This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Can Language Models Act as Knowledge Bases at Scale?

Qiyuan He     Yizhong Wang     Wenya Wang    
Nanyang Technological University
University of Washington
[email protected], [email protected], [email protected]
Abstract

Large language models (LLMs) have demonstrated remarkable proficiency in understanding and generating responses to complex queries through large-scale pre-training. However, the efficacy of these models in memorizing and reasoning among large-scale structured knowledge, especially world knowledge that explicitly covers abundant factual information remains questionable. Addressing this gap, our research investigates whether LLMs can effectively store, recall, and reason with knowledge on a large scale comparable to latest knowledge bases (KBs) such as Wikidata. Specifically, we focus on three crucial aspects to study the viability: (1) the efficiency of LLMs with different sizes in memorizing the exact knowledge in the large-scale KB; (2) the flexibility of recalling the memorized knowledge in response to natural language queries; (3) the capability to infer new knowledge through reasoning. Our findings indicate that while LLMs hold promise as large-scale KBs capable of retrieving and responding with flexibility, enhancements in their reasoning capabilities are necessary to fully realize their potential111Our datasets and source code can be obtained from https://github.com/hyanique/LMKB-at-Scale.

Can Language Models Act as Knowledge Bases at Scale?


Qiyuan He     Yizhong Wang     Wenya Wang Nanyang Technological University University of Washington [email protected], [email protected], [email protected]


1 Introduction

The access to knowledge is critical for language models (LMs) to perform well on many tasks and serve users reliably. Existing studies have found that language models, after pre-training, can encode a large amount of factual knowledge as well as implicit linguistic knowledge from the general corpus, making them a crucial component for tasks that require natural language understanding Bommasani et al. (2022); Li et al. (2022a). This leads to the potential of using language models as knowledge bases Petroni et al. (2019); AlKhamissi et al. (2022). However, existing studies mainly focus on probing Li et al. (2022b); Chen et al. (2022); Sung et al. (2021) and utilizing Roberts et al. (2020); Moiseev et al. (2022) LMs’ knowledge gained from pre-training, which shows deficiencies when handling long-tail, less frequently appeared knowledge Kandpal et al. (2023), due to knowledge imbalance, conflict, and noise in the pre-trained corpora Carlini et al. (2023); Razeghi et al. (2022); Tänzer et al. (2022).

Meanwhile, knowledge bases (KBs), commonly utilized in many knowledge-intensive tasks such as dialogue Li et al. (2022c); Galetzka et al. (2021), question answering Baek et al. (2023); Saxena et al. (2020); Qiu et al. (2020) and recommendation systems Wu et al. (2013), are known for their ability to compactly organize information on a large scale, providing clean and balanced knowledge. For example, Wikidata contains over 108M entities about the world222https://www.wikidata.org/wiki/Wikidata:Statistics. Operations over larger KBs lead to greater computational costs and therefore, pose a big challenge for extracting subgraphs from the KB Cordella et al. (2004); Grohe and Schweitzer (2020) or grounding semantic logic forms over the KB Lan and Jiang (2020); Bhutani et al. (2019) to perform downstream tasks. In addition, the rigid format of KBs limit their flexibility to handle complex natural language queries.

In this work, we propose to explicitly train large language models to memorize world knowledge from Wikidata Pellissier Tanon et al. (2016) at a large scale and systematically study the viability of using the resulting LMs to function as the knowledge base. With their high capacity, we hypothesize that LMs can store information from a knowledge base on a rather large scale and provide more flexibility in querying and reasoning. Specifically, we aim to answer the following three questions: (1) How fast and how good can LMs with different sizes memorize large-scale knowledge of different frequencies through training? (2) How flexible are these trained LMs when being used to answer queries in natural languages rather than the structured triplets used during training? (3) Can LMs infer new knowledge that does not exist in the KB, and what kind of reasoning capabilities are involved? We distinguish our work from those that train LMs on small-scale KBs with popular facts Heinzerling and Inui (2021) or convert knowledge triplets to synthetic sentences using manually curated templates Heinzerling and Inui (2021); Petroni et al. (2019) which only works for a limited set of relations.

We start by proposing an efficient learning algorithm based on importance sampling Alain et al. (2016); Katharopoulos and Fleuret (2018); Zhang et al. (2019) to train LMs to memorize knowledge more efficiently. To answer the first question, we evaluate the memorization capacity of LMs of different sizes as well as their performances on both popular and long-tail world knowledge. We observe that LMs are capable of memorizing information from a knowledge base on a large scale, with larger model learning faster. In addition, infrequent knowledge is more challenging to memorize, irrespective of the size of the language models.

To answer the second question on LMs’ flexibility in handling natural language queries, we further finetune the trained LMs using PopQA Mallen et al. (2023), a natural language QA dataset that requires long-tail Wikidata knowledge. With minimal finetuning, these LMs demonstrate superior performance over their counterpart, which are not trained on Wikidata KB. This indicates the power of LMs in flexibly retrieving and organizing long-tail knowledge, regardless of the presentation form, unveiling their potential for responding to various user queries.

To answer the third question from the perspective of incomplete KBs, we use the dataset released by Veseli et al. (2023a) containing general missing facts (triplets) and further curate two sets of missing facts tailoring two kinds of reasoning capabilities, namely inverse reasoning by switching the positions of the subject and object, and compositional reasoning which conjoins two relations to form a new one. By evaluating LMs’ performances on inferring the missing facts, we study their inherent reasoning capabilities in addition to memorizing existing facts. Our results show that LMs are capable of inferring missing entities from existing knowledge to some extent via reasoning. However, they struggle with inverse reasoning more often than compositional reasoning, advocating for further investigations and explorations on how to improve LMs’ reasoning capabilities, specifically, inverse reasoning.

2 Training LMs on Large-Scale KB

2.1 KB Dataset

A basic knowledge base is a collection of facts in the form of (subject, relation, object) triplets, for example, Freebase Bollacker et al. (2008) and DBPedia Auer et al. (2007). To study the memorization capacity of language models at a large scale, we consider Wikidata Pellissier Tanon et al. (2016), one of the largest knowledge bases to date that is actively maintained by the community. Compared with pre-training corpora, Wikidata contains abundant world knowledge in a more compact and accurate form, covering both popular and long-tail knowledge that appears less frequently in the pre-training corpora of LMs.

Refer to caption
(a) Distribution of entities in 𝒟0\mathcal{D}_{0}, with number of entities in 1e6 scale and occurrence in the powers of 2.
Refer to caption
(b) Distribution of relations in 𝒟0\mathcal{D}_{0}, with occurrence count in the powers of 10.
Figure 1: Distribution of entity and relation occurrences in world knowledge 𝒟0\mathcal{D}_{0}.

When preparing the KB dataset, we use the cleaned knowledge taken from the 2022 January snapshot of Wikidata dump Kaiser and Christmann (2021) to avoid knowledge irrelevant to common question-and-answer, specifically, filtering away URLs, images, geographical coordinates, and subject entities that do not have a corresponding Wikipedia page. If there are multiple objects given a subject and a relation, we randomly sample a single instance from the available objects to avoid knowledge ambiguity. After filtering, we obtain a dataset of 46M (subject, relation, object) triplets, with the distribution of 10.5M entities (subjects or objects) and 2,157 relations shown in Figure 1(a) and Figure 1(b). We denote this dataset as 𝒟0\mathcal{D}_{0}. We can observe that over 4M entities only appear once or twice, and around 500 relations appear 1-10 times. Meanwhile, around 250 relations occur more than 10K times, and 530K entities make more than 16 occurrences. These statistics show that 𝒟0\mathcal{D}_{0} covers adequate popular knowledge as well as a non-neglectable portion of long-tail knowledge.

To study how the model performs regarding knowledge frequency inside the KB, we first calculate the number of occurrences of all entities and relations. Next, we define long-tail entities/relations as entities/relations of top 15% when ranking all entities/relations by their numbers of occurrences in ascending order and popular entities/relations as entities/relations of top 5% when ranking them by their numbers of occurrences in descending order. Then, under each of the long-tail and popular categories, we randomly sample triplets under both the entity set and the relation set, resulting in four datasets denoted as 𝒟PopRel\mathcal{D}_{PopRel}, 𝒟PopEnt\mathcal{D}_{PopEnt}, 𝒟TailRel\mathcal{D}_{TailRel}, and 𝒟TailEnt\mathcal{D}_{TailEnt}. As 𝒟0\mathcal{D}_{0} contains 2,157 relations, the number of knowledge with long-tail relations is limited333𝒟0\mathcal{D}_{0} contains 323 long-tail relations that occur 1-2 times in 𝒟0\mathcal{D}_{0}, summed to 663 occurrences in total. In comparison, the top 5%5\% of 2,157 relations make 40.8M occurrences in 𝒟0\mathcal{D}_{0}, leading to 663 samples in 𝒟TailRel\mathcal{D}_{TailRel}. The other three datasets contain 1K triplets each. Example triplets include (“Linlithgow Burgh Halls”, instance of, “Town hall”) from 𝒟PopRel\mathcal{D}_{PopRel} and (“Department of Agriculture, Water and the Environment”, external auditor, “Australian National Audit Office”) from 𝒟TailRel\mathcal{D}_{TailRel}.

2.2 Model Setup

We choose two language models, namely T5 Raffel et al. (2020) and LLaMA-2 Touvron et al. (2023), each with two different sizes: T5-base, T5-large, LLaMA-2-7b, and LLaMA-2-13b. Starting from their pre-trained checkpoints, we continue training these models on the filtered Wikidata KB 𝒟0\mathcal{D}_{0} containing 46M knowledge triplets. See Appendix A for the detailed training setup.

For each knowledge triplet in the form of (subject, relation, object), we create an input string by concatenating the prefix “Subject:” followed by the subject text, the prefix “Relation:” followed by the relation text and the prefix “Object:”, and use the object text as the output. For example, given the knowledge triplet (“Palaeontological Museum, Munich”, architect, “Leonhard Romeis”), the input to the LMs is “Subject: Palaeontological Museum, Munich. Relation: architect. Object:” and the expected output is the object “Leonhard Romeis”.

The training objective is to maximize the probability of generating the correct object: pLM(xout|xin)p_{LM}(x_{out}|x_{in}) where xoutx_{out} is the object text and xinx_{in} is the input text. pLMp_{LM} denotes the probability distribution given by the language model.

2.3 Importance Sampling

With the goal of injecting abundant and diverse information from large-scale KB information into LMs, it is imperative for the model to converge to a state where it can, in an ideal scenario, memorize every triplet within the training dataset. Traditional training process iterates through each data sample precisely once during each epoch, inherently treating all data with uniform importance. This approach, however, leads to extended training durations and reduced convergence efficiency, particularly when dealing with large-scale KBs containing a significant amount of hard-to-memorize knowledge. To address this issue, inspired by the importance sampling algorithm proposed in Alain et al. (2016); Katharopoulos and Fleuret (2018), we allocate distinct importance weights to the training samples within 𝒟0\mathcal{D}_{0}. The importance weight is proportional to the prediction loss of each sample, serving as a measure of its memorization difficulty. This strategy prioritizes samples that are more challenging to memorize by assigning them greater importance, thereby increasing their likelihood of selection during each training iteration, leading to faster convergence speed Zhang et al. (2019); Xie et al. (2023).

The detailed algorithm is shown in Algorithm 1.

Algorithm 1 Knowledge memorization with importance sampling
0:  knowledge samples with importance 𝒟={(x1,y1;w1),,(xn,yn;wn)}\mathcal{D}=\left\{(x_{1},y_{1};w_{1}),...,(x_{n},y_{n};w_{n})\right\}
0:  language model pre-trained on general corpora
0:  sampling ratio α(0,1)\alpha\in(0,1)
1:  initialize importance w1,,wnw_{1},...,w_{n} with 1e61e6
2:  for every training epoch ee do
3:     sample 𝒮={(xs,ys;ws)}𝒟\mathcal{S}=\left\{(x^{s},y^{s};w^{s})\right\}\subset\mathcal{D} of size n×αn\times\alpha using importance w1,,wnw_{1},...,w_{n}
4:     forward pass using {(xs,ys)}\left\{(x^{s},y^{s})\right\}
5:     update importance wsw^{s} using instance loss (ys,xs)\mathcal{L}(y^{s},x^{s})
6:     backpropagation
7:  end for

As shown in the pseudo-code, we use instance loss (ys,xs)\mathcal{L}(y^{s},x^{s}) to measure the knowledge triplet’s importance and use this importance as the sampling probability within each batch, where \mathcal{L} is the cross-entropy loss, and ysy^{s} is the correct output text given input xsx^{s}. Mathematically,

(ys,xs)=t=1TlogpLM(yts|xs),\mathcal{L}(y^{s},x^{s})=-\sum_{t=1}^{T}\log{p_{LM}(y_{t}^{s}|x^{s})}, (1)

with TT being the number of tokens in ysy^{s} and ytsy_{t}^{s} being the tt-th token in ysy^{s}, Hence, the higher the instance loss, the higher the chance for the instance to be sampled into the subset 𝒮\mathcal{S} for training, forcing the model to focus on learning hard samples more often.

Refer to caption
Figure 2: Learning curves of T5-base training on 𝒟1\mathcal{D}_{1} with and without importance sampling (ImSmp), evaluated using 𝒟1Eval\mathcal{D}_{1-Eval}.

To verify our hypothesis, we conduct a preliminary experiment by randomly sampling 1% triplets from 𝒟0\mathcal{D}_{0} and train a T5-base model to memorize this sampled dataset, with and without using Algorithm 1. We denote this subset containing 426K triplets as 𝒟1\mathcal{D}_{1}. We further arbitrarily sampled 10K triplets from 𝒟1\mathcal{D}_{1} as the corresponding evaluation set, denoted as 𝒟1eval\mathcal{D}_{1-eval}. We configure the sampling ratio α\alpha to be 0.3. As shown in Figure 2, the model trained without importance sampling quickly reaches around 80% exact match and F1 score in the first 30K training steps, and then its performance slowly increases to around 95% exact match and F1 score using another 20K steps. But with importance sampling, the model achieved roughly 80%80\% exact match and F1 score after the first 20K steps, and over 95% exact match and F1 score after another 12K steps. We also notice that training with importance sampling yields a significantly steeper learning curve when compared with the one without importance sampling. In what follows, we stick with importance sampling with the same α\alpha value when training LMs for all the experiments.

2.4 Evaluation

To study the LM’s capacity of memorizing the structured knowledge base, we propose to use the exact match (EM) and F1 scores following Heinzerling and Inui (2021) over the entire training dataset. We call this fixed-form information recall ability. Since it is not feasible to iteratively evaluate the LM on all 46M triplets in 𝒟0\mathcal{D}_{0} throughout the training process due to huge inference time, we opt to randomly sample 10K triplets from 𝒟0\mathcal{D}_{0} as the evaluation set, denoted as 𝒟2\mathcal{D}_{2}.

Refer to caption
(a) Evaluating T5 performance using 𝒟2\mathcal{D}_{2}.
Refer to caption
(b) Evaluating T5 performance using 𝒟PopEnt\mathcal{D}_{PopEnt} and 𝒟TailEnt\mathcal{D}_{TailEnt}.
Refer to caption
(c) Evaluating T5 performance using 𝒟PopRel\mathcal{D}_{PopRel} and 𝒟TailRel\mathcal{D}_{TailRel}.
Refer to caption
(d) Evaluating LLaMA-2 performance using 𝒟2\mathcal{D}_{2}.
Refer to caption
(e) Evaluating LLaMA-2 performance using 𝒟PopEnt\mathcal{D}_{PopEnt} and 𝒟TailEnt\mathcal{D}_{TailEnt}.
Refer to caption
(f) Evaluating LLaMA-2 performance using 𝒟PopRel\mathcal{D}_{PopRel} and 𝒟TailRel\mathcal{D}_{TailRel}.
Figure 3: Evaluating the fixed-form information recall ability for LMs training on 𝒟0\mathcal{D}_{0}. T5 models are on the upper row, and LLaMA-2 models are on the bottom row.

To measure the model’s ability to flexibly retrieve memorized knowledge when queried with an input and output format different from training, we consider using natural language to query our model the same way as the task of question answering (QA). We call this free-form information recall ability. For implementation, we require that the knowledge used by the QA task should be highly covered by the 46M triplets of the world knowledge from Wikidata. Hence, we select the QA dataset constructed in PopQA Mallen et al. (2023). PopQA converted 14K knowledge triplets from Wikidata to their corresponding natural language questions and answers that cover long-tail information based on Wikipedia page views. With a random 8:2 split to obtain a train set of 11.4K samples and a validation set of 2.9K samples, we further finetune the model from the memorization checkpoints using the training split of PopQA and evaluate the performance on the validation set using the F1 score. We also compute the exact match and F1 score of the model’s generation accuracy over the PopQA knowledge triplet to check if the model can access relevant knowledge using fixed-form recall.

Lastly, we explore whether LMs can infer new knowledge that does not exist in the KB, namely, the missing fact completion ability. Since most knowledge graphs are incomplete, missing factual triplets or even entities Yang et al. (2022); Shi and Weninger (2018), the ability to automatically complete missing facts becomes especially demanding. First we consider the missing facts dataset released by Veseli et al. (2023b), which contains 350 factual triplets missing from Wikidata with human annotated ground-truths. As we additionally seek to investigate the underlying reasoning capabilities involved in missing fact completion, we also curate two sets of missing knowledge triplets based on 𝒟0\mathcal{D}_{0}, emphasizing inverse reasoning and compositional reasoning, respectively. For a missing knowledge triplet that is not contained in 𝒟0\mathcal{D}_{0}, we query the model using the same input format as in fixed-form information recall and evaluate the output text against object text using F1 scores444For pre-trained models without training on knowledge base 𝒟0\mathcal{D}_{0}, we query the models with natural language inputs released alongside the triplets. See respective task in Section 5.1 for details.

Next, we present the detailed evaluation and analysis to answer each of the three core questions, including (1) the efficiency of LMs with different sizes in memorizing the exact knowledge in the large-scale KB (Section 3); (2) the flexibility of recalling the memorized knowledge in response to natural language queries (Section 4); (3) the capability to infer new knowledge through reasoning (Section 5).

3 Fixed-Form Information Recall

As mentioned in Section 2.4, we measure the fixed-form information recall ability on a sub-sampled dataset 𝒟2\mathcal{D}_{2} from the original training set 𝒟0\mathcal{D}_{0} to avoid the huge inference cost. See Appendix A for additional training details. Specifically, we compute the exact match and F1 score on 𝒟2\mathcal{D}_{2} along the training steps of T5-base, T5-large, LLaMA-2-7b and LLaMA-2-13b. The performance curves are shown in Figure 3(a) and 3(d).

The results show that the models can memorize a large portion of 46M world knowledge, with T5-large performing better than T5-base, and LLaMA-2-13b slightly more capable than LLaMA-2-7b in terms of memorization capacity. LMs with larger sizes are capable of memorizing more knowledge with higher efficiency. In particular, at the end of training, LLaMA-2-13b gives the highest F1 score of 81.64%, whereas T5-large reaches an F1 score of 63.07%.

In addition, we further separately evaluate the performances on popular and long-tail triplets, i.e., 𝒟PopEnt\mathcal{D}_{PopEnt}, 𝒟PopRel\mathcal{D}_{PopRel}, 𝒟TailEnt\mathcal{D}_{TailEnt} and 𝒟TailRel\mathcal{D}_{TailRel}. The results are shown in Figure 3(b), 3(c), 3(e) and 3(f). These plots demonstrate that (1) All models are better at memorizing popular information than memorizing long-tail information; (2) For LLaMA-2 models, a larger model size does not lead to significantly better memorization capability when it comes to long-tail and popular knowledge; (3) Different from LLaMA-2, we observe that T5-large is better than T5-base in learning both popular and long-tail knowledge, with an even significant improvement for long-tail relations (𝒟TailRel\mathcal{D}_{TailRel}).

4 Free-Form Information Recall

To evaluate the model’s ability to perform free-form information recall when using natural language queries, as indicated in Section 2.4, we adopt the knowledge triplets and their corresponding natural language questions from PopQA:

Given a knowledge triplet (“Binary”, author, “Michael Crichton”) from Wikidata, PopQA converts it to a natural language question which asks for the object: “Who is the author of Binary?”. The correct answer in this case should be “Michael Crichton”. To make LMs trained on knowledge triplets familiar with the natural language QA format, we further finetune these LMs by feeding them the question as input and training these models to generate the correct answer. For T5, the input is the original question such as “Who is the author of Binary?”. For LLaMA-2, the input is “Question: Who is the author of Binary? Answer:”. We then evaluate the generated output using the F1 score555We use F1 score to allow minor linguistic variations in the generated output, taking into account semantic similarity and flexibility.. In addition to the free-form queries, we also evaluate how much of the PopQA knowledge in its original triplet form is memorized by the model at each specific checkpoint by querying the model using the subject and relation, following the same input format used for fixed-form information recall.

Refer to caption
Refer to caption
Figure 4: PopQA finetuning performance and knowledge recall on various checkpoints through training on 𝒟0\mathcal{D}_{0}. The pre-trained models are represented by epoch 0.

The results on PopQA are shown in Figure 4. Each point in the x-axis indicates the number of epochs for each checkpoint when training LMs using the Wikidata triplets, i.e., 𝒟0\mathcal{D}_{0}. Starting from each of these checkpoints, we further finetune the LMs using the training data from PopQA for up to 30 epochs for T5 models and 15 epochs for LLaMA-2 models with early stopping (see Appendix A for details) and report the best F1 score on the evaluation set.

It is clear that training on 𝒟0\mathcal{D}_{0} can provide a sizable performance boost compared with using the originally pre-trained LMs (epoch=0). This suggests that LMs trained on large-scale knowledge bases are capable of performing some extent of free-form information recall, especially for a question-answering task that emphasizes long-tail knowledge. We also notice that memorizing more knowledge (as indicated by the triplet EM scores) leads to better performance in general. In addition, larger models, after being trained on 𝒟0\mathcal{D}_{0}, are able to recall more knowledge for this downstream task in a fixed form, and finetuning yields better results.

5 Missing Fact Completion

Refer to caption
(a) T5: general missing facts.
Refer to caption
(b) T5: inverse reasoning.
Refer to caption
(c) T5: compositional reasoning.
Refer to caption
(d) LLaMA-2: general missing facts.
Refer to caption
(e) LLaMA-2: inverse reasoning.
Refer to caption
(f) LLaMA-2: compositional reasoning.
Figure 5: Evaluating the ability to infer new knowledge across various model checkpoints through training on 𝒟0\mathcal{D}_{0}. The x-axis of the plots indicates the checkpoints having the number of epochs when training LMs using 𝒟0\mathcal{D}_{0}. Specifically, epoch 0 stands for the pre-trained checkpoints.

5.1 General Missing Facts

To evaluate how the model performs when completing missing facts in general, we consider knowledge triplets that are missing from 𝒟0\mathcal{D}_{0}. We query the model to generate an object given the subject and relation. To ensure the feasibility of this setting, we require that the subject and relation in question are both contained in the knowledge base. Hence, the model has to associate relevant information related to the subject and the relation in order to infer the object.

For implementation, we utilize the missing fact dataset Veseli et al. (2023b) consisting of 350 samples of knowledge missing from Wikidata. For each sample, we query the model using the subject and the relation that are contained in Wikidata, and compare the generated output with the human-annotated object using the F1 score666When there are multiple ground-truth candidates, we compare the generated results to each of them and take the best F1 scores. To clearly demonstrate the benefit of knowledge memorization, we further evaluate how the pre-trained LMs perform on these missing facts using the natural language queries provided by the dataset. For example, a missing fact triplet (“Tidö Castle”, headquarters location, “Västeras”) is associated with the following natural language question “The headquarter of Tidö Castle is in” as input for T5, while the input for LLaMA-2 is “Question: The headquarter of Tidö Castle is in? Answer:”.

As shown in Figure 5(a) and 5(d), we can see that training on 𝒟0\mathcal{D}_{0} provides some performance increase. This suggests that training on large-scale knowledge bases can help LMs to infer new facts better. Furthermore, the capability of LMs to infer new facts does not grow along with the memorization process on 𝒟0\mathcal{D}_{0} and larger models like LLaMA-2 even perform worse than smaller models like T5. These observations indicate that the amount of knowledge learned by the models may not be the key factor to determine their inference capability towards missing facts.

5.2 Inverse Reasoning

We define inverse reasoning as the ability to infer (B,r,A)(B,r^{\prime},A) given the triplet (A,r,B)(A,r,B), where AA and BB represent two entities and r,rr,r^{\prime} indicate relations. To study the model’s ability to conduct inverse reasoning over the knowledge base, we first curate a set of triplets in the form of (A,r,B)(A,r,B) originally contained in 𝒟0\mathcal{D}_{0}, denoted as 𝒟\mathcal{D}_{\rightarrow}. Then, we curate the inverse set by mapping the original relation rr to its inverse rr^{\prime} and switch the positions of AA and BB, forming the triplets (B,r,A)(B,r^{\prime},A). We denote this set using 𝒟\mathcal{D}_{\leftarrow}. We query the model for the object entity AA when given the subject entity BB and the inverse relation rr^{\prime} and compute the F1 score on 𝒟\mathcal{D}_{\leftarrow}. To show whether the model is capable of correctly recalling the original fact (A,r,B)(A,r,B) in the first place, we additionally query the model to generate BB given AA and rr on 𝒟\mathcal{D}_{\rightarrow}. For the originally pre-trained LMs without accessing Wikidata, we convert the knowledge triplets to natural language QA pairs as explained in Section 4.

For implementation, we select seven relation pairs (r,r)(r,r^{\prime}) from 𝒟0\mathcal{D}_{0} as shown in Table 1 from Appendix B. For each relation, we apply the restriction that for knowledge triplet (A,r,B)(A,r,B), the inverse knowledge (B,r,A)(B,r^{\prime},A) is not contained in 𝒟0\mathcal{D}_{0}. For each relation, we randomly sample 150 triplets from 𝒟0\mathcal{D}_{0}, resulting in 1,050 samples for both 𝒟\mathcal{D}_{\rightarrow} and 𝒟\mathcal{D}_{\leftarrow}.

As shown in Figure 5(b) and 5(e), for all models, we can observe a limited performance increase when answering the inverse knowledge (B,r,A)(B,r^{\prime},A), despite the models showing increasing memorization accuracy of the forward knowledge (A,r,B)(A,r,B). We speculate this “no significant change” in deduction results suggests that LMs can memorize knowledge well but are short at handling the inverse of relations.

5.3 Compositional Reasoning

We define compositional reasoning as the ability to infer (A,r3,C)(A,r_{3},C) given (A,r1,B)(A,r_{1},B) and (B,r2,C)(B,r_{2},C) when (A,r1,B)(B,r2,C)(A,r3,C)(A,r_{1},B)\wedge(B,r_{2},C)\Rightarrow(A,r_{3},C). To study the model’s ability to conduct compositional reasoning over the knowledge base, we first curate a set of triplet pairs containing (A,r1,B)(A,r_{1},B) and (B,r2,C)(B,r_{2},C), denoted by 𝒟=(𝒟1,𝒟2)\mathcal{D}_{\wedge}=(\mathcal{D}^{1}_{\wedge},\mathcal{D}^{2}_{\wedge}). We then form the conclusive triplet set containing (A,r3,C)(A,r_{3},C), denoted by 𝒟\mathcal{D}_{\Rightarrow}. To test the model’s performance, we query the model using entity AA and relation r3r_{3}, and compare the model’s output with the ground-truth entity CC on 𝒟\mathcal{D}_{\Rightarrow}. To show whether the model is capable of correctly recalling the conditioned facts (A,r1,B)(A,r_{1},B) and (B,r2,C)(B,r_{2},C) in the first place, we additionally query the model to generate the objects for these conditioned facts on 𝒟\mathcal{D}_{\wedge}. Again, for the pre-trained model, we convert the knowledge triplets to natural language QA pairs.

For implementation, we formulate eight reasoning rules r1r2r3r_{1}\wedge r_{2}\Rightarrow r_{3} of relation composition as shown in Table 3 from Appendix B.

For a compositional rule (A,r1,B)(B,r2,C)(A,r3,C)(A,r_{1},B)\wedge(B,r_{2},C)\Rightarrow(A,r_{3},C), we restrict that the prior knowledge triplets (A,r1,B)(A,r_{1},B) and (B,r2,C)(B,r_{2},C) exist in the knowledge dataset while the deduction result (A,r3,C)(A,r_{3},C) is missing. For each reasoning rule, we randomly sample 150 examples from 𝒟0\mathcal{D}_{0}, resulting in 1,200 samples for both 𝒟\mathcal{D}_{\wedge} and 𝒟\mathcal{D}_{\Rightarrow}.

The results from Figure 5(c) and 5(f) show that training on the KB can assist the model in performing compositional reasoning. However, there is an upper threshold; memorizing prior knowledge beyond that point may not help the model perform compositional deduction.

6 Related Work

Infusing Knowledge into LM

Starting from seminal work by Petroni et al. (2019) that first introduced the concept of using pre-trained language models as knowledge bases, many works investigate such viability by finetuning and evaluating the models on downstream question-answering tasks Roberts et al. (2020); Guu et al. (2020); Moiseev et al. (2022). Notably, using salient span making Moiseev et al. (2022), augmented learning objective Verga et al. (2020), and modifying model architecture Zhang et al. (2021); Yasunaga et al. (2022) have shown to improve LM’s performance on various open-domain question answering tasks.

When explicitly studying the capacity to store factual information, many knowledge datasets have been proposed. Among them, LAMA Petroni et al. (2019) is based on factual and commonsense knowledge grounded to Wikipedia. WikiData5M contains 4.9M Wikidata triplets derived from Wang et al. (2021). More recently, Cao et al. (2021) derived WIKI-UNI with a uniform distribution of object entities, and Keleg and Magdy (2023) proposed DLAMA to group factual information by cultural diversity.

Under the scope of investigating the infusing of KB into LM, Bosselut et al. (2019) focus on commonsense knowledge derived from ATOMIC Sap et al. (2019) and ConceptNet Speer et al. (2017), Heinzerling and Inui (2021) study the memorization capacity of BERT-based models using popular Wikidata knowledge. AutoPrompt Shin et al. (2020) can be utilized to modify knowledge input Veseli et al. (2023b, a) for triplet completion. In addition, Mallen et al. (2023) proposed to use retrieval-augmented LMs to help with long-tail factual knowledge.

Probing for Exisisting Knowledge

Given that pre-trained LMs are sensitive to input Jiang et al. (2020); Elazar et al. (2021) and querying the model with even syntactical variations may lead to different output results Longpre et al. (2021), many work have focused on the probing technique to extract knowledge stored inside LM through pre-training. For example, Li et al. (2022b) study the extraction of knowledge under the setting of unsupervised knowledge-grounded conversation, Alivanistos et al. (2023) utilize prompt generation and post-processing techniques to probe for knowledge, while others focus on extracting specific types of factual information, such as commonsense knowledge Davison et al. (2019), simile metaphor Chen et al. (2022) and biomedical facts Sung et al. (2021).

7 Conclusion

In this work, we systematically study the viability of using language models as large-scale knowledge bases. We propose an importance sampling algorithm to increase the efficiency of memorizing world knowledge from Wikidata. We investigate and evaluate three critical dimensions along this direction and conclude that large language models are able to recall a large amount of knowledge in KB through training in both fixed form as the structured KB and free form as natural language queries, with increasing flexibility when querying the world knowledge. Nevertheless, there is a significant gap between the memorization of popular knowledge and long-tail knowledge regardless of model size. In addition, language models, after being trained on the large-scale KB, demonstrate consistent improvement in terms of inferring new facts through some extent of reasoning. However, the amount of knowledge learned during training does not guarantee consistent improvement in reasoning capabilities, especially when it comes to inverse reasoning. These results point to future work in utilizing language models as knowledge bases at scale, as well as further investigations on improving LMs’ reasoning capability over world knowledge.

Limitations

This work focuses on the following three aspects of treating language models as knowledge bases: memorization and accessing of knowledge base information at scale, accessing of memorized knowledge in flexible, natural language format, and inferring facts missing from the knowledge base used for training. AlKhamissi et al. (2022) proposed the following five abilities for a language model to be qualified as a knowledge base: (1) accessing of knowledge, (2) editing of knowledge, (3) consistency over semantically equivalent context, (4) reasoning over stored knowledge, (5) explainability in internal mechanisms and interoperability of outputs under a post-hoc setting. We mainly address the ability of knowledge accessing while providing a preliminary study on the reasoning ability of language models over using inverse and compositional reasoning. Another limitation in our work is that, due to limited computation resources, we are unable to train the models without importance sampling on the 46M triplets of 𝒟0\mathcal{D}_{0}. While it may require further investigation on importance sampling to improve credibility and robustness, we believe our experiments on the subset 𝒟1\mathcal{D}_{1} randomly sampled from 𝒟0\mathcal{D}_{0} are preliminary evidence to support our hypothesis in Section 2.3, and serve as a good foundation for speeding up large-scale knowledge memorization.

Ethics Statement

Large language models are known to memorize information from pre-training corpus. Therefore, probing for stored knowledge may lead to privacy attacks against language models, such as training data extraction attacks Neel and Chang (2024); Staab et al. (2023); Hartmann et al. (2023). For this kind of attack, an adversary can reconstruct parts of the training sample when given access to the model, leading to potential exposures of sensitive information that should not be extracted in fair and ethical usage of language models. In addition, Karamolegkou et al. (2023) confirms that language models are able to memorize a substantial portion of bestselling books with copyright that are published between 1930-2010, which demonstrates the risk of copyright violations when deploying the language models.

For our work, the world knowledge dataset 𝒟0\mathcal{D}_{0} is derived from Wikidata, which follows the CC0 (Creative Common Public Domain) Copyright777https://www.wikidata.org/wiki/Wikidata:Copyright. In this way, we reduce the concern of language models learning sensitive or copyright information when training on the corresponding knowledge dataset. However, we have limited control over information acquired during the pre-training of language modles. It is possible to address this issue in future work by either using language models with sensitive and copyright information removed or deploying knowledge editing methods Zhang et al. (2024) to enforce data privacy and integrity.

References

  • Alain et al. (2016) Guillaume Alain, Alex Lamb, Chinnadhurai Sankar, Aaron Courville, and Yoshua Bengio. 2016. Variance reduction in sgd by distributed importance sampling. Statistics Research Repository, arXiv:1511.06481. Version 7.
  • Alivanistos et al. (2023) Dimitrios Alivanistos, Selene Báez Santamaría, Michael Cochez, Jan-Christoph Kalo, Emile van Krieken, and Thiviyan Thanapalasingam. 2023. Prompting as probing: Using language models for knowledge base construction. Computation Research Repository, arXiv:2208.11057. Version 3.
  • AlKhamissi et al. (2022) Badr AlKhamissi, Millicent Li, Asli Celikyilmaz, Mona Diab, and Marjan Ghazvininejad. 2022. A review on language models as knowledge bases. Computation Research Repository, arXiv:2204.06031.
  • Auer et al. (2007) Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. 2007. Dbpedia: a nucleus for a web of open data. In Proceedings of the 6th International The Semantic Web and 2nd Asian Conference on Asian Semantic Web Conference, ISWC’07/ASWC’07, page 722–735, Berlin, Heidelberg. Springer-Verlag.
  • Baek et al. (2023) Jinheon Baek, Alham Fikri Aji, and Amir Saffari. 2023. Knowledge-augmented language model prompting for zero-shot knowledge graph question answering. In Proceedings of the 1st Workshop on Natural Language Reasoning and Structured Explanations (NLRSE), pages 78–106, Toronto, Canada. Association for Computational Linguistics.
  • Bhutani et al. (2019) Nikita Bhutani, Xinyi Zheng, and H V Jagadish. 2019. Learning to answer complex questions over knowledge bases with query composition. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM ’19, page 739–748, New York, NY, USA. Association for Computing Machinery.
  • Bollacker et al. (2008) Kurt Bollacker, Colin Evans, Praveen Paritosh, Tim Sturge, and Jamie Taylor. 2008. Freebase: a collaboratively created graph database for structuring human knowledge. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data, SIGMOD ’08, page 1247–1250, New York, NY, USA. Association for Computing Machinery.
  • Bommasani et al. (2022) Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S. Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, Erik Brynjolfsson, Shyamal Buch, Dallas Card, Rodrigo Castellon, Niladri Chatterji, Annie Chen, Kathleen Creel, Jared Quincy Davis, Dora Demszky, Chris Donahue, Moussa Doumbouya, Esin Durmus, Stefano Ermon, John Etchemendy, Kawin Ethayarajh, Li Fei-Fei, Chelsea Finn, Trevor Gale, Lauren Gillespie, Karan Goel, Noah Goodman, Shelby Grossman, Neel Guha, Tatsunori Hashimoto, Peter Henderson, John Hewitt, Daniel E. Ho, Jenny Hong, Kyle Hsu, Jing Huang, Thomas Icard, Saahil Jain, Dan Jurafsky, Pratyusha Kalluri, Siddharth Karamcheti, Geoff Keeling, Fereshte Khani, Omar Khattab, Pang Wei Koh, Mark Krass, Ranjay Krishna, Rohith Kuditipudi, Ananya Kumar, Faisal Ladhak, Mina Lee, Tony Lee, Jure Leskovec, Isabelle Levent, Xiang Lisa Li, Xuechen Li, Tengyu Ma, Ali Malik, Christopher D. Manning, Suvir Mirchandani, Eric Mitchell, Zanele Munyikwa, Suraj Nair, Avanika Narayan, Deepak Narayanan, Ben Newman, Allen Nie, Juan Carlos Niebles, Hamed Nilforoshan, Julian Nyarko, Giray Ogut, Laurel Orr, Isabel Papadimitriou, Joon Sung Park, Chris Piech, Eva Portelance, Christopher Potts, Aditi Raghunathan, Rob Reich, Hongyu Ren, Frieda Rong, Yusuf Roohani, Camilo Ruiz, Jack Ryan, Christopher Ré, Dorsa Sadigh, Shiori Sagawa, Keshav Santhanam, Andy Shih, Krishnan Srinivasan, Alex Tamkin, Rohan Taori, Armin W. Thomas, Florian Tramèr, Rose E. Wang, William Wang, Bohan Wu, Jiajun Wu, Yuhuai Wu, Sang Michael Xie, Michihiro Yasunaga, Jiaxuan You, Matei Zaharia, Michael Zhang, Tianyi Zhang, Xikun Zhang, Yuhui Zhang, Lucia Zheng, Kaitlyn Zhou, and Percy Liang. 2022. On the opportunities and risks of foundation models. Computation Research Repository, arXiv:2108.07258.
  • Bosselut et al. (2019) Antoine Bosselut, Hannah Rashkin, Maarten Sap, Chaitanya Malaviya, Asli Celikyilmaz, and Yejin Choi. 2019. COMET: Commonsense transformers for automatic knowledge graph construction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4762–4779, Florence, Italy. Association for Computational Linguistics.
  • Cao et al. (2021) Boxi Cao, Hongyu Lin, Xianpei Han, Le Sun, Lingyong Yan, Meng Liao, Tong Xue, and Jin Xu. 2021. Knowledgeable or educated guess? revisiting language models as knowledge bases. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1860–1874, Online. Association for Computational Linguistics.
  • Carlini et al. (2023) Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, Katherine Lee, Florian Tramer, and Chiyuan Zhang. 2023. Quantifying memorization across neural language models. Computation Research Repository, arXiv:2202.07646.
  • Chen et al. (2022) Weijie Chen, Yongzhu Chang, Rongsheng Zhang, Jiashu Pu, Guandan Chen, Le Zhang, Yadong Xi, Yijiang Chen, and Chang Su. 2022. Probing simile knowledge from pre-trained language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5875–5887, Dublin, Ireland. Association for Computational Linguistics.
  • Cordella et al. (2004) L.P. Cordella, P. Foggia, C. Sansone, and M. Vento. 2004. A (sub)graph isomorphism algorithm for matching large graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(10):1367–1372.
  • Davison et al. (2019) Joe Davison, Joshua Feldman, and Alexander Rush. 2019. Commonsense knowledge mining from pretrained models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1173–1178, Hong Kong, China. Association for Computational Linguistics.
  • Elazar et al. (2021) Yanai Elazar, Nora Kassner, Shauli Ravfogel, Abhilasha Ravichander, Eduard Hovy, Hinrich Schütze, and Yoav Goldberg. 2021. Measuring and improving consistency in pretrained language models. Transactions of the Association for Computational Linguistics, 9:1012–1031.
  • Galetzka et al. (2021) Fabian Galetzka, Jewgeni Rose, David Schlangen, and Jens Lehmann. 2021. Space efficient context encoding for non-task-oriented dialogue generation with graph attention transformer. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7028–7041, Online. Association for Computational Linguistics.
  • Grohe and Schweitzer (2020) Martin Grohe and Pascal Schweitzer. 2020. The graph isomorphism problem. Commun. ACM, 63(11):128–134.
  • Gugger et al. (2022) Sylvain Gugger, Lysandre Debut, Thomas Wolf, Philipp Schmid, Zachary Mueller, Sourab Mangrulkar, Marc Sun, and Benjamin Bossan. 2022. Accelerate: Training and inference at scale made simple, efficient and adaptable. https://github.com/huggingface/accelerate.
  • Guu et al. (2020) Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. 2020. Retrieval augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3929–3938. PMLR.
  • Hartmann et al. (2023) Valentin Hartmann, Anshuman Suri, Vincent Bindschaedler, David Evans, Shruti Tople, and Robert West. 2023. Sok: Memorization in general-purpose large language models. Computation Research Repository, arXiv:2310.18362.
  • Heinzerling and Inui (2021) Benjamin Heinzerling and Kentaro Inui. 2021. Language models as knowledge bases: On entity representations, storage capacity, and paraphrased queries. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 1772–1791, Online. Association for Computational Linguistics.
  • Jiang et al. (2020) Zhengbao Jiang, Frank F. Xu, Jun Araki, and Graham Neubig. 2020. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438.
  • Kaiser and Christmann (2021) Magdalena Kaiser and Philipp Christmann. 2021. Wikidata core for question answering. https://github.com/PhilippChr/wikidata-core-for-QA.
  • Kandpal et al. (2023) Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. 2023. Large language models struggle to learn long-tail knowledge. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 15696–15707. PMLR.
  • Karamolegkou et al. (2023) Antonia Karamolegkou, Jiaang Li, Li Zhou, and Anders Søgaard. 2023. Copyright violations and large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7403–7412, Singapore. Association for Computational Linguistics.
  • Katharopoulos and Fleuret (2018) Angelos Katharopoulos and Francois Fleuret. 2018. Not all samples are created equal: Deep learning with importance sampling. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 2525–2534. PMLR.
  • Keleg and Magdy (2023) Amr Keleg and Walid Magdy. 2023. DLAMA: A framework for curating culturally diverse facts for probing the knowledge of pretrained language models. In Findings of the Association for Computational Linguistics: ACL 2023, pages 6245–6266, Toronto, Canada. Association for Computational Linguistics.
  • Lan and Jiang (2020) Yunshi Lan and Jing Jiang. 2020. Query graph generation for answering multi-hop complex questions from knowledge bases. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 969–974, Online. Association for Computational Linguistics.
  • Li et al. (2022a) Xiang Lorraine Li, Adhiguna Kuncoro, Jordan Hoffmann, Cyprien de Masson d’Autume, Phil Blunsom, and Aida Nematzadeh. 2022a. A systematic investigation of commonsense knowledge in large language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11838–11855, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Li et al. (2022b) Yanyang Li, Jianqiao Zhao, Michael Lyu, and Liwei Wang. 2022b. Eliciting knowledge from large pre-trained models for unsupervised knowledge-grounded conversation. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 10551–10564, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Li et al. (2022c) Yu Li, Baolin Peng, Yelong Shen, Yi Mao, Lars Liden, Zhou Yu, and Jianfeng Gao. 2022c. Knowledge-grounded dialogue generation with a unified knowledge representation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 206–218, Seattle, United States. Association for Computational Linguistics.
  • Longpre et al. (2021) Shayne Longpre, Kartik Perisetla, Anthony Chen, Nikhil Ramesh, Chris DuBois, and Sameer Singh. 2021. Entity-based knowledge conflicts in question answering. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7052–7063, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. Computation Research Repository, arXiv:1711.05101. Version 3.
  • Mallen et al. (2023) Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. When not to trust language models: Investigating effectiveness of parametric and non-parametric memories. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9802–9822, Toronto, Canada. Association for Computational Linguistics.
  • Moiseev et al. (2022) Fedor Moiseev, Zhe Dong, Enrique Alfonseca, and Martin Jaggi. 2022. SKILL: Structured knowledge infusion for large language models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1581–1588, Seattle, United States. Association for Computational Linguistics.
  • Neel and Chang (2024) Seth Neel and Peter Chang. 2024. Privacy issues in large language models: A survey. Computation Research Repository, arXiv:2312.06717.
  • Pellissier Tanon et al. (2016) Thomas Pellissier Tanon, Denny Vrandečić, Sebastian Schaffert, Thomas Steiner, and Lydia Pintscher. 2016. From freebase to wikidata: The great migration. In Proceedings of the 25th International Conference on World Wide Web, WWW ’16, page 1419–1428, Republic and Canton of Geneva, CHE. International World Wide Web Conferences Steering Committee.
  • Petroni et al. (2019) Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, Hong Kong, China. Association for Computational Linguistics.
  • Qiu et al. (2020) Yunqi Qiu, Yuanzhuo Wang, Xiaolong Jin, and Kun Zhang. 2020. Stepwise reasoning for multi-relation question answering over knowledge graph with weak supervision. In Proceedings of the 13th International Conference on Web Search and Data Mining, WSDM ’20, page 474–482, New York, NY, USA. Association for Computing Machinery.
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
  • Rajbhandari et al. (2020) Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16.
  • Rasley et al. (2020) Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’20, page 3505–3506, New York, NY, USA. Association for Computing Machinery.
  • Razeghi et al. (2022) Yasaman Razeghi, Robert L Logan IV, Matt Gardner, and Sameer Singh. 2022. Impact of pretraining term frequencies on few-shot numerical reasoning. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 840–854, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  • Roberts et al. (2020) Adam Roberts, Colin Raffel, and Noam Shazeer. 2020. How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5418–5426, Online. Association for Computational Linguistics.
  • Sap et al. (2019) Maarten Sap, Ronan Le Bras, Emily Allaway, Chandra Bhagavatula, Nicholas Lourie, Hannah Rashkin, Brendan Roof, Noah A. Smith, and Yejin Choi. 2019. Atomic: An atlas of machine commonsense for if-then reasoning. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):3027–3035.
  • Saxena et al. (2020) Apoorv Saxena, Aditay Tripathi, and Partha Talukdar. 2020. Improving multi-hop question answering over knowledge graphs using knowledge base embeddings. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4498–4507, Online. Association for Computational Linguistics.
  • Shazeer and Stern (2018) Noam Shazeer and Mitchell Stern. 2018. Adafactor: Adaptive learning rates with sublinear memory cost. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 4596–4604. PMLR.
  • Shi and Weninger (2018) Baoxu Shi and Tim Weninger. 2018. Open-world knowledge graph completion. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1).
  • Shin et al. (2020) Taylor Shin, Yasaman Razeghi, Robert L. Logan IV, Eric Wallace, and Sameer Singh. 2020. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4222–4235, Online. Association for Computational Linguistics.
  • Speer et al. (2017) Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. Proceedings of the AAAI Conference on Artificial Intelligence, 31(1).
  • Staab et al. (2023) Robin Staab, Mark Vero, Mislav Balunović, and Martin Vechev. 2023. Beyond memorization: Violating privacy via inference with large language models. Computation Research Repository, arXiv:2310.07298.
  • Sung et al. (2021) Mujeen Sung, Jinhyuk Lee, Sean Yi, Minji Jeon, Sungdong Kim, and Jaewoo Kang. 2021. Can language models be biomedical knowledge bases? In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 4723–4734, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Tänzer et al. (2022) Michael Tänzer, Sebastian Ruder, and Marek Rei. 2022. Memorisation versus generalisation in pre-trained language models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7564–7578, Dublin, Ireland. Association for Computational Linguistics.
  • Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. 2023. Llama 2: Open foundation and fine-tuned chat models. Computation Research Repository, arXiv:2307.09288.
  • Verga et al. (2020) Pat Verga, Haitian Sun, Livio Baldini Soares, and William W. Cohen. 2020. Facts as experts: Adaptable and interpretable neural memory over symbolic knowledge. Computation Research Repository, arXiv:2007.00849.
  • Veseli et al. (2023a) Blerta Veseli, Simon Razniewski, Jan-Christoph Kalo, and Gerhard Weikum. 2023a. Evaluating the knowledge base completion potential of GPT. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 6432–6443, Singapore. Association for Computational Linguistics.
  • Veseli et al. (2023b) Blerta Veseli, Sneha Singhania, Simon Razniewski, and Gerhard Weikum. 2023b. Evaluating language models for knowledge base completion. In The Semantic Web: 20th International Conference, ESWC 2023, Hersonissos, Crete, Greece, May 28–June 1, 2023, Proceedings, page 227–243, Berlin, Heidelberg. Springer-Verlag.
  • Wang et al. (2021) Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhengyan Zhang, Zhiyuan Liu, Juanzi Li, and Jian Tang. 2021. KEPLER: A unified model for knowledge embedding and pre-trained language representation. Transactions of the Association for Computational Linguistics, 9:176–194.
  • Wu et al. (2013) Yinghui Wu, Shengqi Yang, and Xifeng Yan. 2013. Ontology-based subgraph querying. In 2013 IEEE 29th International Conference on Data Engineering (ICDE), pages 697–708.
  • Xie et al. (2023) Sang Michael Xie, Shibani Santurkar, Tengyu Ma, and Percy Liang. 2023. Data selection for language models via importance resampling. Computation Research Repository, arXiv:2302.03169.
  • Yang et al. (2022) Haotong Yang, Zhouchen Lin, and Muhan Zhang. 2022. Rethinking knowledge graph evaluation under the open-world assumption. In Advances in Neural Information Processing Systems, volume 35, pages 8374–8385. Curran Associates, Inc.
  • Yasunaga et al. (2022) Michihiro Yasunaga, Antoine Bosselut, Hongyu Ren, Xikun Zhang, Christopher D Manning, Percy S Liang, and Jure Leskovec. 2022. Deep bidirectional language-knowledge graph pretraining. In Advances in Neural Information Processing Systems, volume 35, pages 37309–37323. Curran Associates, Inc.
  • Zhang et al. (2019) Jiong Zhang, Hsiang-Fu Yu, and Inderjit S Dhillon. 2019. Autoassist: A framework to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  • Zhang et al. (2024) Ningyu Zhang, Yunzhi Yao, Bozhong Tian, Peng Wang, Shumin Deng, Mengru Wang, Zekun Xi, Shengyu Mao, Jintian Zhang, Yuansheng Ni, Siyuan Cheng, Ziwen Xu, Xin Xu, Jia-Chen Gu, Yong Jiang, Pengjun Xie, Fei Huang, Lei Liang, Zhiqiang Zhang, Xiaowei Zhu, Jun Zhou, and Huajun Chen. 2024. A comprehensive study of knowledge editing for large language models. Computation Research Repository, arXiv:2401.01286.
  • Zhang et al. (2021) Xikun Zhang, Antoine Bosselut, Michihiro Yasunaga, Hongyu Ren, Percy Liang, Christopher D Manning, and Jure Leskovec. 2021. Greaselm: Graph reasoning enhanced language models. In International conference on learning representations.

Appendix A Additional training and evaluation details

Importance Sampling with 𝒟1\mathcal{D}_{1}

We train the T5-base model from its HuggingFace checkpoint888https://huggingface.co/t5-base in FP32 with a batch size of 300 on two NVIDIA V100 GPUs. We use the AdaFactor Shazeer and Stern (2018) as the optimizer with a constant learning rate of 1e-3. The evaluation batch size is 1024. We set the maximum number of training epochs to be 100 and enforce an early stopping policy to terminate the training if the model shows no improvement on the evaluation set for 10 epochs or after the exact match score on 𝒟1Val\mathcal{D}_{1-Val} exceed 96%. The model reaches the exact match threshold for early stopping for both experiments and the training time is around 2 hours and 5 hours without and without importance sampling.

Training on 𝒟0\mathcal{D}_{0}

We train T5 models from their HuggingFace checkpoints999https://huggingface.co/t5-large on two NVIDIA A100 GPUs, with a batch size 512 and an evaluation batch size of 1024 in FP32 for T5-base, a batch size of 300 and an evaluation batch size of 512 in BF16 for T5-large. We use the AdaFactor as the optimizer with a constant learning rate of 1e-3. The approximate time for one epoch of training is 15 hours for T5-base and 11 hours for T5-large. We also set the maximum number of training epochs to be 50 and enforce an early stopping policy to terminate the training if the model shows no improvement on the evaluation set for ten epochs or after the exact match score on 𝒟2\mathcal{D}_{2} exceed 96%. Neither model meets the early stopping criteria when training on 𝒟0\mathcal{D}_{0}.

We train LLaMA-2 models from their HuggingFace checkpoints101010https://huggingface.co/meta-llama/Llama-2-7b-hf111111https://huggingface.co/meta-llama/Llama-2-13b-hf on eight NVIDIA A800 GPUs in BF16 using Deepspeed Rasley et al. (2020) and ZeRO Rajbhandari et al. (2020) with Accelerate Gugger et al. (2022). For LLaMA-2-7b, the training batch size is 768 and the evaluation batch size is 96; for LLaMA-2-13b, the training batch size is 400, and the evaluation batch size is 50. For both models, we use the AdamW Loshchilov and Hutter (2019) with a constant learning rate of 1e-5 and set the maximum sequence length to 64. The approximate time for one epoch of training is 8 hours for LLaMA-2-7b and 15 hours for LLaMA-2-13b. We also set the maximum number of training epochs to be 20 and enforce an early stopping policy to terminate the training if the model shows no improvement on the evaluation set for five epochs or after the exact match score on 𝒟2\mathcal{D}_{2} exceeds 96%. Neither model meets the early stopping criteria when training on 𝒟0\mathcal{D}_{0}.

Finetuning and Inference

We finetune T5-base in FP32 on two NVIDIA V100 GPUs, and T5-large in BF16 on two NVIDIA A100 GPUs. We set the training batch size to be 256 and the evaluation batch size to be 512, with the same optimizer and learning rate as training. With a maximum epoch of 30, we enforce an early stopping policy that terminates finetuning if the model shows no improvement on the validation set for ten epochs.

For LLaMA-2 models, we perform finetuning with the same configurations as training on 𝒟0\mathcal{D}_{0}. However, we set the maximum number of finetuning epochs to 15 with an early stopping policy that terminates the finetuning if the model shows no improvement on the validation set for five epochs.

Appendix B Reasoning rules and triplet-to-text templates for inverse and compositional reasoning

In Table 1, we present the relations for inverse reasoning rule rr inverse of rr^{\prime} for Section 5.2. Corresponding templates used to convert triplet with these rules to natural language question-answering dataset can be found in Table 2.

Reasoning Rule: rr inverse of rr^{\prime}
rr rr^{\prime}
sibling sibling
shares border with shares border with
father child
mother child
capital capital of
part of has part
country contains
Table 1: Reasoning rules for inverse relations.
relationrelation question text
“sibling” the sibling of subjectsubject is
“shares border with” subjectsubject shares border with
“child” subjectsubject has child
“capital of” subjectsubject is capital of
“has part” subjectsubject has part
“contains” subjectsubject contains
“father” the father of subjectsubject is
“mother” the mother of subjectsubject is
“capital” the capital of subjectsubject is
“part of” subjectsubject is part of
“country” the country subjectsubject belongs to is
Table 2: Templates for converting knowledge triplets to natural language text for Section 5.2. The first column is the relationrelation in knowledge triplet (subject,relation,object)(subject,relation,object) and the second column is the question text querying for objectobject using subjectsubject and relationrelation in natural language.

In Table 3, we present the relations for compositional reasoning rules r1r2r3r_{1}\wedge r_{2}\Rightarrow r_{3} for Section 5.3. Corresponding templates used to convert triplet with these rules to natural language question-answering dataset can be found in Table 4.

Reasoning Rule: r1r2r3r_{1}\wedge r_{2}\Rightarrow r_{3}
r1r_{1} r2r_{2} r3r_{3}
place of birth country country of birth
place of burial country country of burial
place of publication country country of publication
place of death country country of death
performer languages spoken, written or signed language of work or name
author languages spoken, written or signed language of work or name
father father grandfather
mother mother grandmother
Table 3: Reasoning rules for relation composition.
r1r2r3r_{1}\wedge r_{2}\Rightarrow r_{3} relationrelation question text
r1r_{1} “place of birth” the place of birth of subjectsubject is
“place of burial” the place of burial of subjectsubject is
“place of publication” the place of publication of subjectsubject is
“place of death” the place of death of subjectsubject is
“author” the author of subjectsubject is
r1r_{1} and r2r_{2} “father” the father of subjectsubject is
“mother” the mother of subjectsubject is
r2r_{2} “country” the country subjectsubject belongs to is
“langues spoken, written or signed” the languages spoken, written or signed by subjectsubject is
r3r_{3} “country of birth” the country of birth of subjectsubject is
“country of burial” the country of burial of subjectsubject is
“country of publication” the country of publication of subjectsubject is
“country of death” the country of death of subjectsubject is
“language of work or name” the language of subjectsubject is
“grandfather” the grandfather of subjectsubject is
“grandmother” the grandmother of subjectsubject is
Table 4: Templates for converting knowledge triplets to natural language text for Section 5.3. The first column indicates where relationrelation appears in compositional reasoning r1r2r3r_{1}\wedge r_{2}\Rightarrow r_{3}, the second column is the relationrelation in knowledge triplet (subject,relation,object)(subject,relation,object), and the third column is the question text querying for objectobject using subjectsubject and relationrelation in natural language.

Appendix C Dataset and open-source projects

In preparing our own world knowledge dataset 𝒟0\mathcal{D}_{0} of scale similar to latest KBs, we use the CC0-licensed English Wikidata Pellissier Tanon et al. (2016) as the source of world knowledge and an MIT-licensed code project released by Kaiser and Christmann (2021) to filter away knowledge irrelevant to common linguistic tasks. We further derive various subsets from 𝒟0\mathcal{D}_{0} to study the memorization behavior of language models as in Section 2.3, 3, 5.2 and 5.3.

Our experiments on free-form information in Section 4 are based on the PopQA dataset released by Mallen et al. (2023) under MIT License. For general missing fact completion in Section 5.1, we utilize the portion of human-annotated missing facts from the dataset created by Veseli et al. (2023b), which is open-sourced into a public repository.

For experiments on LLaMA-2 models, we also employ a public fork of HuggingFace transformer library to address the left padding problem that may impact the inference results. The fork is licensed under Apache-2.0 and hosted on https://github.com/yizhongw/transformers/tree/left_padding.