Eva-KELLM: A New Benchmark for Evaluating Knowledge Editing of LLMs

Suhang Wu ¹, Minlong Peng ² , Yue Chen ¹, Jinsong Su ¹ , Mingming Sun ²
¹Xiamen University, ²Baidu Research
[email protected]
[email protected]
[email protected]
[email protected]
[email protected] Work done as an intern at Baidu Research.

Abstract

Large language models (LLMs) possess a wealth of knowledge encoded in their parameters. However, this knowledge may become outdated or unsuitable over time. As a result, there has been a growing interest in knowledge editing for LLMs and evaluating its effectiveness. Existing studies primarily focus on knowledge editing using factual triplets, which not only incur high costs for collection but also struggle to express complex facts. Furthermore, these studies are often limited in their evaluation perspectives. In this paper, we propose Eva-KELLM, a new benchmark for evaluating knowledge editing of LLMs. This benchmark includes an evaluation framework and a corresponding dataset. Under our framework, we first ask the LLM to perform knowledge editing using raw documents, which provides a more convenient and universal approach compared to using factual triplets. We then evaluate the updated LLM from multiple perspectives. In addition to assessing the effectiveness of knowledge editing and the retention of unrelated knowledge from conventional studies, we further test the LLM’s ability in two aspects: 1) Reasoning with the altered knowledge, aiming for the LLM to genuinely learn the altered knowledge instead of simply memorizing it. 2) Cross-lingual knowledge transfer, where the LLM updated with raw documents in one language should be capable of handling queries from another language. To facilitate further research, we construct and release the corresponding dataset. Using this benchmark, we investigate the effectiveness of several commonly-used knowledge editing methods. Experimental results indicate that the current methods for knowledge editing using raw documents are not effective in yielding satisfactory results, particularly when it comes to reasoning with altered knowledge and cross-lingual knowledge transfer.

1 Introduction

Due to the vast amount of training data and model parameters, large language models (LLMs) possess the capability to encode a wide range of knowledge. The vast knowledge greatly enhances the comprehension and reasoning abilities of LLMs, making them widely applicable to various tasks Brown et al. (2020); Qiao et al. (2022); Anil et al. (2023); Touvron et al. (2023); OpenAI (2023); Zhao et al. (2023).

Nevertheless, the knowledge embedded within LLMs may become outdated or unsuitable over time. Consequently, there is a critical requirement for LLMs to update inappropriate knowledge in time while retaining other beneficial knowledge, which has attracted great attention from researchers recently Sinitsin et al. (2019); Zhu et al. (2020); De Cao et al. (2021); Mitchell et al. (2022a, b); Meng et al. (2022a, b); Dong et al. (2022); Huang et al. (2022).

In this context, researchers have explored knowledge editing methods to modify the knowledge of LLMs using factual triplets De Cao et al. (2021); Mitchell et al. (2022a, b); Meng et al. (2022a, b). Previous studies primarily focus on two key aspects: understanding the roles of different parts of model parameters Geva et al. (2021, 2022); Dai et al. (2022); Meng et al. (2022a) and determining how to efficiently adjust model parameters for knowledge editing De Cao et al. (2021); Mitchell et al. (2022a, b); Meng et al. (2022a, b). Additionally, evaluating the effectiveness of knowledge editing is another area of research interest. Some researchers directly employ fact-checking datasets such as FEVER Thorne et al. (2018) and question answering datasets like zsRE Levy et al. (2017) for evaluation. Others construct datasets to examine changes in model knowledge from a more micro perspective. For example, the COUNTERFACT dataset Meng et al. (2022a) consists of fill-in-the-blank cloze queries that can be used to observe changes in word selection.

Refer to caption — Figure 1: An example illustrates the difference between conventional studies and ours. We want to edit the knowledge of the model from “*The mother tongue of Danielle Darrieux is English*” to “*French*”. Figure (a) shows the procedure in conventional studies, which uses the sentences representing factual triplets to update knowledge and then evaluate from only two perspectives: Direct Knowledge Editing Evaluation (DKEE) and Unrelated Knowledge Retention Evaluation (UKRE). Our Eva-KELLM differs from the conventional studies, which is illustrated in Figure (b). The LLM is asked to edit the knowledge based on raw documents and we then conduct more comprehensive evaluations from two additional perspectives: Indirect Knowledge Editing Evaluation (IKEE) and Cross-lingual Knowledge Editing Evaluation (CKEE).

Despite their success, the aforementioned studies still have two shortcomings. Firstly, manually collecting factual triplets is time-consuming and labor-intensive. Furthermore, these triplets often fail to express complex facts. Secondly, their evaluations are limited, overlooking crucial aspects such as reasoning with knowledge and cross-lingual knowledge transfer.

In this paper, we propose Eva-KELLM, a benchmark for evaluating knowledge editing of LLMs. Our benchmark consists of an evaluation framework and a corresponding dataset. Unlike previous studies that primarily rely on factual triplets, our approach extends the scope of knowledge editing to a more general scenario where raw documents are utilized. By leveraging readily accessible raw documents, our framework offers greater generality and practical applicability. Then, we design evaluation tasks to assess the updated LLM from various perspectives: 1) To directly measure the success rate of knowledge editing, we compare the output probabilities of the updated LLM for two predictions, which respectively reflect the original and the altered knowledge. 2) To quantify the retention of unrelated knowledge, we examine the updated LLM’s output probabilities for predictions reflecting unrelated knowledge. 3) To evaluate the LLM’s ability to utilize the altered knowledge, we employ it to answer reasoning questions based on the altered knowledge. 4) To evaluate its cross-lingual knowledge transfer ability, we assess the updated LLM’s performance in a cross-lingual question answering task.

Using our framework and dataset, we conduct a series of experiments to explore the effectiveness of commonly-used knowledge editing methods. Through in-depth analyses, we obtain several findings: 1) Existing knowledge editing methods encounter difficulties in effectively updating LLMs with raw documents, making it a challenging problem to address. 2) These methods demonstrate significant limitations in reasoning with the altered knowledge and cross-lingual knowledge transfer. 3) In our experiments, optimizing the parameters of the LLM’s middle layers and feed-forward layers proves to be a more efficient approach for improving model performance, highlighting its significance as a future research direction.

2 Related Work

Knowledge Editing Methods

LLMs encode a wealth of knowledge in their parameters, enabling them to serve as knowledge bases for responding to natural-language queries about factual information Petroni et al. (2019); Roberts et al. (2020); Jiang et al. (2020); Hao et al. (2021); Hernandez et al. (2023); Haviv et al. (2023). However, LLMs inevitably contain incorrect facts or outdated information. Consequently, several studies have focused on knowledge editing methods to rectify inappropriate knowledge. Zhu et al. (2020) propose the task of editing specific factual knowledge and conducting fine-tuning on BERT Devlin et al. (2019). De Cao et al. (2021) develop Knowledge Editor, which can edit knowledge by introducing hyper-networks Ha et al. (2017) to tune the parameters. Dai et al. (2022) try to identify the neurons that express the fact. Mitchell et al. (2022a) present MEND to make fast and local knowledge editing. Meng et al. (2022a) propose Rank-One Model Editing (ROME), which updates knowledge by modifying weights of feed-forward layers. They also develop MEMIT to update an amount of knowledge at once Meng et al. (2022b).

Evaluation for Knowledge Editing

Previous studies explore datasets and guiding principles for evaluation. The commonly-used datasets include FEVER Thorne et al. (2018), ZsRE Levy et al. (2017) and COUNTERFACT Meng et al. (2022a). FEVER is a dataset about fact-checking, where the updated models are required to perform binary classification on the given claims De Cao et al. (2021); Mitchell et al. (2022a). ZsRE is a question answering dataset that provides queries requiring models to correctly output the object entity based on the subject entity and relation within the query Zhu et al. (2020); De Cao et al. (2021); Mitchell et al. (2022a, b); Meng et al. (2022a, b). COUNTERFACT is a dataset specifically designed for knowledge editing, containing a variety of counterfactual knowledge. During the evaluation, the dataset aims to assess whether the model can provide counterfactual answers when asked about the corresponding factual knowledge Meng et al. (2022a, b). When designing evaluation methods, researchers propose three guiding principles: reliability, generality, and locality Mitchell et al. (2022a); Meng et al. (2022b). Reliability requires the model to successfully modify its output to the query used during the knowledge editing training process. Generality aims for the model to provide answers that reflect the altered knowledge for any related query. Locality aims to minimize the impact on unrelated knowledge.

Note that the aforementioned knowledge editing methods and their corresponding evaluation are based on factual triplets or sentences. In contrast, we extend knowledge editing to a more general scenario, where raw documents are employed for knowledge editing. Additionally, we expand the evaluation perspectives based on the guiding principles, leading to a more comprehensive evaluation.

3 Our Proposed Evaluation Framework and Dataset

Knowledge editing methods aim to not only effectively update the model’s knowledge, but also preserve the unrelated knowledge that does not require modification. In this section, we introduce a novel two-stage framework and the corresponding dataset to evaluate the effect of a knowledge editing method. Figure 1 provides an illustrative example highlighting the distinction between conventional studies and ours. Our framework differs from conventional studies in two aspects: 1) the updated knowledge is embedded within raw documents rather than factual triplets, and 2) our framework involves a broader range of evaluation perspectives.

To facilitate descriptions in the following, we introduce the subsequent annotations. Let $x$ and $x^{\prime}$ denote the cloze sentence serving as a query and its paraphrase, respectively. $y$ denotes the prediction that reflects the original knowledge to the given $x$ or $x^{\prime}$ , and $y^{\prime}$ denotes the prediction reflecting the altered knowledge. $\theta$ denotes the original LLM and $\theta^{\prime}$ denotes the updated LLM achieved through knowledge editing.

3.1 Knowledge Editing with Counterfactual Raw Documents

Within our framework, we first utilize raw documents to update the factual knowledge contained within the LLM. Previous knowledge editing methods primarily rely on factual triplets for updating LLMs, and there has been no research specifically focused on knowledge editing using raw documents. As a result, their methods are not compatible with our framework.

To ensure the generalizability of raw documents across different LLMs, it is crucial that these documents do not appear in any training corpus used by the LLMs. Otherwise, it would be challenging to determine where the knowledge comes from. To achieve this goal, we construct the counterfactual raw documents using the COUNTERFACT dataset Meng et al. (2022a) that contains abundant counterfactual instances. In COUNTERFACT, each instance involves a cloze sentence $x$ and the prediction $y^{\prime}$ that reflects the altered knowledge. By combining them, we obtain a counterfactual sentence, denoted as $[x,y^{\prime}]$ . Then we design a prompt for $[x,y^{\prime}]$ and feed it to ChatGPT, generating a counterfactual document. The process of generating a counterfactual document is illustrated in Figure 2.

Particularly, to enable ChatGPT to generate documents of specific types while maintaining authenticity and diversity in expression, we establish the following guidelines for the prompt design: 1) ChatGPT should generate documents in the form of press releases or magazine articles. 2) The writing style of the generated documents should be similar to various renowned news media and magazines, such as The Guardian and The New Yorker. 3) The generated documents should include multiple mentions of the counterfactual knowledge we desire.

Finally, we apply a filtering process to filter documents that do not contain the desired counterfactual knowledge. Specifically, when ChatGPT deviates from our instructions, it may produce undesired documents that clarify the input counterfacts instead of supporting them. We observe that these undesired documents often include specific keywords, such as “misinformation”, “mistake”, and “sorry”. This phenomenon can be attributed to the fact that ChatGPT is an LLM tending to generate frequently-used words. Therefore, we directly retrieve and remove documents that contain these keywords. This filtering process helps ensure that the generated documents align with our desired criteria and effectively convey the intended counterfactual information.

Lang.	AvgLen	#Doc
En	315.25	8,882
Zh	588.65	6,930

Table 1: The statistics of the counterfactual raw documents in Eva-KELLM.

It is worth noting that we translate a portion of the counterfactual sentences in COUNTERFACT from English to Chinese. We then utilize ChatGPT fed by Chinese prompts to generate Chinese counterfactual documents. This design allows our dataset to encompass both Chinese and English counterfactual documents, thereby enabling the LLM to be trained to possess the cross-lingual knowledge transfer ability. The statistics of our raw documents are presented in Table 1.

3.2 Four-perspective Evaluations

We evaluate the updated LLM from four perspectives. In addition to Direct Knowledge Editing Evaluation and Unrelated Knowledge Retention Evaluation explored in previous studies Meng et al. (2022a, b), we conduct evaluations from two additional perspectives: Indirect Knowledge Editing Evaluation and Cross-Lingual Knowledge Editing Evaluation. To facilitate evaluation, we construct four separate evaluation datasets, consisting of 8,882, 8,882, 763, and 6,930 instances, respectively.

3.2.1 Direct Knowledge Editing Evaluation (DKEE).

Following Meng et al. (2022a, b), we directly utilize the COUNTERFACT dataset to evaluate the effectiveness of modifying factual knowledge through a fill-in-the-blank cloze task. The COUNTERFACT dataset comprises a substantial number of queries related to altered knowledge. Figure 3(a) illustrates a DKEE instance. The “query” field contains the factual query $x$ in the form of a cloze sentence, which inquires about Danielle Darrieux’s mother tongue. The “paraphrase query” field corresponds to $x^{\prime}$ , which is the paraphrased version of $x$ . The “altered prediction” and “original prediction” fields correspond to $y^{\prime}$ , the prediction reflecting altered knowledge, and $y$ , the prediction reflecting the original knowledge, respectively. Note that the altered knowledge claims that “The mother tongue of Danielle Darrieux is English”, while the factual knowledge states that “The mother tongue of Danielle Darrieux is French”. Therefore, in this instance, we consider “English” to represent the altered knowledge, while “French” reflects the original knowledge.

During the evaluation, we feed a cloze sentence $x$ about the altered knowledge into the updated LLM $\theta^{\prime}$ and then compare the output probabilities: $p(y|x;\theta^{\prime})$ and $p(y^{\prime}|x;\theta^{\prime})$ . For an effective knowledge editing method, the updated LLM should assign a higher generation probability to $y^{\prime}$ than $y$ for both $x$ and its paraphrase $x^{\prime}$ . As implemented in previous studies Meng et al. (2022a, b), we employ four widely-used metrics to evaluate the performance of the updated LLM: 1) Efficacy Score (ES) denoting the portion of instances satisfying $p(y^{\prime}|x;\theta^{\prime})>p(y|x;\theta^{\prime})$ ; 2) Efficacy Magnitude (EM) indicating the mean difference $p(y^{\prime}|x;\theta^{\prime})-p(y|x;\theta^{\prime})$ ; 3) Paraphrase Score (PS) that is computed similarly to ES but using the paraphrase queries, formulated as the portion of instances satisfying $p(y^{\prime}|x^{\prime};\theta^{\prime})>p(y|x^{\prime};\theta^{\prime})$ ; 4) Paraphrase Magnitude (PM) is the paraphrase query version of EM, calculating mean difference $p(y^{\prime}|x^{\prime};\theta^{\prime})-p(y|x^{\prime};\theta^{\prime})$ .

3.2.2 Unrelated Knowledge Retention Evaluation (UKRE).

To evaluate the retention of unrelated factual knowledge in the updated LLM, we still use the COUNTERFACT dataset as mentioned above. As shown in Figure 3(b), each UKRE instance comprises some fields. The “query” field contains multiple factual queries in the form of cloze sentences, which ask about the languages of “Jacques Chaban-Delmas” and “Maurice Genevoix”. The “altered prediction” and “original prediction” fields have the same meaning as in the DKEE instance. Notice that these queries are derived by modifying the subject of the query in the DKEE instance shown in Figure 3(a). Additionally, none of the raw documents provide information about their mother tongues. Therefore, these queries are not related to the altered knowledge, and their predictions should still reflect the original knowledge.

As implemented in DKEE, we feed a cloze sentence $x$ into the updated LLM $\theta^{\prime}$ and then compare the output probabilities $p(y|x;\theta^{\prime})$ and $p(y^{\prime}|x;\theta^{\prime})$ . We hope $p(y|x;\theta^{\prime})$ should be larger for these queries about unrelated knowledge. Following Meng et al. (2022a), we employ two metrics: 1) Neighborhood Score (NS) denoting the portion of instances satisfying $p(y|x;\theta^{\prime})>p(y^{\prime}|x;\theta^{\prime})$ and 2) Neighborhood Magnitude (NM) indicating the mean difference $p(y|x;\theta^{\prime})-p(y^{\prime}|x;\theta^{\prime})$ .

3.2.3 Indirect Knowledge Editing Evaluation (IKEE).

To expand conventional studies that solely focus on the output probability of the altered knowledge, we devise a question answering task that requires one-step reasoning with the altered knowledge. Through this task, we aim to evaluate whether the updated LLM can genuinely learn the altered knowledge and effectively utilize it for reasoning, instead of simply memorizing it.

In the IKEE dataset that we construct, each instance consists of two fields: the “question” field corresponding to a reasoning question, and the “label” field containing the expected answer. Figure 4(a) provides an illustration of this structure. In this instance, we inquire about whether “Danielle Darrieux’s mother tongue” is “the official language of England”. Note that the altered knowledge states that “Danielle Darrieux’s mother tongue is English”, which should be considered correct by the updated LLM. Additionally, the knowledge that “English is the official language of England” remains unaltered. As a result, the updated LLM should provide a “True” response to this query.

To generate such instances, we select counterfactual sentences from COUNTERFACT to generate binary classification questions and their expected answers are “True”. Each counterfactual sentence involves a cloze sentence $x$ and the prediction $y^{\prime}$ . We first prompt ChatGPT to provide a sentence describing the characteristic of $y^{\prime}$ . Then, we ask ChatGPT to replace $y^{\prime}$ in the counterfactual sentence with this characteristics sentence, and subsequently rephrase the modified sentence as a question. Referring to the example shown in Figure 5, ChatGPT first generates a sentence describing the characteristic of $y^{\prime}$ , such as “English is the official language of England”. Subsequently, ChatGPT replaces $y^{\prime}$ in the counterfactual sentence with this characteristic and rephrases the modified sentence to obtain a reasoning question, which in this case is “Is the mother tongue of Danielle Darrieux the official language of England?”. Similarly, we select roughly equal amounts of factual sentences to construct questions with expected answers of “False”. Notably, in order to prevent knowledge conflicts, we have to ensure that the characteristics of the prediction do not contradict the altered knowledge. To achieve this, we carefully review and revise the constructed questions manually.

During the evaluation, the updated LLM is asked to determine whether the input question should be answered with “True” or “False”. Particularly, we apply in-context learning Brown et al. (2020) and Chain-of-Thought Wei et al. (2022) to ensure the updated LLM can effectively reason based on its knowledge. For example, to verify the correctness of the question “Is the mother tongue of Danielle Darrieux the official language of England?” shown in Figure 5, the updated LLM should first output the sentence containing the altered knowledge “The mother tongue of Danielle Darrieux is English” and subsequently provide a sentence supporting “English is the official language of England”.

During the process of answering, the updated LLM may sometimes provide conflicting responses, making it difficult to determine whether the answer given by the model is “True” or “False”. To address this issue, we incorporate DeBERTa He et al. (2020) as a Natural Language Inference (NLI) model to enhance our evaluation process. The NLI model can assess whether the LLM’s output for a given question logically follows the knowledge contained in a specific sentence. Specifically, if the NLI model determines that the output aligns with the altered knowledge, the LLM’s answer is classified as “True” to this question. Conversely, if the model’s output aligns with the original knowledge, the LLM’s answer is classified as “False”.

Finally, we utilize accuracy as the evaluation metric, which measures the proportion of correct answers provided by the updated LLM.

	DKEE				UKRE		IKEE	CKEE
	ES $\uparrow$	EM $\uparrow$	PS $\uparrow$	PM $\uparrow$	NS $\uparrow$	NM $\uparrow$	Accuracy $\uparrow$	CES $\uparrow$	CEM $\uparrow$
BLOOM-3B	23.88	-6.60	23.75	-5.56	76.16	5.51	36.00	21.89	-5.87
+FT	53.96_(1.31)	2.31_(1.71)	51.15_(0.89)	0.18_(0.29)	55.44_(1.10)	3.28_(1.00)	6.67_(1.53)	43.94_(0.37)	-3.26_(0.02)
+LoRA(Self-attention)	40.92_(2.20)	-2.87_(0.32)	37.31_(0.41)	-4.13_(0.12)	60.66_(0.78)	3.51_(0.31)	46.33_(1.15)	35.03_(1.06)	-4.03_(0.23)
+LoRA(MLP)	43.96_(1.45)	-2.52_(1.16)	37.19_(1.16)	-4.42_(0.12)	59.85_(1.70)	3.37_(1.04)	47.00_(5.29)	36.92_(2.01)	-4.38_(0.49)
+LoRA(Self-attention+MLP)	44.33_(1.25)	-1.96_(0.33)	38.12_(0.72)	-4.15_(0.17)	57.59_(0.53)	2.72_(0.15)	45.67_(3.06)	39.03_(0.95)	-3.85_(0.55)
+LoRA(MLP_1-10)	36.29_(0.92)	-4.45_(0.28)	30.60_(0.38)	-5.81_(0.63)	63.91_(0.88)	4.64_(0.18)	36.67_(8.08)	30.67_(0.37)	-5.17_(0.22)
+LoRA(MLP_11-20)	40.81_(0.62)	-3.07_(0.49)	36.22_(1.37)	-3.98_(0.61)	61.81_(0.23)	3.51_(0.11)	40.67_(5.51)	36.52_(0.21)	-4.13_(0.33)
+LoRA(MLP_21-30)	37.25_(0.13)	-4.62_(0.27)	34.69_(0.78)	-4.74_(0.80)	64.88_(0.35)	5.00_(0.45)	36.67_(2.31)	32.87_(0.38)	-5.08_(0.28)
+LoRA(Self-attention_1-10)	34.42_(0.59)	-4.40_(0.64)	31.10_(0.37)	-5.32_(0.56)	65.64_(0.38)	4.70_(0.46)	30.33_(6.81)	29.82_(0.45)	-4.80_(0.42)
+LoRA(Self-attention_11-20)	39.33_(1.15)	-2.75_(0.39)	33.63_(0.53)	-4.32_(0.41)	61.24_(0.88)	2.93_(0.36)	37.67_(2.52)	33.12_(1.38)	-3.60_(0.41)
+LoRA(Self-attention_21-30)	34.21_(0.89)	-5.17_(0.33)	31.79_(0.43)	-5.12_(0.28)	66.75_(0.30)	5.09_(0.20)	39.67_(1.53)	29.74_(0.48)	-5.29_(0.16)

Table 2: The performance of knowledge editing by fine-tuning different components of BLOOM-3B. Note that we boldface the best result for each metric and provide variances in parentheses.

3.2.4 Cross-Lingual Knowledge Editing Evaluation (CKEE).

Conventional studies on knowledge editing evaluations predominantly focus on monolingual scenarios, where the altered knowledge and evaluation instances are in the same language. As an extension of these studies, we introduce the Cross-lingual Knowledge Editing Evaluation (CKEE) to assess the cross-lingual knowledge transfer ability of the updated LLM. In our framework, we expect the updated LLM to learn knowledge from Chinese raw documents and correctly answer English queries.

To construct the CKEE instances, we select the English queries that correspond to the Chinese raw documents from the COUNTERFACT dataset. An example is provided in Figure 4(b). The “query” field contains the corresponding cloze sentence $x$ , which is in English. The “altered prediction” and “original prediction” correspond to $y^{\prime}$ and $y$ , respectively. Note that the altered knowledge “In Kajaani, the language spoken is Finnish” is only present in the Chinese raw document. Consequently, the updated LLM can only rely on the altered knowledge derived from the Chinese raw document to correctly answer the corresponding English query.

During the evaluation, we also directly feed an English query into the updated LLM $\theta^{\prime}$ , and then compare $p(y^{\prime}|x;\theta^{\prime})$ and $p(y|x;\theta^{\prime})$ . The updated LLM is expected to prioritize outputting $y^{\prime}$ over $y$ , which can be formulated as $p(y^{\prime}|x;\theta^{\prime})>p(y|x;\theta^{\prime})$ . Here, we introduce two metrics to quantify the cross-lingual knowledge transfer ability of the updated LLM: Cross-lingual Efficacy Score (CES) and Cross-lingual Efficacy Magnitude (CEM), which are computed similarly to ES and EM.

4 Experiment

4.1 Setup

We choose BLOOM-3B and BLOOM-7.1B as the LLM in our experiments. BLOOM Scao et al. (2022) is a decoder-only Transformer-based language model. Given its ability to support multiple languages, it is well-suited for knowledge editing using our bilingual raw documents.

In our experiments, we primarily focus on two commonly-used methods for knowledge editing: full fine-tuning and LoRA Hu et al. (2021) and investigate their performance under our benchmark. LoRA is a parameter-efficient fine-tuning method, which freezes the weights of the LLM and introduces trainable rank decomposition matrices into the Transformer layers during the fine-tuning process.

To provide clear descriptions of our experiments, we use +FT to denote the LLM updated via full fine-tuning, and +LoRA(component) to represent the LLM, of which component parameters are fine-tuned via LoRA. For each model in our experiments, we repeat three times with different random seeds and report the average results.

	#Trainable Parameters	DKEE				UKRE		IKEE	CKEE
	#Trainable Parameters	ES $\uparrow$	EM $\uparrow$	PS $\uparrow$	PM $\uparrow$	NS $\uparrow$	NM $\uparrow$	Accuracy $\uparrow$	CES $\uparrow$	CEM $\uparrow$
BLOOM-3B	/	23.88	-6.60	23.75	-5.56	76.16	5.51	36.00	21.89	-5.87
+LoRA(MLP)	6.1M	43.96_(1.45)	-2.52_(1.16)	37.19_(1.16)	-4.42_(0.12)	59.85_(1.70)	3.37_(1.04)	47.00_(5.29)	36.92_(2.01)	-4.38_(0.49)
+LoRA(Self-attention+MLP)	9.8M	44.33_(1.25)	-1.96_(0.33)	38.12_(0.72)	-4.15_(0.17)	57.59_(0.53)	2.72_(0.15)	45.67_(3.06)	39.03_(0.95)	-3.85_(0.55)
BLOOM-7.1B	/	21.75	-7.85	19.81	-6.23	78.29	6.62	28.00	19.21	-6.48
+LoRA(MLP)	9.8M	46.69_(2.25)	-2.45_(0.97)	37.50_(0.70)	-4.93_(0.34)	55.72_(1.32)	2.19_(0.42)	44.00_(5.29)	37.07_(1.32)	-4.66_(0.42)
+LoRA(Self-attention+MLP)	15.7M	48.58_(1.63)	-2.36_(0.98)	38.27_(1.44)	-4.83_(0.79)	54.96_(0.83)	2.66_(0.43)	48.00_(4.58)	38.44_(1.28)	-4.47_(0.40)

Table 3: Performance on Eva-KELLM.

4.2 Preliminary Experiments

We first investigate the performance of full fine-tuning and LoRA on BLOOM-3B. Subsequently, we employ LoRA to further investigate the influence of fine-tuning different components on the model performance: Self-attention and Multi-Layer Perceptron (MLP), which is stacked on top of Self-attention.

As presented in Table 2, +FT outperforms +LoRA(*) in the perspectives of DKEE and CKEE. For example, +LoRA(Self-attention+MLP) achieves the highest ES score of 44.33 when using LoRA. This result still exhibits a gap compared to the score of 53.96 achieved through +FT. However, +FT demonstrates poor performance from the IKEE perspective. Through in-depth analysis on bad cases, we discover that +FT tends to generate content similar to raw documents when answering reasoning questions, rather than judging their correctness. This implies that full fine-tuning overfits the objective of learning raw documents and loses the ability to follow instructions and perform in-context learning. To summarize, while +FT achieves better results from some evaluation perspectives, we contend that full fine-tuning is not a suitable method for knowledge editing as it affects the LLM’s original capabilities.

Back to Table 2, compared with +LoRA(Self-attention), +LoRA(MLP) achieves superior performance in the perspectives of DKEE, CKEE, and IKEE, and obtain comparable performance to +LoRA(Self-attention+MLP). This observation suggests that MLP plays a more significant role in the scenario of knowledge editing we investigate. Note that this finding echoes with previous studies Geva et al. (2021); Meng et al. (2022a), which emphasize the knowledge storage capabilities of MLP.

In addition, we use LoRA to fine-tune different layers of the LLM to explore their influence on the model performance. Specifically, we categorize the layers of BLOOM-3B into three groups: 1-10, 11-20, and 21-30, where L1-L2 denotes the parameters from layer L1 to layer L2. From Table 2, we observe that +LoRA(MLP_11-20) outperforms +LoRA(MLP_1-10) and +LoRA(MLP_21-30). Similarly, among all Self-attention layers, +LoRA(Self-attention_11-20) exhibits the best performance from DKEE and CKEE perspectives. This finding suggests that the parameters of middle layers play a more crucial role in updating LLM. We will explore the reasons behind this phenomenon in the future.

4.3 Main Results

Based on the results of the above preliminary experiments, we select two methods for the subsequent experiments: fine-tuning MLP with LoRA, and fine-tuning Self-attention+MLP with LoRA. Please notice that we exclude full fine-tuning and fine-tuning Self-attention with LoRA from the subsequent experiments, since they fail to yield satisfactory results in preliminary experiments.

Table 3 shows the main experimental results and we can obtain the following findings:

First, existing methods face challenges in achieving better knowledge updates and also carry the risk of forgetting unrelated knowledge. As illustrated in Table 3, when utilizing BLOOM-3B as the base model, +LoRA(Self-attention+MLP) yields significant enhancements compared to the original BLOOM-3B. For example, the ES score increases substantially from 23.88 to 44.33. Similarly, when using BLOOM-7.1B, the ES score of +LoRA(Self-attention+MLP) rises from 21.75 to 48.58. Though significant progress has been made, there is still considerable room for further improving performance. Additionally, we find that as the model performance from the perspective of DKEE improves, the model performance from the UKRE perspective, which measures the retention of unrelated knowledge, often declines. For example, +LoRA(Self-attention+MLP) based on BLOOM-3B demonstrates the best performance in terms of ES and EM metrics from the DKEE perspective. However, it exhibits the worst performance in terms of NS and NM metrics from the UKRE perspective. It can be said that, the challenge of preserving unrelated knowledge remains an unresolved issue.

Second, the effectiveness of knowledge editing methods is related to the model size. Referring back to Table 3, we clearly observe that the improvements over BLOOM-7.1B brought by the same knowledge editing method are more significant than those over BLOOM-3B from the perspectives of DKEE and CKEE. For this phenomenon, we speculate that as the model size increases, more trainable parameters are involved in enhancing the effectiveness of knowledge editing.

Third, further attention is required for reasoning with altered knowledge. Our experiments indicate that existing knowledge editing methods struggle in reasoning with altered knowledge. As illustrated in Table 3, the +LoRA(MLP) based on BLOOM-3B achieves the highest accuracy of 47.00 from the perspective of IKEE. Similarly, when using BLOOM-7.1B as the base model, +LoRA(Self-attention + MLP) yields an accuracy of 48.00. Nevertheless, these results are still unsatisfactory for binary classification. The poor performance can be attributed to two factors. First, there is insufficient integration of modified knowledge, resulting in the propagation of errors. Second, the LLM without SFT (Supervised Fine-Tuning) may have a limited ability to utilize knowledge directly, thus aggravating the problem.

Lastly, learning cross-lingual knowledge transfer remains a challenge for existing knowledge editing methods. As shown in Table 3, we notice that the CES scores from the CKEE perspective are significantly lower than the ES scores. This indicates that queries for CKEE, which are in a different language from the raw documents, pose a challenge in providing correct answers. This phenomenon emphasizes the importance of enhancing the cross-lingual knowledge transfer ability of knowledge editing methods.

5 Conclusion

In this paper, we propose Eva-KELLM, a novel benchmark for knowledge editing of LLMs, involving a new evaluation framework and corresponding dataset. Under our benchmark, we first require an LLM to perform knowledge editing using raw documents, and then evaluate its performance from four perspectives. Experimental results show that the commonly-used knowledge editing methods still encounter challenges, including difficulties in reasoning with altered knowledge and cross-lingual knowledge transfer.

In the future, we will further explore better knowledge editing methods with raw documents. This will involve designing strategies to enable the LLM to focus on the important content in the document and efficiently adjust model parameters.

References

Anil et al. (2023) Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. 2023. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
Dai et al. (2022) Damai Dai, Li Dong, Yaru Hao, Zhifang Sui, Baobao Chang, and Furu Wei. 2022. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8493–8502.
De Cao et al. (2021) Nicola De Cao, Wilker Aziz, and Ivan Titov. 2021. Editing factual knowledge in language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6491–6506.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
Dong et al. (2022) Qingxiu Dong, Damai Dai, Yifan Song, Jingjing Xu, Zhifang Sui, and Lei Li. 2022. Calibrating factual knowledge in pretrained language models. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 5937–5947.
Geva et al. (2022) Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. 2022. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 30–45.
Geva et al. (2021) Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. 2021. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495.
Ha et al. (2017) David Ha, Andrew M. Dai, and Quoc V. Le. 2017. Hypernetworks. In International Conference on Learning Representations.
Hao et al. (2021) Yaru Hao, Li Dong, Furu Wei, and Ke Xu. 2021. Self-attention attribution: Interpreting information interactions inside transformer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 12963–12971.
Haviv et al. (2023) Adi Haviv, Ido Cohen, Jacob Gidron, Roei Schuster, Yoav Goldberg, and Mor Geva. 2023. Understanding transformer memorization recall through idioms. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 248–264.
He et al. (2020) Pengcheng He, Xiaodong Liu, Jianfeng Gao, and Weizhu Chen. 2020. Deberta: Decoding-enhanced bert with disentangled attention. arXiv preprint arXiv:2006.03654.
Hernandez et al. (2023) Evan Hernandez, Belinda Z. Li, and Jacob Andreas. 2023. Inspecting and editing knowledge representations in language models.
Hu et al. (2021) Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2021. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
Huang et al. (2022) Zeyu Huang, Yikang Shen, Xiaofeng Zhang, Jie Zhou, Wenge Rong, and Zhang Xiong. 2022. Transformer-patcher: One mistake worth one neuron. In The Eleventh International Conference on Learning Representations.
Jiang et al. (2020) Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neubig. 2020. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8:423–438.
Levy et al. (2017) Omer Levy, Minjoon Seo, Eunsol Choi, and Luke Zettlemoyer. 2017. Zero-shot relation extraction via reading comprehension. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pages 333–342, Vancouver, Canada. Association for Computational Linguistics.
Meng et al. (2022a) Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. 2022a. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372.
Meng et al. (2022b) Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. 2022b. Mass-editing memory in a transformer. arXiv preprint arXiv:2210.07229.
Mitchell et al. (2022a) Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D Manning. 2022a. Fast model editing at scale. In International Conference on Learning Representations.
Mitchell et al. (2022b) Eric Mitchell, Charles Lin, Antoine Bosselut, Christopher D Manning, and Chelsea Finn. 2022b. Memory-based model editing at scale. In Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 15817–15831. PMLR.
OpenAI (2023) OpenAI. 2023. Gpt-4 technical report.
Petroni et al. (2019) Fabio Petroni, Tim Rocktäschel, Sebastian Riedel, Patrick Lewis, Anton Bakhtin, Yuxiang Wu, and Alexander Miller. 2019. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473.
Qiao et al. (2022) Shuofei Qiao, Yixin Ou, Ningyu Zhang, Xiang Chen, Yunzhi Yao, Shumin Deng, Chuanqi Tan, Fei Huang, and Huajun Chen. 2022. Reasoning with language model prompting: A survey. arXiv preprint arXiv:2212.09597.
Roberts et al. (2020) Adam Roberts, Colin Raffel, and Noam Shazeer. 2020. How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5418–5426.
Scao et al. (2022) Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, et al. 2022. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
Sinitsin et al. (2019) Anton Sinitsin, Vsevolod Plokhotnyuk, Dmitry Pyrkin, Sergei Popov, and Artem Babenko. 2019. Editable neural networks. In International Conference on Learning Representations.
Thorne et al. (2018) James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2018. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana. Association for Computational Linguistics.
Touvron et al. (2023) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. Llama: Open and efficient foundation language models.
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed H Chi, Quoc V Le, Denny Zhou, et al. 2022. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems.
Zhao et al. (2023) Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. 2023. A survey of large language models. arXiv preprint arXiv:2303.18223.
Zhu et al. (2020) Chen Zhu, Ankit Singh Rawat, Manzil Zaheer, Srinadh Bhojanapalli, Daliang Li, Felix Yu, and Sanjiv Kumar. 2020. Modifying memories in transformer models. arXiv preprint arXiv:2012.00363.