Mitigating Heterogeneous Token Overfitting in LLM Knowledge Editing

Tianci Liu [email protected]
Purdue University Zihan Dong [email protected]
Rutgers University Linjun Zhang [email protected]
Rutgers University Haoyu Wang [email protected]
SUNY Albany Jing Gao [email protected]
Purdue University Corresponding author.Corresponding author.

Abstract

Large language models (LLMs) have achieved remarkable performance on various natural language tasks. However, they are trained on static corpora and their knowledge can become outdated quickly in the fast-changing world. This motivates the development of knowledge editing (KE) to update specific knowledge in LLMs without changing unrelated others or compromising their pre-trained capabilities. Previous efforts sought to update a small amount of parameters of a LLM and proved effective for making selective updates. Nonetheless, the edited LLM often exhibits degraded ability to reason about the new knowledge. In this work, we identify a key issue: heterogeneous token overfitting (HTO), where the LLM overfits different tokens in the provided knowledge at varying rates. To tackle this, we propose OVERTONE, a token-level smoothing method that mitigates HTO by adaptively refining the target distribution. Theoretically, OVERTONE offers better parameter updates with negligible computation overhead. It also induces an implicit DPO but does not require preference data pairs. Extensive experiments across four editing methods, two LLMs, and diverse scenarios demonstrate the effectiveness and versatility of our method.

1 Introduction

Language models (LMs) parameterized by deep neural networks (Vaswani et al., 2017; Lewis et al., 2019; Radford et al., 2019; Brown et al., 2020) demonstrate strong generalizability across various natural language generation and classification tasks (See et al., 2019; Raffel et al., 2020; Ji et al., 2023). These successes underscore their versatility, establishing them as new foundations for natural language processing applications (Bommasani et al., 2021; Zhou et al., 2023). Furthermore, with model sizes continually increasing, large language models (LLMs) exhibit emerging abilities to follow natural language instructions (Dong et al., 2022b; Ouyang et al., 2022), which empowers their zero-shot adaptations to unseen tasks (Kojima et al., 2022), paving the way towards artificial general intelligence (Bubeck et al., 2023).

Despite this remarkable potential, the real-world LLM deployment remains largely unresolved: LLMs are capable of comprehending a wide range of human instructions and queries, but they can only provide feedback based on their static knowledge from the data they were trained on. In a fast-changing world, most knowledge quickly becomes outdated. For example, the updated knowledge about the president of United States would refer to Donald Trump rather than Joe Biden. Failing to maintain update-to-date knowledge could amplify critical issues such as making factual fallacy (De Cao et al., 2021) or producing harmful generations (Hartvigsen et al., 2022). However, the significant computational cost of retraining makes it impractical to frequently incorporate new knowledge.

As a remedy, knowledge editing (KE), whose goal is to update an LLM with some specific knowledge without hurting irrelevant others and general ability, is proposed (Wang et al., 2023b; Zhang et al., 2024c). Full fine-tuning of LLMs proved ineffective as it severely disrupted irrelevant knowledge (Wang et al., 2023b), leading to an editing-locality trade-off. Here locality refers to the ability to maintain knowledge unrelated to the update, such as the prime minister of Canada for the previous case. To achieve a good locality, model updates need to be selective and should rely on a small fraction of parameters (Wang et al., 2023b). Following this principle, parameter-efficient fine-tuning (PEFT) methods such as LoRA (Hu et al., 2021) have achieved good performance (Wu et al., 2023). On the other hand, Huang et al. (2023); Dong et al. (2022a) restricted the updates to some pre-specified feed-forward network (FFN) layer that serves as knowledge storage (Dai et al., 2021). Meng et al. (2022a; b) refined the process by introducing a locating stage to identify which layer the target knowledge is stored. These fine-grained manners have demonstrated impressive success in maintaining high locality (Zhang et al., 2024c).

Nevertheless, existing methods still suffered from losing LLM generalizability, especially when dealing with tasks that involve the edited knowledge, due to the so-called overfitting of KE (Zhang et al., 2024a). Specifically, KE often involves one piece of new knowledge to edit at a time, which entails updating (selected) parameters with single training instance. Consequently, edited LLMs tend to pay excessive attention to the edited subject, but fail to reason about the new knowledge (Zhong et al., 2023; Zhang et al., 2024a). Previous works highlighted this challenge, and quantified this ability with a new metric known as portability (Zhong et al., 2023; Wang et al., 2024d). However, the underlying causes of overfitting and their relationship to the KE process remain under-explored, leaving if KE overfitting can be solved in a principled manner an open question.

In this work, we take the first step toward a deeper understanding of this overfitting, and pave the way for a principled solution to mitigate it. We first provide strong evidence that KE overfitting leads to catastrophic degradation of an LLM’s reasoning ability. In particular, we showed that as the LLM is edited with new knowledge, the probability of correct reasoning consistently decreases. To quantify this, we investigated the portability loss at each fine-tuning step (lower indicates better reasoning ability). We observed that while portability loss initially decreased, it grew up quickly thereafter. In addition, the final loss was significantly higher than the initial value. This finding confirms that overfitting is a direct cause of suboptimal portability.

To understand this overfitting, we checked how new knowledge is fitted during the KE process. Based on our findings, KE may only require learning a few pivotal tokens (words), as many tokens already exhibit small initial loss values. Intuitively, an LLM’s pre-trained knowledge may enable it to infer remaining parts base on pivotal tokens. However, existing methods overlook this token-level difference in KE. Even when selectively updating parameters, these methods aim to maximize the likelihoods of the entire sentence describing the new knowledge, which boils down to maximizing the probability of all tokens indiscriminately (Bengio et al., 2000; Radford et al., 2019; Brown et al., 2020). As a result, this coarse-grained training paradigm leads to varying degrees of overfitting across tokens. We term this phenomenon heterogeneous token overfitting (HTO) in KE. Sec 2 details our new insight on KE overfitting and its influence on portability. This is our first main contribution.

In light of how HTO roots at a token level, we propose OVERTONE, a new KE training paradigm to tackle it. OVERTONE assigns each token an adaptive training target according to its (over)fitting state. An efficient solution is proposed to construct these training objectives in a dynamic way that allows to maintain much pre-trained knowledge if possible. The theoretical advantage of our method lies in three folds. First, our solution induces negligible computation cost compared to standard training (much cheaper than a LLM forward). Second, our solution provides a better parameter update through the lens of importance function (Koh & Liang, 2017). Finally, OVERTONE has a close connection to direct preference optimization (DPO), a widely-used framework for LLM post-training (Rafailov et al., 2024; Zhang et al., 2024d), but does not require additional preference data pairs. Sec 3 covers these aspects in details. The proposed OVERTONE and our theoretical analysis is another main technical contribution of this work.

Our paper is organized as follows. Sec 2 and Sec 3 details the new overfitting phenomenon in KE and our proposed OVERTONE for mitigation respectively. Extensive experimental results in Sec 4 demonstrate the superiority of our solution. In the remaining part of this paper, we review related works in Sec 5, and conclude the paper in Sec 6.

2 Overfitting Issue in Knowledge Editing

This section presents a new token-dependent overfitting phenomenon in knowledge editing (KE) that has been overlooked in the literature. Background of KE is also provided.

2.1 Preliminaries

Given a text $\boldsymbol{x}=(x_{1},\dots,x_{n})$ , where each $x_{i}\in\mathcal{V}$ is a token from vocabulary $\mathcal{V}$ , a large language model (LLM) parameterized by $\theta$ computes probability $\pi_{\theta}(\boldsymbol{x})$ based on chain rule (Bengio et al., 2000):

\displaystyle\pi_{\theta}(\boldsymbol{x})

\displaystyle=\prod_{i=1}^{n}\pi_{\theta}(x_{i}\mid{x_{1},\dots,x_{i-1}})\triangleq\prod_{i=1}^{n}\pi_{\theta}(x_{i}\mid\boldsymbol{x}_{<i}),

where $\pi_{\theta}(x_{i}\mid\boldsymbol{x}_{<i})$ is the predicted distribution of token $x_{i}$ given previous $\boldsymbol{x}_{<i}$ . The LLM is usually trained with maximum likelihood estimation (Hochreiter, 1997; Sutskever, 2014; Cho et al., 2014). To generate a sentence $\boldsymbol{x}$ , the LLM computes $\pi_{\theta}(x_{i}\mid\boldsymbol{x}_{<i})$ and draws $x_{i}$ from it; then $x_{i}$ is combined with $\boldsymbol{x}_{<i}$ as new inputs for future steps. This process completes if a special token that marks the end of the sentence is returned, or if the maximum length is reached.

Knowledge Editing (KE) aims to update specific knowledge in a pre-trained LLM while preserving unrelated others. A knowledge can be represented by natural language $(\boldsymbol{x},\boldsymbol{y})$ , $\boldsymbol{x}$ describes the subject and relation, and $\boldsymbol{y}$ entails corresponding object. For instance, suppose $\boldsymbol{x}$ is The president of United States is, $\boldsymbol{y}$ can be Donald Trump. KE asks the LLM to respond given $\boldsymbol{x}$ with new $\boldsymbol{y}$ , while satisfying the following criteria meanwhile (Zhang et al., 2024c): (1) Generality: the edited model should generalize to all equivalent inquires about the US president. (2) Portability: questions reasoned from the new knowledge such as the first lady of United States should be answered correctly. (3) Locality: unrelated knowledge such as the prime minister of Canada should be unchanged. These requirements of precisely updating specific knowledge proves non-trivial (Wang et al., 2023b; Zhang et al., 2024c).

2.2 Overfitting in Knowledge Editing

In response to precise KE requirements, existing attempts restrict the updates to only a minimal amount of parameters. This design establishes remarkable progress in maintaining good locality (Zhang et al., 2024c; Wang et al., 2024b). However, it proves insufficient to maintain good generalizability (generalilty and portability) due to the so-called overfitting issue (Zhong et al., 2023; Zhang et al., 2024a).

Namely, many KE tasks involve one piece of new knowledge at a time, requiring to fine-tune an LLM on single training instance. In such challenging scenarios, the LLM often encounters severe overfitting even only a few parameters are updated. This greatly restricts its ability to generalize the edited knowledge. As shown in Zhong et al. (2023); Zhang et al. (2024a), edited LLMs usually pay excessive attention to the edited subject, but fail to address multi-hop reasoning questions involving the new knowledge. As a result, this limitation results in suboptimal portability.

Refer to caption — Figure 1: Loss (average) change of ground truth answers to generality (rephrased, left) and portability (reasoning, right) questions.

As a direct evidence, Fig 1 shows the change of generality and portability loss¹¹1The perplexity loss of the ground truth answer to a question. at different iterations from fine-tuning LLaMA2 7B (Touvron et al., 2023) with LoRA, a representative KE baseline method (Zhang et al., 2024c). As the training goes on, the generality loss decreases. However, the portability loss decreases at the beginning of training, but starts to increase later. This confirms the existence of overfitting. More importantly, the ultimate portability loss is significantly larger than before editing, indicating that the reasoning ability is in fact undermined by the KE process,

2.3 Heterogeneous Token Overfitting

Towards a deeper understanding of this overfitting phenomenon, we check the loss of each token, and find that different tokens tend to have distinct initial loss values. As depicted in Fig 2(a), before editing LLaMA2, only certain tokens (e.g., the beginning) have significant loss values. On the other hand, some tokens take small loss value and are initially-fitted by nature. As an intuitive explanation, consider the previous US president example. No matter a user wants to edit the answer to Donald Trump or Joe Biden, after seeing the first word Donald or Joe as a hint, the LLM is expected to be capable of infer the remaining part based on its pretrained knowledge.

Nonetheless, existing KE methods overlook this token-level difference. Consequently, they tend to overfit tokens that have varied losses at different speeds. For verification, we compute the pre-edited log-likelihood of tokens generated by the model with greedy decoding, and that of the editing instance during the KE process. We define underfitting degree (UD) as the difference between the pre-edited and running log-likelihood, negative UD indicates an overfitting. Fig 2(b) shows UD of different tokens when half of them are overfitted. Strong pattern of UD varies across different tokens confirms our concern. We dub this issue as heterogeneous token overfitting (HTO) of KE.

HTO’s direct cause lies in the training paradigm. Formally, given editing instance $(\boldsymbol{x},\boldsymbol{y}=[y_{1},\dots,y_{m}])$ where $\boldsymbol{y}$ contains $m$ tokens, many KE methods resort to a conventional LLM training objective²²2We restrict our study to the widely-used teacher-forcing mechanism (Lamb et al., 2016).. In particular, they seek to maximize likelihood of $\pi_{\theta}(\boldsymbol{y}\mid\boldsymbol{x})$ by minimizing an averaged cross-entropy (CE) loss with gradient descent on

$\displaystyle\ell_{\text{CE}}(\theta)$	$\displaystyle\triangleq\sum_{i=1}^{m}\text{CE}[\delta_{y_{i}}(y)\\|\pi_{\theta}(y\mid\boldsymbol{x}\oplus\boldsymbol{y}_{<i})]$	(1)
	$\displaystyle=-\sum_{i=1}^{m}\log\pi_{\theta}(y_{i}\mid{\boldsymbol{c}}_{i})$
$\displaystyle\nabla_{\theta}\ell_{\text{CE}}(\theta)$	$\displaystyle=-\sum_{i=1}^{m}\nabla_{\theta}\log\pi_{\theta}(y_{i}\mid{\boldsymbol{c}}_{i}).$

Here ${\boldsymbol{c}}_{i}=\boldsymbol{x}\oplus\boldsymbol{y}_{<i}$ denotes the context for token $y_{i}$ , $\delta_{y_{i}}(y)$ is the Kronecker delta function³³3 $\delta_{y_{i}}(y)=1$ if $y=y_{i}$ else 0., and $\text{CE}[\cdot\|\cdot]$ computes CE between two distributions.

During training, gradient $\nabla_{\theta}\ell_{\text{CE}}(\theta)$ maximizes the probability of $y_{i}$ whiling minimizing the probabilities of all other candidates. When the model is repeatedly updated using gradient(s) from the single datapoint, as in KE, the probabilities of initially-fitted tokens become disproportionately large, while tokens with high initial loss values are gradually fitted. That is to say, HTO lies in indiscriminately optimizing CE loss of all tokens, without considering their difference. Existing attempts for mitigating overfitting such as early stopping (Yao et al., 2007) and label smoothing (Szegedy et al., 2016; Müller et al., 2019) also ignore this token-level difference, making them conceptually less suitable for HTO.

3 Propose Method

Given the importance of token-level difference in HTO, we propose OVERTONE to offer a granular control that applies to various KE methods, theoretical analysis is also provided.

3.1 Counteract HTO with OVERTONE

We present OVERTONE, a token-level strategy for HTO mitigation. Our method smooths $\boldsymbol{y}$ ’s distribution for fitting in an adaptive way. Specifically, we replace each delta distribution $\delta_{y_{i}}(y)$ with a unique smoothed target distribution $\pi_{\text{tar}}(y\mid{\boldsymbol{c}}_{i})$ , and refine the cross entropy by a clipped forward KL divergence. Our complete loss is given by

\displaystyle\ell_{\text{OVERTONE}}(\theta)

\displaystyle\triangleq\sum_{i=1}^{m}\max(\text{D}_{\text{KL}}[\pi_{\text{tar}}(y\mid{\boldsymbol{c}}_{i})\|\pi_{\theta}(y\mid{\boldsymbol{c}}_{i})],\epsilon),

(2)

where clipped $\max(\cdot,\epsilon)$ imposes a token-level early stopping when predicted $\pi_{\theta}$ is close enough to $\pi_{\text{tar}}$ .

Principles of $\pi_{\text{tar}}$ design. We note that two principles should be met in order to make $\pi_{\text{tar}}$ a good distribution to target on. First, $\pi_{\text{tar}}$ should convey that ground truth token $y_{i}$ is most probable, otherwise, the objective may lead to incorrect knowledge. Second, compared to uniform prior that smooths all tokens equally, the model’s own pre-trained knowledge is a better prior to help mitigate forgetting problem (Zhang & Sabuncu, 2020; Lee et al., 2022).

In light of the two principles, we use $\delta_{y_{i}}$ and the LLM’s current knowledge from its predicted distribution $\pi_{\theta}$ to construct target $\pi_{\text{tar}}$ . However, as will be verified later, directly use $\pi_{\theta}$ can be suboptimal due to the non-negligible noise it carries (Hewitt et al., 2022; Tang et al., 2024). Specifically, Tang et al. (2024) argued that $\pi_{\theta}$ mixes a distinct subset of informative tokens, and a subset of noisy tokens associating with small logits that fall outside $n\sigma$ -distant away from the maximal value. By filtering out noisy tokens in $\pi_{\theta}$ , the LLM performance can be boosted at inference time. We bring this insight to the training (editing) phase and mix the filtered distribution⁴⁴4For brevity $\pi_{\text{flt}}^{(i)}=\pi_{\text{flt}}(y\mid{\boldsymbol{c}}_{i})$ , $\pi_{\text{tar}}^{(i)}$ is defined similarly. Plain $\pi_{\text{flt}}$ and $\pi_{\text{tar}}$ will be used when discussing the general idea. $\pi_{\text{flt}}^{(i)}$ with $\delta_{y_{i}}$ by

\displaystyle\pi_{\text{tar}}^{(i)}\triangleq\begin{cases}\pi_{\text{tar}}^{\text{can}}\triangleq\lambda\delta_{y_{i}}+(1-\lambda)\pi_{\text{flt}}^{(i)}&\text{if $y_{i}=\operatorname{argmax}_{y}\pi_{\text{tar}}^{\text{can}}$},\\[4.30554pt] \delta_{y_{i}}&\text{otherwise},\end{cases}

(3)

where $\lambda$ is a hyper-parameter. Namely, we adopt the candidate mixture $\pi_{\text{tar}}^{\text{can}}$ if it correctly assigns the maximal probability to $y_{i}$ , otherwise, we skip the mixing and use $\delta_{y_{i}}$ . This skip mechanism helps reduce potential knowledge conflicts by discarding $\pi_{\text{flt}}^{(i)}$ (from $\pi_{\theta}$ ) when it heavily relies on outdated knowledge, which often happens in the first few training steps, empirical benefit is shown in Sec 4.4. Algo 1 outlines the process of our solution.

Algorithm 1 OVERTONE Training Paradigm

1: Input: Editing data

(\boldsymbol{x},\boldsymbol{y}=[y_{1},\dots,y_{m}])

, LM parameters

\theta_{0}

, mixing hyper-parameter

\lambda

, early-stopping threshold

\epsilon

, filtering threshold

n

, total training steps

T

2: Initialize:

\theta=\theta_{0}

3: for

t=1,\dots,T

4: # Inner loop is parallelized in practice, unroll for better readability.

5: for

i=1,\dots,m

6: Set context

{\boldsymbol{c}}_{i}=\boldsymbol{x}\oplus\boldsymbol{y}_{<i}

7: Compute logits from the LM as

{\boldsymbol{s}}^{(i)}=f_{\theta}({\boldsymbol{c}}_{i})\in\mathbb{R}^{|\mathcal{V}|}

. Take softmax and get

\pi_{\theta}^{(i)}

8: Top

n\sigma

-filter (Tang et al., 2024): Compute

s^{(i)}_{\max}=\max_{k}{\boldsymbol{s}}^{(i)}

\sigma=\text{std}({\boldsymbol{s}}^{(i)})

. Define filtered logit

\tilde{s}^{(i)}_{k}=-\infty

s^{(i)}_{k}\leq s^{(i)}_{\max}-n\sigma

else

\tilde{s}^{(i)}_{k}=s^{(i)}_{k}

9: Take softmax on filtered

\tilde{\boldsymbol{s}}

and get filtered

\pi_{\text{flt}}^{(i)}

10: Compute target

\pi_{\text{tar}}^{(i)}

based on Eq (3).

11: Compute loss

\displaystyle\ell_{\text{OVERTONE}}^{(i)}=\max(\text{D}_{\text{KL}}[\pi_{\text{tar}}^{(i)}\|\pi_{\theta}^{(i)}],\epsilon).

12: end for

13: Compute sample loss

\displaystyle\ell_{\text{OVERTONE}}(\theta)=\sum_{i=1}^{m}\ell_{\text{OVERTONE}}^{(i)}.

14: Update with learning rate

\alpha

\displaystyle\theta\leftarrow\theta-\alpha\nabla_{\theta}\ell_{\text{OVERTONE}}(\theta)

15: end for

15: Edited parameter

\theta

3.2 Theoretical Advantages of OVERTONE

This section provides theoretical analysis on key factors that merit OVERTONE for KE. All proofs and more in-depth technical background are deferred to App A.

Merit 1. OVERTONE is universal and efficient.

While seemingly distinct, OVERTONE is in fact a generalization of CE loss. Moreover, our choice of $\pi_{\text{tar}}$ makes it computationally efficient, with computation overhead negligible compared to LLM forward operation.

Proposition 3.1.

OVERTONE loss generalizes CE loss and reduces to the latter when $\epsilon=0,\lambda=1$ .

Proposition 3.2.

Using Alg 1, the additional computation complexity induced by OVERTONE is $\mathcal{O}(|\mathcal{V}|)$ when fitting a token, where $|\mathcal{V}|$ is the vocabulary size.

Merit 2. OVERTONE provides better updates.

OVERTONE leads to more effective parameter updates, as demonstrated through the lens of the influence function (Koh & Liang, 2017), outlined in the following informal theorem. Due to page limitations, the formal version and corresponding assumptions are deferred to Appendix A.3.

Theorem 3.3 (Informal).

Under regularity conditions, compared to optimizing the vanilla CE loss, OVERTONE provides a more favorable update direction for the parameters and has less influence on unrelated knowledge.

Merit 3. OVERTONE has close connection to DPO and other constrained optimizations.

One might question whether OVERTONE is conceptually superior to constrained optimization approaches, such as fine-tuning only a small set of specific parameters (Dong et al., 2022a; Dai et al., 2021), limiting update magnitudes (Zhu et al., 2020), or employing low-rank updates (Hu et al., 2021). We emphasize that OVERTONE introduces a new objective that can be solved with any optimization methods, regardless of whether constraints are imposed. In other words, OVERTONE can be seamlessly combined with existing constrained optimization-based solutions for KE.

Below theorem draws a connection between OVERTONE and direct preference optimization (DPO), which has shown superior performance of maintaining pretrained knowledge in LLM post-training (Wang et al., 2023a).

Theorem 3.4.

Let $\epsilon=0$ , optimizing OVERTONE can be seen as optimizing an unbiased estimate of a DPO objective plus some additional KL penalty.

Compared with conducting explicit DPO, OVERTONE does not require collecting preference data, and is more efficient thereof. Furthermore, as highlighted in Rozner et al. (2024), another challenge of applying DPO to KE is that determining win-loss data pairs can be unstraightforward in KE. In contrast, OVERTONE walks around this challenge by refraining from treating any token as unpreferred, and instead acts on a distribution level.

4 Experiments

We evaluate the proposed OVERTONE paradigm on four performant KE methods applying to two representative large language models (LMs) over five benchmarking datasets. Ablation studies are also conducted to help understand its effectiveness. Results show that OVERTONE helps improve editing performance by a large margin on all methods.

4.1 Experiment Setup

Base Models. We conduct experiments on two representative LMs, LLaMA 2-7b-Chat (Touvron et al., 2023) and LLaMA 3-8b-Instruct (Dubey et al., 2024), which have been widely studied in the literature (Zhang et al., 2024c; Wang et al., 2024b). From now on, we refer to the two LMs as LLaMA 2 and LLaMA 3 for brevity.

Tasks. Following Wang et al. (2023b); Zhang et al. (2024c), we edit different kinds of knowledge: WikiData ${}_{\text{recent}}$ , WikiData ${}_{\text{counterfact}}$ (Cohen et al., 2024), WikiBio (Hartvigsen et al., 2024), and ZsRE (Yao et al., 2023). Besides the four popular benchmarks, we also explore more complex MQuAKE (Zhong et al., 2023; Wang et al., 2024d). Due to page limitation, we refer readers to Zhang et al. (2024c) for more benchmark details. When editing an LLM, we consider two scenarios: (1) Single Editing: one piece of knowledge is edited at a time. (2) Continual Editing: multiple pieces of knowledge are edited in a sequential way. This is more challenging due to forgetting and knowledge conflicting (Hartvigsen et al., 2024; Wang et al., 2024b).

Editing Methods. We apply OVERTONE to four representative KE methods from different families that have achieved state-of-the-art performance (Zhang et al., 2024c; Wang et al., 2024c). FT-M (Zhang et al., 2024c) fine-tunes a special layer identified by causal-tracing analysis wherein the knowledge is stored. LoRA (Hu et al., 2021) learns additive low-rank updates for model parameters on the new knowledge. MELO (Yu et al., 2024) and WISE (Wang et al., 2024b) incorporates additional parameter copies to learn new knowledge, along with some gating mechanism to determine whether original or new knowledge should be used at inference time. Despite incorporating certain explicit or implicit constraints on the learnable parameters, these methods are all trained to minimize the CE loss. For better benchmarking, we also report results from two widely-studied methods ROME (Meng et al., 2022a) and MEMIT (Meng et al., 2022b). ROME applies a causal-tracing analysis to identify the layer wherein the knowledge is stored and then solves an analytic rank-one update, and MEMIT extends ROME by identifying a series of layers to edit and finding the updates as least-squares solutions. To reflect the challenging nature of KE under data scarcity regime, we focus on KE methods that do not require a larges-scale hard-to-access training data, or training additional models. No data augmentation were applied during the editing.

Evaluation Criteria. We evaluate the performance from four aspects as discussed in Sec 2: reliability (Rel.), generality (Gen.), portability (Por.), and locality (Loc.). Due to page limits we refer readers to Zhang et al. (2024c); Wang et al. (2024b) for their formulations. We report the average of different metrics for more complete comparisons.

Implementation Details. All of our experiments are implemented in EasyEdit (Wang et al., 2024c). More details and hyper-parameters can be found in App B.

4.2 Single Editing Performance

We evaluate the effectiveness of OVERTONE in conducting Single Editing on ZsRE, WikiData ${}_{\text{recent}}$ , WikiData ${}_{\text{counterfact}}$ , and WikiBio with different KE methods. WISE was tested on ZsRE, the only benchmark that contains additional irrelevant data during the editing time that is required by WISE.

Single Editing results are reported in Tab 4.2. From the table, all KE methods gained significant improvement from the proposed OVERTONE paradigm. Specifically, The four methods hardly performed comparable to baselines ROME and MEMIT from normal training, but were capable of exceeding them when trained with OVERTONE. For instance, without OVERTONE, ROME achieved the highest and the second-highest average performance for editing LLaMA 2 and LLaMA 3 respectively on Wiki ${}_{\text{recent}}$ . However, when equipped with OVERTONE, FT-M, LoRA, and MELO outperformed ROME on both tasks.

We next check where the improvement was made. from the table, the first gain was from improved portability. To see this, note that when editing LLaMA 2 on ZsRE, LoRA reached a portability that was nearly three times of the base version. Similarly, MELO also reached an almost doubled portability. More evidence can be found from editing LLaMA 3 as well. In addition, all methods, especially those initially fall short in maintaining good locality, achieved excellent performance in this regard. As an evidence, LoRA’s reached a nearly five times locality improvements when editing both LLaMA 2 and LLaMA 3 on Wiki ${}_{\text{counterfact}}$ . We want to highlight that, all these improvements were made without compromising editing reliability. That is to say, all the four methods achieved better trade-offs between reliability and reasoning (and locality) from the proposed OVERTONE. More importantly, this success was established in a model-agnostic manner, in the sense that OVERTONE is not specialized for any particular KE method studied here. Instead, it offers a highly flexible and generic paradigm that can be combined with existing solutions in a plug-and-play manner.

More Complex Editing task. To further evaluate how OVERTONE performs on complex benchmark in the filed of KE, we test FT-M and LoRA with editing the two LLMs on MQuAKE-2002 (Wang et al., 2024d)⁵⁵5This is a cleaned version of MQuAKE by fixing knowledge conflicts (Wang et al., 2024d)., following Zhong et al. (2023). This task requires the edited LLM to answer one- and two-hops reasoning questions about the edited knowledge. Experiment results are reported in Table 4.2. As before, OVERTONE was capable of achieving better portability without hurting the editing performance.

These empirical results echo well with our theoretical analysis, and confirm the superiority of OVERTONE.

	ZsRE					Wiki ${}_{\text{recent}}$				Wiki ${}_{\text{counterfact}}$				WikiBio
	LLaMA 2-7b-chat
	Rel.	Gen.	Por.	Loc.	Avg.	Rel.	Por.	Loc.	Avg.	Rel.	Por.	Loc.	Avg.	Rel.	Loc.	Avg.
ROME	96.61	83.91	55.7	96.96	83.3	99.02	54.21	55.91	69.71	97.2	56.85	50.4	68.15	96.41	59.14	77.78
MEMIT	94.22	88.2	57.91	98.28	84.65	97.71	52.93	55.05	68.56	96.38	59.34	45.7	67.14	93.78	56.74	75.26
\cdashline2-20 FT-M	99.75	99.33	54.32	93.01	86.60	100.0	62.93	45.92	69.62	100.0	74.7	54.86	76.52	100.0	90.04	95.02
+ Ours	99.75	96.8	57.08	96.54	87.54	100.0	63.91	60.4	74.77	100.0	73.62	75.34	82.99	100.0	93.46	96.73
\cdashline2-20 LoRA	100.0	100.0	23.34	30.44	63.45	100.0	55.41	28.29	61.23	100.0	71.92	9.99	60.64	100.0	48.84	74.42
+ Ours	100.0	94.31	61.16	87.2	85.67	100.0	63.67	58.72	74.13	100.0	73.96	57.85	77.27	97.68	68.45	83.06
\cdashline2-20 MELO	100.0	96.77	27.11	92.35	79.06	99.13	54.04	40.96	64.71	99.0	71.78	55.83	75.54	99.97	80.77	90.37
+ Ours	100.0	93.31	50.36	97.2	85.22	100.0	60.25	66.48	75.58	99.91	71.81	78.09	83.27	99.68	82.58	91.13
\cdashline2-20 WISE	92.42	70.86	54.57	100.0	79.46	-	-	-	-	-	-	-	-	-	-	-
+ Ours	97.55	76.09	54.17	100.0	81.95	-	-	-	-	-	-	-	-	-	-	-
	LLaMA 3-8b-Instruct
	Rel.	Gen.	Por.	Loc.	Avg.	Rel.	Por.	Loc.	Avg.	Rel.	Por.	Loc.	Avg.	Rel.	Loc.	Avg.
ROME	99.17	97.91	58.12	95.9	87.78	98.84	54.76	49.74	67.78	99.94	58.0	42.94	66.96	92.43	72.63	82.53
MEMIT	96.67	92.46	58.78	98.23	86.53	98.51	53.65	48.45	66.87	99.44	57.81	42.73	66.66	96.26	71.23	83.75
\cdashline2-20 FT-M	100.0	99.75	40.43	79.43	79.90	100.0	57.13	30.01	62.38	100.0	72.62	31.47	68.03	100.0	92.96	96.48
+ Ours	100.0	99.75	48.63	94.78	85.79	100.0	60.88	44.67	68.52	100.0	73.5	58.29	77.26	99.99	94.87	97.43
\cdashline2-20 LoRA	100.0	100.0	26.55	38.85	66.35	100.0	52.99	26.46	59.82	100.0	71.1	9.02	60.04	100.0	59.77	79.88
+ Ours	100.0	98.5	51.57	93.13	85.80	100.0	61.46	56.1	72.52	100.0	72.8	57.54	76.78	98.16	77.24	87.7
\cdashline2-20 MELO	100.0	96.84	39.63	98.8	83.82	100.0	59.07	65.78	74.95	100.0	71.55	87.77	86.44	100.0	98.56	99.28
+ Ours	100.0	95.77	43.08	98.8	84.41	100.0	58.72	69.1	75.94	100.0	70.26	89.81	86.69	99.98	98.56	99.27
\cdashline2-20 WISE	71.67	51.29	49.27	100.0	68.06	-	-	-	-	-	-	-	-	-	-	-
+ Ours	82.67	62.34	47.54	100.0	73.14	-	-	-	-	-	-	-	-	-	-	-

	LLaMA 2-7b-chat				LLaMA 3-8b-Instruct
	Rel.	1-Hop.	2-Hop.	Avg.	Rel.	1-Hop.	2-Hop.	Avg.
FT-M	100.0	83.0	30.0	71.0	100.0	82.0	24.0	68.67
+ Ours	99.86	89.0	37.0	75.29	100.0	85.0	30.0	71.67
\cdashline2-10 LoRA	100.0	95.0	39.0	78.0	100.0	98.0	35.0	77.67
+ Ours	99.75	93.0	48.0	80.25	100.0	95.0	40.0	78.33

	Rel.	Gen.	Por.	Loc.	Avg.
	LLaMA 2-7b-chat
LoRA	100.0	100.0	23.34	30.44	63.45
\cdashline2-6 w/o clip	100.0	99.75	26.6	41.08	66.86
w/o dyn- $\pi_{\text{flt}}$	99.18	97.67	36.32	51.57	71.18
w/o chk- $\pi_{\text{flt}}$	95.35	86.51	57.92	90.08	82.47
w/o flt- $\pi_{\text{flt}}$	100.0	83.93	58.2	90.36	83.12
\cdashline2-6 + Ours	100.0	94.31	61.16	87.2	85.67

	$\displaystyle\text{CE}[\pi_{\text{tar}}(y\mid{\boldsymbol{c}})\\|\pi_{\theta}(y\mid{\boldsymbol{c}})]$
	$\displaystyle=-\sum_{i=1}^{\|\mathcal{V}\|}\pi_{\text{tar}}(y\mid{\boldsymbol{c}})\log\pi_{\theta}(y\mid{\boldsymbol{c}})$
	$\displaystyle=-\sum_{i=1}^{\|\mathcal{V}\|}\left(\lambda\delta_{y_{i}}(y)+(1-\lambda)\pi_{\text{flt}}(y\mid{\boldsymbol{c}})\right)\log\pi_{\theta}(y\mid{\boldsymbol{c}})$
	$\displaystyle=-\left(\lambda\sum_{i=1}^{\|\mathcal{V}\|}\delta_{y_{i}}(y)\log\pi_{\theta}(y\mid{\boldsymbol{c}})+(1-\lambda)\sum_{i=1}^{\|\mathcal{V}\|}\pi_{\text{flt}}(y\mid{\boldsymbol{c}})\log\pi_{\theta}(y\mid{\boldsymbol{c}})\right)$
	$\displaystyle=\lambda\text{CE}[\delta_{y_{i}}(y)\\|\pi_{\theta}(y\mid{\boldsymbol{c}})]+(1-\lambda)\text{CE}[\pi_{\text{flt}}(y\mid{\boldsymbol{c}})\\|\pi_{\theta}(y\mid{\boldsymbol{c}})].$		(5)

$\displaystyle\ell_{\text{OVERTONE}}(\theta)$	$\displaystyle\triangleq\sum_{j=1}^{m}\max(\text{D}_{\text{KL}}[\pi_{\text{tar}}(y\mid{\boldsymbol{c}}_{i})\\|\pi_{\theta}(y\mid{\boldsymbol{c}}_{i})],\epsilon)$
	$\displaystyle=\sum_{j=1}^{m}\text{D}_{\text{KL}}[\pi_{\text{tar}}(y\mid{\boldsymbol{c}}_{i})\\|\pi_{\theta}(y\mid{\boldsymbol{c}}_{i})]\boldsymbol{1}\left(\text{D}_{\text{KL}}[\pi_{\text{tar}}(y\mid{\boldsymbol{c}}_{i})\\|\pi_{\theta}(y\mid{\boldsymbol{c}}_{i})]>\epsilon\right)$
	$\displaystyle\overset{(a)}{=}\sum_{j=1}^{m}\left(\text{CE}[\pi_{\text{tar}}^{(j)}\\|\pi_{\theta}^{(j)}]+H(\pi_{\text{tar}}^{(j)})\right)\boldsymbol{1}\left(\text{D}_{\text{KL}}[\pi_{\text{tar}}^{(j)}]\\|\pi_{\theta}^{(j)}]>\epsilon\right)$
	$\displaystyle=\sum_{j=1}^{m}\text{CE}[\pi_{\text{tar}}^{(j)}\\|\pi_{\theta}^{(j)}]\boldsymbol{1}\left(\text{D}_{\text{KL}}[\pi_{\text{tar}}^{(j)}]\\|\pi_{\theta}^{(j)}]>\epsilon\right)+C.$	(6)

$\displaystyle a$	$\displaystyle=\nabla_{\theta}\ell_{\text{CE}}(z^{\text{new}};\hat{\theta}^{\text{old}}),$	(9)
$\displaystyle b$	$\displaystyle=-\nabla_{\theta}\ell_{\text{CE}}(z^{\text{old}};\hat{\theta}^{\text{old}}),$
$\displaystyle c$	$\displaystyle=\sum_{i=1}^{m}\nabla_{\theta}\text{CE}\quantity[\pi_{\text{flt}}(y\mid{\boldsymbol{c}}_{i}^{\text{new}})\\|\pi_{\theta}(y\mid{\boldsymbol{c}}_{i}^{\text{new}})]\evaluated{}_{\theta=\hat{\theta}^{\text{old}}}.$

	$\displaystyle b$	$\displaystyle=-\nabla_{\theta}\ell_{\text{CE}}(z^{\text{old}};\hat{\theta}^{\text{old}})=\sum_{i=1}^{m}\nabla_{\theta}\log\pi_{\theta}(y_{i}^{\text{old}}\mid{\boldsymbol{c}}_{i}^{\text{old}})\evaluated{}_{\theta=\hat{\theta}^{\text{old}}},$		(11)
	$\displaystyle c$	$\displaystyle=\sum_{i=1}^{m}\nabla_{\theta}\text{CE}\quantity[\pi_{\text{flt}}(y\mid{\boldsymbol{c}}_{j}^{\text{new}})\\|\pi_{\theta}(y\mid{\boldsymbol{c}}_{j}^{\text{new}})]\evaluated{}_{\theta=\hat{\theta}^{\text{old}}}=-\sum_{i=1}^{m}\sum_{y\in S^{(i)}}\pi_{\text{flt}}(y\mid{\boldsymbol{c}}_{i}^{\text{new}})\nabla_{\theta}\log\pi_{\theta}(y\mid{\boldsymbol{c}}_{i}^{\text{new}})\evaluated{}_{\theta=\hat{\theta}^{\text{old}}}.$		(12)

	$\displaystyle G$	$\displaystyle=\nabla_{\theta}\ell_{\frac{1}{N}}(\hat{\theta}^{\text{old}})$
		$\displaystyle=\nabla_{\theta}\ell_{0}(\hat{\theta}^{\text{old}})+\frac{1}{N}\quantity(\nabla_{\theta}\ell_{\text{CE}}(z^{\text{new}};\hat{\theta}^{\text{old}})-\nabla_{\theta}\ell_{\text{CE}}(z^{\text{old}};\hat{\theta}^{\text{old}}))$
		$\displaystyle\overset{(a)}{=}\frac{1}{N}\quantity(\nabla_{\theta}\ell_{\text{CE}}(z^{\text{new}};\hat{\theta}^{\text{old}})-\nabla_{\theta}\ell_{\text{CE}}(z^{\text{old}};\hat{\theta}^{\text{old}})),$

	$\displaystyle\nabla_{\theta}\ell_{\text{OVERTONE}}(z^{\text{new}};\theta)$	$\displaystyle=\sum_{i=1}^{m}\nabla_{\theta}\text{CE}\quantity[\pi_{\text{tar}}(y\mid{\boldsymbol{c}}_{i}^{\text{new}})\\|\pi_{\theta}(y\mid{\boldsymbol{c}}_{i}^{\text{new}})]$
		$\displaystyle=\lambda\sum_{i=1}^{m}\nabla_{\theta}\text{CE}\quantity[\delta_{y_{i}^{\text{new}}}(y)\\|\pi_{\theta}(y\mid{\boldsymbol{c}}_{i}^{\text{new}})]+(1-\lambda)\sum_{i=1}^{m}\nabla_{\theta}\text{CE}\quantity[\pi_{\text{flt}}(y\mid{\boldsymbol{c}}_{i}^{\text{new}})\\|\pi_{\theta}(y\mid{\boldsymbol{c}}_{i}^{\text{new}})]$
		$\displaystyle=-\quantity(\lambda\sum_{i=1}^{m}\nabla_{\theta}\log\pi_{\theta}(y_{i}\mid{\boldsymbol{c}}_{i}^{\text{new}})+(1-\lambda)\sum_{i=1}^{m}-\nabla_{\theta}\text{CE}\quantity[\pi_{\text{flt}}(y\mid{\boldsymbol{c}}_{i}^{\text{new}})\\|\pi_{\theta}(y\mid{\boldsymbol{c}}_{i}^{\text{new}})])$
		$\displaystyle=\lambda a+(1-\lambda)c.$

	$\displaystyle\frac{\langle\lambda a+(1-\lambda)c,a+b\rangle}{\norm{\lambda a+(1-\lambda)c}_{2}}$	$\displaystyle=\frac{\Bigl{\langle}\quantity(\frac{\\|c\\|_{2}}{\\|b\\|_{2}+\\|c\\|_{2}})a+\quantity(\frac{\\|b\\|_{2}\\|c\\|_{2}}{\\|b\\|_{2}+\\|c\\|_{2}})\quantity(\frac{b}{\\|b\\|_{2}}+\delta_{bc}),a+b\Bigr{\rangle}}{\norm{\quantity(\frac{\\|c\\|_{2}}{\\|b\\|_{2}+\\|c\\|_{2}})(a+b)+\quantity(\frac{\\|b\\|_{2}\\|c\\|_{2}}{\\|b\\|_{2}+\\|c\\|_{2}})\delta_{bc}}_{2}}$
		$\displaystyle\geq\frac{\norm{a+b}_{2}^{2}-\norm{a+b}_{2}\quantity(\norm{\delta_{bc}}_{2}\norm{b}_{2})}{\norm{a+b}_{2}+\norm{b}_{2}\norm{\delta_{bc}}_{2}}$
		$\displaystyle\geq\frac{\norm{a+b}_{2}-\norm{b}_{2}\norm{\delta_{bc}}_{2}}{\norm{a+b}_{2}+\norm{b}_{2}\norm{\delta_{bc}}_{2}}\norm{a+b}_{2}.$

$\displaystyle-\ell_{\text{OVERTONE},i}(\theta)$	$\displaystyle=-\text{D}_{\text{KL}}[\pi_{\text{tar}}(y\mid{\boldsymbol{c}}_{i})\\|\pi_{\theta}(y\mid{\boldsymbol{c}}_{i})]$
	$\displaystyle=-\left(\lambda\text{CE}[\delta_{y_{i}}(y)\\|\pi_{\theta}(y\mid{\boldsymbol{c}}_{i})]+(1-\lambda)\text{CE}[\pi_{\text{flt}}(y\mid{\boldsymbol{c}}_{i})\\|\pi_{\theta}(y\mid{\boldsymbol{c}}_{i})]\right)$
	$\displaystyle=-\lambda\left(\text{CE}[\delta_{y_{i}}(y)\\|\pi_{\theta}(y\mid{\boldsymbol{c}}_{i})]-\text{CE}[\pi_{\text{flt}}(y\mid{\boldsymbol{c}})\\|\pi_{\theta}(y\mid{\boldsymbol{c}}_{i})]\right)-\text{CE}[\pi_{\text{flt}}(y\mid{\boldsymbol{c}}_{i})\\|\pi_{\theta}(y\mid{\boldsymbol{c}}_{i})]$	(15)
	$\displaystyle=\lambda\left(\log\pi_{\theta}(y_{i}\mid{\boldsymbol{c}}_{i})+\text{CE}[\pi_{\text{flt}}(y\mid{\boldsymbol{c}})\\|\pi_{\theta}(y\mid{\boldsymbol{c}}_{i})]\right)-\text{CE}[\pi_{\text{flt}}(y\mid{\boldsymbol{c}}_{i})\\|\pi_{\theta}(y\mid{\boldsymbol{c}}_{i})]$	(16)

	$\displaystyle\lambda\left(\underbrace{\mathbb{E}_{y^{+}\sim\pi^{+}(y\mid{\boldsymbol{c}}_{i})}[\log\pi_{\theta}(y^{+}\mid{\boldsymbol{c}}_{i})]}_{\text{Preferred distriibution}}-\mathbb{E}_{y^{-}\sim\pi_{\text{flt}}(y\mid{\boldsymbol{c}}_{i})}[\log\pi_{\theta}(y^{-}\mid{\boldsymbol{c}}_{i})]\right)+\text{CE}[\pi_{\text{flt}}(y\mid{\boldsymbol{c}}_{i})\\|\pi_{\theta}(y\mid{\boldsymbol{c}}_{i})]$
	$\displaystyle=\lambda\mathbb{E}_{y^{+},y^{-}}\left[\frac{\log\pi_{\theta}(y^{+}\mid{\boldsymbol{c}}_{i})}{\log\pi_{\theta}(y^{-}\mid{\boldsymbol{c}}_{i})}\right]+\text{D}_{\text{KL}}[\pi_{\text{flt}}(y\mid{\boldsymbol{c}}_{i})\\|\pi_{\theta}(y\mid{\boldsymbol{c}}_{i})]+C$
	$\displaystyle\overset{(a)}{=}\lambda\left(\mathbb{E}_{y^{+},y^{-}}\left[\frac{\log\pi_{\theta}(y^{+}\mid{\boldsymbol{c}}_{i})}{\log\pi_{\theta}(y^{-}\mid{\boldsymbol{c}}_{i})}-\frac{\log\pi_{\text{flt}}(y^{+}\mid{\boldsymbol{c}}_{i})}{\log\pi_{\text{flt}}(y^{-}\mid{\boldsymbol{c}}_{i})}\right]+\mathbb{E}_{y^{+}}[\log\pi_{\text{flt}}(y^{+}\mid{\boldsymbol{c}}_{i})-\mathbb{E}_{y^{-}}[\log\pi_{\text{flt}}(y^{-}\mid{\boldsymbol{c}}_{i})]\right)$
	$\displaystyle\quad+\text{D}_{\text{KL}}[\pi_{\text{flt}}(y\mid{\boldsymbol{c}}_{i})\\|\pi_{\theta}(y\mid{\boldsymbol{c}}_{i})]+C$
	$\displaystyle=\lambda\left(\mathbb{E}_{y^{+},y^{-}}\left[\frac{\log\pi_{\theta}(y^{+}\mid{\boldsymbol{c}}_{i})}{\log\pi_{\theta}(y^{-}\mid{\boldsymbol{c}}_{i})}-\frac{\log\pi_{\text{flt}}(y^{+}\mid{\boldsymbol{c}}_{i})}{\log\pi_{\text{flt}}(y^{-}\mid{\boldsymbol{c}}_{i})}\right]+\underbrace{\mathbb{E}_{y^{+}}[\log\pi_{\text{flt}}(y^{+}\mid{\boldsymbol{c}}_{i})]-\mathbb{E}_{y^{-}}[\log\pi_{\text{flt}}(y^{-}\mid{\boldsymbol{c}}_{i})]}_{\text{constant wrt $\theta$}}\right)$
	$\displaystyle\quad+\text{D}_{\text{KL}}[\pi_{\text{flt}}(y\mid{\boldsymbol{c}}_{i})\\|\pi_{\theta}(y\mid{\boldsymbol{c}}_{i})]+C$
	$\displaystyle={\underbrace{\mathbb{E}_{y^{+},y^{-}}\left[\lambda\frac{\log\pi_{\theta}(y^{+}\mid{\boldsymbol{c}}_{i})}{\log\pi_{\text{flt}}(y^{+}\mid{\boldsymbol{c}}_{i})}-\lambda\frac{\log\pi_{\theta}(y^{-}\mid{\boldsymbol{c}}_{i})}{\log\pi_{\text{flt}}(y^{-}\mid{\boldsymbol{c}}_{i})}\right]}_{\text{DPO with Clipped ReLU Activation}}}+\underbrace{\text{D}_{\text{KL}}[\pi_{\text{flt}}(y\mid{\boldsymbol{c}}_{i})\\|\pi_{\theta}(y\mid{\boldsymbol{c}}_{i})]}_{\text{Additional Penalty}}+C,$

	$\displaystyle a$	$\displaystyle=-\sum_{i=1}^{m}\nabla_{\theta}\log\pi_{\theta}(y_{i}^{\text{new}}\mid{\boldsymbol{c}}_{i}^{\text{new}}),$
	$\displaystyle c$	$\displaystyle=-\sum_{i=1}^{m}\sum_{y\in S^{(i)}}\pi_{\text{flt}}(y\mid{\boldsymbol{c}}_{i}^{\text{new}})\nabla_{\theta}\log\pi_{\theta}(y\mid{\boldsymbol{c}}_{i}^{\text{new}})\Big{\|}_{\theta=\hat{\theta}^{\text{old}}}.$

Mitigating Heterogeneous Token Overfitting in LLM Knowledge Editing

Abstract

1 Introduction

2 Overfitting Issue in Knowledge Editing

2.1 Preliminaries

2.2 Overfitting in Knowledge Editing

2.3 Heterogeneous Token Overfitting

3 Propose Method

3.1 Counteract HTO with OVERTONE

3.2 Theoretical Advantages of OVERTONE

Proposition 3.1.

Proposition 3.2.

Theorem 3.3 (Informal).

Theorem 3.4.

4 Experiments

4.1 Experiment Setup

4.2 Single Editing Performance

4.3 Continual Editing Performance

4.4 Ablation Studies

5 Related Works

6 Conclusion and Limitations

References

Appendix A Omitted Theorems and Proofs

A.1 Notations

A.2 OVERTONE is universal and efficient

Proposition A.1.

Proposition A.2.

Lemma A.3.

Proof.

Proof.

Proof.

A.3 OVERTONE provides better updates

Theorem A.4 (Informal).

Theorem A.5 (Formal).

A.3.1 Our method gives a better direction of parameter updates

Assumption A.6.

Assumption A.7.

Remark A.8 (Interpretation of the Assumption A.7).

Remark A.9.

Theorem A.10.

Proof.

A.3.2 Our method leads to a smaller perturbation on unrelated knowledge.

Assumption A.11.

Remark A.12.

Assumption A.13.

Remark A.14 (Interpretation of the Assumption A.13).

Theorem A.15.

Proof.

A.4 Connection between OVERTONE and DPO

Theorem A.16.

Proof.

Appendix B Implementation Details

B.1 Hyperparameters used in KE

B.2 MQuAKE Experiment Details

Appendix C More Experiment Results