This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Tracing Privacy Leakage of Language Models to Training Data via Adjusted Influence Functions

Jinxin Liu1 Zao Yang1
Abstract

The responses generated by Large Language Models (LLMs) can include sensitive information from individuals and organizations, leading to potential privacy leakage. This work implements Influence Functions (IFs) to trace privacy leakage back to the training data, thereby mitigating privacy concerns of Language Models (LMs). However, we notice that current IFs struggle to accurately estimate the influence of tokens with large gradient norms, potentially overestimating their influence. When tracing the most influential samples, this leads to frequently tracing back to samples with large gradient norm tokens, overshadowing the actual most influential samples even if their influences are well estimated. To address this issue, we propose Heuristically Adjusted IF (HAIF), which reduces the weight of tokens with large gradient norms, thereby significantly improving the accuracy of tracing the most influential samples. To establish easily obtained groundtruth for tracing privacy leakage, we construct two datasets, PII-E and PII-CR, representing two distinct scenarios: one with identical text in the model outputs and pre-training data, and the other where models leverage their reasoning abilities to generate text divergent from pre-training data. HAIF significantly improves tracing accuracy, enhancing it by 20.96% to 73.71% on the PII-E dataset and 3.21% to 45.93% on the PII-CR dataset, compared to the best SOTA IFs against various GPT-2 and QWen-1.5 models. HAIF also outperforms SOTA IFs on real-world pretraining data CLUECorpus2020, demonstrating strong robustness regardless prompt and response lengths.

1 Introduction

With the emergence of ChatGPT, LLMs have received phenomenal social attentions due to their powerful capabilities. However, the safety and privacy concerns of LLMs continue to be evaluated and challenged by academia and industry (Zhou et al. 2023). LLMs extensively utilize data from public internet, private domain and user interaction during training stage. LLM pretraining data inevitably includes sensitive information from individuals and organizations which can be leaked during inference stage (Yao et al. 2024; Sun et al. 2024; Wang et al. 2024; Yang et al. 2024). In the quest to protect privacy, LLM manufacturers have explored various strategies, including privacy cleaning of pre-training data, LLM alignment, and content moderation. However, due to jailbreak (Wei, Haghtalab, and Steinhardt 2023; Li et al. 2023) and training data extraction techniques (Nasr et al. 2023), these measures are not foolproof.

Consequently, when our privacy is exposed to LLM users despite multiple layers of protection, it raises an important question: Which training data is the root-cause of privacy leakage in a LLM? The answer to this question pertains to several aspects: the identification and tracing of privacy infringements, preventing the flow of sensitive information into further training models, and bridging down-stream tasks such as machine unlearning and model editing (Liu et al. 2024; Chen et al. 2024).

Since IFs are introduced into deep learning in 2017 (Koh and Liang 2017), their primary objective is explaining the predictions of black-box models. IFs have been widely used for identifying the most influential training samples in various fields such as image classification (Fisher et al. 2023; Lee et al. 2020), text classification (Zylberajch, Lertvittayakumjorn, and Toni 2021; Schioppa et al. 2023) and language modeling (Grosse et al. 2023). Therefore, this work implements IFs to tracing privacy leakage in LMs; however, while IFs seem promising for tracing privacy contents, there remains theoretical and practical open problems to be solved.

Effectively utilizing IFs requires assumptions that are often invalid in deep learning contexts, leading to significant errors, such as the convexity of empirical loss, the neglect of training trajectories, and the limitations of parameter divergence when up/down-weighting samples (Schioppa et al. 2023). While other assumptions can be mitigated, Schioppa et al. question the assumption regarding parameter divergence limitations. By deriving an upper bound for parameter divergence when up/down-weighting samples, the upper bound can grow exponentially over time. This leads to the model parameters after Leave-One-Out-Retraining (LOOR) possibly no longer being near the original model parameters, causing the most basic assumption of IFs to fail, making them only effective within a very limited time steps. This implies that IFs cannot be applied to most deep learning models.

Furthermore, IFs have been found that regardless of which test sample is traced, the most influential samples are always traced back to training samples with large norm of gradient (Barshan, Brunet, and Dziugaite 2020). To avoid this issue, RelatIF is introduced with an additional constraint to limit the influence of large norm of gradient, leading to sample-wise normalization on original IFs. However, the following questions have not been well answered: Why do IFs need additional constraints? Which problem is this additional constraint fundamentally solving? Meanwhile, we observe that RelatIF cannot provide satisfactory results for tracing privacy leakage in our experiments.

To solve the aforementioned issues, the contributions of this paper are four-fold:

  • To the best of our knowledge, this work is the first to use IFs to trace privacy leakage in LMs, extending the application of IFs and further safeguarding the privacy of LLMs.

  • This work reveals that the gradient norms of tokens affect the effectiveness of IFs in most deep learning models in three key aspects: the existence of IFs, the accuracy of LOOR parameter estimation, and the estimation errors of existing IFs.

  • We propose HAIF which reduces the weights of tokens with large gradient norms, significantly improving the performance of identifying the most influential training samples with lower computational costs compared to the best SOTA IFs.

  • Comprehensive experiments are conducted to demonstrate that the proposed HAIF significantly enhances tracing performance across various model types and parameter scales compared to SOTA IFs.

2 Related Work and Preliminaries

Typically, one can perform LOOR and observe the change of model parameters and prediction loss to check if a training sample is important for model parameters and predictions. Nonetheless, performing LOOR for each training sample is intractable given the current scale of LM parameters and dataset. Thus, IFs are proposed to simulate such process aiming at finding the most influential training samples of model parameters and predictions (Koh and Liang 2017).

Let D={z1,,zn}D=\{z_{1},...,z_{n}\} be a training set comprising nn samples, θ\theta represent the model parameters and LL denote a loss function for each training sample. Assume θ0\theta_{0} as the original trained model parameters and θϵk\theta_{\epsilon_{k}} as the model trained after the up/down weighting zkDz_{k}\in D by ϵk\epsilon_{k}. The influence of zkz_{k} on the parameters can be expressed as:

Iparam(zk):=d(θϵkθ0)dϵk|ϵk=0,I_{param}(z_{k}):=\frac{d(\theta_{\epsilon_{k}}-\theta_{0})}{d\epsilon_{k}}\bigg{|}_{\epsilon_{k}=0}, (1)

The influence of zkz_{k} on the loss of a test sample ztestz_{test} is quantified by:

Iloss(ztest,zk):=θL(zk,θ0)Iparam(zk).I_{loss}(z_{test},z_{k}):=\nabla_{\theta}L(z_{k},\theta_{0})I_{param}(z_{k}). (2)

IFs can be categorized into two broad types based on the method of estimating Iparam(zk)I_{param}(z_{k}): Hessian-based IFs (HIFs) and Training Trajectory-based IFs (TTIFs).

2.1 Hessian-based Influence Functions

Assume that LL is smooth and convex, and a model is well-trained on a training set using empirical risk minimization. The optimized parameters can be represented as θ0=argminθΘ1ni=1nL(zi,θ).\theta_{0}=\operatorname*{arg\,min}_{\theta\in\Theta}\frac{1}{n}\sum_{i=1}^{n}L(z_{i},\theta). If the weight of a sample zkz_{k} is marginally reduced by ϵk\epsilon_{k}, the parameters are re-optimized as θϵk=argminθΘ[1ni=1nL(zi,θ)ϵkL(zk,θ)].\theta_{\epsilon_{k}}=\operatorname*{arg\,min}_{\theta\in\Theta}[\frac{1}{n}\sum_{i=1}^{n}L(z_{i},\theta)-\epsilon_{k}L(z_{k},\theta)]. HIFs with respect to parameters IparamhifI^{hif}_{param} are derived as follows:

Iparamhif(zk)=Hθ01θL(zk,θ0),I_{param}^{hif}(z_{k})=H_{\theta_{0}}^{-1}\nabla_{\theta}L(z_{k},\theta_{0}), (3)

where Hθ0H_{\theta_{0}} is the Hessian matrix of the trained model (Koh and Liang 2017). The HIFs with respect to loss IlosshifI^{hif}_{loss} can be formulated as:

Ilosshif(ztest,zk)=θL(ztest,θ0)Hθ01θL(zk,θ0).I_{loss}^{hif}(z_{test},z_{k})=\nabla_{\theta}L(z_{test},\theta_{0})H_{\theta_{0}}^{-1}\nabla_{\theta}L(z_{k},\theta_{0}). (4)

HIFs have found applications across various domains (Fisher et al. 2023; Lee et al. 2020; Chen et al. 2023; Xia et al. 2024), and a variety of methods have been proposed to mitigate computational costs (Agarwal, Bullins, and Hazan 2017; Martens and Grosse 2015; George et al. 2018; Schioppa et al. 2022). As discussed above, to mitigate the issue that IFs always trace back to a small group of training samples, RelatIF formulates the task of tracing the most influential training samples as an optimization problem, with an additional constraint:

argmaxzkDmaxϵk|ϵkθL(ztest,θ0)Hθ01θL(zk,θ0)|\displaystyle\operatorname*{arg\,max}_{z_{k}\in D}\max_{\epsilon_{k}}|\epsilon_{k}\nabla_{\theta}L(z_{test},\theta_{0})H_{\theta_{0}}^{-1}\nabla_{\theta}L(z_{k},\theta_{0})|
s.t.ϵkHθ01θL(zk,θ0)2δ.\displaystyle s.t.\|\epsilon_{k}H_{\theta_{0}}^{-1}\nabla_{\theta}L(z_{k},\theta_{0})\|_{2}\leq\delta. (5)

By simplifying this optimization problem, θ\theta-RelatIF is obtained, which essentially performs a sample-wise normalization:

Ilossθrelatif(ztest,zk):=Iloss(ztest,zk)Hθ01θL(zk,θ0)2.I^{\theta-relatif}_{loss}(z_{test},z_{k}):=\frac{I_{loss}(z_{test},z_{k})}{\|H_{\theta_{0}}^{-1}\nabla_{\theta}L(z_{k},\theta_{0})\|_{2}}. (6)

To reduce the computational cost, the Hessian matrix in IFs sometimes is assumed as identity matrix IdId. With that, derived from θ\theta-RelatIF, we have Gradient Cosine IF Ilosscos(ztest,zk):=θL(ztest,θ0)θL(zk,θ0)θL(ztest,θ0)2θL(zk,θ0)2.I^{cos}_{loss}(z_{test},z_{k}):=\frac{\nabla_{\theta}L(z_{test},\theta_{0})\nabla_{\theta}L(z_{k},\theta_{0})}{\|\nabla_{\theta}L(z_{test},\theta_{0})\|_{2}\|\nabla_{\theta}L(z_{k},\theta_{0})\|_{2}}.

However, in (Barshan, Brunet, and Dziugaite 2020), the constraint in (5) lacks theoretical support, and the evaluation metrics do not provide direct evidence that the traced samples are indeed the most influential for the test samples under examination.

2.2 Training Trajectory-based Influence Functions

As stated in Section 2.1, effectively utilizing HIFs needs assumptions which do not hold in most deep learning settings. Thus, TTIFs model the training trajectory of up/down weighting training sample to avoid these assumptions (Pruthi et al. 2020; Schioppa et al. 2023).

Consider a deep learning model trained by mini-batch SGD optimizer. For each training step, model parameters are updated as:

θϵk,t=θϵk,t1ηt1t1(ϵk,θϵk,t1),\theta_{\epsilon_{k},t}=\theta_{\epsilon_{k},t-1}-\eta_{t-1}\nabla\mathcal{L}_{t-1}(\epsilon_{k},\theta_{\epsilon_{k},t-1}), (7)

where ηt\eta_{t} is learning rate (LR) at tt step, t(ϵk,θϵk,t)\mathcal{L}_{t}(\epsilon_{k},\theta_{\epsilon_{k},t}) denotes the mini-batch loss at tt step after up/down weighting the training sample zkz_{k}. That said, the parameters trained after T time steps are:

θϵk,T=θinitt=0T1ηtt(ϵk,θϵk,t),\theta_{\epsilon_{k},T}=\theta_{init}-\sum_{t=0}^{T-1}\eta_{t}\nabla\mathcal{L}_{t}(\epsilon_{k},\theta_{\epsilon_{k},t}), (8)

where θinit\theta_{init} is the initial parameter before training.

To capture the actual influence with SGD training, Iparamsgd(zk)I_{param}^{sgd}(z_{k}) is defined as the derivative of θϵk,T\theta_{\epsilon_{k},T} with respect to ϵk\epsilon_{k}:

Iparamsgd(zk)=t=0T1ηtd(θt(ϵk,θϵk,t))ϵk|ϵk=0It=0T1ηtHtd(θϵk,tθ0,t)dϵk|ϵk=0II,\begin{split}I_{param}^{sgd}(z_{k})=&\underbrace{-\sum_{t=0}^{T-1}\eta_{t}\frac{d(\nabla_{\theta}\mathcal{L}_{t}(\epsilon_{k},\theta_{\epsilon_{k},t}))}{\epsilon_{k}}\bigg{|}_{\epsilon_{k}=0}}_{I}\\ &\underbrace{-\sum_{t=0}^{T-1}\eta_{t}H_{t}\frac{d(\theta_{\epsilon_{k},t}-\theta_{0,t})}{d\epsilon_{k}}\bigg{|}_{\epsilon_{k}=0}}_{II},\end{split} (9)

where HtH_{t} is the Hessian matrix of t\mathcal{L}_{t}. To avoid computation of HtH_{t}, TracIn (Pruthi et al. 2020) ignores IIII in (9) resulting in:

Iparamtracin(zk):=t:zkbatchtηtθL(zk,θ0,t),I_{param}^{tracin}(z_{k}):=-\sum_{t:z_{k}\in batch_{t}}\eta_{t}\nabla_{\theta}L(z_{k},\theta_{0,t}), (10)

where batchtbatch_{t} is mini-batch used at tt time step. While TracIn provides an effective way to track training trajectories, the additive assumption is questioned by (Guu et al. 2023; Schioppa et al. 2023) and IIII in (9) cannot be simply dropped.

When IIII in (9) is brought into consideration, IFs are encountered a theoretical challenge, as discussed below. Assume that both t\mathcal{L}_{t} and θt\nabla_{\theta}\mathcal{L}_{t} with respect to θ\theta are Lipschitz continuous with a Lipschitz constant GG, and that supt,θRt(ϵk,θ)t(0,θ)Cϵk\sup_{t,\theta\in R}\|\mathcal{L}_{t}(\epsilon_{k},\theta)-\mathcal{L}_{t}(0,\theta)\|\leq C\epsilon_{k}, where CC is a constant. By Gronwall inequality (Gronwall 1919), Schioppa et al. (Schioppa et al. 2023) derive an upper bound for model divergence with respect to ϵk\epsilon_{k}:

θϵk,Tθ0,TCϵks<Tηs(1+exp(2Gs<Tηs)).\|\theta_{\epsilon_{k},T}-\theta_{0,T}\|\leq C\epsilon_{k}\sum_{s<T}\eta_{s}(1+exp(2G\sum_{s<T}\eta_{s})). (11)

To be Taylor expanded, IFs make critical assumptions that θϵk,Tθ0,T\|\theta_{\epsilon_{k},T}-\theta_{0,T}\| is O(ϵk)O(\epsilon_{k}) and d(θϵk,Tθ0,T)dϵk\|\frac{d(\theta_{\epsilon_{k},T}-\theta_{0,T})}{d\epsilon_{k}}\| is O(1)O(1). However, (11) shows that the assumption of θϵk,Tθ0,T\|\theta_{\epsilon_{k},T}-\theta_{0,T}\| being O(ϵk)O(\epsilon_{k}) may not hold. This leads to the conclusion that IFs may only be able to accurately estimate influences within a very limited time step.

While (11) seems to pose a theoretical challenge for IFs, this study derives a less conservative bound, and proves that the effectiveness of IFs is determined by the norm of gradient. Furthermore, this paper not only shows the necessity of introducing delta δ\delta in (5), but also mitigate the crisis raised by (11) which indicates that the influence of a sample may not be estimated by a function of ϵk\epsilon_{k}.

3 Methodology

This section begins with formalizing the task and the metric used for evaluating IFs. The conditions to make IFs effective are further derived. Finally, the general form of Adjusted IFs (AIFs) is introduced to satisfy the conditions we propose.

3.1 Problem Statement

Let ziDz_{i}\in D consists of mm labels zi={zi,1,zi,m}.z_{i}=\{z_{i,1},...z_{i,m}\}. The loss function for each label zi,jz_{i,j} is represented by ll, and the overall loss L(zi,θ)L(z_{i},\theta) is given by the average of the individual losses, i.e., L(zi,θ)=1mj=1ml(zi,j,θ)L(z_{i},\theta)=\frac{1}{m}\sum_{j=1}^{m}l(z_{i,j},\theta).

Assume that ztargetDz_{target}\in D encapsulates unique knowledge that directly or indirectly lead to model output ztestz_{test} (e.g., privacy leakage) during inference stage. The objective of this work is to utilize IFs to identify ztargett^\widehat{z_{target_{t}}} such that:

ztargett^=argmaxzkDIloss(ztest,zk).\widehat{z_{target_{t}}}=\operatorname*{arg\,max}_{z_{k}\in D}I_{loss}(z_{test},z_{k}). (12)

A successful trace is one where the traced target ztarget^\widehat{z_{target}} matches the ground truth ztargettz_{target_{t}}; otherwise, it is considered a failed trace. Accordingly, the tracing accuracy is defined as:

Accuracytracing=NsuccessNtracing,Accuracy_{tracing}=\frac{N_{success}}{N_{tracing}}, (13)

where NsuccessN_{success} is the count of successful traces and NtracingN_{tracing} represents the total number of test cases during the inference stage.

3.2 Conditions for Using IFs in Deep Learning

Assume models are trained by a mini-batch SGD optimizer and a Learning Rate (LR) scheduler. After a sample is slightly down weighted, the change of model parameters at each training step is:

θϵk,j,tdownθϵk,j,t1down=ηt1t(ϵk,j,θϵk,j,t)=ηt1i=1nBzi,t1θL(zi,θϵk,j,t1down)+ηt1ϵk,jBzk,t1θl(zk,j,θϵk,j,t1down),\begin{split}&\theta_{\epsilon_{k,j},t}^{down}-\theta_{\epsilon_{k,j},t-1}^{down}=-\eta_{t-1}\nabla\mathcal{L}_{t}(\epsilon_{k,j},\theta_{\epsilon_{k,j},t})\\ &=-\eta_{t-1}\sum_{i=1}^{n}B_{z_{i},t-1}\nabla_{\theta}L(z_{i},\theta_{\epsilon_{k,j},t-1}^{down})\\ &+\eta_{t-1}\epsilon_{k,j}B_{z_{k},t-1}\nabla_{\theta}l(z_{k,j},\theta_{\epsilon_{k,j},t-1}^{down}),\end{split} (14)

where Bzi,t={1if zibatcht0otherwiseB_{z_{i},t}=\left\{\begin{aligned} &1&\text{if }z_{i}\in batch_{t}\\ &0&\text{otherwise}\end{aligned}\right..

Lemma 1.

Assume that model is trained with mini-batch SGD, where ll is C2(θ)C(ϵk,j)C^{2}(\theta)\cap C(\epsilon_{k,j}), and parameters satisfy (14).

The influence of down-weighting zkjz_{kj} on parameters is given by:

Iparamsgd,down(zk,j)=t=0T2(a=t+1T1(IdηaH0,a))ηtBzk,tθl(zk,j,θ0,t)+ηT1Bzk,T1θl(zk,j,θ0,T1)T2,\begin{split}&I_{param}^{sgd,down}(z_{k,j})=\\ &\sum_{t=0}^{T-2}\left(\prod_{a=t+1}^{T-1}(Id-\eta_{a}H_{0,a})\right)\eta_{t}B_{z_{k},t}\nabla_{\theta}l(z_{k,j},\theta_{0,t})\\ &+\eta_{T-1}B_{z_{k},T-1}\nabla_{\theta}l(z_{k,j},\theta_{0,T-1})\quad\forall T\geq 2,\end{split} (15)

where H0,t=i=1nBzi,t2Lθ2H_{0,t}=\sum_{i=1}^{n}B_{z_{i},t}\frac{\partial^{2}L}{\partial\theta^{2}} and θ0,t\theta_{0,t} are the Hessian matrix and model parameters at tt step without altering the weight of zk,jz_{k,j} respectively. The influence of up weighting zk,jz_{k,j} on parameters is Iparamsgd,up(zk,j)=Iparamsgd,down(zk,j)I_{param}^{sgd,up}(z_{k,j})=-I_{param}^{sgd,down}(z_{k,j}).

Proof of Lemma 1 can be found in Appendix A.1.

As discussed in Section 2, the fundamental assumption of IFs is limTIparamsgd(zk,j)\lim_{T\to\infty}I_{param}^{sgd}(z_{k,j}) is O(1)O(1). The following theorem demonstrates a sufficient condition for the existence of IFs in most deep learning settings.

Theorem 1.

Supposing assumptions in Lemma 1 holds, if ηt\eta_{t} is monotonically decreasing for all but finitely many time steps, and there exists some time step tbt_{b}\neq\infty such that ηtH0,t2<1,ttb\eta_{t}\|H_{0,t}\|_{2}<1,\forall t\geq t_{b}, then,

limTIparamsgd(zk,j)2\displaystyle\lim_{T\to\infty}\|I_{param}^{sgd}(z_{k,j})\|_{2} (16)
\displaystyle\leq t=0tb2(a=t+1tb1(IdηaH0,a))ηtBzk,tθl(zk,j,θ0,t)\displaystyle\bigg{\|}\sum_{t=0}^{t_{b}-2}\left(\prod_{a=t+1}^{t_{b}-1}(Id-\eta_{a}H_{0,a})\right)\eta_{t}B_{z_{k},t}\nabla_{\theta}l(z_{k,j},\theta_{0,t})
+ηtb1Bzk,tb1θl(zk,j,θ0,tb1)2\displaystyle+\eta_{t_{b}-1}B_{z_{k},t_{b}-1}\nabla_{\theta}l(z_{k,j},\theta_{0,t_{b}-1})\bigg{\|}_{2}
+limTt=tbT1ηtBzk,tθl(zk,j,θ0,t)2.\displaystyle+\lim_{T\to\infty}\sum_{t=t_{b}}^{T-1}\eta_{t}B_{z_{k},t}\|\nabla_{\theta}l(z_{k,j},\theta_{0,t})\|_{2}.

The convergence of limTIparamsgd(zk,j)2\|\lim_{T\to\infty}I_{param}^{sgd}(z_{k,j})\|_{2} is determined by

r=limteθl(zk,j,θ0,tf)2θl(zk,j,θ0,te)2,r=\lim_{t_{e}\to\infty}\frac{\|\nabla_{\theta}l(z_{k,j},\theta_{0,t_{f}})\|_{2}}{\|\nabla_{\theta}l(z_{k,j},\theta_{0,t_{e}})\|_{2}}, (17)

where te<tft_{e}<t_{f} and they are consecutive time steps such that Bzk,te0B_{z_{k},t_{e}}\neq 0 and Bzk,tf0B_{z_{k},t_{f}}\neq 0.

Then, if r<1r<1, meaning that the gradient norm of zk,jz_{k,j} is constantly decreasing, limTIparamsgd(zk,j)2\|\lim_{T\to\infty}I_{param}^{sgd}(z_{k,j})\|_{2} is convergent, i.e., Iparamsgd(zk,j)I_{param}^{sgd}(z_{k,j}) exists.

Proof of Theorem 1 can be found in Appendix A.2.

Notably, the definition of LR scheduler in Theorem 1 aligns with default HuggingFace (Wolf et al. 2020) and most LMs LR schedulers with or without warmup stage, such as Linear, Constant and Exponential LR schedulers. Theorem 1 leads to a strong conclusion that the influence of zk,jz_{k,j} may not exist if the gradient norm of zk,jz_{k,j} is not decreasing or approaching to 0.

Additionally, even if Iparamsgd(zk,j)I^{sgd}_{param}(z_{k,j}) converges, the estimated LOOR parameters θϵk,j,Tif=θ0,T+ϵk,jIparamsgd(zk,j)\theta_{\epsilon_{k,j},T}^{if}=\theta_{0,T}+\epsilon_{k,j}I_{param}^{sgd}(z_{k,j}) may not be accurate. Removing a token from a dataset typically causes minor changes in parameters, which implies that θϵk,j,T2\|\theta_{\epsilon_{k,j},T}\|_{2} should be small. If the gradient norm of a token is large, the RHS of (16) will also be large, probably resulting in inaccurate estimation of θϵk,j,T\theta_{\epsilon_{k,j},T}. Therefore, we infer that the accurate estimation should be obtained by computing Iparamsgd(zk,j)I^{sgd}_{param}(z_{k,j}) for tokens with small gradient norm.

Next, we use Lemma 1 to identify the conditions under which Iparamsgd(zk,j)=Iparamhif(zk,j)I_{param}^{sgd}(z_{k,j})=I_{param}^{hif}(z_{k,j}), and show that the estimation errors between current IFs and Iparamsgd(zk,j)I_{param}^{sgd}(z_{k,j}) can be amplified by the gradient norm of zk,jz_{k,j}.

Corollary 1.

Under assumptions of Lemma 1, further assume that TT\to\infty, ηt\eta_{t} converges to a constant ηc\eta_{c}, Bzi,t1B_{z_{i},t}\equiv 1, i=1nθL(zi,θ0,tc)=0\sum_{i=1}^{n}\nabla_{\theta}L(z_{i},\theta_{0,t_{c}})=0 at tct_{c}, H0,tc2<1ηc\|H_{0,t_{c}}\|_{2}<\frac{1}{\eta_{c}}, and H0,tcH_{0,t_{c}} is non-singular. Then,

limTIparamsgd,down(zk,j)=Htc1θl(zk,j,θ0,tc)\displaystyle\lim_{T\to\infty}I^{sgd,down}_{param}(z_{k,j})=H_{t_{c}}^{-1}\nabla_{\theta}l(z_{k,j},\theta_{0,t_{c}}) (18)

If these assumptions do not hold, the errors introduced by the assumptions can be amplified by the gradient norm. For example, if Bzi,t1B_{z_{i},t}\not\equiv 1,

limTIparamsgd(zk,j)Iparamhif(zk,j)2\displaystyle\|\lim_{T\to\infty}I_{param}^{sgd}(z_{k,j})-I_{param}^{hif}(z_{k,j})\|_{2} (19)
\displaystyle\leq limTt=tcT2a=t+1T1(IdηcH0,tc)2(Bzk,t1)ηc\displaystyle\lim_{T\to\infty}\sum_{t=t_{c}}^{T-2}\bigg{\|}\prod_{a=t+1}^{T-1}(Id-\eta_{c}H_{0,t_{c}})\bigg{\|}_{2}(B_{z_{k},t}-1)\eta_{c}
θl(zk,j,θ0,tc)2\displaystyle\|\nabla_{\theta}l(z_{k,j},\theta_{0,t_{c}})\|_{2}
+(Bzk,T11)ηcθl(zk,j,θ0,tc)2.\displaystyle+(B_{z_{k},T-1}-1)\eta_{c}\|\nabla_{\theta}l(z_{k,j},\theta_{0,t_{c}})\|_{2}.

Proof of Corollary 1 can be found in Appendix A.4.

Corollary 2.

Under assumptions of Lemma 1,

Iparamsgd(zk,j)Iparamttif(zk,j)2\displaystyle\|I_{param}^{sgd}(z_{k,j})-I_{param}^{ttif}(z_{k,j})\|_{2} (20)
\displaystyle\leq t=0T2(a=t+1T1(IdηaH0,a))Id2ηtBzk,t\displaystyle\sum_{t=0}^{T-2}\bigg{\|}\left(\prod_{a=t+1}^{T-1}(Id-\eta_{a}H_{0,a})\right)-Id\bigg{\|}_{2}\eta_{t}B_{z_{k},t}
θl(zk,j,θ0,t)2.\displaystyle\|\nabla_{\theta}l(z_{k,j},\theta_{0,t})\|_{2}.

As demonstrated by Corollary 1 and 2, the upper bound of errors for HIFs and TTIFs can increase as the gradient norm becomes larger. More importantly, we conducted the experiment on MNIST dataset used in (Koh and Liang 2017), which verifies the consistency between IlosshifI_{loss}^{hif} and LOOR. We also trained a Logistic Regression Model on the same dataset with batch_size=50batch\_size=50 and T=5000T=5000, but changing the optimizer to mini-batch SGD. For 100 training samples, we computed HIF and TTIF as well as performed LOOR. The results on the discrepancies among the approximated θ1n,kif\theta_{\frac{1}{n},k}^{if} with IFs and the accurate θ1n,kloor\theta_{\frac{1}{n},k}^{loor} with respect to the gradient norm of the samples are shown in Figure 1. Note that ϵk=1n\epsilon_{k}=\frac{1}{n} here, which means removing the corresponding sample from the dataset. It is clear that the estimation errors of IFs grow linearly with respect to gradient norms, aligning with the theoretical analysis above.

Refer to caption
Figure 1: Comparison of Parameter Estimation Errors and Gradient Norms. We trained a Logistic Regression model using an SGD optimizer on the 10-class MNIST dataset. For HIF, we calculate the perturbed inverse of the Hessian matrix, and batch information is also brought into consideration when using TTIF.

In our experiments on privacy tracing, we observed that current IFs always traced back to training samples containing tokens with large gradient norms. However, by performing LOOR, the actual influence on model parameters can be very small, sometimes even zero (examples can be found in Appendix A.8). When the influences of tokens with large gradient norms are overestimated, the well-estimated influence of ztargetz_{target} can be overshadowed. Therefore, in the next section, we propose Adjusted Influence Functions to reduce the weights of influence scores calculated from large gradient norms.

3.3 Adjusted Influence Functions

Based on (10), we adjust TTIFs by introducing a function AA of the gradient norm, which yields:

Ilossattif(ztest,zk):=θL(ztest,θ0,T)A(j=1mt=0T1Bk,tηtA(θl(zk,j,θ0,T))),\begin{split}&I_{loss}^{attif}(z_{test},z_{k}):=\\ &\nabla_{\theta}L(z_{test},\theta_{0,T})A(\sum_{j=1}^{m}\sum_{t=0}^{T-1}B_{k,t}\eta_{t}A(\nabla_{\theta}l(z_{k,j},\theta_{0,T}))),\end{split} (21)

where A(v)=W(v2)vA(v)=W(\|v\|_{2})v, and W:RRW:R\rightarrow R is monotonically decreasing function. WW serves an index of estimated influences, assigning larger weights to smaller gradient norms. AA is applied twice to further penalize the sample with large norm of gradient. Similarly, based on (4), HIFs can be adjusted as:

Ilossahif(ztest,zk):=θL(ztest,θ0,T)A(j=1mA(Iparamhif(zk,j))).\begin{split}&I_{loss}^{ahif}(z_{test},z_{k}):=\\ &\nabla_{\theta}L(z_{test},\theta_{0,T})A(\sum_{j=1}^{m}A(I_{param}^{hif}(z_{k,j}))).\end{split} (22)

To reduce computational cost, by only considering last time step, we propose HAIF to trace the most influential training samples with the following form:

Ilosshaif(ztest,zk):=θL(ztest,θ0,T)A(j=1mA(θl(zk,j,θ0,T))).\begin{split}&I_{loss}^{haif}(z_{test},z_{k}):=\\ &\nabla_{\theta}L(z_{test},\theta_{0,T})A(\sum_{j=1}^{m}A(\nabla_{\theta}l(z_{k,j},\theta_{0,T}))).\end{split} (23)

In our experiments, HAIF utilizes A(v)=vv2A(v)=\frac{v}{\|v\|_{2}}. HAIF-T is defined as Ilosshaift(ztest,zk):=θL(ztest,θ0,T)j=1mA(θl(zk,j,θ0,T))I_{loss}^{haif-t}(z_{test},z_{k}):=\nabla_{\theta}L(z_{test},\theta_{0,T})\sum_{j=1}^{m}A(\nabla_{\theta}l(z_{k,j},\theta_{0,T})) which only adjusts gradient token-wise serving as an ablation study. In comparison with θ\theta-RelatIF (6), which performs sample-wise gradient normalization, HAIF adjusts the weight of each token based on its gradient norm. If tokens with large gradient norms are not adjusted, their gradients may dominate the overall sentence gradient, thereby affecting the tracing performance.

We detail the algorithm for implementing HAIF in Appendix A.5, which is more computationally efficient than current IFs and can accurately trace the most influential samples.

4 Experiments

This section mainly endeavors to address the following questions:

  1. 1.

    How well can LMs memorize and extract privacy information, and subsequently utilize the memorized privacy information for reasoning?

  2. 2.

    Under LM settings, can IFs accurately trace back to the ztargetz_{target} according to the privacy information leaked from LMs? Moreover, does HAIF outperforms all SOTA IFs in terms of tracing accuracy.

  3. 3.

    If LMs employ memorized knowledge for reasoning (specifically, when the predicted contents do not encompass any text identical to the corresponding pre-training data), can HAIF maintains higher tracing accuracy than SOTA IFs?

Table 1: Table 2: Tracing accuracy of GPT2 using PII-E dataset. Since GPT2-102M has limited number of correct predictions, its tracing accuracy may lack of statistical significance. NAs are caused by OOM. The numbers in the table are represented in percentage form calculated by (13).
Num Params 102M 325M 737M 1510M
Tracing Types DOB Email Phone Addr Avg DOB Email Phone Addr Avg DOB Email Phone Addr Avg DOB Email Phone Addr Avg
LiSSA 100.00 0.00 0.00 NA 28.57 0.00 0.00 0.00 0.00 0.00 NA NA NA NA NA NA NA NA NA NA
EK-FAC 100.00 33.00 50.00 NA 57.14 16.30 0.00 33.33 3.03 13.41 12.03 0.00 0.00 7.58 9.56 1.00 6.90 0.00 10.77 9.26
RelatIF 100.00 33.33 50.00 NA 57.14 28.15 0.00 33.33 9.10 23.46 20.89 4.55 0.00 13.64 17.13 25.63 17.24 12.50 16.92 21.85
GradientProduct 100.00 0.00 0.00 NA 28.57 0.00 0.00 0.00 0.00 0.00 0.18 4.55 0.00 3.03 2.39 7.50 3.45 0.00 3.08 5.56
TracInCp 100.00 0.00 0.00 NA 28.57 0.74 0.00 0.00 51.52 10.06 0.18 4.55 0.00 3.03 2.39 NA NA NA NA NA
GradientCosine 100.00 0.00 0.00 NA 28.57 8.15 0.00 0.00 3.03 6.70 12.03 9.09 0.00 7.58 10.36 15.00 10.34 6.25 7.69 12.22
HAIF-T[Ours] 100.00 100.00 100.00 NA 100.00 65.19 75.00 1.00 69.70 67.04 87.34 86.36 80.00 78.79 84.86 96.25 89.66 100.00 89.23 94.07
HAIF[Ours] 100.00 100.00 100.00 NA 100.00 67.41 75.00 100.00 81.82 70.95 89.87 86.36 80.00 83.33 87.65 98.13 89.66 100.00 90.77 95.56
Table 2: Table 3: Tracing accuracy of QWen1.5 using PII-E dataset.
Num Params 464M 1840M
Tracing Types DOB Email Phone Addr Avg DOB Email Phone Addr Avg
LiSSA NA NA NA NA NA NA NA NA NA NA
EK-FAC 37.79 51.79 44.87 50.43 45.28 29.63 22.68 23.42 25.43 25.89
RelatIF 42.44 54.46 44.87 57.39 49.27 NA NA NA NA NA
GradientProduct 0.00 0.00 0.00 0.00 0.00 1.06 0.00 3.60 1.69 1.50
TracInCp 0.00 0.00 0.00 0.00 0.00 NA NA NA NA NA
GradientCosine 1.74 1.78 1.28 5.22 2.51 1.59 1.74 5.41 3.39 2.81
HAIF-T[Ours] 66.28 64.29 71.79 66.96 66.88 59.26 73.91 66.67 77.97 68.10
HAIF[Ours] 69.76 63.39 74.36 74.78 70.23 75.66 83.48 72.07 85.59 78.80

4.1 Datasets

In order to answer the previous questions, we begin with synthesizing two datasets to have groundtruth for tracing: PII Extraction (PII-E) dataset and PII Complex Reasoning (PII-CR) Dataset. Dataset examples can be found in Appendix A.6. Please note that, all PIIs (e.g. name, date of birth and phone number) in this work are synthesized with Faker 111https://github.com/joke2k/faker. No real individual privacy is compromised.

PII-E

To emulate the LLM training process, we adopt a proxy similar to (Allen-Zhu and Li 2023) for PII-E synthesis. PII-E includes pretraining and instruction data. We generate virtual data subjects with unique names and four attributes: DOB, Email, Phone, and Address. QWen1.5-14B (Bai et al. 2023) transforms this into biographies as pretraining data. We create instruction data using the template: Question: What’s the {Attribute} of {Name}? Answer: {Attribute Value}. Training data combines pretraining data with 60% of instruction data. The remaining 40% is for testing. Model predictions for test questions are ztest{z_{test}}s. Pretraining data are zi{z_{i}}s for tracing. Each ztestz_{test} has a corresponding ztargetz_{target} from the same virtual subject’s pretraining data. PII-E validates the PII extraction ability of LMs and the tracing ability of IFs, especially when responses include pretraining text.

PII-CR

PII-CR challenges models to perform one-step reasoning based on memorized information. It consists of pretraining and instruction data. We create mappings between festivals and dates, and landmarks and provinces. Pretraining data are generated using the template: {Name} is born on {Festival Name}. Adjacent to {Name}’s place is {Landmark Building}. Instruction data asks for DOB (festival date) and Address (landmark province). PII-CR tests the PII reasoning ability of LMs and the tracing ability of IFs, when responses do not include any same text from pretraining text.

4.2 Privacy Learning Abilities of LM

We use classical GPT2 series and QWen1.5 series (one of the SOTA open-source LLMs (Huang et al. 2023)), and use PII-E and PII-CR datasets to fine-tune models with various series and scales. Generally, increased parameters and advanced architectures enhance PII extraction, with all PII types except DOB posing extraction challenges due to their length and uniqueness variations. A similar pattern is observed in models trained on PII-CR dataset, but the need for reasoning alongside memorization reduces prediction accuracy compared to the PII-E dataset. More details can be found in Appendix A.7.

4.3 Dataset Validation

Let us revisit a key statement from Sections 3.1 and 4.1, which considers the groundtruth ztargetz_{target} for a prediction ztestz_{test} is the corresponding pretraining data from the same virtual data subject.

Previous researchers (Koh and Liang 2017) regard LOOR as the gold standard . However, due to its sensitivity to training hyperparameters (K and Søgaard 2021), detection of poisoned training data is argued to be a better metric. In our context, considering the uniqueness of PIIs, pretraining data leading to PII leakage aligns with the definition of poisoned data.

We perform LOOR on PII-E and PII-CR and training details can be found in Appendix A.9. Table 1 shows the agreement ratio (which equivalent to tracing accuracy of LOOR) of the expected target and LOOR.

Table 3: Table 1: Agreement Ratio of the Expected Target and LOOR for PII-E and PII-CR Datasets
PII-E PII-CR
Correct Prediction 100.00 13.33
Incorrect Prediction 98.00 13.85

For PII-E dataset, the agreement ratio reaches 100%. Despite the dispute over the gold standard and long tail issue, IFs should identify the most influential pretraining sample for all ztestz_{test}s in PII-E dataset. However, the agreement ratio on PII-CR dataset is low. We believe this is because predicting the correct PII requires the model to remember the knowledge in the corresponding pre-training data and reason based on this knowledge, which is influenced by other pre-training data. Both factors affect the loss of ztestz_{test}.

Table 4: Table 4: Tracing accuracy of QWen1.5 and GPT2 using PII-CR dataset
Model Series QWen1.5 GPT2
Num Params 464M 1840M 102M 325M 737M 1510M
Tracing Types DOB Addr Avg DOB Addr Avg DOB Addr Avg DOB Addr Avg DOB Addr Avg DOB Addr Avg
LiSSA NA NA NA NA NA NA 4.17 10.00 5.88 2.63 0.00 2.62 NA NA NA NA NA NA
EK-FAC 13.18 14.71 13.71 11.11 9.09 10.04 8.33 0.00 5.88 5.26 0.00 4.12 5.06 5.26 5.10 16.98 11.76 15.71
RelatIF 21.71 33.82 25.88 NA NA NA 16.67 20.00 17.65 10.53 4.76 9.28 17.72 21.05 18.36 16.98 17.65 17.14
GradientProduct 0.00 0.00 0.00 0.00 0.00 0.00 4.17 10.00 5.88 2.63 0.00 2.06 2.53 0.00 2.04 11.32 17.65 12.86
TracInCp 0.00 0.00 0.00 0.00 0.00 0.00 4.17 10.00 5.88 2.63 0.00 2.06 2.53 5.26 3.06 11.32 17.65 12.86
GradientCosine 2.33 7.35 4.06 2.56 2.27 2.41 8.33 20.00 11.76 6.58 4.76 6.19 8.86 21.05 11.22 1.89 17.65 5.71
HAIF-T[Ours] 31.78 55.88 40.10 2.56 18.94 11.24 16.67 20.00 17.65 50.00 33.33 46.39 58.23 78.95 62.24 47.17 47.06 47.14
HAIF[Ours] 34.88 55.88 42.13 2.56 22.73 13.25 16.67 20.00 17.65 51.32 33.33 47.42 60.76 78.95 64.29 50.94 47.06 50.00
Refer to caption
(a) Tracing accuracy vs token length without token offset.
Refer to caption
(b) Tracing accuracy vs token length with fixed token offset 64.
Refer to caption
(c) Tracing accuracy vs token offset with fixed token length 64.
Figure 2: Tracing accuracy on CLUECorpus2020 dataset with different token offsets and lengths.

4.4 Privacy Tracing Accuracy

We compare the tracing accuracy of five SOTA influence functions (IFs) on various model series, scales, datasets, and PII types. We select five SOTA IFs discussed in Section 2 as baselines, i.e. EK-FAC (Grosse et al. 2023), LiSSA (Bae et al. 2022; Koh and Liang 2017), RelatIF (Barshan, Brunet, and Dziugaite 2020), TracInCP (Pruthi et al. 2020), GradientProduct and GradientCosine. The detailed configurations of IFs are listed in Appendix A.10.

PII Extraction Tracing Abilities

In our experiments, incorrectly predicted privacy information is not regarded as a threat; therefore, we only trace the correct predictions of each model. Please note that the maximum tracing accuracy on PII-E is 100% regardless of the choice of gold standard.

Table 2 shows the tracing accuracy of IFs on various GPT2 scales on PII-E dataset. HAIF significantly improves tracing accuracy across all GPT2 scales and PII types. Comparing to the best baseline, HAIF-T enhances average tracing accuracy from 43.58% to 73.71%. Tracing accuracy of HAIF-T increases as models scale up, achieving 94.07% on GPT2-1.5B. This performance is consistent across PII types. Additionally, HAIF boosts tracing accuracy for all model scales by 1.49% to 3.91%, reaching 95.56% on GPT2-1.5B. Table 3 presents tracing accuracies for IFs on QWen1.5 models on PII-E dataset. HAIF still surpasses the best baseline, achieving tracing accuracies of 70.23% and 78.80% on QWen series models.

PII Reasoning Tracing Accuracy

While PII-E is a good benchmark for IFs, its outputs contain exact pretraining data content, enabling good results via text search. Hence, we use PII-CR dataset to test IFs when LM outputs result from reasoning, not identical pretraining data.

The comparison results are reported in Table 4 showing the tracing accuracies of IFs on various GPT2 and QWen1.5 scales on PII-CR dataset. Note that maximum tracing accuracy is not 100% due to 5.5% PII prediction accuracy from random guessing on PII-CR dataset. For example, tracing accuracy of QWen1.5-0.5B on PII-CR should be less than 88.83%. Lower PII predictions thus yield lower maximum tracing accuracies.

Generally, increased task difficulty leads to decreased prediction and tracing accuracies compared to PII-E dataset. Despite LM outputs lacking identical pretraining data, HAIF maintains highest tracing accuracy across all models. For GPT2 series, HAIF significantly improves tracing accuracy, ranging from 32.86% to 45.93%, over the best baseline. For QWen1.5 series, improvements range from 3.21% to 16.25%.

4.5 Tracing Accuracy on CLUECorpus2020

To further validates the effectiveness of the proposed HAIF, we conduct experiments on the real-world pretraining dataset, CLUECorpus2020 (Xu, Zhang, and Dong 2020) The text is tokenized and concatenated to form the pretraining dataset, where n=100n=100 and m=256m=256. The test set mirrors the training set, with the addition of two hyperparameters, offset and length, to control the test loss, defined as L(ztesti,θ)=j=offsetoffset+lengthl(zi,j,θ)L(z_{test_{i}},\theta)=\sum_{j=offset}^{offset+length}l(z_{i,j},\theta). Here, the offset simulates the user prompt, and the length denotes the model response length. Through LOOR experiments, each ziz_{i} is the most influential training sample for ztestiz_{test_{i}}. According to previous experiments, the best baseline occurs while using QWen1.5-0.5B; thus it is employed to test various IFs under different lengths and offsets.

Fig. 2(a) illustrates that without introducing token offset, RelatIF, EK-FAC, GradientCosine, and HAIF exhibit comparable tracing accuracy as token length varies. This can be attributed to the fact that large gradient norms are mainly found in the initial tokens. Therefore, when the offset=0offset=0, ztestiz_{test_{i}} and ziz_{i} share identical gradients with large norms, enabling RelatIF, EK-FAC and GradientCosine to maintain high tracing accuracies. However, as depicted in Fig. 2(b), the introduction of an offset (which simulates user prompts) leads to a decrease in tracing accuracy for all IFs, with the exception of HAIF. The robustness of HAIF is further demonstrated in Fig. 2(c), where HAIF maintains its performance trend, unaffected by the increase in token offset. It is noteworthy that the tracing accuracy improves with token length, regardless of token offsets. Considering that user prompts are typically not empty and privacy information typically comprises a limited number of tokens, HAIF outperforms all SOTAs in such scenarios.

4.6 Conclusions

This work implements IFs to trace back to the most influential pre-training data to address the concern in privacy leakage in LMs. Our analysis reveals that tokens with large gradient norms can increase errors in LOOR parameter estimation and amplify the discrepancy between existing IFs and actual influence. Therefore, we propose a novel IF, HAIF, which reduces the weights of token with large gradient norms, providing high tracing accuracy with less computational cost than other IFs. To better validate the proposed methods, we construct two datasets, PII-E and PII-CR, where the groundtruth of privacy leakage can be located. Experiments on different model series and scales demonstrate that HAIFs significantly improve tracing accuracy in comparison with all SOTA IFs. Moreover, on real-world pretraining data CLUECorpus2020, HAIF consistently outperforms all SOTA IFs exhibiting strong robustness across varying prompt and response lengths.

References

  • Agarwal, Bullins, and Hazan (2017) Agarwal, N.; Bullins, B.; and Hazan, E. 2017. Second-Order Stochastic Optimization for Machine Learning in Linear Time. Journal of Machine Learning Research, 18(116): 1–40.
  • Allen-Zhu and Li (2023) Allen-Zhu, Z.; and Li, Y. 2023. Physics of Language Models: Part 3.1, Knowledge Storage and Extraction. ArXiv:2309.14316 [cs].
  • Bae et al. (2022) Bae, J.; Ng, N.; Lo, A.; Ghassemi, M.; and Grosse, R. B. 2022. If Influence Functions are the Answer, Then What is the Question? Advances in Neural Information Processing Systems, 35: 17953–17967.
  • Bai et al. (2023) Bai, J.; Bai, S.; Chu, Y.; Cui, Z.; Dang, K.; Deng, X.; Fan, Y.; Ge, W.; Han, Y.; Huang, F.; Hui, B.; Ji, L.; Li, M.; Lin, J.; Lin, R.; Liu, D.; Liu, G.; Lu, C.; Lu, K.; Ma, J.; Men, R.; Ren, X.; Ren, X.; Tan, C.; Tan, S.; Tu, J.; Wang, P.; Wang, S.; Wang, W.; Wu, S.; Xu, B.; Xu, J.; Yang, A.; Yang, H.; Yang, J.; Yang, S.; Yao, Y.; Yu, B.; Yuan, H.; Yuan, Z.; Zhang, J.; Zhang, X.; Zhang, Y.; Zhang, Z.; Zhou, C.; Zhou, J.; Zhou, X.; and Zhu, T. 2023. Qwen Technical Report. ArXiv:2309.16609 [cs].
  • Barshan, Brunet, and Dziugaite (2020) Barshan, E.; Brunet, M.-E.; and Dziugaite, G. K. 2020. RelatIF: Identifying Explanatory Training Samples via Relative Influence. In Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, 1899–1909. PMLR. ISSN: 2640-3498.
  • Chen et al. (2023) Chen, R.; Yang, J.; Xiong, H.; Bai, J.; Hu, T.; Hao, J.; Feng, Y.; Zhou, J. T.; Wu, J.; and Liu, Z. 2023. Fast Model DeBias with Machine Unlearning. Advances in Neural Information Processing Systems, 36: 14516–14539.
  • Chen et al. (2024) Chen, Y.; Zhang, Z.; Han, X.; Xiao, C.; Liu, Z.; Chen, C.; Li, K.; Yang, T.; and Sun, M. 2024. Robust and Scalable Model Editing for Large Language Models. ArXiv:2403.17431 [cs].
  • Fisher et al. (2023) Fisher, J.; Liu, L.; Pillutla, K.; Choi, Y.; and Harchaoui, Z. 2023. Influence Diagnostics under Self-concordance. In Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, 10028–10076. PMLR. ISSN: 2640-3498.
  • George et al. (2018) George, T.; Laurent, C.; Bouthillier, X.; Ballas, N.; and Vincent, P. 2018. Fast Approximate Natural Gradient Descent in a Kronecker Factored Eigenbasis. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.
  • Gronwall (1919) Gronwall, T. H. 1919. Note on the Derivatives with Respect to a Parameter of the Solutions of a System of Differential Equations. Annals of Mathematics, 20(4): 292–296. Publisher: [Annals of Mathematics, Trustees of Princeton University on Behalf of the Annals of Mathematics, Mathematics Department, Princeton University].
  • Grosse et al. (2023) Grosse, R.; Bae, J.; Anil, C.; Elhage, N.; Tamkin, A.; Tajdini, A.; Steiner, B.; Li, D.; Durmus, E.; Perez, E.; Hubinger, E.; Lukošiūtė, K.; Nguyen, K.; Joseph, N.; McCandlish, S.; Kaplan, J.; and Bowman, S. R. 2023. Studying Large Language Model Generalization with Influence Functions. ArXiv:2308.03296 [cs, stat].
  • Guu et al. (2023) Guu, K.; Webson, A.; Pavlick, E.; Dixon, L.; Tenney, I.; and Bolukbasi, T. 2023. Simfluence: Modeling the Influence of Individual Training Examples by Simulating Training Runs. ArXiv:2303.08114 [cs].
  • Huang et al. (2023) Huang, Y.; Bai, Y.; Zhu, Z.; Zhang, J.; Zhang, J.; Su, T.; Liu, J.; Lv, C.; Zhang, Y.; Lei, J.; Fu, Y.; Sun, M.; and He, J. 2023. C-Eval: A Multi-Level Multi-Discipline Chinese Evaluation Suite for Foundation Models. Advances in Neural Information Processing Systems, 36: 62991–63010.
  • K and Søgaard (2021) K, K.; and Søgaard, A. 2021. Revisiting Methods for Finding Influential Examples. ArXiv:2111.04683 [cs].
  • Koh and Liang (2017) Koh, P. W.; and Liang, P. 2017. Understanding Black-box Predictions via Influence Functions. In Proceedings of the 34th International Conference on Machine Learning, 1885–1894. PMLR. ISSN: 2640-3498.
  • Lee et al. (2020) Lee, D.; Park, H.; Pham, T.; and Yoo, C. D. 2020. Learning Augmentation Network via Influence Functions. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10958–10967. ISSN: 2575-7075.
  • Li et al. (2023) Li, H.; Guo, D.; Fan, W.; Xu, M.; Huang, J.; Meng, F.; and Song, Y. 2023. Multi-step Jailbreaking Privacy Attacks on ChatGPT. In Bouamor, H.; Pino, J.; and Bali, K., eds., Findings of the Association for Computational Linguistics: EMNLP 2023, 4138–4153. Singapore: Association for Computational Linguistics.
  • Liu et al. (2024) Liu, S.; Yao, Y.; Jia, J.; Casper, S.; Baracaldo, N.; Hase, P.; Xu, X.; Yao, Y.; Li, H.; Varshney, K. R.; Bansal, M.; Koyejo, S.; and Liu, Y. 2024. Rethinking Machine Unlearning for Large Language Models. ArXiv:2402.08787 [cs].
  • Martens and Grosse (2015) Martens, J.; and Grosse, R. 2015. Optimizing neural networks with Kronecker-factored approximate curvature. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, 2408–2417. Lille, France: JMLR.org.
  • Nasr et al. (2023) Nasr, M.; Carlini, N.; Hayase, J.; Jagielski, M.; Cooper, A. F.; Ippolito, D.; Choquette-Choo, C. A.; Wallace, E.; Tramèr, F.; and Lee, K. 2023. Scalable Extraction of Training Data from (Production) Language Models. ArXiv:2311.17035 [cs].
  • Pruthi et al. (2020) Pruthi, G.; Liu, F.; Kale, S.; and Sundararajan, M. 2020. Estimating Training Data Influence by Tracing Gradient Descent. In Advances in Neural Information Processing Systems, volume 33, 19920–19930. Curran Associates, Inc.
  • Rajbhandari et al. (2020) Rajbhandari, S.; Rasley, J.; Ruwase, O.; and He, Y. 2020. ZeRO: Memory optimizations Toward Training Trillion Parameter Models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 1–16.
  • Schioppa et al. (2023) Schioppa, A.; Filippova, K.; Titov, I.; and Zablotskaia, P. 2023. Theoretical and Practical Perspectives on what Influence Functions Do. Advances in Neural Information Processing Systems, 36: 27560–27581.
  • Schioppa et al. (2022) Schioppa, A.; Zablotskaia, P.; Vilar, D.; and Sokolov, A. 2022. Scaling Up Influence Functions. Proceedings of the AAAI Conference on Artificial Intelligence, 36(8): 8179–8186. Number: 8.
  • Sun et al. (2024) Sun, L.; Huang, Y.; Wang, H.; Wu, S.; Zhang, Q.; Li, Y.; Gao, C.; Huang, Y.; Lyu, W.; Zhang, Y.; Li, X.; Liu, Z.; Liu, Y.; Wang, Y.; Zhang, Z.; Vidgen, B.; Kailkhura, B.; Xiong, C.; Xiao, C.; Li, C.; Xing, E.; Huang, F.; Liu, H.; Ji, H.; Wang, H.; Zhang, H.; Yao, H.; Kellis, M.; Zitnik, M.; Jiang, M.; Bansal, M.; Zou, J.; Pei, J.; Liu, J.; Gao, J.; Han, J.; Zhao, J.; Tang, J.; Wang, J.; Vanschoren, J.; Mitchell, J.; Shu, K.; Xu, K.; Chang, K.-W.; He, L.; Huang, L.; Backes, M.; Gong, N. Z.; Yu, P. S.; Chen, P.-Y.; Gu, Q.; Xu, R.; Ying, R.; Ji, S.; Jana, S.; Chen, T.; Liu, T.; Zhou, T.; Wang, W.; Li, X.; Zhang, X.; Wang, X.; Xie, X.; Chen, X.; Wang, X.; Liu, Y.; Ye, Y.; Cao, Y.; Chen, Y.; and Zhao, Y. 2024. TrustLLM: Trustworthiness in Large Language Models. ArXiv:2401.05561 [cs].
  • Wang et al. (2024) Wang, B.; Chen, W.; Pei, H.; Xie, C.; Kang, M.; Zhang, C.; Xu, C.; Xiong, Z.; Dutta, R.; Schaeffer, R.; Truong, S. T.; Arora, S.; Mazeika, M.; Hendrycks, D.; Lin, Z.; Cheng, Y.; Koyejo, S.; Song, D.; and Li, B. 2024. DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. ArXiv:2306.11698 [cs].
  • Wei, Haghtalab, and Steinhardt (2023) Wei, A.; Haghtalab, N.; and Steinhardt, J. 2023. Jailbroken: How Does LLM Safety Training Fail? Advances in Neural Information Processing Systems, 36: 80079–80110.
  • Wolf et al. (2020) Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; Davison, J.; Shleifer, S.; von Platen, P.; Ma, C.; Jernite, Y.; Plu, J.; Xu, C.; Le Scao, T.; Gugger, S.; Drame, M.; Lhoest, Q.; and Rush, A. 2020. Transformers: State-of-the-Art Natural Language Processing. In Liu, Q.; and Schlangen, D., eds., Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 38–45. Online: Association for Computational Linguistics.
  • Xia et al. (2024) Xia, M.; Malladi, S.; Gururangan, S.; Arora, S.; and Chen, D. 2024. LESS: Selecting Influential Data for Targeted Instruction Tuning. ArXiv:2402.04333 [cs].
  • Xu, Zhang, and Dong (2020) Xu, L.; Zhang, X.; and Dong, Q. 2020. CLUECorpus2020: A Large-scale Chinese Corpus for Pre-training Language Model. ArXiv:2003.01355 [cs] version: 2.
  • Yang et al. (2024) Yang, J.; Jin, H.; Tang, R.; Han, X.; Feng, Q.; Jiang, H.; Zhong, S.; Yin, B.; and Hu, X. 2024. Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond. ACM Transactions on Knowledge Discovery from Data, 18(6): 160:1–160:32.
  • Yao et al. (2024) Yao, Y.; Duan, J.; Xu, K.; Cai, Y.; Sun, Z.; and Zhang, Y. 2024. A Survey on Large Language Model (LLM) Security and Privacy: The Good, The Bad, and The Ugly. High-Confidence Computing, 4(2): 100211.
  • Zhou et al. (2023) Zhou, C.; Li, Q.; Li, C.; Yu, J.; Liu, Y.; Wang, G.; Zhang, K.; Ji, C.; Yan, Q.; He, L.; Peng, H.; Li, J.; Wu, J.; Liu, Z.; Xie, P.; Xiong, C.; Pei, J.; Yu, P. S.; and Sun, L. 2023. A Comprehensive Survey on Pretrained Foundation Models: A History from BERT to ChatGPT. ArXiv:2302.09419 [cs].
  • Zylberajch, Lertvittayakumjorn, and Toni (2021) Zylberajch, H.; Lertvittayakumjorn, P.; and Toni, F. 2021. HILDIF: Interactive Debugging of NLI Models Using Influence Functions. In Brantley, K.; Dan, S.; Gurevych, I.; Lee, J.-U.; Radlinski, F.; Schütze, H.; Simpson, E.; and Yu, L., eds., Proceedings of the First Workshop on Interactive Learning for Natural Language Processing, 1–6. Online: Association for Computational Linguistics.

Appendix A Appendix

A.1 Proof of Lemma 1

Lemma.

Assume that model is trained with mini-batch SGD, where ll is C2(θ)C(ϵk,j)C^{2}(\theta)\cap C(\epsilon_{k,j}), and parameters satisfy

θϵk,j,tdownθϵk,j,t1down=ηt1i=1nBzi,t1θL(zi,θϵk,j,t1down)+ηt1ϵk,jBzk,t1θl(zk,j,θϵk,j,t1down).\begin{split}\theta_{\epsilon_{k,j},t}^{down}-\theta_{\epsilon_{k,j},t-1}^{down}=&-\eta_{t-1}\sum_{i=1}^{n}B_{z_{i},t-1}\nabla_{\theta}L(z_{i},\theta_{\epsilon_{k,j},t-1}^{down})\\ &+\eta_{t-1}\epsilon_{k,j}B_{z_{k},t-1}\nabla_{\theta}l(z_{k,j},\theta_{\epsilon_{k,j},t-1}^{down}).\end{split} (24)

The influence of down-weighting zkjz_{kj} on parameters is given by:

Iparamsgd,down(zk,j)=t=0T2(a=t+1T1(IdηaH0,a))ηtBzk,tθl(zk,j,θ0,t)+ηT1Bzk,T1θl(zk,j,θ0,T1)T2,\begin{split}&I_{param}^{sgd,down}(z_{k,j})=\\ &\sum_{t=0}^{T-2}\left(\prod_{a=t+1}^{T-1}(Id-\eta_{a}H_{0,a})\right)\eta_{t}B_{z_{k},t}\nabla_{\theta}l(z_{k,j},\theta_{0,t})\\ &+\eta_{T-1}B_{z_{k},T-1}\nabla_{\theta}l(z_{k,j},\theta_{0,T-1})\quad\forall T\geq 2,\end{split} (25)

where H0,t=i=1nBzi,t2Lθ2H_{0,t}=\sum_{i=1}^{n}B_{z_{i},t}\frac{\partial^{2}L}{\partial\theta^{2}} and θ0,t\theta_{0,t} are the Hessian matrix and model parameters at tt step without altering the weight of zk,jz_{k,j} respectively.

Proof.
θϵk,j,1downθinit=\displaystyle\theta_{\epsilon_{k,j},1}^{down}-\theta_{init}= η0i=1nBzi,0θL(zi,θϵk,j,0down)\displaystyle-\eta_{0}\sum_{i=1}^{n}B_{z_{i},0}\nabla_{\theta}L(z_{i},\theta_{\epsilon_{k,j},0}^{down}) (26a)
+η0ϵk,jBzk,0θl(zk,j,θϵk,j,0down),\displaystyle+\eta_{0}\epsilon_{k,j}B_{z_{k},0}\nabla_{\theta}l(z_{k,j},\theta_{\epsilon_{k,j},0}^{down}),
θϵk,j,2downθϵk,j,1down=\displaystyle\theta_{\epsilon_{k,j},2}^{down}-\theta_{\epsilon_{k,j},1}^{down}= η1i=1nBzi,1θL(zi,θϵk,j,1down)\displaystyle-\eta_{1}\sum_{i=1}^{n}B_{z_{i},1}\nabla_{\theta}L(z_{i},\theta_{\epsilon_{k,j},1}^{down}) (26b)
+η1ϵk,jBzk,1θl(zk,j,θϵk,j,1down),\displaystyle+\eta_{1}\epsilon_{k,j}B_{z_{k},1}\nabla_{\theta}l(z_{k,j},\theta_{\epsilon_{k,j},1}^{down}),
\vdots
θϵk,j,Tdownθϵk,j,T1down=\displaystyle\theta_{\epsilon_{k,j},T}^{down}-\theta_{\epsilon_{k,j},T-1}^{down}= (26c)
ηT1i=1nBzi,T1θL(zi,θϵk,j,T1down)\displaystyle-\eta_{T-1}\sum_{i=1}^{n}B_{z_{i},T-1}\nabla_{\theta}L(z_{i},\theta_{\epsilon_{k,j},T-1}^{down})
+ηT1ϵk,jBzk,T1θl(zk,j,θϵk,j,T1down).\displaystyle+\eta_{T-1}\epsilon_{k,j}B_{z_{k},T-1}\nabla_{\theta}l(z_{k,j},\theta_{\epsilon_{k,j},T-1}^{down}).

By summing over equations in (26), the parameters at TT time step is:

θϵk,j,Tdown=θinitt=0T1ηti=1nBzi,tθL(zi,θϵk,j,tdown)+t=0T1ηtϵk,jBzk,tθl(zk,j,θϵk,j,tdown).\begin{split}\theta_{\epsilon_{k,j},T}^{down}=&\theta_{init}-\sum_{t=0}^{T-1}\eta_{t}\sum_{i=1}^{n}B_{z_{i},t}\nabla_{\theta}L(z_{i},\theta_{\epsilon_{k,j},t}^{down})\\ &+\sum_{t=0}^{T-1}\eta_{t}\epsilon_{k,j}B_{z_{k},t}\nabla_{\theta}l(z_{k,j},\theta_{\epsilon_{k,j},t}^{down}).\end{split} (27)

By taking derivative of θϵk,j,Tdown\theta_{\epsilon_{k,j},T}^{down} with respect to ϵk,j\epsilon_{k,j},

dθϵk,j,Tdowndϵk,j=t=0T1ηti=1nBzi,td(θL(zi,θϵk,j,tdown))dθϵk,j,tdownθϵk,j,tdowndϵk,j+t=0T1ηtBzk,tθl(zk,j,θϵk,j,tdown)+t=0T1ηtϵk,jBzk,td(θl(zk,j,θϵk,j,tdown))dθϵk,j,tdownθϵk,j,tdowndϵk,j=t=0T1ηtBzk,tθl(zk,j,θϵk,j,tdown)ηtHϵk,j,tdθϵk,j,tdowndϵk,j.\begin{split}&\frac{d\theta_{\epsilon_{k,j},T}^{down}}{d\epsilon_{k,j}}\\ =&-\sum_{t=0}^{T-1}\eta_{t}\sum_{i=1}^{n}B_{z_{i},t}\frac{d(\nabla_{\theta}L(z_{i},\theta_{\epsilon_{k,j},t}^{down}))}{d\theta_{\epsilon_{k,j},t}^{down}}\frac{\theta_{\epsilon_{k,j},t}^{down}}{d\epsilon_{k,j}}\\ &+\sum_{t=0}^{T-1}\eta_{t}B_{z_{k},t}\nabla_{\theta}l(z_{k,j},\theta_{\epsilon_{k,j},t}^{down})\\ &+\sum_{t=0}^{T-1}\eta_{t}\epsilon_{k,j}B_{z_{k},t}\frac{d(\nabla_{\theta}l(z_{k,j},\theta_{\epsilon_{k,j},t}^{down}))}{d\theta_{\epsilon_{k,j},t}^{down}}\frac{\theta_{\epsilon_{k,j},t}^{down}}{d\epsilon_{k,j}}\\ =&\sum_{t=0}^{T-1}\eta_{t}B_{z_{k},t}\nabla_{\theta}l(z_{k,j},\theta_{\epsilon_{k,j},t}^{down})-\eta_{t}H_{\epsilon_{k,j},t}\frac{d\theta_{\epsilon_{k,j},t}^{down}}{d\epsilon_{k,j}}.\end{split} (28)

Then,

Iparamsdg,down(zk,j):=dθϵk,j,Tdowndϵk,j|ϵk,j=0=t=0T1ηtBzk,tθl(zk,j,θ0,t)ηtH0,tdθϵk,j,tdowndϵk,j|ϵk,j=0,\begin{split}&I_{param}^{sdg,down}(z_{k,j}):=\frac{d\theta_{\epsilon_{k,j},T}^{down}}{d\epsilon_{k,j}}\bigg{|}_{\epsilon_{k,j}=0}\\ &=\sum_{t=0}^{T-1}\eta_{t}B_{z_{k},t}\nabla_{\theta}l(z_{k,j},\theta_{0,t})-\eta_{t}H_{0,t}\frac{d\theta_{\epsilon_{k,j},t}^{down}}{d\epsilon_{k,j}}\bigg{|}_{\epsilon_{k,j}=0},\end{split} (29)

where Hϵk,j,t=i=1nBzi,t2Lθ2ϵk,jBzk,t2lθ2H_{\epsilon_{k,j},t}=\sum_{i=1}^{n}B_{z_{i},t}\frac{\partial^{2}L}{\partial\theta^{2}}-\epsilon_{k,j}B_{z_{k},t}\frac{\partial^{2}l}{\partial\theta^{2}}.

Thus, Iparamsdg,down(zk,j)=η0Bzk,0θl(zk,j,θ0,0)I_{param}^{sdg,down}(z_{k,j})=\eta_{0}B_{z_{k},0}\nabla_{\theta}l(z_{k,j},\theta_{0,0}), when T=1T=1.

Next, we prove (LABEL:eq:if_sgd_down_appendix) by mathematical induction.

Base case: When T=2T=2,

Iparamsdg,down(zk,j)=η0Bzk,0θl(zk,j,θ0,0)+η1Bzk,1θl(zk,j,θ0,1)η1H0,1η0Bzk,0θl(zk,j,θ0,0)=(Idη1H0,1)η0Bzk,0θl(zk,j,θ0,0)+η1Bzk,1θl(zk,j,θ0,1),\begin{split}&I_{param}^{sdg,down}(z_{k,j})\\ =&\eta_{0}B_{z_{k},0}\nabla_{\theta}l(z_{k,j},\theta_{0,0})+\eta_{1}B_{z_{k},1}\nabla_{\theta}l(z_{k,j},\theta_{0,1})\\ &-\eta_{1}H_{0,1}\eta_{0}B_{z_{k},0}\nabla_{\theta}l(z_{k,j},\theta_{0,0})\\ =&(Id-\eta_{1}H_{0,1})\eta_{0}B_{z_{k},0}\nabla_{\theta}l(z_{k,j},\theta_{0,0})\\ &+\eta_{1}B_{z_{k},1}\nabla_{\theta}l(z_{k,j},\theta_{0,1}),\end{split} (30)

which satisfies (LABEL:eq:if_sgd_down_appendix).

I.H.: Suppose the statement is true for any time step T=qT=q, for some q2q\geq 2, namely,

Iparamsgd,down(zk,j)=t=0q2(a=t+1q1(IdηaH0,a))ηtBzk,tθl(zk,j,θ0,t)+ηq1Bzk,q1θl(zk,j,θ0,q1).\begin{split}&I_{param}^{sgd,down}(z_{k,j})=\\ &\sum_{t=0}^{q-2}\left(\prod_{a=t+1}^{q-1}(Id-\eta_{a}H_{0,a})\right)\eta_{t}B_{z_{k},t}\nabla_{\theta}l(z_{k,j},\theta_{0,t})\\ &+\eta_{q-1}B_{z_{k},q-1}\nabla_{\theta}l(z_{k,j},\theta_{0,q-1}).\end{split} (31)

I.S.: For T=q+1T=q+1, by expanding (29),

Iparamsdg,down(zk,j)=η0Bzk,0θl(zk,j,θ0,0)+η1Bzk,1θl(zk,j,θ0,1)+η2Bzk,2θl(zk,j,θ0,2)++ηqBzk,qθl(zk,j,θ0,q)η1H0,1dθϵk,j,1downdϵk,j|ϵk,j=0η2H0,2dθϵk,j,2downdϵk,j|ϵk,j=0ηqH0,qdθϵk,j,qdowndϵk,j|ϵk,j=0=η0Bzk,0θl(zk,j,θ0,0)+η1Bzk,1θl(zk,j,θ0,1)+η2Bzk,2θl(zk,j,θ0,2)++ηqBzk,qθl(zk,j,θ0,q)η1H0,1η0Bzk,0θl(zk,j,θ0,0)η2H0,2(Idη1H0,1)η0Bzk,0θl(zk,j,θ0,0)η2H0,2η1Bzk,1θl(zk,j,θ0,1)ηqH0,qa=1q1(IdηaH0,a)η0Bzk,0θl(zk,j,θ0,0)ηqH0,qa=2q1(IdηaH0,a)η1Bzk,1θl(zk,j,θ0,1)ηqH0,qa=3q1(IdηaH0,a)η2Bzk,2θl(zk,j,θ0,2)ηqH0,qa=q1q1(IdηaH0,a)ηq2Bzk,q2θl(zk,j,θ0,q2)ηqH0,qηq1Bzk,q1θl(zk,j,θ0,q1)=t=0q1(a=t+1q(IdηaH0,a))ηtBzk,tθl(zk,j,θ0,t)+ηqBzk,qθl(zk,j,θ0,q).\begin{split}&I_{param}^{sdg,down}(z_{k,j})\\ =&\quad\eta_{0}B_{z_{k},0}\nabla_{\theta}l(z_{k,j},\theta_{0,0})\\ &+\eta_{1}B_{z_{k},1}\nabla_{\theta}l(z_{k,j},\theta_{0,1})\\ &+\eta_{2}B_{z_{k},2}\nabla_{\theta}l(z_{k,j},\theta_{0,2})\\ &+\dots\\ &+\eta_{q}B_{z_{k},q}\nabla_{\theta}l(z_{k,j},\theta_{0,q})\\ &-\eta_{1}H_{0,1}\frac{d\theta_{\epsilon_{k,j},1}^{down}}{d\epsilon_{k,j}}\bigg{|}_{\epsilon_{k,j}=0}\\ &-\eta_{2}H_{0,2}\frac{d\theta_{\epsilon_{k,j},2}^{down}}{d\epsilon_{k,j}}\bigg{|}_{\epsilon_{k,j}=0}\\ &-\dots\\ &-\eta_{q}H_{0,q}\frac{d\theta_{\epsilon_{k,j},q}^{down}}{d\epsilon_{k,j}}\bigg{|}_{\epsilon_{k,j}=0}\\ =&\quad\eta_{0}B_{z_{k},0}\nabla_{\theta}l(z_{k,j},\theta_{0,0})\\ &+\eta_{1}B_{z_{k},1}\nabla_{\theta}l(z_{k,j},\theta_{0,1})\\ &+\eta_{2}B_{z_{k},2}\nabla_{\theta}l(z_{k,j},\theta_{0,2})\\ &+\dots\\ &+\eta_{q}B_{z_{k},q}\nabla_{\theta}l(z_{k,j},\theta_{0,q})\\ &-\eta_{1}H_{0,1}\eta_{0}B_{z_{k},0}\nabla_{\theta}l(z_{k,j},\theta_{0,0})\\ &-\eta_{2}H_{0,2}(Id-\eta_{1}H_{0,1})\eta_{0}B_{z_{k},0}\nabla_{\theta}l(z_{k,j},\theta_{0,0})\\ &-\eta_{2}H_{0,2}\eta_{1}B_{z_{k},1}\nabla_{\theta}l(z_{k,j},\theta_{0,1})\\ &-\dots\\ &-\eta_{q}H_{0,q}\prod_{a=1}^{q-1}(Id-\eta_{a}H_{0,a})\eta_{0}B_{z_{k},0}\nabla_{\theta}l(z_{k,j},\theta_{0,0})\\ &-\eta_{q}H_{0,q}\prod_{a=2}^{q-1}(Id-\eta_{a}H_{0,a})\eta_{1}B_{z_{k},1}\nabla_{\theta}l(z_{k,j},\theta_{0,1})\\ &-\eta_{q}H_{0,q}\prod_{a=3}^{q-1}(Id-\eta_{a}H_{0,a})\eta_{2}B_{z_{k},2}\nabla_{\theta}l(z_{k,j},\theta_{0,2})\\ &-\dots\\ &-\eta_{q}H_{0,q}\prod_{a=q-1}^{q-1}(Id-\eta_{a}H_{0,a})\eta_{q-2}B_{z_{k},q-2}\\ &\nabla_{\theta}l(z_{k,j},\theta_{0,q-2})\\ &-\eta_{q}H_{0,q}\eta_{q-1}B_{z_{k},q-1}\nabla_{\theta}l(z_{k,j},\theta_{0,q-1})\\ =&\sum_{t=0}^{q-1}\left(\prod_{a=t+1}^{q}(Id-\eta_{a}H_{0,a})\right)\eta_{t}B_{z_{k},t}\nabla_{\theta}l(z_{k,j},\theta_{0,t})\\ &+\eta_{q}B_{z_{k},q}\nabla_{\theta}l(z_{k,j},\theta_{0,q}).\end{split} (32)

Therefore, the proof is completed. ∎

A.2 Proof of Theorem 1

Theorem.

Supposing assumptions in Lemma 1 holds, if ηt\eta_{t} is monotonically decreasing for all but finitely many time steps, and there exists tbt_{b}\neq\infty such that ηtH0,t2<1,ttb\eta_{t}\|H_{0,t}\|_{2}<1,\forall t\geq t_{b}, then,

limTIparamsgd(zk,j)2\displaystyle\lim_{T\to\infty}\|I_{param}^{sgd}(z_{k,j})\|_{2} (33)
\displaystyle\leq t=0tb2(a=t+1tb1(IdηaH0,a))ηtBzk,tθl(zk,j,θ0,t)\displaystyle\bigg{\|}\sum_{t=0}^{t_{b}-2}\left(\prod_{a=t+1}^{t_{b}-1}(Id-\eta_{a}H_{0,a})\right)\eta_{t}B_{z_{k},t}\nabla_{\theta}l(z_{k,j},\theta_{0,t})
+ηtb1Bzk,tb1θl(zk,j,θ0,tb1)2\displaystyle+\eta_{t_{b}-1}B_{z_{k},t_{b}-1}\nabla_{\theta}l(z_{k,j},\theta_{0,t_{b}-1})\bigg{\|}_{2}
+limTt=tbT1ηtBzk,tθl(zk,j,θ0,t)2.\displaystyle+\lim_{T\to\infty}\sum_{t=t_{b}}^{T-1}\eta_{t}B_{z_{k},t}\|\nabla_{\theta}l(z_{k,j},\theta_{0,t})\|_{2}.

The convergence of limTIparamsgd(zk,j)2\|\lim_{T\to\infty}I_{param}^{sgd}(z_{k,j})\|_{2} is determined by

r=limteθl(zk,j,θ0,tf)2θl(zk,j,θ0,te)2,r=\lim_{t_{e}\to\infty}\frac{\|\nabla_{\theta}l(z_{k,j},\theta_{0,t_{f}})\|_{2}}{\|\nabla_{\theta}l(z_{k,j},\theta_{0,t_{e}})\|_{2}}, (34)

where te<tft_{e}<t_{f} and they are consecutive time steps such that Bzk,te0B_{z_{k},t_{e}}\neq 0 and Bzk,tf0B_{z_{k},t_{f}}\neq 0.

Then, if r<1r<1, meaning that the gradient norm of zk,jz_{k,j} is constantly decreasing, limTIparamsgd(zk,j)2\|\lim_{T\to\infty}I_{param}^{sgd}(z_{k,j})\|_{2} is convergent, i.e., Iparamsgd(zk,j)I_{param}^{sgd}(z_{k,j}) exists.

Proof.
limTIparamsgd(zk,j)2\displaystyle\lim_{T\to\infty}\|I_{param}^{sgd}(z_{k,j})\|_{2} (35)
\displaystyle\leq limTt=0tb1(a=t+1T1(IdηaH0,a))ηtBzk,tI\displaystyle\underbrace{\lim_{T\to\infty}\bigg{\|}\sum_{t=0}^{t_{b}-1}\left(\prod_{a=t+1}^{T-1}(Id-\eta_{a}H_{0,a})\right)\eta_{t}B_{z_{k},t}}_{I}
θl(zk,j,θ0,t)2I\displaystyle\underbrace{\nabla_{\theta}l(z_{k,j},\theta_{0,t})\bigg{\|}_{2}}_{I}
+limTt=tbT2(a=t+1T1(IdηaH0,a))ηtBzk,tII\displaystyle+\underbrace{\lim_{T\to\infty}\bigg{\|}\sum_{t=t_{b}}^{T-2}\left(\prod_{a=t+1}^{T-1}(Id-\eta_{a}H_{0,a})\right)\eta_{t}B_{z_{k},t}}_{II}
θl(zk,j,θ0,t)+ηT1Bzk,T1θl(zk,j,θ0,T1)2II.\displaystyle\underbrace{\nabla_{\theta}l(z_{k,j},\theta_{0,t})+\eta_{T-1}B_{z_{k},T-1}\nabla_{\theta}l(z_{k,j},\theta_{0,T-1})\bigg{\|}_{2}}_{II}.

Given that ηbH0,t2<1,ttb\eta_{b}\|H_{0,t}\|_{2}<1,\forall t\geq t_{b},

limTa=t+1T1(IdηaH0,a)21,ttb.\lim_{T\to\infty}\|\prod_{a=t+1}^{T-1}(Id-\eta_{a}H_{0,a})\|_{2}\leq 1,\quad\forall t\geq t_{b}. (36)

Thus,

I2\displaystyle\|I\|_{2} (37)
\displaystyle\leq t=0tb2(a=t+1tb1(IdηaH0,a))ηtBzk,tθl(zk,j,θ0,t)\displaystyle\bigg{\|}\sum_{t=0}^{t_{b}-2}\left(\prod_{a=t+1}^{t_{b}-1}(Id-\eta_{a}H_{0,a})\right)\eta_{t}B_{z_{k},t}\nabla_{\theta}l(z_{k,j},\theta_{0,t})
+ηtb1Bzk,tb1θl(zk,j,θ0,tb1)2,\displaystyle+\eta_{t_{b}-1}B_{z_{k},t_{b}-1}\nabla_{\theta}l(z_{k,j},\theta_{0,t_{b}-1})\bigg{\|}_{2},

which is convergent. We also have

II2\displaystyle\|II\|_{2}\leq t=tbT1ηtBzk,tθl(zk,j,θ0,t)2\displaystyle\bigg{\|}\sum_{t=t_{b}}^{T-1}\eta_{t}B_{z_{k},t}\nabla_{\theta}l(z_{k,j},\theta_{0,t})\bigg{\|}_{2} (38)
\displaystyle\leq t=tbT1ηtBzk,tθl(zk,j,θ0,t)2.\displaystyle\sum_{t=t_{b}}^{T-1}\eta_{t}B_{z_{k},t}\|\nabla_{\theta}l(z_{k,j},\theta_{0,t})\|_{2}.

By ratio test, the sufficient condition for upper bound of II2\|II\|_{2} being convergent is

limteηtfθl(zk,j,θ0,tf)2ηteθl(zk,j,θ0,te)2<1.\lim_{t_{e}\to\infty}\frac{\eta_{t_{f}}\|\nabla_{\theta}l(z_{k,j},\theta_{0,t_{f}})\|_{2}}{\eta_{t_{e}}\|\nabla_{\theta}l(z_{k,j},\theta_{0,t_{e}})\|_{2}}<1. (39)

Let

r=limteθl(zk,j,θ0,tf)2θl(zk,j,θ0,te)2.r=\lim_{t_{e}\to\infty}\frac{\|\nabla_{\theta}l(z_{k,j},\theta_{0,t_{f}})\|_{2}}{\|\nabla_{\theta}l(z_{k,j},\theta_{0,t_{e}})\|_{2}}. (40)

Since ηteηtf\eta_{t_{e}}\geq\eta_{t_{f}} when te<tft_{e}<t_{f},

limteηtfθl(zk,j,θ0,tf)2ηteθl(zk,j,θ0,te)2r.\lim_{t_{e}\to\infty}\frac{\eta_{t_{f}}\|\nabla_{\theta}l(z_{k,j},\theta_{0,t_{f}})\|_{2}}{\eta_{t_{e}}\|\nabla_{\theta}l(z_{k,j},\theta_{0,t_{e}})\|_{2}}\leq r. (41)

If r<1r<1, Iparamsgd(zk,j)2\|I_{param}^{sgd}(z_{k,j})\|_{2} is convergent. If r>1r>1 or limtθl(zk,j,θ0,t)20\lim_{t\to\infty}\|\nabla_{\theta}l(z_{k,j},\theta_{0,t})\|_{2}\neq 0, the upper bound of Iparamsgd(zk,j)2\|I_{param}^{sgd}(z_{k,j})\|_{2} can be divergent. ∎

A.3 Other Sufficient Conditions for Iparamsgd(zk,j)I_{param}^{sgd}(z_{k,j}) Being Convergent

The following Corollaries demonstrate other sufficient conditions for Iparamsgd(zk,j)I_{param}^{sgd}(z_{k,j}) being convergent. However, as all the Hessian matrices H0,t,ttbH_{0,t},\forall t\geq t_{b} are required to be non-singular, they may not be useful for general deep learning models.

Corollary.

Under assumption of Lemma 1, if ηt\eta_{t} converges to a sufficiently small constant ηb\eta_{b} such that ηbH0,t2<1,ttb\eta_{b}\|H_{0,t}\|_{2}<1,\forall t\geq t_{b}, and H0,t,ttbH_{0,t},\forall t\geq t_{b} are non-singular, then limTIparamsgd(zk,j)\lim_{T\to\infty}I_{param}^{sgd}(z_{k,j}) is convergent. Specifically, when Bzk,t1B_{z_{k},t}\equiv 1, limTIparamsgd(zk,j)2M1v\|\lim_{T\to\infty}I_{param}^{sgd}(z_{k,j})\|_{2}\leq\|M^{-1}v\|, where M=argmin{H0,t;ttb}σmin(H0,t)M=\operatorname*{arg\,min}_{\{H_{0,t};t\geq t_{b}\}}\sigma_{min}(H_{0,t}), v=argmax{θl(zk,j,θ0,t);ttb}Mθl(zk,j,θ0,t)2v=\operatorname*{arg\,max}_{\{\nabla_{\theta}l(z_{k,j},\theta_{0,t});t\geq t_{b}\}}\|M\nabla_{\theta}l(z_{k,j},\theta_{0,t})\|_{2}, and σmin\sigma_{min} denotes the minimum singular value.

Proof.

If ηt\eta_{t} converges to a sufficiently small constant ηb0\eta_{b}\neq 0 such that ηbH0,t2<1,ttb\eta_{b}\|H_{0,t}\|_{2}<1,\forall t\geq t_{b}, and H0,t,ttbH_{0,t},\forall t\geq t_{b} are non-singular, then

limTa=tb+1T1(IdηaH0,a)2=0,\displaystyle\lim_{T\to\infty}\|\prod_{a=t_{b}+1}^{T-1}(Id-\eta_{a}H_{0,a})\|_{2}=0, (42)

and thereby I=0I=0.

limTIparamsgd(zk,j)2\displaystyle\lim_{T\to\infty}\|I_{param}^{sgd}(z_{k,j})\|_{2} (43)
\displaystyle\leq limTt=tbT1(IdηbM)Tt1ηbBzk,tv2.\displaystyle\lim_{T\to\infty}\left\|\sum_{t=t_{b}}^{T-1}(Id-\eta_{b}M)^{T-t-1}\eta_{b}B_{z_{k},t}v\right\|_{2}.

Since 0<H0,t2<1,ttb0<\|H_{0,t}\|_{2}<1,\forall t\geq t_{b}, limTIparamsgd(zk,j)2\lim_{T\to\infty}\|I_{param}^{sgd}(z_{k,j})\|_{2} is convergent. Specifically, if Bzk,t1B_{z_{k},t}\equiv 1, by Neumann series, limTIparamsgd(zk,j)2limTt=tbT1(IdηbM)Tt1ηbv2=M1v2\lim_{T\to\infty}\|I_{param}^{sgd}(z_{k,j})\|_{2}\leq\lim_{T\to\infty}\|\sum_{t=t_{b}}^{T-1}(Id-\eta_{b}M)^{T-t-1}\eta_{b}v\|_{2}=\|M^{-1}v\|_{2}. ∎

Empirical studies demonstrate that most of eigenvalues of Hessian matrices are clustered near 0 in deep learning models. Hence,H0,t,ttbH_{0,t},\forall t\geq t_{b} being non-singular is a strong assumption and may not be satisfied for most real-world models.

Corollary.

Under assumption of Lemma 1, if ηt\eta_{t} converges to 0 in finite time step, then limTIparamsgd(zk,j)\lim_{T\to\infty}I_{param}^{sgd}(z_{k,j}) is a finite series and is therefore convergent.

The proof of this corollary is straightforward, as each element in the sequence does not go to infinity. However, in most LMs training settings, ηt\eta_{t} is not 0 until tt reaches TT.

A.4 Proof of Corollary 1

Corollary.

Under assumptions of Lemma 1, further assume that TT\to\infty, ηt\eta_{t} converges to a constant ηc\eta_{c}, Bzi,t1B_{z_{i},t}\equiv 1, i=1nθL(zi,θ0,tc)=0\sum_{i=1}^{n}\nabla_{\theta}L(z_{i},\theta_{0,t_{c}})=0 at tct_{c}, H0,tc2<1ηc\|H_{0,t_{c}}\|_{2}<\frac{1}{\eta_{c}}, and H0,tcH_{0,t_{c}} is non-singular. Then,

limTIparamsgd,down(zk,j)=Htc1θl(zk,j,θ0,tc)\displaystyle\lim_{T\to\infty}I^{sgd,down}_{param}(z_{k,j})=H_{t_{c}}^{-1}\nabla_{\theta}l(z_{k,j},\theta_{0,t_{c}}) (44)

If these assumptions do not hold, the errors introduced by the assumptions can be amplified by the gradient norm. For example, if Bzi,t1B_{z_{i},t}\not\equiv 1,

limTIparamsgd(zk,j)Iparamhif(zk,j)2\displaystyle\|\lim_{T\to\infty}I_{param}^{sgd}(z_{k,j})-I_{param}^{hif}(z_{k,j})\|_{2} (45)
\displaystyle\leq limTt=tcT2a=t+1T1(IdηcH0,tc)2(Bzk,t1)ηc\displaystyle\lim_{T\to\infty}\sum_{t=t_{c}}^{T-2}\bigg{\|}\prod_{a=t+1}^{T-1}(Id-\eta_{c}H_{0,t_{c}})\bigg{\|}_{2}(B_{z_{k},t}-1)\eta_{c}
θl(zk,j,θ0,tc)2\displaystyle\|\nabla_{\theta}l(z_{k,j},\theta_{0,t_{c}})\|_{2}
+(Bzk,T11)ηcθl(zk,j,θ0,tc)2.\displaystyle+(B_{z_{k},T-1}-1)\eta_{c}\|\nabla_{\theta}l(z_{k,j},\theta_{0,t_{c}})\|_{2}.
Proof.
limTIparamsgd,down(zk,j)\displaystyle\lim_{T\to\infty}I^{sgd,down}_{param}(z_{k,j}) (46)
=\displaystyle= limTt=0tc1(a=t+1T1(IηcH0,a))ηtθl(zk,j,θ0,t)I\displaystyle\underbrace{\lim_{T\to\infty}\sum_{t=0}^{t_{c}-1}\left(\prod_{a=t+1}^{T-1}(I-\eta_{c}H_{0,a})\right)\eta_{t}\nabla_{\theta}l(z_{k,j},\theta_{0,t})}_{I}
+limTt=tcT2(a=t+1T1(1ηcH0,tc))ηcθl(zk,j,θ0,tc)II\displaystyle+\underbrace{\lim_{T\to\infty}\sum_{t=t_{c}}^{T-2}\left(\prod_{a=t+1}^{T-1}(1-\eta_{c}H_{0,t_{c}})\right)\eta_{c}\nabla_{\theta}l(z_{k,j},\theta_{0,t_{c}})}_{II}
+ηcθl(zk,j,θ0,tc)II\displaystyle\underbrace{+\eta_{c}\nabla_{\theta}l(z_{k,j},\theta_{0,t_{c}})}_{II}

Given that H0,tc1ηc\|H_{0,t_{c}}\|\leq\frac{1}{\eta_{c}}, limTa=tcT1(IηcH0,tc)=0\lim_{T\to\infty}\prod_{a=t_{c}}^{T-1}(I-\eta_{c}H_{0,t_{c}})=0; consequently, II in (46) is 0.

II=limTt=tcT1(IηcH0,tc)Ttc1ηcθl(zk,j,θ0,tc)II=\lim_{T\to\infty}\sum_{t=t_{c}}^{T-1}(I-\eta_{c}H_{0,t_{c}})^{T-t_{c}-1}\eta_{c}\nabla_{\theta}l(z_{k,j},\theta_{0,t_{c}}) (47)

Given H0,tc1ηc\|H_{0,t_{c}}\|\leq\frac{1}{\eta_{c}}, by Neumann series, limTIparamsgd,down(zk,j)=Htc1θl(zk,j,θ0,tc)\lim_{T\to\infty}I^{sgd,down}_{param}(z_{k,j})=H_{t_{c}}^{-1}\nabla_{\theta}l(z_{k,j},\theta_{0,t_{c}}).

If Bzi,t1B_{z_{i},t}\not\equiv 1, (45) can be directly proven by Cauchy-Schwarz and Triangle inequalities. ∎

A.5 Algorithm of HAIF

The algorithm implementing HAIF is summarized in Algorithm 1. Weights for adjusting outputs are initially computed and cached in function output_adjustment (line 12-19). Notably, function output_adjustment only needs to be run once for all test samples. Subsequently, the gradient of the test sample is determined (line 3). Following this, an iteration over the training samples is performed to compute the dot product between the gradient of the test sample and each individual training sample (line 4-11). In comparison to the current IFs, HAIF is computational efficient and able to accurately trace the most influential samples.

Algorithm 1 HAIF Algorithm
1:ztest,{zk}k=1n,θT,l,Lz_{test},\{z_{k}\}_{k=1}^{n},\theta_{T},l,L
2:{sk}k=1n\{s_{k}\}_{k=1}^{n}
3:{sk}k=1n{0}1n\{s_{k}\}_{k=1}^{n}\leftarrow\{0\}_{1}^{n}
4:{wk,j}output_adjustment(θT,{zk}k=1n,l)\{w_{k,j}\}\leftarrow output\_adjustment(\theta_{T},\{z_{k}\}_{k=1}^{n},l)
5:gtgrad(L(ztest,θc))g_{t}\leftarrow grad(L(z_{test},\theta^{c}))
6:for k1k\leftarrow 1 to nn do
7:     Lk0L_{k}\leftarrow 0
8:     for j1j\leftarrow 1 to mm do
9:         Lk=Lk+wk,jl(zk,j,θc)L_{k}=L_{k}+w_{k,j}l(z_{k,j},\theta^{c})
10:     end for
11:     gkgrad(Lk)g_{k}\leftarrow grad(L_{k})
12:     sk=sk+A(gtT)s_{k}=s_{k}+A(g_{t}^{T})
13:end for
14:function output_adjustment(θT,{zk}k=1n,l\theta_{T},\{z_{k}\}_{k=1}^{n},l)
15:     for k1k\leftarrow 1 to nn do
16:         for j1j\leftarrow 1 to mm do
17:              wk,jW(grad(l(zk,j,θT)))w_{k,j}\leftarrow W(\|grad(l(z_{k,j},\theta_{T}))\|)
18:         end for
19:     end for
20:     return {wk,j}\{w_{k,j}\}
21:end function

A.6 Dataset Examples for PII-E and PII-CR

Here, we provide more details about PII-E and PII-CR. Table 5 and 6 illustrate two real examples in datasets. Name and all attributes are generated by Faker 222https://github.com/joke2k/faker.

Table 5 illustrates the attributes, pretraining data, and instruction data of a data subject, Liao Tingting, in the PII-E dataset. For example, models are trained to predict ‘1912/01/01’ for the DOB of Liao Tingting. Furthermore, IFs are expected to trace this prediction back to the pretraining data of Liao Tingting.

PII-CR is more challenging, as shown in Table 6. When asked for the DOB of Liao Tingting, models are expected to answer ‘01/01’. IFs are expected to trace this prediction to the pretraining data, where ‘01/01’ does not exist, while the festival New Year is contained. Models should understand the link among Liao Tingting, ‘01/01’, and New Year. IFs are expected to trace across such reasoning.

Table 5: An example of PII-E dataset
Virtual Data Subject
Name DOB Email Phone Address
Liao Tingting 1912/01/01 [email protected] 18859382421 Block N, Qiqihar Road…
Pretraining (Biology)
Liao Tingting was born on Jan. 1, 1912, at Block N, Qiqihar Road… She is a hard …
Instructions
Q: What’s the DOB of Liao Tingting? A: 1912/01/01
Q: What’s the Email of Liao Tingting? A: [email protected]
Q: What’s the Phone of Liao Tingting? A: 18859382421
Q: What’s the Address of Liao Tingting? A: Block N, Qiqihar Road, Xincheng…
Table 6: An example of PII-CR dataset
Virtual Data Subject
Name DOB Address
Liao Tingting 01/01 Shanghai
Pretraining (Biology)
Liao Tingting is born on New Year. Adjacent to Liao
Tingting’s place is Oriental Pearl TV Tower.
Instructions
Q: What’s the DOB of Liao Tingting? A: 01/01
Q: What’s the Address of Liao Tingting? A: Shanghai

A.7 Detailed Privacy Learning Abilities of LM

Refer to caption
(a) Results on PII-E Dataset
Refer to caption
(b) Results on PII-CR Dataset
Figure 3: PII Prediction Accuracy of GPT2 and QWen1.5. The red dashed line represents the random guess performance of the QWen1.5-0.5B model, trained exclusively with instruction data

In this work, we use classical GPT2 series and QWen1.5 series (one of the SOTA open-source LLMs (Huang et al. 2023)) as base models. We use PII-E and PII-CR datasets to fine-tune various model series and scales. Since model training aims to construct as many traceable samples as possible for evaluating tracing performance, each model are trained with 50 epochs while saving the model with highest PII prediction accuracy. With this in mind, no over-fitting assumption is involved when evaluating tracing accuracy. All models are trained with AdamW optimizer with default settings and linear scheduler. As regular training procedure leads to OOM for QWen1.5-1.8B and GPT2-xLarge, we train them with DeepSpeed-ZeRO2 (Rajbhandari et al. 2020).

The PII prediction accuracy of GPT2 and QWen1.5 on PII-E dataset is illustrated in Fig. 3(a). In general, as the number of parameters increases, so does the ability to memorize and extract PII. More advanced and contemporary model architectures also enhance PII extraction capabilities. Considering the variations in length and uniqueness among different types of PII, all PII types except for DOB present more extraction challenges. As shown in Fig. 3(b), similar pattern is also observed in PII-CR dataset. However, the dataset requires models to reason based on memorized PII information in addition to memorization, the PII prediction accuracy significantly decreases compared to PII-E dataset. We further trained the QWen1.5-0.5B model solely using the instruction data from the two datasets. The performance of model was equivalent to a random guess. In the case of PII-E, the model was unable to infer the linkage between names and privacy attributes due to the uniqueness of PII. However, for PII reasoning datasets PII-CR, even without pretraining data, the models could potentially make a correct prediction if they learned the answer format. This random guess behavior greatly influenced the LOOR results, as demonstrated in Section 4.3.

Given that even the smallest QWen1.5 model performs better (with overwhelming superiority) in the PII prediction accuracy across all PII types compared to the largest GPT2 model, we can conclude that as model architectures become more advanced and their fundamental capabilities are enhanced, models can extract personal information from more obscure expressions, thereby increasing the risk of privacy leakage.

A.8 The Actual Influences of Tokens

Refer to caption
Figure 4: Comparison of Token Influences and Token Gradient Norms

In this section, we conduct experiments to verify that the influences with large gradient norm on model parameters can be small or even zero. Instead of leave-one-sample-out retraining, we perform leave-one-token-out experiments on the PII-E dataset using the QWen1.5-0.5B model with the SGD optimizer. Since performing LOOR for all tokens is intractable, we sample 10 tokens from each of 10 samples and plot the relationship between the parameter change of each token and its gradient norm. As shown in Figure 4, tokens typically have minimal influence on parameters (less than 10410^{-4}), and tokens with large gradient norms can have small or zero influences on model parameters.

A.9 LOOR Details on PII-E and PII-CR

LOOR for all pretraining samples is time-consuming. Hence, a subset of 100 pretraining samples and their instruction data are selected. To balance time and PII prediction accuracy, we choose QWen1.5-0.5B. We sequentially remove each pretraining data and compute the actual loss change for each ztestz_{test}: ΔL(ztest,θ1n,zkLOOR)\Delta L(z_{test},\theta^{LOOR}_{\frac{1}{n},z_{k}}). If the expected targetztesttarget_{z_{test}} equals argmaxkΔL(ztest,θ1n,zkLOOR)\operatorname*{arg\,max}_{k}\Delta L(z_{test},\theta^{LOOR}_{\frac{1}{n},z_{k}}), LOOR is considered to agree with the expectation. This allows us to calculate the agreement ratio of the expected target and LOOR. Note that PII-E and PII-CR datasets share the same retraining proxy.

A.10 Configurations for IFs

This section details the configurations of various IFs in experiments. For LiSSA, we adhere to the settings used in (Bae et al. 2022). Gauss Newton Hessian (GNH) matrix is enabled, and the depth of LiSSA is set to match the length of the dataset. However, even with GNH, calculating Jacobian of model output still leads to Out-Of-Memory (OOM) for large models. We employ the default settings for EK-FAC. Given that the EK-FAC algorithm supports limited types of deep learning layers, we transform the 1D-CNN layer into an equivalent Linear layer as guided in (Grosse et al. 2023). We disregard layers in QWen1.5 series that lead to errors, such as Rotary Embedding and Layer Norm and no layer is disregarded for GPT2 series. ll-RelatIF (Ilosslrelatif(ztest,zk)=Ilosshif(ztest,zk)Ilosshif(zk,zk)I^{l-relatif}_{loss}(z_{test},z_{k})=\frac{I^{hif}_{loss}(z_{test},z_{k})}{\sqrt{I^{hif}_{loss}(z_{k},z_{k})}}) is utilized during experiments which, according to (Barshan, Brunet, and Dziugaite 2020), provides a better estimation than θ\theta-RelatIF. In the case of TracInCp, we utilize three checkpoints saved during the training process. While additional checkpoints may potentially enhance tracing accuracy, it would also result in a longer time than EK-FAC, thereby negating the speed advantage of first-order IFs.