This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Data Deletion for Linear Regression with Noisy SGD

Zhangjie, Xia    Chi-Hua, Wang    Guang, Cheng
Abstract

In the current era of big data and machine learning, it’s essential to find ways to shrink the size of training dataset while preserving the training performance to improve efficiency. However, the challenge behind it includes providing practical ways to find points that can be deleted without significantly harming the training result and suffering from problems like underfitting. We therefore present the perfect deleted point problem for 1-step noisy SGD in the classical linear regression task, which aims to find the perfect deleted point in the training dataset such that the model resulted from the deleted dataset will be identical to the one trained without deleting it. We apply the so-called signal-to-noise ratio and suggest that its value is closely related to the selection of the perfect deleted point. We also implement an algorithm based on this and empirically show the effectiveness of it in a synthetic dataset. Finally we analyze the consequences of the perfect deleted point, specifically how it affects the training performance and privacy budget, therefore highlighting its potential. This research underscores the importance of data deletion and calls for urgent need for more studies in this field.

1 INTRODUCTION

Machine learning (ML) is a field of study in artificial intelligence concerned with the development of statistical algorithms that can learn from given data and generalize to unseen data, thus performing future tasks without explicit instructions. It has been playing an essential role in our everyday life, and has become one of the most rapidly growing research fields nowadays. Jordan and Mitchell (2015) reviewed the core methods and recent progress in this field and suggested that despite its remarkable success recently, it still emerges numerous research opportunities for many unsolved problems.

Data deletion (Garg et al., 2020; Ginart et al., 2019; Oesterling et al., 2024) is one of the most intriguing areas among all those research questions. It’s widely known that more training data tends to improve the training performance dramatically (Ramezan et al., 2021; Foody et al., 2006). However, training with more data also has serious drawbacks. For instance, Budach et al. (2022) empirically showed that data quality in training dataset can seriously affect the training performance of many widely used ML algorithms including regression, classification and clustering, therefore more training data has a higher risk of containing poisoning data points. Narciso and Martins (2020) suggested that the massive amount of data used by the industry can cause serious energy efficiency related problems, thus increasing time and space complexity during training and boosting energy consumption. Hawkins (2004) pointed out that too much training data can make the ML model learn noise and unimportant patterns from given data, thus generalize poorly to unseen testing data, which is also referred to as overfitting.

Additionally, many regulations have been established on behalf of the ”Right to be Forgotten” in many countries and states, including Council of the European Union (2016), and the recent California Consumer Privacy Act Right (CCPA) (State of California Department of Justice, 2018) allows users to require companies like Google, Facebook, Twitter, etc to forget and delete these personal data to protect their privacy. They can also ask these platforms to erase outdated information or limit the retention of personal details, such as photos and emails, to a specific time period. However, since personal data is often collected incrementally, the process of deleting it in chronological order poses significant challenges for machine learning models. This issue highlights the need for the development and analysis of deletion-aware online machine learning methods capable of handling such requests efficiently.

Therefore, since more training data doesn’t necessarily lead to better result and may be prohibited in some cases, it’s natural to ask the following question:

Can we use a smaller training dataset without sacrificing training accuracy?

Data deletion, or machine unlearning (Cao and Yang, 2015; Bourtoule et al., 2021), addresses this problem by selecting certain training samples to be deleted which doesn’t harm the training performance and results in identical final models. Further work proposed many solutions to this question in certain scenarios (Baumhauer et al., 2022; Golatkar et al., 2020; Graves et al., 2021) and we refer the readers to Section 2 for more details.

This paper proposes an innovative idea to the question of data deletion in one step of noisy stochastic gradient descent (SGD) for linear regression task. Section 2 and 3 presents related works and the basic problem formulation. Section 4 introduces our statistical method for solving data deletion problem and shows our algorithm correspondingly. Section 5 highlights the major consequences of our method, specifically how it affects empirical risk and model privacy. Section 6 empirically shows the use of perfect deleted point in the synthetic dataset. Section 7 summarizes the paper and suggests some possible future work.

Our Contributions:

  1. 1.

    Propose a hypothesis testing method to solve the data deletion problem in linear regression using noisy SGD.

  2. 2.

    Show that the deletion of our perfect deleted point affects the training performance the least compared with other points potentially.

  3. 3.

    Show the deletion of our perfect deleted point possibly posts the minimum privacy issue compared with other points.

  4. 4.

    Empirically demonstrate the effectiveness and potential of perfect deleted point in the synthetic dataset.

2 RELATED LITERATURE

Noisy gradient descent.

Our research contributes to the study of machine learning security, particularly focusing on noisy gradient descent (Noisy GD) (Avella-Medina et al., 2023; Das et al., 2023; Wu et al., 2020a). Noisy GD is best exemplified by DP-SGD, which is proposed by Abadi et al. (2016). The basic idea of it is to introduce noise into the gradient descent process to provide privacy guarantees during training. In the DP-SGD framework, each update is equipped with a privacy burden, which quantifies the privacy leakage in each step. There are many causes for the privacy leakage, and our research shows how the deletion of one data point in the training dataset can affect the privacy budget in Section 5.2. We conclude that there exists certain data point, which we call the perfect deleted point, that will cause least privacy budget than other points, and highlights potential future work on how the behaviour of training data can affect machine learning security.

Data deletion.

The key of our research is to understand data deletion in a certain framework and provide a reliable approach to solve this problem. Despite the boosting progress in this field, we have found that most researches use singular metric, like accuracy, as the dominating factor to help them find points to be deleted (Wu et al., 2020b; Guo et al., 2019; Garg et al., 2020; Neel et al., 2021; Gupta et al., 2021). Obviously this is ad hoc, leading to unreliable and inconsistent results. Our research address this issue by applying various metrics, namely the distribution of model weights, training loss and privacy budget, to verify that our perfect deleted point is indeed the ”perfect” one to be deleted among all data points in the training dataset, thus enhancing the reliability of our method and providing a new standard for future researchers working in this domain. We believe our contribution will significantly improve the use of theoretical tools available for data deletion, fostering greater trust and robustness in this area.

Membership inference attack.

Our research approach is inspired by the current progress in the area of membership inference attack (MI attack). In such an attack, the malicious group wants to predict whether a certain data point belongs to the training dataset or the whole data distribution, given either final model output (black-box attack) or information about model structure in the whole training process (white-box attack) (Shokri et al., 2017). While our initial goal is to find the perfect deleted point, equivalently we want to find a data point from the training dataset such that the adversary is most likely to fail to perform a MI attack for that point, i.e., the data point is somehow redundant and makes limited contribution to the training of the model. Many researchers have contributed to this problem from different perspectives, and we are inspired by Leemann et al. (2023) and Yeom et al. (2018a)’s works specifically. Yeom et al. (2018a) analyzes MI attack using the concept of membership advantage, and inspired by this, we apply the adjusted concept of absolute membership advantage to find the perfect deleted point. Leemann et al. (2023) analyzes MI attack from a hypothesis testing perspective, and similarly, we find the perfect deleted point with the smallest absolute membership advantage using likelihood ratio test. We believe the concept of membership advantage and the approach of hypothesis testing may yield more future work in the problem of data deletion.

Here we also recite two concepts from Yeom et al. (2018b) to prepare the readers for our future discussion. The first one is about membership advantage.

Definition 1.

(Membership experiment ExpM(A,D)Exp^{M}(A,D) (Yeom et al., 2018b)): Let AA be an adversary, SS be the training dataset and DD be a probability distribution over data points (x,y)(x,y). The experiment proceeds as follows:

  1. 1.

    Sample SDnS\sim D^{n}.

  2. 2.

    Draw b{0,1}b\in\{0,1\} uniformly at random.

  3. 3.

    Draw zSz\sim S if b=0b=0, or zDz\sim D if b=1b=1.

  4. 4.

    AA tries to predict value of bb and outputs either 0 or 11. If it makes the right prediction, i.e., its output is equal to bb, then ExpM(A,D)=1Exp^{M}(A,D)=1. Otherwise, ExpM(A,D)=0Exp^{M}(A,D)=0.

In other words, the adversary AA tries to predict whether data point zz is drawn from the training dataset or from the whole data distribution in this experiment. If it predicts correctly, then the experiment is successful and outputs 1; otherwise it fails and outputs 0. Based on the experiment, we further define membership advantage as follows:

Definition 2.

(Membership advantage (Yeom et al., 2018b)): The membership advantage of point vv is defined as

Adv(v)=Pr[A=0|b=0]Pr[A=0|b=1]Adv(v)=Pr[A=0|b=0]-Pr[A=0|b=1].

Equivalently, the membership advantage can be understood as the difference between true negative rate (TNR), i.e., Pr[A=0|b=0]Pr[A=0|b=0], and false negative rate (FNR), i.e., Pr[A=0|b=1]Pr[A=0|b=1], of the membership experiment ExpM(A,D)Exp^{M}(A,D). Intuitively, if TNR is close to FNR, then it would be hard to tell whether zz is drawn from SS or DD, which means the null and alternative hypothesis give similar probability distributions. We later leverage this idea to find the perfect deleted point.

Another important concept we need from Yeom et al. (2018b) is ϵ\epsilon-differential privacy, where ϵ\epsilon is called privacy budget and large ϵ\epsilon means serious concerns about training data confidentiality.

Definition 3.

(Differential privacy (Yeom et al., 2018b)): An algorithm A:XnYA:X^{n}\to Y satisfies ϵ\epsilon-differential privacy if for all S,SXnS,S^{\prime}\in X^{n} that differ in a single value, the following holds:

Pr[A(S)Y]eϵPr[A(S)Y].Pr[A(S)\in Y]\leq e^{\epsilon}Pr[A(S^{\prime})\in Y]. (1)

The following Lemma characterizes the connection between membership advantage and privacy budget, which will be used in Section 5.2.

Lemma 1.

(Upper bound of membership advantage for ϵ\epsilon-differentially private algorithm (Yeom et al., 2018b)) Let A be an ϵ\epsilon-differentially private algorithm, then we have

Adv(A)eϵ1.Adv(A)\leq e^{\epsilon}-1. (2)

3 PROBLEM SETUP

3.1 Perfect deleted point problem

Assume we are working on two different datasets: D0D_{0}, referred to as the original dataset, and D1D_{1}, known as the deleted dataset. Our goal is to find a point vD0v^{*}\in D_{0}, which is called the perfect deleted point, such that the deletion of it will cause the minimum change on the final model, i.e., the distribution of model weights resulted from deleting vv^{*} is identical to the one when not deleting it, compared with all the other points.

However, it can be difficult and inefficient to directly find the perfect deleted point. Intuitively, for each data point viD0v_{i}\in D_{0}, we need to delete it from the original dataset D0D_{0} and use the deleted dataset D0viD_{0}\setminus v_{i} to train a new ML model. Among all the nn models we get from deleting each viv_{i}, we want to find the data point vD0v^{*}\in D_{0} such that using the deleted dataset D1=D0vD_{1}=D_{0}\setminus v^{*}, we will obtain the most similar model, i.e., model weights, to the one trained with the original dataset D0D_{0}, compared with deleting all the other n1n-1 data points. We would need to repeat the same deleting-and-training process for nn times if we want to directly find the perfect deleted point, which is unnecessary as we propose an alternative approach in this paper that doesn’t require any additional deleting-and-training at all. We will discuss it in Section 4.2 in detail.

We are motivated to find the perfect deleted point because previous studies have shown that despite improvement of training performance using a larger training dataset (Ramezan et al., 2021; Foody et al., 2006), too much training data can cause overfitting if the model becomes too complex and learns noise or irrelevant patterns rather than generalizing well (Ying, 2019; Hawkins, 2004). Also, it will be more computationally-friendly and data-efficient if using a smaller training dataset (Acun et al., 2021; Zhu et al., 2016).

3.2 Assumptions

In this paper, we propose a novel method for finding the perfect deleted point with the following assumptions:

  1. 1.

    The original dataset D0D_{0} contains finitely many data points, i.e., nn is a positive integer that is not infinity.

  2. 2.

    We are considering the classical linear regression task, i.e., point vi=(xi,yi)D0v_{i}=(x_{i},y_{i})\in D_{0} where xix_{i} is the feature vector and yiy_{i} is the label of corresponding xix_{i}. Thus, the loss function of point viv_{i} for a ML model with model weights ww is

    l(w,vi)=(yiw,xi)2.l(w,v_{i})=(y_{i}-\langle w,x_{i}\rangle)^{2}. (3)

    The empirical risk given model weight being ww and dataset D={(xi,yi)}i=1nD=\{(x_{i},y_{i})\}_{i=1}^{n} is defined as

    L(w;D)=1ni=1nl(w;(xi,yi)),L(w;D)=\frac{1}{n}\sum_{i=1}^{n}l(w;(x_{i},y_{i})), (4)

    which evaluates the discrepancy between predicted outcomes and true outcomes of the whole dataset. The mean is taken over all individual losses here because it’s a common approach to use mean gradient in noisy SGD defined below.

  3. 3.

    The ML model uses noisy gradient descent to minimize the empirical risk defined in equation (4) and update its weights in each step, i.e.,

    w1=w0γ(L(w;D)+η)w_{1}=w_{0}-\gamma(\nabla L(w;D)+\eta) (5)

    where η=N(0,σ2I)\eta=N(0,\sigma^{2}I) is the Gaussian noise and w0,w1w_{0},w_{1} are model weights before and after noisy gradient descent respectively. Note we only consider one step of noisy gradient descent in this paper.

4 KEY THEORETICAL RESULTS

In Section 4, we present our theoretical approach on how to find the perfect deleted point vv^{*} from a hypothesis testing perspective and design a concrete algorithm to realize this idea.

4.1 Notations

With the linear regression setting in Section 3.2, we first formally define how we can obtain the deleted dataset from the original dataset.

Definition 4.

(Original dataset and deleted dataset): Denote D0={(xi,yi)}i=1nD_{0}=\{(x_{i},y_{i})\}_{i=1}^{n} as the original training dataset and

D1=D0{v}D_{1}=D_{0}\setminus\{v\}

as the deleted training dataset with the deleted point v=(xv,yv)D0v=(x_{v},y_{v})\in D_{0}.

Given Definition 4 and equation (4), we next define the empirical risk and gradient for original and deleted dataset respectively.

Definition 5.

(Original empirical risk and its gradient): Given an original dataset D0D_{0} defined in Definition 4. For a model weight being ww, denote the individual loss of ww on training sample (xi,yi)(x_{i},y_{i}) as l(w;(xi,yi))l(w;(x_{i},y_{i})). Then the original empirical risk L(w;D0)L(w;D_{0}) is defined as

L(w;D0)=1ni=1nl(w;(xi,yi))L(w;D_{0})=\frac{1}{n}\sum_{i=1}^{n}l(w;(x_{i},y_{i})).

In addition, wL(w;D0)\nabla_{w}L(w;D_{0}) is called the original gradient of model weight w on dataset D0D_{0}.

Definition 6.

(Deleted empirical risk and its gradient): Given a deleted dataset D1D_{1} defined in Definition 4. For a model weight being ww, denote the individual loss of ww on training sample (xi,yi)(x_{i},y_{i}) as l(w;(xi,yi))l(w;(x_{i},y_{i})). Then the deleted empirical risk L(w;D1)L(w;D_{1}) is defined as

L(w;D1)=1n1[i=1nl(w;(xi,yi))l(w;(xv,yv)]L(w;D_{1})=\frac{1}{n-1}[\sum_{i=1}^{n}l(w;(x_{i},y_{i}))-l(w;(x_{v},y_{v})].

In addition, wL(w;D1)\nabla_{w}L(w;D_{1}) is called the deleted gradient of model weight w on dataset D1D_{1} with deleted point being vv.

Lemma 2 describes a basic relationship of original gradient and deleted gradient.

Lemma 2.

(Identity between original and deleted gradient):

wL(w;D1)=nn1wL(w;D0)1n1wl(w;v).\nabla_{w}L(w;D_{1})=\frac{n}{n-1}\nabla_{w}L(w;D_{0})-\frac{1}{n-1}\nabla_{w}l(w;v). (6)
Proof.

This identity is a direct consequence of Definition 5 and 6. ∎

4.2 Strategy to find the perfect deleted point

Since we only consider one step of noisy gradient update here and the initial weight w0w_{0} doesn’t change whether or not we delete vv, from equation (5) we know that in order to get similar model weights w1w_{1} for original dataset D0D_{0} and deleted dataset D1D_{1}, we need Δw=w1w0\Delta w=w_{1}-w_{0} to be similar, i.e., have identical probability distributions. Equivalently, it means TNR and FNR considering the distributions of D0D_{0} and D1D_{1} should be close to each other, i.e., membership advantage of our perfect deleted point vv^{*} should be close to 0.

Lemma 3.

(Absolute membership advantage of vv):

|Adv(v)|=|Φ(Φ1(1α)dv)α||Adv(v)|=|\Phi(\Phi^{-1}(1-\alpha)-d_{v})-\alpha| (7)

where μ(D)=γwL(w;D)\mu(D)=-\gamma\nabla_{w}L(w;D), σγ=γσ\sigma_{\gamma}=\gamma\sigma, dv=μ(D1)μ(D0)2σγd_{v}=\frac{||\mu(D_{1})-\mu(D_{0})||_{2}}{\sigma_{\gamma}} is the signal-to noise ratio for point vv and α\alpha is the Type I error.

Proof.

See Appendix A.1. ∎

From Lemma 3, we notice that the absolute membership advantage depends on Type I error α\alpha and signal-to-noise ratio dvd_{v}, and Type I error α\alpha, which by definition is the false positive rate (FPR), i.e., the probability of rejecting null hypothesis when it’s actually true, is therefore equivalent to our manually-picked significance level. So next we try to give an explicit formula for how to calculate signal-to-noise ratio dvd_{v} by applying Lemma 2 to help us find the perfect deleted point.

Lemma 4.

(Signal-to-noise ratio dvd_{v} for v=(xv,yv)D0v=(x_{v},y_{v})\in D_{0}):

dv=yvxvSyx+SxxwxvxvTw2γ(n1)2σd_{v}=\frac{||y_{v}x_{v}-S_{yx}+S_{xx}w-x_{v}x_{v}^{T}w||_{2}}{\sqrt{\frac{\gamma(n-1)}{2}}\sigma} (8)

where Syx=1ni=1nyixiS_{yx}=\frac{1}{n}\sum_{i=1}^{n}y_{i}x_{i}, Sxx=1ni=1nxixiTS_{xx}=\frac{1}{n}\sum_{i=1}^{n}x_{i}x_{i}^{T}

Proof.

See Appendix A.2. ∎

From Lemma 3 and 4, we know we need to find data point vv such that dvd_{v} calculated from it using equation (8) will minimize the absolute membership advantage |Adv(v)||Adv(v)|. By simple calculation, we know dvd_{v} should be close to Φ1(1α)Φ1(α)=2Φ1(1α)\Phi^{-1}(1-\alpha)-\Phi^{-1}(\alpha)=2\Phi^{-1}(1-\alpha) so that |Adv(v)||Adv(v)| will be minimized, i.e., close to 0. Therefore, the perfect deleted point problem can be reformulated as: we want to find point vD0v^{*}\in D_{0} such that

v=argminvD0|dv2Φ1(1α)|v^{*}=\mathop{\arg\min}\limits_{v\in D_{0}}|d_{v}-2\Phi^{-1}(1-\alpha)|

Next we formally propose Algorithm 1 to find the perfect deleted point in the original dataset D0D_{0}. To find the perfect deleted point, we need to first obtain signal-to-noise ratio dvd_{v} for each data point vD0v\in D_{0} by applying Lemma 4. Then by Lemma 3, we choose a significance level and use it as the Type I error α\alpha. Finally for each data point vv, we compare the distance between dvd_{v} and 2Φ1(1α)2\Phi^{-1}(1-\alpha), and the one with the smallest distance gives the perfect deleted point vv^{*}. Note we introduce a hyper-parameter δ\delta here to quantify the tolerance we can have for the largest acceptable distance, i.e., if the smallest distance |dv2Φ1(1α)||d_{v}-2\Phi^{-1}(1-\alpha)| for all vD0v\in D_{0} is bigger than δ\delta, then we say there’s no perfect deleted point in the dataset. So now instead of directly training nn models separately using deleted datasets and comparing their model weights to the original model weights for each data point, our approach only needs to compute signal-to-noise ratio dvd_{v} and compares its value to 2Φ1(1α)2\Phi^{-1}(1-\alpha) for all data points in one training step, which only takes O(n)O(n) time and O(n)O(n) space for one step of noisy SGD training, where nn is the size of the original dataset. Thus our approach for finding the perfect deleted point is more time and space efficient than the direct approach.

Algorithm 1 Find Perfect Deleted Point
1:Input: D0={vi}i=1nD_{0}=\{v_{i}\}_{i=1}^{n}, γ\gamma, σ2\sigma^{2}, initial weight ww, tolerance δ\delta, significance level α\alpha^{\prime}
2:Output: vv^{*}
3:Syx,Sxx1ni=1nyixi,1ni=1nxixiTS_{yx},S_{xx}\leftarrow\frac{1}{n}\sum_{i=1}^{n}y_{i}x_{i},\frac{1}{n}\sum_{i=1}^{n}x_{i}x_{i}^{T}
4:dδd\leftarrow\delta
5:αα\alpha\leftarrow\alpha^{\prime}
6:vNonev^{*}\leftarrow\texttt{None}
7:for each v=(xv,yv)D0v=(x_{v},y_{v})\in D_{0} do
8:     dvyvxvSyx+SxxwxvxvTw2γ(n1)2σd_{v}\leftarrow\frac{||y_{v}x_{v}-S_{yx}+S_{xx}w-x_{v}x_{v}^{T}w||_{2}}{\sqrt{\frac{\gamma(n-1)}{2}}\sigma}
9:     d1|dv2Φ1(1α)|d_{1}\leftarrow|d_{v}-2\Phi^{-1}(1-\alpha)|
10:     if d1<=dd_{1}<=d then
11:         dd1d\leftarrow d_{1}
12:         vvv^{*}\leftarrow v
13:     end if
14:end for
15:return vv^{*}

5 MAIN INSIGHTS

In Section 5, we further discuss how the perfect deleted point will affect the training performance (Section 5.1) and how it will cause privacy concern in the training process (Section 5.2). It provides insightful analysis on why the perfect deleted point should be of interest.

5.1 Analysis of training loss for perfect deleted point

In order to understand how perfect deleted point will affect the training performance, we first propose the following concept of absolute membership error that quantifies how different a point vv is from being the perfect deleted point.

Definition 7.

(Absolute membership error): For point vD0v\in D_{0} with corresponding signal-to-noise ratio dvd_{v}, we define its absolute membership error as

ϵv=dv2Φ1(1α).\epsilon_{v}=d_{v}-2\Phi^{-1}(1-\alpha). (9)

Based on this definition, absolute membership error of the perfect deleted point ϵv\epsilon_{v^{*}} should be close to 0 by analysis in Section 4.2. From now on, we denote ϵv\epsilon_{v^{*}} by ϵ\epsilon^{*} for brevity. We propose the following Lemma to bound the change of training loss for deleting point vv.

Lemma 5.

(Bounds for change of empirical risk for deleting point vv): Assume change of empirical risk, i.e., L(w;D1)L(w;D0)L(w;D_{1})-L(w;D_{0}), is non-negative.

  1. 1.

    Given point v=(xv,yv)v=(x_{v},y_{v}) and its absolute membership error ϵv\epsilon_{v}, change of empirical risk after deleting vv is lower bounded by

    1n1L(w;D0)[ϵv+2Φ1(1α)]CSyxSxxw2(n1)xv2\frac{1}{n-1}L(w;D_{0})-[\epsilon_{v}+2\Phi^{-1}(1-\alpha)]C-\frac{||S_{yx}-S_{xx}w||_{2}}{(n-1)||x_{v}||_{2}} (10)

    and upper bounded by

    1n1L(w;D0)+[ϵv+2Φ1(1α)]CSyxSxxw2(n1)xv2\frac{1}{n-1}L(w;D_{0})+[\epsilon_{v}+2\Phi^{-1}(1-\alpha)]C-\frac{||S_{yx}-S_{xx}w||_{2}}{(n-1)||x_{v}||_{2}} (11)

    where C=σxv2γ2(n1)C=\frac{\sigma}{||x_{v}||_{2}}\sqrt{\frac{\gamma}{2(n-1)}}.

  2. 2.

    Suppose features of the training data {(xi,yi)}i=1n\{(x_{i},y_{i})\}_{i=1}^{n} has the following property: xi2B,i[1,n]||x_{i}||_{2}\geq B,\forall i\in[1,n] where BB is a positive number. Then the lower bound can be generalized to

    1n1L(w;D0)[ϵv+2Φ1(1α)]DSyxSxxw2(n1)B\frac{1}{n-1}L(w;D_{0})-[\epsilon_{v}+2\Phi^{-1}(1-\alpha)]D-\frac{||S_{yx}-S_{xx}w||_{2}}{(n-1)B} (12)

    and the upper bound can be generalized to

    1n1L(w;D0)+[ϵv+2Φ1(1α)]DSyxSxxw2(n1)B\frac{1}{n-1}L(w;D_{0})+[\epsilon_{v}+2\Phi^{-1}(1-\alpha)]D-\frac{||S_{yx}-S_{xx}w||_{2}}{(n-1)B} (13)

    where D=σBγ2(n1)D=\frac{\sigma}{B}\sqrt{\frac{\gamma}{2(n-1)}}.

Note both bounds in the first and second part of this Lemma are non-negative because deletion of samples always increases training risk.

Proof.

See Appendix B.1. ∎

The novel part of Lemma 5 is that given several candidates for the perfect deleted point, i.e., they all give very similar dvd_{v}, the one with the smallest l2l_{2} feature norm, i.e., xv2||x_{v}||_{2}, is always better because the lower bound of change of training loss is decreased if we decrease xv2||x_{v}||_{2}. The upper bound is also decreased if we decrease xv2||x_{v}||_{2} because individual loss l(w;v)l(w;v) is non-negative, meaning the part after 1n1L(w;D0)\frac{1}{n-1}L(w;D_{0}), which is 1n1l(w;v)-\frac{1}{n-1}l(w;v), is therefore negative. So decreasing the denominator xv2||x_{v}||_{2} will decrease the upper bound as well. In other words, deleting the one with the smallest l2l_{2} feature norm will improve training performance the most potentially because it has smaller both lower and upper bound for increment of empirical risk.

Also, given the same training dataset, i.e., same BB, the perfect deleted point gives the tightest bound for change of loss after deletion compared with other points because ϵϵv\epsilon^{*}\leq\epsilon_{v}. In other words, we can have an accurate estimate of how the perfect deleted point will affect training performance by setting ϵ=0\epsilon^{*}=0 before even finding the exact perfect deleted point.

5.2 Relation of privacy budget ϵ\epsilon for perfect deleted point

In this section, we try to understand how the perfect deleted point will bring about privacy issues in the training process.

Lemma 6.

(Lower bound for privacy budget for deleting point vv): Privacy budget ϵ\epsilon after deleting point vv with absolute membership error ϵv\epsilon_{v} is at least max{ln[Φ(Φ1(α)ϵv)+1α],0}max\{ln[\Phi(\Phi^{-1}(\alpha)-\epsilon_{v})+1-\alpha],0\}.

Proof.

See Appendix B.2. ∎

From Lemma 6, we observe that assume privacy budget must be non-negative, then as long as the absolute membership error ϵv\epsilon_{v} is non-negative, it can achieve the smallest lower bound 0 for privacy budget. In other words, if we have several candidates for the perfect deleted point, i.e., they all give very similar dvd_{v}, then the one with positive absolute membership error will be considered as the best option for the perfect deleted point because it may potentially distort privacy guarantees to the minimum degree compared with deleting other points. Note we don’t know exactly whether it’ll cause the smallest privacy budget because we only know it can achieve the smallest lower bound for privacy budget, so only theoretically can it lead to a smaller privacy budget.

6 EXPERIMENT

Section 6 exemplifies the effectiveness our approach by conducting an empirical experiment to find the perfect deleted point in a synthetic dataset. By comparing the model weights in different scenarios, we show that our perfect deleted point can indeed change the distribution of model weights to the minimal extent.

6.1 Experiment setup

To better visualize how the perfect deleted point will affect the model weights, we generate a 2D synthetic dataset with 200 samples in this case as the original dataset D0D_{0}, where the xx coordinate of all these points is randomly picked within the range [0,10][0,10] with seed 4040. To get the yy coordinate, we assume the true model behind our generated dataset is y=3.1415926535xy=3.1415926535x since we are dealing with classic linear regression here. Then we use Gaussian noise with mean 0 and standard deviation 22, amplifies its magnitude by 1010, and add it to the true yy coordinate to generate the synthetic yy coordinate. The scatterplot of our generated dataset looks like Figure 1. Although this is a simple dataset that may not mimic the real world dataset, it suffices to analyze the perfect deleted point problem on it in this paper as it satisfies all the assumptions in the given linear regression scenario.

Refer to caption
Figure 1: Synthetic Dataset

The hyperparameters we used in this experiment are listed here: initial weight w0=0w_{0}=0, learning rate γ=0.01\gamma=0.01, Gaussian noise η=N(0,4)\eta=N(0,4), Type I error α=0.01\alpha=0.01, and tolerance δ=100\delta=100. Since this is a stochastic algorithm and we want to get the distribution of weights, we run the 1-step noisy SGD for 100 iterations to understand the overall behaviour of it.

6.2 Result analysis

To understand the effect of perfect deleted point, we first focus on the distribution of model weights after deleting it and compare it with the distribution of model weights when we don’t delete any point. We use the distribution of model weights when we randomly delete a data point as a benchmark to better demonstrate the potential of perfect deleted point compared with all the other points. Secondly, we repeat the same finding-and-deleting process for multiple steps of noisy SGD to augment the effect of the perfect deleted point as deletion of one data point may not affect much compared with no deletion and random deletion because of the size of original dataset D0D_{0} mitigates the impact.

Refer to caption
Figure 2: Distribution of model weights after 1, 10, 50 steps of noisy SGD using perfect deleted point, randomly deleted point and no deleted point in 100 iterations. The histogram plots the occurrence of model weights in each bin, and the shaded area is the kernel density estimate (KDE) of the distribution.
Table 1: Statistics for Distributions of Model Weight
MODEL MEAN VARIANCE
1-step; perfect deleted point 1.77693 0.00051
1-step; random deleted point 1.78043 0.00061
1-step; no deleted point 1.77718 0.00046
10-step; perfect deleted point 3.31226 0.00079
10-step; random deleted point 3.31290 0.00265
10-step; no deleted point 3.30248 0.00066
50-step; perfect deleted point 3.41079 0.00063
50-step; random deleted point 3.35108 0.01560
50-step; no deleted point 3.36008 0.00051

Figure 2 and Table 1 demonstrate the distribution of model weights in different cases both visually and statistically. From the first row Figure 2, we see in one step of noisy SGD, all three cases demonstrate a similar normal distribution, which is proved by the mean and variance of the first three lines in Table 1. It is because the deletion of a single data point in D0D_{0} which contains 200 samples only posts a trivial effect on the distribution of model weights. From the next two rows of Figure 2, we see that as the number of noisy SGD steps increases, the weight distribution of perfect deleted point resembles much more closely to no deleted point compared with random deleted point. This means that our perfect deleted point is indeed “perfect” in the sense of preserving model weight after deletion compared with other points. When we look at the highlighted numbers in the third column of Table 1, we can see that as number of noisy SGD steps increases, the variance of distribution of model weights using random deleted point grows larger, which means the normal distribution becomes ”flatter” and we are more likely to get random weights after more random deletion. But by the convergence of noisy SGD, we know this is not acceptable, which again shows the advantage of perfect deleted point.

However, if we look at the highlighted number in the second column of Table 1, we see that despite having small variance, the mean of distribution of model weights after 50 steps using perfect deleted point shifts tremendously compared with both no and random deleted point, which gives a slightly different model weight distribution compared with no deleted point. To address this problem, we increase the Type I error α\alpha to 0.05, and Figure 3 plots the corresponding result.

Refer to caption
Figure 3: Distribution of model weights after 50 steps of noisy SGD using perfect deleted point when α=0.05\alpha=0.05, with mean = 3.38091 and variance = 0.00077.

As we can see, the mean becomes much smaller and the whole distribution is therefore much similar to the one with no deleted point. An intuitive explanation behind it may be that as we increase Type I error α\alpha, we increase significance level as well, which allows us more room to make mistakes using perfect deleted point in many steps of noisy SGD, i.e., deleting more points, thus improving statistical power of our perfect deleted point. This trade-off between α\alpha and mean may lead to potential future work and needs more rigorous investigation.

7 CONCLUSION

In conclusion, our research proposes the perfect deleted point problem in one step of noisy SGD for classical linear regression. We present a hypothesis testing approach to solve this problem using signal-to-noise ratio and construct an algorithm based on it. Moreover, we analyze how the perfect deleted point is related to training performance and differential privacy, thus showing the consequence of the perfect deleted point and further demonstrating the importance of this problem. Finally, we create a synthetic dataset and implement our algorithm on it and show the minimum impact perfect deleted point will have on the distribution of model weights. The implications of our work demonstrate the urgent necessity for strategies of data deletion in different scenarios to improve model training efficiency. Moving forward, our research sets a new standard for understanding and analyzing such problems, ensuring the efficiency and security for training of ML models. Some potential future work includes analyzing data deletion for the whole SGD training process and for the general regression task.

References

  • Jordan and Mitchell (2015) Michael I Jordan and Tom M Mitchell. Machine learning: Trends, perspectives, and prospects. Science, 349(6245):255–260, 2015.
  • Garg et al. (2020) Sanjam Garg, Shafi Goldwasser, and Prashant Nalini Vasudevan. Formalizing data deletion in the context of the right to be forgotten. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pages 373–402. Springer, 2020.
  • Ginart et al. (2019) Antonio Ginart, Melody Guan, Gregory Valiant, and James Y Zou. Making ai forget you: Data deletion in machine learning. Advances in neural information processing systems, 32, 2019.
  • Oesterling et al. (2024) Alex Oesterling, Jiaqi Ma, Flavio Calmon, and Himabindu Lakkaraju. Fair machine unlearning: Data removal while mitigating disparities. In International Conference on Artificial Intelligence and Statistics, pages 3736–3744. PMLR, 2024.
  • Ramezan et al. (2021) Christopher A Ramezan, Timothy A Warner, Aaron E Maxwell, and Bradley S Price. Effects of training set size on supervised machine-learning land-cover classification of large-area high-resolution remotely sensed data. Remote Sensing, 13(3):368, 2021.
  • Foody et al. (2006) Giles M Foody, Ajay Mathur, Carolina Sanchez-Hernandez, and Doreen S Boyd. Training set size requirements for the classification of a specific class. Remote Sensing of Environment, 104(1):1–14, 2006.
  • Budach et al. (2022) Lukas Budach, Moritz Feuerpfeil, Nina Ihde, Andrea Nathansen, Nele Noack, Hendrik Patzlaff, Felix Naumann, and Hazar Harmouch. The effects of data quality on machine learning performance. arXiv preprint arXiv:2207.14529, 2022.
  • Narciso and Martins (2020) Diogo AC Narciso and FG Martins. Application of machine learning tools for energy efficiency in industry: A review. Energy Reports, 6:1181–1199, 2020.
  • Hawkins (2004) Douglas M Hawkins. The problem of overfitting. Journal of chemical information and computer sciences, 44(1):1–12, 2004.
  • Council of the European Union (2016) Council of the European Union. Council regulation (eu) no 2016/678. https://eur-lex.europa.eu/eli/reg/2016/679/oj, 2016.
  • State of California Department of Justice (2018) State of California Department of Justice. California consumer privacy act (ccpa). https://oag.ca.gov/privacy/ccpa, 2018.
  • Cao and Yang (2015) Yinzhi Cao and Junfeng Yang. Towards making systems forget with machine unlearning. In 2015 IEEE symposium on security and privacy, pages 463–480. IEEE, 2015.
  • Bourtoule et al. (2021) Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pages 141–159. IEEE, 2021.
  • Baumhauer et al. (2022) Thomas Baumhauer, Pascal Schöttle, and Matthias Zeppelzauer. Machine unlearning: Linear filtration for logit-based classifiers. Machine Learning, 111(9):3203–3226, 2022.
  • Golatkar et al. (2020) Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9304–9312, 2020.
  • Graves et al. (2021) Laura Graves, Vineel Nagisetty, and Vijay Ganesh. Amnesiac machine learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 11516–11524, 2021.
  • Avella-Medina et al. (2023) Marco Avella-Medina, Casey Bradshaw, and Po-Ling Loh. Differentially private inference via noisy optimization. The Annals of Statistics, 51(5):2067–2092, 2023.
  • Das et al. (2023) Rudrajit Das, Satyen Kale, Zheng Xu, Tong Zhang, and Sujay Sanghavi. Beyond uniform lipschitz condition in differentially private optimization. In International Conference on Machine Learning, pages 7066–7101. PMLR, 2023.
  • Wu et al. (2020a) Jingfeng Wu, Wenqing Hu, Haoyi Xiong, Jun Huan, Vladimir Braverman, and Zhanxing Zhu. On the noisy gradient descent that generalizes as sgd. In International Conference on Machine Learning, pages 10367–10376. PMLR, 2020a.
  • Abadi et al. (2016) Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318, 2016.
  • Wu et al. (2020b) Yinjun Wu, Edgar Dobriban, and Susan Davidson. Deltagrad: Rapid retraining of machine learning models. In International Conference on Machine Learning, pages 10355–10366. PMLR, 2020b.
  • Guo et al. (2019) Chuan Guo, Tom Goldstein, Awni Hannun, and Laurens Van Der Maaten. Certified data removal from machine learning models. arXiv preprint arXiv:1911.03030, 2019.
  • Neel et al. (2021) Seth Neel, Aaron Roth, and Saeed Sharifi-Malvajerdi. Descent-to-delete: Gradient-based methods for machine unlearning. In Algorithmic Learning Theory, pages 931–962. PMLR, 2021.
  • Gupta et al. (2021) Varun Gupta, Christopher Jung, Seth Neel, Aaron Roth, Saeed Sharifi-Malvajerdi, and Chris Waites. Adaptive machine unlearning. Advances in Neural Information Processing Systems, 34:16319–16330, 2021.
  • Shokri et al. (2017) Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), pages 3–18. IEEE, 2017.
  • Leemann et al. (2023) Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. Gaussian membership inference privacy, 2023. URL https://arxiv.org/abs/2306.07273.
  • Yeom et al. (2018a) Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. Privacy risk in machine learning: Analyzing the connection to overfitting. In 2018 IEEE 31st Computer Security Foundations Symposium (CSF), pages 268–282, 2018a. doi: 10.1109/CSF.2018.00027.
  • Yeom et al. (2018b) Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. Privacy risk in machine learning: Analyzing the connection to overfitting, 2018b.
  • Ying (2019) Xue Ying. An overview of overfitting and its solutions. In Journal of physics: Conference series, volume 1168, page 022022. IOP Publishing, 2019.
  • Acun et al. (2021) Bilge Acun, Matthew Murphy, Xiaodong Wang, Jade Nie, Carole-Jean Wu, and Kim Hazelwood. Understanding training efficiency of deep learning recommendation models at scale. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 802–814. IEEE, 2021.
  • Zhu et al. (2016) Xiangxin Zhu, Carl Vondrick, Charless C Fowlkes, and Deva Ramanan. Do we need more training data? International Journal of Computer Vision, 119(1):76–92, 2016.
  • Wang and Cheng (2024) Chi-Hua Wang and Guang Cheng. Badgd: A unified data-centric framework to identify gradient descent vulnerabilities, 2024. URL https://arxiv.org/abs/2405.15979.
  • Izzo et al. (2021) Zachary Izzo, Mary Anne Smart, Kamalika Chaudhuri, and James Zou. Approximate data deletion from machine learning models. In International Conference on Artificial Intelligence and Statistics, pages 2008–2016. PMLR, 2021.
  • Chourasia and Shah (2023) Rishav Chourasia and Neil Shah. Forget unlearning: Towards true data-deletion in machine learning. In International Conference on Machine Learning, pages 6028–6073. PMLR, 2023.
  • Pawelczyk et al. (2024) Martin Pawelczyk, Jimmy Z Di, Yiwei Lu, Gautam Kamath, Ayush Sekhari, and Seth Neel. Machine unlearning fails to remove data poisoning attacks. arXiv preprint arXiv:2406.17216, 2024.
  • Sekhari et al. (2021) Ayush Sekhari, Jayadev Acharya, Gautam Kamath, and Ananda Theertha Suresh. Remember what you want to forget: Algorithms for machine unlearning. Advances in Neural Information Processing Systems, 34:18075–18086, 2021.
  • Chen et al. (2021) Min Chen, Zhikun Zhang, Tianhao Wang, Michael Backes, Mathias Humbert, and Yang Zhang. When machine unlearning jeopardizes privacy. In Proceedings of the 2021 ACM SIGSAC conference on computer and communications security, pages 896–911, 2021.
  • Steinke et al. (2023) Thomas Steinke, Milad Nasr, and Matthew Jagielski. Privacy auditing with one (1) training run, 2023. URL https://arxiv.org/abs/2305.08846.
  • Cebere et al. (2024) Tudor Cebere, Aurélien Bellet, and Nicolas Papernot. Tighter privacy auditing of dp-sgd in the hidden state threat model, 2024. URL https://arxiv.org/abs/2405.14457.
  • Scharf and Demeure (1991) Louis L Scharf and Cédric Demeure. Statistical signal processing: detection, estimation, and time series analysis. (No Title), 1991.

Checklist

  1. 1.

    For all models and algorithms presented, check if you include:

    1. (a)

      A clear description of the mathematical setting, assumptions, algorithm, and/or model. [Yes] See Section 4

    2. (b)

      An analysis of the properties and complexity (time, space, sample size) of any algorithm. [Yes] See Section 4

    3. (c)

      (Optional) Anonymized source code, with specification of all dependencies, including external libraries. [Yes] See Appendix

  2. 2.

    For any theoretical claim, check if you include:

    1. (a)

      Statements of the full set of assumptions of all theoretical results. [Yes] See Section 3

    2. (b)

      Complete proofs of all theoretical results. [Yes] See Appendix

    3. (c)

      Clear explanations of any assumptions. [Yes] See Section 3

  3. 3.

    For all figures and tables that present empirical results, check if you include:

    1. (a)

      The code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL). [Yes] See Appendix

    2. (b)

      All the training details (e.g., data splits, hyperparameters, how they were chosen). [Yes] See Section 6

    3. (c)

      A clear definition of the specific measure or statistics and error bars (e.g., with respect to the random seed after running experiments multiple times). [Yes] See Section 6

    4. (d)

      A description of the computing infrastructure used. (e.g., type of GPUs, internal cluster, or cloud provider). [Not Applicable]

  4. 4.

    If you are using existing assets (e.g., code, data, models) or curating/releasing new assets, check if you include:

    1. (a)

      Citations of the creator If your work uses existing assets. [Not Applicable]

    2. (b)

      The license information of the assets, if applicable. [Not Applicable]

    3. (c)

      New assets either in the supplemental material or as a URL, if applicable. [Not Applicable]

    4. (d)

      Information about consent from data providers/curators. [Not Applicable]

    5. (e)

      Discussion of sensible content if applicable, e.g., personally identifiable information or offensive content. [Not Applicable]

  5. 5.

    If you used crowdsourcing or conducted research with human subjects, check if you include:

    1. (a)

      The full text of instructions given to participants and screenshots. [Not Applicable]

    2. (b)

      Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable. [Not Applicable]

    3. (c)

      The estimated hourly wage paid to participants and the total amount spent on participant compensation. [Not Applicable]

Appendix A Proof sketch for section 4

A.1 Proof of Lemma 3

Proof of Lemma 3 follows the same logic as Section 4.4 of Scharf and Demeure (1991). It is also adopted from Section A.2 Proof of Lemma 4 in Wang and Cheng (2024). However, here we calculate the absolute membership advantage instead of the trade-off function as in Wang and Cheng (2024). We reinstate the proof here for readers’ reference and change the final step according to our need.

We use binary hypothesis testing considering distributions of Δw=w1w0\Delta w=w_{1}-w_{0} of D0D_{0} and D1D_{1}, denoted as Δw(D0)\Delta w(D_{0}) and Δw(D1)\Delta w(D_{1}), as defined in Section 3.2 to calculate FNR and TNR, therefore calculating membership advantage of v.

H0:Δw(D0)N(μ(D0),σγ2I),H1:Δw(D1)N(μ(D1),σγ2I)H_{0}:\Delta w(D_{0})\sim N(\mu(D_{0}),\sigma_{\gamma}^{2}I),H_{1}:\Delta w(D_{1})\sim N(\mu(D_{1}),\sigma_{\gamma}^{2}I) (14)

Define W=(σγ2I)1(μ(D1)μ(D0))W=(\sigma_{\gamma}^{2}I)^{-1}(\mu(D_{1})-\mu(D_{0})), then log likelihood ratio has the form

L(Δw)=WTΔwL(\Delta w)=W^{T}\Delta w (15)

by equation (4.29) in Scharf and Demeure (1991). Equivalently, the binary hypothesis testing of (14) can be changed into

H0:L(Δw)N(dv22,dv2),H1:L(Δw)N(dv22,dv2)H_{0}:L(\Delta w)\sim N(-\frac{d_{v}^{2}}{2},d_{v}^{2}),H_{1}:L(\Delta w)\sim N(\frac{d_{v}^{2}}{2},d_{v}^{2}) (16)

where

dv2=WT(σγ2I)W=(μ(D1)μ(D0))T(σγ2I)1(μ(D1)μ(D0))d_{v}^{2}=W^{T}(\sigma_{\gamma}^{2}I)W=(\mu(D_{1})-\mu(D_{0}))^{T}(\sigma_{\gamma}^{2}I)^{-1}(\mu(D_{1})-\mu(D_{0})) (17)

is called the signal-to-noise ratio.

At significance level α\alpha, which is also the Type I error, we know (16) has Type II error

β=1Φ(Φ1(1α)dv).\beta=1-\Phi(\Phi^{-1}(1-\alpha)-d_{v}). (18)

Since TNRTNR is probability of failing to reject null hypothesis given null hypothesis is true, which is equal to 1α1-\alpha, then

|Adv(v)|=|TNRFNR|=|1αβ|=|Φ(Φ1(1α)dv)α||Adv(v)|=|TNR-FNR|=|1-\alpha-\beta|=|\Phi(\Phi^{-1}(1-\alpha)-d_{v})-\alpha| (19)

as desired.

A.2 Proof of Lemma 4

Proof of Lemma 4 follows similar logic as Section B.6 Proof of Lemma 10 in Wang and Cheng (2024). However, here the gradient of loss for the deleted dataset D1D_{1} changes, so we need to be careful when we do the computation. We want to calculate

dv\displaystyle d_{v} =μ(D1)μ(D0)2σγ\displaystyle=\frac{||\mu(D_{1})-\mu(D_{0})||_{2}}{\sigma_{\gamma}} (20)
=wL(w;D1)wL(w;D0)2γσ\displaystyle=\frac{||\nabla_{w}L(w;D_{1})-\nabla_{w}L(w;D_{0})||_{2}}{\sqrt{\gamma}\sigma}

Apply Lemma 2, we get

dv=wL(w;D0)wl(w;v)2γ(n1)σd_{v}=\frac{||\nabla_{w}L(w;D_{0})-\nabla_{w}l(w;v)||_{2}}{\sqrt{\gamma(n-1)}\sigma} (21)

Given l(w,(x,y))=(yw,x)2l(w,(x,y))=(y-\langle w,x\rangle)^{2}, the gradient is wl(w,(x,y))=2(yw,x)x\nabla_{w}l(w,(x,y))=-2(y-\langle w,x\rangle)x, then

wL(w;D0)wl(w;v)\displaystyle\nabla_{w}L(w;D_{0})-\nabla_{w}l(w;v) (22)
=w1ni=1n(yiw,xi)2w(yvw,xv)2\displaystyle=\nabla_{w}\frac{1}{n}\sum_{i=1}^{n}(y_{i}-\langle w,x_{i}\rangle)^{2}-\nabla_{w}(y_{v}-\langle w,x_{v}\rangle)^{2}
=w[1ni=1nyi2wT(2ni=1nyixi)+wT(1ni=1nxixiT)w]w[yv22wTyvxv+wTxvxvTw]\displaystyle=\nabla_{w}[\frac{1}{n}\sum_{i=1}^{n}y_{i}^{2}-w^{T}(\frac{2}{n}\sum_{i=1}^{n}y_{i}x_{i})+w^{T}(\frac{1}{n}\sum_{i=1}^{n}x_{i}x_{i}^{T})w]-\nabla_{w}[y_{v}^{2}-2w^{T}y_{v}x_{v}+w^{T}x_{v}x_{v}^{T}w]
=w[1ni=1nyi2yv2]2w[wT(1ni=1nyixiyvxv)]+wwT[1ni=1nxixiTxvxvT]w\displaystyle=\nabla_{w}[\frac{1}{n}\sum_{i=1}^{n}y_{i}^{2}-y_{v}^{2}]-2\nabla_{w}[w^{T}(\frac{1}{n}\sum_{i=1}^{n}y_{i}x_{i}-y_{v}x_{v})]+\nabla_{w}w^{T}[\frac{1}{n}\sum_{i=1}^{n}x_{i}x_{i}^{T}-x_{v}x_{v}^{T}]w
=2(yvxv1ni=1nyixi)+2[1ni=1nxixiTxvxvT]w\displaystyle=2(y_{v}x_{v}-\frac{1}{n}\sum_{i=1}^{n}y_{i}x_{i})+2[\frac{1}{n}\sum_{i=1}^{n}x_{i}x_{i}^{T}-x_{v}x_{v}^{T}]w
=2(yvxvSyx)+2[SxxxvxvT]w\displaystyle=2(y_{v}x_{v}-S_{yx})+2[S_{xx}-x_{v}x_{v}^{T}]w

where Syx=1ni=1nyixi,Sxx=1ni=1nxixiTS_{yx}=\frac{1}{n}\sum_{i=1}^{n}y_{i}x_{i},S_{xx}=\frac{1}{n}\sum_{i=1}^{n}x_{i}x_{i}^{T}. Then (21) can be written as

dv\displaystyle d_{v} =2(yvxvSyx)+2[SxxxvxvT]w2γ(n1)σ\displaystyle=\frac{||2(y_{v}x_{v}-S_{yx})+2[S_{xx}-x_{v}x_{v}^{T}]w||_{2}}{\sqrt{\gamma(n-1)}\sigma} (23)
=(yvxvSyx)+[SxxxvxvT]w2γ(n1)2σ\displaystyle=\frac{||(y_{v}x_{v}-S_{yx})+[S_{xx}-x_{v}x_{v}^{T}]w||_{2}}{\sqrt{\frac{\gamma(n-1)}{2}}\sigma}
=yvxvSyx+SxxwxvxvTw2γ(n1)2σ\displaystyle=\frac{||y_{v}x_{v}-S_{yx}+S_{xx}w-x_{v}x_{v}^{T}w||_{2}}{\sqrt{\frac{\gamma(n-1)}{2}}\sigma}

Appendix B Proof sketch for section 5

B.1 Proof of Lemma 5

Since L(w;D1)=nn1L(w;D0)1n1l(w;v)L(w;D_{1})=\frac{n}{n-1}L(w;D_{0})-\frac{1}{n-1}l(w;v), so we only need to focus on L(w;D1)L(w;D0)=1n1L(w;D0)1n1l(w;v)=1n1L(w;D0)1n1[yvxv,w]=1n1L(w;D0)1n1[yvxvTw]L(w;D_{1})-L(w;D_{0})=\frac{1}{n-1}L(w;D_{0})-\frac{1}{n-1}l(w;v)=\frac{1}{n-1}L(w;D_{0})-\frac{1}{n-1}[y_{v}-\langle x_{v},w\rangle]=\frac{1}{n-1}L(w;D_{0})-\frac{1}{n-1}[y_{v}-x_{v}^{T}w], which is clearly non-negative because L(w;D0)>l(w;v)L(w;D_{0})>l(w;v). Based on equation (23), we know

(dv)2\displaystyle(d_{v})^{2} =(yvxvSyx+SxxwxvxvTw2γ(n1)2σ)2\displaystyle=(\frac{||y_{v}x_{v}-S_{yx}+S_{xx}w-x_{v}x_{v}^{T}w||_{2}}{\sqrt{\frac{\gamma(n-1)}{2}}\sigma})^{2} (24)
2σ2γ(n1)[yvxvxvxvTw2SyxSxxw2]2\displaystyle\geq\frac{2}{\sigma^{2}\gamma(n-1)}[||y_{v}x_{v}-x_{v}x_{v}^{T}w||_{2}-||S_{yx}-S_{xx}w||_{2}]^{2}
=2σ2γ(n1)[xv2yvxvTw2SyxSxxw2]2.\displaystyle=\frac{2}{\sigma^{2}\gamma(n-1)}[||x_{v}||_{2}\cdot||y_{v}-x_{v}^{T}w||_{2}-||S_{yx}-S_{xx}w||_{2}]^{2}.

The inequality is obtained from Triangle Inequality for l2l_{2} norm. From Definition 7, we also know

dv=ϵv+2Φ1(1α)d_{v}=\epsilon_{v}+2\Phi^{-1}(1-\alpha) (25)

So we have

[ϵv+2Φ1(1α)]22σ2γ(n1)[xv2l(w,v)SyxSxxw2]2[\epsilon_{v}+2\Phi^{-1}(1-\alpha)]^{2}\geq\frac{2}{\sigma^{2}\gamma(n-1)}[||x_{v}||_{2}\cdot l(w,v)-||S_{yx}-S_{xx}w||_{2}]^{2}

[ϵv+2Φ1(1α)]2σ2γ(n1)2[xv2l(w,v)SyxSxxw2]2[\epsilon_{v}+2\Phi^{-1}(1-\alpha)]^{2}\frac{\sigma^{2}\gamma(n-1)}{2}\geq[||x_{v}||_{2}\cdot l(w,v)-||S_{yx}-S_{xx}w||_{2}]^{2}

[ϵv+2Φ1(1α)]σxv2γ(n1)2+SyxSxxw2xv2l(w;v)[ϵv+2Φ1(1α)]σxv2γ(n1)2+SyxSxxw2xv2-[\epsilon_{v}+2\Phi^{-1}(1-\alpha)]\frac{\sigma}{||x_{v}||_{2}}\sqrt{\frac{\gamma(n-1)}{2}}+\frac{||S_{yx}-S_{xx}w||_{2}}{||x_{v}||_{2}}\leq l(w;v)\leq[\epsilon_{v}+2\Phi^{-1}(1-\alpha)]\frac{\sigma}{||x_{v}||_{2}}\sqrt{\frac{\gamma(n-1)}{2}}+\frac{||S_{yx}-S_{xx}w||_{2}}{||x_{v}||_{2}}

Multiplying by 1n1\frac{-1}{n-1} and adding 1n1L(w;D0)\frac{1}{n-1}L(w;D_{0}) on both sides of the inequality, we get the first part of Lemma 5.

For part 2 of Lemma 5, we further write equation (24) as

(dv)2\displaystyle(d_{v})^{2} 2σ2γ(n1)[xv2yvxvTw2SyxSxxw2]2\displaystyle\geq\frac{2}{\sigma^{2}\gamma(n-1)}[||x_{v}||_{2}\cdot||y_{v}-x_{v}^{T}w||_{2}-||S_{yx}-S_{xx}w||_{2}]^{2} (26)
2σ2γ(n1)[ByvxvTw2SyxSxxw2]2\displaystyle\geq\frac{2}{\sigma^{2}\gamma(n-1)}[B\cdot||y_{v}-x_{v}^{T}w||_{2}-||S_{yx}-S_{xx}w||_{2}]^{2}

since xv2B||x_{v}||_{2}\geq B by assumption. Then following the same logic of the first part will we get the result as shown in part 2.

B.2 Proof of Lemma 6

Apply Lemma 1 and 7, we get

Φ(Φ1(1α)d)αeϵ1\Phi(\Phi^{-1}(1-\alpha)-d)-\alpha\leq e^{\epsilon}-1

So

ϵ\displaystyle\epsilon ln[Φ(Φ1(1α)d)+1α]\displaystyle\geq ln[\Phi(\Phi^{-1}(1-\alpha)-d)+1-\alpha] (27)

For point vv, we know dv=ϵv+2Φ1(1α)d_{v}=\epsilon_{v}+2\Phi^{-1}(1-\alpha). So the lower bound for privacy budget ϵ\epsilon after deleting perfect deleted point is

ln[Φ(Φ1(1α)d)+1α]\displaystyle ln[\Phi(\Phi^{-1}(1-\alpha)-d)+1-\alpha] =ln[Φ(Φ1(1α)ϵv2Φ1(1α))+1α]\displaystyle=ln[\Phi(\Phi^{-1}(1-\alpha)-\epsilon_{v}-2\Phi^{-1}(1-\alpha))+1-\alpha] (28)
=ln[Φ(Φ1(α)ϵv)+1α].\displaystyle=ln[\Phi(\Phi^{-1}(\alpha)-\epsilon_{v})+1-\alpha].

Since we assume the privacy budget to be non-negative, we take the maxmax of equation (28) and 0 to obtain the desired result.