Data Deletion for Linear Regression with Noisy SGD

Zhangjie, Xia Chi-Hua, Wang Guang, Cheng

Abstract

In the current era of big data and machine learning, it’s essential to find ways to shrink the size of training dataset while preserving the training performance to improve efficiency. However, the challenge behind it includes providing practical ways to find points that can be deleted without significantly harming the training result and suffering from problems like underfitting. We therefore present the perfect deleted point problem for 1-step noisy SGD in the classical linear regression task, which aims to find the perfect deleted point in the training dataset such that the model resulted from the deleted dataset will be identical to the one trained without deleting it. We apply the so-called signal-to-noise ratio and suggest that its value is closely related to the selection of the perfect deleted point. We also implement an algorithm based on this and empirically show the effectiveness of it in a synthetic dataset. Finally we analyze the consequences of the perfect deleted point, specifically how it affects the training performance and privacy budget, therefore highlighting its potential. This research underscores the importance of data deletion and calls for urgent need for more studies in this field.

1 INTRODUCTION

Machine learning (ML) is a field of study in artificial intelligence concerned with the development of statistical algorithms that can learn from given data and generalize to unseen data, thus performing future tasks without explicit instructions. It has been playing an essential role in our everyday life, and has become one of the most rapidly growing research fields nowadays. Jordan and Mitchell (2015) reviewed the core methods and recent progress in this field and suggested that despite its remarkable success recently, it still emerges numerous research opportunities for many unsolved problems.

Data deletion (Garg et al., 2020; Ginart et al., 2019; Oesterling et al., 2024) is one of the most intriguing areas among all those research questions. It’s widely known that more training data tends to improve the training performance dramatically (Ramezan et al., 2021; Foody et al., 2006). However, training with more data also has serious drawbacks. For instance, Budach et al. (2022) empirically showed that data quality in training dataset can seriously affect the training performance of many widely used ML algorithms including regression, classification and clustering, therefore more training data has a higher risk of containing poisoning data points. Narciso and Martins (2020) suggested that the massive amount of data used by the industry can cause serious energy efficiency related problems, thus increasing time and space complexity during training and boosting energy consumption. Hawkins (2004) pointed out that too much training data can make the ML model learn noise and unimportant patterns from given data, thus generalize poorly to unseen testing data, which is also referred to as overfitting.

Additionally, many regulations have been established on behalf of the ”Right to be Forgotten” in many countries and states, including Council of the European Union (2016), and the recent California Consumer Privacy Act Right (CCPA) (State of California Department of Justice, 2018) allows users to require companies like Google, Facebook, Twitter, etc to forget and delete these personal data to protect their privacy. They can also ask these platforms to erase outdated information or limit the retention of personal details, such as photos and emails, to a specific time period. However, since personal data is often collected incrementally, the process of deleting it in chronological order poses significant challenges for machine learning models. This issue highlights the need for the development and analysis of deletion-aware online machine learning methods capable of handling such requests efficiently.

Therefore, since more training data doesn’t necessarily lead to better result and may be prohibited in some cases, it’s natural to ask the following question:

Can we use a smaller training dataset without sacrificing training accuracy?

Data deletion, or machine unlearning (Cao and Yang, 2015; Bourtoule et al., 2021), addresses this problem by selecting certain training samples to be deleted which doesn’t harm the training performance and results in identical final models. Further work proposed many solutions to this question in certain scenarios (Baumhauer et al., 2022; Golatkar et al., 2020; Graves et al., 2021) and we refer the readers to Section 2 for more details.

This paper proposes an innovative idea to the question of data deletion in one step of noisy stochastic gradient descent (SGD) for linear regression task. Section 2 and 3 presents related works and the basic problem formulation. Section 4 introduces our statistical method for solving data deletion problem and shows our algorithm correspondingly. Section 5 highlights the major consequences of our method, specifically how it affects empirical risk and model privacy. Section 6 empirically shows the use of perfect deleted point in the synthetic dataset. Section 7 summarizes the paper and suggests some possible future work.

Our Contributions:

1.

Propose a hypothesis testing method to solve the data deletion problem in linear regression using noisy SGD.
2.

Show that the deletion of our perfect deleted point affects the training performance the least compared with other points potentially.
3.

Show the deletion of our perfect deleted point possibly posts the minimum privacy issue compared with other points.
4.

Empirically demonstrate the effectiveness and potential of perfect deleted point in the synthetic dataset.

2 RELATED LITERATURE

Noisy gradient descent.

Our research contributes to the study of machine learning security, particularly focusing on noisy gradient descent (Noisy GD) (Avella-Medina et al., 2023; Das et al., 2023; Wu et al., 2020a). Noisy GD is best exemplified by DP-SGD, which is proposed by Abadi et al. (2016). The basic idea of it is to introduce noise into the gradient descent process to provide privacy guarantees during training. In the DP-SGD framework, each update is equipped with a privacy burden, which quantifies the privacy leakage in each step. There are many causes for the privacy leakage, and our research shows how the deletion of one data point in the training dataset can affect the privacy budget in Section 5.2. We conclude that there exists certain data point, which we call the perfect deleted point, that will cause least privacy budget than other points, and highlights potential future work on how the behaviour of training data can affect machine learning security.

Data deletion.

The key of our research is to understand data deletion in a certain framework and provide a reliable approach to solve this problem. Despite the boosting progress in this field, we have found that most researches use singular metric, like accuracy, as the dominating factor to help them find points to be deleted (Wu et al., 2020b; Guo et al., 2019; Garg et al., 2020; Neel et al., 2021; Gupta et al., 2021). Obviously this is ad hoc, leading to unreliable and inconsistent results. Our research address this issue by applying various metrics, namely the distribution of model weights, training loss and privacy budget, to verify that our perfect deleted point is indeed the ”perfect” one to be deleted among all data points in the training dataset, thus enhancing the reliability of our method and providing a new standard for future researchers working in this domain. We believe our contribution will significantly improve the use of theoretical tools available for data deletion, fostering greater trust and robustness in this area.

Membership inference attack.

Our research approach is inspired by the current progress in the area of membership inference attack (MI attack). In such an attack, the malicious group wants to predict whether a certain data point belongs to the training dataset or the whole data distribution, given either final model output (black-box attack) or information about model structure in the whole training process (white-box attack) (Shokri et al., 2017). While our initial goal is to find the perfect deleted point, equivalently we want to find a data point from the training dataset such that the adversary is most likely to fail to perform a MI attack for that point, i.e., the data point is somehow redundant and makes limited contribution to the training of the model. Many researchers have contributed to this problem from different perspectives, and we are inspired by Leemann et al. (2023) and Yeom et al. (2018a)’s works specifically. Yeom et al. (2018a) analyzes MI attack using the concept of membership advantage, and inspired by this, we apply the adjusted concept of absolute membership advantage to find the perfect deleted point. Leemann et al. (2023) analyzes MI attack from a hypothesis testing perspective, and similarly, we find the perfect deleted point with the smallest absolute membership advantage using likelihood ratio test. We believe the concept of membership advantage and the approach of hypothesis testing may yield more future work in the problem of data deletion.

Here we also recite two concepts from Yeom et al. (2018b) to prepare the readers for our future discussion. The first one is about membership advantage.

Definition 1.

(Membership experiment $Exp^{M}(A,D)$ (Yeom et al., 2018b)): Let $A$ be an adversary, $S$ be the training dataset and $D$ be a probability distribution over data points $(x,y)$ . The experiment proceeds as follows:

1.

Sample $S\sim D^{n}$ .
2.

Draw $b\in\{0,1\}$ uniformly at random.
3.

Draw $z\sim S$ if $b=0$ , or $z\sim D$ if $b=1$ .
4.

$A$ tries to predict value of $b$ and outputs either $0$ or $1$ . If it makes the right prediction, i.e., its output is equal to $b$ , then $Exp^{M}(A,D)=1$ . Otherwise, $Exp^{M}(A,D)=0$ .

In other words, the adversary $A$ tries to predict whether data point $z$ is drawn from the training dataset or from the whole data distribution in this experiment. If it predicts correctly, then the experiment is successful and outputs 1; otherwise it fails and outputs 0. Based on the experiment, we further define membership advantage as follows:

Definition 2.

(Membership advantage (Yeom et al., 2018b)): The membership advantage of point $v$ is defined as

$Adv(v)=Pr[A=0|b=0]-Pr[A=0|b=1]$ .

Equivalently, the membership advantage can be understood as the difference between true negative rate (TNR), i.e., $Pr[A=0|b=0]$ , and false negative rate (FNR), i.e., $Pr[A=0|b=1]$ , of the membership experiment $Exp^{M}(A,D)$ . Intuitively, if TNR is close to FNR, then it would be hard to tell whether $z$ is drawn from $S$ or $D$ , which means the null and alternative hypothesis give similar probability distributions. We later leverage this idea to find the perfect deleted point.

Another important concept we need from Yeom et al. (2018b) is $\epsilon$ -differential privacy, where $\epsilon$ is called privacy budget and large $\epsilon$ means serious concerns about training data confidentiality.

Definition 3.

(Differential privacy (Yeom et al., 2018b)): An algorithm $A:X^{n}\to Y$ satisfies $\epsilon$ -differential privacy if for all $S,S^{\prime}\in X^{n}$ that differ in a single value, the following holds:

Pr[A(S)\in Y]\leq e^{\epsilon}Pr[A(S^{\prime})\in Y].

(1)

The following Lemma characterizes the connection between membership advantage and privacy budget, which will be used in Section 5.2.

Lemma 1.

(Upper bound of membership advantage for $\epsilon$ -differentially private algorithm (Yeom et al., 2018b)) Let A be an $\epsilon$ -differentially private algorithm, then we have

Adv(A)\leq e^{\epsilon}-1.

(2)

3 PROBLEM SETUP

3.1 Perfect deleted point problem

Assume we are working on two different datasets: $D_{0}$ , referred to as the original dataset, and $D_{1}$ , known as the deleted dataset. Our goal is to find a point $v^{*}\in D_{0}$ , which is called the perfect deleted point, such that the deletion of it will cause the minimum change on the final model, i.e., the distribution of model weights resulted from deleting $v^{*}$ is identical to the one when not deleting it, compared with all the other points.

However, it can be difficult and inefficient to directly find the perfect deleted point. Intuitively, for each data point $v_{i}\in D_{0}$ , we need to delete it from the original dataset $D_{0}$ and use the deleted dataset $D_{0}\setminus v_{i}$ to train a new ML model. Among all the $n$ models we get from deleting each $v_{i}$ , we want to find the data point $v^{*}\in D_{0}$ such that using the deleted dataset $D_{1}=D_{0}\setminus v^{*}$ , we will obtain the most similar model, i.e., model weights, to the one trained with the original dataset $D_{0}$ , compared with deleting all the other $n-1$ data points. We would need to repeat the same deleting-and-training process for $n$ times if we want to directly find the perfect deleted point, which is unnecessary as we propose an alternative approach in this paper that doesn’t require any additional deleting-and-training at all. We will discuss it in Section 4.2 in detail.

We are motivated to find the perfect deleted point because previous studies have shown that despite improvement of training performance using a larger training dataset (Ramezan et al., 2021; Foody et al., 2006), too much training data can cause overfitting if the model becomes too complex and learns noise or irrelevant patterns rather than generalizing well (Ying, 2019; Hawkins, 2004). Also, it will be more computationally-friendly and data-efficient if using a smaller training dataset (Acun et al., 2021; Zhu et al., 2016).

3.2 Assumptions

In this paper, we propose a novel method for finding the perfect deleted point with the following assumptions:

1.

The original dataset $D_{0}$ contains finitely many data points, i.e., $n$ is a positive integer that is not infinity.
2.

We are considering the classical linear regression task, i.e., point $v_{i}=(x_{i},y_{i})\in D_{0}$ where $x_{i}$ is the feature vector and $y_{i}$ is the label of corresponding $x_{i}$ . Thus, the loss function of point $v_{i}$ for a ML model with model weights $w$ is

$l(w,v_{i})=(y_{i}-\langle w,x_{i}\rangle)^{2}.$ (3)

The empirical risk given model weight being $w$ and dataset $D=\{(x_{i},y_{i})\}_{i=1}^{n}$ is defined as

$L(w;D)=\frac{1}{n}\sum_{i=1}^{n}l(w;(x_{i},y_{i})),$ (4)

which evaluates the discrepancy between predicted outcomes and true outcomes of the whole dataset. The mean is taken over all individual losses here because it’s a common approach to use mean gradient in noisy SGD defined below.
3.

The ML model uses noisy gradient descent to minimize the empirical risk defined in equation (4) and update its weights in each step, i.e.,

$w_{1}=w_{0}-\gamma(\nabla L(w;D)+\eta)$ (5)

where $\eta=N(0,\sigma^{2}I)$ is the Gaussian noise and $w_{0},w_{1}$ are model weights before and after noisy gradient descent respectively. Note we only consider one step of noisy gradient descent in this paper.

4 KEY THEORETICAL RESULTS

In Section 4, we present our theoretical approach on how to find the perfect deleted point $v^{*}$ from a hypothesis testing perspective and design a concrete algorithm to realize this idea.

4.1 Notations

With the linear regression setting in Section 3.2, we first formally define how we can obtain the deleted dataset from the original dataset.

Definition 4.

(Original dataset and deleted dataset): Denote $D_{0}=\{(x_{i},y_{i})\}_{i=1}^{n}$ as the original training dataset and

$D_{1}=D_{0}\setminus\{v\}$

as the deleted training dataset with the deleted point $v=(x_{v},y_{v})\in D_{0}$ .

Given Definition 4 and equation (4), we next define the empirical risk and gradient for original and deleted dataset respectively.

Definition 5.

(Original empirical risk and its gradient): Given an original dataset $D_{0}$ defined in Definition 4. For a model weight being $w$ , denote the individual loss of $w$ on training sample $(x_{i},y_{i})$ as $l(w;(x_{i},y_{i}))$ . Then the original empirical risk $L(w;D_{0})$ is defined as

$L(w;D_{0})=\frac{1}{n}\sum_{i=1}^{n}l(w;(x_{i},y_{i}))$ .

In addition, $\nabla_{w}L(w;D_{0})$ is called the original gradient of model weight w on dataset $D_{0}$ .

Definition 6.

(Deleted empirical risk and its gradient): Given a deleted dataset $D_{1}$ defined in Definition 4. For a model weight being $w$ , denote the individual loss of $w$ on training sample $(x_{i},y_{i})$ as $l(w;(x_{i},y_{i}))$ . Then the deleted empirical risk $L(w;D_{1})$ is defined as

$L(w;D_{1})=\frac{1}{n-1}[\sum_{i=1}^{n}l(w;(x_{i},y_{i}))-l(w;(x_{v},y_{v})]$ .

In addition, $\nabla_{w}L(w;D_{1})$ is called the deleted gradient of model weight w on dataset $D_{1}$ with deleted point being $v$ .

Lemma 2 describes a basic relationship of original gradient and deleted gradient.

Lemma 2.

(Identity between original and deleted gradient):

\nabla_{w}L(w;D_{1})=\frac{n}{n-1}\nabla_{w}L(w;D_{0})-\frac{1}{n-1}\nabla_{w}l(w;v).

(6)

Proof.

This identity is a direct consequence of Definition 5 and 6. ∎

4.2 Strategy to find the perfect deleted point

Since we only consider one step of noisy gradient update here and the initial weight $w_{0}$ doesn’t change whether or not we delete $v$ , from equation (5) we know that in order to get similar model weights $w_{1}$ for original dataset $D_{0}$ and deleted dataset $D_{1}$ , we need $\Delta w=w_{1}-w_{0}$ to be similar, i.e., have identical probability distributions. Equivalently, it means TNR and FNR considering the distributions of $D_{0}$ and $D_{1}$ should be close to each other, i.e., membership advantage of our perfect deleted point $v^{*}$ should be close to 0.

Lemma 3.

(Absolute membership advantage of $v$ ):

|Adv(v)|=|\Phi(\Phi^{-1}(1-\alpha)-d_{v})-\alpha|

(7)

where $\mu(D)=-\gamma\nabla_{w}L(w;D)$ , $\sigma_{\gamma}=\gamma\sigma$ , $d_{v}=\frac{||\mu(D_{1})-\mu(D_{0})||_{2}}{\sigma_{\gamma}}$ is the signal-to noise ratio for point $v$ and $\alpha$ is the Type I error.

Proof.

See Appendix A.1. ∎

From Lemma 3, we notice that the absolute membership advantage depends on Type I error $\alpha$ and signal-to-noise ratio $d_{v}$ , and Type I error $\alpha$ , which by definition is the false positive rate (FPR), i.e., the probability of rejecting null hypothesis when it’s actually true, is therefore equivalent to our manually-picked significance level. So next we try to give an explicit formula for how to calculate signal-to-noise ratio $d_{v}$ by applying Lemma 2 to help us find the perfect deleted point.

Lemma 4.

(Signal-to-noise ratio $d_{v}$ for $v=(x_{v},y_{v})\in D_{0}$ ):

d_{v}=\frac{||y_{v}x_{v}-S_{yx}+S_{xx}w-x_{v}x_{v}^{T}w||_{2}}{\sqrt{\frac{\gamma(n-1)}{2}}\sigma}

(8)

where $S_{yx}=\frac{1}{n}\sum_{i=1}^{n}y_{i}x_{i}$ , $S_{xx}=\frac{1}{n}\sum_{i=1}^{n}x_{i}x_{i}^{T}$

Proof.

See Appendix A.2. ∎

From Lemma 3 and 4, we know we need to find data point $v$ such that $d_{v}$ calculated from it using equation (8) will minimize the absolute membership advantage $|Adv(v)|$ . By simple calculation, we know $d_{v}$ should be close to $\Phi^{-1}(1-\alpha)-\Phi^{-1}(\alpha)=2\Phi^{-1}(1-\alpha)$ so that $|Adv(v)|$ will be minimized, i.e., close to 0. Therefore, the perfect deleted point problem can be reformulated as: we want to find point $v^{*}\in D_{0}$ such that

$v^{*}=\mathop{\arg\min}\limits_{v\in D_{0}}|d_{v}-2\Phi^{-1}(1-\alpha)|$

Next we formally propose Algorithm 1 to find the perfect deleted point in the original dataset $D_{0}$ . To find the perfect deleted point, we need to first obtain signal-to-noise ratio $d_{v}$ for each data point $v\in D_{0}$ by applying Lemma 4. Then by Lemma 3, we choose a significance level and use it as the Type I error $\alpha$ . Finally for each data point $v$ , we compare the distance between $d_{v}$ and $2\Phi^{-1}(1-\alpha)$ , and the one with the smallest distance gives the perfect deleted point $v^{*}$ . Note we introduce a hyper-parameter $\delta$ here to quantify the tolerance we can have for the largest acceptable distance, i.e., if the smallest distance $|d_{v}-2\Phi^{-1}(1-\alpha)|$ for all $v\in D_{0}$ is bigger than $\delta$ , then we say there’s no perfect deleted point in the dataset. So now instead of directly training $n$ models separately using deleted datasets and comparing their model weights to the original model weights for each data point, our approach only needs to compute signal-to-noise ratio $d_{v}$ and compares its value to $2\Phi^{-1}(1-\alpha)$ for all data points in one training step, which only takes $O(n)$ time and $O(n)$ space for one step of noisy SGD training, where $n$ is the size of the original dataset. Thus our approach for finding the perfect deleted point is more time and space efficient than the direct approach.

Algorithm 1 Find Perfect Deleted Point

1:Input:

D_{0}=\{v_{i}\}_{i=1}^{n}

\gamma

\sigma^{2}

, initial weight

w

, tolerance

\delta

, significance level

\alpha^{\prime}

2:Output:

v^{*}

S_{yx},S_{xx}\leftarrow\frac{1}{n}\sum_{i=1}^{n}y_{i}x_{i},\frac{1}{n}\sum_{i=1}^{n}x_{i}x_{i}^{T}

d\leftarrow\delta

\alpha\leftarrow\alpha^{\prime}

v^{*}\leftarrow\texttt{None}

7:for each

v=(x_{v},y_{v})\in D_{0}

d_{v}\leftarrow\frac{||y_{v}x_{v}-S_{yx}+S_{xx}w-x_{v}x_{v}^{T}w||_{2}}{\sqrt{\frac{\gamma(n-1)}{2}}\sigma}

d_{1}\leftarrow|d_{v}-2\Phi^{-1}(1-\alpha)|

10: if

d_{1}<=d

then

11:

d\leftarrow d_{1}

12:

v^{*}\leftarrow v

13: end if

14:end for

15:return

v^{*}

5 MAIN INSIGHTS

In Section 5, we further discuss how the perfect deleted point will affect the training performance (Section 5.1) and how it will cause privacy concern in the training process (Section 5.2). It provides insightful analysis on why the perfect deleted point should be of interest.

5.1 Analysis of training loss for perfect deleted point

In order to understand how perfect deleted point will affect the training performance, we first propose the following concept of absolute membership error that quantifies how different a point $v$ is from being the perfect deleted point.

Definition 7.

(Absolute membership error): For point $v\in D_{0}$ with corresponding signal-to-noise ratio $d_{v}$ , we define its absolute membership error as

\epsilon_{v}=d_{v}-2\Phi^{-1}(1-\alpha).

(9)

Based on this definition, absolute membership error of the perfect deleted point $\epsilon_{v^{*}}$ should be close to 0 by analysis in Section 4.2. From now on, we denote $\epsilon_{v^{*}}$ by $\epsilon^{*}$ for brevity. We propose the following Lemma to bound the change of training loss for deleting point $v$ .

Lemma 5.

(Bounds for change of empirical risk for deleting point $v$ ): Assume change of empirical risk, i.e., $L(w;D_{1})-L(w;D_{0})$ , is non-negative.

Given point $v=(x_{v},y_{v})$ and its absolute membership error $\epsilon_{v}$ , change of empirical risk after deleting $v$ is lower bounded by

\frac{1}{n-1}L(w;D_{0})-[\epsilon_{v}+2\Phi^{-1}(1-\alpha)]C-\frac{||S_{yx}-S_{xx}w||_{2}}{(n-1)||x_{v}||_{2}}

(10)

and upper bounded by

\frac{1}{n-1}L(w;D_{0})+[\epsilon_{v}+2\Phi^{-1}(1-\alpha)]C-\frac{||S_{yx}-S_{xx}w||_{2}}{(n-1)||x_{v}||_{2}}

(11)

where $C=\frac{\sigma}{||x_{v}||_{2}}\sqrt{\frac{\gamma}{2(n-1)}}$ .

Suppose features of the training data $\{(x_{i},y_{i})\}_{i=1}^{n}$ has the following property: $||x_{i}||_{2}\geq B,\forall i\in[1,n]$ where $B$ is a positive number. Then the lower bound can be generalized to

\frac{1}{n-1}L(w;D_{0})-[\epsilon_{v}+2\Phi^{-1}(1-\alpha)]D-\frac{||S_{yx}-S_{xx}w||_{2}}{(n-1)B}

(12)

and the upper bound can be generalized to

\frac{1}{n-1}L(w;D_{0})+[\epsilon_{v}+2\Phi^{-1}(1-\alpha)]D-\frac{||S_{yx}-S_{xx}w||_{2}}{(n-1)B}

(13)

where $D=\frac{\sigma}{B}\sqrt{\frac{\gamma}{2(n-1)}}$ .

Note both bounds in the first and second part of this Lemma are non-negative because deletion of samples always increases training risk.

Proof.

See Appendix B.1. ∎

The novel part of Lemma 5 is that given several candidates for the perfect deleted point, i.e., they all give very similar $d_{v}$ , the one with the smallest $l_{2}$ feature norm, i.e., $||x_{v}||_{2}$ , is always better because the lower bound of change of training loss is decreased if we decrease $||x_{v}||_{2}$ . The upper bound is also decreased if we decrease $||x_{v}||_{2}$ because individual loss $l(w;v)$ is non-negative, meaning the part after $\frac{1}{n-1}L(w;D_{0})$ , which is $-\frac{1}{n-1}l(w;v)$ , is therefore negative. So decreasing the denominator $||x_{v}||_{2}$ will decrease the upper bound as well. In other words, deleting the one with the smallest $l_{2}$ feature norm will improve training performance the most potentially because it has smaller both lower and upper bound for increment of empirical risk.

Also, given the same training dataset, i.e., same $B$ , the perfect deleted point gives the tightest bound for change of loss after deletion compared with other points because $\epsilon^{*}\leq\epsilon_{v}$ . In other words, we can have an accurate estimate of how the perfect deleted point will affect training performance by setting $\epsilon^{*}=0$ before even finding the exact perfect deleted point.

5.2 Relation of privacy budget $\epsilon$ for perfect deleted point

In this section, we try to understand how the perfect deleted point will bring about privacy issues in the training process.

Lemma 6.

(Lower bound for privacy budget for deleting point $v$ ): Privacy budget $\epsilon$ after deleting point $v$ with absolute membership error $\epsilon_{v}$ is at least $max\{ln[\Phi(\Phi^{-1}(\alpha)-\epsilon_{v})+1-\alpha],0\}$ .

Proof.

See Appendix B.2. ∎

From Lemma 6, we observe that assume privacy budget must be non-negative, then as long as the absolute membership error $\epsilon_{v}$ is non-negative, it can achieve the smallest lower bound $0$ for privacy budget. In other words, if we have several candidates for the perfect deleted point, i.e., they all give very similar $d_{v}$ , then the one with positive absolute membership error will be considered as the best option for the perfect deleted point because it may potentially distort privacy guarantees to the minimum degree compared with deleting other points. Note we don’t know exactly whether it’ll cause the smallest privacy budget because we only know it can achieve the smallest lower bound for privacy budget, so only theoretically can it lead to a smaller privacy budget.

6 EXPERIMENT

Section 6 exemplifies the effectiveness our approach by conducting an empirical experiment to find the perfect deleted point in a synthetic dataset. By comparing the model weights in different scenarios, we show that our perfect deleted point can indeed change the distribution of model weights to the minimal extent.

6.1 Experiment setup

To better visualize how the perfect deleted point will affect the model weights, we generate a 2D synthetic dataset with 200 samples in this case as the original dataset $D_{0}$ , where the $x$ coordinate of all these points is randomly picked within the range $[0,10]$ with seed $40$ . To get the $y$ coordinate, we assume the true model behind our generated dataset is $y=3.1415926535x$ since we are dealing with classic linear regression here. Then we use Gaussian noise with mean $0$ and standard deviation $2$ , amplifies its magnitude by $10$ , and add it to the true $y$ coordinate to generate the synthetic $y$ coordinate. The scatterplot of our generated dataset looks like Figure 1. Although this is a simple dataset that may not mimic the real world dataset, it suffices to analyze the perfect deleted point problem on it in this paper as it satisfies all the assumptions in the given linear regression scenario.

Refer to caption — Figure 1: Synthetic Dataset

The hyperparameters we used in this experiment are listed here: initial weight $w_{0}=0$ , learning rate $\gamma=0.01$ , Gaussian noise $\eta=N(0,4)$ , Type I error $\alpha=0.01$ , and tolerance $\delta=100$ . Since this is a stochastic algorithm and we want to get the distribution of weights, we run the 1-step noisy SGD for 100 iterations to understand the overall behaviour of it.

6.2 Result analysis

To understand the effect of perfect deleted point, we first focus on the distribution of model weights after deleting it and compare it with the distribution of model weights when we don’t delete any point. We use the distribution of model weights when we randomly delete a data point as a benchmark to better demonstrate the potential of perfect deleted point compared with all the other points. Secondly, we repeat the same finding-and-deleting process for multiple steps of noisy SGD to augment the effect of the perfect deleted point as deletion of one data point may not affect much compared with no deletion and random deletion because of the size of original dataset $D_{0}$ mitigates the impact.

Table 1: Statistics for Distributions of Model Weight

MODEL	MEAN	VARIANCE
1-step; perfect deleted point	1.77693	0.00051
1-step; random deleted point	1.78043	0.00061
1-step; no deleted point	1.77718	0.00046
10-step; perfect deleted point	3.31226	0.00079
10-step; random deleted point	3.31290	0.00265
10-step; no deleted point	3.30248	0.00066
50-step; perfect deleted point	3.41079	0.00063
50-step; random deleted point	3.35108	0.01560
50-step; no deleted point	3.36008	0.00051

Figure 2 and Table 1 demonstrate the distribution of model weights in different cases both visually and statistically. From the first row Figure 2, we see in one step of noisy SGD, all three cases demonstrate a similar normal distribution, which is proved by the mean and variance of the first three lines in Table 1. It is because the deletion of a single data point in $D_{0}$ which contains 200 samples only posts a trivial effect on the distribution of model weights. From the next two rows of Figure 2, we see that as the number of noisy SGD steps increases, the weight distribution of perfect deleted point resembles much more closely to no deleted point compared with random deleted point. This means that our perfect deleted point is indeed “perfect” in the sense of preserving model weight after deletion compared with other points. When we look at the highlighted numbers in the third column of Table 1, we can see that as number of noisy SGD steps increases, the variance of distribution of model weights using random deleted point grows larger, which means the normal distribution becomes ”flatter” and we are more likely to get random weights after more random deletion. But by the convergence of noisy SGD, we know this is not acceptable, which again shows the advantage of perfect deleted point.

However, if we look at the highlighted number in the second column of Table 1, we see that despite having small variance, the mean of distribution of model weights after 50 steps using perfect deleted point shifts tremendously compared with both no and random deleted point, which gives a slightly different model weight distribution compared with no deleted point. To address this problem, we increase the Type I error $\alpha$ to 0.05, and Figure 3 plots the corresponding result.

As we can see, the mean becomes much smaller and the whole distribution is therefore much similar to the one with no deleted point. An intuitive explanation behind it may be that as we increase Type I error $\alpha$ , we increase significance level as well, which allows us more room to make mistakes using perfect deleted point in many steps of noisy SGD, i.e., deleting more points, thus improving statistical power of our perfect deleted point. This trade-off between $\alpha$ and mean may lead to potential future work and needs more rigorous investigation.

7 CONCLUSION

In conclusion, our research proposes the perfect deleted point problem in one step of noisy SGD for classical linear regression. We present a hypothesis testing approach to solve this problem using signal-to-noise ratio and construct an algorithm based on it. Moreover, we analyze how the perfect deleted point is related to training performance and differential privacy, thus showing the consequence of the perfect deleted point and further demonstrating the importance of this problem. Finally, we create a synthetic dataset and implement our algorithm on it and show the minimum impact perfect deleted point will have on the distribution of model weights. The implications of our work demonstrate the urgent necessity for strategies of data deletion in different scenarios to improve model training efficiency. Moving forward, our research sets a new standard for understanding and analyzing such problems, ensuring the efficiency and security for training of ML models. Some potential future work includes analyzing data deletion for the whole SGD training process and for the general regression task.

References

Jordan and Mitchell (2015) Michael I Jordan and Tom M Mitchell. Machine learning: Trends, perspectives, and prospects. Science, 349(6245):255–260, 2015.
Garg et al. (2020) Sanjam Garg, Shafi Goldwasser, and Prashant Nalini Vasudevan. Formalizing data deletion in the context of the right to be forgotten. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pages 373–402. Springer, 2020.
Ginart et al. (2019) Antonio Ginart, Melody Guan, Gregory Valiant, and James Y Zou. Making ai forget you: Data deletion in machine learning. Advances in neural information processing systems, 32, 2019.
Oesterling et al. (2024) Alex Oesterling, Jiaqi Ma, Flavio Calmon, and Himabindu Lakkaraju. Fair machine unlearning: Data removal while mitigating disparities. In International Conference on Artificial Intelligence and Statistics, pages 3736–3744. PMLR, 2024.
Ramezan et al. (2021) Christopher A Ramezan, Timothy A Warner, Aaron E Maxwell, and Bradley S Price. Effects of training set size on supervised machine-learning land-cover classification of large-area high-resolution remotely sensed data. Remote Sensing, 13(3):368, 2021.
Foody et al. (2006) Giles M Foody, Ajay Mathur, Carolina Sanchez-Hernandez, and Doreen S Boyd. Training set size requirements for the classification of a specific class. Remote Sensing of Environment, 104(1):1–14, 2006.
Budach et al. (2022) Lukas Budach, Moritz Feuerpfeil, Nina Ihde, Andrea Nathansen, Nele Noack, Hendrik Patzlaff, Felix Naumann, and Hazar Harmouch. The effects of data quality on machine learning performance. arXiv preprint arXiv:2207.14529, 2022.
Narciso and Martins (2020) Diogo AC Narciso and FG Martins. Application of machine learning tools for energy efficiency in industry: A review. Energy Reports, 6:1181–1199, 2020.
Hawkins (2004) Douglas M Hawkins. The problem of overfitting. Journal of chemical information and computer sciences, 44(1):1–12, 2004.
Council of the European Union (2016) Council of the European Union. Council regulation (eu) no 2016/678. https://eur-lex.europa.eu/eli/reg/2016/679/oj, 2016.
State of California Department of Justice (2018) State of California Department of Justice. California consumer privacy act (ccpa). https://oag.ca.gov/privacy/ccpa, 2018.
Cao and Yang (2015) Yinzhi Cao and Junfeng Yang. Towards making systems forget with machine unlearning. In 2015 IEEE symposium on security and privacy, pages 463–480. IEEE, 2015.
Bourtoule et al. (2021) Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pages 141–159. IEEE, 2021.
Baumhauer et al. (2022) Thomas Baumhauer, Pascal Schöttle, and Matthias Zeppelzauer. Machine unlearning: Linear filtration for logit-based classifiers. Machine Learning, 111(9):3203–3226, 2022.
Golatkar et al. (2020) Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9304–9312, 2020.
Graves et al. (2021) Laura Graves, Vineel Nagisetty, and Vijay Ganesh. Amnesiac machine learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 11516–11524, 2021.
Avella-Medina et al. (2023) Marco Avella-Medina, Casey Bradshaw, and Po-Ling Loh. Differentially private inference via noisy optimization. The Annals of Statistics, 51(5):2067–2092, 2023.
Das et al. (2023) Rudrajit Das, Satyen Kale, Zheng Xu, Tong Zhang, and Sujay Sanghavi. Beyond uniform lipschitz condition in differentially private optimization. In International Conference on Machine Learning, pages 7066–7101. PMLR, 2023.
Wu et al. (2020a) Jingfeng Wu, Wenqing Hu, Haoyi Xiong, Jun Huan, Vladimir Braverman, and Zhanxing Zhu. On the noisy gradient descent that generalizes as sgd. In International Conference on Machine Learning, pages 10367–10376. PMLR, 2020a.
Abadi et al. (2016) Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318, 2016.
Wu et al. (2020b) Yinjun Wu, Edgar Dobriban, and Susan Davidson. Deltagrad: Rapid retraining of machine learning models. In International Conference on Machine Learning, pages 10355–10366. PMLR, 2020b.
Guo et al. (2019) Chuan Guo, Tom Goldstein, Awni Hannun, and Laurens Van Der Maaten. Certified data removal from machine learning models. arXiv preprint arXiv:1911.03030, 2019.
Neel et al. (2021) Seth Neel, Aaron Roth, and Saeed Sharifi-Malvajerdi. Descent-to-delete: Gradient-based methods for machine unlearning. In Algorithmic Learning Theory, pages 931–962. PMLR, 2021.
Gupta et al. (2021) Varun Gupta, Christopher Jung, Seth Neel, Aaron Roth, Saeed Sharifi-Malvajerdi, and Chris Waites. Adaptive machine unlearning. Advances in Neural Information Processing Systems, 34:16319–16330, 2021.
Shokri et al. (2017) Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), pages 3–18. IEEE, 2017.
Leemann et al. (2023) Tobias Leemann, Martin Pawelczyk, and Gjergji Kasneci. Gaussian membership inference privacy, 2023. URL https://arxiv.org/abs/2306.07273.
Yeom et al. (2018a) Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. Privacy risk in machine learning: Analyzing the connection to overfitting. In 2018 IEEE 31st Computer Security Foundations Symposium (CSF), pages 268–282, 2018a. doi: 10.1109/CSF.2018.00027.
Yeom et al. (2018b) Samuel Yeom, Irene Giacomelli, Matt Fredrikson, and Somesh Jha. Privacy risk in machine learning: Analyzing the connection to overfitting, 2018b.
Ying (2019) Xue Ying. An overview of overfitting and its solutions. In Journal of physics: Conference series, volume 1168, page 022022. IOP Publishing, 2019.
Acun et al. (2021) Bilge Acun, Matthew Murphy, Xiaodong Wang, Jade Nie, Carole-Jean Wu, and Kim Hazelwood. Understanding training efficiency of deep learning recommendation models at scale. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 802–814. IEEE, 2021.
Zhu et al. (2016) Xiangxin Zhu, Carl Vondrick, Charless C Fowlkes, and Deva Ramanan. Do we need more training data? International Journal of Computer Vision, 119(1):76–92, 2016.
Wang and Cheng (2024) Chi-Hua Wang and Guang Cheng. Badgd: A unified data-centric framework to identify gradient descent vulnerabilities, 2024. URL https://arxiv.org/abs/2405.15979.
Izzo et al. (2021) Zachary Izzo, Mary Anne Smart, Kamalika Chaudhuri, and James Zou. Approximate data deletion from machine learning models. In International Conference on Artificial Intelligence and Statistics, pages 2008–2016. PMLR, 2021.
Chourasia and Shah (2023) Rishav Chourasia and Neil Shah. Forget unlearning: Towards true data-deletion in machine learning. In International Conference on Machine Learning, pages 6028–6073. PMLR, 2023.
Pawelczyk et al. (2024) Martin Pawelczyk, Jimmy Z Di, Yiwei Lu, Gautam Kamath, Ayush Sekhari, and Seth Neel. Machine unlearning fails to remove data poisoning attacks. arXiv preprint arXiv:2406.17216, 2024.
Sekhari et al. (2021) Ayush Sekhari, Jayadev Acharya, Gautam Kamath, and Ananda Theertha Suresh. Remember what you want to forget: Algorithms for machine unlearning. Advances in Neural Information Processing Systems, 34:18075–18086, 2021.
Chen et al. (2021) Min Chen, Zhikun Zhang, Tianhao Wang, Michael Backes, Mathias Humbert, and Yang Zhang. When machine unlearning jeopardizes privacy. In Proceedings of the 2021 ACM SIGSAC conference on computer and communications security, pages 896–911, 2021.
Steinke et al. (2023) Thomas Steinke, Milad Nasr, and Matthew Jagielski. Privacy auditing with one (1) training run, 2023. URL https://arxiv.org/abs/2305.08846.
Cebere et al. (2024) Tudor Cebere, Aurélien Bellet, and Nicolas Papernot. Tighter privacy auditing of dp-sgd in the hidden state threat model, 2024. URL https://arxiv.org/abs/2405.14457.
Scharf and Demeure (1991) Louis L Scharf and Cédric Demeure. Statistical signal processing: detection, estimation, and time series analysis. (No Title), 1991.

Checklist

1.
For all models and algorithms presented, check if you include:
1. (a)
  
  A clear description of the mathematical setting, assumptions, algorithm, and/or model. [Yes] See Section 4
2. (b)
  
  An analysis of the properties and complexity (time, space, sample size) of any algorithm. [Yes] See Section 4
3. (c)
  
  (Optional) Anonymized source code, with specification of all dependencies, including external libraries. [Yes] See Appendix
2.
For any theoretical claim, check if you include:
1. (a)
  
  Statements of the full set of assumptions of all theoretical results. [Yes] See Section 3
2. (b)
  
  Complete proofs of all theoretical results. [Yes] See Appendix
3. (c)
  
  Clear explanations of any assumptions. [Yes] See Section 3
3.
For all figures and tables that present empirical results, check if you include:
1. (a)
  
  The code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL). [Yes] See Appendix
2. (b)
  
  All the training details (e.g., data splits, hyperparameters, how they were chosen). [Yes] See Section 6
3. (c)
  
  A clear definition of the specific measure or statistics and error bars (e.g., with respect to the random seed after running experiments multiple times). [Yes] See Section 6
4. (d)
  
  A description of the computing infrastructure used. (e.g., type of GPUs, internal cluster, or cloud provider). [Not Applicable]
4.
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets, check if you include:
1. (a)
  
  Citations of the creator If your work uses existing assets. [Not Applicable]
2. (b)
  
  The license information of the assets, if applicable. [Not Applicable]
3. (c)
  
  New assets either in the supplemental material or as a URL, if applicable. [Not Applicable]
4. (d)
  
  Information about consent from data providers/curators. [Not Applicable]
5. (e)
  
  Discussion of sensible content if applicable, e.g., personally identifiable information or offensive content. [Not Applicable]
5.
If you used crowdsourcing or conducted research with human subjects, check if you include:
1. (a)
  
  The full text of instructions given to participants and screenshots. [Not Applicable]
2. (b)
  
  Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable. [Not Applicable]
3. (c)
  
  The estimated hourly wage paid to participants and the total amount spent on participant compensation. [Not Applicable]

Appendix A Proof sketch for section 4

A.1 Proof of Lemma 3

Proof of Lemma 3 follows the same logic as Section 4.4 of Scharf and Demeure (1991). It is also adopted from Section A.2 Proof of Lemma 4 in Wang and Cheng (2024). However, here we calculate the absolute membership advantage instead of the trade-off function as in Wang and Cheng (2024). We reinstate the proof here for readers’ reference and change the final step according to our need.

We use binary hypothesis testing considering distributions of $\Delta w=w_{1}-w_{0}$ of $D_{0}$ and $D_{1}$ , denoted as $\Delta w(D_{0})$ and $\Delta w(D_{1})$ , as defined in Section 3.2 to calculate FNR and TNR, therefore calculating membership advantage of v.

H_{0}:\Delta w(D_{0})\sim N(\mu(D_{0}),\sigma_{\gamma}^{2}I),H_{1}:\Delta w(D_{1})\sim N(\mu(D_{1}),\sigma_{\gamma}^{2}I)

(14)

Define $W=(\sigma_{\gamma}^{2}I)^{-1}(\mu(D_{1})-\mu(D_{0}))$ , then log likelihood ratio has the form

L(\Delta w)=W^{T}\Delta w

(15)

by equation (4.29) in Scharf and Demeure (1991). Equivalently, the binary hypothesis testing of (14) can be changed into

H_{0}:L(\Delta w)\sim N(-\frac{d_{v}^{2}}{2},d_{v}^{2}),H_{1}:L(\Delta w)\sim N(\frac{d_{v}^{2}}{2},d_{v}^{2})

(16)

where

d_{v}^{2}=W^{T}(\sigma_{\gamma}^{2}I)W=(\mu(D_{1})-\mu(D_{0}))^{T}(\sigma_{\gamma}^{2}I)^{-1}(\mu(D_{1})-\mu(D_{0}))

(17)

is called the signal-to-noise ratio.

At significance level $\alpha$ , which is also the Type I error, we know (16) has Type II error

\beta=1-\Phi(\Phi^{-1}(1-\alpha)-d_{v}).

(18)

Since $TNR$ is probability of failing to reject null hypothesis given null hypothesis is true, which is equal to $1-\alpha$ , then

|Adv(v)|=|TNR-FNR|=|1-\alpha-\beta|=|\Phi(\Phi^{-1}(1-\alpha)-d_{v})-\alpha|

(19)

as desired.

A.2 Proof of Lemma 4

Proof of Lemma 4 follows similar logic as Section B.6 Proof of Lemma 10 in Wang and Cheng (2024). However, here the gradient of loss for the deleted dataset $D_{1}$ changes, so we need to be careful when we do the computation. We want to calculate

	$\displaystyle d_{v}$	$\displaystyle=\frac{\|\|\mu(D_{1})-\mu(D_{0})\|\|_{2}}{\sigma_{\gamma}}$		(20)
		$\displaystyle=\frac{\|\|\nabla_{w}L(w;D_{1})-\nabla_{w}L(w;D_{0})\|\|_{2}}{\sqrt{\gamma}\sigma}$		(20)

Apply Lemma 2, we get

d_{v}=\frac{||\nabla_{w}L(w;D_{0})-\nabla_{w}l(w;v)||_{2}}{\sqrt{\gamma(n-1)}\sigma}

(21)

Given $l(w,(x,y))=(y-\langle w,x\rangle)^{2}$ , the gradient is $\nabla_{w}l(w,(x,y))=-2(y-\langle w,x\rangle)x$ , then

		$\displaystyle\nabla_{w}L(w;D_{0})-\nabla_{w}l(w;v)$		(22)
		$\displaystyle=\nabla_{w}\frac{1}{n}\sum_{i=1}^{n}(y_{i}-\langle w,x_{i}\rangle)^{2}-\nabla_{w}(y_{v}-\langle w,x_{v}\rangle)^{2}$
		$\displaystyle=\nabla_{w}[\frac{1}{n}\sum_{i=1}^{n}y_{i}^{2}-w^{T}(\frac{2}{n}\sum_{i=1}^{n}y_{i}x_{i})+w^{T}(\frac{1}{n}\sum_{i=1}^{n}x_{i}x_{i}^{T})w]-\nabla_{w}[y_{v}^{2}-2w^{T}y_{v}x_{v}+w^{T}x_{v}x_{v}^{T}w]$
		$\displaystyle=\nabla_{w}[\frac{1}{n}\sum_{i=1}^{n}y_{i}^{2}-y_{v}^{2}]-2\nabla_{w}[w^{T}(\frac{1}{n}\sum_{i=1}^{n}y_{i}x_{i}-y_{v}x_{v})]+\nabla_{w}w^{T}[\frac{1}{n}\sum_{i=1}^{n}x_{i}x_{i}^{T}-x_{v}x_{v}^{T}]w$
		$\displaystyle=2(y_{v}x_{v}-\frac{1}{n}\sum_{i=1}^{n}y_{i}x_{i})+2[\frac{1}{n}\sum_{i=1}^{n}x_{i}x_{i}^{T}-x_{v}x_{v}^{T}]w$
		$\displaystyle=2(y_{v}x_{v}-S_{yx})+2[S_{xx}-x_{v}x_{v}^{T}]w$

where $S_{yx}=\frac{1}{n}\sum_{i=1}^{n}y_{i}x_{i},S_{xx}=\frac{1}{n}\sum_{i=1}^{n}x_{i}x_{i}^{T}$ . Then (21) can be written as

$\displaystyle d_{v}$	$\displaystyle=\frac{\|\|2(y_{v}x_{v}-S_{yx})+2[S_{xx}-x_{v}x_{v}^{T}]w\|\|_{2}}{\sqrt{\gamma(n-1)}\sigma}$	(23)
	$\displaystyle=\frac{\|\|(y_{v}x_{v}-S_{yx})+[S_{xx}-x_{v}x_{v}^{T}]w\|\|_{2}}{\sqrt{\frac{\gamma(n-1)}{2}}\sigma}$
	$\displaystyle=\frac{\|\|y_{v}x_{v}-S_{yx}+S_{xx}w-x_{v}x_{v}^{T}w\|\|_{2}}{\sqrt{\frac{\gamma(n-1)}{2}}\sigma}$

Appendix B Proof sketch for section 5

B.1 Proof of Lemma 5

Since $L(w;D_{1})=\frac{n}{n-1}L(w;D_{0})-\frac{1}{n-1}l(w;v)$ , so we only need to focus on $L(w;D_{1})-L(w;D_{0})=\frac{1}{n-1}L(w;D_{0})-\frac{1}{n-1}l(w;v)=\frac{1}{n-1}L(w;D_{0})-\frac{1}{n-1}[y_{v}-\langle x_{v},w\rangle]=\frac{1}{n-1}L(w;D_{0})-\frac{1}{n-1}[y_{v}-x_{v}^{T}w]$ , which is clearly non-negative because $L(w;D_{0})>l(w;v)$ . Based on equation (23), we know

$\displaystyle(d_{v})^{2}$	$\displaystyle=(\frac{\|\|y_{v}x_{v}-S_{yx}+S_{xx}w-x_{v}x_{v}^{T}w\|\|_{2}}{\sqrt{\frac{\gamma(n-1)}{2}}\sigma})^{2}$	(24)
	$\displaystyle\geq\frac{2}{\sigma^{2}\gamma(n-1)}[\|\|y_{v}x_{v}-x_{v}x_{v}^{T}w\|\|_{2}-\|\|S_{yx}-S_{xx}w\|\|_{2}]^{2}$
	$\displaystyle=\frac{2}{\sigma^{2}\gamma(n-1)}[\|\|x_{v}\|\|_{2}\cdot\|\|y_{v}-x_{v}^{T}w\|\|_{2}-\|\|S_{yx}-S_{xx}w\|\|_{2}]^{2}.$

The inequality is obtained from Triangle Inequality for $l_{2}$ norm. From Definition 7, we also know

d_{v}=\epsilon_{v}+2\Phi^{-1}(1-\alpha)

(25)

So we have

$[\epsilon_{v}+2\Phi^{-1}(1-\alpha)]^{2}\geq\frac{2}{\sigma^{2}\gamma(n-1)}[||x_{v}||_{2}\cdot l(w,v)-||S_{yx}-S_{xx}w||_{2}]^{2}$

$[\epsilon_{v}+2\Phi^{-1}(1-\alpha)]^{2}\frac{\sigma^{2}\gamma(n-1)}{2}\geq[||x_{v}||_{2}\cdot l(w,v)-||S_{yx}-S_{xx}w||_{2}]^{2}$

$-[\epsilon_{v}+2\Phi^{-1}(1-\alpha)]\frac{\sigma}{||x_{v}||_{2}}\sqrt{\frac{\gamma(n-1)}{2}}+\frac{||S_{yx}-S_{xx}w||_{2}}{||x_{v}||_{2}}\leq l(w;v)\leq[\epsilon_{v}+2\Phi^{-1}(1-\alpha)]\frac{\sigma}{||x_{v}||_{2}}\sqrt{\frac{\gamma(n-1)}{2}}+\frac{||S_{yx}-S_{xx}w||_{2}}{||x_{v}||_{2}}$

Multiplying by $\frac{-1}{n-1}$ and adding $\frac{1}{n-1}L(w;D_{0})$ on both sides of the inequality, we get the first part of Lemma 5.

For part 2 of Lemma 5, we further write equation (24) as

	$\displaystyle(d_{v})^{2}$	$\displaystyle\geq\frac{2}{\sigma^{2}\gamma(n-1)}[\|\|x_{v}\|\|_{2}\cdot\|\|y_{v}-x_{v}^{T}w\|\|_{2}-\|\|S_{yx}-S_{xx}w\|\|_{2}]^{2}$		(26)
		$\displaystyle\geq\frac{2}{\sigma^{2}\gamma(n-1)}[B\cdot\|\|y_{v}-x_{v}^{T}w\|\|_{2}-\|\|S_{yx}-S_{xx}w\|\|_{2}]^{2}$		(26)

since $||x_{v}||_{2}\geq B$ by assumption. Then following the same logic of the first part will we get the result as shown in part 2.

B.2 Proof of Lemma 6

Apply Lemma 1 and 7, we get

$\Phi(\Phi^{-1}(1-\alpha)-d)-\alpha\leq e^{\epsilon}-1$

\displaystyle\epsilon

\displaystyle\geq ln[\Phi(\Phi^{-1}(1-\alpha)-d)+1-\alpha]

(27)

For point $v$ , we know $d_{v}=\epsilon_{v}+2\Phi^{-1}(1-\alpha)$ . So the lower bound for privacy budget $\epsilon$ after deleting perfect deleted point is

	$\displaystyle ln[\Phi(\Phi^{-1}(1-\alpha)-d)+1-\alpha]$	$\displaystyle=ln[\Phi(\Phi^{-1}(1-\alpha)-\epsilon_{v}-2\Phi^{-1}(1-\alpha))+1-\alpha]$		(28)
		$\displaystyle=ln[\Phi(\Phi^{-1}(\alpha)-\epsilon_{v})+1-\alpha].$		(28)

Since we assume the privacy budget to be non-negative, we take the $max$ of equation (28) and 0 to obtain the desired result.

$\displaystyle d_{v}$	$\displaystyle=\frac{\|\|2(y_{v}x_{v}-S_{yx})+2[S_{xx}-x_{v}x_{v}^{T}]w\|\|_{2}}{\sqrt{\gamma(n-1)}\sigma}$	(23)
	$\displaystyle=\frac{\|\|(y_{v}x_{v}-S_{yx})+[S_{xx}-x_{v}x_{v}^{T}]w\|\|_{2}}{\sqrt{\frac{\gamma(n-1)}{2}}\sigma}$
	$\displaystyle=\frac{\|\|y_{v}x_{v}-S_{yx}+S_{xx}w-x_{v}x_{v}^{T}w\|\|_{2}}{\sqrt{\frac{\gamma(n-1)}{2}}\sigma}$

$\displaystyle(d_{v})^{2}$	$\displaystyle=(\frac{\|\|y_{v}x_{v}-S_{yx}+S_{xx}w-x_{v}x_{v}^{T}w\|\|_{2}}{\sqrt{\frac{\gamma(n-1)}{2}}\sigma})^{2}$	(24)
	$\displaystyle\geq\frac{2}{\sigma^{2}\gamma(n-1)}[\|\|y_{v}x_{v}-x_{v}x_{v}^{T}w\|\|_{2}-\|\|S_{yx}-S_{xx}w\|\|_{2}]^{2}$
	$\displaystyle=\frac{2}{\sigma^{2}\gamma(n-1)}[\|\|x_{v}\|\|_{2}\cdot\|\|y_{v}-x_{v}^{T}w\|\|_{2}-\|\|S_{yx}-S_{xx}w\|\|_{2}]^{2}.$

Data Deletion for Linear Regression with Noisy SGD

Abstract

1 INTRODUCTION

Our Contributions:

2 RELATED LITERATURE

Noisy gradient descent.

Data deletion.

Membership inference attack.

Definition 1.

Definition 2.

Definition 3.

Lemma 1.

3 PROBLEM SETUP

3.1 Perfect deleted point problem

3.2 Assumptions

4 KEY THEORETICAL RESULTS

4.1 Notations

Definition 4.

Definition 5.

Definition 6.

Lemma 2.

Proof.

4.2 Strategy to find the perfect deleted point

Lemma 3.

Proof.

Lemma 4.

Proof.

5 MAIN INSIGHTS

5.1 Analysis of training loss for perfect deleted point

Definition 7.

Lemma 5.

Proof.

5.2 Relation of privacy budget ϵ\epsilon for perfect deleted point

Lemma 6.

Proof.

6 EXPERIMENT

6.1 Experiment setup

6.2 Result analysis

7 CONCLUSION

References

Checklist

Appendix A Proof sketch for section 4

A.1 Proof of Lemma 3

A.2 Proof of Lemma 4

Appendix B Proof sketch for section 5

B.1 Proof of Lemma 5

B.2 Proof of Lemma 6

5.2 Relation of privacy budget $\epsilon$ for perfect deleted point