Certified Minimax Unlearning with Generalization Rates and Deletion Capacity
Abstract
We study the problem of -certified machine unlearning for minimax models. Most of the existing works focus on unlearning from standard statistical learning models that have a single variable and their unlearning steps hinge on the direct Hessian-based conventional Newton update. We develop a new -certified machine unlearning algorithm for minimax models. It proposes a minimax unlearning step consisting of a total Hessian-based complete Newton update and the Gaussian mechanism borrowed from differential privacy. To obtain the unlearning certification, our method injects calibrated Gaussian noises by carefully analyzing the “sensitivity” of the minimax unlearning step (i.e., the closeness between the minimax unlearning variables and the retraining-from-scratch variables). We derive the generalization rates in terms of population strong and weak primal-dual risk for three different cases of loss functions, i.e., (strongly-)convex-(strongly-)concave losses. We also provide the deletion capacity to guarantee that a desired population risk can be maintained as long as the number of deleted samples does not exceed the derived amount. With training samples and model dimension , it yields the order , which shows a strict gap over the baseline method of differentially private minimax learning that has . In addition, our rates of generalization and deletion capacity match the state-of-the-art rates derived previously for standard statistical learning models.
1 Introduction
Minimax models have been widely applied in a variety of machine learning applications, including generative adversarial networks (Goodfellow et al., 2014; Arjovsky et al., 2017), adversarially robust learning (Madry et al., 2018; Sinha et al., 2018), and reinforcement learning (Du et al., 2017; Dai et al., 2018). This is largely credited to the two-variable (i.e., primal and dual variables) model structure of minimax models, which is versatile enough to accommodate such diverse instantiations. As is common in machine learning practice, training a successful minimax model relies crucially on a potentially large corpus of training samples that are contributed by users. This raises privacy concerns for minimax models. Unlike standard statistical learning (STL) models, the privacy studies for minimax models are relatively newer. Most of the existing studies focus on privacy protection during the training phase under the differential privacy (DP) notion (Dwork et al., 2006) and federated minimax learning settings (Sharma et al., 2022). Recent works in this direction have successfully achieved several optimal generalization performances measured in terms of the population primal-dual (PD) risk for DP minimax models specifically (Yang et al., 2022; Zhang et al., 2022a; Bassily et al., 2023; Boob and Guzmán, 2023).
Model | Unlearning Algorithm | Setting | Generalization Measure | Deletion Capacity | ||
STL |
|
C | Population Excess Risk (Sekhari et al., 2021) | |||
(Sekhari et al., 2021) | (S)C | |||||
Minimax Learning |
|
SC-SC | Population Strong PD Risk | |||
|
C-C | |||||
Our Work | (S)C-(S)C |
|
Machine unlearning is an emerging privacy-respecting problem concerning already-trained models (i.e., during the post-training phase) (Cao and Yang, 2015; Guo et al., 2020; Sekhari et al., 2021; Graves et al., 2021; Bourtoule et al., 2021; Li et al., 2021; Shibata et al., 2021; Wu et al., 2022; Cheng et al., 2023; Chen et al., 2023; Tarun et al., 2023; Wu et al., 2023; Ghazi et al., 2023; Wang et al., 2023b). That is, it removes certain training samples from the trained model upon their users’ data deletion requests. It is driven by the right to be forgotten, which is mandated by a growing number of user data protection legislations enacted in recent years. Prominent examples include the European Union’s General Data Protection Regulation (GDPR) (Mantelero, 2013), the California Consumer Privacy Act (CCPA), and Canada’s proposed Consumer Privacy Protection Act (CPPA). Machine unlearning comes with several desiderata. Besides sufficiently removing the influence of the data being deleted, it should be efficient and avoid the prohibitive computational cost of the baseline method to fully retrain the model on the remaining dataset from scratch. To guarantee the sufficiency of data removal, there are exact machine unlearning methods (Cao and Yang, 2015; Ginart et al., 2019; Brophy and Lowd, 2021; Bourtoule et al., 2021; Ullah et al., 2021; Schelter et al., 2021; Chen et al., 2022b, a; Yan et al., 2022; Di et al., 2023; Xia et al., 2023) and approximate machine unlearning methods (Golatkar et al., 2020a; Wu et al., 2020; Golatkar et al., 2020b; Nguyen et al., 2020; Neel et al., 2021; Peste et al., 2021; Golatkar et al., 2021; Warnecke et al., 2023; Izzo et al., 2021; Mahadevan and Mathioudakis, 2021; Mehta et al., 2022; Zhang et al., 2022c; Wang et al., 2023a; Chien et al., 2023a; Lin et al., 2023) (some can offer the rigorous -certification (Guo et al., 2020; Sekhari et al., 2021; Suriyakumar and Wilson, 2022; Chien et al., 2023b) inspired by differential privacy). In addition, recent studies also point out the importance of understanding the relationship between the generalization performance and the amount of deleted samples (Sekhari et al., 2021; Suriyakumar and Wilson, 2022). In particular, they introduce the definition of deletion capacity to formally quantify the number of samples that can be deleted for the after-unlearning model to maintain a designated population risk. However, most existing works so far have focused on machine unlearning for standard statistical learning models with one variable, which leaves it unknown how to design a machine unlearning method to meet all the desiderata above.
Machine unlearning for minimax models becomes a pressing problem because the trained minimax models also have a heavy reliance on the training data, while the users contributing data are granted the right to be forgotten. In this paper, we study the machine unlearning problem for minimax models under the -certified machine unlearning framework. We collect in Table 1 the results in this paper and comparisons with baseline methods that are adapted from previous papers to -certified machine unlearning.
Our main contributions can be summarized as follows.
-
•
Certified minimax unlearning algorithm: We develop -certified minimax unlearning algorithm under the setting of the strongly-convex-strongly-concave loss function. To sufficiently remove the data influence, the algorithm introduces the total Hessian consisting of both direct Hessian and indirect Hessian, where the latter is crucial to account for the inter-dependence between the primal and dual variables in minimax models. It leads to the complete Newton-based minimax unlearning update. Subsequently, we introduce the Gaussian mechanism from DP to achieve the -minimax unlearning certification, which requires careful analysis for the closeness between the complete Newton updated variables and the retraining-from-scratch variables.
-
•
Generalization: We provide generalization results for our certified minimax unlearning algorithm in terms of the population weak and strong primal-dual risk, which is a common generalization measure for minimax models.
-
•
Deletion capacity: We establish the deletion capacity result, which guarantees that our unlearning algorithm can retain the generalization rates for up to deleted samples. It matches the state-of-the-art result under the standard statistical unlearning setting that can be regarded as a special case of our minimax setting.
-
•
Extension to more general losses: We extend the certified minimax unlearning to more general loss functions, including convex-concave, strongly-convex-concave, and convex-strongly-concave losses, and provide the corresponding -certification, population primal-dual risk, and deletion capacity results.
-
•
Extension with better efficiency: We develop a more computationally efficient extension, which can also support successive and online deletion requests. It saves the re-computation of the total Hessian matrix during the unlearning phase, where the minimax unlearning update can be regarded as a total Hessian-based infinitesimal jackknife. It also comes with has slightly smaller population primal-dual risk though the overall rates of the risk and deletion capacity remain the same.
2 Related work
Machine unlearning receives increasing research attention in recent years, mainly due to the growing concerns about the privacy of user data that are utilized for machine learning model training. Since the earliest work by Cao and Yang (2015), a variety of machine unlearning methods have been proposed, which can be roughly divided into two categories: exact unlearning and approximate unlearning.
Exact machine unlearning. Methods for exact machine unlearning aim to produce models that perform identically to the models retrained from scratch. Some exact unlearning methods are designed for specific machine learning models like k-means clustering (Ginart et al., 2019) and random forests (Brophy and Lowd, 2021). SISA (Bourtoule et al., 2021) proposes a general exact unlearning framework based on sharding and slicing the training data into multiple non-overlapping shards and training independently on each shard. During unlearning, SISA retrains only on the shards containing the data to be removed. GraphEraser (Chen et al., 2022b) and RecEraser (Chen et al., 2022a) further extend SISA to unlearning for graph neural networks and recommendation systems, respectively.
Approximate machine unlearning. Approximate machine unlearning methods propose to make a tradeoff between the exactness in data removal and computational/memory efficiency. Prior works propose diverse ways to update the model parameter and offer different types of unlearning certification. When it comes to the unlearning update, many existing works consider the Newton update-related unlearning step where the Hessian matrix of the loss function plays a key role (Guo et al., 2020; Golatkar et al., 2020a; Peste et al., 2021; Sekhari et al., 2021; Golatkar et al., 2021; Mahadevan and Mathioudakis, 2021; Suriyakumar and Wilson, 2022; Mehta et al., 2022; Chien et al., 2023b). This unlearning update is motivated by influence functions (Koh and Liang, 2017). In order to alleviate the computation of the Hessian, Golatkar et al. (2020a) and Peste et al. (2021) utilize Fisher Information Matrix to approximate the Hessian, mitigating its expensive computation and inversion. Mehta et al. (2022) provide a variant of conditional independence coefficient to select sufficient sets for unlearning, avoiding the need to invert the entire Hessian matrix. ML-forgetting (Golatkar et al., 2021) trains a linear weights set on the core dataset which would not change by standard training and a linear weights set on the user dataset containing data to be forgotten. They use an optimization problem to approximate the forgetting Newton update. Suriyakumar and Wilson (2022) leverage the proximal infinitesimal jackknife as the unlearning step in order to be applied to nonsmooth loss functions. In addition, they can achieve better computational efficiency and are capable of dealing with online delete requests. There are also many other designs achieving different degrees of speedup (Wu et al., 2020; Nguyen et al., 2020; Neel et al., 2021; Zhang et al., 2022c).
Apart from the various designs for the unlearning update, there are also different definitions of certified machine unlearning. Early works like Guo et al. (2020) introduce a certified data-removal mechanism that adds random perturbations to the loss function at training time. Golatkar et al. (2020a) introduce an information-theoretic-based certified unlearning notion and also add random noise to ensure the certification, which is specific to the Fisher Information Matrix and not general enough. More recently, Sekhari et al. (2021) propose the -certified machine unlearning definition that does not require introducing additional randomization during training. More essential, Sekhari et al. (2021) points out the importance of providing the generalization performance after machine unlearning. Sekhari et al. (2021); Suriyakumar and Wilson (2022) establish the generalization result in terms of the population risk and derive the deletion capacity guarantee.
However, most existing works only consider machine unlearning for STL models that minimize a single variable. None of the prior works provide ertified machine unlearning pertaining to minimax models, for which the generalization and deletion capacity guarantees are still unknown.
3 Preliminaries and Baseline Solution
3.1 Minimax Learning
The goal of minimax learning is to optimize the population loss , given by
(1) |
where is the loss function, is a data instance from the distribution , and are closed convex domains with regard to primal and dual variables, respectively. Since the data distribution is unknown in practice, minimax learning turns to optimize the empirical loss , given by,
(2) |
where is the training dataset with .
We will consider -Lipschitz, -smooth and -strongly-convex--strongly-concave loss functions, which are described in Assumption 1&2 blow and more details can be found in Appendix A.
Assumption 1.
For any , the function is -Lipschitz and with -Lipschitz gradients and -Lipschitz Hessians on the closed convex domain . Moreover, is convex on for any and concave on for any .
Assumption 2.
For any , the function satisfies Assumtion 1 and is -strongly convex on for any and -strongly concave on for any .
Denote a randomized minimax learning algorithm by and its trained variables by . The generalization performance is a top concern of the trained model variables (Thekumparampil et al., 2019; Zhang et al., 2020; Lei et al., 2021; Farnia and Ozdaglar, 2021; Zhang et al., 2021, 2022b; Ozdaglar et al., 2022), which can be measured by population weak primal-dual (PD) risk or population strong PD risk, as formalized below.
Definition 1 (Population Primal-Dual Risk).
The population weak PD risk of , and the population strong PD risk of , are defined as
(3) |
Notations. We introduce the following notations that will be used in the sequel. For a twice differentiable function with the arguments and , we use and to denote the direct gradient of w.r.t. and , respectively and denote its Jacobian matrix as . We use , , , to denote the second order partial derivatives w.r.t. and , correspondingly and denote its Hessian matrix as . We define the total Hessian of the function w.r.t. and : and , where and are the shorthand of and , respectively, when and are invertible. We also use the shorthand notation .
3.2 -Certified Machine Unlearning
An unlearning algorithm for minimax models receives the output of a minimax learning algorithm , the set of delete requests and some additional memory variables as input and returns an updated model , aiming to remove the influence of . For the memory variables in , it will not contain the entire training set, but instead its size is independent of the training data size . The mapping of an unlearning algorithm can be formulated as . We now give the notion of -certified unlearning introduced by Sekhari et al. (2021), which is inspired by the definition of differential privacy (Dwork et al., 2006).
Definition 2 (-Certified Unlearning (Sekhari et al., 2021)).
Let be the domain of . For all of size , set of delete requests such that , the pair of learning algorithm and unlearning algorithm is -certified unlearning, if and , the following two conditions are satisfied:
(4) | |||
(5) |
where denotes the empty set and denotes the memory variables available to .
The above definition ensures the indistinguishability between the output distribution of (i) the model trained on the set and then unlearned with delete requests and (ii) the model trained on the set and then unlearned with an empty set. Specifically, the unlearning algorithm simply adds perturbations to the output of when the set of delete requests is empty.
Deletion Capacity. Under the definition of certified unlearning, Sekhari et al. (2021) introduce the definition of deletion capacity, which formalizes how many samples can be deleted while still maintaining good guarantees on test loss. Here, we utilize the population primal-dual risk defined in Definition 1 instead of the excess population risk utilized for STL models.
Definition 3 (Deletion capacity, (Sekhari et al., 2021)).
Let and be a dataset of size drawn i.i.d from the data distribution . Let be a minimax model and be the set of deletion requests. For a pair of minimax learning algorithm and minimax unlearning algorithm that satisfies -unlearning, the deletion capacity is defined as the maximum number of samples that can be unlearned while still ensuring the population primal-dual (weak PD or strong PD) risk is at most . Let the expectation takes over and the outputs of the algorithms and . Let denotes the dimension of domain and denotes the dimension of domain , specifically,
(6) |
where the ouputs and of the minimax unlearning algorithm refer to parameter and , respectively. could be the population weak PD risk or population strong PD risk of .
We set (or any other small arbitrary constant) throughout the paper.
3.3 Baseline Solution: Certified Minimax Unlearning via Differential Privacy
Since Definition 2 is motivated by differential privacy (DP), it is a natural way to use tools from DP for machine unlearning. For a differentially private learning algorithm with edit distance in neighboring datasets, the unlearning algorithm simply returns its output without any changes and is independent of the delete requests as well as the memory variables , i.e., .
A number of differentially private minimax learning algorithms can be applied, e.g., Zhang et al. (2022a); Yang et al. (2022); Bassily et al. (2023). For instance, we can obtain the output by calling Algorithm 3 in Zhang et al. (2022a). Under Assumption 1&2, we then get the population strong PD risk based on (Zhang et al., 2022a, Theorem 4.3) and the group privacy property of DP (Vadhan, 2017, Lemma 7.2.2), as follows,
(7) |
where we let , , , and be the edit distance between datasets (i.e., the original dataset and the remaining dataset after removing samples to be forgotten).
The algorithm satisfies -DP for any set of size , that is,
Since we have and , the above privacy guarantee can be converted to the minimax unlearning guarantee in Definition 2, implying that the pair is -certified minimax unlearning. According to Definition 3, the population strong PD risk in eq.(7) yields the following bound on deletion capacity.
Theorem 1 (Deletion capacity of unlearning via DP (Sekhari et al., 2021)).
Denote . There exists a polynomial time learning algorithm and unlearning algorithm for minimax problem of the form such that the deletion capacity is:
(8) |
where the constant in -notation depends on the properties of the loss function (e.g., strongly convexity and strongly concavity parameters, Lipchitz continuity and smoothness parameters).
However, this DP minimax learning baseline approach provides an inferior deletion capacity. In the following sections, we show that the in the denominator of eq.(8) can be further reduced to .
4 Certified Minimax Unlearning
In this section, we focus on the setting of the strongly-convex-strongly-concave loss function. We first provide the intuition for the design of the minimax unlearning step in Sec.4.1, then provide the formal algorithm in Sec.4.2 and a more efficient extension in Sec.4.3 with analysis of minimax unlearning certification, generalization result, and deletion capacity in Sec.4.4. We will provide extensions to more general loss settings in Sec.5. The proofs for the theorems presented in this and the next sections can be found in Appendix B and C, respectively.
4.1 Intuition for Minimax Unlearning Update
To begin with, we provide an informal derivation for minimax unlearning update to illustrate its design intuition. Given the training set of size and the deletion subset of size , the aim is to approximate the optimal solution of the loss on the remaining dataset , given by,
(9) |
Meanwhile, we have the optimal solution to the original loss after minimax learning. Taking unlearning for instance, by using a first-order Taylor expansion for around , we have
(10) |
Since is a minimizer of , from the first-order optimality condition, we can get . Now given an auxiliary function (more best response auxiliary functions are introduced in Appendix A, Definition 8), we have . We further get
(11) | ||||
where the approximate equation leaving out the term which is bounded in Appendix A, Lemma 2, and does not affect the overall unlearning guarantee. The approximate equation is the linear approximation step and is the response Jacobian of the auxiliary function . And the approximate equation is due to the implicit function theorem. This gives that
(12) |
which implies the following approximation of :
(13) |
The above informal derivation indicates that the minimax unlearning update relies on the total Hessian to sufficiently remove the data influence Liu et al. (2023); Zhang et al. (2023), rather than the conventional Hessian that appears in standard statistical unlearning (Guo et al., 2020; Sekhari et al., 2021; Suriyakumar and Wilson, 2022; Mehta et al., 2022). The update in eq.(13) has a close relation to the complete Newton step in the second-order minimax optimization literature Zhang et al. (2020), which motivates the complete Newton-based minimax unlearning. However, due to the various approximations in the above informal derivation, we cannot have a certified minimax unlearning guarantee. Below, we will formally derive the upper bound for these approximations in the closeness upper bound analysis. Based on the closeness upper bound, we will introduce the Gaussian mechanism to yield distribution indistinguishably result in the sense of -certified minimax unlearning.
4.2 Proposed Certified Minimax Unlearning
We first provide algorithms under the setting of the smooth and strongly-convex-strongly-concave (SC-SC) loss function as described in Assumptions 1&2.
Minimax Learning algorithm.
We denote our learning algorithm by and the pseudocode is shown in Algorithm 1. Given a dataset of size drawn independently from some distribution , algorithm computes the optimal solution to the empirical risk . then outputs the point as well as the additional memory variables , which computes and stores the total Hessian of at .
Minimax Unlearning Algorithm. We denote the proposed certified minimax unlearning algorithm by and present its pseudocode in Algorithm 2. Algorithm takes the following inputs: the set of delete requests of size , the trained minimax model , and the memory variables . To have the certified minimax unlearning for , eq.(15) computes the total Hessian of by , where the former term can be retrieved from the memory set and the latter is computed on the samples to be deleted; eq.(17) computes the intermediate by the complete Newton step based on the total Hessian ; Line 3 injects calibrated Gaussian noise to ensure -certified minimax unlearning. The certified minimax unlearning for is symmetric. We provide detailed analysis for Algorithm 2 including minimax unlearning certification, generalization results, and deletion capacity in Appendix B.1.
(14) |
(15) |
(16) |
(17) |
(18) |
4.3 Certified Minimax Unlearning without Total Hessian Re-computation
We extend Algorithm 2 and propose Algorithm 3 to reduce the computational cost of Algorithm 2. The complete Newton steps in eq.(19) and eq.(20) utilize the total Hessian and that are directly retrieved from the memory, rather than the updated total Hessian and used in Algorithm 2. The form in eq.(19) and eq.(20) can also be regarded as the total Hessian extension of the infinitesimal jackknife. In this way, it gets rid of the computationally demanding part of re-evaluating the total Hessian for samples to be deleted, which significantly reduces the computational cost. It turns out to be the same computational complexity as the state-of-the-art certified unlearning method developed for STL models (Suriyakumar and Wilson, 2022). Moreover, Algorithm 3 can be more appealing for the successive data deletion setting.
(19) |
(20) |
4.4 Analysis for Algorithm 3
-Certificated Unlearning Guarantee. The intermediate variables are distinguishable in distribution from the retraining-from-scratch variables because they are deterministic and the Taylor expansion introduces a certain amount of approximation. The following lemma quantifies the closeness between and , which can be regarded as the “sensitivity” when applying the Gaussian mechanism.
Lemma 1 (Closeness Upper Bound).
Equipped with Lemma 10, we have the following certified unlearning guarantee by adding Gaussian noise calibrated according to the above closeness result. Due to the minimax structure, our analysis is more involved than the STL case (Sekhari et al., 2021; Suriyakumar and Wilson, 2022).
Theorem 2 (-Minimax Unlearning Certification).
Under the same settings of Lemma 1, our minimax learning algorithm and unlearning algorithm is -certified minimax unlearning if we choose
(22) |
Generalization Guarantee. Theorem 3 below provides the generalization result in terms of the population PD risk for the minimax unlearning algorithm .
Theorem 3 (Population Primal-Dual Risk).
Deletion Capacity. The population weak and strong PD risk given in Theorem 3 for the output of unlearning algorithms provides the following bound on deletion capacity.
5 Certified Minimax Unlearning for Convex-Concave Loss Function
We further extend the certified minimax unlearning for the convex-concave loss function. In addition, Appendix C will provide the extension to convex-strongly-concave and strongly-convex-concave loss functions. Give the convex-concave loss function , similar to the unlearning for STL models (Sekhari et al., 2021), we define the regularized function as . Suppose the function satisfies Assumption 1, then the function is -strongly convex in , -strongly concave in , -Lipschitz, -gradient Lipschitz and -Hessian Lipschitz. It suffices to apply the minimax learning and unlearning algorithms in Sec.4 to the regularized loss function with a properly chosen . We denote the learning and unlearning algorithms for convex-concave losses as and . Their implementation details are given in Appendix C. We suppose the SC-SC regularization parameter satisfies . Theorem 5 below summarizes guarantees of -certified unlearning and population primal-dual risk (weak and strong) for Algorithm .
Theorem 5.
Let Assumption 1 hold and . Suppose the parameter spaces and are bounded so that and . We have,
-
-Minimax Unlearning Certification: Our minimax learning algorithm and unlearning algorithm is -certified minimax unlearning.
-
Population Weak PD Risk: The population weak PD risk for by algorithm is
(25) In particular, by setting below
(26) we have the following population weak PD risk,
(27) where are constants that depend only on and .
-
Population Strong PD Risk: The population strong PD risk for by algorithm is
(28) In particular, by setting below
(29) we have the following population strong PD risk,
(30) where are constants that depend only on and .
-
Deletion Capacity: The deletion capacity of Algorithm is
(31) where the constant depends on the constants and .
6 Conclusion
In this paper, we have studied the certified machine unlearning for minimax models with a focus on the generalization rates and deletion capacity, while existing works in this area largely focus on standard statistical learning models. We have provided a new minimax unlearning algorithm composed of the total Hessian-based complete Newton update and the Gaussian mechanism-based perturbation, which comes with rigorous -unlearning certification. We have established generalization results in terms of the population weak and strong primal-dual risk and the correspondingly defined deletion capacity results for the strongly-convex-strongly-concave loss functions, both of which match the state-of-the-art rates obtained for standard statistical learning models. We have also provided extensions to other loss types like the convex-concave loss function. In addition, we have provided a more computationally efficient extension by getting rid of the total Hessian re-computation during the minimax unlearning phase, which can be more appealing for the successive data deletion setting. Although our bound for deletion capacity is better than that of DP by an order of and matches the state-of-the-art result established for unlearning under the STL setting, it remains unclear whether this bound is tight or not. In future work, we plan to extend to more general settings like the nonconvex-nonconcave loss function setting.
Acknowledgements
We would like to thank Gautam Kamath for his valuable comments on the presentation of the results in the previous version of the paper. Additionally, we extend our thanks to the reviewers and area chair of NeurIPS 2023 for their constructive comments and feedback. This work was supported in part by the National Natural Science Foundation of China (62072395, 62206207, U20A20178), and the National Key Research and Development Program of China (2020AAA0107705, 2021YFB3100300).
References
- Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, volume 70, pages 214–223. PMLR, 2017.
- Bassily et al. (2019) Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. In Advances in Neural Information Processing Systems, volume 32, pages 11279–11288, 2019.
- Bassily et al. (2023) Raef Bassily, Cristóbal Guzmán, and Michael Menart. Differentially private algorithms for the stochastic saddle point problem with optimal rates for the strong gap. In Conference on Learning Theory, volume 195, pages 2482–2508. PMLR, 2023.
- Boob and Guzmán (2023) Digvijay Boob and Cristóbal Guzmán. Optimal algorithms for differentially private stochastic monotone variational inequalities and saddle-point problems. Mathematical Programming, pages 1–43, 2023.
- Bourtoule et al. (2021) Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy, pages 141–159. IEEE, 2021.
- Brophy and Lowd (2021) Jonathan Brophy and Daniel Lowd. Machine unlearning for random forests. In International Conference on Machine Learning, volume 139, pages 1092–1104. PMLR, 2021.
- Cao and Yang (2015) Yinzhi Cao and Junfeng Yang. Towards making systems forget with machine unlearning. In 2015 IEEE symposium on security and privacy, pages 463–480. IEEE, 2015.
- Chen et al. (2022a) Chong Chen, Fei Sun, Min Zhang, and Bolin Ding. Recommendation unlearning. In Proceedings of the ACM Web Conference 2022, pages 2768–2777. ACM, 2022a.
- Chen et al. (2022b) Min Chen, Zhikun Zhang, Tianhao Wang, Michael Backes, Mathias Humbert, and Yang Zhang. Graph unlearning. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 499–513. ACM, 2022b.
- Chen et al. (2023) Min Chen, Weizhuo Gao, Gaoyang Liu, Kai Peng, and Chen Wang. Boundary unlearning: Rapid forgetting of deep networks via shifting the decision boundary. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7766–7775. IEEE, 2023.
- Cheng et al. (2023) Jiali Cheng, George Dasoulas, Huan He, Chirag Agarwal, and Marinka Zitnik. Gnndelete: A general strategy for unlearning in graph neural networks. In The Eleventh International Conference on Learning Representations. OpenReview.net, 2023.
- Chien et al. (2023a) Eli Chien, Chao Pan, and Olgica Milenkovic. Efficient model updates for approximate unlearning of graph-structured data. In The Eleventh International Conference on Learning Representations. OpenReview.net, 2023a.
- Chien et al. (2023b) Eli Chien, Chao Pan, and Olgica Milenkovic. Efficient model updates for approximate unlearning of graph-structured data. In The Eleventh International Conference on Learning Representations. OpenReview.net, 2023b.
- Dai et al. (2018) Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song. Sbeed: Convergent reinforcement learning with nonlinear function approximation. In International Conference on Machine Learning, volume 80, pages 1125–1134. PMLR, 2018.
- Di et al. (2023) Jimmy Z. Di, Jack Douglas, Jayadev Acharya, Gautam Kamath, and Ayush Sekhari. Hidden poison: Machine unlearning enables camouflaged poisoning attacks. In Advances in Neural Information Processing Systems, volume 37, 2023.
- Du et al. (2017) Simon S Du, Jianshu Chen, Lihong Li, Lin Xiao, and Dengyong Zhou. Stochastic variance reduction methods for policy evaluation. In International Conference on Machine Learning, volume 70, pages 1049–1058. PMLR, 2017.
- Dwork et al. (2006) Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Third Theory of Cryptography Conference, volume 3876, pages 265–284. Springer, 2006.
- Dwork et al. (2014) Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9:211–407, 2014.
- Farnia and Ozdaglar (2021) Farzan Farnia and Asuman Ozdaglar. Train simultaneously, generalize better: Stability of gradient-based minimax learners. In International Conference on Machine Learning, volume 139, pages 3174–3185. PMLR, 2021.
- Ghazi et al. (2023) Badih Ghazi, Pritish Kamath, Ravi Kumar, Pasin Manurangsi, Ayush Sekhari, and Chiyuan Zhang. Ticketed learning–unlearning schemes. In The Thirty Sixth Annual Conference on Learning Theory, pages 5110–5139. PMLR, 2023.
- Ginart et al. (2019) Antonio Ginart, Melody Guan, Gregory Valiant, and James Y Zou. Making ai forget you: Data deletion in machine learning. In Advances in neural information processing systems, volume 32, pages 3513–3526, 2019.
- Golatkar et al. (2020a) Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9304–9312, 2020a.
- Golatkar et al. (2020b) Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Forgetting outside the box: Scrubbing deep networks of information accessible from input-output observations. In Computer Vision–ECCV 2020: 16th European Conference, volume 12374, pages 383–398. Springer, 2020b.
- Golatkar et al. (2021) Aditya Golatkar, Alessandro Achille, Avinash Ravichandran, Marzia Polito, and Stefano Soatto. Mixed-privacy forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 792–801. IEEE, 2021.
- Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, volume 27, pages 2672–2680, 2014.
- Graves et al. (2021) Laura Graves, Vineel Nagisetty, and Vijay Ganesh. Amnesiac machine learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 11516–11524. AAAI Press, 2021.
- Guo et al. (2020) Chuan Guo, Tom Goldstein, Awni Hannun, and Laurens Van Der Maaten. Certified data removal from machine learning models. In International Conference on Machine Learning, volume 119, pages 3832–3842. PMLR, 2020.
- Izzo et al. (2021) Zachary Izzo, Mary Anne Smart, Kamalika Chaudhuri, and James Zou. Approximate data deletion from machine learning models. In International Conference on Artificial Intelligence and Statistics, volume 130, pages 2008–2016. PMLR, 2021.
- Koh and Liang (2017) Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International Conference on Machine Learning, volume 70, pages 1885–1894. PMLR, 2017.
- Lei et al. (2021) Yunwen Lei, Zhenhuan Yang, Tianbao Yang, and Yiming Ying. Stability and generalization of stochastic gradient methods for minimax problems. In International Conference on Machine Learning, volume 139, pages 6175–6186. PMLR, 2021.
- Li et al. (2021) Yuantong Li, Chi-Hua Wang, and Guang Cheng. Online forgetting process for linear regression models. In International Conference on Artificial Intelligence and Statistics, pages 217–225. PMLR, 2021.
- Lin et al. (2023) Shen Lin, Xiaoyu Zhang, Chenyang Chen, Xiaofeng Chen, and Willy Susilo. Erm-ktp: Knowledge-level machine unlearning via knowledge transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20147–20155, 2023.
- Lin et al. (2020) Tianyi Lin, Chi Jin, and Michael Jordan. On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning, volume 119, pages 6083–6093. PMLR, 2020.
- Liu et al. (2023) Junxu Liu, Mingsheng Xue, Jian Lou, Xiaoyu Zhang, Li Xiong, and Zhan Qin. Muter: Machine unlearning on adversarially trained models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4892–4902, 2023.
- Luo et al. (2022) Luo Luo, Yujun Li, and Cheng Chen. Finding second-order stationary points in nonconvex-strongly-concave minimax optimization. In Advances in Neural Information Processing Systems, volume 35, pages 36667–36679, 2022.
- Madry et al. (2018) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In The Sixth International Conference on Learning Representations. OpenReview.net, 2018.
- Mahadevan and Mathioudakis (2021) Ananth Mahadevan and Michael Mathioudakis. Certifiable machine unlearning for linear models. arXiv preprint arXiv:2106.15093, 2021.
- Mantelero (2013) Alessandro Mantelero. The eu proposal for a general data protection regulation and the roots of the ‘right to be forgotten’. Computer Law & Security Review, 29(3):229–235, 2013.
- Mehta et al. (2022) Ronak Mehta, Sourav Pal, Vikas Singh, and Sathya N Ravi. Deep unlearning via randomized conditionally independent hessians. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10422–10431. IEEE, 2022.
- Neel et al. (2021) Seth Neel, Aaron Roth, and Saeed Sharifi-Malvajerdi. Descent-to-delete: Gradient-based methods for machine unlearning. In Algorithmic Learning Theory, volume 132, pages 931–962. PMLR, 2021.
- Nesterov and Polyak (2006) Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
- Nguyen et al. (2020) Quoc Phong Nguyen, Bryan Kian Hsiang Low, and Patrick Jaillet. Variational bayesian unlearning. In Advances in Neural Information Processing Systems, volume 33, pages 16025–16036, 2020.
- Ozdaglar et al. (2022) Asuman E. Ozdaglar, Sarath Pattathil, Jiawei Zhang, and Kaiqing Zhang. What is a good metric to study generalization of minimax learners? In Advances in Neural Information Processing Systems, volume 35, pages 38190–38203, 2022.
- Peste et al. (2021) Alexandra Peste, Dan Alistarh, and Christoph H Lampert. SSSE: efficiently erasing samples from trained machine learning models. arXiv preprint arXiv:2107.03860, 2021.
- Schelter et al. (2021) Sebastian Schelter, Stefan Grafberger, and Ted Dunning. Hedgecut: Maintaining randomised trees for low-latency machine unlearning. In International Conference on Management of Data, pages 1545–1557. ACM, 2021.
- Sekhari et al. (2021) Ayush Sekhari, Jayadev Acharya, Gautam Kamath, and Ananda Theertha Suresh. Remember what you want to forget: Algorithms for machine unlearning. In Advances in Neural Information Processing Systems, volume 34, pages 18075–18086, 2021.
- Sharma et al. (2022) Pranay Sharma, Rohan Panda, Gauri Joshi, and Pramod Varshney. Federated minimax optimization: Improved convergence analyses and algorithms. In International Conference on Machine Learning, volume 162, pages 19683–19730. PMLR, 2022.
- Shibata et al. (2021) Takashi Shibata, Go Irie, Daiki Ikami, and Yu Mitsuzumi. Learning with selective forgetting. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, pages 989–996. ijcai.org, 2021.
- Sinha et al. (2018) Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying some distributional robustness with principled adversarial training. In The Sixth International Conference on Learning Representations. OpenReview.net, 2018.
- Suriyakumar and Wilson (2022) Vinith Suriyakumar and Ashia C Wilson. Algorithms that approximate data removal: New results and limitations. In Advances in Neural Information Processing Systems, volume 35, pages 18892–18903, 2022.
- Tarun et al. (2023) Ayush K Tarun, Vikram S Chundawat, Murari Mandal, and Mohan Kankanhalli. Fast yet effective machine unlearning. IEEE Transactions on Neural Networks and Learning Systems, 2023.
- Thekumparampil et al. (2019) Kiran K Thekumparampil, Prateek Jain, Praneeth Netrapalli, and Sewoong Oh. Efficient algorithms for smooth minimax optimization. In Advances in Neural Information Processing Systems, volume 32, pages 12659–12670, 2019.
- Ullah et al. (2021) Enayat Ullah, Tung Mai, Anup Rao, Ryan A Rossi, and Raman Arora. Machine unlearning via algorithmic stability. In Conference on Learning Theory, volume 134, pages 4126–4142. PMLR, 2021.
- Vadhan (2017) Salil Vadhan. The complexity of differential privacy. Tutorials on the Foundations of Cryptography: Dedicated to Oded Goldreich, pages 347–450, 2017.
- Wang et al. (2023a) Cheng-Long Wang, Mengdi Huai, and Di Wang. Inductive graph unlearning. In 32nd USENIX Security Symposium, pages 3205–3222. USENIX Association, 2023a.
- Wang et al. (2023b) Weiqi Wang, Zhiyi Tian, Chenhan Zhang, An Liu, and Shui Yu. Bfu: Bayesian federated unlearning with parameter self-sharing. In Proceedings of the 2023 ACM Asia Conference on Computer and Communications Security, pages 567–578. ACM, 2023b.
- Warnecke et al. (2023) Alexander Warnecke, Lukas Pirch, Christian Wressnegger, and Konrad Rieck. Machine unlearning of features and labels. In 30th Annual Network and Distributed System Security Symposium. The Internet Society, 2023.
- Wu et al. (2022) Ga Wu, Masoud Hashemi, and Christopher Srinivasa. Puma: Performance unchanged model augmentation for training data removal. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8675–8682, 2022.
- Wu et al. (2020) Yinjun Wu, Edgar Dobriban, and Susan Davidson. Deltagrad: Rapid retraining of machine learning models. In International Conference on Machine Learning, volume 119, pages 10355–10366. PMLR, 2020.
- Wu et al. (2023) Zhaomin Wu, Junhui Zhu, Qinbin Li, and Bingsheng He. Deltaboost: Gradient boosting decision trees with efficient machine unlearning. Proceedings of the ACM on Management of Data, 1(2):1–26, 2023.
- Xia et al. (2023) Haocheng Xia, Jinfei Liu, Jian Lou, Zhan Qin, Kui Ren, Yang Cao, and Li Xiong. Equitable data valuation meets the right to be forgotten in model markets. Proceedings of the VLDB Endowment, 16(11):3349–3362, 2023.
- Yan et al. (2022) Haonan Yan, Xiaoguang Li, Ziyao Guo, Hui Li, Fenghua Li, and Xiaodong Lin. Arcane: An efficient architecture for exact machine unlearning. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, pages 4006–4013. ijcai.org, 2022.
- Yang et al. (2022) Zhenhuan Yang, Shu Hu, Yunwen Lei, Kush R Vashney, Siwei Lyu, and Yiming Ying. Differentially private sgda for minimax problems. In Uncertainty in Artificial Intelligence, volume 180, pages 2192–2202. PMLR, 2022.
- Zhang et al. (2020) Guojun Zhang, Kaiwen Wu, Pascal Poupart, and Yaoliang Yu. Newton-type methods for minimax optimization. arXiv preprint arXiv:2006.14592, 2020.
- Zhang et al. (2021) Junyu Zhang, Mingyi Hong, Mengdi Wang, and Shuzhong Zhang. Generalization bounds for stochastic saddle point problems. In International Conference on Artificial Intelligence and Statistics, volume 130, pages 568–576. PMLR, 2021.
- Zhang et al. (2022a) Liang Zhang, Kiran K Thekumparampil, Sewoong Oh, and Niao He. Bring your own algorithm for optimal differentially private stochastic minimax optimization. In Advances in Neural Information Processing Systems, volume 35, pages 35174–35187, 2022a.
- Zhang et al. (2023) Shuijing Zhang, Jian Lou, Li Xiong, Xiaoyu Zhang, and Jing Liu. Closed-form machine unlearning for matrix factorization. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 3278–3287, 2023.
- Zhang et al. (2022b) Siqi Zhang, Yifan Hu, Liang Zhang, and Niao He. Uniform convergence and generalization for nonconvex stochastic minimax problems. In OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop), 2022b.
- Zhang et al. (2022c) Zijie Zhang, Yang Zhou, Xin Zhao, Tianshi Che, and Lingjuan Lyu. Prompt certified machine unlearning with randomized gradient smoothing and quantization. In Advances in Neural Information Processing Systems, volume 35, pages 13433–13455, 2022c.
Appendix A Additional Definitions and Supporting Lemmas
In this section, we provide additional definitions and supporting lemmas. In the next two sections, Sec.B contains missing proofs in Sec.4 and the online extension to support successive unlearning setting. Sec.C contains missing proofs in Sec.5, as well as detailed algorithm descriptions for the general convex-concave loss function setting.
A.1 Additional Definitions
We first recall the following standard definitions for the loss function from optimization literature.
Definition 4 (Function Lipschitz Continuity).
The function is -Lipschitz, i.e., there exists a constant such that for all , and , it holds that
(32) |
Definition 5 (Gradient Lipschitz Continuity).
The function has -Lipschitz gradients, i.e., there exists a constant such that for all , and , it holds that
(33) |
where recall that .
Definition 6 (Hessian Lipschitz Continuity).
The function has -Lipschitz Hessian, i.e., there exists a constant such that for all , and , it holds that
(34) |
where recall that .
Definition 7 (Strongly-Convex-Strongly-Concave Objective Function).
The function is -strongly convex on and -strongly concave on , i.e., there exist constants and such that for all , and , it holds that
(35) |
Definition 8 (Best Response Auxiliary Functions).
We introduce auxiliary functions and , given by
(36) |
and we have and . We can similarly introduce and as
(37) |
and we have and by this definition.
In addition, we define the primal function , which has gradient and Hessian (i.e., the total Hessian of ). The dual function, its gradient, and Hessian can be similarly defined, e.g., .
A.2 Supporting Lemmas
The following lemma provides the distance between and . Similar result can be derived for the distance between and .
Lemma 2.
Proof.
We observe that
(39) |
where the inequality () follows from that is the maximizer of the function , thus . The inequality () is due to the fact that the function is -Lipschitz. Also note that the function is -strongly concave, thus we have
(40) |
Eq.(39) and eq.(40) together give that
(41) |
which implies that . ∎
The following lemma provides the distance between and .
Lemma 3.
Proof.
We begin with the -part,
(43) |
where the inequality () holds because is the minimizer of the function , thus , and the inequality () follows from the fact that the function is -Lipschitz. Since the function is -strongly convex, we further get
(44) |
Eq.(43) and eq.(44) together gives that
(45) |
Thus, we get .
For the -part, we similarly have
(46) |
where the inequality () follows from that is the maximizer of the function , thus . The inequality () is due to the fact that the function is -Lipschitz. In addition, by the strongly-concave assumption of is , we have
(47) |
By eq.(46) and eq.(47), we get that
(48) |
Thus, we have . ∎
In the following, we recall several lemmas (i.e., Lemma 4 to Lemma 8) from existing minimax optimization literature for completeness.
Lemma 4 ((Lin et al., 2020, Lemma 4.3)).
Proof.
Remark 1.
The above lemma can be similarly derived for to obtain that the best response auxiliary function is -Lipschitz. In the next three lemmas, we focus on the -part and omit the -part.
Lemma 5 ((Luo et al., 2022, Lemma 3)).
Lemma 6 ((Nesterov and Polyak, 2006, Lemma 1)).
Proof.
Recall the definition of the primal function and its gradient . Due to the optimality of , we have
(55) |
By taking the total derivative with respect to , we get
(56) |
Taking the total derivative of again on , we further have
(57) |
Based on the equality of and above and Lemma 5, we have
(58) |
∎
Proof.
By the definition of the total Hessian, we have
(59) |
where the inequality () uses the triangle inequality and the inequality () is due to the function has -Lipschitz gradients and is -strongly concave in , thus we have , , and . ∎
Lemma 8 ((Zhang et al., 2022a, Lemma 4.4)).
Appendix B Detailed Algorithm Analysis and Missing Proofs in Section 4
B.1 Analysis for Algorithm 2
In the following, we provide the analysis for Algorithm 2 in terms of guarantees of -certified unlearning, population primal-dual risk, and deletion capacity and the corresponding proofs.
Lemma 10 (Closeness Upper Bound).
Proof.
Recall that the empirical loss functions and are
(63) |
We focus on the key term , which has the following conversions
(64) |
We denote . For the first term on the right-hand side of the inequality in eq.(64), we have
(65) |
where the inequality () is by Lemma 6 and the inequality () is by Lemma 3.
For the second term on the right-hand side of the inequality in eq.(64), we have
(66) |
where the inequality () follows by the fact that the function is -Lipschitz continuous and . The inequality () holds because Lemma 4, and the inequality () is by Lemma 3.
For the third term on the right-hand side of the inequality in eq.(64), we have
(67) |
where the first inequality is by Lemma 7. Plugging eq.(65), eq.(66) and eq.(67) into eq.(64), we further get
(68) |
The above derivation yields an upper bound result. In the following, we derive a lower bound result. Let be the vector satisfying the following relation,
(69) |
Since we have and due to the optimality of , plugging eq.(69) into eq.(68), we get that
(70) |
For the left-hand side of eq.(70), with , we have
(71) |
where the second inequality is by Lemma 7. Combining eq.(70), eq.(68), and the definition of the vector , we get that
(72) |
Symmetrically, we can get that . ∎
Theorem 6 (-Minimax Unlearning Certification).
Under the same settings of Lemma 10, our minimax learning algorithm and unlearning algorithm is -certified minimax unlearning if we choose
(73) |
Proof.
Our proof for -minimax unlearning certification is similar to the one used for the differential privacy guarantee of the Gaussian mechanism (Dwork et al., 2014).
Let be the output of the learning algorithm trained on dataset and be the output of the unlearning algorithm running with delete requests , the learned model , and the memory variables . Then we have and . We also denote the intermediate variables before adding noise in algorithm as , and we have and .
Smilarly, let be the output of the learning algorithm trained on dataset and be the output of the unlearning algorithm running with delete requests , the learned model , and the memory variables . Then we have and . We also denote the intermediate variables before adding noise in algorithm as , and we have and . Note that and .
We sample the noise and with the scale:
(74) |
where and are given in Lemma 10. Then, following the same proof as Dwork et al. (2014, Theorem A.1) together with the composition property of DP (Vadhan, 2017, Lemma 7.2.3), we get that, for any set where ,
(75) |
which implies that the algorithm pair and is -certified minimax unlearning. ∎
Theorem 7 (Population Primal-Dual Risk).
Proof.
We begin with the population weak PD risk for the certified minimax unlearning variable , which has the following conversions,
(77) | ||||
where the inequality () holds because the population loss function is -Lipschitz continuous. The inequality () is by Lemma 8.
By recalling the unlearning update step in Algorithm 2, we have
(78) |
where the vector is drawn independently from . From the relation in eq.(78), we further get
(79) | ||||
where the inequality () is by the triangle inequality and the inequality () follows from the relation in eq.(71), together with the Jensen’s inequality to bound . The equality () holds because the vector and thus we have . Furthermore, the inequality () is due to the fact that is -Lipshitz continuous.
Symmetrically, we have
(80) |
Plugging eq.(79) and eq.(80) into eq.(77) with we get
(81) |
With the noise scale and being equal to , we can get our generalization guarantee with population weak PD risk:
(82) |
For the population strong PD risk , similarly, we have
(83) | ||||
where inequality () is due to the fact that the population loss function is -Lipschitz continuous. The inequality () uses eq.(79), eq.(80) and Lemma 9. With the same noise scale above, we can get the generalization guarantee in terms of strong PD risk below,
(84) |
∎
Theorem 8 (Deletion Capacity).
Proof.
By the definition of deletion capacity, in order to ensure the population PD risk derived in Theorem 7 is bounded by , it suffices to let:
where the constant depends on the properties of the loss function . ∎
B.2 Missing Proofs of Sec.4.4
B.2.1 Proof of Lemma 1 (Closeness Upper Bound)
Proof.
By the definition of the functions and , we have
(86) |
where the inequality () holds because the triangle inequality and the inequality () uses the results in eq.(65) and eq.(66). Now let be the vector satisfying the following relation,
(87) |
Since we have and due to the optimality of , plugging the above relation into eq.(86), we get
(88) |
With , we also have
(89) |
Combining eq.(89), eq.(88), and the definition of the vector , we get that
(90) |
Symmetrically, we can get . ∎
B.2.2 Proof of Theorem 2 (-Minimax Unlearning Certification)
B.2.3 Proof of Theorem 3 (Population Primal-Dual Risk)
Proof.
We start with the population weak PD risk. By eq.(77), we have
(91) |
By recalling the unlearning step in Algorithm 3, we have
(92) |
where the vector is drawn independently from . From the relation in eq.(78), we further get
(93) | ||||
where the inequality () uses the triangle inequality and the inequality () follows by the relation , together with the Jensen’s inequality to bound . The equality () holds because the vector and thus we have . And the inequality () is due to the fact that is -Lipshitz continuous. Symmetrically, we have
(94) |
Plugging eq.(93) and eq.(94) into eq.(91) with we get
(95) |
With the noise scale and being equal to , we can get our generalization guarantee in terms of population weak PD risk:
(96) |
For the population strong PD risk, using an application of eq.(83) with Lemma 9, eq.(93), eq.(94) and the noise scales given in Theorem 2, we can get
(97) |
∎
B.2.4 Proof of Theorem 4 (Deletion Capacity)
Proof.
By the definition of deletion capacity, in order to ensure the population weak or strong PD risk derived in Lemma 3 is bounded by , it suffices to let . ∎
B.3 Efficient Online Unlearning Algorithm
(98) |
(99) |
To support successive unlearning requests, similar to the STL case (Suriyakumar and Wilson, 2022), we further provide an efficient and online minimax unlearning algorithm (denoted by ). The pseudocode of is given in Algorithm 4. Its certified minimax unlearning guarantee, generalization, and deletion capacity can be identically yielded as Algorithm 3, which are omitted here.
Appendix C Detailed Algorithm Descriptions and Missing Proofs in Section 5
C.1 Minimax Unlearning Algorithm for Smooth Convex-Concave Loss Function
In this section, we provide minimax learning and minimax unlearning algorithms for smooth convex-concave loss functions based on the counterpart algorithms for the SC-SC setting. Given the convex-concave loss function , we define the regularized loss function as . Suppose the function satisfies Assumption 1, then the function is -strongly convex in , -strongly concave in , -Lipschitz, -gradient Lipschitz and -Hessian Lipschitz. Thus, we can apply the minimax learning in Algorithm 1 and unlearning in Algorithm 2 to the regularized loss function with a properly chosen . We denote our learning algorithm by and unlearning algorithm by . The pseudocode is provided in Algorithm 5 and Algorithm 6, respectively. Additionally, we denote the regularized population loss as and regularized empirical loss as .
(100) |
(101) |
C.2 Supporting Lemma
Lemma 11.
Suppose the function is -Lipschitz continuous. Define the function as
(102) |
Given a dataset and denote . Then, the variables satisfy and .
Proof.
Due to the optimality of , we have
(103) |
Plugging in the definition of the function in the above, we get that
(104) |
Then, using the triangle inequality, we have
(105) |
where the last inequality holds because the function is -Lipschitz continuous. Thus we have . Similarly, we can get . ∎
C.3 Proof of Theorem 5 (Certified Minimax Unlearning for Convex-Concave Loss Funcion)
Denote and , then the function is -Lipschitz continuous and has -Lipschitz gradients. Let be the optimal solution of the loss function on the remaining dataset, i.e.,
(106) |
Additionally, we have and .
Lemma 12.
Proof.
For the function , an application of Lemma 8 gives that
(108) |
By the assumption of bounded parameter spaces and so that and , we have the following derivations for the population weak PD risk,
(109) |
Similarly, an application of Lemma 9 gives that
(110) |
And we can get the population strong PD risk with the following conversions,
(111) | ||||
∎
Lemma 13 (Closeness Upper Bound).
Proof.
Since we now run the algorithms and with the regularized loss function , the proof is identical to that of Lemma 10. ∎
Equipped with the supporting lemmas above, the proof of Theorem 5 can be separated into the proofs of the following three lemmas.
Lemma 14 (Minimax Unlearning Certification).
Under the settings of Theorem 5, our minimax learning algorithm and unlearning algorithm is -certified minimax unlearning if we choose
(113) |
Proof.
Lemma 15 (Population Primal-Dual Risk).
Proof.
For the population weak PD risk, an application of eq.(77) together with Lemma 12 gives that
(115) |
According to Algorithm 6, we have the unlearning update step
(116) |
where . From the relation above, we further get
(117) | ||||
where the inequality () uses the triangle inequality and the inequality () follows from an application of eq.(71), together with the Jensen’s inequality to bound . The equality () holds because the vector and thus we have . The inequality () uses the definition of the function and the triangle inequality. The inequality () is due to the fact that is -Lipshitz continuous and Lemma 11. Symmetrically, we have
(118) |
Plugging eq.(117) and eq.(118) into eq.(115) with noise scales given in Lemma 14, we can get our generalization guarantee in terms of population weak PD risk:
(119) |
Similarly, using an application of eq.(83) together with Lemma 12, Lemma 14, eq.(117) and eq.(118), we can get the following population strong PD risk:
(120) |
∎
Lemma 16 (Deletion Capacity).
Proof.
By the definition of deletion capacity, in order to ensure the population PD risk derived in Lemma 15 is bounded by , it suffices to let . ∎
C.4 Minimax Unlearning Algorithm for Smooth Convex-Strongly-Concave Loss Function
In this section, we briefly discuss the extension to the smooth C-SC setting. The SC-C setting is symmetric and thus omitted here.
Given the loss function that satisfies Assumption 1 with -strong concavity in , we define the regularized function as . Our minimax learning and minimax unlearning algorithms for C-SC loss function denoted by and are given in Algorithm 7 and Algorithm 8 respectively. Additionally, we denote the regularized population loss by and regularized empirical loss by .
(122) |
(123) |
Note that the function is -strongly convex in , -strongly concave in , -Lipschitz, -gradient Lipschitz and -Hessian Lipschitz. We also have . Let be the optimal solution of the loss function on the remaining dataset, i.e.,
(124) |
An application of Lemma 11 implies that the empirical optimizer returned by Algorithm 8 satisfies . Thus our domain of interest are . Over the set , the function is -Lipschitz continuous. Suppose the strongly-convex regularization parameter satisfies , then has -Lipschitz gradients.
The corresponding theoretical results are given below.
Lemma 17 (Closeness Upper Bound).
Proof.
Since we now run the algorithms and with the regularized loss function , the proof is identical to that of Lemma 10. ∎
Lemma 18 (Minimax Unlearning Certification).
Under the settings of Lemma 17, our minimax learning algorithm and unlearning algorithm is -certified minimax unlearning if we choose
(126) |
Proof.
Lemma 19 (Population Weak PD Risk).
Under the same settings of Lemma 17, suppose the parameter space is bounded so that , the population weak PD risk for the certified minimax unlearning variables returned by Algorithm 8 is
(127) |
where . In particular, by setting the regularization parameter as:
(128) |
we have the following population weak PD risk:
(129) |
where and are constants that depend only on and .
Proof.
An application of (Zhang et al., 2021, Theorem 1) gives that
(130) |
Using the relation above with an application of eq.(77) and eq.(109), we have
(131) |
By an application of eq.(117), we further get
(132) |
and
(133) |
Plugging eq.(132) and eq.(133) into eq.(131) with noise scales given in Lemma 18, we can get our generalization guarantee:
(134) |
where . ∎