This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Certified Minimax Unlearning with Generalization Rates and Deletion Capacity

Jiaqi Liu1,  Jian Lou1,2,†,  Zhan Qin1,†,  Kui Ren1
1Zhejiang University
2ZJU-Hangzhou Global Scientific and Technological Innovation Center
{jiaqi.liu, jian.lou, qinzhan, kuiren}@zju.edu.cn
Corresponding authors
Abstract

We study the problem of (ϵ,δ)(\epsilon,\delta)-certified machine unlearning for minimax models. Most of the existing works focus on unlearning from standard statistical learning models that have a single variable and their unlearning steps hinge on the direct Hessian-based conventional Newton update. We develop a new (ϵ,δ)(\epsilon,\delta)-certified machine unlearning algorithm for minimax models. It proposes a minimax unlearning step consisting of a total Hessian-based complete Newton update and the Gaussian mechanism borrowed from differential privacy. To obtain the unlearning certification, our method injects calibrated Gaussian noises by carefully analyzing the “sensitivity” of the minimax unlearning step (i.e., the closeness between the minimax unlearning variables and the retraining-from-scratch variables). We derive the generalization rates in terms of population strong and weak primal-dual risk for three different cases of loss functions, i.e., (strongly-)convex-(strongly-)concave losses. We also provide the deletion capacity to guarantee that a desired population risk can be maintained as long as the number of deleted samples does not exceed the derived amount. With training samples nn and model dimension dd, it yields the order 𝒪(n/d1/4)\mathcal{O}(n/d^{1/4}), which shows a strict gap over the baseline method of differentially private minimax learning that has 𝒪(n/d1/2)\mathcal{O}(n/d^{1/2}). In addition, our rates of generalization and deletion capacity match the state-of-the-art rates derived previously for standard statistical learning models.

1 Introduction

Minimax models have been widely applied in a variety of machine learning applications, including generative adversarial networks (Goodfellow et al., 2014; Arjovsky et al., 2017), adversarially robust learning (Madry et al., 2018; Sinha et al., 2018), and reinforcement learning (Du et al., 2017; Dai et al., 2018). This is largely credited to the two-variable (i.e., primal and dual variables) model structure of minimax models, which is versatile enough to accommodate such diverse instantiations. As is common in machine learning practice, training a successful minimax model relies crucially on a potentially large corpus of training samples that are contributed by users. This raises privacy concerns for minimax models. Unlike standard statistical learning (STL) models, the privacy studies for minimax models are relatively newer. Most of the existing studies focus on privacy protection during the training phase under the differential privacy (DP) notion (Dwork et al., 2006) and federated minimax learning settings (Sharma et al., 2022). Recent works in this direction have successfully achieved several optimal generalization performances measured in terms of the population primal-dual (PD) risk for DP minimax models specifically (Yang et al., 2022; Zhang et al., 2022a; Bassily et al., 2023; Boob and Guzmán, 2023).

Table 1: Summary of Results. Here (S)C means (strongly-)convex loss function, and (S)C-(S)C means (strongly-)convex-(strongly-)concave loss function. PD means Primal-Dual. nn is the number of training samples and dd is the model dimension.
Model Unlearning Algorithm Setting Generalization Measure Deletion Capacity
STL
DP-based
(Bassily et al., 2019)
C Population Excess Risk (Sekhari et al., 2021) 𝒪(n/d1/2)\mathcal{O}(n/d^{1/2})
(Sekhari et al., 2021) (S)C 𝒪(n/d1/4)\mathcal{O}(n/d^{1/4})
Minimax Learning
DP-based
(Zhang et al., 2022a)
SC-SC Population Strong PD Risk 𝒪(n/d1/2)\mathcal{O}(n/d^{1/2})
DP-based
(Bassily et al., 2023)
C-C
Our Work (S)C-(S)C
Population Weak or
Strong PD Risk
𝒪(n/d1/4)\mathcal{O}(n/d^{1/4})

Machine unlearning is an emerging privacy-respecting problem concerning already-trained models (i.e., during the post-training phase) (Cao and Yang, 2015; Guo et al., 2020; Sekhari et al., 2021; Graves et al., 2021; Bourtoule et al., 2021; Li et al., 2021; Shibata et al., 2021; Wu et al., 2022; Cheng et al., 2023; Chen et al., 2023; Tarun et al., 2023; Wu et al., 2023; Ghazi et al., 2023; Wang et al., 2023b). That is, it removes certain training samples from the trained model upon their users’ data deletion requests. It is driven by the right to be forgotten, which is mandated by a growing number of user data protection legislations enacted in recent years. Prominent examples include the European Union’s General Data Protection Regulation (GDPR) (Mantelero, 2013), the California Consumer Privacy Act (CCPA), and Canada’s proposed Consumer Privacy Protection Act (CPPA). Machine unlearning comes with several desiderata. Besides sufficiently removing the influence of the data being deleted, it should be efficient and avoid the prohibitive computational cost of the baseline method to fully retrain the model on the remaining dataset from scratch. To guarantee the sufficiency of data removal, there are exact machine unlearning methods (Cao and Yang, 2015; Ginart et al., 2019; Brophy and Lowd, 2021; Bourtoule et al., 2021; Ullah et al., 2021; Schelter et al., 2021; Chen et al., 2022b, a; Yan et al., 2022; Di et al., 2023; Xia et al., 2023) and approximate machine unlearning methods (Golatkar et al., 2020a; Wu et al., 2020; Golatkar et al., 2020b; Nguyen et al., 2020; Neel et al., 2021; Peste et al., 2021; Golatkar et al., 2021; Warnecke et al., 2023; Izzo et al., 2021; Mahadevan and Mathioudakis, 2021; Mehta et al., 2022; Zhang et al., 2022c; Wang et al., 2023a; Chien et al., 2023a; Lin et al., 2023) (some can offer the rigorous (ϵ,δ)(\epsilon,\delta)-certification (Guo et al., 2020; Sekhari et al., 2021; Suriyakumar and Wilson, 2022; Chien et al., 2023b) inspired by differential privacy). In addition, recent studies also point out the importance of understanding the relationship between the generalization performance and the amount of deleted samples (Sekhari et al., 2021; Suriyakumar and Wilson, 2022). In particular, they introduce the definition of deletion capacity to formally quantify the number of samples that can be deleted for the after-unlearning model to maintain a designated population risk. However, most existing works so far have focused on machine unlearning for standard statistical learning models with one variable, which leaves it unknown how to design a machine unlearning method to meet all the desiderata above.

Machine unlearning for minimax models becomes a pressing problem because the trained minimax models also have a heavy reliance on the training data, while the users contributing data are granted the right to be forgotten. In this paper, we study the machine unlearning problem for minimax models under the (ϵ,δ)(\epsilon,\delta)-certified machine unlearning framework. We collect in Table 1 the results in this paper and comparisons with baseline methods that are adapted from previous papers to (ϵ,δ)(\epsilon,\delta)-certified machine unlearning.

Our main contributions can be summarized as follows.

  • Certified minimax unlearning algorithm: We develop (ϵ,δ)(\epsilon,\delta)-certified minimax unlearning algorithm under the setting of the strongly-convex-strongly-concave loss function. To sufficiently remove the data influence, the algorithm introduces the total Hessian consisting of both direct Hessian and indirect Hessian, where the latter is crucial to account for the inter-dependence between the primal and dual variables in minimax models. It leads to the complete Newton-based minimax unlearning update. Subsequently, we introduce the Gaussian mechanism from DP to achieve the (ϵ,δ)(\epsilon,\delta)-minimax unlearning certification, which requires careful analysis for the closeness between the complete Newton updated variables and the retraining-from-scratch variables.

  • Generalization: We provide generalization results for our certified minimax unlearning algorithm in terms of the population weak and strong primal-dual risk, which is a common generalization measure for minimax models.

  • Deletion capacity: We establish the deletion capacity result, which guarantees that our unlearning algorithm can retain the generalization rates for up to 𝒪(n/d1/4)\mathcal{O}(n/d^{1/4}) deleted samples. It matches the state-of-the-art result under the standard statistical unlearning setting that can be regarded as a special case of our minimax setting.

  • Extension to more general losses: We extend the certified minimax unlearning to more general loss functions, including convex-concave, strongly-convex-concave, and convex-strongly-concave losses, and provide the corresponding (ϵ,δ)(\epsilon,\delta)-certification, population primal-dual risk, and deletion capacity results.

  • Extension with better efficiency: We develop a more computationally efficient extension, which can also support successive and online deletion requests. It saves the re-computation of the total Hessian matrix during the unlearning phase, where the minimax unlearning update can be regarded as a total Hessian-based infinitesimal jackknife. It also comes with has slightly smaller population primal-dual risk though the overall rates of the risk and deletion capacity remain the same.

2 Related work

Machine unlearning receives increasing research attention in recent years, mainly due to the growing concerns about the privacy of user data that are utilized for machine learning model training. Since the earliest work by Cao and Yang (2015), a variety of machine unlearning methods have been proposed, which can be roughly divided into two categories: exact unlearning and approximate unlearning.

Exact machine unlearning. Methods for exact machine unlearning aim to produce models that perform identically to the models retrained from scratch. Some exact unlearning methods are designed for specific machine learning models like k-means clustering (Ginart et al., 2019) and random forests (Brophy and Lowd, 2021). SISA (Bourtoule et al., 2021) proposes a general exact unlearning framework based on sharding and slicing the training data into multiple non-overlapping shards and training independently on each shard. During unlearning, SISA retrains only on the shards containing the data to be removed. GraphEraser (Chen et al., 2022b) and RecEraser (Chen et al., 2022a) further extend SISA to unlearning for graph neural networks and recommendation systems, respectively.

Approximate machine unlearning. Approximate machine unlearning methods propose to make a tradeoff between the exactness in data removal and computational/memory efficiency. Prior works propose diverse ways to update the model parameter and offer different types of unlearning certification. When it comes to the unlearning update, many existing works consider the Newton update-related unlearning step where the Hessian matrix of the loss function plays a key role (Guo et al., 2020; Golatkar et al., 2020a; Peste et al., 2021; Sekhari et al., 2021; Golatkar et al., 2021; Mahadevan and Mathioudakis, 2021; Suriyakumar and Wilson, 2022; Mehta et al., 2022; Chien et al., 2023b). This unlearning update is motivated by influence functions (Koh and Liang, 2017). In order to alleviate the computation of the Hessian, Golatkar et al. (2020a) and Peste et al. (2021) utilize Fisher Information Matrix to approximate the Hessian, mitigating its expensive computation and inversion. Mehta et al. (2022) provide a variant of conditional independence coefficient to select sufficient sets for unlearning, avoiding the need to invert the entire Hessian matrix. ML-forgetting (Golatkar et al., 2021) trains a linear weights set on the core dataset which would not change by standard training and a linear weights set on the user dataset containing data to be forgotten. They use an optimization problem to approximate the forgetting Newton update. Suriyakumar and Wilson (2022) leverage the proximal infinitesimal jackknife as the unlearning step in order to be applied to nonsmooth loss functions. In addition, they can achieve better computational efficiency and are capable of dealing with online delete requests. There are also many other designs achieving different degrees of speedup (Wu et al., 2020; Nguyen et al., 2020; Neel et al., 2021; Zhang et al., 2022c).

Apart from the various designs for the unlearning update, there are also different definitions of certified machine unlearning. Early works like Guo et al. (2020) introduce a certified data-removal mechanism that adds random perturbations to the loss function at training time. Golatkar et al. (2020a) introduce an information-theoretic-based certified unlearning notion and also add random noise to ensure the certification, which is specific to the Fisher Information Matrix and not general enough. More recently, Sekhari et al. (2021) propose the (ϵ,δ)(\epsilon,\delta)-certified machine unlearning definition that does not require introducing additional randomization during training. More essential, Sekhari et al. (2021) points out the importance of providing the generalization performance after machine unlearning. Sekhari et al. (2021); Suriyakumar and Wilson (2022) establish the generalization result in terms of the population risk and derive the deletion capacity guarantee.

However, most existing works only consider machine unlearning for STL models that minimize a single variable. None of the prior works provide ertified machine unlearning pertaining to minimax models, for which the generalization and deletion capacity guarantees are still unknown.

3 Preliminaries and Baseline Solution

3.1 Minimax Learning

The goal of minimax learning is to optimize the population loss F(𝝎,𝝂)F(\bm{\omega},\bm{\nu}), given by

min𝝎𝒲max𝝂𝒱F(𝝎,𝝂):=𝔼z𝒟[f(𝝎,𝝂;z)],\min_{\bm{\omega}\in\mathcal{W}}\max_{\bm{\nu}\in\mathcal{V}}F(\bm{\omega},\bm{\nu}):=\mathbb{E}_{z\sim\mathcal{D}}[f(\bm{\omega},\bm{\nu};z)], (1)

where f:𝒲×𝒱×𝒵f:\mathcal{W}\times\mathcal{V}\times\mathcal{Z}\to\mathbb{R} is the loss function, z𝒵z\in\mathcal{Z} is a data instance from the distribution 𝒟\mathcal{D}, 𝒲d1\mathcal{W}\subseteq\mathbb{R}^{d_{1}} and 𝒱d2\mathcal{V}\subseteq\mathbb{R}^{d_{2}} are closed convex domains with regard to primal and dual variables, respectively. Since the data distribution 𝒟\mathcal{D} is unknown in practice, minimax learning turns to optimize the empirical loss FS(𝝎,𝝂)F_{S}(\bm{\omega},\bm{\nu}), given by,

min𝝎𝒲max𝝂𝒱FS(𝝎,𝝂):=1ni=1nf(𝝎,𝝂;zi),\min_{\bm{\omega}\in\mathcal{W}}\max_{\bm{\nu}\in\mathcal{V}}F_{S}(\bm{\omega},\bm{\nu}):=\frac{1}{n}\sum^{n}_{i=1}f(\bm{\omega},\bm{\nu};z_{i}), (2)

where S={z1,,zn}S=\{z_{1},\cdots,z_{n}\} is the training dataset with zi𝒟z_{i}\sim\mathcal{D}.

We will consider LL-Lipschitz, \ell-smooth and μ𝝎\mu_{\bm{\omega}}-strongly-convex-μ𝝂\mu_{\bm{\nu}}-strongly-concave loss functions, which are described in Assumption 1&2 blow and more details can be found in Appendix A.

Assumption 1.

For any z𝒵z\in\mathcal{Z}, the function f(𝛚,𝛎;z)f(\bm{\omega},\bm{\nu};z) is LL-Lipschitz and with \ell-Lipschitz gradients and ρ\rho-Lipschitz Hessians on the closed convex domain 𝒲×𝒱\mathcal{W}\times\mathcal{V}. Moreover, f(𝛚,𝛎;z)f(\bm{\omega},\bm{\nu};z) is convex on 𝒲\mathcal{W} for any 𝛎𝒱\bm{\nu}\in\mathcal{V} and concave on 𝒱\mathcal{V} for any 𝛚𝒲\bm{\omega}\in\mathcal{W}.

Assumption 2.

For any z𝒵z\in\mathcal{Z}, the function f(𝛚,𝛎;z)f(\bm{\omega},\bm{\nu};z) satisfies Assumtion 1 and f(𝛚,𝛎;z)f(\bm{\omega},\bm{\nu};z) is μ𝛚\mu_{\bm{\omega}}-strongly convex on 𝒲\mathcal{W} for any 𝛎𝒱\bm{\nu}\in\mathcal{V} and μ𝛎\mu_{\bm{\nu}}-strongly concave on 𝒱\mathcal{V} for any 𝛚𝒲\bm{\omega}\in\mathcal{W}.

Denote a randomized minimax learning algorithm by A:𝒵n𝒲×𝒱A:\mathcal{Z}^{n}\rightarrow\mathcal{W}\times\mathcal{V} and its trained variables by A(S)=(A𝝎(S),A𝝂(S))𝒲×𝒱A(S)=(A_{\bm{\omega}}(S),A_{\bm{\nu}}(S))\in\mathcal{W}\times\mathcal{V}. The generalization performance is a top concern of the trained model variables (A𝝎(S),A𝝂(S))(A_{\bm{\omega}}(S),A_{\bm{\nu}}(S)) (Thekumparampil et al., 2019; Zhang et al., 2020; Lei et al., 2021; Farnia and Ozdaglar, 2021; Zhang et al., 2021, 2022b; Ozdaglar et al., 2022), which can be measured by population weak primal-dual (PD) risk or population strong PD risk, as formalized below.

Definition 1 (Population Primal-Dual Risk).

The population weak PD risk of A(S)A(S), w(A𝛚(S),A𝛎(S))\triangle^{w}(A_{\bm{\omega}}(S),A_{\bm{\nu}}(S)) and the population strong PD risk of A(S)A(S), s(A𝛚(S),A𝛎(S))\triangle^{s}(A_{\bm{\omega}}(S),A_{\bm{\nu}}(S)) are defined as

{w(A𝝎(S),A𝝂(S))=max𝝂𝒱𝔼[F(A𝝎(S),𝝂)]min𝝎𝒲𝔼[F(𝝎,A𝝂(S))],s(A𝝎(S),A𝝂(S))=𝔼[max𝝂𝒱F(A𝝎(S),𝝂)min𝝎𝒲F(𝝎,A𝝂(S))].\left\{\begin{aligned} &\triangle^{w}(A_{\bm{\omega}}(S),A_{\bm{\nu}}(S))=\max_{\bm{\nu}\in\mathcal{V}}\mathbb{E}[F(A_{\bm{\omega}}(S),\bm{\nu})]-\min_{\bm{\omega}\in\mathcal{W}}\mathbb{E}[F(\bm{\omega},A_{\bm{\nu}}(S))],\\ &\triangle^{s}(A_{\bm{\omega}}(S),A_{\bm{\nu}}(S))=\mathbb{E}[\max_{\bm{\nu}\in\mathcal{V}}F(A_{\bm{\omega}}(S),\bm{\nu})-\min_{\bm{\omega}\in\mathcal{W}}F(\bm{\omega},A_{\bm{\nu}}(S))].\end{aligned}\right. (3)

Notations. We introduce the following notations that will be used in the sequel. For a twice differentiable function ff with the arguments 𝝎𝒲\bm{\omega}\in\mathcal{W} and 𝝂𝒱\bm{\nu}\in\mathcal{V}, we use 𝝎f\nabla_{\bm{\omega}}f and 𝝂f\nabla_{\bm{\nu}}f to denote the direct gradient of ff w.r.t. 𝝎\bm{\omega} and 𝝂\bm{\nu}, respectively and denote its Jacobian matrix as f=[𝝎f;𝝂f]\nabla f=[\nabla_{\bm{\omega}}f;\nabla_{\bm{\nu}}f]. We use 𝝎𝝎f\partial_{\bm{\omega}\bm{\omega}}f, 𝝎𝝂f\partial_{\bm{\omega}\bm{\nu}}f, 𝝂𝝎f\partial_{\bm{\nu}\bm{\omega}}f, 𝝂𝝂f\partial_{\bm{\nu}\bm{\nu}}f to denote the second order partial derivatives w.r.t. 𝝎\bm{\omega} and 𝝂\bm{\nu}, correspondingly and denote its Hessian matrix as 2f=[𝝎𝝎f,𝝎𝝂f;𝝂𝝎f,𝝂𝝂f]\nabla^{2}f=[\partial_{\bm{\omega}\bm{\omega}}f,\partial_{\bm{\omega}\bm{\nu}}f;\partial_{\bm{\nu}\bm{\omega}}f,\partial_{\bm{\nu}\bm{\nu}}f]. We define the total Hessian of the function ff w.r.t. 𝝎\bm{\omega} and 𝝂\bm{\nu}: 𝙳𝝎𝝎f:=𝝎𝝎f𝝎𝝂f𝝂𝝂1f𝝂𝝎f\mathtt{D}_{\bm{\omega}\bm{\omega}}f:=\partial_{\bm{\omega}\bm{\omega}}f-\partial_{\bm{\omega}\bm{\nu}}f\cdot\partial_{\bm{\nu}\bm{\nu}}^{-1}f\cdot\partial_{\bm{\nu}\bm{\omega}}f and 𝙳𝝂𝝂f:=𝝂𝝂f𝝂𝝎f𝝎𝝎1f𝝎𝝂f\mathtt{D}_{\bm{\nu}\bm{\nu}}f:=\partial_{\bm{\nu}\bm{\nu}}f-\partial_{\bm{\nu}\bm{\omega}}f\cdot\partial_{\bm{\omega}\bm{\omega}}^{-1}f\cdot\partial_{\bm{\omega}\bm{\nu}}f, where 𝝂𝝂1f\partial_{\bm{\nu}\bm{\nu}}^{-1}f and 𝝎𝝎1f\partial_{\bm{\omega}\bm{\omega}}^{-1}f are the shorthand of (𝝂𝝂f())1(\partial_{\bm{\nu}\bm{\nu}}f(\cdot))^{-1} and (𝝎𝝎f())1(\partial_{\bm{\omega}\bm{\omega}}f(\cdot))^{-1}, respectively, when 𝝂𝝂f\partial_{\bm{\nu}\bm{\nu}}f and 𝝎𝝎f\partial_{\bm{\omega}\bm{\omega}}f are invertible. We also use the shorthand notation 𝝎f(𝝎1,𝝂;z)=𝝎f(𝝎,𝝂;z)|𝝎=𝝎1\nabla_{\bm{\omega}}f(\bm{\omega}_{1},\bm{\nu};z)=\left.\nabla_{\bm{\omega}}f(\bm{\omega},\bm{\nu};z)\right|_{\bm{\omega}=\bm{\omega}_{1}}.

3.2 (ϵ,δ)(\epsilon,\delta)-Certified Machine Unlearning

An unlearning algorithm A¯\bar{A} for minimax models receives the output of a minimax learning algorithm A(S)A(S), the set of delete requests USU\subseteq S and some additional memory variables T(S)𝒯T(S)\in\mathcal{T} as input and returns an updated model (𝝎u,𝝂u)=(A¯𝝎(U,A(S),T(S)),A¯𝝂(U,A(S),T(S)))𝒲×𝒱(\bm{\omega}^{u},\bm{\nu}^{u})=(\bar{A}_{\bm{\omega}}(U,A(S),T(S)),\bar{A}_{\bm{\nu}}(U,A(S),T(S)))\in\mathcal{W}\times\mathcal{V}, aiming to remove the influence of UU. For the memory variables in T(S)T(S), it will not contain the entire training set, but instead its size |T(S)||T(S)| is independent of the training data size nn. The mapping of an unlearning algorithm can be formulated as A¯:𝒵m×𝒲×𝒱×𝒯𝒲×𝒱\bar{A}:\mathcal{Z}^{m}\times\mathcal{W}\times\mathcal{V}\times\mathcal{T}\rightarrow\mathcal{W}\times\mathcal{V}. We now give the notion of (ϵ,δ)(\epsilon,\delta)-certified unlearning introduced by Sekhari et al. (2021), which is inspired by the definition of differential privacy (Dwork et al., 2006).

Definition 2 ((ϵ,δ)(\epsilon,\delta)-Certified Unlearning (Sekhari et al., 2021)).

Let Θ\varTheta be the domain of 𝒲×𝒱\mathcal{W}\times\mathcal{V}. For all SS of size nn, set of delete requests USU\subseteq S such that |U|m|U|\leq m, the pair of learning algorithm AA and unlearning algorithm A¯\bar{A} is (ϵ,δ)(\epsilon,\delta)-certified unlearning, if OΘ\forall O\subseteq\varTheta and ϵ,δ>0\epsilon,\delta>0, the following two conditions are satisfied:

Pr[A¯(U,A(S),T(S))O]eϵPr[A¯(,A(S\U),T(S\U))O]+δ,\displaystyle\Pr[\bar{A}(U,A(S),T(S))\in O]\leq e^{\epsilon}\cdot\Pr[\bar{A}(\emptyset,A(S\backslash U),T(S\backslash U))\in O]+\delta, (4)
Pr[A¯(,A(S\U),T(S\U))O]eϵPr[A¯(U,A(S),T(S))O]+δ,\displaystyle\Pr[\bar{A}(\emptyset,A(S\backslash U),T(S\backslash U))\in O]\leq e^{\epsilon}\cdot\Pr[\bar{A}(U,A(S),T(S))\in O]+\delta, (5)

where \emptyset denotes the empty set and T(S)T(S) denotes the memory variables available to A¯\bar{A}.

The above definition ensures the indistinguishability between the output distribution of (i) the model trained on the set SS and then unlearned with delete requests UU and (ii) the model trained on the set S\US\backslash U and then unlearned with an empty set. Specifically, the unlearning algorithm simply adds perturbations to the output of A(S\U)A(S\backslash U) when the set of delete requests is empty.

Deletion Capacity. Under the definition of certified unlearning, Sekhari et al. (2021) introduce the definition of deletion capacity, which formalizes how many samples can be deleted while still maintaining good guarantees on test loss. Here, we utilize the population primal-dual risk defined in Definition 1 instead of the excess population risk utilized for STL models.

Definition 3 (Deletion capacity, (Sekhari et al., 2021)).

Let ϵ,δ,γ>0\epsilon,\delta,\gamma>0 and SS be a dataset of size nn drawn i.i.d from the data distribution 𝒟\mathcal{D}. Let F(𝛚,𝛎)F(\bm{\omega},\bm{\nu}) be a minimax model and UU be the set of deletion requests. For a pair of minimax learning algorithm AA and minimax unlearning algorithm A¯\bar{A} that satisfies (ϵ,δ)(\epsilon,\delta)-unlearning, the deletion capacity mϵ,δ,γA,A¯(d1,d2,n)m^{A,\bar{A}}_{\epsilon,\delta,\gamma}(d_{1},d_{2},n) is defined as the maximum number of samples UU that can be unlearned while still ensuring the population primal-dual (weak PD or strong PD) risk is at most γ\gamma. Let the expectation 𝔼[]\mathbb{E}[\cdot] takes over S𝒟nS\sim\mathcal{D}^{n} and the outputs of the algorithms AA and A¯\bar{A}. Let d1d_{1} denotes the dimension of domain 𝒲\mathcal{W} and d2d_{2} denotes the dimension of domain 𝒱\mathcal{V}, specifically,

mϵ,δ,γA,A¯(d1,d2,n):=max{m|(A¯𝝎(U,A(S),T(S)),A¯𝝂(U,A(S),T(S)))γ},m^{A,\bar{A}}_{\epsilon,\delta,\gamma}(d_{1},d_{2},n):=\max\left\{m|\triangle\left(\bar{A}_{\bm{\omega}}(U,A(S),T(S)),\bar{A}_{\bm{\nu}}(U,A(S),T(S))\right)\leq\gamma\right\}, (6)

where the ouputs A¯𝛚(U,A(S),T(S))\bar{A}_{\bm{\omega}}(U,A(S),T(S)) and A¯𝛎(U,A(S),T(S))\bar{A}_{\bm{\nu}}(U,A(S),T(S)) of the minimax unlearning algorithm A¯\bar{A} refer to parameter 𝛚\bm{\omega} and 𝛎\bm{\nu}, respectively. (A¯𝛚(U,A(S),T(S)),A¯𝛎(U,A(S),T(S)))\triangle\left(\bar{A}_{\bm{\omega}}(U,A(S),T(S)),\bar{A}_{\bm{\nu}}(U,A(S),T(S))\right) could be the population weak PD risk or population strong PD risk of A¯(U,A(S),T(S))\bar{A}(U,A(S),T(S)).

We set γ=0.01\gamma=0.01 (or any other small arbitrary constant) throughout the paper.

3.3 Baseline Solution: Certified Minimax Unlearning via Differential Privacy

Since Definition 2 is motivated by differential privacy (DP), it is a natural way to use tools from DP for machine unlearning. For a differentially private learning algorithm AA with edit distance mm in neighboring datasets, the unlearning algorithm A¯\bar{A} simply returns its output A(S)A(S) without any changes and is independent of the delete requests UU as well as the memory variables T(S)T(S), i.e., A¯(U,A(S),T(S))=A(S)\bar{A}(U,A(S),T(S))=A(S).

A number of differentially private minimax learning algorithms can be applied, e.g., Zhang et al. (2022a); Yang et al. (2022); Bassily et al. (2023). For instance, we can obtain the output A(S)=(A𝝎(S),A𝝂(S))A(S)=(A_{\bm{\omega}}(S),A_{\bm{\nu}}(S)) by calling Algorithm 3 in Zhang et al. (2022a). Under Assumption 1&2, we then get the population strong PD risk based on (Zhang et al., 2022a, Theorem 4.3) and the group privacy property of DP (Vadhan, 2017, Lemma 7.2.2), as follows,

s(A𝝎(S),A𝝂(S))=𝒪(κ2μn+m2κ2dlog(meϵ/δ)μn2ϵ2),\triangle^{s}(A_{\bm{\omega}}(S),A_{\bm{\nu}}(S))=\mathcal{O}\left(\frac{\kappa^{2}}{\mu n}+\frac{m^{2}\kappa^{2}d\log(me^{\epsilon}/\delta)}{\mu n^{2}\epsilon^{2}}\right), (7)

where we let μ=min{μ𝝎,μ𝝂}\mu=\min\{\mu_{\bm{\omega}},\mu_{\bm{\nu}}\}, κ=/μ\kappa=\ell/\mu, d=max{d1,d2}d=\max\{d_{1},d_{2}\}, and mm be the edit distance between datasets (i.e., the original dataset and the remaining dataset after removing samples to be forgotten).

The algorithm AA satisfies (ϵ,δ)(\epsilon,\delta)-DP for any set USU\subseteq S of size mm, that is,

Pr[A(S)O]eϵPr[A(S\U)O]+δandPr[A(S\U)O]eϵPr[A(S)O]+δ.\Pr[A(S)\in O]\leq e^{\epsilon}\Pr[A(S\backslash U)\in O]+\delta\quad\text{and}\quad\Pr[A(S\backslash U)\in O]\leq e^{\epsilon}\Pr[A(S)\in O]+\delta.

Since we have A(S)=A¯(U,A(S),T(S))A(S)=\bar{A}(U,A(S),T(S)) and A(S\U)=A¯(,A(S\U),T(S\U))A(S\backslash U)=\bar{A}(\emptyset,A(S\backslash U),T(S\backslash U)), the above privacy guarantee can be converted to the minimax unlearning guarantee in Definition 2, implying that the pair (A,A¯)(A,\bar{A}) is (ϵ,δ)(\epsilon,\delta)-certified minimax unlearning. According to Definition 3, the population strong PD risk in eq.(7) yields the following bound on deletion capacity.

Theorem 1 (Deletion capacity of unlearning via DP (Sekhari et al., 2021)).

Denote d=max{d1,d2}d=\max\{d_{1},d_{2}\}. There exists a polynomial time learning algorithm AA and unlearning algorithm AA for minimax problem of the form A¯(U,A(S),T(S))=A(S)\bar{A}(U,A(S),T(S))=A(S) such that the deletion capacity is:

mϵ,δ,γA,A¯(d1,d2,n)Ω~(nϵdlog(eϵ/δ)),m^{A,\bar{A}}_{\epsilon,\delta,\gamma}(d_{1},d_{2},n)\geq\widetilde{\Omega}\left(\frac{n\epsilon}{\sqrt{d\log(e^{\epsilon}/\delta)}}\right), (8)

where the constant in Ω~\widetilde{\Omega}-notation depends on the properties of the loss function ff (e.g., strongly convexity and strongly concavity parameters, Lipchitz continuity and smoothness parameters).

However, this DP minimax learning baseline approach provides an inferior deletion capacity. In the following sections, we show that the d1/2d^{1/2} in the denominator of eq.(8) can be further reduced to d1/4d^{1/4}.

4 Certified Minimax Unlearning

In this section, we focus on the setting of the strongly-convex-strongly-concave loss function. We first provide the intuition for the design of the minimax unlearning step in Sec.4.1, then provide the formal algorithm in Sec.4.2 and a more efficient extension in Sec.4.3 with analysis of minimax unlearning certification, generalization result, and deletion capacity in Sec.4.4. We will provide extensions to more general loss settings in Sec.5. The proofs for the theorems presented in this and the next sections can be found in Appendix B and C, respectively.

4.1 Intuition for Minimax Unlearning Update

To begin with, we provide an informal derivation for minimax unlearning update to illustrate its design intuition. Given the training set SS of size nn and the deletion subset USU\subseteq S of size mm, the aim is to approximate the optimal solution (𝝎S,𝝂S)(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*}) of the loss FS(𝝎,𝝂)F_{S^{\setminus}}(\bm{\omega},\bm{\nu}) on the remaining dataset SUS\setminus U, given by,

(𝝎S,𝝂S):=argmin𝝎𝒲max𝝂𝒱{FS(𝝎,𝝂):=1nmziSUf(𝝎,𝝂;zi)}.(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*}):=\arg\min_{\bm{\omega}\in\mathcal{W}}\max_{\bm{\nu}\in\mathcal{V}}\{F_{S^{\setminus}}(\bm{\omega},\bm{\nu}):=\frac{1}{n-m}\sum_{z_{i}\in S\setminus U}f(\bm{\omega},\bm{\nu};z_{i})\}. (9)

Meanwhile, we have the optimal solution (𝝎S,𝝂S)(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}) to the original loss FS(𝝎,𝝂)F_{S}(\bm{\omega},\bm{\nu}) after minimax learning. Taking unlearning 𝝎\bm{\omega} for instance, by using a first-order Taylor expansion for 𝝎FS(𝝎S,𝝂S)=0\nabla_{\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*})=0 around (𝝎S,𝝂S)(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}), we have

𝝎FS(𝝎S,𝝂S)+𝝎𝝎FS(𝝎S,𝝂S)(𝝎S𝝎S)+𝝎𝝂FS(𝝎S,𝝂S)(𝝂S𝝂S)0.\nabla_{\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})+\partial_{\bm{\omega}\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})(\bm{\omega}_{S^{\setminus}}^{*}-\bm{\omega}_{S}^{*})+\partial_{\bm{\omega}\bm{\nu}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})(\bm{\nu}_{S^{\setminus}}^{*}-\bm{\nu}_{S}^{*})\approx 0. (10)

Since 𝝎S\bm{\omega}_{S}^{*} is a minimizer of FS(𝝎,𝝂)F_{S}(\bm{\omega},\bm{\nu}), from the first-order optimality condition, we can get 𝝎FS(𝝎S,𝝂S)=1nmziU𝝎f(𝝎S,𝝂S;zi)\nabla_{\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})=-\frac{1}{n-m}\sum_{z_{i}\in U}\nabla_{\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i}). Now given an auxiliary function 𝚅S(𝝎)=argmax𝝂𝒱FS(𝝎,𝝂){\tt V}_{S^{\setminus}}(\bm{\omega})=\mathop{\rm argmax}_{\bm{\nu}\in\mathcal{V}}F_{S^{\setminus}}(\bm{\omega},\bm{\nu}) (more best response auxiliary functions are introduced in Appendix A, Definition 8), we have 𝝂S=𝚅S(𝝎S)\bm{\nu}_{S^{\setminus}}^{*}={\tt V}_{S^{\setminus}}(\bm{\omega}_{S^{\setminus}}^{*}). We further get

𝝂S𝝂S\displaystyle\bm{\nu}_{S^{\setminus}}^{*}-\bm{\nu}_{S}^{*} =[𝚅S(𝝎S)𝚅S(𝝎S)]+[𝚅S(𝝎S)𝝂S]\displaystyle=[{\tt V}_{S^{\setminus}}(\bm{\omega}_{S^{\setminus}}^{*})-{\tt V}_{S^{\setminus}}(\bm{\omega}_{S}^{*})]+[{\tt V}_{S^{\setminus}}(\bm{\omega}_{S}^{*})-\bm{\nu}_{S}^{*}] (11)
(i)𝚅S(𝝎S)𝚅S(𝝎S)(ii)(d𝚅S(𝝎S)d𝝎|𝝎=𝝎S)(𝝎S𝝎S)\displaystyle\stackrel{{\scriptstyle(i)}}{{\approx}}{\tt V}_{S^{\setminus}}(\bm{\omega}_{S^{\setminus}}^{*})-{\tt V}_{S^{\setminus}}(\bm{\omega}_{S}^{*})\stackrel{{\scriptstyle(ii)}}{{\approx}}\left(\frac{\mathrm{d}{\tt V}_{S^{\setminus}}(\bm{\omega}^{*}_{S})}{\mathrm{d}\bm{\omega}}\Big{|}_{\bm{\omega}=\bm{\omega}^{*}_{S}}\right)(\bm{\omega}_{S^{\setminus}}^{*}-\bm{\omega}_{S}^{*})
(iii)𝝂𝝂1FS(𝝎S,𝝂S)𝝂𝝎FS(𝝎S,𝝂S)(𝝎S𝝎S),\displaystyle\stackrel{{\scriptstyle(iii)}}{{\approx}}-\partial_{\bm{\nu}\bm{\nu}}^{-1}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\cdot\partial_{\bm{\nu}\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\cdot(\bm{\omega}_{S^{\setminus}}^{*}-\bm{\omega}_{S}^{*}),

where the approximate equation (i)(i) leaving out the term [𝚅S(𝝎S)𝝂S][{\tt V}_{S^{\setminus}}(\bm{\omega}_{S}^{*})-\bm{\nu}_{S}^{*}] which is bounded in Appendix A, Lemma 2, and does not affect the overall unlearning guarantee. The approximate equation (ii)(ii) is the linear approximation step and is the response Jacobian of the auxiliary function 𝚅S(𝝎){\tt V}_{S^{\setminus}}(\bm{\omega}). And the approximate equation (iii)(iii) is due to the implicit function theorem. This gives that

𝝎𝝎FS(𝝎S,𝝂S)(𝝎S𝝎S)+𝝎𝝂FS(𝝎S,𝝂S)(𝝂S𝝂S)=𝙳𝝎𝝎FS(𝝎S,𝝂S)(𝝎S𝝎S),\partial_{\bm{\omega}\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})(\bm{\omega}_{S^{\setminus}}^{*}-\bm{\omega}_{S}^{*})+\partial_{\bm{\omega}\bm{\nu}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})(\bm{\nu}_{S^{\setminus}}^{*}-\bm{\nu}_{S}^{*})={\tt D}_{\bm{\omega}\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})(\bm{\omega}_{S^{\setminus}}^{*}-\bm{\omega}_{S}^{*}), (12)

which implies the following approximation of 𝝎S\bm{\omega}_{S^{\setminus}}^{*}:

𝝎S𝝎S+1nm[𝙳𝝎𝝎FS(𝝎S,𝝂S)]1ziU𝝎f(𝝎S,𝝂S;zi).\bm{\omega}_{S^{\setminus}}^{*}\approx\bm{\omega}_{S}^{*}+\frac{1}{n-m}[{\tt D}_{\bm{\omega}\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})]^{-1}\sum_{z_{i}\in U}\nabla_{\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i}). (13)

The above informal derivation indicates that the minimax unlearning update relies on the total Hessian to sufficiently remove the data influence Liu et al. (2023); Zhang et al. (2023), rather than the conventional Hessian that appears in standard statistical unlearning (Guo et al., 2020; Sekhari et al., 2021; Suriyakumar and Wilson, 2022; Mehta et al., 2022). The update in eq.(13) has a close relation to the complete Newton step in the second-order minimax optimization literature Zhang et al. (2020), which motivates the complete Newton-based minimax unlearning. However, due to the various approximations in the above informal derivation, we cannot have a certified minimax unlearning guarantee. Below, we will formally derive the upper bound for these approximations in the closeness upper bound analysis. Based on the closeness upper bound, we will introduce the Gaussian mechanism to yield distribution indistinguishably result in the sense of (ϵ,δ)(\epsilon,\delta)-certified minimax unlearning.

4.2 Proposed Certified Minimax Unlearning

We first provide algorithms under the setting of the smooth and strongly-convex-strongly-concave (SC-SC) loss function as described in Assumptions 1&2.

Minimax Learning algorithm.

We denote our learning algorithm by AscscA_{sc-sc} and the pseudocode is shown in Algorithm 1. Given a dataset S={zi}i=1nS=\{z_{i}\}^{n}_{i=1} of size nn drawn independently from some distribution 𝒟\mathcal{D}, algorithm AscscA_{sc-sc} computes the optimal solution (𝝎S,𝝂S)(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}) to the empirical risk FS(𝝎,𝝂)F_{S}(\bm{\omega},\bm{\nu}). AscscA_{sc-sc} then outputs the point (𝝎S,𝝂S)(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}) as well as the additional memory variables T(S):={𝙳𝝎𝝎FS(𝝎S,𝝂S),𝙳𝝂𝝂FS(𝝎S,𝝂S)}T(S):=\{\mathtt{D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}),\mathtt{D}_{\bm{\nu}\bm{\nu}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\}, which computes and stores the total Hessian of FS(𝝎,𝝂)F_{S}(\bm{\omega},\bm{\nu}) at (𝝎S,𝝂S)(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}).

Minimax Unlearning Algorithm. We denote the proposed certified minimax unlearning algorithm by A¯scsc\bar{A}_{sc-sc} and present its pseudocode in Algorithm 2. Algorithm A¯scsc\bar{A}_{sc-sc} takes the following inputs: the set of delete requests U={zj}j=1mU=\{z_{j}\}^{m}_{j=1} of size mm, the trained minimax model (𝝎S,𝝂S)(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}), and the memory variables T(S)T(S). To have the certified minimax unlearning for 𝝎\bm{\omega}, eq.(15) computes the total Hessian of FS(𝝎S,𝝂S)F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}) by nnm𝙳𝝎𝝎FS(𝝎S,𝝂S)1nmziU𝙳𝝎𝝎f(𝝎S,𝝂S,zi)\frac{n}{n-m}{\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})-\frac{1}{n-m}\sum_{z_{i}\in U}{\tt D}_{\bm{\omega}\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*},z_{i}), where the former term can be retrieved from the memory set and the latter is computed on the samples to be deleted; eq.(17) computes the intermediate 𝝎^\widehat{\bm{\omega}} by the complete Newton step based on the total Hessian 𝙳𝝎𝝎FS(𝝎S,𝝂S){\tt D}_{\bm{\omega}\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}); Line 3 injects calibrated Gaussian noise 𝝃1\bm{\xi}_{1} to ensure (ϵ,δ)(\epsilon,\delta)-certified minimax unlearning. The certified minimax unlearning for 𝝂\bm{\nu} is symmetric. We provide detailed analysis for Algorithm 2 including minimax unlearning certification, generalization results, and deletion capacity in Appendix B.1.

Algorithm 1 Mimimax Learning Algorithm (Ascsc)(A_{sc-sc})
0:  Dataset SS : {zi}i=1n𝒟n\{z_{i}\}^{n}_{i=1}\sim\mathcal{D}^{n}, loss function: f(𝝎,𝝂;z)f(\bm{\omega},\bm{\nu};z).
1:  Compute
(𝝎S,𝝂S)argmin𝝎max𝝂FS(𝝎,𝝂):=1ni=1nf(𝝎,𝝂;zi).(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\leftarrow\arg\min_{\bm{\omega}}\max_{\bm{\nu}}F_{S}(\bm{\omega},\bm{\nu}):=\frac{1}{n}\sum^{n}_{i=1}f(\bm{\omega},\bm{\nu};z_{i}). (14)
1:  (𝝎S,𝝂S,𝙳𝝎𝝎FS(𝝎S,𝝂S),𝙳𝝂𝝂FS(𝝎S,𝝂S))(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*},\mathtt{D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}),\mathtt{D}_{\bm{\nu}\bm{\nu}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}))
Algorithm 2 Certified Minimax Unlearning for Strongly-Convex-Strongly-Concave Loss (A¯scsc)(\bar{A}_{sc-sc})
0:  Delete requests UU : {zj}j=1mS\{z_{j}\}^{m}_{j=1}\subseteq S, output of Ascsc(S)A_{sc-sc}(S): (𝝎S,𝝂S)(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}), loss function: f(𝝎,𝝂;z)f(\bm{\omega},\bm{\nu};z), memory variables T(S)T(S): {𝙳𝝎𝝎FS(𝝎S,𝝂S),𝙳𝝂𝝂FS(𝝎S,𝝂S)}\{\mathtt{D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}),\mathtt{D}_{\bm{\nu}\bm{\nu}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\}, noise parameters: σ1\sigma_{1}, σ2\sigma_{2}.
1:  Compute
𝙳𝝎𝝎FS(𝝎S,𝝂S)=1nm(n𝙳𝝎𝝎FS(𝝎S,𝝂S)ziU𝙳𝝎𝝎f(𝝎S,𝝂S;zi)),{\tt D}_{\bm{\omega}\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})=\frac{1}{n-m}\left(n\mathtt{D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})-\sum_{z_{i}\in U}\mathtt{D}_{\bm{\omega}\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})\right), (15)
𝙳𝝂𝝂FS(𝝎S,𝝂S)=1nm(n𝙳𝝂𝝂FS(𝝎S,𝝂S)ziU𝙳𝝂𝝂f(𝝎S,𝝂S;zi)).{\tt D}_{\bm{\nu}\bm{\nu}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})=\frac{1}{n-m}\left(n\mathtt{D}_{\bm{\nu}\bm{\nu}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})-\sum_{z_{i}\in U}\mathtt{D}_{\bm{\nu}\bm{\nu}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})\right). (16)
2:  Define
𝝎^=𝝎S+1nm[𝙳𝝎𝝎FS(𝝎S,𝝂S)]1ziU𝝎f(𝝎S,𝝂S;zi),\widehat{\bm{\omega}}=\bm{\omega}_{S}^{*}+\frac{1}{n-m}[{\tt D}_{\bm{\omega}\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})]^{-1}\sum_{z_{i}\in U}\nabla_{\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i}), (17)
𝝂^=𝝂S+1nm[𝙳𝝂𝝂FS(𝝎S,𝝂S)]1ziU𝝂f(𝝎S,𝝂S;zi).\widehat{\bm{\nu}}=\bm{\nu}_{S}^{*}+\frac{1}{n-m}[{\tt D}_{\bm{\nu}\bm{\nu}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})]^{-1}\sum_{z_{i}\in U}\nabla_{\bm{\nu}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i}). (18)
3:  𝝎u=𝝎^+𝝃1\bm{\omega}^{u}=\widehat{\bm{\omega}}+\bm{\xi}_{1}, where 𝝃1𝒩(0,σ12𝐈d1)\bm{\xi}_{1}\sim\mathcal{N}(0,\sigma^{2}_{1}\mathbf{I}_{d_{1}}) and 𝝂u=𝝂^+𝝃2\bm{\nu}^{u}=\widehat{\bm{\nu}}+\bm{\xi}_{2}, where 𝝃2𝒩(0,σ22𝐈d2)\bm{\xi}_{2}\sim\mathcal{N}(0,\sigma^{2}_{2}\mathbf{I}_{d_{2}}).
3:  (𝝎u,𝝂u)(\bm{\omega}^{u},\bm{\nu}^{u}).

4.3 Certified Minimax Unlearning without Total Hessian Re-computation

We extend Algorithm 2 and propose Algorithm 3 to reduce the computational cost of Algorithm 2. The complete Newton steps in eq.(19) and eq.(20) utilize the total Hessian 𝙳𝝎𝝎FS(𝝎S,𝝂S){\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}) and 𝙳𝝂𝝂FS(𝝎S,𝝂S){\tt D}_{\bm{\nu}\bm{\nu}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}) that are directly retrieved from the memory, rather than the updated total Hessian 𝙳𝝎𝝎FS(𝝎S,𝝂S){\tt D}_{\bm{\omega}\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}) and 𝙳𝝂𝝂FS(𝝎S,𝝂S){\tt D}_{\bm{\nu}\bm{\nu}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}) used in Algorithm 2. The form in eq.(19) and eq.(20) can also be regarded as the total Hessian extension of the infinitesimal jackknife. In this way, it gets rid of the computationally demanding part of re-evaluating the total Hessian for samples to be deleted, which significantly reduces the computational cost. It turns out to be the same computational complexity as the state-of-the-art certified unlearning method developed for STL models (Suriyakumar and Wilson, 2022). Moreover, Algorithm 3 can be more appealing for the successive data deletion setting.

Algorithm 3 Efficient Certified Minimax Unlearning (A¯𝚎𝚏𝚏𝚒𝚌𝚒𝚎𝚗𝚝)(\bar{A}_{\tt efficient})
0:  Delete requests UU : {zj}j=1mS\{z_{j}\}^{m}_{j=1}\subseteq S, output of Ascsc(S)A_{sc-sc}(S): (𝝎S,𝝂S)(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}), loss function: f(𝝎,𝝂;z)f(\bm{\omega},\bm{\nu};z), memory variables T(S)T(S): {𝙳𝝎𝝎FS(𝝎S,𝝂S),𝙳𝝂𝝂FS(𝝎S,𝝂S)}\{\mathtt{D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}),\mathtt{D}_{\bm{\nu}\bm{\nu}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\}, noise parameters: σ1\sigma_{1}, σ2\sigma_{2}.
1:  Compute
𝝎~=𝝎S+1n[𝙳𝝎𝝎FS(𝝎S,𝝂S)]1ziU𝝎f(𝝎S,𝝂S;zi),\widetilde{\bm{\omega}}=\bm{\omega}_{S}^{*}+\frac{1}{n}[{\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})]^{-1}\sum_{z_{i}\in U}\nabla_{\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i}), (19)
𝝂~=𝝂S+1n[𝙳𝝂𝝂FS(𝝎S,𝝂S)]1ziU𝝂f(𝝎S,𝝂S;zi).\widetilde{\bm{\nu}}=\bm{\nu}_{S}^{*}+\frac{1}{n}[{\tt D}_{\bm{\nu}\bm{\nu}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})]^{-1}\sum_{z_{i}\in U}\nabla_{\bm{\nu}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i}). (20)
2:  𝝎~u=𝝎~+𝝃1\widetilde{\bm{\omega}}^{u}=\widetilde{\bm{\omega}}+\bm{\xi}_{1}, where 𝝃1𝒩(0,σ12𝐈d1)\bm{\xi}_{1}\sim\mathcal{N}(0,\sigma^{2}_{1}\mathbf{I}_{d_{1}}) and 𝝂~u=𝝂~+𝝃2\widetilde{\bm{\nu}}^{u}=\widetilde{\bm{\nu}}+\bm{\xi}_{2}, where 𝝃2𝒩(0,σ22𝐈d2)\bm{\xi}_{2}\sim\mathcal{N}(0,\sigma^{2}_{2}\mathbf{I}_{d_{2}}) .
2:  (𝝎~u,𝝂~u)(\widetilde{\bm{\omega}}^{u},\widetilde{\bm{\nu}}^{u}).

4.4 Analysis for Algorithm 3

(ϵ,δ)(\epsilon,\delta)-Certificated Unlearning Guarantee. The intermediate variables (𝝎~,𝝂~)(\widetilde{\bm{\omega}},\widetilde{\bm{\nu}}) are distinguishable in distribution from the retraining-from-scratch variables (𝝎S,𝝂S)(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*}) because they are deterministic and the Taylor expansion introduces a certain amount of approximation. The following lemma quantifies the closeness between (𝝎~,𝝂~)(\widetilde{\bm{\omega}},\widetilde{\bm{\nu}}) and (𝝎S,𝝂S)(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*}), which can be regarded as the “sensitivity” when applying the Gaussian mechanism.

Lemma 1 (Closeness Upper Bound).

Suppose the loss function ff satisfies Assumption 1 and 2, 𝙳𝛚𝛚FS(𝛚S,𝛎S)μ𝛚𝛚\|{\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\|\geq\mu_{\bm{\omega}\bm{\omega}} and 𝙳𝛎𝛎FS(𝛚S,𝛎S)μ𝛎𝛎\|{\tt D}_{\bm{\nu}\bm{\nu}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\|\geq\mu_{\bm{\nu}\bm{\nu}}. Let μ=min{μ𝛚,μ𝛎,μ𝛚𝛚,μ𝛎𝛎}\mu=\min\{\mu_{\bm{\omega}},\mu_{\bm{\nu}},\mu_{\bm{\omega}\bm{\omega}},\mu_{\bm{\nu}\bm{\nu}}\}. Then, we have the closeness bound between (𝛚~,𝛎~)(\widetilde{\bm{\omega}},\widetilde{\bm{\nu}}) in Line 1 of Algorithm 3 and (𝛚S,𝛎S)(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*}) in eq.(9):

{𝝎S𝝎~,𝝂S𝝂~}(82L23ρ/μ6+22L2/μ3)m2n2.\{\|\bm{\omega}_{S^{\setminus}}^{*}-\widetilde{\bm{\omega}}\|,\|\bm{\nu}_{S^{\setminus}}^{*}-\widetilde{\bm{\nu}}\|\}\leq\frac{(8\sqrt{2}L^{2}\ell^{3}\rho/\mu^{6}+2\sqrt{2}L\ell^{2}/\mu^{3})m^{2}}{n^{2}}. (21)

Equipped with Lemma 10, we have the following certified unlearning guarantee by adding Gaussian noise calibrated according to the above closeness result. Due to the minimax structure, our analysis is more involved than the STL case (Sekhari et al., 2021; Suriyakumar and Wilson, 2022).

Theorem 2 ((ϵ,δ)(\epsilon,\delta)-Minimax Unlearning Certification).

Under the same settings of Lemma 1, our minimax learning algorithm AscscA_{sc-sc} and unlearning algorithm A¯𝚎𝚏𝚏𝚒𝚌𝚒𝚎𝚗𝚝\bar{A}_{\tt efficient} is (ϵ,δ)(\epsilon,\delta)-certified minimax unlearning if we choose

σ1 and σ2=2(82L23ρ/μ6+22L2/μ3)m2n2ϵ2log(2.5/δ).\sigma_{1}\text{~{}and~{}}\sigma_{2}=\frac{2(8\sqrt{2}L^{2}\ell^{3}\rho/\mu^{6}+2\sqrt{2}L\ell^{2}/\mu^{3})m^{2}}{n^{2}\epsilon}\sqrt{2\log(2.5/\delta)}. (22)

Generalization Guarantee. Theorem 3 below provides the generalization result in terms of the population PD risk for the minimax unlearning algorithm A¯𝚎𝚏𝚏𝚒𝚌𝚒𝚎𝚗𝚝\bar{A}_{\tt efficient}.

Theorem 3 (Population Primal-Dual Risk).

Under the same settings of Lemma 1 and denote d=max{d1,d2}d=\max\{d_{1},d_{2}\}, the population weak and strong PD risk for the certified minimax unlearning variables (𝛚~u,𝛎~u)(\widetilde{\bm{\omega}}^{u},\widetilde{\bm{\nu}}^{u}) returned by Algorithm 3 are

{w(𝝎~u,𝝂~u)=𝒪((L33ρ/μ6+L22/μ3)m2dlog(1/δ)n2ϵ+mL2μn),s(𝝎~u,𝝂~u)=𝒪((L33ρ/μ6+L22/μ3)m2dlog(1/δ)n2ϵ+mL2μn+L2μ2n).\left\{\begin{aligned} &\triangle^{w}(\widetilde{\bm{\omega}}^{u},\widetilde{\bm{\nu}}^{u})=\mathcal{O}\left((L^{3}\ell^{3}\rho/\mu^{6}+L^{2}\ell^{2}/\mu^{3})\cdot\frac{m^{2}\sqrt{d\log(1/\delta)}}{n^{2}\epsilon}+\frac{mL^{2}}{\mu n}\right),\\ &\triangle^{s}(\widetilde{\bm{\omega}}^{u},\widetilde{\bm{\nu}}^{u})=\mathcal{O}\left((L^{3}\ell^{3}\rho/\mu^{6}+L^{2}\ell^{2}/\mu^{3})\cdot\frac{m^{2}\sqrt{d\log(1/\delta)}}{n^{2}\epsilon}+\frac{mL^{2}}{\mu n}+\frac{L^{2}\ell}{\mu^{2}n}\right).\end{aligned}\right. (23)

Deletion Capacity. The population weak and strong PD risk given in Theorem 3 for the output of unlearning algorithms provides the following bound on deletion capacity.

Theorem 4 (Deletion Capacity).

Under the same settings of Lemma 1 and denote d=max{d1,d2}d=\max\{d_{1},d_{2}\}, the deletion capacity of Algorithm 3 is

mϵ,δ,γA,A¯(d1,d2,n)cnϵ(dlog(1/δ))1/4,m^{A,\bar{A}}_{\epsilon,\delta,\gamma}(d_{1},d_{2},n)\geq c\cdot\frac{n\sqrt{\epsilon}}{(d\log(1/\delta))^{1/4}}, (24)

where the constant cc depends on L,l,ρ,L,l,\rho, and μ\mu of the loss function ff.

5 Certified Minimax Unlearning for Convex-Concave Loss Function

We further extend the certified minimax unlearning for the convex-concave loss function. In addition, Appendix C will provide the extension to convex-strongly-concave and strongly-convex-concave loss functions. Give the convex-concave loss function f(𝝎,𝝂;z)f(\bm{\omega},\bm{\nu};z), similar to the unlearning for STL models (Sekhari et al., 2021), we define the regularized function as f~(𝝎,𝝂;z)=f(𝝎,𝝂;z)+λ2𝝎2λ2𝝂2\widetilde{f}(\bm{\omega},\bm{\nu};z)=f(\bm{\omega},\bm{\nu};z)+\frac{\lambda}{2}\|\bm{\omega}\|^{2}-\frac{\lambda}{2}\|\bm{\nu}\|^{2}. Suppose the function ff satisfies Assumption 1, then the function f~\widetilde{f} is λ\lambda-strongly convex in 𝝎\bm{\omega}, λ\lambda-strongly concave in 𝝂\bm{\nu}, (2L+λ𝝎+λ𝝂)(2L+\lambda\|\bm{\omega}\|+\lambda\|\bm{\nu}\|)-Lipschitz, 2(2+λ)\sqrt{2}(2\ell+\lambda)-gradient Lipschitz and ρ\rho-Hessian Lipschitz. It suffices to apply the minimax learning and unlearning algorithms in Sec.4 to the regularized loss function with a properly chosen λ\lambda. We denote the learning and unlearning algorithms for convex-concave losses as AccA_{c-c} and A¯cc\bar{A}_{c-c}. Their implementation details are given in Appendix C. We suppose the SC-SC regularization parameter λ\lambda satisfies λ<\lambda<\ell. Theorem 5 below summarizes guarantees of (ϵ,δ)(\epsilon,\delta)-certified unlearning and population primal-dual risk (weak and strong) for Algorithm A¯cc\bar{A}_{c-c}.

Theorem 5.

Let Assumption 1 hold and d=max{d1,d2}d=\max\{d_{1},d_{2}\}. Suppose the parameter spaces 𝒲\mathcal{W} and 𝒱\mathcal{V} are bounded so that max𝛚𝒲𝛚B𝛚\max_{\bm{\omega}\in\mathcal{W}}\|\bm{\omega}\|\leq B_{\bm{\omega}} and max𝛎𝒱𝛎B𝛎\max_{\bm{\nu}\in\mathcal{V}}\|\bm{\nu}\|\leq B_{\bm{\nu}}. We have,

  1. (a)(a)

    (ϵ,δ)(\epsilon,\delta)-Minimax Unlearning Certification: Our minimax learning algorithm AccA_{c-c} and unlearning algorithm A¯cc\bar{A}_{c-c} is (ϵ,δ)(\epsilon,\delta)-certified minimax unlearning.

  2. (b)(b)

    Population Weak PD Risk: The population weak PD risk for (𝝎u,𝝂u)(\bm{\omega}^{u},\bm{\nu}^{u}) by algorithm A¯cc\bar{A}_{c-c} is

    w(𝝎u,𝝂u)𝒪((L33ρ/λ6+L22/λ3)m2dlog(1/δ)n2ϵ+mL2λn+λ(B𝝎2+B𝝂2)).\triangle^{w}(\bm{\omega}^{u},\bm{\nu}^{u})\leq\mathcal{O}\bigg{(}(L^{3}\ell^{3}\rho/\lambda^{6}+L^{2}\ell^{2}/\lambda^{3})\cdot\frac{m^{2}\sqrt{d\log(1/\delta)}}{n^{2}\epsilon}+\frac{mL^{2}}{\lambda n}+\lambda(B_{\bm{\omega}}^{2}+B_{\bm{\nu}}^{2})\bigg{)}. (25)

    In particular, by setting λ\lambda below

    λ=max{LB𝝎2+B𝝂2mn,(L22m2dlog(1/δ)(B𝝎2+B𝝂2)n2ϵ)1/4,(L33ρm2dlog(1/δ)(B𝝎2+B𝝂2)n2ϵ)1/7},\lambda=\max\bigg{\{}\frac{L}{\sqrt{B_{\bm{\omega}}^{2}+B_{\bm{\nu}}^{2}}}\sqrt{\frac{m}{n}},\left(\frac{L^{2}\ell^{2}m^{2}\sqrt{d\log(1/\delta)}}{(B_{\bm{\omega}}^{2}+B_{\bm{\nu}}^{2})n^{2}\epsilon}\right)^{1/4},\left(\frac{L^{3}\ell^{3}\rho m^{2}\sqrt{d\log(1/\delta)}}{(B_{\bm{\omega}}^{2}+B_{\bm{\nu}}^{2})n^{2}\epsilon}\right)^{1/7}\bigg{\}}, (26)

    we have the following population weak PD risk,

    w(𝝎u,𝝂u)𝒪(c1mn+c2(dlog(1/δ)ϵ2)1/8mn+c3(dlog(1/δ)ϵ)1/7(mn)2/7),\triangle^{w}(\bm{\omega}^{u},\bm{\nu}^{u})\leq\mathcal{O}\bigg{(}c_{1}\sqrt{\frac{m}{n}}+c_{2}\big{(}\frac{d\log(1/\delta)}{\epsilon^{2}}\big{)}^{1/8}\sqrt{\frac{m}{n}}+c_{3}\big{(}\frac{\sqrt{d\log(1/\delta)}}{\epsilon}\big{)}^{1/7}(\frac{m}{n})^{2/7}\bigg{)}, (27)

    where c1,c2,c3c_{1},c_{2},c_{3} are constants that depend only on L,l,ρ,B𝝎L,l,\rho,B_{\bm{\omega}} and B𝝂B_{\bm{\nu}}.

  3. (c)(c)

    Population Strong PD Risk: The population strong PD risk for (𝝎u,𝝂u)(\bm{\omega}^{u},\bm{\nu}^{u}) by algorithm A¯cc\bar{A}_{c-c} is

    s(𝝎u,𝝂u)𝒪((L33ρ/λ6+L22/λ3)m2dlog(1/δ)n2ϵ+mL2λn+L2λ2n+λ(B𝝎2+B𝝂2)).\triangle^{s}(\bm{\omega}^{u},\bm{\nu}^{u})\leq\mathcal{O}\bigg{(}(L^{3}\ell^{3}\rho/\lambda^{6}+L^{2}\ell^{2}/\lambda^{3})\cdot\frac{m^{2}\sqrt{d\log(1/\delta)}}{n^{2}\epsilon}+\frac{mL^{2}}{\lambda n}+\frac{L^{2}\ell}{\lambda^{2}n}+\lambda(B_{\bm{\omega}}^{2}+B_{\bm{\nu}}^{2})\bigg{)}. (28)

    In particular, by setting λ\lambda below

    λ=max{\displaystyle\lambda=\max\bigg{\{} LB𝝎2+B𝝂2mn,(L2(B𝝎2+B𝝂2)n)1/3,(L22m2dlog(1/δ)(B𝝎2+B𝝂2)n2ϵ)1/4,\displaystyle\frac{L}{\sqrt{B_{\bm{\omega}}^{2}+B_{\bm{\nu}}^{2}}}\sqrt{\frac{m}{n}},\left(\frac{L^{2}\ell}{(B_{\bm{\omega}}^{2}+B_{\bm{\nu}}^{2})n}\right)^{1/3},\left(\frac{L^{2}\ell^{2}m^{2}\sqrt{d\log(1/\delta)}}{(B_{\bm{\omega}}^{2}+B_{\bm{\nu}}^{2})n^{2}\epsilon}\right)^{1/4}, (29)
    (L33ρm2dlog(1/δ)(B𝝎2+B𝝂2)n2ϵ)1/7},\displaystyle\left(\frac{L^{3}\ell^{3}\rho m^{2}\sqrt{d\log(1/\delta)}}{(B_{\bm{\omega}}^{2}+B_{\bm{\nu}}^{2})n^{2}\epsilon}\right)^{1/7}\bigg{\}},

    we have the following population strong PD risk,

    s(𝝎u,𝝂u)𝒪(c1mn+c21n3+c3(dlog(1/δ)ϵ2)1/8mn+c4(dlog(1/δ)ϵ)1/7(mn)2/7),\triangle^{s}(\bm{\omega}^{u},\bm{\nu}^{u})\leq\mathcal{O}\bigg{(}c_{1}\sqrt{\frac{m}{n}}+c_{2}\frac{1}{\sqrt[3]{n}}+c_{3}\big{(}\frac{d\log(1/\delta)}{\epsilon^{2}}\big{)}^{1/8}\sqrt{\frac{m}{n}}+c_{4}\big{(}\frac{\sqrt{d\log(1/\delta)}}{\epsilon}\big{)}^{1/7}(\frac{m}{n})^{2/7}\bigg{)}, (30)

    where c1,c2,c3,c4c_{1},c_{2},c_{3},c_{4} are constants that depend only on L,l,ρ,B𝝎L,l,\rho,B_{\bm{\omega}} and B𝝂B_{\bm{\nu}}.

  4. (d)(d)

    Deletion Capacity: The deletion capacity of Algorithm A¯cc\bar{A}_{c-c} is

    mϵ,δ,γA,A¯(d1,d2,n)cnϵ(dlog(1/δ))1/4,m^{A,\bar{A}}_{\epsilon,\delta,\gamma}(d_{1},d_{2},n)\geq c\cdot\frac{n\sqrt{\epsilon}}{(d\log(1/\delta))^{1/4}}, (31)

    where the constant cc depends on the constants L,l,ρ,B𝝎L,l,\rho,B_{\bm{\omega}} and B𝝂B_{\bm{\nu}}.

6 Conclusion

In this paper, we have studied the certified machine unlearning for minimax models with a focus on the generalization rates and deletion capacity, while existing works in this area largely focus on standard statistical learning models. We have provided a new minimax unlearning algorithm composed of the total Hessian-based complete Newton update and the Gaussian mechanism-based perturbation, which comes with rigorous (ϵ,δ)(\epsilon,\delta)-unlearning certification. We have established generalization results in terms of the population weak and strong primal-dual risk and the correspondingly defined deletion capacity results for the strongly-convex-strongly-concave loss functions, both of which match the state-of-the-art rates obtained for standard statistical learning models. We have also provided extensions to other loss types like the convex-concave loss function. In addition, we have provided a more computationally efficient extension by getting rid of the total Hessian re-computation during the minimax unlearning phase, which can be more appealing for the successive data deletion setting. Although our bound for deletion capacity is better than that of DP by an order of d1/4d^{1/4} and matches the state-of-the-art result established for unlearning under the STL setting, it remains unclear whether this bound is tight or not. In future work, we plan to extend to more general settings like the nonconvex-nonconcave loss function setting.

Acknowledgements

We would like to thank Gautam Kamath for his valuable comments on the presentation of the results in the previous version of the paper. Additionally, we extend our thanks to the reviewers and area chair of NeurIPS 2023 for their constructive comments and feedback. This work was supported in part by the National Natural Science Foundation of China (62072395, 62206207, U20A20178), and the National Key Research and Development Program of China (2020AAA0107705, 2021YFB3100300).

References

  • Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International conference on machine learning, volume 70, pages 214–223. PMLR, 2017.
  • Bassily et al. (2019) Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. In Advances in Neural Information Processing Systems, volume 32, pages 11279–11288, 2019.
  • Bassily et al. (2023) Raef Bassily, Cristóbal Guzmán, and Michael Menart. Differentially private algorithms for the stochastic saddle point problem with optimal rates for the strong gap. In Conference on Learning Theory, volume 195, pages 2482–2508. PMLR, 2023.
  • Boob and Guzmán (2023) Digvijay Boob and Cristóbal Guzmán. Optimal algorithms for differentially private stochastic monotone variational inequalities and saddle-point problems. Mathematical Programming, pages 1–43, 2023.
  • Bourtoule et al. (2021) Lucas Bourtoule, Varun Chandrasekaran, Christopher A Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy, pages 141–159. IEEE, 2021.
  • Brophy and Lowd (2021) Jonathan Brophy and Daniel Lowd. Machine unlearning for random forests. In International Conference on Machine Learning, volume 139, pages 1092–1104. PMLR, 2021.
  • Cao and Yang (2015) Yinzhi Cao and Junfeng Yang. Towards making systems forget with machine unlearning. In 2015 IEEE symposium on security and privacy, pages 463–480. IEEE, 2015.
  • Chen et al. (2022a) Chong Chen, Fei Sun, Min Zhang, and Bolin Ding. Recommendation unlearning. In Proceedings of the ACM Web Conference 2022, pages 2768–2777. ACM, 2022a.
  • Chen et al. (2022b) Min Chen, Zhikun Zhang, Tianhao Wang, Michael Backes, Mathias Humbert, and Yang Zhang. Graph unlearning. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pages 499–513. ACM, 2022b.
  • Chen et al. (2023) Min Chen, Weizhuo Gao, Gaoyang Liu, Kai Peng, and Chen Wang. Boundary unlearning: Rapid forgetting of deep networks via shifting the decision boundary. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7766–7775. IEEE, 2023.
  • Cheng et al. (2023) Jiali Cheng, George Dasoulas, Huan He, Chirag Agarwal, and Marinka Zitnik. Gnndelete: A general strategy for unlearning in graph neural networks. In The Eleventh International Conference on Learning Representations. OpenReview.net, 2023.
  • Chien et al. (2023a) Eli Chien, Chao Pan, and Olgica Milenkovic. Efficient model updates for approximate unlearning of graph-structured data. In The Eleventh International Conference on Learning Representations. OpenReview.net, 2023a.
  • Chien et al. (2023b) Eli Chien, Chao Pan, and Olgica Milenkovic. Efficient model updates for approximate unlearning of graph-structured data. In The Eleventh International Conference on Learning Representations. OpenReview.net, 2023b.
  • Dai et al. (2018) Bo Dai, Albert Shaw, Lihong Li, Lin Xiao, Niao He, Zhen Liu, Jianshu Chen, and Le Song. Sbeed: Convergent reinforcement learning with nonlinear function approximation. In International Conference on Machine Learning, volume 80, pages 1125–1134. PMLR, 2018.
  • Di et al. (2023) Jimmy Z. Di, Jack Douglas, Jayadev Acharya, Gautam Kamath, and Ayush Sekhari. Hidden poison: Machine unlearning enables camouflaged poisoning attacks. In Advances in Neural Information Processing Systems, volume 37, 2023.
  • Du et al. (2017) Simon S Du, Jianshu Chen, Lihong Li, Lin Xiao, and Dengyong Zhou. Stochastic variance reduction methods for policy evaluation. In International Conference on Machine Learning, volume 70, pages 1049–1058. PMLR, 2017.
  • Dwork et al. (2006) Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Third Theory of Cryptography Conference, volume 3876, pages 265–284. Springer, 2006.
  • Dwork et al. (2014) Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9:211–407, 2014.
  • Farnia and Ozdaglar (2021) Farzan Farnia and Asuman Ozdaglar. Train simultaneously, generalize better: Stability of gradient-based minimax learners. In International Conference on Machine Learning, volume 139, pages 3174–3185. PMLR, 2021.
  • Ghazi et al. (2023) Badih Ghazi, Pritish Kamath, Ravi Kumar, Pasin Manurangsi, Ayush Sekhari, and Chiyuan Zhang. Ticketed learning–unlearning schemes. In The Thirty Sixth Annual Conference on Learning Theory, pages 5110–5139. PMLR, 2023.
  • Ginart et al. (2019) Antonio Ginart, Melody Guan, Gregory Valiant, and James Y Zou. Making ai forget you: Data deletion in machine learning. In Advances in neural information processing systems, volume 32, pages 3513–3526, 2019.
  • Golatkar et al. (2020a) Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9304–9312, 2020a.
  • Golatkar et al. (2020b) Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Forgetting outside the box: Scrubbing deep networks of information accessible from input-output observations. In Computer Vision–ECCV 2020: 16th European Conference, volume 12374, pages 383–398. Springer, 2020b.
  • Golatkar et al. (2021) Aditya Golatkar, Alessandro Achille, Avinash Ravichandran, Marzia Polito, and Stefano Soatto. Mixed-privacy forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 792–801. IEEE, 2021.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems, volume 27, pages 2672–2680, 2014.
  • Graves et al. (2021) Laura Graves, Vineel Nagisetty, and Vijay Ganesh. Amnesiac machine learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 11516–11524. AAAI Press, 2021.
  • Guo et al. (2020) Chuan Guo, Tom Goldstein, Awni Hannun, and Laurens Van Der Maaten. Certified data removal from machine learning models. In International Conference on Machine Learning, volume 119, pages 3832–3842. PMLR, 2020.
  • Izzo et al. (2021) Zachary Izzo, Mary Anne Smart, Kamalika Chaudhuri, and James Zou. Approximate data deletion from machine learning models. In International Conference on Artificial Intelligence and Statistics, volume 130, pages 2008–2016. PMLR, 2021.
  • Koh and Liang (2017) Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International Conference on Machine Learning, volume 70, pages 1885–1894. PMLR, 2017.
  • Lei et al. (2021) Yunwen Lei, Zhenhuan Yang, Tianbao Yang, and Yiming Ying. Stability and generalization of stochastic gradient methods for minimax problems. In International Conference on Machine Learning, volume 139, pages 6175–6186. PMLR, 2021.
  • Li et al. (2021) Yuantong Li, Chi-Hua Wang, and Guang Cheng. Online forgetting process for linear regression models. In International Conference on Artificial Intelligence and Statistics, pages 217–225. PMLR, 2021.
  • Lin et al. (2023) Shen Lin, Xiaoyu Zhang, Chenyang Chen, Xiaofeng Chen, and Willy Susilo. Erm-ktp: Knowledge-level machine unlearning via knowledge transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20147–20155, 2023.
  • Lin et al. (2020) Tianyi Lin, Chi Jin, and Michael Jordan. On gradient descent ascent for nonconvex-concave minimax problems. In International Conference on Machine Learning, volume 119, pages 6083–6093. PMLR, 2020.
  • Liu et al. (2023) Junxu Liu, Mingsheng Xue, Jian Lou, Xiaoyu Zhang, Li Xiong, and Zhan Qin. Muter: Machine unlearning on adversarially trained models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4892–4902, 2023.
  • Luo et al. (2022) Luo Luo, Yujun Li, and Cheng Chen. Finding second-order stationary points in nonconvex-strongly-concave minimax optimization. In Advances in Neural Information Processing Systems, volume 35, pages 36667–36679, 2022.
  • Madry et al. (2018) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. In The Sixth International Conference on Learning Representations. OpenReview.net, 2018.
  • Mahadevan and Mathioudakis (2021) Ananth Mahadevan and Michael Mathioudakis. Certifiable machine unlearning for linear models. arXiv preprint arXiv:2106.15093, 2021.
  • Mantelero (2013) Alessandro Mantelero. The eu proposal for a general data protection regulation and the roots of the ‘right to be forgotten’. Computer Law & Security Review, 29(3):229–235, 2013.
  • Mehta et al. (2022) Ronak Mehta, Sourav Pal, Vikas Singh, and Sathya N Ravi. Deep unlearning via randomized conditionally independent hessians. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10422–10431. IEEE, 2022.
  • Neel et al. (2021) Seth Neel, Aaron Roth, and Saeed Sharifi-Malvajerdi. Descent-to-delete: Gradient-based methods for machine unlearning. In Algorithmic Learning Theory, volume 132, pages 931–962. PMLR, 2021.
  • Nesterov and Polyak (2006) Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
  • Nguyen et al. (2020) Quoc Phong Nguyen, Bryan Kian Hsiang Low, and Patrick Jaillet. Variational bayesian unlearning. In Advances in Neural Information Processing Systems, volume 33, pages 16025–16036, 2020.
  • Ozdaglar et al. (2022) Asuman E. Ozdaglar, Sarath Pattathil, Jiawei Zhang, and Kaiqing Zhang. What is a good metric to study generalization of minimax learners? In Advances in Neural Information Processing Systems, volume 35, pages 38190–38203, 2022.
  • Peste et al. (2021) Alexandra Peste, Dan Alistarh, and Christoph H Lampert. SSSE: efficiently erasing samples from trained machine learning models. arXiv preprint arXiv:2107.03860, 2021.
  • Schelter et al. (2021) Sebastian Schelter, Stefan Grafberger, and Ted Dunning. Hedgecut: Maintaining randomised trees for low-latency machine unlearning. In International Conference on Management of Data, pages 1545–1557. ACM, 2021.
  • Sekhari et al. (2021) Ayush Sekhari, Jayadev Acharya, Gautam Kamath, and Ananda Theertha Suresh. Remember what you want to forget: Algorithms for machine unlearning. In Advances in Neural Information Processing Systems, volume 34, pages 18075–18086, 2021.
  • Sharma et al. (2022) Pranay Sharma, Rohan Panda, Gauri Joshi, and Pramod Varshney. Federated minimax optimization: Improved convergence analyses and algorithms. In International Conference on Machine Learning, volume 162, pages 19683–19730. PMLR, 2022.
  • Shibata et al. (2021) Takashi Shibata, Go Irie, Daiki Ikami, and Yu Mitsuzumi. Learning with selective forgetting. In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, pages 989–996. ijcai.org, 2021.
  • Sinha et al. (2018) Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying some distributional robustness with principled adversarial training. In The Sixth International Conference on Learning Representations. OpenReview.net, 2018.
  • Suriyakumar and Wilson (2022) Vinith Suriyakumar and Ashia C Wilson. Algorithms that approximate data removal: New results and limitations. In Advances in Neural Information Processing Systems, volume 35, pages 18892–18903, 2022.
  • Tarun et al. (2023) Ayush K Tarun, Vikram S Chundawat, Murari Mandal, and Mohan Kankanhalli. Fast yet effective machine unlearning. IEEE Transactions on Neural Networks and Learning Systems, 2023.
  • Thekumparampil et al. (2019) Kiran K Thekumparampil, Prateek Jain, Praneeth Netrapalli, and Sewoong Oh. Efficient algorithms for smooth minimax optimization. In Advances in Neural Information Processing Systems, volume 32, pages 12659–12670, 2019.
  • Ullah et al. (2021) Enayat Ullah, Tung Mai, Anup Rao, Ryan A Rossi, and Raman Arora. Machine unlearning via algorithmic stability. In Conference on Learning Theory, volume 134, pages 4126–4142. PMLR, 2021.
  • Vadhan (2017) Salil Vadhan. The complexity of differential privacy. Tutorials on the Foundations of Cryptography: Dedicated to Oded Goldreich, pages 347–450, 2017.
  • Wang et al. (2023a) Cheng-Long Wang, Mengdi Huai, and Di Wang. Inductive graph unlearning. In 32nd USENIX Security Symposium, pages 3205–3222. USENIX Association, 2023a.
  • Wang et al. (2023b) Weiqi Wang, Zhiyi Tian, Chenhan Zhang, An Liu, and Shui Yu. Bfu: Bayesian federated unlearning with parameter self-sharing. In Proceedings of the 2023 ACM Asia Conference on Computer and Communications Security, pages 567–578. ACM, 2023b.
  • Warnecke et al. (2023) Alexander Warnecke, Lukas Pirch, Christian Wressnegger, and Konrad Rieck. Machine unlearning of features and labels. In 30th Annual Network and Distributed System Security Symposium. The Internet Society, 2023.
  • Wu et al. (2022) Ga Wu, Masoud Hashemi, and Christopher Srinivasa. Puma: Performance unchanged model augmentation for training data removal. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 8675–8682, 2022.
  • Wu et al. (2020) Yinjun Wu, Edgar Dobriban, and Susan Davidson. Deltagrad: Rapid retraining of machine learning models. In International Conference on Machine Learning, volume 119, pages 10355–10366. PMLR, 2020.
  • Wu et al. (2023) Zhaomin Wu, Junhui Zhu, Qinbin Li, and Bingsheng He. Deltaboost: Gradient boosting decision trees with efficient machine unlearning. Proceedings of the ACM on Management of Data, 1(2):1–26, 2023.
  • Xia et al. (2023) Haocheng Xia, Jinfei Liu, Jian Lou, Zhan Qin, Kui Ren, Yang Cao, and Li Xiong. Equitable data valuation meets the right to be forgotten in model markets. Proceedings of the VLDB Endowment, 16(11):3349–3362, 2023.
  • Yan et al. (2022) Haonan Yan, Xiaoguang Li, Ziyao Guo, Hui Li, Fenghua Li, and Xiaodong Lin. Arcane: An efficient architecture for exact machine unlearning. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, pages 4006–4013. ijcai.org, 2022.
  • Yang et al. (2022) Zhenhuan Yang, Shu Hu, Yunwen Lei, Kush R Vashney, Siwei Lyu, and Yiming Ying. Differentially private sgda for minimax problems. In Uncertainty in Artificial Intelligence, volume 180, pages 2192–2202. PMLR, 2022.
  • Zhang et al. (2020) Guojun Zhang, Kaiwen Wu, Pascal Poupart, and Yaoliang Yu. Newton-type methods for minimax optimization. arXiv preprint arXiv:2006.14592, 2020.
  • Zhang et al. (2021) Junyu Zhang, Mingyi Hong, Mengdi Wang, and Shuzhong Zhang. Generalization bounds for stochastic saddle point problems. In International Conference on Artificial Intelligence and Statistics, volume 130, pages 568–576. PMLR, 2021.
  • Zhang et al. (2022a) Liang Zhang, Kiran K Thekumparampil, Sewoong Oh, and Niao He. Bring your own algorithm for optimal differentially private stochastic minimax optimization. In Advances in Neural Information Processing Systems, volume 35, pages 35174–35187, 2022a.
  • Zhang et al. (2023) Shuijing Zhang, Jian Lou, Li Xiong, Xiaoyu Zhang, and Jing Liu. Closed-form machine unlearning for matrix factorization. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 3278–3287, 2023.
  • Zhang et al. (2022b) Siqi Zhang, Yifan Hu, Liang Zhang, and Niao He. Uniform convergence and generalization for nonconvex stochastic minimax problems. In OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop), 2022b.
  • Zhang et al. (2022c) Zijie Zhang, Yang Zhou, Xin Zhao, Tianshi Che, and Lingjuan Lyu. Prompt certified machine unlearning with randomized gradient smoothing and quantization. In Advances in Neural Information Processing Systems, volume 35, pages 13433–13455, 2022c.

Appendix A Additional Definitions and Supporting Lemmas

In this section, we provide additional definitions and supporting lemmas. In the next two sections, Sec.B contains missing proofs in Sec.4 and the online extension to support successive unlearning setting. Sec.C contains missing proofs in Sec.5, as well as detailed algorithm descriptions for the general convex-concave loss function setting.

A.1 Additional Definitions

We first recall the following standard definitions for the loss function f(𝝎,𝝂;z)f(\bm{\omega},\bm{\nu};z) from optimization literature.

Definition 4 (Function Lipschitz Continuity).

The function f(𝛚,𝛎;z)f(\bm{\omega},\bm{\nu};z) is LL-Lipschitz, i.e., there exists a constant L>0L>0 such that for all 𝛚,𝛚𝒲\bm{\omega},\bm{\omega}^{\prime}\in\mathcal{W}, 𝛎,𝛎𝒱\bm{\nu},\bm{\nu}^{\prime}\in\mathcal{V} and z𝒵z\in\mathcal{Z}, it holds that

f(𝝎,𝝂;z)f(𝝎,𝝂;z)2L2(𝝎𝝎2+𝝂𝝂2).\|f(\bm{\omega},\bm{\nu};z)-f(\bm{\omega}^{\prime},\bm{\nu}^{\prime};z)\|^{2}\leq L^{2}(\|\bm{\omega}-\bm{\omega}^{\prime}\|^{2}+\|\bm{\nu}-\bm{\nu}^{\prime}\|^{2}). (32)
Definition 5 (Gradient Lipschitz Continuity).

The function f(𝛚,𝛎;z)f(\bm{\omega},\bm{\nu};z) has \ell-Lipschitz gradients, i.e., there exists a constant >0\ell>0 such that for all 𝛚,𝛚𝒲\bm{\omega},\bm{\omega}^{\prime}\in\mathcal{W}, 𝛎,𝛎𝒱\bm{\nu},\bm{\nu}^{\prime}\in\mathcal{V} and z𝒵z\in\mathcal{Z}, it holds that

f(𝝎,𝝂;z)f(𝝎,𝝂;z)22(𝝎𝝎2+𝝂𝝂2),\|\nabla f(\bm{\omega},\bm{\nu};z)-\nabla f(\bm{\omega}^{\prime},\bm{\nu}^{\prime};z)\|^{2}\leq\ell^{2}(\|\bm{\omega}-\bm{\omega}^{\prime}\|^{2}+\|\bm{\nu}-\bm{\nu}^{\prime}\|^{2}), (33)

where recall that f(𝛚,𝛎;z)=[𝛚f(𝛚,𝛎;z)𝛎f(𝛚,𝛎;z)]\nabla f(\bm{\omega},\bm{\nu};z)=\begin{bmatrix}\nabla_{\bm{\omega}}f(\bm{\omega},\bm{\nu};z)\\ \nabla_{\bm{\nu}}f(\bm{\omega},\bm{\nu};z)\end{bmatrix}.

Definition 6 (Hessian Lipschitz Continuity).

The function f(𝛚,𝛎;z)f(\bm{\omega},\bm{\nu};z) has ρ\rho-Lipschitz Hessian, i.e., there exists a constant ρ>0\rho>0 such that for all 𝛚,𝛚𝒲\bm{\omega},\bm{\omega}^{\prime}\in\mathcal{W}, 𝛎,𝛎𝒱\bm{\nu},\bm{\nu}^{\prime}\in\mathcal{V} and z𝒵z\in\mathcal{Z}, it holds that

2f(𝝎,𝝂;z)2f(𝝎,𝝂;z)2ρ2(𝝎𝝎2+𝝂𝝂2),\|\nabla^{2}f(\bm{\omega},\bm{\nu};z)-\nabla^{2}f(\bm{\omega}^{\prime},\bm{\nu}^{\prime};z)\|^{2}\leq\rho^{2}(\|\bm{\omega}-\bm{\omega}^{\prime}\|^{2}+\|\bm{\nu}-\bm{\nu}^{\prime}\|^{2}), (34)

where recall that 2f(𝛚,𝛎;z)=[𝛚𝛚f(𝛚,𝛎;z)𝛚𝛎f(𝛚,𝛎;z)𝛎𝛚f(𝛚,𝛎;z)𝛎𝛎f(𝛚,𝛎;z)]\nabla^{2}f(\bm{\omega},\bm{\nu};z)=\begin{bmatrix}\partial_{\bm{\omega}\bm{\omega}}f(\bm{\omega},\bm{\nu};z)&\partial_{\bm{\omega}\bm{\nu}}f(\bm{\omega},\bm{\nu};z)\\ \partial_{\bm{\nu}\bm{\omega}}f(\bm{\omega},\bm{\nu};z)&\partial_{\bm{\nu}\bm{\nu}}f(\bm{\omega},\bm{\nu};z)\end{bmatrix}.

Definition 7 (Strongly-Convex-Strongly-Concave Objective Function).

The function f(𝛚,𝛎;z)f(\bm{\omega},\bm{\nu};z) is μ𝛚\mu_{\bm{\omega}}-strongly convex on 𝒲\mathcal{W} and μ𝛎\mu_{\bm{\nu}}-strongly concave on 𝒱\mathcal{V}, i.e., there exist constants μ𝛚>0\mu_{\bm{\omega}}>0 and μ𝛎>0\mu_{\bm{\nu}}>0 such that for all 𝛚,𝛚𝒲\bm{\omega},\bm{\omega}^{\prime}\in\mathcal{W}, 𝛎,𝛎𝒱\bm{\nu},\bm{\nu}^{\prime}\in\mathcal{V} and z𝒵z\in\mathcal{Z}, it holds that

{f(𝝎,𝝂;z)f(𝝎,𝝂;z)+𝝎f(𝝎,𝝂;z),𝝎𝝎+μ𝝎2𝝎𝝎2,f(𝝎,𝝂;z)f(𝝎,𝝂;z)+𝝂f(𝝎,𝝂;z),𝝂𝝂μ𝝂2𝝂𝝂2.\left\{\begin{array}[]{lr}f(\bm{\omega},\bm{\nu};z)\geq f(\bm{\omega}^{\prime},\bm{\nu};z)+\langle\nabla_{\bm{\omega}}f(\bm{\omega}^{\prime},\bm{\nu};z),\bm{\omega}-\bm{\omega}^{\prime}\rangle+\frac{\mu_{\bm{\omega}}}{2}\|\bm{\omega}-\bm{\omega}^{\prime}\|^{2},\\ f(\bm{\omega},\bm{\nu};z)\leq f(\bm{\omega},\bm{\nu}^{\prime};z)+\langle\nabla_{\bm{\nu}}f(\bm{\omega},\bm{\nu}^{\prime};z),\bm{\nu}-\bm{\nu}^{\prime}\rangle-\frac{\mu_{\bm{\nu}}}{2}\|\bm{\nu}-\bm{\nu}^{\prime}\|^{2}.\end{array}\right. (35)
Definition 8 (Best Response Auxiliary Functions).

We introduce auxiliary functions 𝚅S(𝛚){\tt V}_{S}(\bm{\omega}) and 𝚅S(𝛚){\tt V}_{S^{\setminus}}(\bm{\omega}), given by

𝚅S(𝝎):=argmax𝝂𝒱FS(𝝎,𝝂),𝚅S(𝝎):=argmax𝝂𝒱FS(𝝎,𝝂),{\tt V}_{S}(\bm{\omega}):=\mathop{\rm argmax}_{\bm{\nu}\in\mathcal{V}}F_{S}(\bm{\omega},\bm{\nu}),\qquad{\tt V}_{S^{\setminus}}(\bm{\omega}):=\mathop{\rm argmax}_{\bm{\nu}\in\mathcal{V}}F_{S^{\setminus}}(\bm{\omega},\bm{\nu}), (36)

and we have 𝛎S=𝚅S(𝛚S)\bm{\nu}_{S}^{*}={\tt V}_{S}(\bm{\omega}_{S}^{*}) and 𝛎S=𝚅S(𝛚S)\bm{\nu}_{S^{\setminus}}^{*}={\tt V}_{S^{\setminus}}(\bm{\omega}_{S^{\setminus}}^{*}). We can similarly introduce 𝚆S(𝛎){\tt W}_{S}(\bm{\nu}) and 𝚆S(𝛎){\tt W}_{S^{\setminus}}(\bm{\nu}) as

𝚆S(𝝂):=argmin𝝎𝒲FS(𝝎,𝝂),𝚆S(𝝂):=argmin𝝎𝒲FS(𝝎,𝝂),{\tt W}_{S}(\bm{\nu}):=\mathop{\rm argmin}_{\bm{\omega}\in\mathcal{W}}F_{S}(\bm{\omega},\bm{\nu}),\qquad{\tt W}_{S^{\setminus}}(\bm{\nu}):=\mathop{\rm argmin}_{\bm{\omega}\in\mathcal{W}}F_{S^{\setminus}}(\bm{\omega},\bm{\nu}), (37)

and we have 𝛚S=𝚆S(𝛎S)\bm{\omega}_{S}^{*}={\tt W}_{S}(\bm{\nu}_{S}^{*}) and 𝛚S=𝚆S(𝛎S)\bm{\omega}_{S^{\setminus}}^{*}={\tt W}_{S^{\setminus}}(\bm{\nu}_{S^{\setminus}}^{*}) by this definition.

In addition, we define the primal function P(𝛚):=max𝛎𝒱FS(𝛚,𝛎)=FS(𝛚,𝚅S(𝛚))P(\bm{\omega}):=\max_{\bm{\nu}\in\mathcal{V}}F_{S}(\bm{\omega},\bm{\nu})=F_{S}(\bm{\omega},{\tt V}_{S}(\bm{\omega})), which has gradient P(𝛚)=𝛚FS(𝛚,𝚅S(𝛚))\nabla P(\bm{\omega})=\nabla_{\bm{\omega}}F_{S}(\bm{\omega},{\tt V}_{S}(\bm{\omega})) and Hessian 𝛚𝛚2P(𝛚)=𝙳𝛚𝛚FS(𝛚,𝚅S(𝛚))\nabla^{2}_{\bm{\omega}\bm{\omega}}P(\bm{\omega})={\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega},{\tt V}_{S}(\bm{\omega})) (i.e., the total Hessian of FSF_{S}). The dual function, its gradient, and Hessian can be similarly defined, e.g., D(𝛎):=min𝛚𝒲FS(𝛚,𝛎)=FS(𝚆S(𝛎),𝛎)D(\bm{\nu}):=\min_{\bm{\omega}\in\mathcal{W}}F_{S}(\bm{\omega},\bm{\nu})=F_{S}({\tt W}_{S}(\bm{\nu}),\bm{\nu}).

A.2 Supporting Lemmas

The following lemma provides the distance between 𝚅S(𝝎S){\tt V}_{S}(\bm{\omega}_{S}^{*}) and 𝚅S(𝝎S){\tt V}_{S^{\setminus}}(\bm{\omega}_{S}^{*}). Similar result can be derived for the distance between 𝚆S(𝝂S){\tt W}_{S}(\bm{\nu}_{S}^{*}) and 𝚆S(𝝂S){\tt W}_{S^{\setminus}}(\bm{\nu}_{S}^{*}).

Lemma 2.

Under Assumption 1 and Assumption2, the variables 𝚅S(𝛚S){\tt V}_{S^{\setminus}}(\bm{\omega}_{S}^{*}) and 𝛎S\bm{\nu}_{S}^{*} (i.e., 𝚅S(𝛚S){\tt V}_{S}(\bm{\omega}_{S}^{*})) defined in Algorithm 1 satisfy the following distance bound

𝝂S𝚅S(𝝎S)2Lmμ𝝂(nm).\|\bm{\nu}_{S}^{*}-{\tt V}_{S^{\setminus}}(\bm{\omega}_{S}^{*})\|\leq\frac{2Lm}{\mu_{\bm{\nu}}(n-m)}. (38)
Proof.

We observe that

(nm)(FS(𝝎S,𝚅S(𝝎S))FS(𝝎S,𝝂S))=ziSUf(𝝎S,𝚅S(𝝎S);zi)ziSUf(𝝎S,𝝂S;zi)=ziSf(𝝎S,𝚅S(𝝎S);zi)ziUf(𝝎S,𝚅S(𝝎S);zi)(ziSf(𝝎S,𝝂S;zi)ziU(𝝎S,𝝂S;zi))=n(FS(𝝎S,𝚅S(𝝎S))FS(𝝎S,𝝂S))+ziUf(𝝎S,𝝂S;zi)ziUf(𝝎S,𝚅S(𝝎S);zi)(i)ziUf(𝝎S,𝝂S;zi)ziUf(𝝎S,𝚅S(𝝎S);zi)(ii)mL𝝂S𝚅S(𝝎S),\begin{split}&(n-m)(F_{S^{\setminus}}(\bm{\omega}_{S}^{*},{\tt V}_{S^{\setminus}}(\bm{\omega}_{S}^{*}))-F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}))\\ =&\sum_{z_{i}\in S\setminus U}f(\bm{\omega}_{S}^{*},{\tt V}_{S^{\setminus}}(\bm{\omega}_{S}^{*});z_{i})-\sum_{z_{i}\in S\setminus U}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})\\ =&\sum_{z_{i}\in S}f(\bm{\omega}_{S}^{*},{\tt V}_{S^{\setminus}}(\bm{\omega}_{S}^{*});z_{i})-\sum_{z_{i}\in U}f(\bm{\omega}_{S}^{*},{\tt V}_{S^{\setminus}}(\bm{\omega}_{S}^{*});z_{i})-\bigg{(}\sum_{z_{i}\in S}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})-\sum_{z_{i}\in U}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})\bigg{)}\\ =&n(F_{S}(\bm{\omega}_{S}^{*},{\tt V}_{S^{\setminus}}(\bm{\omega}_{S}^{*}))-F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}))+\sum_{z_{i}\in U}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})-\sum_{z_{i}\in U}f(\bm{\omega}_{S}^{*},{\tt V}_{S^{\setminus}}(\bm{\omega}_{S}^{*});z_{i})\\ \stackrel{{\scriptstyle(i)}}{{\leq}}&\sum_{z_{i}\in U}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})-\sum_{z_{i}\in U}f(\bm{\omega}_{S}^{*},{\tt V}_{S^{\setminus}}(\bm{\omega}_{S}^{*});z_{i})\stackrel{{\scriptstyle(ii)}}{{\leq}}mL\|\bm{\nu}_{S}^{*}-{\tt V}_{S^{\setminus}}(\bm{\omega}_{S}^{*})\|,\end{split} (39)

where the inequality (ii) follows from that 𝝂S\bm{\nu}_{S}^{*} is the maximizer of the function FS(𝝎,𝝂)F_{S}(\bm{\omega},\bm{\nu}), thus FS(𝝎S,𝚅S(𝝎S))FS(𝝎S,𝝂S)0F_{S}(\bm{\omega}_{S}^{*},{\tt V}_{S^{\setminus}}(\bm{\omega}_{S}^{*}))-F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\leq 0. The inequality (iiii) is due to the fact that the function ff is LL-Lipschitz. Also note that the function FS(𝝎,𝝂)F_{S^{\setminus}}(\bm{\omega},\bm{\nu}) is μ𝝂\mu_{\bm{\nu}}-strongly concave, thus we have

FS(𝝎S,𝚅S(𝝎S))FS(𝝎S,𝝂S)μ𝝂2𝝂S𝚅S(𝝎S)2.F_{S^{\setminus}}(\bm{\omega}_{S}^{*},{\tt V}_{S^{\setminus}}(\bm{\omega}_{S}^{*}))-F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\geq\frac{\mu_{\bm{\nu}}}{2}\|\bm{\nu}_{S}^{*}-{\tt V}_{S^{\setminus}}(\bm{\omega}_{S}^{*})\|^{2}. (40)

Eq.(39) and eq.(40) together give that

μ𝝂(nm)2𝝂S𝚅S(𝝎S)2mL𝝂S𝚅S(𝝎S),\frac{\mu_{\bm{\nu}}(n-m)}{2}\|\bm{\nu}_{S}^{*}-{\tt V}_{S^{\setminus}}(\bm{\omega}_{S}^{*})\|^{2}\leq mL\|\bm{\nu}_{S}^{*}-{\tt V}_{S^{\setminus}}(\bm{\omega}_{S}^{*})\|, (41)

which implies that 𝝂S𝚅S(𝝎S)2Lmμ𝝂(nm)\|\bm{\nu}_{S}^{*}-{\tt V}_{S^{\setminus}}(\bm{\omega}_{S}^{*})\|\leq\frac{2Lm}{\mu_{\bm{\nu}}(n-m)}. ∎

The following lemma provides the distance between (𝝎S,𝝂S)(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*}) and (𝝎S,𝝂S)(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}).

Lemma 3.

Under Assumption 1 and Assumption2, the variables (𝛚S,𝛎S)(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*}) defined in eq.(9) and (𝛚S,𝛎S)(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}) defined in Algorithm 1 satisfy the following guarantees

𝝎S𝝎S2Lmμ𝝎n,and𝝂S𝝂S2Lmμ𝝂n.\|\bm{\omega}_{S^{\setminus}}^{*}-\bm{\omega}_{S}^{*}\|\leq\frac{2Lm}{\mu_{\bm{\omega}}n},\qquad and\qquad\|\bm{\nu}_{S^{\setminus}}^{*}-\bm{\nu}_{S}^{*}\|\leq\frac{2Lm}{\mu_{\bm{\nu}}n}. (42)
Proof.

We begin with the 𝝎\bm{\omega}-part,

n[FS(𝝎S,𝝂S)FS(𝝎S,𝝂S)]=ziSf(𝝎S,𝝂S;zi)ziSf(𝝎S,𝝂S;zi)=ziSUf(𝝎S,𝝂S;zi)+ziUf(𝝎S,𝝂S;zi)ziSUf(𝝎S,𝝂S;zi)ziUf(𝝎S,𝝂S;zi)=(nm)[FS(𝝎S,𝝂S)FS(𝝎S,𝝂S)]+ziUf(𝝎S,𝝂S;zi)ziUf(𝝎S,𝝂S;zi)(i)ziUf(𝝎S,𝝂S;zi)ziUf(𝝎S,𝝂S;zi)(ii)mL𝝎S𝝎S,\begin{split}&n[F_{S}(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*})-F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S^{\setminus}}^{*})]\\ =&\sum_{z_{i}\in S}f(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*};z_{i})-\sum_{z_{i}\in S}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S^{\setminus}}^{*};z_{i})\\ =&\sum_{z_{i}\in S\setminus U}f(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*};z_{i})+\sum_{z_{i}\in U}f(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*};z_{i})-\sum_{z_{i}\in S\setminus U}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S^{\setminus}}^{*};z_{i})-\sum_{z_{i}\in U}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S^{\setminus}}^{*};z_{i})\\ =&(n-m)[F_{S^{\setminus}}(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*})-F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S^{\setminus}}^{*})]+\sum_{z_{i}\in U}f(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*};z_{i})-\sum_{z_{i}\in U}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S^{\setminus}}^{*};z_{i})\\ \stackrel{{\scriptstyle(i)}}{{\leq}}&\sum_{z_{i}\in U}f(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*};z_{i})-\sum_{z_{i}\in U}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S^{\setminus}}^{*};z_{i})\stackrel{{\scriptstyle(ii)}}{{\leq}}mL\|\bm{\omega}_{S^{\setminus}}^{*}-\bm{\omega}_{S}^{*}\|,\end{split} (43)

where the inequality (ii) holds because 𝝎S\bm{\omega}_{S^{\setminus}}^{*} is the minimizer of the function FS(𝝎,𝝂)F_{S^{\setminus}}(\bm{\omega},\bm{\nu}), thus FS(𝝎S,𝝂S)FS(𝝎S,𝝂S)0F_{S^{\setminus}}(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*})-F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S^{\setminus}}^{*})\leq 0, and the inequality (iiii) follows from the fact that the function ff is LL-Lipschitz. Since the function FS(𝝎,𝝂)F_{S}(\bm{\omega},\bm{\nu}) is μ𝝎\mu_{\bm{\omega}}-strongly convex, we further get

FS(𝝎S,𝝂S)FS(𝝎S,𝝂S)μ𝝎2𝝎S𝝎S2.F_{S}(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*})-F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S^{\setminus}}^{*})\geq\frac{\mu_{\bm{\omega}}}{2}\|\bm{\omega}_{S^{\setminus}}^{*}-\bm{\omega}_{S}^{*}\|^{2}. (44)

Eq.(43) and eq.(44) together gives that

μ𝝎n2𝝎S𝝎S2mL𝝎S𝝎S.\frac{\mu_{\bm{\omega}}n}{2}\|\bm{\omega}_{S^{\setminus}}^{*}-\bm{\omega}_{S}^{*}\|^{2}\leq mL\|\bm{\omega}_{S^{\setminus}}^{*}-\bm{\omega}_{S}^{*}\|. (45)

Thus, we get 𝝎S𝝎S2Lmμ𝝎n\|\bm{\omega}_{S^{\setminus}}^{*}-\bm{\omega}_{S}^{*}\|\leq\frac{2Lm}{\mu_{\bm{\omega}}n}.

For the 𝝂\bm{\nu}-part, we similarly have

n[FS(𝝎S,𝝂S)FS(𝝎S,𝝂S)]=ziSf(𝝎S,𝝂S;zi)ziSf(𝝎S,𝝂S;zi)=ziSUf(𝝎S,𝝂S;zi)+ziUf(𝝎S,𝝂S;zi)ziSUf(𝝎S,𝝂S;zi)ziUf(𝝎S,𝝂S;zi)=(nm)[FS(𝝎S,𝝂S)FS(𝝎S,𝝂S)]+ziUf(𝝎S,𝝂S;zi)ziUf(𝝎S,𝝂S;zi)(i)ziUf(𝝎S,𝝂S;zi)ziUf(𝝎S,𝝂S;zi)(ii)mL𝝂S𝝂S,\begin{split}&n[F_{S}(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S}^{*})-F_{S}(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*})]\\ =&\sum_{z_{i}\in S}f(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S}^{*};z_{i})-\sum_{z_{i}\in S}f(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*};z_{i})\\ =&\sum_{z_{i}\in S\setminus U}f(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S}^{*};z_{i})+\sum_{z_{i}\in U}f(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S}^{*};z_{i})-\sum_{z_{i}\in S\setminus U}f(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*};z_{i})-\sum_{z_{i}\in U}f(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*};z_{i})\\ =&(n-m)[F_{S^{\setminus}}(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S}^{*})-F_{S^{\setminus}}(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*})]+\sum_{z_{i}\in U}f(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S}^{*};z_{i})-\sum_{z_{i}\in U}f(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*};z_{i})\\ \stackrel{{\scriptstyle(i)}}{{\leq}}&\sum_{z_{i}\in U}f(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S}^{*};z_{i})-\sum_{z_{i}\in U}f(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*};z_{i})\stackrel{{\scriptstyle(ii)}}{{\leq}}mL\|\bm{\nu}_{S}^{*}-\bm{\nu}_{S^{\setminus}}^{*}\|,\end{split} (46)

where the inequality (ii) follows from that 𝝂S\bm{\nu}_{S^{\setminus}}^{*} is the maximizer of the function FS(𝝎,𝝂)F_{S^{\setminus}}(\bm{\omega},\bm{\nu}), thus FS(𝝎S,𝝂S)FS(𝝎S,𝝂S)0F_{S^{\setminus}}(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S}^{*})-F_{S^{\setminus}}(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*})\leq 0. The inequality (iiii) is due to the fact that the function ff is LL-Lipschitz. In addition, by the strongly-concave assumption of FS(𝝎,𝝂)F_{S}(\bm{\omega},\bm{\nu}) is μ𝝂\mu_{\bm{\nu}}, we have

FS(𝝎S,𝝂S)FS(𝝎S,𝝂S)μ𝝂2𝝂S𝝂S2.F_{S}(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S}^{*})-F_{S}(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*})\geq\frac{\mu_{\bm{\nu}}}{2}\|\bm{\nu}_{S^{\setminus}}^{*}-\bm{\nu}_{S}^{*}\|^{2}. (47)

By eq.(46) and eq.(47), we get that

μ𝝂n2𝝂S𝝂S2mL𝝂S𝝂S.\frac{\mu_{\bm{\nu}}n}{2}\|\bm{\nu}_{S^{\setminus}}^{*}-\bm{\nu}_{S}^{*}\|^{2}\leq mL\|\bm{\nu}_{S}^{*}-\bm{\nu}_{S^{\setminus}}^{*}\|. (48)

Thus, we have 𝝂S𝝂S2Lmμ𝝂n\|\bm{\nu}_{S^{\setminus}}^{*}-\bm{\nu}_{S}^{*}\|\leq\frac{2Lm}{\mu_{\bm{\nu}}n}. ∎

In the following, we recall several lemmas (i.e., Lemma 4 to Lemma 8) from existing minimax optimization literature for completeness.

Lemma 4 ((Lin et al., 2020, Lemma 4.3)).

Under Assumption 1 and Assumption2, for any 𝛚𝒲\bm{\omega}\in\mathcal{W}, the function 𝚅S(𝛚){\tt V}_{S}(\bm{\omega}) is (/μ𝛎)(\ell/\mu_{\bm{\nu}})-Lipschitz.

Proof.

By the optimality condition of the function 𝚅S(𝝎){\tt V}_{S}(\bm{\omega}), we have

𝝂FS(𝝎1,𝚅S(𝝎1)),𝚅S(𝝎2)𝚅S(𝝎1)0,\displaystyle\langle\nabla_{\bm{\nu}}F_{S}(\bm{\omega}_{1},{\tt V}_{S}(\bm{\omega}_{1})),{\tt V}_{S}(\bm{\omega}_{2})-{\tt V}_{S}(\bm{\omega}_{1})\rangle\leq 0,
𝝂FS(𝝎2,𝚅S(𝝎2)),𝚅S(𝝎1)𝚅S(𝝎2)0.\displaystyle\langle\nabla_{\bm{\nu}}F_{S}(\bm{\omega}_{2},{\tt V}_{S}(\bm{\omega}_{2})),{\tt V}_{S}(\bm{\omega}_{1})-{\tt V}_{S}(\bm{\omega}_{2})\rangle\leq 0.

Summing the two inequalities above yields

𝝂FS(𝝎1,𝚅S(𝝎1))𝝂FS(𝝎2,𝚅S(𝝎2)),𝚅S(𝝎2)𝚅S(𝝎1)0.\langle\nabla_{\bm{\nu}}F_{S}(\bm{\omega}_{1},{\tt V}_{S}(\bm{\omega}_{1}))-\nabla_{\bm{\nu}}F_{S}(\bm{\omega}_{2},{\tt V}_{S}(\bm{\omega}_{2})),{\tt V}_{S}(\bm{\omega}_{2})-{\tt V}_{S}(\bm{\omega}_{1})\rangle\leq 0. (49)

Since the function FS(𝝎,𝝂)F_{S}(\bm{\omega},\bm{\nu}) is μ𝝂\mu_{\bm{\nu}}-strongly concave in 𝝂\bm{\nu}, we have

𝝂FS(𝝎1,𝚅S(𝝎2))FS(𝝎1,𝚅S(𝝎1)),𝚅S(𝝎2)𝚅S(𝝎1)+μ𝝂𝚅S(𝝎2)𝚅S(𝝎1)20.\langle\nabla_{\bm{\nu}}F_{S}(\bm{\omega}_{1},{\tt V}_{S}(\bm{\omega}_{2}))-F_{S}(\bm{\omega}_{1},{\tt V}_{S}(\bm{\omega}_{1})),{\tt V}_{S}(\bm{\omega}_{2})-{\tt V}_{S}(\bm{\omega}_{1})\rangle+\mu_{\bm{\nu}}\|{\tt V}_{S}(\bm{\omega}_{2})-{\tt V}_{S}(\bm{\omega}_{1})\|^{2}\leq 0. (50)

By eq.(49) and eq.(50) with the \ell-Lipschitz continuity of FS(𝝎,𝝂)\nabla F_{S}(\bm{\omega},\bm{\nu}), we further get

μ𝝂𝚅S(𝝎2)𝚅S(𝝎1)2\displaystyle\mu_{\bm{\nu}}\|{\tt V}_{S}(\bm{\omega}_{2})-{\tt V}_{S}(\bm{\omega}_{1})\|^{2} 𝝂FS(𝝎2,𝚅S(𝝎2))𝝂FS(𝝎1,𝚅S(𝝎2)),𝚅S(𝝎2)𝚅S(𝝎1)\displaystyle\leq\langle\nabla_{\bm{\nu}}F_{S}(\bm{\omega}_{2},{\tt V}_{S}(\bm{\omega}_{2}))-\nabla_{\bm{\nu}}F_{S}(\bm{\omega}_{1},{\tt V}_{S}(\bm{\omega}_{2})),{\tt V}_{S}(\bm{\omega}_{2})-{\tt V}_{S}(\bm{\omega}_{1})\rangle (51)
𝝎2𝝎1𝚅S(𝝎2)𝚅S(𝝎1).\displaystyle\leq\ell\|\bm{\omega}_{2}-\bm{\omega}_{1}\|\cdot\|{\tt V}_{S}(\bm{\omega}_{2})-{\tt V}_{S}(\bm{\omega}_{1})\|.

Consequently, we have

𝚅S(𝝎2)𝚅S(𝝎1)μ𝝂𝝎2𝝎1.\|{\tt V}_{S}(\bm{\omega}_{2})-{\tt V}_{S}(\bm{\omega}_{1})\|\leq\frac{\ell}{\mu_{\bm{\nu}}}\|\bm{\omega}_{2}-\bm{\omega}_{1}\|. (52)

Remark 1.

The above lemma can be similarly derived for 𝚆S{\tt W}_{S} to obtain that the best response auxiliary function 𝚆S(𝛎){\tt W}_{S}(\bm{\nu}) is (/μ𝛚)(\ell/\mu_{\bm{\omega}})-Lipschitz. In the next three lemmas, we focus on the 𝛚\bm{\omega}-part and omit the 𝛎\bm{\nu}-part.

Lemma 5 ((Luo et al., 2022, Lemma 3)).

Denote κ𝛎=/μ𝛎\kappa_{\bm{\nu}}=\ell/\mu_{\bm{\nu}}. Under Assumption 1 and Assumption2, for any 𝛚,𝛚𝒲\bm{\omega},\bm{\omega}^{\prime}\in\mathcal{W}, we have

𝙳𝝎𝝎FS(𝝎,𝚅S(𝝎))𝙳𝝎𝝎FS(𝝎,𝚅S(𝝎))42κ𝝂3ρ𝝎𝝎.\|{\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega},{\tt V}_{S}(\bm{\omega}))-{\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}^{\prime},{\tt V}_{S}(\bm{\omega}^{\prime}))\|\leq 4\sqrt{2}\kappa_{\bm{\nu}}^{3}\rho\|\bm{\omega}-\bm{\omega}^{\prime}\|. (53)
Lemma 6 ((Nesterov and Polyak, 2006, Lemma 1)).

Denote κ𝛎=/μ𝛎\kappa_{\bm{\nu}}=\ell/\mu_{\bm{\nu}}. Under Assumption 1 and Assumption2, for any 𝛚,𝛚𝒲\bm{\omega},\bm{\omega}^{\prime}\in\mathcal{W}, we have

𝝎FS(𝝎,𝚅S(𝝎))𝝎FS(𝝎,𝚅S(𝝎))𝙳𝝎𝝎FS(𝝎)(𝝎𝝎)M2𝝎𝝎2,\|\nabla_{\bm{\omega}}F_{S}(\bm{\omega},{\tt V}_{S}(\bm{\omega}))-\nabla_{\bm{\omega}}F_{S}(\bm{\omega}^{\prime},{\tt V}_{S}(\bm{\omega}^{\prime}))-{\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}^{\prime})(\bm{\omega}-\bm{\omega}^{\prime})\|\leq\frac{M}{2}\|\bm{\omega}-\bm{\omega}^{\prime}\|^{2}, (54)

where M=42κ𝛎3ρM=4\sqrt{2}\kappa_{\bm{\nu}}^{3}\rho.

Proof.

Recall the definition of the primal function P(𝝎):=max𝝂𝒱FS(𝝎,𝝂)P(\bm{\omega}):=\max_{\bm{\nu}\in\mathcal{V}}F_{S}(\bm{\omega},\bm{\nu}) and its gradient P(𝝎)=𝝎FS(𝝎,𝚅S(𝝎))\nabla P(\bm{\omega})=\nabla_{\bm{\omega}}F_{S}(\bm{\omega},{\tt V}_{S}(\bm{\omega})). Due to the optimality of 𝚅S{\tt V}_{S}, we have

𝝂FS(𝝎,𝚅S(𝝎))=0.\nabla_{\bm{\nu}}F_{S}(\bm{\omega},{\tt V}_{S}(\bm{\omega}))=0. (55)

By taking the total derivative with respect to 𝝎\bm{\omega}, we get

𝝂𝝎FS(𝝎,𝚅S(𝝎))+𝝂𝝂FS(𝝎,𝚅S(𝝎))𝚅S(𝝎)=0.\partial_{\bm{\nu}\bm{\omega}}F_{S}(\bm{\omega},{\tt V}_{S}(\bm{\omega}))+\partial_{\bm{\nu}\bm{\nu}}F_{S}(\bm{\omega},{\tt V}_{S}(\bm{\omega}))\nabla{\tt V}_{S}(\bm{\omega})=0. (56)

Taking the total derivative of 𝝎\bm{\omega} again on P(𝝎)\nabla P(\bm{\omega}), we further have

2P(𝝎)=𝝎𝝎FS(𝝎,𝚅S(𝝎))+𝝎𝝂FS(𝝎,𝚅S(𝝎))𝚅S(𝝎)=𝝎𝝎FS(𝝎,𝚅S(𝝎))𝝎𝝂FS(𝝎,𝚅S(𝝎))𝝂𝝂1FS(𝝎,𝚅S(𝝎))𝝂𝝎FS(𝝎,𝚅S(𝝎))=𝙳𝝎𝝎FS(𝝎,𝚅S(𝝎)).\begin{split}\nabla^{2}P(\bm{\omega})=&\partial_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega},{\tt V}_{S}(\bm{\omega}))+\partial_{\bm{\omega}\bm{\nu}}F_{S}(\bm{\omega},{\tt V}_{S}(\bm{\omega}))\nabla{\tt V}_{S}(\bm{\omega})\\ =&\partial_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega},{\tt V}_{S}(\bm{\omega}))-\partial_{\bm{\omega}\bm{\nu}}F_{S}(\bm{\omega},{\tt V}_{S}(\bm{\omega}))\partial_{\bm{\nu}\bm{\nu}}^{-1}F_{S}(\bm{\omega},{\tt V}_{S}(\bm{\omega}))\partial_{\bm{\nu}\bm{\omega}}F_{S}(\bm{\omega},{\tt V}_{S}(\bm{\omega}))\\ =&{\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega},{\tt V}_{S}(\bm{\omega})).\end{split} (57)

Based on the equality of 2P(𝝎)\nabla^{2}P(\bm{\omega}) and 𝙳𝝎𝝎FS(𝝎,𝚅S(𝝎)){\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega},{\tt V}_{S}(\bm{\omega})) above and Lemma 5, we have

𝝎FS(𝝎,𝚅S(𝝎))𝝎FS(𝝎,𝚅S(𝝎))𝙳𝝎𝝎FS(𝝎)(𝝎𝝎)=P(𝝎)P(𝝎)2P(𝝎)(𝝎𝝎)M2𝝎𝝎2.\begin{split}&\|\nabla_{\bm{\omega}}F_{S}(\bm{\omega},{\tt V}_{S}(\bm{\omega}))-\nabla_{\bm{\omega}}F_{S}(\bm{\omega}^{\prime},{\tt V}_{S}(\bm{\omega}^{\prime}))-{\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}^{\prime})(\bm{\omega}-\bm{\omega}^{\prime})\|\\ =&\|\nabla P(\bm{\omega})-\nabla P(\bm{\omega}^{\prime})-\nabla^{2}P(\bm{\omega}^{\prime})(\bm{\omega}-\bm{\omega}^{\prime})\|\\ \leq&\frac{M}{2}\|\bm{\omega}-\bm{\omega}^{\prime}\|^{2}.\end{split} (58)

Lemma 7.

Under Assumption 1 and Assumption2, for all 𝛚𝒲\bm{\omega}\in\mathcal{W} and 𝛎𝒱\bm{\nu}\in\mathcal{V}, we have 𝙳𝛚𝛚f(𝛚,𝛎;z)+2μ𝛎\|{\tt D}_{\bm{\omega}\bm{\omega}}f(\bm{\omega},\bm{\nu};z)\|\leq\ell+\frac{\ell^{2}}{\mu_{\bm{\nu}}}.

Proof.

By the definition of the total Hessian, we have

𝙳𝝎𝝎f(𝝎,𝝂;z)=𝝎𝝎f(𝝎,𝝂;z)𝝎𝝂f(𝝎,𝝂;z)𝝂𝝂1f(𝝎,𝝂;z)𝝂𝝎f(𝝎,𝝂;z)(i)𝝎𝝎f(𝝎,𝝂;z)+𝝎𝝂f(𝝎,𝝂;z)𝝂𝝂1f(𝝎,𝝂;z)𝝂𝝎f(𝝎,𝝂;z)(ii)+μ𝝂1=+2μ𝝂,\begin{split}\|{\tt D}_{\bm{\omega}\bm{\omega}}f(\bm{\omega},\bm{\nu};z)\|=&\|\partial_{\bm{\omega}\bm{\omega}}f(\bm{\omega},\bm{\nu};z)-\partial_{\bm{\omega}\bm{\nu}}f(\bm{\omega},\bm{\nu};z)\partial_{\bm{\nu}\bm{\nu}}^{-1}f(\bm{\omega},\bm{\nu};z)\partial_{\bm{\nu}\bm{\omega}}f(\bm{\omega},\bm{\nu};z)\|\\ \stackrel{{\scriptstyle(i)}}{{\leq}}&\|\partial_{\bm{\omega}\bm{\omega}}f(\bm{\omega},\bm{\nu};z)\|+\|\partial_{\bm{\omega}\bm{\nu}}f(\bm{\omega},\bm{\nu};z)\partial_{\bm{\nu}\bm{\nu}}^{-1}f(\bm{\omega},\bm{\nu};z)\partial_{\bm{\nu}\bm{\omega}}f(\bm{\omega},\bm{\nu};z)\|\\ \stackrel{{\scriptstyle(ii)}}{{\leq}}&\ell+\ell\cdot\mu_{\bm{\nu}}^{-1}\cdot\ell=\ell+\frac{\ell^{2}}{\mu_{\bm{\nu}}},\end{split} (59)

where the inequality (ii) uses the triangle inequality and the inequality (iiii) is due to the function ff has \ell-Lipschitz gradients and ff is μ𝝂\mu_{\bm{\nu}}-strongly concave in 𝝂\bm{\nu}, thus we have 𝝎𝝎f(𝝎,𝝂;z)\|\partial_{\bm{\omega}\bm{\omega}}f(\bm{\omega},\bm{\nu};z)\|\leq\ell, 𝝎𝝂f(𝝎,𝝂;z)\|\partial_{\bm{\omega}\bm{\nu}}f(\bm{\omega},\bm{\nu};z)\|\leq\ell, 𝝂𝝎f(𝝎,𝝂;z)\|\partial_{\bm{\nu}\bm{\omega}}f(\bm{\omega},\bm{\nu};z)\|\leq\ell and 𝝂𝝂f(𝝎,𝝂;z)μ𝝂1\|\partial_{\bm{\nu}\bm{\nu}}f(\bm{\omega},\bm{\nu};z)\|\leq\mu_{\bm{\nu}}^{-1}. ∎

Lemma 8 ((Zhang et al., 2022a, Lemma 4.4)).

Under Assumption 1 and Assumption2, the population weak PD risk for the minimax learning variables (𝛚S,𝛎S)(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}) returned by Algorithm 1 has

w(𝝎S,𝝂S)22L2μn.\triangle^{w}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\leq\frac{2\sqrt{2}L^{2}}{\mu n}. (60)
Lemma 9 ((Zhang et al., 2021, Theorem 2)).

Under Assumption 1 and Assumption2, the population strong PD risk for the minimax learning variables (𝛚S,𝛎S)(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}) returned by Algorithm 1 has

s(𝝎S,𝝂S)22L2n2μ𝝎μ𝝂+1(1μ𝝎+1μ𝝂)8L2μ2n.\triangle^{s}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\leq\frac{2\sqrt{2}L^{2}}{n}\cdot\sqrt{\frac{\ell^{2}}{\mu_{\bm{\omega}}\mu_{\bm{\nu}}}+1}\cdot\left(\frac{1}{\mu_{\bm{\omega}}}+\frac{1}{\mu_{\bm{\nu}}}\right)\leq\frac{8L^{2}\ell}{\mu^{2}n}. (61)

Appendix B Detailed Algorithm Analysis and Missing Proofs in Section 4

B.1 Analysis for Algorithm 2

In the following, we provide the analysis for Algorithm 2 in terms of guarantees of (ϵ,δ)(\epsilon,\delta)-certified unlearning, population primal-dual risk, and deletion capacity and the corresponding proofs.

Lemma 10 (Closeness Upper Bound).

Suppose the loss function ff satisfies Assumption 1 and 2, 𝙳𝛚𝛚FS(𝛚S,𝛎S)μ𝛚𝛚\|{\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\|\geq\mu_{\bm{\omega}\bm{\omega}} and 𝙳𝛎𝛎FS(𝛚S,𝛎S)μ𝛎𝛎\|{\tt D}_{\bm{\nu}\bm{\nu}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\|\geq\mu_{\bm{\nu}\bm{\nu}}. Let μ=min{μ𝛚,μ𝛎,μ𝛚𝛚,μ𝛎𝛎}\mu=\min\{\mu_{\bm{\omega}},\mu_{\bm{\nu}},\mu_{\bm{\omega}\bm{\omega}},\mu_{\bm{\nu}\bm{\nu}}\}. Then, we have the closeness bound between (𝛚^,𝛎^)(\widehat{\bm{\omega}},\widehat{\bm{\nu}}) in Line 2 of Algorithm 2 and (𝛚S,𝛎S)(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*}) in eq.(9):

{𝝎S𝝎^,𝝂S𝝂^}(82L23ρ/μ5+8L2/μ2)m2n(μn(+2/μ)m).\{\|\bm{\omega}_{S^{\setminus}}^{*}-\widehat{\bm{\omega}}\|,\|\bm{\nu}_{S^{\setminus}}^{*}-\widehat{\bm{\nu}}\|\}\leq\frac{(8\sqrt{2}L^{2}\ell^{3}\rho/\mu^{5}+8L\ell^{2}/\mu^{2})m^{2}}{n(\mu n-(\ell+\ell^{2}/\mu)m)}. (62)
Proof.

Recall that the empirical loss functions FS(𝝎,𝝂)F_{S^{\setminus}}(\bm{\omega},\bm{\nu}) and FS(𝝎,𝝂)F_{S}(\bm{\omega},\bm{\nu}) are

FS(𝝎,𝝂):=1nmziSUf(𝝎,𝝂;zi),andFS(𝝎,𝝂):=1nziSf(𝝎,𝝂;zi).F_{S^{\setminus}}(\bm{\omega},\bm{\nu}):=\frac{1}{n-m}\sum_{z_{i}\in S\setminus U}f(\bm{\omega},\bm{\nu};z_{i}),\quad\text{and}\quad F_{S}(\bm{\omega},\bm{\nu}):=\frac{1}{n}\sum_{z_{i}\in S}f(\bm{\omega},\bm{\nu};z_{i}). (63)

We focus on the key term 𝝎FS(𝝎S,𝚅S(𝝎S))𝝎FS(𝝎S,𝝂S)𝙳𝝎𝝎FS(𝝎S,𝝂S)(𝝎S𝝎S)\nabla_{\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S^{\setminus}}^{*},{\tt V}_{S}(\bm{\omega}_{S^{\setminus}}^{*}))-\nabla_{\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})-{\tt D}_{\bm{\omega}\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})(\bm{\omega}_{S^{\setminus}}^{*}-\bm{\omega}_{S}^{*}), which has the following conversions

𝝎FS(𝝎S,𝚅S(𝝎S))𝝎FS(𝝎S,𝝂S)𝙳𝝎𝝎FS(𝝎S,𝝂S)(𝝎S𝝎S)=nnm[𝝎FS(𝝎S,𝚅S(𝝎S))𝝎FS(𝝎S,𝝂S)𝙳𝝎𝝎FS(𝝎S,𝝂S)(𝝎S𝝎S)]1nmziU[𝝎f(𝝎S,𝚅S(𝝎S);zi)𝝎f(𝝎S,𝝂S;zi)]+1nmziU𝙳𝝎𝝎f(𝝎S,𝝂S;zi)(𝝎S𝝎S)nnm𝝎FS(𝝎S,𝚅S(𝝎S))𝝎FS(𝝎S,𝝂S)𝙳𝝎𝝎FS(𝝎S,𝝂S)(𝝎S𝝎S)+1nmziU𝝎f(𝝎S,𝚅S(𝝎S);zi)𝝎f(𝝎S,𝝂S;zi)+1nmziU𝙳𝝎𝝎f(𝝎S,𝝂S;zi)(𝝎S𝝎S).\begin{split}&\|\nabla_{\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S^{\setminus}}^{*},{\tt V}_{S}(\bm{\omega}_{S^{\setminus}}^{*}))-\nabla_{\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})-{\tt D}_{\bm{\omega}\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})(\bm{\omega}_{S^{\setminus}}^{*}-\bm{\omega}_{S}^{*})\|\\ =&\|\frac{n}{n-m}[\nabla_{\bm{\omega}}F_{S}(\bm{\omega}_{S^{\setminus}}^{*},{\tt V}_{S}(\bm{\omega}_{S^{\setminus}}^{*}))-\nabla_{\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})-{\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})(\bm{\omega}_{S^{\setminus}}^{*}-\bm{\omega}_{S}^{*})]\\ &-\frac{1}{n-m}\sum_{z_{i}\in U}[\nabla_{\bm{\omega}}f(\bm{\omega}_{S^{\setminus}}^{*},{\tt V}_{S}(\bm{\omega}_{S^{\setminus}}^{*});z_{i})-\nabla_{\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})]\\ &+\frac{1}{n-m}\sum_{z_{i}\in U}{\tt D}_{\bm{\omega}\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})(\bm{\omega}_{S^{\setminus}}^{*}-\bm{\omega}_{S}^{*})\|\\ \leq&\frac{n}{n-m}\|\nabla_{\bm{\omega}}F_{S}(\bm{\omega}_{S^{\setminus}}^{*},{\tt V}_{S}(\bm{\omega}_{S^{\setminus}}^{*}))-\nabla_{\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})-{\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})(\bm{\omega}_{S^{\setminus}}^{*}-\bm{\omega}_{S}^{*})\|\\ &+\frac{1}{n-m}\sum_{z_{i}\in U}\|\nabla_{\bm{\omega}}f(\bm{\omega}_{S^{\setminus}}^{*},{\tt V}_{S}(\bm{\omega}_{S^{\setminus}}^{*});z_{i})-\nabla_{\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})\|\\ &+\frac{1}{n-m}\|\sum_{z_{i}\in U}{\tt D}_{\bm{\omega}\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})(\bm{\omega}_{S^{\setminus}}^{*}-\bm{\omega}_{S}^{*})\|.\end{split} (64)

We denote κ𝝂=/μ𝝂\kappa_{\bm{\nu}}=\ell/\mu_{\bm{\nu}}. For the first term on the right-hand side of the inequality in eq.(64), we have

nnm𝝎FS(𝝎S,𝚅S(𝝎S))𝝎FS(𝝎S,𝝂S)𝙳𝝎𝝎FS(𝝎S,𝝂S)(𝝎S𝝎S)(i)nnm22κ𝝂3ρ𝝎S𝝎S2(ii)82κ𝝂3ρL2m2μ𝝎2n(nm)82ρL23m2μ5n(nm),\begin{split}&\frac{n}{n-m}\|\nabla_{\bm{\omega}}F_{S}(\bm{\omega}_{S^{\setminus}}^{*},{\tt V}_{S}(\bm{\omega}_{S^{\setminus}}^{*}))-\nabla_{\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})-{\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})(\bm{\omega}_{S^{\setminus}}^{*}-\bm{\omega}_{S}^{*})\|\\ \stackrel{{\scriptstyle(i)}}{{\leq}}&\frac{n}{n-m}\cdot 2\sqrt{2}\kappa_{\bm{\nu}}^{3}\rho\|\bm{\omega}_{S^{\setminus}}^{*}-\bm{\omega}_{S}^{*}\|^{2}\stackrel{{\scriptstyle(ii)}}{{\leq}}\frac{8\sqrt{2}\kappa_{\bm{\nu}}^{3}\rho L^{2}m^{2}}{\mu_{\bm{\omega}}^{2}n(n-m)}\leq\frac{8\sqrt{2}\rho L^{2}\ell^{3}m^{2}}{\mu^{5}n(n-m)},\end{split} (65)

where the inequality (ii) is by Lemma 6 and the inequality (iiii) is by Lemma 3.

For the second term on the right-hand side of the inequality in eq.(64), we have

1nmziU𝝎f(𝝎S,𝚅S(𝝎S);zi)𝝎f(𝝎S,𝝂S;zi)(i)1nmm𝝎S𝝎S2+𝚅S(𝝎S)𝚅S(𝝎S)2(ii)1nmm𝝎S𝝎S2+κ𝝂2𝝎S𝝎S2(iii)2Llm21+κ𝝂2μ𝝎n(nm)22Llκ𝝂m2μn(nm)22Ll2m2μ2n(nm),\begin{split}&\frac{1}{n-m}\sum_{z_{i}\in U}\|\nabla_{\bm{\omega}}f(\bm{\omega}_{S^{\setminus}}^{*},{\tt V}_{S}(\bm{\omega}_{S^{\setminus}}^{*});z_{i})-\nabla_{\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})\|\\ \stackrel{{\scriptstyle(i)}}{{\leq}}&\frac{1}{n-m}\cdot m\ell\sqrt{\|\bm{\omega}_{S^{\setminus}}^{*}-\bm{\omega}_{S}^{*}\|^{2}+\|{\tt V}_{S}(\bm{\omega}_{S^{\setminus}}^{*})-{\tt V}_{S}(\bm{\omega}_{S}^{*})\|^{2}}\\ \stackrel{{\scriptstyle(ii)}}{{\leq}}&\frac{1}{n-m}\cdot m\ell\sqrt{\|\bm{\omega}_{S^{\setminus}}^{*}-\bm{\omega}_{S}^{*}\|^{2}+\kappa_{\bm{\nu}}^{2}\|\bm{\omega}_{S^{\setminus}}^{*}-\bm{\omega}_{S}^{*}\|^{2}}\\ \stackrel{{\scriptstyle(iii)}}{{\leq}}&\frac{2Llm^{2}\sqrt{1+\kappa_{\bm{\nu}}^{2}}}{\mu_{\bm{\omega}}n(n-m)}\leq\frac{2\sqrt{2}Ll\kappa_{\bm{\nu}}m^{2}}{\mu n(n-m)}\leq\frac{2\sqrt{2}Ll^{2}m^{2}}{\mu^{2}n(n-m)},\end{split} (66)

where the inequality (ii) follows by the fact that the function 𝝎f(,)\nabla_{\bm{\omega}}f(\cdot,\cdot) is \ell-Lipschitz continuous and 𝝂S=𝚅S(𝝎S)\bm{\nu}_{S}^{*}={\tt V}_{S}(\bm{\omega}_{S}^{*}). The inequality (iiii) holds because Lemma 4, and the inequality (iiiiii) is by Lemma 3.

For the third term on the right-hand side of the inequality in eq.(64), we have

1nmziU𝙳𝝎𝝎f(𝝎S,𝝂S;zi)(𝝎S𝝎S)(+2μ𝝂)2Lm2μ𝝎n(nm)4L2m2μ2n(nm),\begin{split}&\frac{1}{n-m}\|\sum_{z_{i}\in U}{\tt D}_{\bm{\omega}\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})(\bm{\omega}_{S^{\setminus}}^{*}-\bm{\omega}_{S}^{*})\|\\ \leq&(\ell+\frac{\ell^{2}}{\mu_{\bm{\nu}}})\cdot\frac{2Lm^{2}}{\mu_{\bm{\omega}}n(n-m)}\leq\frac{4L\ell^{2}m^{2}}{\mu^{2}n(n-m)},\end{split} (67)

where the first inequality is by Lemma 7. Plugging eq.(65), eq.(66) and eq.(67) into eq.(64), we further get

𝝎FS(𝝎S,𝚅S(𝝎S))𝝎FS(𝝎S,𝝂S)𝙳𝝎𝝎FS(𝝎S,𝝂S)(𝝎S𝝎S)(82L23ρ/μ5+8L2/μ2)m2n(nm).\begin{split}&\|\nabla_{\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S^{\setminus}}^{*},{\tt V}_{S}(\bm{\omega}_{S^{\setminus}}^{*}))-\nabla_{\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})-{\tt D}_{\bm{\omega}\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})(\bm{\omega}_{S^{\setminus}}^{*}-\bm{\omega}_{S}^{*})\|\\ \leq&(8\sqrt{2}L^{2}\ell^{3}\rho/\mu^{5}+8L\ell^{2}/\mu^{2})\frac{m^{2}}{n(n-m)}.\end{split} (68)

The above derivation yields an upper bound result. In the following, we derive a lower bound result. Let 𝐱\mathbf{x} be the vector satisfying the following relation,

𝝎S=𝝎S+1nm[𝙳𝝎𝝎FS(𝝎S,𝝂S)]1ziU𝝎f(𝝎S,𝝂S;zi)+𝐱.\bm{\omega}_{S^{\setminus}}^{*}=\bm{\omega}_{S}^{*}+\frac{1}{n-m}[{\tt D}_{\bm{\omega}\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})]^{-1}\sum_{z_{i}\in U}\nabla_{\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})+\mathbf{x}. (69)

Since we have 𝝎FS(𝝎S,𝝂S)=1nmziU𝝎f(𝝎S,𝝂S;zi)\nabla_{\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})=-\frac{1}{n-m}\sum_{z_{i}\in U}\nabla_{\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i}) and 𝝎FS(𝝎S,𝚅S(𝝎S))=0\nabla_{\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S^{\setminus}}^{*},{\tt V}_{S}(\bm{\omega}_{S^{\setminus}}^{*}))=0 due to the optimality of 𝝎S\bm{\omega}_{S^{\setminus}}^{*}, plugging eq.(69) into eq.(68), we get that

𝙳𝝎𝝎FS(𝝎S,𝝂S)𝐱(82L23ρ/μ5+8L2/μ2)m2n(nm).\|{\tt D}_{\bm{\omega}\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\cdot\mathbf{x}\|\leq(8\sqrt{2}L^{2}\ell^{3}\rho/\mu^{5}+8L\ell^{2}/\mu^{2})\frac{m^{2}}{n(n-m)}. (70)

For the left-hand side of eq.(70), with 𝙳𝝎𝝎FS(𝝎S,𝝂S)μ𝝎𝝎\|{\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\|\geq\mu_{\bm{\omega}\bm{\omega}}, we have

𝙳𝝎𝝎FS(𝝎S,𝝂S)𝐱=1nm[ziSU𝙳𝝎𝝎f(𝝎S,𝝂S;zi)]𝐱=1nm[ziS𝙳𝝎𝝎f(𝝎S,𝝂S;zi)ziU𝙳𝝎𝝎f(𝝎S,𝝂S;zi)]𝐱1nm(n𝙳𝝎𝝎FS(𝝎S,𝝂S)ziU𝙳𝝎𝝎f(𝝎S,𝝂S;zi))𝐱(μ𝝎𝝎n(+2/μ𝝂)m)nm𝐱(μn(+2/μ)m)nm𝐱,\begin{split}\|{\tt D}_{\bm{\omega}\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\cdot\mathbf{x}\|&=\frac{1}{n-m}\|[\sum_{z_{i}\in S\setminus U}{\tt D}_{\bm{\omega}\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})]\cdot\mathbf{x}\|\\ &=\frac{1}{n-m}\|[\sum_{z_{i}\in S}{\tt D}_{\bm{\omega}\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})-\sum_{z_{i}\in U}{\tt D}_{\bm{\omega}\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})]\cdot\mathbf{x}\|\\ &\geq\frac{1}{n-m}\bigg{(}\|n{\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\|-\|\sum_{z_{i}\in U}{\tt D}_{\bm{\omega}\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})\|\bigg{)}\cdot\|\mathbf{x}\|\\ &\geq\frac{(\mu_{\bm{\omega}\bm{\omega}}n-(\ell+\ell^{2}/\mu_{\bm{\nu}})m)}{n-m}\|\mathbf{x}\|\geq\frac{(\mu n-(\ell+\ell^{2}/\mu)m)}{n-m}\|\mathbf{x}\|,\end{split} (71)

where the second inequality is by Lemma 7. Combining eq.(70), eq.(68), and the definition of the vector 𝐱\|\mathbf{x}\|, we get that

𝝎S𝝎^=𝐱(82L23ρ/μ5+8L2/μ2)m2n(μn(+2/μ)m).\|\bm{\omega}_{S^{\setminus}}^{*}-\widehat{\bm{\omega}}\|=\|\mathbf{x}\|\leq\frac{(8\sqrt{2}L^{2}\ell^{3}\rho/\mu^{5}+8L\ell^{2}/\mu^{2})m^{2}}{n(\mu n-(\ell+\ell^{2}/\mu)m)}. (72)

Symmetrically, we can get that 𝝂S𝝂^(82L23ρ/μ5+8L2/μ2)m2n(μn(+2/μ)m)\|\bm{\nu}_{S^{\setminus}}^{*}-\widehat{\bm{\nu}}\|\leq\frac{(8\sqrt{2}L^{2}\ell^{3}\rho/\mu^{5}+8L\ell^{2}/\mu^{2})m^{2}}{n(\mu n-(\ell+\ell^{2}/\mu)m)}. ∎

Theorem 6 ((ϵ,δ)(\epsilon,\delta)-Minimax Unlearning Certification).

Under the same settings of Lemma 10, our minimax learning algorithm AscscA_{sc-sc} and unlearning algorithm A¯scsc\bar{A}_{sc-sc} is (ϵ,δ)(\epsilon,\delta)-certified minimax unlearning if we choose

σ1 and σ2=2(82L23ρ/μ5+8L2/μ2)m2n(μn(+2/μ)m)ϵ2log(2.5/δ).\sigma_{1}\text{~{}and~{}}\sigma_{2}=\frac{2(8\sqrt{2}L^{2}\ell^{3}\rho/\mu^{5}+8L\ell^{2}/\mu^{2})m^{2}}{n(\mu n-(\ell+\ell^{2}/\mu)m)\epsilon}\sqrt{2\log(2.5/\delta)}. (73)
Proof.

Our proof for (ϵ,δ)(\epsilon,\delta)-minimax unlearning certification is similar to the one used for the differential privacy guarantee of the Gaussian mechanism (Dwork et al., 2014).

Let (𝝎S,𝝂S)(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}) be the output of the learning algorithm AscscA_{sc-sc} trained on dataset SS and (𝝎u,𝝂u)(\bm{\omega}^{u},\bm{\nu}^{u}) be the output of the unlearning algorithm A¯scsc\bar{A}_{sc-sc} running with delete requests UU, the learned model (𝝎S,𝝂S)(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}), and the memory variables T(S)T(S). Then we have (𝝎S,𝝂S)=Ascsc(S)(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})=A_{sc-sc}(S) and (𝝎u,𝝂u)=A¯scsc(U,Ascsc(S),T(S))(\bm{\omega}^{u},\bm{\nu}^{u})=\bar{A}_{sc-sc}(U,A_{sc-sc}(S),T(S)). We also denote the intermediate variables before adding noise in algorithm A¯scsc\bar{A}_{sc-sc} as (𝝎^,𝝂^)(\widehat{\bm{\omega}},\widehat{\bm{\nu}}), and we have 𝝎u=𝝎^+𝝃1\bm{\omega}^{u}=\widehat{\bm{\omega}}+\bm{\xi}_{1} and 𝝂u=𝝂^+𝝃2\bm{\nu}^{u}=\widehat{\bm{\nu}}+\bm{\xi}_{2}.

Smilarly, let (𝝎S,𝝂S)(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*}) be the output of the learning algorithm AscscA_{sc-sc} trained on dataset SUS\setminus U and (𝝎Su,𝝂Su)(\bm{\omega}^{u}_{S^{\setminus}},\bm{\nu}^{u}_{S^{\setminus}}) be the output of the unlearning algorithm A¯scsc\bar{A}_{sc-sc} running with delete requests \emptyset, the learned model (𝝎S,𝝂S)(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*}), and the memory variables T(SU)T(S\setminus U). Then we have (𝝎S,𝝂S)=Ascsc(SU)(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*})=A_{sc-sc}(S\setminus U) and (𝝎Su,𝝂Su)=A¯scsc(,Ascsc(SU),T(S))(\bm{\omega}^{u}_{S^{\setminus}},\bm{\nu}^{u}_{S^{\setminus}})=\bar{A}_{sc-sc}(\emptyset,A_{sc-sc}(S\setminus U),T(S)). We also denote the intermediate variables before adding noise in algorithm A¯scsc\bar{A}_{sc-sc} as (𝝎^S,𝝂^S)(\widehat{\bm{\omega}}_{S^{\setminus}},\widehat{\bm{\nu}}_{S^{\setminus}}), and we have 𝝎Su=𝝎^S+𝝃1\bm{\omega}^{u}_{S^{\setminus}}=\widehat{\bm{\omega}}_{S^{\setminus}}+\bm{\xi}_{1} and 𝝂Su=𝝂^S+𝝃2\bm{\nu}^{u}_{S^{\setminus}}=\widehat{\bm{\nu}}_{S^{\setminus}}+\bm{\xi}_{2}. Note that 𝝎^S=𝝎S\widehat{\bm{\omega}}_{S^{\setminus}}=\bm{\omega}_{S^{\setminus}}^{*} and 𝝂^S=𝝂S\widehat{\bm{\nu}}_{S^{\setminus}}=\bm{\nu}_{S^{\setminus}}^{*}.

We sample the noise 𝝃1𝒩(0,σ1𝐈d1)\bm{\xi}_{1}\sim\mathcal{N}(0,\sigma_{1}\mathbf{I}_{d_{1}}) and 𝝃2𝒩(0,σ2𝐈d2)\bm{\xi}_{2}\sim\mathcal{N}(0,\sigma_{2}\mathbf{I}_{d_{2}}) with the scale:

{σ1=𝝎S𝝎^2log(2.5/δ)ϵ/2=𝝎^S𝝎^2log(2.5/δ)ϵ/2,σ2=𝝂S𝝂^2log(2.5/δ)ϵ/2=𝝂^S𝝂^2log(2.5/δ)ϵ/2,\left\{\begin{array}[]{lr}\sigma_{1}=\|\bm{\omega}_{S^{\setminus}}^{*}-\widehat{\bm{\omega}}\|\cdot\frac{\sqrt{2\log(2.5/\delta)}}{\epsilon/2}=\|\widehat{\bm{\omega}}_{S^{\setminus}}-\widehat{\bm{\omega}}\|\cdot\frac{\sqrt{2\log(2.5/\delta)}}{\epsilon/2},\\ \sigma_{2}=\|\bm{\nu}_{S^{\setminus}}^{*}-\widehat{\bm{\nu}}\|\cdot\frac{\sqrt{2\log(2.5/\delta)}}{\epsilon/2}=\|\widehat{\bm{\nu}}_{S^{\setminus}}-\widehat{\bm{\nu}}\|\cdot\frac{\sqrt{2\log(2.5/\delta)}}{\epsilon/2},\end{array}\right. (74)

where 𝝎S𝝎^\|\bm{\omega}_{S^{\setminus}}^{*}-\widehat{\bm{\omega}}\| and 𝝎S𝝎^\|\bm{\omega}_{S^{\setminus}}^{*}-\widehat{\bm{\omega}}\| are given in Lemma 10. Then, following the same proof as Dwork et al. (2014, Theorem A.1) together with the composition property of DP (Vadhan, 2017, Lemma 7.2.3), we get that, for any set OΘO\subseteq\varTheta where Θ:=𝒲×𝒱\varTheta:=\mathcal{W}\times\mathcal{V},

Pr[(𝝎u,𝝂u)O]eϵPr[(𝝎Su,𝝂Su)O]+δ,andPr[(𝝎Su,𝝂Su)O]eϵPr[(𝝎u,𝝂u)O]+δ,\Pr[(\bm{\omega}^{u},\bm{\nu}^{u})\in O]\leq e^{\epsilon}\Pr[(\bm{\omega}^{u}_{S^{\setminus}},\bm{\nu}^{u}_{S^{\setminus}})\in O]+\delta,\;\text{and}\;\Pr[(\bm{\omega}^{u}_{S^{\setminus}},\bm{\nu}^{u}_{S^{\setminus}})\in O]\leq e^{\epsilon}\Pr[(\bm{\omega}^{u},\bm{\nu}^{u})\in O]+\delta, (75)

which implies that the algorithm pair AscscA_{sc-sc} and A¯scsc\bar{A}_{sc-sc} is (ϵ,δ)(\epsilon,\delta)-certified minimax unlearning. ∎

Theorem 7 (Population Primal-Dual Risk).

Under the same settings of Lemma 10 and denote d=max{d1,d2}d=\max\{d_{1},d_{2}\}, the population weak and strong PD risk for the certified minimax unlearning variables (𝛚u,𝛎u)(\bm{\omega}^{u},\bm{\nu}^{u}) returned by Algorithm 2 are

{w(𝝎u,𝝂u)=𝒪((L33ρ/μ6+L22/μ3)m2dlog(1/δ)n2ϵ+mL2μn),s(𝝎u,𝝂u)=𝒪((L33ρ/μ6+L22/μ3)m2dlog(1/δ)n2ϵ+mL2μn+L2μ2n).\left\{\begin{aligned} &\triangle^{w}(\bm{\omega}^{u},\bm{\nu}^{u})=\mathcal{O}\left((L^{3}\ell^{3}\rho/\mu^{6}+L^{2}\ell^{2}/\mu^{3})\cdot\frac{m^{2}\sqrt{d\log(1/\delta)}}{n^{2}\epsilon}+\frac{mL^{2}}{\mu n}\right),\\ &\triangle^{s}(\bm{\omega}^{u},\bm{\nu}^{u})=\mathcal{O}\left((L^{3}\ell^{3}\rho/\mu^{6}+L^{2}\ell^{2}/\mu^{3})\cdot\frac{m^{2}\sqrt{d\log(1/\delta)}}{n^{2}\epsilon}+\frac{mL^{2}}{\mu n}+\frac{L^{2}\ell}{\mu^{2}n}\right).\end{aligned}\right. (76)
Proof.

We begin with the population weak PD risk for the certified minimax unlearning variable (𝝎u,𝝂u)(\bm{\omega}^{u},\bm{\nu}^{u}), which has the following conversions,

w(𝝎u,𝝂u)\displaystyle\triangle^{w}(\bm{\omega}^{u},\bm{\nu}^{u}) (77)
=\displaystyle= max𝝂𝒱𝔼[F(𝝎u,𝝂)]min𝝎𝒲𝔼[F(𝝎,𝝂u)]\displaystyle\max_{\bm{\nu}\in\mathcal{V}}\mathbb{E}[F(\bm{\omega}^{u},\bm{\nu})]-\min_{\bm{\omega}\in\mathcal{W}}\mathbb{E}[F(\bm{\omega},\bm{\nu}^{u})]
=\displaystyle= max𝝂𝒱𝔼[F(𝝎u,𝝂)F(𝝎S,𝝂)+F(𝝎S,𝝂)]min𝝎𝒲𝔼[F(𝝎,𝝂u)F(𝝎,𝝂S)+F(𝝎,𝝂S)]\displaystyle\max_{\bm{\nu}\in\mathcal{V}}\mathbb{E}[F(\bm{\omega}^{u},\bm{\nu})-F(\bm{\omega}_{S}^{*},\bm{\nu})+F(\bm{\omega}_{S}^{*},\bm{\nu})]-\min_{\bm{\omega}\in\mathcal{W}}\mathbb{E}[F(\bm{\omega},\bm{\nu}^{u})-F(\bm{\omega},\bm{\nu}_{S}^{*})+F(\bm{\omega},\bm{\nu}_{S}^{*})]
\displaystyle\leq max𝝂𝒱𝔼[F(𝝎u,𝝂)F(𝝎S,𝝂)]+max𝝂𝒱𝔼[F(𝝎S,𝝂)]min𝝎𝒲𝔼[F(𝝎,𝝂u)F(𝝎,𝝂S)]min𝝎𝒲𝔼[F(𝝎,𝝂S)]\displaystyle\max_{\bm{\nu}\in\mathcal{V}}\mathbb{E}[F(\bm{\omega}^{u},\bm{\nu})-F(\bm{\omega}_{S}^{*},\bm{\nu})]+\max_{\bm{\nu}\in\mathcal{V}}\mathbb{E}[F(\bm{\omega}_{S}^{*},\bm{\nu})]-\min_{\bm{\omega}\in\mathcal{W}}\mathbb{E}[F(\bm{\omega},\bm{\nu}^{u})-F(\bm{\omega},\bm{\nu}_{S}^{*})]-\min_{\bm{\omega}\in\mathcal{W}}\mathbb{E}[F(\bm{\omega},\bm{\nu}_{S}^{*})]
=\displaystyle= max𝝂𝒱𝔼[F(𝝎u,𝝂)F(𝝎S,𝝂)]+max𝝎𝒲𝔼[(F)(𝝎,𝝂u)(F)(𝝎,𝝂S)]\displaystyle\max_{\bm{\nu}\in\mathcal{V}}\mathbb{E}[F(\bm{\omega}^{u},\bm{\nu})-F(\bm{\omega}_{S}^{*},\bm{\nu})]+\max_{\bm{\omega}\in\mathcal{W}}\mathbb{E}[(-F)(\bm{\omega},\bm{\nu}^{u})-(-F)(\bm{\omega},\bm{\nu}_{S}^{*})]
+max𝝂𝒱𝔼[F(𝝎S,𝝂)]min𝝎𝒲𝔼[F(𝝎,𝝂S)]\displaystyle+\max_{\bm{\nu}\in\mathcal{V}}\mathbb{E}[F(\bm{\omega}_{S}^{*},\bm{\nu})]-\min_{\bm{\omega}\in\mathcal{W}}\mathbb{E}[F(\bm{\omega},\bm{\nu}_{S}^{*})]
(i)\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}} 𝔼[L𝝎u𝝎S]+𝔼[L𝝂u𝝂S]+w(𝝎S,𝝂S)\displaystyle\mathbb{E}[L\|\bm{\omega}^{u}-\bm{\omega}_{S}^{*}\|]+\mathbb{E}[L\|\bm{\nu}^{u}-\bm{\nu}_{S}^{*}\|]+\triangle^{w}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})
(ii)\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}} 𝔼[L𝝎u𝝎S]+𝔼[L𝝂u𝝂S]+22L2μn,\displaystyle\mathbb{E}[L\|\bm{\omega}^{u}-\bm{\omega}_{S}^{*}\|]+\mathbb{E}[L\|\bm{\nu}^{u}-\bm{\nu}_{S}^{*}\|]+\frac{2\sqrt{2}L^{2}}{\mu n},

where the inequality (ii) holds because the population loss function F(𝝎,𝝂):=𝔼[f(𝝎,𝝂;z)]F(\bm{\omega},\bm{\nu}):=\mathbb{E}[f(\bm{\omega},\bm{\nu};z)] is LL-Lipschitz continuous. The inequality (iiii) is by Lemma 8.

By recalling the unlearning update step in Algorithm 2, we have

𝝎u=𝝎S+1nm[𝙳𝝎𝝎FS(𝝎S,𝝂S)]1ziU𝝎f(𝝎S,𝝂S;zi)+𝝃1,\bm{\omega}^{u}=\bm{\omega}_{S}^{*}+\frac{1}{n-m}[{\tt D}_{\bm{\omega}\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})]^{-1}\sum_{z_{i}\in U}\nabla_{\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})+\bm{\xi}_{1}, (78)

where the vector 𝝃1d1\bm{\xi}_{1}\in\mathbb{R}^{d_{1}} is drawn independently from 𝒩(0,σ12𝐈d1)\mathcal{N}(0,\sigma_{1}^{2}\mathbf{I}_{d_{1}}). From the relation in eq.(78), we further get

𝔼[𝝎u𝝎S]=\displaystyle\mathbb{E}[\|\bm{\omega}^{u}-\bm{\omega}_{S}^{*}\|]= 𝔼[1nm[𝙳𝝎𝝎FS(𝝎S,𝝂S)]1ziU𝝎f(𝝎S,𝝂S;zi)+𝝃1]\displaystyle\mathbb{E}\left[\left\|\frac{1}{n-m}[{\tt D}_{\bm{\omega}\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})]^{-1}\cdot\sum_{z_{i}\in U}\nabla_{\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})+\bm{\xi}_{1}\right\|\right] (79)
(i)\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}} 1nm𝔼[[𝙳𝝎𝝎FS(𝝎S,𝝂S)]1ziU𝝎f(𝝎S,𝝂S;zi)]+𝔼[𝝃1]\displaystyle\frac{1}{n-m}\mathbb{E}\left[\left\|[{\tt D}_{\bm{\omega}\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})]^{-1}\cdot\sum_{z_{i}\in U}\nabla_{\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})\right\|\right]+\mathbb{E}[\|\bm{\xi}_{1}\|]
(ii)\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}} 1nmnm(μn(1+/μ)m)𝔼[ziU𝝎f(𝝎S,𝝂S;zi)]+𝔼[𝝃12]\displaystyle\frac{1}{n-m}\cdot\frac{n-m}{(\mu n-\ell(1+\ell/\mu)m)}\mathbb{E}\left[\left\|\sum_{z_{i}\in U}\nabla_{\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})\right\|\right]+\sqrt{\mathbb{E}[\|\bm{\xi}_{1}\|^{2}]}
=(iii)\displaystyle\stackrel{{\scriptstyle(iii)}}{{=}} 1(μn(1+/μ)m)𝔼[ziU𝝎f(𝝎S,𝝂S;zi)]+d1σ1\displaystyle\frac{1}{(\mu n-\ell(1+\ell/\mu)m)}\mathbb{E}\left[\left\|\sum_{z_{i}\in U}\nabla_{\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})\right\|\right]+\sqrt{d_{1}}\sigma_{1}
(iv)\displaystyle\stackrel{{\scriptstyle(iv)}}{{\leq}} mL(μn(1+/μ)m)+d1σ1,\displaystyle\frac{mL}{(\mu n-\ell(1+\ell/\mu)m)}+\sqrt{d_{1}}\sigma_{1},

where the inequality (ii) is by the triangle inequality and the inequality (iiii) follows from the relation in eq.(71), together with the Jensen’s inequality to bound 𝔼[𝝃1]\mathbb{E}[\|\bm{\xi}_{1}\|]. The equality (iiiiii) holds because the vector 𝝃1𝒩(0,σ12𝐈d1)\bm{\xi}_{1}\sim\mathcal{N}(0,\sigma_{1}^{2}\mathbf{I}_{d_{1}}) and thus we have 𝔼[𝝃12]=d1σ12\mathbb{E}[\|\bm{\xi}_{1}\|^{2}]=d_{1}\sigma_{1}^{2}. Furthermore, the inequality (iviv) is due to the fact that f(𝝎,𝝂;z)f(\bm{\omega},\bm{\nu};z) is LL-Lipshitz continuous.

Symmetrically, we have

𝔼[𝝂u𝝂S]mL(μn(1+/μ)m)+d2σ2.\mathbb{E}[\|\bm{\nu}^{u}-\bm{\nu}_{S}^{*}\|]\leq\frac{mL}{(\mu n-\ell(1+\ell/\mu)m)}+\sqrt{d_{2}}\sigma_{2}. (80)

Plugging eq.(79) and eq.(80) into eq.(77) with d=max{d1,d2}d=\max\{d_{1},d_{2}\} we get

w(𝝎u,𝝂u)2mL2(μn(1+/μ)m)+d(σ1+σ2)L+22L2μn.\triangle^{w}(\bm{\omega}^{u},\bm{\nu}^{u})\leq\frac{2mL^{2}}{(\mu n-\ell(1+\ell/\mu)m)}+\sqrt{d}(\sigma_{1}+\sigma_{2})L+\frac{2\sqrt{2}L^{2}}{\mu n}. (81)

With the noise scale σ1\sigma_{1} and σ2\sigma_{2} being equal to 2(82L23ρ/μ5+8L2/μ2)m2n(μn(+2/μ)m)ϵ2log(2.5/δ)\frac{2(8\sqrt{2}L^{2}\ell^{3}\rho/\mu^{5}+8L\ell^{2}/\mu^{2})m^{2}}{n(\mu n-(\ell+\ell^{2}/\mu)m)\epsilon}\sqrt{2\log(2.5/\delta)}, we can get our generalization guarantee with population weak PD risk:

w(𝝎u,𝝂u)=𝒪((L33ρ/μ6+L22/μ3)m2dlog(1/δ)n2ϵ+mL2μn).\triangle^{w}(\bm{\omega}^{u},\bm{\nu}^{u})=\mathcal{O}\left((L^{3}\ell^{3}\rho/\mu^{6}+L^{2}\ell^{2}/\mu^{3})\cdot\frac{m^{2}\sqrt{d\log(1/\delta)}}{n^{2}\epsilon}+\frac{mL^{2}}{\mu n}\right). (82)

For the population strong PD risk s(𝝎u,𝝂u)\triangle^{s}(\bm{\omega}^{u},\bm{\nu}^{u}), similarly, we have

𝔼[max𝝂𝒱F(𝝎u,𝝂)min𝝎𝒲F(𝝎,𝝂u)]\displaystyle\mathbb{E}[\max_{\bm{\nu}\in\mathcal{V}}F(\bm{\omega}^{u},\bm{\nu})-\min_{\bm{\omega}\in\mathcal{W}}F(\bm{\omega},\bm{\nu}^{u})] (83)
=\displaystyle= 𝔼[max𝝂𝒱(F(𝝎u,𝝂)F(𝝎S,𝝂)+F(𝝎S,𝝂))min𝝎𝒲(F(𝝎,𝝂u)F(𝝎,𝝂S)+F(𝝎,𝝂S))]\displaystyle\mathbb{E}[\max_{\bm{\nu}\in\mathcal{V}}\left(F(\bm{\omega}^{u},\bm{\nu})-F(\bm{\omega}_{S}^{*},\bm{\nu})+F(\bm{\omega}_{S}^{*},\bm{\nu})\right)-\min_{\bm{\omega}\in\mathcal{W}}\left(F(\bm{\omega},\bm{\nu}^{u})-F(\bm{\omega},\bm{\nu}_{S}^{*})+F(\bm{\omega},\bm{\nu}_{S}^{*})\right)]
=\displaystyle= 𝔼[max𝝂𝒱(F(𝝎u,𝝂)F(𝝎S,𝝂)+F(𝝎S,𝝂))]𝔼[min𝝎𝒲(F(𝝎,𝝂u)F(𝝎,𝝂S)+F(𝝎,𝝂S))]\displaystyle\mathbb{E}[\max_{\bm{\nu}\in\mathcal{V}}(F(\bm{\omega}^{u},\bm{\nu})-F(\bm{\omega}_{S}^{*},\bm{\nu})+F(\bm{\omega}_{S}^{*},\bm{\nu}))]-\mathbb{E}[\min_{\bm{\omega}\in\mathcal{W}}(F(\bm{\omega},\bm{\nu}^{u})-F(\bm{\omega},\bm{\nu}_{S}^{*})+F(\bm{\omega},\bm{\nu}_{S}^{*}))]
\displaystyle\leq 𝔼[max𝝂𝒱(F(𝝎u,𝝂)F(𝝎S,𝝂))+max𝝂𝒱F(𝝎S,𝝂)]𝔼[min𝝎𝒲(F(𝝎,𝝂u)F(𝝎,𝝂S))+min𝝎𝒲F(𝝎,𝝂S)]\displaystyle\mathbb{E}[\max_{\bm{\nu}\in\mathcal{V}}(F(\bm{\omega}^{u},\bm{\nu})-F(\bm{\omega}_{S}^{*},\bm{\nu}))+\max_{\bm{\nu}\in\mathcal{V}}F(\bm{\omega}_{S}^{*},\bm{\nu})]-\mathbb{E}[\min_{\bm{\omega}\in\mathcal{W}}(F(\bm{\omega},\bm{\nu}^{u})-F(\bm{\omega},\bm{\nu}_{S}^{*}))+\min_{\bm{\omega}\in\mathcal{W}}F(\bm{\omega},\bm{\nu}_{S}^{*})]
=\displaystyle= 𝔼[max𝝂𝒱(F(𝝎u,𝝂)F(𝝎S,𝝂))]+𝔼[max𝝂𝒱F(𝝎S,𝝂)]𝔼[min𝝎𝒲(F(𝝎,𝝂u)F(𝝎,𝝂S))]𝔼[min𝝎𝒲F(𝝎,𝝂S)]\displaystyle\mathbb{E}[\max_{\bm{\nu}\in\mathcal{V}}(F(\bm{\omega}^{u},\bm{\nu})-F(\bm{\omega}_{S}^{*},\bm{\nu}))]+\mathbb{E}[\max_{\bm{\nu}\in\mathcal{V}}F(\bm{\omega}_{S}^{*},\bm{\nu})]-\mathbb{E}[\min_{\bm{\omega}\in\mathcal{W}}(F(\bm{\omega},\bm{\nu}^{u})-F(\bm{\omega},\bm{\nu}_{S}^{*}))]-\mathbb{E}[\min_{\bm{\omega}\in\mathcal{W}}F(\bm{\omega},\bm{\nu}_{S}^{*})]
=\displaystyle= 𝔼[max𝝂𝒱(F(𝝎u,𝝂)F(𝝎S,𝝂))]+𝔼[max𝝎𝒲((F)(𝝎,𝝂u)(F)(𝝎,𝝂S))]\displaystyle\mathbb{E}[\max_{\bm{\nu}\in\mathcal{V}}(F(\bm{\omega}^{u},\bm{\nu})-F(\bm{\omega}_{S}^{*},\bm{\nu}))]+\mathbb{E}[\max_{\bm{\omega}\in\mathcal{W}}((-F)(\bm{\omega},\bm{\nu}^{u})-(-F)(\bm{\omega},\bm{\nu}_{S}^{*}))]
+𝔼[max𝝂𝒱F(𝝎S,𝝂)min𝝎𝒲F(𝝎,𝝂S)]\displaystyle+\mathbb{E}[\max_{\bm{\nu}\in\mathcal{V}}F(\bm{\omega}_{S}^{*},\bm{\nu})-\min_{\bm{\omega}\in\mathcal{W}}F(\bm{\omega},\bm{\nu}_{S}^{*})]
(i)\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}} 𝔼[L𝝎u𝝎S]+𝔼[L𝝂u𝝂S]+s(𝝎S,𝝂S)\displaystyle\mathbb{E}[L\|\bm{\omega}^{u}-\bm{\omega}_{S}^{*}\|]+\mathbb{E}[L\|\bm{\nu}^{u}-\bm{\nu}_{S}^{*}\|]+\triangle^{s}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})
(ii)\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}} 2mL2(μn(1+/μ)m)+d(σ1+σ2)L+8L2μ2n,\displaystyle\frac{2mL^{2}}{(\mu n-\ell(1+\ell/\mu)m)}+\sqrt{d}(\sigma_{1}+\sigma_{2})L+\frac{8L^{2}\ell}{\mu^{2}n},

where inequality (ii) is due to the fact that the population loss function F(𝝎,𝝂):=𝔼[f(𝝎,𝝂;z)]F(\bm{\omega},\bm{\nu}):=\mathbb{E}[f(\bm{\omega},\bm{\nu};z)] is LL-Lipschitz continuous. The inequality (iiii) uses eq.(79), eq.(80) and Lemma 9. With the same noise scale above, we can get the generalization guarantee in terms of strong PD risk below,

s(𝝎u,𝝂u)=𝒪((L33ρ/μ6+L22/μ3)m2dlog(1/δ)n2ϵ+mL2μn+L2μ2n).\triangle^{s}(\bm{\omega}^{u},\bm{\nu}^{u})=\mathcal{O}\left((L^{3}\ell^{3}\rho/\mu^{6}+L^{2}\ell^{2}/\mu^{3})\cdot\frac{m^{2}\sqrt{d\log(1/\delta)}}{n^{2}\epsilon}+\frac{mL^{2}}{\mu n}+\frac{L^{2}\ell}{\mu^{2}n}\right). (84)

Theorem 8 (Deletion Capacity).

Under the same settings of Lemma 10 and denote d=max{d1,d2}d=\max\{d_{1},d_{2}\}, the deletion capacity of Algorithm 2 is

mϵ,δ,γA,A¯(d1,d2,n)cnϵ(dlog(1/δ))1/4,m^{A,\bar{A}}_{\epsilon,\delta,\gamma}(d_{1},d_{2},n)\geq c\cdot\frac{n\sqrt{\epsilon}}{(d\log(1/\delta))^{1/4}}, (85)

where the constant cc depends on L,l,ρ,L,l,\rho, and μ\mu of the loss function ff.

Proof.

By the definition of deletion capacity, in order to ensure the population PD risk derived in Theorem 7 is bounded by γ\gamma, it suffices to let:

mϵ,δ,γA,A¯(d1,d2,n)cnϵ(dlog(1/δ))1/4,m^{A,\bar{A}}_{\epsilon,\delta,\gamma}(d_{1},d_{2},n)\geq c\cdot\frac{n\sqrt{\epsilon}}{(d\log(1/\delta))^{1/4}},

where the constant cc depends on the properties of the loss function f(𝝎,𝝂;z)f(\bm{\omega},\bm{\nu};z). ∎

B.2 Missing Proofs of Sec.4.4

B.2.1 Proof of Lemma 1 (Closeness Upper Bound)

Proof.

By the definition of the functions FS(𝝎,𝝂)F_{S}(\bm{\omega},\bm{\nu}) and FS(𝝎,𝝂)F_{S^{\setminus}}(\bm{\omega},\bm{\nu}), we have

𝝎FS(𝝎S,𝚅S(𝝎S))𝝎FS(𝝎S,𝝂S)nnm𝙳𝝎𝝎FS(𝝎S,𝝂S)(𝝎S𝝎S))=nnm[𝝎FS(𝝎S,𝚅S(𝝎S))𝝎FS(𝝎S,𝝂S)𝙳𝝎𝝎FS(𝝎S,𝝂S)(𝝎S𝝎S)]1nmziU[𝝎f(𝝎S,𝚅S(𝝎S);zi)𝝎f(𝝎S,𝝂S;zi)](i)nnm𝝎FS(𝝎S,𝚅S(𝝎S))𝝎FS(𝝎S,𝝂S)𝙳𝝎𝝎FS(𝝎S,𝝂S)(𝝎S𝝎S)+1nmziU𝝎f(𝝎S,𝚅S(𝝎S);zi)𝝎f(𝝎S,𝝂S;zi)(ii)82ρL23m2μ5n(nm)+22Ll2m2μ2n(nm),\begin{split}&\|\nabla_{\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S^{\setminus}}^{*},{\tt V}_{S}(\bm{\omega}_{S^{\setminus}}^{*}))-\nabla_{\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})-\frac{n}{n-m}{\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})(\bm{\omega}_{S^{\setminus}}^{*}-\bm{\omega}_{S}^{*}))\|\\ =&\|\frac{n}{n-m}[\nabla_{\bm{\omega}}F_{S}(\bm{\omega}_{S^{\setminus}}^{*},{\tt V}_{S}(\bm{\omega}_{S^{\setminus}}^{*}))-\nabla_{\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})-{\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})(\bm{\omega}_{S^{\setminus}}^{*}-\bm{\omega}_{S}^{*})]\\ &-\frac{1}{n-m}\sum_{z_{i}\in U}[\nabla_{\bm{\omega}}f(\bm{\omega}_{S^{\setminus}}^{*},{\tt V}_{S}(\bm{\omega}_{S^{\setminus}}^{*});z_{i})-\nabla_{\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})]\|\\ \stackrel{{\scriptstyle(i)}}{{\leq}}&\frac{n}{n-m}\|\nabla_{\bm{\omega}}F_{S}(\bm{\omega}_{S^{\setminus}}^{*},{\tt V}_{S}(\bm{\omega}_{S^{\setminus}}^{*}))-\nabla_{\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})-{\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})(\bm{\omega}_{S^{\setminus}}^{*}-\bm{\omega}_{S}^{*})\|\\ &+\frac{1}{n-m}\sum_{z_{i}\in U}\|\nabla_{\bm{\omega}}f(\bm{\omega}_{S^{\setminus}}^{*},{\tt V}_{S}(\bm{\omega}_{S^{\setminus}}^{*});z_{i})-\nabla_{\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})\|\\ \stackrel{{\scriptstyle(ii)}}{{\leq}}&\frac{8\sqrt{2}\rho L^{2}\ell^{3}m^{2}}{\mu^{5}n(n-m)}+\frac{2\sqrt{2}Ll^{2}m^{2}}{\mu^{2}n(n-m)},\end{split} (86)

where the inequality (ii) holds because the triangle inequality and the inequality (iiii) uses the results in eq.(65) and eq.(66). Now let 𝐱~\widetilde{\mathbf{x}} be the vector satisfying the following relation,

𝝎S=𝝎S+1n[𝙳𝝎𝝎FS(𝝎S,𝝂S)]1ziU𝝎f(𝝎S,𝝂S;zi)+𝐱~.\bm{\omega}_{S^{\setminus}}^{*}=\bm{\omega}_{S}^{*}+\frac{1}{n}[{\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})]^{-1}\sum_{z_{i}\in U}\nabla_{\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})+\widetilde{\mathbf{x}}. (87)

Since we have 𝝎FS(𝝎S,𝝂S)=1nmziU𝝎f(𝝎S,𝝂S;zi)\nabla_{\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})=-\frac{1}{n-m}\sum_{z_{i}\in U}\nabla_{\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i}) and 𝝎FS(𝝎S,𝚅S(𝝎S))=0\nabla_{\bm{\omega}}F_{S^{\setminus}}(\bm{\omega}_{S^{\setminus}}^{*},{\tt V}_{S}(\bm{\omega}_{S^{\setminus}}^{*}))=0 due to the optimality of 𝝎S\bm{\omega}_{S^{\setminus}}^{*}, plugging the above relation into eq.(86), we get

nnm𝙳𝝎𝝎FS(𝝎S,𝝂S)𝐱~(82L23ρ/μ5+22L2/μ2)m2n(nm).\|\frac{n}{n-m}{\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\widetilde{\mathbf{x}}\|\leq(8\sqrt{2}L^{2}\ell^{3}\rho/\mu^{5}+2\sqrt{2}L\ell^{2}/\mu^{2})\frac{m^{2}}{n(n-m)}. (88)

With 𝙳𝝎𝝎FS(𝝎S,𝝂S)μ𝝎𝝎{\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\geq\mu_{\bm{\omega}\bm{\omega}}, we also have

nnm𝙳𝝎𝝎FS(𝝎S,𝝂S)𝐱~μ𝝎𝝎nnm𝐱~μnnm𝐱~.\|\frac{n}{n-m}{\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\widetilde{\mathbf{x}}\|\geq\frac{\mu_{\bm{\omega}\bm{\omega}}n}{n-m}\|\widetilde{\mathbf{x}}\|\geq\frac{\mu n}{n-m}\|\widetilde{\mathbf{x}}\|. (89)

Combining eq.(89), eq.(88), and the definition of the vector 𝐱~\|\widetilde{\mathbf{x}}\|, we get that

𝝎S𝝎~=𝐱~(82L23ρ/μ6+22L2/μ3)m2n2.\|\bm{\omega}_{S^{\setminus}}^{*}-\widetilde{\bm{\omega}}\|=\|\widetilde{\mathbf{x}}\|\leq\frac{(8\sqrt{2}L^{2}\ell^{3}\rho/\mu^{6}+2\sqrt{2}L\ell^{2}/\mu^{3})m^{2}}{n^{2}}. (90)

Symmetrically, we can get 𝝂S𝝂~(82L23ρ/μ6+22L2/μ3)m2n2\|\bm{\nu}_{S^{\setminus}}^{*}-\widetilde{\bm{\nu}}\|\leq\frac{(8\sqrt{2}L^{2}\ell^{3}\rho/\mu^{6}+2\sqrt{2}L\ell^{2}/\mu^{3})m^{2}}{n^{2}}. ∎

B.2.2 Proof of Theorem 2 ((ϵ,δ)(\epsilon,\delta)-Minimax Unlearning Certification)

Proof.

With the closeness upper bound in Lemma 1 and the given noise scales in eq.(22), the proof is identical to that of Theorem 6. ∎

B.2.3 Proof of Theorem 3 (Population Primal-Dual Risk)

Proof.

We start with the population weak PD risk. By eq.(77), we have

w(𝝎~u,𝝂~u)𝔼[L𝝎~u𝝎S]+𝔼[L𝝂~u𝝂S]+22L2μn.\triangle^{w}(\widetilde{\bm{\omega}}^{u},\widetilde{\bm{\nu}}^{u})\leq\mathbb{E}[L\|\widetilde{\bm{\omega}}^{u}-\bm{\omega}_{S}^{*}\|]+\mathbb{E}[L\|\widetilde{\bm{\nu}}^{u}-\bm{\nu}_{S}^{*}\|]+\frac{2\sqrt{2}L^{2}}{\mu n}. (91)

By recalling the unlearning step in Algorithm 3, we have

𝝎~u=𝝎S+1n[𝙳𝝎𝝎FS(𝝎S,𝝂S)]1ziU𝝎f(𝝎S,𝝂S;zi)+𝝃1,\widetilde{\bm{\omega}}^{u}=\bm{\omega}_{S}^{*}+\frac{1}{n}[{\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})]^{-1}\sum_{z_{i}\in U}\nabla_{\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})+\bm{\xi}_{1}, (92)

where the vector 𝝃1d1\bm{\xi}_{1}\in\mathbb{R}^{d_{1}} is drawn independently from 𝒩(0,σ12𝐈d1)\mathcal{N}(0,\sigma_{1}^{2}\mathbf{I}_{d_{1}}). From the relation in eq.(78), we further get

𝔼[𝝎~u𝝎S]=\displaystyle\mathbb{E}[\|\widetilde{\bm{\omega}}^{u}-\bm{\omega}_{S}^{*}\|]= 𝔼[1n[𝙳𝝎𝝎FS(𝝎S,𝝂S)]1ziU𝝎f(𝝎S,𝝂S;zi)+𝝃1]\displaystyle\mathbb{E}\left[\left\|\frac{1}{n}[{\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})]^{-1}\cdot\sum_{z_{i}\in U}\nabla_{\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})+\bm{\xi}_{1}\right\|\right] (93)
(i)\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}} 1n𝔼[[𝙳𝝎𝝎FS(𝝎S,𝝂S)]1ziU𝝎f(𝝎S,𝝂S;zi)]+𝔼[𝝃1]\displaystyle\frac{1}{n}\mathbb{E}\left[\left\|[{\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})]^{-1}\cdot\sum_{z_{i}\in U}\nabla_{\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})\right\|\right]+\mathbb{E}[\|\bm{\xi}_{1}\|]
(ii)\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}} 1nμ1𝔼[ziU𝝎f(𝝎S,𝝂S;zi)]+𝔼[𝝃12]\displaystyle\frac{1}{n}\cdot\mu^{-1}\mathbb{E}\left[\left\|\sum_{z_{i}\in U}\nabla_{\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})\right\|\right]+\sqrt{\mathbb{E}[\|\bm{\xi}_{1}\|^{2}]}
=(iii)\displaystyle\stackrel{{\scriptstyle(iii)}}{{=}} 1μn𝔼[ziU𝝎f(𝝎S,𝝂S;zi)]+d1σ1(iv)mLμn+d1σ1,\displaystyle\frac{1}{\mu n}\mathbb{E}\left[\left\|\sum_{z_{i}\in U}\nabla_{\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})\right\|\right]+\sqrt{d_{1}}\sigma_{1}\stackrel{{\scriptstyle(iv)}}{{\leq}}\frac{mL}{\mu n}+\sqrt{d_{1}}\sigma_{1},

where the inequality (ii) uses the triangle inequality and the inequality (iiii) follows by the relation 𝙳𝝎𝝎FS(𝝎S,𝝂S)μ𝝎𝝎μ{\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\geq\mu_{\bm{\omega}\bm{\omega}}\geq\mu, together with the Jensen’s inequality to bound 𝔼[𝝃1]\mathbb{E}[\|\bm{\xi}_{1}\|]. The equality (iiiiii) holds because the vector 𝝃1𝒩(0,σ12𝐈d1)\bm{\xi}_{1}\sim\mathcal{N}(0,\sigma_{1}^{2}\mathbf{I}_{d_{1}}) and thus we have 𝔼[𝝃12]=d1σ12\mathbb{E}[\|\bm{\xi}_{1}\|^{2}]=d_{1}\sigma_{1}^{2}. And the inequality (iviv) is due to the fact that f(𝝎,𝝂;z)f(\bm{\omega},\bm{\nu};z) is LL-Lipshitz continuous. Symmetrically, we have

𝔼[𝝂~u𝝂S]mLμn+d2σ2.\mathbb{E}[\|\widetilde{\bm{\nu}}^{u}-\bm{\nu}_{S}^{*}\|]\leq\frac{mL}{\mu n}+\sqrt{d_{2}}\sigma_{2}. (94)

Plugging eq.(93) and eq.(94) into eq.(91) with d=max{d1,d2}d=\max\{d_{1},d_{2}\} we get

w(𝝎~u,𝝂~u)2mL2μn+d(σ1+σ2)L+22L2μn.\triangle^{w}(\widetilde{\bm{\omega}}^{u},\widetilde{\bm{\nu}}^{u})\leq\frac{2mL^{2}}{\mu n}+\sqrt{d}(\sigma_{1}+\sigma_{2})L+\frac{2\sqrt{2}L^{2}}{\mu n}. (95)

With the noise scale σ1\sigma_{1} and σ2\sigma_{2} being equal to 2(82L23ρ/μ6+22L2/μ3)m2n2ϵ2log(2.5/δ)\frac{2(8\sqrt{2}L^{2}\ell^{3}\rho/\mu^{6}+2\sqrt{2}L\ell^{2}/\mu^{3})m^{2}}{n^{2}\epsilon}\sqrt{2\log(2.5/\delta)}, we can get our generalization guarantee in terms of population weak PD risk:

w(𝝎~u,𝝂~u)=𝒪((L33ρ/μ6+L22/μ3)m2dlog(1/δ)n2ϵ+mL2μn).\triangle^{w}(\widetilde{\bm{\omega}}^{u},\widetilde{\bm{\nu}}^{u})=\mathcal{O}\left((L^{3}\ell^{3}\rho/\mu^{6}+L^{2}\ell^{2}/\mu^{3})\cdot\frac{m^{2}\sqrt{d\log(1/\delta)}}{n^{2}\epsilon}+\frac{mL^{2}}{\mu n}\right). (96)

For the population strong PD risk, using an application of eq.(83) with Lemma 9, eq.(93), eq.(94) and the noise scales given in Theorem 2, we can get

s(𝝎~u,𝝂~u)=𝒪((L33ρ/μ6+L22/μ3)m2dlog(1/δ)n2ϵ+mL2μn+L2μ2n).\triangle^{s}(\widetilde{\bm{\omega}}^{u},\widetilde{\bm{\nu}}^{u})=\mathcal{O}\left((L^{3}\ell^{3}\rho/\mu^{6}+L^{2}\ell^{2}/\mu^{3})\cdot\frac{m^{2}\sqrt{d\log(1/\delta)}}{n^{2}\epsilon}+\frac{mL^{2}}{\mu n}+\frac{L^{2}\ell}{\mu^{2}n}\right). (97)

B.2.4 Proof of Theorem 4 (Deletion Capacity)

Proof.

By the definition of deletion capacity, in order to ensure the population weak or strong PD risk derived in Lemma 3 is bounded by γ\gamma, it suffices to let mϵ,δ,γA,A¯(d1,d2,n)cnϵ(dlog(1/δ))1/4m^{A,\bar{A}}_{\epsilon,\delta,\gamma}(d_{1},d_{2},n)\geq c\cdot\frac{n\sqrt{\epsilon}}{(d\log(1/\delta))^{1/4}}. ∎

B.3 Efficient Online Unlearning Algorithm

Algorithm 4 Efficient Online Certified Minimax Unlearning (A¯𝚘𝚗𝚕𝚒𝚗𝚎)(\bar{A}_{\tt online})
0:  Delete request zkSz_{k}\in S, early delete requests UU : {zj}j=1mS\{z_{j}\}^{m}_{j=1}\subseteq S, output of Ascsc(S)A_{sc-sc}(S): (𝝎S,𝝂S)(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}), memory variables T(S)T(S): {𝙳𝝎𝝎FS(𝝎S,𝝂S),𝙳𝝂𝝂FS(𝝎S,𝝂S)}\{\mathtt{D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}),\mathtt{D}_{\bm{\nu}\bm{\nu}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\}, early unlearning variables: (𝝎~Uu,𝝂~Uu)(\widetilde{\bm{\omega}}_{U}^{u},\widetilde{\bm{\nu}}_{U}^{u}), loss function: f(𝝎,𝝂;z)f(\bm{\omega},\bm{\nu};z), noise parameters: σ1\sigma_{1}, σ2\sigma_{2}.
1:  Set 𝝎~=𝝎S\widetilde{\bm{\omega}}_{\emptyset}=\bm{\omega}_{S}^{*} and 𝝂~=𝝂S\widetilde{\bm{\nu}}_{\emptyset}=\bm{\nu}_{S}^{*}.
2:  Compute
𝝎¯U{k}=𝝎~U+1n[𝙳𝝎𝝎FS(𝝎S,𝝂S)]1𝝎f(𝝎S,𝝂S;zk),\overline{\bm{\omega}}_{U\cup\{k\}}=\widetilde{\bm{\omega}}_{U}+\frac{1}{n}[{\tt D}_{\bm{\omega}\bm{\omega}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})]^{-1}\nabla_{\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{k}), (98)
𝝂¯U{k}=𝝂~U+1n[𝙳𝝂𝝂FS(𝝎S,𝝂S)]1𝝂f(𝝎S,𝝂S;zk).\overline{\bm{\nu}}_{U\cup\{k\}}=\widetilde{\bm{\nu}}_{U}+\frac{1}{n}[{\tt D}_{\bm{\nu}\bm{\nu}}F_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})]^{-1}\nabla_{\bm{\nu}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{k}). (99)
3:  𝝎~U{k}u=𝝎¯U{k}+𝝃1\widetilde{\bm{\omega}}^{u}_{U\cup\{k\}}=\overline{\bm{\omega}}_{U\cup\{k\}}+\bm{\xi}_{1}, where 𝝃1𝒩(0,σ1𝐈d1)\bm{\xi}_{1}\sim\mathcal{N}(0,\sigma_{1}\mathbf{I}_{d_{1}}) and 𝝂~U{k}u=𝝂¯U{k}+𝝃2\widetilde{\bm{\nu}}^{u}_{U\cup\{k\}}=\overline{\bm{\nu}}_{U\cup\{k\}}+\bm{\xi}_{2}, where 𝝃2𝒩(0,σ2𝐈d2)\bm{\xi}_{2}\sim\mathcal{N}(0,\sigma_{2}\mathbf{I}_{d_{2}}) .
3:  (𝝎~U{k}u,𝝂~U{k}u)(\widetilde{\bm{\omega}}^{u}_{U\cup\{k\}},\widetilde{\bm{\nu}}^{u}_{U\cup\{k\}}).

To support successive unlearning requests, similar to the STL case (Suriyakumar and Wilson, 2022), we further provide an efficient and online minimax unlearning algorithm (denoted by A¯𝚘𝚗𝚕𝚒𝚗𝚎\bar{A}_{\tt online}). The pseudocode of A¯𝚘𝚗𝚕𝚒𝚗𝚎\bar{A}_{\tt online} is given in Algorithm 4. Its certified minimax unlearning guarantee, generalization, and deletion capacity can be identically yielded as Algorithm 3, which are omitted here.

Appendix C Detailed Algorithm Descriptions and Missing Proofs in Section 5

C.1 Minimax Unlearning Algorithm for Smooth Convex-Concave Loss Function

In this section, we provide minimax learning and minimax unlearning algorithms for smooth convex-concave loss functions based on the counterpart algorithms for the SC-SC setting. Given the convex-concave loss function f(𝝎,𝝂;z)f(\bm{\omega},\bm{\nu};z), we define the regularized loss function as f~(𝝎,𝝂;z)=f(𝝎,𝝂;z)+λ2𝝎2λ2𝝂2\widetilde{f}(\bm{\omega},\bm{\nu};z)=f(\bm{\omega},\bm{\nu};z)+\frac{\lambda}{2}\|\bm{\omega}\|^{2}-\frac{\lambda}{2}\|\bm{\nu}\|^{2}. Suppose the function ff satisfies Assumption 1, then the function f~\widetilde{f} is λ\lambda-strongly convex in 𝝎\bm{\omega}, λ\lambda-strongly concave in 𝝂\bm{\nu}, (2L+λ𝝎+λ𝝂)(2L+\lambda\|\bm{\omega}\|+\lambda\|\bm{\nu}\|)-Lipschitz, 2(2+λ)\sqrt{2}(2\ell+\lambda)-gradient Lipschitz and ρ\rho-Hessian Lipschitz. Thus, we can apply the minimax learning in Algorithm 1 and unlearning in Algorithm 2 to the regularized loss function with a properly chosen λ\lambda. We denote our learning algorithm by AccA_{c-c} and unlearning algorithm by A¯cc\bar{A}_{c-c}. The pseudocode is provided in Algorithm 5 and Algorithm 6, respectively. Additionally, we denote the regularized population loss as F~(𝝎,𝝂):=𝔼z𝒟[f~(𝝎,𝝂;z)]\widetilde{F}(\bm{\omega},\bm{\nu}):=\mathbb{E}_{z\sim\mathcal{D}}[\widetilde{f}(\bm{\omega},\bm{\nu};z)] and regularized empirical loss as F~S(𝝎,𝝂):=1ni=1nf~(𝝎,𝝂;zi)\widetilde{F}_{S}(\bm{\omega},\bm{\nu}):=\frac{1}{n}\sum_{i=1}^{n}\widetilde{f}(\bm{\omega},\bm{\nu};z_{i}).

Algorithm 5 Mimimax Learning Algorithm (Acc)(A_{c-c})
0:  Dataset SS : {zi}i=1n𝒟n\{z_{i}\}^{n}_{i=1}\sim\mathcal{D}^{n}, loss function: f(𝝎,𝝂;z)f(\bm{\omega},\bm{\nu};z), regularization parameter: λ\lambda.
1:  Define
f~(𝝎,𝝂;z)=f(𝝎,𝝂;z)+λ2𝝎2λ2𝝂2.\widetilde{f}(\bm{\omega},\bm{\nu};z)=f(\bm{\omega},\bm{\nu};z)+\frac{\lambda}{2}\|\bm{\omega}\|^{2}-\frac{\lambda}{2}\|\bm{\nu}\|^{2}. (100)
2:  Run the algorithm AscscA_{sc-sc} on the dataset SS with loss function f~\widetilde{f}.
2:  (𝝎S,𝝂S,𝙳𝝎𝝎F~S(𝝎S,𝝂S),𝙳𝝂𝝂F~S(𝝎S,𝝂S))Ascsc(S,f~)(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*},\mathtt{D}_{\bm{\omega}\bm{\omega}}\widetilde{F}_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}),\mathtt{D}_{\bm{\nu}\bm{\nu}}\widetilde{F}_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}))\leftarrow A_{sc-sc}(S,\widetilde{f}).
Algorithm 6 Certified Minimax Unlearning for Convex-Concave Loss (A¯cc)(\bar{A}_{c-c})
0:  Delete requests UU : {zj}j=1mS\{z_{j}\}^{m}_{j=1}\subseteq S, output of Acc(S)A_{c-c}(S): (𝝎S,𝝂S)(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}), loss function: f(𝝎,𝝂;z)f(\bm{\omega},\bm{\nu};z), memory variables T(S)T(S): {𝙳𝝎𝝎F~S(𝝎S,𝝂S),𝙳𝝂𝝂F~S(𝝎S,𝝂S)}\{\mathtt{D}_{\bm{\omega}\bm{\omega}}\widetilde{F}_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}),\mathtt{D}_{\bm{\nu}\bm{\nu}}\widetilde{F}_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\}, regularization parameter: λ\lambda, noise parameters: σ1\sigma_{1}, σ2\sigma_{2}.
1:  Define
f~(𝝎,𝝂;z)=f(𝝎,𝝂;z)+λ2𝝎2λ2𝝂2.\widetilde{f}(\bm{\omega},\bm{\nu};z)=f(\bm{\omega},\bm{\nu};z)+\frac{\lambda}{2}\|\bm{\omega}\|^{2}-\frac{\lambda}{2}\|\bm{\nu}\|^{2}. (101)
2:  Run the algorithm A¯scsc\bar{A}_{sc-sc} with delete requests UU, learning variables (𝝎S,𝝂S)(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}), memory variables T(S)T(S), loss function f~\widetilde{f} and noise parameters σ1\sigma_{1} and σ2\sigma_{2}.
2:  (𝝎u,𝝂u)A¯scsc(U,(𝝎S,𝝂S),T(S),f~,σ1,σ2)(\bm{\omega}^{u},\bm{\nu}^{u})\leftarrow\bar{A}_{sc-sc}(U,(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}),T(S),\widetilde{f},\sigma_{1},\sigma_{2}).

C.2 Supporting Lemma

Lemma 11.

Suppose the function f(𝛚,𝛎;z)f(\bm{\omega},\bm{\nu};z) is LL-Lipschitz continuous. Define the function f~(𝛚,𝛎;z)\widetilde{f}(\bm{\omega},\bm{\nu};z) as

f~(𝝎,𝝂;z)=f(𝝎,𝝂;z)+λ2𝝎2λ2𝝂2.\widetilde{f}(\bm{\omega},\bm{\nu};z)=f(\bm{\omega},\bm{\nu};z)+\frac{\lambda}{2}\|\bm{\omega}\|^{2}-\frac{\lambda}{2}\|\bm{\nu}\|^{2}. (102)

Given a dataset S={zi}i=1nS=\{z_{i}\}^{n}_{i=1} and denote (𝛚S,𝛎S):=argmin𝛚𝒲max𝛎𝒱{F~S(𝛚,𝛎):=1ni=1nf~(𝛚,𝛎;zi)}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}):=\arg\min_{\bm{\omega}\in\mathcal{W}}\max_{\bm{\nu}\in\mathcal{V}}\{\widetilde{F}_{S}(\bm{\omega},\bm{\nu}):=\frac{1}{n}\sum_{i=1}^{n}\widetilde{f}(\bm{\omega},\bm{\nu};z_{i})\}. Then, the variables (𝛚S,𝛎S)(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}) satisfy 𝛚SL/λ\|\bm{\omega}_{S}^{*}\|\leq L/\lambda and 𝛎SL/λ\|\bm{\nu}_{S}^{*}\|\leq L/\lambda.

Proof.

Due to the optimality of 𝝎S\bm{\omega}_{S}^{*}, we have

𝝎F~S(𝝎S,𝝂S;z)=1nziS𝝎f~(𝝎S,𝝂S;zi)=0.\nabla_{\bm{\omega}}\widetilde{F}_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z)=\frac{1}{n}\sum_{z_{i}\in S}\nabla_{\bm{\omega}}\widetilde{f}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})=0. (103)

Plugging in the definition of the function f~\widetilde{f} in the above, we get that

1nziS𝝎f(𝝎S,𝝂S;zi)+λ𝝎S=0.\frac{1}{n}\sum_{z_{i}\in S}\nabla_{\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})+\lambda\bm{\omega}_{S}^{*}=0. (104)

Then, using the triangle inequality, we have

λ𝝎S=1nziS𝝎f(𝝎S,𝝂S;zi)1nziS𝝎f(𝝎S,𝝂S;zi)L,\|\lambda\bm{\omega}_{S}^{*}\|=\|\frac{1}{n}\sum_{z_{i}\in S}\nabla_{\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})\|\leq\frac{1}{n}\sum_{z_{i}\in S}\|\nabla_{\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})\|\leq L, (105)

where the last inequality holds because the function ff is LL-Lipschitz continuous. Thus we have 𝝎SL/λ\|\bm{\omega}_{S}^{*}\|\leq L/\lambda. Similarly, we can get 𝝂SL/λ\|\bm{\nu}_{S}^{*}\|\leq L/\lambda. ∎

Lemma 11 implies that the empirical optimizer (𝝎S,𝝂S)(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}) returned by Algorithm 6 satisfies 𝝎SL/λ\|\bm{\omega}_{S}^{*}\|\leq L/\lambda and 𝝂SL/λ\|\bm{\nu}_{S}^{*}\|\leq L/\lambda. Thus our domain of interest are 𝒲:={𝝎|𝝎L/λ}\mathcal{W}:=\{\bm{\omega}|\|\bm{\omega}\|\leq L/\lambda\} and 𝒱:={𝝂|𝝂L/λ}\mathcal{V}:=\{\bm{\nu}|\|\bm{\nu}\|\leq L/\lambda\}. Over the set 𝒲×𝒱\mathcal{W}\times\mathcal{V}, the function f~(𝝎,𝝂;z)\widetilde{f}(\bm{\omega},\bm{\nu};z) is 4L4L-Lipschitz continuous. Also, with λ<\lambda<\ell, f~(𝝎,𝝂;z)\widetilde{f}(\bm{\omega},\bm{\nu};z) has 323\sqrt{2}\ell-Lipschitz gradients.

C.3 Proof of Theorem 5 (Certified Minimax Unlearning for Convex-Concave Loss Funcion)

Denote L~=4L\widetilde{L}=4L and ~=32\widetilde{\ell}=3\sqrt{2}\ell, then the function f~\widetilde{f} is L~\widetilde{L}-Lipschitz continuous and has ~\widetilde{\ell}-Lipschitz gradients. Let (𝝎S,𝝂S)(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*}) be the optimal solution of the loss function F~S(𝝎,𝝂)\widetilde{F}_{S^{\setminus}}(\bm{\omega},\bm{\nu}) on the remaining dataset, i.e.,

(𝝎S,𝝂S):=argmin𝝎𝒲max𝝂𝒱{F~S(𝝎,𝝂):=1nmziSUf~(𝝎,𝝂;zi)}.(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*}):=\arg\min_{\bm{\omega}\in\mathcal{W}}\max_{\bm{\nu}\in\mathcal{V}}\{\widetilde{F}_{S^{\setminus}}(\bm{\omega},\bm{\nu}):=\frac{1}{n-m}\sum_{z_{i}\in S\setminus U}\widetilde{f}(\bm{\omega},\bm{\nu};z_{i})\}. (106)

Additionally, we have 𝙳𝝎𝝎F~S(𝝎S,𝝂S)λ\|{\tt D}_{\bm{\omega}\bm{\omega}}\widetilde{F}_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\|\geq\lambda and 𝙳𝝂𝝂F~S(𝝎S,𝝂S)λ\|{\tt D}_{\bm{\nu}\bm{\nu}}\widetilde{F}_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\|\geq\lambda.

Lemma 12.

Under the settings of Theorem 5, for any λ>0\lambda>0, the population weak and strong PD risk for the minimax learning variables (𝛚S,𝛎S)(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}) returned by Algorithm 5 are

{max𝝂𝒱𝔼[F(𝝎S,𝝂)]min𝝎𝒲𝔼[F(𝝎,𝝂S)]322L2λn+λ2(B𝝎2+B𝝂2),𝔼[max𝝂𝒱F(𝝎S,𝝂)min𝝎𝒲F(𝝎,𝝂S)]3842L2λ2n+λ2(B𝝎2+B𝝂2).\left\{\begin{aligned} &\max_{\bm{\nu}\in\mathcal{V}}\mathbb{E}[F(\bm{\omega}_{S}^{*},\bm{\nu})]-\min_{\bm{\omega}\in\mathcal{W}}\mathbb{E}[F(\bm{\omega},\bm{\nu}_{S}^{*})]\leq\frac{32\sqrt{2}L^{2}}{\lambda n}+\frac{\lambda}{2}(B_{\bm{\omega}}^{2}+B_{\bm{\nu}}^{2}),\\ &\mathbb{E}[\max_{\bm{\nu}\in\mathcal{V}}F(\bm{\omega}_{S}^{*},\bm{\nu})-\min_{\bm{\omega}\in\mathcal{W}}F(\bm{\omega},\bm{\nu}_{S}^{*})]\leq\frac{384\sqrt{2}L^{2}\ell}{\lambda^{2}n}+\frac{\lambda}{2}(B_{\bm{\omega}}^{2}+B_{\bm{\nu}}^{2}).\end{aligned}\right. (107)
Proof.

For the function F~(𝝎,𝝂)\widetilde{F}(\bm{\omega},\bm{\nu}), an application of Lemma 8 gives that

max𝝂𝒱𝔼[F~(𝝎S,𝝂)]min𝝎𝒲𝔼[F~(𝝎,𝝂S)]322L2λn.\max_{\bm{\nu}\in\mathcal{V}}\mathbb{E}[\widetilde{F}(\bm{\omega}_{S}^{*},\bm{\nu})]-\min_{\bm{\omega}\in\mathcal{W}}\mathbb{E}[\widetilde{F}(\bm{\omega},\bm{\nu}_{S}^{*})]\leq\frac{32\sqrt{2}L^{2}}{\lambda n}. (108)

By the assumption of bounded parameter spaces 𝒲\mathcal{W} and 𝒱\mathcal{V} so that max𝝎𝒲𝝎B𝝎\max_{\bm{\omega}\in\mathcal{W}}\|\bm{\omega}\|\leq B_{\bm{\omega}} and max𝝂𝒱𝝂B𝝂\max_{\bm{\nu}\in\mathcal{V}}\|\bm{\nu}\|\leq B_{\bm{\nu}}, we have the following derivations for the population weak PD risk,

max𝝂𝒱𝔼[F(𝝎S,𝝂)]min𝝎𝒲𝔼[F(𝝎,𝝂S)]=max𝝂𝒱𝔼[F~(𝝎S,𝝂)λ2𝝎S2+λ2𝝂2]min𝝎𝒲𝔼[F~(𝝎,𝝂S)λ2𝝎2+λ2𝝂S2]max𝝂𝒱𝔼[F~(𝝎S,𝝂)+λ2𝝂2]min𝝎𝒲𝔼[F~(𝝎,𝝂S)λ2𝝎2]max𝝂𝒱𝔼[F~(𝝎S,𝝂)]min𝝎𝒲𝔼[F~(𝝎,𝝂S)]+max𝝂𝒱𝔼[λ2𝝂2]+max𝝎𝒲𝔼[λ2𝝎2]322L2λn+λ2(B𝝎2+B𝝂2).\begin{split}&\max_{\bm{\nu}\in\mathcal{V}}\mathbb{E}[F(\bm{\omega}_{S}^{*},\bm{\nu})]-\min_{\bm{\omega}\in\mathcal{W}}\mathbb{E}[F(\bm{\omega},\bm{\nu}_{S}^{*})]\\ =&\max_{\bm{\nu}\in\mathcal{V}}\mathbb{E}\bigg{[}\widetilde{F}(\bm{\omega}_{S}^{*},\bm{\nu})-\frac{\lambda}{2}\|\bm{\omega}_{S}^{*}\|^{2}+\frac{\lambda}{2}\|\bm{\nu}\|^{2}\bigg{]}-\min_{\bm{\omega}\in\mathcal{W}}\mathbb{E}\bigg{[}\widetilde{F}(\bm{\omega},\bm{\nu}_{S}^{*})-\frac{\lambda}{2}\|\bm{\omega}\|^{2}+\frac{\lambda}{2}\|\bm{\nu}_{S}^{*}\|^{2}\bigg{]}\\ \leq&\max_{\bm{\nu}\in\mathcal{V}}\mathbb{E}\bigg{[}\widetilde{F}(\bm{\omega}_{S}^{*},\bm{\nu})+\frac{\lambda}{2}\|\bm{\nu}\|^{2}\bigg{]}-\min_{\bm{\omega}\in\mathcal{W}}\mathbb{E}\bigg{[}\widetilde{F}(\bm{\omega},\bm{\nu}_{S}^{*})-\frac{\lambda}{2}\|\bm{\omega}\|^{2}\bigg{]}\\ \leq&\max_{\bm{\nu}\in\mathcal{V}}\mathbb{E}\big{[}\widetilde{F}(\bm{\omega}_{S}^{*},\bm{\nu})\big{]}-\min_{\bm{\omega}\in\mathcal{W}}\mathbb{E}\big{[}\widetilde{F}(\bm{\omega},\bm{\nu}_{S}^{*})\big{]}+\max_{\bm{\nu}\in\mathcal{V}}\mathbb{E}\bigg{[}\frac{\lambda}{2}\|\bm{\nu}\|^{2}\bigg{]}+\max_{\bm{\omega}\in\mathcal{W}}\mathbb{E}\bigg{[}\frac{\lambda}{2}\|\bm{\omega}\|^{2}\bigg{]}\\ \leq&\frac{32\sqrt{2}L^{2}}{\lambda n}+\frac{\lambda}{2}(B_{\bm{\omega}}^{2}+B_{\bm{\nu}}^{2}).\end{split} (109)

Similarly, an application of Lemma 9 gives that

𝔼[max𝝂𝒱F~(𝝎S,𝝂)min𝝎𝒲F~(𝝎,𝝂S)]1282L2(2+λ)λ2n.\mathbb{E}[\max_{\bm{\nu}\in\mathcal{V}}\widetilde{F}(\bm{\omega}_{S}^{*},\bm{\nu})-\min_{\bm{\omega}\in\mathcal{W}}\widetilde{F}(\bm{\omega},\bm{\nu}_{S}^{*})]\leq\frac{128\sqrt{2}L^{2}(2\ell+\lambda)}{\lambda^{2}n}. (110)

And we can get the population strong PD risk with the following conversions,

𝔼[max𝝂𝒱F(𝝎S,𝝂)min𝝎𝒲F(𝝎,𝝂S)]\displaystyle\mathbb{E}[\max_{\bm{\nu}\in\mathcal{V}}F(\bm{\omega}_{S}^{*},\bm{\nu})-\min_{\bm{\omega}\in\mathcal{W}}F(\bm{\omega},\bm{\nu}_{S}^{*})] (111)
=\displaystyle= 𝔼[max𝝂𝒱(F~(𝝎S,𝝂)λ2𝝎S2+λ2𝝂2)min𝝎𝒲(F~(𝝎,𝝂S)λ2𝝎2+λ2𝝂S2)]\displaystyle\mathbb{E}\left[\max_{\bm{\nu}\in\mathcal{V}}\left(\widetilde{F}(\bm{\omega}_{S}^{*},\bm{\nu})-\frac{\lambda}{2}\|\bm{\omega}_{S}^{*}\|^{2}+\frac{\lambda}{2}\|\bm{\nu}\|^{2}\right)-\min_{\bm{\omega}\in\mathcal{W}}\left(\widetilde{F}(\bm{\omega},\bm{\nu}_{S}^{*})-\frac{\lambda}{2}\|\bm{\omega}\|^{2}+\frac{\lambda}{2}\|\bm{\nu}_{S}^{*}\|^{2}\right)\right]
\displaystyle\leq 𝔼[max𝝂𝒱(F~(𝝎S,𝝂)+λ2𝝂2)min𝝎𝒲(F~(𝝎,𝝂S)λ2𝝎2)]\displaystyle\mathbb{E}\left[\max_{\bm{\nu}\in\mathcal{V}}\left(\widetilde{F}(\bm{\omega}_{S}^{*},\bm{\nu})+\frac{\lambda}{2}\|\bm{\nu}\|^{2}\right)-\min_{\bm{\omega}\in\mathcal{W}}\left(\widetilde{F}(\bm{\omega},\bm{\nu}_{S}^{*})-\frac{\lambda}{2}\|\bm{\omega}\|^{2}\right)\right]
=\displaystyle= 𝔼[max𝝂𝒱(F~(𝝎S,𝝂)+λ2𝝂2)]𝔼[min𝝎𝒲(F~(𝝎,𝝂S)λ2𝝎2)]\displaystyle\mathbb{E}\left[\max_{\bm{\nu}\in\mathcal{V}}\left(\widetilde{F}(\bm{\omega}_{S}^{*},\bm{\nu})+\frac{\lambda}{2}\|\bm{\nu}\|^{2}\right)\right]-\mathbb{E}\left[\min_{\bm{\omega}\in\mathcal{W}}\left(\widetilde{F}(\bm{\omega},\bm{\nu}_{S}^{*})-\frac{\lambda}{2}\|\bm{\omega}\|^{2}\right)\right]
\displaystyle\leq 𝔼[max𝝂𝒱F~(𝝎S,𝝂)+max𝝂𝒱λ2𝝂2]+𝔼[max𝝎𝒲(F~)(𝝎,𝝂S)+max𝝎𝒲λ2𝝎2]\displaystyle\mathbb{E}[\max_{\bm{\nu}\in\mathcal{V}}\widetilde{F}(\bm{\omega}_{S}^{*},\bm{\nu})+\max_{\bm{\nu}\in\mathcal{V}}\frac{\lambda}{2}\|\bm{\nu}\|^{2}]+\mathbb{E}[\max_{\bm{\omega}\in\mathcal{W}}(-\widetilde{F})(\bm{\omega},\bm{\nu}_{S}^{*})+\max_{\bm{\omega}\in\mathcal{W}}\frac{\lambda}{2}\|\bm{\omega}\|^{2}]
=\displaystyle= 𝔼[max𝝂𝒱F~(𝝎S,𝝂)min𝝎𝒲F~(𝝎,𝝂S)]+𝔼[max𝝂𝒱λ2𝝂2]+𝔼[max𝝎𝒲λ2𝝎2]\displaystyle\mathbb{E}[\max_{\bm{\nu}\in\mathcal{V}}\widetilde{F}(\bm{\omega}_{S}^{*},\bm{\nu})-\min_{\bm{\omega}\in\mathcal{W}}\widetilde{F}(\bm{\omega},\bm{\nu}_{S}^{*})]+\mathbb{E}[\max_{\bm{\nu}\in\mathcal{V}}\frac{\lambda}{2}\|\bm{\nu}\|^{2}]+\mathbb{E}[\max_{\bm{\omega}\in\mathcal{W}}\frac{\lambda}{2}\|\bm{\omega}\|^{2}]
\displaystyle\leq 1282L2(2+λ)λ2n+λ2(B𝝎2+B𝝂2)3842L2λ2n+λ2(B𝝎2+B𝝂2).\displaystyle\frac{128\sqrt{2}L^{2}(2\ell+\lambda)}{\lambda^{2}n}+\frac{\lambda}{2}(B_{\bm{\omega}}^{2}+B_{\bm{\nu}}^{2})\leq\frac{384\sqrt{2}L^{2}\ell}{\lambda^{2}n}+\frac{\lambda}{2}(B_{\bm{\omega}}^{2}+B_{\bm{\nu}}^{2}).

Lemma 13 (Closeness Upper Bound).

Under the settings of Theorem 5, we have the closeness bound between the intermediate variables (𝛚^,𝛎^)(\widehat{\bm{\omega}},\widehat{\bm{\nu}}) in Algorithm 6 and (𝛚S,𝛎S)(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*}) in eq.(106):

{𝝎S𝝎^,𝝂S𝝂^}(82L~2~3ρ/λ5+8L~~2/λ2)m2n(λn(~+~2/λ)m).\{\|\bm{\omega}_{S^{\setminus}}^{*}-\widehat{\bm{\omega}}\|,\|\bm{\nu}_{S^{\setminus}}^{*}-\widehat{\bm{\nu}}\|\}\leq\frac{(8\sqrt{2}\widetilde{L}^{2}\widetilde{\ell}^{3}\rho/\lambda^{5}+8\widetilde{L}\widetilde{\ell}^{2}/\lambda^{2})m^{2}}{n(\lambda n-(\widetilde{\ell}+\widetilde{\ell}^{2}/\lambda)m)}. (112)
Proof.

Since we now run the algorithms AscscA_{sc-sc} and A¯scsc\bar{A}_{sc-sc} with the regularized loss function f~\widetilde{f}, the proof is identical to that of Lemma 10. ∎

Equipped with the supporting lemmas above, the proof of Theorem 5 can be separated into the proofs of the following three lemmas.

Lemma 14 (Minimax Unlearning Certification).

Under the settings of Theorem 5, our minimax learning algorithm AccA_{c-c} and unlearning algorithm A¯cc\bar{A}_{c-c} is (ϵ,δ)(\epsilon,\delta)-certified minimax unlearning if we choose

σ1 and σ2=2(82L~2~3ρ/λ5+8L~~2/λ2)m2n(λn(~+~2/λ)m)ϵ2log(2.5/δ).\sigma_{1}\text{~{}and~{}}\sigma_{2}=\frac{2(8\sqrt{2}\widetilde{L}^{2}\widetilde{\ell}^{3}\rho/\lambda^{5}+8\widetilde{L}\widetilde{\ell}^{2}/\lambda^{2})m^{2}}{n(\lambda n-(\widetilde{\ell}+\widetilde{\ell}^{2}/\lambda)m)\epsilon}\sqrt{2\log(2.5/\delta)}. (113)
Proof.

With the closeness upper bound in Lemma 13 and the given noise scales in eq.(113), the proof is identical to that of Theorem 6. ∎

Lemma 15 (Population Primal-Dual Risk).

Under the settings of Theorem 5, the population weak and strong PD risk for (𝛚u,𝛎u)(\bm{\omega}^{u},\bm{\nu}^{u}) returned by Algorithm 6 are

{w(𝝎u,𝝂u)𝒪((L33ρ/λ6+L22/λ3)m2dlog(1/δ)n2ϵ+mL2λn+λ(B𝝎2+B𝝂2)),s(𝝎u,𝝂u)𝒪((L33ρ/λ6+L22/λ3)m2dlog(1/δ)n2ϵ+mL2λn+L2λ2n+λ(B𝝎2+B𝝂2)).\left\{\begin{aligned} &\triangle^{w}(\bm{\omega}^{u},\bm{\nu}^{u})\leq\mathcal{O}\bigg{(}(L^{3}\ell^{3}\rho/\lambda^{6}+L^{2}\ell^{2}/\lambda^{3})\cdot\frac{m^{2}\sqrt{d\log(1/\delta)}}{n^{2}\epsilon}+\frac{mL^{2}}{\lambda n}+\lambda(B_{\bm{\omega}}^{2}+B_{\bm{\nu}}^{2})\bigg{)},\\ &\triangle^{s}(\bm{\omega}^{u},\bm{\nu}^{u})\leq\mathcal{O}\bigg{(}(L^{3}\ell^{3}\rho/\lambda^{6}+L^{2}\ell^{2}/\lambda^{3})\cdot\frac{m^{2}\sqrt{d\log(1/\delta)}}{n^{2}\epsilon}+\frac{mL^{2}}{\lambda n}+\frac{L^{2}\ell}{\lambda^{2}n}+\lambda(B_{\bm{\omega}}^{2}+B_{\bm{\nu}}^{2})\bigg{)}.\end{aligned}\right. (114)
Proof.

For the population weak PD risk, an application of eq.(77) together with Lemma 12 gives that

w(𝝎u,𝝂u)𝔼[L𝝎u𝝎S]+𝔼[L𝝂u𝝂S]+322L2λn+λ2(B𝝎2+B𝝂2).\triangle^{w}(\bm{\omega}^{u},\bm{\nu}^{u})\leq\mathbb{E}[L\|\bm{\omega}^{u}-\bm{\omega}_{S}^{*}\|]+\mathbb{E}[L\|\bm{\nu}^{u}-\bm{\nu}_{S}^{*}\|]+\frac{32\sqrt{2}L^{2}}{\lambda n}+\frac{\lambda}{2}(B_{\bm{\omega}}^{2}+B_{\bm{\nu}}^{2}). (115)

According to Algorithm 6, we have the unlearning update step

𝝎u=𝝎S+1nm[𝙳𝝎𝝎F~S(𝝎S,𝝂S)]1ziU𝝎f~(𝝎S,𝝂S;zi)+𝝃1,\bm{\omega}^{u}=\bm{\omega}_{S}^{*}+\frac{1}{n-m}[{\tt D}_{\bm{\omega}\bm{\omega}}\widetilde{F}_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})]^{-1}\sum_{z_{i}\in U}\nabla_{\bm{\omega}}\widetilde{f}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})+\bm{\xi}_{1}, (116)

where F~S(𝝎S,𝝂S):=1nmziSUf~(𝝎,𝝂;zi)\widetilde{F}_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}):=\frac{1}{n-m}\sum_{z_{i}\in S\setminus U}\widetilde{f}(\bm{\omega},\bm{\nu};z_{i}). From the relation above, we further get

𝔼[𝝎u𝝎S]\displaystyle\mathbb{E}[\|\bm{\omega}^{u}-\bm{\omega}_{S}^{*}\|] (117)
=\displaystyle= 𝔼[1nm[𝙳𝝎𝝎F~S(𝝎S,𝝂S)]1ziU𝝎f~(𝝎S,𝝂S;zi)+𝝃1]\displaystyle\mathbb{E}\left[\left\|\frac{1}{n-m}[{\tt D}_{\bm{\omega}\bm{\omega}}\widetilde{F}_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})]^{-1}\cdot\sum_{z_{i}\in U}\nabla_{\bm{\omega}}\widetilde{f}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})+\bm{\xi}_{1}\right\|\right]
(i)\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}} 1nm𝔼[[𝙳𝝎𝝎F~S(𝝎S,𝝂S)]1ziU𝝎f~(𝝎S,𝝂S;zi)]+𝔼[𝝃1]\displaystyle\frac{1}{n-m}\mathbb{E}\left[\left\|[{\tt D}_{\bm{\omega}\bm{\omega}}\widetilde{F}_{S^{\setminus}}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})]^{-1}\cdot\sum_{z_{i}\in U}\nabla_{\bm{\omega}}\widetilde{f}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})\right\|\right]+\mathbb{E}[\|\bm{\xi}_{1}\|]
(ii)\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}} 1nmnm(λn~(1+~/λ)m)𝔼[ziU𝝎f~(𝝎S,𝝂S;zi)]+𝔼[𝝃12]\displaystyle\frac{1}{n-m}\cdot\frac{n-m}{(\lambda n-\widetilde{\ell}(1+\widetilde{\ell}/\lambda)m)}\mathbb{E}\left[\left\|\sum_{z_{i}\in U}\nabla_{\bm{\omega}}\widetilde{f}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})\right\|\right]+\sqrt{\mathbb{E}[\|\bm{\xi}_{1}\|^{2}]}
=(iii)\displaystyle\stackrel{{\scriptstyle(iii)}}{{=}} 1(λn~(1+~/λ)m)𝔼[ziU𝝎f~(𝝎S,𝝂S;zi)]+d1σ1\displaystyle\frac{1}{(\lambda n-\widetilde{\ell}(1+\widetilde{\ell}/\lambda)m)}\mathbb{E}\left[\left\|\sum_{z_{i}\in U}\nabla_{\bm{\omega}}\widetilde{f}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})\right\|\right]+\sqrt{d_{1}}\sigma_{1}
(iv)\displaystyle\stackrel{{\scriptstyle(iv)}}{{\leq}} 1(λn~(1+~/λ)m)𝔼[ziU(𝝎f(𝝎S,𝝂S;zi)+λ𝝎S+λ𝝂S)]+d1σ1\displaystyle\frac{1}{(\lambda n-\widetilde{\ell}(1+\widetilde{\ell}/\lambda)m)}\mathbb{E}\left[\sum_{z_{i}\in U}\big{(}\|\nabla_{\bm{\omega}}f(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*};z_{i})\|+\lambda\|\bm{\omega}_{S}^{*}\|+\lambda\|\bm{\nu}_{S}^{*}\|\big{)}\right]+\sqrt{d_{1}}\sigma_{1}
(vi)\displaystyle\stackrel{{\scriptstyle(vi)}}{{\leq}} 3mL(λn~(1+~/λ)m)+d1σ1,\displaystyle\frac{3mL}{(\lambda n-\widetilde{\ell}(1+\widetilde{\ell}/\lambda)m)}+\sqrt{d_{1}}\sigma_{1},

where the inequality (ii) uses the triangle inequality and the inequality (iiii) follows from an application of eq.(71), together with the Jensen’s inequality to bound 𝔼[𝝃1]\mathbb{E}[\|\bm{\xi}_{1}\|]. The equality (iiiiii) holds because the vector 𝝃1𝒩(0,σ12𝐈d1)\bm{\xi}_{1}\sim\mathcal{N}(0,\sigma_{1}^{2}\mathbf{I}_{d_{1}}) and thus we have 𝔼[𝝃12]=d1σ12\mathbb{E}[\|\bm{\xi}_{1}\|^{2}]=d_{1}\sigma_{1}^{2}. The inequality (iviv) uses the definition of the function f~\widetilde{f} and the triangle inequality. The inequality (vivi) is due to the fact that f(𝝎,𝝂;z)f(\bm{\omega},\bm{\nu};z) is LL-Lipshitz continuous and Lemma 11. Symmetrically, we have

𝔼[𝝂u𝝂S]3mL(λn~(1+~/λ)m)+d2σ2.\mathbb{E}[\|\bm{\nu}^{u}-\bm{\nu}_{S}^{*}\|]\leq\frac{3mL}{(\lambda n-\widetilde{\ell}(1+\widetilde{\ell}/\lambda)m)}+\sqrt{d_{2}}\sigma_{2}. (118)

Plugging eq.(117) and eq.(118) into eq.(115) with noise scales given in Lemma 14, we can get our generalization guarantee in terms of population weak PD risk:

w(𝝎u,𝝂u)𝒪((L33ρ/λ6+L22/λ3)m2dlog(1/δ)n2ϵ+mL2λn+λ(B𝝎2+B𝝂2)).\triangle^{w}(\bm{\omega}^{u},\bm{\nu}^{u})\leq\mathcal{O}\bigg{(}(L^{3}\ell^{3}\rho/\lambda^{6}+L^{2}\ell^{2}/\lambda^{3})\cdot\frac{m^{2}\sqrt{d\log(1/\delta)}}{n^{2}\epsilon}+\frac{mL^{2}}{\lambda n}+\lambda(B_{\bm{\omega}}^{2}+B_{\bm{\nu}}^{2})\bigg{)}. (119)

Similarly, using an application of eq.(83) together with Lemma 12, Lemma 14, eq.(117) and eq.(118), we can get the following population strong PD risk:

s(𝝎u,𝝂u)𝒪((L33ρ/λ6+L22/λ3)m2dlog(1/δ)n2ϵ+mL2λn+L2λ2n+λ(B𝝎2+B𝝂2)).\triangle^{s}(\bm{\omega}^{u},\bm{\nu}^{u})\leq\mathcal{O}\bigg{(}(L^{3}\ell^{3}\rho/\lambda^{6}+L^{2}\ell^{2}/\lambda^{3})\cdot\frac{m^{2}\sqrt{d\log(1/\delta)}}{n^{2}\epsilon}+\frac{mL^{2}}{\lambda n}+\frac{L^{2}\ell}{\lambda^{2}n}+\lambda(B_{\bm{\omega}}^{2}+B_{\bm{\nu}}^{2})\bigg{)}. (120)

Lemma 16 (Deletion Capacity).

Under the settings of Theorem 5, the deletion capacity of Algorithm 6 is

mϵ,δ,γA,A¯(d1,d2,n)cnϵ(dlog(1/δ))1/4,m^{A,\bar{A}}_{\epsilon,\delta,\gamma}(d_{1},d_{2},n)\geq c\cdot\frac{n\sqrt{\epsilon}}{(d\log(1/\delta))^{1/4}}, (121)

where the constant cc depends on L,l,ρ,B𝛚L,l,\rho,B_{\bm{\omega}} and B𝛎B_{\bm{\nu}}.

Proof.

By the definition of deletion capacity, in order to ensure the population PD risk derived in Lemma 15 is bounded by γ\gamma, it suffices to let mϵ,δ,γA,A¯(d1,d2,n)cnϵ(dlog(1/δ))1/4m^{A,\bar{A}}_{\epsilon,\delta,\gamma}(d_{1},d_{2},n)\geq c\cdot\frac{n\sqrt{\epsilon}}{(d\log(1/\delta))^{1/4}}. ∎

C.4 Minimax Unlearning Algorithm for Smooth Convex-Strongly-Concave Loss Function

In this section, we briefly discuss the extension to the smooth C-SC setting. The SC-C setting is symmetric and thus omitted here.

Given the loss function f(𝝎,𝝂;z)f(\bm{\omega},\bm{\nu};z) that satisfies Assumption 1 with μ𝝂\mu_{\bm{\nu}}-strong concavity in 𝝂\bm{\nu}, we define the regularized function as f~(𝝎,𝝂;z)=f(𝝎,𝝂;z)+λ2𝝎2\widetilde{f}(\bm{\omega},\bm{\nu};z)=f(\bm{\omega},\bm{\nu};z)+\frac{\lambda}{2}\|\bm{\omega}\|^{2}. Our minimax learning and minimax unlearning algorithms for C-SC loss function ff denoted by AcscA_{c-sc} and A¯csc\bar{A}_{c-sc} are given in Algorithm 7 and Algorithm 8 respectively. Additionally, we denote the regularized population loss by F~(𝝎,𝝂):=𝔼z𝒟[f~(𝝎,𝝂;z)]\widetilde{F}(\bm{\omega},\bm{\nu}):=\mathbb{E}_{z\sim\mathcal{D}}[\widetilde{f}(\bm{\omega},\bm{\nu};z)] and regularized empirical loss by F~S(𝝎,𝝂):=1ni=1nf~(𝝎,𝝂;zi)\widetilde{F}_{S}(\bm{\omega},\bm{\nu}):=\frac{1}{n}\sum_{i=1}^{n}\widetilde{f}(\bm{\omega},\bm{\nu};z_{i}).

Algorithm 7 Mimimax Learning Algorithm (Acsc)(A_{c-sc})
0:  Dataset SS : {zi}i=1n𝒟n\{z_{i}\}^{n}_{i=1}\sim\mathcal{D}^{n}, loss function: f(𝝎,𝝂;z)f(\bm{\omega},\bm{\nu};z), regularization parameter: λ\lambda.
1:  Define
f~(𝝎,𝝂;z)=f(𝝎,𝝂;z)+λ2𝝎2.\widetilde{f}(\bm{\omega},\bm{\nu};z)=f(\bm{\omega},\bm{\nu};z)+\frac{\lambda}{2}\|\bm{\omega}\|^{2}. (122)
2:  Run the algorithm AscscA_{sc-sc} on the dataset SS with loss function f~\widetilde{f}.
2:  (𝝎S,𝝂S,𝙳𝝎𝝎F~S(𝝎S,𝝂S),𝙳𝝂𝝂F~S(𝝎S,𝝂S))Ascsc(S,f~)(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*},\mathtt{D}_{\bm{\omega}\bm{\omega}}\widetilde{F}_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}),\mathtt{D}_{\bm{\nu}\bm{\nu}}\widetilde{F}_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}))\leftarrow A_{sc-sc}(S,\widetilde{f}).
Algorithm 8 Certified Minimax Unlearning for Convex-Strongly-Concave Loss (A¯csc)(\bar{A}_{c-sc})
0:  Delete requests UU : {zj}j=1mS\{z_{j}\}^{m}_{j=1}\subseteq S, output of Acsc(S)A_{c-sc}(S): (𝝎S,𝝂S)(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}), memory variables T(S)T(S): {𝙳𝝎𝝎F~S(𝝎S,𝝂S),𝙳𝝂𝝂F~S(𝝎S,𝝂S)}\{\mathtt{D}_{\bm{\omega}\bm{\omega}}\widetilde{F}_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}),\mathtt{D}_{\bm{\nu}\bm{\nu}}\widetilde{F}_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\}, loss function: f(𝝎,𝝂;z)f(\bm{\omega},\bm{\nu};z), regularization parameter: λ\lambda, noise parameters: σ1\sigma_{1}, σ2\sigma_{2}.
1:  Define
f~(𝝎,𝝂;z)=f(𝝎,𝝂;z)+λ2𝝎2.\widetilde{f}(\bm{\omega},\bm{\nu};z)=f(\bm{\omega},\bm{\nu};z)+\frac{\lambda}{2}\|\bm{\omega}\|^{2}. (123)
2:  Run the algorithm A¯scsc\bar{A}_{sc-sc} with delete requests UU, learning variables (𝝎S,𝝂S)(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}), memory variables T(S)T(S), loss function f~\widetilde{f} and noise parameters σ1\sigma_{1} and σ2\sigma_{2}.
2:  (𝝎u,𝝂u)A¯scsc(U,(𝝎S,𝝂S),T(S),f~,σ1,σ2)(\bm{\omega}^{u},\bm{\nu}^{u})\leftarrow\bar{A}_{sc-sc}(U,(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}),T(S),\widetilde{f},\sigma_{1},\sigma_{2}).

Note that the function f~(𝝎,𝝂;z)\widetilde{f}(\bm{\omega},\bm{\nu};z) is λ\lambda-strongly convex in 𝝎\bm{\omega}, μ𝝂\mu_{\bm{\nu}}-strongly concave in 𝝂\bm{\nu}, (L~:=2L+λ𝝎)(\widetilde{L}:=2L+\lambda\|\bm{\omega}\|)-Lipschitz, (~:=2(2+λ))(\widetilde{\ell}:=\sqrt{2}(2\ell+\lambda))-gradient Lipschitz and ρ\rho-Hessian Lipschitz. We also have 𝙳𝝎𝝎F~S(𝝎S,𝝂S)λ\|{\tt D}_{\bm{\omega}\bm{\omega}}\widetilde{F}_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\|\geq\lambda. Let (𝝎S,𝝂S)(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*}) be the optimal solution of the loss function F~S(𝝎,𝝂)\widetilde{F}_{S^{\setminus}}(\bm{\omega},\bm{\nu}) on the remaining dataset, i.e.,

(𝝎S,𝝂S):=argmin𝝎𝒲max𝝂𝒱{F~S(𝝎,𝝂):=1nmziSUf~(𝝎,𝝂;zi)}.(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*}):=\arg\min_{\bm{\omega}\in\mathcal{W}}\max_{\bm{\nu}\in\mathcal{V}}\{\widetilde{F}_{S^{\setminus}}(\bm{\omega},\bm{\nu}):=\frac{1}{n-m}\sum_{z_{i}\in S\setminus U}\widetilde{f}(\bm{\omega},\bm{\nu};z_{i})\}. (124)

An application of Lemma 11 implies that the empirical optimizer (𝝎S,𝝂S)(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*}) returned by Algorithm 8 satisfies 𝝎SL/λ\|\bm{\omega}_{S}^{*}\|\leq L/\lambda. Thus our domain of interest are 𝒲:={𝝎|𝝎L/λ}\mathcal{W}:=\{\bm{\omega}|\|\bm{\omega}\|\leq L/\lambda\}. Over the set 𝒲\mathcal{W}, the function f~(𝝎,𝝂;z)\widetilde{f}(\bm{\omega},\bm{\nu};z) is 3L3L-Lipschitz continuous. Suppose the strongly-convex regularization parameter λ\lambda satisfies λ<\lambda<\ell, then f~(𝝎,𝝂;z)\widetilde{f}(\bm{\omega},\bm{\nu};z) has 323\sqrt{2}\ell-Lipschitz gradients.

The corresponding theoretical results are given below.

Lemma 17 (Closeness Upper Bound).

Let Assumption 1 hold. Assume the function f(𝛚,𝛎;z)f(\bm{\omega},\bm{\nu};z) is μ𝛎\mu_{\bm{\nu}}-strongly concave in 𝛎\bm{\nu} and 𝙳𝛎𝛎F~S(𝛚S,𝛎S)μ𝛎𝛎\|{\tt D}_{\bm{\nu}\bm{\nu}}\widetilde{F}_{S}(\bm{\omega}_{S}^{*},\bm{\nu}_{S}^{*})\|\geq\mu_{\bm{\nu}\bm{\nu}}. Let μ=min{μ𝛎,μ𝛎𝛎}\mu=\min\{\mu_{\bm{\nu}},\mu_{\bm{\nu}\bm{\nu}}\}. Then, we have the closeness bound between the intermediate variables (𝛚^,𝛎^)(\widehat{\bm{\omega}},\widehat{\bm{\nu}}) in Algorithm 8 and (𝛚S,𝛎S)(\bm{\omega}_{S^{\setminus}}^{*},\bm{\nu}_{S^{\setminus}}^{*}) in eq.(124):

{𝝎S𝝎^(82L~2~3ρλ2μ3+8L~~2λμ)m2n(λn(~+~2/μ)m),𝝂S𝝂^(82L~2~3ρλ3μ2+8L~~2λμ)m2n(μn(~+~2/λ)m).\left\{\begin{array}[]{lr}\|\bm{\omega}_{S^{\setminus}}^{*}-\widehat{\bm{\omega}}\|\leq\big{(}\frac{8\sqrt{2}\widetilde{L}^{2}\widetilde{\ell}^{3}\rho}{\lambda^{2}\mu^{3}}+\frac{8\widetilde{L}\widetilde{\ell}^{2}}{\lambda\mu}\big{)}\cdot\frac{m^{2}}{n(\lambda n-(\widetilde{\ell}+\widetilde{\ell}^{2}/\mu)m)},\\ \|\bm{\nu}_{S^{\setminus}}^{*}-\widehat{\bm{\nu}}\|\leq\big{(}\frac{8\sqrt{2}\widetilde{L}^{2}\widetilde{\ell}^{3}\rho}{\lambda^{3}\mu^{2}}+\frac{8\widetilde{L}\widetilde{\ell}^{2}}{\lambda\mu}\big{)}\cdot\frac{m^{2}}{n(\mu n-(\widetilde{\ell}+\widetilde{\ell}^{2}/\lambda)m)}.\end{array}\right. (125)
Proof.

Since we now run the algorithms AscscA_{sc-sc} and A¯scsc\bar{A}_{sc-sc} with the regularized loss function f~\widetilde{f}, the proof is identical to that of Lemma 10. ∎

Lemma 18 (Minimax Unlearning Certification).

Under the settings of Lemma 17, our minimax learning algorithm AcscA_{c-sc} and unlearning algorithm A¯csc\bar{A}_{c-sc} is (ϵ,δ)(\epsilon,\delta)-certified minimax unlearning if we choose

{σ1=(82L~2~3ρλ2μ3+8L~~2λμ)2m22log(2.5/δ)n(λn(~+~2/μ)m)ϵ,σ2=(82L~2~3ρλ3μ2+8L~~2λμ)2m22log(2.5/δ)n(μn(~+~2/λ)m)ϵ.\left\{\begin{array}[]{lr}\sigma_{1}=\big{(}\frac{8\sqrt{2}\widetilde{L}^{2}\widetilde{\ell}^{3}\rho}{\lambda^{2}\mu^{3}}+\frac{8\widetilde{L}\widetilde{\ell}^{2}}{\lambda\mu}\big{)}\cdot\frac{2m^{2}\sqrt{2\log(2.5/\delta)}}{n(\lambda n-(\widetilde{\ell}+\widetilde{\ell}^{2}/\mu)m)\epsilon},\\ \sigma_{2}=\big{(}\frac{8\sqrt{2}\widetilde{L}^{2}\widetilde{\ell}^{3}\rho}{\lambda^{3}\mu^{2}}+\frac{8\widetilde{L}\widetilde{\ell}^{2}}{\lambda\mu}\big{)}\cdot\frac{2m^{2}\sqrt{2\log(2.5/\delta)}}{n(\mu n-(\widetilde{\ell}+\widetilde{\ell}^{2}/\lambda)m)\epsilon}.\end{array}\right. (126)
Proof.

With the closeness upper bound in Lemma 17 and the given noise scales in eq.(126), the proof is identical to that of Theorem 6. ∎

Lemma 19 (Population Weak PD Risk).

Under the same settings of Lemma 17, suppose the parameter space 𝒲\mathcal{W} is bounded so that max𝛚𝒲𝛚B𝛚\max_{\bm{\omega}\in\mathcal{W}}\|\bm{\omega}\|\leq B_{\bm{\omega}}, the population weak PD risk for the certified minimax unlearning variables (𝛚u,𝛎u)(\bm{\omega}^{u},\bm{\nu}^{u}) returned by Algorithm 8 is

w(𝝎u,𝝂u)𝒪((L33ρλ3μ3+L22λ2μ+L22λμ2)m2dlog(1/δ)n2ϵ+mL2λn+mL2μn+λB𝝎2),\triangle^{w}(\bm{\omega}^{u},\bm{\nu}^{u})\leq\mathcal{O}\bigg{(}\big{(}\frac{L^{3}\ell^{3}\rho}{\lambda^{3}\mu^{3}}+\frac{L^{2}\ell^{2}}{\lambda^{2}\mu}+\frac{L^{2}\ell^{2}}{\lambda\mu^{2}}\big{)}\cdot\frac{m^{2}\sqrt{d\log(1/\delta)}}{n^{2}\epsilon}+\frac{mL^{2}}{\lambda n}+\frac{mL^{2}}{\mu n}+\lambda B_{\bm{\omega}}^{2}\bigg{)}, (127)

where d=max{d1,d2}d=\max\{d_{1},d_{2}\}. In particular, by setting the regularization parameter λ\lambda as:

λ=max{LB𝝎mn,LmB𝝎μn(dlog(1/δ)ϵ)1/2,(L22m2dlog(1/δ)B𝝎2μn2ϵ)1/3,(L33ρm2dlog(1/δ)B𝝎2μ3n2ϵ)1/4},\begin{split}\lambda=\max\bigg{\{}&\frac{L}{B_{\bm{\omega}}}\sqrt{\frac{m}{n}},\frac{L\ell m}{B_{\bm{\omega}}\mu n}\big{(}\frac{\sqrt{d\log(1/\delta)}}{\epsilon}\big{)}^{1/2},\big{(}\frac{L^{2}\ell^{2}m^{2}\sqrt{d\log(1/\delta)}}{B_{\bm{\omega}}^{2}\mu n^{2}\epsilon}\big{)}^{1/3},\big{(}\frac{L^{3}\ell^{3}\rho m^{2}\sqrt{d\log(1/\delta)}}{B_{\bm{\omega}}^{2}\mu^{3}n^{2}\epsilon}\big{)}^{1/4}\bigg{\}},\end{split} (128)

we have the following population weak PD risk:

w(𝝎u,𝝂u)𝒪(c1mn+c2mn+c3(dlog(1/δ)ϵ)1/2mn+c4(dlog(1/δ)ϵ)1/3(mn)2/3+c5(dlog(1/δ)ϵ)1/4mn),\begin{split}\triangle^{w}(\bm{\omega}^{u},\bm{\nu}^{u})\leq\mathcal{O}\bigg{(}&c_{1}\sqrt{\frac{m}{n}}+c_{2}\frac{m}{n}+c_{3}\big{(}\frac{\sqrt{d\log(1/\delta)}}{\epsilon}\big{)}^{1/2}\frac{m}{n}\\ &+c_{4}\big{(}\frac{\sqrt{d\log(1/\delta)}}{\epsilon}\big{)}^{1/3}\big{(}\frac{m}{n}\big{)}^{2/3}+c_{5}\big{(}\frac{\sqrt{d\log(1/\delta)}}{\epsilon}\big{)}^{1/4}\sqrt{\frac{m}{n}}\bigg{)},\end{split} (129)

where c1,c2,c3,c4c_{1},c_{2},c_{3},c_{4} and c5c_{5} are constants that depend only on L,l,ρ,μL,l,\rho,\mu and B𝛚B_{\bm{\omega}}.

Proof.

An application of (Zhang et al., 2021, Theorem 1) gives that

max𝝂𝒱𝔼[F~(𝝎S,𝝂)]min𝝎𝒲𝔼[F~(𝝎,𝝂S)]182L2n(1λ+1μ).\max_{\bm{\nu}\in\mathcal{V}}\mathbb{E}[\widetilde{F}(\bm{\omega}_{S}^{*},\bm{\nu})]-\min_{\bm{\omega}\in\mathcal{W}}\mathbb{E}[\widetilde{F}(\bm{\omega},\bm{\nu}_{S}^{*})]\leq\frac{18\sqrt{2}L^{2}}{n}\bigg{(}\frac{1}{\lambda}+\frac{1}{\mu}\bigg{)}. (130)

Using the relation above with an application of eq.(77) and eq.(109), we have

w(𝝎u,𝝂u)𝔼[L𝝎u𝝎S]+𝔼[L𝝂u𝝂S]+182L2n(1λ+1μ)+λB𝝎22.\triangle^{w}(\bm{\omega}^{u},\bm{\nu}^{u})\leq\mathbb{E}[L\|\bm{\omega}^{u}-\bm{\omega}_{S}^{*}\|]+\mathbb{E}[L\|\bm{\nu}^{u}-\bm{\nu}_{S}^{*}\|]+\frac{18\sqrt{2}L^{2}}{n}\bigg{(}\frac{1}{\lambda}+\frac{1}{\mu}\bigg{)}+\frac{\lambda B_{\bm{\omega}}^{2}}{2}. (131)

By an application of eq.(117), we further get

𝔼[𝝎u𝝎S]2mLλn~(1+~/μ)m+d1σ1,\mathbb{E}[\|\bm{\omega}^{u}-\bm{\omega}_{S}^{*}\|]\leq\frac{2mL}{\lambda n-\widetilde{\ell}(1+\widetilde{\ell}/\mu)m}+\sqrt{d_{1}}\sigma_{1}, (132)

and

𝔼[𝝂u𝝂S]2mLμn~(1+~/λ)m+d2σ2.\mathbb{E}[\|\bm{\nu}^{u}-\bm{\nu}_{S}^{*}\|]\leq\frac{2mL}{\mu n-\widetilde{\ell}(1+\widetilde{\ell}/\lambda)m}+\sqrt{d_{2}}\sigma_{2}. (133)

Plugging eq.(132) and eq.(133) into eq.(131) with noise scales given in Lemma 18, we can get our generalization guarantee:

w(𝝎u,𝝂u)𝒪((L33ρλ3μ3+L22λ2μ+L22λμ2)m2dlog(1/δ)n2ϵ+mL2λn+mL2μn+λB𝝎2),\triangle^{w}(\bm{\omega}^{u},\bm{\nu}^{u})\leq\mathcal{O}\bigg{(}\big{(}\frac{L^{3}\ell^{3}\rho}{\lambda^{3}\mu^{3}}+\frac{L^{2}\ell^{2}}{\lambda^{2}\mu}+\frac{L^{2}\ell^{2}}{\lambda\mu^{2}}\big{)}\cdot\frac{m^{2}\sqrt{d\log(1/\delta)}}{n^{2}\epsilon}+\frac{mL^{2}}{\lambda n}+\frac{mL^{2}}{\mu n}+\lambda B_{\bm{\omega}}^{2}\bigg{)}, (134)

where d=max{d1,d2}d=\max\{d_{1},d_{2}\}. ∎

Lemma 20 (Population Strong PD Risk).

Under the same settings of Lemma 19, the population strong PD risk for (𝛚u,𝛎u)(\bm{\omega}^{u},\bm{\nu}^{u}) returned by Algorithm 8 is

s(𝝎u,𝝂u)𝒪(\displaystyle\triangle^{s}(\bm{\omega}^{u},\bm{\nu}^{u})\leq\mathcal{O}\bigg{(} (L33ρλ3μ3+L22λ2μ+L22λμ2)m2dlog(1/δ)n2ϵ+mL2λn+mL2μn\displaystyle\big{(}\frac{L^{3}\ell^{3}\rho}{\lambda^{3}\mu^{3}}+\frac{L^{2}\ell^{2}}{\lambda^{2}\mu}+\frac{L^{2}\ell^{2}}{\lambda\mu^{2}}\big{)}\cdot\frac{m^{2}\sqrt{d\log(1/\delta)}}{n^{2}\epsilon}+\frac{mL^{2}}{\lambda n}+\frac{mL^{2}}{\mu n} (135)
+L2λ3/2μ1/2n+L2λ1/2μ3/2n+λB𝝎2),\displaystyle+\frac{L^{2}\ell}{\lambda^{3/2}\mu^{1/2}n}+\frac{L^{2}\ell}{\lambda^{1/2}\mu^{3/2}n}+\lambda B_{\bm{\omega}}^{2}\bigg{)},

where d=max{d1,d2}d=\max\{d_{1},d_{2}\}. In particular, by setting the regularization parameter λ\lambda as:

λ=max{LB𝝎mn,LmB𝝎μn(dlog(1/δ)ϵ)1/2,(L22m2dlog(1/δ)B𝝎2μn2ϵ)1/3,(L33ρm2dlog(1/δ)B𝝎2μ3n2ϵ)1/4,(L2B𝝎2μ1/2n)2/5,1μ(L2B𝝎2n)2/3}.\begin{split}\lambda=\max\bigg{\{}&\frac{L}{B_{\bm{\omega}}}\sqrt{\frac{m}{n}},\frac{L\ell m}{B_{\bm{\omega}}\mu n}\big{(}\frac{\sqrt{d\log(1/\delta)}}{\epsilon}\big{)}^{1/2},\big{(}\frac{L^{2}\ell^{2}m^{2}\sqrt{d\log(1/\delta)}}{B_{\bm{\omega}}^{2}\mu n^{2}\epsilon}\big{)}^{1/3},\\ &\big{(}\frac{L^{3}\ell^{3}\rho m^{2}\sqrt{d\log(1/\delta)}}{B_{\bm{\omega}}^{2}\mu^{3}n^{2}\epsilon}\big{)}^{1/4},\big{(}\frac{L^{2}\ell}{B_{\bm{\omega}}^{2}\mu^{1/2}n}\big{)}^{2/5},\frac{1}{\mu}\big{(}\frac{L^{2}\ell}{B_{\bm{\omega}}^{2}n}\big{)}^{2/3}\bigg{\}}.\end{split} (136)

we have the following population strong PD risk:

s(𝝎u,𝝂u)𝒪(c1mn+c2mn+c3(dlog(1/δ)ϵ)1/2mn+c4(dlog(1/δ)ϵ)1/3(mn)2/3+c5(dlog(1/δ)ϵ)1/4mn+c61n2/5+c71n2/3),\begin{split}\triangle^{s}(\bm{\omega}^{u},\bm{\nu}^{u})\leq\mathcal{O}\bigg{(}&c_{1}\sqrt{\frac{m}{n}}+c_{2}\frac{m}{n}+c_{3}\big{(}\frac{\sqrt{d\log(1/\delta)}}{\epsilon}\big{)}^{1/2}\frac{m}{n}+c_{4}\big{(}\frac{\sqrt{d\log(1/\delta)}}{\epsilon}\big{)}^{1/3}\big{(}\frac{m}{n}\big{)}^{2/3}\\ &+c_{5}\big{(}\frac{\sqrt{d\log(1/\delta)}}{\epsilon}\big{)}^{1/4}\sqrt{\frac{m}{n}}+c_{6}\frac{1}{n^{2/5}}+c_{7}\frac{1}{n^{2/3}}\bigg{)},\end{split} (137)

where c1,c2,c3,c4,c5,c6c_{1},c_{2},c_{3},c_{4},c_{5},c_{6} and c7c_{7} are constants that depend only on L,l,ρ,μL,l,\rho,\mu and B𝛚B_{\bm{\omega}}.

Proof.

An application of Lemma 9 gives that

𝔼[max𝝂𝒱F~(𝝎S,𝝂)min𝝎𝒲F~(𝝎,𝝂S)]\displaystyle\mathbb{E}[\max_{\bm{\nu}\in\mathcal{V}}\widetilde{F}(\bm{\omega}_{S}^{*},\bm{\nu})-\min_{\bm{\omega}\in\mathcal{W}}\widetilde{F}(\bm{\omega},\bm{\nu}_{S}^{*})]\leq 362L2(2+λ)n(1λ3/2μ1/2+1λ1/2μ3/2)\displaystyle\frac{36\sqrt{2}L^{2}(2\ell+\lambda)}{n}\left(\frac{1}{\lambda^{3/2}\mu^{1/2}}+\frac{1}{\lambda^{1/2}\mu^{3/2}}\right) (138)
\displaystyle\leq 1082L2n(1λ3/2μ1/2+1λ1/2μ3/2).\displaystyle\frac{108\sqrt{2}L^{2}\ell}{n}\left(\frac{1}{\lambda^{3/2}\mu^{1/2}}+\frac{1}{\lambda^{1/2}\mu^{3/2}}\right).

Using an application of eq.(83) and eq.(111), together with eq.(138), eq.(132), eq.(133) and the noise scales given in Lemma 18, we have

s(𝝎u,𝝂u)𝒪(\displaystyle\triangle^{s}(\bm{\omega}^{u},\bm{\nu}^{u})\leq\mathcal{O}\bigg{(} (L33ρλ3μ3+L22λ2μ+L22λμ2)m2dlog(1/δ)n2ϵ+mL2λn+mL2μn\displaystyle\big{(}\frac{L^{3}\ell^{3}\rho}{\lambda^{3}\mu^{3}}+\frac{L^{2}\ell^{2}}{\lambda^{2}\mu}+\frac{L^{2}\ell^{2}}{\lambda\mu^{2}}\big{)}\cdot\frac{m^{2}\sqrt{d\log(1/\delta)}}{n^{2}\epsilon}+\frac{mL^{2}}{\lambda n}+\frac{mL^{2}}{\mu n} (139)
+L2λ3/2μ1/2n+L2λ1/2μ3/2n+λB𝝎2).\displaystyle+\frac{L^{2}\ell}{\lambda^{3/2}\mu^{1/2}n}+\frac{L^{2}\ell}{\lambda^{1/2}\mu^{3/2}n}+\lambda B_{\bm{\omega}}^{2}\bigg{)}.

Lemma 21 (Deletion Capacity).

Under the same settings as Lemma 19, the deletion capacity of Algorithm 3 is

mϵ,δ,γA,A¯(d1,d2,n)cnϵ(dlog(1/δ))1/4,m^{A,\bar{A}}_{\epsilon,\delta,\gamma}(d_{1},d_{2},n)\geq c\cdot\frac{n\sqrt{\epsilon}}{(d\log(1/\delta))^{1/4}}, (140)

where the constant cc depends on L,l,ρ,μL,l,\rho,\mu and B𝛚B_{\bm{\omega}} and d=max{d1,d2}d=\max\{d_{1},d_{2}\}.

Proof.

By the definition of deletion capacity, in order to ensure the population PD risk derived in Lemma 19 or Lemma 20 is bounded by γ\gamma, it suffices to let mϵ,δ,γA,A¯(d1,d2,n)cnϵ(dlog(1/δ))1/4m^{A,\bar{A}}_{\epsilon,\delta,\gamma}(d_{1},d_{2},n)\geq c\cdot\frac{n\sqrt{\epsilon}}{(d\log(1/\delta))^{1/4}}. ∎