Bayesian Inference Forgetting

Shaopeng Fu S. Fu and F. He contributed equally.S. Fu, F. He, and D. Tao are with the University of Sydney. Email: [email protected], [email protected], and [email protected]. Fengxiang He¹¹footnotemark: 1 ²²footnotemark: 2 Yue Xu Y. Xu is with University of Science and Technology of China. This work was completed when she was intern at the University of Sydney. Email: [email protected]. Dacheng Tao²²footnotemark: 2

Abstract

The right to be forgotten has been legislated in many countries but the enforcement in machine learning would cause unbearable costs: companies may need to delete whole models learned from massive resources due to single individual requests. Existing works propose to remove the knowledge learned from the requested data via its influence function which is no longer naturally well-defined in Bayesian inference. This paper proposes a Bayesian inference forgetting (BIF) framework to realize the right to be forgotten in Bayesian inference. In the BIF framework, we develop forgetting algorithms for variational inference and Markov chain Monte Carlo. We show that our algorithms can provably remove the influence of single datums on the learned models. Theoretical analysis demonstrates that our algorithms have guaranteed generalizability. Experiments of Gaussian mixture models on the synthetic dataset and Bayesian neural networks on the real-world data verify the feasibility of our methods. The source code package is available at https://github.com/fshp971/BIF.

Keywords: Certified knowledge removal, Bayesian inference, variational inference, and Markov chain Monte Carlo.

1 Introduction

The right to be forgotten refers to the individuals’ right to ask data controllers to delete the data collected from them. It has been recognized in many countries through legislation, including the European Union’s General Data Protection Regulation (2016) and the California Consumer Privacy Act (2018). However, enforcement may result in unbearable costs for AI industries. When a single data deletion request comes, the data controller needs to delete the whole machine learning model which might have cost massive amounts of resources, including data, energy, and time.

To address this issue, recent works [25, 24] propose to forget the influence of the requested data on the learned models via the influence function [14, 32, 38]:

\displaystyle\mathcal{I}(x)=-\nabla^{-2}_{\theta}L(\theta,S)\nabla^{T}_{\theta}\ell(\theta,x),

where $L(\theta,S)=\frac{1}{n}\sum_{i=1}^{n}\ell(\theta,x_{i})$ is the loss function in the optimization. This approach can provably remove knowledge learned from the requested data for optimization-based machine learning subject to some conditions on convexity and smoothness of the loss function.

However, the loss function $L$ is not always naturally well-defined. Bayesian inference approximates the posterior of probabilistic models, where a loss function may be in a different form or even do not exist. In this case, the influence function is no longer well-defined, and therefore, existing forgetting methods become invalid. Bayesian inference enables a wide spectrum of machine learning algorithms, such as probabilistic matrix factorization [46, 62, 72], topic models [30, 8, 74], probabilistic graphic models [39, 6, 71] and Bayesian neural networks [10, 37, 52, 66, 58, 61, 75]. Overlooking Bayesian inference in developing forgetting algorithms leaves a considerable proportion of machine learning algorithms costly to protect the right to be forgotten.

In this paper, we design an energy-based Bayesian inference forgetting (BIF) framework to realize the right to be forgotten in Bayesian inference. We define Bayesian inference influence functions based on the pre-defined energy functions, which help provably characterize the influence of some specific datums on the learned models. We prove that these characterizations are rigorous in first-order approximation subject to some mild assumptions on the local convexity and smoothness of the energy functions around the convergence points. Specifically, the approximation error is not larger than the order $O(1/n^{2})$ , where $n$ is the training sample size, which is negligible when the training sample set is sufficiently large. The BIF framework proposes to delete the Bayesian inference influence function of the requested data. We prove that BIF realizes $\varepsilon$ -certified knowledge removal, a new notion defined to evaluate the performance of knowledge removal.

In the BIF framework, we develop certified knowledge removal algorithms for two canonical Bayesian inference algorithms, variational inference [34, 7, 31, 8] and Markov chain Monte Carlo (MCMC) [22, 45, 69, 15, 13, 57].

•

Variational inference forgetting. We show that the evidence lower bound function (ELBO; [7]) plays the role of energy function. Based on the ELBO, a variational inference influence function is defined, which helps deliver the variational inference forgetting algorithm. We prove that the $\varepsilon$ -certified knowledge removal is guaranteed on the mean-field Gaussian variational family.
•

MCMC forgetting. We formulate a new energy function for MCMC, based on which an MCMC influence function is designed. The new influence function then delivers our the MCMC forgetting algorithm. We prove that the $\varepsilon$ -certified knowledge removal is achieved when the training sample set is sufficiently large. The MCMC forgetting algorithm is also applicable in the stochastic gradient MCMC.

We theoretically investigated whether the BIF algorithms affect the generalizabilities of the learned Bayesian models. Based on the PAC-Bayesian theory [47, 48, 49], we prove that the generalization bounds are of the same order $O(\sqrt{(\log n)/n})$ ( $n$ is the training sample size), no matter whether the Bayesian models are processed by BIF. Moreover, the introduced difference of the generalization bounds is not larger than the order $O(1/n)$ .

From the empirical aspect, we conduct systematic experiments to verify the forgetting algorithms. We applied the BIF algorithms to two scenarios: (1) Gaussian mixture models for clustering on synthetic data; and (2) Bayesian neural networks for classification on the real-world data. Every scenario involves variational inference, SGLD, and SGHMC. The results suggest that the BIF algorithms effectively remove the knowledge learned from the requested data while protecting other information intact, which are in full agreement with our theoretical results. To secure reproducibility, the source code package is available at https://github.com/fshp971/BIF.

The rest of this paper is organized as follows. Section 2 reviews the related works. Section 3 provides the necessary preliminaries of Bayesian inference. Section 4 presents our main result, the energy-based BIF framework. Section 5 provides the forgetting algorithms dveloped under the BIF framework for variational inference and MCMC. Section 6 studies the generalization abilities of the proposed algorithms. Section 7 presents the experiment results, and Section 8 concludes the paper.

2 Related Works

Forgetting. The concept “make AI forget you” is first proposed by Ginart et al. [23], which also designs forgetting algorithms for $k$ -means. Bourtoule et al. [11] then propose a “sharded, isolated, sliced, and aggregated” (SISA) framework to approach low-cost knowledge removal via the following three steps: (1) divide the whole training sample set into multiple disjoint shards; (2) train models in isolation on each of these shards; (3) retain the affected model when a request to unlearn a training point arrives. Izzo et al. [33] develop a projective residual update (PRU) method that replaces the update vector in data removal with its projection on a low-dimensional subspace, which reduces the time complexity of data deletion to scale only linearly in the dimension of data. Guo et al. [25] and Golatkar et al. [24] propose a certified removal based on the influence function [14, 32, 38] of the requested data. Specifically, influence function characterizes the influence of a single datum on the learned model in the empirical risk minimization (ERM) regime. The definition of influence function relies on the gradient and Hessian of the objective function which does not naturally exist in Bayesian inference. Other advances include that Garg et al. [20], Baumhauer et al. [5], and Sommer et al. [63].

We acknowledge that a concurrent work [53] also studied the knowledge removal in Bayesian models. The authors employed variational inference to minimize the KL-divergence between the approximate posterior after unlearning and the posterior of the model retrained on the remaining data. However, this approach only applies to parametric models, such as those obtained by variational inference, while many non-parametric models are un-touched, including MCMC. In contrast, our BIF covers all Bayesian inference methods. Moreover, no theoretical guarantee on the knowledge removal was established, which would be necessary to meet the legal requirements.

Markov chain Monte Carlo (MCMC). Hastings [26] introduces a two-step sampling method named MCMC, which first construct a Markov chain and then draw samples according to the state of the Markov chain. It is proved that the drawn samples will converge to that from the desired distribution. Since that, many improvements have been made on MCMC, including the Gibbs sampling method [22] and hybrid Monte Carlo [16]. Welling and Teh [69] propose a new framework named stochastic gradient Langevin dynamics (SGLD) that by adding a proper noise to a standard stochastic gradient optimization algorithm [59], the iterates will converge to samples from the true posterior distribution as the stepsize annealed. Ahn et al. [3] extend the algorithm based on the SGLD by leveraging the Bayesian Central Limit Theorem, to improve the slow mixing rate of SGLD. Inspired by the idea of a thermostat in statistical physics, Ding et al. [15] leverage a small number of additional variables to stabilize momentum fluctuations caused by the unknown noise. Chen et al. [13] extend the Hamiltonian Monte Carlo (HMC) to Stochastic gradient HMC (SGHMC) by adding a friction term, which enables SGHMC sample from the desired distributions without applying the MH rule. Based on Langevin Monte Carlo methods, Patterson and Teh [57] propose Stochastic gradient Riemannian Langevin dynamics by updating parameters according to both the stochastic gradients and additional noise which forces it to explore the full posterior. Ma et al. [45] introduce a general recipe for creating stochastic gradient MCMC samplers (SGMCMC), which is based on continuous Markov processes specified via two matrices and proved to be complete.

Variational inference. Jordan et al. [34] introduce the use of variational inference in graphical models [35] such as Bayesian networks and Markov random fields, in which variational inference is employed to approximate the target posterior with families of parameterized distributions. Blei et al. [8] apply variational inference to local Dirichlet allocation (LDA), which is used to modeling the text corpora. Sung et al. [64] propose a general variational inference framework for conjugate-exponential family models, in which the model parameters except latent variables are integrated in an exact manner, while the latent variables are approximated by a first-order algorithm. By using stochastic optimization [59], a technique that follows noisy estimates of the gradient of the objective, Hoffman et al. [31] derive stochastic variational inference which iterates between subsampling the data and adjusting the hidden structure based only on the subsample. Paisley et al. [55] propose an algorithm that reduces the variance of the stochastic search gradient by using control variates based on stochastic optimization and allows for direct optimization of variational lower bound. Titsias and Lázaro-Gredilla [65] propose local expectation gradients, in which the stochastic gradient estimation problem over multiple variational parameters is decomposed into smaller subtasks, and each sub-task focus on the most relevant part of the variational distribution. Zhang et al. [73] reviews recent advances of variational inference, from four aspects, scalable, generic, accurate, and amortized variational inference.

Bayesian Neural Networks (BNNs). Bayesian inference is first applied to neural networks by Neal [52], in which Hybrid Monte Carlo is employed to inference the posteriors of neural networks’ parameters. Blundell et al. [10] propose an efficient backpropagation-compatible algorithm to calculate the gradient of the variational neural networks, which expands the applicable domain of variational inference to deep learning. Based on that, Kingma et al. [37] further propose a local reparameterization technique to reduce the variance of stochastic gradients of variational neural networks. They also develop the variational dropout that can automatically learn the dropout rates. Pearce et al. [58] introduce a procedures family termed randomized MAP sampling (RMS), which includes randomize-then optimize and ensemble sampling, and then realize Bayesian inference for neural networks via ensembling under the RMS family. Roth and Pernkopf [61] successfully apply the Dirichlet process prior to BNNs, which enforces the sharing of the network weights and reduces the number of parameters. Zhang et al. [75] propose a cyclical step-size schedule for SGMCMC to learn the multimodal posterior of a Bayesian neural network.

3 Preliminaries

Suppose a data set is $S=\{z_{1},z_{2},\cdots,z_{n}\}$ , where $z_{i}\in\mathcal{Z}$ is a datum and $n$ is the sample size. One may hope to fit the dataset $S$ by a parametric probabilistic model $p(z|\theta)$ where $\theta\in\Theta$ is the parameter.

Bayesian inference [69, 7, 22] infers the posterior of the probabilistic model

\displaystyle p(\theta|S)=\frac{p(\theta)\prod_{i=1}^{n}p(z_{i}|\theta)}{p(S)},

where $p(\theta)$ is the prior of model parameter $\theta$ and $p(S)=\int p(\theta)\prod_{i=1}^{n}p(z_{i}|\theta)\mathrm{d}\theta$ is the normalization factor. In most cases, we do not know the closed form of the normalization factor $p(S)$ . Two canonical solutions are variational inference and MCMC.

Variational inference [31, 7] employs two steps to infer the posterior:

1.

Define a family of distributions, $\mathcal{Q}=\{q_{\lambda}|\lambda\in\Lambda\}$ , termed variational family, where $\lambda$ is the variational distribution parameter.
2.

Search in the variational family $\mathcal{Q}$ for the distribution closest to the posterior $p(\theta|S)$ measured by KL divergence, i.e., $\min_{\lambda}\mathrm{KL}(q_{\lambda}(\theta)\|p(\theta|S))$ .

Minimizing the KL divergence is equivalent to maximizing the evidence lower bound (ELBO [7]):

\text{ELBO}(\lambda,S)=\mathbb{E}_{q_{\lambda}}\log p(\theta,S)-\mathbb{E}_{q_{\lambda}}\log q_{\lambda}(\theta).

A popular variational family is the mean-field Gaussian family [7, 36, 10, 37]. A mean-field Gaussian variational distribution $q_{\lambda}$ has the following structure,

\displaystyle q_{\lambda}=\mathcal{N}(\mu,\sigma^{2}I),

where $\lambda=(\mu_{1},\cdots,\mu_{d},\sigma_{1},\cdots,\sigma_{d})$ .

MCMC [45, 26] constructs a Markov chain whose stationary distribution is the targeted posterior. The Markov chain performs a Monte Carlo procedure to sample the posterior. However, MCMC can be prohibitively time-consuming in large-scale models and large-scale data. To address the issue, stochastic gradient MCMC (SGMCMC [45]) introduces stochastic gradient estimation on mini-batches [59] to MCMC. In this paper, we study two major SGMCMC methods, stochastic gradient Langevin dynamics (SGLD) [69] and stochastic gradient Hamiltonian Monte Carlo (SGHMC) [13].

SGLD updates weight $\theta_{t}$ as follows,

	$\displaystyle\theta_{t+1}=\theta_{t}-\varepsilon_{t}\nabla_{\theta}\widetilde{U}(\theta_{t})+\mathcal{N}(0,2\varepsilon_{t}),$
	$\displaystyle\widetilde{U}(\theta)=-\frac{\|S\|}{\|\widetilde{S}\|}\sum_{z\in\widetilde{S}}\log p(z\|\theta)-\log p(\theta),$

where $\varepsilon_{t}\in\mathbb{R}^{+}$ is the learning rate and $\widetilde{S}\subset S$ is the mini-batch. Usually, one needs to anneal $\varepsilon_{t}$ to small values.

SGHMC introduces a momentum $v_{t}$ to the weight update:

	$\displaystyle\theta_{t+1}=\theta_{t}+v_{t},$
	$\displaystyle v_{t+1}=(1-\alpha_{t})v_{t}-\eta_{t}\nabla\widetilde{U}(\theta_{t})+\mathcal{N}(0,2\alpha_{t}\eta_{t}),$

where $\alpha_{t}\propto\sqrt{\eta_{t}}$ (see eqs. (14), (15) in [13]), and the momentum is initialized as $v_{0}\sim\mathcal{N}(0,\eta_{0})$ . Moreover, when $\eta_{t}$ is decay by a factor of $c$ , $\alpha_{t}$ needs to be decayed by a factor of $\sqrt{c}$ .

4 Bayesian Inference Forgetting

This section presents our Bayesian inference forgetting framework. We first define $\varepsilon$ -certified knowledge removal in Bayesian inference, which quantitates the knowledge removal performance. Then, we study the energy functions in Bayesian inference. The forgetting algorithm for Bayesian inference is designed based on the energy functions.

4.1 $\varepsilon$ -Certified Knowledge Removal

Suppose a Bayesian inference method learns a probabilistic model $\hat{p}_{S}$ on the sample set $S$ . A client requests to remove her/his data $S^{\prime}\subset S$ . A trivial and costly approach is re-training the model on data $S-S^{\prime}$ . Suppose the re-learned probabilistic model is $\hat{p}_{S-S^{\prime}}$ .

A forgetting algorithm $\mathcal{A}$ is designed to process the distribution $\hat{p}_{S}$ as follows to estimate the distribution $\hat{p}_{S-S^{\prime}}$ ,

\displaystyle\hat{p}^{-S^{\prime}}_{S}=\mathcal{A}(\hat{p}_{S},S^{\prime}).

We term the distribution $\hat{p}^{-S^{\prime}}_{S}$ as the processed model.

In order to meet the regulation requirements, the algorithm needs to achieve $\varepsilon$ -certified knowledge removal as follows.

Definition 4.1 ( $\varepsilon$ -certified knowledge removal in Bayesian inference).

For any subset $S^{\prime}\subset S$ and $\varepsilon>0$ , we call $\mathcal{A}$ performs $\varepsilon$ -certified knowledge removal, if

\displaystyle\mathrm{KL}(\hat{p}^{-S^{\prime}}_{S},\hat{p}_{S-S^{\prime}})\leq\varepsilon.

4.2 Energy Functions in Bayesian Inference

Bayesian inference can be formulated as minimizing an energy function, echos the free-energy principle in physics. This section studies the energy functions in Bayesian inference.

Suppose $F(\gamma,S)$ is the energy function of a probabilistic distribution $\chi(\gamma)$ parametrized by $\gamma\in\Gamma\subset\mathbb{R}^{K}$ over the model parameter space $\Theta$ . A typical energy function $F(\gamma,S)$ has the following structure,

\displaystyle F(\gamma,S)=\sum_{i=1}^{n}h(\gamma,z_{i})+f(\gamma),

(1)

where $h(\gamma,z)$ is a function defined on $\Gamma\times\mathcal{Z}$ that characterizes the influence from individual datums, $f(\gamma)$ is a function defined on $\Gamma$ that characterizes the influence from the prior of $\gamma$ . Usually, a lower energy corresponds to a smaller distance between the two distributions. In variational inference, ELBO plays as an energy function. In MCMC, we construct an energy function. Please see Sections 5.1 and 5.2 for more details.

Similarly, the energy function for $S-S^{\prime}$ is $F(\gamma,S-S^{\prime})$ . It can be re-arranged as follows,

\displaystyle F(\gamma,S-S^{\prime})=F(\gamma,S)-\sum_{z\in S^{\prime}}h(\gamma,z).

(2)

4.3 Forgetting Algorithm for Bayesian inference

Based on the energy function, the forgetting algorithm for Bayesian inference is then designed and shown as fig. 1.

Figure 1: The workflow of Bayesian inference forgetting framework that removes the influence of datum

z_{j}\in S

from the distribution

\hat{p}_{S}

\hat{p}_{S}

is the distribution learned on the training sample set

S

and is parameterized by

\gamma_{S}

\hat{p}^{-z_{j}}_{S}

is the processed distribution and is parameterized by

\gamma^{-z_{j}}_{S}

\gamma^{-z_{j}}_{S}=\gamma_{S}-\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(0)}{\partial\tau}

Suppose that the probabilistic model $\hat{p}_{S}$ learned on the training sample set $S$ is parameterized by $\gamma_{S}$ . Then, $\gamma_{S}$ is a local minimizer of the energy function $F(\gamma,S)$ . Similarly, suppose that the model $\hat{p}_{S-S^{\prime}}$ learned on $S-S^{\prime}$ is parameterized by $\gamma_{S-S^{\prime}}$ . Then $\gamma_{S-S^{\prime}}$ is a local minimizer of the energy function $F(\gamma,S-S^{\prime})$ . Also, we assume the processed model $\hat{p}^{-S^{\prime}}_{S}$ is parameterized by $\gamma^{-S^{\prime}}_{S}$ .

Mathematically, the forgetting in Bayesian inference can be realized by approaching the local minimizer $\gamma_{S}$ of the energy function $F(\gamma,S)$ to the local minimizer $\gamma_{S-S^{\prime}}$ of the energy function $F(\gamma,S-S^{\prime})$ .

We start with a simple case that remove the influence learned from a single datum $z_{j}\in S$ . The energy function for the posterior $p(\theta|S-\{z_{j}\})$ is as follows,

\displaystyle F(\gamma,S-\{z_{j}\})

\displaystyle=\sum_{i=1}^{n}h(\gamma,z_{i})+f(\gamma)-h(\gamma,z_{j}).

(3)

Let $\gamma_{S-\{z_{j}\}}$ be a local minimizer of $F(\gamma,S-\{z_{j}\})$ . Then,

	$\displaystyle\nabla_{\gamma}F(\gamma_{S},S)=\sum_{i=1}^{n}\nabla_{\gamma}h(\gamma_{S},z_{i})+\nabla_{\gamma}f(\gamma_{S})=0,$
	$\displaystyle\nabla_{\gamma}F(\gamma_{S-\{z_{j}\}},S-\{z_{j}\})=\nabla_{\gamma}F(\gamma_{S-\{z_{j}\}},S)-\nabla_{\gamma}h(\gamma_{S-\{z_{j}\}},z_{j})=0.$

Notice that the structures of the above two equations only differ in the term $\nabla_{\gamma}h(\gamma,z_{j})$ . They are two special cases of the following equation,

\displaystyle\nabla_{\gamma}F(\gamma,S)+\tau\cdot\nabla_{\gamma}h(\gamma,z_{j})=0,

(4)

where $\tau\in[-1,0]$ . Eq. (4) induces an implicit mapping $\hat{\gamma}^{-z_{j}}_{S}:[-1,0]\to\Gamma$ such that $\hat{\gamma}^{-z_{j}}_{S}(0)=\gamma_{S}$ . For any $\tau\in[-1,0]$ , $\hat{\gamma}^{-z_{j}}_{S}(\tau)$ is a solution of eq. (4). Thus, $\hat{\gamma}^{-z_{j}}_{S}(\tau)$ is also a critical point of the following function,

\displaystyle F_{-z_{j},\tau}(\gamma,S)=F(\gamma,S)+\tau\cdot h(\gamma,z_{j}).

(5)

When $\tau=-1$ , $F_{-z_{j},-1}(\gamma,S)$ is exactly $F(\gamma,S-\{z_{j}\})$ , which indicates that $\hat{\gamma}^{-z_{j}}_{S}(-1)$ is a critical point of $F(\gamma,S-\{z_{j}\})$ , and is possible to be a local minimizer. We assume that $\hat{\gamma}^{-z_{j}}_{S}(-1)$ is really a local minimizer of $F(\gamma,S-\{z_{j}\})$ . We also let $\gamma_{S-\{z_{j}\}}=\hat{\gamma}^{-z_{j}}_{S}(-1)$ . Then, $\gamma_{S-\{z_{j}\}}$ can be approached based on $\gamma_{S}$ in a first-order approximation manner,

\displaystyle\gamma_{S-\{z_{j}\}}=\hat{\gamma}^{-z_{j}}_{S}(-1)\approx\hat{\gamma}^{-z_{j}}_{S}(0)-\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(0)}{\partial\tau}=\gamma_{S}-\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(0)}{\partial\tau}.

Thus, when a request of removing a datum $z_{j}$ comes, our forgetting algorithm will process the request by replacing $\gamma_{S}$ with $\gamma^{-z_{j}}_{S}$ , in which

\displaystyle\gamma^{-z_{j}}_{S}:=\gamma_{S}-\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(0)}{\partial\tau}.

4.4 Theoretical Guarantee

Under some mild assumptions, we prove that (1) the mapping $\hat{\gamma}^{-z_{j}}_{S}$ uniquely exists; (2) $\gamma_{S-\{z_{j}\}}=\hat{\gamma}_{S}(-1)$ is the global minimizer of $F(\gamma,S-\{z_{j}\})$ ; and (3) the approximation error between $\gamma^{-z_{j}}_{S}$ and $\gamma_{S-\{z_{j}\}}$ is not larger than order $O\left(1/n^{2}\right)$ , where $n$ is the training set size.

We first introduce two definitions.

Definition 4.2 (compact set).

A set is called compact if and only if it is closed and bounded.

Definition 4.3 (strong convexity).

A differentiable function $f$ is called $c$ -strongly convex, if and only if there exists a constant real $c>0$ , such that for any two points $z$ and $z^{\prime}$ in the domain of the function $f$ , the following inequality holds,

\displaystyle(\nabla f(z)-\nabla f(z^{\prime}))^{T}(z-z^{\prime})\geq\frac{c}{2}\|z-z^{\prime}\|^{2}_{2}.

When $f$ is second-order continuously differentiable, for any $c>0$ , the following three propositions are equivalent:

1.

$f$ is $c$ -strongly convex;
2.

for any $z$ from the domain of the function $f$ , the smallest eigenvalue of the Hessian matrix $\nabla^{2}f(z)$ is at least $c$ , i.e., $\lambda_{\min}\left(\nabla^{2}f(z)\right)\geq c$ ;
3.

for any $z$ from the domain of $f$ , the matrix $(\nabla^{2}f(z)-cI)$ is positive semi-definite, i.e., $\nabla^{2}f(z)\succeq cI$ .

The theoretical guarantees rely on the following two assumptions.

Assumption 4.1.

Both of the supports $\Gamma$ and $\mathcal{Z}$ are compact.

Usually, the energy functions $F(\gamma,S)$ and $F(\gamma,S-\{z_{j}\})$ are close to each other. Thus, it is reasonable to assume that the two local minimizers $\gamma$ and $\gamma_{S-\{z_{j}\}}$ fall in the same local region. Hence, we only need to analyze on a limited local region, which justifies Assumption 4.1.

Assumption 4.2.

Suppose $f(\gamma)$ and $h(\gamma,z)$ are the two inputs of the energy functions $F(\gamma,S)$ . We assume that

1.

$f(\gamma)$ is $3$ rd-order continuously differentiable and $c_{f}$ -strongly convex on $\Gamma$ ,
2.

$h(\gamma,z)$ is $3$ rd-order continuously differentiable with respect to $\gamma$ in $\Gamma\times\mathcal{Z}$ . Besides, $\forall z\in\mathcal{Z}$ , $h(\gamma,z)$ is $c_{h}(z)$ -strongly convex with respect to $\gamma$ , where $c_{h}:\mathcal{Z}\to\mathbb{R}$ is a continuous function.

Assumption 4.2 is mild in gradient-based optimization. Together with Assumption 4.1, they assume the local strong convexity and smoothness of the energy functions.

We then prove that the energy function $F_{-z_{j},\tau}$ is strongly convex, as the following lemma.

Lemma 4.1.

Suppose Assumption 4.2 holds. For any $\tau\in[-1,0]$ , $F_{-z_{j},\tau}(\gamma,S)$ (eq. (5)) is strongly convex with respect to the parameter $\gamma$ .

Proof.

We rewrite the energy function $F_{-z_{j},\tau}$ as follows,

\displaystyle F_{-z_{j},\tau}(\gamma,S)=F(\gamma,S-\{z_{j}\})+(1+\tau)h(\gamma,z_{j}).

Apparently, $F(\gamma,S-\{z_{j}\})$ is strongly convex. Since $\tau\in[-1,0]$ , we have that $1+\tau\geq 0$ . Thus, $(1+\tau)h(\gamma,z_{j})$ is either strongly convex or equal to zero. Hence, we have that $F_{-z_{j},\tau}(\gamma,S)$ is strongly convex with respect to $\gamma$ .

The proof is completed. ∎

We are now able to prove that eq. (4) really induces a continuous mapping $\hat{\gamma}^{-z_{j}}_{S}$ , and $\hat{\gamma}^{-z_{j}}_{S}(-1)$ is the global minimizer of $F(\gamma,S-\{z_{j}\})$ .

Theorem 4.1.

Suppose assumption 4.2 holds. Then, eq. (4) induces a unique continuous function $\hat{\gamma}^{-z_{j}}_{S}:[-1,0]\to\Gamma$ such that for any $\tau\in[-1,0]$ , $\hat{\gamma}^{-z_{j}}_{S}(\tau)$ is the solution of eq. (4) and the global minimizer of eq. (5) with respect to $\gamma$ .

Proof.

According to Lemma 4.1, for any $\tau\in[-1,0]$ , the energy function $F_{-z_{j},\tau}(\gamma,S)$ is strongly convex with respect to $\gamma$ . Thus, the global minimizer of the energy function $F_{-z_{j},\tau}(\gamma,S)$ is unique with respect to $\gamma$ . Therefore, we can define the mapping $\hat{\gamma}^{-z_{j}}_{S}$ as follows,

\displaystyle\hat{\gamma}^{-z_{j}}_{S}(\tau)=\arg\min_{\gamma}F_{-z_{j},\tau}(\gamma,S),

where $\tau\in[-1,0]$ .

Besides, The strong convexity of $F_{-z_{j},\tau}(\gamma,S)$ indicates that when $\tau$ is fixed, a parameter $\gamma^{*}\in\Gamma$ satisfies $\nabla_{\gamma}F_{-z_{j},\tau}(\gamma^{*},S)=0$ (eq. (4)), if and only if $\gamma^{*}$ is the global minimizer of $F_{-z_{j},\tau}(\gamma,S)$ . Thus, $\forall\tau\in[-1,0]$ , $\hat{\gamma}^{-z_{j}}_{S}(\tau)$ is the only solution of eq. (4).

Eventually, according to Definition 4.3, the strong convexity of $F_{-z_{j},\tau}(\gamma,S)$ implies that the Hessian matrix $\nabla^{2}_{\gamma}F_{-z_{j},\tau}(\hat{\gamma}^{-z_{j}}_{S}(\tau),S)$ is invertible. Combining with the implicit function theorem, we have that $\hat{\gamma}^{-z_{j}}_{S}(\tau)$ is continuous everywhere on $[-1,0]$ .

The proof is completed. ∎

Theorem 4.1 demonstrates that our algorithm can obtain the $\hat{\gamma}^{-z_{j}}_{S}(-1)$ . This secures that our algorithm can remove the learned influence from the probabilistic model $\hat{p}_{S}$ .

We then analyze the error introduced by our algorithm. Assume that $\hat{\gamma}^{-z_{j}}_{S}$ is second-order continuously differentiable, then $\hat{\gamma}^{-z_{j}}_{S}(-1)$ can be expanded by the Taylor’s series:

	$\displaystyle\hat{\gamma}^{-z_{j}}_{S}(-1)$	$\displaystyle=\hat{\gamma}^{-z_{j}}_{S}(0)+(-1)\cdot\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(0)}{\partial\tau}+\frac{(-1)^{2}}{2}\cdot\frac{\partial^{2}\hat{\gamma}^{-z_{j}}_{S}(\xi)}{\partial\tau^{2}}$
		$\displaystyle=\hat{\gamma}^{-z_{j}}_{S}(0)-\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(0)}{\partial\tau}+\frac{1}{2}\cdot\frac{\partial^{2}\hat{\gamma}^{-z_{j}}_{S}(\xi)}{\partial\tau^{2}},$

where $\xi\in[-1,0]$ , $\frac{1}{2}\frac{\partial^{2}\hat{\gamma}^{-z_{j}}_{S}(\xi)}{\partial\tau^{2}}$ is the Cauchy form of the remainder. We prove that the Cauchy remainder term becomes negligible as the training set size $n$ goes to infinity.

Theorem 4.2.

Suppose Assumptions 4.1 and 4.2 hold. The induced mapping $\hat{\gamma}^{-z_{j}}_{S}$ is as defined in Theorem 4.1. Then, for any $\tau\in[-1,0]$ , we have that

\displaystyle\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}=-\left(\nabla^{2}_{\gamma}F(\hat{\gamma}^{-z_{j}}_{S}(\tau),S)+\tau\cdot\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\right)^{-1}\cdot\nabla_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})^{T},

and

\displaystyle\left\|\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}\right\|_{2}\leq O\left(\frac{1}{n}\right).

The proof of Theorem 4.2 is presented in Appendix A. We first calculate $\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}$ based on eq. (4). We then upper bound the norm of $\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}$ by upper bounding the norm of an inverse matrix

\displaystyle\left(\nabla^{2}_{\gamma}F(\hat{\gamma}^{-z_{j}}_{S}(\tau),S)+\tau\cdot\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\right)^{-1}

and a vector $\nabla_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})$ . The norm of the inverse matrix is bounded based on the strong convexity of $\hat{\gamma}^{-z_{j}}_{S}(\tau)$ , while the norm of the vector is bounded based on its continuity in a compact domain.

Based on Theorem 4.2, we then prove that the norm of $\frac{\partial^{2}\gamma^{-z_{j}}_{S}(\tau)}{\partial\tau^{2}}$ is no larger than order $O(1/n^{2})$ . Thus, a sufficiently large training sample size ensures that, the second and high order terms in the Taylor’s series are negligible.

Theorem 4.3.

Suppose Assumptions 4.1 and 4.2 hold. The induced mapping $\hat{\gamma}^{-z_{j}}_{S}$ is as defined in Theorem 4.1. Then, for any $\tau\in[-1,0]$ , we have that

\displaystyle\left\|\frac{\partial^{2}\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau^{2}}\right\|_{2}\leq O\left(\frac{1}{n^{2}}\right).

The proof routine of Theorem 4.3 is similar to Theorem 4.2, while the calculation is more difficult. We leave the details in Appendix A.

Corollary 4.1.

Suppose Assumptions 4.1 and 4.2 hold. Suppose $\gamma_{S}$ and $\gamma_{S-\{z_{j}\}}$ are the global minimizers of the energy functions $F(\gamma,S)$ and $F(\gamma,S-\{z_{j}\})$ , respectively. Then, we have that

\displaystyle\gamma_{S-\{z_{j}\}}=\gamma_{S}+\nabla^{-2}_{\gamma}F(\gamma_{S},S)\cdot\nabla_{\gamma}h(\gamma_{S},z_{j})^{T}+O\left(\frac{1}{n^{2}}\right).

Eventually, we consider removing a subset $S^{\prime}$ from $S$ . Follow the previous derivations, we prove that the approximation error of approaching $\gamma_{S-S^{\prime}}$ based on $\gamma_{S}$ is also not larger than order $O(1/n^{2})$ , as stated below:

Corollary 4.2.

Suppose Assumptions 4.1 and 4.2 hold. Suppose $\gamma_{S}$ and $\gamma_{S-S^{\prime}}$ are the global minimizers of the energy functions $F(\gamma,S)$ and $F(\gamma,S-S^{\prime})$ , respectively. Then, we have that

\displaystyle\gamma_{S-S^{\prime}}=\gamma_{S}+\nabla^{-2}_{\gamma}F(\gamma_{S},S)\cdot\sum_{z_{j}\in S^{\prime}}\nabla_{\gamma}h(\gamma_{S},z_{j})^{T}+O\left(\frac{1}{n^{2}}\right).

5 Forgetting Algorithms for Bayesian Inference

In this section, we develop provably certified forgetting algorithms for variational inference and MCMC under the BIF framework.

5.1 Variational Inference Forgetting

As introduced in Section 3, variational inference aims to minimize the negative ELBO function, which can be expanded as follows,

	$\displaystyle-\text{ELBO}(\lambda,S)$
	$\displaystyle\quad=-\mathbb{E}_{q_{\lambda}}\log p(\theta,S)+\mathbb{E}_{q_{\lambda}}\log q_{\lambda}(\theta)$
	$\displaystyle\quad=-\mathbb{E}_{q_{\lambda}}\log\left(p(\theta)\prod_{i=1}^{n}p(z_{i}\|\theta)\right)+\mathbb{E}_{q_{\lambda}}\log q_{\lambda}(\theta)$
	$\displaystyle\quad=\sum_{i=1}^{n}-\mathbb{E}_{q_{\lambda}}\log p(z_{i}\|\theta)+\mathrm{KL}(q_{\lambda}(\theta)\\|p(\theta)).$

Notice that the structure of the above equation is similar as that of the energy function (eq. (1)). Thus, we employ the negative ELBO function as the energy function.

Based on the energy function, we define the following variational inference influence function to characterize the influence of single datum on the learned model.

Definition 5.1 (Variational inference influence function).

For any example $z_{j}\in S$ , its variational inference influence function is defined to be

\displaystyle\mathcal{I}_{\mathrm{VI}}(z_{j}):=-\nabla^{-2}_{\lambda}\mathrm{ELBO}(\hat{\lambda}_{S},S)\cdot\nabla_{\lambda}^{T}\mathbb{E}_{q_{\hat{\lambda}_{S}}}\log p(z_{j}|\theta).

We then design the variational inference forgetting algorithm, to remove the influences of a single datum from the learned variational parameter $\hat{\lambda}_{S}$ ,

\displaystyle\hat{\lambda}^{-z_{j}}_{S}=\mathcal{A}_{\mathrm{VI}}(\hat{\lambda}_{S},z_{j})=\hat{\lambda}_{S}-\mathcal{I}_{\mathrm{VI}}(z_{j}).

We prove that variational inference forgetting can provably remove the influences of datums from the learned model.

Theorem 5.1.

Let $\gamma:=\lambda$ , $\Gamma:=\Lambda$ , $h(\gamma,z):=-\mathbb{E}_{q_{\lambda}}\log p(z|\theta)$ and $f(\gamma):=\mathrm{KL}(q_{\lambda}(\theta)\|p(\theta))$ . Suppose Assumptions 4.1 and 4.2 hold. Then, we have that

\displaystyle\hat{\lambda}_{S-\{z_{j}\}}=\hat{\lambda}_{S}-\mathcal{I}_{\mathrm{VI}}(z_{j})+O\left(\frac{1}{n^{2}}\right)\approx\hat{\lambda}_{S}-\mathcal{I}_{\mathrm{VI}}(z_{j}),

where $\hat{\lambda}_{S}$ and $\hat{\lambda}_{S-\{z_{j}\}}$ are the global minimizers of $-\mathrm{ELBO}(\lambda,S)$ and $-\mathrm{ELBO}(\lambda,S-\{z_{j}\})$ , respectively.

Proof.

It is straightforward from Corollary 4.1. ∎

Remark 5.1.

When Assumptions 4.1 and 4.2 do not hold, performing variational inference forgetting approaches some critical point of $-\mathrm{ELBO}(\lambda,S-\{z_{j}\})$ based on $\hat{\lambda}_{S}$ .

Beyond a single datum’s removal, Corollary 4.2 further shows that the influence of a sample set $S^{\prime}$ can also be characterized by the variational inference influence function $\mathcal{I_{\mathrm{VI}}}(S^{\prime})$ , as follows,

\displaystyle\mathcal{I}_{\mathrm{VI}}(S^{\prime})=\sum_{z_{j}\in S^{\prime}}\mathcal{I}_{\mathrm{VI}}(z_{j}).

This guarantees that one can remove a group of datums at one time. It can significantly speed up and simplify the forgetting process of large amounts of datums.

We next study the $\varepsilon$ -certified knowledge removal guarantee for variational inference forgetting. Here, we take the mean-field Gaussian family as an example. Proofs for other varitional families are similar.

Theorem 5.2 ( $\varepsilon$ -certified knowledge removal of mean-field Gaussian variational distribution).

Let $\mathcal{Q}$ be a mean-field Gaussian family where the variance of every variational distributions are bounded, i.e., $\exists M_{1},M_{2}\in\mathbb{R}$ such that for any $q_{\lambda}=\mathcal{N}(\mu,\sigma^{2}I)\in\mathcal{Q}$ , we have $0<M_{1}\leq\sigma_{i}\leq M_{2}$ holds for $i=1,\cdots,d$ . Let $\hat{\lambda}_{S}$ and $\hat{\lambda}_{S-\{z_{j}\}}$ be the variational parameters that learned on sample sets $S$ and $S-\{z_{j}\}$ , respectively. Suppose $\hat{\lambda}^{-z_{j}}_{S}=\mathcal{A}_{\mathrm{VI}}(\hat{\lambda}_{S},z_{j})$ is the processed variational parameter. Then, $\mathcal{A}_{\mathrm{VI}}(\hat{\lambda}_{S},z_{j})$ performs $\varepsilon_{\hat{\lambda}_{S},z_{j}}$ -certified knowledge removal, where

\displaystyle\varepsilon_{\hat{\lambda}_{S},z_{j}}=\frac{1}{2M_{1}^{2}}\left(2(M_{1}+M_{2})\|\hat{\lambda}^{-z_{j}}_{S}-\hat{\lambda}_{S-\{z_{j}\}}\|_{1}+\|\hat{\lambda}^{-z_{j}}_{S}-\hat{\lambda}_{S-\{z_{j}\}}\|_{2}^{2}\right).

The proof of Theorem 5.2 is omitted here and presented in Appendix B. When all the conditions in Theorem 5.1 hold, we have that $\|\hat{\lambda}^{-z_{j}}-\hat{\lambda}_{S-\{z_{j}\}}\|_{1}\leq O(1/n^{2})$ and $\|\hat{\lambda}^{-z_{j}}_{S}-\hat{\lambda}_{S-\{z_{j}\}}\|^{2}_{2}\leq O(1/n^{4})$ . Thus, $\varepsilon_{\hat{\lambda}_{S},z_{j}}\leq O(1/n^{2})$ . Therefore, $\mathcal{A}_{\mathrm{VI}}$ performs $O(1/n^{2})$ -certified knowledge removal for mean-field Gaussian variational distribution $q_{\hat{\lambda}_{S}}$ .

5.2 MCMC Forgetting Algorithm

In MCMC, there is no explicit objective function. In this work, we transport $p(\theta|S)$ to approximate $p(\theta|S^{\prime})$ . Specifically, we want to find a “drift” $\hat{\Delta}^{-S^{\prime}}_{S}$ named drifting influence to minimize the following KL divergence,

\displaystyle\hat{\Delta}^{-S^{\prime}}_{S}=\arg\min_{\Delta}\mathrm{KL}(p(\theta|S)\|p(\theta+\Delta|S-S^{\prime})).

(6)

The KL term in eq. (6) can be expanded as follows,

	$\displaystyle\mathrm{KL}(p(\theta\|S)\\|p(\theta+\Delta\|S-S^{\prime}))$
	$\displaystyle\quad=\mathbb{E}_{p(\theta\|S)}\log p(\theta\|S)-\mathbb{E}_{p(\theta\|S)}\log p(\theta+\Delta\|S-S^{\prime})$
	$\displaystyle\quad=-\mathbb{E}_{p(\theta\|S)}\log\left(\frac{p(\theta+\Delta)\prod_{z\in S}p(z\|\theta+\Delta)}{p(S-S^{\prime})\prod_{z\in S^{\prime}}p(z\|\theta+\Delta)}\right)+\mathbb{E}_{p(\theta\|S)}\log p(\theta\|S)$
	$\displaystyle\quad=\sum_{i=1}^{n}-\mathbb{E}_{p(\theta\|S)}\log p(z_{i}\|\theta+\Delta)-\mathbb{E}_{p(\theta\|S)}\log p(\theta+\Delta)$
	$\displaystyle\quad\quad-\sum_{z\in S^{\prime}}-\mathbb{E}_{p(\theta\|S)}\log p(z\|\theta+\Delta)+\mathrm{Constant}.$		(7)

Despite the constant term, the remaining has the similar structure with $F(\gamma,S-S^{\prime})$ (eq. (2)). Thus, we define the energy function for MCMC by replacing $h(\gamma,z)$ and $f(\gamma)$ with $-\mathbb{E}_{p(\theta|S)}\log p(z|\theta+\Delta)$ and $-\mathbb{E}_{p(\theta|S)}\log p(\theta+\Delta)$ , respectively.

Based on the energy function, we define an MCMC influence function to characterize the drifting influences induced by a single datum as follows.

Definition 5.2 (MCMC influence function).

For any example $z_{j}\in S$ , its MCMC influence function is defined to be

\displaystyle\mathcal{I}_{\mathrm{MCMC}}(z_{j}):=-\left(\mathbb{E}_{p(\theta|S)}\nabla_{\theta}^{2}\log p(\theta,S)\right)^{-1}\left(\mathbb{E}_{p(\theta|S)}\nabla_{\theta}\log p(z_{j}|\theta)\right)^{T}.

(8)

Then, the MCMC forgetting algorithm is as follows,

\displaystyle p^{-z_{j}}_{S}(\theta)=\mathcal{A}_{\mathrm{MCMC}}(p(\theta|S),z_{j})=p(\theta+\mathcal{I}_{\mathrm{MCMC}}(z_{j})|S).

In practice, we do not have the posterior $p(\theta|S)$ , but only some samples drawn from $p(\theta|S)$ . Thus, we are not able to transform the posterior $p(\theta|S)$ to the processed distribution $p^{-z_{j}}_{S}(\theta)$ . However, the probability of drawing a sample $\theta_{t}$ from $p(\theta|S)$ equals that of drawing a sample $\theta_{t}-\mathcal{I}_{\mathrm{MCMC}}(z_{j})$ from $p^{-z_{j}}_{S}(\theta)$ . Therefore, when performing MCMC forgetting, we will replace any drawn sample $\theta_{t}$ by a new sample $\theta_{t}-\mathcal{I}_{\mathrm{MCMC}}(z_{j})$ .

We prove that the MCMC forgetting algorithm can provably remove the drifting influence induced by a single datum from the learned model.

Theorem 5.3.

Let $\gamma:=\Delta$ , $\Gamma:=\mathrm{supp}(\Delta)$ , $h(\gamma,z):=-\mathbb{E}_{p(\theta|S)}\log p(z|\theta+\Delta)$ and $f(\gamma):=-\mathbb{E}_{p(\theta|S)}\log p(\theta+\Delta)$ . Suppose that Assumptions 4.1 and 4.2 hold. Then, we have that

\displaystyle\hat{\Delta}^{-z_{j}}_{S}

\displaystyle=-\mathcal{I}_{\mathrm{MCMC}}(z_{j})+O\left(\frac{1}{n^{2}}\right)\approx-\mathcal{I}_{\mathrm{MCMC}}(z_{j}),

where $\hat{\Delta}^{-z_{j}}_{S}$ is the global minimizer of $\mathrm{KL}(p(\theta|S)\|p(\theta+\Delta|S-\{z_{j}\}))$ .

Remark 5.2.

When Assumptions 4.1 and 4.2 do not hold, performing MCMC forgetting can be seen as finding a critical point $\hat{\Delta}^{-z_{j}}_{S}$ of $\mathrm{KL}(p(\theta-\Delta|S)\|p(\theta|S-\{z_{j}\}))$ .

Proof.

When all the conditions hold, the global minimizer of $\mathrm{KL}(p(\theta|S)\|p(\theta+\Delta|S))$ is exactly $0$ . Meanwhile, $\hat{\Delta}^{-z_{j}}_{S}$ is the global minimizer of $\mathrm{KL}(p(\theta|S)\|p(\theta+\Delta|S-\{z_{j}\}))$ . Combining Corollary 4.1, we have that

	$\displaystyle\hat{\Delta}^{-z_{j}}_{S}=$	$\displaystyle\hat{\Delta}^{-z_{j}}_{S}-0$
	$\displaystyle=$	$\displaystyle\left(\nabla^{2}_{\Delta}\left[-\mathbb{E}_{p(\theta\|S)}\log p(\theta+\Delta,S)\right]_{\Delta=0}\right)^{-1}$
		$\displaystyle\cdot\left(\nabla_{\Delta}\left[-\mathbb{E}_{p(\theta\|S)}\log p(z_{j}\|\theta+\Delta)\right]_{\Delta=0}\right)^{T}+O\left(\frac{1}{n^{2}}\right)$
	$\displaystyle=$	$\displaystyle\left(\mathbb{E}_{p(\theta\|S)}\nabla^{2}_{\theta}\log p(\theta,S)\right)^{-1}\cdot\left(\mathbb{E}_{p(\theta\|S)}\nabla_{\theta}\log p(z_{j}\|\theta)\right)^{T}+O\left(\frac{1}{n^{2}}\right)$
	$\displaystyle=$	$\displaystyle-\mathcal{I}_{\mathrm{MCMC}}(z_{j})+O\left(\frac{1}{n^{2}}\right)$
	$\displaystyle\approx$	$\displaystyle-\mathcal{I}_{\mathrm{MCMC}}(z_{j}),$

The proof is completed. ∎

Similarly, by applying Corollary 4.2, the MCMC influence function of a sample set $S^{\prime}$ is as follows,

\displaystyle\mathcal{I}_{\mathrm{MCMC}}(S^{\prime})=\sum_{z_{j}\in S^{\prime}}\mathcal{I}_{\mathrm{MCMC}}(z_{j}).

It guarantees that one can adopt MCMC forgetting algorithm to remove a group of datums at one time. This improves the efficiency of removing large amounts of datums.

The MCMC forgetting algorithm also applies to SGMCMC, because SGMCMC draws samples from some target posterior $p(\theta|S)$ [13, 69, 45], the same as MCMC.

We then give the $\varepsilon$ -certified knowledge removal guarantee for MCMC forgetting.

Theorem 5.4.

Suppose that

	$\displaystyle p(\theta\|S)=\mathcal{N}(\theta_{1},(nJ(\theta_{1}))^{-1}),$
	$\displaystyle p(\theta\|S-\{z_{j}\})=\mathcal{N}(\theta_{2},((n-1)J(\theta_{2}))^{-1}).$

Let $p^{-z_{j}}_{S}(\theta)=\mathcal{A}_{\mathrm{MCMC}}(p(\theta|S),z_{j})=p(\theta+\mathcal{I}_{\mathrm{MCMC}}(z)|S)$ be the processed model. Then, $\mathcal{A}_{\mathrm{MCMC}}(p(\theta|S),z_{j})$ performs $O(\varepsilon_{\theta_{1},z_{j}})$ -certified knowledge removal, where

\displaystyle\varepsilon_{\theta_{1},z_{j}}=(n-1)(\theta_{1}^{\prime}-\theta_{2})^{T}J(\theta_{2})(\theta_{1}^{\prime}-\theta_{2})+\mathrm{tr}\left(J^{-1}(\theta_{1})(J(\theta_{2})-J(\theta_{1}))\right)+\log\frac{|J(\theta_{1})|}{|J(\theta_{2})|}

and $\theta_{1}^{\prime}=\theta_{1}-\mathcal{I}_{\mathrm{MCMC}}(z_{j})$ .

Theorem 5.4 assumes that the posteriors $p(\theta|S)$ and $p(\theta|S-\{z_{j}\})$ are Gaussian. This assumption is from the Bayesian asymptotic theory [21, 42]. Suppose the training set $S$ is drawn from distribution $f_{1}(z)$ , while $S-\{z_{j}\}$ is drawn from distribution $f_{2}(z)$ . Then, under some mild assumptions, when the training sample size $n$ is sufficiently large, the posteriors $p(\theta|S)$ and $p(\theta|S-\{z_{j}\})$ approach Gaussian distributions as below,

	$\displaystyle p(\theta\|S)\sim\mathcal{N}(\theta_{1},(nJ(\theta_{1}))^{-1}),$
	$\displaystyle p(\theta\|S-\{z_{j}\})\sim\mathcal{N}(\theta_{2},((n-1)J(\theta_{2}))^{-1}),$

where $\theta_{i}=\arg\min_{\theta}\mathrm{KL}(f_{i}(z)\|p(z|\theta))$ , $i=\{1,2\}$ , $J(\theta)=\mathbb{E}_{p(z|\theta)}\left[-\left.\frac{\partial^{2}}{\partial\theta^{2}}\log p(z|\theta)\right|\theta\right]$ is the Fisher information.

The detailed proof of Theorem 5.4 is omitted here and given in appendix B.

Combining the conditions of Theorem 5.3, we have that $\|\theta^{\prime}_{1}-\theta_{2}\|_{2}\leq O(1/n^{2})$ . Thus, as $n\to\infty$ , $\varepsilon_{\theta_{1},z}$ will eventually converge to $\mathrm{tr}\left(J^{-1}(\theta_{1})(J(\theta_{2})-J(\theta_{1}))\right)+\log\frac{|J(\theta_{1})|}{|J(\theta_{2})|}$ .

5.3 Efficient Implementation

A major computing burden in both variational inference forgetting and MCMC forgetting algorithms is calculating the product of $H^{-1}v$ , where $H$ is the Hessian matrix of some vector-valued function $f(x)$ and $v$ is a constant vector. the calculation above would have a considerably high computational cost. We follow Agarwal et al. [2] and Koh and Liang [38] to apply a divide-and-conquer strategy to address the issue. This strategy relies on calculating the Hessian-vector product $Hv$ .

Hessian-vector product (HVP). We first discuss how to efficiently calculate $Hv$ . The calculation of $Hv$ can be decomposed into two steps: (1) calculate $\frac{\partial f(x)}{\partial x}$ and then (2) calculate $\frac{\partial}{\partial x}\left(\frac{\partial f(x)}{\partial x}\cdot v\right)$ . It is worth noting that $\frac{\partial f(x)}{\partial x}\in\mathbb{R}^{1\times d}$ and $v\in\mathbb{R}^{d\times 1}$ , where $d>0$ is the dimension of data. Thus, $\left(\frac{\partial f(x)}{\partial x}\cdot v\right)$ is a scalar value. Calculating its gradient $\frac{\partial}{\partial x}\left(\frac{\partial f(x)}{\partial x}\cdot v\right)$ has a very low computational cost on platform PyTorch [56] or TensorFlow [1].

Calculating $H^{-1}v$ . When the norm $\|H\|\leq 1$ , the matrix $H^{-1}$ can be expanded by the Taylor’s series as $H^{-1}=\sum_{i=0}^{\infty}(I-H)^{i}$ . Define that $H_{j}^{-1}=\sum_{i=0}^{j}(I-H)^{i}$ . Then, we have the following recursive equation,

H_{j}^{-1}v=v+(I-H)H_{j-1}^{-1}v.

Agarwal et al. [2] prove that when $j\rightarrow\infty$ , we have $\mathbb{E}[H_{j}^{-1}]\rightarrow H^{-1}$ . Therefore, we employ $H^{-1}_{j}v$ to approximate $H^{-1}v$ .

Moreover, to secure the condition $\|H\|\leq 1$ stands, we scale $H$ to $cH$ by a scale $c\in\mathbb{R}^{+}$ , such that $\|cH\|\leq 1$ . Then, we approximate $(cH)^{-1}$ . Eventually, we have that $H^{-1}=c(cH)^{-1}$ . We can plug it to the applicable equations above.

6 Generalization Analysis

In this section, we study the generalization ability of the models that processed by forgetting algorithms. We derive generalization bounds for a mean-field Gaussian variational model and a specified Gaussian MCMC model, Based on PAC-Bayes framework. All the proofs in this section are presented in Appendix C.

Generalization ability is important to machine learning algorithms, which refers to the ability to make accurate predictions on unseen data. A standard measurement of the generalization ability is the generalization bound, i.e., the upper bound of the difference between expected risk and empirical risk [51, 68, 50]. An algorithm with a small generalization bound is expected to generalize well. Existing generalization bound can be roughly divided into three categories: (1) generalization bounds based on the hypothesis complexity, such as VC dimension [9, 67], Rademacher complexity [41, 40, 4], and covering number [17, 27], which suggest implementations control the hypothesis complexity to help model generalize well; (2) generalization bounds based on the algorithmic stability [60, 12], which follow the intuition that an algorithm with good generalization ability is robust to the interference of single data points; (3) generalization bounds established under the PAC-Bayesian framework [48, 47]; and (4) generalization guarantees from differential privacy [18, 54, 29]. The excellent generalization ability of the over-parameterized model, including Bayesian neural network and others in deep learning, is somehow beyond the explanation of the conventional learning theory. Establishing theoretical foundations has been attracted wide attention [19, 28].

Let $Q$ be a probabilistic model. Suppose $S$ is the training sample set. Then, the expected risk $\mathcal{R}(Q)$ and empirical risk $\mathcal{\hat{R}}(Q,S)$ of $Q$ are defined to be

\displaystyle\mathcal{R}(Q)=\mathop{\mathbb{E}}_{h\sim Q}\mathop{\mathbb{E}}_{z}\ell(h,z),\ \ \ \ \mathcal{\hat{R}}(Q,S)=\mathop{\mathbb{E}}_{h\sim Q}\frac{1}{n}\sum_{i=1}^{n}\ell(h,z_{i}),

where $h$ is a hypothesis drawn from $Q$ , and $\ell$ is the loss function ranging in $[0,1]$ . The difference of expected risk and empirical risk is the generalization error. Its magnitude characterizes the generalizability of the algorithm.

We then prove a generalization bound for mean-field Gaussian variational distributions.

Theorem 6.1.

Suppose all the conditions in Theorems 5.1 and 5.2 hold. Let $q_{\lambda}=\mathcal{N}(\mu,\sigma^{2}I)$ denotes the mean-field Gaussian distribution learned on the training set $S$ . Let $\lambda^{-}=\mathcal{A}_{\mathrm{VI}}(\lambda,z_{j})$ denotes the processed variational distribution parameter, where $z_{j}\in S$ . Also, let $\Delta_{\lambda}=\mathcal{I}_{\mathrm{VI}}(z_{j})=(\Delta_{\mu},\Delta_{\sigma})$ denotes the variational inference influence function. Then, for any real $\delta\in(0,1)$ , with probability at least $1-\delta$ , the following inequality holds:

\displaystyle\mathcal{R}(q_{\lambda^{-}})\leq\mathcal{\hat{R}}(q_{\lambda^{-}},S)+\sqrt{\frac{C_{\lambda,\Delta_{\lambda}}+2\log\frac{1}{\delta}+2\log n-d+4}{4n-2}},

(9)

where

\displaystyle C_{\lambda,\Delta_{\lambda}}=\|\Delta_{\lambda}\|^{2}+2\|\lambda\|\cdot\|\Delta_{\lambda}\|+\|\lambda\|^{2}-2\sum_{k=1}^{d}\log\left(\sigma_{k}-\Delta_{\sigma_{k}}\right)\leq O(1),

and $\|\Delta_{\lambda}\|\leq O\left(\frac{1}{n}\right)$ .

Remark 6.1.

This generalization bound is of order $O(\sqrt{(\log n)/n})$ .

Corollary 6.1.

Variational inference forgetting increases the generalization bound by a value not larger than $O(1/n)$ .

Proof.

Variational inference forgetting introduces the term $\|\Delta_{\lambda}\|$ into the generalization bound. Thus, the generalization bound is increased by the following value:

	$\displaystyle O\left(\sqrt{\frac{\\|\Delta_{\lambda}\\|^{2}+2\\|\lambda\\|\cdot\\|\Delta_{\lambda}\\|-2\sum_{k=1}^{d}\log\left(\frac{\sigma_{k}-\Delta_{\sigma_{k}}}{\sigma_{k}}\right)}{4n-2}}\right)$
	$\displaystyle\quad\leq O\left(\sqrt{\frac{O(1/n^{2})+O(1/n)+O(1/n)}{4n-2}}\right)=O\left(\frac{1}{n}\right).$

The proof is completed. ∎

This corollary secures that variational inference forgetting would not compromise the generalizability.

In Section 5.2, we have shown that when the training sample size is sufficiently large, the posterior distribution $p(\theta|S)$ is asymptotically Gaussian. Here, we again assume that $p(\theta|S)=\mathcal{N}(\theta_{1},(nJ(\theta_{1}))^{-1})$ , where $J(\theta)$ is the fisher information matrix. Then, we obtain a generalization bound as follows.

Theorem 6.2.

Suppose all the conditions in Theorem 5.3 hold. Suppose that $p(\theta|S)=\mathcal{N}(\theta_{1},(nJ(\theta_{1}))^{-1})$ . Let $p^{-}=\mathcal{A}_{\mathrm{MCMC}}(p(\theta|S),z_{j})$ denotes the processed distribution, where $z_{j}\in S$ . Let $\Delta_{\theta}=\mathcal{I}_{\mathrm{MCMC}}(z_{j})$ denotes the MCMC influence function. Then, for any $\delta\in(0,1)$ , with probability at least $1-\delta$ , the following inequality holds:

\displaystyle\mathcal{R}(p^{-})\leq\mathcal{\hat{R}}(p^{-},S)+\sqrt{\frac{C_{p,\Delta_{\theta_{1}}}+2\log\frac{1}{\delta}+(d+2)\log n-d+4}{4n-2}},

(10)

where

\displaystyle C_{p,\Delta_{\theta_{1}}}=\|\Delta_{\theta_{1}}\|^{2}+2\|\theta_{1}\|\cdot\|\Delta_{\theta_{1}}\|+\|\theta_{1}\|^{2}+\frac{1}{n}\mathrm{tr}(J^{-1}(\theta_{1}))+\log|J(\theta_{1})|=O(1),

and $\|\Delta_{\theta_{1}}\|\leq O\left(\frac{1}{n}\right)$ .

Remark 6.2.

This generalization bound is of order $O(\sqrt{(\log n)/n})$ .

Corollary 6.2.

MCMC forgetting increases the generalization bound by a value not larger than $O(1/n)$ .

Proof.

MCMC forgetting introduces the term $\|\Delta_{\theta_{1}}\|$ into the generalization bound. Thus, the generalization bound is increased by the following value:

\displaystyle O\left(\sqrt{\frac{\|\Delta_{\theta_{1}}\|^{2}+2\|\theta_{1}\|\cdot\|\Delta_{\theta_{1}}\|}{4n-2}}\right)\leq O\left(\sqrt{\frac{O(1/n^{2})+O(1/n)}{4n-2}}\right)=O\left(\frac{1}{n}\right).

The proof is completed. ∎

This corollary secures that MCMC forgetting would not compromise the generalizability.

7 Experiments

We apply our forgetting algorithm to a Gaussian mixture model and a Bayesian neural network. In every scenario, we employ variational inference and two SGMCMC methods, SGLD and SGHMC. The empirical results are in full agreement with our methods. All the experiments are conducted on a computer with a GPU of NVIDIA^® GeForce^® RTX 2080 Ti, a CPU of Intel^® Core^™ i9-9900 @ 3.10GHz, and 32GB memory. To secure reproducibility, the source code package is available at https://github.com/fshp971/BIF.

7.1 Experiments for Gaussian Mixture Model

We first conduct experiments with GMM on Synthetic data to evaluate our methods.

7.1.1 Implementation Details

The implementation details are given below.

Synthetic dataset. We generate a dataset $S$ of size $2,000$ for evaluating our algorithms. Every datum is two-dimensional and is possibly from $K$ classes. In this experiment, we set $K$ as $4$ . The raw data is visualized in the left of fig. 2(a), in which four different colors represent four different classes.

Gaussian mixture model (GMM). GMMs are usually employed to inference the cluster centers. GMM assumes data is drawn from $K$ Gaussian distributions centered at $\mu_{1},\cdots,\mu_{K}$ , respectively. The hierarchical structure of GMM is as follows: (1) draw a clustering center from the uniform distribution over $\{\mu_{c_{1}},\ldots,\mu_{c_{K}}\}$ ; and (2) sample $z_{i}$ from a Gaussian distribution centering at $\mu_{c_{i}}$ .

	$\displaystyle\mu_{k}$	$\displaystyle\sim\mathcal{N}(0,\sigma^{2}I),$
	$\displaystyle c_{i}$	$\displaystyle\sim\mathrm{categorical}\left(\frac{1}{K},\cdots,\frac{1}{K}\right),$
	$\displaystyle Z_{i}$	$\displaystyle\sim\mathcal{N}(\mu_{c_{i}},I),$

where $1\leq k\leq K$ , $1\leq i\leq n$ , $\mu_{k}\in\mathbb{R}^{d}$ , $c_{i}\in\{1,\cdots,K\}$ , $Z_{i}\in\mathbb{R}^{d}$ , and the hyperparameter $\sigma\in\mathbb{R}$ is the prior standard deviation. We set $\sigma$ as $1$ in our experiments.

Applying SGLD and SGHMC to GMM is straightforward. Besides, for variational inference, we utilize the following mean-field variational family [7],

	$\displaystyle q(\bm{\mu},\bm{c})$	$\displaystyle=\prod_{k=1}^{K}q(\mu_{k})\prod_{i=1}^{n}q(c_{i}\|\mu),$
	$\displaystyle\mu_{k}$	$\displaystyle\sim\mathcal{N}(m_{k},s_{k}^{2}),$
	$\displaystyle q(c_{i}=k\|\mu)$	$\displaystyle\propto\exp\left(\mathbb{E}_{q_{-c_{i}}}\log p(x_{i}\|\mu,c_{i})\right)$
		$\displaystyle=\exp\sum_{j}\left(x_{ij}m_{kj}-\frac{m_{kj}^{2}+s_{kj}^{2}}{2}\right),$

where $x_{i},m_{k},s_{k}\in\mathbb{R}^{d}$ , and $\lambda=(m_{1},\cdots,m_{K},s_{1},\cdots,s_{K})$ is the variational distribution parameter. Thus, the ELBO function is calculated as follows,

	$\displaystyle\mathrm{ELBO}(\lambda,S)$	$\displaystyle=-\sum_{k,j}\frac{m_{kj}^{2}+s_{kj}^{2}}{2\sigma^{2}}+\sum_{k,j}\log s_{kj}-\sum_{i,k,j}\varphi_{ik}\frac{-2x_{ij}m_{kj}+m_{kj}^{2}+s_{kj}^{2}}{2}$
		$\displaystyle\ \ \ \ -\sum_{i,k}\varphi_{ik}\log\varphi_{ik}-\frac{1}{2}\sum_{i,j}x_{ij}^{2}+\mathrm{Constant},$

where $1\leq i\leq n$ , $1\leq k\leq K$ , $1\leq j\leq d$ , and

\displaystyle\varphi_{ik}=q(c_{i}=k|\mu)\propto\exp\sum_{j}\left(x_{ij}m_{kj}-\frac{m_{kj}^{2}+s_{kj}^{2}}{2}\right).

Refer to caption — (a) Visualization of dataset $S$ .

Experiment design. We employ variational inference, SGLD, and SGHMC to inference GMM on the synthetic datums. Then, we remove $400$ points from each of the pink parts and yellow parts, around $40\%$ of the whole dataset at all, by the proposed BIF algorithms. The remaining datums are shown in the right of fig. 2(a). We also trained models on only the remaining set with the same Bayesian inference settings in order to show the targets of the forgetting task. The experiments have two main phrases:

(1) Training phase. Every GMM is trained for $2,000$ iterations. The batch size is set as $64$ . For variational inference, the learning rate is fixed to $2/n$ , where $n$ is the training sample set size. For SGLD, the learning rate schedule is set as $4\cdot t^{-0.15}/n$ , where $t$ is the training iteration step. For SGHMC, the learning rate schedule is set as $2\cdot t^{-0.15}/n$ , and the initial $\alpha$ factor is set as $0.4$ .

(2) Forgetting phase. We remove a batch of $4$ datums each time. When calculating the inversed-Hessian-vector product $H^{-1}v$ in the influence functions (see Section 5.3), the recursive calculation number $j$ is set as $32$ , and the scaling factor $c$ is set as $1/n^{\prime}$ , where $n^{\prime}$ is the number of the current remaining training datums. Notice that $n^{\prime}$ will gradually decrease as the forgetting process continuing. Moreover, for SGLD and SGHMC, we employ Monte Carlo method to calculate the expectations in MCMC influence functions. Specifically, we repeatedly sample model parameter $\theta$ for $5$ times, calculate the matrix or vector in MCMC influence functions, and average the results to approach the expectations.

7.1.2 Results Analysis

The empirical results are presented in figs. 2(b), 2(c), 2(d) respectively. In figures for variational inference, we draw the learned clustering centers as black points. In every figure for SGLD and SGHMC, (1) we draw the point drawn in the terminated iteration as a black point; and (2) we draw the points drawn in the last $500$ iterations in blue, orange, red, or green. The results obtained on the remaining parts (target) are also shown in these figures. This visualization shows that after forgetting, the learned models are close to the target models, which demonstrate that our forgetting algorithms can selectively remove specified datums while protecting others intact.

7.2 Experiments for Bayesian Neural Network

We then conduct experiments with Bayesian neural networks on real data.

7.2.1 Implementation Details

The implementation details are given below.

Dataset. We employ Fashion-MNIST [70] dataset in our experiments. It consists of $28\times 28$ gray-scale images from $10$ different classes, where each class consists of $6,000$ training examples and $1,000$ test examples. For the data argumentation, we first resize each image to $32\times 32$ and then normalize each pixel value to $[-0.5,0.5]$ before feeding them into BNN. For the forgetting experiments, we divide the training set of Fashion-MNIST into two parts, the removed part $S_{f}$ and the remaining part $S_{r}$ . Apparently, $S_{f}\cup S_{r}=S$ . For the selection of the removed training set $S_{f}$ , we randomly choose $1,000$ , $2,000$ , $3,000$ , $4,000$ , $5,000$ and $6,000$ examples from the class “T-shirt” (i.e., examples that labeled with number $0$ ) in the training set to form $S_{f}$ . For the brevity, we denote the test set by $S_{\text{test}}$ .

Bayesian neural network (BNN). BNNs employ Bayesian inference to inference the posterior of the neural networks parameters. Two major Bayesian inference methods employed wherein are variational inference and SGMCMC.

For variational inference, one usually utilizes the mean-field Gaussian variational family [10, 37] to train BNNs. The Bayes by Backprop technique [10] is utilized to calculate the derivative of ELBO function. Specifically, for the random variable $\theta$ subject to the mean-field variational distribution $q_{\lambda}=\mathcal{N}(\mu,\sigma^{2}I)$ , we have that $(\theta-\mu)/\sigma\sim\mathcal{N}(0,I)$ . Let $\varepsilon=(\theta-\mu)/\sigma$ , then the derivative of the ELBO function can be varied as follows,

	$\displaystyle\nabla_{\lambda}\mathrm{ELBO}(\lambda,S)$
	$\displaystyle\quad=\nabla_{\lambda}\mathbb{E}_{q_{\lambda}}\left(\log p(\theta,S)-q_{\lambda}(\theta)\right)$
	$\displaystyle\quad=\nabla_{\lambda}\mathbb{E}_{\varepsilon}\left(\log p(\theta,S)-q_{\lambda}(\theta)\right)$
	$\displaystyle\quad=\mathbb{E}_{\varepsilon}\nabla_{\lambda}\left(\log p(\theta,S)-q_{\lambda}(\theta)\right),$

Hence, we can calculate the derivative $\nabla_{\lambda}\mathrm{ELBO}(\lambda,S)$ in two steps: (1) repeatedly sample $\varepsilon$ from $\mathcal{N}(0,I)$ and calculate $\nabla_{\lambda}\left(\log p(\theta,S)-q_{\lambda}(\theta)\right)$ based on $\theta=\mu+\varepsilon\cdot\sigma$ ; and (2) average the obtained derivatives to approximate $\nabla_{\lambda}\mathrm{ELBO}(\lambda,S)$ . Moreover, we also adopt the local reparameterization trick [37] to further eliminate the covariances between the gradients of examples in a batch.

Bayesian LeNet-5 [43] is employed in our experiments, which consists of two convolutional layers and three fully-connected layers. We follow Liu et al. [44] to use an isotropic Gaussian distribution $\mathcal{N}(0,\sigma^{2}I)$ as the prior of BNN, where the standard deviation $\sigma$ is set as $0.15$ .

Experiment design. We employ variational inference, SGLD, and SGHMC to train the BNN on the complete training set $S$ . Then, we remove the subset $S_{f}$ by the proposed BIF algorithms. We also trained models on only the remaining set $S_{r}$ in order to show the targets of the forgetting task. The experiments have two main phrases:

(1) Training phase. Every BNN is trained for $10,000$ iterations. The batch size is set as $128$ . For variational inference, the learning rate is initialized as $0.5/n$ , where $n$ is the training set size, and decay by $0.1$ every $4,000$ iterations. Additionally, the sampling times to perform Bayes by Backprop procedure is set as $5$ . For SGLD, the step-size schedule is set as $0.5\cdot t^{-0.5}/n$ , where $t$ is the training iteration step. For SGHMC, the step-size schedule is set as $0.5\cdot t^{-0.5}/n$ , and the initial $\alpha$ factor is set as $0.4$ .

(2) Forgetting phase. We remove a batch of $64$ datums each time. When calculating the inversed-Hessian-vector product $H^{-1}v$ in influence functions (see Section 5.3), the recursive calculation number $j$ is set as $64$ . For variational inference, the scaling factor $c$ is set as $0.1/n^{\prime}$ , where $n^{\prime}$ is the number of the current remaining datums. Notice that $n^{\prime}$ will gradually decrease as the forgetting process continuing. For SGLD and SGHMC, the scaling factors $c$ are set as $0.005/n^{\prime}$ and $0.05/n^{\prime}$ , respectively. Besides, we employ the Monte Carlo method to calculate the expectations in MCMC influence functions. Specifically, we repeatedly sample model parameter $\theta$ for $5$ times, calculate the matrix and vector in MCMC influence function based on these sampled parameters, and average the results to approach the desired expectation.

Table 1: Time of the training phase and the forgetting phase in the experiments for BNNs. The acceleration rate is calculated by dividing “training time” with “forgetting time”. Our forgetting algorithms is significantly faster than re-training from scratch.

	Variational Inference	SGLD	SGHMC
Training Time	$337.77$ s	$39.36$ s	$39.90$ s
Removal Time	$4.97$ s	$4.04$ s	$4.04$ s
Acceleration Rate	$67.90$	$9.75$	$9.88$

7.2.2 Results Analysis

For each of the obtained processed models and target models, we evaluate its classification errors on the sets $S_{f}$ , $S_{r}$ and $S_{\text{test}}$ , respectively. The results are collected and plotted in fig. 3. We also collect and present the training and forgetting times in Table 1.

From fig. 3 and Table 1, we have the following two observations: (1) the processed models are similar to the target models in terms of the classification errors on all the three sample sets $S_{f}$ , $S_{r}$ , and $S_{\text{test}}$ ; and (2) our forgetting algorithms are significantly faster than simply training models from scratch. These phenomena demonstrate that our forgetting algorithms can effectively and efficiently remove influences of datums from BNNs without hurting other remaining information.

8 Conclusion

The right to be forgotten imposes a considerable compliance burden on AI companies. A company may need to delete the whole model learned from massive resources due to a request to delete a single datum. To address this problem, this work designs a Bayesian inference forgetting (BIF) that removes the influence of some specific datums on the learned model without completely deleting the whole model. The BIF framework is established on an energy-based Bayesian inference influence function, which characterizes the influence of requested data on the learned models. We prove that BIF has an $\varepsilon$ -certified knowledge removal guarantee, which is a new term on characterizing the forgetting performance. Under the BIF framework, forgetting algorithms are developed for two canonical Bayesian inference algorithms, variational inference and Markov chain Monte Carlo. Theoretical analysis provides guarantees on the generalizability of the proposed methods: performing the proposed BIF has little effect on them. Comprehensive experiments demonstrate that the proposed methods can remove the influence of specified datums without compromising the knowledge learned on the remained datums. The source code package is available at https://github.com/fshp971/BIF.

References

[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.
[2] Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research, 18(1):4148–4187, 2017.
[3] Sungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian posterior sampling via stochastic gradient Fisher scoring. In International Conference on Machine Learning, pages 1771–1778, 2012.
[4] Peter L Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. The Journal of Machine Learning Research, 3(Nov):463–482, 2002.
[5] Thomas Baumhauer, Pascal Schöttle, and Matthias Zeppelzauer. Machine unlearning: Linear filtration for logit-based classifiers. arXiv preprint arXiv:2002.02730, 2020.
[6] Matthew J Beal, Zoubin Ghahramani, and Carl E Rasmussen. The infinite hidden Markov model. In Advances in Neural Information Processing Systems, pages 577–584, 2002.
[7] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. The Journal of the American Statistical Association, 112(518):859–877, 2017.
[8] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003.
[9] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM, 36(4):929–965, 1989.
[10] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In International Conference on Machine Learning, pages 1613–1622, 2015.
[11] Lucas Bourtoule, Varun Chandrasekaran, Christopher Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. arXiv preprint arXiv:1912.03817, 2019.
[12] Olivier Bousquet and André Elisseeff. Stability and generalization. The Journal of Machine Learning Research, 2(Mar):499–526, 2002.
[13] Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In International Conference on Machine Learning, pages 1683–1691, 2014.
[14] R Dennis Cook and Sanford Weisberg. Residuals and Influence in Regression. New York: Chapman and Hall, 1982.
[15] Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D Skeel, and Hartmut Neven. Bayesian sampling using stochastic gradient thermostats. In Advances in Neural Information Processing Systems, pages 3203–3211, 2014.
[16] Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid Monte Carlo. Physics Letters B, 195(2):216–222, 1987.
[17] Richard M Dudley. The sizes of compact subsets of Hilbert space and continuity of Gaussian processes. Journal of Functional Analysis, 1(3):290–330, 1967.
[18] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Leon Roth. Preserving statistical validity in adaptive data analysis. In Annual ACM Symposium on Theory of Computing, pages 117–126, 2015.
[19] Weinan E, Chao Ma, Stephan Wojtowytsch, and Lei Wu. Towards a mathematical understanding of neural network-based machine learning: What we know and what we don’t. arXiv preprint arXiv:2009.10713, 2020.
[20] Sanjam Garg, Shafi Goldwasser, and Prashant Nalini Vasudevan. Formalizing data deletion in the context of the right to be forgotten. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pages 373–402. Springer, 2020.
[21] Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. Bayesian data analysis. CRC press, 2013.
[22] Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, (6):721–741, 1984.
[23] Antonio Ginart, Melody Guan, Gregory Valiant, and James Y Zou. Making AI forget you: Data deletion in machine learning. In Advances in Neural Information Processing Systems, pages 3518–3531, 2019.
[24] Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9304–9312, 2020.
[25] Chuan Guo, Tom Goldstein, Awni Hannun, and Laurens van der Maaten. Certified data removal from machine learning models. arXiv preprint arXiv:1911.03030, 2019.
[26] W Keith Hastings. Monte Carlo sampling methods using Markov chains and their applications. 1970.
[27] David Haussler. Sphere packing numbers for subsets of the boolean n-cube with bounded Vapnik-Chervonenkis dimension. Journal of Combinatorial Theory, Series A, 69(2):217–232, 1995.
[28] Fengxiang He and Dacheng Tao. Recent advances in deep learning theory. arXiv preprint arXiv:2012.10931, 2020.
[29] Fengxiang He, Bohan Wang, and Dacheng Tao. Tighter generalization bounds for iterative differentially private learning algorithms. arXiv preprint arXiv:2007.09371, 2020.
[30] Matthew Hoffman, Francis R Bach, and David M Blei. Online learning for latent Dirichlet allocation. In Advances in Neural Information Processing Systems, pages 856–864, 2010.
[31] Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(4):1303–1347, 2013.
[32] Peter J Huber. Robust Statistics, volume 523. John Wiley & Sons, 2004.
[33] Zachary Izzo, Mary Anne Smart, Kamalika Chaudhuri, and James Zou. Approximate data deletion from machine learning models: Algorithms and evaluations. arXiv preprint arXiv:2002.10077, 2020.
[34] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine Learning, 37(2):183–233, 1999.
[35] Michael Irwin Jordan. Learning in graphical models, volume 89. Springer Science & Business Media, 1998.
[36] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
[37] Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pages 2575–2583, 2015.
[38] Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International Conference on Machine Learning, pages 1885–1894, 2017.
[39] Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT press, 2009.
[40] Vladimir Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47(5):1902–1914, 2001.
[41] Vladimir Koltchinskii and Dmitriy Panchenko. Rademacher processes and bounding the risk of function learning. In High Dimensional Probability II, pages 443–457. Springer, 2000.
[42] Lucien Le Cam. Asymptotic Methods in Statistical Decision Theory. Springer Science & Business Media, 2012.
[43] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
[44] Xuanqing Liu, Yao Li, Chongruo Wu, and Cho-Jui Hsieh. Adv-BNN: Improved adversarial defense through robust Bayesian neural network. arXiv preprint arXiv:1810.01279, 2018.
[45] Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient MCMC. In Advances in Neural Information Processing Systems, pages 2917–2925, 2015.
[46] Zhanyu Ma, Andrew E Teschendorff, Arne Leijon, Yuanyuan Qiao, Honggang Zhang, and Jun Guo. Variational Bayesian matrix factorization for bounded support data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(4):876–889, 2014.
[47] David A McAllester. Some PAC-Bayesian theorems. In Proceedings of the Annual Conference on Computational Learning Theory, pages 230–234, 1998.
[48] David A McAllester. PAC-Bayesian model averaging. In Proceedings of the Annual Conference on Computational Learning Theory, pages 164–170, 1999.
[49] David A McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51(1):5–21, 2003.
[50] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. MIT Press, 2018.
[51] Mohamad T Musavi, Khue Hiang Chan, Donald M Hummels, and K Kalantri. On the generalization ability of neural network classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(6):659–663, 1994.
[52] Radford M Neal. Bayesian training of backpropagation networks by the hybrid Monte Carlo method. Technical report, Citeseer, 1992.
[53] Quoc Phong Nguyen, Bryan Kian Hsiang Low, and Patrick Jaillet. Variational bayesian unlearning. Advances in Neural Information Processing Systems, 33, 2020.
[54] Luca Oneto, Sandro Ridella, and Davide Anguita. Differential privacy and generalization: Sharper bounds with applications. Pattern Recognition Letters, 89:31–38, 2017.
[55] John Paisley, David Blei, and Michael Jordan. Variational Bayesian inference with stochastic search. arXiv preprint arXiv:1206.6430, 2012.
[56] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS-W, 2017.
[57] Sam Patterson and Yee Whye Teh. Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In Advances in Neural Information Processing Systems, pages 3102–3110, 2013.
[58] Tim Pearce, Felix Leibfried, and Alexandra Brintrup. Uncertainty in neural networks: Approximately Bayesian ensembling. In International Conference on Artificial Intelligence and Statistics, pages 234–244. PMLR, 2020.
[59] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407, 1951.
[60] William H Rogers and Terry J Wagner. A finite sample distribution-free performance bound for local discrimination rules. The Annals of Statistics, pages 506–514, 1978.
[61] Wolfgang Roth and Franz Pernkopf. Bayesian neural networks with weight sharing using Dirichlet processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(1):246–252, 2018.
[62] Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In International Conference on Machine Learning, pages 880–887, 2008.
[63] David Marco Sommer, Liwei Song, Sameer Wagh, and Prateek Mittal. Towards probabilistic verification of machine unlearning. arXiv preprint arXiv:2003.04247, 2020.
[64] Jaemo Sung, Zoubin Ghahramani, and Sung-Yang Bang. Latent-space variational Bayes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(12):2236–2242, 2008.
[65] Michalis Titsias and Miguel Lázaro-Gredilla. Local expectation gradients for black box variational inference. In Advances in Neural Information Processing Systems, pages 2638–2646, 2015.
[66] DM Titterington et al. Bayesian methods for neural networks and related models. Statistical Science, 19(1):128–139, 2004.
[67] Vladimir Vapnik. Estimation of Dependences Based on Empirical Data. Springer Science & Business Media, 2006.
[68] Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer Science & Business Media, 2013.
[69] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In International Conference on Machine Learning, pages 681–688, 2011.
[70] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms, 2017.
[71] Qiaoling Ye, Arash Amini, and Qing Zhou. Optimizing regularized Cholesky score for order-based learning of Bayesian networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
[72] Hongwei Yong, Deyu Meng, Wangmeng Zuo, and Lei Zhang. Robust online matrix factorization for dynamic background subtraction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(7):1726–1740, 2017.
[73] Cheng Zhang, Judith Bütepage, Hedvig Kjellström, and Stephan Mandt. Advances in variational inference. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8):2008–2026, 2018.
[74] Hao Zhang, Bo Chen, Yulai Cong, Dandan Guo, Hongwei Liu, and Mingyuan Zhou. Deep autoencoding topic model with scalable hybrid Bayesian inference. IEEE Transactions on Pattern Analysis and Machine Intelligence, 68:5795–5809, 2020.
[75] Ruqi Zhang, Chunyuan Li, Jianyi Zhang, Changyou Chen, and Andrew Gordon Wilson. Cyclical stochastic gradient MCMC for Bayesian deep learning. In International Conference on Learning Representations, 2020.

Appendix A Proofs in Section 4

This section provides the missing proofs in Section 4.

Proof of Theorem 4.2.

We first calculate $\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}$ based on eq. (4), i.e., the following equation,

\displaystyle\nabla_{\gamma}F(\gamma,S)+\tau\cdot\nabla_{\gamma}h(\gamma,z_{j})=0.

Calculate the derivatives of the both sides of the above equation with respect to $\tau$ , we have that

\displaystyle\nabla^{2}_{\gamma}F(\hat{\gamma}^{-z_{j}}_{S}(\tau),S)\cdot\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}+\nabla_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})^{T}+\tau\cdot\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\cdot\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}=0.

(11)

By lemma 4.1, $F_{-z_{j},\tau}(\gamma,S)$ is strongly convex with respect to $\gamma$ . Thus, the following Hessian matrix

\displaystyle\nabla^{2}_{\gamma}F_{-z_{j},\tau}(\hat{\gamma}^{-z_{j}}_{S}(\tau),S)=\nabla^{2}_{\gamma}F(\hat{\gamma}^{-z_{j}}_{S}(\tau),S)+\tau\cdot\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})

is positive definite, hence invertible. Combining with eq. (11), we have that

\displaystyle\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}=

\displaystyle-\left(\nabla^{2}_{\gamma}F(\hat{\gamma}^{-z_{j}}_{S}(\tau),S)+\tau\cdot\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\right)^{-1}\cdot\nabla_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})^{T}.

(12)

We then upper bound the norm of $\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}$ . Based on eq. (12), we have that

\displaystyle\left\|\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}\right\|_{2}\leq

\displaystyle\left\|\left(\frac{1}{n}\nabla^{2}_{\gamma}F(\hat{\gamma}^{-z_{j}}_{S}(\tau),S)+\frac{\tau}{n}\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\right)^{-1}\right\|_{2}\cdot\left\|\frac{1}{n}\nabla_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\right\|_{2}.

(13)

We first consider the first term of the right-hand side of eq. (13). By assumption 4.2, $f(\gamma)$ is $c_{f}$ -strongly convex and $h(\gamma,z)$ is $c_{h}(z)$ -strongly convex, where $c_{f}$ is a positive real number, $c_{h}(z)$ is a positive continuous real function on $\mathcal{Z}$ . Thus we have that

		$\displaystyle\frac{1}{n}\nabla^{2}_{\gamma}F(\hat{\gamma}^{-z_{j}}_{S}(\tau),S)+\frac{\tau}{n}\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})$
	$\displaystyle=$	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{i})+\frac{1}{n}\nabla^{2}_{\gamma}f(\hat{\gamma}^{-z_{j}}_{S}(\tau))+\frac{\tau}{n}\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})$
	$\displaystyle=$	$\displaystyle\frac{1}{n}\sum_{z\in S-\{z_{j}\}}\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z)+\frac{1+\tau}{n}\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})+\frac{1}{n}\nabla^{2}_{\gamma}f(\hat{\gamma}^{-z_{j}}_{S}(\tau))$
	$\displaystyle\succeq$	$\displaystyle\left(\frac{1}{n}\sum_{z\in S-\{z_{j}\}}c_{h}(z)+\frac{1+\tau}{n}c_{h}(z_{j})+\frac{c_{f}}{n}\right)I$
	$\displaystyle\succeq$	$\displaystyle\left(\frac{1}{n}\sum_{z\in S-\{z_{j}\}}c_{h}(z)\right)I.$

Applying Assumption 4.1, we have that $c_{h}(z)$ is continuous on the compact set $\mathcal{Z}$ . This suggests that there exists a real constant $\hat{c}_{h}>0$ such that $\forall z\in\mathcal{Z}$ , $c_{h}(z)\geq\hat{c}_{h}$ . Therefore, we further have that

\displaystyle\frac{1}{n}\nabla^{2}_{\gamma}F(\hat{\gamma}^{-z_{j}}_{S}(\tau),S)+\frac{\tau}{n}\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\succeq\left(\frac{1}{n}\sum_{z\in S-\{z_{j}\}}c_{h}(z)\right)I\succeq\left(\frac{n-1}{n}\hat{c}_{h}\right)I.

Let $\lambda_{\min}$ denotes the smallest eigenvalue of the matrix $\left(\frac{1}{n}\nabla^{2}_{\gamma}F(\hat{\gamma}^{-z_{j}}_{S}(\tau),S)+\frac{\tau}{n}\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\right)$ . Then, the above inequality implies that $\lambda_{\min}\geq\frac{n-1}{n}\hat{c}_{h}$ . Hence, we have the following,

\displaystyle\left\|\left(\frac{1}{n}\nabla^{2}_{\gamma}F(\hat{\gamma}_{S}(\tau),S)+\frac{\tau}{n}\nabla^{2}_{\gamma}h(\hat{\gamma}_{S}(\tau),z_{j})\right)^{-1}\right\|_{2}=\frac{1}{\lambda_{\min}}\leq\frac{n}{(n-1)\hat{c}_{h}}=O(1).

(14)

We then upper bound the second term of the right-hand side of eq. (13). Applying Assumptions 4.2 and 4.1, we have that $\nabla^{2}_{\gamma}h(\gamma,z)$ is continuous on compact support $\Gamma\times\mathcal{Z}$ . This demonstrates that both $\nabla^{2}_{\gamma}h(\gamma,z)$ and $\left\|\nabla^{2}_{\gamma}h(\gamma,z)\right\|_{2}$ are bounded. Therefore, as $n\to\infty$ , we have

\displaystyle\left\|\frac{1}{n}\sum_{i=1}^{n}\nabla^{2}_{\gamma}h(\hat{\gamma}_{S}(\tau),z_{i})\right\|_{2}=\frac{1}{n}\left\|\sum_{i=1}^{n}\nabla^{2}_{\gamma}h(\hat{\gamma}_{S}(\tau),z_{i})\right\|_{2}\leq O\left(\frac{1}{n}\right).

(15)

Finally, inserting eqs. (14) (15) into eq. (13), we eventually have that

\displaystyle\left\|\frac{\partial\hat{\gamma}_{S}(\tau)}{\partial\tau}\right\|_{2}\leq O(1)\cdot O\left(\frac{1}{n}\right)=O\left(\frac{1}{n}\right).

The proof is completed. ∎

Proof of Theorem 4.3.

We first calculate $\frac{\partial^{2}\gamma^{-z_{j}}_{S}(\tau)}{\partial\tau^{2}}$ based on eq. (4). Similar to the proof of Theorem 4.2, we calculate the second-order derivatives of the both sides of eq. (4) with respect to $\tau$ and have that

	$\displaystyle\sum_{i=1}^{n}B(\tau,z_{i})+A(\tau)+\left(\sum_{i=1}^{n}\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{i})+\nabla^{2}_{\gamma}f(\hat{\gamma}^{-z_{j}}_{S}(\tau))\right)\cdot\frac{\partial^{2}\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau^{2}}$
	$\displaystyle+2\cdot\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\cdot\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}+\tau\cdot B(\tau,z_{j})+\tau\cdot\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\cdot\frac{\partial^{2}\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau^{2}}=0,$

which means

	$\displaystyle\frac{\partial^{2}\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau^{2}}=$	$\displaystyle-\left(\frac{1}{n}\sum_{i=1}^{n}\nabla^{2}_{\gamma}F(\hat{\gamma}^{-z_{j}}_{S}(\tau),S)+\frac{\tau}{n}\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\right)^{-1}$
		$\displaystyle\cdot\left(\frac{1}{n}\sum_{i=1}^{n}B(\tau,z_{i})+\frac{\tau}{n}B(\tau,z_{j})+\frac{1}{n}A(\tau)+\frac{2}{n}\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\cdot\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}\right),$		(16)

where the invertibility of $\left(\frac{1}{n}\sum_{i=1}^{n}\nabla^{2}_{\gamma}F(\hat{\gamma}^{-z_{j}}_{S}(\tau),S)+\frac{\tau}{n}\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\right)$ is guaranteed by Lemma 4.1, $A(\tau),B(\tau,z)\in\mathbb{R}^{K\times 1}$ , and for $i=1,\cdots,K$ , we have the following,

	$\displaystyle A(\tau)_{i}={\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}}^{T}\cdot\nabla^{2}_{\gamma}\left(\frac{\partial f(\hat{\gamma}^{-z_{j}}_{S}(\tau))}{\partial\gamma_{i}}\right)\cdot\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau},$		(17)
	$\displaystyle B(\tau,z)_{i}={\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}}^{T}\cdot\nabla^{2}_{\gamma}\left(\frac{\partial h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z)}{\partial\gamma_{i}}\right)\cdot\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}.$		(18)

We then upper bound the norm of $\frac{\partial^{2}\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau^{2}}$ . Based on eq. (16), we have that

	$\displaystyle\left\\|\frac{\partial^{2}\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau^{2}}\right\\|_{2}$
	$\displaystyle\leq\left\\|\left(\frac{1}{n}\sum_{i=1}^{n}\nabla^{2}_{\gamma}F(\hat{\gamma}^{-z_{j}}_{S}(\tau),S)+\frac{\tau}{n}\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\right)^{-1}\right\\|_{2}$
	$\displaystyle\ \ \ \ \cdot\frac{1}{n}\left(\sum_{i=1}^{n}\\|B(\tau,z_{i})\\|_{2}+\tau\cdot\\|B(\tau,z_{j})\\|_{2}+\\|A(\tau)\\|_{2}+2\cdot\left\\|\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\cdot\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}\right\\|_{2}\right)$		(19)
	$\displaystyle\leq O\left(\frac{1}{n}\left(\sum_{i=1}^{n}\\|B(\tau,z_{i})\\|_{2}+\tau\cdot\\|B(\tau,z_{j})\\|_{2}+\\|A(\tau)\\|_{2}+2\cdot\left\\|\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\cdot\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}\right\\|_{2}\right)\right).$		(20)

where eq. (20) is obtained by inserting eq. (14) (in the proof of Theorem 4.2) into eq. (19). Thus, the remaining is to upper bound the norms of $A(\tau)$ , $B(\tau,z)$ and $\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\cdot\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}$ .

We first upper bound $A(\tau)$ . Applying Theorem 4.2, we have that $\left\|\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}\right\|_{2}\leq O\left(\frac{1}{n}\right)$ . Applying Assumptions 4.2 and 4.1, we have that $\nabla^{2}_{\gamma}\left(\frac{\partial f(\hat{\gamma}^{-z_{j}}_{S}(\tau))}{\partial\gamma_{i}}\right)$ is bounded on its support $\Gamma$ . As a result, the norm of $\left\|\nabla^{2}_{\gamma}\left(\frac{\partial f(\hat{\gamma}^{-z_{j}}_{S}(\tau))}{\partial\gamma_{i}}\right)\right\|_{2}$ is also bounded. Therefore, we have that

\displaystyle\|A(\tau)\|_{2}\leq\sum_{i=1}^{K}\left\|\nabla^{2}_{\gamma}\left(\frac{\partial f(\hat{\gamma}^{-z_{j}}_{S}(\tau))}{\partial\gamma_{i}}\right)\right\|_{2}\cdot\left\|\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}\right\|_{2}^{2}\leq\sum_{i=1}^{K}O(1)\cdot O\left(\frac{1}{n^{2}}\right)=O\left(\frac{1}{n^{2}}\right).

(21)

For $B(\tau,z)$ , we similarly have that

\displaystyle\|B(\tau,z)\|_{2}\leq O\left(\frac{1}{n^{2}}\right).

(22)

To upper bound the norm of $\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\cdot\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}$ , we apply Theorem 4.2, Assumptions 4.2 and 4.1, and have that

	$\displaystyle\left\\|\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\cdot\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}\right\\|_{2}$
	$\displaystyle\quad\leq\left\\|\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\right\\|_{2}\cdot\left\\|\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}\right\\|_{2}\leq O(1)\cdot O\left(\frac{1}{n}\right)=O\left(\frac{1}{n}\right).$		(23)

Inserting eqs. (21), (22) and (23), into eq. (20), we eventually have that

\displaystyle\left\|\frac{\partial^{2}\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau^{2}}\right\|_{2}\leq O\left(\frac{1}{n}\left(\sum_{i=1}^{n}O\left(\frac{1}{n^{2}}\right)+\tau\cdot O\left(\frac{1}{n^{2}}\right)+O\left(\frac{1}{n^{2}}\right)+2\cdot O\left(\frac{1}{n}\right)\right)\right)=O\left(\frac{1}{n^{2}}\right).

The proof is completed. ∎

Appendix B Proofs in Section 5

This section provides the detailed proofs in Section 5.

Proof of Theorem 5.2.

Let $\hat{\lambda}_{S}$ , $\hat{\lambda}_{S-\{z_{j}\}}$ be the variational parameters learned on $S$ and $S-\{z_{j}\}$ , respectively. Let $\hat{\lambda}^{-z_{j}}_{S}=\mathcal{A}_{\mathrm{VI}}(\hat{\lambda}_{S},z_{j})=\hat{\lambda}_{S}-\mathcal{I}_{\mathrm{VI}}(z_{j})$ be the processed variational parameter. Then, to establish a certified knowledge removal guarantee for $\mathcal{A}_{\mathrm{VI}}(\hat{\lambda}_{S},z_{j})$ , we need to bound the KL divergence $\mathrm{KL}(p_{\hat{\lambda}^{-z_{j}}}(\theta)\|p_{\hat{\lambda}_{S-\{z_{j}\}}}(\theta))$ .

For the brevity, we denote that

\displaystyle p_{\hat{\lambda}^{-z_{j}}_{S}}=\mathcal{N}(\mu_{1},\sigma_{1}^{2}I),

where $\hat{\lambda}^{-z_{j}}_{S}=(\mu_{11},\cdots,\mu_{1d},\sigma_{11},\cdots,\sigma_{1d})$ , $0<M_{1}\leq\sigma_{11},\cdots,\sigma_{1d}\leq M_{2}$ , and

\displaystyle p_{\hat{\lambda}_{S-\{z_{j}\}}}=\mathcal{N}(\mu_{2},\sigma_{2}^{2}I),

where $\hat{\lambda}_{S-\{z_{j}\}}=(\mu_{21},\cdots,\mu_{2d},\sigma_{21},\cdots,\sigma_{2d})$ , $0<M_{1}\leq\sigma_{21},\cdots,\sigma_{2d}\leq M_{2}$ .

We then have the following (we assume that $\Theta=\mathbb{R}^{d}$ ),

$\displaystyle\mathrm{KL}$	$\displaystyle(p_{\hat{\lambda}^{-z_{j}}_{S}}(\theta)\\|p_{\hat{\lambda}_{S-\{z_{j}\}}}(\theta))$
	$\displaystyle=\int_{\Theta}\log\left(\frac{p_{\hat{\lambda}^{-z_{j}}_{S}}(\theta)}{p_{\hat{\lambda}_{S-\{z_{j}\}}}(\theta)}\right)p_{\hat{\lambda}^{-z_{j}}_{S}}(\theta)\mathrm{d}\theta$
	$\displaystyle=-\frac{1}{2}\sum_{k=1}^{d}\int\left(\frac{(\theta_{k}-\mu_{1k})^{2}}{\sigma_{1k}^{2}}-\frac{(\theta_{k}-{\mu_{2k}})^{2}}{\sigma_{2k}^{2}}\right)p_{\hat{\lambda}^{-z_{j}}_{S}}(\theta_{k})\mathrm{d}\theta_{k}-\sum_{k=1}^{d}\log\frac{\sigma_{1k}}{\sigma_{2k}}$
	$\displaystyle=-\frac{1}{2}\sum_{k=1}^{d}\left(1-\frac{\sigma_{1k}^{2}}{\sigma_{2k}^{2}}-\frac{(\mu_{1k}-\mu_{2k})^{2}}{\sigma_{2k}^{2}}\right)-\sum_{k=1}^{d}\log\frac{\sigma_{1k}}{\sigma_{2k}}$
	$\displaystyle=\frac{1}{2}\sum_{k=1}^{d}\left(\frac{\sigma_{1k}^{2}-\sigma_{2k}^{2}}{\sigma_{2k}^{2}}+\frac{(\mu_{1k}-\mu_{2k})^{2}}{\sigma_{2k}^{2}}+2\log\left(1+\frac{\sigma_{2k}-\sigma_{1k}}{\sigma_{1k}}\right)\right)$
	$\displaystyle\leq\frac{1}{2}\sum_{k=1}^{d}\left(\frac{\|\sigma_{1k}+\sigma_{2k}\|\cdot\|\sigma_{1k}-\sigma_{2k}\|}{\sigma_{2k}^{2}}+\frac{(\mu_{1k}-\mu_{2k})^{2}}{\sigma_{2k}^{2}}+2\cdot\frac{\|\sigma_{1k}-\sigma_{2k}\|}{\sigma_{1k}}\right)$
	$\displaystyle\leq\frac{1}{2}\sum_{k=1}^{d}\left(\frac{2M_{2}\|\sigma_{1k}-\sigma_{2k}\|}{M_{1}^{2}}+\frac{(\mu_{1k}-\mu_{2k})^{2}}{M_{1}^{2}}+\frac{2\|\sigma_{1k}-\sigma_{2k}\|}{M_{1}}\right)$
	$\displaystyle=\frac{1}{2M_{1}^{2}}\sum_{k=1}^{d}\left(2(M_{1}+M_{2})\|\sigma_{1k}-\sigma_{2k}\|+(\mu_{1k}-\mu_{2k})^{2}\right)$
	$\displaystyle\leq\frac{1}{2M_{1}^{2}}\left(2(M_{1}+M_{2})\\|\hat{\lambda}^{-z_{j}}_{S}-\hat{\lambda}_{S-\{z_{j}\}}\\|_{1}+\\|\hat{\lambda}^{-z_{j}}_{S}-\hat{\lambda}_{S-\{z_{j}\}}\\|_{2}^{2}\right)=\varepsilon_{\hat{\lambda}_{S},z_{j}}.$	(24)

Therefore, $\mathcal{A}_{\mathrm{VI}}(\hat{\lambda}_{S},z_{j})$ performs $\varepsilon_{\hat{\lambda}_{S},z_{j}}$ -certified knowledge removal.

The proof is completed. ∎

Proof of Theorem 5.4.

Let the processed model be $p^{-z_{j}}(\theta)=\mathcal{A}_{\mathrm{MCMC}}(p(\theta|S),z_{j})$ . For the brevity, we denote that

\displaystyle p_{1}(\theta)=p^{-z_{j}}(\theta)=\mathcal{N}(\theta_{1}^{\prime},(nJ(\theta_{1}))^{-1}),

where $\theta_{1}^{\prime}=\theta_{1}-\mathcal{I}_{\mathrm{MCMC}}(z_{j})$ , and

\displaystyle p_{2}(\theta)=p(\theta|S-\{z_{j}\})=\mathcal{N}(\theta_{2},((n-1)J(\theta_{2}))^{-1}).

Then, to establish the certified removal guarantee, we need to calculate the KL divergence $\mathrm{KL}(p_{1}(\theta)\|p_{2}(\theta))$ .

Since

	$\displaystyle p_{1}(\theta)=\frac{1}{\sqrt{(2\pi)^{d}\|nJ(\theta_{1})\|^{-1}}}\exp\left(-\frac{1}{2}(\theta-\theta_{1}^{\prime})^{T}(nJ(\theta_{1}))(\theta-\theta_{1}^{\prime})\right),$
	$\displaystyle p_{2}(\theta)=\frac{1}{\sqrt{(2\pi)^{d}\|(n-1)J(\theta_{2})\|^{-1}}}\exp\left(-\frac{1}{2}(\theta-\theta_{2})^{T}((n-1)J(\theta_{2}))(\theta-\theta_{2})\right),$

we have that (we assume that $\Theta=\mathbb{R}^{d}$ )

	$\displaystyle\mathrm{KL}$	$\displaystyle(p_{1}(\theta)\\|p_{2}(\theta))$
	$\displaystyle=$	$\displaystyle\int_{\Theta}\log\left(\frac{p_{1}(\theta)}{p_{2}(\theta)}\right)p_{1}(\theta)\mathrm{d}\theta$
	$\displaystyle=$	$\displaystyle-\frac{1}{2}\int_{\Theta}\left((\theta-\theta_{1}^{\prime})^{T}(nJ(\theta_{1}))(\theta-\theta_{1}^{\prime})-(\theta-\theta_{2})^{T}((n-1)J(\theta_{2}))(\theta-\theta_{2})\right)p_{1}(\theta)\mathrm{d}\theta$
		$\displaystyle-\frac{1}{2}\log\frac{\|nJ(\theta_{1})\|^{-1}}{\|(n-1)J(\theta_{2})\|^{-1}}$
	$\displaystyle=$	$\displaystyle-\frac{1}{2}\left(\mathrm{tr}(nJ(\theta_{1})(nJ(\theta_{1}))^{-1})-\mathrm{tr}((n-1)J(\theta_{2})(nJ(\theta_{1}))^{-1})\right.$
		$\displaystyle\hskip 30.00005pt\left.-(\theta_{1}^{\prime}-\theta_{2})^{T}((n-1)J(\theta_{2}))(\theta_{1}^{\prime}-\theta_{2})\right)+\frac{1}{2}\left(\frac{n}{n-1}\right)^{d}\log\frac{\|J(\theta_{1})\|}{\|J(\theta_{2})\|}$
	$\displaystyle=$	$\displaystyle\frac{1}{2}\left((n-1)(\theta_{1}^{\prime}-\theta_{2})^{T}J(\theta_{2})(\theta_{1}^{\prime}-\theta_{2})+\frac{n-1}{n}\mathrm{tr}(J(\theta_{2})J^{-1}(\theta_{1}))-d+\left(\frac{n}{n-1}\right)^{d}\log\frac{\|J(\theta_{1})\|}{\|J(\theta_{2})\|}\right)$
	$\displaystyle=$	$\displaystyle O\left((n-1)(\theta_{1}^{\prime}-\theta_{2})^{T}J(\theta_{2})(\theta_{1}^{\prime}-\theta_{2})+\mathrm{tr}\left(J^{-1}(\theta_{1})(J(\theta_{2})-J(\theta_{1}))\right)+\log\frac{\|J(\theta_{1})\|}{\|J(\theta_{2})\|}\right)$
	$\displaystyle=$	$\displaystyle O(\varepsilon_{\theta_{1},z_{j}}).$

which means that $\mathrm{KL}(p^{-z_{j}}(\theta)\|p(\theta|S-\{z_{j}\}))=O(\varepsilon_{\theta_{1},z_{j}})$ . Therefore, $\mathcal{A}_{\mathrm{MCMC}}(p(\theta|S),z_{j})$ performs $O(\varepsilon_{\theta_{1},z_{j}})$ -certified knowledge removal.

The proof is completed. ∎

Appendix C Proofs in Section 6

This section provides the detailed proofs in Section 6.

We derive generalization bounds for BIF under the PAC-Bayesian framework [47, 48, 49]. The framework can provide guarantees for randomized predictors (e.g. Bayesian predictors).

Specifically, let $Q$ a distribution on the parameter space $\Theta$ , $P$ denotes the prior distribution over the parameter space $\Theta$ . Then, the expected risks $\mathcal{R}(Q)$ is bounded in terms of the empirical risk $\mathcal{\hat{R}}(Q,S)$ and KL-divergence $\mathrm{KL}(Q\|P)$ by the following result from PAC-Bayes.

Lemma C.1 (cf. [49], Theorem 1).

For any real $\delta\in(0,1)$ , with probability at least $1-\delta$ , we have the following inequality for all distributions $Q$ :

\displaystyle\mathcal{R}(Q)\leq\mathcal{\hat{R}}(Q,S)+\sqrt{\frac{\mathrm{KL}(Q\|P)+\log\frac{1}{\delta}+\log n+2}{2n-1}}.

(25)

Based on Lemma C.1, we prove the generalization bounds in Theorem 6.1 and Theorem 6.2.

proof of Theorem 6.1.

Let the prior distribution $P$ be the standard Gaussian distribution $\mathcal{N}(0,I)$ , $Q$ be the distribution that obtained after conducting variational inference BIF. Suppose the density functions of $P$ , $Q$ are $p(\theta)$ , $q(\theta)$ respectively, then we have

	$\displaystyle p(\theta)=\frac{1}{\sqrt{(2\pi)^{d}}}\exp\left(-\frac{\\|\theta\\|^{2}}{2}\right),$
	$\displaystyle q(\theta)=\frac{1}{\sqrt{(2\pi)^{d}}\prod_{k=1}^{d}\sigma_{k}^{\prime}}\exp\left(-\sum_{k=1}^{d}\frac{(\theta_{k}-{\mu_{k}^{\prime}})^{2}}{2{\sigma_{k}^{\prime}}^{2}}\right),$

where $\mu_{k}^{\prime}=\mu_{k}-\Delta_{\mu_{k}}$ and $\sigma_{k}^{\prime}=\sigma_{k}-\Delta_{\sigma_{k}}$ .

Therefore, we can calculate the KL-divergence $\mathrm{KL}(Q\|P)$ as follows (where we assume that $\Theta=\mathbb{R}^{d}$ ),

$\displaystyle\mathrm{KL}$	$\displaystyle(Q\\|P)$
	$\displaystyle=\int_{\Theta}\log\left(\frac{q(\theta)}{p(\theta)}\right)q(\theta)\mathrm{d}\theta$
	$\displaystyle=-\frac{1}{2}\sum_{k=1}^{d}\int_{\Theta}\left(\frac{(\theta_{k}-\mu_{k}^{\prime})^{2}}{{\sigma_{k}^{\prime}}^{2}}-\theta_{k}^{2}\right)q(\theta)\mathrm{d}\theta-\sum_{k=1}^{d}\log{\sigma_{k}^{\prime}}$
	$\displaystyle=-\frac{1}{2}\sum_{k=1}^{d}\left(1-{\mu_{k}^{\prime}}^{2}-{\sigma_{k}^{\prime}}^{2}\right)-\sum_{k=1}^{d}\log{\sigma_{k}^{\prime}}$
	$\displaystyle=\frac{1}{2}\left(\\|\lambda-\Delta_{\lambda}\\|^{2}-d-2\sum_{k=1}^{d}\log{\sigma_{k}^{\prime}}\right)$
	$\displaystyle\leq\frac{1}{2}\left(\\|\Delta_{\lambda}\\|^{2}+2\\|\lambda\\|\cdot\\|\Delta_{\lambda}\\|+\\|\lambda\\|^{2}-2\sum_{k=1}^{d}\log\left(\sigma_{k}-\Delta_{\sigma_{k}}\right)-d\right)$
	$\displaystyle=\frac{1}{2}\left(C_{\lambda,\Delta_{\lambda}}-d\right),$	(26)

where

\displaystyle C_{\lambda,\Delta_{\lambda}}=\|\Delta_{\lambda}\|^{2}+2\|\lambda\|\cdot\|\Delta_{\lambda}\|+\|\lambda\|^{2}-2\sum_{k=1}^{d}\log\left(\sigma_{k}-\Delta_{\sigma_{k}}\right).

Eq. (26) gives an upper bound of the KL-divergence between the obtained distribution after BIF and the prior distribution. Inserting eq. (26) into eq. (25) in Lemma C.1, we obtain the PAC-Bayesian generalization bound.

Furthermore, since all the conditions in Theorem 5.1 hold, we can apply Theorem 4.2 and have that $|\Delta_{\sigma_{k}}|\leq\|\Delta_{\lambda}\|\leq O(1/n)$ . Besides, by applying the conditions in Theorem 5.2, we have that for any $1\leq k\leq d$ , $\sigma_{k}\geq M_{1}>0$ , where $M_{1}$ is a constant. Therefore, we have that

	$\displaystyle C_{\lambda,\Delta_{\lambda}}\leq$	$\displaystyle\\|\Delta_{\lambda}\\|^{2}+2\\|\lambda\\|\cdot\\|\Delta_{\lambda}\\|+\\|\lambda\\|^{2}-2d\log\left(M_{1}-\\|\Delta_{\lambda}\\|\right)$
	$\displaystyle=$	$\displaystyle O\left(\log\left(M_{1}-\\|\Delta_{\lambda}\\|\right)\right)$
	$\displaystyle\leq$	$\displaystyle O(1).$

The proof is completed. ∎

proof of Theorem 6.2.

Let the prior distribution $P$ be the standard Gaussian distribution $\mathcal{N}(0,I)$ , $Q$ be the processed distribution that obtained after conducting MCMC BIF. Suppose the density functions of $P$ , $Q$ are $p(\theta)$ , $q(\theta)$ respectively, then we have

	$\displaystyle p(\theta)=\frac{1}{\sqrt{(2\pi)^{d}}}\exp\left(-\frac{\\|\theta\\|^{2}}{2}\right),$
	$\displaystyle q(\theta)=\frac{1}{\sqrt{(2\pi)^{d}\|nJ(\theta_{1})\|^{-1}}}\exp\left(-\frac{1}{2}(\theta-\theta_{1}^{\prime})^{T}(nJ(\theta_{1}))(\theta-\theta_{1}^{\prime})\right),$

where $\theta_{1}^{\prime}=\theta_{1}-\Delta_{\theta_{1}}$ .

Thus, we can calculate the KL-divergence $\mathrm{KL}(Q\|P)$ as follows (where we assume that $\Theta=\mathbb{R}^{d}$ ),

$\displaystyle\mathrm{KL}$	$\displaystyle(Q\\|P)$
	$\displaystyle=\int_{\Theta}\log\left(\frac{q(\theta)}{p(\theta)}\right)q(\theta)\mathrm{d}\theta$
	$\displaystyle=-\frac{1}{2}\left(\int_{\Theta}\left((\theta-\theta_{1}^{\prime})^{T}(nJ(\theta_{1}))(\theta-\theta_{1}^{\prime})-\\|\theta\\|^{2}\right)q(\theta)\mathrm{d}\theta-\log\left\|nJ(\theta_{1})\right\|\right)$
	$\displaystyle=-\frac{1}{2}\left(\mathrm{tr}(I)-\mathrm{tr}\left((nJ(\theta_{1}))^{-1}\right)-\\|\theta_{1}^{\prime}\\|^{2}-\log\left\|nJ(\theta_{1})\right\|\right)$
	$\displaystyle=\frac{1}{2}\left(\\|\theta_{1}^{\prime}\\|^{2}+\frac{1}{n}\mathrm{tr}(J^{-1}(\theta_{1}))-d+\log\left(n^{d}\|J(\theta_{1})\|\right)\right)$
	$\displaystyle\leq\frac{1}{2}\left(\\|\Delta_{\theta_{1}}\\|^{2}+2\\|\theta_{1}\\|\cdot\\|\Delta_{\theta_{1}}\\|+\\|\theta_{1}\\|^{2}+\frac{1}{n}\mathrm{tr}(J^{-1}(\theta_{1}))+\log\|J(\theta_{1})\|+d(\log n-1)\right)$
	$\displaystyle=\frac{1}{2}\left(C_{p,\Delta_{\theta_{1}}}+d(\log n-1)\right),$	(27)

where

\displaystyle C_{p,\Delta_{\theta_{1}}}=\|\Delta_{\theta_{1}}\|^{2}+2\|\theta_{1}\|\cdot\|\Delta_{\theta_{1}}\|+\|\theta_{1}\|^{2}+\frac{1}{n}\mathrm{tr}(J^{-1}(\theta_{1}))+\log|J(\theta_{1})|.

Eq. (27) bounds the KL-divergence between the processed distribution after conducting BIF and the prior distribution. Inserting eq. (27) into eq. (25) in Lemma C.1, we then obtain the PAC-Bayesian generalization bound.

Eventually, since all the conditions in Theorem 5.3 hold, we can apply Theorem 4.2 and have that $\|\Delta_{\theta_{1}}\|\leq O(1/n)$ . Therefore, we have the following,

\displaystyle C_{p,\Delta_{\theta_{1}}}=O\left(\|\theta_{1}\|^{2}+\frac{1}{n}\mathrm{tr}(J^{-1}(\theta_{1}))+\log|J(\theta_{1})|\right)=O(1).

The proof is completed. ∎

	$\displaystyle\mathrm{KL}(p(\theta\|S)\\|p(\theta+\Delta\|S-S^{\prime}))$
	$\displaystyle\quad=\mathbb{E}_{p(\theta\|S)}\log p(\theta\|S)-\mathbb{E}_{p(\theta\|S)}\log p(\theta+\Delta\|S-S^{\prime})$
	$\displaystyle\quad=-\mathbb{E}_{p(\theta\|S)}\log\left(\frac{p(\theta+\Delta)\prod_{z\in S}p(z\|\theta+\Delta)}{p(S-S^{\prime})\prod_{z\in S^{\prime}}p(z\|\theta+\Delta)}\right)+\mathbb{E}_{p(\theta\|S)}\log p(\theta\|S)$
	$\displaystyle\quad=\sum_{i=1}^{n}-\mathbb{E}_{p(\theta\|S)}\log p(z_{i}\|\theta+\Delta)-\mathbb{E}_{p(\theta\|S)}\log p(\theta+\Delta)$
	$\displaystyle\quad\quad-\sum_{z\in S^{\prime}}-\mathbb{E}_{p(\theta\|S)}\log p(z\|\theta+\Delta)+\mathrm{Constant}.$		(7)

	$\displaystyle\left\\|\frac{\partial^{2}\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau^{2}}\right\\|_{2}$
	$\displaystyle\leq\left\\|\left(\frac{1}{n}\sum_{i=1}^{n}\nabla^{2}_{\gamma}F(\hat{\gamma}^{-z_{j}}_{S}(\tau),S)+\frac{\tau}{n}\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\right)^{-1}\right\\|_{2}$
	$\displaystyle\ \ \ \ \cdot\frac{1}{n}\left(\sum_{i=1}^{n}\\|B(\tau,z_{i})\\|_{2}+\tau\cdot\\|B(\tau,z_{j})\\|_{2}+\\|A(\tau)\\|_{2}+2\cdot\left\\|\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\cdot\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}\right\\|_{2}\right)$		(19)
	$\displaystyle\leq O\left(\frac{1}{n}\left(\sum_{i=1}^{n}\\|B(\tau,z_{i})\\|_{2}+\tau\cdot\\|B(\tau,z_{j})\\|_{2}+\\|A(\tau)\\|_{2}+2\cdot\left\\|\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\cdot\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}\right\\|_{2}\right)\right).$		(20)

$\displaystyle\mathrm{KL}$	$\displaystyle(Q\\|P)$
	$\displaystyle=\int_{\Theta}\log\left(\frac{q(\theta)}{p(\theta)}\right)q(\theta)\mathrm{d}\theta$
	$\displaystyle=-\frac{1}{2}\left(\int_{\Theta}\left((\theta-\theta_{1}^{\prime})^{T}(nJ(\theta_{1}))(\theta-\theta_{1}^{\prime})-\\|\theta\\|^{2}\right)q(\theta)\mathrm{d}\theta-\log\left\|nJ(\theta_{1})\right\|\right)$
	$\displaystyle=-\frac{1}{2}\left(\mathrm{tr}(I)-\mathrm{tr}\left((nJ(\theta_{1}))^{-1}\right)-\\|\theta_{1}^{\prime}\\|^{2}-\log\left\|nJ(\theta_{1})\right\|\right)$
	$\displaystyle=\frac{1}{2}\left(\\|\theta_{1}^{\prime}\\|^{2}+\frac{1}{n}\mathrm{tr}(J^{-1}(\theta_{1}))-d+\log\left(n^{d}\|J(\theta_{1})\|\right)\right)$
	$\displaystyle\leq\frac{1}{2}\left(\\|\Delta_{\theta_{1}}\\|^{2}+2\\|\theta_{1}\\|\cdot\\|\Delta_{\theta_{1}}\\|+\\|\theta_{1}\\|^{2}+\frac{1}{n}\mathrm{tr}(J^{-1}(\theta_{1}))+\log\|J(\theta_{1})\|+d(\log n-1)\right)$
	$\displaystyle=\frac{1}{2}\left(C_{p,\Delta_{\theta_{1}}}+d(\log n-1)\right),$	(27)

Bayesian Inference Forgetting

Abstract

1 Introduction

2 Related Works

3 Preliminaries

4 Bayesian Inference Forgetting

4.1 ε\varepsilon-Certified Knowledge Removal

Definition 4.1 (ε\varepsilon-certified knowledge removal in Bayesian inference).

4.2 Energy Functions in Bayesian Inference

4.3 Forgetting Algorithm for Bayesian inference

4.4 Theoretical Guarantee

Definition 4.2 (compact set).

Definition 4.3 (strong convexity).

Assumption 4.1.

Assumption 4.2.

Lemma 4.1.

Proof.

Theorem 4.1.

Proof.

Theorem 4.2.

Theorem 4.3.

Corollary 4.1.

Corollary 4.2.

5 Forgetting Algorithms for Bayesian Inference

5.1 Variational Inference Forgetting

Definition 5.1 (Variational inference influence function).

Theorem 5.1.

Proof.

Remark 5.1.

Theorem 5.2 (ε\varepsilon-certified knowledge removal of mean-field Gaussian variational distribution).

5.2 MCMC Forgetting Algorithm

Definition 5.2 (MCMC influence function).

Theorem 5.3.

Remark 5.2.

Proof.

Theorem 5.4.

5.3 Efficient Implementation

6 Generalization Analysis

Theorem 6.1.

Remark 6.1.

Corollary 6.1.

Proof.

Theorem 6.2.

Remark 6.2.

Corollary 6.2.

Proof.

7 Experiments

7.1 Experiments for Gaussian Mixture Model

7.1.1 Implementation Details

7.1.2 Results Analysis

7.2 Experiments for Bayesian Neural Network

7.2.1 Implementation Details

7.2.2 Results Analysis

8 Conclusion

References

Appendix A Proofs in Section 4

Proof of Theorem 4.2.

Proof of Theorem 4.3.

Appendix B Proofs in Section 5

Proof of Theorem 5.2.

Proof of Theorem 5.4.

Appendix C Proofs in Section 6

Lemma C.1 (cf. [49], Theorem 1).

proof of Theorem 6.1.

proof of Theorem 6.2.

4.1 $\varepsilon$ -Certified Knowledge Removal

Definition 4.1 ( $\varepsilon$ -certified knowledge removal in Bayesian inference).

Theorem 5.2 ( $\varepsilon$ -certified knowledge removal of mean-field Gaussian variational distribution).