This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Bayesian Inference Forgetting

Shaopeng Fu S. Fu and F. He contributed equally.S. Fu, F. He, and D. Tao are with the University of Sydney. Email: [email protected], [email protected], and [email protected].    Fengxiang He11footnotemark: 1 22footnotemark: 2    Yue Xu Y. Xu is with University of Science and Technology of China. This work was completed when she was intern at the University of Sydney. Email: [email protected].    Dacheng Tao22footnotemark: 2
Abstract

The right to be forgotten has been legislated in many countries but the enforcement in machine learning would cause unbearable costs: companies may need to delete whole models learned from massive resources due to single individual requests. Existing works propose to remove the knowledge learned from the requested data via its influence function which is no longer naturally well-defined in Bayesian inference. This paper proposes a Bayesian inference forgetting (BIF) framework to realize the right to be forgotten in Bayesian inference. In the BIF framework, we develop forgetting algorithms for variational inference and Markov chain Monte Carlo. We show that our algorithms can provably remove the influence of single datums on the learned models. Theoretical analysis demonstrates that our algorithms have guaranteed generalizability. Experiments of Gaussian mixture models on the synthetic dataset and Bayesian neural networks on the real-world data verify the feasibility of our methods. The source code package is available at https://github.com/fshp971/BIF.

Keywords: Certified knowledge removal, Bayesian inference, variational inference, and Markov chain Monte Carlo.

1 Introduction

The right to be forgotten refers to the individuals’ right to ask data controllers to delete the data collected from them. It has been recognized in many countries through legislation, including the European Union’s General Data Protection Regulation (2016) and the California Consumer Privacy Act (2018). However, enforcement may result in unbearable costs for AI industries. When a single data deletion request comes, the data controller needs to delete the whole machine learning model which might have cost massive amounts of resources, including data, energy, and time.

To address this issue, recent works [25, 24] propose to forget the influence of the requested data on the learned models via the influence function [14, 32, 38]:

(x)=θ2L(θ,S)θT(θ,x),\displaystyle\mathcal{I}(x)=-\nabla^{-2}_{\theta}L(\theta,S)\nabla^{T}_{\theta}\ell(\theta,x),

where L(θ,S)=1ni=1n(θ,xi)L(\theta,S)=\frac{1}{n}\sum_{i=1}^{n}\ell(\theta,x_{i}) is the loss function in the optimization. This approach can provably remove knowledge learned from the requested data for optimization-based machine learning subject to some conditions on convexity and smoothness of the loss function.

However, the loss function LL is not always naturally well-defined. Bayesian inference approximates the posterior of probabilistic models, where a loss function may be in a different form or even do not exist. In this case, the influence function is no longer well-defined, and therefore, existing forgetting methods become invalid. Bayesian inference enables a wide spectrum of machine learning algorithms, such as probabilistic matrix factorization [46, 62, 72], topic models [30, 8, 74], probabilistic graphic models [39, 6, 71] and Bayesian neural networks [10, 37, 52, 66, 58, 61, 75]. Overlooking Bayesian inference in developing forgetting algorithms leaves a considerable proportion of machine learning algorithms costly to protect the right to be forgotten.

In this paper, we design an energy-based Bayesian inference forgetting (BIF) framework to realize the right to be forgotten in Bayesian inference. We define Bayesian inference influence functions based on the pre-defined energy functions, which help provably characterize the influence of some specific datums on the learned models. We prove that these characterizations are rigorous in first-order approximation subject to some mild assumptions on the local convexity and smoothness of the energy functions around the convergence points. Specifically, the approximation error is not larger than the order O(1/n2)O(1/n^{2}), where nn is the training sample size, which is negligible when the training sample set is sufficiently large. The BIF framework proposes to delete the Bayesian inference influence function of the requested data. We prove that BIF realizes ε\varepsilon-certified knowledge removal, a new notion defined to evaluate the performance of knowledge removal.

In the BIF framework, we develop certified knowledge removal algorithms for two canonical Bayesian inference algorithms, variational inference [34, 7, 31, 8] and Markov chain Monte Carlo (MCMC) [22, 45, 69, 15, 13, 57].

  • Variational inference forgetting. We show that the evidence lower bound function (ELBO; [7]) plays the role of energy function. Based on the ELBO, a variational inference influence function is defined, which helps deliver the variational inference forgetting algorithm. We prove that the ε\varepsilon-certified knowledge removal is guaranteed on the mean-field Gaussian variational family.

  • MCMC forgetting. We formulate a new energy function for MCMC, based on which an MCMC influence function is designed. The new influence function then delivers our the MCMC forgetting algorithm. We prove that the ε\varepsilon-certified knowledge removal is achieved when the training sample set is sufficiently large. The MCMC forgetting algorithm is also applicable in the stochastic gradient MCMC.

We theoretically investigated whether the BIF algorithms affect the generalizabilities of the learned Bayesian models. Based on the PAC-Bayesian theory [47, 48, 49], we prove that the generalization bounds are of the same order O((logn)/n)O(\sqrt{(\log n)/n}) (nn is the training sample size), no matter whether the Bayesian models are processed by BIF. Moreover, the introduced difference of the generalization bounds is not larger than the order O(1/n)O(1/n).

From the empirical aspect, we conduct systematic experiments to verify the forgetting algorithms. We applied the BIF algorithms to two scenarios: (1) Gaussian mixture models for clustering on synthetic data; and (2) Bayesian neural networks for classification on the real-world data. Every scenario involves variational inference, SGLD, and SGHMC. The results suggest that the BIF algorithms effectively remove the knowledge learned from the requested data while protecting other information intact, which are in full agreement with our theoretical results. To secure reproducibility, the source code package is available at https://github.com/fshp971/BIF.

The rest of this paper is organized as follows. Section 2 reviews the related works. Section 3 provides the necessary preliminaries of Bayesian inference. Section 4 presents our main result, the energy-based BIF framework. Section 5 provides the forgetting algorithms dveloped under the BIF framework for variational inference and MCMC. Section 6 studies the generalization abilities of the proposed algorithms. Section 7 presents the experiment results, and Section 8 concludes the paper.

2 Related Works

Forgetting. The concept “make AI forget you” is first proposed by Ginart et al. [23], which also designs forgetting algorithms for kk-means. Bourtoule et al. [11] then propose a “sharded, isolated, sliced, and aggregated” (SISA) framework to approach low-cost knowledge removal via the following three steps: (1) divide the whole training sample set into multiple disjoint shards; (2) train models in isolation on each of these shards; (3) retain the affected model when a request to unlearn a training point arrives. Izzo et al. [33] develop a projective residual update (PRU) method that replaces the update vector in data removal with its projection on a low-dimensional subspace, which reduces the time complexity of data deletion to scale only linearly in the dimension of data. Guo et al. [25] and Golatkar et al. [24] propose a certified removal based on the influence function [14, 32, 38] of the requested data. Specifically, influence function characterizes the influence of a single datum on the learned model in the empirical risk minimization (ERM) regime. The definition of influence function relies on the gradient and Hessian of the objective function which does not naturally exist in Bayesian inference. Other advances include that Garg et al. [20], Baumhauer et al. [5], and Sommer et al. [63].

We acknowledge that a concurrent work [53] also studied the knowledge removal in Bayesian models. The authors employed variational inference to minimize the KL-divergence between the approximate posterior after unlearning and the posterior of the model retrained on the remaining data. However, this approach only applies to parametric models, such as those obtained by variational inference, while many non-parametric models are un-touched, including MCMC. In contrast, our BIF covers all Bayesian inference methods. Moreover, no theoretical guarantee on the knowledge removal was established, which would be necessary to meet the legal requirements.

Markov chain Monte Carlo (MCMC). Hastings [26] introduces a two-step sampling method named MCMC, which first construct a Markov chain and then draw samples according to the state of the Markov chain. It is proved that the drawn samples will converge to that from the desired distribution. Since that, many improvements have been made on MCMC, including the Gibbs sampling method [22] and hybrid Monte Carlo [16]. Welling and Teh [69] propose a new framework named stochastic gradient Langevin dynamics (SGLD) that by adding a proper noise to a standard stochastic gradient optimization algorithm [59], the iterates will converge to samples from the true posterior distribution as the stepsize annealed. Ahn et al. [3] extend the algorithm based on the SGLD by leveraging the Bayesian Central Limit Theorem, to improve the slow mixing rate of SGLD. Inspired by the idea of a thermostat in statistical physics, Ding et al. [15] leverage a small number of additional variables to stabilize momentum fluctuations caused by the unknown noise. Chen et al. [13] extend the Hamiltonian Monte Carlo (HMC) to Stochastic gradient HMC (SGHMC) by adding a friction term, which enables SGHMC sample from the desired distributions without applying the MH rule. Based on Langevin Monte Carlo methods, Patterson and Teh [57] propose Stochastic gradient Riemannian Langevin dynamics by updating parameters according to both the stochastic gradients and additional noise which forces it to explore the full posterior. Ma et al. [45] introduce a general recipe for creating stochastic gradient MCMC samplers (SGMCMC), which is based on continuous Markov processes specified via two matrices and proved to be complete.

Variational inference. Jordan et al. [34] introduce the use of variational inference in graphical models [35] such as Bayesian networks and Markov random fields, in which variational inference is employed to approximate the target posterior with families of parameterized distributions. Blei et al. [8] apply variational inference to local Dirichlet allocation (LDA), which is used to modeling the text corpora. Sung et al. [64] propose a general variational inference framework for conjugate-exponential family models, in which the model parameters except latent variables are integrated in an exact manner, while the latent variables are approximated by a first-order algorithm. By using stochastic optimization [59], a technique that follows noisy estimates of the gradient of the objective, Hoffman et al. [31] derive stochastic variational inference which iterates between subsampling the data and adjusting the hidden structure based only on the subsample. Paisley et al. [55] propose an algorithm that reduces the variance of the stochastic search gradient by using control variates based on stochastic optimization and allows for direct optimization of variational lower bound. Titsias and Lázaro-Gredilla [65] propose local expectation gradients, in which the stochastic gradient estimation problem over multiple variational parameters is decomposed into smaller subtasks, and each sub-task focus on the most relevant part of the variational distribution. Zhang et al. [73] reviews recent advances of variational inference, from four aspects, scalable, generic, accurate, and amortized variational inference.

Bayesian Neural Networks (BNNs). Bayesian inference is first applied to neural networks by Neal [52], in which Hybrid Monte Carlo is employed to inference the posteriors of neural networks’ parameters. Blundell et al. [10] propose an efficient backpropagation-compatible algorithm to calculate the gradient of the variational neural networks, which expands the applicable domain of variational inference to deep learning. Based on that, Kingma et al. [37] further propose a local reparameterization technique to reduce the variance of stochastic gradients of variational neural networks. They also develop the variational dropout that can automatically learn the dropout rates. Pearce et al. [58] introduce a procedures family termed randomized MAP sampling (RMS), which includes randomize-then optimize and ensemble sampling, and then realize Bayesian inference for neural networks via ensembling under the RMS family. Roth and Pernkopf [61] successfully apply the Dirichlet process prior to BNNs, which enforces the sharing of the network weights and reduces the number of parameters. Zhang et al. [75] propose a cyclical step-size schedule for SGMCMC to learn the multimodal posterior of a Bayesian neural network.

3 Preliminaries

Suppose a data set is S={z1,z2,,zn}S=\{z_{1},z_{2},\cdots,z_{n}\}, where zi𝒵z_{i}\in\mathcal{Z} is a datum and nn is the sample size. One may hope to fit the dataset SS by a parametric probabilistic model p(z|θ)p(z|\theta) where θΘ\theta\in\Theta is the parameter.

Bayesian inference [69, 7, 22] infers the posterior of the probabilistic model

p(θ|S)=p(θ)i=1np(zi|θ)p(S),\displaystyle p(\theta|S)=\frac{p(\theta)\prod_{i=1}^{n}p(z_{i}|\theta)}{p(S)},

where p(θ)p(\theta) is the prior of model parameter θ\theta and p(S)=p(θ)i=1np(zi|θ)dθp(S)=\int p(\theta)\prod_{i=1}^{n}p(z_{i}|\theta)\mathrm{d}\theta is the normalization factor. In most cases, we do not know the closed form of the normalization factor p(S)p(S). Two canonical solutions are variational inference and MCMC.

Variational inference [31, 7] employs two steps to infer the posterior:

  1. 1.

    Define a family of distributions, 𝒬={qλ|λΛ}\mathcal{Q}=\{q_{\lambda}|\lambda\in\Lambda\}, termed variational family, where λ\lambda is the variational distribution parameter.

  2. 2.

    Search in the variational family 𝒬\mathcal{Q} for the distribution closest to the posterior p(θ|S)p(\theta|S) measured by KL divergence, i.e., minλKL(qλ(θ)p(θ|S))\min_{\lambda}\mathrm{KL}(q_{\lambda}(\theta)\|p(\theta|S)).

Minimizing the KL divergence is equivalent to maximizing the evidence lower bound (ELBO [7]):

ELBO(λ,S)=𝔼qλlogp(θ,S)𝔼qλlogqλ(θ).\text{ELBO}(\lambda,S)=\mathbb{E}_{q_{\lambda}}\log p(\theta,S)-\mathbb{E}_{q_{\lambda}}\log q_{\lambda}(\theta).

A popular variational family is the mean-field Gaussian family [7, 36, 10, 37]. A mean-field Gaussian variational distribution qλq_{\lambda} has the following structure,

qλ=𝒩(μ,σ2I),\displaystyle q_{\lambda}=\mathcal{N}(\mu,\sigma^{2}I),

where λ=(μ1,,μd,σ1,,σd)\lambda=(\mu_{1},\cdots,\mu_{d},\sigma_{1},\cdots,\sigma_{d}).

MCMC [45, 26] constructs a Markov chain whose stationary distribution is the targeted posterior. The Markov chain performs a Monte Carlo procedure to sample the posterior. However, MCMC can be prohibitively time-consuming in large-scale models and large-scale data. To address the issue, stochastic gradient MCMC (SGMCMC [45]) introduces stochastic gradient estimation on mini-batches [59] to MCMC. In this paper, we study two major SGMCMC methods, stochastic gradient Langevin dynamics (SGLD) [69] and stochastic gradient Hamiltonian Monte Carlo (SGHMC) [13].

SGLD updates weight θt\theta_{t} as follows,

θt+1=θtεtθU~(θt)+𝒩(0,2εt),\displaystyle\theta_{t+1}=\theta_{t}-\varepsilon_{t}\nabla_{\theta}\widetilde{U}(\theta_{t})+\mathcal{N}(0,2\varepsilon_{t}),
U~(θ)=|S||S~|zS~logp(z|θ)logp(θ),\displaystyle\widetilde{U}(\theta)=-\frac{|S|}{|\widetilde{S}|}\sum_{z\in\widetilde{S}}\log p(z|\theta)-\log p(\theta),

where εt+\varepsilon_{t}\in\mathbb{R}^{+} is the learning rate and S~S\widetilde{S}\subset S is the mini-batch. Usually, one needs to anneal εt\varepsilon_{t} to small values.

SGHMC introduces a momentum vtv_{t} to the weight update:

θt+1=θt+vt,\displaystyle\theta_{t+1}=\theta_{t}+v_{t},
vt+1=(1αt)vtηtU~(θt)+𝒩(0,2αtηt),\displaystyle v_{t+1}=(1-\alpha_{t})v_{t}-\eta_{t}\nabla\widetilde{U}(\theta_{t})+\mathcal{N}(0,2\alpha_{t}\eta_{t}),

where αtηt\alpha_{t}\propto\sqrt{\eta_{t}} (see eqs. (14), (15) in [13]), and the momentum is initialized as v0𝒩(0,η0)v_{0}\sim\mathcal{N}(0,\eta_{0}). Moreover, when ηt\eta_{t} is decay by a factor of cc, αt\alpha_{t} needs to be decayed by a factor of c\sqrt{c}.

4 Bayesian Inference Forgetting

This section presents our Bayesian inference forgetting framework. We first define ε\varepsilon-certified knowledge removal in Bayesian inference, which quantitates the knowledge removal performance. Then, we study the energy functions in Bayesian inference. The forgetting algorithm for Bayesian inference is designed based on the energy functions.

4.1 ε\varepsilon-Certified Knowledge Removal

Suppose a Bayesian inference method learns a probabilistic model p^S\hat{p}_{S} on the sample set SS. A client requests to remove her/his data SSS^{\prime}\subset S. A trivial and costly approach is re-training the model on data SSS-S^{\prime}. Suppose the re-learned probabilistic model is p^SS\hat{p}_{S-S^{\prime}}.

A forgetting algorithm 𝒜\mathcal{A} is designed to process the distribution p^S\hat{p}_{S} as follows to estimate the distribution p^SS\hat{p}_{S-S^{\prime}},

p^SS=𝒜(p^S,S).\displaystyle\hat{p}^{-S^{\prime}}_{S}=\mathcal{A}(\hat{p}_{S},S^{\prime}).

We term the distribution p^SS\hat{p}^{-S^{\prime}}_{S} as the processed model.

In order to meet the regulation requirements, the algorithm needs to achieve ε\varepsilon-certified knowledge removal as follows.

Definition 4.1 (ε\varepsilon-certified knowledge removal in Bayesian inference).

For any subset SSS^{\prime}\subset S and ε>0\varepsilon>0, we call 𝒜\mathcal{A} performs ε\varepsilon-certified knowledge removal, if

KL(p^SS,p^SS)ε.\displaystyle\mathrm{KL}(\hat{p}^{-S^{\prime}}_{S},\hat{p}_{S-S^{\prime}})\leq\varepsilon.

4.2 Energy Functions in Bayesian Inference

Bayesian inference can be formulated as minimizing an energy function, echos the free-energy principle in physics. This section studies the energy functions in Bayesian inference.

Suppose F(γ,S)F(\gamma,S) is the energy function of a probabilistic distribution χ(γ)\chi(\gamma) parametrized by γΓK\gamma\in\Gamma\subset\mathbb{R}^{K} over the model parameter space Θ\Theta. A typical energy function F(γ,S)F(\gamma,S) has the following structure,

F(γ,S)=i=1nh(γ,zi)+f(γ),\displaystyle F(\gamma,S)=\sum_{i=1}^{n}h(\gamma,z_{i})+f(\gamma), (1)

where h(γ,z)h(\gamma,z) is a function defined on Γ×𝒵\Gamma\times\mathcal{Z} that characterizes the influence from individual datums, f(γ)f(\gamma) is a function defined on Γ\Gamma that characterizes the influence from the prior of γ\gamma. Usually, a lower energy corresponds to a smaller distance between the two distributions. In variational inference, ELBO plays as an energy function. In MCMC, we construct an energy function. Please see Sections 5.1 and 5.2 for more details.

Similarly, the energy function for SSS-S^{\prime} is F(γ,SS)F(\gamma,S-S^{\prime}). It can be re-arranged as follows,

F(γ,SS)=F(γ,S)zSh(γ,z).\displaystyle F(\gamma,S-S^{\prime})=F(\gamma,S)-\sum_{z\in S^{\prime}}h(\gamma,z). (2)

4.3 Forgetting Algorithm for Bayesian inference

Based on the energy function, the forgetting algorithm for Bayesian inference is then designed and shown as fig. 1.

zjz_{j}γ^Szj(0)τ\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(0)}{\partial\tau}SSp^S\hat{p}_{S}γS\gamma_{S}γSzj\gamma^{-z_{j}}_{S}p^Szj\hat{p}^{-z_{j}}_{S}
Figure 1: The workflow of Bayesian inference forgetting framework that removes the influence of datum zjSz_{j}\in S from the distribution p^S\hat{p}_{S}. p^S\hat{p}_{S} is the distribution learned on the training sample set SS and is parameterized by γS\gamma_{S}. p^Szj\hat{p}^{-z_{j}}_{S} is the processed distribution and is parameterized by γSzj\gamma^{-z_{j}}_{S}. γSzj=γSγ^Szj(0)τ\gamma^{-z_{j}}_{S}=\gamma_{S}-\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(0)}{\partial\tau}.

Suppose that the probabilistic model p^S\hat{p}_{S} learned on the training sample set SS is parameterized by γS\gamma_{S}. Then, γS\gamma_{S} is a local minimizer of the energy function F(γ,S)F(\gamma,S). Similarly, suppose that the model p^SS\hat{p}_{S-S^{\prime}} learned on SSS-S^{\prime} is parameterized by γSS\gamma_{S-S^{\prime}}. Then γSS\gamma_{S-S^{\prime}} is a local minimizer of the energy function F(γ,SS)F(\gamma,S-S^{\prime}). Also, we assume the processed model p^SS\hat{p}^{-S^{\prime}}_{S} is parameterized by γSS\gamma^{-S^{\prime}}_{S}.

Mathematically, the forgetting in Bayesian inference can be realized by approaching the local minimizer γS\gamma_{S} of the energy function F(γ,S)F(\gamma,S) to the local minimizer γSS\gamma_{S-S^{\prime}} of the energy function F(γ,SS)F(\gamma,S-S^{\prime}).

We start with a simple case that remove the influence learned from a single datum zjSz_{j}\in S. The energy function for the posterior p(θ|S{zj})p(\theta|S-\{z_{j}\}) is as follows,

F(γ,S{zj})\displaystyle F(\gamma,S-\{z_{j}\}) =i=1nh(γ,zi)+f(γ)h(γ,zj).\displaystyle=\sum_{i=1}^{n}h(\gamma,z_{i})+f(\gamma)-h(\gamma,z_{j}). (3)

Let γS{zj}\gamma_{S-\{z_{j}\}} be a local minimizer of F(γ,S{zj})F(\gamma,S-\{z_{j}\}). Then,

γF(γS,S)=i=1nγh(γS,zi)+γf(γS)=0,\displaystyle\nabla_{\gamma}F(\gamma_{S},S)=\sum_{i=1}^{n}\nabla_{\gamma}h(\gamma_{S},z_{i})+\nabla_{\gamma}f(\gamma_{S})=0,
γF(γS{zj},S{zj})=γF(γS{zj},S)γh(γS{zj},zj)=0.\displaystyle\nabla_{\gamma}F(\gamma_{S-\{z_{j}\}},S-\{z_{j}\})=\nabla_{\gamma}F(\gamma_{S-\{z_{j}\}},S)-\nabla_{\gamma}h(\gamma_{S-\{z_{j}\}},z_{j})=0.

Notice that the structures of the above two equations only differ in the term γh(γ,zj)\nabla_{\gamma}h(\gamma,z_{j}). They are two special cases of the following equation,

γF(γ,S)+τγh(γ,zj)=0,\displaystyle\nabla_{\gamma}F(\gamma,S)+\tau\cdot\nabla_{\gamma}h(\gamma,z_{j})=0, (4)

where τ[1,0]\tau\in[-1,0]. Eq. (4) induces an implicit mapping γ^Szj:[1,0]Γ\hat{\gamma}^{-z_{j}}_{S}:[-1,0]\to\Gamma such that γ^Szj(0)=γS\hat{\gamma}^{-z_{j}}_{S}(0)=\gamma_{S}. For any τ[1,0]\tau\in[-1,0], γ^Szj(τ)\hat{\gamma}^{-z_{j}}_{S}(\tau) is a solution of eq. (4). Thus, γ^Szj(τ)\hat{\gamma}^{-z_{j}}_{S}(\tau) is also a critical point of the following function,

Fzj,τ(γ,S)=F(γ,S)+τh(γ,zj).\displaystyle F_{-z_{j},\tau}(\gamma,S)=F(\gamma,S)+\tau\cdot h(\gamma,z_{j}). (5)

When τ=1\tau=-1, Fzj,1(γ,S)F_{-z_{j},-1}(\gamma,S) is exactly F(γ,S{zj})F(\gamma,S-\{z_{j}\}), which indicates that γ^Szj(1)\hat{\gamma}^{-z_{j}}_{S}(-1) is a critical point of F(γ,S{zj})F(\gamma,S-\{z_{j}\}), and is possible to be a local minimizer. We assume that γ^Szj(1)\hat{\gamma}^{-z_{j}}_{S}(-1) is really a local minimizer of F(γ,S{zj})F(\gamma,S-\{z_{j}\}). We also let γS{zj}=γ^Szj(1)\gamma_{S-\{z_{j}\}}=\hat{\gamma}^{-z_{j}}_{S}(-1). Then, γS{zj}\gamma_{S-\{z_{j}\}} can be approached based on γS\gamma_{S} in a first-order approximation manner,

γS{zj}=γ^Szj(1)γ^Szj(0)γ^Szj(0)τ=γSγ^Szj(0)τ.\displaystyle\gamma_{S-\{z_{j}\}}=\hat{\gamma}^{-z_{j}}_{S}(-1)\approx\hat{\gamma}^{-z_{j}}_{S}(0)-\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(0)}{\partial\tau}=\gamma_{S}-\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(0)}{\partial\tau}.

Thus, when a request of removing a datum zjz_{j} comes, our forgetting algorithm will process the request by replacing γS\gamma_{S} with γSzj\gamma^{-z_{j}}_{S}, in which

γSzj:=γSγ^Szj(0)τ.\displaystyle\gamma^{-z_{j}}_{S}:=\gamma_{S}-\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(0)}{\partial\tau}.

4.4 Theoretical Guarantee

Under some mild assumptions, we prove that (1) the mapping γ^Szj\hat{\gamma}^{-z_{j}}_{S} uniquely exists; (2) γS{zj}=γ^S(1)\gamma_{S-\{z_{j}\}}=\hat{\gamma}_{S}(-1) is the global minimizer of F(γ,S{zj})F(\gamma,S-\{z_{j}\}); and (3) the approximation error between γSzj\gamma^{-z_{j}}_{S} and γS{zj}\gamma_{S-\{z_{j}\}} is not larger than order O(1/n2)O\left(1/n^{2}\right), where nn is the training set size.

We first introduce two definitions.

Definition 4.2 (compact set).

A set is called compact if and only if it is closed and bounded.

Definition 4.3 (strong convexity).

A differentiable function ff is called cc-strongly convex, if and only if there exists a constant real c>0c>0, such that for any two points zz and zz^{\prime} in the domain of the function ff, the following inequality holds,

(f(z)f(z))T(zz)c2zz22.\displaystyle(\nabla f(z)-\nabla f(z^{\prime}))^{T}(z-z^{\prime})\geq\frac{c}{2}\|z-z^{\prime}\|^{2}_{2}.

When ff is second-order continuously differentiable, for any c>0c>0, the following three propositions are equivalent:

  1. 1.

    ff is cc-strongly convex;

  2. 2.

    for any zz from the domain of the function ff, the smallest eigenvalue of the Hessian matrix 2f(z)\nabla^{2}f(z) is at least cc, i.e., λmin(2f(z))c\lambda_{\min}\left(\nabla^{2}f(z)\right)\geq c;

  3. 3.

    for any zz from the domain of ff, the matrix (2f(z)cI)(\nabla^{2}f(z)-cI) is positive semi-definite, i.e., 2f(z)cI\nabla^{2}f(z)\succeq cI.

The theoretical guarantees rely on the following two assumptions.

Assumption 4.1.

Both of the supports Γ\Gamma and 𝒵\mathcal{Z} are compact.

Usually, the energy functions F(γ,S)F(\gamma,S) and F(γ,S{zj})F(\gamma,S-\{z_{j}\}) are close to each other. Thus, it is reasonable to assume that the two local minimizers γ\gamma and γS{zj}\gamma_{S-\{z_{j}\}} fall in the same local region. Hence, we only need to analyze on a limited local region, which justifies Assumption 4.1.

Assumption 4.2.

Suppose f(γ)f(\gamma) and h(γ,z)h(\gamma,z) are the two inputs of the energy functions F(γ,S)F(\gamma,S). We assume that

  1. 1.

    f(γ)f(\gamma) is 33rd-order continuously differentiable and cfc_{f}-strongly convex on Γ\Gamma,

  2. 2.

    h(γ,z)h(\gamma,z) is 33rd-order continuously differentiable with respect to γ\gamma in Γ×𝒵\Gamma\times\mathcal{Z}. Besides, z𝒵\forall z\in\mathcal{Z}, h(γ,z)h(\gamma,z) is ch(z)c_{h}(z)-strongly convex with respect to γ\gamma, where ch:𝒵c_{h}:\mathcal{Z}\to\mathbb{R} is a continuous function.

Assumption 4.2 is mild in gradient-based optimization. Together with Assumption 4.1, they assume the local strong convexity and smoothness of the energy functions.

We then prove that the energy function Fzj,τF_{-z_{j},\tau} is strongly convex, as the following lemma.

Lemma 4.1.

Suppose Assumption 4.2 holds. For any τ[1,0]\tau\in[-1,0], Fzj,τ(γ,S)F_{-z_{j},\tau}(\gamma,S) (eq. (5)) is strongly convex with respect to the parameter γ\gamma.

Proof.

We rewrite the energy function Fzj,τF_{-z_{j},\tau} as follows,

Fzj,τ(γ,S)=F(γ,S{zj})+(1+τ)h(γ,zj).\displaystyle F_{-z_{j},\tau}(\gamma,S)=F(\gamma,S-\{z_{j}\})+(1+\tau)h(\gamma,z_{j}).

Apparently, F(γ,S{zj})F(\gamma,S-\{z_{j}\}) is strongly convex. Since τ[1,0]\tau\in[-1,0], we have that 1+τ01+\tau\geq 0. Thus, (1+τ)h(γ,zj)(1+\tau)h(\gamma,z_{j}) is either strongly convex or equal to zero. Hence, we have that Fzj,τ(γ,S)F_{-z_{j},\tau}(\gamma,S) is strongly convex with respect to γ\gamma.

The proof is completed. ∎

We are now able to prove that eq. (4) really induces a continuous mapping γ^Szj\hat{\gamma}^{-z_{j}}_{S}, and γ^Szj(1)\hat{\gamma}^{-z_{j}}_{S}(-1) is the global minimizer of F(γ,S{zj})F(\gamma,S-\{z_{j}\}).

Theorem 4.1.

Suppose assumption 4.2 holds. Then, eq. (4) induces a unique continuous function γ^Szj:[1,0]Γ\hat{\gamma}^{-z_{j}}_{S}:[-1,0]\to\Gamma such that for any τ[1,0]\tau\in[-1,0], γ^Szj(τ)\hat{\gamma}^{-z_{j}}_{S}(\tau) is the solution of eq. (4) and the global minimizer of eq. (5) with respect to γ\gamma.

Proof.

According to Lemma 4.1, for any τ[1,0]\tau\in[-1,0], the energy function Fzj,τ(γ,S)F_{-z_{j},\tau}(\gamma,S) is strongly convex with respect to γ\gamma. Thus, the global minimizer of the energy function Fzj,τ(γ,S)F_{-z_{j},\tau}(\gamma,S) is unique with respect to γ\gamma. Therefore, we can define the mapping γ^Szj\hat{\gamma}^{-z_{j}}_{S} as follows,

γ^Szj(τ)=argminγFzj,τ(γ,S),\displaystyle\hat{\gamma}^{-z_{j}}_{S}(\tau)=\arg\min_{\gamma}F_{-z_{j},\tau}(\gamma,S),

where τ[1,0]\tau\in[-1,0].

Besides, The strong convexity of Fzj,τ(γ,S)F_{-z_{j},\tau}(\gamma,S) indicates that when τ\tau is fixed, a parameter γΓ\gamma^{*}\in\Gamma satisfies γFzj,τ(γ,S)=0\nabla_{\gamma}F_{-z_{j},\tau}(\gamma^{*},S)=0 (eq. (4)), if and only if γ\gamma^{*} is the global minimizer of Fzj,τ(γ,S)F_{-z_{j},\tau}(\gamma,S). Thus, τ[1,0]\forall\tau\in[-1,0], γ^Szj(τ)\hat{\gamma}^{-z_{j}}_{S}(\tau) is the only solution of eq. (4).

Eventually, according to Definition 4.3, the strong convexity of Fzj,τ(γ,S)F_{-z_{j},\tau}(\gamma,S) implies that the Hessian matrix γ2Fzj,τ(γ^Szj(τ),S)\nabla^{2}_{\gamma}F_{-z_{j},\tau}(\hat{\gamma}^{-z_{j}}_{S}(\tau),S) is invertible. Combining with the implicit function theorem, we have that γ^Szj(τ)\hat{\gamma}^{-z_{j}}_{S}(\tau) is continuous everywhere on [1,0][-1,0].

The proof is completed. ∎

Theorem 4.1 demonstrates that our algorithm can obtain the γ^Szj(1)\hat{\gamma}^{-z_{j}}_{S}(-1). This secures that our algorithm can remove the learned influence from the probabilistic model p^S\hat{p}_{S}.

We then analyze the error introduced by our algorithm. Assume that γ^Szj\hat{\gamma}^{-z_{j}}_{S} is second-order continuously differentiable, then γ^Szj(1)\hat{\gamma}^{-z_{j}}_{S}(-1) can be expanded by the Taylor’s series:

γ^Szj(1)\displaystyle\hat{\gamma}^{-z_{j}}_{S}(-1) =γ^Szj(0)+(1)γ^Szj(0)τ+(1)222γ^Szj(ξ)τ2\displaystyle=\hat{\gamma}^{-z_{j}}_{S}(0)+(-1)\cdot\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(0)}{\partial\tau}+\frac{(-1)^{2}}{2}\cdot\frac{\partial^{2}\hat{\gamma}^{-z_{j}}_{S}(\xi)}{\partial\tau^{2}}
=γ^Szj(0)γ^Szj(0)τ+122γ^Szj(ξ)τ2,\displaystyle=\hat{\gamma}^{-z_{j}}_{S}(0)-\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(0)}{\partial\tau}+\frac{1}{2}\cdot\frac{\partial^{2}\hat{\gamma}^{-z_{j}}_{S}(\xi)}{\partial\tau^{2}},

where ξ[1,0]\xi\in[-1,0], 122γ^Szj(ξ)τ2\frac{1}{2}\frac{\partial^{2}\hat{\gamma}^{-z_{j}}_{S}(\xi)}{\partial\tau^{2}} is the Cauchy form of the remainder. We prove that the Cauchy remainder term becomes negligible as the training set size nn goes to infinity.

Theorem 4.2.

Suppose Assumptions 4.1 and 4.2 hold. The induced mapping γ^Szj\hat{\gamma}^{-z_{j}}_{S} is as defined in Theorem 4.1. Then, for any τ[1,0]\tau\in[-1,0], we have that

γ^Szj(τ)τ=(γ2F(γ^Szj(τ),S)+τγ2h(γ^Szj(τ),zj))1γh(γ^Szj(τ),zj)T,\displaystyle\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}=-\left(\nabla^{2}_{\gamma}F(\hat{\gamma}^{-z_{j}}_{S}(\tau),S)+\tau\cdot\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\right)^{-1}\cdot\nabla_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})^{T},

and

γ^Szj(τ)τ2O(1n).\displaystyle\left\|\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}\right\|_{2}\leq O\left(\frac{1}{n}\right).

The proof of Theorem 4.2 is presented in Appendix A. We first calculate γ^Szj(τ)τ\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau} based on eq. (4). We then upper bound the norm of γ^Szj(τ)τ\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau} by upper bounding the norm of an inverse matrix

(γ2F(γ^Szj(τ),S)+τγ2h(γ^Szj(τ),zj))1\displaystyle\left(\nabla^{2}_{\gamma}F(\hat{\gamma}^{-z_{j}}_{S}(\tau),S)+\tau\cdot\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\right)^{-1}

and a vector γh(γ^Szj(τ),zj)\nabla_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j}). The norm of the inverse matrix is bounded based on the strong convexity of γ^Szj(τ)\hat{\gamma}^{-z_{j}}_{S}(\tau), while the norm of the vector is bounded based on its continuity in a compact domain.

Based on Theorem 4.2, we then prove that the norm of 2γSzj(τ)τ2\frac{\partial^{2}\gamma^{-z_{j}}_{S}(\tau)}{\partial\tau^{2}} is no larger than order O(1/n2)O(1/n^{2}). Thus, a sufficiently large training sample size ensures that, the second and high order terms in the Taylor’s series are negligible.

Theorem 4.3.

Suppose Assumptions 4.1 and 4.2 hold. The induced mapping γ^Szj\hat{\gamma}^{-z_{j}}_{S} is as defined in Theorem 4.1. Then, for any τ[1,0]\tau\in[-1,0], we have that

2γ^Szj(τ)τ22O(1n2).\displaystyle\left\|\frac{\partial^{2}\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau^{2}}\right\|_{2}\leq O\left(\frac{1}{n^{2}}\right).

The proof routine of Theorem 4.3 is similar to Theorem 4.2, while the calculation is more difficult. We leave the details in Appendix A.

Corollary 4.1.

Suppose Assumptions 4.1 and 4.2 hold. Suppose γS\gamma_{S} and γS{zj}\gamma_{S-\{z_{j}\}} are the global minimizers of the energy functions F(γ,S)F(\gamma,S) and F(γ,S{zj})F(\gamma,S-\{z_{j}\}), respectively. Then, we have that

γS{zj}=γS+γ2F(γS,S)γh(γS,zj)T+O(1n2).\displaystyle\gamma_{S-\{z_{j}\}}=\gamma_{S}+\nabla^{-2}_{\gamma}F(\gamma_{S},S)\cdot\nabla_{\gamma}h(\gamma_{S},z_{j})^{T}+O\left(\frac{1}{n^{2}}\right).

Eventually, we consider removing a subset SS^{\prime} from SS. Follow the previous derivations, we prove that the approximation error of approaching γSS\gamma_{S-S^{\prime}} based on γS\gamma_{S} is also not larger than order O(1/n2)O(1/n^{2}), as stated below:

Corollary 4.2.

Suppose Assumptions 4.1 and 4.2 hold. Suppose γS\gamma_{S} and γSS\gamma_{S-S^{\prime}} are the global minimizers of the energy functions F(γ,S)F(\gamma,S) and F(γ,SS)F(\gamma,S-S^{\prime}), respectively. Then, we have that

γSS=γS+γ2F(γS,S)zjSγh(γS,zj)T+O(1n2).\displaystyle\gamma_{S-S^{\prime}}=\gamma_{S}+\nabla^{-2}_{\gamma}F(\gamma_{S},S)\cdot\sum_{z_{j}\in S^{\prime}}\nabla_{\gamma}h(\gamma_{S},z_{j})^{T}+O\left(\frac{1}{n^{2}}\right).

5 Forgetting Algorithms for Bayesian Inference

In this section, we develop provably certified forgetting algorithms for variational inference and MCMC under the BIF framework.

5.1 Variational Inference Forgetting

As introduced in Section 3, variational inference aims to minimize the negative ELBO function, which can be expanded as follows,

ELBO(λ,S)\displaystyle-\text{ELBO}(\lambda,S)
=𝔼qλlogp(θ,S)+𝔼qλlogqλ(θ)\displaystyle\quad=-\mathbb{E}_{q_{\lambda}}\log p(\theta,S)+\mathbb{E}_{q_{\lambda}}\log q_{\lambda}(\theta)
=𝔼qλlog(p(θ)i=1np(zi|θ))+𝔼qλlogqλ(θ)\displaystyle\quad=-\mathbb{E}_{q_{\lambda}}\log\left(p(\theta)\prod_{i=1}^{n}p(z_{i}|\theta)\right)+\mathbb{E}_{q_{\lambda}}\log q_{\lambda}(\theta)
=i=1n𝔼qλlogp(zi|θ)+KL(qλ(θ)p(θ)).\displaystyle\quad=\sum_{i=1}^{n}-\mathbb{E}_{q_{\lambda}}\log p(z_{i}|\theta)+\mathrm{KL}(q_{\lambda}(\theta)\|p(\theta)).

Notice that the structure of the above equation is similar as that of the energy function (eq. (1)). Thus, we employ the negative ELBO function as the energy function.

Based on the energy function, we define the following variational inference influence function to characterize the influence of single datum on the learned model.

Definition 5.1 (Variational inference influence function).

For any example zjSz_{j}\in S, its variational inference influence function is defined to be

VI(zj):=λ2ELBO(λ^S,S)λT𝔼qλ^Slogp(zj|θ).\displaystyle\mathcal{I}_{\mathrm{VI}}(z_{j}):=-\nabla^{-2}_{\lambda}\mathrm{ELBO}(\hat{\lambda}_{S},S)\cdot\nabla_{\lambda}^{T}\mathbb{E}_{q_{\hat{\lambda}_{S}}}\log p(z_{j}|\theta).

We then design the variational inference forgetting algorithm, to remove the influences of a single datum from the learned variational parameter λ^S\hat{\lambda}_{S},

λ^Szj=𝒜VI(λ^S,zj)=λ^SVI(zj).\displaystyle\hat{\lambda}^{-z_{j}}_{S}=\mathcal{A}_{\mathrm{VI}}(\hat{\lambda}_{S},z_{j})=\hat{\lambda}_{S}-\mathcal{I}_{\mathrm{VI}}(z_{j}).

We prove that variational inference forgetting can provably remove the influences of datums from the learned model.

Theorem 5.1.

Let γ:=λ\gamma:=\lambda, Γ:=Λ\Gamma:=\Lambda, h(γ,z):=𝔼qλlogp(z|θ)h(\gamma,z):=-\mathbb{E}_{q_{\lambda}}\log p(z|\theta) and f(γ):=KL(qλ(θ)p(θ))f(\gamma):=\mathrm{KL}(q_{\lambda}(\theta)\|p(\theta)). Suppose Assumptions 4.1 and 4.2 hold. Then, we have that

λ^S{zj}=λ^SVI(zj)+O(1n2)λ^SVI(zj),\displaystyle\hat{\lambda}_{S-\{z_{j}\}}=\hat{\lambda}_{S}-\mathcal{I}_{\mathrm{VI}}(z_{j})+O\left(\frac{1}{n^{2}}\right)\approx\hat{\lambda}_{S}-\mathcal{I}_{\mathrm{VI}}(z_{j}),

where λ^S\hat{\lambda}_{S} and λ^S{zj}\hat{\lambda}_{S-\{z_{j}\}} are the global minimizers of ELBO(λ,S)-\mathrm{ELBO}(\lambda,S) and ELBO(λ,S{zj})-\mathrm{ELBO}(\lambda,S-\{z_{j}\}), respectively.

Proof.

It is straightforward from Corollary 4.1. ∎

Remark 5.1.

When Assumptions 4.1 and 4.2 do not hold, performing variational inference forgetting approaches some critical point of ELBO(λ,S{zj})-\mathrm{ELBO}(\lambda,S-\{z_{j}\}) based on λ^S\hat{\lambda}_{S}.

Beyond a single datum’s removal, Corollary 4.2 further shows that the influence of a sample set SS^{\prime} can also be characterized by the variational inference influence function VI(S)\mathcal{I_{\mathrm{VI}}}(S^{\prime}), as follows,

VI(S)=zjSVI(zj).\displaystyle\mathcal{I}_{\mathrm{VI}}(S^{\prime})=\sum_{z_{j}\in S^{\prime}}\mathcal{I}_{\mathrm{VI}}(z_{j}).

This guarantees that one can remove a group of datums at one time. It can significantly speed up and simplify the forgetting process of large amounts of datums.

We next study the ε\varepsilon-certified knowledge removal guarantee for variational inference forgetting. Here, we take the mean-field Gaussian family as an example. Proofs for other varitional families are similar.

Theorem 5.2 (ε\varepsilon-certified knowledge removal of mean-field Gaussian variational distribution).

Let 𝒬\mathcal{Q} be a mean-field Gaussian family where the variance of every variational distributions are bounded, i.e., M1,M2\exists M_{1},M_{2}\in\mathbb{R} such that for any qλ=𝒩(μ,σ2I)𝒬q_{\lambda}=\mathcal{N}(\mu,\sigma^{2}I)\in\mathcal{Q}, we have 0<M1σiM20<M_{1}\leq\sigma_{i}\leq M_{2} holds for i=1,,di=1,\cdots,d. Let λ^S\hat{\lambda}_{S} and λ^S{zj}\hat{\lambda}_{S-\{z_{j}\}} be the variational parameters that learned on sample sets SS and S{zj}S-\{z_{j}\}, respectively. Suppose λ^Szj=𝒜VI(λ^S,zj)\hat{\lambda}^{-z_{j}}_{S}=\mathcal{A}_{\mathrm{VI}}(\hat{\lambda}_{S},z_{j}) is the processed variational parameter. Then, 𝒜VI(λ^S,zj)\mathcal{A}_{\mathrm{VI}}(\hat{\lambda}_{S},z_{j}) performs ελ^S,zj\varepsilon_{\hat{\lambda}_{S},z_{j}}-certified knowledge removal, where

ελ^S,zj=12M12(2(M1+M2)λ^Szjλ^S{zj}1+λ^Szjλ^S{zj}22).\displaystyle\varepsilon_{\hat{\lambda}_{S},z_{j}}=\frac{1}{2M_{1}^{2}}\left(2(M_{1}+M_{2})\|\hat{\lambda}^{-z_{j}}_{S}-\hat{\lambda}_{S-\{z_{j}\}}\|_{1}+\|\hat{\lambda}^{-z_{j}}_{S}-\hat{\lambda}_{S-\{z_{j}\}}\|_{2}^{2}\right).

The proof of Theorem 5.2 is omitted here and presented in Appendix B. When all the conditions in Theorem 5.1 hold, we have that λ^zjλ^S{zj}1O(1/n2)\|\hat{\lambda}^{-z_{j}}-\hat{\lambda}_{S-\{z_{j}\}}\|_{1}\leq O(1/n^{2}) and λ^Szjλ^S{zj}22O(1/n4)\|\hat{\lambda}^{-z_{j}}_{S}-\hat{\lambda}_{S-\{z_{j}\}}\|^{2}_{2}\leq O(1/n^{4}). Thus, ελ^S,zjO(1/n2)\varepsilon_{\hat{\lambda}_{S},z_{j}}\leq O(1/n^{2}). Therefore, 𝒜VI\mathcal{A}_{\mathrm{VI}} performs O(1/n2)O(1/n^{2})-certified knowledge removal for mean-field Gaussian variational distribution qλ^Sq_{\hat{\lambda}_{S}}.

5.2 MCMC Forgetting Algorithm

In MCMC, there is no explicit objective function. In this work, we transport p(θ|S)p(\theta|S) to approximate p(θ|S)p(\theta|S^{\prime}). Specifically, we want to find a “drift” Δ^SS\hat{\Delta}^{-S^{\prime}}_{S} named drifting influence to minimize the following KL divergence,

Δ^SS=argminΔKL(p(θ|S)p(θ+Δ|SS)).\displaystyle\hat{\Delta}^{-S^{\prime}}_{S}=\arg\min_{\Delta}\mathrm{KL}(p(\theta|S)\|p(\theta+\Delta|S-S^{\prime})). (6)

The KL term in eq. (6) can be expanded as follows,

KL(p(θ|S)p(θ+Δ|SS))\displaystyle\mathrm{KL}(p(\theta|S)\|p(\theta+\Delta|S-S^{\prime}))
=𝔼p(θ|S)logp(θ|S)𝔼p(θ|S)logp(θ+Δ|SS)\displaystyle\quad=\mathbb{E}_{p(\theta|S)}\log p(\theta|S)-\mathbb{E}_{p(\theta|S)}\log p(\theta+\Delta|S-S^{\prime})
=𝔼p(θ|S)log(p(θ+Δ)zSp(z|θ+Δ)p(SS)zSp(z|θ+Δ))+𝔼p(θ|S)logp(θ|S)\displaystyle\quad=-\mathbb{E}_{p(\theta|S)}\log\left(\frac{p(\theta+\Delta)\prod_{z\in S}p(z|\theta+\Delta)}{p(S-S^{\prime})\prod_{z\in S^{\prime}}p(z|\theta+\Delta)}\right)+\mathbb{E}_{p(\theta|S)}\log p(\theta|S)
=i=1n𝔼p(θ|S)logp(zi|θ+Δ)𝔼p(θ|S)logp(θ+Δ)\displaystyle\quad=\sum_{i=1}^{n}-\mathbb{E}_{p(\theta|S)}\log p(z_{i}|\theta+\Delta)-\mathbb{E}_{p(\theta|S)}\log p(\theta+\Delta)
zS𝔼p(θ|S)logp(z|θ+Δ)+Constant.\displaystyle\quad\quad-\sum_{z\in S^{\prime}}-\mathbb{E}_{p(\theta|S)}\log p(z|\theta+\Delta)+\mathrm{Constant}. (7)

Despite the constant term, the remaining has the similar structure with F(γ,SS)F(\gamma,S-S^{\prime}) (eq. (2)). Thus, we define the energy function for MCMC by replacing h(γ,z)h(\gamma,z) and f(γ)f(\gamma) with 𝔼p(θ|S)logp(z|θ+Δ)-\mathbb{E}_{p(\theta|S)}\log p(z|\theta+\Delta) and 𝔼p(θ|S)logp(θ+Δ)-\mathbb{E}_{p(\theta|S)}\log p(\theta+\Delta), respectively.

Based on the energy function, we define an MCMC influence function to characterize the drifting influences induced by a single datum as follows.

Definition 5.2 (MCMC influence function).

For any example zjSz_{j}\in S, its MCMC influence function is defined to be

MCMC(zj):=(𝔼p(θ|S)θ2logp(θ,S))1(𝔼p(θ|S)θlogp(zj|θ))T.\displaystyle\mathcal{I}_{\mathrm{MCMC}}(z_{j}):=-\left(\mathbb{E}_{p(\theta|S)}\nabla_{\theta}^{2}\log p(\theta,S)\right)^{-1}\left(\mathbb{E}_{p(\theta|S)}\nabla_{\theta}\log p(z_{j}|\theta)\right)^{T}. (8)

Then, the MCMC forgetting algorithm is as follows,

pSzj(θ)=𝒜MCMC(p(θ|S),zj)=p(θ+MCMC(zj)|S).\displaystyle p^{-z_{j}}_{S}(\theta)=\mathcal{A}_{\mathrm{MCMC}}(p(\theta|S),z_{j})=p(\theta+\mathcal{I}_{\mathrm{MCMC}}(z_{j})|S).

In practice, we do not have the posterior p(θ|S)p(\theta|S), but only some samples drawn from p(θ|S)p(\theta|S). Thus, we are not able to transform the posterior p(θ|S)p(\theta|S) to the processed distribution pSzj(θ)p^{-z_{j}}_{S}(\theta). However, the probability of drawing a sample θt\theta_{t} from p(θ|S)p(\theta|S) equals that of drawing a sample θtMCMC(zj)\theta_{t}-\mathcal{I}_{\mathrm{MCMC}}(z_{j}) from pSzj(θ)p^{-z_{j}}_{S}(\theta). Therefore, when performing MCMC forgetting, we will replace any drawn sample θt\theta_{t} by a new sample θtMCMC(zj)\theta_{t}-\mathcal{I}_{\mathrm{MCMC}}(z_{j}).

We prove that the MCMC forgetting algorithm can provably remove the drifting influence induced by a single datum from the learned model.

Theorem 5.3.

Let γ:=Δ\gamma:=\Delta, Γ:=supp(Δ)\Gamma:=\mathrm{supp}(\Delta), h(γ,z):=𝔼p(θ|S)logp(z|θ+Δ)h(\gamma,z):=-\mathbb{E}_{p(\theta|S)}\log p(z|\theta+\Delta) and f(γ):=𝔼p(θ|S)logp(θ+Δ)f(\gamma):=-\mathbb{E}_{p(\theta|S)}\log p(\theta+\Delta). Suppose that Assumptions 4.1 and 4.2 hold. Then, we have that

Δ^Szj\displaystyle\hat{\Delta}^{-z_{j}}_{S} =MCMC(zj)+O(1n2)MCMC(zj),\displaystyle=-\mathcal{I}_{\mathrm{MCMC}}(z_{j})+O\left(\frac{1}{n^{2}}\right)\approx-\mathcal{I}_{\mathrm{MCMC}}(z_{j}),

where Δ^Szj\hat{\Delta}^{-z_{j}}_{S} is the global minimizer of KL(p(θ|S)p(θ+Δ|S{zj}))\mathrm{KL}(p(\theta|S)\|p(\theta+\Delta|S-\{z_{j}\})).

Remark 5.2.

When Assumptions 4.1 and 4.2 do not hold, performing MCMC forgetting can be seen as finding a critical point Δ^Szj\hat{\Delta}^{-z_{j}}_{S} of KL(p(θΔ|S)p(θ|S{zj}))\mathrm{KL}(p(\theta-\Delta|S)\|p(\theta|S-\{z_{j}\})).

Proof.

When all the conditions hold, the global minimizer of KL(p(θ|S)p(θ+Δ|S))\mathrm{KL}(p(\theta|S)\|p(\theta+\Delta|S)) is exactly 0. Meanwhile, Δ^Szj\hat{\Delta}^{-z_{j}}_{S} is the global minimizer of KL(p(θ|S)p(θ+Δ|S{zj}))\mathrm{KL}(p(\theta|S)\|p(\theta+\Delta|S-\{z_{j}\})). Combining Corollary 4.1, we have that

Δ^Szj=\displaystyle\hat{\Delta}^{-z_{j}}_{S}= Δ^Szj0\displaystyle\hat{\Delta}^{-z_{j}}_{S}-0
=\displaystyle= (Δ2[𝔼p(θ|S)logp(θ+Δ,S)]Δ=0)1\displaystyle\left(\nabla^{2}_{\Delta}\left[-\mathbb{E}_{p(\theta|S)}\log p(\theta+\Delta,S)\right]_{\Delta=0}\right)^{-1}
(Δ[𝔼p(θ|S)logp(zj|θ+Δ)]Δ=0)T+O(1n2)\displaystyle\cdot\left(\nabla_{\Delta}\left[-\mathbb{E}_{p(\theta|S)}\log p(z_{j}|\theta+\Delta)\right]_{\Delta=0}\right)^{T}+O\left(\frac{1}{n^{2}}\right)
=\displaystyle= (𝔼p(θ|S)θ2logp(θ,S))1(𝔼p(θ|S)θlogp(zj|θ))T+O(1n2)\displaystyle\left(\mathbb{E}_{p(\theta|S)}\nabla^{2}_{\theta}\log p(\theta,S)\right)^{-1}\cdot\left(\mathbb{E}_{p(\theta|S)}\nabla_{\theta}\log p(z_{j}|\theta)\right)^{T}+O\left(\frac{1}{n^{2}}\right)
=\displaystyle= MCMC(zj)+O(1n2)\displaystyle-\mathcal{I}_{\mathrm{MCMC}}(z_{j})+O\left(\frac{1}{n^{2}}\right)
\displaystyle\approx MCMC(zj),\displaystyle-\mathcal{I}_{\mathrm{MCMC}}(z_{j}),

The proof is completed. ∎

Similarly, by applying Corollary 4.2, the MCMC influence function of a sample set SS^{\prime} is as follows,

MCMC(S)=zjSMCMC(zj).\displaystyle\mathcal{I}_{\mathrm{MCMC}}(S^{\prime})=\sum_{z_{j}\in S^{\prime}}\mathcal{I}_{\mathrm{MCMC}}(z_{j}).

It guarantees that one can adopt MCMC forgetting algorithm to remove a group of datums at one time. This improves the efficiency of removing large amounts of datums.

The MCMC forgetting algorithm also applies to SGMCMC, because SGMCMC draws samples from some target posterior p(θ|S)p(\theta|S) [13, 69, 45], the same as MCMC.

We then give the ε\varepsilon-certified knowledge removal guarantee for MCMC forgetting.

Theorem 5.4.

Suppose that

p(θ|S)=𝒩(θ1,(nJ(θ1))1),\displaystyle p(\theta|S)=\mathcal{N}(\theta_{1},(nJ(\theta_{1}))^{-1}),
p(θ|S{zj})=𝒩(θ2,((n1)J(θ2))1).\displaystyle p(\theta|S-\{z_{j}\})=\mathcal{N}(\theta_{2},((n-1)J(\theta_{2}))^{-1}).

Let pSzj(θ)=𝒜MCMC(p(θ|S),zj)=p(θ+MCMC(z)|S)p^{-z_{j}}_{S}(\theta)=\mathcal{A}_{\mathrm{MCMC}}(p(\theta|S),z_{j})=p(\theta+\mathcal{I}_{\mathrm{MCMC}}(z)|S) be the processed model. Then, 𝒜MCMC(p(θ|S),zj)\mathcal{A}_{\mathrm{MCMC}}(p(\theta|S),z_{j}) performs O(εθ1,zj)O(\varepsilon_{\theta_{1},z_{j}})-certified knowledge removal, where

εθ1,zj=(n1)(θ1θ2)TJ(θ2)(θ1θ2)+tr(J1(θ1)(J(θ2)J(θ1)))+log|J(θ1)||J(θ2)|\displaystyle\varepsilon_{\theta_{1},z_{j}}=(n-1)(\theta_{1}^{\prime}-\theta_{2})^{T}J(\theta_{2})(\theta_{1}^{\prime}-\theta_{2})+\mathrm{tr}\left(J^{-1}(\theta_{1})(J(\theta_{2})-J(\theta_{1}))\right)+\log\frac{|J(\theta_{1})|}{|J(\theta_{2})|}

and θ1=θ1MCMC(zj)\theta_{1}^{\prime}=\theta_{1}-\mathcal{I}_{\mathrm{MCMC}}(z_{j}).

Theorem 5.4 assumes that the posteriors p(θ|S)p(\theta|S) and p(θ|S{zj})p(\theta|S-\{z_{j}\}) are Gaussian. This assumption is from the Bayesian asymptotic theory [21, 42]. Suppose the training set SS is drawn from distribution f1(z)f_{1}(z), while S{zj}S-\{z_{j}\} is drawn from distribution f2(z)f_{2}(z). Then, under some mild assumptions, when the training sample size nn is sufficiently large, the posteriors p(θ|S)p(\theta|S) and p(θ|S{zj})p(\theta|S-\{z_{j}\}) approach Gaussian distributions as below,

p(θ|S)𝒩(θ1,(nJ(θ1))1),\displaystyle p(\theta|S)\sim\mathcal{N}(\theta_{1},(nJ(\theta_{1}))^{-1}),
p(θ|S{zj})𝒩(θ2,((n1)J(θ2))1),\displaystyle p(\theta|S-\{z_{j}\})\sim\mathcal{N}(\theta_{2},((n-1)J(\theta_{2}))^{-1}),

where θi=argminθKL(fi(z)p(z|θ))\theta_{i}=\arg\min_{\theta}\mathrm{KL}(f_{i}(z)\|p(z|\theta)), i={1,2}i=\{1,2\}, J(θ)=𝔼p(z|θ)[2θ2logp(z|θ)|θ]J(\theta)=\mathbb{E}_{p(z|\theta)}\left[-\left.\frac{\partial^{2}}{\partial\theta^{2}}\log p(z|\theta)\right|\theta\right] is the Fisher information.

The detailed proof of Theorem 5.4 is omitted here and given in appendix B.

Combining the conditions of Theorem 5.3, we have that θ1θ22O(1/n2)\|\theta^{\prime}_{1}-\theta_{2}\|_{2}\leq O(1/n^{2}). Thus, as nn\to\infty, εθ1,z\varepsilon_{\theta_{1},z} will eventually converge to tr(J1(θ1)(J(θ2)J(θ1)))+log|J(θ1)||J(θ2)|\mathrm{tr}\left(J^{-1}(\theta_{1})(J(\theta_{2})-J(\theta_{1}))\right)+\log\frac{|J(\theta_{1})|}{|J(\theta_{2})|}.

5.3 Efficient Implementation

A major computing burden in both variational inference forgetting and MCMC forgetting algorithms is calculating the product of H1vH^{-1}v, where HH is the Hessian matrix of some vector-valued function f(x)f(x) and vv is a constant vector. the calculation above would have a considerably high computational cost. We follow Agarwal et al. [2] and Koh and Liang [38] to apply a divide-and-conquer strategy to address the issue. This strategy relies on calculating the Hessian-vector product HvHv.

Hessian-vector product (HVP). We first discuss how to efficiently calculate HvHv. The calculation of HvHv can be decomposed into two steps: (1) calculate f(x)x\frac{\partial f(x)}{\partial x} and then (2) calculate x(f(x)xv)\frac{\partial}{\partial x}\left(\frac{\partial f(x)}{\partial x}\cdot v\right). It is worth noting that f(x)x1×d\frac{\partial f(x)}{\partial x}\in\mathbb{R}^{1\times d} and vd×1v\in\mathbb{R}^{d\times 1}, where d>0d>0 is the dimension of data. Thus, (f(x)xv)\left(\frac{\partial f(x)}{\partial x}\cdot v\right) is a scalar value. Calculating its gradient x(f(x)xv)\frac{\partial}{\partial x}\left(\frac{\partial f(x)}{\partial x}\cdot v\right) has a very low computational cost on platform PyTorch [56] or TensorFlow [1].

Calculating H1vH^{-1}v. When the norm H1\|H\|\leq 1, the matrix H1H^{-1} can be expanded by the Taylor’s series as H1=i=0(IH)iH^{-1}=\sum_{i=0}^{\infty}(I-H)^{i}. Define that Hj1=i=0j(IH)iH_{j}^{-1}=\sum_{i=0}^{j}(I-H)^{i}. Then, we have the following recursive equation,

Hj1v=v+(IH)Hj11v.H_{j}^{-1}v=v+(I-H)H_{j-1}^{-1}v.

Agarwal et al. [2] prove that when jj\rightarrow\infty, we have 𝔼[Hj1]H1\mathbb{E}[H_{j}^{-1}]\rightarrow H^{-1}. Therefore, we employ Hj1vH^{-1}_{j}v to approximate H1vH^{-1}v.

Moreover, to secure the condition H1\|H\|\leq 1 stands, we scale HH to cHcH by a scale c+c\in\mathbb{R}^{+}, such that cH1\|cH\|\leq 1. Then, we approximate (cH)1(cH)^{-1}. Eventually, we have that H1=c(cH)1H^{-1}=c(cH)^{-1}. We can plug it to the applicable equations above.

6 Generalization Analysis

In this section, we study the generalization ability of the models that processed by forgetting algorithms. We derive generalization bounds for a mean-field Gaussian variational model and a specified Gaussian MCMC model, Based on PAC-Bayes framework. All the proofs in this section are presented in Appendix C.

Generalization ability is important to machine learning algorithms, which refers to the ability to make accurate predictions on unseen data. A standard measurement of the generalization ability is the generalization bound, i.e., the upper bound of the difference between expected risk and empirical risk [51, 68, 50]. An algorithm with a small generalization bound is expected to generalize well. Existing generalization bound can be roughly divided into three categories: (1) generalization bounds based on the hypothesis complexity, such as VC dimension [9, 67], Rademacher complexity [41, 40, 4], and covering number [17, 27], which suggest implementations control the hypothesis complexity to help model generalize well; (2) generalization bounds based on the algorithmic stability [60, 12], which follow the intuition that an algorithm with good generalization ability is robust to the interference of single data points; (3) generalization bounds established under the PAC-Bayesian framework [48, 47]; and (4) generalization guarantees from differential privacy [18, 54, 29]. The excellent generalization ability of the over-parameterized model, including Bayesian neural network and others in deep learning, is somehow beyond the explanation of the conventional learning theory. Establishing theoretical foundations has been attracted wide attention [19, 28].

Let QQ be a probabilistic model. Suppose SS is the training sample set. Then, the expected risk (Q)\mathcal{R}(Q) and empirical risk ^(Q,S)\mathcal{\hat{R}}(Q,S) of QQ are defined to be

(Q)=𝔼hQ𝔼z(h,z),^(Q,S)=𝔼hQ1ni=1n(h,zi),\displaystyle\mathcal{R}(Q)=\mathop{\mathbb{E}}_{h\sim Q}\mathop{\mathbb{E}}_{z}\ell(h,z),\ \ \ \ \mathcal{\hat{R}}(Q,S)=\mathop{\mathbb{E}}_{h\sim Q}\frac{1}{n}\sum_{i=1}^{n}\ell(h,z_{i}),

where hh is a hypothesis drawn from QQ, and \ell is the loss function ranging in [0,1][0,1]. The difference of expected risk and empirical risk is the generalization error. Its magnitude characterizes the generalizability of the algorithm.

We then prove a generalization bound for mean-field Gaussian variational distributions.

Theorem 6.1.

Suppose all the conditions in Theorems 5.1 and 5.2 hold. Let qλ=𝒩(μ,σ2I)q_{\lambda}=\mathcal{N}(\mu,\sigma^{2}I) denotes the mean-field Gaussian distribution learned on the training set SS. Let λ=𝒜VI(λ,zj)\lambda^{-}=\mathcal{A}_{\mathrm{VI}}(\lambda,z_{j}) denotes the processed variational distribution parameter, where zjSz_{j}\in S. Also, let Δλ=VI(zj)=(Δμ,Δσ)\Delta_{\lambda}=\mathcal{I}_{\mathrm{VI}}(z_{j})=(\Delta_{\mu},\Delta_{\sigma}) denotes the variational inference influence function. Then, for any real δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta, the following inequality holds:

(qλ)^(qλ,S)+Cλ,Δλ+2log1δ+2lognd+44n2,\displaystyle\mathcal{R}(q_{\lambda^{-}})\leq\mathcal{\hat{R}}(q_{\lambda^{-}},S)+\sqrt{\frac{C_{\lambda,\Delta_{\lambda}}+2\log\frac{1}{\delta}+2\log n-d+4}{4n-2}}, (9)

where

Cλ,Δλ=Δλ2+2λΔλ+λ22k=1dlog(σkΔσk)O(1),\displaystyle C_{\lambda,\Delta_{\lambda}}=\|\Delta_{\lambda}\|^{2}+2\|\lambda\|\cdot\|\Delta_{\lambda}\|+\|\lambda\|^{2}-2\sum_{k=1}^{d}\log\left(\sigma_{k}-\Delta_{\sigma_{k}}\right)\leq O(1),

and ΔλO(1n)\|\Delta_{\lambda}\|\leq O\left(\frac{1}{n}\right).

Remark 6.1.

This generalization bound is of order O((logn)/n)O(\sqrt{(\log n)/n}).

Corollary 6.1.

Variational inference forgetting increases the generalization bound by a value not larger than O(1/n)O(1/n).

Proof.

Variational inference forgetting introduces the term Δλ\|\Delta_{\lambda}\| into the generalization bound. Thus, the generalization bound is increased by the following value:

O(Δλ2+2λΔλ2k=1dlog(σkΔσkσk)4n2)\displaystyle O\left(\sqrt{\frac{\|\Delta_{\lambda}\|^{2}+2\|\lambda\|\cdot\|\Delta_{\lambda}\|-2\sum_{k=1}^{d}\log\left(\frac{\sigma_{k}-\Delta_{\sigma_{k}}}{\sigma_{k}}\right)}{4n-2}}\right)
O(O(1/n2)+O(1/n)+O(1/n)4n2)=O(1n).\displaystyle\quad\leq O\left(\sqrt{\frac{O(1/n^{2})+O(1/n)+O(1/n)}{4n-2}}\right)=O\left(\frac{1}{n}\right).

The proof is completed. ∎

This corollary secures that variational inference forgetting would not compromise the generalizability.

In Section 5.2, we have shown that when the training sample size is sufficiently large, the posterior distribution p(θ|S)p(\theta|S) is asymptotically Gaussian. Here, we again assume that p(θ|S)=𝒩(θ1,(nJ(θ1))1)p(\theta|S)=\mathcal{N}(\theta_{1},(nJ(\theta_{1}))^{-1}), where J(θ)J(\theta) is the fisher information matrix. Then, we obtain a generalization bound as follows.

Theorem 6.2.

Suppose all the conditions in Theorem 5.3 hold. Suppose that p(θ|S)=𝒩(θ1,(nJ(θ1))1)p(\theta|S)=\mathcal{N}(\theta_{1},(nJ(\theta_{1}))^{-1}). Let p=𝒜MCMC(p(θ|S),zj)p^{-}=\mathcal{A}_{\mathrm{MCMC}}(p(\theta|S),z_{j}) denotes the processed distribution, where zjSz_{j}\in S. Let Δθ=MCMC(zj)\Delta_{\theta}=\mathcal{I}_{\mathrm{MCMC}}(z_{j}) denotes the MCMC influence function. Then, for any δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta, the following inequality holds:

(p)^(p,S)+Cp,Δθ1+2log1δ+(d+2)lognd+44n2,\displaystyle\mathcal{R}(p^{-})\leq\mathcal{\hat{R}}(p^{-},S)+\sqrt{\frac{C_{p,\Delta_{\theta_{1}}}+2\log\frac{1}{\delta}+(d+2)\log n-d+4}{4n-2}}, (10)

where

Cp,Δθ1=Δθ12+2θ1Δθ1+θ12+1ntr(J1(θ1))+log|J(θ1)|=O(1),\displaystyle C_{p,\Delta_{\theta_{1}}}=\|\Delta_{\theta_{1}}\|^{2}+2\|\theta_{1}\|\cdot\|\Delta_{\theta_{1}}\|+\|\theta_{1}\|^{2}+\frac{1}{n}\mathrm{tr}(J^{-1}(\theta_{1}))+\log|J(\theta_{1})|=O(1),

and Δθ1O(1n)\|\Delta_{\theta_{1}}\|\leq O\left(\frac{1}{n}\right).

Remark 6.2.

This generalization bound is of order O((logn)/n)O(\sqrt{(\log n)/n}).

Corollary 6.2.

MCMC forgetting increases the generalization bound by a value not larger than O(1/n)O(1/n).

Proof.

MCMC forgetting introduces the term Δθ1\|\Delta_{\theta_{1}}\| into the generalization bound. Thus, the generalization bound is increased by the following value:

O(Δθ12+2θ1Δθ14n2)O(O(1/n2)+O(1/n)4n2)=O(1n).\displaystyle O\left(\sqrt{\frac{\|\Delta_{\theta_{1}}\|^{2}+2\|\theta_{1}\|\cdot\|\Delta_{\theta_{1}}\|}{4n-2}}\right)\leq O\left(\sqrt{\frac{O(1/n^{2})+O(1/n)}{4n-2}}\right)=O\left(\frac{1}{n}\right).

The proof is completed. ∎

This corollary secures that MCMC forgetting would not compromise the generalizability.

7 Experiments

We apply our forgetting algorithm to a Gaussian mixture model and a Bayesian neural network. In every scenario, we employ variational inference and two SGMCMC methods, SGLD and SGHMC. The empirical results are in full agreement with our methods. All the experiments are conducted on a computer with a GPU of NVIDIA® GeForce® RTX 2080 Ti, a CPU of Intel® Core i9-9900 @ 3.10GHz, and 32GB memory. To secure reproducibility, the source code package is available at https://github.com/fshp971/BIF.

7.1 Experiments for Gaussian Mixture Model

We first conduct experiments with GMM on Synthetic data to evaluate our methods.

7.1.1 Implementation Details

The implementation details are given below.

Synthetic dataset. We generate a dataset SS of size 2,0002,000 for evaluating our algorithms. Every datum is two-dimensional and is possibly from KK classes. In this experiment, we set KK as 44. The raw data is visualized in the left of fig. 2(a), in which four different colors represent four different classes.

Gaussian mixture model (GMM). GMMs are usually employed to inference the cluster centers. GMM assumes data is drawn from KK Gaussian distributions centered at μ1,,μK\mu_{1},\cdots,\mu_{K}, respectively. The hierarchical structure of GMM is as follows: (1) draw a clustering center from the uniform distribution over {μc1,,μcK}\{\mu_{c_{1}},\ldots,\mu_{c_{K}}\}; and (2) sample ziz_{i} from a Gaussian distribution centering at μci\mu_{c_{i}}.

μk\displaystyle\mu_{k} 𝒩(0,σ2I),\displaystyle\sim\mathcal{N}(0,\sigma^{2}I),
ci\displaystyle c_{i} categorical(1K,,1K),\displaystyle\sim\mathrm{categorical}\left(\frac{1}{K},\cdots,\frac{1}{K}\right),
Zi\displaystyle Z_{i} 𝒩(μci,I),\displaystyle\sim\mathcal{N}(\mu_{c_{i}},I),

where 1kK1\leq k\leq K, 1in1\leq i\leq n, μkd\mu_{k}\in\mathbb{R}^{d}, ci{1,,K}c_{i}\in\{1,\cdots,K\}, ZidZ_{i}\in\mathbb{R}^{d}, and the hyperparameter σ\sigma\in\mathbb{R} is the prior standard deviation. We set σ\sigma as 11 in our experiments.

Applying SGLD and SGHMC to GMM is straightforward. Besides, for variational inference, we utilize the following mean-field variational family [7],

q(𝝁,𝒄)\displaystyle q(\bm{\mu},\bm{c}) =k=1Kq(μk)i=1nq(ci|μ),\displaystyle=\prod_{k=1}^{K}q(\mu_{k})\prod_{i=1}^{n}q(c_{i}|\mu),
μk\displaystyle\mu_{k} 𝒩(mk,sk2),\displaystyle\sim\mathcal{N}(m_{k},s_{k}^{2}),
q(ci=k|μ)\displaystyle q(c_{i}=k|\mu) exp(𝔼qcilogp(xi|μ,ci))\displaystyle\propto\exp\left(\mathbb{E}_{q_{-c_{i}}}\log p(x_{i}|\mu,c_{i})\right)
=expj(xijmkjmkj2+skj22),\displaystyle=\exp\sum_{j}\left(x_{ij}m_{kj}-\frac{m_{kj}^{2}+s_{kj}^{2}}{2}\right),

where xi,mk,skdx_{i},m_{k},s_{k}\in\mathbb{R}^{d}, and λ=(m1,,mK,s1,,sK)\lambda=(m_{1},\cdots,m_{K},s_{1},\cdots,s_{K}) is the variational distribution parameter. Thus, the ELBO function is calculated as follows,

ELBO(λ,S)\displaystyle\mathrm{ELBO}(\lambda,S) =k,jmkj2+skj22σ2+k,jlogskji,k,jφik2xijmkj+mkj2+skj22\displaystyle=-\sum_{k,j}\frac{m_{kj}^{2}+s_{kj}^{2}}{2\sigma^{2}}+\sum_{k,j}\log s_{kj}-\sum_{i,k,j}\varphi_{ik}\frac{-2x_{ij}m_{kj}+m_{kj}^{2}+s_{kj}^{2}}{2}
i,kφiklogφik12i,jxij2+Constant,\displaystyle\ \ \ \ -\sum_{i,k}\varphi_{ik}\log\varphi_{ik}-\frac{1}{2}\sum_{i,j}x_{ij}^{2}+\mathrm{Constant},

where 1in1\leq i\leq n, 1kK1\leq k\leq K, 1jd1\leq j\leq d, and

φik=q(ci=k|μ)expj(xijmkjmkj2+skj22).\displaystyle\varphi_{ik}=q(c_{i}=k|\mu)\propto\exp\sum_{j}\left(x_{ij}m_{kj}-\frac{m_{kj}^{2}+s_{kj}^{2}}{2}\right).
Refer to caption
Refer to caption
(a) Visualization of dataset SS.
Refer to caption
Refer to caption
(b) Results of variational inference.
Refer to caption
Refer to caption
(c) Results of SGLD.
Refer to caption
Refer to caption
(d) Results of SGHMC.
Figure 2: Visualized results of GMM experiments. There are four groups of points in the synthetic dataset, where different groups are in different colors. The clustering results that trained on the complete dataset are shown in the left parts of figs. 2(b), 2(c), 2(d). The results of the origin centers, processed centers, and target centers are drawn in black points, blue triangles, and red asterisks in the right parts of figs. 2(b), 2(c), 2(d), where the origin centers are obtained via training on the complete set, the processed centers are obtained after BIF, and the target centers are obtained via training on only the remaining set.

Experiment design. We employ variational inference, SGLD, and SGHMC to inference GMM on the synthetic datums. Then, we remove 400400 points from each of the pink parts and yellow parts, around 40%40\% of the whole dataset at all, by the proposed BIF algorithms. The remaining datums are shown in the right of fig. 2(a). We also trained models on only the remaining set with the same Bayesian inference settings in order to show the targets of the forgetting task. The experiments have two main phrases:

(1) Training phase. Every GMM is trained for 2,0002,000 iterations. The batch size is set as 6464. For variational inference, the learning rate is fixed to 2/n2/n, where nn is the training sample set size. For SGLD, the learning rate schedule is set as 4t0.15/n4\cdot t^{-0.15}/n, where tt is the training iteration step. For SGHMC, the learning rate schedule is set as 2t0.15/n2\cdot t^{-0.15}/n, and the initial α\alpha factor is set as 0.40.4.

(2) Forgetting phase. We remove a batch of 44 datums each time. When calculating the inversed-Hessian-vector product H1vH^{-1}v in the influence functions (see Section 5.3), the recursive calculation number jj is set as 3232, and the scaling factor cc is set as 1/n1/n^{\prime}, where nn^{\prime} is the number of the current remaining training datums. Notice that nn^{\prime} will gradually decrease as the forgetting process continuing. Moreover, for SGLD and SGHMC, we employ Monte Carlo method to calculate the expectations in MCMC influence functions. Specifically, we repeatedly sample model parameter θ\theta for 55 times, calculate the matrix or vector in MCMC influence functions, and average the results to approach the expectations.

7.1.2 Results Analysis

The empirical results are presented in figs. 2(b), 2(c), 2(d) respectively. In figures for variational inference, we draw the learned clustering centers as black points. In every figure for SGLD and SGHMC, (1) we draw the point drawn in the terminated iteration as a black point; and (2) we draw the points drawn in the last 500500 iterations in blue, orange, red, or green. The results obtained on the remaining parts (target) are also shown in these figures. This visualization shows that after forgetting, the learned models are close to the target models, which demonstrate that our forgetting algorithms can selectively remove specified datums while protecting others intact.

7.2 Experiments for Bayesian Neural Network

We then conduct experiments with Bayesian neural networks on real data.

7.2.1 Implementation Details

The implementation details are given below.

Dataset. We employ Fashion-MNIST [70] dataset in our experiments. It consists of 28×2828\times 28 gray-scale images from 1010 different classes, where each class consists of 6,0006,000 training examples and 1,0001,000 test examples. For the data argumentation, we first resize each image to 32×3232\times 32 and then normalize each pixel value to [0.5,0.5][-0.5,0.5] before feeding them into BNN. For the forgetting experiments, we divide the training set of Fashion-MNIST into two parts, the removed part SfS_{f} and the remaining part SrS_{r}. Apparently, SfSr=SS_{f}\cup S_{r}=S. For the selection of the removed training set SfS_{f}, we randomly choose 1,0001,000, 2,0002,000, 3,0003,000, 4,0004,000, 5,0005,000 and 6,0006,000 examples from the class “T-shirt” (i.e., examples that labeled with number 0) in the training set to form SfS_{f}. For the brevity, we denote the test set by StestS_{\text{test}}.

Bayesian neural network (BNN). BNNs employ Bayesian inference to inference the posterior of the neural networks parameters. Two major Bayesian inference methods employed wherein are variational inference and SGMCMC.

For variational inference, one usually utilizes the mean-field Gaussian variational family [10, 37] to train BNNs. The Bayes by Backprop technique [10] is utilized to calculate the derivative of ELBO function. Specifically, for the random variable θ\theta subject to the mean-field variational distribution qλ=𝒩(μ,σ2I)q_{\lambda}=\mathcal{N}(\mu,\sigma^{2}I), we have that (θμ)/σ𝒩(0,I)(\theta-\mu)/\sigma\sim\mathcal{N}(0,I). Let ε=(θμ)/σ\varepsilon=(\theta-\mu)/\sigma, then the derivative of the ELBO function can be varied as follows,

λELBO(λ,S)\displaystyle\nabla_{\lambda}\mathrm{ELBO}(\lambda,S)
=λ𝔼qλ(logp(θ,S)qλ(θ))\displaystyle\quad=\nabla_{\lambda}\mathbb{E}_{q_{\lambda}}\left(\log p(\theta,S)-q_{\lambda}(\theta)\right)
=λ𝔼ε(logp(θ,S)qλ(θ))\displaystyle\quad=\nabla_{\lambda}\mathbb{E}_{\varepsilon}\left(\log p(\theta,S)-q_{\lambda}(\theta)\right)
=𝔼ελ(logp(θ,S)qλ(θ)),\displaystyle\quad=\mathbb{E}_{\varepsilon}\nabla_{\lambda}\left(\log p(\theta,S)-q_{\lambda}(\theta)\right),

Hence, we can calculate the derivative λELBO(λ,S)\nabla_{\lambda}\mathrm{ELBO}(\lambda,S) in two steps: (1) repeatedly sample ε\varepsilon from 𝒩(0,I)\mathcal{N}(0,I) and calculate λ(logp(θ,S)qλ(θ))\nabla_{\lambda}\left(\log p(\theta,S)-q_{\lambda}(\theta)\right) based on θ=μ+εσ\theta=\mu+\varepsilon\cdot\sigma; and (2) average the obtained derivatives to approximate λELBO(λ,S)\nabla_{\lambda}\mathrm{ELBO}(\lambda,S). Moreover, we also adopt the local reparameterization trick [37] to further eliminate the covariances between the gradients of examples in a batch.

Bayesian LeNet-5 [43] is employed in our experiments, which consists of two convolutional layers and three fully-connected layers. We follow Liu et al. [44] to use an isotropic Gaussian distribution 𝒩(0,σ2I)\mathcal{N}(0,\sigma^{2}I) as the prior of BNN, where the standard deviation σ\sigma is set as 0.150.15.

Experiment design. We employ variational inference, SGLD, and SGHMC to train the BNN on the complete training set SS. Then, we remove the subset SfS_{f} by the proposed BIF algorithms. We also trained models on only the remaining set SrS_{r} in order to show the targets of the forgetting task. The experiments have two main phrases:

(1) Training phase. Every BNN is trained for 10,00010,000 iterations. The batch size is set as 128128. For variational inference, the learning rate is initialized as 0.5/n0.5/n, where nn is the training set size, and decay by 0.10.1 every 4,0004,000 iterations. Additionally, the sampling times to perform Bayes by Backprop procedure is set as 55. For SGLD, the step-size schedule is set as 0.5t0.5/n0.5\cdot t^{-0.5}/n, where tt is the training iteration step. For SGHMC, the step-size schedule is set as 0.5t0.5/n0.5\cdot t^{-0.5}/n, and the initial α\alpha factor is set as 0.40.4.

(2) Forgetting phase. We remove a batch of 6464 datums each time. When calculating the inversed-Hessian-vector product H1vH^{-1}v in influence functions (see Section 5.3), the recursive calculation number jj is set as 6464. For variational inference, the scaling factor cc is set as 0.1/n0.1/n^{\prime}, where nn^{\prime} is the number of the current remaining datums. Notice that nn^{\prime} will gradually decrease as the forgetting process continuing. For SGLD and SGHMC, the scaling factors cc are set as 0.005/n0.005/n^{\prime} and 0.05/n0.05/n^{\prime}, respectively. Besides, we employ the Monte Carlo method to calculate the expectations in MCMC influence functions. Specifically, we repeatedly sample model parameter θ\theta for 55 times, calculate the matrix and vector in MCMC influence function based on these sampled parameters, and average the results to approach the desired expectation.

Refer to caption
Refer to caption
Refer to caption
Figure 3: Curves of classification error to the number of the removed training datums. The results of variational inference, SGLD, and SGHMC are presented from left to right, respectively. For each setting, three classification error curves on the forgetted set SfS_{f}, remained set SrS_{r}, and test set StestS_{\text{test}} are plotted. The models are trained for 55 times with different random seeds. The darker lines show the average over seeds and the shaded area shows the standard deviations.
Table 1: Time of the training phase and the forgetting phase in the experiments for BNNs. The acceleration rate is calculated by dividing “training time” with “forgetting time”. Our forgetting algorithms is significantly faster than re-training from scratch.
Variational Inference SGLD SGHMC
Training Time 337.77337.77s 39.3639.36s 39.9039.90s
Removal Time 4.974.97s 4.044.04s 4.044.04s
Acceleration Rate 67.9067.90 9.759.75 9.889.88

7.2.2 Results Analysis

For each of the obtained processed models and target models, we evaluate its classification errors on the sets SfS_{f}, SrS_{r} and StestS_{\text{test}}, respectively. The results are collected and plotted in fig. 3. We also collect and present the training and forgetting times in Table 1.

From fig. 3 and Table 1, we have the following two observations: (1) the processed models are similar to the target models in terms of the classification errors on all the three sample sets SfS_{f}, SrS_{r}, and StestS_{\text{test}}; and (2) our forgetting algorithms are significantly faster than simply training models from scratch. These phenomena demonstrate that our forgetting algorithms can effectively and efficiently remove influences of datums from BNNs without hurting other remaining information.

8 Conclusion

The right to be forgotten imposes a considerable compliance burden on AI companies. A company may need to delete the whole model learned from massive resources due to a request to delete a single datum. To address this problem, this work designs a Bayesian inference forgetting (BIF) that removes the influence of some specific datums on the learned model without completely deleting the whole model. The BIF framework is established on an energy-based Bayesian inference influence function, which characterizes the influence of requested data on the learned models. We prove that BIF has an ε\varepsilon-certified knowledge removal guarantee, which is a new term on characterizing the forgetting performance. Under the BIF framework, forgetting algorithms are developed for two canonical Bayesian inference algorithms, variational inference and Markov chain Monte Carlo. Theoretical analysis provides guarantees on the generalizability of the proposed methods: performing the proposed BIF has little effect on them. Comprehensive experiments demonstrate that the proposed methods can remove the influence of specified datums without compromising the knowledge learned on the remained datums. The source code package is available at https://github.com/fshp971/BIF.

References

  • [1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.
  • [2] Naman Agarwal, Brian Bullins, and Elad Hazan. Second-order stochastic optimization for machine learning in linear time. The Journal of Machine Learning Research, 18(1):4148–4187, 2017.
  • [3] Sungjin Ahn, Anoop Korattikara, and Max Welling. Bayesian posterior sampling via stochastic gradient Fisher scoring. In International Conference on Machine Learning, pages 1771–1778, 2012.
  • [4] Peter L Bartlett and Shahar Mendelson. Rademacher and Gaussian complexities: Risk bounds and structural results. The Journal of Machine Learning Research, 3(Nov):463–482, 2002.
  • [5] Thomas Baumhauer, Pascal Schöttle, and Matthias Zeppelzauer. Machine unlearning: Linear filtration for logit-based classifiers. arXiv preprint arXiv:2002.02730, 2020.
  • [6] Matthew J Beal, Zoubin Ghahramani, and Carl E Rasmussen. The infinite hidden Markov model. In Advances in Neural Information Processing Systems, pages 577–584, 2002.
  • [7] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. The Journal of the American Statistical Association, 112(518):859–877, 2017.
  • [8] David M Blei, Andrew Y Ng, and Michael I Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003.
  • [9] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability and the Vapnik-Chervonenkis dimension. Journal of the ACM, 36(4):929–965, 1989.
  • [10] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural network. In International Conference on Machine Learning, pages 1613–1622, 2015.
  • [11] Lucas Bourtoule, Varun Chandrasekaran, Christopher Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. arXiv preprint arXiv:1912.03817, 2019.
  • [12] Olivier Bousquet and André Elisseeff. Stability and generalization. The Journal of Machine Learning Research, 2(Mar):499–526, 2002.
  • [13] Tianqi Chen, Emily Fox, and Carlos Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In International Conference on Machine Learning, pages 1683–1691, 2014.
  • [14] R Dennis Cook and Sanford Weisberg. Residuals and Influence in Regression. New York: Chapman and Hall, 1982.
  • [15] Nan Ding, Youhan Fang, Ryan Babbush, Changyou Chen, Robert D Skeel, and Hartmut Neven. Bayesian sampling using stochastic gradient thermostats. In Advances in Neural Information Processing Systems, pages 3203–3211, 2014.
  • [16] Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid Monte Carlo. Physics Letters B, 195(2):216–222, 1987.
  • [17] Richard M Dudley. The sizes of compact subsets of Hilbert space and continuity of Gaussian processes. Journal of Functional Analysis, 1(3):290–330, 1967.
  • [18] Cynthia Dwork, Vitaly Feldman, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Aaron Leon Roth. Preserving statistical validity in adaptive data analysis. In Annual ACM Symposium on Theory of Computing, pages 117–126, 2015.
  • [19] Weinan E, Chao Ma, Stephan Wojtowytsch, and Lei Wu. Towards a mathematical understanding of neural network-based machine learning: What we know and what we don’t. arXiv preprint arXiv:2009.10713, 2020.
  • [20] Sanjam Garg, Shafi Goldwasser, and Prashant Nalini Vasudevan. Formalizing data deletion in the context of the right to be forgotten. In Annual International Conference on the Theory and Applications of Cryptographic Techniques, pages 373–402. Springer, 2020.
  • [21] Andrew Gelman, John B Carlin, Hal S Stern, David B Dunson, Aki Vehtari, and Donald B Rubin. Bayesian data analysis. CRC press, 2013.
  • [22] Stuart Geman and Donald Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence, (6):721–741, 1984.
  • [23] Antonio Ginart, Melody Guan, Gregory Valiant, and James Y Zou. Making AI forget you: Data deletion in machine learning. In Advances in Neural Information Processing Systems, pages 3518–3531, 2019.
  • [24] Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9304–9312, 2020.
  • [25] Chuan Guo, Tom Goldstein, Awni Hannun, and Laurens van der Maaten. Certified data removal from machine learning models. arXiv preprint arXiv:1911.03030, 2019.
  • [26] W Keith Hastings. Monte Carlo sampling methods using Markov chains and their applications. 1970.
  • [27] David Haussler. Sphere packing numbers for subsets of the boolean n-cube with bounded Vapnik-Chervonenkis dimension. Journal of Combinatorial Theory, Series A, 69(2):217–232, 1995.
  • [28] Fengxiang He and Dacheng Tao. Recent advances in deep learning theory. arXiv preprint arXiv:2012.10931, 2020.
  • [29] Fengxiang He, Bohan Wang, and Dacheng Tao. Tighter generalization bounds for iterative differentially private learning algorithms. arXiv preprint arXiv:2007.09371, 2020.
  • [30] Matthew Hoffman, Francis R Bach, and David M Blei. Online learning for latent Dirichlet allocation. In Advances in Neural Information Processing Systems, pages 856–864, 2010.
  • [31] Matthew D. Hoffman, David M. Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(4):1303–1347, 2013.
  • [32] Peter J Huber. Robust Statistics, volume 523. John Wiley & Sons, 2004.
  • [33] Zachary Izzo, Mary Anne Smart, Kamalika Chaudhuri, and James Zou. Approximate data deletion from machine learning models: Algorithms and evaluations. arXiv preprint arXiv:2002.10077, 2020.
  • [34] Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine Learning, 37(2):183–233, 1999.
  • [35] Michael Irwin Jordan. Learning in graphical models, volume 89. Springer Science & Business Media, 1998.
  • [36] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [37] Durk P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pages 2575–2583, 2015.
  • [38] Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. In International Conference on Machine Learning, pages 1885–1894, 2017.
  • [39] Daphne Koller and Nir Friedman. Probabilistic Graphical Models: Principles and Techniques. MIT press, 2009.
  • [40] Vladimir Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions on Information Theory, 47(5):1902–1914, 2001.
  • [41] Vladimir Koltchinskii and Dmitriy Panchenko. Rademacher processes and bounding the risk of function learning. In High Dimensional Probability II, pages 443–457. Springer, 2000.
  • [42] Lucien Le Cam. Asymptotic Methods in Statistical Decision Theory. Springer Science & Business Media, 2012.
  • [43] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [44] Xuanqing Liu, Yao Li, Chongruo Wu, and Cho-Jui Hsieh. Adv-BNN: Improved adversarial defense through robust Bayesian neural network. arXiv preprint arXiv:1810.01279, 2018.
  • [45] Yi-An Ma, Tianqi Chen, and Emily Fox. A complete recipe for stochastic gradient MCMC. In Advances in Neural Information Processing Systems, pages 2917–2925, 2015.
  • [46] Zhanyu Ma, Andrew E Teschendorff, Arne Leijon, Yuanyuan Qiao, Honggang Zhang, and Jun Guo. Variational Bayesian matrix factorization for bounded support data. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(4):876–889, 2014.
  • [47] David A McAllester. Some PAC-Bayesian theorems. In Proceedings of the Annual Conference on Computational Learning Theory, pages 230–234, 1998.
  • [48] David A McAllester. PAC-Bayesian model averaging. In Proceedings of the Annual Conference on Computational Learning Theory, pages 164–170, 1999.
  • [49] David A McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51(1):5–21, 2003.
  • [50] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. MIT Press, 2018.
  • [51] Mohamad T Musavi, Khue Hiang Chan, Donald M Hummels, and K Kalantri. On the generalization ability of neural network classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(6):659–663, 1994.
  • [52] Radford M Neal. Bayesian training of backpropagation networks by the hybrid Monte Carlo method. Technical report, Citeseer, 1992.
  • [53] Quoc Phong Nguyen, Bryan Kian Hsiang Low, and Patrick Jaillet. Variational bayesian unlearning. Advances in Neural Information Processing Systems, 33, 2020.
  • [54] Luca Oneto, Sandro Ridella, and Davide Anguita. Differential privacy and generalization: Sharper bounds with applications. Pattern Recognition Letters, 89:31–38, 2017.
  • [55] John Paisley, David Blei, and Michael Jordan. Variational Bayesian inference with stochastic search. arXiv preprint arXiv:1206.6430, 2012.
  • [56] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In NIPS-W, 2017.
  • [57] Sam Patterson and Yee Whye Teh. Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In Advances in Neural Information Processing Systems, pages 3102–3110, 2013.
  • [58] Tim Pearce, Felix Leibfried, and Alexandra Brintrup. Uncertainty in neural networks: Approximately Bayesian ensembling. In International Conference on Artificial Intelligence and Statistics, pages 234–244. PMLR, 2020.
  • [59] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407, 1951.
  • [60] William H Rogers and Terry J Wagner. A finite sample distribution-free performance bound for local discrimination rules. The Annals of Statistics, pages 506–514, 1978.
  • [61] Wolfgang Roth and Franz Pernkopf. Bayesian neural networks with weight sharing using Dirichlet processes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(1):246–252, 2018.
  • [62] Ruslan Salakhutdinov and Andriy Mnih. Bayesian probabilistic matrix factorization using Markov chain Monte Carlo. In International Conference on Machine Learning, pages 880–887, 2008.
  • [63] David Marco Sommer, Liwei Song, Sameer Wagh, and Prateek Mittal. Towards probabilistic verification of machine unlearning. arXiv preprint arXiv:2003.04247, 2020.
  • [64] Jaemo Sung, Zoubin Ghahramani, and Sung-Yang Bang. Latent-space variational Bayes. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(12):2236–2242, 2008.
  • [65] Michalis Titsias and Miguel Lázaro-Gredilla. Local expectation gradients for black box variational inference. In Advances in Neural Information Processing Systems, pages 2638–2646, 2015.
  • [66] DM Titterington et al. Bayesian methods for neural networks and related models. Statistical Science, 19(1):128–139, 2004.
  • [67] Vladimir Vapnik. Estimation of Dependences Based on Empirical Data. Springer Science & Business Media, 2006.
  • [68] Vladimir Vapnik. The Nature of Statistical Learning Theory. Springer Science & Business Media, 2013.
  • [69] Max Welling and Yee W Teh. Bayesian learning via stochastic gradient Langevin dynamics. In International Conference on Machine Learning, pages 681–688, 2011.
  • [70] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms, 2017.
  • [71] Qiaoling Ye, Arash Amini, and Qing Zhou. Optimizing regularized Cholesky score for order-based learning of Bayesian networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020.
  • [72] Hongwei Yong, Deyu Meng, Wangmeng Zuo, and Lei Zhang. Robust online matrix factorization for dynamic background subtraction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(7):1726–1740, 2017.
  • [73] Cheng Zhang, Judith Bütepage, Hedvig Kjellström, and Stephan Mandt. Advances in variational inference. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8):2008–2026, 2018.
  • [74] Hao Zhang, Bo Chen, Yulai Cong, Dandan Guo, Hongwei Liu, and Mingyuan Zhou. Deep autoencoding topic model with scalable hybrid Bayesian inference. IEEE Transactions on Pattern Analysis and Machine Intelligence, 68:5795–5809, 2020.
  • [75] Ruqi Zhang, Chunyuan Li, Jianyi Zhang, Changyou Chen, and Andrew Gordon Wilson. Cyclical stochastic gradient MCMC for Bayesian deep learning. In International Conference on Learning Representations, 2020.

Appendix A Proofs in Section 4

This section provides the missing proofs in Section 4.

Proof of Theorem 4.2.

We first calculate γ^Szj(τ)τ\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau} based on eq. (4), i.e., the following equation,

γF(γ,S)+τγh(γ,zj)=0.\displaystyle\nabla_{\gamma}F(\gamma,S)+\tau\cdot\nabla_{\gamma}h(\gamma,z_{j})=0.

Calculate the derivatives of the both sides of the above equation with respect to τ\tau, we have that

γ2F(γ^Szj(τ),S)γ^Szj(τ)τ+γh(γ^Szj(τ),zj)T+τγ2h(γ^Szj(τ),zj)γ^Szj(τ)τ=0.\displaystyle\nabla^{2}_{\gamma}F(\hat{\gamma}^{-z_{j}}_{S}(\tau),S)\cdot\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}+\nabla_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})^{T}+\tau\cdot\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\cdot\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}=0. (11)

By lemma 4.1, Fzj,τ(γ,S)F_{-z_{j},\tau}(\gamma,S) is strongly convex with respect to γ\gamma. Thus, the following Hessian matrix

γ2Fzj,τ(γ^Szj(τ),S)=γ2F(γ^Szj(τ),S)+τγ2h(γ^Szj(τ),zj)\displaystyle\nabla^{2}_{\gamma}F_{-z_{j},\tau}(\hat{\gamma}^{-z_{j}}_{S}(\tau),S)=\nabla^{2}_{\gamma}F(\hat{\gamma}^{-z_{j}}_{S}(\tau),S)+\tau\cdot\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})

is positive definite, hence invertible. Combining with eq. (11), we have that

γ^Szj(τ)τ=\displaystyle\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}= (γ2F(γ^Szj(τ),S)+τγ2h(γ^Szj(τ),zj))1γh(γ^Szj(τ),zj)T.\displaystyle-\left(\nabla^{2}_{\gamma}F(\hat{\gamma}^{-z_{j}}_{S}(\tau),S)+\tau\cdot\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\right)^{-1}\cdot\nabla_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})^{T}. (12)

We then upper bound the norm of γ^Szj(τ)τ\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}. Based on eq. (12), we have that

γ^Szj(τ)τ2\displaystyle\left\|\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}\right\|_{2}\leq (1nγ2F(γ^Szj(τ),S)+τnγ2h(γ^Szj(τ),zj))121nγh(γ^Szj(τ),zj)2.\displaystyle\left\|\left(\frac{1}{n}\nabla^{2}_{\gamma}F(\hat{\gamma}^{-z_{j}}_{S}(\tau),S)+\frac{\tau}{n}\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\right)^{-1}\right\|_{2}\cdot\left\|\frac{1}{n}\nabla_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\right\|_{2}. (13)

We first consider the first term of the right-hand side of eq. (13). By assumption 4.2, f(γ)f(\gamma) is cfc_{f}-strongly convex and h(γ,z)h(\gamma,z) is ch(z)c_{h}(z)-strongly convex, where cfc_{f} is a positive real number, ch(z)c_{h}(z) is a positive continuous real function on 𝒵\mathcal{Z}. Thus we have that

1nγ2F(γ^Szj(τ),S)+τnγ2h(γ^Szj(τ),zj)\displaystyle\frac{1}{n}\nabla^{2}_{\gamma}F(\hat{\gamma}^{-z_{j}}_{S}(\tau),S)+\frac{\tau}{n}\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})
=\displaystyle= 1ni=1nγ2h(γ^Szj(τ),zi)+1nγ2f(γ^Szj(τ))+τnγ2h(γ^Szj(τ),zj)\displaystyle\frac{1}{n}\sum_{i=1}^{n}\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{i})+\frac{1}{n}\nabla^{2}_{\gamma}f(\hat{\gamma}^{-z_{j}}_{S}(\tau))+\frac{\tau}{n}\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})
=\displaystyle= 1nzS{zj}γ2h(γ^Szj(τ),z)+1+τnγ2h(γ^Szj(τ),zj)+1nγ2f(γ^Szj(τ))\displaystyle\frac{1}{n}\sum_{z\in S-\{z_{j}\}}\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z)+\frac{1+\tau}{n}\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})+\frac{1}{n}\nabla^{2}_{\gamma}f(\hat{\gamma}^{-z_{j}}_{S}(\tau))
\displaystyle\succeq (1nzS{zj}ch(z)+1+τnch(zj)+cfn)I\displaystyle\left(\frac{1}{n}\sum_{z\in S-\{z_{j}\}}c_{h}(z)+\frac{1+\tau}{n}c_{h}(z_{j})+\frac{c_{f}}{n}\right)I
\displaystyle\succeq (1nzS{zj}ch(z))I.\displaystyle\left(\frac{1}{n}\sum_{z\in S-\{z_{j}\}}c_{h}(z)\right)I.

Applying Assumption 4.1, we have that ch(z)c_{h}(z) is continuous on the compact set 𝒵\mathcal{Z}. This suggests that there exists a real constant c^h>0\hat{c}_{h}>0 such that z𝒵\forall z\in\mathcal{Z}, ch(z)c^hc_{h}(z)\geq\hat{c}_{h}. Therefore, we further have that

1nγ2F(γ^Szj(τ),S)+τnγ2h(γ^Szj(τ),zj)(1nzS{zj}ch(z))I(n1nc^h)I.\displaystyle\frac{1}{n}\nabla^{2}_{\gamma}F(\hat{\gamma}^{-z_{j}}_{S}(\tau),S)+\frac{\tau}{n}\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\succeq\left(\frac{1}{n}\sum_{z\in S-\{z_{j}\}}c_{h}(z)\right)I\succeq\left(\frac{n-1}{n}\hat{c}_{h}\right)I.

Let λmin\lambda_{\min} denotes the smallest eigenvalue of the matrix (1nγ2F(γ^Szj(τ),S)+τnγ2h(γ^Szj(τ),zj))\left(\frac{1}{n}\nabla^{2}_{\gamma}F(\hat{\gamma}^{-z_{j}}_{S}(\tau),S)+\frac{\tau}{n}\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\right). Then, the above inequality implies that λminn1nc^h\lambda_{\min}\geq\frac{n-1}{n}\hat{c}_{h}. Hence, we have the following,

(1nγ2F(γ^S(τ),S)+τnγ2h(γ^S(τ),zj))12=1λminn(n1)c^h=O(1).\displaystyle\left\|\left(\frac{1}{n}\nabla^{2}_{\gamma}F(\hat{\gamma}_{S}(\tau),S)+\frac{\tau}{n}\nabla^{2}_{\gamma}h(\hat{\gamma}_{S}(\tau),z_{j})\right)^{-1}\right\|_{2}=\frac{1}{\lambda_{\min}}\leq\frac{n}{(n-1)\hat{c}_{h}}=O(1). (14)

We then upper bound the second term of the right-hand side of eq. (13). Applying Assumptions 4.2 and 4.1, we have that γ2h(γ,z)\nabla^{2}_{\gamma}h(\gamma,z) is continuous on compact support Γ×𝒵\Gamma\times\mathcal{Z}. This demonstrates that both γ2h(γ,z)\nabla^{2}_{\gamma}h(\gamma,z) and γ2h(γ,z)2\left\|\nabla^{2}_{\gamma}h(\gamma,z)\right\|_{2} are bounded. Therefore, as nn\to\infty, we have

1ni=1nγ2h(γ^S(τ),zi)2=1ni=1nγ2h(γ^S(τ),zi)2O(1n).\displaystyle\left\|\frac{1}{n}\sum_{i=1}^{n}\nabla^{2}_{\gamma}h(\hat{\gamma}_{S}(\tau),z_{i})\right\|_{2}=\frac{1}{n}\left\|\sum_{i=1}^{n}\nabla^{2}_{\gamma}h(\hat{\gamma}_{S}(\tau),z_{i})\right\|_{2}\leq O\left(\frac{1}{n}\right). (15)

Finally, inserting eqs. (14) (15) into eq. (13), we eventually have that

γ^S(τ)τ2O(1)O(1n)=O(1n).\displaystyle\left\|\frac{\partial\hat{\gamma}_{S}(\tau)}{\partial\tau}\right\|_{2}\leq O(1)\cdot O\left(\frac{1}{n}\right)=O\left(\frac{1}{n}\right).

The proof is completed. ∎

Proof of Theorem 4.3.

We first calculate 2γSzj(τ)τ2\frac{\partial^{2}\gamma^{-z_{j}}_{S}(\tau)}{\partial\tau^{2}} based on eq. (4). Similar to the proof of Theorem 4.2, we calculate the second-order derivatives of the both sides of eq. (4) with respect to τ\tau and have that

i=1nB(τ,zi)+A(τ)+(i=1nγ2h(γ^Szj(τ),zi)+γ2f(γ^Szj(τ)))2γ^Szj(τ)τ2\displaystyle\sum_{i=1}^{n}B(\tau,z_{i})+A(\tau)+\left(\sum_{i=1}^{n}\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{i})+\nabla^{2}_{\gamma}f(\hat{\gamma}^{-z_{j}}_{S}(\tau))\right)\cdot\frac{\partial^{2}\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau^{2}}
+2γ2h(γ^Szj(τ),zj)γ^Szj(τ)τ+τB(τ,zj)+τγ2h(γ^Szj(τ),zj)2γ^Szj(τ)τ2=0,\displaystyle+2\cdot\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\cdot\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}+\tau\cdot B(\tau,z_{j})+\tau\cdot\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\cdot\frac{\partial^{2}\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau^{2}}=0,

which means

2γ^Szj(τ)τ2=\displaystyle\frac{\partial^{2}\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau^{2}}= (1ni=1nγ2F(γ^Szj(τ),S)+τnγ2h(γ^Szj(τ),zj))1\displaystyle-\left(\frac{1}{n}\sum_{i=1}^{n}\nabla^{2}_{\gamma}F(\hat{\gamma}^{-z_{j}}_{S}(\tau),S)+\frac{\tau}{n}\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\right)^{-1}
(1ni=1nB(τ,zi)+τnB(τ,zj)+1nA(τ)+2nγ2h(γ^Szj(τ),zj)γ^Szj(τ)τ),\displaystyle\cdot\left(\frac{1}{n}\sum_{i=1}^{n}B(\tau,z_{i})+\frac{\tau}{n}B(\tau,z_{j})+\frac{1}{n}A(\tau)+\frac{2}{n}\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\cdot\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}\right), (16)

where the invertibility of (1ni=1nγ2F(γ^Szj(τ),S)+τnγ2h(γ^Szj(τ),zj))\left(\frac{1}{n}\sum_{i=1}^{n}\nabla^{2}_{\gamma}F(\hat{\gamma}^{-z_{j}}_{S}(\tau),S)+\frac{\tau}{n}\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\right) is guaranteed by Lemma 4.1, A(τ),B(τ,z)K×1A(\tau),B(\tau,z)\in\mathbb{R}^{K\times 1}, and for i=1,,Ki=1,\cdots,K, we have the following,

A(τ)i=γ^Szj(τ)τTγ2(f(γ^Szj(τ))γi)γ^Szj(τ)τ,\displaystyle A(\tau)_{i}={\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}}^{T}\cdot\nabla^{2}_{\gamma}\left(\frac{\partial f(\hat{\gamma}^{-z_{j}}_{S}(\tau))}{\partial\gamma_{i}}\right)\cdot\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}, (17)
B(τ,z)i=γ^Szj(τ)τTγ2(h(γ^Szj(τ),z)γi)γ^Szj(τ)τ.\displaystyle B(\tau,z)_{i}={\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}}^{T}\cdot\nabla^{2}_{\gamma}\left(\frac{\partial h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z)}{\partial\gamma_{i}}\right)\cdot\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}. (18)

We then upper bound the norm of 2γ^Szj(τ)τ2\frac{\partial^{2}\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau^{2}}. Based on eq. (16), we have that

2γ^Szj(τ)τ22\displaystyle\left\|\frac{\partial^{2}\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau^{2}}\right\|_{2}
(1ni=1nγ2F(γ^Szj(τ),S)+τnγ2h(γ^Szj(τ),zj))12\displaystyle\leq\left\|\left(\frac{1}{n}\sum_{i=1}^{n}\nabla^{2}_{\gamma}F(\hat{\gamma}^{-z_{j}}_{S}(\tau),S)+\frac{\tau}{n}\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\right)^{-1}\right\|_{2}
1n(i=1nB(τ,zi)2+τB(τ,zj)2+A(τ)2+2γ2h(γ^Szj(τ),zj)γ^Szj(τ)τ2)\displaystyle\ \ \ \ \cdot\frac{1}{n}\left(\sum_{i=1}^{n}\|B(\tau,z_{i})\|_{2}+\tau\cdot\|B(\tau,z_{j})\|_{2}+\|A(\tau)\|_{2}+2\cdot\left\|\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\cdot\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}\right\|_{2}\right) (19)
O(1n(i=1nB(τ,zi)2+τB(τ,zj)2+A(τ)2+2γ2h(γ^Szj(τ),zj)γ^Szj(τ)τ2)).\displaystyle\leq O\left(\frac{1}{n}\left(\sum_{i=1}^{n}\|B(\tau,z_{i})\|_{2}+\tau\cdot\|B(\tau,z_{j})\|_{2}+\|A(\tau)\|_{2}+2\cdot\left\|\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\cdot\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}\right\|_{2}\right)\right). (20)

where eq. (20) is obtained by inserting eq. (14) (in the proof of Theorem 4.2) into eq. (19). Thus, the remaining is to upper bound the norms of A(τ)A(\tau), B(τ,z)B(\tau,z) and γ2h(γ^Szj(τ),zj)γ^Szj(τ)τ\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\cdot\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}.

We first upper bound A(τ)A(\tau). Applying Theorem 4.2, we have that γ^Szj(τ)τ2O(1n)\left\|\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}\right\|_{2}\leq O\left(\frac{1}{n}\right). Applying Assumptions 4.2 and 4.1, we have that γ2(f(γ^Szj(τ))γi)\nabla^{2}_{\gamma}\left(\frac{\partial f(\hat{\gamma}^{-z_{j}}_{S}(\tau))}{\partial\gamma_{i}}\right) is bounded on its support Γ\Gamma. As a result, the norm of γ2(f(γ^Szj(τ))γi)2\left\|\nabla^{2}_{\gamma}\left(\frac{\partial f(\hat{\gamma}^{-z_{j}}_{S}(\tau))}{\partial\gamma_{i}}\right)\right\|_{2} is also bounded. Therefore, we have that

A(τ)2i=1Kγ2(f(γ^Szj(τ))γi)2γ^Szj(τ)τ22i=1KO(1)O(1n2)=O(1n2).\displaystyle\|A(\tau)\|_{2}\leq\sum_{i=1}^{K}\left\|\nabla^{2}_{\gamma}\left(\frac{\partial f(\hat{\gamma}^{-z_{j}}_{S}(\tau))}{\partial\gamma_{i}}\right)\right\|_{2}\cdot\left\|\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}\right\|_{2}^{2}\leq\sum_{i=1}^{K}O(1)\cdot O\left(\frac{1}{n^{2}}\right)=O\left(\frac{1}{n^{2}}\right). (21)

For B(τ,z)B(\tau,z), we similarly have that

B(τ,z)2O(1n2).\displaystyle\|B(\tau,z)\|_{2}\leq O\left(\frac{1}{n^{2}}\right). (22)

To upper bound the norm of γ2h(γ^Szj(τ),zj)γ^Szj(τ)τ\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\cdot\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}, we apply Theorem 4.2, Assumptions 4.2 and 4.1, and have that

γ2h(γ^Szj(τ),zj)γ^Szj(τ)τ2\displaystyle\left\|\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\cdot\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}\right\|_{2}
γ2h(γ^Szj(τ),zj)2γ^Szj(τ)τ2O(1)O(1n)=O(1n).\displaystyle\quad\leq\left\|\nabla^{2}_{\gamma}h(\hat{\gamma}^{-z_{j}}_{S}(\tau),z_{j})\right\|_{2}\cdot\left\|\frac{\partial\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau}\right\|_{2}\leq O(1)\cdot O\left(\frac{1}{n}\right)=O\left(\frac{1}{n}\right). (23)

Inserting eqs. (21), (22) and (23), into eq. (20), we eventually have that

2γ^Szj(τ)τ22O(1n(i=1nO(1n2)+τO(1n2)+O(1n2)+2O(1n)))=O(1n2).\displaystyle\left\|\frac{\partial^{2}\hat{\gamma}^{-z_{j}}_{S}(\tau)}{\partial\tau^{2}}\right\|_{2}\leq O\left(\frac{1}{n}\left(\sum_{i=1}^{n}O\left(\frac{1}{n^{2}}\right)+\tau\cdot O\left(\frac{1}{n^{2}}\right)+O\left(\frac{1}{n^{2}}\right)+2\cdot O\left(\frac{1}{n}\right)\right)\right)=O\left(\frac{1}{n^{2}}\right).

The proof is completed. ∎

Appendix B Proofs in Section 5

This section provides the detailed proofs in Section 5.

Proof of Theorem 5.2.

Let λ^S\hat{\lambda}_{S}, λ^S{zj}\hat{\lambda}_{S-\{z_{j}\}} be the variational parameters learned on SS and S{zj}S-\{z_{j}\}, respectively. Let λ^Szj=𝒜VI(λ^S,zj)=λ^SVI(zj)\hat{\lambda}^{-z_{j}}_{S}=\mathcal{A}_{\mathrm{VI}}(\hat{\lambda}_{S},z_{j})=\hat{\lambda}_{S}-\mathcal{I}_{\mathrm{VI}}(z_{j}) be the processed variational parameter. Then, to establish a certified knowledge removal guarantee for 𝒜VI(λ^S,zj)\mathcal{A}_{\mathrm{VI}}(\hat{\lambda}_{S},z_{j}), we need to bound the KL divergence KL(pλ^zj(θ)pλ^S{zj}(θ))\mathrm{KL}(p_{\hat{\lambda}^{-z_{j}}}(\theta)\|p_{\hat{\lambda}_{S-\{z_{j}\}}}(\theta)).

For the brevity, we denote that

pλ^Szj=𝒩(μ1,σ12I),\displaystyle p_{\hat{\lambda}^{-z_{j}}_{S}}=\mathcal{N}(\mu_{1},\sigma_{1}^{2}I),

where λ^Szj=(μ11,,μ1d,σ11,,σ1d)\hat{\lambda}^{-z_{j}}_{S}=(\mu_{11},\cdots,\mu_{1d},\sigma_{11},\cdots,\sigma_{1d}), 0<M1σ11,,σ1dM20<M_{1}\leq\sigma_{11},\cdots,\sigma_{1d}\leq M_{2}, and

pλ^S{zj}=𝒩(μ2,σ22I),\displaystyle p_{\hat{\lambda}_{S-\{z_{j}\}}}=\mathcal{N}(\mu_{2},\sigma_{2}^{2}I),

where λ^S{zj}=(μ21,,μ2d,σ21,,σ2d)\hat{\lambda}_{S-\{z_{j}\}}=(\mu_{21},\cdots,\mu_{2d},\sigma_{21},\cdots,\sigma_{2d}), 0<M1σ21,,σ2dM20<M_{1}\leq\sigma_{21},\cdots,\sigma_{2d}\leq M_{2}.

We then have the following (we assume that Θ=d\Theta=\mathbb{R}^{d}),

KL\displaystyle\mathrm{KL} (pλ^Szj(θ)pλ^S{zj}(θ))\displaystyle(p_{\hat{\lambda}^{-z_{j}}_{S}}(\theta)\|p_{\hat{\lambda}_{S-\{z_{j}\}}}(\theta))
=Θlog(pλ^Szj(θ)pλ^S{zj}(θ))pλ^Szj(θ)dθ\displaystyle=\int_{\Theta}\log\left(\frac{p_{\hat{\lambda}^{-z_{j}}_{S}}(\theta)}{p_{\hat{\lambda}_{S-\{z_{j}\}}}(\theta)}\right)p_{\hat{\lambda}^{-z_{j}}_{S}}(\theta)\mathrm{d}\theta
=12k=1d((θkμ1k)2σ1k2(θkμ2k)2σ2k2)pλ^Szj(θk)dθkk=1dlogσ1kσ2k\displaystyle=-\frac{1}{2}\sum_{k=1}^{d}\int\left(\frac{(\theta_{k}-\mu_{1k})^{2}}{\sigma_{1k}^{2}}-\frac{(\theta_{k}-{\mu_{2k}})^{2}}{\sigma_{2k}^{2}}\right)p_{\hat{\lambda}^{-z_{j}}_{S}}(\theta_{k})\mathrm{d}\theta_{k}-\sum_{k=1}^{d}\log\frac{\sigma_{1k}}{\sigma_{2k}}
=12k=1d(1σ1k2σ2k2(μ1kμ2k)2σ2k2)k=1dlogσ1kσ2k\displaystyle=-\frac{1}{2}\sum_{k=1}^{d}\left(1-\frac{\sigma_{1k}^{2}}{\sigma_{2k}^{2}}-\frac{(\mu_{1k}-\mu_{2k})^{2}}{\sigma_{2k}^{2}}\right)-\sum_{k=1}^{d}\log\frac{\sigma_{1k}}{\sigma_{2k}}
=12k=1d(σ1k2σ2k2σ2k2+(μ1kμ2k)2σ2k2+2log(1+σ2kσ1kσ1k))\displaystyle=\frac{1}{2}\sum_{k=1}^{d}\left(\frac{\sigma_{1k}^{2}-\sigma_{2k}^{2}}{\sigma_{2k}^{2}}+\frac{(\mu_{1k}-\mu_{2k})^{2}}{\sigma_{2k}^{2}}+2\log\left(1+\frac{\sigma_{2k}-\sigma_{1k}}{\sigma_{1k}}\right)\right)
12k=1d(|σ1k+σ2k||σ1kσ2k|σ2k2+(μ1kμ2k)2σ2k2+2|σ1kσ2k|σ1k)\displaystyle\leq\frac{1}{2}\sum_{k=1}^{d}\left(\frac{|\sigma_{1k}+\sigma_{2k}|\cdot|\sigma_{1k}-\sigma_{2k}|}{\sigma_{2k}^{2}}+\frac{(\mu_{1k}-\mu_{2k})^{2}}{\sigma_{2k}^{2}}+2\cdot\frac{|\sigma_{1k}-\sigma_{2k}|}{\sigma_{1k}}\right)
12k=1d(2M2|σ1kσ2k|M12+(μ1kμ2k)2M12+2|σ1kσ2k|M1)\displaystyle\leq\frac{1}{2}\sum_{k=1}^{d}\left(\frac{2M_{2}|\sigma_{1k}-\sigma_{2k}|}{M_{1}^{2}}+\frac{(\mu_{1k}-\mu_{2k})^{2}}{M_{1}^{2}}+\frac{2|\sigma_{1k}-\sigma_{2k}|}{M_{1}}\right)
=12M12k=1d(2(M1+M2)|σ1kσ2k|+(μ1kμ2k)2)\displaystyle=\frac{1}{2M_{1}^{2}}\sum_{k=1}^{d}\left(2(M_{1}+M_{2})|\sigma_{1k}-\sigma_{2k}|+(\mu_{1k}-\mu_{2k})^{2}\right)
12M12(2(M1+M2)λ^Szjλ^S{zj}1+λ^Szjλ^S{zj}22)=ελ^S,zj.\displaystyle\leq\frac{1}{2M_{1}^{2}}\left(2(M_{1}+M_{2})\|\hat{\lambda}^{-z_{j}}_{S}-\hat{\lambda}_{S-\{z_{j}\}}\|_{1}+\|\hat{\lambda}^{-z_{j}}_{S}-\hat{\lambda}_{S-\{z_{j}\}}\|_{2}^{2}\right)=\varepsilon_{\hat{\lambda}_{S},z_{j}}. (24)

Therefore, 𝒜VI(λ^S,zj)\mathcal{A}_{\mathrm{VI}}(\hat{\lambda}_{S},z_{j}) performs ελ^S,zj\varepsilon_{\hat{\lambda}_{S},z_{j}}-certified knowledge removal.

The proof is completed. ∎

Proof of Theorem 5.4.

Let the processed model be pzj(θ)=𝒜MCMC(p(θ|S),zj)p^{-z_{j}}(\theta)=\mathcal{A}_{\mathrm{MCMC}}(p(\theta|S),z_{j}). For the brevity, we denote that

p1(θ)=pzj(θ)=𝒩(θ1,(nJ(θ1))1),\displaystyle p_{1}(\theta)=p^{-z_{j}}(\theta)=\mathcal{N}(\theta_{1}^{\prime},(nJ(\theta_{1}))^{-1}),

where θ1=θ1MCMC(zj)\theta_{1}^{\prime}=\theta_{1}-\mathcal{I}_{\mathrm{MCMC}}(z_{j}), and

p2(θ)=p(θ|S{zj})=𝒩(θ2,((n1)J(θ2))1).\displaystyle p_{2}(\theta)=p(\theta|S-\{z_{j}\})=\mathcal{N}(\theta_{2},((n-1)J(\theta_{2}))^{-1}).

Then, to establish the certified removal guarantee, we need to calculate the KL divergence KL(p1(θ)p2(θ))\mathrm{KL}(p_{1}(\theta)\|p_{2}(\theta)).

Since

p1(θ)=1(2π)d|nJ(θ1)|1exp(12(θθ1)T(nJ(θ1))(θθ1)),\displaystyle p_{1}(\theta)=\frac{1}{\sqrt{(2\pi)^{d}|nJ(\theta_{1})|^{-1}}}\exp\left(-\frac{1}{2}(\theta-\theta_{1}^{\prime})^{T}(nJ(\theta_{1}))(\theta-\theta_{1}^{\prime})\right),
p2(θ)=1(2π)d|(n1)J(θ2)|1exp(12(θθ2)T((n1)J(θ2))(θθ2)),\displaystyle p_{2}(\theta)=\frac{1}{\sqrt{(2\pi)^{d}|(n-1)J(\theta_{2})|^{-1}}}\exp\left(-\frac{1}{2}(\theta-\theta_{2})^{T}((n-1)J(\theta_{2}))(\theta-\theta_{2})\right),

we have that (we assume that Θ=d\Theta=\mathbb{R}^{d})

KL\displaystyle\mathrm{KL} (p1(θ)p2(θ))\displaystyle(p_{1}(\theta)\|p_{2}(\theta))
=\displaystyle= Θlog(p1(θ)p2(θ))p1(θ)dθ\displaystyle\int_{\Theta}\log\left(\frac{p_{1}(\theta)}{p_{2}(\theta)}\right)p_{1}(\theta)\mathrm{d}\theta
=\displaystyle= 12Θ((θθ1)T(nJ(θ1))(θθ1)(θθ2)T((n1)J(θ2))(θθ2))p1(θ)dθ\displaystyle-\frac{1}{2}\int_{\Theta}\left((\theta-\theta_{1}^{\prime})^{T}(nJ(\theta_{1}))(\theta-\theta_{1}^{\prime})-(\theta-\theta_{2})^{T}((n-1)J(\theta_{2}))(\theta-\theta_{2})\right)p_{1}(\theta)\mathrm{d}\theta
12log|nJ(θ1)|1|(n1)J(θ2)|1\displaystyle-\frac{1}{2}\log\frac{|nJ(\theta_{1})|^{-1}}{|(n-1)J(\theta_{2})|^{-1}}
=\displaystyle= 12(tr(nJ(θ1)(nJ(θ1))1)tr((n1)J(θ2)(nJ(θ1))1)\displaystyle-\frac{1}{2}\left(\mathrm{tr}(nJ(\theta_{1})(nJ(\theta_{1}))^{-1})-\mathrm{tr}((n-1)J(\theta_{2})(nJ(\theta_{1}))^{-1})\right.
(θ1θ2)T((n1)J(θ2))(θ1θ2))+12(nn1)dlog|J(θ1)||J(θ2)|\displaystyle\hskip 30.00005pt\left.-(\theta_{1}^{\prime}-\theta_{2})^{T}((n-1)J(\theta_{2}))(\theta_{1}^{\prime}-\theta_{2})\right)+\frac{1}{2}\left(\frac{n}{n-1}\right)^{d}\log\frac{|J(\theta_{1})|}{|J(\theta_{2})|}
=\displaystyle= 12((n1)(θ1θ2)TJ(θ2)(θ1θ2)+n1ntr(J(θ2)J1(θ1))d+(nn1)dlog|J(θ1)||J(θ2)|)\displaystyle\frac{1}{2}\left((n-1)(\theta_{1}^{\prime}-\theta_{2})^{T}J(\theta_{2})(\theta_{1}^{\prime}-\theta_{2})+\frac{n-1}{n}\mathrm{tr}(J(\theta_{2})J^{-1}(\theta_{1}))-d+\left(\frac{n}{n-1}\right)^{d}\log\frac{|J(\theta_{1})|}{|J(\theta_{2})|}\right)
=\displaystyle= O((n1)(θ1θ2)TJ(θ2)(θ1θ2)+tr(J1(θ1)(J(θ2)J(θ1)))+log|J(θ1)||J(θ2)|)\displaystyle O\left((n-1)(\theta_{1}^{\prime}-\theta_{2})^{T}J(\theta_{2})(\theta_{1}^{\prime}-\theta_{2})+\mathrm{tr}\left(J^{-1}(\theta_{1})(J(\theta_{2})-J(\theta_{1}))\right)+\log\frac{|J(\theta_{1})|}{|J(\theta_{2})|}\right)
=\displaystyle= O(εθ1,zj).\displaystyle O(\varepsilon_{\theta_{1},z_{j}}).

which means that KL(pzj(θ)p(θ|S{zj}))=O(εθ1,zj)\mathrm{KL}(p^{-z_{j}}(\theta)\|p(\theta|S-\{z_{j}\}))=O(\varepsilon_{\theta_{1},z_{j}}). Therefore, 𝒜MCMC(p(θ|S),zj)\mathcal{A}_{\mathrm{MCMC}}(p(\theta|S),z_{j}) performs O(εθ1,zj)O(\varepsilon_{\theta_{1},z_{j}})-certified knowledge removal.

The proof is completed. ∎

Appendix C Proofs in Section 6

This section provides the detailed proofs in Section 6.

We derive generalization bounds for BIF under the PAC-Bayesian framework [47, 48, 49]. The framework can provide guarantees for randomized predictors (e.g. Bayesian predictors).

Specifically, let QQ a distribution on the parameter space Θ\Theta, PP denotes the prior distribution over the parameter space Θ\Theta. Then, the expected risks (Q)\mathcal{R}(Q) is bounded in terms of the empirical risk ^(Q,S)\mathcal{\hat{R}}(Q,S) and KL-divergence KL(QP)\mathrm{KL}(Q\|P) by the following result from PAC-Bayes.

Lemma C.1 (cf. [49], Theorem 1).

For any real δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta, we have the following inequality for all distributions QQ:

(Q)^(Q,S)+KL(QP)+log1δ+logn+22n1.\displaystyle\mathcal{R}(Q)\leq\mathcal{\hat{R}}(Q,S)+\sqrt{\frac{\mathrm{KL}(Q\|P)+\log\frac{1}{\delta}+\log n+2}{2n-1}}. (25)

Based on Lemma C.1, we prove the generalization bounds in Theorem 6.1 and Theorem 6.2.

proof of Theorem 6.1.

Let the prior distribution PP be the standard Gaussian distribution 𝒩(0,I)\mathcal{N}(0,I), QQ be the distribution that obtained after conducting variational inference BIF. Suppose the density functions of PP, QQ are p(θ)p(\theta), q(θ)q(\theta) respectively, then we have

p(θ)=1(2π)dexp(θ22),\displaystyle p(\theta)=\frac{1}{\sqrt{(2\pi)^{d}}}\exp\left(-\frac{\|\theta\|^{2}}{2}\right),
q(θ)=1(2π)dk=1dσkexp(k=1d(θkμk)22σk2),\displaystyle q(\theta)=\frac{1}{\sqrt{(2\pi)^{d}}\prod_{k=1}^{d}\sigma_{k}^{\prime}}\exp\left(-\sum_{k=1}^{d}\frac{(\theta_{k}-{\mu_{k}^{\prime}})^{2}}{2{\sigma_{k}^{\prime}}^{2}}\right),

where μk=μkΔμk\mu_{k}^{\prime}=\mu_{k}-\Delta_{\mu_{k}} and σk=σkΔσk\sigma_{k}^{\prime}=\sigma_{k}-\Delta_{\sigma_{k}}.

Therefore, we can calculate the KL-divergence KL(QP)\mathrm{KL}(Q\|P) as follows (where we assume that Θ=d\Theta=\mathbb{R}^{d}),

KL\displaystyle\mathrm{KL} (QP)\displaystyle(Q\|P)
=Θlog(q(θ)p(θ))q(θ)dθ\displaystyle=\int_{\Theta}\log\left(\frac{q(\theta)}{p(\theta)}\right)q(\theta)\mathrm{d}\theta
=12k=1dΘ((θkμk)2σk2θk2)q(θ)dθk=1dlogσk\displaystyle=-\frac{1}{2}\sum_{k=1}^{d}\int_{\Theta}\left(\frac{(\theta_{k}-\mu_{k}^{\prime})^{2}}{{\sigma_{k}^{\prime}}^{2}}-\theta_{k}^{2}\right)q(\theta)\mathrm{d}\theta-\sum_{k=1}^{d}\log{\sigma_{k}^{\prime}}
=12k=1d(1μk2σk2)k=1dlogσk\displaystyle=-\frac{1}{2}\sum_{k=1}^{d}\left(1-{\mu_{k}^{\prime}}^{2}-{\sigma_{k}^{\prime}}^{2}\right)-\sum_{k=1}^{d}\log{\sigma_{k}^{\prime}}
=12(λΔλ2d2k=1dlogσk)\displaystyle=\frac{1}{2}\left(\|\lambda-\Delta_{\lambda}\|^{2}-d-2\sum_{k=1}^{d}\log{\sigma_{k}^{\prime}}\right)
12(Δλ2+2λΔλ+λ22k=1dlog(σkΔσk)d)\displaystyle\leq\frac{1}{2}\left(\|\Delta_{\lambda}\|^{2}+2\|\lambda\|\cdot\|\Delta_{\lambda}\|+\|\lambda\|^{2}-2\sum_{k=1}^{d}\log\left(\sigma_{k}-\Delta_{\sigma_{k}}\right)-d\right)
=12(Cλ,Δλd),\displaystyle=\frac{1}{2}\left(C_{\lambda,\Delta_{\lambda}}-d\right), (26)

where

Cλ,Δλ=Δλ2+2λΔλ+λ22k=1dlog(σkΔσk).\displaystyle C_{\lambda,\Delta_{\lambda}}=\|\Delta_{\lambda}\|^{2}+2\|\lambda\|\cdot\|\Delta_{\lambda}\|+\|\lambda\|^{2}-2\sum_{k=1}^{d}\log\left(\sigma_{k}-\Delta_{\sigma_{k}}\right).

Eq. (26) gives an upper bound of the KL-divergence between the obtained distribution after BIF and the prior distribution. Inserting eq. (26) into eq. (25) in Lemma C.1, we obtain the PAC-Bayesian generalization bound.

Furthermore, since all the conditions in Theorem 5.1 hold, we can apply Theorem 4.2 and have that |Δσk|ΔλO(1/n)|\Delta_{\sigma_{k}}|\leq\|\Delta_{\lambda}\|\leq O(1/n). Besides, by applying the conditions in Theorem 5.2, we have that for any 1kd1\leq k\leq d, σkM1>0\sigma_{k}\geq M_{1}>0, where M1M_{1} is a constant. Therefore, we have that

Cλ,Δλ\displaystyle C_{\lambda,\Delta_{\lambda}}\leq Δλ2+2λΔλ+λ22dlog(M1Δλ)\displaystyle\|\Delta_{\lambda}\|^{2}+2\|\lambda\|\cdot\|\Delta_{\lambda}\|+\|\lambda\|^{2}-2d\log\left(M_{1}-\|\Delta_{\lambda}\|\right)
=\displaystyle= O(log(M1Δλ))\displaystyle O\left(\log\left(M_{1}-\|\Delta_{\lambda}\|\right)\right)
\displaystyle\leq O(1).\displaystyle O(1).

The proof is completed. ∎

proof of Theorem 6.2.

Let the prior distribution PP be the standard Gaussian distribution 𝒩(0,I)\mathcal{N}(0,I), QQ be the processed distribution that obtained after conducting MCMC BIF. Suppose the density functions of PP, QQ are p(θ)p(\theta), q(θ)q(\theta) respectively, then we have

p(θ)=1(2π)dexp(θ22),\displaystyle p(\theta)=\frac{1}{\sqrt{(2\pi)^{d}}}\exp\left(-\frac{\|\theta\|^{2}}{2}\right),
q(θ)=1(2π)d|nJ(θ1)|1exp(12(θθ1)T(nJ(θ1))(θθ1)),\displaystyle q(\theta)=\frac{1}{\sqrt{(2\pi)^{d}|nJ(\theta_{1})|^{-1}}}\exp\left(-\frac{1}{2}(\theta-\theta_{1}^{\prime})^{T}(nJ(\theta_{1}))(\theta-\theta_{1}^{\prime})\right),

where θ1=θ1Δθ1\theta_{1}^{\prime}=\theta_{1}-\Delta_{\theta_{1}}.

Thus, we can calculate the KL-divergence KL(QP)\mathrm{KL}(Q\|P) as follows (where we assume that Θ=d\Theta=\mathbb{R}^{d}),

KL\displaystyle\mathrm{KL} (QP)\displaystyle(Q\|P)
=Θlog(q(θ)p(θ))q(θ)dθ\displaystyle=\int_{\Theta}\log\left(\frac{q(\theta)}{p(\theta)}\right)q(\theta)\mathrm{d}\theta
=12(Θ((θθ1)T(nJ(θ1))(θθ1)θ2)q(θ)dθlog|nJ(θ1)|)\displaystyle=-\frac{1}{2}\left(\int_{\Theta}\left((\theta-\theta_{1}^{\prime})^{T}(nJ(\theta_{1}))(\theta-\theta_{1}^{\prime})-\|\theta\|^{2}\right)q(\theta)\mathrm{d}\theta-\log\left|nJ(\theta_{1})\right|\right)
=12(tr(I)tr((nJ(θ1))1)θ12log|nJ(θ1)|)\displaystyle=-\frac{1}{2}\left(\mathrm{tr}(I)-\mathrm{tr}\left((nJ(\theta_{1}))^{-1}\right)-\|\theta_{1}^{\prime}\|^{2}-\log\left|nJ(\theta_{1})\right|\right)
=12(θ12+1ntr(J1(θ1))d+log(nd|J(θ1)|))\displaystyle=\frac{1}{2}\left(\|\theta_{1}^{\prime}\|^{2}+\frac{1}{n}\mathrm{tr}(J^{-1}(\theta_{1}))-d+\log\left(n^{d}|J(\theta_{1})|\right)\right)
12(Δθ12+2θ1Δθ1+θ12+1ntr(J1(θ1))+log|J(θ1)|+d(logn1))\displaystyle\leq\frac{1}{2}\left(\|\Delta_{\theta_{1}}\|^{2}+2\|\theta_{1}\|\cdot\|\Delta_{\theta_{1}}\|+\|\theta_{1}\|^{2}+\frac{1}{n}\mathrm{tr}(J^{-1}(\theta_{1}))+\log|J(\theta_{1})|+d(\log n-1)\right)
=12(Cp,Δθ1+d(logn1)),\displaystyle=\frac{1}{2}\left(C_{p,\Delta_{\theta_{1}}}+d(\log n-1)\right), (27)

where

Cp,Δθ1=Δθ12+2θ1Δθ1+θ12+1ntr(J1(θ1))+log|J(θ1)|.\displaystyle C_{p,\Delta_{\theta_{1}}}=\|\Delta_{\theta_{1}}\|^{2}+2\|\theta_{1}\|\cdot\|\Delta_{\theta_{1}}\|+\|\theta_{1}\|^{2}+\frac{1}{n}\mathrm{tr}(J^{-1}(\theta_{1}))+\log|J(\theta_{1})|.

Eq. (27) bounds the KL-divergence between the processed distribution after conducting BIF and the prior distribution. Inserting eq. (27) into eq. (25) in Lemma C.1, we then obtain the PAC-Bayesian generalization bound.

Eventually, since all the conditions in Theorem 5.3 hold, we can apply Theorem 4.2 and have that Δθ1O(1/n)\|\Delta_{\theta_{1}}\|\leq O(1/n). Therefore, we have the following,

Cp,Δθ1=O(θ12+1ntr(J1(θ1))+log|J(θ1)|)=O(1).\displaystyle C_{p,\Delta_{\theta_{1}}}=O\left(\|\theta_{1}\|^{2}+\frac{1}{n}\mathrm{tr}(J^{-1}(\theta_{1}))+\log|J(\theta_{1})|\right)=O(1).

The proof is completed. ∎