This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The Memory Perturbation Equation:
Understanding Model’s Sensitivity to Data

Peter Nickl
[email protected]
&Lu Xu 
[email protected]
&Dharmesh Tailor
[email protected]
&Thomas Möllenhoff
[email protected]
&Mohammad Emtiyaz Khan
[email protected]
Equal contribution. Part of this work was carried out when Dharmesh Tailor was at RIKEN AIP.RIKEN Center for AI Project, Tokyo, Japan.University of Amsterdam, Amsterdam, Netherlands.Corresponding author.
Abstract

Understanding model’s sensitivity to its training data is crucial but can also be challenging and costly, especially during training. To simplify such issues, we present the Memory-Perturbation Equation (MPE) which relates model’s sensitivity to perturbation in its training data. Derived using Bayesian principles, the MPE unifies existing sensitivity measures, generalizes them to a wide-variety of models and algorithms, and unravels useful properties regarding sensitivities. Our empirical results show that sensitivity estimates obtained during training can be used to faithfully predict generalization on unseen test data. The proposed equation is expected to be useful for future research on robust and adaptive learning.

1 Introduction

Understanding model’s sensitivity to training data is important to handle issues related to quality, privacy, and security. For example, we can use it to understand (i) the effect of errors and biases in the data; (ii) model’s dependence on private information to avoid data leakage; (iii) model’s weakness to malicious manipulations. Despite their importance, sensitivity properties of machine learning (ML) models are not well understood in general. Sensitivity is often studied through empirical investigations, but conclusions drawn this way do not always generalize across models or algorithms. Such studies are also costly, sometimes requiring thousands of GPUs [38], which can quickly become infeasible if we need to repeat them every time the model is updated.

A cheaper solution is to use local perturbation methods [21], for instance, influence measures that study sensitivity of trained model to data removal (Fig. 1(a)) [8, 7]. Such methods too fall short of providing a clear understanding of sensitivity properties for generic cases. For instance, influence measures are useful to study trained models but are not suited to analyze training trajectories [14, 54]. Another challenge is in handling non-differentiable loss functions or discrete parameter spaces where a natural choice of perturbation mechanisms may not always be clear [32]. The measures also do not directly reveal the causes of sensitivities for generic ML models and algorithms.

In this paper, we simplify these issues by proposing a new method to unify, generalize, and understand perturbation methods for sensitivity analysis. We present the Memory-Perturbation Equation (MPE) as a unifying equation to understand sensitivity properties of generic ML algorithms. The equation builds upon the Bayesian learning rule (BLR) [28] which unifies many popular algorithms from various fields as specific instances of a natural-gradient descent to solve a Bayesian learning problem. The MPE uses natural-gradients to understand sensitivity of all such algorithms. We use the MPE to show several new results regarding sensitivity of generic ML algorithms:

  1. 1.

    We show that sensitivity to a group of examples can be estimated by simply adding their natural-gradients; see Eq. 6. Larger natural-gradients imply higher sensitivity and just a few such examples can often account for most of the sensitivity. Such examples can be used to characterize the model’s memory and memory-perturbation refers to the fact that the model can forget its essential knowledge when those examples are perturbed heavily.

  2. 2.

    We derive Influence Function [8, 31] as a special case of the MPE when natural-gradients with respect to Gaussian posterior are used. More importantly, we derive new measures that, unlike influence functions, can be applied during training for all algorithms covered under the BLR (such as those used in deep learning and optimization). See Table 1.

  3. 3.

    Measures derived using Gaussian posteriors share a common property: sensitivity to an example depends on the product of its prediction error and variance (Eq. 12). That is, most sensitive data lies where the model makes the most mistakes and is also least confident. In many cases, such estimates are extremely cheap to compute.

  4. 4.

    We show that sensitivity of the training data can be used to accurately predict model generalization, even during training (Fig. 1(b)). This agrees with similar studies which also show effectiveness of sensitivity in predicting generalization [22, 12, 19, 4].

Refer to caption
(a) Estimating the effect of an example removal
Refer to caption
(b) Predicting test NLL during training, CIFAR10
Figure 1: Our main goal is to estimate the sensitivity of the training trajectory when examples are perturbed or simply removed; see Panel (a). We present the MPE to estimate the sensitivity without any retraining and use them to faithfully predict the test performance from training data alone; see Panel (b). The test negative log-likelihood (gray line) for ResNet–20 on CIFAR10 shows similar trends to the leave-one-out (LOO) score computed on the training data (black line).

2 Understanding a Model’s Sensitivity to Its Training Data

Understanding a model’s sensitivity to its training data is important but is often done by a costly process of retraining the model multiple times. For example, consider a model with a parameter vector 𝜽P\smash{\boldsymbol{\theta}\in^{P}} trained on data 𝒟={𝒟1,𝒟2,,𝒟N}\mbox{${\cal D}$}=\{\mbox{${\cal D}$}_{1},\mbox{${\cal D}$}_{2},\ldots,\mbox{${\cal D}$}_{N}\} by using an algorithm 𝒜t\mathcal{A}_{t} that generates a sequence {𝜽t}\{\boldsymbol{\theta}_{t}\} for iteration tt that converges to a minimizer 𝜽\boldsymbol{\theta}_{*}. Formally, we write

𝜽t𝒜t(𝜽t1,(𝜽)) where (𝜽)=i=1Ni(𝜽)+(𝜽),\boldsymbol{\theta}_{t}\leftarrow\mathcal{A}_{t}\left(\boldsymbol{\theta}_{t-1},\mathcal{L}(\boldsymbol{\theta})\right)\,\,\text{ where }\,\mathcal{L}(\boldsymbol{\theta})=\sum_{i=1}^{N}\ell_{i}(\boldsymbol{\theta})+\mathcal{R}(\boldsymbol{\theta}), (1)

and we use the loss i(𝜽)\ell_{i}(\boldsymbol{\theta}) for 𝒟i\mbox{${\cal D}$}_{i} and a regularizer (𝜽)\mathcal{R}(\boldsymbol{\theta}). Because 𝜽t\boldsymbol{\theta}_{t} are all functions of 𝒟{\cal D} or its subsets, we can analyze their sensitivity by simply ‘perturbing’ the data. For example, we can remove a subset 𝒟\mathcal{M}\subset\mbox{${\cal D}$} to get a perturbed dataset, denoted by 𝒟\\smash{\mbox{${\cal D}$}^{\backslash\mathcal{M}}}, and retrain the model to get new iterates 𝜽t\\smash{\boldsymbol{\theta}_{t}^{\backslash\mathcal{M}}}, converging to a minimizer 𝜽\\smash{\boldsymbol{\theta}_{*}^{\backslash\mathcal{M}}}. If the deviation 𝜽t\𝜽t\smash{\boldsymbol{\theta}_{t}^{\backslash\mathcal{M}}}-\boldsymbol{\theta}_{t} is large for most tt, we may deem the model to be highly sensitive to the examples in \mathcal{M}. This is a simple method for sensitive analysis but requires a costly brute-force retraining [38] which is often infeasible for long training trajectories, big models, and large datasets. More importantly, conclusions drawn from retraining are often empirical and may not hold across models or algorithms.

A cheaper alternative is to use local perturbation methods [21], for instance, influence measures that estimate the sensitivity without retraining (illustrated in Fig. 1(a) by the dashed red arrow). The simplest result of this kind is for linear regression which dates back to the 70s [7]. The method makes use of the stationarity condition to derive deviations in 𝜽\boldsymbol{\theta}_{*} due to small perturbations to data. For linear regression, the deviations can be obtained in closed-form. Consider input-output pairs (𝐱i,yi)(\mbox{$\mbox{$\mathbf{x}$}$}_{i},y_{i}) and the loss i(𝜽)=12(yifi(𝜽))2\ell_{i}(\boldsymbol{\theta})=\mbox{$\frac{1}{2}$}(y_{i}-f_{i}(\boldsymbol{\theta}))^{2} for fi(𝜽)=𝐱i𝜽f_{i}(\boldsymbol{\theta})=\smash{\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\boldsymbol{\theta}} and a regularizer (𝜽)=δ𝜽2/2\smash{\mathcal{R}(\boldsymbol{\theta})=\delta\|\boldsymbol{\theta}\|^{2}/2}. We can obtain closed-form expressions of the deviation due to the removal of the ii’th example as shown below (a proof is included in App. A),

𝜽\i𝜽=(𝐇\i)1𝐱iei,fi(𝜽\i)fi(𝜽)=vi\iei,\smash{\boldsymbol{\theta}_{*}^{\backslash i}-\boldsymbol{\theta}_{*}=(\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{\backslash i})^{-1}\mbox{$\mbox{$\mathbf{x}$}$}_{i}e_{i},\qquad\qquad f_{i}(\boldsymbol{\theta}_{*}^{\backslash i})-f_{i}(\boldsymbol{\theta}_{*})=v_{i}^{\backslash i}e_{i}}, (2)

where we denote 𝐇\i=𝐇𝐱i𝐱i\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{\backslash i}=\mbox{$\mbox{$\mathbf{H}$}$}_{*}-\mbox{$\mbox{$\mathbf{x}$}$}_{i}\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top} defined using the Hessian 𝐇=2(𝜽)\smash{\mbox{$\mbox{$\mathbf{H}$}$}_{*}=\nabla^{2}\mathcal{L}(\boldsymbol{\theta}_{*})}. We also denote the prediction error of 𝜽\boldsymbol{\theta}_{*} by ei=𝐱i𝜽yie_{i}=\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\boldsymbol{\theta}_{*}-y_{i}, and prediction variance of 𝜽\i\smash{\boldsymbol{\theta}_{*}^{\backslash i}} by vi\i=𝐱i(𝐇\i)1𝐱i\smash{v_{i}^{\backslash i}=\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}(\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{\backslash i})^{-1}\mbox{$\mbox{$\mathbf{x}$}$}_{i}}.

The expression shows that the influence is bi-linearly related to both prediction error and variance, that is, when examples with high error and variance are removed, the model is expected to change a lot. These ideas are generalized using infinitesimal perturbation [21]. For example, influence functions [8, 32, 31] use a perturbation model 𝜽ϵi=argmin𝜽(𝜽)ϵii(𝜽)\boldsymbol{\theta}_{*}^{\epsilon_{i}}=\operatorname*{arg\,min}_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta})-\epsilon_{i}\ell_{i}(\boldsymbol{\theta}) with a scalar perturbation ϵi\epsilon_{i}\in. By using a quadratic approximation, we get the following influence function,

𝜽ϵiϵi|ϵi=0=𝐇1i(𝜽).\left.\frac{\partial{\boldsymbol{\theta}_{*}^{\epsilon_{i}}}}{\partial{\epsilon_{i}}}\right|_{\epsilon_{i}=0}=\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{-1}\nabla\ell_{i}(\boldsymbol{\theta}_{*}). (3)

This works for a generic differentiable loss function and is closely related to Eq. 2. We can choose other perturbation models, but they often exhibit bi-linear relationships; see App. A for details.

Despite their generality, there remain many open challenges with the local perturbation methods:

  1. 1.

    Influence functions are valid only at a stationary point 𝜽\boldsymbol{\theta}_{*} where the gradient is assumed to be 0, and extending them to iterates 𝜽t\boldsymbol{\theta}_{t} generated by generic algorithmic-steps 𝒜t\mathcal{A}_{t} is non-trivial [14]. This is even more important for deep learning where we may never reach such a stationary point, for example, due to stochastic training or early stopping [33, 53].

  2. 2.

    Applying influence functions to a non-differentiable loss or discrete parameter spaces is difficult. This is because the choice of perturbation model is not always obvious [32].

  3. 3.

    Finally, despite their generality, these measures do not directly reveal the causes of high influence. Does the bi-linear relationship in Eq. 2 hold more generally? If yes, under what conditions? Answers to such questions are currently unknown.

Studies to fix these issues are rare in ML, rather it is more common to simply use heuristics measures. Many such measures have been proposed in the recent years, for example, those using derivatives with respect to inputs [23, 2, 38], variations of Cook’s distance [17], prediction error and/or gradients [3, 51, 42, 40], backtracking training trajectories [16], or simply by retraining [13]. These works, although useful, do not directly address the issues. Many of these measures are derived without any direct connections to perturbation methods. They also appear to be unaware of bi-linear relationships such as those in Eq. 2. Our goal here is to address the issues by unifying and generalizing perturbation methods of sensitivity analysis.

3 The Memory-Perturbation Equation (MPE)

We propose the memory-perturbation equation (MPE) to unify, generalize, and understand sensitivity methods in machine learning. We derive the equation by using a property of conjugate Bayesian models which enables us to derive a closed-form expression for the sensitivity. In a Bayesian setting, data examples can be removed by simply dividing their likelihoods from the posterior [52]. For example, consider a model with prior p0=p(𝜽)p_{0}=p(\boldsymbol{\theta}) and likelihood p~j=p(𝒟j|𝜽)\tilde{p}_{j}=p(\mbox{${\cal D}$}_{j}|\boldsymbol{\theta}), giving rise to a posterior q=p(𝜽|𝒟)p0p~1p~2p~Nq_{*}=p(\boldsymbol{\theta}|\mbox{${\cal D}$})\propto p_{0}\tilde{p}_{1}\tilde{p}_{2}\ldots\tilde{p}_{N}. To remove p~j\tilde{p}_{j}, say for all j𝒟j\in\mathcal{M}\subset\mbox{${\cal D}$}, we simply divide qq_{*} by those p~j\tilde{p}_{j}. This is further simplified if we assume conjugate exponential-family form for p0p_{0} and p~j\tilde{p}_{j}. Then, the division between two distributions is equivalent to a subtraction between their natural parameters. This property yields a closed-form expression for the exact deviation, as stated below.

Theorem 1

Assuming a conjugate exponential-family model, the posterior q\q_{*}^{\backslash\mathcal{M}} (with natural parameter 𝛌\\smash{\boldsymbol{\lambda}_{*}^{\backslash\mathcal{M}}}) can be written in terms of qq_{*} (with natural parameter 𝛌\boldsymbol{\lambda}_{*}), as shown below:

q\qjp~je𝝀\,𝐓(𝜽)e𝝀,𝐓(𝜽)je𝝀~j,𝐓(𝜽)𝝀\=𝝀j𝝀~j.q_{*}^{\backslash\mathcal{M}}\propto\frac{q_{*}}{\prod_{j\in\mathcal{M}}\tilde{p}_{j}}\quad\implies e^{\langle\boldsymbol{\lambda}_{*}^{\backslash\mathcal{M}},\,\text{\mbox{$\mbox{$\mathbf{T}$}$}}(\boldsymbol{\theta})\rangle}\propto\frac{e^{\langle\boldsymbol{\lambda}_{*},\,\text{\mbox{$\mbox{$\mathbf{T}$}$}}(\boldsymbol{\theta})\rangle}}{\prod_{j\in\mathcal{M}}e^{\langle\widetilde{\boldsymbol{\lambda}}_{j},\,\text{\mbox{$\mbox{$\mathbf{T}$}$}}(\boldsymbol{\theta})\rangle}}\quad\implies\boldsymbol{\lambda}_{*}^{\backslash\mathcal{M}}=\boldsymbol{\lambda}_{*}-\sum_{j\in\mathcal{M}}\widetilde{\boldsymbol{\lambda}}_{j}. (4)

where all exponential families are defined by using inner-product 𝛌,𝐓(𝛉)\langle\boldsymbol{\lambda},\mbox{$\mbox{$\mathbf{T}$}$}(\boldsymbol{\theta})\rangle with natural parameters 𝛌\boldsymbol{\lambda} and sufficient statistics 𝐓(𝛉)\mbox{$\mbox{$\mathbf{T}$}$}(\boldsymbol{\theta}). The natural parameter of p~j\tilde{p}_{j} is denoted by 𝛌~j\smash{\widetilde{\boldsymbol{\lambda}}_{j}}.

The deviation 𝝀\𝝀\boldsymbol{\lambda}_{*}^{\backslash\mathcal{M}}-\boldsymbol{\lambda}_{*} is obtained by simply adding 𝝀~j\widetilde{\boldsymbol{\lambda}}_{j} for all jj\in\mathcal{M}. Further explanations and examples are given in App. B, along with some elementary facts about exponential families. We use this result to derive an equation that enables us to estimate the sensitivity of generic algorithms.

Our derivation builds on the Bayesian learning rule (BLR) [28] which unifies many algorithms by expressing their iterations as inference in conjugate Bayesian models [26]. This is done by reformulating Eq. 1 in a Bayesian setting to find an exponential-family approximation qp(𝜽|𝒟)e(𝜽)\smash{q_{*}\approx p(\boldsymbol{\theta}|\mbox{${\cal D}$})\propto e^{-\mathcal{L}(\boldsymbol{\theta})}}. At every iteration tt, the BLR updates the natural parameter 𝝀t\boldsymbol{\lambda}_{t} of an exponential-family qtq_{t} which can equivalently be expressed as the posterior of a conjugate model (shown on the right),

𝝀t(1ρ)𝝀t1ρj=0N𝐠~j(𝝀t1)qt(qt1)1ρ(p0)ρPriorj=1Neρ𝐠~j(𝝀t1),𝐓(𝜽)Likelihood\boldsymbol{\lambda}_{t}\leftarrow(1-\rho)\boldsymbol{\lambda}_{t-1}-\rho\sum_{j=0}^{N}\tilde{\mathbf{g}}_{j}(\boldsymbol{\lambda}_{t-1})\quad\Longleftrightarrow\quad q_{t}\propto\underbrace{\left(q_{t-1}\right)^{1-\rho}(p_{0})^{\rho}}_{\text{Prior}}\prod_{j=1}^{N}\underbrace{e^{\langle-\rho\tilde{\mathbf{g}}_{j}(\boldsymbol{\lambda}_{t-1}),\,\text{\mbox{$\mbox{$\mathbf{T}$}$}}(\boldsymbol{\theta})\rangle}}_{\text{Likelihood}} (5)

where 𝐠~j(𝝀)=𝐅(𝝀)1𝝀𝔼q[j(𝜽)]\tilde{\mathbf{g}}_{j}(\mbox{$\mbox{$\boldsymbol{\lambda}$}$})=\mbox{$\mbox{$\mathbf{F}$}$}(\boldsymbol{\lambda})^{-1}\nabla_{\boldsymbol{\lambda}}\mathbb{E}_{q}[\ell_{j}(\boldsymbol{\theta})] is the natural gradient with respect to 𝝀\boldsymbol{\lambda} defined using the Fisher Information Matrix 𝐅(𝝀t)\mbox{$\mbox{$\mathbf{F}$}$}(\boldsymbol{\lambda}_{t}) of qtq_{t}, and ρ>0\rho>0 is the learning rate. For simplicity, we denote 0(𝜽)=(𝜽)=logp0\ell_{0}(\boldsymbol{\theta})=\mathcal{R}(\boldsymbol{\theta})=-\log p_{0}, and assume p0p_{0} to be conjugate. The conjugate model on the right uses a prior and likelihood both of which, by construction, belong to the same exponential-family as qtq_{t}. By choosing an appropriate form for qtq_{t} and making necessary approximations to 𝐠~j\smash{\tilde{\mathbf{g}}_{j}}, the BLR can recover many popular algorithms as special cases. For instance, using a Gaussian qtq_{t}, we can recover stochastic gradient descent (SGD), Newton’s method, RMSprop, Adam, etc. For such cases, the conjugate model at the right is often a linear model [25]. These details, along with a summary of the BLR, are included in App. C. Our main idea is to study the sensitivity of all the algorithms covered under the BLR by using the conjugate model in Eq. 5.

Let qt\q_{t}^{\backslash\mathcal{M}} be the posterior obtained with the BLR but without the data in \mathcal{M}. We can estimate its natural parameter 𝝀t\\smash{\boldsymbol{\lambda}_{t}^{\backslash\mathcal{M}}} in a similar fashion as Eq. 4, that is, by dividing qtq_{t} by the likelihood approximation at the current 𝝀t\boldsymbol{\lambda}_{t}. This gives us the following estimate of the deviation obtained by simply adding the natural-gradients for all examples in \mathcal{M},

𝝀^t\𝝀t=ρj𝐠~j(𝝀t)\hat{\boldsymbol{\lambda}}_{t}^{\backslash\mathcal{M}}-\boldsymbol{\lambda}_{t}=\rho\sum_{j\in\mathcal{M}}\tilde{\mathbf{g}}_{j}(\mbox{$\mbox{$\boldsymbol{\lambda}$}$}_{t}) (6)

where 𝝀^t\\smash{\hat{\boldsymbol{\lambda}}_{t}^{\backslash\mathcal{M}}} is an estimate of the true 𝝀t\\smash{\boldsymbol{\lambda}_{t}^{\backslash\mathcal{M}}}. We call this the memory-perturbation equation (MPE) due to a unique property of the equation: the deviation is estimated by a simple addition and characterized solely by the examples in \mathcal{M}. Due to the additive nature of the estimate, examples with larger natural-gradients contribute more to it and so we expect most of the sensitivity to be explained by just a few examples with largest natural gradients. This is similar to the representer theorem where just a few support vectors are sufficient to characterize the decision boundary [29, 47, 10]. Here, such examples can be seen as characterizing the model’s memory because perturbing them can make the model forget its essential knowledge. The phrase memory-perturbation signifies this.

The equation can be easily adopted to handle an arbitrary perturbation. For instance, consider perturbation (𝜽)jϵjj(𝜽)\smash{\mathcal{L}(\boldsymbol{\theta})-\sum_{j\in\mathcal{M}}\epsilon_{j}\ell_{j}(\boldsymbol{\theta})}. To estimate its effect, we divide qtq_{t} by the likelihood approximations raised to ϵi\epsilon_{i}, giving us the following variant,

𝝀^tϵ𝝀t=ρjϵj𝐠~j(𝝀t),𝝀^tϵϵj|ϵj=0=ρ𝐠~j(𝝀t),j,\hat{\boldsymbol{\lambda}}_{t}^{\boldsymbol{\epsilon}_{\mathcal{M}}}-\boldsymbol{\lambda}_{t}=\rho\sum_{j\in\mathcal{M}}\epsilon_{j}\tilde{\mathbf{g}}_{j}(\boldsymbol{\lambda}_{t}),\qquad\implies\qquad\left.\frac{\partial{\hat{\boldsymbol{\lambda}}_{t}^{\boldsymbol{\epsilon}_{\mathcal{M}}}}}{\partial{\epsilon_{j}}}\right|_{\epsilon_{j}=0}=\rho\,\tilde{\mathbf{g}}_{j}(\boldsymbol{\lambda}_{t}),\quad\forall j\in\mathcal{M}, (7)

where we denote all ϵj\epsilon_{j} in \mathcal{M} by ϵ\boldsymbol{\epsilon}_{\mathcal{M}}. Setting ϵj=1\epsilon_{j}=1 in the left reduces to Eq. 6 which corresponds to removal. The example demonstrates how to adopt the MPE to handle arbitrary perturbations.

3.1 Unifying the existing sensitivity measures as special cases of the MPE

The MPE is a unifying equation from which many existing sensitivity measures can be derived as special cases. We will show three such results. The first result shows that, for conjugate models, the MPE recovers the exact deviations given in Thm. 1. Such models include textbook examples [6], such as, mixture models, linear state-space models, and PCA. Below is a formal statement.

Theorem 2

For conjugate exponential-family models, Eq. 4 is obtained as a special case of the MPE in Eq. 6 evaluated at 𝛌\boldsymbol{\lambda}_{*} of the exact posterior qq_{*} when we set j(𝛉)=logp~j\ell_{j}(\boldsymbol{\theta})=-\log\tilde{p}_{j} and ρ=1\rho=1.

The result holds because, for conjugate models, one-step of the BLR is equivalent to Bayes’ rule and therefore 𝐠~j(𝝀)=𝝀~j\smash{\tilde{\mbox{$\mbox{$\mathbf{g}$}$}}_{j}(\boldsymbol{\lambda}_{*})=-\widetilde{\boldsymbol{\lambda}}_{j}} (see [24, Sec. 5.1]). A proof is given in App. D along with an illustrative example on the Beta-Bernoulli model. We note that a recent work in [49] also takes inspiration from Bayesian models, but their sensitivity measures lack the property discussed above. See also [15] for a different approach to sensitivity analysis of variational Bayes with a focus on the posterior mean. The result above also justifies setting ρ\rho to 1, a choice we will often resort to.

Our second result is to show that the MPE recovers the influence function by Cook [7].

Theorem 3

For linear regression, Eq. 2 is obtained as a special case of the MPE in Eq. 6 evaluated at 𝛌\boldsymbol{\lambda}_{*} of the exact posterior q=𝒩(𝛉|𝛉,𝐇1)\smash{q_{*}=\mbox{${\cal N}$}(\boldsymbol{\theta}|\boldsymbol{\theta}_{*},\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{-1})}.

The proof in App. E relies on two facts: first, the natural parameter is 𝝀=(𝐇𝜽,12𝐇)\smash{\boldsymbol{\lambda}_{*}=(\mbox{$\mbox{$\mathbf{H}$}$}_{*}\boldsymbol{\theta}_{*},\,-\mbox{$\frac{1}{2}$}\mbox{$\mbox{$\mathbf{H}$}$}_{*})}, and second, the natural gradients for a Gaussian qq with mean 𝐦\mathbf{m} can be written as follows,

𝐠~i(𝝀)=(𝐠^i𝐇^i𝐦,12𝐇^i),\begin{split}\tilde{\mathbf{g}}_{i}(\boldsymbol{\lambda})=(\hat{\mbox{$\mbox{$\mathbf{g}$}$}}_{i}-\hat{\mbox{$\mbox{$\mathbf{H}$}$}}_{i}\mbox{$\mbox{$\mathbf{m}$}$},\,\,\,\mbox{$\frac{1}{2}$}\hat{\mbox{$\mbox{$\mathbf{H}$}$}}_{i}),\end{split} (8)

where 𝐠^i=𝔼q[i(𝜽)]\hat{\mbox{$\mbox{$\mathbf{g}$}$}}_{i}=\mathbb{E}_{q}[\nabla\ell_{i}(\boldsymbol{\theta})] and 𝐇^i=𝔼q[2i(𝜽)]\hat{\mbox{$\mbox{$\mathbf{H}$}$}}_{i}=\mathbb{E}_{q}[\nabla^{2}\ell_{i}(\boldsymbol{\theta})]. This is due to [28, Eqs. 10-11], but a proof is given in Eq. 27 of App. C. The theorem then directly follows by plugging 𝐠~i(𝝀)\smash{\tilde{\mathbf{g}}_{i}(\boldsymbol{\lambda}_{*})} in Eq. 6. This derivation is much shorter than the classical techniques which often require inversion lemmas (see Sec. A.1). The estimated deviations are exact, which is not a surprise because linear regression is a conjugate Gaussian model. However, it is interesting (and satisfying) that the deviation in 𝜽\boldsymbol{\theta}_{*} naturally emerges from the deviation in 𝝀\boldsymbol{\lambda}_{*}.

Our final result is to recover influence functions for deep learning, specifically Eq. 3. To do so, we use a Gaussian posterior approximation q=𝒩(𝜽|𝜽,𝐇1)\smash{q_{*}=\mbox{${\cal N}$}(\boldsymbol{\theta}|\boldsymbol{\theta}_{*},\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{-1})} obtained by using the so-called Laplace’s method [34, 50, 37]. The Laplace posterior can be seen a special case of the BLR solution when the natural gradient is approximated with the delta method [28, Table 1]. Remarkably, using the same approximation in the MPE, we recover Eq. 3.

Theorem 4

The influence function in Eq. 3 is obtained as a special case of the MPE in Eq. 7 evaluated at 𝛌\boldsymbol{\lambda}_{*} of the posterior q=𝒩(𝛉|𝛉,𝐇1)q_{*}=\mbox{${\cal N}$}(\boldsymbol{\theta}|\boldsymbol{\theta}_{*},\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{-1}) when we approximate 𝐠~i(𝛌)\tilde{\mathbf{g}}_{i}(\boldsymbol{\lambda}) of Eq. 8 with the delta method by substituting 𝔼q[i(𝛉)]i(𝛉)\mathbb{E}_{q_{*}}[\nabla\ell_{i}(\boldsymbol{\theta})]\approx\nabla\ell_{i}(\boldsymbol{\theta}_{*}) and 𝔼q[2i(𝛉)]2i(𝛉)\mathbb{E}_{q_{*}}[\nabla^{2}\ell_{i}(\boldsymbol{\theta})]\approx\nabla^{2}\ell_{i}(\boldsymbol{\theta}_{*}).

A proof is in App. F. We note that Eq. 3 can be justified as a Newton-step over the perturbed data but in the opposite direction [32, 31]. In a similar fashion, Eqs. 6 and 7 can be seen as natural-gradient steps in the opposite direction. Using the natural-gradient descent, as we have shown, can recover a variety of existing perturbation methods as special cases.

3.2 Generalizing the perturbation method to estimate sensitivity during training

Influence measures discussed so far assume that the model is already trained and that the loss is differentiable. We will now present generalizations to obtain new measures that can be applied during training and do not require differentiability of the loss. We will focus on Gaussian qq but the derivation can be extended to other posterior forms. The main idea is to specialize Eqs. 6 and 7 to the algorithms covered under the BLR, giving rise to new measures that estimate sensitivity by simply taking a step over the perturbed data but in the opposite direction.

We first discuss sensitivity of an iteration tt of the BLR yielding a Gaussian qt=𝒩(𝜽|𝐦t,𝐒t1)q_{t}=\mbox{${\cal N}$}(\boldsymbol{\theta}|\mbox{$\mbox{$\mathbf{m}$}$}_{t},\mbox{$\mbox{$\mathbf{S}$}$}_{t}^{-1}). The natural parameter is the pair 𝝀t=(𝐒t𝐦t,12𝐒t)\smash{\boldsymbol{\lambda}_{t}=(\mbox{$\mbox{$\mathbf{S}$}$}_{t}\mbox{$\mbox{$\mathbf{m}$}$}_{t},\,-\mbox{$\frac{1}{2}$}\mbox{$\mbox{$\mathbf{S}$}$}_{t})}. Using Eq. 8 in Eq. 6, we get

𝐒^t\𝐦^t\𝐒t𝐦t=ρj𝔼qt[j(𝜽)]𝔼qt[2j(𝜽)]𝐦t,𝐒t𝐒^t\=ρj𝔼qt[2j(𝜽)]\begin{split}&\hat{\mbox{$\mbox{$\mathbf{S}$}$}}_{t}^{\backslash\mathcal{M}}\hat{\mbox{$\mbox{$\mathbf{m}$}$}}_{t}^{\backslash\mathcal{M}}-\mbox{$\mbox{$\mathbf{S}$}$}_{t}\mbox{$\mbox{$\mathbf{m}$}$}_{t}=\rho\sum_{j\in\mathcal{M}}\mathbb{E}_{q_{t}}[\nabla\ell_{j}(\boldsymbol{\theta})]-\mathbb{E}_{q_{t}}[\nabla^{2}\ell_{j}(\boldsymbol{\theta})]\mbox{$\mbox{$\mathbf{m}$}$}_{t},~{}~{}\mbox{$\mbox{$\mathbf{S}$}$}_{t}-\hat{\mbox{$\mbox{$\mathbf{S}$}$}}_{t}^{\backslash\mathcal{M}}=\rho\sum_{j\in\mathcal{M}}\mathbb{E}_{q_{t}}[\nabla^{2}\ell_{j}(\boldsymbol{\theta})]\end{split} (9)

Plugging 𝐒t\mbox{$\mbox{$\mathbf{S}$}$}_{t} from the second equation into the first one, we can recover the following expressions,

𝐦^t\𝐦t=ρ(𝐒^t\)1𝔼qt[jj(𝜽)],𝐦^tϵiϵi|ϵi=0=ρ𝐒t1𝔼qt[i(𝜽)]\hat{\mbox{$\mbox{$\mathbf{m}$}$}}_{t}^{\backslash\mathcal{M}}-\mbox{$\mbox{$\mathbf{m}$}$}_{t}=\rho\big{(}\hat{\mbox{$\mbox{$\mathbf{S}$}$}}_{t}^{\backslash\mathcal{M}}\big{)}^{-1}\mathbb{E}_{q_{t}}\Big{[}\sum_{j\in\mathcal{M}}\nabla\ell_{j}(\boldsymbol{\theta})\Big{]},\qquad\left.\frac{\partial{\hat{\mbox{$\mbox{$\mathbf{m}$}$}}_{t}^{\epsilon_{i}}}}{\partial{\epsilon_{i}}}\right|_{\epsilon_{i}=0}=\rho\,\mbox{$\mbox{$\mathbf{S}$}$}_{t}^{-1}\mathbb{E}_{q_{t}}\left[\nabla\ell_{i}(\boldsymbol{\theta})\right] (10)

For the second equation, we omit the proof but it is similar to App. F, resulting in preconditioning with 𝐒t\mbox{$\mbox{$\mathbf{S}$}$}_{t}. For computational ease, we will approximate 𝐒^t\𝐒t\hat{\mbox{$\mbox{$\mathbf{S}$}$}}_{t}^{\backslash\mathcal{M}}\approx\mbox{$\mbox{$\mathbf{S}$}$}_{t} even in the first equation. We will also approximate the expectation at a sample 𝜽tqt\boldsymbol{\theta}_{t}\sim q_{t} or simply at the mean 𝜽t=𝐦t\boldsymbol{\theta}_{t}=\mbox{$\mbox{$\mathbf{m}$}$}_{t}. Ultimately, the suggestion is to use 𝐒t1i(𝜽t)\smash{\mbox{$\mbox{$\mathbf{S}$}$}_{t}^{-1}\nabla\ell_{i}(\boldsymbol{\theta}_{t})} as the sensitivity measure, or variations of it, for example, by using a Monte-Carlo average over multiple samples.

Based on this, a list of algorithms and their corresponding measures is given in Table 1. All of the algorithms can be derived as special instances of the BLR by making specific approximations (see Sec. C.3). The measures are obtained by applying the exact same approximations to Eq. 10. For example, Newton’s method is obtained when 𝐦t=𝜽t\mbox{$\mbox{$\mathbf{m}$}$}_{t}=\boldsymbol{\theta}_{t}, 𝐒t=𝐇t1\mbox{$\mbox{$\mathbf{S}$}$}_{t}=\mbox{$\mbox{$\mathbf{H}$}$}_{t-1}, and expectations are approximated by using the delta method at 𝜽t\boldsymbol{\theta}_{t} (similarly to Thm. 4). With these, we get

𝐒t1𝔼qt[i(𝜽)]𝐇t11i(𝜽t),\mbox{$\mbox{$\mathbf{S}$}$}_{t}^{-1}\mathbb{E}_{q_{t}}[\nabla\ell_{i}(\boldsymbol{\theta})]\approx\mbox{$\mbox{$\mathbf{H}$}$}_{t-1}^{-1}\nabla\ell_{i}(\boldsymbol{\theta}_{t}), (11)

which is the measure shown in the first row of the table. In a similar fashion, we can derive measures for other algorithms that use a slightly different approximations leading to a different preconditioner. The exact strategy to update the preconditioners is given in Eqs. 31, 32, 33 and 34 of Sec. C.3. For all, the sensitivity measure is simply an update step for the ii’th example but in the opposite direction.

Algorithm Update Sensitivity
Newton’s method 𝜽t𝜽t1𝐇t11(𝜽t1)\boldsymbol{\theta}_{t}\leftarrow\boldsymbol{\theta}_{t-1}-\mbox{$\mbox{$\mathbf{H}$}$}_{t-1}^{-1}\nabla\mathcal{L}(\boldsymbol{\theta}_{t-1}) 𝐇t11i(𝜽t)\mbox{$\mbox{$\mathbf{H}$}$}_{t-1}^{-1}\nabla\ell_{i}(\boldsymbol{\theta}_{t})
Online Newton (ON) [28] 𝜽t𝜽t1ρ𝐒t1(𝜽t1)\boldsymbol{\theta}_{t}\leftarrow\boldsymbol{\theta}_{t-1}-\rho\,\mbox{$\mbox{$\mathbf{S}$}$}_{t}^{-1}\nabla\mathcal{L}(\boldsymbol{\theta}_{t-1}) 𝐒t1i(𝜽t)\mbox{$\mbox{$\mathbf{S}$}$}_{t}^{-1}\nabla\ell_{i}(\boldsymbol{\theta}_{t})
ON (diagonal+minibatch) [28] 𝜽t𝜽t1ρ𝐬t1^(𝜽t1)\boldsymbol{\theta}_{t}\leftarrow\boldsymbol{\theta}_{t-1}-\rho\,\mbox{$\mbox{$\mathbf{s}$}$}_{t}^{-1}\cdot\hat{\nabla}\mathcal{L}(\boldsymbol{\theta}_{t-1}) 𝐬t1i(𝜽t)\mbox{$\mbox{$\mathbf{s}$}$}_{t}^{-1}\cdot\nabla\ell_{i}(\boldsymbol{\theta}_{t})
iBLR (diagonal+minibatch) [35] 𝐦t𝐦t1ρ𝐬t1^(𝜽t1)\mbox{$\mbox{$\mathbf{m}$}$}_{t}\leftarrow\mbox{$\mbox{$\mathbf{m}$}$}_{t-1}-\rho\,\mbox{$\mbox{$\mathbf{s}$}$}_{t}^{-1}\cdot\hat{\nabla}\mathcal{L}(\boldsymbol{\theta}_{t-1}) 𝐬t1i(𝜽t)\mbox{$\mbox{$\mathbf{s}$}$}_{t}^{-1}\cdot\nabla\ell_{i}(\boldsymbol{\theta}_{t})
RMSprop/Adam [30] 𝜽t𝜽t1ρ𝐬t12^(𝜽t1)\boldsymbol{\theta}_{t}\leftarrow\boldsymbol{\theta}_{t-1}-\rho\,\mbox{$\mbox{$\mathbf{s}$}$}_{t}^{-\frac{1}{2}}\cdot\hat{\nabla}\mathcal{L}(\boldsymbol{\theta}_{t-1}) 𝐬t12i(𝜽t)\mbox{$\mbox{$\mathbf{s}$}$}_{t}^{-\frac{1}{2}}\cdot\nabla\ell_{i}(\boldsymbol{\theta}_{t})
SGD 𝜽t𝜽t1ρ^(𝜽t1)\boldsymbol{\theta}_{t}\leftarrow\boldsymbol{\theta}_{t-1}-\rho\,\hat{\nabla}\mathcal{L}(\boldsymbol{\theta}_{t-1}) i(𝜽t)\nabla\ell_{i}(\boldsymbol{\theta}_{t})
Table 1: A list of algorithms and their sensitivity measures derived using Eq. 10. The second column gives the update, most of which use pre-conditioners that are either matrices (𝐇t,𝐒t\mbox{$\mbox{$\mathbf{H}$}$}_{t},\mbox{$\mbox{$\mathbf{S}$}$}_{t}) or a vector (𝐬t\mbox{$\mbox{$\mathbf{s}$}$}_{t}); see the full update equations in Eqs. 31, 32, 33 and 34 in App. C. The third column shows the associated sensitivity measure to perturbation in the ii’th example which can be interpreted as a step for the ii example but in the opposite direction. We denote the element-wise multiplication between vectors by “\cdot” and the minibatch gradients by ^\smash{\hat{\nabla}}. For iBLR, 𝜽t\boldsymbol{\theta}_{t} is either 𝐦t\mbox{$\mbox{$\mathbf{m}$}$}_{t} or a sample from qtq_{t}.

Table 1 shows an interplay between the training algorithm and sensitivity measures. For instance, it suggests that the measure 𝐇t11i(𝜽t)\smash{\mbox{$\mbox{$\mathbf{H}$}$}_{t-1}^{-1}\nabla\ell_{i}(\boldsymbol{\theta}_{t})} is justifiable for Newton’s method but might be inappropriate otherwise. In general, it is more appropriate to use the algorithm’s own preconditioner (if they use one). The quality of preconditioner (and therefore the measure) is tied to the quality of the posterior approximation. For example, RMSprop’s preconditioner is not a good estimator of the posterior covariance when minibatch size is large [27, Thm. 1], therefore we should not expect it to work well for large minibatches. In contrast, the ON method [28] explicitly builds a good estimate of 𝐒t\mbox{$\mbox{$\mathbf{S}$}$}_{t} during training and we expect it to give better (and more faithful) sensitivity estimates.

For SGD, our approach suggests using the gradient. This goes well with many existing approaches [40, 42, 51, 3] but also gives a straightforward way to modify them when the training algorithm is changed. For instance, the TracIn approach [42] builds sensitivity estimates during SGD training by tracing j(𝜽t)i(𝜽t)\nabla\ell_{j}(\boldsymbol{\theta}_{t})^{\top}\nabla\ell_{i}(\boldsymbol{\theta}_{t}) for many examples ii and jj. When the algorithm is switched, say to the ON method, we simply need to trace j(𝜽t)𝐒t1i(𝜽t)\smash{\nabla\ell_{j}(\boldsymbol{\theta}_{t})^{\top}\mbox{$\mbox{$\mathbf{S}$}$}_{t}^{-1}\nabla\ell_{i}(\boldsymbol{\theta}_{t})}. Such a modification is speculated in [42, Sec 3.2] and the MPE provides a way to accomplish exactly that. It is also possible to mix and match algorithms with different measures but caution is required. For example, to use the measure in Eq. 11, say within a first-order method, the algorithm must be modified to build a well-conditioned estimate of the Hessian. This can be tricky and can make the sensitivity measure fragile [5].

Extensions to non-differentiable loss functions and discontinuous parameter spaces is straightforward. For example, when using a Gaussian posterior, the measures in Eq. 10 can be modified to handle non-differentiable loss function by simply replacing 𝔼qt[i(𝜽)]\mathbb{E}_{q_{t}}[\nabla\ell_{i}(\boldsymbol{\theta})] with 𝐦𝔼qt[i(𝜽)]\nabla_{\text{\mbox{$\mbox{$\mathbf{m}$}$}}}\mathbb{E}_{q_{t}}\left[\ell_{i}(\boldsymbol{\theta})\right], which is a simple application of the Bonnet theorem [44] (see App. G). The resulting approach is more principled than [32] which uses an ad-hoc smoothing of the non-differentiable loss: the smoothing in our approach is automatically done by using the posterior distribution. Handling of discontinuous parameter spaces follows in a similar fashion. For example, binary variables can be handled by measuring the sensitivity through the parameter of the Bernoulli distribution (see App. D).

3.3 Understanding the causes of high sensitivity estimates for the Gaussian case

The MPE can be used to understand the causes of high sensitivity-estimates. We will demonstrate this for Gaussian qq but similar analysis can be done for other distributions. We find that sensitivity measures derived using Gaussian posteriors generally have two causes of high sensitivity.

To see this, consider a loss i(𝜽)=logp(yi|σ(fi(𝜽)))\ell_{i}(\boldsymbol{\theta})=-\log p(y_{i}|\sigma(f_{i}(\boldsymbol{\theta}))) where p(yi|μ)p(y_{i}|\mu) is an exponential-family distribution with expectation parameter μ\mu, fi(𝜽)f_{i}(\boldsymbol{\theta}) is the model output for the ii’th example, and σ()\sigma(\cdot) is an activation function, for example, the softmax function. For such loss functions, the gradient takes a simple form: i(𝜽)=fi(𝜽)[σ(fi(𝜽))yi]\nabla\ell_{i}(\boldsymbol{\theta})=\nabla f_{i}(\boldsymbol{\theta})[\sigma(f_{i}(\boldsymbol{\theta}))-y_{i}] [6, Eq. 4.124]. Using this, we can approximate the deviations in model outputs by using a first-order Taylor approximation,

fi(𝜽t\i)fi(𝜽t)Deviation in the outputfi(𝜽t)(𝜽t\i𝜽t)fi(𝜽t)𝐇t11fi(𝜽t)=vit, prediction variance[σ(fi(𝜽t))yi]=eit, prediction error.\begin{split}\underbrace{f_{i}(\boldsymbol{\theta}_{t}^{\backslash i})-f_{i}(\boldsymbol{\theta}_{t})}_{\text{Deviation in the output}}&\approx\nabla f_{i}(\boldsymbol{\theta}_{t})^{\top}(\boldsymbol{\theta}_{t}^{\backslash i}-\boldsymbol{\theta}_{t})\approx\underbrace{\nabla f_{i}(\boldsymbol{\theta}_{t})^{\top}\mbox{$\mbox{$\mathbf{H}$}$}_{t-1}^{-1}\nabla f_{i}(\boldsymbol{\theta}_{t})}_{=v_{it},\text{ prediction variance}}\underbrace{[\sigma(f_{i}(\boldsymbol{\theta}_{t}))-y_{i}]}_{=e_{it},\text{ prediction error}}.\end{split} (12)

where we used 𝜽t\i𝜽t𝐇t11i(𝜽t)\smash{\boldsymbol{\theta}_{t}^{\backslash i}-\boldsymbol{\theta}_{t}\approx\mbox{$\mbox{$\mathbf{H}$}$}_{t-1}^{-1}\nabla\ell_{i}(\boldsymbol{\theta}_{t})} which is based on the measure in the first row of Table 1. Similarly to Eq. 2, the deviation in the model output is equal to the product of the prediction error and (linearized) prediction variance of fi(𝜽t)f_{i}(\boldsymbol{\theta}_{t}) [25, 20]. The change in the model output is expected to be high, whenever examples with high prediction error and variance are removed.

We can write many such variants with a similar bi-linear relationship. For example, Eq. 12 can be extended to get deviations in predictions as follows:

σ(fi(𝜽t\i))σ(fi(𝜽t))σ(fi(𝜽t))fi(𝜽t)(𝜽t\i𝜽t)σ(fi(𝜽t))viteit.\smash{\mbox{$\sigma$}(f_{i}(\boldsymbol{\theta}_{t}^{\backslash i}))-\mbox{$\sigma$}(f_{i}(\boldsymbol{\theta}_{t}))\approx\mbox{$\sigma$}^{\prime}(f_{i}(\boldsymbol{\theta}_{t}))\nabla f_{i}(\boldsymbol{\theta}_{t})^{\top}(\boldsymbol{\theta}_{t}^{\backslash i}-\boldsymbol{\theta}_{t})\approx\mbox{$\sigma$}^{\prime}(f_{i}(\boldsymbol{\theta}_{t}))v_{it}e_{it}}. (13)

Eq. 12 estimates the deviation at one example and at a location 𝜽t\boldsymbol{\theta}_{t}, but we could also write them for a group of examples and evaluate them at the mean 𝐦t\mbox{$\mbox{$\mathbf{m}$}$}_{t} or at any sample 𝜽qt\boldsymbol{\theta}\sim q_{t}. For example, to remove a group \mathcal{M} of size MM, we can write the deviation of the model-output vector 𝐟(𝜽)M\smash{\mbox{$\mbox{$\mathbf{f}$}$}(\boldsymbol{\theta})\in^{M}},

𝐟(𝐦t\)𝐟(𝐦t)𝐟(𝐦t)𝐒t1𝐟(𝐦t)[σ(𝐟(𝐦t)𝐲],\mbox{$\mbox{$\mathbf{f}$}$}(\mbox{$\mbox{$\mathbf{m}$}$}_{t}^{\backslash\mathcal{M}})-\mbox{$\mbox{$\mathbf{f}$}$}(\mbox{$\mbox{$\mathbf{m}$}$}_{t})\approx\nabla\mbox{$\mbox{$\mathbf{f}$}$}(\mbox{$\mbox{$\mathbf{m}$}$}_{t})^{\top}\mbox{$\mbox{$\mathbf{S}$}$}_{t}^{-1}\nabla\mbox{$\mbox{$\mathbf{f}$}$}(\mbox{$\mbox{$\mathbf{m}$}$}_{t})[\sigma(\mbox{$\mbox{$\mathbf{f}$}$}(\mbox{$\mbox{$\mathbf{m}$}$}_{t})-\mbox{$\mbox{$\mathbf{y}$}$}], (14)

where 𝐲\mathbf{y} is the vector of labels and we used the sensitivity measure in Eq. 10. An example for sparse Gaussian process is in App. H. The measure for SGD in Table 1 can also be used which gives fi(𝜽t\i)fi(𝜽t)fi(𝜽)2eit\smash{f_{i}(\boldsymbol{\theta}_{t}^{\backslash i})-f_{i}(\boldsymbol{\theta}_{t})\approx\|\nabla f_{i}(\boldsymbol{\theta})\|^{2}e_{it}} which is similar to the scores used in [40]. The list in Table 1 suggests that such scores can be improved by using 𝐇t\mbox{$\mbox{$\mathbf{H}$}$}_{t} or 𝐒t\mbox{$\mbox{$\mathbf{S}$}$}_{t}, essentially, replacing the gradient norm by an estimate of the prediction variance. Additional benefit can be obtained by further employing samples from qtq_{t} instead of using a point estimate 𝜽t\boldsymbol{\theta}_{t} or 𝐦t\mbox{$\mbox{$\mathbf{m}$}$}_{t}; see an example in App. H.

It is also clear that all of the deviations above can be obtained cheaply during training by using already computed quantities. The estimation does not add significant computational overhead and can be used to efficiently predict the generalization performance during training. For example, using Eq. 12, we can approximate the leave-one-out (LOO) cross-validation (CV) error as follows,

LOO(𝜽t)=i=1Ni(𝜽t\i)=i=1Nlogp(yi|σ(fi(𝜽t\i)))i=1Nlogp(yi|σ(fi(𝜽t)+viteit)).\text{LOO}(\boldsymbol{\theta}_{t})=\sum_{i=1}^{N}\ell_{i}(\boldsymbol{\theta}_{t}^{\backslash i})=-\sum_{i=1}^{N}\log p(y_{i}|\mbox{$\sigma$}(f_{i}(\boldsymbol{\theta}_{t}^{\backslash i})))\approx-\sum_{i=1}^{N}\log p(y_{i}|\mbox{$\sigma$}(f_{i}(\boldsymbol{\theta}_{t})+v_{it}e_{it})). (15)

The approximation eliminates the need to train NN models to perform CV, rather just uses eite_{it} and vitv_{it} which are extremely cheap to compute within algorithms such as ON, RMSprop, and SGD. Leave-group-out (LGO) estimates can also be built, for example, by using Eq. 14, which enables us to understand the effect of leaving out a big chunk of training data, for example, an entire class for classification. The LOO and LGO estimates are closely related to marginal likelihood and sharpness, both of which are useful to predict generalization performance [22, 12, 19]. Estimates similar to Eq. 15 have been proposed previously [43, 4] but none of them do so during training.

4 Experiments

We show experimental results to demonstrate the usefulness of the MPE to understand the sensitivity of deep-learning models. We show the following: (1) we verify that the estimated deviations (sensitivities) for data removal correlate with the truth; (2) we predict the effect of class removal on generalization error; (3) we estimate the cross-validation curve for hyperparameter tuning; (4) we predict generalization during training; and (5) we study evolution of sensitivities during training. All details of the experimental setup are included in App. I and the code is available at https://github.com/team-approx-bayes/memory-perturbation.

Estimated deviations correlate with the truth: Fig. 2 shows a good correlation between the true deviations σ(fi(𝜽\i))σ(fi(𝜽))\smash{\mbox{$\sigma$}(f_{i}(\boldsymbol{\theta}_{*}^{\backslash i}))-\mbox{$\sigma$}(f_{i}(\boldsymbol{\theta}_{*}))} and their estimates σ(fi(𝜽))viei\smash{\mbox{$\sigma$}^{\prime}(f_{i}(\boldsymbol{\theta}_{*}))v_{i*}e_{i*}}, as shown in Eq. 13. We show results for three datasets, each using a different architecture but all trained using SGD. To estimate the Hessian 𝐇\mbox{$\mbox{$\mathbf{H}$}$}_{*} and compute vi=fi(𝜽)𝐇1fi(𝜽)\smash{v_{i*}=\nabla f_{i}(\boldsymbol{\theta}_{*})^{\top}\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{-1}\nabla f_{i}(\boldsymbol{\theta}_{*})}, we use a Kronecker-factored (K-FAC) approximation implemented in the laplace [11] and ASDL [39] packages. Each marker represents a data example. The estimate roughly maintains the ranking of examples according to their sensitivity. Below each panel, a histogram of true deviations is included to show that the majority of examples have extremely low sensitivity and most of the large sensitivities are attributed to a small fraction of data. The high-sensitivity examples often include interesting cases (possibly mislabeled or simply ambiguous), some of which are visualized in each panel along with some low-sensitivity examples to show the contrast. High-sensitivity examples characterize the model’s memory because perturbing them leads to a large change in the model. Similar trends are observed for removal of groups of examples in Fig. 6 of Sec. I.2.

Refer to caption
(a) MLP on MNIST
Refer to caption
(b) LeNet on FMNIST
Refer to caption
(c) CNN on CIFAR-10
Figure 2: The estimated deviation for an example removal correlates well with the true deviations in predictions. Each marker represents an example. For each panel, the histogram at the bottom shows that the majority of examples have low sensitivity and most of the large sensitivities are attributed to a small fraction of data. We show a few images of high and low sensitivity examples from two randomly chosen classes, where we observe the high-sensitivity examples to be more interesting (possibly mislabeled or just ambiguous), while low-sensitivity examples appear more predictable.

Predicting the effect of class removal on generalization: Fig. 3(a) shows that the leave-group-out estimates can be used to faithfully predict the test performance even when a whole class is removed. The x-axis shows the test negative log-likelihood (NLL) on a held-out test set, while the y-axis shows the following leave-one-class-out (LOCO) loss on the set 𝒞\mathcal{C} of a left-out class,

LOCO𝒞(𝜽)=i𝒞i(𝜽\𝒞)i𝒞logp(yi|σ(fi(𝜽)+viei)).\text{LOCO}_{\mathcal{C}}(\boldsymbol{\theta}_{*})=\sum_{i\in\mathcal{C}}\ell_{i}(\boldsymbol{\theta}_{*}^{\backslash\mathcal{C}})\approx-\sum_{i\in\mathcal{C}}\log p(y_{i}|\mbox{$\sigma$}(f_{i}(\boldsymbol{\theta}_{*})+v_{i*}e_{i*})).

The estimate uses an approximation: fi(𝜽\𝒞)fi(𝜽)fi(𝜽)𝐇1j𝒞j(𝜽)viei\smash{f_{i}(\boldsymbol{\theta}_{*}^{\backslash\mathcal{C}})-f_{i}(\boldsymbol{\theta}_{*})\approx\nabla f_{i}(\boldsymbol{\theta}_{*})^{\top}\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{-1}\sum_{j\in\mathcal{C}}\nabla\ell_{j}(\boldsymbol{\theta}_{*})}\approx v_{i*}e_{i*}, which is similar to Eq. 12, but uses an additional approximation j𝒞j(𝜽)i(𝜽)\smash{\sum_{j\in\mathcal{C}}}\nabla\ell_{j}(\boldsymbol{\theta}_{*})\approx\nabla\ell_{i}(\boldsymbol{\theta}_{*}) to reduce the computation due to matrix-vector multiplications (we rely on the same K-FAC approximation used in the previous experiment). Results might improve when this approximation is relaxed. We show results for two models: MLP and LeNet. Each marker corresponds to a specific class whose names are indicated with the text. The dashed lines indicate the general trends, showing a good correlation between the truth and estimate. The classes Shirt, Pullover are the most sensitive, while the classes Bag, Trousers are least sensitive. A similar result for MNIST is in Fig. 11(d) of Sec. I.3.

Refer to caption
(a) Predicting the effect of class removal
Refer to caption
(b) Evolution of sensitivities during training
Figure 3: Panel (a) shows, in the x-axis, the test NLL of trained models with a class removed. In the y-axis, we show the respective leave-one-class-out (LOCO) estimates. Each marker correspond to a specific class removed (text indicates class names). Results for two models on FMNIST are shown. Both show good correlation between the test NLL and LOCO estimates; see the dashed lines. Panel (b) shows the evolution of estimated sensitivities during training of LeNet5 on FMNIST. As training progresses, the model becomes more and more sensitive to a small fraction of data.
Refer to caption
(a) MLP on MNIST
Refer to caption
(b) LeNet5 on FMNIST
Refer to caption
(c) CNN on CIFAR-10
Figure 4: The test NLL (gray) almost perfectly matches the estimated LOO-CV error of Eq. 15 (black). The x-axis shows different values of δ\delta parameter of an L2L_{2}-regularization δ𝜽2/2\delta\|\boldsymbol{\theta}\|^{2}/2.

Predicting generalization for hyperparameter tuning: We consider the tuning of the parameter δ\delta for the L2L_{2}-regularizer of form δ𝜽2/2\delta\|\boldsymbol{\theta}\|^{2}/2. Fig. 4 shows an almost perfect match between the test NLL and the estimated LOO-CV error of Eq. 15. Additional figures with the test errors visualized on top are included in Fig. 7 of Sec. I.4 where we again see a close match to the LOO-CV curves.

Predicting generalization during training: As discussed earlier, existing influence measures are not designed to analyze sensitivity during training and care needs to be taken when using ad-hoc strategies. We first show results for our proposed measure in Eq. 10 which gives reliable sensitivity estimates during training. We use the improved-BLR method [35] which estimates the mean 𝐦t\mbox{$\mbox{$\mathbf{m}$}$}_{t} and a vector preconditioner 𝐬t\mbox{$\mbox{$\mathbf{s}$}$}_{t} during training. We can derive an estimate for the LOO error at the mean 𝐦t\mbox{$\mbox{$\mathbf{m}$}$}_{t} following a derivation similar to Eqs. 14 and 15,

LOO(𝐦t)i=1Nlogp(yi|σ(fi(𝐦t)+viteit))\text{LOO}(\mbox{$\mbox{$\mathbf{m}$}$}_{t})\approx-\sum_{i=1}^{N}\log p(y_{i}|\mbox{$\sigma$}(f_{i}(\mbox{$\mbox{$\mathbf{m}$}$}_{t})+v_{it}e_{it})) (16)

where vit=fi(𝐦t)diag(𝐬t)1fi(𝐦t)v_{it}=\nabla f_{i}(\mbox{$\mbox{$\mathbf{m}$}$}_{t})^{\top}\text{diag}(\mbox{$\mbox{$\mathbf{s}$}$}_{t})^{-1}\nabla f_{i}(\mbox{$\mbox{$\mathbf{m}$}$}_{t}) and eit=σ(fi(𝐦t))yie_{it}=\sigma(f_{i}(\mbox{$\mbox{$\mathbf{m}$}$}_{t}))-y_{i}.

The first panel in Fig. 5 shows a good match between the above LOO estimate and test NLL. For comparison, in the next two panels, we show results for SGD training by using two ad-hoc measures obtained by plugging different Hessian approximations in Eq. 11. The first panel approximates 𝐇t\mbox{$\mbox{$\mathbf{H}$}$}_{t} with a diagonal Generalized Gauss-Newton (GGN) matrix, while the second panel uses a K-FAC approximation. We see that diagonal-GGN-LOO does not work well at all and, while K-FAC-LOO improves this, it is still not as good as the iBLR result despite using a non-diagonal Hessian approximation. Not to mention, the two measures require an additional pass through the data to compute the Hessian approximation, and also need a careful setting of a damping parameter.

A similar result for iBLR is shown in Fig. 1(b) where we use the larger ResNet–20 on CIFAR10, and more such results are included in Fig. 8 of Sec. I.5. We also find that both diagonal-GGN-LOO or K-FAC-LOO further deteriorate when the model overfits; see Fig. 9. Results for the Adam optimizer are included in Fig. 10, where we again see that using ad hoc measures may not always work. Overall, these results show the difficulty of estimating sensitivity during training and suggest to take caution when using measures that are not naturally suited to analyze the training algorithm.

Refer to caption
(a) iBLR & LOO of Eq. 16
Refer to caption
(b) SGD & diagonal-GGN-LOO
Refer to caption
(c) SGD & K-FAC-LOO
Figure 5: We compare faithfulness of LOO estimates during training to predict the test NLL. The first panel shows results for iBLR where a good match is obtained by using the LOO estimate of Eq. 16 which uses a diagonal preconditioner. The next two panels show results for SGD where we use the LOO estimate of Eq. 15 but with different Hessian approximations. Panel (b) uses a diagonal-GGN which does not work very well. Results are improved when K-FAC is used, but they are still not as good as the iBLR, despite using a non-diagonal Hessian approximation.

Evolution of sensitivities during training: Fig. 3(b) shows the evolution of sensitivities of examples as the training progresses. We use the iBLR algorithm and approximate the deviation as σ(fi(𝐦t\i))σ(fi(𝐦t))σ(fi(𝐦t))viteit\smash{\mbox{$\sigma$}(f_{i}(\mbox{$\mbox{$\mathbf{m}$}$}_{t}^{\backslash i}))-\mbox{$\sigma$}(f_{i}(\mbox{$\mbox{$\mathbf{m}$}$}_{t}))\approx\mbox{$\sigma$}^{\prime}(f_{i}(\mbox{$\mbox{$\mathbf{m}$}$}_{t}))v_{it}e_{it}} where vitv_{it} and eite_{it} are obtained similarly to Eq. 16. The x-axis corresponds to examples sorted from least sensitive to most sensitive examples at convergence. The y-axis shows the histogram of sensitivity estimates. We observe that, as the training progresses, the distribution concentrates around a small fraction of the data. At the top, we visualize a few examples with high and low sensitivity estimates, where the high-sensitivity examples included interesting cases (similarly to Fig. 2). The result suggests that the model concentrates more and more on a small fraction of high-sensitivity examples, and therefore such examples can be used to characterize the model’s memory. Additional experiments of this kind are included in Fig. 11 of Sec. I.6, along with other experiment details.

5 Discussion

We present the memory-perturbation equation by building upon the BLR framework. The equation suggests to take a step in the direction of the natural gradient of the perturbed examples. Using the MPE framework, we unify existing influence measures, generalize them to a wide variety of problems, and unravel useful properties regarding sensitivity. We also show that sensitivity estimation can be done cheaply and use this to predict generalization performance. An interesting avenue for future research is to apply the method to larger models and real-world problems. We also need to understand how our generalization measure compares to other methods, such as those considered in [22]. We would also like to understand the effect of various posterior approximations. Another interesting direction is to apply the method to non-Gaussian cases, for example, to study ensemble methods in deep learning with mixture models.

Acknowledgements

This work is supported by the Bayes duality project, JST CREST Grant Number JPMJCR2112.

References

  • [1] Vincent Adam, Paul Chang, Mohammad Emtiyaz Khan, and Arno Solin. Dual Parameterization of Sparse Variational Gaussian Processes. Advances in Neural Information Processing Systems, 2021.
  • [2] Chirag Agarwal, Daniel D’Souza, and Sara Hooker. Estimating Example Difficulty using Variance of Gradients. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2022.
  • [3] Devansh Arpit, Stanisław Jastrzębski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Simon Lacoste-Julien. A Closer Look at Memorization in Deep Networks. In International Conference on Machine Learning, 2017.
  • [4] Gregor Bachmann, Thomas Hofmann, and Aurélien Lucchi. Generalization Through The Lens of Leave-One-Out Error. In International Conference on Learning Representations, 2022.
  • [5] Samyadeep Basu, Phil Pope, and Soheil Feizi. Influence Functions in Deep Learning Are Fragile. In International Conference on Learning Representations, 2021.
  • [6] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
  • [7] R Dennis Cook. Detection of Influential Observation in Linear Regression. Technometrics, 1977.
  • [8] R Dennis Cook and Sanford Weisberg. Characterizations of an Empirical Influence Function for Detecting Influential Cases in Regression. Technometrics, 1980.
  • [9] R Dennis Cook and Sanford Weisberg. Residuals and Influence in Regression. Chapman and Hall, 1982.
  • [10] Corinna Cortes and Vladimir Vapnik. Support-Vector Networks. Machine learning, 1995.
  • [11] Erik Daxberger, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, Matthias Bauer, and Philipp Hennig. Laplace Redux-Effortless Bayesian Deep Learning. Advances in Neural Information Processing Systems, 2021.
  • [12] Gintare Karolina Dziugaite and Daniel M Roy. Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters Than Training Data. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2017.
  • [13] Vitaly Feldman and Chiyuan Zhang. What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation. In Advances in Neural Information Processing Systems, 2020.
  • [14] Wing K Fung and CW Kwan. A Note on Local Influence Based on Normal Curvature. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 1997.
  • [15] Ryan Giordano, Tamara Broderick, and Michael I Jordan. Covariances, robustness and variational Bayes. Journal of Machine Learning Research, 19(51), 2018.
  • [16] Satoshi Hara, Atsushi Nitanda, and Takanori Maehara. Data Cleansing for Models Trained with SGD. In Advances in Neural Information Processing Systems, 2019.
  • [17] Hrayr Harutyunyan, Alessandro Achille, Giovanni Paolini, Orchid Majumder, Avinash Ravichandran, Rahul Bhotika, and Stefano Soatto. Estimating Informativeness of Samples with Smooth Unique Information. In International Conference on Learning Representations, 2021.
  • [18] James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian Processes for Big Data. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2013.
  • [19] Alexander Immer, Matthias Bauer, Vincent Fortuin, Gunnar Rätsch, and Mohammad Emtiyaz Khan. Scalable Marginal Likelihood Estimation for Model Selection in Deep Learning. In International Conference on Machine Learning, 2021.
  • [20] Alexander Immer, Maciej Korzepa, and Matthias Bauer. Improving Predictions of Bayesian Neural Nets via Local Linearization. International Conference on Artificial Intelligence and Statistics, 2021.
  • [21] Louis A Jaeckel. The Infinitesimal Jackknife. Technical report, Bell Lab., 1972.
  • [22] Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic Generalization Measures and Where To Find Them. In International Conference on Learning Representations, 2020.
  • [23] Angelos Katharopoulos and Francois Fleuret. Not All Samples Are Created Equal: Deep Learning with Importance Sampling. In International Conference on Machine Learning, 2018.
  • [24] Mohammad Emtiyaz Khan. Variational Bayes Made Easy. Fifth Symposium on Advances in Approximate Bayesian Inference, 2023.
  • [25] Mohammad Emtiyaz Khan, Alexander Immer, Ehsan Abedi, and Maciej Korzepa. Approximate Inference Turns Deep Networks into Gaussian Processes. Advances in Neural Information Processing Systems, 2019.
  • [26] Mohammad Emtiyaz Khan and Wu Lin. Conjugate-Computation Variational Inference: Converting Variational Inference in Non-Conjugate Models to Inferences in Conjugate Models. In International Conference on Artificial Intelligence and Statistics, 2017.
  • [27] Mohammad Emtiyaz Khan, Didrik Nielsen, Voot Tangkaratt, Wu Lin, Yarin Gal, and Akash Srivastava. Fast and Scalable Bayesian Deep Learning by Weight-Perturbation in Adam. In International Conference on Machine Learning, 2018.
  • [28] Mohammad Emtiyaz Khan and Håvard Rue. The Bayesian Learning Rule. Journal of Machine Learning Research, 2023.
  • [29] George S Kimeldorf and Grace Wahba. A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines. The Annals of Mathematical Statistics, 1970.
  • [30] Diederik Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In International Conference on Learning Representations, 2015.
  • [31] Pang Wei Koh, Kai-Siang Ang, Hubert Teo, and Percy S Liang. On the Accuracy of Influence Functions for Measuring Group Effects. In Advances in Neural Information Processing Systems, 2019.
  • [32] Pang Wei Koh and Percy Liang. Understanding Black-Box Predictions via Influence Functions. In International Conference on Machine Learning, 2017.
  • [33] Aran Komatsuzaki. One Epoch is All You Need. ArXiv e-Prints, 2019.
  • [34] Pierre-Simon Laplace. Mémoires de Mathématique et de Physique. Tome Sixieme, 1774.
  • [35] Wu Lin, Mark Schmidt, and Mohammad Emtiyaz Khan. Handling the Positive-Definite Constraint in the Bayesian Learning Rule. In International Conference on Machine Learning, 2020.
  • [36] Ilya Loshchilov and Frank Hutter. Decoupled Weight Decay Regularization. International Conference on Learning Representations, 2019.
  • [37] David JC MacKay. Information Theory, Inference and Learning Algorithms. Cambridge University Press, 2003.
  • [38] Roman Novak, Yasaman Bahri, Daniel A Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. Sensitivity and Generalization in Neural Networks: An Empirical Study. In International Conference on Learning Representations, 2018.
  • [39] Kazuki Osawa, Satoki Ishikawa, Rio Yokota, Shigang Li, and Torsten Hoefler. ASDL: A Unified Interface for Gradient Preconditioning in PyTorch. In NeurIPS Workshop Order up! The Benefits of Higher-Order Optimization in Machine Learning, 2023.
  • [40] Mansheej Paul, Surya Ganguli, and Gintare Karolina Dziugaite. Deep Learning on a Data Diet: Finding Important Examples Early in Training. In Advances in Neural Information Processing Systems, 2021.
  • [41] Daryl Pregibon. Logistic Regression Diagnostics. The Annals of Statistics, 1981.
  • [42] Garima Pruthi, Frederick Liu, Satyen Kale, and Mukund Sundararajan. Estimating Training Data Influence by Tracing Gradient Descent. In Advances in Neural Information Processing Systems, 2020.
  • [43] Kamiar Rahnama Rad and Arian Maleki. A Scalable Estimate of the Out-of-Sample Prediction Error via Approximate Leave-One-Out Cross-Validation. Journal of the Royal Statistical Society Series B: Statistical Methodology, 2020.
  • [44] Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic Backpropagation and Approximate Inference in Deep Generative Models. In International Conference on Machine Learning, 2014.
  • [45] Hugh Salimbeni, Stefanos Eleftheriadis, and James Hensman. Natural Gradients in Practice: Non-Conjugate Variational Inference in Gaussian Process Models. In International Conference on Artificial Intelligence and Statistics, 2018.
  • [46] Frank Schneider, Lukas Balles, and Philipp Hennig. DeepOBS: A Deep Learning Optimizer Benchmark Suite. In International Conference on Learning Representations, 2019.
  • [47] Bernhard Schölkopf, Ralf Herbrich, and Alex J Smola. A Generalized Representer Theorem. In International Conference on Computational Learning Theory, 2001.
  • [48] Saurabh Singh and Shankar Krishnan. Filter Response Normalization Layer: Eliminating Batch Dependence in the Training of Deep Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.
  • [49] Ryutaro Tanno, Melanie F Pradier, Aditya Nori, and Yingzhen Li. Repairing Neural Networks by Leaving the Right Past Behind. Advances in Neural Information Processing Systems, 2022.
  • [50] Luke Tierney and Joseph B Kadane. Accurate Approximations for Posterior Moments and Marginal Densities. Journal of the American Statistical Association, 1986.
  • [51] Mariya Toneva, Alessandro Sordoni, Remi Tachet des Combes, Adam Trischler, Yoshua Bengio, and Geoffrey J. Gordon. An Empirical Study of Example Forgetting during Deep Neural Network Learning. In International Conference on Learning Representations, 2019.
  • [52] Robert Weiss. An Approach to Bayesian Sensitivity Analysis. Journal of the Royal Statistical Society Series B: Statistical Methodology, 1996.
  • [53] Fuzhao Xue, Yao Fu, Wangchunshu Zhou, Zangwei Zheng, and Yang You. To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis. ArXiv e-Prints, 2023.
  • [54] Hongtu Zhu, Joseph G. Ibrahim, Sikyum Lee, and Heping Zhang. Perturbation Selection and Influence Measures in Local Influence Analysis. The Annals of Statistics, 2007.

Appendix A Influence Function for Linear Regression

We consider NN input-output pairs (𝐱i,yi)(\mbox{$\mbox{$\mathbf{x}$}$}_{i},y_{i}). The feature matrix containing 𝐱i\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top} as rows is denoted by 𝐗\mathbf{X} and the output vector of length NN is denoted by 𝐲\mathbf{y}. The loss is i(𝜽)=12(yifi(𝜽))2\ell_{i}(\boldsymbol{\theta})=\mbox{$\frac{1}{2}$}(y_{i}-f_{i}(\boldsymbol{\theta}))^{2} for fi(𝜽)=𝐱i𝜽f_{i}(\boldsymbol{\theta})=\smash{\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\boldsymbol{\theta}}. The regularizer is assumed to be (𝜽)=δ𝜽2/2\smash{\mathcal{R}(\boldsymbol{\theta})=\delta\|\boldsymbol{\theta}\|^{2}/2}. The minimizer is given by

𝜽=𝐇1𝐗𝐲.\boldsymbol{\theta}_{*}=\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{-1}\mbox{$\mbox{$\mathbf{X}$}$}^{\top}\mbox{$\mbox{$\mathbf{y}$}$}. (17)

We define a perturbation model as follows with ϵi\epsilon_{i}\in:

𝜽ϵi=argmin𝜽(𝜽)ϵii(𝜽).\boldsymbol{\theta}_{*}^{\epsilon_{i}}=\operatorname*{arg\,min}_{\boldsymbol{\theta}}\mathcal{L}(\boldsymbol{\theta})-\epsilon_{i}\ell_{i}(\boldsymbol{\theta}).

For ϵi=1\epsilon_{i}=1, it corresponds to example removal. An arbitrary ϵi\epsilon_{i} simply weights the example accordingly. The solution has a closed-form expression,

𝜽ϵi=(𝐇ϵi𝐱i𝐱i)1(𝐗𝐲ϵi𝐱iyi).\boldsymbol{\theta}_{*}^{\epsilon_{i}}=\left(\mbox{$\mbox{$\mathbf{H}$}$}_{*}-\epsilon_{i}\mbox{$\mbox{$\mathbf{x}$}$}_{i}\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\right)^{-1}\left(\mbox{$\mbox{$\mathbf{X}$}$}^{\top}\mbox{$\mbox{$\mathbf{y}$}$}-\epsilon_{i}\mbox{$\mbox{$\mathbf{x}$}$}_{i}y_{i}\right). (18)

where 𝐇=i=1N𝐱i𝐱i+δ𝐈P\mbox{$\mbox{$\mathbf{H}$}$}_{*}=\sum_{i=1}^{N}\mbox{$\mbox{$\mathbf{x}$}$}_{i}\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}+\delta\mbox{$\mbox{$\mathbf{I}$}$}_{P} is the Hessian of (𝜽)\mathcal{L}(\boldsymbol{\theta}). We first derive a closed-form expressions for 𝜽ϵi𝜽\boldsymbol{\theta}_{*}^{\epsilon_{i}}-\boldsymbol{\theta}_{*}, and then specialize them for different ϵi\epsilon_{i}.

A.1 Derivation of the leave-one-out (LOO) deviation

We denote 𝚺=𝐇1\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}_{*}=\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{-1} and use the Sherman-Morrison formula to write

𝜽ϵi=(𝚺+ϵi𝚺𝐱i𝐱i𝚺1ϵi𝐱i𝚺𝐱i)(𝐗𝐲ϵiyi𝐱i)=𝚺𝐗𝐲+ϵi𝚺𝐱i[𝐱i𝚺𝐗𝐲1ϵi𝐱i𝚺𝐱iϵiyi𝐱i𝚺𝐱i1ϵi𝐱i𝚺𝐱iyi]=𝜽+ϵi𝚺𝐱i[𝐱i𝜽1ϵiviϵiyivi1ϵiviyi]=𝜽+ϵi𝚺𝐱i[𝐱i𝜽yi1ϵivi]=𝜽+𝚺𝐱iϵiei1ϵivi.\begin{split}\boldsymbol{\theta}_{*}^{\epsilon_{i}}&=\left(\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}_{*}+\frac{\epsilon_{i}\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}_{*}\mbox{$\mbox{$\mathbf{x}$}$}_{i}\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}_{*}}{1-\epsilon_{i}\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}_{*}\mbox{$\mbox{$\mathbf{x}$}$}_{i}}\right)\left(\mbox{$\mbox{$\mathbf{X}$}$}^{\top}\mbox{$\mbox{$\mathbf{y}$}$}-\epsilon_{i}y_{i}\mbox{$\mbox{$\mathbf{x}$}$}_{i}\right)\\ &=\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}_{*}\mbox{$\mbox{$\mathbf{X}$}$}^{\top}\mbox{$\mbox{$\mathbf{y}$}$}+\epsilon_{i}\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}_{*}\mbox{$\mbox{$\mathbf{x}$}$}_{i}\left[\frac{\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}_{*}\mbox{$\mbox{$\mathbf{X}$}$}^{\top}\mbox{$\mbox{$\mathbf{y}$}$}}{1-\epsilon_{i}\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}_{*}\mbox{$\mbox{$\mathbf{x}$}$}_{i}}-\frac{\epsilon_{i}y_{i}\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}_{*}\mbox{$\mbox{$\mathbf{x}$}$}_{i}}{1-\epsilon_{i}\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}_{*}\mbox{$\mbox{$\mathbf{x}$}$}_{i}}-y_{i}\right]\\ &=\boldsymbol{\theta}_{*}+\epsilon_{i}\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}_{*}\mbox{$\mbox{$\mathbf{x}$}$}_{i}\left[\frac{\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\boldsymbol{\theta}_{*}}{1-\epsilon_{i}v_{i}}-\frac{\epsilon_{i}y_{i}v_{i}}{1-\epsilon_{i}v_{i}}-y_{i}\right]\\ &=\boldsymbol{\theta}_{*}+\epsilon_{i}\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}_{*}\mbox{$\mbox{$\mathbf{x}$}$}_{i}\left[\frac{\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\boldsymbol{\theta}_{*}-y_{i}}{1-\epsilon_{i}v_{i}}\right]=\boldsymbol{\theta}_{*}+\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}_{*}\mbox{$\mbox{$\mathbf{x}$}$}_{i}\frac{\epsilon_{i}e_{i}}{1-\epsilon_{i}v_{i}}.\end{split} (19)

In line 3 we substitute vi=𝐱i𝚺𝐱iv_{i}=\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}_{*}\mbox{$\mbox{$\mathbf{x}$}$}_{i} and 𝜽=𝚺𝐗𝐲\boldsymbol{\theta}_{*}=\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}_{*}\mbox{$\mbox{$\mathbf{X}$}$}^{\top}\mbox{$\mbox{$\mathbf{y}$}$} and in the last step we use ei=𝐱i𝜽yie_{i}=\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\boldsymbol{\theta}_{*}-y_{i}.

We define ei\i=ei/(1vi)e_{i}^{\backslash i}=e_{i}/(1-v_{i}) which is the prediction error of 𝜽\i\boldsymbol{\theta}_{*}^{\backslash i},

ei\i\displaystyle e_{i}^{\backslash i} =𝐱i𝜽\iyi=𝐱i(𝜽+𝚺𝐱iei1vi)yi=𝐱i𝜽+vi1vieiyi=ei1vi.\displaystyle=\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\boldsymbol{\theta}_{*}^{\backslash i}-y_{i}=\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\left(\boldsymbol{\theta}_{*}+\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}_{*}\mbox{$\mbox{$\mathbf{x}$}$}_{i}\frac{e_{i}}{1-v_{i}}\right)-y_{i}=\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\boldsymbol{\theta}_{*}+\frac{v_{i}}{1-v_{i}}e_{i}-y_{i}=\frac{e_{i}}{1-v_{i}}. (20)

Therefore, we get the following expressions for the deviation,

𝜽\i𝜽=𝚺𝐱iei\i,fi(𝜽\i)fi(𝜽)=viei\i.\boldsymbol{\theta}_{*}^{\backslash i}-\boldsymbol{\theta}_{*}=\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}_{*}\mbox{$\mbox{$\mathbf{x}$}$}_{i}e_{i}^{\backslash i},\qquad f_{i}(\boldsymbol{\theta}_{*}^{\backslash i})-f_{i}(\boldsymbol{\theta}_{*})=v_{i}e_{i}^{\backslash i}.

These expressions can be written in the form of Eq. 2 by left-multiplying with 𝚺1=𝐇\i+𝐱i𝐱i\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}_{*}^{-1}=\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{\backslash i}+\mbox{$\mbox{$\mathbf{x}$}$}_{i}\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top},

(𝐇\i+𝐱i𝐱i)(𝜽\i𝜽)=𝐱i(𝐱i𝜽\iyi)𝜽\i𝜽=(𝐇\i)1𝐱iei.(\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{\backslash i}+\mbox{$\mbox{$\mathbf{x}$}$}_{i}\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top})(\boldsymbol{\theta}_{*}^{\backslash i}-\boldsymbol{\theta}_{*})=\mbox{$\mbox{$\mathbf{x}$}$}_{i}(\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\boldsymbol{\theta}_{*}^{\backslash i}-y_{i})\,\Rightarrow\,\boldsymbol{\theta}_{*}^{\backslash i}-\boldsymbol{\theta}_{*}={(\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{\backslash i})}^{-1}\mbox{$\mbox{$\mathbf{x}$}$}_{i}e_{i}.

A.2 Derivation of the infinitesimal perturbation approach

We differentiate 𝜽ϵi\boldsymbol{\theta}_{*}^{\epsilon_{i}} in Eq. 19 to get

𝜽ϵiϵi=𝚺𝐱iei(1ϵivi)2,\frac{\partial\boldsymbol{\theta}_{*}^{\epsilon_{i}}}{\partial\epsilon_{i}}=\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}_{*}\mbox{$\mbox{$\mathbf{x}$}$}_{i}\frac{e_{i}}{(1-\epsilon_{i}v_{i})^{2}}, (21)

yielding the following expressions:

𝜽ϵiϵi|ϵi=0\displaystyle\left.\frac{\partial{\boldsymbol{\theta}_{*}^{\epsilon_{i}}}}{\partial{\epsilon_{i}}}\right|_{\epsilon_{i}=0} =𝚺𝐱iei,\displaystyle=\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}_{*}\mbox{$\mbox{$\mathbf{x}$}$}_{i}e_{i}, fi(𝜽ϵi)ϵi|ϵi=0=𝐱i𝜽ϵiϵi|ϵi=0=𝐱i𝚺𝐱iei=viei.\displaystyle\left.\frac{\partial{f_{i}(\boldsymbol{\theta}_{*}^{\epsilon_{i}})}}{\partial{\epsilon_{i}}}\right|_{\epsilon_{i}=0}=\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\left.\frac{\partial{\boldsymbol{\theta}_{*}^{\epsilon_{i}}}}{\partial{\epsilon_{i}}}\right|_{\epsilon_{i}=0}=\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}_{*}\mbox{$\mbox{$\mathbf{x}$}$}_{i}e_{i}=v_{i}e_{i}. (22)

The second equation in Eq. 22 follows from the chain rule. We get a bi-linear relationship of the influence measure with respect to viv_{i} and prediction error eie_{i}. It is also possible to evaluate Eq. 21 at ϵi=1\epsilon_{i}=1 representing an infinitesimal perturbation about the LOO estimate, 𝜽ϵi/ϵi|ϵi=1=𝚺\i𝐱iei\i\smash{\left.\partial\boldsymbol{\theta}_{*}^{\epsilon_{i}}/\partial\epsilon_{i}\right|_{\epsilon_{i}=1}=\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}_{*}^{\backslash i}\mbox{$\mbox{$\mathbf{x}$}$}_{i}e_{i}^{\backslash i}} From this and Eq. 22, we can interpret Eq. 2 as the average derivative over the interval ϵi[0,1]\epsilon_{i}\in[0,1] [9] or the derivative evaluated at some 0<ϵi<10<\epsilon_{i}<1 (via an application of the mean value theorem) [41].

Appendix B Conjugate Exponential-Family Models

Exponential-family distributions take the following form:

q=h(𝜽)exp[𝝀,𝐓(𝜽)A(𝝀)].q=h(\boldsymbol{\theta})\exp\left[\langle\boldsymbol{\lambda},\mbox{$\mbox{$\mathbf{T}$}$}(\boldsymbol{\theta})\rangle-A(\boldsymbol{\lambda})\right].

where 𝝀Ω\boldsymbol{\lambda}\in\Omega are the natural (or canonical) parameter for which the cumulant (or log partition) function A(𝝀)A(\boldsymbol{\lambda}) is finite, strictly convex and differentiable over Ω\Omega. The quantity 𝐓(𝜽)\mbox{$\mbox{$\mathbf{T}$}$}(\boldsymbol{\theta}) is the sufficient statistics, ,\langle\cdot,\cdot\rangle is an inner product and h(𝜽)h(\boldsymbol{\theta}) is some function. A popular example is the Gaussian distribution, which can be rearranged to take an exponential-family form written in terms of the precision matrix 𝐒=𝚺1\smash{\mbox{$\mbox{$\mathbf{S}$}$}=\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}^{-1}},

𝒩(𝜽|𝐦,𝚺)\displaystyle\mbox{${\cal N}$}(\boldsymbol{\theta}|\mbox{$\mbox{$\mathbf{m}$}$},\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}) =|2π𝚺|12exp[12(𝜽𝐦)𝚺1(𝜽𝐦)]\displaystyle=|2\pi\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}|^{-\text{\mbox{$\frac{1}{2}$}}}\exp\left[-\mbox{$\frac{1}{2}$}(\boldsymbol{\theta}-\text{\mbox{$\mbox{$\mathbf{m}$}$}})^{\top}\text{\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}}^{-1}(\boldsymbol{\theta}-\text{\mbox{$\mbox{$\mathbf{m}$}$}})\right]
=exp[𝜽𝐒𝐦12𝜽𝐒𝜽12(𝐦𝐒𝐦+log|2π𝐒1|)].\displaystyle=\exp\left[\boldsymbol{\theta}^{\top}\text{\mbox{$\mbox{$\mathbf{S}$}$}}\text{\mbox{$\mbox{$\mathbf{m}$}$}}-\mbox{$\frac{1}{2}$}\boldsymbol{\theta}^{\top}\text{\mbox{$\mbox{$\mathbf{S}$}$}}\boldsymbol{\theta}-\mbox{$\frac{1}{2}$}\left(\mbox{$\mbox{$\mathbf{m}$}$}^{\top}\mbox{$\mbox{$\mathbf{S}$}$}\mbox{$\mbox{$\mathbf{m}$}$}+\log|2\pi\mbox{$\mbox{$\mathbf{S}$}$}^{-1}|\right)\right].

From this, we can read-off the quantities needed to define an exponential-form,

𝝀=(𝐒𝐦,12𝐒),𝐓(𝜽)=(𝜽,𝜽𝜽),A(𝝀)=12(𝐦𝐒𝐦+log|2π𝐒1|),h(𝜽)=1.\boldsymbol{\lambda}=(\mbox{$\mbox{$\mathbf{S}$}$}\mbox{$\mbox{$\mathbf{m}$}$},\,-\mbox{$\frac{1}{2}$}\mbox{$\mbox{$\mathbf{S}$}$}),\,\,\,\,\,\mbox{$\mbox{$\mathbf{T}$}$}(\boldsymbol{\theta})=(\boldsymbol{\theta},\,\boldsymbol{\theta}\boldsymbol{\theta}^{\top}),\,\,\,\,\,A(\boldsymbol{\lambda})=\mbox{$\frac{1}{2}$}\left(\mbox{$\mbox{$\mathbf{m}$}$}^{\top}\mbox{$\mbox{$\mathbf{S}$}$}\mbox{$\mbox{$\mathbf{m}$}$}+\log|2\pi\mbox{$\mbox{$\mathbf{S}$}$}^{-1}|\right),\,\,\,\,\,h(\boldsymbol{\theta})=1. (23)

Both the natural parameter and sufficient statistics consist of two elements. The inner-product for the first elements is simply a transpose to get the 𝜽𝐒𝐦\smash{\boldsymbol{\theta}^{\top}\mbox{$\mbox{$\mathbf{S}$}$}\mbox{$\mbox{$\mathbf{m}$}$}} term, while for the second element it is a trace which gives Tr(𝜽𝜽𝐒/2)=12𝜽𝐒𝜽\smash{-\mbox{Tr}(\boldsymbol{\theta}\boldsymbol{\theta}^{\top}\mbox{$\mbox{$\mathbf{S}$}$}/2)=-\mbox{$\frac{1}{2}$}\boldsymbol{\theta}^{\top}\mbox{$\mbox{$\mathbf{S}$}$}\boldsymbol{\theta}}.

Conjugate Exponential-Family Models are those where both the likelihoods and prior can be expressed in terms of the same form of exponential-family distribution with respect to 𝜽\boldsymbol{\theta}. For instance, in linear regression, both the likelihood and prior take a Gaussian form with respect to 𝜽\boldsymbol{\theta},

p~i=p(yi|𝐱i,𝜽)\displaystyle\tilde{p}_{i}=p(y_{i}|\mbox{$\mbox{$\mathbf{x}$}$}_{i},\boldsymbol{\theta}) =𝒩(yi|𝐱i𝜽,1)exp[𝜽𝐱iyi12𝜽𝐱i𝐱i𝜽]\displaystyle=\mbox{${\cal N}$}(y_{i}|\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\boldsymbol{\theta},1)\propto\exp\left[\boldsymbol{\theta}^{\top}\mbox{$\mbox{$\mathbf{x}$}$}_{i}y_{i}-\mbox{$\frac{1}{2}$}\boldsymbol{\theta}^{\top}\mbox{$\mbox{$\mathbf{x}$}$}_{i}\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\boldsymbol{\theta}\right]
p0=p(𝜽)\displaystyle p_{0}=p(\boldsymbol{\theta}) =𝒩(𝜽|0,𝐈/δ)exp[12𝜽(δ𝐈)𝜽].\displaystyle=\mbox{${\cal N}$}(\boldsymbol{\theta}|0,\mbox{$\mbox{$\mathbf{I}$}$}/\delta)\propto\exp\left[-\mbox{$\frac{1}{2}$}\boldsymbol{\theta}^{\top}\left(\delta\mbox{$\mbox{$\mathbf{I}$}$}\right)\boldsymbol{\theta}\right].

Note that p~i\tilde{p}_{i} is a distribution over yiy_{i} but it can also be expressed in an (unnormalized) Gaussian form with respect to 𝜽\boldsymbol{\theta}. The sufficient statistics of both p~i\tilde{p}_{i} and p0p_{0} correspond to those of a Gaussian distribution. Therefore, the posterior is also a Gaussian,

q=p(𝜽|𝒟)\displaystyle q_{*}=p(\boldsymbol{\theta}|\mbox{${\cal D}$}) p0p~1p~2p~N\displaystyle\propto p_{0}\tilde{p}_{1}\tilde{p}_{2}\ldots\tilde{p}_{N}
=exp[12𝜽(δ𝐈)𝜽]i=1Nexp[𝜽𝐱iyi12𝜽𝐱i𝐱i𝜽]\displaystyle=\exp\left[-\mbox{$\frac{1}{2}$}\boldsymbol{\theta}^{\top}\left(\delta\mbox{$\mbox{$\mathbf{I}$}$}\right)\boldsymbol{\theta}\right]\prod_{i=1}^{N}\exp\left[\boldsymbol{\theta}^{\top}\mbox{$\mbox{$\mathbf{x}$}$}_{i}y_{i}-\mbox{$\frac{1}{2}$}\boldsymbol{\theta}^{\top}\mbox{$\mbox{$\mathbf{x}$}$}_{i}\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\boldsymbol{\theta}\right]
=exp[𝜽i=1N𝐱iyi12𝜽(δ𝐈+i=1N𝐱i𝐱i)𝜽]\displaystyle=\exp\left[\boldsymbol{\theta}^{\top}\sum_{i=1}^{N}\mbox{$\mbox{$\mathbf{x}$}$}_{i}y_{i}-\mbox{$\frac{1}{2}$}\boldsymbol{\theta}^{\top}\left(\delta\mbox{$\mbox{$\mathbf{I}$}$}+\sum_{i=1}^{N}\mbox{$\mbox{$\mathbf{x}$}$}_{i}\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\right)\boldsymbol{\theta}\right]
=exp[𝜽𝐇𝜽12𝜽𝐇𝜽]\displaystyle=\exp\left[\boldsymbol{\theta}^{\top}\mbox{$\mbox{$\mathbf{H}$}$}_{*}\boldsymbol{\theta}_{*}-\mbox{$\frac{1}{2}$}\boldsymbol{\theta}^{\top}\mbox{$\mbox{$\mathbf{H}$}$}_{*}\boldsymbol{\theta}\right]
𝒩(𝜽|𝜽,𝐇1).\displaystyle\propto\mbox{${\cal N}$}(\boldsymbol{\theta}|\boldsymbol{\theta}_{*},\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{-1}).

The third line follows because 𝜽=𝐇1𝐗𝐲\smash{\boldsymbol{\theta}_{*}=\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{-1}\mbox{$\mbox{$\mathbf{X}$}$}^{\top}\mbox{$\mbox{$\mathbf{y}$}$}}, as shown in Eq. 17.

These computations can be written as conjugate-computations [26] where we simply add the natural parameters,

p~iexp[𝝀~i,𝐓(𝜽)], where 𝝀~i=(𝐱iyi,12𝐱i𝐱i)p0exp[𝝀0,𝐓(𝜽)], where 𝝀0=(0,12δ𝐈)qexp[𝝀,𝐓(𝜽)], where 𝝀=𝝀0+i=1N𝝀~i=(𝐇𝜽,12𝐇).\begin{split}\tilde{p}_{i}&\propto\exp\left[\langle\widetilde{\boldsymbol{\lambda}}_{i},\mbox{$\mbox{$\mathbf{T}$}$}(\boldsymbol{\theta})\rangle\right],\text{ where }\widetilde{\boldsymbol{\lambda}}_{i}=(\mbox{$\mbox{$\mathbf{x}$}$}_{i}y_{i},\,-\mbox{$\frac{1}{2}$}\mbox{$\mbox{$\mathbf{x}$}$}_{i}\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top})\\ p_{0}&\propto\exp\left[\langle\boldsymbol{\lambda}_{0},\mbox{$\mbox{$\mathbf{T}$}$}(\boldsymbol{\theta})\rangle\right],\text{ where }\boldsymbol{\lambda}_{0}=(0,\,-\mbox{$\frac{1}{2}$}\delta\mbox{$\mbox{$\mathbf{I}$}$})\\ \implies q_{*}&\propto\exp\left[\langle\boldsymbol{\lambda}_{*},\mbox{$\mbox{$\mathbf{T}$}$}(\boldsymbol{\theta})\rangle\right],\text{ where }\boldsymbol{\lambda}_{*}=\boldsymbol{\lambda}_{0}+\sum_{i=1}^{N}\widetilde{\boldsymbol{\lambda}}_{i}=\left(\mbox{$\mbox{$\mathbf{H}$}$}_{*}\boldsymbol{\theta}_{*},\,-\mbox{$\frac{1}{2}$}\mbox{$\mbox{$\mathbf{H}$}$}_{*}\right).\end{split}

In the same fashion, to remove the contributions of certain likelihoods, we can simply subtract their natural parameters from 𝝀\mbox{$\mbox{$\boldsymbol{\lambda}$}$}_{*}. These are the calculations which give rise to the following equation:

q\qjp~je𝐓(𝜽),𝝀\e𝐓(𝜽),𝝀je𝐓(𝜽),𝝀~j𝝀\=𝝀j𝝀~j.q_{*}^{\backslash\mathcal{M}}\propto\frac{q_{*}}{\prod_{j\in\mathcal{M}}\tilde{p}_{j}}\quad\implies e^{\langle\text{\mbox{$\mbox{$\mathbf{T}$}$}}(\boldsymbol{\theta}),\,\boldsymbol{\lambda}_{*}^{\backslash\mathcal{M}}\rangle}\propto\frac{e^{\langle\text{\mbox{$\mbox{$\mathbf{T}$}$}}(\boldsymbol{\theta}),\,\boldsymbol{\lambda}_{*}\rangle}}{\prod_{j\in\mathcal{M}}e^{\langle\text{\mbox{$\mbox{$\mathbf{T}$}$}}(\boldsymbol{\theta}),\,\widetilde{\boldsymbol{\lambda}}_{j}\rangle}}\quad\implies\boldsymbol{\lambda}_{*}^{\backslash\mathcal{M}}=\boldsymbol{\lambda}_{*}-\sum_{j\in\mathcal{M}}\widetilde{\boldsymbol{\lambda}}_{j}.

Appendix C The Bayesian Learning Rule

The Bayesian learning rule (BLR) aims to find a posterior approximation q(𝜽)p(𝜽|𝒟)e(𝜽)q(\boldsymbol{\theta})\approx p(\boldsymbol{\theta}|\mbox{${\cal D}$})\propto e^{-\mathcal{L}(\boldsymbol{\theta})}. Often, one considers a regular, minimal exponential-family q𝒬q\in\mathcal{Q}, for example, the class of Gaussian distributions. The approximation is found by optimizing a generalized Bayesian objective,

q=argminq𝒬𝔼q[(𝜽)](q).q_{*}=\operatorname*{arg\,min}_{q\in\mathcal{Q}}\,\,\mathbb{E}_{q}\left[\mathcal{L}(\boldsymbol{\theta})\right]-\mathcal{H}(q).

where (q)=𝔼q[logq(𝜽)]\mathcal{H}(q)=\mathbb{E}_{q}[-\log q(\boldsymbol{\theta})] is the entropy of qq and 𝒬\mathcal{Q} is the class of exponential family approximation. The objective is equivalent to the Evidence Lower Bound (ELBO) when (𝜽)\mathcal{L}(\boldsymbol{\theta}) corresponds to the negative log-joint probability of a Bayesian model; see [28, Sec 1.2].

The BLR uses natural-gradient descent to find qq_{*}, where each iteration tt takes the following form,

𝝀t𝝀t1ρ𝐅(𝝀t1)1𝝀[𝔼q[(𝜽)](q)]|𝝀=𝝀t1\boldsymbol{\lambda}_{t}\leftarrow\boldsymbol{\lambda}_{t-1}-\rho\mbox{$\mbox{$\mathbf{F}$}$}(\boldsymbol{\lambda}_{t-1})^{-1}\left.\frac{\partial{}}{\partial{\boldsymbol{\lambda}}}\left[\mathbb{E}_{q}\left[\mathcal{L}(\boldsymbol{\theta})\right]-\mathcal{H}(q)\right]\right|_{\boldsymbol{\lambda}=\boldsymbol{\lambda}_{t-1}} (24)

where ρ>0\rho>0 is the learning rate. The gradient is computed with respect to 𝝀\boldsymbol{\lambda} (through qq), and we scale the gradient by the Fisher Information Matrix (FIM) defined as follows,

𝐅(𝝀)=𝔼q[(𝝀logq)(𝝀logq)]=𝝀2A(𝝀).\mbox{$\mbox{$\mathbf{F}$}$}(\boldsymbol{\lambda})=\mathbb{E}_{q}\left[(\nabla_{\boldsymbol{\lambda}}\log q)(\nabla_{\boldsymbol{\lambda}}\log q)^{\top}\right]=\nabla_{\boldsymbol{\lambda}}^{2}A(\boldsymbol{\lambda}).

The second equality shows that, for exponential-family distribution, the above FIM is also the second derivative of the log-partition function A(𝝀)A(\boldsymbol{\lambda}).

C.1 The BLR of Eq. 5

The BLR in Eq. 5 is obtained by simplifying the natural-gradient using the following identity,

𝐅(𝝀)1𝝀𝔼q()=𝝁𝔼q()|𝝁=𝝀A(𝝀)\mbox{$\mbox{$\mathbf{F}$}$}(\boldsymbol{\lambda})^{-1}\nabla_{\boldsymbol{\lambda}}\mathbb{E}_{q}(\cdot)\,\,=\,\,\left.\nabla_{\text{\mbox{$\mbox{$\boldsymbol{\mu}$}$}}}\mathbb{E}_{q}(\cdot)\right|_{\text{\mbox{$\mbox{$\boldsymbol{\mu}$}$}}=\nabla_{\boldsymbol{\lambda}}A(\boldsymbol{\lambda})} (25)

where 𝝁\boldsymbol{\mu} is the expectation parameter. The identity works because of the minimality of the exponential-family which ensures that there is a one-to-one mapping between 𝝀\boldsymbol{\lambda} and 𝝁\boldsymbol{\mu}, and also that the FIM is invertible. Using this, we can show that the natural gradient of (q)\mathcal{H}(q) is simply equal to 𝝀-\boldsymbol{\lambda}; see [28, App. B]. Defining 0(𝜽)=(𝜽)\ell_{0}(\boldsymbol{\theta})=\mathcal{R}(\boldsymbol{\theta}), we get the version of the BLR shown in Eq. 5,

𝝀t(1ρ)𝝀t1ρj=0N𝐠~j(𝝀t1), where 𝐠~j(𝝀t1)=𝝁𝔼q[j(𝜽)]|𝝁=𝝀A(𝝀t1).\boldsymbol{\lambda}_{t}\leftarrow(1-\rho)\boldsymbol{\lambda}_{t-1}-\rho\sum_{j=0}^{N}\tilde{\mathbf{g}}_{j}(\boldsymbol{\lambda}_{t-1}),\text{ where }\tilde{\mathbf{g}}_{j}(\mbox{$\mbox{$\boldsymbol{\lambda}$}$}_{t-1})=\left.\nabla_{\text{\mbox{$\mbox{$\boldsymbol{\mu}$}$}}}\mathbb{E}_{q}[\ell_{j}(\boldsymbol{\theta})]\right|_{\text{\mbox{$\mbox{$\boldsymbol{\mu}$}$}}=\nabla_{\boldsymbol{\lambda}}A(\boldsymbol{\lambda}_{t-1})}.

C.2 The conjugate-model form of the BLR given in Eq. 5

To express the update in terms of the posterior of a conjugate model, we simply take the inner product with 𝐓(𝜽)\mbox{$\mbox{$\mathbf{T}$}$}(\boldsymbol{\theta}) and take the exponential to write the update as

e𝝀t,𝐓(𝜽)qt(e𝝀t1,𝐓(𝜽)qt1)1ρ(e𝐠~0(𝝀t1),𝐓(𝜽)p0)ρj=1Neρ𝐠~j(𝝀t1),𝐓(𝜽),\underbrace{e^{\langle\boldsymbol{\lambda}_{t},\text{\mbox{$\mbox{$\mathbf{T}$}$}}(\boldsymbol{\theta})\rangle}}_{\propto q_{t}}\leftarrow\Big{(}\underbrace{e^{\langle\boldsymbol{\lambda}_{t-1},\text{\mbox{$\mbox{$\mathbf{T}$}$}}(\boldsymbol{\theta})\rangle}}_{\propto q_{t-1}}\Big{)}^{1-\rho}\Big{(}\underbrace{e^{\langle-\tilde{\mathbf{g}}_{0}(\boldsymbol{\lambda}_{t-1}),\text{\mbox{$\mbox{$\mathbf{T}$}$}}(\boldsymbol{\theta})\rangle}}_{\propto p_{0}}\Big{)}^{\rho}\prod_{j=1}^{N}e^{\langle-\rho\tilde{\mathbf{g}}_{j}(\boldsymbol{\lambda}_{t-1}),\text{\mbox{$\mbox{$\mathbf{T}$}$}}(\boldsymbol{\theta})\rangle}, (26)

The simplification of the second term on the left to p0p_{0} happens when p0p_{0} is a conjugate prior, that is, p0exp(𝝀0,𝐓(𝜽))p_{0}\propto\exp(\langle\boldsymbol{\lambda}_{0},\mbox{$\mbox{$\mathbf{T}$}$}(\boldsymbol{\theta})\rangle) for some 𝝀0\boldsymbol{\lambda}_{0} (see an example in App. B where we show that L2L_{2} regularizer leads to such a choice). In such cases, we can simplify,

𝐠~0(𝝀),𝐓(𝜽)=𝝁𝔼q[logp0],𝐓(𝜽)=𝝁𝝀0,𝝁,𝐓(𝜽)=𝝀0,𝐓(𝜽)=logp0+const.\langle-\tilde{\mathbf{g}}_{0}(\boldsymbol{\lambda}),\mbox{$\mbox{$\mathbf{T}$}$}(\boldsymbol{\theta})\rangle=\langle\nabla_{\text{\mbox{$\mbox{$\boldsymbol{\mu}$}$}}}\mathbb{E}_{q}[\log p_{0}],\mbox{$\mbox{$\mathbf{T}$}$}(\boldsymbol{\theta})\rangle=\langle\nabla_{\text{\mbox{$\mbox{$\boldsymbol{\mu}$}$}}}\langle\boldsymbol{\lambda}_{0},\mbox{$\mbox{$\boldsymbol{\mu}$}$}\rangle,\mbox{$\mbox{$\mathbf{T}$}$}(\boldsymbol{\theta})\rangle=\langle\boldsymbol{\lambda}_{0},\mbox{$\mbox{$\mathbf{T}$}$}(\boldsymbol{\theta})\rangle=\log p_{0}+\text{const.}

Using this in Eq. 26, we recover the conjugate model given in Eq. 5.

C.3 BLR for a Gaussian qq and the Variational Online Newton (VON) algorithm

By choosing an appropriate form for qtq_{t} and making necessary approximations to 𝐠~j\smash{\tilde{\mathbf{g}}_{j}}, the BLR can recover many popular algorithms as special cases. We will now give a few examples for the case of a Gaussian qt=𝒩(𝜽|𝐦t,𝚺t)q_{t}=\mbox{${\cal N}$}(\boldsymbol{\theta}|\mbox{$\mbox{$\mathbf{m}$}$}_{t},\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}_{t}) which enables derivation of various first and second-order optimization algorithms, such as, Newton’s method, RMSprop, Adam, and SGD.

As shown in Eq. 23, for a Gausian 𝒩(𝜽|𝐦,𝚺)\mbox{${\cal N}$}(\boldsymbol{\theta}|\mbox{$\mbox{$\mathbf{m}$}$},\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}), the natural parameter and sufficient statistics are shown below, along with the expectation parameters 𝝁=𝔼q[𝐓(𝜽)]\mbox{$\mbox{$\boldsymbol{\mu}$}$}=\mathbb{E}_{q}[\mbox{$\mbox{$\mathbf{T}$}$}(\boldsymbol{\theta})].

𝝀=(𝐒𝐦,12𝐒),𝐓(𝜽)=(𝜽,𝜽𝜽),𝝁=(𝐦,𝐦𝐦+𝚺),\boldsymbol{\lambda}=(\mbox{$\mbox{$\mathbf{S}$}$}\mbox{$\mbox{$\mathbf{m}$}$},\,-\mbox{$\frac{1}{2}$}\mbox{$\mbox{$\mathbf{S}$}$}),\,\,\,\,\,\mbox{$\mbox{$\mathbf{T}$}$}(\boldsymbol{\theta})=(\boldsymbol{\theta},\,\boldsymbol{\theta}\boldsymbol{\theta}^{\top}),\,\,\,\,\,\mbox{$\mbox{$\boldsymbol{\mu}$}$}=(\mbox{$\mbox{$\mathbf{m}$}$},\,\mbox{$\mbox{$\mathbf{m}$}$}\mbox{$\mbox{$\mathbf{m}$}$}^{\top}+\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}),\,\,\,\,\,

Using these, we can write the natural gradients as gradients with respect to 𝝁\boldsymbol{\mu}, , and then using chain-rule to express them as gradients with respect to 𝐦\mathbf{m} and 𝚺\boldsymbol{\Sigma},

𝐠~j(𝝀)=𝝁𝔼q[j(𝜽)]=(𝐦𝔼q[j(𝜽)]𝐦𝐦+𝚺𝔼q[j(𝜽)])=(𝐠^j𝐇^j𝐦12𝐇^j,),\begin{split}\tilde{\mathbf{g}}_{j}(\boldsymbol{\lambda})&=\nabla_{\text{\mbox{$\mbox{$\boldsymbol{\mu}$}$}}}\mathbb{E}_{q}[\ell_{j}(\boldsymbol{\theta})]=\left(\begin{array}[]{c}\nabla_{\text{\mbox{$\mbox{$\mathbf{m}$}$}}}\mathbb{E}_{q}[\ell_{j}(\boldsymbol{\theta})]\\ \nabla_{\text{\mbox{$\mbox{$\mathbf{m}$}$}}\text{\mbox{$\mbox{$\mathbf{m}$}$}}^{\top}+\text{\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}}}\mathbb{E}_{q}[\ell_{j}(\boldsymbol{\theta})]\end{array}\right)=\left(\begin{array}[]{c}\hat{\mbox{$\mbox{$\mathbf{g}$}$}}_{j}-\hat{\mbox{$\mbox{$\mathbf{H}$}$}}_{j}\mbox{$\mbox{$\mathbf{m}$}$}\\ \mbox{$\frac{1}{2}$}\hat{\mbox{$\mbox{$\mathbf{H}$}$}}_{j},\end{array}\right),\end{split} (27)

where in the last equation we define two quantities written in terms of j(𝜽)\nabla\ell_{j}(\boldsymbol{\theta}) and 2j(𝜽)\nabla^{2}\ell_{j}(\boldsymbol{\theta}) by using Price’s and Bonnet’s theorem [44],

𝐠^j=𝐦𝔼q[j(𝜽)]=𝔼q[j(𝜽)],𝐇^j=2𝚺𝔼q[j(𝜽)]=𝔼q[2j(𝜽)].\hat{\mbox{$\mbox{$\mathbf{g}$}$}}_{j}=\nabla_{\text{\mbox{$\mbox{$\mathbf{m}$}$}}}\mathbb{E}_{q}[\ell_{j}(\boldsymbol{\theta})]=\mathbb{E}_{q}[\nabla\ell_{j}(\boldsymbol{\theta})],\qquad\hat{\mbox{$\mbox{$\mathbf{H}$}$}}_{j}=2\nabla_{\text{\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}}}\mathbb{E}_{q}[\ell_{j}(\boldsymbol{\theta})]=\mathbb{E}_{q}[\nabla^{2}\ell_{j}(\boldsymbol{\theta})]. (28)

Plugging these into the BLR update gives us the following update,

𝐒t𝐦t\displaystyle\mbox{$\mbox{$\mathbf{S}$}$}_{t}\mbox{$\mbox{$\mathbf{m}$}$}_{t} (1ρ)𝐒t1𝐦t1+ρj=0N(𝐇^j,t1𝐦t1𝐠^j,t1),𝐒t(1ρ)𝐒t1+ρj=0N𝐇^j,t1\displaystyle\leftarrow(1-\rho)\mbox{$\mbox{$\mathbf{S}$}$}_{t-1}\mbox{$\mbox{$\mathbf{m}$}$}_{t-1}+\rho\sum_{j=0}^{N}\left(\hat{\mbox{$\mbox{$\mathbf{H}$}$}}_{j,t-1}\mbox{$\mbox{$\mathbf{m}$}$}_{t-1}-\hat{\mbox{$\mbox{$\mathbf{g}$}$}}_{j,t-1}\right),\quad\mbox{$\mbox{$\mathbf{S}$}$}_{t}\leftarrow(1-\rho)\mbox{$\mbox{$\mathbf{S}$}$}_{t-1}+\rho\sum_{j=0}^{N}\hat{\mbox{$\mbox{$\mathbf{H}$}$}}_{j,t-1}

where 𝐠^j,t1\hat{\mbox{$\mbox{$\mathbf{g}$}$}}_{j,t-1} and 𝐇^j,t1\hat{\mbox{$\mbox{$\mathbf{H}$}$}}_{j,t-1} are quantities similar to before but now evaluated at the qt1q_{t-1}. The conjugate model can be written as follows,

qte𝜽𝐒t𝐦t12𝜽𝐒t𝜽(qt1)1ρ(p0)ρj=1Ne𝜽𝐢^j,t112𝜽𝐈^j,t1𝜽q_{t}\propto e^{\boldsymbol{\theta}^{\top}\text{\mbox{$\mbox{$\mathbf{S}$}$}}_{t}\text{\mbox{$\mbox{$\mathbf{m}$}$}}_{t}-\mbox{$\frac{1}{2}$}\boldsymbol{\theta}^{\top}\text{\mbox{$\mbox{$\mathbf{S}$}$}}_{t}\boldsymbol{\theta}}\,\,\propto(q_{t-1})^{1-\rho}(p_{0})^{\rho}\prod_{j=1}^{N}e^{\boldsymbol{\theta}^{\top}\hat{\text{\mbox{$\mbox{$\mathbf{i}$}$}}}_{j,t-1}-\mbox{$\frac{1}{2}$}\boldsymbol{\theta}^{\top}\hat{\text{\mbox{$\mbox{$\mathbf{I}$}$}}}_{j,t-1}\boldsymbol{\theta}}

The prior above is Gaussian and defined using qt1q_{t-1} and p0p_{0}. The model uses likelihoods that are Gaussian distribution with information vector 𝐢^j,t1=ρ(𝐇^j,t1𝐦t1𝐠^j,t1)\smash{\hat{\mbox{$\mbox{$\mathbf{i}$}$}}_{j,t-1}=\rho(\hat{\text{\mbox{$\mbox{$\mathbf{H}$}$}}}_{j,t-1}\text{\mbox{$\mbox{$\mathbf{m}$}$}}_{t-1}-\hat{\mbox{$\mbox{$\mathbf{g}$}$}}_{j,t-1})} and information matrix 𝐈^j,t1=ρ𝐇^j,t1\smash{\hat{\mbox{$\mbox{$\mathbf{I}$}$}}_{j,t-1}=\rho\hat{\text{\mbox{$\mbox{$\mathbf{H}$}$}}}_{j,t-1}}. The likelihood is allowed to be an improper distribution, meaning that its integral is not one. This is not a problem as long as 𝐒t\mbox{$\mbox{$\mathbf{S}$}$}_{t} remains positive definite. A valid 𝐒t\mbox{$\mbox{$\mathbf{S}$}$}_{t} can be ensured by either using a Generalized Gauss-Newton approximation to the Hessian [27] or by using the improved BLR of [35]. The former strategy is used in [25] to express BLR iterations as linear models and Gaussian processes. Ultimately, we want to ensure that perturbation in the approximate likelihoods in qtq_{t} yields a valid posterior and, as long as this is the case, the conjugate model can be used safely. For instance, in Thm. 4, this issue poses no problem at all.

The BLR update can be rearranged and written in a Newton-like form show below,

VON: 𝐦t𝐦t1ρ𝐒t1𝔼qt1[(𝜽)],𝐒t(1ρ)𝐒t1+ρ𝔼qt1[2(𝜽)].\text{VON: \quad}\mbox{$\mbox{$\mathbf{m}$}$}_{t}\leftarrow\mbox{$\mbox{$\mathbf{m}$}$}_{t-1}-\rho\mbox{$\mbox{$\mathbf{S}$}$}_{t}^{-1}\,\mathbb{E}_{q_{t-1}}\left[\nabla\mathcal{L}(\boldsymbol{\theta})\right],\qquad\mbox{$\mbox{$\mathbf{S}$}$}_{t}\leftarrow(1-\rho)\mbox{$\mbox{$\mathbf{S}$}$}_{t-1}+\rho\,\mathbb{E}_{q_{t-1}}\left[\nabla^{2}\mathcal{L}(\boldsymbol{\theta})\right]. (29)

This is called the Variational Online Newton (VON) algorithm. A full derivation is in [27] with details on many of its variants in [28]. The simplest variant is the Online Newton (ON) algorithm, where we use the delta method,

𝔼qt[(𝜽)](𝐦t),𝔼qt[2(𝜽)]2(𝐦t).\mathbb{E}_{q_{t}}\left[\nabla\mathcal{L}(\boldsymbol{\theta})\right]\approx\nabla\mathcal{L}(\mbox{$\mbox{$\mathbf{m}$}$}_{t}),\qquad\mathbb{E}_{q_{t}}\left[\nabla^{2}\mathcal{L}(\boldsymbol{\theta})\right]\approx\nabla^{2}\mathcal{L}(\mbox{$\mbox{$\mathbf{m}$}$}_{t}). (30)

Then denoting 𝐦t=𝜽t\mbox{$\mbox{$\mathbf{m}$}$}_{t}=\boldsymbol{\theta}_{t}, we get the following ON update,

ON: 𝜽t𝜽t1ρ𝐒t1(𝜽t1),𝐒t(1ρ)𝐒t1+ρ2(𝜽t1).\text{ON: \quad}\boldsymbol{\theta}_{t}\leftarrow\boldsymbol{\theta}_{t-1}-\rho\mbox{$\mbox{$\mathbf{S}$}$}_{t}^{-1}\,\nabla\mathcal{L}(\boldsymbol{\theta}_{t-1}),\qquad\mbox{$\mbox{$\mathbf{S}$}$}_{t}\leftarrow(1-\rho)\mbox{$\mbox{$\mathbf{S}$}$}_{t-1}+\rho\,\nabla^{2}\mathcal{L}(\boldsymbol{\theta}_{t-1}). (31)

To reduce the cost, we can use a diagonal approximation 𝐒t=diag(𝐬t)\mbox{$\mbox{$\mathbf{S}$}$}_{t}=\mbox{$\mbox{diag}$}(\mbox{$\mbox{$\mathbf{s}$}$}_{t}) where 𝐬t\mbox{$\mbox{$\mathbf{s}$}$}_{t} is a scale vector. Additionally, we can use minibatching to estimate the gradient and hessian (denoted by ^\smash{\hat{\nabla}} and ^2\smash{\hat{\nabla}^{2}}),

ON (diagonal+minibatch): 𝜽t𝜽t1ρ𝐬t1^(𝜽t1),\displaystyle\boldsymbol{\theta}_{t}\leftarrow\boldsymbol{\theta}_{t-1}-\rho\mbox{$\mbox{$\mathbf{s}$}$}_{t}^{-1}\cdot\hat{\nabla}\mathcal{L}(\boldsymbol{\theta}_{t-1}),\quad (32)
𝐬t(1ρ)𝐬t1+ρdiag(^2(𝜽t1)),\displaystyle\mbox{$\mbox{$\mathbf{s}$}$}_{t}\leftarrow(1-\rho)\mbox{$\mbox{$\mathbf{s}$}$}_{t-1}+\rho\,\mbox{$\mbox{diag}$}(\hat{\nabla}^{2}\mathcal{L}(\boldsymbol{\theta}_{t-1})),

where \cdot indicates element-wise product two vectors and diag()\mbox{$\mbox{diag}$}(\cdot) extracts the diagonal of a matrix.

Several optimization algorithms can be obtained as special cases from the above variants. For example, to get Newton’s method, we set ρ=1\rho=1 in ON to get

𝜽t𝜽t1[2(𝜽t1)]1(𝜽t1).\boldsymbol{\theta}_{t}\leftarrow\boldsymbol{\theta}_{t-1}-[\nabla^{2}\mathcal{L}(\boldsymbol{\theta}_{t-1})]^{-1}\,\nabla\mathcal{L}(\boldsymbol{\theta}_{t-1}). (33)

RMSprop and Adam can be derived in a similar fashion [28].

In our experiments, we use the improved BLR or iBLR optimizer [35]. We use it to implement an improved version of VON [27, Eqs. 7–8] which ensures that the covariance is always positive-definite, even when the Hessian estimates are not. We use diagonal approximation 𝐒t=diag(𝝈2)1\mbox{$\mbox{$\mathbf{S}$}$}_{t}=\text{diag}(\mbox{$\mbox{$\boldsymbol{\sigma}$}$}^{2})^{-1}, momentum and minibatching as proposed in [27, 35]. For learning rate αt>0\alpha_{t}>0, momentum β1,β2[0,1)\beta_{1},\beta_{2}\in[0,1) the iterations are written as follows:

iBLR: 𝐠tβ1𝐠t1+(1β1)^𝐠t1,𝐡tβ2𝐡t1+(1β2)^𝐡t1+12(1β2)2(𝐡t1^𝐡t1)2/(𝐡t1+δ),𝐦t𝐦t1αt(𝐠t+δ𝐦t1)/(𝐡t+δ),𝝈t21/(N(𝐡t+δ)).\begin{split}\text{iBLR: }\quad\mbox{$\mbox{$\mathbf{g}$}$}_{t}&\leftarrow\beta_{1}\mbox{$\mbox{$\mathbf{g}$}$}_{t-1}+(1-\beta_{1})\widehat{}\mbox{$\mbox{$\mathbf{g}$}$}_{t-1},\\ \mbox{$\mbox{$\mathbf{h}$}$}_{t}&\leftarrow\beta_{2}\mbox{$\mbox{$\mathbf{h}$}$}_{t-1}+(1-\beta_{2})\widehat{}\mbox{$\mbox{$\mathbf{h}$}$}_{t-1}+\mbox{$\frac{1}{2}$}(1-\beta_{2})^{2}(\mbox{$\mbox{$\mathbf{h}$}$}_{t-1}-\widehat{}\mbox{$\mbox{$\mathbf{h}$}$}_{t-1})^{2}/(\mbox{$\mbox{$\mathbf{h}$}$}_{t-1}+\delta),\\ \mbox{$\mbox{$\mathbf{m}$}$}_{t}&\leftarrow\mbox{$\mbox{$\mathbf{m}$}$}_{t-1}-\alpha_{t}(\mbox{$\mbox{$\mathbf{g}$}$}_{t}+\delta\mbox{$\mbox{$\mathbf{m}$}$}_{t-1})/(\mbox{$\mbox{$\mathbf{h}$}$}_{t}+\delta),\\ \mbox{$\mbox{$\boldsymbol{\sigma}$}$}_{t}^{2}&\leftarrow 1/(N(\mbox{$\mbox{$\mathbf{h}$}$}_{t}+\delta)).\end{split} (34)

Here, δ>0\delta>0 is the L2L_{2}-regularization parameter and ^𝐠t1=1|B|iB𝔼qt1(𝜽)[i(𝜽)]\smash{\widehat{}\mbox{$\mbox{$\mathbf{g}$}$}_{t-1}=\frac{1}{|B|}\sum_{i\in B}\mathbb{E}_{q_{t-1}(\boldsymbol{\theta})}[\nabla\ell_{i}(\boldsymbol{\theta})]}, ^𝐡t1=1|B|iB𝔼qt1(𝜽)[i(𝜽)(𝜽𝐦t1)/𝝈t12]\smash{\widehat{}\mbox{$\mbox{$\mathbf{h}$}$}_{t-1}=\frac{1}{|B|}\sum_{i\in B}\mathbb{E}_{q_{t-1}(\boldsymbol{\theta})}[\nabla\ell_{i}(\boldsymbol{\theta})(\boldsymbol{\theta}-\mbox{$\mbox{$\mathbf{m}$}$}_{t-1})/\mbox{$\mbox{$\boldsymbol{\sigma}$}$}_{t-1}^{2}]} denote Monte-Carlo approximations of the expected stochastic gradient and diagonal Hessian under qt1(𝜽)=𝒩(𝜽|𝐦t1,diag(𝝈t12))q_{t-1}(\boldsymbol{\theta})=\mathcal{N}(\boldsymbol{\theta}\,|\,\mbox{$\mbox{$\mathbf{m}$}$}_{t-1},\text{diag}(\mbox{$\mbox{$\boldsymbol{\sigma}$}$}_{t-1}^{2})) and minibatch BB. As suggested in [27, 35], we used the reparametrization trick to estimate the diagonal Hessian via gradients only. In practice, we approximate the expectations using a single random sample. We expect multiple samples to further improve the results.

Appendix D Proof of Thm. 2 and the Beta-Bernoulli Model

From Eq. 27, it directly follows that

𝐠~j(𝝀)=𝝁𝔼q[logp~j]=𝝁𝝀~j,𝔼q[𝐓(𝜽)]=𝝁𝝀~j,𝝁=𝝀~j.\tilde{\mbox{$\mbox{$\mathbf{g}$}$}}_{j}(\boldsymbol{\lambda})=\nabla_{\text{\mbox{$\mbox{$\boldsymbol{\mu}$}$}}}\mathbb{E}_{q}[-\log\tilde{p}_{j}]=-\nabla_{\text{\mbox{$\mbox{$\boldsymbol{\mu}$}$}}}\langle\widetilde{\boldsymbol{\lambda}}_{j},\mathbb{E}_{q}[\mbox{$\mbox{$\mathbf{T}$}$}(\boldsymbol{\theta})]\rangle=-\nabla_{\text{\mbox{$\mbox{$\boldsymbol{\mu}$}$}}}\langle\widetilde{\boldsymbol{\lambda}}_{j},\mbox{$\mbox{$\boldsymbol{\mu}$}$}\rangle=-\widetilde{\boldsymbol{\lambda}}_{j}.

Using this in Eq. 6, we get the deviation given in Eq. 4.

We will now show an example on Beta-Bernoulli model, which is a conjugate model. We assume the model to be p(𝒟,θ)p(θ)ip(yi|θ)p(\mbox{${\cal D}$},\theta)\propto p(\theta)\prod_{i}p(y_{i}|\theta) where the prior is p(θ)=Beta(θ|α0,β0)p(\theta)=\mbox{Beta}(\theta|\alpha_{0},\beta_{0}) and likelihoods are p(yi|θ)=Ber(yi|θ)p(y_{i}|\theta)=\mbox{Ber}(y_{i}|\theta) with 𝒟i=yi\mbox{${\cal D}$}_{i}=y_{i}. This is a conjugate model and the posterior is Beta distribution, that is, it takes the same form as the prior. An expression is given below,

q=Beta(θ|α,β), where α=α0+j=1Nyj,β=β0j=1Nyj+N.q_{*}=\mbox{Beta}(\theta|\alpha_{*},\beta_{*}),\text{ where }\alpha_{*}=\alpha_{0}+\sum_{j=1}^{N}y_{j},\qquad\beta_{*}=\beta_{0}-\sum_{j=1}^{N}y_{j}+N.

The posterior for the perturbed dataset 𝒟\i\mbox{${\cal D}$}^{\backslash i} is also available in closed-form:

q\i=Beta(θ|α\i,β\i), where α\i=α0+j=1,jiNyj,β\i=β0j=1,jiNyj+N1.q_{*}^{\backslash i}=\mbox{Beta}(\theta|\alpha_{*}^{\backslash i},\beta_{*}^{\backslash i}),\text{ where }\alpha_{*}^{\backslash i}=\alpha_{0}+\sum_{\begin{subarray}{c}j=1,\\ j\neq i\end{subarray}}^{N}y_{j},\qquad{\beta_{*}^{\backslash i}=\beta_{0}-\sum_{\begin{subarray}{c}j=1,\\ j\neq i\end{subarray}}^{N}y_{j}+N-1}.

Therefore the deviations in the posterior parameters can be simply obtained as follows:

α\iα=yi,β\iβ=yi1\alpha_{*}^{\backslash i}-\alpha_{*}=-y_{i},\qquad\beta_{*}^{\backslash i}-\beta_{*}=y_{i}-1 (35)

This result can also be straightforwardly obtained using the MPE. For the Beta distribution q𝝀(θ)=Beta(θ|α,β)q_{\boldsymbol{\lambda}}(\theta)=\mbox{Beta}(\theta|\alpha,\beta), we have 𝝀=(α1,β1)\boldsymbol{\lambda}=(\alpha-1,\beta-1), therefore 𝝀\i𝝀=(α\iα,β\iβ)\smash{\boldsymbol{\lambda}_{*}^{\backslash i}-\boldsymbol{\lambda}_{*}=(\alpha_{*}^{\backslash i}-\alpha_{*},\,\,\,\beta_{*}^{\backslash i}-\beta_{*})}. For Beta distribution, 𝐓(θ)=(logθ,log(1θ))\mbox{$\mbox{$\mathbf{T}$}$}(\theta)=(\log\theta,\log(1-\theta)) and writing the likelihood in an exponential form, we get

p(yi|θ)=Ber(yi|θ)θyi(1θ)1yieyilogθ+(1yi)log(1θ),p(y_{i}|\theta)=\text{Ber}(y_{i}|\theta)\propto\theta^{y_{i}}(1-\theta)^{1-y_{i}}\propto e^{y_{i}\log\theta+(1-y_{i})\log(1-\theta)},

therefore 𝝀~i=(yi,yi1)\widetilde{\boldsymbol{\lambda}}_{i}=(y_{i},y_{i}-1). Setting 𝝀\i𝝀=𝝀~i\smash{\boldsymbol{\lambda}_{*}^{\backslash i}-\boldsymbol{\lambda}_{*}=-\widetilde{\boldsymbol{\lambda}}_{i}}, we recover the result given in Eq. 35.

Appendix E Proof of Thm. 3

For linear regression, we have

i(𝜽)=𝐱i(𝐱i𝜽yi),2i(𝜽)=𝐱i𝐱i.\nabla\ell_{i}(\boldsymbol{\theta})=\mbox{$\mbox{$\mathbf{x}$}$}_{i}(\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\boldsymbol{\theta}-y_{i}),\qquad\nabla^{2}\ell_{i}(\boldsymbol{\theta})=\mbox{$\mbox{$\mathbf{x}$}$}_{i}\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}.

Using these in Eq. 8, we get,

𝐠~i(𝝀)=𝔼q[𝐱i(𝐱i𝜽yi)𝐱i𝐱i𝜽,12𝐱i𝐱i]=(𝐱iyi,12𝐱i𝐱i),\tilde{\mathbf{g}}_{i}(\boldsymbol{\lambda}_{*})=\mathbb{E}_{q}\left[\mbox{$\mbox{$\mathbf{x}$}$}_{i}(\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\boldsymbol{\theta}-y_{i})-\mbox{$\mbox{$\mathbf{x}$}$}_{i}\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\boldsymbol{\theta}_{*},\,\,\,\mbox{$\frac{1}{2}$}\mbox{$\mbox{$\mathbf{x}$}$}_{i}\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\right]=\left(-\mbox{$\mbox{$\mathbf{x}$}$}_{i}y_{i},\,\,\mbox{$\frac{1}{2}$}\mbox{$\mbox{$\mathbf{x}$}$}_{i}\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\right),

The natural parameter is 𝝀=(𝐇𝜽,12𝐇)\boldsymbol{\lambda}_{*}=(\mbox{$\mbox{$\mathbf{H}$}$}_{*}\boldsymbol{\theta}_{*},-\mbox{$\frac{1}{2}$}\mbox{$\mbox{$\mathbf{H}$}$}_{*}). In a similar way, we can define q\i\smash{q_{*}^{\backslash i}} and its natural parameter. Using these, we can write Eq. 6 as

𝐇\i𝜽\i𝐇𝜽=𝐱iyi,12𝐇\i+12𝐇=12𝐱i𝐱i.\displaystyle\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{\backslash i}\boldsymbol{\theta}_{*}^{\backslash i}-\mbox{$\mbox{$\mathbf{H}$}$}_{*}\boldsymbol{\theta}_{*}=-\mbox{$\mbox{$\mathbf{x}$}$}_{i}y_{i},\qquad\qquad-\mbox{$\frac{1}{2}$}\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{\backslash i}+\mbox{$\frac{1}{2}$}\mbox{$\mbox{$\mathbf{H}$}$}_{*}=\mbox{$\frac{1}{2}$}\mbox{$\mbox{$\mathbf{x}$}$}_{i}\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}.

Substituting the second equation into the first one, we get the first equation below,

𝐇\i𝜽\i(𝐇\i+𝐱i𝐱i)𝜽=𝐱iyi𝜽\i𝜽=(𝐇\i)1𝐱i(𝐱i𝜽yi)=(𝐇\i)1𝐱iei.\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{\backslash i}\boldsymbol{\theta}_{*}^{\backslash i}-(\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{\backslash i}+\mbox{$\mbox{$\mathbf{x}$}$}_{i}\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top})\boldsymbol{\theta}_{*}=-\mbox{$\mbox{$\mathbf{x}$}$}_{i}y_{i}\quad\implies\quad\boldsymbol{\theta}_{*}^{\backslash i}-\boldsymbol{\theta}_{*}=(\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{\backslash i})^{-1}\mbox{$\mbox{$\mathbf{x}$}$}_{i}(\mbox{$\mbox{$\mathbf{x}$}$}_{i}^{\top}\boldsymbol{\theta}_{*}-y_{i})=(\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{\backslash i})^{-1}\mbox{$\mbox{$\mathbf{x}$}$}_{i}e_{i}.

The last equality is exactly Eq. 2. Since linear regression is a conjugate model, an alternate derivation would be to directly use the parameterization 𝝀~j\smash{\widetilde{\boldsymbol{\lambda}}}_{j} of p~i\tilde{p}_{i} (derived in App. B) and plug it in Thm. 2.

Appendix F Proof of Thm. 4

For simplicity, we denote

𝝀^ϵi=0=𝝀^ϵiϵi|ϵi=0,\partial\hat{\boldsymbol{\lambda}}_{*}^{\epsilon_{i}=0}=\left.\frac{\partial{\hat{\boldsymbol{\lambda}}_{*}^{\epsilon_{i}}}}{\partial{\epsilon_{i}}}\right|_{\epsilon_{i}=0},

with 𝝀^ϵi\hat{\boldsymbol{\lambda}}_{*}^{\epsilon_{i}} as defined in Eq. 7 in the main text. For Gaussian distributions, the natural parameter comes in a pair 𝝀^ϵi=(𝐇ϵi𝜽ϵi,12𝐇ϵi)\smash{\hat{\mbox{$\mbox{$\boldsymbol{\lambda}$}$}}_{*}^{\epsilon_{i}}=(\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{\epsilon_{i}}\boldsymbol{\theta}_{*}^{\epsilon_{i}},\,\,\,-\mbox{$\frac{1}{2}$}\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{\epsilon_{i}})}. Its derivative with respect to ϵi\epsilon_{i} at ϵi=0\epsilon_{i}=0 can be written as the following by using the chain rule:

𝝀^ϵi=0=(𝐇𝜽ϵi=0+𝐇ϵi=0𝜽,12𝐇ϵi=0).\partial\hat{\boldsymbol{\lambda}}_{*}^{\epsilon_{i}=0}=\left(\mbox{$\mbox{$\mathbf{H}$}$}_{*}\partial{\boldsymbol{\theta}_{*}^{\epsilon_{i}=0}}+\partial{\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{\epsilon_{i}=0}}\boldsymbol{\theta}_{*},\,\,\,{-\mbox{$\frac{1}{2}$}\partial{\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{\epsilon_{i}=0}}}\right).

Here, we use the fact that, as ϵi0\epsilon_{i}\to 0, we have (𝜽ϵi,𝐇ϵi)(𝜽,𝐇)(\boldsymbol{\theta}_{*}^{\epsilon_{i}},\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{\epsilon_{i}})\to(\boldsymbol{\theta}_{*},\mbox{$\mbox{$\mathbf{H}$}$}_{*}) and also assumed that the limit of the product is equal to the product of the individual limits. Next, we need the expression for the natural gradient, for which we will use Eq. 8 but approximate the expectation by using the delta approximation 𝔼q[g(𝜽)]g(𝜽)\mathbb{E}_{q_{*}}[g(\boldsymbol{\theta})]\approx g(\boldsymbol{\theta}_{*}) for any function gg, as shown below to define:

𝐠^i(𝝀)=[i(𝜽)2i(𝜽)𝜽,122i(𝜽)]\hat{\mbox{$\mbox{$\mathbf{g}$}$}}_{i}(\boldsymbol{\lambda}_{*})=\left[\nabla\ell_{i}(\boldsymbol{\theta}_{*})-\nabla^{2}\ell_{i}(\boldsymbol{\theta}_{*})\boldsymbol{\theta}_{*},\,\,\,\mbox{$\frac{1}{2}$}\nabla^{2}\ell_{i}(\boldsymbol{\theta}_{*})\right]

The claim is that if we set the perturbed 𝝀^ϵi=0=𝐠^i(𝝀)\partial\hat{\boldsymbol{\lambda}}_{*}^{\epsilon_{i}=0}=\hat{\mbox{$\mbox{$\mathbf{g}$}$}}_{i}(\boldsymbol{\lambda}_{*}) we recover Eq. 3, that is, we set

𝐇𝜽ϵi=0+𝐇ϵi=0𝜽=(𝜽)2i(𝜽)𝜽,12𝐇ϵi=0=122i(𝜽).\displaystyle\mbox{$\mbox{$\mathbf{H}$}$}_{*}\partial{\boldsymbol{\theta}_{*}^{\epsilon_{i}=0}}+\partial{\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{\epsilon_{i}=0}}\boldsymbol{\theta}_{*}=\nabla\ell(\boldsymbol{\theta}_{*})-\nabla^{2}\ell_{i}(\boldsymbol{\theta}_{*})\boldsymbol{\theta}_{*},\qquad\qquad-\mbox{$\frac{1}{2}$}\partial{\mbox{$\mbox{$\mathbf{H}$}$}_{*}^{\epsilon_{i}=0}}=\mbox{$\frac{1}{2}$}\nabla^{2}\ell_{i}(\boldsymbol{\theta}_{*}).

Plugging the second equation into the first, the second term cancels and we recover Eq. 3.

Appendix G Extension to Non-Differentiable Loss function

For non-differentiable cases, we can use Eq. 28 to rewrite the BLR of Eq. 29 as

𝐦t𝐦t1ρ𝐒t1𝐦𝔼qt1[(𝜽)],𝐒t(1ρ)𝐒t1+2ρ𝚺𝔼qt1[(𝜽)],\mbox{$\mbox{$\mathbf{m}$}$}_{t}\leftarrow\mbox{$\mbox{$\mathbf{m}$}$}_{t-1}-\rho\mbox{$\mbox{$\mathbf{S}$}$}_{t}^{-1}\nabla_{\text{\mbox{$\mbox{$\mathbf{m}$}$}}}\mathbb{E}_{q_{t-1}}[\mathcal{L}(\boldsymbol{\theta})],\qquad\mbox{$\mbox{$\mathbf{S}$}$}_{t}\leftarrow(1-\rho)\mbox{$\mbox{$\mathbf{S}$}$}_{t-1}+2\rho\nabla_{\text{\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}}}\mathbb{E}_{q_{t-1}}[\mathcal{L}(\boldsymbol{\theta})], (36)

where 𝚺=𝐒1\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}=\mbox{$\mbox{$\mathbf{S}$}$}^{-1}. Essentially, we take derivative outside the expectation instead of inside which is valid because the expectation of a non-differentiable function is still differentiable (under some regularity conditions). The same technique can be applied to Eq. 8 to get

𝐠~i(𝝀)=(𝐦𝔼q[i]2𝚺𝔼q[i(𝜽)]𝐦,𝚺𝔼q[i(𝜽)]),\begin{split}\tilde{\mathbf{g}}_{i}(\boldsymbol{\lambda})=\left(\nabla_{\text{\mbox{$\mbox{$\mathbf{m}$}$}}}\mathbb{E}_{q}[\ell_{i}]-2\nabla_{\text{\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}}}\mathbb{E}_{q}[\ell_{i}(\boldsymbol{\theta})]\mbox{$\mbox{$\mathbf{m}$}$},\,\,\,\nabla_{\text{\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}}}\mathbb{E}_{q}[\ell_{i}(\boldsymbol{\theta})]\right),\end{split} (37)

and proceeding in the same fashion we can write: 𝐦^t\i𝐦t=(𝐒^t\i)1𝐦𝔼qt[i(𝜽)]\hat{\mbox{$\mbox{$\mathbf{m}$}$}}_{t}^{\backslash i}-\mbox{$\mbox{$\mathbf{m}$}$}_{t}=(\hat{\mbox{$\mbox{$\mathbf{S}$}$}}_{t}^{\backslash i})^{-1}\nabla_{\text{\mbox{$\mbox{$\mathbf{m}$}$}}}\mathbb{E}_{q_{t}}\left[\ell_{i}(\boldsymbol{\theta})\right]. This is the extension of Eq. 10 to non-differentiable loss functions.

Appendix H Sensitivity Measures for Sparse Variational Gaussian Processes

Sparse variational GP (SVGP) methods optimize the following variational objective to find a Gaussian posterior approximation q(𝐮)q(\mathbf{u}) over function values 𝐮:=(f(𝐳1),f(𝐳2),,f(𝐳M))\mathbf{u}:=(f(\mbox{$\mbox{$\mathbf{z}$}$}_{1}),f(\mbox{$\mbox{$\mathbf{z}$}$}_{2}),\ldots,f(\mbox{$\mbox{$\mathbf{z}$}$}_{M})) where 𝒵:=(𝐳1,𝐳2,,𝐳M)\mathcal{Z}:=(\mbox{$\mbox{$\mathbf{z}$}$}_{1},\mbox{$\mbox{$\mathbf{z}$}$}_{2},\ldots,\mbox{$\mbox{$\mathbf{z}$}$}_{M}) is the set of inducing inputs with MNM\ll N:

¯(𝐦,𝚺,𝒵,ϕ):=i=1N𝔼q(fi)[logp(yi|fi)]𝔻KL(q(𝐮)p(𝐮))\underline{\mathcal{L}}(\mbox{$\mbox{$\mathbf{m}$}$},\mbox{$\mbox{$\boldsymbol{\Sigma}$}$},\mathcal{Z},\mbox{$\mbox{$\boldsymbol{\phi}$}$}):=\sum_{i=1}^{N}\mathbb{E}_{q(f_{i})}\left[\log p(y_{i}|f_{i})\right]-\mathbb{D}_{\text{KL}}(q(\mathbf{u})\,\|\,p(\mathbf{u}))

where p(𝐮):=𝒩(𝐮|𝟎,𝐊𝐮𝐮)p(\mathbf{u}):=\mbox{${\cal N}$}(\mathbf{u}|\mbox{$\mbox{$\boldsymbol{0}$}$},\mbox{$\mbox{$\mathbf{K}$}$}_{\mathbf{u}\mathbf{u}}) is the prior with 𝐊𝐮𝐮\mbox{$\mbox{$\mathbf{K}$}$}_{\mathbf{u}\mathbf{u}} as the covariance function κ(,)\kappa(\cdot,\cdot^{\prime}) evaluated at 𝒵\mathcal{Z}, q(fi)=𝒩(fi|𝐚i𝐦,𝐚i𝚺𝐚i+σi2)q(f_{i})=\mbox{${\cal N}$}(f_{i}|\mbox{$\mbox{$\mathbf{a}$}$}_{i}^{\top}\mbox{$\mbox{$\mathbf{m}$}$},\mbox{$\mbox{$\mathbf{a}$}$}_{i}^{\top}\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}\mbox{$\mbox{$\mathbf{a}$}$}_{i}+\sigma_{i}^{2}) is the posterior marginal of fi=f(𝐱i)f_{i}=f(\mbox{$\mbox{$\mathbf{x}$}$}_{i}) with 𝐚i:=𝐊𝐮𝐮1𝐤𝐮i\mbox{$\mbox{$\mathbf{a}$}$}_{i}:=\mbox{$\mbox{$\mathbf{K}$}$}_{\mathbf{u}\mathbf{u}}^{-1}\mbox{$\mbox{$\mathbf{k}$}$}_{\mathbf{u}i} and σi2:=κii𝐚i𝐊𝐮𝐮𝐚i\sigma_{i}^{2}:=\kappa_{ii}-\mbox{$\mbox{$\mathbf{a}$}$}_{i}^{\top}\mbox{$\mbox{$\mathbf{K}$}$}_{\mathbf{u}\mathbf{u}}\mbox{$\mbox{$\mathbf{a}$}$}_{i} as the noise variance of fif_{i} conditioned on 𝐮\mathbf{u}. The objective is also used to optimize hyperparameters ϕ\boldsymbol{\phi} and inducing input set 𝒵\mathcal{Z}.

We can optimize the objective using the BLR for which the resulting update is identical to the variational online-newton (VON) algorithm. We first write the natural gradients,

~𝔼qt(fi)[logp(yi|fi)]=((eitβit𝐚i𝐦)𝐚i,12βit𝐚i𝐚i).\displaystyle\widetilde{\nabla}\mathbb{E}_{q_{t}(f_{i})}[-\log p(y_{i}|f_{i})]=\left((e_{it}-\beta_{it}\mbox{$\mbox{$\mathbf{a}$}$}_{i}^{\top}\mbox{$\mbox{$\mathbf{m}$}$}_{*})\mbox{$\mbox{$\mathbf{a}$}$}_{i},\,\,\,\mbox{$\frac{1}{2}$}\beta_{it}\mbox{$\mbox{$\mathbf{a}$}$}_{i}\mbox{$\mbox{$\mathbf{a}$}$}_{i}^{\top}\right). (38)

where we define

eit=𝔼qt(fi)[filogp(yi|fi)],βit=𝔼qt(fi)[fi2logp(yi|fi)]e_{it}=\mathbb{E}_{q_{t}(f_{i})}[-\nabla_{f_{i}}\log p(y_{i}|f_{i})],\qquad\beta_{it}=\mathbb{E}_{q_{t}(f_{i})}[-\nabla_{f_{i}}^{2}\log p(y_{i}|f_{i})]

We define 𝐀\mathbf{A} to be a matrix with 𝐚i\mbox{$\mbox{$\mathbf{a}$}$}_{i}^{\top} as rows, and 𝐞t,𝜷t\mbox{$\mbox{$\mathbf{e}$}$}_{t},\mbox{$\mbox{$\boldsymbol{\beta}$}$}_{t} to be vectors of eit,βite_{it},\beta_{it}. Using these in the VON update, we simplify as follows:

𝐒t+1\displaystyle\mbox{$\mbox{$\mathbf{S}$}$}_{t+1} =(1ρ)𝐒t+ρ[𝐀diag(𝜷t)𝐀+𝐊𝐮𝐮1]\displaystyle=(1-\rho)\mbox{$\mbox{$\mathbf{S}$}$}_{t}+\rho\left[\mbox{$\mbox{$\mathbf{A}$}$}^{\top}\mbox{$\mbox{diag}$}(\mbox{$\mbox{$\boldsymbol{\beta}$}$}_{t})\mbox{$\mbox{$\mathbf{A}$}$}+\mbox{$\mbox{$\mathbf{K}$}$}_{\mathbf{u}\mathbf{u}}^{-1}\right] (39)
𝐦t+1=𝐒t+11[(1ρ)𝐒t𝐦tρ(𝐀𝐞t𝐀diag(𝜷t)𝐀𝐦t)]=𝐒t+11[((1ρ)𝐒t+ρ𝐀diag(𝜷t)𝐀)𝐦tρ𝐀𝐞t]=𝐒t+11[(𝐒t+1ρ𝐊𝐮𝐮1)𝐦tρ𝐀𝐞t]=𝐒t+11[𝐒t+1𝐦tρ(𝐀𝐞t+𝐊𝐮𝐮1𝐦t)]=𝐦tρ𝐒t+11[𝐀𝐞t+𝐊𝐮𝐮1𝐦t].\displaystyle\begin{split}\mbox{$\mbox{$\mathbf{m}$}$}_{t+1}&=\mbox{$\mbox{$\mathbf{S}$}$}_{t+1}^{-1}\left[(1-\rho)\mbox{$\mbox{$\mathbf{S}$}$}_{t}\mbox{$\mbox{$\mathbf{m}$}$}_{t}-\rho\left(\mbox{$\mbox{$\mathbf{A}$}$}^{\top}\mbox{$\mbox{$\mathbf{e}$}$}_{t}-\mbox{$\mbox{$\mathbf{A}$}$}^{\top}\mbox{$\mbox{diag}$}(\mbox{$\mbox{$\boldsymbol{\beta}$}$}_{t})\mbox{$\mbox{$\mathbf{A}$}$}\mbox{$\mbox{$\mathbf{m}$}$}_{t}\right)\right]\\ &=\mbox{$\mbox{$\mathbf{S}$}$}_{t+1}^{-1}\left[\left((1-\rho)\mbox{$\mbox{$\mathbf{S}$}$}_{t}+\rho\mbox{$\mbox{$\mathbf{A}$}$}^{\top}\mbox{$\mbox{diag}$}(\mbox{$\mbox{$\boldsymbol{\beta}$}$}_{t})\mbox{$\mbox{$\mathbf{A}$}$}\right)\mbox{$\mbox{$\mathbf{m}$}$}_{t}-\rho\mbox{$\mbox{$\mathbf{A}$}$}^{\top}\mbox{$\mbox{$\mathbf{e}$}$}_{t}\right]\\ &=\mbox{$\mbox{$\mathbf{S}$}$}_{t+1}^{-1}\left[\left(\mbox{$\mbox{$\mathbf{S}$}$}_{t+1}-\rho\mbox{$\mbox{$\mathbf{K}$}$}_{\mathbf{u}\mathbf{u}}^{-1}\right)\mbox{$\mbox{$\mathbf{m}$}$}_{t}-\rho\mbox{$\mbox{$\mathbf{A}$}$}^{\top}\mbox{$\mbox{$\mathbf{e}$}$}_{t}\right]\\ &=\mbox{$\mbox{$\mathbf{S}$}$}_{t+1}^{-1}\left[\mbox{$\mbox{$\mathbf{S}$}$}_{t+1}\mbox{$\mbox{$\mathbf{m}$}$}_{t}-\rho\left(\mbox{$\mbox{$\mathbf{A}$}$}^{\top}\mbox{$\mbox{$\mathbf{e}$}$}_{t}+\mbox{$\mbox{$\mathbf{K}$}$}_{\mathbf{u}\mathbf{u}}^{-1}\mbox{$\mbox{$\mathbf{m}$}$}_{t}\right)\right]\\ &=\mbox{$\mbox{$\mathbf{m}$}$}_{t}-\rho\mbox{$\mbox{$\mathbf{S}$}$}_{t+1}^{-1}\left[\mbox{$\mbox{$\mathbf{A}$}$}^{\top}\mbox{$\mbox{$\mathbf{e}$}$}_{t}+\mbox{$\mbox{$\mathbf{K}$}$}_{\mathbf{u}\mathbf{u}}^{-1}\mbox{$\mbox{$\mathbf{m}$}$}_{t}\right].\end{split} (40)

For Gaussian likelihood, the updates in Eqs. 39 and 40 coincide with the method of [18], and for non-Gaussian likelihood they are similar to the natural-gradient method by [45], but we use the specific parameterization of [26]. An alternate update rule in terms of site parameters is given by [1] (see Eqs. 22-24).

We are now ready to write the sensitivity measure essentially substituting the gradient in Eq. 10),

𝐒t1𝐦𝔼qt(𝐮)[logp(yi|fi)]=𝐒t1𝐚i𝔼qt(fi)[logp(yi|fi)]=𝐒t1𝐚ieit\mbox{$\mbox{$\mathbf{S}$}$}_{t}^{-1}\nabla_{\text{\mbox{$\mbox{$\mathbf{m}$}$}}}\mathbb{E}_{q_{t}(\text{\mbox{$\mbox{$\mathbf{u}$}$}})}[-\log p(y_{i}|f_{i})]=\mbox{$\mbox{$\mathbf{S}$}$}_{t}^{-1}\mbox{$\mbox{$\mathbf{a}$}$}_{i}\mathbb{E}_{q_{t}(f_{i})}[-\nabla\log p(y_{i}|f_{i})]=\mbox{$\mbox{$\mathbf{S}$}$}_{t}^{-1}\mbox{$\mbox{$\mathbf{a}$}$}_{i}e_{it} (41)

We can also see the bi-linear relationship by considering the deviation in the mean of the posterior marginal fi(𝐦):=𝐚i𝐦f_{i}(\mbox{$\mbox{$\mathbf{m}$}$}):=\mbox{$\mbox{$\mathbf{a}$}$}_{i}^{\top}\mbox{$\mbox{$\mathbf{m}$}$},

fi(𝐦t\i)fi(𝐦t)𝐚i(𝐦^t\i𝐦t)=𝐚i𝚺t𝐚ieit=viteitf_{i}(\mbox{$\mbox{$\mathbf{m}$}$}_{t}^{\backslash i})-f_{i}(\mbox{$\mbox{$\mathbf{m}$}$}_{t})\approx\mbox{$\mbox{$\mathbf{a}$}$}_{i}^{\top}(\hat{\mbox{$\mbox{$\mathbf{m}$}$}}_{t}^{\backslash i}-\mbox{$\mbox{$\mathbf{m}$}$}_{t})=\mbox{$\mbox{$\mathbf{a}$}$}_{i}^{\top}\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}_{t}\mbox{$\mbox{$\mathbf{a}$}$}_{i}e_{it}=v_{it}e_{it} (42)

where vit=𝐚i𝚺t𝐚iv_{it}=\mbox{$\mbox{$\mathbf{a}$}$}_{i}^{\top}\mbox{$\mbox{$\boldsymbol{\Sigma}$}$}_{t}\mbox{$\mbox{$\mathbf{a}$}$}_{i} is the marginal variance of fif_{i}.

Appendix I Experimental Details

I.1 Neural network architectures

Below, we describe different neural networks used in our experiments,

MLP (500, 300):

This is a multilayer perceptron (MLP) with two hidden layers of 500 and 300 neurons and a parameter count of around 546 000546\,000 (using hyperbolic-tangent activations).

MLP (32, 16):

This is also an MLP with two hidden layers of 32 and 16 neurons, which accounts for around 26 00026\,000 parameters (also using hyperbolic tangent activations).

LeNet5:

This is a standard convolutional neural network (CNN) architecture with three convolution layers followed by two fully-connected layers, corresponding to around 62 00062\,000 parameters.

CNN:

This network, taken from the DeepOBS suite [46], consists of three convolution layers followed by three fully-connected layers with a parameter count of 895 000895\,000.

ResNet–20:

This network has around 274 000274\,000 parameters. We use filter response normalization (FRN) [48] as an alternative to batch normalization.

MLP for USPS:

For the experiment on binary USPS in Fig. 6(a), we use an MLP with three hidden layers of 30 neurons each and a total of around 10 00010\,000 parameters.

I.2 Details of “Do estimated deviations correlate with the truth?”

In Fig. 2, we train neural network classifiers with a cross-entropy loss to obtain 𝜽\mbox{$\mbox{$\boldsymbol{\theta}$}$}_{*}. Due to the computational demand of per-example retraining, the removed examples are randomly subsampled from the training set. We show results over 1000 examples for MNIST and FMNIST and 100 examples for CIFAR10. In the multiclass setting, the expression yields a per-class sensitivity value. We obtain a scalar value for each example by summing over the absolute values of the per-class sensitivities. For training both the original model 𝜽\mbox{$\mbox{$\boldsymbol{\theta}$}$}_{*} and the perturbed models 𝜽\i\smash{\mbox{$\mbox{$\boldsymbol{\theta}$}$}_{*}^{\backslash i}}, we use SGD with a momentum parameter of 0.90.9 and a cosine learning-rate scheduler. To obtain 𝜽\i\smash{\mbox{$\mbox{$\boldsymbol{\theta}$}$}_{*}^{\backslash i}}, we retrain a model that is warmstarted at 𝜽\mbox{$\mbox{$\boldsymbol{\theta}$}$}_{*}. Other details regarding the training setup are given in Table 2. For all models, we do not use data augmentation during training. The resulting 𝜽\mbox{$\mbox{$\boldsymbol{\theta}$}$}_{*} for MNIST, FMNIST, and CIFAR10 have training accuracies of 99.9%99.9\%, 95.0%95.0\%, and 99.9%99.9\%, respectively. The test accuracies for these models are 98.4%98.4\%, 91.2%91.2\% and 76.7%76.7\%.

Dataset Model BB δ\delta EE^{*} LRLR^{*} LRminLR^{*}_{\text{min}} E\iE^{\backslash i} LR\iLR^{\backslash i} LRmin\iLR^{\backslash i}_{\text{min}}
MNIST MLP (500, 300) 256256 100100 500500 10210^{-2} 10310^{-3} 300300 10310^{-3} 10410^{-4}
FMNIST LeNet5 256256 100100 300300 10110^{-1} 10310^{-3} 200200 10310^{-3} 10410^{-4}
CIFAR10 CNN 512512 250250 500500 10210^{-2} 10410^{-4} 300300 10410^{-4} 10610^{-6}
Table 2: Hyperparameters for predicting true sensitivity in Fig. 2. BB, EE and LRLR denote batch size, training epochs and learning-rates, respectively. The superscripts and \i indicate hyperparameters for training on all data and warmstarted leave-one-out retraining, respectively. LRminLR_{\text{min}} is the minimum learning-rate of the cosine scheduler.

Additional group removal experiments:

We also study how the deviation for removing a group of examples in a set \mathcal{M} can be estimated using a variation of Eq. 14 for the deviation in predictions at convergence. Denoting the vector of fi(𝜽)f_{i}(\boldsymbol{\theta}) for ii\in\mathcal{M} by 𝐟(𝜽)\mbox{$\mbox{$\mathbf{f}$}$}_{\mathcal{M}}(\boldsymbol{\theta}), we get

σ(𝐟(𝜽\)σ(𝐟(𝜽))𝚲(𝜽)𝐕(𝜽)𝐞(𝜽)iσ(fi)viei.\mbox{$\sigma$}(\mbox{$\mbox{$\mathbf{f}$}$}_{\mathcal{M}}(\boldsymbol{\theta}_{*}^{\backslash\mathcal{M}})-\mbox{$\sigma$}(\mbox{$\mbox{$\mathbf{f}$}$}_{\mathcal{M}}(\boldsymbol{\theta}_{*}))\approx\mbox{$\mbox{$\boldsymbol{\Lambda}$}$}(\boldsymbol{\theta}_{*})\mbox{$\mbox{$\mathbf{V}$}$}_{\mathcal{M}}(\boldsymbol{\theta}_{*})\mbox{$\mbox{$\mathbf{e}$}$}_{\mathcal{M}}(\boldsymbol{\theta}_{*})\approx\sum_{i\in\mathcal{M}}\mbox{$\sigma$}^{\prime}(f_{i*})v_{i*}e_{i*}. (43)

where 𝚲(𝜽)\mbox{$\mbox{$\boldsymbol{\Lambda}$}$}(\boldsymbol{\theta}_{*}) is a diagonal matrix containing all σ(fi)\sigma^{\prime}(f_{i*}), 𝐕(𝜽)=𝐟(𝜽)𝐒1𝐟(𝜽)\mbox{$\mbox{$\mathbf{V}$}$}_{\mathcal{M}}(\boldsymbol{\theta}_{*})=\nabla\mbox{$\mbox{$\mathbf{f}$}$}_{\mathcal{M}}(\boldsymbol{\theta}_{*})\mbox{$\mbox{$\mathbf{S}$}$}_{*}^{-1}\nabla\mbox{$\mbox{$\mathbf{f}$}$}_{\mathcal{M}}(\boldsymbol{\theta}_{*})^{\top} is the prediction covariance of size M×MM\times M where MM is the number of examples in \mathcal{M}, and 𝐞(𝜽)\mbox{$\mbox{$\mathbf{e}$}$}_{\mathcal{M}}(\boldsymbol{\theta}_{*}) is the vector of prediction errors. The last approximation above is done to avoid building the covariance, where we ignore the off-diagonal entries of 𝐕(𝜽)\mbox{$\mbox{$\mathbf{V}$}$}_{\mathcal{M}}(\boldsymbol{\theta}_{*}).

In Fig. 6(a) we consider a binary USPS dataset consisting of the classes for the digits 3 and 5. Using |||\mathcal{M}| = 16, we show the first and second approximations in Eq. 43 both correlate well with the truth obtained by removing a group and retraining the model. In Fig. 6(b) we do the same on MNIST with |||\mathcal{M}| = 64, where we see similar trends. For the experiment on binary USPS in Fig. 6(a), we train a MLP with three hidden layers with 3030 neurons each. The original model 𝜽\mbox{$\mbox{$\boldsymbol{\theta}$}$}_{*} is trained for 500500 epochs with a learning-rate of 10310^{-3}, a batch size of 3232 and a L2L_{2}-regularization parameter δ=5\delta=5. It has 100%100\% training accuracy and 94.8%94.8\% test accuracy. For the leave-group-out retraining to obtain 𝜽\\smash{\boldsymbol{\theta}_{*}^{\backslash\mathcal{M}}}, we initialize the model at 𝜽\mbox{$\mbox{$\boldsymbol{\theta}$}$}_{*}, use a learning-rate of 10310^{-3} and train for 10001000 epochs. For the MNIST result in Fig. 6(b) we use the MLP (500, 300) model with the same hyperparameters as for 𝜽\boldsymbol{\theta}_{*} in Table 2. For 𝜽\\smash{\boldsymbol{\theta}_{*}^{\backslash\mathcal{M}}}, we initialize the model at 𝜽\mbox{$\mbox{$\boldsymbol{\theta}$}$}_{*} and use a cosine schedule of the learning-rate from 10410^{-4} to 10510^{-5} over 500500 epochs. We do not use data augmentation. Similarly to the experiments on per-example removal, we use a K-FAC approximation.

Refer to caption
(a) MLP on USPS-3vs5, |||\mathcal{M}| = 16
Refer to caption
(b) MLP on MNIST, |||\mathcal{M}| = 64
Figure 6: Panel (a) and Panel (b) show that the estimated deviation for removal of groups of examples correlates well with the true deviations obtained by retraining. Each marker corresponds to a removed group of examples. The red circles show the second approximation in Eq. 43. In Panel (a), we additionally show (with blue squares) the first approximation of  Eq. 43. We see that the second approximation is quite accurate in this case.
Dataset Model EE LRLR^{*} LRminLR^{*}_{\text{min}} LR\CLR^{\backslash C} LRmin\CLR^{\backslash C}_{\text{min}}
MNIST MLP (500, 300) 500500 10210^{-2} 10310^{-3} 10410^{-4} 10510^{-5}
MNIST LeNet5 300300 10110^{-1} 10310^{-3} 10510^{-5} 10610^{-6}
FMNIST MLP (32, 16) 300300 10210^{-2} 10310^{-3} 10510^{-5} 10610^{-6}
FMNIST LeNet5 300300 10110^{-1} 10310^{-3} 10410^{-4} 10510^{-5}
Table 3: Hyperparameters for the class removal experiments in Fig. 3(a) and Fig. 11(d). BB, EE and LRLR denote batch size, training epochs and learning-rates. The superscripts and \C indicate hyperparameters for training on all data and warmstarted leave-one-class-out retraining, respectively. LRminLR_{\text{min}} is the minimum learning-rate of the cosine scheduler.

I.3 Details of “Predicting the effect of class removal on generalization”

For the FMNIST experiment in Fig. 3(a), we use the MLP (32, 16) and LeNet5 models. For the MNIST experiment in Fig. 11(d), we use the MLP (500, 300) and LeNet5 models. The hyperparameters are given in Table 3. The MLP on MNIST has a training accuracy of 99.9%99.9\% and a test accuracy of 98.4%98.4\%. When using LeNet5, the training and test accuracies are 99.2%99.2\% and 99.1%99.1\%. On FMNIST, the LeNet5 has an accuracy of 95.0%95.0\% on the training set, and an accuracy of 91.2%91.2\% on the test set. On the same dataset, the MLP has a training accuracy of 89.9%89.9\% and a test accuracy of 86.2%86.2\%. For all models, we use a regularization parameter of 100 and a batch size of 256. The leave-one-class-out training is run for 1000 epochs and the rest of the training setup is same as the previous experiment.

I.4 Details of “Estimating the leave-one-out cross-validation curves for hyperparameter tuning”

The details of the training setup are in Table 2. Fig. 7 is the same as Fig. 4 but additionally shows the test errors. For visualization purposes, each plot uses a moving average of the plotted lines with a smoothing window. Other training details are similar to previous experiments. All models are trained from scratch where we use Adam for FMNIST, AdamW [36] for CIFAR10, and SGD with a momentum parameter of 0.9 for MNIST. We use a cosine learning-rate scheduler to anneal the learning-rate. The other hyperparameters are similar to the settings of the models trained on all data from the leave-one-out experiments in Table 2, except for the number of epochs for CIFAR10 where we train for 150 epochs. Similarly to Sec. I.2, we use a Kronecker-factored Laplace approximation for variance computation and do not employ data augmentation during training.

Dataset Model Number of δ\deltas Range Smoothing window
MNIST MLP (500, 300) 96 10010310^{0}-10^{3} 3
FMNIST LeNet5 96 10110310^{1}-10^{3} 5
CIFAR10 CNN 30 10110310^{1}-10^{3} 3
Table 4: Experimental settings for Fig. 4.
Refer to caption
(a) MNIST, MLP
Refer to caption
(b) FMNIST, LeNet5
Refer to caption
(c) CIFAR10, CNN
Figure 7: Leave-one-out estimation with sensitivities obtained from MPE (Train-LOO-MPE) can accurately estimate the LOO-CV curve for predicting generalization and tuning of the L2L_{2}-regularization parameter on MNIST, FMNIST and CIFAR-10.

I.5 Details of “Predicting generalization during the training”

Details of the training setup:

The experimental details, including test accuracies at the end of training, are listed in Table 5. We use a grid search to determine the regularization parameter δ\delta. The learning-rate is decayed according to a cosine schedule. For diagonal-GGN-LOO and K-FAC-LOO, we use the SGD optimizer with an exception on the FMNIST dataset where we use the AdamW optimizer [36]. In that experiment, we use a weight decay factor of δ/N\delta/N replacing the explicit L2L_{2}-regularization term in the loss in Eq. 1. The regularizer (𝜽)\mathcal{R}(\boldsymbol{\theta}) is set to zero. We do not use training data augmentation. For all plots, the LOO-estimate is evaluated periodically during the training, which is indicated with markers.

Additional details on hyperparameters of iBLR are as follows, where h0h_{0} is the initialization of the Hessian:

  • MNIST, MLP (32, 16): h0=0.1h_{0}=0.1

  • MNIST, LeNet5: h0=0.1h_{0}=0.1

  • FMNIST, LeNet5: h0=0.1h_{0}=0.1

  • CIFAR10, CNN: h0=0.05h_{0}=0.05

  • CIFAR10, ResNet20: h0=0.01h_{0}=0.01

We set β1=0.9\beta_{1}=0.9 and β2=0.99999\beta_{2}=0.99999 in all of those experiments. The magnitude of the prediction variance can depend on h0h_{0}, which therefore can influence the magnitude of the sensitivities that are perturbing the function outputs in the LOO estimate of Eq. 16. We choose h0h_{0} on a grid of four values [0.01, 0.05, 0.1, 0.5] to obtain sensitivities that result in a good prediction of generalization performance.

Dataset Model Method LRLR LRminLR_{\text{min}} BB δ\delta Test acc.
MNIST MLP (32, 16) iBLR 10210^{-2} 10410^{-4} 256256 8080 95.6%95.6\%
diag.-GGN-LOO 10310^{-3} 10410^{-4} 256256 8080 95.8%95.8\%
K-FAC-LOO 10310^{-3} 10410^{-4} 256256 8080 95.8%95.8\%
MNIST LeNet5 iBLR 10210^{-2} 10410^{-4} 256256 6060 97.597.5%
diag.-GGN-LOO 10310^{-3} 10410^{-4} 256256 6060 97.4%97.4\%
K-FAC-LOO 10310^{-3} 10410^{-4} 256256 6060 97.4%97.4\%
FMNIST LeNet5 iBLR 10110^{-1} 0 256256 6060 90.7%90.7\%
diag.-GGN-LOO 10210^{-2} 10410^{-4} 256256 6060 91.0%91.0\%
K-FAC-LOO 10210^{-2} 10410^{-4} 256256 6060 91.0%91.0\%
CIFAR10 CNN iBLR 10110^{-1} 10410^{-4} 512512 250250 81.0%81.0\%
diag.-GGN-LOO 10110^{-1} 0 512512 250250 75.4%75.4\%
K-FAC-LOO 10110^{-1} 0 512512 250250 73.6%73.6\%
CIFAR10 ResNet–20 iBLR 21012*10^{-1} 0 5050 1010 83.4%83.4\%
Table 5: Experimental settings for predicting generalization during the training in Fig. 1(b), Fig. 5 and Fig. 8. BB and EE denote the batch-size and training epochs, respectively. LRLR and LRminLR_{\text{min}} are the start and end learning-rates of the cosine scheduler. δ\delta is the regularization parameter. The specification in brackets in the third column indicates the method for computing sensitivities. We use either iBLR or SGD with diagonal GGN (diag.GGN) or K-FAC.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: These plots are similar to Fig. 5 but for different model-data pairs. The three rows correspond to MLP on MNIST, LeNet5 on MNIST, and CNN on CIFAR10, respectively. The trends are almost same as those discussed in the main text.

Additional Results: In Fig. 8, we show additional results for MNIST and CIFAR10 that are not included in the main text. For MNIST, we evaluate both on a the MLP (32, 16) model and a LeNet5 architecture. For the additional CIFAR10 results, we use the CNN. In Fig. 9 we include an additional experiment where the model overfits. The K-FAC-LOO estimate deteriorates in this case, but we can still use the LOO as a diagnostic for detecting overfitting and as a stopping criterion. We train a LeNet5 on FMNIST with AdamW and predict generalization. The trend of the estimated NLL matches the trend of the test NLL in the course of training.

In Fig. 10, we include further results for sensitivity estimation with the Adam optimizer. We use the following update

𝐫tβ1𝐫t1+(1β1)𝐠t,𝐬tβ2𝐬t1+(1β2)(𝐠t𝐠t),𝜽t𝜽t1ρ𝐫t/(^𝐬t+ϵ),\displaystyle\mbox{$\mbox{$\mathbf{r}$}$}_{t}\leftarrow\beta_{1}\mbox{$\mbox{$\mathbf{r}$}$}_{t-1}+(1-\beta_{1})\mbox{$\mbox{$\mathbf{g}$}$}_{t},\quad\mbox{$\mbox{$\mathbf{s}$}$}_{t}\leftarrow\beta_{2}\mbox{$\mbox{$\mathbf{s}$}$}_{t-1}+(1-\beta_{2})\,(\mbox{$\mbox{$\mathbf{g}$}$}_{t}\cdot\mbox{$\mbox{$\mathbf{g}$}$}_{t}),\quad\mbox{$\mbox{$\boldsymbol{\theta}$}$}_{t}\leftarrow\mbox{$\mbox{$\boldsymbol{\theta}$}$}_{t-1}-\rho\,\mbox{$\mbox{$\mathbf{r}$}$}_{t}/(\sqrt{\hat{}\mbox{$\mbox{$\mathbf{s}$}$}_{t}}+\epsilon),

where 𝐠t\mbox{$\mbox{$\mathbf{g}$}$}_{t} is the minibatch gradient, β1\beta_{1} and β2\beta_{2} are coefficients for the running averages, ρ\rho is a learning-rate, and ϵ\epsilon a small damping to stabilize. We construct a diagonal matrix 𝐒t=diag(N𝐬t)\mbox{$\mbox{$\mathbf{S}$}$}_{t}=\text{diag}(N\sqrt{\mbox{$\mbox{$\mathbf{s}$}$}_{t}}) to estimate sensitivity with MPE as suggested in Table 1 (NN is the number of training examples). Better results are expected by building better estimates of 𝐒t\mbox{$\mbox{$\mathbf{S}$}$}_{t} as discussed in [27]. As described in section 3.4 of [27], a smaller batch size should improve the estimate, which we also observe in the experiment.

Refer to caption
(a) Diagonal-GGN-LOO
Refer to caption
(b) K-FAC-LOO
Figure 9: Additional results for training with AdamW where we observe overfitting. We see that K-FAC-LOO deteriorates when the model start to overfit. Both the LOO measures can still be useful tools for diagnosing overfitting. Details of training setup are given in Table 6
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 10: LOO-CV estimates with Adam using the measure suggested in Table 1. The two rows correspond to a batch size of 8 and a batch size of 32, respectively. A smaller batchsize generally decreases the gap between the test NLL and the estimate. Details of the training setup are given in Table 7.
Dataset Model Method LRLR LRminLR_{\text{min}} BB δ\delta Test acc.
FMNIST LeNet5 diag., AdamW 10310^{-3} 10310^{-3} 256256 6060 88.1%88.1\%
K-FAC, AdamW 10310^{-3} 10310^{-3} 256256 6060 87.6%87.6\%
Table 6: Experimental settings for predicting generalization during the training in Fig. 9.
Dataset Model LRLR LRminLR_{\text{min}} δ\delta Test acc. (B=8B=8) Test acc. (B=32B=32)
MNIST MLP (32, 16) 10310^{-3} 0 8080 97.3%97.3\% 97.4%97.4\%
MNIST LeNet5 10310^{-3} 0 6060 99.2%99.2\% 99.2%99.2\%
FMNIST LeNet5 10310^{-3} 0 6060 91.4%91.4\% 91.2%91.2\%
CIFAR10 CNN 10310^{-3} 0 5050 75.2%75.2\% 78.4%78.4\%
Table 7: Experimental settings for Fig. 10.

I.6 Details of “evolution of sensitivities during training”

Refer to caption
(a) MNIST, Bayesian logistic regr.
Refer to caption
(b) MNIST, iBLR with MLP
Refer to caption
(c) CIFAR-10, iBLR & ResNet–20
Refer to caption
(d) Class removal result for MNIST
Figure 11: Additional experiments similar to Fig. 3(b). In Panel (a), we show the evolution of sensitivities for Bayesian logistic regression on MNIST trained with the VON algorithm. In Panel (b) we use a MLP trained with the iBLR optimizer. In Panel (c), we use a ResNet–20 trained with iBLR on CIFAR-10. Panel (d) shows the class removal result similar to Fig. 3(a), but on MNIST

We use the MPE with iBLR for neural network classification on MNIST, FMNIST and CIFAR10, as well as MPE for logistic regression on MNIST. Experiment details are in Table 8.

Dataset Model B δ\delta EE
MNIST MLP (500, 300) 256 30 100
FMNIST LeNet5 256 60 100
CIFAR10 ResNet–20 512 35 300
Table 8: Experimental settings for evolution of sensitivities during training in Fig. 3(b), and Fig. 3.

For the experiment in Fig. 11(a), we consider Bayesian logistic regression. We set δ=0.1\delta=0.1. The Hessian is always positive-definite due to the convex loss function therefore we use the VON algorithm given in Eq. 29. We use 125125 updates with batch-size 200200, reaching a test accuracy of around 91%91\% using the mean 𝐦t\mbox{$\mbox{$\mathbf{m}$}$}_{t}. We use linear learning-rate decay from 0.0050.005 to 0.0010.001 for the mean 𝐦\mathbf{m} and a learning-rate of 10510^{-5} for the precision 𝐒\mathbf{S}. The expectations are approximated using 3 samples drawn from the posterior. We plot sensitivities at iteration t=5,10,25,125t=5,10,25,125. For this example, we use samples from qtq_{t} to compute the prediction variance and error (150150 samples are used). We sort examples according to their sensitivity at iteration t=125t=125 and then plot their average sensitivities in 6060 groups with 100100 examples in each group.

For the experiments in  Fig. 3(b), Fig. 11(b) and Fig. 11(c), we consider neural network models 𝐟(𝜽t)\mbox{$\mbox{$\mathbf{f}$}$}(\boldsymbol{\theta}_{t}) on FMNIST, MNIST and CIFAR10. We do not use training data augmentation. For CIFAR10 we use a ResNet–20. The expectations in the iBLR are approximated using a single sample drawn from the posterior. For prediction, we use the mean 𝐦t\mbox{$\mbox{$\mathbf{m}$}$}_{t}. The test accuracies are 91.3%91.3\% for FMNIST, 98.5%98.5\% for MNIST and 80.9%80.9\% for CIFAR10. We use a cosine learning-rate scheduler with an initial learning-rate of 0.1 and anneal to zero over the course of training. Other experimental details are stated in Table 8. Similar to before, we use sampling to evaluate sensitivity (150150 samples are used).

Appendix J Author Contributions Statement

Authors list: Peter Nickl (PN), Lu Xu (LX), Dharmesh Tailor (DT), Thomas Moellenhoff (TM), Mohammad Emtiyaz Khan (MEK)

All co-authors contributed to developing the main idea. MEK and DT first discussed the idea deriving sensitivity measure based on the BLR. MEK derived the MPE and the results in Sec 3 and DT helped in connecting them to influence functions. PN derived the results in 3.3 and came up with the idea to predict generalization error with LOO-CV. LX adapted it to class-removal. PN wrote the code with help from LX. PN and LX did most of the experiments with some help from TM and regular feedback from everybody. TM did the experiment on the sensitivity evolution during training with some help from PN. All authors were involved in writing and proof-reading of the paper.

Appendix K Differences Between Camera-Ready Version and Submitted Version

We made several changes to take the feedback of reviewers into account and improve the paper.

  1. 1.

    The writing and organization of the paper were modified to emphasize the generalization to a wide variety of models and algorithms and the applicability of MPE during training.

  2. 2.

    The presentation was changed in Section 3 to emphasize the focus on the conjugate model. Detailed derivations were pushed to the appendices and more focus was put on big picture ideas. Arbitrary perturbations parts were made explicit. Table 1 was added and more focus was put on training algorithms.

  3. 3.

    We added experiments using leave-one-out estimation to predict generalization on unseen test data during traininig. We also added results to study the evolution of sensitivities during training using MPE with iBLR.