Scale-invariant Bayesian Neural Networks with Connectivity Tangent Kernel
Abstract
Explaining generalizations and preventing over-confident predictions are central goals of studies on the loss landscape of neural networks. Flatness, defined as loss invariability on perturbations of a pre-trained solution, is widely accepted as a predictor of generalization in this context. However, the problem that flatness and generalization bounds can be changed arbitrarily according to the scale of a parameter was pointed out, and previous studies partially solved the problem with restrictions: Counter-intuitively, their generalization bounds were still variant for the function-preserving parameter scaling transformation or limited only to an impractical network structure. As a more fundamental solution, we propose new prior and posterior distributions invariant to scaling transformations by decomposing the scale and connectivity of parameters, thereby allowing the resulting generalization bound to describe the generalizability of a broad class of networks with the more practical class of transformations such as weight decay with batch normalization. We also show that the above issue adversely affects the uncertainty calibration of Laplace approximation and propose a solution using our invariant posterior. We empirically demonstrate our posterior provides effective flatness and calibration measures with low complexity in such a practical parameter transformation case, supporting its practical effectiveness in line with our rationale.
1 Introduction
Though neural networks (NNs) have experienced extraordinary success, understanding the generalizability of NNs and successfully using them in real-world contexts still faces a number of obstacles [1, 2]. It is a well-known enigma, for instance, why such NNs generalize well and do not suffer from overfitting [3, 4, 5]. Recent research on the loss landscape of NNs seeks to reduce these obstacles. Hochreiter and Schmidhuber [6] proposed a theory known as flat minima (FM): the flatness of local minima (i.e., loss invariability w.r.t. parameter perturbations) is positively correlated with network generalizability, as empirically demonstrated by Jiang et al. [7]. Concerning overconfidence, MacKay [8] suggested an approximated Bayesian posterior using the curvature information of local minima, and Daxberger et al. [9] underlined its practical utility.
Nonetheless, the limitations of the FM hypothesis were pointed out by Dinh et al. [10], Li et al. [11]. By rescaling two successive layers, Dinh et al. [10] demonstrated it was possible to modify a flatness measure without modifying the functions, hence allowing extraneous variability to be captured in the computation of generalizability. Meanwhile, Li et al. [11] argued that weight decayregularization [12] is an important limitation of the FM hypothesis as it leads to a contradiction of the FM hypothesis in practice; the weight decay sharpens the pre-trained solutions of NNs by downscaling the parameters, whereas the weight decay actually improves the generalization performance of NNs in general cases [13]. In short, they suggest that scaling transformation on network parameters (e.g., re-scaling layers and weight decay) may lead to a contradiction of the FM hypothesis.
To resolve this contradiction, we investigate PAC-Bayesian prior and posterior distributions to derive a new scale-invariant generalization bound. Unlike related works [14, 15], our bound guarantees the invariance for a general class of function-preserving parameter scaling transformation with a broad class of networks [16] (Secs. 2.2 and 2.3).
This bound is derived from the scale invariance of the prior and poster distributions, which guarantees not only the scale invariance of the bound but also its substance the Kullback-Leibler (KL) divergence-based kernel; we named this new term with scale-invariance property as empirical Connectivity Tangent Kernel (CTK) as it can be considered as a modification of empirical Neural Tangent Kernel [17]. Consequently, we define a novel sharpness metric named Connectivity Sharpness (CS) as a trace of CTK. Empirically, we verify our CS has a better prediction for generalization performance of NNs than existing sharpness measures [18, 19, 20] with a low-complexity (Sec. 2.5), with confirming its stronger correlation to generalization error (Sec. 4.1).
We also found the contradiction of the FM hypothesis turns into meaningless predictive uncertainty amplifying issues in the Bayesian NN regime (Sec. 3.1), and can alleviate this issue by using Bayesian NN based on the posterior distribution of our PAC-Bayesian analysis. We call the resulting Bayesian NN as Connectivity Laplace (CL), as it can be seen as a variation of Laplace Approximation (LA; MacKay [8]) using different Jacobian. In particular, we provide pitfalls of weight decay regularization with BN in LA and its remedy using our posterior (Sec. 3.1) to suggest practical utilities of our Bayesian NNs (Sec. 4.2). We summarize our contributions as follows:
-
•
Unlike related studies, our resulting (PAC-Bayes) generalization bound guarantees the invariance for a general class of function-preserving parameter scaling transformation with a broad class of networks (Sec. 2.2 and 2.3). Based on this novel PAC-Bayes bound, we propose a low-complexity sharpness measure (Sec. 2.5).
-
•
We provide pitfalls of weight decay regularization with BN in LA and its remedy using our posterior (Sec. 3.1).
- •
2 PAC-Bayes bound with scale-invariance
2.1 Background
Setup and Definitions We consider a Neural Network (NN), , given input and network parameter . Hereafter, we consider one dimensional vector a one-dimensional vector as a single column matrix unless otherwise stated. We use the output of NN as a prediction for input . Let denote the independently and identically distributed (i.i.d.) training data drawn from true data distribution , where and are input and output representation of -th training instance, respectively. For simplicity, we denote concatenated input and output of all instances as and , respectively and as a concatenation of . Given a prior distribution of network parameter and a likelihood function , Bayesian inference defines posterior distribution of network parameter as where is training loss and is normalizing factor. For example, the likelihood function for regression task will be Gaussian: where is (homoscedastic) observation noise scale. For classification task, we treat it as a one-hot regression task following Lee et al. [21] and He et al. [22]. While we applied this modification for theoretical tractability, Lee et al. [23], Hui and Belkin [24] showed this modification offers reasonable performance competitive to the cross-entropy loss. Details on this treatment is given in Appendix B.
Laplace Approximation In general, the exact computation for the Bayesian posterior of a network parameter is intractable. The Laplace Approximation (LA; [8]) proposes to approximate the posterior distribution with a Gaussian distribution defined as where is a pre-trained parameter with training loss and is Hessian of loss function w.r.t. parameter at .
Recent works on LA replace the Hessian matrix with (Generalized) Gauss-Newton matrix to make computation tractable [25, 26]. With this approximation, the LA posterior of regression problem can be represented as:
(1) |
where and is a identity matrix and is a concatenation of (Jacobian of w.r.t. at input and parameter ) along training input . Since covariance of equation 1 is inverse of matrix, further sub-curvature approximation was considered including diagonal, Kronecker-factored approximate curvature (KFAC), last-layer, and sub-network [27, 28, 29]. Furthermore, they found that proper selection of prior scale is needed to balance the dilemma between overconfidence and underfitting in LA.
PAC-Bayes bound with data-dependent prior We consider a PAC-Bayes generalization error bound of classification task used in McAllester [30], Perez-Ortiz et al. [31] (especially, equation (7) of Perez-Ortiz et al. [31]). Let be any PAC-Bayes prior distribution over independent of training dataset and be a error function which is defined separately from the loss function. For any constant and , and any PAC-Bayes posterior distribution over , the following holds with probability at least : where , , and denotes the cardinality of . That is, and are generalization error and empirical error, respectively. The only restriction on here is that it cannot depend on the dataset .
Following the recent discussion in Perez-Ortiz et al. [31], one can construct data-dependent PAC-Bayes bounds by (i) randomly partitioning dataset into and so that they are independent, (ii) using a PAC-Bayes prior distribution only dependent of (i.e., independent of so belongs to ), (iii) using a PAC-Bayes posterior distribution dependent of entire dataset , and (iv) computing empirical error with target subset (not entire dataset ). In summary, one can modify the aforementioned original PAC-Bayes bound through our data-dependent prior as
(2) |
where is the cardinality of . We denote sets of input and output of splitted datasets () as for simplicity.
2.2 Design of PAC-Bayes prior and posterior
Our goal is to construct scale-invariant and . To this end, we first assume a pre-trained parameter of the negative log-likelihood function with . This parameter can be attained with standard NN optimization procedures (e.g., stochastic gradient descent (SGD) with momentum). Then, we consider linearized NN at the pre-trained parameter with the auxiliary variable as
(3) |
where is a vector-to-matrix diagonal operator. Note that equation 3 is the first-order Taylor approximation (i.e., linearization) of NN with perturbation given input and parameter : , where denotes element-wise multiplication (Hadamard product) of two vectors. Here we write the perturbation in parameter space as instead of single variable such as . This design of linearization matches the scale of perturbation (i.e., ) to the scale of in a component-wise manner. Similar idea was proposed in the context of pruning at initialization [32, 33] to measure the importance of each connection independently of its weight. In this context, our perturbation can be viewed as perturbation in connectivity space by decomposing the scale and connectivity of parameter.
Based on this, we define a data-dependent prior () over connectivity as
(4) |
This distribution can be translated to a distribution over parameter by considering the distribution of perturbed parameter (): . We now define the PAC-Bayes posterior over connectivity as follows:
(5) | ||||
(6) | ||||
(7) |
where is a concatenation of (i.e., Jacobian of perturbed NN w.r.t. at input and connectivity ) along training input . Our PAC-Bayes posterior is the posterior of Bayesian linear regression problem w.r.t. connectivity : where and are i.i.d. samples of . Again, it is equivalent to the posterior distribution over parameter where by assuming that all components of are non-zero. Note that this assumption can be easily satisfied by considering the prior and posterior distribution of non-zero components of NNs only. Although we choose this restriction for theoretical tractability, future work can modify this choice to achieve diverse predictions by considering the distribution of zero coordinates. We refer to Appendix C for detailed derivations of and .
Remark 2.1 (Two-phase training).
The prior distribution in equation 4 is data-dependent priors since they depend on the pre-trained parameter optimized on training dataset . On the other hand, posterior distribution in equation 5 depend on both (through ) and (through Bayesian linear regression). Intuitively, one attain the PAC-Bayes posterior with two-phase training: pre-train with and Bayesian fine-tuning with . A similar idea of linearized fine-tuning was proposed in the context of transfer learning in Achille et al. [34], Maddox et al. [35].
Now we provide an invariance property of prior and posterior distributions w.r.t. function-preserving scale transformations as follows: The main intuition behind this proposition is that Jacobian w.r.t. connectivity is invariant to the function-preserving scaling transformation, i.e.,
.
Proposition 2.2 (Scale-invariance of PAC-Bayes prior and posterior).
Let is a invertible diagonal linear transformation such that . Then, both PAC-Bayes prior and posterior are invariant under :
Furthermore, generalization and empirical errors are also invariant to .
2.3 Resulting PAC-Bayes bound
Now we plug in prior and posterior into the modified PAC-Bayes generalization error bound in equation 2. Accordingly, we obtain a novel generalization error bound, named PAC-Bayes-CTK, which is guaranteed to be invariant to scale transformations (hence without the contradiction of FM hypothesis mentioned in Sec. 1).
Theorem 2.3 (PAC-Bayes-CTK and its invariance).
Note that recent works on solving FM contradiction only focused on the scale-invariance of sharpness metric: Indeed, their generalization bounds are not invariant to scale transformations due to the scale-dependent terms (equation (34) in Tsuzuku et al. [14] and equation (6) in Kwon et al. [15]). On the other hand, generalization bound in Petzka et al. [16] (Theorem 11 in their paper) only holds for single-layer NNs, whereas ours has no restrictions for network structure. Therefore, to the best of our knowledge, our PAC-Bayes bound is the first scale-invariant PAC-Bayes bound. To highlight our theoretical implications, we show the representative cases of in Proposition 2.2 in Appendix D (e.g., weight decay for network with BN), where the generalization bounds of the other studies are variant, but ours is invariant, resolving the FM contradiction on bound level.
The following corollary explains why we name PAC-Bayes bound in Theorem 2.3 PAC-Bayes-CTK.
Corollary 2.4 (Relation between CTK and PAC-Bayes-CTK).
Let us define empirical Connectivity Tangent Kernel (CTK) of as by removing below term? Note that empirical CTK has non-zero eigenvalues of , then following holds for in Theorem 2.3: (i) for and (ii) for . Since , this means terms of summation in sharpness part of PAC-Bayes-CTK vanishes to 0. Furthermore, this sharpness term of PAC-Bayes-CTK is a monotonically increasing function for each eigenvalue of empirical CTK.
Note that Corollary 2.4 clarifies why in Theorem 2.3 is called the sharpness term of PAC-Bayes-CTK. As eigenvalues of CTK measures the sensitivity of output w.r.t. perturbation on connectivity, a sharp pre-trained parameter would have large CTK eigenvalues, so increasing the sharpness term and the generalization gap by according to Corollary 2.4.
Finally, Proposition 2.5 shows that empirical CTK is also scale-invariant.
Proposition 2.5 (Scale-invariance of empirical CTK).
Let be an function-preserving scale transformation in Proposition 2.2. Then empirical CTK at parameter is invariant under :
(9) |
Remark 2.6 (Connections to empirical NTK).
The empirical CTK resembles the existing empirical Neural Tangent Kernel (NTK) at parameter [17]: . Note that the deterministic NTK in Jacot et al. [17] is the infinite-width limiting kernel at initialized parameters, while empirical NTK can be defined on any parameter of a finite-width NN. The main difference between empirical CTK and the existing empirical NTK is in the definition of Jacobian. In CTK, Jacobian is computed w.r.t. connectivity while the empirical NTK uses Jacobian w.r.t. parameters . Therefore, another PAC-Bayes bound can be derived from the linearization of where . As this bound is related to the eigenvalues of , we call this bound PAC-Bayes-NTK and provide derivations in Appendix A. Note PAC-Bayes-NTK is not scale-invariant as in general.
2.4 Computing approximate bound in real world problems
To verify that PAC-Bayes bound in Theorem 2.3 is non-vacuous, we compute it for real-world problems. We use CIFAR-10 and 100 datasets [36], where the 50K training instances are randomly partitioned into of cardinality 45K and of cardinality 5K. We pre-train ResNet-18 [37] with a mini-batch size of 1K on with SGD of initial learning rate 0.4 and momentum 0.9. We use cosine annealing for learning rate scheduling [38] with a warmup for the initial 10% training step. We fix and select based on the negative log-likelihood of .
To compute the equation 8, one need (i) -based perturbation term, (ii) -based sharpness term, and (iii) samples from PAC-Bayes posterior . in equation 6 can be obtained by minimizing by first-order optimality condition. Note that this problem is a convex optimization problem w.r.t. , since is the parameter of the linear regression problem. We use Adam optimizer [39] with fixed learning rate 1e-4 to solve this. For sharpness term, we apply the Lanczos algorithm to approximate the eigenspectrum of following [40]. We use 100 Lanczos iterations based on the their setting. Lastly, we estimate empirical error and test error with 8 samples of CL/LL implementation of Randomize-Then-Optimize (RTO) framework [41, 42]. We refer to Appendix E for the RTO implementation of CL/LL.
CIFAR-10 | CIFAR-100 | |||||||
---|---|---|---|---|---|---|---|---|
Parameter scale | 0.5 | 1.0 | 2.0 | 4.0 | 0.5 | 1.0 | 2.0 | 4.0 |
14515.0039 | 14517.7793 | 14517.3506 | 14518.4746 | 12872.6895 | 12874.4395 | 12873.9512 | 12875.541 | |
Bias | 13.9791 | 13.4685 | 13.3559 | 13.3122 | 25.3686 | 24.8064 | 24.9102 | 24.7557 |
Sharpness | 24.6874 | 24.6938 | 24.6926 | 24.6941 | 26.0857 | 26.0894 | 26.0874 | 26.0916 |
KL | 19.3332 | 19.0812 | 19.0243 | 19.0032 | 25.7271 | 25.4479 | 25.4988 | 25.4236 |
Test err. | 0.0468 ± 0.0014 | 0.0463 ± 0.0013 | 0.0462 ± 0.0012 | 0.0460 ± 0.0013 | 0.2257 ± 0.0020 | 0.2252 ± 0.0017 | 0.2256 ± 0.0015 | 0.2253 ± 0.0017 |
PAC-Bayes-CTK | 0.0918 ± 0.0013 | 0.0911 ± 0.0011 | 0.0909 ± 0.0011 | 0.0907 ± 0.0009 | 0.2874 ± 0.0034 | 0.2862 ± 0.0031 | 0.2860 ± 0.0032 | 0.2862 ± 0.0032 |
Table 1 provides the bound and related term of the resulting model. First of all, we found that our estimated PAC-Bayes-CTK is non-vacuous [43]: estimated bound is better than guessing at random. Note that the non-vacuous bound is not trivial in PAC-Bayes analysis: only a few PAC-Bayes literatures [44, 43, 31] verified the non-vacuous property of their bound, and other PAC-Bayes literatures [45, 14] do not. To check the invariance property of our bound, we scale the scale-invariant parameters in ResNet-18 (i.e., parameters preceding BN layers) for fixed constants . Note that this scaling does not change the function represented by NN due to BN layers, and the error bound should be preserved. Table 1 shows that our bound and related terms are invariant to these transformations. On the other hand, PAC-Bayes-NTK is very sensitive to parameter scale, as shown in Table 7 in Appendix J.
2.5 Connectivity Sharpness and its efficient computation
Now, we focus on the fact that the trace of CTK is also invariant to the parameter scale by Proposition 2.5. Unlike PAC-Bayes-CTK/NTK, a trace of CTK/NTK does not require onerous hyper-parameter selection of . Therefore, we simply define as a practical sharpness measure at , named Connectivity Sharpness (CS) to detour the complex computation of PAC-Bayes-CTK. This metric can be easily applied to find NNs with better generalization, similar to other sharpness metrics (e.g., trace of Hessian), as shown in [7]. We evaluate the detecting performance of CS in Sec. 4.1. The following corollary shows how CS can explain the generalization performance of NNs, conceptually.
Corollary 2.7 (Connectivity sharpness, Informal).
Let us assume CTK and KL divergence term of PAC-Bayes-CTK as defined in Theorem 2.3. Then, if CS vanishes to zero or infinity, the KL divergence term of PAC-Bayes-CTK also does so.
As the trace of a matrix can be efficiently estimated by Hutchinson’s method [46], one can compute the CS without explicitly computing the entire CTK. We refer to Appendix F for detailed procedures of computing CS. As CS is invariant to function-preserving scale transformations by Theorem 2.5, it also does not contradict the FM hypothesis.
3 Bayesian NNs with scale-invariance
This section provides the practical implications of the posterior distribution used in PAC-Bayes analysis. We interpret the PAC-Bayes posterior in equation 5 as a modified result of LA [8]. Then, we show this modification improves existing LA in the presence of weight decay regularization. Finally, we explain how one can efficiently construct a Bayesian NN from equation 5.
3.1 Pitfalls of weight decay with BN in Laplace Approximation
One can view parameter space version of as a modified version of in equation 1 by (i) replacing isotropic damping term to the parameter scale-dependent damping term () and (ii) adding perturbation to the mean of Gaussian distribution. In this section, we focus on the effect of replacing the damping term of LA in the presence of weight decay of batch normalized NNs. We refer to [47, 48] for the discussion on the effect of adding perturbation to the LA with linearized NNs.
The main difference between covariance term of LA equation 1 and equation 7 is the definition of Jacobian (i.e. parameter or connectivity) similar to the difference between empirical CTK and NTK in remark 2.6. Therefore, we name as Connectivity Laplace (CL) approximated posterior.
To compare CL posterior and existing LA, we explain how weight decay regularization with BN produces unexpected side effects in existing LA. This side effect can be quantified if we consider linearized NN for LA, called Linearized Laplace (LL; Foong et al. [49]). Note that LL is well known to be better calibrated than non-linearized LA for estimating ’in-between’ uncertainty. By assuming , the predictive distribution of linearized NN for equation 1 and CL are
(10) | ||||
(11) |
for any input where in subscript of CTK/NTK means concatenation. We refer to Appendix G for the detailed derivations. The following proposition shows how weight decay regularization on scale-invariant parameters introduced by BN can amplify the predictive uncertainty of equation 10. Note that the primal regularizing effect of weight decay originates through regularization on scale-invariant parameters [50, 13].
Proposition 3.1 (Uncertainty amplifying effect for LL).
Let us assume that is a weight decay regularization on scale-invariant parameters (e.g., parameters preceding BN layers) by multiplying and all the non-scale-invariant parameters are fixed. Then, predictive uncertainty of LL is amplified by while predictive uncertainty of CTK is preserved:
where is variance of random variable.
Recently, [47, 48] observed similar pitfalls in Proposition 3.1. However, their solution requires a more complicated hyper-parameter search: independent prior selection for each normalized parameter group. On the other hand, CL does not increase the hyper-parameter to be optimized compared to LL. We believe this difference will make CL more attractive to practitioners.
4 Experiments
Here we describe experiments that demonstrate (i) the effectiveness of Connectivity Sharpness (CS) as a generalization measurement metric and (ii) the usefulness of Connectivity Laplace (CL) as an efficient general-purpose Bayesian NN: CL resolves the contradiction of FM hypothesis and shows stable calibration performance to the selection of prior scale.
4.1 Connectivity Sharpness as a generalization measurement metric
To verify that the CS actually has a better correlation with generalization performance compared to existing sharpness measures, we evaluate the three correlation metrics on the CIFAR-10 dataset: (a) Kendall’s rank-correlation coefficient () [51] (b) granulated Kendall’s coefficient and their average () [7] (c) conditional independence test () [7]. For all correlation metrics, a larger value means a stronger correlation between sharpness and generalization.
We compare CS to following baseline sharpness measures: trace of Hessian (; Keskar et al. [19]), trace of empirical Fisher (; Jastrzebski et al. [52]), trace of empirical NTK at (), Fisher-Rao (FR ;Liang et al. [18]) metric, Adaptive Sharpness (AS; Kwon et al. [15]), and four PAC-Bayes bound based measures: Sharpness-Orig. (SO), Pacbayes-Orig. (PO), Sharpness-Mag. (SM), and Pacbayes-Mag. (PM), which are eq. (52), (49), (62), (61) in Jiang et al. [7]. For computing granulated Kendall’s correlation, we use 5 hyper-parameters (network depth, network width, learning rate, weight decay, and mini-batch size) and 3 options for each (thus we train models with different training configurations). We vary the depth and width of NN based on VGG-13 [53]. We refer to Appendix H for experimental details.
tr() | tr() | tr() | SO | PO | SM | PM | AS | FR | CS | |
---|---|---|---|---|---|---|---|---|---|---|
(rank corr.) | 0.706 | 0.679 | 0.703 | 0.490 | 0.436 | 0.473 | 0.636 | 0.755 | 0.649 | 0.837 |
network depth | 0.764 | 0.652 | 0.978 | -0.358 | -0.719 | 0.774 | 0.545 | 0.756 | 0.771 | 0.978 |
network width | 0.687 | 0.922 | 0.330 | -0.533 | -0.575 | 0.495 | 0.564 | 0.827 | 0.921 | 0.978 |
mini-batch size | 0.976 | 0.810 | 0.988 | 0.859 | 0.893 | 0.909 | 0.750 | 0.829 | 0.685 | 0.905 |
learning rate | 0.966 | 0.713 | 1.000 | 0.829 | 0.874 | 0.057 | 0.621 | 0.794 | 0.565 | 0.897 |
weight decay | -0.031 | -0.103 | 0.402 | 0.647 | 0.711 | 0.168 | 0.211 | 0.710 | 0.373 | 0.742 |
(avg.) | 0.672 | 0.599 | 0.739 | 0.289 | 0.237 | 0.481 | 0.538 | 0.783 | 0.663 | 0.900 |
(cond. MI) | 0.320 | 0.243 | 0.352 | 0.039 | 0.041 | 0.049 | 0.376 | 0.483 | 0.288 | 0.539 |
In Table 2, CS shows the best results for , , and compared to all other sharpness measures. Also, granulated Kendall of CS is higher than other sharpness measures for 3 out of 5 hyper-parameters and competitive to other sharpness measures with the leftover hyper-parameters. The main difference of CS with other sharpness measures is in the correlation with network width and weight decay: For network width, we found that sharpness measures except CS, , AS and FR fail to capture strong correlation. While SO/PO can capture the correlation with weight decay, we believe it is due to the weight norm term of SO/PO. However, this term would interfere in capturing the sharpness-generalization correlation related to the number of parameters (i.e., width/depth), while CS/AS does not suffer from such a problem. Also, it is notable that FR fails to capture this correlation despite its invariant property.
4.2 Connectivity Laplace as an efficient general-purpose Bayesian NN
To assess the effectiveness of CL as a general-purpose Bayesian NN, we consider uncertainty calibration on UCI dataset and CIFAR-10/100.
Original [54] | GAP variants [49] | |||||||
---|---|---|---|---|---|---|---|---|
Deep Ensemble | MCDO | LL | CL | Deep Ensemble | MCDO | LL | CL | |
boston_housing | 2.90 ± 0.03 | 2.63 ± 0.01 | 2.85 ± 0.01 | 2.88 ± 0.02 | 2.71 ± 0.01 | 2.68 ± 0.01 | 2.74 ± 0.01 | 2.75 ± 0.01 |
concrete_strength | 3.06 ± 0.01 | 3.20 ± 0.00 | 3.22 ± 0.01 | 3.11 ± 0.02 | 4.03 ± 0.07 | 3.42 ± 0.00 | 3.47 ± 0.01 | 4.03 ± 0.02 |
energy_efficiency | 0.74 ± 0.01 | 1.92 ± 0.01 | 2.12 ± 0.01 | 0.83 ± 0.01 | 0.77 ± 0.01 | 1.78 ± 0.01 | 2.02 ± 0.01 | 0.90 ± 0.02 |
kin8nm | -1.07 ± 0.00 | -0.80 ± 0.01 | -0.90 ± 0.00 | -1.07 ± 0.00 | -0.94 ± 0.00 | -0.71 ± 0.00 | -0.87 ± 0.00 | -0.93 ± 0.00 |
naval_propulsion | -4.83 ± 0.00 | -3.85 ± 0.00 | -4.57 ± 0.00 | -4.76 ± 0.00 | -2.22 ± 0.33 | -3.36 ± 0.01 | -3.66 ± 0.11 | -3.80 ± 0.07 |
power_plant | 2.81 ± 0.00 | 2.91 ± 0.00 | 2.91 ± 0.00 | 2.81 ± 0.00 | 2.91 ± 0.00 | 2.97 ± 0.00 | 2.98 ± 0.00 | 2.91 ± 0.00 |
protein_structure | 2.89 ± 0.00 | 2.96 ± 0.00 | 2.91 ± 0.00 | 2.89 ± 0.00 | 3.11 ± 0.00 | 3.07 ± 0.00 | 3.07 ± 0.00 | 3.13 ± 0.00 |
wine | 1.21 ± 0.00 | 0.96 ± 0.01 | 1.24 ± 0.01 | 1.27 ± 0.01 | 1.48 ± 0.01 | 1.03 ± 0.00 | 1.45 ± 0.01 | 1.43 ± 0.00 |
yacht_hydrodynamics | 1.26 ± 0.04 | 2.17 ± 0.06 | 1.20 ± 0.04 | 1.25 ± 0.04 | 1.71 ± 0.03 | 3.06 ± 0.02 | 1.78 ± 0.02 | 1.74 ± 0.01 |
UCI regression datasets We implement full-curvature versions of LL and CL and evaluate these to the 9 UCI regression datasets [54] and its GAP-variants [49] to compare calibration performance on in-between uncertainty. We train MLP with a single hidden layer. We fix and choose from {0.01, 0.1, 1, 10, 100} using log-likelihood of validation dataset. We use 8 random seeds to compute the average and standard error of the test negative log-likelihoods. The following two tables show test NLL for LL/CL and 2 baselines (Deep Ensemble [55] and Monte-Carlo DropOut (MCDO; Gal and Ghahramani [56])). Eight ensemble members are used in Deep Ensemble, and 32 MC samples are used in LL, CL, and MCDO. Table 3 show that CL performs better than LL for 6 out of 9 datasets. Although LL shows better calibration results for 3 datasets in both settings, we would like to point out that the performance gaps between LL and CL were not severe as in the other 6 datasets, where CL performs better.
Image Classification We evaluate the uncertainty calibration performance of CL on CIFAR-10/100. As baseline methods, we consider Deterministic network, Monte-Carlo Dropout (MCDO; [56]), Monte-Carlo Batch Normalization (MCBN; [57]), and Deep Ensemble [55], Batch Ensemble [58], LL [25, 9]. We use Randomize-Then-Optimize (RTO) implementation of LL/CL in Appendix E. We measure Expected Calibration Error (ECE; Guo et al. [59]), negative log-likelihood (NLL), and Brier score (Brier.) for ensemble predictions. We also measure the area under receiver operating curve (AUC) for OOD detection, where we set the SVHN [60] dataset as an OOD dataset. For more details on the experimental setting, please refer to Appendix I.
Table 4 shows uncertainty calibration results on CIFAR-100. We refer to Appendix I for results on other settings, including CIFAR-10 and VGGNet [53]. Our CL shows better results than baselines for all uncertainty calibration metrics (NLL, ECE, Brier., and AUC) except Deep Ensemble. This means scale-invariance of CTK improves Bayesian inference, which is consistent with the results in toy examples. Although the Deep Ensemble presents the best results in 3 out of 4 metrics, it requires full training from initialization for each ensemble member, while LL/CL requires only a post-hoc training upon the pre-trained NN for each member. Particularly noteworthy is that CL presents competitive results with Deep Ensemble, even with much smaller computations.
CIFAR-100 | ||||
NLL () | ECE () | Brier. () | AUC () | |
Deterministic | 1.5370 ± 0.0117 | 0.1115 ± 0.0017 | 0.3889 ± 0.0031 | - |
MCDO | 1.4264 ± 0.0110 | 0.0651 ± 0.0008 | 0.3925 ± 0.0020 | 0.6907 ± 0.0121 |
MCBN | 1.4689 ± 0.0106 | 0.0998 ± 0.0016 | 0.3750 ± 0.0028 | 0.7982 ± 0.0210 |
Batch Ensemble | 1.4029 ± 0.0031 | 0.0842 ± 0.0005 | 0.3582 ± 0.0010 | 0.7887 ± 0.0115 |
Deep Ensemble | 1.0110 | 0.0507 | 0.2740 | 0.7802 |
Linearized Laplace | 1.1673 ± 0.0093 | 0.0632 ± 0.0010 | 0.3597 ± 0.0020 | 0.8066 ± 0.0120 |
Connectivity Laplace (Ours) | 1.1307 ± 0.0042 | 0.0524 ± 0.0009 | 0.3319 ± 0.0005 | 0.8423 ± 0.0204 |



Robustness to the selection of prior scale Figure 1 shows the uncertainty calibration (i.e. NLL, ECE, Brier) results over various values for LL, CL, and Deterministic (Det.) baseline. As mentioned in previous works [27, 28], the uncertainty calibration results of LL is extremely sensitive to the selection of . Especially, LL shows severe under-fitting for large (i.e. small damping) regime. On the other hand, CL shows stable performance in the various ranges of .
5 Conclusion
This study introduced novel PAC-Bayes prior and posterior distributions to extend the robustness of generalization bound w.r.t. parameter transformation by decomposing the scale and connectivity of parameters. The resulting generalization bound is guaranteed to be invariant of any function-preserving scale transformations. This result successfully solved the problem that the contradiction of the FM hypothesis caused by the general scale transformation could not be solved in the existing generalization error bound, thereby allowing the theory to be much closer to reality. As a result of the theoretical enhancement, our posterior distribution for PAC-analysis can also be interpreted as an improved Laplace Approximation without pathological failures in weight decay regularization. Therefore, we expect this fact contributes to reducing the theory-practice gap in understanding the generalization effect of NN, leading to follow-up studies that interpret this effect more clearly.
References
- Kendall and Gal [2017] Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- Ovadia et al. [2019] Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model's uncertainty? evaluating predictive uncertainty under dataset shift. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Neyshabur et al. [2015a] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. In ICLR (Workshop), 2015a.
- Zhang et al. [2017] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization, 2017.
- Arora et al. [2018] Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. In Jennifer Dy and Andreas Krause, editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 254–263. PMLR, 10–15 Jul 2018.
- Hochreiter and Schmidhuber [1995] Sepp Hochreiter and Jürgen Schmidhuber. Simplifying neural nets by discovering flat minima. In G. Tesauro, D. Touretzky, and T. Leen, editors, Advances in Neural Information Processing Systems, volume 7. MIT Press, 1995.
- Jiang et al. [2020] Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them. In International Conference on Learning Representations, 2020.
- MacKay [1992] David J. C. MacKay. A practical bayesian framework for backpropagation networks. Neural Comput., 4(3):448–472, may 1992. ISSN 0899-7667. doi: 10.1162/neco.1992.4.3.448.
- Daxberger et al. [2021a] Erik Daxberger, Agustinus Kristiadi, Alexander Immer, Runa Eschenhagen, Matthias Bauer, and Philipp Hennig. Laplace redux - effortless bayesian deep learning. In A. Beygelzimer, Y. Dauphin, P. Liang, and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems, 2021a.
- Dinh et al. [2017] Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1019–1028. PMLR, 06–11 Aug 2017.
- Li et al. [2018] Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
- Loshchilov and Hutter [2019] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
- Zhang et al. [2019] Guodong Zhang, Chaoqi Wang, Bowen Xu, and Roger Grosse. Three mechanisms of weight decay regularization. In International Conference on Learning Representations, 2019.
- Tsuzuku et al. [2020] Yusuke Tsuzuku, Issei Sato, and Masashi Sugiyama. Normalized flat minima: Exploring scale invariant definition of flat minima for neural networks using PAC-Bayesian analysis. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 9636–9647. PMLR, 13–18 Jul 2020.
- Kwon et al. [2021] Jungmin Kwon, Jeongseop Kim, Hyunseo Park, and In Kwon Choi. Asam: Adaptive sharpness-aware minimization for scale-invariant learning of deep neural networks. arXiv preprint arXiv:2102.11600, 2021.
- Petzka et al. [2021] Henning Petzka, Michael Kamp, Linara Adilova, Cristian Sminchisescu, and Mario Boley. Relative flatness and generalization. Advances in Neural Information Processing Systems, 34, 2021.
- Jacot et al. [2018] Arthur Jacot, Franck Gabriel, and Clement Hongler. Neural tangent kernel: Convergence and generalization in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
- Liang et al. [2019] Tengyuan Liang, Tomaso Poggio, Alexander Rakhlin, and James Stokes. Fisher-rao metric, geometry, and complexity of neural networks. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 888–896. PMLR, 2019.
- Keskar et al. [2017] Nitish Shirish Keskar, Jorge Nocedal, Ping Tak Peter Tang, Dheevatsa Mudigere, and Mikhail Smelyanskiy. On large-batch training for deep learning: Generalization gap and sharp minima. In 5th International Conference on Learning Representations, ICLR 2017, 2017.
- Neyshabur et al. [2017] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. Exploring generalization in deep learning. arXiv preprint arXiv:1706.08947, 2017.
- Lee et al. [2019a] Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019a.
- He et al. [2020] Bobby He, Balaji Lakshminarayanan, and Yee Whye Teh. Bayesian deep ensembles via the neural tangent kernel. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1010–1022. Curran Associates, Inc., 2020.
- Lee et al. [2020] Jaehoon Lee, Samuel Schoenholz, Jeffrey Pennington, Ben Adlam, Lechao Xiao, Roman Novak, and Jascha Sohl-Dickstein. Finite versus infinite neural networks: an empirical study. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 15156–15172. Curran Associates, Inc., 2020.
- Hui and Belkin [2021] Like Hui and Mikhail Belkin. Evaluation of neural architectures trained with square loss vs cross-entropy in classification tasks. In International Conference on Learning Representations, 2021.
- Khan et al. [2019] Mohammad Emtiyaz E Khan, Alexander Immer, Ehsan Abedi, and Maciej Korzepa. Approximate inference turns deep networks into gaussian processes. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Immer et al. [2021] Alexander Immer, Maciej Korzepa, and Matthias Bauer. Improving predictions of bayesian neural nets via local linearization. In AISTATS, pages 703–711, 2021.
- Ritter et al. [2018] Hippolyt Ritter, Aleksandar Botev, and David Barber. A scalable laplace approximation for neural networks. In International Conference on Learning Representations, 2018.
- Kristiadi et al. [2020] Agustinus Kristiadi, Matthias Hein, and Philipp Hennig. Being bayesian, even just a bit, fixes overconfidence in relu networks. In International conference on machine learning, pages 5436–5446. PMLR, 2020.
- Daxberger et al. [2021b] Erik Daxberger, Eric Nalisnick, James U Allingham, Javier Antoran, and Jose Miguel Hernandez-Lobato. Bayesian deep learning via subnetwork inference. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 2510–2521. PMLR, 18–24 Jul 2021b.
- McAllester [1999] David A McAllester. Pac-bayesian model averaging. In Proceedings of the twelfth annual conference on Computational learning theory, pages 164–170, 1999.
- Perez-Ortiz et al. [2021] Maria Perez-Ortiz, Omar Risvaplata, John Shawe-Taylor, and Csaba Szepesvári. Tighter risk certificates for neural networks. Journal of Machine Learning Research, 22(227):1–40, 2021.
- Lee et al. [2019b] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. SNIP: SINGLE-SHOT NETWORK PRUNING BASED ON CONNECTION SENSITIVITY. In International Conference on Learning Representations, 2019b.
- Lee et al. [2019c] Namhoon Lee, Thalaiyasingam Ajanthan, Stephen Gould, and Philip HS Torr. A signal propagation perspective for pruning neural networks at initialization. arXiv preprint arXiv:1906.06307, 2019c.
- Achille et al. [2021] Alessandro Achille, Aditya Golatkar, Avinash Ravichandran, Marzia Polito, and Stefano Soatto. Lqf: Linear quadratic fine-tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15729–15739, 2021.
- Maddox et al. [2021] Wesley Maddox, Shuai Tang, Pablo Moreno, Andrew Gordon Wilson, and Andreas Damianou. Fast adaptation with linearized neural networks. In International Conference on Artificial Intelligence and Statistics, pages 2737–2745. PMLR, 2021.
- Krizhevsky [2009] A Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, University of Toronto, 2009.
- He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- Loshchilov and Hutter [2016] Ilya Loshchilov and Frank Hutter. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016.
- Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Ghorbani et al. [2019] Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via hessian eigenvalue density. In International Conference on Machine Learning, pages 2232–2241. PMLR, 2019.
- Bardsley et al. [2014] Johnathan M Bardsley, Antti Solonen, Heikki Haario, and Marko Laine. Randomize-then-optimize: A method for sampling from posterior distributions in nonlinear inverse problems. SIAM Journal on Scientific Computing, 36(4):A1895–A1910, 2014.
- Matthews et al. [2017] Alexander G de G Matthews, Jiri Hron, Richard E Turner, and Zoubin Ghahramani. Sample-then-optimize posterior sampling for bayesian linear models. In NIPS Workshop on Advances in Approximate Bayesian Inference, 2017.
- Zhou et al. [2018] Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P Adams, and Peter Orbanz. Non-vacuous generalization bounds at the imagenet scale: a pac-bayesian compression approach. arXiv preprint arXiv:1804.05862, 2018.
- Dziugaite and Roy [2017] Gintare Karolina Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In Proceedings of the 33rd Annual Conference on Uncertainty in Artificial Intelligence (UAI), 2017.
- Foret et al. [2020] Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. arXiv preprint arXiv:2010.01412, 2020.
- Hutchinson [1989] Michael F Hutchinson. A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines. Communications in Statistics-Simulation and Computation, 18(3):1059–1076, 1989.
- Antoran et al. [2021] Javier Antoran, James Urquhart Allingham, David Janz, Erik Daxberger, Eric Nalisnick, and José Miguel Hernández-Lobato. Linearised laplace inference in networks with normalisation layers and the neural g-prior. In Fourth Symposium on Advances in Approximate Bayesian Inference, 2021.
- Antoran et al. [2022] Javier Antoran, David Janz, James U Allingham, Erik Daxberger, Riccardo Rb Barbano, Eric Nalisnick, and José Miguel Hernández-Lobato. Adapting the linearised laplace model evidence for modern deep learning. In International Conference on Machine Learning, pages 796–821. PMLR, 2022.
- Foong et al. [2019] Andrew YK Foong, Yingzhen Li, José Miguel Hernández-Lobato, and Richard E Turner. ’in-between’uncertainty in bayesian neural networks. arXiv preprint arXiv:1906.11537, 2019.
- Van Laarhoven [2017] Twan Van Laarhoven. L2 regularization versus batch and weight normalization. arXiv preprint arXiv:1706.05350, 2017.
- Kendall [1938] Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, 1938.
- Jastrzebski et al. [2021] Stanislaw Jastrzebski, Devansh Arpit, Oliver Astrand, Giancarlo B Kerg, Huan Wang, Caiming Xiong, Richard Socher, Kyunghyun Cho, and Krzysztof J Geras. Catastrophic fisher explosion: Early phase fisher matrix impacts generalization. In International Conference on Machine Learning, pages 4772–4784. PMLR, 2021.
- Simonyan and Zisserman [2015] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
- Hernández-Lobato and Adams [2015] José Miguel Hernández-Lobato and Ryan Adams. Probabilistic backpropagation for scalable learning of bayesian neural networks. In International conference on machine learning, pages 1861–1869. PMLR, 2015.
- Lakshminarayanan et al. [2017] Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
- Gal and Ghahramani [2016] Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059. PMLR, 2016.
- Teye et al. [2018] Mattias Teye, Hossein Azizpour, and Kevin Smith. Bayesian uncertainty estimation for batch normalized deep networks. In International Conference on Machine Learning, pages 4907–4916. PMLR, 2018.
- Wen et al. [2020] Yeming Wen, Dustin Tran, and Jimmy Ba. Batchensemble: an alternative approach to efficient ensemble and lifelong learning. arXiv preprint arXiv:2002.06715, 2020.
- Guo et al. [2017] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1321–1330. PMLR, 06–11 Aug 2017.
- Netzer et al. [2011] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011.
- Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- Bishop [2006] Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
- Murphy [2012] Kevin P Murphy. Machine learning: a probabilistic perspective. MIT press, 2012.
- Neyshabur et al. [2015b] Behnam Neyshabur, Russ R Salakhutdinov, and Nati Srebro. Path-sgd: Path-normalized optimization in deep neural networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc., 2015b.
- Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 448–456, Lille, France, 07–09 Jul 2015. PMLR.
- Lobacheva et al. [2021] Ekaterina Lobacheva, Maxim Kodryan, Nadezhda Chirkova, Andrey Malinin, and Dmitry P Vetrov. On the periodic behavior of neural network training with batch normalization and weight decay. Advances in Neural Information Processing Systems, 34:21545–21556, 2021.
- Devlin et al. [2018] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Meyer et al. [2021] Raphael A Meyer, Cameron Musco, Christopher Musco, and David P Woodruff. Hutch++: Optimal stochastic trace estimation. In Symposium on Simplicity in Algorithms (SOSA), pages 142–155. SIAM, 2021.
- Bradbury et al. [2018] James Bradbury, Roy Frostig, Peter Hawkins, Matthew James Johnson, Chris Leary, Dougal Maclaurin, George Necula, Adam Paszke, Jake VanderPlas, Skye Wanderman-Milne, and Qiao Zhang. JAX: composable transformations of Python+NumPy programs. github, 2018.
- Woodbury [1950] M.A. Woodbury. Inverting Modified Matrices. Memorandum Report / Statistical Research Group, Princeton. Statistical Research Group, 1950.
- Ren et al. [2019] Jie Ren, Peter J. Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark Depristo, Joshua Dillon, and Balaji Lakshminarayanan. Likelihood ratios for out-of-distribution detection. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
- Van Amersfoort et al. [2020] Joost Van Amersfoort, Lewis Smith, Yee Whye Teh, and Yarin Gal. Uncertainty estimation using a single deep deterministic neural network. In Hal Daumé III and Aarti Singh, editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 9690–9700. PMLR, 13–18 Jul 2020.
Appendix A Proofs
A.1 Proof of Proposition 2.2
Proof.
Since the prior is independent to the parameter scale, is trivial. For Jacobian w.r.t. parameters, we have
Then, the Jacobian of NN w.r.t. connectivity at holds
(12) | ||||
(13) |
where the first equality holds from the above one and the fact that is a diagonal linear transformation. Therefore, the covariance of posterior is invariant to .
Moreover, the mean of posterior is also invariant to .
Therefore, equation 6 and equation 7 are invariant to function-preserving scale transformations. The remain part of the theorem is related to the definition of function-preserving scale transformation . For generalization error, following holds
This proof can be extended to the empirical error . ∎
A.2 Proof of Theorem 2.3
Proof.
(Construction of KL divergence) To construct PAC-Bayes-CTK, we need to arrange KL divergence between posterior and prior as follows:
where the first equality uses the KL divergence between two Gaussian distributions, the thrid equality uses trace property ( and for scalar ), and the last equality uses the definition of PAC-Bayes prior (). For sharpness term, we first compute the term as
Since and is positive semi-definite, the matrix have non-zero eigenvalues of . Since trace is the sum of eigenvalues and log-determinant is the sum of log-eigenvalues, we have
where . By plugging this KL divergence to the equation 2, we get equation 8.
(Eigenvalues of ) To show the scale-invariance of PAC-Bayes-CTK, it is sufficient to show that KL divergence posterior and prior is scale-invariant: is independent to KL PAC-Bayes prior/posterior and we already show the invariance property of empirical/generalization error term in Proposition 2.2. To show the invariance property of KL divergence, we consider the Connectivity Tangent Kernel (CTK) as defined in equation 2.4:
Since CTK is a real-symmetric matrix, one can assume the eigenvalue decomposition of CTK as where is an orthogonal matrix and is a diagonal matrix. Then following holds for
Therefore, eigenvalues of are where are eigenvalues of CTK (and diagonal elements of ).
(Scale invariance of CTK) The scale-invariance property of CTK is a simple application of equation 13:
Therefore, CTK is invariant to any function-preserving scale transformation and so do its eigenvalues. This guarantees the invariance of and its eigenvalues. In summary, we showed the scale-invariance property of sharpness term of KL divergence. Now all that remains is to show the invariance of the perturbation term. However, this is already proved in the proof of Proposition 2.2. Therefore, we show PAC-Bayes-CTK is invariant to any function-preserving scale transformation. ∎
A.3 Proof of Corollary 2.4
Proof.
In proof of Theorem 2.3, we showed that eigenvalues of can be represented as
where are eigenvalues of CTK. Now, we identify the eigenvalues of CTK. To this end, we assume the singular value decomposition (SVD) of Jacobian w.r.t. connectivity as where and are orthogonal matrices and is a rectangular diagonal matrix. Then, CTK can be represented as . In summary, the column vectors of are eigenvectors of CTK and eigenvalues of CTK are square of singular values of and so . Therefore for all for eigenvalues of and equality holds for . Now all that remains is to show that the sharpness term of PAC-Bayes-CTK is a monotonically increasing function on each eigenvalues of CTK. To show this, we first keep in mind that
is a monotonically decreasing function for and is a monotonically decreasing function for . Since sharpness term of KL divergence is
This is a monotonically increasing function for since for . For your information, we plot and in Figure 2.


∎
A.4 Proof of Proposition 2.5
We refer to Scale invariance of CTK part of proof of Theorem 2.3. This is a direct application of scale-invariance property of Jacobian w.r.t. connectivity.
A.5 Proof of Corollary 2.7
Proof.
Since CS is trace of CTK, it is a sum of eigenvalues of CTK. As shown in the proof of Corollary 2.4, eigenvalues of CTK are square of singular values of Jacobian w.r.t. connectivity . Therefore, eigenvalues of CTK are non-negative and vanishes to zero if CS vanishes to zero.
This means the sharpness term of KL divergence vanishes to zero. Furthermore, singular values of Jacobian w.r.t. also vanishes to zero in this case. Therefore, vanishes to zero, also. Similarly, if CS diverges to infinity, this means (at least) one of eigenvalues of CTK diverges to infinity. In this case, following holds
Therefore, KL divergence term of PAC-Bayes-CTK also diverges to infinity. ∎
A.6 Proof of Proposition 3.1
Proof.
By assumption, we fixed all non-scale invariant parameters. This means we exclude these parameters in sampling procedure of CL and LL. In terms of predictive distribution, this can be translated as
where and for scale-invariant parameter set . Thereby, we mask the gradient of non scale-invariant parameter as zero. Therefore, this can be arrange as follows
where is a masking vector (i.e., one for included components and zero for excluded components). Then, the weight decay regularization for scale-invariant parameters can be represented as
Therefore, we get
for empirical NTK and
for empirical CTK. Therefore, we get
This gives us
∎
A.7 Derivation of PAC-Bayes-NTK
Theorem A.1 (PAC-Bayes-NTK).
Let us assume pre-trained parameter with data . Let us assume PAC-Bayes prior and posterior as
(14) | ||||
(15) | ||||
(16) | ||||
(17) |
By applying to data-dependent PAC-Bayes bound (equation 2), we get
(18) |
where are eigenvalues of and . This upper bound is not scale-invariant in general.
Proof.
The main difference between PAC-Bayes-CTK and PAC-Bayes-NTK is the definition of Jacobian: PAC-Bayes-CTK use Jacobian w.r.t connectivity and PAC-Bayes-NTK use Jacobian w.r.t. parameter. Therefore, Construction of KL divergence of proof of Theorem 2.3 is preserved except
and are eigenvalues of . Note that these eigenvalues satisfies
where are eigenvalues of NTK. ∎
Remark A.2 (Function-preserving scale transformation to NTK).
On the contrary to the CTK, scale invariance property is not applicable to the NTK due to Jacobian w.r.t. parameter:
If we assume all parameters are scale-invariant (or equivalently masking the Jacobian for all non scale-invariant parameters as in the proof of Proposition 3.1), the scale of NTK is proportional to the inverse scale of parameters.
A.8 Deterministic limiting kernel of CTK
Theorem A.3 (Deterministic limiting kernel of CTK).
Let us assume -layered network with Lipschitz activation function and NN with NTK initialization. Then the empirical CTK converges in probability to a deterministic limiting kernel as the layers width sequentially. Furthermore, holds.
Proof.
The proof is a modification to proof of convergence of NTK in Jacot et al. [17] considering NTK initialization (i.e. standard Gaussian for all parameters). We provide proof by induction. For single layer network, The CTK is summed as:
since the weight is sampled from standard Gaussian distribution, whose variance is 1, and product of two (independent) random variable converges in probability converges to the product of converged values. If we assume CTK of -th layer is converged to NTK of -th layer in probability, then the convergence of the -th layer is also satisfied since multiplication of two random weights, which converges to 1, is multiplied to the empirical NTK of -th layer, which converges to the deterministic limiting NTK of -th layer. Therefore, empirical CTK converges in probability to the deterministic limiting CTK, which is equivalent to the deterministic limiting NTK. ∎
Appendix B Details of Squared Loss for Classification Tasks
For the classification tasks in Sec. 4.2, we use the squared loss instead of the cross-entropy loss since our theoretical results are built on the squared loss. Here, we describe how we use the squared loss to mimic the cross-entropy loss. There are several works [23, 24] that utilized the squared loss for the classification task instead of the cross-entropy loss. Specifically, Lee et al. [23] used
where is the number of classes, and Hui and Belkin [24] used
for single data loss, where is sample loss given input , target and parameter , is the -th component of , and are dataset-specific hyper-parameters.
These works used the mean for reducing the vector-valued loss into a scalar loss. However, this can be problematic when the number of classes is large. When the number of classes increases, the denominator of the mean (the number of classes) increases while the target value remains 1 (one-hot label). As a result, the scale of a gradient for the target class becomes smaller. To avoid such an unfavorable effect, we just use the sum for reducing vector-valued loss into a scalar loss instead of taking the mean, i.e.,
Appendix C Derivation of PAC-Bayes posterior
Derivation of
For Bayesian linear regression, we compute the posterior of
where is i.i.d. sampled and the prior of is given as . By concatenating this, we get
where are concatenation of , respectively and . It is well known [62, 63] that the posterior of for this problem is
Similarly, we define Bayesian linear regression problem as
where and the regression coefficient is in this case. Thus, we treat as a target and as an input of linear regression problem. By concatenating this, we get
By plugging this to the posterior of Bayesian linear regression problem, we get
Derivation of
We define perturbed parameter as follows
Since is affine to , we get the distribution of as
Appendix D Representative cases of function-preserving scaling transformations
Activation-wise rescaling transformation [14, 64] For NNs with ReLU activations, following holds for : , where rescale transformation 111For a simple two layer linear NN with weight matrix , the first case of equation 19 corresponds to -th row of and the second case of equation 19 corresponds to -th column of . is defined as
(19) |
Note that is a finer-grained rescaling transformation than layer-wise rescaling transformation (i.e. common for all activations in layer ) discussed in Dinh et al. [10]. Dinh et al. [10] showed that even layer-wise rescaling transformations can sharpen pre-trained solutions in terms of trace of Hessian (i.e., contradicting the FM hypothesis). This contradiction also occurs to previous PAC-Bayes bounds [14, 15] due to the scale-dependent term.
Weight decay with BN layers [65] For parameters preceding BN layer,
(20) |
for an input and a positive vector . This implies that scaling transformations on these parameters preserve function represented by NNs for : , where scaling transformation is defined for
(21) |
Note that the weight decay regularization [12, 13] can be implemented as a realization of (e.g., for all activations preceding BN layers). Therefore, thanks to Theorem 2.2 and Theorem 2.5, our CTK-based bound is invariant to weight decay regularization applied to parameters before BN layers. We also refer to [50, 66] for optimization perspective of weight decay with BN.
Appendix E Implementation of Connectivity Laplace
To estimate the empirical/generalization bound in Sec. 2.4 and calibrate uncertainty in Sec. 4.2, we need to sample from the posterior . For this, we sample perturbations in connectivity space
so that for equation 6. To sample this, we provide a novel approach to sample from LA/CL without curvature approximation. To this end, we consider following optimization problem
where and . By first-order optimality condition, we have
By arranging this w.r.t. optimizer , we get
Since both and are sampled from independent Gaussian distributions, we have
Therefore, optimal solution of randomized optimization problem is
Similarly, sampling from CL can be implemented as a following optimization problem.
where and . Since we sample the noise of data/perturbation and optimize the perturbation, this can be interpreted as a Randomize-Then-Optimize implementation of Laplace Approximation and Connectivity Laplace [41, 42].
Appendix F Details of computing Connectivity Sharpness
It is well known that empirical NTK or Jacobian is intractable in modern architecture of NNs (e.g., ResNet [37] or BERT [67]). Therefore, one might wonder how Connectivity Sharpness can be computed for these architectures. However, Connectivity Sharpness in Sec. 2.5 is defined as trace of empirical CTK, thereby one can compute CS with Hutchison’s method [46, 68]. According to Hutchison’s method, trace of a matrix is
where is a random variable with (e.g., standard normal distribution or Rademacher distribution). Since in our case, we further use mini-batch approximation to compute : (i) Sample from Rademacher distribution for mini-batch with size and (ii) compute with Jacobian-vector product of JAX [69] and (iii) compute . Then, the sum of for all mini-batch in training dataset is a Monte-Carlo approximation of CS with sample size 1. Empirically, we found that this approximation is sufficiently stable to capture the correlation between sharpness and generalization as shown in Sec. 4.1.
Appendix G Predictive uncertainty of Connectivity/Linearized Laplace
In this section, we derive predictive uncertainty of Linearized Laplace (LL) and Connectivity Laplace (CL). By matrix inversion lemma [70], the weight covariance of LL is
Therefore, if , then the weight covariance of LL converges to
With this weight covariance and linearized NN, the predictive uncertainty of LL is
Similarly, the predictive uncertainty of CL is
Appendix H Details on sharpness-generalization experiments
To verify that the CS has a better correlation with generalization performance compared to existing sharpness measures, we evaluate the three metrics: (a) Kendall’s rank-correlation coefficient [51] which considers the consistency of a sharpness measure with generalization gap (i.e., if one has higher sharpness, then so has higher generalization gap) (b) granulated Kendall’s coefficient [7] which examines Kendall’s rank-correlation coefficient w.r.t. individual hyper-parameters to separately evaluate the effect of each hyper-parameter to generalization gap (c) conditional independence test [7] which captures the causal relationship between measure and generalization.
network depth | 1, 2, 3 |
network width | 32, 64, 128 |
learning rate | 0.1, 0.032, 0.001 |
weight decay | 0.0, 1e-4, 5e-4 |
mini-batch size | 256, 1024, 4096 |
Three metrics are compared with the following baselines: trace of Hessian (; [19]), trace of Fisher information matrix (; [52]), trace of empirical NTK at (), and four PAC-Bayes bound based measures, sharpness-orig (SO), pacbayes-orig (PO), sharpness mag (SM), and pacbayes mag (PM), which are eq. (52), (49), (62), (61) in Jiang et al. [7].
For the granulated Kendall’s coefficient, we use 5 hyper-parameters : network depth, network width, learning rate, weight decay and mini-batch size, along with 3 options for each hyper-parameters as in Table 5. We use the VGG-13 [53] as a base model and we adjust the depth and width of each conv block. We add BN layers after the convolution layer for each block. Specifically, the number of convolution layers of each conv block is the depth and the number of channels of convolution layers of the first conv block is the width. For the subsequent conv blocks, we follow the original VGG width multipliers (, , ). An example with depth 1 and width 128 is depicted in Table 6.
ConvNet Configuration |
---|
input (224 224 RGB image) |
Conv3-128 |
BN |
ReLU |
MaxPool |
Conv3-256 |
BN |
ReLU |
MaxPool |
Conv3-512 |
BN |
ReLU |
MaxPool |
Conv3-1024 |
BN |
ReLU |
MaxPool |
Conv3-1024 |
BN |
ReLU |
MaxPool |
FC-4096 |
ReLU |
FC-4096 |
ReLU |
FC-1000 |
We use SGD optimizer with a momentum 0.9. We train each model for 200 epochs and use cosine learning rate scheduler [38] with 30% of initial epochs as warm-up epochs. The standard data augmentations (padding, random crop, random horizontal flip, and normalization) for CIFAR-10 is used for training data. For the analysis, we only use models with above 99% training accuracy following Jiang et al. [7]. As a result, we use 200 out of 243 trained models for our correlation analysis. For every experiment, we use 8 NVIDIA RTX 3090 GPUs.
Appendix I Details and additional results on BNN experiments
I.1 Experimental Setting
Uncertainty calibration on image classification task We pre-train models for 200 epochs CIFAR-10/100 dataset [36] with ResNet-18[37] as mentioned in Section 2.4. We choose ensemble size as 8 except Deep Ensemble [55] and Batch Ensemble [58]. We use 4 ensemble members for Deep Ensemble and Batch Ensemble due to computational cost.
For evaluation, we define single member prediction as one-hot representation of network output with label smoothing. We select label smoothing coefficient as 0.01 for CIFAR-10, 0.1 for CIFAR-100. We define ensemble prediction as averaged prediction of single member predictions. For OOD detection, we use variance of prediction in output space, which is competitive to recent OOD detection methods [71, 72]. We use 0.01 for and select best with cross validation. For every experiment, we use 8 NVIDIA RTX 3090 GPUs.
Appendix J Additional results on bound estimation
CIFAR-10 | CIFAR-100 | |||||||
---|---|---|---|---|---|---|---|---|
Parameter scale | 0.5 | 1.0 | 2.0 | 4.0 | 0.5 | 1.0 | 2.0 | 4.0 |
18746194.0 | 6206303.5 | 3335419.75 | 2623873.25 | 12688970.0 | 3916139.25 | 2819272.5 | 2662497.0 | |
Bias | 483.86 | 427.0042 | 299.0085 | 197.3149 | 476.9061 | 478.1776 | 440.284 | 329.8767 |
Sharpness | 579.6815 | 472.0 | 402.8186 | 369.3761 | 547.2874 | 434.7583 | 398.5075 | 387.3265 |
KL divergence | 531.7708 | 449.5021 | 350.9135 | 283.3455 | 512.0967 | 456.4679 | 419.3957 | 358.6016 |
Test err. | 0.5617 ± 0.0670 | 0.4566 ± 0.0604 | 0.2824 ± 0.0447 | 0.1530 ± 0.0199 | 0.6210 ± 0.0096 | 0.6003 ± 0.0094 | 0.5499 ± 0.0100 | 0.4666 ± 0.0093 |
PAC-Bayes-NTK | 0.7985 ± 0.0694 | 0.6730 ± 0.0626 | 0.4718 ± 0.0465 | 0.3186 ± 0.0202 | 0.8530 ± 0.0140 | 0.8162 ± 0.0136 | 0.7602 ± 0.0112 | 0.6617 ± 0.0114 |
Appendix K Additional results on image classification
CIFAR-10 | ||||
NLL () | ECE () | Brier. () | AUC () | |
Deterministic | 0.4086 ± 0.0018 | 0.0490 ± 0.0003 | 0.1147 ± 0.0005 | - |
MCDO | 0.3889 ± 0.0049 | 0.0465 ± 0.0009 | 0.1106 ± 0.0015 | 0.7765 ± 0.0221 |
MCBN | 0.3852 ± 0.0012 | 0.0462 ± 0.0002 | 0.1108 ± 0.0003 | 0.9051 ± 0.0065 |
Batch Ensemble | 0.3544 ± 0.0036 | 0.0399 ± 0.0009 | 0.1064 ± 0.0012 | 0.9067 ± 0.0030 |
Deep Ensemble | 0.2243 | 0.0121 | 0.0776 | 0.7706 |
Linearized Laplace | 0.3366 ± 0.0013 | 0.0398 ± 0.0004 | 0.1035 ± 0.0003 | 0.8883 ± 0.0017 |
Connectivity Laplace (Ours) | 0.2674 ± 0.0028 | 0.0234 ± 0.0011 | 0.0946 ± 0.0010 | 0.9002 ± 0.0033 |
CIFAR-100 | ||||
NLL () | ECE () | Brier. () | AUC () | |
Deterministic | 1.8286 ± 0.0066 | 0.1544 ± 0.0010 | 0.4661 ± 0.0018 | - |
MCDO | 1.7439 ± 0.0089 | 0.1363 ± 0.0008 | 0.4456 ± 0.0017 | 0.6424 ± 0.0099 |
MCBN | 1.7491 ± 0.0075 | 0.1399 ± 0.0010 | 0.4488 ± 0.0015 | 0.7039 ± 0.0197 |
Batch Ensemble | 1.6142 ± 0.0101 | 0.1077 ± 0.0020 | 0.4143 ± 0.0027 | 0.7232 ± 0.0021 |
Deep Ensemble | 1.2006 | 0.0456 | 0.3228 | 0.6929 |
Linearized Laplace | 1.5806 ± 0.0054 | 0.1036 ± 0.0004 | 0.4127 ± 0.0010 | 0.6893 ± 0.0221 |
Connectivity Laplace (Ours) | 1.4073 ± 0.0039 | 0.0703 ± 0.0028 | 0.3827 ± 0.0012 | 0.7254 ± 0.0136 |