This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Demystify Optimization and Generalization of Over-parameterized PAC-Bayesian Learning

Wei Huang    Chunrui Liu    Yilan Chen    Tianyu Liu    Richard Yi Da Xu
Abstract

PAC-Bayesian is an analysis framework where the training error can be expressed as the weighted average of the hypotheses in the posterior distribution whilst incorporating the prior knowledge. In addition to being a pure generalization bound analysis tool, PAC-Bayesian bound can also be incorporated into an objective function to train a probabilistic neural network, making them a powerful and relevant framework that can numerically provide a tight generalization bound for supervised learning. For simplicity, we call probabilistic neural network learned using training objectives derived from PAC-Bayesian bounds as PAC-Bayesian learning. Despite their empirical success, the theoretical analysis of PAC-Bayesian learning for neural networks is rarely explored. This paper proposes a new class of convergence and generalization analysis for PAC-Bayes learning when it is used to train the over-parameterized neural networks by the gradient descent method. For a wide probabilistic neural network, we show that when PAC-Bayes learning is applied, the convergence result corresponds to solving a kernel ridge regression when the probabilistic neural tangent kernel (PNTK) is used as its kernel. Based on this finding, we further characterize the uniform PAC-Bayesian generalization bound which improves over the Rademacher complexity-based bound for non-probabilistic neural network. Finally, drawing the insight from our theoretical results, we propose a proxy measure for efficient hyperparameters selection, which is proven to be time-saving.

Machine Learning, ICML

1 Introduction

Deep learning has demonstrated powerful learning capability thanks to its over-parameterization structure, in which various network architectures have been responsible for the great leap in performance (LeCun et al., 2015). For the reason that over-fitting and complex hyperparameters tuning are two of the major pain points in deep learning, designing generalization guarantee for deep networks is a long-term pursuit for researchers (Zhang et al., 2021). To this end, a learning framework that trains a probabilistic neural network with a PAC-Bayesian bound objective function has been proposed (Bégin et al., 2016; Dziugaite & Roy, 2017; Neyshabur et al., 2017b; Raginsky et al., 2017; Neyshabur et al., 2017a; London, 2017; Smith & Le, 2017; Liu et al., 2021). We dub this learning method, PAC-Bayesian learning. The PAC-Bayesian learning has been proven to be able to achieve competitive expected test set error, while providing a tight generalization bound. Furthermore, this generalization bound computed from the training error can obviate the need for splitting data into a train, test, or validation set, which is highly applicable to train a deep network with scarce data (Pérez-Ortiz et al., 2020; Grünwald & Mehta, 2020). Meanwhile, these advancements on PAC-Bayesian bounds have been widely adapted with different deep neural network structures including meta-learning (Amit & Meir, 2018), convolutional neural network (Zhou et al., 2018), binary activated multilayer networks (Letarte et al., 2019), partially aggregated neural networks (Biggs & Guedj, 2020), and graph neural network (Liao et al., 2020).

Due to the tremendous empirical success of PAC-Bayesian learning, there is increasing interest in understanding their theoretical properties. However, they are either restricted to a specific model variant (Dziugaite & Roy, 2018a; Dziugaite et al., 2020) or rely heavily on empirical exploration (Neyshabur et al., 2017a). To our best knowledge, there is neither investigation of why the training of PAC-Bayesian learning is successful nor the reason for PAC-Bayesian bound to be tight on unseen data. Therefore, it is natural to ask: for a probabilistic neural network with an objective function derived from a PAC-Bayesian bound when gradient descent is adopted,

  1. Q1:

    How effective the training (i.e., the trainability) on training set?

  2. Q2:

    How tight the generalization bound (i.e., generalization) compared to those learning framework using non-probabilistic neural networks?

The exploration of the answer can be highly non-trivial due to inherented non-convex of over-parameterization (Jain & Kar, 2017) and additional randomness introduced by probabilistic neural network (Specht, 1990) as well as additional challenges brought by the divergence between posterior/prior distribution pairs known as Kullback-Leibler (KL) divergence.

In this paper, we answer the above questions by leveraging the recent advances in the theory of deep learning in over-parameterized settings with extremely (or infinitely) wide networks. It can be shown that ultra-wide networks optimized with gradient descent can achieve near-zero training error, and the critical factor that governs training process is the so called neural tangent kernel (NTK), which can be proven to be unchanged during gradient descent training (Jacot et al., 2018), thus providing a guarantee for achieving global minimum (Du et al., 2019a; Allen-Zhu et al., 2019). Under the PAC-Bayesian framework, NTK is no longer calculated from the derivative of the weights directly, but instead is calculated based on the gradient of the distribution parameters of the weights. We call it Probabilistic NTK (PNTK), based on which we build a convergence analysis to characterize the optimization process of PAC-Bayes learning. Thanks to the explicit solution obtained by optimization analysis, we further formulate the generalization bound of PAC-Bayesian learning for the first time, and demonstrate its advantage by comparing with theoretical generalization bound of learning framework with non-stochastic neural networks (Arora et al., 2019a; Cao & Gu, 2019; Hu et al., 2019). We summarize our contribution as follows:

  • With a detailed characterization of gradient descent training of PAC-Bayes objective function, we arrive at a conclusion that the final solution is kernel ridge regression with its kernel being the PNTK.

  • Based on the optimization solution, we derive a PAC-Bayesian bound for deep networks and find that it improves over the Rademacher complexity bound for non-probabilistic neural network with a fair comparison.

  • PAC-Bayesian learning depends on a large number of hyperparameters. We design a training-free proxy based on our theoretical bound and show it is effective and time-saving.

  • Technically, our proof of convergence is similar to (Du et al., 2018; Arora et al., 2019a). But there are essential differences. The main difference is that our network architecture is much complex (e.g. probabilistic network contain two set of parameter) and each set involves its own randomness which requires bounding many terms in this work differently and more elaborately. Due to the particularly wide range of application scenarios of probabilistic neural networks such as Variational Auto-Encoder (Pu et al., 2016) and Deep Bayesian Network (Nie et al., 2018), we hope our technique can provide the basis for analysis of wide probabilistic neural networks.

2 Related Work

PAC-Bayesian Analysis.

Probably Approximately Correct (PAC) Bayes framework (McAllester, 1999a, b) can incorporate knowledge about the learning algorithm and probability distribution over a set of hypotheses, thus providing a test performance (generalization) guarantee. Subsequently, PAC-Bayesian method is adopted to analyze the generalization bound of the probabilistic neural network (Langford & Caruana, 2002b). The original PAC-Bayes theory only works with bounded loss function, Haddouche et al. (2021) expands the PAC-Bayesian theory to learning problems with unbounded loss functions. Besides, several improved PAC-Bayesian bounds suitable for different scenarios are introduced by (Bégin et al., 2014, 2016). As a result of the flexibility and generalization properties of PAC-Bayes, it is widely used to analyze the complex, non-convex, and overparameterized optimization problem, especially over-parameterized neural networks (Guedj, 2019). Neyshabur et al. (2017b) present a generalization bound for feedforward neural networks with ReLU activations in terms of the product of the spectral norm of the layers and the Frobenius norm of the weights. London (2017) study the generalization error of randomized learning algorithms – focusing on stochastic gradient descent (SGD) – using a novel combination of PAC-Bayes and algorithmic stability.

PAC-Bayesian Learning.

Apart from providing theoretical analysis for the generalization properties of deep learning, PAC-Bayes pay an increasing number of attention to afford numerical generalization bound (certificate) for practical deep learning algorithm. Langford & Caruana (2002a) unprecedentedly introduce a method to train a Bayesian neural network and use a refined PAC-Bayesian bound for computing error upper bound. Later, Neyshabur et al. (2017a) extend Langford & Caruana (2002a)’s work by developing a training objective function derived from a relaxed PAC-Bayesian bound. In the standard application of PAC-Bayes, the prior is typically chosen to be a spherical Gaussian centered at the origin. However, without incorporating the information of data, the KL divergence might be unreasonable large, limiting the performance of PAC-Bayes method. To fill this gap, a large literature propose to obtain localized PAC-Bayes bounds via distribution-dependent priors through data (Ambroladze et al., 2007; Negrea et al., 2019; Dziugaite et al., 2020; Perez-Ortiz et al., 2021). Furthermore, Dziugaite & Roy (2018b) show how an differentially private data-dependent prior yields a valid PAC-Bayes bound for the situation that data distribution is presumed to be unknown. More recently, research focus on providing PAC-Bayesian bound for more realistic architectures, such as, ImageNet (Zhou et al., 2018), binary activated multilayer networks (Letarte et al., 2019), Partially Aggregated Neural Networks (Biggs & Guedj, 2020), and graph neural network (Liao et al., 2020). We denote the practical use of PAC-Bayesian algorithm to train over-parameterized neural networks as PAC-Bayesian learning and the target of this work is to demystify the success behind the deep learning trained via PAC-Bayesian bound.

Neural Tangent Kernel.

One of the greatest recent progress in deep learning theory is the gradient descent training can find global minimum with deep neural networks thanks to the over-parameterization (Du et al., 2019a, 2018; Allen-Zhu et al., 2019). Technically, (Jacot et al., 2018) resort to the NTK, which will stay constant during training. Under a mild assumption that the lowest eigenvalue of the NTK is greater than zero, the loss that is convex regards output will convergence to the global minimum. Furthermore, the generalization ability to unseen data of trained ultra-wide networks has been characterized by the Rademacher complexity analysis (Cao & Gu, 2019; Arora et al., 2019a). In this work, we will make a comparison between Rademacher bound and PAC-Bayesian bound with ultra-wide networks, where Rademacher bound of kernel ridge regression is formulated by (Hu et al., 2019). Since the origin of the NTK, it has been applied to different deep networks structures, such as orthogonal initialization (Huang et al., 2020), convolutional networks (Arora et al., 2019b), graph neural networks (Du et al., 2019b; Huang et al., 2021), and transformer (Hron et al., 2020).

3 Preliminary

3.1 Notation

We use bold-faced letters for vectors and matrices otherwise representing scalar. We use 2\|\cdot\|_{2} to denote the Euclidean norm of a vector or the spectral norm of a matrix, while denoting F\|\cdot\|_{F} as the Frobenius norm of a matrix. For a neural network, we denote ϕ(x)\phi(x) as the activation function and we adopt ReLU activation in this work. Given a logical variable aa, then 𝕀(a)=1\mathbb{I}(a)=1 if aa is true otherwise 𝕀(a)=0\mathbb{I}(a)=0. Let 𝐈d{\bf I}_{d} be the identity matrix with dimension of d×d\mathbb{R}^{d\times d}. We denote [n]={1,2,,n}[n]=\{1,2,\ldots,n\}. The least eigenvalue of matrix 𝐀\bf A is denoted as λ0(𝐀)=λmin(𝐀)\lambda_{0}({\bf A})=\lambda_{\min}(\bf A).

3.2 Over-parameterized Probabilistic Neural Network

In PAC-Bayesian learning we use probabilistic neural networks instead of deterministic networks, where the weights always follow a certain distribution. In this work, we adopt the Gaussian distribution for the weights, and define a two-layer probabilistic neural network governed by the following expression,

f(𝐱)=1m𝐯ϕ(𝐖𝐱)f({\bf x})=\frac{1}{\sqrt{m}}{\bf v}\phi({\bf Wx}) (1)

where 𝐱d{\bf x}\in\mathbb{R}^{d} is the input, 𝐖m×d{\bf W}\in\mathbb{R}^{m\times d} are weight matrix in the first layer, and 𝐯m{\bf v}\in\mathbb{R}^{m} is the weight vector in the second layer. For simplicity, we represent the weight matrix as a set of column vectors, i.e., 𝐖=(𝐰1,𝐰2,,𝐰m){\bf W}=({\bf w}_{1},{\bf w}_{2},\dots,{\bf w}_{m})^{\top}. In particular, we use the re-parametrization trick to treat the weight in the first layer and leave the weights in the second to be initialized by uniform distribution,

𝐰r=𝝁r+𝝈rϵr,ϵr𝒩(𝟎,𝐈d);\displaystyle{\bf w}_{r}=\boldsymbol{\mu}_{r}+\boldsymbol{\sigma}_{r}\odot\boldsymbol{\epsilon}_{r},~{}~{}~{}\boldsymbol{\epsilon}_{r}\sim\mathcal{N}({\bf 0},{\bf I}_{d}); (2)
𝐯runif({1,1})\displaystyle~{}~{}~{}{\bf v}_{r}\sim{\rm unif}(\{-1,1\})

where \odot denotes the element-wide product operation, 𝝁\boldsymbol{\mu} and 𝝈\boldsymbol{\sigma} are mean and variance respectively. We then fix the second layer 𝐯{\bf v} and only optimize the first layer. This is a trick that has been widely used in the NTK paper (Arora et al., 2019a; Du et al., 2018) for simplifying analysis, the same results can be obtained on deep probabilistic neural network.

3.3 The PAC-Bayesian Learning

Suppose data S={(𝐱i,yi)}i=1nS=\{({\bf x}_{i},y_{i})\}_{i=1}^{n} are i.i.d. samples from a non-degenerate distribution 𝒟\mathcal{D}. Defining that the \mathcal{H} represents the hypothesis space, h(𝐱)h(\bf{x}) is the prediction of hypothesis hh\in\mathcal{H} over sample 𝐱\bf{x}. R𝒟(h)=𝔼(𝐱,y)𝒟[(y,h(𝐱))]R_{\mathcal{D}}(h)=\mathbb{E}_{({\bf x},y)\sim\mathcal{D}}[\ell(y,h(\bf{x}))] represents the generalization error of classifier hh and R^S(h)=1ni=1n(yi,h(𝐱i))\widehat{R}_{S}(h)=\frac{1}{n}\sum_{i=1}^{n}\ell\left(y_{i},h\left({\bf x}_{i}\right)\right) represents the empirical error of classifier hh, where ()\ell(\cdot) is the loss function.

The PAC-Bayes theorem (Langford & Seeger, 2001; Seeger, 2002; Maurer, 2004) concludes that,

Theorem 3.1.

Denote Q0Q_{0}\in\mathcal{H} be some prior distribution over .\mathcal{H}. Then for any δ(0,1]\delta\in(0,1], the following inequality holds uniformly for all posteriors distributions QQ\in\mathcal{H} with probability at least 1δ1-\delta,

kl(R^S(Q)R𝒟(Q))KL(QQ0)+log2nδn.{\rm kl}(\widehat{R}_{S}(Q)\|{R}_{\mathcal{D}}(Q))\leq\frac{\mathrm{KL}(Q\|Q_{0})+\log\frac{2\sqrt{n}}{\delta}}{n}. (3)

Here R𝒟(Q)=𝔼(𝐱,y)𝒟,hQ[(y,h(𝐱))]=𝔼hQ[R𝒟(h)]R_{\mathcal{D}}(Q)=\mathbb{E}_{({\bf x},y)\sim\mathcal{D},h\sim Q}[\ell(y,h({\bf x}))]=\mathbb{E}_{h\sim Q}[R_{\mathcal{D}}(h)] and R^S(Q)=𝔼hQ[R^S(h)]\widehat{R}_{S}(Q)=\mathbb{E}_{h\sim Q}[\widehat{R}_{S}(h)]. KL(QQ0)=𝔼Q[lnQQ0]\mathrm{KL}(Q\|Q_{0})=\mathbb{E}_{Q}\left[\ln\frac{Q}{Q_{0}}\right] is the Kullback-Leibler (KL) divergence and kl(qq)=qlog(qq)+(1q)log(1q1q){\rm kl}(q\|q^{\prime})=q\log(\frac{q}{q^{\prime}})+(1-q)\log(\frac{1-q}{1-q^{\prime}}) is the binary KL divergence. Furthermore, combining with pinsker’s inequality for binary KL divergence, kl(p^p)(pp^)2/(2p){\rm kl}(\hat{p}\|p)\geq(p-\hat{p})^{2}/(2p), when p^<p\hat{p}<p, yields,

R𝒟(Q)R^S(Q)2R𝒟(Q)KL(QQ0)+log2nδn.R_{\mathcal{D}}(Q)-\widehat{R}_{S}(Q)\leq\sqrt{2R_{\mathcal{D}}(Q)\frac{\mathrm{KL}(Q\|Q_{0})+\log\frac{2\sqrt{n}}{\delta}}{n}}. (4)

The above bound is the classical bound. On the other hand, combining with the inequality ab12(λ¯a+bλ¯)\sqrt{ab}\leq\frac{1}{2}(\bar{\lambda}a+\frac{b}{\bar{\lambda}}), for all λ¯>0\bar{\lambda}>0, lead to a PAC-Bayes-λ\lambda bound, which is proposed by (Thiemann et al., 2017),

Theorem 3.2.

Denote Q0Q_{0}\in\mathcal{H} be some prior distribution over .\mathcal{H}. Then for any δ(0,1]\delta\in(0,1], the following inequality holds uniformly for all posteriors distributions QQ\in\mathcal{H} with probability at least 1δ1-\delta

R𝒟(Q)R^S(Q)1λ¯/2+KL(QQ0)+log2nδnλ¯(1λ¯/2).{R}_{\mathcal{D}}(Q)\leq\frac{\widehat{R}_{S}(Q)}{1-\bar{\lambda}/2}+\frac{\mathrm{KL}(Q\|Q_{0})+\log\frac{2\sqrt{n}}{\delta}}{n\bar{\lambda}(1-\bar{\lambda}/2)}. (5)

In this work, we study the PAC-Bayesian learning by training the probabilistic network described by Equation (1) with an objective PAC-Bayes bound (5). Specifically, the objective function can be expressed as,

(Q)=R^S(Q)+λKL(QQ0)n\mathcal{B}(Q)=\widehat{R}_{S}(Q)+\lambda\frac{\mathrm{KL}(Q\|Q_{0})}{n} (6)

where λ\lambda is a hyperparameter introduced in a heuristic manner to make the method more flexible, and λ¯=2\bar{\lambda}=2. Since the term regarding δ\delta is a constant, we omit it in the objective function. Instead of optimizing 𝐖{\bf W} itself directly, we introduce reparameterization trick (Kingma et al., 2015) to perform gradient descent,

𝝁r(t+1)=𝝁r(t)η𝝁r(t)\displaystyle{\boldsymbol{\mu}}_{r}(t+1)={\boldsymbol{\mu}}_{r}(t)-\eta\frac{\partial\mathcal{B}}{\partial{\boldsymbol{\mu}}_{r}(t)} (7)
𝝈r(t+1)=𝝈r(t)η𝝈r(t)\displaystyle{\boldsymbol{\sigma}}_{r}(t+1)={\boldsymbol{\sigma}}_{r}(t)-\eta\frac{\partial\mathcal{B}}{\partial{\boldsymbol{\sigma}}_{r}(t)}

where η\eta is the learning rate.

3.4 Neural Tangent Kernel

The neural tangent kernel was originated from (Jacot et al., 2018), whose definition is given by,

𝚯(𝐗,𝐗,t)=𝜽f(𝐗,t)𝜽f(𝐗,t)\boldsymbol{\Theta}({\bf X},{\bf X},t)=\nabla_{\boldsymbol{\theta}}f({\bf X},t)\nabla_{\boldsymbol{\theta}}f({\bf X},t)^{\top} (8)

where 𝜽p×1\boldsymbol{\theta}\in\mathbb{R}^{p\times 1} with pp being the number of parameters in the model. We demonstrate how it is used to describe the training dynamics of deep neural networks.

The dynamics of gradient flow for parameters are given by,

𝜽t=η𝜽R^S=η𝜽f(𝐗,t)f(𝐗,t)R^S\frac{\partial\boldsymbol{\theta}}{\partial t}=-\eta\nabla_{\boldsymbol{\theta}}\widehat{R}_{S}=-\eta\nabla_{\boldsymbol{\theta}}f({\bf X},t)^{\top}\nabla_{f({\bf X},t)}\widehat{R}_{S} (9)

Then the dynamics of output functions f(𝐗)=vec(f(𝐱)𝐱𝐗)nm×1f({\bf X})={\rm vec}(f({\bf x})_{{\bf x}\in{\bf X}})\in\mathbb{R}^{nm\times 1} follow,

f(𝐗,t)t=𝜽f(𝐗,t)𝜽t=η𝚯(𝐗,𝐗,t)f(𝐗,t)R^S\frac{\partial f({\bf X},t)}{\partial t}=\nabla_{\boldsymbol{\theta}}f({\bf X},t)\frac{\partial\boldsymbol{\theta}}{\partial t}=-\eta\boldsymbol{\Theta}({\bf X},{\bf X},t)\nabla_{f({\bf X},t)}\widehat{R}_{S} (10)

It is proven that when the width of neural networks is infinite, 𝚯(𝐗,𝐗,t)\boldsymbol{\Theta}({\bf X},{\bf X},t) converge in probability to 𝚯(𝐗,𝐗)\boldsymbol{\Theta}({\bf X},{\bf X}) (Jacot et al., 2018), i.e., 𝚯(𝐗,𝐗,t)=𝚯(𝐗,𝐗)\boldsymbol{\Theta}({\bf X},{\bf X},t)=\boldsymbol{\Theta}({\bf X},{\bf X}). Assuming that the loss is the mean squared error loss, R^S=12f(𝐗)𝐘)22\widehat{R}_{S}=\frac{1}{2}\|f({\bf X})-{\bf Y})\|^{2}_{2}, then we have,

ft(x)=\displaystyle f_{t}(\textbf{x})= f0(x)+𝚯(x,X)T𝚯1(I+eη𝚯t)(Yf0(X))\displaystyle f_{0}(\textbf{x})+\boldsymbol{\Theta}(\textbf{x},\textbf{X})^{T}\boldsymbol{\Theta}^{-1}(\textbf{I}+e^{-\eta\boldsymbol{\Theta}t})(\textbf{Y}-f_{0}(\textbf{X})) (11)

At the end of the training, namely, tt\rightarrow\infty, the solution is equivalent to the well-known kernel regression, of which the kernel is NTK.

4 Theoretical Results

4.1 Optimization analysis

To simplify the analysis, we first consider the optimization of probabilistic neural networks of the form (1) with objective R^S(Q)\widehat{R}_{S}(Q) and adopting the mean squared error loss. In other words, we neglect the KL divergence term at this stage and will show that the corresponding results can be extended to the target function with KL divergence. To perform the optimization analysis, we define the neural tangent kernel for probabilistic neural network, and name them as probabilistic neural tangent kernel (PNTK).

Definition 4.1 (Probabilistic Neural Tangent Kernel).

The tangent kernels associated with output function f(𝐱,t)f({\bf x},t) at parameters 𝛍\boldsymbol{\mu} and 𝛔\boldsymbol{\sigma} are defined as,

𝚯μ(𝐗,𝐗,t)\displaystyle\boldsymbol{\Theta}^{\mu}({\bf X},{\bf X},t) =𝝁f(𝐗,t)𝝁f(𝐗,t)n×n\displaystyle=\nabla_{\boldsymbol{\mu}}f({\bf X},t)\nabla_{\boldsymbol{\mu}}f({\bf X},t)^{\top}\in\mathbb{R}^{n\times n} (12)
𝚯σ(𝐗,𝐗,t)\displaystyle\boldsymbol{\Theta}^{\sigma}({\bf X},{\bf X},t) =𝝈f(𝐗,t)𝝈f(𝐗,t)n×n\displaystyle=\nabla_{\boldsymbol{\sigma}}f({\bf X},t)\nabla_{\boldsymbol{\sigma}}f({\bf X},t)^{\top}\in\mathbb{R}^{n\times n}

Different from standard (deterministic) neural network, the probabilistic network consist of two sets of parameters, i.e., 𝝁r\boldsymbol{\mu}_{r} and 𝝈r\boldsymbol{\sigma}_{r}, with r=[m]r=[m]. The gradient descent based on reparameteriztion trick (Equation 7) is performed on each of the two parameter set separately. As a result, we have two corresponding tangent kernels. It is shown that the NTK will converge in probability to a deterministic kernel in the infinite-width limit (Jacot et al., 2018), as we will show in later derivation, the PNTK will converge to a limiting kernel at initialization and during training with a large width condition. The detailed expression for the limiting kernel is given as follows

Definition 4.2 (Limiting Neural Tangent Kernel with ReLU Activation).

Consider a probabilistic network of the form (1), with a Gaussian distribute weights 𝐰{\bf w}. Then the tangent kernels in the infinite-width limit will follow the expression,

𝚯ij(𝐗,𝐗)=𝔼𝐰[𝐱i𝐱j𝕀{𝐰𝐱i0,𝐰𝐱j0}]\boldsymbol{\Theta}^{\infty}_{ij}({\bf X},{\bf X})=\mathbb{E}_{{\bf w}}\big{[}{\bf x}^{\top}_{i}{\bf x}_{j}\mathbb{I}\{{\bf w}^{\top}{\bf x}_{i}\geq 0,{\bf w}^{\top}{\bf x}_{j}\geq 0\}\big{]} (13)

The definition above gives the exact formulation of the limiting NTK for random networks. In addition, its positive definiteness is a critical factor to guarantee the convergence of gradient descent, thus we state the assumption as follows,

Assumption 4.3.

Assume the lowerest eigenvalue of the limiting NTK is greater than zero, i.e., λ0(𝚯)>0\lambda_{0}(\boldsymbol{\Theta}^{\infty})>0.

With all the preliminaries above, we state our main theorem,

Theorem 4.4 (Convergence of Probabilistic Networks with large width).

Suppose the network’s width is of m=Ω(n6λ04δ3(σ02+σ^2))m=\Omega\big{(}\frac{n^{6}}{\lambda_{0}^{4}\delta^{3}(\sigma_{0}^{2}+\hat{\sigma}^{2})}\big{)} with the initialization. Then with probability at least 1δ1-\delta over the initialization, we have,

R^S(Q,t)exp(λ0t)R^S(Q,0)\widehat{R}_{S}(Q,t)\leq\exp(-\lambda_{0}t)\widehat{R}_{S}(Q,0) (14)

where we adopt initialization of μri𝒩(0,σ^){\mu}_{ri}\sim\mathcal{N}(0,{\hat{\sigma}}) and σri=σ0{\sigma}_{ri}=\sigma_{0}, for i[d]i\in[d]. μri{\mu}_{ri} and σri{\sigma}_{ri} are entries of weights matrix 𝛍\boldsymbol{\mu} and 𝛔\boldsymbol{\sigma}, σ^\hat{\sigma} and σ0\sigma_{0} are positive constants.

Our theorem establishes that if mm is large enough, the expected training error converges to zero at a linear rate. In particular, the least eigenvalue of PNTK governs the convergence rate. To prove the theorem, we split the process into three parts, namely, PNTK at initialization, PNTK during training and changes of weights during training, and all the detailed proofs for our theorems and lemmas can be found in the Appendix.

We first study the behavior of tangent kernels with ultra-wide condition, namely m=ploy(n,1/λ0,1/δ)m={\rm ploy}(n,1/\lambda_{0},1/\delta), at initialization. The following lemma demonstrates that if mm is large, then 𝚯0μ\boldsymbol{\Theta}^{\mu}_{0} and 𝚯0σ\boldsymbol{\Theta}^{\sigma}_{0} have a lower bound on smallest eigenvalue with high probability. The proof is by the standard concentration bound.

Lemma 4.5 (PNTK at initialization).

If m=O(n2log(n2/δ)λ02)m=O\big{(}\frac{n^{2}\log(n^{2}/\delta)}{\lambda^{2}_{0}}\big{)}, while 𝛍r\boldsymbol{\mu}_{r} and 𝛔r\boldsymbol{\sigma}_{r} are initialized by the form in Theorem 4.4, then with probability at least 1δ1-\delta, 𝚯μ(𝐗,𝐗,0)𝚯(𝐗,𝐗)2λ04\big{\|}\boldsymbol{\Theta}^{\mu}({\bf X},{\bf X},0)-\boldsymbol{\Theta}^{\infty}({\bf X},{\bf X})\big{\|}_{2}\leq\frac{\lambda_{0}}{4}, 𝚯μ(𝐗,𝐗,0)23λ04\big{\|}\boldsymbol{\Theta}^{\mu}({\bf X},{\bf X},0)\big{\|}_{2}\geq\frac{3\lambda_{0}}{4}; and 𝚯σ(𝐗,𝐗,0)𝚯(𝐗,𝐗)2λ04\big{\|}\boldsymbol{\Theta}^{\sigma}({\bf X},{\bf X},0)-\boldsymbol{\Theta}^{\infty}({\bf X},{\bf X})\big{\|}_{2}\leq\frac{\lambda_{0}}{4}, 𝚯σ(𝐗,𝐗,0)23λ04\big{\|}\boldsymbol{\Theta}^{\sigma}({\bf X},{\bf X},0)\big{\|}_{2}\geq\frac{3\lambda_{0}}{4}.

Remark 4.6.

The concentration bound is over randomness of initialization for 𝝁r(0)\boldsymbol{\mu}_{r}(0) and randomness of variance ϵr\boldsymbol{\epsilon}_{r}.

From the above lemma, we implicitly show that both tangent kernels regarding 𝝁r\boldsymbol{\mu}_{r} and 𝝈r\boldsymbol{\sigma}_{r} are equivalent in the infinite-width limit. It is also true in general as long as the reparametrization trick is adopted for training probabilistic neural networks. The next problem is that PNTKs are time-dependent matrices, thus varying during gradient descent training. To account for this problem, we build a lemma stating that if the weights 𝐰(t){\bf w}(t) is close to 𝐰(0){\bf w}(0) during gradient descent training, then the corresponding PNTKs 𝚯μ(t)\boldsymbol{\Theta}^{\mu}(t) and 𝚯σ(t)\boldsymbol{\Theta}^{\sigma}(t) are close to 𝚯\boldsymbol{\Theta}^{\infty} and have a lower bound over smallest eigenvalue λ0\lambda_{0}. Here we introduce 𝐰r(t)𝝁r(t)+𝝈r(t)ϵr(0){\bf w}_{r}(t)\equiv\boldsymbol{\mu}_{r}(t)+\boldsymbol{\sigma}_{r}(t)\odot\boldsymbol{\epsilon}_{r}(0), where ϵr(0)\boldsymbol{\epsilon}_{r}(0) copies sample value from that in 𝐰r(0){\bf w}_{r}(0). Notice 𝐰r(t){\bf w}_{r}(t) in our definition is not exact weights at training step tt, but we introduce it to facilitate our proof.

Lemma 4.7 (PNTK during training).

If 𝛍r\boldsymbol{\mu}_{r} and 𝛔r\boldsymbol{\sigma}_{r} are initialized the same with Theorem 4.4, for any set of weight vectors 𝐰1,,𝐰m{\bf w}_{1},\ldots,{\bf w}_{m} that satisfy for any r[m]r\in[m], 𝐰r(t)𝐰r(0)cλ0δσ02+σ^2n2R\|{\bf w}_{r}(t)-{\bf w}_{r}(0)\|\leq\frac{c\lambda_{0}\delta\sqrt{\sigma_{0}^{2}+\hat{\sigma}^{2}}}{n^{2}}\equiv R, where cc is a constant, then with probability at least 1δ1-\delta, 𝚯μ(𝐗,𝐗,t)𝚯(𝐗,𝐗)2λ04\big{\|}\boldsymbol{\Theta}^{\mu}({\bf X},{\bf X},t)-\boldsymbol{\Theta}^{\infty}({\bf X},{\bf X})\big{\|}_{2}\leq\frac{\lambda_{0}}{4}, 𝚯μ(𝐗,𝐗,t)2λ02\big{\|}\boldsymbol{\Theta}^{\mu}({\bf X},{\bf X},t)\big{\|}_{2}\geq\frac{\lambda_{0}}{2}; and 𝚯σ(𝐗,𝐗,t)𝚯(𝐗,𝐗)2λ04\big{\|}\boldsymbol{\Theta}^{\sigma}({\bf X},{\bf X},t)-\boldsymbol{\Theta}^{\infty}({\bf X},{\bf X})\big{\|}_{2}\leq\frac{\lambda_{0}}{4}, 𝚯σ(𝐗,𝐗,t)2λ02\big{\|}\boldsymbol{\Theta}^{\sigma}({\bf X},{\bf X},t)\big{\|}_{2}\geq\frac{\lambda_{0}}{2}.

By the above lemma, we demonstrate that if the change of weight is bounded, then the tangent kernel matrix is closed to its expectation. The next lemma will show that the change of weight is bounded during gradient descent training when the PNTK is close to the limiting kernel Equation (13).

Lemma 4.8 (Change of weights during training).

Suppose λmin(𝚯(t))λ02\lambda_{\min}(\boldsymbol{\Theta}(t))\geq\frac{\lambda_{0}}{2} for 0<t<T0<t<T. Then we have R^S(Q,t)exp(λ0t)R^S(Q,0)\widehat{R}_{S}(Q,t)\leq\exp(-{\lambda_{0}}t)\widehat{R}_{S}(Q,0), and with probability at least 1δ1-\delta over initialization, 𝐰r(t)𝐰r(0)22nmδλ0R\|{\bf w}_{r}(t)-{\bf w}_{r}(0)\|_{2}\leq\frac{2n}{\sqrt{m\delta}\lambda_{0}}\equiv R^{\prime}.

The main gap between PNTK and final results for output function can be filled by writing down the gradient flow dynamics,

df(𝐱i,t)dt=j=1n(𝐲jf(𝐱j,t))(𝚯ijμ+𝚯ijσ)\displaystyle\frac{df({\bf x}_{i},t)}{dt}=\sum_{j=1}^{n}({\bf y}_{j}-f({\bf x}_{j},t))(\boldsymbol{\Theta}^{\mu}_{ij}+\boldsymbol{\Theta}^{\sigma}_{ij}) (15)

Combined with fact that PNTKs are bounded during training, therefore the behavior of the loss is traceable. To finalize the proof for Theorem 4.4, we construct a final lemma.

Lemma 4.9.

If R<RR^{\prime}<R, we have R^S(Q,t)exp(λ0t)R^S(Q,0).\widehat{R}_{S}(Q,t)\leq\exp(-\lambda_{0}t)\widehat{R}_{S}(Q,0).

4.2 Training with KL divergence

According to Equation (6), there is a KL divergence term in the objective function. We expand the KL-divergence for two Gaussian distribution, p(w(t))=𝒩(μt,σt)p({w}({t}))=\mathcal{N}({\mu}_{t},{\sigma}_{t}), and p(w(0))=𝒩(μ0,σ0)p({w}({0}))=\mathcal{N}({\mu}_{0},{\sigma}_{0}),

KL=12(logσ0σt+(μtμ0)2σ02+σtσ01)\text{KL}=\frac{1}{2}\bigg{(}\log\frac{{\sigma}_{0}}{{\sigma}_{t}}+\frac{({\mu}_{t}-{\mu}_{0})^{2}}{{\sigma}^{2}_{0}}+\frac{\sigma_{t}}{\sigma_{0}}-1\bigg{)} (16)

We compare the gradient of mean weights 𝝁\boldsymbol{\mu} and variance weights 𝝈\boldsymbol{\sigma}. With a direct calculation, we have f(𝐱i)𝝁r=1mvr𝕀(𝐰r𝐱i0)𝐱i\frac{\partial f({\bf x}_{i})}{\partial\boldsymbol{\mu}_{r}}=\frac{1}{\sqrt{m}}{v}_{r}\mathbb{I}({\bf w}^{\top}_{r}{\bf x}_{i}\geq 0){\bf x}_{i} and f(𝐱i)𝝈r=1mvr𝕀(𝐰r𝐱i0)𝐱iϵr\frac{\partial f({\bf x}_{i})}{\partial\boldsymbol{\sigma}_{r}}=\frac{1}{\sqrt{m}}{v}_{r}\mathbb{I}({\bf w}^{\top}_{r}{\bf x}_{i}\geq 0){\bf x}_{i}\odot\boldsymbol{\epsilon}_{r}. It is shown that there is one more random variable ϵr\boldsymbol{\epsilon}_{r} associated with the gradient regarding variance weights, which leads to the expected gradient norm to be zero. To verify this result, we conduct an experiment and leave the results in the Appendix E.2. Therefore, it is approximately equivalent to fix 𝝈\boldsymbol{\sigma} during gradient descent training. With this assumption we arrive at the conclusion:

Theorem 4.10.

Consider gradient descent on target function (6), with the initialization stated in Theorem 4.4. Suppose mpoly(n,1/λ0,1/δ,1/)m\geq{\rm poly}({n},1/\lambda_{0},1/\delta,1/\mathcal{E}). Then with probability at least 1δ1-\delta over the random initialization, we have

f^(𝐱,)=𝚯(𝐱,𝐗)(𝚯(𝐗,𝐗)+λ/σ02𝐈)1𝐘±{\widehat{f}({\bf x},\infty)}=\boldsymbol{\Theta}^{\infty}({\bf x},{\bf X})\big{(}\boldsymbol{\Theta}({\bf X},{\bf X})+\lambda/{\sigma^{2}_{0}}{\bf I}\big{)}^{-1}{\bf Y}\pm\mathcal{E} (17)

where f^(𝐱,t)=𝔼fQf(𝐱,t)\widehat{f}({\bf x},t)=\mathbb{E}_{f\sim Q}f({\bf x},t) aligns with the definition of empirical loss function.

The theorem above reveals that the regularization effect of the KL term in PAC-Bayesian learning, and presents an explicit expression for the convergence result of the output function.

4.3 Generalization analysis

Recall that in Theorem 3.2, the PAC-Bayesian bound with respect to the distribution at initialization and after optimization is given. Therefore, combined with results from Theorem 4.10, we can directly provide a generalization bound for PAC-Bayesian learning with ultra-wide networks.

Theorem 4.11 (PAC-Bayesian bound with NTK).

Suppose data S={(𝐱i,yi)}i=1nS=\{({\bf x}_{i},y_{i})\}_{i=1}^{n} are i.i.d. samples from a non-degenerate distribution 𝒟\mathcal{D}, and mpoly(n,λ01,δ1)m\geq{\rm poly}(n,\lambda_{0}^{-1},\delta^{-1}). Consider any loss function :×[0,1]\ell:\mathbb{R}\times\mathbb{R}\rightarrow[0,1]. Then with probability at least 1δ1-\delta over the random initialization and the training samples, the probabilistic network trained by gradient descent for TΩ(1ηλ0lognδ)T\geq\Omega(\frac{1}{\eta\lambda_{0}}\log\frac{n}{\delta}) iterations has population risk R𝒟(Q)R_{\mathcal{D}}(Q) that is bounded as follows:

R𝒟\displaystyle R_{\mathcal{D}} 𝐘(𝚯+λ/σ02𝐈)1𝐘nσ02+λσ02𝐘(𝚯+λ/σ02𝐈)2𝐘n\displaystyle\leq\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}}{n\sigma^{2}_{0}}+\frac{\lambda}{\sigma^{2}_{0}}\sqrt{\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-2}{\bf Y}}{n}} (18)
+O(log2nδn).\displaystyle+O\bigg{(}\frac{\log\frac{2\sqrt{n}}{\delta}}{n}\bigg{)}.

In this theorem, we establish a reasonable generalization bound for PAC-Bayesian learning framework, thus providing a theoretical guarantee. We further demonstrate the advantage of PAC-Bayesian learning by comparing it with the Rademacher complexity-based generalization bound for deterministic neural network with kernel ridge solution.

Theorem 4.12 (Rademacher bound with NTK).

Suppose data S={(𝐱i,yi)}i=1nS=\{({\bf x}_{i},y_{i})\}_{i=1}^{n} are i.i.d. samples from a non-degenerate distribution 𝒟\mathcal{D}, and mpoly(n,λ01,δ1)m\geq{\rm poly}(n,\lambda_{0}^{-1},\delta^{-1}). Consider any loss function :×[0,1]\ell:\mathbb{R}\times\mathbb{R}\rightarrow[0,1]. Then with probability at least 1δ1-\delta over the random initialization and training samples, the network trained by gradient descent for TΩ(1ηλ0lognδ)T\geq\Omega(\frac{1}{\eta\lambda_{0}}\log\frac{n}{\delta}) iterations has population risk R𝒟(Q)R_{\mathcal{D}}(Q) that is bounded as follows:

R𝒟\displaystyle R_{\mathcal{D}} 𝐘(𝚯+λ/σ02𝐈)1𝐘n+λσ02𝐘(𝚯+λ/σ02𝐈)2𝐘n\displaystyle\leq\sqrt{\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}}{n}}+\frac{\lambda}{\sigma^{2}_{0}}\sqrt{\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-2}{\bf Y}}{n}} (19)
+O(lognλ0δn).\displaystyle+O\bigg{(}\sqrt{\frac{\log\frac{n}{\lambda_{0}\delta}}{n}}\bigg{)}.

Theorem 4.12 is obtained by following Theaorem 5.1 in (Hu et al., 2019), who presents a Rademacher complexity-based generalization bound for ultra-wide neural networks with kernel ridge regression solution. Similar analysis for kernel regression with NTK can be found in (Arora et al., 2019a; Cao & Gu, 2019).

The main difference between two generalization bound is 𝐘(𝚯+λ/σ02𝐈)1𝐘n\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}}{n} versus 𝐘(𝚯+λ/σ02𝐈)1𝐘n\sqrt{\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}}{n}}, which is due to that PAC-Bayesian bound count the KL divergence while Rademacher bound calculate the reproducing kernel Hilbert space (RKHS) norm. We find that the rates of convergence are different. One is O(1/n)O(1/n) and another is O(1/n)O(1/\sqrt{n}). Therefore, we conclude that the PAC-Bayesian bound has an improvement over Rademacher complexity-based bound.

Refer to caption
(a)
Refer to caption
(b)
Figure 1: Relative Frobenius norm change in μ\mu and ρ\rho respectively during training with MSE loss that derived from the classic PAC-Bayesian bound, where mm is the width of the network.

5 Experiments

In this section, we first provide empirical support showing that the training dynamics of wide probabilistic neural networks using the training objective derived from a PAC-Bayes bound are captured by PNTK, which validates our Lemma 4.8. Second, as an extension of our finding of PAC-Bayesian bound in Theorem D.1, we provide a training-free metric to approximate the PAC-Bayesian bound via NTK, which can be used to select the best hyperparameters without involving any training and eliminate excessive computation time.

5.1 Experimental setup

In all experiments, the NTK parameterization is chosen for initializing parameters, which follows Equation (1). Specifically, the initial mean for weights, μ0\mu_{0}, is sampled from a truncated Gaussian distribution with a mean of 0 and a standard variance of 11, truncating at 2 standard deviations. To make sure that the variance is positive, the initial variance for weight is transformed from the given value of ρ0\rho_{0} through formula σ0=log(1+exp(ρ0))\sigma_{0}=\log(1+\exp(\rho_{0})).

For the experiment in section 5.2, we consider a three hidden layer ReLU fully-connected network of the training objective derived from the PAC-Bayesian lambda bound in Equation (5), used an ordinary MSE function as loss, trained with full-batch gradient descent using learning rates equal to 1 on a fixed subset of MNIST (|D|=128|D|=128) of ten-classification. An random initialized prior with no connection to data is used since it is inline with our theoretical setting and we only intend to observe the change in parameters rather than the performance of the actual bound.

In section 5.3, we use both fully connected and convoluted neural network structures to perform experiments on MINIST and CIFAR-10 datasets for demonstrating the effectiveness of our training-free PAC-Bayesian network bound for searching hyperparameters under different datasets and network structures. We build a 3 layers fully-connected neural network with 600 neurons on each layer. Another convolutional architecture with a total of 13 layers with around 10 million learnable parameters is constructed. We adopt data-dependent prior since it is a practical and popular method (Dziugaite et al., 2020; Perez-Ortiz et al., 2021). Specifically, this data-dependent prior is pre-trained on a subset of total training data with empirical risk minimization. The networks for posterior training is then initialized by the weights learnt from prior. At last, the generalization bound is computed using Equation (5). Bound computing-related settings are referred to the work done by Pérez-Ortiz et al. (2020), such as, confidence parameters for the risk certificate and chernoff bound, and the number of monte carlo samples for estimating the risk certificate.

5.2 Validation of theoretical results

After T=217T=2^{17} steps of gradient descent updates from different random initialization, we plot the changes of (input/output/hidden layer) 𝝁\boldsymbol{\mu} and 𝝆\boldsymbol{\rho} for weights corresponded to the variation in width mm for each layer on Figure 1. We observe that the relative Frobenius norm change in input/output layer’s weights scales as 1/m1/\sqrt{m} while hidden layers’ weights scales as 1/m1/m during the training, which verifies our lemma 4.8.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Figure 2: In FCN structure with MNIST dataset, Kendall-tau correlations between 𝒫𝒜\mathcal{PA} and generalization bound with respect to the different proportion of data used for prior training, the different values of KL penalty, and different values of ρ0\rho_{0} are 0.89, 0.89, and 0.93 at 1 % level of significance. Similar results are found in CNN structure with CIFAR10 dataset where Kendall-tau correlations are 0.89, 0.83, and 0.57.
Method Network & dataset Generalization bound Testing error Computation time
MNIST + FCN 0.0427 0.0192 3888 mins
Brute-Force MNIST + CNN 0.0247 0.0091 20736 mins
CIFAR10 + FCN 0.5377 0.4840 8424 mins
CIFAR10 + CNN 0.1969 0.1514 55080 mins
MNIST + FCN 0.0478 0.0213 1296 mins
𝒫𝒜\mathcal{PA} MNIST + CNN 0.0346 0.0168 1296 mins
CIFAR10 + FCN 0.5490 0.4905 1296 mins
CIFAR10 + CNN 0.1970 0.1510 1296 mins
Table 1: We conduct the grid search runs over 9 data-dependent prior with different subsets data for prior training, 9 different values of KL penalty, and 8 different values of ρ0\rho_{0}. Under any setting, the generalization bound from the hyperparameters selected from 𝒫𝒜\mathcal{PA} is close to the best generalization bound while saving lots of computation time. Training is performed with a server with a CPU with 5,120 cores, and a 32 GB Nvidia Quadro V100. Note, computation time per run using 𝒫𝒜\mathcal{PA} is around 2 mins. Whereas, computation times for the actual posterior training are 6 mins and 32 mins for MNIST trained with FCN and CNN respectively. In addition, 13 mins and 85 mins are used for CIFAR-10 trained with FCN and CNN.

5.3 Selecting hyperparameters via training-free metric

PAC-Bayesian learning framework provides competitive performance with non-vacuous generalization bounds. However, the tightness of this generalization bound depends on the hyperparameters used, such as the proportionality of data used for the prior, the initialization of ρ0\rho_{0}, and the KL penalty weight (λ\lambda). Since these three values do not change during the training, we refer them as hyperparameters. Choosing the right hyperparameters via grid search is obviously prohibitive, as each attempt to compute the generalization bound can involve significant computational resources.

Another plausible approach is to design some kind of predictive, so-called “training-free” metric such that we can approximate the error bound without going through an expensive training process. In light of this goal, we have already developed a generalization bound in the theorem D.1 via NTK. Since NTK changes are held constant during training, we can predict the generalization bound by this proxy metric, which can be formulated as follows:

𝒫𝒜=\displaystyle\mathcal{PA}= Tr((𝚯+λ/σ02𝐈)1𝐘𝐘σ02n+\displaystyle\operatorname{Tr}\bigg{(}\frac{(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}\cdot{\bf Y}{\bf Y}^{\top}}{\sigma^{2}_{0}\cdot n}+ (20)
λσ02(𝚯+λ/σ02𝐈)2𝐘𝐘n)\displaystyle\frac{\lambda}{\sigma^{2}_{0}}\sqrt{\frac{(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-2}\cdot{\bf Y}{\bf Y}^{\top}}{n}}\bigg{)}

where 𝚯\boldsymbol{\Theta} denotes the gram matrix. 𝐘𝐘{\bf Y}{\bf Y}^{\top} is a n×nn\times n label similarity matrix (if two data have the same label, their joint entry in the matrix is 1 and 0 otherwise), and nn is the number of data used. To demonstrate the computational practicality of this training-free metric, we compute 𝒫𝒜\mathcal{PA} using only a subset of the data for each class (325 per class for FCN and 75 per class for CNN). We should also mention that training-free methods for searching neural architectures are not new, they can be found in NAS (Chen et al., 2021; Deshpande et al., 2021), MAE Random Sampling (Camero et al., 2021), pruning at initialization (Abdelfattah et al., 2021). To the best of our knowledge, there is currently no training-free method for selecting hyperparameters in the PAC-Bayesian framework, which we consider to be one of the novelties of this paper.

The Figure 2 demonstrates a strong correlation between 𝒫𝒜\mathcal{PA} and the actual generalization bound. Finally, we demonstrate that by searching through all possible combinations of hyperparameters using 𝒫𝒜\mathcal{PA}, it is possible to select a hyperparameter leading towards a result that is comparable to the best generalization bound, but without the excessive computation. To put things in perspective, in Table 1, we show that grid search can be much more expensive than 𝒫𝒜\mathcal{PA}. For instance, under the setting of CNN and CIFAR10 dataset, we save 42.5 times computational time to find the bound that is 1% lower than the best generalization bound.

6 Discussion

In this work, we theoretically prove that the learning dynamics of deep probabilistic neural networks using training objectives derived from PAC-Bayes bounds are exactly described by the NTK in over-parameterized setting. Empirical investigation reveals that this agrees well with the actual training process. Furthermore, the expected output function trained with PAC-Bayesian bound converges to the kernel ridge regression under a mild assumption. Based on this finding, we obtain an explicit generalization bound with respect to NTK for PAC-Bayesian learning, which improves over the generalization bound obtained through NTK on non-probabilistic neural network. Finally, we show PAC-Bayesian bound score, the training-free method, can effectively select the hyperparameters which leads to a lower generalization bound without cost excessive computation time that brute-force grid search would take. In a summary, we establish our theoretical analysis on PAC-Bayes with random initialized prior. One promising direction is to study PAC-Bayesian learning with data-dependent prior by NTK.

References

  • Abdelfattah et al. (2021) Abdelfattah, M. S., Mehrotra, A., Dudziak, Ł., and Lane, N. D. Zero-cost proxies for lightweight nas. arXiv preprint arXiv:2101.08134, 2021.
  • Allen-Zhu et al. (2019) Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pp. 242–252. PMLR, 2019.
  • Ambroladze et al. (2007) Ambroladze, A., Parrado-Hernández, E., and Shawe-Taylor, J. Tighter pac-bayes bounds. Advances in neural information processing systems, 19:9, 2007.
  • Amit & Meir (2018) Amit, R. and Meir, R. Meta-learning by adjusting priors based on extended pac-bayes theory. In International Conference on Machine Learning, pp. 205–214. PMLR, 2018.
  • Arora et al. (2019a) Arora, S., Du, S., Hu, W., Li, Z., and Wang, R. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pp. 322–332. PMLR, 2019a.
  • Arora et al. (2019b) Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R., and Wang, R. On exact computation with an infinitely wide neural net. arXiv preprint arXiv:1904.11955, 2019b.
  • Bégin et al. (2014) Bégin, L., Germain, P., Laviolette, F., and Roy, J.-F. Pac-bayesian theory for transductive learning. In Artificial Intelligence and Statistics, pp.  105–113. PMLR, 2014.
  • Bégin et al. (2016) Bégin, L., Germain, P., Laviolette, F., and Roy, J.-F. Pac-bayesian bounds based on the rényi divergence. In Artificial Intelligence and Statistics, pp.  435–444. PMLR, 2016.
  • Biggs & Guedj (2020) Biggs, F. and Guedj, B. Differentiable pac-bayes objectives with partially aggregated neural networks. arXiv preprint arXiv:2006.12228, 2020.
  • Camero et al. (2021) Camero, A., Wang, H., Alba, E., and Bäck, T. Bayesian neural architecture search using a training-free performance metric. Applied Soft Computing, 106:107356, 2021.
  • Cao & Gu (2019) Cao, Y. and Gu, Q. Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in Neural Information Processing Systems, 32:10836–10846, 2019.
  • Chen et al. (2021) Chen, W., Gong, X., and Wang, Z. Neural architecture search on imagenet in four gpu hours: A theoretically inspired perspective. arXiv preprint arXiv:2102.11535, 2021.
  • Deshpande et al. (2021) Deshpande, A., Achille, A., Ravichandran, A., Li, H., Zancato, L., Fowlkes, C., Bhotika, R., Soatto, S., and Perona, P. A linearized framework and a new benchmark for model selection for fine-tuning. arXiv preprint arXiv:2102.00084, 2021.
  • Du et al. (2019a) Du, S., Lee, J., Li, H., Wang, L., and Zhai, X. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pp. 1675–1685. PMLR, 2019a.
  • Du et al. (2018) Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
  • Du et al. (2019b) Du, S. S., Hou, K., Salakhutdinov, R. R., Poczos, B., Wang, R., and Xu, K. Graph neural tangent kernel: Fusing graph neural networks with graph kernels. Advances in Neural Information Processing Systems, 32:5723–5733, 2019b.
  • Dziugaite & Roy (2018a) Dziugaite, G. K. and Roy, D. Entropy-sgd optimizes the prior of a pac-bayes bound: Generalization properties of entropy-sgd and data-dependent priors. In International Conference on Machine Learning, pp. 1377–1386. PMLR, 2018a.
  • Dziugaite & Roy (2017) Dziugaite, G. K. and Roy, D. M. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. CoRR, abs/1703.11008, 2017. URL https://arxiv.org/abs/1703.11008.
  • Dziugaite & Roy (2018b) Dziugaite, G. K. and Roy, D. M. Data-dependent pac-bayes priors via differential privacy. arXiv preprint arXiv:1802.09583, 2018b.
  • Dziugaite et al. (2020) Dziugaite, G. K., Hsu, K., Gharbieh, W., and Roy, D. M. On the role of data in pac-bayes bounds. arXiv preprint arXiv:2006.10929, 2020.
  • Grünwald & Mehta (2020) Grünwald, P. D. and Mehta, N. A. Fast rates for general unbounded loss functions: From erm to generalized bayes. J. Mach. Learn. Res., 21:56–1, 2020.
  • Guedj (2019) Guedj, B. A primer on pac-bayesian learning. arXiv preprint arXiv:1901.05353, 2019.
  • Haddouche et al. (2021) Haddouche, M., Guedj, B., Rivasplata, O., and Shawe-Taylor, J. Pac-bayes unleashed: generalisation bounds with unbounded losses. Entropy, 23(10):1330, 2021.
  • Hron et al. (2020) Hron, J., Bahri, Y., Sohl-Dickstein, J., and Novak, R. Infinite attention: Nngp and ntk for deep attention networks. In International Conference on Machine Learning, pp. 4376–4386. PMLR, 2020.
  • Hu et al. (2019) Hu, W., Li, Z., and Yu, D. Simple and effective regularization methods for training on noisily labeled data with generalization guarantee. arXiv preprint arXiv:1905.11368, 2019.
  • Huang et al. (2020) Huang, W., Du, W., and Da Xu, R. Y. On the neural tangent kernel of deep networks with orthogonal initialization. arXiv preprint arXiv:2004.05867, 2020.
  • Huang et al. (2021) Huang, W., Li, Y., Du, W., Da Xu, R. Y., Yin, J., and Chen, L. Wide graph neural networks: Aggregation provably leads to exponentially trainability loss. arXiv preprint arXiv:2103.03113, 2021.
  • Jacot et al. (2018) Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. arXiv preprint arXiv:1806.07572, 2018.
  • Jain & Kar (2017) Jain, P. and Kar, P. Non-convex optimization for machine learning. arXiv preprint arXiv:1712.07897, 2017.
  • Kingma et al. (2015) Kingma, D. P., Salimans, T., and Welling, M. Variational dropout and the local reparameterization trick. Advances in neural information processing systems, 28:2575–2583, 2015.
  • Langford & Caruana (2002a) Langford, J. and Caruana, R. (not) bounding the true error. In Dietterich, T., Becker, S., and Ghahramani, Z. (eds.), Advances in Neural Information Processing Systems, volume 14. MIT Press, 2002a.
  • Langford & Caruana (2002b) Langford, J. and Caruana, R. (not) bounding the true error. Advances in Neural Information Processing Systems, 2:809–816, 2002b.
  • Langford & Seeger (2001) Langford, J. and Seeger, M. Bounds for averaging classifiers. 2001.
  • LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436–444, 2015.
  • Lee et al. (2019) Lee, J., Xiao, L., Schoenholz, S. S., Bahri, Y., Novak, R., Sohl-Dickstein, J., and Pennington, J. Wide neural networks of any depth evolve as linear models under gradient descent. arXiv preprint arXiv:1902.06720, 2019.
  • Letarte et al. (2019) Letarte, G., Germain, P., Guedj, B., and Laviolette, F. Dichotomize and generalize: Pac-bayesian binary activated deep neural networks. arXiv preprint arXiv:1905.10259, 2019.
  • Liao et al. (2020) Liao, R., Urtasun, R., and Zemel, R. A pac-bayesian approach to generalization bounds for graph neural networks. arXiv preprint arXiv:2012.07690, 2020.
  • Liu et al. (2021) Liu, T., Lu, J., Yan, Z., and Zhang, G. Pac-bayes bounds for meta-learning with data-dependent prior. arXiv preprint arXiv:2102.03748, 2021.
  • London (2017) London, B. A pac-bayesian analysis of randomized learning with application to stochastic gradient descent. arXiv preprint arXiv:1709.06617, 2017.
  • Maurer (2004) Maurer, A. A note on the pac bayesian theorem. arXiv preprint cs/0411099, 2004.
  • McAllester (1999a) McAllester, D. A. Pac-bayesian model averaging. In Proceedings of the twelfth annual conference on Computational learning theory, pp.  164–170, 1999a.
  • McAllester (1999b) McAllester, D. A. Some PAC-Bayesian theorems. Machine Learning, 37(3):355–363, 1999b.
  • Negrea et al. (2019) Negrea, J., Haghifam, M., Dziugaite, G. K., Khisti, A., and Roy, D. M. Information-theoretic generalization bounds for sgld via data-dependent estimates. arXiv preprint arXiv:1911.02151, 2019.
  • Neyshabur et al. (2017a) Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. Exploring generalization in deep learning. arXiv preprint arXiv:1706.08947, 2017a.
  • Neyshabur et al. (2017b) Neyshabur, B., Bhojanapalli, S., and Srebro, N. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564, 2017b.
  • Nie et al. (2018) Nie, S., Zheng, M., and Ji, Q. The deep regression bayesian network and its applications: Probabilistic deep learning for computer vision. IEEE Signal Processing Magazine, 35(1):101–111, 2018.
  • Pérez-Ortiz et al. (2020) Pérez-Ortiz, M., Rivasplata, O., Shawe-Taylor, J., and Szepesvári, C. Tighter risk certificates for neural networks. CoRR, abs/2007.12911, 2020. URL https://arxiv.org/abs/2007.12911.
  • Perez-Ortiz et al. (2021) Perez-Ortiz, M., Rivasplata, O., Guedj, B., Gleeson, M., Zhang, J., Shawe-Taylor, J., Bober, M., and Kittler, J. Learning pac-bayes priors for probabilistic neural networks. arXiv preprint arXiv:2109.10304, 2021.
  • Pu et al. (2016) Pu, Y., Gan, Z., Henao, R., Yuan, X., Li, C., Stevens, A., and Carin, L. Variational autoencoder for deep learning of images, labels and captions. Advances in neural information processing systems, 29:2352–2360, 2016.
  • Raginsky et al. (2017) Raginsky, M., Rakhlin, A., and Telgarsky, M. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory, pp.  1674–1703. PMLR, 2017.
  • Seeger (2002) Seeger, M. Pac-bayesian generalisation error bounds for gaussian process classification. Journal of machine learning research, 3(Oct):233–269, 2002.
  • Smith & Le (2017) Smith, S. L. and Le, Q. V. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.
  • Specht (1990) Specht, D. F. Probabilistic neural networks. Neural networks, 3(1):109–118, 1990.
  • Thiemann et al. (2017) Thiemann, N., Igel, C., Wintenberger, O., and Seldin, Y. A strongly quasiconvex pac-bayesian bound. In International Conference on Algorithmic Learning Theory, pp.  466–492. PMLR, 2017.
  • Zhang et al. (2021) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
  • Zhou et al. (2018) Zhou, W., Veitch, V., Austern, M., Adams, R. P., and Orbanz, P. Non-vacuous generalization bounds at the imagenet scale: a pac-bayesian compression approach. arXiv preprint arXiv:1804.05862, 2018.

Appendix A Preliminary Lemmas

In this section, we list preliminary lemmas that with be used in the proof.

Lemma A.1 (Markov’s inequality).

If XX is a nonnegative random variable and a>0a>0, then the probability that XX is at least aa is at most the expectation of XX divided by aa:

P(Xa)𝔼[X]aP(X\geq a)\leq\frac{\mathbb{E}[X]}{a}
Lemma A.2 (Hoeffding’s inequality).

Let X1,,XnX_{1},\ldots,X_{n} be independent random variables such that aiXibia_{i}\leq X_{i}\leq b_{i} almost surely. Consider the sum of these random variables, Sn=X1++XnS_{n}=X_{1}+\cdots+X_{n}, then Hoeffding’s theorem states that, for all t>0t>0:

P(|Sn𝔼[Sn]|t)2exp(2t2i=1n(biai)2)P(|S_{n}-\mathbb{E}[S_{n}]|\geq t)\leq 2\exp\bigg{(}-\frac{2t^{2}}{\sum_{i=1}^{n}(b_{i}-a_{i})^{2}}\bigg{)}

Appendix B Mission Proofs of Section 4.1

This section contains detailed proofs of the results that are missing in the main paper.

Lemma B.1 (PNTK at initialization).

If m=Ω(n2λ02lognδ)m=\Omega(\frac{n^{2}}{\lambda^{2}_{0}}\log\frac{n}{\delta}), while 𝛍r\boldsymbol{\mu}_{r} and 𝛔r\boldsymbol{\sigma}_{r} are initialized by the form in Theorem 4.4, then with probability at least 1δ1-\delta, 𝚯μ(𝐗,𝐗,0)𝚯(𝐗,𝐗)2λ04\big{\|}\boldsymbol{\Theta}^{\mu}({\bf X},{\bf X},0)-\boldsymbol{\Theta}^{\infty}({\bf X},{\bf X})\big{\|}_{2}\leq\frac{\lambda_{0}}{4}, 𝚯μ(𝐗,𝐗,0)23λ04\big{\|}\boldsymbol{\Theta}^{\mu}({\bf X},{\bf X},0)\big{\|}_{2}\geq\frac{3\lambda_{0}}{4}; and 𝚯0σ(𝐗,𝐗)𝚯(𝐗,𝐗)2λ04\big{\|}\boldsymbol{\Theta}^{\sigma}_{0}({\bf X},{\bf X})-\boldsymbol{\Theta}^{\infty}({\bf X},{\bf X})\big{\|}_{2}\leq\frac{\lambda_{0}}{4}, 𝚯σ(𝐗,𝐗,0)23λ04\big{\|}\boldsymbol{\Theta}^{\sigma}({\bf X},{\bf X},0)\big{\|}_{2}\geq\frac{3\lambda_{0}}{4}.

Proof.

The derivative of output over the parameters 𝝁\boldsymbol{\mu} and 𝝈\boldsymbol{\sigma} can be expressed as

f(𝐱i)𝝁r=1mvr𝕀(𝐰r𝐱i0)𝐱if(𝐱i)𝝈r=1mvr𝕀(𝐰r𝐱i)𝐱iϵr\frac{\partial f({\bf x}_{i})}{\partial\boldsymbol{\mu}_{r}}=\frac{1}{\sqrt{m}}{v}_{r}\mathbb{I}({\bf w}^{\top}_{r}{\bf x}_{i}\geq 0){\bf x}_{i}~{}~{}~{}\frac{\partial f({\bf x}_{i})}{\partial\boldsymbol{\sigma}_{r}}=\frac{1}{\sqrt{m}}{v}_{r}\mathbb{I}({\bf w}^{\top}_{r}{\bf x}_{i}){\bf x}_{i}\odot\boldsymbol{\epsilon}_{r} (21)

Since the probabilistic neural network contains two set of parameters, we discuss the corresponding NTK accordingly.

(1) 𝚯μ\boldsymbol{\Theta}^{\mu}.

Plugging the derivative result regarding mean weights in Equation (21) into the definition of PNTK (Eqution 12) yields:

𝚯ijμ(0)=𝐱i𝐱jm(vr)2r=1m𝕀(𝐰r𝐱i0)𝕀(𝐰r𝐱j0)=𝐱i𝐱jmr=1m𝕀(𝐰r𝐱i0)𝕀(𝐰r𝐱j0)\boldsymbol{\Theta}^{\mu}_{ij}(0)=\frac{{\bf x}^{\top}_{i}{\bf x}_{j}}{m}({v}_{r})^{2}\sum_{r=1}^{m}\mathbb{I}({\bf w}^{\top}_{r}{\bf x}_{i}\geq 0)\mathbb{I}({\bf w}^{\top}_{r}{\bf x}_{j}\geq 0)=\frac{{\bf x}^{\top}_{i}{\bf x}_{j}}{m}\sum_{r=1}^{m}\mathbb{I}({\bf w}^{\top}_{r}{\bf x}_{i}\geq 0)\mathbb{I}({\bf w}^{\top}_{r}{\bf x}_{j}\geq 0)

By an analysis, we find that for all pairs of i,ji,j, 𝚯ijμ(0)\boldsymbol{\Theta}^{\mu}_{ij}(0) is the average of mm i.i.d. random variables bounded in [0,1][0,1], with the expectation 𝚯ij=𝔼𝐰[𝐱i𝐱j𝕀{𝐰𝐱i0,𝐰T𝐱j0}]\boldsymbol{\Theta}^{\infty}_{ij}=\mathbb{E}_{{\bf w}}\big{[}{\bf x}^{\top}_{i}{\bf x}_{j}\mathbb{I}\{{\bf w}^{\top}{\bf x}_{i}\geq 0,{\bf w}^{T}{\bf x}_{j}\geq 0\}\big{]}. Here 𝐰i=𝝁i+𝝈iϵ{\bf w}_{i}=\boldsymbol{\mu}_{i}+\boldsymbol{\sigma}_{i}{\epsilon}, and 𝝁i𝒩(μ^,σ^)\boldsymbol{\mu}_{i}\sim\mathcal{N}(\hat{\mu},\hat{\sigma}). Then with a simple argument, we show that 𝐰{\bf w} is also a Gaussian variable. The variance of distribution for 𝐰i{\bf w}_{i} can be inferred through the following process,

f(𝐰i)=12πσ0e(𝐰i𝝁i)22σ02f(𝝁i)=12πσ^e(𝝁iμ^)22σ^2f({\bf w}_{i})=\frac{1}{\sqrt{2\pi}\sigma_{0}}e^{-\frac{({\bf w}_{i}-\boldsymbol{\mu}_{i})^{2}}{2\sigma_{0}^{2}}}~{}~{}~{}f(\boldsymbol{\mu}_{i})=\frac{1}{\sqrt{2\pi}\hat{\sigma}}e^{-\frac{(\boldsymbol{\mu}_{i}-\hat{\mu})^{2}}{2\hat{\sigma}^{2}}}

Taking the density function of variable 𝝁i\boldsymbol{\mu}_{i} into f(𝐰i)f({\bf w}_{i}), we can obtain,

f(𝐰i)=12πσ0e(𝐰i𝝁i)22σ0212πσ^e(𝝁iμ^)22σ^2𝑑𝝁i=12π(σ02+σ^2)e(𝐰iμ^)22(σ02+σ^2)\displaystyle f({\bf w}_{i})=\int\frac{1}{\sqrt{2\pi}\sigma_{0}}e^{-\frac{({\bf w}_{i}-\boldsymbol{\mu}_{i})^{2}}{2\sigma_{0}^{2}}}\frac{1}{\sqrt{2\pi}\hat{\sigma}}e^{-\frac{(\boldsymbol{\mu}_{i}-\hat{\mu})^{2}}{2\hat{\sigma}^{2}}}d\boldsymbol{\mu}_{i}=\frac{1}{\sqrt{2\pi(\sigma_{0}^{2}+\hat{\sigma}^{2})}}e^{-\frac{({\bf w}_{i}-\hat{\mu})^{2}}{2(\sigma_{0}^{2}+\hat{\sigma}^{2})}}

Then by Hoeffding’s inequality, we know that the following inequality holds with probability at least 1δ1-\delta^{\prime},

|𝚯ijμ(0)𝚯ij|log(2/δ)2m\big{|}\boldsymbol{\Theta}^{\mu}_{ij}(0)-\boldsymbol{\Theta}^{\infty}_{ij}\big{|}\leq\sqrt{\frac{\log(2/\delta^{\prime})}{2m}}

Because NTK matrix is of size n×nn\times n, we then apply a union bound over all i,j[n]i,j\in[n] (by setting δ=δ/n2\delta^{\prime}=\delta/n^{2}), and obtain that

|𝚯ijμ(0)𝚯ij|log(2n2/δ)2m\big{|}\boldsymbol{\Theta}^{\mu}_{ij}(0)-\boldsymbol{\Theta}^{\infty}_{ij}\big{|}\leq\sqrt{\frac{\log(2n^{2}/\delta)}{2m}}

Thus we have,

𝚯μ(0)𝚯22𝚯μ(0)𝚯F2i,j|𝚯ijμ(0)𝚯ij|2=O(n2log(2n2/δ)m)\big{\|}\boldsymbol{\Theta}^{\mu}(0)-\boldsymbol{\Theta}^{\infty}\big{\|}^{2}_{2}\leq\big{\|}\boldsymbol{\Theta}^{\mu}(0)-\boldsymbol{\Theta}^{\infty}\big{\|}^{2}_{F}\leq\sum_{i,j}\big{|}\boldsymbol{\Theta}^{\mu}_{ij}(0)-\boldsymbol{\Theta}_{ij}^{\infty}\big{|}^{2}=O\bigg{(}\frac{n^{2}\log(2n^{2}/\delta)}{m}\bigg{)}

Finally, if n2log(2n2/δ)mλ04\sqrt{\frac{n^{2}\log(2n^{2}/\delta)}{m}}\leq\frac{\lambda_{0}}{4}, which implies m=O(n2log(n2/δ)λ02)m=O\big{(}\frac{n^{2}\log(n^{2}/\delta)}{\lambda^{2}_{0}}\big{)}, then with probability at least 1δ1-\delta,

𝚯μ(𝐗,𝐗,0)𝚯(𝐗,𝐗)2λ04,𝚯μ(𝐗,𝐗,0)23λ04.\big{\|}\boldsymbol{\Theta}^{\mu}({\bf X},{\bf X},0)-\boldsymbol{\Theta}^{\infty}({\bf X},{\bf X})\big{\|}_{2}\leq\frac{\lambda_{0}}{4},~{}\big{\|}\boldsymbol{\Theta}^{\mu}({\bf X},{\bf X},0)\big{\|}_{2}\geq\frac{3\lambda_{0}}{4}.

(2) 𝚯σ\boldsymbol{\Theta}^{\sigma}.

Plugging the derivative result regarding mean weights in Equation (21) into the definition of PNTK (Eqution 12) yields:

𝚯ijσ=𝐱i𝐱jmr=1m𝕀(𝐰r𝐱i0)𝕀(𝐰r𝐱j0)ϵr2\boldsymbol{\Theta}^{\sigma}_{ij}=\frac{{\bf x}^{\top}_{i}{\bf x}_{j}}{m}\sum_{r=1}^{m}\mathbb{I}({\bf w}^{\top}_{r}{\bf x}_{i}\geq 0)\mathbb{I}({\bf w}^{\top}_{r}{\bf x}_{j}\geq 0)\boldsymbol{\epsilon}^{2}_{r}

Note that the tangent kernel for σ\sigma differs from 𝚯ijμ\boldsymbol{\Theta}^{\mu}_{ij} with an additional term ϵr2\boldsymbol{\epsilon}^{2}_{r}. It is known the ϵr2χ1\boldsymbol{\epsilon}^{2}_{r}\sim\chi_{1} independently with 𝕀(𝐰r𝐱i0)𝕀(𝐰r𝐱j0)\mathbb{I}({\bf w}^{\top}_{r}{\bf x}_{i}\geq 0)\mathbb{I}({\bf w}^{\top}_{r}{\bf x}_{j}\geq 0). Because 𝔼[χ1]=1\mathbb{E}[\chi_{1}]=1, the expectation of 𝚯ijσ\boldsymbol{\Theta}^{\sigma}_{ij} equals the expectation of 𝚯ijμ\boldsymbol{\Theta}^{\mu}_{ij}. Thus for all pairs of i,ji,j, 𝚯ijσ\boldsymbol{\Theta}^{\sigma}_{ij} is the average of mm i.i.d. random variables with the expectation 𝚯ij=𝔼𝐰[𝐱i𝐱j𝕀{𝐰𝐱i0,𝐰𝐱j0}]\boldsymbol{\Theta}^{\infty}_{ij}=\mathbb{E}_{{\bf w}}\big{[}{\bf x}^{\top}_{i}{\bf x}_{j}\mathbb{I}\{{\bf w}^{\top}{\bf x}_{i}\geq 0,{\bf w}^{\top}{\bf x}_{j}\geq 0\}\big{]}.

Now we calculate the concentration bound. It is known that ϵr2\boldsymbol{\epsilon}^{2}_{r} is independent and sub-exponential. Then, by sub-exponential tail bound, we know that the following holds with probability at least 1δ1-\delta^{\prime},

|𝚯ijσ(0)𝚯ij|log(8/δ)2m\big{|}\boldsymbol{\Theta}^{\sigma}_{ij}(0)-\boldsymbol{\Theta}^{\infty}_{ij}\big{|}\leq\sqrt{\frac{\log(8/\delta^{\prime})}{2m}}

This bound is of the same order to concentration bound for 𝚯ijμ(0)\boldsymbol{\Theta}^{\mu}_{ij}(0). Thus we can take all the arguments for 𝚯ijμ(0)\boldsymbol{\Theta}^{\mu}_{ij}(0) above to finalize the proof: If n2log(8n2/δ)mλ04\sqrt{\frac{n^{2}\log(8n^{2}/\delta)}{m}}\leq\frac{\lambda_{0}}{4}, which implies m=O(n2log(n2/δ)λ02)m=O\big{(}\frac{n^{2}\log(n^{2}/\delta)}{\lambda^{2}_{0}}\big{)}, then with probability at least 1δ1-\delta,

𝚯σ(𝐗,𝐗,0)𝚯(𝐗,𝐗)2λ04,𝚯σ(𝐗,𝐗,0)23λ04.\big{\|}\boldsymbol{\Theta}^{\sigma}({\bf X},{\bf X},0)-\boldsymbol{\Theta}^{\infty}({\bf X},{\bf X})\big{\|}_{2}\leq\frac{\lambda_{0}}{4},~{}\big{\|}\boldsymbol{\Theta}^{\sigma}({\bf X},{\bf X},0)\big{\|}_{2}\geq\frac{3\lambda_{0}}{4}.

Lemma B.2 (PNTK during training).

If 𝛍r\boldsymbol{\mu}_{r} and 𝛔r\boldsymbol{\sigma}_{r} are initialized the same with Theorem 4.4, for any set of weight vectors 𝐰1,,𝐰m{\bf w}_{1},\ldots,{\bf w}_{m} that satisfy for any r[m]r\in[m], 𝐰r(t)𝐰r(0)cλ0δσ02+σ^2n2R\|{\bf w}_{r}(t)-{\bf w}_{r}(0)\|\leq\frac{c\lambda_{0}\delta\sqrt{\sigma_{0}^{2}+\hat{\sigma}^{2}}}{n^{2}}\equiv R, where cc is a constant, then with probability at least 1δ1-\delta, 𝚯μ(𝐗,𝐗,t)𝚯(𝐗,𝐗)2λ04\big{\|}\boldsymbol{\Theta}^{\mu}({\bf X},{\bf X},t)-\boldsymbol{\Theta}^{\infty}({\bf X},{\bf X})\big{\|}_{2}\leq\frac{\lambda_{0}}{4}, 𝚯μ(𝐗,𝐗,t)2λ02\big{\|}\boldsymbol{\Theta}^{\mu}({\bf X},{\bf X},t)\big{\|}_{2}\geq\frac{\lambda_{0}}{2}; and 𝚯σ(𝐗,𝐗,t)𝚯(𝐗,𝐗)2λ04\big{\|}\boldsymbol{\Theta}^{\sigma}({\bf X},{\bf X},t)-\boldsymbol{\Theta}^{\infty}({\bf X},{\bf X})\big{\|}_{2}\leq\frac{\lambda_{0}}{4}, 𝚯σ(𝐗,𝐗,t)2λ02\big{\|}\boldsymbol{\Theta}^{\sigma}({\bf X},{\bf X},t)\big{\|}_{2}\geq\frac{\lambda_{0}}{2}.

Proof.

Define the event Air={𝐰:𝐰r(t)𝐰r(0)R,𝕀{𝐱i𝐰r(t)0}𝕀{𝐱i𝐰r(0)0}}A_{ir}=\{\exists{\bf w}:\|{\bf w}_{r}(t)-{\bf w}_{r}(0)\|\leq R,\mathbb{I}\{{\bf x}^{\top}_{i}{\bf w}_{r}(t)\geq 0\}\neq\mathbb{I}\{{\bf x}^{\top}_{i}{\bf w}_{r}(0)\geq 0\}\}. Then this event would only happen when |𝐰r,0𝐱i|<R|{\bf w}_{r,0}^{\top}{\bf x}_{i}|<R. As calculated in the proof of Lemma B.1, we know that 𝐰i𝒩(0,σ^2+σ02){\bf w}_{i}\sim\mathcal{N}(0,\sqrt{\hat{\sigma}^{2}+\sigma^{2}_{0}}). Then with the condition that 𝐱i=1\|{\bf x}_{i}\|=1, we obtain

z=𝐰r(0)𝐱i𝒩(0,σ^2+σ02)z={\bf w}_{r}(0)^{\top}{\bf x}_{i}\sim\mathcal{N}\big{(}0,\sqrt{\hat{\sigma}^{2}+\sigma^{2}_{0}}\big{)}

By integrating over the condition |𝐰r(0)𝐱i|<R|{\bf w}_{r}(0)^{\top}{\bf x}_{i}|<R, we have

P(Air)=Pz𝒩(0,σ02+σ^2)(|z|<R)2R2π(σ02+σ^2)P(A_{ir})=P_{z\sim\mathcal{N}(0,\sqrt{\sigma^{2}_{0}+\hat{\sigma}^{2}})}(|z|<R)\leq\frac{2R}{\sqrt{2\pi(\sigma^{2}_{0}+\hat{\sigma}^{2})}}

Then we can bound the entry of the matrix 𝚯μ\boldsymbol{\Theta}^{\mu}:

𝔼[|𝚯ijμ(0)𝚯ijμ(t)|]=𝔼[1m|𝐱i𝐱jr=1m(𝕀{𝐰r(0)𝐱i0,𝐰r(0)𝐱j0}𝕀{𝐰r(t)𝐱i0,𝐰r(t)𝐱j0})|]\displaystyle\mathbb{E}\big{[}|\boldsymbol{\Theta}^{\mu}_{ij}(0)-\boldsymbol{\Theta}^{\mu}_{ij}(t)|\big{]}=\mathbb{E}\bigg{[}\frac{1}{m}\big{|}{\bf x}^{\top}_{i}{\bf x}_{j}\sum_{r=1}^{m}\big{(}\mathbb{I}\{{\bf w}^{\top}_{r}(0){\bf x}_{i}\geq 0,{\bf w}^{\top}_{r}(0){\bf x}_{j}\geq 0\}-\mathbb{I}\{{\bf w}_{r}(t)^{\top}{\bf x}_{i}\geq 0,{\bf w}_{r}(t)^{\top}{\bf x}_{j}\geq 0\}\big{)}\big{|}\bigg{]}
1mr=1m𝔼[𝕀{AirAjr}]4R2π(σ02+σ^2)\displaystyle\leq\frac{1}{m}\sum_{r=1}^{m}\mathbb{E}\big{[}\mathbb{I}\{A_{ir}\cup A_{jr}\}\big{]}\leq\frac{4R}{\sqrt{2\pi(\sigma^{2}_{0}+\hat{\sigma}^{2})}}

Summing over all entries of the matrix, we have 𝔼[(ij)|𝚯ijμ(0)𝚯ijμ(t)|]4n2R2π(σ02+σ^2)\mathbb{E}\bigg{[}\sum_{(ij)}\big{|}\boldsymbol{\Theta}^{\mu}_{ij}(0)-\boldsymbol{\Theta}^{\mu}_{ij}(t)\big{|}\bigg{]}\leq\frac{4n^{2}R}{\sqrt{2\pi(\sigma_{0}^{2}+\hat{\sigma}^{2})}}. By Markov’s inequality, with probability of 1δ1-\delta over distribution of 𝐰r,0{\bf w}_{r,0} for r[m]r\in[m], (ij)|𝚯ijμ(0)𝚯ijμ(t)|4n2R2π(σ02+σ^2)δ\sum_{(ij)}\big{|}\boldsymbol{\Theta}^{\mu}_{ij}(0)-\boldsymbol{\Theta}^{\mu}_{ij}(t)\big{|}\leq\frac{4n^{2}R}{\sqrt{2\pi(\sigma^{2}_{0}+\hat{\sigma}^{2})}\delta}. Then by the matrix perturbation theory, we have,

𝚯μ(t)𝚯μ(0)2𝚯μ(t)𝚯μ(0)F(ij)|𝚯ijμ(t)𝚯ijμ(0)|4n2R2π(σ02+σ^2)δ\big{\|}\boldsymbol{\Theta}^{\mu}(t)-\boldsymbol{\Theta}^{\mu}(0)\big{\|}_{2}\leq\big{\|}\boldsymbol{\Theta}^{\mu}(t)-\boldsymbol{\Theta}^{\mu}(0)\big{\|}_{F}\leq\sum_{(ij)}\big{|}\boldsymbol{\Theta}^{\mu}_{ij}(t)-\boldsymbol{\Theta}^{\mu}_{ij}(0)\big{|}\leq\frac{4n^{2}R}{\sqrt{2\pi(\sigma_{0}^{2}+\hat{\sigma}^{2})}\delta}

If 4n2R2π(σ02+σ^2)δλ04\frac{4n^{2}R}{\sqrt{2\pi(\sigma_{0}^{2}+\hat{\sigma}^{2})}\delta}\leq\frac{\lambda_{0}}{4}, which implies 𝐰r(t)𝐰r(0)cλ0δσ02+σ^2n2\|{\bf w}_{r}(t)-{\bf w}_{r}(0)\|\leq\frac{c\lambda_{0}\delta\sqrt{\sigma_{0}^{2}+\hat{\sigma}^{2}}}{n^{2}}, then with probability at least 1δ1-\delta,

𝚯μ(𝐗,𝐗,t)𝚯(𝐗,𝐗)2λ04,𝚯μ(𝐗,𝐗,t)2λ02.\big{\|}\boldsymbol{\Theta}^{\mu}({\bf X},{\bf X},t)-\boldsymbol{\Theta}^{\infty}({\bf X},{\bf X})\big{\|}_{2}\leq\frac{\lambda_{0}}{4},~{}\big{\|}\boldsymbol{\Theta}^{\mu}({\bf X},{\bf X},t)\big{\|}_{2}\geq\frac{\lambda_{0}}{2}.

On the other hand, we can bound the entry of the matrix 𝚯σ\boldsymbol{\Theta}^{\sigma} with following inequality:

𝔼[|𝚯ijσ(0)𝚯ijσ(t)|]=𝔼[1m|𝐱i𝐱jr=1m(𝕀{𝐰r(0)𝐱i0,𝐰r(0)𝐱j0}ϵr2(0)𝕀{𝐰r(t)𝐱i0,𝐰r(t)𝐱j0}ϵr2(t))|]\displaystyle\mathbb{E}\big{[}|\boldsymbol{\Theta}^{\sigma}_{ij}(0)-\boldsymbol{\Theta}^{\sigma}_{ij}(t)|\big{]}=\mathbb{E}\bigg{[}\frac{1}{m}\big{|}{\bf x}^{\top}_{i}{\bf x}_{j}\sum_{r=1}^{m}\big{(}\mathbb{I}\{{\bf w}^{\top}_{r}(0){\bf x}_{i}\geq 0,{\bf w}^{\top}_{r}(0){\bf x}_{j}\geq 0\}\boldsymbol{\epsilon}^{2}_{r}(0)-\mathbb{I}\{{\bf w}_{r}(t)^{\top}{\bf x}_{i}\geq 0,{\bf w}_{r}(t)^{\top}{\bf x}_{j}\geq 0\}\boldsymbol{\epsilon}^{2}_{r}(t)\big{)}\big{|}\bigg{]}
1mr=1m𝔼[𝕀{AirAjr}ϵr2(t)]4R2π(σ02+σ^2)\displaystyle\leq\frac{1}{m}\sum_{r=1}^{m}\mathbb{E}\big{[}\mathbb{I}\{A_{ir}\cup A_{jr}\}\boldsymbol{\epsilon}^{2}_{r}(t)\big{]}\leq\frac{4R}{\sqrt{2\pi(\sigma^{2}_{0}+\hat{\sigma}^{2})}}

For the first inequality in the second lien, we use ϵr2(t)=ϵr2(0)\boldsymbol{\epsilon}^{2}_{r}(t)=\boldsymbol{\epsilon}^{2}_{r}(0) according to the definition of 𝐰r(t){\bf w}_{r}(t). We argue that our definition is reasonable because 𝔼[ϵr2(0)]=𝔼[ϵr2(t)]\mathbb{E}[\boldsymbol{\epsilon}^{2}_{r}(0)]=\mathbb{E}[\boldsymbol{\epsilon}^{2}_{r}(t)] when ϵr(t)\boldsymbol{\epsilon}_{r}(t) and ϵr(0)\boldsymbol{\epsilon}_{r}(0) are two i.i.d. variables. With our definition, we can compute 𝐰r(t)𝐰r(0)\|{\bf w}_{r}(t)-{\bf w}_{r}(0)\| directly, and keep the corresponding conclusion applicable to real weights at training step simultaneously.

The above inequality shows that with additional ϵr2\boldsymbol{\epsilon}^{2}_{r}, we still have the same result. Once again, we can apply arguments for 𝚯μ\boldsymbol{\Theta}^{\mu} here, and conclude the proof: If 4n2R2π(σ02+σ^2)δλ04\frac{4n^{2}R}{\sqrt{2\pi(\sigma_{0}^{2}+\hat{\sigma}^{2})}\delta}\leq\frac{\lambda_{0}}{4}, which implies 𝐰r(t)𝐰r(0)cλ0δσ02+σ^2n2\|{\bf w}_{r}(t)-{\bf w}_{r}(0)\|\leq\frac{c\lambda_{0}\delta\sqrt{\sigma_{0}^{2}+\hat{\sigma}^{2}}}{n^{2}}, then with probability at least 1δ1-\delta,

𝚯σ(𝐗,𝐗,t)𝚯(𝐗,𝐗)2λ04,𝚯σ(𝐗,𝐗,t)2λ02.\big{\|}\boldsymbol{\Theta}^{\sigma}({\bf X},{\bf X},t)-\boldsymbol{\Theta}^{\infty}({\bf X},{\bf X})\big{\|}_{2}\leq\frac{\lambda_{0}}{4},~{}\big{\|}\boldsymbol{\Theta}^{\sigma}({\bf X},{\bf X},t)\big{\|}_{2}\geq\frac{\lambda_{0}}{2}.

Lemma B.3 (Change of weights during training).

Suppose λmin(𝚯(t))λ02\lambda_{\min}(\boldsymbol{\Theta}(t))\geq\frac{\lambda_{0}}{2} for 0<t<T0<t<T. Then we have R^S(Q,t)exp(λ0t)R^S(Q,0)\widehat{R}_{S}(Q,t)\leq\exp(-{\lambda_{0}}t)\widehat{R}_{S}(Q,0), and with probability at least 1δ1-\delta over initialization, 𝐰r(t)𝐰r(0)22nmδλ0R\|{\bf w}_{r}(t)-{\bf w}_{r}(0)\|_{2}\leq\frac{2n}{\sqrt{m\delta}\lambda_{0}}\equiv R^{\prime}.

Proof.

According to the gradient flow of output function, we have

df(𝐱i,t)dt\displaystyle\frac{df({\bf x}_{i},t)}{dt} =r=1m(f(𝐱i,t)𝝁r,d𝝁r(t)dt+f(𝐱,t)𝝈r,d𝝈r(t)dt)\displaystyle=\sum_{r=1}^{m}\big{(}\langle\frac{\partial f({\bf x}_{i},t)}{\partial\boldsymbol{\mu}_{r}},\frac{d\boldsymbol{\mu}_{r}(t)}{dt}\rangle+\langle\frac{\partial f({\bf x},t)}{\partial\boldsymbol{\sigma}_{r}},\frac{d\boldsymbol{\sigma}_{r}(t)}{dt}\rangle\big{)}
=j=1n(𝐲if(𝐱j))r=1m(f(𝐱i)𝝁r,f(𝐱j)𝝁r+f(𝐱i)𝝈r,f(𝐱j)𝝈r)\displaystyle=\sum_{j=1}^{n}({\bf y}_{i}-f({\bf x}_{j}))\sum_{r=1}^{m}\big{(}\langle\frac{\partial f({\bf x}_{i})}{\partial\boldsymbol{\mu}_{r}},\frac{f({\bf x}_{j})}{\partial\boldsymbol{\mu}_{r}}\rangle+\langle\frac{\partial f({\bf x}_{i})}{\partial\boldsymbol{\sigma}_{r}},\frac{\partial f({\bf x}_{j})}{\partial\boldsymbol{\sigma}_{r}}\rangle\big{)}
=j=1n(𝐲jf(𝐱j,t))(𝚯ijμ+𝚯ijσ)\displaystyle=\sum_{j=1}^{n}({\bf y}_{j}-f({\bf x}_{j},t))(\boldsymbol{\Theta}^{\mu}_{ij}+\boldsymbol{\Theta}^{\sigma}_{ij})

Then the dynamics of loss can be calculated,

ddtR^S(Q,t)\displaystyle\frac{d}{dt}\widehat{R}_{S}(Q,t) =12ddt𝔼f(t)Qf(𝐗,t)𝐘22=12ddtf^(𝐗,t)𝐘22\displaystyle=\frac{1}{2}\frac{d}{dt}\big{\|}\mathbb{E}_{f(t)\sim Q}f({\bf X},t)-{\bf Y}\big{\|}_{2}^{2}=\frac{1}{2}\frac{d}{dt}\big{\|}\widehat{f}({\bf X},t)-{\bf Y}\big{\|}_{2}^{2}
=(𝐘f^(𝐗,t))(𝚯μ(t)+𝚯σ(t))(𝐘f^(𝐗),t)λ0𝐘f^(𝐗,t)22\displaystyle=-({\bf Y}-\widehat{f}({\bf X},t))^{\top}(\boldsymbol{\Theta}^{\mu}(t)+\boldsymbol{\Theta}^{\sigma}(t))({\bf Y}-\widehat{f}({\bf X}),t)\leq-\lambda_{0}\|{\bf Y}-\widehat{f}({\bf X},t)\|^{2}_{2}

where we define f^(t)=𝔼f(t)Qf(t)=𝔼ϵf(t)\widehat{f}(t)=\mathbb{E}_{f(t)\sim Q}f(t)=\mathbb{E}_{\epsilon}f(t). 𝐗,𝐘{\bf X},{\bf Y} are collections of features and labels. Then the loss can be bounded,

R^S(Q,t)=12𝐘f^(𝐗,t)22exp(λ0t)R^S(Q,0)\widehat{R}_{S}(Q,t)=\frac{1}{2}\|{\bf Y}-\widehat{f}({\bf X},t)\|^{2}_{2}\leq\exp(-{\lambda_{0}}t)\widehat{R}_{S}(Q,0)

The above equation implies the linear convergence rate of gradient descent on over-parameterized probabilistic network.

Now we bounded the change of mean weights:

ddt𝝁r(t)2\displaystyle\bigg{\|}\frac{d}{dt}\boldsymbol{\mu}_{r}(t)\bigg{\|}_{2} =i=1n(yif^(𝐱i))1m𝐯r𝐱i𝕀(𝐰r(t)𝐱i0)2\displaystyle=\bigg{\|}\sum_{i=1}^{n}(y_{i}-\widehat{f}({\bf x}_{i}))\frac{1}{\sqrt{m}}{\bf v}_{r}{\bf x}_{i}\mathbb{I}({\bf w}^{\top}_{r}(t){\bf x}_{i}\geq 0)\bigg{\|}_{2}
1mi=1n|yif^(𝐱i)|nm𝐘f^(𝐗,t)2nmexp(λ0t/2)𝐘f^(𝐗,0)2\displaystyle\leq\frac{1}{\sqrt{m}}\sum_{i=1}^{n}|y_{i}-\widehat{f}({\bf x}_{i})|\leq\sqrt{\frac{n}{m}}\|{\bf Y}-\widehat{f}({\bf X},t)\|_{2}\leq\sqrt{\frac{n}{m}}\exp(-\lambda_{0}t/2)\|{\bf Y}-\widehat{f}({\bf X},0)\|_{2}

Integrating the gradient, we have

𝝁r(T)𝝁r(0)20Tddt𝝁r(t)2𝑑t2nm𝐲f^(𝐗,0)2λ0\|\boldsymbol{\mu}_{r}(T)-\boldsymbol{\mu}_{r}(0)\|_{2}\leq\int_{0}^{T}\big{\|}\frac{d}{dt}\boldsymbol{\mu}_{r}(t)\big{\|}_{2}dt\leq 2\sqrt{\frac{n}{m}}\frac{\|{\bf y}-\widehat{f}({\bf X},0)\|_{2}}{\lambda_{0}}

Next we bounded the change of variance weights:

ddt𝝈r(t)2\displaystyle\bigg{\|}\frac{d}{dt}\boldsymbol{\sigma}_{r}(t)\bigg{\|}_{2} =i=1n(yif^(𝐱i))1m𝔼ϵr(t)𝐯r𝐱iϵr(t)𝕀(𝐰r(t)𝐱i0)2=0\displaystyle=\bigg{\|}\sum_{i=1}^{n}(y_{i}-\widehat{f}({\bf x}_{i}))\frac{1}{\sqrt{m}}\mathbb{E}_{\boldsymbol{\epsilon}_{r}(t)}{\bf v}_{r}{\bf x}_{i}\odot\boldsymbol{\epsilon}_{r}(t)\mathbb{I}({\bf w}^{\top}_{r}(t){\bf x}_{i}\geq 0)\bigg{\|}_{2}=0

In the above derivation, we use the definition of loss which is R^S(Q)=𝔼fQR^S(f)\widehat{R}_{S}(Q)=\mathbb{E}_{f\in Q}\widehat{R}_{S}(f) and interchange integration and differentiation. The result leads to:

𝝈r(T)𝝈r(0)0Tddt𝝈r(t)2𝑑t=0\|\boldsymbol{\sigma}_{r}(T)-\boldsymbol{\sigma}_{r}(0)\|\leq\int_{0}^{T}\bigg{\|}\frac{d}{dt}\boldsymbol{\sigma}_{r}(t)\bigg{\|}_{2}dt=0

Plug the results for both mean and variance weights together, we obtain

𝐰r(T)𝐰r(0)=𝝁r(T)𝝁r(0)+ϵr(0)(𝝈r(T)𝝈r(0))2nm𝐘f^(𝐗,0)2λ0=4nR^S(Q,0)mλ0\|{\bf w}_{r}(T)-{\bf w}_{r}(0)\|=\|{\boldsymbol{\mu}}_{r}(T)-{\boldsymbol{\mu}}_{r}(0)+{\boldsymbol{\epsilon}}_{r}(0)\odot({\boldsymbol{\sigma}}_{r}(T)-{\boldsymbol{\sigma}}_{r}(0))\|\leq 2\sqrt{\frac{n}{m}}\frac{\|{\bf Y}-\widehat{f}({\bf X},0)\|_{2}}{\lambda_{0}}=\frac{4\sqrt{n\widehat{R}_{S}(Q,0)}}{\sqrt{m}\lambda_{0}}

Finally, it is time to bound R^S(Q,0)\widehat{R}_{S}(Q,0),

𝔼[𝐘f^(𝐗)22]=i=1n(yi2+yi𝔼[f^(𝐱i)]+𝔼[f^(𝐱i)2])=i=1n(1+O(1)=O(n))\mathbb{E}\big{[}\|{\bf Y}-\widehat{f}({\bf X})\|^{2}_{2}\big{]}=\sum_{i=1}^{n}(y^{2}_{i}+y_{i}\mathbb{E}[\widehat{f}({\bf x}_{i})]+\mathbb{E}[\widehat{f}({\bf x}_{i})^{2}])=\sum_{i=1}^{n}(1+O(1)=O(n))

Thus by Markov’s inequality, we have with probability at least 1δ1-\delta, R^S(Q,0)=O(nδ)\widehat{R}_{S}(Q,0)=O(\frac{n}{\delta}), and

𝐰r(t)𝐰r(0)22nmδλ0R\|{\bf w}_{r}(t)-{\bf w}_{r}(0)\|_{2}\leq\frac{2n}{\sqrt{m\delta}\lambda_{0}}\equiv R^{\prime}

Lemma B.4.

If R<RR^{\prime}<R, we have R^S(Q,t)exp(λ0t)R^S(Q,0).\widehat{R}_{S}(Q,t)\leq\exp(-\lambda_{0}t)\widehat{R}_{S}(Q,0).

Proof.

Since R=2nmδλ0R^{\prime}=\frac{2n}{\sqrt{m\delta}\lambda_{0}} and R=cλ0δ(σ02+σ^2)n2R=\frac{c\lambda_{0}\delta\sqrt{(\sigma^{2}_{0}+\hat{\sigma}^{2})}}{n^{2}}, we have m=Ω(n6λ04δ4(σ02+σ^2))m=\Omega\big{(}\frac{n^{6}}{\lambda^{4}_{0}\delta^{4}(\sigma_{0}^{2}+\hat{\sigma}^{2})}\big{)}. ∎

Appendix C Missing Proofs of Section 4.2

Theorem C.1.

Consider gradient descent on target function (6), with the initialization stated in Theorem 4.4. Suppose mpoly(n,1/λ0,1/δ,1/)m\geq{\rm poly}({n},1/\lambda_{0},1/\delta,1/\mathcal{E}). Then with probability at least 1δ1-\delta over the random initialization, we have

f^(𝐱)=𝚯(𝐱,𝐗)(𝚯(𝐗,𝐗)+λ/σ02𝐈)1𝐘±{\widehat{f}({\bf x})}=\boldsymbol{\Theta}^{\infty}({\bf x},{\bf X})\big{(}\boldsymbol{\Theta}({\bf X},{\bf X})+\lambda/{\sigma^{2}_{0}}{\bf I}\big{)}^{-1}{\bf Y}\pm\mathcal{E} (22)

where f^(𝐱)=𝔼fQf(𝐱)\widehat{f}({\bf x})=\mathbb{E}_{f\sim Q}f(\bf x) aligns with the definition of empirical loss function.

Proof.

At a high level, our proof first establishes the result of kernel ridge regression in the infinite-width limit, then bounds the perturbation on the predict.

According the linearization rules for infinitely-wide networks (Lee et al., 2019), the output function can be expressed as,

f^ntk(𝐱,t)=ϕμ(𝐱)(𝝁(t)𝝁(0))+ϕσ(𝐱)(𝝈(t)𝝈(0)),\widehat{f}_{\rm ntk}({\bf x},t)=\phi_{{\mu}}({\bf x})^{\top}(\boldsymbol{\mu}(t)-\boldsymbol{\mu}(0))+\phi_{{\sigma}}({\bf x})^{\top}(\boldsymbol{\sigma}(t)-\boldsymbol{\sigma}(0)),

where ϕμ(𝐱)=μf^(𝐱,0)\phi_{{\mu}}({\bf x})=\nabla_{{{\mu}}}\widehat{f}({\bf x},0), and ϕσ(𝐱)=σf^(𝐱,0)\phi_{{\sigma}}({\bf x})=\nabla_{{{\sigma}}}\widehat{f}({\bf x},0). Since 𝝈\boldsymbol{\sigma} does not change during training, then the KL divergence reduces to

KL=12(μtμ0)2σ02\text{KL}=\frac{1}{2}\frac{({\mu}_{t}-{\mu}_{0})^{2}}{{\sigma}^{2}_{0}}

Then the gradient flow equation for 𝝁\boldsymbol{\mu} becomes,

d𝝁(t)dt=𝝁\displaystyle\frac{d\boldsymbol{\mu}(t)}{dt}=\frac{\partial\mathcal{B}}{\partial\boldsymbol{\mu}} =(f^ntk(𝐗,t)𝐘)ϕμ(𝐗)+λ/σ02(𝝁(t)𝝁(0))\displaystyle=(\widehat{f}_{\rm ntk}({\bf X},t)-{\bf Y})\phi_{{\mu}}({\bf X})+{\lambda}/{\sigma^{2}_{0}}(\boldsymbol{\mu}(t)-\boldsymbol{\mu}(0))
=𝚯μ(𝝁(t)𝝁(0))ϕμ(𝐗)𝐘+λ/σ02(𝝁(t)𝝁(0))\displaystyle=\boldsymbol{\Theta}^{\mu}(\boldsymbol{\mu}(t)-\boldsymbol{\mu}(0))-\phi_{\mu}({\bf X})^{\top}{\bf Y}+{\lambda}/{\sigma^{2}_{0}}(\boldsymbol{\mu}(t)-\boldsymbol{\mu}(0))

By analysing above equation, we find it is an ordinary differential equation regarding 𝝁t\boldsymbol{\mu}_{t}, and the solution is,

𝝁(t)=ϕμ(𝐗)(𝚯(𝐗,𝐗)+λ/σ02𝐈)1𝐘(𝐈e(𝚯(𝐗,𝐗)+λ/σ02𝐈)t)\boldsymbol{\mu}(t)=\phi_{\mu}({\bf X})^{\top}(\boldsymbol{\Theta}({\bf X},{\bf X})+\lambda/\sigma^{2}_{0}{\bf I})^{-1}{\bf Y}({\bf I}-e^{-(\boldsymbol{\Theta}({\bf X},{\bf X})+\lambda/\sigma^{2}_{0}{\bf I})t})

Plug this result into the linearization of expected output function, we have,

f^ntk(𝐱,t)=𝚯(𝐱,𝐗)(𝚯(𝐗,𝐗)+λ/σ02𝐈)1𝐘(𝐈e(𝚯(𝐗,𝐗)+λ/σ02𝐈)t)\widehat{f}_{\rm ntk}({\bf x},t)=\boldsymbol{\Theta}({\bf x},{\bf X})(\boldsymbol{\Theta}({\bf X},{\bf X})+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}({\bf I}-e^{-(\boldsymbol{\Theta}({\bf X},{\bf X})+{\lambda}/{\sigma^{2}_{0}}{\bf I})t})

when we take the time to be infinity,

f^ntk(𝐱,t=)=𝚯(𝐱,𝐗)(𝚯(𝐗,𝐗)+λ/σ02𝐈)1𝐲.\widehat{f}_{\rm ntk}({\bf x},t=\infty)=\boldsymbol{\Theta}({\bf x},{\bf X})(\boldsymbol{\Theta}({\bf X},{\bf X})+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf y}.

The next step is to show that

|f^ntk(𝐱)f^(𝐱)|O().|{\widehat{f}_{\rm ntk}({\bf x})-\widehat{f}({\bf x})}|\leq O(\mathcal{E}).

where =init+nλ02log(nΘλ0)Θ\mathcal{E}=\mathcal{E}_{\rm init}+\frac{\sqrt{n}}{\lambda_{0}^{2}}\log(\frac{n}{\mathcal{E}_{\Theta}\lambda_{0}})\mathcal{E}_{\Theta}. In which, we define |f^(𝐱,0)|init|\widehat{f}({\bf x},0)|\leq\mathcal{E}_{\rm init} and 𝚯𝚯(t)2Θ\|\boldsymbol{\Theta}^{\infty}-\boldsymbol{\Theta}(t)\|_{2}\leq\mathcal{E}_{\Theta}.

The proof relies a careful analysis on the trajectories induced by gradient flows for optimizing the neural network and the NTK predictor. The detailed proof can be found in the proof of Theorem 3.2 in (Arora et al., 2019b), and we can replace kernel ridge regression here by kernel regression.

Appendix D Missing Proofs of Section 4.3

Theorem D.1 (PAC-Bayesian bound with NTK).

Suppose data S={(𝐱i,yi)}i=1nS=\{({\bf x}_{i},y_{i})\}_{i=1}^{n} are i.i.d. samples from a non-degenerate distribution 𝒟\mathcal{D}, and mpoly(n,λ01,δ1)m\geq{\rm poly}(n,\lambda_{0}^{-1},\delta^{-1}). Consider any loss function :×[0,1]\ell:\mathbb{R}\times\mathbb{R}\rightarrow[0,1]. Then with probability at least 1δ1-\delta over the random initialization and the training samples, the probabilistic network trained by gradient descent for TΩ(1ηλ0lognδ)T\geq\Omega(\frac{1}{\eta\lambda_{0}}\log\frac{n}{\delta}) iterations has population risk R𝒟(Q)R_{\mathcal{D}}(Q) that is bounded as follows:

R𝒟\displaystyle R_{\mathcal{D}} 1σ02𝐘(𝚯+λ/σ02𝐈)1𝐘n+λσ02𝐘(𝚯+λ/σ02𝐈)2𝐘n+O(log2nδn).\displaystyle\leq\frac{1}{\sigma^{2}_{0}}\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}}{n}+\frac{\lambda}{\sigma^{2}_{0}}\sqrt{\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-2}{\bf Y}}{n}}+O\bigg{(}\frac{\log\frac{2\sqrt{n}}{\delta}}{n}\bigg{)}. (23)
Proof.

The generalization bound consists two terms, one is the empirical error, and another is KL divergence.

(1) We first bound the empirical error i=1n(f^ntk(𝐱i,)yi)2\sqrt{\sum_{i=1}^{n}(\widehat{f}_{\rm ntk}({\bf x}_{i},\infty)-y_{i})^{2}} with following inequality,

i=1n(f^ntk(𝐱i,)yi)2\displaystyle\sqrt{\sum_{i=1}^{n}(\widehat{f}_{\rm ntk}({\bf x}_{i},\infty)-y_{i})^{2}} =𝚯(𝐗,𝐗)(𝚯(𝐗,𝐗)+λ/σ02𝐈)1𝐘𝐘=λ/σ02(𝚯+λ/σ02𝐈)1𝐘\displaystyle=\|\boldsymbol{\Theta}({\bf X},{\bf X})(\boldsymbol{\Theta}({\bf X},{\bf X})+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}-{\bf Y}\|=\|{\lambda}/{\sigma^{2}_{0}}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}\|
=λ/σ02𝐘(𝚯+λ/σ02𝐈)2𝐘\displaystyle={\lambda}/{\sigma^{2}_{0}}\sqrt{{\bf Y}^{\top}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-2}{\bf Y}}

Then we can further bound the error as,

1ni=1n(f^ntk(𝐱i)yi)1ni=1n|f^ntk(𝐱i)yi|1ni=1n|f^ntk(𝐱i)yi|2λσ02𝐘(𝚯+λ/σ02𝐈)2𝐘n\frac{1}{n}\sum_{i=1}^{n}\ell(\widehat{f}_{\rm ntk}({\bf x}_{i})-y_{i})\leq\frac{1}{n}\sum_{i=1}^{n}|\widehat{f}_{\rm ntk}({\bf x}_{i})-y_{i}|\leq\frac{1}{\sqrt{n}}\sqrt{\sum_{i=1}^{n}|\widehat{f}_{\rm ntk}({\bf x}_{i})-y_{i}|^{2}}\leq\frac{\lambda}{\sigma^{2}_{0}}\sqrt{\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-2}{\bf Y}}{n}}

(2) The next step is to calculate the KL divergence. According to the solution of differential equation, we have,

𝝁(t)𝝁(0)=ϕμ(𝐱)𝐘(𝚯(𝐗,𝐗)+λ/σ02𝐈)1(𝐈e(𝚯(𝐗,𝐗)+λ/σ02𝐈)t),\boldsymbol{\mu}(t)-\boldsymbol{\mu}(0)=\phi_{{\mu}}({\bf x})^{\top}{\bf Y}(\boldsymbol{\Theta}({\bf X},{\bf X})+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}({\bf I}-e^{-(\boldsymbol{\Theta}({\bf X},{\bf X})+{\lambda}/{\sigma^{2}_{0}}{\bf I})t}),

then t=T=t=T=\infty yields 𝝁(t)𝝁(0)=ϕμ(𝚯(𝐗,𝐗)+λσ02𝐈)1𝐘\boldsymbol{\mu}(t)-\boldsymbol{\mu}(0)=\phi_{{\mu}}^{\top}(\boldsymbol{\Theta}({\bf X},{\bf X})+\frac{\lambda}{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}. Therefore, the KL divergence is,

KL =1σ02𝐘(𝚯(𝐗,𝐗)+λ/σ02𝐈)1𝚯(𝐗,𝐗)(𝚯(𝐗,𝐗)+λ/σ02𝐈)1𝐘1σ02𝐘(𝚯(𝐗,𝐗)+λ/σ02𝐈)1𝐘\displaystyle=\frac{1}{\sigma^{2}_{0}}{\bf Y}^{\top}(\boldsymbol{\Theta}({\bf X},{\bf X})+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}\boldsymbol{\Theta}({\bf X},{\bf X})(\boldsymbol{\Theta}({\bf X},{\bf X})+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}\leq\frac{1}{\sigma^{2}_{0}}{\bf Y}^{\top}(\boldsymbol{\Theta}({\bf X},{\bf X})+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}

Finally, we achieve the PAC-Bayesian generalization bound,

R𝒟\displaystyle R_{\mathcal{D}} 𝐘(𝚯+λσ02𝐈)1𝐘nσ02+λσ02𝐘(𝚯+λσ02𝐈)2𝐘n+O(log2nδn).\displaystyle\leq\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+\frac{\lambda}{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}}{n\sigma^{2}_{0}}+\frac{\lambda}{\sigma^{2}_{0}}\sqrt{\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+\frac{\lambda}{\sigma^{2}_{0}}{\bf I})^{-2}{\bf Y}}{n}}+O\bigg{(}\frac{\log\frac{2\sqrt{n}}{\delta}}{n}\bigg{)}.

Theorem D.2 (Rademacher bound with NTK).

Suppose data S={(𝐱i,yi)}i=1nS=\{({\bf x}_{i},y_{i})\}_{i=1}^{n} are i.i.d. samples from a non-degenerate distribution 𝒟\mathcal{D}, and mpoly(n,λ01,δ1)m\geq{\rm poly}(n,\lambda_{0}^{-1},\delta^{-1}). Consider any loss function :×[0,1]\ell:\mathbb{R}\times\mathbb{R}\rightarrow[0,1]. Then with probability at least 1δ1-\delta over the random initialization and training samples, the network trained by gradient descent for TΩ(1ηλ0lognδ)T\geq\Omega(\frac{1}{\eta\lambda_{0}}\log\frac{n}{\delta}) iterations has population risk R𝒟(Q)R_{\mathcal{D}}(Q) that is bounded as follows:

R𝒟\displaystyle R_{\mathcal{D}} 𝐘(𝚯+λ/σ02𝐈)1𝐘n+λσ02𝐘(𝚯+λ/σ02𝐈)2𝐘n+O(lognλ0δn).\displaystyle\leq\sqrt{\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}}{n}}+\frac{\lambda}{\sigma^{2}_{0}}\sqrt{\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-2}{\bf Y}}{n}}+O\bigg{(}\sqrt{\frac{\log\frac{n}{\lambda_{0}\delta}}{n}}\bigg{)}. (24)
Proof.

In this proof, we use Rademacher-complexity analysis. Let \mathcal{H} be the reproducing kernel Hilbert space (RKHS) corresponding to the kernel k(,)k(\cdot,\cdot). It is known that the RKHS norm of a function fntk(𝐱)=𝚯(𝐱,𝐗)(𝚯(𝐗,𝐗)+λ/σ02𝐈)1𝐘=𝜶k(𝐱,𝐗)f_{\rm ntk}({\bf x})=\boldsymbol{\Theta}^{\infty}({\bf x},{\bf X})\big{(}\boldsymbol{\Theta}({\bf X},{\bf X})+\lambda/{\sigma^{2}_{0}}{\bf I}\big{)}^{-1}{\bf Y}=\boldsymbol{\alpha}^{\top}k({\bf x},{\bf X}) is fntk=𝜶k(𝐗,𝐗)𝜶\|f_{\rm ntk}\|_{\mathcal{H}}=\sqrt{\boldsymbol{\alpha}^{\top}k({\bf X},{\bf X})\boldsymbol{\alpha}}, where k=𝚯k=\boldsymbol{\Theta} and 𝜶=λ/σ02𝐈)1𝐘\boldsymbol{\alpha}={\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}. Then we can bound the fntk\|f_{\rm ntk}\|_{\mathcal{H}}.

fntk=𝐘(𝚯(𝐗,𝐗)+λ/σ02𝐈)1𝚯(𝐗,𝐗)(𝚯(𝐗,𝐗)+λ/σ02𝐈)1𝐘𝐘(𝚯(𝐗,𝐗)+λ/σ02𝐈)1𝐘\|f_{\rm ntk}\|_{\mathcal{H}}=\sqrt{{\bf Y}^{\top}(\boldsymbol{\Theta}({\bf X},{\bf X})+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}\boldsymbol{\Theta}({\bf X},{\bf X})(\boldsymbol{\Theta}({\bf X},{\bf X})+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}}\leq\sqrt{{\bf Y}^{\top}(\boldsymbol{\Theta}({\bf X},{\bf X})+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}}

For function class B={f(𝐱)=𝜶k(𝐱,𝐗):fB}\mathcal{F}_{B}=\{f({\bf x})=\boldsymbol{\alpha}^{\top}k({\bf x},{\bf X}):\|f\|_{\mathcal{H}}\leq B\}, it is shown that its empirical Rademacher complexity can be bounded as,

R^S(B)=1n𝔼[supfBi=1nf(𝐱i)γi]BTr[k(𝐗,𝐗)]n{\widehat{R}}_{S}(\mathcal{F}_{B})=\frac{1}{n}\mathbb{E}\big{[}\sup_{f\in\mathcal{F}_{B}}\sum_{i=1}^{n}f({\bf x}_{i})\gamma_{i}\big{]}\leq\frac{B\sqrt{{\rm Tr}[k({\bf X},{\bf X})]}}{n}

Assume that Tr[k(𝐗,𝐗)]n{\rm Tr}[k({\bf X},{\bf X})]\approx n. Recall the standard generalization bound from Rademacher complexity, with probability at least 1δ1-\delta, we have,

supf[𝔼𝒟[(f(𝐱),y)]1ni=1n(f(𝐱i),yi)]2R^S()+3log(2/δ)2n\sup_{f\in\mathcal{F}}[\mathbb{E}_{\mathcal{D}}[\ell(f({\bf x}),y)]-\frac{1}{n}\sum_{i=1}^{n}\ell(f({\bf x}_{i}),y_{i})]\leq 2{\widehat{R}}_{S}(\mathcal{F})+3\sqrt{\frac{\log(2/\delta)}{2n}}

Thus we have,

R𝒟\displaystyle R_{\mathcal{D}} 𝐘(𝚯+λ/σ02𝐈)1𝐘n+λσ02𝐘(𝚯+λ/σ02𝐈)2𝐘n+O(lognλ0δn).\displaystyle\leq\sqrt{\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}}{n}}+\frac{\lambda}{\sigma^{2}_{0}}\sqrt{\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-2}{\bf Y}}{n}}+O\bigg{(}\sqrt{\frac{\log\frac{n}{\lambda_{0}\delta}}{n}}\bigg{)}.

Appendix E Additional Experiments

This section contains additional experimental results.

E.1 Training dynamics of wide probabilistic network

We conduct an experiment similar to the experiment conducted in section 5.2 but without incorporating the KL term in the objective function. The same results can be observed, further verifying our theoretical results.

E.2 Comparison of gradient norm with respect to 𝝁\boldsymbol{\mu} and 𝝈\boldsymbol{\sigma}

In Lemma 4.9, we assume that the variance weight 𝝈\boldsymbol{\sigma} is fixed. To verify if this assumption is reasonable in the practical training process, we conduct an experiment, to compare the gradient of norm with respect to 𝝁\boldsymbol{\mu} and 𝝈\boldsymbol{\sigma}. The result is shown in Figure S3. We can see that the gradient norm of 𝝁f\nabla_{\boldsymbol{\mu}}f is much larger than that of 𝝈f\nabla_{\boldsymbol{\sigma}}f, which implies that 𝝈\boldsymbol{\sigma} is effectively fixed during gradient descent training.

Refer to caption
(a)
Figure S1: Relative Frobenius norm change in μ\mu during training with MSE loss that does not include the KL term, where mm is the width of the network.
Refer to caption
(a)
Figure S2: Correlation between aggregated proxy 𝒫𝒜\mathcal{PA} and generalization bound.
Refer to caption
(a)
Figure S3: Comparison between the gradient of mean 𝝁\boldsymbol{\mu} and standard deviations 𝝈\boldsymbol{\sigma}.

E.3 Correlation between generalization bound proxy metric and generalization bound

In Figure 2, we observe a positive and significant correlation between 𝒫𝒜\mathcal{PA} and generalization bound held among different values of a selected hyperparameter while fixing other hyperparameters. Furthermore, we provide a Figure S2 presenting the correlation for aggregated values of ρ0\rho_{0} and λ\lambda, under the circumstance where 50% data is used for prior training. We can clearly see that lower 𝒫𝒜\mathcal{PA} corresponds to the lower bound, with a strong positive Kendall-tau correlation of 0.7.

E.4 Grid search

For selecting hyperparameters, we conduct a grid search over ρ0\rho_{0}, percent of prior data, and KL penalty λ\lambda. Notably, we do grid sweep over the data for prior training with different proportion in [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] since 0.2 is the minimum proportion required for obtaining a reasonably lower value generalization bound (Dziugaite et al., 2020). For the rest, we run over ρ0\rho_{0} at value [0.03, 0.05, 0.07, 0.09, 0.1, 0.3, 0.5, 0.7] for FCN ([0.05, 0.07, 0.09, 0.1, 0.3, 0.5, 0.7, 0.9] for CNN) and KL penalty at [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1] for both structures.