Demystify Optimization and Generalization of Over-parameterized PAC-Bayesian Learning

Wei Huang Chunrui Liu Yilan Chen Tianyu Liu Richard Yi Da Xu

Abstract

PAC-Bayesian is an analysis framework where the training error can be expressed as the weighted average of the hypotheses in the posterior distribution whilst incorporating the prior knowledge. In addition to being a pure generalization bound analysis tool, PAC-Bayesian bound can also be incorporated into an objective function to train a probabilistic neural network, making them a powerful and relevant framework that can numerically provide a tight generalization bound for supervised learning. For simplicity, we call probabilistic neural network learned using training objectives derived from PAC-Bayesian bounds as PAC-Bayesian learning. Despite their empirical success, the theoretical analysis of PAC-Bayesian learning for neural networks is rarely explored. This paper proposes a new class of convergence and generalization analysis for PAC-Bayes learning when it is used to train the over-parameterized neural networks by the gradient descent method. For a wide probabilistic neural network, we show that when PAC-Bayes learning is applied, the convergence result corresponds to solving a kernel ridge regression when the probabilistic neural tangent kernel (PNTK) is used as its kernel. Based on this finding, we further characterize the uniform PAC-Bayesian generalization bound which improves over the Rademacher complexity-based bound for non-probabilistic neural network. Finally, drawing the insight from our theoretical results, we propose a proxy measure for efficient hyperparameters selection, which is proven to be time-saving.

Machine Learning, ICML

1 Introduction

Deep learning has demonstrated powerful learning capability thanks to its over-parameterization structure, in which various network architectures have been responsible for the great leap in performance (LeCun et al., 2015). For the reason that over-fitting and complex hyperparameters tuning are two of the major pain points in deep learning, designing generalization guarantee for deep networks is a long-term pursuit for researchers (Zhang et al., 2021). To this end, a learning framework that trains a probabilistic neural network with a PAC-Bayesian bound objective function has been proposed (Bégin et al., 2016; Dziugaite & Roy, 2017; Neyshabur et al., 2017b; Raginsky et al., 2017; Neyshabur et al., 2017a; London, 2017; Smith & Le, 2017; Liu et al., 2021). We dub this learning method, PAC-Bayesian learning. The PAC-Bayesian learning has been proven to be able to achieve competitive expected test set error, while providing a tight generalization bound. Furthermore, this generalization bound computed from the training error can obviate the need for splitting data into a train, test, or validation set, which is highly applicable to train a deep network with scarce data (Pérez-Ortiz et al., 2020; Grünwald & Mehta, 2020). Meanwhile, these advancements on PAC-Bayesian bounds have been widely adapted with different deep neural network structures including meta-learning (Amit & Meir, 2018), convolutional neural network (Zhou et al., 2018), binary activated multilayer networks (Letarte et al., 2019), partially aggregated neural networks (Biggs & Guedj, 2020), and graph neural network (Liao et al., 2020).

Due to the tremendous empirical success of PAC-Bayesian learning, there is increasing interest in understanding their theoretical properties. However, they are either restricted to a specific model variant (Dziugaite & Roy, 2018a; Dziugaite et al., 2020) or rely heavily on empirical exploration (Neyshabur et al., 2017a). To our best knowledge, there is neither investigation of why the training of PAC-Bayesian learning is successful nor the reason for PAC-Bayesian bound to be tight on unseen data. Therefore, it is natural to ask: for a probabilistic neural network with an objective function derived from a PAC-Bayesian bound when gradient descent is adopted,

Q1:

How effective the training (i.e., the trainability) on training set?
Q2:

How tight the generalization bound (i.e., generalization) compared to those learning framework using non-probabilistic neural networks?

The exploration of the answer can be highly non-trivial due to inherented non-convex of over-parameterization (Jain & Kar, 2017) and additional randomness introduced by probabilistic neural network (Specht, 1990) as well as additional challenges brought by the divergence between posterior/prior distribution pairs known as Kullback-Leibler (KL) divergence.

In this paper, we answer the above questions by leveraging the recent advances in the theory of deep learning in over-parameterized settings with extremely (or infinitely) wide networks. It can be shown that ultra-wide networks optimized with gradient descent can achieve near-zero training error, and the critical factor that governs training process is the so called neural tangent kernel (NTK), which can be proven to be unchanged during gradient descent training (Jacot et al., 2018), thus providing a guarantee for achieving global minimum (Du et al., 2019a; Allen-Zhu et al., 2019). Under the PAC-Bayesian framework, NTK is no longer calculated from the derivative of the weights directly, but instead is calculated based on the gradient of the distribution parameters of the weights. We call it Probabilistic NTK (PNTK), based on which we build a convergence analysis to characterize the optimization process of PAC-Bayes learning. Thanks to the explicit solution obtained by optimization analysis, we further formulate the generalization bound of PAC-Bayesian learning for the first time, and demonstrate its advantage by comparing with theoretical generalization bound of learning framework with non-stochastic neural networks (Arora et al., 2019a; Cao & Gu, 2019; Hu et al., 2019). We summarize our contribution as follows:

•

With a detailed characterization of gradient descent training of PAC-Bayes objective function, we arrive at a conclusion that the final solution is kernel ridge regression with its kernel being the PNTK.
•

Based on the optimization solution, we derive a PAC-Bayesian bound for deep networks and find that it improves over the Rademacher complexity bound for non-probabilistic neural network with a fair comparison.
•

PAC-Bayesian learning depends on a large number of hyperparameters. We design a training-free proxy based on our theoretical bound and show it is effective and time-saving.
•

Technically, our proof of convergence is similar to (Du et al., 2018; Arora et al., 2019a). But there are essential differences. The main difference is that our network architecture is much complex (e.g. probabilistic network contain two set of parameter) and each set involves its own randomness which requires bounding many terms in this work differently and more elaborately. Due to the particularly wide range of application scenarios of probabilistic neural networks such as Variational Auto-Encoder (Pu et al., 2016) and Deep Bayesian Network (Nie et al., 2018), we hope our technique can provide the basis for analysis of wide probabilistic neural networks.

2 Related Work

PAC-Bayesian Analysis.

Probably Approximately Correct (PAC) Bayes framework (McAllester, 1999a, b) can incorporate knowledge about the learning algorithm and probability distribution over a set of hypotheses, thus providing a test performance (generalization) guarantee. Subsequently, PAC-Bayesian method is adopted to analyze the generalization bound of the probabilistic neural network (Langford & Caruana, 2002b). The original PAC-Bayes theory only works with bounded loss function, Haddouche et al. (2021) expands the PAC-Bayesian theory to learning problems with unbounded loss functions. Besides, several improved PAC-Bayesian bounds suitable for different scenarios are introduced by (Bégin et al., 2014, 2016). As a result of the flexibility and generalization properties of PAC-Bayes, it is widely used to analyze the complex, non-convex, and overparameterized optimization problem, especially over-parameterized neural networks (Guedj, 2019). Neyshabur et al. (2017b) present a generalization bound for feedforward neural networks with ReLU activations in terms of the product of the spectral norm of the layers and the Frobenius norm of the weights. London (2017) study the generalization error of randomized learning algorithms – focusing on stochastic gradient descent (SGD) – using a novel combination of PAC-Bayes and algorithmic stability.

PAC-Bayesian Learning.

Apart from providing theoretical analysis for the generalization properties of deep learning, PAC-Bayes pay an increasing number of attention to afford numerical generalization bound (certificate) for practical deep learning algorithm. Langford & Caruana (2002a) unprecedentedly introduce a method to train a Bayesian neural network and use a refined PAC-Bayesian bound for computing error upper bound. Later, Neyshabur et al. (2017a) extend Langford & Caruana (2002a)’s work by developing a training objective function derived from a relaxed PAC-Bayesian bound. In the standard application of PAC-Bayes, the prior is typically chosen to be a spherical Gaussian centered at the origin. However, without incorporating the information of data, the KL divergence might be unreasonable large, limiting the performance of PAC-Bayes method. To fill this gap, a large literature propose to obtain localized PAC-Bayes bounds via distribution-dependent priors through data (Ambroladze et al., 2007; Negrea et al., 2019; Dziugaite et al., 2020; Perez-Ortiz et al., 2021). Furthermore, Dziugaite & Roy (2018b) show how an differentially private data-dependent prior yields a valid PAC-Bayes bound for the situation that data distribution is presumed to be unknown. More recently, research focus on providing PAC-Bayesian bound for more realistic architectures, such as, ImageNet (Zhou et al., 2018), binary activated multilayer networks (Letarte et al., 2019), Partially Aggregated Neural Networks (Biggs & Guedj, 2020), and graph neural network (Liao et al., 2020). We denote the practical use of PAC-Bayesian algorithm to train over-parameterized neural networks as PAC-Bayesian learning and the target of this work is to demystify the success behind the deep learning trained via PAC-Bayesian bound.

Neural Tangent Kernel.

One of the greatest recent progress in deep learning theory is the gradient descent training can find global minimum with deep neural networks thanks to the over-parameterization (Du et al., 2019a, 2018; Allen-Zhu et al., 2019). Technically, (Jacot et al., 2018) resort to the NTK, which will stay constant during training. Under a mild assumption that the lowest eigenvalue of the NTK is greater than zero, the loss that is convex regards output will convergence to the global minimum. Furthermore, the generalization ability to unseen data of trained ultra-wide networks has been characterized by the Rademacher complexity analysis (Cao & Gu, 2019; Arora et al., 2019a). In this work, we will make a comparison between Rademacher bound and PAC-Bayesian bound with ultra-wide networks, where Rademacher bound of kernel ridge regression is formulated by (Hu et al., 2019). Since the origin of the NTK, it has been applied to different deep networks structures, such as orthogonal initialization (Huang et al., 2020), convolutional networks (Arora et al., 2019b), graph neural networks (Du et al., 2019b; Huang et al., 2021), and transformer (Hron et al., 2020).

3 Preliminary

3.1 Notation

We use bold-faced letters for vectors and matrices otherwise representing scalar. We use $\|\cdot\|_{2}$ to denote the Euclidean norm of a vector or the spectral norm of a matrix, while denoting $\|\cdot\|_{F}$ as the Frobenius norm of a matrix. For a neural network, we denote $\phi(x)$ as the activation function and we adopt ReLU activation in this work. Given a logical variable $a$ , then $\mathbb{I}(a)=1$ if $a$ is true otherwise $\mathbb{I}(a)=0$ . Let ${\bf I}_{d}$ be the identity matrix with dimension of $\mathbb{R}^{d\times d}$ . We denote $[n]=\{1,2,\ldots,n\}$ . The least eigenvalue of matrix $\bf A$ is denoted as $\lambda_{0}({\bf A})=\lambda_{\min}(\bf A)$ .

3.2 Over-parameterized Probabilistic Neural Network

In PAC-Bayesian learning we use probabilistic neural networks instead of deterministic networks, where the weights always follow a certain distribution. In this work, we adopt the Gaussian distribution for the weights, and define a two-layer probabilistic neural network governed by the following expression,

f({\bf x})=\frac{1}{\sqrt{m}}{\bf v}\phi({\bf Wx})

(1)

where ${\bf x}\in\mathbb{R}^{d}$ is the input, ${\bf W}\in\mathbb{R}^{m\times d}$ are weight matrix in the first layer, and ${\bf v}\in\mathbb{R}^{m}$ is the weight vector in the second layer. For simplicity, we represent the weight matrix as a set of column vectors, i.e., ${\bf W}=({\bf w}_{1},{\bf w}_{2},\dots,{\bf w}_{m})^{\top}$ . In particular, we use the re-parametrization trick to treat the weight in the first layer and leave the weights in the second to be initialized by uniform distribution,

		$\displaystyle{\bf w}_{r}=\boldsymbol{\mu}_{r}+\boldsymbol{\sigma}_{r}\odot\boldsymbol{\epsilon}_{r},~{}~{}~{}\boldsymbol{\epsilon}_{r}\sim\mathcal{N}({\bf 0},{\bf I}_{d});$		(2)
		$\displaystyle~{}~{}~{}{\bf v}_{r}\sim{\rm unif}(\{-1,1\})$		(2)

where $\odot$ denotes the element-wide product operation, $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}$ are mean and variance respectively. We then fix the second layer ${\bf v}$ and only optimize the first layer. This is a trick that has been widely used in the NTK paper (Arora et al., 2019a; Du et al., 2018) for simplifying analysis, the same results can be obtained on deep probabilistic neural network.

3.3 The PAC-Bayesian Learning

Suppose data $S=\{({\bf x}_{i},y_{i})\}_{i=1}^{n}$ are i.i.d. samples from a non-degenerate distribution $\mathcal{D}$ . Defining that the $\mathcal{H}$ represents the hypothesis space, $h(\bf{x})$ is the prediction of hypothesis $h\in\mathcal{H}$ over sample $\bf{x}$ . $R_{\mathcal{D}}(h)=\mathbb{E}_{({\bf x},y)\sim\mathcal{D}}[\ell(y,h(\bf{x}))]$ represents the generalization error of classifier $h$ and $\widehat{R}_{S}(h)=\frac{1}{n}\sum_{i=1}^{n}\ell\left(y_{i},h\left({\bf x}_{i}\right)\right)$ represents the empirical error of classifier $h$ , where $\ell(\cdot)$ is the loss function.

The PAC-Bayes theorem (Langford & Seeger, 2001; Seeger, 2002; Maurer, 2004) concludes that,

Theorem 3.1.

Denote $Q_{0}\in\mathcal{H}$ be some prior distribution over $\mathcal{H}.$ Then for any $\delta\in(0,1]$ , the following inequality holds uniformly for all posteriors distributions $Q\in\mathcal{H}$ with probability at least $1-\delta$ ,

{\rm kl}(\widehat{R}_{S}(Q)\|{R}_{\mathcal{D}}(Q))\leq\frac{\mathrm{KL}(Q\|Q_{0})+\log\frac{2\sqrt{n}}{\delta}}{n}.

(3)

Here $R_{\mathcal{D}}(Q)=\mathbb{E}_{({\bf x},y)\sim\mathcal{D},h\sim Q}[\ell(y,h({\bf x}))]=\mathbb{E}_{h\sim Q}[R_{\mathcal{D}}(h)]$ and $\widehat{R}_{S}(Q)=\mathbb{E}_{h\sim Q}[\widehat{R}_{S}(h)]$ . $\mathrm{KL}(Q\|Q_{0})=\mathbb{E}_{Q}\left[\ln\frac{Q}{Q_{0}}\right]$ is the Kullback-Leibler (KL) divergence and ${\rm kl}(q\|q^{\prime})=q\log(\frac{q}{q^{\prime}})+(1-q)\log(\frac{1-q}{1-q^{\prime}})$ is the binary KL divergence. Furthermore, combining with pinsker’s inequality for binary KL divergence, ${\rm kl}(\hat{p}\|p)\geq(p-\hat{p})^{2}/(2p)$ , when $\hat{p}<p$ , yields,

R_{\mathcal{D}}(Q)-\widehat{R}_{S}(Q)\leq\sqrt{2R_{\mathcal{D}}(Q)\frac{\mathrm{KL}(Q\|Q_{0})+\log\frac{2\sqrt{n}}{\delta}}{n}}.

(4)

The above bound is the classical bound. On the other hand, combining with the inequality $\sqrt{ab}\leq\frac{1}{2}(\bar{\lambda}a+\frac{b}{\bar{\lambda}})$ , for all $\bar{\lambda}>0$ , lead to a PAC-Bayes- $\lambda$ bound, which is proposed by (Thiemann et al., 2017),

Theorem 3.2.

{R}_{\mathcal{D}}(Q)\leq\frac{\widehat{R}_{S}(Q)}{1-\bar{\lambda}/2}+\frac{\mathrm{KL}(Q\|Q_{0})+\log\frac{2\sqrt{n}}{\delta}}{n\bar{\lambda}(1-\bar{\lambda}/2)}.

(5)

In this work, we study the PAC-Bayesian learning by training the probabilistic network described by Equation (1) with an objective PAC-Bayes bound (5). Specifically, the objective function can be expressed as,

\mathcal{B}(Q)=\widehat{R}_{S}(Q)+\lambda\frac{\mathrm{KL}(Q\|Q_{0})}{n}

(6)

where $\lambda$ is a hyperparameter introduced in a heuristic manner to make the method more flexible, and $\bar{\lambda}=2$ . Since the term regarding $\delta$ is a constant, we omit it in the objective function. Instead of optimizing ${\bf W}$ itself directly, we introduce reparameterization trick (Kingma et al., 2015) to perform gradient descent,

	$\displaystyle{\boldsymbol{\mu}}_{r}(t+1)={\boldsymbol{\mu}}_{r}(t)-\eta\frac{\partial\mathcal{B}}{\partial{\boldsymbol{\mu}}_{r}(t)}$		(7)
	$\displaystyle{\boldsymbol{\sigma}}_{r}(t+1)={\boldsymbol{\sigma}}_{r}(t)-\eta\frac{\partial\mathcal{B}}{\partial{\boldsymbol{\sigma}}_{r}(t)}$		(7)

where $\eta$ is the learning rate.

3.4 Neural Tangent Kernel

The neural tangent kernel was originated from (Jacot et al., 2018), whose definition is given by,

\boldsymbol{\Theta}({\bf X},{\bf X},t)=\nabla_{\boldsymbol{\theta}}f({\bf X},t)\nabla_{\boldsymbol{\theta}}f({\bf X},t)^{\top}

(8)

where $\boldsymbol{\theta}\in\mathbb{R}^{p\times 1}$ with $p$ being the number of parameters in the model. We demonstrate how it is used to describe the training dynamics of deep neural networks.

The dynamics of gradient flow for parameters are given by,

\frac{\partial\boldsymbol{\theta}}{\partial t}=-\eta\nabla_{\boldsymbol{\theta}}\widehat{R}_{S}=-\eta\nabla_{\boldsymbol{\theta}}f({\bf X},t)^{\top}\nabla_{f({\bf X},t)}\widehat{R}_{S}

(9)

Then the dynamics of output functions $f({\bf X})={\rm vec}(f({\bf x})_{{\bf x}\in{\bf X}})\in\mathbb{R}^{nm\times 1}$ follow,

\frac{\partial f({\bf X},t)}{\partial t}=\nabla_{\boldsymbol{\theta}}f({\bf X},t)\frac{\partial\boldsymbol{\theta}}{\partial t}=-\eta\boldsymbol{\Theta}({\bf X},{\bf X},t)\nabla_{f({\bf X},t)}\widehat{R}_{S}

(10)

It is proven that when the width of neural networks is infinite, $\boldsymbol{\Theta}({\bf X},{\bf X},t)$ converge in probability to $\boldsymbol{\Theta}({\bf X},{\bf X})$ (Jacot et al., 2018), i.e., $\boldsymbol{\Theta}({\bf X},{\bf X},t)=\boldsymbol{\Theta}({\bf X},{\bf X})$ . Assuming that the loss is the mean squared error loss, $\widehat{R}_{S}=\frac{1}{2}\|f({\bf X})-{\bf Y})\|^{2}_{2}$ , then we have,

\displaystyle f_{t}(\textbf{x})=

\displaystyle f_{0}(\textbf{x})+\boldsymbol{\Theta}(\textbf{x},\textbf{X})^{T}\boldsymbol{\Theta}^{-1}(\textbf{I}+e^{-\eta\boldsymbol{\Theta}t})(\textbf{Y}-f_{0}(\textbf{X}))

(11)

At the end of the training, namely, $t\rightarrow\infty$ , the solution is equivalent to the well-known kernel regression, of which the kernel is NTK.

4 Theoretical Results

4.1 Optimization analysis

To simplify the analysis, we first consider the optimization of probabilistic neural networks of the form (1) with objective $\widehat{R}_{S}(Q)$ and adopting the mean squared error loss. In other words, we neglect the KL divergence term at this stage and will show that the corresponding results can be extended to the target function with KL divergence. To perform the optimization analysis, we define the neural tangent kernel for probabilistic neural network, and name them as probabilistic neural tangent kernel (PNTK).

Definition 4.1 (Probabilistic Neural Tangent Kernel).

The tangent kernels associated with output function $f({\bf x},t)$ at parameters $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}$ are defined as,

	$\displaystyle\boldsymbol{\Theta}^{\mu}({\bf X},{\bf X},t)$	$\displaystyle=\nabla_{\boldsymbol{\mu}}f({\bf X},t)\nabla_{\boldsymbol{\mu}}f({\bf X},t)^{\top}\in\mathbb{R}^{n\times n}$		(12)
	$\displaystyle\boldsymbol{\Theta}^{\sigma}({\bf X},{\bf X},t)$	$\displaystyle=\nabla_{\boldsymbol{\sigma}}f({\bf X},t)\nabla_{\boldsymbol{\sigma}}f({\bf X},t)^{\top}\in\mathbb{R}^{n\times n}$		(12)

Different from standard (deterministic) neural network, the probabilistic network consist of two sets of parameters, i.e., $\boldsymbol{\mu}_{r}$ and $\boldsymbol{\sigma}_{r}$ , with $r=[m]$ . The gradient descent based on reparameteriztion trick (Equation 7) is performed on each of the two parameter set separately. As a result, we have two corresponding tangent kernels. It is shown that the NTK will converge in probability to a deterministic kernel in the infinite-width limit (Jacot et al., 2018), as we will show in later derivation, the PNTK will converge to a limiting kernel at initialization and during training with a large width condition. The detailed expression for the limiting kernel is given as follows

Definition 4.2 (Limiting Neural Tangent Kernel with ReLU Activation).

Consider a probabilistic network of the form (1), with a Gaussian distribute weights ${\bf w}$ . Then the tangent kernels in the infinite-width limit will follow the expression,

\boldsymbol{\Theta}^{\infty}_{ij}({\bf X},{\bf X})=\mathbb{E}_{{\bf w}}\big{[}{\bf x}^{\top}_{i}{\bf x}_{j}\mathbb{I}\{{\bf w}^{\top}{\bf x}_{i}\geq 0,{\bf w}^{\top}{\bf x}_{j}\geq 0\}\big{]}

(13)

The definition above gives the exact formulation of the limiting NTK for random networks. In addition, its positive definiteness is a critical factor to guarantee the convergence of gradient descent, thus we state the assumption as follows,

Assumption 4.3.

Assume the lowerest eigenvalue of the limiting NTK is greater than zero, i.e., $\lambda_{0}(\boldsymbol{\Theta}^{\infty})>0$ .

With all the preliminaries above, we state our main theorem,

Theorem 4.4 (Convergence of Probabilistic Networks with large width).

Suppose the network’s width is of $m=\Omega\big{(}\frac{n^{6}}{\lambda_{0}^{4}\delta^{3}(\sigma_{0}^{2}+\hat{\sigma}^{2})}\big{)}$ with the initialization. Then with probability at least $1-\delta$ over the initialization, we have,

\widehat{R}_{S}(Q,t)\leq\exp(-\lambda_{0}t)\widehat{R}_{S}(Q,0)

(14)

where we adopt initialization of ${\mu}_{ri}\sim\mathcal{N}(0,{\hat{\sigma}})$ and ${\sigma}_{ri}=\sigma_{0}$ , for $i\in[d]$ . ${\mu}_{ri}$ and ${\sigma}_{ri}$ are entries of weights matrix $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}$ , $\hat{\sigma}$ and $\sigma_{0}$ are positive constants.

Our theorem establishes that if $m$ is large enough, the expected training error converges to zero at a linear rate. In particular, the least eigenvalue of PNTK governs the convergence rate. To prove the theorem, we split the process into three parts, namely, PNTK at initialization, PNTK during training and changes of weights during training, and all the detailed proofs for our theorems and lemmas can be found in the Appendix.

We first study the behavior of tangent kernels with ultra-wide condition, namely $m={\rm ploy}(n,1/\lambda_{0},1/\delta)$ , at initialization. The following lemma demonstrates that if $m$ is large, then $\boldsymbol{\Theta}^{\mu}_{0}$ and $\boldsymbol{\Theta}^{\sigma}_{0}$ have a lower bound on smallest eigenvalue with high probability. The proof is by the standard concentration bound.

Lemma 4.5 (PNTK at initialization).

If $m=O\big{(}\frac{n^{2}\log(n^{2}/\delta)}{\lambda^{2}_{0}}\big{)}$ , while $\boldsymbol{\mu}_{r}$ and $\boldsymbol{\sigma}_{r}$ are initialized by the form in Theorem 4.4, then with probability at least $1-\delta$ , $\big{\|}\boldsymbol{\Theta}^{\mu}({\bf X},{\bf X},0)-\boldsymbol{\Theta}^{\infty}({\bf X},{\bf X})\big{\|}_{2}\leq\frac{\lambda_{0}}{4}$ , $\big{\|}\boldsymbol{\Theta}^{\mu}({\bf X},{\bf X},0)\big{\|}_{2}\geq\frac{3\lambda_{0}}{4}$ ; and $\big{\|}\boldsymbol{\Theta}^{\sigma}({\bf X},{\bf X},0)-\boldsymbol{\Theta}^{\infty}({\bf X},{\bf X})\big{\|}_{2}\leq\frac{\lambda_{0}}{4}$ , $\big{\|}\boldsymbol{\Theta}^{\sigma}({\bf X},{\bf X},0)\big{\|}_{2}\geq\frac{3\lambda_{0}}{4}$ .

Remark 4.6.

The concentration bound is over randomness of initialization for $\boldsymbol{\mu}_{r}(0)$ and randomness of variance $\boldsymbol{\epsilon}_{r}$ .

From the above lemma, we implicitly show that both tangent kernels regarding $\boldsymbol{\mu}_{r}$ and $\boldsymbol{\sigma}_{r}$ are equivalent in the infinite-width limit. It is also true in general as long as the reparametrization trick is adopted for training probabilistic neural networks. The next problem is that PNTKs are time-dependent matrices, thus varying during gradient descent training. To account for this problem, we build a lemma stating that if the weights ${\bf w}(t)$ is close to ${\bf w}(0)$ during gradient descent training, then the corresponding PNTKs $\boldsymbol{\Theta}^{\mu}(t)$ and $\boldsymbol{\Theta}^{\sigma}(t)$ are close to $\boldsymbol{\Theta}^{\infty}$ and have a lower bound over smallest eigenvalue $\lambda_{0}$ . Here we introduce ${\bf w}_{r}(t)\equiv\boldsymbol{\mu}_{r}(t)+\boldsymbol{\sigma}_{r}(t)\odot\boldsymbol{\epsilon}_{r}(0)$ , where $\boldsymbol{\epsilon}_{r}(0)$ copies sample value from that in ${\bf w}_{r}(0)$ . Notice ${\bf w}_{r}(t)$ in our definition is not exact weights at training step $t$ , but we introduce it to facilitate our proof.

Lemma 4.7 (PNTK during training).

If $\boldsymbol{\mu}_{r}$ and $\boldsymbol{\sigma}_{r}$ are initialized the same with Theorem 4.4, for any set of weight vectors ${\bf w}_{1},\ldots,{\bf w}_{m}$ that satisfy for any $r\in[m]$ , $\|{\bf w}_{r}(t)-{\bf w}_{r}(0)\|\leq\frac{c\lambda_{0}\delta\sqrt{\sigma_{0}^{2}+\hat{\sigma}^{2}}}{n^{2}}\equiv R$ , where $c$ is a constant, then with probability at least $1-\delta$ , $\big{\|}\boldsymbol{\Theta}^{\mu}({\bf X},{\bf X},t)-\boldsymbol{\Theta}^{\infty}({\bf X},{\bf X})\big{\|}_{2}\leq\frac{\lambda_{0}}{4}$ , $\big{\|}\boldsymbol{\Theta}^{\mu}({\bf X},{\bf X},t)\big{\|}_{2}\geq\frac{\lambda_{0}}{2}$ ; and $\big{\|}\boldsymbol{\Theta}^{\sigma}({\bf X},{\bf X},t)-\boldsymbol{\Theta}^{\infty}({\bf X},{\bf X})\big{\|}_{2}\leq\frac{\lambda_{0}}{4}$ , $\big{\|}\boldsymbol{\Theta}^{\sigma}({\bf X},{\bf X},t)\big{\|}_{2}\geq\frac{\lambda_{0}}{2}$ .

By the above lemma, we demonstrate that if the change of weight is bounded, then the tangent kernel matrix is closed to its expectation. The next lemma will show that the change of weight is bounded during gradient descent training when the PNTK is close to the limiting kernel Equation (13).

Lemma 4.8 (Change of weights during training).

Suppose $\lambda_{\min}(\boldsymbol{\Theta}(t))\geq\frac{\lambda_{0}}{2}$ for $0<t<T$ . Then we have $\widehat{R}_{S}(Q,t)\leq\exp(-{\lambda_{0}}t)\widehat{R}_{S}(Q,0)$ , and with probability at least $1-\delta$ over initialization, $\|{\bf w}_{r}(t)-{\bf w}_{r}(0)\|_{2}\leq\frac{2n}{\sqrt{m\delta}\lambda_{0}}\equiv R^{\prime}$ .

The main gap between PNTK and final results for output function can be filled by writing down the gradient flow dynamics,

\displaystyle\frac{df({\bf x}_{i},t)}{dt}=\sum_{j=1}^{n}({\bf y}_{j}-f({\bf x}_{j},t))(\boldsymbol{\Theta}^{\mu}_{ij}+\boldsymbol{\Theta}^{\sigma}_{ij})

(15)

Combined with fact that PNTKs are bounded during training, therefore the behavior of the loss is traceable. To finalize the proof for Theorem 4.4, we construct a final lemma.

Lemma 4.9.

If $R^{\prime}<R$ , we have $\widehat{R}_{S}(Q,t)\leq\exp(-\lambda_{0}t)\widehat{R}_{S}(Q,0).$

4.2 Training with KL divergence

According to Equation (6), there is a KL divergence term in the objective function. We expand the KL-divergence for two Gaussian distribution, $p({w}({t}))=\mathcal{N}({\mu}_{t},{\sigma}_{t})$ , and $p({w}({0}))=\mathcal{N}({\mu}_{0},{\sigma}_{0})$ ,

\text{KL}=\frac{1}{2}\bigg{(}\log\frac{{\sigma}_{0}}{{\sigma}_{t}}+\frac{({\mu}_{t}-{\mu}_{0})^{2}}{{\sigma}^{2}_{0}}+\frac{\sigma_{t}}{\sigma_{0}}-1\bigg{)}

(16)

We compare the gradient of mean weights $\boldsymbol{\mu}$ and variance weights $\boldsymbol{\sigma}$ . With a direct calculation, we have $\frac{\partial f({\bf x}_{i})}{\partial\boldsymbol{\mu}_{r}}=\frac{1}{\sqrt{m}}{v}_{r}\mathbb{I}({\bf w}^{\top}_{r}{\bf x}_{i}\geq 0){\bf x}_{i}$ and $\frac{\partial f({\bf x}_{i})}{\partial\boldsymbol{\sigma}_{r}}=\frac{1}{\sqrt{m}}{v}_{r}\mathbb{I}({\bf w}^{\top}_{r}{\bf x}_{i}\geq 0){\bf x}_{i}\odot\boldsymbol{\epsilon}_{r}$ . It is shown that there is one more random variable $\boldsymbol{\epsilon}_{r}$ associated with the gradient regarding variance weights, which leads to the expected gradient norm to be zero. To verify this result, we conduct an experiment and leave the results in the Appendix E.2. Therefore, it is approximately equivalent to fix $\boldsymbol{\sigma}$ during gradient descent training. With this assumption we arrive at the conclusion:

Theorem 4.10.

Consider gradient descent on target function (6), with the initialization stated in Theorem 4.4. Suppose $m\geq{\rm poly}({n},1/\lambda_{0},1/\delta,1/\mathcal{E})$ . Then with probability at least $1-\delta$ over the random initialization, we have

{\widehat{f}({\bf x},\infty)}=\boldsymbol{\Theta}^{\infty}({\bf x},{\bf X})\big{(}\boldsymbol{\Theta}({\bf X},{\bf X})+\lambda/{\sigma^{2}_{0}}{\bf I}\big{)}^{-1}{\bf Y}\pm\mathcal{E}

(17)

where $\widehat{f}({\bf x},t)=\mathbb{E}_{f\sim Q}f({\bf x},t)$ aligns with the definition of empirical loss function.

The theorem above reveals that the regularization effect of the KL term in PAC-Bayesian learning, and presents an explicit expression for the convergence result of the output function.

4.3 Generalization analysis

Recall that in Theorem 3.2, the PAC-Bayesian bound with respect to the distribution at initialization and after optimization is given. Therefore, combined with results from Theorem 4.10, we can directly provide a generalization bound for PAC-Bayesian learning with ultra-wide networks.

Theorem 4.11 (PAC-Bayesian bound with NTK).

Suppose data $S=\{({\bf x}_{i},y_{i})\}_{i=1}^{n}$ are i.i.d. samples from a non-degenerate distribution $\mathcal{D}$ , and $m\geq{\rm poly}(n,\lambda_{0}^{-1},\delta^{-1})$ . Consider any loss function $\ell:\mathbb{R}\times\mathbb{R}\rightarrow[0,1]$ . Then with probability at least $1-\delta$ over the random initialization and the training samples, the probabilistic network trained by gradient descent for $T\geq\Omega(\frac{1}{\eta\lambda_{0}}\log\frac{n}{\delta})$ iterations has population risk $R_{\mathcal{D}}(Q)$ that is bounded as follows:

	$\displaystyle R_{\mathcal{D}}$	$\displaystyle\leq\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}}{n\sigma^{2}_{0}}+\frac{\lambda}{\sigma^{2}_{0}}\sqrt{\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-2}{\bf Y}}{n}}$		(18)
		$\displaystyle+O\bigg{(}\frac{\log\frac{2\sqrt{n}}{\delta}}{n}\bigg{)}.$		(18)

In this theorem, we establish a reasonable generalization bound for PAC-Bayesian learning framework, thus providing a theoretical guarantee. We further demonstrate the advantage of PAC-Bayesian learning by comparing it with the Rademacher complexity-based generalization bound for deterministic neural network with kernel ridge solution.

Theorem 4.12 (Rademacher bound with NTK).

Suppose data $S=\{({\bf x}_{i},y_{i})\}_{i=1}^{n}$ are i.i.d. samples from a non-degenerate distribution $\mathcal{D}$ , and $m\geq{\rm poly}(n,\lambda_{0}^{-1},\delta^{-1})$ . Consider any loss function $\ell:\mathbb{R}\times\mathbb{R}\rightarrow[0,1]$ . Then with probability at least $1-\delta$ over the random initialization and training samples, the network trained by gradient descent for $T\geq\Omega(\frac{1}{\eta\lambda_{0}}\log\frac{n}{\delta})$ iterations has population risk $R_{\mathcal{D}}(Q)$ that is bounded as follows:

	$\displaystyle R_{\mathcal{D}}$	$\displaystyle\leq\sqrt{\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}}{n}}+\frac{\lambda}{\sigma^{2}_{0}}\sqrt{\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-2}{\bf Y}}{n}}$		(19)
		$\displaystyle+O\bigg{(}\sqrt{\frac{\log\frac{n}{\lambda_{0}\delta}}{n}}\bigg{)}.$		(19)

Theorem 4.12 is obtained by following Theaorem 5.1 in (Hu et al., 2019), who presents a Rademacher complexity-based generalization bound for ultra-wide neural networks with kernel ridge regression solution. Similar analysis for kernel regression with NTK can be found in (Arora et al., 2019a; Cao & Gu, 2019).

The main difference between two generalization bound is $\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}}{n}$ versus $\sqrt{\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}}{n}}$ , which is due to that PAC-Bayesian bound count the KL divergence while Rademacher bound calculate the reproducing kernel Hilbert space (RKHS) norm. We find that the rates of convergence are different. One is $O(1/n)$ and another is $O(1/\sqrt{n})$ . Therefore, we conclude that the PAC-Bayesian bound has an improvement over Rademacher complexity-based bound.

5 Experiments

In this section, we first provide empirical support showing that the training dynamics of wide probabilistic neural networks using the training objective derived from a PAC-Bayes bound are captured by PNTK, which validates our Lemma 4.8. Second, as an extension of our finding of PAC-Bayesian bound in Theorem D.1, we provide a training-free metric to approximate the PAC-Bayesian bound via NTK, which can be used to select the best hyperparameters without involving any training and eliminate excessive computation time.

5.1 Experimental setup

In all experiments, the NTK parameterization is chosen for initializing parameters, which follows Equation (1). Specifically, the initial mean for weights, $\mu_{0}$ , is sampled from a truncated Gaussian distribution with a mean of 0 and a standard variance of $1$ , truncating at 2 standard deviations. To make sure that the variance is positive, the initial variance for weight is transformed from the given value of $\rho_{0}$ through formula $\sigma_{0}=\log(1+\exp(\rho_{0}))$ .

For the experiment in section 5.2, we consider a three hidden layer ReLU fully-connected network of the training objective derived from the PAC-Bayesian lambda bound in Equation (5), used an ordinary MSE function as loss, trained with full-batch gradient descent using learning rates equal to 1 on a fixed subset of MNIST ( $|D|=128$ ) of ten-classification. An random initialized prior with no connection to data is used since it is inline with our theoretical setting and we only intend to observe the change in parameters rather than the performance of the actual bound.

In section 5.3, we use both fully connected and convoluted neural network structures to perform experiments on MINIST and CIFAR-10 datasets for demonstrating the effectiveness of our training-free PAC-Bayesian network bound for searching hyperparameters under different datasets and network structures. We build a 3 layers fully-connected neural network with 600 neurons on each layer. Another convolutional architecture with a total of 13 layers with around 10 million learnable parameters is constructed. We adopt data-dependent prior since it is a practical and popular method (Dziugaite et al., 2020; Perez-Ortiz et al., 2021). Specifically, this data-dependent prior is pre-trained on a subset of total training data with empirical risk minimization. The networks for posterior training is then initialized by the weights learnt from prior. At last, the generalization bound is computed using Equation (5). Bound computing-related settings are referred to the work done by Pérez-Ortiz et al. (2020), such as, confidence parameters for the risk certificate and chernoff bound, and the number of monte carlo samples for estimating the risk certificate.

5.2 Validation of theoretical results

After $T=2^{17}$ steps of gradient descent updates from different random initialization, we plot the changes of (input/output/hidden layer) $\boldsymbol{\mu}$ and $\boldsymbol{\rho}$ for weights corresponded to the variation in width $m$ for each layer on Figure 1. We observe that the relative Frobenius norm change in input/output layer’s weights scales as $1/\sqrt{m}$ while hidden layers’ weights scales as $1/m$ during the training, which verifies our lemma 4.8.

Method	Network & dataset	Generalization bound	Testing error	Computation time
	MNIST + FCN	0.0427	0.0192	3888 mins
Brute-Force	MNIST + CNN	0.0247	0.0091	20736 mins
	CIFAR10 + FCN	0.5377	0.4840	8424 mins
	CIFAR10 + CNN	0.1969	0.1514	55080 mins
	MNIST + FCN	0.0478	0.0213	1296 mins
$\mathcal{PA}$	MNIST + CNN	0.0346	0.0168	1296 mins
	CIFAR10 + FCN	0.5490	0.4905	1296 mins
	CIFAR10 + CNN	0.1970	0.1510	1296 mins

Table 1: We conduct the grid search runs over 9 data-dependent prior with different subsets data for prior training, 9 different values of KL penalty, and 8 different values of

\rho_{0}

. Under any setting, the generalization bound from the hyperparameters selected from

\mathcal{PA}

is close to the best generalization bound while saving lots of computation time. Training is performed with a server with a CPU with 5,120 cores, and a 32 GB Nvidia Quadro V100. Note, computation time per run using

\mathcal{PA}

is around 2 mins. Whereas, computation times for the actual posterior training are 6 mins and 32 mins for MNIST trained with FCN and CNN respectively. In addition, 13 mins and 85 mins are used for CIFAR-10 trained with FCN and CNN.

5.3 Selecting hyperparameters via training-free metric

PAC-Bayesian learning framework provides competitive performance with non-vacuous generalization bounds. However, the tightness of this generalization bound depends on the hyperparameters used, such as the proportionality of data used for the prior, the initialization of $\rho_{0}$ , and the KL penalty weight ( $\lambda$ ). Since these three values do not change during the training, we refer them as hyperparameters. Choosing the right hyperparameters via grid search is obviously prohibitive, as each attempt to compute the generalization bound can involve significant computational resources.

Another plausible approach is to design some kind of predictive, so-called “training-free” metric such that we can approximate the error bound without going through an expensive training process. In light of this goal, we have already developed a generalization bound in the theorem D.1 via NTK. Since NTK changes are held constant during training, we can predict the generalization bound by this proxy metric, which can be formulated as follows:

	$\displaystyle\mathcal{PA}=$	$\displaystyle\operatorname{Tr}\bigg{(}\frac{(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}\cdot{\bf Y}{\bf Y}^{\top}}{\sigma^{2}_{0}\cdot n}+$		(20)
		$\displaystyle\frac{\lambda}{\sigma^{2}_{0}}\sqrt{\frac{(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-2}\cdot{\bf Y}{\bf Y}^{\top}}{n}}\bigg{)}$		(20)

where $\boldsymbol{\Theta}$ denotes the gram matrix. ${\bf Y}{\bf Y}^{\top}$ is a $n\times n$ label similarity matrix (if two data have the same label, their joint entry in the matrix is 1 and 0 otherwise), and $n$ is the number of data used. To demonstrate the computational practicality of this training-free metric, we compute $\mathcal{PA}$ using only a subset of the data for each class (325 per class for FCN and 75 per class for CNN). We should also mention that training-free methods for searching neural architectures are not new, they can be found in NAS (Chen et al., 2021; Deshpande et al., 2021), MAE Random Sampling (Camero et al., 2021), pruning at initialization (Abdelfattah et al., 2021). To the best of our knowledge, there is currently no training-free method for selecting hyperparameters in the PAC-Bayesian framework, which we consider to be one of the novelties of this paper.

The Figure 2 demonstrates a strong correlation between $\mathcal{PA}$ and the actual generalization bound. Finally, we demonstrate that by searching through all possible combinations of hyperparameters using $\mathcal{PA}$ , it is possible to select a hyperparameter leading towards a result that is comparable to the best generalization bound, but without the excessive computation. To put things in perspective, in Table 1, we show that grid search can be much more expensive than $\mathcal{PA}$ . For instance, under the setting of CNN and CIFAR10 dataset, we save 42.5 times computational time to find the bound that is 1% lower than the best generalization bound.

6 Discussion

In this work, we theoretically prove that the learning dynamics of deep probabilistic neural networks using training objectives derived from PAC-Bayes bounds are exactly described by the NTK in over-parameterized setting. Empirical investigation reveals that this agrees well with the actual training process. Furthermore, the expected output function trained with PAC-Bayesian bound converges to the kernel ridge regression under a mild assumption. Based on this finding, we obtain an explicit generalization bound with respect to NTK for PAC-Bayesian learning, which improves over the generalization bound obtained through NTK on non-probabilistic neural network. Finally, we show PAC-Bayesian bound score, the training-free method, can effectively select the hyperparameters which leads to a lower generalization bound without cost excessive computation time that brute-force grid search would take. In a summary, we establish our theoretical analysis on PAC-Bayes with random initialized prior. One promising direction is to study PAC-Bayesian learning with data-dependent prior by NTK.

References

Abdelfattah et al. (2021) Abdelfattah, M. S., Mehrotra, A., Dudziak, Ł., and Lane, N. D. Zero-cost proxies for lightweight nas. arXiv preprint arXiv:2101.08134, 2021.
Allen-Zhu et al. (2019) Allen-Zhu, Z., Li, Y., and Song, Z. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pp. 242–252. PMLR, 2019.
Ambroladze et al. (2007) Ambroladze, A., Parrado-Hernández, E., and Shawe-Taylor, J. Tighter pac-bayes bounds. Advances in neural information processing systems, 19:9, 2007.
Amit & Meir (2018) Amit, R. and Meir, R. Meta-learning by adjusting priors based on extended pac-bayes theory. In International Conference on Machine Learning, pp. 205–214. PMLR, 2018.
Arora et al. (2019a) Arora, S., Du, S., Hu, W., Li, Z., and Wang, R. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pp. 322–332. PMLR, 2019a.
Arora et al. (2019b) Arora, S., Du, S. S., Hu, W., Li, Z., Salakhutdinov, R., and Wang, R. On exact computation with an infinitely wide neural net. arXiv preprint arXiv:1904.11955, 2019b.
Bégin et al. (2014) Bégin, L., Germain, P., Laviolette, F., and Roy, J.-F. Pac-bayesian theory for transductive learning. In Artificial Intelligence and Statistics, pp. 105–113. PMLR, 2014.
Bégin et al. (2016) Bégin, L., Germain, P., Laviolette, F., and Roy, J.-F. Pac-bayesian bounds based on the rényi divergence. In Artificial Intelligence and Statistics, pp. 435–444. PMLR, 2016.
Biggs & Guedj (2020) Biggs, F. and Guedj, B. Differentiable pac-bayes objectives with partially aggregated neural networks. arXiv preprint arXiv:2006.12228, 2020.
Camero et al. (2021) Camero, A., Wang, H., Alba, E., and Bäck, T. Bayesian neural architecture search using a training-free performance metric. Applied Soft Computing, 106:107356, 2021.
Cao & Gu (2019) Cao, Y. and Gu, Q. Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in Neural Information Processing Systems, 32:10836–10846, 2019.
Chen et al. (2021) Chen, W., Gong, X., and Wang, Z. Neural architecture search on imagenet in four gpu hours: A theoretically inspired perspective. arXiv preprint arXiv:2102.11535, 2021.
Deshpande et al. (2021) Deshpande, A., Achille, A., Ravichandran, A., Li, H., Zancato, L., Fowlkes, C., Bhotika, R., Soatto, S., and Perona, P. A linearized framework and a new benchmark for model selection for fine-tuning. arXiv preprint arXiv:2102.00084, 2021.
Du et al. (2019a) Du, S., Lee, J., Li, H., Wang, L., and Zhai, X. Gradient descent finds global minima of deep neural networks. In International Conference on Machine Learning, pp. 1675–1685. PMLR, 2019a.
Du et al. (2018) Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.
Du et al. (2019b) Du, S. S., Hou, K., Salakhutdinov, R. R., Poczos, B., Wang, R., and Xu, K. Graph neural tangent kernel: Fusing graph neural networks with graph kernels. Advances in Neural Information Processing Systems, 32:5723–5733, 2019b.
Dziugaite & Roy (2018a) Dziugaite, G. K. and Roy, D. Entropy-sgd optimizes the prior of a pac-bayes bound: Generalization properties of entropy-sgd and data-dependent priors. In International Conference on Machine Learning, pp. 1377–1386. PMLR, 2018a.
Dziugaite & Roy (2017) Dziugaite, G. K. and Roy, D. M. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. CoRR, abs/1703.11008, 2017. URL https://arxiv.org/abs/1703.11008.
Dziugaite & Roy (2018b) Dziugaite, G. K. and Roy, D. M. Data-dependent pac-bayes priors via differential privacy. arXiv preprint arXiv:1802.09583, 2018b.
Dziugaite et al. (2020) Dziugaite, G. K., Hsu, K., Gharbieh, W., and Roy, D. M. On the role of data in pac-bayes bounds. arXiv preprint arXiv:2006.10929, 2020.
Grünwald & Mehta (2020) Grünwald, P. D. and Mehta, N. A. Fast rates for general unbounded loss functions: From erm to generalized bayes. J. Mach. Learn. Res., 21:56–1, 2020.
Guedj (2019) Guedj, B. A primer on pac-bayesian learning. arXiv preprint arXiv:1901.05353, 2019.
Haddouche et al. (2021) Haddouche, M., Guedj, B., Rivasplata, O., and Shawe-Taylor, J. Pac-bayes unleashed: generalisation bounds with unbounded losses. Entropy, 23(10):1330, 2021.
Hron et al. (2020) Hron, J., Bahri, Y., Sohl-Dickstein, J., and Novak, R. Infinite attention: Nngp and ntk for deep attention networks. In International Conference on Machine Learning, pp. 4376–4386. PMLR, 2020.
Hu et al. (2019) Hu, W., Li, Z., and Yu, D. Simple and effective regularization methods for training on noisily labeled data with generalization guarantee. arXiv preprint arXiv:1905.11368, 2019.
Huang et al. (2020) Huang, W., Du, W., and Da Xu, R. Y. On the neural tangent kernel of deep networks with orthogonal initialization. arXiv preprint arXiv:2004.05867, 2020.
Huang et al. (2021) Huang, W., Li, Y., Du, W., Da Xu, R. Y., Yin, J., and Chen, L. Wide graph neural networks: Aggregation provably leads to exponentially trainability loss. arXiv preprint arXiv:2103.03113, 2021.
Jacot et al. (2018) Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. arXiv preprint arXiv:1806.07572, 2018.
Jain & Kar (2017) Jain, P. and Kar, P. Non-convex optimization for machine learning. arXiv preprint arXiv:1712.07897, 2017.
Kingma et al. (2015) Kingma, D. P., Salimans, T., and Welling, M. Variational dropout and the local reparameterization trick. Advances in neural information processing systems, 28:2575–2583, 2015.
Langford & Caruana (2002a) Langford, J. and Caruana, R. (not) bounding the true error. In Dietterich, T., Becker, S., and Ghahramani, Z. (eds.), Advances in Neural Information Processing Systems, volume 14. MIT Press, 2002a.
Langford & Caruana (2002b) Langford, J. and Caruana, R. (not) bounding the true error. Advances in Neural Information Processing Systems, 2:809–816, 2002b.
Langford & Seeger (2001) Langford, J. and Seeger, M. Bounds for averaging classifiers. 2001.
LeCun et al. (2015) LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature, 521(7553):436–444, 2015.
Lee et al. (2019) Lee, J., Xiao, L., Schoenholz, S. S., Bahri, Y., Novak, R., Sohl-Dickstein, J., and Pennington, J. Wide neural networks of any depth evolve as linear models under gradient descent. arXiv preprint arXiv:1902.06720, 2019.
Letarte et al. (2019) Letarte, G., Germain, P., Guedj, B., and Laviolette, F. Dichotomize and generalize: Pac-bayesian binary activated deep neural networks. arXiv preprint arXiv:1905.10259, 2019.
Liao et al. (2020) Liao, R., Urtasun, R., and Zemel, R. A pac-bayesian approach to generalization bounds for graph neural networks. arXiv preprint arXiv:2012.07690, 2020.
Liu et al. (2021) Liu, T., Lu, J., Yan, Z., and Zhang, G. Pac-bayes bounds for meta-learning with data-dependent prior. arXiv preprint arXiv:2102.03748, 2021.
London (2017) London, B. A pac-bayesian analysis of randomized learning with application to stochastic gradient descent. arXiv preprint arXiv:1709.06617, 2017.
Maurer (2004) Maurer, A. A note on the pac bayesian theorem. arXiv preprint cs/0411099, 2004.
McAllester (1999a) McAllester, D. A. Pac-bayesian model averaging. In Proceedings of the twelfth annual conference on Computational learning theory, pp. 164–170, 1999a.
McAllester (1999b) McAllester, D. A. Some PAC-Bayesian theorems. Machine Learning, 37(3):355–363, 1999b.
Negrea et al. (2019) Negrea, J., Haghifam, M., Dziugaite, G. K., Khisti, A., and Roy, D. M. Information-theoretic generalization bounds for sgld via data-dependent estimates. arXiv preprint arXiv:1911.02151, 2019.
Neyshabur et al. (2017a) Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. Exploring generalization in deep learning. arXiv preprint arXiv:1706.08947, 2017a.
Neyshabur et al. (2017b) Neyshabur, B., Bhojanapalli, S., and Srebro, N. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564, 2017b.
Nie et al. (2018) Nie, S., Zheng, M., and Ji, Q. The deep regression bayesian network and its applications: Probabilistic deep learning for computer vision. IEEE Signal Processing Magazine, 35(1):101–111, 2018.
Pérez-Ortiz et al. (2020) Pérez-Ortiz, M., Rivasplata, O., Shawe-Taylor, J., and Szepesvári, C. Tighter risk certificates for neural networks. CoRR, abs/2007.12911, 2020. URL https://arxiv.org/abs/2007.12911.
Perez-Ortiz et al. (2021) Perez-Ortiz, M., Rivasplata, O., Guedj, B., Gleeson, M., Zhang, J., Shawe-Taylor, J., Bober, M., and Kittler, J. Learning pac-bayes priors for probabilistic neural networks. arXiv preprint arXiv:2109.10304, 2021.
Pu et al. (2016) Pu, Y., Gan, Z., Henao, R., Yuan, X., Li, C., Stevens, A., and Carin, L. Variational autoencoder for deep learning of images, labels and captions. Advances in neural information processing systems, 29:2352–2360, 2016.
Raginsky et al. (2017) Raginsky, M., Rakhlin, A., and Telgarsky, M. Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory, pp. 1674–1703. PMLR, 2017.
Seeger (2002) Seeger, M. Pac-bayesian generalisation error bounds for gaussian process classification. Journal of machine learning research, 3(Oct):233–269, 2002.
Smith & Le (2017) Smith, S. L. and Le, Q. V. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.
Specht (1990) Specht, D. F. Probabilistic neural networks. Neural networks, 3(1):109–118, 1990.
Thiemann et al. (2017) Thiemann, N., Igel, C., Wintenberger, O., and Seldin, Y. A strongly quasiconvex pac-bayesian bound. In International Conference on Algorithmic Learning Theory, pp. 466–492. PMLR, 2017.
Zhang et al. (2021) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.
Zhou et al. (2018) Zhou, W., Veitch, V., Austern, M., Adams, R. P., and Orbanz, P. Non-vacuous generalization bounds at the imagenet scale: a pac-bayesian compression approach. arXiv preprint arXiv:1804.05862, 2018.

Appendix A Preliminary Lemmas

In this section, we list preliminary lemmas that with be used in the proof.

Lemma A.1 (Markov’s inequality).

If $X$ is a nonnegative random variable and $a>0$ , then the probability that $X$ is at least $a$ is at most the expectation of $X$ divided by $a$ :

P(X\geq a)\leq\frac{\mathbb{E}[X]}{a}

Lemma A.2 (Hoeffding’s inequality).

Let $X_{1},\ldots,X_{n}$ be independent random variables such that $a_{i}\leq X_{i}\leq b_{i}$ almost surely. Consider the sum of these random variables, $S_{n}=X_{1}+\cdots+X_{n}$ , then Hoeffding’s theorem states that, for all $t>0$ :

P(|S_{n}-\mathbb{E}[S_{n}]|\geq t)\leq 2\exp\bigg{(}-\frac{2t^{2}}{\sum_{i=1}^{n}(b_{i}-a_{i})^{2}}\bigg{)}

Appendix B Mission Proofs of Section 4.1

This section contains detailed proofs of the results that are missing in the main paper.

Lemma B.1 (PNTK at initialization).

If $m=\Omega(\frac{n^{2}}{\lambda^{2}_{0}}\log\frac{n}{\delta})$ , while $\boldsymbol{\mu}_{r}$ and $\boldsymbol{\sigma}_{r}$ are initialized by the form in Theorem 4.4, then with probability at least $1-\delta$ , $\big{\|}\boldsymbol{\Theta}^{\mu}({\bf X},{\bf X},0)-\boldsymbol{\Theta}^{\infty}({\bf X},{\bf X})\big{\|}_{2}\leq\frac{\lambda_{0}}{4}$ , $\big{\|}\boldsymbol{\Theta}^{\mu}({\bf X},{\bf X},0)\big{\|}_{2}\geq\frac{3\lambda_{0}}{4}$ ; and $\big{\|}\boldsymbol{\Theta}^{\sigma}_{0}({\bf X},{\bf X})-\boldsymbol{\Theta}^{\infty}({\bf X},{\bf X})\big{\|}_{2}\leq\frac{\lambda_{0}}{4}$ , $\big{\|}\boldsymbol{\Theta}^{\sigma}({\bf X},{\bf X},0)\big{\|}_{2}\geq\frac{3\lambda_{0}}{4}$ .

Proof.

The derivative of output over the parameters $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}$ can be expressed as

\frac{\partial f({\bf x}_{i})}{\partial\boldsymbol{\mu}_{r}}=\frac{1}{\sqrt{m}}{v}_{r}\mathbb{I}({\bf w}^{\top}_{r}{\bf x}_{i}\geq 0){\bf x}_{i}~{}~{}~{}\frac{\partial f({\bf x}_{i})}{\partial\boldsymbol{\sigma}_{r}}=\frac{1}{\sqrt{m}}{v}_{r}\mathbb{I}({\bf w}^{\top}_{r}{\bf x}_{i}){\bf x}_{i}\odot\boldsymbol{\epsilon}_{r}

(21)

Since the probabilistic neural network contains two set of parameters, we discuss the corresponding NTK accordingly.

(1) $\boldsymbol{\Theta}^{\mu}$ .

Plugging the derivative result regarding mean weights in Equation (21) into the definition of PNTK (Eqution 12) yields:

\boldsymbol{\Theta}^{\mu}_{ij}(0)=\frac{{\bf x}^{\top}_{i}{\bf x}_{j}}{m}({v}_{r})^{2}\sum_{r=1}^{m}\mathbb{I}({\bf w}^{\top}_{r}{\bf x}_{i}\geq 0)\mathbb{I}({\bf w}^{\top}_{r}{\bf x}_{j}\geq 0)=\frac{{\bf x}^{\top}_{i}{\bf x}_{j}}{m}\sum_{r=1}^{m}\mathbb{I}({\bf w}^{\top}_{r}{\bf x}_{i}\geq 0)\mathbb{I}({\bf w}^{\top}_{r}{\bf x}_{j}\geq 0)

By an analysis, we find that for all pairs of $i,j$ , $\boldsymbol{\Theta}^{\mu}_{ij}(0)$ is the average of $m$ i.i.d. random variables bounded in $[0,1]$ , with the expectation $\boldsymbol{\Theta}^{\infty}_{ij}=\mathbb{E}_{{\bf w}}\big{[}{\bf x}^{\top}_{i}{\bf x}_{j}\mathbb{I}\{{\bf w}^{\top}{\bf x}_{i}\geq 0,{\bf w}^{T}{\bf x}_{j}\geq 0\}\big{]}$ . Here ${\bf w}_{i}=\boldsymbol{\mu}_{i}+\boldsymbol{\sigma}_{i}{\epsilon}$ , and $\boldsymbol{\mu}_{i}\sim\mathcal{N}(\hat{\mu},\hat{\sigma})$ . Then with a simple argument, we show that ${\bf w}$ is also a Gaussian variable. The variance of distribution for ${\bf w}_{i}$ can be inferred through the following process,

f({\bf w}_{i})=\frac{1}{\sqrt{2\pi}\sigma_{0}}e^{-\frac{({\bf w}_{i}-\boldsymbol{\mu}_{i})^{2}}{2\sigma_{0}^{2}}}~{}~{}~{}f(\boldsymbol{\mu}_{i})=\frac{1}{\sqrt{2\pi}\hat{\sigma}}e^{-\frac{(\boldsymbol{\mu}_{i}-\hat{\mu})^{2}}{2\hat{\sigma}^{2}}}

Taking the density function of variable $\boldsymbol{\mu}_{i}$ into $f({\bf w}_{i})$ , we can obtain,

\displaystyle f({\bf w}_{i})=\int\frac{1}{\sqrt{2\pi}\sigma_{0}}e^{-\frac{({\bf w}_{i}-\boldsymbol{\mu}_{i})^{2}}{2\sigma_{0}^{2}}}\frac{1}{\sqrt{2\pi}\hat{\sigma}}e^{-\frac{(\boldsymbol{\mu}_{i}-\hat{\mu})^{2}}{2\hat{\sigma}^{2}}}d\boldsymbol{\mu}_{i}=\frac{1}{\sqrt{2\pi(\sigma_{0}^{2}+\hat{\sigma}^{2})}}e^{-\frac{({\bf w}_{i}-\hat{\mu})^{2}}{2(\sigma_{0}^{2}+\hat{\sigma}^{2})}}

Then by Hoeffding’s inequality, we know that the following inequality holds with probability at least $1-\delta^{\prime}$ ,

\big{|}\boldsymbol{\Theta}^{\mu}_{ij}(0)-\boldsymbol{\Theta}^{\infty}_{ij}\big{|}\leq\sqrt{\frac{\log(2/\delta^{\prime})}{2m}}

Because NTK matrix is of size $n\times n$ , we then apply a union bound over all $i,j\in[n]$ (by setting $\delta^{\prime}=\delta/n^{2}$ ), and obtain that

\big{|}\boldsymbol{\Theta}^{\mu}_{ij}(0)-\boldsymbol{\Theta}^{\infty}_{ij}\big{|}\leq\sqrt{\frac{\log(2n^{2}/\delta)}{2m}}

Thus we have,

\big{\|}\boldsymbol{\Theta}^{\mu}(0)-\boldsymbol{\Theta}^{\infty}\big{\|}^{2}_{2}\leq\big{\|}\boldsymbol{\Theta}^{\mu}(0)-\boldsymbol{\Theta}^{\infty}\big{\|}^{2}_{F}\leq\sum_{i,j}\big{|}\boldsymbol{\Theta}^{\mu}_{ij}(0)-\boldsymbol{\Theta}_{ij}^{\infty}\big{|}^{2}=O\bigg{(}\frac{n^{2}\log(2n^{2}/\delta)}{m}\bigg{)}

Finally, if $\sqrt{\frac{n^{2}\log(2n^{2}/\delta)}{m}}\leq\frac{\lambda_{0}}{4}$ , which implies $m=O\big{(}\frac{n^{2}\log(n^{2}/\delta)}{\lambda^{2}_{0}}\big{)}$ , then with probability at least $1-\delta$ ,

\big{\|}\boldsymbol{\Theta}^{\mu}({\bf X},{\bf X},0)-\boldsymbol{\Theta}^{\infty}({\bf X},{\bf X})\big{\|}_{2}\leq\frac{\lambda_{0}}{4},~{}\big{\|}\boldsymbol{\Theta}^{\mu}({\bf X},{\bf X},0)\big{\|}_{2}\geq\frac{3\lambda_{0}}{4}.

(2) $\boldsymbol{\Theta}^{\sigma}$ .

Plugging the derivative result regarding mean weights in Equation (21) into the definition of PNTK (Eqution 12) yields:

\boldsymbol{\Theta}^{\sigma}_{ij}=\frac{{\bf x}^{\top}_{i}{\bf x}_{j}}{m}\sum_{r=1}^{m}\mathbb{I}({\bf w}^{\top}_{r}{\bf x}_{i}\geq 0)\mathbb{I}({\bf w}^{\top}_{r}{\bf x}_{j}\geq 0)\boldsymbol{\epsilon}^{2}_{r}

Note that the tangent kernel for $\sigma$ differs from $\boldsymbol{\Theta}^{\mu}_{ij}$ with an additional term $\boldsymbol{\epsilon}^{2}_{r}$ . It is known the $\boldsymbol{\epsilon}^{2}_{r}\sim\chi_{1}$ independently with $\mathbb{I}({\bf w}^{\top}_{r}{\bf x}_{i}\geq 0)\mathbb{I}({\bf w}^{\top}_{r}{\bf x}_{j}\geq 0)$ . Because $\mathbb{E}[\chi_{1}]=1$ , the expectation of $\boldsymbol{\Theta}^{\sigma}_{ij}$ equals the expectation of $\boldsymbol{\Theta}^{\mu}_{ij}$ . Thus for all pairs of $i,j$ , $\boldsymbol{\Theta}^{\sigma}_{ij}$ is the average of $m$ i.i.d. random variables with the expectation $\boldsymbol{\Theta}^{\infty}_{ij}=\mathbb{E}_{{\bf w}}\big{[}{\bf x}^{\top}_{i}{\bf x}_{j}\mathbb{I}\{{\bf w}^{\top}{\bf x}_{i}\geq 0,{\bf w}^{\top}{\bf x}_{j}\geq 0\}\big{]}$ .

Now we calculate the concentration bound. It is known that $\boldsymbol{\epsilon}^{2}_{r}$ is independent and sub-exponential. Then, by sub-exponential tail bound, we know that the following holds with probability at least $1-\delta^{\prime}$ ,

\big{|}\boldsymbol{\Theta}^{\sigma}_{ij}(0)-\boldsymbol{\Theta}^{\infty}_{ij}\big{|}\leq\sqrt{\frac{\log(8/\delta^{\prime})}{2m}}

This bound is of the same order to concentration bound for $\boldsymbol{\Theta}^{\mu}_{ij}(0)$ . Thus we can take all the arguments for $\boldsymbol{\Theta}^{\mu}_{ij}(0)$ above to finalize the proof: If $\sqrt{\frac{n^{2}\log(8n^{2}/\delta)}{m}}\leq\frac{\lambda_{0}}{4}$ , which implies $m=O\big{(}\frac{n^{2}\log(n^{2}/\delta)}{\lambda^{2}_{0}}\big{)}$ , then with probability at least $1-\delta$ ,

\big{\|}\boldsymbol{\Theta}^{\sigma}({\bf X},{\bf X},0)-\boldsymbol{\Theta}^{\infty}({\bf X},{\bf X})\big{\|}_{2}\leq\frac{\lambda_{0}}{4},~{}\big{\|}\boldsymbol{\Theta}^{\sigma}({\bf X},{\bf X},0)\big{\|}_{2}\geq\frac{3\lambda_{0}}{4}.

∎

Lemma B.2 (PNTK during training).

Proof.

Define the event $A_{ir}=\{\exists{\bf w}:\|{\bf w}_{r}(t)-{\bf w}_{r}(0)\|\leq R,\mathbb{I}\{{\bf x}^{\top}_{i}{\bf w}_{r}(t)\geq 0\}\neq\mathbb{I}\{{\bf x}^{\top}_{i}{\bf w}_{r}(0)\geq 0\}\}$ . Then this event would only happen when $|{\bf w}_{r,0}^{\top}{\bf x}_{i}|<R$ . As calculated in the proof of Lemma B.1, we know that ${\bf w}_{i}\sim\mathcal{N}(0,\sqrt{\hat{\sigma}^{2}+\sigma^{2}_{0}})$ . Then with the condition that $\|{\bf x}_{i}\|=1$ , we obtain

z={\bf w}_{r}(0)^{\top}{\bf x}_{i}\sim\mathcal{N}\big{(}0,\sqrt{\hat{\sigma}^{2}+\sigma^{2}_{0}}\big{)}

By integrating over the condition $|{\bf w}_{r}(0)^{\top}{\bf x}_{i}|<R$ , we have

P(A_{ir})=P_{z\sim\mathcal{N}(0,\sqrt{\sigma^{2}_{0}+\hat{\sigma}^{2}})}(|z|<R)\leq\frac{2R}{\sqrt{2\pi(\sigma^{2}_{0}+\hat{\sigma}^{2})}}

Then we can bound the entry of the matrix $\boldsymbol{\Theta}^{\mu}$ :

		$\displaystyle\mathbb{E}\big{[}\|\boldsymbol{\Theta}^{\mu}_{ij}(0)-\boldsymbol{\Theta}^{\mu}_{ij}(t)\|\big{]}=\mathbb{E}\bigg{[}\frac{1}{m}\big{\|}{\bf x}^{\top}_{i}{\bf x}_{j}\sum_{r=1}^{m}\big{(}\mathbb{I}\{{\bf w}^{\top}_{r}(0){\bf x}_{i}\geq 0,{\bf w}^{\top}_{r}(0){\bf x}_{j}\geq 0\}-\mathbb{I}\{{\bf w}_{r}(t)^{\top}{\bf x}_{i}\geq 0,{\bf w}_{r}(t)^{\top}{\bf x}_{j}\geq 0\}\big{)}\big{\|}\bigg{]}$
		$\displaystyle\leq\frac{1}{m}\sum_{r=1}^{m}\mathbb{E}\big{[}\mathbb{I}\{A_{ir}\cup A_{jr}\}\big{]}\leq\frac{4R}{\sqrt{2\pi(\sigma^{2}_{0}+\hat{\sigma}^{2})}}$

Summing over all entries of the matrix, we have $\mathbb{E}\bigg{[}\sum_{(ij)}\big{|}\boldsymbol{\Theta}^{\mu}_{ij}(0)-\boldsymbol{\Theta}^{\mu}_{ij}(t)\big{|}\bigg{]}\leq\frac{4n^{2}R}{\sqrt{2\pi(\sigma_{0}^{2}+\hat{\sigma}^{2})}}$ . By Markov’s inequality, with probability of $1-\delta$ over distribution of ${\bf w}_{r,0}$ for $r\in[m]$ , $\sum_{(ij)}\big{|}\boldsymbol{\Theta}^{\mu}_{ij}(0)-\boldsymbol{\Theta}^{\mu}_{ij}(t)\big{|}\leq\frac{4n^{2}R}{\sqrt{2\pi(\sigma^{2}_{0}+\hat{\sigma}^{2})}\delta}$ . Then by the matrix perturbation theory, we have,

\big{\|}\boldsymbol{\Theta}^{\mu}(t)-\boldsymbol{\Theta}^{\mu}(0)\big{\|}_{2}\leq\big{\|}\boldsymbol{\Theta}^{\mu}(t)-\boldsymbol{\Theta}^{\mu}(0)\big{\|}_{F}\leq\sum_{(ij)}\big{|}\boldsymbol{\Theta}^{\mu}_{ij}(t)-\boldsymbol{\Theta}^{\mu}_{ij}(0)\big{|}\leq\frac{4n^{2}R}{\sqrt{2\pi(\sigma_{0}^{2}+\hat{\sigma}^{2})}\delta}

If $\frac{4n^{2}R}{\sqrt{2\pi(\sigma_{0}^{2}+\hat{\sigma}^{2})}\delta}\leq\frac{\lambda_{0}}{4}$ , which implies $\|{\bf w}_{r}(t)-{\bf w}_{r}(0)\|\leq\frac{c\lambda_{0}\delta\sqrt{\sigma_{0}^{2}+\hat{\sigma}^{2}}}{n^{2}}$ , then with probability at least $1-\delta$ ,

\big{\|}\boldsymbol{\Theta}^{\mu}({\bf X},{\bf X},t)-\boldsymbol{\Theta}^{\infty}({\bf X},{\bf X})\big{\|}_{2}\leq\frac{\lambda_{0}}{4},~{}\big{\|}\boldsymbol{\Theta}^{\mu}({\bf X},{\bf X},t)\big{\|}_{2}\geq\frac{\lambda_{0}}{2}.

On the other hand, we can bound the entry of the matrix $\boldsymbol{\Theta}^{\sigma}$ with following inequality:

		$\displaystyle\mathbb{E}\big{[}\|\boldsymbol{\Theta}^{\sigma}_{ij}(0)-\boldsymbol{\Theta}^{\sigma}_{ij}(t)\|\big{]}=\mathbb{E}\bigg{[}\frac{1}{m}\big{\|}{\bf x}^{\top}_{i}{\bf x}_{j}\sum_{r=1}^{m}\big{(}\mathbb{I}\{{\bf w}^{\top}_{r}(0){\bf x}_{i}\geq 0,{\bf w}^{\top}_{r}(0){\bf x}_{j}\geq 0\}\boldsymbol{\epsilon}^{2}_{r}(0)-\mathbb{I}\{{\bf w}_{r}(t)^{\top}{\bf x}_{i}\geq 0,{\bf w}_{r}(t)^{\top}{\bf x}_{j}\geq 0\}\boldsymbol{\epsilon}^{2}_{r}(t)\big{)}\big{\|}\bigg{]}$
		$\displaystyle\leq\frac{1}{m}\sum_{r=1}^{m}\mathbb{E}\big{[}\mathbb{I}\{A_{ir}\cup A_{jr}\}\boldsymbol{\epsilon}^{2}_{r}(t)\big{]}\leq\frac{4R}{\sqrt{2\pi(\sigma^{2}_{0}+\hat{\sigma}^{2})}}$

For the first inequality in the second lien, we use $\boldsymbol{\epsilon}^{2}_{r}(t)=\boldsymbol{\epsilon}^{2}_{r}(0)$ according to the definition of ${\bf w}_{r}(t)$ . We argue that our definition is reasonable because $\mathbb{E}[\boldsymbol{\epsilon}^{2}_{r}(0)]=\mathbb{E}[\boldsymbol{\epsilon}^{2}_{r}(t)]$ when $\boldsymbol{\epsilon}_{r}(t)$ and $\boldsymbol{\epsilon}_{r}(0)$ are two i.i.d. variables. With our definition, we can compute $\|{\bf w}_{r}(t)-{\bf w}_{r}(0)\|$ directly, and keep the corresponding conclusion applicable to real weights at training step simultaneously.

The above inequality shows that with additional $\boldsymbol{\epsilon}^{2}_{r}$ , we still have the same result. Once again, we can apply arguments for $\boldsymbol{\Theta}^{\mu}$ here, and conclude the proof: If $\frac{4n^{2}R}{\sqrt{2\pi(\sigma_{0}^{2}+\hat{\sigma}^{2})}\delta}\leq\frac{\lambda_{0}}{4}$ , which implies $\|{\bf w}_{r}(t)-{\bf w}_{r}(0)\|\leq\frac{c\lambda_{0}\delta\sqrt{\sigma_{0}^{2}+\hat{\sigma}^{2}}}{n^{2}}$ , then with probability at least $1-\delta$ ,

\big{\|}\boldsymbol{\Theta}^{\sigma}({\bf X},{\bf X},t)-\boldsymbol{\Theta}^{\infty}({\bf X},{\bf X})\big{\|}_{2}\leq\frac{\lambda_{0}}{4},~{}\big{\|}\boldsymbol{\Theta}^{\sigma}({\bf X},{\bf X},t)\big{\|}_{2}\geq\frac{\lambda_{0}}{2}.

∎

Lemma B.3 (Change of weights during training).

Proof.

According to the gradient flow of output function, we have

	$\displaystyle\frac{df({\bf x}_{i},t)}{dt}$	$\displaystyle=\sum_{r=1}^{m}\big{(}\langle\frac{\partial f({\bf x}_{i},t)}{\partial\boldsymbol{\mu}_{r}},\frac{d\boldsymbol{\mu}_{r}(t)}{dt}\rangle+\langle\frac{\partial f({\bf x},t)}{\partial\boldsymbol{\sigma}_{r}},\frac{d\boldsymbol{\sigma}_{r}(t)}{dt}\rangle\big{)}$
		$\displaystyle=\sum_{j=1}^{n}({\bf y}_{i}-f({\bf x}_{j}))\sum_{r=1}^{m}\big{(}\langle\frac{\partial f({\bf x}_{i})}{\partial\boldsymbol{\mu}_{r}},\frac{f({\bf x}_{j})}{\partial\boldsymbol{\mu}_{r}}\rangle+\langle\frac{\partial f({\bf x}_{i})}{\partial\boldsymbol{\sigma}_{r}},\frac{\partial f({\bf x}_{j})}{\partial\boldsymbol{\sigma}_{r}}\rangle\big{)}$
		$\displaystyle=\sum_{j=1}^{n}({\bf y}_{j}-f({\bf x}_{j},t))(\boldsymbol{\Theta}^{\mu}_{ij}+\boldsymbol{\Theta}^{\sigma}_{ij})$

Then the dynamics of loss can be calculated,

	$\displaystyle\frac{d}{dt}\widehat{R}_{S}(Q,t)$	$\displaystyle=\frac{1}{2}\frac{d}{dt}\big{\\|}\mathbb{E}_{f(t)\sim Q}f({\bf X},t)-{\bf Y}\big{\\|}_{2}^{2}=\frac{1}{2}\frac{d}{dt}\big{\\|}\widehat{f}({\bf X},t)-{\bf Y}\big{\\|}_{2}^{2}$
		$\displaystyle=-({\bf Y}-\widehat{f}({\bf X},t))^{\top}(\boldsymbol{\Theta}^{\mu}(t)+\boldsymbol{\Theta}^{\sigma}(t))({\bf Y}-\widehat{f}({\bf X}),t)\leq-\lambda_{0}\\|{\bf Y}-\widehat{f}({\bf X},t)\\|^{2}_{2}$

where we define $\widehat{f}(t)=\mathbb{E}_{f(t)\sim Q}f(t)=\mathbb{E}_{\epsilon}f(t)$ . ${\bf X},{\bf Y}$ are collections of features and labels. Then the loss can be bounded,

\widehat{R}_{S}(Q,t)=\frac{1}{2}\|{\bf Y}-\widehat{f}({\bf X},t)\|^{2}_{2}\leq\exp(-{\lambda_{0}}t)\widehat{R}_{S}(Q,0)

The above equation implies the linear convergence rate of gradient descent on over-parameterized probabilistic network.

Now we bounded the change of mean weights:

	$\displaystyle\bigg{\\|}\frac{d}{dt}\boldsymbol{\mu}_{r}(t)\bigg{\\|}_{2}$	$\displaystyle=\bigg{\\|}\sum_{i=1}^{n}(y_{i}-\widehat{f}({\bf x}_{i}))\frac{1}{\sqrt{m}}{\bf v}_{r}{\bf x}_{i}\mathbb{I}({\bf w}^{\top}_{r}(t){\bf x}_{i}\geq 0)\bigg{\\|}_{2}$
		$\displaystyle\leq\frac{1}{\sqrt{m}}\sum_{i=1}^{n}\|y_{i}-\widehat{f}({\bf x}_{i})\|\leq\sqrt{\frac{n}{m}}\\|{\bf Y}-\widehat{f}({\bf X},t)\\|_{2}\leq\sqrt{\frac{n}{m}}\exp(-\lambda_{0}t/2)\\|{\bf Y}-\widehat{f}({\bf X},0)\\|_{2}$

Integrating the gradient, we have

\|\boldsymbol{\mu}_{r}(T)-\boldsymbol{\mu}_{r}(0)\|_{2}\leq\int_{0}^{T}\big{\|}\frac{d}{dt}\boldsymbol{\mu}_{r}(t)\big{\|}_{2}dt\leq 2\sqrt{\frac{n}{m}}\frac{\|{\bf y}-\widehat{f}({\bf X},0)\|_{2}}{\lambda_{0}}

Next we bounded the change of variance weights:

\displaystyle\bigg{\|}\frac{d}{dt}\boldsymbol{\sigma}_{r}(t)\bigg{\|}_{2}

\displaystyle=\bigg{\|}\sum_{i=1}^{n}(y_{i}-\widehat{f}({\bf x}_{i}))\frac{1}{\sqrt{m}}\mathbb{E}_{\boldsymbol{\epsilon}_{r}(t)}{\bf v}_{r}{\bf x}_{i}\odot\boldsymbol{\epsilon}_{r}(t)\mathbb{I}({\bf w}^{\top}_{r}(t){\bf x}_{i}\geq 0)\bigg{\|}_{2}=0

In the above derivation, we use the definition of loss which is $\widehat{R}_{S}(Q)=\mathbb{E}_{f\in Q}\widehat{R}_{S}(f)$ and interchange integration and differentiation. The result leads to:

\|\boldsymbol{\sigma}_{r}(T)-\boldsymbol{\sigma}_{r}(0)\|\leq\int_{0}^{T}\bigg{\|}\frac{d}{dt}\boldsymbol{\sigma}_{r}(t)\bigg{\|}_{2}dt=0

Plug the results for both mean and variance weights together, we obtain

\|{\bf w}_{r}(T)-{\bf w}_{r}(0)\|=\|{\boldsymbol{\mu}}_{r}(T)-{\boldsymbol{\mu}}_{r}(0)+{\boldsymbol{\epsilon}}_{r}(0)\odot({\boldsymbol{\sigma}}_{r}(T)-{\boldsymbol{\sigma}}_{r}(0))\|\leq 2\sqrt{\frac{n}{m}}\frac{\|{\bf Y}-\widehat{f}({\bf X},0)\|_{2}}{\lambda_{0}}=\frac{4\sqrt{n\widehat{R}_{S}(Q,0)}}{\sqrt{m}\lambda_{0}}

Finally, it is time to bound $\widehat{R}_{S}(Q,0)$ ,

\mathbb{E}\big{[}\|{\bf Y}-\widehat{f}({\bf X})\|^{2}_{2}\big{]}=\sum_{i=1}^{n}(y^{2}_{i}+y_{i}\mathbb{E}[\widehat{f}({\bf x}_{i})]+\mathbb{E}[\widehat{f}({\bf x}_{i})^{2}])=\sum_{i=1}^{n}(1+O(1)=O(n))

Thus by Markov’s inequality, we have with probability at least $1-\delta$ , $\widehat{R}_{S}(Q,0)=O(\frac{n}{\delta})$ , and

\|{\bf w}_{r}(t)-{\bf w}_{r}(0)\|_{2}\leq\frac{2n}{\sqrt{m\delta}\lambda_{0}}\equiv R^{\prime}

∎

Lemma B.4.

If $R^{\prime}<R$ , we have $\widehat{R}_{S}(Q,t)\leq\exp(-\lambda_{0}t)\widehat{R}_{S}(Q,0).$

Proof.

Since $R^{\prime}=\frac{2n}{\sqrt{m\delta}\lambda_{0}}$ and $R=\frac{c\lambda_{0}\delta\sqrt{(\sigma^{2}_{0}+\hat{\sigma}^{2})}}{n^{2}}$ , we have $m=\Omega\big{(}\frac{n^{6}}{\lambda^{4}_{0}\delta^{4}(\sigma_{0}^{2}+\hat{\sigma}^{2})}\big{)}$ . ∎

Appendix C Missing Proofs of Section 4.2

Theorem C.1.

{\widehat{f}({\bf x})}=\boldsymbol{\Theta}^{\infty}({\bf x},{\bf X})\big{(}\boldsymbol{\Theta}({\bf X},{\bf X})+\lambda/{\sigma^{2}_{0}}{\bf I}\big{)}^{-1}{\bf Y}\pm\mathcal{E}

(22)

where $\widehat{f}({\bf x})=\mathbb{E}_{f\sim Q}f(\bf x)$ aligns with the definition of empirical loss function.

Proof.

At a high level, our proof first establishes the result of kernel ridge regression in the infinite-width limit, then bounds the perturbation on the predict.

According the linearization rules for infinitely-wide networks (Lee et al., 2019), the output function can be expressed as,

\widehat{f}_{\rm ntk}({\bf x},t)=\phi_{{\mu}}({\bf x})^{\top}(\boldsymbol{\mu}(t)-\boldsymbol{\mu}(0))+\phi_{{\sigma}}({\bf x})^{\top}(\boldsymbol{\sigma}(t)-\boldsymbol{\sigma}(0)),

where $\phi_{{\mu}}({\bf x})=\nabla_{{{\mu}}}\widehat{f}({\bf x},0)$ , and $\phi_{{\sigma}}({\bf x})=\nabla_{{{\sigma}}}\widehat{f}({\bf x},0)$ . Since $\boldsymbol{\sigma}$ does not change during training, then the KL divergence reduces to

\text{KL}=\frac{1}{2}\frac{({\mu}_{t}-{\mu}_{0})^{2}}{{\sigma}^{2}_{0}}

Then the gradient flow equation for $\boldsymbol{\mu}$ becomes,

	$\displaystyle\frac{d\boldsymbol{\mu}(t)}{dt}=\frac{\partial\mathcal{B}}{\partial\boldsymbol{\mu}}$	$\displaystyle=(\widehat{f}_{\rm ntk}({\bf X},t)-{\bf Y})\phi_{{\mu}}({\bf X})+{\lambda}/{\sigma^{2}_{0}}(\boldsymbol{\mu}(t)-\boldsymbol{\mu}(0))$
		$\displaystyle=\boldsymbol{\Theta}^{\mu}(\boldsymbol{\mu}(t)-\boldsymbol{\mu}(0))-\phi_{\mu}({\bf X})^{\top}{\bf Y}+{\lambda}/{\sigma^{2}_{0}}(\boldsymbol{\mu}(t)-\boldsymbol{\mu}(0))$

By analysing above equation, we find it is an ordinary differential equation regarding $\boldsymbol{\mu}_{t}$ , and the solution is,

\boldsymbol{\mu}(t)=\phi_{\mu}({\bf X})^{\top}(\boldsymbol{\Theta}({\bf X},{\bf X})+\lambda/\sigma^{2}_{0}{\bf I})^{-1}{\bf Y}({\bf I}-e^{-(\boldsymbol{\Theta}({\bf X},{\bf X})+\lambda/\sigma^{2}_{0}{\bf I})t})

Plug this result into the linearization of expected output function, we have,

\widehat{f}_{\rm ntk}({\bf x},t)=\boldsymbol{\Theta}({\bf x},{\bf X})(\boldsymbol{\Theta}({\bf X},{\bf X})+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}({\bf I}-e^{-(\boldsymbol{\Theta}({\bf X},{\bf X})+{\lambda}/{\sigma^{2}_{0}}{\bf I})t})

when we take the time to be infinity,

\widehat{f}_{\rm ntk}({\bf x},t=\infty)=\boldsymbol{\Theta}({\bf x},{\bf X})(\boldsymbol{\Theta}({\bf X},{\bf X})+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf y}.

The next step is to show that

|{\widehat{f}_{\rm ntk}({\bf x})-\widehat{f}({\bf x})}|\leq O(\mathcal{E}).

where $\mathcal{E}=\mathcal{E}_{\rm init}+\frac{\sqrt{n}}{\lambda_{0}^{2}}\log(\frac{n}{\mathcal{E}_{\Theta}\lambda_{0}})\mathcal{E}_{\Theta}$ . In which, we define $|\widehat{f}({\bf x},0)|\leq\mathcal{E}_{\rm init}$ and $\|\boldsymbol{\Theta}^{\infty}-\boldsymbol{\Theta}(t)\|_{2}\leq\mathcal{E}_{\Theta}$ .

The proof relies a careful analysis on the trajectories induced by gradient flows for optimizing the neural network and the NTK predictor. The detailed proof can be found in the proof of Theorem 3.2 in (Arora et al., 2019b), and we can replace kernel ridge regression here by kernel regression.

∎

Appendix D Missing Proofs of Section 4.3

Theorem D.1 (PAC-Bayesian bound with NTK).

\displaystyle R_{\mathcal{D}}

\displaystyle\leq\frac{1}{\sigma^{2}_{0}}\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}}{n}+\frac{\lambda}{\sigma^{2}_{0}}\sqrt{\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-2}{\bf Y}}{n}}+O\bigg{(}\frac{\log\frac{2\sqrt{n}}{\delta}}{n}\bigg{)}.

(23)

Proof.

The generalization bound consists two terms, one is the empirical error, and another is KL divergence.

(1) We first bound the empirical error $\sqrt{\sum_{i=1}^{n}(\widehat{f}_{\rm ntk}({\bf x}_{i},\infty)-y_{i})^{2}}$ with following inequality,

	$\displaystyle\sqrt{\sum_{i=1}^{n}(\widehat{f}_{\rm ntk}({\bf x}_{i},\infty)-y_{i})^{2}}$	$\displaystyle=\\|\boldsymbol{\Theta}({\bf X},{\bf X})(\boldsymbol{\Theta}({\bf X},{\bf X})+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}-{\bf Y}\\|=\\|{\lambda}/{\sigma^{2}_{0}}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}\\|$
		$\displaystyle={\lambda}/{\sigma^{2}_{0}}\sqrt{{\bf Y}^{\top}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-2}{\bf Y}}$

Then we can further bound the error as,

\frac{1}{n}\sum_{i=1}^{n}\ell(\widehat{f}_{\rm ntk}({\bf x}_{i})-y_{i})\leq\frac{1}{n}\sum_{i=1}^{n}|\widehat{f}_{\rm ntk}({\bf x}_{i})-y_{i}|\leq\frac{1}{\sqrt{n}}\sqrt{\sum_{i=1}^{n}|\widehat{f}_{\rm ntk}({\bf x}_{i})-y_{i}|^{2}}\leq\frac{\lambda}{\sigma^{2}_{0}}\sqrt{\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-2}{\bf Y}}{n}}

(2) The next step is to calculate the KL divergence. According to the solution of differential equation, we have,

\boldsymbol{\mu}(t)-\boldsymbol{\mu}(0)=\phi_{{\mu}}({\bf x})^{\top}{\bf Y}(\boldsymbol{\Theta}({\bf X},{\bf X})+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}({\bf I}-e^{-(\boldsymbol{\Theta}({\bf X},{\bf X})+{\lambda}/{\sigma^{2}_{0}}{\bf I})t}),

then $t=T=\infty$ yields $\boldsymbol{\mu}(t)-\boldsymbol{\mu}(0)=\phi_{{\mu}}^{\top}(\boldsymbol{\Theta}({\bf X},{\bf X})+\frac{\lambda}{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}$ . Therefore, the KL divergence is,

\displaystyle=\frac{1}{\sigma^{2}_{0}}{\bf Y}^{\top}(\boldsymbol{\Theta}({\bf X},{\bf X})+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}\boldsymbol{\Theta}({\bf X},{\bf X})(\boldsymbol{\Theta}({\bf X},{\bf X})+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}\leq\frac{1}{\sigma^{2}_{0}}{\bf Y}^{\top}(\boldsymbol{\Theta}({\bf X},{\bf X})+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}

Finally, we achieve the PAC-Bayesian generalization bound,

\displaystyle R_{\mathcal{D}}

\displaystyle\leq\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+\frac{\lambda}{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}}{n\sigma^{2}_{0}}+\frac{\lambda}{\sigma^{2}_{0}}\sqrt{\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+\frac{\lambda}{\sigma^{2}_{0}}{\bf I})^{-2}{\bf Y}}{n}}+O\bigg{(}\frac{\log\frac{2\sqrt{n}}{\delta}}{n}\bigg{)}.

∎

Theorem D.2 (Rademacher bound with NTK).

Suppose data $S=\{({\bf x}_{i},y_{i})\}_{i=1}^{n}$ are i.i.d. samples from a non-degenerate distribution $\mathcal{D}$ , and $m\geq{\rm poly}(n,\lambda_{0}^{-1},\delta^{-1})$ . Consider any loss function $\ell:\mathbb{R}\times\mathbb{R}\rightarrow[0,1]$ . Then with probability at least $1-\delta$ over the random initialization and training samples, the network trained by gradient descent for $T\geq\Omega(\frac{1}{\eta\lambda_{0}}\log\frac{n}{\delta})$ iterations has population risk $R_{\mathcal{D}}(Q)$ that is bounded as follows:

\displaystyle R_{\mathcal{D}}

\displaystyle\leq\sqrt{\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}}{n}}+\frac{\lambda}{\sigma^{2}_{0}}\sqrt{\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-2}{\bf Y}}{n}}+O\bigg{(}\sqrt{\frac{\log\frac{n}{\lambda_{0}\delta}}{n}}\bigg{)}.

(24)

Proof.

In this proof, we use Rademacher-complexity analysis. Let $\mathcal{H}$ be the reproducing kernel Hilbert space (RKHS) corresponding to the kernel $k(\cdot,\cdot)$ . It is known that the RKHS norm of a function $f_{\rm ntk}({\bf x})=\boldsymbol{\Theta}^{\infty}({\bf x},{\bf X})\big{(}\boldsymbol{\Theta}({\bf X},{\bf X})+\lambda/{\sigma^{2}_{0}}{\bf I}\big{)}^{-1}{\bf Y}=\boldsymbol{\alpha}^{\top}k({\bf x},{\bf X})$ is $\|f_{\rm ntk}\|_{\mathcal{H}}=\sqrt{\boldsymbol{\alpha}^{\top}k({\bf X},{\bf X})\boldsymbol{\alpha}}$ , where $k=\boldsymbol{\Theta}$ and $\boldsymbol{\alpha}={\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}$ . Then we can bound the $\|f_{\rm ntk}\|_{\mathcal{H}}$ .

\|f_{\rm ntk}\|_{\mathcal{H}}=\sqrt{{\bf Y}^{\top}(\boldsymbol{\Theta}({\bf X},{\bf X})+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}\boldsymbol{\Theta}({\bf X},{\bf X})(\boldsymbol{\Theta}({\bf X},{\bf X})+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}}\leq\sqrt{{\bf Y}^{\top}(\boldsymbol{\Theta}({\bf X},{\bf X})+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}}

For function class $\mathcal{F}_{B}=\{f({\bf x})=\boldsymbol{\alpha}^{\top}k({\bf x},{\bf X}):\|f\|_{\mathcal{H}}\leq B\}$ , it is shown that its empirical Rademacher complexity can be bounded as,

{\widehat{R}}_{S}(\mathcal{F}_{B})=\frac{1}{n}\mathbb{E}\big{[}\sup_{f\in\mathcal{F}_{B}}\sum_{i=1}^{n}f({\bf x}_{i})\gamma_{i}\big{]}\leq\frac{B\sqrt{{\rm Tr}[k({\bf X},{\bf X})]}}{n}

Assume that ${\rm Tr}[k({\bf X},{\bf X})]\approx n$ . Recall the standard generalization bound from Rademacher complexity, with probability at least $1-\delta$ , we have,

\sup_{f\in\mathcal{F}}[\mathbb{E}_{\mathcal{D}}[\ell(f({\bf x}),y)]-\frac{1}{n}\sum_{i=1}^{n}\ell(f({\bf x}_{i}),y_{i})]\leq 2{\widehat{R}}_{S}(\mathcal{F})+3\sqrt{\frac{\log(2/\delta)}{2n}}

Thus we have,

\displaystyle R_{\mathcal{D}}

\displaystyle\leq\sqrt{\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-1}{\bf Y}}{n}}+\frac{\lambda}{\sigma^{2}_{0}}\sqrt{\frac{{\bf Y}^{\top}(\boldsymbol{\Theta}+{\lambda}/{\sigma^{2}_{0}}{\bf I})^{-2}{\bf Y}}{n}}+O\bigg{(}\sqrt{\frac{\log\frac{n}{\lambda_{0}\delta}}{n}}\bigg{)}.

∎

Appendix E Additional Experiments

This section contains additional experimental results.

E.1 Training dynamics of wide probabilistic network

We conduct an experiment similar to the experiment conducted in section 5.2 but without incorporating the KL term in the objective function. The same results can be observed, further verifying our theoretical results.

E.2 Comparison of gradient norm with respect to $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}$

In Lemma 4.9, we assume that the variance weight $\boldsymbol{\sigma}$ is fixed. To verify if this assumption is reasonable in the practical training process, we conduct an experiment, to compare the gradient of norm with respect to $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}$ . The result is shown in Figure S3. We can see that the gradient norm of $\nabla_{\boldsymbol{\mu}}f$ is much larger than that of $\nabla_{\boldsymbol{\sigma}}f$ , which implies that $\boldsymbol{\sigma}$ is effectively fixed during gradient descent training.

E.3 Correlation between generalization bound proxy metric and generalization bound

In Figure 2, we observe a positive and significant correlation between $\mathcal{PA}$ and generalization bound held among different values of a selected hyperparameter while fixing other hyperparameters. Furthermore, we provide a Figure S2 presenting the correlation for aggregated values of $\rho_{0}$ and $\lambda$ , under the circumstance where 50% data is used for prior training. We can clearly see that lower $\mathcal{PA}$ corresponds to the lower bound, with a strong positive Kendall-tau correlation of 0.7.

E.4 Grid search

For selecting hyperparameters, we conduct a grid search over $\rho_{0}$ , percent of prior data, and KL penalty $\lambda$ . Notably, we do grid sweep over the data for prior training with different proportion in [0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9] since 0.2 is the minimum proportion required for obtaining a reasonably lower value generalization bound (Dziugaite et al., 2020). For the rest, we run over $\rho_{0}$ at value [0.03, 0.05, 0.07, 0.09, 0.1, 0.3, 0.5, 0.7] for FCN ([0.05, 0.07, 0.09, 0.1, 0.3, 0.5, 0.7, 0.9] for CNN) and KL penalty at [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1] for both structures.

Demystify Optimization and Generalization of Over-parameterized PAC-Bayesian Learning

Abstract

1 Introduction

2 Related Work

PAC-Bayesian Analysis.

PAC-Bayesian Learning.

Neural Tangent Kernel.

3 Preliminary

3.1 Notation

3.2 Over-parameterized Probabilistic Neural Network

3.3 The PAC-Bayesian Learning

Theorem 3.1.

Theorem 3.2.

3.4 Neural Tangent Kernel

4 Theoretical Results

4.1 Optimization analysis

Definition 4.1 (Probabilistic Neural Tangent Kernel).

Definition 4.2 (Limiting Neural Tangent Kernel with ReLU Activation).

Assumption 4.3.

Theorem 4.4 (Convergence of Probabilistic Networks with large width).

Lemma 4.5 (PNTK at initialization).

Remark 4.6.

Lemma 4.7 (PNTK during training).

Lemma 4.8 (Change of weights during training).

Lemma 4.9.

4.2 Training with KL divergence

Theorem 4.10.

4.3 Generalization analysis

Theorem 4.11 (PAC-Bayesian bound with NTK).

Theorem 4.12 (Rademacher bound with NTK).

5 Experiments

5.1 Experimental setup

5.2 Validation of theoretical results

5.3 Selecting hyperparameters via training-free metric

6 Discussion

References

Appendix A Preliminary Lemmas

Lemma A.1 (Markov’s inequality).

Lemma A.2 (Hoeffding’s inequality).

Appendix B Mission Proofs of Section 4.1

Lemma B.1 (PNTK at initialization).

Proof.

(1) 𝚯μ\boldsymbol{\Theta}^{\mu}.

(2) 𝚯σ\boldsymbol{\Theta}^{\sigma}.

Lemma B.2 (PNTK during training).

Proof.

Lemma B.3 (Change of weights during training).

Proof.

Lemma B.4.

Proof.

Appendix C Missing Proofs of Section 4.2

Theorem C.1.

Proof.

Appendix D Missing Proofs of Section 4.3

Theorem D.1 (PAC-Bayesian bound with NTK).

Proof.

Theorem D.2 (Rademacher bound with NTK).

Proof.

Appendix E Additional Experiments

E.1 Training dynamics of wide probabilistic network

E.2 Comparison of gradient norm with respect to 𝝁\boldsymbol{\mu} and 𝝈\boldsymbol{\sigma}

E.3 Correlation between generalization bound proxy metric and generalization bound

E.4 Grid search

(1) $\boldsymbol{\Theta}^{\mu}$ .

(2) $\boldsymbol{\Theta}^{\sigma}$ .

E.2 Comparison of gradient norm with respect to $\boldsymbol{\mu}$ and $\boldsymbol{\sigma}$