An Overview of Uncertainty Quantification Methods
for Infinite Neural Networks

Florian Juengermann Maxime Laasri Marius Merkle Weiwei Pan

Abstract

To better understand the theoretical behavior of large neural networks, several works have analyzed the case where a network’s width tends to infinity. In this regime, the effect of random initialization and the process of training a neural network can be formally expressed with analytical tools like Gaussian processes and neural tangent kernels. In this paper, we review methods for quantifying uncertainty in such infinite-width neural networks and compare their relationship to Gaussian processes in the Bayesian inference framework. We make use of several equivalence results along the way to obtain exact closed-form solutions for predictive uncertainty.

Machine Learning, Uncertainty, Infinite Neural Networks

Supervised by:

1 Introduction

Deep learning has brought breakthroughs across a variety of disciplines including image classification, speech recognition, and natural language processing [LeCun et al., 2015]. But despite their tremendous success, artificial neural networks often tend to be overconfident in their predictions, especially for out-of-distribution data.

Therefore, proper quantification of uncertainty is of fundamental importance to the deep learning community. It turns out that the infinite-width limit, i.e. the limit when the number of hidden units in a neural network approaches infinity, simplifies the analysis considerably, and is useful to draw conclusions for very large, but finite neural networks.

In this paper, we provide an overview of the most common techniques for quantifying uncertainty in deep learning and explore how these techniques are related in the infinite-width regime. In section 2, we introduce the fundamentals of these techniques, while in section 3, we outline their relation as well as several equivalence results.

2 Background

Throughout this section, we will consider a neural network with input $\bm{x}\in\mathbb{R}^{n}$ and scalar output $y\in\mathbb{R}$ . Let $f(\bm{x},\bm{\theta})$ denote the neural network output function where the weights and biases are summarized in the parameter vector $\bm{\theta}$ . The traditional training objective is to minimize the mean squared error (MSE) loss $\mathcal{L}$ over a dataset $\mathcal{D}=(\mathcal{X},\mathcal{Y})=\{(\bm{x}_{i},y_{i})\}_{i=1}^{N}$ of $N$ training points, i.e. to find an optimal $\bm{\theta}^{*}$ such that:

\bm{\theta}^{*}=\text{arg min}_{\bm{\theta}}\underbrace{\frac{1}{N}\sum_{i=1}^{N}\left(y_{i}-f(\bm{x}_{i},\bm{\theta})\right)^{2}}_{\mathcal{L}(\mathcal{X},\mathcal{Y})}

(1)

2.1 Gaussian Processes

One fundamental tool used in machine learning to quantify uncertainty is Gaussian processes (GPs). They are primarily encountered in regression problems, but they can be extended to classification or clustering tasks [Kapoor et al., 2010, Kim and Lee, 2007].

Simply put, a GP works as follows: (i) for any input $\bm{x}\in\mathbb{R}^{n}$ , it yields a probability distribution for any possible outcome $y\in\mathbb{R}$ , and (ii) any finite collection of input points follows a multivariate Gaussian distribution $\mathcal{N}(\underline{\mu},\underline{\Sigma})$ with mean vector $\underline{\mu}\in\mathbb{R}^{n}$ and covariance matrix $\underline{\Sigma}\in\mathbb{R}^{n\times n}$ where one usually assumes the data to be centered ( $\underline{\mu}=0$ ). The matrix $\underline{\Sigma}$ is determined by a kernel function $\Sigma$ that allows to incorporate prior knowledge about the shape and characteristics of the data distribution. We will denote a GP with kernel function $\Sigma$ applied on centered data as $\mathcal{GP}(0,\Sigma)$ . The most commonly used kernel function is the squared exponential or radial basis function (RBF):

\underline{\Sigma}_{ij}=\Sigma(\bm{x}_{i},\bm{x}_{j})=\sigma^{2}\exp\left(\frac{||\bm{x}_{i}-\bm{x}_{j}||_{2}^{2}}{2l^{2}}\right)

(2)

where the variance $\sigma$ and length $l$ serve as hyperparameters to encode domain knowledge. For a new test point $\bm{x}$ , the prior distribution over $y$ is given by:

y\sim\mathcal{N}\left(0,\Sigma(\bm{x},\bm{x})\right)

(3)

Then, the dataset $\mathcal{D}$ can be used as part of the Bayesian framework to update prior beliefs and obtain a posterior distribution $P$ by conditioning on the training points, as follows:

\begin{split}P(&y|\bm{x},\mathcal{X},\mathcal{Y})=\mathcal{N}\big{(}\Sigma(\bm{x},\mathcal{X})\Sigma(\mathcal{X},\mathcal{X})^{-1}\mathcal{Y},\\ &\Sigma(\bm{x},\bm{x})-\Sigma(\bm{x},\mathcal{X})\Sigma(\mathcal{X},\mathcal{X})^{-1}\Sigma(X,\bm{x})\big{)}\end{split}

(4)

2.2 Frequentist versus Bayesian Approach

The most common tools used to quantify uncertainty specifically in neural networks come from the distinction between two fundamental approaches. On the one hand, we can use the variance of predictions of multiple neural networks that follow from different random initializations. We refer to this frequentist type of ensemble method as deep ensemble (DE). On the other hand, a Bayesian neural network (BNN) is the natural result of applying Bayesian inference to neural networks. In a BNN, we place priors – typically independent Gaussians – on the network weights and biases. As a result, the posterior predictive distribution gives a prediction interval instead of a single-point estimate.

BNNs used in practice have posterior distributions that are highly complex and multi-modal, challenging even the most sophisticated Markov chain Monte Carlo (MCMC) sampling methods. For that reason, practitioners oftentimes use mean-field variational inference (MFVI) to approximate the posterior with a factorized Gaussian, i.e. assume the posterior distributions over the network’s parameters to be independent.

2.3 Neural Linear Models

An alternative to MFVI that scales better to large datasets than BNNs and GPs are neural linear models (NLMs).

First introduced by Lázaro-Gredilla and Figueiras-Vidal [2010] as Marginalized Neural Networks and later re-introduced by Snoek et al. [2015], NLMs consist of a Bayesian linear regression performed on the last layer of a deep neural network. They resemble BNNs in structure, except only the final linear layer has priors on its coefficients. An example is shown in Figure 1.

{neuralnetwork}

[height=4, layerspacing=15mm] \inputlayer[count=4, bias=false, title=Input
Layer, text= $x_{\hiddenlayer}$ [count=4, bias=false, text= $\bm{\theta}^{(1)}_{\linklayers}$ \hiddenlayer[count=4, bias=false, title=Hidden
Layers, text= $\bm{\theta}^{(2)}_{\linklayers}$ \hiddenlayer[count=4, bias=false, text= $\bm{\theta}^{(3)}_{\linklayers}$ \outputlayer[count=1, bias=false, title= Bayesian
Output
Layer, text= $\mathbf{w}$

Figure 1: Example of a neural linear model. Hidden layer

l

has vector weights

\bm{\theta}_{i}^{(l)}

, while the output layer performs Bayesian linear regression on the weights

\mathbf{w}

The first layers whose weights $\bm{\theta}$ are simple scalars can be viewed as a feature map $\Phi_{\bm{\theta}}$ that transforms input $\bm{x}$ into features $\Phi_{\bm{\theta}}(\bm{x})$ , which are then fed into the Bayesian linear regression. Because of this particular structure, training and inference with NLMs differs from usual procedures and usually consist of two steps: training the deterministic layers first using a point estimate of the output layer weights, and computing the posterior of that Bayesian layer second. In [Thakur et al., 2021], the authors propose an instance of this procedure that has the benefit of not underestimating uncertainty in data-scarce regions.

When all priors on the output weights $\mathbf{w}$ are independent and identical Gaussian distributions, the posterior is a multivariate Gaussian whose mean $\underline{\mu}_{NLM}$ and covariance matrix $\underline{\Sigma}_{NLM}$ can be expressed in closed-form:

	$\displaystyle{\underline{\mu}}_{NLM}$	$\displaystyle=\alpha\left(\alpha{\Phi_{\bm{\theta}}(\mathcal{X})}^{\top}\Phi_{\bm{\theta}}(\mathcal{X})+\beta I\right)^{-1}{\Phi_{\bm{\theta}}(\mathcal{X})}^{\top}\mathcal{Y}$		(5)
	$\displaystyle{\underline{\Sigma}}_{NLM}$	$\displaystyle=\left(\alpha{\Phi_{\bm{\theta}}(\mathcal{X})}^{\top}\Phi_{\bm{\theta}}(\mathcal{X})+\beta I\right)^{-1}$		(6)

where $I$ is the identity matrix, and $\alpha=\frac{1}{{\sigma_{\epsilon}}^{2}}$ and $\beta=\frac{1}{{\sigma_{\mathbf{w}}}^{2}}$ are prior hyperparameters equal to the inverse of the variance of the regression noise and weight variables, respectively. These expressions reveal that the main computational effort of inference with NLMs lies in the inversion of the matrix $\alpha{\Phi_{\bm{\theta}}(\mathcal{X})}^{T}\Phi_{\bm{\theta}}(\mathcal{X})+\beta I$ , whose dimension is the number of neurons in the last hidden layer. Compared with GPs who scale with the size of the training data set, as explained later in subsubsection 3.2.1, NLMs scale with the width of the neural network’s last layer, which is often of lower dimensionality. This is where NLMs gain their advantage in tractability and computational efficiency, which is why NLMs have gained popularity.

2.4 Neural Tangent Kernel

The last piece of background to understand is the theory describing the training process of a neural network with gradient descent. A helpful tool for that is the neural tangent kernel (NTK) introduced by Jacot et al. [2018]:

\hat{\Theta}_{t}(\bm{x},\bm{x^{\prime}})=\nabla_{\bm{\theta}}f(\bm{x},\bm{\theta_{t}})^{\top}\nabla_{\bm{\theta}}f(\bm{x^{\prime}},\bm{\theta_{t}})

(7)

In general, this kernel function depends on the random initialization $\bm{\theta_{0}}$ and changes over time. However, Jacot et al. [2018] show that in the infinite-width limit, the NTK at initialization $\hat{\Theta}_{0}$ becomes deterministic and only depends on the activation function and network architecture. In this case, the NTK also stays constant during training for small enough learning rates: $\hat{\Theta}_{t}=\hat{\Theta}_{0}$ . We call this deterministic and constant kernel $\Theta$ . Yang [2020] extends these convergence results from multilayer perceptrons (MLPs) to other architectures including convolutional and batch-normalization layers.

Now, we investigate how the NTK describes change in the output function of a NN during training. Jacot et al. [2018] and Lee et al. [2019] show that for small enough learning rates, the NN prediction $f_{t}(\bm{x}):=f(\bm{x},\bm{\theta}_{t})$ at training time $t$ follows an ordinary differential equation (ODE) that is characterized by the NTK:

\frac{\text{d}f_{t}(\bm{x})}{\text{d}t}=-\nu\hat{\Theta}_{t}(\bm{x},\mathcal{X})\nabla_{f_{t}(\mathcal{X})}\mathcal{L}

(8)

Here, $\nu$ denotes the learning rate and $\nabla_{f_{t}(\mathcal{X})}\mathcal{L}$ describes the gradient of the loss with respect to model’s output on the training set. Using an MSE loss for $\mathcal{L}$ and the fact that the NTK stays constant over time, Equation 8 simplifies to a linear ordinary differential equation with known exponentially decaying solution. So, the output for the training dataset at time $t$ is

f_{t}(\mathcal{X})=\mathcal{Y}+\left(f_{0}(\mathcal{X})-\mathcal{Y}\right)e^{-\nu\Theta(\mathcal{X},\mathcal{X})t}

(9)

This shows that the NN achieves zero training loss because the NTK is a positive definite matrix, hence the exponential term converges to zero.

Equivalently, based on the results of [Lee et al., 2019], He et al. [2020] show that $f_{t}$ can be expressed based on the random initialization $f_{0}$ . For $t\rightarrow\infty$ this gives:

f_{\infty}(\bm{x})=f_{0}(\bm{x})-\Theta(\bm{x},\mathcal{X})\Theta(\mathcal{X,X})^{-1}(f_{0}(\mathcal{X})-\mathcal{Y})

(10)

3 Behavior at Infinite Width

In this section, we summarize the behavior of NNs in the infinite-width limit, i.e. we are interested in the limit as the number of hidden units goes to infinity.

3.1 Priors in Infinite Neural Networks

Under the condition that the network parameters follow independent and identical (iid) distributions with zero mean and finite variance and the activation function of the last layer is bounded, the prior predictive distribution converges to a GP with zero mean and finite variance [Neal, 1996]. The intuition is as follows: for a fixed input $\bm{x}$ , every hidden layer output $\phi_{i}$ is a random variable with finite mean and variance as the activation function is bounded. When multiplied by the final layer weights, $\phi_{i}w_{i}$ are iid random variables with zero mean and finite variance. According to the central limit theorem, their sum converges to a Gaussian distribution. If we rescale the variances according to the network width, the network output becomes a standard normal distribution $f(\bm{x})\sim\mathcal{N}(0,1)$ . As this holds true for every $\bm{x}$ , the neural network is equivalent to a GP called the neural network Gaussian process (NNGP). A sampled function from this NNGP is equivalent to the output function of a randomly initialized NN and to a prior predictive sample of a BNN.

While Neal [1996] gives a proof of existence, i.e. that a NN converges to a GP in the infinite-width limit, Neal does not provide any analytic form of the covariance function $\mathcal{K}$ . Williams [1997] builds on that work and provides closed-form kernel functions for different sets of activation functions and prior distributions on the network parameters.

3.2 Bayesian Inference

So far we have not considered any data but made predictions based on randomly drawn initial parameters. In the following sections, we discuss three approaches to incorporate the training data. First, we update the GP prior kernel by conditioning on the training data to obtain a GP posterior kernel. Second, we use gradient-based methods to train the NN. Third, we make the neural network Bayesian and approximate the BNN posterior.

3.2.1 Gaussian Process Regression

To go from the prior to the posterior distribution in a GP, we condition the NNGP kernel $\mathcal{K}$ on the dataset $(\mathcal{X,Y})$ . Following the Bayesian framework, the result is the posterior GP as defined in Equation 4.

A major problem of GPs is their computational effort which is cubic in the number of training points $\mathcal{O}(N^{3})$ . Common methods have enabled exact inference for a maximum of a few thousand training points only. This makes GPs unsuitable for many real-world applications that require very large datasets to be processed. Even though exact inference on GPs has recently been scaled to a million data points by GPU parallelization [Wang et al., 2019], NNs provide a more accessible framework in the big data regime.

3.2.2 Neural Network Training

NNs are trained with gradient-based optimizers such as full-batch gradient descent or its extensions such as stochastic gradient descent and Adam optimizer. Here, we consider two training methods with different theoretical conclusions.

Weakly Trained NN

First, we only train the last layer of our NN. This means all other layers act as random feature extractors that are unchanged during training. Lee et al. [2019] show that for the MSE loss function and a sufficiently small learning rate, the network output during training is an interpolation between the GP prior and GP posterior, and asymptotically converges to the posterior of the NNGP. This means, training only the last layer of a randomly initialized NN is equivalent to sampling a function from the GP with the NNGP prior kernel $\mathcal{K}$ conditioned on the dataset.

Fully Trained NN

If we use the gradient descent algorithm on the entire network, the NTK $\Theta$ describes the training behavior. After convergence to zero loss, Equation 10 describes the learned function $f_{\infty}$ . As $f_{0}$ can be regarded as a sample from the NNGP $\mathcal{GP}(0,\mathcal{K})$ , we write $f_{\infty}\sim\mathcal{GP}(\mu_{\mathrm{DE}}(\bm{x}),\Sigma_{\mathrm{DE}}(\bm{x},\bm{x}^{\prime}))$ with ¹¹1The notation $+h.c.$ means ”plus Hermitian conjugate”, like in [Lee et al., 2019]

$\displaystyle\mu_{\mathrm{DE}}(\bm{x})$	$\displaystyle=\Theta_{\bm{x}\mathcal{X}}\Theta_{\mathcal{XX}}^{-1}\mathcal{Y}$	(11)
$\displaystyle\Sigma_{\mathrm{DE}}(\bm{x},\bm{x}^{\prime})$	$\displaystyle=\mathcal{K}_{\bm{xx^{\prime}}}+\Theta_{\bm{x}\mathcal{X}}\Theta_{\mathcal{XX}}^{-1}\mathcal{K}_{\mathcal{XX}}\Theta_{\mathcal{XX}}^{-1}\Theta_{\mathcal{X}\bm{x}}$
	$\displaystyle\qquad-(\Theta_{\bm{x}\mathcal{X}}\Theta_{\mathcal{XX}}^{-1}\mathcal{K}_{\mathcal{X}\bm{x}^{\prime}}+h.c.).$	(12)

where we used subscripts for the kernel arguments in interest of space. Lee et al. [2019] observed that this deep ensemble GP does not correspond to a proper Bayesian posterior. This means, in contrast to the weakly trained NNs, random initialization followed by gradient descent training does not give valid posterior predictive function samples. Hence, an ensemble of NNs does not accurately approximate the posterior.

Bayesian Deep Ensemble

To address this, He et al. [2020] propose a slight modification to the gradient descent training. With that, they arrive at the what they call neural tangent kernel Gaussian process (NTKGP) with

	$\displaystyle\mu_{\mathrm{NTKGP}}(\bm{x})$	$\displaystyle=\Theta_{\bm{x}\mathcal{X}}\Theta_{\mathcal{XX}}^{-1}\mathcal{Y}$		(13)
	$\displaystyle\Sigma_{\mathrm{NTKGP}}(\bm{x},\bm{x}^{\prime})$	$\displaystyle=\Theta_{\bm{xx^{\prime}}}-\Theta_{\bm{x}\mathcal{X}}\Theta_{\mathcal{XX}}^{-1}\Theta_{\mathcal{X}\bm{x}^{\prime}}$		(14)

Note that this corresponds to a GP posterior as defined in Equation 4. In comparison to the NNGP posterior we obtain when conditioning the NNGP prior on the data, or when only training the last layer of the NN, here, the prior covariance function is not the NNGP kernel $\mathcal{K}$ but the NTK $\Theta$ .

3.2.3 BNN Posterior Approximation

Instead of using an ensemble of randomly initialized NNs, we can use a prior distribution on the network weights to obtain a BNN. However, approximation methods struggle with accurately representing the complex BNN posterior. Coker et al. [2021] show that in the infinite-width limit, the commonly used MFVI approximation fails to learn the data. Specifically, the posterior predictive mean for any input converges to zero, regardless of the input data. In their proof, they assume that the Kullback–Leibler (KL) divergence and the $erf$ activation function are used, but give empirical evidence for $\tanh$ and $ReLU$ activations.

4 Conclusion

This work provides an overview of the different methods used for quantifying uncertainty in infinite neural networks, and shows how to obtain analytic expressions for both prior and posterior predictive distributions for that purpose.

While the prior predictive can simply be modeled as a GP, we have outlined three ways to obtain proper posterior predictives: using GP regression, weakly trained NNs, or Bayesian deep ensembles, where the latter two turn out to be equivalent to GP regression with particular covariance (kernel) functions.

References

Coker et al. [2021] Beau Coker, Weiwei Pan, and Finale Doshi-Velez. Wide mean-field variational bayesian neural networks ignore the data. arXiv preprint arXiv:2106.07052, 2021.
He et al. [2020] Bobby He, Balaji Lakshminarayanan, and Yee Whye Teh. Bayesian deep ensembles via the neural tangent kernel. arXiv preprint arXiv:2007.05864, 2020.
Jacot et al. [2018] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. arXiv preprint arXiv:1806.07572, 2018.
Kapoor et al. [2010] Ashish Kapoor, Kristen Grauman, Raquel Urtasun, and Trevor Darrell. Gaussian processes for object categorization. International journal of computer vision, 88(2):169–188, 2010.
Kim and Lee [2007] Hyun-Chul Kim and Jaewook Lee. Clustering based on gaussian processes. Neural computation, 19(11):3088–3107, 2007.
Lázaro-Gredilla and Figueiras-Vidal [2010] Miguel Lázaro-Gredilla and Aníbal R Figueiras-Vidal. Marginalized neural network mixtures for large-scale regression. IEEE transactions on neural networks, 21(8):1345–1351, 2010.
LeCun et al. [2015] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436–444, 2015.
Lee et al. [2019] Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. Advances in neural information processing systems, 32:8572–8583, 2019.
Neal [1996] Radford M Neal. Priors for infinite networks. In Bayesian Learning for Neural Networks, pages 29–53. Springer, 1996.
Snoek et al. [2015] Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. Scalable bayesian optimization using deep neural networks. In International conference on machine learning, pages 2171–2180. PMLR, 2015.
Thakur et al. [2021] Sujay Thakur, Cooper Lorsung, Yaniv Yacoby, Finale Doshi-Velez, and Weiwei Pan. Uncertainty-aware (una) bases for bayesian regression using multi-headed auxiliary networks, 2021.
Wang et al. [2019] Ke Wang, Geoff Pleiss, Jacob Gardner, Stephen Tyree, Kilian Q Weinberger, and Andrew Gordon Wilson. Exact gaussian processes on a million data points. Advances in Neural Information Processing Systems, 32:14648–14659, 2019.
Williams [1997] Christopher KI Williams. Computing with infinite networks. Advances in neural information processing systems, pages 295–301, 1997.
Yang [2020] Greg Yang. Tensor programs ii: Neural tangent kernel for any architecture. arXiv preprint arXiv:2006.14548, 2020.

An Overview of Uncertainty Quantification Methods for Infinite Neural Networks