This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On weight and variance uncertainty in neural networks for regression tasks

Moein Monemi [email protected] School of Engineering Science, College of Engineering, University of Tehran, Tehran, Iran Morteza Amini Corresponding author, e-mail: [email protected] Department of Statistics, School of Mathematics, Statistics, and Computer Science, College of Science, University of Tehran, Tehran, Iran S. Mahmoud Taheri [email protected] School of Engineering Science, College of Engineering, University of Tehran, Tehran, Iran Mohammad Arashi [email protected] Department of Statistics, Faculty of Mathematical Sciences, Ferdowsi University of Mashhad, Mashhad 9177948974, Iran
Abstract

We consider the problem of weight uncertainty proposed by [Blundell et al. (2015). Weight uncertainty in neural network. In International conference on machine learning, 1613-1622, PMLR.] in neural networks (NNs) specialized for regression tasks. We further investigate the effect of variance uncertainty in their model. We show that including the variance uncertainty can improve the prediction performance of the Bayesian NN. Variance uncertainty enhances the generalization of the model by considering the posterior distribution over the variance parameter. We examine the generalization ability of the proposed model using a function approximation example and further illustrate it with the riboflavin genetic data set. We explore fully connected dense networks and dropout NNs with Gaussian and spike-and-slab priors, respectively, for the network weights.

Keywords: Bayesian neural networks, Variational Bayes, Kullback-Leibler divergence, posterior distribution, Regression task, variance uncertainty.

1 Introduction

Bayesian Neural Networks (BNNs) have been introduced and comprehensively discussed by many authors (among others see [14, 2]). BNNs are suitable for modeling uncertainty by considering values of the parameters that might not be learned by the available data. This is achieved by randomizing unknown parameters, such as weights and biases while incorporating prior knowledge. Such an approach naturally regularizes the model and helps prevent overfitting, a common challenge in neural networks (NNs). The BNNs can also be considered as a suitable and more reliable alternative to ensemble learning methods [15], including bagging and boosting of the neural networks.

The primary objective in the BNNs is to estimate the posterior distribution of the parameters. However, obtaining the exact posterior is often infeasible due to the integrals’ intractability. Thus, it is necessary to use approximation techniques for the posterior density. One common approach is the Markov Chain Monte Carlo (MCMC), including the Metropolis-Hastings algorithm [10], which generates samples from the posterior distribution by constructing a Markov chain, typically of first orders.

Despite its effectiveness, flexibility, and strong theoretical foundations, MCMC has significant limitations in high-dimensional spaces, including slow convergence and high computational costs. Variational Bayes (VB) has emerged as a faster and more scalable approximation technique. First introduced by [11] and [19], VB seeks to approximate the intractable posterior distribution by identifying a simpler, parametric probability distribution that closely resembles the true posterior. The central idea in VB is to recast the problem of posterior inference as an optimization task. Specifically, VB minimizes the Kullback-Leibler (KL) divergence [13] between the approximate distribution and the true posterior [3]. This involves iteratively updating the hyper-parameters of the approximate distribution to improve its fit to the posterior while maintaining computational efficiency [1].

The VB method has limitations related to the conjugacy of the full conditionals of parameters. In such cases, models that go beyond the conjugacy of the exponential family have been proposed [20]. Black Box Variational Inference (BBVI) [16] eliminates the need for conditional conjugacy by introducing a flexible framework for approximating posterior distributions. It leverages stochastic optimization techniques to estimate gradients and employs variance reduction strategies to enhance computational efficiency and stability. Bayes by Backprop (weight uncertainty), introduced by Blundell et al. [4], is another approach for performing Bayesian inference in NNs. This method leverages stochastic optimization and employs the re-parametrization trick to enable efficient sampling and gradient-based optimization of posterior distributions, for a NN weights. The model is used for classification and regression tasks, by considering suitable multi-nomial and Gaussian likelihood for the target variable, respectively.

Another approach for modeling uncertainty in weights of a NN is considered in [7]. They modeled the variational approximation of the weights as the multiplication of a fixed hyperparameter (which might be optimized) by a bernoulli random varaible for modeling the droupout mechanism. There are two main differences between the method considered in [7] and that in [4]. The first difference is that no statistical distribution is considered for modelling uncertainty in weights around the hyperparameter in the method proposed in [7]. The second difference is the approach for modeling the dropout mechanishm, since the dropout mechinism was modeled by considering the slab-and-spike prior for the weights of the NN in [4].

For the regression task, the variance of the Gaussian likelihood is assumed fixed in [4] and is determined by cross-validation. In this paper, we consider the problem of variance uncertainty in the model proposed by [4] to investigate its effectiveness on the prediction performance of the BNNs for regression tasks. Throughout a simulation study from a nonlinear function approximation problem, it is shown that the Bayes by Backprop with variance uncertainty performs better than the model with fixed variance. We have also considered the riboflavin data set, which is a genetic high dimensional data set to evaluate the performance of the proposed model, compared to the other methods, by applying PCA-BNN and the dropout-BNN methods. The dropout-BNN is applied by considering the spike-and-slab prior for the weights of the BNN. As the results presented in Section 6 shows, our proposed method outperforms other approaches in terms of MSPE and coverage probabilities for the regression curve, as well as MSPE for the riboflavin dataset.

The remainder of the paper is organized as follows. Section 2 reviews the foundations of the BNNs. The VB method for posterior approximation is introduced in Section 3. Section 4 describes the Bayes by Backprop method proposed by [4], for the regression task with fixed variance. In Section 5, we propose the variance uncertainty for the BNNs. The experimental results and the evaluation of the proposed model is considered in Section 6. Some concluding remarks are given in Section 7. The codes for this study is available online on github (soon).

2 Bayesian Neural Networks

In this section, our goal is to describe BNNs and their challenges in practical applications. Before examining BNNs, we briefly describe NNs from a statistical perspective. A NN aims to approximate the unknown function y=ϕ(x)y=\phi(x) where xx and yy are the input and the output vectors, respectively. From a statistical point of view, given the sample points x,yx,y, the objective function of NN is the log-likelihood function of weights and biases, WW, and extra parameters ρ\rho, of the model, logL(W,ρ|x,y)\log L(W,\rho|x,y) which is aimed to be maximized to learn the parameters of the network

(W,ρ)^MLE=argmax(W,ρ)logL(W,ρx,y).\widehat{(W,\rho)}_{\text{\rm MLE}}=\arg\max_{(W,\rho)}\log L(W,\rho\mid x,y).

The likelihood function L(W,ρx,y)L(W,\rho\mid x,y) depends on the specific problem being addressed:

  • In regression tasks, the likelihood function is typically modeled by a Gaussian density of yy with a mean provided by the network given the input data set xx and an unknown variance σ2\sigma^{2}, and the negative log-likelihood corresponds to the sum of squared errors (SSE).

  • In classification problems, the likelihood function is often modeled as a categorical distribution, and the negative log-likelihood corresponds to the cross-entropy loss.

In a BNN, a prior distribution p(W,ρ)p(W,\rho) is considered for the parameters of the model [14], and our goal is to find a posterior distribution of all parameters p(W,ρx,y)p(W,\rho\mid x,y). Using the Bayes’ formula, the posterior density is obtained as

p(W,ρx,y)=L(W,ρx,y)p(W,ρ)p(x,y),p(W,\rho\mid x,y)=\frac{L(W,\rho\mid x,y)\,p(W,\rho)}{p(x,y)},

where p(x,y)p(x,y) is the marginal distribution of the data, given by

p(x,y)=L(W,ρx,y)p(W,ρ)d(W,ρ).p(x,y)=\int L(W,\rho\mid x,y)\,p(W,\rho)\,d(W,\rho).

The point estimation of the model parameters might be computed by the maximum a posteriori (MAP) estimator as follows

(W,ρ)^MAP=argmax(W,ρ)logL(W,ρx,y)+logp(W,ρ).\widehat{(W,\rho)}_{\text{\rm MAP}}=\arg\max_{(W,\rho)}\log L(W,\rho\mid x,y)+\log p(W,\rho).

The posterior predictive distribution function, used to make predictions and to determine prediction intervals for new data, is defined as follows (for more details, see e.g., [14, 2])

p(yx,y,x)=Θ~L(W,ρx,y)p(W,ρx,y)d(W,ρ).p(y^{*}\mid x,y,x^{*})=\int_{\tilde{\Theta}}L(W,\rho\mid x^{*},y^{*})\,p(W,\rho\mid x,y)\,d(W,\rho).

In practice, we face intractable integrals to solve p(x,y)p(x,y); therefore, we resort to approximation methods. In the following, we introduce VB as a well-known approximation technique.

3 Variational Bayes

The VB is a method for finding an approximate distribution q(θ)q(\theta) of the posterior distribution p(θ|x)p(\theta|x), by minimizing the Kullback-Leibler divergence KL[q(θ)||p(θ|x)]{\rm KL}\left[q(\theta)||p(\theta|x)\right] as a measure of closeness [18]. In this section, we briefly review the basic elements of the VB method.

Let xx be a vector of observed data, and θ\theta be a parameter vector with joint distribution p(x,θ)p(x,\theta). In the Bayesian inference framework, the inference about θ\theta is done based on the posterior distribution p(θ|x)=p(x,θ)/p(x)p(\theta|x)=p(x,\theta)/p(x) where p(x)=p(x,θ)𝑑θp(x)=\int p(x,\theta)d\theta.

In the VB method, we assume the parameter vector θ\theta is divided into MM partitions {θ1,,θM}\{\theta_{1},\ldots,\theta_{M}\} and we want to approximate p(θ|x)p(\theta|x) by

q(θ)=j=1Mqj(θj).q(\theta)=\prod_{j=1}^{M}q_{j}(\theta_{j}).

where qj(.)q_{j}(.) is the variational posterior over the vector of posterior parameters θj\theta_{j}. Hence, each θj\theta_{j} can be a vector of parameters. The best VB approximation q()q^{\ast}(\cdot) is then obtained as

q=argminqKL[q(θ)||p(θ|x)],q^{\ast}=\arg\min_{q}{\rm KL}\left[q(\theta)||p(\theta|x)\right],

where

KL[q(θ)||p(θ|x)]=q(θ)logq(θ)p(θ|x)dθ=logp(x)q(θ)logp(x,θ)q(θ)dθ.{\rm KL}\left[q(\theta)||p(\theta|x)\right]=\int q(\theta)\log\frac{q(\theta)}{p(\theta|x)}d\theta=\log p(x)-\int q(\theta)\log\frac{p(x,\theta)}{q(\theta)}d\theta. (1)

Let η={η1,,ηM}\eta=\{\eta_{1},\ldots,\eta_{M}\} be the partitions of hyper-parameter such that each ηj\eta_{j} be the variational parameter of θj\theta_{j}. In the fixed-form-VB (FFVB), we consider an assumed density function for each q(θj|ηj)q(\theta_{j}|\eta_{j}), with unknown ηj\eta_{j}, and we aim to determine the hyper-parameters η\eta such that KL[q(θ|η)||p(θ|x)]{\rm KL}\left[q(\theta|\eta)||p(\theta|x)\right] is minimized.

It is clear from (1) that minimizing KL is equivalent to maximizing the evidence lower bound (ELBO) with respect to η\eta, defined as

ELBO=q(θ|η)logp(x,θ)q(θ|η)dθ{\rm ELBO}=\int q(\theta|\eta)\log\frac{p(x,\theta)}{q(\theta|\eta)}d\theta

The ELBO is equal to logp(x)\log p(x) when the KL divergence is zero, meaning a perfect fit. The closer the value of ELBO[q(θ)]{\rm ELBO}[q(\theta)] to logp(x)\log p(x) (greater values), the fit is better. Maximizing the ELBO is equivalent to minimizing its negative. It follows that

η=infηq(θ|η)logq(θ|η)p(x|θ)p(θ)dθ=infη𝔼q(θ|η)[logq(θ|η)logp(x|θ)logp(θ)],\eta^{*}=\inf_{\eta}\int q(\theta|\eta)\log\frac{q(\theta|\eta)}{p(x|\theta)p(\theta)}\,d\theta=\inf_{\eta}\mathbb{E}_{q(\theta|\eta)}\left[\log q(\theta|\eta)-\log p(x|\theta)-\log p(\theta)\right],

where η\eta^{*} represents the optimal parameter estimate for maximizing the ELBO.

We define a loss function to be minimized. Assume that ϵ\epsilon is a random variable with a probability density function q(ϵ)q(\epsilon), and let ff be a function. In this case, as shown in [4], if

q(ϵ)dϵ=q(θ|η)dθ,q(\epsilon)d\epsilon=q(\theta|\eta)d\theta, (2)

then

η𝔼q(θ|η)[f(θ,η)]=𝔼q(ϵ)[f(θ,η)η].\frac{\partial}{\partial\eta}\mathbb{E}_{q(\theta|\eta)}[f(\theta,\eta)]=\mathbb{E}_{q(\epsilon)}\left[\frac{\partial f(\theta,\eta)}{\partial\eta}\right]. (3)

Equation (3) shows that if we can find a transformation for θ\theta such that it depends on ϵ\epsilon and satisfies the condition (2), then samples of ϵ\epsilon can be used to generate samples of θ\theta. This approach focuses on finding η\eta^{*}, as defined above. Therefore, we define

f(θ,η)=logq(θ|η)logp(x|θ)logp(θ).f(\theta,\eta)=\log q(\theta|\eta)-\log p(x|\theta)-\log p(\theta). (4)

Note that the expectation of f(θ,η)f(\theta,\eta) is the negative of the ELBO and the gradient of f(θ,η)f(\theta,\eta) provides an unbiased estimate of the gradient defined in equation (3). In addition, increasing the number of samples of ϵ\epsilon reduces the variance of the estimated gradient. This method utilizes equation (4) as an objective function to determine η\eta^{*}.

4 Variational Bayes Regression Network with Fixed Variance

In this section, we consider regression problems with constant variance. As we mentioned in section 3, we can use equation (4) for the optimization problem as the KL error and we are interested in minimizing it.

Following the approach described in [4], the variance of the likelihood function in a regression problem can be fixed to a constant value. In regression problems, assuming independency among the output features (i.e., the values of the vector yiy_{i}), we have

yi=ϕ(xi;W)+ϵi,ϵi𝒩(0,σ02Iq),y_{i}=\phi(x_{i};W)+\epsilon_{i},\quad\epsilon_{i}\sim\mathcal{N}(0,\sigma^{2}_{0}I_{q}),

where IqI_{q} is the identity matrix, xipx_{i}\in\mathbb{R}^{p} represents the ii-th input variable, yiqy_{i}\in\mathbb{R}^{q} denotes the ii-th output variable, i=1,,ni=1,...,n, and σ02\sigma^{2}_{0} is a fixed known variance. Moreover, ϕ(xi;W)\phi(x_{i};W) represents the output of a NN with an arbitrary number of layers. In this case, we have the conditional distribution

yi|xi,W𝒩(ϕ(xi;W),σ02Iq).y_{i}|x_{i},W\sim\mathcal{N}(\phi(x_{i};W),\sigma^{2}_{0}I_{q}).

If (xi,yi)(x_{i},y_{i}) is the iith observation, then the likelihood function, assuming independent observations, is given by

L(Wy,x)\displaystyle L(W\mid y,x) =p(y1,,ynx1,,xn,W)\displaystyle=p(y_{1},\dots,y_{n}\mid x_{1},\dots,x_{n},W)
=i=1n12πσ02exp{12σ02(yiϕ(xi;W))T(yiϕ(xi;W))}\displaystyle=\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi\sigma_{0}^{2}}}\exp\left\{-\frac{1}{2\sigma_{0}^{2}}(y_{i}-\phi(x_{i};W))^{T}(y_{i}-\phi(x_{i};W))\right\}

Thus, [4] considered the objective function

f(W,η)=logq(W|η)logL(Wy,x)logp(W).f(W,\eta)=\log q(W|\eta)-\log L(W\mid y,x)-\log p(W). (5)

with

q(W|η)=𝒩(W;μw,log(1+exp(ρw))),q(W|\eta)=\mathcal{N}(W;\mu_{w},\log(1+\exp(\rho_{w}))),

and η=(μw,ρw)\eta=(\mu_{w},\rho_{w}), seeking the optimum value of η\eta. We denote this method by VBNET-FIXED throughout the remaining of the paper. However, this approach fails to effectively capture uncertainty in the variance parameter. In the forthcoming section, we will discuss how to introduce uncertainty in the variance by using parametrization, so that the model can be generalized by considering the posterior distribution over the variance parameter.

5 Variational Bayes Regression Network with Variance Uncertainty

In this section, we suggest to use a variational posterior distribution over weights, biases, and also the variance of the likelihood function. Our proposal can be considered as a generalization of the previous model introduced in section 4. In our approach, we assume that the posterior parameters are represented by θ=(W,S)\theta=(W,S), where WW encompasses the weights and biases of the NN, and SS denotes a parameter related to the variance of the likelihood function. Additionally, we define the variational parameters as η=(σw2,μw,σL2,μL)\eta=(\sigma_{w}^{2},\mu_{w},\sigma_{L}^{2},\mu_{L}), where σw2\sigma_{w}^{2} and μw\mu_{w} pertain to the variational distribution over WW, and σL2\sigma_{L}^{2} and μL\mu_{L} are associated to variational distribution over SS.

Recall in a regression task we have

yi=ϕ(Xi;W)+εi=y^i+εi,εiN(0,log(1+exp(S))Iq)y_{i}=\phi\left(X_{i};W\right)+\varepsilon_{i}={\hat{y}}_{i}+\varepsilon_{i},\ \ \ \ \ \ \varepsilon_{i}\sim N(0,\ \log\left(1+\exp(S\right))I_{q})

that gives

yi|xiN(y^i,log(1+exp(S))Iq).y_{i}|x_{i}\sim N({\hat{y}}_{i},\ \log\left(1+\exp(S\right))I_{q}).

Therefore, the likelihood function has form

L(W,S|x,y)\displaystyle L(W,S|x,y) =i=1n12πlog(1+exp(S))exp{12log(1+exp(S))(yiϕ(xi;W))T(yiϕ(xi;W))}.\displaystyle=\prod_{i=1}^{n}\frac{1}{\sqrt{2\pi\log\left(1+\exp(S\right))}}\exp\left\{-\frac{1}{2\log\left(1+\exp(S\right))}(y_{i}-\phi(x_{i};W))^{T}(y_{i}-\phi(x_{i};W))\right\}.

Based on [4], we assume that the variational posterior of all parameters is a diagonal Gaussian distribution, i.e., W|μw,σw2N(μw,σw2I|W|)W|\mu_{w},\sigma_{w}^{2}\sim N(\mu_{w},\sigma_{w}^{2}I_{|W|}), where |W||W| denotes the number of weights and biases in the NN, and σw=log(1+exp(ρw))>0\sigma_{w}=\log\left(1+\exp(\rho_{w})\right)>0. Also let

S|μL,σL2N(μL,σL2),{S|\mu_{L},\sigma_{L}^{2}\sim N(\mu_{L},\sigma_{L}^{2})},

such that

σL2=log(1+exp(ρL)).\sigma_{L}^{2}=\log\left(1+\exp(\rho_{L})\right).

Further, assume εw𝒩(0,I|W|)\varepsilon_{w}\sim\mathcal{N}(0,I_{|W|}) and εL𝒩(0,1)\varepsilon_{L}\sim\mathcal{N}(0,1). Then, the posterior samples of WW and SS are given by

W=μw+εwlog(1+exp(ρw))W=\mu_{w}+\varepsilon_{w}\odot\log(1+\exp(\rho_{w}))
S=μL+εLlog(1+exp(ρL)),S=\mu_{L}+\varepsilon_{L}\log(1+\exp(\rho_{L})),

where \odot denotes element-wise multiplication. Here, μw\mu_{w} and ρw\rho_{w} are the variational parameters associated with WW, and μL\mu_{L} and ρL\rho_{L} are the variational parameters associated with SS.

Algorithm 1 Variational Inference Algorithm for regression network with variance uncertainty
  1. Sample εw𝒩(0,I|W|)\varepsilon_{w}\sim\mathcal{N}(0,I_{|W|})] and εL𝒩(0,1)\varepsilon_{L}\sim\mathcal{N}(0,1).
  2. Compute posterior sample:
W=μw+εwlog(1+exp(ρw)).W=\mu_{w}+\varepsilon_{w}\odot\log(1+\exp(\rho_{w})).
S=μL+εLlog(1+exp(ρL))S=\mu_{L}+\varepsilon_{L}\log(1+\exp(\rho_{L}))
  3. Define the objective function:
f(W,μw,σw2)=logq(W|μw,σw2)logp(W,S)p(D|W,S).f(W,\mu_{w},\sigma_{w}^{2})=\log q(W|\mu_{w},\sigma_{w}^{2})-\log p(W,S)p(D|W,S).
f(S,μL,σL2)=logq(S|μL,σL2)logp(W,S)p(D|W,S).f(S,\mu_{L},\sigma_{L}^{2})=\log q(S|\mu_{L},\sigma_{L}^{2})-\log p(W,S)p(D|W,S).
  4. Calculate the gradient of f(W,μw,σw2)f(W,\mu_{w},\sigma_{w}^{2}) with respect to μw\mu_{w} and ρw\rho_{w} and the gradient of f(S,μL,σL2)f(S,\mu_{L},\sigma_{L}^{2}) with respect to μL\mu_{L} and ρL\rho_{L}.
μwf(W,μw,σw2)=f(W,μw,σw2)μw,ρwf(W,μw,σw2)=f(W,μw,σw2)ρw,\nabla_{\mu_{w}}f(W,\mu_{w},\sigma_{w}^{2})=\frac{\partial f(W,\mu_{w},\sigma_{w}^{2})}{\partial\mu_{w}},\nabla_{\rho_{w}}f(W,\mu_{w},\sigma_{w}^{2})=\frac{\partial f(W,\mu_{w},\sigma_{w}^{2})}{\partial\rho_{w}},
μLf(S,μL,σL2)=f(S,μL,σL2)μL,ρLf(S,μL,σL2)=f(S,μL,σL2)ρL.\nabla_{\mu_{L}}f(S,\mu_{L},\sigma_{L}^{2})=\frac{\partial f(S,\mu_{L},\sigma_{L}^{2})}{\partial\mu_{L}},\nabla_{\rho_{L}}f(S,\mu_{L},\sigma_{L}^{2})=\frac{\partial f(S,\mu_{L},\sigma_{L}^{2})}{\partial\rho_{L}}.
  5. Update the variational parameters.
μwμwγwμwf(W,μw,σw2),ρwρwγwρwf(W,μw,σw2)\mu_{w}\leftarrow\mu_{w}-\gamma_{w}\,\nabla_{\mu_{w}}f(W,\mu_{w},\sigma_{w}^{2}),\rho_{w}\leftarrow\rho_{w}-\gamma_{w}\,\nabla_{\rho_{w}}f(W,\mu_{w},\sigma_{w}^{2})
μLμLγlμLf(S,μL,σL2),ρLρLγlρLf(S,μL,σL2)\mu_{L}\leftarrow\mu_{L}-\gamma_{l}\,\nabla_{\mu_{L}}f(S,\mu_{L},\sigma_{L}^{2}),\rho_{L}\leftarrow\rho_{L}-\gamma_{l}\,\nabla_{\rho_{L}}f(S,\mu_{L},\sigma_{L}^{2})
  6. Repeat from step 1 until the stopping condition is satisfied.

Algorithm 1 describes the VB algorithm for the regression network with variance uncertainty. We denote this method by VBNET-SVAR throughout the remaining of the paper.

6 Experimental Results

In this section, we evaluate the results of our proposed model on two regression problems under the following scenarios

  • The normal prior, which corresponds to a normal distribution, assuming independency between the prior over parameters, is given by

    p(W)=12πσp2exp(12σp2vec(W)vec(W)),p(W)=\frac{1}{\sqrt{2\pi\sigma_{p}^{2}}}\exp\left(-\frac{1}{2\sigma_{p}^{2}}vec(W)^{\top}vec(W)\right),

    where σp2\sigma_{p}^{2} denotes the variance of the prior, and θ\theta represents the vector of parameters.

  • The spike-and-slab prior [8], which corresponds to a mixture of normal distributions, assuming independency between the prior over parameters, is given by

    p(W)=j[zj𝒩(Wj;0,σ12)+(1zj)𝒩(Wj;0,σ22)],p(W)=\prod_{j}\left[z_{j}\mathcal{N}(W_{j};0,\sigma_{1}^{2})+(1-z_{j})\mathcal{N}(W_{j};0,\sigma_{2}^{2})\right],

    where, WjW_{j} represents the jjth (weight or bias) parameter in the model, σ12\sigma_{1}^{2} denotes the variance of the first mixture component, which is greater than σ22\sigma_{2}^{2}, the variance of the second mixture component, and

    zjindep.Ber(π).z_{j}\stackrel{{\scriptstyle\rm\tiny indep.}}{{\sim}}{\rm Ber}(\pi).

    The second variance σ22\sigma_{2}^{2} is assumed to be small (σ220\sigma_{2}^{2}\approx 0), allowing the prior to concentrate samples around 0 with probability 1π1-\pi.

We compare the performance of our proposed model with various models, including the model introduced in [4], and the frequentist NN. The results indicate that variance uncertainty can significantly enhance performance. It should be noted that in the nonlinear function estimation study, we have set the fixed variance for the model proposed by [4] to the train set MSE of the frequentist NN, while in the riboflavin data analysis, the fixed variance is set to the maximum of 0.2var(ytrain)0.2{\rm var}(y_{\rm train}) and the train set MSE of the frequentist NN to prevent the overfit small error problem caused by the frequentist NN. These values are chosen by trial and error to obtain the best performance of the model of [4] (instead of a cross-validation approach).

6.1 Nonlinear function estimation

Assume that in the problem of regression curve, the target variable is generated according to the model of the following curve.

y=x+2sin(2π(x+ϵ))+2sin(4π(x+ϵ))+ϵ,y=x+2\sin\left(2\pi(x+\epsilon)\right)+2\sin\left(4\pi(x+\epsilon)\right)+\epsilon,

where ϵN(0,0.02)\epsilon\sim N(0,0.02). To model the out-of-sample unforeseen risks, the values of xx for the train and test samples are generated from uniform distributions with supports [0.1,0.6][-0.1,0.6] and [0.25,0.85][-0.25,0.85], respectively. The train and test sets contain 800 and 200 samples, respectively.

Refer to caption

Figure 1: The result of samples 1-4 for the regression curve simulation study.

Refer to caption

Figure 2: The box plots of the MSPE for the regression curve simulation study.

Refer to caption

Figure 3: The box plots of the coverage probabilities for the regression curve simulation study.

The prior p(W)=𝒩(W;0,σp2I|W|)p(W)=\mathcal{N}(W;0,\sigma_{p}^{2}I_{|W|}) is used for the VBNET-FIXED, while and the prior p(W,S)=𝒩(W;0,σpw2I|W|)𝒩(S;0,σps2)p(W,S)=\mathcal{N}(W;0,\sigma_{pw}^{2}I_{|W|})\mathcal{N}(S;0,\sigma_{ps}^{2}) is considered for the VBNET-SVAR. Four competitive models are considered for the comparison study; VBNET-SVAR, VBNET-FIXED, the standard NN (NNET), and the Generalized Additive Model (GAM [9]). A total number of 10 replications is performed in this simulation study.

Figure 1 shows the estimated curves for all competitive models, in which the two Bayesian models also include 95% prediction intervals. Figure 2 presents box plots for the values of the Mean Squared Prediction Error (MSPE) for each model, defined as follows

MSPE=1mi=1m(yiy^i)2,{\rm MSPE}=\frac{1}{m}\sum_{i=1}^{m}(y_{i}-\hat{y}_{i})^{2},

where yi,i=1,,my_{i},\;i=1,\ldots,m are the target values of the test set, and y^i\hat{y}_{i} is the predicted value for yiy_{i}. From Figure 2, one can observe that the VBNET-SVAR model performs better than the other competitive models for out-of-sample prediction. Figure 3 shows the box plot for the coverage probabilities of the two Bayesian models for the test set, demonstrating that the VBNET-SVAR model provides more data coverage for the out-of-sample prediction purpose.

6.2 Riboflavin dataset

Riboflavin is a dataset with a small number of samples but many features. This makes it a suitable dataset for Bayesian models. Dataset of riboflavin production by Bacillus subtilis [5] contains n = 71 observations of p = 4088 predictors (gene expressions) and a one-dimensional response (riboflavin production). From 71 samples, 56 randomly drawn samples are considered as the train set, and the remaining 25 samples are considered as the test set. We have repeated this sampling 10 times to evaluate the performance of the competitor models on this data set. All network was constructed using two hidden layers with 128 and 64 neurons, respectively.

Two scenarios are examined for fitting the NN on the train set. In the first scenario (PCA-BNN), the first 25 principal components of the 4088 genes are considered as the input of the neural network and all other competitive models. The second scenario (dropout-BNN) utilizes the feature selection using the dropout mechanism, imposed on the BNNs by the spike-and-slab prior. For the second scenario, the GAM model is replaced by the sparse-GAM model (GAMSEL) [6].

Refer to caption

Figure 4: The result of samples 1-4 for the riboflavin data set, with PCA-BNN scenario.

Refer to caption

Figure 5: The box plots of the MSPE for the riboflavin data set, with PCA-BNN scenario.

Refer to caption

Figure 6: The result of samples 1-4 for the riboflavin data set, with dropout-BNN scenario.

Refer to caption

Figure 7: The box plots of the MSPE for the riboflavin data set, with dropout-BNN scenario.

Figures 4 and 6 shows the predicted values for all competitive models, for the PCA-BNN and dropout-BNN scenarios, respectively, including 95% prediction intervals for the BNNs. Figures 5 and 7 presents box plots for the values of the MSPE for each model,for the PCA-BNN and dropout-BNN scenarios, respectively. From Figures 5 and 7, one can observe that the VBNET-SVAR model outperforms the other competitive models for out-of-sample prediction.

7 Concluding remarks

In this paper, we considered variance uncertainty in neural networks, specialized for the regression tasks. Based on the MSPE and coverage probabilities, a simulation study of a simple function approximation problem and a real genetic data set analysis showed that the prediction performance of the BNN improved by considering variance uncertainty. This method applies when no information is available regarding the variance setting of the likelihood function, which is common in many real-world applications. This model can also generalize BNNs in regression problems with fixed variance. The code for this study is available online at github (https://github.com/mortamini/vbnet).

References

  • [1] Ahmed, A., Aly, M., Gonzalez, J., Narayanamurthy, S., & Smola, A. J. (2012, February). Scalable inference in latent variable models. In Proceedings of the fifth ACM International Conference on Web Search and Data Mining (pp. 123-132).
  • [2] Bishop, C. M. (1997). Bayesian neural networks. Journal of the Brazilian Computer Society, 4, 61-68.
  • [3] Blei, D. M., Kucukelbir, A., & McAuliffe, J. D. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518), 859-877.
  • [4] Blundell, C., Cornebise, J., Kavukcuoglu, K., & Wierstra, D. (2015, June). Weight uncertainty in neural network. In International Conference on Machine Learning (pp. 1613-1622). PMLR.
  • [5] Bühlmann, P., Kalisch, M., & Meier, L. (2014). High-dimensional statistics with a view toward applications in biology. Annual Review of Statistics and Its Application, 1(1), 255-278.
  • [6] Chouldechova, A., & Hastie, T. (2015). Generalized additive model selection. arXiv preprint arXiv:1506.03850.
  • [7] Gal, Y., & Ghahramani, Z. (2016, June). Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning (pp. 1050-1059). PMLR.
  • [8] George, E. I., & McCulloch, R. E. (1997). Approaches for Bayesian variable selection. Statistica Sinica, 7, 339-373.
  • [9] Hastie, T., & Tibshirani, R. (1987). Generalized additive models: some applications. Journal of the American Statistical Association, 82(398), 371-386.
  • [10] Hastings, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1), 97–109.
  • [11] Jordan, M. I., Ghahramani, Z., Jaakkola, T. S., & Saul, L. K. (1999). An introduction to variational methods for graphical models. Machine Learning, 37, 183-233.
  • [12] Jospin, L. V., Laga, H., Boussaid, F., Buntine, W., & Bennamoun, M. (2022). Hands-on Bayesian neural networks—A tutorial for deep learning users. IEEE Computational Intelligence Magazine, 17(2), 29-48.
  • [13] Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79-86.
  • [14] Neal, R. M. (1992). Bayesian training of backpropagation networks by the hybrid Monte Carlo method. Technical Report CRG-TR-92-1, Dept. of Computer Science, University of Toronto.
  • [15] Prince, S. J. (2023). Understanding deep learning. MIT press.
  • [16] Ranganath, R., Gerrish, S., & Blei, D. (2014, April). Black box variational inference. In Artificial Intelligence and Statistics (pp. 814-822). PMLR.
  • [17] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929-1958.
  • [18] Tran, M. N., Tseng, P., & Kohn, R. (2023). Particle Mean Field Variational Bayes. arXiv preprint arXiv,2303.13930.
  • [19] Wainwright, M. J., & Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1–2), 1-305.
  • [20] Zhang, C., Bütepage, J., Kjellström, H., & Mandt, S. (2018). Advances in variational inference. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(8), 2008-2026.