A Bayesian Convolutional Neural Network-based Generalized Linear Model

Yeseul Jeon¹ Won Chang^3,4 Seonghyun Jeong^1,2 Sanghoon Han⁵ and Jaewoo Park^1,2,∗
¹Department of Statistics and Data Science [email protected] Yonsei University Seoul South Korea
²Department of Applied Statistics Yonsei University Seoul South Korea
³Division of Statistics and Data Science University of Cincinnati Cincinnati Ohio U.S.A
⁴Department of Statistics Seoul National University Seoul South Korea
⁵Department of Psychology Yonsei University Seoul South Korea

Abstract

Convolutional neural networks (CNNs) provide flexible function approximations for a wide variety of applications when the input variables are in the form of images or spatial data. Although CNNs often outperform traditional statistical models in prediction accuracy, statistical inference, such as estimating the effects of covariates and quantifying the prediction uncertainty, is not trivial due to the highly complicated model structure and overparameterization. To address this challenge, we propose a new Bayesian approach by embedding CNNs within the generalized linear models (GLMs) framework. We use extracted nodes from the last hidden layer of CNN with Monte Carlo (MC) dropout as informative covariates in GLM. This improves accuracy in prediction and regression coefficient inference, allowing for the interpretation of coefficients and uncertainty quantification. By fitting ensemble GLMs across multiple realizations from MC dropout, we can account for uncertainties in extracting the features. We apply our methods to biological and epidemiological problems, which have both high-dimensional correlated inputs and vector covariates. Specifically, we consider malaria incidence data, brain tumor image data, and fMRI data. By extracting information from correlated inputs, the proposed method can provide an interpretable Bayesian analysis. The algorithm can be broadly applicable to image regressions or correlated data analysis by enabling accurate Bayesian inference quickly.

keywords:

Bayesian deep learning; feature extraction; Monte Carlo dropout; posterior approximation; uncertainty quantification;

1 Introduction

With recent advances in data collection technologies, it is increasingly common to have images as part of the collected data. Specifically, we consider three important applications: (1) Malaria data contain spatially correlated confirmed cases. Detecting disease hotspots and studying spatial patterns of incidence are crucial for public health problems. (2) Brain tumor data contain magnetic resonance imaging (MRI) images and texture-based statistics. Developing automated decision rules about brain tumor status can be beneficial to practitioners. (3) Anxiety disorder example contains functional magnetic resonance imaging (fMRI) data with psychological assessment collected from the survey questions. Analyzing the relationship between brain images and survey responses is an important psychological contribution. For all cases, we have both correlated matrix images and vector covariates, which are challenging to analyze simultaneously. Traditional statistical approaches face inferential issues when dealing with input image variables due to high dimensionality and multicollinearity. Convolutional neural networks (CNNs) are popular approaches for image processing by alleviating such multicollinearity issues (Krizhevsky et al., 2012). However, CNNs cannot quantify uncertainties, which are often of primary interest in statistical inference. Furthermore, the complex structure of CNN can lead to nonconvex optimization, which cannot guarantee the convergence of weight parameters (Alzubaidi et al., 2021). We propose a computationally efficient Bayesian approach by embedding CNNs within generalized linear models (GLMs). Our method can take advantage of both statistical models and advanced neural networks (NNs).

There is a growing literature on Bayesian deep learning methods (Neal, 2012) to quantify the uncertainty by regarding weight parameters in deep neural networks (DNNs) as random variables. With a choice of priors, we have posterior distributions for the standard Bayesian inference. Bayesian alternatives can be robust for a small sample size compared to the standard DNN (Gal and Ghahramani, 2016a, b). Despite its apparent advantages, Bayesian DNNs have been much less used than the classical one due to the difficulty in exploring the full posterior distributions. To circumvent the difficulty, variational Bayes (VB) methods have been proposed to approximate complex high-dimensional posteriors; examples include VB for DNNs (Tran et al., 2020) and CNNs (Shridhar et al., 2019). Although VB can alleviate computational costs, to the best of our knowledge, no existing approach provides a general Bayesian framework to study the image and vector-type variables simultaneously. Furthermore, most previous works have focused on prediction uncertainty but not on estimation uncertainty of model parameters; interpretations of regression coefficients have not been studied much in the Bayesian deep learning context.

We propose a Bayesian convolutional generalized linear model (BayesCGLM) that can study different types of variables simultaneously in various applications. First, we train Bayesian CNN (BayesCNN) (Gal and Ghahramani, 2016a) with the correlated input (images or correlated variables) and output. Then, we extract Monte Carlo (MC) samples of features from the last layer of the fitted BayesCNN. Lastly, we fit a GLM by regressing an output variable on augmented covariates with the features obtained from each MC sample. For each MC sample of the feature, we fit an individual GLM and aggregate its posterior in parallel. These steps allow us to account for uncertainties in estimating the features. We also study an often overlooked challenge in Bayesian deep learning: practical issues in obtaining credible intervals of parameters and assessing whether they provide nominal coverage.

In our method, we model BayesCNN via a deep Gaussian process (GP) (Damianou and Lawrence, 2013), which is a nested GP. GPs (Rasmussen, 2003) are widely used to represent complex function relationships due to their flexibility. However, due to its nested functional compositions, Bayesian inference for deep GP is computationally expensive. Recently, Sauer et al. (2023) approximated each layer of the deep GP through the Vecchia approximation, which scales linearly. With somewhat different perspectives, Gal and Ghahramani (2016a, b) developed VB for deep GP based on MC approximations. They showed that NNs with dropout applied after every hidden layer are equivalent to the approximation of the posteriors of the deep GP. We build upon their approach to sample features from the approximate posterior of the last layer in BayesCNN. Training BayesCNN with dropout approximation does not require additional computing costs compared to the frequentist one. There have been several recent proposals to extract features from the hidden layer of NN models (Tran et al., 2020; Bhatnagar et al., 2022; Fong and Xu, 2021), and our method is motivated by them. The features can be regarded as reduced-dimensional summary statistics that measure complex non-linear relationships between the high-dimensional input and output. We observe that CNNs are specifically useful for extracting information from correlated inputs. There have been a number of recent proposals for forecasting spatio-temporal processes through ensembles of echo state networks (McDermott and Wikle, 2017, 2019). Similarly, Daw and Wikle (2022) developed the random ensemble deep spatial approach that combined a deep neural model with random weights. Our work has similarities to these previous attempts in that we aggregate fitted models obtained from each MC sample of the feature. Such ensemble techniques can improve prediction over a single model and quantify uncertainties (Sagi and Rokach, 2018).

In addition to methodological improvements, our method can contribute to scientific advancement. We study the relationships between the features and response variables via principal component analysis (PCA), providing valuable insights. Furthermore, by providing adequate uncertainty quantification in both estimation and prediction, our method can aid statistically sound scientific decisions.

The outline of the remainder of this paper is as follows. In Section 2, we introduce a Bayesian DNN based on GP approximation and its estimation procedure. In Section 3, we propose an efficient BayesCGLM and provide implementation details. In Section 4, we apply BayesCGLM to several real-world data sets. We conclude with a discussion in Section 5.

2 A Bayesian Deep Neural Network: Gaussian Process Approximation

Here, we provide a brief overview of Bayesian DNNs and their approximations to deep GPs. Let $\{(\mathbf{x}_{n},\mathbf{y}_{n}),n=1,\cdots,N\}$ be the observed input-response pairs. A standard DNN is used to model the relationship between the input $\mathbf{x}_{n}$ and the output $\mathbf{o}_{n}\in{\mathbb{R}}^{d}$ , which can be regarded as an expectation of $\mathbf{y}_{n}$ given $\mathbf{x}_{n}$ , from a statistical viewpoint. Suppose we consider a DNN with $L$ layers, where the $l$ th layer has $k_{l}$ nodes for $l=1,\cdots,L$ . Then we can define the weight matrix $\mathbf{W}_{l}\in{\mathbb{R}}^{k_{l}\times k_{l-1}}$ that connects the $(l-1)$ th layer to the $l$ th hidden layer and the bias vector $\mathbf{b}_{l}\in{\mathbb{R}}^{k_{l}}$ . Then $\bm{\theta}=\{(\mathbf{W}_{l},\mathbf{b}_{l}),l=1,\cdots,L\}$ is a set of parameters in the DNN that belongs to a parameter space $\bm{\Theta}$ .

To model complex data, the number of NN parameters grows tremendously with increasing $l$ and $k_{l}$ . Such large networks are computationally intensive and can lead to overfitting issues. Dropout (Srivastava et al., 2014) is a simple but effective strategy to address both problems. The term “dropout” refers to randomly removing connections from nodes with some probability during training. The structure of NNs with dropouts is

\mathbf{o}_{n}=\sigma_{L}\Big{(}\mathbf{W}_{L}\sigma_{L-1}\Big{(}\cdots\sigma_{3}\Big{(}\mathbf{W}_{3}\sigma_{2}\Big{(}\mathbf{W}_{2}\sigma_{1}\Big{(}\mathbf{W}_{1}\mathbf{x}_{n}+\mathbf{b}_{1}\Big{)}\circ\mathbf{d}_{2}+\mathbf{b}_{2}\Big{)}\circ\mathbf{d}_{3}+\mathbf{b}_{3}\Big{)}\cdots\Big{)}\circ\mathbf{d}_{L}+\mathbf{b}_{L}\Big{)},

(1)

where $\sigma_{l}(\cdot)$ is an activation function; for example, the rectified linear (ReLU) or the hyperbolic tangent function (TanH) can be used. Here, $\mathbf{d}_{l}\in{\mathbb{R}}^{k_{l}}$ , $l=2,\cdots,L$ , are dropout vectors whose elements follow a Bernoulli distribution with a pre-specified success probability $\psi_{l}$ independently. The notation “ $\circ$ ” indicates the elementwise multiplication of vectors. As in Gal and Ghahramani (2016b), we apply dropout to the output of every $\sigma_{l}(\cdot)$ (i.e., the nonlinear feature). Since $\sigma_{l}(\cdot)$ has $\mathbf{W}_{l}$ and $\mathbf{b}_{l}$ in its argument, both weight and bias terms are regularized. Although dropout can alleviate overfitting issues, quantifying uncertainties in predictions or parameter estimates remains challenging for the frequentist approach.

To address this difficulty, Bayesian methods have been proposed by considering probabilistic alternatives of NNs. The Bayesian approaches consider an arbitrary function that is likely to generate data. Specifically, a deep GP has been widely used to represent complex NN structures for both supervised and unsupervised learning problems (Gal and Ghahramani, 2016b; Damianou and Lawrence, 2013). Define $\mathbf{f}_{n,1}=\mathbf{W}_{1}\mathbf{x}_{n}+\mathbf{b}_{1}\in{\mathbb{R}}^{k_{1}}$ and $\mathbf{f}_{n,l}=\mathbf{W}_{l}\bm{\phi}_{n,l-1}+\mathbf{b}_{l}\in{\mathbb{R}}^{k_{l}}$ , $l\geq 2$ , where $\bm{\phi}_{n,l}=\sigma_{l}(\mathbf{f}_{l})\in{\mathbb{R}}^{k_{l}}$ is the nonlinear output feature from the $l$ th layer. For Bayesian inference, we set independent normal priors for weight and bias parameters as $p(\mathbf{W}_{l})$ and $p(\mathbf{b}_{l})$ respectively. We denote the set of linear features $\mathbf{f}_{n,l}$ as $\mathbf{F}_{l}=\{\mathbf{f}_{n,l}\}_{n=1}^{N}\in\mathbb{R}^{N\times k_{l}}$ and the $k$ th column of $\mathbf{F}_{l}$ as $\mathbf{F}^{(k)}_{l}~{}(k=1,\cdots,k_{l})$ . The columns of $\mathbf{F}_{l}$ are assumed to be independent of each other for given $\mathbf{F}_{l-1}$ . Then, the deep GP model can be represented as

\begin{split}\mathbf{F}^{(k)}_{l}|\mathbf{F}_{l-1}&\sim N(0,\widehat{\mathbf{\Sigma}}_{l}),\quad l=2,\dots,L\\ \mathbf{y}_{n}|\mathbf{f}_{n,L-1}&\sim p(\mathbf{y}_{n}|\mathbf{f}_{n,L-1}).\end{split}

(2)

Here the empirical covariance matrix $\widehat{\mathbf{\Sigma}}_{l}\in{\mathbb{R}}^{N\times N}$ is

\begin{split}\widehat{\mathbf{\Sigma}}_{l}&=\frac{1}{k_{l}}\sigma_{l}(\bm{\Phi}_{l-1}\mathbf{W}^{\top}_{l}+\mathbf{b}_{l})\sigma_{l}(\bm{\Phi}_{l-1}\mathbf{W}^{\top}_{l}+\mathbf{b}_{l})^{\top},\end{split}

(3)

where $\bm{\Phi}_{l}$ are the collection of feature vectors $\{\bm{\phi}_{n,l}\}_{n=1}^{N}\in{\mathbb{R}}^{N\times k_{l}}$ . This empirical covariance matrix is obtained through Monte Carlo integration, where each column of $\mathbf{W}_{l}$ and each element of $\mathbf{b}_{l}$ are sampled from the $p(\mathbf{W})$ and $p(\mathbf{b})$ , respectively. According to Gal and Ghahramani (2016b), (3) becomes accurate in approximating the true covariance matrix as the value of $k_{l}$ increases. It implies that an arbitrary data-generating function is modeled through a nested GP with the covariance of the extracted features from the previous hidden layer. In (2), $\mathbf{y}_{n}$ is a response variable that has a distribution $p(\cdot|\mathbf{f}_{n,L-1})$ belonging to the exponential family with a continuously differentiable link function $g$ such that $g(E[\mathbf{y}_{n}|\bm{\phi}_{n,L-1}])=\mathbf{W}_{L}\bm{\phi}_{n,L-1}+\mathbf{b}_{L}$ . Note that $g^{-1}(\cdot)$ can be regarded as $\sigma_{L}(\cdot)$ in (1). As in the typical mixed effect models, conditional independence of $\{\mathbf{y}_{n}\}_{n=1}^{N}$ is assumed.

Since Bayesian inference for deep GP using Markov chain Monte Carlo (MCMC) is not straightforward to implement, Gal and Ghahramani (2016a) developed VB for deep GP based on MC approximation, which is called MC dropout. They used a normal mixture distribution as a variational distribution $q(\bm{\theta})$ to approximate the posterior distribution $\pi(\bm{\theta}|\{\mathbf{x}_{n},\mathbf{y}_{n}\}_{n=1}^{N})$ of deep GP. Specifically, the variational distributions are defined as

\begin{split}q(\mathbf{W}_{l})&=\prod_{\forall i,j}q(w_{l,ij}),~{}~{}~{}q(\mathbf{b}_{l})=\prod_{\forall i}q(b_{l,i})\\ q(w_{l,ij})&=p_{l}N(\mu^{w}_{l,ij},\sigma^{2})+(1-p_{l})N(0,\sigma^{2})\\ q(b_{l,i})&=p_{l}N(\mu^{b}_{l,i},\sigma^{2})+(1-p_{l})N(0,\sigma^{2}),\end{split}

(4)

where $w_{l,ij}$ is the $(i,j)$ th element of the weight matrix $\mathbf{W}_{l}\in{\mathbb{R}}^{k_{l}\times k_{l-1}}$ and $b_{l,i}$ is the $i$ th element of the bias vector $\mathbf{b}_{l}\in{\mathbb{R}}^{k_{l}}$ . For the weight parameters, $\mu^{w}_{l,ij}$ and $\sigma^{2}$ are variational parameters that control the mean and spread of the distributions, respectively. As the inclusion probability $p_{l}\in[0,1]$ becomes close to 0, $q({w}_{l,ij})$ becomes $N(0,\sigma^{2})$ , indicating that it is likely to drop the weight parameters (i.e., $w_{l,ij}=0$ ). Similarly, the variational distribution for the bias parameters is modeled with a mixture normal distribution. Then, the optimal variational parameters are set by minimizing the Kullback–Leibler (KL) divergence between $q(\bm{\theta})$ and $\pi(\bm{\theta}|\{\mathbf{x}_{n},\mathbf{y}_{n}\}_{n=1}^{N})$ where minimizing the KL divergence is equivalent to maximizing $\mbox{E}_{q}[{\log(\pi(\bm{\theta},\{\mathbf{x}_{n},\mathbf{y}_{n}\}_{n=1}^{N}))}]-\mbox{E}_{q}[\log q({\bm{\theta}})]$ , the evidence lower bound (ELBO). With the independent variational distribution $q(\bm{\theta}):=\prod_{l=1}^{L}q(\mathbf{W}_{l})q(\mathbf{b}_{l})$ , the log ELBO of the deep GP is

\begin{split}\mathcal{L}_{\text{GP-VI}}&:=\sum_{n=1}^{N}\int\cdots\int\prod_{l=1}^{L}q(\mathbf{W}_{l})q(\mathbf{b}_{l})\log p(\mathbf{y}_{n}|\mathbf{x}_{n},\{\mathbf{W}_{l},\mathbf{b}_{l}\}_{l=1}^{L})d\mathbf{W}_{1}d\mathbf{b}_{1}\cdots d\mathbf{W}_{L}d\mathbf{b}_{L}\\ &\quad-\text{KL}\Big{(}\prod_{l=1}^{L}q(\mathbf{W}_{l})q(\mathbf{b}_{l})\Big{|}\Big{|}p(\{\mathbf{W}_{l},\mathbf{b}_{l}\}_{l=1}^{L})\Big{)}.\end{split}

(5)

To emphasize the dependency of $\{\mathbf{W}_{l},\mathbf{b}_{l}\}_{l=1}^{L}$ , we represent $p(\mathbf{y}_{n}|\mathbf{f}_{n,L-1})$ in (2) as $p(\mathbf{y}_{n}|\mathbf{x}_{n},\{\mathbf{W}_{l},\mathbf{b}_{l}\}_{l=1}^{L})$ , which is identical. Since the direct maximization of (5) is challenging due to the intractable integration, Gal and Ghahramani (2016b) replaced it with MC approximation as

\begin{split}\mathcal{L}_{\text{GP-MC}}&:=\frac{1}{M}\sum_{m=1}^{M}\sum_{n=1}^{N}\log p(\mathbf{y}_{n}|\mathbf{x}_{n},\{\mathbf{W}_{l}^{(m)},\mathbf{b}_{l}^{(m)}\}_{l=1}^{L})\\ &\quad-\text{KL}\Big{(}\prod_{l=1}^{L}q(\mathbf{W}_{l})q(\mathbf{b}_{l})\Big{|}\Big{|}\prod_{l=1}^{L}p(\mathbf{W}_{l})p(\mathbf{b}_{l})\Big{)},\end{split}

(6)

where $\{\{\mathbf{W}_{l}^{(m)},\mathbf{b}_{l}^{(m)}\}_{l=1}^{L}\}_{m=1}^{M}$ is MC samples from the variational distribution in (4).

Note that the MC samples are generated for each stochastic gradient descent (SGD) update, and the estimates from $\mathcal{L}_{\text{GP-MC}}$ would converge to those obtained from $\mathcal{L}_{\text{GP-VI}}$ (Paisley et al., 2012; Rezende et al., 2014). Furthermore, Gal and Ghahramani (2016b) showed that (6) converges to the frequentist loss function of a DNN with dropouts when $\sigma$ in (4) is small value and $k_{l}$ becomes large. This result implies that the standard DNNs with dropout applied after every hidden layer are equivalent to the approximation of the posteriors of the deep GP. We provide mathematical details in Web Appendix A.1.

In many applications, the prediction of unobserved responses is of great interest. Consider an unobserved response $\mathbf{y}_{n}^{\ast}\in{\mathbb{R}}^{d}$ with an input $\mathbf{x}_{n}^{\ast}\in{\mathbb{R}}^{k_{0}}$ . For given $\mathbf{x}_{n}^{\ast}$ , we can obtain $\bm{\phi}^{\ast}_{n,l}\in{\mathbb{R}}^{k_{l}}$ from the approximate posterior predictive distribution based on MC dropout. Although Gal and Ghahramani (2016b) focused on predicting $\mathbf{y}_{n}^{*}$ , this procedure allows us to construct informative and low-dimensional summary statistics for $\mathbf{x}_{n}^{\ast}$ . Specifically, we can sample an output feature vector $\bm{\phi}^{\ast}_{n,L-1}\in{\mathbb{R}}^{k_{L-1}}$ which summarizes the complex nonlinear dependence relationships between input and output variables up to the last layer of the NN. In Section 3, we use these extracted features as additional covariates in the GLM framework.

3 Flexible Generalized Linear Models with Convolutional Neural Networks

As a variant of DNNs, CNNs have been widely used in computer vision and image classification (Krizhevsky et al., 2012; Shridhar et al., 2019). CNNs can effectively extract important features from images or correlated datasets with nonlinear dependencies. In our scientific applications, we have spatially correlated datasets for malaria incidence, MRI images for brain tumors, and a fMRI correlation matrix of the brain for anxiety disorder. Given that our applications involve image or correlated datasets, CNNs are well-suited for capturing structural dependency by extracting local information through convolution layers.

We propose a Bayesian convolutional generalized linear model (BayesCGLM) based on two appealing methodologies: (1) CNNs have been widely used to utilize correlated predictors (or input variables) with spatial structures such as images or geospatial data. (2) GLMs can study the relationship between covariates and non-Gaussian responses by extending a linear regression framework. Let $\mathbf{D}=\{\mathbf{X},\mathbf{Y},\mathbf{Z}\}$ be the observed dataset, where $\mathbf{X}=\{\mathbf{X}_{n}\}_{n=1}^{N}$ be the collection of correlated input (i.e., each $\mathbf{X}_{n}$ is a matrix or tensor) and $\mathbf{Y}=\{\mathbf{y}_{n}\}_{n=1}^{N}$ be the corresponding output. We also let $\mathbf{Z}=\{\mathbf{z}_{n}\}_{n=1}^{N}$ be a collection of $p$ -dimensional vector covariates. We use BayesCNN to extract the feature from $\mathbf{X}$ and employ it in a GLM with the covariates $\mathbf{Z}$ to predict $\mathbf{Y}$ .

Such an approach is useful in that the feature is optimized for the prediction of $\mathbf{Y}$ . Since the feature is guided by $\mathbf{Y}$ , we can construct a map from the features to the output. Similar to our approach, Bhatnagar et al. (2022) also extracted features through DNN and used the features to fit their model. They showed that this can improve the prediction performance of the model. In Section 4, we observe that the feature serves as an important summary statistic for predicting response $\mathbf{Y}$ . Compared to Bhatnagar et al. (2022), we also consider the uncertainties in estimating the features using MC dropout.

3.1 Bayesian Convolutional Neural Network

A Bayesian perspective can quantify uncertainties through posterior densities of $\bm{\theta}$ . Specifically, Gal and Ghahramani (2016a) proposed an efficient Bayesian alternative by regarding weight parameters in CNN’s kernels as random variables. By extending a result in Gal and Ghahramani (2016b), Gal and Ghahramani (2016a) showed that applying dropout after every convolution layer can approximate the intractable posterior distribution of deep GP.

Let $\mathbf{X}_{n}\in{\mathbb{R}}^{M_{0}\times R_{0}\times C_{0}}$ be an input with $M_{0}$ height, $R_{0}$ width, and $C_{0}$ channels. For example, $M_{0}$ and $R_{0}$ are determined based on the resolution of images, and $C_{0}=3$ for color images ( $C_{0}=1$ for black and white images). There are three types of layers in CNN: the convolution, pooling, and fully connected layers. Convolution layers move kernels (filters) on the input image and extract information, resulting in a “feature map”. Each cell in the feature map is obtained from element-wise multiplication between the image and the filter matrix. The weight values in filter matrices are parameters to be estimated. Higher output values indicate stronger signals on the image, which can account for spatial structures. We can extract the important neighborhood features from the input by shifting the kernels over all pixel locations with a certain step size (stride). Then, the pooling layers are applied to downsample the feature maps. The convolution and pooling layers are applied repeatedly to reduce the dimension of an input image. Then, the extracted features from the convolution/pooling layers are flattened and connected to the output layer through a fully connected layer, which can learn non-linear relationships between the features and the output.

Consider a CNN with $L$ number of layers, where the last two are fully connected and output layers. For the $l$ th convolution layer, there are $C_{l}$ number of kernels $\mathbf{K}_{l,c}\in{\mathbb{R}}^{H_{l}\times D_{l}\times C_{l-1}}$ with $H_{l}$ height, $D_{l}$ width, and $C_{l-1}$ channels for $c=1,\cdots,C_{l}$ . Note that the dimension of the channel is determined by the number of kernels in the previous convolution layer. After applying the $l$ th convolution layer, the $(m,r)$ th element of the feature matrix $\bm{\eta}_{n,(l,c)}\in{\mathbb{R}}^{M_{l}\times R_{l}}$ becomes

[\bm{\eta}_{n,(l,c)}]_{m,r}=\sigma\Big{(}\sum_{i=1}^{H_{l}}\sum_{j=1}^{D_{l}}\sum_{k=1}^{C_{l-1}}([\mathbf{K}_{l,c}]_{i,j,k}[\bm{\eta}_{n,(l-1,c)}]_{m+i-1,r+j-1,k})+b_{l,c}\Big{)},

(7)

where $b_{l,c}$ is the bias parameter. For the first convolution layer, $\bm{\eta}_{n,(l-1,c)}$ is replaced with an input image $\mathbf{X}_{n}$ . (i.e., $\bm{\eta}_{n,(0,c)}=\mathbf{X}_{n}$ ). For computational efficiency and stability, ReLU functions are widely used as activation functions $\sigma(\cdot)$ . By shifting kernels, the convolution layer can capture neighborhood information, and the neighborhood structure is determined by the size of the kernels. Gal and Ghahramani (2016a) pointed out that extracting the features from the convolution operation can be reformulated as an affine transformation in standard DNN (details are in Web Appendix A.2). Therefore, CNNs can also be represented as a deep GP with dropout approximation; applying dropout after every convolution layer before pooling can approximate a deep GP.

Note that optimal hyperparameter tuning in NNs requires a number of empirical fittings by varying hyperparameters. Wang et al. (2019) pointed out that a higher dropout rate can lead to a slower convergence rate, while a lower rate can deteriorate generalization performance. Although Gal and Ghahramani (2016b, a) used a dropout rate of around 0.1-0.5, it does not always guarantee optimal performance. Therefore, we recommend conducting empirical studies by varying dropout rates within such a range for validation. In Web Appendix B, we observe that our method is robust across different dropout rates.

After the $L-2$ th convolution layers, the extracted feature tensor $\bm{\eta}_{n,L-2}\in{\mathbb{R}}^{M_{L-2}\times R_{L-2}\times C_{L-2}}$ is flattened in the fully connected layer. By vectorizing $\bm{\eta}_{n,L-2}$ , we can represent the feature tensor as a feature vector $\bm{\phi}_{n,L-2}=\operatorname{vec}(\bm{\eta}_{n,L-2})\in{\mathbb{R}}^{k_{L-2}}$ , where $k_{L-2}=M_{L-2}R_{L-2}C_{L-2}$ . Finally, we can extract a feature vector $\bm{\phi}_{n,L-1}\in{\mathbb{R}}^{k_{L-1}}$ from the fully connected layer. From here, we define the collection of the extracted feature vector as $\bm{\Phi}=\{\bm{\phi}_{n,L-1}\}_{n=1}^{N}$ . We suppress the index $L-1$ in $\bm{\Phi}_{L-1}$ for notational convenience.

The training step of BayesCNN is identical to that of standard CNN. In the prediction step, the algorithm performs forward propagation for the given estimated NN parameters. For each iteration, $\{\mathbf{d}_{l}\}_{l=2}^{L}$ are randomly sampled from the Bernoulli distribution with rate $\psi_{l}$ and are applied to the features. The algorithm repeats this $M$ times to obtain an MC sample of $M$ predicted outputs. We provide implementation details for BayesCNN in Web Appendix C.1.

3.2 Bayesian Convolutional Generalized Linear Model

We propose BayesCGLM by regressing $\mathbf{Y}$ on $\bm{\Phi}$ and additional covariates $\mathbf{Z}$ . Here, we use $\bm{\Phi}$ as a basis matrix that encapsulates information of high-dimensional and correlated input variables $\mathbf{X}$ . Let $\mathbf{A}=[\mathbf{Z},\bm{\Phi}]\in{\mathbb{R}}^{N\times(p+k_{L-1})}$ be the design matrix. Then, our model is

\begin{split}g(E[\mathbf{Y}|\mathbf{Z},\bm{\Phi}])&=\mathbf{Z}\bm{\gamma}+\bm{\Phi}\bm{\delta}=\mathbf{A}\bm{\beta},\end{split}

(8)

where $\bm{\beta}=(\bm{\gamma}^{\top},\bm{\delta}^{\top})^{\top}\in{\mathbb{R}}^{p+k_{L-1}}$ is the corresponding regression coefficients and $g(\cdot)$ is a one-to-one continuously differential link function. The proposed method can provide an interpretation of the regression coefficients and quantify uncertainties in prediction and estimation. Furthermore, the training step does not require additional computational costs compared to the traditional CNN. We train BayesCNN with the negative log-likelihood loss function corresponding to the type of output variable. To be more specific, we used the mean squared loss, binary cross-entropy loss, and Poisson loss for Gaussian, binary, and count outputs, respectively. Such choices are natural in that the loss functions are directly defined by the distribution of the output variable.

Since we place a posterior distribution over the kernels of the fitted BayesCNN (Section 3.1), we can generate MC samples of the features $\bm{\Phi}^{(m)}$ , $m=1,\cdots,M$ from the variational distribution. To account for the uncertainties in estimating the features, we use all the MC samples of the features rather than their point estimates. For each MC sample of $\mathbf{A}^{(m)}=[\mathbf{Z},\bm{\Phi}^{(m)}]$ , we can fit individual GLM in parallel; we have $M$ number of posterior distributions of $\bm{\beta}_{m}=(\bm{\gamma}_{m}^{\top},\bm{\delta}_{m}^{\top})^{\top}\in{\mathbb{R}}^{p+k_{L-1}}$ . This step is parallelizable, so computational wall times decrease with more cores. Then, we construct an aggregated distribution (the “ensemble-posterior”) of $\bm{\beta}$ . In the following section, we provide details about constructing the ensemble-posterior distribution. The fitting procedure of BayesCGLM is summarized in Web Appendix C.2.

In general, inference for the regression coefficients $\bm{\gamma}$ in (8) is challenging for DNN. Although one can regard the weight parameters at the last hidden layer as $\bm{\gamma}$ , we cannot obtain uncertainties in estimates. Even BayesDNN (Gal and Ghahramani, 2016b) only provided point estimates of weight parameters, while they can quantify uncertainties in response prediction. On the other hand, our method directly provides the posterior distribution of $\bm{\gamma}$ , which is useful for standard inference. In addition, the complex structure of DNN often leads to nonconvex optimization (Goodfellow et al., 2014; Dauphin et al., 2014), resulting in an inaccurate estimate of $\bm{\gamma}$ . Since the stopping rule of the SGD algorithm is based on prediction accuracy (not on convergence in parameter estimation), the convergence of individual parameters, including $\bm{\gamma}$ cannot be guaranteed. The problem gets further complicated by a large number of parameters that need to be simultaneously updated in the SGD algorithm. Note that BayesCGLM also uses the SGD algorithm to extract $\bm{\Phi}$ . From this, we can obtain one of the optimal combinations of weight parameters and the corresponding $\bm{\Phi}$ . Although such $\bm{\Phi}$ may not be the global optimizer, it is still informative to predict responses because it is obtained by minimizing the loss function. However, this does not guarantee the convergence of individual weight parameters in nonconvex optimization. Instead, our two-stage approach can alleviate this issue by using extracted $\bm{\Phi}$ as a fixed basis matrix in GLM. In Web Appendix D, we conduct the simulation studies under different model configurations and show that BayesCGLM can recover the true $\bm{\gamma}$ values well, while estimates from BayesCNN can be biased.

3.3 Bayesian Inference

Here, we describe a Bayesian inference for the proposed method. We first define the posterior distribution of the regression coefficient $\pi(\bm{\beta}|\mathbf{D})$ . Let $\{\mathbf{W}_{l},\mathbf{b}_{l}\}_{l=1}^{L-1}$ be the weights and biases corresponding to all hidden layers in CNN. We can vectorize kernels in CNN and represent them as weights as in the standard DNN (see Web Appendix A.2 for details). Note that $\bm{\Phi}$ can be deterministically obtained from forward propagation for the given $\mathbf{X}$ and $\{\mathbf{W}_{l},\mathbf{b}_{l}\}_{l=1}^{L-1}$ . Therefore, the posterior distribution can be represented as

\begin{split}\pi(\bm{\beta}|\mathbf{D})&=\int\pi(\bm{\beta}|\mathbf{D},\{\mathbf{W}_{l},\mathbf{b}_{l}\}_{l=1}^{L-1})\\ &\quad\times\prod_{l=1}^{L-1}\pi(\mathbf{W}_{l}|\mathbf{D})\pi(\mathbf{b}_{l}|\mathbf{D})d\mathbf{W}_{1}d\mathbf{b}_{1}\cdots d\mathbf{W}_{L-1}d\mathbf{b}_{L-1},\end{split}

(9)

where, $\pi(\bm{\beta}|\mathbf{D},\{\mathbf{W}_{l},\mathbf{b}_{l}\}_{l=1}^{L-1})$ is the conditional posterior, and $\pi(\mathbf{W}_{l}|\mathbf{D})$ , $\pi(\mathbf{b}_{l}|\mathbf{D})$ are marginal posteriors for weight and bias, respectively. Since it is challenging to compute (9) directly, we approximate it through MC dropout as

\int\pi(\bm{\beta}|\mathbf{D},\{\mathbf{W}_{l},\mathbf{b}_{l}\}_{l=1}^{L-1})\prod_{l=1}^{L-1}q(\mathbf{W}_{l})q(\mathbf{b}_{l})d\mathbf{W}_{1}d\mathbf{b}_{1}\cdots d\mathbf{W}_{L-1}d\mathbf{b}_{L-1},

(10)

where $q(\mathbf{W}_{l})$ and $q(\mathbf{b}_{l})$ are variational distributions defined in (4). Then the MC approximation to (10) is

\frac{1}{M}\sum_{m=1}^{M}\pi(\bm{\beta}_{m}|\mathbf{D},\{\mathbf{W}^{(m)}_{l},\mathbf{b}^{(m)}_{l}\}_{l=1}^{L-1}).

(11)

Here $\{\{\mathbf{W}^{(m)}_{l},\mathbf{b}^{(m)}_{l}\}_{l=1}^{L-1}\}_{m=1}^{M}$ are sampled from (4). One may choose the standard MCMC algorithm to estimate each $\pi(\bm{\beta}_{m}|\mathbf{D},\{\mathbf{W}^{(m)}_{l},\mathbf{b}^{(m)}_{l}\}_{l=1}^{L-1})$ in (11). However, this may require a long run of the chain to guarantee good mixing, resulting in excessive computational cost. Instead, we use the Laplace method to approximate each $\pi(\bm{\beta}_{m}|\mathbf{D},\{\mathbf{W}^{(m)}_{l},\mathbf{b}^{(m)}_{l}\}_{l=1}^{L-1})$ . In our preliminary studies, we observed that the Laplace approximation leads to a similar inference result as the MCMC method within a much shorter computing time when the sample size is large enough.

For the Laplace approximation, we first compute $\bm{\Phi}^{(m)}$ for the given $\{\mathbf{W}^{(m)}_{l},\mathbf{b}^{(m)}_{l}\}_{l=1}^{L-1}$ through forward propagation. Then, we obtain the maximum likelihood estimate (MLE) $\widehat{\bm{\beta}}_{m}$ using GLM by regressing $\mathbf{Y}$ on $\bm{\Phi}^{(m)}$ and $\mathbf{Z}$ . We approximate the posterior of $\bm{\beta}_{m}$ as $\mathcal{N}(\widehat{\bm{\beta}}_{m},\widehat{\mathbf{B}}^{-1}_{m})$ , where $\widehat{\mathbf{B}}_{m}\in{\mathbb{R}}^{(p+k_{L-1})\times(p+k_{L-1})}$ is the observed Fisher information matrix from the $m$ th MC samples. From this procedure, (11) can be approximated as

\frac{1}{M}\sum_{m=1}^{M}\varphi(\bm{\beta};\widehat{\bm{\beta}}_{m},\widehat{\mathbf{B}}^{-1}_{m}),

(12)

where $\varphi(\bm{x};\bm{\mu},\bm{\Sigma})$ is a multivariate normal density with mean $\bm{\mu}$ and covariance $\bm{\Sigma}$ . In summary, (12) is the approximation of the posterior distribution $\pi(\bm{\beta}|\mathbf{D})$ through MC dropout and the Laplace method. Note that BayesCGLMs approximate $\pi(\bm{\beta}|\mathbf{D})$ through the MC average from the same probability model (i.e., data generating mechanism). Therefore, the interpretation of $\bm{\beta}$ does not change from ensemble to ensemble. However, the interpretation of $\bm{\delta}$ can be more difficult because each column of $\bm{\Phi}$ comes from complex (and repeated) convolution operations. It is challenging to explain the impact of changes in features based on the estimated $\bm{\delta}$ .

Similar to BayesCNN, our method can quantify uncertainties in predictions, which is an advantage over deterministic NNs. Note that BayesCGLM can also provide interpretations of model parameters with uncertainty quantification, while BayesCNN cannot. For unobserved response $\mathbf{Y}^{\ast}\in{\mathbb{R}}^{N_{\text{test}}}$ , we have $\mathbf{A}^{\ast}_{m}=[\mathbf{Z}^{\ast},\bm{\Phi}^{\ast(m)}]\in{\mathbb{R}}^{N_{\text{test}}\times(p+k_{L-1})}$ for $m=1,\cdots,M$ . Here, $\bm{\Phi}^{\ast(m)}$ is $m$ th features extracted from the last hidden layer of the trained BayesCNN for given $\mathbf{X}^{\ast}$ and $\mathbf{Z}^{\ast}$ . Then, from (12), the predictive distribution of the linear predictor is

\frac{1}{M}\sum_{m=1}^{M}\varphi(\mathbf{A}^{\ast}\bm{\beta};\mathbf{A}_{m}^{\ast}\widehat{\bm{\beta}}_{m},\mathbf{A}_{m}^{\ast}\widehat{\mathbf{B}}^{-1}_{m}\mathbf{A}^{\ast\top}_{m}).

(13)

For given (13), we can generate $\mathbf{Y}^{\ast}$ from the underlying distribution. For example, when we have a Gaussian response, we have $\mathbf{Y}^{\ast}\sim\mathcal{N}(\frac{1}{M}\sum_{m=1}^{M}\mathbf{A}^{\ast}\bm{\widehat{\beta}}_{m},\widehat{\sigma}^{2})$ where $\widehat{\sigma}^{2}=\sum_{m=1}^{M}(\mathbf{A}^{(m)}\bm{\widehat{\beta}}_{m}-\mathbf{Y})^{\top}(\mathbf{A}^{(m)}\bm{\widehat{\beta}}_{m}-\mathbf{Y})/NM$ . For the count response, we can simulate $\mathbf{Y}^{\ast}$ from the Poisson distribution with an intensity of $\exp(\frac{1}{M}\sum_{m=1}^{M}\mathbf{A}^{\ast}\bm{\widehat{\beta}}_{m})$ .

As we described, BayesCGLM involves several approximation steps: (1) MC dropout and (2) the Laplace method; therefore, investigating approximation quality is crucial. In Web Appendix D.5, we conduct a numerical study under the simple NN setting. Here, we use a standard NN to extract features; therefore, we refer to our method as BayesDGLM. We compare the performance of BayesDGLM, BayesDNN, and a fully Bayesian method (MCMC). We observe that the BayesDGLM provides comparable inference results to the fully Bayesian method with much smaller computational costs. Furthermore, BayesDNN cannot quantify uncertainties in parameter estimates, while BayesDGLM can. Considering that it is infeasible to apply the fully Bayesian method for complex DNNs, BayesDGLM would be a practical option in many applications. We provide details for the experiment in Web Appendix D.5.

4 Applications

We apply our method to brain tumor image data and fMRI data for anxiety scores. We also provide results for geospatial application in Web Appendix E. We compared our model with BayesCNN and GLM to assess its performance. We evaluated prediction accuracy using the root mean square prediction error (RMSPE) and uncertainty quantification performance using the empirical coverage of prediction intervals. Details about the CNN structure are provided in Web Appendix F.

4.1 Brain Tumor Image Data

According to Noone et al. (2020), brain and nervous system tumors have low 5-year relative survival rates, which have gradually increased from 1975 to 2016, reaching 33%. Furthermore, some brain tumors can also be cancerous. Considering that the 5-year survival rate for overall cancers is 67%, brain tumors can be life-threatening. MRI is the standard diagnostic tool for brain tumors, and texture-based first- and second-order statistics of MRI images are useful for classification (Aggarwal and Agrawal, 2012), and these statistics are widely used in the classification of biomedical images (Ismael and Abdel-Qader, 2018). However, these statistics do not capture the spatial information of correlated structures in MRI images. Automated methods for brain tumor classification are becoming increasingly important, as manual analysis of MRI data by medical experts is challenging. It is also crucial to convey the uncertainty associated with these automated methods to ensure statistically sound decision-making. Compared to previous studies (Aggarwal and Agrawal, 2012; Ismael and Abdel-Qader, 2018), we use both MRI images and texture-based statistics to predict brain tumors while quantifying uncertainties. The dataset is collected from the Brain Tumor Image Segmentation Challenge (Menze et al., 2014). The binary response indicates whether a patient has a fatal brain tumor. For each patient, we have an MRI image with its two summary variables (first and second-order features of MRI images). Here, we use $N=2,508$ observations to fit our model and reserve $N_{\text{test}}=1,254$ observations for validation. We fit BayesCNN using $240\times 240$ pixel gray images $\mathbf{X}$ , scalar covariates $\mathbf{Z}$ as input, and the binary response $\mathbf{Y}$ as output. From the fitted BayesCNN, we can extract $\bm{\Phi}\in{\mathbb{R}}^{2,508\times 16}$ from the last hidden layer. In this study, we use two summary variables (first and second-order features of MRI images) as covariates. With the extracted features from the BayesCNN, we fit logistic GLMs by regressing $\mathbf{Y}$ on $[\mathbf{Z},\bm{\Phi}^{(m)}]\in{\mathbb{R}}^{2,508\times 18}$ for $m=1,\cdots,500$ .

Table 1 shows inference results from different methods. Our method shows that the first-order feature has a negative relationship, while the second-order feature has a positive relationship with the brain tumor risk. Both coefficients are statistically significant based on the highest posterior density (HPD) intervals. Note that previous work (Aggarwal and Agrawal, 2012) only predicted tumor status but did not provide such an interpretation from the first and second-order statistics. Therefore, we can conclude that darker MRI images (larger first-order statistics) with lower variation (smaller second-order statistics) indicate a lower possibility of tumor presence. We observe that the sign of $\bm{\gamma}$ estimates from BayesCGLM is aligned with those from GLM.

Table 1: Inference results for the brain tumor dataset from different methods. For all methods, the posterior mean of

\bm{\gamma}

, 95% HPD interval, accuracy, recall, precision, and computing time (min) are reported in the table.

		BayesCGLM	BayesCNN	GLM
		$M=500$	$M=500$
$\gamma_{1}$ (first-order feature)	Mean	$-5.332$	0.248	$-2.591$
	95% Interval	$(-7.049,-3.704)$	-	$(-2.769,-2.412)$
$\gamma_{2}$ (second-order feature)	Mean	4.894	0.160	2.950
	95% Interval	$(3.303,6.564)$	-	$(2.755,3.144)$
Prediction	Accuracy	0.924	0.867	0.784
	Recall	0.929	0.787	0.783
	Precision	0.901	0.907	0.715
Time (min)		293.533	103.924	0.004

Figure 1 is a score plot with the first and second principal components of $\bm{\Phi}$ . Figure 1 (a) shows a clear separation of tumor and non-tumor status based on two principal components. In particular, we observe that patients without tumors mostly have positive values of the first principal component (62.9% explained variability). This indicates that $\bm{\Phi}$ contains useful information from images for predicting brain tumor status. By incorporating $\bm{\Phi}$ information, BayesCGLM shows the most accurate prediction performance while providing credible intervals for the estimates.

Refer to caption — Figure 1: Score plots for each application: (a) brain tumor dataset and (b) fMRI dataset.

Figure 2 (a) shows that the area under the curve (AUC) for the receiver operating characteristic curve (ROC) plot is about 0.96, which shows outstanding prediction performance. The predicted probability surface becomes larger for the binary responses with a value of 1 (Figure 2 (b)). Furthermore, the true and predicted binary responses are well aligned. We also investigate how prediction uncertainties are related to actual images. The top panel in Figure 2 (c) shows four examples of correctly classified cases with low prediction uncertainties. The results show that when the image is overall dark, and the tumor area clearly stands out from the background, we have low uncertainties. On the other hand, the bottom panel in Figure 2 (c) shows four examples of misclassified cases with high prediction uncertainties. In these cases, the background image is too bright, so the tumor area is hardly distinguishable from the background. Although the point predictions misclassify these cases, higher uncertainty informs us that further investigation is needed. Therefore, better medical decisions can be made when uncertainty measures are considered.

4.2 fMRI Data for Anxiety Score

Anxiety disorders are becoming a prevalent mental health issue worldwide, with studies showing a more than 25% increase in cases in 2020 (Organization et al., 2022). There have been several works to study the relationship between anxiety scores and psychological assessments collected from the survey data (Kaplan et al., 2015). However, no previous studies considered the fMRI data, which contains important information about the functional connectivity of the brain. Here, we utilize fMRI information through the extracted feature to construct an anxiety score model. From this, we can incorporate the correlation structure of the brain when we study the relationships between psychological assessments and anxiety levels.

The resting-state fMRI data is collected by the enhanced Nathan Kline Institute-Rockland Sample (NKI-RS) to create a large-scale community sample of participants across the lifespan. For 156 patients, resting-state fMRI signals were scanned for 394 seconds. Then we transform fMRI images into 278 functional brain Regions-of-Interest (ROIs) following the parcellation scheme of Shen et al. (2013).

A functional network was computed using the pairwise Pearson correlation coefficient, constructing $278\times 278$ square symmetric matrix, which is called the ROI-signal matrix. The correlation matrix of brain functions has been widely used to study the functional connectivity of the brain (Sporns, 2002). Therefore, we also use the correlation matrix as input $\mathbf{X}$ . Participants in the study completed a survey that included various physiological and psychological assessments. In this study, we use $N=156$ observations to fit our model and reserve $N_{\text{test}}=52$ observations for validation. We use conscientiousness, extraversion, agreeableness, and neuroticism as $\mathbf{Z}$ and the averaged anxiety score as $\mathbf{Y}$ . We train BayesCNN with $\mathbf{X}$ , $\mathbf{Z}$ as input, and the continuous response $\mathbf{Y}$ as output. From the trained BayesCNN, we extract each $m$ th feature matrix $\bm{\Phi}^{(m)}\in{\mathbb{R}}^{104\times 8}$ from the last hidden layer. Then we regress $\mathbf{Y}$ on $[\mathbf{Z},\bm{\Phi}^{(m)}]$ for $m=1,\cdots,500$ .

Table 2 indicates that BayesCGLM has comparable prediction performance compared to BayesCNN and can quantify uncertainties of regression coefficients. While the sign of the estimated $\bm{\gamma}$ are identical across different methods, there are differences in the statistical significance, particularly in agreeableness and conscientiousness variables. From BayesCGLM, we can infer that the person who has emotional instability and vulnerability to unpleasant emotions (neuroticism) and who is less likely to rebel against others (agreeableness) has a high anxiety and depression score, which are aligned with the previous study in Kaplan et al. (2015). On the other hand, individuals who are more sociable (extroverted) and driven to achieve their goals (conscientiousness) tend to exhibit lower anxiety and depression scores, as also supported by the previous studies (Naragon-Gainey et al., 2014; Fan et al., 2020). Note that these previous studies neither studied the psychological assessments simultaneously nor fMRI information.

GLM, which does not utilize fMRI information, draws different conclusions in a significance test. Table 2 shows that agreeableness and conscientiousness variables are not statistically significant in GLM, which contradicts the findings in the previous studies (Kaplan et al., 2015; Fan et al., 2020). Figure 1 is a score plot with the first and second principal components of $\bm{\Phi}$ . Figure 1 (b) indicates a negative correlation between anxiety scores and the first principal component (62.6% explained variability). Therefore, $\bm{\Phi}$ encapsulates information about brain functions that are highly associated with anxiety levels. By incorporating such important information into the model, BayesCGLM can accurately infer the relationship between psychological assessments and show improved prediction performance.

We observe that all the methods provide similar coefficient estimates. We can infer that the person who has emotional instability and vulnerability to unpleasant emotions (neuroticism) and who is less likely to rebel against others (agreeableness) has a high anxiety and depression score, which gives the comparable result from Kaplan et al. (2015). The person who is more inclined to seek sociability and is eager to attain goals (extraversion), on the other hand, has lower anxiety and depression scores. Note that the previous work (Kaplan et al., 2015) only conducted exploratory data analysis based on correlation coefficients without quantifying the uncertainty. On the other hand, our model can estimate the coefficients by incorporating correlated fMRI information.

Table 2: Inference results for the fMRI dataset from different methods. For all methods, posterior mean of

\bm{\gamma}

, 95% HPD interval, RMSPE, prediction coverage, and computing time (min) are reported in the table.

		BayesCGLM	BayesCNN	GLM
		$M=500$	$M=500$
$\gamma_{1}$ (neuroticism)	Mean	3.496	3.480	4.318
	95% Interval	$(2.629,4.29)$	-	$(2.572,6.064)$
$\gamma_{2}$ (extraversion)	Mean	$-1.123$	$-1.149$	$-1.653$
	95% Interval	$(-1.829,-0.386)$	-	$(-3.198,-0.108)$
$\gamma_{3}$ (agreeableness)	Mean	1.113	1.204	1.196
	95% Interval	$(0.480,1.760)$	-	$(-0.229,2.621)$
$\gamma_{4}$ (conscientiousness)	Mean	$-1.394$	$-1.416$	$-1.283$
	95% Interval	$(-2.226,-0.507)$	-	$(-3.056,0.491)$
Prediction	RMSPE	9.342	9.794	10.531
	Coverage	0.923	0.981	0.923
Time (min)		3.459	0.491	0.001

Figure 3 shows that there is some agreement between the true and predicted responses. We observe that HPD prediction intervals include the true response well (92.3% prediction coverage). Due to the small sample size, prediction intervals are wider than previous examples, which is natural. In Figure 3, triangles represent individuals with the three largest HPD intervals. We observe that the anxiety score is underestimated for these people, but the high uncertainty informs us that we may need to pay additional attention to these individuals with severe anxiety issues; they might have been overlooked without uncertainty quantification.

5 Discussion

We propose a flexible approach, which can utilize image data as predictor variables via feature extraction through BayesCNN. The proposed method can simultaneously utilize predictors with different data structures, such as vector covariates and image data. Our study shows that the proposed method results in prediction performance comparable to that of existing deep learning algorithms in terms of prediction accuracy while enabling uncertainty quantification for estimation and prediction. By constructing an ensemble posterior through a mixture of feature posteriors, we can account for the uncertainties in feature extraction.

Our work is motivated by recently developed feature extraction approaches from DNN. These features can improve the performance of existing statistical approaches by accounting for complex dependencies among observations. For instance, Bhatnagar et al. (2022) extracted features from the long-short-term memory (LSTM) networks and used them to capture complex inverse relationships in calibration problems. Similar to our approach, Tran et al. (2020) used the extracted features from DNN as additional covariates in generalized linear mixed effect models. Fong and Xu (2021) proposed a nonlinear dimension reduction method based on deep autoencoders, which can extract interpretable features from high-dimensional data.

The proposed framework can be extended to a wider range of statistical models, such as Cox regression in survival analysis or regression models in causal inference. Extracting features from other types of DNN is also available; for example, one can use CNN-LSTM (Wang et al., 2016) to capture spatio-temporal dependencies. Our ensemble approaches can significantly improve prediction and provide interpretable parameter estimates.

Acknowledgement

This work was partially supported by the National Research Foundation of Korea (2019R1A2C1007399, 2020R1C1C1A0100386814, 2022R1C1C1006735, RS-2023-00217705), ICAN (ICT Challenge and Advanced Network of HRD) support program (RS-2023-00259934) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation). The authors are grateful to anonymous reviewers for their careful reading and valuable comments.

Data Availability

The brain tumor images dataset is obtained from Kaggle (https://www.kaggle.com/datasets/jakeshbohaju/brain-tumor/data). The fMRI dataset for anxiety score and the malaria incidence datasets can be downloaded from https://github.com/jeon9677/BayesCGLM.

References

Aggarwal and Agrawal (2012) Aggarwal, N. and Agrawal, R. (2012). First and second order statistics features for classification of magnetic resonance brain images. J. Sign. Inf. Process .
Alzubaidi et al. (2021) Alzubaidi, L., Zhang, J., Humaidi, A. J., Al-Dujaili, A., Duan, Y., Al-Shamma, O., et al. (2021). Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. Journal of Big Data 8, 1–74.
Bhatnagar et al. (2022) Bhatnagar, S., Chang, W., Kim, S., and Wang, J. (2022). Computer model calibration with time series data using deep learning and quantile regression. SIAM/ASA Journal on Uncertainty Quantification 10, 1–26.
Damianou and Lawrence (2013) Damianou, A. and Lawrence, N. D. (2013). Deep Gaussian processes. In International Conference on Artificial Intelligence and Statistics, pages 207–215. PMLR.
Dauphin et al. (2014) Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in Neural Information Processing Systems 27,.
Daw and Wikle (2022) Daw, R. and Wikle, C. K. (2022). Reds: Random ensemble deep spatial prediction. Environmetrics page e2780.
Fan et al. (2020) Fan, J. et al. (2020). Relationships between five-factor personality model and anxiety: The effect of conscientiousness on anxiety. Open Journal of Social Sciences 8, 462.
Fong and Xu (2021) Fong, Y. and Xu, J. (2021). Forward stepwise deep autoencoder-based monotone nonlinear dimensionality reduction methods. Journal of Computational and Graphical Statistics 30, 1–10.
Gal and Ghahramani (2016a) Gal, Y. and Ghahramani, Z. (2016a). Bayesian convolutional neural networks with bernoulli approximate variational inference.
Gal and Ghahramani (2016b) Gal, Y. and Ghahramani, Z. (2016b). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, pages 1050–1059. PMLR.
Goodfellow et al. (2014) Goodfellow, I. J., Vinyals, O., and Saxe, A. M. (2014). Qualitatively characterizing neural network optimization problems. arXiv preprint arXiv:1412.6544 .
Ismael and Abdel-Qader (2018) Ismael, M. R. and Abdel-Qader, I. (2018). Brain tumor classification via statistical features and back-propagation neural network. In IEEE International Conference on Electro/Information Technology (EIT), pages 0252–0257. IEEE.
Kaplan et al. (2015) Kaplan, S. C., Levinson, C. A., Rodebaugh, T. L., Menatti, A., and Weeks, J. W. (2015). Social anxiety and the big five personality traits: The interactive relationship of trust and openness. Cognitive Behaviour Therapy 44, 212–222.
Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25, 1097–1105.
McDermott and Wikle (2017) McDermott, P. L. and Wikle, C. K. (2017). An ensemble quadratic echo state network for non-linear spatio-temporal forecasting. Stat 6, 315–330.
McDermott and Wikle (2019) McDermott, P. L. and Wikle, C. K. (2019). Deep echo state networks with uncertainty quantification for spatio-temporal forecasting. Environmetrics 30, e2553.
Menze et al. (2014) Menze, B. H., Jakab, A., Bauer, S., Kalpathy-Cramer, J., Farahani, K., Kirby, J., et al. (2014). The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Transactions on Medical Imaging 34, 1993–2024. doi: 10.1109/TMI.2014.2377694.
Naragon-Gainey et al. (2014) Naragon-Gainey, K., Rutter, L. A., and Brown, T. A. (2014). The interaction of extraversion and anxiety sensitivity on social anxiety: evidence of specificity relative to depression. Behavior therapy 45, 418–429.
Neal (2012) Neal, R. M. (2012). Bayesian learning for neural networks, volume 118. Springer Science & Business Media.
Noone et al. (2020) Noone, A., Howlader, N., Krapcho, M., Miller, D., Brest, A., Yu, M., et al. (2020). SEER cancer statistics review (CSR) 1975–2017. National Cancer Institute: Bethesda, MD, USA .
Organization et al. (2022) Organization, W. H. et al. (2022). Mental health and covid-19: early evidence of the pandemic’s impact: scientific brief, 2 march 2022. Technical report, World Health Organization.
Paisley et al. (2012) Paisley, J., Blei, D., and Jordan, M. (2012). Variational Bayesian inference with stochastic search.
Rasmussen (2003) Rasmussen, C. E. (2003). Gaussian processes in machine learning. In Summer School on Machine Learning, pages 63–71. Springer.
Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pages 1278–1286. PMLR.
Sagi and Rokach (2018) Sagi, O. and Rokach, L. (2018). Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8, e1249.
Sauer et al. (2023) Sauer, A., Cooper, A., and Gramacy, R. B. (2023). Vecchia-approximated deep gaussian processes for computer experiments. Journal of Computational and Graphical Statistics 32, 824–837.
Shen et al. (2013) Shen, X., Tokoglu, F., Papademetris, X., and Constable, R. T. (2013). Groupwise whole-brain parcellation from resting-state fMRI data for network node identification. Neuroimage 82, 403–415.
Shridhar et al. (2019) Shridhar, K., Laumann, F., and Liwicki, M. (2019). A comprehensive guide to Bayesian convolutional neural network with variational inference.
Sporns (2002) Sporns, O. (2002). Network analysis, complexity, and brain function. Complexity 8, 56–60.
Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1929–1958.
Tran et al. (2020) Tran, M.-N., Nguyen, N., Nott, D., and Kohn, R. (2020). Bayesian deep net GLM and GLMM. Journal of Computational and Graphical Statistics 29, 97–113.
Wang et al. (2016) Wang, J., Yu, L.-C., Lai, K. R., and Zhang, X. (2016). Dimensional sentiment analysis using a regional CNN-LSTM model. In The 54th Annual Meeting of the Association for Computational Linguistics, volume 2, pages 225–230.
Wang et al. (2019) Wang, S., Zhou, T., and Bilmes, J. (2019). Jumpout: Improved dropout for deep neural networks with ReLUs. In International conference on machine learning, pages 6668–6676. PMLR.

Supplementary Materials

Web Appendices, Tables, and Figures referenced in Sections 2-4, as well as a zip file containing data and code, are available with this paper at the Biometrics website on Oxford Academic.