This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Bayesian Convolutional Neural Network-based Generalized Linear Model

Yeseul Jeon1    Won Chang3,4    Seonghyun Jeong1,2    Sanghoon Han5    and Jaewoo Park1,2,∗
1Department of Statistics and Data Science
[email protected]
   Yonsei University    Seoul    South Korea
2Department of Applied Statistics
   Yonsei University    Seoul    South Korea
3Division of Statistics and Data Science
   University of Cincinnati    Cincinnati    Ohio    U.S.A
4Department of Statistics
   Seoul National University    Seoul    South Korea
5Department of Psychology
   Yonsei University    Seoul    South Korea
Abstract

Convolutional neural networks (CNNs) provide flexible function approximations for a wide variety of applications when the input variables are in the form of images or spatial data. Although CNNs often outperform traditional statistical models in prediction accuracy, statistical inference, such as estimating the effects of covariates and quantifying the prediction uncertainty, is not trivial due to the highly complicated model structure and overparameterization. To address this challenge, we propose a new Bayesian approach by embedding CNNs within the generalized linear models (GLMs) framework. We use extracted nodes from the last hidden layer of CNN with Monte Carlo (MC) dropout as informative covariates in GLM. This improves accuracy in prediction and regression coefficient inference, allowing for the interpretation of coefficients and uncertainty quantification. By fitting ensemble GLMs across multiple realizations from MC dropout, we can account for uncertainties in extracting the features. We apply our methods to biological and epidemiological problems, which have both high-dimensional correlated inputs and vector covariates. Specifically, we consider malaria incidence data, brain tumor image data, and fMRI data. By extracting information from correlated inputs, the proposed method can provide an interpretable Bayesian analysis. The algorithm can be broadly applicable to image regressions or correlated data analysis by enabling accurate Bayesian inference quickly.

keywords:
Bayesian deep learning; feature extraction; Monte Carlo dropout; posterior approximation; uncertainty quantification;

1 Introduction

With recent advances in data collection technologies, it is increasingly common to have images as part of the collected data. Specifically, we consider three important applications: (1) Malaria data contain spatially correlated confirmed cases. Detecting disease hotspots and studying spatial patterns of incidence are crucial for public health problems. (2) Brain tumor data contain magnetic resonance imaging (MRI) images and texture-based statistics. Developing automated decision rules about brain tumor status can be beneficial to practitioners. (3) Anxiety disorder example contains functional magnetic resonance imaging (fMRI) data with psychological assessment collected from the survey questions. Analyzing the relationship between brain images and survey responses is an important psychological contribution. For all cases, we have both correlated matrix images and vector covariates, which are challenging to analyze simultaneously. Traditional statistical approaches face inferential issues when dealing with input image variables due to high dimensionality and multicollinearity. Convolutional neural networks (CNNs) are popular approaches for image processing by alleviating such multicollinearity issues (Krizhevsky et al., 2012). However, CNNs cannot quantify uncertainties, which are often of primary interest in statistical inference. Furthermore, the complex structure of CNN can lead to nonconvex optimization, which cannot guarantee the convergence of weight parameters (Alzubaidi et al., 2021). We propose a computationally efficient Bayesian approach by embedding CNNs within generalized linear models (GLMs). Our method can take advantage of both statistical models and advanced neural networks (NNs).

There is a growing literature on Bayesian deep learning methods (Neal, 2012) to quantify the uncertainty by regarding weight parameters in deep neural networks (DNNs) as random variables. With a choice of priors, we have posterior distributions for the standard Bayesian inference. Bayesian alternatives can be robust for a small sample size compared to the standard DNN (Gal and Ghahramani, 2016a, b). Despite its apparent advantages, Bayesian DNNs have been much less used than the classical one due to the difficulty in exploring the full posterior distributions. To circumvent the difficulty, variational Bayes (VB) methods have been proposed to approximate complex high-dimensional posteriors; examples include VB for DNNs (Tran et al., 2020) and CNNs (Shridhar et al., 2019). Although VB can alleviate computational costs, to the best of our knowledge, no existing approach provides a general Bayesian framework to study the image and vector-type variables simultaneously. Furthermore, most previous works have focused on prediction uncertainty but not on estimation uncertainty of model parameters; interpretations of regression coefficients have not been studied much in the Bayesian deep learning context.

We propose a Bayesian convolutional generalized linear model (BayesCGLM) that can study different types of variables simultaneously in various applications. First, we train Bayesian CNN (BayesCNN) (Gal and Ghahramani, 2016a) with the correlated input (images or correlated variables) and output. Then, we extract Monte Carlo (MC) samples of features from the last layer of the fitted BayesCNN. Lastly, we fit a GLM by regressing an output variable on augmented covariates with the features obtained from each MC sample. For each MC sample of the feature, we fit an individual GLM and aggregate its posterior in parallel. These steps allow us to account for uncertainties in estimating the features. We also study an often overlooked challenge in Bayesian deep learning: practical issues in obtaining credible intervals of parameters and assessing whether they provide nominal coverage.

In our method, we model BayesCNN via a deep Gaussian process (GP) (Damianou and Lawrence, 2013), which is a nested GP. GPs (Rasmussen, 2003) are widely used to represent complex function relationships due to their flexibility. However, due to its nested functional compositions, Bayesian inference for deep GP is computationally expensive. Recently, Sauer et al. (2023) approximated each layer of the deep GP through the Vecchia approximation, which scales linearly. With somewhat different perspectives, Gal and Ghahramani (2016a, b) developed VB for deep GP based on MC approximations. They showed that NNs with dropout applied after every hidden layer are equivalent to the approximation of the posteriors of the deep GP. We build upon their approach to sample features from the approximate posterior of the last layer in BayesCNN. Training BayesCNN with dropout approximation does not require additional computing costs compared to the frequentist one. There have been several recent proposals to extract features from the hidden layer of NN models (Tran et al., 2020; Bhatnagar et al., 2022; Fong and Xu, 2021), and our method is motivated by them. The features can be regarded as reduced-dimensional summary statistics that measure complex non-linear relationships between the high-dimensional input and output. We observe that CNNs are specifically useful for extracting information from correlated inputs. There have been a number of recent proposals for forecasting spatio-temporal processes through ensembles of echo state networks (McDermott and Wikle, 2017, 2019). Similarly, Daw and Wikle (2022) developed the random ensemble deep spatial approach that combined a deep neural model with random weights. Our work has similarities to these previous attempts in that we aggregate fitted models obtained from each MC sample of the feature. Such ensemble techniques can improve prediction over a single model and quantify uncertainties (Sagi and Rokach, 2018).

In addition to methodological improvements, our method can contribute to scientific advancement. We study the relationships between the features and response variables via principal component analysis (PCA), providing valuable insights. Furthermore, by providing adequate uncertainty quantification in both estimation and prediction, our method can aid statistically sound scientific decisions.

The outline of the remainder of this paper is as follows. In Section 2, we introduce a Bayesian DNN based on GP approximation and its estimation procedure. In Section 3, we propose an efficient BayesCGLM and provide implementation details. In Section 4, we apply BayesCGLM to several real-world data sets. We conclude with a discussion in Section 5.

2 A Bayesian Deep Neural Network: Gaussian Process Approximation

Here, we provide a brief overview of Bayesian DNNs and their approximations to deep GPs. Let {(𝐱n,𝐲n),n=1,,N}\{(\mathbf{x}_{n},\mathbf{y}_{n}),n=1,\cdots,N\} be the observed input-response pairs. A standard DNN is used to model the relationship between the input 𝐱n\mathbf{x}_{n} and the output 𝐨nd\mathbf{o}_{n}\in{\mathbb{R}}^{d}, which can be regarded as an expectation of 𝐲n\mathbf{y}_{n} given 𝐱n\mathbf{x}_{n}, from a statistical viewpoint. Suppose we consider a DNN with LL layers, where the llth layer has klk_{l} nodes for l=1,,Ll=1,\cdots,L. Then we can define the weight matrix 𝐖lkl×kl1\mathbf{W}_{l}\in{\mathbb{R}}^{k_{l}\times k_{l-1}} that connects the (l1)(l-1)th layer to the llth hidden layer and the bias vector 𝐛lkl\mathbf{b}_{l}\in{\mathbb{R}}^{k_{l}}. Then 𝜽={(𝐖l,𝐛l),l=1,,L}\bm{\theta}=\{(\mathbf{W}_{l},\mathbf{b}_{l}),l=1,\cdots,L\} is a set of parameters in the DNN that belongs to a parameter space 𝚯\bm{\Theta}.

To model complex data, the number of NN parameters grows tremendously with increasing ll and klk_{l}. Such large networks are computationally intensive and can lead to overfitting issues. Dropout (Srivastava et al., 2014) is a simple but effective strategy to address both problems. The term “dropout” refers to randomly removing connections from nodes with some probability during training. The structure of NNs with dropouts is

𝐨n=σL(𝐖LσL1(σ3(𝐖3σ2(𝐖2σ1(𝐖1𝐱n+𝐛1)𝐝2+𝐛2)𝐝3+𝐛3))𝐝L+𝐛L),\mathbf{o}_{n}=\sigma_{L}\Big{(}\mathbf{W}_{L}\sigma_{L-1}\Big{(}\cdots\sigma_{3}\Big{(}\mathbf{W}_{3}\sigma_{2}\Big{(}\mathbf{W}_{2}\sigma_{1}\Big{(}\mathbf{W}_{1}\mathbf{x}_{n}+\mathbf{b}_{1}\Big{)}\circ\mathbf{d}_{2}+\mathbf{b}_{2}\Big{)}\circ\mathbf{d}_{3}+\mathbf{b}_{3}\Big{)}\cdots\Big{)}\circ\mathbf{d}_{L}+\mathbf{b}_{L}\Big{)}, (1)

where σl()\sigma_{l}(\cdot) is an activation function; for example, the rectified linear (ReLU) or the hyperbolic tangent function (TanH) can be used. Here, 𝐝lkl\mathbf{d}_{l}\in{\mathbb{R}}^{k_{l}}, l=2,,Ll=2,\cdots,L, are dropout vectors whose elements follow a Bernoulli distribution with a pre-specified success probability ψl\psi_{l} independently. The notation “\circ” indicates the elementwise multiplication of vectors. As in Gal and Ghahramani (2016b), we apply dropout to the output of every σl()\sigma_{l}(\cdot) (i.e., the nonlinear feature). Since σl()\sigma_{l}(\cdot) has 𝐖l\mathbf{W}_{l} and 𝐛l\mathbf{b}_{l} in its argument, both weight and bias terms are regularized. Although dropout can alleviate overfitting issues, quantifying uncertainties in predictions or parameter estimates remains challenging for the frequentist approach.

To address this difficulty, Bayesian methods have been proposed by considering probabilistic alternatives of NNs. The Bayesian approaches consider an arbitrary function that is likely to generate data. Specifically, a deep GP has been widely used to represent complex NN structures for both supervised and unsupervised learning problems (Gal and Ghahramani, 2016b; Damianou and Lawrence, 2013). Define 𝐟n,1=𝐖1𝐱n+𝐛1k1\mathbf{f}_{n,1}=\mathbf{W}_{1}\mathbf{x}_{n}+\mathbf{b}_{1}\in{\mathbb{R}}^{k_{1}} and 𝐟n,l=𝐖lϕn,l1+𝐛lkl\mathbf{f}_{n,l}=\mathbf{W}_{l}\bm{\phi}_{n,l-1}+\mathbf{b}_{l}\in{\mathbb{R}}^{k_{l}}, l2l\geq 2, where ϕn,l=σl(𝐟l)kl\bm{\phi}_{n,l}=\sigma_{l}(\mathbf{f}_{l})\in{\mathbb{R}}^{k_{l}} is the nonlinear output feature from the llth layer. For Bayesian inference, we set independent normal priors for weight and bias parameters as p(𝐖l)p(\mathbf{W}_{l}) and p(𝐛l)p(\mathbf{b}_{l}) respectively. We denote the set of linear features 𝐟n,l\mathbf{f}_{n,l} as 𝐅l={𝐟n,l}n=1NN×kl\mathbf{F}_{l}=\{\mathbf{f}_{n,l}\}_{n=1}^{N}\in\mathbb{R}^{N\times k_{l}} and the kkth column of 𝐅l\mathbf{F}_{l} as 𝐅l(k)(k=1,,kl)\mathbf{F}^{(k)}_{l}~{}(k=1,\cdots,k_{l}). The columns of 𝐅l\mathbf{F}_{l} are assumed to be independent of each other for given 𝐅l1\mathbf{F}_{l-1}. Then, the deep GP model can be represented as

𝐅l(k)|𝐅l1N(0,𝚺^l),l=2,,L𝐲n|𝐟n,L1p(𝐲n|𝐟n,L1).\begin{split}\mathbf{F}^{(k)}_{l}|\mathbf{F}_{l-1}&\sim N(0,\widehat{\mathbf{\Sigma}}_{l}),\quad l=2,\dots,L\\ \mathbf{y}_{n}|\mathbf{f}_{n,L-1}&\sim p(\mathbf{y}_{n}|\mathbf{f}_{n,L-1}).\end{split} (2)

Here the empirical covariance matrix 𝚺^lN×N\widehat{\mathbf{\Sigma}}_{l}\in{\mathbb{R}}^{N\times N} is

𝚺^l=1klσl(𝚽l1𝐖l+𝐛l)σl(𝚽l1𝐖l+𝐛l),\begin{split}\widehat{\mathbf{\Sigma}}_{l}&=\frac{1}{k_{l}}\sigma_{l}(\bm{\Phi}_{l-1}\mathbf{W}^{\top}_{l}+\mathbf{b}_{l})\sigma_{l}(\bm{\Phi}_{l-1}\mathbf{W}^{\top}_{l}+\mathbf{b}_{l})^{\top},\end{split} (3)

where 𝚽l\bm{\Phi}_{l} are the collection of feature vectors {ϕn,l}n=1NN×kl\{\bm{\phi}_{n,l}\}_{n=1}^{N}\in{\mathbb{R}}^{N\times k_{l}}. This empirical covariance matrix is obtained through Monte Carlo integration, where each column of 𝐖l\mathbf{W}_{l} and each element of 𝐛l\mathbf{b}_{l} are sampled from the p(𝐖)p(\mathbf{W}) and p(𝐛)p(\mathbf{b}), respectively. According to Gal and Ghahramani (2016b), (3) becomes accurate in approximating the true covariance matrix as the value of klk_{l} increases. It implies that an arbitrary data-generating function is modeled through a nested GP with the covariance of the extracted features from the previous hidden layer. In (2), 𝐲n\mathbf{y}_{n} is a response variable that has a distribution p(|𝐟n,L1)p(\cdot|\mathbf{f}_{n,L-1}) belonging to the exponential family with a continuously differentiable link function gg such that g(E[𝐲n|ϕn,L1])=𝐖Lϕn,L1+𝐛Lg(E[\mathbf{y}_{n}|\bm{\phi}_{n,L-1}])=\mathbf{W}_{L}\bm{\phi}_{n,L-1}+\mathbf{b}_{L}. Note that g1()g^{-1}(\cdot) can be regarded as σL()\sigma_{L}(\cdot) in (1). As in the typical mixed effect models, conditional independence of {𝐲n}n=1N\{\mathbf{y}_{n}\}_{n=1}^{N} is assumed.

Since Bayesian inference for deep GP using Markov chain Monte Carlo (MCMC) is not straightforward to implement, Gal and Ghahramani (2016a) developed VB for deep GP based on MC approximation, which is called MC dropout. They used a normal mixture distribution as a variational distribution q(𝜽)q(\bm{\theta}) to approximate the posterior distribution π(𝜽|{𝐱n,𝐲n}n=1N)\pi(\bm{\theta}|\{\mathbf{x}_{n},\mathbf{y}_{n}\}_{n=1}^{N}) of deep GP. Specifically, the variational distributions are defined as

q(𝐖l)=i,jq(wl,ij),q(𝐛l)=iq(bl,i)q(wl,ij)=plN(μl,ijw,σ2)+(1pl)N(0,σ2)q(bl,i)=plN(μl,ib,σ2)+(1pl)N(0,σ2),\begin{split}q(\mathbf{W}_{l})&=\prod_{\forall i,j}q(w_{l,ij}),~{}~{}~{}q(\mathbf{b}_{l})=\prod_{\forall i}q(b_{l,i})\\ q(w_{l,ij})&=p_{l}N(\mu^{w}_{l,ij},\sigma^{2})+(1-p_{l})N(0,\sigma^{2})\\ q(b_{l,i})&=p_{l}N(\mu^{b}_{l,i},\sigma^{2})+(1-p_{l})N(0,\sigma^{2}),\end{split} (4)

where wl,ijw_{l,ij} is the (i,j)(i,j)th element of the weight matrix 𝐖lkl×kl1\mathbf{W}_{l}\in{\mathbb{R}}^{k_{l}\times k_{l-1}} and bl,ib_{l,i} is the iith element of the bias vector 𝐛lkl\mathbf{b}_{l}\in{\mathbb{R}}^{k_{l}}. For the weight parameters, μl,ijw\mu^{w}_{l,ij} and σ2\sigma^{2} are variational parameters that control the mean and spread of the distributions, respectively. As the inclusion probability pl[0,1]p_{l}\in[0,1] becomes close to 0, q(wl,ij)q({w}_{l,ij}) becomes N(0,σ2)N(0,\sigma^{2}), indicating that it is likely to drop the weight parameters (i.e., wl,ij=0w_{l,ij}=0). Similarly, the variational distribution for the bias parameters is modeled with a mixture normal distribution. Then, the optimal variational parameters are set by minimizing the Kullback–Leibler (KL) divergence between q(𝜽)q(\bm{\theta}) and π(𝜽|{𝐱n,𝐲n}n=1N)\pi(\bm{\theta}|\{\mathbf{x}_{n},\mathbf{y}_{n}\}_{n=1}^{N}) where minimizing the KL divergence is equivalent to maximizing Eq[log(π(𝜽,{𝐱n,𝐲n}n=1N))]Eq[logq(𝜽)]\mbox{E}_{q}[{\log(\pi(\bm{\theta},\{\mathbf{x}_{n},\mathbf{y}_{n}\}_{n=1}^{N}))}]-\mbox{E}_{q}[\log q({\bm{\theta}})], the evidence lower bound (ELBO). With the independent variational distribution q(𝜽):=l=1Lq(𝐖l)q(𝐛l)q(\bm{\theta}):=\prod_{l=1}^{L}q(\mathbf{W}_{l})q(\mathbf{b}_{l}), the log ELBO of the deep GP is

GP-VI:=n=1Nl=1Lq(𝐖l)q(𝐛l)logp(𝐲n|𝐱n,{𝐖l,𝐛l}l=1L)d𝐖1d𝐛1d𝐖Ld𝐛LKL(l=1Lq(𝐖l)q(𝐛l)||p({𝐖l,𝐛l}l=1L)).\begin{split}\mathcal{L}_{\text{GP-VI}}&:=\sum_{n=1}^{N}\int\cdots\int\prod_{l=1}^{L}q(\mathbf{W}_{l})q(\mathbf{b}_{l})\log p(\mathbf{y}_{n}|\mathbf{x}_{n},\{\mathbf{W}_{l},\mathbf{b}_{l}\}_{l=1}^{L})d\mathbf{W}_{1}d\mathbf{b}_{1}\cdots d\mathbf{W}_{L}d\mathbf{b}_{L}\\ &\quad-\text{KL}\Big{(}\prod_{l=1}^{L}q(\mathbf{W}_{l})q(\mathbf{b}_{l})\Big{|}\Big{|}p(\{\mathbf{W}_{l},\mathbf{b}_{l}\}_{l=1}^{L})\Big{)}.\end{split} (5)

To emphasize the dependency of {𝐖l,𝐛l}l=1L\{\mathbf{W}_{l},\mathbf{b}_{l}\}_{l=1}^{L}, we represent p(𝐲n|𝐟n,L1)p(\mathbf{y}_{n}|\mathbf{f}_{n,L-1}) in (2) as p(𝐲n|𝐱n,{𝐖l,𝐛l}l=1L)p(\mathbf{y}_{n}|\mathbf{x}_{n},\{\mathbf{W}_{l},\mathbf{b}_{l}\}_{l=1}^{L}), which is identical. Since the direct maximization of (5) is challenging due to the intractable integration, Gal and Ghahramani (2016b) replaced it with MC approximation as

GP-MC:=1Mm=1Mn=1Nlogp(𝐲n|𝐱n,{𝐖l(m),𝐛l(m)}l=1L)KL(l=1Lq(𝐖l)q(𝐛l)||l=1Lp(𝐖l)p(𝐛l)),\begin{split}\mathcal{L}_{\text{GP-MC}}&:=\frac{1}{M}\sum_{m=1}^{M}\sum_{n=1}^{N}\log p(\mathbf{y}_{n}|\mathbf{x}_{n},\{\mathbf{W}_{l}^{(m)},\mathbf{b}_{l}^{(m)}\}_{l=1}^{L})\\ &\quad-\text{KL}\Big{(}\prod_{l=1}^{L}q(\mathbf{W}_{l})q(\mathbf{b}_{l})\Big{|}\Big{|}\prod_{l=1}^{L}p(\mathbf{W}_{l})p(\mathbf{b}_{l})\Big{)},\end{split} (6)

where {{𝐖l(m),𝐛l(m)}l=1L}m=1M\{\{\mathbf{W}_{l}^{(m)},\mathbf{b}_{l}^{(m)}\}_{l=1}^{L}\}_{m=1}^{M} is MC samples from the variational distribution in (4).

Note that the MC samples are generated for each stochastic gradient descent (SGD) update, and the estimates from GP-MC\mathcal{L}_{\text{GP-MC}} would converge to those obtained from GP-VI\mathcal{L}_{\text{GP-VI}} (Paisley et al., 2012; Rezende et al., 2014). Furthermore, Gal and Ghahramani (2016b) showed that (6) converges to the frequentist loss function of a DNN with dropouts when σ\sigma in (4) is small value and klk_{l} becomes large. This result implies that the standard DNNs with dropout applied after every hidden layer are equivalent to the approximation of the posteriors of the deep GP. We provide mathematical details in Web Appendix A.1.

In many applications, the prediction of unobserved responses is of great interest. Consider an unobserved response 𝐲nd\mathbf{y}_{n}^{\ast}\in{\mathbb{R}}^{d} with an input 𝐱nk0\mathbf{x}_{n}^{\ast}\in{\mathbb{R}}^{k_{0}}. For given 𝐱n\mathbf{x}_{n}^{\ast}, we can obtain ϕn,lkl\bm{\phi}^{\ast}_{n,l}\in{\mathbb{R}}^{k_{l}} from the approximate posterior predictive distribution based on MC dropout. Although Gal and Ghahramani (2016b) focused on predicting 𝐲n\mathbf{y}_{n}^{*}, this procedure allows us to construct informative and low-dimensional summary statistics for 𝐱n\mathbf{x}_{n}^{\ast}. Specifically, we can sample an output feature vector ϕn,L1kL1\bm{\phi}^{\ast}_{n,L-1}\in{\mathbb{R}}^{k_{L-1}} which summarizes the complex nonlinear dependence relationships between input and output variables up to the last layer of the NN. In Section 3, we use these extracted features as additional covariates in the GLM framework.

3 Flexible Generalized Linear Models with Convolutional Neural Networks

As a variant of DNNs, CNNs have been widely used in computer vision and image classification (Krizhevsky et al., 2012; Shridhar et al., 2019). CNNs can effectively extract important features from images or correlated datasets with nonlinear dependencies. In our scientific applications, we have spatially correlated datasets for malaria incidence, MRI images for brain tumors, and a fMRI correlation matrix of the brain for anxiety disorder. Given that our applications involve image or correlated datasets, CNNs are well-suited for capturing structural dependency by extracting local information through convolution layers.

We propose a Bayesian convolutional generalized linear model (BayesCGLM) based on two appealing methodologies: (1) CNNs have been widely used to utilize correlated predictors (or input variables) with spatial structures such as images or geospatial data. (2) GLMs can study the relationship between covariates and non-Gaussian responses by extending a linear regression framework. Let 𝐃={𝐗,𝐘,𝐙}\mathbf{D}=\{\mathbf{X},\mathbf{Y},\mathbf{Z}\} be the observed dataset, where 𝐗={𝐗n}n=1N\mathbf{X}=\{\mathbf{X}_{n}\}_{n=1}^{N} be the collection of correlated input (i.e., each 𝐗n\mathbf{X}_{n} is a matrix or tensor) and 𝐘={𝐲n}n=1N\mathbf{Y}=\{\mathbf{y}_{n}\}_{n=1}^{N} be the corresponding output. We also let 𝐙={𝐳n}n=1N\mathbf{Z}=\{\mathbf{z}_{n}\}_{n=1}^{N} be a collection of pp-dimensional vector covariates. We use BayesCNN to extract the feature from 𝐗\mathbf{X} and employ it in a GLM with the covariates 𝐙\mathbf{Z} to predict 𝐘\mathbf{Y}.

Such an approach is useful in that the feature is optimized for the prediction of 𝐘\mathbf{Y}. Since the feature is guided by 𝐘\mathbf{Y}, we can construct a map from the features to the output. Similar to our approach, Bhatnagar et al. (2022) also extracted features through DNN and used the features to fit their model. They showed that this can improve the prediction performance of the model. In Section 4, we observe that the feature serves as an important summary statistic for predicting response 𝐘\mathbf{Y}. Compared to Bhatnagar et al. (2022), we also consider the uncertainties in estimating the features using MC dropout.

3.1 Bayesian Convolutional Neural Network

A Bayesian perspective can quantify uncertainties through posterior densities of 𝜽\bm{\theta}. Specifically, Gal and Ghahramani (2016a) proposed an efficient Bayesian alternative by regarding weight parameters in CNN’s kernels as random variables. By extending a result in Gal and Ghahramani (2016b), Gal and Ghahramani (2016a) showed that applying dropout after every convolution layer can approximate the intractable posterior distribution of deep GP.

Let 𝐗nM0×R0×C0\mathbf{X}_{n}\in{\mathbb{R}}^{M_{0}\times R_{0}\times C_{0}} be an input with M0M_{0} height, R0R_{0} width, and C0C_{0} channels. For example, M0M_{0} and R0R_{0} are determined based on the resolution of images, and C0=3C_{0}=3 for color images (C0=1C_{0}=1 for black and white images). There are three types of layers in CNN: the convolution, pooling, and fully connected layers. Convolution layers move kernels (filters) on the input image and extract information, resulting in a “feature map”. Each cell in the feature map is obtained from element-wise multiplication between the image and the filter matrix. The weight values in filter matrices are parameters to be estimated. Higher output values indicate stronger signals on the image, which can account for spatial structures. We can extract the important neighborhood features from the input by shifting the kernels over all pixel locations with a certain step size (stride). Then, the pooling layers are applied to downsample the feature maps. The convolution and pooling layers are applied repeatedly to reduce the dimension of an input image. Then, the extracted features from the convolution/pooling layers are flattened and connected to the output layer through a fully connected layer, which can learn non-linear relationships between the features and the output.

Consider a CNN with LL number of layers, where the last two are fully connected and output layers. For the llth convolution layer, there are ClC_{l} number of kernels 𝐊l,cHl×Dl×Cl1\mathbf{K}_{l,c}\in{\mathbb{R}}^{H_{l}\times D_{l}\times C_{l-1}} with HlH_{l} height, DlD_{l} width, and Cl1C_{l-1} channels for c=1,,Clc=1,\cdots,C_{l}. Note that the dimension of the channel is determined by the number of kernels in the previous convolution layer. After applying the llth convolution layer, the (m,r)(m,r)th element of the feature matrix 𝜼n,(l,c)Ml×Rl\bm{\eta}_{n,(l,c)}\in{\mathbb{R}}^{M_{l}\times R_{l}} becomes

[𝜼n,(l,c)]m,r=σ(i=1Hlj=1Dlk=1Cl1([𝐊l,c]i,j,k[𝜼n,(l1,c)]m+i1,r+j1,k)+bl,c),[\bm{\eta}_{n,(l,c)}]_{m,r}=\sigma\Big{(}\sum_{i=1}^{H_{l}}\sum_{j=1}^{D_{l}}\sum_{k=1}^{C_{l-1}}([\mathbf{K}_{l,c}]_{i,j,k}[\bm{\eta}_{n,(l-1,c)}]_{m+i-1,r+j-1,k})+b_{l,c}\Big{)}, (7)

where bl,cb_{l,c} is the bias parameter. For the first convolution layer, 𝜼n,(l1,c)\bm{\eta}_{n,(l-1,c)} is replaced with an input image 𝐗n\mathbf{X}_{n}. (i.e., 𝜼n,(0,c)=𝐗n\bm{\eta}_{n,(0,c)}=\mathbf{X}_{n}). For computational efficiency and stability, ReLU functions are widely used as activation functions σ()\sigma(\cdot). By shifting kernels, the convolution layer can capture neighborhood information, and the neighborhood structure is determined by the size of the kernels. Gal and Ghahramani (2016a) pointed out that extracting the features from the convolution operation can be reformulated as an affine transformation in standard DNN (details are in Web Appendix A.2). Therefore, CNNs can also be represented as a deep GP with dropout approximation; applying dropout after every convolution layer before pooling can approximate a deep GP.

Note that optimal hyperparameter tuning in NNs requires a number of empirical fittings by varying hyperparameters. Wang et al. (2019) pointed out that a higher dropout rate can lead to a slower convergence rate, while a lower rate can deteriorate generalization performance. Although Gal and Ghahramani (2016b, a) used a dropout rate of around 0.1-0.5, it does not always guarantee optimal performance. Therefore, we recommend conducting empirical studies by varying dropout rates within such a range for validation. In Web Appendix B, we observe that our method is robust across different dropout rates.

After the L2L-2th convolution layers, the extracted feature tensor 𝜼n,L2ML2×RL2×CL2\bm{\eta}_{n,L-2}\in{\mathbb{R}}^{M_{L-2}\times R_{L-2}\times C_{L-2}} is flattened in the fully connected layer. By vectorizing 𝜼n,L2\bm{\eta}_{n,L-2}, we can represent the feature tensor as a feature vector ϕn,L2=vec(𝜼n,L2)kL2\bm{\phi}_{n,L-2}=\operatorname{vec}(\bm{\eta}_{n,L-2})\in{\mathbb{R}}^{k_{L-2}}, where kL2=ML2RL2CL2k_{L-2}=M_{L-2}R_{L-2}C_{L-2}. Finally, we can extract a feature vector ϕn,L1kL1\bm{\phi}_{n,L-1}\in{\mathbb{R}}^{k_{L-1}} from the fully connected layer. From here, we define the collection of the extracted feature vector as 𝚽={ϕn,L1}n=1N\bm{\Phi}=\{\bm{\phi}_{n,L-1}\}_{n=1}^{N}. We suppress the index L1L-1 in 𝚽L1\bm{\Phi}_{L-1} for notational convenience.

The training step of BayesCNN is identical to that of standard CNN. In the prediction step, the algorithm performs forward propagation for the given estimated NN parameters. For each iteration, {𝐝l}l=2L\{\mathbf{d}_{l}\}_{l=2}^{L} are randomly sampled from the Bernoulli distribution with rate ψl\psi_{l} and are applied to the features. The algorithm repeats this MM times to obtain an MC sample of MM predicted outputs. We provide implementation details for BayesCNN in Web Appendix C.1.

3.2 Bayesian Convolutional Generalized Linear Model

We propose BayesCGLM by regressing 𝐘\mathbf{Y} on 𝚽\bm{\Phi} and additional covariates 𝐙\mathbf{Z}. Here, we use 𝚽\bm{\Phi} as a basis matrix that encapsulates information of high-dimensional and correlated input variables 𝐗\mathbf{X}. Let 𝐀=[𝐙,𝚽]N×(p+kL1)\mathbf{A}=[\mathbf{Z},\bm{\Phi}]\in{\mathbb{R}}^{N\times(p+k_{L-1})} be the design matrix. Then, our model is

g(E[𝐘|𝐙,𝚽])=𝐙𝜸+𝚽𝜹=𝐀𝜷,\begin{split}g(E[\mathbf{Y}|\mathbf{Z},\bm{\Phi}])&=\mathbf{Z}\bm{\gamma}+\bm{\Phi}\bm{\delta}=\mathbf{A}\bm{\beta},\end{split} (8)

where 𝜷=(𝜸,𝜹)p+kL1\bm{\beta}=(\bm{\gamma}^{\top},\bm{\delta}^{\top})^{\top}\in{\mathbb{R}}^{p+k_{L-1}} is the corresponding regression coefficients and g()g(\cdot) is a one-to-one continuously differential link function. The proposed method can provide an interpretation of the regression coefficients and quantify uncertainties in prediction and estimation. Furthermore, the training step does not require additional computational costs compared to the traditional CNN. We train BayesCNN with the negative log-likelihood loss function corresponding to the type of output variable. To be more specific, we used the mean squared loss, binary cross-entropy loss, and Poisson loss for Gaussian, binary, and count outputs, respectively. Such choices are natural in that the loss functions are directly defined by the distribution of the output variable.

Since we place a posterior distribution over the kernels of the fitted BayesCNN (Section 3.1), we can generate MC samples of the features 𝚽(m)\bm{\Phi}^{(m)}, m=1,,Mm=1,\cdots,M from the variational distribution. To account for the uncertainties in estimating the features, we use all the MC samples of the features rather than their point estimates. For each MC sample of 𝐀(m)=[𝐙,𝚽(m)]\mathbf{A}^{(m)}=[\mathbf{Z},\bm{\Phi}^{(m)}], we can fit individual GLM in parallel; we have MM number of posterior distributions of 𝜷m=(𝜸m,𝜹m)p+kL1\bm{\beta}_{m}=(\bm{\gamma}_{m}^{\top},\bm{\delta}_{m}^{\top})^{\top}\in{\mathbb{R}}^{p+k_{L-1}}. This step is parallelizable, so computational wall times decrease with more cores. Then, we construct an aggregated distribution (the “ensemble-posterior”) of 𝜷\bm{\beta}. In the following section, we provide details about constructing the ensemble-posterior distribution. The fitting procedure of BayesCGLM is summarized in Web Appendix C.2.

In general, inference for the regression coefficients 𝜸\bm{\gamma} in (8) is challenging for DNN. Although one can regard the weight parameters at the last hidden layer as 𝜸\bm{\gamma}, we cannot obtain uncertainties in estimates. Even BayesDNN (Gal and Ghahramani, 2016b) only provided point estimates of weight parameters, while they can quantify uncertainties in response prediction. On the other hand, our method directly provides the posterior distribution of 𝜸\bm{\gamma}, which is useful for standard inference. In addition, the complex structure of DNN often leads to nonconvex optimization (Goodfellow et al., 2014; Dauphin et al., 2014), resulting in an inaccurate estimate of 𝜸\bm{\gamma}. Since the stopping rule of the SGD algorithm is based on prediction accuracy (not on convergence in parameter estimation), the convergence of individual parameters, including 𝜸\bm{\gamma} cannot be guaranteed. The problem gets further complicated by a large number of parameters that need to be simultaneously updated in the SGD algorithm. Note that BayesCGLM also uses the SGD algorithm to extract 𝚽\bm{\Phi}. From this, we can obtain one of the optimal combinations of weight parameters and the corresponding 𝚽\bm{\Phi}. Although such 𝚽\bm{\Phi} may not be the global optimizer, it is still informative to predict responses because it is obtained by minimizing the loss function. However, this does not guarantee the convergence of individual weight parameters in nonconvex optimization. Instead, our two-stage approach can alleviate this issue by using extracted 𝚽\bm{\Phi} as a fixed basis matrix in GLM. In Web Appendix D, we conduct the simulation studies under different model configurations and show that BayesCGLM can recover the true 𝜸\bm{\gamma} values well, while estimates from BayesCNN can be biased.

3.3 Bayesian Inference

Here, we describe a Bayesian inference for the proposed method. We first define the posterior distribution of the regression coefficient π(𝜷|𝐃)\pi(\bm{\beta}|\mathbf{D}). Let {𝐖l,𝐛l}l=1L1\{\mathbf{W}_{l},\mathbf{b}_{l}\}_{l=1}^{L-1} be the weights and biases corresponding to all hidden layers in CNN. We can vectorize kernels in CNN and represent them as weights as in the standard DNN (see Web Appendix A.2 for details). Note that 𝚽\bm{\Phi} can be deterministically obtained from forward propagation for the given 𝐗\mathbf{X} and {𝐖l,𝐛l}l=1L1\{\mathbf{W}_{l},\mathbf{b}_{l}\}_{l=1}^{L-1}. Therefore, the posterior distribution can be represented as

π(𝜷|𝐃)=π(𝜷|𝐃,{𝐖l,𝐛l}l=1L1)×l=1L1π(𝐖l|𝐃)π(𝐛l|𝐃)d𝐖1d𝐛1d𝐖L1d𝐛L1,\begin{split}\pi(\bm{\beta}|\mathbf{D})&=\int\pi(\bm{\beta}|\mathbf{D},\{\mathbf{W}_{l},\mathbf{b}_{l}\}_{l=1}^{L-1})\\ &\quad\times\prod_{l=1}^{L-1}\pi(\mathbf{W}_{l}|\mathbf{D})\pi(\mathbf{b}_{l}|\mathbf{D})d\mathbf{W}_{1}d\mathbf{b}_{1}\cdots d\mathbf{W}_{L-1}d\mathbf{b}_{L-1},\end{split} (9)

where, π(𝜷|𝐃,{𝐖l,𝐛l}l=1L1)\pi(\bm{\beta}|\mathbf{D},\{\mathbf{W}_{l},\mathbf{b}_{l}\}_{l=1}^{L-1}) is the conditional posterior, and π(𝐖l|𝐃)\pi(\mathbf{W}_{l}|\mathbf{D}), π(𝐛l|𝐃)\pi(\mathbf{b}_{l}|\mathbf{D}) are marginal posteriors for weight and bias, respectively. Since it is challenging to compute (9) directly, we approximate it through MC dropout as

π(𝜷|𝐃,{𝐖l,𝐛l}l=1L1)l=1L1q(𝐖l)q(𝐛l)d𝐖1d𝐛1d𝐖L1d𝐛L1,\int\pi(\bm{\beta}|\mathbf{D},\{\mathbf{W}_{l},\mathbf{b}_{l}\}_{l=1}^{L-1})\prod_{l=1}^{L-1}q(\mathbf{W}_{l})q(\mathbf{b}_{l})d\mathbf{W}_{1}d\mathbf{b}_{1}\cdots d\mathbf{W}_{L-1}d\mathbf{b}_{L-1}, (10)

where q(𝐖l)q(\mathbf{W}_{l}) and q(𝐛l)q(\mathbf{b}_{l}) are variational distributions defined in (4). Then the MC approximation to (10) is

1Mm=1Mπ(𝜷m|𝐃,{𝐖l(m),𝐛l(m)}l=1L1).\frac{1}{M}\sum_{m=1}^{M}\pi(\bm{\beta}_{m}|\mathbf{D},\{\mathbf{W}^{(m)}_{l},\mathbf{b}^{(m)}_{l}\}_{l=1}^{L-1}). (11)

Here {{𝐖l(m),𝐛l(m)}l=1L1}m=1M\{\{\mathbf{W}^{(m)}_{l},\mathbf{b}^{(m)}_{l}\}_{l=1}^{L-1}\}_{m=1}^{M} are sampled from (4). One may choose the standard MCMC algorithm to estimate each π(𝜷m|𝐃,{𝐖l(m),𝐛l(m)}l=1L1)\pi(\bm{\beta}_{m}|\mathbf{D},\{\mathbf{W}^{(m)}_{l},\mathbf{b}^{(m)}_{l}\}_{l=1}^{L-1}) in (11). However, this may require a long run of the chain to guarantee good mixing, resulting in excessive computational cost. Instead, we use the Laplace method to approximate each π(𝜷m|𝐃,{𝐖l(m),𝐛l(m)}l=1L1)\pi(\bm{\beta}_{m}|\mathbf{D},\{\mathbf{W}^{(m)}_{l},\mathbf{b}^{(m)}_{l}\}_{l=1}^{L-1}). In our preliminary studies, we observed that the Laplace approximation leads to a similar inference result as the MCMC method within a much shorter computing time when the sample size is large enough.

For the Laplace approximation, we first compute 𝚽(m)\bm{\Phi}^{(m)} for the given {𝐖l(m),𝐛l(m)}l=1L1\{\mathbf{W}^{(m)}_{l},\mathbf{b}^{(m)}_{l}\}_{l=1}^{L-1} through forward propagation. Then, we obtain the maximum likelihood estimate (MLE) 𝜷^m\widehat{\bm{\beta}}_{m} using GLM by regressing 𝐘\mathbf{Y} on 𝚽(m)\bm{\Phi}^{(m)} and 𝐙\mathbf{Z}. We approximate the posterior of 𝜷m\bm{\beta}_{m} as 𝒩(𝜷^m,𝐁^m1)\mathcal{N}(\widehat{\bm{\beta}}_{m},\widehat{\mathbf{B}}^{-1}_{m}), where 𝐁^m(p+kL1)×(p+kL1)\widehat{\mathbf{B}}_{m}\in{\mathbb{R}}^{(p+k_{L-1})\times(p+k_{L-1})} is the observed Fisher information matrix from the mmth MC samples. From this procedure, (11) can be approximated as

1Mm=1Mφ(𝜷;𝜷^m,𝐁^m1),\frac{1}{M}\sum_{m=1}^{M}\varphi(\bm{\beta};\widehat{\bm{\beta}}_{m},\widehat{\mathbf{B}}^{-1}_{m}), (12)

where φ(𝒙;𝝁,𝚺)\varphi(\bm{x};\bm{\mu},\bm{\Sigma}) is a multivariate normal density with mean 𝝁\bm{\mu} and covariance 𝚺\bm{\Sigma}. In summary, (12) is the approximation of the posterior distribution π(𝜷|𝐃)\pi(\bm{\beta}|\mathbf{D}) through MC dropout and the Laplace method. Note that BayesCGLMs approximate π(𝜷|𝐃)\pi(\bm{\beta}|\mathbf{D}) through the MC average from the same probability model (i.e., data generating mechanism). Therefore, the interpretation of 𝜷\bm{\beta} does not change from ensemble to ensemble. However, the interpretation of 𝜹\bm{\delta} can be more difficult because each column of 𝚽\bm{\Phi} comes from complex (and repeated) convolution operations. It is challenging to explain the impact of changes in features based on the estimated 𝜹\bm{\delta}.

Similar to BayesCNN, our method can quantify uncertainties in predictions, which is an advantage over deterministic NNs. Note that BayesCGLM can also provide interpretations of model parameters with uncertainty quantification, while BayesCNN cannot. For unobserved response 𝐘Ntest\mathbf{Y}^{\ast}\in{\mathbb{R}}^{N_{\text{test}}}, we have 𝐀m=[𝐙,𝚽(m)]Ntest×(p+kL1)\mathbf{A}^{\ast}_{m}=[\mathbf{Z}^{\ast},\bm{\Phi}^{\ast(m)}]\in{\mathbb{R}}^{N_{\text{test}}\times(p+k_{L-1})} for m=1,,Mm=1,\cdots,M. Here, 𝚽(m)\bm{\Phi}^{\ast(m)} is mmth features extracted from the last hidden layer of the trained BayesCNN for given 𝐗\mathbf{X}^{\ast} and 𝐙\mathbf{Z}^{\ast}. Then, from (12), the predictive distribution of the linear predictor is

1Mm=1Mφ(𝐀𝜷;𝐀m𝜷^m,𝐀m𝐁^m1𝐀m).\frac{1}{M}\sum_{m=1}^{M}\varphi(\mathbf{A}^{\ast}\bm{\beta};\mathbf{A}_{m}^{\ast}\widehat{\bm{\beta}}_{m},\mathbf{A}_{m}^{\ast}\widehat{\mathbf{B}}^{-1}_{m}\mathbf{A}^{\ast\top}_{m}). (13)

For given (13), we can generate 𝐘\mathbf{Y}^{\ast} from the underlying distribution. For example, when we have a Gaussian response, we have 𝐘𝒩(1Mm=1M𝐀𝜷^m,σ^2)\mathbf{Y}^{\ast}\sim\mathcal{N}(\frac{1}{M}\sum_{m=1}^{M}\mathbf{A}^{\ast}\bm{\widehat{\beta}}_{m},\widehat{\sigma}^{2}) where σ^2=m=1M(𝐀(m)𝜷^m𝐘)(𝐀(m)𝜷^m𝐘)/NM\widehat{\sigma}^{2}=\sum_{m=1}^{M}(\mathbf{A}^{(m)}\bm{\widehat{\beta}}_{m}-\mathbf{Y})^{\top}(\mathbf{A}^{(m)}\bm{\widehat{\beta}}_{m}-\mathbf{Y})/NM. For the count response, we can simulate 𝐘\mathbf{Y}^{\ast} from the Poisson distribution with an intensity of exp(1Mm=1M𝐀𝜷^m)\exp(\frac{1}{M}\sum_{m=1}^{M}\mathbf{A}^{\ast}\bm{\widehat{\beta}}_{m}).

As we described, BayesCGLM involves several approximation steps: (1) MC dropout and (2) the Laplace method; therefore, investigating approximation quality is crucial. In Web Appendix D.5, we conduct a numerical study under the simple NN setting. Here, we use a standard NN to extract features; therefore, we refer to our method as BayesDGLM. We compare the performance of BayesDGLM, BayesDNN, and a fully Bayesian method (MCMC). We observe that the BayesDGLM provides comparable inference results to the fully Bayesian method with much smaller computational costs. Furthermore, BayesDNN cannot quantify uncertainties in parameter estimates, while BayesDGLM can. Considering that it is infeasible to apply the fully Bayesian method for complex DNNs, BayesDGLM would be a practical option in many applications. We provide details for the experiment in Web Appendix D.5.

4 Applications

We apply our method to brain tumor image data and fMRI data for anxiety scores. We also provide results for geospatial application in Web Appendix E. We compared our model with BayesCNN and GLM to assess its performance. We evaluated prediction accuracy using the root mean square prediction error (RMSPE) and uncertainty quantification performance using the empirical coverage of prediction intervals. Details about the CNN structure are provided in Web Appendix F.

4.1 Brain Tumor Image Data

According to Noone et al. (2020), brain and nervous system tumors have low 5-year relative survival rates, which have gradually increased from 1975 to 2016, reaching 33%. Furthermore, some brain tumors can also be cancerous. Considering that the 5-year survival rate for overall cancers is 67%, brain tumors can be life-threatening. MRI is the standard diagnostic tool for brain tumors, and texture-based first- and second-order statistics of MRI images are useful for classification (Aggarwal and Agrawal, 2012), and these statistics are widely used in the classification of biomedical images (Ismael and Abdel-Qader, 2018). However, these statistics do not capture the spatial information of correlated structures in MRI images. Automated methods for brain tumor classification are becoming increasingly important, as manual analysis of MRI data by medical experts is challenging. It is also crucial to convey the uncertainty associated with these automated methods to ensure statistically sound decision-making. Compared to previous studies (Aggarwal and Agrawal, 2012; Ismael and Abdel-Qader, 2018), we use both MRI images and texture-based statistics to predict brain tumors while quantifying uncertainties. The dataset is collected from the Brain Tumor Image Segmentation Challenge (Menze et al., 2014). The binary response indicates whether a patient has a fatal brain tumor. For each patient, we have an MRI image with its two summary variables (first and second-order features of MRI images). Here, we use N=2,508N=2,508 observations to fit our model and reserve Ntest=1,254N_{\text{test}}=1,254 observations for validation. We fit BayesCNN using 240×240240\times 240 pixel gray images 𝐗\mathbf{X}, scalar covariates 𝐙\mathbf{Z} as input, and the binary response 𝐘\mathbf{Y} as output. From the fitted BayesCNN, we can extract 𝚽2,508×16\bm{\Phi}\in{\mathbb{R}}^{2,508\times 16} from the last hidden layer. In this study, we use two summary variables (first and second-order features of MRI images) as covariates. With the extracted features from the BayesCNN, we fit logistic GLMs by regressing 𝐘\mathbf{Y} on [𝐙,𝚽(m)]2,508×18[\mathbf{Z},\bm{\Phi}^{(m)}]\in{\mathbb{R}}^{2,508\times 18} for m=1,,500m=1,\cdots,500.

Table 1 shows inference results from different methods. Our method shows that the first-order feature has a negative relationship, while the second-order feature has a positive relationship with the brain tumor risk. Both coefficients are statistically significant based on the highest posterior density (HPD) intervals. Note that previous work (Aggarwal and Agrawal, 2012) only predicted tumor status but did not provide such an interpretation from the first and second-order statistics. Therefore, we can conclude that darker MRI images (larger first-order statistics) with lower variation (smaller second-order statistics) indicate a lower possibility of tumor presence. We observe that the sign of 𝜸\bm{\gamma} estimates from BayesCGLM is aligned with those from GLM.

Table 1: Inference results for the brain tumor dataset from different methods. For all methods, the posterior mean of 𝜸\bm{\gamma}, 95% HPD interval, accuracy, recall, precision, and computing time (min) are reported in the table.
BayesCGLM BayesCNN GLM
M=500M=500 M=500M=500
γ1\gamma_{1} (first-order feature) Mean 5.332-5.332 0.248 2.591-2.591
95% Interval (7.049,3.704)(-7.049,-3.704) - (2.769,2.412)(-2.769,-2.412)
γ2\gamma_{2} (second-order feature) Mean 4.894 0.160 2.950
95% Interval (3.303,6.564)(3.303,6.564) - (2.755,3.144)(2.755,3.144)
Prediction Accuracy 0.924 0.867 0.784
Recall 0.929 0.787 0.783
Precision 0.901 0.907 0.715
Time (min) 293.533 103.924 0.004

Figure 1 is a score plot with the first and second principal components of 𝚽\bm{\Phi}. Figure 1 (a) shows a clear separation of tumor and non-tumor status based on two principal components. In particular, we observe that patients without tumors mostly have positive values of the first principal component (62.9% explained variability). This indicates that 𝚽\bm{\Phi} contains useful information from images for predicting brain tumor status. By incorporating 𝚽\bm{\Phi} information, BayesCGLM shows the most accurate prediction performance while providing credible intervals for the estimates.

Refer to caption
Figure 1: Score plots for each application: (a) brain tumor dataset and (b) fMRI dataset.

Figure 2 (a) shows that the area under the curve (AUC) for the receiver operating characteristic curve (ROC) plot is about 0.96, which shows outstanding prediction performance. The predicted probability surface becomes larger for the binary responses with a value of 1 (Figure 2 (b)). Furthermore, the true and predicted binary responses are well aligned. We also investigate how prediction uncertainties are related to actual images. The top panel in Figure 2 (c) shows four examples of correctly classified cases with low prediction uncertainties. The results show that when the image is overall dark, and the tumor area clearly stands out from the background, we have low uncertainties. On the other hand, the bottom panel in Figure 2 (c) shows four examples of misclassified cases with high prediction uncertainties. In these cases, the background image is too bright, so the tumor area is hardly distinguishable from the background. Although the point predictions misclassify these cases, higher uncertainty informs us that further investigation is needed. Therefore, better medical decisions can be made when uncertainty measures are considered.

Refer to caption
Figure 2: (a) The ROC curve and AUC obtained by each method. (b) The estimated probability of y=1y=1 (blue solid line) and the observed binary responses (black dots). (c) MRI images with the fitted probabilities (prediction errors) for different cases. The top panel illustrates correctly specified images with small prediction errors. The bottom panel illustrates misclassified images with large prediction errors.

4.2 fMRI Data for Anxiety Score

Anxiety disorders are becoming a prevalent mental health issue worldwide, with studies showing a more than 25% increase in cases in 2020 (Organization et al., 2022). There have been several works to study the relationship between anxiety scores and psychological assessments collected from the survey data (Kaplan et al., 2015). However, no previous studies considered the fMRI data, which contains important information about the functional connectivity of the brain. Here, we utilize fMRI information through the extracted feature to construct an anxiety score model. From this, we can incorporate the correlation structure of the brain when we study the relationships between psychological assessments and anxiety levels.

The resting-state fMRI data is collected by the enhanced Nathan Kline Institute-Rockland Sample (NKI-RS) to create a large-scale community sample of participants across the lifespan. For 156 patients, resting-state fMRI signals were scanned for 394 seconds. Then we transform fMRI images into 278 functional brain Regions-of-Interest (ROIs) following the parcellation scheme of Shen et al. (2013).

A functional network was computed using the pairwise Pearson correlation coefficient, constructing 278×278278\times 278 square symmetric matrix, which is called the ROI-signal matrix. The correlation matrix of brain functions has been widely used to study the functional connectivity of the brain (Sporns, 2002). Therefore, we also use the correlation matrix as input 𝐗\mathbf{X}. Participants in the study completed a survey that included various physiological and psychological assessments. In this study, we use N=156N=156 observations to fit our model and reserve Ntest=52N_{\text{test}}=52 observations for validation. We use conscientiousness, extraversion, agreeableness, and neuroticism as 𝐙\mathbf{Z} and the averaged anxiety score as 𝐘\mathbf{Y}. We train BayesCNN with 𝐗\mathbf{X}, 𝐙\mathbf{Z} as input, and the continuous response 𝐘\mathbf{Y} as output. From the trained BayesCNN, we extract each mmth feature matrix 𝚽(m)104×8\bm{\Phi}^{(m)}\in{\mathbb{R}}^{104\times 8} from the last hidden layer. Then we regress 𝐘\mathbf{Y} on [𝐙,𝚽(m)][\mathbf{Z},\bm{\Phi}^{(m)}] for m=1,,500m=1,\cdots,500.

Table 2 indicates that BayesCGLM has comparable prediction performance compared to BayesCNN and can quantify uncertainties of regression coefficients. While the sign of the estimated 𝜸\bm{\gamma} are identical across different methods, there are differences in the statistical significance, particularly in agreeableness and conscientiousness variables. From BayesCGLM, we can infer that the person who has emotional instability and vulnerability to unpleasant emotions (neuroticism) and who is less likely to rebel against others (agreeableness) has a high anxiety and depression score, which are aligned with the previous study in Kaplan et al. (2015). On the other hand, individuals who are more sociable (extroverted) and driven to achieve their goals (conscientiousness) tend to exhibit lower anxiety and depression scores, as also supported by the previous studies (Naragon-Gainey et al., 2014; Fan et al., 2020). Note that these previous studies neither studied the psychological assessments simultaneously nor fMRI information.

GLM, which does not utilize fMRI information, draws different conclusions in a significance test. Table 2 shows that agreeableness and conscientiousness variables are not statistically significant in GLM, which contradicts the findings in the previous studies (Kaplan et al., 2015; Fan et al., 2020). Figure 1 is a score plot with the first and second principal components of 𝚽\bm{\Phi}. Figure 1 (b) indicates a negative correlation between anxiety scores and the first principal component (62.6% explained variability). Therefore, 𝚽\bm{\Phi} encapsulates information about brain functions that are highly associated with anxiety levels. By incorporating such important information into the model, BayesCGLM can accurately infer the relationship between psychological assessments and show improved prediction performance.

We observe that all the methods provide similar coefficient estimates. We can infer that the person who has emotional instability and vulnerability to unpleasant emotions (neuroticism) and who is less likely to rebel against others (agreeableness) has a high anxiety and depression score, which gives the comparable result from Kaplan et al. (2015). The person who is more inclined to seek sociability and is eager to attain goals (extraversion), on the other hand, has lower anxiety and depression scores. Note that the previous work (Kaplan et al., 2015) only conducted exploratory data analysis based on correlation coefficients without quantifying the uncertainty. On the other hand, our model can estimate the coefficients by incorporating correlated fMRI information.

Table 2: Inference results for the fMRI dataset from different methods. For all methods, posterior mean of 𝜸\bm{\gamma}, 95% HPD interval, RMSPE, prediction coverage, and computing time (min) are reported in the table.
BayesCGLM BayesCNN GLM
M=500M=500 M=500M=500
γ1\gamma_{1} (neuroticism) Mean 3.496 3.480 4.318
95% Interval (2.629,4.29)(2.629,4.29) - (2.572,6.064)(2.572,6.064)
γ2\gamma_{2} (extraversion) Mean 1.123-1.123 1.149-1.149 1.653-1.653
95% Interval (1.829,0.386)(-1.829,-0.386) - (3.198,0.108)(-3.198,-0.108)
γ3\gamma_{3} (agreeableness) Mean 1.113 1.204 1.196
95% Interval (0.480,1.760)(0.480,1.760) - (0.229,2.621)(-0.229,2.621)
γ4\gamma_{4} (conscientiousness) Mean 1.394-1.394 1.416-1.416 1.283-1.283
95% Interval (2.226,0.507)(-2.226,-0.507) - (3.056,0.491)(-3.056,0.491)
Prediction RMSPE 9.342 9.794 10.531
Coverage 0.923 0.981 0.923
Time (min) 3.459 0.491 0.001

Figure 3 shows that there is some agreement between the true and predicted responses. We observe that HPD prediction intervals include the true response well (92.3% prediction coverage). Due to the small sample size, prediction intervals are wider than previous examples, which is natural. In Figure 3, triangles represent individuals with the three largest HPD intervals. We observe that the anxiety score is underestimated for these people, but the high uncertainty informs us that we may need to pay additional attention to these individuals with severe anxiety issues; they might have been overlooked without uncertainty quantification.

Refer to caption
Figure 3: Comparison between the true and predicted anxiety scores from BayesCGLM. Circles (asterisks) represent that the true scores are (not) covered in 95% HPD intervals. Triangles represent individuals with the three largest HPD intervals.

5 Discussion

We propose a flexible approach, which can utilize image data as predictor variables via feature extraction through BayesCNN. The proposed method can simultaneously utilize predictors with different data structures, such as vector covariates and image data. Our study shows that the proposed method results in prediction performance comparable to that of existing deep learning algorithms in terms of prediction accuracy while enabling uncertainty quantification for estimation and prediction. By constructing an ensemble posterior through a mixture of feature posteriors, we can account for the uncertainties in feature extraction.

Our work is motivated by recently developed feature extraction approaches from DNN. These features can improve the performance of existing statistical approaches by accounting for complex dependencies among observations. For instance, Bhatnagar et al. (2022) extracted features from the long-short-term memory (LSTM) networks and used them to capture complex inverse relationships in calibration problems. Similar to our approach, Tran et al. (2020) used the extracted features from DNN as additional covariates in generalized linear mixed effect models. Fong and Xu (2021) proposed a nonlinear dimension reduction method based on deep autoencoders, which can extract interpretable features from high-dimensional data.

The proposed framework can be extended to a wider range of statistical models, such as Cox regression in survival analysis or regression models in causal inference. Extracting features from other types of DNN is also available; for example, one can use CNN-LSTM (Wang et al., 2016) to capture spatio-temporal dependencies. Our ensemble approaches can significantly improve prediction and provide interpretable parameter estimates.

Acknowledgement

This work was partially supported by the National Research Foundation of Korea (2019R1A2C1007399, 2020R1C1C1A0100386814, 2022R1C1C1006735, RS-2023-00217705), ICAN (ICT Challenge and Advanced Network of HRD) support program (RS-2023-00259934) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation). The authors are grateful to anonymous reviewers for their careful reading and valuable comments.

Data Availability

The brain tumor images dataset is obtained from Kaggle (https://www.kaggle.com/datasets/jakeshbohaju/brain-tumor/data). The fMRI dataset for anxiety score and the malaria incidence datasets can be downloaded from https://github.com/jeon9677/BayesCGLM.

References

  • Aggarwal and Agrawal (2012) Aggarwal, N. and Agrawal, R. (2012). First and second order statistics features for classification of magnetic resonance brain images. J. Sign. Inf. Process .
  • Alzubaidi et al. (2021) Alzubaidi, L., Zhang, J., Humaidi, A. J., Al-Dujaili, A., Duan, Y., Al-Shamma, O., et al. (2021). Review of deep learning: Concepts, CNN architectures, challenges, applications, future directions. Journal of Big Data 8, 1–74.
  • Bhatnagar et al. (2022) Bhatnagar, S., Chang, W., Kim, S., and Wang, J. (2022). Computer model calibration with time series data using deep learning and quantile regression. SIAM/ASA Journal on Uncertainty Quantification 10, 1–26.
  • Damianou and Lawrence (2013) Damianou, A. and Lawrence, N. D. (2013). Deep Gaussian processes. In International Conference on Artificial Intelligence and Statistics, pages 207–215. PMLR.
  • Dauphin et al. (2014) Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in Neural Information Processing Systems 27,.
  • Daw and Wikle (2022) Daw, R. and Wikle, C. K. (2022). Reds: Random ensemble deep spatial prediction. Environmetrics page e2780.
  • Fan et al. (2020) Fan, J. et al. (2020). Relationships between five-factor personality model and anxiety: The effect of conscientiousness on anxiety. Open Journal of Social Sciences 8, 462.
  • Fong and Xu (2021) Fong, Y. and Xu, J. (2021). Forward stepwise deep autoencoder-based monotone nonlinear dimensionality reduction methods. Journal of Computational and Graphical Statistics 30, 1–10.
  • Gal and Ghahramani (2016a) Gal, Y. and Ghahramani, Z. (2016a). Bayesian convolutional neural networks with bernoulli approximate variational inference.
  • Gal and Ghahramani (2016b) Gal, Y. and Ghahramani, Z. (2016b). Dropout as a Bayesian approximation: Representing model uncertainty in deep learning. In International Conference on Machine Learning, pages 1050–1059. PMLR.
  • Goodfellow et al. (2014) Goodfellow, I. J., Vinyals, O., and Saxe, A. M. (2014). Qualitatively characterizing neural network optimization problems. arXiv preprint arXiv:1412.6544 .
  • Ismael and Abdel-Qader (2018) Ismael, M. R. and Abdel-Qader, I. (2018). Brain tumor classification via statistical features and back-propagation neural network. In IEEE International Conference on Electro/Information Technology (EIT), pages 0252–0257. IEEE.
  • Kaplan et al. (2015) Kaplan, S. C., Levinson, C. A., Rodebaugh, T. L., Menatti, A., and Weeks, J. W. (2015). Social anxiety and the big five personality traits: The interactive relationship of trust and openness. Cognitive Behaviour Therapy 44, 212–222.
  • Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems 25, 1097–1105.
  • McDermott and Wikle (2017) McDermott, P. L. and Wikle, C. K. (2017). An ensemble quadratic echo state network for non-linear spatio-temporal forecasting. Stat 6, 315–330.
  • McDermott and Wikle (2019) McDermott, P. L. and Wikle, C. K. (2019). Deep echo state networks with uncertainty quantification for spatio-temporal forecasting. Environmetrics 30, e2553.
  • Menze et al. (2014) Menze, B. H., Jakab, A., Bauer, S., Kalpathy-Cramer, J., Farahani, K., Kirby, J., et al. (2014). The multimodal brain tumor image segmentation benchmark (BRATS). IEEE Transactions on Medical Imaging 34, 1993–2024. doi: 10.1109/TMI.2014.2377694.
  • Naragon-Gainey et al. (2014) Naragon-Gainey, K., Rutter, L. A., and Brown, T. A. (2014). The interaction of extraversion and anxiety sensitivity on social anxiety: evidence of specificity relative to depression. Behavior therapy 45, 418–429.
  • Neal (2012) Neal, R. M. (2012). Bayesian learning for neural networks, volume 118. Springer Science & Business Media.
  • Noone et al. (2020) Noone, A., Howlader, N., Krapcho, M., Miller, D., Brest, A., Yu, M., et al. (2020). SEER cancer statistics review (CSR) 1975–2017. National Cancer Institute: Bethesda, MD, USA .
  • Organization et al. (2022) Organization, W. H. et al. (2022). Mental health and covid-19: early evidence of the pandemic’s impact: scientific brief, 2 march 2022. Technical report, World Health Organization.
  • Paisley et al. (2012) Paisley, J., Blei, D., and Jordan, M. (2012). Variational Bayesian inference with stochastic search.
  • Rasmussen (2003) Rasmussen, C. E. (2003). Gaussian processes in machine learning. In Summer School on Machine Learning, pages 63–71. Springer.
  • Rezende et al. (2014) Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning, pages 1278–1286. PMLR.
  • Sagi and Rokach (2018) Sagi, O. and Rokach, L. (2018). Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8, e1249.
  • Sauer et al. (2023) Sauer, A., Cooper, A., and Gramacy, R. B. (2023). Vecchia-approximated deep gaussian processes for computer experiments. Journal of Computational and Graphical Statistics 32, 824–837.
  • Shen et al. (2013) Shen, X., Tokoglu, F., Papademetris, X., and Constable, R. T. (2013). Groupwise whole-brain parcellation from resting-state fMRI data for network node identification. Neuroimage 82, 403–415.
  • Shridhar et al. (2019) Shridhar, K., Laumann, F., and Liwicki, M. (2019). A comprehensive guide to Bayesian convolutional neural network with variational inference.
  • Sporns (2002) Sporns, O. (2002). Network analysis, complexity, and brain function. Complexity 8, 56–60.
  • Srivastava et al. (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1929–1958.
  • Tran et al. (2020) Tran, M.-N., Nguyen, N., Nott, D., and Kohn, R. (2020). Bayesian deep net GLM and GLMM. Journal of Computational and Graphical Statistics 29, 97–113.
  • Wang et al. (2016) Wang, J., Yu, L.-C., Lai, K. R., and Zhang, X. (2016). Dimensional sentiment analysis using a regional CNN-LSTM model. In The 54th Annual Meeting of the Association for Computational Linguistics, volume 2, pages 225–230.
  • Wang et al. (2019) Wang, S., Zhou, T., and Bilmes, J. (2019). Jumpout: Improved dropout for deep neural networks with ReLUs. In International conference on machine learning, pages 6668–6676. PMLR.

Supplementary Materials

Web Appendices, Tables, and Figures referenced in Sections 2-4, as well as a zip file containing data and code, are available with this paper at the Biometrics website on Oxford Academic.