Efficient Variational Inference for Sparse Deep Learning with Theoretical Guarantee

Jincheng Bai
Department of Statistics
Purdue University
West Lafayette, IN 47906
[email protected]
&Qifan Song
Department of Statistics
Purdue University
West Lafayette, IN 47906
[email protected]
Guang Cheng
Department of Statistics
Purdue University
West Lafayette, IN 47906
[email protected]

Abstract

Sparse deep learning aims to address the challenge of huge storage consumption by deep neural networks, and to recover the sparse structure of target functions. Although tremendous empirical successes have been achieved, most sparse deep learning algorithms are lacking of theoretical support. On the other hand, another line of works have proposed theoretical frameworks that are computationally infeasible. In this paper, we train sparse deep neural networks with a fully Bayesian treatment under spike-and-slab priors, and develop a set of computationally efficient variational inferences via continuous relaxation of Bernoulli distribution. The variational posterior contraction rate is provided, which justifies the consistency of the proposed variational Bayes method. Notably, our empirical results demonstrate that this variational procedure provides uncertainty quantification in terms of Bayesian predictive distribution and is also capable to accomplish consistent variable selection by training a sparse multi-layer neural network.

1 Introduction

Dense neural network (DNN) may face various problems despite its huge successes in AI fields. Larger training sets and more complicated network structures improve accuracy in deep learning, but always incur huge storage and computation burdens. For example, small portable devices may have limited resources such as several megabyte memory, while a dense neural networks like ResNet-50 with 50 convolutional layers would need more than 95 megabytes of memory for storage and numerous floating number computation (Cheng et al.,, 2018). It is therefore necessary to compress deep learning models before deploying them on these hardware limited devices.

In addition, sparse neural networks may recover the potential sparsity structure of the target function, e.g., sparse teacher network in the teacher-student framework (Goldt et al.,, 2019; Tian,, 2018). Another example is from nonparametric regression with sparse target functions, i.e., only a portion of input variables are relevant to the response variable. A sparse network may serve the goal of variable selection (Feng and Simon,, 2017; Liang et al.,, 2018; Ye and Sun,, 2018), and is also known to be robust to adversarial samples against $l_{\infty}$ and $l_{2}$ attacks (Guo et al.,, 2018).

Bayesian neural network (BNN), which dates back to MacKay, (1992); Neal, (1992), comparing with frequentist DNN, possesses the advantages of robust prediction via model averaging and automatic uncertainty quantification (Blundell et al.,, 2015). Conceptually, BNN can easily induce sparse network selection by assigning discrete prior over all possible network structures. However, challenges remain for sparse BNN including inefficient Bayesian computing issue and the lack of theoretical justification. This work aims to resolve these two important bottlenecks simultaneously, by utilizing variational inference approach (Jordan et al.,, 1999; Blei et al.,, 2017). On the computational side, it can reduce the ultra-high dimensional sampling problem of Bayesian computing, to an optimization task which can still be solved by a back-propagation algorithm. On the theoretical side, we provide a proper prior specification, under which the variational posterior distribution converges towards the truth. To the best of our knowledge, this work is the first one that provides a complete package of both theory and computation for sparse Bayesian DNN.

Related work.

A plethora of methods on sparsifying or compressing neural networks have been proposed (Cheng et al.,, 2018; Gale et al.,, 2019). The majority of these methods are pruning-based (Han et al.,, 2016; Zhu and Gupta,, 2018; Frankle and Carbin,, 2018), which are ad-hoc on choosing the threshold of pruning and usually require additional training and fine tuning. Some other methods could achieve sparsity during training. For example, Louizos et al., (2018) introduced $l_{0}$ regularized learning and Mocanu et al., (2018) proposed sparse evolutionary training. However, the theoretical guarantee and the optimal choice of hyperparameters for these methods are unclear. As a more natural solution to enforce sparsity in DNN, Bayesian sparse neural network has been proposed by placing prior distributions on network weights: Blundell et al., (2015) and Deng et al., (2019) considered spike-and-slab priors with a Gaussian and Laplacian spike respectively; Log-uniform prior was used in Molchanov et al., (2017); Ghosh et al., (2018) chose to use the popular horseshoe shrinkage prior. These existing works actually yield posteriors over the dense DNN model space despite applying sparsity induced priors. In order to derive explicit sparse inference results, users have to additionally determine certain pruning rules on the posterior. On the other hand, theoretical works regarding sparse deep learning have been studied in Schmidt-Hieber, (2017), Polson and Rockova, (2018) and Chérief-Abdellatif, (2020), but finding an efficient implementation to close the gap between theory and practice remains a challenge for these mentioned methods.

Detailed contributions.

We impose a spike-and-slab prior on all the edges (weights and biases) of a neural network, where the spike component and slab component represent whether the corresponding edge is inactive or active, respectively. Our work distinguished itself from prior works on Bayesian sparse neural network by imposing the spike-and-slab prior with the Dirac spike function. Hence automatically, all posterior samples are from exact sparse DNN models.

More importantly, with carefully chosen hyperparameter values, especially the prior probability that each edge is active, we establish the variational posterior consistency, and the corresponding convergence rate strikes the balance of statistical estimation error, variational error and the approximation error. The theoretical results are validated by various simulations and real applications. Empirically we also demonstrate that the proposed method possesses good performance of variable selection and uncertainty quantification. While Feng and Simon, (2017); Liang et al., (2018); Ye and Sun, (2018) only considered the neural network with single hidden layer for variable selection, we observe correct support recovery for neural networks with multi-layer networks.

2 Preliminaries

Nonparametric regression.

Consider a nonparametric regression model with random covariates

{}Y_{i}=f_{0}(X_{i})+\epsilon_{i},\>i=1,\ldots,n,

(1)

where $X_{i}=(x_{i1},\ldots,x_{ip})^{t}\sim\mathcal{U}([-1,1]^{p})$ ¹¹1This compact support assumption is generally satisfied given the standardized data, and may be relaxed., $\mathcal{U}$ denotes the uniform distribution, $\epsilon_{i}\overset{iid}{\sim}\mathcal{N}(0,\sigma^{2}_{\epsilon})$ is the noise term, and $f_{0}:[-1,1]^{p}\rightarrow\mathbb{R}$ is an underlying true function. For simplicity of analysis, we assume $\sigma_{\epsilon}$ is known. Denote $D_{i}=(X_{i},Y_{i})$ and $D=(D_{1},\dots,D_{n})$ as the observations. Let $P_{0}$ denote the underlying probability measure of data, and $p_{0}$ denote the corresponding density function.

Deep neural network.

An $L$ -hidden-layer neural network is used to model the data. The number of neurons in each hidden layer is denoted by $p_{i}$ for $i=1,\dots,L$ . The weight matrix and bias parameter in each layer are denoted by $W_{i}\in\mathbb{R}^{p_{i-1}\times p_{i}}$ and $b_{i}\in\mathbb{R}^{p_{i}}$ for $i=1,\dots,L+1$ . Let $\sigma(x)$ be the activation function, and for any $r\in\mathbb{Z}^{+}$ and any $b\in\mathbb{R}^{r}$ , we define $\sigma_{b}:\mathbb{R}^{r}\rightarrow\mathbb{R}^{r}$ as $\sigma_{b}(y_{j})=\sigma(y_{j}-b_{j})$ for $j=1,\ldots,r$ . Then, given parameters $\boldsymbol{p}=(p_{1},\dots,p_{L})^{\prime}$ and $\theta=\{W_{1},b_{1},\ldots,W_{L},b_{L},W_{L+1},b_{L+1}\}$ , the output of this DNN model can be written as

{}f_{\theta}(X)=W_{L+1}\sigma_{b_{L}}(W_{L}\sigma_{b_{L-1}}\ldots\sigma_{b_{1}}(W_{1}X))+b_{L+1}.

(2)

In what follows, with slight abuse of notation, $\theta$ is also viewed as a vector that contains all the coefficients in $W_{i}$ ’s and $b_{i}$ ’s, , i.e., $\theta=(\theta_{1},\dots,\theta_{T})^{\prime}$ , where the length $T:=\sum^{L-1}_{l=1}p_{l+1}(p_{l}+1)+p_{1}(p+1)+(p_{L}+1)$ .

Variational inference.

Bayesian procedure makes statistical inferences from the posterior distribution $\pi(\theta|D)\propto\pi(\theta)p_{\theta}(D)$ , where $\pi(\theta)$ is the prior distribution. Since MCMC doesn’t scale well on complex Bayesian learning tasks with large datasets, variational inference (Jordan et al.,, 1999; Blei et al.,, 2017) has become a popular alternative. Given a variational family of distributions, denoted by $\mathcal{Q}$ ²²2For simplicity, it is commonly assumed that $\mathcal{Q}$ is the mean-field family, i.e. $q(\theta)=\prod^{T}_{i=1}q(\theta_{i})$ ., it seeks to approximate the true posterior distribution by finding a closest member of $\mathcal{Q}$ in terms of KL divergence:

\widehat{q}(\theta)=\underset{q(\theta)\in\mathcal{Q}}{\mbox{argmin }}\mbox{KL}(q(\theta)||\pi(\theta|D)).

(3)

The optimization (3) is equivalent to minimize the negative ELBO, which is defined as

\begin{split}\Omega=-\mathbb{E}_{q(\theta)}[\log p_{\theta}(D)]+\mbox{KL}(q(\theta)||\pi(\theta)),\end{split}

(4)

where the first term in (4) can be viewed as the reconstruction error (Kingma and Welling,, 2014) and the second term serves as regularization. Hence, the variational inference procedure minimizes the reconstruction error while being penalized against prior distribution in the sense of KL divergence.

When the variational family is indexed by some hyperparameter $\omega$ , i.e., any $q\in\mathcal{Q}$ can be written as $q_{\omega}(\theta)$ , then the negative ELBO is a function of $\omega$ as $\Omega(\omega)$ . The KL divergence term in (4) could usually be integrated analytically, while the reconstruction error requires Monte Carlo estimation. Therefore, the optimization of $\Omega(\omega)$ can utilize the stochastic gradient approach (Kingma and Welling,, 2014). To be concrete, if all distributions in $\mathcal{Q}$ can be reparameterized as $q_{\omega}\stackrel{{\scriptstyle d}}{{=}}g(\omega,\nu)$ ³³3“ $\stackrel{{\scriptstyle d}}{{=}}$ ” means equivalence in distribution for some differentiable function $g$ and random variable $\nu$ , then the stochastic estimator of $\Omega(\omega)$ and its gradient are

\begin{split}&\widetilde{\Omega}^{m}(\omega)=-\frac{n}{m}\frac{1}{K}\sum^{m}_{i=1}\sum^{K}_{k=1}\log p_{g(\omega,\nu_{k})}(D_{i})+\mbox{KL}(q_{\omega}(\theta)||\pi(\theta))\\ &\nabla_{\omega}\widetilde{\Omega}^{m}(\omega)=-\frac{n}{m}\frac{1}{K}\sum^{m}_{i=1}\sum^{K}_{k=1}\nabla_{\omega}\log p_{g(\omega,\nu_{k})}(D_{i})+\nabla_{\omega}\mbox{KL}(q_{\omega}(\theta)||\pi(\theta)),\end{split}

(5)

where $D_{i}$ ’s are randomly sampled data points and $\nu_{k}$ ’s are iid copies of $\nu$ . Here, $m$ and $K$ are minibatch size and Monte Carlo sample size, respectively.

3 Sparse Bayesian deep learning with spike-and-slab prior

We aim to approximate $f_{0}$ in the generative model (1) by a sparse neural network. Specifically, given a network structure, i.e. the depth $L$ and the width $\boldsymbol{p}$ , $f_{0}$ is approximated by DNN models $f_{\theta}$ with sparse parameter vector $\theta\in\Theta=\mathbb{R}^{T}$ . From a Bayesian perspective, we impose a spike-and-slab prior (George and McCulloch,, 1993; Ishwaran and Rao,, 2005) on $\theta$ to model sparse DNN.

A spike-and-slab distribution is a mixture of two components: a Dirac spike concentrated at zero and a flat slab distribution. Denote $\delta_{0}$ as the Dirac at 0 and $\gamma=(\gamma_{1},\ldots,\gamma_{T})$ as a binary vector indicating the inclusion of each edge in the network. The prior distribution $\pi(\theta)$ thus follows:

\begin{split}\theta_{i}|\gamma_{i}\sim\gamma_{i}\mathcal{N}(0,\sigma^{2}_{0})+(1-\gamma_{i})\delta_{0},\quad\gamma_{i}\sim\mbox{Bern}(\lambda),\end{split}

(6)

for $i=1,\ldots,T$ , where $\lambda$ and $\sigma^{2}_{0}$ are hyperparameters representing the prior inclusion probability and the prior Gaussian variance, respectively. The choice of $\sigma_{0}^{2}$ and $\lambda$ play an important role in sparse Bayesian learning, and in Section 4, we will establish theoretical guarantees for the variational inference procedure under proper deterministic choices of $\sigma_{0}^{2}$ and $\lambda$ . Alternatively, hyperparameters may be chosen via an Empirical Bayesian (EB) procedure, but it is beyond the scope of this work. We assume $\mathcal{Q}$ is in the same family of spike-and-slab laws:

\begin{split}\theta_{i}|\gamma_{i}\sim\gamma_{i}\mathcal{N}(\mu_{i},\sigma^{2}_{i})+(1-\gamma_{i})\delta_{0},\quad\gamma_{i}\sim\mbox{Bern}(\phi_{i})\end{split}

(7)

for $i=1,\ldots,T$ , where $0\leq\phi_{i}\leq 1$ .

Comparing to pruning approaches (e.g. Zhu and Gupta,, 2018; Frankle and Carbin,, 2018; Molchanov et al.,, 2017) that don’t pursue sparsity among bias parameter $b_{i}$ ’s, the Bayesian modeling induces posterior sparsity for both weight and bias parameters.

In the literature, Polson and Rockova, (2018); Chérief-Abdellatif, (2020) imposed sparsity specification as follows $\Theta(L,\boldsymbol{p},s)=\{\theta\mbox{ as in model }(\ref{eqmod2}):||\theta||_{0}\leq s\}$ that not only posts great computational challenges, but also requires tuning for optimal sparsity level $s$ . For example, Chérief-Abdellatif, (2020) shows that given $s$ , two error terms occur in the variation DNN inference: 1) the variational error $r_{n}(L,\boldsymbol{p},s)$ caused by the variational Bayes approximation to the true posterior distribution and 2) the approximation error $\xi_{n}(L,\boldsymbol{p},s)$ between $f_{0}$ and the best bounded-weight $s$ -sparsity DNN approximation of $f_{0}$ . Both error terms $r_{n}$ and $\xi_{n}$ depend on $s$ (and their specific forms are given in next section). Generally speaking, as the model capacity (i.e., $s$ ) increases, $r_{n}$ will increase and $\xi_{n}$ will decrease. Hence the optimal choice $s^{*}$ that strikes the balance between these two is

s^{*}=\underset{s}{\mbox{argmin }}\{r_{n}(L,\boldsymbol{p},s)+\xi_{n}(L,\boldsymbol{p},s)\}.

Therefore, one needs to develop a selection criteria for $\widehat{s}$ such that $\widehat{s}\approx s^{*}$ . In contrast, our modeling directly works on the whole sparsity regime without pre-specifying $s$ , and is shown later to be capable of automatically attaining the same rate of convergence as if the optimal $s^{*}$ were known.

4 Theoretical results

In this section, we will establish the contraction rate of the variational sparse DNN procedure, without knowing $s^{*}$ . For simplicity, we only consider equal-width neural network (similar as Polson and Rockova, (2018)).

The following assumptions are imposed:

Condition 4.1.

$p_{i}\equiv N\in\mathbb{Z}^{+}$ that can depend on n, and $\lim T=\infty$ .

Condition 4.2.

$\sigma(x)$ is 1-Lipschitz continuous.

Condition 4.3.

The hyperparameter $\sigma_{0}^{2}$ is set to be some constant, and $\lambda$ satisfies $\log(1/\lambda)=O\{(L+1)\log N+\log(p\sqrt{n/s^{*}})\}$ and $\log(1/(1-\lambda))=O((s^{*}/T)\{(L+1)\log N+\log(p\sqrt{n/s^{*}})\})$ .

Condition 4.2 is very mild, and includes ReLU, sigmoid and tanh. Note that Condition 4.3 gives a wide range choice of $\lambda$ , even including the choice of $\lambda$ independent of $s^{*}$ (See Theorem 4.1 below).

We first state a lemma on an upper bound for the negative ELBO. Denote the log-likelihood ratio between $p_{0}$ and $p_{\theta}$ as $l_{n}(P_{0},P_{\theta})=\log(p_{0}(D)/p_{\theta}(D))=\sum^{n}_{i=1}\log(p_{0}(D_{i})/p_{\theta}(D_{i}))$ . Given some constant $B>0$ , we define

	$\displaystyle r^{*}_{n}$	$\displaystyle:=$	$\displaystyle r_{n}(L,N,s^{})=((L+1)s^{}/n)\log N+(s^{}/n)\log(p\sqrt{n/s^{}}),$
	$\displaystyle\xi^{*}_{n}$	$\displaystyle:=$	$\displaystyle\xi_{n}(L,N,s^{})=\inf_{\theta\in\Theta(L,\boldsymbol{p},s^{}),\\|\theta\\|_{\infty}\leq B}\|\|f_{\theta}-f_{0}\|\|^{2}_{\infty}.$

Recall that $r_{n}(L,N,s)$ and $\xi_{n}(L,N,s)$ denote the variational error and the approximation error.

Lemma 4.1.

Under Condition 4.1-4.3, then with dominating probability,

{}\begin{split}\inf_{q(\theta)\in\mathcal{Q}}\Bigl{\{}\mbox{KL}(q(\theta)||\pi(\theta|\lambda))+\int_{\Theta}l_{n}(P_{0},P_{\theta})q(\theta)d\theta\Bigr{\}}\leq Cn(r^{*}_{n}+\xi^{*}_{n})\end{split}

(8)

where $C$ is either some positive constant if $\lim n(r_{n}^{*}+\xi^{*}_{n})=\infty$ , or any diverging sequence if $\lim\sup n(r_{n}^{*}+\xi^{*}_{n})\neq\infty$ .

Noting that $\mbox{KL}(q(\theta)||\pi(\theta|\lambda))+\int_{\Theta}l_{n}(P_{0},P_{\theta})q(\theta)(d\theta)$ is the negative ELBO up to a constant, we therefore show the optimal loss function of the proposed variational inference is bounded.

The next lemma investigates the convergence of the variational distribution under the Hellinger distance, which is defined as

d^{2}(P_{\theta},P_{0})=\mathbb{E}_{X}\Bigl{(}1-\exp\{-[f_{\theta}(X)-f_{0}(X)]^{2}/(8\sigma^{2}_{\epsilon})\}\Bigr{)}.

In addition, let $s_{n}=s^{*}\log^{2\delta-1}(n)$ for any $\delta>1$ . An assumption on $s^{*}$ is required to strike the balance between $r_{n}^{*}$ and $\xi^{*}$ :

Condition 4.4.

$\max\{s^{*}\log(p\sqrt{n/s^{*}},(L+1)s^{*}\log N\}=o(n)$ and $r_{n}^{*}\asymp\xi_{n}^{*}$ .

Lemma 4.2.

Under Conditions 4.1-4.4, if $\sigma_{0}^{2}$ is set to be constant and $\lambda\leq T^{-1}\exp\{-Mnr_{n}^{*}/s_{n}\}$ for any positive diverging sequence $M\rightarrow\infty$ , then with dominating probability, we have

\begin{split}\int_{\Theta}d^{2}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta\leq C\varepsilon^{*2}_{n}+\frac{3}{n}\inf_{q(\theta)\in\mathcal{Q}}\Bigl{\{}\mbox{KL}(q(\theta)||\pi(\theta|\lambda))+\int_{\Theta}l_{n}(P_{0},P_{\theta})q(\theta)d\theta\Bigr{\}},\end{split}

(9)

where $C$ is some constant, and

\varepsilon^{*}_{n}:=\varepsilon_{n}(L,N,s^{*})=\sqrt{r_{n}(L,N,s^{*})}\log^{\delta}(n),\mbox{ for any }\delta>1.

Remark.

The result (9) is of exactly the same form as in the existing literature (Pati et al.,, 2018), but it is not trivial in the following sense. The existing literature require the existence of a global testing function that separates $P_{0}$ and $\{P_{\theta}:d(P_{\theta},P_{0})\geq\varepsilon_{n}^{*}\}$ with exponentially decay rate of Type I and Type II errors. If such a testing function exists only over a subset $\Theta_{n}\subset\Theta$ (which is the case for our DNN modeling), then the existing result (Yang et al.,, 2020) can only characterize the VB posterior contraction behavior within $\Theta_{n}$ , but not over the whole parameter space $\Theta$ . Therefore our result, which characterizes the convergence behavior for the overall VB posterior, represents a significant improvement beyond those works.

The above two lemmas together imply the following guarantee for VB posterior:

Theorem 4.1.

Let $\sigma_{0}^{2}$ be a constant and $-\log\lambda=\log(T)+\delta[(L+1)\log N+\log\sqrt{n}p]$ for any constant $\delta>0$ . Under Conditions 4.1-4.2, 4.4, we have with high probability

\int_{\Theta}d^{2}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta\leq C\varepsilon^{*2}_{n}+C^{\prime}(r_{n}^{*}+\xi^{*}_{n}),

where $C$ is some positive constant and $C^{\prime}$ is any diverging sequence.

The $\varepsilon_{n}^{*2}$ denotes the estimation error from the statistical estimator for $P_{0}$ . The variational Bayes convergence rate consists of estimating error, i.e., $\varepsilon_{n}^{*2}$ , variational error, i.e., $r_{n}^{*}$ , and approximation error, i.e., $\xi_{n}^{*}$ . Given that the former two errors have only logarithmic difference, our convergence rate actually strikes the balance among all three error terms. The derived convergence rate has an explicit expression in terms of the network structure based on the forms of $\varepsilon_{n}^{*}$ , $r_{n}^{*}$ and $\xi_{n}^{*}$ , in contrast with general convergence results in Pati et al., (2018); Zhang and Gao, (2019); Yang et al., (2020).

Remark.

Theorem 4.1 provides a specific choice for $\lambda$ , which can be relaxed to the general conditions on $\lambda$ in Lemma 4.2. In contrast to the heuristic choices such as $\lambda=\exp(-2\log n)$ (BIC; Hubin and Storvik,, 2019), this theoretically justified choice incorporates knowledge of input dimension, network structure and sample size. Such an $\lambda$ will be used in our numerical experiments in Section 6, but readers shall be aware of that its theoretical validity is only justified in an asymptotic sense.

Remark.

The convergence rate is derived under Hellinger metric, which is of less practical relevance than $L_{2}$ norm representing the common prediction error. One may obtain a convergence result under $L_{2}$ norm via a VB truncation (refer to supplementary material, Theorem A.1).

Remark.

If $f_{0}$ is an $\alpha$ -Hölder-smooth function with fixed input dimension $p$ , then by choosing some $L\asymp\log n$ , $N\asymp n/\log n$ , combining with the approximation result (Schmidt-Hieber,, 2017, Theorem 3), our theorem ensures rate-minimax convergence up to a logarithmic term.

5 Implementation

To conduct optimization of (4) via stochastic gradient optimization, we need to find certain reparameterization for any distribution in $\mathcal{Q}$ . One solution is to use the inverse CDF sampling technique. Specifically, if $\theta\sim q\in\mathcal{Q}$ , its marginal $\theta_{i}$ ’s are independent mixture of (7). Let $F_{(\mu_{i},\sigma_{i},\phi_{i})}$ be the CDF of $\theta_{i}$ , then $\theta_{i}\stackrel{{\scriptstyle d}}{{=}}F^{-1}_{(\mu_{i},\sigma_{i},\phi_{i})}(u_{i})$ holds where $u_{i}\sim\mathcal{U}(0,1)$ . This inverse CDF reparameterization, although valid, can not be conveniently implemented within the state-of-art python packages like PyTorch. Rather, a more popular way in VB is to utilize the Gumbel-softmax approximation.

We rewrite the loss function $\Omega$ as

\begin{split}-\mathbb{E}_{q(\theta|\gamma)q(\gamma)}[\log p_{\theta}(D)]+\sum^{T}_{i=1}\mbox{KL}(q(\gamma_{i})||\pi(\gamma_{i}))+\sum^{T}_{i=1}q(\gamma_{i}=1)\mbox{KL}(\mathcal{N}(\mu_{i},\sigma^{2}_{i})||\mathcal{N}(0,\sigma^{2}_{0})).\end{split}

(10)

Since it is impossible to reparameterize the discrete variable $\gamma$ by a continuous system, we apply continuous relaxation, i.e., to approximate $\gamma$ by a continuous distribution. In particular, the Gumbel-softmax approximation (Maddison et al.,, 2017; Jang et al.,, 2017) is used here, and $\gamma_{i}\sim\mbox{Bern}(\phi_{i})$ is approximated by $\widetilde{\gamma}_{i}\sim\mbox{Gumbel-softmax}(\phi_{i},\tau)$ , where

\begin{split}&\widetilde{\gamma}_{i}=(1+\exp(-\eta_{i}/\tau))^{-1},\quad\eta_{i}=\log\frac{\phi_{i}}{1-\phi_{i}}+\log\frac{u_{i}}{1-u_{i}},\quad u_{i}\sim\mathcal{U}(0,1).\end{split}

$\tau$ is called the temperature, and as it approaches 0, $\tilde{\gamma}_{i}$ converges to $\gamma_{i}$ in distribution (refer to Figure 4 in the supplementary material). In addition, one can show that $P(\widetilde{\gamma}_{i}>0.5)=\phi_{i}$ , which implies $\gamma_{i}\stackrel{{\scriptstyle d}}{{=}}1(\widetilde{\gamma}_{i}>0.5)$ . Thus, $\widetilde{\gamma}_{i}$ is viewed as a soft version of $\gamma_{i}$ , and will be used in the backward pass to enable the calculation for gradients, while the hard version $\gamma_{i}$ will be used in the forward pass to obtain a sparse network structure. In practice, $\tau$ is usually chosen no smaller than 0.5 for numerical stability. Besides, the normal variable $\mathcal{N}(\mu_{i},\sigma^{2}_{i})$ is reparameterized by $\mu_{i}+\sigma_{i}\epsilon_{i}$ for $\epsilon_{i}\sim\mathcal{N}(0,1)$ .

The complete variational inference procedure with Gumbel-softmax approximation is stated below.

Algorithm 1 Variational inference for sparse BNN with normal slab distribution.

1:parameters:

\omega=(\mu,\sigma^{\prime},\phi^{\prime})

2:where

\sigma_{i}=\log(1+\exp(\sigma^{\prime}_{i}))

\phi_{i}=(1+\exp(\phi^{\prime}_{i}))^{-1}

, for

i=1,\ldots,T

3:repeat

D^{m}

\leftarrow

Randomly draw a minibatch of size

m

from

D

\epsilon_{i},u_{i}

\leftarrow

Randomly draw

K

samples from

\mathcal{N}(0,1)

and

\mathcal{U}(0,1)

\widetilde{\Omega}^{m}(\omega)

\leftarrow

Use (5) with (

D^{m}

\omega

\epsilon

u

); Use

\gamma

in the forward pass

\nabla_{\omega}\widetilde{\Omega}^{m}(\omega)

\leftarrow

Use (5) with (

D^{m}

\omega

\epsilon

u

); Use

\widetilde{\gamma}

in the backward pass

\omega

\leftarrow

Update with

\nabla_{\omega}\widetilde{\Omega}^{m}(\omega)

using gradient descent algorithms (e.g. SGD or Adam)

9:until convergence of

\widetilde{\Omega}^{m}(\omega)

10:return

\omega

The Gumbel-softmax approximation introduces an additional error that may jeopardize the validity of Theorem 4.1. Our exploratory studies (refer to Section B.3 in supplementary material) demonstrates little differences between the results of using inverse-CDF reparameterization and using Gumbel-softmax approximation in some simple model. Therefore, we conjecture that Gumbel-softmax approximation doesn’t hurt the VB convergence, and thus will be implemented in our numerical studies.

6 Experiments

We evaluate the empirical performance of the proposed variational inference through simulation study and MNIST data application. For the simulation study, we consider a teacher-student framework and a nonlinear regression function, by which we justify the consistency of the proposed method and validate the proposed choice of hyperparameters. As a byproduct, the performance of uncertainty quantification and the effectiveness of variable selection will be examined as well.

For all the numerical studies, we let $\sigma^{2}_{0}=2$ , the choice of $\lambda$ follows Theorem 4.1 (denoted by $\lambda_{opt}$ ): $\log(\lambda^{-1}_{opt})=\log(T)+0.1[(L+1)\log N+\log\sqrt{n}p]$ . The remaining details of implementation (such as initialization, choices of $K$ , $m$ and learning rate) are provided in the supplementary material. We will use VB posterior mean estimator $\widehat{f}_{H}=\sum^{H}_{h=1}f_{\theta_{h}}/H$ to assess the prediction accuracy, where $\theta_{h}\sim\widehat{q}(\theta)$ are samples drawn from the VB posterior and $H=30$ . The posterior network sparsity is measured by $\widehat{s}=\sum^{T}_{i=1}\phi_{i}/T$ . Input nodes who have connection with $\phi_{i}>0.5$ to the second layer is selected as relevant input variables, and we report the corresponding false positive rate (FPR) and false negative rate (FNR) to evaluate the variable selection performance of our method.

Our method will be compared with the dense variational BNN (VBNN) (Blundell et al.,, 2015) with independent centered normal prior and independent normal variational distribution, the AGP pruner (Zhu and Gupta,, 2018), the Lottery Ticket Hypothesis (LOT) (Frankle and Carbin,, 2018), the variational dropout (VD) (Molchanov et al.,, 2017) and the Horseshoe BNN (HS-BNN) (Ghosh et al.,, 2018). In particular, VBNN can be regarded as a baseline method without any sparsification or compression. All reported simulation results are based on 30 replications (except that we use 60 replications for interval estimation coverages). Note that the sparsity level in methods AGP and LOT are user-specified. Hence, in simulation studies, we try a grid search for AGP and LOT, and only report the ones that yield highest testing accuracy. Furthermore, note that FPR and FNR are not calculated for HS-BNN since it only sparsifies the hidden layers nodewisely.

Simulation I: Teacher-student networks setup

We consider two teacher network settings for $f_{0}$ : (A) densely connected with a structure of 20-6-6-1, $p=20$ , $n=3000$ , $\sigma(x)=\mbox{sigmoid}(x)$ , $X\sim\mathcal{U}([-1,1]^{20})$ , $\epsilon\sim\mathcal{N}(0,1)$ and network parameter $\theta_{i}$ is randomly sampled from $\mathcal{U}(0,1)$ ; (B) sparsely connected as shown in Figure 1 (c), $p=100$ , $n=500$ , $\sigma(x)=\mbox{tanh}(x)$ , $X\sim\mathcal{U}([-1,1]^{100})$ and $\epsilon\sim\mathcal{N}(0,1)$ , the network parameter $\theta_{i}$ ’s are fixed (refer to supplementary material for details).

Refer to caption — (a) $\lambda\leq\lambda_{opt}$ .

Table 1: Simulation results for Simulation I. SVBNN represents our sparse variational BNN method. The sparsity levels specified for AGP are 30% and 5%, and for LOT are 10% and 5%, respectively for the two cases.

	Method	Train	Test	FPR(%)	FNR(%)	$\boldsymbol{95\%}$ Coverage (%)	Sparsity(%)
		RMSE		Input variable selection
Dense	SVBNN	1.01 $\pm$ 0.02	1.01 $\pm$ 0.00	-	-	97.5 $\pm$ 1.71	6.45 $\pm$ 0.83
	VBNN	1.00 $\pm$ 0.02	1.00 $\pm$ 0.00	-	-	91.4 $\pm$ 3.89	100 $\pm$ 0.00
	VD	0.99 $\pm$ 0.02	1.01 $\pm$ 0.00	-	-	76.4 $\pm$ 4.75	28.6 $\pm$ 2.81
	HS-BNN	0.98 $\pm$ 0.02	1.02 $\pm$ 0.01	-	-	83.5 $\pm$ 0.78	64.9 $\pm$ 24.9
	AGP	0.99 $\pm$ 0.02	1.01 $\pm$ 0.00	-	-	-	30.0 $\pm$ 0.00
	LOT	1.04 $\pm$ 0.01	1.02 $\pm$ 0.00	-	-	-	10.0 $\pm$ 0.00
Sparse	SVBNN	0.99 $\pm$ 0.03	1.00 $\pm$ 0.01	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00	96.4 $\pm$ 4.73	2.15 $\pm$ 0.25
	VBNN	0.92 $\pm$ 0.05	1.53 $\pm$ 0.17	100 $\pm$ 0.00	0.00 $\pm$ 0.00	90.7 $\pm$ 8.15	100 $\pm$ 0.00
	VD	0.86 $\pm$ 0.04	1.07 $\pm$ 0.03	72.9 $\pm$ 6.99	0.00 $\pm$ 0.00	75.5 $\pm$ 7.81	20.8 $\pm$ 3.08
	HS-BNN	0.90 $\pm$ 0.04	1.29 $\pm$ 0.04	-	-	67.0 $\pm$ 8.54	32.1 $\pm$ 20.1
	AGP	1.01 $\pm$ 0.03	1.02 $\pm$ 0.00	16.9 $\pm$ 1.81	0.00 $\pm$ 0.00	-	5.00 $\pm$ 0.00
	LOT	0.96 $\pm$ 0.01	1.04 $\pm$ 0.01	19.5 $\pm$ 2.57	0.00 $\pm$ 0.00	-	5.00 $\pm$ 0.00

First, we examine the impact of different choices of $\lambda$ on our VB sparse DNN modeling. A set of different $\lambda$ values are used, and for each $\lambda$ , we compute the training square root MSE (RMSE) and testing RMSE based on $\widehat{f}_{H}$ . Results for the simulation setting (B) are plotted in Figure 1 along with error bars (Refer to Section B.4 in supplementary material for the plot under the simulation setting (A)). The figure shows that as $\lambda$ increases, the resultant network becomes denser and the training RMSE monotonically decreases, while testing RMSE curve is roughly U-shaped. In other words, an overly small $\lambda$ leads to over-sparsified DNNs with insufficient expressive power, and an overly large $\lambda$ leads to overfitting DNNs. The suggested $\lambda_{opt}$ successfully locates in the valley of U-shaped testing curve, which empirically justifies our theoretical choice of $\lambda_{opt}$ .

We next compare the performance of our method (with $\lambda_{opt}$ ) to the benchmark methods, and present results in Table 1. For the dense teacher network (A), our method leads to the most sparse structure with comparable prediction error; For the sparse teacher network (B), our method not only achieves the best prediction accuracy, but also always selects the correct set of relevant input variables. Besides, we also explore uncertainty quantification of our methods, by studying the coverage of 95% Bayesian predictive intervals (refer to supplementary material for details). Table 1 shows that our method obtains coverage rates slightly higher than the nominal levels while other (Bayesian) methods suffer from undercoverage problems.

Simulation II: Sparse nonlinear function

Consider the following sparse function $f_{0}$ :

f_{0}(x_{1},\dots,x_{200})=\frac{7x_{2}}{1+x^{2}_{1}}+5\sin(x_{3}x_{4})+2x_{5},\quad\epsilon\sim\mathcal{N}(0,1),

(11)

all covariates are iid $\mathcal{N}(0,1)$ and data set contains $n=3000$ observations. A ReLU network with $L=3$ and $N=7$ is used. Similar to the simulation I, we study the impact of $\lambda$ , and results in Figure 2 justify that $\lambda_{opt}$ is a reasonable choice. Table 2 compares the performances of our method (under $\lambda_{opt}$ ) to the competitive methods. Our method exhibits the best prediction power with minimal connectivity, among all the methods. In addition, our method achieves smallest FPR and acceptable FNR for input variable selection. In comparison, other methods select huge number of false input variables. Figure 2 (c) shows the selected network (edges with $\phi_{i}>0.5$ ) in one replication that correctly identifies the input variables.

Table 2: Results for Simulation II. The sparsity levels selected for AGP and LOT are both 30%.

Method	Train RMSE	Test RMSE	FPR(%)	FNR(%)	Sparsity(%)
SVBNN	1.19 $\pm$ 0.05	1.21 $\pm$ 0.05	0.00 $\pm$ 0.21	16.0 $\pm$ 8.14	2.97 $\pm$ 0.48
VBNN	0.96 $\pm$ 0.06	1.99 $\pm$ 0.49	100 $\pm$ 0.00	0.00 $\pm$ 0.00	100 $\pm$ 0.00
VD	1.02 $\pm$ 0.05	1.43 $\pm$ 0.19	98.6 $\pm$ 1.22	0.67 $\pm$ 3.65	46.9 $\pm$ 4.72
HS-BNN	1.17 $\pm$ 0.52	1.66 $\pm$ 0.43	-	-	41.1 $\pm$ 36.5
AGP	1.06 $\pm$ 0.08	1.58 $\pm$ 0.11	82.7 $\pm$ 3.09	1.33 $\pm$ 5.07	30.0 $\pm$ 0.00
LOT	1.08 $\pm$ 0.09	1.44 $\pm$ 0.14	83.6 $\pm$ 2.94	0.00 $\pm$ 0.00	30.0 $\pm$ 0.00

MNIST application.

We evaluate the performance of our method on MNIST data for classification tasks, by comparing with benchmark methods. A 2-hidden layer DNN with 512 neurons in each layer is used. We compare the testing accuracy of our method (with $\lambda_{opt}$ ) to the benchmark methods at different epochs using the same batch size (refer to supplementary material for details). Figure 3 shows our method achieves best accuracy as epoch increases, and the final sparsity level for SVBNN, AGP and VD are $5.06\%$ , $5.00\%$ and $2.28\%$ .

In addition, an illustration of our method’s capability for uncertainty quantification on MNIST can be found in the supplementary material, where additional experimental results on UCI regression datasets can also be found.

7 Conclusion and discussion

We proposed a variational inference method for deep neural networks under spike-and-slab priors with theoretical guarantees. Future direction could be investigating the theory behind choosing hyperparamters via the EB estimation instead of deterministic choices.

Furthermore, extending the current results to more complicated networks (convolutional neural network, residual network, etc.) is not trivial. Conceptually, it requires the design of structured sparsity (e.g., group sparsity in Neklyudov et al., (2017)) to fulfill the goal of faster prediction. Theoretically, it requires deeper understanding of the expressive ability (i.e. approximation error) and capacity (i.e., packing or covering number) of the network model space. For illustration purpose, we include an example of Fashion-MNIST task using convolutional neural network in the supplementary material, and it demonstrates the usage of our method on more complex networks in practice.

Broader Impact

We believe the ethical aspects are not applicable to this work. For future societal consequences, deep learning has a wide range of applications such as computer version and natural language processing. Our work provides a solution to overcome the drawbacks of modern deep neural network, and also improves the understanding of deep learning.

The proposed method could improve the existing applications. Specifically, sparse learning helps apply deep neural networks to hardware limited devices, like cell phones or pads, which will broaden the horizon of deep learning application. In addition, as a Bayesian method, not only a result, but also the knowledge of confidence or certainty in that result are provided, which could benefit people in various aspects. For example, in the application of cancer diagnostic, by providing the certainty associated with each possible outcome, Bayesian learning would assist the medical professionals to make a better judgement about whether the tumor is a cancer or a benign one. Such kind of ability to quantify uncertainty would contribute to the modern deep learning.

Acknowledgments and Disclosure of Funding

We would like to thank Wei Deng for the helpful discussion and thank the reviewers for their thoughtful comments. Qifan Song’s research is partially supported by National Science Foundation grant DMS-1811812. Guang Cheng was a member of the Institute for Advanced Study in writing this paper, and he would like to thank the institute for its hospitality.

References

Blei et al., (2017) Blei, D., Kucukelbir, A., and McAuliffe, J. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112:859–877.
Blundell et al., (2015) Blundell, C., Cornebise, J., Kavukcuoglu, K., et al. (2015). Weight uncertainty in neural networks. In Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML 15), pages 1613–1622, Lille, France.
Boucheron et al., (2013) Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: A nonasymptotic theory of independence. Oxford University press.
Cheng et al., (2018) Cheng, Y., Wang, D., Zhou, P., et al. (2018). Model compression and acceleration for deep neural networks: The principles, progress, and challenges. IEEE Signal Processing Magazine, 35(1):126–136.
Chérief-Abdellatif, (2020) Chérief-Abdellatif, B.-E. (2020). Convergence rates of variational inference in sparse deep learning. In Proceedings of the 37th International Conference on Machine Learning (ICML 2020), Vienna, Austria.
Chérief-Abdellatif and Alquier, (2018) Chérief-Abdellatif, B.-E. and Alquier, P. (2018). Consistency of variational bayes inference for estimation and model selection in mixtures. Electronic Journal of Statistics, 12(2):2995–3035.
Deng et al., (2019) Deng, W., Zhang, X., Liang, F., and Lin, G. (2019). An adaptive empirical Bayesian method for sparse deep learning. In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.
Feng and Simon, (2017) Feng, J. and Simon, N. (2017). Sparse input neural networks for high-dimensional nonparametric regression and classification. arXiv preprint arXiv:1711.07592.
Frankle and Carbin, (2018) Frankle, J. and Carbin, M. (2018). The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635.
Gale et al., (2019) Gale, T., Elsen, E., and Hooker, S. (2019). The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574.
George and McCulloch, (1993) George, E. and McCulloch, R. (1993). Variable selection via gibbs sampling. Journal of the American Statistical Association, 88:881–889.
Ghosal and Van Der Vaart, (2007) Ghosal, S. and Van Der Vaart, A. (2007). Convergence rates of posterior distributions for noniid observations. The Annals of Statistics, 35(1):192–223.
Ghosh and Doshi-Velez, (2017) Ghosh, S. and Doshi-Velez, F. (2017). Model selection in Bayesian neural networks via horseshoe priors. arXiv preprint arXiv:1705.10388.
Ghosh et al., (2018) Ghosh, S., Yao, J., and Doshi-Velez, F. (2018). Structured variational learning of Bayesian neural networks with horseshoe priors. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden.
Goldt et al., (2019) Goldt, S., Advani, M. S., Saxe, A. M., Krzakala, F., and Zdeborová, L. (2019). Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup. In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.
Guo et al., (2018) Guo, Y., Zhang, C., Zhang, C., and Chen, Y. (2018). Sparse DNNs with improved adversarial robustness. In 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), pages 240–249, Montréal, Canada.
Han et al., (2016) Han, S., Mao, H., and Dally, W. (2016). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations (ICLR).
Hernández-Lobato and Adams, (2015) Hernández-Lobato, J. and Adams, R. (2015). Probabilistic backpropagation for scalable learning of bayesian neural networks. In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), Lille, France.
Hubin and Storvik, (2019) Hubin, A. and Storvik, G. (2019). Combining model and parameter uncertainty in bayesian neural networks. arXiv:1903.07594.
Ishwaran and Rao, (2005) Ishwaran, H. and Rao, S. (2005). Spike and slab variable selection: Frequentist and bayesian strategies. Annals of Statistics, 33(2):730–773.
Jang et al., (2017) Jang, E., Gu, S., and Poole, B. (2017). Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations (ICLR 2017).
Jordan et al., (1999) Jordan, M., Ghahramani, Z., Jaakkola, T., et al. (1999). An introduction to variational methods for graphical models. Machine Learning.
Kingma and Welling, (2014) Kingma, D. and Welling, M. (2014). Auto-encoding variational Bayes. arXiv:1312.6114.
Le Cam, (1986) Le Cam, L. (1986). Asymptotic methods in statistical decision theory. Springer Science & Business Media, New York.
Liang et al., (2018) Liang, F., Li, Q., and Zhou, L. (2018). Bayesian neural networks for selection of drug sensitive genes. Journal of the American Statistical Association, 113:955–972.
Louizos et al., (2018) Louizos, C., Welling, M., and Kingma, D. P. (2018). Learning sparse neural networks through l0 regularization. In ICLR 2018.
MacKay, (1992) MacKay, D. (1992). A practical bayesian framework for backpropagation networks. Nerual Computation.
Maddison et al., (2017) Maddison, C., Mnih, A., and Teh, Y. W. (2017). The concrete distribution: A continuous relaxation of discrete random variables. In International Conference on Learning Representations (ICLR 2017).
Mocanu et al., (2018) Mocanu, D., Mocanu, E., Stone, P., et al. (2018). Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications, 9:2383.
Molchanov et al., (2017) Molchanov, D., Ashukha, A., and Vetrov, D. (2017). Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), pages 2498–2507.
Neal, (1992) Neal, R. (1992). Bayesian learning via stochastic dynamics. In Advances in Neural Information Processing Systems 5 (NIPS 1992), pages 475–482.
Neklyudov et al., (2017) Neklyudov, K., Molchanov, D., Ashukha, A., and Vetrov, D. (2017). Structured Bayesian pruning via log-normal multiplicative noise. In 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA.
Pati et al., (2018) Pati, D., Bhattacharya, A., and Yang, Y. (2018). On the statistical optimality of variational bayes. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS) 2018, Lanzarote, Spain.
Polson and Rockova, (2018) Polson, N. and Rockova, V. (2018). Posterior concentration for sparse deep learning. In 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), pages 930–941, Montréal, Canada.
Schmidt-Hieber, (2017) Schmidt-Hieber, J. (2017). Nonparametric regression using deep neural networks with ReLU activation function. arXiv:1708.06633v2.
Sønderby et al., (2016) Sønderby, C., Raiko, T., Maaløe, L., Sønderby, S., and Ole, W. (2016). How to train deep variational autoencoders and probabilistic ladder networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning (ICML 16), New York, NY.
Tian, (2018) Tian, Y. (2018). A theoretical framework for deep locally connected ReLU network. arXiv preprint arXiv:1809.10829.
Yang et al., (2020) Yang, Y., Pati, D., and Bhattacharya, A. (2020). $\alpha$ -variational inference with statistical guarantees. Annals of Statistics, 48(2):886–905.
Ye and Sun, (2018) Ye, M. and Sun, Y. (2018). Variable selection via penalized neural network: a drop-out-one loss approach. In Proceedings of the 35th International Conference on International Conference on Machine Learning (ICML 18), pages 5620–5629, Stockholm, Sweden.
Zhang and Gao, (2019) Zhang, F. and Gao, C. (2019). Convergence rates of variational posterior distributions. arXiv preprint arXiv:1712.02519.
Zhu and Gupta, (2018) Zhu, M. and Gupta, S. (2018). To prune, or not to prune: Exploring the efficacy of pruning for model compression. In International Conference on Learning Representations (ICLR).

Supplementary Document to the Paper "Efficient Variational Inference for Sparse Deep Learning with Theoretical Guarantee"

In this document, the detailed proofs for the theoretical results are provided in the first section, along with additional numerical results presented in the second section.

Appendix A Proofs of theoretical results

A.1 Proof of Lemma 4.1

As a technical tool for the proof, we first restate the Lemma 6.1 in Chérief-Abdellatif and Alquier, (2018) as follows.

Lemma A.1.

For any $K>0$ , the KL divergence between any two mixture densities $\sum_{k=1}^{K}w_{k}g_{k}$ and $\sum_{k=1}^{K}\tilde{w}_{k}\tilde{g}_{k}$ is bounded as

\mbox{KL}(\sum_{k=1}^{K}w_{k}g_{k}||\sum_{k=1}^{K}\tilde{w}_{k}\tilde{g}_{k})\leq\mbox{KL}(\boldsymbol{w}||\tilde{\boldsymbol{w}})+\sum^{K}_{k=1}w_{k}\mbox{KL}(g_{k}||\tilde{g}_{k}),

where $\mbox{KL}(\boldsymbol{w}||\tilde{\boldsymbol{w}})=\sum_{k=1}^{K}w_{k}\log\frac{w_{k}}{\tilde{w}_{k}}$ .

Proof of Lemma 4.1

Proof.

It suffices to construct some $q^{*}(\theta)\in\mathcal{Q}$ , such that w.h.p,

\begin{split}&\mbox{KL}(q^{*}(\theta)||\pi(\theta|\lambda))+\int_{\Theta}l_{n}(P_{0},P_{\theta})q^{*}(\theta)(d\theta)\\ \leq&C_{1}nr^{*}_{n}+C^{\prime}_{1}n\inf_{\theta}||f_{\theta}-f_{0}||^{2}_{\infty}+C^{\prime}_{1}nr^{*}_{n},\end{split}

where $C_{1}$ , $C^{\prime}_{1}$ are some positive constants if $\lim n(r_{n}^{*}+\xi^{*}_{n})=\infty$ , or any diverging sequences if $\lim\sup n(r_{n}^{*}+\xi^{*}_{n})\neq\infty$ .

Recall $\theta^{*}=\arg\min_{\theta\in\Theta(L,\boldsymbol{p},s^{*},B)}||f_{\theta}-f_{0}||^{2}_{\infty}$ , then $q^{*}(\theta)\in\mathcal{Q}$ can be constructed as

	$\displaystyle\mbox{KL}(q^{}(\theta)\|\|\pi(\theta\|\lambda))\leq C_{1}nr^{}_{n},$		(12)
	$\displaystyle\int_{\Theta}\|\|f_{\theta}-f_{\theta^{}}\|\|_{\infty}^{2}q^{}(\theta)(d\theta)\leq r^{*}_{n}.$		(13)

We define $q^{*}(\theta)$ as follows, for $i=1,\dots,T$ :

\begin{split}&\theta_{i}|\gamma^{*}_{i}\sim\gamma^{*}_{i}\mathcal{N}(\theta_{i}^{*},\sigma^{2}_{n})+(1-\gamma^{*}_{i})\delta_{0},\\ &\gamma_{i}^{*}\sim\mbox{Bern}(\phi^{*}_{i}),\\ &\phi^{*}_{i}=1(\theta^{*}_{i}\neq 0),\end{split}

(14)

where $a_{n}^{2}=\frac{s^{*}}{8n}\log^{-1}(3pN)(2BN)^{-2(L+1)}\Bigl{\{}(p+1+\frac{1}{BN-1})^{2}+\frac{1}{(2BN)^{2}-1}+\frac{2}{(2BN-1)^{2}}\Bigr{\}}^{-1}$ .

To prove (12), denote $\Gamma^{T}$ as the set of all possible binary inclusion vectors with length $T$ , then $q^{*}(\theta)$ and $\pi(\theta|\lambda)$ could be written as mixtures

q^{*}(\theta)=\sum_{\gamma\in\Gamma^{T}}1(\gamma=\gamma^{*})\prod^{T}_{i=1}\gamma_{i}\mathcal{N}(\theta^{*}_{i},\sigma^{2}_{n})+(1-\gamma_{i})\delta_{0},

and

\pi(\theta|\lambda)=\sum_{\gamma\in\Gamma^{T}}\pi(\gamma)\prod^{T}_{i=1}\gamma_{i}\mathcal{N}(0,\sigma^{2}_{0})+(1-\gamma_{i})\delta_{0},

where $\pi(\gamma)$ is the probability for vector $\gamma$ under prior distribution $\pi$ . Then,

\begin{split}&\mbox{KL}(q^{*}(\theta)||\pi(\theta|\lambda))\\ \leq&\log\frac{1}{\pi(\gamma^{*})}+\sum_{\gamma\in\Gamma^{T}}1(\gamma=\gamma^{*})\mbox{KL}\Bigl{\{}\prod^{T}_{i=1}\gamma_{i}\mathcal{N}(\theta^{*}_{i},\sigma^{2}_{n})+(1-\gamma_{i})\delta_{0}\Bigl{|}\Bigr{|}\prod^{T}_{i=1}\gamma_{i}\mathcal{N}(0,\sigma^{2}_{0}))+(1-\gamma_{i})\delta_{0}\Bigr{\}}\\ =&\log\frac{1}{\lambda^{s^{*}}(1-\lambda)^{T-s^{*}}}+\sum^{T}_{i=1}\mbox{KL}\Bigl{\{}\gamma^{*}_{i}\mathcal{N}(\theta^{*}_{i},\sigma^{2}_{n})+(1-\gamma^{*}_{i})\delta_{0}||\gamma^{*}_{i}\mathcal{N}(0,\sigma^{2}_{0}))+(1-\gamma^{*}_{i})\delta_{0}\Bigr{\}}\\ =&s^{*}\log(\frac{1}{\lambda})+(T-s^{*})\log(\frac{1}{1-\lambda})+\sum^{T}_{i=1}\gamma^{*}_{i}\Bigl{\{}\frac{1}{2}\log\Bigl{(}\frac{\sigma^{2}_{0}}{\sigma^{2}_{n}}\Bigr{)}+\frac{\sigma^{2}_{n}+\theta^{*2}_{i}}{2}-\frac{1}{2}\Bigr{\}}\\ \leq&C_{0}nr^{*}_{n}+\frac{s^{*}}{2}\sigma^{2}_{n}+\frac{s^{*}}{2}(B^{2}-1)+\frac{s^{*}}{2}\log\Bigl{(}\frac{\sigma^{2}_{0}}{\sigma^{2}_{n}}\Bigr{)}\\ \leq&(C_{0}+1)nr^{*}_{n}+\frac{s^{*}}{2}B^{2}+\frac{s^{*}}{2}\log\Bigl{(}\frac{8n}{s^{*}}\log(3pN)(2BN)^{2L+2}\Bigl{\{}(p+1+\frac{1}{BN-1})^{2}\\ &+\frac{1}{(2BN)^{2}-1}+\frac{2}{(2BN-1)^{2}}\Bigr{\}}\Bigr{)}\\ \leq&(C_{0}+2)nr^{*}_{n}+\frac{B^{2}}{2}s^{*}+(L+1)s^{*}\log(2BN)+\frac{s^{*}}{2}\log\log(3BN)+\frac{s^{*}}{2}\log\Bigl{(}\frac{n}{s^{*}}p^{2}\Bigr{)}\\ \leq&(C_{0}+3)nr^{*}_{n}+(L+1)s^{*}\log N+s^{*}\log\Bigl{(}p\sqrt{\frac{n}{s^{*}}}\Bigr{)}\\ \leq&C_{1}nr^{*}_{n},\mbox{ for sufficiently large n},\end{split}

where $C_{0}$ and $C_{1}$ are some fixed constants. The first inequality is due to Lemma A.1 and the second inequality is due to Condition $4.4$ .

Furthermore, by Appendix G of Chérief-Abdellatif, (2020), it can be shown

\begin{split}&\int_{\Theta}||f_{\theta}-f_{\theta^{*}}||^{2}_{\infty}q^{*}(\theta)(d\theta)\\ \leq&8a^{2}_{n}\log(3BN)(2BN)^{2L+2}\Bigl{\{}(p+1+\frac{1}{BN-1})^{2}+\frac{1}{(2BN)^{2}-1}+\frac{2}{(2BN-1)^{2}}\Bigr{\}}\\ \leq&\frac{s^{*}}{n}\leq r^{*}_{n}.\end{split}

Noting that

\begin{split}l_{n}(P_{0},P_{\theta})&=\frac{1}{2\sigma^{2}_{\epsilon}}(||Y-f_{\theta}(X)||^{2}_{2}-||Y-f_{0}(X)||^{2}_{2})\\ &=\frac{1}{2\sigma^{2}_{\epsilon}}(||Y-f_{0}(X)+f_{0}(X)-f_{\theta}(X))||^{2}_{2}-||Y-f_{0}(X)||^{2}_{2})\\ &=\frac{1}{2\sigma^{2}_{\epsilon}}(||f_{\theta}(X)-f_{0}(X)||^{2}_{2}+2\langle Y-f_{0}(X),f_{0}(X)-f_{\theta}(X)\rangle),\end{split}

Denote

\begin{split}\mathcal{R}_{1}&=\int_{\Theta}||f_{\theta}(X)-f_{0}(X)||^{2}_{2}q^{*}(\theta)(d\theta),\\ \mathcal{R}_{2}&=\int_{\Theta}\langle Y-f_{0}(X),f_{0}(X)-f_{\theta}(X)\rangle q^{*}(\theta)(d\theta).\end{split}

Since $||f_{\theta}(X)-f_{0}(X)||^{2}_{2}\leq n||f_{\theta}-f_{0}||^{2}_{\infty}\leq n||f_{\theta}-f_{\theta^{\ast}}||^{2}_{\infty}+n||f_{\theta^{\ast}}-f_{0}||^{2}_{\infty}$ ,

\mathcal{R}_{1}\leq nr^{*}_{n}+n||f_{\theta^{\ast}}-f_{0}||^{2}_{\infty}.

Noting that $Y-f_{0}(X)=\epsilon\sim\mathcal{N}(0,\sigma^{2}_{\epsilon}I)$ , then

\begin{split}\mathcal{R}_{2}&=\int_{\Theta}\epsilon^{T}(f_{0}(X)-f_{\theta}(X))q^{*}(\theta)(d\theta)=\epsilon^{T}\int_{\Theta}(f_{0}(X)-f_{\theta}(X))q^{*}(\theta)(d\theta)\sim\mathcal{N}(0,c_{f}\sigma^{2}_{\epsilon}),\end{split}

where $c_{f}=||\int_{\Theta}(f_{0}(X)-f_{\theta}(X))q^{*}(\theta)(d\theta)||^{2}_{2}\leq\mathcal{R}_{1}$ due to Cauchy-Schwarz inequality. Therefore, $\mathcal{R}_{2}=O_{p}(\sqrt{\mathcal{R}_{1}})$ , and w.h.p., $\mathcal{R}_{2}\leq C^{\prime}_{0}\mathcal{R}_{1}$ , where $C^{\prime}_{0}$ is some positive constant if $\lim n(r_{n}^{*}+\xi_{n}^{*})=\infty$ or $C^{\prime}_{0}$ is any diverging sequence if $\lim\sup n(r_{n}^{*}+\xi_{n}^{*})\neq\infty$ . Therefore,

\begin{split}\int_{\Theta}l_{n}(P_{0},P_{\theta})q^{*}(\theta)(d\theta)=\mathcal{R}_{1}/2\sigma^{2}_{\epsilon}+\mathcal{R}_{2}/\sigma^{2}_{\epsilon}\leq&(2C^{\prime}_{0}+1)n(r^{*}_{n}+||f_{\theta^{\ast}}-f_{0}||^{2}_{\infty})/2\sigma_{\varepsilon}^{2}\\ \leq&C^{\prime}_{1}(nr^{*}_{n}+||f_{\theta^{\ast}}-f_{0}||^{2}_{\infty}))\mbox{, w.h.p.},\end{split}

which concludes this lemma together with (12). ∎

A.2 Proof of Lemma 4.2

Under Condition 4.1 - 4.2, we have the following lemma that shows the existence of testing functions over $\Theta_{n}=\Theta(L,\boldsymbol{p},s_{n})$ , where $\Theta(L,\boldsymbol{p},s_{n})$ denotes the set of parameter whose $L_{0}$ norm is bounded by $s_{n}$ .

Lemma A.2.

Let $\varepsilon^{*}_{n}=Mn^{-1/2}\sqrt{(L+1)s^{*}\log N+s^{*}\log(p\sqrt{n/s^{*}})}\log^{\delta}(n)$ for any $\delta>1$ and some large constant M. Let $s_{n}=s^{*}\log^{2\delta-1}n$ . Then there exists some testing function $\phi\in[0,1]$ and $C_{1}>0$ , $C_{2}>1/3$ , such that

\begin{split}\mathbb{E}_{P_{0}}(\phi)&\leq\exp\{-C_{1}n\varepsilon_{n}^{*2}\},\\ \sup_{\begin{subarray}{c}P_{\theta}\in\mathcal{F}(L,\boldsymbol{p},s_{n})\\ d(P_{\theta},P_{0})>\varepsilon^{*}_{n}\end{subarray}}\mathbb{E}_{P_{\theta}}(1-\phi)&\leq\exp\{-C_{2}nd^{2}(P_{0},P_{\theta})\}.\end{split}

Proof.

Due to the well-known result (e.g., Le Cam, (1986), page 491 or Ghosal and Van Der Vaart, (2007), Lemma 2), there always exists a function $\psi\in[0,1]$ , such that

\begin{split}&\mathbb{E}_{P_{0}}(\psi)\leq\exp\{-nd^{2}(P_{\theta_{1}},P_{0})/2\},\\ &\mathbb{E}_{P_{\theta}}(1-\psi)\leq\exp\{-nd^{2}(P_{\theta_{1}},P_{0})/2\},\end{split}

for all $P_{\theta}\in\mathcal{F}(L,\boldsymbol{p},s_{n})$ satisfying that $d(P_{\theta},P_{\theta_{1}})\leq d(P_{0},P_{\theta_{1}})/18$ .

Let $K=N(\varepsilon_{n}^{*}/19,\mathcal{F}(L,\boldsymbol{p},s_{n}),d(\cdot,\cdot))$ denote the covering number of set $\mathcal{F}(L,\boldsymbol{p},s_{n})$ , i.e., there exists $K$ Hellinger-balls with radius $\varepsilon_{n}^{*}/19$ , that completely cover $\mathcal{F}(L,\boldsymbol{p},s_{n})$ . For any $\theta\in\mathcal{F}(L,\boldsymbol{p},s_{n})$ (W.O.L.G, we assume $P_{\theta}$ belongs to the $k$ th Hellinger ball centered at $P_{\theta_{k}}$ ), if $d(P_{\theta},P_{0})>\varepsilon_{n}^{*}$ , then we must have that $d(P_{0},P_{\theta_{k}})>(18/19)\varepsilon_{n}^{*}$ and there exists a testing function $\psi_{k}$ , such that

\begin{split}\mathbb{E}_{P_{0}}(\psi_{k})&\leq\exp\{-nd^{2}(P_{\theta_{k}},P_{0})/2\}\\ &\leq\exp\{-(18^{2}/19^{2}/2)n\varepsilon_{n}^{*2}\},\\ \mathbb{E}_{P_{\theta}}(1-\psi_{k})&\leq\exp\{-nd^{2}(P_{\theta_{k}},P_{0})/2\}\\ &\leq\exp\{-n(d(P_{0},P_{\theta})-\varepsilon_{n}^{*}/19)^{2}/2\}\\ &\leq\exp\{-(18^{2}/19^{2}/2)nd^{2}(P_{0},P_{\theta})\}.\end{split}

Now we define $\phi=\max_{k=1,\dots,K}\psi$ . Thus we must have

\begin{split}\mathbb{E}_{P_{0}}(\phi)&\leq\sum_{k}\mathbb{E}_{P_{0}}(\psi_{k})\leq K\exp\{-(18^{2}/19^{2}/2)n\varepsilon_{n}^{*2}\}\\ &\leq\exp\{-(18^{2}/19^{2}/2)n\varepsilon_{n}^{*2}-\log K\}.\end{split}

Note that

	$\displaystyle\log K=\log N(\varepsilon_{n}^{*}/19,\mathcal{F}(L,\boldsymbol{p},s_{n}),d(\cdot,\cdot))$
	$\displaystyle\leq\log N(\sqrt{8}\sigma_{\varepsilon}\varepsilon_{n}^{*}/19,\mathcal{F}(L,\boldsymbol{p},s_{n}),\\|\cdot\\|_{\infty})$
	$\displaystyle\leq(s_{n}+1)\log(\frac{38}{\sqrt{8}\sigma_{\varepsilon}\varepsilon_{n}^{*}}(L+1)(N+1)^{2(L+1)})$
	$\displaystyle\leq C_{0}(s_{n}\log\frac{1}{\varepsilon_{n}^{*}}+s_{n}\log(L+1)+s_{n}(L+1)\log N)$
	$\displaystyle\leq s_{n}(L+1)\log n\log N\leq s^{*}(L+1)\log N\log^{2\delta}n$
	$\displaystyle\leq n\varepsilon_{n}^{*2}/4,\mbox{ for sufficiently large n,}$		(15)

where $C_{0}$ is some positive constant, the first inequality is due to the fact

\begin{split}d^{2}(P_{\theta},P_{0})\leq 1-\exp\{-\frac{1}{8\sigma^{2}_{\epsilon}}||f_{0}-f_{\theta}||^{2}_{\infty}\}\end{split}

and $\varepsilon_{n}^{*}=o(1)$ , the second inequality is due to Lemma 10 of Schmidt-Hieber, (2017)⁴⁴4Although Schmidt-Hieber, (2017) only focuses on ReLU network, its Lemma 10 could apply to any 1-Lipchitz continuous activation function., and the last inequality is due to $s_{n}\log(1/\varepsilon^{*}_{n})\asymp s_{n}\log n$ . Therefore,

\begin{split}\mathbb{E}_{P_{0}}(\phi)&\leq\sum_{k}P_{0}(\psi_{k})\leq\exp\{-C_{1}n\varepsilon_{n}^{*2}\},\end{split}

for some $C_{1}=18^{2}/19^{2}/2-1/4$ . On the other hand, for any $\theta$ , such that $d(P_{\theta},P_{0})\geq\varepsilon_{n}^{*}$ , say $P_{\theta}$ belongs to the $k$ th Hellinger ball, then we have

\begin{split}\mathbb{E}_{P_{\theta}}(1-\phi)&\leq\mathbb{E}_{P_{\theta}}(1-\psi_{k})\leq\exp\{-C_{2}nd^{2}(P_{0},P_{\theta})\},\end{split}

where $C_{2}=18^{2}/19^{2}/2$ . Hence we conclude the proof. ∎

Lemma A.3 restates the Donsker and Varadhan’s representation for the KL divergence, whose proof can be found in Boucheron et al., (2013).

Lemma A.3.

For any probability measure $\mu$ and any measurable function $h$ with $e^{h}\in L_{1}(\mu)$ ,

\log\int e^{h(\eta)}\mu(d\eta)=\sup_{\rho}\left[\int h(\eta)\rho(d\eta)-\mbox{KL}(\rho||\mu)\right].

Proof of Lemma 4.2

Proof.

Denote $\Theta_{n}$ as the truncated parameter space $\{\theta:\sum^{T}_{i=1}1(\theta_{i}\neq 0)\leq s_{n}\}$ , where $s_{n}$ is defined in Lemma A.2. Noting that

\int_{\theta\in\Theta}d^{2}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta=\int_{\theta\in\Theta_{n}}d^{2}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta+\int_{\theta\in\Theta_{n}^{c}}d^{2}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta,

(16)

it suffices to find upper bounds of the two components in RHS of (16).

We start with the first component. Denote $\widetilde{\pi}(\theta)$ to be the truncated prior $\pi(\theta)$ on set $\Theta_{n}$ , i.e., $\widetilde{\pi}(\theta)=\pi(\theta)1(\theta\in\Theta_{n})/\pi(\Theta_{n})$ , then by Lemma A.2 and the same argument used in Theorem 3.1 of Pati et al., (2018), it could be shown

\int_{\Theta_{n}}\eta(P_{\theta},P_{0})\widetilde{\pi}(\theta)d\theta\leq e^{C_{0}n\varepsilon^{*2}_{n}},\mbox{w.h.p.}

(17)

for some $C_{0}>0$ , where $\log\eta(P_{\theta},P_{0})=l_{n}(P_{\theta},P_{0})+\frac{n}{3}d^{2}(P_{\theta},P_{0})$ . We further denote the $\widehat{q}(\theta)$ restricted on $\Theta_{n}$ as $\widecheck{q}(\theta)$ , i.e., $\widecheck{q}(\theta)=\widehat{q}(\theta)1(\theta\in\Theta_{n})/\widehat{q}(\Theta_{n})$ , then by Lemma A.3 and (17), w.h.p.,

\begin{split}&\frac{n}{3\widehat{q}(\Theta_{n})}\int_{\Theta_{n}}d^{2}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta=\frac{n}{3}\int_{\Theta_{n}}d^{2}(P_{\theta},P_{0})\widecheck{q}(\theta)d\theta\\ \leq&Cn\varepsilon^{*2}_{n}+\mbox{KL}(\widecheck{q}(\theta)||\widetilde{\pi}(\theta))-\int_{\Theta_{n}}l_{n}(P_{\theta},P_{0})\widecheck{q}(\theta)d\theta.\\ \end{split}

(18)

Furthermore,

\begin{split}\mbox{KL}(\widecheck{q}(\theta)||\widetilde{\pi}(\theta))&=\frac{1}{\widehat{q}(\Theta_{n})}\int_{\theta\in\Theta_{n}}\log\frac{\widehat{q}(\theta)}{\pi(\theta)}\widehat{q}(\theta)d\theta+\log\frac{\pi(\Theta_{n})}{\widehat{q}(\Theta_{n})}\\ &=\frac{1}{\widehat{q}(\Theta_{n})}\mbox{KL}(\widehat{q}(\theta)||\pi(\theta))-\frac{1}{\widehat{q}(\Theta_{n})}\int_{\theta\in\Theta_{n}^{c}}\log\frac{\widehat{q}(\theta)}{\pi(\theta)}\widehat{q}(\theta)d\theta+\log\frac{\pi(\Theta_{n})}{\widehat{q}(\Theta_{n})},\end{split}

and similarly,

\begin{split}\int_{\Theta_{n}}l_{n}(P_{\theta},P_{0})\widecheck{q}(\theta)d\theta=\frac{1}{\widehat{q}(\Theta_{n})}\int_{\Theta}l_{n}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta-\frac{1}{\widehat{q}(\Theta_{n})}\int_{\Theta_{n}^{c}}l_{n}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta.\end{split}

Combining the above two equations together, we have

\begin{split}&\frac{n}{3\widehat{q}(\Theta_{n})}\int_{\Theta_{n}}d^{2}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta\leq Cn\varepsilon_{n}^{*2}+\mbox{KL}(\widecheck{q}(\theta)||\widetilde{\pi}(\theta))-\int_{\Theta_{n}}l_{n}(P_{\theta},P_{0})\widecheck{q}(\theta)d\theta\\ =&Cn\varepsilon_{n}^{*2}+\frac{1}{\widehat{q}(\Theta_{n})}\left(\mbox{KL}(\widehat{q}(\theta)||\pi(\theta))-\int_{\Theta}l_{n}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta\right)\\ &-\frac{1}{\widehat{q}(\Theta_{n})}\left(\int_{\Theta_{n}^{c}}\log\frac{\widehat{q}(\theta)}{\pi(\theta)}\widehat{q}(\theta)d\theta-\int_{\Theta_{n}^{c}}l_{n}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta\right)+\log\frac{\pi(\Theta_{n})}{\widehat{q}(\Theta_{n})}.\end{split}

(19)

The second component of (16) trivially satisfies that $\int_{\theta\in\Theta_{n}^{c}}d^{2}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta\leq\int_{\theta\in\Theta_{n}^{c}}\widehat{q}(\theta)d\theta=\widehat{q}(\Theta_{n}^{c})$ . Thus, together with (19), we have that w.h.p.,

\begin{split}&\int d^{2}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta\leq 3\widehat{q}(\Theta_{n})C\varepsilon^{*2}_{n}+\frac{3}{n}\left(\mbox{KL}(\widehat{q}(\theta)||\pi(\theta))-\int_{\Theta}l_{n}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta\right)\\ &+\frac{3}{n}\int_{\Theta_{n}^{c}}l_{n}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta+\frac{3}{n}\int_{\Theta_{n}^{c}}\log\frac{\pi(\theta)}{\widehat{q}(\theta)}\widehat{q}(\theta)d\theta+\frac{3\widehat{q}(\Theta_{n})}{n}\log\frac{\pi(\Theta_{n})}{\widehat{q}(\Theta_{n})}+\widehat{q}(\Theta_{n}^{c}).\end{split}

(20)

The second term in the RHS of (20) is bounded by $C^{\prime}(r^{*}_{n}+\xi^{*}_{n})$ w.h.p., due to Lemma 4.1, where $C^{\prime}$ is either positive constant or diverging sequence depending on whether $n(r_{n}^{*}+\xi_{n}^{*})$ diverges.

The third term in the RHS of (20) is bounded by

\begin{split}&\frac{3}{n}\int_{\Theta_{n}^{c}}l_{n}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta\\ =&\frac{3}{2n\sigma^{2}_{\epsilon}}\int_{\Theta_{n}^{c}}\left[\sum^{n}_{i=1}\epsilon_{i}^{2}-\sum^{n}_{i=1}(\epsilon_{i}+f_{0}(X_{i})-f_{\theta}(X_{i}))^{2}\right]\widehat{q}(\theta)d\theta\\ =&\frac{3}{2n\sigma^{2}_{\epsilon}}\int_{\Theta_{n}^{c}}\left[-2\sum^{n}_{i=1}(\epsilon_{i}\times(f_{0}(X_{i})-f_{\theta}(X_{i}))-\sum^{n}_{i=1}(f_{0}(X_{i})-f_{\theta}(X_{i}))^{2}\right]\widehat{q}(\theta)d\theta\\ =&\frac{3}{2n\sigma^{2}_{\epsilon}}\left\{-2\sum^{n}_{i=1}\epsilon_{i}\int_{\Theta_{n}^{c}}(f_{0}(X_{i})-f_{\theta}(X_{i}))\widehat{q}(\theta)d\theta-\int_{\Theta_{n}^{c}}\sum^{n}_{i=1}(f_{0}(X_{i})-f_{\theta}(X_{i}))^{2}\widehat{q}(\theta)d\theta\right\}.\end{split}

Conditional on $X_{i}$ ’s, $-2\sum^{n}_{i=1}\epsilon_{i}\int_{\Theta_{n}^{c}}(f_{0}(X_{i})-f_{\theta}(X_{i}))\widehat{q}(\theta)d\theta$ follows a normal distribution $\mathcal{N}(0,V^{2})$ , where $V^{2}=4\sigma^{2}_{\epsilon}\sum^{n}_{i=1}(\int_{\Theta_{n}^{c}}(f_{0}(X_{i})-f_{\theta}(X_{i}))\widehat{q}(\theta)d\theta)^{2}\leq 4\sigma^{2}_{\epsilon}\int_{\Theta_{n}^{c}}\sum^{n}_{i=1}(f_{0}(X_{i})-f_{\theta}(X_{i}))^{2}\widehat{q}(\theta)d\theta$ . Thus conditional on $X_{i}$ ’s, the third term in the RHS of (20) is bounded by

\frac{3}{2n\sigma^{2}_{\epsilon}}\left[\mathcal{N}(0,V^{2})-\frac{V^{2}}{4\sigma^{2}_{\epsilon}}\right].

(21)

Noting that $\mathcal{N}(0,V^{2})=O_{p}(M_{n}V)$ for any diverging sequence $M_{n}$ , (21) is further bounded, w.h.p., by

\frac{3}{2n\sigma^{2}_{\epsilon}}(M_{n}V-\frac{V^{2}}{4\sigma^{2}_{\epsilon}})\leq\frac{3}{2n\sigma^{2}_{\epsilon}}\sigma^{2}_{\epsilon}M_{n}^{2}.

Therefore, the third term in the RHS of (20) can be bounded by $\varepsilon_{n}^{*2}$ w.h.p. (by choosing $M_{n}^{2}=n\varepsilon_{n}^{*2}$ ).

The fourth term in the RHS of (20) is bounded by

\frac{3}{n}\int_{\Theta_{n}^{c}}\log\frac{\pi(\theta)}{\widehat{q}(\theta)}\widehat{q}(\theta)d\theta\leq\frac{3}{n}\widehat{q}(\Theta_{n}^{c})\log\frac{\pi(\Theta_{n}^{c})}{\widehat{q}(\Theta_{n}^{c})}\leq\frac{3}{n}\sup_{x\in(0,1)}[x\log(1/x)]=O(1/n).

Similarly, the fifth term in the RHS of (20) is bounded by $O(1/n)$ .

For the last term in the RHS of (20), by Lemma A.5 in below, w.h.p., $\widehat{q}(\Theta_{n}^{c})\leq\varepsilon_{n}^{*2}$ .

Combine all the above result together, w.h.p.,

\begin{split}&\int d^{2}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta\leq C\varepsilon^{*2}_{n}+\frac{3}{n}\left(\mbox{KL}(\widehat{q}(\theta)||\pi(\theta))-\int_{\Theta}l_{n}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta\right)+O(1/n),\end{split}

where $C$ is some constant. ∎

Lemma A.4 (Chernoff bound for Poisson tail).

Let $X\sim\mbox{poi}(\lambda)$ be a Poisson random variable. For any $x>\lambda$ ,

P(X\geq x)\leq\frac{(e\lambda)^{x}e^{-\lambda}}{x^{x}}.

Lemma A.5.

If $\lambda\leq T^{-1}\exp\{-Mnr_{n}^{*}/s_{n}\}$ for any positive diverging sequence $M\rightarrow\infty$ , then w.h.p., $\widehat{q}(\Theta_{n}^{c})=O(\varepsilon_{n}^{*2})$ .

Proof.

By Lemma 4.1, we have that w.h.p.,

\begin{split}&\mbox{KL}(\widehat{q}(\theta)||\pi(\theta|\lambda))+\int_{\Theta}l_{n}(P_{0},P_{\theta})\widehat{q}(\theta)d\theta=\inf_{q_{\theta}\in\mathcal{Q}}\Bigl{\{}\mbox{KL}(q(\theta)||\pi(\theta|\lambda))+\int_{\Theta}l_{n}(P_{0},P_{\theta})q(\theta)(d\theta)\Bigr{\}}\\ \leq&Cnr_{n}^{*}\quad(\mbox{Note that }r_{n}^{*}\asymp\xi^{*}_{n})\end{split}

where $C$ is either a constant or any diverging sequence, depending on whether $nr_{n}^{*}$ diverges. By the similar argument used in the proof of Lemma 4.1,

\int_{\Theta}l_{n}(P_{0},P_{\theta})\widehat{q}(\theta)d\theta\leq\frac{1}{2\sigma_{\epsilon}^{2}}\left(\int_{\Theta}||f_{\theta}(X)-f_{0}(X)||^{2}_{2}\widehat{q}(\theta)(d\theta)+Z\right)

where $Z$ is a normal distributed $\mathcal{N}(0,\sigma^{2}_{\epsilon}c_{0}^{\prime})$ , where $c_{0}^{\prime}\leq c_{0}=\int_{\Theta}||f_{\theta}(X)-f_{0}(X)||^{2}_{2}\widehat{q}(\theta)(d\theta)$ . Therefore, $-\int_{\Theta}l_{n}(P_{0},P_{\theta})\widehat{q}(\theta)d\theta=(1/2\sigma_{\epsilon}^{2})[-c_{0}+O_{p}(\sqrt{c_{0}})]$ , and $\mbox{KL}(\widehat{q}(\theta)||\pi(\theta|\lambda))\leq Cnr_{n}^{*}+(1/2\sigma_{\epsilon}^{2})[-c_{0}+O_{p}(\sqrt{c_{0}})]$ . Since $Cnr_{n}^{*}\rightarrow\infty$ , we must have w.h.p., $\mbox{KL}(\widehat{q}(\theta)||\pi(\theta|\lambda))\leq Cnr_{n}^{*}/2$ . On the other hand,

\begin{split}&\mbox{KL}(\widehat{q}(\theta)||\pi(\theta|\lambda))=\sum_{i=1}^{T}\mbox{KL}(\widehat{q}(\theta_{i})||\pi(\theta_{i}|\lambda))\geq\sum_{i=1}^{T}\mbox{KL}(\widehat{q}(\gamma_{i})||\pi(\gamma_{i}|\lambda))\\ =&\sum_{i=1}^{T}\left[\widehat{q}(\gamma_{i}=1)\log\frac{\widehat{q}(\gamma_{i}=1)}{\lambda}+\widehat{q}(\gamma_{i}=0)\log\frac{\widehat{q}(\gamma_{i}=0)}{1-\lambda}\right].\end{split}

(22)

Let us choose $\lambda_{0}=1/T$ , and $A=\{i:\widehat{q}(\gamma_{i}=1)\geq\lambda_{0}\}$ , then the above inequality (22) implies that $\sum_{i\in A}\widehat{q}(\gamma_{i}=1)\log(\lambda_{0}/\lambda)\leq Cnr_{n}^{*}/2$ . Noting that $\lambda\leq T^{-1}\exp\{-Mnr_{n}^{*}/s_{n}\}$ , it further implies $\sum_{i\in A}\widehat{q}(\gamma_{i}=1)\leq s_{n}/M\prec s_{n}$ .

Under distribution $\widehat{q}$ , by Bernstein inequality,

\begin{split}&Pr(\sum_{i\in A}\gamma_{i}\geq 2s_{n}/3)\leq Pr(\sum_{i\in A}\gamma_{i}\geq s_{n}/2+\sum_{i\in A}\mathbb{E}(\gamma_{i}))\leq\exp\left(-\frac{s_{n}^{2}/8}{\sum_{i\in A}\mathbb{E}[\gamma_{i}^{2}]+s_{n}/6}\right)\\ =&\exp\left(-\frac{s_{n}^{2}/8}{\sum_{i\in A}\widehat{q}(\gamma_{i}=1)+s_{n}/6}\right)\leq\exp\left(-cs_{n}\right)=O(\varepsilon_{n}^{*2})\end{split}

for some constant $c>0$ , where the last inequality holds since $\log(1/\varepsilon_{n}^{*2})=O(\log n)\prec s_{n}$ .

Under distribution $\widehat{q}$ , $\sum_{i\notin A}\gamma_{i}$ is stochastically smaller than $Bin(T,\lambda_{0})$ . Since $T\rightarrow\infty$ , then by Lemma A.4,

\begin{split}&Pr(\sum_{i\notin A}\gamma_{i}\geq s_{n}/3)\leq Pr(Bin(T,\lambda_{0})\geq s_{n}/3)\rightarrow Pr(\mbox{poi}(1)\geq s_{n}/3)\\ =&O(\exp\{-c^{\prime}s_{n}\})=O(\varepsilon_{n}^{*2})\end{split}

for some $c^{\prime}>0$ . Trivially, it implies that w.h.p, $Pr(\sum_{i}\gamma_{i}\geq s_{n})=O(\varepsilon_{n}^{*2})$ for VB posterior $\widehat{q}$ .

∎

A.3 Main theorem

Theorem A.1.

Under Conditions 4.1-4.2, 4.4 and set $-\log\lambda=\log(T)+\delta[(L+1)\log N+\log\sqrt{n}p]$ for any constant $\delta>0$ , we then have that w.h.p.,

\int_{\Theta}d^{2}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta\leq C\varepsilon^{*2}_{n}+C^{\prime}(r_{n}^{*}+\xi^{*}_{n}),

where $C$ is some positive constant and $C^{\prime}$ is any diverging sequence. If $\|f_{0}\|_{\infty}<F$ , and we truncated the VB posterior on $\Theta_{F}=\{\theta:\|f_{\theta}\|_{\infty}\leq F\}$ , i.e., $\widehat{q}_{F}\propto\widehat{q}1(\theta\in\Theta_{F})$ , then, w.h.p.,

\int_{\Theta_{F}}\mathbb{E}_{X}|f_{\theta}(X)-f_{0}(X)|^{2}\widehat{q}_{F}(\theta)d\theta\leq\frac{C\varepsilon^{*2}_{n}+C^{\prime}(r_{n}^{*}+\xi^{*}_{n})}{C_{F}\widehat{q}(\Theta_{F})}

where $C_{F}=[1-\exp(-4F^{2}/8\sigma^{2}_{\epsilon})]/4F^{2}$ , and $\widehat{q}(\Theta_{F})$ is the VB posterior mass of $\Theta_{F}$ .

Proof.

The convergence under squared Hellinger distance is directly result of Lemma 4.1 and 4.2, by simply checking the choice of $\lambda$ satisfies required conditions. The convergence under $L_{2}$ distance relies on inequality $d^{2}(P_{\theta},P_{0})\geq C_{F}\mathbb{E}_{X}|f_{\theta}(X)-f_{0}(X)|^{2}$ for $C_{F}=[1-\exp(-4F^{2}/8\sigma^{2}_{\epsilon})]/4F^{2}$ when both $f_{\theta}$ and $f_{0}$ are bounded by $F$ . Then, w.h.p,

\begin{split}&\int_{\Theta_{F}}\mathbb{E}_{X}|f_{\theta}(X)-f_{0}(X)|^{2}\widehat{q}_{F}(\theta)d\theta\leq C_{F}^{-1}\int_{\Theta_{F}}d^{2}(P_{\theta},P_{0})\widehat{q}_{F}(\theta)d\theta\\ \leq&\frac{1}{C_{F}\widehat{q}(\Theta_{F})}\int_{\Theta}d^{2}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta\leq\frac{C\varepsilon^{*2}_{n}+C^{\prime}(r_{n}^{*}+\xi^{*}_{n})}{C_{F}\widehat{q}(\Theta_{F})}.\end{split}

∎

Appendix B Additional experimental results

B.1 Comparison between Bernoulli variable and the Gumbel softmax approximation

Denote $\gamma_{i}\sim\mbox{Bern}(\phi_{i})$ and $\widetilde{\gamma}_{i}\sim\mbox{Gumbel-softmax}(\phi_{i},\tau)$ , then we have that

\begin{split}&\widetilde{\gamma}_{i}:=g_{\tau}(\phi_{i},u_{i})=(1+\exp(-\eta_{i}/\tau))^{-1},\quad\mbox{where }\eta_{i}=\log\frac{\phi_{i}}{1-\phi_{i}}+\log\frac{u_{i}}{1-u_{i}},\quad u_{i}\sim\mathcal{U}(0,1),\\ &\gamma_{i}:=g(\phi_{i},u_{i})=1(u_{i}\leq\phi_{i})\quad\mbox{where }u_{i}\sim\mathcal{U}(0,1).\end{split}

Fig 4 demonstrates the functional convergence of $g_{\tau}$ towards $g$ as $\tau$ goes to zero. In Fig 4(a), by fixing $\phi_{i}(=0.9)$ , we show $g_{\tau}$ converges to $g$ as a function of $u_{i}$ . Fig 4 (b) demonstrates that $g_{\tau}$ converges to $g$ as a function of $\alpha_{i}=\log(\phi_{i}/(1-\phi_{i}))$ when $u_{i}(=0.2)$ is fixed. These two figures show that as $\tau\rightarrow 0$ , $g_{\tau}\rightarrow g$ . Formally, Maddison et al., (2017) rigorously proved that $\widetilde{\gamma}_{i}$ converges to $\gamma_{i}$ in distribution as $\tau$ approaches 0.

B.2 Algorithm implementation details for the numerical experiments

Initialization

As mentioned by Sønderby et al., (2016) and Molchanov et al., (2017), training sparse BNN with random initialization may lead to bad performance, since many of the weights could be pruned too early. In our case, we assign each of the weights and biases a inclusion variable, which could reduce to zero quickly in the early optimization stage if we randomly initialize them. As a consequence, we deliberately initialize $\phi_{i}$ to be close to 1 in our experiments. This initialization strategy ensures the training starts from a fully connected neural network, which is similar to start training from a pre-trained fully connected network as mentioned in Molchanov et al., (2017). The other two parameters $\mu_{i}$ and $\sigma_{i}$ are initialized randomly.

Other implementation details in simulation studies

We set $K=1$ and $\mbox{learning rate}=5\times 10^{-3}$ during training. For Simulation I, we choose batch size $m=1024$ and $m=128$ for (A) and (B) respectively, and run 10000 epochs for both cases. For simulation II, we use $m=512$ and run 7000 epochs. Although it is common to set up an annealing schedule for temperature parameter $\tau$ , we don’t observe any significant performance improvement compared to setting $\tau$ as a constant, therefore we choose $\tau=0.5$ in all of our experiments. The optimization method used is Adam.

The implementation details for UCI datasets and MNIST can be found in Section B.5 and B.6 respectively.

B.3 Toy example: linear regression

In this section, we aim to demonstrate that there is little difference between the results using inverse-CDF reparameterization and Gumbel-softmax approximation via a toy example.

Consider a linear regression model:

Y_{i}=X^{T}_{i}\beta+\epsilon_{i},\quad\epsilon_{i}\sim\mathcal{N}(0,1),\quad i=1,\ldots,n,

We simulate a dataset with 1000 observations and 200 predictors, where $\beta_{50}=\beta_{100}=\beta_{150}=10$ , $\beta_{75}=\beta_{125}=-10$ and $\beta_{j}=0$ for all other $j$ .

A spike-and-slab prior is imposed on $\beta$ such that

\beta_{j}|\gamma_{j}\sim\gamma_{j}\mathcal{N}(0,\sigma^{2}_{0})+(1-\gamma_{j})\delta_{0},\quad\gamma_{j}\sim\mbox{Bern}(\lambda),

for $j=1,\ldots,200$ , where $\sigma_{0}=5$ and $\lambda=0.03$ . The variational distribution $q(\beta)\mathcal{Q}$ is chosen as

\beta_{j}|\gamma_{j}\sim\gamma_{j}\mathcal{N}(\mu_{j},\sigma^{2}_{j})+(1-\gamma_{j})\delta_{0},\quad\gamma_{j}\sim\mbox{Bern}(\phi_{j}).

We use both Gumbel-softmax approximation and inverse-CDF reparameterization for the stochastic optimization of ELBO, and plot posterior mean $\mathbb{E}_{\widehat{q}(\beta)}(\beta_{j}|\gamma_{j})$ (blue curve) against the true value (red curve). Figure 5 shows that inverse-CDF reparameterization exhibits only slightly higher error in estimating zero coefficients than the Gumbel-softmax approximation, which indicates the two methods has little difference on this toy example.

B.4 Teacher student networks

The network parameter $\theta$ for the sparse teacher network setting (B) is set as following: $W=\{W_{1,11}=W_{1,12}=W_{2,11}=W_{2,12}=2.5,W_{1,21}=W_{1,22}=W_{2,21}=W_{2,22}=1.5,W_{3,11}=3\mbox{ and }W_{3,21}=2\}$ ; $b=\{b_{1,1}=b_{2,1}=b_{3,1}=1\mbox{ and }b_{1,2}=b_{2,2}=-1\}$ .

Figure 6 displays the simulation result for simulation I under dense teacher network (A) setting. Unlike the result under sparse teacher network (B), the testing accuracy seems monotonically increases as $\lambda$ increases (i.e., posterior network gets denser). However, as shown, the increasing of testing performance is rather slow, which indicates that introducing sparsity has few negative impact to the testing accuracy.

Coverage rate

In this paragraph, we explain the details of how we compute the coverage rate values of Bayesian intervals reported in the main text. A fixed point $(x_{1}^{(*)},\dots,x_{p}^{(*)})^{\prime}$ is prespecified, and let $x^{(1)},\dots,x^{(1000)}$ be 1000 equidistant points from $-1$ to $1$ . In each run, we compute the Bayesian credible intervals of response means (estimated by 600 Monte Carlo samples) for 1000 different input $x$ ’s: $(x^{(1)},x_{2}^{(*)},\dots,x_{p}^{(*)}),\dots,(x^{(1000)},x_{2}^{(*)},\dots,x_{p}^{(*)})$ . It is repeated by 60 times and the average coverage rate (over all different $x$ ’s and 60 runs) is reported. Similarly, we replace $x_{2}^{(*)}$ (or $x_{3}^{(*)}$ ) by $x^{(i)}$ ( $i=1,\dots,1000$ ), and compute their average coverage rate. The complete coverage rate results are shown in Table 3. Note that Table 1 in the main text shows $95\%$ coverage of $x_{3}$ for (A) and $95\%$ coverage of $x_{1}$ for (B).

Table 3: Coverage rates for teacher networks.

		90 % coverage (%)			95% coverage (%)
	Method	$\boldsymbol{x_{1}}$	$\boldsymbol{x_{2}}$	$\boldsymbol{x_{3}}$	$\boldsymbol{x_{1}}$	$\boldsymbol{x_{2}}$	$\boldsymbol{x_{3}}$
Dense	SVBNN	93.8 $\pm$ 2.84	93.1 $\pm$ 4.93	93.1 $\pm$ 2.96	97.9 $\pm$ 1.01	97.9 $\pm$ 1.69	97.5 $\pm$ 1.71
	VBNN	85.8 $\pm$ 2.51	82.4 $\pm$ 2.62	86.3 $\pm$ 1.88	92.7 $\pm$ 2.83	91.3 $\pm$ 2.61	91.4 $\pm$ 2.43
	VD	61.3 $\pm$ 2.40	60.0 $\pm$ 2.79	64.9 $\pm$ 6.17	74.9 $\pm$ 1.79	71.8 $\pm$ 2.33	76.4 $\pm$ 4.75
	HS-BNN	83.1 $\pm$ 1.67	80.0 $\pm$ 1.21	76.9 $\pm$ 1.70	88.1 $\pm$ 1.13	84.1 $\pm$ 1.48	83.5 $\pm$ 0.78
Sparse	SVBNN	92.3 $\pm$ 8.61	94.6 $\pm$ 5.37	98.3 $\pm$ 0.00	96.4 $\pm$ 4.73	97.7 $\pm$ 3.71	100 $\pm$ 0.00
	VBNN	86.7 $\pm$ 10.9	87.0 $\pm$ 11.3	93.3 $\pm$ 0.00	90.7 $\pm$ 8.15	91.9 $\pm$ 9.21	96.7 $\pm$ 0.00
	VD	65.2 $\pm$ 0.08	63.7 $\pm$ 6.58	65.9 $\pm$ 0.83	75.5 $\pm$ 7.81	74.6 $\pm$ 7.79	76.6 $\pm$ 0.40
	HS-BNN	59.0 $\pm$ 8.52	59.4 $\pm$ 4.38	56.6 $\pm$ 2.06	67.0 $\pm$ 8.54	68.2 $\pm$ 3.62	66.5 $\pm$ 1.86

B.5 Real data regression experiment: UCI datasets

We follow the experimental protocols of Hernández-Lobato and Adams, (2015), and choose five datasets for the experiment. For the small datasets "Kin8nm", "Naval", "Power Plant" and "wine", we choose a single-hidden-layer ReLU network with 50 hidden units. We randomly select $90\%$ and $10\%$ for training and testing respectively, and this random split process is repeated for 20 times (to obtain standard deviations for our results). We choose minibatch size $m=128$ , $\mbox{learning rate}=10^{-3}$ and run 500 epochs for "Naval", "Power Plant" and "Wine", 800 epochs for "Kin8nm". For the large dataset "Year", we use a single-hidden-layer ReLU network with 100 hidden units, and the evaluation is conducted on a single split. We choose $m=256$ , $\mbox{learning rate}=10^{-3}$ and run 100 epochs. For all the five datasets, $\lambda$ is chosen as $\lambda_{opt}$ : $\log(\lambda^{-1}_{opt})=\log(T)+0.1[(L+1)\log N+\log\sqrt{n}p]$ , which is the same as other numerical studies. We let $\sigma^{2}_{0}=2$ and use grid search to find $\sigma_{\epsilon}$ that yields the best prediction accuracy. Adam is used for all the datasets in the experiment.

We report the testing squared root MSE (RMSE) based on $\widehat{f}_{H}$ (defined in the main text) with $H=30$ , and also report the posterior network sparsity $\widehat{s}=\sum^{T}_{i=1}\phi_{i}/T$ . For the purpose of comparison, we list the results by Horseshoe BNN (HS-BNN) (Ghosh and Doshi-Velez,, 2017) and probalistic backpropagation (PBP) (Hernández-Lobato and Adams,, 2015). Table 4 demonstrates that our method achieves best prediction accuracy for all the datasets with a sparse structure.

Table 4: Results on UCI regression datasets.

		Test RMSE			Posterior sparsity(%)
Dataset	$n(p)$	SVBNN	HS-BNN	PBP	SVBNN
Kin8nm	8192 (8)	0.08 $\pm$ 0.00	0.08 $\pm$ 0.00	0.10 $\pm$ 0.00	64.5 $\pm$ 1.85
Naval	11934 (16)	0.00 $\pm$ 0.00	0.00 $\pm$ 0.00	0.01 $\pm$ 0.00	82.9 $\pm$ 1.31
Power Plant	9568 (4)	4.01 $\pm$ 0.18	4.03 $\pm$ 0.15	4.12 $\pm$ 0.03	56.6 $\pm$ 3.13
Wine	1599 (11)	0.62 $\pm$ 0.04	0.63 $\pm$ 0.04	0.64 $\pm$ 0.01	59.9 $\pm$ 4.92
Year	515345 (90)	8.87 $\pm$ NA	9.26 $\pm$ NA	8.88 $\pm$ NA	20.8 $\pm$ NA

B.6 Real data classification experiment: MNIST dataset

The MNIST data is normalized by mean equaling 0.1306 and standard deviation equaling 0.3081. For all methods, we choose the same minibatch size $m=256$ , $\mbox{learning rate}=5\times 10^{-3}$ for our method and $3\times 10^{-3}$ for the others, total number of epochs is 400 and the optimization algorithm is RMSprop. AGP is pre-specified at $5\%$ sparsity level.

Besides the testing accuracy reported in the main text, we also examine our method’s ability of uncertainty quantification for MNIST classification task. We first create ambiguous images by overlaying two examples from the testing set as shown in Figure 7 (a). To perform uncertainty quantification using our method, for each of the overlaid images, we generate $\theta_{h}$ from the VB posterior $\widehat{q}(\theta)$ for $h=1,\ldots,100$ , and calculate the associated predictive probability vector $f_{\theta_{h}}(x)\in\mathbb{R}^{10}$ where $x$ is the overlaid image input, and then use the estimated posterior mean $\widehat{f}(x)=\sum^{100}_{h=1}f_{\theta_{h}}(x)/100$ as the Bayesian predictive probability vector. As a comparison, we also calculate the predictive probability vector for each overlaid image using AGP as a frequentist benchmark. Figure 7 (b) shows frequentist method gives almost a deterministic answer (i.e., predictive probability is almost 1 for certain digit) that is obviously unsatisfactory for this task, while our VB method is capable of providing knowledge of certainty on these out-of-domain inputs, which demonstrates the advantage of Bayesian method in uncertainty quantification on the classification task.

B.7 Illustration of CNN: Fashion-MNIST dataset

In this section, we perform an experiment on a more complex task, the Fashion-MNIST dataset. To illustrate the usage of our method beyond feedforward networks, we consider using a 2-Conv-2-FC network: The feature maps for the convolutional layers are set to be 32 and 64, and the filter size are $5\times 5$ and $3\times 3$ respectively. The paddings are 2 for both layers and the it has a $2\times 2$ max pooling for each of the layers; The fully-connected layers have $64\times 8\times 8$ neurons. The activation functions are all ReLUs. The dataset is prepocessed by random horizontal flip. The batchsize is 1024, learning rate is 0.001, and Adam is used for optimization. We run the experiment for 150 epochs.

We use both SVBNN and VBNN for this task. In particular, the VBNN, which uses normal prior and variational distributions, is the full Bayesian method without compressing, and can be regarded as the baseline for our method. Figure 8 exhibits our method attains higher accuracy as epoch increases and then decreases as the sparisty goes down. Meanwhile, the baseline method - full BNN suffers from overfitting after 80 epochs.