This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Efficient Variational Inference for Sparse Deep Learning with Theoretical Guarantee

Jincheng Bai
Department of Statistics
Purdue University
West Lafayette, IN 47906
[email protected]
&Qifan Song
Department of Statistics
Purdue University
West Lafayette, IN 47906
[email protected]
Guang Cheng
Department of Statistics
Purdue University
West Lafayette, IN 47906
[email protected]
Abstract

Sparse deep learning aims to address the challenge of huge storage consumption by deep neural networks, and to recover the sparse structure of target functions. Although tremendous empirical successes have been achieved, most sparse deep learning algorithms are lacking of theoretical support. On the other hand, another line of works have proposed theoretical frameworks that are computationally infeasible. In this paper, we train sparse deep neural networks with a fully Bayesian treatment under spike-and-slab priors, and develop a set of computationally efficient variational inferences via continuous relaxation of Bernoulli distribution. The variational posterior contraction rate is provided, which justifies the consistency of the proposed variational Bayes method. Notably, our empirical results demonstrate that this variational procedure provides uncertainty quantification in terms of Bayesian predictive distribution and is also capable to accomplish consistent variable selection by training a sparse multi-layer neural network.

1 Introduction

Dense neural network (DNN) may face various problems despite its huge successes in AI fields. Larger training sets and more complicated network structures improve accuracy in deep learning, but always incur huge storage and computation burdens. For example, small portable devices may have limited resources such as several megabyte memory, while a dense neural networks like ResNet-50 with 50 convolutional layers would need more than 95 megabytes of memory for storage and numerous floating number computation (Cheng et al.,, 2018). It is therefore necessary to compress deep learning models before deploying them on these hardware limited devices.

In addition, sparse neural networks may recover the potential sparsity structure of the target function, e.g., sparse teacher network in the teacher-student framework (Goldt et al.,, 2019; Tian,, 2018). Another example is from nonparametric regression with sparse target functions, i.e., only a portion of input variables are relevant to the response variable. A sparse network may serve the goal of variable selection (Feng and Simon,, 2017; Liang et al.,, 2018; Ye and Sun,, 2018), and is also known to be robust to adversarial samples against ll_{\infty} and l2l_{2} attacks (Guo et al.,, 2018).

Bayesian neural network (BNN), which dates back to MacKay, (1992); Neal, (1992), comparing with frequentist DNN, possesses the advantages of robust prediction via model averaging and automatic uncertainty quantification (Blundell et al.,, 2015). Conceptually, BNN can easily induce sparse network selection by assigning discrete prior over all possible network structures. However, challenges remain for sparse BNN including inefficient Bayesian computing issue and the lack of theoretical justification. This work aims to resolve these two important bottlenecks simultaneously, by utilizing variational inference approach (Jordan et al.,, 1999; Blei et al.,, 2017). On the computational side, it can reduce the ultra-high dimensional sampling problem of Bayesian computing, to an optimization task which can still be solved by a back-propagation algorithm. On the theoretical side, we provide a proper prior specification, under which the variational posterior distribution converges towards the truth. To the best of our knowledge, this work is the first one that provides a complete package of both theory and computation for sparse Bayesian DNN.

Related work.

A plethora of methods on sparsifying or compressing neural networks have been proposed (Cheng et al.,, 2018; Gale et al.,, 2019). The majority of these methods are pruning-based (Han et al.,, 2016; Zhu and Gupta,, 2018; Frankle and Carbin,, 2018), which are ad-hoc on choosing the threshold of pruning and usually require additional training and fine tuning. Some other methods could achieve sparsity during training. For example, Louizos et al., (2018) introduced l0l_{0} regularized learning and Mocanu et al., (2018) proposed sparse evolutionary training. However, the theoretical guarantee and the optimal choice of hyperparameters for these methods are unclear. As a more natural solution to enforce sparsity in DNN, Bayesian sparse neural network has been proposed by placing prior distributions on network weights: Blundell et al., (2015) and Deng et al., (2019) considered spike-and-slab priors with a Gaussian and Laplacian spike respectively; Log-uniform prior was used in Molchanov et al., (2017); Ghosh et al., (2018) chose to use the popular horseshoe shrinkage prior. These existing works actually yield posteriors over the dense DNN model space despite applying sparsity induced priors. In order to derive explicit sparse inference results, users have to additionally determine certain pruning rules on the posterior. On the other hand, theoretical works regarding sparse deep learning have been studied in Schmidt-Hieber, (2017), Polson and Rockova, (2018) and Chérief-Abdellatif, (2020), but finding an efficient implementation to close the gap between theory and practice remains a challenge for these mentioned methods.

Detailed contributions.

We impose a spike-and-slab prior on all the edges (weights and biases) of a neural network, where the spike component and slab component represent whether the corresponding edge is inactive or active, respectively. Our work distinguished itself from prior works on Bayesian sparse neural network by imposing the spike-and-slab prior with the Dirac spike function. Hence automatically, all posterior samples are from exact sparse DNN models.

More importantly, with carefully chosen hyperparameter values, especially the prior probability that each edge is active, we establish the variational posterior consistency, and the corresponding convergence rate strikes the balance of statistical estimation error, variational error and the approximation error. The theoretical results are validated by various simulations and real applications. Empirically we also demonstrate that the proposed method possesses good performance of variable selection and uncertainty quantification. While Feng and Simon, (2017); Liang et al., (2018); Ye and Sun, (2018) only considered the neural network with single hidden layer for variable selection, we observe correct support recovery for neural networks with multi-layer networks.

2 Preliminaries

Nonparametric regression.

Consider a nonparametric regression model with random covariates

Yi=f0(Xi)+ϵi,i=1,,n,{}Y_{i}=f_{0}(X_{i})+\epsilon_{i},\>i=1,\ldots,n, (1)

where Xi=(xi1,,xip)t𝒰([1,1]p)X_{i}=(x_{i1},\ldots,x_{ip})^{t}\sim\mathcal{U}([-1,1]^{p})111This compact support assumption is generally satisfied given the standardized data, and may be relaxed., 𝒰\mathcal{U} denotes the uniform distribution, ϵiiid𝒩(0,σϵ2)\epsilon_{i}\overset{iid}{\sim}\mathcal{N}(0,\sigma^{2}_{\epsilon}) is the noise term, and f0:[1,1]pf_{0}:[-1,1]^{p}\rightarrow\mathbb{R} is an underlying true function. For simplicity of analysis, we assume σϵ\sigma_{\epsilon} is known. Denote Di=(Xi,Yi)D_{i}=(X_{i},Y_{i}) and D=(D1,,Dn)D=(D_{1},\dots,D_{n}) as the observations. Let P0P_{0} denote the underlying probability measure of data, and p0p_{0} denote the corresponding density function.

Deep neural network.

An LL-hidden-layer neural network is used to model the data. The number of neurons in each hidden layer is denoted by pip_{i} for i=1,,Li=1,\dots,L. The weight matrix and bias parameter in each layer are denoted by Wipi1×piW_{i}\in\mathbb{R}^{p_{i-1}\times p_{i}} and bipib_{i}\in\mathbb{R}^{p_{i}} for i=1,,L+1i=1,\dots,L+1. Let σ(x)\sigma(x) be the activation function, and for any r+r\in\mathbb{Z}^{+} and any brb\in\mathbb{R}^{r}, we define σb:rr\sigma_{b}:\mathbb{R}^{r}\rightarrow\mathbb{R}^{r} as σb(yj)=σ(yjbj)\sigma_{b}(y_{j})=\sigma(y_{j}-b_{j}) for j=1,,rj=1,\ldots,r. Then, given parameters 𝒑=(p1,,pL)\boldsymbol{p}=(p_{1},\dots,p_{L})^{\prime} and θ={W1,b1,,WL,bL,WL+1,bL+1}\theta=\{W_{1},b_{1},\ldots,W_{L},b_{L},W_{L+1},b_{L+1}\}, the output of this DNN model can be written as

fθ(X)=WL+1σbL(WLσbL1σb1(W1X))+bL+1.{}f_{\theta}(X)=W_{L+1}\sigma_{b_{L}}(W_{L}\sigma_{b_{L-1}}\ldots\sigma_{b_{1}}(W_{1}X))+b_{L+1}. (2)

In what follows, with slight abuse of notation, θ\theta is also viewed as a vector that contains all the coefficients in WiW_{i}’s and bib_{i}’s, , i.e., θ=(θ1,,θT)\theta=(\theta_{1},\dots,\theta_{T})^{\prime}, where the length T:=l=1L1pl+1(pl+1)+p1(p+1)+(pL+1)T:=\sum^{L-1}_{l=1}p_{l+1}(p_{l}+1)+p_{1}(p+1)+(p_{L}+1).

Variational inference.

Bayesian procedure makes statistical inferences from the posterior distribution π(θ|D)π(θ)pθ(D)\pi(\theta|D)\propto\pi(\theta)p_{\theta}(D), where π(θ)\pi(\theta) is the prior distribution. Since MCMC doesn’t scale well on complex Bayesian learning tasks with large datasets, variational inference (Jordan et al.,, 1999; Blei et al.,, 2017) has become a popular alternative. Given a variational family of distributions, denoted by 𝒬\mathcal{Q} 222For simplicity, it is commonly assumed that 𝒬\mathcal{Q} is the mean-field family, i.e. q(θ)=i=1Tq(θi)q(\theta)=\prod^{T}_{i=1}q(\theta_{i})., it seeks to approximate the true posterior distribution by finding a closest member of 𝒬\mathcal{Q} in terms of KL divergence:

q^(θ)=argmin q(θ)𝒬KL(q(θ)||π(θ|D)).\widehat{q}(\theta)=\underset{q(\theta)\in\mathcal{Q}}{\mbox{argmin }}\mbox{KL}(q(\theta)||\pi(\theta|D)). (3)

The optimization (3) is equivalent to minimize the negative ELBO, which is defined as

Ω=𝔼q(θ)[logpθ(D)]+KL(q(θ)||π(θ)),\begin{split}\Omega=-\mathbb{E}_{q(\theta)}[\log p_{\theta}(D)]+\mbox{KL}(q(\theta)||\pi(\theta)),\end{split} (4)

where the first term in (4) can be viewed as the reconstruction error (Kingma and Welling,, 2014) and the second term serves as regularization. Hence, the variational inference procedure minimizes the reconstruction error while being penalized against prior distribution in the sense of KL divergence.

When the variational family is indexed by some hyperparameter ω\omega, i.e., any q𝒬q\in\mathcal{Q} can be written as qω(θ)q_{\omega}(\theta), then the negative ELBO is a function of ω\omega as Ω(ω)\Omega(\omega). The KL divergence term in (4) could usually be integrated analytically, while the reconstruction error requires Monte Carlo estimation. Therefore, the optimization of Ω(ω)\Omega(\omega) can utilize the stochastic gradient approach (Kingma and Welling,, 2014). To be concrete, if all distributions in 𝒬\mathcal{Q} can be reparameterized as qω=dg(ω,ν)q_{\omega}\stackrel{{\scriptstyle d}}{{=}}g(\omega,\nu)333=d\stackrel{{\scriptstyle d}}{{=}}” means equivalence in distribution for some differentiable function gg and random variable ν\nu, then the stochastic estimator of Ω(ω)\Omega(\omega) and its gradient are

Ω~m(ω)=nm1Ki=1mk=1Klogpg(ω,νk)(Di)+KL(qω(θ)||π(θ))ωΩ~m(ω)=nm1Ki=1mk=1Kωlogpg(ω,νk)(Di)+ωKL(qω(θ)||π(θ)),\begin{split}&\widetilde{\Omega}^{m}(\omega)=-\frac{n}{m}\frac{1}{K}\sum^{m}_{i=1}\sum^{K}_{k=1}\log p_{g(\omega,\nu_{k})}(D_{i})+\mbox{KL}(q_{\omega}(\theta)||\pi(\theta))\\ &\nabla_{\omega}\widetilde{\Omega}^{m}(\omega)=-\frac{n}{m}\frac{1}{K}\sum^{m}_{i=1}\sum^{K}_{k=1}\nabla_{\omega}\log p_{g(\omega,\nu_{k})}(D_{i})+\nabla_{\omega}\mbox{KL}(q_{\omega}(\theta)||\pi(\theta)),\end{split} (5)

where DiD_{i}’s are randomly sampled data points and νk\nu_{k}’s are iid copies of ν\nu. Here, mm and KK are minibatch size and Monte Carlo sample size, respectively.

3 Sparse Bayesian deep learning with spike-and-slab prior

We aim to approximate f0f_{0} in the generative model (1) by a sparse neural network. Specifically, given a network structure, i.e. the depth LL and the width 𝒑\boldsymbol{p}, f0f_{0} is approximated by DNN models fθf_{\theta} with sparse parameter vector θΘ=T\theta\in\Theta=\mathbb{R}^{T}. From a Bayesian perspective, we impose a spike-and-slab prior (George and McCulloch,, 1993; Ishwaran and Rao,, 2005) on θ\theta to model sparse DNN.

A spike-and-slab distribution is a mixture of two components: a Dirac spike concentrated at zero and a flat slab distribution. Denote δ0\delta_{0} as the Dirac at 0 and γ=(γ1,,γT)\gamma=(\gamma_{1},\ldots,\gamma_{T}) as a binary vector indicating the inclusion of each edge in the network. The prior distribution π(θ)\pi(\theta) thus follows:

θi|γiγi𝒩(0,σ02)+(1γi)δ0,γiBern(λ),\begin{split}\theta_{i}|\gamma_{i}\sim\gamma_{i}\mathcal{N}(0,\sigma^{2}_{0})+(1-\gamma_{i})\delta_{0},\quad\gamma_{i}\sim\mbox{Bern}(\lambda),\end{split} (6)

for i=1,,Ti=1,\ldots,T, where λ\lambda and σ02\sigma^{2}_{0} are hyperparameters representing the prior inclusion probability and the prior Gaussian variance, respectively. The choice of σ02\sigma_{0}^{2} and λ\lambda play an important role in sparse Bayesian learning, and in Section 4, we will establish theoretical guarantees for the variational inference procedure under proper deterministic choices of σ02\sigma_{0}^{2} and λ\lambda. Alternatively, hyperparameters may be chosen via an Empirical Bayesian (EB) procedure, but it is beyond the scope of this work. We assume 𝒬\mathcal{Q} is in the same family of spike-and-slab laws:

θi|γiγi𝒩(μi,σi2)+(1γi)δ0,γiBern(ϕi)\begin{split}\theta_{i}|\gamma_{i}\sim\gamma_{i}\mathcal{N}(\mu_{i},\sigma^{2}_{i})+(1-\gamma_{i})\delta_{0},\quad\gamma_{i}\sim\mbox{Bern}(\phi_{i})\end{split} (7)

for i=1,,Ti=1,\ldots,T, where 0ϕi10\leq\phi_{i}\leq 1.

Comparing to pruning approaches (e.g. Zhu and Gupta,, 2018; Frankle and Carbin,, 2018; Molchanov et al.,, 2017) that don’t pursue sparsity among bias parameter bib_{i}’s, the Bayesian modeling induces posterior sparsity for both weight and bias parameters.

In the literature, Polson and Rockova, (2018); Chérief-Abdellatif, (2020) imposed sparsity specification as follows Θ(L,𝒑,s)={θ as in model (2):θ0s}\Theta(L,\boldsymbol{p},s)=\{\theta\mbox{ as in model }(\ref{eqmod2}):||\theta||_{0}\leq s\} that not only posts great computational challenges, but also requires tuning for optimal sparsity level ss. For example, Chérief-Abdellatif, (2020) shows that given ss, two error terms occur in the variation DNN inference: 1) the variational error rn(L,𝒑,s)r_{n}(L,\boldsymbol{p},s) caused by the variational Bayes approximation to the true posterior distribution and 2) the approximation error ξn(L,𝒑,s)\xi_{n}(L,\boldsymbol{p},s) between f0f_{0} and the best bounded-weight ss-sparsity DNN approximation of f0f_{0}. Both error terms rnr_{n} and ξn\xi_{n} depend on ss (and their specific forms are given in next section). Generally speaking, as the model capacity (i.e., ss) increases, rnr_{n} will increase and ξn\xi_{n} will decrease. Hence the optimal choice ss^{*} that strikes the balance between these two is

s=argmin 𝑠{rn(L,𝒑,s)+ξn(L,𝒑,s)}.s^{*}=\underset{s}{\mbox{argmin }}\{r_{n}(L,\boldsymbol{p},s)+\xi_{n}(L,\boldsymbol{p},s)\}.

Therefore, one needs to develop a selection criteria for s^\widehat{s} such that s^s\widehat{s}\approx s^{*}. In contrast, our modeling directly works on the whole sparsity regime without pre-specifying ss, and is shown later to be capable of automatically attaining the same rate of convergence as if the optimal ss^{*} were known.

4 Theoretical results

In this section, we will establish the contraction rate of the variational sparse DNN procedure, without knowing ss^{*}. For simplicity, we only consider equal-width neural network (similar as Polson and Rockova, (2018)).

The following assumptions are imposed:

Condition 4.1.

piN+p_{i}\equiv N\in\mathbb{Z}^{+} that can depend on n, and limT=\lim T=\infty.

Condition 4.2.

σ(x)\sigma(x) is 1-Lipschitz continuous.

Condition 4.3.

The hyperparameter σ02\sigma_{0}^{2} is set to be some constant, and λ\lambda satisfies log(1/λ)=O{(L+1)logN+log(pn/s)}\log(1/\lambda)=O\{(L+1)\log N+\log(p\sqrt{n/s^{*}})\} and log(1/(1λ))=O((s/T){(L+1)logN+log(pn/s)})\log(1/(1-\lambda))=O((s^{*}/T)\{(L+1)\log N+\log(p\sqrt{n/s^{*}})\}).

Condition 4.2 is very mild, and includes ReLU, sigmoid and tanh. Note that Condition 4.3 gives a wide range choice of λ\lambda, even including the choice of λ\lambda independent of ss^{*} (See Theorem 4.1 below).

We first state a lemma on an upper bound for the negative ELBO. Denote the log-likelihood ratio between p0p_{0} and pθp_{\theta} as ln(P0,Pθ)=log(p0(D)/pθ(D))=i=1nlog(p0(Di)/pθ(Di))l_{n}(P_{0},P_{\theta})=\log(p_{0}(D)/p_{\theta}(D))=\sum^{n}_{i=1}\log(p_{0}(D_{i})/p_{\theta}(D_{i})). Given some constant B>0B>0, we define

rn\displaystyle r^{*}_{n} :=\displaystyle:= rn(L,N,s)=((L+1)s/n)logN+(s/n)log(pn/s),\displaystyle r_{n}(L,N,s^{*})=((L+1)s^{*}/n)\log N+(s^{*}/n)\log(p\sqrt{n/s^{*}}),
ξn\displaystyle\xi^{*}_{n} :=\displaystyle:= ξn(L,N,s)=infθΘ(L,𝒑,s),θBfθf02.\displaystyle\xi_{n}(L,N,s^{*})=\inf_{\theta\in\Theta(L,\boldsymbol{p},s^{*}),\|\theta\|_{\infty}\leq B}||f_{\theta}-f_{0}||^{2}_{\infty}.

Recall that rn(L,N,s)r_{n}(L,N,s) and ξn(L,N,s)\xi_{n}(L,N,s) denote the variational error and the approximation error.

Lemma 4.1.

Under Condition 4.1-4.3, then with dominating probability,

infq(θ)𝒬{KL(q(θ)||π(θ|λ))+Θln(P0,Pθ)q(θ)dθ}Cn(rn+ξn){}\begin{split}\inf_{q(\theta)\in\mathcal{Q}}\Bigl{\{}\mbox{KL}(q(\theta)||\pi(\theta|\lambda))+\int_{\Theta}l_{n}(P_{0},P_{\theta})q(\theta)d\theta\Bigr{\}}\leq Cn(r^{*}_{n}+\xi^{*}_{n})\end{split} (8)

where CC is either some positive constant if limn(rn+ξn)=\lim n(r_{n}^{*}+\xi^{*}_{n})=\infty, or any diverging sequence if limsupn(rn+ξn)\lim\sup n(r_{n}^{*}+\xi^{*}_{n})\neq\infty.

Noting that KL(q(θ)||π(θ|λ))+Θln(P0,Pθ)q(θ)(dθ)\mbox{KL}(q(\theta)||\pi(\theta|\lambda))+\int_{\Theta}l_{n}(P_{0},P_{\theta})q(\theta)(d\theta) is the negative ELBO up to a constant, we therefore show the optimal loss function of the proposed variational inference is bounded.

The next lemma investigates the convergence of the variational distribution under the Hellinger distance, which is defined as

d2(Pθ,P0)=𝔼X(1exp{[fθ(X)f0(X)]2/(8σϵ2)}).d^{2}(P_{\theta},P_{0})=\mathbb{E}_{X}\Bigl{(}1-\exp\{-[f_{\theta}(X)-f_{0}(X)]^{2}/(8\sigma^{2}_{\epsilon})\}\Bigr{)}.

In addition, let sn=slog2δ1(n)s_{n}=s^{*}\log^{2\delta-1}(n) for any δ>1\delta>1. An assumption on ss^{*} is required to strike the balance between rnr_{n}^{*} and ξ\xi^{*}:

Condition 4.4.

max{slog(pn/s,(L+1)slogN}=o(n)\max\{s^{*}\log(p\sqrt{n/s^{*}},(L+1)s^{*}\log N\}=o(n) and rnξnr_{n}^{*}\asymp\xi_{n}^{*}.

Lemma 4.2.

Under Conditions 4.1-4.4, if σ02\sigma_{0}^{2} is set to be constant and λT1exp{Mnrn/sn}\lambda\leq T^{-1}\exp\{-Mnr_{n}^{*}/s_{n}\} for any positive diverging sequence MM\rightarrow\infty, then with dominating probability, we have

Θd2(Pθ,P0)q^(θ)dθCεn2+3ninfq(θ)𝒬{KL(q(θ)||π(θ|λ))+Θln(P0,Pθ)q(θ)dθ},\begin{split}\int_{\Theta}d^{2}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta\leq C\varepsilon^{*2}_{n}+\frac{3}{n}\inf_{q(\theta)\in\mathcal{Q}}\Bigl{\{}\mbox{KL}(q(\theta)||\pi(\theta|\lambda))+\int_{\Theta}l_{n}(P_{0},P_{\theta})q(\theta)d\theta\Bigr{\}},\end{split} (9)

where CC is some constant, and

εn:=εn(L,N,s)=rn(L,N,s)logδ(n), for any δ>1.\varepsilon^{*}_{n}:=\varepsilon_{n}(L,N,s^{*})=\sqrt{r_{n}(L,N,s^{*})}\log^{\delta}(n),\mbox{ for any }\delta>1.
Remark.

The result (9) is of exactly the same form as in the existing literature (Pati et al.,, 2018), but it is not trivial in the following sense. The existing literature require the existence of a global testing function that separates P0P_{0} and {Pθ:d(Pθ,P0)εn}\{P_{\theta}:d(P_{\theta},P_{0})\geq\varepsilon_{n}^{*}\} with exponentially decay rate of Type I and Type II errors. If such a testing function exists only over a subset ΘnΘ\Theta_{n}\subset\Theta (which is the case for our DNN modeling), then the existing result (Yang et al.,, 2020) can only characterize the VB posterior contraction behavior within Θn\Theta_{n}, but not over the whole parameter space Θ\Theta. Therefore our result, which characterizes the convergence behavior for the overall VB posterior, represents a significant improvement beyond those works.

The above two lemmas together imply the following guarantee for VB posterior:

Theorem 4.1.

Let σ02\sigma_{0}^{2} be a constant and logλ=log(T)+δ[(L+1)logN+lognp]-\log\lambda=\log(T)+\delta[(L+1)\log N+\log\sqrt{n}p] for any constant δ>0\delta>0. Under Conditions 4.1-4.2, 4.4, we have with high probability

Θd2(Pθ,P0)q^(θ)𝑑θCεn2+C(rn+ξn),\int_{\Theta}d^{2}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta\leq C\varepsilon^{*2}_{n}+C^{\prime}(r_{n}^{*}+\xi^{*}_{n}),

where CC is some positive constant and CC^{\prime} is any diverging sequence.

The εn2\varepsilon_{n}^{*2} denotes the estimation error from the statistical estimator for P0P_{0}. The variational Bayes convergence rate consists of estimating error, i.e., εn2\varepsilon_{n}^{*2}, variational error, i.e., rnr_{n}^{*}, and approximation error, i.e., ξn\xi_{n}^{*}. Given that the former two errors have only logarithmic difference, our convergence rate actually strikes the balance among all three error terms. The derived convergence rate has an explicit expression in terms of the network structure based on the forms of εn\varepsilon_{n}^{*}, rnr_{n}^{*} and ξn\xi_{n}^{*}, in contrast with general convergence results in Pati et al., (2018); Zhang and Gao, (2019); Yang et al., (2020).

Remark.

Theorem 4.1 provides a specific choice for λ\lambda, which can be relaxed to the general conditions on λ\lambda in Lemma 4.2. In contrast to the heuristic choices such as λ=exp(2logn)\lambda=\exp(-2\log n) (BIC; Hubin and Storvik,, 2019), this theoretically justified choice incorporates knowledge of input dimension, network structure and sample size. Such an λ\lambda will be used in our numerical experiments in Section 6, but readers shall be aware of that its theoretical validity is only justified in an asymptotic sense.

Remark.

The convergence rate is derived under Hellinger metric, which is of less practical relevance than L2L_{2} norm representing the common prediction error. One may obtain a convergence result under L2L_{2} norm via a VB truncation (refer to supplementary material, Theorem A.1).

Remark.

If f0f_{0} is an α\alpha-Hölder-smooth function with fixed input dimension pp, then by choosing some LlognL\asymp\log n, Nn/lognN\asymp n/\log n, combining with the approximation result (Schmidt-Hieber,, 2017, Theorem 3), our theorem ensures rate-minimax convergence up to a logarithmic term.

5 Implementation

To conduct optimization of (4) via stochastic gradient optimization, we need to find certain reparameterization for any distribution in 𝒬\mathcal{Q}. One solution is to use the inverse CDF sampling technique. Specifically, if θq𝒬\theta\sim q\in\mathcal{Q}, its marginal θi\theta_{i}’s are independent mixture of (7). Let F(μi,σi,ϕi)F_{(\mu_{i},\sigma_{i},\phi_{i})} be the CDF of θi\theta_{i}, then θi=dF(μi,σi,ϕi)1(ui)\theta_{i}\stackrel{{\scriptstyle d}}{{=}}F^{-1}_{(\mu_{i},\sigma_{i},\phi_{i})}(u_{i}) holds where ui𝒰(0,1)u_{i}\sim\mathcal{U}(0,1). This inverse CDF reparameterization, although valid, can not be conveniently implemented within the state-of-art python packages like PyTorch. Rather, a more popular way in VB is to utilize the Gumbel-softmax approximation.

We rewrite the loss function Ω\Omega as

𝔼q(θ|γ)q(γ)[logpθ(D)]+i=1TKL(q(γi)||π(γi))+i=1Tq(γi=1)KL(𝒩(μi,σi2)||𝒩(0,σ02)).\begin{split}-\mathbb{E}_{q(\theta|\gamma)q(\gamma)}[\log p_{\theta}(D)]+\sum^{T}_{i=1}\mbox{KL}(q(\gamma_{i})||\pi(\gamma_{i}))+\sum^{T}_{i=1}q(\gamma_{i}=1)\mbox{KL}(\mathcal{N}(\mu_{i},\sigma^{2}_{i})||\mathcal{N}(0,\sigma^{2}_{0})).\end{split} (10)

Since it is impossible to reparameterize the discrete variable γ\gamma by a continuous system, we apply continuous relaxation, i.e., to approximate γ\gamma by a continuous distribution. In particular, the Gumbel-softmax approximation (Maddison et al.,, 2017; Jang et al.,, 2017) is used here, and γiBern(ϕi)\gamma_{i}\sim\mbox{Bern}(\phi_{i}) is approximated by γ~iGumbel-softmax(ϕi,τ)\widetilde{\gamma}_{i}\sim\mbox{Gumbel-softmax}(\phi_{i},\tau), where

γ~i=(1+exp(ηi/τ))1,ηi=logϕi1ϕi+logui1ui,ui𝒰(0,1).\begin{split}&\widetilde{\gamma}_{i}=(1+\exp(-\eta_{i}/\tau))^{-1},\quad\eta_{i}=\log\frac{\phi_{i}}{1-\phi_{i}}+\log\frac{u_{i}}{1-u_{i}},\quad u_{i}\sim\mathcal{U}(0,1).\end{split}

τ\tau is called the temperature, and as it approaches 0, γ~i\tilde{\gamma}_{i} converges to γi\gamma_{i} in distribution (refer to Figure 4 in the supplementary material). In addition, one can show that P(γ~i>0.5)=ϕiP(\widetilde{\gamma}_{i}>0.5)=\phi_{i}, which implies γi=d1(γ~i>0.5)\gamma_{i}\stackrel{{\scriptstyle d}}{{=}}1(\widetilde{\gamma}_{i}>0.5). Thus, γ~i\widetilde{\gamma}_{i} is viewed as a soft version of γi\gamma_{i}, and will be used in the backward pass to enable the calculation for gradients, while the hard version γi\gamma_{i} will be used in the forward pass to obtain a sparse network structure. In practice, τ\tau is usually chosen no smaller than 0.5 for numerical stability. Besides, the normal variable 𝒩(μi,σi2)\mathcal{N}(\mu_{i},\sigma^{2}_{i}) is reparameterized by μi+σiϵi\mu_{i}+\sigma_{i}\epsilon_{i} for ϵi𝒩(0,1)\epsilon_{i}\sim\mathcal{N}(0,1).

The complete variational inference procedure with Gumbel-softmax approximation is stated below.

Algorithm 1 Variational inference for sparse BNN with normal slab distribution.
1:parameters: ω=(μ,σ,ϕ)\omega=(\mu,\sigma^{\prime},\phi^{\prime}) ,
2:where σi=log(1+exp(σi))\sigma_{i}=\log(1+\exp(\sigma^{\prime}_{i})), ϕi=(1+exp(ϕi))1\phi_{i}=(1+\exp(\phi^{\prime}_{i}))^{-1}, for i=1,,Ti=1,\ldots,T
3:repeat
4:     DmD^{m} \leftarrow Randomly draw a minibatch of size mm from DD
5:     ϵi,ui\epsilon_{i},u_{i} \leftarrow Randomly draw KK samples from 𝒩(0,1)\mathcal{N}(0,1) and 𝒰(0,1)\mathcal{U}(0,1)
6:     Ω~m(ω)\widetilde{\Omega}^{m}(\omega) \leftarrow Use (5) with (DmD^{m}, ω\omega, ϵ\epsilon, uu); Use γ\gamma in the forward pass
7:     ωΩ~m(ω)\nabla_{\omega}\widetilde{\Omega}^{m}(\omega) \leftarrow Use (5) with (DmD^{m}, ω\omega, ϵ\epsilon, uu); Use γ~\widetilde{\gamma} in the backward pass
8:     ω\omega \leftarrow Update with ωΩ~m(ω)\nabla_{\omega}\widetilde{\Omega}^{m}(\omega) using gradient descent algorithms (e.g. SGD or Adam)
9:until convergence of Ω~m(ω)\widetilde{\Omega}^{m}(\omega)
10:return ω\omega

The Gumbel-softmax approximation introduces an additional error that may jeopardize the validity of Theorem 4.1. Our exploratory studies (refer to Section B.3 in supplementary material) demonstrates little differences between the results of using inverse-CDF reparameterization and using Gumbel-softmax approximation in some simple model. Therefore, we conjecture that Gumbel-softmax approximation doesn’t hurt the VB convergence, and thus will be implemented in our numerical studies.

6 Experiments

We evaluate the empirical performance of the proposed variational inference through simulation study and MNIST data application. For the simulation study, we consider a teacher-student framework and a nonlinear regression function, by which we justify the consistency of the proposed method and validate the proposed choice of hyperparameters. As a byproduct, the performance of uncertainty quantification and the effectiveness of variable selection will be examined as well.

For all the numerical studies, we let σ02=2\sigma^{2}_{0}=2, the choice of λ\lambda follows Theorem 4.1 (denoted by λopt\lambda_{opt}): log(λopt1)=log(T)+0.1[(L+1)logN+lognp]\log(\lambda^{-1}_{opt})=\log(T)+0.1[(L+1)\log N+\log\sqrt{n}p]. The remaining details of implementation (such as initialization, choices of KK, mm and learning rate) are provided in the supplementary material. We will use VB posterior mean estimator f^H=h=1Hfθh/H\widehat{f}_{H}=\sum^{H}_{h=1}f_{\theta_{h}}/H to assess the prediction accuracy, where θhq^(θ)\theta_{h}\sim\widehat{q}(\theta) are samples drawn from the VB posterior and H=30H=30. The posterior network sparsity is measured by s^=i=1Tϕi/T\widehat{s}=\sum^{T}_{i=1}\phi_{i}/T. Input nodes who have connection with ϕi>0.5\phi_{i}>0.5 to the second layer is selected as relevant input variables, and we report the corresponding false positive rate (FPR) and false negative rate (FNR) to evaluate the variable selection performance of our method.

Our method will be compared with the dense variational BNN (VBNN) (Blundell et al.,, 2015) with independent centered normal prior and independent normal variational distribution, the AGP pruner (Zhu and Gupta,, 2018), the Lottery Ticket Hypothesis (LOT) (Frankle and Carbin,, 2018), the variational dropout (VD) (Molchanov et al.,, 2017) and the Horseshoe BNN (HS-BNN) (Ghosh et al.,, 2018). In particular, VBNN can be regarded as a baseline method without any sparsification or compression. All reported simulation results are based on 30 replications (except that we use 60 replications for interval estimation coverages). Note that the sparsity level in methods AGP and LOT are user-specified. Hence, in simulation studies, we try a grid search for AGP and LOT, and only report the ones that yield highest testing accuracy. Furthermore, note that FPR and FNR are not calculated for HS-BNN since it only sparsifies the hidden layers nodewisely.

Simulation I: Teacher-student networks setup

We consider two teacher network settings for f0f_{0}: (A) densely connected with a structure of 20-6-6-1, p=20p=20, n=3000n=3000, σ(x)=sigmoid(x)\sigma(x)=\mbox{sigmoid}(x), X𝒰([1,1]20)X\sim\mathcal{U}([-1,1]^{20}), ϵ𝒩(0,1)\epsilon\sim\mathcal{N}(0,1) and network parameter θi\theta_{i} is randomly sampled from 𝒰(0,1)\mathcal{U}(0,1); (B) sparsely connected as shown in Figure 1 (c), p=100p=100, n=500n=500, σ(x)=tanh(x)\sigma(x)=\mbox{tanh}(x), X𝒰([1,1]100)X\sim\mathcal{U}([-1,1]^{100}) and ϵ𝒩(0,1)\epsilon\sim\mathcal{N}(0,1), the network parameter θi\theta_{i}’s are fixed (refer to supplementary material for details).

Refer to caption
(a) λλopt\lambda\leq\lambda_{opt}.
Refer to caption
(b) λλopt\lambda\geq\lambda_{opt}.
Refer to caption
(c) Sparse teacher network.
Figure 1: (a) λ={10200,10150,10100,1050,1020,105,λopt}\lambda=\{10^{-200},10^{-150},10^{-100},10^{-50},10^{-20},10^{-5},\lambda_{opt}\}. (b) λ={λopt,0.1,0.3,0.5,\lambda=\{\lambda_{opt},0.1,0.3,0.5, 0.7,0.9,0.99}0.7,0.9,0.99\}. (c) The structure of the target sparse teacher network. Please note that the xx axes of figures (a) and (b) are in different scales.
Table 1: Simulation results for Simulation I. SVBNN represents our sparse variational BNN method. The sparsity levels specified for AGP are 30% and 5%, and for LOT are 10% and 5%, respectively for the two cases.
RMSE Input variable selection
Method Train Test FPR(%) FNR(%) 𝟗𝟓%\boldsymbol{95\%} Coverage (%) Sparsity(%)
Dense SVBNN 1.01 ±\pm 0.02 1.01 ±\pm 0.00 - - 97.5 ±\pm 1.71 6.45 ±\pm 0.83
VBNN 1.00 ±\pm 0.02 1.00 ±\pm 0.00 - - 91.4 ±\pm 3.89 100 ±\pm 0.00
VD 0.99 ±\pm 0.02 1.01 ±\pm 0.00 - - 76.4 ±\pm 4.75 28.6 ±\pm 2.81
HS-BNN 0.98 ±\pm 0.02 1.02 ±\pm 0.01 - - 83.5 ±\pm 0.78 64.9 ±\pm 24.9
AGP 0.99 ±\pm 0.02 1.01 ±\pm 0.00 - - - 30.0 ±\pm 0.00
LOT 1.04 ±\pm 0.01 1.02 ±\pm 0.00 - - - 10.0 ±\pm 0.00
Sparse SVBNN 0.99 ±\pm 0.03 1.00 ±\pm 0.01 0.00 ±\pm 0.00 0.00 ±\pm 0.00 96.4 ±\pm 4.73 2.15 ±\pm 0.25
VBNN 0.92 ±\pm 0.05 1.53 ±\pm 0.17 100 ±\pm 0.00 0.00 ±\pm 0.00 90.7 ±\pm 8.15 100 ±\pm 0.00
VD 0.86 ±\pm 0.04 1.07 ±\pm 0.03 72.9 ±\pm 6.99 0.00 ±\pm 0.00 75.5 ±\pm 7.81 20.8 ±\pm 3.08
HS-BNN 0.90 ±\pm 0.04 1.29 ±\pm 0.04 - - 67.0 ±\pm 8.54 32.1 ±\pm 20.1
AGP 1.01 ±\pm 0.03 1.02 ±\pm 0.00 16.9 ±\pm 1.81 0.00 ±\pm 0.00 - 5.00 ±\pm 0.00
LOT 0.96 ±\pm 0.01 1.04 ±\pm 0.01 19.5 ±\pm 2.57 0.00 ±\pm 0.00 - 5.00 ±\pm 0.00

First, we examine the impact of different choices of λ\lambda on our VB sparse DNN modeling. A set of different λ\lambda values are used, and for each λ\lambda, we compute the training square root MSE (RMSE) and testing RMSE based on f^H\widehat{f}_{H}. Results for the simulation setting (B) are plotted in Figure 1 along with error bars (Refer to Section B.4 in supplementary material for the plot under the simulation setting (A)). The figure shows that as λ\lambda increases, the resultant network becomes denser and the training RMSE monotonically decreases, while testing RMSE curve is roughly U-shaped. In other words, an overly small λ\lambda leads to over-sparsified DNNs with insufficient expressive power, and an overly large λ\lambda leads to overfitting DNNs. The suggested λopt\lambda_{opt} successfully locates in the valley of U-shaped testing curve, which empirically justifies our theoretical choice of λopt\lambda_{opt}.

We next compare the performance of our method (with λopt\lambda_{opt}) to the benchmark methods, and present results in Table 1. For the dense teacher network (A), our method leads to the most sparse structure with comparable prediction error; For the sparse teacher network (B), our method not only achieves the best prediction accuracy, but also always selects the correct set of relevant input variables. Besides, we also explore uncertainty quantification of our methods, by studying the coverage of 95% Bayesian predictive intervals (refer to supplementary material for details). Table 1 shows that our method obtains coverage rates slightly higher than the nominal levels while other (Bayesian) methods suffer from undercoverage problems.

Simulation II: Sparse nonlinear function

Consider the following sparse function f0f_{0}:

f0(x1,,x200)=7x21+x12+5sin(x3x4)+2x5,ϵ𝒩(0,1),f_{0}(x_{1},\dots,x_{200})=\frac{7x_{2}}{1+x^{2}_{1}}+5\sin(x_{3}x_{4})+2x_{5},\quad\epsilon\sim\mathcal{N}(0,1), (11)

all covariates are iid 𝒩(0,1)\mathcal{N}(0,1) and data set contains n=3000n=3000 observations. A ReLU network with L=3L=3 and N=7N=7 is used. Similar to the simulation I, we study the impact of λ\lambda, and results in Figure 2 justify that λopt\lambda_{opt} is a reasonable choice. Table 2 compares the performances of our method (under λopt\lambda_{opt}) to the competitive methods. Our method exhibits the best prediction power with minimal connectivity, among all the methods. In addition, our method achieves smallest FPR and acceptable FNR for input variable selection. In comparison, other methods select huge number of false input variables. Figure 2 (c) shows the selected network (edges with ϕi>0.5\phi_{i}>0.5) in one replication that correctly identifies the input variables.

Refer to caption
(a) λλopt\lambda\leq\lambda_{opt}.
Refer to caption
(b) λλopt\lambda\geq\lambda_{opt}.
Refer to caption
(c) Selected network structure.
Figure 2: (a) λ={10200,10150,10100,1050,1020,105,λopt}\lambda=\{10^{-200},10^{-150},10^{-100},10^{-50},10^{-20},10^{-5},\lambda_{opt}\}. (b) λ={λopt,0.1,0.3,0.5,\lambda=\{\lambda_{opt},0.1,0.3,0.5, 0.7,0.9,0.99}0.7,0.9,0.99\}. (c) A selected network structure for (11).
Table 2: Results for Simulation II. The sparsity levels selected for AGP and LOT are both 30%.
Method Train RMSE Test RMSE FPR(%) FNR(%) Sparsity(%)
SVBNN 1.19 ±\pm 0.05 1.21 ±\pm 0.05 0.00 ±\pm 0.21 16.0 ±\pm 8.14 2.97 ±\pm 0.48
VBNN 0.96 ±\pm 0.06 1.99 ±\pm 0.49 100 ±\pm 0.00 0.00 ±\pm 0.00 100 ±\pm 0.00
VD 1.02 ±\pm 0.05 1.43 ±\pm 0.19 98.6 ±\pm 1.22 0.67 ±\pm 3.65 46.9 ±\pm 4.72
HS-BNN 1.17 ±\pm 0.52 1.66 ±\pm 0.43 - - 41.1 ±\pm 36.5
AGP 1.06 ±\pm 0.08 1.58 ±\pm 0.11 82.7 ±\pm 3.09 1.33 ±\pm 5.07 30.0 ±\pm 0.00
LOT 1.08 ±\pm 0.09 1.44 ±\pm 0.14 83.6 ±\pm 2.94 0.00 ±\pm 0.00 30.0 ±\pm 0.00
Refer to caption
Figure 3: Testing accuracy for MNIST

MNIST application.

We evaluate the performance of our method on MNIST data for classification tasks, by comparing with benchmark methods. A 2-hidden layer DNN with 512 neurons in each layer is used. We compare the testing accuracy of our method (with λopt\lambda_{opt}) to the benchmark methods at different epochs using the same batch size (refer to supplementary material for details). Figure 3 shows our method achieves best accuracy as epoch increases, and the final sparsity level for SVBNN, AGP and VD are 5.06%5.06\%, 5.00%5.00\% and 2.28%2.28\%.

In addition, an illustration of our method’s capability for uncertainty quantification on MNIST can be found in the supplementary material, where additional experimental results on UCI regression datasets can also be found.

7 Conclusion and discussion

We proposed a variational inference method for deep neural networks under spike-and-slab priors with theoretical guarantees. Future direction could be investigating the theory behind choosing hyperparamters via the EB estimation instead of deterministic choices.

Furthermore, extending the current results to more complicated networks (convolutional neural network, residual network, etc.) is not trivial. Conceptually, it requires the design of structured sparsity (e.g., group sparsity in Neklyudov et al., (2017)) to fulfill the goal of faster prediction. Theoretically, it requires deeper understanding of the expressive ability (i.e. approximation error) and capacity (i.e., packing or covering number) of the network model space. For illustration purpose, we include an example of Fashion-MNIST task using convolutional neural network in the supplementary material, and it demonstrates the usage of our method on more complex networks in practice.

Broader Impact

We believe the ethical aspects are not applicable to this work. For future societal consequences, deep learning has a wide range of applications such as computer version and natural language processing. Our work provides a solution to overcome the drawbacks of modern deep neural network, and also improves the understanding of deep learning.

The proposed method could improve the existing applications. Specifically, sparse learning helps apply deep neural networks to hardware limited devices, like cell phones or pads, which will broaden the horizon of deep learning application. In addition, as a Bayesian method, not only a result, but also the knowledge of confidence or certainty in that result are provided, which could benefit people in various aspects. For example, in the application of cancer diagnostic, by providing the certainty associated with each possible outcome, Bayesian learning would assist the medical professionals to make a better judgement about whether the tumor is a cancer or a benign one. Such kind of ability to quantify uncertainty would contribute to the modern deep learning.

Acknowledgments and Disclosure of Funding

We would like to thank Wei Deng for the helpful discussion and thank the reviewers for their thoughtful comments. Qifan Song’s research is partially supported by National Science Foundation grant DMS-1811812. Guang Cheng was a member of the Institute for Advanced Study in writing this paper, and he would like to thank the institute for its hospitality.

References

  • Blei et al., (2017) Blei, D., Kucukelbir, A., and McAuliffe, J. (2017). Variational inference: A review for statisticians. Journal of the American Statistical Association, 112:859–877.
  • Blundell et al., (2015) Blundell, C., Cornebise, J., Kavukcuoglu, K., et al. (2015). Weight uncertainty in neural networks. In Proceedings of the 32nd International Conference on International Conference on Machine Learning (ICML 15), pages 1613–1622, Lille, France.
  • Boucheron et al., (2013) Boucheron, S., Lugosi, G., and Massart, P. (2013). Concentration inequalities: A nonasymptotic theory of independence. Oxford University press.
  • Cheng et al., (2018) Cheng, Y., Wang, D., Zhou, P., et al. (2018). Model compression and acceleration for deep neural networks: The principles, progress, and challenges. IEEE Signal Processing Magazine, 35(1):126–136.
  • Chérief-Abdellatif, (2020) Chérief-Abdellatif, B.-E. (2020). Convergence rates of variational inference in sparse deep learning. In Proceedings of the 37th International Conference on Machine Learning (ICML 2020), Vienna, Austria.
  • Chérief-Abdellatif and Alquier, (2018) Chérief-Abdellatif, B.-E. and Alquier, P. (2018). Consistency of variational bayes inference for estimation and model selection in mixtures. Electronic Journal of Statistics, 12(2):2995–3035.
  • Deng et al., (2019) Deng, W., Zhang, X., Liang, F., and Lin, G. (2019). An adaptive empirical Bayesian method for sparse deep learning. In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.
  • Feng and Simon, (2017) Feng, J. and Simon, N. (2017). Sparse input neural networks for high-dimensional nonparametric regression and classification. arXiv preprint arXiv:1711.07592.
  • Frankle and Carbin, (2018) Frankle, J. and Carbin, M. (2018). The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635.
  • Gale et al., (2019) Gale, T., Elsen, E., and Hooker, S. (2019). The state of sparsity in deep neural networks. arXiv preprint arXiv:1902.09574.
  • George and McCulloch, (1993) George, E. and McCulloch, R. (1993). Variable selection via gibbs sampling. Journal of the American Statistical Association, 88:881–889.
  • Ghosal and Van Der Vaart, (2007) Ghosal, S. and Van Der Vaart, A. (2007). Convergence rates of posterior distributions for noniid observations. The Annals of Statistics, 35(1):192–223.
  • Ghosh and Doshi-Velez, (2017) Ghosh, S. and Doshi-Velez, F. (2017). Model selection in Bayesian neural networks via horseshoe priors. arXiv preprint arXiv:1705.10388.
  • Ghosh et al., (2018) Ghosh, S., Yao, J., and Doshi-Velez, F. (2018). Structured variational learning of Bayesian neural networks with horseshoe priors. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), Stockholm, Sweden.
  • Goldt et al., (2019) Goldt, S., Advani, M. S., Saxe, A. M., Krzakala, F., and Zdeborová, L. (2019). Dynamics of stochastic gradient descent for two-layer neural networks in the teacher-student setup. In 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.
  • Guo et al., (2018) Guo, Y., Zhang, C., Zhang, C., and Chen, Y. (2018). Sparse DNNs with improved adversarial robustness. In 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), pages 240–249, Montréal, Canada.
  • Han et al., (2016) Han, S., Mao, H., and Dally, W. (2016). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. In International Conference on Learning Representations (ICLR).
  • Hernández-Lobato and Adams, (2015) Hernández-Lobato, J. and Adams, R. (2015). Probabilistic backpropagation for scalable learning of bayesian neural networks. In Proceedings of the 32nd International Conference on Machine Learning (ICML 2015), Lille, France.
  • Hubin and Storvik, (2019) Hubin, A. and Storvik, G. (2019). Combining model and parameter uncertainty in bayesian neural networks. arXiv:1903.07594.
  • Ishwaran and Rao, (2005) Ishwaran, H. and Rao, S. (2005). Spike and slab variable selection: Frequentist and bayesian strategies. Annals of Statistics, 33(2):730–773.
  • Jang et al., (2017) Jang, E., Gu, S., and Poole, B. (2017). Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations (ICLR 2017).
  • Jordan et al., (1999) Jordan, M., Ghahramani, Z., Jaakkola, T., et al. (1999). An introduction to variational methods for graphical models. Machine Learning.
  • Kingma and Welling, (2014) Kingma, D. and Welling, M. (2014). Auto-encoding variational Bayes. arXiv:1312.6114.
  • Le Cam, (1986) Le Cam, L. (1986). Asymptotic methods in statistical decision theory. Springer Science & Business Media, New York.
  • Liang et al., (2018) Liang, F., Li, Q., and Zhou, L. (2018). Bayesian neural networks for selection of drug sensitive genes. Journal of the American Statistical Association, 113:955–972.
  • Louizos et al., (2018) Louizos, C., Welling, M., and Kingma, D. P. (2018). Learning sparse neural networks through l0 regularization. In ICLR 2018.
  • MacKay, (1992) MacKay, D. (1992). A practical bayesian framework for backpropagation networks. Nerual Computation.
  • Maddison et al., (2017) Maddison, C., Mnih, A., and Teh, Y. W. (2017). The concrete distribution: A continuous relaxation of discrete random variables. In International Conference on Learning Representations (ICLR 2017).
  • Mocanu et al., (2018) Mocanu, D., Mocanu, E., Stone, P., et al. (2018). Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature Communications, 9:2383.
  • Molchanov et al., (2017) Molchanov, D., Ashukha, A., and Vetrov, D. (2017). Variational dropout sparsifies deep neural networks. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), pages 2498–2507.
  • Neal, (1992) Neal, R. (1992). Bayesian learning via stochastic dynamics. In Advances in Neural Information Processing Systems 5 (NIPS 1992), pages 475–482.
  • Neklyudov et al., (2017) Neklyudov, K., Molchanov, D., Ashukha, A., and Vetrov, D. (2017). Structured Bayesian pruning via log-normal multiplicative noise. In 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA.
  • Pati et al., (2018) Pati, D., Bhattacharya, A., and Yang, Y. (2018). On the statistical optimality of variational bayes. In Proceedings of the 21st International Conference on Artificial Intelligence and Statistics (AISTATS) 2018, Lanzarote, Spain.
  • Polson and Rockova, (2018) Polson, N. and Rockova, V. (2018). Posterior concentration for sparse deep learning. In 32nd Conference on Neural Information Processing Systems (NeurIPS 2018), pages 930–941, Montréal, Canada.
  • Schmidt-Hieber, (2017) Schmidt-Hieber, J. (2017). Nonparametric regression using deep neural networks with ReLU activation function. arXiv:1708.06633v2.
  • Sønderby et al., (2016) Sønderby, C., Raiko, T., Maaløe, L., Sønderby, S., and Ole, W. (2016). How to train deep variational autoencoders and probabilistic ladder networks. In Proceedings of the 33rd International Conference on International Conference on Machine Learning (ICML 16), New York, NY.
  • Tian, (2018) Tian, Y. (2018). A theoretical framework for deep locally connected ReLU network. arXiv preprint arXiv:1809.10829.
  • Yang et al., (2020) Yang, Y., Pati, D., and Bhattacharya, A. (2020). α\alpha-variational inference with statistical guarantees. Annals of Statistics, 48(2):886–905.
  • Ye and Sun, (2018) Ye, M. and Sun, Y. (2018). Variable selection via penalized neural network: a drop-out-one loss approach. In Proceedings of the 35th International Conference on International Conference on Machine Learning (ICML 18), pages 5620–5629, Stockholm, Sweden.
  • Zhang and Gao, (2019) Zhang, F. and Gao, C. (2019). Convergence rates of variational posterior distributions. arXiv preprint arXiv:1712.02519.
  • Zhu and Gupta, (2018) Zhu, M. and Gupta, S. (2018). To prune, or not to prune: Exploring the efficacy of pruning for model compression. In International Conference on Learning Representations (ICLR).

Supplementary Document to the Paper "Efficient Variational Inference for Sparse Deep Learning with Theoretical Guarantee"

In this document, the detailed proofs for the theoretical results are provided in the first section, along with additional numerical results presented in the second section.

Appendix A Proofs of theoretical results

A.1 Proof of Lemma 4.1

As a technical tool for the proof, we first restate the Lemma 6.1 in Chérief-Abdellatif and Alquier, (2018) as follows.

Lemma A.1.

For any K>0K>0, the KL divergence between any two mixture densities k=1Kwkgk\sum_{k=1}^{K}w_{k}g_{k} and k=1Kw~kg~k\sum_{k=1}^{K}\tilde{w}_{k}\tilde{g}_{k} is bounded as

KL(k=1Kwkgk||k=1Kw~kg~k)KL(𝒘||𝒘~)+k=1KwkKL(gk||g~k),\mbox{KL}(\sum_{k=1}^{K}w_{k}g_{k}||\sum_{k=1}^{K}\tilde{w}_{k}\tilde{g}_{k})\leq\mbox{KL}(\boldsymbol{w}||\tilde{\boldsymbol{w}})+\sum^{K}_{k=1}w_{k}\mbox{KL}(g_{k}||\tilde{g}_{k}),

where KL(𝐰||𝐰~)=k=1Kwklogwkw~k\mbox{KL}(\boldsymbol{w}||\tilde{\boldsymbol{w}})=\sum_{k=1}^{K}w_{k}\log\frac{w_{k}}{\tilde{w}_{k}}.

Proof of Lemma 4.1

Proof.

It suffices to construct some q(θ)𝒬q^{*}(\theta)\in\mathcal{Q}, such that w.h.p,

KL(q(θ)||π(θ|λ))+Θln(P0,Pθ)q(θ)(dθ)C1nrn+C1ninfθfθf02+C1nrn,\begin{split}&\mbox{KL}(q^{*}(\theta)||\pi(\theta|\lambda))+\int_{\Theta}l_{n}(P_{0},P_{\theta})q^{*}(\theta)(d\theta)\\ \leq&C_{1}nr^{*}_{n}+C^{\prime}_{1}n\inf_{\theta}||f_{\theta}-f_{0}||^{2}_{\infty}+C^{\prime}_{1}nr^{*}_{n},\end{split}

where C1C_{1}, C1C^{\prime}_{1} are some positive constants if limn(rn+ξn)=\lim n(r_{n}^{*}+\xi^{*}_{n})=\infty, or any diverging sequences if limsupn(rn+ξn)\lim\sup n(r_{n}^{*}+\xi^{*}_{n})\neq\infty.

Recall θ=argminθΘ(L,𝒑,s,B)fθf02\theta^{*}=\arg\min_{\theta\in\Theta(L,\boldsymbol{p},s^{*},B)}||f_{\theta}-f_{0}||^{2}_{\infty}, then q(θ)𝒬q^{*}(\theta)\in\mathcal{Q} can be constructed as

KL(q(θ)||π(θ|λ))C1nrn,\displaystyle\mbox{KL}(q^{*}(\theta)||\pi(\theta|\lambda))\leq C_{1}nr^{*}_{n}, (12)
Θfθfθ2q(θ)(dθ)rn.\displaystyle\int_{\Theta}||f_{\theta}-f_{\theta^{*}}||_{\infty}^{2}q^{*}(\theta)(d\theta)\leq r^{*}_{n}. (13)

We define q(θ)q^{*}(\theta) as follows, for i=1,,Ti=1,\dots,T:

θi|γiγi𝒩(θi,σn2)+(1γi)δ0,γiBern(ϕi),ϕi=1(θi0),\begin{split}&\theta_{i}|\gamma^{*}_{i}\sim\gamma^{*}_{i}\mathcal{N}(\theta_{i}^{*},\sigma^{2}_{n})+(1-\gamma^{*}_{i})\delta_{0},\\ &\gamma_{i}^{*}\sim\mbox{Bern}(\phi^{*}_{i}),\\ &\phi^{*}_{i}=1(\theta^{*}_{i}\neq 0),\end{split} (14)

where an2=s8nlog1(3pN)(2BN)2(L+1){(p+1+1BN1)2+1(2BN)21+2(2BN1)2}1a_{n}^{2}=\frac{s^{*}}{8n}\log^{-1}(3pN)(2BN)^{-2(L+1)}\Bigl{\{}(p+1+\frac{1}{BN-1})^{2}+\frac{1}{(2BN)^{2}-1}+\frac{2}{(2BN-1)^{2}}\Bigr{\}}^{-1}.

To prove (12), denote ΓT\Gamma^{T} as the set of all possible binary inclusion vectors with length TT, then q(θ)q^{*}(\theta) and π(θ|λ)\pi(\theta|\lambda) could be written as mixtures

q(θ)=γΓT1(γ=γ)i=1Tγi𝒩(θi,σn2)+(1γi)δ0,q^{*}(\theta)=\sum_{\gamma\in\Gamma^{T}}1(\gamma=\gamma^{*})\prod^{T}_{i=1}\gamma_{i}\mathcal{N}(\theta^{*}_{i},\sigma^{2}_{n})+(1-\gamma_{i})\delta_{0},

and

π(θ|λ)=γΓTπ(γ)i=1Tγi𝒩(0,σ02)+(1γi)δ0,\pi(\theta|\lambda)=\sum_{\gamma\in\Gamma^{T}}\pi(\gamma)\prod^{T}_{i=1}\gamma_{i}\mathcal{N}(0,\sigma^{2}_{0})+(1-\gamma_{i})\delta_{0},

where π(γ)\pi(\gamma) is the probability for vector γ\gamma under prior distribution π\pi. Then,

KL(q(θ)||π(θ|λ))log1π(γ)+γΓT1(γ=γ)KL{i=1Tγi𝒩(θi,σn2)+(1γi)δ0||i=1Tγi𝒩(0,σ02))+(1γi)δ0}=log1λs(1λ)Ts+i=1TKL{γi𝒩(θi,σn2)+(1γi)δ0||γi𝒩(0,σ02))+(1γi)δ0}=slog(1λ)+(Ts)log(11λ)+i=1Tγi{12log(σ02σn2)+σn2+θi2212}C0nrn+s2σn2+s2(B21)+s2log(σ02σn2)(C0+1)nrn+s2B2+s2log(8nslog(3pN)(2BN)2L+2{(p+1+1BN1)2+1(2BN)21+2(2BN1)2})(C0+2)nrn+B22s+(L+1)slog(2BN)+s2loglog(3BN)+s2log(nsp2)(C0+3)nrn+(L+1)slogN+slog(pns)C1nrn, for sufficiently large n,\begin{split}&\mbox{KL}(q^{*}(\theta)||\pi(\theta|\lambda))\\ \leq&\log\frac{1}{\pi(\gamma^{*})}+\sum_{\gamma\in\Gamma^{T}}1(\gamma=\gamma^{*})\mbox{KL}\Bigl{\{}\prod^{T}_{i=1}\gamma_{i}\mathcal{N}(\theta^{*}_{i},\sigma^{2}_{n})+(1-\gamma_{i})\delta_{0}\Bigl{|}\Bigr{|}\prod^{T}_{i=1}\gamma_{i}\mathcal{N}(0,\sigma^{2}_{0}))+(1-\gamma_{i})\delta_{0}\Bigr{\}}\\ =&\log\frac{1}{\lambda^{s^{*}}(1-\lambda)^{T-s^{*}}}+\sum^{T}_{i=1}\mbox{KL}\Bigl{\{}\gamma^{*}_{i}\mathcal{N}(\theta^{*}_{i},\sigma^{2}_{n})+(1-\gamma^{*}_{i})\delta_{0}||\gamma^{*}_{i}\mathcal{N}(0,\sigma^{2}_{0}))+(1-\gamma^{*}_{i})\delta_{0}\Bigr{\}}\\ =&s^{*}\log(\frac{1}{\lambda})+(T-s^{*})\log(\frac{1}{1-\lambda})+\sum^{T}_{i=1}\gamma^{*}_{i}\Bigl{\{}\frac{1}{2}\log\Bigl{(}\frac{\sigma^{2}_{0}}{\sigma^{2}_{n}}\Bigr{)}+\frac{\sigma^{2}_{n}+\theta^{*2}_{i}}{2}-\frac{1}{2}\Bigr{\}}\\ \leq&C_{0}nr^{*}_{n}+\frac{s^{*}}{2}\sigma^{2}_{n}+\frac{s^{*}}{2}(B^{2}-1)+\frac{s^{*}}{2}\log\Bigl{(}\frac{\sigma^{2}_{0}}{\sigma^{2}_{n}}\Bigr{)}\\ \leq&(C_{0}+1)nr^{*}_{n}+\frac{s^{*}}{2}B^{2}+\frac{s^{*}}{2}\log\Bigl{(}\frac{8n}{s^{*}}\log(3pN)(2BN)^{2L+2}\Bigl{\{}(p+1+\frac{1}{BN-1})^{2}\\ &+\frac{1}{(2BN)^{2}-1}+\frac{2}{(2BN-1)^{2}}\Bigr{\}}\Bigr{)}\\ \leq&(C_{0}+2)nr^{*}_{n}+\frac{B^{2}}{2}s^{*}+(L+1)s^{*}\log(2BN)+\frac{s^{*}}{2}\log\log(3BN)+\frac{s^{*}}{2}\log\Bigl{(}\frac{n}{s^{*}}p^{2}\Bigr{)}\\ \leq&(C_{0}+3)nr^{*}_{n}+(L+1)s^{*}\log N+s^{*}\log\Bigl{(}p\sqrt{\frac{n}{s^{*}}}\Bigr{)}\\ \leq&C_{1}nr^{*}_{n},\mbox{ for sufficiently large n},\end{split}

where C0C_{0} and C1C_{1} are some fixed constants. The first inequality is due to Lemma A.1 and the second inequality is due to Condition 4.44.4.

Furthermore, by Appendix G of Chérief-Abdellatif, (2020), it can be shown

Θfθfθ2q(θ)(dθ)8an2log(3BN)(2BN)2L+2{(p+1+1BN1)2+1(2BN)21+2(2BN1)2}snrn.\begin{split}&\int_{\Theta}||f_{\theta}-f_{\theta^{*}}||^{2}_{\infty}q^{*}(\theta)(d\theta)\\ \leq&8a^{2}_{n}\log(3BN)(2BN)^{2L+2}\Bigl{\{}(p+1+\frac{1}{BN-1})^{2}+\frac{1}{(2BN)^{2}-1}+\frac{2}{(2BN-1)^{2}}\Bigr{\}}\\ \leq&\frac{s^{*}}{n}\leq r^{*}_{n}.\end{split}

Noting that

ln(P0,Pθ)=12σϵ2(Yfθ(X)22Yf0(X)22)=12σϵ2(||Yf0(X)+f0(X)fθ(X))||22||Yf0(X)||22)=12σϵ2(fθ(X)f0(X)22+2Yf0(X),f0(X)fθ(X)),\begin{split}l_{n}(P_{0},P_{\theta})&=\frac{1}{2\sigma^{2}_{\epsilon}}(||Y-f_{\theta}(X)||^{2}_{2}-||Y-f_{0}(X)||^{2}_{2})\\ &=\frac{1}{2\sigma^{2}_{\epsilon}}(||Y-f_{0}(X)+f_{0}(X)-f_{\theta}(X))||^{2}_{2}-||Y-f_{0}(X)||^{2}_{2})\\ &=\frac{1}{2\sigma^{2}_{\epsilon}}(||f_{\theta}(X)-f_{0}(X)||^{2}_{2}+2\langle Y-f_{0}(X),f_{0}(X)-f_{\theta}(X)\rangle),\end{split}

Denote

1=Θfθ(X)f0(X)22q(θ)(dθ),2=ΘYf0(X),f0(X)fθ(X)q(θ)(dθ).\begin{split}\mathcal{R}_{1}&=\int_{\Theta}||f_{\theta}(X)-f_{0}(X)||^{2}_{2}q^{*}(\theta)(d\theta),\\ \mathcal{R}_{2}&=\int_{\Theta}\langle Y-f_{0}(X),f_{0}(X)-f_{\theta}(X)\rangle q^{*}(\theta)(d\theta).\end{split}

Since fθ(X)f0(X)22nfθf02nfθfθ2+nfθf02||f_{\theta}(X)-f_{0}(X)||^{2}_{2}\leq n||f_{\theta}-f_{0}||^{2}_{\infty}\leq n||f_{\theta}-f_{\theta^{\ast}}||^{2}_{\infty}+n||f_{\theta^{\ast}}-f_{0}||^{2}_{\infty},

1nrn+nfθf02.\mathcal{R}_{1}\leq nr^{*}_{n}+n||f_{\theta^{\ast}}-f_{0}||^{2}_{\infty}.

Noting that Yf0(X)=ϵ𝒩(0,σϵ2I)Y-f_{0}(X)=\epsilon\sim\mathcal{N}(0,\sigma^{2}_{\epsilon}I), then

2=ΘϵT(f0(X)fθ(X))q(θ)(dθ)=ϵTΘ(f0(X)fθ(X))q(θ)(dθ)𝒩(0,cfσϵ2),\begin{split}\mathcal{R}_{2}&=\int_{\Theta}\epsilon^{T}(f_{0}(X)-f_{\theta}(X))q^{*}(\theta)(d\theta)=\epsilon^{T}\int_{\Theta}(f_{0}(X)-f_{\theta}(X))q^{*}(\theta)(d\theta)\sim\mathcal{N}(0,c_{f}\sigma^{2}_{\epsilon}),\end{split}

where cf=Θ(f0(X)fθ(X))q(θ)(dθ)221c_{f}=||\int_{\Theta}(f_{0}(X)-f_{\theta}(X))q^{*}(\theta)(d\theta)||^{2}_{2}\leq\mathcal{R}_{1} due to Cauchy-Schwarz inequality. Therefore, 2=Op(1)\mathcal{R}_{2}=O_{p}(\sqrt{\mathcal{R}_{1}}), and w.h.p., 2C01\mathcal{R}_{2}\leq C^{\prime}_{0}\mathcal{R}_{1}, where C0C^{\prime}_{0} is some positive constant if limn(rn+ξn)=\lim n(r_{n}^{*}+\xi_{n}^{*})=\infty or C0C^{\prime}_{0} is any diverging sequence if limsupn(rn+ξn)\lim\sup n(r_{n}^{*}+\xi_{n}^{*})\neq\infty. Therefore,

Θln(P0,Pθ)q(θ)(dθ)=1/2σϵ2+2/σϵ2(2C0+1)n(rn+fθf02)/2σε2C1(nrn+||fθf0||2)), w.h.p.,\begin{split}\int_{\Theta}l_{n}(P_{0},P_{\theta})q^{*}(\theta)(d\theta)=\mathcal{R}_{1}/2\sigma^{2}_{\epsilon}+\mathcal{R}_{2}/\sigma^{2}_{\epsilon}\leq&(2C^{\prime}_{0}+1)n(r^{*}_{n}+||f_{\theta^{\ast}}-f_{0}||^{2}_{\infty})/2\sigma_{\varepsilon}^{2}\\ \leq&C^{\prime}_{1}(nr^{*}_{n}+||f_{\theta^{\ast}}-f_{0}||^{2}_{\infty}))\mbox{, w.h.p.},\end{split}

which concludes this lemma together with (12). ∎

A.2 Proof of Lemma 4.2

Under Condition 4.1 - 4.2, we have the following lemma that shows the existence of testing functions over Θn=Θ(L,𝒑,sn)\Theta_{n}=\Theta(L,\boldsymbol{p},s_{n}), where Θ(L,𝒑,sn)\Theta(L,\boldsymbol{p},s_{n}) denotes the set of parameter whose L0L_{0} norm is bounded by sns_{n}.

Lemma A.2.

Let εn=Mn1/2(L+1)slogN+slog(pn/s)logδ(n)\varepsilon^{*}_{n}=Mn^{-1/2}\sqrt{(L+1)s^{*}\log N+s^{*}\log(p\sqrt{n/s^{*}})}\log^{\delta}(n) for any δ>1\delta>1 and some large constant M. Let sn=slog2δ1ns_{n}=s^{*}\log^{2\delta-1}n. Then there exists some testing function ϕ[0,1]\phi\in[0,1] and C1>0C_{1}>0, C2>1/3C_{2}>1/3, such that

𝔼P0(ϕ)exp{C1nεn2},supPθ(L,𝒑,sn)d(Pθ,P0)>εn𝔼Pθ(1ϕ)exp{C2nd2(P0,Pθ)}.\begin{split}\mathbb{E}_{P_{0}}(\phi)&\leq\exp\{-C_{1}n\varepsilon_{n}^{*2}\},\\ \sup_{\begin{subarray}{c}P_{\theta}\in\mathcal{F}(L,\boldsymbol{p},s_{n})\\ d(P_{\theta},P_{0})>\varepsilon^{*}_{n}\end{subarray}}\mathbb{E}_{P_{\theta}}(1-\phi)&\leq\exp\{-C_{2}nd^{2}(P_{0},P_{\theta})\}.\end{split}
Proof.

Due to the well-known result (e.g., Le Cam, (1986), page 491 or Ghosal and Van Der Vaart, (2007), Lemma 2), there always exists a function ψ[0,1]\psi\in[0,1], such that

𝔼P0(ψ)exp{nd2(Pθ1,P0)/2},𝔼Pθ(1ψ)exp{nd2(Pθ1,P0)/2},\begin{split}&\mathbb{E}_{P_{0}}(\psi)\leq\exp\{-nd^{2}(P_{\theta_{1}},P_{0})/2\},\\ &\mathbb{E}_{P_{\theta}}(1-\psi)\leq\exp\{-nd^{2}(P_{\theta_{1}},P_{0})/2\},\end{split}

for all Pθ(L,𝒑,sn)P_{\theta}\in\mathcal{F}(L,\boldsymbol{p},s_{n}) satisfying that d(Pθ,Pθ1)d(P0,Pθ1)/18d(P_{\theta},P_{\theta_{1}})\leq d(P_{0},P_{\theta_{1}})/18.

Let K=N(εn/19,(L,𝒑,sn),d(,))K=N(\varepsilon_{n}^{*}/19,\mathcal{F}(L,\boldsymbol{p},s_{n}),d(\cdot,\cdot)) denote the covering number of set (L,𝒑,sn)\mathcal{F}(L,\boldsymbol{p},s_{n}), i.e., there exists KK Hellinger-balls with radius εn/19\varepsilon_{n}^{*}/19, that completely cover (L,𝒑,sn)\mathcal{F}(L,\boldsymbol{p},s_{n}). For any θ(L,𝒑,sn)\theta\in\mathcal{F}(L,\boldsymbol{p},s_{n}) (W.O.L.G, we assume PθP_{\theta} belongs to the kkth Hellinger ball centered at PθkP_{\theta_{k}}), if d(Pθ,P0)>εnd(P_{\theta},P_{0})>\varepsilon_{n}^{*}, then we must have that d(P0,Pθk)>(18/19)εnd(P_{0},P_{\theta_{k}})>(18/19)\varepsilon_{n}^{*} and there exists a testing function ψk\psi_{k}, such that

𝔼P0(ψk)exp{nd2(Pθk,P0)/2}exp{(182/192/2)nεn2},𝔼Pθ(1ψk)exp{nd2(Pθk,P0)/2}exp{n(d(P0,Pθ)εn/19)2/2}exp{(182/192/2)nd2(P0,Pθ)}.\begin{split}\mathbb{E}_{P_{0}}(\psi_{k})&\leq\exp\{-nd^{2}(P_{\theta_{k}},P_{0})/2\}\\ &\leq\exp\{-(18^{2}/19^{2}/2)n\varepsilon_{n}^{*2}\},\\ \mathbb{E}_{P_{\theta}}(1-\psi_{k})&\leq\exp\{-nd^{2}(P_{\theta_{k}},P_{0})/2\}\\ &\leq\exp\{-n(d(P_{0},P_{\theta})-\varepsilon_{n}^{*}/19)^{2}/2\}\\ &\leq\exp\{-(18^{2}/19^{2}/2)nd^{2}(P_{0},P_{\theta})\}.\end{split}

Now we define ϕ=maxk=1,,Kψ\phi=\max_{k=1,\dots,K}\psi. Thus we must have

𝔼P0(ϕ)k𝔼P0(ψk)Kexp{(182/192/2)nεn2}exp{(182/192/2)nεn2logK}.\begin{split}\mathbb{E}_{P_{0}}(\phi)&\leq\sum_{k}\mathbb{E}_{P_{0}}(\psi_{k})\leq K\exp\{-(18^{2}/19^{2}/2)n\varepsilon_{n}^{*2}\}\\ &\leq\exp\{-(18^{2}/19^{2}/2)n\varepsilon_{n}^{*2}-\log K\}.\end{split}

Note that

logK=logN(εn/19,(L,𝒑,sn),d(,))\displaystyle\log K=\log N(\varepsilon_{n}^{*}/19,\mathcal{F}(L,\boldsymbol{p},s_{n}),d(\cdot,\cdot))
logN(8σεεn/19,(L,𝒑,sn),)\displaystyle\leq\log N(\sqrt{8}\sigma_{\varepsilon}\varepsilon_{n}^{*}/19,\mathcal{F}(L,\boldsymbol{p},s_{n}),\|\cdot\|_{\infty})
(sn+1)log(388σεεn(L+1)(N+1)2(L+1))\displaystyle\leq(s_{n}+1)\log(\frac{38}{\sqrt{8}\sigma_{\varepsilon}\varepsilon_{n}^{*}}(L+1)(N+1)^{2(L+1)})
C0(snlog1εn+snlog(L+1)+sn(L+1)logN)\displaystyle\leq C_{0}(s_{n}\log\frac{1}{\varepsilon_{n}^{*}}+s_{n}\log(L+1)+s_{n}(L+1)\log N)
sn(L+1)lognlogNs(L+1)logNlog2δn\displaystyle\leq s_{n}(L+1)\log n\log N\leq s^{*}(L+1)\log N\log^{2\delta}n
nεn2/4, for sufficiently large n,\displaystyle\leq n\varepsilon_{n}^{*2}/4,\mbox{ for sufficiently large n,} (15)

where C0C_{0} is some positive constant, the first inequality is due to the fact

d2(Pθ,P0)1exp{18σϵ2f0fθ2}\begin{split}d^{2}(P_{\theta},P_{0})\leq 1-\exp\{-\frac{1}{8\sigma^{2}_{\epsilon}}||f_{0}-f_{\theta}||^{2}_{\infty}\}\end{split}

and εn=o(1)\varepsilon_{n}^{*}=o(1), the second inequality is due to Lemma 10 of Schmidt-Hieber, (2017)444Although Schmidt-Hieber, (2017) only focuses on ReLU network, its Lemma 10 could apply to any 1-Lipchitz continuous activation function., and the last inequality is due to snlog(1/εn)snlogns_{n}\log(1/\varepsilon^{*}_{n})\asymp s_{n}\log n. Therefore,

𝔼P0(ϕ)kP0(ψk)exp{C1nεn2},\begin{split}\mathbb{E}_{P_{0}}(\phi)&\leq\sum_{k}P_{0}(\psi_{k})\leq\exp\{-C_{1}n\varepsilon_{n}^{*2}\},\end{split}

for some C1=182/192/21/4C_{1}=18^{2}/19^{2}/2-1/4. On the other hand, for any θ\theta, such that d(Pθ,P0)εnd(P_{\theta},P_{0})\geq\varepsilon_{n}^{*}, say PθP_{\theta} belongs to the kkth Hellinger ball, then we have

𝔼Pθ(1ϕ)𝔼Pθ(1ψk)exp{C2nd2(P0,Pθ)},\begin{split}\mathbb{E}_{P_{\theta}}(1-\phi)&\leq\mathbb{E}_{P_{\theta}}(1-\psi_{k})\leq\exp\{-C_{2}nd^{2}(P_{0},P_{\theta})\},\end{split}

where C2=182/192/2C_{2}=18^{2}/19^{2}/2. Hence we conclude the proof. ∎

Lemma A.3 restates the Donsker and Varadhan’s representation for the KL divergence, whose proof can be found in Boucheron et al., (2013).

Lemma A.3.

For any probability measure μ\mu and any measurable function hh with ehL1(μ)e^{h}\in L_{1}(\mu),

logeh(η)μ(dη)=supρ[h(η)ρ(dη)KL(ρ||μ)].\log\int e^{h(\eta)}\mu(d\eta)=\sup_{\rho}\left[\int h(\eta)\rho(d\eta)-\mbox{KL}(\rho||\mu)\right].

Proof of Lemma 4.2

Proof.

Denote Θn\Theta_{n} as the truncated parameter space {θ:i=1T1(θi0)sn}\{\theta:\sum^{T}_{i=1}1(\theta_{i}\neq 0)\leq s_{n}\}, where sns_{n} is defined in Lemma A.2. Noting that

θΘd2(Pθ,P0)q^(θ)𝑑θ=θΘnd2(Pθ,P0)q^(θ)𝑑θ+θΘncd2(Pθ,P0)q^(θ)𝑑θ,\int_{\theta\in\Theta}d^{2}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta=\int_{\theta\in\Theta_{n}}d^{2}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta+\int_{\theta\in\Theta_{n}^{c}}d^{2}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta, (16)

it suffices to find upper bounds of the two components in RHS of (16).

We start with the first component. Denote π~(θ)\widetilde{\pi}(\theta) to be the truncated prior π(θ)\pi(\theta) on set Θn\Theta_{n}, i.e., π~(θ)=π(θ)1(θΘn)/π(Θn)\widetilde{\pi}(\theta)=\pi(\theta)1(\theta\in\Theta_{n})/\pi(\Theta_{n}), then by Lemma A.2 and the same argument used in Theorem 3.1 of Pati et al., (2018), it could be shown

Θnη(Pθ,P0)π~(θ)𝑑θeC0nεn2,w.h.p.\int_{\Theta_{n}}\eta(P_{\theta},P_{0})\widetilde{\pi}(\theta)d\theta\leq e^{C_{0}n\varepsilon^{*2}_{n}},\mbox{w.h.p.} (17)

for some C0>0C_{0}>0, where logη(Pθ,P0)=ln(Pθ,P0)+n3d2(Pθ,P0)\log\eta(P_{\theta},P_{0})=l_{n}(P_{\theta},P_{0})+\frac{n}{3}d^{2}(P_{\theta},P_{0}). We further denote the q^(θ)\widehat{q}(\theta) restricted on Θn\Theta_{n} as qˇ(θ)\widecheck{q}(\theta), i.e., qˇ(θ)=q^(θ)1(θΘn)/q^(Θn)\widecheck{q}(\theta)=\widehat{q}(\theta)1(\theta\in\Theta_{n})/\widehat{q}(\Theta_{n}), then by Lemma A.3 and (17), w.h.p.,

n3q^(Θn)Θnd2(Pθ,P0)q^(θ)𝑑θ=n3Θnd2(Pθ,P0)qˇ(θ)𝑑θCnεn2+KL(qˇ(θ)||π~(θ))Θnln(Pθ,P0)qˇ(θ)dθ.\begin{split}&\frac{n}{3\widehat{q}(\Theta_{n})}\int_{\Theta_{n}}d^{2}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta=\frac{n}{3}\int_{\Theta_{n}}d^{2}(P_{\theta},P_{0})\widecheck{q}(\theta)d\theta\\ \leq&Cn\varepsilon^{*2}_{n}+\mbox{KL}(\widecheck{q}(\theta)||\widetilde{\pi}(\theta))-\int_{\Theta_{n}}l_{n}(P_{\theta},P_{0})\widecheck{q}(\theta)d\theta.\\ \end{split} (18)

Furthermore,

KL(qˇ(θ)||π~(θ))=1q^(Θn)θΘnlogq^(θ)π(θ)q^(θ)𝑑θ+logπ(Θn)q^(Θn)=1q^(Θn)KL(q^(θ)||π(θ))1q^(Θn)θΘnclogq^(θ)π(θ)q^(θ)dθ+logπ(Θn)q^(Θn),\begin{split}\mbox{KL}(\widecheck{q}(\theta)||\widetilde{\pi}(\theta))&=\frac{1}{\widehat{q}(\Theta_{n})}\int_{\theta\in\Theta_{n}}\log\frac{\widehat{q}(\theta)}{\pi(\theta)}\widehat{q}(\theta)d\theta+\log\frac{\pi(\Theta_{n})}{\widehat{q}(\Theta_{n})}\\ &=\frac{1}{\widehat{q}(\Theta_{n})}\mbox{KL}(\widehat{q}(\theta)||\pi(\theta))-\frac{1}{\widehat{q}(\Theta_{n})}\int_{\theta\in\Theta_{n}^{c}}\log\frac{\widehat{q}(\theta)}{\pi(\theta)}\widehat{q}(\theta)d\theta+\log\frac{\pi(\Theta_{n})}{\widehat{q}(\Theta_{n})},\end{split}

and similarly,

Θnln(Pθ,P0)qˇ(θ)𝑑θ=1q^(Θn)Θln(Pθ,P0)q^(θ)𝑑θ1q^(Θn)Θncln(Pθ,P0)q^(θ)𝑑θ.\begin{split}\int_{\Theta_{n}}l_{n}(P_{\theta},P_{0})\widecheck{q}(\theta)d\theta=\frac{1}{\widehat{q}(\Theta_{n})}\int_{\Theta}l_{n}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta-\frac{1}{\widehat{q}(\Theta_{n})}\int_{\Theta_{n}^{c}}l_{n}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta.\end{split}

Combining the above two equations together, we have

n3q^(Θn)Θnd2(Pθ,P0)q^(θ)dθCnεn2+KL(qˇ(θ)||π~(θ))Θnln(Pθ,P0)qˇ(θ)dθ=Cnεn2+1q^(Θn)(KL(q^(θ)||π(θ))Θln(Pθ,P0)q^(θ)dθ)1q^(Θn)(Θnclogq^(θ)π(θ)q^(θ)𝑑θΘncln(Pθ,P0)q^(θ)𝑑θ)+logπ(Θn)q^(Θn).\begin{split}&\frac{n}{3\widehat{q}(\Theta_{n})}\int_{\Theta_{n}}d^{2}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta\leq Cn\varepsilon_{n}^{*2}+\mbox{KL}(\widecheck{q}(\theta)||\widetilde{\pi}(\theta))-\int_{\Theta_{n}}l_{n}(P_{\theta},P_{0})\widecheck{q}(\theta)d\theta\\ =&Cn\varepsilon_{n}^{*2}+\frac{1}{\widehat{q}(\Theta_{n})}\left(\mbox{KL}(\widehat{q}(\theta)||\pi(\theta))-\int_{\Theta}l_{n}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta\right)\\ &-\frac{1}{\widehat{q}(\Theta_{n})}\left(\int_{\Theta_{n}^{c}}\log\frac{\widehat{q}(\theta)}{\pi(\theta)}\widehat{q}(\theta)d\theta-\int_{\Theta_{n}^{c}}l_{n}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta\right)+\log\frac{\pi(\Theta_{n})}{\widehat{q}(\Theta_{n})}.\end{split} (19)

The second component of (16) trivially satisfies that θΘncd2(Pθ,P0)q^(θ)𝑑θθΘncq^(θ)𝑑θ=q^(Θnc)\int_{\theta\in\Theta_{n}^{c}}d^{2}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta\leq\int_{\theta\in\Theta_{n}^{c}}\widehat{q}(\theta)d\theta=\widehat{q}(\Theta_{n}^{c}). Thus, together with (19), we have that w.h.p.,

d2(Pθ,P0)q^(θ)dθ3q^(Θn)Cεn2+3n(KL(q^(θ)||π(θ))Θln(Pθ,P0)q^(θ)dθ)+3nΘncln(Pθ,P0)q^(θ)𝑑θ+3nΘnclogπ(θ)q^(θ)q^(θ)𝑑θ+3q^(Θn)nlogπ(Θn)q^(Θn)+q^(Θnc).\begin{split}&\int d^{2}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta\leq 3\widehat{q}(\Theta_{n})C\varepsilon^{*2}_{n}+\frac{3}{n}\left(\mbox{KL}(\widehat{q}(\theta)||\pi(\theta))-\int_{\Theta}l_{n}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta\right)\\ &+\frac{3}{n}\int_{\Theta_{n}^{c}}l_{n}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta+\frac{3}{n}\int_{\Theta_{n}^{c}}\log\frac{\pi(\theta)}{\widehat{q}(\theta)}\widehat{q}(\theta)d\theta+\frac{3\widehat{q}(\Theta_{n})}{n}\log\frac{\pi(\Theta_{n})}{\widehat{q}(\Theta_{n})}+\widehat{q}(\Theta_{n}^{c}).\end{split} (20)

The second term in the RHS of (20) is bounded by C(rn+ξn)C^{\prime}(r^{*}_{n}+\xi^{*}_{n}) w.h.p., due to Lemma 4.1, where CC^{\prime} is either positive constant or diverging sequence depending on whether n(rn+ξn)n(r_{n}^{*}+\xi_{n}^{*}) diverges.

The third term in the RHS of (20) is bounded by

3nΘncln(Pθ,P0)q^(θ)𝑑θ=32nσϵ2Θnc[i=1nϵi2i=1n(ϵi+f0(Xi)fθ(Xi))2]q^(θ)𝑑θ=32nσϵ2Θnc[2i=1n(ϵi×(f0(Xi)fθ(Xi))i=1n(f0(Xi)fθ(Xi))2]q^(θ)dθ=32nσϵ2{2i=1nϵiΘnc(f0(Xi)fθ(Xi))q^(θ)𝑑θΘnci=1n(f0(Xi)fθ(Xi))2q^(θ)dθ}.\begin{split}&\frac{3}{n}\int_{\Theta_{n}^{c}}l_{n}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta\\ =&\frac{3}{2n\sigma^{2}_{\epsilon}}\int_{\Theta_{n}^{c}}\left[\sum^{n}_{i=1}\epsilon_{i}^{2}-\sum^{n}_{i=1}(\epsilon_{i}+f_{0}(X_{i})-f_{\theta}(X_{i}))^{2}\right]\widehat{q}(\theta)d\theta\\ =&\frac{3}{2n\sigma^{2}_{\epsilon}}\int_{\Theta_{n}^{c}}\left[-2\sum^{n}_{i=1}(\epsilon_{i}\times(f_{0}(X_{i})-f_{\theta}(X_{i}))-\sum^{n}_{i=1}(f_{0}(X_{i})-f_{\theta}(X_{i}))^{2}\right]\widehat{q}(\theta)d\theta\\ =&\frac{3}{2n\sigma^{2}_{\epsilon}}\left\{-2\sum^{n}_{i=1}\epsilon_{i}\int_{\Theta_{n}^{c}}(f_{0}(X_{i})-f_{\theta}(X_{i}))\widehat{q}(\theta)d\theta-\int_{\Theta_{n}^{c}}\sum^{n}_{i=1}(f_{0}(X_{i})-f_{\theta}(X_{i}))^{2}\widehat{q}(\theta)d\theta\right\}.\end{split}

Conditional on XiX_{i}’s, 2i=1nϵiΘnc(f0(Xi)fθ(Xi))q^(θ)𝑑θ-2\sum^{n}_{i=1}\epsilon_{i}\int_{\Theta_{n}^{c}}(f_{0}(X_{i})-f_{\theta}(X_{i}))\widehat{q}(\theta)d\theta follows a normal distribution 𝒩(0,V2)\mathcal{N}(0,V^{2}), where V2=4σϵ2i=1n(Θnc(f0(Xi)fθ(Xi))q^(θ)𝑑θ)24σϵ2Θnci=1n(f0(Xi)fθ(Xi))2q^(θ)dθV^{2}=4\sigma^{2}_{\epsilon}\sum^{n}_{i=1}(\int_{\Theta_{n}^{c}}(f_{0}(X_{i})-f_{\theta}(X_{i}))\widehat{q}(\theta)d\theta)^{2}\leq 4\sigma^{2}_{\epsilon}\int_{\Theta_{n}^{c}}\sum^{n}_{i=1}(f_{0}(X_{i})-f_{\theta}(X_{i}))^{2}\widehat{q}(\theta)d\theta. Thus conditional on XiX_{i}’s, the third term in the RHS of (20) is bounded by

32nσϵ2[𝒩(0,V2)V24σϵ2].\frac{3}{2n\sigma^{2}_{\epsilon}}\left[\mathcal{N}(0,V^{2})-\frac{V^{2}}{4\sigma^{2}_{\epsilon}}\right]. (21)

Noting that 𝒩(0,V2)=Op(MnV)\mathcal{N}(0,V^{2})=O_{p}(M_{n}V) for any diverging sequence MnM_{n}, (21) is further bounded, w.h.p., by

32nσϵ2(MnVV24σϵ2)32nσϵ2σϵ2Mn2.\frac{3}{2n\sigma^{2}_{\epsilon}}(M_{n}V-\frac{V^{2}}{4\sigma^{2}_{\epsilon}})\leq\frac{3}{2n\sigma^{2}_{\epsilon}}\sigma^{2}_{\epsilon}M_{n}^{2}.

Therefore, the third term in the RHS of (20) can be bounded by εn2\varepsilon_{n}^{*2} w.h.p. (by choosing Mn2=nεn2M_{n}^{2}=n\varepsilon_{n}^{*2}).

The fourth term in the RHS of (20) is bounded by

3nΘnclogπ(θ)q^(θ)q^(θ)𝑑θ3nq^(Θnc)logπ(Θnc)q^(Θnc)3nsupx(0,1)[xlog(1/x)]=O(1/n).\frac{3}{n}\int_{\Theta_{n}^{c}}\log\frac{\pi(\theta)}{\widehat{q}(\theta)}\widehat{q}(\theta)d\theta\leq\frac{3}{n}\widehat{q}(\Theta_{n}^{c})\log\frac{\pi(\Theta_{n}^{c})}{\widehat{q}(\Theta_{n}^{c})}\leq\frac{3}{n}\sup_{x\in(0,1)}[x\log(1/x)]=O(1/n).

Similarly, the fifth term in the RHS of (20) is bounded by O(1/n)O(1/n).

For the last term in the RHS of (20), by Lemma A.5 in below, w.h.p., q^(Θnc)εn2\widehat{q}(\Theta_{n}^{c})\leq\varepsilon_{n}^{*2}.

Combine all the above result together, w.h.p.,

d2(Pθ,P0)q^(θ)dθCεn2+3n(KL(q^(θ)||π(θ))Θln(Pθ,P0)q^(θ)dθ)+O(1/n),\begin{split}&\int d^{2}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta\leq C\varepsilon^{*2}_{n}+\frac{3}{n}\left(\mbox{KL}(\widehat{q}(\theta)||\pi(\theta))-\int_{\Theta}l_{n}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta\right)+O(1/n),\end{split}

where CC is some constant. ∎

Lemma A.4 (Chernoff bound for Poisson tail).

Let Xpoi(λ)X\sim\mbox{poi}(\lambda) be a Poisson random variable. For any x>λx>\lambda,

P(Xx)(eλ)xeλxx.P(X\geq x)\leq\frac{(e\lambda)^{x}e^{-\lambda}}{x^{x}}.
Lemma A.5.

If λT1exp{Mnrn/sn}\lambda\leq T^{-1}\exp\{-Mnr_{n}^{*}/s_{n}\} for any positive diverging sequence MM\rightarrow\infty, then w.h.p., q^(Θnc)=O(εn2)\widehat{q}(\Theta_{n}^{c})=O(\varepsilon_{n}^{*2}).

Proof.

By Lemma 4.1, we have that w.h.p.,

KL(q^(θ)||π(θ|λ))+Θln(P0,Pθ)q^(θ)dθ=infqθ𝒬{KL(q(θ)||π(θ|λ))+Θln(P0,Pθ)q(θ)(dθ)}Cnrn(Note that rnξn)\begin{split}&\mbox{KL}(\widehat{q}(\theta)||\pi(\theta|\lambda))+\int_{\Theta}l_{n}(P_{0},P_{\theta})\widehat{q}(\theta)d\theta=\inf_{q_{\theta}\in\mathcal{Q}}\Bigl{\{}\mbox{KL}(q(\theta)||\pi(\theta|\lambda))+\int_{\Theta}l_{n}(P_{0},P_{\theta})q(\theta)(d\theta)\Bigr{\}}\\ \leq&Cnr_{n}^{*}\quad(\mbox{Note that }r_{n}^{*}\asymp\xi^{*}_{n})\end{split}

where CC is either a constant or any diverging sequence, depending on whether nrnnr_{n}^{*} diverges. By the similar argument used in the proof of Lemma 4.1,

Θln(P0,Pθ)q^(θ)𝑑θ12σϵ2(Θfθ(X)f0(X)22q^(θ)(dθ)+Z)\int_{\Theta}l_{n}(P_{0},P_{\theta})\widehat{q}(\theta)d\theta\leq\frac{1}{2\sigma_{\epsilon}^{2}}\left(\int_{\Theta}||f_{\theta}(X)-f_{0}(X)||^{2}_{2}\widehat{q}(\theta)(d\theta)+Z\right)

where ZZ is a normal distributed 𝒩(0,σϵ2c0)\mathcal{N}(0,\sigma^{2}_{\epsilon}c_{0}^{\prime}), where c0c0=Θfθ(X)f0(X)22q^(θ)(dθ)c_{0}^{\prime}\leq c_{0}=\int_{\Theta}||f_{\theta}(X)-f_{0}(X)||^{2}_{2}\widehat{q}(\theta)(d\theta). Therefore, Θln(P0,Pθ)q^(θ)𝑑θ=(1/2σϵ2)[c0+Op(c0)]-\int_{\Theta}l_{n}(P_{0},P_{\theta})\widehat{q}(\theta)d\theta=(1/2\sigma_{\epsilon}^{2})[-c_{0}+O_{p}(\sqrt{c_{0}})], and KL(q^(θ)||π(θ|λ))Cnrn+(1/2σϵ2)[c0+Op(c0)]\mbox{KL}(\widehat{q}(\theta)||\pi(\theta|\lambda))\leq Cnr_{n}^{*}+(1/2\sigma_{\epsilon}^{2})[-c_{0}+O_{p}(\sqrt{c_{0}})]. Since CnrnCnr_{n}^{*}\rightarrow\infty, we must have w.h.p., KL(q^(θ)||π(θ|λ))Cnrn/2\mbox{KL}(\widehat{q}(\theta)||\pi(\theta|\lambda))\leq Cnr_{n}^{*}/2. On the other hand,

KL(q^(θ)||π(θ|λ))=i=1TKL(q^(θi)||π(θi|λ))i=1TKL(q^(γi)||π(γi|λ))=i=1T[q^(γi=1)logq^(γi=1)λ+q^(γi=0)logq^(γi=0)1λ].\begin{split}&\mbox{KL}(\widehat{q}(\theta)||\pi(\theta|\lambda))=\sum_{i=1}^{T}\mbox{KL}(\widehat{q}(\theta_{i})||\pi(\theta_{i}|\lambda))\geq\sum_{i=1}^{T}\mbox{KL}(\widehat{q}(\gamma_{i})||\pi(\gamma_{i}|\lambda))\\ =&\sum_{i=1}^{T}\left[\widehat{q}(\gamma_{i}=1)\log\frac{\widehat{q}(\gamma_{i}=1)}{\lambda}+\widehat{q}(\gamma_{i}=0)\log\frac{\widehat{q}(\gamma_{i}=0)}{1-\lambda}\right].\end{split} (22)

Let us choose λ0=1/T\lambda_{0}=1/T, and A={i:q^(γi=1)λ0}A=\{i:\widehat{q}(\gamma_{i}=1)\geq\lambda_{0}\}, then the above inequality (22) implies that iAq^(γi=1)log(λ0/λ)Cnrn/2\sum_{i\in A}\widehat{q}(\gamma_{i}=1)\log(\lambda_{0}/\lambda)\leq Cnr_{n}^{*}/2. Noting that λT1exp{Mnrn/sn}\lambda\leq T^{-1}\exp\{-Mnr_{n}^{*}/s_{n}\}, it further implies iAq^(γi=1)sn/Msn\sum_{i\in A}\widehat{q}(\gamma_{i}=1)\leq s_{n}/M\prec s_{n}.

Under distribution q^\widehat{q}, by Bernstein inequality,

Pr(iAγi2sn/3)Pr(iAγisn/2+iA𝔼(γi))exp(sn2/8iA𝔼[γi2]+sn/6)=exp(sn2/8iAq^(γi=1)+sn/6)exp(csn)=O(εn2)\begin{split}&Pr(\sum_{i\in A}\gamma_{i}\geq 2s_{n}/3)\leq Pr(\sum_{i\in A}\gamma_{i}\geq s_{n}/2+\sum_{i\in A}\mathbb{E}(\gamma_{i}))\leq\exp\left(-\frac{s_{n}^{2}/8}{\sum_{i\in A}\mathbb{E}[\gamma_{i}^{2}]+s_{n}/6}\right)\\ =&\exp\left(-\frac{s_{n}^{2}/8}{\sum_{i\in A}\widehat{q}(\gamma_{i}=1)+s_{n}/6}\right)\leq\exp\left(-cs_{n}\right)=O(\varepsilon_{n}^{*2})\end{split}

for some constant c>0c>0, where the last inequality holds since log(1/εn2)=O(logn)sn\log(1/\varepsilon_{n}^{*2})=O(\log n)\prec s_{n}.

Under distribution q^\widehat{q}, iAγi\sum_{i\notin A}\gamma_{i} is stochastically smaller than Bin(T,λ0)Bin(T,\lambda_{0}). Since TT\rightarrow\infty, then by Lemma A.4,

Pr(iAγisn/3)Pr(Bin(T,λ0)sn/3)Pr(poi(1)sn/3)=O(exp{csn})=O(εn2)\begin{split}&Pr(\sum_{i\notin A}\gamma_{i}\geq s_{n}/3)\leq Pr(Bin(T,\lambda_{0})\geq s_{n}/3)\rightarrow Pr(\mbox{poi}(1)\geq s_{n}/3)\\ =&O(\exp\{-c^{\prime}s_{n}\})=O(\varepsilon_{n}^{*2})\end{split}

for some c>0c^{\prime}>0. Trivially, it implies that w.h.p, Pr(iγisn)=O(εn2)Pr(\sum_{i}\gamma_{i}\geq s_{n})=O(\varepsilon_{n}^{*2}) for VB posterior q^\widehat{q}.

A.3 Main theorem

Theorem A.1.

Under Conditions 4.1-4.2, 4.4 and set logλ=log(T)+δ[(L+1)logN+lognp]-\log\lambda=\log(T)+\delta[(L+1)\log N+\log\sqrt{n}p] for any constant δ>0\delta>0, we then have that w.h.p.,

Θd2(Pθ,P0)q^(θ)𝑑θCεn2+C(rn+ξn),\int_{\Theta}d^{2}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta\leq C\varepsilon^{*2}_{n}+C^{\prime}(r_{n}^{*}+\xi^{*}_{n}),

where CC is some positive constant and CC^{\prime} is any diverging sequence. If f0<F\|f_{0}\|_{\infty}<F, and we truncated the VB posterior on ΘF={θ:fθF}\Theta_{F}=\{\theta:\|f_{\theta}\|_{\infty}\leq F\}, i.e., q^Fq^1(θΘF)\widehat{q}_{F}\propto\widehat{q}1(\theta\in\Theta_{F}), then, w.h.p.,

ΘF𝔼X|fθ(X)f0(X)|2q^F(θ)𝑑θCεn2+C(rn+ξn)CFq^(ΘF)\int_{\Theta_{F}}\mathbb{E}_{X}|f_{\theta}(X)-f_{0}(X)|^{2}\widehat{q}_{F}(\theta)d\theta\leq\frac{C\varepsilon^{*2}_{n}+C^{\prime}(r_{n}^{*}+\xi^{*}_{n})}{C_{F}\widehat{q}(\Theta_{F})}

where CF=[1exp(4F2/8σϵ2)]/4F2C_{F}=[1-\exp(-4F^{2}/8\sigma^{2}_{\epsilon})]/4F^{2}, and q^(ΘF)\widehat{q}(\Theta_{F}) is the VB posterior mass of ΘF\Theta_{F}.

Proof.

The convergence under squared Hellinger distance is directly result of Lemma 4.1 and 4.2, by simply checking the choice of λ\lambda satisfies required conditions. The convergence under L2L_{2} distance relies on inequality d2(Pθ,P0)CF𝔼X|fθ(X)f0(X)|2d^{2}(P_{\theta},P_{0})\geq C_{F}\mathbb{E}_{X}|f_{\theta}(X)-f_{0}(X)|^{2} for CF=[1exp(4F2/8σϵ2)]/4F2C_{F}=[1-\exp(-4F^{2}/8\sigma^{2}_{\epsilon})]/4F^{2} when both fθf_{\theta} and f0f_{0} are bounded by FF. Then, w.h.p,

ΘF𝔼X|fθ(X)f0(X)|2q^F(θ)𝑑θCF1ΘFd2(Pθ,P0)q^F(θ)𝑑θ1CFq^(ΘF)Θd2(Pθ,P0)q^(θ)𝑑θCεn2+C(rn+ξn)CFq^(ΘF).\begin{split}&\int_{\Theta_{F}}\mathbb{E}_{X}|f_{\theta}(X)-f_{0}(X)|^{2}\widehat{q}_{F}(\theta)d\theta\leq C_{F}^{-1}\int_{\Theta_{F}}d^{2}(P_{\theta},P_{0})\widehat{q}_{F}(\theta)d\theta\\ \leq&\frac{1}{C_{F}\widehat{q}(\Theta_{F})}\int_{\Theta}d^{2}(P_{\theta},P_{0})\widehat{q}(\theta)d\theta\leq\frac{C\varepsilon^{*2}_{n}+C^{\prime}(r_{n}^{*}+\xi^{*}_{n})}{C_{F}\widehat{q}(\Theta_{F})}.\end{split}

Appendix B Additional experimental results

B.1 Comparison between Bernoulli variable and the Gumbel softmax approximation

Denote γiBern(ϕi)\gamma_{i}\sim\mbox{Bern}(\phi_{i}) and γ~iGumbel-softmax(ϕi,τ)\widetilde{\gamma}_{i}\sim\mbox{Gumbel-softmax}(\phi_{i},\tau), then we have that

γ~i:=gτ(ϕi,ui)=(1+exp(ηi/τ))1,where ηi=logϕi1ϕi+logui1ui,ui𝒰(0,1),γi:=g(ϕi,ui)=1(uiϕi)where ui𝒰(0,1).\begin{split}&\widetilde{\gamma}_{i}:=g_{\tau}(\phi_{i},u_{i})=(1+\exp(-\eta_{i}/\tau))^{-1},\quad\mbox{where }\eta_{i}=\log\frac{\phi_{i}}{1-\phi_{i}}+\log\frac{u_{i}}{1-u_{i}},\quad u_{i}\sim\mathcal{U}(0,1),\\ &\gamma_{i}:=g(\phi_{i},u_{i})=1(u_{i}\leq\phi_{i})\quad\mbox{where }u_{i}\sim\mathcal{U}(0,1).\end{split}

Fig 4 demonstrates the functional convergence of gτg_{\tau} towards gg as τ\tau goes to zero. In Fig 4(a), by fixing ϕi(=0.9)\phi_{i}(=0.9), we show gτg_{\tau} converges to gg as a function of uiu_{i}. Fig 4 (b) demonstrates that gτg_{\tau} converges to gg as a function of αi=log(ϕi/(1ϕi))\alpha_{i}=\log(\phi_{i}/(1-\phi_{i})) when ui(=0.2)u_{i}(=0.2) is fixed. These two figures show that as τ0\tau\rightarrow 0, gτgg_{\tau}\rightarrow g. Formally, Maddison et al., (2017) rigorously proved that γ~i\widetilde{\gamma}_{i} converges to γi\gamma_{i} in distribution as τ\tau approaches 0.

Refer to caption
(a) Fix ϕi=0.9\phi_{i}=0.9.
Refer to caption
(b) Fix ui=0.2u_{i}=0.2.
Figure 4: The convergence of gτg_{\tau} towards gg as τ\tau approaches 0.

B.2 Algorithm implementation details for the numerical experiments

Initialization

As mentioned by Sønderby et al., (2016) and Molchanov et al., (2017), training sparse BNN with random initialization may lead to bad performance, since many of the weights could be pruned too early. In our case, we assign each of the weights and biases a inclusion variable, which could reduce to zero quickly in the early optimization stage if we randomly initialize them. As a consequence, we deliberately initialize ϕi\phi_{i} to be close to 1 in our experiments. This initialization strategy ensures the training starts from a fully connected neural network, which is similar to start training from a pre-trained fully connected network as mentioned in Molchanov et al., (2017). The other two parameters μi\mu_{i} and σi\sigma_{i} are initialized randomly.

Other implementation details in simulation studies

We set K=1K=1 and learning rate=5×103\mbox{learning rate}=5\times 10^{-3} during training. For Simulation I, we choose batch size m=1024m=1024 and m=128m=128 for (A) and (B) respectively, and run 10000 epochs for both cases. For simulation II, we use m=512m=512 and run 7000 epochs. Although it is common to set up an annealing schedule for temperature parameter τ\tau, we don’t observe any significant performance improvement compared to setting τ\tau as a constant, therefore we choose τ=0.5\tau=0.5 in all of our experiments. The optimization method used is Adam.

The implementation details for UCI datasets and MNIST can be found in Section B.5 and B.6 respectively.

B.3 Toy example: linear regression

In this section, we aim to demonstrate that there is little difference between the results using inverse-CDF reparameterization and Gumbel-softmax approximation via a toy example.

Consider a linear regression model:

Yi=XiTβ+ϵi,ϵi𝒩(0,1),i=1,,n,Y_{i}=X^{T}_{i}\beta+\epsilon_{i},\quad\epsilon_{i}\sim\mathcal{N}(0,1),\quad i=1,\ldots,n,

We simulate a dataset with 1000 observations and 200 predictors, where β50=β100=β150=10\beta_{50}=\beta_{100}=\beta_{150}=10, β75=β125=10\beta_{75}=\beta_{125}=-10 and βj=0\beta_{j}=0 for all other jj.

A spike-and-slab prior is imposed on β\beta such that

βj|γjγj𝒩(0,σ02)+(1γj)δ0,γjBern(λ),\beta_{j}|\gamma_{j}\sim\gamma_{j}\mathcal{N}(0,\sigma^{2}_{0})+(1-\gamma_{j})\delta_{0},\quad\gamma_{j}\sim\mbox{Bern}(\lambda),

for j=1,,200j=1,\ldots,200, where σ0=5\sigma_{0}=5 and λ=0.03\lambda=0.03. The variational distribution q(β)𝒬q(\beta)\mathcal{Q} is chosen as

βj|γjγj𝒩(μj,σj2)+(1γj)δ0,γjBern(ϕj).\beta_{j}|\gamma_{j}\sim\gamma_{j}\mathcal{N}(\mu_{j},\sigma^{2}_{j})+(1-\gamma_{j})\delta_{0},\quad\gamma_{j}\sim\mbox{Bern}(\phi_{j}).

We use both Gumbel-softmax approximation and inverse-CDF reparameterization for the stochastic optimization of ELBO, and plot posterior mean 𝔼q^(β)(βj|γj)\mathbb{E}_{\widehat{q}(\beta)}(\beta_{j}|\gamma_{j}) (blue curve) against the true value (red curve). Figure 5 shows that inverse-CDF reparameterization exhibits only slightly higher error in estimating zero coefficients than the Gumbel-softmax approximation, which indicates the two methods has little difference on this toy example.

Refer to caption
(a) Gumbel-softmax reparametrization
Refer to caption
(b) Inverse-CDF reparametrization
Figure 5: Linear regression

B.4 Teacher student networks

The network parameter θ\theta for the sparse teacher network setting (B) is set as following: W={W1,11=W1,12=W2,11=W2,12=2.5,W1,21=W1,22=W2,21=W2,22=1.5,W3,11=3 and W3,21=2}W=\{W_{1,11}=W_{1,12}=W_{2,11}=W_{2,12}=2.5,W_{1,21}=W_{1,22}=W_{2,21}=W_{2,22}=1.5,W_{3,11}=3\mbox{ and }W_{3,21}=2\}; b={b1,1=b2,1=b3,1=1 and b1,2=b2,2=1}b=\{b_{1,1}=b_{2,1}=b_{3,1}=1\mbox{ and }b_{1,2}=b_{2,2}=-1\}.

Figure 6 displays the simulation result for simulation I under dense teacher network (A) setting. Unlike the result under sparse teacher network (B), the testing accuracy seems monotonically increases as λ\lambda increases (i.e., posterior network gets denser). However, as shown, the increasing of testing performance is rather slow, which indicates that introducing sparsity has few negative impact to the testing accuracy.

Refer to caption
(a) λλopt\lambda\leq\lambda_{opt}.
Refer to caption
(b) λλopt\lambda\geq\lambda_{opt}.
Refer to caption
(c) Dense teacher network.
Figure 6: (a) λ={10200,10150,10100,1050,1020,105,λopt}\lambda=\{10^{-200},10^{-150},10^{-100},10^{-50},10^{-20},10^{-5},\lambda_{opt}\}. (b) λ={λopt,0.1,0.3,0.5,\lambda=\{\lambda_{opt},0.1,0.3,0.5, 0.7,0.9,0.99}0.7,0.9,0.99\}. (c) The structure of the target dense teacher network.

Coverage rate

In this paragraph, we explain the details of how we compute the coverage rate values of Bayesian intervals reported in the main text. A fixed point (x1(),,xp())(x_{1}^{(*)},\dots,x_{p}^{(*)})^{\prime} is prespecified, and let x(1),,x(1000)x^{(1)},\dots,x^{(1000)} be 1000 equidistant points from 1-1 to 11. In each run, we compute the Bayesian credible intervals of response means (estimated by 600 Monte Carlo samples) for 1000 different input xx’s: (x(1),x2(),,xp()),,(x(1000),x2(),,xp())(x^{(1)},x_{2}^{(*)},\dots,x_{p}^{(*)}),\dots,(x^{(1000)},x_{2}^{(*)},\dots,x_{p}^{(*)}). It is repeated by 60 times and the average coverage rate (over all different xx’s and 60 runs) is reported. Similarly, we replace x2()x_{2}^{(*)} (or x3()x_{3}^{(*)}) by x(i)x^{(i)} (i=1,,1000i=1,\dots,1000), and compute their average coverage rate. The complete coverage rate results are shown in Table 3. Note that Table 1 in the main text shows 95%95\% coverage of x3x_{3} for (A) and 95%95\% coverage of x1x_{1} for (B).

Table 3: Coverage rates for teacher networks.
90 % coverage (%) 95% coverage (%)
Method 𝒙𝟏\boldsymbol{x_{1}} 𝒙𝟐\boldsymbol{x_{2}} 𝒙𝟑\boldsymbol{x_{3}} 𝒙𝟏\boldsymbol{x_{1}} 𝒙𝟐\boldsymbol{x_{2}} 𝒙𝟑\boldsymbol{x_{3}}
Dense SVBNN 93.8 ±\pm 2.84 93.1 ±\pm 4.93 93.1 ±\pm 2.96 97.9 ±\pm 1.01 97.9 ±\pm 1.69 97.5 ±\pm 1.71
VBNN 85.8 ±\pm 2.51 82.4 ±\pm 2.62 86.3 ±\pm 1.88 92.7 ±\pm 2.83 91.3 ±\pm 2.61 91.4 ±\pm 2.43
VD 61.3 ±\pm 2.40 60.0 ±\pm 2.79 64.9 ±\pm 6.17 74.9 ±\pm 1.79 71.8 ±\pm 2.33 76.4 ±\pm 4.75
HS-BNN 83.1 ±\pm 1.67 80.0 ±\pm 1.21 76.9 ±\pm 1.70 88.1 ±\pm 1.13 84.1 ±\pm 1.48 83.5 ±\pm 0.78
Sparse SVBNN 92.3 ±\pm 8.61 94.6 ±\pm 5.37 98.3 ±\pm 0.00 96.4 ±\pm 4.73 97.7 ±\pm 3.71 100 ±\pm 0.00
VBNN 86.7 ±\pm 10.9 87.0 ±\pm 11.3 93.3 ±\pm 0.00 90.7 ±\pm 8.15 91.9 ±\pm 9.21 96.7 ±\pm 0.00
VD 65.2 ±\pm 0.08 63.7 ±\pm 6.58 65.9 ±\pm 0.83 75.5 ±\pm 7.81 74.6 ±\pm 7.79 76.6 ±\pm 0.40
HS-BNN 59.0 ±\pm 8.52 59.4 ±\pm 4.38 56.6 ±\pm 2.06 67.0 ±\pm 8.54 68.2 ±\pm 3.62 66.5 ±\pm 1.86

B.5 Real data regression experiment: UCI datasets

We follow the experimental protocols of Hernández-Lobato and Adams, (2015), and choose five datasets for the experiment. For the small datasets "Kin8nm", "Naval", "Power Plant" and "wine", we choose a single-hidden-layer ReLU network with 50 hidden units. We randomly select 90%90\% and 10%10\% for training and testing respectively, and this random split process is repeated for 20 times (to obtain standard deviations for our results). We choose minibatch size m=128m=128, learning rate=103\mbox{learning rate}=10^{-3} and run 500 epochs for "Naval", "Power Plant" and "Wine", 800 epochs for "Kin8nm". For the large dataset "Year", we use a single-hidden-layer ReLU network with 100 hidden units, and the evaluation is conducted on a single split. We choose m=256m=256, learning rate=103\mbox{learning rate}=10^{-3} and run 100 epochs. For all the five datasets, λ\lambda is chosen as λopt\lambda_{opt}: log(λopt1)=log(T)+0.1[(L+1)logN+lognp]\log(\lambda^{-1}_{opt})=\log(T)+0.1[(L+1)\log N+\log\sqrt{n}p], which is the same as other numerical studies. We let σ02=2\sigma^{2}_{0}=2 and use grid search to find σϵ\sigma_{\epsilon} that yields the best prediction accuracy. Adam is used for all the datasets in the experiment.

We report the testing squared root MSE (RMSE) based on f^H\widehat{f}_{H} (defined in the main text) with H=30H=30, and also report the posterior network sparsity s^=i=1Tϕi/T\widehat{s}=\sum^{T}_{i=1}\phi_{i}/T. For the purpose of comparison, we list the results by Horseshoe BNN (HS-BNN) (Ghosh and Doshi-Velez,, 2017) and probalistic backpropagation (PBP) (Hernández-Lobato and Adams,, 2015). Table 4 demonstrates that our method achieves best prediction accuracy for all the datasets with a sparse structure.

Table 4: Results on UCI regression datasets.
Test RMSE Posterior sparsity(%)
Dataset n(p)n(p) SVBNN HS-BNN PBP SVBNN
Kin8nm 8192 (8) 0.08 ±\pm 0.00 0.08 ±\pm 0.00 0.10 ±\pm 0.00 64.5 ±\pm 1.85
Naval 11934 (16) 0.00 ±\pm 0.00 0.00 ±\pm 0.00 0.01 ±\pm 0.00 82.9 ±\pm 1.31
Power Plant 9568 (4) 4.01 ±\pm 0.18 4.03 ±\pm 0.15 4.12 ±\pm 0.03 56.6 ±\pm 3.13
Wine 1599 (11) 0.62 ±\pm 0.04 0.63 ±\pm 0.04 0.64 ±\pm 0.01 59.9 ±\pm 4.92
Year 515345 (90) 8.87 ±\pm NA 9.26 ±\pm NA 8.88 ±\pm NA 20.8 ±\pm NA

B.6 Real data classification experiment: MNIST dataset

The MNIST data is normalized by mean equaling 0.1306 and standard deviation equaling 0.3081. For all methods, we choose the same minibatch size m=256m=256, learning rate=5×103\mbox{learning rate}=5\times 10^{-3} for our method and 3×1033\times 10^{-3} for the others, total number of epochs is 400 and the optimization algorithm is RMSprop. AGP is pre-specified at 5%5\% sparsity level.

Refer to caption
(a) Overlaid images (on the last column)
Refer to caption
(b) Predictive distribution for overlaid images
Figure 7: Top row of (b) exhibits the predictive distribution for the top overlaid image, which is made by 5 and 6; Middle row of (b) exhibits the predictive distribution for the middle overlaid image, which is made by 2 and 3; Bottom row of (b) exhibits the predictive distribution for the bottom overlaid image, which is made by 2 and 7.

Besides the testing accuracy reported in the main text, we also examine our method’s ability of uncertainty quantification for MNIST classification task. We first create ambiguous images by overlaying two examples from the testing set as shown in Figure 7 (a). To perform uncertainty quantification using our method, for each of the overlaid images, we generate θh\theta_{h} from the VB posterior q^(θ)\widehat{q}(\theta) for h=1,,100h=1,\ldots,100, and calculate the associated predictive probability vector fθh(x)10f_{\theta_{h}}(x)\in\mathbb{R}^{10} where xx is the overlaid image input, and then use the estimated posterior mean f^(x)=h=1100fθh(x)/100\widehat{f}(x)=\sum^{100}_{h=1}f_{\theta_{h}}(x)/100 as the Bayesian predictive probability vector. As a comparison, we also calculate the predictive probability vector for each overlaid image using AGP as a frequentist benchmark. Figure 7 (b) shows frequentist method gives almost a deterministic answer (i.e., predictive probability is almost 1 for certain digit) that is obviously unsatisfactory for this task, while our VB method is capable of providing knowledge of certainty on these out-of-domain inputs, which demonstrates the advantage of Bayesian method in uncertainty quantification on the classification task.

B.7 Illustration of CNN: Fashion-MNIST dataset

In this section, we perform an experiment on a more complex task, the Fashion-MNIST dataset. To illustrate the usage of our method beyond feedforward networks, we consider using a 2-Conv-2-FC network: The feature maps for the convolutional layers are set to be 32 and 64, and the filter size are 5×55\times 5 and 3×33\times 3 respectively. The paddings are 2 for both layers and the it has a 2×22\times 2 max pooling for each of the layers; The fully-connected layers have 64×8×864\times 8\times 8 neurons. The activation functions are all ReLUs. The dataset is prepocessed by random horizontal flip. The batchsize is 1024, learning rate is 0.001, and Adam is used for optimization. We run the experiment for 150 epochs.

We use both SVBNN and VBNN for this task. In particular, the VBNN, which uses normal prior and variational distributions, is the full Bayesian method without compressing, and can be regarded as the baseline for our method. Figure 8 exhibits our method attains higher accuracy as epoch increases and then decreases as the sparisty goes down. Meanwhile, the baseline method - full BNN suffers from overfitting after 80 epochs.

Refer to caption
(a) Accuracy.
Refer to caption
(b) Sparsity.
Figure 8: Fashion-MNIST experiment.