Challenges and Opportunities in Approximate Bayesian Deep Learning for Intelligent IoT Systems

Meet P. Vadera, and Benjamin M. Marlin
Manning College of Information and Computer Sciences
University of Massachusetts Amherst
Amherst, MA, USA
{mvadera, marlin}@cs.umass.edu

Abstract

Approximate Bayesian deep learning methods hold significant promise for addressing several issues that occur when deploying deep learning components in intelligent systems, including mitigating the occurrence of over-confident errors and providing enhanced robustness to out of distribution examples. However, the computational requirements of existing approximate Bayesian inference methods can make them ill-suited for deployment in intelligent IoT systems that include lower-powered edge devices. In this paper, we present a range of approximate Bayesian inference methods for supervised deep learning and highlight the challenges and opportunities when applying these methods on current edge hardware. We highlight several potential solutions to decreasing model storage requirements and improving computational scalability, including model pruning and distillation methods.

1 Introduction

Deep learning research has shown promising results in many application areas of artificial intelligence including object detection, language modeling, speech recognition, medical imagining, image segmentation and many more [1, 2, 3, 4, 5, 6, 7, 8, 9]. Key components driving the overall success of deep learning-based methods include advances in learning algorithms, neural network architectures, computing hardware including graphics processing units (GPUs) and tensor processing units (TPUs), and the availability of large labeled data sets.

However, as deep learning models are being deployed in fielded intelligent systems, including intelligent IoT systems, several challenges have become increasingly prominent. These challenges include the deployment-time occurrence of high confidence errors, the need to be robust to out-of-distribution inputs, and the potential for in-domain adversarial inputs. High confidence errors occur when a probabilistic machine learning model ascribes high probability to an incorrect output (e.g., a class label in a classification setting) [10]. Deployed deep learning models can also encounter inputs from data distribution that differ systematically from the distribution they were trained on. In such cases, models need to avoid making high confidence errors. One potential approach to dealing with this issue is to explicitly identify inputs as out-of-distribution and to decline to make predictions for them [11]. Lastly, intelligent systems can also encounter adversarial inputs of different types. Early work in adversarial example generation focused on algorithms for making low-norm changes to inputs that, while being nearly imperceptible to humans, result in models making highly confident errors [12].

One of the important factors contributing to susceptibility to these problems is model uncertainty. Supervised deep learning models are most commonly trained by optimizing their parameters to minimize a training loss function. This approach yields a single locally optimal setting of the model parameters, which are then used to make predictions at deployment time. However, given the large number of parameters used in current models, there typically exist multiple qualitatively different sets of parameters yielding similar training loss function values, but making different predictions on future inputs. Commmon training procedures select a single set of such parameters, despite the fact that for a given amount of training data, there may be significant uncertainty over what the best model parameters actually are.

Bayesian inference provides a different perspective on the problem of training deep neural network models that attempts to represent model uncertainty and propagate it to the point of issuing predictions and making decisions. In Bayesian inference, the unknown model parameters are formally treated as random variables. The goal becomes to infer the posterior probability distribution over the unknown model parameters given the available data. When the data support multiple distinct interpretations in terms of settings of the model parameters, the model posterior will reflect this uncertainty. Incorporating model uncertainty into prediction and decision-making typically decreases overconfidence in predictions via the Bayesian model averaging effect, and can also increase robustness to out-of-distribution and adversarial inputs [13, 14]. Quantities computed from the model posterior can also be used as inputs for auxiliary problems, including the detection of out-of-distribution examples.

Bayesian inference has the potential to provide a comprehensive theoretical foundation for constructing uncertainty-aware and robust deep learning-based intelligent IoT systems, including systems that must reason robustly over complex multi-modal inputs in the field. However, the application of Bayesian inference methods to deep neural network models is challenging due to the large scale of current state-of-the-art prediction models.

In this paper, we focus on the challenges and opportunities that arise when we consider deploying Bayesian deep learning approaches on IoT edge devices. Specifically, traditional approximate Bayesian learning algorithms represent the model posterior as a large ensemble of models. This can be expensive from the storage perspective, as the cost of storing an ensemble of $S$ models will be $S$ times higher than the cost of storing a single model unless additional compression is used. Further, when making predictions, traditional Bayesian ensembles require processing each input instance through each element of the ensemble. Again, this requires $S$ times more compute when using an ensemble of $S$ models compared to the use of a single model.

The remainder of this paper is organized as follows. In Section 2 we begin by providing a comprehensive discussion of Bayesian supervised learning, approximate Bayesian inference, and the scalability challenges of deploying current Bayesian deep learning model representations on edge hardware. Next, in sections 3 and 4 we discuss model compression techniques that can be leveraged for compressing Bayesian posterior distributions. These approaches either look at compressing each member of the model ensemble, or compress the entire ensemble into a surrogate model. We point out challenges, opportunities and open research directions related to both approaches.

2 Bayesian Deep Learning
and the Challenge of Scalability

In this section we introduce the fundamental concepts of Bayesian inference for supervised deep learning along with foundational approximation methods. We discuss the scalability challenges when deploying such methods on edge and IoT systems.

2.1 Bayesian Supervised Learning

Supervised learning forms the core of machine learning-based prediction systems. In a supervised learning problem, we are given a dataset $\mathcal{D}$ consisting of input-output pairs $\{(\mathbf{x}_{i},y_{i})|1\leq i\leq N\}$ , where $\mathbf{x}_{i}\in\mathbb{R}^{D}$ is the input or feature vector and $y_{i}\in\mathcal{Y}$ is the output or prediction target. We let $\mathcal{D}^{x}$ be the dataset of inputs and $\mathcal{D}^{y}$ be the dataset of outputs. The nature of $\mathcal{Y}$ depends on the task at hand. For classification tasks, $\mathcal{Y}$ is a finite set, whereas for regression tasks, we usually have $\mathcal{Y}\in\mathbb{R}$ . In this paper, we specifically focus on the classification setting in supervised learning, where the goal is to learn a function $f:\mathbb{R}^{D}\rightarrow\mathcal{Y}$ that can accurately predict the outputs from the inputs.

In probabilistic supervised learning, we construct the prediction function using a conditional probability model of the form $f_{\theta}(\mathbf{x})=p(y|\mathbf{x},\theta)$ where $\theta\in\mathbb{R}^{K}$ are the model parameters. The conditional likelihood of the inputs given the outputs and the parameters is denoted by $p(\mathcal{D}^{y}|\mathcal{D}^{x},\theta)$ . Under the assumption that the outputs are independent and identically distributed given their corresponding outputs, we have that $p(\mathcal{D}^{y}|\mathcal{D}^{x},\theta)=\prod_{i=1}^{N}p(y_{i}|\mathbf{x}_{i},\theta)$ . A key ingredient in Bayesian inference, as well as in traditional point-estimated neural networks, is the prior distribution $p(\theta|\lambda)$ over model parameters. As the name suggests, the prior distribution represents our beliefs about the distribution of the model parameters prior to analyzing the data. The prior distribution can have its own hyperparameters, here denoted by $\lambda$ [15].

Bayesian inference involves the computation of the posterior distribution over the unknown model parameters given a training dataset $\mathcal{D}_{tr}$ and the prior. The parameter posterior is obtained using the Bayes theorem, as shown in Equation (1).

\displaystyle p(\theta|\mathcal{D}_{tr},\lambda)

\displaystyle=\frac{p(\mathcal{D}_{tr}^{y}|\mathcal{D}_{tr}^{x},\theta)p(\theta|\lambda)}{\int p(\mathcal{D}_{tr}^{y}|\mathcal{D}_{tr}^{x},\theta)p(\theta|\lambda)d\theta}

(1)

The denominator in the parameter posterior (referred to as the “evidence” term) is intractable to compute for most ML models, including for deep neural networks [15]. As a result, the computation of the exact posterior distribution is intractable. However, in practice, the quantity of interest is often not the parameter posterior distribution itself, but rather low-dimensional expectations under the parameter posterior.

One key posterior expectation in the supervised learning setting is the posterior predictive distribution, which is necessary and sufficient for making maximum probability predictions for outputs given inputs while integrating over the uncertainty in the model parameters. The posterior predictive distribution computation is shown in Equation (2).

\displaystyle p(y|\mathbf{x},\mathcal{D}_{tr},\lambda)

\displaystyle=\mathbb{E}_{p(\theta|\mathcal{D}_{tr},\lambda)}[p(y|\mathbf{x},\theta)]

(2)

Another posterior expectation that is useful in uncertainty quantification is the expected posterior predictive entropy. Posterior predictive entropy (also referred to as the total uncertainty of the predictive distribution) can be decomposed into quantities referred to as expected data uncertainty and knowledge uncertainty [16]. These three forms of uncertainty are related by the equation shown below:

	$\displaystyle\underbrace{\pazocal{H}\left[\mathbb{E}_{p(\theta\|\pazocal{D})}\left[p\left(y\|\boldsymbol{x},\theta\right)\right]\right]}_{\text{Total Uncertainty }}$	$\displaystyle=\underbrace{\pazocal{I}\left[y,\theta\|\boldsymbol{x},\pazocal{D}\right]}_{\text{Knowledge Uncertainty }}$
		$\displaystyle+\underbrace{\mathbb{E}_{p(\theta\|\pazocal{D})}\left[\pazocal{H}\left[p\left(y\|\boldsymbol{x},\theta\right)\right]\right]}_{\text{Expected Data Uncertainty }}$		(3)

Knowledge uncertainty can be efficiently computed as the difference between total uncertainty and expected data uncertainty, both of which are functions of posterior expectations. Recent work has leveraged these uncertainty estimates to explore a range of down-stream tasks such out-of-distribution detection, misclassification detection, and active learning that rely on uncertainty quantification and decomposition.[17, 18, 19, 20, 21, 22, 23]. However, all of these posterior expectations are also intractable to compute exactly for deep learning models. We thus next turn to the problem of approximate Bayesian inference methods.

2.2 Approximate Bayesian Inference for Supervised Learning

As indicated in the previous subsection, posterior expectations including the posterior distribution needed for Bayesian supervised learning is intractable in its original form. To tackle this problem there is a significant body of work in the area of approximate Bayesian inference techniques. The ultimate goal of these approximation methods is to compute approximate posterior expectations that are close to their theoretical counterparts. Approximate Bayesian methods can be broadly divided into three categories: Markov Chain Monte Carlo (MCMC) methods, surrogate density estimation methods, and other approximation methods. We describe each category of methods below and discuss their edge deployment challenges.

2.2.1 Markov Chain Monte Carlo Methods

MCMC methods provide an approximation to the intractable parameter posterior $p(\theta|\mathcal{D},\lambda)$ via a set of samples drawn for this distribution. MCMC methods simulate a Markov chain that converges to the parameter posterior as its steady state distribution. The simulated states of the Markov chain after convergence correspond to samples from the parameter posterior. Once we collect a set of parameter samples of the desired size, we can approximate expectations with respect to the parameter posterior using empirical averages over the sampled parameter values [24]. For example, the posterior predictive distribution can be approximated as shown in Equation (2.2.1). As we can see, computing the Monte Carlo approximation to the posterior predictive distribution is very similar to computing the predictive distribution of a model ensemble.

$\displaystyle p(y\|\mathbf{x},\mathcal{D},\lambda)$	$\displaystyle=\mathbb{E}_{p(\theta\|\mathcal{D},\lambda)}[p(y\|\mathbf{x},\theta)]$
	$\displaystyle\approx\mathbb{E}_{p_{\textrm{MC}}(\theta\|\mathcal{D},\lambda)}[p(y\|\mathbf{x},\theta)]$
	$\displaystyle=\frac{1}{S}\sum_{s=1}^{S}p(y\|\mathbf{x},\theta_{s})$	(4)

Examples of classical MCMC methods include the Gibbs sampler [25] and the Metropolis-Hastings sampler [26]. The earliest work on MCMC samplers for neural networks traces back to the application of Hamiltonian Monte Carlo [27] methods. While a number of MCMC methods have since been developed with improved properties including slice sampling [28], elliptical slice sampling [29], and Riemann manifold sampling methods [30], these methods all require using all of the available data when computing the likelihood term needed for posterior inference. Although only linear in the number of data cases, this can be a highly expensive operation for large data sets and models and can render MCMC methods practically infeasible in scenarios where stochastic gradient descent (SGD) [31] can be usefully applied to optimize model parameters.

However, recent advances in MCMC approaches have enabled the use of SGD-like mini-batch algorithms, greatly extending the range of applicability of MCMC methods. Prominent examples of such approaches include stochastic gradient Langevin dynamics (SGLD) [32], stochastic gradient Hamiltonian Monte Carlo (SGHMC) [33] and their cyclic learning rate versions as presented in [34].

Another important property that determines the efficacy of MCMC methods is the degree of mixing. The degree of mixing refers to how efficiently the Markov chain traverses the posterior distribution after convergence [35]. Better mixing enables faster collection of a more diverse set of parameter samples. However, the mixing properties of MCMC methods depend on the dimensionality of the parameter space. Modern deep neural networks can have an extremely large number of parameters (millions or more), potentially leading to inadequate mixing.

An alternative to sampling in the original parameter space $\mathbb{R}^{K}$ is to sample in a reduced dimensional space $\mathbb{R}^{K^{\prime}}$ for $K^{\prime}\ll K$ and to project back to the full-dimensional space. This process is termed subspace inference [36]. Subspaces can be generated using singular value decomposition (SVD) applied to SGD iterates, or by generating random projection matrices, among other possible options [36]. These methods introduce bias by restricting the sampler to operate in a subspace of $\mathbb{R}^{K}$ , but reduce variance by enabling the underlying Markov chain to mix faster.

While recent advances in MCMC methods are enabling the application of Bayesian inference approaches to increasingly large deep learning models, there remains a sizeable gap in terms of the practical deployment of Monte Carlo-based posterior approximations on edge hardware to support intelligent IoT systems. As noted previously, MCMC methods generate $S$ samples from the model posterior in place of the single set of parameters used by optimization-based deep learning. This means that the deployment of Monte Carlo-based approximate prediction methods requires $S$ times more storage as well as $S$ times more computation without further approximations.

However, the actual increase in prediction latency depends on a number of factors. First, in the edge setting, it is typical for an edge device to process data from on-board sensors. In this case, the edge device may only need to make predictions for a single instance at a time. For GPU or TPU accelerated edge devices, it may thus be possible to run a single input through multiple models with identical structure in parallel, effectively batching the models instead of the inputs. The feasibility of this approach depends on the size of the model to be deployed. Second, many edge platform toolchains have the ability to perform weight quantization and to use mixed precision arithmetic when deploying models. Such transformations can be applied separately to each element of a posterior ensemble to improve edge deployment efficiency. We note that the effect of such quantization has not been well studied for Bayesian posterior ensembles, but there is reason to believe that they might better tolerate more aggressive quantization than single models due to the average that is taken over the outputs of the posterior ensemble when making predictions.

Lastly, we note that the memory constraints of edge devices can also play a non-trivial role in prediction latency when using ensembles due to the time required to load models. In some instances, the time needed to load models is significantly higher than the time needed to use the loaded model to make predictions for a single input. When this is the case, it can dramatically increase prediction latency. Indeed, the number of models that can be present in memory simultaneously can currently be the effective limiting factor on the size of the MCMC posterior ensemble that can usefully be deployed.

An interesting opportunity motivated by these observations is the development of optimized model loading methods specifically for ensembles of models with identical architectures. In this case, the structure of the computation graph underlying the architecture should only need to be instantiated once, and it may be possible to more rapidly iterate through different sets of parameters for the same architecture than is possible using current libraries that do not include such optimizations [37].

2.2.2 Surrogate Density Methods

An alternate approach to MCMC methods is approximating the original posterior distribution via an analytically tractable parameterized surrogate distribution. Thus, given the original posterior distribution $p(\theta|\mathcal{D}_{tr},\lambda)$ , surrogate density methods aim to approximate the true posterior using an auxiliary distribution $q(\theta|\phi)$ , where $\phi$ are the auxiliary parameters [38, 39, 40, 41]. A common approximation for auxiliary parameters is the use of a multivariate Gaussian distribution with a diagonal covariance matrix $\mathcal{N}(\theta;\mu,\Sigma)$ , also known as mean-field variational inference. Here, the auxiliary parameters are $\phi=[\mu,\Sigma]$ . The main reason why mean-field variational inference is popular is due to the simple “re-parameterization trick” that makes sampling from the auxiliary distribution straightforward [42]. With the re-parameterization trick, we can sample $\theta_{k}\sim\mathcal{N}(\mu_{k},\Sigma_{kk})$ by first drawing $\eta\sim\mathcal{N}(0,1)$ , followed by the linear transformation $\theta_{k}=\mu_{k}+\eta\cdot\sqrt{\Sigma_{kk}}$ . This allows us to backpropagate through the variational parameters while drawing samples of the model parameters to approximate the objective functions used for learning.

Now, for estimating the auxiliary parameters, we need an objective function that can measure the discrepancy between the surrogate distribution and the ground truth posterior. Needless to say, this objective function must also be computationally tractable. The most commonly used discrepancy function is the Kullback-Leibler (KL) divergence as shown below [43].

\textrm{KL}(p(\theta)||q(\theta))=\mathbb{E}_{p(\theta)}\left[\log\left(\frac{p(\theta)}{q(\theta)}\right)\right]

The KL divergence is a directional divergence, and thus is not symmetric in its arguments. This can result in two different measures depending on the directionality. When the surrogate posterior is used as the first argument, the result is the variational inference (VI) framework [38, 39]. When the surrogate posterior is used as the second argument, the result is the expectation propagation (EP) framework [41]. This often leads to VI methods having mode seeking behavior, as they are not forced to ensure the same support as the original posterior. EP methods on the other hand are forced to ensure exact support, but this can often lead to incorrect mode estimation as it must cover the support of the original posterior. These two extremes can also be interpolated by more generalized divergence measures, including alpha divergence [44].

These methods also suffer from scalability issues due to the need to compute the log likelihood and its gradient over the entire dataset. Similar to the stochastic gradient version of MCMC methods introduced earlier, advances in the past decade have led to more scalable methods in this family as well, including stochastic variational inference [45], that are able to accommodate large-scale datasets using mini-batch gradients.

In addition to mean field VI, other more advanced approximations are possible, such as multiplicative normalizing flows (MNF) [46], Bayesian hypernetworks [47], and Rank-1 factorization [48]. In multiplicative normalizing flows, we choose a simple density function such as the isotropic Gaussian distribution, and use a bijective function to transform the samples drawn from the simple density function to form more complex distributions. The Bayesian hypernetworks approach builds upon the MNF approach and uses a neural network model to transform the samples drawn from a simpler density function to model a more complex posterior distribution. Finally, the rank-1 factorization approach represents the model parameters as a product of two rank-1 factors thereby reducing the dimensionality of the base distribution of the approximate posterior. For example, if we have a parameter matrix $\pazocal{W}\in\mathbb{R}^{M\times N}$ , this can be re-written as a matrix product of $W_{1}\in\mathbb{R}^{M\times 1}$ and $W_{2}\in\mathbb{R}^{1\times N}$ . This effectively reduces the number of parameters from $\pazocal{O}(M\cdot N)$ to $\pazocal{O}(M+N)$ .

MC Dropout is a particularly interesting approach, which is equivalent to approximate variational inference under specific assumptions [49]. Dropout itself was first introduced as a regularization technique where during every training iteration a pre-determined proportion of activations is randomly set to zero to reduce overfitting. At test time dropout is switched off and all units participate in making predictions [50]. In MC Dropout, by contrast, dropout is used at prediction time. This leads to a stochastic forward pass through the point-estimated model. Multiple forward passes through the model are used and the predictive distributions are averaged. This procedure is equivalent to sampling from a specific approximate variational posterior, but has the advantage that it is very easy to implement.

The major drawback of surrogate density methods is that they introduce bias into the estimation of the posterior distribution, unless the true posterior belongs to the family of auxiliary distributions. The degree of bias will depend on the functional form of the auxiliary distribution, and the divergence measure used to estimate the parameters of the auxiliary distribution.

In addition, under these methods, we still face hurdles in computing posterior expectations including the approximate posterior predictive distribution. In particular, even if $q(\theta|\phi)$ is a simple parametric distribution, the expectation $\mathbb{E}_{q(\theta|\phi)}[p(y|\mathbf{x},\theta)]$ still usually cannot be computed analytically due to the non-linearity of $p(y|\mathbf{x},\theta)$ . As a result, we have to again resort to Monte Carlo approximation, and draw samples from the approximate variational posterior. Generally, these methods trade-off the potential bias in the surrogate parameter posterior for the ability to draw independent samples once the surrogate posterior parameters have been estimated.

The deployment considerations for models derived using variational inference methods are distinct from those of MCMC methods. One of the benefits of mean field variational inference over MCMC methods is that the representation of the posterior approximation only requires twice the storage of a single model. However, for this to be useful at deployment time, it must be possible to efficiently sample models from the approximate parameter posterior on the fly. The same is true of approaches like MC Dropout that require the sampling of random masks at inference time. In the case of MC Dropout, for example, current model conversion pipelines from PyTorch [51] to TensorRT [37] via ONNX [52] strip out stochastic dropout layers. It appears that for such approaches to have practically useful storage advantages over MCMC methods on current edge hardware, additional development may be needed to enable the automated deployment of computation graphs with stochastic elements. An alternative for hardware where the computational cost of on-the-fly generation of samples from the variational posterior is prohibitive is to pre-generate and deploy a fixed variational approximate posterior ensemble. However, in this case, all the considerations that apply to the generation of MCMC posterior ensembles will also apply.

2.2.3 Additional Approximate Bayesian Inference Methods

With the emergence of Generative Adversarial Networks (GANs) [53], there has been work on learning implicit generative model representations of the parameter posterior that can be used at test time to draw an arbitrary but finite number of samples from this posterior approximation. The existing work in this area involves training GANs that can approximate the posterior distribution asserted by SGLD [17, 54]. To compute the approximate posterior predictive distribution, we yet again would use Monte Carlo approximation as shown in the previous subsections. An advantage of these methods is that in theory we do not need to store posterior samples for use during inference and we can also determine the number of posterior samples to use on the fly. However, the same deployment issues noted for variational inference apply here as well in terms of the feasibility of on-the-fly sampling from an auxiliary model. In addition, this approach requires allocating additional memory for the auxiliary GAN model.

Deep Ensembles [55] have also shown strong performance in terms of representing predictive distributions, and can also be considered as an approximation to Bayesian inference [56]. However, deep ensembles can be expensive during training, as we need to train each model in the ensemble from scratch. To alleviate this issue, snapshot ensembles [57] uses a cyclical learning rate schedule to generate more ensembles in less training time. However, these methods learn a mixture of posterior modes, which is different from an approximation to the full posterior. Regardless, Deep Ensembles have been shown to yield better performance relative to using small numbers of samples in a traditional MCMC approach. Deployment considerations similar to those of MCMC ensembles also apply to deep ensembles.

2.2.4 Discussion

As described in this section, different methods for approximating Bayesian inference have different strengths and weaknesses in terms of their theoretical and practical properties. The first such property is the bias-variance tradeoff in approximate posterior estimation. MCMC methods provide an unbiased estimate of the posterior in theory, but this relies on convergence of the underlying Markov chains, which can be time-consuming. Mixing of the Markov chains is another important factor determining the diversity of the approximate posterior ensemble. On the other hand, we can learn surrogate posterior distributions much more efficiently using stochastic gradient descent, but they introduce bias that depends on the functional choice of the auxiliary distribution and the divergence measure used. However, while important, these factors largely impact training time and the quality of approximations, not their subsequent edge deployability.

In terms of edge deployability, a common aspect among all the methods discussed in this section is that they require the application of Monte Carlo-based approximations at test time. When $S$ samples are used, the computational complexity of making predictions is $S$ times higher than if a single instance of the base model is used. As noted previously, the use of mixed precision arithmetic can help to accelerate these computations, but a reduction from 32 bit to 16 bit representations will typically at most half the prediction latency of deployed models. This is likely to be far from sufficient when considering closing the prediction latency gap between single optimization-based models and model ensembles.

In addition, as noted previously, MCMC methods have $O(S\cdot K)$ storage cost where $K$ is the number of parameters in the base model and $S$ is the number of samples. By contrast, mean-field variational inference has $O(2K)$ storage cost, but there appear to be practical challenges to actually realizing the benefits of the reduced storage cost of variational methods on current edge platforms. In the next sections, we turn to possible modeling and algorithmic approaches to improving the edge scalability of approximate Bayesian inference methods in terms of both computation time and storage costs.

3 Model Pruning Approaches to Improving
the Scalability of Bayesian Deep Learning

As described in the previous section, the storage and computational scalability properties of Bayesian inference for neural network models can be a significant barrier to edge and IoT deployment, despite their potential benefits in the context of intelligent systems. In this section, we discuss model pruning and sparsification approaches that aim to reduce the storage and computation requirement for Bayesian ensembles. These approaches can be broadly divided into unstructured and structured pruning approaches, as we describe below.

3.1 Unstructured Pruning

Optimization-based unstructured pruning methods aim to compress neural network models by sparsifying their weight matrices. The earliest work on unstructured neural network pruning dates back to the Optimal Brain Damage method [58]. In the optimal brain damage approach, the authors presented a Taylor series approximation of the objective function, and show that under the assumption that the Hessian matrix is diagonal, the weights corresponding to the second order derivatives that go to zero can be removed with little to no loss in performance. The follow-up Optimal Brain Surgeon method [59] highlights that the diagonal Hessian assumption can be limiting, leading to the removal of connections that should be retained. It uses second order derivatives to decide which connections to remove, and obtains better generalization on held-out test data compared to the Optimal Brain Damage approach.

Another line of early work in unstructured pruning removes parameters based on their magnitudes. The earliest work in this area dates back to [60]. Since then, magnitude-based unstructured pruning has been revisited in more detail, and it has been found that iterative pruning with fine-tuning can help prune more parameters of a neural network while retaining better predictive performance compared to one-time post-hoc pruning [61].

In the iterative pruning with fine-tuning approach, we define a pruning rate $p$ (the percentage of weights to prune) for each pruning cycle and remove all weights with magnitudes in the bottom $p$ percentile. We then fine tune the model by optimizing the unpruned weights to a desired level of convergence. We repeat these pruning and fine-tuning cycles until we achieve the desired overall sparsity level.

Experimental results on iterative pruning with fine-tuning have shown that for the ImageNet data set [62], the authors were able to reduce the number of active parameters in the AlexNet model [63] by 9 $\times$ , and reduce the number of active parameters in VGG16 [64] by 16 $\times$ . It is important to note that regularization methods (such as the application of an $\ell_{2}$ or $\ell_{1}$ penalty) can help to ensure that the magnitude of the weights that are not contributing to predictive performance are driven to zero. Building upon this, there has been work on further compressing the model parameters using quantization and Huffman coding [65] after pruning. There has also been additional work on weight magnitude threshold-based pruning that allows for restoring connections [66, 67, 68].

The Lottery Ticket Hypothesis [69] proposed an iterative pruning method that is very similar to the method of [61], which we refer to as Iterative Pruning with Rewinding. This approach differs from basic iterative pruning in that following each pruning iteration, the weights used to initialize the next iteration of the algorithm are formed by combining the pruned weights with the original random weight vector (generated during initialization), instead of the weight vector obtained at the end of the previous pruning iteration. Effectively, the active weights are rewound to their initial values at the start of each round of iterative pruning. Through this approach, the authors found that there are sparse substructures within deep neural networks that, when initialized randomly, achieve very similar performance to the original dense networks with no pruning.

While unstructured pruning methods are well-studied in the context of optimization-based deep learning, they have not received as much attention in the Bayesian deep learning literature despite their potential to reduce the storage complexity of posterior model ensembles. Several applications of unstructured sparsity are possible and deserve future study. First, MCMC methods can be used to generate a posterior ensemble consisting of a set of models. A single round of pruning can then be applied to each of the models to remove a specified percentage of the smallest weights, resulting in an ensemble of weight-sparse models. However, optimization-based fine-tuning can not be applied in this setting without the potential for significantly altering the distribution that the ensemble represents. Importantly, there is potential for the sparsified models in such an ensemble to tolerate much higher levels of sparsity than an individual model due to the averaging that occurs over the elements of the ensemble when making predictions.

An alternative approach is to use optimization-based iterative pruning and fine-tuning to derive a sparse network structure and a starting set of parameter values. MCMC methods can then be initialized to sample within this sparse structure, starting from the parameter values found using optimization. Optimization-based iterative pruning approaches can also potentially be combined with variational Bayesian deep learning methods. In the case of the classical Gaussian mean field approximation, both weights that are close to zero and weights that have high posterior variance could potentially be pruned from the model. Since variational inference is fundamentally an optimization-based procedure, it can also be composed with an iterative pruning and fine-tuning process to more closely mimic the process that is used in standard optimization-based deep learning, as described above.

We note that while unstructured pruning has been proven to preserve prediction accuracy even at high levels of weight sparsity for large optimization-based models, whether and how these savings convert into practical savings in deployed systems is fairly complicated. We first consider the storage properties of weight-sparse models. In particular, the parameter matrices for a weight-sparse model must be stored in a compressed format to yield any storage benefit.

Sparse matrices are typically stored either in coordinate list (COO) format or compressed sparse column/row (CSC/CSR) format. In these formats, at a high level, we store the indices and values of the non-zero elements of the matrix. For the COO format, for example, the space complexity of the resulting data structure is $\mathcal{O}(3N)$ , where $N$ is the number of non-zero elements. Importantly, how we compose weight sparsity with Bayesian deep learning can have significant impact on the storage complexity of the resulting models.

In the first approach described above, we considered independently sparsifying each element of the posterior ensemble. If we retain a total of $N$ non-zero weights for each of the $S$ models, the total storage required is $O(3NS)$ . In the second approach, we considered sampling withing a fixed sparse structure. If this sparse structure has $N$ non-zero weights, the total required storage is theoretically, $O(2N+NS)$ since we only need to store the indices of the non-zero elements once. For $S\gg 2$ , this reduces the required storage of the sparse ensemble by $66\%$ over standard COO format. While no current sparse matrix formats exist that support this optimization, it would be straightforward to implement from a storage perspective.

Lastly, we note that realizing the computational savings of weight-sparse networks with GPUs and TPUs is also currently a challenge. PyTorch, for example, has sparse matrix support that is currently in beta testing. Existing libraries for edge GPU/TPU-accelerated architectures such as TensorRT currently only support highly restricted sparsity patterns. As a result, fully realizing the theoretical computational benefit of weight-sparse Bayesian deep learning approaches on edge platforms will require additional support for GPU/TPU-accelerated linear algebra operations over sparse matrices.

3.2 Structured Pruning

Optimization-based structured pruning methods apply similar pruning principles to those of unstructured pruning, but with the aim of pruning larger structural elements such as entire hidden unit or entire convolutional kernels. The simplest structured pruning method extend the iterative pruning approaches described above to also include removing hidden units or convolutional kernels that have no incoming connections [61]. However, the induced sparsity patterns can be such that few units are actually pruned. These basic approaches can be improved by modifying the pruning criteria to prune units where the norm of their incoming weights is in the bottom $p$ percentile across all active units [70]. This approach ensures that a desired number of units is pruned from the network on each round.

Another set of approaches leverage group LASSO (least absolute shrinkage and selection operator) regularization to encourage a group of incoming weights to go to zero simultaneously [71, 72, 73, 74]. To apply this approach, we must first must partition the model parameters in the parameter vector $\phi$ into $K$ groups $\pazocal{G}_{k}$ . The form of the regularizer is $R(\phi)=\sum_{k=1}^{K}\big{(}\textstyle\sum_{j\in\pazocal{G}_{k}}\phi_{j}^{2}\big{)}^{1/2}$ when using the group $\ell_{1}/\ell_{2}$ regularizer. For a feedforward model, we place all the incoming weights for each hidden unit into a group. Similarly, we collect all the incoming weights for a particular channel in a convolution layer together into a group. The regularizer will then tend to encourage all the weights in a group to go to zero or most of the weights in a group to be non-zero. This can make it much easier to identify structures for pruning using weight norm-based criteria and can require less fine-tuning as most inputs to a unit will tend to already be close to zero prior to pruning.

Structured pruning can also be composed with approximate Bayesian deep learning methods in multiple ways. As with unstructured pruning, we can also separately apply structured pruning to each element of a posterior ensemble. However, the chances that all incoming weights for a unit will be close to zero is many times smaller than the chance that a single weight will be small. In this case, the use of an explicit sparsity inducing prior, such as a spike-and-slab prior, would very likely be necessary to obtain meaningful pruning. The approach of using optimization-based structured pruning methods to select a structure and an initial set of weights to run MCMC-based methods from is also applicable and is a potentially promising approach. Again, there are close relationships between structured sparsity methods for optimization-based deep learning and variational Bayesian deep learning. There is also specific prior work on sparsity inducing priors for variational Bayesian methods. Previous work in this area includes the use of horseshoe priors [75] for approximate Bayesian inference [76, 77].

Finally, we note that there is a significant gap between the practical implementation of models that result from structured sparsity methods and unstructured sparsity methods. This is due to the fact that structured sparsity methods learn more compact dense models that do not need sparse matrix support to realize storage and computational savings. This is a significant advantage considering the current state of sparse matrix support in software frameworks for edge platforms. However, structured sparsity methods have been observed to require more non-zero weights than weight-sparse models to obtain similar levels of predictive performance [78]. On the other hand, the run time of linear algebra operations for sparse matrices on real hardware can have significantly more overhead than when operating over dense matrices. The trade-offs between the theoretical and practical properties of these approaches requires further study, and the best approach to use is likely to be highly dependent on the hardware and software support available on a particular deployment platform.

4 Model Distillation Approaches to Improving
the Scalability of Bayesian Deep Learning

In this section we describe posterior distillation methods, which provide an alternative to pruning methods that can also both decrease the storage cost and the computational cost of applying Bayesian deep learning methods. Examples of methods in this area include Bayesian Dark Knowledge (BDK) [79] and Generalized Posterior Expectation Distillation (GPED) [19]. These methods aim to compress the computation of expectations under the model posterior into neural networks whose storage and computational complexity can be set to match deployment constraints. As a result, distillation approaches can expose a flexible trade-off between resource use and posterior approximation accuracy.

In the case of BDK, the selected posterior expectation is the posterior predictive distribution. BDK approximates the posterior predictive distribution by learning an auxiliary neural network model to compress the Monte Carlo approximation to the posterior predictive distribution $\mathbb{E}_{p_{MC}(\theta|\mathcal{D},\lambda)}[p(y|\mathbf{x},\theta)]$ . The GPED approach extends the BDK approach to the case of distilling arbitrary posterior expectations. GPED has been used to directly approximate model and data uncertainty via posterior expectation distillation.

The major advantage of this family of approaches is that they can drastically reduce the deployment time computational complexity of posterior predictive inference relative to using a Monte Carlo average computed using many samples. However, a shortcoming of this family of approaches is that they only capture the target posterior expectations. Thus, they do not have the ability to compute other statistics without being re-trained.

Ensemble distribution distillation (EnD²) is a closely related approach that aims to distill the collective predictive distribution outputs of the models in an ensemble into a neural network that predicts the parameters of a Dirichlet distribution [18]. The goal is to preserve more information about the distribution of outputs of the ensemble in such a way that multiple statistics of the ensemble’s outputs can be efficiently approximated. We note that the EnD² approach can be applied to any ensemble of models producing categorical output distributions and can thus be applied to distill the predictive distributions of the elements of a posterior ensemble obtained using MCMC methods as well as those obtained from a variational approximation. We also note that this approach can be extended to approximate the distribution of other posterior quantities by distilling in to approximating models that output other types of distributions.

There is an interesting trade-off between distilling the full parameter posterior distribution into models that predict specific posterior expectations, such as BDK and GPED, and approaches that distill aspects of the posterior into distributions. As noted above, expectation distillation is a less general approach and models must be re-trained to extend coverage to additional expectations. On the other hand, EnD² gains generality by predicting parametric distributions, from which a wider range of posterior properties can be computed. The drawback of this approach is that it introduces irreducible bias via the selection of a particular approximating family of distributions in a way that is similar to variational inference. As a result, although a wider range of posterior properties can be estimated using EnD², their accuracy can be varied and all can be biased if the approximating family is not a good match to the true posterior.

In terms of deployment on edge hardware, one of the distinct advantages of distillation-based approximations over both MCMC and variational methods is that the deployment time computational complexity of approximating posterior expectations can be completely controlled via the selection of the model architecture that the posterior is distilled into. Once this architecture is selected and the distillation process is carried out, the result is a single model (or one model per posterior expectation of interest) that can then be deployed. This can be much simpler than deploying a posterior ensemble in the case of MCMC methods.

However, maximizing the performance of distillation-based methods for a given computational and storage budget requires performing an architecture search to determine the optimal architecture for the approximating model. Since distillation-based learning is itself an optimization problem, one method to perform this search is to start with a large model architecture and apply iterative pruning and fine-tuning as described in the previous section. The use of both weight-level and the structure-level pruning is possible, with the deployment caveats noted in the previous section.

The GPED paper [19] uses such an approach to expose storage-performance and computation-performance Pareto frontiers by varying the level of pruning used. That work shows that pruning-based selection of model architectures is superior to more basic approaches like searching the space of layer-width multipliers. More generally, it demonstrates the potential for significant sensitivity to the architecture of the approximating model. The question of how to most efficiently search for an optimal distillation architecture remains an open question.

5 Conclusions and Discussion

In this paper we have provided a comprehensive overview of approaches to approximate Bayesian inference from the specific perspective of the challenges that emerge when considering the deployment of Bayesian deep learning models on edge hardware. We have presented a number of potential research directions aimed at improving the deployment scalability of Bayesian deep learning methods, with a focus on pruning and distillation-based model compression methods. Realizing such improvements will be crucial to enabling the practical use of Bayesian deep learning methods on edge hardware and to the development of intelligent IoT systems.

Lastly, we note that the nature of current edge hardware is likely to force trade-offs between multiple facets of the performance of Bayesian deep learning models deployed at the edge, including predictive performance, uncertainty quantification ability, robustness, and prediction latency. Methods like distillation and pruning can provide flexible trade-offs between these facets of performance, and it is imperative that candidate models are comprehensively and simultaneously evaluated with respect to all facets of performance to ensure that gains with respect to one subset of facets does not come at an unacceptable cost with respect to other subsets [20].

It is also worth emphasizing that the approximation approaches that yield optimal Bayesian deep learning deployments in respect to one edge platform could be far from optimal for another platform (or could even be infeasible) depending on the properties of the platform, including available storage and the speed and level of parallelism of computation. The problem of learning to predict optimal deployment configurations across platforms and tasks is itself a very interesting and important research challenge in this space.

Acknowledgement

This work was partially supported by the US Army Research Laboratory under cooperative agreement W911NF-17-2-0196. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the US government.

References

[1] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” in Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. IEEE, 2013, pp. 6645–6649.
[2] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269, 2016.
[3] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” in NAACL-HLT, 2018.
[4] G. Singh, M. P. Vadera, L. Samavedham, and E. C. H. Lim, “Machine learning-based framework for multi-class diagnosis of neurodegenerative diseases: A study on parkinson’s disease,” IFAC-PapersOnLine, vol. 49, pp. 990–995, 2016.
[5] G. Singh, M. Vadera, L. Samavedham, and E. C.-H. Lim, “Multiclass diagnosis of neurodegenerative diseases: A neuroimaging machine-learning-based approach,” Industrial & Engineering Chemistry Research, vol. 58, no. 26, pp. 11 498–11 505, 2019. [Online]. Available: https://doi.org/10.1021/acs.iecr.8b06064
[6] K. Suzuki, “Overview of deep learning in medical imaging,” Radiological physics and technology, vol. 10, no. 3, pp. 257–273, 2017.
[7] M. P. Vadera and B. M. Marlin, “Poster abstract: Investigating fusion-based deep learning architectures for smoking puff detection,” in 4th IEEE/ACM International Conference on Connected Health: Applications, Systems and Engineering Technologies, CHASE 2019, Arlington, VA, USA, Septempter 25-27, 2019. IEEE, 2019, pp. 11–12. [Online]. Available: https://doi.org/10.1109/CHASE48038.2019.00011
[8] S. Minaee, Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, and D. Terzopoulos, “Image segmentation using deep learning: A survey,” CoRR, vol. abs/2001.05566, 2020. [Online]. Available: https://arxiv.org/abs/2001.05566
[9] M. P. Vadera, S. Ghosh, K. Ng, and B. M. Marlin, “Post-hoc loss-calibration for bayesian neural networks,” in UAI, 2021.
[10] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” in International Conference on Machine Learning, 2017, pp. 1321–1330.
[11] D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified and out-of-distribution examples in neural networks,” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. [Online]. Available: https://openreview.net/forum?id=Hkg4TI9xl
[12] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in ICLR, 2015.
[13] A. G. Wilson, “The case for bayesian deep learning,” 2020.
[14] M. P. Vadera, S. N. Shukla, B. Jalaian, and B. M. Marlin, “Assessing the adversarial robustness of monte carlo and distillation methods for deep bayesian neural network classification,” in AAAI SafeAI Workshop, 2020.
[15] R. M. Neal, Bayesian Learning for Neural Networks. Berlin, Heidelberg: Springer-Verlag, 1996.
[16] S. Depeweg, J. M. Hernández-Lobato, F. Doshi-Velez, and S. Udluft, “Decomposition of uncertainty for active learning and reliable reinforcement learning in stochastic systems,” ArXiv, vol. abs/1710.07283, 2017.
[17] K.-C. Wang, P. Vicol, J. Lucas, L. Gu, R. Grosse, and R. Zemel, “Adversarial distillation of bayesian neural network posteriors,” arXiv preprint arXiv:1806.10317, 2018.
[18] A. Malinin, B. Mlodozeniec, and M. Gales, “Ensemble distribution distillation,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=BygSP6Vtvr
[19] M. P. Vadera, B. Jalaian, and B. M. Marlin, “Generalized bayesian posterior expectation distillation for deep neural networks,” in UAI, 2020.
[20] M. P. Vadera, A. D. Cobb, B. Jalaian, and B. M. Marlin, “URSABench: Comprehensive Benchmarking of Approximate Bayesian Inference Methods for Deep Neural Networks,” in ICML Workshop on Uncertainty and Robustness in Deep Learning, 2020.
[21] N. Houlsby, F. Huszár, Z. Ghahramani, and M. Lengyel, “Bayesian active learning for classification and preference learning,” arXiv preprint arXiv:1112.5745, 2011.
[22] C. Holtsclaw, M. P. Vadera, and B. M. Marlin, “Towards joint segmentation and active learning for block-structured data streams,” in The ACM SIGKDD Conference on Knowledge Discovery and Data Mining Workshop on Data Collection, Curation, and Labeling for Mining and Learning, 2019.
[23] A. Kirsch, J. van Amersfoort, and Y. Gal, “Batchbald: Efficient and diverse batch acquisition for deep bayesian active learning,” in Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, Eds., 2019, pp. 7024–7035.
[24] A. F. Smith and G. O. Roberts, “Bayesian computation via the gibbs sampler and related markov chain monte carlo methods,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 55, no. 1, pp. 3–23, 1993.
[25] G. Casella and E. I. George, “Explaining the gibbs sampler,” The American Statistician, vol. 46, no. 3, pp. 167–174, 1992.
[26] S. Chib and E. Greenberg, “Understanding the metropolis-hastings algorithm,” The american statistician, vol. 49, no. 4, pp. 327–335, 1995.
[27] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth, “Hybrid monte carlo,” Physics letters B, vol. 195, no. 2, pp. 216–222, 1987.
[28] R. M. Neal, “Slice sampling,” Annals of statistics, pp. 705–741, 2003.
[29] I. Murray, R. P. Adams, and D. J. C. MacKay, “Elliptical slice sampling,” in AISTATS, 2010.
[30] M. Girolami and B. Calderhead, “Riemann manifold langevin and hamiltonian monte carlo methods,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 73, no. 2, pp. 123–214, 2011.
[31] L. Bottou, “Large-scale machine learning with stochastic gradient descent,” in Proceedings of COMPSTAT’2010. Springer, 2010, pp. 177–186.
[32] M. Welling and Y. W. Teh, “Bayesian learning via stochastic gradient langevin dynamics,” in Proceedings of the 28th international conference on machine learning (ICML-11), 2011, pp. 681–688.
[33] T. Chen, E. B. Fox, and C. Guestrin, “Stochastic gradient hamiltonian monte carlo,” in International Conference on Machine Learning, 2014.
[34] R. Zhang, C. Li, J. Zhang, C. Chen, and A. G. Wilson, “Cyclical stochastic gradient mcmc for bayesian deep learning,” in ICLR, 2020.
[35] S. Brooks, A. Gelman, G. Jones, and X.-L. Meng, Handbook of markov chain monte carlo. CRC press, 2011.
[36] P. Izmailov, W. J. Maddox, P. Kirichenko, T. Garipov, D. P. Vetrov, and A. G. Wilson, “Subspace inference for bayesian deep learning,” in UAI, 2019.
[37] NVIDIA Corporation. TensorRT Developer Guide. [Online]. Available: https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html
[38] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul, “An introduction to variational methods for graphical models,” Machine learning, vol. 37, no. 2, pp. 183–233, 1999.
[39] T. S. Jaakkola and M. I. Jordan, “Bayesian parameter estimation via variational methods,” Statistics and Computing, vol. 10, no. 1, pp. 25–37, 2000.
[40] S. Ghosh, F. M. Delle Fave, and J. Yedidia, “Assumed density filtering methods for learning bayesian neural networks,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016.
[41] T. P. Minka, “Expectation propagation for approximate bayesian inference,” in Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., 2001, pp. 362–369.
[42] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra, “Weight uncertainty in neural network,” in Proceedings of the 32nd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, F. Bach and D. Blei, Eds., vol. 37. Lille, France: PMLR, 07–09 Jul 2015, pp. 1613–1622. [Online]. Available: https://proceedings.mlr.press/v37/blundell15.html
[43] D. J. MacKay, Information theory, inference and learning algorithms. Cambridge university press, 2003.
[44] Y. Li and R. E. Turner, “Rényi divergence variational inference,” in Advances in Neural Information Processing Systems 29, 2016, pp. 1073–1081.
[45] M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley, “Stochastic variational inference,” The Journal of Machine Learning Research, vol. 14, no. 1, pp. 1303–1347, 2013.
[46] C. Louizos and M. Welling, “Multiplicative normalizing flows for variational bayesian neural networks,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2017, pp. 2218–2227.
[47] D. Krueger, C.-W. Huang, R. Islam, R. Turner, A. Lacoste, and A. C. Courville, “Bayesian hypernetworks,” ArXiv, vol. abs/1710.04759, 2017.
[48] M. Dusenberry, G. Jerfel, Y. Wen, Y. Ma, J. Snoek, K. Heller, B. Lakshminarayanan, and D. Tran, “Efficient and scalable bayesian neural nets with rank-1 factors,” in International conference on machine learning. PMLR, 2020, pp. 2782–2792.
[49] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in international conference on machine learning, 2016, pp. 1050–1059.
[50] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting,” The Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
[51] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, Eds. Curran Associates, Inc., 2019, pp. 8024–8035.
[52] J. Bai, F. Lu, K. Zhang et al., “Onnx: Open neural network exchange,” https://github.com/onnx/onnx, 2019.
[53] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in Advances in neural information processing systems, 2014, pp. 2672–2680.
[54] C. Henning, J. von Oswald, J. Sacramento, S. C. Surace, J.-P. Pfister, and B. F. Grewe, “Approximating the predictive distribution via adversarially-trained hypernetworks,” in Bayesian Deep Learning Workshop, NeurIPS (Spotlight) 2018, 2018.
[55] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,” in Advances in Neural Information Processing Systems, 2017, pp. 6402–6413.
[56] A. G. Wilson and P. Izmailov, “Bayesian deep learning and a probabilistic perspective of generalization,” in Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., 2020.
[57] G. Huang, Y. Li, G. Pleiss, Z. Liu, J. E. Hopcroft, and K. Q. Weinberger, “Snapshot ensembles: Train 1, get M for free,” in 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. [Online]. Available: https://openreview.net/forum?id=BJYwwY9ll
[58] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain damage,” in Advances in Neural information processing systems, 1989.
[59] B. Hassibi, D. G. Stork, and G. J. Wolff, “Optimal brain surgeon and general network pruning,” IEEE International Conference on Neural Networks, pp. 293–299 vol.1, 1993.
[60] J. Hertz, A. Krogh, and R. G. Palmer, Introduction to the Theory of Neural Computation. USA: Addison-Wesley Longman Publishing Co., Inc., 1991.
[61] S. Han, J. Pool, J. Tran, and W. J. Dally, “Learning both weights and connections for efficient neural networks,” in Advances in Neural information processing systems, 2015.
[62] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
[63] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105.
[64] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[65] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,” in 4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, Y. Bengio and Y. LeCun, Eds., 2016. [Online]. Available: http://arxiv.org/abs/1510.00149
[66] Y. Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient dnns,” in Advances in Neural information processing systems, 2016.
[67] X. Jin, X.-T. Yuan, J. Feng, and S. Yan, “Training skinny deep neural networks with iterative hard thresholding methods,” ArXiv, vol. abs/1607.05423, 2016.
[68] S. Han, J. Pool, S. Narang, H. Mao, S. Tang, E. Elsen, B. Catanzaro, J. Tran, and W. J. Dally, “Dsd: Regularizing deep neural networks with dense-sparse-dense training flow,” ArXiv, vol. abs/1607.04381, 2016.
[69] J. Frankle and M. Carbin, “The lottery ticket hypothesis: Training pruned neural networks,” ArXiv, vol. abs/1803.03635, 2018.
[70] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient convnets,” ArXiv, vol. abs/1608.08710, 2016.
[71] Y. Zhang and Z. Ou, “Learning sparse structured ensembles with stochastic gradient mcmc sampling and network pruning,” 2018 IEEE 28th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6, 2018.
[72] J. M. Alvarez and M. Salzmann, “Learning the number of neurons in deep networks,” in Advances in Neural Information Processing Systems, 2016, pp. 2270–2278.
[73] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Advances in neural information processing systems, 2016, pp. 2074–2082.
[74] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very deep neural networks,” 2017 IEEE International Conference on Computer Vision (ICCV), pp. 1398–1406, 2017.
[75] C. M. Carvalho, N. G. Polson, and J. G. Scott, “Handling sparsity via the horseshoe,” in Artificial Intelligence and Statistics. PMLR, 2009, pp. 73–80.
[76] S. Ghosh and F. Doshi-Velez, “Model selection in bayesian neural networks via horseshoe priors,” arXiv preprint arXiv:1705.10388, 2017.
[77] C. Louizos, K. Ullrich, and M. Welling, “Bayesian compression for deep learning,” ArXiv, vol. abs/1705.08665, 2017.
[78] D. Blalock, J. J. G. Ortiz, J. Frankle, and J. Guttag, “What is the state of neural network pruning?” arXiv preprint arXiv:2003.03033, 2020.
[79] A. K. Balan, V. Rathod, K. P. Murphy, and M. Welling, “Bayesian dark knowledge,” in Advances in Neural Information Processing Systems, 2015, pp. 3438–3446.

	$\displaystyle\underbrace{\pazocal{H}\left[\mathbb{E}_{p(\theta\|\pazocal{D})}\left[p\left(y\|\boldsymbol{x},\theta\right)\right]\right]}_{\text{Total Uncertainty }}$	$\displaystyle=\underbrace{\pazocal{I}\left[y,\theta\|\boldsymbol{x},\pazocal{D}\right]}_{\text{Knowledge Uncertainty }}$
		$\displaystyle+\underbrace{\mathbb{E}_{p(\theta\|\pazocal{D})}\left[\pazocal{H}\left[p\left(y\|\boldsymbol{x},\theta\right)\right]\right]}_{\text{Expected Data Uncertainty }}$		(3)

$\displaystyle p(y\|\mathbf{x},\mathcal{D},\lambda)$	$\displaystyle=\mathbb{E}_{p(\theta\|\mathcal{D},\lambda)}[p(y\|\mathbf{x},\theta)]$
	$\displaystyle\approx\mathbb{E}_{p_{\textrm{MC}}(\theta\|\mathcal{D},\lambda)}[p(y\|\mathbf{x},\theta)]$
	$\displaystyle=\frac{1}{S}\sum_{s=1}^{S}p(y\|\mathbf{x},\theta_{s})$	(4)