A Comparison of the Delta Method and the Bootstrap in Deep Learning Classification

Geir K. Nilsen Department of Mathematics, University of Bergen [email protected] Antonella Z. Munthe-Kaas Department of Mathematics, University of Bergen Hans J. Skaug Department of Mathematics, University of Bergen Morten Brun Department of Mathematics, University of Bergen

Abstract

We validate the deep learning classification adapted Delta method introduced in [11] by a comparison with the classical Bootstrap. We show that there is a strong linear relationship between the quantified predictive epistemic uncertainty levels obtained from the two methods when applied on two LeNet-based neural network classifiers using the MNIST and CIFAR-10 datasets. Furthermore, we demonstrate that the Delta method offers a five times computation time reduction compared to the Bootstrap.

1 Introduction

It can be beneficial to distinguish between epistemic and aleatoric uncertainty in machine learning models [5]. Bayesian statistics provides a coherent framework for representing epistemic uncertainty in neural networks [9], but has not so far gained widespread use in deep learning [3] – presumably due to the high computational cost that traditionally comes with Fisher information based methods. In particular, the Delta method [4, 6] depends on the empirical Fisher information matrix which grows quadratically with the number of neural network parameters $P$ – and its direct application in modern deep learning is therefore prohibitively expensive. To mitigate this, [11] proposed a low cost variant of the Delta method applicable to $L_{2}$ -regularized deep neural networks based on the top $K$ eigenpairs of the Fisher information matrix.

In this paper, we validate the methodology introduced in [11] by a comparison with the classical Bootstrap [2, 6, 8, 12, 13]. We show that there is a strong linear relationship between the quantified epistemic uncertainty levels obtained from the two methods when applied on two LeNet-based neural network classifiers using the MNIST and CIFAR-10 datasets.

The paper is organized as follows: in Section 2 we review the Bootstrap and the Delta method in a deep learning classification context. In Section 3 we introduce two LeNet-based classifiers which will be used in the comparison in Section 4, and finally, in Section 5 we summarize the paper and give some concluding remarks.

2 Introduction to the Methodologies

In the following, we denote the training set by $\{x_{n}\in\mathbb{R}^{T_{1}},y_{n}\in\mathbb{R}^{T_{L}}\}_{n=1}^{N}$ , the test set by $\{x_{n}\in\mathbb{R}^{T_{1}},y_{n}\in\mathbb{R}^{T_{L}}\}_{n=1}^{N_{\text{test}}}$ and an arbitrary input example by $x_{0}$ . The parameter space is denoted by the vector $\omega\in\mathbb{R}^{P}$ , where $P$ is the number of parameters (weights and biases) in the model. The parameter values after training is denoted by the vector $\hat{\omega}\in\mathbb{R}^{P}$ . Furthermore, a prediction for $x_{0}$ is denoted by $\hat{y}_{0}=f(x_{0},\hat{\omega})\in\mathbb{R}^{T_{L}}$ where $f:\mathbb{R}^{T_{1}\times P}\rightarrow\mathbb{R}^{T_{L}}$ is a deep neural network model function [3] and where $T_{L}$ denotes the number of classes. Furthermore, it is assumed that the cost function denoted by $C$ is $L_{2}$ -regularized with a regularization-rate factor $\lambda/2$ .

2.1 The Bootstrap in Deep Learning Classification

In the context of deep learning classification, the classical Bootstrap method starts by creating $B$ datasets from the original dataset by sampling with replacement. Subsequently, $B$ networks are trained separately on each of the bootstrapped datasets. The epistemic uncertainty for each of the $T_{L}$ class predictions (in standard deviations) associated with prediction of $x_{0}$ is obtained by the sample standard deviation over the ensemble of $B$ predictions,

\widetilde{\sigma}_{\text{boot}}(x_{0})=\sqrt{\frac{1}{B-1}\sum_{b=1}^{B}(\hat{y}_{0}^{(b)}-\overline{\hat{y}_{0}})^{2}}\in\mathbb{R}^{T_{L}},

(1)

where the vector $\hat{y}_{0}^{(b)}$ represents the $T_{L}$ predictions for $x_{0}$ (one probability per class) obtained from the $b$ th bootstrapped network, and where $\overline{\hat{y}_{0}}$ is the sample mean,

\overline{\hat{y}_{0}}=\frac{1}{B}\sum_{b=1}^{B}\hat{y}_{0}^{(b)}\in\mathbb{R}^{T_{L}}.

(2)

The method is easy to implement efficiently in practice. Training $B$ networks is an ‘embarrassingly’ parallel problem, and the space complexity for the bootstrapped datasets is just $O(BN)$ when an indexing scheme is used for the sampling with replacement. The experiments conducted in this paper is based on the example pydeepboot.py provided in the pydeepdelta provision [14].

2.2 The Delta Method in Deep Learning Classification

The Delta method was adapted to the deep learning classification context by [11]. The adaption addresses several fundamental difficulties that arise when the method is applied in deep learning. In essence, it is shown that an approximation of the eigendecomposition of the Fisher information matrix utilizing only $K$ eigenpairs allows for an efficient implementation with bounded worst-case approximation errors. We briefly review the standard method here for convenience.

An approximation of the epistemic component of the uncertainty associated with the prediction of $x_{0}$ can be found by the formula

\widetilde{\sigma}_{\text{delta}}(x_{0})=\sqrt{\text{diag}\big{(}F\Sigma F^{T}\big{)}}\in\mathbb{R}^{T_{L}},

(3)

where the sensitivity matrix $F$ in (3) is defined

F=\begin{bmatrix}F_{ij}\end{bmatrix}\in\mathbb{R}^{T_{L}\times P},~{}F_{ij}=\frac{\partial}{\partial\omega_{j}}f_{i}(x_{0},\omega)\bigg{\rvert}_{\omega=\hat{\omega}}.

(4)

The covariance matrix $\Sigma$ in (3) can be estimated by several alternative estimators. In [11] it was demonstrated that the Hessian estimator, the Outer-Products of Gradients (OPG) estimator and the Sandwich estimator lead to nearly perfect correlated results for two different deep learning models. Since the models discussed in this paper are identical to those in [11], we thus focus only on one of the estimators, namely the OPG estimator defined by

\Sigma=\frac{1}{N}G^{-1}=\frac{1}{N}\left[\frac{1}{N}\sum_{n=1}^{N}\frac{\partial C_{n}}{\partial\omega}\frac{\partial C_{n}}{\partial\omega}^{T}\bigg{\rvert}_{\omega=\hat{\omega}}+\lambda I\right]^{-1}\in\mathbb{R}^{P\times P},

(5)

where the summation part of $G$ corresponds to the empirical covariance of the gradients of the cost function evaluated at $\hat{\omega}$ . As discussed in [11], the term $\lambda I$ is explicitly added in order to make the OPG estimator asymptotically equal to the Hessian estimator, as is the primary motivation for the former as a plug-in replacement of the latter in the first place.

When the Delta method is implemented under the framework of [11], it has several desirable properties: a) requires only $O(PK)$ space and $O(KPN)$ time, b) fits well with deep learning software frameworks based on automatic differentiation, c) works with any $L_{2}$ -regularized neural network architecture, and d) does not interfere with the training process as long as the norm of the gradient of the cost function is approximately equal to zero after training.

3 The Neural Network Classifiers

We deploy two LeNet-based neural network architectures which differs only by the number of neurons in two of the layers in order to individually match the formats of the MNIST and CIFAR-10 datasets. Our TensorFlow code for the Delta method is based on the pydeepdelta Python module [14], and is fully deterministic [10]. The corresponding Bootstrap implementation can be found in the same repository.

3.1 MNIST

There are $L=6$ layers, layer $l=1$ is the input layer represented by the input vector. Layer $l=2$ is a $3\times 3\times 1\times 32$ convolutional layer followed by max pooling with stride equal to $2$ and with a ReLU activation function. Layer $l=3$ is a $3\times 3\times 32\times 64$ convolutional layer followed by max pooling with a stride equal to $2$ , and with ReLU activation function. Layer $l=4$ is a $3\times 3\times 64\times 64$ convolutional layer with ReLU activation function. Layer $l=5$ is a $576\times 64$ dense layer with ReLU activation function, and the output layer $l=6$ is a $64\times T_{L}$ dense layer with softmax activation function, where the number of classes (outputs) is $T_{L}=10$ . The total number of parameters is $P=93322$ .

3.2 CIFAR-10

There are $L=6$ layers, layer $l=1$ is the input layer represented by the input vector. Layer $l=2$ is a $3\times 3\times 3\times 32$ convolutional layer followed by max pooling with stride equal to $2$ and with a ReLU activation function. Layer $l=3$ is a $3\times 3\times 32\times 64$ convolutional layer followed by max pooling with a stride equal to $2$ , and with ReLU activation function. Layer $l=4$ is a $3\times 3\times 64\times 64$ convolutional layer with ReLU activation function. Layer $l=5$ is a $1024\times 64$ dense layer with ReLU activation function, and the output layer $l=6$ is a $64\times 10$ dense layer with softmax activation function, where the number of classes (outputs) is $T_{L}=10$ . The total number of parameters is $P=122570$ .

3.3 Training Details

For the Bootstrap networks, we test two different weight initialization variants: dynamic random normal weight initialization (DRWI) and static random normal weight initialization (SRWI). The former uses a different (e.g. dynamic) seed across the replicates, meaning that each network in the DRWI Bootstrap ensemble will start out with different random weight values. The latter case uses the same (e.g. static) seed across the replicates, and hence all the networks in the SRWI Bootstrap ensemble receives the same random initial weight values. For all networks, we use zero bias initialization. Futhermore, to investigate the impact of random weight initialization on the Delta method, we apply the Delta method 16 times on a set of 16 networks distinguished only by DRWI.

We use the cross-entropy cost function with a $L_{2}$ -regularization rate $\lambda=0.01$ , and utilize the Adam [7, 1] optimizer with a batch size of $100$ , and no form of randomized data shuffling. To ensure convergence (e.g. $||\nabla C(\hat{\omega})||_{2}\approx 0$ ), we apply two slightly different learning rate schedules given by the following (step, rate) pairs: MNIST = $\{(0,10^{-3}),(60\text{k},10^{-4}),(70\text{k},10^{-5}),(80\text{k},10^{-6})\}$ and CIFAR-10 = $\{(0,10^{-3}),(55\text{k},10^{-4}),(85\text{k},10^{-5}),(95\text{k},10^{-6},(105\text{k},10^{-7})\}$ . For MNIST, we stop the trainings after $90,000$ steps, while for CIFAR-10, after $115,000$ steps – corresponding to the overall training statistics shown in Table 1.

Networks	Dataset	Training Set Accuracy	Test Set Accuracy	$C(\hat{\omega})$	$\|\|\nabla C(\hat{\omega})\|\|_{2}$
DRWI Bootstrap B=100	MNIST	$0.979\pm 0.000$	$0.981\pm 0.001$	$0.253\pm 0.006$	$0.016\pm 0.013$
DRWI Bootstrap B=100	CIFAR-10	$0.705\pm 0.025$	$0.684\pm 0.020$	$1.248\pm 0.042$	$0.035\pm 0.020$
SRWI Bootstrap B=100	MNIST	$0.979\pm 0.000$	$0.981\pm 0.001$	$0.254\pm 0.002$	$0.017\pm 0.013$
SRWI Bootstrap B=100	CIFAR-10	$0.715\pm 0.010$	$0.693\pm 0.009$	$1.235\pm 0.018$	$0.031\pm 0.014$
Delta 16 reps (DRWI)	MNIST	$0.979\pm 0.000$	$0.981\pm 0.001$	$0.257\pm 0.002$	$0.016\pm 0.005$
Delta 16 reps (DRWI)	CIFAR-10	$0.701\pm 0.032$	$0.687\pm 0.029$	$1.284\pm 0.053$	$0.030\pm 0.012$

Table 1: Training statistics for the Delta and Bootstrap networks. The DRWI and SRWI Bootstrap ensembles each consists of

B=100

bootstrapped networks, while the Delta method is applied repeatedly on 16 networks distinguished only by DRWI. Averages

\pm

two standard deviations are calculated across the

B=100

networks for the Bootstrap, and across the 16 repetitions for the Delta method.

4 Comparison

The basic comparison design entails a set of 16 linear regressions on the predictive uncertainty estimates obtained from the two methods using test sets as input data

$\displaystyle\widetilde{\sigma}_{\text{boot}}(x_{n})_{m}=\alpha_{d}+\beta_{d}\widetilde{\sigma}_{\text{delta}}(x_{n})_{m,d}+e_{n,m,d},\quad n$	$\displaystyle=1,2,\ldots,N_{\text{test}}$
$\displaystyle\quad m$	$\displaystyle=1,2,\ldots,T_{L}$
$\displaystyle\quad d$	$\displaystyle=1,2,\ldots,16.$	(6)

Accounting for the two variants of the Bootstrap (SRWI/DRWI), this leads to two sets of squared correlation coefficients, intercepts, slopes and Delta method approximation errors, respectively denoted by $\{R^{2}_{d},\alpha_{d},\beta_{d},\epsilon_{d}\}_{d=1}^{16}$ . Furthermore, as we wish to analyze the impact of the number of Bootstrap replicates and the number of Delta method eigenpairs, we generate these sets for various $B$ and $K$ . An outline of the setup is shown in Figure 1.

Refer to caption — Figure 1: Regression (6) of $\widetilde{\sigma}_{\text{boot}}$ onto $\widetilde{\sigma}_{\text{delta}}$ .

Figure 2 shows scatter plots of the regression results for the first repetition ( $d=1$ ) of the Delta method against the DRWI Bootstrap ensemble. These plots are based on $B=100$ bootstrap replicates, and we have selected $K=1500$ eigenpairs for MNIST and $K=2500$ eigenpairs for CIFAR-10. Clearly, there is a strong linear relationship between the two methods: the squared correlation coefficients are $R_{1}^{2}=0.94$ for MNIST and $R_{1}^{2}=0.90$ for CIFAR-10. On the other hand, the absolute uncertainty level differs between the methods and datasets. This can be seen by the slope coefficients, where the Delta method is overestimating ( $\beta_{1}<1$ ) on MNIST, and underestimating ( $\beta_{1}>1$ ) on CIFAR-10. Further, since the estimated intercepts ( $\alpha_{1}$ ) are zero, there are no offsets between the methods. Finally, we see that the maximum across examples and class outputs of the Delta method approximation errors ( $\epsilon_{1}$ ) are zero, so there is nothing to be achieved by increasing $K$ . As we will see later, $K$ has here been selected unnecessarily high and can be significantly reduced with no loss of accuracy.

4.1 Discussion of the Regression Results as a Function of $B$ and $K$

The results from the full set of regressions ( $d=1,2,\ldots,16$ ) holding a fixed $B=100$ are shown in Figure 3. The primary observations are as follows: The mean squared correlation coefficients $R^{2}$ are generally high for MNIST and CIFAR-10, meaning that there is a strong linear relationship between the uncertainty levels obtained by the Bootstrap and the Delta method. For the lowest $K$ , the $R^{2}$ starts out at 90% for MNIST, and at 81% for CIFAR-10. As $K$ grows, an increase by only $4$ % is observed for MNIST, while 8% for CIFAR-10. The major difference observed as $K$ increases lies in the absolute uncertainty levels expressed by the slope $\beta$ : for MNIST, the slope stabilizes at around $K=600$ while at about $K=1000$ for CIFAR-10. The same trend is reflected in the maximum approximation errors $\epsilon$ , where we respectively see them approach zero at the same values for $K$ . Although not shown in the plots, the regression intercepts $\alpha$ are always zero, meaning that there is no offset in the uncertainty estimates by the two methods.

The main difference found from applying DRWI opposed to SRWI for the Bootstrap ensembles, is that the absolute level of uncertainty increases with DRWI. This is expected, since the DRWI version of the Bootstrap will be more prone to reaching different local minima, and therefore also captures this additional variance. Supporting evidence for this hypothesis is evident by CIFAR-10’s wider confidence intervals. A more pronounced geometry difference across various local minima will ultimately lead to higher variability in the $R^{2}$ and $\beta$ . A slightly higher mean $R^{2}$ (+1-2%) is also observed for the DRWI version of the Bootstrap. This is reasonable given the fact that also the Delta method networks are more prone to reaching different local minima across the 16 repetitions because of DRWI.

Figure 4 shows the same type of comparison when the number of Bootstrap replicates $B$ varies, and the number of eigenpairs are fixed ( $K=1500$ for MNIST and $K=2500$ for CIFAR-10). The main observation from this experiment is that there is very little to achieve by selecting a larger ensemble size $B$ than about 50, as this is the point where the mean slope and squared correlation coefficient stabilizes.

4.2 Computation Time

Table 2 shows the computation time for the two methods when executed on a Nvidia RTX 2080 Ti based GPU. For MNIST, the smallest $K$ leading to acceptable approximation errors and stable absolute uncertainty levels for the Delta method is at $K=600$ , while for CIFAR-10 the same applies at $K=1000$ . Furthermore, the smallest acceptable $B$ leading to stable correlation and absolute uncertainty levels for the Bootstrap is at $B=50$ . We conclude that in these experiments the Delta method outperforms the Bootstrap in terms of computation time by a factor $4.6$ on MNIST, and a factor $5.9$ for CIFAR-10.

Method	Classifier	B	K	Initial Phase [h:mm:ss]	Prediction Phase [mm:ss]		Total [h:mm:ss]
					Training Set	Test Set
Bootstrap	MNIST	50	N/A	4:08:28	00:19	00:03	4:08:50
	CIFAR-10			7:37:16	00:40	00:07	7:38:04
Delta	MNIST	N/A	600	0:42:33	9:52	1:37	0:54:02
	CIFAR-10		1000	1:00:54	14:44	02:56	1:18:35

Table 2: Computation time for the Bootstrap and Delta method. For the Bootstrap, the ‘initial phase’ accounts for the parallelized training of

B

networks, while the ‘prediction phase’ accounts for the predictive epistemic uncertainty estimation (1), which is further divided into the training and test sets. For the Delta method, the ‘initial phase’ accounts for the approximate eigendecomposition of the covariance matrix (5), while the ‘prediction phase’ accounts for the predictive epistemic uncertainty estimation (3), further divided into the training set and test sets.

5 Concluding Remarks

We have shown that there is a strong linear relationship between the predictive epistemic uncertainty estimates obtained by the Bootstrap and the Delta method when applied on two different deep learning classification models. Firstly, we find that the number of eigenpairs $K$ in the Delta method can be selected order of magnitudes lower than $P$ with no loss of correspondence between the methods. This coincides with the fact that when the Delta method approximation errors are sufficiently close to zero, there is no nothing to achieve by a further increase in $K$ , and therefore the correspondence will stabilize at this point.

Secondly, we find that the DRWI version of the Bootstrap yields the best correspondence, and that there is little to achieve by using more than $B=50$ replicates. Thirdly, we observe that the most complex model (CIFAR-10) yields a high variability in the correspondence across multiple DRWI Delta method runs. We interpret this effect as caused by cost functional multi-modality, and that the Delta method fails to capture the additional variance tied to reaching local minima of different geometric characteristics. Finally, in our experiments we have seen that the Delta method outperforms the Bootstrap in terms of computation time by a factor $4.6$ on MNIST and by a factor $5.9$ for CIFAR-10.

References

[1] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Rev., vol. 60, no. 2, pp. 223-311, 2018.
[2] B. Efron. Bootstrap methods: Another look at the jackknife. Ann. Stat., vol. 7, no. 1, pp. 1–26., 1979.
[3] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. http://www.deeplearningbook.org, MIT Press, 2016.
[4] J. M. V. Hoef. Who Invented the Delta Method? https://www.researchgate.net/publication/254329376_Who_Invented_the_Delta_Method, The American Statistician, 66:2, 124-127, 2012.
[5] E. Hüllermeier and W. Waegeman. Aleatoric and Epistemic Uncertainty in Machine Learning: An Introduction to Concepts and Methods. https://arxiv.org/abs/1910.09457, arXiv:1910.09457v2 [cs.LG], 2020.
[6] A. Khosravi and D. Creighton. A Comprehensive Review of Neural Network-based Prediction Intervals and New Advances. https://www.researchgate.net/publication/51534965_Comprehensive_Review_of_Neural_Network-Based_Prediction_Intervals_and_New_Advances, IEEE Transactions On Neural Networks, Vol. 22, No. 9, 2011.
[7] D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. In Proc. 3rd Int. Conf. Learn. Representations, 2014.
[8] B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. https://arxiv.org/pdf/1612.01474, arXiv:1612.01474v3 [stat.ML], 2017.
[9] D. MacKay. A practical Bayesian framework for backpropagation networks. http://www.inference.org.uk/mackay/PhD.html#PhD, Neural Computation, 4(3):448–472, 1992., 1992.
[10] P. Nagarajan and G. Warnell. Deterministic Implementations for Reproducibility in Deep Reinforcement Learning. https://arxiv.org/abs/1809.05676, arXiv:1809.05676 [cs.AI], 2019.
[11] G. K. Nilsen, A. Z. Munthe-Kaas, H. J. Skaug, and M. Brun. Epistemic Uncertainty Quantification in Deep Learning Classification by the Delta Method. https://arxiv.org/abs/1912.00832, arXiv:arXiv:1912.00832 [cs.LG], 2021.
[12] I. Osband. Risk versus Uncertainty in Deep Learning: Bayes, Bootstrap and the Dangers of Dropout. http://bayesiandeeplearning.org/2016/papers/BDL_4.pdf, NIPS Workshop on Bayesian Deep Learning, 2016.
[13] I. Osband, C. Blundell, A. Pritzel, and B. V. Roy. Deep Exploration via Bootstrapped DQN. https://papers.nips.cc/paper/6501-deep-exploration-via-bootstrapped-dqn.pdf, Conference on Neural Information Processing Systems (NIPS), 2016.
[14] pyDeepDelta: A TensorFlow Module Implementing the Delta Method in Deep Learning Classification. https://github.com/gknilsen/pydeepdelta.git.