A Comparison of the Delta Method and the Bootstrap in Deep Learning Classification
Abstract
We validate the deep learning classification adapted Delta method introduced in [11] by a comparison with the classical Bootstrap. We show that there is a strong linear relationship between the quantified predictive epistemic uncertainty levels obtained from the two methods when applied on two LeNet-based neural network classifiers using the MNIST and CIFAR-10 datasets. Furthermore, we demonstrate that the Delta method offers a five times computation time reduction compared to the Bootstrap.
1 Introduction
It can be beneficial to distinguish between epistemic and aleatoric uncertainty in machine learning models [5]. Bayesian statistics provides a coherent framework for representing epistemic uncertainty in neural networks [9], but has not so far gained widespread use in deep learning [3] – presumably due to the high computational cost that traditionally comes with Fisher information based methods. In particular, the Delta method [4, 6] depends on the empirical Fisher information matrix which grows quadratically with the number of neural network parameters – and its direct application in modern deep learning is therefore prohibitively expensive. To mitigate this, [11] proposed a low cost variant of the Delta method applicable to -regularized deep neural networks based on the top eigenpairs of the Fisher information matrix.
In this paper, we validate the methodology introduced in [11] by a comparison with the classical Bootstrap [2, 6, 8, 12, 13]. We show that there is a strong linear relationship between the quantified epistemic uncertainty levels obtained from the two methods when applied on two LeNet-based neural network classifiers using the MNIST and CIFAR-10 datasets.
The paper is organized as follows: in Section 2 we review the Bootstrap and the Delta method in a deep learning classification context. In Section 3 we introduce two LeNet-based classifiers which will be used in the comparison in Section 4, and finally, in Section 5 we summarize the paper and give some concluding remarks.
2 Introduction to the Methodologies
In the following, we denote the training set by , the test set by and an arbitrary input example by . The parameter space is denoted by the vector , where is the number of parameters (weights and biases) in the model. The parameter values after training is denoted by the vector . Furthermore, a prediction for is denoted by where is a deep neural network model function [3] and where denotes the number of classes. Furthermore, it is assumed that the cost function denoted by is -regularized with a regularization-rate factor .
2.1 The Bootstrap in Deep Learning Classification
In the context of deep learning classification, the classical Bootstrap method starts by creating datasets from the original dataset by sampling with replacement. Subsequently, networks are trained separately on each of the bootstrapped datasets. The epistemic uncertainty for each of the class predictions (in standard deviations) associated with prediction of is obtained by the sample standard deviation over the ensemble of predictions,
(1) |
where the vector represents the predictions for (one probability per class) obtained from the th bootstrapped network, and where is the sample mean,
(2) |
The method is easy to implement efficiently in practice. Training networks is an ‘embarrassingly’ parallel problem, and the space complexity for the bootstrapped datasets is just when an indexing scheme is used for the sampling with replacement. The experiments conducted in this paper is based on the example pydeepboot.py provided in the pydeepdelta provision [14].
2.2 The Delta Method in Deep Learning Classification
The Delta method was adapted to the deep learning classification context by [11]. The adaption addresses several fundamental difficulties that arise when the method is applied in deep learning. In essence, it is shown that an approximation of the eigendecomposition of the Fisher information matrix utilizing only eigenpairs allows for an efficient implementation with bounded worst-case approximation errors. We briefly review the standard method here for convenience.
An approximation of the epistemic component of the uncertainty associated with the prediction of can be found by the formula
(3) |
where the sensitivity matrix in (3) is defined
(4) |
The covariance matrix in (3) can be estimated by several alternative estimators. In [11] it was demonstrated that the Hessian estimator, the Outer-Products of Gradients (OPG) estimator and the Sandwich estimator lead to nearly perfect correlated results for two different deep learning models. Since the models discussed in this paper are identical to those in [11], we thus focus only on one of the estimators, namely the OPG estimator defined by
(5) |
where the summation part of corresponds to the empirical covariance of the gradients of the cost function evaluated at . As discussed in [11], the term is explicitly added in order to make the OPG estimator asymptotically equal to the Hessian estimator, as is the primary motivation for the former as a plug-in replacement of the latter in the first place.
When the Delta method is implemented under the framework of [11], it has several desirable properties: a) requires only space and time, b) fits well with deep learning software frameworks based on automatic differentiation, c) works with any -regularized neural network architecture, and d) does not interfere with the training process as long as the norm of the gradient of the cost function is approximately equal to zero after training.
3 The Neural Network Classifiers
We deploy two LeNet-based neural network architectures which differs only by the number of neurons in two of the layers in order to individually match the formats of the MNIST and CIFAR-10 datasets. Our TensorFlow code for the Delta method is based on the pydeepdelta Python module [14], and is fully deterministic [10]. The corresponding Bootstrap implementation can be found in the same repository.
3.1 MNIST
There are layers, layer is the input layer represented by the input vector. Layer is a convolutional layer followed by max pooling with stride equal to and with a ReLU activation function. Layer is a convolutional layer followed by max pooling with a stride equal to , and with ReLU activation function. Layer is a convolutional layer with ReLU activation function. Layer is a dense layer with ReLU activation function, and the output layer is a dense layer with softmax activation function, where the number of classes (outputs) is . The total number of parameters is .
3.2 CIFAR-10
There are layers, layer is the input layer represented by the input vector. Layer is a convolutional layer followed by max pooling with stride equal to and with a ReLU activation function. Layer is a convolutional layer followed by max pooling with a stride equal to , and with ReLU activation function. Layer is a convolutional layer with ReLU activation function. Layer is a dense layer with ReLU activation function, and the output layer is a dense layer with softmax activation function, where the number of classes (outputs) is . The total number of parameters is .
3.3 Training Details
For the Bootstrap networks, we test two different weight initialization variants: dynamic random normal weight initialization (DRWI) and static random normal weight initialization (SRWI). The former uses a different (e.g. dynamic) seed across the replicates, meaning that each network in the DRWI Bootstrap ensemble will start out with different random weight values. The latter case uses the same (e.g. static) seed across the replicates, and hence all the networks in the SRWI Bootstrap ensemble receives the same random initial weight values. For all networks, we use zero bias initialization. Futhermore, to investigate the impact of random weight initialization on the Delta method, we apply the Delta method 16 times on a set of 16 networks distinguished only by DRWI.
We use the cross-entropy cost function with a -regularization rate , and utilize the Adam [7, 1] optimizer with a batch size of , and no form of randomized data shuffling. To ensure convergence (e.g. ), we apply two slightly different learning rate schedules given by the following (step, rate) pairs: MNIST = and CIFAR-10 = . For MNIST, we stop the trainings after steps, while for CIFAR-10, after steps – corresponding to the overall training statistics shown in Table 1.
Networks | Dataset | Training Set Accuracy | Test Set Accuracy | ||
---|---|---|---|---|---|
DRWI Bootstrap B=100 | MNIST | ||||
CIFAR-10 | |||||
SRWI Bootstrap B=100 | MNIST | ||||
CIFAR-10 | |||||
Delta 16 reps (DRWI) | MNIST | ||||
CIFAR-10 |
4 Comparison
The basic comparison design entails a set of 16 linear regressions on the predictive uncertainty estimates obtained from the two methods using test sets as input data
(6) |
Accounting for the two variants of the Bootstrap (SRWI/DRWI), this leads to two sets of squared correlation coefficients, intercepts, slopes and Delta method approximation errors, respectively denoted by . Furthermore, as we wish to analyze the impact of the number of Bootstrap replicates and the number of Delta method eigenpairs, we generate these sets for various and . An outline of the setup is shown in Figure 1.

Figure 2 shows scatter plots of the regression results for the first repetition () of the Delta method against the DRWI Bootstrap ensemble. These plots are based on bootstrap replicates, and we have selected eigenpairs for MNIST and eigenpairs for CIFAR-10. Clearly, there is a strong linear relationship between the two methods: the squared correlation coefficients are for MNIST and for CIFAR-10. On the other hand, the absolute uncertainty level differs between the methods and datasets. This can be seen by the slope coefficients, where the Delta method is overestimating () on MNIST, and underestimating () on CIFAR-10. Further, since the estimated intercepts () are zero, there are no offsets between the methods. Finally, we see that the maximum across examples and class outputs of the Delta method approximation errors () are zero, so there is nothing to be achieved by increasing . As we will see later, has here been selected unnecessarily high and can be significantly reduced with no loss of accuracy.


4.1 Discussion of the Regression Results as a Function of and
The results from the full set of regressions () holding a fixed are shown in Figure 3. The primary observations are as follows: The mean squared correlation coefficients are generally high for MNIST and CIFAR-10, meaning that there is a strong linear relationship between the uncertainty levels obtained by the Bootstrap and the Delta method. For the lowest , the starts out at 90% for MNIST, and at 81% for CIFAR-10. As grows, an increase by only % is observed for MNIST, while 8% for CIFAR-10. The major difference observed as increases lies in the absolute uncertainty levels expressed by the slope : for MNIST, the slope stabilizes at around while at about for CIFAR-10. The same trend is reflected in the maximum approximation errors , where we respectively see them approach zero at the same values for . Although not shown in the plots, the regression intercepts are always zero, meaning that there is no offset in the uncertainty estimates by the two methods.

Delta vs. SRWI Bootstrap

Delta vs. DRWI Bootstrap

Delta vs. SRWI Bootstrap

Delta vs. DRWI Bootstrap
The main difference found from applying DRWI opposed to SRWI for the Bootstrap ensembles, is that the absolute level of uncertainty increases with DRWI. This is expected, since the DRWI version of the Bootstrap will be more prone to reaching different local minima, and therefore also captures this additional variance. Supporting evidence for this hypothesis is evident by CIFAR-10’s wider confidence intervals. A more pronounced geometry difference across various local minima will ultimately lead to higher variability in the and . A slightly higher mean (+1-2%) is also observed for the DRWI version of the Bootstrap. This is reasonable given the fact that also the Delta method networks are more prone to reaching different local minima across the 16 repetitions because of DRWI.
Figure 4 shows the same type of comparison when the number of Bootstrap replicates varies, and the number of eigenpairs are fixed ( for MNIST and for CIFAR-10). The main observation from this experiment is that there is very little to achieve by selecting a larger ensemble size than about 50, as this is the point where the mean slope and squared correlation coefficient stabilizes.

Delta vs. SRWI Bootstrap

Delta vs. DRWI Bootstrap

Delta vs. SRWI Bootstrap

Delta vs. DRWI Bootstrap
4.2 Computation Time
Table 2 shows the computation time for the two methods when executed on a Nvidia RTX 2080 Ti based GPU. For MNIST, the smallest leading to acceptable approximation errors and stable absolute uncertainty levels for the Delta method is at , while for CIFAR-10 the same applies at . Furthermore, the smallest acceptable leading to stable correlation and absolute uncertainty levels for the Bootstrap is at . We conclude that in these experiments the Delta method outperforms the Bootstrap in terms of computation time by a factor on MNIST, and a factor for CIFAR-10.
Method | Classifier | B | K | Initial Phase [h:mm:ss] | Prediction Phase [mm:ss] | Total [h:mm:ss] | |
Training Set | Test Set | ||||||
Bootstrap | MNIST | 50 | N/A | 4:08:28 | 00:19 | 00:03 | 4:08:50 |
CIFAR-10 | 7:37:16 | 00:40 | 00:07 | 7:38:04 | |||
Delta | MNIST | N/A | 600 | 0:42:33 | 9:52 | 1:37 | 0:54:02 |
CIFAR-10 | 1000 | 1:00:54 | 14:44 | 02:56 | 1:18:35 |
5 Concluding Remarks
We have shown that there is a strong linear relationship between the predictive epistemic uncertainty estimates obtained by the Bootstrap and the Delta method when applied on two different deep learning classification models. Firstly, we find that the number of eigenpairs in the Delta method can be selected order of magnitudes lower than with no loss of correspondence between the methods. This coincides with the fact that when the Delta method approximation errors are sufficiently close to zero, there is no nothing to achieve by a further increase in , and therefore the correspondence will stabilize at this point.
Secondly, we find that the DRWI version of the Bootstrap yields the best correspondence, and that there is little to achieve by using more than replicates. Thirdly, we observe that the most complex model (CIFAR-10) yields a high variability in the correspondence across multiple DRWI Delta method runs. We interpret this effect as caused by cost functional multi-modality, and that the Delta method fails to capture the additional variance tied to reaching local minima of different geometric characteristics. Finally, in our experiments we have seen that the Delta method outperforms the Bootstrap in terms of computation time by a factor on MNIST and by a factor for CIFAR-10.
References
- [1] L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning. SIAM Rev., vol. 60, no. 2, pp. 223-311, 2018.
- [2] B. Efron. Bootstrap methods: Another look at the jackknife. Ann. Stat., vol. 7, no. 1, pp. 1–26., 1979.
- [3] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. http://www.deeplearningbook.org, MIT Press, 2016.
- [4] J. M. V. Hoef. Who Invented the Delta Method? https://www.researchgate.net/publication/254329376_Who_Invented_the_Delta_Method, The American Statistician, 66:2, 124-127, 2012.
- [5] E. Hüllermeier and W. Waegeman. Aleatoric and Epistemic Uncertainty in Machine Learning: An Introduction to Concepts and Methods. https://arxiv.org/abs/1910.09457, arXiv:1910.09457v2 [cs.LG], 2020.
- [6] A. Khosravi and D. Creighton. A Comprehensive Review of Neural Network-based Prediction Intervals and New Advances. https://www.researchgate.net/publication/51534965_Comprehensive_Review_of_Neural_Network-Based_Prediction_Intervals_and_New_Advances, IEEE Transactions On Neural Networks, Vol. 22, No. 9, 2011.
- [7] D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. In Proc. 3rd Int. Conf. Learn. Representations, 2014.
- [8] B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles. https://arxiv.org/pdf/1612.01474, arXiv:1612.01474v3 [stat.ML], 2017.
- [9] D. MacKay. A practical Bayesian framework for backpropagation networks. http://www.inference.org.uk/mackay/PhD.html#PhD, Neural Computation, 4(3):448–472, 1992., 1992.
- [10] P. Nagarajan and G. Warnell. Deterministic Implementations for Reproducibility in Deep Reinforcement Learning. https://arxiv.org/abs/1809.05676, arXiv:1809.05676 [cs.AI], 2019.
- [11] G. K. Nilsen, A. Z. Munthe-Kaas, H. J. Skaug, and M. Brun. Epistemic Uncertainty Quantification in Deep Learning Classification by the Delta Method. https://arxiv.org/abs/1912.00832, arXiv:arXiv:1912.00832 [cs.LG], 2021.
- [12] I. Osband. Risk versus Uncertainty in Deep Learning: Bayes, Bootstrap and the Dangers of Dropout. http://bayesiandeeplearning.org/2016/papers/BDL_4.pdf, NIPS Workshop on Bayesian Deep Learning, 2016.
- [13] I. Osband, C. Blundell, A. Pritzel, and B. V. Roy. Deep Exploration via Bootstrapped DQN. https://papers.nips.cc/paper/6501-deep-exploration-via-bootstrapped-dqn.pdf, Conference on Neural Information Processing Systems (NIPS), 2016.
- [14] pyDeepDelta: A TensorFlow Module Implementing the Delta Method in Deep Learning Classification. https://github.com/gknilsen/pydeepdelta.git.