This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Comparison of the Delta Method and the Bootstrap in Deep Learning Classification

Geir K. Nilsen Department of Mathematics, University of Bergen [email protected] Antonella Z. Munthe-Kaas Department of Mathematics, University of Bergen Hans J. Skaug Department of Mathematics, University of Bergen Morten Brun Department of Mathematics, University of Bergen
Abstract

We validate the deep learning classification adapted Delta method introduced in [11] by a comparison with the classical Bootstrap. We show that there is a strong linear relationship between the quantified predictive epistemic uncertainty levels obtained from the two methods when applied on two LeNet-based neural network classifiers using the MNIST and CIFAR-10 datasets. Furthermore, we demonstrate that the Delta method offers a five times computation time reduction compared to the Bootstrap.

1 Introduction

It can be beneficial to distinguish between epistemic and aleatoric uncertainty in machine learning models [5]. Bayesian statistics provides a coherent framework for representing epistemic uncertainty in neural networks [9], but has not so far gained widespread use in deep learning [3] – presumably due to the high computational cost that traditionally comes with Fisher information based methods. In particular, the Delta method [4, 6] depends on the empirical Fisher information matrix which grows quadratically with the number of neural network parameters PP – and its direct application in modern deep learning is therefore prohibitively expensive. To mitigate this, [11] proposed a low cost variant of the Delta method applicable to L2L_{2}-regularized deep neural networks based on the top KK eigenpairs of the Fisher information matrix.

In this paper, we validate the methodology introduced in [11] by a comparison with the classical Bootstrap [2, 6, 8, 12, 13]. We show that there is a strong linear relationship between the quantified epistemic uncertainty levels obtained from the two methods when applied on two LeNet-based neural network classifiers using the MNIST and CIFAR-10 datasets.

The paper is organized as follows: in Section 2 we review the Bootstrap and the Delta method in a deep learning classification context. In Section 3 we introduce two LeNet-based classifiers which will be used in the comparison in Section 4, and finally, in Section 5 we summarize the paper and give some concluding remarks.

2 Introduction to the Methodologies

In the following, we denote the training set by {xnT1,ynTL}n=1N\{x_{n}\in\mathbb{R}^{T_{1}},y_{n}\in\mathbb{R}^{T_{L}}\}_{n=1}^{N}, the test set by {xnT1,ynTL}n=1Ntest\{x_{n}\in\mathbb{R}^{T_{1}},y_{n}\in\mathbb{R}^{T_{L}}\}_{n=1}^{N_{\text{test}}} and an arbitrary input example by x0x_{0}. The parameter space is denoted by the vector ωP\omega\in\mathbb{R}^{P}, where PP is the number of parameters (weights and biases) in the model. The parameter values after training is denoted by the vector ω^P\hat{\omega}\in\mathbb{R}^{P}. Furthermore, a prediction for x0x_{0} is denoted by y^0=f(x0,ω^)TL\hat{y}_{0}=f(x_{0},\hat{\omega})\in\mathbb{R}^{T_{L}} where f:T1×PTLf:\mathbb{R}^{T_{1}\times P}\rightarrow\mathbb{R}^{T_{L}} is a deep neural network model function [3] and where TLT_{L} denotes the number of classes. Furthermore, it is assumed that the cost function denoted by CC is L2L_{2}-regularized with a regularization-rate factor λ/2\lambda/2.

2.1 The Bootstrap in Deep Learning Classification

In the context of deep learning classification, the classical Bootstrap method starts by creating BB datasets from the original dataset by sampling with replacement. Subsequently, BB networks are trained separately on each of the bootstrapped datasets. The epistemic uncertainty for each of the TLT_{L} class predictions (in standard deviations) associated with prediction of x0x_{0} is obtained by the sample standard deviation over the ensemble of BB predictions,

σ~boot(x0)=1B1b=1B(y^0(b)y^0¯)2TL,\widetilde{\sigma}_{\text{boot}}(x_{0})=\sqrt{\frac{1}{B-1}\sum_{b=1}^{B}(\hat{y}_{0}^{(b)}-\overline{\hat{y}_{0}})^{2}}\in\mathbb{R}^{T_{L}}, (1)

where the vector y^0(b)\hat{y}_{0}^{(b)} represents the TLT_{L} predictions for x0x_{0} (one probability per class) obtained from the bbth bootstrapped network, and where y^0¯\overline{\hat{y}_{0}} is the sample mean,

y^0¯=1Bb=1By^0(b)TL.\overline{\hat{y}_{0}}=\frac{1}{B}\sum_{b=1}^{B}\hat{y}_{0}^{(b)}\in\mathbb{R}^{T_{L}}. (2)

The method is easy to implement efficiently in practice. Training BB networks is an ‘embarrassingly’ parallel problem, and the space complexity for the bootstrapped datasets is just O(BN)O(BN) when an indexing scheme is used for the sampling with replacement. The experiments conducted in this paper is based on the example pydeepboot.py provided in the pydeepdelta provision [14].

2.2 The Delta Method in Deep Learning Classification

The Delta method was adapted to the deep learning classification context by [11]. The adaption addresses several fundamental difficulties that arise when the method is applied in deep learning. In essence, it is shown that an approximation of the eigendecomposition of the Fisher information matrix utilizing only KK eigenpairs allows for an efficient implementation with bounded worst-case approximation errors. We briefly review the standard method here for convenience.

An approximation of the epistemic component of the uncertainty associated with the prediction of x0x_{0} can be found by the formula

σ~delta(x0)=diag(FΣFT)TL,\widetilde{\sigma}_{\text{delta}}(x_{0})=\sqrt{\text{diag}\big{(}F\Sigma F^{T}\big{)}}\in\mathbb{R}^{T_{L}}, (3)

where the sensitivity matrix FF in (3) is defined

F=[Fij]TL×P,Fij=ωjfi(x0,ω)|ω=ω^.F=\begin{bmatrix}F_{ij}\end{bmatrix}\in\mathbb{R}^{T_{L}\times P},~{}F_{ij}=\frac{\partial}{\partial\omega_{j}}f_{i}(x_{0},\omega)\bigg{\rvert}_{\omega=\hat{\omega}}. (4)

The covariance matrix Σ\Sigma in (3) can be estimated by several alternative estimators. In [11] it was demonstrated that the Hessian estimator, the Outer-Products of Gradients (OPG) estimator and the Sandwich estimator lead to nearly perfect correlated results for two different deep learning models. Since the models discussed in this paper are identical to those in [11], we thus focus only on one of the estimators, namely the OPG estimator defined by

Σ=1NG1=1N[1Nn=1NCnωCnωT|ω=ω^+λI]1P×P,\Sigma=\frac{1}{N}G^{-1}=\frac{1}{N}\left[\frac{1}{N}\sum_{n=1}^{N}\frac{\partial C_{n}}{\partial\omega}\frac{\partial C_{n}}{\partial\omega}^{T}\bigg{\rvert}_{\omega=\hat{\omega}}+\lambda I\right]^{-1}\in\mathbb{R}^{P\times P}, (5)

where the summation part of GG corresponds to the empirical covariance of the gradients of the cost function evaluated at ω^\hat{\omega}. As discussed in [11], the term λI\lambda I is explicitly added in order to make the OPG estimator asymptotically equal to the Hessian estimator, as is the primary motivation for the former as a plug-in replacement of the latter in the first place.

When the Delta method is implemented under the framework of [11], it has several desirable properties: a) requires only O(PK)O(PK) space and O(KPN)O(KPN) time, b) fits well with deep learning software frameworks based on automatic differentiation, c) works with any L2L_{2}-regularized neural network architecture, and d) does not interfere with the training process as long as the norm of the gradient of the cost function is approximately equal to zero after training.

3 The Neural Network Classifiers

We deploy two LeNet-based neural network architectures which differs only by the number of neurons in two of the layers in order to individually match the formats of the MNIST and CIFAR-10 datasets. Our TensorFlow code for the Delta method is based on the pydeepdelta Python module [14], and is fully deterministic [10]. The corresponding Bootstrap implementation can be found in the same repository.

3.1 MNIST

There are L=6L=6 layers, layer l=1l=1 is the input layer represented by the input vector. Layer l=2l=2 is a 3×3×1×323\times 3\times 1\times 32 convolutional layer followed by max pooling with stride equal to 22 and with a ReLU activation function. Layer l=3l=3 is a 3×3×32×643\times 3\times 32\times 64 convolutional layer followed by max pooling with a stride equal to 22, and with ReLU activation function. Layer l=4l=4 is a 3×3×64×643\times 3\times 64\times 64 convolutional layer with ReLU activation function. Layer l=5l=5 is a 576×64576\times 64 dense layer with ReLU activation function, and the output layer l=6l=6 is a 64×TL64\times T_{L} dense layer with softmax activation function, where the number of classes (outputs) is TL=10T_{L}=10. The total number of parameters is P=93322P=93322.

3.2 CIFAR-10

There are L=6L=6 layers, layer l=1l=1 is the input layer represented by the input vector. Layer l=2l=2 is a 3×3×3×323\times 3\times 3\times 32 convolutional layer followed by max pooling with stride equal to 22 and with a ReLU activation function. Layer l=3l=3 is a 3×3×32×643\times 3\times 32\times 64 convolutional layer followed by max pooling with a stride equal to 22, and with ReLU activation function. Layer l=4l=4 is a 3×3×64×643\times 3\times 64\times 64 convolutional layer with ReLU activation function. Layer l=5l=5 is a 1024×641024\times 64 dense layer with ReLU activation function, and the output layer l=6l=6 is a 64×1064\times 10 dense layer with softmax activation function, where the number of classes (outputs) is TL=10T_{L}=10. The total number of parameters is P=122570P=122570.

3.3 Training Details

For the Bootstrap networks, we test two different weight initialization variants: dynamic random normal weight initialization (DRWI) and static random normal weight initialization (SRWI). The former uses a different (e.g. dynamic) seed across the replicates, meaning that each network in the DRWI Bootstrap ensemble will start out with different random weight values. The latter case uses the same (e.g. static) seed across the replicates, and hence all the networks in the SRWI Bootstrap ensemble receives the same random initial weight values. For all networks, we use zero bias initialization. Futhermore, to investigate the impact of random weight initialization on the Delta method, we apply the Delta method 16 times on a set of 16 networks distinguished only by DRWI.

We use the cross-entropy cost function with a L2L_{2}-regularization rate λ=0.01\lambda=0.01, and utilize the Adam [7, 1] optimizer with a batch size of 100100, and no form of randomized data shuffling. To ensure convergence (e.g. C(ω^)20||\nabla C(\hat{\omega})||_{2}\approx 0), we apply two slightly different learning rate schedules given by the following (step, rate) pairs: MNIST = {(0,103),(60k,104),(70k,105),(80k,106)}\{(0,10^{-3}),(60\text{k},10^{-4}),(70\text{k},10^{-5}),(80\text{k},10^{-6})\} and CIFAR-10 = {(0,103),(55k,104),(85k,105),(95k,106,(105k,107)}\{(0,10^{-3}),(55\text{k},10^{-4}),(85\text{k},10^{-5}),(95\text{k},10^{-6},(105\text{k},10^{-7})\}. For MNIST, we stop the trainings after 90,00090,000 steps, while for CIFAR-10, after 115,000115,000 steps – corresponding to the overall training statistics shown in Table 1.

Networks Dataset Training Set Accuracy Test Set Accuracy C(ω^)C(\hat{\omega}) C(ω^)2||\nabla C(\hat{\omega})||_{2}
DRWI Bootstrap B=100 MNIST 0.979±0.0000.979\pm 0.000 0.981±0.0010.981\pm 0.001 0.253±0.0060.253\pm 0.006 0.016±0.0130.016\pm 0.013
CIFAR-10 0.705±0.0250.705\pm 0.025 0.684±0.0200.684\pm 0.020 1.248±0.0421.248\pm 0.042 0.035±0.0200.035\pm 0.020
SRWI Bootstrap B=100 MNIST 0.979±0.0000.979\pm 0.000 0.981±0.0010.981\pm 0.001 0.254±0.0020.254\pm 0.002 0.017±0.0130.017\pm 0.013
CIFAR-10 0.715±0.0100.715\pm 0.010 0.693±0.0090.693\pm 0.009 1.235±0.0181.235\pm 0.018 0.031±0.0140.031\pm 0.014
Delta 16 reps (DRWI) MNIST 0.979±0.0000.979\pm 0.000 0.981±0.0010.981\pm 0.001 0.257±0.0020.257\pm 0.002 0.016±0.0050.016\pm 0.005
CIFAR-10 0.701±0.0320.701\pm 0.032 0.687±0.0290.687\pm 0.029 1.284±0.0531.284\pm 0.053 0.030±0.0120.030\pm 0.012
Table 1: Training statistics for the Delta and Bootstrap networks. The DRWI and SRWI Bootstrap ensembles each consists of B=100B=100 bootstrapped networks, while the Delta method is applied repeatedly on 16 networks distinguished only by DRWI. Averages ±\pm two standard deviations are calculated across the B=100B=100 networks for the Bootstrap, and across the 16 repetitions for the Delta method.

4 Comparison

The basic comparison design entails a set of 16 linear regressions on the predictive uncertainty estimates obtained from the two methods using test sets as input data

σ~boot(xn)m=αd+βdσ~delta(xn)m,d+en,m,d,n\displaystyle\widetilde{\sigma}_{\text{boot}}(x_{n})_{m}=\alpha_{d}+\beta_{d}\widetilde{\sigma}_{\text{delta}}(x_{n})_{m,d}+e_{n,m,d},\quad n =1,2,,Ntest\displaystyle=1,2,\ldots,N_{\text{test}}
m\displaystyle\quad m =1,2,,TL\displaystyle=1,2,\ldots,T_{L}
d\displaystyle\quad d =1,2,,16.\displaystyle=1,2,\ldots,16. (6)

Accounting for the two variants of the Bootstrap (SRWI/DRWI), this leads to two sets of squared correlation coefficients, intercepts, slopes and Delta method approximation errors, respectively denoted by {Rd2,αd,βd,ϵd}d=116\{R^{2}_{d},\alpha_{d},\beta_{d},\epsilon_{d}\}_{d=1}^{16}. Furthermore, as we wish to analyze the impact of the number of Bootstrap replicates and the number of Delta method eigenpairs, we generate these sets for various BB and KK. An outline of the setup is shown in Figure 1.

Refer to caption
Figure 1: Regression (6) of σ~boot\widetilde{\sigma}_{\text{boot}} onto σ~delta\widetilde{\sigma}_{\text{delta}}.

Figure 2 shows scatter plots of the regression results for the first repetition (d=1d=1) of the Delta method against the DRWI Bootstrap ensemble. These plots are based on B=100B=100 bootstrap replicates, and we have selected K=1500K=1500 eigenpairs for MNIST and K=2500K=2500 eigenpairs for CIFAR-10. Clearly, there is a strong linear relationship between the two methods: the squared correlation coefficients are R12=0.94R_{1}^{2}=0.94 for MNIST and R12=0.90R_{1}^{2}=0.90 for CIFAR-10. On the other hand, the absolute uncertainty level differs between the methods and datasets. This can be seen by the slope coefficients, where the Delta method is overestimating (β1<1\beta_{1}<1) on MNIST, and underestimating (β1>1\beta_{1}>1) on CIFAR-10. Further, since the estimated intercepts (α1\alpha_{1}) are zero, there are no offsets between the methods. Finally, we see that the maximum across examples and class outputs of the Delta method approximation errors (ϵ1\epsilon_{1}) are zero, so there is nothing to be achieved by increasing KK. As we will see later, KK has here been selected unnecessarily high and can be significantly reduced with no loss of accuracy.

Refer to caption
(a) MNIST
Refer to caption
(b) CIFAR-10
Figure 2: Predictive uncertainty estimates obtained from the Delta method (first repetition, d=1d=1) against the DRWI Bootstrap for (a) MNIST using B=100B=100 replicates and K=1500K=1500 eigenpairs, and (b) CIFAR-10 using B=100B=100 replicates and K=2500K=2500 eigenpairs.

4.1 Discussion of the Regression Results as a Function of BB and KK

The results from the full set of regressions (d=1,2,,16d=1,2,\ldots,16) holding a fixed B=100B=100 are shown in Figure 3. The primary observations are as follows: The mean squared correlation coefficients R2R^{2} are generally high for MNIST and CIFAR-10, meaning that there is a strong linear relationship between the uncertainty levels obtained by the Bootstrap and the Delta method. For the lowest KK, the R2R^{2} starts out at 90% for MNIST, and at 81% for CIFAR-10. As KK grows, an increase by only 44% is observed for MNIST, while 8% for CIFAR-10. The major difference observed as KK increases lies in the absolute uncertainty levels expressed by the slope β\beta: for MNIST, the slope stabilizes at around K=600K=600 while at about K=1000K=1000 for CIFAR-10. The same trend is reflected in the maximum approximation errors ϵ\epsilon, where we respectively see them approach zero at the same values for KK. Although not shown in the plots, the regression intercepts α\alpha are always zero, meaning that there is no offset in the uncertainty estimates by the two methods.

Refer to caption
(a) MNIST
Delta vs. SRWI Bootstrap
Refer to caption
(b) MNIST
Delta vs. DRWI Bootstrap
Refer to caption
(c) CIFAR-10
Delta vs. SRWI Bootstrap
Refer to caption
(d) CIFAR-10
Delta vs. DRWI Bootstrap
Figure 3: Summaries of the regressions of σ~boot\widetilde{\sigma}_{\text{boot}} onto σ~delta\widetilde{\sigma}_{\text{delta}} as given by (6), for different values of KK and a fixed B=100B=100. The solid lines and the associated confidence intervals represent the mean and the variation of the regression results across the 16 repetitions of the Delta method.

The main difference found from applying DRWI opposed to SRWI for the Bootstrap ensembles, is that the absolute level of uncertainty increases with DRWI. This is expected, since the DRWI version of the Bootstrap will be more prone to reaching different local minima, and therefore also captures this additional variance. Supporting evidence for this hypothesis is evident by CIFAR-10’s wider confidence intervals. A more pronounced geometry difference across various local minima will ultimately lead to higher variability in the R2R^{2} and β\beta. A slightly higher mean R2R^{2} (+1-2%) is also observed for the DRWI version of the Bootstrap. This is reasonable given the fact that also the Delta method networks are more prone to reaching different local minima across the 16 repetitions because of DRWI.

Figure 4 shows the same type of comparison when the number of Bootstrap replicates BB varies, and the number of eigenpairs are fixed (K=1500K=1500 for MNIST and K=2500K=2500 for CIFAR-10). The main observation from this experiment is that there is very little to achieve by selecting a larger ensemble size BB than about 50, as this is the point where the mean slope and squared correlation coefficient stabilizes.

Refer to caption
(a) MNIST
Delta vs. SRWI Bootstrap
Refer to caption
(b) MNIST
Delta vs. DRWI Bootstrap
Refer to caption
(c) CIFAR-10
Delta vs. SRWI Bootstrap
Refer to caption
(d) CIFAR-10
Delta vs. DRWI Bootstrap
Figure 4: Summaries of the regressions of σ~boot\widetilde{\sigma}_{\text{boot}} onto σ~delta\widetilde{\sigma}_{\text{delta}} as given by (6), for different values of BB and a fixed number of eigenpairs KK. The solid lines and the belonging confidence intervals represent the mean and the variation of the regression results across the 16 repetitions of the Delta method.

4.2 Computation Time

Table 2 shows the computation time for the two methods when executed on a Nvidia RTX 2080 Ti based GPU. For MNIST, the smallest KK leading to acceptable approximation errors and stable absolute uncertainty levels for the Delta method is at K=600K=600, while for CIFAR-10 the same applies at K=1000K=1000. Furthermore, the smallest acceptable BB leading to stable correlation and absolute uncertainty levels for the Bootstrap is at B=50B=50. We conclude that in these experiments the Delta method outperforms the Bootstrap in terms of computation time by a factor 4.64.6 on MNIST, and a factor 5.95.9 for CIFAR-10.

Method Classifier B K Initial Phase [h:mm:ss] Prediction Phase [mm:ss] Total [h:mm:ss]
Training Set Test Set
Bootstrap MNIST 50 N/A 4:08:28 00:19 00:03 4:08:50
CIFAR-10 7:37:16 00:40 00:07 7:38:04
Delta MNIST N/A 600 0:42:33 9:52 1:37 0:54:02
CIFAR-10 1000 1:00:54 14:44 02:56 1:18:35
Table 2: Computation time for the Bootstrap and Delta method. For the Bootstrap, the ‘initial phase’ accounts for the parallelized training of BB networks, while the ‘prediction phase’ accounts for the predictive epistemic uncertainty estimation (1), which is further divided into the training and test sets. For the Delta method, the ‘initial phase’ accounts for the approximate eigendecomposition of the covariance matrix (5), while the ‘prediction phase’ accounts for the predictive epistemic uncertainty estimation (3), further divided into the training set and test sets.

5 Concluding Remarks

We have shown that there is a strong linear relationship between the predictive epistemic uncertainty estimates obtained by the Bootstrap and the Delta method when applied on two different deep learning classification models. Firstly, we find that the number of eigenpairs KK in the Delta method can be selected order of magnitudes lower than PP with no loss of correspondence between the methods. This coincides with the fact that when the Delta method approximation errors are sufficiently close to zero, there is no nothing to achieve by a further increase in KK, and therefore the correspondence will stabilize at this point.

Secondly, we find that the DRWI version of the Bootstrap yields the best correspondence, and that there is little to achieve by using more than B=50B=50 replicates. Thirdly, we observe that the most complex model (CIFAR-10) yields a high variability in the correspondence across multiple DRWI Delta method runs. We interpret this effect as caused by cost functional multi-modality, and that the Delta method fails to capture the additional variance tied to reaching local minima of different geometric characteristics. Finally, in our experiments we have seen that the Delta method outperforms the Bootstrap in terms of computation time by a factor 4.64.6 on MNIST and by a factor 5.95.9 for CIFAR-10.

References