Beyond Point Estimate: Inferring Ensemble Prediction Variation from Neuron Activation Strength in Recommender Systems

Zhe Chen, Yuyan Wang, Dong Lin, Derek Zhiyuan Cheng, Lichan Hong, Ed H. Chi, Claire Cui Google, Inc.
{chenzhe,yuyanw,dongl,zcheng,lichan,edchi,claire}@google.com

(2018)

Abstract.

Despite deep neural network (DNN)’s impressive prediction performance in various domains, it is well known now that a set of DNN models trained with the same model specification and the same data can produce very different prediction results. Ensemble method is one state-of-the-art benchmark for prediction uncertainty estimation. However, ensembles are expensive to train and serve for web-scale traffic.

In this paper, we seek to advance the understanding of prediction variation estimated by the ensemble method. Through empirical experiments on two widely used benchmark datasets Movielens and Criteo in recommender systems, we observe that prediction variations come from various randomness sources, including training data shuffling, and parameter random initialization. By introducing more randomness into model training, we notice that ensemble’s mean predictions tend to be more accurate while the prediction variations tend to be higher. Moreover, we propose to infer prediction variation from neuron activation strength and demonstrate the strong prediction power from activation strength features. Our experiment results show that the average $R^{2}$ on MovieLens is as high as 0.56 and on Criteo is 0.81. Our method performs especially well when detecting the lowest and highest variation buckets, with 0.92 AUC and 0.89 AUC respectively. Our approach provides a simple way for prediction variation estimation, which opens up new opportunities for future work in many interesting areas (e.g., model-based reinforcement learning) without relying on serving expensive ensemble models.

Ensemble, Neural Networks, Neuron Activation, Prediction Uncertainty, Recommender Systems

^†^†copyright: acmcopyright^†^†journalyear: 2018^†^†doi: 10.1145/1122445.1122456^†^†conference: Woodstock ’18: ACM Symposium on Neural Gaze Detection; June 03–05, 2018; Woodstock, NY^†^†booktitle: Woodstock ’18: ACM Symposium on Neural Gaze Detection, June 03–05, 2018, Woodstock, NY^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06

1. Introduction

Deep neural networks (DNNs) have gained widespread adoption in recent years across many domains. Despite their impressive performance in various applications, most DNNs today only generate point predictions. And it is well known that a set of DNN models trained with the same model specification and the same data can produce very different predictions (Lakshminarayanan et al., 2017; Ovadia et al., 2019; Wen et al., 2020). Researchers realize that point predictions do not tell the whole story and raise questions about whether DNNs predictions can be trusted (Jiang et al., 2018; Schulam and Saria, 2019).

In response, a growing number of researches are looking into quantifying prediction uncertainties for DNNs. Ensemble method is a state-of-the-art benchmark for prediction uncertainty estimation to consolidate agreements among the ensemble members and produce better point predictions (Breiman, 1996; Dietterich, 2000; Lakshminarayanan et al., 2017; Ovadia et al., 2019). Many researches focus on whether these point predictions are well-calibrated on either in-distribution or out-of-distribution (OOD) data (Ovadia et al., 2019; Guo et al., 2017; Lee et al., 2017; Liu et al., 2020). However, there exist disagreements among the predictions in the ensemble, which we call prediction variation. For example, different models in the ensemble often yield different prediction results even on the same input example. Although prediction variation contributes to model uncertainty (Lakshminarayanan et al., 2017), there is no comprehensive study to advance the understanding of it.

Ensembles provide us with a good approximation of prediction variation, but they are computationally expensive as they require training multiple copies of the same model. At inference time, they require computing predictions on every example for every ensemble member, which can be infeasible for real-time large-scale machine learning systems. Researches propose various resource-saving techniques (Gal and Ghahramani, 2016; Huang et al., 2017; Wen et al., 2020; Lu et al., 2020). However, as far as we know, none of these works studies whether we can infer prediction variation from neuron activation strength collected from the DNN directly, without running predictions on the same data multiple times. Here, we use neuron activation strength to indicate DNN’s neuron output strength, e.g., the value of the neuron after activation and whether the neuron is activated.

We hypothesize that neuron activation strength could be directly used to infer prediction variation. Our intuition is based on neuroscience’s Long-Term Potentiation (LTP) process (Nicoll, 2017) which states that connections between neurons become stronger with more frequent activation. LTP is considered as one of the underlying mechanisms for learning and memorization. If we imagine deep networks learn like the brain, some groups of neurons will be more frequently and/or strongly activated, i.e. strengthened neurons. During learning, these strengthened neurons represent where the network has learned or memorized better. Therefore, we hypothesize that deep neural networks predict more confidently if an input activates through strengthened neurons, and less so if the input goes through weaker neurons.

Our Goal — In this paper, we aim to advance the understanding of prediction variations estimated by ensemble models, and look into the predictive power of neuron activation strength on prediction variations. To the best of our knowledge, we are the first to conduct comprehensive studies on prediction variations from different ensemble models, and we are also the first to demonstrate that we are able to infer prediction variation from neuron activation strength.

Challenges — We face the following challenges:

Variation Quantification — There are a variety of prediction problems. For example, predicting the target rating for a user on a given movie could be a regression task or a multi-class classification task by dividing the movie ratings into multiple buckets; predicting click-through rate is often modeled as a binary classification task. There is no standard way to quantify prediction variation for such a variety of tasks.

Variation Sources — Training a set of models with the same model specification and the same data can produce very different results, and the prediction disagreements are inherently caused by the nonconvex nature of DNN models in which multiple local minima exist. Multiple randomness sources could lead to the disagreements, such as random initialization of DNN parameters, random shuffling of training data, sub-sampling of training data, optimization algorithms, and even the hardware itself. It is often hard to identify the contribution of each randomness source to prediction variation.

Neuron Activation Strength — We hypothesize that a deep network’s prediction variations are strongly correlated with neuron activation strength throughout training and at inference time. However, it is not straightforward to demonstrate this relationship for different prediction problems and randomness sources.

Our Approach — In this paper, we investigate sources for prediction variation, and by controlling the randomness sources explicitly, we demonstrate that neuron activation strength has strong prediction power to infer ensemble prediction variations in almost all the randomness-controlled settings. We demonstrate our findings on two popular datasets used to evaluate recommender systems, MovieLens and Criteo.

First, we quantify prediction variation across different types of target tasks, including regression, binary classification, and multi-class classification tasks. We use standard deviation of the ensemble predictions to quantify prediction variation for regression and binary classification tasks, and use KL divergence based measurement to quantify prediction distribution disagreement for multi-class classification tasks.

Second, we identify and examine three variation sources of randomness (data shuffling, weight initialization, and data re-sampling). By explicitly controlling randomness sources, we study their contributions to the performance of point prediction and prediction variation. Our results show that every variation source exhibits a non-negligible contribution towards the total prediction variation. The prediction variation may have different sensitivities to different types of variation sources on different target tasks. When we include more variation sources, the ensemble’s prediction mean tends to be more accurate, while the prediction variations are higher.

Finally, we demonstrate that neuron activation strength has strong prediction power in estimating prediction variation of ensemble models, while the neuron activation strength information can be obtained from a single DNN. We obtain the neuron activation strength information from the neural network. With this strength information, we can add a cheap auxiliary task to estimate prediction variation directly. Our experiment results show that our activation strength based method estimates prediction variation fairly well as a regression task. The average $R^{2}$ on MovieLens is 0.43 and 0.51 with different task definitions, and on Criteo is 0.78. Our method is especially good at detecting the lowest and highest variation bucket examples, on average with 0.92 AUC score for the lowest bucket and 0.89 AUC score for the highest bucket on both datasets. Our approach is complementary and orthogonal to many other resource-saving or single model prediction variation estimation techniques, as it doesn’t alter the target task’s optimization objectives and process.

Applications — Prediction variation quantification is a fundamental problem and our activation strength based approach opens up new opportunities for a lot of interesting applications. For example, in model-based reinforcement learning, prediction variation has to be quantified for exploration (Zhou et al., 2019; Chua et al., 2018). In curriculum learning (Bengio et al., 2009), prediction variation can be used as a way for estimating example difficulty. In medical domain, prediction variation can be used to capture significant variability in patient-specific predictions (Dusenberry et al., 2020). Our proposed activation strength based method provides a simple and principled way to serve prediction variation estimate by deploying a cheap auxiliary task, instead of using an expensive ensemble model, during inference time. We explore applying our method in the above scenarios.

Contributions — Our contributions are four fold:

•

Framework for prediction variation estimation using activation strength in Section 3.
•

Formal quantification on prediction variation for various target tasks in Section 4.
•

Prediction variation understanding by explicitly controlling various randomness sources in Section 5.
•

Empirical experiments to demonstrate strong predictive power from neuron activation strength to estimate ensemble prediction variation in Section 6.

We cover the related work in Section 2, and conclude with a discussion of future work in Section 7.

2. Related Work

In machine learning literature, researchers mostly focus on two distinct types of uncertainties: aleatoric uncertainty and epistemic uncertainty (Der Kiureghian and Ditlevsen, 2009). Aleatoric uncertainty is due to the stochastic variability inherent in the data generating process (Liu et al., 2019). Aleatoric uncertainty corresponds to data uncertainty, which describes uncertainty for a given outcome due to incomplete information (Knight, 1921). Epistemic uncertainty is due to our lack of knowledge about the data generating mechanism (Liu et al., 2019), and corresponds to model uncertainty, which can be viewed as uncertainty regarding the true function underlying the observed process (Bishop, 2006). In this paper, we focus on studying model uncertainty, especially prediction variations or disagreements.

There has been extensive research on methodologies for estimating model uncertainty and discussions on their comparisons (Ovadia et al., 2019). Principled approaches include Bayesian approaches (Neal, 2012; MacKay, 1992; Hinton and Van Camp, 1993; Louizos and Welling, 2017; Zhu and Zabaras, 2018) and ensemble-based approaches (Lakshminarayanan et al., 2017). Bayesian methods provide a mathematically grounded framework to model uncertainty, through learning the deep neural network as Gaussian processes (Neal, 2012), or learning approximate posterior distributions for all or some weights of the network (Blundell et al., 2015; Kwon et al., 2018). Ensemble methods (Lakshminarayanan et al., 2017), on the other hand, is a conceptually simpler way to estimate model uncertainty. There are multiple ways to create ensembles of neural networks: bagging (Breiman, 1996), Jackknife (McIntosh, 2016), random parameter initialization, or random shuffling of training examples. The resulting ensemble of neural networks contains some diversity, and the variation of their predictions can be used as an estimate of model uncertainty. A lot of research work results are based on the ensemble method (Berthelot et al., 2019; De Fauw et al., 2018; Chua et al., 2018; Leibig et al., 2017; Ovadia et al., 2019). For example, (De Fauw et al., 2018) demonstrates promising results uses deep ensembles for diagnosis and referral in retinal disease. (Chua et al., 2018) proposes a new algorithm for model-based reinforcement learning by incorporating uncertainty via ensemble. In this paper, we use ensemble as the ground truth to produce prediction variations in different scenarios (i.e., different randomness settings).

Estimating model uncertainty through Bayesian modeling or ensemble usually incurs significant computation cost. For example, Bayesian neural networks that perform variational learning on the full network (Blundell et al., 2015) significantly increase the training and serving cost. The cost for ensemble methods scales by the number of models in the ensemble, which can be prohibitive for practical use. To this end, researchers have proposed various techniques to reduce the cost for Bayesian modelling and ensemble methods. For example, single-model approaches are proposed to quantify model uncertainty by modifying the output layer (Tagasovska and Lopez-Paz, 2019; Liu et al., 2020), deriving tractable posteriors from last layer output only (Riquelme et al., 2018; Snoek et al., 2015), or constructing pseudo-ensembles that can be solved and estimated analytically (Lu et al., 2020). Our proposed method in this paper can also be viewed as a single-model approach for model uncertainty estimation. However, we do not impose any Bayesian assumptions on the network or any distributional assumptions on the ensembles. Instead, we build an empirical model to learn the association between activation strength and model uncertainty, and use it to estimate model uncertainty for new examples, which offers a relatively simple, robust and computationally efficient way to estimate prediction variation from a single model.

3. Variation Estimation Framework

Similar to the work on model uncertainty estimation (Nix and Weigend, 1994, 1995; Su et al., 2018), we build two components for the prediction variation estimation framework: target task, and variation estimation task, as shown in Figure 1. Before discussing the two tasks in detail, we first introduce the two experiment datasets that we use throughout this paper.

3.1. Datasets

Our studies are based on two datasets: MovieLens and Criteo.

MovieLens — The MovieLens ¹¹1http://files.grouplens.org/datasets/movielens/ml-1m-README.txt contains 1M movie ratings from 6000 users on 4000 movies. This data also contains user related features and movie related features.

Criteo — The Criteo Display Advertising challenge ²²2https://www.kaggle.com/c/criteo-display-ad-challenge features a binary classification task to predict click-through rate (clicked event’s label is 1, otherwise 0). The Criteo data consists of around 40M examples with 13 numerical and 26 categorical features.

Refer to caption — Figure 1. Our framework for prediction variation estimation using activation strength.

3.2. Target Task

The target task is defined by the original prediction problem, such as the rating prediction task on MovieLens, and the click-through prediction task on Criteo. The target task takes in the input features from the dataset, and predicts the target. In this paper, we focus on the multi-layer perceptron architecture (MLP), with ReLU as the activation function for all layers. Furthermore, we define three target tasks on MovieLens and Criteo.

MovieLens Regression (MovieLens-R) — The target task takes in user-related features (i.e., id, gender, age, and occupation) and movie-related features (i.e., id, title and genres), and predicts movie rating as a regression task. The movie ratings are integers from 1 to 5. We use mean squared error (MSE) as the loss function. MSE is a standard metric for evaluating the performance of rating prediction in recommenders (Herlocker et al., 2004; Saadati et al., 2019; Bennett et al., 2007). For example, (Bennett et al., 2007) used 100 million anonymous movie ratings and reported their Root Mean Squared Error (RMSE) performance on a test dataset as 0.95.

Each model trains for 20 epochs with early-stopping. We only use the observed ratings in MovieLens as training data. Similar to the neural collaborative filtering framework (He et al., 2017), we use fully connected neural network for the rating prediction task and ReLU as the activation function. We set the fully connected neuron layer sizes to be [50, 20, 10]. We set the user id and item id embedding size to 8 (He et al., 2017), the user age embedding size to 3, and user occupation embedding size to 5.

MovieLens Classification (MovieLens-C) — Similar to the MovieLens regression task, we predict movie ratings as 5 integer values from 1 to 5 and model this problem as a multi-class classification task with softmax cross entropy as the loss function.

We experiment with temperature scaling values $T=$ [0.1, 0.2, 0.5, 1, 2, 5, 10] with batch size of 1024 to make sure predictions are well calibrated (Guo et al., 2017). We pick $T=0.2$ which gives the best Brier score³³3 https://en.wikipedia.org/wiki/Brier_score while achieving similar accuracy compared to other settings.

Criteo — This target task uses a set of numerical and categorical features to predict the click-through rate. The label for the task is either 0 or 1 representing whether an ad is clicked or not. We model this problem as a binary classification task with sigmoid cross entropy loss function. The trained model outputs a float between 0 and 1 representing the predicted click-through probability.

We use the same model setting as described in (Ovadia et al., 2019), except for ReLU layer sizes. In the beginning, we set ReLU layer sizes to [2572, 1454, 1596] as in (Ovadia et al., 2019), but found that only around 80 neurons are activated at least once on a 10k sample data. As a result, for the experiments in this paper, we use ReLU layer sizes of [50, 20, 10] and find the prediction performance is similar to the model with much larger ReLU layer sizes. Each model is trained for 1 epoch.

3.3. Variation Estimation Task

In this paper, we focus on using neuron activation strength to estimate prediction variation for each input example. We use ensemble to estimate prediction variation as the ground-truth label. We define prediction variation formally in Section 4.

As shown in Figure 1, we build a neural network model taking the neuron activation strength features to estimate prediction variation. During the inference time, we directly output the estimated prediction variation using activation strength as an auxiliary task. In our current setup, we collect the activation strength features from all the neurons in the target task. Also, we find that it is possible to identify important neurons in order to reduce the number of activation strength features. Due to the space limit, we will not discuss feature reduction in this paper. The detailed setup of the variation estimation task is discussed in Section 6.

4. Prediction Variation Quantification

In this section, we formally quantify prediction variation in different problem settings. Ensemble is one state-of-the-art benchmark for prediction uncertainty estimation (Breiman, 1996; Dietterich, 2000; Lakshminarayanan et al., 2017; Ovadia et al., 2019). We use ensemble to estimate model prediction variation, that is how much disagreement there is among ensemble model predictions.

Given the same training data and model configuration, we train an ensemble of $N$ models $M=\{m\}$ , where $M$ is the set of models, and $N$ is the ensemble size. Let $\{x\}$ be the testing data, and $x$ represents the feature vector. Each of the trained model $m\in M$ makes a prediction on an example $x\in\{x\}$ as $y^{\prime}_{m}(x)$ .

For regression and binary classification tasks, the model output is a float value, thus we define prediction variation as value prediction variation based on the standard deviation of the predicted float values across different models in the ensemble. For multi-class classification tasks, the model output is a probability distribution over different categories. Thus, we define prediction variation as distribution prediction variation based on the KL-disagreement or generalized Jensen-Shannon divergence (Lakshminarayanan et al., 2017) on the predicted probability distributions. Now we define prediction variation formally.

Definition 4.1.

(Value Prediction Variation) Given an example $x$ that represents the feature vector, we define its prediction variation $PV(x)$ to be the standard deviation of predictions from the set of models $M$ as $PV(x)=\sqrt{\frac{\sum_{m\in M}{(y^{\prime}_{m}(x)-\bar{y}(x))^{2}}}{|M|-1}}$ , where $\bar{y}(x)=\frac{1}{|M|}\sum_{m\in M}y^{\prime}_{m}(x)$

Definition 4.2.

(Distribution Prediction Variation (Lakshminarayanan et al., 2017)) Given the example $x$ that represents the feature vector, let the prediction distribution for the example $x$ be $p(y|x)$ . We define prediction variation $PV(x)$ to be the sum of the Kullback-Leibler (KL) divergence from the prediction distribution of each model $m\in M$ to the mean prediction distribution of the ensemble. $PV(x)=\sum_{m=1}^{M}KL(p_{m}(y|x)||p_{E}(y|x))$ where $p_{E}(y|x)=M^{-1}\sum_{m}p(y|x)$ is the mean prediction of the ensemble.

Randomness	MovieLens-R					MovieLens-C			Criteo
Settings	MSE	ACC	PV Mean	PV Std	PV Coeff	ACC	PV Mean	PV Std	AUC	PV Mean	PV Std	PV Coeff
(R0) None	0.7980	0.4483	0.0000	0.0000	0.00%	0.4635	0.0000	0.0000	0.7956	0.0000	0.0000	0.00%
(R1) R	0.7569	0.4570	0.1948	0.0692	5.83%	0.4818	4.4486	2.5744	0.7991	0.0300	0.0162	16.3%
(R2) S	0.7671	0.4473	0.1433	0.0509	4.36%	0.4746	2.7379	1.6717	0.7999	0.0359	0.0179	20.7%
(R3) R+S	0.7479	0.4521	0.1936	0.0649	5.87%	0.4829	4.4637	2.5053	0.7999	0.0358	0.0181	20.4%
(R4) J	0.7718	0.4464	0.1494	0.0638	4.54%	0.4745	2.7522	1.9771	0.8003	0.0394	0.0198	22.9%
(R5) R+J	0.7489	0.4522	0.2035	0.0701	6.14%	0.4829	4.7250	2.6514	0.8002	0.0396	0.0199	22.9%
(R6) S+J	0.7640	0.4486	0.1560	0.0571	4.74%	0.4766	3.1739	1.9803	0.8002	0.0409	0.0200	23.5%
(R7) R+S+J	0.7457	0.4528	0.2013	0.0667	6.08%	0.4838	4.7332	2.6723	0.8000	0.0407	0.0202	23.6%

Table 1. Ensemble’s prediction accuracy and prediction variation (PV) on 8 randomness settings. PV coefficient (coeff) shows the average ratio of PV to the ensemble mean prediction over all the testing examples.

5. Prediction Variation Sources

In this section, we diagnose the prediction variation sources, and we are interested in seeing their effects on the total prediction variation by controlling type of each randomness source.

There are many sources contributing to prediction variation. Random initialization of DNN parameters contributes randomness to model predictions. Randomness can also come from training data shuffling or sub-sampling. In addition, asynchronous or distributed training could lead to training order randomness. More surprisingly, we observe the hardware itself contributes to the model prediction variation: we find that by fixing all the other settings, training a model on different CPUs might produce different models.

In this paper, we consider three types of randomness sources:

1. Shuffle (S) — Whether randomly shuffles input data, i.e., randomizes input data order.

2. RandInit (R) — Whether randomly initializes model parameters, including DNN weights and embeddings. We can fix the initialization by setting a global random seed in Tensorflow.

3. Jackknife (J) — Whether randomly sampling input data by applying delete-1 Jackknife (McIntosh, 2016). We split data into N Jackknife sub-samples and each ensemble member randomly leaves one Jackknife sample out. We use 100 models for each ensemble as discussed in Appendix A.1, and we split the data into 100 unique Jackknife sub-samples. Another popular data sampling method is bootstrap (Wichmann and Hill, 2001). For this work, we pick Jackknife due to its simplicity to implement.

We set up the prediction variation randomness control experiments as follows: First, if incorporating Jackknife randomness, we obtain the delete-1 Jackknife sub-samples; otherwise, we use all the training data. Second, if incorporating Shuffle randomness, we shuffle the training data; otherwise we do not. Finally, if incorporating RandInit randomness, we randomly initialize all the parameters without a fixed global seed for all the ensemble members; otherwise we use a fixed global seed.⁴⁴4When RandInit is enabled, we use a fixed set of 100 random seeds. We use an ensemble of 100 models for each of the randomness settings: as discussed in Appendix A.1, 93% of prediction variation in the ensemble of 1000 models can be captured with size 100 ensemble.

On each dataset of MovieLens and Criteo, we randomly split the data into training and testing. On MovieLens we split the 1M data into 60% for training, and 40% for testing. On Criteo, same to (Ovadia et al., 2019), we use 37M data for training and 4.4M for testing. We obtain the prediction variation estimation on all the testing examples for further analysis.

Randomness Source Comparison — Table 1 shows the accuracy and prediction variation statistics for different combinations of the three randomness sources. For each type of randomness combination (e.g., (R3) R+S means using RandInit and Shuffle only), we train 100 models of the same setting, and obtain the ensemble mean prediction for accuracy evaluation and report the prediction variations. For accuracy evaluation, we report Mean Squared Error (MSE) and accuracy (ACC) for MovieLens-R, ACC for MovieLens-C, and AUC score for Criteo. We obtain ACC for MovieLens-R by rounding the ratings to the closest integers. For prediction variation metrics, we report prediction variation (PV) mean and standard deviation, and PV coefficient (coeff). We obtain the PV coefficient for each example $x$ as PV(x) divided by the ensemble mean prediction.

From the tables, we can see that each type of the randomness sources exhibits a non-negligible and different contribution towards the total prediction variations. As we add more randomness sources, the mean prediction of the ensembles tends to become more accurate, and the prediction variations get higher. On all three target tasks, the R7 setting appears to exhibit the best or close to best accuracy score, and its prediction variations are also the highest or close to the highest among all the randomness settings. It also seems that different target tasks or datasets are sensitive to different types of randomness sources. We observe that the Criteo data is more sensitive to Shuffle and Jackknife randomness while the MovieLens data is more sensitive to RandInit. We verify that by fixing all the randomness sources, we do not observe any prediction variation in the R0 settings. The PV Mean and PV std is always 0 for R0. According to PV coefficient, we notice that MovieLens shows around 5-6% of prediction sway from the mean prediction while Criteo has around 20% of prediction sway.

Randomness Source Correlations — Under each randomness setting, we are able to obtain the prediction variation for each example. Figure 2 shows the Pearson correlation of the prediction variations of all the examples between each pair of the randomness settings. The randomness setting can be found at Table 1. We do not consider R0 because it eliminates all the randomness in the model and PV is always 0.

As we can see from Figures 2, the prediction variation correlation patterns on MovieLens is quite different from Criteo. For example, while the lowest Pearson correlation score on MovieLens is around 0.7, all the randomness settings on Criteo are quite correlated with the lowest Pearson correlation score to be around 0.92.

Regression vs Classification — On the MovieLens dataset, we are able to predict ratings as regression or classification. Table 1 shows that the prediction accuracy is higher when we predict ratings as a classification task than as a regression task almost for all the randomness settings, as classification optimizes for the accuracy metric directly.

We also compare the prediction variations obtained through regression and classification, and Figure 3 shows the Pearson correlation of prediction variations for the two tasks on various randomness settings for all the testing examples. We find that whether the variations are strongly correlated depends on the randomness settings. As shown in the figure, the prediction variations are highly correlated (with Pearson correlation more than 0.8) when we add the RandInit randomness source, otherwise the two tasks are less correlated. we think the reason is that RandInit controls model parameters and the loss function plays an important role, while underlying data properties affect Shuffle and Jackknife more.

6. Prediction Variation Estimation

In this section, we study the problem of using neuron activation strength to infer prediction variation. We first discuss the prediction variation estimation task setup in Section 6.1, and then show the experiment results on using neuron activation strength to estimate prediction variation in Section 6.2.

6.1. Variation Estimation Task Setup

As shown in Figure 1, the variation estimation task takes in the neuron activation information collected during the target task inference time. We use the prediction variation estimated by the ensemble model for training, and then during inference time, the variation estimation model is able to infer the prediction variation as a cheap auxiliary task. We set up the variation estimation task as follow.

Ground-truth Labels — We first obtain the prediction variation ground-truth from the ensemble. On both MovieLens or Criteo, we first split the data into training $D_{t}$ and testing $D_{e}$ for the target task. On MovieLens, we split the 1M data into 60% for training and 40% for testing; on Criteo, we use the same setting as in (Ovadia et al., 2019), 37M data for training and 4.4M for testing. We train an ensemble of 100 models to obtain $PV(x)$ for each $x\in D_{e}$ .

Evaluation Procedure — We further split $D_{e}$ into 2 sets: 50% as $D_{e1}=(x_{e1},PV(x_{e1}))$ for training the variation estimation model; and another 50% as $D_{e2}=(x_{e2},PV(x_{e2}))$ for testing.

Given a trained target task model $m_{t}$ , we build a neural network model $m_{v}$ to estimate prediction variations. $m_{v}$ collects the activation strength information from $m_{t}$ ’s neurons during $m_{t}$ ’s inference time. $m_{v}$ trains on $D_{e1}$ and tests on $D_{e2}$ with fully connected layers of size $[100,50]$ , batch size 256, Adam optimizer with learning rate 0.001, and 150 training epochs with early stopping. We find that $m_{v}$ takes less than one epoch to converge on Criteo, but takes longer to converge on MovieLens due to its much smaller data size.

Now we explain $m_{v}$ ’s input features and objective in detail.

Input Features — We consider two types of input features collected from neuron activation strength. We use ReLU (Nair and Hinton, 2010) as the activation function. We believe our activation strength feature is general and can be applied to other activation functions, such as Softplus (Glorot et al., 2011), ELU (Clevert et al., 2015), GELUs (Hendrycks and Gimpel, 2016), and Swish (Ramachandran et al., 2017). Due to space limitations, we only experiment with ReLU.

Binary — On ReLU neurons, we consider whether a neuron is activated as the input feature. This binary feature represents whether the neuron output is greater than 0.

Value — The raw value of a neuron’s output directly represents the strength of an activated neuron. Therefore, we experiment with normalized activation value as the input feature. We normalize the neuron outputs according to the neuron output mean and standard deviation collected from the training data.

Objective — We estimate the prediction variation in two ways.

Regression — In this setting, the model directly estimates the prediction variation as a regression task. We use Mean Squared Error (MSE) as the loss function. However, by directly optimizing for MSE, this regression task’s output range could be huge. As a result, we limit the minimum output of the model to be 0 as prediction variation is always positive, and limit the maximum output to be $mean+3*std$ where the mean and std are estimated on the training data’s prediction variations. $mean+3*std$ should be able to cover 99.7% of the data in a Guassian distribution.⁵⁵5 https://en.wikipedia.org/wiki/68-95-99.7_rule

Classification — In this setting, we divide prediction variation into multiple buckets according to the percentile, and then predict which variation bucket it belongs to. We set the bucket number to be 5, and use cross entropy as the loss function for the prediction variation classification model.

	MovieLens-R		MovieLens-C		Criteo
	MSE	$R^{2}$	MSE	$R^{2}$	MSE	$R^{2}$
(R1) R	0.0022	0.5416	3.6159	0.4586	0.0063	0.7617
(R2) S	0.0011	0.5636	1.5015	0.4683	0.0068	0.7863
(R3) R+S	0.0019	0.5514	3.4288	0.4569	0.0062	0.8100
(R4) J	0.0025	0.3885	2.5986	0.3386	0.0086	0.7817
(R5) R+J	0.0024	0.5219	3.7727	0.4646	0.0085	0.7868
(R6) S+J	0.0017	0.4938	2.3175	0.4125	0.0091	0.7739
(R7) R+S+J	0.0022	0.5123	4.0496	0.4368	0.0092	0.7761
Average	0.0020	0.5104	3.0407	0.4338	0.0078	0.7824

Table 2. Performance of variation estimation as regression on 7 randomness settings.

6.2. Variation Estimation Performance

In this section, we show the variation estimation performance using neuron activation strength on MovieLens and Criteo.

Regression Performance — When we run the prediction variation estimation as a regression task, we directly output a score as the estimated prediction variation. We use both binary and value input features.

In Table 2, we show the Mean Squared Error (MSE) and $R^{2}$ for the three target tasks on the 7 randomness control settings. From the table, on all the three target tasks and all the 7 randomness control settings, we observe strong prediction power of neuron activation strength for ensemble prediction variations. The average $R^{2}$ on MovieLens-R is 0.51, on MovieLens-C is 0.43, and on Criteo is 0.78. The variation estimation performance is the best on the Criteo data, while MovieLens-R is better than MovieLens-C. The reasons could be: First, Criteo has more training data than MovieLens. To train the variation estimation model, we have 2.2M (50% of 4.4M) training data on Criteo, while only 0.2M (50% of 0.4M) training data on MovieLens; Second, Criteo has a larger relative range of prediction variations, compared to MovieLens. As shown in Table 1, Criteo shows around 20% of prediction variation sway from the mean prediction, which is much higher than 4-6% on MovieLens; Finally, the Criteo task is probably the easiest task among the three: it is a binary classification task, while MovieLens-R is a regression task and MovieLens-C is a multi-class classification task.

Classification Performance — We also run the prediction variation estimation as a classification task, by predicting which variation bucket it should be in. We use both binary and value input features. Due to the space limit, we only show the results on the R3 randomness setting, as R3 uses training data shuffling and parameter random initialization which is the most common setting in practice.

Figure 4 shows the AUC scores and Figure 5 shows the confusion matrix for the 5-bucket prediction variation classification on both MovieLens and Criteo. The numbers in Figure 5 are normalized by the actual example number in each bucket. Bucket 1 represents the lowest variation slice, while bucket 5 represents the highest.

As we can see from Figure 4, our variation estimation model is fairly good at distinguishing examples at different variation buckets, especially for the lowest and highest buckets. The average AUC score for the three tasks on bucket 1 is about 0.92 and on bucket 5 is about 0.89. Figure 5 shows that the classification errors mostly happen on adjacent buckets. For example, on Criteo, most of the mis-classifications assign bucket 1 examples to bucket 2. When we divide the prediction variation buckets on training data, we notice that the bucket thresholds are close. For example, under the randomness control setting R3, the thresholds of the 5 buckets for MovieLens-R are [0.1420, 0.1672, 0.1950, 0.2366], and the thresholds for Criteo is [0.0194, 0.0287, 0.0398, 0.0515]. As show in the figures, Criteo seems to have the best performance among the three tasks. Again the reasons could be that the Criteo task has more training data, is probably the easiest among the three tasks, and it has much larger relative prediction variation range.

	MovieLens-R		MovieLens-C		Criteo
	MSE	$R^{2}$	MSE	$R^{2}$	MSE	$R^{2}$
B	0.0025	0.4196	4.1616	0.3408	0.0124	0.6210
BV	0.0019	0.5514	3.4288	0.4569	0.0062	0.8100

Table 3. Activation feature study for variation estimation as regression with the randomness setting R3.

	MovieLens-R		MovieLens-C		Criteo
Run	MSE	$R^{2}$	MSE	$R^{2}$	MSE	$R^{2}$
1	0.0019	0.5514	3.4288	0.4569	0.0062	0.8100
2	0.0020	0.5160	3.3243	0.4734	0.0061	0.8140
3	0.0019	0.5418	3.1825	0.4959	0.0064	0.8036
4	0.0019	0.5384	3.2375	0.4872	0.0070	0.7863
5	0.0021	0.4994	3.2631	0.4831	0.0066	0.7969
Std	0.00008	0.0190	0.0842	0.0133	0.0003	0.0098

Table 4. Reproducibility test for variation estimation as regression with the randomness setting R3.

Activation Feature Study — In Table 3, we show the contribution of the two input activation strength features. We try two feature settings: B refers to using the binary feature only; and BV refers to using both the binary and value features. As shown in the table, each type of features makes a non-negligible contribution towards prediction variation estimation. Thus, it is beneficial to have both the binary and value features for prediction variation estimation.

Reproducibility — When we evaluate the variation estimation model, the performance is calculated based on one target task model. To check whether the performance is reproducible, we train 5 new target task models. To simplify the problem, we also use the randomness setting R3, which is the most commonly used setting in practice. Table 4 shows the MSE and $R^{2}$ for each run on both MovieLens and Criteo. As shown in the table, the standard deviation of MSE and $R^{2}$ for the 5 runs is small for each of the three tasks. Thus, we conclude that activation strength is useful for estimating prediction variation and it is reproducible.

Comparison with Dropout (Gal and Ghahramani, 2016) — Dropout is also a standard way to estimate model uncertainty. We estimate prediction variation using dropout as follows: We train one model by randomly dropping 20% of the neurons on the ReLU layers using the randomness settings R3. During inference time, we keep the dropout turned on to obtain the predicted results for all the testing data. We run the inference 100 times, and obtained the prediction variation for each testing example. We find that the prediction variation estimated by dropout is not very correlated with the variation estimated by the ensemble method. On MovieLens-R, Pearson correlation of prediction variations for dropout and ensemble is 0.25, RMSE is 0.0798, and $R^{2}$ is -0.5010. On Criteo, Pearson correlation of prediction variations for dropout and ensemble is 0.37, RMSE is 0.0279, and $R^{2}$ is -1.3709. As a result, we did not conduct further comparison with our activation strength based method.

7. Conclusion and Future Work

In this paper, we conduct empirical studies to understand the prediction variation estimated by ensembles under various randomness control settings. Our experiments on two public datasets (MovieLens and Criteo) demonstrate that with more variation sources, ensemble tends to produce more accurate point estimates with higher prediction variations. More importantly, we demonstrate strong predictive power of neuron activation strength to infer ensemble prediction variations, which provides an efficient way to estimate prediction variation without the need to run inference multiple times as in ensemble methods.

In the future, we are interested in exploring the proposed activation strength based methods in various applications, such as model-based reinforcement learning, and curriculum learning. In addition, we plan to make two further improvements to our activation strength based approach. First, we would like to study additional neuron activation patterns, such as adjacent neuron paths, to further improve the variation estimation model. Second, currently we re-train the variation estimation model for each new target task. Activation pattern is a universal and general feature, and we hope to find a universal model for prediction variation estimation without re-training each individual target task.

Appendix A Appendix

A.1. Model Ensemble Sizes

Our prediction variation is estimated using the ensemble method, and we are interested in finding out how many models to ensemble to estimate prediction variation accurately. In this section, we conduct empirical experiments on MovieLens-R, and Criteo to answer this question.

For each target task, using the same training data and model configuration, we first train 1000 models as the ensemble universe $M_{gt}$ to obtain ground-truth prediction variations. We used the R3 randomness setting here, as it is the most commonly used setting in practice. Given an example $x$ , we obtain its prediction variation from the 1000 ensemble models as $PV_{gt}(x)$ . We calculate the mean prediction variation for all the examples as $\bar{PV}_{gt}$ .

Then we evaluate the prediction variation difference between an ensemble $M_{N}$ of a smaller size $N$ and $M_{gt}$ . We use delta ratio to quantify the difference between the prediction variation estimated from the two ensembles $M_{N}$ and $M_{gt}$ as follows.

Definition A.1.

(Delta Ratio) Let prediction variation delta $\delta_{M_{N}}(x)$ be the absolute difference of the estimated prediction variation between a model ensemble $M_{N}$ of size $N$ and the ground-truth model ensemble $M_{gt}$ , as $\delta_{M_{N}}(x)=|PV_{M_{N}}(x)-PV_{gt}(x)|$ . We obtain the average prediction variation delta for all the examples in a dataset $D$ as $\delta_{M_{N}}=\frac{1}{|D|}\sum_{x\in D}\delta_{M_{N}}(x)$ . We define delta ratio $dr_{M_{N}}$ to be the ratio of prediction variation delta $\delta_{M_{N}}$ to the average prediction variation in the ground-truth ensemble models $\bar{PV}_{gt}$ , as $dr_{M_{N}}=\delta_{M_{N}}/\bar{PV}_{gt}$

In Figure 6, we show the delta ratio of different ensemble sizes for the MovieLens regression task and the Criteo task. For each ensemble size $N$ , we sample N models without replacement from the 1000 ground-truth model universe and obtain its delta ratio. We repeat this sampling process for 20 times, and obtain the mean and standard deviation of the delta ratio for the given $N$ . We plot the delta ratio as shown in Figure 6. We can see the delta ratio decreases when the ensemble size increases. The delta ratio statistics is similar on both datasets, MovieLens and Criteo. When the ensemble size is 100, the delta ratio is about 7% which indicates 93% of prediction variation from the ground-truth ensemble of 1000 models is captured. As a result, in this paper, we use 100 as the default ensemble size for all experiments.

References

(1)
Bengio et al. (2009) Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning. 41–48.
Bennett et al. (2007) James Bennett, Stan Lanning, et al. 2007. The netflix prize. In Proceedings of KDD cup and workshop, Vol. 2007. Citeseer, 35.
Berthelot et al. (2019) David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver, and Colin A Raffel. 2019. Mixmatch: A holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems. 5049–5059.
Bishop (2006) Christopher M Bishop. 2006. Pattern recognition and machine learning. springer.
Blundell et al. (2015) Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. 2015. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424 (2015).
Breiman (1996) Leo Breiman. 1996. Bagging predictors. Machine learning 24, 2 (1996), 123–140.
Chua et al. (2018) Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. 2018. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems. 4754–4765.
Clevert et al. (2015) Djork-Arné Clevert, Thomas Unterthiner, and Sepp Hochreiter. 2015. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289 (2015).
De Fauw et al. (2018) Jeffrey De Fauw, Joseph R Ledsam, Bernardino Romera-Paredes, Stanislav Nikolov, Nenad Tomasev, Sam Blackwell, Harry Askham, Xavier Glorot, Brendan O’Donoghue, Daniel Visentin, et al. 2018. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nature medicine 24, 9 (2018), 1342–1350.
Der Kiureghian and Ditlevsen (2009) Armen Der Kiureghian and Ove Ditlevsen. 2009. Aleatory or epistemic? Does it matter? Structural safety 31, 2 (2009), 105–112.
Dietterich (2000) Thomas G Dietterich. 2000. Ensemble methods in machine learning. In International workshop on multiple classifier systems. Springer, 1–15.
Dusenberry et al. (2020) Michael W Dusenberry, Dustin Tran, Edward Choi, Jonas Kemp, Jeremy Nixon, Ghassen Jerfel, Katherine Heller, and Andrew M Dai. 2020. Analyzing the role of model uncertainty for electronic health records. In Proceedings of the ACM Conference on Health, Inference, and Learning. 204–213.
Gal and Ghahramani (2016) Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning. 1050–1059.
Glorot et al. (2011) Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifier neural networks. In Proceedings of the fourteenth international conference on artificial intelligence and statistics. 315–323.
Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. 2017. On calibration of modern neural networks. arXiv preprint arXiv:1706.04599 (2017).
He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proceedings of the 26th international conference on world wide web. International World Wide Web Conferences Steering Committee, 173–182.
Hendrycks and Gimpel (2016) Dan Hendrycks and Kevin Gimpel. 2016. Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016).
Herlocker et al. (2004) Jonathan L Herlocker, Joseph A Konstan, Loren G Terveen, and John T Riedl. 2004. Evaluating collaborative filtering recommender systems. ACM Transactions on Information Systems (TOIS) 22, 1 (2004), 5–53.
Hinton and Van Camp (1993) Geoffrey E Hinton and Drew Van Camp. 1993. Keeping the neural networks simple by minimizing the description length of the weights. In Proceedings of the sixth annual conference on Computational learning theory. 5–13.
Huang et al. (2017) Gao Huang, Yixuan Li, Geoff Pleiss, Zhuang Liu, John E Hopcroft, and Kilian Q Weinberger. 2017. Snapshot ensembles: Train 1, get m for free. arXiv preprint arXiv:1704.00109 (2017).
Jiang et al. (2018) Heinrich Jiang, Been Kim, Melody Guan, and Maya Gupta. 2018. To trust or not to trust a classifier. In Advances in neural information processing systems. 5541–5552.
Knight (1921) Frank Hyneman Knight. 1921. Risk, uncertainty and profit. Vol. 31. Houghton Mifflin.
Kwon et al. (2018) Yongchan Kwon, Joong-Ho Won, Beom Joon Kim, and Myunghee Cho Paik. 2018. Uncertainty quantification using bayesian neural networks in classification: Application to ischemic stroke lesion segmentation. (2018).
Lakshminarayanan et al. (2017) Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. 2017. Simple and scalable predictive uncertainty estimation using deep ensembles. In Advances in neural information processing systems. 6402–6413.
Lee et al. (2017) Kimin Lee, Honglak Lee, Kibok Lee, and Jinwoo Shin. 2017. Training confidence-calibrated classifiers for detecting out-of-distribution samples. arXiv preprint arXiv:1711.09325 (2017).
Leibig et al. (2017) Christian Leibig, Vaneeda Allken, Murat Seçkin Ayhan, Philipp Berens, and Siegfried Wahl. 2017. Leveraging uncertainty information from deep neural networks for disease detection. Scientific reports 7, 1 (2017), 1–14.
Liu et al. (2019) Jeremiah Liu, John Paisley, Marianthi-Anna Kioumourtzoglou, and Brent Coull. 2019. Accurate uncertainty estimation and decomposition in ensemble learning. In Advances in Neural Information Processing Systems. 8952–8963.
Liu et al. (2020) Jeremiah Zhe Liu, Zi Lin, Shreyas Padhy, Dustin Tran, Tania Bedrax-Weiss, and Balaji Lakshminarayanan. 2020. Simple and Principled Uncertainty Estimation with Deterministic Deep Learning via Distance Awareness. arXiv preprint arXiv:2006.10108 (2020).
Louizos and Welling (2017) Christos Louizos and Max Welling. 2017. Multiplicative normalizing flows for variational bayesian neural networks. arXiv preprint arXiv:1703.01961 (2017).
Lu et al. (2020) Zhiyun Lu, Eugene Ie, and Fei Sha. 2020. Uncertainty Estimation with Infinitesimal Jackknife, Its Distribution and Mean-Field Approximation. arXiv preprint arXiv:2006.07584 (2020).
MacKay (1992) David JC MacKay. 1992. A practical Bayesian framework for backpropagation networks. Neural computation 4, 3 (1992), 448–472.
McIntosh (2016) Avery McIntosh. 2016. The Jackknife estimation method. arXiv preprint arXiv:1606.00497 (2016).
Nair and Hinton (2010) Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In ICML.
Neal (2012) Radford M Neal. 2012. Bayesian learning for neural networks. Vol. 118. Springer Science & Business Media.
Nicoll (2017) Roger A Nicoll. 2017. A brief history of long-term potentiation. Neuron 93, 2 (2017), 281–290.
Nix and Weigend (1994) David A Nix and Andreas S Weigend. 1994. Estimating the mean and variance of the target probability distribution. Proceedings of 1994 ieee international conference on neural networks (ICNN’94) 1 (1994), 55–60.
Nix and Weigend (1995) David A Nix and Andreas S Weigend. 1995. Learning local error bars for nonlinear regression. Advances in neural information processing systems (1995), 489–496.
Ovadia et al. (2019) Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, David Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. 2019. Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems. 13991–14002.
Ramachandran et al. (2017) Prajit Ramachandran, Barret Zoph, and Quoc V. Le. 2017. Searching for Activation Functions. CoRR abs/1710.05941 (2017).
Riquelme et al. (2018) Carlos Riquelme, George Tucker, and Jasper Snoek. 2018. Deep bayesian bandits showdown: An empirical comparison of bayesian deep networks for thompson sampling. arXiv preprint arXiv:1802.09127 (2018).
Saadati et al. (2019) Mojdeh Saadati, Syed Shihab, and Mohammed Shaiqur Rahman. 2019. Movie Recommender Systems: Implementation and Performance Evaluation. arXiv preprint arXiv:1909.12749 (2019).
Schulam and Saria (2019) Peter Schulam and Suchi Saria. 2019. Can you trust this prediction? Auditing pointwise reliability after learning. arXiv preprint arXiv:1901.00403 (2019).
Snoek et al. (2015) Jasper Snoek, Oren Rippel, Kevin Swersky, Ryan Kiros, Nadathur Satish, Narayanan Sundaram, Mostofa Patwary, Mr Prabhat, and Ryan Adams. 2015. Scalable bayesian optimization using deep neural networks. In International conference on machine learning. 2171–2180.
Su et al. (2018) Dongqi Su, Ying Yin Ting, and Jason Ansel. 2018. Tight Prediction Intervals Using Expanded Interval Minimization. arXiv preprint arXiv:1806.11222 (2018).
Tagasovska and Lopez-Paz (2019) Natasa Tagasovska and David Lopez-Paz. 2019. Single-model uncertainties for deep learning. In Advances in Neural Information Processing Systems. 6417–6428.
Wen et al. (2020) Yeming Wen, Dustin Tran, and Jimmy Ba. 2020. BatchEnsemble: An Alternative Approach to Efficient Ensemble and Lifelong Learning. Eighth International Conference on Learning Representations (ICLR 2020) (2020).
Wichmann and Hill (2001) Felix A Wichmann and N Jeremy Hill. 2001. The psychometric function: II. Bootstrap-based confidence intervals and sampling. Perception & psychophysics 63, 8 (2001), 1314–1329.
Zhou et al. (2019) Dongruo Zhou, Lihong Li, and Quanquan Gu. 2019. Neural Contextual Bandits with UCB-based Exploration. arXiv:1911.04462 [cs.LG]
Zhu and Zabaras (2018) Yinhao Zhu and Nicholas Zabaras. 2018. Bayesian deep convolutional encoder–decoder networks for surrogate modeling and uncertainty quantification. J. Comput. Phys. 366 (2018), 415–447.