This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Variable-Based Calibration for Machine Learning Classifiers

Markelle Kelly, Padhraic Smyth
Abstract

The deployment of machine learning classifiers in high-stakes domains requires well-calibrated confidence scores for model predictions. In this paper we introduce the notion of variable-based calibration to characterize calibration properties of a model with respect to a variable of interest, generalizing traditional score-based metrics such as expected calibration error (ECE). In particular, we find that models with near-perfect ECE can exhibit significant miscalibration as a function of features of the data. We demonstrate this phenomenon both theoretically and in practice on multiple well-known datasets, and show that it can persist after the application of existing calibration methods. To mitigate this issue, we propose strategies for detection, visualization, and quantification of variable-based calibration error. We then examine the limitations of current score-based calibration methods and explore potential modifications. Finally, we discuss the implications of these findings, emphasizing that an understanding of calibration beyond simple aggregate measures is crucial for endeavors such as fairness and model interpretability.

1 Introduction

Predictive models built by machine learning algorithms are increasingly informing decisions across high-stakes applications such as medicine (Rajkomar, Dean, and Kohane 2019), employment (Chalfin et al. 2016), and criminal justice (Završnik 2021). There is also broad recent interest in developing systems where humans and machine learning models collaborate to make predictions and decisions (Kleinberg et al. 2018; Bansal et al. 2021; De et al. 2021; Steyvers et al. 2022). A critical aspect of using model predictions in such contexts is calibration. In particular, in order to trust the predictions from a machine learning classifier, these predictions must be accompanied by well-calibrated confidence scores.

In practice, however, it has been well-documented that machine learning classifiers such as deep neural networks can produce poorly-calibrated class probabilities (Guo et al. 2017; Vaicenavicius et al. 2019; Ovadia et al. 2019). As a result, a variety of calibration methods have been developed, which aim to ensure that a model’s confidence (or score) matches its true accuracy. A widely used approach is post-hoc calibration: methods which use a separate labeled dataset to learn a mapping from the original model’s class probabilities to calibrated probabilities, often with a relatively simple one-dimensional mapping (e.g., Platt (1999); Kull, Filho, and Flach (2017); Kumar, Liang, and Ma (2019)). These methods have been shown to generally improve the the empirical calibration error of a model, as commonly measured by the expected calibration error (ECE).

However, as we show in this paper, aggregate measures of score-based calibration error such as ECE can hide significant systematic miscalibration in other dimensions of a model’s performance. To address this issue we introduce the notion of variable-based calibration to better understand how the calibration error of a model can vary as a function of a variable of interest, such as an input variable to the model or some other metadata variable. We focus in particular in this paper on real-valued variables. For example, in prediction problems involving individuals (e.g., credit-scoring or medical diagnosis) one such variable could be Age. Detecting systematic miscalibration is important for problems such as assessing the fairness of a model, for instance detecting that a model is significantly overconfident for some age ranges and underconfident for others.

As an illustrative example, consider a simple classifier trained to predict the presence of cardiovascular disease111https://www.kaggle.com/sulianova/cardiovascular-disease-dataset. After the application of Platt scaling, a standard post-hoc calibration method, this model attains a relatively low ECE of 0.74%. This low ECE is reflected in the reliability diagram shown in Figure 1(a), which shows near-perfect alignment with the diagonal. If a user of this model were to only consider aggregate metrics such as ECE, they might reasonably conclude that the model is generally well-calibrated. However, evaluating model error and predicted error with respect to the variable Patient Age reveals an undesirable and systematic miscalibration pattern with respect to this variable, as illustrated in Figure 1(b): the model is underconfident by upwards of five percentage points for younger patients, and is significantly overconfident for older patients.

Refer to caption
(a) Reliability diagram (for accuracy)
Refer to caption
(b) Variable-based calibration plot (for error)
Figure 1: Calibration plots for a neural network predicting cardiovascular disease, after calibration with Platt scaling: (a) reliability diagram, (b) LOESS-smoothed estimates with confidence intervals of actual and model-predicted error as a function of patient age. This dataset consists of 70,000 records of patient data (49,000 train, 6,000 validation, 15,000 test), with a binary prediction task of determining the presence of cardiovascular disease.

In this paper, we systematically investigate variable-based calibration for classification models, from both theoretical and empirical perspectives. In particular, our contributions are as follows:

  1. 1.

    We introduce the notion of variable-based calibration and define a per-variable calibration metric (VECE).

  2. 2.

    We characterize theoretically the relationship between variable-based miscalibration measured via VECE and traditional score-based miscalibration measured via ECE.

  3. 3.

    We demonstrate, across multiple well-known tabular, text, and image datasets and a variety of models, that significant variable-based miscalibration can exist in practice, even after the application of standard score-based calibration methods.

  4. 4.

    We investigate variable-based calibration methods and demonstrate empirically that these methods can simultaneously reduce both ECE and VECE. 222Our code is available online at https://github.com/markellekelly/variable-wise-calibration.

2 Related Work

Visualizing Model Performance by Variable:

In prior work a number of different techniques have been developed for visual understanding and diagnosis of model performance with respect to a particular variable of interest. One such technique is partial dependence plots (Friedman 2001; Molnar 2020), which visualize the effect of an input feature of interest on model predictions. Another approach is dashboards such as FairVis (Cabrera et al. 2019) which enable the exploration of model performance (e.g., accuracy, false positive rate) across various data subgroups. However, none of this prior work investigates the visualization of per-variable calibration properties of a model, i.e., how a model’s own predictions of accuracy (or error) vary as a function of a particular variable.

Quantifying Model Calibration by Variable:

Work on calibration for machine learning classifiers has largely focused on score-based calibration: reliability diagrams, the ECE, and standard calibration methods are all defined with respect to confidence scores (Murphy and Winkler 1977; Huang et al. 2020; Song et al. 2021). An exception to this is in the fairness literature, where researchers have broadly called for disaggregated model evaluation, e.g. computing metrics of interest individually for sensitive sub-populations (Mitchell et al. 2019; Raji et al. 2020). To this end, several notions of calibration that move beyond standard aggregate measures have been introduced: Hébert-Johnson et al. (2018) check calibration across all identifiable subpopulations of the data, Pan et al. (2020) evaluate calibration over data subsets corresponding to a categorical variable of interest, and Luo et al. (2022) compute “local calibration” using the average classification error on similar samples. Our paper expands on this prior work in two ways. First, we shift the focus from categorical to real-valued variables—our methods operate on a continuous basis, estimating calibration for an entire population rather than for various subgroups. Second, we center on diagnosing calibration; we present visualization and estimation techniques for understanding an existing classifier rather than prescriptive conditions for model training or selection.

3 Background on Score-Based ECE

Consider a classification problem mapping inputs xx to predictions for labels y{1,,K}y\in\{1,\ldots,K\}. Let ff be a black-box classifier which outputs label probabilities f(x)[0,1]Kf(x)\in[0,1]^{K} for each xXx\in X. Then, for the standard 0-1 loss function, the predicted label is y^=argmax(f(x)){1,,K}\hat{y}=\text{argmax}(f(x))\in\{1,\ldots,K\} and the corresponding confidence score is s=s(x)=Pf(y=y^|x)=max(f(x))s=s(x)=P_{f}(y=\hat{y}|x)=\text{max}(f(x)). It is of interest to determine whether such a model is well-calibrated, that is, whether its confidence matches the true probability that a prediction is correct.

For a given confidence score ss, we define Acc(s)=P(y=y^|s)=𝔼[𝕀[y=y^|s]]\text{Acc}(s)=P(y=\hat{y}|s)=\mathbb{E}\left[\mathbb{I}[y=\hat{y}|s]\right]. Then the p\ell_{p} calibration error (CE), as a function of the confidence score ss, is defined as the difference between accuracy and confidence score (Kumar, Liang, and Ma 2019):

CE(s)=|P(y=y^|s)s|p=|Acc(s)s|p\text{CE}(s)=|P(y=\hat{y}|s)-s|^{p}=|\text{Acc}(s)-s|^{p} (1)

where p1p\geq 1. In this paper, we will focus on the expectation of the 1\ell_{1} calibration error with p=1p=1, known as the ECE:

ECE=𝔼[CE(s)]=sP(s)|Acc(s)s|𝑑s\text{ECE}=\mathbb{E}[\text{CE}(s)]=\int_{s}P(s)|\text{Acc}(s)-s|ds (2)

where an ECE of zero corresponds to perfect calibration. In practice, ECE is often estimated empirically on a labeled test dataset by creating BB bins over ss according to some binning scheme (e.g., Guo et al. (2017)):

ECE^=b=1Bnbn|AccbConfb|\widehat{\text{ECE}}=\sum_{b=1}^{B}\frac{n_{b}}{n}|\text{Acc}_{b}-\text{Conf}_{b}| (3)

where nbn_{b} is the number of datapoints in bin bb, nn is the total number of datapoints, and Accb\text{Acc}_{b} and Confb\text{Conf}_{b} are the estimated accuracy and estimated average value of confidence, respectively, in bin b=1,,Bb=1,\ldots,B.

4 Variable-Based Calibration Error

In many applications, we may be motivated to understand the calibration properties of a classification model ff relative to one or more particular variables of interest. For instance, traditional reliability diagrams and the ECE measure may be insufficient to fully characterize the type of variable-based miscalibration shown in Figure 1.

Consider a real-valued variable VV taking values vv. VV could be a variable related to the inputs XX of the model, such as one of the input features, another feature (e.g., metadata) defined per instance but not used in the model, or some function of inputs xx. To evaluate model calibration with respect to VV, we introduce the notion of variable-based calibration error (VCE), defined pointwise as a function of vv:

VCE(v)=|Acc(v)𝔼[s|v]|\mbox{VCE}(v)=\bigl{|}\text{Acc}(v)-\mathbb{E}[s|v]\bigr{|} (4)

where Acc(v)=P(y=y^|v)\text{Acc}(v)=P(y=\hat{y}|v) is the accuracy of the model conditioned on V=vV=v, marginalizing over inputs to the model that do not involve VV. 𝔼[s|v]\mathbb{E}[s|v] is the expected model score conditioned on a particular value vv:

𝔼[s|v]=ssP(s|v)𝑑s\mathbb{E}[s|v]=\int_{s}s\cdot P(s|v)ds (5)

In general, conditioning on vv will induce a distribution over inputs xx, which in turn induces a distribution P(s|v)P(s|v) over scores ss and predictions y^\hat{y}. As an example of VCE(v)\text{VCE}(v), in the context of Figure 1(b), at v=45v=45, the model accuracy P(y=y^|v)P(y=\hat{y}|v) is estimated to be 10021=79%100-21=79\% and the expected score 𝔼[s|v]\mathbb{E}[s|v] is estimated to be 76%76\%, so the VCE(v)\mbox{VCE}(v) is approximately 3%3\%.

The expected value of VCE(v)\text{VCE}(v), with respect to VV, is defined as:

VECE=𝔼[VCE(v)]=vP(v)VCE(v)𝑑v\text{VECE}=\mathbb{E}[\text{VCE}(v)]=\int_{v}P(v)\text{VCE}(v)dv (6)

Comment

Note that CE (and ECE) can be seen as a special case of VCE (and VECE) given the correspondence of Equations 1 and 2 with Equations 4 and 6 when VV is the model score (i.e., V=sV=s). In the rest of the paper, however, we view CE and ECE as being distinct from VCE and VECE in order to highlight the differences between score-based and variable-based calibration.

As with ECE, a practical way to compute an empirical estimate of VECE is by binning, where bins bb are defined by some binning scheme (e.g., equal weight) over values vv of the variable VV (rather than over scores ss):

VECE^=b=1Bnbn|AccbConfb|.\widehat{\text{VECE}}=\sum_{b=1}^{B}\frac{n_{b}}{n}|\text{Acc}_{b}-\text{Conf}_{b}|. (7)

Here bb is a bin corresponding to some sub-range of VV, nbn_{b} is the number of points within this bin, and Accb\text{Acc}_{b} and Confb\text{Conf}_{b} are empirical estimates of the model’s accuracy and the model’s average confidence within bin bb. For example, the ECE^\widehat{\text{ECE}} in Figure 1 is 0.74%, while the VECE^\widehat{\text{VECE}} is 2.04%.

The definitions of VCE(v)\text{VCE}(v) and VECE above are in terms of a continuous variable VV, which is our primary focus in this paper. In general, the definitions above and the theoretical results in Section 5 also apply to discrete-valued VV, as well as to multivariate VV.

5 Theoretical Results

In this section, we establish a number of results on the relationship between ECE and VECE. All proofs can be found in Appendix A.

First, we show that the ECE and VECE can differ by a gap of up to 50%.

Theorem 5.1 (VECE bound).

There exist KK-ary classifiers ff and variables VV such that the classifier ff has both ECE = 0 and variable-based VECE=0.512K\text{VECE}=0.5-\frac{1}{2K}.

For example, in the binary case with K=2K=2, the difference between ECE and VECE can be as large as 0.25. As the number of classes KK grows, this gap approaches 0.5. Thus, we can have models ff that are perfectly calibrated according to ECE (i.e. with ECE = 0) but that can have VECE ranging from 0.25 to 0.5. We will show later in Section 7 that this type of gap is not just a theoretical artifact but also exists in real-world datasets, for real-world classifiers ff and for specific variables VV of interest. The proof of Theorem 5.1 is by construction, using a model ff that is very underconfident for certain regions of vv and very overconfident in other regions of vv, but perfectly calibrated with respect to ss.

In the context of analyzing properties of ECE, Kumar, Liang, and Ma (2019) proved that the binned empirical estimator ECE^\widehat{\text{ECE}} consistently underestimates the true ECE, and showed by construction that this gap can approach 0.5. Our results complement this work in that we are concerned with the true theoretical relationship between two different measures of calibration, namely ECE and VECE, whereas Kumar, Liang, and Ma (2019) relate the estimate ECE^\widehat{\text{ECE}} (Equation 3) with the true ECE (Equation 2).

Theorem 5.2 (ECE bound).

There exist K-ary classifiers ff and variables VV such that the classifier ff has VECE=0\text{VECE}=0 and ECE=0.512K\text{ECE}=0.5-\frac{1}{2K}.

We prove this by construction, where ff is well-calibrated with respect to a variable VV, but its low scores are very underconfident and its high scores are very overconfident.

The results above illustrate that the ECE and VECE measures can be very different for the same model ff. In our experimental results we will also show that it is not uncommon (particularly for uncalibrated models) for ECE and VECE to be equal. To understand the case of equality, we first define the notion of consistent over- or under-confidence with respect to a variable:

Definition 5.3 (Consistent overconfidence).

Let ff be a classifier with scores ss. For a variable VV taking values vv, ff is consistently overconfident if 𝔼[s|v]>P(y=y^|v),v\mathbb{E}[s|v]>P(y=\hat{y}|v),\forall v, i.e., the expected value of the model’s scores ff as a function of vv is always greater than the true accuracy as a function of vv.

Consistent underconfidence can be defined analogously, using 𝔼[s|v]<P(y=y^|v),v\mathbb{E}[s|v]<P(y=\hat{y}|v),\forall v. In the special case where the variable VV is defined as the score itself, we have the condition s>P(y=y^|s),ss>P(y=\hat{y}|s),\forall s, leading to consistent overconfidence for the scores.

For the case of consistent over- or under- confidence for a model ff, we have the following result:

Theorem 5.4 (Equality conditions of ECE and VECE).

Let ff be a classifier that is consistently under- or over- confident with respect both to ss and to a variable VV. Then the ECE and VECE of ff are equal.

The results above provide insight into the relationship between ECE and VECE. Specifically, if the miscalibration is “one-sided” (i.e., consistently over- or under-confident for both the score ss and a variable VV) then ECE and VECE will be in agreement. However, when the classifier ff is both over- and under-confident (as a function of either ss or vv), then ECE and VECE can differ significantly and, as a result, ECE can mask significant systematic miscalibration with respect to variables of interest.

Refer to caption
Figure 2: Variable-based calibration plots for Age for the Adult Income model

6 Mitigating Variable-Based Miscalibration

Diagnosis of Variable-Based Miscalibration

In order to better detect and characterize per-variable miscalibration, we discuss below variable-based calibration plots, which we have found useful in practice. Figure 1(b) shows an example of a variable-based calibration plot for age. In Section 7, we explore how these plots can be used to characterize miscalibration across different classifiers, datasets, and variables of interest.

For ease of interpretation in the results below we focus on the model’s error rate and predicted error, rather than accuracy and confidence, although they are equivalent. Particularly for models with high accuracy, we find that it is more intuitive to discuss differences in error rate than in accuracy.

To generate these types of plots, we first compute the individual error 𝕀[yy^]\mathbb{I}[y\neq\hat{y}] and predicted error 1s(x)=1max(f(x))1-s(x)=1-\text{max}(f(x)) for each observation. We then construct nonparametric error curves with LOESS. (Further details are available in Appendix B.) This approach allows us to obtain 95% confidence bars for the error rate and mean predicted error, based on standard error, thus putting the differences in curves into perspective.

Beyond visualization, we can use VECE scores to discover which variables for a dataset have the largest systematic variable-based miscalibration. In particular, ranking features in order of decreasing VECE highlights variables that may be worth investigating. An example of such a ranking for the Adult Income dataset333https://archive.ics.uci.edu/ml/datasets/adult, based on a neural network with post-hoc beta calibration (Kull, Filho, and Flach 2017), is shown in Table 1. The Years of education and Age variables rank highest in VECE, so a model developer or a user of a model might find it useful to generate a variable-based calibration plot for each of these. The Weekly work hours and Census weight variables are of lesser concern, but could also be explored. We will perform an in-depth investigation of miscalibration with respect to the variable Age in Section 7.

VECE VCE(vv^{*})
Years of education 9.95% 20.13%
Age 9.59% 23.44%
Weekly work hours 7.94% 18.21%
Census weight 5.06% 12.08%
Table 1: Variable-based calibration error of Adult Income dataset features

It is also possible to define the maximum value of VCE(v)\text{VCE}(v), i.e, the worst-case calibration error, as well as the value vv^{*} that incurs this worst-case error:

v\displaystyle v^{*} =argmaxv{VCE(v)}\displaystyle=\arg\max_{v}\{\text{VCE}(v)\} (8)
=argmaxv{|P(y=y^|v)𝔼[s(v)]|}\displaystyle=\arg\max_{v}\{\bigl{|}P(y=\hat{y}|v)-\mathbb{E}[s(v)]\bigr{|}\}

Estimating either vv^{*} or VCE(v)\text{VCE}(v^{*}) accurately may be difficult in practice, particularly for small sample sizes nn, since it involves the non-parametric estimation of the difference of two curves (Bowman and Young 1996) as a function of vv (as the shapes of the curves need not follow any convenient parametric form, e.g., see Figure 1(b)). One simple estimation strategy is to smooth both curves with LOESS and compute the maximum difference between the two estimated curves. Using this approach, worst-case calibration errors VCE(v)\text{VCE}(v^{*}) for the Adult Income model are also shown in Table 1.

Calibration Methods

We found empirically, across multiple datasets, that standard score-based calibration methods often reduce ECE while neglecting variable-based systematic miscalibration. Because calibration error can vary as a function of a feature of interest VV, we propose incorporating information about VV into post-hoc calibration. In particular, we introduce the concept of variable-based calibration methods, a family of calibration methods that adjust confidence scores with respect to some variable of interest VV. As an illustrative example, we perform experiments in Section 7 with a modification of probability calibration trees (Leathart et al. 2017). This technique involves performing logistic calibration separately for data splits defined by decision trees trained over the input space. We alter the method to train decision trees for yy with only vv as input, with a minimum leaf size of one-tenth of the total calibration set size. We then perform beta calibration at each leaf (Kull, Filho, and Flach 2017), as we found in our experiments that it performs empirically better than logistic calibration. In the multi-class case, we use Dirichlet calibration, an extension of beta calibration for kk-class classification (Kull et al. 2019). Our split-based calibration method using decision trees is intended to provide a straightforward illustration of the potential benefits of variable-based calibration, rather than a state-of-the-art methodology that can balance ECE and VECE (which we leave to future work). We also investigated variable-based calibration methods that operate continuously over VV (rather than on separate data splits) using extensions of logistic and beta calibration, but found that these were not as reliable in our experiments as the tree-based approach (see Appendix C for details).

7 Variable-Based Miscalibration in Practice

In this section, we explore several examples where the ECE obscures systematic miscalibration relative to some variable of interest, particularly after post-hoc score-based calibration. In our experiments we use four datasets that span tabular, text, and image data. For each dataset and variable of interest VV, we investigate both (1) several score-based calibration methods and (2) our variable-based calibration method (the tree-based technique described in Section 6), comparing the resulting ECE, VECE, and variable-based calibration plots. In particular, we calibrate with scaling-binning (Kumar, Liang, and Ma 2019), Platt scaling (Platt 1999), beta calibration (Kull, Filho, and Flach 2017), and, for the multi-class case, Dirichlet calibration (Kull et al. 2019). The datasets are split into training, calibration, and test sets. Each calibration method is trained on the same calibration set, and all metrics and figures are produced from the final test set. The ECE and VECE are computed with an equal-support binning scheme, with B=10B=10. Further details regarding datasets, models, and calibration methods are in Appendix B.

Refer to caption
Figure 3: Variable-based calibration plots for the Yelp model for Review Length

Adult Census Records: Predicting Income

The Adult Income dataset consists of 1994 Census records; the goal is to predict whether an individual’s annual income is greater than $50,000. We model this data with a simple feed-forward neural network and evaluate the model’s calibration error with respect to age (i.e. let VV=age). Uncalibrated, this model has an ECE and VECE of 20.67% (see Table 2). The ECE and VECE are equal precisely because of the model’s consistent overconfidence as a function of both the confidence score and VV (see Definition 5.3). This overconfidence with respect to age is reflected in the variable-based calibration plot (Figure 2a). The model’s error rate varies significantly as a function of age, with very high error for individuals around age 50, and much lower error for younger and older people. However, its confidence remains nearly constant at close to 100% (i.e., a predicted error close to 0%) across all ages.

ECE VECE
Uncalibrated 20.67% 20.67%
Scaling-binning 2.27% 9.25%
Platt scaling 4.57% 10.13%
Beta calibration 1.65% 9.59%
Variable-based calibration 1.64% 2.11%
Table 2: Adult Income model calibration error
Refer to caption
Figure 4: Variable-based calibration plots for the Bank Marketing model for Age

After calibrating, the ECE is dramatically reduced, with beta calibration achieving an ECE of 1.65%. However, the corresponding VECE is still very high (over 9%). As shown in Figure 2b, the model’s self-predicted error has increased substantially, but remains near constant as a function of age. Thus, despite a significant improvement in ECE, the model still harbors unfairness with respect to age, exhibiting overconfidence in its predictions for individuals in the 35-65 age range, and underconfidence for those outside of it. As the model is no longer consistently overconfident, the ECE and VECE diverge, as predicted theoretically.

Variable-based calibration obtains a significantly lower VECE of 2.11%, while simultaneously reducing the ECE. This improvement in VECE is reflected in Figure 2c. The model’s predicted error now varies with age to match the true error rate. In this case, a simple variable-based calibration method improves the age-wise systematic miscalibration of the model, without detriment to the overall calibration error.

Yelp Reviews: Predicting Sentiment

To explore variable-based calibration in an NLP context, we use a fine-tuned large language model, BERT (Kenton and Toutanova 2019), on the Yelp review dataset444https://www.yelp.com/dataset. The model predicts whether a review has a positive or negative rating based on its text. In this case there are no easily-interpretable features directly input to the model. Instead, to better diagnose model behavior, we can analyze real-valued characteristics of the text, such as the length of each review or part-of-speech statistics. Here we focus on review length in characters.

Figure 3 shows the model’s error and predicted error with respect to review length. The error rate is lowest for reviews around 300-700 characters, around the median review length. Very short and very long reviews are associated with a higher error rate. Uncalibrated, this model is consistently overconfident, with an ECE and VECE of 1.93% (see Table 3).

ECE VECE
Uncalibrated 1.93% 1.93%
Scaling-binning 4.23% 4.23%
Platt scaling 3.04% 0.64%
Beta calibration 1.73% 0.37%
Variable-based calibration 1.70% 0.23%
Table 3: Yelp model calibration error

After beta calibration, the ECE and VECE drop to 1.73% and 0.37%, respectively. Figure 3b reflects this: the model’s predicted error aligns more closely with its actual error rate, although it is still overconfident for very short reviews.

Our variable-based calibration method further reduces the VECE and yields a small improvement to the ECE. The new predicted error curve matches the true relationship between review length and error rate more faithfully (Figure 3c), reducing overconfidence for short reviews.

Refer to caption
Figure 5: Variable-based calibration plots for the CIFAR-10H model for Median Reaction Time

Bank Marketing: Predicting Subscriptions

We also investigate miscalibration on a simple neural network modeling the Bank Marketing dataset555https://archive.ics.uci.edu/ml/datasets/bank+marketing. The model predicts whether a bank customer will subscribe to a bank term deposit as a result of direct marketing. Uncalibrated, the model is overconfident, with both ECE and VECE over 4.5% (see Table 4). Consider the calibration error with respect to customer age before (Figure 4a) and after (Figure 4b) Platt scaling. Platt scaling, which is the best-performing score-based calibration method, uniformly increases the predicted error across age, reducing both ECE and VECE, but resulting in underconfidence for most ages and overconfidence at the edges of the distribution.

ECE VECE
Uncalibrated 4.69% 4.69%
Scaling-binning 4.37% 3.39%
Platt scaling 2.38% 2.83%
Beta calibration 2.48% 2.77%
Variable-based calibration 2.10% 0.52%
Table 4: Bank Marketing model calibration error

The variable-based calibration method achieves competitive ECE, while reducing VECE to about half of one percent. The calibration plot reflects this improvement: the predicted error matches the true error rate more closely, reducing the miscalibration with respect to customer age.

CIFAR-10H: Image Classification

As a multi-class example, we investigate variable-based miscalibration on CIFAR-10H, a 10-class image dataset including labels and reaction times from human annotators (Peterson et al. 2019). We use a standard deep learning image classification architecture (a DenseNet model) to predict the image category, and investigate median annotator reaction times, metadata that are not provided to the model. Instead of Platt scaling and beta calibration, here we use Dirichlet calibration (to accomodate the multiple classes).

In this case, Dirichlet calibration achieves the lowest overall ECE and variable-based calibration obtains the lowest VECE (see Table 5). The variable-based calibration plots are shown in Figure 5. We see that the variable-based calibration method reduces underconfidence for examples with low median reaction times (where the majority of data points lie).

ECE VECE
Uncalibrated 1.90% 1.92%
Scaling-binning 3.83% 3.60%
Dirichlet calibration 0.80% 1.12%
Variable-based calibration 1.18% 0.86%
Table 5: CIFAR-10H model calibration error

Summary of Experimental Results

Our results demonstrate the potential of variable-based calibration. While score-based calibration methods generally improved the ECE, variable-based calibration methods performed better across datasets in terms of simultaneously reducing both the ECE and VECE, without any significant increase in model error rate or the VECE for other variables (details in Appendix B). The results also illustrate that variable-based calibration plots enable meaningful characterization of the relationships between variables of interest and predicted/true error, providing more detailed insight into a model’s performance than a single number (i.e., ECE or VECE).

8 Discussion and Conclusions

Discussion of Limitations

There are several potential limitations of this work. First, we focused on the mitigation of miscalibration for one variable VV at a time. Although we did not observe higher VECE for other variables after applying our variable-based calibration method, this behavior has not been analyzed theoretically. Further, a more thorough investigation of miscalibration across intersections of variables is still needed. We also emphasize that the variable-based calibration method used in the paper is primarily for illustration; the development of new methods for simultaneously reducing score-based and variable-based miscalibration is a useful direction for future work.

Conclusions

In this paper we demonstrated theoretically and empirically that ECE can obscure significant miscalibration with respect to variables of potential importance to a developer or user of a classification model. To better detect and characterize this type of miscalibration, we introduced the VECE measure and corresponding variable-based calibration plots, and we characterized the theoretical relationship between VECE and ECE. In a case study across several datasets and models, we showed that VECE, variable-based calibration plots, and variable-based calibration methods are all useful tools for understanding and mitigating miscalibration on a per-variable level. Looking forward, to mitigate biases in calibration error, we recommend moving beyond purely score-based calibration analysis. In addition to promoting fairness, these techniques offer new insight into model behavior and provide actionable avenues for improvement.

Acknowledgements

This material is based upon work supported in part by the HPI Research Center in Machine Learning and Data Science at UC Irvine, by the National Science Foundation under grants number 1900644 and 1927245, and by a Qualcomm Faculty Award.

References

  • Bansal et al. (2021) Bansal, G.; Nushi, B.; Kamar, E.; Horvitz, E.; and Weld, D. S. 2021. Is the Most Accurate AI the Best Teammate? Optimizing AI for Teamwork. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 11405–11414.
  • Bowman and Young (1996) Bowman, A.; and Young, S. 1996. Graphical Comparison of Nonparametric Curves. Journal of the Royal Statistical Society: Series C (Applied Statistics), 45(1): 83–98.
  • Cabrera et al. (2019) Cabrera, Á. A.; Epperson, W.; Hohman, F.; Kahng, M.; Morgenstern, J.; and Chau, D. H. 2019. FAIRVIS: Visual Analytics for Discovering Intersectional Bias in Machine Learning. In 2019 IEEE Conference on Visual Analytics Science and Technology (VAST), 46–56.
  • Chalfin et al. (2016) Chalfin, A.; Danieli, O.; Hillis, A.; Jelveh, Z.; Luca, M.; Ludwig, J.; and Mullainathan, S. 2016. Productivity and Selection of Human Capital with Machine Learning. American Economic Review, 106(5): 124–27.
  • De et al. (2021) De, A.; Okati, N.; Zarezade, A.; and Rodriguez, M. G. 2021. Classification under Human Assistance. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 5905–5913.
  • Friedman (2001) Friedman, J. H. 2001. Greedy Function Approximation: A Gradient Boosting Machine. Annals of Statistics, 1189–1232.
  • Guo et al. (2017) Guo, C.; Pleiss, G.; Sun, Y.; and Weinberger, K. Q. 2017. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning, 1321–1330.
  • Hébert-Johnson et al. (2018) Hébert-Johnson, U.; Kim, M.; Reingold, O.; and Rothblum, G. 2018. Multicalibration: Calibration for the (Computationally-Identifiable) Masses. In International Conference on Machine Learning, 1939–1948.
  • Huang et al. (2020) Huang, Y.; Li, W.; Macheret, F.; Gabriel, R. A.; and Ohno-Machado, L. 2020. A Tutorial on Calibration Measurements and Calibration Models for Clinical Prediction Models. Journal of the American Medical Informatics Association, 27(4): 621–633.
  • Kenton and Toutanova (2019) Kenton, J. D. M.-W. C.; and Toutanova, L. K. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT, 4171–4186.
  • Kleinberg et al. (2018) Kleinberg, J.; Lakkaraju, H.; Leskovec, J.; Ludwig, J.; and Mullainathan, S. 2018. Human Decisions and Machine Predictions. The Quarterly Journal of Economics, 133(1): 237–293.
  • Kull, Filho, and Flach (2017) Kull, M.; Filho, T. S.; and Flach, P. 2017. Beta Calibration: A Well-Founded and Easily Implemented Improvement on Logistic Calibration for Binary Classifiers. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, volume 54, 623–631.
  • Kull et al. (2019) Kull, M.; Perello-Nieto, M.; Kängsepp, M.; Filho, T. S.; Song, H.; and Flach, P. 2019. Beyond Temperature Scaling: Obtaining Well-Calibrated Multiclass Probabilities with Dirichlet Calibration. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, 12316–12326.
  • Kumar, Liang, and Ma (2019) Kumar, A.; Liang, P. S.; and Ma, T. 2019. Verified Uncertainty Calibration. In Advances in Neural Information Processing Systems, 3787–3798.
  • Leathart et al. (2017) Leathart, T.; Frank, E.; Holmes, G.; and Pfahringer, B. 2017. Probability Calibration Trees. In Proceedings of the Ninth Asian Conference on Machine Learning, volume 77, 145–160.
  • Luo et al. (2022) Luo, R.; Bhatnagar, A.; Wang, H.; Xiong, C.; Savarese, S.; Bai, Y.; Zhao, S.; and Ermon, S. 2022. Localized Calibration: Metrics and Recalibration. In Proceedings of the Thirty-Eighth Conference on Uncertainty in Artificial Intelligence, volume 180, 1286–1295.
  • Mitchell et al. (2019) Mitchell, M.; Wu, S.; Zaldivar, A.; Barnes, P.; Vasserman, L.; Hutchinson, B.; Spitzer, E.; Raji, I. D.; and Gebru, T. 2019. Model Cards for Model Reporting. In Proceedings of the Conference on Fairness, Accountability, and Transparency, 220–229.
  • Molnar (2020) Molnar, C. 2020. Interpretable Machine Learning. Lulu.com.
  • Murphy and Winkler (1977) Murphy, A. H.; and Winkler, R. L. 1977. Reliability of Subjective Probability Forecasts of Precipitation and Temperature. Journal of the Royal Statistical Society: Series C (Applied Statistics), 26(1): 41–47.
  • Ovadia et al. (2019) Ovadia, Y.; Fertig, E.; Lakshminarayanan, B.; Nowozin, S.; Sculley, D.; Dillon, J.; Ren, J.; Nado, Z.; and Snoek, J. 2019. Can You Trust Your Model’s Uncertainty? Evaluating Predictive Uncertainty under Dataset Shift. In Proceedings of the 33rd International Conference on Neural Information Processing Systems, 14003–14014.
  • Pan et al. (2020) Pan, F.; Ao, X.; Tang, P.; Lu, M.; Liu, D.; Xiao, L.; and He, Q. 2020. Field-aware Calibration: A Simple and Empirically Strong Method for Reliable Probabilistic Predictions. In Proceedings of The Web Conference: WWW 2020, 729–739.
  • Peterson et al. (2019) Peterson, J. C.; Battleday, R. M.; Griffiths, T. L.; and Russakovsky, O. 2019. Human Uncertainty Makes Classification More Robust. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9617–9626.
  • Platt (1999) Platt, J. 1999. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. Advances in Large Margin Classifiers, 61–74.
  • Raji et al. (2020) Raji, I. D.; Smart, A.; White, R. N.; Mitchell, M.; Gebru, T.; Hutchinson, B.; Smith-Loud, J.; Theron, D.; and Barnes, P. 2020. Closing the AI Accountability Gap: Defining an End-to-End Framework for Internal Algorithmic Auditing. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, 33–44.
  • Rajkomar, Dean, and Kohane (2019) Rajkomar, A.; Dean, J.; and Kohane, I. 2019. Machine Learning in Medicine. New England Journal of Medicine, 380(14): 1347–1358.
  • Song et al. (2021) Song, H.; Perello-Nieto, M.; Santos-Rodriguez, R.; Kull, M.; Flach, P.; et al. 2021. Classifier Calibration: How to assess and improve predicted class probabilities: a survey. arXiv preprint arXiv:2112.10327.
  • Steyvers et al. (2022) Steyvers, M.; Tejeda, H.; Kerrigan, G.; and Smyth, P. 2022. Bayesian Modeling of Human-AI Complementarity. Proceedings of the National Academy of Sciences, 119(11).
  • Vaicenavicius et al. (2019) Vaicenavicius, J.; Widmann, D.; Andersson, C.; Lindsten, F.; Roll, J.; and Schön, T. 2019. Evaluating Model Calibration in Classification. In International Conference on Artificial Intelligence and Statistics, 3459–3467.
  • Završnik (2021) Završnik, A. 2021. Algorithmic Justice: Algorithms and Big Data in Criminal Justice Settings. European Journal of Criminology, 18(5): 623–642.

Appendix A Proofs for Section 5

Theorem A.1 (VECE bound).

There exist KK-ary classifiers ff and variables VV such that the classifier ff has ECE = 0 and VECE=0.512K\text{VECE}=0.5-\frac{1}{2K}.

Proof.

Let VV be a continuous variable with density P(v)P(v). Recall that VECE=vP(v)|P(y=y^|v)𝔼[s|v]|dv\text{VECE}=\int_{v}P(v)|P(y=\hat{y}|v)-\mathbb{E}[s|v]|dv where P(y=y^|v)P(y=\hat{y}|v) is the accuracy of model ff as a function of vv, and the score ss is the probability that the model assigns to its label prediction y^\hat{y}.

The reliability diagram for a KK-ary classifier has scores s[1K,1]s\in[\frac{1}{K},1] where the leftmost value for this interval is a result of the fact that the score is defined as the maximum of KK class probabilities. Let γ=0.5+12K\gamma=0.5+\frac{1}{2K} be the midpoint of this interval.

Assume that the scores ss have a uniform distribution of the form sU(γα,γ+α)s\sim U(\gamma-\alpha,\gamma+\alpha), where α\alpha is some constant and 0α0.250\leq\alpha\leq 0.25, and that the scores ss and the variable VV are independent.

Further assume that the accuracy of the model ff depends on vv and ss in the following manner

P(y=y^|vvt,sγ)=1α\displaystyle P(y=\hat{y}|v\leq v_{t},s\leq\gamma)=1-\alpha P(y=y^|vvt,s>γ)=1\displaystyle P(y=\hat{y}|v\leq v_{t},s>\gamma)=1
P(y=y^|v>vt,sγ)=1K\displaystyle P(y=\hat{y}|v>v_{t},s\leq\gamma)=\frac{1}{K} P(y=y^|v>vt,s>γ)=1K+α\displaystyle P(y=\hat{y}|v>v_{t},s>\gamma)=\frac{1}{K}+\alpha

where vtv_{t} is defined such that P(vvt)=P(v>vt)=0.5P(v\leq v_{t})=P(v>v_{t})=0.5.

The marginal accuracy as a function of the score (marginalizing over vv) can be written as

P(y=y^|sγ)=γα2\displaystyle P(y=\hat{y}|s\leq\gamma)=\gamma-\frac{\alpha}{2}
P(y=y^|s>γ)=γ+α2.\displaystyle P(y=\hat{y}|s>\gamma)=\gamma+\frac{\alpha}{2}.

The marginal accuracy as a function of vv (marginalizing over ss) is

P(y=y^|vvt)=1α2\displaystyle P(y=\hat{y}|v\leq v_{t})=1-\frac{\alpha}{2}
P(y=y^|v>vt)=1K+α2.\displaystyle P(y=\hat{y}|v>v_{t})=\frac{1}{K}+\frac{\alpha}{2}.

This setup is designed so that the score is close to the accuracy as a function of ss (to minimize ECE), but the variable-based expected scores 𝔼[s|v]=γ\mathbb{E}[s|v]=\gamma are relatively far away from accuracy as a function of vv.

Under these assumptions we can write the ECE as

ECE=sp(s)|P(y=y^|s)s|ds=γαγ12α|γα2s|𝑑s+γγ+α12α|γ+α2s|𝑑s=α4.\displaystyle\begin{split}\text{ECE}&=\int_{s}p(s)\cdot|P(y=\hat{y}|s)-s|ds\\ &=\int_{\gamma-\alpha}^{\gamma}\frac{1}{2\alpha}|\gamma-\frac{\alpha}{2}-s|ds+\int_{\gamma}^{\gamma+\alpha}\frac{1}{2\alpha}|\gamma+\frac{\alpha}{2}-s|ds\\ &=\frac{\alpha}{4}.\end{split} (9)

We can write the VECE as

VECE=vtP(v)|P(y=y^|v)𝔼[s|v]|dv+vtP(v)|P(y=y^|v)𝔼[s|v]|dv=vtP(v)|1α2γ|𝑑v+vtP(v)|1K+α2γ|𝑑v=(0.512Kα2)vP(v)𝑑v=0.512Kα2.\displaystyle\begin{split}\text{VECE}&=\int_{-\infty}^{v_{t}}P(v)\cdot|P(y=\hat{y}|v)-\mathbb{E}[s|v]|dv+\int_{v_{t}}^{\infty}P(v)\cdot|P(y=\hat{y}|v)-\mathbb{E}[s|v]|dv\\ &=\int_{-\infty}^{v_{t}}P(v)\cdot|1-\frac{\alpha}{2}-\gamma|dv+\int_{v_{t}}^{\infty}P(v)\cdot|\frac{1}{K}+\frac{\alpha}{2}-\gamma|dv\\ &=(0.5-\frac{1}{2K}-\frac{\alpha}{2})\int_{v}P(v)dv\\ &=0.5-\frac{1}{2K}-\frac{\alpha}{2}.\end{split} (10)

Thus, as α0\alpha\to 0, VECE(0.512K)\text{VECE}\to(0.5-\frac{1}{2K}) and ECE0\text{ECE}\to 0.

Theorem A.2 (ECE bound).

There exist K-ary classifiers ff and variables VV such that the classifier ff has VECE=0\text{VECE}=0 and ECE=0.512K\text{ECE}=0.5-\frac{1}{2K}.

Proof.

Let VV be a continuous variable with density P(V)P(V). Recall that a K-ary classifier has scores s[1K,1]s\in[\frac{1}{K},1], where we let γ=0.5+12K\gamma=0.5+\frac{1}{2K} be the midpoint of this interval. Assume that ff produces scores from two uniform distributions, with equal probability: sU(1K,1K+α)s\sim U(\frac{1}{K},\frac{1}{K}+\alpha) and sU(1α,1)s\sim U(1-\alpha,1), where α\alpha is some constant 0α0.250\leq\alpha\leq 0.25, and that the scores ss and the variable VV are independent. Finally, suppose the accuracy of the model P(y=y^)=γP(y=\hat{y})=\gamma is independent of ss and VV.

Under these assumptions we can write the VECE as

VECE=P(v)|P(y=y^|v)𝔼[s|v]|dv=P(v)|γγ|𝑑v=0.\displaystyle\begin{split}\text{VECE}&=\int_{-\infty}^{\infty}P(v)\cdot|P(y=\hat{y}|v)-\mathbb{E}[s|v]|dv\\ &=\int_{-\infty}^{\infty}P(v)\cdot|\gamma-\gamma|dv\\ &=0.\end{split} (11)

We can write the ECE as

ECE=sp(s)|P(y=y^|s)s|ds=121K1K+α1α|γs|𝑑s+121α11α|γs|𝑑s=0.512Kα2.\displaystyle\begin{split}\text{ECE}&=\int_{s}p(s)\cdot|P(y=\hat{y}|s)-s|ds\\ &=\frac{1}{2}\int_{\frac{1}{K}}^{\frac{1}{K}+\alpha}\frac{1}{\alpha}|\gamma-s|ds+\frac{1}{2}\int_{1-\alpha}^{1}\frac{1}{\alpha}|\gamma-s|ds\\ &=0.5-\frac{1}{2K}-\frac{\alpha}{2}.\end{split} (12)

Thus, as α0\alpha\to 0, ECE(0.512K)\text{ECE}\to(0.5-\frac{1}{2K}) and VECE=0\text{VECE}=0.

Definition A.3 (Consistent overconfidence).

Let ff be a classifier with scores ss. For a variable VV taking values vv, ff is consistently overconfident if 𝔼[s|v]>P(y=y^|v),v\mathbb{E}[s|v]>P(y=\hat{y}|v),\forall v, i.e., the expected value of the model’s scores ff as a function of vv is always greater than the true accuracy as a function of vv.

Consistent underconfidence is defined analogously with 𝔼[s|v]<P(y=y^|v),v\mathbb{E}[s|v]<P(y=\hat{y}|v),\forall v. In the special case where the variable VV is defined as the score itself, we have s>P(y=y^|s),ss>P(y=\hat{y}|s),\forall s, etc.

Theorem A.4 (Equality conditions for ECE and VECE).

Let ff be a classifier that is consistently under- or over-confident with respect both to ss and to a variable VV. Then the ECE and VECE of ff are equal.

Proof.

Without loss of generality, suppose ff is consistently underconfident with respect to its scores ss and VV.

Then we have, by consistent underconfidence:

ECE=sp(s)|P(y=y^|s)s|ds=sp(s)P(y=y^|s)𝑑s𝔼[s]VECE=vp(v)|P(y=y^|v)𝔼[s|v]|dv=vp(v)P(y=y^|v)dvvp(v)𝔼[s|v]dv=vp(v)P(y=y^|v)𝑑v𝔼[s]\begin{split}\text{ECE}&=\int_{s}p(s)\cdot|P(y=\hat{y}|s)-s|ds\\ &=\int_{s}p(s)\cdot P(y=\hat{y}|s)ds-\mathbb{E}[s]\\ \end{split}\qquad\begin{split}\text{VECE}&=\int_{v}p(v)\cdot|P(y=\hat{y}|v)-\mathbb{E}[s|v]|dv\\ &=\int_{v}p(v)\cdot P(y=\hat{y}|v)dv-\int_{v}p(v)\mathbb{E}[s|v]dv\cdot\\ &=\int_{v}p(v)\cdot P(y=\hat{y}|v)dv-\mathbb{E}[s]\\ \end{split} (13)

By the law of total probability,

=sp(s)P(y=y^|s)𝑑s𝔼[s]=P(y=y^)𝔼[s]=vp(v)P(y=y^|v)𝑑v𝔼[s]=P(y=y^)𝔼[s]\begin{split}&=\int_{s}p(s)\cdot P(y=\hat{y}|s)ds-\mathbb{E}[s]\\ &=P(y=\hat{y})-\mathbb{E}[s]\\ \end{split}\qquad\qquad\begin{split}&=\int_{v}p(v)\cdot P(y=\hat{y}|v)dv-\mathbb{E}[s]\\ &=P(y=\hat{y})-\mathbb{E}[s]\\ \end{split} (14)

So ECE=VECE=P(y=y^)𝔼[s]\text{ECE}=\text{VECE}=P(y=\hat{y})-\mathbb{E}[s].

Appendix B Calibration, Model, and Dataset Details

Here, we include additional information and plots for each dataset and model discussed in Section 7. Code for reproducing all tables and plots is available online.666Our code is available online at https://github.com/markellekelly/variable-wise-calibration.

On each dataset, we test several existing calibration methods: Platt scaling, scaling-binning, beta calibration, and (for the multi-class case) Dirichlet calibration. For scaling-binning, we calibrate over 10 bins, and for Dirichlet calibration, we use a lambda value of 1e-3, values chosen based on the respective authors’ provided examples. Here and in Section 7, we present the uncalibrated and variable-based calibrated output, along with the best-performing score-based calibration method (for the Adult and Yelp datasets, beta calibration; for Bank Marketing, Platt scaling; for CIFAR, Dirichlet calibration).

Our variable-based calibration method is performed as follows. Given the calibration set, a decision tree classifier is trained to predict the outcome yy with input VV (the single variable of interest). We use a maximum depth of two and a minimum leaf size of 0.10.1* the size of the calibration set. The calibration set is then split according to the leaf nodes of the trained decision tree, and separately the rest of the dataset is split according to the same rules. Standard beta calibration is then performed separately for each split, using the subset of the original calibration set as the new calibration set, and computing the new calibrated probabilities for the subset of the original dataset.

Variable-based calibration plots are created with LOESS, with quadratic local fit and an assumed symmetric distribution of the errors, with empirically-chosen smoothing factors between 0.8 and 0.9.

We note the VECE for each numeric variable in each dataset before and after the calibration method is applied. We find in general empirically that variable-based calibration with respect to one variable is not detrimental to the VECE of other variables.

Finally, we observe that our variable-based calibration method does not tend to significantly degrade accuracy. Accuracies for each dataset before and after its application are shown in Table 6.

Adult Income Yelp Bank Marketing CIFAR
Uncalibrated 79.1% 98.0% 88.9% 97.2%
Score-based calibrated 79.1% 98.0% 88.7% 96.9%
Variable-based calibrated 79.1% 98.0% 88.7% 96.0%
Table 6: Accuracies for all four datasets before calibration, after the highest-ECE score-based calibration (as reported in the main paper and below), and after variable-based calibration.

Adult Income

The Adult Income dataset was modeled with a multi-layer perceptron, with two hidden layers of sizes 100 and 75. Of the 48,842 observations, 32,561 were used for training, 2,500 were used for calibration, and 13,781 were used for testing. The dataset includes six continuous variables: age, fnlwgt (the estimated number of people an individual represents), education-num (a number representing the individual’s years of education), capital-gain, capital-loss, and hours-per-week (the number of hours per week that an individual works).

Based on the beta-calibrated model, education-num and age rank the highest in VECE, as shown in Section 6. For all six variables, VECE is reduced by applying the variable-based calibration method with respect to age:

Uncalibrated Beta-calibrated Variable-based calibrated
education-num 20.67% 9.95% 8.53%
age 20.67% 9.59% 2.11%
hours-per-week 20.67% 7.94% 6.02%
fnlwgt 20.67% 5.06% 4.10%
capital-gain 20.67% 1.50% 1.39%
capital-loss 20.67% 1.50% 1.39%
Table 7: VECE for numeric variables in the Adult Income dataset: uncalibrated, beta-calibrated, and after variable-based calibration with respect to age.

Uncalibrated, the model’s ECE and VECE are 20.67%. Of the score-based calibration methods tested, beta calibration achieves the lowest ECE of 1.65%. Relevant reliability diagrams for the uncalibrated, beta-calibrated, and variable-based calibrated models are shown in Figure 6.

Refer to caption
Figure 6: Reliability diagrams for the Adult Income model

Yelp

The Yelp dataset was modeled with a fine-tuned BERT model. 100,000 observations were randomly sampled from the full Yelp dataset. Of these, 70,500 were used for training, 10,000 were used for calibration, and 19,500 were used for testing. Several continuous features were generated from the raw text reviews, including length in characters, number of special characters, and proportions of each part of speech. Based on the beta-calibrated model, review length ranked highest in VECE, followed by proportion of stop words, as shown in Table 8.

Uncalibrated Beta-calibrated Variable-based calibrated
Length (characters) 1.93% 0.37% 0.23%
Stop-word Proportion 1.93% 0.29% 0.28%
Named Entity Count 1.93% 0.21% 0.22%
Table 8: VECE for numeric variables in the Yelp dataset: uncalibrated, beta calibrated, and after variable-based calibration with respect to length in characters.

Uncalibrated, the model’s ECE and VECE are 1.93%. Of the score-based calibration methods tested, beta calibration achieves the lowest ECE of 1.73%. Relevant reliability diagrams for the uncalibrated, beta-calibrated, and variable-based calibrated models are shown in Figure 7.

Refer to caption
Figure 7: Reliability diagrams for the Yelp model

Bank Marketing

The Bank Marketing dataset was modeled with a multi-layer perceptron, with two hidden layers of sizes 100 and 75. Of the 45,211 total observations, 31,647 were used for training, 1,000 were used for calibration, and 12,564 were used for testing. Based on the model calibrated with Platt scaling, account balance ranked highest in VECE, followed by age, as shown in Table 9.

Uncalibrated Platt scaling Variable-based calibrated
Account balance 5.35% 4.17% 3.22%
Age 4.69% 2.83% 0.52%
Table 9: VECE for numeric variables in the Bank Marketing dataset: uncalibrated, calibrated with Platt scaling, and after variable-based calibration with respect to age.

Uncalibrated, the model’s ECE is 4.69%. Of the score-based calibration methods tested, Platt scaling achieves the lowest ECE of 2.38%. Relevant reliability diagrams for the uncalibrated, calibrated with Platt scaling, and variable-based calibrated models are shown in Figure 8.

Refer to caption
Figure 8: Reliability diagrams for the Bank Marketing model, uncalibrated (left), calibrated with Platt scaling (middle), and variable-based calibrated (right)

CIFAR-10H

The CIFAR-10H dataset was modeled with a DenseNet model. Of the 10,000 total observations, 4,057 were used for training, 2,000 were used for calibration, and 3,943 were used for testing.

Uncalibrated, the model’s ECE and VECE are 1.90% and 1.92%, respectively. Of the score-based calibration methods tested, Dirichlet calibration achieves the lowest ECE of 0.80%. Relevant reliability diagrams for the uncalibrated, Dirichlet-calibrated, and variable-based calibrated models are shown in Figure 9.

Refer to caption
Figure 9: Reliability diagrams for the CIFAR-10H model, uncalibrated (left), Dirichlet calibrated (middle), and variable-based calibrated (right)

Appendix C Alternate Calibration Methods

As an alternative variable-based calibration method, we extend logistic and beta calibration, which operate continously over score, to incorporate information regarding VV. In particular, logistic calibration learns a mapping μ\mu of scores ss, with parameters aa and cc learned via logistic regression:

μlogistic(s)=11+1/exp(as+c)\displaystyle\mu_{\text{logistic}}(s)=\frac{1}{1+1/\text{exp}(a\cdot s+c)}

This can be augmented to include VV by simply training the logistic regression on both ss and VV, learning the following mapping:

μlogistic_v(s,v)=11+1/exp(as+bv+c)\displaystyle\mu_{\text{logistic}\_v}(s,v)=\frac{1}{1+1/\text{exp}(a\cdot s+b\cdot v+c)}

where bb is the logistic regression coefficient corresponding to VV.

Similarly, beta calibration learns the following mapping, where the parameters aa, bb, and cc are learned by training a logistic regression on ln(s)\text{ln}(s) and ln(1s)-\text{ln}(1-s) (see (Kull, Filho, and Flach 2017) for more details):

μbeta(s)=11+1/(ecsa(1s)b)\displaystyle\mu_{\text{beta}}(s)=\frac{1}{1+1/\left(e^{c}\frac{s^{a}}{(1-s)^{b}}\right)}

This can also be augmented with VV, including it as a third input to the regression:

μbeta_v(s,v)=11+1/(edv+csa(1s)b)\displaystyle\mu_{\text{beta}\_v}(s,v)=\frac{1}{1+1/\left(e^{d\cdot v+c}\frac{s^{a}}{(1-s)^{b}}\right)}

In contrast to the tree-based method detailed in the main paper, which splits the data along VV and then separately calibrates each set, these methods learn one calibration mapping for the entire dataset. Empirically, we find that augmented beta calibration is a promising approach, simultaneously reducing ECE and VECE, although some attention must be paid to the fit of the logistic regression (e.g., by including a quadratic term). However, in our experiments, this technique ultimately was not as reliable as tree-based calibration (perhaps because the functional form of beta calibration is not flexible enough to always be able to correct systematic miscalibration as a function of VV).

Here, we include the results of the augmented-beta variable-based (VB) calibration method on the Adult Income, Yelp, and Bank Marketing datasets. The models for the Adult and Bank Marketing datasets include a quadratic term for VV, which obtained a better fit. (Note that this formulation only applies to binary classification, so we do not include results for the CIFAR dataset here).

ECE VECE
Uncalibrated 20.67% 20.67%
Beta calibration 1.65% 9.59%
Tree-based VB calibration 1.64% 2.11%
Augmented-beta VB calibration 1.49% 1.87%
Table 10: Adult Income model calibration error
Refer to caption
Figure 10: Variable-based calibration plots for the Adult Income model for Age
ECE VECE
Uncalibrated 1.93% 1.93%
Beta calibration 1.73% 0.37%
Tree-based VB calibration 1.70% 0.23%
Augmented-beta VB calibration 1.73% 0.37%
Table 11: Yelp model calibration error
Refer to caption
Figure 11: Variable-based calibration plots for the Yelp model for Review Length
ECE VECE
Uncalibrated 4.69% 4.69%
Platt scaling 2.38% 2.83%
Tree-based VB calibration 2.10% 0.52%
Augmented-beta VB calibration 2.09% 1.13%
Table 12: Bank Marketing model calibration error
Refer to caption
Figure 12: Variable-based calibration plots for the Bank Marketing model for Age