Confidence Calibration Using Logits Transformations

Sooyong Jang
Dept. of Computer and Info. Science
PRECISE Center
University of Pennsylvania
Pennsylvania, PA 15213
&Radoslav Ivanov
Dept. of Computer and Info. Science
PRECISE Center
University of Pennsylvania
Pennsylvania, PA 15213
Insup Lee
Dept. of Computer and Info. Science
PRECISE Center
University of Pennsylvania
Pennsylvania, PA 15213
&James Weimer
Dept. of Computer and Info. Science
PRECISE Center
University of Pennsylvania
Pennsylvania, PA 15213

Abstract

As machine learning techniques become widely adopted in new domains, especially in safety-critical systems such as autonomous vehicles, it is crucial to provide accurate output uncertainty estimation. As a result, many approaches have been proposed to calibrate neural networks to accurately estimate the likelihood of misclassification. However, while these methods achieve low calibration error, there is space for further improvement, especially in large-dimensional settings such as ImageNet. In this paper, we introduce a calibration algorithm, named Hoki, that works by applying random transformations to the neural network logits. We provide a sufficient condition for perfect calibration based on the number of label prediction changes observed after applying the transformations. We perform experiments on multiple datasets and show that the proposed approach generally outperforms state-of-the-art calibration algorithms across multiple datasets and models, especially on the challenging ImageNet dataset. Finally, Hoki is scalable as well, as it requires comparable execution time to that of temperature scaling.

1 Introduction

Refer to caption — Figure 1: Random noise transformations may lead to a different number of label prediction changes for different images. Here we apply six different transformations sampled from $N(0,0.16^{2})$ .

Deep neural networks have proven useful in various fields such as image classification [1, 2, 3], object detection [4, 5], and speaker verification [6]. Motivated by these successes, deep neural networks are now being integrated into safety-critical systems such as medical systems [7, 8]. However, as observed by Guo et al. [9], neural networks are often over-confident on their predictions. This over-confidence can be a critical problem in the safety-critical applications where over-confident neural networks can be wrong with high confidence.

Various calibration techniques have been proposed to alleviate the miscalibration issue. A common approach is to map uncalibrated logits or confidences to calibrated ones [9, 10, 11, 12, 13, 14, 15, 16]. Another option is to train a neural network with a modified loss function to produce calibrated confidence [17, 18, 19]. Although existing methods successfully improve the confidence in terms of calibration metrics such as the expected calibration error (ECE) [20], there is further space for improvement, especially in high-dimensional settings where vast amounts of data are required for good calibration.

In this work, we propose Hoki, a calibration algorithm that achieves strong empirical performance as compared with state-of-the-art techniques. The intuition behind Hoki is illustrated in Figure 1. Suppose we are given two images of Persian cats that are initially correctly classified and suppose that we randomly perturb each image a number of times.

If all perturbed versions of the first image are still correctly classified whereas the second image transformations lead to label switches (e.g., to Siamese cat), then intuitively we would have higher confidence in the first image’s label. More generally, the idea, as shown in Figure 2, is to use logit transformations to test the network’s sensitivity and group examples according to the proportion of observed label switches.

Based on this intuition, we first present a sufficient condition for perfect calibration in terms of ECE through leveraging the information from changes after transformations. An added benefit of this theoretical result is that it also leads to a natural implementation, through minimizing the empirical calibration error (including transformations). The proposed algorithm is also efficient in terms of runtime, especially in the case of logit transformations due to their reduced dimensionality (as compared to the input space dimensionality).

To evaluate Hoki, we perform experiments on MNIST, CIFAR10/100, and ImageNet, and we use several standard models per dataset, including LeNet5, DesneNet, ResNet, and ResNet SD. On these datasets, we compare Hoki with multiple state-of-the-art calibration algorithms, namely temperature scaling [9], MS-ODIR, Dir-ODIR [14], ETS, IRM, IROvA-TS [21], and ReCal [11]. In terms of ECE [20], Hoki outperforms other calibration algorithms on 8 out of 15 benchmarks; we emphasize that Hoki achieves lower ECE than other methods on all ImageNet models. Finally, in terms of learning time, Hoki achieves similar performance to temperature scaling, the fastest algorithm we compared with.

The main contributions of this paper are as follows:

•

we propose Hoki, an iterative calibration algorithm using logit transformations;
•

we provide a sufficient condition for perfect calibration;
•

we show that Hoki outperforms other calibration algorithms on 8 out of 15 benchmarks, and achieves the lowest ECE on all ImageNet models;
•

we demonstrate that Hoki is also efficient in terms of learning time, achieving performance comparable to temperature scaling.

This paper is structured as follows. In Section 2, we summarize the related works, and in Section 3, we describe the problem statement. Then, we present the theory of calibration through transformations in Section 4. In Section 5, we demonstrate the transformation selection process, and we illustrate the calibration algorithm using transformations in Section 6. In Section 7, we show the experimental results, and we conclude this paper in Section 8.

2 Related work

Various approaches have been tried for obtaining accurate confidences, and in this paper, we consider the papers most related to our approach. We review post-hoc calibration techniques, transformation based calibration, and also calibration with a theoretical guarantee.

Post-hoc Calibration. Many researchers have proposed calibration methods for a neural network classifier so that the predicted probabilities match the empirical probabilities using a validation set [10, 12, 13, 9, 14, 15, 17, 21]. These post-hoc approaches learn a (simpler) mapping function for the calibration without re-training the given (complex) classifier, and different functions for the different types of inputs have been proposed for the mapping. For example, temperature scaling [9] uses a linear function on logits, Platt scaling [13] employs an affine function on logits, and Dirichlet calibration [14] trains an affine function on log logits.

Transformation-based calibration. Approaches using input transformations for calibration have also been proposed. Bahat and Shakhnarovich [22] apply semantic preserving transformations such as contrast change, rotation and zoom to augment given inputs and compute confidence using the augmented set of inputs. Jang et al. [11] introduce a lossy label-invariant transformation for calibration. They define a lossy label-invariant transformation and use it to group inputs and apply group-wise temperature scalings. While our method is also based on transformations, the choice of random transformations allows us to obtain a theoretical guarantee for perfect calibration.

Calibration with theoretical guarantee. Kumar et al. [23] propose a scaling-binning calibrator which combines Platt Scaling and histogram binning, and provide a bound on the calibration error. Park et al. [24] propose a calibrated prediction method which provides per-prediction confidence bound using Clopper-Pearson interval based on histogram binning. With this method, examples in the same bin are assigned the same confidence, whereas Hoki assigns a different confidence to each example, while also aiming to achieve the sufficient condition for perfect calibration within the bin.

3 Problem Statement

Let $\mathcal{X}$ be a feature space, $\mathcal{Y}=\{1,\dots,C\}$ be a set of labels, and $\mathcal{D}$ be a distribution over $\mathcal{X}\times\mathcal{Y}$ . We are given a classifier $f:\mathcal{X}\to\mathcal{Y}$ and a corresponding calibrator $g:\mathcal{X}\to[0,1]$ such that, for a given example $x$ , $(f(x),g(x))=(\hat{y},\hat{p})$ is the label prediction $\hat{y}$ with a corresponding confidence $\hat{p}$ . In what follows, we say that the sets $\mathcal{P}_{1},\dots,\mathcal{P}_{J}$ form a bin partition of the confidence space, $[0,1]$ , if $\cup_{i=1}^{J}\mathcal{P}_{i}=[0,1]$ and $\forall i\neq j,\;\mathcal{P}_{i}\cap\mathcal{P}_{j}=\emptyset$ . Furthermore, given a dataset $Z=\{(x_{1},y_{1}),\dots,(x_{N},y_{N})\}$ , we say $g$ induces an index partition $\mathcal{I}_{1},\dots,\mathcal{I}_{J}$ of $\{1,\dots,N\}$ such that $g(x_{n})\in\mathcal{P}_{j}\iff n\in\mathcal{I}_{j},\forall(x_{n},y_{n})\in Z$ .

Before formally stating the problem considered in this work, we define calibration error (CE) and expected calibration error (ECE) [20].

Definition 1 (Calibration Error (CE)).

For any calibrator $g$ and confidence partitions $\mathcal{P}_{1},\dots,\mathcal{P}_{J}$ , the calibration error (CE) is defined as

CE(g)=\sum_{j=1}^{J}w_{j}\left|e_{j}\right|,

where

	$\displaystyle\;e_{j}:=P_{\mathcal{D}}\left[Y=f(X)\;\middle\|\;g(X)\in\mathcal{P}_{j}\right]-E_{\mathcal{D}}\left[g(X)\mid g(X)\in\mathcal{P}_{j}\right]$
	$\displaystyle\;w_{j}:=P_{\mathcal{D}}\left[g(X)\in\mathcal{P}_{j}\right].$

Intuitively, the CE of a classifier-calibrator pair in a given partition is the expected difference between the classifier’s accuracy and the calibrator’s confidence. To get the CE over the entire space, we sum up all the individual partition CEs, weighted by the probability mass of each partition (i.e., the probability of an example falling in that partition).

Definition 2 (Expected Calibration Error (ECE)).

For any calibrator $g$ , confidence partitions $\mathcal{P}_{1},\dots,\mathcal{P}_{J}$ , sampled dataset $Z\in(\mathcal{X}\times\mathcal{Y})^{N}$ , and induced index partition $\{\mathcal{I}_{1},\dots,\mathcal{I}_{J}\}$ , we define the expected calibration error (ECE) as

ECE(g)=\sum_{j=1}^{J}\hat{w}_{j}\left|\hat{e}_{j}\right|,\qquad\text{where}\qquad\hat{e}_{j}:=\sum_{n\in\mathcal{I}_{j}}\frac{\mathbbm{1}_{\{y_{n}=\hat{y}_{n}\}}-g(x_{n})}{|\mathcal{I}_{j}|}\quad\mbox{and}\quad\hat{w}_{j}:=\frac{|\mathcal{I}_{j}|}{N}.

Thus, the ECE is the sampled version of the CE. Note that Definition 2 is equivalent to the standard ECE definition, as used in prior work [9]. We are now ready to state the problem addressed in this work, namely find a calibrator $\hat{g}$ that minimizes the ECE over a validation set.

Problem statement 1.

Let $\mathcal{G}=\{g:\mathcal{X}\to[0,1]\}$ be the set of all calibrators. We aim to find $\hat{g}\in\mathcal{G}$ that minimizes the expected calibration error,

\hat{g}=\arg\min_{g\in\mathcal{G}}ECE(g).\\

4 Calibration using Transformations

This section provides the intuition and theory of using transformations for the purpose of calibration. We begin by providing high level intuition, followed by a sufficient condition for perfect calibration in expectation, which leads to a natural implementation as well.

High-level intuition.

Suppose that $f$ and $g$ form a classifier-calibrator pair. If we take a correctly classified image of a cat, for example, we would expect that the classification confidence would drop as we apply random transformations to the image (e.g., add noise, zoom out). Conversely, if the confidence does not decrease, we would conclude that $f$ and $g$ are not properly calibrated.

More generally, the goal of applying transformations is to group examples in bins of similar confidence. In particular, if a certain set of examples exhibits similar transformation patterns (e.g., label switching, misclassification), then the calibrator should learn to assign such examples a similar confidence value. Of course, this approach would only work for a good choice for transformations – we discuss a number of options in Section 5.

Sufficiency for perfect calibration.

We now investigate calibrator properties that ensure perfect calibration. Suppose we are given a class of transformations $\mathcal{T}=\{t:\mathcal{X}\to\mathcal{X}\}$ , e.g., functions that add random noise, and a corresponding probability distribution $\mathcal{D}_{T}$ over $\mathcal{T}$ . Then, for each example $(x,y)$ , we can apply a number of transformations and observe how many transformations lead to a label switch. Specifically, the following result is key to achieving perfect calibration.

Theorem 1 (Sufficiency for Perfect Calibration).

Let $\mathcal{P}_{1},\dots,\mathcal{P}_{J}$ be a confidence bin partition. A calibrator $g\in\mathcal{G}$ is perfectly calibrated, i.e., $CE(g)=0$ , if it satisfies, $\forall j\in\{1,\dots,J\}$ ,

\displaystyle E_{\mathcal{D}}\left[g(X)\mid g(X)\in\mathcal{P}_{j}\right]=\alpha_{j}\gamma_{j}+\beta_{j}(1-\gamma_{j})

where

	$\displaystyle\alpha_{j}=$	$\displaystyle P_{\mathcal{D}\times\mathcal{D}_{T}}[f(X)=Y\mid f(T(X))=f(X),g(X)\in\mathcal{P}_{j}]$
	$\displaystyle\beta_{j}=$	$\displaystyle P_{\mathcal{D}\times\mathcal{D}_{T}}[f(X)=Y\mid f(T(X))\neq f(X),g(X)\in\mathcal{P}_{j}]$
	$\displaystyle\gamma_{j}=$	$\displaystyle P_{\mathcal{D}\times\mathcal{D}_{T}}\left[f(T(X))=f(X)\mid g(X)\in\mathcal{P}_{j}\right]$

Proof.

Proof provided in the supplementary material. ∎

Intuitively, Theorem 1 states that the label switching that we observe (in each bin) due to added transformations must be consistent with the confidence and accuracy in that bin. In particular, the average confidence in the bin must be equal to the weighted sum of accuracies over the two groups of examples: 1) examples whose label is changed by some transformation; 2) examples whose label is not changed due to transformations. The benefit of Theorem 1 is that it leads to a natural implementation by estimating all probabilities given a validation set (as discussed in Section 6). In the Supplementary Material, we also provide a theoretical bound (Theorem 2) on the generalization ECE (given new data) of Hoki in a probably approximately correct sense.

5 Transformation Selection

As discussed in Section 4, the choice of transformations greatly affects the benefit of the result presented in Theorem 1. In particular, if a certain transformation results in a label switch for all examples, then it does not provide any useful confidence information. Thus, the most beneficial transformations are those that separate different examples into different partitions, as measured by the proportion of label switches caused by those transformations.

Choosing a class of transformations.

The first consideration when selecting a transformation is whether to apply it to the input $x$ or to some internal classifier representation, e.g., the logits in last layer of a neural network. The benefits of applying input transformations are that they are independent of the classifier and can be chosen based on physical characteristics (e.g., a small rotation should not affect an image’s class). Applying transformations to the logits is also appealing due to the reduced dimensionality: this results in improved scalability and makes it easier to find useful transformations.

Another consideration when choosing the transformations is what family to select them from. Input transformations offer a wide range of possibilities, especially in the case of images, e.g., rotation, translation, zoom out. On the other hand, logit transformations do not necessarily have a physical interpretation, so a more natural choice is to add noise selected from a known probability distribution, e.g., Gaussian or uniform. In this paper, we explore the space of uniform noise and Gaussian noise, as applied to the neural network’s logits, in order to benefit from the scalability improvements due to logit transformations.

Parameter selection.

As discussed above, the noise parameters need to be chosen so as to maximize the benefit of using transformations. One way of measuring the effect of a given transformation is by computing the standard deviations of (non-calibrated) confidences over the entire validation set. Intuitively, if a transformation results in a large standard deviation of confidences, that means this transformation is correlated with the classifier’s sensitivity to input perturbations and hence with the confidence in the classifier’s correctness. Therefore, we aim to identify transformations that maximize the standard deviation of confidences over the validation data.

To compute the variance in predicted confidences for a specific transformation distribution $\mathcal{D}_{T}$ , we use the sufficient condition presented in Theorem 1. In particular, suppose we are given a validation set $Z=(x_{1},y_{1}),\dots,(x_{N},y_{N})$ and a sampled set of transformations $T=\{t_{1},\dots,t_{M}\}\sim[\mathcal{D}_{T}]^{M}$ . Let $\mathcal{I}_{1},\dots,\mathcal{I}_{J}$ be an index partition.¹¹1Note that, when choosing the noise parameters, we use a single bin for all data. Equations (1)-(4) are written for an arbitrary partition since they are referenced in Section 6 as well. Then, estimates of $\alpha$ , $\beta$ , $\gamma$ , and calibrated confidence, $p$ , can be calculated as

$\displaystyle\hat{\alpha}_{j}=$	$\displaystyle\frac{\displaystyle\sum_{n\in\mathcal{I}_{j}}\sum_{m=1}^{M}\mathbbm{1}_{\{f(x_{n})=y_{n}\}}\mathbbm{1}_{\{f(t_{m}(x_{n}))=f(x_{n})\}}}{\displaystyle\sum_{n\in\mathcal{I}_{j}}\sum_{m=1}^{M}\mathbbm{1}_{\{f(t_{m}(x_{n}))=f(x_{n})\}}}$	(1)
$\displaystyle\hat{\beta}_{j}=$	$\displaystyle\frac{\displaystyle\sum_{n\in\mathcal{I}_{j}}\sum_{m=1}^{M}\mathbbm{1}_{\{f(x_{n})=y_{n}\}}\mathbbm{1}_{\{f(t_{m}(x_{n}))\neq f(x_{n})\}}}{\displaystyle\sum_{n\in\mathcal{I}_{j}}\sum_{m=1}^{M}\mathbbm{1}_{\{f(t_{m}(x_{n}))\neq f(x_{n})\}}}$	(2)
$\displaystyle\hat{\gamma}_{j,n}=$	$\displaystyle\frac{1}{M}\sum_{m=1}^{M}\mathbbm{1}_{\{f(t_{m}(x_{n})=f(x_{n})\}}$	(3)
$\displaystyle\hat{p}_{j,n}=$	$\displaystyle(\hat{\alpha}_{j}-\hat{\beta}_{j})\hat{\gamma}_{j,n}+\hat{\beta}_{j}.$	(4)

To choose the transformation for each dataset-model combination (please refer to Section 7 for a full description of the datasets), we perform grid search over Gaussian and uniform noise parameters and choose the setting that results in the largest standard deviation of $\hat{p}_{j,n}$ . In the Gaussian case, we search over the space [-20, 20] for the mean and (0, 20] for the standard deviation. For uniform noise, we explore the space [-20, 20], by varying both the minimum noise as well as the range of the noise.

Table 1 shows the selected transformation parameters and corresponding values for $\hat{\alpha}$ , $\hat{\beta}$ , and standard deviation of $\hat{p}_{j,n}$ , denoted by $\hat{\sigma}$ , as computed over the different datasets and models used in our experiments. As shown in the table, $\hat{\sigma}$ varies between 0.0395 and 0.0967, which illustrates the challenge of finding an appropriate transformation.

Table 1: Selected transformation parameters over the different datasets and models. The number of transformation is

M=1000

. We use

U(a,b)

to denote uniform noise with a range of

[a,b]

and

G(a,b)

to denote Gaussian noise with a mean of

a

and standard deviation of

b

Dataset	Model	Parameters	$\hat{\alpha}$	$\hat{\beta}$	$\hat{\sigma}$
MNIST	LeNet 5	U(-2, 4)	0.9910	0.6358	0.0167
CIFAR10	DenseNet 40	$U(5,14)$	0.9399	0.6046	0.0456
	LeNet 5	$U(16,19)$	0.7795	0.4826	0.0616
	ResNet 110	$U(-6,3)$	0.9567	0.6320	0.0395
	ResNet 110 SD	$U(-16,-8)$	0.9286	0.5990	0.0500
	WRN 28-10	$U(-16,-8)$	0.9715	0.6529	0.0332
CIFAR100	DenseNet 40	$G(-20,2)$	0.7615	0.4082	0.0800
	LeNet 5	$G(-5,1)$	0.4817	0.2542	0.0553
	ResNet 110	$G(-4,2)$	0.7837	0.4457	0.0801
	ResNet 110 SD	$G(-19,2)$	0.7870	0.4558	0.0782
	WRN 28-10	$G(4,2)$	0.8706	0.5038	0.0967
ImageNet	DenseNet 161	$G(16,2)$	0.8464	0.5190	0.0806
	MobileNet V2	$G(3,2)$	0.8222	0.5013	0.0871
	ResNet 152	$G(0,2)$	0.8551	0.5268	0.0803
	WRN 101-2	$G(3,2)$	0.8596	0.5236	0.0820

6 Implementation

Based on the theory described in Section 4, we propose Hoki, an iterative algorithm for confidence calibration. Hoki operates differently during design time and runtime. During design time, Hoki samples random transformations and learns the $\hat{\alpha}_{j}$ and $\hat{\beta}_{j}$ parameters for each bin. These parameters are then used at runtime, on test data, to estimate the confidence for new examples.

Design Time Algorithm.

The design time algorithm is described in Algorithm 1. The high-level idea of the algorithm is to achieve the sufficient condition outlined in Theorem 1 on the validation set. In particular, we first sample transformations $\{t_{1},\dots,t_{M}\}$ from $\mathcal{T}$ and observe the corresponding predictions $\bar{y}_{n,m}$ . Based on $\bar{y}_{n,m}$ , we compute the fraction of transformed examples that have the same label as the original image, $\gamma_{n}$ . We emphasize that, as a special case of Theorem 1, $\gamma_{n}$ is computed separately for each example, as opposed to averaged over the entire partition. This modification ensures that the data is spread across multiple partitions, while still satisfying the sufficient condition for perfect calibration.

After the initialization step, Hoki recursively estimates the parameters $\alpha_{j}$ and $\beta_{j}$ and computes the calibrated confidence using those two values based on Theorem 1. Since the original partitioning may change after reestimating the parameters, we repeat this process until there is no change in the data partitioning (or a maximum number of iterations is reached). Note that for the corner case where the transformations result in empty sets (i.e., either all labels change or all remain the same), we set $\hat{\alpha}_{j}^{k}$ , $\hat{\beta}_{j}^{k}$ to the bin accuracy. Ultimately, at design time, Hoki returns the set of calibration pairs for all partitions for all iterations learned in the design time algorithm as a set $\mathcal{H}$ , the set of transformations $\hat{\mathcal{T}}$ , and the validation data accuracy, $p$ .

Algorithm 1 Design Time Algorithm

Input: validation set

Z=(x_{1},y_{1}),\dots,(x_{N},y_{N})

, transformation set

\mathcal{T}

, number of transformations

M

, classifier

f

, confidence space partition

\mathcal{P}_{1},\dots,\mathcal{P}_{J}

, maximum number of iterations

K

Sample transformations

\hat{\mathcal{T}}=\{t_{1},\dots,t_{M}\}

from

\mathcal{T}

\hat{p}=\frac{1}{N}\sum_{n=1}^{N}\mathbbm{1}_{\{y_{n}=f(x_{n})\}}

for

n=1

N

\hat{\gamma}_{n}=\frac{1}{M}\sum_{m=1}^{M}\mathbbm{1}_{\{f(t_{m}(x_{n}))=f(x_{n})\}}

p_{n}=\hat{p}

end for

for

k=1

K

Compute sets

\mathcal{I}_{1}^{k},\dots,\mathcal{I}_{J}^{k}

s. t.

n\in\mathcal{I}_{j}^{k}\iff p_{n}\in\mathcal{P}_{j}

k>1\wedge\mathcal{I}_{1}^{k}=\mathcal{I}_{1}^{k-1}\wedge\dots\wedge\mathcal{I}_{J}^{k}=\mathcal{I}_{J}^{k-1}

then

break

end if

for

j=1

J

c=\sum_{n\in\mathcal{I}_{j}^{k}}\sum_{m=1}^{M}\mathbbm{1}_{\{f(x_{n})=f(t_{m}(x_{n}))\}}

c=0\vee c=M|\mathcal{I}_{j}^{k}|

then

\hat{\alpha}^{k}_{j}=\hat{\beta}^{k}_{j}=\frac{1}{|\mathcal{I}_{j}^{k}|}\sum_{n\in\mathcal{I}_{j}^{k}}\mathbbm{1}_{\{y_{n}=f(x_{n})\}}

else

Compute

\hat{\alpha}_{j}^{k}

according to Equation (1), using

\mathcal{I}_{j}^{k}

Compute

\hat{\beta}_{j}^{k}

according to Equation (2), using

\mathcal{I}_{j}^{k}

end if

p_{n}=\left(\hat{\alpha}^{k}_{j}-\hat{\beta}^{k}_{j}\right)\hat{\gamma}_{n}+\hat{\beta}^{k}_{j},\;\forall n\in\mathcal{I}_{j}^{k}

end for

K^{*}=k

end for

\mathcal{H}=\{(\hat{\alpha}_{j}^{k},\hat{\beta}_{j}^{k})\;|\;1\leq j\leq J,\;1\leq k\leq K^{*}\}

return

\mathcal{H}

\hat{p}

\hat{\mathcal{T}}

Runtime Algorithm.

The runtime algorithm is described in Algorithm 2. Once the parameters for each step, $\hat{\alpha}_{j}^{k}$ and $\hat{\beta}_{j}^{k}$ , are learned in Algorithm 1, Hoki can calibrate the confidence for a new input $x$ using the calibration parameter pairs in $\mathcal{H}$ . Hoki first observes the original prediction $f(x)$ and the transformed data prediction $f(t_{m}(x))$ for all transformations $\{t_{1},t_{2},\dots,t_{M}\}$ . Hoki computes $\gamma$ based on these values and iteratively updates the confidence according to the calibration parameters learned in Algorithm 1.

Algorithm 2 Runtime Algorithm

Input: test sample

x\in\mathcal{X}

, original classifier

f

, outputs of Algorithm 1:

\mathcal{H}

\hat{p}

\hat{\mathcal{T}}

\gamma=\frac{1}{M}\sum_{m=1}^{M}\mathbbm{1}_{\{f(x)=f(t_{m}(x))\}},\;t_{m}\in\hat{\mathcal{T}}

p=\hat{p}

for

k=1

\frac{|\mathcal{H}|}{J}

Identify partition index

j^{\prime}

for

x

such that

p\in\mathcal{P}_{j^{\prime}}

Identify calibration parameters

(\hat{\alpha}^{k}_{j^{\prime}},\hat{\beta}^{k}_{j^{\prime}})\in\mathcal{H}

p=\left(\alpha^{k}_{j^{\prime}}-\beta^{k}_{j^{\prime}}\right)\gamma+\beta^{k}_{j^{\prime}}

end for

return

p

Limitations.

There are two main limitations of our approach. First of all, since Hoki uses transformations for calibration, choosing the appropriate transformations has a significant effect on performance. To address this issue, we propose a transformation selection based on the maximization of confidence variance as described in Section 5. The second limitation is the utility of the ECE metric itself – it is possible to minimize the ECE by outputting the network accuracy as confidence for all examples, which defeats the purpose of calibration. We argue that by maximizing the variance of our algorithm, we ensure that the data is spread across multiple bins, as demonstrated in Section 7. Thus, one can use the ECE in multiple ways, e.g., in autonomous systems by making an informed decision through taking into account the calibrated confidence in a new example’s predicted label.

7 Experiments

We compare Hoki with state-of-the-art calibration algorithms using several standard datasets and models. For each model and dataset, we compute the ECE for the uncalibrated model and the calibrated confidence by the algorithms. The experimental setup, baseline algorithms and the evaluation metrics are explained in the following subsections.

7.1 Experimental Setup

This subsection provides the details about the datasets, models, baseline algorithms, and the evaluation metrics.

Datasets and Models. We perform experiments on MNIST [25], CIFAR 10/100 [26], and ImageNet [27]. We use the following models for each dataset. For MNIST, we use one model, LeNet5 [25]. For CIFAR 10/100, we use five different models, DenseNet 40 [28], LeNet5 [25], ResNet110 [1], ResNet110 SD [29], and WRN-28-10 [2]. For ImageNet, we use four models, DenseNet161 [28], MobileNetV2 [30], ResNet152 [1], and WRN-101-2 [2].

We implement LeNet5, ResNet110 SD, and obtain code for DenseNet40²²2https://github.com/andreasveit/densenet-pytorch under BSD 3-Clause License, ResNet110³³3https://github.com/bearpaw/pytorch-classification, under MIT License and WRN 28-10⁰⁰footnotemark: 0 from the corresponding github repositories. We also obtained the pre-trained model for all models on ImageNet from PyTorch.⁴⁴4https://pytorch.org/docs/stable/torchvision/models.html, under BSD License

Baselines. We compare Hoki with several state-of-the-art calibration algorithms, namely temperature scaling, vector scaling [9], Dir-ODIR, MS-ODIR [14], ETS, IRM, IROvA-TS [21], and ReCal [11]. They calibrate confidence by learning a mapping function for uncalibrated logits or confidences. We obtain other calibration algorithms from their papers except for temperature scaling and vector scaling which we obtain from Kull et al. [14]. For ReCal, the authors provide three different setups for their algorithm, and we choose (’zoom-out’, 0.1, 0.9, 20) because it shows the best results on ImageNet.

Evaluation Metric. As described in Problem Statement 1, we evaluate all algorithms based on ECE (with $J=15$ bins of equal width [9, 31]), as defined in Definition 2. Additionally, we calculate the learning time during design time to investigate each algoritm’s practical utility. If a calibration algorithm is too slow on real datasets, it may not be appropriate to use the algorithm in practice.

7.2 Results

The experimental results have two parts. The first part is a comparison on calibration performance in terms of ECE, and the second part is an analysis on time efficiency during design time.

ECE Results.

Table 2: ECE for the different calibration algorithms on different datasets and models. The number with the bold face and the underline denote the best and the second best result, respectively.

Dataset	Model	Val Acc. (%)	Test Acc. (%)	Uncal.	TempS	VecS	MS- ODIR	Dir- ODIR	ETS	IRM	IROvA- TS	ReCal	Hoki
MNIST	LeNet5	98.85	98.81	0.0076	0.0018	0.0015	0.0024	0.0022	0.0019	0.0019	0.0033	0.0021	0.0008
CIFAR10	DenseNet 40	91.92	91.75	0.0520	0.0070	0.0044	0.0052	0.0039	0.0069	0.0095	0.0107	0.0101	0.0057
	LeNet5	72.00	72.77	0.0182	0.0120	0.0092	0.0141	0.0105	0.0115	0.0167	0.0229	0.0118	0.0110
	ResNet110	94.12	93.10	0.0456	0.0088	0.0094	0.0088	0.0084	0.0066	0.0103	0.0133	0.0090	0.0071
	ResNet110 SD	90.28	90.38	0.0538	0.0114	0.0086	0.0102	0.0094	0.0112	0.0113	0.0156	0.0120	0.0044
	WRN 28-10	96.06	95.94	0.0251	0.0097	0.0096	0.0092	0.0094	0.0157	0.0049	0.0088	0.0091	0.0026
CIFAR100	DenseNet 40	68.82	68.16	0.1728	0.0154	0.0266	0.0296	0.0189	0.0136	0.0135	0.0377	0.0154	0.0073
	LeNet5	37.82	37.66	0.0100	0.0211	0.0155	0.0131	0.0142	0.0120	0.0125	0.0363	0.0192	0.0123
	ResNet 110	70.60	69.52	0.1422	0.0091	0.0300	0.0345	0.0231	0.0155	0.0202	0.0457	0.0121	0.0127
	ResNet 110 SD	70.62	70.10	0.1229	0.0089	0.0358	0.0355	0.0207	0.0086	0.0142	0.0425	0.0100	0.0098
	WRN 28-10	79.62	79.90	0.0534	0.0437	0.0452	0.0355	0.0346	0.0370	0.0108	0.0336	0.0373	0.0112
ImageNet	DenseNet 161	76.83	77.45	0.0564	0.0199	0.0233	0.0368	0.0477	0.0100	0.0090	0.0487	0.0133	0.0043
ImageNet	MobileNet V2	71.69	72.01	0.0274	0.0164	0.0153	0.0212	0.0269	0.0087	0.0075	0.0477	0.0153	0.0011
	ResNet 152	77.93	78.69	0.0491	0.0201	0.0207	0.0347	0.0397	0.0112	0.0080	0.0457	0.0139	0.0052
	WRN 101-2	78.67	79.15	0.0524	0.0307	0.0330	0.0418	0.0279	0.0165	0.0086	0.0426	0.0258	0.0067

Table 2 displays ECE values for each algorithm, along with each model’s validation set and test set accuracy (the Supplementary Material provides an extensive evaluation where we also vary the number of bins in the ECE evaluation, in order to test each algorithm’s robustness to more fine-grained bins). As discussed in Section 5, Hoki uses the transformations shown in Table 1. As shown in Table 2, Hoki achieves the lowest ECE on 8 out of 15 benchmarks. The benefit of using transformations is especially pronounced in the large-dimensional ImageNet dataset where Hoki consistently achieves the lowest ECE on all models.

Interestingly, the ECE produced by Hoki closely tracks the difference between the validation and test set accuracy. In some sense, one cannot hope to do better than this difference as it reflects the variance within each dataset. Thus, the benchmarks where Hoki does not achieve the best performance are settings with large differences in generalization accuracy. For example, in the case of ResNet110 on CIFAR10 and CIFAR100, the gaps are 1.02 and 1.08 percentage points, respectively.

Another reason for our strong performance on ImageNet is the dataset size (there are 25,000 images in the ImageNet validation set compared to 10,000 images in MNIST and 5,000 images in CIFAR10/100). A larger validation set means that each partition is likely to have more samples, which in turn results in more accurate estimation of $\alpha$ , $\beta$ , and $\gamma$ .

ECE Variance.

For further evaluation, we also explore the variance in confidences produced by each algorithm. As noted in Section 5, a larger variance is preferred because it provides some indication that examples with low true confidence are indeed separated from those with high confidence. Figure 3 provides a plot of the ECE variance vs. the ECE (in log scale) for all algorithms on the ImageNet models (plots for the other benchmarks are provided in the Supplementary Material). As shown in the figure, Hoki has comparable variance to other algorithms that also achieve low ECE on ImageNet. Overall, there appears to be a trade-off between achieving low ECE and high variance – we leave exploring this phenomenon for future work.

Time Efficiency.

Table 3 displays the learning time of each calibration algorithm during design time. The main reason for the proposed method’s good scalability is that we apply transformations to the logits – thus, we avoid the need to perform the input transformations that are needed in ReCal, for example. Furthermore, Hoki is comparable to temperature scaling, which is a fairly simple approach in the sense that it only needs to learn one parameter. In summary, Hoki not only achieves low ECE on most benchmarks but is also fast to execute.

Table 3: Learning time (sec). The number with the bold face and the underline are the best and the second best result, respectively.

Dataset	Model	TempS	VecS	MS-ODIR	Dir-ODIR	ETS	IRM	IROvA-TS	ReCal	Hoki
MNIST	LeNet5	0.16	43.31	112.04	207.88	0.31	0.02	0.12	57.67	0.53
CIFAR10	DenseNet40	0.08	31.33	222.93	92.34	0.04	0.01	0.06	84.04	0.34
	LeNet5	0.05	11.86	79.62	74.33	0.04	0.01	0.06	110.79	0.32
	ResNet110	0.07	27.17	193.04	87.73	0.05	0.01	0.07	38.85	0.28
	ResNet110 SD	0.07	21.39	189.27	93.17	0.04	0.01	0.09	58.74	0.28
	WRN 28-10	0.05	22.71	123.46	92.80	0.04	0.01	0.08	49.62	0.39
CIFAR100	DenseNet40	0.51	23.17	1211.68	626.00	0.54	0.11	0.57	136.23	1.09
	LeNet5	0.42	24.17	459.59	236.87	0.29	0.12	0.45	97.77	0.99
	ResNet110	0.30	25.49	1459.71	510.12	0.59	0.12	0.57	97.29	1.02
	ResNet110 SD	0.24	25.51	1696.10	495.23	0.61	0.12	0.63	604.12	1.01
	WRN 28-10	0.30	26.71	1110.11	611.52	0.64	0.11	0.50	125.84	1.03
ImageNet	DenseNet161	18.79	179.61	13901.38	6891.19	29.34	8.76	31.16	50730.17	37.68
	MobileNet V2	18.74	423.48	3899.07	12695.64	27.56	8.80	26.99	3139.60	32.97
	ResNet152	18.79	169.72	12401.58	5402.85	29.28	9.01	31.14	71254.34	33.78
	WRN 101-2	18.73	182.39	16989.40	11378.40	29.17	8.67	31.46	31545.77	34.26

8 Conclusion

This work proposed a confidence calibration algorithm based on the intuition that we can partition examples based on the neural network’s sensitivity to transformations. Based on this intuition, we provided a sufficient condition for perfect calibration in terms of ECE. We performed an extensive experimental comparison and demonstrated that Hoki outperforms state-of-the-art approaches in multiple datasets and models, and the benefits are especially pronounced on the challenging ImageNet. For future work, we plan to explore the benefits of combining different transformations, particularly a mix of input and logit transformations. If those transformations are chosen carefully in order to identify input sensitivity, we expect that more accurate calibration is possible.

References

He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
Zagoruyko and Komodakis [2016] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.
Tan and Le [2019] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pages 6105–6114. PMLR, 2019.
Redmon and Farhadi [2018] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
Bochkovskiy et al. [2020] Alexey Bochkovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao. Yolov4: Optimal speed and accuracy of object detection. arXiv preprint arXiv:2004.10934, 2020.
Li et al. [2017] Chao Li, Xiaokong Ma, Bing Jiang, Xiangang Li, Xuewei Zhang, Xiao Liu, Ying Cao, Ajay Kannan, and Zhenyao Zhu. Deep speaker: an end-to-end neural speaker embedding system. arXiv preprint arXiv:1705.02304, 650, 2017.
Ronneberger et al. [2015] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
Arcadu et al. [2019] Filippo Arcadu, Fethallah Benmansour, Andreas Maunz, Jeff Willis, Zdenka Haskova, and Marco Prunotto. Deep learning algorithm predicts diabetic retinopathy progression in individual patients. NPJ digital medicine, 2(1):1–9, 2019.
Guo et al. [2017] Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modern neural networks. arXiv preprint arXiv:1706.04599, 2017.
Gupta et al. [2020] Kartik Gupta, Amir Rahimi, Thalaiyasingam Ajanthan, Thomas Mensink, Cristian Sminchisescu, and Richard Hartley. Calibration of neural networks using splines. arXiv preprint arXiv:2006.12800, 2020.
Jang et al. [2020] Sooyong Jang, Insup Lee, and James Weimer. Improving classifier confidence using lossy label-invariant transformations. arXiv preprint arXiv:2011.04182, 2020.
Patel et al. [2020] Kanil Patel, William Beluch, Bin Yang, Michael Pfeiffer, and Dan Zhang. Multi-class uncertainty calibration via mutual information maximization-based binning. arXiv preprint arXiv:2006.13092, 2020.
Platt et al. [1999] John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in large margin classifiers, 10(3):61–74, 1999.
Kull et al. [2019] Meelis Kull, Miquel Perello Nieto, Markus Kängsepp, Telmo Silva Filho, Hao Song, and Peter Flach. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. In Advances in Neural Information Processing Systems, pages 12295–12305, 2019.
Rahimi et al. [2020] Amir Rahimi, Amirreza Shaban, Ching-An Cheng, Richard Hartley, and Byron Boots. Intra order-preserving functions for calibration of multi-class neural networks. Advances in Neural Information Processing Systems, 33, 2020.
Wenger et al. [2020] Jonathan Wenger, Hedvig Kjellström, and Rudolph Triebel. Non-parametric calibration for classification. In International Conference on Artificial Intelligence and Statistics, pages 178–190. PMLR, 2020.
Tran et al. [2019] Gia-Lac Tran, Edwin V Bonilla, John Cunningham, Pietro Michiardi, and Maurizio Filippone. Calibrating deep convolutional gaussian processes. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 1554–1563, 2019.
Kumar et al. [2018] Aviral Kumar, Sunita Sarawagi, and Ujjwal Jain. Trainable calibration measures for neural networks from kernel mean embeddings. In International Conference on Machine Learning, pages 2805–2814, 2018.
Seo et al. [2019] Seonguk Seo, Paul Hongsuck Seo, and Bohyung Han. Learning for single-shot confidence calibration in deep neural networks through stochastic inferences. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9030–9038, 2019.
Naeini et al. [2015] Mahdi Pakdaman Naeini, Gregory F Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. In Proceedings of the… AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence, volume 2015, page 2901. NIH Public Access, 2015.
Zhang et al. [2020] Jize Zhang, Bhavya Kailkhura, and T Yong-Jin Han. Mix-n-match: Ensemble and compositional methods for uncertainty calibration in deep learning. In International Conference on Machine Learning, pages 11117–11128. PMLR, 2020.
Bahat and Shakhnarovich [2020] Yuval Bahat and Gregory Shakhnarovich. Classification confidence estimation with test-time data-augmentation. arXiv preprint arXiv:2006.16705, 2020.
Kumar et al. [2019] Ananya Kumar, Percy Liang, and Tengyu Ma. Verified uncertainty calibration. arXiv preprint arXiv:1909.10155, 2019.
Park et al. [2020] Sangdon Park, Shuo Li, Osbert Bastani, and Insup Lee. Pac confidence predictions for deep neural network classifiers. arXiv preprint arXiv:2011.00716, 2020.
LeCun et al. [1998] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
Krizhevsky et al. [2009] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. 2009.
Deng et al. [2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
Huang et al. [2017] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017.
Huang et al. [2016] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, and Kilian Q Weinberger. Deep networks with stochastic depth. In European conference on computer vision, pages 646–661. Springer, 2016.
Sandler et al. [2018] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018.
Nixon et al. [2019] Jeremy Nixon, Michael W Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. Measuring calibration in deep learning. In CVPR Workshops, volume 2, 2019.

Confidence Calibration with Bounded Error Using Transformation:
Supplementary Material

Appendix A Proof of Theorem 1

We begin by observing for any $g$ and $\mathcal{P}_{j}\in\{\mathcal{P}_{1},\dots,\mathcal{P}_{j}\}$ , the law of total probability states

	$\displaystyle P\left[Y=f(X)\mid g(X)\in\mathcal{P}_{j}\right]=$
	$\displaystyle=$	$\displaystyle\;P\left[Y=f(X)\mid f(T(X))=f(X),g(X)\in\mathcal{P}_{j}\right]P\left[f(T(X))=f(X)\mid g(X)\in\mathcal{P}_{j}\right]$
		$\displaystyle+P\left[Y=f(X)\mid f(T(X))\neq f(X),g(X)\in\mathcal{P}_{j}\right]P\left[f(T(X))\neq f(X)\mid g(X)\in\mathcal{P}_{j}\right]$
	$\displaystyle=$	$\displaystyle\;\alpha_{j}\gamma_{j}+\beta_{j}P\left[f(T(X))\neq f(X)\mid g(X)\in\mathcal{P}_{j}\right]$
	$\displaystyle=$	$\displaystyle\;\alpha_{j}\gamma_{j}+\beta_{j}(1-\gamma_{j})$

Then, from Definition 2,

	$\displaystyle CE(g)=\sum_{j=1}^{J}w_{j}\left\|e_{j}\right\|=$	$\displaystyle\sum_{j=1}^{J}w_{j}\Big{\|}P\left[Y=f(X)\mid g(X)\in\mathcal{P}_{j}\right]-E\left[g(X)\mid g(X)\in\mathcal{P}_{j}\right]\Big{\|}$
	$\displaystyle=$	$\displaystyle\sum_{j=1}^{J}w_{j}\Big{\|}\alpha_{j}\gamma_{j}+\beta_{j}(1-\gamma_{j})-E\left[g(X)\mid g(X)\in\mathcal{P}_{j}\right]\Big{\|}$

Thus, the following is a sufficient property for $CE(g)=0$ :

\displaystyle\forall j\in\{1,\dots,J\},\;E\left[g(X)\mid g(X)\in\mathcal{P}_{j}\right]=\alpha_{j}\gamma_{j}+\beta_{j}(1-\gamma_{j})

Appendix B Generalization bounds on the ECE

This section presents a bound on the generalization ECE, given a new dataset, in a probably approximately correct (PAC) sense. Theorem 2 states that if a calibrator $g$ achieves a low ECE on a test set, $Z$ , then the expected calibration error of $g$ can be bounded, in a PAC sense.

Theorem 2 (Bounded Calibration Error).

Suppose a calibrator $g$ was evaluated on a test set $Z=\{(x_{1},y_{1}),\dots,(x_{N},y_{N})\}$ , achieving $ECE_{Z}(g)$ . For any $\delta$ , the CE is bounded, i.e.,

P\left[CE(g)\leq\epsilon\right]\geq 1-\delta,

when

\displaystyle\epsilon=ECE_{Z}(g)+\frac{J\sqrt{2}}{\sqrt{N}}\sqrt{2\ln(2)-\ln(\delta)}

Proof.

	$\displaystyle P\left[CE(g)\geq\epsilon\right]=$
	$\displaystyle=$	$\displaystyle\;P\left[\sum_{j=1}^{J}\left\|e_{j}\right\|w_{j}\geq\epsilon\right]$
	$\displaystyle=$	$\displaystyle\;P\left[\sum_{j=1}^{J}\left\|e_{j}\right\|(w_{j}-\hat{w}_{j}+\hat{w}_{j})\geq\epsilon\right]$
	$\displaystyle\leq$	$\displaystyle\;P\left[\sum_{j=1}^{J}\left\|e_{j}\right\|\left\|w_{j}-\hat{w}_{j}\right\|+\left\|e_{j}\right\|\hat{w}_{j}\geq\epsilon\right]$
	$\displaystyle\leq$	$\displaystyle\;P\left[\sum_{j=1}^{J}\left\|w_{j}-\hat{w}_{j}\right\|+\left\|e_{j}\right\|\hat{w}_{j}\geq\epsilon\right]$
	$\displaystyle\leq$	$\displaystyle\;P\left[\sum_{j=1}^{J}\left\|w_{j}-\hat{w}_{j}\right\|+\left\|e_{j}-\hat{e}_{j}\right\|\hat{w}_{j}+\left\|\hat{e}_{j}\right\|\hat{w}_{j}\geq\epsilon\right]$
	$\displaystyle=$	$\displaystyle\;P\left[\sum_{j=1}^{J}\left\|w_{j}-\hat{w}_{j}\right\|+\left\|e_{j}-\hat{e}_{j}\right\|\hat{w}_{j}\geq\epsilon-ECE_{Z}(g)\right]$
	$\displaystyle\leq$	$\displaystyle\;\max_{j}P\left[\left\|w_{j}-\hat{w}_{j}\right\|+\left\|e_{j}-\hat{e}_{j}\right\|\hat{w}_{j}\geq\frac{\epsilon-ECE_{Z}(g)}{J}\right]$
	$\displaystyle\leq$	$\displaystyle\;\max_{j}P\left[\left\|w_{j}-\hat{w}_{j}\right\|\geq\frac{\epsilon-ECE_{Z}(g)}{2J}\right]+P\left[\left\|e_{j}-\hat{e}_{j}\right\|\geq\frac{\epsilon-ECE_{Z}(g)}{2J\hat{w}_{j}}\right]$
	$\displaystyle\leq$	$\displaystyle\;\max_{j}2\exp\left\{-2N\left(\frac{\epsilon-ECE_{Z}(g)}{2J}\right)^{2}\right\}+2\exp\left\{-2N\hat{w}_{j}\left(\frac{\epsilon-ECE_{Z}(g)}{2J\hat{w}_{j}}\right)^{2}\right\}$
	$\displaystyle=$	$\displaystyle\;\max_{j}2\exp\left\{-2N\left(\frac{\epsilon-ECE_{Z}(g)}{2J}\right)^{2}\right\}+2\exp\left\{-2N\frac{1}{\hat{w}_{j}}\left(\frac{\epsilon-ECE_{Z}(g)}{2J}\right)^{2}\right\}$
	$\displaystyle\leq$	$\displaystyle\;2\exp\left\{-2N\left(\frac{\epsilon-ECE_{Z}(g)}{2J}\right)^{2}\right\}+2\exp\left\{-2N\left(\frac{\epsilon-ECE_{Z}(g)}{2J}\right)^{2}\right\}$
	$\displaystyle=$	$\displaystyle\;4\exp\left\{-\frac{N(\epsilon-ECE_{Z}(g))^{2}}{2J^{2}}\right\}$

We complete the proof by observing

\displaystyle\;4\exp\left\{-\frac{N(\epsilon-ECE_{Z}(g))^{2}}{2J^{2}}\right\}\leq\delta\iff\;\epsilon\geq ECE_{Z}(g)+\frac{J\sqrt{2}}{\sqrt{N}}\sqrt{\left(2\ln(2)-\ln(\delta)\right)}

∎

Appendix C Additional Experiments

In this section, we present additional experimental results. We show more plots for ECE variance, comparisons using ECE with different number of bins, and Hoki’s ECE with the different initialization.

C.1 ECE Variance

In addition to the ImageNet result in the main paper, we show the same ECE variance results for other benchmarks on MNIST (Figure 4) and CIFAR10/100 (Figure 5 and 6). The range of variance are different depending on dataset, but the widths of the range are equal.

Similar to ImageNet case in the main text, Hoki has comparable variance for MNIST (Figure 4) and CIFAR10 (Figure 5), but with the smaller ECE. As shown in Figure 6, Hoki has a similar pattern with ImageNet case on CIFAR100, i.e., it has comparable variance with better ECE. Note that uncalibrated classifier is not shown for DenseNet40 (Figure 6(a)), ResNet110 (Figure 6(c)), and ResNet110 SD (Figure 6(d)), because the uncalibrated ECEs are high compared to other algorithms as shown in Table 2 and the variances are low.

C.2 Number of bins

In addition to the general setting for computing ECE (15 bins), we also use different number of bins (5, 10, 30, 50, 100) for better evaluation. The purpose of this evaluation is to show sensitive each algorithm is to the bin size – a larger sensitivity would imply that an algorithm might not generalize as well on new data. As shown in Tables 4-8, Hoki outperforms other algorithms in many benchmarks except the extreme cases, using 5 bins and 100 bins. We emphasize that Hoki always produces the best or the second-best performance on the challenging ImageNet except for one case – ECE using 100 bins with the ResNet152 model. This result shows that, although Hoki is calibrated with 15 bins, the calibration is shown to be effective with various bin sizes, which highlights the benefit of calibration using transformations.

Table 4: ECE using 5 bins

Dataset	Model	Uncal.	TempS	VecS	MS- ODIR	Dir- ODIR	ETS	IRM	IROvA- TS	ReCal	Hoki
MNIST	LeNet 5	0.0074	0.0014	0.0013	0.0011	0.0016	0.0015	0.0011	0.0015	0.0030	0.0008
CIFAR10	DenseNet 40	0.0519	0.0043	0.0023	0.0040	0.0035	0.0040	0.0037	0.0038	0.0064	0.0039
CIFAR10	LeNet 5	0.0169	0.0074	0.0065	0.0067	0.0068	0.0088	0.0065	0.0124	0.0192	0.0110
CIFAR10	ResNet 110	0.0450	0.0066	0.0074	0.0046	0.0083	0.0081	0.0051	0.0057	0.0103	0.0045
CIFAR10	ResNet 110 SD	0.0534	0.0067	0.0053	0.0050	0.0046	0.0091	0.0058	0.0063	0.0137	0.0044
CIFAR10	WRN 28-10	0.0248	0.0089	0.0095	0.0092	0.0092	0.0087	0.0115	0.0027	0.0085	0.0026
CIFAR100	DenseNet 40	0.1728	0.0126	0.0253	0.0285	0.0182	0.0106	0.0085	0.0107	0.0344	0.0073
CIFAR100	LeNet 5	0.0070	0.0206	0.0129	0.0105	0.0097	0.0180	0.0080	0.0060	0.0338	0.0123
CIFAR100	ResNet 110	0.1422	0.0072	0.0300	0.0340	0.0190	0.0109	0.0110	0.0147	0.0371	0.0127
CIFAR100	ResNet 110 SD	0.1229	0.0071	0.0358	0.0354	0.0207	0.0067	0.0051	0.0125	0.0418	0.0084
CIFAR100	WRN 28-10	0.0521	0.0422	0.0436	0.0354	0.0327	0.0364	0.0140	0.0088	0.0319	0.0111
ImageNet	DenseNet 161	0.0564	0.0191	0.0211	0.0367	0.0477	0.0126	0.0093	0.0068	0.0482	0.0043
ImageNet	MobileNet V2	0.0266	0.0150	0.0135	0.0191	0.0204	0.0134	0.0060	0.0038	0.0471	0.0010
ImageNet	ResNet 152	0.0490	0.0201	0.0207	0.0347	0.0397	0.0139	0.0112	0.0031	0.0455	0.0052
ImageNet	WRN 101-2	0.0524	0.0307	0.0300	0.0413	0.0194	0.0245	0.0155	0.0065	0.0425	0.0067

Table 5: ECE using 10 bins

Dataset	Model	Uncal.	TempS	VecS	MS- ODIR	Dir- ODIR	ETS	IRM	IROvA- TS	ReCal	Hoki
MNIST	LeNet 5	0.0076	0.0014	0.0013	0.0011	0.0016	0.0016	0.0014	0.0019	0.0030	0.0008
CIFAR10	DenseNet 40	0.0519	0.0068	0.0029	0.0040	0.0035	0.0056	0.0073	0.0073	0.0094	0.0057
CIFAR10	LeNet 5	0.0177	0.0079	0.0076	0.0067	0.0068	0.0110	0.0071	0.0142	0.0192	0.0110
CIFAR10	ResNet 110	0.0457	0.0079	0.0082	0.0075	0.0083	0.0084	0.0054	0.0091	0.0106	0.0071
CIFAR10	ResNet 110 SD	0.0535	0.0088	0.0077	0.0050	0.0046	0.0109	0.0072	0.0078	0.0145	0.0044
CIFAR10	WRN 28-10	0.0251	0.0090	0.0095	0.0092	0.0092	0.0087	0.0147	0.0046	0.0085	0.0026
CIFAR100	DenseNet 40	0.1728	0.0152	0.0258	0.0285	0.0182	0.0134	0.0098	0.0124	0.0380	0.0073
CIFAR100	LeNet 5	0.0118	0.0208	0.0129	0.0105	0.0097	0.0204	0.0131	0.0119	0.0357	0.0282
CIFAR100	ResNet 110	0.1422	0.0074	0.0301	0.0340	0.0190	0.0121	0.0138	0.0181	0.0419	0.0127
CIFAR100	ResNet 110 SD	0.1229	0.0077	0.0358	0.0354	0.0207	0.0079	0.0066	0.0140	0.0418	0.0098
CIFAR100	WRN 28-10	0.0524	0.0424	0.0446	0.0354	0.0327	0.0388	0.0303	0.0106	0.0319	0.0112
ImageNet	DenseNet 161	0.0564	0.0203	0.0236	0.0367	0.0477	0.0126	0.0097	0.0077	0.0486	0.0043
ImageNet	MobileNet V2	0.0266	0.0156	0.0149	0.0191	0.0204	0.0138	0.0068	0.0064	0.0474	0.0010
ImageNet	ResNet 152	0.0490	0.0201	0.0207	0.0347	0.0397	0.0139	0.0112	0.0055	0.0457	0.0052
ImageNet	WRN 101-2	0.0524	0.0307	0.0310	0.0413	0.0194	0.0245	0.0155	0.0076	0.0425	0.0067

Table 6: ECE using 30 bins

Dataset	Model	Uncal.	TempS	VecS	MS- ODIR	Dir- ODIR	ETS	IRM	IROvA- TS	ReCal	Hoki
MNIST	LeNet 5	0.0078	0.0025	0.0034	0.0037	0.0033	0.0035	0.0030	0.0032	0.0036	0.0008
CIFAR10	DenseNet 40	0.0525	0.0100	0.0052	0.0088	0.0083	0.0108	0.0103	0.0110	0.0136	0.0057
CIFAR10	LeNet 5	0.0230	0.0144	0.0153	0.0165	0.0145	0.0158	0.0142	0.0181	0.0283	0.0110
CIFAR10	ResNet 110	0.0458	0.0098	0.0097	0.0098	0.0091	0.0095	0.0086	0.0103	0.0156	0.0110
CIFAR10	ResNet 110 SD	0.0538	0.0135	0.0110	0.0133	0.0103	0.0136	0.0130	0.0145	0.0172	0.0229
CIFAR10	WRN 28-10	0.0255	0.0110	0.0103	0.0100	0.0105	0.0098	0.0164	0.0051	0.0100	0.0026
CIFAR100	DenseNet 40	0.1728	0.0201	0.0317	0.0324	0.0224	0.0179	0.0219	0.0150	0.0386	0.0075
CIFAR100	LeNet 5	0.0174	0.0224	0.0183	0.0192	0.0232	0.0231	0.0197	0.0144	0.0373	0.0282
CIFAR100	ResNet 110	0.1423	0.0122	0.0319	0.0351	0.0231	0.0132	0.0190	0.0223	0.0470	0.0127
CIFAR100	ResNet 110 SD	0.1230	0.0119	0.0367	0.0362	0.0208	0.0173	0.0108	0.0187	0.0449	0.0098
CIFAR100	WRN 28-10	0.0538	0.0449	0.0453	0.0378	0.0358	0.0392	0.0397	0.0162	0.0366	0.0112
ImageNet	DenseNet 161	0.0564	0.0203	0.0244	0.0373	0.0477	0.0139	0.0111	0.0105	0.0500	0.0085
ImageNet	MobileNet V2	0.0280	0.0188	0.0193	0.0230	0.0275	0.0183	0.0110	0.0089	0.0479	0.0011
ImageNet	ResNet 152	0.0494	0.0211	0.0223	0.0359	0.0399	0.0144	0.0150	0.0099	0.0460	0.0052
ImageNet	WRN 101-2	0.0532	0.0321	0.0330	0.0420	0.0305	0.0262	0.0234	0.0105	0.0428	0.0067

Table 7: ECE using 50 bins

Dataset	Model	Uncal.	TempS	VecS	MS- ODIR	Dir- ODIR	ETS	IRM	IROvA- TS	ReCal	Hoki
MNIST	LeNet 5	0.0080	0.0044	0.0039	0.0037	0.0038	0.0032	0.0047	0.0026	0.0038	0.0029
CIFAR10	DenseNet 40	0.0524	0.0111	0.0116	0.0108	0.0103	0.0135	0.0111	0.0113	0.0152	0.0064
CIFAR10	LeNet 5	0.0247	0.0182	0.0191	0.0194	0.0208	0.0184	0.0179	0.0207	0.0312	0.0128
CIFAR10	ResNet 110	0.0462	0.0123	0.0113	0.0113	0.0107	0.0108	0.0104	0.0105	0.0172	0.0071
CIFAR10	ResNet 110 SD	0.0541	0.0146	0.0133	0.0142	0.0115	0.0161	0.0152	0.0131	0.0181	0.0168
CIFAR10	WRN 28-10	0.0255	0.0131	0.0120	0.0112	0.0109	0.0105	0.0169	0.0054	0.0122	0.0026
CIFAR100	DenseNet 40	0.1728	0.0244	0.0330	0.0339	0.0295	0.0240	0.0270	0.0163	0.0435	0.0114
CIFAR100	LeNet 5	0.0241	0.0261	0.0205	0.0230	0.0270	0.0281	0.0278	0.0148	0.0407	0.0455
CIFAR100	ResNet 110	0.1423	0.0156	0.0322	0.0379	0.0256	0.0197	0.0234	0.0232	0.0515	0.0192
CIFAR100	ResNet 110 SD	0.1232	0.0164	0.0385	0.0398	0.0227	0.0233	0.0209	0.0221	0.0490	0.0167
CIFAR100	WRN 28-10	0.0562	0.0465	0.0470	0.0387	0.0376	0.0413	0.0404	0.0171	0.0387	0.0152
ImageNet	DenseNet 161	0.0567	0.0227	0.0257	0.0380	0.0478	0.0176	0.0147	0.0131	0.0504	0.0044
ImageNet	MobileNet V2	0.0299	0.0187	0.0204	0.0241	0.0287	0.0194	0.0145	0.0094	0.0497	0.0016
ImageNet	ResNet 152	0.0499	0.0239	0.0249	0.0365	0.0399	0.0177	0.0179	0.0107	0.0474	0.0177
ImageNet	WRN 101-2	0.0544	0.0341	0.0342	0.0428	0.0342	0.0271	0.0212	0.0106	0.0438	0.0067

Table 8: ECE using 100 bins

Dataset	Model	Uncal.	TempS	VecS	MS- ODIR	Dir- ODIR	ETS	IRM	IROvA- TS	ReCal	Hoki
MNIST	LeNet 5	0.0086	0.0057	0.0053	0.0052	0.0057	0.0054	0.0062	0.0032	0.0045	0.0053
CIFAR10	DenseNet 40	0.0538	0.0156	0.0166	0.0146	0.0158	0.0176	0.0162	0.0125	0.0171	0.0069
CIFAR10	LeNet 5	0.0285	0.0268	0.0243	0.0259	0.0273	0.0289	0.0279	0.0213	0.0353	0.0336
CIFAR10	ResNet 110	0.0474	0.0150	0.0148	0.0148	0.0142	0.0141	0.0146	0.0110	0.0208	0.0076
CIFAR10	ResNet 110 SD	0.0551	0.0196	0.0179	0.0189	0.0165	0.0195	0.0214	0.0150	0.0213	0.0282
CIFAR10	WRN 28-10	0.0261	0.0146	0.0143	0.0148	0.0152	0.0121	0.0201	0.0061	0.0132	0.0110
CIFAR100	DenseNet 40	0.1731	0.0292	0.0369	0.0393	0.0333	0.0340	0.0329	0.0174	0.0518	0.0189
CIFAR100	LeNet 5	0.0321	0.0346	0.0303	0.0296	0.0297	0.0338	0.0340	0.0166	0.0497	0.0455
CIFAR100	ResNet 110	0.1425	0.0264	0.0356	0.0416	0.0307	0.0264	0.0304	0.0257	0.0562	0.0456
CIFAR100	ResNet 110 SD	0.1235	0.0248	0.0408	0.0424	0.0293	0.0318	0.0281	0.0232	0.0537	0.0167
CIFAR100	WRN 28-10	0.0596	0.0486	0.0509	0.0455	0.0405	0.0449	0.0439	0.0173	0.0452	0.0206
ImageNet	DenseNet 161	0.0575	0.0256	0.0292	0.0392	0.0483	0.0205	0.0195	0.0136	0.0536	0.0050
ImageNet	MobileNet V2	0.0316	0.0228	0.0242	0.0274	0.0323	0.0248	0.0194	0.0113	0.0538	0.0096
ImageNet	ResNet 152	0.0509	0.0264	0.0281	0.0384	0.0405	0.0214	0.0201	0.0129	0.0508	0.0236
ImageNet	WRN 101-2	0.0554	0.0356	0.0373	0.0441	0.0362	0.0308	0.0267	0.0127	0.0467	0.0085

C.3 Initialization with Original Uncalibrated Confidence

We perform an experiment to investigate the effect of different initializations. In Algorithm 1 and 2, we initialize the confidence with the validation set accuracy. We can also use the original uncalibrated confidence from a classifier as the initial value, and we compare ECE values with those two different initialization. Table 9 shows that the initialization with the validation set accuracy is always better than the initialization with original uncalibrated confidence except two benchmarks, (CIFAR10, DenseNet40) and (CIFAR100, ResNet110 SD). This difference illustrates the importance of the initialization of Hoki – starting from a high-variance initial set of confidences may make it harder to converge to a good local optimum in terms of ECE.

Table 9: ECE by different initialization

Dataset	Model	Val Accuracy	Uncalibrated Confidence
MNIST	LeNet 5	0.0008	0.0018
CIFAR10	DenseNet 40	0.0057	0.0038
	LeNet 5	0.0110	0.0171
	ResNet 110	0.0071	0.0093
	ResNet 110 SD	0.0044	0.0060
	WRN 28-10	0.0026	0.0042
CIFAR100	DenseNet 40	0.0073	0.0178
	LeNet 5	0.0123	0.0189
	ResNet 110	0.0127	0.0157
	ResNet 110 SD	0.0098	0.0090
	WRN 28-10	0.0112	0.0117
ImageNet	DenseNet 161	0.0043	0.0069
	MobileNet V2	0.0011	0.0061
	ResNet 152	0.0052	0.0081
	WRN 101-2	0.0067	0.0077

Appendix D Computing Environment

All experiments were run on a server with the specifications described in Table 10.

Table 10: Computing Specification

Item	Specification
CPU	Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz
Memory	768 GB
GPU	NVIDIA GeForce RTX 2080 Ti

	$\displaystyle CE(g)=\sum_{j=1}^{J}w_{j}\left\|e_{j}\right\|=$	$\displaystyle\sum_{j=1}^{J}w_{j}\Big{\|}P\left[Y=f(X)\mid g(X)\in\mathcal{P}_{j}\right]-E\left[g(X)\mid g(X)\in\mathcal{P}_{j}\right]\Big{\|}$
	$\displaystyle=$	$\displaystyle\sum_{j=1}^{J}w_{j}\Big{\|}\alpha_{j}\gamma_{j}+\beta_{j}(1-\gamma_{j})-E\left[g(X)\mid g(X)\in\mathcal{P}_{j}\right]\Big{\|}$