Evaluating Uncertainty Calibration for Open-Set Recognition
Abstract
Despite achieving enormous success in predictive accuracy for visual classification problems, deep neural networks (DNNs) suffer from providing overconfident probabilities on out-of-distribution (OOD) data. Yet, accurate uncertainty estimation is crucial for safe and reliable robot autonomy. In this paper, we evaluate popular calibration techniques for open-set conditions in a way that is distinctly different from the conventional evaluation of calibration methods on OOD data. Our results show that closed-set DNN calibration approaches are much less effective for open-set recognition, which highlights the need to develop new DNN calibration methods to address this problem.
I Introduction
Deep neural networks (DNNs) have been very successful at vision-based tasks, which has led to their widespread deployment for real-world applications (e.g., object recognition, image segmentation, etc.). Nevertheless, a challenge of DNN-based vision systems lies in the network’s inclination to produce overconfident predictions during inference, especially when facing categories not seen during training (i.e., out-of-distribution (OOD) data) [1]. Within the realm of robotics, this motivates research questions such as how much trust can we put into the predictions of a DNN when misclassifications may have catastrophic consequences [2]? Methods have been proposed to mitigate the overconfidence problem by calibrating the predictive probabilities [3], or estimating the predictive uncertainties [4, 5], for various vision-based objectives (e.g., image classification [6], semantic segmentation [7], and object detection [8, 9]).
When deployed in real-world environments, we desire a DNN to have the discriminative ability to separate query inputs into either known (i.e., seen during training) or unknown (i.e., not yet seen during training) classes. This problem has previously been formalized and studied under the context of open-set recognition (OSR) [10]. OSR is the extension of object recognition from closed-set to open-set conditions, where classes outside of the training data can appear at inference time. Specifically, OSR trains a model on classes of an class training dataset. Then, at test time, the model is faced with different classes of which were not seen during training. An explicit none-of-the-above (or unknown) class placeholder (i.e., the class) is added during inference. OSR aims to assign correct labels from known classes to seen test-time samples, while detecting and placing unseen samples to the unknown class. Robotic applications of OSR include the division of semantically known places from semantically unknown [11], failure identification for self-driving cars [12], and entity detection [13].
Calibration refers to the issue of obtaining a model that has a predictive probability that reflects the true correctness likelihood. More formally, a model is well calibrated if outcomes predicted to occur with probability occur approximately fraction of the time [14]. In robotics, there is a clear need for uncertainty calibration of policies (e.g., autonomous driving [15]), especially for OSR. Previous work has evaluated calibration methods on both in distribution (ID) cases and under distributional shifts [16, 17]. The experimental settings in [16, 17] follow those in standard OOD detection works where a model trained on the ID set is evaluated for its reliability in identifying test images as either ID or OOD [18]. Although dealing with similar problems, OOD detection and OSR have been studied separately with different evaluation protocols. To the best of our knowledge, the evaluation of calibration methods for the OSR problem (i.e., open-set calibration) has not been explored. Hence, our work serves as a step toward this direction.
II Open-Set Recognition Evaluation
The first step of our evaluation is to enhance a model to perform OSR. To do this, we create a baseline by thresholding on the maximum probability which indicates the mostly likely predicted class. We also apply OpenMax [19], a well-known OSR method, for comparison. Following the standard experimental setup in OSR works, we use ResNet [20] as the backbone classification network and evaluate on the CIFAR10 benchmark dataset [21]. To test under OSR conditions, we follow the most common data partitioning protocol [22, 23, 24, 25, 26, 27]. Specifically, we split each dataset at random such that 6 classes are chosen to be known and the remaining 4 classes are unknown. We repeat the experiment over 5 runs and report the average score. Thus, the model is trained with 6 known classes and tested on all 10 classes. We map the randomly selected 6 known classes using indices 0-5 and set the other 4 unknown classes to start at index 6. A perfect OSR model should be able to assign correct labels to images belonging to any of the 6 known classes and identify images from all other classes as unknowns.
In the second step of our evaluation, we calibrate the trained model on the validation set in order to improve calibration during inference. Calibration is performed using temperature scaling [3], a simple and effective technique that makes adjustments to a model’s logits by a single temperature parameter . Logits are probability vectors produced by a network before the softmax layer for a given input image. Additionally, is optimized on a held-out validation set and it can rescale a model’s confidence without affecting accuracy. After acquiring an OSR model, we apply temperature scaling on the model’s predictive confidence and evaluate the calibration performance in this novel problem setting. Concretely, to apply temperature scaling we reserve 10% of the training data as a validation set for optimizing and use the remaining 90% for training the model. To evaluate model calibration, we use the expected calibration error (ECE) [14] to measure the absolute difference between predictive confidence and accuracy. As a result of its simplicity and effectiveness, ECE is the most commonly used metric for model calibration. We also evaluate the Brier score within the framework of OSR as follows. If a trained classifier’s prediction where is the number of samples and is the number of known classes, then we can obtain a new prediction that takes into account the unknown class. For each prediction we estimate the probability of being unknown, i.e.,
(1) |
Next, we calculate the Brier score by
(2) |
where the predicted probability and ground truth of the th class are and , respectively.
III Evaluation Results
We visualize model calibration via reliability diagrams [28, 29], where the diagonal represents perfect calibration and any deviation from a perfect diagonal indicates miscalibration. The top rows of Fig. 1 and Table I present the reliability diagrams and scores for closed-set classification before and after calibration by temperature scaling. We see that temperature scaling significantly improves the model’s ability to match the true correctness likelihood in calibrating the predictions for this conventional 10-class closed-set classification problem. In the middle rows of Fig. 1 and Table I, the reliability diagrams and scores of a classifier operating in open-set conditions by thresholding on its predictive probabilities before and after calibration by temperature scaling are shown. In this scenario, temperature scaling still provides a clear improvement over uncalibrated models. Nonetheless, there exist clear gaps between temperature scaling calibrated confidence and perfectly calibrated confidence thus indicating that the calibration problem for open-set conditions is much more challenging. The bottom rows of Fig. 1 and Table I demonstrate the reliability diagrams and scores of a classifier operating in open-set conditions by applying OpenMax before and after calibration. Although OpenMax improves classification accuracy, it exhibits no obvious improvement in terms of calibration. In addition, we also conducted these experiments with disparate classification networks such as DenseNet [30] and EfficientNet [31] on the SVHN dataset [32] (see appendix). In all of these experiments, we observed the same findings as above.






IV Conclusion
In this paper, we assessed the efficacy of calibration methods for OSR. From the evaluation results, we observe that the calibration problem is much more challenging under open-set conditions in comparison to closed-set conditions. Previous techniques that calibrate confidence really well for closed-set classification only provide limited calibration performance for OSR. Moreover, traditional OSR approaches improve accuracy, yet they do not take calibration into consideration. Given these observations, we want to bring the community’s attention not only to the pursuit of more accurate OSR methods, but also well-calibrated OSR systems that can perform reliably and safely in open-set settings.
Method | Measure | Brier | ECE | Accuracy |
---|---|---|---|---|
Closed-set | Before calibration | 0.255 | 0.069 | 0.832 |
After calibration | 0.244 | 0.011 | ||
Open-set | Before calibration | 0.814 | 0.346 | 0.603 |
After calibration | 0.647 | 0.230 | ||
Open-set + OpenMax | Before calibration | 0.841 | 0.335 | 0.632 |
After calibration | 0.703 | 0.197 |
References
- [1] M. Hein, M. Andriushchenko, and J. Bitterwolf, “Why relu networks yield high-confidence predictions far away from the training data and how to mitigate the problem,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 41–50.
- [2] N. Sünderhauf, O. Brock, W. Scheirer, R. Hadsell, D. Fox, J. Leitner, B. Upcroft, P. Abbeel, W. Burgard, M. Milford, and P. Corke, “The limits and potentials of deep learning for robotics,” The International Journal of Robotics Research, vol. 37, no. 4-5, pp. 405–420, 2018.
- [3] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger, “On calibration of modern neural networks,” in Proceedings of the International Conference on Machine Learning (ICML), 2017, pp. 1321–1330.
- [4] Y. Gal and Z. Ghahramani, “Dropout as a bayesian approximation: Representing model uncertainty in deep learning,” in Proceedings of the International Conference on Machine Learning (ICML), 2016, pp. 1050–1059.
- [5] B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,” in Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), 2017, pp. 6402–6413.
- [6] S. Zhao, M. Kim, R. Sahoo, T. Ma, and S. Ermon, “Calibrating predictions to decisions: A novel approach to multi-class calibration,” in Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), vol. 34, 2021.
- [7] A. Kendall, V. Badrinarayanan, and R. Cipolla, “Bayesian segnet: Model uncertainty in deep convolutional encoder-decoder architectures for scene understanding,” in Proceedings of the British Machine Vision Conference (BMVC), 2017.
- [8] F. Kuppers, J. Kronenberger, A. Shantia, and A. Haselhoff, “Multivariate confidence calibration for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2020, pp. 326–327.
- [9] Z. Lyu, N. B. Gutierrez, and W. J. Beksi, “An uncertainty estimation framework for probabilistic object detection,” in Proceedings of the IEEE International Conference on Automation Science and Engineering (CASE), 2021, pp. 1441–1446.
- [10] W. J. Scheirer, A. de Rezende Rocha, A. Sapkota, and T. E. Boult, “Toward open set recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, no. 7, pp. 1757–1772, 2012.
- [11] N. Sünderhauf, F. Dayoub, S. McMahon, B. Talbot, R. Schulz, P. Corke, G. Wyeth, B. Upcroft, and M. Milford, “Place categorization and semantic mapping on a mobile robot,” in IEEE International Conference on Robotics and Automation (ICRA), 2016, pp. 5729–5736.
- [12] M. S. Ramanagopal, C. Anderson, R. Vasudevan, and M. Johnson-Roberson, “Failing to learn: Autonomously identifying perception failures for self-driving cars,” IEEE Robotics and Automation Letters, vol. 3, no. 4, pp. 3860–3867, 2018.
- [13] P. R. Vieira, P. D. Félix, and L. Macedo, “Open-world active learning with stacking ensemble for self-driving cars,” arXiv preprint arXiv:2109.06628, 2021.
- [14] M. P. Naeini, G. Cooper, and M. Hauskrecht, “Obtaining well calibrated probabilities using bayesian binning,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2015.
- [15] F. Nozarian, C. Müller, and P. Slusallek, “Uncertainty quantification and calibration of imitation learning policy in autonomous driving,” in International Workshop on the Foundations of Trustworthy AI Integrating Learning, Optimization and Reasoning, 2020, pp. 146–162.
- [16] Y. Ovadia, E. Fertig, J. Ren, Z. Nado, D. Sculley, S. Nowozin, J. Dillon, B. Lakshminarayanan, and J. Snoek, “Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift,” in Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), vol. 32, 2019.
- [17] M. Minderer, J. Djolonga, R. Romijnders, F. Hubis, X. Zhai, N. Houlsby, D. Tran, and M. Lucic, “Revisiting the calibration of modern neural networks,” in Proceedings of the Conference on Neural Information Processing Systems (NeurIPS), vol. 34, 2021.
- [18] D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified and out-of-distribution examples in neural networks,” in Proceedings of the International Conference on Learning Representations (ICLR), 2016.
- [19] A. Bendale and T. E. Boult, “Towards open set deep networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1563–1572.
- [20] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
- [21] A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” University of Toronto, Toronto, Tech. Rep., 2009.
- [22] L. Neal, M. Olson, X. Fern, W.-K. Wong, and F. Li, “Open set learning with counterfactual images,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 613–628.
- [23] R. Yoshihashi, W. Shao, R. Kawakami, S. You, M. Iida, and T. Naemura, “Classification-reconstruction learning for open-set recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 4016–4025.
- [24] P. Oza and V. M. Patel, “C2ae: Class conditioned auto-encoder for open-set recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 2307–2316.
- [25] P. Perera, V. I. Morariu, R. Jain, V. Manjunatha, C. Wigington, V. Ordonez, and V. M. Patel, “Generative-discriminative feature representations for open-set recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 11 814–11 823.
- [26] H. Zhang, A. Li, J. Guo, and Y. Guo, “Hybrid models for open set recognition,” in Proceedings of the European Conference on Computer Vision (ECCV). Springer, 2020, pp. 102–117.
- [27] D.-W. Zhou, H.-J. Ye, and D.-C. Zhan, “Learning placeholders for open-set recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 4401–4410.
- [28] M. H. DeGroot and S. E. Fienberg, “The comparison and evaluation of forecasters,” Journal of the Royal Statistical Society: Series D (The Statistician), vol. 32, no. 1-2, pp. 12–22, 1983.
- [29] A. Niculescu-Mizil and R. Caruana, “Predicting good probabilities with supervised learning,” in Proceedings of the International Conference on Machine Learning (ICML), 2005, pp. 625–632.
- [30] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4700–4708.
- [31] M. Tan and Q. Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in Proceedings of the International Conference on Machine Learning (ICML), 2019, pp. 6105–6114.
- [32] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng, “Reading digits in natural images with unsupervised feature learning,” in Proceedings of the Conference on Neural Information Processing Systems (NeurIPS) Workshops, 2011.
Method | Measure | Brier | ECE | Accuracy |
---|---|---|---|---|
Closed-set | Before calibration | 0.085 | 0.009 | 0.943 |
After calibration | 0.084 | 0.006 | ||
Open-set | Before calibration | 0.434 | 0.220 | 0.793 |
After calibration | 0.406 | 0.204 | ||
Open-set + OpenMax | Before calibration | 0.496 | 0.224 | 0.811 |
After calibration | 0.489 | 0.218 |
Method | Measure | Brier | ECE | Accuracy |
---|---|---|---|---|
Closed-set | Before calibration | 0.246 | 0.070 | 0.835 |
After calibration | 0.234 | 0.012 | ||
Open-set | Before calibration | 0.788 | 0.357 | 0.645 |
After calibration | 0.643 | 0.269 | ||
Open-set + OpenMax | Before calibration | 0.861 | 0.369 | 0.666 |
After calibration | 0.755 | 0.268 |
Method | Measure | Brier | ECE | Accuracy |
---|---|---|---|---|
Closed-set | Before calibration | 0.091 | 0.008 | 0.939 |
After calibration | 0.090 | 0.006 | ||
Open-set | Before calibration | 0.447 | 0.229 | 0.816 |
After calibration | 0.376 | 0.178 | ||
Open-set + OpenMax | Before calibration | 0.507 | 0.231 | 0.822 |
After calibration | 0.483 | 0.209 |
Method | Measure | Brier | ECE | Accuracy |
---|---|---|---|---|
Closed-set | Before calibration | 0.233 | 0.032 | 0.839 |
After calibration | 0.230 | 0.012 | ||
Open-set | Before calibration | 0.767 | 0.373 | 0.633 |
After calibration | 0.665 | 0.275 | ||
Open-set + OpenMax | Before calibration | 0.841 | 0.334 | 0.636 |
After calibration | 0.719 | 0.151 |
Method | Measure | Brier | ECE | Accuracy |
---|---|---|---|---|
Closed-set | Before calibration | 0.101 | 0.010 | 0.932 |
After calibration | 0.100 | 0.004 | ||
Open-set | Before calibration | 0.394 | 0.207 | 0.845 |
After calibration | 0.377 | 0.197 | ||
Open-set + OpenMax | Before calibration | 0.466 | 0.209 | 0.857 |
After calibration | 0.450 | 0.193 |