Soft Labeling Affects Out-of-Distribution Detection of Deep Neural Networks

Doyup Lee Yeongjae Cheon

Abstract

Soft labeling becomes a common output regularization for generalization and model compression of deep neural networks. However, the effect of soft labeling on out-of-distribution (OOD) detection, which is an important topic of machine learning safety, is not explored. In this study, we show that soft labeling can determine OOD detection performance. Specifically, how to regularize outputs of incorrect classes by soft labeling can deteriorate or improve OOD detection. Based on the empirical results, we postulate a future work for OOD-robust DNNs: a proper output regularization by soft labeling can construct OOD-robust DNNs without additional training of OOD samples or modifying the models, while improving classification accuracy.

Machine Learning, ICML

1 Introduction

Out-of-distribution (OOD) detection has been an important topic for deep learning applications, after deep neural networks (DNNs) are known to be over-confident on abnormal samples, which are unrecognizable (Nguyen et al., 2015) and from out-of-distribution (Hendrycks & Gimpel, 2017). OOD detection is highly related to the safety, because there is no control on test samples after deployment of DNNs to the real-world.

To prevent DNNs from over-confident predictions of OOD samples, post-training with outlier samples is commonly used. Fine-tuning with a few selected OOD samples (Hendrycks et al., 2019) or adversarial noises (Hein et al., 2019) can improve the detection performance of unseen OOD samples.

Meanwhile, soft labeling becomes a common trick of output regularization to train DNNs in various purposes. For example, label smoothing (Szegedy et al., 2016) improves the test accuracy of DNNs, preventing an overfitting problem (He et al., 2019; Müller et al., 2019). Knowledge distillation (Hinton et al., 2015), a kind of soft labeling (Yuan et al., 2019), can compress the size of a teacher model, or improve the accuracy of its student networks (Xie et al., 2019).

Despite the popularity of soft labeling, how soft labeling affects OOD detection of DNNs has not been explored. In this study, we assume that regularizing predictions on incorrect classes by soft labeling determines OOD detection performance of DNNs. We analyze and empirically verify our assumption, based on two major results: a) label smoothing deteriorates OOD detection of DNNs, and b) soft labels, generated by a teacher model, distill OOD detection performance into its student models. In particular, the degraded test accuracy of a teacher model with outlier exposure is recovered or improved in its student models, while conserving the high performance of OOD detection.

Based on the empirical results, we claim that a lottery ticket of soft labeling for OOD-robust DNNs exists, and how to regularize the predictions of DNNs on incorrect classes is a compelling direction of future work for generalization of DNNs not only on unseen in-distribution (ID) samples, but also on OOD samples.

2 Preliminaries

2.1 Outlier Exposure

Outlier exposure (Hendrycks et al., 2019) finetunes a model with some OOD samples to predict uniform distribution for the OOD training samples.

\mathcal{H}(q,p_{i})+\lambda\mathcal{H}(\mathcal{U}(K),p_{o}),

(1)

where $p_{i}$ is a prediction of a ID sample, $p_{o}$ is a prediction of a OOD sample, $q$ is an one-hot represented ground truth, $\lambda$ is a hyper-parameter, $\mathcal{H}$ is cross-entropy, and $\mathcal{U}(K)$ is uniform distribution over all $K$ classes.

Despite a significant improvement of OOD detection, training additional OOD samples has two drawbacks. First, original test accuracy is often degraded after outlier exposure as a trade-off between OOD detection and the original task. Second, we cannot consider all possible OOD samples in training, because there are infinitely many OOD samples.

In this study, soft label prevents the degradation of classification accuracy and often improves the test accuracy (Table 1). In addition, we show that a soft label can make DNNs robust to OOD without any OOD training sample and model modification (Figure 2).

2.2 Soft Labeling as an Output Regularization

Given an one-hot represented ground truth $q$ of a training sample $x$ , soft labeling is defined as

\tilde{q}=(1-\alpha)q+\alpha q^{\prime},

(2)

where $\alpha$ is a hyper-parameter for soft labeling, $q^{\prime}\in[0,1]^{K}$ is a soft target that satisfies $\mathrm{argmax}{(\tilde{q})}=\mathrm{argmax}{(q)}$ and $\sum_{i=1}^{K}{\tilde{q}_{i}}=1$ , and $K$ is the number of classes. Then, the training loss with soft labeling is

\mathcal{H}(\tilde{q},p)=(1-\alpha)\mathcal{H}(q,p)+\alpha\mathcal{H}(q^{\prime},p).

(3)

Note that a soft labeling is a regularization of the predictions including incorrect classes.

The training objective of both label smoothing (Szegedy et al., 2016) and knowledge distillation (Hinton et al., 2015) are represented by Eq (3) (Yuan et al., 2019). Label smoothing (Szegedy et al., 2016) is a soft labeling that regularizes DNNs to predict an uniform distribution $\mathcal{U}(K)$ over all $K$ classes:

(1-\alpha)\mathcal{H}(q,p)+\alpha\mathcal{H}(\mathcal{U}(K),p).

(4)

In knowledge distillation, a soft target of a student model $\tilde{q}$ consists of a prediction of its teacher model:

(1-\alpha)\mathcal{H}(q,p)+\alpha\mathcal{H}(p_{t},p),

(5)

where $p_{t}$ is the prediction of the teacher model. Knowledge distillation is a kind of output regularization of student models by the teacher’s predictions (Yuan et al., 2019) for model compression (Hinton et al., 2015) or generalization (Xie et al., 2019).

2.3 Experimental Setting

In this paper, we train WRN-40-2 (Zagoruyko & Komodakis, 2016) with the SVHN, CIFAR-10, and CIFAR-100 datasets (ID). We follow the experimental setting in the official code of outlier exposure¹¹1https://github.com/hendrycks/outlier-exposure except that we use 150 epochs for training. In addition, we follow the hyper-parameter settings of knowledge distillation in (Müller et al., 2019). For evaluation of OOD detection, we use the MNIST, Fashion-MNIST, SVHN (or CIFAR-10), LSUN, and TinyImageNet datasets for OOD samples, and AUROC for the evaluation measure.

Table 1: Test accuracy and expected calibration error (ECE) of WideResNet (Baseline) trained with SVHN, CIFAR10, and CIFAR-100. TinyImageNet is used to train for OE (Outlier Exposure). OD (Outlier Distillation) means the student model of OE model.

ID Dataset		Baseline	+OE	+OD
SVHN	Acc	97.02	96.82	97.17
SVHN	ECE	2.38	2.65	2.28
CIFAR-10	Acc	95.12	94.74	95.10
CIFAR-10	ECE	3.85	4.07	3.49
CIFAR-100	Acc	76.63	75.58	76.80
CIFAR-100	ECE	12.06	14.79	10.61

3 Soft Labeling Affects OOD Detection

Table 2: OOD detection performance of outlier exposure and outlier distillation. WRN-OE, which is finetuned with TinyImageNet as OOD, is used as the teacher model for the two student models (WRN and DenseNet).

AUROC	WRN-OE	$\rightarrow$ WRN	$\rightarrow$ DenseNet
MNIST	84.28	90.95	93.30
Fashion-MNIST	95.16	96.03	96.61
SVHN	94.19	94.50	94.19
LSUN	99.99	99.94	99.94
TinyImageNet	99.99	99.78	99.83

Refer to caption — Figure 1: Test accuracy and expected calibration error (top) and OOD detection AUROC (bottom) of WRN, trained with SVHN (left), CIFAR-10 (middle), and CIFAR-100 (right) respectively. The red dot line represents label smoothing $\alpha$ minimizing ECE. OOD detection is continuously deteriorated when label smoothing $\alpha$ increases. When ECE starts to increase (after the red dot line), dramatic drops of AUROC are shown in the training datasets of SVHN (ID) and CIFAR-10 (ID).

3.1 Label Smoothing and OOD Detection

Figure 1 shows the effects of label smoothing with different $\alpha$ on test accuracy, expected calibration error (ECE) (Guo et al., 2017), and detection of the OOD datasets. As shown in (Lukasik et al., 2020), ECE starts to increase when the label smoothing $\alpha$ is larger than the optimal values (red dot lines). Although the test accuracy on CIFAR-10 with $\alpha=0.001,0.1$ deteriorates, the test accuracy is always improved when ECE is minimized (Müller et al., 2019).

Even though label smoothing improves test accuracy and ECE (Müller et al., 2019), label smoothing makes DNNs vulnerable to out-of-distribution and disable to distinguish ID and OOD datasets. Label smoothing always deteriorates OOD detection regardless of the magnitude of $\alpha$ , and larger $\alpha$ results in more degradation of OOD detection. In particular, WRN models, trained with SVHN and CIFAR-10, show significant AUROC drops, when the ECE starts to increase (after the red dot line).

We can infer the reason why label smoothing hurts OOD detection of DNNs from two perspectives. First, combining Eq (1) and (4), we can interpret the output regularization of label smoothing as an outlier exposure of ID samples. Then, label smoothing can deteriorate OOD detection of DNNs, making DNNs disable to discriminate OOD samples from ID samples. When the magnitude of $\alpha$ increases, the effect of output regularization in Eq (4) increases and deteriorates the OOD detection performance as outlier exposure of the ID datasets.

Meanwhile, knowledge distillation is the other view to interpret the negative effect of label smoothing on OOD detection. Note that label smoothing is a knowledge distillation with a teacher model, which perfectly learns the ID samples as OOD and predicts uniform distribution for all ID samples. Thus, we assume that Soft labels of incorrect classes, generated by a teacher model, determines OOD detection performance of its student model, and empirically verify the assumption in section 3.2.

3.2 Knowledge Distillation and OOD Detection

In this section, we show that OOD detection performance is determined by the soft labels. Specifically, soft labels that are generated by a teacher model determine the performance of its student model. Figure 2 shows OOD detection performance of teacher models and their student models in various settings. For a student model, we use the same architecture (WRN-40-2) with its teacher, because our concern is to analyze the effects of soft labeling, not a model compression.

Figure 2 (top) shows OOD detection AUROCs of the WRN-40-2 models (SVHN, CIFAR-10, and CIFAR-100), and their student models. The teacher and its student model have similar AUROCs regardless of test datasets (OOD).

In Figure 2 (bottom), we finetune the teacher models with various OOD samples (MNIST, TinyImageNet, and MNIST+TinyImageNet) to improve OOD detection by outlier exposure. The OOD detection of the teacher models is improved in different OOD datasets, according to the exposed OOD samples. We find that OOD detection performance of student models is always consistent with their teacher models (OE), regardless of the choice of training OOD samples for the teacher.

Especially, when we use MNIST+TinyImageNet for outlier exposure of the teacher model, both the teacher and its student almost perfectly detect the test OOD samples. Exposing various OOD samples in training time is an unrealistic setting, because there are infinitely many cases of OOD. However, the experimental results is worth noting, because the student model is trained only using ID samples with soft labels, and any OOD sample is not directly used to train the student model. Although how to generate soft labels without a perfectly OOD-robust teacher remains in an open question, the result show the existence of soft labeling for OOD-robust DNNs to various OOD datasets without OOD training.

OOD detection performance is also distilled into a student model that has a different architecture from its teacher model. In Table 2, we use DenseNet (Huang et al., 2017) with 40 hidden layers and 12 growth rates as the student model of WRN-40-2. Note that the number of trainable parameters of DenseNet (1.1 M) is twice less than WRN-40-2 (2.2 M). Even though the size and model architecture of the student are different from those of its teacher, OOD detection AUROCs of the teacher and student are consistent. The results imply that the effect of soft labeling on OOD detection is model-agnostic. Then, if we find a soft labeling method for OOD-robust DNNs, the soft labeling can be generally used for various DNN architectures.

Orthogonal to the OOD detection, one disadvantage of post-training with OOD samples is a degradation of original classification accuracy (Hendrycks et al., 2019; Hein et al., 2019). However, we find that both test accuracy and ECE of the student models (+OD) are similar to or better than the original model before outlier exposure (baseline) in (Table 1). The improvement of test accuracy results from soft labeling, because soft labels can help model prevent an overfitting problem regardless of the type of soft labeling (Yuan et al., 2019).

4 Discussion

In this study, we show that a soft labeling of incorrect classes is closely linked with OOD detection of DNNs. Note that the results of student models in Figure 2 do not use any OOD sample, but can have almost perfect OOD detection AUROCs. The results verify that constructing OOD-robust DNNs is possible without modifying the model or post-training of OOD samples.

The limitation of our study is that the solution of soft labeling for OOD-robust DNNs is unrevealed and remains in an open question. However, we focus on showing the existence of soft labeling for OOD-robust DNNs.

We postulate that finding an output regularization of incorrect classes that makes DNN robust to unseen OOD samples is possible and a worth exploration for future work. Note that proper soft labeling can improve not only OOD detection, but also the classification accuracy of unseen ID samples and confidence calibration (Table 1). In addition, the OOD-robust soft labeling is model-agnostic and generally applied into various model architectures.

References

Guo et al. (2017) Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1321–1330. JMLR. org, 2017.
He et al. (2019) He, T., Zhang, Z., Zhang, H., Zhang, Z., Xie, J., and Li, M. Bag of tricks for image classification with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 558–567, 2019.
Hein et al. (2019) Hein, M., Andriushchenko, M., and Bitterwolf, J. Why relu networks yield high-confidence predictions far away from the training data and how to mitigate the problem. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 41–50, 2019.
Hendrycks & Gimpel (2017) Hendrycks, D. and Gimpel, K. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations, 2017.
Hendrycks et al. (2019) Hendrycks, D., Mazeika, M., and Dietterich, T. Deep anomaly detection with outlier exposure. In International Conference on Learning Representations, 2019.
Hinton et al. (2015) Hinton, G., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
Huang et al. (2017) Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708, 2017.
Lukasik et al. (2020) Lukasik, M., Bhojanapalli, S., Menon, A. K., and Kumar, S. Does label smoothing mitigate label noise? arXiv preprint arXiv:2003.02819, 2020.
Müller et al. (2019) Müller, R., Kornblith, S., and Hinton, G. E. When does label smoothing help? In Advances in Neural Information Processing Systems, pp. 4696–4705, 2019.
Nguyen et al. (2015) Nguyen, A., Yosinski, J., and Clune, J. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 427–436, 2015.
Szegedy et al. (2016) Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., and Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826, 2016.
Xie et al. (2019) Xie, Q., Hovy, E., Luong, M.-T., and Le, Q. V. Self-training with noisy student improves imagenet classification. arXiv preprint arXiv:1911.04252, 2019.
Yuan et al. (2019) Yuan, L., Tay, F. E., Li, G., Wang, T., and Feng, J. Revisit knowledge distillation: a teacher-free framework. arXiv preprint arXiv:1909.11723, 2019.
Zagoruyko & Komodakis (2016) Zagoruyko, S. and Komodakis, N. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.