This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Scaling for Training Time and Post-hoc Out-of-distribution Detection Enhancement

Kai Xu1, Rongyu Chen1, Gianni Franchi2, Angela Yao1
1National University of Singapore
2U2IS, ENSTA Paris, Institut polytechnique de Paris
{kxu,rchen,ayao}@comp.nus.edu.sg[email protected]
Abstract

The capacity of a modern deep learning system to determine if a sample falls within its realm of knowledge is fundamental and important. In this paper, we offer insights and analyses of recent state-of-the-art out-of-distribution (OOD) detection methods - extremely simple activation shaping (ASH). We demonstrate that activation pruning has a detrimental effect on OOD detection, while activation scaling enhances it. Moreover, we propose SCALE, a simple yet effective post-hoc network enhancement method for OOD detection, which attains state-of-the-art OOD detection performance without compromising in-distribution (ID) accuracy. By integrating scaling concepts into the training process to capture a sample’s ID characteristics, we propose Intermediate Tensor SHaping (ISH), a lightweight method for training time OOD detection enhancement. We achieve AUROC scores of +1.85% for near-OOD and +0.74% for far-OOD datasets on the OpenOOD v1.5 ImageNet-1K benchmark. Our code and models are available at https://github.com/kai422/SCALE.

1 Introduction

In deep neural networks, out-of-distribution (OOD) detection distinguishes samples which deviate from the training distribution. Standard OOD detection concerns semantic shifts (Yang et al., 2022; Zhang et al., 2023), where OOD data is defined as test samples from semantic categories unseen during training. Ideally, the neural network should be able to reject such samples as being OOD, while still maintaining strong performance on in-distribution (ID) test samples belonging to seen training categories.

Methods for detecting OOD samples work by scoring network outputs such as logits or softmax values (Hendrycks & Gimpel, 2017; Hendrycks et al., 2022), post-hoc network adjustment during inference to improve OOD scoring (Sun & Li, 2022; Sun et al., 2021; Djurisic et al., 2023), or by adjusting model training (Wei et al., 2022; Ming et al., 2023; DeVries & Taylor, 2018). These approaches can be used either independently or in conjunction with one another. Typically, post-hoc adjustments together with OOD scoring is the preferred combination since it is highly effective at discerning OOD samples with minimal ID drop and can also be applied directly to already-trained models off-the-shelf. Examples include ReAct (Sun et al., 2021), DICE (Sun & Li, 2022) and more recently, ASH (Djurisic et al., 2023).

On the surface, each method takes different and sometimes even contradictory approaches. ReAct rectifies penultimate activations which exceed a threshold; ASH, on the other hand, prunes penultimate activations that are too low while amplifying remaining activations. While ASH currently achieves state-of-the-art performance, it lacks a robust explanation of its underlying operational principles. This limitation highlights the need for a comprehensive explanatory framework.

This work seeks to understand the working principles behind ASH. Through observations and mathematical derivations, we reveal that OOD datasets tend to exhibit a lower rate of pruning due to distinct mean and variance characteristics. We also demonstrate the significant role of scaling in enhancing OOD detection in ASH, while highlighting that the lower-part pruning approach, in contrast to ReAct, hinders the OOD detection process. This understanding leads to new state-of-the-art results by leveraging scaling, achieving significant improvements without compromising on ID accuracy.

Through the lens of studying the distributions, we highlight the importance of scaling as a key metric for assessing a sample’s ID nature. We integrate this concept into the training process, hypothesizing the feasibility of shaping the ID-ness objective even without the inclusion of OOD samples. The ID-ness objective introduces an optimization weighting factor for different samples through proposed intermediate tensor shaping (ISH). Remarkably, ISH achieves outstanding performance in both near-OOD and far-OOD detection tasks, with only one-third of the training effort required compared to current state-of-the-art approaches.

Our contributions can be summarized as follows:

  • We analyze and explain the working principles of pruning and scaling for OOD detection and reveal that pruning, in some scenario, actually hurts OOD detection.

  • Based on our analysis, we devise SCALE, a new post-hoc network enhancement method for OOD detection, which achieves state-of-the-art results on OOD detection without any ID accuracy trade-off.

  • By incorporating scaling concepts into the training process to capture a sample’s ID characteristics, we introduce ISH, a lightweight and innovative method for improving OOD detection during training. ISH yields remarkable OOD detection results.

Refer to caption
Figure 1: ID-OOD Trade-off on ImageNet on Near-OOD Dataset. Unlike existing methods such as ASH, ReAct and Dice, our proposed SCALE does not have any ID accuracy trade-off while improving OOD accuracy. Our training methods, ISH, achieves outstanding OOD results by emphasizing the training of samples with high ID characteristics.

2 Related Work

OOD scoring methods indicate how likely a sample comes from the training distribution, i.e. is in-distribution, based on sample features or model outputs. From a feature perspective,  Lee et al. (2018) proposed to score a sample via the minimum Mahalanobis distance of that sample’s features to the nearest ID class centroid. For model outputs, two common variants are based on the maximum softmax prediction (Hendrycks & Gimpel, 2017) and the maximum logit scores (Hendrycks et al., 2022). The raw softmax or logit scores are susceptible to the overconfidence issue, therefore, Liu et al. (2020) proposed to use an energy-based function to transform the logits as an improved score. A key benefit of deriving OOD scores from feature or model outputs is that it does not impact the model or the inference procedure, so the ID accuracy will not be affected.

Post-hoc model enhancement methods modify the inference procedure to improve OOD detection and are often used together with OOD scoring methods. Examples include ReAct (Sun et al., 2021), which rectifies the penultimate activations for inference, DICE (Sun & Li, 2022), which sparsifies the network’s weights in the last layer, and ASH (Djurisic et al., 2023), which scales and prunes the penultimate activations. Each of these methods is then combined with energy-based score (Liu et al., 2020) to detect the OOD data. While effective at identifying OOD data, these methods have a reduced ID accuracy as the inference procedure is altered. Our proposed SCALE is also post-hoc model enhancement, while our ID accuracy will not be affected, where we applies different scaling factor based on sample’s activations shape, which do not alter the ID estimates for single sample, but emphasize difference among samples.

Training-time model enhancement

techniques aims to make OOD data more distinguishable directly at training. Various strategies including the incorporation of additional network branches (DeVries & Taylor, 2018), alternative training strategies (Wei et al., 2022), or data augmentation (Pinto et al., 2022; Hendrycks et al., 2020). The underlying assumption behind each of these techniques is training towards OOD detection objective can provide more discriminative features for OOD detection. A significant drawback of training-time enhancement is the additional computational cost. For example, AugMix (Hendrycks et al., 2020) requires double training time and extra GPU memory cost. Our intermediate tensor shaping (ISH) improves the OOD detection with one-third of the computational cost compares to the most lightweight method, without modifying model architecture.

Intermediate tensor shaping: Activation shaping have been explored in deep learning for various purposes. DropOut is the first to utilize this idea by sparsifying the activations for regularization. Similar ideas has been applied on Li et al. (2023) for transformers. Activation shaping can also help efficient training and inference by compression (Kurtz et al., 2020; Chen et al., 2023b). Shaping operations on intermediate tensors differ from those on activations. Activation shaping affects both forward pass inference and backward gradient computation during training. In contrast, shaping intermediate tensors exclusively influences the backward gradient computation. Since intermediate tensors tend to consume a significant portion of GPU memory, techniques for compressing intermediate tensors have gained widespread use in memory-efficient training, all without altering the forward pass. (Evans & Aamodt, 2021; Liu et al., 2022; Chen et al., 2023a).

3 Activation Scaling for Post-hoc Model Enhancement

We start by presenting the preliminaries of Out-of-Distribution (OOD) detection in Sec. 3.1 to set the stage for our subsequent discussion and analysis of the ASH method in Sec. 3.2. The results of our analysis directly leads to our own OOD criterion in Sec. 3.3. Finally, we introduce our intermediate tensor shaping approach for training time OOD detection enhancement in Sec. 3.4.

3.1 Preliminaries

While OOD is relevant for many domains, we follow previous works (Yang et al., 2022) and focus specifically on semantic shifts in image classification. During training, the classification model is trained with ID data that fall into a pre-defined set of KK semantic categories: (𝒙,y)𝒟ID,y𝒴ID\forall({\bm{x}},y)\sim\mathcal{D}_{\text{ID}},y\in\mathcal{Y}_{\text{ID}}. During inference, there are both ID and OOD samples; the latter are samples drawn from categories unobserved during training, i.e. (𝒙,y)𝒟OOD,y𝒴ID\forall({\bm{x}},y)\sim\mathcal{D}_{\text{OOD}},y\notin\mathcal{Y}_{\text{ID}}.

Now consider a neural network consisting of two parts: a feature extractor f()f(\cdot), and a linear classifier parameterized by weight matrix 𝐖K×D\mathbf{W}\in\mathbb{R}^{K\times D} and a bias vector 𝒃D{\bm{b}}\in\mathbb{R}^{D}. The network logit can be mathematically represented as

𝒛=𝐖𝒂+𝒃,𝒂=f(𝒙),{\bm{z}}=\mathbf{W}\cdot{\bm{a}}+{\bm{b}},\qquad{\bm{a}}=f({\bm{x}}), (1)

where 𝒂D{\bm{a}}\in\mathbb{R}^{D} is the DD-dimensional feature vector in the penultimate layer of the network and 𝒛K{\bm{z}}\in\mathbb{R}^{K} is the logit vector from which the class label can be estimated by y^=argmax(𝒛)\hat{y}=\operatorname*{arg\,max}({\bm{z}}). In line with other OOD literature (Sun et al., 2021), an individual dimension of feature 𝒂{\bm{a}}, denoted with index jj as 𝒂j{\bm{a}}_{j}, is referred to as an “activation”.

For a given test sample 𝒙{\bm{x}}, an OOD score can be calculated to indicate the confidence that 𝒙{\bm{x}} is in-distribution. By convention, scores above a threshold τ\tau are ID, while those equal or below are considered OOD. A common setting is the energy-based OOD score SEBO(𝒙)S_{\textit{EBO}}({\bm{x}}) together with indicator function G()G(\cdot) that applies the thresholding (Liu et al., 2020):

G(𝒙;τ)={0if SEBO(𝒙)τ(OOD),1if SEBO(𝒙)>τ(ID),,SEBO(𝒙)=TlogkKe𝒛k/T,\displaystyle G({\bm{x}};\tau)=\begin{cases}0&\quad\text{if }S_{\textit{EBO}}({\bm{x}})\leq\tau\quad(\textit{OOD}),\\ 1&\quad\text{if }S_{\textit{EBO}}({\bm{x}})>\tau\quad(\textit{ID}),\end{cases},\qquad S_{\textit{EBO}}({\bm{x}})=T\cdot\text{log}\sum_{k}^{K}e^{{\bm{z}}_{k}/T}, (2)

where TT is a temperature parameter, kk is the logit index for the KK classes.

3.2 Analysis on ASH:

A state-of-the-art method for OOD detection is ASH (Djurisic et al., 2023). ASH stands for activation shaping and is a simple post-hoc method that applies a rectified scaling to the feature vector 𝒂{\bm{a}}. Activations in 𝒂{\bm{a}} up to the pthp^{\text{th}} percentile across the DD dimensions are rectified (“pruned” in the original text); activations above the pthp^{\text{th}} percentile are scaled. More formally, ASH introduces a shaping function sfs_{f} that is applied to each activation 𝒂j{\bm{a}}_{j} in a given sample. If we define Pp(𝒂)P_{p}({\bm{a}}) as the pthp^{\text{th}} percentile of the elements in 𝒂{\bm{a}}, ASH produces the logit 𝒛ASH{\bm{z}}_{\text{ASH}}:

𝒛ASH=𝐖(𝒂sf(𝒂))+𝒃,where sf(𝒂)j={0if 𝒂jPp(𝒂),exp(r)if 𝒂j>Pp(𝒂),,{\bm{z}}_{\text{ASH}}=\mathbf{W}\cdot\left({\bm{a}}\circ s_{f}({\bm{a}})\right)+{\bm{b}},\quad\text{where }s_{f}({\bm{a}})_{j}=\begin{cases}0&\quad\text{if }{\bm{a}}_{j}\leq P_{p}({\bm{a}}),\\ \exp(r)&\quad\text{if }{\bm{a}}_{j}>P_{p}({\bm{a}}),\end{cases},\quad (3)

and \circ denotes an element-wise matrix multiplication, and the scaling factor rr is defined as the ratio of the sum of all activations versus the sum of un-pruned activations in 𝒂{\bm{a}}:

r=QQp,where Q=jD𝒂j and Qp=𝒂j>Pp(𝒂)𝒂j.r=\frac{Q}{Q_{p}},\qquad\text{where }Q={\sum_{j}^{D}{{\bm{a}}_{j}}}\qquad\text{ and }\;Q_{p}=\!\!\!\!\sum_{{\bm{a}}_{j}>P_{p}({\bm{a}})}{{\bm{a}}_{j}}. (4)

Since QpQQ_{p}\leq Q, the factor r1r\geq 1; the higher the percentile pp, i.e. the greater the extent of pruning, the smaller QpQ_{p} is with respect to QQ and the larger the scaling factor rr. To distinguish OOD data, ASH then passes the logit from Eq. 3 to score and indicator function as given in Eq. 2.

While ASH is highly effective, the original paper has no explanation of the working mechanism111In fact, the authors put forth a call for explanation in their Appendix L.. We analyze the rectification and scaling components of ASH below and reveal that scaling helps to separate ID versus OOD energy scores, while rectification has an adverse effect.

Dataset pp value
ImageNet 0.296
SSB-hard 0.262
NINCO 0.181
iNaturalist 0.083
Textures 0.099
OpenImage-O 0.155
Table 1: Average pp-values for all samples under Chi-square test; values greater than 0.05 verifies that a Gaussian assumption is reasonable.
[Uncaptioned image]
Figure 2: Mean and Variance of pre-ReLU activations for ID (blue) vs. OOD datasets (pink).
[Uncaptioned image]
Figure 3: μ/σ\mu/\sigma of pre-ReLU activations for ID (blue) vs. OOD (pink).

Assumptions: Our analysis is based on two assumptions. (1) The penultimate activations of ID and OOD samples follow two differing rectified Gaussian distributions parameterized by (μID,σID)(\mu^{\text{ID}},\sigma^{\text{ID}}) and (μOOD,σOOD)(\mu^{\text{OOD}},\sigma^{\text{OOD}}). The Gaussian assumption is commonly used in the literature (Sun et al., 2021)and we verify it in Tab. 1; the rectification follows naturally if a ReLU is applied as the final operation of the penultimate layer. (2) Normalized ID activations are higher than that of OOD activations; this assumption is supported by (Liu et al., 2020) , who suggested that well-trained networks have higher responses to samples resembling those seen in training. Fig. 2 and Fig. 3 visualize statistical corroboration of these assumptions.

Proposition 3.1.

Assume that ID activations 𝐚j(ID)𝒩R(μID,σID){\bm{a}}_{j}^{(\textit{ID})}\sim\mathcal{N}^{R}(\mu^{\textit{ID}},\sigma^{\textit{ID}}) and OOD activations 𝐚j(OOD)𝒩R(μOOD,σOOD){\bm{a}}_{j}^{(\textit{OOD})}\sim\mathcal{N}^{R}(\mu^{\textit{OOD}},\sigma^{\textit{OOD}}) where 𝒩R\mathcal{N}^{R} denotes a rectified Gaussian distribution. If μID/σID>μOOD/σOOD\mu^{\text{ID}}/\sigma^{\text{ID}}>\mu^{\text{OOD}}/\sigma^{\text{OOD}}, then there is a range of percentiles pp for which a factor C(p)=φ(2erf1(2p1))1Φ(2erf1(2p1))C(p)=\frac{\varphi(\sqrt{2}\operatorname{erf}^{-1}(2p-1))}{1-\Phi(\sqrt{2}\operatorname{erf}^{-1}(2p-1))} is large enough such that QpID/QID<QpOOD/QOOD{Q_{p}^{\text{ID}}}/{Q^{\text{ID}}}<{Q_{p}^{\text{OOD}}}/{Q^{\text{OOD}}}.

The full proof is given in Appendix A. Above, φ\varphi and Φ\Phi denote the probability density function and cumulative distribution function of the standard normal distribution, respectively. The factor C(p)C(p), plotted in Fig. 4a, relates the percentile of activations that distinguishes ID from OOD data.

Refer to caption
Refer to caption
Refer to caption
Figure 4: (a) The relationship between the parameter C(p)C(p) and the percentile pp. A higher value of C(p)C(p) indicates better separation of scales. (b) AUROC vs. percentile pp. Up to p=0.85p=0.85, as highlighted by orange box, AUROC for scaling increases while for pruning it decreases. The results of ASH sit between the two as the method is a combination of pruning plus scaling. (c) Histograms of scales Q/QpQ/Q_{p} for ID dataset (ImageNet) and OOD dataset (iNaturalist) exhibit a clear separation from each other.

Rectification (Pruning) The relative reduction of activations can be expressed as:

DPruning=(QQp)/Q.D^{\textit{Pruning}}=(Q-Q_{p})/{Q}. (5)

Note that a reduction in activations also leads to a reduction in the OOD energy. Since QpID/QID<QpOOD/QOOD{Q_{p}^{\textit{ID}}}/{Q^{\textit{ID}}}<{Q_{p}^{\textit{OOD}}}/{Q^{\textit{OOD}}}, it directly implies that the decrease in ID samples will be greater than that in OOD samples, denoted as DIDPruning>DOODPruningD^{\textit{Pruning}}_{\textit{ID}}>D^{\textit{Pruning}}_{\textit{OOD}}. From this result, we can show that the expected value of the relative decrease in energy scores with rectification will be greater for ID samples than OOD samples following the Remark 2 in Sun et al. (2021), which illustrates that the changes in logits is proportional to the changes in activations.

Our result above shows that rectification or pruning creates a greater overlap in energy scores between ID and OOD samples, making it more difficult to distinguish them. Empirically, this result is shown in Fig. 4b, where AUROC steadily decreases with stand-alone pruning as the percentile pp increase.

Scaling on the other hand behaves in a manner opposite to the derivation above and enlarges the separation between ID and OOD scores.

Given QpID/QID<QpOOD/QOOD{Q_{p}^{\textit{ID}}}/{Q^{\textit{ID}}}<{Q_{p}^{\textit{OOD}}}/{Q^{\textit{OOD}}} and r=Q/Qqr=Q/Q_{q}, we have rID>rOODr^{\textit{ID}}>r^{\textit{OOD}}, which motivates the separation on rr between ID and OOD, Fig. 4c depicts the histograms for these respective distributions, they are well separated and therefore scale activations of ID and OOD samples differently. The relative increase on activation can be expressed as:

IScaling=(r1)I^{\text{Scaling}}=(r-1) (6)

where we can get IIDscaling>IOODscalingI^{\text{scaling}}_{\text{ID}}>I^{\text{scaling}}_{\text{OOD}}. This increase is then transferred to logit spaces 𝒛{\bm{z}} and energy-based scores SEBO(ID)S_{\textit{EBO(ID)}} and SEBO(OOD)S_{\textit{EBO(OOD)}}, which increase the gap between ID and OOD samples.

Discussion on percentile pp:  Note that C(p)C(p) does not monotonically increasing with respect to pp (see Fig. 4a). When p0.95p\approx 0.95, there is an inflection point and C(p)C(p) decreases. A similar inflection follows on the AUROC for scaling (see Fig. 4b), though it is not exactly aligned to C(p)C(p). The difference is likely due to the approximations made to estimate C(p)C(p). Also, as pp gets progressively larger, fewer activations (D=2048D=2048 total activations) are considered for estimating rr, leading to unreliable logits for the energy score. Curiously, pruning also drops off, which we believe to come similarly from the extreme reduction in activations.

3.3 SCALE Criterion for OOD Detection

From our analyses and findings above, we propose a new post-hoc model enhancement criterion, which we call SCALE. As the name suggests, it shapes the activation with (only) a scaling:

𝒛=𝐖(𝒂sf(𝒂))+𝒃,where sf(𝒂)j=exp(r) and r=j𝒂j𝒂j>Pp(𝒂)𝒂j.{\bm{z}}^{\prime}=\mathbf{W}\cdot\left({\bm{a}}\circ s_{f}({\bm{a}})\right)+{\bm{b}},\qquad\text{where }s_{f}({\bm{a}})_{j}=\exp(r)\ \text{ and }\ r=\frac{\sum_{j}{{\bm{a}}_{j}}}{\sum_{{\bm{a}}_{j}>P_{p}({\bm{a}})}{{\bm{a}}_{j}}}. (7)

Fig. 5a illustrates how SCALE works. SCALE applies the same scaling factor rr as ASH, based on percentile pp. Instead of pruning, it retains and scales all the activations. Doing so has two benefits. First, it enhances the separation in energy scores between ID and OOD samples. Secondly, scaling all activations equally preserve the ordinality of the logits 𝒛{\bm{z}}^{\prime} compared to 𝒛{\bm{z}}. As such, the argmax\arg\max is not affected and there is no trade-off for ID accuracy; this is not the case with rectification, be it pruning, like in ASH or clipping, or like ReAct (see Fig. 1). Results in Tab. 2 and 3 verify that SCALE outperform ASH-S on all datasets and model architectures.

Refer to caption
(a) Demonstration of SCALE post-hoc model improvement. We prune activations to calculate the scaling factor. The original activations are then multiplied by the computed scales before fed into the fully connected layer.
Refer to caption
(b) The process of ISH training. During training, we keep the forward pass unchanged. In the backward pass, we scale activations for parameter optimization weighted by sf(𝒂i)s_{f}({\bm{a}}_{i}), which varies for different samples and reflects sample’s ID-ness.
Figure 5: Illustrations of our post-hoc model enhancement method SCALE and training time model enhancement method ISH.

3.4 Incorporating SCALE into Training

In practice, the semantic shift of ID versus OOD data may be ambiguous. For example, iNaturalist dataset features different species of plants; similar objects may be found in ImageNet. Our hypothesis is that, during training, we can emphasize the impact of samples possessing the most distinctive in-distribution characteristics, denoted as ”ID-ness”. Quantifying the ID-ness of specific samples is a challenging task, so we rely on a well-trained network to assist us in this endeavor. In particular, for a well-trained network, we can reacquire the activations of all training samples. We proceed on the assumption that the normalized ID activations are greater than those of out-of-distribution (OOD) activations. To measure the degree of ID-ness within the training data, we compute their scale factor, represented as Q/Qp{Q}/{Q_{p}}. Armed with this measurement of ID-ness, we can then undertake the process of re-optimizing the network using the high ID-ness data. Our approach draws inspiration from the concept of intermediate tensor compression found in memory-efficient training methods (Chen et al., 2023a), where modifications are exclusively applied to the backward pass, leaving the forward pass unchanged.

Fig. 5b illustrates our training time enhancement methods for OOD detection. We finetune a well-trained network, by introducing a modification to the gradient of the weights of the fully connected layer. The modified gradient is defined as follows:

𝐖t+1=𝐖tηi[(𝒂isf(𝒂i))𝒛i]\mathbf{W}^{t+1}=\mathbf{W}^{t}-\eta\sum_{i}[({\bm{a}}_{i}\circ s_{f}({\bm{a}}_{i}))^{\top}\nabla{{\bm{z}}_{i}}] (8)

where ii denotes sample index in the batch, \nabla denotes the gradient regarding to the cross entropy loss, tt denotes the training step tt, and η\eta represents the learning rate.

Modifying activations exclusively in the backward pass offers several advantages. Firstly, it leaves the forward pass unaffected, resulting in only a minimal loss in ID accuracy. Secondly, the model architecture remains exactly the same during inference, making this training strategy compatible with any OOD post-processing techniques. Since the saved activations in the backward pass are also referred to as intermediate tensors, we term this method as Intermediate tensor SHaping (ISH).

4 Experiments

4.1 Settings

To verify SCALE as a post-hoc OOD method, we conduct experiments using CIFAR10, CIFAR100 (Krizhevsky, 2009), and ImageNet-1k (Deng et al., 2009) as in-distribution (ID) data sources.

CIFAR. We used SVHN (Netzer et al., 2011), LSUN-Crop (Yu et al., 2015), LSUN-Resize (Yu et al., 2015), iSUN (Xu et al., 2015), Places365 (Zhou et al., 2018), and Textures (Cimpoi et al., 2014) as OOD datasets, For consistency with previous work, we use the same model architecture and pretrained weights, namely, DenseNet-101 (Huang et al., 2017), in accordance with the other post-hoc approaches DICE, ReAct, and ASH. Table 3 compares the FPR@95 and AUROC averaged across all six datasets; detailed results are provided in Appendix B.

ImageNet. In our ImageNet experiments, we follow the OpenOOD v1.5 (Zhang et al., 2023) benchmark, which separates OOD datasets as near-OOD and far-OOD groups. We employed SSB-hard (Vaze et al., 2022) and NINCO (Bitterwolf et al., 2023) as near-OOD datasets and iNaturalist (Horn et al., 2018), Textures (Cimpoi et al., 2014), and OpenImage-O (Wang et al., 2022) as far-OOD datasets. Our reported metrics are the average FPR@95 and AUROC values across these categories; detailed results are given in Appendix B. The OpenOOD benchmark includes improved hyperparameter selection with a dedicated OOD validation set to prevent overfitting to the testing set. Additionally, we provide results following the same dataset and test/validation split settings as ASH and ReAct in the appendix. We adopted the ResNet50 (He et al., 2016) model architecture and obtained the pretrained network from the torchvision library.

Metrics. We evaluate with two measures. The first is FPR@95, which measures the false positive rate at a fixed true positive rate of 95%; lower scores are better). The second is AUROC (Area under the ROC curve). It represents the probability that a positive in-distribution (ID) sample will have a higher detection score than a negative out-of-distribution (OOD) sample; higher scores indicate superior discrimination.

4.2 SCALE for Post-Hoc OOD Detection

Comparison of ODD score methods and post-hoc model enhancement methods (separated with a solid line) on the ImageNet and CIFAR are illustrated in the Table 2 and 3. Notably, SCALE attains the highest OOD detection scores.

OOD Detection Accuracy. Compared to the current state-of-the-art ASH-S, SCALE demonstrates significant improvements on ImageNet – 1.73 on Near-OOD and 0.26 on far-OOD when considering AUROC. For FPR@95, it outperforms ASH-S by 2.27 and 0.33. On CIFAR10 and CIFAR100, SCALE has even greater improvements of 2.48 and 2.41 for FPR@95, as well as 0.66 and 0.72 for AUROC, respectively.

ID Accuracy. One of SCALE’s key advantages is it only applies linear transformations on features, so ID accuracy is guaranteed to stay the same. This differentiates it from other post-hoc enhancement methods that rectify or prune activations, thereby modifying inference and invariably compromises the ID accuracy. SCALE’s performance surpasses ASH-S by a substantial margin of 0.67 on the ID dataset, ImageNet-1k. This capability is pivotal for establishing a unified pipeline that excels for ID and OOD.

Model Postprocessor Near-OOD Far-OOD ID  ACC
FPR@95 AUROC FPR@95 AUROC
\downarrow \uparrow \downarrow \uparrow \uparrow
ResNet50 EBO (Liu et al., 2020) 68.56 75.89 38.40 89.47 76.18
MSP (Hendrycks & Gimpel, 2017) 65.67 76.02 51.47 85.23 76.18
MLS (Hendrycks et al., 2022) 67.82 76.46 38.20 89.58 76.18
GEN (Liu et al., 2023) 65.30 76.85 35.62 89.77 76.18
RMDS (Ren et al., 2021) 65.04 76.99 40.91 86.38 76.18
TempScale (Guo et al., 2017) 64.51 77.14 46.67 87.56 76.18
ReAct (Sun et al., 2021) 66.75 77.38 26.31 93.67 75.58
ASH-S (Djurisic et al., 2023) 62.03 79.63 16.86 96.47 75.51
SCALE (Ours) 59.76 81.36 16.53 96.53 76.18
Table 2: OOD detection results on ImageNet-1K benchmarks. Model choice and protocol are the same as existing works. SCALE outperforms other OOD score methods and post-hoc model enhancement methods, achieving the highest OOD detection scores and excelling in the ID-OOD trade-off. Detailed results for each dataset are given in Appendix B.
CIFAR-10 CIFAR-100
Model Postprocessor FPR@95 AUROC FPR@95 AUROC
\downarrow \uparrow \downarrow \uparrow
DenseNet-101 MSP 48.73 92.46 80.13 74.36
EBO 26.55 94.57 68.45 81.19
ReAct 26.45 94.95 62.27 84.47
DICE 20.83±1.58{20.83}^{\pm 1.58} 95.24±0.24{95.24}^{\pm 0.24} 49.72±1.69{49.72}^{\pm 1.69} 87.23±0.73{87.23}^{\pm 0.73}
ASH-S 15.05 96.61 41.40 90.02
SCALE (Ours) 12.57 97.27 38.99 90.74
Table 3: OOD detection results on CIFAR benchmarks. SCALE outperform all postprocessors. Detailed results for each dataset are in the appendix.

Comparison with TempScale. Temperature scaling (TempScale) is widely used for confidence calibration (Guo et al., 2017). SCALE and TempScale both leverage scaling for OOD detection, but with two distinctions. Firstly, TempScale directly scales logits for calibration, whereas SCALE applies scaling at the penultimate layer. Secondly, TempScale employs a uniform scaling factor for all samples, whereas SCALE applies a sample-specific scaling factor based on the sample’s activation statistics. The sample-specific scaling is a crucial differentiator that enables the discrimination between ID and OOD samples. Notably, our SCALE model significantly outperforms TempScale in both Near-OOD and Far-OOD scenarios.

SCALE with different percentiles pp. Table 2 uses p=0.85p=0.85 for SCALE and ASH-S, which is verified on the validation set. As detailed in Section 3.2, in order to ensure the validity of scaling, it is essential for the percentile value pp to fall within a specific range where the parameter C(p)C(p) exhibits a sufficiently high value to meet the required condition. Our experimental observations align with this theoretical premise. Specifically, we have empirically observed that, up to the 85% percentile threshold, the AUROC values for both Near-OOD and Far-OOD scenarios consistently show an upward trend. However, a noticeable decline becomes apparent beyond this percentile threshold. This empirical finding corroborates our theoretical insight, indicating that the parameter C(p)C(p) experiences a reduction in magnitude as pp approaches the 90%.

pp 65 70 75 80 85 90 95
Near-OOD 62.45 / 79.31 61.65 / 79.83 61.12 / 80.41 60.12 / 81.01 59.76 / 81.36 63.19 / 80.14 78.62 73.40
Far-OOD 24.08 / 94.43 22.21 / 95.02 20.20 / 95.61 18.26 / 96.17 16.53 / 96.53 18.58 / 96.20 32.42 93.28
Table 4: FPR@95 / AUROC results on ImageNet benchmarks under different pp.

4.3 ISH for Training-Time Model Enhancement

We used the same dataset splits as the post-hoc experiments in Sec. 4.1. For training, we fine-tuned the torchvision pretrained model with ISH for 10 epochs with a cosine annealing learning rate schedule initiated at 0.003 and a minimum of 0. We additionally observed that using a smaller weight decay value (5e-6) enhances OOD detection performance. The results are presented in Table 5. We compare ISH with other training time model enhancement methods.

Comparison with OOD traning methods.

The work LogitNorm(Wei et al., 2022) focuses on diagnosing the gradual narrowing of the gap between the logit magnitudes of ID and OOD distributions during later stages of training. Their proposed approach involves normalizing logits, and the scaling factor is applied within the logits space during the backward pass.

The key distinction between their LogitNorm method and our ISH approach lies in the purpose of scaling. LogitNorm scales logits primarily for confidence calibration, aiming to align the model’s confidence with the reliability of its predictions. In contrast, ISH scales activations to prioritize weighted optimization, emphasizing the impact of high ID-ness data on the fine-tuning process.

Comparisons with data augmentation-based methods.  Zhang et al. (2023) indicates that data augmentation methods, while not originally designed for OOD detection improvement, can simultaneously enhance both ID and OOD accuracy.

In comparison to AugMix and RegMixup, our ISH approach, while slightly reducing ID accuracy, delivers superior OOD performance with significantly fewer computational resources. When compared to AugMix, ISH achieves substantial improvements, enhancing AUROC by 0.46 and 0.8 for Near-OOD and Far-OOD, respectively, with just 0.1x the extended training epochs. Notably, ISH sets the highest AUROC records, reaching 84.01% on Near-OOD scores and 96.79% on Far-OOD scores among all methods on OpenOODv1.5 benchmark.

Model Training Epochs Ori.+Ext. Postprocessor Near-OOD Far-OOD ID  ACC
FPR@95 AUROC FPR@95 AUROC
\downarrow \uparrow \downarrow \uparrow \uparrow
ResNet50 LogitNorm (Wei et al., 2022) 90+30 MSP 68.56 74.62 31.33 91.54 76.45
CIDER (Ming et al., 2023) 90+30 KNN 71.69 68.97 28.69 92.18 -
TorchVision Model 90 SCALE 59.76 81.36 16.53 96.53 76.13
TorchVision Model Extended 90+10 SCALE 59.25 82.67 18.48 96.24 76.84
RegMixup (Pinto et al., 2022) 90+30 SCALE 63.55 80.85 19.87 95.94 76.88
AugMix (Hendrycks et al., 2020) 180 SCALE 60.58 83.55 21.01 95.99 77.64
ISH (Ours) 90+10 SCALE 55.73 84.01 15.62 96.79 76.74
Table 5: Comparisons with data augmentation-based methods on ImageNet-1K. Our ISH method achieves the highest scores for both Near-OOD and Far-OOD with the shortest training epochs. ”Ori.” denotes the original training epochs for the pretrained network, while ”Ext.” denotes the extended training epochs in our training scheme.

5 Conclusion

In this paper, we have conducted an in-depth investigation into the efficacy of scaling techniques in enhancing out-of-distribution (OOD) detection. Our study is grounded in the analysis of activation distribution disparities between in-distribution (ID) and OOD data. To this end, we introduce SCALE, a post-hoc model enhancement method that achieves state-of-the-art OOD accuracy when integrated with energy scores, without compromising ID accuracy. Furthermore, we extend the application of scaling to the training phase, introducing ISH, a training-time enhancement method that significantly bolsters OOD accuracy.

References

  • Bitterwolf et al. (2023) Julian Bitterwolf, Maximilian Müller, and Matthias Hein. In or out? fixing imagenet out-of-distribution detection evaluation. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  2471–2506. PMLR, 2023. URL https://proceedings.mlr.press/v202/bitterwolf23a.html.
  • Chen et al. (2023a) Joya Chen, Kai Xu, Yuhui Wang, Yifei Cheng, and Angela Yao. Dropit: Dropping intermediate tensors for memory-efficient DNN training. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023a. URL https://openreview.net/pdf?id=Kn6i2BZW69w.
  • Chen et al. (2023b) Xuanyao Chen, Zhijian Liu, Haotian Tang, Li Yi, Hang Zhao, and Song Han. Sparsevit: Revisiting activation sparsity for efficient high-resolution vision transformer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pp.  2061–2070. IEEE, 2023b. doi: 10.1109/CVPR52729.2023.00205. URL https://doi.org/10.1109/CVPR52729.2023.00205.
  • Cimpoi et al. (2014) Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2014, Columbus, OH, USA, June 23-28, 2014, pp.  3606–3613. IEEE Computer Society, 2014. doi: 10.1109/CVPR.2014.461. URL https://doi.org/10.1109/CVPR.2014.461.
  • Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA, pp.  248–255. IEEE Computer Society, 2009. doi: 10.1109/CVPR.2009.5206848. URL https://doi.org/10.1109/CVPR.2009.5206848.
  • DeVries & Taylor (2018) Terrance DeVries and Graham W. Taylor. Learning confidence for out-of-distribution detection in neural networks. CoRR, abs/1802.04865, 2018. URL http://arxiv.org/abs/1802.04865.
  • Djurisic et al. (2023) Andrija Djurisic, Nebojsa Bozanic, Arjun Ashok, and Rosanne Liu. Extremely simple activation shaping for out-of-distribution detection. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=ndYXTEL6cZz.
  • Evans & Aamodt (2021) R. David Evans and Tor M. Aamodt. AC-GC: lossy activation compression with guaranteed convergence. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp.  27434–27448, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/e655c7716a4b3ea67f48c6322fc42ed6-Abstract.html.
  • Guo et al. (2017) Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Doina Precup and Yee Whye Teh (eds.), Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, volume 70 of Proceedings of Machine Learning Research, pp.  1321–1330. PMLR, 2017. URL http://proceedings.mlr.press/v70/guo17a.html.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, pp.  770–778. IEEE Computer Society, 2016. doi: 10.1109/CVPR.2016.90. URL https://doi.org/10.1109/CVPR.2016.90.
  • Hendrycks & Gimpel (2017) Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URL https://openreview.net/forum?id=Hkg4TI9xl.
  • Hendrycks et al. (2020) Dan Hendrycks, Norman Mu, Ekin Dogus Cubuk, Barret Zoph, Justin Gilmer, and Balaji Lakshminarayanan. Augmix: A simple data processing method to improve robustness and uncertainty. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=S1gmrxHFvB.
  • Hendrycks et al. (2022) Dan Hendrycks, Steven Basart, Mantas Mazeika, Andy Zou, Joseph Kwon, Mohammadreza Mostajabi, Jacob Steinhardt, and Dawn Song. Scaling out-of-distribution detection for real-world settings. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp.  8759–8773. PMLR, 2022. URL https://proceedings.mlr.press/v162/hendrycks22a.html.
  • Horn et al. (2018) Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alexander Shepard, Hartwig Adam, Pietro Perona, and Serge J. Belongie. The inaturalist species classification and detection dataset. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp.  8769–8778. Computer Vision Foundation / IEEE Computer Society, 2018. doi: 10.1109/CVPR.2018.00914. URL http://openaccess.thecvf.com/content_cvpr_2018/html/Van_Horn_The_INaturalist_Species_CVPR_2018_paper.html.
  • Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pp.  2261–2269. IEEE Computer Society, 2017. doi: 10.1109/CVPR.2017.243. URL https://doi.org/10.1109/CVPR.2017.243.
  • Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009. URL https://api.semanticscholar.org/CorpusID:18268744.
  • Kurtz et al. (2020) Mark Kurtz, Justin Kopinsky, Rati Gelashvili, Alexander Matveev, John Carr, Michael Goin, William M. Leiserson, Sage Moore, Nir Shavit, and Dan Alistarh. Inducing and exploiting activation sparsity for fast inference on deep neural networks. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp.  5533–5543. PMLR, 2020. URL http://proceedings.mlr.press/v119/kurtz20a.html.
  • Lee et al. (2018) Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp.  7167–7177, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/abdeb6f575ac5c6676b747bca8d09cc2-Abstract.html.
  • Li et al. (2023) Zonglin Li, Chong You, Srinadh Bhojanapalli, Daliang Li, Ankit Singh Rawat, Sashank J. Reddi, Ke Ye, Felix Chern, Felix X. Yu, Ruiqi Guo, and Sanjiv Kumar. The lazy neuron phenomenon: On emergence of activation sparsity in transformers. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=TJ2nxciYCk-.
  • Liu et al. (2020) Weitang Liu, Xiaoyun Wang, John D. Owens, and Yixuan Li. Energy-based out-of-distribution detection. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/f5496252609c43eb8a3d147ab9b9c006-Abstract.html.
  • Liu et al. (2022) Xiaoxuan Liu, Lianmin Zheng, Dequan Wang, Yukuo Cen, Weize Chen, Xu Han, Jianfei Chen, Zhiyuan Liu, Jie Tang, Joey Gonzalez, Michael W. Mahoney, and Alvin Cheung. GACT: activation compressed training for generic network architectures. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp.  14139–14152. PMLR, 2022. URL https://proceedings.mlr.press/v162/liu22v.html.
  • Liu et al. (2023) Xixi Liu, Yaroslava Lochman, and Christopher Zach. GEN: pushing the limits of softmax-based out-of-distribution detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, pp.  23946–23955. IEEE, 2023. doi: 10.1109/CVPR52729.2023.02293. URL https://doi.org/10.1109/CVPR52729.2023.02293.
  • Ming et al. (2023) Yifei Ming, Yiyou Sun, Ousmane Dia, and Yixuan Li. How to exploit hyperspherical embeddings for out-of-distribution detection? In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=aEFaE0W5pAd.
  • Netzer et al. (2011) Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y. Ng. Reading digits in natural images with unsupervised feature learning. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning 2011, 2011. URL http://ufldl.stanford.edu/housenumbers/nips2011_housenumbers.pdf.
  • Pinto et al. (2022) Francesco Pinto, Harry Yang, Ser Nam Lim, Philip H. S. Torr, and Puneet K. Dokania. Using mixup as a regularizer can surprisingly improve accuracy & out-of-distribution robustness. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/5ddcfaad1cb72ce6f1a365e8f1ecf791-Abstract-Conference.html.
  • Ren et al. (2021) Jie Ren, Stanislav Fort, Jeremiah Z. Liu, Abhijit Guha Roy, Shreyas Padhy, and Balaji Lakshminarayanan. A simple fix to mahalanobis distance for improving near-ood detection. CoRR, abs/2106.09022, 2021. URL https://arxiv.org/abs/2106.09022.
  • Sun & Li (2022) Yiyou Sun and Yixuan Li. DICE: leveraging sparsification for out-of-distribution detection. In Shai Avidan, Gabriel J. Brostow, Moustapha Cissé, Giovanni Maria Farinella, and Tal Hassner (eds.), Computer Vision - ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part XXIV, volume 13684 of Lecture Notes in Computer Science, pp.  691–708. Springer, 2022. doi: 10.1007/978-3-031-20053-3“˙40. URL https://doi.org/10.1007/978-3-031-20053-3_40.
  • Sun et al. (2021) Yiyou Sun, Chuan Guo, and Yixuan Li. React: Out-of-distribution detection with rectified activations. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp.  144–157, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/01894d6f048493d2cacde3c579c315a3-Abstract.html.
  • Vaze et al. (2022) Sagar Vaze, Kai Han, Andrea Vedaldi, and Andrew Zisserman. Open-set recognition: A good closed-set classifier is all you need. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=5hLP5JY9S2d.
  • Wang et al. (2022) Haoqi Wang, Zhizhong Li, Litong Feng, and Wayne Zhang. Vim: Out-of-distribution with virtual-logit matching. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pp.  4911–4920. IEEE, 2022. doi: 10.1109/CVPR52688.2022.00487. URL https://doi.org/10.1109/CVPR52688.2022.00487.
  • Wei et al. (2022) Hongxin Wei, Renchunzi Xie, Hao Cheng, Lei Feng, Bo An, and Yixuan Li. Mitigating neural network overconfidence with logit normalization. In Kamalika Chaudhuri, Stefanie Jegelka, Le Song, Csaba Szepesvári, Gang Niu, and Sivan Sabato (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp.  23631–23644. PMLR, 2022. URL https://proceedings.mlr.press/v162/wei22d.html.
  • Xu et al. (2015) Pingmei Xu, Krista A. Ehinger, Yinda Zhang, Adam Finkelstein, Sanjeev R. Kulkarni, and Jianxiong Xiao. Turkergaze: Crowdsourcing saliency with webcam based eye tracking. CoRR, abs/1504.06755, 2015. URL http://arxiv.org/abs/1504.06755.
  • Yang et al. (2022) Jingkang Yang, Pengyun Wang, Dejian Zou, Zitang Zhou, Kunyuan Ding, Wenxuan Peng, Haoqi Wang, Guangyao Chen, Bo Li, Yiyou Sun, Xuefeng Du, Kaiyang Zhou, Wayne Zhang, Dan Hendrycks, Yixuan Li, and Ziwei Liu. Openood: Benchmarking generalized out-of-distribution detection. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/d201587e3a84fc4761eadc743e9b3f35-Abstract-Datasets_and_Benchmarks.html.
  • Yu et al. (2015) Fisher Yu, Yinda Zhang, Shuran Song, Ari Seff, and Jianxiong Xiao. LSUN: construction of a large-scale image dataset using deep learning with humans in the loop. CoRR, abs/1506.03365, 2015. URL http://arxiv.org/abs/1506.03365.
  • Zhang et al. (2023) Jingyang Zhang, Jingkang Yang, Pengyun Wang, Haoqi Wang, Yueqian Lin, Haoran Zhang, Yiyou Sun, Xuefeng Du, Kaiyang Zhou, Wayne Zhang, Yixuan Li, Ziwei Liu, Yiran Chen, and Hai Li. Openood v1.5: Enhanced benchmark for out-of-distribution detection. CoRR, abs/2306.09301, 2023. doi: 10.48550/arXiv.2306.09301. URL https://doi.org/10.48550/arXiv.2306.09301.
  • Zhou et al. (2018) Bolei Zhou, Àgata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell., 40(6):1452–1464, 2018. doi: 10.1109/TPAMI.2017.2723009. URL https://doi.org/10.1109/TPAMI.2017.2723009.

Appendix A Details of Proof

Proposition 3.1.

Assume that ID activations 𝐚j(ID)𝒩R(μID,σID){\bm{a}}_{j}^{(\textit{ID})}\sim\mathcal{N}^{R}(\mu^{\textit{ID}},\sigma^{\textit{ID}}) and OOD activations 𝐚j(OOD)𝒩R(μOOD,σOOD){\bm{a}}_{j}^{(\textit{OOD})}\sim\mathcal{N}^{R}(\mu^{\textit{OOD}},\sigma^{\textit{OOD}}) where 𝒩R\mathcal{N}^{R} denotes a rectified Gaussian distribution. If μID/σID>μOOD/σOOD\mu^{\text{ID}}/\sigma^{\text{ID}}>\mu^{\text{OOD}}/\sigma^{\text{OOD}}, then there is a range of percentiles pp for which a factor C(p)=φ(2erf1(2p1))1Φ(2erf1(2p1))C(p)=\frac{\varphi(\sqrt{2}\operatorname{erf}^{-1}(2p-1))}{1-\Phi(\sqrt{2}\operatorname{erf}^{-1}(2p-1))} is large enough such that QpID/QID<QpOOD/QOOD{Q_{p}^{\text{ID}}}/{Q^{\text{ID}}}<{Q_{p}^{\text{OOD}}}/{Q^{\text{OOD}}}.

Proof.

The proof schema is to derive equivalent conditions. Under the assumption that data in the latent space follows an independent and identically distributed (IID) Gaussian distribution prior to the ReLU activation (Sun et al. (2021)), we can derive that each coefficient 𝒂j(ID)𝒩R(μID,σID){\bm{a}}_{j}^{(\textit{ID})}\sim\mathcal{N}^{R}(\mu^{\textit{ID}},\sigma^{\textit{ID}}) and OOD activations 𝒂j(OOD)𝒩R(μOOD,σOOD){\bm{a}}_{j}^{(\textit{OOD})}\sim\mathcal{N}^{R}(\mu^{\textit{OOD}},\sigma^{\textit{OOD}}) where 𝒩R\mathcal{N}^{R} denotes a rectified Gaussian distribution. Moreover if we denote high activation 𝒉j(ID)=𝒂j(ID){\bm{h}}_{j}^{(\textit{ID})}={\bm{a}}_{j}^{(\textit{ID})} if 𝒂j>Pp(𝒂){\bm{a}}_{j}>P_{p}({\bm{a}}) and zeros elsewhere. Then we have 𝒉j(ID)𝒩T(μID,σID){\bm{h}}_{j}^{(\textit{ID})}\sim\mathcal{N}^{T}(\mu^{\textit{ID}},\sigma^{\textit{ID}}) and identically 𝒉j(OOD)𝒩T(μOOD,σOOD){\bm{h}}_{j}^{(\textit{OOD})}\sim\mathcal{N}^{T}(\mu^{\textit{OOD}},\sigma^{\textit{OOD}}), where 𝒩T\mathcal{N}^{T} denotes a truncated Gaussian distribution. Then, we can calculate the expectations as follows:

𝔼[𝒂j]\displaystyle\mathbb{E}[{\bm{a}}_{j}] =μ[1Φ(μσ)]+φ(μσ)σ\displaystyle=\mu\left[1-\Phi(-\frac{\mu}{\sigma})\right]+\varphi(-\frac{\mu}{\sigma})\sigma (9)
𝔼[𝒉j]\displaystyle\mathbb{E}[{\bm{h}}_{j}] =μ+φ(m)1Φ(m)σ,m=sμσ\displaystyle=\mu+\frac{\varphi(m)}{1-\Phi(m)}\sigma,\ m=\frac{s-\mu}{\sigma} (10)

Here, φ()\varphi(\cdot) is the probability density function of the standard normal distribution, and Φ()\Phi(\cdot) is its cumulative distribution function.

Qp/Q=j𝒉jj𝒂j=𝔼[𝒉j](1p)D𝔼[𝒂j]DQ_{p}/Q=\frac{\sum_{j}{\bm{h}}_{j}}{\sum_{j}{\bm{a}}_{j}}=\frac{\mathbb{E}[{\bm{h}}_{j}](1-p)D}{\mathbb{E}[{\bm{a}}_{j}]D}. Let us consider the notation β=(1p)Q/Qp=𝔼[𝒂j]𝔼[𝒉j]\beta=(1-p)Q/Q_{p}=\frac{\mathbb{E}[{\bm{a}}_{j}]}{\mathbb{E}[{\bm{h}}_{j}]}. QpID/QID<QpOOD/QOODβID>βOODQ_{p}^{\text{ID}}/Q^{\text{ID}}<Q_{p}^{\text{OOD}}/Q^{\text{OOD}}\iff\beta^{\text{ID}}>\beta^{\text{OOD}}. So we focus on:

β=μ[1Φ(μσ)]+φ(μσ)σμ+φ(m)1Φ(m)σ=1Φ(μσ)1+φ(m)1Φ(m)σμ+φ(μσ)σμ+φ(m)1Φ(m)σ\displaystyle\beta=\frac{\mu\left[1-\Phi(-\frac{\mu}{\sigma})\right]+\varphi(-\frac{\mu}{\sigma})\sigma}{\mu+{\frac{\varphi(m)}{1-\Phi(m)}}\sigma}=\frac{1-\Phi(-\frac{\mu}{\sigma})}{1+{\frac{\varphi(m)}{1-\Phi(m)}}\frac{\sigma}{\mu}}+\frac{\varphi(-\frac{\mu}{\sigma})\sigma}{\mu+\frac{\varphi(m)}{1-\Phi(m)}\sigma} (11)

Let’s introduce some notations for ease of analysis:

  • γ=μσ\gamma=\frac{\mu}{\sigma}

  • A=Φ(γ)A=\Phi(-\gamma)

  • B=φ(γ)B=\varphi(-\gamma)

  • C=φ(m)1Φ(m)=φ(sμσ)1Φ(sμσ)C=\frac{\varphi(m)}{1-\Phi(m)}=\frac{\varphi(\frac{s-\mu}{\sigma})}{1-\Phi(\frac{s-\mu}{\sigma})},

With these definitions, we can express β\beta as:

β=1A1+Cγ1+Bσμ+Cσ\displaystyle\beta=\frac{1-A}{1+C\gamma^{-1}}+\frac{B\sigma}{\mu+C\sigma} (12)

We consider that γIDγOOD\gamma^{\text{ID}}\geq\gamma^{\text{OOD}} Hence, we also have:

  • AIDAOODA^{\text{ID}}\leq A^{\text{OOD}}

  • BIDBOODB^{\text{ID}}\leq B^{\text{OOD}}

By definition we have that sID(p)=μID+σID2erf1(2p1)s^{\text{ID}}(p)={\mu^{\text{ID}}+\sigma^{\text{ID}}{\sqrt{2}}\operatorname{erf}^{-1}(2p-1)} and sOOD(p)=μOOD+σOOD2erf1(2p1)s^{\text{OOD}}(p)={\mu^{\text{OOD}}+\sigma^{\text{OOD}}{\sqrt{2}}\operatorname{erf}^{-1}(2p-1)} where pp is the proportion of data that we want to keep. So we have:

CID(p)=φ(mID)1Φ(mID)=φ(sIDμIDσID)1Φ(sIDμIDσID)=φ(2erf1(2p1))1Φ(2erf1(2p1))\displaystyle C^{\text{ID}}(p)=\frac{\varphi(m^{\text{ID}})}{1-\Phi(m^{\text{ID}})}=\frac{\varphi(\frac{s^{\text{ID}}-\mu^{\text{ID}}}{\sigma^{\text{ID}}})}{1-\Phi(\frac{s^{\text{ID}}-\mu^{\text{ID}}}{\sigma^{\text{ID}}})}=\frac{\varphi(\sqrt{2}\operatorname{erf}^{-1}(2p-1))}{1-\Phi(\sqrt{2}\operatorname{erf}^{-1}(2p-1))} (13)

Moreover, we can prove that COOD(p)=φ(mOOD)1Φ(mOOD)=φ(sOODμOODσOOD)1Φ(sOODμOODσOOD)=φ(2erf1(2p1))1Φ(2erf1(2p1))=CID(p)C^{\text{OOD}}(p)=\frac{\varphi(m^{\text{OOD}})}{1-\Phi(m^{\text{OOD}})}=\frac{\varphi(\frac{s^{\text{OOD}}-\mu^{\text{OOD}}}{\sigma^{\text{OOD}}})}{1-\Phi(\frac{s^{\text{OOD}}-\mu^{\text{OOD}}}{\sigma^{\text{OOD}}})}=\frac{\varphi(\sqrt{2}\operatorname{erf}^{-1}(2p-1))}{1-\Phi(\sqrt{2}\operatorname{erf}^{-1}(2p-1))}=C^{\text{ID}}(p).

Now, if we consider the approximation:

𝔼[𝒂j]μ[1Φ(μσ)]\displaystyle\mathbb{E}[{\bm{a}}_{j}]\simeq\mu\left[1-\Phi(-\frac{\mu}{\sigma})\right] (14)

We assume that φ(μσ)σ0\varphi\left(-\frac{\mu}{\sigma}\right)\sigma\approx 0 since the sigma term is very small, and the second term is below one. With this approximation, we have:

β=γ(1A)γ+C\displaystyle\beta=\frac{\gamma(1-A)}{\gamma+C} (15)

We want to compare β\beta for in-distribution (ID) denoted βID\beta^{\text{ID}} and out-of-distribution (OOD) data denoted βOOD\beta^{\text{OOD}}. Moreover, we have:

βIDβOOD1AID1+CγID11AOOD1+CγOOD11AID1AOOD1+CγID11+CγOOD1\displaystyle\beta^{\text{ID}}\geq\beta^{\text{OOD}}\iff\frac{1-A^{\text{ID}}}{1+C{\gamma^{\text{ID}}}^{-1}}\geq\frac{1-A^{\text{OOD}}}{1+C{\gamma^{\text{OOD}}}^{-1}}\iff\frac{1-A^{\text{ID}}}{1-A^{\text{OOD}}}\geq\frac{1+C{\gamma^{\text{ID}}}^{-1}}{1+C{\gamma^{\text{OOD}}}^{-1}} (16)

We can use the approximation: 11+CγOOD11CγOOD1\frac{1}{1+C{\gamma^{\text{OOD}}}^{-1}}\simeq 1-C{\gamma^{\text{OOD}}}^{-1} by applying a first-order Taylor expansion. Then we have:

1AID1AOOD\displaystyle\frac{1-A^{\text{ID}}}{1-A^{\text{OOD}}} (1+CγID1)(1CγOOD1)\displaystyle\geq(1+C{\gamma^{\text{ID}}}^{-1})(1-C{\gamma^{\text{OOD}}}^{-1}) (17)
1+C(γID1γOOD1)C2(γID1γOOD1)\displaystyle\geq 1+C({\gamma^{\text{ID}}}^{-1}-{\gamma^{\text{OOD}}}^{-1})-C^{2}({\gamma^{\text{ID}}}^{-1}{\gamma^{\text{OOD}}}^{-1}) (18)

Note that by definition CC should be positive. The given inequality can be expressed as:

1AID1AOOD1C(γID1γOOD1)+C2(γID1γOOD1)0\displaystyle\frac{1-A^{\text{ID}}}{1-A^{\text{OOD}}}-1-C({\gamma^{\text{ID}}}^{-1}-{\gamma^{\text{OOD}}}^{-1})+C^{2}({\gamma^{\text{ID}}}^{-1}{\gamma^{\text{OOD}}}^{-1})\geq 0 (19)

We can rewrite it as:

𝕒1C2+𝕒2C+𝕒30\displaystyle\mathbb{a}_{1}C^{2}+\mathbb{a}_{2}C+\mathbb{a}_{3}\geq 0 (20)

Here we have the following notations: 𝕒1=(γID1γOOD1)\mathbb{a}_{1}=({\gamma^{\text{ID}}}^{-1}{\gamma^{\text{OOD}}}^{-1}) and 𝕒2=(γID1γOOD1)\mathbb{a}_{2}=-({\gamma^{\text{ID}}}^{-1}-{\gamma^{\text{OOD}}}^{-1}) and 𝕒3=1AID1AOOD1\mathbb{a}_{3}=\frac{1-A^{\text{ID}}}{1-A^{\text{OOD}}}-1. Let us define Δ=𝕒224𝕒1𝕒3\Delta=\mathbb{a}_{2}^{2}-4\mathbb{a}_{1}\mathbb{a}_{3} Then we have:

Δ\displaystyle\Delta =(γID1γOOD1)24(γID1γOOD1)(1AID1AOOD1)\displaystyle=({\gamma^{\text{ID}}}^{-1}-{\gamma^{\text{OOD}}}^{-1})^{2}-4({\gamma^{\text{ID}}}^{-1}{\gamma^{\text{OOD}}}^{-1})\left(\frac{1-A^{\text{ID}}}{1-A^{\text{OOD}}}-1\right) (21)
=γID2+γOOD22(γID1γOOD1)(21AID1AOOD1)\displaystyle={\gamma^{\text{ID}}}^{-2}+{\gamma^{\text{OOD}}}^{-2}-2({\gamma^{\text{ID}}}^{-1}{\gamma^{\text{OOD}}}^{-1})\left(2\frac{1-A^{\text{ID}}}{1-A^{\text{OOD}}}-1\right) (22)
=(γID1+γOOD1)24(γID1γOOD1)(1AID1AOOD)\displaystyle=\left({\gamma^{\text{ID}}}^{-1}+{\gamma^{\text{OOD}}}^{-1}\right)^{2}-4({\gamma^{\text{ID}}}^{-1}{\gamma^{\text{OOD}}}^{-1})\left(\frac{1-A^{\text{ID}}}{1-A^{\text{OOD}}}\right) (23)

Since 𝕒1>0\mathbb{a}_{1}>0, there are two possible cases:

  • if Δ0\Delta\leq 0 then C(p)+C(p)\in\mathbb{R}^{+}

  • if Δ>0\Delta>0 then C(p)[max((γID1γOOD1)+Δ2(γID1γOOD1),0+),+)C(p)\in\left[\max\left(\frac{({\gamma^{\text{ID}}}^{-1}-{\gamma^{\text{OOD}}}^{-1})+\sqrt{\Delta}}{2({\gamma^{\text{ID}}}^{-1}{\gamma^{\text{OOD}}}^{-1})},0^{+}\right),+\infty\right). Note that another side (γID1γOOD1)0({\gamma^{\text{ID}}}^{-1}-{\gamma^{\text{OOD}}}^{-1})\leq 0 so (γID1γOOD1)Δ0({\gamma^{\text{ID}}}^{-1}-{\gamma^{\text{OOD}}}^{-1})-\sqrt{\Delta}\leq 0. So we do not consider this.

In summary, there is a valid range of pruning pp value satisfying the valid range of C(p)C(p) so that the statistics Qp/QQ_{p}/Q of the ID distribution is smaller than that of the OOD distributions. pp with a larger C(p)C(p) is more applicable to any case. ∎

Appendix B Full Experiments

In this section, we provide full results for SCALE post-hoc model enhancement. Tab. 6 shows full results on ImageNet and Tab. 9 and 9 show full results on CIFAR10 and CIFAR100. We also provide ImageNet results following dataset setting of ReAct and ASH in Tab. 7 for more comparison.

ResNet50 Near-OOD Far-OOD ID  Accuracy
SSB-hard NINCO Average iNaturalist Textures OpenImage-O Average
EBO 76.54 / 72.08 60.59 / 79.70 68.56 / 75.89 31.33 / 90.63 45.77 / 88.7 38.08 / 89.06 38.40 / 89.47 76.18
MSP 74.49 / 72.09 56.84 / 79.95 65.67 / 76.02 43.34 / 88.41 60.89 / 82.43 50.16 / 84.86 51.47 / 85.23 76.18
MLS 76.19 / 72.51 59.49 / 80.41 67.84 / 76.46 30.63 / 91.16 46.11 / 88.39 37.86 / 89.17 38.20 / 89.58 76.18
GEN 75.72 / 72.01 54.88 / 81.70 65.30 / 76.85 26.12 / 92.44 46.23 / 87.60 34.52 / 89.26 35.62 / 89.77 76.18
RMDS 77.88 / 71.77 52.20 / 82.22 65.04 / 76.99 33.67 / 87.24 48.80 / 86.08 40.27 / 85.84 40.91 / 86.38 76.18
TempScale 73.90 / 72.87 55.12 / 81.41 64.51 / 77.14 37.70 / 90.50 56.92 / 84.95 45.39 / 87.22 46.67 / 87.56 76.18
ReAct 77.57 / 73.02 55.92 / 81.73 66.75 / 77.38 16.73 / 96.34 29.63 / 92.79 32.58 / 91.87 26.31 / 93.67 75.58
ASH-S 70.80 / 74.72 53.26 / 84.54 62.03 / 79.63 11.02 / 97.72 10.90 / 97.87 28.60 / 93.82 16.86 / 96.47 75.51
SCALE (Ours) 67.72 / 77.35 51.80 / 85.37 59.76 / 81.36 9.51 / 98.02 11.90 / 97.63 28.18 / 93.95 16.53 / 96.53 76.18
Table 6: FPR95 \downarrow / AUROC \uparrow for ResNet50 on ImageNet on OpenOOD v1.5 benchmark.
OOD Datasets
Model Methods iNaturalist SUN Places Textures Average ID ACC
FPR95 \downarrow AUROC \uparrow FPR95 \downarrow AUROC \uparrow FPR95 \downarrow AUROC \uparrow FPR95 \downarrow AUROC \uparrow FPR95 \downarrow AUROC \uparrow
ResNet50 MSP 54.99 87.74 70.83 80.86 73.99 79.76 68.00 79.61 66.95 81.99 76.12
EBO 55.72 89.95 59.26 85.89 64.92 82.86 53.72 85.99 58.41 86.17 76.12
ReAct 20.38 96.22 24.20 94.20 33.85 91.58 47.30 89.80 31.43 92.95 -
DICE 25.63 94.49 35.15 90.83 46.49 87.48 31.72 90.30 34.75 90.77 -
DICE + ReAct 18.64 96.24 25.45 93.94 36.86 90.67 28.07 92.74 27.25 93.40 -
ASH-S 11.49 97.87 27.98 94.02 39.78 90.98 11.93 97.60 22.80 95.12 74.98
SCALE (Ours) 9.50 98.17 23.27 95.02 34.51 92.26 12.93 97.37 20.05 95.71 76.12
Table 7: OOD detection results for ResNet 50 following the exact same metrics and testing splits as Sun et al. (2021). ResNet is trained with ID data (ImageNet-1k) only. \uparrow indicates larger values are better and \downarrow indicates smaller values are better. All values are percentages. SCALE consistently perform better than ASH-S, across all the OOD datasets.
Table 8: Detailed results for CIFAR-10.
Method SVHN LSUN-c LSUN-r iSUN Textures Places365 Average ID ACC
FPR95 AUROC FPR95 AUROC FPR95 AUROC FPR95 AUROC FPR95 AUROC FPR95 AUROC FPR95 AUROC
   \downarrow    \uparrow    \downarrow    \uparrow    \downarrow    \uparrow    \downarrow    \uparrow    \downarrow    \uparrow    \downarrow    \uparrow    \downarrow    \uparrow \uparrow
MSP 47.24 93.48 33.57 95.54 42.10 94.51 42.31 94.52 64.15 88.15 63.02 88.57 48.73 92.46 94.53
EBO 40.61 93.99 3.81 99.15 9.28 98.12 10.07 98.07 56.12 86.43 39.40 91.64 26.55 94.57 94.53
ReAct 41.64 93.87 5.96 98.84 11.46 97.87 12.72 97.72 43.58 92.47 43.31 91.03 26.45 94.67 -
DICE 25.99±5.10 95.90±1.08 0.26±0.11 99.92±0.02 3.91±0.56 99.20±0.15 4.36±0.71 99.14±0.15 41.90±4.41 88.18±1.80 48.59±1.53 89.13±0.31 20.83±1.58 95.24±0.24 -
ASH-S 6.51 98.65 0.90 99.73 4.96 98.92 5.17 98.90 24.34 95.09 48.45 88.34 15.05 96.61 94.02
SCALE (Ours) 5.80 98.72 0.73 99.74 3.36 99.22 3.43 99.21 23.42 94.97 38.69 91.74 12.57 97.27 94.53
Table 9: Detailed results for CIFAR-100.
Method SVHN LSUN-c LSUN-r iSUN Textures Places365 Average ID ACC
FPR95 AUROC FPR95 AUROC FPR95 AUROC FPR95 AUROC FPR95 AUROC FPR95 AUROC FPR95 AUROC
   \downarrow    \uparrow    \downarrow    \uparrow    \downarrow    \uparrow    \downarrow    \uparrow    \downarrow    \uparrow    \downarrow    \uparrow    \downarrow    \uparrow \uparrow
MSP 81.70 75.40 60.49 85.60 85.24 69.18 85.99 70.17 84.79 71.48 82.55 74.31 80.13 74.36 75.04
EBO 87.46 81.85 14.72 97.43 70.65 80.14 74.54 78.95 84.15 71.03 79.20 77.72 68.45 81.19 75.04
ReAct 83.81 81.41 25.55 94.92 60.08 87.88 65.27 86.55 77.78 78.95 82.65 74.04 62.27 84.47 -
DICE 54.65±4.94 88.84±0.39 0.93±0.07 99.74±0.01 49.40±1.99 91.04±1.49 48.72±1.55 90.08±1.36 65.04±0.66 76.42±0.35 79.58±2.34 77.26±1.08 49.72±1.69 87.23±0.73 -
ASH-S 25.02 95.76 5.52 98.94 51.33 90.12 46.67 91.30 34.02 92.35 85.86 71.62 41.40 90.02 71.65
SCALE (Ours) 22.05 96.29 4.48 99.16 46.02 91.54 42.14 92.47 34.20 92.34 85.04 72.66 38.99 90.74 75.04