AUTO: Adaptive Outlier Optimization for Online Test-Time OOD Detection

Puning Yang^1,2, Jian Liang¹, Jie Cao¹, and Ran He^1,2
¹ CRIPAC & MAIS, Institute of Automation, Chinese Academy of Sciences
² School of Artificial Intelligence, University of Chinese Academy of Sciences
[email protected], [email protected]

Abstract

Out-of-distribution (OOD) detection is a crucial aspect of deploying machine learning models in open-world applications. Empirical evidence suggests that training with auxiliary outliers substantially improves OOD detection. However, such outliers typically exhibit a distribution gap compared to the test OOD data and do not cover all possible test OOD scenarios. Additionally, incorporating these outliers introduces additional training burdens. In this paper, we introduce a novel paradigm called test-time OOD detection, which utilizes unlabeled online data directly at test time to improve OOD detection performance. While this paradigm is efficient, it also presents challenges such as catastrophic forgetting. To address these challenges, we propose adaptive outlier optimization (AUTO), which consists of an in-out-aware filter, an ID memory bank, and a semantically-consistent objective. AUTO adaptively mines pseudo-ID and pseudo-OOD samples from test data, utilizing them to optimize networks in real time during inference. Extensive results on CIFAR-10, CIFAR-100, and ImageNet benchmarks demonstrate that AUTO significantly enhances OOD detection performance. The code will be available at https://github.com/Puning97/AUTO-for-OOD-detection.

1 Introduction

Deep neural networks often exhibit overconfidence when predicting out-of-distribution (OOD) samples, which undermines their reliability in the open-world deployment [36, 2]. Therefore, discerning the OOD data from in-distribution (ID) data is critical and motivates recent studies in OOD detection [13]. Existing solutions focus on either designing new scoring functions [13, 25, 22, 27, 39, 49] or training models with auxiliary outliers [14, 27, 7, 31, 43, 48]. The latter paradigm, known as Outlier Exposure based (OE-based) methods, has shown to be effective due to their ability to leverage the knowledge of outliers.

Despite the promising results, the distribution gap between the training outliers and test data can mislead the model, thus impairing the performance of OOD detection. In addition, a limited set of training outliers can hardly cover various test OOD scenarios, resulting in unstable performances. Addressing both issues is essential but challenging for the OE-based methods. Notably, WOODS [19] directly samples the outlier set from unlabeled test data, thus narrowing the distribution gap between training outliers and test OOD data. Results in Figure LABEL:fig:first reveal the benefits of this scheme. However, it still incurs additional computational costs during training. Moreover, the finite test OOD data sampled by WOODS during the training phase can not represent all potential OOD samples in the current deployment, leading to sub-optimal performance.

To overcome the above limitations, we first formulate a practical and challenging setting called test-time OOD detection. We suggest a simple yet powerful way to access test OOD data – directly modifying models at test time. Our setting does not impose any burden during the training phase but modifies the model at test time, offering strong efficiency, flexibility, and generality. To learn from unlabeled online data, we propose to update the model once the incoming test sample is identified as a pseudo-OOD sample. However, optimizing the model with only pseudo-OOD samples will cause catastrophic forgetting [29], which underperforms both ID and OOD tasks.

Accordingly, we propose Adaptive oUTlier Optimiza-tion (AUTO), which consists of an in-out-aware filter, a dynamic memory bank, and a semantically-consistent objective. Specifically, the in-out-aware filter is initialized by training ID data to identify pseudo-OOD and pseudo-ID samples and expose them to models at test time. The dynamic memory bank contains one ID sample per category, from which the model learns alongside the identified outliers during testing. Besides, we observe potential bias in the test-time optimization, which leads to degraded ID performance. Thus, we design a simple objective to keep the prediction of the latest model consistent with that of the original model. That further helps the model maintain ID accuracy. Extensive evaluations suggest that AUTO substantially enhances OOD detection performance and outperforms the competing methods by a large margin.

Our contributions are summarized as follows:

•

We explore a practical and challenging problem, test-time out-of-distribution detection, which directly optimizes networks with unlabeled test data at test time.
•

We introduce AUTO. Except for selecting and learning from pseudo-OOD samples, it also employs an ID memory bank and a semantically-consistent objective to overcome the forgetting issue.
•

Extensive results demonstrate that AUTO achieves significant improvements in all OOD detection tasks while preserving ID classification performance.

2 Related Work

OOD Detection with ID data only. A lot of studies have explored the OOD detection task in the past few years. The OpenMax score [1] is the first method to detect OOD samples by utilizing the Extreme Value Theory. Hendrycks et al. [13] present a baseline using the Maximum Softmax Probability (MSP) but may not be suitable for OOD detection [34]. Later on, researchers improve the OOD detection performance in two ways: (1) advanced score functions are proposed, including ODIN score [25], Mahalanobis distance-based score [22], Energy score [27, 47, 26], GradNorm score [15], and MaxLogit score [12]. (2) novel techniques are proposed to modify the logit space, including SSD [38], ReAct [39], GEM [34], KNN [41], LogitNorm [49], DICE [40], and CIDER [32].

OOD Detection with auxiliary outliers. Another promising direction toward OOD detection involves the auxiliary outliers for model regularization. The pioneering work, Outlier Exposure (OE) [14], trains models with auxiliary outliers, inspiring a new line of work, such as CCU [30], SOFL [33], Energy [27], ATOM [3], and POEM [31]. Considering the scarcity of training outliers, recent works synthesize pseudo-OOD samples using ID samples (VOS [7], NPOS [43], and DOE [48]) or sample test data directly [19]. It is noteworthy that WOODS [19] has tried to leverage test data to improve detection performance. However, WOODS requires access to test data at training time, which is not feasible in some security-sensitive scenarios (e.g. federated learning). In contrast, AUTO directly utilizes data at test time, which is more general and flexible.

Test-time optimization. To address the distribution shift without the leakage of training data, a recent paradigm called source-free domain adaptation [23, 24, 21] adapts the pre-trained source model to the whole target domain. However, this paradigm is not suitable for streaming data where the test data can be seen only once. A pioneering work [42] designs a customized multi-head model and provides two modes (i.e., vanilla and online) for adapting the model to a single instance at test time. While the vanilla mode [56] individually adapts the pre-trained model to each test sample, the online mode can reuse past knowledge to help recognize the incoming sample better. A number of follow-up methods [46, 18] adopt the online test-time optimization manner but do not require a customized pre-trained model, making the test-time adaptation more attractive. However, all these methods aim to improve the generalization ability of the pre-trained model, which can not handle the unknown classes in the OOD detection task. Notably, ETLT [9] proposes calibrating the OOD scores by establishing the correlations between the sample’s OOD score and its input feature at test time.

Refer to caption — Figure 2: Illustration of the Adaptive oUTlier Optimization (AUTO) framework. The key components include a dynamic ID memory bank, an in-out-aware filter, and a semantically-consistency loss $L_{SC}$ . Different color means different operations at test time: Each sample is given the MSP score $S^{t}(x^{t})$ and judged by the filter. Then, according to the judgment, the sample will activate different operations. For instance, if it is recognized as a pseudo-ID sample, blue lines are activated: this sample will be utilized to replace the sample with the same label in the ID memory bank.

3 Online Test-Time OOD Detection

3.1 Preliminary: OOD Detection

Let $\mathcal{X}\subseteq\mathbb{R}^{n}$ be the input space, $\mathcal{Y}=\left\{1,...,C\right\}$ be the label space, and ${h}=\rho\circ\varphi:\mathcal{X}\to\mathbb{R}^{C}$ be the model, where $\rho$ is the classifier and $\varphi$ is the feature extractor. The supervised methods aim to learn the joint data distribution from the training set $\mathcal{D}_{{id}}$ . Let $\mathcal{P}_{{ID}}$ and $\mathcal{P}_{{OOD}}$ denote the marginal ID and OOD distribution on $\mathcal{X}$ , separately. The OOD detector $f_{\lambda}(\cdot)$ aims to find a proper model ${h}$ and score function $S(\cdot)$ to detect OOD data well:

f_{\lambda}({x})=\begin{cases}{ID}&S({x})\geq\beta\\ {OOD}&S({x})<\beta\end{cases},

(1)

where $\beta$ is a given threshold. OE [14] aims to get a better ${h}$ by regularizing the target model to produce low-confidence predictions for test OOD data. Thus OE introduces auxiliary outliers at training time. Let $\mathcal{D}_{aux}$ be the auxiliary OOD dataset on $\mathcal{X}$ , which is also disjoint with the test OOD dataset. The learning objective can be written as:

\mathbb{E}_{({x},{y})\sim\mathcal{D}_{{id}}}[\ell_{\mathrm{CE}}({h}({x}),{y})]+\lambda\mathbb{E}_{({x})\sim\mathcal{D}_{aux}}[\ell_{\mathrm{OE}}({h}({x}))],

(2)

where $\lambda$ is a trade-off hyperparameter and is set to 0.5 for vision tasks, $\ell_{\mathrm{CE}}$ is the cross-entropy loss, and $\ell_{\mathrm{OE}}$ is defined by Kullback-Leibler divergence of softmax predictions to the uniform distribution. For phraseology consistency, in this paper, we describe training-time examples as auxiliary outliers and exclusively use OOD data to describe test-time unknown inputs.

3.2 Challenges of Test-Time OOD Detection

This paper formalizes the problem of enhancing OOD detection in an online unlabeled data stream at test time. With this new setting, concrete questions arise: (1) How to roughly distinguish test data as ID or OOD? Tackling this question will help select pseudo-OOD and pseudo-ID samples that AUTO learns from during testing. (2) Assuming the continual learning at the image-by-image level is effective, will constant changes lead to catastrophic forgetting? If it does, how to overcome it? (3) Different from training-phase methods, the update efficiency for test-time methods needs to be considered, especially for time-critical tasks. Thus, How to efficiently update the model?

4 Adaptive Outlier Optimization

In this section, we first describe the proposed Adaptive oUTlier Optimization (AUTO) framework. As illustrated in Figure 2, AUTO comprises three key components: an in-out-aware filter to tackle the selection of training samples (Section 4.1), a dynamic ID memory bank, and a semantically-consistent objective to tackle the forgetting issue (Section 4.2), respectively. Then, we elaborate on the parameter updating strategy for efficient model optimization (Section 4.3). The full algorithm is provided with the above components, which systematically work as a whole and reciprocate each other (Section 4.4).

4.1 Adaptive In-out-aware Filter

We first model the distribution of open world data $\mathcal{P}_{OPEN}$ with the Huber contamination model [17]:

\mathcal{P}_{OPEN}=\kappa\mathcal{P}_{ID}+(1-\kappa)\mathcal{P}_{OOD},

(3)

where $\kappa\in[0,1)$ . The mixture of unlabeled data from $\mathcal{P}_{OPEN}$ posits a unique obstacle for differentiated optimization based on data distribution. In our method, pseudo-OOD samples are roughly distinguished from test ID data by our adaptive in-out-aware filter, which is initialized by the statistics from the training ID data. For instance, given ID examples ${x}_{id}^{i}\sim\mathcal{P}_{ID}~{},i\in[1,N]$ , we compute the MSP [13] score $S^{0}({x}_{id}^{i})$ of each sample and then estimate the mean $\mu_{in}$ and standard deviation $\sigma_{in}$ of ID data:

\mu_{in}=\frac{\sum_{i=1}^{N}S^{0}({x}_{id}^{i})}{N},\sigma_{in}=\sqrt{\frac{\sum_{i=1}^{N}(S^{0}({x}_{id}^{i})-\mu_{in})}{N}}.\\

(4)

Then, the outlier-aware and inner-aware margins are initialized as follows:

m_{in}^{0}=\mu_{in}+k_{1}\times\sigma_{in},~{}~{}~{}m_{out}^{0}=\mu_{in}-k_{2}\times\sigma_{in},

(5)

where $k_{1}$ and $k_{2}$ are hyperparameters. On the one hand, we regard a sample with a score higher than $m_{in}$ as a pseudo-ID sample. We keep $m_{in}$ fixed during training, and this setting works well in the experiments. On the other hand, the confidence scores of all samples are decreasing as we update models with the outliers [14, 27]. Taking this phenomenon into consideration, we update $m_{out}$ with a greedy strategy. A sample ${x}^{t}$ with a score lower than $m_{out}$ is recognized as a pseudo-OOD sample and rewritten as $\hat{{x}}^{t}_{ood}$ . Then, we use $\hat{{x}}^{t}_{ood}$ to optimize the model as follows:

\mathcal{L}_{\textrm{ood}}=\ell_{\mathrm{OE}}({h}^{t}(\hat{{x}}^{t}_{ood})),

(6)

where ${h}^{t}$ represents the latest updated model.

Meanwhile, We record the mean of historical OOD score values of the pseudo-OOD samples. Then, we use the mean value to update $m_{out}$ as follows: Assuming that we have recorded the mean score of $M$ pseudo-OOD samples when the t-th sample inputs:

m_{out}^{t+1}=\begin{cases}\frac{M\cdot m_{out}^{t}+S^{t}({x}^{t})}{M+1}&\text{ if }S^{t}({x}^{t})<m_{out}^{t},\\ m_{out}^{t}&\text{else.}\end{cases}

(7)

4.2 Anti-forgetting Components

Dynamic ID memory bank. We introduce a dynamic memory bank $\mathcal{M}_{id}$ into the ID classification loss formulated in Eq. 2. The memory bank store one sample per category and is initialized with samples randomly selected from training data. We update the samples in the memory bank with the test-time ID data in the same category. Concretely, given a test-time sample $\hat{{x}}_{id}$ whose score is higher than the inner-aware margin $m_{in}$ and its pseudo label $\hat{{y}}_{id}$ , we utilize it to update the memory bank as follows:

\hat{{x}}_{id}\to{x}_{\mathcal{M}},\quad\text{if }\hat{{y}}_{id}={y}_{\mathcal{M}}.

(8)

Empirically, we notice that the training with only $\mathcal{M}_{id}$ does not help improve OOD detection, thus we do not modify our model until encountering a $\hat{{x}}_{ood}$ . The test-time ID objective $\mathcal{L}_{id}$ is formulated as follows:

\mathcal{L}_{id}=\mathbb{E}_{({x},{y})\sim\mathcal{M}_{id}}[\ell_{\mathrm{CE}}({h}({x}),{y})]

(9)

Semantically-Consistent Objective. We find that at the beginning of test-time optimization, $m_{out}^{0}$ may misclassify some ID samples as OOD. Such misclassifications subsequently confuse the model during optimization. To address this problem, we propose to maintain the consistency between the predictions of the original model and the updated model. Specifically, at the beginning of the testing stage, we make a duplicate of the model ${h}^{0}$ and freeze its parameters. The prediction of the duplicated model is denoted as ${y}^{0}$ . Intuitively, if the results predicted by the model remain consistent with ${y}^{0}$ , the performance on the source task will not degrade. Let $p^{t}_{{y}}$ denote the softmax probability that the t-th sample belongs to the class ${y}$ . Our objective is:

\mathcal{L}_{SC}=\begin{cases}0,&\text{ if }y^{0}=y^{t}\\ p^{t}_{y^{t}}-p^{t}_{y^{0}}+\phi,&\text{ if }y^{0}\neq y^{t}\end{cases},

(10)

which enforces $p^{t}_{y^{t}}$ close to $p^{t}_{y^{0}}$ with a margin $\phi$ . That means the prediction of $h^{t}$ is supposed to be higher than that of $h^{0}$ at least by $\phi$ . In this work, we empirically set $\phi$ as a pre-defined value. It is worth noting that $\phi$ should not be too large that it may influence the optimization between the pseudo-OOD sample and the uniform distribution.

4.3 Modulation Parameters

Let $\theta$ denote all the parameters of the model, updating $\theta$ is a natural choice, but it is sub-optimal for test-time OOD detection. Therefore, we consider optimizing part of the network parameters. While this operation is common in many tasks, there is still a lack of research on identifying which part should be updated to improve OOD detection performance efficiently. Following the partial updating principle [46], we explore the influence of optimizing different combinations of parameters, e.g., the last feature block $\theta_{last}$ , all batch normalization layers $\theta_{bn}$ , and the classifier $\theta_{fc}$ . Table 7 displays the results that optimize the above combinations. We finally optimize $\theta_{last}$ while keeping the remaining parameters fixed during testing.

4.4 Overall Objective

The overall test-time optimization objective is formulated as follows:

\mathcal{L}_{total}=\mathcal{L}_{id}+\lambda_{1}\mathcal{L}_{ood}+\lambda_{2}\mathcal{L}_{SC},

(11)

where $\lambda_{1},\lambda_{2}$ are set to 1 and 0.1, respectively. To adequately leverage the information in the outlier and the memory bank, we repeat the optimization process on each outlier iteratively for a given number of iterations, $T$ . While seemingly separated from each other, the three components of AUTO are working collaboratively. First, the in-out-aware filter selects high-quality ID and OOD samples from the unlabeled test data, which facilitates the positive update of models. Second, the anti-forgetting components help model enlarges the margin between ID and OOD data, which paybacks to the filter and helps it select samples more accurately. The entire training process converges when the three components perform satisfactorily. The pseudo-code is illustrated in Appendix.

Method	CIFAR-10						CIFAR-100
	ResNet-34			WideResNet-40-2			ResNet-34			WideResNet-40-2
	FPR95 $\downarrow$	AUROC $\uparrow$	ID_Acc $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	ID_Acc $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	ID_Acc $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	ID_Acc $\uparrow$
MSP [13]	46.49	92.53	94.87	52.00	90.57	94.53	83.53	74.34	77.51	79.15	76.44	75.84
ODIN[25]	30.00	93.94	94.87	34.32	91.38	94.53	82.76	75.27	77.51	69.75	81.29	75.84
Mahalanobis [22]	44.31	93.31	94.87	25.61	95.19	94.53	75.56	80.82	77.51	71.14	79.71	75.84
Energy [27]	28.77	94.07	94.87	33.41	91.53	94.53	82.65	75.33	77.51	69.65	81.30	75.84
ReAct [39]	32.57	93.16	94.85	58.67	82.85	93.41	74.76	82.01	77.09	92.01	64.53	64.77
Logit Norm [49]	18.14	96.61	94.68	21.03	95.86	94.42	76.08	76.83	76.40	54.90	87.60	76.02
KNN [41]	36.71	94.15	94.87	36.63	93.31	94.53	71.33	82.44	77.51	59.92	84.36	75.84
OE [14]	6.02	98.58	95.08	7.08	98.51	94.44	58.52	87.30	76.84	54.04	85.82	75.59
Energy with $\mathcal{D}_{aux}$ [27]	2.93	98.71	95.49	2.91	98.97	94.91	53.02	90.05	77.19	44.43	90.47	75.75
POEM [31]	11.25	97.62	89.57	7.17	98.37	90.62	19.78	95.94	69.49	24.30	95.96	69.37
WOODS [19]	10.10	97.75	94.79	12.14	97.58	94.72	34.90	91.21	77.84	22.65	94.54	75.74
DOE [48]	8.93	97.84	94.74	5.00	98.75	94.43	32.43	93.65	76.95	26.09	94.43	74.98
AUTO	6.47	98.64	94.98	9.45	97.94	93.33	11.85	97.36	77.90	17.55	95.55	73.71

Table 1: Comparison with competitive OOD detection methods on CIFAR benchmarks.

\uparrow

indicates larger values are better and vice versa. All values are percentages and are averaged over six OOD test datasets described in Section 5.1. Bold numbers indicate superior results.

\kappa

in Eq. (3) is set to 0.5 in CIFAR benchmarks.

5 Experiments

In this section, we evaluate AUTO for test-time OOD detection on CIFAR10/100 and ImageNet benchmarks. We compare AUTO with previous OOD detection methods, both OOD performance and ID performance (Section 5.2). Besides, we present extensive ablation experiments to validate the robustness of AUTO (Section 5.3).

5.1 Setup

Benchmarks Following the common benchmarks in OOD detection literatures [27, 7], we evaluate our method on CIFAR-10/100 [20] and ImageNet [5]. For CIFAR benchmarks, we consider six common OOD datasets: SVHN [35], Textures [4], LSUN-Crop [53], LSUN-Resize [53], iSUN [52], and Places365 [57]. Images in CIFAR benchmarks are resized to $32\times 32$ . For the ImageNet benchmark, we use subsets of four datasets from SUN [51], Textures [4], Place50 [57], and iNaturalist [45]. Images in the ImageNet benchmark are resized to $224\times 224$ . We provide more details of the ID and OOD datasets and categories in Appendix. The batch size is set to 1 during testing.

Evaluation metrics. We evaluate our framework and baseline methods using the following metrics: 1) The false positive rate of OOD samples when the true positive rate of in-distribution samples is at 95% (FPR95); 2) The area under the receiver operating characteristic curve (AUROC); and 3) The ID classification accuracy (ID_Acc).

Training details. For CIFAR benchmarks, we train two backbones from scratch: ResNet-34 [11] and the Wide ResNet [54] architecture with 40 layers and widen factor of 2. For the ImageNet benchmark, we use a pre-trained ResNet-50 model [11] from the PyTorch [37] and a pre-trained Vision Transformer [6] from the Timm library [50].

For modifying during testing, we use stochastic gradient descent with the learning rate set to that of the last epoch during training, which is 0.001 in all our experiments. We set weight decay and momentum to zero during test-time OOD detection, inspired by practice in [28, 10]. We do not tune our hyperparameters for each $\mathcal{D}_{open}^{test}$ distribution, so that $\mathcal{D}_{open}^{test}$ is kept unknown like with open-world scenarios. We use the cross-validation strategy to select $\lambda_{1},\lambda_{2},\phi,k_{1},k_{2}$ , which is shown in Appendix.

Due to space constraints, additional experiments and ablation studies are included in Appendix.

5.2 Main Results

AUTO outperforms baselines trained without auxiliary outliers. We compare AUTO with post hoc OOD detection methods, which include MSP [13], ODIN [25], Mahalanobis [22], Energy [27], Logitnorm [49], ReAct [39], DICE [40], and KNN [41]. Like post hoc OOD detection methods, AUTO does not require additional modifications at training time. As shown in Table 1, AUTO outperforms all post hoc methods by a large margin. Compared with the best baseline in each data-net pair, AUTO reduces the FPR95 by 30.24% (ResNet-34) and 27.18% (WRN) on CIFAR-10 and by 59.48%(ResNet-34) and 42.37% (WRN) on CIFAR-100. Besides, as shown in Table 2, Auto can further enhance OOD performance based on other post hoc methods. Accessing test OOD data greatly benefits AUTO with an acceptable inference time increase (as shown in Table 7). The superior performance demonstrates that AUTO could be a complementary method for all well-trained models to efficiently excavate their intrinsic OOD discriminative capability with lightweight modifications during testing.

Methods	FPR95 $\downarrow$	AUROC $\uparrow$	ID_Acc $\uparrow$
ReAct [39]	32.57	93.16	94.85
ReAct+AUTO	6.13	98.69	94.93
LogitNorm [49]	18.14	96.61	94.68
LogitNorm+AUTO	12.60	97.31	93.49

Table 2: Results on well-trained models from post hoc methods. Model is trained on CIFAR-10 using ResNet-34.

Backbone	Methods	OOD Datasets								Average		ID_Acc $\uparrow$
		SUN		Textures		iNaturalist		Places		Average
		FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$
ResNet-50	MSP[13]	68.53	81.75	66.15	80.46	52.69	88.42	71.59	80.63	64.74	82.82	76.12
	ODIN[25]	54.04	86.89	45.50	87.57	41.50	91.38	62.12	84.45	50.79	87.57	76.12
	Mahalanobis[22]	98.35	42.10	54.78	85.02	96.95	52.60	98.47	42.01	87.14	55.43	76.12
	Energy[27]	58.25	86.73	52.30	86.73	53.94	90.60	65.40	84.12	57.47	87.05	76.12
	ReAct[39]	23.69	94.44	46.33	90.30	19.71	96.37	33.30	91.96	30.76	93.27	74.82
	DICE+ReAct[40]	26.49	93.83	29.36	92.65	20.07	96.11	38.35	90.61	28.57	93.30	67.01
	KNN[41]	70.50	80.46	11.26	97.41	60.30	86.09	78.81	74.66	55.22	84.66	76.12
	OE*[14]	80.10	76.55	66.38	82.04	78.31	75.23	70.41	81.78	73.80	78.90	75.51
	MixOE*[55]	74.62	79.81	58.00	85.83	80.51	74.30	84.33	69.20	74.36	77.28	74.62
	VOS*[7]	98.72	38.50	70.20	83.62	94.83	57.69	87.75	65.65	87.87	61.36	74.43
	DOE*[48]	80.94	76.26	34.67	88.90	55.87	85.98	67.84	83.05	59.83	83.54	75.50
	AUTO	9.84	97.49	19.11	95.56	2.48	99.33	22.44	94.10	13.47	96.62	73.38
ViT-B-16	MSP[13]	73.80	79.49	63.07	81.50	39.40	92.41	74.09	79.56	62.59	83.24	78.01
	ODIN[25]	62.81	83.20	51.45	86.31	30.28	92.65	66.21	81.51	52.69	85.92	78.01
	Mahalanobis[22]	79.88	81.82	72.10	80.33	18.22	95.37	84.05	73.70	63.57	82.81	78.01
	Energy[27]	69.29	84.52	51.97	88.30	37.84	94.46	72.03	82.74	57.78	87.51	78.01
	ReAct[39]	72.19	84.12	53.17	88.12	29.54	95.19	74.15	82.22	57.26	87.41	78.01
	KNN [41]	51.01	89.46	41.12	90.55	7.32	98.50	54.08	88.31	38.38	91.71	78.01
	VOS^†[7]	43.03	91.92	56.67	87.64	31.65	94.53	41.62	90.23	43.24	90.86	79.64
	NPOS^†[43]	28.96	94.63	57.39	85.91	27.63	94.75	35.45	91.63	37.36	91.73	79.55
	AUTO	15.60	96.32	27.71	94.05	0.87	99.77	33.97	92.97	19.54	95.78	79.45

Table 3: Comparison with competitive OOD detection methods on the ImageNet benchmark.

\uparrow

indicates larger values are better and vice versa. We refer part of results directly from DOE [48] (*) and NPOS [43] (^†).

AUTO achieves competitive results compared with OE-based counterparts. In Table 1, we contrast AUTO with OE-based methods, which include OE [14], Energy [27], POEM [31], WOODS [19], and DOE [48]. All the baseline methods are fine-tuned using the same pre-trained model. Despite not being optimal on CIFAR-10, AUTO’s improvement is still remarkable, with 7.93% (ResNet-34) and 3.71% (WRN) better results on CIFAR-100. Note that the tiny-ImageNet dataset is adopted as the auxiliary outlier, which is mainly different from the considered test OOD datasets. Thus, we highlight that AUTO reduces the distribution gap between auxiliary outliers and test OOD data, resulting in superior performance.

AUTO scales effectively to large datasets. Compared with CIFAR benchmarks, the ImageNet benchmark is more challenging due to the larger feature space (512 $\rightarrow$ 2048) and the larger label space (10/100 $\rightarrow$ 1000). In Table 3, we evaluate AUTO on ImageNet using a ResNet backbone and a Vision Transformer. For post hoc counterparts, we still perform better with both backbones. For OE-based methods, ImageNet-21K-P is adopted as the auxiliary outlier. Since OE-based methods on Imagenet have been published recently, and their codes have yet to be released, we refer to the results reported in original papers [48, 43] for now. Surprisingly, AUTO still significantly outperforms these latest works. The above results suggest that adjusting models at test time is obviously more efficient than predicting the unknowns during training in these complicated scenarios.

AUTO gets impressive OOD performance while maintaining comparable ID classification accuracy. Deployed models should maintain source task performance while detecting OOD examples in the open world. We compare the ID classification accuracy of AUTO in Table 1 and Table 3 with post hoc OOD detection methods. AUTO improves classification accuracy on ResNet-34 and Vision Transformer. Compared with the original models, it only underperforms on ResNet-50 and WideResNet-40-2 The gap is just 1.2% on CIFAR-10, 1.13% on CIFAR-100, and 2.74% on ImageNet. Such a loss is acceptable and will not affect the safe deployment of the model.

OOD data	Methods	FPR95 $\downarrow$	AUROC $\uparrow$	ID $\_$ Acc $\uparrow$
	MSP [13]	82.69	75.07	78.05
Places365	OE [14]	74.35	80.19	76.37
SVHN	WOODS [19]	73.32	80.35	77.10
	AUTO	42.16	87.59	77.94

Table 4: Results on the mixed OOD scenario. Model is trained on CIFAR-100 using ResNet-34.

AUTO can handle mixed OOD scenarios. Previous testing scenarios only include one OOD data and one ID data. In this paper, we introduce mixed OOD scenarios. Models will encounter at least two kinds of OOD data at test time. Results in Table 4 suggest that the models’ performance in this new scenario differs from an arithmetic average of performances in single-OOD scenarios. The intricate composition of data presents challenges for all methods. Nevertheless, AUTO continues to demonstrate exceptional performance, exhibiting a greater performance advantage over OE and WOODS. This underscores AUTO’s superior capability to handle mixed OOD scenarios.

AUTO is flexible in time-series OOD scenarios. Except for the mixed OOD scenario, our paper also explores time-series OOD scenarios, where the environment may change over time. This means that simply sampling and learning from the target scenario before deployment may not be effective, as the OOD data may change. We evaluate this challenge and demonstrate that AUTO overcomes this problem by continuously adjusting models. As shown in Table 5, a model trained on CIFAR-100 encounters the LSUN-Resize and Places365 datasets sequentially. Post hoc methods are not affected by this change, but WOODS performs well only on LSUN-Resize and underperforms significantly on Places365. In contrast, AUTO constantly modifies the model, enabling it to adapt to the new CIFAR-100/Places365 scenario.

OOD	Method	FPR95 $\downarrow$	AUROC $\uparrow$	ID $\_$ Acc $\uparrow$
LSUN_R	MSP [13]	84.87	80.81	77.51
	OE [14]	75.91	82.53	77.02
	WOODS [19]	24.34	94.74	77.80
	AUTO	0.74	99.82	78.02
Places365	MSP [13]	80.81	76.05	77.51
	OE [14]	63.36	86.23	76.32
	WOODS [19]	75.87	81.74	78.00
	AUTO	51.75	88.63	77.42

Table 5: Results on the test-seires OOD scenario. ResNet-34 trained on CIFAR-100 will encounter LSUN-Resize and Place365 sequentially.

AUTO further enhances the OOD detection performance of models trained with auxiliary outliers. The aforementioned findings suggest that the incorporation of AUTO can effectively enhance the ability of models trained solely on ID data to detect OOD samples. To further investigate the potential of AUTO in improving OOD detection performance for models trained with auxiliary outliers, we conducted additional experiments. As presented in Table 6, with the help of AUTO, models trained with auxiliary outliers improve their OOD detection performance further. Moreover, results from Tables 5 and 6 also reveal that the model trained on LSUN-Resize before being deployed to Places365 achieves better performance than the model directly deployed to Places365. These findings demonstrate the versatility of AUTO.

Model	FPR95 $\downarrow$	AUROC $\uparrow$	ID_Acc $\uparrow$
AUTO	52.35	87.57	76.24
OE [14]	63.36	86.23	76.84
OE+AUTO	51.02	88.32	75.67
WOODS [19]	62.50	81.99	77.75
WOODS+AUTO	46.98	88.99	75.23

Table 6: Results on CIFAR-100/Places365 with ResNet-34, AUTO enhances OE-based methods.

5.3 Ablation Study

Modulation parameters. We evaluate the impact of different optimization objects, as presented in Table 7. On the one hand, taking both ID and OOD tasks into account, the performance of optimizing the last feature block is superior. On the other hand, models in open-world scenarios, particularly those engaged in online stream applications, need to notice the optimization efficiency. We note that the inference time per sample is approximately 5ms. AUTO necessitates only a modest 2.4x increase in processing time. Such an increment in time cost is absolutely tolerable. Thus, we conclude that the utilization of the last feature block as the optimization objective is a more efficient strategy.

Modu. Para.	FPR95 $\downarrow$	AUROC $\uparrow$	ID_Acc $\uparrow$	Time
No Para.	83.53	74.34	77.51	1x
Block 1	77.72	78.77	78.27	1.8x
Block 2	79.59	77.44	77.47	2.3x
Block 3	35.18	90.23	75.38	2.9x
Block 4	11.85	97.36	77.90	3.4x
BN	20.21	95.19	77.32	3.1x
FC	80.01	77.92	78.02	1.5x
All Para.	35.46	88.16	64.57	11.2x

Table 7: Ablation study on different modulation parameters. Model is trained on CIFAR-100 with ResNet-34.

Components in AUTO. Results presented in Table 8 evaluate the efficacy of various components. Our results demonstrate that training the model solely with an ID memory bank leads to similar performance as the method without optimization, indicating that optimizing on ID data alone does not effectively enhance OOD detection. Furthermore, while training on outliers alone improves OOD detection, it results in catastrophic forgetting, as evidenced by the decline in ID classification accuracy. With the help of the ID memory bank, the model jointly updated by both ID and OOD samples already exhibits progress in both OOD detection and ID classification. However, we want to achieve a higher level of ID performance. Our results show that the incorporation of a semantically-consistent objective makes this goal come true.

$\mathcal{L}_{\textrm{id}}$	$\mathcal{L}_{\textrm{ood}}$	$\mathcal{L}_{\textrm{SC}}$	FPR95 $\downarrow$	AUROC $\uparrow$	ID_Acc $\uparrow$
$\checkmark$			79.89	76.34	77.50
	$\checkmark$		60.26	79.36	65.92
$\checkmark$	$\checkmark$		12.54	97.12	77.55
$\checkmark$	$\checkmark$	$\checkmark$	11.85	97.36	77.90

Table 8: Ablation study on different combinations of objectives. Model is trained on CIFAR-100 with ResNet-34.

Parameter analysis of $\mathcal{L}_{SC}$ . We conduct a parameter analysis of $\mathcal{L}_{SC}$ to evaluate the impact of $\lambda_{2}$ and $\phi$ on the model. Firstly, we examine the effect of different values of $\lambda_{2}$ in Figure LABEL:fig:subfig:a and LABEL:fig:subfig:b, with $\phi=0.2$ . Empirical results demonstrate that a larger $\lambda_{2}$ prioritizes the ID classification task over OOD detection. The larger $\lambda_{2}$ also weakens the ID information learned from dynamic ID memory, harming both ID and OOD tasks. Furthermore, we investigate how the parameter $\phi$ affects OOD detection and ID classification in Figure LABEL:fig:subfig:c and LABEL:fig:subfig:d, with $\lambda_{2}=0.1$ . Our findings confirm that $\phi$ should not be excessively large to preserve the optimization between OOD examples and the uniform distribution. While the impact of $\phi$ on ID performance is minimal, as shown in Table 8, it does supplement part of the remaining ID deficiencies from dynamic ID memory.

$T=2$	$\kappa=$ 0.9	0.7	0.5	0.3	0.1
Texture	44.86	23.05	19.11	16.49	23.50
Places365	77.80	52.87	47.54	46.34	42.12
$\kappa=0.5$	$T=$ 0	1	2	3	5
Texture	78.55	22.52	19.11	19.65	19.22
Places365	81.50	51.34	47.53	46.16	45.82

Table 9: Ablation study on

\kappa

and

T

, ResNet-34 is trained on CIFAR-100, FPR95 is reported. Lower is better.

Sensitivity analysis on $\kappa$ and $T$ . We conduct a sensitivity analysis on the parameters $\kappa$ and $T$ , as summarized in Table 9. Increasing the percentage of OOD data (decreasing $\kappa$ ) in the test set generally improves the performance of AUTO, as it allows the model to access more OOD samples and make more iterations. The model’s performance sometimes suffers from an increase in OOD data, as it can contaminate the ID memory, more details can be found in Appendix. Subsequently, we examined the effect of $T$ , the number of iterations when each OOD sample is encountered. Multiple iterations can expedite the model’s convergence and facilitate the detection of OOD data that may have been missed during the convergence process, thereby improving model performance. However, we found that the incremental benefits diminish as $T$ increases beyond two iterations. Considering both performance and speed, we recommend setting $T=2$ as the standard iteration frequency.

Samples	$\mu_{in}$	$\sigma_{in}$	FPR95 $\downarrow$	AUROC $\uparrow$	ID $\_$ Acc $\uparrow$
100	0.9985	0.0024	47.03	89.51	77.59
1000	0.9976	0.0145	46.66	89.59	77.70
10000	0.9976	0.0115	45.94	89.72	77.65
50000 (All)	0.9977	0.0111	46.27	89.67	77.53

Table 10: Results on the Places365 [57] dataset, model is trained on CIFAR-100 using ResNet-34. Utilizing fewer ID samples to initialize AUTO is more efficient.

Approximate estimation of the mean and standard deviation on training ID data. In order to initialize the in-out-aware filter, the statistics of the training ID data $\mu_{in}$ and $\sigma_{in}$ are estimated using all available training samples. This method introduces high computation costs before deployments. Actually, the statistics can be approximated using a small subset of the training data. Table 10 presents these approximate statistics and corresponding performances. Our findings suggest that approximate statistics facilitate the initialization of AUTO, significantly reducing computation costs and streamlining model deployment.

6 Conclusion

This paper introduces a novel setting called test-time out-of-distribution detection, whereby models directly optimize with the unlabeled online test data. We present AUTO, which consists of three key components: an in-out-aware filter, a dynamic ID memory bank, and a semantically-consistent objective. AUTO excavates a well-trained model’s intrinsic OOD discriminative capability from a unique test-time online optimization perspective. Extensive results demonstrate that AUTO can improve OOD detection performance substantially while maintaining the accuracy of ID classification. We hope our work could serve as a springboard for future works, provide new insights for revisiting the model development in OOD detection, and draw more attention toward the testing phase.

References

[1] Abhijit Bendale and Terrance Boult. Towards open world recognition. In CVPR, 2015.
[2] Abhijit Bendale and Terrance E Boult. Towards open set deep networks. In CVPR, 2016.
[3] Jiefeng Chen, Yixuan Li, Xi Wu, Yingyu Liang, and Somesh Jha. Atom: Robustifying out-of-distribution detection using outlier mining. In ECML PKDD, 2021.
[4] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In CVPR, 2014.
[5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
[6] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2020.
[7] Xuefeng Du, Zhaoning Wang, Mu Cai, and Yixuan Li. Vos: Learning what you don’t know by virtual outlier synthesis. In ICLR, 2022.
[8] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 2011.
[9] Ke Fan, Yikai Wang, Qian Yu, Da Li, and Yanwei Fu. A simple test-time method for out-of-distribution detection. arXiv preprint arXiv:2207.08210, 2022.
[10] Kaiming He, Ross Girshick, and Piotr Dollár. Rethinking imagenet pre-training. In ICCV, 2019.
[11] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ECCV, 2016.
[12] Dan Hendrycks, Steven Basart, Mantas Mazeika, Andy Zou, Joseph Kwon, Mohammadreza Mostajabi, Jacob Steinhardt, and Dawn Song. Scaling out-of-distribution detection for real-world settings. In ICML, 2022.
[13] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136, 2016.
[14] Dan Hendrycks, Mantas Mazeika, and Thomas Dietterich. Deep anomaly detection with outlier exposure. 2019.
[15] Rui Huang, Andrew Geng, and Yixuan Li. On the importance of gradients for detecting distributional shifts in the wild. In NeurIPS, 2021.
[16] Rui Huang and Yixuan Li. Mos: Towards scaling out-of-distribution detection for large semantic space. In CVPR, 2021.
[17] Peter J Huber. Robust estimation of a location parameter. In Breakthroughs in statistics, pages 492–518. 1992.
[18] Yusuke Iwasawa and Yutaka Matsuo. Test-time classifier adjustment module for model-agnostic domain generalization. NeurIPS, 2021.
[19] Julian Katz-Samuels, Julia Nakhleh, Robert Nowak, and Yixuan Li. Training ood detectors in their natural habitats. In ICML, 2022.
[20] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images. Tech Report, 2009.
[21] Jogendra Nath Kundu, Naveen Venkat, R Venkatesh Babu, et al. Universal source-free domain adaptation. In CVPR, 2020.
[22] Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In NeurIPS, 2018.
[23] Jian Liang, Dapeng Hu, and Jiashi Feng. Do we really need to access the source data? source hypothesis transfer for unsupervised domain adaptation. In ICML, 2020.
[24] Jian Liang, Dapeng Hu, Yunbo Wang, Ran He, and Jiashi Feng. Source data-absent unsupervised domain adaptation through hypothesis transfer and labeling transfer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):8602–8617, 2021.
[25] Shiyu Liang, Yixuan Li, and R Srikant. Enhancing the reliability of out-of-distribution image detection in neural networks. In ICLR, 2018.
[26] Ziqian Lin, Sreya Dutta Roy, and Yixuan Li. Mood: Multi-level out-of-distribution detection. In CVPR, 2021.
[27] Weitang Liu, Xiaoyun Wang, John Owens, and Yixuan Li. Energy-based out-of-distribution detection. In NeurIPS, 2020.
[28] Zhuang Liu, Mingjie Sun, Tinghui Zhou, Gao Huang, and Trevor Darrell. Rethinking the value of network pruning. arXiv preprint arXiv:1810.05270, 2018.
[29] Michael McCloskey and Neal J Cohen. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation. 1989.
[30] Alexander Meinke and Matthias Hein. Towards neural networks that provably know when they don’t know. In ICLR, 2019.
[31] Yifei Ming, Ying Fan, and Yixuan Li. Poem: Out-of-distribution detection with posterior sampling. In ICML, 2022.
[32] Yifei Ming, Yiyou Sun, Ousmane Dia, and Yixuan Li. How to exploit hyperspherical embeddings for out-of-distribution detection? In ICLR, 2023.
[33] Sina Mohseni, Mandar Pitale, JBS Yadawa, and Zhangyang Wang. Self-supervised learning for generalizable out-of-distribution detection. In AAAI, 2020.
[34] Peyman Morteza and Yixuan Li. Provable guarantees for understanding out-of-distribution detection. In AAAI, 2022.
[35] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. In NeurIPS, 2011.
[36] Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In CVPR, 2015.
[37] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, 2019.
[38] Vikash Sehwag, Mung Chiang, and Prateek Mittal. Ssd: A unified framework for self-supervised outlier detection. In ICLR, 2020.
[39] Yiyou Sun, Chuan Guo, and Yixuan Li. React: Out-of-distribution detection with rectified activations. In NeurIPS, 2021.
[40] Yiyou Sun and Yixuan Li. Dice: Leveraging sparsification for out-of-distribution detection. In ECCV, 2022.
[41] Yiyou Sun, Yifei Ming, Xiaojin Zhu, and Yixuan Li. Out-of-distribution detection with deep nearest neighbors. In ICML, 2022.
[42] Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei Efros, and Moritz Hardt. Test-time training with self-supervision for generalization under distribution shifts. In ICML, 2020.
[43] Leitian Tao, Xuefeng Du, Xiaojin Zhu, and Yixuan Li. Non-parametric outlier synthesis. In ICLR, 2023.
[44] Antonio Torralba, Rob Fergus, and William T Freeman. 80 million tiny images: A large data set for nonparametric object and scene recognition. IEEE TPAMI, 2008.
[45] Grant Van Horn, Oisin Mac Aodha, Yang Song, Yin Cui, Chen Sun, Alex Shepard, Hartwig Adam, Pietro Perona, and Serge Belongie. The inaturalist species classification and detection dataset. In CVPR, 2018.
[46] Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. In ICLR, 2020.
[47] Haoran Wang, Weitang Liu, Alex Bocchieri, and Yixuan Li. Can multi-label classification networks know what they don’t know? In NeurIPS, 2021.
[48] Qizhou Wang, Junjie Ye, Feng Liu, Quanyu Dai, Marcus Kalander, Tongliang Liu, Jianye HAO, and Bo Han. Out-of-distribution detection with implicit outlier transformation. In ICLR, 2023.
[49] Hongxin Wei, Renchunzi Xie, Hao Cheng, Lei Feng, Bo An, and Yixuan Li. Mitigating neural network overconfidence with logit normalization. In ICML, 2022.
[50] Ross Wightman. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
[51] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010.
[52] Pingmei Xu, Krista A Ehinger, Yinda Zhang, Adam Finkelstein, Sanjeev R Kulkarni, and Jianxiong Xiao. Turkergaze: Crowdsourcing saliency with webcam based eye tracking. arXiv preprint arXiv:1504.06755, 2015.
[53] Fisher Yu, Ari Seff, Yinda Zhang, Shuran Song, Thomas Funkhouser, and Jianxiong Xiao. Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
[54] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016.
[55] Jingyang Zhang, Nathan Inkawhich, Randolph Linderman, Yiran Chen, and Hai Li. Mixture outlier exposure: Towards out-of-distribution detection in fine-grained environments. In WACV, 2023.
[56] Marvin Mengxin Zhang, Sergey Levine, and Chelsea Finn. Memo: Test time robustness via adaptation and augmentation. In NeurIPSW, 2021.
[57] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million image database for scene recognition. IEEE TPAMI, 2017.

Appendix

7 Pesudo-code

We present our pesudo-code in Alg. 1. We calculate the MSP [13] score for the current test sample and identify the sample with the proposed in-out-aware filter. Our system will be updated once the sample is identified as a pseudo-ID or pseudo-OOD sample.

Algorithm 1 Pseudocode of AUTO in a PyTorch-like style.

#M_id: ID memory bank with one sample each class

#m_i,m_o: inner-aware margin and outlier-aware margin for selecting test samples.

#f_t: the updated model when the t-th sample inputs

For x in loader: # load one sample each time

sm=softmax(f_t(x)) # softmax: 1xC

msp=max(sm) # msp score

if msp

>

m_i: # recognized as an ID sample

pred=argmax(sm) # prediction of x

# search sample with same label in M_id

for i in range len (M_id):

if pred==y_i: # y_i: labels in M_id.

x_i=x # replace sample

if msp

<

m_o: #recognized as an OOD sample.

# training T iterations on each sample

ind=0

while ind

<

sm_t=softmax(f_t(x))

# CELoss: CrossEntropyLoss

# u:uniform distribution

loss=CELoss(sm_t, u)

# optimize with ID memory bank

for i in range len (M_id):

loss+=CELoss(f_t(x_i),y_i)

# semantic consistency

sm_0=softmax(f_0(x))

pred_t=argmax(sm_t)

pred_0=argmax(sm_0)

if not (pred_t == pred_0):

loss+=sm_t[pred_t]-sm_t[pred_0]+phi

#update model

loss.backward()

update(f_t.params)

8 Experiments

8.1 Details of Datasets

Firstly, we summarize the ID configurations in Table 11. For CIFAR benchmarks, we use the standard split with 50,000 training and 10,000 test images. For the ImageNet benchmark, we use the standard validation split with 50,000 images as ID samples at test time.

Dataset	Training Set ( $\mathcal{D}_{id}^{train}$ )	Test Set ( $\mathcal{D}_{id}^{test}$ )
CIFAR-10/100	50000	10000
ImageNet	1281167	50000

Table 11: ID data configurations.

Then, we summarize the OOD configurations for CIFAR and ImageNet benchmarks.

For CIFAR benchmarks, we follow the common OOD settings in [27]. Details are shown in Table 12.

Dataset	Test OOD ( $\mathcal{D}_{ood}^{test}$ )	$\mathcal{D}_{aux}^{wood}$
Dataset	Test OOD ( $\mathcal{D}_{ood}^{test}$ )	$\mathcal{D}_{ood}^{test}$	$\mathcal{D}_{id}^{test}$
SVHN	10k	5k	5k
Textures	5640	2820	2820
LSUN_C	10k	5k	5k
LSUN_R	10k	5k	5k
Places365	10k	5k	5k
iSUN	8925	4462	4463

Table 12: OOD data for CIFAR benchmarks.

We use 80 Million Tiny Images [44] dataset as $\mathcal{D}_{aux}$ for OE-based methods. The 80 Million Tiny Images is a large-scale, diverse dataset of 32×32 natural images scrapped from the Internet. Following the OE setting, we remove all examples of 80 Million Tiny Images which appear in the CIFAR datasets. In particular, we change the auxiliary data set $\mathcal{D}_{aux}^{wood}$ for the WOODS method according to the number of OOD samples (Table 12). For $\mathcal{D}_{id}^{train}$ , 45000 images are utilized for training the model, and the other 5000 images are utilized for validation.

For the ImageNet benchmark, we follow the common setting from [16]. Details are shown in Table 12.

Dataset	iNaturalist	Textures	Places50	SUN
Test OOD ( $\mathcal{D}_{ood}^{test}$ )	10k	5640	10k	10k

Table 13: OOD data for ImageNet benchmarks.

We directly use the pre-trained model in the ImageNet benchmark, the training set $\mathcal{D}_{id}^{train}$ is utilized to sample the ID memory bank.

For both CIFAR and ImageNet benchmarks, the main results reported in our paper are tested on the test set, which is denoted as:

\mathcal{D}_{test}=\mathcal{D}_{ood}^{test}+\mathcal{D}_{id}^{test}.

(12)

8.2 Details of Models

For training the classification models on CIFAR-10/100 data, we use the ResNet-34 [11] and the Wide ResNet [54] architecture with 40 layers and widen factor of 2. These models are optimized by stochastic gradient descent with Nesterov momentum [8]. We set the momentum to 0.9 and the weight decay coefficient to 0.0005. Models are trained for 200 epochs. The start learning rate is 0.1 and decays by a factor of 10 at epochs 80 and 140. We use a batch size of 128 and a dropout rate of 0.3.

For training OOD detectors with auxiliary data, we follow the OE[14] and energy [27] setting, we use the default: $\lambda=0.5$ in OE; $m_{in}=-7$ , $m_{out}=-25$ and $\beta=0.1$ in energy. Models are trained from scratch for 200 epochs.

8.3 Details of Baselines

To evaluate the baselines, we follow the original definition in MSP [13], ODIN score [25], Mahalanobis score [22], Energy score [27], ReAct [39], LogitNorm [49], DICE [40], and KNN distance [41].

For ODIN, we set the temperature $T$ as 1000.

For Energy, the temperature is set to be $T=1$ .

For ReAct, the rectification threshold $p$ is set to 1. More discussions are shown in Table 14.

For LogitNorm, we set the temperature hyperparameter $\tau$ as 0.04 for CIFAR-10 and 0.0085 for CIFAR-100. More discussions are shown in Table 15.

For DICE, the sparsity ratio is set to $90\%$ .

For POEM and DOE, models are trained from scratch.

9 Results and Discussions.

9.1 Discussion on Existing Methods

Before the proposition of AUTO, we systemically study the existing methods and find some experiment results interesting. These methods substantially improve OOD detection performance with a fixed setting, but sometimes they will unintentionally make a negative optimization.

Effectiveness of ReAct [39]. ReAct [39] is a successful method to improve OOD detection performance in most of scenarios, the central hyperparameter -rectification threshold- always set to a small value (e.g. 1,2). However, we find that ReAct is negative when the rectification threshold $p$ is set small for the model trained on ImageNet [5] using Vision Transformer [6]. Results is shown in Table 14. We believe that if $p$ is dynamically adjusted during testing, the influence of ReAct would be positive.

$p$	FPR95 $\downarrow$	AUROC $\uparrow$	ID $\_$ Acc $\uparrow$
1	92.20	76.51	77.19
10	61.08	86.82	77.84
15	58.03	87.25	78.06
20	57.26	87.41	78.01
25	57.33	87.47	78.02
50	57.26	87.05	78.01
$+\infty$	57.26	87.05	78.01

Table 14: Effect of rectification threshold

p

for inference. Model is trained on ImageNet using Vision Transformer. Results are averaged on four OOD datasets.

Effectiveness of LogitNorm [49]. LogitNorm [49] substantially mitigate the overconfident problem of neural networks. It can be easily adopted in practical settings without changing the loss and training theme. However, the performance of this method is heavily influenced by its temperature parameter $\tau$ , which modulates the magnitude of the logits. Authors of Logitnorm state that $\tau$ should decrease as the number of categories increases. For instance, it is set to 0.04 on CIFAR-10 and 0.01 on CIFAR-100 [20]. However, we discover that this conclusion is empirical. Effect on the $\tau$ is shown in Table 15. It suggests that we need to dynamically search for the best value of $\tau$ at a finer granularity level during testing. Because the best $\tau$ may be different in various test scenarios.

$\tau$	FPR95 $\downarrow$	AUROC $\uparrow$	ID $\_$ Acc $\uparrow$
0.01	76.08	76.83	76.40
0.02	71.43	78.32	76.86
0.04	69.89	79.67	76.82
0.05	70.36	76.24	77.71

Table 15: Effect of a temperature parameter

\tau

for inference. Model is trained on CIFAR-100 using ResNet-34. Results are averaged on six OOD datasets.

9.2 More Results

The influence of hyperparameters. Hyperparameters in our work are set as follows: $\lambda_{1}=1,\lambda_{2}=0.1,\phi=0.2,T=2$ , these parameters have been discussed in the main text. The initialization of the inner-aware and outlier-aware margins are validated in the following experiments. We evaluate the value of $k_{1},k_{2}$ with the cross-validation strategy. For instance, we utilize the Texture [4] dataset to choose $k_{1},k_{2}$ when the model is deployed in the scenario which is consists of CIFAR-100 and SVHN [35]. Considering variations of the ID data and OOD data, we choose a set of parameters for each model. Results are shown in Table 16 and Table 17.

$k_{1},k_{2}$	FPR95 $\downarrow$	AUROC $\uparrow$	ID $\_$ Acc $\uparrow$
0,0	27.71	94.05	79.12
0,1	48.68	90.84	78.27
0,2	50.96	89.90	78.61
0,3	53.14	88.04	78.15

Table 16: Ablation study on the initialization of in-out-aware margins

k_{1},k_{2}

, model is trained on ImageNet using Vision Transformer. AUTO is tested on the Texture dataset.

$k_{1},k_{2}$	FPR95 $\downarrow$	AUROC $\uparrow$	ID $\_$ Acc $\uparrow$
0,0	18.00	96.17	77.78
0,1	18.02	96.13	77.76
0,2	18.16	96.10	78.02
0,3	18.07	96.15	77.78

Table 17: Ablation study on the initialization of in-out-aware margins

k_{1},k_{2}

, model is trained on CIFAR-100 using ResNet-34 [11]. AUTO is tested on the Texture dataset.

We find that $k_{1},k_{2}$ are insensitive on CIFAR-10/CIFAR-100 benchmarks. Although the performance is similar, a bigger interval means more samples are utilized to modify models. Thus, the increased computational cost due to the initialization is noteworthy. In summary, we set $k_{1}=0,k_{2}=3$ for models trained on all ID datasets using ResNet/WideResNet [54] architectures. For models trained on the ImageNet using the Vision Transformer, we set $k_{1}=0,k_{2}=0$ .

A class prototype memory bank enhances both ID and OOD performance. When sufficient ID data is available, it is possible to build a fixed class prototype memory bank. Table 18 shows that incorporating such a memory bank can facilitate further improvements in both OOD and ID performance. However, it may not be practical to construct such a memory bank, particularly if ID samples are scarce. Additionally, while the class prototype memory bank slightly outperforms dynamically updating the memory bank using test data, the latter approach is more convenient and flexible. Therefore, we have opted for the latter approach in AUTO.

$\kappa$	Memory	FPR95 $\downarrow$	AUROC $\uparrow$	ID $\_$ Acc $\uparrow$
0.5	Proto.	13.21	96.97	78.27
0.5	Update	13.29	96.91	77.16
0.9	Proto.	11.84	97.42	78.31
0.9	Update	13.14	96.92	78.39

Table 18: AUTO (MSP) results with different memory bank. The Proto. represents the class prototype memory bank. Models are trained on CIFAR-100 with ResNet-34 and tested on six OOD datasets. Average results are reported.

A gradually reducing weighting factor for $\mathcal{L}_{SC}$ . As we mentioned in the main text, setting a large $\lambda_{2}$ for $\mathcal{L}_{SC}$ can lead to underperformance on both ID and OOD tasks. To address this issue, we propose a gradually decreasing weighting factor $\beta$ , which decreases as the number of iterations increases. Table 19 demonstrates that this factor effectively prevents the degradation of OOD performance, but it also reduces the gain of $\mathcal{L}_{SC}$ on ID performance. Since $\mathcal{L}_{SC}$ is initially proposed to enhance ID performance, we decide not to incorporate this factor into our framework.

$\mathcal{L}_{SC}$	FPR95 $\downarrow$	AUROC $\uparrow$	ID $\_$ Acc $\uparrow$
with $\beta$	10.92	97.38	77.83
w/o $\beta$	11.85	97.36	77.90

Table 19: A gradually reducing weighting factor for

\mathcal{L}_{SC}

can enhance OOD detection but reduces the gain of

\mathcal{L}_{SC}

on ID performance. Models are trained on CIFAR-100 with ResNet-34 and tested on six OOD datasets. Average results are reported.

Results on test-time methods. Existing test-time optimization methods have primarily focused on adaptation tasks. Although these methods are not developed with OOD detection in mind, as a paradigm for accessing data in the testing phase, it is still necessary to provide a comparison and explanation. In Table 20, we investigate two settings where models are trained on CIFAR-100 during training and tested on (1) CIFAR-100/Places365 scenario or (2) CIFAR-100-C/Places365 scenario. We can summarize that (1) OOD data proves to be detrimental to Tent, which attempts to fit the OOD data into the known data distribution. This not only fails to detect OOD data, but also harms ID performance. (2) Introducing covariate shift to ID data during testing is a new scenario that cannot be covered by OOD detection methods. Additionally, this new scenario is not aligned with the OOD detection paradigm.

Test Data	Method	FPR95 $\downarrow$	AUROC $\uparrow$	ID $\_$ Acc $\uparrow$
CIFAR-100 Places365	MSP [13]	83.53	74.34	77.51
	Tent [46]	86.32	68.29	67.01
	AUTO	13.29	96.91	77.16
CIFAR-100-C Places365	MSP [13]	92.75	54.25	20.49
	Tent [46]	90.05	61.15	59.45
	AUTO	100.00	3.20	19.91

Table 20: ID data are changed at test time. Models are trained on CIFAR-100 with ResNet-34 during training.

The results presented in Table 20 demonstrate that OOD detection methods fail to perform in the CIFAR-100-C scenario. To further investigate this phenomenon, we conduct additional experiments as shown in Table 21, where we train our model on CIFAR-100 and evaluated its performance on a CIFAR-100/CIFAR-100-C scenario. Our findings reveal that OOD detection methods treat CIFAR-100-C data as OOD data, as evidenced by the MSP method successfully detecting CIFAR-100-C, similar to its detection of Places365. Meanwhile, by utilizing AUTO, the pre-trained model is able to accurately distinguish between CIFAR-100-C and CIFAR-100 during testing.

Test Data	Method	FPR95 $\downarrow$	AUROC $\uparrow$	ID $\_$ Acc $\uparrow$
CIFAR-100 CIFAR-100-C	MSP	85.37	73.96	77.51
CIFAR-100 CIFAR-100-C	AUTO	5.61	98.69	78.06

Table 21: AUTO identify CIFAR-100-C (covariance shift) data as OOD samples.

Based on the aforementioned experimental results, it is evident that attempting to force Tent to fit OOD data or OOD detection methods to fit ID data with covariance shift is unreasonable. It is clear that novel experimental settings (CIFAR-100 $\rightarrow$ CIFAR-100-C/Places365) necessitate the development of innovative methodologies.

AUTO outperforms the test-time self-supervised OOD detection method. As mentioned in Related Works, ELET [9] is an OOD detection method that learns a detector from scratch at test time. We compare AUTO and ELET in Table 22. Results show that AUTO is better than ELET.

Method	FPR95 $\downarrow$ /AUROC $\uparrow$
Method	CIFAR-10	CIFAR-100
ELET [9]	18.57/94.89	62.00/80.91
AUTO	9.45/97.94	17.55/95.55

Table 22: AUTO outperforms ELET. Models are tested on six OOD datasets with WideResNet-40-2.

AUTO with different score functions. To further verify the generality and the effectiveness of AUTO, we test AUTO with three representative OOD scoring functions, namely, MSP [13], Free energy [27], and MaxLogit [12]. Regarding all the cases with different scoring functions, our AUTO always achieves good performance (Table 23), demonstrating that our proposal can genuinely make the target model learn from OOD knowledge for OOD detection.

Method	SVHN		Texture		LSUN_Crop		LSUN_Resize		Places365		iSUN		Average
Method	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	IN_Acc $\uparrow$
MSP [13]	1.11	99.77	20.46	95.57	3.91	99.05	0.70	99.82	52.35	87.57	1.21	99.70	13.29	96.91	77.16
Energy [27]	0.96	99.79	18.07	96.15	3.54	99.16	0.74	99.82	46.27	89.67	1.52	99.59	11.85	97.36	77.90
MaxLogit [12]	1.37	99.72	21.61	95.55	4.24	99.02	0.98	99.80	43.95	89.68	1.95	95.55	12.35	97.22	77.87

Table 23: AUTO’s results with different score functions, ResNet-34 is trained on CIFAR-100.

\uparrow

indicates larger values are better and vice versa. Bold numbers indicate superior results.

Results on Individual Datasets. For models trained on CIFAR-10 (Table 24) and CIFAR-100 (Table 25), we report their specific results on six OOD datasets respectively.

Backbone	Method	SVHN		Texture		LSUN_Crop		LSUN_Resize		Places365		iSUN		Average
Backbone	Method	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	IN_Acc $\uparrow$
RN-34	MSP [13]	35.51	95.05	51.84	91.34	43.34	93.93	45.75	93.54	56.55	88.11	45.98	93.24	46.49	92.53	94.87
	ODIN[25]	19.10	96.57	38.82	91.80	21.21	96.20	26.78	95.32	45.28	88.65	28.82	95.11	30.00	93.94	94.87
	Mahalanobis [22]	22.22	96.47	35.47	94.38	60.92	92.59	44.89	93.59	56.30	89.58	46.08	93.25	44.31	93.31	94.87
	Energy [27]	18.18	96.67	37.53	92.09	19.34	96.50	25.30	95.47	44.44	88.50	27.81	95.19	28.77	94.07	94.87
	ReAct [39]	23.15	95.42	41.51	90.61	25.35	94.92	28.69	94.79	46.51	88.79	30.19	94.42	32.57	93.16	94.85
	Logit Norm [49]	12.71	97.82	29.87	94.21	0.90	99.77	18.39	96.77	26.73	94.67	20.24	96.43	18.14	96.81	94.68
	KNN [41]	32.49	95.34	40.82	93.36	28.15	95.71	33.69	95.10	48.25	90.80	36.86	94.59	36.71	94.15	94.87
	OE [14]	3.48	98.98	9.27	98.24	1.24	99.68	3.21	99.09	15.88	96.36	3.05	99.10	6.02	98.58	95.08
	Energy (w. $\mathcal{D}_{aux}$ )[27]	1.16	99.53	4.34	98.64	0.84	99.01	0.98	99.08	8.87	97.26	1.41	99.06	2.93	98.71	95.49
	POEM [31]	1.17	99.30	0.96	99.50	54.46	89.54	0.00	100.00	10.32	97.49	0.00	100.00	11.25	97.64	89.57
	WOODS [19]	2.26	99.55	22.75	94.71	3.13	99.47	1.86	99.65	26.62	93.76	4.01	99.36	10.10	97.75	94.79
	DOE [48]	1.15	99.80	9.60	97.94	1.05	99.79	4.20	99.15	34.65	91.00	2.95	99.37	8.93	97.84	94.74
	AUTO	1.35	99.73	12.64	97.69	0.99	99.81	0.84	99.81	21.97	95.03	1.03	99.77	6.47	98.64	94.98
WRN-40-2	MSP [13]	51.21	92.36	62.49	86.75	30.24	95.73	51.53	91.39	59.78	87.52	56.77	89.66	52.00	90.57	94.53
	ODIN[25]	33.34	92.29	55.02	83.96	6.88	98.51	29.46	93.59	44.58	87.80	36.66	92.12	34.32	91.38	94.53
	Mahalanobis [22]	6.89	98.69	11.65	97.99	16.63	97.11	27.27	95.52	61.84	86.73	29.40	95.11	25.61	95.19	94.53
	Energy [27]	31.85	92.37	55.63	83.93	6.44	98.66	28.29	93.75	43.03	88.13	35.20	92.32	33.41	91.53	94.53
	ReAct [39]	88.11	73.53	77.57	73.82	52.40	84.02	35.53	91.40	53.88	85.32	44.51	89.00	58.67	82.85	93.41
	Logit Norm [49]	10.64	97.46	44.35	91.84	1.47	98.98	17.85	96.86	35.68	92.98	16.21	97.04	21.03	95.86	94.42
	KNN [41]	45.16	92.92	43.63	91.52	27.16	95.44	26.82	95.40	43.97	90.77	33.01	93.83	36.63	93.31	94.53
	OE [14]	3.24	99.17	10.82	98.01	1.30	99.67	4.20	99.09	17.31	96.23	5.63	98.9	7.08	98.51	94.44
	Energy (w. $\mathcal{D}_{aux}$ )[27]	0.72	99.58	4.36	98.82	0.58	99.42	1.25	99.23	8.81	97.57	1.72	99.2	2.91	98.97	94.91
	POEM [31]	2.61	99.43	0.48	99.75	34.63	92.46	0.00	100.00	5.30	98.55	0.00	100.00	7.17	98.37	90.62
	WOODS [19]	2.62	99.50	36.35	93.02	1.75	99.62	1.03	99.75	29.25	93.96	1.85	99.61	12.14	97.58	94.72
	DOE [48]	3.05	99.25	8.70	97.97	0.10	99.84	1.55	99.39	14.85	96.64	1.75	99.39	5.00	98.74	94.43
	AUTO	1.03	99.77	17.7	96.2	0.49	99.74	1.34	99.55	35.26	92.76	0.87	99.63	9.45	97.94	93.33

Table 24: Detailed results on six common OOD benchmark datasets, models are trained on CIFAR-10.

\uparrow

indicates larger values are better and vice versa. The bold and underlined numbers respectively indicate the first and second best results.

Backbone	Method	SVHN		Texture		LSUN_Crop		LSUN_Resize		Places365		iSUN		Average
Backbone	Method	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	FPR95 $\downarrow$	AUROC $\uparrow$	IN_Acc $\uparrow$
RN-34	MSP [13]	84.77	72.71	80.15	77.85	86.00	75.27	84.87	80.81	80.81	76.05	84.58	72.60	83.53	74.34	77.51
	ODIN[25]	84.93	72.73	79.06	78.29	87.71	73.90	81.37	75.34	81.33	75.36	82.17	75.98	82.76	75.27	77.51
	Mahalanobis [22]	70.20	85.59	64.33	85.45	81.43	78.09	80.46	78.47	78.97	77.52	78.00	79.81	75.56	80.82	77.51
	Energy [27]	85.89	72.77	79.47	77.93	89.37	73.75	78.96	76.09	82.33	75.12	79.90	76.33	82.65	75.33	77.51
	ReAct [39]	75.49	82.97	68.86	83.73	83.19	82.31	71.19	81.43	77.25	79.86	72.55	81.73	74.76	82.01	77.09
	Logit Norm [49]	51.12	91.51	86.99	72.96	46.30	90.94	96.11	64.72	80.22	77.81	95.75	63.03	76.08	76.83	76.40
	KNN [41]	68.02	84.92	70.46	83.02	71.26	82.30	70.08	83.54	77.43	78.23	70.75	82.64	71.33	82.44	77.51
	OE [14]	44.58	98.98	64.91	86.07	24.52	95.59	75.91	82.53	63.36	86.23	77.81	81.32	58.52	87.3	76.84
	Energy (w. $\mathcal{D}_{aux}$ )[27]	34.24	94.34	65.71	87.81	16.51	97.19	71.60	86.35	55.59	88.95	74.44	85.65	53.02	90.05	77.19
	POEM [31]	28.79	95.84	3.07	98.86	69.85	86.32	0.00	100.00	16.95	96.61	0.00	100.00	19.78	96.27	69.49
	WOODS [19]	17.25	96.64	46.11	87.30	27.95	93.52	24.34	94.74	62.50	81.99	31.25	93.05	34.90	91.21	77.84
	DOE [48]	36.25	93.80	38.15	91.86	13.60	97.69	32.60	94.85	44.10	88.87	29.90	94.83	32.43	97.22	77.87
	AUTO	0.96	99.79	18.07	96.15	3.54	99.16	0.74	99.82	46.27	89.67	1.52	99.59	11.85	97.36	77.90
WRN-40-2	MSP [13]	78.32	80.15	83.26	74.18	66.08	84.32	81.24	73.17	82.50	74.93	83.52	71.89	79.15	76.44	75.84
	ODIN[25]	74.01	83.78	79.69	77.29	33.37	94.24	72.43	79.86	81.48	75.22	77.49	77.35	69.75	81.29	75.84
	Mahalanobis [22]	58.99	87.93	38.45	91.64	98.83	63.55	72.44	81.56	87.00	71.58	71.15	82.02	71.14	79.71	75.84
	Energy [27]	76.04	83.71	80.99	76.90	29.66	94.92	72.24	79.93	81.94	74.69	77.03	77.61	69.65	81.30	75.84
	ReAct [39]	98.77	55.95	91.19	59.71	91.25	59.71	86.92	72.35	93.79	62.09	90.17	68.69	92.01	64.53	64.77
	Logit Norm [49]	52.54	91.13	72.26	79.90	9.62	98.38	58.48	88.18	75.23	80.04	61.27	87.98	54.90	87.60	76.02
	KNN [41]	45.11	90.47	51.17	87.36	64.49	82.22	56.86	87.22	83.23	73.12	58.68	85.77	59.92	84.36	75.84
	OE [14]	53.71	90.13	60.98	84.59	16.74	96.81	65.99	79.71	57.03	86.30	69.80	77.37	54.04	85.82	75.59
	Energy (w. $\mathcal{D}_{aux}$ )[27]	34.13	94.84	56.04	87.97	9.80	98.26	56.39	86.63	50.49	89.83	59.71	85.27	44.43	90.47	75.75
	POEM [31]	48.96	92.67	3.81	98.41	71.59	88.96	0.00	100.00	21.41	95.74	0.00	100.00	24.30	95.96	69.37
	WOODS [19]	7.99	98.50	42.17	89.62	7.73	98.59	7.74	98.65	54.71	84.48	15.53	97.42	22.65	94.54	75.74
	DOE [48]	33.35	94.59	32.55	92.75	3.70	98.99	14.45	97.43	56.55	85.77	15.95	97.02	26.09	94.43	74.98
	AUTO	1.74	99.53	34.41	91.57	4.82	98.63	2.06	99.34	59.86	84.88	2.38	99.32	17.55	95.55	73.71

Table 25: Detailed results on six common OOD benchmark datasets, models are trained on CIFAR-100.

\uparrow

indicates larger values are better and vice versa. The bold and underlined numbers respectively indicate the first and second best results.

10 Experiment Platform

All our experiments are run on the NVIDIA RTX3090 GPU. The Pytorch version is 1.9.0 with Python 3.8. The operating system is Ubuntu Linux 16.04.