Multiple Testing Framework for Out-of-Distribution Detection
Abstract
We study the problem of Out-of-Distribution (OOD) detection, that is, detecting whether a Machine Learning (ML) model’s output can be trusted at inference time. While a number of tests for OOD detection have been proposed in prior work, a formal framework for studying this problem is lacking. We propose a definition for the notion of OOD that includes both the input distribution and the ML model, which provides insights for the construction of powerful tests for OOD detection. We also propose a multiple hypothesis testing inspired procedure to systematically combine any number of different statistics from the ML model using conformal p-values. We further provide strong guarantees on the probability of incorrectly classifying an in-distribution sample as OOD. In our experiments, we find that threshold-based tests proposed in prior work perform well in specific settings, but not uniformly well across different OOD instances. In contrast, our proposed method that combines multiple statistics performs uniformly well across different datasets and neural networks architectures.
I Introduction
Given the ubiquitous use of ML models in safety-critical applications such as self-driving and medicine, there is a need to develop methods to detect whether an ML model’s output at inference time can be trusted. This problem is commonly referred to as the Out-of-Distribution (OOD) detection problem. If an output is deemed untrustworthy by an OOD detector, one can abstain from making decisions based on the output, and default to a safe action. There has been a flurry of works on this problem in recent years. A particular area of focus has been on OOD detection for deep learning models ([1, 2, 3, 4, 5, 6]). While neural networks generalize quite well to inputs from the same distribution as the training distribution, recent works have shown that they tend to make incorrect predictions with high confidence, even for unrecognizable or irrelevant inputs (see, e.g., [7, 8, 9]).
In many of the prior works on OOD detection, OOD inputs are considered to be inputs that are not generated from the input training distribution (see, e.g., [2, 3]), which better describes the classical problem of outlier detection. However, in contrast to outlier detection, the goal in OOD detection is to flag untrustworthy outputs from a given ML model. Thus, it is essential for the definition of an OOD sample to involve the ML model. One of the contributions of this paper is a formal definition for the notion of OOD that involves both the input distribution and the ML model.
In a line of work in OOD detection, it is assumed that the detector has access (exposure) to OOD examples, which can be used to train an auxiliary classifier, or to tune hyperparameters for the detection model ([1, 10, 2, 11]). Other works rely on identifying certain patterns observed in the training data distribution, and use these patterns to train the original ML model to help detect OOD examples. For instance, in [6], a neural network is trained to leverage in-distribution equivariance propoerties for OOD detection. There is another line of work in which tests are designed based on statistics from generative models trained for OOD detection. For instance, in [12], statistics from a deep generative model are combined through p-values using the Fisher test. In this paper, we focus exclusively on developing methods that do not use any OOD samples, and can be applied to any pre-trained ML model.
Prior work has primarily been focused on identifying promising test statistics and corresponding thresholds, sometimes motivated by empirical observations of the values taken by these statistics for certain in-distribution and OOD inputs. For instance, in [1], a confidence score is constructed through a weighted sum of Mahalanobis distances across layers, using the class conditional Gaussian distributions of the features of the neural network under Gaussian discriminant analysis. In [2], a statistic based on input perturbations and temperature-scaled softmax scores is proposed. In [5], a free energy score based on the denominator of the temperature-scaled softmax score is proposed. In [3], scores are derived from Gram matrices, through the sum of deviations of the Gram matrix values from their respective range observed over the training data. In [13], the broad goal is to find all candidate functions from a given collection in an offline manner through multiple testing, such that any one of these candidate functions controls some risk at inference time. This approach is applied in [13] to the problem of OOD detection to select suitable thresholds for a given test statistic to control the false alarm rate. In [4], vector norms of gradients from a pre-trained network are used to form a test statistic. In [14], OOD detection in Convolutional Neural Networks (CNNs) is studied; spatial and channel reduction techniques are employed to produce statistics per layer, and these layer statistics are combined to form a final score using a method motivated by the tests proposed by [15] and [16]. Thus, their proposed algorithm computes a single score using all the intermediate features of the CNN and its corresponding empirical p-value. They provide marginal false alarm guarantees averaged over all possible validation datasets used to compute the empirical p-value. Additionally, the proposed method in [14] can be applied only to CNNs, and not any general ML model. To summarize, from prior work, it is unclear which among these scores/statistics is the best for OOD detection, or if there exists such a test statistic that is useful for all possible out-distributions. The latter question was raised in [17], where they posit that one can construct an out-distribution for any single score or statistic that results in poor detection performance.
The false alarm probability or type-I error of a test refers to the probability of a single in-distribution sample being misclassified as OOD, and the detection power refers to the probability of correctly identifying an OOD distribution sample. Note that the detection power is also referred to as detection accuracy in prior OOD works. In much of the prior work on OOD detection, the false alarm probability is estimated using empirical evaluations on certain in-distribution datasets. What is lacking in such works is a rigorous theoretical analysis of the probability of false alarm, which can be used to meet pre-specified false alarm constraints. Such false alarm guarantees are crucial for the responsible deployment of OOD methods in practice. Note that it is not possible to give any theoretical guarantees on the detection power of an OOD detection test without prior information about the class of all possible out-distributions, which is typically not available in practice. Therefore, in prior work on OOD detection, the detection powers of candidate OOD methods that meet the same pre-specified false alarm levels are compared empirically.
In this work, we propose a method inspired by multiple hypothesis testing ([18, 19, 20]) to systematically combine multiple test statistics for OOD detection. Our method works for combining any number of statistics with an arbitrary dependence structure, for instance the Mahalanobis distances ([1]) and the Gram matrix deviations across layers ([3]) of a neural network. We should emphasize there is no obvious way to directly combine such disparate statistics with provable guarantees for OOD detection. Detection procedures for multiple hypothesis testing are usually based on combining p-values across hypotheses [18, 19, 20]. However, in the problem of OOD detection, the probability measures under both the in-distribution (null) and out-of-distribution (alternate) settings are unknown, and thus the actual p-values cannot be computed. In conformal inference methods ([21, 22]) the p-values are replaced with conformal p-values, which are estimates computed from the empirical CDF of the test statistics. These conformal p-values are data-dependent, as they are calculated from in-distribution samples. In the procedure proposed in this paper, we use conformal p-values and provide rigorous theoretical guarantees on the probability of false alarm, conditioned on the dataset used for computing the conformal p-values.
Contributions
-
1.
We formally characterize the notion of OOD, using which we provide insights on why it is necessary for OOD tests to involve more than just the new unseen input and the final output of the ML model for OOD detection.
-
2.
We propose a new approach for OOD detection inspired by multiple testing. Our proposed test allows us to combine, in a systematic way, any number of different test statistics produced from the ML model with arbitrary dependence structures.
-
3.
We provide strong theoretical guarantees on the probability of false alarm, conditioned on the dataset used for computing the conformal p-values. This is stronger than false alarm guarantees in prior work (e.g., [22, 6]), where the guarantees are given in terms of an expectation over all possible datasets.
-
4.
We perform extensive experiments across different datasets to demonstrate the efficacy of our method. We perform ablation studies to show that combining various statistics using our method produces uniformly good results across various types of OOD examples and Deep Neural Network (DNN) architectures.
II Problem Statement and OOD Modelling
Consider a learning problem with , where is the input-output pair and is the distribution of the dataset available at training time. Let the dataset available at training time be denoted by , where is the size of the dataset. Let the ML model be denoted by , where is the random variable denoting the parameters of the ML model. For instance, depicts the weights and biases in a neural network. Let be a random variable generated from an unknown distribution, and be an instance of this random variable seen by the ML model at inference time. Given and the ML model, the goal is to detect if this new unseen sample might produce an untrustworthy output. This might happen because either the input does not conform to the training data distribution, or if the ML model is unable to capture the true relationship between the input and the true label . Whether a new unseen sample is OOD or not depends on both the ML model and the distribution .
A precise mathematical definition of the OOD detection problem that captures both the input distribution and the ML model appears to be lacking in prior work. The most common definition is based on testing between the following hypotheses (see, e.g., [2]):
(1) |
where corresponds to ‘in-distribution’ and corresponds to ‘out-of-distribution’. However, such a definition does not involve the ML model, and better describes the problem of outlier detection, which is fundamentally different from the problem of OOD detection.
Let , and consider the distribution as the joint distribution of the input and the output of the ML model. Using this joint distribution as the ‘in-distribution’, consider the following testing problem:
(2) |
Note that this is a definition of OOD detection that involves both the input distribution and the ML model (through ). It also captures both the cases where the input is not drawn from , and when the ML model is unable to capture the relationship between the unseen input and its label.
The hypothesis test in (2) involves the true label and the distribution . Since these quantities are unknown, the model prediction and the empirical distribution of based on the training data, respectively, may be used instead. When the ML model performs well during training, i.e., for almost all training data points, the empirical versions of and agree, and we again arrive at a definition that does not involve the ML model. Thus, we conclude that it is necessary to use other functions of the input derived from the ML model111In this paper, we use the term statistic or score interchangeably to denote these functions of the input derived from the ML model. in addition to just the final output in constructing test statistics for effective OOD detection. Such a strategy is commonly employed, without theoretical justification, in many OOD detection works, for instance, through the use of intermediate features of a neural network to calculate the Mahalanobis score ([1]) and gram matrix score ([3]), and gradient information to calculate the GradNorm score ([4]). The discussion above provides a qualitative theoretical justification for these strategies developed in prior works.
III Proposed Framework and Algorithm
In this section, we describe our proposed framework formally, and present our algorithm to combine any number of different functions of the input with an arbitrary dependence structure.
In our formulation of OOD detection in (2), we posit that, in addition to the input and the output from the ML model, it is necessary to use other functions222Without loss of generality, we may assume that these functions are scalar-valued. of the input, which are dependent on the ML model. We refer to these functions as score functions, denoted by . The outputs of the score functions are scalar-valued scores :
(3) |
The score functions are assumed to be chosen based on prior information in such a way that the scores are likely to take on small values for in-distribution inputs and larger values for OOD inputs. For a new input , let be the corresponding scores. Note that one of scores could be based on the final output from the learning model .
III-A Motivation for multiple testing framework
In order to construct an OOD detection test for the new sample using the scores, the scores would need to be combined in some manner. Since we do not know the dependence structure between the scores, combining them in an ad hoc manner, such as summing them up, cannot be justified and may result in tests with low power (probability of detection) for many OOD distributions. For instance, consider a simple bivariate Gaussian setting as follows:
(4) |
Let the statistic , and let be the p-value when the observed value of the statistic is . Recall that the p-value is given by:
(5) |
For given , let denote the test which rejects if . For test , the probability of false alarm, i.e.,
(6) |
can be controlled at , by exploiting the fact that p-values have a uniform distribution under the null hypothesis. However, the detection power of the test under different possible distributions under the alternate hypothesis might be poor. For instance, if under the alternate hypothesis, the statistic has the same distribution under the null and alternate hypotheses. Thus the detection power of test is upper bounded by . It is possible to find many such joint distributions for the alternate hypothesis, under which the detection power of test is poor, i.e., it is close to the probability of false alarm.
On the other hand, consider the following split of the above testing problem into two binary hypothesis testing problem corresponding to the statistics and :
(7) |
Let and be the p-values corresponding to the two individual tests in (7), and be the ordered p-values. Let
(8) |
Then, let test be defined such that it rejects if .
Similar to test , the probability of false alarm of test can be controlled at level . On the other hand, we see that the detection power of test when , where , satisfies the following condition:
(9) |
where is the complementary cumulative distribution function of a random variable.
Thus, the detection power satisfies a minimum quality of performance under all distributions for the alternate hypothesis. Note that it is also possible for some distributions for the alternate hypothesis that test has better detection power than test (9). For instance, if , then the detection performance of test is better than that of . However, if we do not have any prior information on the behaviour of the statistics under the alternate hypotheses, combining multiple test statistics in an ad hoc manner (such as summing them) might not be desirable. Further, there is no obvious way to combine two completely different set of statistics, say the Mahalanobis scores from different layers of a DNN and the energy score.
III-B Proposed OOD Detection Test
Motivated by the above discussion, we propose the following multiple testing framework for OOD detection:
(10) |
where are the distributions of the corresponding scores when is an in-distribution sample as defined in (2). It is clear to see that if the new input is an in-distribution sample, then all are true in (10), and if is an OOD sample, then one or more of are likely to be false. Thus, we propose a test that declares the instance as OOD, if any of the are rejected.
We propose an algorithm for OOD detection inspired by the Benjamini-Hochberg (BH) procedure given in [20] (preliminaries are provided in the Appendix). Most multiple testing techniques, including the BH procedure, involve computing the p-values of the individual tests. The p-value of a realization of the test statistic , , is given by
(11) |
where is the CDF of . The p-value for is a random variable
(12) |
The distribution of this p-value under null hypothesis is uniform over . Its distribution under the alternate hypothesis concentrates around 0, and is difficult to characterize in general. Also, while a p-value close to 0 is evidence against the null hypothesis, a large p-value does not provide evidence in favor of the null hypothesis.
If we do not know the distributions under the null hypotheses to calculate the exact p-values, conformal inference methods suggest evaluating the empirical CDF of under the null hypothesis using a hold-out set (denoted by ) known as the calibration set, to construct a conformal p-value . A conformal p-value satisfies the following property:
(13) |
when is independent from and has a continuous distribution. The classical conformal p-value (see [21]) is given by:
(14) |
The estimate is said to be a marginally valid conformal p-value, as it depends on . In other words, (13) can be rewritten as follows:
(15) |
where the expectation is over all possible calibration datasets. The property in (13) is however not valid conditionally, i.e., need not be upper-bounded by . This is important to note, as false alarm guarantees given for out-of-distribution detection methods using conformal inference (see, e.g., [22, 6]) are based on (15). Such guarantees are not strong, as they only guarantee that the probability of false alarm, averaged over all possible calibration data sets, is controlled. While the problem of conditional coverage has been discussed in the context of sequential testing for distribution shifts (e.g., [23]) and conformal inference (e.g., [24]), it has not been discussed widely under the setting of single sample OOD detection.
The related problem of outlier testing using conformal p-values is studied in [25]. However, the result from [25], stating that conformal p-values satisfy the PRDS (Positive Regression Dependent on a Subset) property, which is required for the False Discovery Rate (FDR) control in the BH procedure, is valid only under the setting where the individual test statistics (and hence the original p-values) are independent. The PRDS property does not hold for the conformal p-values in Algorithm 1, since the corresponding p-values (see (12)) are highly dependent through the common input . In addition, the conditional false alarm guarantees provided in [25] utilize calibration conditionally valid (CCV) p-values proposed in [25], as opposed to the conformal values proposed in [21] (which we use in our work). Indeed, these CCV p-values cannot be directly used in our setting to obtain false alarm guarantees in Theorem 1, without a similar adjustment to the thresholds as in (18), as the p-values would be dependent through both the calibration dataset and the input.
In our proposed OOD detection test we use conformal p-values in place of the actual p-values. In order to compute the conformal p-values, we maintain a calibration set .
In this work, we aim to provide conditional false alarm guarantees, i.e., if is an in-distribution sample (all are true in (10)), then
(16) |
is controlled with high probability. As discussed earlier in this section, such conditional guarantees are essential for the safe deployment of OOD detection algorithms. Note that in the literature on multiple testing, the marginal false alarm probability is equivalent to the Family Wise Error Rate (FWER) or False Discovery Rate (FDR) when all the null hypotheses are true in (10) (detailed discussion provided in the Appendix).
We compute the scores of these statistics for the samples in the calibration set . Using these, we calculate the conformal p-values for the new sample as in (73), and order the conformal p-values in increasing order as . Let be a parameter of the OOD detection algorithm, and let , and let
(17) |
where
(18) |
The factor of is included in order to obtain false alarm guarantees for any arbitrary dependence between the test statistics. The factor of is a constant related to the size of the calibration dataset, and is introduced to provide strong conditional false alarm guarantees, conditioned on the calibration set (discussed further in the proof of the results below). While choosing a smaller value of improves the power of the proposed OOD detection test, it increases the size of the calibration set needed to provide the conditional false alarm guarantees. The OOD detection test declares the instance as OOD if , i.e., if any of the are rejected. The pseudo-code is described in Algorithm 1.
(19) |
For instance, consider a Deep Neural Network (DNN) with layers. Let denote the Mahalanobis scores ([1]). Let denote the Gram deviation scores ([3]). [1] use outlier exposure to combine into a single score for a threshold-based test, and the [3] use the sum of for a similar test. However, it is not straightforward to determine how to combine the Mahalanobis scores and the Gram deviation scores for OOD detection without outlier exposure. In Algorithm 1, we provide a systematic way to construct a test that uses all these contrasting scores. In addition, we provide a systematic way to design the test thresholds to meet a given false alarm constraint as presented below in Theorem 1.
III-C Theoretical Guarantees
On running Algorithm 1, we can guarantee that the conditional probability of the false alarm is bounded by with high probability. In order to provide this guarantee, we need to enforce certain sample complexity conditions on the size of the calibration set , as detailed in the Lemma below.
Lemma 1.
Let , and be as in Algorithm 1. Let , , and . For a given , let be such that
(20) |
where is the regularized incomplete beta function (the CDF of a Beta distribution with parameters ). Then for random variables for ,
(21) |
Proof.
When the condition on in (20) is satisfied, we have that
(22) |
where is the CDF of a Beta distribution with parameters , and the second inequality follows since is upper bounded by . From the Union Bound, we have that,
(23) |
Thus, we have the desired result in Lemma 1. ∎
The condition on in Lemma 1 is due to the fact that the CDF of the conformal p-values conditioned on the calibration dataset follows a Beta distribution (see [21]), and is essential to provide the guarantees in Theorem 1. Due to the form of the CDF of the Beta distribution, it is difficult to characterize the dependence of on , , and in closed form. We plot the calibration dataset sizes as given by Lemma 1 for and for different values of in Figure 1. Note that is conservative.


In the following result, we formally present the conditional false alarm guarantee for Algorithm 1.
Theorem 1.
We adapt the proof of FDR control for the BH procedure provided in [20] for our algorithm, to the use of conformal p-values estimated from the calibration set instead of the actual p-values in Algorithm 1. The details of the proof are presented in the Appendix.




We verify the results in Theorem 1 through experiments with CIFAR10 and SVHN as in-distribution datasets, and ResNet and DenseNet architectures (more details on the experimental setup are given in Section IV). In Figure 2, we plot the false alarm probabilities when the thresholds for comparing the conformal p-values are set according to Algorithm 1. The dashed line represents the theoretical upper bound on the false alarm probability. As seen in Figure 2, the false alarm probability is bounded by the theoretical upper bound as stated in Theorem 1 for all settings considered. Note that the results in this paper hold for any given ML model, and while the bound may be conservative for certain settings (e.g., DenseNet with CIFAR10), it is tight in other cases (e.g., ResNet with SVHN).
Such strong theoretical guarantees are absent in most prior work on OOD detection. A few works that have suggested the use of conformal p-values for OOD detection, such as [6], provide marginal false alarm guarantees of the form:
(25) |
where the expectation is over all possible calibration sets. (See also the discussion surrounding (15).) However, this does not guarantee that the false alarm level is maintained with high probability for the particular calibration dataset used. In addition, it does not provide any information on the size of the calibration dataset to be used.
IV Experimental Evaluation
In the previous section, we have provided guarantees on the strong probability of false alarm for Algorithm 1. However, since it is not possible to theoretically analyze the power of such a test (due to the structure of the alternate hypothesis), we evaluate the power of our proposed approach through experiments. In addition, since we do not know beforehand what kind of OOD samples might arise at inference time, an effective OOD detection test must also have low variance across different OOD datasets, for a given Deep Neural Network (DNN) architecture. In our experiments, we evaluate both of these metrics to demonstrate the effectiveness of our approach.
Following the standard protocol for OOD detection ([1, 3, 5]), we consider settings with CIFAR10 and SVHN as the in-distribution datasets.
-
•
For CIFAR10 as the in-distribution dataset, we study SVHN, LSUN, ImageNet, and iSUN as OOD datasets.
-
•
For SVHN as the in-distribution dataset, we study LSUN, ImageNet, CIFAR10 and iSUN as OOD datasets.
We evaluate the performance on two pre-trained architectures: ResNet34 ([26]) and DenseNet ([27]). The calibration dataset in each case is a subset of 5000 samples of the in-distribution training dataset.
We evaluate the proposed approach, and compare it with SOTA methods based on the standard metric of probability of detection or power (i.e., probability of correctly detecting an OOD sample) at probability of false alarm at 0.1. Note that in some prior on OOD detection, the probability of detection is referred to as the True Negative Rate (TNR) and as the True Positive Rate (TPR), where the in-distribution samples are considered positives, and OOD samples are considered negatives.
Recall that we focus exclusively on methods that do not have any outlier exposure to OOD samples, such as those in [10, 2, 1], and can be applied to any pre-trained ML model. We compare our approach against baselines: Mahalanobis ([1]), Gram matrix ([3]), and Energy ([5]). For the Mahalanobis baseline, we use the scores from the penultimate layer of the network to maintain uniformity.
To evaluate our proposed method, we systematically combine the following test statistics using our multiple testing approach as detailed in Algorithm 1:
-
1.
Mahalanobis distances from individual DNN layers ([1]): Let denote the outputs of the intermediate layers of the neural network for an input . We estimate , the class-wise mean of , as the empirical class-wise mean from the training dataset:
(26) where is the number of points with label . We estimate the common covariance for all classes as
(27) This is equivalent to fitting the class- conditional Gaussian distributions with a tied covariance. The Mahalanobis score for layer is calculated as:
(28) We calculate 5 Mahalanobis scores from the intermediate layers for the ResNet34 architecture, and 4 scores for the DenseNet architecture.
-
2.
Gram matrix deviations from the individual DNN layers ([3]): For each intermediate layer , the Gram matrix of order is calculated as:
(29) where the power is calculated element-wise. For each flattened upper triangular Gram matrix , there are correlations. The class-specific minimum and maximum values for the correlation (i.e., -th element of ), class , layer and power are estimated from the training dataset as and , respectively. For a new input , the deviation for correlation , layer , power is calculated with respect to the predicted class as
(30) As proposed in [3], the Gram matrix score for layer is then calculated as the sum of over values of from 1 to 10, and all values of , and normalized by the empirical mean of . We calculate 5 Gram scores from the intermediate layers for the ResNet34 architecture, and 4 scores for the DenseNet architecture.
-
3.
Energy statistic ([5]): The energy score is a temperature scaled log-sum-exponent of the softmax scores
(31) where is the number of classes, are the softmax scores, and is the temperature parameter. In our experiments, we set the temperature to 100 for all in-distribution datasets, DNN architectures and OOD datasets (as stated in [5], the energy score is not sensitive to the temperature parameter).
We use a subset of 45000 points from the training dataset (with no overlap with the calibration dataset) to calculate the class-wise empirical means and covariance for the Mahalanobis scores, and the minimum and maximum correlations for the Gram scores.




For CIFAR10 and SVHN as the in-distribution datasets, we use our proposed method in Algorithm 1 to combine the Mahalanobis scores and Gram scores across layers, and the energy score, to detect OOD samples. There are 11 scores (i.e., ) in total for the ResNet34 architecture, and 9 scores (i.e., ) for the DenseNet architecture. Recall that Algorithm 1 declares an input to be an OOD sample if any of the null hypotheses corresponding to the scores are rejected. For different OOD datasets, we empirically study the probability of each null hypotheses being rejected by Algorithm 1. In Figure 3, we plot the empirical probability of each score being rejected, i.e., the proportion of data points in each OOD dataset for which the corresponding null hypothesis was rejected. The Mahalanobis score and Gram score of layer are denoted by ‘Mahala i’ and ‘Gram i’ respectively, and the energy score is denoted by ‘Energy’. We observe that while the probability of a score being rejected is high for certain OOD datsets, there exists OOD instances for which it is quite low. For example, in the Resnet34 architecture with CIFAR10 as the in-distribution dataset, while the Mahalanobis scores of layers 2, 3 and 5, and the Gram scores of layer 5 are useful to detect OOD instances from the LSUN, Imagenet and iSUN datasets, they are not likely to be useful in detecting OOD instances from the SVHN dataset. On the other hand, the Mahalanobis and Gram scores from layer 4 of the network are more useful in detecting OOD instances from the SVHN dataset than LSUN, Imagenet and iSUN datasets. This study provides evidence that any single score may not be useful to detect all kinds of OOD instances that an ML model might encounter at inference time, and combining different scores systematically, as proposed in Algorithm 1, might lead to a more robust OOD detection method. We demonstrate an improvement in detection performance and the robustness of our proposed OOD detection method through extensive experiments presented further in this section.
The detection power performances for CIFAR10 and SVHN as in-distribution datasets are presented in Tables I and II, for the Mahalanobis, Gram and Energy baselines, and our proposed method of combining different statistics. We annotate our method with the number of statistics used, e.g., Mahalanobis, Gram and Energy (5/4+5/4+1) uses 5,4 layers in ResNet34, DenseNet architectures respectively, for both Mahalanobis and Gram, and the energy score. For each in-distribution dataset, we consider 8 cases, comprising of 4 OOD Datasets and 2 different DNN architectures.
OOD Dataset Method ResNet34 DenseNet Mahala (penultimate layer) 82.77 92.98 Gram (sum across layers) 96.04 89.97 SVHN Energy 73.21 42.40 Ours - Mahala (5/4) 87.92 93.16 Ours - Gram (5/4) 95.61 89.90 Ours - Mahala, Energy (5/4 + 1) 91.88 94.03 Ours - Gram, Energy (5/4 + 1) 96.78 90.77 Ours - Mahala, Gram (5/4 + 5) 96.23 94.21 Ours - Mahala, Gram and Energy (5/4+5/4+1) 97.13 94.57 Mahala (penultimate layer) 85.45 82.81 Gram (sum across layers) 92.34 80.04 ImageNet Energy 76.76 94.93 Ours - Mahala (5/4) 96.90 95.19 Ours - Gram (5/4) 92.60 80.12 Ours - Mahala, Energy (5/4 + 1) 97.28 98.09 Ours - Gram, Energy (5/4 + 1) 94.53 95.19 Ours - Mahala, Gram (5/4 + 5) 96.38 92.81 Ours - Mahala, Gram and Energy (5/4+5/4+1) 97.03 97.20 Mahala (penultimate layer) 90.97 84.11 Gram (sum across layers) 95.94 81.83 LSUN Energy 81.16 96.89 Ours - Mahala (5/4) 98.11 96.38 Ours - Gram (5/4) 96.16 81.67 Ours - Mahala, Energy (5/4 + 1) 97.87 98.20 Ours - Gram, Energy (5/4 + 1) 96.61 96.43 Ours - Mahala, Gram (5/4 + 5/4) 98.02 94.40 Ours - Mahala, Gram and Energy (5/4+5/4+1) 98.00 97.78 Mahala (penultimate layer) 89.99 83.19 Gram (sum across layers) 95.10 81.47 iSUN Energy 80.11 95.10 Ours - Mahala (5/4) 97.24 95.26 Ours - Gram (5/4) 95.11 81.09 Ours - Mahala, Energy (5/4 + 1) 97.17 97.12 Ours - Gram, Energy (5/4 + 1) 96.19 94.73 Ours - Mahala, Gram (5/4 + 5/4) 97.36 92.93 Ours - Mahala, Gram and Energy (5/4+5/4+1) 97.67 96.34
OOD Dataset Method ResNet34 DenseNet Mahala (penultimate layer) 96.12 96.34 Gram (sum across layers) 97.52 93.57 ImageNet Energy 85.14 70.53 Ours - Mahala (5/4) 99.91 99.95 Ours - Gram (5/4) 97.68 94.38 Ours - Mahala, Energy (5/4 + 1) 99.89 99.93 Ours - Gram, Energy (5/4 + 1) 97.85 95.01 Ours - Mahala, Gram (5/4 + 5/4) 99.83 99.91 Ours - Mahala, Gram and Energy (5/4+5/4+1) 99.84 99.89 Mahala (penultimate layer) 93.74 94.17 Gram (sum across layers) 96.20 88.25 LSUN Energy 81.30 71.36 Ours - Mahala (5/4) 99.98 100.0 Ours - Gram (5/4) 96.54 89.02 Ours - Mahala, Energy (5/4 + 1) 99.96 99.99 Ours - Gram, Energy (5/4 + 1) 96.82 90.56 Ours - Mahala, Gram (5/4 + 5/4) 99.96 99.98 Ours - Mahala, Gram and Energy (5/4+5/4+1) 99.95 100.0 Mahala (penultimate layer) 95.23 96.01 Gram (sum across layers) 96.50 91.46 iSUN Energy 82.79 71.20 Ours - Mahala (5/4) 99.98 100.0 Ours - Gram (5/4) 96.80 91.89 Ours - Mahala, Energy (5/4 + 1) 99.93 100.0 Ours - Gram, Energy (5/4 + 1) 97.21 92.69 Ours - Mahala, Gram (5/4 + 5/4) 99.88 99.98 Ours - Mahala, Gram and Energy (5/4+5/4+1) 99.88 99.98 Mahala (penultimate layer) 96.09 94.25 Gram (sum across layers) 91.58 69.77 CIFAR10 Energy 83.31 54.07 Ours - Mahala (5/4) 98.31 97.64 Ours - Gram (5/4) 92.39 72.84 Ours - Mahala, Energy (5/4 + 1) 98.13 97.16 Ours - Gram, Energy (5/4 + 1) 92.91 78.03 Ours - Mahala, Gram (5/4 + 5) 97.15 94.83 Ours - Mahala, Gram and Energy (5/4+5/4+1) 97.35 95.23
-
1.
Improvement in probability of detection across OOD datasets and DNN architectures: The best probability of detection in all 8 cases with CIFAR10 as in-distribution correspond to our method of combining statistics. Similarly, with SVHN as in-distribution, our method of combining statistics gives the best probability of detection in all 8 cases.
Thus, our approach leads to an improvement across OOD datasets and DNN architectures.
-
2.
Lower variation in detection probability across OOD datasets and DNN architectures: Detection probabilities of baselines Mahalanobis, Gram and Energy exhibit a much higher variation across different kinds of OOD samples as compared to the combination of all statistics.
With CIFAR10 as the in-distribution dataset, for the ResNet34 architecture: the variation in is for the Mahalanobis baseline, for the Gram baseline, and for the energy baseline. In contrast, our method of combining all statistics has a variation of . For DenseNet, the variation in is for the Mahalanobis baseline, for the Gram baseline, and for the energy baseline. Our method of combining all statistics has a variation of .
A similar trend is seen with SVHN as the in-distribution dataset. Our method reduces the variation across different kinds of OOD samples by almost 5X. This is a key improvement, as the kind of OOD samples encountered at inference time is unknown, and our proposed method shows very little variation across different OOD datasets.
-
3.
Impact of combining all the scores: For CIFAR10 as the in-distribution dataset, in all 8 cases, combining all the scores - Mahalanobis and gram from individual layers, and the energy score, is either the best method, or within of the best performance.
Similarly, with SVHN as the in-distribution dataset, in 7 out of 8 cases, combining all the scores is either the best method, or within of the best performance (the gap is in the remaining case).
Thus, in contrast to existing methods, combining all the statistics using Algorithm 1 is robust to different kinds of OOD samples across DNN architectures.
In-distribution Dataset OOD Dataset Method ResNet34 DenseNet CIFAR10 SVHN Naive 81.13 83.28 Ours 97.13 94.57 CIFAR10 ImageNet Naive 86.45 80.96 Ours 97.03 97.20 CIFAR10 LSUN Naive 91.31 83.79 Ours 98.00 97.78 CIFAR10 iSUN Naive 89.22 81.70 Ours 97.67 96.34 SVHN ImageNet Naive 97.08 95.67 Ours 99.84 99.89 SVHN LSUN Naive 95.00 92.81 Ours 99.95 100.0 SVHN iSUN Naive 96.00 94.53 Ours 99.88 99.98 SVHN CIFAR10 Naive 86.10 77.22 Ours 97.35 95.23
OOD Dataset Method ResNet34 DenseNet SVHN Mahala (penultimate layer) 93.86 96.72 Gram (sum across layers) 97.28 94.31 Energy 90.24 77.92 Ours - Mahala (5/4) 95.34 96.70 Ours - Gram (5/4) 97.47 94.28 Ours - Mahala, Energy (5/4 + 1) 95.84 96.99 Ours - Gram, Energy (5/4 + 1) 97.90 96.20 Ours - Mahala, Gram (5/4 + 5/4) 97.56 96.98 Ours - Mahala, Gram and Energy (5/4+5/4+1) 97.72 97.24 ImageNet Mahala (penultimate layer) 94.84 93.12 Gram (sum across layers) 95.90 89.83 Energy 91.40 96.03 Ours - Mahala (5/4) 97.89 97.32 Ours - Gram (5/4) 96.09 89.75 Ours - Mahala, Energy (5/4 + 1) 97.97 98.13 Ours - Gram, Energy (5/4 + 1) 97.07 96.79 Ours - Mahala, Gram (5/4 + 5/4) 97.55 96.67 Ours - Mahala, Gram and Energy (5/4+5/4+1) 97.63 97.70 LSUN Mahala (penultimate layer) 96.28 90.00 Gram (sum across layers) 97.31 87.97 Energy 92.35 96.83 Ours - Mahala (5/4) 98.20 97.54 Ours - Gram (5/4) 97.46 87.76 Ours - Mahala, Energy (5/4 + 1) 98.07 98.16 Ours - Gram, Energy (5/4 + 1) 97.76 97.14 Ours - Mahala, Gram (5/4 + 5/4) 97.99 96.82 Ours - Mahala, Gram and Energy (5/4+5/4+1) 97.96 97.74 iSUN Mahala (penultimate layer) 96.07 93.71 Gram (sum across layers) 97.01 90.48 Energy 92.05 96.25 Ours - Mahala (5/4) 97.95 97.39 Ours - Gram (5/4) 97.15 90.36 Ours - Mahala, Energy (5/4 + 1) 97.93 97.89 Ours - Gram, Energy (5/4 + 1) 97.66 96.71 Ours - Mahala, Gram (5/4 + 5/4) 97.79 96.76 Ours - Mahala, Gram and Energy (5/4+5/4+1) 97.83 97.47
OOD Dataset Method ResNet34 DenseNet LSUN Mahala (penultimate layer) 96.06 96.22 Gram 97.23 94.17 Energy 87.58 86.01 Ours - Mahala (5/4) 99.00 98.92 Ours - Gram (5/4) 97.19 94.11 Ours - Mahala, Energy (5/4 + 1) 98.76 98.94 Ours - Gram, Energy (5/4 + 1) 97.47 95.69 Ours - Mahala, Gram (5/4 + 5/4) 98.82 99.08 Ours - Mahala, Gram and Energy (5/4+5/4+1) 98.21 99.06 ImageNet Mahala (penultimate layer) 96.81 97.01 Gram 97.75 96.34 Energy 90.33 85.76 Ours - Mahala (5/4) 98.99 98.89 Ours - Gram (5/4) 97.73 96.32 Ours - Mahala, Energy (5/4 + 1) 98.79 98.91 Ours - Gram, Energy (5/4 + 1) 98.01 97.11 Ours - Mahala, Gram (5/4 + 5/4) 98.87 99.04 Ours - Mahala, Gram and Energy (5/4+5/4+1) 98.25 99.02 iSUN Mahala (penultimate layer) 96.49 96.85 Gram 97.40 95.47 Energy 88.75 85.69 Ours - Mahala (5/4) 98.99 98.91 Ours - Gram (5/4) 97.37 95.42 Ours - Mahala, Energy (5/4 + 1) 98.76 98.94 Ours - Gram, Energy (5/4 + 1) 97.63 96.40 Ours - Mahala, Gram (5/4 + 5/4) 98.82 99.07 Ours - Mahala, Gram and Energy (5/4+5/4+1) 98.21 99.05 CIFAR10 Mahala (penultimate layer) 96.90 96.59 Gram 95.35 87.06 Energy 89.09 77.72 Ours - Mahala (5/4) 97.63 97.50 Ours - Gram (5/4) 95.35 87.21 Ours - Mahala, Energy (5/4 + 1) 97.68 97.43 Ours - Gram, Energy (5/4 + 1) 95.94 91.15 Ours - Mahala, Gram (5/4 + 5/4) 97.32 96.91 Ours - Mahala, Gram and Energy (5/4+5/4+1) 97.10 96.98
We further compare our proposed method of combining multiple scores in Algorithm 1 with a baseline that combines scores naively through an averaging rule. This naive OOD detection test maintains thresholds for the scores. Let be the weight for the -th score where
(32) |
and let be defined as
(33) |
The naive averaging OOD detection rule declares an input to be an OOD sample if . The thresholds are set to ensure a false alarm probability of . In Table III, we present a comparison of the detection powers of the naive averaging rule and our proposed method of combining scores, where both methods use the Mahalanobis and Gram scores from all the layers, and the energy score. We observe that the naive averaging method does not perform as well as our proposed method of combining statistics, and indeed has a high variability across different OOD datasets. Thus, we see that while it is imperative to combine multiple scores for effective and robust OOD detection, combining them in an adhoc manner such as uniform averaging does not yield good results.
In some of the prior work on OOD detection, the Area Under the Receiver Operating Characteristic (AUROC) metric has been used to compare different tests ([2, 5, 3, 1]). However, it is not clear that this measure is useful in such a comparison, especially when the ROC is being estimated through simulations. It is possible for a test (say, Test 1) to have a larger AUROC than another test (say, Test 2), with Test 2 having a larger detection power than Test 1 for all values of false alarm less than some threshold (equivalently, all values of TPR greater than some threshold). Nevertheless, we provide the AUROC numbers for our experimental setups below for completeness.
Table IV contains the AUROC numbers for CIFAR10 as the in-distribution dataset, and Table V contains the AUROC numbers for SVHN as the in-distribution dataset. We observe similar patterns in the AUROC numbers as the above observations on the detection power at a fixed false alarm probability. The Mahalanobis, Gram and Energy baselines have a high variability across different kinds of OOD samples and DNN architectures, whereas our proposed method of combining all statistics has a low variability. For instance, the Mahalanobis, Gram and Energy baselines for DenseNet architecture with CIFAR10 as the in-distribution dataset have an average variability of 10.5 in their AUROC, whereas our proposed method of combining all statistics has a variability of in the AUROC performance. Our proposed method of combining all statistics either has the best AUROC performance or within of the best performance in all 8 cases for CIFAR10 and SVHN as the in-distribution datasets.
V Conclusion
While empirical methods for OOD detection have been studied extensively in recent literature, a formal characterization of OOD is lacking. We proposed a characterization for the notion of OOD that includes both the input distribution and the ML model. This provided insights for the construction of effective OOD detection tests. Our approach, inspired by multiple hypothesis testing, allows us to systematically combine any number of different statistics derived from the ML model with an arbitrary dependence structure.
Furthermore, our analysis allows us to set the test thresholds to meet given constraints on the probability of incorrectly classifying an in-distribution sample as OOD (false alarm probability). We provide strong theoretical guarantees on the probability of false alarm in OOD detection, conditioned on the dataset used for computing the conformal p-values.
In our experiments, we observe that no single score is useful for detecting different kinds of OOD instances. We demonstrated that our proposed method outperforms threshold-based tests for OOD detection proposed in prior work. Across different kinds of OOD examples, we observed that the state-of-the-art methods from prior work exhibit high variability across OOD instances and neural network architectures in their probability of detection of OOD samples. In contrast, our proposed method is robust and provides uniformly good performance (with respect to both detection power and AUROC) across different kinds of OOD samples and neural network architectures. This robustness is important, since a useful OOD detection algorithm should perform well regardless of the type of OOD instance encountered at inference time.
References
- [1] K. Lee, K. Lee, H. Lee, and J. Shin, “A simple unified framework for detecting out-of-distribution samples and adversarial attacks,” in Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31. Curran Associates, Inc., 2018. [Online]. Available: https://proceedings.neurips.cc/paper/2018/file/abdeb6f575ac5c6676b747bca8d09cc2-Paper.pdf
- [2] S. Liang, Y. Li, and R. Srikant, “Enhancing the reliability of out-of-distribution image detection in neural networks,” in International Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=H1VGkIxRZ
- [3] C. S. Sastry and S. Oore, “Detecting out-of-distribution examples with gram matrices,” in Proceedings of the 37th International Conference on Machine Learning, ser. ICML’20. JMLR.org, 2020.
- [4] R. Huang, A. Geng, and Y. Li, “On the importance of gradients for detecting distributional shifts in the wild,” in Advances in Neural Information Processing Systems, M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., vol. 34. Curran Associates, Inc., 2021, pp. 677–689. [Online]. Available: https://proceedings.neurips.cc/paper/2021/file/063e26c670d07bb7c4d30e6fc69fe056-Paper.pdf
- [5] W. Liu, X. Wang, J. Owens, and Y. Li, “Energy-based out-of-distribution detection,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 21 464–21 475. [Online]. Available: https://proceedings.neurips.cc/paper/2020/file/f5496252609c43eb8a3d147ab9b9c006-Paper.pdf
- [6] R. Kaur, S. Jha, A. Roy, S. Park, E. Dobriban, O. Sokolsky, and I. Lee, “idecode: In-distribution equivariance for conformal out-of-distribution detection,” arXiv preprint arXiv:2201.02331, 2022.
- [7] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” arXiv preprint arXiv:1312.6199, 2013.
- [8] A. Nguyen, J. Yosinski, and J. Clune, “Deep neural networks are easily fooled: High confidence predictions for unrecognizable images,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 427–436.
- [9] D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified and out-of-distribution examples in neural networks,” arXiv preprint arXiv:1610.02136, 2016.
- [10] D. Hendrycks, M. Mazeika, and T. G. Dietterich, “Deep anomaly detection with outlier exposure,” in 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. [Online]. Available: https://openreview.net/forum?id=HyxCxhRcY7
- [11] Z. Liang, M. Sesia, and W. Sun, “Integrative conformal p-values for powerful out-of-distribution testing with labeled outliers,” arXiv preprint arXiv:2208.11111, 2022.
- [12] F. Bergamin, P.-A. Mattei, J. D. Havtorn, H. Senetaire, H. Schmutz, L. Maaløe, S. Hauberg, and J. Frellsen, “Model-agnostic out-of-distribution detection using combined statistical tests,” in International Conference on Artificial Intelligence and Statistics. PMLR, 2022, pp. 10 753–10 776.
- [13] A. N. Angelopoulos, S. Bates, E. J. Candès, M. I. Jordan, and L. Lei, “Learn then test: Calibrating predictive algorithms to achieve risk control,” arXiv preprint arXiv:2110.01052, 2021.
- [14] M. Haroush, T. Frostig, R. Heller, and D. Soudry, “A statistical framework for efficient out of distribution detection in deep neural networks,” arXiv preprint arXiv:2102.12967, 2021.
- [15] R. J. Simes, “An improved bonferroni procedure for multiple tests of significance,” Biometrika, vol. 73, no. 3, pp. 751–754, 1986.
- [16] R. A. Fisher, Statistical methods for research workers. Springer, 1992.
- [17] L. Zhang, M. Goldstein, and R. Ranganath, “Understanding failures in out-of-distribution detection with deep generative models,” in International Conference on Machine Learning. PMLR, 2021, pp. 12 427–12 436.
- [18] S. Holm, “A simple sequentially rejective multiple test procedure,” Scandinavian Journal of Statistics, vol. 6, no. 2, pp. 65–70, 1979. [Online]. Available: http://www.jstor.org/stable/4615733
- [19] Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: A practical and powerful approach to multiple testing,” Journal of the Royal Statistical Society. Series B (Methodological), vol. 57, no. 1, pp. 289–300, 1995. [Online]. Available: http://www.jstor.org/stable/2346101
- [20] Y. Benjamini and D. Yekutieli, “The control of the false discovery rate in multiple testing under dependency,” The Annals of Statistics, vol. 29, no. 4, pp. 1165–1188, 2001. [Online]. Available: http://www.jstor.org/stable/2674075
- [21] V. Vovk, A. Gammerman, and C. Saunders, “Machine-learning applications of algorithmic randomness,” in Proceedings of the Sixteenth International Conference on Machine Learning, ser. ICML ’99. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 1999, p. 444–453.
- [22] V. Balasubramanian, S.-S. Ho, and V. Vovk, Conformal Prediction for Reliable Machine Learning: Theory, Adaptations and Applications, 1st ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2014.
- [23] A. Podkopaev and A. Ramdas, “Tracking the risk of a deployed model and detecting harmful distribution shifts,” arXiv preprint arXiv:2110.06177, 2021.
- [24] V. Vovk, “Conditional validity of inductive conformal predictors,” in Asian conference on machine learning. PMLR, 2012, pp. 475–490.
- [25] S. Bates, E. Candès, L. Lei, Y. Romano, and M. Sesia, “Testing for outliers with conformal p-values,” arXiv preprint arXiv:2104.08279, 2021.
- [26] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- [27] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 4700–4708.
- [28] Y. Benjamini and Y. Hochberg, “Controlling the false discovery rate: a practical and powerful approach to multiple testing,” Journal of the Royal statistical society: series B (Methodological), vol. 57, no. 1, pp. 289–300, 1995.
- [29] V. Vovk, “Conditional validity of inductive conformal predictors,” in Proceedings of the Asian Conference on Machine Learning, ser. Proceedings of Machine Learning Research, S. C. H. Hoi and W. Buntine, Eds., vol. 25. Singapore Management University, Singapore: PMLR, 04–06 Nov 2012, pp. 475–490. [Online]. Available: https://proceedings.mlr.press/v25/vovk12.html
-A Preliminaries on Multiple Testing
Multiple hypothesis testing (a.k.a. multiple testing) refers to the inference problem testing between multiple binary hypotheses, e.g., versus , . For a given multiple testing procedure, let be the number of null hypotheses rejected (i.e., number of tests declared as the alternative ), out of which is the number of true null hypotheses. Some measures of performance for multiple testing procedures are as follows:
-
1.
Family Wise Error Rate (FWER): The probability of rejecting at least one null hypothesis when all of them are true.
-
2.
False Discovery Rate (FDR): Expected ratio of number of true null hypotheses rejected () and the total number of hypotheses rejected (), i.e.,
(34) where the expectation is taken over the joint distribution of the statistics involved in the multiple testing problem.
When all the null hypotheses are true, with probability 1, and:
Various multiple testing procedures have been proposed in literature depending on the quantity of interest to be controlled. Widely used multiple testing procedures involve calculating the p-values for each test as , and combining these p-values to give decisions for each hypothesis. Let . One of the earliest tests proposed to control the FWER is the Bonferroni test. In this test, each is computed, and for each , the corresponding hypothesis is rejected if
(35) |
This test controls the FWER at for any joint distribution of the test statistics of the hypotheses. However, the power of this test has been observed to be low, and hence the test is considered to be conservative. The FDR measure was proposed by [28], who also proposed a procedure to control the FDR. Let the p-values for each test be , and let the ordered p-values be denoted by . Let
(36) |
The Benjamini-Hochberg (BH) procedure rejects hypotheses , and controls the FDR at level when the test statistics are independent. [20] showed that the constants in the BH procedure can be modified to instead of to control the FDR at level for arbitrarily dependent test statistics. Note that the Bonferroni procedure and the BH procedure can be used to test against the global null (all are true), where the probability of false alarm is equal to the FWER and FDR. Our proposed algorithm for the OOD detection problem builds on the BH procedure with the modified constants, and conformal p-values calculated using a calibration dataset , where we aim to control the conditional probability of false alarm with high probability.
-B Proposed OOD Modelling
In Section 2, we conclude that functions of the input from the learning algorithm apart from the final output are required for the OOD formulation presented above. Note that this does not violate the data-processing inequality, as the out-distribution characterizes the input and the model, and these functions of the input give us additional information regarding the learning algorithm. In addition, these functions give us information to differentiate between the null and the alternate hypothesis.
-C Proof of Theorem 1
For , let
(37) |
where
(38) |
and let . As in (16), the probability of false alarm conditioned on the calibration set is given by
(39) |
where is as defined in Algorithm 1. Here denotes the global null hypothesis, which corresponds to all the being true. Note that signifies that are being rejected. Let
Then,
(40) |
The following lemma is useful in deriving an upper bound for .
Lemma 2.
For ,
(41) |
where is as defined in Section 3.
Proof.
Let
Let be the subset of where the null hypotheses rejected correspond to the indices in . Then
(42) |
Note that if null hypotheses corresponding to the indices in are rejected, then the conformal p-values corresponding to these tests are less than or equal to (since the maximum among them is less than or equal to ), and the conformal p-values corresponding to the remaining tests are greater than , i.e.,
(43) |
and
(44) |
Thus,
(45) |
Then,
(46) | ||||
(47) | ||||
(48) | ||||
(49) | ||||
(50) | ||||
(51) |
where the first equality arises from the fact that is the union of disjoint sets for , and the third equality follows from (45). ∎
Using the result from Lemma 2 in the expression for in (40), we obtain that
(52) |
Note that by definition, . Thus,
(53) |
and
(54) | ||||
(55) | ||||
(56) | ||||
(57) |
Note that the events are disjoint for . Thus,
(58) | ||||
(59) | ||||
(60) |
Using this in (57),we get that
(61) |
Let . Then, rearranging the terms from above, we get
(62) |
Note that is a function of only through random variables . We have from [29, 25] that follows a Beta distribution, i.e., , where
(63) | ||||
(64) |
The mean of this distribution is . Let denote the event
(65) |
When the condition on in Lemma 1 is satisfied, we have that
(66) |
Under the event , we have that
(67) | ||||
(68) | ||||
(69) | ||||
(70) |
Thus, with probability greater than , we have that
(71) |
-D Additional Experimental Results
All experiments presented in this paper were run on a single NVIDIA GTX-1080Ti GPU with PyTorch.
In addition, we provide the detection probabilities for CIFAR100 as the in-distribution dataset in VI. We consider the Mahalanobis scores and Gram scores from the individual layers for the same. Recall that the energy score is a temperature scaled log-sum-exponent of the softmax scores, i.e., where is the number of classes, are the softmax scores, and is the temperature parameter. We do not consider the energy score as one of the statistics for CIFAR100 as in-distribution, as we do not expect it to give a good representation of the in-distribution data. As the number of classes in CIFAR100 is quite large (100), we expect the softmax scores to not provide a reliable confidence score for distinguishing in-distribution points from OOD samples. Table VII contains the AUROC numbers for CIFAR100 as the in-distribution dataset.
-E Comparison with Bonferroni inspired test
It is possible to construct an OOD detection test adapted from the Bonferroni procedure similar to Algorithm 1, by replacing with:
(72) |
i.e., calculating as the number of hypotheses for which the corresponding conformal p-value is smaller than the constant . A sample is declared as OOD if . This procedure is detailed in Algorithm 2 for completeness.
(73) |
However, in general, the Bonferroni procedure has been observed to have a smaller detection power as compared to the BH procedure. We provide a comparison of the detection performance of using the Bonferroni inspired OOD detection test versus the BH inspired test proposed in Algorithm 1 in Tables VIII and IX. We can provide guarantees on the conditional false alarm probability similar to Theorem 1 for Algorithm 2 as well.
Theorem 2.
Let . Let be a calibration set, and let be such that for a given ,
(74) |
where , , , and is the CDF of a Beta distribution with parameters . Then, for a new input and a ML model , the probability of incorrectly detecting as OOD conditioned on while using Algorithm 2 is bounded by , i.e.,
(75) |
with probability .
Proof.
We have that
(76) | ||||
(77) | ||||
(78) | ||||
(79) |
Let . Thus,
(80) |
We have from [29, 25] that follows a Beta distribution, i.e., , where
(81) | ||||
(82) |
The mean of this distribution is . Let denote the event
(83) |
When satisfies the condition in (74), we have that
(84) | ||||
(85) | ||||
(86) | ||||
(87) |
Thus, under event , i.e., with probability greater than , we have that
(88) |
∎
OOD Dataset Method ResNet34 DenseNet SVHN Mahala (penultimate layer) 61.75 62.21 Gram 71.60 77.87 Ours - Mahala (5/4) 64.55 62.81 Ours - Gram (5/4) 58.54 78.15 Ours - Mahala, Gram (all) (5/4 + 1) 72.81 70.80 ImageNet Mahala (penultimate layer) 35.03 89.05 Gram 82.42 86.42 Ours - Mahala (5/4) 86.04 90.72 Ours - Gram (5/4) 74.43 86.85 Ours - Mahala, Gram (all) (5/4 + 1) 85.64 90.15 LSUN Mahala (penultimate layer) 34.00 92.17 Gram 78.36 88.93 Ours - Mahala (5/4) 86.19 92.86 Ours - Gram (5/4) 66.62 89.20 Ours - Mahala, Gram (all) (5/4 + 1) 84.81 92.66 iSUN Mahala (penultimate layer) 36.01 88.89 Gram 83.15 84.82 Ours - Mahala (5/4) 99.35 99.82 Ours - Gram (5/4) 53.71 83.01 Ours - Mahala, Gram (all) (5/4 + 1) 99.42 99.85
OOD Dataset Method ResNet34 DenseNet SVHN Mahala (penultimate layer) 89.35 85.81 Gram 91.85 91.33 Ours - Mahala (5/4) 89.42 86.59 Ours - Gram (5/4) 88.86 91.23 Ours - Mahala, Gram (all) (5/4 + 1) 91.53 89.98 ImageNet Mahala (penultimate layer) 78.81 95.38 Gram 94.10 94.13 Ours - Mahala (5/4) 94.96 95.65 Ours - Gram (5/4) 92.00 94.04 Ours - Mahala, Gram (all) (5/4 + 1) 94.96 95.66 LSUN Mahala (penultimate layer) 78.90 96.39 Gram 93.06 95.33 Ours - Mahala (5/4) 94.91 96.13 Ours - Gram (5/4) 90.00 95.19 Ours - Mahala, Gram (all) (5/4 + 1) 94.73 96.18 iSUN Mahala (penultimate layer) 81.38 95.43 Gram 94.71 94.36 Ours - Mahala (5/4) 98.12 98.04 Ours - Gram (5/4) 89.77 93.85 Ours - Mahala, Gram (all) (5/4 + 1) 98.04 97.90
OOD Dataset Method ResNet34 DenseNet SVHN Mahala, Gram and Energy (BH) 97.13 94.57 Mahala, Gram and Energy (Bonferroni) 96.41 91.13 ImageNet Mahala, Gram and Energy (BH) 97.03 97.20 Mahala, Gram and Energy (Bonferroni) 95.92 95.89 LSUN Mahala, Gram and Energy (BH) 98.00 97.78 Mahala, Gram and Energy (Bonferroni) 96.99 96.53 iSUN Mahala, Gram and Energy (BH) 97.67 96.34 Mahala, Gram and Energy (Bonferroni) 96.76 94.79
OOD Dataset Method ResNet34 DenseNet CIFAR10 Mahala, Gram and Energy (BH) 97.35 95.23 Mahala, Gram and Energy (Bonferroni) 95.84 91.77 ImageNet Mahala, Gram and Energy (BH) 99.84 99.89 Mahala, Gram and Energy (Bonferroni) 99.72 99.79 LSUN Mahala, Gram and Energy (BH) 99.95 100.00 Mahala, Gram and Energy (Bonferroni) 99.89 99.97 iSUN Mahala, Gram and Energy (BH) 99.88 99.98 Mahala, Gram and Energy (Bonferroni) 99.88 99.98