Out of Distribution Detection, Generalization, and Robustness Triangle with Maximum Probability Theorem
^†^†thanks: This project is supported by the NIH funding: R01-CA246704 and R01-CA240639, and Florida Department of Health (FDOH): 20K04.

Amir Emad Marvasti Currus AI
Orlando, FL
[email protected] Ehsan Emad Marvasti Currus AI
Orlando, FL, USA
[email protected] Ulas Bagci Northwestern University
Chicago, IL, USA
[email protected]

Abstract

Maximum Probability Framework, powered by Maximum Probability Theorem, is a recent theoretical development in artificial intelligence, aiming to formally define probabilistic models, guiding development of objective functions, and regularization of probabilistic models. MPT uses the probability distribution that the models assume on random variables to provide an upper bound on the probability of the model. We apply MPT to challenging out-of-distribution (OOD) detection problems in computer vision by incorporating MPT as a regularization scheme in the training of CNNs and their energy-based variants. We demonstrate the effectiveness of the proposed method on 1080 trained models, with varying hyperparameters, and conclude that the MPT-based regularization strategy stabilizes and improves the generalization and robustness of base models in addition to enhanced OOD performance on CIFAR10, CIFAR100, and MNIST datasets.

Index Terms:

Out of distribution detection, maximum probability theorem, robustness, deep learning, regularization

I Introduction

Regularization. Training machine learning models requires careful tuning of hyper parameters, architectures, and the loss functions. An effective tuning of the non-trainable aspects of models is mainly justified by empirical evaluations which costs time and is a source of uncertainty in evaluation of models and algorithms. Regularization of models is widely accepted as a way of stabilizing the training process and improving models generalization performance.

In the Bayesian view of machine learning, the popular regularization approach is to define an explicit prior over parameters of the model. Parallel to this view, a recent approach, called Maximum Probability Theorem (MPT), is introduced [33]. MPT considers a model as an event in the probability space and provides an upper-bound for the probability of the event representing the model. Increasing the probability of a model in MPT leads to regularization effects. Unlike other regularization strategies, MPT eliminates the need for explicitly defined priors over the model’s parameters. Instead, MPT requires only a prior on the observables (e.g. input and output) to determine the prior on the parameters or hidden random variables.

Our proposal. We hypothesize that MPT can act as a black-box regularization in vision applications such as OOD detection, and can lead to improved robustness, generalization, and performance in such tasks. We first construct some objective functions encapsulating MPT regularization, and demonstrate their relation with cross entropy loss. Next, we show how to incorporate MPT regularization into training of CNNs and their energy based variants. Finally, we discuss the details of OOD experiments and the results on generalization and robustness.

Why Energy Based Models (EBMs)? The recent line of work by [46] showed that treating CNNs as EBMs increase the generalization performance of the CNN model and provide additional abilities for the network [11] such as:

•

EBMs can be used to detect anomalous input (Out of Distribution Detection),
•

EBMs can be trained without labels,
•

EBMs can be used as generative models.

Hence, in this study, we treat CNNs as an EBM [26, 42] and also experiment with effects of MPT on CNN-EBMs.

The summary of our contributions is below:

1.

We successfully adapt MPT regularization in the energy based models,
2.

We reconsider the OOD detection problem with the proposed MPT based regularization,
3.

We demonstrate the robustness and generalization effects of MPT regularization on three datasets and 1080 trained networks, with obtained promising results.

II Related Work

Our work relies mainly on three lines of research, regularization methods, EBMs, and OOD methods.
Regularization and Priors. Regularization and priors of probabilistic models have a rich history, including Jaynes’ Maximum Entropy Principle [21, 19, 20], Jeffreys’ uninformative priors [22], and reference priors [4, 6] generalizing Jeffreys priors. Practically, the well accepted approach to regularize probabilistic models is to impose explicit priors on parameters or other variables in the model. In deep learning, the explicit prior is reflected in the popular regularization schemes. For example, The $\ell_{1},\ell_{2}$ norm regularization reflects a Gaussian and Laplacian prior on variables [24, 31, 9]. The logic behind reparameterization invariant priors in [4, 22], is that our prior knowledge should not change based on how a parameter is represented or how the model is constructed. The prior on parameters in reference priors, are determined to maximize the mutual information of parameters and the observable random variables [2, 3, 4, 5]. Authors in [34] propose a gradient based optimization of models with reference priors, which is a promising ground for integrating uninformative priors. However, automatic determination of the reference priors for parameters of complex models is not fully understood [34].

MPT [33], a recent addition to the probabilistic models literature, offers a black box regularization of probabilistic models while being invariant to reparameterizations. To use MPT, one does not need to explicitly determine the priors over the parameters. In other words, the parameters do not need to be modeled as random variables.

Energy Based Models. EBMs assume some non-normalized log probability distribution over the space of their input [41, 26]. This non-normalized log probability is referred to as energy. In the conventional loss functions such as Log-Likelihood, the loss depends on the imposed log probability distribution. Training EBMs is possible by estimating the gradients and sampling from their imposed probability distribution [41, 14, 10]. The gradients of the log probability with respect to the model’s parameters, is in the form of an expectation with respect to the EBMs distribution. If the domain of EBMs are arbitrary large, unbiased and low variance sampling from their underlying distribution is not trivial. To sample from the EBM probability distribution multiple sampling techniques can be used, including the class of MCMC sampling techniques [36, 35]. The major drawback with the MCMC techniques is that they induce bias in calculation of gradients during their burnout period [32, 38].

Also, depending on the domain of EBMs, the estimated gradients can have large variance [12]. The variance in estimation of gradients is a double edge sword. On one hand, the variance helps with escaping plateaus in the objective function landscape. On the other hand, large variance in gradients, disrupts convergence of parameters to the optimal point. Despite the bias and variance drawbacks, MCMC methods such as Stochastic Langevin Dynamics [43] had been shown to be successful in training EBMs [46].

Out of Distribution Detection. Out of distribution (OOD) detection task is motivated by understanding reliability of prediction of a model on some input data [1]. Within the existing methods for OOD [16, 31, 29, 39] and many others, our work focuses on probabilistic OOD methods and EBMs. We hypothesize that if the model fits a probability distribution to its input, the model’s probability distribution function (pdf) on input can be used to identify out-distribution data. Existing OOD detection methods commonly rely on a scoring function that derives statistics from the output layer of the neural network. Popular probabilistic measures for OOD are Softmax score [13], Energy score [11]. ODIN [28], GODIN[16] can be viewed as modification of Energy score and Softmax Score. MOOD [29] is an architectural modification using similar probabilistic principles for training and evaluation.

III Regularization via Maximum Probability Theorem

Notation. We use the notation according to the MPT paper: a probability space is defined as $(\Omega,\Sigma,P)$ where $\Omega$ is the sample space, $\Sigma$ is the $\sigma$ -algebra and $P$ is the probability measure. A parametric model, parameterized by $\theta$ in some arbitrary space, is defined as $M_{\theta}\in\Sigma$ , an event in the $\sigma$ -algebra. We use upper case letters for random variables such as $X,Y$ and the range of random variables is denoted by $R(.)$ . We use $S(P)$ as a short hand notation to represent support of the probability measure. We use $x\in S(P)$ to identify outcomes of $X$ with non zero probability. In our construction and similar to MPT, we do not consider $\theta$ as a random variable; only a numerical representation for the event $M_{\theta}$ . For mapping $f$ with a vector space domain and range, the $y$ -th entry of the output vector is represented by $f(x)[y]$ .

Overview of MPT. Consider the probability space $(\Omega,\Sigma,P)$ , the random variable $Z$ and some event $M_{\theta}\in\Sigma$ , where $\theta$ is a vector in $\Re^{t}$ . MPT shows that the following inequality holds,

\displaystyle P(M_{\theta})\leq\underset{z\in R(Z)}{\min}\left\{\frac{P(z)}{P(z|M_{\theta})}\right\}.

(1)

In MPT, probabilistic models are represented by events such as $M_{\theta}$ , with a known conditional distribution. To regularize the probabilistic models, the upper bound of the model’s probability is maximized. The parameter vector $\theta$ is not necessarily considered as a random vector; therefore, explicit definition of a prior on the parameters is not necessary. As $\theta$ varies, the conditional distribution of the model over $Z$ changes, leading to change in $P(M_{\theta})$ . Note that the probability upper-bound is equal to 1, only if $P(z)=P(z|M_{\theta})$ , therefore, $-\log(P(M_{\theta}))$ can serve as a measure of deviation of the model from the prior.

$\alpha$ -Parametrization of the MPT. The softmin family of lower bounds for the maximum probability bound is used as a smooth approximation for the min function. The softmin family parameterized by $\alpha>0$ is defined as

\displaystyle P_{\alpha}(M_{\theta})\triangleq\left(\sum_{z\in R(Z)}\left(\frac{P(z)}{P(z|M_{\theta})}\right)^{-\alpha}\right)^{-\frac{1}{\alpha}}.

(2)

The following inequality relates the softmin family and the probability upperbound in (1):

\displaystyle P_{\alpha}(M_{\theta})\leq\underset{z\in R(Z)}{\min}\left\{\frac{P(z)}{P(z|M_{\theta})}\right\},

(3)

where the equality holds as $\alpha\to+\infty$ :[33]. The proposed objective function by [33] is $\mathcal{L}(\theta)=P(M_{\theta},M^{*})$ , the probability of intersection of the model and the event $M^{*}$ representing the true underlying model (the dataset in our case). $M^{*}$ in our work is modeled with its conditional distribution. The conditional distribution of $M^{*}$ is the empirical distribution of the dataset. Throughout the paper, $M^{*}$ is called the oracle.

In this paper, we consider the classification problem-OOD detection. We define random variables $X,Y,Z$ to represent the input, label and $Z=(X,Y)$ as shorthand for the concatenation of input and the label. We can write the intersection objective function in its conditional form and treat each term separately,

\displaystyle\mathcal{L}=\ln P(M^{*}|M_{\theta})+\ln P(M_{\theta}).

(4)

In practice, $M_{\theta}$ and $M^{*}$ are implicitly modeled by their conditional distribution over the random variable, $P(X,Y|M_{\theta})$ . To use the explicit representation, we write the objective function in the marginal form. Prior to deriving the objective function, the relation of $M^{*}$ and $M_{\theta}$ needs to be assumed. The choice of modeling the relation of $M^{*}$ and $M_{\theta}$ dictates the optimization process. Similar to [33], the conditional independence of model and oracle is explored here, namely $M_{\theta}\perp M^{*}|Z$ .

Conditional Independence. If we assume conditional independence between the model and the oracle, we can continue with the marginalization process of the objective function $\mathcal{L}^{\perp}(\theta)$ :

	$\displaystyle\mathcal{L}^{\perp}(\theta)\triangleq\ln P(M^{*}\|M_{\theta})+\ln P(M_{\theta}),$		(5)
	$\displaystyle=\ln\sum_{z\in R(Z)}P(M^{*}\|z,M_{\theta})P(z\|M_{\theta})+\ln P(M_{\theta}).$		(6)

Considering the conditional independence assumption, $P(M^{*}|z,M_{\theta})=P(M^{*}|z)=P(z|M^{*})P(M^{*})/P(z)$ . Assuming that $P(z)$ is uniform, we can treat both $P(z)$ and $P(M^{*})$ as constants and drop them for ease of notation. As a result,

\displaystyle\mathcal{L}^{\perp}(\theta)=\ln\sum_{z\in R(Z)}P(z|M^{*})P(z|M_{\theta})+\ln P(M_{\theta}).

(7)

Finally, to $\alpha$ -parametrize $P(M_{\theta})$ , we can define the following lower bound objective function

\displaystyle\mathcal{L}^{\perp}_{\alpha}\triangleq\ln\sum_{z\in R(Z)}P(z|M^{*})P(z|M_{\theta})+\ln P_{\alpha}(M_{\theta}).

(8)

Alternatively, because of the concavity of the log function and the Jensen inequality, we can find another lower bound, the cross entropy loss with MPT regularization:

	$\displaystyle\mathcal{L}_{\alpha}^{\perp}(\theta)\geq\sum_{z\in R(Z)}P(z\|M^{*})\ln P(z\|M_{\theta})+\ln P_{\alpha}(M_{\theta}),$		(9)
	$\displaystyle\mathcal{L}_{\textrm{cross}}(\theta)\triangleq\sum_{z\in R(Z)}P(z\|M^{*})\ln P(z\|M_{\theta}),$		(10)

where $\mathcal{L}_{\textrm{cross}}(\theta)$ is the conventional cross entropy objective function. The summary of the inequality relations is thus

\displaystyle\mathcal{L}^{\perp}(\theta)\geq\mathcal{L}_{\alpha}^{\perp}(\theta)\geq\mathcal{L}_{\textrm{cross}}(\theta)+\ln P_{\alpha}(M_{\theta}).

(11)

IV MPT in Energy Based Models

EBMs add another dimension to the training of classification models. The models are trained on the input distribution without the need for the labels. Here, we incorporate the MPT view of machine learning and the objective functions described in previous section in the formulation of EBMs.

Consider the training set consisted of $N_{t}$ entries and the test set is consisted of $N_{v}$ entries, i.e. $D_{t}=\{z^{(i)}\triangleq(x^{(i)},y^{(i)})\}_{i=1}^{N_{t}}$ . The training and test set could be understood as an empirical distribution of the underlying true model. We denote the oracle as $M^{*}$ . We adapt the intersection objective function described in MPT view. The classification model parameterized by $\theta\in\Re^{t}$ is represented by $f(x;\theta)$ where $f:\Re^{n}\to\Re^{m}$ . We construct the model’s probability distribution $P(x,y|M_{\theta})$ using $f$ :

\displaystyle P(x,y|M_{\theta})=e^{f(x,\theta)[y]}P(x,y)/\sum_{x^{\prime},y^{\prime}\in S(P)}e^{f(x^{\prime},\theta)[y^{\prime}]}P(x^{\prime},y^{\prime}).

(12)

Having Eq (12), and the probability upperbound in (1), we can write the maximum probability of $M_{\theta}$ as

\displaystyle P(M_{\theta})\leq\frac{\sum_{x^{\prime},y^{\prime}\in S(P)}e^{f(x^{\prime},\theta)[y^{\prime}]}P(x^{\prime},y^{\prime})}{\underset{x,y\in S(P)}{\max}\left\{e^{f(x,\theta)[y]}\right\}}.

(13)

For $P(M_{\theta})$ to be nonzero, either $f$ needs to be bounded or the prior needs to have bounded support. We denote the normalizing constant in (12) by $\eta(\theta)$ . Under independence assumption, when we add the log-likelihood term, $\eta(\theta)$ will cancel out. Therefore, in our objective functions only the support of the prior affects the regularization term. In above equation, we used the prior in the construction of our model. Without the prior, the probability distribution is not necessarily normalizable [2], which leads to instability of the optimization.

Refer to caption — Figure 1: The generalization performance of the energy based model with varying hyper parameters and loss functions. Left: CIFAR10, middle: CIFAR100, right: MINST.

IV-A Conditional Independence Assumption in EBMs

Having the independence assumption, we can write the intersection objective function for the EBMs as

	$\displaystyle\mathcal{L}_{\alpha}^{\perp}(\theta)=\ln\left(\frac{1}{N_{t}}\sum_{i=1}^{N_{t}}P(x^{(i)},y^{(i)}\|M_{\theta})\right)+\ln P_{\alpha}(M_{\theta}),$		(14)
	$\displaystyle=\ln\left(\frac{1}{N_{t}}\sum_{i=1}^{N_{t}}e^{f(x^{(i)},\theta)[y^{(i)}]}P(x^{(i)},y^{(i)})\right)-\ln\eta(\theta)+\ln\eta(\theta)$
	$\displaystyle-\frac{1}{\alpha}\ln\left(\sum_{x,y\in S(P)}e^{\alpha f(x,\theta)[y]}\right)$		(15)
	$\displaystyle=\ln\left(\frac{1}{N_{t}}\sum_{i=1}^{N_{t}}e^{f(x^{(i)},\theta)[y^{(i)}]}P(x^{(i)},y^{(i)})\right)$
	$\displaystyle-\frac{1}{\alpha}\ln\left(\sum_{x,y\in S(P)}e^{\alpha f(x,\theta)[y]}\right).$		(16)

The cross entropy loss can then be written as

	$\displaystyle\mathcal{L}_{\alpha}^{\textrm{cross}}(\theta)=$	$\displaystyle\frac{1}{N_{t}}\sum_{i=1}^{N_{t}}\left(f(x^{(i)},\theta)[y^{(i)}]+\ln P(x^{(i)},y^{(i)})\right),$
		$\displaystyle-\frac{1}{\alpha}\ln\left(\sum_{x,y\in S(P)}e^{\alpha f(x,\theta)[y]}\right).$		(17)

The MPT regularization term - the second term in Eq (16,17) - suppresses the probability of the most probable states in the optimization process. In a gradient based optimization framework, the difference between cross entropy loss and the intersection loss is that intersection loss re-weights the gradient for each datapoint according to their energy. Whereas, in the cross entropy loss the gradient with respect to the energy of any datapoint is not weighted at all.

Table I: Classification Results: the empirical mean and standard deviation of training and test accuracy. Each statistics were calculated with 120 trained networks with varying hyper parameters. The accuracy of diverged models were considered as 0. CCE: Conditional Cross Entropy, JCE: Joint-Cross Entropy, and JI: Joint Intersection.

		Train			Test
		max	mean	std	max	mean	std
	Objective Functions
	CCE	0.947	0.713	0.314	0.757	0.587	0.251
CIFAR10	JCE	0.999	0.800	0.345	0.797	0.647	0.277
	JI	1.000	0.847	0.335	0.812	0.666	0.264
	CCE	0.653	0.415	0.198	0.344	0.260	0.111
CIFAR100	JCE	0.616	0.376	0.220	0.372	0.245	0.135
	JI	0.627	0.214	0.271	0.397	0.134	0.170
	CCE	1.000	0.904	0.295	0.995	0.898	0.293
MNIST	JCE	1.000	0.990	0.098	0.996	0.985	0.097
	JI	1.000	0.753	0.426	0.995	0.748	0.423

IV-B Approximating the gradients

To have an unbiased approximation of the gradients in the intersection objective function, we need to use MCMC methods. Drawing unbiased approximation of gradients using MCMC methods requires passing the burnout period. Approximating the burnout period is an open problem in the MCMC literature. We accept the bias in the gradients for the benefit of computational efficiency.

In the intersection objective function, the total gradient is not a linear function of sub-gradients corresponding to the subsets of data. The gradient cannot be approximated with small batches without bias. Considering limit cases, the nature of the bias in our approximation becomes clear. If we choose a batch size of one, the sub-gradients are unbiased approximators of the cross entropy’s total gradient, and as the batch size increases, the sub-gradients approximates the intersection objective function’ total gradient. Based on this reasoning, we accept the trade off and allow the sub-gradients to be biased in the favor of efficiency.

We can use a small batch size and approximate the loss function. When the loss function is approximated we can calculate the gradients using any automatic gradient calculation framework.

Table II: Robustness of the trained models to noise level. The AUC score is the average of the test accuracy for varying uniform noise levels, added to the inputs. CCE: Conditional Cross Entropy, JCE: Joint-Cross Entropy, and JI: Joint Intersection.

	CIFAR10	CIFAR100	MNIST
CCE	0.34 $\pm$ 0.05	0.09 $\pm$ 0.01	0.99 $\pm$ 0.00
JCE	0.34 $\pm$ 0.03	0.09 $\pm$ 0.01	0.97 $\pm$ 0.02
JI	0.35 $\pm$ 0.02	0.08 $\pm$ 0.04	0.97 $\pm$ 0.02

IV-C Max Energy Score and OOD Detection

Energy based model fits a distribution to the input domain as well as the labels. The energy score (i.e., likelihood of observations of some input data up to a normalization constant) can be used to discriminate between data that is observed in training and the data that was not observed. In addition to energy score and softmax score [13], we introduce a new score for discriminating between in and out distribution data; max energy score:

\displaystyle S_{\textrm{ME}}(x)=\underset{y\in S(P)}{\max}f(x;\theta)[y^{\prime}].

(18)

This new scoring value (max energy score) is simply defined as the maximum energy of the labels per input. Smaller the score is, the more highly that input data belongs to out-distribution.

V Experiments and Results

We used a 12 layer CNN to train on CIFAR10, CIFAR100 and MNIST datasets. We chose to use CReLU activation [40] instead of the ReLU because of its gradient norm preserving property. For all convolution layers we used 64 filters. The CNN does not have Batch Normalization [18] and Dropout [15]. We avoided any other probabilistic layer to avoid their interference with the optimization. Including probabilistic layers in our experiments might interact with the objective function we proposed. Their effect is out of the scope of the current study. The purpose of using the described model was to experiment with a backbone and simple architecture. We avoided hyperparameter tuning as much as possible and tested all the variants with the same hyperparameter range. The learning rate was fixed to 0.01. We experimented with different batch sizes to include the effect of the bias of subgradients in our experiment. The batch sizes in our experiments were chosen from 64, 128, 256. Each hyperparameter setup was trained 5 times. The total number of networks trained are 360 for each dataset.

The experiment is designed to show the effects of using MPT with the proposed objective functions on the CNN and its energy based variant on the classification task. In MPT based objective functions, the $\alpha$ hyper parameter determines how much the model is regularized. A comparative study is embedded in our experiments. Setting $\alpha$ to 1, removes the regularization effect of MPT and is equivalent to the usual likelihood training methods. We have included $\alpha=1$ to compare MPT with the conventional cross entropy and energy based loss.

Our experiments are three fold, (i) we experiment with classification task to show the stability of training and generalization performance having the MPT regularization. (ii) we experiment the robustness of trained networks to the additive noise to input. (iii) We demonstrate the effect of MPT regularization on OOD task. We use the the popular metric, Area Under the Receiver Operating Characteristics (AUROC) to evaluate the OOD performance.

Table III: OOD AUROC scores. The empirical mean and standard deviation of the used scores for discrimination. Each statistics were calculated with 120 trained networks with varying hyper parameters. The AUROC of diverged models were considered as 0.

Objective Function	Mean AUROC
	Energy	Max Energy (Ours)	Softmax Score
Conditional Cross	0.687 $\pm$ 0.102	0.683 $\pm$ 0.098	0.632 $\pm$ 0.084
Joint Cross	0.661 $\pm$ 0.110	0.744 $\pm$ 0.130	0.604 $\pm$ 0.060
Joint Intersection	0.690 $\pm$ 0.124	0.734 $\pm$ 0.135	0.608 $\pm$ 0.067
	Max AUROC
	Energy	Max Energy (Ours)	Softmax Score
Conditional Cross	0.824	0.819	0.743
Joint Cross	0.836	0.854	0.713
Joint Intersection	0.853	0.866	0.707

V-A Training stability and Generalization performance

The results of the generalization performance of the trained networks are demonstrated in Fig 1. The summary of the performances can be found in Table I. The loss functions used are Conditional Cross Entropy, Joint Cross Entropy and Joint Intersection. Conditional Cross Entropy is the usual cross entropy loss function. Joint Cross Entropy is the energy based variant with the cross entropy loss. Joint Cross Entropy enforces the network to train jointly on the input and the label domain. Joint Intersection is the energy based loss function derived from MPT described in equation (16). All of the loss functions were equipped with MPT regularization, where the effect of regularization is controlled by $\alpha$ .

Figure 1 shows that increasing the $\alpha$ parameter has positive effect on both generalization and stability of training. On all datasets, as $\alpha$ increases the average of test accuracy of the models increases and the empirical variance decreases. During training phase some networks diverged. The accuracy of the diverged networks were set to 0, impacting the average accuracy. The best performance in our experiments is subpar to the state of the art results on CIFAR datasets. The inferiority of our results could corresponds to almost blind tuning of hyperparameters and the network architecture. Our experiment represent the practical situation were the dataset at hand is new and the initial attempts for hyperparameter tuning is suboptimal.

Table I shows the max and average test accuracy of the trained networks; the $\alpha$ parameter settings were marginalized out. The test accuracy on CIFAR100 is exceptionally low. The low test accuracy can be explained with the low average training accuracy. The low test accuracy can potentially be solved by increasing the capacity of the network. The maximum test accuracy, in CIFAR10 and CIFAR100, was achieved by Joint-Intersection objective function. In MNIST the difference of the intersection test accuracy and the cross entropy is insignificant. The results show that cross entropy is a safe choice for training under-regularized loss functions. But with the heavier regularization, the intersection loss is superior.

V-B Robustness Analysis

We compare the performance of the trained models, when the input data is perturbed with some level of noise. For each network, we add uniform noise to the input data, and test the performance of network for different levels of noise. The prior for the input in all our experiments is a uniform distribution with support of $[-1,1]$ . We clamped values, after adding the noise, so that the noisy input falls in the support of the prior. We evaluated the test accuracy of each network by measuring the test accuracy per noise level. The support of the uniform noise were selected from [0.1,0.2,…1] times the support range of the prior. For each network we approximated the Area Under the Curve (AUC) of the noise level and the test accuracy. Figure 2 demonstrates the variation of the AUC measure with respect to changes in $\alpha$ . The results are consistent with respect to the results of Figure 1. As $\alpha$ increases the models are more robust to changes in the input distribution. Table II compares each loss function’s AUC performance by averaging $\alpha$ out. Without considering the regularization, the results indicate that the objective functions, on average, behave similarly. However, Joint-Intersection responds to the regularization better than the rest as seen in Fig 2.

Table IV: Comparison of the on OOD AUROC scores reported in reference [29] with our best result. Except the last row (where we used 6 datasets), all other rows include 10 datasets.

	Score	Model (#layers)	AUROC
MOOD[29, 17]	-	MSDNeT (20)	0.912
ODIN[28, 45]	Softmax	WideResNet(40)	0.901
Mahalonobis[27]	-	WideResNet(40)	0.893
Mahalonobis[27]	-	MSDNeT (20)	0.828
Liu et al.[30]	Energy	WideResNet (40)	0.900
Liu et al.[30]	Energy	MSDNet(20) (40)	0.904
EBM+MPT (Ours)	Max Energy	CNN (12) (40)	0.866

V-C OOD Detection

We use the trained network and test the out of distribution detection performance of the networks on 6 datasets. The setup of our evaluation is similar to that of [29]. The in-distribution dataset in our experiments is CIFAR10 [23], and the out distributions are MNIST [25], KMNIST [7], SVHN [37], CIFAR100 [23], STL10 [8] and Fashion MNIST [44]. We use the softmax score, energy score, and the proposed max-energy score to discriminate between in and out distribution. We use the data from the test set of the in-distribution and we used both train and test data for the out-distribution. We used conventional AUROC score for evaluation of OOD strategies [29, 16, 13, 28]. The AUROC is averaged over the out-distribution datasets.

The average AUROC of trained networks with respect to the $\alpha$ parameter is shown in Fig 3. Decrease of the AUROC for large values of $\alpha$ is expected. As a reminder, energy based models assume a probability distribution on the input domain. As $\alpha$ increases further form 2 (see reference [33] for significance of $\alpha=2$ ), the optimal probability distribution solution tends to be closer to the prior distribution. This means that out-distribution scores would be similar to the in-distribution scores. The Joint-Intersection objective function is superior for both Energy score and the Max Energy score. For Softmax Score, the conditional cross entropy is the superior objective function.

Fig 3 represents the comparison of different energy scores with respect to $\alpha$ . As seen Max Energy score performs better in the OOD task compared to the other OOD scores. Table III summarizes the results for the energy scores and the loss functions. Table IV compares the results of different methods on OOD, obtained by reference [29] to provide a sense of performance scale for the reader. The inferiority of our results could be attributed to the choice of a light weight and vanilla CNN model with only 12 layers. MOOD [29], ODIN[28], JEM[11, 30] and Liu et al.[30], use modified versions of EBM-CNNs and the OOD scores discussed in our paper. In Figure 3, We have shown that MPT regularization consistently improves the OOD performance of the popular OOD scores and the performance of EBM-CNNs. Therefore MPT has the potential for integration into the mentioned baselines.

VI Discussion and Future Research

The empirical aspect of our research is mainly ablation studies and proof of concept for effectiveness of MPT. Our empirical evidence holds, at least, when the model architecture and the hyper parameters are not tuned to the dataset. It is hard to confidently assert that MPT works for the vast variations of models that currently exist in the machine learning scope. However, so far we have not identified any incompatible probabilistic model in our empirical evidence. Furthermore, since the proof for MPT does not rely on any assumptions, it is reasonable to expect that the framework is general enough.

Given that MPT is theoretically well grounded, and offers a black box view on priors, it is a promising direction for further investigation. Also, the empirical evidence, although naturally limited, is inline with what is expected from theory. For example, the generalization of the model using MPT can be tied to smoothness of model’s distribution on input/label, $P(x,y|M_{\theta})$ . The test accuracy of a model is only dependant on the test data distribution and the model’s distribution. Maximizing the model probability forces the model’s distribution to be close to the prior distribution (e.g. Uniform). If the prior is smooth, maximizing the probability of the model enforces the model’s distribution to be smoother. Models assuming smoother conditional distributions are more invariant to changes of input/label distribution. Therefore, the smoother models are more robust to the difference of the train and test distribution. We intend to follow up on this thought for future research and verify whether probability of the models could be tied theoretically to their generalization performance.

MPT does not restrict the possibility of having explicit priors on parameters. MPT provides an upper bound for the prior density of parameters. Any explicit assumption made about the parameters can be combined with MPT to form a new prior. For example including the explicit prior $q(\theta)$ can be done by constructing the density, $p(\theta)=P_{\alpha}(M_{\theta})q(\theta)$ followed by normalization. So far the benefits and adverse effects of combining explicit priors is not clear to us. It is possible to speculate that the gradients of regularization may be affected by the gradient vanishing problem in the MPT framework. Having explicit priors could help with the gradient vanishing of regularization. We intend to test the effect of incorporating explicit priors to the MPT framework in our future experiments.

VII Conclusion

In this paper, we have incorporated Maximum Probability Theorem into training of Energy Based Models (EBMs) and demonstrated its black-box regularization property in classification and out of distribution detection problems. Obtained from six publicly available datasets, our experiments demonstrated that (1) learning input and the label jointly in EBMs with adequate regularization outperform CNNs with softmax layer. (2) MPT regularization, without any exception in our experiments, increases the stability of training, lowers the variance of test accuracy, and improves generalization performance. (3) Incorporating MPT regularization improves AUROC in the OOD tasks.

References

[1] Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., Mané, D.: Concrete problems in ai safety. arXiv preprint arXiv:1606.06565 (2016)
[2] Berger, J.O., Bernardo, J.M.: On the development of the reference prior method. Bayesian statistics 4(4), 35–60 (1992)
[3] Berger, J.O., Bernardo, J.M.: Ordered group reference priors with application to the multinomial problem. Biometrika 79(1), 25–37 (1992)
[4] Berger, J.O., Bernardo, J.M., Sun, D., et al.: The formal definition of reference priors. The Annals of Statistics 37(2), 905–938 (2009)
[5] Berger, J.O., Bernardo, J.M., Sun, D., et al.: Overall objective priors. Bayesian Analysis 10(1), 189–221 (2015)
[6] Bernardo, J.M.: Reference posterior distributions for bayesian inference. Journal of the Royal Statistical Society: Series B (Methodological) 41(2), 113–128 (1979)
[7] Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb, A., Yamamoto, K., Ha, D.: Deep learning for classical japanese literature. arXiv preprint arXiv:1812.01718 (2018)
[8] Coates, A., Ng, A., Lee, H.: An analysis of single-layer networks in unsupervised feature learning. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics. pp. 215–223. JMLR Workshop and Conference Proceedings (2011)
[9] Consonni, G., Fouskakis, D., Liseo, B., Ntzoufras, I., et al.: Prior distributions for objective bayesian analysis. Bayesian Analysis 13(2), 627–679 (2018)
[10] Du, Y., Li, S., Tenenbaum, J., Mordatch, I.: Improved contrastive divergence training of energy-based models. In: International Conference on Machine Learning. pp. 2837–2848. PMLR (2021)
[11] Duvenaud, D., Wang, J., Jacobsen, J., Swersky, K., Norouzi, M., Grathwohl, W.: Your classifier is secretly an energy based model and you should treat it like one (2020)
[12] Greensmith, E., Bartlett, P.L., Baxter, J.: Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research 5(9) (2004)
[13] Hendrycks, D., Gimpel, K.: A baseline for detecting misclassified and out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136 (2016)
[14] Hinton, G.E.: Training products of experts by minimizing contrastive divergence. Neural computation 14(8), 1771–1800 (2002)
[15] Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012)
[16] Hsu, Y.C., Shen, Y., Jin, H., Kira, Z.: Generalized odin: Detecting out-of-distribution image without learning from out-of-distribution data. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10951–10960 (2020)
[17] Huang, G., Chen, D., Li, T., Wu, F., van der Maaten, L., Weinberger, K.: Multi-scale dense networks for resource efficient image classification. In: International Conference on Learning Representations (2018)
[18] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning. pp. 448–456. PMLR (2015)
[19] Jaynes, E.T.: Information theory and statistical mechanics. Physical review 106(4), 620 (1957)
[20] Jaynes, E.T.: Information theory and statistical mechanics. ii. Physical review 108(2), 171 (1957)
[21] Jaynes, E.T.: Prior probabilities. IEEE Trans. Systems Science and Cybernetics 4(3), 227–241 (1968)
[22] Jeffreys, H.: An invariant form for the prior probability in estimation problems. Proceedings of the Royal Society of London. Series A. Mathematical and Physical Sciences 186(1007), 453–461 (1946)
[23] Krizhevsky, A., Hinton, G., et al.: Learning multiple layers of features from tiny images (2009)
[24] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems. pp. 1097–1105 (2012)
[25] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11), 2278–2324 (1998)
[26] LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., Huang, F.: A tutorial on energy-based learning. Predicting structured data 1(0) (2006)
[27] Lee, K., Lee, K., Lee, H., Shin, J.: A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 31. Curran Associates, Inc. (2018)
[28] Liang, S., Li, Y., Srikant, R.: Enhancing the reliability of out-of-distribution image detection in neural networks. arXiv preprint arXiv:1706.02690 (2017)
[29] Lin, Z., Roy, S.D., Li, Y.: Mood: Multi-level out-of-distribution detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15313–15323 (2021)
[30] Liu, W., Wang, X., Owens, J., Li, Y.: Energy-based out-of-distribution detection. In: Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.F., Lin, H. (eds.) Advances in Neural Information Processing Systems. vol. 33, pp. 21464–21475. Curran Associates, Inc. (2020)
[31] MacKay, D.J., Mac Kay, D.J.: Information theory, inference and learning algorithms. Cambridge university press (2003)
[32] Mackay, D.J.C.: Introduction to monte carlo methods. In: Learning in graphical models, pp. 175–204. Springer (1998)
[33] Marvasti, A.E., Marvasti, E.E., Bagci, U., Foroosh, H.: Maximum probability theorem: A framework for probabilistic machine learning. IEEE Transactions on Artificial Intelligence 2(3), 214–227 (2021)
[34] Nalisnick, E., Smyth, P.: Variational reference priors (2017)
[35] Neal, R.M.: Slice sampling. The annals of statistics 31(3), 705–767 (2003)
[36] Neal, R.M., et al.: Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo 2(11), 2 (2011)
[37] Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning (2011)
[38] Propp, J.G., Wilson, D.B.: Exact sampling with coupled markov chains and applications to statistical mechanics. Random Structures & Algorithms 9(1-2), 223–252 (1996)
[39] Raghuram, J., Chandrasekaran, V., Jha, S., Banerjee, S.: A general framework for detecting anomalous inputs to dnn classifiers. In: International Conference on Machine Learning. pp. 8764–8775. PMLR (2021)
[40] Shang, W., Sohn, K., Almeida, D., Lee, H.: Understanding and improving convolutional neural networks via concatenated rectified linear units. CoRR abs/1603.05201 (2016)
[41] Song, Y., Kingma, D.P.: How to train your energy-based models. arXiv preprint arXiv:2101.03288 (2021)
[42] Teh, Y.W., Welling, M., Osindero, S., Hinton, G.E.: Energy-based models for sparse overcomplete representations. Journal of Machine Learning Research 4(Dec), 1235–1260 (2003)
[43] Welling, M., Teh, Y.W.: Bayesian learning via stochastic gradient langevin dynamics. In: Proceedings of the 28th international conference on machine learning (ICML-11). pp. 681–688 (2011)
[44] Xiao, H., Rasul, K., Vollgraf, R.: Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 (2017)
[45] Zagoruyko, S., Komodakis, N.: Wide residual networks. In: Richard C. Wilson, E.R.H., Smith, W.A.P. (eds.) Proceedings of the British Machine Vision Conference (BMVC). pp. 87.1–87.12. BMVA Press (September 2016). https://doi.org/10.5244/C.30.87
[46] Zhao, S., Jacobsen, J.H., Grathwohl, W.: Joint energy-based models for semi-supervised classification. In: ICML 2020 Workshop on Uncertainty and Robustness in Deep Learning. vol. 1 (2020)

Out of Distribution Detection, Generalization, and Robustness Triangle with Maximum Probability Theorem ††thanks: This project is supported by the NIH funding: R01-CA246704 and R01-CA240639, and Florida Department of Health (FDOH): 20K04.