This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

An Ambiguity Measure for Recognizing the Unknowns in Deep Learning

Roozbeh Yousefzadeh
Huawei Hong Kong Research Center
[email protected]
Abstract

We study the understanding of deep neural networks from the scope in which they are trained on. While the accuracy of these models is usually impressive on the aggregate level, they still make mistakes, sometimes on cases that appear to be trivial. Moreover, these models are not reliable in realizing what they do not know leading to failures such as adversarial vulnerability and out-of-distribution failures. Here, we propose a measure for quantifying the ambiguity of inputs for any given model with regard to the scope of its training. We define the ambiguity based on the geometric arrangements of the decision boundaries and the convex hull of training set in the feature space learned by the trained model, and demonstrate that a single ambiguity measure may detect a considerable portion of mistakes of a model on in-distribution samples, adversarial inputs, as well as out-of-distribution inputs. Using our ambiguity measure, a model may abstain from classification when it encounters ambiguous inputs leading to a better model accuracy not just on a given testing set, but on the inputs it may encounter at the world at large. In pursuit of this measure, we develop a theoretical framework that can identify the unknowns of the model in relation to its scope. We put this in perspective with the confidence of the model and develop formulations to identify the regions of the domain which are unknown to the model, yet the model is guaranteed to have high confidence.

1 Introduction

How can a model know what it does not know? How can we develop models that can reliably relate inputs to their knowledge and abstain from processing inputs that are ambiguous to them? In many applications of deep learning, e.g., in object detection, we encounter inputs that are hard to process and evaluate, i.e., corner cases. Consider the case of autonomous driving where a model may encounter ambiguous images, e.g., a pedestrian with an unusual clothing, etc. These are potentially points that may lead to accidents and it would be useful to automatically quantify whether a model is uncertain about its decisions. Moreover, a model may encounter inputs that are adversarially designed to mislead it [30]. We also have cases of out-of-distribution inputs [14]. All these cases can be considered various types of input ambiguity for a given model, i.e., the unknowns of the model, potentially leading to its failure. If we have tools to detect such inputs when we encounter them, we can have the model make more conservative decisions or abstain all together. For example, an autonomous car could decide to slow down or stop by the side of the road, if it is unsure about what it sees in the road ahead.

Refer to caption
Figure 1: We propose a unified ambiguity framework to identify the knowns and unknowns of deep learning models.

The problem is that usually unknowns of a model is unknown to itself. In other words, AI models usually do not know what they do not know [22]. Indeed, the question of recognizing the unknowns of a model is central in many applications of AI models. In the literature, we have separate fields that study adversarial vulnerabilities [11] and out-of-distribution generalization [31]. We also have studies that aim to improve the accuracy of models by reducing their mistakes on in-distribution testing samples, e.g., by model calibration [23], uncertainty quantification [26], misclassification detection [15]. However, we do not have a unified framework to connect all of these failure modes. Imagine we want to deploy a trained model in the real world while concerned about all the above sorts of failures. It seems very expensive and impractical to deploy a model while having a pipeline of supplementary models for weeding out adversarial and out-of-distribution inputs along with corner cases. Often, models offered for adversarial or out-of-distribution detection are more expensive than the main model at hand [27], e.g., the recent OOD detection methods require the images to be processed with a diffusion model, or many of the other OOD detection methods rely on ensembles of models as large as the original model. Hence, the pipeline of supplementary models is extremely expensive compared to the model itself. If we set aside the question of computational feasibility, another interesting question is whether these failure detection methods can work in harmony with each other to weed out the failure modes of a specific model at hand? The answer seems to be currently no, as [16] demonstrated that most of these failure detection methods do not generalize well, and moreover, for some of these failure detection methods, even the evaluation metrics used in the literature are not appropriate.

To address these issues, we propose an ambiguity measure that can relate these seemingly separate fields of study into a unified framework by leveraging the geometric information available in the feature space that models learn from their training set, specifically, in relation to the decision boundaries and convex hulls, the information that is freely available to any model designer, but it is often neglected. Our contributions can be summarized as the following:

  1. 1.

    We first develop a theoretical framework to evaluate the unknowns of the trained models in relation to their confidence. With certain guarantees we quantify the portions of the domain which are unknown to the trained models, yet the models are guaranteed to process them with high confidence.

  2. 2.

    We propose an ambiguity measure that is based on the geometric arrangements in the feature space of a trained model and the common failure modes of the models. We use image classification models as our case study and provide the necessary algorithms to compute the ambiguity measure fast, in a fraction of a second for models as large as Swin Transformer trained on Imagenet [9]. Through experiments, we also show that our proposed measure is able to detect a considerable portion of the mistakes made by the models, along with adversarial inputs, and out-of-distribution images.

  3. 3.

    Our ambiguity measure provides an automatically generated text describing why an image is ambiguous, relating the ambiguity to samples in the training set and to the decision boundaries of the model.

  4. 4.

    We show that there is valuable geometric information in the feature space of deep networks which has been often neglected, especially with regard to the convex hull of training sets.

2 Geometric information available in the last hidden layer of trained models

Let us consider a classification model \mathcal{M} trained on a training set 𝒟tr\mathcal{D}^{tr}. This model may have dd dimensions for its input space and nn dimensions for its output corresponding to nn distinct classes

z=(x),z=\mathcal{M}(x),

where zz is the output of the model for input xx, and the element of zz with largest value will indicate the classification for xx

𝒞(x)={i:zi=maxkzk}.\mathcal{C}(x)=\{i:z_{i}=\max_{k}z_{k}\}.

We use Φ\Phi to denote the latent space in the last hidden layer of a given model. This feature space has ff dimensions. For example, a typical Swin Transformer trained on Imagenet has f=1,536f=1,536 and n=1,000n=1,000 in its last hidden layer. When \mathcal{M} processes an input xx, it maps the xx to its latent space

xϕ=ϕ(x),x_{\phi}=\mathcal{M}_{\phi}(x), (1)

and then xϕx_{\phi} will be processed via

z=Wϕxϕ+bϕ,z=W_{\phi}x_{\phi}+b_{\phi}, (2)

where WϕW_{\phi} and bϕb_{\phi} are the weight matrix and bias vector of the last hidden layer of the model.

2.1 Why study the learned feature space?

Our main motivation in this approach is to understand and verify the learning of a given trained model, i.e., what the model has learned from its training data and how it uses its learning to process the new inputs.

The success of deep networks is largely attributed to the features they extract from their inputs. For images, contents of interest usually cannot be explained in terms of individual pixels. Rather, for both humans and the models, the spatial relationship between the groups of pixels is the key to identifying what is depicted in an image. Similarly, for natural language processing, the sequence of words matter in the meaning created from individual words. Another way to describe this phenomenon is that inputs, such as image and text (and many other types of data), have a lower dimensional structure, and in order to learn from such data, one has to learn that underlying structure. This has been a subject of study from various perspectives in machine learning often under the term representation learning [3, 8]. Studying the underlying structure of the data also has a rich literature in various fields of mathematics [1, 12].

In pursuit of understanding the underlying structure, it is important to note that deep networks often make mistakes, and their learning from the data sometimes has certain flaws, i.e., they may fail to learn certain structures in the data. Moreover, training sets may have certain biases and flaws of their own which may affect the learning of the models trained on them. For example, in a facial recognition dataset, certain groups of people may not be well represented and that could affect the feature space learned by the models [17].

Here, we study the learned feature spaces not as flawless manifolds that represent the ideal way of learning. Rather, we study a given feature space to view what a model has learned from its data, and to verify the knowns and unknowns of the given model. We provide systematic methods that can interpret the learning of a trained model, and explain how it views the testing samples in relation to what it has learned.

2.2 Feature space: further compression via SVD

The feature space that we consider is based on the Singular Value Decomposition (SVD) of Φ\Phi. We use SVD to project the points in the latent space of the last hidden layer to a lower dimensional space denoted by Ψ\Psi. SVD decomposes the matrix WϕW_{\phi} into three matrices:

Wϕ=UΣVT,W_{\phi}=U\Sigma V^{T},

where UU is a unitary n×nn\times n matrix, Σ\Sigma is a diagonal n×fn\times f matrix, and V is a unitary f×ff\times f matrix. We use VTV^{T} to project the points in the latent space:

xψ=VTxϕ.x_{\psi}=V^{T}x_{\phi}. (3)

Given a point xψx_{\psi} in feature space Ψ\Psi, one can obtain the output of the model via

z=UΣxψ+bϕ.z=U\Sigma x_{\psi}+b_{\phi}. (4)

Benefits of decomposing the last hidden layer: The results we present in this paper are all in the Ψ\Psi space. We note that analyzing the original space of Φ\Phi yields to results that are slightly less accurate compared to the results obtained from Φ\Phi. The better accuracy is explainable via the rotation and scaling operations of SVD [12] which may pose the geometric arrangement of data in a way that is more suitable for our classification task. This is evident in Figure 2 where for the orange histogram (Ψ\Psi), the separation between the closest class and the other classes is more pronounced. Indeed, if we were to use the distance to the convex hull of closest class as our classification criteria, the accuracy of our model would be 94.40% for Ψ\Psi, 94.35% for Φ\Phi, and 94.33% for Φ@U\Phi@U while the accuracy obtained from the softmax score is 94.37%. This indicates that for accurate classification, one can rely merely on the convex hull of training set in the feature space without considering the decision boundaries of this model.

Refer to caption
Figure 2: Distribution of distance of testing samples from the convex hull of training set for each class of CIFAR-10 dataset in the feature space learned by a ResNet model.

The other benefit obtained from the SVD is that Ψ\Psi has a smaller number of dimensions compared to Φ\Phi. For example, a Swin transformer has 1,536 dimensions in Φ\Phi while its corresponding Ψ\Psi has only 1,000 dimensions. This makes our optimization problems, regarding the convex hulls and decision boundaries, less expensive. We note that SVD has been used in the last hidden layer of language models before, not to change the geometric arrangements, but to make the computation of softmax scores faster [28, 7].

2.3 The intuition behind our geometric approach

Let us consider a deep network trained for classification of 3 classes. Figure 3 depicts the feature space in the Ψ\Psi space of such model which we extracted from 3 output classes of a Swin Transformer trained on ImageNet. This figure, not only shows the decision boundaries of the model in this feature space, it also depicts the convex hull of training samples for each class. When classifying a testing sample xx, a typical model maps the sample to the Ψ\Psi space, and then, determines which partition the xψx_{\psi} belongs to. The partitioning step is performed via equation (4) where partitions are separated by a finite set of hyperplanes [33].

Refer to caption
Figure 3: Partitioning of a 2D domain via decision boundaries and the convex hull of training set for each class.

The linear operation performed via equation (4) partitions the feature space via a set of hyperplanes. Since WϕW_{\phi} and bϕb_{\phi} are fixed for a trained model, the partitioning of the feature space (i.e., the location of decision boundaries) is also fixed. It follows that once we compute the xψx_{\psi} for a given xx, equation (4) merely determines which partition the xψx_{\psi} has fallen within. This, however, neglects the rich geometric information available in the feature space, most importantly the relationship between the xψx_{\psi} and the convex hull of training set for each class. The other valuable information in the feature space is the arrangement with regard to the decision boundaries.

To make our point more clear, consider the red and purple points in Figure 3, each of them being a testing sample. The purple point falls within the convex hull of class 1 and it is relatively far from the decision boundaries. Being inside the convex hull indicates that this sample falls within the range of training samples for class 1, i.e., the model has some frame of reference for classifying this sample as class 1. Being away from the decision boundaries with other classes signifies that there is little chance of confusion on which partition/class this sample may belong to.

Now, let us consider the red point in Figure 3 which falls in the partition for class 2. This sample is far from the convex hull of training set for class 2, and at the same time, it is very close to the decision boundaries with the other 2 classes. One can argue, quite reasonably, that the model should be less confident about the classification of the red point compared to the classification of the purple point. While the red sample has fallen in the partition for class 2, it does not seem to be closely similar to any of the training samples for class 2. Yet, it can be easily flipped to any of the other classes, if it were slightly perturbed or if the decision boundaries of the model were slightly different than their current configuration.

These two concepts, the relationship with the convex hull of training set and the relationship with the decision boundaries, are the basis for our ambiguity measure.

3 Computing the distances to the convex hulls and the decision boundaries

3.1 Convex hulls

Assuming that training set has at least one sample for each class, one would be able to compute the distance to the convex hull of each class.

xψ,k=𝒫(xψ,ψtr,k),x_{\psi}^{\mathcal{H},k}=\mathcal{P}^{\mathcal{H}}(x_{\psi},\mathcal{H}_{\psi}^{tr},k), (5)

where 𝒫\mathcal{P} is a function that maps the xψx_{\psi} to the convex hull of training set for class kk. Using this projection, we can compute a vector of distances where each of its elements corresponds to one of the classes in training set

dψ(x):k{1:n},dψ(xψ)[k]=xψ𝒫(xψ,ψtr,k)F,d^{\mathcal{H}}_{\psi}(x):\forall k\in\{1:n\},\quad d^{\mathcal{H}}_{\psi}(x_{\psi})[k]=\|x_{\psi}-\mathcal{P}^{\mathcal{H}}(x_{\psi},\mathcal{H}_{\psi}^{tr},k)\|_{F}, (6)

We will be particularly interested in the first two smallest values of this vector.

To compute the projection of a sample to the convex hull of a set of points, we use the approximation algorithm by [13, 4] which has a guarantee for small error and a running time of 𝒪(|𝒟|nϵ¯2)\mathcal{O}(|\mathcal{D}|\frac{n}{\bar{\epsilon}^{2}}) where |𝒟||\mathcal{D}| is the number of points in the convex hull, nn is the number of dimensions, and ϵ¯\bar{\epsilon} is a parameter to bound the error. We note that the problem of projecting to a convex hull is convex and there are exact algorithms that can solve it reasonably fast [24], but not as fast as the approximation algorithm.

In practice, solving this optimization problem in a 1,000 dimensional feature space of a Swin Transformer trained on Imagenet takes about 100 milliseconds on a GPU.

3.2 Decision boundaries

3.2.1 Existence, intersections, and properties

Here, we are concerned with the decision boundaries in the compressed feature space Ψ\Psi. For simplicity of notations, we rewrite the equation (4) as

z=Wψxψ+bϕ,z=W_{\psi}x_{\psi}+b_{\phi},

where Wψ=UΣW_{\psi}=U\Sigma.

The decision boundary between any two classes ii and jj in the compressed feature space, denoted by xψf(i,j)x^{f(i,j)}_{\psi} can be defined as the set of points satisfying the following two conditions

Wψ[i]xψcf(i,j)+bϕ[i]=Wψ[j]xψf(i,j)+bϕ[j]W_{\psi}[i]x^{cf(i,j)}_{\psi}+b_{\phi}[i]=W_{\psi}[j]x^{f(i,j)}_{\psi}+b_{\phi}[j] (7)
Wψ[i]xψcf(i,j)+bϕ[i]Wψ[k]xψcf(i,j)+bϕ[k],k{i,j}W_{\psi}[i]x^{cf(i,j)}_{\psi}+b_{\phi}[i]\geq W_{\psi}[k]x^{cf(i,j)}_{\psi}+b_{\phi}[k]\quad,\quad\forall k\notin\{i,j\} (8)

These two equations define a subspace within the Ψ\Psi, but it is important to realize that this space may be empty. If the subspace is empty for the decision boundary between two specific classes, it means that those classes do not have an interface.

Theoretically, it is possible for all class pairs to have an interface even when n>>fn>>f.

The following is wrong. Talk about the dimensions of subspace. relate that to n and f.

Theoretically, as long as nfn\leq f, it is possible for each pair of classes to have an interface, i.e., existence of all decision boundaries is possible, but not necessarily guaranteed. However, when we have f<nf<n, it may not be possible for some of the classes to have a direct interface with each other. Rather, the first class may be separated from the second classes via other intermediary classes. This is not necessarily undesirable as it would mean there is little chance of confusion between the two classes. For example, a model trained on Imagenet might have a decision boundary between two different types of fish, but perhaps, it may not have a decision boundary between the planes and alligators. Hence, it would be informative to understand and verify which classes do or do not have decision boundaries between them.

In an ff-dimensional ambient space, the decision boundary between two classes may be a hyperplane of f1f-1 dimensions. These decision boundaries eventually intersect with each other as well as with the boundaries of the domain. The intersection of two or more hyperplanes may be a non-empty subspace of lower dimensions defining the decision boundary between multiple classes.

In our compressed feature space, dimensionality of feature space is always equal to the number of classes, i.e., f=nf=n. We note that in practice almost all models have architectures with f>nf>n in their last hidden layer. There are some architectures that have f<nf<n, e.g., certain variations of Swin Transformers, however, the accuracy of those architecture have proved have proved to be lower compared to accuracy of equivalent architectures with f>nf>n.

A separate paper provides a theoretical analysis and discussion on the existence, intersections, and properties of decision boundaries in the feature space. Moreover, we provide a fast algorithm that can project the samples to the decision boundaries of trained models.

3.2.2 Projecting a point to the decision boundaries

Given a sample xx with classification i=𝒞(x)i=\mathcal{C}(x), we would be interested to project the xx to the closest point on the decision boundaries with other classes. This is interesting because it would tell us how far a given sample is located from the decision boundaries of other classes, and it would also give us the direction the sample has to take to reach those decision boundaries. Perhaps the direction that a sample has to take to reach a decision boundary is very different than the direction it has to take to reach the convex hull of training set.

To project the point xψx_{\psi} to the closest point on the decision boundary between classes ii and jj, we minimize the following objective

minxψcf(i,j)xψcf(i,j)xψ2\min_{x^{cf(i,j)}_{\psi}}\|x^{cf(i,j)}_{\psi}-x_{\psi}\|_{2} (9)

subject to the constraints in equations (7)-(8). In our notation, xfx^{f} denotes flip points while xcfx^{cf} denotes closest flip points (closest to some reference xx). In practice, solving this optimization problem in a 1,000 dimensional feature space of a Swin Transformer trained on Imagenet takes about 50 milliseconds on a GPU.

3.2.3 Special cases

As we noted earlier, a direct interface might not exist between certain classes in which case the optimization problem above will become infeasible. One easy way to verify the infeasibility of our optimization problem is to solve its dual form. When the dual form is unbounded, the primal problem is guaranteed to be infeasible.

When the problem is infeasible, we can relax the formulation aiming for a proxy point that would tell us how far xψx_{\psi} has to move in order to flip to the other class jj. Since classes ii and jj do not have a direct interface, it means that xψx_{\psi} has to flip from class ii to some other class(es) before eventually flipping to the class jj. This implies a farther distance, hence, we seek to find a point on the interface with class jj and not necessarily with class ii. For this, we rewrite the equations (7)-(8) as

Wψ[j]xψcf(i,j)+bϕ[j]Wψ[k]xψcf(i,j)+bϕ[k],k{j},W_{\psi}[j]x^{cf(i,j)}_{\psi}+b_{\phi}[j]\geq W_{\psi}[k]x^{cf(i,j)}_{\psi}+b_{\phi}[k]\quad,\quad\forall k\notin\{j\}, (10)

and subject the objective function (9) to this relaxed constraint.

It is possible to generalize this to decision boundaries between the class ii and multiple classes specified in γ\gamma. In this case, we can minimize equation (9) subject to the following constraint

Wψ[i]xψcf(i,γ)+bϕ[i]=Wψ[j]xψf(i,γ)+bϕ[j],jγW_{\psi}[i]x^{cf(i,\gamma)}_{\psi}+b_{\phi}[i]=W_{\psi}[j]x^{f(i,\gamma)}_{\psi}+b_{\phi}[j]\quad,\quad\forall j\in\gamma (11)
Wψ[i]xψcf(i,γ)+bϕ[i]Wψ[k]xψcf(i,γ)+bϕ[k],k{i,γ}W_{\psi}[i]x^{cf(i,\gamma)}_{\psi}+b_{\phi}[i]\geq W_{\psi}[k]x^{cf(i,\gamma)}_{\psi}+b_{\phi}[k]\quad,\quad\forall k\notin\{i,\gamma\} (12)

3.2.4 Calculating the vector of distances

To formalize the projection to the decision boundaries as a function, we use the following notation

xψcf(i,γ)=𝒫f(xψ,γ),x^{cf(i,\gamma)}_{\psi}=\mathcal{P}^{f}(x_{\psi},\gamma), (13)

where γ\gamma denotes a set of classes that we want to find their joint decision boundary with class ii.

Using the function in equation 13, we can find the distance between a point and the decision boundaries of all other n1n-1 classes. This ultimately gives us a vector of n1n-1 distances to each of the decision boundaries

dψf(xψ):k{1:n\i},dψf(xψ)[k]=xψ𝒫f(xψ,k)F,d^{f}_{\psi}(x_{\psi}):\forall k\in\{1:n\backslash i\},\quad d^{f}_{\psi}(x_{\psi})[k]=\|x_{\psi}-\mathcal{P}^{f}(x_{\psi},k)\|_{F}, (14)

while dψf(xψ)[i]=0d^{f}_{\psi}(x_{\psi})[i]=0.

4 Characterising the unknowns of the models

Let us first characterize the notion of the unknowns of a trained model within its learned feature space.

4.1 Boundaries of the domain in feature space

In the pixel space, boundaries of the domain are well known. We can think of the pixel space as a hypercube of unit 1 dimensions as it is customary to normalize the pixel values to be between 0 and 1 or some other values.

Remark 4.1.

While we consider the pixel domain as a continuous hypercube, we note that pixel space is usually a discrete space as pixel values have representations such as UINT8. In any case, the assumption of continuous space does not affect our conclusions as the discrete space will be a subset of continuous space.

To realize the limits of the feature space Ψ\Psi, we can derive the following based on the Hölder’s inequality

xψ=VTxϕVT2,xϕ2=xϕ2,\begin{split}\|x_{\psi}\|_{\infty}&=\|V^{T}x_{\phi}\|_{\infty}\\ &\leq\|V^{T}\|_{2,\infty}\|x_{\phi}\|_{2}\\ &=\|x_{\phi}\|_{2},\end{split} (15)

since VTV^{T} (from the SVD) is orthonormal, and therefore, VT2,=1\|V^{T}\|_{2,\infty}=1.

Now, we have to identify the limits of the xϕ2\|x_{\phi}\|_{2} as the output of the ϕ(.)\mathcal{M}_{\phi}(.). For this, we can solve the following optimization problem

maxxϕ(x)2,\max_{x}\|\mathcal{M}_{\phi}(x)\|_{2}, (16)

subject to xx being within the boundaries of the pixel space. The maximum value of this objective function will give us the upper bound on xϕ2\|x_{\phi}\|_{2} which can then be plugged back into equation (15) to find the limit of the hypercube that defines our feature space Ψ\Psi. The hypercube is centered at the origin while each dimension of it is 2xψ2\|x_{\psi}\|_{\infty}.

For solving the optimization problem in equation (16), we note that ϕ\mathcal{M}_{\phi} is a function of all the layers preceding the last layer. Depending on the architecture of the model, these layers could include ReLU layers, pooling layers, normalization layers, attention layers, etc. This composition of function as a whole is a non-concave function. Nevertheless, we can solve our maximization problem using standard optimization tools, perhaps the same algorithms that are used for training the models.

Remark 4.2.

To directly identify the limits of Φ\Phi (instead of Ψ\Psi), and avoid the compression performed via SVD, the norm of the objective function in equation (16) can be changed from the 2-norm to infinity norm.

Remark 4.3.

For some specific models, the feature space Φ\Phi is strictly non-negative, e.g., when a model has a ReLU layer at the very end. Most models do not have that property and their last hidden layer may have positive and negative values.

4.2 Unknowns may constitute a large portion of the domain of a trained model

The arguments regarding the unknowns of the models are generally applicable to the pixel space as much as the feature space. In both spaces, the convex hull of training set usually occupies a small fraction of the bounded domain. Testing samples are outside the tr\mathcal{H}^{tr}, yet their distance from the tr\mathcal{H}^{tr} usually does not exceed 30% of the diameter of the tr\mathcal{H}^{tr} [32]. For some models, this percentage is less than 5% in the feature space.

At the same time, vast areas of the domain may be unrelated to the scope of a given model whether the domain is the pixel space or for other types of machine learning applications [6]. For example, if we consider the pixel space for a model trained on the CIFAR-10 dataset, the same pixel space contains all sorts of images unrelated to the model’s scope, e.g., radiology images, images of other object types, etc. All these irrelevant images exist in that same domain. What has become known as the adversarial inputs are also in that same domain.

To get a handle on how much of the domain is covered by the convex hull of training set, we estimate its volume using the approximation algorithm by [29]. This is one of the fastest approximation methods available [19] and it makes our computation significantly faster since we are dealing with high dimensions and high number of samples. The running time of this algorithm is 𝒪(n¯4)\mathcal{O}(\bar{n}^{4}) where n¯\bar{n} is the number of dimensions.

4.3 Existence of cases of improper confidence

A simple example of improper over-confidence is the adversarial inputs which has been the subject of extensive, yet inconclusive, studies over the past few years. Another example is the case of out-of-distribution inputs where a model may confidently classify images that are unrelated to the scope of its training, e.g., an object detection model may classify a radiology image of liver as an aeroplane with 100% confidence. First, let us provide a formulation which can quickly give us egregious cases of over confidence for a model:

maxx^ψxψx^ψ2\max_{\hat{x}_{\psi}}\|x_{\psi}-\hat{x}_{\psi}\|_{2} (17)

subject to:

x^ψΨ,x^ψtr,softmax(UΣx^ψ+bϕ)[i]>0.9\begin{gathered}\hat{x}_{\psi}\in\Psi,\\ \hat{x}_{\psi}\notin\mathcal{H}^{tr},\\ softmax(U\Sigma\hat{x}_{\psi}+b_{\phi})[i]>0.9\end{gathered} (18)

Solving this formulation is not hard, especially if we use a good initial solution that is inside the partition ii yet far from xψx_{\psi}. We leave the numerical examples for the future.

4.4 Establishing the closeness to decision boundaries as the indicator of model’s confidence

In this section, we prove that only in the vicinity of decision boundaries a model has low confidence, and in all other regions of the domain, a model is guaranteed to have high confidence.

Theorem 4.4.

Given an input xx for a trained model (.)\mathcal{M}(.), if in the feature space of \mathcal{M}, the distance of xx from the closest decision boundary is at least δ\delta,

xΦ:dϕf,min(x)δ,\forall x\in\Phi:d^{f,min}_{\phi}(x)\geq\delta,

then, the confidence (top softmax score) of M(.)M(.) in classifying xx will be at least

maxsoftmax((x))eδW[j]W[i]2eδW[j]W[i]2+(n1),\max softmax(\mathcal{M}(x))\geq\frac{e^{\delta\|W[j]-W[i]\|_{2}}}{e^{\delta\|W[j]-W[i]\|_{2}}+(n-1)},

where WW is the weight matrix used for the classification of the feature space, nn is the number of classes, ii is the classification of xx, and jj is the class with closest decision boundary.

Proof.

We know xx is not on any decision boundaries, hence it will have a specific classification. Let us denote the classification of xx to be some class i=𝒞(x)i=\mathcal{C}(x). Without any loss of generality, we consider the feature space to be Φ\Phi while the same proof will apply to Ψ\Psi as well.

Let us denote the point on the closest decision boundary as xϕfx_{\phi}^{f}. Such decision boundary will be a flip between class ii and at least one other class. We denote the other class with jj while noting that jj may not be unique. From the theorem statement, we have

v2=δ,v=xϕfxϕ.\|v\|_{2}=\delta\quad,\quad v=x_{\phi}^{f}-x_{\phi}.

From equation (2), we can write z=Wxϕ+bz=Wx_{\phi}+b and zf=Wxϕf+bz^{f}=Wx_{\phi}^{f}+b. Plugging the definition of vv will give us

z=W(xϕfv)+b=WxϕfWv+b=zfWv.z=W(x_{\phi}^{f}-v)+b=Wx_{\phi}^{f}-Wv+b=z^{f}-Wv.

We have zf[i]=zf[j]z^{f}[i]=z^{f}[j] since xfx^{f} is a flip point between classes ii and jj. Based on this, we can write

z[i]z[j]=zf[i]zf[j](W[i]vW[j]v)=0(W[i]vW[j]v)=(W[j]W[i])v=W[j]W[i]2v2cosθ=δW[j]W[i]2,\begin{split}z[i]-z[j]&=z^{f}[i]-z^{f}[j]-(W[i]v-W[j]v)\\ &=0-(W[i]v-W[j]v)\\ &=(W[j]-W[i])v\\ &=\|W[j]-W[i]\|_{2}\|v\|_{2}\cos{\theta}\\ &=\delta\|W[j]-W[i]\|_{2},\end{split} (19)

where W[i]W[i] denotes the ithi^{th} row of WW, and θ\theta is the angle between the vectors W[j]W[i]W[j]-W[i] and vv. It is easy to show that these two vectors are always parallel leading to cosθ=1\cos{\theta}=1. To realize this, note that the decision boundary between classes ii and jj is a hyperplane with coefficients W[j]W[i]W[j]-W[i]. Since this hyperplane is the closest decision boundary to xϕx_{\phi}, no other decision boundary can block the access of xϕx_{\phi} to this hyperplane, hence, vv, the direction to the closest point on this hyperplane should be perpendicular to the hyperplane. Being perpendicular to a hyperplane means having the same direction as its vector of coefficients which makes vv parallel to W[j]W[i]W[j]-W[i].

Now, via equation (19), we have the difference between the elements ii and jj of the logit vector zz. Hence, we can derive a bound on the softmax score (confidence) of (x)\mathcal{M}(x). The top softmax score for xx, corresponding to class ii, is

maxsoftmax(x)=ez[i]k{1:n}ez[k]ez[i]ez[i]+(n1)ez[j]=ez[j]eδW[j]W[i]2ez[j]eδW[j]W[i]2+(n1)ez[j]=eδW[j]W[i]2eδW[j]W[i]2+(n1)\begin{split}\max softmax(x)&=\frac{e^{z[i]}}{\sum\limits_{\forall k\in\{1:n\}}e^{z[k]}}\\ &\geq\frac{e^{z[i]}}{e^{z[i]}+(n-1)e^{z[j]}}\\ &=\frac{e^{z[j]}e^{\delta\|W[j]-W[i]\|_{2}}}{e^{z[j]}e^{\delta\|W[j]-W[i]\|_{2}}+(n-1)e^{z[j]}}\\ &=\frac{e^{\delta\|W[j]-W[i]\|_{2}}}{e^{\delta\|W[j]-W[i]\|_{2}}+(n-1)}\end{split} (20)

Corollary 4.4.1.

Given a specific model with known WW, the bound in equation (20) can be reduced to the following class agnostic bound

maxsoftmax((x))eδρ(W)eδρ(W)+(n1),\max softmax(\mathcal{M}(x))\geq\frac{e^{\delta\rho(W)}}{e^{\delta\rho(W)}+(n-1)}, (21)

where

ρ(W)=minijW[j]W[i]2,i,j{1:n}.\rho(W)=\min_{i\neq j}\|W[j]-W[i]\|_{2},\forall i,j\in\{1:n\}. (22)
Proof.

Computing the ρ(W)\rho(W) is straightforward via equation (22). This equation provides the smallest possible value for W[j]W[i]2\|W[j]-W[i]\|_{2} for any choice of iji\neq j. We previously established that ii and jj cannot be identical. Furthermore, the function exex+c\frac{e^{x}}{e^{x}+c} is monotonically increasing with respect to xx because its derivative w.r.t. xx is strictly positive. Therefore, the minimum value of W[j]W[i]2\|W[j]-W[i]\|_{2} will provide a lower bound for the right hand side of equation (20). ∎

We can extend the Theorem 4.4 to derive an upper bound on the confidence of the model based on the distance of inputs from the decision boundaries.

Theorem 4.5.

Given an input xx for a trained model (.)\mathcal{M}(.), if in the feature space of \mathcal{M}, the distance of xx from the closest decision boundary is at most δ\delta,

xΦ:dϕf,min(x)δ,\forall x\in\Phi:d^{f,min}_{\phi}(x)\leq\delta,

then, the confidence (top softmax score) of M(.)M(.) in classifying xx will be at most

maxsoftmax((x))eδρ(W)eδρ(W)+1,\max softmax(\mathcal{M}(x))\leq\frac{e^{\delta\rho^{\prime}(W)}}{e^{\delta\rho^{\prime}(W)}+1}, (23)

where

ρ(W)=maxijW[j]W[i]2,i,j{1:n}.\rho^{\prime}(W)=\max_{i\neq j}\|W[j]-W[i]\|_{2},\forall i,j\in\{1:n\}. (24)
Proof.

Proof is similar to the previous proofs and is left out for brevity. ∎

We can use the above theorems in two ways. First, we can consider a specific lower bound for the confidence and determine what distance should the samples maintain from the decision boundaries for a given confidence lower bound. This can then be used to estimate for what fraction of the domain a model has high confidence. Second, if we know how far the samples are from the decision boundaries, we can compute the minimum confidence of the model for those samples.

To make this relationship more tangible, let us consider two examples.

Example 4.6.

Consider a ViT model trained on the CIFAR-10 dataset. We obtain the ρ(W)\rho(W) for this model as 0.8760.876. Since the model has 10 classes, we have n=10n=10. The relationship between the confidence of the model and the distance to the decision boundaries becomes

maxsoftmax((x))e0.876δe0.876δ+9.\max softmax(\mathcal{M}(x))\geq\frac{e^{0.876\delta}}{e^{0.876\delta}+9}.

Feature space of this model is a 768-dimensional hypercube. To give a perspective about distances in the feature space of this model, we note that its diameter has size 110. The diameter of ϕtr\mathcal{H}^{tr}_{\phi} is 10.40. The average diatance to the decision boundaries is 55. Given the bound above, choosing the distance threshold as δ=3\delta=3 yields

maxsoftmax((x))60%.\max softmax(\mathcal{M}(x))\geq 60\%.

This means that any sample that with distance to decision boundaries 3\geq 3 will be classified by the model with softmax score of at least 60%. Alternatively, if we require the confidence to be at least 90%, the distance from the decision boundaries should be at least 5.075.07. Note that these are lower bounds, and there may exist points with high confidence that are closer to the decision boundaries. However, as long as the 5.075.07 distance is maintained from the decision boundaries for any input, the confidence of the ViT model for the input is guaranteed to be at least 90%.

On the other hand, using Theorem 4.5, we obtain ρ(W)=0.9692\rho^{\prime}(W)=0.9692 leading to

maxsoftmax((x))e0.9692δe0.9692δ+1,\max softmax(\mathcal{M}(x))\leq\frac{e^{0.9692\delta}}{e^{0.9692\delta}+1},

which implies that for any sample with dϕf,min0.42d^{f,min}_{\phi}\leq 0.42, the top softmax score of the model cannot be larger than 60%.

Remark 4.7.

The theorem and the corollary proved above hold all over the domain inside and outside the convex hull of training set. Therefore, it is only the closeness to decision boundaries that can explain the low confidence of a model when classifying an input.

Remark 4.8.

From equation (19), we can conclude that the closest class to a given input is not necessarily the class with the second largest softmax score.

4.5 Characterizing the decision boundaries and the fraction of the domain occupied by them

In previous section, we established that closeness to decision boundaries is the only source of low confidence for a model. We now aim to identify the decision boundaries and gauge the fraction of the domain occupied by them. A certain offset from the decision boundaries will indicate the fraction of the domain that a model would have low confidence.

Theorem 4.9.

Given a hypercube Φ=[0,1]n\Phi=[0,1]^{n} with n2n\geq 2 as the feature space of a classification model where WW and bb define the model’s output via equation (2), the decision boundary between any two classes ii and jj, denoted by Δϕi,j\Delta^{i,j}_{\phi}, is a polytope of n1\leq n-1 dimensions. Volume of this polytope can be computed via Algorithm 1 and it is upper bounded by

vol(Δϕi,j)W[i]W[j]2(n1)!k=1n(W[i,k]W[j,k])K{1,,n}(1)|K|(max(0,b[i]+b[j]w.1K))n1.\begin{split}&vol(\Delta^{i,j}_{\phi})\leq\\ &\frac{\|W[i]-W[j]\|_{2}}{(n-1)!\prod\limits_{k=1}^{n}(W[i,k]-W[j,k])}\sum_{K\subseteq\{1,\dots,n\}}(-1)^{|K|}\big{(}\max(0,-b[i]+b[j]-w.\mathbbm{1}_{K})\big{)}^{n-1}.\end{split} (25)
Proof.

The decision boundary between any two arbitrary classes ii and jj, is a subset of the domain defined by

xΦ(W[i]W[j])x=b[i]+b[j]W[i]x+b[i]W[k]x+b[k],k{i,j}.\begin{split}x&\in\Phi\\ (W[i]-W[j])x&=-b[i]+b[j]\\ W[i]x+b[i]&\geq W[k]x+b[k],\quad\forall k\notin\{i,j\}.\end{split} (26)

It is possible for this subset to be empty as we noted before in Section 3.2.1 in which case Δϕi,j\Delta^{i,j}_{\phi} would not occupy any portion of the Φ\Phi. However, if the subset is non-empty, it will be a convex polytope of at most n1n-1 dimensions. Each vertex of this polytope will be an intersection of the Δϕi,j\Delta^{i,j}_{\phi} with either the boundaries of the Φ\Phi or with the decision boundaries of other classes. These intersections define the vertices of the polytope. To identify the vertices of the Δϕi,j\Delta^{i,j}_{\phi}, we can use the simplex method in Algorithm 1 inspired by the revised simplex algorithm in [24]. This algorithm initially identifies a vertex of the feasible domain, then finds a direction where one of the constraints is binding, and moves in that direction until it finds the next vertex. This will continue until all the vertices are obtained as we know the number of vertices of this polytope is finite and upper bounded by (n[n2])(nn2)(n-[\frac{n}{2}])\binom{n}{\frac{n}{2}} [25]. Having all the vertices of Δϕi,j\Delta^{i,j}_{\phi}, we can compute its volume analytically using formulas in [18, 5].111One way to perform this computation is to break the polytope into a set of simplexes, and then compute the volume of each simplex using the Gram determinant.
If Δϕi,j\Delta^{i,j}_{\phi} only intersects with the boundaries of the Φ\Phi and does not have any intersections with other decision boundaries, Δϕi,j\Delta^{i,j}_{\phi} would be the intersection of a hyperplane and a hypercube, and its volume can be computed analytically via the right hand side of equation (25) derived by [2, 20]. However, if Δϕi,j\Delta^{i,j}_{\phi} intersects with any other decision boundaries, i.e., the last line of equation (26) becomes binding for one or more instances of kk, then the volume of Δϕi,j\Delta^{i,j}_{\phi} will be strictly less than the right hand side of equation (25). ∎

Algorithm 1 Identifying the decision boundaries in the feature space
1:model (.)\mathcal{M}(.) with WW and bb in its last hidden layer and Φ\Phi as its feature space, classes ii and jj
2:vertices of polytope Δϕi,j\Delta^{i,j}_{\phi} defining the decision boundary between classes ii and jj in the feature space of (.)\mathcal{M}(.)
3:Φ=[a,a]n\Phi=[-a,a]^{n}\leftarrow identify aa as the length of Φ\Phi by solving equation (16) with infinity norm
4:initialize 𝒮=\mathcal{S}=\emptyset
5:𝒮\mathcal{S}\leftarrow identify ss as a basic feasible solution of equations (26)
6:dd\leftarrow identify a direction from ss along which one constraint in equation (26) is binding
7:while dd\neq\emptyset do
8:     𝒮\mathcal{S}\leftarrow move ss along dd as far as equations (26) are satisfied
9:     dd\leftarrow identify a direction from ss along which one new constraint in equation (26) is binding
10:end while
Remark 4.10.

If Δϕi,j\Delta^{i,j}_{\phi} does not have any intersections with other decision boundaries, e.g., the case of binary classification, then the equation (25) becomes an equality directly providing the volume of Δϕi,j\Delta^{i,j}_{\phi} with no need for Algorithm 1.

Now that we can identify the exact location of decision boundaries and their volume, we can proceed to identify the slices of the domain surrounding the decision boundaries which are guaranteed to have high or low confidence.

4.6 Regions of guaranteed high confidence in the feature space

We are interested to identify the regions of the domain where a model is guaranteed to have high confidence, i.e., softmax score. Let us formally define such regions as

Φ(τ){xϕ|xϕΦ,maxsoftmax((xϕ))τ}.\Phi(\geq\tau)\coloneqq\{x_{\phi}\>|\>x_{\phi}\in\Phi,\max softmax(\mathcal{M}(x_{\phi}))\geq\tau\}.

In Theorems 4.4-4.5, we established that maintaining a certain distance from the decision boundaries will lead to lower bounds on the softmax score of any given model. Later, in Theorem 4.9, we established the location of decision boundaries in the feature space and the corresponding volume that they occupy. It follows from these two sets of theorems that in the offsets of certain distance from the decision boundaries, a model is guaranteed to have low confidence, and in the regions away from those offsets, a model is guaranteed to have high confidence. We can formally identify the fraction of these regions using the following starting with the case of binary classification.

Theorem 4.11.

Given a model (.)\mathcal{M}(.) with feature space Φ=[a,a]n\Phi=[-a,a]^{n} and two output classes, the fraction of feature space that is classified with softmax score of at least τ\tau is lower bounded by

12aδ12δ(2a)nvol(Δϕi,j)vol(Φ(τ)),1-\frac{\sqrt{2}}{a}\delta\leq 1-\frac{2\delta}{(2a)^{n}}vol(\Delta_{\phi}^{i,j})\leq vol(\Phi(\geq\tau)), (27)

where the lower bound corresponds to the volume of all the points which maintain a minimum distance of

δ=1ρ(W)log(τ(1n)τ1)\delta=\frac{1}{\rho(W)}\log\big{(}\frac{\tau(1-n)}{\tau-1}\big{)} (28)

from the decision boundaries in Φ\Phi.

Proof.

The model has 22 classes and one decision boundary, Δϕi,j\Delta_{\phi}^{i,j}, dividing the Φ\Phi into two sections. The vertices of the polytope defining Δϕi,j\Delta_{\phi}^{i,j} can be identified via Algorithm 1, and its volume can be computed precisely via the right hand side of equation (25). From Theorem 4.5, the offset of δ\delta would guarantee classification with softwax score of at least τ\tau for any point in the Φ\Phi. The volume of the regions that have a minimum distance of δ\leq\delta from the decision boundaries is less than or equal to the volume of decision boundaries times 2δ2\delta. Therefore, the remainder of the volume of the Φ\Phi is guaranteed to have softmax score τ\geq\tau.

Moreover, Δϕi,j\Delta_{\phi}^{i,j} is the intersection of a hyperplane and a hypercube and its volume is upper bounded by 2(2a)n1\sqrt{2}(2a)^{n-1} [2]. Replacing this for vol(Δϕi,j)vol(\Delta_{\phi}^{i,j}) yields the leftmost side of the bound. ∎

Remark 4.12.

We note that the volume of the offset surrounding the Δϕi,j\Delta_{\phi}^{i,j} can be computed precisely by shifting the hyperplane between classes ii and jj by ±δ\pm\delta, and computing the corresponding vertices with the Φ\Phi resulting in a hypersimplex. The volume of the hypersimplex will be the exact value for vol(Δϕi,j)vol(\Delta_{\phi}^{i,j}). Nevertheless, computing the volume of Δϕi,j\Delta_{\phi}^{i,j} and multiplying it with 2δ2\delta may be a reasonably close approximation in the binary classification case.

We now extend this to the multi-classification case. First, let us define a notation for the area surrounding any given decision boundary

Δϕ(i,j)±δ{xϕ|xϕΦ,|(W[i]W[j])xϕ+b[i]b[j]|W[i]W[j]2δ},\Delta_{\phi}^{(i,j)\pm\delta}\coloneqq\{x_{\phi}\>|\>x_{\phi}\in\Phi,\frac{|(W[i]-W[j])x_{\phi}+b[i]-b[j]|}{\|W[i]-W[j]\|_{2}}\leq\delta\}, (29)
Theorem 4.13.

Given a trained model (.)\mathcal{M}(.) and its feature space Φ=[a,a]n\Phi=[-a,a]^{n}, the fraction of feature space that is classified with softmax score of at least τ\tau is lower bounded by

(Φ(τ))11(2a)nvol(ΓΔ),\mathcal{R}(\Phi(\geq\tau))\geq 1-\frac{1}{(2a)^{n}}vol(\Gamma\Delta), (30)

where ΓΔ\Gamma\Delta denotes the union of all the decision boundaries and their offsets

ΓΔ=i=1n1j=i+1nΔϕ(i,j)±δ,\Gamma\Delta=\bigcup_{i=1}^{n-1}\bigcup_{j=i+1}^{n}\Delta_{\phi}^{(i,j)\pm\delta}, (31)

and δ\delta is the offset from the decision boundaries defined by equation (28).

Proof.

From Theorem 4.5, the offset of δ\delta from the decision boundaries guarantees classification with softwax score of at least τ\tau for any point in the Φ\Phi. A given model with nn output classes has (n2)\binom{n}{2} distinct decision boundaries. For each decision boundary, the region that maintains a distance of δ\leq\delta from it is defined by Δϕ(i,j)±δ\Delta_{\phi}^{(i,j)\pm\delta} and its computable via an extension of Algorithm 1. The union of these regions defines all the points in Φ\Phi that have a distance of δ\leq\delta from at least one decision boundary. The complement of the volume of this union will be the lower bound on the desired fraction of feature space. ∎

4.7 Regions of guaranteed overconfidence: High confidence for the unknowns

In a separate study, we empirically establish that the scope of a model is tightly related to the convex hull of its training set in the feature space. For the purpose of the current study, we take it for granted that the scope of a trained model is tightly related to the convex hull of its training set

xΦ:xtr2δ,x\in\Phi:\|x-\mathcal{H}^{tr}\|_{2}\leq\delta, (32)

for some δ\delta that is relatively small compared to the dimensions of tr\mathcal{H}^{tr} and Φ\Phi.

In section 4.2, we discussed, based on empirical evidence, that the convex hull of training set may occupy a small fraction of the feature space and vast areas in the feature space may be unrelated to the scope of a given model. Now, we integrate these notions with the confidence of the models to identify the regions in the feature space that are unrelated to the scope, yet, the model is guaranteed to have high confidence.

To identify these regions, we incorporate the convex hull of training set into our previous theorems. In Theorems 4.11-4.13, we devised a method to identify the regions of the feature space that are guaranteed to be classified with high confidence. Now, we aim to identify the high confidence regions that maintain a certain distance from the convex hull of training set. Such regions would be the improper confidence of the trained model: overconfidence for the inputs irrelevant to the scope of the model.

We first approximate the convex hull of training set with the smallest hyperrectangle that contains it.

Definition 4.14.

Given the feature space Φ=[a,a]n\Phi=[-a,a]^{n}, and the convex hull ϕtrΦ\mathcal{H}^{tr}_{\phi}\in\Phi: we define Φ\Phi^{\mathcal{H}} as the smallest hyperrectangle, with the same orientation as Φ\Phi, that contains the tr\mathcal{H}^{tr}:

Φ[li,ui]i=1f|li=min(xϕj[i],xϕj𝒟ϕtr),ui=maxj(xϕj[i],xϕj𝒟ϕtr)\Phi^{\mathcal{H}}\coloneqq[l_{i},u_{i}]_{i=1}^{f}\>|\>l_{i}=\min(x^{j}_{\phi}[i],\forall x_{\phi}^{j}\in\mathcal{D}^{tr}_{\phi}),u_{i}=\max_{j}(x^{j}_{\phi}[i],\forall x_{\phi}^{j}\in\mathcal{D}^{tr}_{\phi})

Moreover, Φ+δh\Phi^{\mathcal{H}+\delta^{h}} denotes the hyperrectangle with an outer offset of δh\delta^{h} from Φ\Phi^{\mathcal{H}}:

Φ+δh[li,ui]i=1f|li=max(liδh,a),ui=min(ui+δh,a).\Phi^{\mathcal{H}+\delta^{h}}\coloneqq[l_{i}^{\prime},u_{i}^{\prime}]_{i=1}^{f}\>|\>l_{i}^{\prime}=\max(l_{i}-\delta^{h},-a),u_{i}=\min(u_{i}+\delta^{h},a).

It follows that any point outside the Φ+δh\Phi^{\mathcal{H}+\delta^{h}} would have a distance of at least δh\delta^{h} from the ϕtr\mathcal{H}^{tr}_{\phi}. We consider the small offset of δh\delta^{h} around the ϕtr\mathcal{H}^{tr}_{\phi} to still be relevant to the scope of the trained model. The value of δh\delta^{h} shall be chosen empirically by verifying how far from the ϕtr\mathcal{H}^{tr}_{\phi} inputs can still be relevant to the scope of the model.

We are now set to formalize the regions of the domain that maintain a distance of δh\delta^{h} from ϕtr\mathcal{H}^{tr}_{\phi} (being unknown to the model), and yet, the model is guaranteed to have high confidence about any inputs in those regions. In the following, cu\mathcal{R}^{cu} denotes the ratio of the volume of the domain that has such characteristics.

Theorem 4.15.

Given a model (.)\mathcal{M}(.) with feature space Φ=[a,a]n\Phi=[-a,a]^{n} and two output classes, the fraction of feature space that maintains a minimum distance of δh\delta^{h} from the ϕtr\mathcal{H}^{tr}_{\phi} and is classified with softmax score of at least τ\tau is lower bounded by

cu11(2a)n(vol(Φ)+vol(ΓΔ)vol(ΦΓΔ)),\mathcal{R}^{cu}\geq 1-\frac{1}{(2a)^{n}}\big{(}vol(\Phi^{\mathcal{H}})+vol(\Gamma\Delta)-vol(\Phi^{\mathcal{H}}\cap\Gamma\Delta)\big{)}, (33)

where δ\delta is defined in equation (28) and δh\delta^{h} is a small positive number defining the offset from the ϕtr\mathcal{H}^{tr}_{\phi}.

Proof.

We know from Theorem 4.4 that maintaining distance of δ\delta from the decision boundaries guarantees classification with softmax score of at least τ\tau anywhere in Φ\Phi. ΓΔ\Gamma\Delta, calculated via equation (31), is the union of all regions that have a distance of δ\leq\delta from the decision boundaries in Φ\Phi, i.e., regions that are classified with low confidence. All of these low confidence regions shall be excluded from the cu\mathcal{R}^{cu} as they have softmax score τ\leq\tau. The entire Φ\Phi^{\mathcal{H}} is also excluded from cu\mathcal{R}^{cu} as they are close to the Φ\Phi^{\mathcal{H}} and considered known to the model. However, these two have an overlap which should not be deducted from the domain twice. ΦΓΔ\Phi^{\mathcal{H}}\cap\Gamma\Delta is the portion of ΓΔ\Gamma\Delta that intersects with Φ\Phi^{\mathcal{H}} which we add back to cu\mathcal{R}^{cu}. The remainder of the Φ\Phi is unknown to the model, yet, the model is guaranteed to have high confidence. ∎

Previously, we showed how each of these components can be computed. Later, we will use these formulations to evaluate the percentage of domains that are unknown to the models yet the models, unjustifiably have high confidence about them. We see that these regions are considerably large.

5 Unknowns of the trained models in relation to common failure modes

Here, we consider four specific modes of failure, two of which fall under the umbrella of confusion.

5.1 Confusion between classes

By confusion, we refer to cases where a model may be undecided about which class applies to a given input. One such confusion would occur when a point is on or close to a decision boundary. Another, more complicated case will arise when we encounter overlapping classes. That is, when we have two or more classes where their training set overlap each other. The overlap region in particular would be a source of confusion. Such overlaps appear less often in computer vision tasks, but they are quite common in social applications of AI, e.g., the Adult Income dataset [10]. To quantify the confusion for overlapping classes, we consider the distance to the convex hull of classes.

5.1.1 Separable classes

The first type of confusion is when training samples of each class are separable from other classes in the feature space. In such cases, closeness to the decision boundary that has separated the classes would be an indication for a model’s confusion. For this, we measure the distance to the closest decision boundary

dψf,min(xψ)=minj{1:n\i}dψf(i,j)(xψ)d^{f,min}_{\psi}(x_{\psi})=\min_{j\in\{1:n\backslash i\}}d^{f(i,j)}_{\psi}(x_{\psi}) (34)

This measure is inversely related to ambiguity of an input.

5.1.2 Overlapping classes

In some cases, we encounter datasets where classes have some overlap. Such overlap could be in the original domain and/or in the feature space. To reflect this in our ambiguity measure, we consider the difference between the distance to the two closest convex hulls. In other words, we first sort the vector dψ(x)d^{\mathcal{H}}_{\psi}(x), then subtract the first element of the sorted vector from its second element

dψc(xψ)=dψ,sorted(x)[1]dψ,sorted(x)[0]d^{c}_{\psi}(x_{\psi})=d^{\mathcal{H},sorted}_{\psi}(x)[1]-d^{\mathcal{H},sorted}_{\psi}(x)[0] (35)

This measure is inversely related to the ambiguity. Because the vector, dψ,sorted(x)d^{\mathcal{H},sorted}_{\psi}(x), is sorted, its first element is not larger than its second element, and hence, we have dψc(x)0d^{c}_{\psi}(x)\geq 0 for any input.

When there is an overlap between two classes of training set, and a sample falls into that overlap, its distance from the convex hull of each class will be zero, and the resulting dψc(x)d^{c}_{\psi}(x) will be zero as well, leading to ζ(x)\zeta(x)\rightarrow\infty.

Refer to caption
(a)
Refer to caption
(b)
Figure 4: (a) Example of a 2D domain with overlapping classes and (b) the regions where dψc(xψ)d^{c}_{\psi}(x_{\psi}) is equal or close to zero leading to high ambiguity. In the overlap region, we have dψc(xψ)=0d^{c}_{\psi}(x_{\psi})=0 causing the ambiguity measure to go to infinity. For points in the domain that are equidistanced from the convex hull of these two classes, we again have dψc(xψ)=0d^{c}_{\psi}(x_{\psi})=0. For all other points in the domain, dψc(xψ)d^{c}_{\psi}(x_{\psi}) is strictly positive. dψc(xψ)=0d^{c}_{\psi}(x_{\psi})=0 would be larger for points that are close to one class and not as close to the other class, implying little or no confusion.

One could consider limiting this measure to the samples that are inside the overlapping regions, i.e., disregard it whenever a sample does not coincide with an overlapping region. However, we find this measure to be insightful even when a sample is outside the convex hull of classes. Outside the convex hull of classes, this measure will go to zero when a sample is equidistant from the two closest classes as shown in Figure 4. This, again, may imply confusion between the two classes. In our ablation studies, we will demonstrate how insightful this measure can be. Moreover, we show that instead of computing the difference between the two distances in equation (35), one can use a ratio of distances.

5.2 Excessive extrapolation

This case occurs when a model processes an input that excessively outside the convex hull of training set. We measure this via the minimum distance to the convex hull of classes

dψ,min(xψ)=minj{1:n}dψ,j(xψ).d^{\mathcal{H},min}_{\psi}(x_{\psi})=\min_{j\in\{1:n\}}d^{\mathcal{H},j}_{\psi}(x_{\psi}). (36)

This measure is directly related to ambiguity.

5.3 Gaps in the model’s knowledge

Geometrically, this type of failure refers to the possible holes that may exist within the training samples. For example, consider a model that makes clinical decisions for patients. The training set of such model may only consist of patients ages 5-10 and patients aged 80-90 years old. When this model is deployed to make a decision for a patient that is 40 years old, it might be interpolating between its training samples, yet, in the dimension of age, the 40 years old testing sample may be too far from any of the training samples. We characterize this notion of knowledge gap by identifying the ball with the largest radius centered at the sample xψx_{\psi} not containing any of training samples. The radius of such ball is our proxy for the knowledge gap:

dψg(xψ)=maxρρ:yψ𝒟ψtr:xψyψ2ρd_{\psi}^{g}(x_{\psi})=\max_{\rho}\rho:\forall y_{\psi}\in\mathcal{D}^{tr}_{\psi}:\|x_{\psi}-y_{\psi}\|_{2}\geq\rho (37)

This measure is directly related to ambiguity.

5.4 Our ambiguity measure

We are now ready to incorporate all of the above geometric information into a unified ambiguity measure.

ζ(x)=((dψ,min(x)+ϵ)(dψg(x)+ϵ)dψf,min(x)dψc(x))α,\zeta(x)=\Big{(}\frac{(d^{\mathcal{H},min}_{\psi}(x)+\epsilon)(d_{\psi}^{g}(x)+\epsilon)}{d^{f,min}_{\psi}(x)d^{c}_{\psi}(x)}\Big{)}^{\alpha}, (38)

where ϵ\epsilon and α\alpha are positive scalars to normalize the ambiguity measure. The purpose of ϵ\epsilon is to extend the ambiguity measure inside the convex hull of training set. When we encounter a sample that falls inside the tr\mathcal{H}^{tr} (while not coinciding with any of the training samples), its corresponding dψ,mind^{\mathcal{H},min}_{\psi} will be zero, yet the other three measures can reflect useful information in ζ\zeta. Our recommendation is to choose ϵ1\epsilon\ll 1. For the α\alpha, it can be chosen such that ambiguity is bounded between 0 and 1 for a set of samples.

Remark 5.1.

The order of ζ(.)\zeta(.) for a set of distinct inputs is preserved for any choice of ϵ>0\epsilon>0 and α>0\alpha>0.

The measure provided by equation (38) is well defined even in its extremes. We know that all distances are lower bounded by zero, therefore, the value of ζ(.)\zeta(.) for any input in the domain is non-negative. When dψg0d^{g}_{\psi}\rightarrow 0, it follows that dψ,min0d^{\mathcal{H},min}_{\psi}\rightarrow 0, and as a result, the ambiguity measure goes to ϵ2\epsilon^{2} indicating that the input is not ambiguous. As the distance to the convex hull increases, the ambiguity increases linearly. By contrast, when dψf,min0d^{f,min}_{\psi}\rightarrow 0 and/or dψc0d^{c}_{\psi}\rightarrow 0, the ambiguity goes to infinity, i.e., the input is highly ambiguous. We almost never deal with these extreme cases, and almost always, these distances are positive values that are not too small or too large.

Models trained with special training methods: An underlying assumption here is that training set contains samples with hard labels. There are training methods, such as Mixup [34] that trains using soft labels. Such methods define the location of decision boundaries in between the samples. This does not contradict our approach because in such instances, we still have the original training samples with their hard labels, and moreover, the interpolated samples with soft labels are not visually meaningful, i.e., they only define the location of decision boundaries.

Models trained with unlabeled data: In self-supervised learning and in contrastive learning, it is possible to train a model on training samples without labels. Such methods are successful because the number of classes are large and the probability of encountering false positive samples is very small, i.e., any two random pair of samples that one picks from an unlabeled training set are unlikely to be from a same class. Such models, trained with contrastive learning, need to be fine-tuned eventually with labeled training samples in order to define the location of decision boundaries. This last stage will provide the frame of reference we need to compute our ambiguity measure. In other words, when a model is fine-tuned and the location of its decision boundaries are defined, we can go back to the unlabeled training set and label the samples using the fine-tuned model. Once we obtain the labeled training set, we can proceed with computing our ambiguity measure for any testing sample.

5.5 A simplified version of ambiguity measure: rule of thumb

As a rule of thumb alternative to the measure provided in equation (38), one can consider using the following simplified measure

ζ¯(x)=(dψ,min(x)+ϵdψf,min(x))α.\bar{\zeta}(x)=\Big{(}\frac{d^{\mathcal{H},min}_{\psi}(x)+\epsilon}{d^{f,min}_{\psi}(x)}\Big{)}^{\alpha}. (39)

We show in our experiments that this rule of thumb works quite effectively in performing a wide range of tasks regarding the model failures albeit not as good as the equation 38.

We have considered other rules of thumb as well, e.g., subtracting the two distances of dψ,min(x)d^{\mathcal{H},min}_{\psi}(x) and dψf,min(x)d^{f,min}_{\psi}(x) instead of dividing them. Overall, we did not see a significant difference in predictive power of the measure in our empirical experiments. We chose the ratio instead of the difference, because it appeared to be more intuitive and slightly better in its predictive power.

5.6 Ambiguity explanation

We can generate an automatic text explaining why an input is ambiguous or not. Generally, this text will describe each component of the ambiguity measure in a percentile form in comparison to other samples.

The structure of the text is as following:

"Classification of this input is  . This image is ambiguous/unambiguous because its ambiguity measure is higher/lower than   % of all samples. The ambiguity of this input has four components:

  • Its distance to the convex hull of training set is >   percentile

  • Its distance to closest decision boundary >   percentile

  • It is close to both classes   and  

  • The radius of gap around this sample >   percentile

5.7 Abstention model

At the deployment time, for any given input, the model first computes the ambiguity measure. If the ambiguity measure is larger than the chosen threshold, the model abstains from classifying. Otherwise, it will return the classification of input along with the ambiguity measure.

Algorithm 2 Unknown aware inference
1:model (.)\mathcal{M}(.), input xx, ambiguity function ζ(.)\zeta(.), ambiguity threshold τ\tau
2:Classification of xx
3:Calculate xψx_{\psi} using equations (1) and (3)
4:Compute ζ(x)\zeta(x) using equation (38)
5:if ζ(x)>τ\zeta(x)>\tau then
6:     return unsure: abstaining from classification
7:     return ambiguity explanation (Section 5.6)
8:else
9:     return classification: (x)\mathcal{M}(x), ambiguity measure: ζ(x)\zeta(x)
10:end if

The threshold τ\tau can be chosen based on the statistics of ambiguity measure computed for a set of samples perhaps from a validation set. For example, one can compute the ambiguity of inputs on a validation set, and then set the τ\tau that best separates the misclassifications of that validation set from its correct classifications. Alternatively, for any model that is already used in practice for a while, there will be some samples of its failure modes available at hand. One can compute the ambiguity of those failures and set the τ\tau that best identifies those failures. In choosing the τ\tau, limiting the rate of false positives may also be a consideration.

If we do not have access to a validation set or any samples of previous failures, we can compute the ambiguity measure for the training samples, and then choose some threshold based on the ambiguity distribution. For example, we may set the threshold as the 99th percentile of the ambiguity of training samples.

5.8 Ambiguity model

To build an ambiguity model, we may train a classifier on the geometric information of each point in the feature space. The training objective could be to detect undesirable inputs that the model may be exposed to, or the cases where the model is likely to make a mistake. In other words, such ambiguity model will aim to detect and weed out the inputs that the model may not be fit to process them.

Algorithm 3 presents a formal procedure to train such a model. This procedure relies on having access to a set of normal in-distribution samples 𝒟p\mathcal{D}^{p}, and a set of undesirable samples that the model is likely to fail on them 𝒟n\mathcal{D}^{n}. The set of undesirable samples in 𝒟n\mathcal{D}^{n} could consist of any undesirable samples that we might have available. For example, if \mathcal{M} has been deployed in practice, it might have made some mistakes already which we can include in 𝒟n\mathcal{D}^{n}. Or if we have encountered some OOD or adversarial inputs, we can include them. If we are concerned about specific types of OOD inputs or certain types of adversarial attacks, such samples could be gathered/generated and then included in 𝒟n\mathcal{D}^{n} as well.

As in the case of abstention models, we rely on the geometric information available in the feature space. For any given input, we can compute the geometric information using equations (34)-(37). Assembling these measurements together as a unified vector for each input in 𝒟p\mathcal{D}^{p} and 𝒟n\mathcal{D}^{n} leads to two new datasets: 𝒢p\mathcal{G}^{p} and 𝒢n\mathcal{G}^{n} which only reflect the geometric information for each of the samples.

About the 𝒜(.)\mathcal{A}(.), one can use any appropriate classification model. Our task, here, is binary classification and we do not expect 𝒢ψn\mathcal{G}_{\psi}^{n} to have high dimensions. Clearly, 𝒢ψp\mathcal{G}_{\psi}^{p} and 𝒢ψn\mathcal{G}_{\psi}^{n} can be split into a training/validation/testing portions when we train the 𝒜(.)\mathcal{A}(.). We have experimented with models such as XGBoost, Random Forests, and SVMs, most of which perform reasonably well. We train ambiguity models to detect a variety of failure modes such as detecting corner cases, misclassifications, adversarial inputs, and OOD.

Algorithm 3 Creating an ambiguity model to detect undesirable (ambiguous/adversarial/OOD) inputs
1:model \mathcal{M}, set of normal samples 𝒟p\mathcal{D}^{p}, set of undesirable samples 𝒟n\mathcal{D}^{n}
2:ambiguity detection model 𝒜\mathcal{A}
3:𝒟ψp,𝒟ψn\mathcal{D}^{p}_{\psi},\mathcal{D}^{n}_{\psi}\leftarrow map all samples of 𝒟p\mathcal{D}^{p} and 𝒟n\mathcal{D}^{n} to feature space Ψ\Psi
4:𝒢ψp\mathcal{G}_{\psi}^{p}\leftarrow compute dψ,min(.)d^{\mathcal{H},min}_{\psi}(.), dψg(.)d_{\psi}^{g}(.), dψf,min(.)d^{f,min}_{\psi}(.), and dψc(.)d^{c}_{\psi}(.) for 𝒟ψp\mathcal{D}^{p}_{\psi}
5:𝒢ψn\mathcal{G}_{\psi}^{n}\leftarrow compute dψ,min(.)d^{\mathcal{H},min}_{\psi}(.), dψg(.)d_{\psi}^{g}(.), dψf,min(.)d^{f,min}_{\psi}(.), and dψc(.)d^{c}_{\psi}(.) for 𝒟ψn\mathcal{D}^{n}_{\psi}
6:train a model 𝒜(.)\mathcal{A}(.) to classify 𝒢ψp\mathcal{G}_{\psi}^{p} from 𝒢ψn\mathcal{G}_{\psi}^{n}
7:return 𝒜(.)\mathcal{A}(.)

We note that in the literature, there are models specifically designed for adversarial detection that are trained merely on the inputs in the feature space [21]. That would be equivalent to omitting lines 2-3 of the algorithm above, training the detection model not on 𝒢ψp\mathcal{G}_{\psi}^{p} and 𝒢ψn\mathcal{G}_{\psi}^{n}, rather training the model on 𝒟ψp\mathcal{D}_{\psi}^{p} and 𝒟ψn\mathcal{D}_{\psi}^{n}. We will see in our experimental results that computing the geometric arrangements enables the model to identify a variety of undesirable inputs (not just adversarial) better than merely from the projections to the feature space.

6 Results

In this section, we present our results on a ResNet model pre-trained on the CIFAR-10 dataset and a Swin Transformer pretrained on Imagenet. In the appendix, we provide results obtained from other pretrained models as well.

To be added. Stay tuned.

References

  • [1] Claire Fisher Adler. Modern Geometry. McGraw-Hill, 1967.
  • [2] Keith Ball. Cube slicing in Rn{R}^{n}. Proceedings of the American Mathematical Society, 97(3):465–473, 1986.
  • [3] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8), 2013.
  • [4] Avrim Blum, Sariel Har-Peled, and Benjamin Raichel. Sparse approximation via generating point sets. ACM Transactions on Algorithms, 15(3):1–16, 2019.
  • [5] Benno Büeler, Andreas Enge, and Komei Fukuda. Exact volume computation for polytopes: A practical study. In Polytopes—Combinatorics and Computation, pages 131–154. Springer, 2000.
  • [6] Xuenan Cao and Roozbeh Yousefzadeh. Extrapolation and AI transparency: Why machine learning models should reveal when they make decisions beyond their training. Big Data & Society, 10(1), 2023.
  • [7] Patrick Chen, Si Si, Sanjiv Kumar, Yang Li, and Cho-Jui Hsieh. Learning to screen for fast softmax inference on large vocabulary neural networks. In International Conference on Learning Representations, 2019.
  • [8] Uri Cohen, SueYeon Chung, Daniel D Lee, and Haim Sompolinsky. Separability and geometry of object manifolds in deep neural networks. Nature Communications, 11(1):1–13, 2020.
  • [9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  • [10] Dua Dheeru and Efi Karra Taniskidou. UCI machine learning repository, 2017.
  • [11] Micah Goldblum, Dimitris Tsipras, Chulin Xie, Xinyun Chen, Avi Schwarzschild, Dawn Song, Aleksander Mądry, Bo Li, and Tom Goldstein. Dataset security for machine learning: Data poisoning, backdoor attacks, and defenses. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2), 2023.
  • [12] Gene H Golub and Charles F Van Loan. Matrix Computations. JHU Press, Baltimore, 4th edition, 2012.
  • [13] Sariel Har-Peled, Nirman Kumar, David M Mount, and Benjamin Raichel. Space exploration via proximity search. Discrete & Computational Geometry, 56:357–376, 2016.
  • [14] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations, 2017.
  • [15] Dan Hendrycks and Kevin Gimpel. A baseline for detecting misclassified and out-of-distribution examples in neural networks. In International Conference on Learning Representations, 2017.
  • [16] Paul F Jaeger, Carsten Tim Lüth, Lukas Klein, and Till J. Bungert. A call to reflect on evaluation practices for failure detection in image classification. In International Conference on Learning Representations, 2023.
  • [17] Saachi Jain, Hannah Lawrence, Ankur Moitra, and Aleksander Madry. Distilling model failures as directions in latent space. In International Conference on Learning Representations, 2023.
  • [18] Jim Lawrence. Polytope volume computation. Mathematics of Computation, 57(195):259–271, 1991.
  • [19] László Lovász and Santosh Vempala. Simulated annealing in convex bodies and an O(n4){O}^{*}(n^{4}) volume algorithm. Journal of Computer and System Sciences, 72(2):392–417, 2006.
  • [20] Jean-Luc Marichal and Michael J Mossinghoff. Slices, slabs, and sections of the unit hypercube. Online Journal of Analytic Combinatorics, 3:1–11, 2008.
  • [21] Mazda Moayeri and Soheil Feizi. Sample efficient detection and classification of adversarial attacks via self-supervised embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7677–7686, 2021.
  • [22] Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshminarayanan. Do deep generative models know what they don’t know? In International Conference on Learning Representations, 2019.
  • [23] Jeremy Nixon, Michael W. Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. Measuring calibration in deep learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019.
  • [24] Jorge Nocedal and Stephen Wright. Numerical Optimization. Springer, New York, 2nd edition, 2006.
  • [25] Patrick E O’Neil. Hyperplane cuts of an n-cube. Discrete Mathematics, 1(2):193–195, 1971.
  • [26] Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, D. Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshminarayanan, and Jasper Snoek. Can you trust your model’s uncertainty? Evaluating predictive uncertainty under dataset shift. In Advances in Neural Information Processing Systems, 2019.
  • [27] Jie Ren, Peter J Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark Depristo, Joshua Dillon, and Balaji Lakshminarayanan. Likelihood ratios for out-of-distribution detection. In Advances in Neural Information Processing Systems, pages 14680–14691, 2019.
  • [28] Kyuhong Shim, Minjae Lee, Iksoo Choi, Yoonho Boo, and Wonyong Sung. SVD-softmax: Fast softmax approximation on large vocabulary neural networks. Advances in Neural Information Processing Systems, 2017.
  • [29] Miklós Simonovits. How to compute the volume in high dimension? Mathematical Programming, 97:337–374, 2003.
  • [30] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
  • [31] Nanyang Ye, Kaican Li, Haoyue Bai, Runpeng Yu, Lanqing Hong, Fengwei Zhou, Zhenguo Li, and Jun Zhu. OoD-Bench: Quantifying and understanding two dimensions of out-of-distribution generalization. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  • [32] Roozbeh Yousefzadeh. Deep learning generalization and the convex hull of training sets. arXiv preprint arXiv:2101.09849, 2020.
  • [33] Roozbeh Yousefzadeh. Decision boundaries and convex hulls in the feature space that deep learning functions learn from images. arXiv preprint arXiv:2202.04052, 2022.
  • [34] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018.