This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Is the Volume of a Credal Set a Good Measure for Epistemic Uncertainty?

Yusuf Sale Institute of Informatics
University of Munich (LMU)
Germany
Munich Center for Machine Learning
Germany
Michele Caprio PRECISE Center
Department of Computer and Information Science
University of Pennsylvania
USA
Eyke Hüllermeier Institute of Informatics
University of Munich (LMU)
Germany
Munich Center for Machine Learning
Germany
Abstract

Adequate uncertainty representation and quantification have become imperative in various scientific disciplines, especially in machine learning and artificial intelligence. As an alternative to representing uncertainty via one single probability measure, we consider credal sets (convex sets of probability measures). The geometric representation of credal sets as dd-dimensional polytopes implies a geometric intuition about (epistemic) uncertainty. In this paper, we show that the volume of the geometric representation of a credal set is a meaningful measure of epistemic uncertainty in the case of binary classification, but less so for multi-class classification. Our theoretical findings highlight the crucial role of specifying and employing uncertainty measures in machine learning in an appropriate way, and for being aware of possible pitfalls.

1 Introduction

The notion of uncertainty has recently drawn increasing attention in machine learning (ML) and artificial intelligence (AI) due to the fields’ burgeoning relevance for practical applications, many of which have safety requirements, such as in medical domains [Lambrou et al., 2010, Senge et al., 2014, Yang et al., 2009] or socio-technical systems [Varshney, 2016, Varshney and Alemzadeh, 2017]. These applications to safety-critical contexts show that a suitable representation and quantification of uncertainty for modern, reliable machine learning systems is imperative.

In general, the literature makes a distinction between aleatoric and epistemic uncertainties (AU and EU, respectively) [Hora, 1996]. While the former is caused by the inherent randomness of the data-generating process, EU results from the learner’s lack of knowledge regarding the true underlying model; it also includes approximation uncertainty. Since EU can be reduced per se with further information (e.g., via data augmentation using semantic preserving transformations), it is also referred to as reducible uncertainty. In contrast, aleatoric uncertainty, as a property of the data-generating process, is irreducible [Hüllermeier and Waegeman, 2021]. The importance of distinguishing between different types of uncertainty is reflected in several areas of recent machine learning research, e.g. in Bayesian deep learning [Depeweg et al., 2018, Kendall and Gal, 2017], in adversarial example detection [Smith and Gal, 2018], or data augmentation in Bayesian classification [Kapoor et al., 2022]. A qualitative representation of total uncertainty, AU, and EU, and of their asymptotic behavior as the number of data points available to the learning agent increases, is given in Figure 1.

Refer to caption
Figure 1: Qualitative behavior of total, aleatoric, and epistemic uncertainties depending on the sample size. The dotted line is the difference between total and epistemic uncertainties. This figure replicates [Hüllermeier, 2022, Figure 3].

Typically, uncertainty in machine learning, artificial intelligence, and related fields is expressed solely in terms of probability theory. That is, given a measurable space (Ω,𝒜)(\Omega,\mathcal{A}), uncertainty is entirely represented by defining one single probability measure PP on (Ω,𝒜)(\Omega,\mathcal{A}). However, representing uncertainty in machine learning is not restricted to classical probability theory; various aspects of uncertainty representation and quantification in ML are discussed by Hüllermeier and Waegeman [2021]. Credal sets, i.e., (convex) sets of probability measures, are considered to be very popular models of uncertainty representation, especially in the field of imprecise probabilities (IP) [Augustin et al., 2014, Walley, 1991]. Credal sets are also very appealing from an ML perspective for representing uncertainty, as they can represent both aleatoric and epistemic uncertainty (as opposed to a single probability measure). Numerous scholars emphasized the utility of representing uncertainty in ML via credal sets, e.g., credal classification [Zaffalon, 2002, Corani and Zaffalon, 2008] based on the Imprecise Dirichlet Model (IDM) [Walley, 1996], generalizing Bayesian networks to credal classifiers [Corani et al., 2012], or building credal decision-trees [Abellán and Moral, 2003].

Uncertainty representation via credal sets also requires a corresponding quantification of the underlying uncertainty, referred to as credal uncertainty quantification (CUQ). The task of (credal) uncertainty quantification translates to finding a suitable measure that can accurately reflect the uncertainty inherent to a credal set. In many ML applications, such as active learning [Settles, 2009] or classification with abstention, there is a need to quantify (predictive) uncertainty in a scalar way. Appropriate measures of uncertainty are often axiomatically justified [Bronevich and Klir, 2008, 2010].

Contributions. In this work, we consider the volume of the geometric representation of a credal set on the label space as a quite obvious and intuitively plausible measure of EU. We argue that this measure is indeed meaningful if we are in a binary classification setting. However, in a multi-class setting, the volume exhibits shortcomings that make it unsuitable for quantifying EU associated with a credal set.

Structure of the paper. The paper is divided as follows. Section 2 formally introduces the framework we work in, and Section 3 discusses the related literature. Section 4 presents our main findings, which are further discussed in Section 5. Proofs of our theoretical results are given in Appendix A, and (a version of) Carl-Pajor’s theorem, intimately related to Theorem 1, is stated in Appendix B.

2 Uncertainty in ML and AI

Uncertainty is a crucial concept in many academic and applied disciplines. However, since its definition depends on the specific context a scholar works in, we now introduce the formal framework of supervised learning within which we will examine it.

Let (𝒳,σ(𝒳))(\mathcal{X},\sigma(\mathcal{X})), and (𝒴,σ(𝒴))(\mathcal{Y},\sigma(\mathcal{Y})) be two measurable spaces, where σ(𝒳)\sigma(\mathcal{X}), and σ(𝒴)\sigma(\mathcal{Y}) are suitable σ\sigma-algebras. We will refer to 𝒳\mathcal{X} as instance space (or equivalently, input space) and to 𝒴\mathcal{Y} as label space. Further, the sequence {(xi,yi)}i=1n(𝒳×𝒴)n\{({x}_{i},y_{i})\}_{i=1}^{n}\in(\mathcal{X}\times\mathcal{Y})^{n}, is called training data. The pairs (xi,yi)(x_{i},y_{i}) are realizations of random variables (Xi,Yi)(X_{i},Y_{i}), which are assumed independent and identically distributed (i.i.d.) according to some probability measure PP on (𝒳×𝒴,σ(𝒳×𝒴))(\mathcal{X}\times\mathcal{Y},\sigma(\mathcal{X}\times\mathcal{Y})).

Definition 1 (Credal set).

Let (Ω,𝒜)(\Omega,\mathcal{A}) be a generic measurable space and denote by (Ω,𝒜)\mathcal{M}(\Omega,\mathcal{A}) the set of all (countably additive) probability measures on (Ω,𝒜)(\Omega,\mathcal{A}). A convex subset 𝒫(Ω,𝒜)\mathcal{P}\subseteq\mathcal{M}(\Omega,\mathcal{A}) is called a credal set.

Note that in Definition 1, the assumption of convexity is quite natural and considered to be rational (see, e.g., Levi [1980]). It is also mathematically appealing, since, as shown by Walley [1991, Section 3.3.3], the “lower boundary” P¯\underline{P} of 𝒫\mathcal{P}, defined as P¯(A)infP𝒫P(A)\underline{P}(A)\coloneqq\inf_{P\in\mathcal{P}}P(A), for all A𝒜A\in\mathcal{A} and called the lower probability associated with 𝒫\mathcal{P}, is coherent [Walley, 1991, Section 2.5].

Further, in a supervised learning setting, we assume a hypothesis space \mathcal{H}, where each hypothesis hh\in\mathcal{H} maps a query instance 𝐱𝐪\mathbf{x_{q}} to a probability measure PP on (𝒴,σ(𝒴))(\mathcal{Y},\sigma(\mathcal{Y})). We distinguish between different “degrees” of uncertainty-aware predictions, which are depicted in Table 1.

Predictor AU aware? EU aware?
Hard label prediction: h:𝒳𝒴h:\mathcal{X}\longrightarrow\mathcal{Y}
Probabilistic prediction: h:𝒳(𝒴,σ(𝒴))h:\mathcal{X}\longrightarrow\mathcal{M}(\mathcal{Y},\sigma(\mathcal{Y}))
Credal prediction: h:𝒳Cr(𝒴)h:\mathcal{X}\longrightarrow\text{Cr}(\mathcal{Y})
Table 1: Aleatoric uncertainty (AU) and epistemic uncertainty (EU) awareness of different predictors.

We denote by Cr(𝒴)\text{Cr}(\mathcal{Y}) the set of all credal sets on (𝒴,σ(𝒴))(\mathcal{Y},\sigma(\mathcal{Y})). While probabilistic predictions h(𝐱𝐪)=y^h(\mathbf{x_{q}})=\hat{y} fail to capture the epistemic part of the (predictive) uncertainty, predictions in the form of credal sets h(𝐱𝐪)=𝒫(𝒴,σ(𝒴))h(\mathbf{x_{q}})=\mathcal{P}\subseteq\mathcal{M}(\mathcal{Y},\sigma(\mathcal{Y})) account for both types of uncertainty. It should also be remarked that representing uncertainty is not restricted to the credal set formalism. Another possible framework to represent AU and EU is that of second-order distributions; they are commonly applied in Bayesian learning and have been recently inspected in the context of uncertainty quantification by Bengs et al. [2022].

In this paper, we restrict our attention to the credal set representation. Given a credal prediction set, it remains to properly quantify the uncertainty encapsulated in it using a suitable measure. Credal set representations are often illustrated in low dimensions (usually d=2d=2 or d=3d=3). Examples of such geometrical illustrations can be found in the context of machine learning in [Hüllermeier and Waegeman, 2021] and in imprecise probability theory in [Walley, 1991, Chapter 4]. This suggests that a credal set and its geometric representation are strictly intertwined. We will show in the following sections that this intuitive view can have disastrous consequences in higher dimensions and that one should exercise caution in this respect. Furthermore, it remains to be discussed whether a geometric viewpoint on (predictive) uncertainty quantification is in fact sensible.

3 Measures of Credal Uncertainty

In this section we examine some axiomatically defined properties of (credal) uncertainty measures. For a more detailed discussion of various (credal) uncertainty measures in machine learning and a critical analysis thereof, we refer to Hüllermeier et al. [2022].

Let SS denote the Shannon entropy [Shannon, 1948], whose discrete version is defined as

S:(𝒴,σ(𝒴))\displaystyle S:\mathcal{M}(\mathcal{Y},\sigma(\mathcal{Y})) ,\displaystyle\rightarrow\mathbb{R},
P\displaystyle P S(P)y𝒴P({y})log2P({y}).\displaystyle\mapsto S(P)\coloneqq-\sum_{y\in\mathcal{Y}}P(\{y\})\log_{2}P(\{y\}).

A suitable measure of credal uncertainty U:Cr(𝒴)U:\text{Cr}(\mathcal{Y})\rightarrow\mathbb{R} should satisfy the following axioms proposed by Abellán and Klir [2005], Jiroušek and Shenoy [2018]:

  • A1

    Non-negativity and boundedness:

    • (i)

      U(𝒫)0U(\mathcal{P})\geq 0, for all 𝒫Cr(𝒴)\mathcal{P}\in\text{Cr}(\mathcal{Y});

    • (ii)

      there exists uu\in\mathbb{R} such that U(𝒫)uU(\mathcal{P})\leq u, for all 𝒫Cr(𝒴)\mathcal{P}\in\text{Cr}(\mathcal{Y}).

  • A2

    Continuity: UU is a continuous functional.

  • A3

    Monotonicity: for all 𝒬,𝒫Cr(𝒴)\mathcal{Q},\mathcal{P}\in\text{Cr}(\mathcal{Y}) such that 𝒬𝒫\mathcal{Q}\subset\mathcal{P}, we have U(𝒬)U(𝒫)U(\mathcal{Q})\leq U(\mathcal{P}).

  • A4

    Probability consistency: for all 𝒫Cr(𝒴)\mathcal{P}\in\text{Cr}(\mathcal{Y}) such that 𝒫={P}\mathcal{P}=\{P\}, we have U(𝒫)=S(P)U(\mathcal{P})=S(P).

  • A5

    Sub-additivity: Suppose 𝒴=𝒴1×𝒴2\mathcal{Y}=\mathcal{Y}_{1}\times\mathcal{Y}_{2}, and let 𝒫\mathcal{P} be a joint credal set on 𝒴\mathcal{Y} such that 𝒫\mathcal{P}^{\prime} is the marginal credal set on 𝒴1\mathcal{Y}_{1} and 𝒫′′\mathcal{P}^{\prime\prime} is the marginal credal set on 𝒴2\mathcal{Y}_{2}, respectively. Then, we have

    U(𝒫)U(𝒫)+U(𝒫′′).U(\mathcal{P})\leq U(\mathcal{P}^{\prime})+U(\mathcal{P}^{\prime\prime}). (1)
  • A6

    Additivity: If 𝒫\mathcal{P}^{\prime} and 𝒫′′\mathcal{P}^{\prime\prime} are independent, (1) holds with equality.

In axiom A6, independence refers to a suitable notion for independence of credal sets, see e.g. Couso et al. [1999]. An axiomatic definition of properties for uncertainty measures is a common approach in the literature [Pal et al., 1992, 1993]. Examples of credal uncertainty measures that satisfy some of the axioms A1–A6 are the maximal entropy [Abellan and Moral, 2003] and the generalized Hartley measure [Abellán and Moral, 2000].

Recall that the lower probability P¯\underline{P} of 𝒫\mathcal{P} is defined as P¯(A)infP𝒫P(A)\underline{P}(A)\coloneqq\inf_{P\in\mathcal{P}}P(A), for all Aσ(𝒴)A\in\sigma(\mathcal{Y}), and call upper probability its conjugate P¯(A)1P¯(Ac)=supP𝒫P(A)\overline{P}(A)\coloneqq 1-\underline{P}(A^{c})=\sup_{P\in\mathcal{P}}P(A), for all Aσ(𝒴)A\in\sigma(\mathcal{Y}). Since we are concerned with the fundamental question of whether the volume functional is a suitable measure for epistemic uncertainty, we replace A4 with the following axiom that better suits our purposes.

  • A4’

    Probability consistency: U(𝒫)U(\mathcal{P}) reduces to 0 as the distance between P¯(A)\overline{P}(A) and P¯(A)\underline{P}(A) goes to 0, for all Aσ(𝒴)A\in\sigma(\mathcal{Y}).

While A4’ addresses solely the epistemic component of uncertainty assoicated with the credal set 𝒫\mathcal{P}, A4 incorporates the aleatoric uncertainty. Finally, we introduce a seventh axiom that subsumes a desirable property of UU proposed by Hüllermeier et al. [2022, Theorem 1.A3-A5].

  • A7

    Invariance: UU is invariant to rotation and translation.

Call dd the cardinality of the label space 𝒴\mathcal{Y}. In the next section, we will note that many of these axioms are satisfied by the volume operator in the case d=2d=2 but can no longer be guaranteed for d>2d>2.

4 Geometry of Epistemic Uncertainty

As pointed out in Section 3, there is no unambiguous measure of (credal) uncertainty for machine learning purposes. In this section, we present a measure for EU rooted in the geometric concept of volume and show how it is well-suited for a binary classification setting, while it loses its appeal when moving to a multi-class setting.

Since we are considering a classification setting, we assume that 𝒴\mathcal{Y} is a finite Polish space so that |𝒴|=d|\mathcal{Y}|=d, for some natural number d2d\geq 2. We also let σ(𝒴)=2𝒴\sigma(\mathcal{Y})=2^{\mathcal{Y}} to work with the finest possible σ\sigma-algebra of 𝒴\mathcal{Y}; the results we provide still hold for any coarser σ\sigma-algebra.111Call τ\tau the topology on 𝒴\mathcal{Y}. The ideas expressed in this paper can be easily extended to the case where 𝒴\mathcal{Y} is not Polish. We require it to convey our results without dealing with topological subtleties. Because 𝒴\mathcal{Y} is Polish, (𝒴,σ(𝒴))\mathcal{M}(\mathcal{Y},\sigma(\mathcal{Y})) is Polish as well. In particular, the topology endowed to (𝒴,σ(𝒴))\mathcal{M}(\mathcal{Y},\sigma(\mathcal{Y})) is the weak topology, which – because we assumed 𝒴\mathcal{Y} to be finite – coincides with the topology τ2\tau_{\|\cdot\|_{2}} induced by the Euclidean norm. Consider a credal set 𝒫(𝒴,σ(𝒴))\mathcal{P}\subset\mathcal{M}(\mathcal{Y},\sigma(\mathcal{Y})), which can be seen as the outcome of a procedure involving an imprecise Bayesian neural network (IBNN) [Caprio et al., 2023a], or an imprecise neural network (INN) [Caprio et al., 2023b]; an ensemble-based approach is proposed by Shaker and Hüllermeier [2020].

Since 𝒴={y1,,yd}\mathcal{Y}=\{y_{1},\ldots,y_{d}\}, each element P𝒫P\in\mathcal{P} can be seen as a dd-dimensional probability vector, P=(p1,,pd)P=(p_{1},\ldots,p_{d})^{\top}, where pj=P({yj})p_{j}=P(\{y_{j}\}), j{1,,d}j\in\{1,\ldots,d\}, pj0p_{j}\geq 0, for all j{1,,d}j\in\{1,\ldots,d\}, and j=1dpj=1\sum_{j=1}^{d}p_{j}=1. This entails that if we denote by Δd1\Delta^{d-1} the unit simplex in d\mathbb{R}^{d}, we have 𝒫Δd1\mathcal{P}\subset\Delta^{d-1}, which means that 𝒫\mathcal{P} is a convex body inscribed in Δd1\Delta^{d-1}.222In the remaining part of the paper, we denote by 𝒫\mathcal{P} both the credal set and its geometric representation, as no confusion arises.

Intuitively, the “larger” 𝒫\mathcal{P} is, the higher the credal uncertainty. A natural way of capturing the size of 𝒫\mathcal{P}, then, appears to be its volume Vol(𝒫)\text{Vol}(\mathcal{P}). Notice that Vol(𝒫)\text{Vol}(\mathcal{P}) is a bounded quantity: its value is bounded from below by 0 and from above by d/[(d1)!]\sqrt{d}/[(d-1)!], the volume of the whole unit simplex Δd1\Delta^{d-1}. The latter corresponds to the case where 𝒫=Δd1\mathcal{P}=\Delta^{d-1}, that is, to the case of completely vacuous beliefs: the agent is only able to say that the probability of AA is in [0,1][0,1], for all AA\in\mathcal{F}. In this sense, the volume is a measure of the size of set 𝒫\mathcal{P} that increases the more uncertain the agent is about the elements of \mathcal{F}. This argument shows that Vol(𝒫)\text{Vol}(\mathcal{P}) is well suited to capture credal uncertainty. But why is it appropriate to describe EU?333The concept of volume has been explored in the imprecise probabilities literature, see e.g., Bloch [1996], [Cuzzolin, 2021, Chapter 17], and Seidenfeld et al. [2012], but, to the best of our knowledge, has never been tied to the notion of epistemic uncertainty. More in general, the geometry of imprecise probabilities has been studied, e.g., by Anel [2021], Cuzzolin [2021]. Think of the extreme case where EU does not exist, so that the agent faces AU only. In that case, they would be able to specify a unique probability measure P(𝒴,σ(𝒴))P\in\mathcal{M}(\mathcal{Y},\sigma(\mathcal{Y})) (or equivalently, PΔd1P\in\Delta^{d-1}), and Vol({P})=0\text{Vol}(\{P\})=0. Hence, if Vol(𝒫)>0\text{Vol}(\mathcal{P})>0, then this means that the agent faces EU. In addition, let (𝒫n)n(\mathcal{P}_{n})_{n\in\mathbb{N}} be a sequence of credal sets on (𝒴,σ(𝒴))(\mathcal{Y},\sigma(\mathcal{Y})) representing successive refinements of 𝒫\mathcal{P} computed as new data becomes available to the agent.444Clearly |𝒫n|=|𝒫||\mathcal{P}_{n}|=|\mathcal{P}|, for all nn. If, after observing enough evidence, the EU is resolved, that is, if limn[P¯n(A)P¯n(A)]=0\lim_{n\rightarrow\infty}[\overline{P}_{n}(A)-\underline{P}_{n}(A)]=0 for all AA\in\mathcal{F}, we see that the following holds. Sequence (𝒫n)n(\mathcal{P}_{n})_{n\in\mathbb{N}} converges – say in the Hausdorff metric – as nn\rightarrow\infty to 𝒫(𝒴,σ(𝒴))\mathcal{P}^{\star}\subset\mathcal{M}(\mathcal{Y},\sigma(\mathcal{Y})) such that |𝒫|=|𝒫n||\mathcal{P}^{\star}|=|\mathcal{P}_{n}|, for all nn, and all the elements of 𝒫\mathcal{P}^{\star} are equal to PP^{\star}, the (unique) probability measure that encapsulates the AU.555Technically 𝒫\mathcal{P}^{\star} is a multiset, that is, a set where multiple instances for each of its elements are allowed. Through the learning process, we refine our estimates for the “true” underlying aleatoric uncertainty (pertaining to PP^{\star}), which is left after all the EU is resolved. Then, the geometric representation of 𝒫\mathcal{P}^{\star} is a point whose volume is 0. Hence, we have that the volume of 𝒫n\mathcal{P}_{n} converges from above to 0 (that is, it possesses the continuity property), which is exactly the behavior we would expect as EU resolves.

As we shall see, while this intuitive explanation holds if d=2d=2, for d>2d>2, continuity fails, thus making the volume not suited to capture EU in a multi-class classification setting. We also show in Theorem 1 that the volume lacks robustness in higher dimensions. Small perturbations to the boundary of a credal set make its volume vary significantly. This may seriously hamper the results of a study, leading to potentially catastrophic consequences in downstream tasks.

4.1 Vol(𝒫)\text{Vol}(\mathcal{P}): a good measure for EU, but only if d=2d=2

Let d=2d=2 so that 𝒫\mathcal{P} is a subset of Δ1\Delta^{1}, the segment linking the points (1,0)(1,0) and (0,1)(0,1) in a 22-dimensional Cartesian plane. Notice that in this case, the volume Vol(𝒫)\text{Vol}(\mathcal{P}) corresponds to the length of the segment. In this context, Vol(𝒫)\text{Vol}(\mathcal{P}) is an appealing measure to describe the EU associated with the credal set 𝒫\mathcal{P}.

Proposition 1.

Vol()\text{Vol}(\cdot) satisfies axioms A1–A3, A4’, A5 and A7 of Section 3.

Let us now discuss additivity (axiom A6 of Section 3). Suppose the label space 𝒴={(y1,y2),(y3,y4)}\mathcal{Y}=\{(y_{1},y_{2}),(y_{3},y_{4})\} can be written as 𝒴1×𝒴2\mathcal{Y}_{1}\times\mathcal{Y}_{2}, where 𝒴1={y1,y3}\mathcal{Y}_{1}=\{y_{1},y_{3}\} and 𝒴2={y2,y4}\mathcal{Y}_{2}=\{y_{2},y_{4}\}. Let 𝒫\mathcal{P} be a joint credal set on 𝒴\mathcal{Y} such that 𝒫\mathcal{P}^{\prime} is the marginal credal set on 𝒴1\mathcal{Y}_{1} and 𝒫′′\mathcal{P}^{\prime\prime} is the marginal credal set on 𝒴2\mathcal{Y}_{2}. In the proof of Proposition 1, we show that if y1y3y_{1}\neq y_{3} and y2y4y_{2}\neq y_{4},666This implies that |𝒴|=|𝒴1|=|𝒴2|=2|\mathcal{Y}|=|\mathcal{Y}_{1}|=|\mathcal{Y}_{2}|=2. then the volume is sub-additive. Suppose instead now that y1=y3=yy_{1}=y_{3}=y_{\star}, so that |𝒴|=|𝒴2|=2|\mathcal{Y}|=|\mathcal{Y}_{2}|=2 and |𝒴1|=1|\mathcal{Y}_{1}|=1.777A similar argument will hold if we assume y2=y4=yy_{2}=y_{4}=y^{\star}, so that |𝒴|=|𝒴1|=2|\mathcal{Y}|=|\mathcal{Y}_{1}|=2 and |𝒴2|=1|\mathcal{Y}_{2}|=1. Then, the marginal marg𝒴1(P)=P\text{marg}_{\mathcal{Y}_{1}}(P)=P^{\prime} of any P𝒫P\in\mathcal{P} on 𝒴1\mathcal{Y}_{1} will give probability 11 to yy_{\star}; in formulas, P(y)=1P^{\prime}(y_{\star})=1. This entails that 𝒫={P}\mathcal{P}^{\prime}=\{P^{\prime}\} is a singleton and that its geometric representation is a point.888Or, alternatively, 𝒫\mathcal{P}^{\prime} is a multiset whose elements are all equal. Then, for all P𝒫P\in\mathcal{P}, P((y1,y2))=P′′(y2)P((y_{1},y_{2}))=P^{\prime\prime}(y_{2}) and P((y3,y4))=P′′(y4)P((y_{3},y_{4}))=P^{\prime\prime}(y_{4}), where marg𝒴2(P)=P′′\text{marg}_{\mathcal{Y}_{2}}(P)=P^{\prime\prime} is the marginal of any P𝒫P\in\mathcal{P} on 𝒴2\mathcal{Y}_{2}.

In turn, this line of reasoning implies that Vol(𝒫)+Vol(𝒫′′)=0+Vol(𝒫)=Vol(𝒫)\text{Vol}(\mathcal{P}^{\prime})+\text{Vol}(\mathcal{P}^{\prime\prime})=0+\text{Vol}(\mathcal{P})=\text{Vol}(\mathcal{P}), which shows that the volume is additive in this case.

This situation corresponds to an instance of strong independence (SI) [Couso et al., 1999, Section 3.5]. We have SI if and only if

𝒫=Conv({P(𝒴,σ(𝒴))marg𝒴1(P)𝒫and marg𝒴2(P)𝒫′′}).\displaystyle\begin{split}\mathcal{P}=\text{Conv}(\{P\in\mathcal{M}(\mathcal{Y},\sigma(\mathcal{Y}))\text{: }&\text{marg}_{\mathcal{Y}_{1}}(P)\in\mathcal{P}^{\prime}\\ \text{and }&\text{marg}_{\mathcal{Y}_{2}}(P)\in\mathcal{P}^{\prime\prime}\}).\end{split} (2)

In other words, there is complete lack of interaction between the probability measure on 𝒴1\mathcal{Y}_{1} and those on 𝒴2\mathcal{Y}_{2}. To see that this is the case, recall that 𝒫\mathcal{P} is a credal set, and so is convex; recall also that 𝒫={P}\mathcal{P}^{\prime}=\{P^{\prime}\} is a singleton. Then, pick any Pexex(𝒫)P^{ex}\in\text{ex}(\mathcal{P}), where ex(𝒫)\text{ex}(\mathcal{P}) denotes the set of extreme elements of 𝒫\mathcal{P}. We have that marg𝒴1(Pex)=P\text{marg}_{\mathcal{Y}_{1}}(P^{ex})=P^{\prime}, and so marg𝒴2(Pex)ex(𝒫′′)\text{marg}_{\mathcal{Y}_{2}}(P^{ex})\in\text{ex}(\mathcal{P}^{\prime\prime}). With a slight abuse of notation, we can write ex(𝒫)={P}×ex(𝒫′′)\text{ex}(\mathcal{P})=\{P^{\prime}\}\times\text{ex}(\mathcal{P}^{\prime\prime}). This immediately implies that the equality in (2) holds. As pointed out in [Couso et al., 1999, Section 3.5], SI implies independence of the marginal sets, epistemic independence of the marginal experiments, and independence in the selection [Couso et al., 1999, Sections 3.1, 3.4, and 3.5, respectively]. It is, therefore, a rather strong notion of independence.

The volume is also trivially additive if (y1,y2)=(y3,y4)(y_{1},y_{2})=(y_{3},y_{4}), but in that case 𝒴\mathcal{Y} would be a multiset.

The argument put forward so far can be summarized in the following proposition.

Proposition 2.

Let 𝒴={(y1,y2),(y3,y4)}\mathcal{Y}=\{(y_{1},y_{2}),(y_{3},y_{4})\}. Vol()\text{Vol}(\cdot) satisfies axiom A6 if we assume the instance of SI given by either of the following

  • y1=y3y_{1}=y_{3},

  • y2=y4y_{2}=y_{4},

  • y1=y3y_{1}=y_{3} and y2=y4y_{2}=y_{4}.

If d>2d>2, the volume ceases to be an appealing measure for EU. This is because quantifying the uncertainty associated with a credal set becomes challenging due to the dependency of the volume on the dimension. So far, we have written Vol in place of Vold1\text{Vol}_{d-1} to ease notation, but for d>2d>2 the dimension with respect to which the volume is taken becomes crucial. Let us give a simple example to illustrate this.

Example 1.

Let d=3d=3, so that the unit simplex is Δ31=Δ2\Delta^{3-1}=\Delta^{2}, the triangle whose extreme points are (1,0,0)(1,0,0), (0,1,0)(0,1,0), and (0,0,1)(0,0,1) in a 33-dimensional Cartesian plane (the purple triangle in Figure 2). Consider a sequence (𝒫n)(\mathcal{P}_{n}) of credal sets whose geometric representations are triangles, and suppose their height reduces to 0 as nn\rightarrow\infty, so that the (geometric representation of) 𝒫\mathcal{P}_{\infty} – the limit of (𝒫n)(\mathcal{P}_{n}) in the Hausdorff metric – is a segment. The limiting set 𝒫\mathcal{P}_{\infty}, then, is not of full dimensionality that is, its geometric representation is a proper subset of Δ1\Delta^{1}, while the geometric representation of 𝒫n\mathcal{P}_{n} is a proper subset of Δ2\Delta^{2}, for all nn. This implies that Vol2(𝒫)=0\text{Vol}_{2}(\mathcal{P}_{\infty})=0, but – unless 𝒫\mathcal{P}_{\infty} is a degenerate segment, i.e. a point – Vol1(𝒫)>0\text{Vol}_{1}(\mathcal{P}_{\infty})>0. As we can see, the EU has not resolved, yet 𝒫\mathcal{P}_{\infty} has a zero 22-dimensional volume; this is clearly undesirable. It is easy to see how this problem exacerbates in higher dimensions.

There are two possible ways one could try to circumvent the issue in Example 1; alas, both exhibit shortcomings, that is, at least one of the axioms A1–A3, A4’, A5–A7 in Section 3 is not satisfied. The first one is to consider the volume operator Vol(𝒫)\text{Vol}(\mathcal{P}) as the volume taken with respect to the space in which set 𝒫\mathcal{P} is of full dimensionality. In this case, we immediately see how A2 fails. Considering again the sequence in Example 1, we would have a sequence (𝒫n)(\mathcal{P}_{n}) whose volume Vol2(𝒫n)\text{Vol}_{2}(\mathcal{P}_{n}) is going to zero. However, in the limit, its volume Vol1(𝒫)\text{Vol}_{1}(\mathcal{P}_{\infty}) would be positive. Axiom A3 fails as well: consider a credal set 𝒫\mathcal{P} whose representation is a triangle having base bb and height hh and suppose h<2h<2. Consider then a credal set 𝒬𝒫\mathcal{Q}\subsetneq\mathcal{P} whose representation is a segment having length =b\ell=b. Then, Vol2(𝒫)=bh/2<b\text{Vol}_{2}(\mathcal{P})=b\cdot h/2<b, while Vol1(𝒬)==b\text{Vol}_{1}(\mathcal{Q})=\ell=b.

The second one is to consider lift probability sets; let us discuss this idea in depth. Let d,dd,d^{\prime}\in\mathbb{N}, and let d<dd^{\prime}<d. Call

O(d,d){Vd×d:VV=Id},O(d^{\prime},d)\coloneqq\{V\in\mathbb{R}^{d^{\prime}\times d}:VV^{\top}=I_{d^{\prime}}\},

where IdI_{d^{\prime}} is the dd^{\prime}-dimensional identity matrix. That is, O(d,d)O(d^{\prime},d) is the Stiefel manifold of d×dd^{\prime}\times d matrices with orthonormal rows [Cai and Lim, 2022]. Then, for any VO(d,d)V\in O(d^{\prime},d) and any bdb\in\mathbb{R}^{d^{\prime}}, define

φV,b:dd,xφV,b(x)Vx+b.\varphi_{V,b}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d^{\prime}},\quad x\mapsto\varphi_{V,b}(x)\coloneqq Vx+b.

Suppose now that, for some nn, (the geometric representation of) 𝒫n\mathcal{P}_{n} is a proper subset of Δd1\Delta^{d-1}, while (the geometric representation of) 𝒫n+1\mathcal{P}_{n+1} is a proper subset of Δd1\Delta^{d^{\prime}-1}. Pick any VO(d,d)V\in O(d^{\prime},d) and any bdb\in\mathbb{R}^{d^{\prime}}; an embedding of 𝒫n+1\mathcal{P}_{n+1} in Δd1\Delta^{d-1} is a set KK such that for all xKx\in K, there exists a probability vector p𝒫n+1p\in\mathcal{P}_{n+1} such that φV,b(x)=p\varphi_{V,b}(x)=p. Call Φ+(𝒫n+1,d)\Phi^{+}(\mathcal{P}_{n+1},d) the set of embeddings of 𝒫n+1\mathcal{P}_{n+1} in Δd1\Delta^{d-1}, and assume that it is nonempty.

Then, define

𝒫˘n+1argminKΦ+(𝒫n+1,d)|Vold1(K)Vold1(𝒫n+1)|;\breve{\mathcal{P}}_{n+1}\coloneqq\operatorname*{arg\,min}_{K\in\Phi^{+}(\mathcal{P}_{n+1},d)}\left|\text{Vol}_{d-1}(K)-\text{Vol}_{d^{\prime}-1}(\mathcal{P}_{n+1})\right|;

we call it the lift probability set for the heuristic similarity with lift zonoids [Mosler, 2002]. We define it in this way because we want the dd-dimensional set whose (full dimensionality) volume is the closest possible to the (dd^{\prime}-dimensional) volume of 𝒫n+1\mathcal{P}_{n+1}. A simple example is the following. Suppose the geometric representation of 𝒫n\mathcal{P}_{n} is a proper subset of Δ2\Delta^{2}, and that the geometric representation of 𝒫n+1\mathcal{P}_{n+1} is a proper subset of Δ1\Delta^{1}. So the former is a subset of 2\mathbb{R}^{2}, and the latter is a segment in \mathbb{R}. Then, a possible 𝒫˘n+1\breve{\mathcal{P}}_{n+1} is any triangle in Δ2\Delta^{2} whose height hh is 22 and whose base length bb is equal to the length \ell of the segment representing 𝒫n+1\mathcal{P}_{n+1}. This because the area of such 𝒫˘n+1\breve{\mathcal{P}}_{n+1} is bh/2b\cdot h/2; if h=2h=2 and b=b=\ell, then Vol2(𝒫˘n+1)=Vol1(𝒫n+1)\text{Vol}_{2}(\breve{\mathcal{P}}_{n+1})=\text{Vol}_{1}(\mathcal{P}_{n+1}), which is what we wanted. A visual representation is given in Figure 2.

Refer to caption
Figure 2: A visual representation of a lift probability set.

Notice that 𝒫˘n+1\breve{\mathcal{P}}_{n+1} is well defined because Φ+(𝒫n+1,d)2Δd1\Phi^{+}(\mathcal{P}_{n+1},d)\subset 2^{\Delta^{d-1}}, and Δd1\Delta^{d-1} is compact.999If the argmin\operatorname*{arg\,min} is not a singleton, pick any of its elements. We can then compare Vold1(𝒫n)\text{Vol}_{d-1}(\mathcal{P}_{n}) and of Vold1(𝒫˘n+1)\text{Vol}_{d-1}(\breve{\mathcal{P}}_{n+1}), and also compute the relative quantity

|Vold1(𝒫n)Vold1(𝒫˘n+1)|Vold1(𝒫n)\frac{\left|\text{Vol}_{d-1}(\mathcal{P}_{n})-\text{Vol}_{d-1}(\breve{\mathcal{P}}_{n+1})\right|}{\text{Vol}_{d-1}(\mathcal{P}_{n})}

that captures the variation in volume between 𝒫n\mathcal{P}_{n} and 𝒫˘n+1\breve{\mathcal{P}}_{n+1}. Alas, in this case, too, it is easy to see how A2 fails. Consider the same sequence as in Example 1. We would have that Vol2(𝒫n)\text{Vol}_{2}(\mathcal{P}_{n}) goes to zero as nn\rightarrow\infty, but Vol2(𝒫˘)>0\text{Vol}_{2}(\breve{\mathcal{P}}_{\infty})>0. Axiom A3 may fail as well since we could find credal sets 𝒫Δd1\mathcal{P}\subset\Delta^{d-1} and 𝒬Δd1\mathcal{Q}\subset\Delta^{d^{\prime}-1} such that 𝒬𝒫\mathcal{Q}\subsetneq\mathcal{P}, but 𝒬˘𝒫\breve{\mathcal{Q}}\not\subset\mathcal{P}.

4.2 Lack of robustness in higher dimensions

In this section, we show how, if we measure the EU associated with a credal set on the label space using the volume, as the number of labels grows, “small” changes of the uncertainty representation may lead to catastrophic consequences in downstream tasks.

For a generic compact set KdK\in\mathbb{R}^{d} and a positive real rr, the rr-packing of KK, denoted by Packr(K)\text{Pack}_{r}(K), is the collection of sets KK^{\prime} that satisfy the following properties

  • (i)

    KKK^{\prime}\subset K,

  • (ii)

    xKBrd(x)K\cup_{x\in K^{\prime}}B_{r}^{d}(x)\subset K, where Brd(x)B_{r}^{d}(x) denotes the ball of radius rr in space d\mathbb{R}^{d} centered at xx,

  • (iii)

    the elements of {Brd(x)}xK\{B_{r}^{d}(x)\}_{x\in K^{\prime}} are pairwise disjoint,

  • (iv)

    there does not exist xKx^{\prime}\in K such that (i)-(iii) are satisfied by K{x}K^{\prime}\cup\{x^{\prime}\}.

The packing number of KK, denoted by Nrpack(K)N^{\text{pack}}_{r}(K), is given by maxKPackr(K)|K|\max_{K^{\prime}\in{\text{Pack}_{r}(K)}}|K^{\prime}|. We also let KrargmaxKPackr(K)|K|K^{\star}_{r}\coloneqq\operatorname*{arg\,max}_{K^{\prime}\in{\text{Pack}_{r}(K)}}|K^{\prime}| and K~rxKrBrd(x)\tilde{K}_{r}\coloneqq\cup_{x\in K^{\star}_{r}}B^{d}_{r}(x). Notice that

Vol(K~r)=c(r,d,K)Vol(K),\text{Vol}(\tilde{K}_{r})=c(r,d,K)\text{Vol}(K), (3)

where

c(r,d,K)(0,1], for all r>0,andc(r,d,K)c(rϵ,d,Kˇ), for all ϵ>0,\begin{split}c(r,d,K)&\in(0,1]\text{, for all }r>0,\\ \text{and}\quad c(r,d,K)&\leq c(r-\epsilon,d,\check{K})\text{, for all }\epsilon>0,\end{split} (4)

where Kˇ\check{K} is any compact set in d\mathbb{R}^{d}, possibly different than KK. That is, we can always find a real number c(r,d,K)c(r,d,K) depending on the dimension dd of the Euclidean space, on the radius rr of the balls, and on the set KK of interest, that relates the volume of KK and that of K~r\tilde{K}_{r}. Being in (0,1](0,1], it takes into account the fact that since K~r\tilde{K}_{r} is a union of pairwise disjoint balls within KK, its volume cannot exceed that of KK. This is easy to see in Figure 3. The second condition in (4) states that irrespective of the compact set of interest, we retain more of the volume of the original set if we pack it using balls of a smaller radius.

To give a simple illustration, consider r1,r2>0r_{1},r_{2}>0 such that r1r2r_{1}\leq r_{2}. Then, by (3) and (4), we have that Vol(K)Vol(K~r2)=Vol(K)[1c(r2,d,K)]Vol(K)[1c(r1,d,K)]=Vol(K)Vol(K~r1)\text{Vol}({K})-\text{Vol}(\tilde{K}_{r_{2}})=\text{Vol}({K})[1-c(r_{2},d,K)]\geq\text{Vol}({K})[1-c(r_{1},d,K)]=\text{Vol}({K})-\text{Vol}(\tilde{K}_{r_{1}}). This means that the difference in volume between KK and K~r2\tilde{K}_{r_{2}} is larger than that between KK and K~r1\tilde{K}_{r_{1}}.

Let 𝒦(d)\mathcal{K}(\mathbb{R}^{d}) be the class of compact sets in d\mathbb{R}^{d}, and call c(r,d)maxK𝒦(d)c(r,d,K)c(r,d)\coloneqq\max_{K\in\mathcal{K}(\mathbb{R}^{d})}c(r,d,K). As rr goes to 0, c(r,d)c(r,d) increases to its optimal value that we denote as c(d)c^{\star}(d). The values of c(d)c^{\star}(d) have only been found for d{1,2,3,8,24}d\in\{1,2,3,8,24\} [Cohn et al., 2017, Viazovska, 2017]. The fact that c(r,d)c(r,d) increases as rr decreases to 0 captures the idea that using balls of smaller radius leads to a better approximation of the volume of the compact set KK in d\mathbb{R}^{d} that is being packed.

Refer to caption
Figure 3: A representation of K~r\tilde{K}_{r}, for some r>0r>0, where KK is a parallelepiped in 3\mathbb{R}^{3}. This figure replicates [Hifi and Yousef, 2019, Figure 4].

Suppose our credal set 𝒫\mathcal{P} is compact, so to be able to use the concepts of rr-packing and packing number. Consider then a set 𝒬(Ω,)\mathcal{Q}\subset\mathcal{M}(\Omega,\mathcal{F}) that satisfies the following three properties:

  1. (a)

    𝒬𝒫\mathcal{Q}\subsetneq\mathcal{P}, so that 𝒬𝒫𝒬\mathcal{Q}^{\prime}\coloneqq\mathcal{P}\setminus\mathcal{Q}\neq\emptyset,

  2. (b)

    dH(𝒫,𝒬)=ϵd_{H}(\mathcal{P},\mathcal{Q})=\epsilon, for some ϵ>0\epsilon>0,

  3. (c)

    ϵ\epsilon is such that we can find r>0r>0 for which Nrpack(𝒫)Nrϵpack(𝒬)N^{\text{pack}}_{r}(\mathcal{P})\geq N^{\text{pack}}_{r-\epsilon}(\mathcal{Q}^{\prime}).

Property (a) tells us that 𝒬\mathcal{Q} is a proper subset of 𝒫\mathcal{P}. Let d2d_{2} denote the metric induced by the Euclidean norm 2\|\cdot\|_{2}. Property (b) tells us that the Hausdorff distance

dH(𝒫,𝒬)=max{maxP𝒫d2(P,𝒬),maxQ𝒬d2(𝒫,Q)}d_{H}(\mathcal{P},\mathcal{Q})=\max\left\{{\max_{P\in\mathcal{P}}d_{2}(P,\mathcal{Q}),\max_{Q\in\mathcal{Q}}d_{2}(\mathcal{P},Q)}\right\} (5)

between 𝒫\mathcal{P} and 𝒬\mathcal{Q} is equal to some ϵ>0\epsilon>0. Property (c) ensures that ϵ\epsilon is “not too large”. To understand why, notice that if ϵ\epsilon is “large”, that is, if it is close to rr, then the packing number of 𝒬𝒫\mathcal{Q}^{\prime}\subsetneq\mathcal{P} using balls of radius rϵr-\epsilon can be larger than the packing number of 𝒫\mathcal{P} using balls of radius rr.101010Because 𝒬𝒫\mathcal{Q}\subsetneq\mathcal{P} and dH(𝒫,𝒬)=ϵd_{H}(\mathcal{P},\mathcal{Q})=\epsilon, packing using balls of radius rϵr-\epsilon is a sensible choice. Requiring (c) ensures us that this does not happen, and therefore that ϵ\epsilon is “small”. A representation of 𝒫\mathcal{P} and 𝒬\mathcal{Q} satisfying (a)–(c) is given in Figure 4. A (possibly very small) change in uncertainty representation is captured by a situation in which the agent specifies credal set 𝒬\mathcal{Q} in place of 𝒫\mathcal{P}. We are ready to state the main result of this section.

Refer to caption
Figure 4: A representation of 𝒫\mathcal{P} (the orange pentagon) and 𝒬\mathcal{Q} (the green pentagon) satisfying (a)-(c) when the dimension of state space Ω\Omega is d=3d=3. The unit simplex Δ2\Delta^{2} in 3\mathbb{R}^{3} is given by the purple triangle whose vertices are the elements of the basis of 3\mathbb{R}^{3}, i.e., e1=(1,0,0)e_{1}=(1,0,0), e2=(0,1,0)e_{2}=(0,1,0), and e2=(0,0,1)e_{2}=(0,0,1).
Theorem 1.

Let Ω\Omega be a finite Polish space so that |Ω|=d|\Omega|=d, and let =2Ω\mathcal{F}=2^{\Omega}. Pick any compact set 𝒫(Ω,)\mathcal{P}\subset\mathcal{M}(\Omega,\mathcal{F}), and any set 𝒬\mathcal{Q} that satisfies (a)-(c). The following holds

Vol(𝒫)Vol(𝒬)Vol(𝒫)1(1ϵr)d.\frac{\text{Vol}(\mathcal{P})-\text{Vol}(\mathcal{Q}^{\prime})}{\text{Vol}(\mathcal{P})}\geq 1-\left(1-\frac{\epsilon}{r}\right)^{d}. (6)

Notice that we implicitly assumed that at least a 𝒬\mathcal{Q} satisfying (a)-(c) exists. We have that [Vol(𝒫)Vol(𝒬)]/Vol(𝒫)[0,1][\text{Vol}(\mathcal{P})-\text{Vol}(\mathcal{Q}^{\prime})]/\text{Vol}(\mathcal{P})\in[0,1]; in light of this, since 1(1ϵ/r)d11-(1-\epsilon/r)^{d}\rightarrow 1 as dd\rightarrow\infty, Theorem 1 states that as dd grows, most of the volume of 𝒫\mathcal{P} concentrates near its boundary.

As a result, if we use the volume operator as a metric for the EU, this latter is very sensitive to perturbations of the boundary of the (geometric representation of the) credal set; this is problematic for credal sets in the context of ML. Suppose we are in a multi-classification setting such that the cardinality of 𝒴\mathcal{Y} is some large number dd. Suppose that two different procedures produce two different credal sets on 𝒴\mathcal{Y}; call one 𝒫\mathcal{P} and the other 𝒬\mathcal{Q}, and suppose 𝒬\mathcal{Q} satisfies (a)-(c). This means that the uncertainty representations associated with the two procedures differ only by a “small amount”. For instance, this could be the result of an agent specifying “slightly different” credal prior sets. This may well happen since defining the boundaries of credal sets is usually quite an arbitrary task to perform. Then, this would result in a (possibly massive) underestimation of the epistemic uncertainty in the results of the analysis, which would potentially translate in catastrophic consequence in downstream tasks. In Example 2, we describe a situation in which Theorem 1 is applied to credal prior sets.

Example 2.

Assume for simplicity that the parameter space Θ\Theta is finite and that its cardinality is 𝔠\mathfrak{c}. Suppose an agent faces complete ignorance regarding the probabilities to assign to the elements of 2Θ2^{\Theta}. Although tempting, there is a pitfall in choosing the whole simplex Δ𝔠1\Delta^{\mathfrak{c}-1} as the credal prior set. As shown by Walley [1991, Chapter 5], completely vacuous beliefs – captured by choice of Δ𝔠1\Delta^{\mathfrak{c}-1} as a credal prior set – cannot be Bayes-updated. This means that the posterior credal set will again be Δ𝔠1\Delta^{\mathfrak{c}-1}: no large amount of data is enough to swamp the prior. Instead, suppose that the agent considers a credal prior set Δϵ𝔠1\Delta^{\mathfrak{c}-1}_{\epsilon} that satisfies (a)–(c). If 𝔠\mathfrak{c} is large enough, this would mean that Vol(Δϵ𝔠1)\text{Vol}(\Delta^{\mathfrak{c}-1}_{\epsilon}) is much smaller than Vol(Δ𝔠1)\text{Vol}(\Delta^{\mathfrak{c}-1}).

Two remarks are in order. First, in the binary classification setting (that is, when d=2d=2), the lack of robustness of the volume highlighted by Theorem 1 is not an issue since 1(1ϵ/r)d1-(1-\epsilon/r)^{d} is approximately 11 only when the cardinality |𝒴|=d|\mathcal{Y}|=d is large. Second, Theorem 1 is intimately related to Carl-Pajor’s Theorem [Ball and Pajor, 1990, Theorem 1]; this implies that in the future, more techniques from high-dimensional geometry may become useful in the study of epistemic, and potentially also aleatoric, uncertainties.111111We state (a version of) Carl-Pajor’s Theorem in Appendix B.

5 Conclusion

Credal sets provide a flexible and powerful formalism for representing uncertainty in various scientific disciplines. In particular, uncertainty representation via credal sets can capture different degrees of uncertainty and allow for a more nuanced representation of epistemic and aleatoric uncertainty in machine learning systems. Moreover, the corresponding geometric representation of credal sets as dd-dimensional polytopes enables a thoroughly intuitive view of uncertainty representation and quantification.

In this paper, we showed that the volume of a credal set is a sensible measure of epistemic uncertainty in the context of binary classification, as it enjoys many desirable properties suggested in the existing literature. On the other side, the volume forfeits these properties in a multi-class classification setting, despite its intuitive meaningfulness.

In addition, this work stimulates a fundamental question as to what extent a geometric approach to uncertainty quantification (in ML) is sensible.

This is the first step toward studying the geometric properties of (epistemic) uncertainty in AI and ML. In the future, we plan to explore the geometry of aleatoric uncertainty and introduce techniques from high-dimensional geometry and high-dimensional probability to enhance and deepen the study of EU and AU in the contexts of AI and ML.

{contributions}

Yusuf Sale and Michele Caprio contributed equally to this paper.

Acknowledgements.
Michele Caprio would like to acknowledge partial funding by the Army Research Office (ARO MURI W911NF2010080). Yusuf Sale is supported by the DAAD programme Konrad Zuse Schools of Excellence in Artificial Intelligence, sponsored by the Federal Ministry of Education and Research.

References

  • Abellán and Moral [2000] Joaquín Abellán and Serafín Moral. A non-specificity measure for convex sets of probability distributions. International journal of uncertainty, fuzziness and knowledge-based systems, 8(03):357–367, 2000.
  • Abellán and Moral [2003] Joaquín Abellán and Serafín Moral. Building classification trees using the total uncertainty criterion. International Journal of Intelligent Systems, 18(12):1215–1225, 2003.
  • Abellan and Moral [2003] Joaquin Abellan and Serafin Moral. Maximum of entropy for credal sets. International journal of uncertainty, fuzziness and knowledge-based systems, 11(05):587–597, 2003.
  • Abellán and Klir [2005] Joaquín Abellán and George J. Klir. Additivity of uncertainty measures on credal sets. International Journal of General Systems, 34(6):691–713, 2005.
  • Anel [2021] Mathieu Anel. The Geometry of Ambiguity: An Introduction to the Ideas of Derived Geometry, volume 1, pages 505–553. Cambridge University Press, 2021.
  • Augustin et al. [2014] Thomas Augustin, Frank PA Coolen, Gert De Cooman, and Matthias CM Troffaes. Introduction to imprecise probabilities. John Wiley & Sons, 2014.
  • Ball and Pajor [1990] Keith Ball and Alain Pajor. Convex bodies with few faces. Proceedings of the American Mathematical Society, 110(1):225–231, 1990.
  • Bengs et al. [2022] Viktor Bengs, Eyke Hüllermeier, and Willem Waegeman. Pitfalls of epistemic uncertainty quantification through loss minimisation. In Advances in Neural Information Processing Systems, 2022.
  • Bloch [1996] Isabelle Bloch. Some aspects of Dempster-Shafer evidence theory for classification of multi-modality medical images taking partial volume effect into account. Pattern Recognition Letters, 17(8):905–919, 1996.
  • Bronevich and Klir [2008] Andrey Bronevich and George J Klir. Axioms for uncertainty measures on belief functions and credal sets. In NAFIPS 2008-2008 Annual Meeting of the North American Fuzzy Information Processing Society, pages 1–6. IEEE, 2008.
  • Bronevich and Klir [2010] Andrey Bronevich and George J Klir. Measures of uncertainty for imprecise probabilities: an axiomatic approach. International journal of approximate reasoning, 51(4):365–390, 2010.
  • Cai and Lim [2022] Yuhang Cai and Lek-Heng Lim. Distances between probability distributions of different dimensions. IEEE Transactions on Information Theory, 2022.
  • Caprio et al. [2023a] Michele Caprio, Souradeep Dutta, Radoslav Ivanov, Kuk Jang, Vivian Lin, Oleg Sokolsky, and Insup Lee. Imprecise Bayesian Neural Networks. arXiv preprint arXiv:2302.09656, 2023a.
  • Caprio et al. [2023b] Michele Caprio, Souradeep Dutta, Kaustubh Sridhar, Kuk Jang, Vivian Lin, Oleg Sokolsky, and Insup Lee. EpiC INN: Epistemic Curiosity Imprecise Neural Network. Technical report, University of Pennsylvania, Department of Computer and Information Science, 01 2023b.
  • Cohn et al. [2017] Henry Cohn, Abhinav Kumar, Stephen D. Miller, Danylo Radchenko, and Maryna S. Viazovska. The sphere packing problem in dimension 24. Annals of Mathematics, 185(3):1017–1033, 2017.
  • Corani and Zaffalon [2008] Giorgio Corani and Marco Zaffalon. Learning reliable classifiers from small or incomplete data sets: The naive credal classifier 2. Journal of Machine Learning Research, 9(4), 2008.
  • Corani et al. [2012] Giorgio Corani, Alessandro Antonucci, and Marco Zaffalon. Bayesian networks with imprecise probabilities: Theory and application to classification. In Data Mining: Foundations and Intelligent Paradigms, pages 49–93. Springer, 2012.
  • Couso et al. [1999] Inés Couso, Serafín Moral, and Peter Walley. Examples of independence for imprecise probabilities. In Proceedings of the First International Symposium on Imprecise Probabilities and Their Applications (ISIPTA 1999), pages 121–130, 1999.
  • Cuzzolin [2021] Fabio Cuzzolin. The Geometry of Uncertainty. Artificial Intelligence: Foundations, Theory, and Algorithms. Springer Nature Switzerland, 2021.
  • Depeweg et al. [2018] Stefan Depeweg, Jose-Miguel Hernandez-Lobato, Finale Doshi-Velez, and Steffen Udluft. Decomposition of uncertainty in bayesian deep learning for efficient and risk-sensitive learning. In International Conference on Machine Learning, pages 1184–1193. PMLR, 2018.
  • Hifi and Yousef [2019] Mhand Hifi and Labib Yousef. A local search-based method for sphere packing problems. European Journal of Operational Research, 247:482–500, 2019.
  • Hora [1996] Stephen C Hora. Aleatory and epistemic uncertainty in probability elicitation with an example from hazardous waste management. Reliability Engineering & System Safety, 54(2-3):217–223, 1996.
  • Hüllermeier and Waegeman [2021] Eyke Hüllermeier and Willem Waegeman. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine Learning, 110(3):457–506, 2021.
  • Hüllermeier et al. [2022] Eyke Hüllermeier, Sébastien Destercke, and Mohammad Hossein Shaker. Quantification of credal uncertainty in machine learning: A critical analysis and empirical comparison. In Uncertainty in Artificial Intelligence, pages 548–557. PMLR, 2022.
  • Hüllermeier [2022] Eyke Hüllermeier. Quantifying aleatoric and epistemic uncertainty in machine learning: Are conditional entropy and mutual information appropriate measures? Available at arxiv:2209.03302, 2022.
  • Jiroušek and Shenoy [2018] Radim Jiroušek and Prakash P. Shenoy. A new definition of entropy of belief functions in the Dempster–Shafer theory. International Journal of Approximate Reasoning, 92:49–65, 2018.
  • Kapoor et al. [2022] Sanyam Kapoor, Wesley J Maddox, Pavel Izmailov, and Andrew Gordon Wilson. On uncertainty, tempering, and data augmentation in bayesian classification. arXiv preprint arXiv:2203.16481, 2022.
  • Kendall and Gal [2017] Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems, 30, 2017.
  • Lambrou et al. [2010] Antonis Lambrou, Harris Papadopoulos, and Alex Gammerman. Reliable confidence measures for medical diagnosis with evolutionary algorithms. IEEE Transactions on Information Technology in Biomedicine, 15(1):93–99, 2010.
  • Levi [1980] Isaac Levi. The Enterprise of Knowledge. London : MIT Press, 1980.
  • Mosler [2002] Karl Mosler. Zonoids and lift zonoids. In Multivariate Dispersion, Central Regions, and Depth: The Lift Zonoid Approach, volume 165 of Lecture Notes in Statistics, pages 25–78. New York : Springer, 2002.
  • Pal et al. [1992] Nikhil R Pal, James C Bezdek, and Rohan Hemasinha. Uncertainty measures for evidential reasoning i: A review. International Journal of Approximate Reasoning, 7(3-4):165–183, 1992.
  • Pal et al. [1993] Nikhil R Pal, James C Bezdek, and Rohan Hemasinha. Uncertainty measures for evidential reasoning ii: A new measure of total uncertainty. International Journal of Approximate Reasoning, 8(1):1–16, 1993.
  • Seidenfeld et al. [2012] Teddy Seidenfeld, Mark J. Schervish, and Joseph B. Kadane. Forecasting with imprecise probabilities. International Journal of Approximate Reasoning, 53(8):1248–1261, 2012. Imprecise Probability: Theories and Applications (ISIPTA’11).
  • Senge et al. [2014] Robin Senge, Stefan Bösner, Krzysztof Dembczyński, Jörg Haasenritter, Oliver Hirsch, Norbert Donner-Banzhoff, and Eyke Hüllermeier. Reliable classification: Learning classifiers that distinguish aleatoric and epistemic uncertainty. Information Sciences, 255:16–29, 2014.
  • Settles [2009] Burr Settles. Active Learning Literature Survey. Technical report, University of Wisconsin-Madison, Department of Computer Sciences, 2009.
  • Shaker and Hüllermeier [2020] Mohammad Hossein Shaker and Eyke Hüllermeier. Aleatoric and epistemic uncertainty with random forests. In Advances in Intelligent Data Analysis XVIII: 18th International Symposium on Intelligent Data Analysis, IDA 2020, Konstanz, Germany, April 27–29, 2020, Proceedings 18, pages 444–456. Springer, 2020.
  • Shannon [1948] Claude E Shannon. A mathematical theory of communication. The Bell system technical journal, 27(3):379–423, 1948.
  • Smith and Gal [2018] Lewis Smith and Yarin Gal. Understanding measures of uncertainty for adversarial example detection. arXiv preprint arXiv:1803.08533, 2018.
  • Varshney [2016] Kush R Varshney. Engineering safety in machine learning. In 2016 Information Theory and Applications Workshop (ITA), pages 1–5. IEEE, 2016.
  • Varshney and Alemzadeh [2017] Kush R Varshney and Homa Alemzadeh. On the safety of machine learning: Cyber-physical systems, decision sciences, and data products. Big data, 5(3):246–255, 2017.
  • Vershynin [2018] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
  • Viazovska [2017] Maryna S. Viazovska. The sphere packing problem in dimension 8. Annals of Mathematics, 185(3):991–1015, 2017.
  • Walley [1991] Peter Walley. Statistical Reasoning with Imprecise Probabilities, volume 42 of Monographs on Statistics and Applied Probability. London : Chapman and Hall, 1991.
  • Walley [1996] Peter Walley. Inferences from multinomial data: learning about a bag of marbles. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):3–34, 1996.
  • Yang et al. [2009] Fan Yang, Hua-zhen Wang, Hong Mi, Cheng-de Lin, and Wei-wen Cai. Using random forest for reliable classification and cost-sensitive learning for medical diagnosis. BMC bioinformatics, 10(1):1–14, 2009.
  • Zaffalon [2002] Marco Zaffalon. The naive credal classifier. Journal of statistical planning and inference, 105(1):5–21, 2002.

Appendix A Proofs

Proof of Proposition 1.

Let 𝒫,𝒬Δ(𝒴,σ(𝒴))\mathcal{P},\mathcal{Q}\subset\Delta(\mathcal{Y},\sigma(\mathcal{Y})) be credal sets, and assume |𝒴|=2|\mathcal{Y}|=2. Then we have the following.

  • Vol(𝒫)0\text{Vol}(\mathcal{P})\geq 0 and Vol(𝒫)Vol(Δ21)=2\text{Vol}(\mathcal{P})\leq\text{Vol}(\Delta^{2-1})=\sqrt{2}. Hence Vol()\text{Vol}(\cdot) satisfies A1.

  • The volume being a continuous functional is a well-known fact that comes from the continuity of the Lebesgue measure, so Vol()\text{Vol}(\cdot) satisfies A2.

  • 𝒬𝒫Vol(𝒬)Vol(𝒫)\mathcal{Q}\subset\mathcal{P}\implies\text{Vol}(\mathcal{Q})\leq\text{Vol}(\mathcal{P}). This comes from the fundamental property of the Lebesgue measure, so Vol()\text{Vol}(\cdot) satisfies A3.

  • Consider a sequence (𝒫n)(\mathcal{P}_{n}) of credal sets on (𝒴,σ(𝒴))(\mathcal{Y},\sigma(\mathcal{Y})) such that limn[P¯n(A)P¯n(A)]=0\lim_{n\rightarrow\infty}[\overline{P}_{n}(A)-\underline{P}_{n}(A)]=0, for all Aσ(𝒴)A\in\sigma(\mathcal{Y}). Then, this means that there exists NN\in\mathbb{N} such that for all nNn\geq N, the geometric representation of 𝒫n\mathcal{P}_{n} is a subset of the geometric representation of 𝒫n+1\mathcal{P}_{n+1}. In addition, the limiting element of (𝒫n)(\mathcal{P}_{n}) is a (multi)set 𝒫\mathcal{P}^{\star} whose elements are all equal to PP^{\star}, so its geometric representation is a point and its volume is 0. Hence, probability consistency is implied by continuity A3, so Vol()\text{Vol}(\cdot) satisfies A4’.

  • The volume is invariant to rotation and translation. This is a well-known fact that comes from the fundamental property of the Lebesgue measure, so Vol()\text{Vol}(\cdot) satisfies A7.

Let us now show that the volume operator satisfies sub-additivity A5. Let 𝒴=𝒴1×𝒴2\mathcal{Y}=\mathcal{Y}_{1}\times\mathcal{Y}_{2}. In addition, suppose we are in the general case in which |𝒴|=|𝒴1|=|𝒴2|=2|\mathcal{Y}|=|\mathcal{Y}_{1}|=|\mathcal{Y}_{2}|=2. In particular, let 𝒴={(y1,y2),(y3,y4)}\mathcal{Y}=\{(y_{1},y_{2}),(y_{3},y_{4})\}, so that 𝒴1={y1,y3}\mathcal{Y}_{1}=\{y_{1},y_{3}\} and 𝒴2={y2,y4}\mathcal{Y}_{2}=\{y_{2},y_{4}\}. Suppose also y1y3y_{1}\neq y_{3} and y2y4y_{2}\neq y_{4}. Now, pick any probability measure PP on 𝒴\mathcal{Y}. In general, we would have that its marginal marg𝒴1(P)=P\text{marg}_{\mathcal{Y}_{1}}(P)=P^{\prime} on 𝒴1\mathcal{Y}_{1} is such that P(yi)=jP((yi,yj))P^{\prime}(y_{i})=\sum_{j}P((y_{i},y_{j})). Similarly for marginal marg𝒴2(P)=P′′\text{marg}_{\mathcal{Y}_{2}}(P)=P^{\prime\prime} on 𝒴2\mathcal{Y}_{2}. In our case, though, the computation is easier. To see this, fix y1y_{1}. Then, we should sum over jj the probability of (y1,yj)(y_{1},y_{j}), yj𝒴2y_{j}\in\mathcal{Y}_{2}. But the only pair (y1,yj)(y_{1},y_{j}) is (y1,y2)(y_{1},y_{2}). A similar argument holds if we fix y3y_{3}, or any of the elements of 𝒴2\mathcal{Y}_{2}. Hence, we have that

P(y1)=P((y1,y2))=P′′(y2)andP(y3)=P((y3,y4))=P′′(y4).P^{\prime}(y_{1})=P((y_{1},y_{2}))=P^{\prime\prime}(y_{2})\quad\text{and}\quad P^{\prime}(y_{3})=P((y_{3},y_{4}))=P^{\prime\prime}(y_{4}).

Let 𝒫\mathcal{P}^{\prime} and 𝒫′′\mathcal{P}^{\prime\prime} denote the marginal convex sets of probability distributions on 𝒴1\mathcal{Y}_{1} and 𝒴2\mathcal{Y}_{2}, respectively, and let 𝒫\mathcal{P} denote the convex set of joint probability distributions on 𝒴=𝒴1×𝒴2\mathcal{Y}=\mathcal{Y}_{1}\times\mathcal{Y}_{2} [Couso et al., 1999]. Then, given our argument above, we have that Vol(𝒫)<Vol(𝒫)+Vol(𝒫′′)=2Vol(𝒫)\text{Vol}(\mathcal{P})<\text{Vol}(\mathcal{P}^{\prime})+\text{Vol}(\mathcal{P}^{\prime\prime})=2\text{Vol}(\mathcal{P}). So in the general |𝒴|=|𝒴1|=|𝒴2|=2|\mathcal{Y}|=|\mathcal{Y}_{1}|=|\mathcal{Y}_{2}|=2 case where y1y3y_{1}\neq y_{3} and y2y4y_{2}\neq y_{4}, the volume is subadditive. ∎

Proof of Proposition 2.

Immediate from the assumption on the instance of SI. ∎

Proof of Theorem 1.

Pick any compact set 𝒫(Ω,)\mathcal{P}\subset\mathcal{M}(\Omega,\mathcal{F}) and any set 𝒬\mathcal{Q} satisfying (a)-(c). Let BrddB^{d}_{r}\subset\mathbb{R}^{d} denote a generic ball in d\mathbb{R}^{d} of radius r>0r>0. Notice that Nrϵpack(𝒬)=Nrϵpack(𝒫)Nrϵpack(𝒬)0N^{\text{pack}}_{r-\epsilon}(\mathcal{Q}^{\prime})=N^{\text{pack}}_{r-\epsilon}(\mathcal{P})-N^{\text{pack}}_{r-\epsilon}(\mathcal{Q})\geq 0 because 𝒫𝒬\mathcal{P}\supset\mathcal{Q}. Then, the proof goes as follows

Vol(𝒫)Vol(𝒬)Vol(𝒫)\displaystyle\frac{\text{Vol}(\mathcal{P})-\text{Vol}(\mathcal{Q}^{\prime})}{\text{Vol}(\mathcal{P})} =1c(r,d,𝒫)Vol(𝒫~r)1c(rϵ,d,𝒬)Vol(𝒬~rϵ)1c(r,d,𝒫)Vol(𝒫~r)\displaystyle=\frac{\frac{1}{c(r,d,\mathcal{P})}\text{Vol}(\tilde{\mathcal{P}}_{r})-\frac{1}{c(r-\epsilon,d,\mathcal{Q}^{\prime})}\text{Vol}(\tilde{\mathcal{Q}}^{\prime}_{r-\epsilon})}{\frac{1}{c(r,d,\mathcal{P})}\text{Vol}(\tilde{\mathcal{P}}_{r})} (7)
Vol(𝒫~r)Vol(𝒬~rϵ)Vol(𝒫~r)\displaystyle\geq\frac{\text{Vol}(\tilde{\mathcal{P}}_{r})-\text{Vol}(\tilde{\mathcal{Q}}^{\prime}_{r-\epsilon})}{\text{Vol}(\tilde{\mathcal{P}}_{r})} (8)
=Nrpack(𝒫)Vol(Brd)Nrϵpack(𝒬)Vol(Brϵd)Nrpack(𝒫)Vol(Brd)\displaystyle=\frac{N^{\text{pack}}_{r}(\mathcal{P})\text{Vol}(B^{d}_{r})-N^{\text{pack}}_{r-\epsilon}(\mathcal{Q}^{\prime})\text{Vol}(B^{d}_{r-\epsilon})}{N^{\text{pack}}_{r}(\mathcal{P})\text{Vol}(B^{d}_{r})} (9)
=Nrpack(𝒫)Vol(B1d)rdNrϵpack(𝒬)Vol(B1d)(rϵ)dNrpack(𝒫)Vol(B1d)rd\displaystyle=\frac{N^{\text{pack}}_{r}(\mathcal{P})\text{Vol}(B^{d}_{1})r^{d}-N^{\text{pack}}_{r-\epsilon}(\mathcal{Q}^{\prime})\text{Vol}(B^{d}_{1})(r-\epsilon)^{d}}{N^{\text{pack}}_{r}(\mathcal{P})\text{Vol}(B^{d}_{1})r^{d}} (10)
=Nrpack(𝒫)rdNrϵpack(𝒬)(rϵ)dNrpack(𝒫)rd\displaystyle=\frac{N^{\text{pack}}_{r}(\mathcal{P})r^{d}-N^{\text{pack}}_{r-\epsilon}(\mathcal{Q}^{\prime})(r-\epsilon)^{d}}{N^{\text{pack}}_{r}(\mathcal{P})r^{d}}
=1Nrϵpack(𝒬)Nrpack(𝒫)(1ϵr)d\displaystyle=1-\frac{N^{\text{pack}}_{r-\epsilon}(\mathcal{Q}^{\prime})}{N^{\text{pack}}_{r}(\mathcal{P})}\left(1-\frac{\epsilon}{r}\right)^{d}
1(1ϵr)d,\displaystyle\geq 1-\left(1-\frac{\epsilon}{r}\right)^{d}, (11)

where (7) comes from equation (3), (8) comes from the fact that rϵrc(rϵ,d,𝒬)c(r,d,𝒫)r-\epsilon\leq r\implies c(r-\epsilon,d,\mathcal{Q}^{\prime})\geq c(r,d,\mathcal{P}) by (4), (9) comes from 𝒫~r\tilde{\mathcal{P}}_{r} being the union of pairwise disjoint balls of radius rr, (10) comes from properties of the volume of a ball of radius rr in d\mathbb{R}^{d}, and (11) comes from property (c) of 𝒬\mathcal{Q}. ∎

Appendix B High-dimensional probability

Since Theorem 1 in Section 4.2 is intimately related with Carl-Pajor’s Theorem [Ball and Pajor, 1990], we state (a version) of the theorem here.

Theorem 2 (Carl-Pajor).

Let B1,dB_{1,d} denote the dd-dimensional unit euclidean ball, and let 𝒫B1,d\mathcal{P}\subset B_{1,d} be a polytope with mm\in\mathbb{N} vertices. Then, we have

Vol(𝒫)Vol(B1,d)(4logmd)d.\displaystyle\frac{\text{Vol}(\mathcal{P})}{\text{Vol}(B_{1,d})}\leq\left(4\sqrt{\frac{\log m}{d}}\right)^{d}. (12)

For further results connecting high-dimensional probability and data science, see Vershynin [2018].