Metric Entropy Duality and the Sample Complexity of Outcome Indistinguishability
Abstract
We give the first sample complexity characterizations for outcome indistinguishability, a theoretical framework of machine learning recently introduced by Dwork, Kim, Reingold, Rothblum, and Yona (STOC 2021). In outcome indistinguishability, the goal of the learner is to output a predictor that cannot be distinguished from the target predictor by a class of distinguishers examining the outcomes generated according to the predictors’ predictions. While outcome indistinguishability originated from the algorithmic fairness literature, it provides a flexible objective for machine learning even when fairness is not a consideration. In this work, we view outcome indistinguishability as a relaxation of PAC learning that allows us to achieve meaningful performance guarantees under data constraint.
In the distribution-specific and realizable setting where the learner is given the data distribution together with a predictor class containing the target predictor, we show that the sample complexity of outcome indistinguishability is characterized by the metric entropy of w.r.t. the dual Minkowski norm defined by , and equivalently by the metric entropy of w.r.t. the dual Minkowski norm defined by . This equivalence makes an intriguing connection to the long-standing metric entropy duality conjecture in convex geometry. Our sample complexity characterization implies a variant of metric entropy duality, which we show is nearly tight. In the distribution-free setting, we focus on the case considered by Dwork et al. where contains all possible predictors, hence the sample complexity only depends on . In this setting, we show that the sample complexity of outcome indistinguishability is characterized by the fat-shattering dimension of .
We also show a strong sample complexity separation between realizable and agnostic outcome indistinguishability in both the distribution-free and the distribution-specific settings. This is in contrast to distribution-free (resp. distribution-specific) PAC learning where the sample complexity in both the realizable and the agnostic settings can be characterized by the VC dimension (resp. metric entropy).
1 Introduction
Prediction algorithms based on machine learning are becoming increasingly influential on major decisions that affect individual’s lives in settings such as medical diagnoses, loan applications, and educational admissions processes. As a result, we must be careful that the predictions of these algorithms do not discriminate against any sub-communities within the input population. Unfortunately, standard measures of prediction accuracy don’t necessarily guarantee the absence of such discrimination. Consider a binary classifier that is used to predict an individual’s probability of repaying a loan. A natural way to measure the success of our classifier would be to use classification error–the fraction of instances from some representative test set that it classifies incorrectly. Ideally, we’d hope that a classifier with error incorrectly classifies any particular individual in our population only of the time. However, if some low-income minority group makes up 10% of the population and has a low probability (say 40%) of repaying a loan on average, a classifier that chooses to focus on getting accuracy on the majority group and applies a blanket policy of classifying every individual in the minority group as unlikely to pay back the loan can still receive this classification error guarantee despite incorrectly classifying minority individuals of the time. If factored into loan application decisions, such a classifier could deny loans to many worthy applicants from the minority group, further cementing and potentially exacerbating the financial disparities present in the population.
This potential for discrimination was the original inspiration behind Outcome indistinguishability (OI), a theoretical framework for machine learning recently proposed by Dwork, Kim, Reingold, Rothblum, and Yona [2021] that aims to address disparities in the treatment of different groups by replacing the accuracy objective with a more flexible objective that can give stronger guarantees for sub-communities in a population. OI is a framework used when learning predictors, rather than classifiers. In this setting, an analogous measure to the classification error of a learned predictor is the error considered by e.g. Bartlett et al. [1996]. Instead of using the error as a quality measure, the main idea of OI is to measure the quality of a learned predictor by how easy it is for a predetermined class of distinguishers to distinguish between the output of and the output of the target predictor.
More precisely, the goal of OI is to output a predictor assigning probabilities to individuals . Given a distribution over , every predictor defines a distribution for individual-outcome pairs where the individual is drawn from , and the outcome is drawn from the Bernoulli distribution with mean . The input to the OI learner consists of examples drawn i.i.d. from for an unknown target predictor . The outputted predictor is audited by a set of distinguishers . Here, a distinguisher takes an individual-outcome pair as input, and either accepts or rejects the input.111 Dwork et al. [2021] also considered more advanced distinguishers with access to additional information about the predictors such as the predictions themselves or the predictor code, but the simplest distinguishers we describe here can already express non-trivial objectives. Also, while we describe outcome indistinguishability in the binary outcome setting, it is possible to generalize the notion to an arbitrary number of possible outcomes by considering predictors that output a description of a more general distribution. The distinguishing advantage of a distinguisher on two predictors and is defined as the absolute difference between the acceptance probabilities of on drawn from and . The quality of the learned predictor is then measured by the maximum distinguishing advantage (MDA) on and over all distinguishers . In the loan repayment setting described above, adding a distinguisher that only compares the rate of positive classification on the minority group could identify and prevent this disparity for a small enough allowable MDA threshold.
It can be shown that the MDA never exceeds the error , and when contains all possible distinguishers, the MDA and the error are known to be equal (see Lemma 7). However, by leveraging our ability to select only the distinguishers we care about in OI, we can tune the MDA to be a more suitable quality measure compared to error even in settings outside of fairness. As an example, suppose we would like to learn a binary image classifier that can be used in self-driving cars to determine whether the road contains an obstruction. Ideally we would like to learn a model that gives classification error very close to zero because it means that we can expect the car to fail to detect an obstruction extremely rarely. However, what if we have insufficient data to learn a classifier with extremely low error, and must settle for larger error on the order of 1% or more? This is where we need to observe that not all errors are born equal. Failing to recognize leaves on the road is dramatically different from failing to identify a parking vehicle. Unfortunately, a 1% error may be completely concentrated on important obstructions (that may occur less than 1% of the time). An overall error rate that would guarantee minuscule errors in important cases may very well be impossible and concentrating on the errors we care about may be mandated.
This example demonstrates that classification error alone may not be enough to tell us whether this is a model we would trust on the roads. In particular, the ability to refine our measure of performance to focus on the particular types of mistakes that we care about most would give us a better understanding of performance and might potentially allow us to get good targeted accuracy guarantees even with insufficient data to achieve high accuracy against all types of errors. If we believe that distinguishing between instances where a large obstruction is or is not present requires a smaller number of circuit gates than distinguishing between instances that may contain more minor obstructions such as a small tree branch or plastic bag, choosing to contain all distinguishers with small circuit sizes would filter out serious errors such as missing a pedestrian crossing the street even when we cannot achieve any meaningful overall accuracy due to limited data. Moreover, if contains a distinguisher that accepts if and only if contains a large obstruction and equals , a predictor that significantly underestimates the true prediction when contains a serious obstruction would have a high MDA. In general, when the distinguisher class is properly chosen, a predictor with error and MDA both being about can be less preferable than a predictor with error being but MDA being only .
1.1 Sample Complexity Characterizations
Traditionally, outcome indistinguishability and related notions such as multicalibration [Hébert-Johnson, Kim, Reingold, and Rothblum, 2018] have been viewed as providing stronger guarantees than error alone by allowing a predictor’s performance to be fine-tuned at the group level rather than just looking at the entire population. However, our obstruction-identification example from the previous section demonstrates how OI can also be viewed as a useful relaxation of the standard performance benchmark. By focusing on the important errors specified by the distinguisher class, outcome indistinguishability may allow us to achieve good performance with relatively small sample size. It is natural to ask: how much improvement in the sample size do we get from OI? This is the main focus of this paper—understanding the sample complexity of outcome indistinguishability.
It is one of the major objectives of learning theory to understand the sample complexity of various learning tasks and many influential characterizations of sample complexity have been discovered. The most notable example is the VC dimension for PAC learning [Vapnik and Chervonenkis, 1971, Blumer, Ehrenfeucht, Haussler, and Warmuth, 1989, Linial, Mansour, and Rivest, 1991]. In PAC learning [Valiant, 1984b, a], the learner receives examples where are drawn i.i.d. from a distribution over , and aims to output a binary classifier (a.k.a. concept or hypothesis) that is “close” to the target classifier . Here, the performance of is measured by the classification error . In the realizable setting, the target classifier is assumed to be from a given class , whereas in the agnostic setting, there is no assumption on but the performance of is measured relative to the best classifier in the class . The main result of VC theory states that the sample complexity of PAC learning in both the realizable and the agnostic settings are characterized by the VC dimension [Vapnik and Chervonenkis, 1971] of the class , which has a simple combinatorial definition. The success of VC theory raises the possibility that the sample complexity of other learning tasks may also have natural characterizations. Below we discuss two such learning tasks that are most relevant to our work: predictor learning and distribution-specific learning.
The extension of PAC learning to predictor learning dates back to [Kearns and Schapire, 1994], in which predictors were termed probabilistic concepts. In predictor learning, binary classifiers are replaced by predictors whose predictions take values in . Kearns and Schapire [1994] introduced the notion of fat-shattering dimension (see Section 2.4 for a precise definition) as a generalization of VC dimension to predictor classes and showed a lower bound on the sample complexity of learning a predictor class by its fat-shattering dimension. Alon, Ben-David, Cesa-Bianchi, and Haussler [1997] and Bartlett, Long, and Williamson [1996] complemented the result with corresponding upper bounds and concluded that a predictor class is learnable from finitely many examples if and only if it has finite fat-shattering dimension. Quantitatively, these papers showed that the sample complexity of learning a predictor in a class within error is characterized by
assuming we want the success probability to be at least a constant, say . Moreover, Bartlett et al. [1996] extended the characterization to the agnostic setting, where the objective is the mean absolute error between the learned prediction and the actual outcome. Later, the sample complexity bounds and the related uniform convergence results from Alon et al. [1997] and Bartlett et al. [1996] were improved by Bartlett and Long [1995], Bartlett and Long [1998], and Li, Long, and Srinivasan [2001].
Another natural extension of PAC learning is distribution-specific learning. In both PAC learning and predictor learning discussed above, performance of the learner is evaluated based on the worst-case distribution . These settings are referred to as distribution-free due to their lack of dependence on a particular input distribution. Since the distribution is usually not the worst-case in practice, distribution-specific learning focuses on the performance of the learner on a given distribution . In this setting, the sample complexity of learning a binary classifier in class to achieve a classification error below is characterized by
(1) |
using the metric entropy, i.e., the logarithm of the covering number of w.r.t. the classification error (as a metric), which indeed depends on the specific distribution [Benedek and Itai, 1991].
To compare OI with these previous clasification-error/-error-based notions of learning in terms of sample complexity, we need a characterization of the sample complexity of OI using similar quantities such as the fat-shattering dimension or the metric entropy. While it is tempting to hope that we might directly apply such notions, OI introduces additional subtlety in that we must consider how the expressiveness of the predictor class and the class of distinguishers interact to fully understand the sample complexity requirements. This is in contrast to standard settings where characterizing the expressiveness of the concept class via VC dimension or related notions is sufficient.
We show a simple example where it is indeed important to consider the interplay between and rather than just considering their contributions independently: Partition the set of inputs into two equal-sized sets, and . We consider a class of predictors that are maximally expressive on and constant on : Let . Similarly, we can define a distinguisher class that are maximally inquisitive on one set and ignores the other set but depending on which set is ignored we will get dramatically different complexity: Define two potential distinguisher classes: , . and are symmetric and thus identical in any independent measure of complexity. Nevertheless, it is easy to see that while achieving -OI on with respect to is equivalent to learning a predictor from with error on the set , learning an -OI predictor from with respect to is trivial as is constant on all individuals that can distinguish between, and therefore any will satisfy perfect OI with respect to . This example demonstrates the need for a more subtle variation of existing complexity notions in order to tightly characterize the sample complexity of OI.
Prior to our work, Dwork et al. [2021] showed that examples are sufficient for achieving an advantage below over a distinguisher class with probability at least in the distribution-free setting. In this work, we refine the dependence on to its fat-shattering dimension with a matching lower bound (Section 5). In addition, we characterize the sample complexity of OI in the distribution-specific setting with greater generality for every given predictor class , placing OI in the same setting as the classification-error/-error-based notions of learning with improved sample complexity due to a well-tuned objective based on the distinguisher class . The interplay between the distinguisher class and the predictor class connects our sample complexity characterizations to the intriguing metric entropy duality conjecture in convex geometry, which we discuss more in Section 1.2 below.
1.2 Covering for Distinguishers and Metric Entropy Duality
Before we describe our sample complexity characterizations for OI, we would like to discuss the ideas behind the metric-entropy-based sample complexity characterization (1) for distribution-specific learning by Benedek and Itai [1991]. Our characterizations for distribution-specific OI are based on similar ideas, but in a more subtle and arguably surprising way, making an intriguing connection to the metric entropy duality conjecture, a long-standing conjecture in convex geometry.
The idea behind the sample complexity upper bound in (1) is to reduce the size of the classifier class by taking an -covering of it. Consider the space of all binary classifiers endowed with the “classification error metric” . If two classifiers and are close w.r.t. this metric, choosing and would result in roughly equal classification errors w.r.t. the target predictor. This gives a natural procedure for simplifying the classifier class and consequently controlling the sample complexity: we can replace by an -covering of it with only minor loss in its expressivity. Here, an -covering is a subset such that every can find a close companion such that . As the size of can be bounded by the covering number , we can use a relatively small amount of examples to accurately estimate the classification error of every classifier in . This naturally leads to the empirical risk minimization (ERM) algorithm used in [Benedek and Itai, 1991].
To extend the ERM algorithm to outcome indistinguishability, a natural idea is to define the metric between two predictors and to be the maximum distinguishing advantage for and w.r.t. the distinguisher class , and compute an -covering of the predictor class . However, this direct extension (Algorithm 1) of the ERM algorithm to OI does not give us the right sample complexity upper bound (we show this formally in Appendix A, especially in Lemma 33). This failure is partly because the acceptance probabilities of the distinguishers in are not all preserved on a small sample. Indeed, learning a predictor with MDA below w.r.t. to the target predictor necessarily requires estimating the acceptance probability of every distinguisher in on within error (the estimates are simply the acceptance probabilities on ).
To meet the need of estimating the acceptance probabilities of the distinguishers in , we use a new algorithm (Algorithm 2) where we compute a covering of w.r.t. a metric defined by the predictor class , i.e., we flip the roles of and in the covering. Using this new algorithm, we get a sample complexity upper bound as a function of the metric entropy of w.r.t. , but the sample complexity lower bound we get by extending the arguments from [Benedek and Itai, 1991] is still a function of the metric entropy of w.r.t. . How do we make the upper and lower bounds match?
Our idea is to first transform distinguishers into the same inner product space as the predictors, and then interpret and as two abstract vector sets and in the same inner product space that can exchange roles freely. Specifically, combining our upper and lower bounds, we know that the lower bound based on never exceeds the upper bound based on . If we set and , a more precise version of what we get is
(2) |
If and are two arbitrary abstract sets of vectors, we can inversely set and in the inequality above and get
This inequality is exactly what we need to flip back the roles of and and make our upper bound match our lower bound.
The key inequality (2) that helps match our upper and lower bounds comes from combining the bounds themselves. Moreover, this inequality makes an intriguing connection to the long-standing metric entropy duality conjecture in convex geometry, which conjectures that (2) holds without the factor, but with the additional assumption that and are convex and symmetric around the origin. Without this additional assumption, we show that the quadratic dependence on in (2) is nearly tight (Lemma 14).
The metric entropy duality conjecture (formally stated in Conjecture 12) was first proposed by Pietsch [1972]. While the conjecture remains open, many weaker versions of it have been proved [e.g. Bourgain, Pajor, Szarek, and Tomczak-Jaegermann, 1989]. Most notably, the conjecture was proved by Artstein, Milman, and Szarek [2004b] in the special case where one of the two convex sets is an ellipsoid. This result was further strengthened by Artstein, Milman, Szarek, and Tomczak-Jaegermann [2004a]. Milman [2007] proved a weaker form of the conjecture with the constant factors replaced by dimension dependent quantities.
1.3 Our Contributions
Outcome indistinguishability was originally proposed as a strong fairness and accuracy criterion, potentially requiring large sample sizes to achieve. In this work, we view OI differently as a meaningful notion when we have insufficient data for -error-based learning. For this reason, we focus on no-access OI, the simplest form of OI introduced by Dwork et al. [2021]. For no-access OI, we show that (randomized) distinguishers can be converted to vectors in the same inner product space as the predictors (Section 2.1.2), connecting the MDA objective to the multiaccuracy objective used by Kim et al. [2019]. This allows us to understand predictor classes and distinguisher classes using geometric notions such as the dual Minkowski norms and the covering numbers defined by these norms (see Section 2.2).
In the distribution-specific setting, we consider realizable OI where the target predictor lies in an arbitrary given predictor class . Setting the failure probability bound in Theorem 13 to be a constant, say , for every predictor class , every distinguisher class , every data distribution and every MDA bound , we characterize the sample complexity of distribution-specific realizable OI both as
(3) |
and as
(4) |
Our sample complexity characterizations (3) and (4) highlight an intriguing connection between learning theory and the metric entropy duality conjecture (Conjecture 12) first proposed by Pietsch [1972], which conjectures that
whenever two function classes and are convex and symmetric around the origin. Our sample complexity characterizations imply a variant version of metric entropy duality (Theorem 11) where and are not required to be convex and symmetric, which we show is nearly tight (Lemma 14).
In the distribution-free setting, we focus on the case where contains all possible predictors, which is the setting considered by Dwork et al. [2021]. Setting in Theorem 15, we show that the sample complexity of distribution-free OI in this setting is characterized by
This characterization extends to multicalibration with some modifications (see Remark 21 and Remark 24). Our result refines the in the sample complexity upper bounds by Dwork et al. [2021] (for OI) and by Hébert-Johnson et al. [2018] (for multicalibration) to the fat-shattering dimension of with a matching lower bound.
In addition, we show that the sample complexity of OI in the agnostic setting behaves very differently from the realizable setting. This is in contrast to many common learning settings where the sample complexities of realizable and agnostic learning usually behave similarly (a recent independent work by Hopkins et al. [2021] gives a unified explanation for this phenomenon). In both the distribution-free and the distribution-specific settings, we show that the sample complexity of agnostic OI can increase when we remove some distinguishers from , and it can become arbitrarily large even when the realizable sample complexity is bounded by a constant (Section 6.2). This also suggests that OI can have larger sample complexity compared to -error based learning in the agnostic setting. This is because in the agnostic setting, the performance of the learned predictor is measured relative to the best predictor in the class , which can have a much better performance when measured using the selected objective of OI than using the error. On the other hand, when the target predictor belongs to the symmetric convex hull of the predictor class , we show that the sample complexity in the distribution-specific agnostic setting has the same characterizations as in the realizable setting (Lemma 25).
1.4 Related Work
The notion of outcome indistinguishability originated from the growing research of algorithmic fairness. Specifically, outcome indistinguishability can be treated as a generalization of multiaccuracy and multicalibration introduced by Hébert-Johnson, Kim, Reingold, and Rothblum [2018] and Kim, Ghorbani, and Zou [2019], in which the goal is to ensure that the learned predictor is accurate in expectation or calibrated conditioned on every subpopulation in a subpopulation class . Roughly speaking, the subpopulation class in multiaccuracy and multicalibration plays a similar role as the distinguisher class in outcome indistinguishability, and this connection has been discussed more extensively in [Dwork et al., 2021, Section 4]. Beyond fairness, multicalibration and OI also provide strong accuracy guarantees [see e.g. Rothblum and Yona, 2021, Blum and Lykouris, 2020, Zhao, Kim, Sahoo, Ma, and Ermon, 2021, Gopalan, Kalai, Reingold, Sharan, and Wieder, 2022, Kim, Kern, Goldwasser, Kreuter, and Reingold, 2022, Burhanpurkar, Deng, Dwork, and Zhang, 2021]. For a general predictor class and a subpopulation class , Shabat, Cohen, and Mansour [2020] showed sample complexity upper bounds of uniform convergence for multicalibration based on the maximum of suitable complexity measures of and . They complemented this result with a lower bound which does not grow with and . In comparison, we focus on the weaker no-access OI setting where the sample complexity can be much smaller, and we provide matching upper and lower bounds in terms of the dependence on and .
1.5 Paper Organization
The remainder of this paper is structured as follows. In Section 2, we formally define OI and related notions. We give lower and upper bounds for the sample complexity of distribution-specific realizable OI in Section 3, and turn these bounds into a characterization in Section 4 via metric entropy duality. We characterize the sample complexity of distribution-free OI in Section 5. Finally, in Section 6 we show a strong separation between the sample complexities of realizable and agnostic OI in both the distribution-free and distribution-specific settings.
2 Preliminaries
We use to denote an arbitrary non-empty set of individuals throughout the paper. For simplicity, whenever we say is a distribution over , we implicitly assume that every subset of is measurable w.r.t. (this holds e.g. when is a discrete distribution), although all the results in this paper naturally extend to more general distributions under appropriate measurability assumptions. We use to denote the set of all distributions over satisfying the implicit assumption. For two sets and , we use to denote the class of all functions . Given , we use to denote the Bernoulli distribution over with mean . We use to denote the base- logarithm.
2.1 Outcome Indistinguishability
Outcome indistinguishability is a theoretical framework introduced by Dwork et al. [2021] that aims to guarantee that the outcomes produced by some learned predictor are indistinguishable to a predetermined class of distinguishers from outcomes sampled from the true probabilities for each individual defined by .
The distinguishing task in outcome indistinguishability consists of drawing an individual according to some population distribution and then presenting the distinguisher with an outcome/individual pair where is either sampled according to the true predictor from the Bernoulli distribution , or sampled according to the learned predictor from . Taking a pair , a distinguisher outputs or . We allow distinguishers to be randomized.
Given a predictor and a distribution over , we define to be the distribution of pairs drawn using the process described above such that and . With this notation in hand, we can now provide a formal definition of outcome indistinguishability:
Definition 1 (No-Access Outcome Indistinguishability, Dwork et al. [2021]).
Let be a distribution over a set of individuals and be some target predictor for the set of individuals. For a class of distinguishers and , a predictor satisfies -outcome indistinguishability (OI) if for every ,
(5) |
We refer to the left-hand-side of (5) as the distinguishing advantage (or simply advantage) of the distinguisher , denoted . Given a class of distinguishers, we use to denote the supremum . According to Definition 1, a learned predictor satisfies -OI if and only if .
There are many different extensions of OI to settings in which the power of distinguishers is extended to allow access to additional information about the learned predictor or access to more than one individual/outcome pair. We will be focusing on this most basic case in which the distinguisher only has access to a single pair and has no additional access to the learned predictor (hence the name No-Access OI).
2.1.1 OI Algorithms
An OI algorithm (or learner) takes examples drawn i.i.d. from and outputs a predictor that satisfies -OI with probability at least :
Definition 2 (No-Access OI Algorithms).
Given a target predictor , a class of distinguishers , a distribution , an advantage bound , a failure probability bound , and a nonnegative integer , we use to denote the set of all (possibly randomized and inefficient) algorithms with the following properties:
-
1.
takes examples as input and outputs a predictor ;
-
2.
when the input examples are drawn i.i.d. from , with probability at least the output predictor satisfies .
When seeking to learn an outcome indistinguishable predictor, there are two different tasks that we can consider. On one hand, in what we call the realizable case, we assume that the target predictor lies in a known predictor class , and we seek to achieve low distinguishing advantages over all distinguishers in the class .222 Throughout the paper, we implicitly assume that all predictor classes and distinguisher classes are non-empty. Alternatively, in the agnostic case, we can imagine a situation in which nothing is known about the target predictor , but the performance of the learned predictor is measured relative to the best predictor in . In both the agnostic and realizable settings, we can also specify whether we measure performance on the worst-case distribution over individuals in , or on some particular distribution given to the learner. We call these the distribution-free and distribution-specific settings, respectively.
Definition 3 (Algorithms in Various Settings).
Given a predictor class , a class of distinguishers , a distribution , an advantage bound , a failure probability bound , and a nonnegative integer , we define the sets of algorithms that solve various OI tasks using examples as follows:
(distribution-specific realizable OI) | |||
(distribution-specific agnostic OI) | |||
(distribution-free realizable OI) | |||
(distribution-free agnostic OI) |
We note that while these learning goals are all defined with respect to some predictor class , this class simply constrains the possible values of the target predictor (or in the agnostic case, constrains the predictors used to measure the performance of the returned predictor). In particular we do not require any of the OI algorithms to be proper, i.e. always output some , despite the fact that some of the algorithms discussed in our proofs happen to satisfy this property.
Definition 4 (Sample complexity).
Given a predictor class , a class of distinguishers , a distribution , an advantage bound , a failure probability bound , we define the sample complexity of various OI tasks as follows:
It is clear from the definition that the following monotonicity properties hold: for ,
(6) |
Note that the sample complexities in the agnostic setting are not guaranteed to be monotone w.r.t. (see Section 6.2). It is also clear from definition that
(7) |
2.1.2 No-Access Distinguishers as Functions of Individuals
In the standard definition of OI, distinguishers are thought of as randomized algorithms that take as input an individual-outcome pair and output ACCEPT or REJECT. However, there is a natural way to transform every no-access distinguisher into a function that maps every individual to a label in such a way that the distinguishing advantage of can be recovered from .
In particular, given a randomized distinguisher , we define such that
Given the function , we show that we can always recover the distinguishing advantage of the original distinguisher :
Lemma 5.
For any two predictors ,
Proof.
For any and any two possible outcomes , it is easily verifiable that
where the probabilities are over the internal randomness of the distinguisher . The lemma is proved by the following chain of equations:
Note that the transformation from a distinguisher to the function is onto: given any function , we can construct a distinguisher such that in the following way: if , distinguisher accepts with probability , and accepts with probability ; if , distinguisher accepts with probability , and accepts with probability .
Lemma 5 shows that all distinguishers mapped to the same function have equal distinguishing advantages. It also shows that no-access OI has the same error objective as multiaccuracy considered by Kim et al. [2019]. From now on, when we refer to a distinguisher , it should be interpreted as the function . Similarly, we think of a distinguisher class as a non-empty subset of .
2.2 Inner Product, Norm, and Covering Number
The set of all real-valued functions on is naturally a linear space: for all , we define where for all , and for and , we define where for all .
For any function class and a real number , we use to denote and use to denote . For any two function classes , we define
For any non-empty function class , we say a function is in the convex hull of if there exist , and such that and . We use to denote the symmetric convex hull of consisting of all functions in the convex hull of . When is empty, we define to be the class consisting of only the all zeros function.
We say a function is bounded if there exists such that for all . We say a function class is bounded if there exists such that for all and .
For two bounded functions , we define their inner product w.r.t. distribution as
Although we call the above quantity an inner product for simplicity, it may not be positive definite ( need not imply for all ). For a non-empty bounded function class and a bounded function , we define the dual Minkowski norm of w.r.t. to be
If is empty, we define . The norm is technically only a semi-norm as it may not be positive definite, but whenever it is positive definite, it is the dual norm of the Minkowski norm induced by (the closure of) for finite [see e.g. Nikolov et al., 2013, Section 2.1].
For a non-empty bounded function class and , we say a subset is an -covering of w.r.t. the norm if for every , there exists such that . We define the covering number of w.r.t. the norm to be the minimum size of such an -covering of . We refer to the logarithm of the covering number, , as the metric entropy.
The following basic facts about the covering number are very useful:
Lemma 6.
We have the following facts:
-
1.
if , then ;
-
2.
;
-
3.
for every bounded function , ;
-
4.
for every , .
2.3 Distinguishing Advantages as Inner Products and Norms
The inner products and norms provide convenient ways for describing distinguishing advantages. Given a distinguisher and two predictors , Lemma 5 tells us that
Using these representations for advantages, we can prove the following lemma relating advantages to the error:
Lemma 7.
It holds that . Moreover, when , .
Proof.
To prove the first statement, we recall that
Because for all and , we are guaranteed that , which gives
as desired.
For the second statement, consider the distinguisher defined such that if and otherwise. For all , distinguisher satisfies
Therefore,
Since , this proves the second statement. ∎
2.4 Fat-Shattering Dimension
Given a function class and , we say are -fat shattered by w.r.t. if for every , there exists such that
We sometimes omit the mention of and say is -fat shattered by if such exist. The -fat-shattering dimension of introduced first by Kearns and Schapire [1994] is defined to be
3 Sample Complexity of Distribution-Specific Realizable OI
In this section, we give lower and upper bounds for the sample complexity of distribution-specific realizable OI for every given predictor class , distinguisher class , and distribution over individuals. Our lower bound is based on the metric entropy of w.r.t. the norm , whereas in our upper bound the roles of and are flipped. In the next section, we remove this role flip and give a complete characterization of the sample complexity using a version of metric entropy duality implied from combining our lower and upper bounds.
3.1 Lower Bound
We prove the following lemma showing that the sample complexity of distribution-specific realizable OI is lower bounded by the metric entropy of the predictor class w.r.t. the dual Minkowski norm defined by the distinguisher class . This lemma generalizes [Benedek and Itai, 1991, Lemma 4.8], which considered the special case where every predictor is a binary classifier, and the distinguisher class contains all possible distinguishers ().
Lemma 8.
For every predictor class , every distinguisher class , every distribution , every advantage bound , and every failure probability bound , the following sample complexity lower bound holds for distribution-specific realizable OI:
Proof.
Define . Let be the maximum-size subset of such that
(8) |
It is clear that because otherwise is not a -covering of and we can add one more predictor into without violating (8), a contradiction with the maximality of .
Let be a nonnegative integer such that . Our goal is to prove
(9) |
Let be an algorithm in . We draw a predictor uniformly at random from , and draw examples i.i.d. from . We say algorithm succeeds if it outputs such that . By assumption, when are given as input, algorithm succeeds with probability at least . Now instead of drawing i.i.d. from , we fix them so that the success probability is maximized. In other words, we can find fixed such that if we run algorithm on examples where and is chosen uniformly at random from , the algorithm succeeds with probability at least . Similarly, we can fix the internal randomness of algorithm and assume that is deterministic without decreasing its success probability on . Now algorithm has at most possible inputs, and thus has at most possible outputs. No output can be the success output for two different choices of from because of (8). Therefore, the success probability of algorithm is at most , and thus
This implies (9), as desired. ∎
3.2 Upper Bound
We give an algorithm for distribution-specific realizable OI to prove a sample complexity upper bound for it. Before we describe our algorithm, let us briefly discuss the empirical risk minimization algorithm (Algorithm 1). Benedek and Itai [1991, Proof of Lemma 4.6] showed that this natural algorithm works in the special case where 1) every predictor in is a binary classifier, and 2) the distinguisher class contains all possible distinguishers. When both 1) and 2) are satisfied, the algorithm gives a sample complexity upper bound of
(10) |
which would give a satisfactory sample complexity characterization when combined with our lower bound in Lemma 8. However, in Appendix A, we show that the algorithm fails when only one of the two conditions 1) and 2) (no matter which) is true.333 There is a variant of Algorithm 1 that minimizes over the entire predictor class instead of the covering . As discussed in [Benedek and Itai, 1991], this variant is not guaranteed to give a sample complexity upper bound close to (10) even under both conditions 1) and 2). Changing the definition of to (mimicking Algorithm 2) also makes Algorithm 1 fail under both conditions 1) and 2). To see this, suppose is the uniform distribution over a finite domain , where and for every , and . Assuming and , when the target predictor is , Algorithm 1 always outputs on the new (note that changing the values of on can significantly change , but it never changes by more than ). Since neither 1) nor 2) is guaranteed to hold in the distribution-specific realizable OI setting, we use a new algorithm (Algorithm 2) to prove our sample complexity upper bound (Lemma 10) where the roles of and flip compared to (10). In Section 4, we show how to flip them back to get a sample complexity characterization for distribution-specific realizable OI.
Our sample complexity upper bound is based on the following analysis of Algorithm 2:
Lemma 9.
For every predictor class , every distinguisher class , every distribution , every advantage bound , and every failure probability bound , there exists a positive integer
(11) | ||||
(12) |
such that Algorithm 2 belongs to
where .
Proof.
Define and for an arbitrary . It remains to prove that Algorithm 2 belongs to
for some determined below. Given examples , we define for every distinguisher , where is the minimum-size covering of computed in Algorithm 2. By definition, , so by the Chernoff bound and the union bound, for some , with probability at least ,
(13) |
Assuming that (13) is true, it suffices to prove that the output of Algorithm 2 satisfies . Let be a predictor that satisfies . We have
The output predictor of Algorithm 2 satisfies
Therefore,
Since and is an -covering w.r.t. ,
Finally,
as desired. ∎
We are now ready to state and prove our sample complexity upper bound for distribution-specific realizable OI.
Lemma 10.
For every predictor class , every distinguisher class , every distribution , every advantage bound , and every failure probability bound , the following sample complexity upper bound holds for distribution-specific realizable OI:
where .
4 Metric Entropy Duality
In the previous section, we proved lower and upper bounds for the sample complexity of distribution-specific realizable OI, but these bounds do not yet give a satisfactory sample complexity characterization because of the exchanging roles of the predictor class and the distinguisher class in the lower and upper bounds. In this section, we solve the issue by proving the following theorem:
Theorem 11.
There exists an absolute constant with the following property. For , let and be two non-empty bounded function classes. For any distribution and any , it holds that
Before we prove this theorem, we note that it has a similar statement to the long-standing metric entropy duality conjecture proposed first by Pietsch [1972]. The conjecture can be stated as follows using our notations [for other equivalent statements of the conjecture, see e.g. Artstein et al., 2004a]:
Conjecture 12 (Pietsch [1972]).
There exist absolute constants with the following property. For any two bounded function classes over a non-empty finite set , if and are convex and symmetric (i.e. ), then for any distribution and any ,
Compared to Conjecture 12, our Theorem 11 gives a variant of metric entropy duality which does not require and to be convex and symmetric, but has the constant in Conjecture 12 replaced by a quantity dependent on the granularity and the scale of the functions in and . Since the predictor class and the distinguisher class are not in general convex and symmetric, our Theorem 11 is more convenient for proving sample complexity characterizations for OI (see Theorem 13). In Lemma 14, we show that the quadratic dependence on in Theorem 11 is nearly tight.
Below we prove Theorem 11 by combining our lower and upper bounds in the previous section.
Proof.
We are now ready to state and prove our sample complexity characterizations for distribution-specific realizable OI.
Theorem 13.
For every predictor class , every distinguisher class , every distribution , every advantage bound , and every failure probability bound , the following sample complexity characterizations hold for distribution-specific realizable OI:
and
Since is monotone w.r.t. (Lemma 6 Item 1), compared to -error-based learning which corresponds to (see Lemma 7), Theorem 13 shows that a selected class of distinguishers helps reduce the sample complexity, or allows us to achieve smaller (and potentially better performance guarantees) with the same sample size. We prove Theorem 13 below.
Proof.
We start with the following chain of inequalities:
(by Theorem 11) | ||||
(by Lemma 8) | ||||
(15) |
where the last inequality is by Lemma 10. It remains to show
(16) |
If , it is clear that , so the inequality is trivial. If , we have the following inequality by Theorem 11:
Inequality (16) is proved by plugging the inequality above into (15). ∎
We complete the section by showing near-tightness of Theorem 11.
Lemma 14.
There exists a constant such that for all , there exist a ground set , a distribution over , and function classes such that
Proof.
By Lemma 6 Item 4, we can assume w.l.o.g. that . Now we have , and . Let be the largest integer power of such that . We choose to be , and choose to be the uniform distribution over . Let be the bijection from to such that for all . Define to be the set of vectors in consisting of the columns of the Hadamard matrix of size . We choose , and . The intuition behind the choice of is to keep small while making large, which by Lemma 6 Items 1 and 2 roughly corresponds to making the symmetric convex hull large. That is why we use the Hadamard matrix to ensure that the functions in achieve the maximum norm (with for every ) and are orthogonal to each other.
Define . By the properties of the Hadamard matrix, the functions in form an orthonormal basis of w.r.t. the inner product . This implies
(17) |
Let be an -covering of w.r.t. norm such that . By the definition of -covering,
For a function class , we define . Now we have , which implies that the volume of is at most times the volume of .
5 Sample Complexity of Distribution-Free OI
In this section, we consider the distribution-free setting where the OI learner has no knowledge of the distribution , and its performance is measured on the worst-case distribution . We focus on the case where and characterize the sample complexity of OI in this setting using the fat-shattering dimension of (Theorem 15). Without the assumption that , we give a sample complexity upper bound for realizable OI in Remark 16 and leave the intriguing question of sample complexity characterization to future work.
Theorem 15.
For every distinguisher class , every advantage bound , and every failure probability bound , the following sample complexity characterization holds for distribution-free OI:
(20) |
This theorem is a direct corollary of the sample complexity upper bound (Lemma 19) and lower bound (Lemma 23) we prove in the remaining of this section.444 If we assume and , it is rather straightforward to show a sample complexity lower bound of by a reduction from estimating the bias of a coin from independent tosses. We omit the details as our focus is on the dependence on , rather than on and . With some modifications, Theorem 15 also extends to multicalibration (see Remarks 21 and 24).
Remark 16.
If we drop the assumption that and assume instead that , by Lemma 7, our error objective becomes the error. By the ideas and results in [Bartlett et al., 1996, Bartlett and Long, 1995], it holds that
Combining this with (20) and using the monotonicity (6) of the sample complexity of realizable OI, without assuming or , we have
This, however, does not give a sample complexity characterization for distribution-free realizable OI because when and are both infinite, it is still possible to have being finite (see the example in Section 1.1).
5.1 Upper Bound
We prove our upper bound using the multiaccuracy boost algorithm (Algorithm 3). Kim et al. [2019] used the algorithm to show a sample complexity upper bound for multiaccuracy based on an abstract notion of dimension of the distinguisher class. We make their upper bound concrete using the fat-shattering dimension and match it with a lower bound in Section 5.2. The following standard uniform convergence result based on the fat-shattering dimension is crucial for our upper bound:
Lemma 17 (Uniform convergence from fat shattering [Bartlett and Long, 1995]).
Let be a function class. For every there exists such that
and for every probability distribution over ,
where are drawn i.i.d. from and for all .
Lemma 18.
Let be a distinguisher class. For every there exists such that
and for every distribution over and every predictor ,
where are drawn i.i.d. from .
Proof.
For every , define to be a function that maps to . Define .
We show that for all . Consider that are -fat shattered by . Since every in maps to for all , we know that . This then implies that are -fat shattered by . Conversely, if are -fat shattered by , it is clear that are -fat shattered by .
The lemma then follows directly from applying Lemma 17 with . ∎
Now we state our sample complexity upper bound for distribution-free OI (Lemma 19) and prove it using Algorithm 3 and a helper lemma (Lemma 20).
Lemma 19.
For every distinguisher class , every advantage bound , and every failure probability bound , the following sample complexity upper bound holds for distribution-free OI:
Lemma 20.
Proof.
Now we finish the proof of Lemma 19.
Proof.
We first consider the relatively simple case where . By the definition of the fat-shattering dimension, there exists such that
(21) |
Therefore, for every fixed distribution and target predictor , as long as the learned predictor satisfies , we get the desired error bound . Decomposing with and for every , we aim for the stronger goal that and .
Given input examples drawn i.i.d. from , we first compute and . It is clear that , so there exists such that . Similarly, we compute satisfying and define so that . We output with if and otherwise.
By the Chernoff bound and the union bound, we can choose such that with probability at least , both of the following inequalities hold:
(22) | |||
(23) |
When multiplied by , (23) implies . Combining this with (22), with probability at least , . Similarly, with probability at least , . By the union bound, with probability at least , and both hold as desired, so
We thus assume from now on. We use Algorithm 3 with for a positive integer chosen as follows. Given input examples drawn i.i.d. from for some and , by Lemma 18 and the union bound, we can choose
so that with probability at least , every time Line 3 is executed, we have
(24) | |||
(25) |
Assuming this is true, when the output predictor is returned at Line 3, we have
as desired. Moreover, (24) and (25) imply before every time Line 3 is executed, so by Lemma 20, when the output predictor is returned at Line 3,
as desired. Therefore,
(by ) |
∎
Remark 21.
Hébert-Johnson et al. [2018] used a modified version of Algorithm 3 for multicalibration, and indeed, this gives a sample complexity upper bound for multicalibration similar to Lemma 19. In multicalibration, the only difference is in the error objective: the interval is partitioned into subsets for some , and given a distribution and a distinguisher class , we measure the error of a learned predictor w.r.t. the target predictor by
(26) |
For , suppose our goal is to achieve with probability at least in the distribution-free setting. Changing Line 3 of Algorithm 3 to
and changing Line 3 to
we can prove a sample complexity upper bound of
This bound follows from combining the proof of Lemma 19 with the observation that for every predictor , the following distinguisher class
satisfies for every .
5.2 Lower Bound
We first prove the following lemma showing that the fat-shattering dimension gives a lower bound for the metric entropy on a particular distribution .
Lemma 22.
If are -fat shattered by , then , where is the uniform distribution over .
Proof.
We can assume without loss of generality that . The assumption that are -fat shattered by implies the existence of a function such that for every function , there exists a distinguisher satisfying
(27) |
We first show that is a subset of the symmetric convex hull . Assume for the sake of contradiction that some does not belong to . In particular, does not belong to the symmetric convex hull of . By the hyperplane separation theorem, there exists such that
(28) |
Consider the function such that if and only if . For with , we have by (27) and thus . Similarly, for with we have . In both cases, we have , a contradiction with (28).
Now we have proved that . By the symmetry of , . Then by the convexity of , .
Combining the lemma above with Lemma 8, we obtain the following sample complexity lower bound for distribution-free OI.
Lemma 23.
For every distinguisher class , every advantage bound , and every failure probability bound , the following sample complexity lower bound holds for distribution-free OI:
Proof.
Remark 24.
It is clear that the error objective for multicalibration defined in (26) satisfies
so Lemma 23 directly implies a sample complexity lower bound for multicalibration. Specifically, assuming the predictor class is , if we want to achieve with probability at least in the distribution-free setting, the sample complexity is at least
6 Separation between Agnostic and Realizable OI
The sample complexities of realizable and agnostic learning have the same characterization in many settings. They are both characterized by the VC dimension in distribution-free PAC learning [Vapnik and Chervonenkis, 1971], whereas in distribution-specific PAC learning they are both characterized by the metric entropy (this characterization was proved in the realizable setting by Benedek and Itai [1991], and it extends straightforwardly to the agnostic setting (see Lemma 26)). Recently, an independent work by Hopkins et al. [2021] shows that this relationship between realizable and agnostic learning holds very broadly for a unified reason.
In this section, we study this relationship between realizable and agnostic learning in outcome indistinguishability. In contrast to many common learning settings including PAC learning, we show a strong separation between the sample complexities of realizable OI and agnostic OI in both the distribution-free and the distribution-specific settings.
6.1 No separation with additional assumptions
Before we present the sample complexity separation between realizable and agnostic OI, we first discuss several assumptions that make the separation disappear in the following two lemmas.
In the first lemma (Lemma 25), we make the assumption that the target predictor belongs to the symmetric convex hull of .
Lemma 25 (Inside the symmetric convex hull).
For every predictor class , every distinguisher class , every distribution , every advantage bound , and every failure probability bound , there exist a nonnegative integer and an algorithm such that
In the second lemma (Lemma 26), we make the assumption that (i.e. we consider the error, see Lemma 7), and that either only contains binary classifiers or the target predictor is a binary classifier.
Lemma 26 (Binary classifiers with error).
Assume . For every predictor class , every distribution , every advantage bound and every failure probability bound , there exists a positive integer such that
and for every target predictor , Algorithm 1 belongs to
whenever or .
Proof.
We choose so that by the Chernoff bound and the union bound, with probability at least over drawn i.i.d. from , for all ,
(33) |
It remains to prove that whenever the input to Algorithm 1 satisfies the inequality above, the output satisfies assuming or .
6.2 Separation without additional assumptions
Without the additional assumptions in Lemma 25 and Lemma 26, we give examples where the sample complexity of agnostic OI can be arbitrarily larger than that of realizable OI in both the distribution-specific and the distribution-free settings. Given a positive integer , we choose , and choose the predictor class to consist of only two predictors and where and for all .
We first show that examples are sufficient for distribution-free agnostic OI as long as .
Lemma 27.
For all , there exists a positive integer such that for all and defined as above, there exists an algorithm .
Proof.
We choose algorithm to be the following simple algorithm: after seeing examples , it computes
If the denominator is zero, define . Algorithm outputs the predictor such that , and for all .
It remains to show that when the input examples are drawn i.i.d. from for some distribution and some , the output of algorithm satisfies the following with probability at least :
(37) |
Let denote the probability mass on in distribution . Since for all and , by Lemma 7,
(38) |
If , inequality (38) implies that (37) is always true. If , we choose so that the following two conditions hold:
-
1.
by the Chernoff bound, with probability at least , ;
-
2.
it holds that , where is an absolute constant determined later.
When combined, the two conditions guarantee that with probability at least , . Conditioned on this being true, by the Chernoff bound, choosing sufficiently large guarantees that with probability at least ,
Combining this with (38), we know (37) holds with probability at least , as desired. ∎
By the monotonicity properties of sample complexities (see (6) and (7)), the lemma above implies that for all and ,
(39) |
and
(40) |
Now we show that for a specific distribution and a particular distinguisher class , it requires at least examples for distribution-specific agnostic OI when and (Lemma 28). Sending to infinity and comparing with (40), this implies that the sample complexity of agnostic OI can be arbitrarily large even when the sample complexity of realizable OI is bounded by a constant in both the distribution-specific and the distribution-free settings. Comparing this with (39), we also know that the sample complexity of agnostic OI is not monotone w.r.t. the distinguisher class in both the distribution-specific and distribution-free settings.
Before we describe the and used in Lemma 28, we need the following definitions. Let denote the set of all functions . A function is a parity function if for a subset , it holds that for all . A function is an anti-parity function if for some parity function , it holds that for all . Let (resp. ) denote the set of parity functions (resp. anti-parity) functions.
We choose so that it puts probability mass on , and puts the remaining probability mass evenly on . We choose to contain hypotheses as follows: for every parity function , there is a distinguisher such that and for all .
Lemma 28.
For defined as above, .
To prove the lemma, we first consider the following task of list-learning parity and anti-parity functions, which we abbreviate as ListL. Given a positive integer list size , in the task of ListL, an algorithm takes as input examples where ’s are drawn independently and uniformly from , and for some unknown , and the algorithm outputs a subset (i.e. list) . The algorithm succeeds if and .
Lemma 29.
Assuming and , no algorithm has failure probability smaller than for all choices of in the task ListL.
Proof.
Assume that for some , there exists an algorithm that has failure probability at most for all . In particular, when is drawn uniformly at random from , the failure probability of is at most . For this fixed distribution of , we can assume that algorithm is deterministic without increasing its failure probability. By the law of total probability,
Here, are the input examples to algorithm where are drawn independently and uniformly from , and for every with drawn uniformly at random from .
With probability
are linearly independent in , in which case the conditional distribution of given is the uniform distribution over for some with , and thus
Therefore,
as desired. ∎
We are now ready to prove Lemma 28. Let be an algorithm in for a nonnegative integer and a positive real number . We use this algorithm to build an algorithm for ListL. Suppose are the input examples in ListL, where are drawn independently and uniformly from , and for some .
Before we construct the algorithm for ListL, we need the following definition. For every , we define such that , and for all .
Now we construct the algorithm for ListL using algorithm . We start by generating the input examples to algorithm . Independently for every , with probability we set , with probability we set and with the remaining probability we set . After running algorithm and obtaining the output , we return the list
(41) |
We prove the following two helper lemmas before we finish the proof of Lemma 28.
Lemma 30.
If , then . Similarly, if , then .
Proof.
We prove the first half of the lemma, and the second half follows from a similar argument. Pick any function , and pick such that and . Define to be the predictor that maps everything to . We have . Define to be the uniform distribution over . We have
Therefore,
This implies
(42) |
However, the predictors in , when restricted to the sub-domain , form an orthonormal basis for w.r.t. the inner product , so
Therefore, there can be at most different that satisfy (42). Since is defined differently for different , we get as desired. ∎
Lemma 31.
The event happens with probability at least .
Proof.
By the definition of in (41), the lemma is trivial if , so we assume .
Define to be the uniform distribution over . If , for the predictor that satisfy and for all , we have , so
For all other predictors , we have , so
Therefore, . Similarly, we can show that when , .
Since the input examples to algorithm are generated i.i.d. from , and since , with probability at least , algorithm outputs such that , in which case , as desired. ∎
Now we complete the proof of Lemma 28.
References
- Alon et al. [1997] Noga Alon, Shai Ben-David, Nicolò Cesa-Bianchi, and David Haussler. Scale-sensitive dimensions, uniform convergence, and learnability. J. ACM, 44(4):615–631, 1997. ISSN 0004-5411. doi: 10.1145/263867.263927. URL https://doi.org/10.1145/263867.263927.
- Artstein et al. [2004a] S. Artstein, V. Milman, S. Szarek, and N. Tomczak-Jaegermann. On convexified packing and entropy duality. Geom. Funct. Anal., 14(5):1134–1141, 2004a. ISSN 1016-443X. doi: 10.1007/s00039-004-0486-3. URL https://doi.org/10.1007/s00039-004-0486-3.
- Artstein et al. [2004b] S. Artstein, V. Milman, and S. J. Szarek. Duality of metric entropy. Ann. of Math. (2), 159(3):1313–1328, 2004b. ISSN 0003-486X. doi: 10.4007/annals.2004.159.1313. URL https://doi.org/10.4007/annals.2004.159.1313.
- Bartlett and Long [1995] Peter L Bartlett and Philip M Long. More theorems about scale-sensitive dimensions and learning. In Proceedings of the eighth annual conference on Computational learning theory, pages 392–401, 1995.
- Bartlett and Long [1998] Peter L. Bartlett and Philip M. Long. Prediction, learning, uniform convergence, and scale-sensitive dimensions. volume 56, pages 174–190. 1998. doi: 10.1006/jcss.1997.1557. URL https://doi.org/10.1006/jcss.1997.1557. Eighth Annual Workshop on Computational Learning Theory (COLT) (Santa Cruz, CA, 1995).
- Bartlett et al. [1996] Peter L. Bartlett, Philip M. Long, and Robert C. Williamson. Fat-shattering and the learnability of real-valued functions. volume 52, pages 434–452. 1996. doi: 10.1006/jcss.1996.0033. URL https://doi.org/10.1006/jcss.1996.0033. Seventh Annual Workshop on Computational Learning Theory (COLT) (New Brunswick, NJ, 1994).
- Benedek and Itai [1991] Gyora M. Benedek and Alon Itai. Learnability with respect to fixed distributions. Theoret. Comput. Sci., 86(2):377–389, 1991. ISSN 0304-3975. doi: 10.1016/0304-3975(91)90026-X. URL https://doi.org/10.1016/0304-3975(91)90026-X.
- Blum and Lykouris [2020] Avrim Blum and Thodoris Lykouris. Advancing subgroup fairness via sleeping experts. In 11th Innovations in Theoretical Computer Science Conference (ITCS 2020), volume 151, pages 55:1–55:24, 2020.
- Blumer et al. [1989] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. J. Assoc. Comput. Mach., 36(4):929–965, 1989. ISSN 0004-5411. doi: 10.1145/76359.76371. URL https://doi.org/10.1145/76359.76371.
- Bourgain et al. [1989] J. Bourgain, A. Pajor, S. J. Szarek, and N. Tomczak-Jaegermann. On the duality problem for entropy numbers of operators. In Geometric aspects of functional analysis (1987–88), volume 1376 of Lecture Notes in Math., pages 50–63. Springer, Berlin, 1989. doi: 10.1007/BFb0090048. URL https://doi.org/10.1007/BFb0090048.
- Burhanpurkar et al. [2021] Maya Burhanpurkar, Zhun Deng, Cynthia Dwork, and Linjun Zhang. Scaffolding sets. arXiv preprint arXiv:2111.03135, 2021.
- Dwork et al. [2021] Cynthia Dwork, Michael P Kim, Omer Reingold, Guy N Rothblum, and Gal Yona. Outcome indistinguishability. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 1095–1108, 2021.
- Gopalan et al. [2022] Parikshit Gopalan, Adam Tauman Kalai, Omer Reingold, Vatsal Sharan, and Udi Wieder. Omnipredictors. In Mark Braverman, editor, 13th Innovations in Theoretical Computer Science Conference (ITCS 2022), volume 215 of Leibniz International Proceedings in Informatics (LIPIcs), pages 79:1–79:21, Dagstuhl, Germany, 2022. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. ISBN 978-3-95977-217-4. doi: 10.4230/LIPIcs.ITCS.2022.79. URL https://drops.dagstuhl.de/opus/volltexte/2022/15675.
- Hébert-Johnson et al. [2018] Ursula Hébert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. Multicalibration: Calibration for the (computationally-identifiable) masses. In International Conference on Machine Learning, pages 1939–1948. PMLR, 2018.
- Hopkins et al. [2021] Max Hopkins, Daniel Kane, Shachar Lovett, and Gaurav Mahajan. Realizable learning is all you need. arXiv preprint arXiv:2111.04746, 2021.
- Kearns and Schapire [1994] Michael J. Kearns and Robert E. Schapire. Efficient distribution-free learning of probabilistic concepts. In Computational learning theory and natural learning systems, Vol. I, Bradford Book, pages 289–329. MIT Press, Cambridge, MA, 1994.
- Kim et al. [2019] Michael P Kim, Amirata Ghorbani, and James Zou. Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 247–254, 2019.
- Kim et al. [2022] Michael P Kim, Christoph Kern, Shafi Goldwasser, Frauke Kreuter, and Omer Reingold. Universal adaptability: Target-independent inference that competes with propensity scoring. Proceedings of the National Academy of Sciences, 119(4), 2022.
- Li et al. [2001] Yi Li, Philip M. Long, and Aravind Srinivasan. Improved bounds on the sample complexity of learning. J. Comput. System Sci., 62(3):516–527, 2001. ISSN 0022-0000. doi: 10.1006/jcss.2000.1741. URL https://doi.org/10.1006/jcss.2000.1741.
- Linial et al. [1991] Nathan Linial, Yishay Mansour, and Ronald L. Rivest. Results on learnability and the Vapnik-Chervonenkis dimension. Inform. and Comput., 90(1):33–49, 1991. ISSN 0890-5401. doi: 10.1016/0890-5401(91)90058-A. URL https://doi.org/10.1016/0890-5401(91)90058-A.
- Milman [2007] Emanuel Milman. A remark on two duality relations. Integral Equations Operator Theory, 57(2):217–228, 2007. ISSN 0378-620X. doi: 10.1007/s00020-006-1479-4. URL https://doi.org/10.1007/s00020-006-1479-4.
- Nikolov et al. [2013] Aleksandar Nikolov, Kunal Talwar, and Li Zhang. The geometry of differential privacy: the sparse and approximate cases. In STOC’13—Proceedings of the 2013 ACM Symposium on Theory of Computing, pages 351–360. ACM, New York, 2013. doi: 10.1145/2488608.2488652. URL https://doi.org/10.1145/2488608.2488652.
- Pietsch [1972] Albrecht Pietsch. Theorie der Operatorenideale (Zusammenfassung). Wissenschaftliche Beiträge der Friedrich-Schiller-Universität Jena. Friedrich-Schiller-Universität, Jena, 1972.
- Rothblum and Yona [2021] Guy N Rothblum and Gal Yona. Multi-group agnostic PAC learnability. arXiv preprint arXiv:2105.09989, 2021.
- Shabat et al. [2020] Eliran Shabat, Lee Cohen, and Yishay Mansour. Sample complexity of uniform convergence for multicalibration. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 13331–13340. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/9a96876e2f8f3dc4f3cf45f02c61c0c1-Paper.pdf.
- Valiant [1984a] L. G. Valiant. Deductive learning. volume 312, pages 441–446. 1984a. doi: 10.1098/rsta.1984.0069. URL https://doi.org/10.1098/rsta.1984.0069. With discussion, Mathematical logic and programming languages.
- Valiant [1984b] Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984b.
- Vapnik and Chervonenkis [1971] VN Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Measures of Complexity, 16(2):11, 1971.
- Zhao et al. [2021] Shengjia Zhao, Michael Kim, Roshni Sahoo, Tengyu Ma, and Stefano Ermon. Calibrating predictions to decisions: A novel approach to multi-class calibration. Advances in Neural Information Processing Systems, 34, 2021.
Appendix A Failure of Empirical Risk Minimization
The following two lemmas show that in certain cases the failure probability of Algorithm 1 approaches (instead of ) when the number of examples increases.
Lemma 32.
Assume is finite and is the uniform distribution over . Assume , and consists of the following three predictors: that maps every to for . For every positive integer , Algorithm 1 does not belong to
Proof.
Consider the input examples to Algorithm 1: drawn i.i.d. from . We first show that with probability above ,
This is trivially true when is odd. When is even, this is also true because
It remains to prove that whenever , the output of Algorithm 1 satisfies . By Lemma 7, , and . Therefore, the -covering in Algorithm 1 is equal to . The loss of predictor is
When , has the smallest loss, so Algorithm 1 returns , in which case we indeed have . Similarly, when , has the smallest loss, so and . ∎
In the lemma below, we give an example of the failure of Algorithm 1 in distribution-specific realizable OI when all the predictors in are binary classifiers. In this example, are parametrized by two positive integers and as follows. We choose the individual set to be , and choose the distinguisher class as follows. For every size- subset , define to be the set of all distinguishers satisfying for all . The distinguisher class is then defined as , where the union is over all size- subsets . The predictor class consists of predictors defined as follows:
The distribution spreads probability mass evenly on , and the spreads the remaining probability mass evenly on .
Since , Theorem 13 tells us that
The lemma below shows that this sample complexity upper bound cannot be achieved using Algorithm 1 when and is close to zero.
Lemma 33.
For every positive integer , there exists a positive integer such that when are defined as above, Algorithm 1 does not belong to
Proof.
By choosing sufficiently large, we get
(43) | ||||
(44) |
Suppose Algorithm 1 belongs to for some . Consider the case where the input to Algorithm 1 is examples drawn i.i.d. from where is drawn uniformly at random from . By assumption, the output of Algorithm 1 satisfies . Since Algorithm 1 only outputs , this means that where and . However, with probability , all the ’s belong to , in which case , giving no information about . Therefore, . This implies
(45) |
Inequalities (43) and (44) imply that the covering computed in Algorithm 1 is either or . Without loss of generality, we assume because the other case can be handled similarly. Now consider the case where the input examples are drawn i.i.d. from with .
By our construction of , the loss of every predictor in is
By the Chernoff bound and the union bound, with probability above , the absolute difference between and is below for all . Note that
Therefore, with probability above , Algorithm 1 returns , which does not satisfy . This implies
(46) |
Appendix B A Helper Lemma
Lemma 34.
Let be the uniform distribution over a set of individuals with size . Then for all ,
where is the base of the natural logarithm.
Proof.
Suppose . Let be the bijection from to such that for all . Let be an -covering of w.r.t. the norm . Defining
we have , which implies . Therefore, the volume of is at most times the volume of .
It is clear that has volume in . Moreover, by Lemma 7,
which has volume . Therefore,
and thus
as desired. We used the fact in the last inequality. ∎