This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Metric Entropy Duality and the Sample Complexity of Outcome Indistinguishability

Lunjia Hu Computer Science Department, Stanford University. Email: [email protected]. Supported by Moses Charikar’s Simons Investigator award and Omer Reingold’s NSF Award IIS-1908774.    Charlotte Peale Computer Science Department, Stanford University. Email: [email protected]. Supported by the Simons Foundation Collaboration on the Theory of Algorithmic Fairness.    Omer Reingold Computer Science Department, Stanford University. Email: [email protected]. Supported by the Simons Foundation Collaboration on the Theory of Algorithmic Fairness, the Sloan Foundation Grant 2020-13941 and the Simons Foundation investigators award 689988.
Abstract

We give the first sample complexity characterizations for outcome indistinguishability, a theoretical framework of machine learning recently introduced by Dwork, Kim, Reingold, Rothblum, and Yona (STOC 2021). In outcome indistinguishability, the goal of the learner is to output a predictor that cannot be distinguished from the target predictor by a class 𝒟{\mathcal{D}} of distinguishers examining the outcomes generated according to the predictors’ predictions. While outcome indistinguishability originated from the algorithmic fairness literature, it provides a flexible objective for machine learning even when fairness is not a consideration. In this work, we view outcome indistinguishability as a relaxation of PAC learning that allows us to achieve meaningful performance guarantees under data constraint.

In the distribution-specific and realizable setting where the learner is given the data distribution together with a predictor class 𝒫{\mathcal{P}} containing the target predictor, we show that the sample complexity of outcome indistinguishability is characterized by the metric entropy of 𝒫{\mathcal{P}} w.r.t. the dual Minkowski norm defined by 𝒟{\mathcal{D}}, and equivalently by the metric entropy of 𝒟{\mathcal{D}} w.r.t. the dual Minkowski norm defined by 𝒫{\mathcal{P}}. This equivalence makes an intriguing connection to the long-standing metric entropy duality conjecture in convex geometry. Our sample complexity characterization implies a variant of metric entropy duality, which we show is nearly tight. In the distribution-free setting, we focus on the case considered by Dwork et al. where 𝒫{\mathcal{P}} contains all possible predictors, hence the sample complexity only depends on 𝒟{\mathcal{D}}. In this setting, we show that the sample complexity of outcome indistinguishability is characterized by the fat-shattering dimension of 𝒟{\mathcal{D}}.

We also show a strong sample complexity separation between realizable and agnostic outcome indistinguishability in both the distribution-free and the distribution-specific settings. This is in contrast to distribution-free (resp. distribution-specific) PAC learning where the sample complexity in both the realizable and the agnostic settings can be characterized by the VC dimension (resp. metric entropy).

1 Introduction

Prediction algorithms based on machine learning are becoming increasingly influential on major decisions that affect individual’s lives in settings such as medical diagnoses, loan applications, and educational admissions processes. As a result, we must be careful that the predictions of these algorithms do not discriminate against any sub-communities within the input population. Unfortunately, standard measures of prediction accuracy don’t necessarily guarantee the absence of such discrimination. Consider a binary classifier that is used to predict an individual’s probability of repaying a loan. A natural way to measure the success of our classifier would be to use classification error–the fraction of instances from some representative test set that it classifies incorrectly. Ideally, we’d hope that a classifier with 5%5\% error incorrectly classifies any particular individual in our population only 5%5\% of the time. However, if some low-income minority group makes up 10% of the population and has a low probability (say 40%) of repaying a loan on average, a classifier that chooses to focus on getting 99%99\% accuracy on the majority group and applies a blanket policy of classifying every individual in the minority group as unlikely to pay back the loan can still receive this 5%5\% classification error guarantee despite incorrectly classifying minority individuals 40%40\% of the time. If factored into loan application decisions, such a classifier could deny loans to many worthy applicants from the minority group, further cementing and potentially exacerbating the financial disparities present in the population.

This potential for discrimination was the original inspiration behind Outcome indistinguishability (OI), a theoretical framework for machine learning recently proposed by Dwork, Kim, Reingold, Rothblum, and Yona [2021] that aims to address disparities in the treatment of different groups by replacing the accuracy objective with a more flexible objective that can give stronger guarantees for sub-communities in a population. OI is a framework used when learning predictors, rather than classifiers. In this setting, an analogous measure to the classification error of a learned predictor pp is the 1\ell_{1} error considered by e.g. Bartlett et al. [1996]. Instead of using the 1\ell_{1} error as a quality measure, the main idea of OI is to measure the quality of a learned predictor pp by how easy it is for a predetermined class 𝒟{\mathcal{D}} of distinguishers to distinguish between the output of pp and the output of the target predictor.

More precisely, the goal of OI is to output a predictor p:X[0,1]p:X\rightarrow[0,1] assigning probabilities p(x)[0,1]p(x)\in[0,1] to individuals xXx\in X. Given a distribution μ\mu over XX, every predictor pp defines a distribution μp\mu_{p} for individual-outcome pairs (x,o)X×{0,1}(x,o)\in X\times\{0,1\} where the individual xx is drawn from μ\mu, and the outcome oBer(p(x))o\sim\mathrm{Ber}(p(x)) is drawn from the Bernoulli distribution with mean p(x)p(x). The input to the OI learner consists of examples (x1,o1),,(xn,on)(x_{1},o_{1}),\ldots,(x_{n},o_{n}) drawn i.i.d. from μp\mu_{p^{*}} for an unknown target predictor pp^{*}. The outputted predictor is audited by a set of distinguishers 𝒟{\mathcal{D}}. Here, a distinguisher takes an individual-outcome pair (x,o)X×{0,1}(x,o)\in X\times\{0,1\} as input, and either accepts or rejects the input.111 Dwork et al. [2021] also considered more advanced distinguishers with access to additional information about the predictors such as the predictions themselves or the predictor code, but the simplest distinguishers we describe here can already express non-trivial objectives. Also, while we describe outcome indistinguishability in the binary outcome setting, it is possible to generalize the notion to an arbitrary number of possible outcomes by considering predictors that output a description of a more general distribution. The distinguishing advantage of a distinguisher dd on two predictors p1p_{1} and p2p_{2} is defined as the absolute difference between the acceptance probabilities of dd on (x,o)(x,o) drawn from μp1\mu_{p_{1}} and μp2\mu_{p_{2}}. The quality of the learned predictor pp is then measured by the maximum distinguishing advantage (MDA) on pp and pp^{*} over all distinguishers d𝒟d\in{\mathcal{D}}. In the loan repayment setting described above, adding a distinguisher that only compares the rate of positive classification on the minority group could identify and prevent this disparity for a small enough allowable MDA threshold.

It can be shown that the MDA never exceeds the 1\ell_{1} error 𝔼xμ|p(x)p(x)|\operatorname{\mathbb{E}}_{x\sim\mu}|p(x)-p^{*}(x)|, and when 𝒟{\mathcal{D}} contains all possible distinguishers, the MDA and the 1\ell_{1} error are known to be equal (see Lemma 7). However, by leveraging our ability to select only the distinguishers we care about in OI, we can tune the MDA to be a more suitable quality measure compared to 1\ell_{1} error even in settings outside of fairness. As an example, suppose we would like to learn a binary image classifier that can be used in self-driving cars to determine whether the road contains an obstruction. Ideally we would like to learn a model that gives classification error very close to zero because it means that we can expect the car to fail to detect an obstruction extremely rarely. However, what if we have insufficient data to learn a classifier with extremely low error, and must settle for larger error on the order of 1% or more? This is where we need to observe that not all errors are born equal. Failing to recognize leaves on the road is dramatically different from failing to identify a parking vehicle. Unfortunately, a 1% error may be completely concentrated on important obstructions (that may occur less than 1% of the time). An overall error rate that would guarantee minuscule errors in important cases may very well be impossible and concentrating on the errors we care about may be mandated.

This example demonstrates that classification error alone may not be enough to tell us whether this is a model we would trust on the roads. In particular, the ability to refine our measure of performance to focus on the particular types of mistakes that we care about most would give us a better understanding of performance and might potentially allow us to get good targeted accuracy guarantees even with insufficient data to achieve high accuracy against all types of errors. If we believe that distinguishing between instances where a large obstruction is or is not present requires a smaller number of circuit gates than distinguishing between instances that may contain more minor obstructions such as a small tree branch or plastic bag, choosing 𝒟{\mathcal{D}} to contain all distinguishers with small circuit sizes would filter out serious errors such as missing a pedestrian crossing the street even when we cannot achieve any meaningful overall accuracy due to limited data. Moreover, if 𝒟{\mathcal{D}} contains a distinguisher that accepts (x,o)(x,o) if and only if xx contains a large obstruction and oo equals 11, a predictor that significantly underestimates the true prediction p(x)p^{*}(x) when xx contains a serious obstruction would have a high MDA. In general, when the distinguisher class 𝒟{\mathcal{D}} is properly chosen, a predictor with 1\ell_{1} error and MDA both being about 0.10.1 can be less preferable than a predictor with 1\ell_{1} error being 0.20.2 but MDA being only 0.010.01.

1.1 Sample Complexity Characterizations

Traditionally, outcome indistinguishability and related notions such as multicalibration [Hébert-Johnson, Kim, Reingold, and Rothblum, 2018] have been viewed as providing stronger guarantees than 1\ell_{1} error alone by allowing a predictor’s performance to be fine-tuned at the group level rather than just looking at the entire population. However, our obstruction-identification example from the previous section demonstrates how OI can also be viewed as a useful relaxation of the standard 1\ell_{1} performance benchmark. By focusing on the important errors specified by the distinguisher class, outcome indistinguishability may allow us to achieve good performance with relatively small sample size. It is natural to ask: how much improvement in the sample size do we get from OI? This is the main focus of this paper—understanding the sample complexity of outcome indistinguishability.

It is one of the major objectives of learning theory to understand the sample complexity of various learning tasks and many influential characterizations of sample complexity have been discovered. The most notable example is the VC dimension for PAC learning [Vapnik and Chervonenkis, 1971, Blumer, Ehrenfeucht, Haussler, and Warmuth, 1989, Linial, Mansour, and Rivest, 1991]. In PAC learning [Valiant, 1984b, a], the learner receives examples (x1,h(x1)),,(xn,h(xn))(x_{1},h^{*}(x_{1})),\ldots,(x_{n},h^{*}(x_{n})) where x1,,xnx_{1},\ldots,x_{n} are drawn i.i.d. from a distribution μ\mu over XX, and aims to output a binary classifier (a.k.a. concept or hypothesis) h:X{0,1}h:X\rightarrow\{0,1\} that is “close” to the target classifier h:X{0,1}h^{*}:X\rightarrow\{0,1\}. Here, the performance of hh is measured by the classification error Prxμ[h(x)h(x)]\Pr_{x\sim\mu}[h(x)\neq h^{*}(x)]. In the realizable setting, the target classifier hh^{*} is assumed to be from a given class \mathcal{H}, whereas in the agnostic setting, there is no assumption on hh^{*} but the performance of hh is measured relative to the best classifier in the class \mathcal{H}. The main result of VC theory states that the sample complexity of PAC learning in both the realizable and the agnostic settings are characterized by the VC dimension [Vapnik and Chervonenkis, 1971] of the class \mathcal{H}, which has a simple combinatorial definition. The success of VC theory raises the possibility that the sample complexity of other learning tasks may also have natural characterizations. Below we discuss two such learning tasks that are most relevant to our work: predictor learning and distribution-specific learning.

The extension of PAC learning to predictor learning dates back to [Kearns and Schapire, 1994], in which predictors were termed probabilistic concepts. In predictor learning, binary classifiers are replaced by predictors whose predictions take values in [0,1][0,1]. Kearns and Schapire [1994] introduced the notion of fat-shattering dimension (see Section 2.4 for a precise definition) as a generalization of VC dimension to predictor classes and showed a lower bound on the sample complexity of learning a predictor class by its fat-shattering dimension. Alon, Ben-David, Cesa-Bianchi, and Haussler [1997] and Bartlett, Long, and Williamson [1996] complemented the result with corresponding upper bounds and concluded that a predictor class is learnable from finitely many examples if and only if it has finite fat-shattering dimension. Quantitatively, these papers showed that the sample complexity of learning a predictor in a class 𝒫{\mathcal{P}} within 1\ell_{1} error ε\varepsilon is characterized by

(1/ε)O(1)𝖿𝖺𝗍𝒫(εΘ(1)),(1/\varepsilon)^{O(1)}\mathsf{fat}_{{\mathcal{P}}}\big{(}\varepsilon^{\Theta(1)}\big{)},

assuming we want the success probability to be at least a constant, say 9/109/10. Moreover, Bartlett et al. [1996] extended the characterization to the agnostic setting, where the objective is the mean absolute error between the learned prediction and the actual outcome. Later, the sample complexity bounds and the related uniform convergence results from Alon et al. [1997] and Bartlett et al. [1996] were improved by Bartlett and Long [1995], Bartlett and Long [1998], and Li, Long, and Srinivasan [2001].

Another natural extension of PAC learning is distribution-specific learning. In both PAC learning and predictor learning discussed above, performance of the learner is evaluated based on the worst-case distribution μ\mu. These settings are referred to as distribution-free due to their lack of dependence on a particular input distribution. Since the distribution is usually not the worst-case in practice, distribution-specific learning focuses on the performance of the learner on a given distribution μ\mu. In this setting, the sample complexity of learning a binary classifier in class \mathcal{H} to achieve a classification error below ε\varepsilon is characterized by

(1/ε)O(1)logNμ(,Θ(ε))(1/\varepsilon)^{O(1)}\log N_{\mu}(\mathcal{H},\Theta(\varepsilon)) (1)

using the metric entropy, i.e., the logarithm of the covering number NμN_{\mu} of \mathcal{H} w.r.t. the classification error (as a metric), which indeed depends on the specific distribution μ\mu [Benedek and Itai, 1991].

To compare OI with these previous clasification-error/1\ell_{1}-error-based notions of learning in terms of sample complexity, we need a characterization of the sample complexity of OI using similar quantities such as the fat-shattering dimension or the metric entropy. While it is tempting to hope that we might directly apply such notions, OI introduces additional subtlety in that we must consider how the expressiveness of the predictor class 𝒫{\mathcal{P}} and the class of distinguishers 𝒟{\mathcal{D}} interact to fully understand the sample complexity requirements. This is in contrast to standard settings where characterizing the expressiveness of the concept class via VC dimension or related notions is sufficient.

We show a simple example where it is indeed important to consider the interplay between 𝒫{\mathcal{P}} and 𝒟{\mathcal{D}} rather than just considering their contributions independently: Partition the set of inputs XX into two equal-sized sets, X1X_{1} and X2X_{2}. We consider a class of predictors that are maximally expressive on X2X_{2} and constant on X1X_{1}: Let 𝒫={p:X[0,1]|p(x)=0,xX1}{\mathcal{P}}=\{p:X\rightarrow[0,1]|p(x)=0,\forall x\in X_{1}\}. Similarly, we can define a distinguisher class that are maximally inquisitive on one set and ignores the other set but depending on which set is ignored we will get dramatically different complexity: Define two potential distinguisher classes: 𝒟1={d:X×{0,1}{ACCEPT,REJECT}|d(x,b)=REJECT,xX1,b{0,1}}{\mathcal{D}}_{1}=\{d:X\times\{0,1\}\rightarrow\{\textnormal{ACCEPT},\textnormal{REJECT}\}|d(x,b)=\textnormal{REJECT},\forall x\in X_{1},b\in\{0,1\}\}, 𝒟2={d:X×{0,1}{ACCEPT,REJECT}|d(x,b)=REJECT,xX2,b{0,1}}{\mathcal{D}}_{2}=\{d:X\times\{0,1\}\rightarrow\{\textnormal{ACCEPT},\textnormal{REJECT}\}|d(x,b)=\textnormal{REJECT},\forall x\in X_{2},b\in\{0,1\}\}. 𝒟1{\mathcal{D}}_{1} and 𝒟2{\mathcal{D}}_{2} are symmetric and thus identical in any independent measure of complexity. Nevertheless, it is easy to see that while achieving ε\varepsilon-OI on 𝒫{\mathcal{P}} with respect to 𝒟1{\mathcal{D}}_{1} is equivalent to learning a predictor from 𝒫{\mathcal{P}} with ε\varepsilon 1\ell_{1} error on the set X2X_{2}, learning an ε\varepsilon-OI predictor from 𝒫{\mathcal{P}} with respect to 𝒟2{\mathcal{D}}_{2} is trivial as 𝒫{\mathcal{P}} is constant on all individuals that 𝒟2{\mathcal{D}}_{2} can distinguish between, and therefore any p𝒫p\in{\mathcal{P}} will satisfy perfect OI with respect to 𝒟2{\mathcal{D}}_{2}. This example demonstrates the need for a more subtle variation of existing complexity notions in order to tightly characterize the sample complexity of OI.

Prior to our work, Dwork et al. [2021] showed that O(ε4log(|𝒟|/εδ))O(\varepsilon^{-4}\log(|{\mathcal{D}}|/\varepsilon\delta)) examples are sufficient for achieving an advantage below ε\varepsilon over a distinguisher class 𝒟{\mathcal{D}} with probability at least 1δ1-\delta in the distribution-free setting. In this work, we refine the log|𝒟|\log|{\mathcal{D}}| dependence on 𝒟{\mathcal{D}} to its fat-shattering dimension with a matching lower bound (Section 5). In addition, we characterize the sample complexity of OI in the distribution-specific setting with greater generality for every given predictor class 𝒫{\mathcal{P}}, placing OI in the same setting as the classification-error/1\ell_{1}-error-based notions of learning with improved sample complexity due to a well-tuned objective based on the distinguisher class 𝒟{\mathcal{D}}. The interplay between the distinguisher class 𝒟{\mathcal{D}} and the predictor class 𝒫{\mathcal{P}} connects our sample complexity characterizations to the intriguing metric entropy duality conjecture in convex geometry, which we discuss more in Section 1.2 below.

1.2 Covering for Distinguishers and Metric Entropy Duality

Before we describe our sample complexity characterizations for OI, we would like to discuss the ideas behind the metric-entropy-based sample complexity characterization (1) for distribution-specific learning by Benedek and Itai [1991]. Our characterizations for distribution-specific OI are based on similar ideas, but in a more subtle and arguably surprising way, making an intriguing connection to the metric entropy duality conjecture, a long-standing conjecture in convex geometry.

The idea behind the sample complexity upper bound in (1) is to reduce the size of the classifier class \mathcal{H} by taking an ε\varepsilon-covering of it. Consider the space of all binary classifiers endowed with the “classification error metric” η(h1,h2)=Prxμ[h1(x)h2(x)]\eta(h_{1},h_{2})=\Pr_{x\sim\mu}[h_{1}(x)\neq h_{2}(x)]. If two classifiers h1h_{1} and h2h_{2} are close w.r.t. this metric, choosing h1h_{1} and h2h_{2} would result in roughly equal classification errors w.r.t. the target predictor. This gives a natural procedure for simplifying the classifier class and consequently controlling the sample complexity: we can replace \mathcal{H} by an ε\varepsilon-covering \mathcal{H}^{\prime} of it with only minor loss in its expressivity. Here, an ε\varepsilon-covering is a subset \mathcal{H}^{\prime}\subseteq\mathcal{H} such that every hh\in\mathcal{H} can find a close companion hh^{\prime}\in\mathcal{H}^{\prime} such that η(h,h)ε\eta(h,h^{\prime})\leq\varepsilon. As the size of \mathcal{H}^{\prime} can be bounded by the covering number Nμ(,ε)N_{\mu}(\mathcal{H},\varepsilon), we can use a relatively small amount of examples to accurately estimate the classification error of every classifier in \mathcal{H}^{\prime}. This naturally leads to the empirical risk minimization (ERM) algorithm used in [Benedek and Itai, 1991].

To extend the ERM algorithm to outcome indistinguishability, a natural idea is to define the metric η(p1,p2)\eta(p_{1},p_{2}) between two predictors p1p_{1} and p2p_{2} to be the maximum distinguishing advantage for p1p_{1} and p2p_{2} w.r.t. the distinguisher class 𝒟{\mathcal{D}}, and compute an ε\varepsilon-covering 𝒫{\mathcal{P}}^{\prime} of the predictor class 𝒫{\mathcal{P}}. However, this direct extension (Algorithm 1) of the ERM algorithm to OI does not give us the right sample complexity upper bound (we show this formally in Appendix A, especially in Lemma 33). This failure is partly because the acceptance probabilities of the distinguishers in 𝒟{\mathcal{D}} are not all preserved on a small sample. Indeed, learning a predictor pp with MDA below ε\varepsilon w.r.t. to the target predictor pp^{*} necessarily requires estimating the acceptance probability of every distinguisher in 𝒟{\mathcal{D}} on pp^{*} within error ε\varepsilon (the estimates are simply the acceptance probabilities on pp).

To meet the need of estimating the acceptance probabilities of the distinguishers in 𝒟{\mathcal{D}}, we use a new algorithm (Algorithm 2) where we compute a covering of 𝒟{\mathcal{D}} w.r.t. a metric defined by the predictor class 𝒫{\mathcal{P}}, i.e., we flip the roles of 𝒫{\mathcal{P}} and 𝒟{\mathcal{D}} in the covering. Using this new algorithm, we get a sample complexity upper bound as a function of the metric entropy logNμ,𝒫(𝒟,ε)\log N_{\mu,{\mathcal{P}}}({\mathcal{D}},\varepsilon) of 𝒟{\mathcal{D}} w.r.t. 𝒫{\mathcal{P}}, but the sample complexity lower bound we get by extending the arguments from [Benedek and Itai, 1991] is still a function of the metric entropy logNμ,𝒟(𝒫,ε)\log N_{\mu,{\mathcal{D}}}({\mathcal{P}},\varepsilon) of 𝒫{\mathcal{P}} w.r.t. 𝒟{\mathcal{D}}. How do we make the upper and lower bounds match?

Our idea is to first transform distinguishers into the same inner product space as the predictors, and then interpret 𝒟{\mathcal{D}} and 𝒫{\mathcal{P}} as two abstract vector sets 1{\mathcal{F}}_{1} and 2{\mathcal{F}}_{2} in the same inner product space that can exchange roles freely. Specifically, combining our upper and lower bounds, we know that the lower bound based on logNμ,𝒟(𝒫,ε)\log N_{\mu,{\mathcal{D}}}({\mathcal{P}},\varepsilon) never exceeds the upper bound based on logNμ,𝒫(𝒟,ε)\log N_{\mu,{\mathcal{P}}}({\mathcal{D}},\varepsilon). If we set 1=𝒫{\mathcal{F}}_{1}={\mathcal{P}} and 2=𝒟{\mathcal{F}}_{2}={\mathcal{D}}, a more precise version of what we get is

logNμ,2(1,ε)O(ε2(1+logNμ,2(1,Ω(ε)))).\log N_{\mu,{\mathcal{F}}_{2}}({\mathcal{F}}_{1},\varepsilon)\leq O\Big{(}\varepsilon^{-2}\Big{(}1+\log N_{\mu,{\mathcal{F}}_{2}}({\mathcal{F}}_{1},\Omega(\varepsilon))\Big{)}\Big{)}. (2)

If 1{\mathcal{F}}_{1} and 2{\mathcal{F}}_{2} are two arbitrary abstract sets of vectors, we can inversely set 1=𝒟{\mathcal{F}}_{1}={\mathcal{D}} and 2=𝒫{\mathcal{F}}_{2}={\mathcal{P}} in the inequality above and get

logNμ,𝒫(𝒟,ε)O(ε2(1+logNμ,𝒟(𝒫,Ω(ε)))).\log N_{\mu,{\mathcal{P}}}({\mathcal{D}},\varepsilon)\leq O\Big{(}\varepsilon^{-2}\Big{(}1+\log N_{\mu,{\mathcal{D}}}({\mathcal{P}},\Omega(\varepsilon))\Big{)}\Big{)}.

This inequality is exactly what we need to flip back the roles of 𝒫{\mathcal{P}} and 𝒟{\mathcal{D}} and make our upper bound match our lower bound.

The key inequality (2) that helps match our upper and lower bounds comes from combining the bounds themselves. Moreover, this inequality makes an intriguing connection to the long-standing metric entropy duality conjecture in convex geometry, which conjectures that (2) holds without the ε2\varepsilon^{-2} factor, but with the additional assumption that 1{\mathcal{F}}_{1} and 2{\mathcal{F}}_{2} are convex and symmetric around the origin. Without this additional assumption, we show that the quadratic dependence on 1/ε1/\varepsilon in (2) is nearly tight (Lemma 14).

The metric entropy duality conjecture (formally stated in Conjecture 12) was first proposed by Pietsch [1972]. While the conjecture remains open, many weaker versions of it have been proved [e.g. Bourgain, Pajor, Szarek, and Tomczak-Jaegermann, 1989]. Most notably, the conjecture was proved by Artstein, Milman, and Szarek [2004b] in the special case where one of the two convex sets is an ellipsoid. This result was further strengthened by Artstein, Milman, Szarek, and Tomczak-Jaegermann [2004a]. Milman [2007] proved a weaker form of the conjecture with the constant factors replaced by dimension dependent quantities.

1.3 Our Contributions

Outcome indistinguishability was originally proposed as a strong fairness and accuracy criterion, potentially requiring large sample sizes to achieve. In this work, we view OI differently as a meaningful notion when we have insufficient data for 1\ell_{1}-error-based learning. For this reason, we focus on no-access OI, the simplest form of OI introduced by Dwork et al. [2021]. For no-access OI, we show that (randomized) distinguishers can be converted to vectors in the same inner product space as the predictors (Section 2.1.2), connecting the MDA objective to the multiaccuracy objective used by Kim et al. [2019]. This allows us to understand predictor classes 𝒫{\mathcal{P}} and distinguisher classes 𝒟{\mathcal{D}} using geometric notions such as the dual Minkowski norms μ,𝒫,μ,𝒟\|\cdot\|_{\mu,{\mathcal{P}}},\|\cdot\|_{\mu,{\mathcal{D}}} and the covering numbers Nμ,𝒫(,),Nμ,𝒟(,)N_{\mu,{\mathcal{P}}}(\cdot,\cdot),N_{\mu,{\mathcal{D}}}(\cdot,\cdot) defined by these norms (see Section 2.2).

In the distribution-specific setting, we consider realizable OI where the target predictor lies in an arbitrary given predictor class 𝒫{\mathcal{P}}. Setting the failure probability bound δ\delta in Theorem 13 to be a constant, say 1/101/10, for every predictor class 𝒫{\mathcal{P}}, every distinguisher class 𝒟{\mathcal{D}}, every data distribution μ\mu and every MDA bound ε\varepsilon, we characterize the sample complexity of distribution-specific realizable OI both as

(1/ε)O(1)logNμ,𝒟(𝒫,Θ(ε))(1/\varepsilon)^{O(1)}\log N_{\mu,{\mathcal{D}}}({\mathcal{P}},\Theta(\varepsilon)) (3)

and as

(1/ε)O(1)logNμ,𝒫(𝒟,Θ(ε)).(1/\varepsilon)^{O(1)}\log N_{\mu,{\mathcal{P}}}({\mathcal{D}},\Theta(\varepsilon)). (4)

Our sample complexity characterizations (3) and (4) highlight an intriguing connection between learning theory and the metric entropy duality conjecture (Conjecture 12) first proposed by Pietsch [1972], which conjectures that

logNμ,1(2,ε)O(logNμ,2(1,Ω(ε)))\log N_{\mu,{\mathcal{F}}_{1}}({\mathcal{F}}_{2},\varepsilon)\leq O(\log N_{\mu,{\mathcal{F}}_{2}}({\mathcal{F}}_{1},\Omega(\varepsilon)))

whenever two function classes 1{\mathcal{F}}_{1} and 2{\mathcal{F}}_{2} are convex and symmetric around the origin. Our sample complexity characterizations imply a variant version of metric entropy duality (Theorem 11) where 1{\mathcal{F}}_{1} and 2{\mathcal{F}}_{2} are not required to be convex and symmetric, which we show is nearly tight (Lemma 14).

In the distribution-free setting, we focus on the case where 𝒫{\mathcal{P}} contains all possible predictors, which is the setting considered by Dwork et al. [2021]. Setting δ=1/10\delta=1/10 in Theorem 15, we show that the sample complexity of distribution-free OI in this setting is characterized by

(1/ε)O(1)𝖿𝖺𝗍𝒟(Θ(ε)).(1/\varepsilon)^{O(1)}\mathsf{fat}_{{\mathcal{D}}}(\Theta(\varepsilon)).

This characterization extends to multicalibration with some modifications (see Remark 21 and Remark 24). Our result refines the log|𝒟|\log|{\mathcal{D}}| in the sample complexity upper bounds by Dwork et al. [2021] (for OI) and by Hébert-Johnson et al. [2018] (for multicalibration) to the fat-shattering dimension of 𝒟{\mathcal{D}} with a matching lower bound.

In addition, we show that the sample complexity of OI in the agnostic setting behaves very differently from the realizable setting. This is in contrast to many common learning settings where the sample complexities of realizable and agnostic learning usually behave similarly (a recent independent work by Hopkins et al. [2021] gives a unified explanation for this phenomenon). In both the distribution-free and the distribution-specific settings, we show that the sample complexity of agnostic OI can increase when we remove some distinguishers from 𝒟{\mathcal{D}}, and it can become arbitrarily large even when the realizable sample complexity is bounded by a constant (Section 6.2). This also suggests that OI can have larger sample complexity compared to 1\ell_{1}-error based learning in the agnostic setting. This is because in the agnostic setting, the performance of the learned predictor is measured relative to the best predictor in the class 𝒫{\mathcal{P}}, which can have a much better performance when measured using the selected objective of OI than using the 1\ell_{1} error. On the other hand, when the target predictor pp^{*} belongs to the symmetric convex hull of the predictor class 𝒫{\mathcal{P}}, we show that the sample complexity in the distribution-specific agnostic setting has the same characterizations as in the realizable setting (Lemma 25).

1.4 Related Work

The notion of outcome indistinguishability originated from the growing research of algorithmic fairness. Specifically, outcome indistinguishability can be treated as a generalization of multiaccuracy and multicalibration introduced by Hébert-Johnson, Kim, Reingold, and Rothblum [2018] and Kim, Ghorbani, and Zou [2019], in which the goal is to ensure that the learned predictor is accurate in expectation or calibrated conditioned on every subpopulation in a subpopulation class 𝒞\mathcal{C}. Roughly speaking, the subpopulation class 𝒞\mathcal{C} in multiaccuracy and multicalibration plays a similar role as the distinguisher class 𝒟{\mathcal{D}} in outcome indistinguishability, and this connection has been discussed more extensively in [Dwork et al., 2021, Section 4]. Beyond fairness, multicalibration and OI also provide strong accuracy guarantees [see e.g. Rothblum and Yona, 2021, Blum and Lykouris, 2020, Zhao, Kim, Sahoo, Ma, and Ermon, 2021, Gopalan, Kalai, Reingold, Sharan, and Wieder, 2022, Kim, Kern, Goldwasser, Kreuter, and Reingold, 2022, Burhanpurkar, Deng, Dwork, and Zhang, 2021]. For a general predictor class 𝒫{\mathcal{P}} and a subpopulation class 𝒞\mathcal{C}, Shabat, Cohen, and Mansour [2020] showed sample complexity upper bounds of uniform convergence for multicalibration based on the maximum of suitable complexity measures of 𝒞\mathcal{C} and 𝒫{\mathcal{P}}. They complemented this result with a lower bound which does not grow with 𝒞\mathcal{C} and 𝒫{\mathcal{P}}. In comparison, we focus on the weaker no-access OI setting where the sample complexity can be much smaller, and we provide matching upper and lower bounds in terms of the dependence on 𝒟{\mathcal{D}} and 𝒫{\mathcal{P}}.

1.5 Paper Organization

The remainder of this paper is structured as follows. In Section 2, we formally define OI and related notions. We give lower and upper bounds for the sample complexity of distribution-specific realizable OI in Section 3, and turn these bounds into a characterization in Section 4 via metric entropy duality. We characterize the sample complexity of distribution-free OI in Section 5. Finally, in Section 6 we show a strong separation between the sample complexities of realizable and agnostic OI in both the distribution-free and distribution-specific settings.

2 Preliminaries

We use XX to denote an arbitrary non-empty set of individuals throughout the paper. For simplicity, whenever we say μ\mu is a distribution over XX, we implicitly assume that every subset of XX is measurable w.r.t. μ\mu (this holds e.g. when μ\mu is a discrete distribution), although all the results in this paper naturally extend to more general distributions under appropriate measurability assumptions. We use ΔX\Delta_{X} to denote the set of all distributions over XX satisfying the implicit assumption. For two sets AA and BB, we use BAB^{A} to denote the class of all functions f:ABf:A\rightarrow B. Given r[0,1]r\in[0,1], we use Ber(r)\mathrm{Ber}(r) to denote the Bernoulli distribution over {0,1}\{0,1\} with mean rr. We use log\log to denote the base-22 logarithm.

2.1 Outcome Indistinguishability

Outcome indistinguishability is a theoretical framework introduced by Dwork et al. [2021] that aims to guarantee that the outcomes produced by some learned predictor p:X[0,1]p:X\rightarrow[0,1] are indistinguishable to a predetermined class of distinguishers 𝒟{\mathcal{D}} from outcomes sampled from the true probabilities for each individual defined by p:X[0,1]p^{*}:X\rightarrow[0,1].

The distinguishing task in outcome indistinguishability consists of drawing an individual xXx\in X according to some population distribution μΔX\mu\in\Delta_{X} and then presenting the distinguisher with an outcome/individual pair (x,o)(x,o) where oo is either sampled according to the true predictor pp^{*} from the Bernoulli distribution Ber(p(x))\mathrm{Ber}(p^{*}(x)), or sampled according to the learned predictor pp from Ber(p(x))\mathrm{Ber}(p(x)). Taking a pair (x,o)X×{0,1}(x,o)\in X\times\{0,1\}, a distinguisher dd outputs d(x,o)=ACCEPTd(x,o)=\textnormal{ACCEPT} or d(x,o)=REJECTd(x,o)=\textnormal{REJECT}. We allow distinguishers to be randomized.

Given a predictor pp and a distribution μ\mu over XX, we define μp\mu_{p} to be the distribution of pairs (x,o)X×{0,1}(x,o)\in X\times\{0,1\} drawn using the process described above such that xμx\sim\mu and oBer(p(x))o\sim\mathrm{Ber}(p(x)). With this notation in hand, we can now provide a formal definition of outcome indistinguishability:

Definition 1 (No-Access Outcome Indistinguishability, Dwork et al. [2021]).

Let μΔX\mu\in\Delta_{X} be a distribution over a set of individuals XX and p:X[0,1]p^{*}:X\rightarrow[0,1] be some target predictor for the set of individuals. For a class of distinguishers 𝒟{\mathcal{D}} and ε>0\varepsilon>0, a predictor p:X[0,1]p:X\rightarrow[0,1] satisfies (𝒟,ε)({\mathcal{D}},\varepsilon)-outcome indistinguishability (OI) if for every d𝒟d\in{\mathcal{D}},

|Pr(x,o)μp[d(x,o)=ACCEPT]Pr(x,o)μp[d(x,o)=ACCEPT]|ε.\left|\Pr_{(x,o)\sim\mu_{p}}\left[d(x,o)=\textnormal{ACCEPT}\right]-\Pr_{(x,o^{*})\sim\mu_{p^{*}}}\left[d(x,o^{*})=\textnormal{ACCEPT}\right]\right|\leq\varepsilon. (5)

We refer to the left-hand-side of (5) as the distinguishing advantage (or simply advantage) of the distinguisher dd, denoted advμ,d(p,p)\mathrm{adv}_{\mu,d}(p,p^{*}). Given a class 𝒟{\mathcal{D}} of distinguishers, we use advμ,𝒟(p,p)\mathrm{adv}_{\mu,{\mathcal{D}}}(p,p^{*}) to denote the supremum supd𝒟advμ,d(p,p)\sup_{d\in{\mathcal{D}}}\mathrm{adv}_{\mu,d}(p,p^{*}). According to Definition 1, a learned predictor pp satisfies (𝒟,ε)({\mathcal{D}},\varepsilon)-OI if and only if advμ,𝒟(p,p)ε\mathrm{adv}_{\mu,{\mathcal{D}}}(p,p^{*})\leq\varepsilon.

There are many different extensions of OI to settings in which the power of distinguishers is extended to allow access to additional information about the learned predictor pp or access to more than one individual/outcome pair. We will be focusing on this most basic case in which the distinguisher only has access to a single pair (x,o)(x,o) and has no additional access to the learned predictor pp (hence the name No-Access OI).

2.1.1 OI Algorithms

An OI algorithm (or learner) takes examples (x1,o1),,(xn,on)(x_{1},o_{1}),\ldots,(x_{n},o_{n}) drawn i.i.d. from μp\mu_{p^{*}} and outputs a predictor pp that satisfies (𝒟,ε)({\mathcal{D}},\varepsilon)-OI with probability at least 1δ1-\delta:

Definition 2 (No-Access OI Algorithms).

Given a target predictor p[0,1]Xp^{*}\in[0,1]^{X}, a class of distinguishers 𝒟{\mathcal{D}}, a distribution μΔX\mu\in\Delta_{X}, an advantage bound ε0\varepsilon\geq 0, a failure probability bound δ0\delta\geq 0, and a nonnegative integer nn, we use 𝖮𝖨n(p,𝒟,ε,δ,μ)\mathsf{OI}_{n}(p^{*},{\mathcal{D}},\varepsilon,\delta,\mu) to denote the set of all (possibly randomized and inefficient) algorithms 𝒜{\mathcal{A}} with the following properties:

  1. 1.

    𝒜{\mathcal{A}} takes nn examples (x1,o1),,(xn,on)X×{0,1}(x_{1},o_{1}),\ldots,(x_{n},o_{n})\in X\times\{0,1\} as input and outputs a predictor p[0,1]Xp\in[0,1]^{X};

  2. 2.

    when the input examples are drawn i.i.d. from μp\mu_{p^{*}}, with probability at least 1δ1-\delta the output predictor pp satisfies advμ,𝒟(p,p)ε\mathrm{adv}_{\mu,{\mathcal{D}}}(p,p^{*})\leq\varepsilon.

When seeking to learn an outcome indistinguishable predictor, there are two different tasks that we can consider. On one hand, in what we call the realizable case, we assume that the target predictor pp^{*} lies in a known predictor class 𝒫[0,1]X{\mathcal{P}}\subseteq[0,1]^{X}, and we seek to achieve low distinguishing advantages over all distinguishers in the class 𝒟{\mathcal{D}}.222 Throughout the paper, we implicitly assume that all predictor classes and distinguisher classes are non-empty. Alternatively, in the agnostic case, we can imagine a situation in which nothing is known about the target predictor pp^{*}, but the performance of the learned predictor is measured relative to the best predictor in 𝒫{\mathcal{P}}. In both the agnostic and realizable settings, we can also specify whether we measure performance on the worst-case distribution μ\mu over individuals in XX, or on some particular distribution μ\mu given to the learner. We call these the distribution-free and distribution-specific settings, respectively.

Definition 3 (Algorithms in Various Settings).

Given a predictor class 𝒫{\mathcal{P}}, a class of distinguishers 𝒟{\mathcal{D}}, a distribution μ\mu, an advantage bound ε\varepsilon, a failure probability bound δ\delta, and a nonnegative integer nn, we define the sets of algorithms that solve various OI tasks using nn examples as follows:

𝖣𝖲𝖱n(𝒫,𝒟,ε,δ,μ)=p𝒫𝖮𝖨n(p,𝒟,ε,δ,μ),\displaystyle\mathsf{DS\mathchar 45\relax R}_{n}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta,\mu)={\bigcap}_{p^{*}\in{\mathcal{P}}}\mathsf{OI}_{n}(p^{*},{\mathcal{D}},\varepsilon,\delta,\mu), (distribution-specific realizable OI)
𝖣𝖲𝖠n(𝒫,𝒟,ε,δ,μ)=p[0,1]X𝖮𝖨n(p,𝒟,infp𝒫advμ,𝒟(p,p)+ε,δ,μ),\displaystyle\mathsf{DS\mathchar 45\relax A}_{n}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta,\mu)={\bigcap}_{p^{*}\in[0,1]^{X}}\mathsf{OI}_{n}(p^{*},{\mathcal{D}},\inf_{p\in{\mathcal{P}}}\mathrm{adv}_{\mu,{\mathcal{D}}}(p,p^{*})+\varepsilon,\delta,\mu), (distribution-specific agnostic OI)
𝖣𝖥𝖱n(𝒫,𝒟,ε,δ)=μΔX𝖣𝖲𝖱n(𝒫,𝒟,ε,δ,μ),\displaystyle\mathsf{DF\mathchar 45\relax R}_{n}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta)={\bigcap}_{\mu^{\prime}\in\Delta_{X}}\mathsf{DS\mathchar 45\relax R}_{n}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta,\mu^{\prime}), (distribution-free realizable OI)
𝖣𝖥𝖠n(𝒫,𝒟,ε,δ)=μΔX𝖣𝖲𝖠n(𝒫,𝒟,ε,δ,μ).\displaystyle\mathsf{DF\mathchar 45\relax A}_{n}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta)={\bigcap}_{\mu^{\prime}\in\Delta_{X}}\mathsf{DS\mathchar 45\relax A}_{n}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta,\mu^{\prime}). (distribution-free agnostic OI)

We note that while these learning goals are all defined with respect to some predictor class 𝒫{\mathcal{P}}, this class simply constrains the possible values of the target predictor pp^{*} (or in the agnostic case, constrains the predictors used to measure the performance of the returned predictor). In particular we do not require any of the OI algorithms to be proper, i.e. always output some p𝒫p\in{\mathcal{P}}, despite the fact that some of the algorithms discussed in our proofs happen to satisfy this property.

Definition 4 (Sample complexity).

Given a predictor class 𝒫{\mathcal{P}}, a class of distinguishers 𝒟{\mathcal{D}}, a distribution μ\mu, an advantage bound ε\varepsilon, a failure probability bound δ\delta, we define the sample complexity of various OI tasks as follows:

𝖲𝖠𝖬𝖯𝖣𝖲𝖱(𝒫,𝒟,ε,δ,μ)=inf{n0:𝖣𝖲𝖱n(𝒫,𝒟,ε,δ,μ)},\displaystyle\mathsf{SAMP\mathchar 45\relax DS\mathchar 45\relax R}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta,\mu)=\inf\{n\in\mathbb{Z}_{\geq 0}:\mathsf{DS\mathchar 45\relax R}_{n}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta,\mu)\neq\emptyset\},
𝖲𝖠𝖬𝖯𝖣𝖲𝖠(𝒫,𝒟,ε,δ,μ)=inf{n0:𝖣𝖲𝖠n(𝒫,𝒟,ε,δ,μ)},\displaystyle\mathsf{SAMP\mathchar 45\relax DS\mathchar 45\relax A}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta,\mu)=\inf\{n\in\mathbb{Z}_{\geq 0}:\mathsf{DS\mathchar 45\relax A}_{n}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta,\mu)\neq\emptyset\},
𝖲𝖠𝖬𝖯𝖣𝖥𝖱(𝒫,𝒟,ε,δ)=inf{n0:𝖣𝖥𝖱n(𝒫,𝒟,ε,δ)},\displaystyle\mathsf{SAMP\mathchar 45\relax DF\mathchar 45\relax R}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta)=\inf\{n\in\mathbb{Z}_{\geq 0}:\mathsf{DF\mathchar 45\relax R}_{n}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta)\neq\emptyset\},
𝖲𝖠𝖬𝖯𝖣𝖥𝖠(𝒫,𝒟,ε,δ)=inf{n0:𝖣𝖥𝖠n(𝒫,𝒟,ε,δ)}.\displaystyle\mathsf{SAMP\mathchar 45\relax DF\mathchar 45\relax A}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta)=\inf\{n\in\mathbb{Z}_{\geq 0}:\mathsf{DF\mathchar 45\relax A}_{n}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta)\neq\emptyset\}.

It is clear from the definition that the following monotonicity properties hold: for 𝒫𝒫,𝒟𝒟,εε,δδ{\mathcal{P}}^{\prime}\subseteq{\mathcal{P}},{\mathcal{D}}^{\prime}\subseteq{\mathcal{D}},\varepsilon^{\prime}\geq\varepsilon,\delta^{\prime}\geq\delta,

𝖲𝖠𝖬𝖯𝖣𝖲𝖱(𝒫,𝒟,ε,δ,μ)𝖲𝖠𝖬𝖯𝖣𝖲𝖱(𝒫,𝒟,ε,δ,μ),\displaystyle\mathsf{SAMP\mathchar 45\relax DS\mathchar 45\relax R}({\mathcal{P}}^{\prime},{\mathcal{D}}^{\prime},\varepsilon^{\prime},\delta^{\prime},\mu)\leq\mathsf{SAMP\mathchar 45\relax DS\mathchar 45\relax R}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta,\mu),
𝖲𝖠𝖬𝖯𝖣𝖲𝖠(𝒫,𝒟,ε,δ,μ)𝖲𝖠𝖬𝖯𝖣𝖲𝖠(𝒫,𝒟,ε,δ,μ),\displaystyle\mathsf{SAMP\mathchar 45\relax DS\mathchar 45\relax A}({\mathcal{P}}^{\prime},{\mathcal{D}},\varepsilon^{\prime},\delta^{\prime},\mu)\leq\mathsf{SAMP\mathchar 45\relax DS\mathchar 45\relax A}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta,\mu),
𝖲𝖠𝖬𝖯𝖣𝖥𝖱(𝒫,𝒟,ε,δ)𝖲𝖠𝖬𝖯𝖣𝖥𝖱(𝒫,𝒟,ε,δ),\displaystyle\mathsf{SAMP\mathchar 45\relax DF\mathchar 45\relax R}({\mathcal{P}}^{\prime},{\mathcal{D}}^{\prime},\varepsilon^{\prime},\delta^{\prime})\leq\mathsf{SAMP\mathchar 45\relax DF\mathchar 45\relax R}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta),
𝖲𝖠𝖬𝖯𝖣𝖥𝖠(𝒫,𝒟,ε,δ)𝖲𝖠𝖬𝖯𝖣𝖥𝖠(𝒫,𝒟,ε,δ).\displaystyle\mathsf{SAMP\mathchar 45\relax DF\mathchar 45\relax A}({\mathcal{P}}^{\prime},{\mathcal{D}},\varepsilon^{\prime},\delta^{\prime})\leq\mathsf{SAMP\mathchar 45\relax DF\mathchar 45\relax A}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta). (6)

Note that the sample complexities in the agnostic setting are not guaranteed to be monotone w.r.t. 𝒟{\mathcal{D}} (see Section 6.2). It is also clear from definition that

𝖲𝖠𝖬𝖯𝖣𝖲𝖱(𝒫,𝒟,ε,δ,μ)𝖲𝖠𝖬𝖯𝖣𝖥𝖱(𝒫,𝒟,ε,δ),\displaystyle\mathsf{SAMP\mathchar 45\relax DS\mathchar 45\relax R}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta,\mu)\leq\mathsf{SAMP\mathchar 45\relax DF\mathchar 45\relax R}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta),
𝖲𝖠𝖬𝖯𝖣𝖲𝖠(𝒫,𝒟,ε,δ,μ)𝖲𝖠𝖬𝖯𝖣𝖥𝖠(𝒫,𝒟,ε,δ).\displaystyle\mathsf{SAMP\mathchar 45\relax DS\mathchar 45\relax A}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta,\mu)\leq\mathsf{SAMP\mathchar 45\relax DF\mathchar 45\relax A}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta). (7)

2.1.2 No-Access Distinguishers as Functions of Individuals

In the standard definition of OI, distinguishers are thought of as randomized algorithms that take as input an individual-outcome pair (x,o)X×{0,1}(x,o)\in X\times\{0,1\} and output ACCEPT or REJECT. However, there is a natural way to transform every no-access distinguisher dd into a function fdf_{d} that maps every individual xXx\in X to a label y[1,1]y\in[-1,1] in such a way that the distinguishing advantage of dd can be recovered from fdf_{d}.

In particular, given a randomized distinguisher dd, we define fd:X[1,1]f_{d}:X\rightarrow[-1,1] such that

fd(x)=Pr[d(x,1)=ACCEPT]Pr[d(x,0)=ACCEPT]for allxX.f_{d}(x)=\Pr[d(x,1)=\textnormal{ACCEPT}]-\Pr[d(x,0)=\textnormal{ACCEPT}]\quad\textnormal{for all}\ x\in X.

Given the function fdf_{d}, we show that we can always recover the distinguishing advantage of the original distinguisher dd:

Lemma 5.

For any two predictors p1,p2:X[0,1]p_{1},p_{2}:X\rightarrow[0,1],

advμ,d(p1,p2)=|𝔼xμ[fd(x)(p1(x)p2(x))]|.\mathrm{adv}_{\mu,d}(p_{1},p_{2})=|\operatorname{\mathbb{E}}_{x\sim\mu}[f_{d}(x)(p_{1}(x)-p_{2}(x))]|.
Proof.

For any xXx\in X and any two possible outcomes (o1,o2){(0,0),(0,1),(1,0),(1,1)}(o_{1},o_{2})\in\{(0,0),(0,1),(1,0),(1,1)\}, it is easily verifiable that

Pr[d(x,o1)=ACCEPT]Pr[d(x,o2)=ACCEPT]=fd(x)(o1o2),\Pr[d(x,o_{1})=\textnormal{ACCEPT}]-\Pr[d(x,o_{2})=\textnormal{ACCEPT}]=f_{d}(x)(o_{1}-o_{2}),

where the probabilities are over the internal randomness of the distinguisher dd. The lemma is proved by the following chain of equations:

advμ,d(p1,p2)\displaystyle\mathrm{adv}_{\mu,d}(p_{1},p_{2})
=\displaystyle={} |Pr(x,o)μp1[d(x,o)=ACCEPT]Pr(x,o)μp2[d(x,o)=ACCEPT]|\displaystyle\left|\Pr_{(x,o)\sim\mu_{p_{1}}}\left[d(x,o)=\textnormal{ACCEPT}\right]-\Pr_{(x,o)\sim\mu_{p_{2}}}\left[d(x,o)=\textnormal{ACCEPT}\right]\right|
=\displaystyle={} |𝔼xμ,o1Ber(p1(x)),o2Ber(p2(x))[Pr[d(x,o1)=ACCEPT]Pr[d(x,o2)=ACCEPT]]|\displaystyle\left|\operatorname{\mathbb{E}}\limits_{x\sim\mu,o_{1}\sim\mathrm{Ber}(p_{1}(x)),o_{2}\sim\mathrm{Ber}(p_{2}(x))}[\Pr[d(x,o_{1})=\textnormal{ACCEPT}]-\Pr[d(x,o_{2})=\textnormal{ACCEPT}]]\right|
=\displaystyle={} |𝔼xμ,o1Ber(p1(x)),o2Ber(p2(x))[fd(x)(o1o2)]|\displaystyle\left|\operatorname{\mathbb{E}}\limits_{x\sim\mu,o_{1}\sim\mathrm{Ber}(p_{1}(x)),o_{2}\sim\mathrm{Ber}(p_{2}(x))}[f_{d}(x)(o_{1}-o_{2})]\right|
=\displaystyle={} |𝔼xμ[fd(x)(p1(x)p2(x))]|.\displaystyle\left|\operatorname{\mathbb{E}}\limits_{x\sim\mu}[f_{d}(x)(p_{1}(x)-p_{2}(x))]\right|.\qed

Note that the transformation from a distinguisher dd to the function fd[1,1]Xf_{d}\in[-1,1]^{X} is onto: given any function f[1,1]Xf\in[-1,1]^{X}, we can construct a distinguisher dd such that fd=ff_{d}=f in the following way: if f(x)0f(x)\geq 0, distinguisher dd accepts (x,1)(x,1) with probability f(x)f(x), and accepts (x,0)(x,0) with probability 0; if f(x)<0f(x)<0, distinguisher dd accepts (x,0)(x,0) with probability f(x)-f(x), and accepts (x,1)(x,1) with probability 0.

Lemma 5 shows that all distinguishers dd mapped to the same function fdf_{d} have equal distinguishing advantages. It also shows that no-access OI has the same error objective as multiaccuracy considered by Kim et al. [2019]. From now on, when we refer to a distinguisher dd, it should be interpreted as the function fd[1,1]Xf_{d}\in[-1,1]^{X}. Similarly, we think of a distinguisher class 𝒟{\mathcal{D}} as a non-empty subset of [1,1]X[-1,1]^{X}.

2.2 Inner Product, Norm, and Covering Number

The set X\mathbb{R}^{X} of all real-valued functions on XX is naturally a linear space: for all f1,f2Xf_{1},f_{2}\in\mathbb{R}^{X}, we define f1+f2=gXf_{1}+f_{2}=g\in\mathbb{R}^{X} where g(x)=f1(x)+f2(x)g(x)=f_{1}(x)+f_{2}(x) for all xXx\in X, and for fXf\in\mathbb{R}^{X} and rr\in\mathbb{R}, we define rf=hXrf=h\in\mathbb{R}^{X} where h(x)=rf(x)h(x)=rf(x) for all xXx\in X.

For any function class X{\mathcal{F}}\subseteq\mathbb{R}^{X} and a real number rr, we use rr{\mathcal{F}} to denote {rf:f}\{rf:f\in{\mathcal{F}}\} and use -{\mathcal{F}} to denote (1)(-1){\mathcal{F}}. For any two function classes 1,2X{\mathcal{F}}_{1},{\mathcal{F}}_{2}\subseteq\mathbb{R}^{X}, we define

1+2\displaystyle{\mathcal{F}}_{1}+{\mathcal{F}}_{2} ={f1+f2:f11,f22},and\displaystyle=\{f_{1}+f_{2}:f_{1}\in{\mathcal{F}}_{1},f_{2}\in{\mathcal{F}}_{2}\},\textnormal{and}
12\displaystyle{\mathcal{F}}_{1}-{\mathcal{F}}_{2} ={f1f2:f11,f22}.\displaystyle=\{f_{1}-f_{2}:f_{1}\in{\mathcal{F}}_{1},f_{2}\in{\mathcal{F}}_{2}\}.

For any non-empty function class X{\mathcal{F}}\subseteq\mathbb{R}^{X}, we say a function fXf\in\mathbb{R}^{X} is in the convex hull of {\mathcal{F}} if there exist n>0n\in\mathbb{Z}_{>0}, f1,,fnf_{1},\ldots,f_{n}\in{\mathcal{F}} and r1,,rn0r_{1},\ldots,r_{n}\in\mathbb{R}_{\geq 0} such that r1++rn=1r_{1}+\cdots+r_{n}=1 and f=r1f1++rnfnf=r_{1}f_{1}+\cdots+r_{n}f_{n}. We use ¯\bar{}{\mathcal{F}} to denote the symmetric convex hull of {\mathcal{F}} consisting of all functions in the convex hull of (){\mathcal{F}}\cup(-{\mathcal{F}}). When {\mathcal{F}} is empty, we define ¯\bar{}{\mathcal{F}} to be the class consisting of only the all zeros function.

We say a function fXf\in\mathbb{R}^{X} is bounded if there exists M>0M>0 such that |f(x)|M|f(x)|\leq M for all xXx\in X. We say a function class X{\mathcal{F}}\subseteq\mathbb{R}^{X} is bounded if there exists M>0M>0 such that |f(x)|M|f(x)|\leq M for all ff\in{\mathcal{F}} and xXx\in X.

For two bounded functions f1,f2Xf_{1},f_{2}\in\mathbb{R}^{X}, we define their inner product w.r.t. distribution μ\mu as

f1,f2μ=𝔼xμ[f1(x)f2(x)].\langle f_{1},f_{2}\rangle_{\mu}=\operatorname{\mathbb{E}}_{x\sim\mu}[f_{1}(x)f_{2}(x)].

Although we call the above quantity an inner product for simplicity, it may not be positive definite (f,fμ=0\langle f,f\rangle_{\mu}=0 need not imply f(x)=0f(x)=0 for all xXx\in X). For a non-empty bounded function class 1X{\mathcal{F}}_{1}\subseteq\mathbb{R}^{X} and a bounded function fXf\in\mathbb{R}^{X}, we define the dual Minkowski norm of ff w.r.t. 1{\mathcal{F}}_{1} to be

fμ,1=supf11|f,f1μ|.\|f\|_{\mu,{\mathcal{F}}_{1}}=\sup_{f_{1}\in{\mathcal{F}}_{1}}|\langle f,f_{1}\rangle_{\mu}|.

If 1{\mathcal{F}}_{1} is empty, we define fμ,1=0\|f\|_{\mu,{\mathcal{F}}_{1}}=0. The norm μ,1\|\cdot\|_{\mu,{\mathcal{F}}_{1}} is technically only a semi-norm as it may not be positive definite, but whenever it is positive definite, it is the dual norm of the Minkowski norm induced by (the closure of) ¯1\bar{}{\mathcal{F}}_{1} for finite XX [see e.g. Nikolov et al., 2013, Section 2.1].

For a non-empty bounded function class 2X{\mathcal{F}}_{2}\subseteq\mathbb{R}^{X} and ε0\varepsilon\geq 0, we say a subset 22{\mathcal{F}}_{2}^{\prime}\subseteq{\mathcal{F}}_{2} is an ε\varepsilon-covering of 2{\mathcal{F}}_{2} w.r.t. the norm μ,1\|\cdot\|_{\mu,{\mathcal{F}}_{1}} if for every f22f_{2}\in{\mathcal{F}}_{2}, there exists f22f_{2}^{\prime}\in{\mathcal{F}}_{2}^{\prime} such that f2f2μ,1ε\|f_{2}-f_{2}^{\prime}\|_{\mu,{\mathcal{F}}_{1}}\leq\varepsilon. We define the covering number Nμ,1(2,ε)N_{\mu,{\mathcal{F}}_{1}}({\mathcal{F}}_{2},\varepsilon) of 2{\mathcal{F}}_{2} w.r.t. the norm μ,1\|\cdot\|_{\mu,{\mathcal{F}}_{1}} to be the minimum size of such an ε\varepsilon-covering of 2{\mathcal{F}}_{2}^{\prime}. We refer to the logarithm of the covering number, logNμ,1(2,ε)\log N_{\mu,{\mathcal{F}}_{1}}({\mathcal{F}}_{2},\varepsilon), as the metric entropy.

The following basic facts about the covering number are very useful:

Lemma 6.

We have the following facts:

  1. 1.

    if ^11X\hat{}{\mathcal{F}}_{1}\subseteq{\mathcal{F}}_{1}\subseteq\mathbb{R}^{X}, then Nμ,^1(2,ε)Nμ,1(2,ε)N_{\mu,\hat{}{\mathcal{F}}_{1}}({\mathcal{F}}_{2},\varepsilon)\leq N_{\mu,{\mathcal{F}}_{1}}({\mathcal{F}}_{2},\varepsilon);

  2. 2.

    Nμ,1(2,ε)=Nμ,¯1(2,ε)N_{\mu,{\mathcal{F}}_{1}}({\mathcal{F}}_{2},\varepsilon)=N_{\mu,\bar{}{\mathcal{F}}_{1}}({\mathcal{F}}_{2},\varepsilon);

  3. 3.

    for every bounded function fXf\in\mathbb{R}^{X}, Nμ,1(2+{f},ε)=Nμ,1(2,ε)N_{\mu,{\mathcal{F}}_{1}}({\mathcal{F}}_{2}+\{f\},\varepsilon)=N_{\mu,{\mathcal{F}}_{1}}({\mathcal{F}}_{2},\varepsilon);

  4. 4.

    for every a,b>0a,b\in\mathbb{R}_{>0}, Nμ,a1(b2,abε)=Nμ,1(2,ε)N_{\mu,a{\mathcal{F}}_{1}}(b{\mathcal{F}}_{2},ab\varepsilon)=N_{\mu,{\mathcal{F}}_{1}}({\mathcal{F}}_{2},\varepsilon).

2.3 Distinguishing Advantages as Inner Products and Norms

The inner products and norms provide convenient ways for describing distinguishing advantages. Given a distinguisher d[1,1]Xd\in[-1,1]^{X} and two predictors p1,p2[0,1]Xp_{1},p_{2}\in[0,1]^{X}, Lemma 5 tells us that

advμ,d(p1,p2)=|d,p1p2μ|,and thus\displaystyle\mathrm{adv}_{\mu,d}(p_{1},p_{2})=|\langle d,p_{1}-p_{2}\rangle_{\mu}|,\quad\textnormal{and thus}
advμ,𝒟(p1,p2)=p1p2μ,𝒟for all distinguisher classes𝒟[1,1]X.\displaystyle\mathrm{adv}_{\mu,{\mathcal{D}}}(p_{1},p_{2})=\|p_{1}-p_{2}\|_{\mu,{\mathcal{D}}}\quad\textnormal{for all distinguisher classes}\ {\mathcal{D}}\subseteq[-1,1]^{X}.

Using these representations for advantages, we can prove the following lemma relating advantages to the 1\ell_{1} error:

Lemma 7.

It holds that advμ,𝒟(p1,p2)=p1p2μ,𝒟𝔼xμ[|p1(x)p2(x)|]\mathrm{adv}_{\mu,{\mathcal{D}}}(p_{1},p_{2})=\|p_{1}-p_{2}\|_{\mu,{\mathcal{D}}}\leq\operatorname{\mathbb{E}}_{x\sim\mu}[|p_{1}(x)-p_{2}(x)|]. Moreover, when {1,1}X𝒟\{-1,1\}^{X}\subseteq{\mathcal{D}}, advμ,𝒟(p1,p2)=𝔼xμ[|p1(x)p2(x)|]\mathrm{adv}_{\mu,{\mathcal{D}}}(p_{1},p_{2})=\operatorname{\mathbb{E}}_{x\sim\mu}[|p_{1}(x)-p_{2}(x)|].

Proof.

To prove the first statement, we recall that

advμ,𝒟(p1,p2)=p1p2μ,𝒟=supd𝒟|𝔼xμ[(p1(x)p2(x))d(x)]|.\mathrm{adv}_{\mu,{\mathcal{D}}}(p_{1},p_{2})=\|p_{1}-p_{2}\|_{\mu,{\mathcal{D}}}=\sup_{d\in{\mathcal{D}}}|\operatorname{\mathbb{E}}_{x\sim\mu}[(p_{1}(x)-p_{2}(x))d(x)]|.

Because d(x)[1,1]d(x)\in[-1,1] for all d𝒟d\in{\mathcal{D}} and xXx\in X, we are guaranteed that |(p1(x)p2(x))d(x)||p1(x)p2(x)||(p_{1}(x)-p_{2}(x))d(x)|\leq|p_{1}(x)-p_{2}(x)|, which gives

supd𝒟|𝔼xμ[(p1(x)p2(x))d(x)]|supd𝒟𝔼xμ[|(p1(x)p2(x))d(x)|]𝔼xμ[|p1(x)p2(x)|],\sup_{d\in{\mathcal{D}}}|\operatorname{\mathbb{E}}_{x\sim\mu}[(p_{1}(x)-p_{2}(x))d(x)]|\leq\sup_{d\in{\mathcal{D}}}\operatorname{\mathbb{E}}_{x\sim\mu}[|(p_{1}(x)-p_{2}(x))d(x)|]\leq\operatorname{\mathbb{E}}_{x\sim\mu}[|p_{1}(x)-p_{2}(x)|],

as desired.

For the second statement, consider the distinguisher dd defined such that d(x)=1d(x)=1 if p1(x)p2(x)p_{1}(x)\geq p_{2}(x) and d(x)=1d(x)=-1 otherwise. For all xXx\in X, distinguisher dd satisfies

(p1(x)p2(x))d(x)=|p1(x)p2(x)|.(p_{1}(x)-p_{2}(x))d(x)=|p_{1}(x)-p_{2}(x)|.

Therefore,

advμ,d(p1,p2)=|𝔼xμ[(p1(x)p2(x))d(x)]|=𝔼xμ[|p1(x)p2(x)|].\mathrm{adv}_{\mu,d}(p_{1},p_{2})=|\operatorname{\mathbb{E}}_{x\sim\mu}[(p_{1}(x)-p_{2}(x))d(x)]|=\operatorname{\mathbb{E}}_{x\sim\mu}[|p_{1}(x)-p_{2}(x)|].

Since d{1,1}X𝒟d\in\{-1,1\}^{X}\subseteq{\mathcal{D}}, this proves the second statement. ∎

2.4 Fat-Shattering Dimension

Given a function class X{\mathcal{F}}\subseteq\mathbb{R}^{X} and γ0\gamma\geq 0, we say x1,,xnXx_{1},\ldots,x_{n}\in X are γ\gamma-fat shattered by {\mathcal{F}} w.r.t. r1,,rnr_{1},\ldots,r_{n}\in\mathbb{R} if for every (b1,,bn){1,1}n(b_{1},\ldots,b_{n})\in\{-1,1\}^{n}, there exists ff\in{\mathcal{F}} such that

bi(f(xi)ri)γfor alli{1,,n}.b_{i}(f(x_{i})-r_{i})\geq\gamma\quad\textnormal{for all}\ i\in\{1,\ldots,n\}.

We sometimes omit the mention of r1,,rnr_{1},\ldots,r_{n} and say x1,,xnx_{1},\ldots,x_{n} is γ\gamma-fat shattered by {\mathcal{F}} if such r1,,rnr_{1},\ldots,r_{n} exist. The γ\gamma-fat-shattering dimension of {\mathcal{F}} introduced first by Kearns and Schapire [1994] is defined to be

𝖿𝖺𝗍(γ)=sup{n0:there exist x1,,xnX that are γ-fat shattered by }.\mathsf{fat}_{\mathcal{F}}(\gamma)=\sup\{n\in\mathbb{Z}_{\geq 0}:\textnormal{there exist $x_{1},\ldots,x_{n}\in X$ that are $\gamma$-fat shattered by ${\mathcal{F}}$}\}.

3 Sample Complexity of Distribution-Specific Realizable OI

In this section, we give lower and upper bounds for the sample complexity of distribution-specific realizable OI for every given predictor class 𝒫{\mathcal{P}}, distinguisher class 𝒟{\mathcal{D}}, and distribution μ\mu over individuals. Our lower bound is based on the metric entropy of 𝒫{\mathcal{P}} w.r.t. the norm μ,𝒟\|\cdot\|_{\mu,{\mathcal{D}}}, whereas in our upper bound the roles of 𝒫{\mathcal{P}} and 𝒟{\mathcal{D}} are flipped. In the next section, we remove this role flip and give a complete characterization of the sample complexity using a version of metric entropy duality implied from combining our lower and upper bounds.

3.1 Lower Bound

We prove the following lemma showing that the sample complexity of distribution-specific realizable OI is lower bounded by the metric entropy of the predictor class 𝒫{\mathcal{P}} w.r.t. the dual Minkowski norm μ,𝒟\|\cdot\|_{\mu,{\mathcal{D}}} defined by the distinguisher class 𝒟{\mathcal{D}}. This lemma generalizes [Benedek and Itai, 1991, Lemma 4.8], which considered the special case where every predictor is a binary classifier, and the distinguisher class 𝒟{\mathcal{D}} contains all possible distinguishers (𝒟=[1,1]X{\mathcal{D}}=[-1,1]^{X}).

Lemma 8.

For every predictor class 𝒫[0,1]X{\mathcal{P}}\subseteq[0,1]^{X}, every distinguisher class 𝒟[1,1]X{\mathcal{D}}\subseteq[-1,1]^{X}, every distribution μΔX\mu\in\Delta_{X}, every advantage bound ε>0\varepsilon>0, and every failure probability bound δ(0,1)\delta\in(0,1), the following sample complexity lower bound holds for distribution-specific realizable OI:

𝖲𝖠𝖬𝖯𝖣𝖲𝖱(𝒫,𝒟,ε,δ,μ)log((1δ)Nμ,𝒟(𝒫,2ε)).\mathsf{SAMP\mathchar 45\relax DS\mathchar 45\relax R}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta,\mu)\geq\log((1-\delta)N_{\mu,{\mathcal{D}}}({\mathcal{P}},2\varepsilon)).
Proof.

Define M=Nμ,𝒟(𝒫,2ε)M=N_{\mu,{\mathcal{D}}}({\mathcal{P}},2\varepsilon). Let 𝒫{\mathcal{P}}^{\prime} be the maximum-size subset of 𝒫{\mathcal{P}} such that

p1p2μ,𝒟>2εfor all distinctp1,p2𝒫.\|p_{1}-p_{2}\|_{\mu,{\mathcal{D}}}>2\varepsilon\quad\textnormal{for all distinct}\ p_{1},p_{2}\in{\mathcal{P}}^{\prime}. (8)

It is clear that |𝒫|M|{\mathcal{P}}^{\prime}|\geq M because otherwise 𝒫{\mathcal{P}}^{\prime} is not a 2ε2\varepsilon-covering of 𝒫{\mathcal{P}} and we can add one more predictor into 𝒫{\mathcal{P}}^{\prime} without violating (8), a contradiction with the maximality of |𝒫||{\mathcal{P}}^{\prime}|.

Let nn be a nonnegative integer such that 𝖣𝖲𝖱n(𝒫,𝒟,ε,δ,μ)\mathsf{DS\mathchar 45\relax R}_{n}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta,\mu)\neq\emptyset. Our goal is to prove

nlog((1δ)M).n\geq\log((1-\delta)M). (9)

Let 𝒜{\mathcal{A}} be an algorithm in 𝖣𝖲𝖱n(𝒫,𝒟,ε,δ,μ)\mathsf{DS\mathchar 45\relax R}_{n}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta,\mu). We draw a predictor pp^{*} uniformly at random from 𝒫{\mathcal{P}}^{\prime}, and draw examples (x1,o1),,(xn,on)(x_{1},o_{1}),\ldots,(x_{n},o_{n}) i.i.d. from μp\mu_{p^{*}}. We say algorithm 𝒜{\mathcal{A}} succeeds if it outputs pp such that ppμ,𝒟ε\|p-p^{*}\|_{\mu,{\mathcal{D}}}\leq\varepsilon. By assumption, when (x1,o1),,(xn,on)(x_{1},o_{1}),\ldots,(x_{n},o_{n}) are given as input, algorithm 𝒜{\mathcal{A}} succeeds with probability at least 1δ1-\delta. Now instead of drawing x1,,xnx_{1},\ldots,x_{n} i.i.d. from μ\mu, we fix them so that the success probability is maximized. In other words, we can find fixed x1,,xnXx_{1},\ldots,x_{n}\in X such that if we run algorithm 𝒜{\mathcal{A}} on examples (x1,o1),,(xn,on)(x_{1},o_{1}),\ldots,(x_{n},o_{n}) where oiBer(p(xi))o_{i}\sim\mathrm{Ber}(p^{*}(x_{i})) and pp^{*} is chosen uniformly at random from 𝒫{\mathcal{P}}^{\prime}, the algorithm succeeds with probability at least 1δ1-\delta. Similarly, we can fix the internal randomness of algorithm 𝒜{\mathcal{A}} and assume that 𝒜{\mathcal{A}} is deterministic without decreasing its success probability on (x1,o1),,(xn,on)(x_{1},o_{1}),\ldots,(x_{n},o_{n}). Now algorithm 𝒜{\mathcal{A}} has at most 2n2^{n} possible inputs, and thus has at most 2n2^{n} possible outputs. No output can be the success output for two different choices of pp^{*} from 𝒫{\mathcal{P}}^{\prime} because of (8). Therefore, the success probability of algorithm 𝒜{\mathcal{A}} is at most 2n/M2^{n}/M, and thus

2n/M1δ.2^{n}/M\geq 1-\delta.

This implies (9), as desired. ∎

3.2 Upper Bound

We give an algorithm for distribution-specific realizable OI to prove a sample complexity upper bound for it. Before we describe our algorithm, let us briefly discuss the empirical risk minimization algorithm (Algorithm 1). Benedek and Itai [1991, Proof of Lemma 4.6] showed that this natural algorithm works in the special case where 1) every predictor in 𝒫{\mathcal{P}} is a binary classifier, and 2) the distinguisher class 𝒟{\mathcal{D}} contains all possible distinguishers. When both 1) and 2) are satisfied, the algorithm gives a sample complexity upper bound of

O((1/ε)O(1)logNμ,𝒟(𝒫,ε/2)),O((1/\varepsilon)^{O(1)}\log N_{\mu,{\mathcal{D}}}({\mathcal{P}},\varepsilon/2)), (10)

which would give a satisfactory sample complexity characterization when combined with our lower bound in Lemma 8. However, in Appendix A, we show that the algorithm fails when only one of the two conditions 1) and 2) (no matter which) is true.333 There is a variant of Algorithm 1 that minimizes 𝗅𝗈𝗌𝗌(p)\mathsf{loss}(p) over the entire predictor class 𝒫{\mathcal{P}} instead of the covering 𝒫{\mathcal{P}}^{\prime}. As discussed in [Benedek and Itai, 1991], this variant is not guaranteed to give a sample complexity upper bound close to (10) even under both conditions 1) and 2). Changing the definition of 𝗅𝗈𝗌𝗌(p)\mathsf{loss}(p) to supd𝒟|p,dμ1ni=1nd(xi)oi|\sup_{d\in{\mathcal{D}}}|\langle p,d\rangle_{\mu}-\frac{1}{n}\sum_{i=1}^{n}d(x_{i})o_{i}| (mimicking Algorithm 2) also makes Algorithm 1 fail under both conditions 1) and 2). To see this, suppose μ\mu is the uniform distribution over a finite domain XX, 𝒫={p0,p1}{\mathcal{P}}=\{p_{0},p_{1}\} where p0(x)=0p_{0}(x)=0 and p1(x)=1p_{1}(x)=1 for every xXx\in X, and 𝒟=[1,1]X{\mathcal{D}}=[-1,1]^{X}. Assuming ε(0,1)\varepsilon\in(0,1) and n<|X|/10n<|X|/10, when the target predictor pp^{*} is p1p_{1}, Algorithm 1 always outputs p0p_{0} on the new 𝗅𝗈𝗌𝗌\mathsf{loss} (note that changing the values of dd on x1,,xnx_{1},\ldots,x_{n} can significantly change 1ni=1nd(xi)oi\frac{1}{n}\sum_{i=1}^{n}d(x_{i})o_{i}, but it never changes p,dμ\langle p,d\rangle_{\mu} by more than 2n/|X|2n/|X|). Since neither 1) nor 2) is guaranteed to hold in the distribution-specific realizable OI setting, we use a new algorithm (Algorithm 2) to prove our sample complexity upper bound (Lemma 10) where the roles of 𝒫{\mathcal{P}} and 𝒟{\mathcal{D}} flip compared to (10). In Section 4, we show how to flip them back to get a sample complexity characterization for distribution-specific realizable OI.

Parameters : predictor class 𝒫{\mathcal{P}}, distinguisher class 𝒟{\mathcal{D}}, distribution μ\mu, MDA bound ε\varepsilon, positive integer nn.
Input : examples (x1,o1),,(xn,on)(x_{1},o_{1}),\ldots,(x_{n},o_{n}).
Output : predictor p𝒫p\in{\mathcal{P}}.
𝒫{\mathcal{P}}^{\prime}\leftarrow minimum-size ε/2\varepsilon/2-covering of 𝒫{\mathcal{P}} w.r.t. norm μ,𝒟\|\cdot\|_{\mu,{\mathcal{D}}};
return p𝒫p\in{\mathcal{P}}^{\prime} that minimizes the empirical error
𝗅𝗈𝗌𝗌(p):=supd𝒟|1ni=1nd(xi)(p(xi)oi)|;\mathsf{loss}(p):=\sup_{d\in{\mathcal{D}}}\left|\frac{1}{n}\sum_{i=1}^{n}d(x_{i})(p(x_{i})-o_{i})\right|;
Algorithm 1 Empirical Risk Minimization
Parameters : predictor class 𝒫{\mathcal{P}}, distinguisher class 𝒟{\mathcal{D}}, distribution μ\mu, MDA bound ε\varepsilon, positive integer nn.
Input : examples (x1,o1),,(xn,on)(x_{1},o_{1}),\ldots,(x_{n},o_{n}).
Output : predictor p𝒫p\in{\mathcal{P}}.
𝒬𝒫𝒫{\mathcal{Q}}\leftarrow{\mathcal{P}}-{\mathcal{P}} ;
  /* Recall that 𝒫𝒫={p1p2:p1,p2𝒫}{\mathcal{P}}-{\mathcal{P}}=\{p_{1}-p_{2}:p_{1},p_{2}\in{\mathcal{P}}\}. */
𝒟{\mathcal{D}}^{\prime}\leftarrow minimum-size ε/2\varepsilon/2-covering of 𝒟{\mathcal{D}} w.r.t. norm μ,𝒬\|\cdot\|_{\mu,{\mathcal{Q}}};
return p𝒫p\in{\mathcal{P}} that ε/16\varepsilon/16-minimizes
𝗅𝗈𝗌𝗌(p):=supd𝒟|p,dμ1ni=1nd(xi)oi|,\mathsf{loss}(p):=\sup_{d\in{\mathcal{D}}^{\prime}}\left|\langle p,d\rangle_{\mu}-\frac{1}{n}\sum_{i=1}^{n}d(x_{i})o_{i}\right|,
i.e., 𝗅𝗈𝗌𝗌(p)infp𝒫𝗅𝗈𝗌𝗌(p)+ε/16\mathsf{loss}(p)\leq\inf_{p^{\prime}\in{\mathcal{P}}}\mathsf{loss}(p^{\prime})+\varepsilon/16;
Algorithm 2 Distinguisher Covering

Our sample complexity upper bound is based on the following analysis of Algorithm 2:

Lemma 9.

For every predictor class 𝒫[0,1]X{\mathcal{P}}\subseteq[0,1]^{X}, every distinguisher class 𝒟[1,1]X{\mathcal{D}}\subseteq[-1,1]^{X}, every distribution μΔX\mu\in\Delta_{X}, every advantage bound ε(0,1)\varepsilon\in(0,1), and every failure probability bound δ(0,1)\delta\in(0,1), there exists a positive integer

n\displaystyle n O(ε2(logNμ,𝒬(𝒟,ε/2)+log(2/δ)))\displaystyle\leq O\big{(}\varepsilon^{-2}\big{(}\log N_{\mu,{\mathcal{Q}}}({\mathcal{D}},\varepsilon/2)+\log(2/\delta)\big{)}\big{)} (11)
O(ε2(logNμ,𝒫(𝒟,ε/4)+log(2/δ)))\displaystyle\leq O\big{(}\varepsilon^{-2}\big{(}\log N_{\mu,{\mathcal{P}}}({\mathcal{D}},\varepsilon/4)+\log(2/\delta)\big{)}\big{)} (12)

such that Algorithm 2 belongs to

p[0,1]X𝖮𝖨n(p,𝒟,3infp𝒫advμ,𝒟(p,p)+ε,δ,μ),{\bigcap}_{p^{*}\in[0,1]^{X}}\mathsf{OI}_{n}(p^{*},{\mathcal{D}},3\inf_{p\in{\mathcal{P}}}\mathrm{adv}_{\mu,{\mathcal{D}}}(p,p^{*})+\varepsilon,\delta,\mu),

where 𝒬=𝒫𝒫={p1p2:p1,p2𝒫}{\mathcal{Q}}={\mathcal{P}}-{\mathcal{P}}=\{p_{1}-p_{2}:p_{1},p_{2}\in{\mathcal{P}}\}.

Proof.

We first note that fμ,𝒬2fμ,𝒫\|f\|_{\mu,{\mathcal{Q}}}\leq 2\|f\|_{\mu,{\mathcal{P}}} for all bounded functions fXf\in\mathbb{R}^{X}, which implies that

Nμ,𝒬(𝒟,ε/2)Nμ,𝒫(𝒟,ε/4).N_{\mu,{\mathcal{Q}}}({\mathcal{D}},\varepsilon/2)\leq N_{\mu,{\mathcal{P}}}({\mathcal{D}},\varepsilon/4).

This proves inequality (12).

Define M=Nμ,𝒬(𝒟,ε/2)M=N_{\mu,{\mathcal{Q}}}({\mathcal{D}},\varepsilon/2) and ε0=infp𝒫advμ,𝒟(p,p)\varepsilon_{0}=\inf_{p\in{\mathcal{P}}}\mathrm{adv}_{\mu,{\mathcal{D}}}(p,p^{*}) for an arbitrary p[0,1]Xp^{*}\in[0,1]^{X}. It remains to prove that Algorithm 2 belongs to

p[0,1]X𝖮𝖨n(p,𝒟,3ε0+ε,δ,μ){\bigcap}_{p^{*}\in[0,1]^{X}}\mathsf{OI}_{n}(p^{*},{\mathcal{D}},3\varepsilon_{0}+\varepsilon,\delta,\mu)

for some n=O(ε2(logM+log(2/δ)))n=O(\varepsilon^{-2}(\log M+\log(2/\delta))) determined below. Given nn examples (x1,o1),,(xn,on)(x_{1},o_{1}),\ldots,(x_{n},o_{n}), we define K(d)=1ni=1noid(xi)K(d)=\frac{1}{n}\sum_{i=1}^{n}o_{i}d(x_{i}) for every distinguisher d𝒟d\in{\mathcal{D}}^{\prime}, where 𝒟{\mathcal{D}}^{\prime} is the minimum-size ε/2\varepsilon/2 covering of 𝒫{\mathcal{P}} computed in Algorithm 2. By definition, |𝒟|=M|{\mathcal{D}}^{\prime}|=M, so by the Chernoff bound and the union bound, for some n=O(ε2(logM+log(2/δ)))n=O(\varepsilon^{-2}(\log M+\log(2/\delta))), with probability at least 1δ1-\delta,

|K(d)p,dμ|ε/16for alld𝒟.|K(d)-\langle p^{*},d\rangle_{\mu}|\leq\varepsilon/16\quad\textnormal{for all}\ d\in{\mathcal{D}}^{\prime}. (13)

Assuming that (13) is true, it suffices to prove that the output pp of Algorithm 2 satisfies ppμ,𝒟3ε0+ε\|p-p^{*}\|_{\mu,{\mathcal{D}}}\leq 3\varepsilon_{0}+\varepsilon. Let p~𝒫\tilde{p}\in{\mathcal{P}} be a predictor that satisfies p~pμ,𝒟ε0+ε/16\|\tilde{p}-p^{*}\|_{\mu,{\mathcal{D}}}\leq\varepsilon_{0}+\varepsilon/16. We have

supd𝒟|K(d)p~,dμ|supd𝒟|p,dμp~,dμ|+ε/16ε0+ε/8.\sup_{d\in{\mathcal{D}}^{\prime}}|K(d)-\langle\tilde{p},d\rangle_{\mu}|\leq\sup_{d\in{\mathcal{D}}^{\prime}}|\langle p^{*},d\rangle_{\mu}-\langle\tilde{p},d\rangle_{\mu}|+\varepsilon/16\leq\varepsilon_{0}+\varepsilon/8.

The output predictor pp of Algorithm 2 satisfies

supd𝒟|K(d)p,dμ|supd𝒟|K(d)p~,dμ|+ε/16ε0+3ε/16.\sup_{d\in{\mathcal{D}}^{\prime}}|K(d)-\langle p,d\rangle_{\mu}|\leq\sup_{d\in{\mathcal{D}}^{\prime}}|K(d)-\langle\tilde{p},d\rangle_{\mu}|+\varepsilon/16\leq\varepsilon_{0}+3\varepsilon/16.

Therefore,

supd𝒟|pp~,dμ|\displaystyle\sup_{d\in{\mathcal{D}}^{\prime}}|\langle p-\tilde{p},d\rangle_{\mu}| supd𝒟|p,dμK(d)|+supd𝒟|p~,dμK(d)|\displaystyle\leq\sup_{d\in{\mathcal{D}}^{\prime}}|\langle p,d\rangle_{\mu}-K(d)|+\sup_{d\in{\mathcal{D}}^{\prime}}|\langle\tilde{p},d\rangle_{\mu}-K(d)|
(ε0+ε/8)+(ε0+3ε/16)\displaystyle\leq(\varepsilon_{0}+\varepsilon/8)+(\varepsilon_{0}+3\varepsilon/16)
2ε0+5ε/16.\displaystyle\leq 2\varepsilon_{0}+5\varepsilon/16.

Since pp~𝒬p-\tilde{p}\in{\mathcal{Q}} and 𝒟{\mathcal{D}}^{\prime} is an ε/2\varepsilon/2-covering w.r.t. μ,𝒬\|\cdot\|_{\mu,{\mathcal{Q}}},

pp~μ,𝒟=supd𝒟|pp~,dμ|supd𝒟|pp~,dμ|+ε/22ε0+13ε/16.\|p-\tilde{p}\|_{\mu,{\mathcal{D}}}=\sup_{d\in{\mathcal{D}}}|\langle p-\tilde{p},d\rangle_{\mu}|\leq\sup_{d\in{\mathcal{D}}^{\prime}}|\langle p-\tilde{p},d\rangle_{\mu}|+\varepsilon/2\leq 2\varepsilon_{0}+13\varepsilon/16.

Finally,

ppμ,𝒟pp~μ,𝒟+p~pμ,𝒟(2ε0+13ε/16)+(ε0+ε/16)3ε0+ε,\|p-p^{*}\|_{\mu,{\mathcal{D}}}\leq\|p-\tilde{p}\|_{\mu,{\mathcal{D}}}+\|\tilde{p}-p^{*}\|_{\mu,{\mathcal{D}}}\leq(2\varepsilon_{0}+13\varepsilon/16)+(\varepsilon_{0}+\varepsilon/16)\leq 3\varepsilon_{0}+\varepsilon,

as desired. ∎

We are now ready to state and prove our sample complexity upper bound for distribution-specific realizable OI.

Lemma 10.

For every predictor class 𝒫[0,1]X{\mathcal{P}}\subseteq[0,1]^{X}, every distinguisher class 𝒟[1,1]X{\mathcal{D}}\subseteq[-1,1]^{X}, every distribution μΔX\mu\in\Delta_{X}, every advantage bound ε>0\varepsilon>0, and every failure probability bound δ(0,1)\delta\in(0,1), the following sample complexity upper bound holds for distribution-specific realizable OI:

𝖲𝖠𝖬𝖯𝖣𝖲𝖱(𝒫,𝒟,ε,δ,μ)\displaystyle\mathsf{SAMP\mathchar 45\relax DS\mathchar 45\relax R}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta,\mu) O(ε2(logNμ,𝒬(𝒟,ε/2)+log(2/δ)))\displaystyle\leq O\big{(}\varepsilon^{-2}\big{(}\log N_{\mu,{\mathcal{Q}}}({\mathcal{D}},\varepsilon/2)+\log(2/\delta)\big{)}\big{)}
O(ε2(logNμ,𝒫(𝒟,ε/4)+log(2/δ))),\displaystyle\leq O\big{(}\varepsilon^{-2}\big{(}\log N_{\mu,{\mathcal{P}}}({\mathcal{D}},\varepsilon/4)+\log(2/\delta)\big{)}\big{)},

where 𝒬=𝒫𝒫={p1p2:p1,p2𝒫}{\mathcal{Q}}={\mathcal{P}}-{\mathcal{P}}=\{p_{1}-p_{2}:p_{1},p_{2}\in{\mathcal{P}}\}.

Proof.

When ε1\varepsilon\geq 1, we have 𝖲𝖠𝖬𝖯𝖣𝖲𝖱(𝒫,𝒟,ε,δ,μ)=0\mathsf{SAMP\mathchar 45\relax DS\mathchar 45\relax R}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta,\mu)=0 and the lemma is trivially true. We assume ε(0,1)\varepsilon\in(0,1) henceforth.

For every p𝒫p^{*}\in{\mathcal{P}}, we have infp𝒫advμ,𝒟(p,p)=0\inf_{p\in{\mathcal{P}}}\mathrm{adv}_{\mu,{\mathcal{D}}}(p,p^{*})=0, so by Lemma 9, there exists a positive integer nn satisfying (11) and (12) such that

p[0,1]X𝖮𝖨n(p,𝒟,3infp𝒫advμ,𝒟(p,p)+ε,δ,μ)\displaystyle\emptyset\neq{\bigcap}_{p^{*}\in[0,1]^{X}}\mathsf{OI}_{n}(p^{*},{\mathcal{D}},3\inf_{p\in{\mathcal{P}}}\mathrm{adv}_{\mu,{\mathcal{D}}}(p,p^{*})+\varepsilon,\delta,\mu) p𝒫𝖮𝖨n(p,𝒟,ε,δ,μ)\displaystyle\subseteq{\bigcap}_{p^{*}\in{\mathcal{P}}}\mathsf{OI}_{n}(p^{*},{\mathcal{D}},\varepsilon,\delta,\mu)
=𝖣𝖲𝖱n(𝒫,𝒟,ε,δ,μ).\displaystyle=\mathsf{DS\mathchar 45\relax R}_{n}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta,\mu).

This completes the proof by the definition of 𝖲𝖠𝖬𝖯𝖣𝖲𝖱(𝒫,𝒟,ε,δ,μ)\mathsf{SAMP\mathchar 45\relax DS\mathchar 45\relax R}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta,\mu). ∎

4 Metric Entropy Duality

In the previous section, we proved lower and upper bounds for the sample complexity of distribution-specific realizable OI, but these bounds do not yet give a satisfactory sample complexity characterization because of the exchanging roles of the predictor class 𝒫{\mathcal{P}} and the distinguisher class 𝒟{\mathcal{D}} in the lower and upper bounds. In this section, we solve the issue by proving the following theorem:

Theorem 11.

There exists an absolute constant c1c\geq 1 with the following property. For M1,M2>0M_{1},M_{2}>0, let 1[M1,M1]X{\mathcal{F}}_{1}\subseteq[-M_{1},M_{1}]^{X} and 2[M2,M2]X{\mathcal{F}}_{2}\subseteq[-M_{2},M_{2}]^{X} be two non-empty bounded function classes. For any distribution μΔX\mu\in\Delta_{X} and any ε>0\varepsilon>0, it holds that

logNμ,2(1,ε)c(M1M2/ε)2(1+logNμ,1(2,ε/8)).\log N_{\mu,{\mathcal{F}}_{2}}({\mathcal{F}}_{1},\varepsilon)\leq c(M_{1}M_{2}/\varepsilon)^{2}(1+\log N_{\mu,{\mathcal{F}}_{1}}({\mathcal{F}}_{2},\varepsilon/8)).

Before we prove this theorem, we note that it has a similar statement to the long-standing metric entropy duality conjecture proposed first by Pietsch [1972]. The conjecture can be stated as follows using our notations [for other equivalent statements of the conjecture, see e.g. Artstein et al., 2004a]:

Conjecture 12 (Pietsch [1972]).

There exist absolute constants c1,c21c_{1},c_{2}\geq 1 with the following property. For any two bounded function classes 1,2X{\mathcal{F}}_{1},{\mathcal{F}}_{2}\subseteq\mathbb{R}^{X} over a non-empty finite set XX, if 1{\mathcal{F}}_{1} and 2{\mathcal{F}}_{2} are convex and symmetric (i.e. ¯1=1,¯2=2\bar{}{\mathcal{F}}_{1}={\mathcal{F}}_{1},\bar{}{\mathcal{F}}_{2}={\mathcal{F}}_{2}), then for any distribution μΔX\mu\in\Delta_{X} and any ε>0\varepsilon>0,

logNμ,2(1,ε)c1logNμ,1(2,ε/c2).\log N_{\mu,{\mathcal{F}}_{2}}({\mathcal{F}}_{1},\varepsilon)\leq c_{1}\log N_{\mu,{\mathcal{F}}_{1}}({\mathcal{F}}_{2},\varepsilon/c_{2}).

Compared to Conjecture 12, our Theorem 11 gives a variant of metric entropy duality which does not require 1{\mathcal{F}}_{1} and 2{\mathcal{F}}_{2} to be convex and symmetric, but has the constant c1c_{1} in Conjecture 12 replaced by a quantity dependent on the granularity ε\varepsilon and the scale of the functions in 1{\mathcal{F}}_{1} and 2{\mathcal{F}}_{2}. Since the predictor class 𝒫{\mathcal{P}} and the distinguisher class 𝒟{\mathcal{D}} are not in general convex and symmetric, our Theorem 11 is more convenient for proving sample complexity characterizations for OI (see Theorem 13). In Lemma 14, we show that the quadratic dependence on M1M2/εM_{1}M_{2}/\varepsilon in Theorem 11 is nearly tight.

Below we prove Theorem 11 by combining our lower and upper bounds in the previous section.

Proof.

By Lemma 6 Item 4, we can assume w.l.o.g. that M1=M2=1M_{1}=M_{2}=1. Define 𝒟=2[1,1]X{\mathcal{D}}={\mathcal{F}}_{2}\subseteq[-1,1]^{X}, 𝒫={(1+f)/2:f1}[0,1]X{\mathcal{P}}=\{(1+f)/2:f\in{\mathcal{F}}_{1}\}\subseteq[0,1]^{X}, and 𝒬=𝒫𝒫{\mathcal{Q}}={\mathcal{P}}-{\mathcal{P}}. Combining Lemma 8 and Lemma 10 with δ=1/3\delta=1/3 and ε\varepsilon replaced by ε/4\varepsilon/4, there exists a constant c1c\geq 1 such that

logNμ,𝒟(𝒫,ε/2)cε2(1+logNμ,𝒬(𝒟,ε/8)).\log N_{\mu,{\mathcal{D}}}({\mathcal{P}},\varepsilon/2)\leq c\varepsilon^{-2}(1+\log N_{\mu,{\mathcal{Q}}}({\mathcal{D}},\varepsilon/8)).

Using Lemma 6 Items 3 and 4,

logNμ,2(1,ε)\displaystyle\log N_{\mu,{\mathcal{F}}_{2}}({\mathcal{F}}_{1},\varepsilon) =logNμ,𝒟(𝒫,ε/2)\displaystyle=\log N_{\mu,{\mathcal{D}}}({\mathcal{P}},\varepsilon/2)
cε2(1+logNμ,𝒬(𝒟,ε/8))\displaystyle\leq c\varepsilon^{-2}(1+\log N_{\mu,{\mathcal{Q}}}({\mathcal{D}},\varepsilon/8))
cε2(1+logNμ,1(𝒟,ε/8))\displaystyle\leq c\varepsilon^{-2}(1+\log N_{\mu,{\mathcal{F}}_{1}}({\mathcal{D}},\varepsilon/8)) (14)
=cε2(1+logNμ,1(2,ε/8)),\displaystyle=c\varepsilon^{-2}(1+\log N_{\mu,{\mathcal{F}}_{1}}({\mathcal{F}}_{2},\varepsilon/8)),

as desired. Here, inequality (14) holds because fμ,𝒬fμ,1\|f\|_{\mu,{\mathcal{Q}}}\leq\|f\|_{\mu,{\mathcal{F}}_{1}} for every bounded function fXf\in\mathbb{R}^{X}. ∎

We are now ready to state and prove our sample complexity characterizations for distribution-specific realizable OI.

Theorem 13.

For every predictor class 𝒫[0,1]X{\mathcal{P}}\subseteq[0,1]^{X}, every distinguisher class 𝒟[1,1]X{\mathcal{D}}\subseteq[-1,1]^{X}, every distribution μΔX\mu\in\Delta_{X}, every advantage bound ε>0\varepsilon>0, and every failure probability bound δ(0,1)\delta\in(0,1), the following sample complexity characterizations hold for distribution-specific realizable OI:

logNμ,𝒟(𝒫,2ε)+log(1δ)\displaystyle\log N_{\mu,{\mathcal{D}}}({\mathcal{P}},2\varepsilon)+\log(1-\delta)
\displaystyle\leq{} 𝖲𝖠𝖬𝖯𝖣𝖲𝖱(𝒫,𝒟,ε,δ,μ)\displaystyle\mathsf{SAMP\mathchar 45\relax DS\mathchar 45\relax R}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta,\mu)
\displaystyle\leq{} O(ε4logNμ,𝒟(𝒫,ε/32)+ε2log(2/δ)),\displaystyle O(\varepsilon^{-4}\log N_{\mu,{\mathcal{D}}}({\mathcal{P}},\varepsilon/32)+\varepsilon^{-2}\log(2/\delta)),

and

Ω(ε2logNμ,𝒫(𝒟,16ε))1+log(1δ)\displaystyle\Omega(\varepsilon^{2}\log N_{\mu,{\mathcal{P}}}({\mathcal{D}},16\varepsilon))-1+\log(1-\delta)
\displaystyle\leq{} 𝖲𝖠𝖬𝖯𝖣𝖲𝖱(𝒫,𝒟,ε,δ,μ)\displaystyle\mathsf{SAMP\mathchar 45\relax DS\mathchar 45\relax R}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta,\mu)
\displaystyle\leq{} O(ε2logNμ,𝒫(𝒟,ε/4)+ε2log(2/δ)).\displaystyle O(\varepsilon^{-2}\log N_{\mu,{\mathcal{P}}}({\mathcal{D}},\varepsilon/4)+\varepsilon^{-2}\log(2/\delta)).

Since logNμ,𝒟(𝒫,ε)\log N_{\mu,{\mathcal{D}}}({\mathcal{P}},\varepsilon) is monotone w.r.t. 𝒟{\mathcal{D}} (Lemma 6 Item 1), compared to 1\ell_{1}-error-based learning which corresponds to 𝒟=[1,1]X{\mathcal{D}}=[-1,1]^{X} (see Lemma 7), Theorem 13 shows that a selected class 𝒟{\mathcal{D}} of distinguishers helps reduce the sample complexity, or allows us to achieve smaller ε\varepsilon (and potentially better performance guarantees) with the same sample size. We prove Theorem 13 below.

Proof.

We start with the following chain of inequalities:

Ω(ε2logNμ,𝒫(𝒟,16ε))1+log(1δ)\displaystyle\Omega(\varepsilon^{2}\log N_{\mu,{\mathcal{P}}}({\mathcal{D}},16\varepsilon))-1+\log(1-\delta)
\displaystyle\leq{} logNμ,𝒟(𝒫,2ε)+log(1δ)\displaystyle\log N_{\mu,{\mathcal{D}}}({\mathcal{P}},2\varepsilon)+\log(1-\delta) (by Theorem 11)
\displaystyle\leq{} 𝖲𝖠𝖬𝖯𝖣𝖲𝖱(𝒫,𝒟,ε,δ,μ)\displaystyle\mathsf{SAMP\mathchar 45\relax DS\mathchar 45\relax R}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta,\mu) (by Lemma 8)
\displaystyle\leq{} O(ε2logNμ,𝒫(𝒟,ε/4)+ε2log(2/δ)),\displaystyle O(\varepsilon^{-2}\log N_{\mu,{\mathcal{P}}}({\mathcal{D}},\varepsilon/4)+\varepsilon^{-2}\log(2/\delta)), (15)

where the last inequality is by Lemma 10. It remains to show

𝖲𝖠𝖬𝖯𝖣𝖲𝖱(𝒫,𝒟,ε,δ,μ)O(ε4logNμ,𝒟(𝒫,ε/32)+ε2log(2/δ)).\mathsf{SAMP\mathchar 45\relax DS\mathchar 45\relax R}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta,\mu)\leq O(\varepsilon^{-4}\log N_{\mu,{\mathcal{D}}}({\mathcal{P}},\varepsilon/32)+\varepsilon^{-2}\log(2/\delta)). (16)

If Nμ,𝒟(𝒫,ε/32)1N_{\mu,{\mathcal{D}}}({\mathcal{P}},\varepsilon/32)\leq 1, it is clear that 𝖲𝖠𝖬𝖯𝖣𝖲𝖱(𝒫,𝒟,ε,δ,μ)=0\mathsf{SAMP\mathchar 45\relax DS\mathchar 45\relax R}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta,\mu)=0, so the inequality is trivial. If Nμ,𝒟(𝒫,ε/32)2N_{\mu,{\mathcal{D}}}({\mathcal{P}},\varepsilon/32)\geq 2, we have the following inequality by Theorem 11:

ε2logNμ,𝒫(𝒟,ε/4))O(ε4(1+logNμ,𝒟(𝒫,ε/32)))O(ε4logNμ,𝒟(𝒫,ε/32)).\varepsilon^{-2}\log N_{\mu,{\mathcal{P}}}({\mathcal{D}},\varepsilon/4))\leq O(\varepsilon^{-4}(1+\log N_{\mu,{\mathcal{D}}}({\mathcal{P}},\varepsilon/32)))\leq O(\varepsilon^{-4}\log N_{\mu,{\mathcal{D}}}({\mathcal{P}},\varepsilon/32)).

Inequality (16) is proved by plugging the inequality above into (15). ∎

We complete the section by showing near-tightness of Theorem 11.

Lemma 14.

There exists a constant c>0c>0 such that for all M1>0,M2>0,ε(0,M1M2/2)M_{1}>0,M_{2}>0,\varepsilon\in(0,M_{1}M_{2}/2), there exist a ground set XX, a distribution μ\mu over XX, and function classes 1[M1,M2]X,2[M2,M2]X{\mathcal{F}}_{1}\subseteq[-M_{1},M_{2}]^{X},{\mathcal{F}}_{2}\subseteq[-M_{2},M_{2}]^{X} such that

logNμ,2(1,ε)c(1+log|2|)(M1M2ε)2(log(M1M2ε))1.\log N_{\mu,{\mathcal{F}}_{2}}({\mathcal{F}}_{1},\varepsilon)\geq c(1+\log|{\mathcal{F}}_{2}|)\left(\frac{M_{1}M_{2}}{\varepsilon}\right)^{2}\left(\log\left(\frac{M_{1}M_{2}}{\varepsilon}\right)\right)^{-1}.
Proof.

By Lemma 6 Item 4, we can assume w.l.o.g. that M1=M2=1M_{1}=M_{2}=1. Now we have ε<M1M2/2=1/2\varepsilon<M_{1}M_{2}/2=1/2, and 1/2ε2>21/2\varepsilon^{2}>2. Let mm be the largest integer power of 22 such that m1/2ε2m\leq 1/2\varepsilon^{2}. We choose XX to be {1,,m}\{1,\ldots,m\}, and choose μ\mu to be the uniform distribution over XX. Let 𝗏𝖾𝖼\mathsf{vec} be the bijection from X\mathbb{R}^{X} to m\mathbb{R}^{m} such that 𝗏𝖾𝖼(f)=(f(1),,f(m))m\mathsf{vec}(f)=(f(1),\ldots,f(m))\in\mathbb{R}^{m} for all fXf\in\mathbb{R}^{X}. Define HmH_{m} to be the set of vectors in {1,1}m\{-1,1\}^{m} consisting of the mm columns of the Hadamard matrix of size m×mm\times m. We choose 1=[1,1]X{\mathcal{F}}_{1}=[-1,1]^{X}, and 2={𝗏𝖾𝖼1(v):vHm}{\mathcal{F}}_{2}=\{\mathsf{vec}^{-1}(v):v\in H_{m}\}. The intuition behind the choice of 2{\mathcal{F}}_{2} is to keep |2||{\mathcal{F}}_{2}| small while making Nμ,2(1,ε)N_{\mu,{\mathcal{F}}_{2}}({\mathcal{F}}_{1},\varepsilon) large, which by Lemma 6 Items 1 and 2 roughly corresponds to making the symmetric convex hull ¯2\bar{}{\mathcal{F}}_{2} large. That is why we use the Hadamard matrix to ensure that the functions ff in 2{\mathcal{F}}_{2} achieve the maximum norm (with f(x)=±1f(x)=\pm 1 for every xXx\in X) and are orthogonal to each other.

Define B={fX:fμ,2ε}B=\{f\in\mathbb{R}^{X}:\|f\|_{\mu,{\mathcal{F}}_{2}}\leq\varepsilon\}. By the properties of the Hadamard matrix, the functions in 2{\mathcal{F}}_{2} form an orthonormal basis of X\mathbb{R}^{X} w.r.t. the inner product ,μ\langle\cdot,\cdot\rangle_{\mu}. This implies

B={f2rff:rf[ε,ε]for allf2}.B=\left\{{\sum}_{f\in{\mathcal{F}}_{2}}r_{f}f:r_{f}\in[-\varepsilon,\varepsilon]\ \textnormal{for all}\ f\in{\mathcal{F}}_{2}\right\}. (17)

Let 11{\mathcal{F}}^{\prime}_{1}\subseteq{\mathcal{F}}_{1} be an ε\varepsilon-covering of 1{\mathcal{F}}_{1} w.r.t. norm μ,2\|\cdot\|_{\mu,{\mathcal{F}}_{2}} such that |1|=Nμ,2(1,ε)|{\mathcal{F}}_{1}^{\prime}|=N_{\mu,{\mathcal{F}}_{2}}({\mathcal{F}}_{1},\varepsilon). By the definition of ε\varepsilon-covering,

1f1({f}+B).{\mathcal{F}}_{1}\subseteq{\bigcup}_{f\in{\mathcal{F}}_{1}^{\prime}}(\{f\}+B).

For a function class X{\mathcal{F}}\subseteq\mathbb{R}^{X}, we define 𝗏𝖾𝖼()={𝗏𝖾𝖼(f):f}m\mathsf{vec}({\mathcal{F}})=\{\mathsf{vec}(f):f\in{\mathcal{F}}\}\subseteq\mathbb{R}^{m}. Now we have 𝗏𝖾𝖼(1)f1𝗏𝖾𝖼({f}+B)\mathsf{vec}({\mathcal{F}}_{1})\subseteq\bigcup_{f\in{\mathcal{F}}_{1}^{\prime}}\mathsf{vec}(\{f\}+B), which implies that the volume of 𝗏𝖾𝖼(1)\mathsf{vec}({\mathcal{F}}_{1}) is at most |1||{\mathcal{F}}_{1}^{\prime}| times the volume of 𝗏𝖾𝖼(B)\mathsf{vec}(B).

It is clear that 𝗏𝖾𝖼(1)=[1,1]m\mathsf{vec}({\mathcal{F}}_{1})=[-1,1]^{m} has volume 2m2^{m}. By (17),

𝗏𝖾𝖼(B)={vHmrvv:rv[ε,ε]for allvHm}.\mathsf{vec}(B)=\left\{{\sum}_{v\in H_{m}}r_{v}v:r_{v}\in[-\varepsilon,\varepsilon]\ \textnormal{for all}\ v\in H_{m}\right\}.

Since the columns of HmH_{m} are orthogonal and have 2\ell_{2} norm m\sqrt{m} in m\mathbb{R}^{m}, the volume of 𝗏𝖾𝖼(B)\mathsf{vec}(B) is (2εm)m(2\varepsilon\sqrt{m})^{m}. Therefore,

2m|1|(2εm)m,2^{m}\leq|{\mathcal{F}}_{1}^{\prime}|\cdot(2\varepsilon\sqrt{m})^{m},

and thus |1|(1/εm)m|{\mathcal{F}}_{1}^{\prime}|\geq(1/\varepsilon\sqrt{m})^{m}. Now we have

logNμ,2(1,ε)=log|1|mlog(1/εm)Ω(m)Ω(1/ε2),\log N_{\mu,{\mathcal{F}}_{2}}({\mathcal{F}}_{1},\varepsilon)=\log|{\mathcal{F}}_{1}^{\prime}|\geq m\log(1/\varepsilon\sqrt{m})\geq\Omega(m)\geq\Omega(1/\varepsilon^{2}), (18)

and

log|2|=logmO(log(1/ε)).\log|{\mathcal{F}}_{2}|=\log m\leq O(\log(1/\varepsilon)). (19)

Combining (18) and (19) proves the lemma. ∎

5 Sample Complexity of Distribution-Free OI

In this section, we consider the distribution-free setting where the OI learner has no knowledge of the distribution μ\mu, and its performance is measured on the worst-case distribution μ\mu. We focus on the case where 𝒫=[0,1]X{\mathcal{P}}=[0,1]^{X} and characterize the sample complexity of OI in this setting using the fat-shattering dimension of 𝒟{\mathcal{D}} (Theorem 15). Without the assumption that 𝒫=[0,1]X{\mathcal{P}}=[0,1]^{X}, we give a sample complexity upper bound for realizable OI in Remark 16 and leave the intriguing question of sample complexity characterization to future work.

Theorem 15.

For every distinguisher class 𝒟{\mathcal{D}}, every advantage bound ε(0,1)\varepsilon\in(0,1), and every failure probability bound δ(0,1)\delta\in(0,1), the following sample complexity characterization holds for distribution-free OI:

𝖿𝖺𝗍𝒟(12ε)/8+log(1δ)\displaystyle\mathsf{fat}_{\mathcal{D}}(12\varepsilon)/8+\log(1-\delta)
\displaystyle\leq{} 𝖲𝖠𝖬𝖯𝖣𝖥𝖱([0,1]X,𝒟,ε,δ)\displaystyle\mathsf{SAMP\mathchar 45\relax DF\mathchar 45\relax R}([0,1]^{X},{\mathcal{D}},\varepsilon,\delta)
=\displaystyle={} 𝖲𝖠𝖬𝖯𝖣𝖥𝖠([0,1]X,𝒟,ε,δ)\displaystyle\mathsf{SAMP\mathchar 45\relax DF\mathchar 45\relax A}([0,1]^{X},{\mathcal{D}},\varepsilon,\delta)
\displaystyle\leq{} O(ε4(𝖿𝖺𝗍𝒟(ε/25)(log(2/ε))2+log(2/δ))).\displaystyle O\Big{(}\varepsilon^{-4}\Big{(}\mathsf{fat}_{\mathcal{D}}(\varepsilon/25)(\log(2/\varepsilon))^{2}+\log(2/\delta)\Big{)}\Big{)}. (20)

This theorem is a direct corollary of the sample complexity upper bound (Lemma 19) and lower bound (Lemma 23) we prove in the remaining of this section.444 If we assume :=supxXsupd𝒟|d(x)|>3ε\ell:=\sup_{x\in X}\sup_{d\in{\mathcal{D}}}|d(x)|>3\varepsilon and δ(0,1/3)\delta\in(0,1/3), it is rather straightforward to show a sample complexity lower bound of Ω(2ε2log(1/δ))\Omega(\ell^{2}\varepsilon^{-2}\log(1/\delta)) by a reduction from estimating the bias of a coin from independent tosses. We omit the details as our focus is on the dependence on 𝒟{\mathcal{D}}, rather than on ε\varepsilon and δ\delta. With some modifications, Theorem 15 also extends to multicalibration (see Remarks 21 and 24).

Remark 16.

If we drop the assumption that 𝒫=[0,1]X{\mathcal{P}}=[0,1]^{X} and assume instead that 𝒟=[1,1]X{\mathcal{D}}=[-1,1]^{X}, by Lemma 7, our error objective advμ,𝒟(,)\mathrm{adv}_{\mu,{\mathcal{D}}}(\cdot,\cdot) becomes the 1\ell_{1} error. By the ideas and results in [Bartlett et al., 1996, Bartlett and Long, 1995], it holds that

𝖲𝖠𝖬𝖯𝖣𝖥𝖱(𝒫,[1,1]X,ε,δ)O(ε4(𝖿𝖺𝗍𝒫(ε2/20)(log(2/ε))2+log(2/δ))).\mathsf{SAMP\mathchar 45\relax DF\mathchar 45\relax R}({\mathcal{P}},[-1,1]^{X},\varepsilon,\delta)\leq O\Big{(}\varepsilon^{-4}\Big{(}\mathsf{fat}_{\mathcal{P}}(\varepsilon^{2}/20)(\log(2/\varepsilon))^{2}+\log(2/\delta)\Big{)}\Big{)}.

Combining this with (20) and using the monotonicity (6) of the sample complexity of realizable OI, without assuming 𝒫=[0,1]X{\mathcal{P}}=[0,1]^{X} or 𝒟=[1,1]X{\mathcal{D}}=[-1,1]^{X}, we have

𝖲𝖠𝖬𝖯𝖣𝖥𝖱(𝒫,𝒟,ε,δ)O(ε4(min{𝖿𝖺𝗍𝒟(ε/25),𝖿𝖺𝗍𝒫(ε2/20)}(log(2/ε))2+log(2/δ))).\mathsf{SAMP\mathchar 45\relax DF\mathchar 45\relax R}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta)\leq O\bigg{(}\varepsilon^{-4}\Big{(}\min\{\mathsf{fat}_{\mathcal{D}}(\varepsilon/25),\mathsf{fat}_{\mathcal{P}}(\varepsilon^{2}/20)\}(\log(2/\varepsilon))^{2}+\log(2/\delta)\Big{)}\bigg{)}.

This, however, does not give a sample complexity characterization for distribution-free realizable OI because when 𝖿𝖺𝗍𝒟(ε/25)\mathsf{fat}_{\mathcal{D}}(\varepsilon/25) and 𝖿𝖺𝗍𝒫(ε2/20)\mathsf{fat}_{\mathcal{P}}(\varepsilon^{2}/20) are both infinite, it is still possible to have 𝖲𝖠𝖬𝖯𝖣𝖥𝖱(𝒫,𝒟,ε,δ)\mathsf{SAMP\mathchar 45\relax DF\mathchar 45\relax R}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta) being finite (see the example in Section 1.1).

5.1 Upper Bound

We prove our upper bound using the multiaccuracy boost algorithm (Algorithm 3). Kim et al. [2019] used the algorithm to show a sample complexity upper bound for multiaccuracy based on an abstract notion of dimension of the distinguisher class. We make their upper bound concrete using the fat-shattering dimension and match it with a lower bound in Section 5.2. The following standard uniform convergence result based on the fat-shattering dimension is crucial for our upper bound:

Lemma 17 (Uniform convergence from fat shattering [Bartlett and Long, 1995]).

Let [1,1]X{\mathcal{F}}\subseteq[-1,1]^{X} be a function class. For every ε,δ(0,1)\varepsilon,\delta\in(0,1) there exists n>0n\in\mathbb{Z}_{>0} such that

n=O(ε2(𝖿𝖺𝗍(ε/5)(log(2/ε))2+log(2/δ)))n=O\Big{(}\varepsilon^{-2}\Big{(}\mathsf{fat}_{\mathcal{F}}(\varepsilon/5)(\log(2/\varepsilon))^{2}+\log(2/\delta)\Big{)}\Big{)}

and for every probability distribution μ\mu over XX,

Pr[supf|1ni=1nf(xi)Ef|ε]δ,\Pr\left[\sup_{f\in{\mathcal{F}}}\left|\frac{1}{n}\sum_{i=1}^{n}f(x_{i})-E_{f}\right|\geq\varepsilon\right]\leq\delta,

where x1,,xnx_{1},\ldots,x_{n} are drawn i.i.d. from μ\mu and Ef:=𝔼xμ[f(x)]E_{f}:=\operatorname{\mathbb{E}}_{x\sim\mu}[f(x)] for all ff\in{\mathcal{F}}.

Lemma 18.

Let 𝒟[1,1]X{\mathcal{D}}\subseteq[-1,1]^{X} be a distinguisher class. For every ε,δ(0,1)\varepsilon,\delta\in(0,1) there exists n>0n\in\mathbb{Z}_{>0} such that

n=O(ε2(𝖿𝖺𝗍𝒟(ε/5)(log(2/ε))2+log(2/δ)))n=O\Big{(}\varepsilon^{-2}\Big{(}\mathsf{fat}_{\mathcal{D}}(\varepsilon/5)(\log(2/\varepsilon))^{2}+\log(2/\delta)\Big{)}\Big{)}

and for every distribution μ\mu over XX and every predictor p[0,1]Xp\in[0,1]^{X},

Pr[supd𝒟|1ni=1noid(xi)p,dμ|ε]δ,\Pr\left[\sup_{d\in{\mathcal{D}}}\left|\frac{1}{n}\sum_{i=1}^{n}o_{i}d(x_{i})-\langle p,d\rangle_{\mu}\right|\geq\varepsilon\right]\leq\delta,

where (x1,o1),,(xn,on)(x_{1},o_{1}),\ldots,(x_{n},o_{n}) are drawn i.i.d. from μp\mu_{p}.

Proof.

For every d𝒟d\in{\mathcal{D}}, define d~:X×{0,1}[1,1]\tilde{d}:X\times\{0,1\}\rightarrow[-1,1] to be a function that maps (x,o)X×{0,1}(x,o)\in X\times\{0,1\} to od(x)od(x). Define ~𝒟={d~:d𝒟}\tilde{}{\mathcal{D}}=\{\tilde{d}:d\in{\mathcal{D}}\}.

We show that 𝖿𝖺𝗍~𝒟(ε)=𝖿𝖺𝗍𝒟(ε)\mathsf{fat}_{\tilde{}{\mathcal{D}}}(\varepsilon)=\mathsf{fat}_{\mathcal{D}}(\varepsilon) for all ε>0\varepsilon>0. Consider (x1,o1),,(xm,om)(x_{1},o_{1}),\ldots,(x_{m},o_{m}) that are ε\varepsilon-fat shattered by ~𝒟\tilde{}{\mathcal{D}}. Since every d~\tilde{d} in ~𝒟\tilde{}{\mathcal{D}} maps (x,0)(x,0) to 0 for all xXx\in X, we know that o1==om=1o_{1}=\cdots=o_{m}=1. This then implies that x1,,xmx_{1},\ldots,x_{m} are ε\varepsilon-fat shattered by 𝒟{\mathcal{D}}. Conversely, if x1,,xmx_{1},\ldots,x_{m} are ε\varepsilon-fat shattered by 𝒟{\mathcal{D}}, it is clear that (x1,1),,(xm,1)(x_{1},1),\ldots,(x_{m},1) are ε\varepsilon-fat shattered by ~𝒟\tilde{}{\mathcal{D}}.

The lemma then follows directly from applying Lemma 17 with =~𝒟{\mathcal{F}}=\tilde{}{\mathcal{D}}. ∎

Now we state our sample complexity upper bound for distribution-free OI (Lemma 19) and prove it using Algorithm 3 and a helper lemma (Lemma 20).

Lemma 19.

For every distinguisher class 𝒟{\mathcal{D}}, every advantage bound ε(0,1)\varepsilon\in(0,1), and every failure probability bound δ(0,1)\delta\in(0,1), the following sample complexity upper bound holds for distribution-free OI:

𝖲𝖠𝖬𝖯𝖣𝖥𝖱([0,1]X,𝒟,ε,δ)\displaystyle\mathsf{SAMP\mathchar 45\relax DF\mathchar 45\relax R}([0,1]^{X},{\mathcal{D}},\varepsilon,\delta) =𝖲𝖠𝖬𝖯𝖣𝖥𝖠([0,1]X,𝒟,ε,δ)\displaystyle=\mathsf{SAMP\mathchar 45\relax DF\mathchar 45\relax A}([0,1]^{X},{\mathcal{D}},\varepsilon,\delta)
O(ε4(𝖿𝖺𝗍𝒟(ε/25)(log(2/ε))2+log(2/δ))).\displaystyle\leq O\Big{(}\varepsilon^{-4}\Big{(}\mathsf{fat}_{\mathcal{D}}(\varepsilon/25)(\log(2/\varepsilon))^{2}+\log(2/\delta)\Big{)}\Big{)}.
Parameters : distinguisher class 𝒟{\mathcal{D}}, advantage bound ε(0,1)\varepsilon\in(0,1).
Input : examples (x1,o1),,(xn,on)(x_{1},o_{1}),\ldots,(x_{n},o_{n}).
Output : predictor p[0,1]Xp\in[0,1]^{X}.
1 Initialize p(x)=1/2p(x)=1/2 for all xXx\in X;
2 T16/ε2T\leftarrow\lceil 16/\varepsilon^{2}\rceil;
3 mn/Tm\leftarrow\lfloor n/T\rfloor;
4 for t=1,,Tt=1,\ldots,T do
5       take fresh examples (x1,o1),,(xm,om)(x_{1}^{*},o_{1}^{*}),\ldots,(x_{m}^{*},o_{m}^{*}) where (xi,oi)=(x(t1)m+i,o(t1)m+i))(x_{i}^{*},o_{i}^{*})=(x_{(t-1)m+i},o_{(t-1)m+i)});
6       draw oio_{i}^{\prime} from Ber(p(xi))\mathrm{Ber}(p(x_{i}^{*})) independently for all i=1,,ni=1,\ldots,n;
7       if there exists d𝒟(𝒟)d\in{\mathcal{D}}\cup(-{\mathcal{D}}) such that 1mi=1md(xi)(oioi)3ε/5\frac{1}{m}\sum_{i=1}^{m}d(x_{i}^{*})(o_{i}^{*}-o^{\prime}_{i})\geq 3\varepsilon/5  then
8             p(x)p(x)+εd(x)/5p(x)\leftarrow p(x)+\varepsilon d(x)/5 for all xXx\in X;
9             p(x)max{0,min{1,p(x)}}p(x)\leftarrow\max\{0,\min\{1,p(x)\}\} for all xXx\in X;
10            
11      else
12             return pp;
13            
14       end if
15      
16 end for
17 return pp;
Algorithm 3 Multiaccuracy Boost [Hébert-Johnson et al., 2018, Kim et al., 2019]
Lemma 20.

If pp,dμε/5\langle p^{*}-p,d\rangle_{\mu}\geq\varepsilon/5 for some predictor p[0,1]Xp^{*}\in[0,1]^{X} and some distribution μΔX\mu\in\Delta_{X} before Line 3 of Algorithm 3 is executed, then Lines 3-3 decrease pp22\|p-p^{*}\|_{2}^{2} by at least ε2/25\varepsilon^{2}/25. Here, we use f22\|f\|_{2}^{2} as a shorthand for f,fμ\langle f,f\rangle_{\mu}.

Proof.

Line 3 decreases pp22\|p^{*}-p\|_{2}^{2} by at least ε2/25\varepsilon^{2}/25 because

pp22p(p+(ε/5)d)22\displaystyle\|p^{*}-p\|_{2}^{2}-\|p^{*}-(p+(\varepsilon/5)d)\|_{2}^{2}
=\displaystyle={} 2pp,(ε/5)dμ(ε/5)2d22\displaystyle 2\langle p^{*}-p,(\varepsilon/5)d\rangle_{\mu}-(\varepsilon/5)^{2}\|d\|_{2}^{2}
\displaystyle\geq{} 2(ε/5)2(ε/5)2\displaystyle 2(\varepsilon/5)^{2}-(\varepsilon/5)^{2}
=\displaystyle={} ε2/25.\displaystyle\varepsilon^{2}/25.

The lemma is proved by noting that Line 3 never increases pp22\|p^{*}-p\|_{2}^{2}. ∎

Now we finish the proof of Lemma 19.

Proof.

We first consider the relatively simple case where 𝖿𝖺𝗍𝒟(ε/25)=0\mathsf{fat}_{\mathcal{D}}(\varepsilon/25)=0. By the definition of the fat-shattering dimension, there exists d[1,1]Xd^{*}\in[-1,1]^{X} such that

|d(x)d(x)|ε/25for every d𝒟 and every xX.|d^{*}(x)-d(x)|\leq\varepsilon/25\quad\textnormal{for every $d\in{\mathcal{D}}$ and every $x\in X$.} (21)

Therefore, for every fixed distribution μΔX\mu\in\Delta_{X} and target predictor p[0,1]Xp^{*}\in[0,1]^{X}, as long as the learned predictor pp satisfies |pp,dμ|ε/2|\langle p-p^{*},d^{*}\rangle_{\mu}|\leq\varepsilon/2, we get the desired error bound ppμ,𝒟ε\|p-p^{*}\|_{\mu,{\mathcal{D}}}\leq\varepsilon. Decomposing d=d++dd^{*}=d^{*}_{+}+d^{*}_{-} with d+(x):=max{d(x),0}d^{*}_{+}(x):=\max\{d^{*}(x),0\} and d(x):=min{d(x),0}d^{*}_{-}(x):=\min\{d^{*}(x),0\} for every xXx\in X, we aim for the stronger goal that |pp,d+μ|ε/4|\langle p-p^{*},d^{*}_{+}\rangle_{\mu}|\leq\varepsilon/4 and |pp,dμ|ε/4|\langle p-p^{*},d^{*}_{-}\rangle_{\mu}|\leq\varepsilon/4.

Given input examples (x1,o1),,(xn,on)(x_{1},o_{1}),\ldots,(x_{n},o_{n}) drawn i.i.d. from μp\mu_{p^{*}}, we first compute u+:=1ni=1nd+(xi)oiu_{+}:=\frac{1}{n}\sum_{i=1}^{n}d^{*}_{+}(x_{i})o_{i} and v+:=1ni=1nd+(xi)v_{+}:=\frac{1}{n}\sum_{i=1}^{n}d^{*}_{+}(x_{i}). It is clear that 0u+v+0\leq u_{+}\leq v_{+}, so there exists r+[0,1]r_{+}\in[0,1] such that u+=r+v+u_{+}=r_{+}v_{+}. Similarly, we compute u,vu_{-},v_{-} satisfying 0uv0\geq u_{-}\geq v_{-} and define r[0,1]r_{-}\in[0,1] so that u=rvu_{-}=r_{-}v_{-}. We output p[0,1]Xp\in[0,1]^{X} with p(x)=r+p(x)=r_{+} if d(x)0d^{*}(x)\geq 0 and p(x)=rp(x)=r_{-} otherwise.

By the Chernoff bound and the union bound, we can choose n=O(ε2log(2/δ))n=O(\varepsilon^{-2}\log(2/\delta)) such that with probability at least 1δ/21-\delta/2, both of the following inequalities hold:

|u+p,d+μ|ε/8,\displaystyle|u_{+}-\langle p^{*},d^{*}_{+}\rangle_{\mu}|\leq\varepsilon/8, (22)
|v+𝔼xμ[d+(x)]|ε/8.\displaystyle|v_{+}-\operatorname{\mathbb{E}}_{x\sim\mu}[d^{*}_{+}(x)]|\leq\varepsilon/8. (23)

When multiplied by r+r_{+}, (23) implies |u+p,d+μ|ε/8|u_{+}-\langle p,d^{*}_{+}\rangle_{\mu}|\leq\varepsilon/8. Combining this with (22), with probability at least 1δ/21-\delta/2, |pp,d+μ|ε/4|\langle p^{*}-p,d^{*}_{+}\rangle_{\mu}|\leq\varepsilon/4. Similarly, with probability at least 1δ/21-\delta/2, |pp,dμ|ε/4|\langle p^{*}-p,d^{*}_{-}\rangle_{\mu}|\leq\varepsilon/4. By the union bound, with probability at least 1δ1-\delta, |pp,d+μ|ε/4|\langle p^{*}-p,d^{*}_{+}\rangle_{\mu}|\leq\varepsilon/4 and |pp,dμ|ε/4|\langle p^{*}-p,d^{*}_{-}\rangle_{\mu}|\leq\varepsilon/4 both hold as desired, so

𝖲𝖠𝖬𝖯𝖣𝖥𝖱([0,1]X,𝒟,ε,δ)\displaystyle\mathsf{SAMP\mathchar 45\relax DF\mathchar 45\relax R}([0,1]^{X},{\mathcal{D}},\varepsilon,\delta) O(ε2log(2/δ))\displaystyle\leq O(\varepsilon^{-2}\log(2/\delta))
O(ε4(𝖿𝖺𝗍𝒟(ε/25)(log(2/ε))2+log(2/δ))).\displaystyle\leq O\Big{(}\varepsilon^{-4}\Big{(}\mathsf{fat}_{\mathcal{D}}(\varepsilon/25)(\log(2/\varepsilon))^{2}+\log(2/\delta)\Big{)}\Big{)}.

We thus assume 𝖿𝖺𝗍𝒟(ε/25)1\mathsf{fat}_{\mathcal{D}}(\varepsilon/25)\geq 1 from now on. We use Algorithm 3 with n=25/ε2mn=\lceil 25/\varepsilon^{2}\rceil m for a positive integer mm chosen as follows. Given input examples (x1,o1),,(xn,on)(x_{1},o_{1}),\ldots,(x_{n},o_{n}) drawn i.i.d. from μp\mu_{p^{*}} for some μΔX\mu\in\Delta_{X} and p[0,1]Xp^{*}\in[0,1]^{X}, by Lemma 18 and the union bound, we can choose

m=O(ε2(𝖿𝖺𝗍𝒟(ε/25)(log(2/ε))2+log(2/δε)))m=O\Big{(}\varepsilon^{-2}\Big{(}\mathsf{fat}_{\mathcal{D}}(\varepsilon/25)(\log(2/\varepsilon))^{2}+\log(2/\delta\varepsilon)\Big{)}\Big{)}

so that with probability at least 1δ1-\delta, every time Line 3 is executed, we have

supd𝒟|1mi=1md(xi)oid,pμ|ε/5,and\displaystyle\sup_{d\in{\mathcal{D}}}\left|\frac{1}{m}\sum_{i=1}^{m}d(x_{i}^{*})o_{i}^{*}-\langle d,p^{*}\rangle_{\mu}\right|\leq\varepsilon/5,\textnormal{and} (24)
supd𝒟|1mi=1md(xi)oid,pμ|ε/5.\displaystyle\sup_{d\in{\mathcal{D}}}\left|\frac{1}{m}\sum_{i=1}^{m}d(x_{i}^{*})o_{i}^{\prime}-\langle d,p\rangle_{\mu}\right|\leq\varepsilon/5. (25)

Assuming this is true, when the output predictor pp is returned at Line 3, we have

supd𝒟|pp,dμ|3ε/5+ε/5+ε/5ε,\sup_{d\in{\mathcal{D}}}|\langle p-p^{*},d\rangle_{\mu}|\leq 3\varepsilon/5+\varepsilon/5+\varepsilon/5\leq\varepsilon,

as desired. Moreover, (24) and (25) imply pp,dμ3ε/5ε/5ε/5=ε/5\langle p^{*}-p,d\rangle_{\mu}\geq 3\varepsilon/5-\varepsilon/5-\varepsilon/5=\varepsilon/5 before every time Line 3 is executed, so by Lemma 20, when the output predictor pp is returned at Line 3,

pp221(ε2/25)25/ε20,\|p-p^{*}\|_{2}^{2}\leq 1-(\varepsilon^{2}/25)\lceil 25/\varepsilon^{2}\rceil\leq 0,

as desired. Therefore,

𝖲𝖠𝖬𝖯𝖣𝖥𝖱([0,1]X,𝒟,ε,δ)\displaystyle\mathsf{SAMP\mathchar 45\relax DF\mathchar 45\relax R}([0,1]^{X},{\mathcal{D}},\varepsilon,\delta) =𝖲𝖠𝖬𝖯𝖣𝖥𝖠([0,1]X,𝒟,ε,δ)\displaystyle=\mathsf{SAMP\mathchar 45\relax DF\mathchar 45\relax A}([0,1]^{X},{\mathcal{D}},\varepsilon,\delta)
n\displaystyle\leq n
=25/ε2m\displaystyle=\lceil 25/\varepsilon^{2}\rceil m
=O(ε4(𝖿𝖺𝗍𝒟(ε/25)(log(2/ε))2+log(2/δε)))\displaystyle=O\Big{(}\varepsilon^{-4}\Big{(}\mathsf{fat}_{\mathcal{D}}(\varepsilon/25)(\log(2/\varepsilon))^{2}+\log(2/\delta\varepsilon)\Big{)}\Big{)}
=O(ε4(𝖿𝖺𝗍𝒟(ε/25)(log(2/ε))2+log(2/δ))).\displaystyle=O\Big{(}\varepsilon^{-4}\Big{(}\mathsf{fat}_{\mathcal{D}}(\varepsilon/25)(\log(2/\varepsilon))^{2}+\log(2/\delta)\Big{)}\Big{)}. (by 𝖿𝖺𝗍𝒟(ε/25)1\mathsf{fat}_{\mathcal{D}}(\varepsilon/25)\geq 1)

Remark 21.

Hébert-Johnson et al. [2018] used a modified version of Algorithm 3 for multicalibration, and indeed, this gives a sample complexity upper bound for multicalibration similar to Lemma 19. In multicalibration, the only difference is in the error objective: the interval [0,1][0,1] is partitioned into kk subsets Λ1,,Λk\Lambda_{1},\ldots,\Lambda_{k} for some k>0k\in\mathbb{Z}_{>0}, and given a distribution μΔX\mu\in\Delta_{X} and a distinguisher class 𝒟[1,1]X{\mathcal{D}}\subseteq[-1,1]^{X}, we measure the error of a learned predictor p[0,1]Xp\in[0,1]^{X} w.r.t. the target predictor p[0,1]Xp^{*}\in[0,1]^{X} by

mc-errorμ,𝒟(p,p):=supd𝒟,1jk|𝔼xμ[(p(x)p(x))d(x)𝟙(p(x)Λj)]|.\mathrm{mc}\textup{-}\mathrm{error}_{\mu,{\mathcal{D}}}(p,p^{*}):=\sup_{d\in{\mathcal{D}},1\leq j\leq k}|\operatorname{\mathbb{E}}_{x\sim\mu}[(p(x)-p^{*}(x))d(x)\mathds{1}(p(x)\in\Lambda_{j})]|. (26)

For ε,δ(0,1)\varepsilon,\delta\in(0,1), suppose our goal is to achieve mc-errorμ,𝒟(p,p)ε\mathrm{mc}\textup{-}\mathrm{error}_{\mu,{\mathcal{D}}}(p,p^{*})\leq\varepsilon with probability at least 1δ1-\delta in the distribution-free setting. Changing Line 3 of Algorithm 3 to

if there exists d𝒟(𝒟) and j{1,,k} such that\displaystyle\text{``}\textnormal{if}\textit{ there exists }d\in{\mathcal{D}}\cup(-{\mathcal{D}})\textit{ and }j\in\{1,\ldots,k\}\textit{ such that }
1mi=1md(xi)(oioi)𝟙(p(xi)Λj)3ε/5 then\displaystyle\frac{1}{m}\sum_{i=1}^{m}d(x_{i}^{*})(o_{i}^{*}-o_{i}^{\prime})\mathds{1}(p(x_{i}^{*})\in\Lambda_{j})\geq 3\varepsilon/5\textnormal{ then}\text{''}

and changing Line 3 to

p(x)p(x)+εd(x)𝟙(p(x)Λj)/5 for all xX,\text{``}p(x)\leftarrow p(x)+\varepsilon d(x)\mathds{1}(p(x)\in\Lambda_{j})/5\textnormal{ for all }x\in X\text{''},

we can prove a sample complexity upper bound of

O(ε4(𝖿𝖺𝗍𝒟(ε/25)(log(2/ε))2+log(1/δ))).O\Big{(}\varepsilon^{-4}\Big{(}\mathsf{fat}_{\mathcal{D}}(\varepsilon/25)(\log(2/\varepsilon))^{2}+\log(1/\delta)\Big{)}\Big{)}.

This bound follows from combining the proof of Lemma 19 with the observation that for every predictor p[0,1]Xp\in[0,1]^{X}, the following distinguisher class

𝒟p:={d[1,1]X:d𝒟 and j{1,,k} such that\displaystyle{\mathcal{D}}_{p}:=\{d^{\prime}\in[-1,1]^{X}:\exists d\in{\mathcal{D}}\text{ and }j\in\{1,\ldots,k\}\text{ such that}
for every xX,d(x)=d(x)𝟙(p(x)Λj)}\displaystyle\text{for every }x\in X,d^{\prime}(x)=d(x)\mathds{1}(p(x)\in\Lambda_{j})\}

satisfies 𝖿𝖺𝗍𝒟p(ε)𝖿𝖺𝗍𝒟(ε)+1\mathsf{fat}_{{\mathcal{D}}_{p}}(\varepsilon)\leq\mathsf{fat}_{{\mathcal{D}}}(\varepsilon)+1 for every ε>0\varepsilon>0.

5.2 Lower Bound

We first prove the following lemma showing that the fat-shattering dimension gives a lower bound for the metric entropy on a particular distribution μ\mu.

Lemma 22.

If x1,,xnXx_{1},\ldots,x_{n}\in X are 6ε6\varepsilon-fat shattered by 𝒟{\mathcal{D}}, then logNμ,𝒟([0,1]X,ε)n/8\log N_{\mu,{\mathcal{D}}}([0,1]^{X},\varepsilon)\geq n/8, where μ\mu is the uniform distribution over x1,,xnx_{1},\ldots,x_{n}.

Proof.

We can assume without loss of generality that X={x1,,xn}X=\{x_{1},\ldots,x_{n}\}. The assumption that x1,,xnx_{1},\ldots,x_{n} are 6ε6\varepsilon-fat shattered by 𝒟{\mathcal{D}} implies the existence of a function rXr\in\mathbb{R}^{X} such that for every function b{1,1}Xb\in\{-1,1\}^{X}, there exists a distinguisher db𝒟d_{b}\in{\mathcal{D}} satisfying

b(x)(db(x)r(x))6εfor allxX.b(x)(d_{b}(x)-r(x))\geq 6\varepsilon\quad\text{for all}\ x\in X. (27)

We first show that {r}+[6ε,6ε]X\{r\}+[-6\varepsilon,6\varepsilon]^{X} is a subset of the symmetric convex hull ¯𝒟\bar{}{\mathcal{D}}. Assume for the sake of contradiction that some f{r}+[6ε,6ε]Xf\in\{r\}+[-6\varepsilon,6\varepsilon]^{X} does not belong to ¯𝒟\bar{}{\mathcal{D}}. In particular, ff does not belong to the symmetric convex hull of {db:b{1,1}X}\{d_{b}:b\in\{-1,1\}^{X}\}. By the hyperplane separation theorem, there exists gXg\in\mathbb{R}^{X} such that

g,dbfμ<0for allb{1,1}X.\langle g,d_{b}-f\rangle_{\mu}<0\quad\text{for all}\ b\in\{-1,1\}^{X}. (28)

Consider the function b{1,1}Xb\in\{-1,1\}^{X} such that b(x)=1b(x)=1 if and only if g(x)0g(x)\geq 0. For xXx\in X with b(x)=1b(x)=1, we have db(x)r(x)6εd_{b}(x)-r(x)\geq 6\varepsilon by (27) and thus db(x)f(x)d_{b}(x)\geq f(x). Similarly, for xXx\in X with b(x)=1b(x)=-1 we have db(x)f(x)d_{b}(x)\leq f(x). In both cases, we have g(x)(db(x)f(x))0g(x)(d_{b}(x)-f(x))\geq 0, a contradiction with (28).

Now we have proved that {r}+[6ε,6ε]X¯𝒟\{r\}+[-6\varepsilon,6\varepsilon]^{X}\subseteq\bar{}{\mathcal{D}}. By the symmetry of ¯𝒟\bar{}{\mathcal{D}}, {r}+[6ε,6ε]X¯𝒟\{-r\}+[-6\varepsilon,6\varepsilon]^{X}\subseteq\bar{}{\mathcal{D}}. Then by the convexity of ¯𝒟\bar{}{\mathcal{D}}, [6ε,6ε]X¯𝒟[-6\varepsilon,6\varepsilon]^{X}\subseteq\bar{}{\mathcal{D}}.

The lemma is proved by the following chain of inequalities:

logNμ,𝒟([0,1]X,ε)\displaystyle\log N_{\mu,{\mathcal{D}}}([0,1]^{X},\varepsilon) =logNμ,¯𝒟([1,1]X,2ε)\displaystyle=\log N_{\mu,\bar{}{\mathcal{D}}}([-1,1]^{X},2\varepsilon) (29)
logNμ,[6ε,6ε]X([1,1]X,2ε)\displaystyle\geq\log N_{\mu,[-6\varepsilon,6\varepsilon]^{X}}([-1,1]^{X},2\varepsilon) (30)
=logNμ,[1,1]X([1,1]X,1/3)\displaystyle=\log N_{\mu,[-1,1]^{X}}([-1,1]^{X},1/3) (31)
n/8.\displaystyle\geq n/8. (32)

We used Lemma 6 Items 2, 3, and 4 in (29), Lemma 6 Item 1 in (30), and Lemma 6 Item 4 in (31). We used Lemma 34 to get (32). ∎

Combining the lemma above with Lemma 8, we obtain the following sample complexity lower bound for distribution-free OI.

Lemma 23.

For every distinguisher class 𝒟{\mathcal{D}}, every advantage bound ε>0\varepsilon>0, and every failure probability bound δ(0,1)\delta\in(0,1), the following sample complexity lower bound holds for distribution-free OI:

𝖲𝖠𝖬𝖯𝖣𝖥𝖱([0,1]X,𝒟,ε,δ)\displaystyle\mathsf{SAMP\mathchar 45\relax DF\mathchar 45\relax R}([0,1]^{X},{\mathcal{D}},\varepsilon,\delta) =𝖲𝖠𝖬𝖯𝖣𝖥𝖠([0,1]X,𝒟,ε,δ)\displaystyle=\mathsf{SAMP\mathchar 45\relax DF\mathchar 45\relax A}([0,1]^{X},{\mathcal{D}},\varepsilon,\delta)
𝖿𝖺𝗍𝒟(12ε)/8+log(1δ).\displaystyle\geq\mathsf{fat}_{\mathcal{D}}(12\varepsilon)/8+\log(1-\delta).
Proof.

Define n=𝖿𝖺𝗍𝒟(12ε)n=\mathsf{fat}_{\mathcal{D}}(12\varepsilon), and suppose x1,,xnXx_{1},\ldots,x_{n}\in X are 12ε12\varepsilon-shattered by 𝒟{\mathcal{D}}. Let μ\mu be the uniform distribution over x1,,xnx_{1},\ldots,x_{n}. By Lemma 22, logNμ,𝒟([0,1]X,2ε)n/8\log N_{\mu,{\mathcal{D}}}([0,1]^{X},2\varepsilon)\geq n/8. By Lemma 8,

𝖲𝖠𝖬𝖯𝖣𝖥𝖠([0,1]X,𝒟,ε,δ)\displaystyle\mathsf{SAMP\mathchar 45\relax DF\mathchar 45\relax A}([0,1]^{X},{\mathcal{D}},\varepsilon,\delta) =𝖲𝖠𝖬𝖯𝖣𝖥𝖱([0,1]X,𝒟,ε,δ)\displaystyle=\mathsf{SAMP\mathchar 45\relax DF\mathchar 45\relax R}([0,1]^{X},{\mathcal{D}},\varepsilon,\delta)
𝖲𝖠𝖬𝖯𝖣𝖲𝖱([0,1]X,𝒟,ε,δ,μ)\displaystyle\geq\mathsf{SAMP\mathchar 45\relax DS\mathchar 45\relax R}([0,1]^{X},{\mathcal{D}},\varepsilon,\delta,\mu) (by (7))
logNμ,𝒟([0,1]X,2ε)+log(1δ)\displaystyle\geq\log N_{\mu,{\mathcal{D}}}([0,1]^{X},2\varepsilon)+\log(1-\delta)
n/8+log(1δ).\displaystyle\geq n/8+\log(1-\delta).\qed
Remark 24.

It is clear that the error objective mc-errorμ,𝒟(p,p)\mathrm{mc}\textup{-}\mathrm{error}_{\mu,{\mathcal{D}}}(p,p^{*}) for multicalibration defined in (26) satisfies

mc-errorμ,𝒟(p,p)advμ,𝒟(p,p)/k,\mathrm{mc}\textup{-}\mathrm{error}_{\mu,{\mathcal{D}}}(p,p^{*})\geq\mathrm{adv}_{\mu,{\mathcal{D}}}(p,p^{*})/k,

so Lemma 23 directly implies a sample complexity lower bound for multicalibration. Specifically, assuming the predictor class 𝒫{\mathcal{P}} is [0,1]X[0,1]^{X}, if we want to achieve mc-errorμ,𝒟(p,p)ε\mathrm{mc}\textup{-}\mathrm{error}_{\mu,{\mathcal{D}}}(p,p^{*})\leq\varepsilon with probability at least 1δ1-\delta in the distribution-free setting, the sample complexity is at least

𝖿𝖺𝗍𝒟(12kε)/8+log(1δ).\mathsf{fat}_{\mathcal{D}}(12k\varepsilon)/8+\log(1-\delta).

6 Separation between Agnostic and Realizable OI

The sample complexities of realizable and agnostic learning have the same characterization in many settings. They are both characterized by the VC dimension in distribution-free PAC learning [Vapnik and Chervonenkis, 1971], whereas in distribution-specific PAC learning they are both characterized by the metric entropy (this characterization was proved in the realizable setting by Benedek and Itai [1991], and it extends straightforwardly to the agnostic setting (see Lemma 26)). Recently, an independent work by Hopkins et al. [2021] shows that this relationship between realizable and agnostic learning holds very broadly for a unified reason.

In this section, we study this relationship between realizable and agnostic learning in outcome indistinguishability. In contrast to many common learning settings including PAC learning, we show a strong separation between the sample complexities of realizable OI and agnostic OI in both the distribution-free and the distribution-specific settings.

6.1 No separation with additional assumptions

Before we present the sample complexity separation between realizable and agnostic OI, we first discuss several assumptions that make the separation disappear in the following two lemmas.

In the first lemma (Lemma 25), we make the assumption that the target predictor pp^{*} belongs to the symmetric convex hull ¯𝒫\bar{}{\mathcal{P}} of 𝒫{\mathcal{P}}.

Lemma 25 (Inside the symmetric convex hull).

For every predictor class 𝒫[0,1]X{\mathcal{P}}\subseteq[0,1]^{X}, every distinguisher class 𝒟[1,1]X{\mathcal{D}}\subseteq[-1,1]^{X}, every distribution μΔX\mu\in\Delta_{X}, every advantage bound ε>0\varepsilon>0, and every failure probability bound δ(0,1)\delta\in(0,1), there exist a nonnegative integer nn and an algorithm 𝒜{\mathcal{A}} such that

n\displaystyle n =O(ε2(logNμ,𝒫(𝒟,ε/4)+log(2/δ))),and\displaystyle=O(\varepsilon^{-2}(\log N_{\mu,{\mathcal{P}}}({\mathcal{D}},\varepsilon/4)+\log(2/\delta))),\quad\textnormal{and}
𝒜\displaystyle{\mathcal{A}} p¯𝒫𝖮𝖨n(p,𝒟,ε,δ,μ).\displaystyle\in{\bigcap}_{p^{*}\in\bar{}{\mathcal{P}}}\mathsf{OI}_{n}(p^{*},{\mathcal{D}},\varepsilon,\delta,\mu).
Proof.

By Lemma 6 Item 2,

logNμ,𝒫(𝒟,ε/4))=logNμ,¯𝒫(𝒟,ε/4)).\log N_{\mu,{\mathcal{P}}}({\mathcal{D}},\varepsilon/4))=\log N_{\mu,\bar{}{\mathcal{P}}}({\mathcal{D}},\varepsilon/4)).

The lemma is then a consequence of applying Lemma 10 with 𝒫{\mathcal{P}} replaced by ¯𝒫\bar{}{\mathcal{P}}. ∎

In the second lemma (Lemma 26), we make the assumption that {1,1}𝒟\{-1,1\}\subseteq{\mathcal{D}} (i.e. we consider the 1\ell_{1} error, see Lemma 7), and that either 𝒫{\mathcal{P}} only contains binary classifiers or the target predictor pp^{*} is a binary classifier.

Lemma 26 (Binary classifiers with 1\ell_{1} error).

Assume {1,1}X𝒟\{-1,1\}^{X}\subseteq{\mathcal{D}}. For every predictor class 𝒫[0,1]X{\mathcal{P}}\subseteq[0,1]^{X}, every distribution μΔX\mu\in\Delta_{X}, every advantage bound ε(0,1)\varepsilon\in(0,1) and every failure probability bound δ(0,1)\delta\in(0,1), there exists a positive integer nn such that

n=O((logNμ,𝒟(𝒫,ε/2)+log(2/δ))/ε2),n=O((\log N_{\mu,{\mathcal{D}}}({\mathcal{P}},\varepsilon/2)+\log(2/\delta))/\varepsilon^{2}),

and for every target predictor p[0,1]Xp^{*}\in[0,1]^{X}, Algorithm 1 belongs to

𝖮𝖨n(p,𝒟,infp𝒫advμ,𝒟(p,p)+ε,δ,μ)\mathsf{OI}_{n}(p^{*},{\mathcal{D}},\inf_{p\in{\mathcal{P}}}\mathrm{adv}_{\mu,{\mathcal{D}}}(p,p^{*})+\varepsilon,\delta,\mu)

whenever 𝒫{0,1}X{\mathcal{P}}\subseteq\{0,1\}^{X} or p{0,1}Xp^{*}\in\{0,1\}^{X}.

Proof.

We choose n=O((logNμ,𝒟(𝒫,ε/2)+log(2/δ))/ε2)n=O((\log N_{\mu,{\mathcal{D}}}({\mathcal{P}},\varepsilon/2)+\log(2/\delta))/\varepsilon^{2}) so that by the Chernoff bound and the union bound, with probability at least 1δ1-\delta over (x1,o1),,(xn,on)(x_{1},o_{1}),\ldots,(x_{n},o_{n}) drawn i.i.d. from μp\mu_{p^{*}}, for all p𝒫p^{\prime}\in{\mathcal{P}}^{\prime},

|1ni=1n|p(xi)oi|𝔼(x,o)μp[|p(x)o|]|ε/8.\left|\frac{1}{n}\sum_{i=1}^{n}|p^{\prime}(x_{i})-o_{i}|-\operatorname{\mathbb{E}}_{(x,o)\sim\mu_{p^{*}}}[|p^{\prime}(x)-o|]\right|\leq\varepsilon/8. (33)

It remains to prove that whenever the input (x1,o1),,(xn,on)(x_{1},o_{1}),\ldots,(x_{n},o_{n}) to Algorithm 1 satisfies the inequality above, the output pp satisfies advμ,𝒟(p,p)infp𝒫advμ,𝒟(p,p)+ε\mathrm{adv}_{\mu,{\mathcal{D}}}(p,p^{*})\leq\inf_{p\in{\mathcal{P}}}\mathrm{adv}_{\mu,{\mathcal{D}}}(p,p^{*})+\varepsilon assuming 𝒫{0,1}X{\mathcal{P}}\subseteq\{0,1\}^{X} or p{0,1}Xp^{*}\in\{0,1\}^{X}.

By definition, there exists p^𝒫\hat{p}\in{\mathcal{P}} such that p^pμ,𝒟infp𝒫ppμ,𝒟+ε/4\|\hat{p}-p^{*}\|_{\mu,{\mathcal{D}}}\leq\inf_{p\in{\mathcal{P}}}\|p-p^{*}\|_{\mu,{\mathcal{D}}}+\varepsilon/4. Since 𝒫{\mathcal{P}}^{\prime} is an ε/2\varepsilon/2-covering of 𝒫{\mathcal{P}}, there exists p~𝒫\tilde{p}\in{\mathcal{P}}^{\prime} such that p~p^μ,𝒟ε/2\|\tilde{p}-\hat{p}\|_{\mu,{\mathcal{D}}}\leq\varepsilon/2. Combining these two inequalities together,

p~pμ,𝒟p~p^μ,𝒟+p^pμ,𝒟infp𝒫ppμ,𝒟+3ε/4.\|\tilde{p}-p^{*}\|_{\mu,{\mathcal{D}}}\leq\|\tilde{p}-\hat{p}\|_{\mu,{\mathcal{D}}}+\|\hat{p}-p^{*}\|_{\mu,{\mathcal{D}}}\leq\inf_{p\in{\mathcal{P}}}\|p-p^{*}\|_{\mu,{\mathcal{D}}}+3\varepsilon/4. (34)

Since {1,1}X𝒟\{-1,1\}^{X}\subseteq{\mathcal{D}},

𝗅𝗈𝗌𝗌(p)=1ni=1n|p(xi)oi|for allp𝒫.\mathsf{loss}(p^{\prime})=\frac{1}{n}\sum_{i=1}^{n}|p^{\prime}(x_{i})-o_{i}|\quad\textnormal{for all}\ p^{\prime}\in{\mathcal{P}}^{\prime}.

Therefore, the output pp of Algorithm 1 satisfies

1ni=1n|p(xi)oi|=𝗅𝗈𝗌𝗌(p)𝗅𝗈𝗌𝗌(p~)=1ni=1n|p~(xi)oi|.\frac{1}{n}\sum_{i=1}^{n}|p(x_{i})-o_{i}|=\mathsf{loss}(p)\leq\mathsf{loss}(\tilde{p})=\frac{1}{n}\sum_{i=1}^{n}|\tilde{p}(x_{i})-o_{i}|. (35)

Our assumption 𝒫{0,1}X{\mathcal{P}}\subseteq\{0,1\}^{X} or p{0,1}Xp^{*}\in\{0,1\}^{X} implies that for all p𝒫p^{\prime}\in{\mathcal{P}}^{\prime},

𝔼(x,o)μp[|p(x)o|]=𝔼xμ[|p(x)p(x)|]=ppμ,𝒟,\operatorname{\mathbb{E}}_{(x,o)\sim\mu_{p^{*}}}[|p^{\prime}(x)-o|]=\operatorname{\mathbb{E}}_{x\sim\mu}[|p^{\prime}(x)-p^{*}(x)|]=\|p^{\prime}-p^{*}\|_{\mu,{\mathcal{D}}}, (36)

where the last equation is by Lemma 7.

Combining everything together,

ppμ,𝒟\displaystyle\|p-p^{*}\|_{\mu,{\mathcal{D}}} 1ni=1n|p(xi)oi|+ε/8\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}|p(x_{i})-o_{i}|+\varepsilon/8 (by (33) and (36))
1ni=1n|p~(xi)oi|+ε/8\displaystyle\leq\frac{1}{n}\sum_{i=1}^{n}|\tilde{p}(x_{i})-o_{i}|+\varepsilon/8 (by (35))
p~pμ,𝒟+ε/4\displaystyle\leq\|\tilde{p}-p^{*}\|_{\mu,{\mathcal{D}}}+\varepsilon/4 (by (33) and (36))
infp𝒫ppμ,𝒟+ε.\displaystyle\leq\inf_{p\in{\mathcal{P}}}\|p-p^{*}\|_{\mu,{\mathcal{D}}}+\varepsilon. (by (34))

6.2 Separation without additional assumptions

Without the additional assumptions in Lemma 25 and Lemma 26, we give examples where the sample complexity of agnostic OI can be arbitrarily larger than that of realizable OI in both the distribution-specific and the distribution-free settings. Given a positive integer mm, we choose X={}{0,1}mX=\{\bot\}\cup\{0,1\}^{m}, and choose the predictor class 𝒫{\mathcal{P}} to consist of only two predictors p1p_{1} and p2p_{2} where p1()=0,p2()=1p_{1}(\bot)=0,p_{2}(\bot)=1 and p1(x)=p2(x)=1/2p_{1}(x)=p_{2}(x)=1/2 for all x{0,1}mx\in\{0,1\}^{m}.

We first show that O(ε2log(2/δ))O(\varepsilon^{-2}\log(2/\delta)) examples are sufficient for distribution-free agnostic OI as long as 𝒟=[1,1]X{\mathcal{D}}=[-1,1]^{X}.

Lemma 27.

For all ε,δ(0,1)\varepsilon,\delta\in(0,1), there exists a positive integer n=O(ε2log(2/δ))n=O(\varepsilon^{-2}\log(2/\delta)) such that for all mm and X,𝒫X,{\mathcal{P}} defined as above, there exists an algorithm 𝒜𝖣𝖥𝖠n(𝒫,[1,1]X,ε,δ){\mathcal{A}}\in\mathsf{DF\mathchar 45\relax A}_{n}({\mathcal{P}},[-1,1]^{X},\varepsilon,\delta).

Proof.

We choose algorithm 𝒜{\mathcal{A}} to be the following simple algorithm: after seeing examples (x1,o1),,(xn,on)(x_{1},o_{1}),\allowbreak\ldots,\allowbreak(x_{n},o_{n}), it computes

r:=|{i{1,,n}:(xi,oi)=(,1)}||{i{1,,n}:xi=}|.r:=\frac{|\{i\in\{1,\ldots,n\}:(x_{i},o_{i})=(\bot,1)\}|}{|\{i\in\{1,\ldots,n\}:x_{i}=\bot\}|}.

If the denominator |{i{1,,n}:xi=}||\{i\in\{1,\ldots,n\}:x_{i}=\bot\}| is zero, define r=1/2r=1/2. Algorithm 𝒜{\mathcal{A}} outputs the predictor pp such that p()=rp(\bot)=r, and p(x)=1/2p(x)=1/2 for all x{0,1}mx\in\{0,1\}^{m}.

It remains to show that when the input examples (x1,o1),,(xn,on)(x_{1},o_{1}),\ldots,(x_{n},o_{n}) are drawn i.i.d. from μp\mu_{p^{*}} for some distribution μΔX\mu\in\Delta_{X} and some p[0,1]Xp^{*}\in[0,1]^{X}, the output pp of algorithm 𝒜{\mathcal{A}} satisfies the following with probability at least 1δ1-\delta:

ppμ,[1,1]Xinfp𝒫ppμ,[1,1]X+ε.\|p-p^{*}\|_{\mu,[-1,1]^{X}}\leq\inf_{p^{\prime}\in{\mathcal{P}}}\|p^{\prime}-p^{*}\|_{\mu,[-1,1]^{X}}+\varepsilon. (37)

Let μ\mu_{\bot} denote the probability mass on \bot in distribution μ\mu. Since p(x)=1/2=p(x)p(x)=1/2=p^{\prime}(x) for all p𝒫p^{\prime}\in{\mathcal{P}} and x{0,1}Xx\in\{0,1\}^{X}, by Lemma 7,

ppμ,[1,1]Xμ|p()p()|infp𝒫ppμ,[1,1]X.\|p-p^{*}\|_{\mu,[-1,1]^{X}}-\mu_{\bot}|p(\bot)-p^{*}(\bot)|\leq\inf_{p^{\prime}\in{\mathcal{P}}}\|p^{\prime}-p^{*}\|_{\mu,[-1,1]^{X}}. (38)

If με\mu_{\bot}\leq\varepsilon, inequality (38) implies that (37) is always true. If μ>ε\mu_{\bot}>\varepsilon, we choose n=O(ε2log(2/δ))n=O(\varepsilon^{-2}\log(2/\delta)) so that the following two conditions hold:

  1. 1.

    by the Chernoff bound, with probability at least 1δ/21-\delta/2, |{i{1,,n}:xi=}|μn/2|\{i\in\{1,\ldots,n\}:x_{i}=\bot\}|\geq\mu_{\bot}n/2;

  2. 2.

    it holds that μn/2Cμε2log(2/δ)\mu_{\bot}n/2\geq C\mu_{\bot}\varepsilon^{-2}\log(2/\delta), where CC is an absolute constant determined later.

When combined, the two conditions guarantee that with probability at least 1δ/21-\delta/2, |{i{1,,n}:xi=}|Cμε2log(2/δ)|\{i\in\{1,\ldots,n\}:x_{i}=\bot\}|\geq C\mu_{\bot}\varepsilon^{-2}\log(2/\delta). Conditioned on this being true, by the Chernoff bound, choosing CC sufficiently large guarantees that with probability at least 1δ/21-\delta/2,

|p()p()|=|p()|{i:(xi,oi)=(,1)}||{i:xi=}||ε/μ.|p^{*}(\bot)-p(\bot)|=\left|p^{*}(\bot)-\frac{|\{i:(x_{i},o_{i})=(\bot,1)\}|}{|\{i:x_{i}=\bot\}|}\right|\leq\varepsilon/{\sqrt{\mu_{\bot}}}.

Combining this with (38), we know (37) holds with probability at least 1δ1-\delta, as desired. ∎

By the monotonicity properties of sample complexities (see (6) and (7)), the lemma above implies that for all 𝒟[1,1]X{\mathcal{D}}\subseteq[-1,1]^{X} and μΔX\mu\in\Delta_{X},

𝖲𝖠𝖬𝖯𝖣𝖲𝖠(𝒫,[1,1]X,ε,δ,μ)𝖲𝖠𝖬𝖯𝖣𝖥𝖠(𝒫,[1,1]X,ε,δ)O(ε2log(2/δ)),\mathsf{SAMP\mathchar 45\relax DS\mathchar 45\relax A}({\mathcal{P}},[-1,1]^{X},\varepsilon,\delta,\mu)\leq\mathsf{SAMP\mathchar 45\relax DF\mathchar 45\relax A}({\mathcal{P}},[-1,1]^{X},\varepsilon,\delta)\leq O(\varepsilon^{-2}\log(2/\delta)),\\ (39)

and

𝖲𝖠𝖬𝖯𝖣𝖲𝖱(𝒫,𝒟,ε,δ,μ)𝖲𝖠𝖬𝖯𝖣𝖥𝖱(𝒫,𝒟,ε,δ)\displaystyle\mathsf{SAMP\mathchar 45\relax DS\mathchar 45\relax R}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta,\mu)\leq\mathsf{SAMP\mathchar 45\relax DF\mathchar 45\relax R}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta) 𝖲𝖠𝖬𝖯𝖣𝖥𝖱(𝒫,[1,1]X,ε,δ)\displaystyle\leq\mathsf{SAMP\mathchar 45\relax DF\mathchar 45\relax R}({\mathcal{P}},[-1,1]^{X},\varepsilon,\delta)
O(ε2log(2/δ)).\displaystyle\leq O(\varepsilon^{-2}\log(2/\delta)). (40)

Now we show that for a specific distribution μ\mu and a particular distinguisher class 𝒟{\mathcal{D}}, it requires at least m20m-20 examples for distribution-specific agnostic OI when ε=1/8\varepsilon=1/8 and δ=1/3\delta=1/3 (Lemma 28). Sending mm to infinity and comparing with (40), this implies that the sample complexity of agnostic OI can be arbitrarily large even when the sample complexity of realizable OI is bounded by a constant in both the distribution-specific and the distribution-free settings. Comparing this with (39), we also know that the sample complexity of agnostic OI is not monotone w.r.t. the distinguisher class 𝒟{\mathcal{D}} in both the distribution-specific and distribution-free settings.

Before we describe the μ\mu and 𝒟{\mathcal{D}} used in Lemma 28, we need the following definitions. Let 𝖡\mathsf{B} denote the set of all functions t:{0,1}m{0,1}t:\{0,1\}^{m}\rightarrow\{0,1\}. A function f𝖡f\in\mathsf{B} is a parity function if for a subset S{1,,m}S\subseteq\{1,\ldots,m\}, it holds that f(x)=iSxif(x)=\bigoplus_{i\in S}x_{i} for all x{0,1}mx\in\{0,1\}^{m}. A function g𝖡g\in\mathsf{B} is an anti-parity function if for some parity function ff, it holds that g(x)=1f(x)g(x)=1\oplus f(x) for all x{0,1}mx\in\{0,1\}^{m}. Let 𝖡𝖯𝖡\mathsf{BP}\subseteq\mathsf{B} (resp. 𝖡𝖠𝖡\mathsf{BA}\subseteq\mathsf{B}) denote the set of parity functions (resp. anti-parity) functions.

We choose μ\mu so that it puts 1/31/3 probability mass on \bot, and puts the remaining 2/32/3 probability mass evenly on {0,1}m\{0,1\}^{m}. We choose 𝒟{\mathcal{D}} to contain 2m2^{m} hypotheses as follows: for every parity function f𝖡𝖯f\in\mathsf{BP}, there is a distinguisher d𝒟d\in{\mathcal{D}} such that d()=1d(\bot)=1 and d(x)=(1)f(x)d(x)=(-1)^{f(x)} for all x{0,1}mx\in\{0,1\}^{m}.

Lemma 28.

For m,𝒫,𝒟,μm,{\mathcal{P}},{\mathcal{D}},\mu defined as above, 𝖲𝖠𝖬𝖯𝖣𝖲𝖠(𝒫,𝒟,1/8,1/3,μ)m20\mathsf{SAMP\mathchar 45\relax DS\mathchar 45\relax A}({\mathcal{P}},{\mathcal{D}},1/8,1/3,\mu)\geq m-20.

To prove the lemma, we first consider the following task of list-learning parity and anti-parity functions, which we abbreviate as ListL. Given a positive integer list size kk, in the task of ListL, an algorithm takes as input examples (u1,v1),,(un,vn)(u_{1},v_{1}),\ldots,(u_{n},v_{n}) where uiu_{i}’s are drawn independently and uniformly from {0,1}m\{0,1\}^{m}, and vi=t(ui)v_{i}=t(u_{i}) for some unknown t𝖡t\in\mathsf{B}, and the algorithm outputs a subset (i.e. list) L𝖡L\subseteq\mathsf{B}. The algorithm succeeds if tLt\in L and min{|L𝖡𝖯|,|L𝖡𝖠|}k\min\{|L\cap\mathsf{BP}|,|L\cap\mathsf{BA}|\}\leq k.

Lemma 29.

Assuming nmn\leq m and k2mnk\leq 2^{m-n}, no algorithm has failure probability smaller than (1/2)(11/2mn)(1k/2mn)(1/2)(1-1/2^{m-n})(1-k/2^{m-n}) for all choices of t𝖡t\in\mathsf{B} in the task ListL.

Proof.

Assume that for some α0\alpha\geq 0, there exists an algorithm 𝒜{\mathcal{A}} that has failure probability at most α\alpha for all t𝖡t\in\mathsf{B}. In particular, when tt is drawn uniformly at random from 𝖡𝖯𝖡𝖠\mathsf{BP}\cup\mathsf{BA}, the failure probability of 𝒜{\mathcal{A}} is at most α\alpha. For this fixed distribution of tt, we can assume that algorithm 𝒜{\mathcal{A}} is deterministic without increasing its failure probability. By the law of total probability,

αPr[failure]=𝔼[Pr[failure|(u1,v1),,(un,vn)]].\alpha\geq\Pr[\textnormal{failure}]=\operatorname{\mathbb{E}}[\Pr[\textnormal{failure}|(u_{1},v_{1}),\ldots,(u_{n},v_{n})]].

Here, (u1,v1),,(un,vn)(u_{1},v_{1}),\ldots,(u_{n},v_{n}) are the input examples to algorithm 𝒜{\mathcal{A}} where u1,,unu_{1},\ldots,u_{n} are drawn independently and uniformly from {0,1}m\{0,1\}^{m}, and vi=t(ui)v_{i}=t(u_{i}) for every i=1,,ni=1,\ldots,n with tt drawn uniformly at random from 𝖡𝖯𝖡𝖠\mathsf{BP}\cup\mathsf{BA}.

With probability

(12m)(12m+1)(12m+n1)12m2m+n111/2mn,(1-2^{-m})(1-2^{-m+1})\cdots(1-2^{-m+n-1})\geq 1-2^{-m}-\cdots-2^{-m+n-1}\geq 1-1/2^{m-n},

u1,,unu_{1},\ldots,u_{n} are linearly independent in 𝔽2m\mathbb{F}_{2}^{m}, in which case the conditional distribution of tt given (u1,v1),,(un,vn)(u_{1},v_{1}),\ldots,(u_{n},v_{n}) is the uniform distribution over L1L2L_{1}\cup L_{2} for some L1𝖡𝖯,L2𝖡𝖠L_{1}\subseteq\mathsf{BP},L_{2}\subseteq\mathsf{BA} with |L1|=|L2|=2mn|L_{1}|=|L_{2}|=2^{m-n}, and thus

Pr[failure|(u1,v1),,(un,vn)](1/2)(1k/2mn).\Pr[\textnormal{failure}|(u_{1},v_{1}),\ldots,(u_{n},v_{n})]\geq(1/2)(1-k/2^{m-n}).

Therefore,

α𝔼[Pr[failure|(u1,v1),,(un,vn)]](1/2)(11/2mn)(1k/2mn),\alpha\geq\operatorname{\mathbb{E}}[\Pr[\textnormal{failure}|(u_{1},v_{1}),\ldots,(u_{n},v_{n})]]\geq(1/2)(1-1/2^{m-n})(1-k/2^{m-n}),

as desired. ∎

We are now ready to prove Lemma 28. Let 𝒜{\mathcal{A}} be an algorithm in 𝖣𝖲𝖠n(𝒫,𝒟,1/8,δ,μ)\mathsf{DS\mathchar 45\relax A}_{n}({\mathcal{P}},{\mathcal{D}},1/8,\delta,\mu) for a nonnegative integer nn and a positive real number δ\delta. We use this algorithm to build an algorithm for ListL. Suppose (u1,v1),,(un,vn)(u_{1},v_{1}),\ldots,(u_{n},v_{n}) are the input examples in ListL, where uiu_{i} are drawn independently and uniformly from {0,1}m\{0,1\}^{m}, and vi=t(ui)v_{i}=t(u_{i}) for some t𝖡t\in\mathsf{B}.

Before we construct the algorithm for ListL, we need the following definition. For every f𝖡f\in\mathsf{B}, we define pf[0,1]Xp_{f}\in[0,1]^{X} such that pf()=1/2p_{f}(\bot)=1/2, and pf(x)=(1+(1)f(x))/2p_{f}(x)=(1+(-1)^{f(x)})/2 for all x{0,1}mx\in\{0,1\}^{m}.

Now we construct the algorithm for ListL using algorithm 𝒜{\mathcal{A}}. We start by generating the input examples (x1,o1),,(xn,on)(x_{1},o_{1}),\ldots,(x_{n},o_{n}) to algorithm 𝒜{\mathcal{A}}. Independently for every i=1,,ni=1,\ldots,n, with probability 1/61/6 we set (xi,oi)=(,0)(x_{i},o_{i})=(\bot,0), with probability 1/61/6 we set (xi,oi)=(,1)(x_{i},o_{i})=(\bot,1) and with the remaining probability 2/32/3 we set (xi,oi)=(ui,(1+(1)vi)/2)(x_{i},o_{i})=(u_{i},(1+(-1)^{v_{i}})/2). After running algorithm 𝒜{\mathcal{A}} and obtaining the output pp, we return the list

L={f𝖡:pfpμ,𝒟1/6+1/8}(𝖡\(𝖡𝖯𝖡𝖠)).L=\{f\in\mathsf{B}:\|p_{f}-p\|_{\mu,{\mathcal{D}}}\leq 1/6+1/8\}\cup(\mathsf{B}\backslash(\mathsf{BP}\cup\mathsf{BA})). (41)

We prove the following two helper lemmas before we finish the proof of Lemma 28.

Lemma 30.

If p()1/2p(\bot)\leq 1/2, then |L𝖡𝖯|64|L\cap\mathsf{BP}|\leq 64. Similarly, if p()1/2p(\bot)\geq 1/2, then |L𝖡𝖠|64|L\cap\mathsf{BA}|\leq 64.

Proof.

We prove the first half of the lemma, and the second half follows from a similar argument. Pick any function fL𝖡𝖯f\in L\cap\mathsf{BP}, and pick d𝒟d\in{\mathcal{D}} such that d()=1d(\bot)=1 and d(x)=(1)f(x)d(x)=(-1)^{f(x)}. Define p0p_{0} to be the predictor that maps everything to 1/21/2. We have pfp0,dμ=1/3\langle p_{f}-p_{0},d\rangle_{\mu}=1/3. Define μ\mu^{\prime} to be the uniform distribution over {0,1}d\{0,1\}^{d}. We have

pp0,dμ=(1/3)(p()p0())+(2/3)pp0,dμ(2/3)pp0,dμ.\langle p-p_{0},d\rangle_{\mu}=(1/3)(p(\bot)-p_{0}(\bot))+(2/3)\langle p-p_{0},d\rangle_{\mu^{\prime}}\leq(2/3)\langle p-p_{0},d\rangle_{\mu^{\prime}}.

Therefore,

1/3(2/3)pp0,dμpfp,dμ1/6+1/8.1/3-(2/3)\langle p-p_{0},d\rangle_{\mu^{\prime}}\leq\langle p_{f}-p,d\rangle_{\mu}\leq 1/6+1/8.

This implies

pp0,dμ1/16.\langle p-p_{0},d\rangle_{\mu^{\prime}}\geq 1/16. (42)

However, the predictors in 𝒟{\mathcal{D}}, when restricted to the sub-domain {0,1}mX\{0,1\}^{m}\subseteq X, form an orthonormal basis for {0,1}m\mathbb{R}^{\{0,1\}^{m}} w.r.t. the inner product ,μ\langle\cdot,\cdot\rangle_{\mu^{\prime}}, so

d𝒟pp0,dμ2=pp0,pp0μ1/4.{\sum}_{d\in{\mathcal{D}}}\langle p-p_{0},d\rangle_{\mu^{\prime}}^{2}=\langle p-p_{0},p-p_{0}\rangle_{\mu^{\prime}}\leq 1/4.

Therefore, there can be at most 6464 different d𝒟d\in{\mathcal{D}} that satisfy (42). Since dd is defined differently for different fL𝖡𝖯f\in L\cap\mathsf{BP}, we get |L𝖡𝖯|64|L\cap\mathsf{BP}|\leq 64 as desired. ∎

Lemma 31.

The event tLt\in L happens with probability at least 1δ1-\delta.

Proof.

By the definition of LL in (41), the lemma is trivial if t𝖡𝖯𝖡𝖠t\notin\mathsf{BP}\cup\mathsf{BA}, so we assume t𝖡𝖯𝖡𝖠t\in\mathsf{BP}\cup\mathsf{BA}.

Define μ\mu^{\prime} to be the uniform distribution over {0,1}m\{0,1\}^{m}. If t𝖡𝖯t\in\mathsf{BP}, for the predictor d𝒟d\in{\mathcal{D}} that satisfy d()=1d(\bot)=1 and d(x)=(1)t(x)d(x)=(-1)^{t(x)} for all x{0,1}mx\in\{0,1\}^{m}, we have ptp2,dμ=1/2\langle p_{t}-p_{2},d\rangle_{\mu^{\prime}}=1/2, so

ptp2,dμ=(1/3)(1/2)+(2/3)ptp2,dμ=1/6.\langle p_{t}-p_{2},d\rangle_{\mu}=(1/3)(-1/2)+(2/3)\langle p_{t}-p_{2},d\rangle_{\mu^{\prime}}=1/6.

For all other predictors d𝒟d\in{\mathcal{D}}, we have ptp2,dμ=0\langle p_{t}-p_{2},d\rangle_{\mu^{\prime}}=0, so

ptp2,dμ=(1/3)(1/2)+(2/3)ptp2,dμ=1/6.\langle p_{t}-p_{2},d\rangle_{\mu}=(1/3)(-1/2)+(2/3)\langle p_{t}-p_{2},d\rangle_{\mu^{\prime}}=-1/6.

Therefore, ptp2μ,𝒟=1/6\|p_{t}-p_{2}\|_{\mu,{\mathcal{D}}}=1/6. Similarly, we can show that when t𝖡𝖠t\in\mathsf{BA}, ptp1μ,𝒟=1/6\|p_{t}-p_{1}\|_{\mu,{\mathcal{D}}}=1/6.

Since the input examples (x1,o1),,(xn,on)(x_{1},o_{1}),\ldots,(x_{n},o_{n}) to algorithm 𝒜{\mathcal{A}} are generated i.i.d. from μpt\mu_{p_{t}}, and since 𝒜𝖣𝖲𝖠n(𝒫,𝒟,1/8,δ,μ){\mathcal{A}}\in\mathsf{DS\mathchar 45\relax A}_{n}({\mathcal{P}},{\mathcal{D}},1/8,\delta,\mu), with probability at least 1δ1-\delta, algorithm 𝒜{\mathcal{A}} outputs pp such that pptμ,𝒟1/6+1/8\|p-p_{t}\|_{\mu,{\mathcal{D}}}\leq 1/6+1/8, in which case tLt\in L, as desired. ∎

Now we complete the proof of Lemma 28.

Proof.

Combining Lemma 29, Lemma 30, and Lemma 31, we have

1δ1(1/2)(11/2mn)(164/2mn)1-\delta\leq 1-(1/2)(1-1/2^{m-n})(1-64/2^{m-n})

whenever nm8n\leq m-8. Setting δ=1/3\delta=1/3, we get nm20n\geq m-20, which completes the proof. ∎

References

  • Alon et al. [1997] Noga Alon, Shai Ben-David, Nicolò Cesa-Bianchi, and David Haussler. Scale-sensitive dimensions, uniform convergence, and learnability. J. ACM, 44(4):615–631, 1997. ISSN 0004-5411. doi: 10.1145/263867.263927. URL https://doi.org/10.1145/263867.263927.
  • Artstein et al. [2004a] S. Artstein, V. Milman, S. Szarek, and N. Tomczak-Jaegermann. On convexified packing and entropy duality. Geom. Funct. Anal., 14(5):1134–1141, 2004a. ISSN 1016-443X. doi: 10.1007/s00039-004-0486-3. URL https://doi.org/10.1007/s00039-004-0486-3.
  • Artstein et al. [2004b] S. Artstein, V. Milman, and S. J. Szarek. Duality of metric entropy. Ann. of Math. (2), 159(3):1313–1328, 2004b. ISSN 0003-486X. doi: 10.4007/annals.2004.159.1313. URL https://doi.org/10.4007/annals.2004.159.1313.
  • Bartlett and Long [1995] Peter L Bartlett and Philip M Long. More theorems about scale-sensitive dimensions and learning. In Proceedings of the eighth annual conference on Computational learning theory, pages 392–401, 1995.
  • Bartlett and Long [1998] Peter L. Bartlett and Philip M. Long. Prediction, learning, uniform convergence, and scale-sensitive dimensions. volume 56, pages 174–190. 1998. doi: 10.1006/jcss.1997.1557. URL https://doi.org/10.1006/jcss.1997.1557. Eighth Annual Workshop on Computational Learning Theory (COLT) (Santa Cruz, CA, 1995).
  • Bartlett et al. [1996] Peter L. Bartlett, Philip M. Long, and Robert C. Williamson. Fat-shattering and the learnability of real-valued functions. volume 52, pages 434–452. 1996. doi: 10.1006/jcss.1996.0033. URL https://doi.org/10.1006/jcss.1996.0033. Seventh Annual Workshop on Computational Learning Theory (COLT) (New Brunswick, NJ, 1994).
  • Benedek and Itai [1991] Gyora M. Benedek and Alon Itai. Learnability with respect to fixed distributions. Theoret. Comput. Sci., 86(2):377–389, 1991. ISSN 0304-3975. doi: 10.1016/0304-3975(91)90026-X. URL https://doi.org/10.1016/0304-3975(91)90026-X.
  • Blum and Lykouris [2020] Avrim Blum and Thodoris Lykouris. Advancing subgroup fairness via sleeping experts. In 11th Innovations in Theoretical Computer Science Conference (ITCS 2020), volume 151, pages 55:1–55:24, 2020.
  • Blumer et al. [1989] Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. J. Assoc. Comput. Mach., 36(4):929–965, 1989. ISSN 0004-5411. doi: 10.1145/76359.76371. URL https://doi.org/10.1145/76359.76371.
  • Bourgain et al. [1989] J. Bourgain, A. Pajor, S. J. Szarek, and N. Tomczak-Jaegermann. On the duality problem for entropy numbers of operators. In Geometric aspects of functional analysis (1987–88), volume 1376 of Lecture Notes in Math., pages 50–63. Springer, Berlin, 1989. doi: 10.1007/BFb0090048. URL https://doi.org/10.1007/BFb0090048.
  • Burhanpurkar et al. [2021] Maya Burhanpurkar, Zhun Deng, Cynthia Dwork, and Linjun Zhang. Scaffolding sets. arXiv preprint arXiv:2111.03135, 2021.
  • Dwork et al. [2021] Cynthia Dwork, Michael P Kim, Omer Reingold, Guy N Rothblum, and Gal Yona. Outcome indistinguishability. In Proceedings of the 53rd Annual ACM SIGACT Symposium on Theory of Computing, pages 1095–1108, 2021.
  • Gopalan et al. [2022] Parikshit Gopalan, Adam Tauman Kalai, Omer Reingold, Vatsal Sharan, and Udi Wieder. Omnipredictors. In Mark Braverman, editor, 13th Innovations in Theoretical Computer Science Conference (ITCS 2022), volume 215 of Leibniz International Proceedings in Informatics (LIPIcs), pages 79:1–79:21, Dagstuhl, Germany, 2022. Schloss Dagstuhl – Leibniz-Zentrum für Informatik. ISBN 978-3-95977-217-4. doi: 10.4230/LIPIcs.ITCS.2022.79. URL https://drops.dagstuhl.de/opus/volltexte/2022/15675.
  • Hébert-Johnson et al. [2018] Ursula Hébert-Johnson, Michael Kim, Omer Reingold, and Guy Rothblum. Multicalibration: Calibration for the (computationally-identifiable) masses. In International Conference on Machine Learning, pages 1939–1948. PMLR, 2018.
  • Hopkins et al. [2021] Max Hopkins, Daniel Kane, Shachar Lovett, and Gaurav Mahajan. Realizable learning is all you need. arXiv preprint arXiv:2111.04746, 2021.
  • Kearns and Schapire [1994] Michael J. Kearns and Robert E. Schapire. Efficient distribution-free learning of probabilistic concepts. In Computational learning theory and natural learning systems, Vol. I, Bradford Book, pages 289–329. MIT Press, Cambridge, MA, 1994.
  • Kim et al. [2019] Michael P Kim, Amirata Ghorbani, and James Zou. Multiaccuracy: Black-box post-processing for fairness in classification. In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society, pages 247–254, 2019.
  • Kim et al. [2022] Michael P Kim, Christoph Kern, Shafi Goldwasser, Frauke Kreuter, and Omer Reingold. Universal adaptability: Target-independent inference that competes with propensity scoring. Proceedings of the National Academy of Sciences, 119(4), 2022.
  • Li et al. [2001] Yi Li, Philip M. Long, and Aravind Srinivasan. Improved bounds on the sample complexity of learning. J. Comput. System Sci., 62(3):516–527, 2001. ISSN 0022-0000. doi: 10.1006/jcss.2000.1741. URL https://doi.org/10.1006/jcss.2000.1741.
  • Linial et al. [1991] Nathan Linial, Yishay Mansour, and Ronald L. Rivest. Results on learnability and the Vapnik-Chervonenkis dimension. Inform. and Comput., 90(1):33–49, 1991. ISSN 0890-5401. doi: 10.1016/0890-5401(91)90058-A. URL https://doi.org/10.1016/0890-5401(91)90058-A.
  • Milman [2007] Emanuel Milman. A remark on two duality relations. Integral Equations Operator Theory, 57(2):217–228, 2007. ISSN 0378-620X. doi: 10.1007/s00020-006-1479-4. URL https://doi.org/10.1007/s00020-006-1479-4.
  • Nikolov et al. [2013] Aleksandar Nikolov, Kunal Talwar, and Li Zhang. The geometry of differential privacy: the sparse and approximate cases. In STOC’13—Proceedings of the 2013 ACM Symposium on Theory of Computing, pages 351–360. ACM, New York, 2013. doi: 10.1145/2488608.2488652. URL https://doi.org/10.1145/2488608.2488652.
  • Pietsch [1972] Albrecht Pietsch. Theorie der Operatorenideale (Zusammenfassung). Wissenschaftliche Beiträge der Friedrich-Schiller-Universität Jena. Friedrich-Schiller-Universität, Jena, 1972.
  • Rothblum and Yona [2021] Guy N Rothblum and Gal Yona. Multi-group agnostic PAC learnability. arXiv preprint arXiv:2105.09989, 2021.
  • Shabat et al. [2020] Eliran Shabat, Lee Cohen, and Yishay Mansour. Sample complexity of uniform convergence for multicalibration. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 13331–13340. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/9a96876e2f8f3dc4f3cf45f02c61c0c1-Paper.pdf.
  • Valiant [1984a] L. G. Valiant. Deductive learning. volume 312, pages 441–446. 1984a. doi: 10.1098/rsta.1984.0069. URL https://doi.org/10.1098/rsta.1984.0069. With discussion, Mathematical logic and programming languages.
  • Valiant [1984b] Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984b.
  • Vapnik and Chervonenkis [1971] VN Vapnik and A Ya Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Measures of Complexity, 16(2):11, 1971.
  • Zhao et al. [2021] Shengjia Zhao, Michael Kim, Roshni Sahoo, Tengyu Ma, and Stefano Ermon. Calibrating predictions to decisions: A novel approach to multi-class calibration. Advances in Neural Information Processing Systems, 34, 2021.

Appendix A Failure of Empirical Risk Minimization

The following two lemmas show that in certain cases the failure probability of Algorithm 1 approaches 11 (instead of 0) when the number of examples increases.

Lemma 32.

Assume XX is finite and μ\mu is the uniform distribution over XX. Assume 𝒟=[1,1]X{\mathcal{D}}=[-1,1]^{X}, and 𝒫{\mathcal{P}} consists of the following three predictors: pbp_{b} that maps every xXx\in X to bb for b=0,1/2,1b=0,1/2,1. For every positive integer nn, Algorithm 1 does not belong to

𝖮𝖨n(p1/2,𝒟,1/3,11/n+1,μ).\mathsf{OI}_{n}(p_{1/2},{\mathcal{D}},1/3,1-1/\sqrt{n+1},\mu).
Proof.

Consider the nn input examples to Algorithm 1: (x1,o1),,(xn,on)(x_{1},o_{1}),\ldots,(x_{n},o_{n}) drawn i.i.d. from μp1/2\mu_{p_{1/2}}. We first show that with probability above 11/n+11-1/\sqrt{n+1},

n:=|{i{1,,n}:oi=1}|n/2.n^{\prime}:=|\{i\in\{1,\ldots,n\}:o_{i}=1\}|\neq n/2.

This is trivially true when nn is odd. When nn is even, this is also true because

Pr[n=n/2]=(nn/2)/2n<1/n+1.\Pr[n^{\prime}=n/2]={n\choose n/2}/2^{n}<1/\sqrt{n+1}.

It remains to prove that whenever nn/2n^{\prime}\neq n/2, the output pp of Algorithm 1 satisfies pp1/2μ,𝒟>1/3\|p-p_{1/2}\|_{\mu,{\mathcal{D}}}>1/3. By Lemma 7, p0p1/2μ,𝒟=1/2,p1/2p1μ,𝒟=1/2\|p_{0}-p_{1/2}\|_{\mu,{\mathcal{D}}}=1/2,\|p_{1/2}-p_{1}\|_{\mu,{\mathcal{D}}}=1/2, and p0p1μ,𝒟=1\|p_{0}-p_{1}\|_{\mu,{\mathcal{D}}}=1. Therefore, the ε/2(=1/6)\varepsilon/2(=1/6)-covering 𝒫{\mathcal{P}}^{\prime} in Algorithm 1 is equal to 𝒫{\mathcal{P}}. The loss of predictor pbp_{b} is

𝗅𝗈𝗌𝗌(pb)=1ni=1n|boi|=(1b)n+b(nn)=n+b(n2n).\mathsf{loss}(p_{b})=\frac{1}{n}\sum_{i=1}^{n}|b-o_{i}|=(1-b)n^{\prime}+b(n-n^{\prime})=n^{\prime}+b(n-2n^{\prime}).

When n<n/2n^{\prime}<n/2, p0p_{0} has the smallest loss, so Algorithm 1 returns p=p0p=p_{0}, in which case we indeed have pp1/2μ,𝒟=1/2>1/3\|p-p_{1/2}\|_{\mu,{\mathcal{D}}}=1/2>1/3. Similarly, when n>n/2n^{\prime}>n/2, p1p_{1} has the smallest loss, so p=p1p=p_{1} and pp1/2μ,𝒟=1/2>1/3\|p-p_{1/2}\|_{\mu,{\mathcal{D}}}=1/2>1/3. ∎

In the lemma below, we give an example of the failure of Algorithm 1 in distribution-specific realizable OI when all the predictors in 𝒫{\mathcal{P}} are binary classifiers. In this example, X,μ,𝒫,𝒟X,\mu,{\mathcal{P}},{\mathcal{D}} are parametrized by two positive integers mm and nn as follows. We choose the individual set to be X={1,2,3}{1,,m}X=\{-1,-2,-3\}\cup\{1,\ldots,m\}, and choose the distinguisher class 𝒟{\mathcal{D}} as follows. For every size-nn subset Y{1,,m}Y\subseteq\{1,\ldots,m\}, define 𝒟Y[1,1]X{\mathcal{D}}_{Y}\subseteq[-1,1]^{X} to be the set of all distinguishers d[1,1]Xd\in[-1,1]^{X} satisfying d(x)=0d(x)=0 for all x{1,,m}\Yx\in\{1,\ldots,m\}\backslash Y. The distinguisher class 𝒟{\mathcal{D}} is then defined as 𝒟=Y𝒟Y{\mathcal{D}}=\bigcup_{Y}{\mathcal{D}}_{Y}, where the union is over all size-nn subsets Y{1,,m}Y\subseteq\{1,\ldots,m\}. The predictor class 𝒫{\mathcal{P}} consists of 44 predictors p0,p1,p2,p3p_{0},p_{1},p_{2},p_{3} defined as follows:

p0(1)=0,p0(2)=0,p0(3)=0,\displaystyle p_{0}(-1)=0,p_{0}(-2)=0,p_{0}(-3)=0,{} p0(x)=0for allx{1,,m},\displaystyle p_{0}(x)=0\ \textnormal{for all}\ x\in\{1,\ldots,m\},
p1(1)=0,p1(2)=0,p1(3)=0,\displaystyle p_{1}(-1)=0,p_{1}(-2)=0,p_{1}(-3)=0,{} p1(x)=1for allx{1,,m},\displaystyle p_{1}(x)=1\ \textnormal{for all}\ x\in\{1,\ldots,m\},
p2(1)=1,p2(2)=1,p2(3)=0,\displaystyle p_{2}(-1)=1,p_{2}(-2)=1,p_{2}(-3)=0,{} p2(x)=0for allx{1,,m},\displaystyle p_{2}(x)=0\ \textnormal{for all}\ x\in\{1,\ldots,m\},
p3(1)=0,p3(2)=1,p3(3)=1,\displaystyle p_{3}(-1)=0,p_{3}(-2)=1,p_{3}(-3)=1,{} p3(x)=1for allx{1,,m}.\displaystyle p_{3}(x)=1\ \textnormal{for all}\ x\in\{1,\ldots,m\}.

The distribution μ\mu spreads 1/21/2 probability mass evenly on {1,2,3}\{-1,-2,-3\}, and the spreads the remaining 1/21/2 probability mass evenly on {1,,m}\{1,\ldots,m\}.

Since Nμ,𝒟(𝒫,ε/32)|𝒫|=4N_{\mu,{\mathcal{D}}}({\mathcal{P}},\varepsilon/32)\leq|{\mathcal{P}}|=4, Theorem 13 tells us that

𝖲𝖠𝖬𝖯𝖣𝖲𝖱(𝒫,𝒟,ε,δ,μ)O(ε2log(2/δ)+ε4).\mathsf{SAMP\mathchar 45\relax DS\mathchar 45\relax R}({\mathcal{P}},{\mathcal{D}},\varepsilon,\delta,\mu)\leq O(\varepsilon^{-2}\log(2/\delta)+\varepsilon^{-4}).

The lemma below shows that this sample complexity upper bound cannot be achieved using Algorithm 1 when ε1/4\varepsilon\leq 1/4 and δ\delta is close to zero.

Lemma 33.

For every positive integer nn, there exists a positive integer mm such that when X,𝒫,𝒟,μX,{\mathcal{P}},{\mathcal{D}},\mu are defined as above, Algorithm 1 does not belong to

𝖣𝖲𝖱n(𝒫,𝒟,1/4,max{22O(n),1O(2Ω(n))},μ).\mathsf{DS\mathchar 45\relax R}_{n}({\mathcal{P}},{\mathcal{D}},1/4,\max\{2^{-2-O(n)},1-O(2^{-\Omega(n)})\},\mu).
Proof.

By choosing mm sufficiently large, we get

p0p1μ,𝒟\displaystyle\|p_{0}-p_{1}\|_{\mu,{\mathcal{D}}} =(1/2)(n/m)1/100,\displaystyle=(1/2)(n/m)\leq 1/100, (43)
pipjμ,𝒟\displaystyle\|p_{i}-p_{j}\|_{\mu,{\mathcal{D}}} (1/2)(2/3)=1/3for all i,j{1,2,3,4} satisfying i<j and (i,j)(0,1).\displaystyle\geq(1/2)(2/3)=1/3\ \textnormal{for all $i,j\in\{1,2,3,4\}$ satisfying $i<j$ and $(i,j)\neq(0,1)$}. (44)

Suppose Algorithm 1 belongs to 𝖣𝖲𝖱n(𝒫,𝒟,1/4,δ,μ)\mathsf{DS\mathchar 45\relax R}_{n}({\mathcal{P}},{\mathcal{D}},1/4,\delta,\mu) for some δ(0,1)\delta\in(0,1). Consider the case where the input to Algorithm 1 is nn examples (x1,o1),,(xn,on)(x_{1},o_{1}),\ldots,(x_{n},o_{n}) drawn i.i.d. from μp\mu_{p^{*}} where pp^{*} is drawn uniformly at random from {p0,p2}\{p_{0},p_{2}\}. By assumption, the output pp of Algorithm 1 satisfies Pr[ppμ,𝒟1/4]1δ\Pr[\|p-p^{*}\|_{\mu,{\mathcal{D}}}\leq 1/4]\geq 1-\delta. Since Algorithm 1 only outputs p𝒫p\in{\mathcal{P}}, this means that Pr[p𝖭𝖾𝗂(p)]1δ\Pr[p\in\mathsf{Nei}(p^{*})]\geq 1-\delta where 𝖭𝖾𝗂(p0)={p0,p1}\mathsf{Nei}(p_{0})=\{p_{0},p_{1}\} and 𝖭𝖾𝗂(p2)={p2}\mathsf{Nei}(p_{2})=\{p_{2}\}. However, with probability (1/2)n(1/2)^{n}, all the xix_{i}’s belong to {1,,m}\{1,\ldots,m\}, in which case o1==on=0o_{1}=\cdots=o_{n}=0, giving no information about pp^{*}. Therefore, Pr[p𝖭𝖾𝗂(p)](1/2)n(1/2)\Pr[p\notin\mathsf{Nei}(p^{*})]\geq(1/2)^{n}(1/2). This implies

δ(1/2)n(1/2)>22O(n).\delta\geq(1/2)^{n}(1/2)>2^{-2-O(n)}. (45)

Inequalities (43) and (44) imply that the covering 𝒫{\mathcal{P}}^{\prime} computed in Algorithm 1 is either {p0,p2,p3}\{p_{0},p_{2},p_{3}\} or {p1,p2,p3}\{p_{1},p_{2},p_{3}\}. Without loss of generality, we assume 𝒫={p1,p2,p3}{\mathcal{P}}^{\prime}=\{p_{1},p_{2},p_{3}\} because the other case can be handled similarly. Now consider the case where the input examples (x1,o1),,(xn,on)(x_{1},o_{1}),\ldots,(x_{n},o_{n}) are drawn i.i.d. from μp\mu_{p^{*}} with p=p0p^{*}=p_{0}.

By our construction of 𝒟{\mathcal{D}}, the loss of every predictor in p𝒫p^{\prime}\in{\mathcal{P}}^{\prime} is

𝗅𝗈𝗌𝗌(p)=1ni=1n|p(xi)p0(xi)|.\mathsf{loss}(p^{\prime})=\frac{1}{n}\sum_{i=1}^{n}|p^{\prime}(x_{i})-p_{0}(x_{i})|.

By the Chernoff bound and the union bound, with probability above 1O(2Ω(n))1-O(2^{-\Omega(n)}), the absolute difference between 𝗅𝗈𝗌𝗌(p)\mathsf{loss}(p^{\prime}) and 𝔼xμ|p(x)p0(x)|\operatorname{\mathbb{E}}_{x\sim\mu}|p^{\prime}(x)-p_{0}(x)| is below 1/1001/100 for all p𝒫p^{\prime}\in{\mathcal{P}}^{\prime}. Note that

𝔼xμ|p1(x)p0(x)|=1/2,𝔼xμ|p2(x)p0(x)|=1/3,𝔼xμ|p3(x)p0(x)|=5/6.\operatorname{\mathbb{E}}_{x\sim\mu}|p_{1}(x)-p_{0}(x)|=1/2,\operatorname{\mathbb{E}}_{x\sim\mu}|p_{2}(x)-p_{0}(x)|=1/3,\operatorname{\mathbb{E}}_{x\sim\mu}|p_{3}(x)-p_{0}(x)|=5/6.

Therefore, with probability above 1O(2Ω(n))1-O(2^{-\Omega(n)}), Algorithm 1 returns p=p2p=p_{2}, which does not satisfy ppμ,𝒟1/4\|p-p^{*}\|_{\mu,{\mathcal{D}}}\leq 1/4. This implies

δ>1O(2Ω(n)).\delta>1-O(2^{-\Omega(n)}). (46)

The lemma is proved by combining (45) and (46). ∎

Appendix B A Helper Lemma

Lemma 34.

Let μ\mu be the uniform distribution over a set XX of individuals with size |X|=n>0|X|=n\in\mathbb{Z}_{>0}. Then for all ε(0,1/e)\varepsilon\in(0,1/e),

logNμ,[1,1]X([1,1]X,ε)nlog(1/eε),\log N_{\mu,[-1,1]^{X}}([-1,1]^{X},\varepsilon)\geq n\log(1/e\varepsilon),

where ee is the base of the natural logarithm.

Proof.

Suppose X={x1,,xn}X=\{x_{1},\ldots,x_{n}\}. Let 𝗏𝖾𝖼\mathsf{vec} be the bijection from X\mathbb{R}^{X} to n\mathbb{R}^{n} such that 𝗏𝖾𝖼(f)=(f(x1),,f(xn))n\mathsf{vec}(f)=(f(x_{1}),\ldots,f(x_{n}))\in\mathbb{R}^{n} for all fXf\in\mathbb{R}^{X}. Let {f1,,fm}[1,1]X\{f_{1},\ldots,f_{m}\}\subseteq[-1,1]^{X} be an ε\varepsilon-covering of [1,1]X[-1,1]^{X} w.r.t. the norm μ,[1,1]X\|\cdot\|_{\mu,[-1,1]^{X}}. Defining

B={fX:fμ,[1,1]Xε},B=\{f\in\mathbb{R}^{X}:\|f\|_{\mu,[-1,1]^{X}}\leq\varepsilon\},

we have [1,1]Xi=1m({fi}+B)[-1,1]^{X}\subseteq\bigcup_{i=1}^{m}(\{f_{i}\}+B), which implies 𝗏𝖾𝖼([1,1]X)i=1m𝗏𝖾𝖼({fi}+B)\mathsf{vec}([-1,1]^{X})\subseteq\bigcup_{i=1}^{m}\mathsf{vec}(\{f_{i}\}+B). Therefore, the volume of 𝗏𝖾𝖼([1,1]X)\mathsf{vec}([-1,1]^{X}) is at most mm times the volume of 𝗏𝖾𝖼(B)\mathsf{vec}(B).

It is clear that 𝗏𝖾𝖼([1,1]X)=[1,1]n\mathsf{vec}([-1,1]^{X})=[-1,1]^{n} has volume 2n2^{n} in n\mathbb{R}^{n}. Moreover, by Lemma 7,

𝗏𝖾𝖼(B)={(r1,,rn)n:|r1|++|rn|εn},\mathsf{vec}(B)=\{(r_{1},\ldots,r_{n})\in\mathbb{R}^{n}:|r_{1}|+\cdots+|r_{n}|\leq\varepsilon n\},

which has volume (2εn)n/n!(2\varepsilon n)^{n}/n!. Therefore,

2nm(2εn)n/n!,2^{n}\leq m(2\varepsilon n)^{n}/n!,

and thus

logmlog(n!/(nε)n)nlog(1/eε),\log m\geq\log(n!/(n\varepsilon)^{n})\geq n\log(1/e\varepsilon),

as desired. We used the fact log(n!)1nlogtdt>nlog(n/e)\log(n!)\geq\int_{1}^{n}\log t\mathrm{d}t>n\log(n/e) in the last inequality. ∎