This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Of Dice and Games:
A Theory of Generalized Boosting

Marco Bressan1, Nataly Brukhim2, Nicolò Cesa-Bianchi1,3,
Emmanuel Esposito1, Yishay Mansour4,5, Shay Moran6,5, Maximilian Thiessen7
1Universita degli Studi di Milano, Italy   2Institute for Advanced Study, USA
3Politecnico di Milano, Italy   4Tel Aviv University, Israel   5Google Research
6Technion, Israel   7TU Wien, Austria
Abstract

Cost-sensitive loss functions are crucial in many real-world prediction problems, where different types of errors are penalized differently; for example, in medical diagnosis, a false negative prediction can lead to worse consequences than a false positive prediction. However, traditional PAC learning theory has mostly focused on the symmetric 0-1 loss, leaving cost-sensitive losses largely unaddressed. In this work, we extend the celebrated theory of boosting to incorporate both cost-sensitive and multi-objective losses. Cost-sensitive losses assign costs to the entries of a confusion matrix, and are used to control the sum of prediction errors accounting for the cost of each error type. Multi-objective losses, on the other hand, simultaneously track multiple cost-sensitive losses, and are useful when the goal is to satisfy several criteria at once (e.g., minimizing false positives while keeping false negatives below a critical threshold).

We develop a comprehensive theory of cost-sensitive and multi-objective boosting, providing a taxonomy of weak learning guarantees that distinguishes which guarantees are trivial (i.e., can always be achieved), which ones are boostable (i.e., imply strong learning), and which ones are intermediate, implying non-trivial yet not arbitrarily accurate learning. For binary classification, we establish a dichotomy: a weak learning guarantee is either trivial or boostable. In the multiclass setting, we describe a more intricate landscape of intermediate weak learning guarantees. Our characterization relies on a geometric interpretation of boosting, revealing a surprising equivalence between cost-sensitive and multi-objective losses.

Keywords: Boosting, Minimax theorem, cost-sensitive learning, multi-objective learning, Blackwell’s approachability.

1 Introduction

In many machine learning applications, different types of mistakes may have very different consequences, making it crucial to consider the costs associated with them. For example, in medical diagnostics, failing to detect a serious illness (a false negative) can have life-threatening implications, whereas incorrectly diagnosing a healthy person as ill (a false positive) mostly leads to unnecessary stress and medical expenses. This disparity in error costs is not limited to binary decisions. For example, when recommending movies to a viewer with preferences “romance over action over horror”, misclassifying a romance film as “horror” is probably worse than misclassifying it as “action”. Besides weighting different kinds of mispredictions, one may even want to treat different kinds of mispredictions separately. That is, instead of a cost-sensitive criterion, one may use a multi-objective criterion, specifying acceptable rates for different types of mispredictions. For example, one may find acceptable a false positive rate of (say) 10%10\% only if simultaneously the false negative rate is at most (say) 1%1\%.

Despite the importance of misclassification costs in applications, the theoretical understanding of this setting is lacking. A glaring example, which motivates this work, is boosting. Broadly speaking, a boosting algorithm is a procedure that aggregates several weak learners (whose accuracy is only marginally better than a random guess) into a single strong learner (whose accuracy can be made as high as desired). Although this is a fundamental and well-studied machine learning technique, a theory of boosting accounting for cost-sensitive or multi-objective losses is missing, even in the simplest setting of binary classification.111There are many works on adapting AdaBoost to cost-sensitive learning, but they do not address the fundamental question of identifying the minimal assumption on the cost-sensitive function which guarantees boosting. See more in Appendix A. In fact, if one can assign different costs to different kinds of mistakes, then even the meaning of “marginaly better than a random guess” is not immediately clear; let alone the question of whether one can boost a weak learner to a strong learner, or what precisely “boosting” means. The present work addresses those challenges, providing a generalized theory of boosting which unifies different types of weak learning guarantees, including cost-sensitive and multi-objective ones, and extends the standard algorithmic boosting framework beyond the current state of the art. The fundamental question that we pose is:

Which cost-sensitive and/or multi-objective learners can be boosted? And how?

In classical boosting theory for binary classification, a sharp transition occurs at a weak learner’s error threshold of 1/21/2: if the weak learner is guaranteed to output hypotheses with an error rate below 1/21/2, then it can be boosted to a strong learner with arbitrarily small error, for instance by AdaBoost (Freund and Schapire, 1997). Thus, a weak learning guarantee below 1/21/2 implies strong learning. On the other hand, a guarantee of 1/21/2 is trivial, as it can be achieved by tossing a fair coin—see Schapire and Freund (2012). Therefore, guarantees above 1/21/2 are not boostable.

We investigate whether similar transitions exist for arbitrary cost-sensitive losses. A cost-sensitive loss ww specifies the penalty w(i,j)w(i,j) for predicting ii when the true label is jj, and can penalize prediction errors unequally (e.g., in binary classification, it may penalize false negatives more than false positives). Suppose we have access to a weak learner that outputs hypotheses with a cost-sensitive loss of at most z>0z>0 under ww. For which values of zz does this imply strong learning, so that the weak learner can be boosted to achieve arbitrarily small cost-sensitive loss according to ww? Which values of zz are trivial, meaning they can always be achieved? Are there intermediate values of zz that do not imply strong learning but are still non-trivial?

A similar question arises for multi-objective losses. A multi-objective loss is given by a vector 𝒘=(w1,w2,,wr)\boldsymbol{w}=(w_{1},w_{2},\ldots,w_{r}) where each wiw_{i} is a cost-sensitive loss as described above. For instance, in binary classification, a natural choice would be 𝒘=(wn,wp)\boldsymbol{w}=(w_{n},w_{p}), where wnw_{n} measures false negatives and wpw_{p} false positives. Suppose again we have access to a weak learner that outputs hypotheses with loss at most zi0z_{i}\geq 0 under wiw_{i} simultaneously for every ii, forming a vector of guarantees 𝐳=(z1,,zr)\mathbf{z}=(z_{1},\ldots,z_{r}). Which 𝐳\mathbf{z} are trivial, in that they can always be achieved? Which 𝐳\mathbf{z} are boostable, allowing for a strong learner that achieves arbitrarily small error simultaneously for all of the losses wiw_{i}? And are there intermediate vectors 𝐳\mathbf{z} that fall between trivial and boostable?

We address these questions by introducing a new perspective on random guessing, framed as either a zero-sum game or a vector-payoff game (known as a Blackwell approachability game). This game-theoretic approach applies to both cost-sensitive and multi-objective learning, leading to a complete characterization of boostability in these cases. We then extend these techniques to the multiclass setting, where the boostability landscape becomes significantly more complex. While this perspective complements existing views of boosting as a zero-sum game, prior methods are not suited to the settings we examine here. The new tools introduced in this work effectively handle a broad learning framework, establishing a unified and comprehensive theory of generalized boosting.

In particular, we provide extensive answers to the above questions, as follows:

  • Cost-sensitive losses. For binary classification, we establish a crisp dichotomy: each guarantee z0z\geq 0 is either trivial or boostable (A). We show that this transition occurs at a critical threshold given by the value of a zero-sum game defined by ww itself. In the multiclass setting, the dichotomy expands into a hierarchy of guarantees, ranging from fully boostable to trivial. Here, we show that there exist multiple thresholds 0<v1(w)<v2(w)<<vτ(w)0<v_{1}(w)<v_{2}(w)<\ldots<v_{\tau}(w), where τ\tau depends on the cost function ww, and each guarantee z(vi,vi+1)z\in(v_{i},v_{i+1}) can be boosted down to viv_{i} (C). This generalizes the binary case, in which τ=1\tau=1.

  • Multi-objective losses. For binary classification, we again show a clean dichotomy; however, the threshold now takes a higher-dimensional form, becoming a surface that separates trivial guarantees from boostable ones (B). Figure 1 illustrates this surface for a loss vector 𝒘\boldsymbol{w} representing false positive and false negative errors. In the multiclass setting, things become more complex: here, we show how to boost non-trivial guarantees to list-learners (D), but a complete taxonomy of boostable guarantees remains elusive and is left as an open question.

  • An equivalence between cost-sensitive and multi-objective losses. We establish and exploit an equivalence between multi-objective and cost-sensitive losses that may be of independent interest (Theorem 6). Given a loss vector 𝒘=(w1,,wr)\boldsymbol{w}=(w_{1},\ldots,w_{r}), consider a weak learner that outputs hypotheses with loss at most ziz_{i} for each wiw_{i}. By linearity of expectation, it follows that for any convex combination w𝜶=αiwiw_{\boldsymbol{\alpha}}=\sum\alpha_{i}w_{i}, the weak learner’s loss with respect to w𝜶w_{\boldsymbol{\alpha}} does not exceed αizi\sum\alpha_{i}z_{i}. We prove that the converse also holds: if for each such w𝜶w_{\boldsymbol{\alpha}} there exists a weak learner with loss at most αizi\sum\alpha_{i}z_{i}, then we can efficiently aggregate these learners into one that achieves loss at most ziz_{i} simultaneously for every wiw_{i}. Interestingly, a geometric perspective of this result reveals a connection to Blackwell’s Approachability Theorem (Blackwell, 1956) and to scalarization methods in multi-objective optimization.

Organization of the manuscript.

Section 2 gives a detailed walkthrough of all our main results, their significance, and the underlying intuition. Section 3 addresses the binary classification case, in both the cost-sensitive and multi-objective flavors. Section 4 presents our equivalence connecting cost-sensitive learners with multi-objective learners. Finally, Section 5 considers the multiclass case.

2 Main Results

This section outlines the problem setting and the key questions we investigate, and provides an overview of our main results. We begin from the basic technical setup. We consider the standard PAC (Probably Approximately Correct) setting (Valiant, 1984), which models learning a concept class 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}} for a domain 𝒳{\mathcal{X}} and a label space 𝒴=[k]{\mathcal{Y}}=[k] with k2k\geq 2. We define a cost function, or simply cost, to be any function w:𝒴2[0,1]w:{\mathcal{Y}}^{2}\to[0,1] that satisfies w(i,i)=0w(i,i)=0 for all i𝒴i\in{\mathcal{Y}};222Although the range of the cost is [0,1][0,1] our results can be easily extended to arbitrary costs w:𝒴2R0w:{\mathcal{Y}}^{2}\to\mathbb{R}_{\geq 0}. the value w(i,j)w(i,j) should be thought of as the penalty incurred by predicting ii when the true label is jj. Note that ww can be seen as a k×kk\times k matrix indexed by 𝒴{\mathcal{Y}}. For instance, for 𝒴={1,+1}{\mathcal{Y}}=\{-1,+1\}, a valid cost is w=(0140)w=\bigl{(}\begin{smallmatrix}0&1\\ 4&0\end{smallmatrix}\bigr{)}; this means that the cost of a false positive is w(+1,1)=4w(+1,-1)=4, and that of a false negative is w(1,+1)=1w(-1,+1)=1. For conciseness, in the binary setting we let w=w(1,+1)w_{-}=w(-1,+1) and w+=w(+1,1)w_{+}=w(+1,-1); and by some overloading of notation we denote by ww both the matrix (0ww+0)\bigl{(}\begin{smallmatrix}0&w_{-}\\ w_{+}&0\end{smallmatrix}\bigr{)} and the vector (w+,w)(w_{+},w_{-}). For a generic cost w:𝒴2[0,1]w:{\mathcal{Y}}^{2}\rightarrow[0,1], a target function ff\in\mathcal{F}, and a distribution 𝒟\mathcal{D} over 𝒳{\mathcal{X}}, the cost-sensitive loss of a predictor h:𝒳𝒴h:{\mathcal{X}}\to{\mathcal{Y}} is:

L𝒟w(h)Ex𝒟[w(h(x),f(x))]L_{\mathcal{D}}^{w}(h)\triangleq{\mathbb E}_{x\sim\mathcal{D}}\Bigl{[}w(h(x),f(x))\Bigr{]} (1)

For example, when w(i,j)=I{ij}w(i,j)=\mathbb{I}\!\left\{i\neq j\right\}, then L𝒟w(h)L_{\mathcal{D}}^{w}(h) simply corresponds to Px𝒟(h(x)f(x))\operatorname{\mathbb{P}}_{x\sim\mathcal{D}}(h(x)\neq f(x)), the standard 0-1 loss used in binary classification. For convenience we assume throughout the paper that w1\|w\|_{\infty}\leq 1, though all our results apply more broadly.

We begin by presenting our contributions for the binary setting (Section 2.1), followed by their extension to the general multiclass case (Section 2.2).

2.1 Binary setting

We begin from the binary setting, that is, the case 𝒴={1,+1}{\mathcal{Y}}=\{-1,+1\}. The starting point is the definition of a suitable notion of “weak learner” for an arbitrary cost function ww.

Definition 1 ((w,z)(w,z)-learner).

Let w:𝒴2[0,1]w:{\mathcal{Y}}^{2}\rightarrow[0,1] be a cost function and let z0z\geq 0. An algorithm 𝒜{\mathcal{A}} is a (w,z)(w,z)-learner for 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}} if there is a function m0:(0,1)2Nm_{0}:(0,1)^{2}\rightarrow\mathbb{N} such that for every ff\in\mathcal{F}, every distribution 𝒟\mathcal{D} over 𝒳{\mathcal{X}}, and every ϵ,δ(0,1)\epsilon,\delta\in(0,1) the following claim holds. If SS is a sample of m0(ϵ,δ)m_{0}(\epsilon,\delta) examples drawn i.i.d. from 𝒟\mathcal{D} and labeled by ff, then 𝒜(S){\mathcal{A}}(S) returns a predictor h:𝒳𝒴h:{\mathcal{X}}\rightarrow{\mathcal{Y}} such that L𝒟w(h)z+ϵL_{\mathcal{D}}^{w}(h)\leq z+\epsilon with probability at least 1δ1-\delta.

{floatrow}
Binary Multiclass
0-1 loss 1/21/2 12,23,34,,k1k\frac{1}{2},\frac{2}{3},\frac{3}{4},\dots,\frac{k-1}{k}
cost ww V(w)=w+ww++wV(w)=\frac{w_{+}w_{-}}{w_{+}+w_{-}} v2(w)vτ(w)v_{{2}}(w)\leq\dots\leq v_{{\tau}}(w)
multi-objective w=(w1,,wr)\boldsymbol{w}=(w_{1},\dots,w_{r}) coin-attainable boundary of C(𝒘)C(\boldsymbol{w}) JJ-dice-attainable boundaries of DJ(𝒘)D_{J}(\boldsymbol{w})
\ffigbox
Refer to caption
Figure 1: Boostability thresholds. Binary. For classic 0-1 loss binary boosting, it is well-known that the boostability threshold is 1/21/2: any value below it can be boosted, while any value above it is trivially attainable by non-boostable learners. For any cost ww, the boostability threshold is V(w)\operatorname{V}(w) (see Equation 3, A). For the multi-objective loss, the threshold is determined by the boundary of the coin-attainable region, denoted C(𝒘)C(\boldsymbol{w}) (see Definition 4, B), as illustrated in the plot on the right; each point in the plane corresponds to false-positive and false-negative errors (z+,z)(z_{+},z_{-}). The two colored regions in the plot correspond to (a) coin-attainable point C(𝒘)C(\boldsymbol{w}) (in red) and (b) boostable points [0,1]2C(𝒘)[0,1]^{2}\setminus C(\boldsymbol{w}) (in blue). See below B for further discussion. Multiclass. A similar pattern holds for multiclass boosting. For 0-1 loss, boostability is known to be determined by k1k-1 thresholds (Brukhim et al., 2023a). For any cost ww, the boostability thresholds are vn(w)v_{{n}}(w) (see Equation 6). For the multi-objective loss, thresholds are determined by the boundaries of dice-attainable regions DJ(𝒘)D_{J}(\boldsymbol{w}) (see Section 2.2 for further details).

We study the question of whether a (w,z)(w,z)-learner can be boosted. Broadly speaking, a learner 𝒜{\mathcal{A}} is boostable if there exists a learning algorithm {\mathcal{B}} that can achieve arbitrarily small loss by aggregating a small set of predictors obtained from 𝒜{\mathcal{A}}. More precisely, given black-box access to 𝒜{\mathcal{A}} and a labeled sample SS, algorithm {\mathcal{B}} invokes 𝒜{\mathcal{A}} on subsamples S1,,STS_{1},\ldots,S_{T} of SS to produce weak predictors h1,,hTh_{1},\dots,h_{T}. It then outputs a predictor h¯(x)=g(h1(x),,hT(x))\bar{h}(x)=g(h_{1}(x),\dots,h_{T}(x)) using some aggregation function g:𝒴T𝒴g:{\mathcal{Y}}^{T}\rightarrow{\mathcal{Y}}. For instance, in classic binary boosting the function gg is usually a weighted majority vote. In this work, we consider arbitrary such aggregations. The goal is to ensure that the loss of the aggregate predictor, L𝒟w(h¯)L_{\mathcal{D}}^{w}(\bar{h}), goes to zero with TT, resulting in a (w,0)(w,0)-learner. Our guiding question can then be stated as:

Can we boost a (w,z)(w,z)-learner to a (w,0)(w,0)-learner?

In order to develop some intuition, let us consider again the case when ww yields the standard 0-1 loss. In that case, Definition 1 boils down to the standard notion of a weak PAC learner.333Notice that this definition is slightly more general than the classic weak PAC learning definition, in which z=1/2γz=\nicefrac{{1}}{{2}}-\gamma and ϵ=0\epsilon=0, yet we consider learners which are allowed to be arbitrarily close to zz. Then, classic boosting theory states that every (w,z)(w,z)-learner with z<1/2z<\nicefrac{{1}}{{2}} can be boosted to a (w,0)(w,0)-learner; and it is easy to see that a loss of 1/2\nicefrac{{1}}{{2}} is always achievable, simply by predicting according to a fair coin toss. Thus, the value 1/2\nicefrac{{1}}{{2}} yields a sharp dichotomy: every z<1/2z<\nicefrac{{1}}{{2}} can be boosted and drawn to 0, while every z1/2z\geq\nicefrac{{1}}{{2}} is trivially achievable and cannot be brought below 1/2\nicefrac{{1}}{{2}}.

Can this classic dichotomy between “boostable” and “trivial” be extended to accommodate arbitrary cost functions? It turns out that this can be done if one uses as trivial predictors biased coins. Indeed, we show that, by taking such random guessing strategies into account, one can identify a general boostability threshold for all cost functions ww. However, in contrast to the 0-1 loss case, this critical threshold between the boostable and non-boostable guarantees is no longer 1/2\nicefrac{{1}}{{2}}; instead, it is a function of the cost ww. More precisely, the threshold is determined by the outcome of a simple two-player zero-sum game, as we describe next.

The game involves a minimizing player (the predictor) and a maximizing player (the environment). The minimizing player selects a distribution 𝐩\mathbf{p} over 𝒴={1,+1}{\mathcal{Y}}=\{-1,+1\}; similarly, the maximizing player selects a distribution 𝐪\mathbf{q} over 𝒴={1,+1}{\mathcal{Y}}=\{-1,+1\}. The payoff matrix of the game is ww. The overall cost paid by the predictor is then:

w(𝐩,𝐪)Ei𝐩Ej𝐪w(i,j).w(\mathbf{p},\mathbf{q})\triangleq{\mathbb E}_{i\sim\mathbf{p}}{\mathbb E}_{j\sim\mathbf{q}}\ w(i,j)\,. (2)

Following standard game theory, the value of the game is the quantity

V(w)min𝐩Δ𝒴max𝐪Δ𝒴w(𝐩,𝐪),\displaystyle\operatorname{V}(w)\triangleq\min_{\mathbf{p}\in\Delta_{\mathcal{Y}}}\max_{\mathbf{q}\in\Delta_{\mathcal{Y}}}w(\mathbf{p},\mathbf{q})\,, (3)

where Δ𝒴\Delta_{\mathcal{Y}} is the set of all distributions over 𝒴{\mathcal{Y}}. In words, V(w)\operatorname{V}(w) is the smallest loss that the predictor can ensure by tossing a biased coin (i.e., 𝐩\mathbf{p}) without knowing anything about the true distribution of the labels (i.e., 𝐪\mathbf{q}). Now consider a value z0z\geq 0. It is easy to see that, if zV(w)z\geq\operatorname{V}(w), then there exists a universal (w,z)(w,z)-learner—one that works for all \mathcal{F} and all distributions 𝒟\mathcal{D} over 𝒳{\mathcal{X}}: this is the learner that, ignoring the input sample, returns the randomized predictor h𝐩h_{\mathbf{p}} whose outcome h(x)h(x) is distributed according to 𝐩\mathbf{p} independently for every x𝒳x\in{\mathcal{X}}. Indeed, by Equation 3 the loss of h𝐩h_{\mathbf{p}} is at most V(w)\operatorname{V}(w) regardless of ff\in\mathcal{F} and of the distribution 𝒟\mathcal{D}. Formally, we define:

Definition 2 (Random guess).

A randomized hypothesis is called a random guess if its prediction y𝒴y\in{\mathcal{Y}} is independent of the input point x𝒳x\in{\mathcal{X}}. That is, there exists a probability distribution 𝐩Δ𝒴\mathbf{p}\in\Delta_{\mathcal{Y}} such that h(x)𝐩h(x)\sim\mathbf{p} for every x𝒳x\in{\mathcal{X}}.

We prove that the non-boostability condition zV(w)z\geq\operatorname{V}(w) is tight. That is, we prove that every (w,z)(w,z)-learner with z<V(w)z<\operatorname{V}(w) is boostable to a (w,0)(w,0)-learner. Thus, the value of the game V(w)\operatorname{V}(w) given by Equation 3 is precisely the threshold for boostability. Formally:

Theorem A (Cost-sensitive boosting, binary case).

Let 𝒴={1,+1}{\mathcal{Y}}=\{-1,+1\}. Let w=(w+,w)(0,1]2w=(w_{+},w_{-})\in(0,1]^{2} be a cost. Then, for all z0z\geq 0, exactly one of the following holds.

  • (w,z)(w,z) is boostable: for every 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}}, every (w,z)(w,z)-learner is boostable to a (w,0)(w,0)-learner.

  • (w,z)(w,z) is trivial: there exists a random guess hh such that the learner that always outputs hh is a (w,z)(w,z)-learner for every 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}}.

Moreover, (w,z)(w,z) is boostable if and only if z<V(w)z<V(w), where V(w)=w+ww++w\operatorname{V}(w)=\frac{w_{+}w_{-}}{w_{+}+w_{-}}.

The proof of A is given in Section 3. In a nutshell, A says that anything that beats the weakest possible learner (a coin) is as powerful as a the strongest possible learner; between these two extremes there is no middle ground. Remarkably, the proof of A is simple and yet it relies on three distinct applications of von Neumann’s Minimax Theorem! (von Neumann, 1928). The first application, which also appears in classical binary boosting, is used to aggregate a distribution over the weak learners. The other two applications are unique to the cost-sensitive setting: one arises in the analysis of the boosting algorithm (first item of A), and the other one in defining the trivial learner (second item of A). These last two applications are both based on the zero-sum game defined above. We also note that the first item of A is obtained constructively by an efficient boosting algorithm, as we detail in Section 3.

2.1.1 Multi-objective losses

Refer to caption
Refer to caption
Refer to caption
Figure 2: In all plots, each point e=(e+,e)e=(e_{+},e_{-}) in the plane corresponds to false-positive and false-negative errors. (Left) Cost-sensitive vs. multi-objective. The leftmost figure corresponds to a cost-sensitive guarantee (w,z)(w,z), where the blue line is given by e,w=z\langle e,w\rangle=z. The shaded area is the feasible region of points ee satisfying the guarantee. The second figure corresponds to a multi-objective guarantee (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z}), where r=3r=3 corresponds to 3 different lines, each of the form e,wi=zi\langle e,w_{i}\rangle=z_{i}. The shaded area corresponds to all points satisfying all guarantees, i.e, attaining (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z}). (Right) Envelope of the coin-attainable region. The rightmost figure presents many different lines; each line correspond to a guarantee (w,V(w))(w,\operatorname{V}(w)) and is of the form e,w=V(w)\langle e,w\rangle=\operatorname{V}(w). Then, the coin-attainable boundary curve in the case of false-positive and false-negative costs (i.e., 𝒘=(wp,wn)\boldsymbol{w}=(w_{p},w_{n})), is given by e++e=1\sqrt{e_{+}}+\sqrt{e_{-}}=1 (as shown in Section 3). Furthermore, that boundary of the coin-attainable points is the same curve obtained by the envelope of the different lines. This ties between the value V(w)\operatorname{V}(w) to the coin-attainable area C(𝒘)C(\boldsymbol{w}) (see below B for further discussion).

We turn our attention to multi-objective losses. A multi-objective loss is specified by a multi-objective cost, that is, a vector 𝒘=(w1,,wr)\boldsymbol{w}=(w_{1},\ldots,w_{r}) where wi:𝒴2[0,1]2w_{i}:{\mathcal{Y}}^{2}\to[0,1]^{2} is a cost for every i=1,,ri=1,\ldots,r. This allows one to express several criteria by which a learner is evaluated; for example, it allows us to measure separately the false positive rate and the false negative rate, giving a more fine-grained control over the error bounds. The guarantees of a learner can then be expressed by bounding L𝒟wi(h)L_{\mathcal{D}}^{w_{i}}(h) by some zi0z_{i}\geq 0 simultaneously for each i=1,,ri=1,\ldots,r. This leads to the following generalization of Definition 1.

Definition 3 ((𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z})-learner).

Let rNr\in\mathbb{N}, let 𝒘=(w1,,wr)\boldsymbol{w}=(w_{1},\ldots,w_{r}) where each wi:𝒴2[0,1]2w_{i}:{\mathcal{Y}}^{2}\to[0,1]^{2} is a cost function, and let 𝒛[0,1]r\boldsymbol{z}\in[0,1]^{r}. An algorithm 𝒜{\mathcal{A}} is a (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z})-learner for 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}} if there is a function m0:(0,1)2Nm_{0}:(0,1)^{2}\rightarrow\mathbb{N} such that for every ff\in\mathcal{F}, every distribution 𝒟\mathcal{D} over 𝒳{\mathcal{X}}, and every ϵ,δ(0,1)\epsilon,\delta\in(0,1) what follows holds. If SS is a sample of m0(ϵ,δ)m_{0}(\epsilon,\delta) examples drawn i.i.d. from 𝒟\mathcal{D} and labeled by ff, then 𝒜(S){\mathcal{A}}(S) returns a predictor h:𝒳𝒴h:{\mathcal{X}}\rightarrow{\mathcal{Y}} such that with probability at least 1δ1-\delta:

i=1,,r,L𝒟wi(h)zi+ϵ.\forall\ i=1,\ldots,r,\qquad L^{w_{i}}_{\mathcal{D}}(h)\leq z_{i}+\epsilon\;.

Consider for example 𝒘=(wp,wn)\boldsymbol{w}=(w_{p},w_{n}) with wp=(0010)w_{p}=\bigl{(}\begin{smallmatrix}0&0\\ 1&0\end{smallmatrix}\bigr{)} and wn=(0100)w_{n}=\bigl{(}\begin{smallmatrix}0&1\\ 0&0\end{smallmatrix}\bigr{)}. Then wpw_{p} counts the false positives, wnw_{n} the false negatives. Thus a (𝒘,𝐳)(\boldsymbol{w},\mathbf{z})-learner for, say, 𝐳=(0.1,0.4)\mathbf{z}=(0.1,0.4) ensures simultaneously a false-positive rate arbitrarily close to 0.10.1 and false-negative arbitrarily close to 0.40.4. See Figure 2 for illustrative geometric examples.

Like for the cost-sensitive case, we address the question of which (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z})-learners are boostable. In other words we ask: given 𝒘\boldsymbol{w}, what is the right “threshold” for 𝐳\mathbf{z}? Equivalently, for which 𝐳\mathbf{z} can we always boost a (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z})-learner to a (𝒘,𝟎)(\boldsymbol{w},\mathbf{0})-learner, and for which 𝐳\mathbf{z} do there exist (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z})-learners that are not boostable? It turns out that the answer is more nuanced than in the cost-sensitive case of Section 2.1, and yet we can prove the same “all or nothing” phenomenon of A, as depicted in Figure 1.444Note that the light-red area (upper triangle) is attainable by coins (distributions) that are oblivious to the true distribution marginals. For example, it is trivial to attain the point (12,12)(\frac{1}{2},\frac{1}{2}) by a fair coin, and the points (0,1)(0,1) or (1,0)(1,0) by a “degenerate coin” that is simply a constant function which always predicts the same label. In particular, every 𝐳[0,1]r\mathbf{z}\in[0,1]^{r} is either entirely boostable, in the sense that every (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z})-learner can be boosted to a (𝒘,𝟎)(\boldsymbol{w},\mathbf{0})-learner, or trivial, in the sense that there exists a (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z})-learner whose output is always a hypothesis that can be simulated by a (biased) random coin. To this end we introduce the notion of coin attainability. This is the multi-objective equivalent of the condition zV(w)z\geq\operatorname{V}(w) for ww being a scalar cost as in Section 2.1.

Definition 4 (Coin attainability).

Let 𝒘=(w1,,wr)\boldsymbol{w}=(w_{1},\ldots,w_{r}) where wi:𝒴2[0,1]w_{i}:{\mathcal{Y}}^{2}\rightarrow[0,1] is a cost function for every i=1,,ri=1,\ldots,r, and let 𝐳[0,1]r\mathbf{z}\in[0,1]^{r}. We say that (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z}) is coin-attainable if

𝐪Δ𝒴,𝐩Δ𝒴,i=1,,r,wi(𝐩,𝐪)zi.\forall\mathbf{q}\in\Delta_{\mathcal{Y}},\ \exists\mathbf{p}\in\Delta_{\mathcal{Y}},\ \forall i=1,\ldots,r,\qquad w_{i}(\mathbf{p},\mathbf{q})\leq z_{i}\;. (4)

The coin-attainable region of 𝒘\boldsymbol{w} is the set C(𝒘)C(\boldsymbol{w}) of all 𝒛\boldsymbol{z} such that (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z}) is coin-attainable.

It is not hard to see that, if 𝐳\mathbf{z} is coin-attainable, there exists a universal trivial (𝒘,𝐳)(\boldsymbol{w},\mathbf{z})-learner for all 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}}. Fix indeed any distribution 𝒟\mathcal{D} over 𝒳{\mathcal{X}} and any ff\in\mathcal{F}. First, one can learn the marginal 𝐪\mathbf{q} of 𝒟\mathcal{D} over 𝒴{\mathcal{Y}} within, say, total variation distance ϵ\epsilon; then, by the coin-attainability of 𝐳\mathbf{z}, one can return a hypothesis hh such that h(x)𝐩h(x)\sim\mathbf{p}, where 𝐩\mathbf{p} ensures wi(𝐩,𝐪)zi+ϵw_{i}(\mathbf{p},\mathbf{q})\leq z_{i}+\epsilon for all i=1,,ri=1,\ldots,r. Formally, recalling the definition of random guess (Definition 2), we have:

Definition 5 (Trivial learner).

A learning algorithm 𝒜{\mathcal{A}} is trivial if it only outputs random guesses.

Thus, a trivial learner can (at best) learn the best random guess and return it. Clearly, such a learner is in general not boostable to a (𝒘,𝟎)(\boldsymbol{w},\mathbf{0})-learner. The main result of this subsection is:

Theorem B (Multi-objective boosting, binary case).

Let 𝒴={1,1}{\mathcal{Y}}=\{-1,1\}, let 𝐰=(w1,,wr)\boldsymbol{w}=(w_{1},\ldots,w_{r}) where wi:𝒴2[0,1]w_{i}:{\mathcal{Y}}^{2}\rightarrow[0,1] is a cost, and let 𝐳[0,1]r\boldsymbol{z}\in[0,1]^{r}. Then, exactly one of the following holds.

  • (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z}) is boostable: for every 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}}, every (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z})-learner is boostable to a (𝒘,𝟎)(\boldsymbol{w},\mathbf{0})-learner.

  • (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z}) is trivial: there exists a trivial learner that is a (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z})-learner for all 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}}.

Moreover, (𝐰,𝐳)(\boldsymbol{w},\mathbf{z}) is boostable if and only if 𝐳C(𝐰)\mathbf{z}\notin C(\boldsymbol{w}).

The proof of B is given in Section 3, and is based on a certain scalarization of the loss amenable to a reduction to A. Let us spend a few words on B. First, note how Definition 4 is reminiscent of the zero-sum game of Section 2.1. The crucial difference, however, is that now the maximizing player (the environment) plays first, and the minimizing player (the predictor) can then choose 𝐩\mathbf{p} as a function of 𝐪\mathbf{q}. As shown in more detail in Section 3.2, this is essential for proving the second item—that is, for coin-attainability to capture exactly the guarantees attainable by trivial learners. For the scalar-valued game of Section 2.1, instead, by von Neumann’s Minimax Theorem the order of the players is immaterial to the value of the game V(w)\operatorname{V}(w).

Second, note that the multi-objective setting is more general than that of a (scalar-valued) zero-sum game, as here the payoff is vector-valued. This can be thought of as a Blackwell approachability game (Blackwell, 1956), a generalization of zero-sum games to vectorial payoffs. Indeed, it turns out that there is a connection between these two notions, in the sense of Blackwell’s theorem: we prove that a certain vectorial bound 𝐳\mathbf{z} can be attained (with respect to a vector-valued loss 𝒘\boldsymbol{w}) if an only if all the “scalarizations” of 𝐳\mathbf{z} can be attained with respect to the corresponding scalarizations of 𝒘\boldsymbol{w}. In fact, we can show that this is captured formally by an equivalence between cost-sensitive learners and multi-objective learners, as explained in Section 2.1.2.

Geometric interpretation of B.

A multi-objective guarantee 𝐳\mathbf{z} with respect to a multi-objective cost 𝒘\boldsymbol{w} can be viewed as a set of linear constraints, wiziw_{i}\leq z_{i}. This identifies a region that is convex and downward closed, see Figure 2. This holds true even in the multiclass setting, when there are more than 22 labels. Similarly, the set of all coin-attainable points C(𝒘)C(\boldsymbol{w}) for the false-negative/false-positive objective 𝒘=(wp,wn)\boldsymbol{w}=(w_{p},w_{n}) can be shown to be convex and upward-closed555A set CC is upward-closed if for all 𝒛C\boldsymbol{z}\in C and 𝒛[0,1]r\boldsymbol{z}^{\prime}\in[0,1]^{r} such that 𝒛𝒛\boldsymbol{z}^{\prime}\geq\boldsymbol{z} coordinate-wise, then 𝒛C\boldsymbol{z}^{\prime}\in C. See Section 3 for further details.. This is illustrated in Figure 1, as well as the set of all boostable points, given in the red and blue areas, respectively.

As shown in Figure 2, the coin-attainable boundary curve in this case, is given by e++e=1\sqrt{e_{+}}+\sqrt{e_{-}}=1 (as shown in Section 3), and it is exactly the same curve obtained by the envelope of the lines of the form e,w=V(w)\langle e,w\rangle=V(w), for any cost w:𝒴2[0,1]w:{\mathcal{Y}}^{2}\rightarrow[0,1]. This ties between the value V(w)\operatorname{V}(w) to the coin-attainable area C(𝒘)C(\boldsymbol{w}).

For a general pair (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z}), one can characterize if it is coin-attainable by examining whether its associated convex feasible region is intersected with the set coin-attainable points as depicted in Figure 1. This intuition is also captured in Theorem 7 below.

2.1.2 Cost-sensitive vs. multi-objective losses: a general equivalence

We establish a general, formal connection between cost-sensitive and multi-objective learning. The starting point is the straightforward observation that, by definition, a (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z})-learner is a (wi,zi)(w_{i},z_{i})-learner for every i=1,,ri=1,\ldots,r. Does the converse hold, too? That is, if for each i=1,,ri=1,\ldots,r there exist a (wi,zi)(w_{i},z_{i})-learner 𝒜i{\mathcal{A}}_{i}, can we conclude that there exists a (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z})-learner 𝒜{\mathcal{A}}? Perhaps surprisingly, we can show that this holds if we consider all convex combinations of the scalar guarantees (wi,zi)(w_{i},z_{i}). Formally, given a distribution 𝜶=(α1,,αr)Δr\boldsymbol{\alpha}=(\alpha_{1},\ldots,\alpha_{r})\in\Delta_{r}, let w𝜶=i=1rαiwiw_{\boldsymbol{\alpha}}=\sum_{i=1}^{r}\alpha_{i}w_{i} and z𝜶=𝜶,𝒛z_{\boldsymbol{\alpha}}=\langle\boldsymbol{\alpha},\boldsymbol{z}\rangle. It is easy to see that the observation above continues to hold: a (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z})-learner is a (w𝜶,z𝜶)(w_{\boldsymbol{\alpha}},z_{\boldsymbol{\alpha}})-learner for every such 𝜶\boldsymbol{\alpha}. The next result shows that the converse holds, too: the existence of a (w𝜶,z𝜶)(w_{\boldsymbol{\alpha}},z_{\boldsymbol{\alpha}})-learner for every 𝜶Δr\boldsymbol{\alpha}\in\Delta_{r} implies the existence of a (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z})-learner.

Theorem 6 (Equivalence of learners).

Let 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}}, let 𝐰=(w1,,wr)\boldsymbol{w}=(w_{1},\ldots,w_{r}) where each wi:𝒴2[0,1]w_{i}:{\mathcal{Y}}^{2}\to[0,1] is a cost function, and let 𝐳[0,1]r\boldsymbol{z}\in[0,1]^{r}. Then, the following are equivalent.

  1. 1.

    There exists a (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z})-learner for \mathcal{F}.

  2. 2.

    For every 𝜶Δr\boldsymbol{\alpha}\in\Delta_{r} there exists a (w𝜶,z𝜶)\big{(}w_{\boldsymbol{\alpha}},z_{\boldsymbol{\alpha}}\big{)}-learner for \mathcal{F}.

Moreover, the equivalence is obtained constructively by an efficient algorithm.

We remark that Theorem 6 holds in the multiclass case, too (i.e., for arbitrary sets 𝒴{\mathcal{Y}}). Thus, the interplay between multi-objective and cost-sensitive learning is a general phenomenon, not limited to binary classifiers. The reduction from multi-objective to cost-sensitive in Theorem 6 resembles the weighted sum scalarization method for multi-objective optimization (Ehrgott, 2005, Chapter 3). However, it is unclear whether this similarity can be used to provide further insights in our problem.

In the same vein as Theorem 6, we prove an equivalence between trivial guarantees of cost-sensitive learners (see A) and trivial guarantees of multi-objective learners (see B).

Theorem 7 (Equivalence of trivial guarantees).

Let 𝐰=(w1,,wr)\boldsymbol{w}=(w_{1},\ldots,w_{r}) where each wi:𝒴2[0,1]w_{i}:{\mathcal{Y}}^{2}\to[0,1] is a cost function, and let 𝐳[0,1]r\boldsymbol{z}\in[0,1]^{r}. Then the following are equivalent.

  1. 1.

    (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z}) is coin-attainable.

  2. 2.

    For every 𝜶Δr\boldsymbol{\alpha}\in\Delta_{r} it holds that z𝜶V(w𝜶)z_{\boldsymbol{\alpha}}\geq\operatorname{V}(w_{\boldsymbol{\alpha}}).

As Theorem 6, we note that Theorem 7 also holds in the general multiclass case. The proof of Theorem 7 again uses von Neumann’s Minimax Theorem, based on a two-player game that differs from those described earlier. In this game, the pure strategies of the maximizing player are the different elements of [r][r], representing the various objectives of the cost function. The proofs of all equivalence theorems are given in Section 4.

2.2 Multiclass setting

For binary classification tasks, we get an especially clean and refined mathematical picture, in which boostability is fully determined by a single threshold, i.e., V(w)\operatorname{V}(w). That is, any learner with loss value below it can be boosted, while any value above it is attainable by a trivial non-boostable learner. In the unweighted case, recent work by Brukhim et al. (2023a) demonstrated that a multichotomy emerges, governed by k1k-1 thresholds: 12,23,,k1k\frac{1}{2},\frac{2}{3},\dots,\frac{k-1}{k}, where kk is the number of labels. For each such threshold, any learner achieving a loss below it can be “partially” boosted to the next threshold below it, but not further than that. Here, we extend their results to arbitrary cost functions and prove that a similar phenomenon holds more generally.

In contrast to the unweighted case, the landscape of boostability in the general setting is more complex, consisting of many thresholds corresponding to distinct possible subsets of labels. Interestingly, the thresholds of boostability can be computed precisely, and are all determined by the outcome of zero-sum games, as defined next. We generalize the zero-sum game defined in Equation (2) to the case of k2k\geq 2 labels. It turns out that in the multiclass setting, a more refined zero-sum game is needed, introducing a certain asymmetry between the sets of allowable strategies for each player. In particular, we restrict the set of distributions 𝐪\mathbf{q} which the maximizing player is allowed to choose. For a subset J𝒴J\subseteq{\mathcal{Y}} of labels define the value of the game restricted to the subset JJ as:

VJ(w)min𝐩Δ𝒴max𝐪ΔJw(𝐩,𝐪).\displaystyle\operatorname{V}_{J}(w)\triangleq\min_{\mathbf{p}\in\Delta_{\mathcal{Y}}}\max_{\mathbf{q}\in\Delta_{J}}w(\mathbf{p},\mathbf{q})\;. (5)

Importantly, only the maximizing player is restricted to ΔJ\Delta_{J} while the minimizing player is not. Thus, VJ(w)\operatorname{V}_{J}(w) is the smallest loss one can ensure by predicting with a die given that the correct label is in JJ. Clearly, for every JJJ^{\prime}\subseteq J it holds that VJ(w)VJ(w)\operatorname{V}_{J^{\prime}}(w)\leq\operatorname{V}_{J}(w). Consider the subsets J𝒴J\subseteq{\mathcal{Y}} in nondecreasing order of VJ(w)\operatorname{V}_{J}(w).666For notational easiness we define V(w)=0\operatorname{V}_{\emptyset}(w)=0. We then denote each such cost by vs(w)v_{s}(w) for some s{1,,2k}s\in\{1,\dots,2^{k}\}, so that overall we have,

0=v1(w)<v2(w)<<vτ(w)=V𝒴(w),0=v_{1}(w)<v_{2}(w)<\ldots<v_{\tau}(w)=\operatorname{V}_{{\mathcal{Y}}}(w), (6)

where τ2k\tau\leq 2^{k} depends on the cost function ww. The first equality holds since V{y}(w)=0\operatorname{V}_{\{y\}}(w)=0 for y𝒴y\in{\mathcal{Y}}. Moreover, by a straightforward calculation it can be shown that, for the unweighted case in which ww is simply the 0-1 loss (i.e., w(i,j)=I{ij}w(i,j)=\mathbb{I}\!\left\{i\neq j\right\}), the thresholds VJ(w)\operatorname{V}_{J}(w) specialize to those given by Brukhim et al. (2023a). Concretely, it is described in the following fact.

Fact 8.

Let ww be the 0-1 loss. Then, for every n1n\geq 1 and J(𝒴n)J\in{{\mathcal{Y}}\choose n}, it holds that VJ(w)=11n\operatorname{V}_{J}(w)=1-\frac{1}{n}.

Thus, for the 0-1 loss we have τ=k\tau=k. Note that although for the general case given here there is no closed-form expression for the value of each threshold, it can each be determined by solving for the corresponding linear program representation of the zero-sum game described above.

We can now state our main results on multiclass cost-sensitive boosting, given next.

Theorem C (Cost-sensitive boosting, multiclass case).

Let 𝒴=[k]{\mathcal{Y}}=[k], let w:𝒴2[0,1]w:{\mathcal{Y}}^{2}\rightarrow[0,1] be any cost function, and let z0z\geq 0. Let v1<v2<<vτv_{1}<v_{2}<\cdots<v_{\tau} as defined in Equation 6,777When clear from context we omit ww and denote a threshold by vnv_{n}. and let nn be the largest integer such that vnzv_{n}\leq z. Then, the following claims both hold.

  • (w,z)(w,z) is nn-boostable: for every 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}}, every (w,z)(w,z)-learner is boostable to (w,vn)(w,v_{{n}})-learner.

  • (w,z)(w,z) is not (n1)(n-1)-boostable: there exists a class 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}} and a (w,z)(w,z)-learner for \mathcal{F} that cannot be boosted to (w,z)(w,z^{\prime})-learner for \mathcal{F}, for any z<vnz^{\prime}<v_{n}.

The proof of C is given in Section 5. In words, C implies that there is a partition of the interval [0,1][0,1], on which the loss values lie, into at most τ\tau sub-intervals or “buckets”, based on vnv_{{n}}. Then, any loss value zz in a certain bucket [vn,vn+1)\bigl{[}v_{{n}},v_{{n+1}}\bigr{)} can be boosted to the lowest value within the bucket, but not below that. We remark that, in fact, the above boostability result is obtained constructively via an efficient algorithm, as described in Section 5.

2.2.1 Multiclass classification, multi-objective loss

We shift our focus to multi-objective losses in the multiclass setting. Recall the notion of (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z})-learner of Definition 3. Our goal is to prove a multi-objective counterpart of C; that is, to understand for which 𝒛\boldsymbol{z}^{\prime} a (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z})-learner can be boosted to a (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z}^{\prime})-learner. Unlike the scalar cost setting of C, however, the set of those 𝒛\boldsymbol{z}^{\prime} that are reachable by boosting a (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z})-learner is not a one-dimensional interval in [0,1][0,1], but a subset of [0,1]r[0,1]^{r}.

As a first step, we generalize the notion of coin attainability given by Definition 4. The key ingredient that we shall add is to restrict the distribution of the labels over a given set J𝒴J\subseteq{\mathcal{Y}}. From the point of view of a random guesser, this amounts to knowing that the correct label must be in JJ. We then say that a guarantee is JJ-dice-attainable if a trivial learner can satisfy that guarantee over any distribution supported only over JJ. Formally, it is defined as follows.

Definition 9 (JJ-dice attainability).

Let 𝒴=[k]{\mathcal{Y}}=[k], let 𝒘=(w1,,wr)\boldsymbol{w}=(w_{1},\ldots,w_{r}) where each wi:𝒴2[0,1]w_{i}:{\mathcal{Y}}^{2}\rightarrow[0,1] is a cost function, let 𝐳[0,1]r\mathbf{z}\in[0,1]^{r}, and let J𝒴J\subseteq{\mathcal{Y}}. Then (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z}) is JJ-dice-attainable if:

𝐪ΔJ,𝐩Δ𝒴,i=1,,r,wi(𝐩,𝐪)zi.\forall\mathbf{q}\in\Delta_{J},\ \exists\mathbf{p}\in\Delta_{\mathcal{Y}},\ \forall i=1,\ldots,r,\qquad w_{i}(\mathbf{p},\mathbf{q})\leq z_{i}\;. (7)

The JJ-dice-attainable region of 𝒘\boldsymbol{w} is the set DJ(𝒘){𝒛:(𝒘,𝒛) is J-dice-attainable}D_{J}(\boldsymbol{w})\triangleq\left\{\boldsymbol{z}:(\boldsymbol{w},\boldsymbol{z})\text{ is }J\text{-dice-attainable}\right\}.

Intuitively, if (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z}) is JJ-dice-attainable, then a (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z})-learner is not better than a random die over JJ, and therefore should not be able to separate any label in JJ. Using the notion of JJ-dice-attainability, we define a partial ordering over [0,1]r[0,1]^{r}. For every 𝒛,𝒛[0,1]r\boldsymbol{z},\boldsymbol{z}^{\prime}\in[0,1]^{r} write 𝒛𝒘𝒛\boldsymbol{z}\preceq_{\boldsymbol{w}}\boldsymbol{z}^{\prime} if

J𝒴,𝒛DJ(𝒘)𝒛DJ(𝒘),\forall J\subseteq{\mathcal{Y}},\ \ \boldsymbol{z}\in D_{J}(\boldsymbol{w})\implies\boldsymbol{z}^{\prime}\in D_{J}(\boldsymbol{w})\,, (8)

and 𝒛𝒘𝒛\boldsymbol{z}\not\preceq_{\boldsymbol{w}}\boldsymbol{z}^{\prime} otherwise. Intuitively, 𝒛𝒘𝒛\boldsymbol{z}\preceq_{\boldsymbol{w}}\boldsymbol{z}^{\prime} means that, whenever 𝒛\boldsymbol{z} is not better than a die, then 𝒛\boldsymbol{z}^{\prime} is not better than a die either. We prove that this partial ordering precisely characterizes the boostability of 𝒛\boldsymbol{z} to 𝒛\boldsymbol{z}^{\prime}. We can now state our main result for multi-objective multiclass boosting.

Theorem D (Multi-objective boosting, multiclass case).

Let 𝒴=[k]{\mathcal{Y}}=[k], let 𝐰=(w1,,wr)\boldsymbol{w}=(w_{1},\ldots,w_{r}) where each wi:𝒴2[0,1]w_{i}:{\mathcal{Y}}^{2}\rightarrow[0,1] is a cost function, and let 𝐳[0,1]r\boldsymbol{z}\in[0,1]^{r}. Then, the following claims both hold.

  • (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z}) is boostable to (w,z)(\boldsymbol{w},\boldsymbol{z}^{\prime}) for every zwz\boldsymbol{z}\preceq_{\boldsymbol{w}}\boldsymbol{z}^{\prime}: for every class 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}}, every (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z})-learner for \mathcal{F} is boostable to a (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z}^{\prime})-learner for \mathcal{F}.

  • (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z}) is not boostable to (w,z)(\boldsymbol{w},\boldsymbol{z}^{\prime}) for every zwz\boldsymbol{z}\not\preceq_{\boldsymbol{w}}\boldsymbol{z}^{\prime}: there exists a class 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}} and a (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z})-learner for \mathcal{F} that is a trivial learner and, therefore, cannot be boosted to a (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z}^{\prime})-learner for \mathcal{F}, for any 𝒛𝒘𝒛\boldsymbol{z}\not\preceq_{\boldsymbol{w}}\boldsymbol{z}^{\prime}.

Let us elaborate briefly on D. In the multiclass cost-sensitive setting, where the overall cost is a scalar, the landscape of boostability is determined by a sequence of (totally ordered) scalar thresholds vn[0,1]v_{{n}}\in[0,1]; and those thresholds determine whether a (w,z)(w,z)-learner can be boosted to a (w,z)(w,z^{\prime})-learner, for z<zz^{\prime}<z, as detailed in C. In the multiclass multi-objective setting, those scalar thresholds are replaced by a set of surfaces in [0,1]r[0,1]^{r}. Each such surface corresponds to the boundary of the dice-attainable sets DJ(𝒘)D_{J}(\boldsymbol{w}). Those surface can be seen as thresholds of a higher-dimensional form, separating between different boostability guarantees.

3 Binary Classification

This section considers the classic binary classification setting, where 𝒴={1,+1}{\mathcal{Y}}=\{-1,+1\}. Section 3.1 tackles the simplest case, where the cost ww is scalar, and explores the conditions that make a (w,z)(w,z)-learner boostable, proving A. Section 3.2 considers multi-objective costs, where the cost 𝒘\boldsymbol{w} is vector-valued, for a specific but illustrative choice of 𝒘\boldsymbol{w}, proving a special case of B and of Theorem 7. The reason for restricting 𝒘\boldsymbol{w} to a special form is that this allows us to introduce the key ideas while avoiding heavy notation. Finally, in Section 3.4 we give the full proof of B.

For convenience, as 𝒴={1,+1}{\mathcal{Y}}=\{-1,+1\}, we will sometimes encode a cost function w:𝒴2[0,1]w:{\mathcal{Y}}^{2}\to[0,1] as a 2-by-2 matrix (0ww+0)\bigl{(}\begin{smallmatrix}0&w_{-}\\ w_{+}&0\end{smallmatrix}\bigr{)}, where w+=w(+1,1)w_{+}=w(+1,-1) is the cost of a false positive and w=w(1,+1)w_{-}=w(-1,+1) the cost of a false negative. We will also use the fact, provable through easy calculations, that the value of the game V(w)\operatorname{V}(w) equals 0 if min(w,w+)=0\min(w_{-},w_{+})=0 and ww+w+w+\frac{w_{-}w_{+}}{w_{-}+w_{+}} otherwise. Before moving forward, we also need a measure of the advantage that a (w,z)(w,z)-learner has compared to a coin.

Definition 10 (Margin, binary case).

The margin of a (w,z)(w,z)-learner is γmax{0,V(w)z}\gamma\triangleq\max\{0,\operatorname{V}(w)-z\}.

When w+=w=1w_{+}=w_{-}=1, and thus ww is the standard 0-1 cost, γ\gamma is the usual notion of margin that defines a weak learner in binary classification tasks. In that case, it is well known that to boost a weak learner with margin γ\gamma it is necessary and sufficient to invoke the learner a number of times that scales with 1/γ2\nicefrac{{1}}{{\gamma^{2}}}. This remains true for general ww and zz, in that our boosting algorithm below invokes the (w,z)(w,z)-learner a number of times proportional to 1/γ2\nicefrac{{1}}{{\gamma^{2}}}.

3.1 Boosting a (w,z)(w,z)-learner

The main contribution of this subsection is a boosting algorithm, shown in Algorithm 1, that turns any (w,z)(w,z)-learner 𝒜{\mathcal{A}} with z<V(w)z<\operatorname{V}(w) into a (w,0)(w,0)-learner 𝒜{\mathcal{A}}^{\prime}. We prove:

Theorem 11.

Let 𝒴={1,+1}{\mathcal{Y}}=\{-1,+1\} and 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}}. If 𝒜{\mathcal{A}} is a (w,z)(w,z)-learner for \mathcal{F} with margin γ>0\gamma>0, then Algorithm 1 with the choice of parameters of Lemma 12 is a (w,0)(w,0)-learner for \mathcal{F}. Under the same assumptions, Algorithm 1 makes T=O(ln(m)γ2)T=O\bigl{(}\frac{\ln(m)}{\gamma^{2}}\bigr{)} oracle calls to 𝒜{\mathcal{A}}, where mm is the size of the input sample, and has sample complexity m(ϵ,δ)=O~(m0(γ3,δ2T)ln(1/δ)ϵγ2)m(\epsilon,\delta)=\widetilde{O}\Bigl{(}m_{0}\bigl{(}\frac{\gamma}{3},\frac{\delta}{2T}\bigr{)}\cdot\frac{\ln(1/\delta)}{\epsilon\cdot\gamma^{2}}\Bigr{)}, where m0m_{0} is the sample complexity of 𝒜{\mathcal{A}} .

It is immediate to see that Theorem 11 implies the first item of A; for the second item see Lemma 14 below. In the rest of this subsection we prove Theorem 11. We do so in two steps: first we prove that Algorithm 1 returns a hypothesis consistent with the input sample (Lemma 12), and then we show generalization by a compression argument (Lemma 13).

Algorithm 1 Boosting a binary (w,z)(w,z)-learner
0:  sample S=(xi,yi)i=1mS=(x_{i},y_{i})_{i=1}^{m}; (w,z)(w,z)-learner 𝒜{\mathcal{A}}; parameters T,η,m^T,\eta,\widehat{m}
1:  Initialize: D1(i)=1D_{1}(i)=1 for all i=1,,mi=1,...,m.
2:  for t=1,,Tt=1,\ldots,T do
3:     Compute the distribution 𝒟tDtiDt(i)\mathcal{D}_{t}\triangleq\frac{D_{t}}{\sum_{i}D_{t}(i)} over [m][m].
4:     Draw a set StS_{t} of m^\widehat{m} labeled examples i.i.d. from 𝒟t\mathcal{D}_{t} and obtain ht=𝒜(St)h_{t}={\mathcal{A}}(S_{t}).
5:     For every i=1,,mi=1,\ldots,m let:
Dt+1(i)Dt(i)eηw(ht(xi),yi).D_{t+1}(i)\triangleq D_{t}(i)\cdot e^{\eta\cdot w(h_{t}(x_{i}),y_{i})}\;.
6:  end for
7:  Let F:𝒳×𝒴[0,1]F:{\mathcal{X}}\times{\mathcal{Y}}\to[0,1] such that F(x,y)1Tt=1TI{ht(x)=y}F(x,y)\triangleq\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}\!\left\{h_{t}(x)=y\right\} for all x𝒳,y𝒴x\in{\mathcal{X}},y\in{\mathcal{Y}}
8:  return  h^S:𝒳𝒴\widehat{h}_{S}:{\mathcal{X}}\to{\mathcal{Y}} such that, for all x𝒳x\in{\mathcal{X}},
h^S(x){+1if wF(x,1)<w+F(x,+1),1otherwise.\widehat{h}_{S}(x)\triangleq\begin{cases}+1&\text{if }w_{-}\cdot F(x,-1)<w_{+}\cdot F(x,+1)\;,\\ -1&\text{otherwise.}\end{cases}

At a high level, Algorithm 1 follows standard boosting procedures. Given a (w,z)(w,z)-learner 𝒜{\mathcal{A}} and a labeled sample SS, Algorithm 1 uses 𝒜{\mathcal{A}} to obtain a sequence of hypotheses h1,,hTh_{1},\ldots,h_{T} whose average has loss close to zz on each single example in SS. This can be achieved through any regret-minimizing algorithm, but for the sake of concreteness our algorithm uses a modified version of Hedge (Freund and Schapire, 1997) where the update step re-weights examples as a function of ww. Unlike standard approaches, the final prediction on a point xx is not just the majority vote of the hth_{t}’s. Instead, the prediction is constructed by comparing the fraction of hth_{t}’s returning 1-1, weighted by ww_{-}, and the fraction of hth_{t}’s returning +1+1, weighted by w+w_{+}. The smallest weighted fraction wins, and the corresponding label is returned. We show that this ensures correct labeling of the entire sample with the desired probability. Formally, we have:

Lemma 12.

Let 𝒴={1,+1}{\mathcal{Y}}=\{-1,+1\} and let w:𝒴2[0,1]w:{\mathcal{Y}}^{2}\to[0,1] be a cost function. Let 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}}, and let 𝒜{\mathcal{A}} be a (w,z)(w,z)-learner for \mathcal{F} with margin γ>0\gamma>0 and sample complexity m0m_{0}. Fix any ff\in\mathcal{F}, let 𝒟\mathcal{D} be any distribution over 𝒳{\mathcal{X}}, and let S={(x1,y1),,(xm,ym)}S=\{(x_{1},y_{1}),\ldots,(x_{m},y_{m})\} be a multiset of mm examples given by i.i.d. points x1,,xm𝒟x_{1},\dots,x_{m}\sim\mathcal{D} labeled by ff. Finally, fix any δ(0,1)\delta\in(0,1). If Algorithm 1 is given SS, oracle access to 𝒜{\mathcal{A}}, and parameters T=18ln(m)γ2T=\left\lceil\frac{18\ln(m)}{\gamma^{2}}\right\rceil, η=2ln(m)T\eta=\sqrt{\frac{2\ln(m)}{T}}, and m^=m0(γ3,δT)\widehat{m}=m_{0}\bigl{(}\frac{\gamma}{3},\frac{\delta}{T}\bigr{)}, then Algorithm 1 makes TT calls to 𝒜{\mathcal{A}} and returns h^S:𝒳𝒴\widehat{h}_{S}:{\mathcal{X}}\to{\mathcal{Y}} such that with probability at least 1δ1-\delta:

h^S(xi)=yii=1,,m.\widehat{h}_{S}(x_{i})=y_{i}\qquad\forall i=1,\ldots,m\;.
Proof.

Fix any i[m]i\in[m]. Given that maxt,iw(ht(xi),yi)1\max_{t,i}w(h_{t}(x_{i}),y_{i})\leq 1 and that η1\eta\leq 1, the standard analysis of Hedge (Freund and Schapire, 1997) shows that:

t=1Tw(ht(xi),yi)\displaystyle\sum_{t=1}^{T}w(h_{t}(x_{i}),y_{i}) ln(m)η+η2T+t=1TEj𝒟t[w(ht(xj),yj)].\displaystyle\leq\frac{\ln(m)}{\eta}+\frac{\eta}{2}T+\sum_{t=1}^{T}{\mathbb E}_{j\sim\mathcal{D}_{t}}\Bigl{[}w(h_{t}(x_{j}),y_{j})\Bigr{]}\;. (9)

Dividing both sides by TT, and using the definitions of η\eta and TT at the right-hand side, yields:

1Tt=1Tw(ht(xi),yi)\displaystyle\frac{1}{T}\sum_{t=1}^{T}w(h_{t}(x_{i}),y_{i}) γ3+1Tt=1TEj𝒟t[w(ht(xj),yj)].\displaystyle\leq\frac{\gamma}{3}+\frac{1}{T}\sum_{t=1}^{T}{\mathbb E}_{j\sim\mathcal{D}_{t}}\Bigl{[}w(h_{t}(x_{j}),y_{j})\Bigr{]}\;. (10)

For every t=1,,Tt=1,\ldots,T, since ht=𝒜(St)h_{t}={\mathcal{A}}(S_{t}), since 𝒜{\mathcal{A}} is a (w,z)(w,z)-learner for \mathcal{F}, and by the choice of m^\widehat{m}:

P(Ej𝒟t[w(ht(xj),yj)]>z+γ3)δT.\displaystyle\operatorname{\mathbb{P}}\left({\mathbb E}_{j\sim\mathcal{D}_{t}}\Bigl{[}w(h_{t}(x_{j}),y_{j})\Bigr{]}>z+\frac{\gamma}{3}\right)\leq\frac{\delta}{T}\;. (11)

By averaging over TT and taking a union bound over t=1,,Tt=1,\ldots,T, we have that, with probability at least 1δ1-\delta,

1Tt=1TEj𝒟t[w(ht(xj),yj)]z+γ3.\displaystyle\frac{1}{T}\sum_{t=1}^{T}{\mathbb E}_{j\sim\mathcal{D}_{t}}\Bigl{[}w(h_{t}(x_{j}),y_{j})\Bigr{]}\leq z+\frac{\gamma}{3}\;. (12)

Now suppose Equation 12 holds. We shall prove that h^S(xi)=yi\widehat{h}_{S}(x_{i})=y_{i} for every i[m]i\in[m]. First, Equation 10 together with the definition of γ\gamma implies that for every i[m]i\in[m]:

1Tt=1Tw(ht(xi),yi)\displaystyle\frac{1}{T}\sum_{t=1}^{T}w(h_{t}(x_{i}),y_{i}) 23γ+zV(w)γ3<V(w).\displaystyle\leq\frac{2}{3}\gamma+z\leq\operatorname{V}(w)-\frac{\gamma}{3}<\operatorname{V}(w)\;. (13)

Now recall the definition of F:𝒳×𝒴[0,1]F:{\mathcal{X}}\times{\mathcal{Y}}\to[0,1] from Algorithm 1:

F(x,y)=1Tt=1TI{ht(x)=y}x𝒳,y𝒴.\displaystyle F(x,y)=\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}\!\left\{h_{t}(x)=y\right\}\qquad\forall x\in{\mathcal{X}},y\in{\mathcal{Y}}\;. (14)

Clearly, for any x𝒳x\in{\mathcal{X}}, we have that F(x,1)+F(x,+1)=1F(x,-1)+F(x,+1)=1. Now consider any i[m]i\in[m]. If yi=+1y_{i}=+1 then by definition of ww and FF:

1Tt=1Tw(ht(xi),yi)=w1Tt=1TI{ht(xi)=1}=wF(xi,1).\displaystyle\frac{1}{T}\sum_{t=1}^{T}w(h_{t}(x_{i}),y_{i})=w_{-}\cdot\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}\!\left\{h_{t}(x_{i})=-1\right\}=w_{-}\cdot F(x_{i},-1)\;. (15)

By Equation 13, then, wF(xi,1)<V(w)w_{-}\cdot F(x_{i},-1)<\operatorname{V}(w). By a symmetric argument w+F(xi,+1)<V(w)w_{+}\cdot F(x_{i},+1)<\operatorname{V}(w) if yi=1y_{i}=-1. It remains to show that for every x𝒳x\in{\mathcal{X}} exactly one among wF(x,1)<V(w)w_{-}\cdot F(x,-1)<\operatorname{V}(w) and w+F(x,+1)<V(w)w_{+}\cdot F(x,+1)<\operatorname{V}(w) holds. As we have just shown, at least one of the inequalities holds, depending on the value of yiy_{i}. Now assume towards contradiction that both hold. Recall that V(w)=ww+w+w+\operatorname{V}(w)=\frac{w_{-}w_{+}}{w_{-}+w_{+}}. Note that γ>0\gamma>0 implies V(w)>0\operatorname{V}(w)>0 and thus w,w+>0w_{-},w_{+}>0. Then we have:

1=F(x,1)+F(x,+1)<V(w)w+V(w)w+=w+w+w++ww+w+=1.\displaystyle 1=F(x,-1)+F(x,+1)<\frac{\operatorname{V}(w)}{w_{-}}+\frac{\operatorname{V}(w)}{w_{+}}=\frac{w_{+}}{w_{-}+w_{+}}+\frac{w_{-}}{w_{-}+w_{+}}=1\;. (16)

By definition of h^S\widehat{h}_{S} then h^S(xi)=yi\widehat{h}_{S}(x_{i})=y_{i} for all i[m]i\in[m], concluding the proof. ∎

The next lemma shows that, if the size mm of the sample in Lemma 12 is large enough, then the hypothesis returned by Algorithm 1 has small generalization error. It is immediate to see that, together with Lemma 12, this implies Theorem 11.

Lemma 13.

Assume the hypotheses of Lemma 12. For any ϵ,δ(0,1)\epsilon,\delta\in(0,1), if the size mm of the sample given to Algorithm 1 satisfies

mln(2/δ)ϵ+m0(γ3,δ2T)T(1+ln(m)ϵ),m\geq\frac{\ln(2/\delta)}{\epsilon}+m_{0}\left(\frac{\gamma}{3},\frac{\delta}{2T}\right)\cdot T\cdot\left(1+\frac{\ln(m)}{\epsilon}\right)\;,

then with probability at least 1δ1-\delta the output h^S\widehat{h}_{S} of Algorithm 1 satisfies L𝒟w(h^S)ϵL_{\mathcal{D}}^{w}(\widehat{h}_{S})\leq\epsilon. Therefore, Algorithm 1 is a (w,0)(w,0)-learner for \mathcal{F} with sample complexity m(ϵ,δ)=O~(m0(γ3,δ2T)ln(1/δ)ϵγ2)m(\epsilon,\delta)=\widetilde{O}\Bigl{(}m_{0}\bigl{(}\frac{\gamma}{3},\frac{\delta}{2T}\bigr{)}\cdot\frac{\ln(1/\delta)}{\epsilon\cdot\gamma^{2}}\Bigr{)}.

Proof.

First, we apply Lemma 12 with δ/2\delta/2 in place of δ\delta; we obtain that, with probability at least 1δ/21-\delta/2, the hypothesis h^S\widehat{h}_{S} is consistent with the labeled sample SS. Next, we apply a standard compression-based generalization argument (see e.g., Theorem 2.8 from Schapire and Freund (2012)). To this end, note that one can construct a compression scheme for h^S\widehat{h}_{S} of size κ\kappa equal to the total size of the samples on which 𝒜{\mathcal{A}} is invoked, that is, κ=m^T\kappa=\widehat{m}\cdot T. By Theorem 2.8 from Schapire and Freund (2012) with δ/2\delta/2 in place of δ\delta we get that, with probability at least 1δ/21-\delta/2,

L𝒟w(h^S)κln(m)+ln(2/δ)mκ.\displaystyle L_{\mathcal{D}}^{w}(\widehat{h}_{S})\leq\frac{\kappa\ln(m)+\ln(2/\delta)}{m-\kappa}\;. (17)

Straightforward calculations show that the right-hand side is at most ϵ\epsilon whenever:

m\displaystyle m ln(2/δ)ϵ+κ(1+ln(m)ϵ)\displaystyle\geq\frac{\ln(2/\delta)}{\epsilon}+\kappa\cdot\left(1+\frac{\ln(m)}{\epsilon}\right) (18)
=ln(2/δ)ϵ+m0(γ3,δ2T)T(1+ln(m)ϵ).\displaystyle=\frac{\ln(2/\delta)}{\epsilon}+m_{0}\left(\frac{\gamma}{3},\frac{\delta}{2T}\right)\cdot T\cdot\left(1+\frac{\ln(m)}{\epsilon}\right)\;. (19)

A union bound completes the first part of the claim, and shows that Algorithm 1 is a (w,0)(w,0)-learner for \mathcal{F}. The condition on the sample complexity of Algorithm 1 then follows immediately by definition of T=O(ln(m)γ2)T=O\bigl{(}\frac{\ln(m)}{\gamma^{2}}\bigr{)}. ∎

Remark on adaptive boosting.

Traditional boosting algorithms, such as the well-known AdaBoost (Schapire and Freund, 2012), do not assume prior knowledge of the margin parameter γ\gamma and adapt to it dynamically during execution. In contrast, the boosting algorithm in Algorithm 1, along with our broader approach, requires an initial estimate of γ\gamma as input. If this estimate is too large, the algorithm may fail. However, this issue can be addressed through a straightforward binary search procedure: we iteratively adjust our guess, halving it when necessary based on observed outcomes. This adds only a logarithmic overhead of O(ln(1/γ))O(\ln(1/\gamma)) to the runtime, without affecting sample complexity bounds.

We conclude this subsection with a proof of the second part of A.

Lemma 14.

Let 𝒳{\mathcal{X}} be any domain, let 𝒴={1,+1}{\mathcal{Y}}=\{-1,+1\}, and let w=(w+,w)(0,1]2w=(w_{+},w_{-})\in(0,1]^{2} be a cost over 𝒴{\mathcal{Y}}. If zV(w)z\geq\operatorname{V}(w), then there is an algorithm that is a (w,z)(w,z)-learner for every 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}}. Moreover, for some domain 𝒳{\mathcal{X}} and some 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}} the said learner cannot be boosted to a (w,z)(w,z^{\prime})-learner for any z<V(w)z^{\prime}<\operatorname{V}(w).

Proof.

By definition of V(w)\operatorname{V}(w) there exists a distribution 𝐩Δ𝒴\mathbf{p}\in\Delta_{\mathcal{Y}} such that w(𝐩,𝐪)V(w)w(\mathbf{p},\mathbf{q})\leq\operatorname{V}(w) for every 𝐪Δ𝒴\mathbf{q}\in\Delta_{\mathcal{Y}} over 𝒴{\mathcal{Y}}. Consider then the algorithm that ignores the input and returns a randomized predictor h𝐩h_{\mathbf{p}} such that h𝐩(x)𝐩h_{\mathbf{p}}(x)\sim\mathbf{p} independently for every given 𝐱𝒳\mathbf{x}\in{\mathcal{X}}. Clearly, for every distribution 𝒟\mathcal{D} over 𝒳{\mathcal{X}},

L𝒟w(h𝐩)w(𝐩,𝐪)V(w),\displaystyle L_{\mathcal{D}}^{w}(h_{\mathbf{p}})\leq w(\mathbf{p},\mathbf{q})\leq\operatorname{V}(w)\;, (20)

where 𝐪\mathbf{q} is the marginal of 𝒟\mathcal{D} over 𝒴{\mathcal{Y}}. This proves that the algorithm is a (w,z)(w,z)-learner for every 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}}. Let us now show that for some 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}} cannot be boosted to a (w,z)(w,z^{\prime})-learner for any z<V(w)z^{\prime}<\operatorname{V}(w). If V(w)=0\operatorname{V}(w)=0 then this is trivial, as w0w\geq 0; thus we can assume V(w)>0\operatorname{V}(w)>0. Suppose then that a (w,z)(w,z^{\prime})-learner for \mathcal{F} exists, where z<V(w)z^{\prime}<\operatorname{V}(w). By item (1), it follows that a (w,0)(w,0)-learner for \mathcal{F} exists. Recall that V(w)=ww+w+w+\operatorname{V}(w)=\frac{w_{-}w_{+}}{w_{-}+w_{+}}. Since we assumed V(w)>0\operatorname{V}(w)>0, it follows that w,w+>0w_{-},w_{+}>0. Then, a (w,0)(w,0)-learner is a strong learner in the PAC sense. Now choose \mathcal{F} to be any non-learnable class; for instance, let =𝒴N\mathcal{F}={\mathcal{Y}}{N} (see Shalev-Shwartz and Ben-David (2014, Theorem 5.1)). Then, no strong PAC learner and therefore no (w,0)(w,0)-learner exists for \mathcal{F}. Hence a (w,z)(w,z^{\prime})-learner for \mathcal{F} with z<V(w)z^{\prime}<\operatorname{V}(w) does not exist, too. ∎

3.2 Multi-objective losses: boosting a (𝒘,𝐳)(\boldsymbol{w},\mathbf{z})-learner

This section considers the boosting problem for learners that satisfy multi-objective costs, as captured by the notion of (𝒘,𝐳)(\boldsymbol{w},\mathbf{z})-learners (Definition 3). Our main question is then:

When can a (𝐰,𝐳)(\boldsymbol{w},\mathbf{z})-learner be boosted to a (𝐰,𝟎)(\boldsymbol{w},\mathbf{0})-learner?

For the scalar case, we showed in Section 3.1 that the threshold to boost (w,z)(w,z)-learners is the value of the game V(w)\operatorname{V}(w), which is the best bound on ww one can achieve by just tossing a coin—independently of the example xx at hand, and even of the specific distribution 𝒟\mathcal{D} over 𝒳{\mathcal{X}}. One may thus guess that the same characterization holds for boosting (𝒘,𝐳)(\boldsymbol{w},\mathbf{z})-learners. It turns out that this guess fails, as we show below. We thus adopt a different and stronger notion of “trivial learner”, obtained by exchanging the order of the players. It turns out that this is the right notion, in the sense that it is equivalent to being non-boostable.

For the sake of simplicity we assume a specific 𝒘\boldsymbol{w} that we call population-driven cost. This will help conveying the key ideas, without having to deal with the technical details of the more general version of our results (Section 5). The population-driven cost is the one that counts, separately, false negatives and false positives. More precisely, 𝒘=(w,w+)\boldsymbol{w}=(w_{-},w_{+}) where ww_{-} and w+w_{+} are now two cost functions that for all i,j𝒴i,j\in{\mathcal{Y}} satisfy:

w(i,j)\displaystyle w_{-}(i,j) =I{i=1j=+1},\displaystyle=\mathbb{I}\!\left\{i=-1\land j=+1\right\}\;, (21)
w+(i,j)\displaystyle w_{+}(i,j) =I{i=+1j=1}.\displaystyle=\mathbb{I}\!\left\{i=+1\land j=-1\right\}\;. (22)

For 𝐳=(z,z+)R02\mathbf{z}=(z_{-},z_{+})\in\mathbb{R}_{\geq 0}^{2}, a (𝒘,𝐳)(\boldsymbol{w},\mathbf{z})-learner then ensures simultaneously a false-negative rate arbitrarily close to zz_{-} and a false-positive rate arbitrarily close to z+z_{+}.

3.2.1 Coin attainability

Let 𝒘\boldsymbol{w} as above, and let 𝐳R02\mathbf{z}\in\mathbb{R}_{\geq 0}^{2}. As a first attempt at answering our question above, we guess that 𝐳\mathbf{z} is non-boostable when it is attainable by a random coin toss in the same way as for scalar costs (Section 3.1). That is, we may say that 𝐳\mathbf{z} is trivially attainable w.r.t. 𝒘\boldsymbol{w} if there exists 𝐩Δ𝒴\mathbf{p}\in\Delta_{\mathcal{Y}} such that for every 𝐪Δ𝒴\mathbf{q}\in\Delta_{\mathcal{Y}} we have 𝒘(𝐩,𝐪)𝐳\boldsymbol{w}(\mathbf{p},\mathbf{q})\leq\mathbf{z}. Clearly, for any such 𝐳\mathbf{z} there would exist a (𝒘,𝐳)(\boldsymbol{w},\mathbf{z})-learner that is not boostable for some class \mathcal{F}. If we could also prove the converse, that any 𝐳\mathbf{z} not trivially attainable is boostable, then we would have determined that trivial attainability is equivalent non-boostability.

Unfortunately, this turns out to be false. To see why, fix any distribution 𝐩=(p,p+)\mathbf{p}=(p_{-},p_{+}). Define then 𝐪=(q,q+)\mathbf{q}=(q_{-},q_{+}) as follows: if p1/2p_{-}\geq\nicefrac{{1}}{{2}} then q=0q_{-}=0, otherwise q+=0q_{+}=0. It follows immediately that either p(1q)1/2p_{-}(1-q_{-})\geq\nicefrac{{1}}{{2}} or p+(1q+)1/2p_{+}(1-q_{+})\geq\nicefrac{{1}}{{2}}. That is, 𝒘(𝐩,𝐪)\boldsymbol{w}(\mathbf{p},\mathbf{q}) is at least 1/2\nicefrac{{1}}{{2}} in at least one coordinate. As a consequence, no 𝐳<(1/2,1/2)\mathbf{z}<(\nicefrac{{1}}{{2}},\nicefrac{{1}}{{2}}) is trivially attainable. Thus, if our guess is correct, it should be the case that any (𝒘,𝐳)(\boldsymbol{w},\mathbf{z})-learner with 𝐳<(1/2,1/2)\mathbf{z}<(\nicefrac{{1}}{{2}},\nicefrac{{1}}{{2}}) is boostable.

We can easily show that that is not the case. Let 𝐳=(1/4,1/4)\mathbf{z}=(\nicefrac{{1}}{{4}},\nicefrac{{1}}{{4}}) and consider the following (𝒘,𝐳)(\boldsymbol{w},\mathbf{z})-learner. Upon receiving a sufficiently large labeled sample from some distribution 𝒟\mathcal{D} over 𝒳{\mathcal{X}}, the learner computes an estimate 𝐪^=(q^,q^+)\hat{\mathbf{q}}=(\hat{q}_{-},\hat{q}_{+}) of the marginal of 𝒟\mathcal{D} over 𝒴{\mathcal{Y}}. The learner then computes 𝐩=(p,p+)\mathbf{p}=(p_{-},p_{+}) that satisfies:

p(1q^)14andp+(1q^+)14.\displaystyle p_{-}\cdot(1-\hat{q}_{-})\leq\frac{1}{4}\quad\text{and}\quad p_{+}\cdot(1-\hat{q}_{+})\leq\frac{1}{4}\;. (23)

It is not hard to show that such a 𝐩\mathbf{p} always exists. The learner thus outputs hh such that h(x)𝐩h(x)\sim\mathbf{p} for all x𝒳x\in{\mathcal{X}} independently. As the number of samples received by the learner increases, the estimate 𝐪^\hat{\mathbf{q}} approaches 𝐪\mathbf{q}, and thus the loss of the learner approaches 𝐳\mathbf{z}. Clearly, however, such a learner cannot be boosted, as its predictions are independent of the example xx at hand. We conclude that there exist 𝐳\mathbf{z} that are not trivially attainable and yet are not boostable.

The example above suggest a fix to our notion of “non-boostable learner”. The fix consists in considering again prediction by a coin 𝐩\mathbf{p}, where, however, we allow 𝐩\mathbf{p} to be a function of 𝐪\mathbf{q}. This leads to the following definition:

Definition 15.

Let 𝒘=(w,w+)\boldsymbol{w}=(w_{-},w_{+}) be the population-driven cost and let 𝐳=(z,z+)R02\mathbf{z}=(z_{-},z_{+})\in\mathbb{R}_{\geq 0}^{2}. Then, 𝐳\mathbf{z} is coin-attainable w.r.t. 𝒘\boldsymbol{w} if for every distribution 𝐪=(q,q+)\mathbf{q}=(q_{-},q_{+}) there exists a distribution 𝐩=(p,p+)\mathbf{p}=(p_{-},p_{+}) such that:

w(𝐩,𝐪)=p(1q)zandw+(𝐩,𝐪)=p+(1q+)z+.w_{-}(\mathbf{p},\mathbf{q})=p_{-}\cdot(1-q_{-})\leq z_{-}\quad\text{and}\quad w_{+}(\mathbf{p},\mathbf{q})=p_{+}\cdot(1-q_{+})\leq z_{+}\;.

Note how Definition 15 mirrors the game of Section 3.1, but with a reverted order of the players: here, the predictor plays second. It should be clear that, if 𝐳\mathbf{z} is coin-attainable, then there exists a (𝒘,𝐳)(\boldsymbol{w},\mathbf{z})-learner that is not boostable. That is the learner described above, which gets an estimate 𝐪^\hat{\mathbf{q}} of 𝐪\mathbf{q} from the sample, and then returns hh such that h(x)𝐩h(x)\sim\mathbf{p}, where 𝐩\mathbf{p} satisfies the condition in Definition 15 with respect to 𝐪^\hat{\mathbf{q}}. What is not clear is whether the converse holds, too: is a (𝒘,𝐳)(\boldsymbol{w},\mathbf{z})-learner boostable whenever 𝐳\mathbf{z} is not coin-attainable? It turns out that this is the case, as the next subsection shows.

3.2.2 Coin-attainability equals non-boostability

Let us try to boost a (𝒘,𝐳)(\boldsymbol{w},\mathbf{z})-learner 𝒜{\mathcal{A}}. We can do so by reduction to the scalar case of Section 3.1. To this end choose any 𝜶=(α1,α2)Δ𝒴\boldsymbol{\alpha}=(\alpha_{1},\alpha_{2})\in\Delta_{\mathcal{Y}}. Define 𝜶𝒘:𝒴2R0\boldsymbol{\alpha}\cdot\boldsymbol{w}:{\mathcal{Y}}^{2}\to\mathbb{R}_{\geq 0} by:

(𝜶𝒘)(i,j)=α1w(i,j)+α2w+(i,j)i,j𝒴.\displaystyle(\boldsymbol{\alpha}\cdot\boldsymbol{w})(i,j)=\alpha_{1}\cdot w_{-}(i,j)+\alpha_{2}\cdot w_{+}(i,j)\qquad\forall i,j\in{\mathcal{Y}}\;. (24)

For conciseness we shall write w=𝜶𝒘w=\boldsymbol{\alpha}\cdot\boldsymbol{w}. In words, ww is the convex combination of w,w+w_{-},w_{+} according to 𝜶\boldsymbol{\alpha}, and therefore w=(0α1α20)w=\bigl{(}\begin{smallmatrix}0&\alpha_{1}\\ \alpha_{2}&0\end{smallmatrix}\bigr{)}. Therefore w:𝒴2[0,1]w:{\mathcal{Y}}^{2}\to[0,1] and w(i,i)=0w(i,i)=0 for each i𝒴i\in{\mathcal{Y}}. Now let z=𝜶𝐳z=\boldsymbol{\alpha}\cdot\mathbf{z}. Now we make the following crucial observation: since 𝒜{\mathcal{A}} is a (𝒘,𝐳)(\boldsymbol{w},\mathbf{z})-learner, then it is also a (w,z)(w,z)-learner. This is immediate to see from the definition of (𝒘,𝐳)(\boldsymbol{w},\mathbf{z}) and the definition of ww. It follows that, by the results of Section 3.1, 𝒜{\mathcal{A}} can be boosted if z<V(w)z<\operatorname{V}(w). Summarizing, we can boost 𝒜{\mathcal{A}} whenever:

𝜶𝐳<V(𝜶𝒘).\displaystyle\boldsymbol{\alpha}\cdot\mathbf{z}<\operatorname{V}(\boldsymbol{\alpha}\cdot\boldsymbol{w})\;. (25)

That is, if there is any 𝜶Δ𝒴\boldsymbol{\alpha}\in\Delta_{\mathcal{Y}} satisfying Equation 25, then 𝒜{\mathcal{A}} can be boosted to a strong learner. Moreover, if that is the case, then one can prove that one can ensure 𝜶>𝟎\boldsymbol{\alpha}>\mathbf{0}. Then, one can easily see that 𝒜{\mathcal{A}}^{*} is a (𝒘,𝟎)(\boldsymbol{w},\mathbf{0})-learner. Our main question thus becomes:

Is it true that, if 𝐳\mathbf{z} is not coin-attainable, then there is 𝛂Δ𝒴\boldsymbol{\alpha}\in\Delta_{\mathcal{Y}} such that 𝛂𝐳<V(𝛂𝐰)\boldsymbol{\alpha}\cdot\mathbf{z}<\operatorname{V}(\boldsymbol{\alpha}\cdot\boldsymbol{w})?

We show that this is indeed the case, and therefore 𝐳\mathbf{z} is coin-attainable if and only if 𝜶𝐳V(𝜶𝒘)\boldsymbol{\alpha}\cdot\mathbf{z}\geq\operatorname{V}(\boldsymbol{\alpha}\cdot\boldsymbol{w}) for all 𝜶Δ𝒴\boldsymbol{\alpha}\in\Delta_{\mathcal{Y}}. Equivalently, all (𝒘,𝐳)(\boldsymbol{w},\mathbf{z})-learners are boostable if and only if 𝜶𝐳<V(𝜶𝒘)\boldsymbol{\alpha}\cdot\mathbf{z}<\operatorname{V}(\boldsymbol{\alpha}\cdot\boldsymbol{w}) for some 𝜶Δ𝒴\boldsymbol{\alpha}\in\Delta_{\mathcal{Y}}. As a consequence, every 𝐳\mathbf{z} is either coin-attainable or boostable. Formally, we have:

Theorem 16.

Let 𝒴={1,+1}{\mathcal{Y}}=\{-1,+1\}, let 𝐰\boldsymbol{w} be the population-driven cost, and let 𝐳=(z,z+)R02𝟎\mathbf{z}=(z_{-},z_{+})\in\mathbb{R}_{\geq 0}^{2}\setminus\mathbf{0}. Then 𝐳\mathbf{z} is coin-attainable w.r.t. 𝐰\boldsymbol{w} if and only if 𝛂𝐳V(𝛂𝐰)\boldsymbol{\alpha}\cdot\mathbf{z}\geq\operatorname{V}(\boldsymbol{\alpha}\cdot\boldsymbol{w}), where:

𝜶=(z+z+z+,zz+z+)Δ𝒴.\displaystyle\boldsymbol{\alpha}=\left(\frac{\sqrt{z_{+}}}{\sqrt{z_{-}}+\sqrt{z_{+}}},\frac{\sqrt{z_{-}}}{\sqrt{z_{-}}+\sqrt{z_{+}}}\right)\in\Delta_{\mathcal{Y}}\,. (26)

Equivalently, 𝐳\mathbf{z} is coin-attainable w.r.t. 𝐰\boldsymbol{w} if and only if z+z+1\sqrt{z_{-}}+\sqrt{z_{+}}\geq 1.

Proof.

First of all, by straightforward calculations we determine:

𝜶𝐳\displaystyle\boldsymbol{\alpha}\cdot\mathbf{z} =zz+,\displaystyle=\sqrt{z_{-}}\sqrt{z_{+}}\;, (27)
V(𝜶𝒘)\displaystyle\operatorname{V}\left(\boldsymbol{\alpha}\cdot\boldsymbol{w}\right) =V(0α1α20)=zz+z+z+.\displaystyle=\operatorname{V}\left(\begin{smallmatrix}0&\alpha_{1}\\ \alpha_{2}&0\end{smallmatrix}\right)=\frac{\sqrt{z_{-}}\sqrt{z_{+}}}{\sqrt{z_{-}}+\sqrt{z_{+}}}\;. (28)

This proves that 𝜶𝐳V(𝜶𝒘)\boldsymbol{\alpha}\cdot\mathbf{z}\geq\operatorname{V}(\boldsymbol{\alpha}\cdot\boldsymbol{w}) if and only if z+z+1\sqrt{z_{-}}+\sqrt{z_{+}}\geq 1.

Now, for the “only if” direction, by von Neumann’s minimax theorem and by definition of 𝜶\boldsymbol{\alpha},

V(𝜶𝒘)\displaystyle V(\boldsymbol{\alpha}\cdot\boldsymbol{w}) =max𝐪Δ𝒴min𝐩Δ𝒴z+z+z+p(1q)+zz+z+p+(1q+)\displaystyle=\max_{\mathbf{q}\in\Delta_{\mathcal{Y}}}\min_{\mathbf{p}\in\Delta_{\mathcal{Y}}}\frac{\sqrt{z_{+}}}{\sqrt{z_{-}}+\sqrt{z_{+}}}\cdot p_{-}\cdot(1-q_{-})+\frac{\sqrt{z_{-}}}{\sqrt{z_{-}}+\sqrt{z_{+}}}\cdot p_{+}\cdot(1-q_{+}) (29)
z+zz+z++zz+z+z+=zz+=𝜶𝐳,\displaystyle\leq\frac{\sqrt{z_{+}}\cdot z_{-}}{\sqrt{z_{-}}+\sqrt{z_{+}}}+\frac{\sqrt{z_{-}}\cdot z_{+}}{\sqrt{z_{-}}+\sqrt{z_{+}}}=\sqrt{z_{-}}\sqrt{z_{+}}=\boldsymbol{\alpha}\cdot\mathbf{z}\;, (30)

where the inequality holds as 𝐳\mathbf{z} is coin-attainable.

For the “if” direction, suppose 𝜶𝐳V(𝜶𝒘)\boldsymbol{\alpha}\cdot\mathbf{z}\geq\operatorname{V}(\boldsymbol{\alpha}\cdot\boldsymbol{w}), hence z+z+1\sqrt{z_{-}}+\sqrt{z_{+}}\geq 1. We claim that this implies 𝐳\mathbf{z} is coin-attainable. Fix indeed any 𝐪=(q,q+)Δ𝒴\mathbf{q}=(q_{-},q_{+})\in\Delta_{\mathcal{Y}}. We consider three cases: (1) q+zq_{+}\leq z_{-}, (2) qz+q_{-}\leq z_{+}, and (3) q+>zq_{+}>z_{-} and q>z+q_{-}>z_{+}. If q+zq_{+}\leq z_{-} then choose 𝐩=(1,0)\mathbf{p}=(1,0), otherwise if qz+q_{-}\leq z_{+} choose 𝐩=(0,1)\mathbf{p}=(0,1). It is immediate to see that this implies p(1q)zp_{-}\cdot(1-q_{-})\leq z_{-} and p+(1q+)z+p_{+}\cdot(1-q_{+})\leq z_{+}. Suppose then q+>zq_{+}>z_{-} and q>z+q_{-}>z_{+}, which implies 𝐪>𝟎\mathbf{q}>\mathbf{0}. We claim that:

zq++z+q1.\displaystyle\frac{z_{-}}{q_{+}}+\frac{z_{+}}{q_{-}}\geq 1\;. (31)

Indeed, letting a=za=z_{-}, b=z+b=z_{+}, x=q+x=q_{+}, multiplying both sides by x(1x)x(1-x), and rearranging yields:

a(1x)+bxx(1x)0,\displaystyle a\cdot(1-x)+b\cdot x-x(1-x)\geq 0\;, (32)

which can be easily checked to hold for all xx whenever a+b1\sqrt{a}+\sqrt{b}\geq 1. Equation 31 then implies the existence of p,p+p_{-},p_{+} such that p+p+=1p_{-}+p_{+}=1 and that:

p(1q)\displaystyle p_{-}\cdot(1-q_{-}) zq+(1q)=z,\displaystyle\leq\frac{z_{-}}{q_{+}}\cdot(1-q_{-})=z_{-}\;, (33)
p+(1q+)\displaystyle p_{+}\cdot(1-q_{+}) z+q(1q+)=z+.\displaystyle\leq\frac{z_{+}}{q_{-}}\cdot(1-q_{+})=z_{+}\;. (34)

We conclude that 𝐳\mathbf{z} is coin-attainable. ∎

3.3 A duality, and a geometric interpretation

Refer to caption
Figure 3: Duality in a picture. For a multi-objective cost 𝒘:𝒴2R02\boldsymbol{w}:{\mathcal{Y}}^{2}\to\mathbb{R}_{\geq 0}^{2}, the set C(𝒘)C(\boldsymbol{w}) of all coin-attainable vectors 𝐳\mathbf{z} is the intersection of all halfspaces in the form H(𝜶)={𝒙R2:𝜶𝒙V(𝜶𝒘)}H(\boldsymbol{\alpha})=\{{\boldsymbol{x}}\in\mathbb{R}^{2}:\boldsymbol{\alpha}\cdot{\boldsymbol{x}}\geq\operatorname{V}(\boldsymbol{\alpha}\cdot\boldsymbol{w})\}. As a consequence, the complement of C(𝒘)C(\boldsymbol{w}) is the set of all boostable vectors 𝐳\mathbf{z}. This is a special case of a more general result, stated in Section 5, that holds for any k2k\geq 2 and any multi-objective cost 𝒘:𝒴2R0r\boldsymbol{w}:{\mathcal{Y}}^{2}\to\mathbb{R}_{\geq 0}^{r}. The straight blue line is the boundary of a specific H(𝜶)H(\boldsymbol{\alpha}): for any 𝐳\mathbf{z} in it, every (𝒘,𝐳)(\boldsymbol{w},\mathbf{z})-learner is boostable w.r.t. the cost 𝜶𝒘\boldsymbol{\alpha}\cdot\boldsymbol{w}.

Theorem 16 has an interesting geometric interpretation. In fact, one can easily check that for every 𝐳R02\mathbf{z}\in\mathbb{R}_{\geq 0}^{2} it holds that z+z+1\sqrt{z_{-}}+\sqrt{z_{+}}\geq 1 if and only if 𝜶𝐳V(𝜶𝒘)\boldsymbol{\alpha}\cdot\mathbf{z}\geq\operatorname{V}(\boldsymbol{\alpha}\cdot\boldsymbol{w}) for every 𝜶Δ𝒴\boldsymbol{\alpha}\in\Delta_{\mathcal{Y}}, and not just for the single 𝜶\boldsymbol{\alpha} specified in Theorem 16. That is, we have:

Theorem 17.

Let 𝒴={1,+1}{\mathcal{Y}}=\{-1,+1\}, let 𝐰\boldsymbol{w} be the population-driven cost, and let 𝐳R02\mathbf{z}\in\mathbb{R}_{\geq 0}^{2}. Then 𝐳\mathbf{z} is coin-attainable w.r.t. 𝐰\boldsymbol{w} if and only if 𝛂𝐳V(𝛂𝐰)\boldsymbol{\alpha}\cdot\mathbf{z}\geq\operatorname{V}(\boldsymbol{\alpha}\cdot\boldsymbol{w}) for every 𝛂Δ𝒴\boldsymbol{\alpha}\in\Delta_{\mathcal{Y}}.

Theorem 17 says that 𝐳\mathbf{z} is coin-attainable if and only if there is no way to “scalarize” 𝒘\boldsymbol{w} so as to obtain a boostable learner w.r.t. the corresponding scalarized cost zz. From a geometric perspective, the result can be read as follows. For any 𝜶Δ𝒴\boldsymbol{\alpha}\in\Delta_{\mathcal{Y}} consider the following halfspace:

H(𝜶)={𝒙R2:𝜶𝒙V(𝜶𝒘)}.\displaystyle H(\boldsymbol{\alpha})=\{{\boldsymbol{x}}\in\mathbb{R}^{2}\,:\,\boldsymbol{\alpha}\cdot{\boldsymbol{x}}\geq\operatorname{V}(\boldsymbol{\alpha}\cdot\boldsymbol{w})\}\;. (35)

Let moreover C(𝒘)C(\boldsymbol{w}) be the set of all 𝐳\mathbf{z} that are coin-attainable w.r.t. 𝒘\boldsymbol{w}. Then Theorem 17 says:

C(𝒘)=𝜶Δ𝒴H(𝜶).\displaystyle C(\boldsymbol{w})=\bigcap_{\boldsymbol{\alpha}\in\Delta_{\mathcal{Y}}}H(\boldsymbol{\alpha})\;. (36)

In other words, the set of coin-attainable vectors is precisely the intersection of all halfspaces H(𝜶)H(\boldsymbol{\alpha}). It follows that C(𝒘)C(\boldsymbol{w}) is convex: and, indeed, if two vectors 𝐳\mathbf{z} and 𝐳\mathbf{z}^{\prime} are both coin-attainable then it is easy to see that a𝐳+(1a)𝐳a\cdot\mathbf{z}+(1-a)\cdot\mathbf{z}^{\prime} is coin-attainable, too, for every a[0,1]a\in[0,1]. Figure 3 gives a pictorial description of this geometric interpretation. The proof of Theorem 17 is not given here, as it is a special case, for 𝒘\boldsymbol{w} being the population-driven cost, of Theorem 7.

3.4 Proof of B

First item. The proof makes use of the duality results of Section 4. Assume (𝒘,𝒛)C(𝒘)(\boldsymbol{w},\boldsymbol{z})\notin C(\boldsymbol{w}); that is, 𝐳\mathbf{z} is not coin-attainable with respect to 𝒘\boldsymbol{w}. By Theorem 17 there exists 𝜶Δr\boldsymbol{\alpha}\in\Delta_{r} such that z𝜶<V(w𝜶)z_{\boldsymbol{\alpha}}<\operatorname{V}(w_{\boldsymbol{\alpha}}), where w𝜶=i=1rαiwiw_{\boldsymbol{\alpha}}=\sum_{i=1}^{r}\alpha_{i}w_{i}. Thus, 𝒜{\mathcal{A}} is a (w𝜶,z𝜶)(w_{\boldsymbol{\alpha}},z_{\boldsymbol{\alpha}})-learner for \mathcal{F}. By Lemma 14, then, there exists a (w𝜶,0)(w_{\boldsymbol{\alpha}},0)-learner for \mathcal{F}. It is immediate to see that, if 𝜶>𝟎\boldsymbol{\alpha}>\mathbf{0} (that is, α>0\alpha_{\ell}>0 for every [r]\ell\in[r]) then it is a (w,0)(w,0)-learner for \mathcal{F}. Thus, we need only to show that we can always assume 𝜶>𝟎\boldsymbol{\alpha}>\mathbf{0}. Let

𝜶(ϵ)=(1ϵ)𝜶+ϵr𝟏ϵ[0,1].\displaystyle\boldsymbol{\alpha}(\epsilon)=(1-\epsilon)\boldsymbol{\alpha}+\frac{\epsilon}{r}\cdot\mathbf{1}\quad\forall\epsilon\in[0,1]\,. (37)

Note that both z𝜶(ϵ)z_{\boldsymbol{\alpha}(\epsilon)} and V(w𝜶(ϵ))\operatorname{V}(w_{\boldsymbol{\alpha}(\epsilon)}) are continuous functions of ϵ\epsilon. Since z𝜶(0)=z𝜶<V(w𝜶)=V(w𝜶(0))z_{\boldsymbol{\alpha}(0)}=z_{\boldsymbol{\alpha}}<\operatorname{V}(w_{\boldsymbol{\alpha}})=\operatorname{V}(w_{\boldsymbol{\alpha}(0)}), it follows that there exists ϵ>0\epsilon>0 such that z𝜶(ϵ)<V(w𝜶(ϵ))z_{\boldsymbol{\alpha}(\epsilon)}<\operatorname{V}(w_{\boldsymbol{\alpha}(\epsilon)}), too.

Second item. We show that, if 𝐳C(𝒘)\mathbf{z}\in C(\boldsymbol{w}), then there is a trivial learner 𝒜{\mathcal{A}} that is a (𝒘,𝐳)(\boldsymbol{w},\mathbf{z})-learner for every =𝒴𝒳\mathcal{F}={\mathcal{Y}}^{\mathcal{X}}. By choosing \mathcal{F} to be any non-learnable class this implies that 𝒜{\mathcal{A}} is not boostable to a (𝒘,𝐳)(\boldsymbol{w},\mathbf{z}^{\prime})-learner for \mathcal{F} for any 𝐳C(𝒘)\mathbf{z}^{\prime}\notin C(\boldsymbol{w}), for otherwise the first item above would imply the existence of a (𝒘,𝟎)(\boldsymbol{w},\mathbf{0})-learner for \mathcal{F}. The learner 𝒜{\mathcal{A}} works as follows. Given a sample SS of mm examples drawn i.i.d. from 𝒟\mathcal{D} and labeled by any f:𝒳𝒴f:{\mathcal{X}}\to{\mathcal{Y}}, it estimates the marginal 𝐪\mathbf{q} of 𝒟\mathcal{D} on 𝒴{\mathcal{Y}}. Denote that estimate by 𝐪^\hat{\mathbf{q}}. It is well known that 𝒜{\mathcal{A}} can ensure 𝐪^\hat{\mathbf{q}} is within total variation distance ϵ\epsilon of 𝐪\mathbf{q} with probability at least 1δ1-\delta by taking m=Θ(k+ln(1/δ)ϵ2)m=\Theta\bigl{(}\frac{k+\ln(1/\delta)}{\epsilon^{2}}\bigr{)}, see for instance (Canonne, 2020, Theorem 1). Assume then this is the case. Let 𝐩Δ\mathbf{p}\in\Delta be a distribution such that 𝒘(𝐩,𝐪^)𝐳\boldsymbol{w}(\mathbf{p},\hat{\mathbf{q}})\leq\mathbf{z}; note that 𝐩\mathbf{p} exists by definition of coin attainability. Then, 𝒜{\mathcal{A}} returns a randomized predictor hS:𝒳𝒴h_{S}:{\mathcal{X}}\to{\mathcal{Y}} such that hS(x)𝐩h_{S}(x)\sim\mathbf{p} independently of the input xx. Since 𝒘[0,1]r\boldsymbol{w}\in[0,1]^{r}, then,

L𝒟𝒘(hS)𝒘(𝐪,𝐪^)+ϵ𝟏=𝐳+ϵ𝟏.\displaystyle L_{\mathcal{D}}^{\boldsymbol{w}}(h_{S})\leq\boldsymbol{w}(\mathbf{q},\hat{\mathbf{q}})+\epsilon\cdot\mathbf{1}=\mathbf{z}+\epsilon\cdot\mathbf{1}\,. (38)

This proves that 𝒜{\mathcal{A}} is a (𝒘,𝐳)(\boldsymbol{w},\mathbf{z})-learner.

4 Equivalence: Cost-Sensitive and Multi-Objective

In this section we examine the connection between cost-sensitive and multi-objective learning. In particular, we prove the two equivalence theorems: Theorem 6 which demonstrates a certain equivalence between learning algorithms for a class \mathcal{F}, with respect to these different guarantees, and Theorem 7, which characterizes trivial guarantees in both settings.

The following lemma is the key component used to prove Theorem 7.

Lemma 18.

Let rNr\in\mathbb{N}, let 𝐰=(w1,,wr)\boldsymbol{w}=(w_{1},\ldots,w_{r}) where each wi:𝒴2[0,1]w_{i}:{\mathcal{Y}}^{2}\to[0,1] is a cost function, and let 𝐳[0,1]r\boldsymbol{z}\in[0,1]^{r}. Fix any J𝒴J\subseteq{\mathcal{Y}}, and 𝐪ΔJ\mathbf{q}\in\Delta_{J}. Let 𝒴𝒳{\mathcal{H}}\subseteq{\mathcal{Y}}^{\mathcal{X}} be some fixed hypotheses class. Assume that for every 𝛂Δr\boldsymbol{\alpha}\in\Delta_{r}, there exists 𝐩Δ\mathbf{p}\in\Delta_{\mathcal{H}} such that i=1rαiwi(𝐩,𝐪)𝛂,𝐳\sum_{i=1}^{r}\alpha_{i}\cdot w_{i}(\mathbf{p},\mathbf{q})\leq\langle\boldsymbol{\alpha},\boldsymbol{z}\rangle. Then, there exists 𝐩Δ\mathbf{p}^{*}\in\Delta_{\mathcal{H}} such that for all i=1,,ri=1,...,r, it holds that wi(𝐩,𝐪)ziw_{i}(\mathbf{p}^{*},\mathbf{q})\leq z_{i}.

Proof.

Define the following zero-sum game. The pure strategies of the maximizing player are [r][r], and the pure strategies minimizing player are 𝒴{\mathcal{Y}}. The payoff matrix is given as follows, with rows corresponding to labels 𝒴\ell\in{\mathcal{Y}} and columns to loss-entry i[r]i\in[r],

M(,i)=Ej𝐪wi(,j)zi.M(\ell,i)={\mathbb E}_{j\sim\mathbf{q}}\ w_{i}(\ell,j)-z_{i}. (39)

Thus, we have,

M(𝐩,𝜶)E𝐩Ei𝜶M(,i)=i=1rαiwi(𝐩,𝐪)𝜶,𝒛.M(\mathbf{p},\boldsymbol{\alpha})\triangleq{\mathbb E}_{\ell\sim\mathbf{p}}\ {\mathbb E}_{i\sim\boldsymbol{\alpha}}\ M(\ell,i)=\sum_{i=1}^{r}\alpha_{i}w_{i}(\mathbf{p},\mathbf{q})-\langle\boldsymbol{\alpha},\boldsymbol{z}\rangle. (40)

Denote,

V(𝐪)max𝜶Δrmin𝐩Δ𝒴M(𝐩,𝜶).\displaystyle\operatorname{V}(\mathbf{q})\triangleq\max_{\boldsymbol{\alpha}\in\Delta_{r}}\min_{\mathbf{p}\in\Delta_{\mathcal{Y}}}M(\mathbf{p},\boldsymbol{\alpha}). (41)

It remains to show that there exists 𝐩\mathbf{p} such for all i=1,,ri=1,...,r, we have wi(𝐩,𝐪)ziw_{i}(\mathbf{p},\mathbf{q})\leq z_{i}. By assumption, we know that V(𝐪)0\operatorname{V}(\mathbf{q})\leq 0, and so by von Neumann’s minimax theorem (von Neumann, 1928) we have:

min𝐩Δ𝒴max𝜶ΔrM(𝐩,𝜶)0.\displaystyle\min_{\mathbf{p}\in\Delta_{\mathcal{Y}}}\max_{\boldsymbol{\alpha}\in\Delta_{r}}M(\mathbf{p},\boldsymbol{\alpha})\leq 0. (42)

In particular, there exists 𝐩Δ𝒴\mathbf{p}^{*}\in\Delta_{\mathcal{Y}} such that for all 𝜶Δr\boldsymbol{\alpha}\in\Delta_{r} it holds that:

i=1rαiwi(𝐩,𝐪)𝜶,𝒛.\sum_{i=1}^{r}\alpha_{i}w_{i}(\mathbf{p}^{*},\mathbf{q})\leq\langle\boldsymbol{\alpha},\boldsymbol{z}\rangle.

Observe that the above holds for all 𝜶\boldsymbol{\alpha} of the form 𝐞i\mathbf{e}_{i} for i=1,,ri=1,...,r (i.e., the standard basis of Rr\mathbb{R}^{r}), and so we get the desired result. ∎

Proposition 19.

Let rNr\in\mathbb{N}, and let 𝐰=(w1,,wr)\boldsymbol{w}=(w_{1},\ldots,w_{r}) where each wi:𝒴2[0,1]w_{i}:{\mathcal{Y}}^{2}\to[0,1] is a cost function, 𝐳[0,1]r\boldsymbol{z}\in[0,1]^{r}, and J𝒴J\subseteq{\mathcal{Y}}. Then, the following are equivalent.

  1. 1.

    (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z}) is JJ-dice-attainable.

  2. 2.

    𝜶Δr\forall\boldsymbol{\alpha}\in\Delta_{r}, it holds that 𝜶,𝒛VJ(w𝜶)\langle\boldsymbol{\alpha},\boldsymbol{z}\rangle\geq\operatorname{V}_{J}(w_{\boldsymbol{\alpha}}), where w𝜶=i=1rαiwiw_{\boldsymbol{\alpha}}=\sum_{i=1}^{r}\alpha_{i}w_{i}.

Next, we prove Proposition 19, which also proves Theorem 7 as a special case.

Proof of Proposition 19.

It is easy to see that 121\implies 2 holds, by definitions. We prove 212\implies 1. By assumption, for every 𝜶Δr\boldsymbol{\alpha}\in\Delta_{r} it holds that,

VJ(w𝜶)=max𝐪ΔJmin𝐩Δ𝒴i=1rαiwi(𝐩,𝐪)𝜶,𝒛.\operatorname{V}_{J}(w_{\boldsymbol{\alpha}})=\max_{\mathbf{q}\in{\Delta_{J}}}\min_{\mathbf{p}\in\Delta_{{\mathcal{Y}}}}\sum_{i=1}^{r}\alpha_{i}\cdot w_{i}(\mathbf{p},\mathbf{q})\leq\langle\boldsymbol{\alpha},\boldsymbol{z}\rangle. (43)

In particular, this implies that for any 𝐪ΔJ\mathbf{q}\in\Delta_{J} and any 𝜶Δr\boldsymbol{\alpha}\in\Delta_{r}, there exists 𝐩Δ𝒴\mathbf{p}\in\Delta_{\mathcal{Y}} such that

i=1rαiwi(𝐩,𝐪)𝜶,𝒛.\sum_{i=1}^{r}\alpha_{i}\cdot w_{i}(\mathbf{p},\mathbf{q})\leq\langle\boldsymbol{\alpha},\boldsymbol{z}\rangle.

Thus, by Lemma 18 we get that for any 𝐪ΔJ\mathbf{q}\in\Delta_{J}, there exists 𝐩Δ𝒴\mathbf{p}^{*}\in\Delta_{\mathcal{Y}} such that for every i[r]i\in[r],

wi(𝐩,𝐪)zi.w_{i}(\mathbf{p}^{*},\mathbf{q})\leq z_{i}.

Thus, we get that (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z}) is JJ-dice-attainable. ∎

We shift our focus to general learning algorithms for a class \mathcal{F}. First, recall that a multi-objective learner is, by definition, also a cost-sensitive learner. The existence of a (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z})-learner implies the existence of a (wi,zi)(w_{i},z_{i})-learner for each i=1,,ri=1,\ldots,r. Equivalently, the existence of a (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z})-learner implies the existence of a (w𝜶,z𝜶)(w_{\boldsymbol{\alpha}},z_{\boldsymbol{\alpha}})-learner for 𝜶{𝒆1,,𝒆r}\boldsymbol{\alpha}\in\{\boldsymbol{e}_{1},\ldots,\boldsymbol{e}_{r}\} (the canonical basis of Rr)\mathbb{R}^{r}) where w𝜶=i=1rαiwiw_{\boldsymbol{\alpha}}=\sum_{i=1}^{r}\alpha_{i}w_{i} and z𝜶=𝜶,𝒛z_{\boldsymbol{\alpha}}=\langle\boldsymbol{\alpha},\boldsymbol{z}\rangle.

Theorem 6 shows the other direction holds, i.e., that the existence of a (w𝜶,z𝜶)(w_{\boldsymbol{\alpha}},z_{\boldsymbol{\alpha}})-learner for all possible 𝜶\boldsymbol{\alpha} implies the existence of a (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z})-learner.

Algorithm 2 Boosting (w𝜶,z𝜶)(w_{\boldsymbol{\alpha}},z_{\boldsymbol{\alpha}})-learners
0:  Sample S=(xj,yj)j=1mS=(x_{j},y_{j})_{j=1}^{m}; (w𝜶,z𝜶)(w_{\boldsymbol{\alpha}},z_{\boldsymbol{\alpha}})-learners 𝒜𝜶{\mathcal{A}}_{\boldsymbol{\alpha}} for every 𝜶Δr\boldsymbol{\alpha}\in\Delta_{r}; parameters T,η,m^T,\eta,\widehat{m}
1:  Initialize: a1(i)=1a_{1}(i)=1 for all i=1,,ri=1,...,r, set UU to be the uniform distribution over [m][m].
2:  Define M(h,i)=I{1mj=1mwi(h(xj),yj)>zi}M(h,i)=\mathbb{I}\!\left\{\frac{1}{m}\sum_{j=1}^{m}\ w_{i}(h(x_{j}),y_{j})>z_{i}\right\} for all i=1,,ri=1,...,r and h:𝒳𝒴h:{\mathcal{X}}\rightarrow{\mathcal{Y}}.
3:  for t=1,,Tt=1,\ldots,T do
4:     Compute the distribution 𝜶t𝐚tiat(i)\boldsymbol{\alpha}_{t}\triangleq\frac{\mathbf{a}_{t}}{\sum_{i}a_{t}(i)} over [r][r].
5:     Draw a set StS_{t} of m^\widehat{m} labeled examples i.i.d. from UU and obtain ht=𝒜𝜶t(St)h_{t}={\mathcal{A}}_{\boldsymbol{\alpha}_{t}}(S_{t}).
6:     For every i=1,,ri=1,\ldots,r let:
at+1(i)at(i)eηM(ht,i).a_{t+1}(i)\triangleq a_{t}(i)\cdot e^{\eta\cdot M(h_{t},i)}\;.
7:  end for
8:  return  The set of weak predictors h1,,hTh_{1},...,h_{T}.
Proof of Theorem 6.

It is easy to see that 121\implies 2 holds, by definitions. We prove 212\implies 1. For any ff\in\mathcal{F}, and distribution 𝒟\mathcal{D} over 𝒳{\mathcal{X}}, let SS be a set of mm examples sampled i.i.d. from 𝒟\mathcal{D} and labeled by ff, where mm to be determined later. Let ϵ,δ>0\epsilon,\delta>0. Then, applying Algorithm 2 over SS with access to 𝒜𝜶{\mathcal{A}}_{\boldsymbol{\alpha}} learners, and with T=100r2ln(r)T={100r^{2}\ln(r)}, η=2ln(r)T\eta=\sqrt{\frac{2\ln(r)}{T}}, m^=m0(110r,120rT)\widehat{m}=m_{0}(\frac{1}{10r},\frac{1}{20rT}) (where m0m_{0} is the sample complexity of 𝒜𝜶{\mathcal{A}}_{\boldsymbol{\alpha}}) yields the following. Fix any i[r]i\in[r]. Given that maxt,iM(ht,i)1\max_{t,i}M(h_{t},i)\leq 1 and that η1\eta\leq 1, the standard analysis of Hedge (Freund and Schapire, 1997) shows that:

t=1TM(ht,i)ln(r)η+η2T+t=1TEi𝜶t[M(ht,i)].\sum_{t=1}^{T}M(h_{t},i)\leq\frac{\ln(r)}{\eta}+\frac{\eta}{2}T+\sum_{t=1}^{T}{\mathbb E}_{i\sim\boldsymbol{\alpha}_{t}}\Bigl{[}M(h_{t},i)\Bigr{]}\;. (44)

By the guarantee of 𝒜𝜶{\mathcal{A}}_{\boldsymbol{\alpha}} and the choice of m^\widehat{m}, and by union bound over all t[T]t\in[T] we get that with probability at least 1120r1-\frac{1}{20r},

1Tt=1T1mj=1mw𝜶t(ht(xj),yj)1Tt=1Tz𝜶t+110r,\frac{1}{T}\sum_{t=1}^{T}\frac{1}{m}\sum_{j=1}^{m}\ w_{\boldsymbol{\alpha}_{t}}(h_{t}(x_{j}),y_{j})\leq\frac{1}{T}\sum_{t=1}^{T}z_{\boldsymbol{\alpha}_{t}}+\frac{1}{10r}\;, (45)

which implies,

1Tt=1TEi𝜶t[M(ht,i)]110r.\frac{1}{T}\sum_{t=1}^{T}{\mathbb E}_{i\sim\boldsymbol{\alpha}_{t}}\Bigl{[}M(h_{t},i)\Bigr{]}\leq\frac{1}{10r}. (46)

Thus, we get that with probability 1120r1-\frac{1}{20r},

1Tt=1TM(ht,i)ln(r)Tη+η2+110r15r.\frac{1}{T}\sum_{t=1}^{T}M(h_{t},i)\leq\frac{\ln(r)}{T\eta}+\frac{\eta}{2}+\frac{1}{10r}\leq\frac{1}{5r}\;. (47)

By then also taking a union bound over i=1,,ri=1,...,r, we have with probability at least 1920\frac{19}{20} for all i=1,,ri=1,...,r it holds that,

1Tt=1TM(ht,i)15r.\frac{1}{T}\sum_{t=1}^{T}M(h_{t},i)\leq\frac{1}{5r}. (48)

Recall that for each tt and ii we have that M(ht,i){0,1}M(h_{t},i)\in\{0,1\}, and so for all t[T]t\in[T] but T5r\frac{T}{5r} it holds that M(ht,i)=0M(h_{t},i)=0. We can now define the randomized hypothesis h~\tilde{h} for any x𝒳x\in{\mathcal{X}} as follows. To compute the value of h~(x)\tilde{h}(x), first we sample a hypothesis hth_{t} by sampling tU(T)t\sim U(T) uniformly at random, and then predict according to ht(x)h_{t}(x). We then have,

PtU(T)[i=1rM(ht,i)1]i=1rPtU(T)[M(ht,i)=1]15.\operatorname{\mathbb{P}}_{t\sim U(T)}\Big{[}\sum_{i=1}^{r}M(h_{t},i)\geq 1\Big{]}\leq\sum_{i=1}^{r}\operatorname{\mathbb{P}}_{t\sim U(T)}\Big{[}M(h_{t},i)=1\Big{]}\leq\frac{1}{5}. (49)

Then, by the definition of M(h,i)M(h,i), we have that with probability at least 1920\frac{19}{20},

Ph~[i=1,,r,1mj=1mwi(h~(xj),yj)zi]45.\operatorname{\mathbb{P}}_{\tilde{h}}\Big{[}\forall i=1,...,r,\ \ \frac{1}{m}\sum_{j=1}^{m}\ w_{i}(\tilde{h}(x_{j}),y_{j})\leq z_{i}\Big{]}\geq\frac{4}{5}. (50)

Finally, we apply standard compression-based generalization arguments (Shalev-Shwartz and Ben-David (2014), Theorem 30.2) where the sample size is m=O~(κϵ2)m=\tilde{O}(\frac{\kappa}{\epsilon^{2}}) and where the compression size is κ=m^T\kappa=\widehat{m}\cdot T, we get that with probability at least 1δ1-\delta over the sample of S𝒟mS\sim\mathcal{D}^{m} it holds that:

Ph~[i=1,,r,L𝒟wi(h~)zi+ϵ]>1/2.\operatorname{\mathbb{P}}_{\tilde{h}}\Big{[}\forall i=1,...,r,\ \ L_{\mathcal{D}}^{w_{i}}(\tilde{h})\leq z_{i}+\epsilon\Big{]}>1/2. (51)

Thus, the randomized hypothesis h~\tilde{h} satisfies the desired guarantees with a constant probability. Overall we get that the procedure described above is a (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z})-learner for \mathcal{F} in expectation.

Lastly, we remark that instead of a randomized learner whose guarantees hold in expectation, one can de-randomize choosing a single hth_{t} uniformly at random as above. Moreover, the confidence parameter of 1/21/2 can easily be made arbitrarily large using standard confidence-boosting techniques. That is, by first splitting the input sample into approximately q=O(log(r/δ))q=O(\log(r/\delta)) non-intersecting equal parts to learn roughly qq independent classifiers, each satisfying Equation (51). By the choice of qq, with 1δ/21-\delta/2 probability, at least one of these classifiers will the satisfy the 1/21/2-probability condition given in Equation (51). Then, by Chernoff’s inequality, for each of these classifiers the performance as tested over another independent test-set is close to that over the population distribution. Then, by choosing the one with best empirical errors over the test-set, we get the desired result. ∎

5 Multiclass Classification

In contrast to binary classification, boosting in the multiclass setting does not exhibit a clean dichotomy between “trivial” guarantees and “fully boostable” guarantees. A line of works (Mukherjee and Schapire, 2011; Brukhim et al., 2021, 2023b, 2023a) has studied multiclass boosting when ww is the standard 0-1 loss, and has shown a certain multichotomy determined by k1k-1 thresholds, 12,23,,k1k\frac{1}{2},\frac{2}{3},\dots,\frac{k-1}{k}, that replace the single boostability threshold 1/2\nicefrac{{1}}{{2}} of the binary case. For each such threshold, any learner achieving a loss below it can be “boosted” to a predictor that rules out some incorrect labels, albeit not all. Thus, for every example xx the predictor returns a subset of candidate labels, also called a list. A learner that beats a threshold 1\frac{\ell-1}{\ell} above in fact yields a list predictor with lists of size 1\ell-1, and it can be shown that the converse is also true. In this section we generalize these kind of results to the case of arbitrary costs functions, proving that these list boostability and list-versus-loss equivalences hold more broadly.

We start in Section 5.1 by describing an algorithm that turns a (w,z)(w,z)-learner into a list learner, proving the first part of C. Next, Section 5.3 proves some “list boostability” results that focus on the list length, based on the framework of List PAC learning (Brukhim et al., 2022; Charikar and Pabbaraju, 2023). We turn to multi-objective losses in Section 5.2, proving the first item of D. Finally, Section 5.4 concludes by giving the lower bounds of both C and D.

5.1 Cost-sensitive boosting

As said, in multiclass classification one cannot in general boost a learner into one that predicts the correct label. However, as prior work has shown, one can boost a learner into a list learner. This is a learner whose output hypothesis maps each x𝒳x\in{\mathcal{X}} to a set of labels, rather than to one single label. Formally, a list function is any map μ:𝒳2𝒴\mu:{\mathcal{X}}\to 2^{\mathcal{Y}}. A list learner is then a learner that outputs a list function μ\mu. The performance of such a list function μ\mu is typically measured in two ways. First, one wants μ(x)\mu(x) to contain the true label f(x)f(x) with good probability. Second, one wants μ(x)\mu(x) to somehow be “close” to f(x)f(x); for instance, in the case of standard 0-1 cost, this is measured by the length |μ(x)||\mu(x)| of μ(x)\mu(x). All these guarantees are meant, as usual, over the target distribution 𝒟\mathcal{D}. One can then ask the following question: given a (w,z)(w,z)-learner, how good a list function μ\mu can one construct? In order to answer this question in the cost-sensitive setting, we need to generalize list length to a more nuanced measure—one that is based once again on the values of games.

Definition 20 ((w,z)(w,z)-boundedness).

Let w:𝒴2[0,1]w:{\mathcal{Y}}^{2}\rightarrow[0,1] be a cost function and let z0z\geq 0. A list function μ:𝒳2𝒴\mu:{\mathcal{X}}\to 2^{\mathcal{Y}} is (w,z)(w,z)-bounded if, for every x𝒳x\in{\mathcal{X}}, the list μ(x)\mu(x) satisfies Vμ(x)(w)z\operatorname{V}_{\!\mu(x)}(w)\leq z.

Definition 21 ((w,z)(w,z)-list learner).

Let w:𝒴2[0,1]w:{\mathcal{Y}}^{2}\rightarrow[0,1] be a cost function and let z0z\geq 0. An algorithm \mathcal{L} is a (w,z)(w,z)-list learner for a class 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}} if there is a function m:(0,1)2Nm_{\mathcal{L}}:(0,1)^{2}\rightarrow\mathbb{N} such that for every ff\in\mathcal{F}, every distribution 𝒟\mathcal{D} over 𝒳{\mathcal{X}}, and every ϵ,δ(0,1)\epsilon,\delta\in(0,1) the following claim holds. If SS is a sample of m(ϵ,δ)m_{\mathcal{L}}(\epsilon,\delta) examples drawn i.i.d. from 𝒟\mathcal{D} and labeled by ff, then (S)\mathcal{L}(S) returns a (w,z)(w,z)-bounded list function μ\mu such that Px𝒟[f(x)μ(x)]ϵ\operatorname{\mathbb{P}}_{x\sim\mathcal{D}}\left[f(x)\notin\mu(x)\right]~{}\leq~{}\epsilon with probability at least 1δ1-\delta.

The definitions above mirror closely the standard ones used in list learning, where the quality of a list μ(x)\mu(x) is measured by its length |μ(x)||\mu(x)|. The crucial difference is that, now, we are instead using the value of the restricted game Vμ(x)(w)\operatorname{V}_{\!\mu(x)}(w). To get some intuition why this is the right choice, consider again the case where ww is the standard 0-1 cost. In that case it is well known that VJ(w)=11/|J|V_{J}(w)=1-\nicefrac{{1}}{{|J|}} for every (nonempty) J𝒴J\subseteq{\mathcal{Y}}. Then, by Definition 20 and Definition 21 a (w,z)(w,z)-list learner is one that outputs lists of length at most =min(k,11z)\ell=\min(k,\lfloor\frac{1}{1-z}\rfloor). That is, for the 0-1 cost Definition 21 yields the standard notion of \ell-list learner used in prior work on multiclass boosting. Now, for the special case of the 0-1 cost, Brukhim et al. (2023a) showed that any (w,z)(w,z)-learner is equivalent to an \ell-list learner, with \ell as defined above, in the sense that each one of them can be converted into the other. We show that this remains true for every cost function ww, if one replaces |μ(x)||\mu(x)| with Vμ(x)(w)\operatorname{V}_{\!\mu(x)}(w), and therefore the notion of \ell-list learner with the notion of (w,z)(w,z)-list learner. That is, we prove that (w,z)(w,z)-learners and (w,z)(w,z)-list learners are equivalent. Formally, we prove:

Theorem 22.

Let 𝒴=[k]{\mathcal{Y}}=[k], let w:𝒴2[0,1]w:{\mathcal{Y}}^{2}\rightarrow[0,1] be a cost function, let 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}}, and let z0z\geq 0. Then:

  • Every (w,z)(w,z)-learner for \mathcal{F} can be converted into a (w,z)(w,z)-list learner for \mathcal{F}.

  • Every (w,z)(w,z)-list learner for \mathcal{F} can be converted into a (w,z)(w,z)-learner for \mathcal{F}.

Theorem 22 is the main technical result of the present sub-section, and implies easily the first item of C. And again, the above-mentioned result of Brukhim et al. (2023a) is a special case of Theorem 22, obtained for ww being the standard 0-1 cost. Moreover, one can easily show that all these results do not hold if one uses |μ(x)||\mu(x)| in place of Vμ(x)(w)\operatorname{V}_{\!\mu(x)}(w). Our game-theoretical perspective therefore seems the right point of view for characterizing multiclass boostability. Let us see how the first item of C follows from Theorem 22. Recall the thresholds defined in Equation 6:

0=v1(w)<<vτ(w)=V(w),0=v_{1}(w)<\ldots<v_{\tau}(w)=V(w), (52)

where the first equalities hold because vJ(w)=0v_{J}(w)=0 for any singleton JJ and, by convention, for the empty set. Given z0z\geq 0, let n=n(z)n=n(z) be the largest integer such that vn(w)zv_{n}(w)\leq z. Now, suppose we have a (w,z)(w,z)-learner 𝒜{\mathcal{A}} for some \mathcal{F}. By the first item of Theorem 22, there exists a (w,z)(w,z)-list learner \mathcal{L} for \mathcal{F}, too. However, by definition of nn every J𝒴J\subseteq{\mathcal{Y}} with VJ(w)z\operatorname{V}_{J}(w)\leq z satisfies VJ(w)vn\operatorname{V}_{J}(w)\leq v_{n}. Therefore, \mathcal{L} actually returns list functions that are (w,vn+σ)(w,v_{n}+\sigma)-bounded, for σ>0\sigma>0 arbitrarily small. That is, \mathcal{L} is actually a (w,vn)(w,v_{n})-list learner for \mathcal{F}. By the second item of Theorem 22, then, there exists a (w,vn)(w,v_{n})-learner for \mathcal{F}.

The rest of the subsection is therefore devoted to prove Theorem 22: the first item in Section 5.1.1, and the second one in Section 5.1.2.

5.1.1 Turning a (w,z)(w,z)-learner into a (w,z)(w,z)-list learner

We prove the first item of Theorem 22 by giving an algorithm, Algorithm 3. The algorithm is very similar to the one for the binary case. And, like for that case, we define a certain margin that the learner has, in order to bound the number of “boosting” rounds of the algorithm. For the multiclass case, however, the definition is in general slightly different.

Definition 23 (Margin).

Let w:𝒴2[0,1]w:{\mathcal{Y}}^{2}\to[0,1] be a cost function and let z0z\geq 0. The margin of a (w,z)(w,z)-learner is γ=min{VJ(w)z:J𝒴,VJ(w)>z}\gamma=\min\{\operatorname{V}_{J}(w)-z:J\subseteq{\mathcal{Y}},\operatorname{V}_{J}(w)>z\}, or γ=0\gamma=0 whenever the set is empty.

Note that γ=0\gamma=0 only when zV(w)z\geq\operatorname{V}(w). In that case, the first item of Theorem 22 holds trivially by just considering the trivial list learner that outputs the constant list function μ(x)=𝒴\mu(x)={\mathcal{Y}}. Hence, in what follows we consider always the case γ>0\gamma>0. We shall prove:

Theorem 24 (Weak \Rightarrow List Learning).

Let 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}} and let w:𝒴2[0,1]w:{\mathcal{Y}}^{2}\to[0,1] be a cost function. If 𝒜{\mathcal{A}} is a (w,z)(w,z)-learner for \mathcal{F} with margin γ>0\gamma>0, then Algorithm 3, with the choice of parameters of Lemma 25 and σ=23γ\sigma=\frac{2}{3}\gamma, is a (w,z)(w,z)-list learner for \mathcal{F}. Under the same assumptions, Algorithm 3 makes T=O(ln(m)γ2)T=O\bigl{(}\frac{\ln(m)}{\gamma^{2}}\bigr{)} oracle calls to 𝒜{\mathcal{A}}, where mm is the size of the input sample, and has sample complexity m(ϵ,δ)=O~(m0(γ3,δ2T)ln(1/δ)ϵγ2)m(\epsilon,\delta)=\widetilde{O}\Bigl{(}m_{0}\bigl{(}\frac{\gamma}{3},\frac{\delta}{2T}\bigr{)}\cdot\frac{\ln(1/\delta)}{\epsilon\cdot\gamma^{2}}\Bigr{)}.

Theorem 24 follows immediately from two technical results stated below. The first result, Lemma 25, proves that Algorithm 3 can turn a (w,z)(w,z)-learner 𝒜{\mathcal{A}} into a list function μS\mu_{S} that is (w,z+σ)(w,z+\sigma)-bounded and that with probability 1δ1-\delta is consistent with SS (i.e., yiμ(xi)y_{i}\in\mu(x_{i}) for every (xi,yi)S(x_{i},y_{i})\in S), for σ,δ>0\sigma,\delta>0 as small as desired. The second result, Lemma 26, shows that the list function μ\mu has small generalization error: if the input sample SS of Algorithm 3 is sufficiently large, then Px𝒟[f(x)μ(x)]1ϵ\operatorname{\mathbb{P}}_{x\sim\mathcal{D}}[f(x)\in\mu(x)]\geq 1-\epsilon, where ϵ>0\epsilon>0 can be chosen as small as desired.

Lemma 25.

Let 𝒴=[k]{\mathcal{Y}}=[k], let 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}}, and let 𝒜{\mathcal{A}} be a (w,z)(w,z)-learner for \mathcal{F} with sample complexity m0m_{0}. Fix any ff\in\mathcal{F}, any distribution 𝒟\mathcal{D} over 𝒳{\mathcal{X}}, and any σ,δ(0,1)\sigma,\delta\in(0,1). Then, given a multiset S={(x1,y1),,(xm,ym)}S=\{(x_{1},y_{1}),\ldots,(x_{m},y_{m})\} of mm examples formed by i.i.d. points x1,,xm𝒟x_{1},\dots,x_{m}\sim\mathcal{D} labeled by ff, oracle access to 𝒜{\mathcal{A}}, and parameters T=8ln(m)σ2T=\left\lceil\frac{8\ln(m)}{\sigma^{2}}\right\rceil, η=2ln(m)T\eta=\sqrt{\frac{2\ln(m)}{T}}, and m^=m0(σ2,δT)\widehat{m}=m_{0}\bigl{(}\frac{\sigma}{2},\frac{\delta}{T}\bigr{)}, Algorithm 3 returns a list function μS\mu_{S} that is (w,z+σ)(w,z+\sigma)-bounded and that, with probability at least 1δ1-\delta, satisfies:

yiμS(xi)i=1,,m.y_{i}\in\mu_{S}(x_{i})\qquad\forall i=1,\dots,m\;.
Proof.

First, we prove that μS\mu_{S} is (w,z+σ)(w,z+\sigma)-bounded. Fix any x𝒳x\in{\mathcal{X}}. Observe that F(x,)F(x,\cdot) is a distribution over 𝒴{\mathcal{Y}}; denote it by 𝐩xΔ𝒴\mathbf{p}^{x}\in\Delta_{\mathcal{Y}}, so that px=F(x,)p^{x}_{\ell}=F(x,\ell) for every label 𝒴\ell\in{\mathcal{Y}}. Thus, by the definition of μS\mu_{S}, the sum involved in the condition for including a certain label y𝒴y\in{\mathcal{Y}} in the list μS(x)\mu_{S}(x) is

𝒴F(x,)w(,y)=w(𝐩x,y).\sum_{\ell\in{\mathcal{Y}}}F(x,\ell)\cdot w(\ell,y)=w(\mathbf{p}^{x},y)\;. (53)

Then, observe that

VμS(x)(w)=min𝐩Δ𝒴max𝐪ΔμS(x)w(𝐩,𝐪)max𝐪ΔμS(x)w(𝐩x,𝐪)=maxyμS(x)w(𝐩x,y)z+σ,\displaystyle\operatorname{V}_{\!\mu_{S}(x)}(w)=\min_{\mathbf{p}\in\Delta_{\mathcal{Y}}}\max_{\mathbf{q}\in\Delta_{\mu_{S}(x)}}w(\mathbf{p},\mathbf{q})\leq\max_{\mathbf{q}\in\Delta_{\mu_{S}(x)}}w(\mathbf{p}^{x},\mathbf{q})=\max_{y\in\mu_{S}(x)}w(\mathbf{p}^{x},y)\leq z+\sigma\;, (54)

where the last inequality holds by definition of μS(x)\mu_{S}(x).

We now turn to proving that with probability at least 1δ1-\delta we have yiμS(xi)y_{i}\in\mu_{S}(x_{i}) for all i[m]i\in[m]. The proof is similar to that of Lemma 12. For any given i[m]i\in[m], the analysis of Hedge and the definitions of η\eta and TT yield:

1Tt=1Tw(ht(xi),yi)σ2+1Tt=1TEj𝒟t[w(ht(xj),yj)].\frac{1}{T}\sum_{t=1}^{T}w(h_{t}(x_{i}),y_{i})\leq\frac{\sigma}{2}+\frac{1}{T}\sum_{t=1}^{T}{\mathbb E}_{j\sim\mathcal{D}_{t}}\Bigl{[}w(h_{t}(x_{j}),y_{j})\Bigr{]}\;. (55)

Furthermore, since 𝒜{\mathcal{A}} is a (w,z)(w,z)-learner for \mathcal{F}, and given the choices of m^\widehat{m} and ht=𝒜(St)h_{t}={\mathcal{A}}(S_{t}), by a union bound over t[T]t\in[T] we have that with probability at least 1δ1-\delta:

1Tt=1TEj𝒟t[w(ht(xj),yj)]z+σ2.\frac{1}{T}\sum_{t=1}^{T}{\mathbb E}_{j\sim\mathcal{D}_{t}}\Bigl{[}w(h_{t}(x_{j}),y_{j})\Bigr{]}\leq z+\frac{\sigma}{2}\;. (56)

Conditioning on Equation 56, we prove that yiμS(xi)y_{i}\in\mu_{S}(x_{i}) for every i[m]i\in[m]. Consider the function F:𝒳×𝒴[0,1]F:{\mathcal{X}}\times{\mathcal{Y}}\to[0,1] defined in Algorithm 3. By Equations 55 and 56, we obtain for every i[m]i\in[m] that

𝒴F(xi,)w(,yi)=1Tt=1Tw(ht(xi),yi)z+σ.\sum_{\ell\in{\mathcal{Y}}}F(x_{i},\ell)\cdot w(\ell,y_{i})=\frac{1}{T}\sum_{t=1}^{T}w(h_{t}(x_{i}),y_{i})\leq z+\sigma\;. (57)

which in turn implies that yiμS(xi)y_{i}\in\mu_{S}(x_{i}) by construction of μS\mu_{S}. This concludes the proof. ∎

Algorithm 3 Boosting a (w,z)(w,z)-learner to an (w,z+σ)(w,z+\sigma)-list learner
0:  Sample S=(xi,yi)i=1mS=(x_{i},y_{i})_{i=1}^{m}; (w,z)(w,z)-learner 𝒜{\mathcal{A}}; parameters T,η,m^,σT,\eta,\widehat{m},\sigma
1:  Initialize: D1(i)=1D_{1}(i)=1 for all i=1,,mi=1,...,m.
2:  for t=1,,Tt=1,\ldots,T do
3:     Compute the distribution 𝒟tDtiDt(i)\mathcal{D}_{t}\triangleq\frac{D_{t}}{\sum_{i}D_{t}(i)} over SS
4:     Draw a set StS_{t} of m^\widehat{m} labeled examples i.i.d. from 𝒟t\mathcal{D}_{t} and obtain ht=𝒜(St)h_{t}={\mathcal{A}}(S_{t}).
5:     For every i=1,,mi=1,\ldots,m let:
Dt+1(i)Dt(i)eηw(ht(xi),yi).D_{t+1}(i)\triangleq D_{t}(i)\cdot e^{\eta\cdot w(h_{t}(x_{i}),y_{i})}\;.
6:  end for
7:  Let F:𝒳×𝒴[0,1]F:{\mathcal{X}}\times{\mathcal{Y}}\to[0,1] be such that F(x,y)1Tt=1TI{ht(x)=y}F(x,y)\triangleq\frac{1}{T}\sum_{t=1}^{T}\mathbb{I}\!\left\{h_{t}(x)=y\right\} for all x𝒳x\in{\mathcal{X}}, y𝒴y\in{\mathcal{Y}}.
8:  return  μS:𝒳2𝒴\mu_{S}:{\mathcal{X}}\to 2^{\mathcal{Y}} such that, for all x𝒳x\in{\mathcal{X}},
μS(x){y𝒴:𝒴F(x,)w(,y)z+σ}.\mu_{S}(x)\triangleq\Biggl{\{}y\in{\mathcal{Y}}:\sum_{\ell\in{\mathcal{Y}}}F(x,\ell)\cdot w(\ell,y)\leq z+\sigma\Biggr{\}}\;.

In a similar way as Lemma 13 does for the binary case, we show that the list function μS\mu_{S} returned by Algorithm 3 has small generalization error provided that the sample SS is sufficiently large. The main idea to prove generalization is again via a sample compression scheme, but relying instead on a novel result by Brukhim et al. (2023a) for multiclass classification; note that, while their result from Brukhim et al. (2023a) is stated in terms of list functions whose outputs are ss-uples in 𝒴s{\mathcal{Y}}^{s} for some s<ks<k, their same proof applies to the (w,z)(w,z)-bounded list functions output by our Algorithm 3.

Lemma 26.

Assume the setting of Lemma 25. For any ϵ,δ(0,1)\epsilon,\delta\in(0,1), if the size mm of the sample given to Algorithm 3 satisfies

mln(2/δ)ϵ+m0(σ2,δ2T)T(1+ln(m)ϵ),m\geq\frac{\ln(2/\delta)}{\epsilon}+m_{0}\left(\frac{\sigma}{2},\frac{\delta}{2T}\right)\cdot T\cdot\left(1+\frac{\ln(m)}{\epsilon}\right)\;,

then the output μS\mu_{S} of Algorithm 3 satisfies Px𝒟[f(x)μS(x)]ϵ\operatorname{\mathbb{P}}_{x\sim\mathcal{D}}[f(x)\notin\mu_{S}(x)]\leq\epsilon with probability at least 1δ1-\delta. Therefore, Algorithm 3 is a (w,z+σ)(w,z+\sigma)-list learner for \mathcal{F} with sample complexity m(ϵ,δ)=O~(m0(σ2,δ2T)ln(1/δ)ϵσ2)m(\epsilon,\delta)=\widetilde{O}\Bigl{(}m_{0}\bigl{(}\frac{\sigma}{2},\frac{\delta}{2T}\bigr{)}\cdot\frac{\ln(1/\delta)}{\epsilon\cdot\sigma^{2}}\Bigr{)}.

Proof.

By analogy with the proof of Lemma 13, we first apply Lemma 25 with δ/2\delta/2 in place of δ\delta in order to obtain an (w,z+σ)(w,z+\sigma)-bounded list function μS\mu_{S} consistent with SS with probability at least 1δ/21-\delta/2. We then apply a compression-based generalization argument. To do so, we remark once again that one can construct a compression scheme for μS\mu_{S} of size equal to the total size of the samples on which 𝒜{\mathcal{A}} operates, which is equal to κ=m^T\kappa=\widehat{m}\cdot T. The main difference with the binary case is that we rely on the generalization bound for a sample compression scheme for lists as per Theorem 6 of Brukhim et al. (2023a), with δ/2\delta/2 in place of δ\delta; we can employ this generalization bound thanks to the consistency of μS\mu_{S} with SS (with sufficiently large probability). Then, this implies that

Px𝒟[f(x)μS(x)]κln(m)+ln(2/δ)mκ\operatorname{\mathbb{P}}_{x\sim\mathcal{D}}\bigl{[}f(x)\notin\mu_{S}(x)\bigr{]}\leq\frac{\kappa\ln(m)+\ln(2/\delta)}{m-\kappa} (58)

holds with probability at least 1δ/21-\delta/2. By similar calculations as in Equations 18 and 19, and replacing the values of κ\kappa and m^\widehat{m}, the right-hand side of the above inequality can be easily show to be at most ϵ\epsilon whenever

mln(2/δ)ϵ+m0(σ2,δ2T)T(1+ln(m)ϵ).m\geq\frac{\ln(2/\delta)}{\epsilon}+m_{0}\left(\frac{\sigma}{2},\frac{\delta}{2T}\right)\cdot T\cdot\left(1+\frac{\ln(m)}{\epsilon}\right)\;. (59)

A union bound concludes the proof for the first part of the claim, since it shows that Algorithm 3 is an (w,z+σ)(w,z+\sigma)-list learner for \mathcal{F}. The claim on the sample complexity of Algorithm 3 then follows by straightforward calculations and the definition of T=O(ln(m)σ2)T=O\bigl{(}\frac{\ln(m)}{\sigma^{2}}\bigr{)}. ∎

At this point, we have all the ingredients and tools to prove that we can construct a (w,z)(w,z)-list learner from a (w,z)(w,z)-learner.

Proof of Theorem 24.

Consider Algorithm 3 under the same assumptions of Lemma 25, and set σ=23γ\sigma=\frac{2}{3}\gamma as per assumption. By Lemma 25 and Lemma 26, we know that Algorithm 3 is a (w,z+23γ)(w,z+\frac{2}{3}\gamma)-list learner with sample complexity m(ϵ,δ)=O~(m0(γ3,δ2T)ln(1/δ)ϵγ2)m(\epsilon,\delta)=\widetilde{O}\bigl{(}m_{0}\bigl{(}\frac{\gamma}{3},\frac{\delta}{2T}\bigr{)}\cdot\frac{\ln(1/\delta)}{\epsilon\cdot\gamma^{2}}\bigr{)} that performs O(ln(m)γ2)O\bigl{(}\frac{\ln(m)}{\gamma^{2}}\bigr{)} oracle calls to the (w,z)(w,z)-learner 𝒜{\mathcal{A}}. Moreover, we can immediately conclude that Algorithm 3 is also a (w,z)(w,z)-list learner because zvnz\geq v_{n} and z+23γ<z+γ=vn+1z+\frac{2}{3}\gamma<z+\gamma=v_{n+1}, by definitions of n=n(z)n=n(z) and γ\gamma. ∎

5.1.2 Turning a (w,z)(w,z)-list learner into a (w,z)(w,z)-learner

We prove that every (w,z)(w,z)-list learner can be converted into a (w,z)(w,z)-learner.

Lemma 27 (List \Rightarrow Weak Learning).

There exists an algorithm {\mathcal{B}} that for every 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}} satisfies what follows. Let w:𝒴2[0,1]w:{\mathcal{Y}}^{2}\rightarrow[0,1] be a cost function and let z0z\geq 0. Given oracle access to a (w,z)(w,z)-list learner \mathcal{L} for \mathcal{F} with sample complexity mm_{\mathcal{L}}, algorithm {\mathcal{B}} is a (w,z)(w,z)-learner for \mathcal{F} with sample complexity mm_{\mathcal{L}} that makes one oracle call to \mathcal{L}.

Proof.

Fix any ff\in\mathcal{F} and any distribution 𝒟\mathcal{D} over 𝒳{\mathcal{X}}, and suppose {\mathcal{B}} is given a sample SS of size mm(ϵ,δ)m\geq m_{\mathcal{L}}(\epsilon,\delta) consisting of examples drawn i.i.d. from 𝒟\mathcal{D} and labeled by ff. First, {\mathcal{B}} calls \mathcal{L} on SS. By hypothesis, then, \mathcal{L} returns a (w,z)(w,z)-bounded μS:𝒳𝒴𝒳\mu_{S}:{\mathcal{X}}\to{\mathcal{Y}}^{\mathcal{X}} that with probability at least 1δ1-\delta satisfies:

Px𝒟[f(x)μS(x)]ϵ.\operatorname{\mathbb{P}}_{x\sim\mathcal{D}}\bigl{[}f(x)\notin\mu_{S}(x)\bigr{]}\leq\epsilon\;. (60)

Conditioning on the event above from this point onwards, we give a randomized predictor hSh_{S} such that L𝒟w(hS)z+ϵL_{\mathcal{D}}^{w}(h_{S})\leq z+\epsilon. First, for any nonempty J𝒴J\subseteq{\mathcal{Y}}, let 𝐩JΔ𝒴\mathbf{p}_{J}\in\Delta_{\mathcal{Y}} be the minimax distribution achieving the value of the game restricted to JJ, i.e.,

𝐩J=argmin𝐩Δ𝒴(max𝐪ΔJw(𝐩,𝐪)).\displaystyle\mathbf{p}_{J}=\arg\min_{\mathbf{p}\in\Delta_{\mathcal{Y}}}\left(\max_{\mathbf{q}\in\Delta_{J}}w(\mathbf{p},\mathbf{q})\right)\;. (61)

Note that 𝐩J\mathbf{p}_{J} can be efficiently computed via a linear program. Then, simply define hS(x)𝐩μS(x)h_{S}(x)\sim\mathbf{p}_{\mu_{S}(x)}.

Let us analyse the loss of hSh_{S} under 𝒟\mathcal{D}. First, by the law of total expectation, and since w1\|w\|_{\infty}\leq 1,

L𝒟w(hS)\displaystyle L_{\mathcal{D}}^{w}(h_{S}) PxD[f(x)μS(x)]+JPx𝒟[μS(x)=Jf(x)J]L𝒟Jw(hS),\displaystyle\leq\operatorname{\mathbb{P}}_{x\sim D}[f(x)\notin\mu_{S}(x)]+\sum_{J}\operatorname{\mathbb{P}}_{x\sim\mathcal{D}}[\mu_{S}(x)=J\wedge f(x)\in J]\cdot L_{\mathcal{D}_{J}}^{w}(h_{S})\;, (62)

where the summation is over all J𝒴J\subseteq{\mathcal{Y}} with Px𝒟[μS(x)=J]>0\operatorname{\mathbb{P}}_{x\sim\mathcal{D}}[\mu_{S}(x)=J]>0, and 𝒟J\mathcal{D}_{J} is the distribution obtained from 𝒟\mathcal{D} by conditioning on the event μS(x)=Jf(x)J\mu_{S}(x)=J\wedge f(x)\in J. Consider the right-hand side of Equation 62. By Equation 60, the first term is at most ϵ\epsilon. For the second term, denote by 𝐪J\mathbf{q}_{J} the marginal of 𝒟J\mathcal{D}_{J} over 𝒴{\mathcal{Y}}; note that, crucially, 𝐪JΔJ\mathbf{q}_{J}\in\Delta_{J}. Therefore, by definition of hSh_{S}:

L𝒟Jw(hS)=w(𝐩J,𝐪J)max𝐪ΔJw(𝐩J,𝐪)=VJ(w)z,\displaystyle L_{\mathcal{D}_{J}}^{w}(h_{S})=w(\mathbf{p}_{J},\mathbf{q}_{J})\leq\max_{\mathbf{q}\in\Delta_{J}}w(\mathbf{p}_{J},\mathbf{q})=\operatorname{V}_{J}(w)\leq z\;, (63)

where the last inequality holds as J=μ(x)J=\mu(x) and μ\mu is (w,z)(w,z)-bounded. Using Equations 60 and 63 in Equation 62 shows that L𝒟w(hS)z+ϵL_{\mathcal{D}}^{w}(h_{S})\leq z+\epsilon. ∎

5.2 Multi-objective losses

In this sub-section we prove the first item of D. Coupled with Lemma 39 below, this proves D. Both results require the following definition. Let av(𝒘,𝐳)\operatorname{av}(\boldsymbol{w},\mathbf{z}) denote the family of all subsets of 𝒴{\mathcal{Y}} which are avoided by (𝒘,𝐳)(\boldsymbol{w},\mathbf{z}), i.e.,

av(𝒘,𝐳){J𝒴:𝐳DJ(𝒘)}.\operatorname{av}(\boldsymbol{w},\mathbf{z})\triangleq\{J\subseteq{\mathcal{Y}}:\mathbf{z}\notin D_{J}(\boldsymbol{w})\}.

The algorithm proposed below then scales with N=|av(𝒘,𝐳)|N=|\operatorname{av}(\boldsymbol{w},\mathbf{z})|, which may be exponential in |𝒴||{\mathcal{Y}}|. We remark that for our purposes, it also suffices to consider only the minimally-sized avoided lists in av(𝒘,𝐳)\operatorname{av}(\boldsymbol{w},\mathbf{z}), i.e., if JJJ\subseteq J^{\prime} and 𝐳DJ(𝒘)\mathbf{z}\notin D_{J}(\boldsymbol{w}) then Jav(𝒘,𝐳)J^{\prime}\notin\operatorname{av}(\boldsymbol{w},\mathbf{z}). However, in the worst case NN remains of the same order and so we keep the definition as given above for simplicity. Lemma 39 will then give a lower bound showing that this condition is, in some sense, necessary.

Theorem 28.

Let 𝒴=[k]{\mathcal{Y}}=[k], let 𝐰=(w1,,wr)\boldsymbol{w}=(w_{1},\ldots,w_{r}) where each wi:𝒴2[0,1]w_{i}:{\mathcal{Y}}^{2}\rightarrow[0,1] is a cost function, and let 𝐳[0,1]r\boldsymbol{z}\in[0,1]^{r}. Let 𝐳[0,1]r\boldsymbol{z}^{\prime}\in[0,1]^{r} such that 𝐳𝐰𝐳\boldsymbol{z}\preceq_{\boldsymbol{w}}\boldsymbol{z}^{\prime}. Assume there exists a (𝐰,𝐳)(\boldsymbol{w},\boldsymbol{z})-learner for a class 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}}. Then, there exists a (𝐰,𝐳)(\boldsymbol{w},\boldsymbol{z}^{\prime})-learner for \mathcal{F}.

Proof.

By assumption, for every J𝒴J\subseteq{\mathcal{Y}} such that Jav(𝒘,𝐳)J\in\operatorname{av}(\boldsymbol{w},\mathbf{z}^{\prime}) it holds that Jav(𝒘,𝐳)J\in\operatorname{av}(\boldsymbol{w},\mathbf{z}). Let 𝜶Δr\boldsymbol{\alpha}\in\Delta_{r}, and denote w𝜶=𝜶𝒘w_{\boldsymbol{\alpha}}=\boldsymbol{\alpha}\cdot\boldsymbol{w} and z𝜶=𝜶𝐳z^{\prime}_{\boldsymbol{\alpha}}=\boldsymbol{\alpha}\cdot\mathbf{z}^{\prime}. Then, by Lemma 29 we get that the (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z})-learner can be used to obtain a (w𝜶,z𝜶)(w_{\boldsymbol{\alpha}},z^{\prime}_{\boldsymbol{\alpha}})-learner for \mathcal{F}. Since this holds for every 𝜶Δr\boldsymbol{\alpha}\in\Delta_{r}, then by Theorem 6 this implies the existence of a (𝒘,𝐳)(\boldsymbol{w},\mathbf{z}^{\prime})-learner as well, which completes the proof. ∎

Lemma 29.

Let 𝐰=(w1,,wr)\boldsymbol{w}=(w_{1},\ldots,w_{r}) where each wi:𝒴2[0,1]w_{i}:{\mathcal{Y}}^{2}\rightarrow[0,1] is a cost function, and let 𝐳,𝐳[0,1]r\boldsymbol{z},\boldsymbol{z}^{\prime}\in[0,1]^{r} such that 𝐳𝐰𝐳\boldsymbol{z}\preceq_{\boldsymbol{w}}\boldsymbol{z}^{\prime}. Assume there exists a (𝐰,𝐳)(\boldsymbol{w},\boldsymbol{z})-learner 𝒜{\mathcal{A}} for a class 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}}. Then, there exists a (w𝛂,z𝛂)(w_{\boldsymbol{\alpha}},z^{\prime}_{\boldsymbol{\alpha}})-learner for \mathcal{F}, for every 𝛂Δr\boldsymbol{\alpha}\in\Delta_{r}.

Proof.

First, by Lemma 31 we get that 𝒜{\mathcal{A}} can be used to obtain an algorithm \mathcal{L} that is a av(𝒘,𝒛)\operatorname{av}(\boldsymbol{w},\boldsymbol{z}^{\prime})-list learner for \mathcal{F}. Next, we show that \mathcal{L} can be boosted to a (w𝜶,z𝜶)(w_{\boldsymbol{\alpha}},z^{\prime}_{\boldsymbol{\alpha}})-learner. Fix any 𝜶Δr\boldsymbol{\alpha}\in\Delta_{r}. First, we argue that \mathcal{L} is in fact a (w𝜶,z𝜶)(w_{\boldsymbol{\alpha}},z^{\prime}_{\boldsymbol{\alpha}})-list learner (see Definition 21). This holds since for every J𝒴J\subseteq{\mathcal{Y}} such that z𝜶<VJ(w𝜶)z^{\prime}_{\boldsymbol{\alpha}}<V_{J}(w_{\boldsymbol{\alpha}}), it must be that Jav(𝒘,𝒛)J\in\operatorname{av}(\boldsymbol{w},\boldsymbol{z}^{\prime}), by definition and by Proposition 19. In particular, for any μ\mu list function outputted by \mathcal{L}, and for every x𝒳x\in{\mathcal{X}} it holds that Jμ(x)J\not\subseteq\mu(x) and so Vμ(x)(w𝜶)z𝜶V_{\mu(x)}(w_{\boldsymbol{\alpha}})\leq z^{\prime}_{\boldsymbol{\alpha}}.

Thus, we have established that \mathcal{L} is in fact a (w𝜶,z𝜶)(w_{\boldsymbol{\alpha}},z^{\prime}_{\boldsymbol{\alpha}})-list learner. Lastly, by Lemma 27 we get that it can also be converted into a (w𝜶,z𝜶)(w_{\boldsymbol{\alpha}},z^{\prime}_{\boldsymbol{\alpha}})-learner, as needed. ∎

Definition 30 (av(𝒘,𝒛)\operatorname{av}(\boldsymbol{w},\boldsymbol{z})-list learner).

An algorithm \mathcal{L} is a av(𝐰,𝐳)\operatorname{av}(\boldsymbol{w},\boldsymbol{z})-list learner for a class 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}} if there is a function m:(0,1)2Nm_{\mathcal{L}}:(0,1)^{2}\rightarrow\mathbb{N} such that for every ff\in\mathcal{F}, every distribution 𝒟\mathcal{D} over 𝒳{\mathcal{X}}, and every ϵ,δ(0,1)\epsilon,\delta\in(0,1) the following claim holds. If SS is a sample of m(ϵ,δ)m_{\mathcal{L}}(\epsilon,\delta) examples drawn i.i.d. from 𝒟\mathcal{D} and labeled by ff, then (S)\mathcal{L}(S) returns a list function μ:𝒳2𝒴\mu:{\mathcal{X}}\rightarrow 2^{\mathcal{Y}} such that Px𝒟[f(x)μ(x)]ϵ\operatorname{\mathbb{P}}_{x\sim\mathcal{D}}\left[f(x)\notin\mu(x)\right]~{}\leq~{}\epsilon with probability 1δ1-\delta, and such that for every x𝒳x\in{\mathcal{X}} and every Jav(𝒘,𝒛)J\in\operatorname{av}(\boldsymbol{w},\boldsymbol{z}) it holds that Jμ(x)J\not\subseteq\mu(x).

Lemma 31.

Let 𝒴=[k]{\mathcal{Y}}=[k], let 𝐰=(w1,,wr)\boldsymbol{w}=(w_{1},\ldots,w_{r}) where each wi:𝒴2[0,1]w_{i}:{\mathcal{Y}}^{2}\rightarrow[0,1] is a cost function, and let 𝐳,𝐳[0,1]r\boldsymbol{z},\boldsymbol{z}^{\prime}\in[0,1]^{r} such that 𝐳𝐰𝐳\boldsymbol{z}\preceq_{\boldsymbol{w}}\boldsymbol{z}^{\prime}. Assume there exists a (𝐰,𝐳)(\boldsymbol{w},\boldsymbol{z})-learner 𝒜{\mathcal{A}} for a class 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}}. Then, there exists a av(𝐰,𝐳)\operatorname{av}(\boldsymbol{w},\boldsymbol{z}^{\prime})-list learner for \mathcal{F}.

Proof.

First, fix any Jav(𝒘,𝒛)J\in\operatorname{av}(\boldsymbol{w},\boldsymbol{z}^{\prime}). By assumption that 𝒛𝒘𝒛\boldsymbol{z}\preceq_{\boldsymbol{w}}\boldsymbol{z}^{\prime}, we also have that Jav(𝒘,𝒛)J\in\operatorname{av}(\boldsymbol{w},\boldsymbol{z}). In particular, we have that (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z}) is not JJ-dice-attainable. Then, by Proposition 19 we have that there exist 𝜶Δr\boldsymbol{\alpha}\in\Delta_{r} such that,

z𝜶=𝜶,𝒛<VJ(w𝜶),z_{\boldsymbol{\alpha}}=\langle\boldsymbol{\alpha},\boldsymbol{z}\rangle<\operatorname{V}_{J}(w_{\boldsymbol{\alpha}}), (64)

where w𝜶=i=1rαiwiw_{\boldsymbol{\alpha}}=\sum_{i=1}^{r}\alpha_{i}w_{i}. Thus, we have that 𝒜{\mathcal{A}} is also a (w𝜶,z𝜶)(w_{\boldsymbol{\alpha}},z_{\boldsymbol{\alpha}})-learner for \mathcal{F}, and that z𝜶<VJ(w𝜶)z_{\boldsymbol{\alpha}}<\operatorname{V}_{J}(w_{\boldsymbol{\alpha}}). We denote this learning algorithm by 𝒜J{\mathcal{A}}_{J}. Notice that we can repeat the above process for any such JJ. Thus, we obtain different learning algorithms 𝒜J{\mathcal{A}}_{J} for each Jav(𝒘,𝒛)J\in\operatorname{av}(\boldsymbol{w},\boldsymbol{z}^{\prime}).

Next, we will describe the construction of the av(𝒘,𝒛)\operatorname{av}(\boldsymbol{w},\boldsymbol{z}^{\prime})-list learning algorithm. Fix any δ,ϵ>0\delta^{\prime},\epsilon^{\prime}>0. For every learner 𝒜J{\mathcal{A}}_{J}, apply Algorithm 3 with parameters σJ=23γJ\sigma_{J}=\frac{2}{3}\gamma_{J}, where γJ\gamma_{J} is the margin of 𝒜J{\mathcal{A}}_{J}, our (w𝜶J,z𝜶J)(w_{\boldsymbol{\alpha}_{J}},z_{\boldsymbol{\alpha}_{J}})-learner (see Definition 23). We also set the parameters of 𝒜J{\mathcal{A}}_{J} to be ϵ=ϵ\epsilon=\epsilon^{\prime}, and δ=δ/2N\delta=\delta^{\prime}/2N, for N=|av(𝒘,𝒛)|N=|\operatorname{av}(\boldsymbol{w},\boldsymbol{z}^{\prime})|. The remaining parameters for Algorithm 3 are set as in Lemma 25.

For each such run of the algorithm, we obtain a list function. Let μJ1,,μJN\mu_{J_{1}},...,\mu_{J_{N}} denote all list functions obtained by this procedure. Finally, return the list function μ\mu defined by

μ(x)=n=1NμJn(x),\mu(x)=\bigcap_{n=1}^{N}\mu_{J_{n}}(x),

for every x𝒳x\in{\mathcal{X}}. We will now show that this algorithm satisfies the desired guarantees, and is indeed a av(𝒘,𝒛)\operatorname{av}(\boldsymbol{w},\boldsymbol{z}^{\prime})-list learner.

First, by Lemma 25 and union bound we get that with probability at least 1δ/21-\delta/2, for each nNn\leq N it holds that:

yiμJn(xi)i=1,,m.y_{i}\in\mu_{J_{n}}(x_{i})\qquad\forall i=1,\dots,m\;.

Thus, in particular, with probability at least 1δ/21-\delta/2 we also have that yiμ(xi)y_{i}\in\mu(x_{i}) for all i[m]i\in[m]. Moreover, by Lemma 25 we have that all μJn\mu_{J_{n}} are (w𝜶,z𝜶+σJn)(w_{\boldsymbol{\alpha}},z_{\boldsymbol{\alpha}}+\sigma_{J_{n}})-bounded. That is, for each nNn\leq N and every x𝒳x\in{\mathcal{X}} it holds that VμJn(x)(w𝜶)z𝜶+σJn=z𝜶+23γJn\operatorname{V}_{\!\mu_{J_{n}}(x)}(w_{\boldsymbol{\alpha}})\leq z_{\boldsymbol{\alpha}}+\sigma_{J_{n}}=z_{\boldsymbol{\alpha}}+\frac{2}{3}\gamma_{J_{n}}. Now recall that by the definition of the margin, and by Equation 64 we in fact have that:

VμJn(x)(w𝜶)<VJn(w𝜶).\operatorname{V}_{\!\mu_{J_{n}}(x)}(w_{\boldsymbol{\alpha}})<\operatorname{V}_{\!J_{n}}(w_{\boldsymbol{\alpha}}).

This then implies that for every x𝒳x\in{\mathcal{X}} it holds that JnμJn(x)J_{n}\not\subseteq\mu_{J_{n}}(x). Thus, in particular, we also get that the final list function μ\mu satisfies that for every x𝒳x\in{\mathcal{X}} it holds that Jnμ(x)J_{n}\not\subseteq\mu(x). Lastly, by following the same compression-based generalization analysis as in Lemma 26, we obtain that the above procedure is in fact a av(𝒘,𝒛)\operatorname{av}(\boldsymbol{w},\boldsymbol{z}^{\prime})-list learning algorithm, with sample complexity m(ϵ,δ)=O~(m0(γ,δNT)ln(N/δ)ϵγ2)m(\epsilon,\delta)=\widetilde{O}\Bigl{(}m_{0}\bigl{(}{\gamma^{*}},\frac{\delta}{NT}\bigr{)}\cdot\frac{\ln(N/\delta)}{\epsilon\cdot{\gamma^{*}}^{2}}\Bigr{)}, where γ=minn[N]γJn\gamma^{*}=\min_{n\in[N]}\gamma_{J_{n}}.

5.3 Multiclass boosting via ss-list PAC learning

In this section we aim to convert weak learners to list learners, with a fixed list size. This is, in some sense, a coarsening of the previous subsection, which also subsumes previous work by Brukhim et al. (2023a), where the authors demonstrate k1k-1 thresholds which determine the achievable list sizes. To that end, for every s=2,,ks=2,\ldots,k define the following coarsening of ss-the critical thresholds of ww:

vs¯(w)min{VJ(w):J([k]s)},vs¯(w)max{VJ(w):J([k]s)}.\displaystyle v_{\underline{s}}(w)\triangleq\min\left\{\operatorname{V}_{J}(w):J\in{[k]\choose s}\right\},\qquad v_{\overline{s}}(w)\triangleq\max\left\{\operatorname{V}_{J}(w):J\in{[k]\choose s}\right\}\;. (65)

Clearly, for all s{2,,k}s\in\{2,...,k\} we have that vs¯(w)vs¯(w)v_{\underline{s}}(w)\leq v_{\overline{s}}(w). It is also easy to see that 0v2¯(w)vk¯(w)0\leq v_{\underline{2}}(w)\leq\ldots\leq v_{\underline{k}}(w), and 0v2¯(w)vk¯(w)0\leq v_{\overline{2}}(w)\leq\ldots\leq v_{\overline{k}}(w). When clear from context, we omit ww and denote the thresholds by vs¯v_{\underline{s}} and vs¯v_{\overline{s}}. We remark that the two types of thresholds vs¯v_{\underline{s}} and vs¯v_{\overline{s}} are useful for different ways of boosting as specified in the remainder of this section.

Before stating the theorem, we need to first describe what it means to “partially” boost. The results given in this section are based on the framework of List PAC learning (Brukhim et al., 2022; Charikar and Pabbaraju, 2023). The connection between list learning and multiclass boosting was demonstrated by prior works (Brukhim et al., 2023a, b). Here we strengthen this link and generalize previous results to hold for arbitrary costs. We start with introducing list learning in Definition 32, followed by the statement of Theorem 33.

Definition 32 (ss-List PAC Learning (Brukhim et al., 2022; Charikar and Pabbaraju, 2023)).

An algorithm \mathcal{L} is a ss-list PAC learner for a class 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}} if the following holds. For every distribution 𝒟\mathcal{D} over 𝒳{\mathcal{X}} and target function ff\in\mathcal{F}, and every ϵ,δ>0\epsilon,\delta>0, if SS is a set of mm(ϵ,δ)m\geq m(\epsilon,\delta) examples drawn i.i.d. from 𝒟\mathcal{D} and labeled by ff, then (S)\mathcal{L}(S) returns μ:𝒳𝒴s\mu:{\mathcal{X}}\rightarrow{\mathcal{Y}}^{s} such that, with probability at least 1δ1-\delta,

L𝒟(μ)Px𝒟[f(x)μ(x)]ϵ.L_{\mathcal{D}}(\mu)\triangleq\operatorname{\mathbb{P}}_{x\sim\mathcal{D}}\bigl{[}f(x)\notin\mu(x)\bigr{]}~{}\leq~{}\epsilon\;.
Theorem 33.

Let 𝒴=[k]{\mathcal{Y}}=[k], let w:𝒴2[0,1]w:{\mathcal{Y}}^{2}\rightarrow[0,1] be any cost function, and let z0z\geq 0. Let v2¯vk¯v_{\underline{2}}\leq\ldots\leq v_{\underline{k}} as defined in Equation 65, and let s<ks<k be the smallest integer for which z<vs+1¯z<v_{\underline{s+1}} or, if none exists, let s=ks=k. Then, the following claims both hold.

  • (w,z)(w,z) is ss-boostable: for every class 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}}, every (w,z)(w,z)-learner is boostable to an ss-list PAC learner.

  • (w,z)(w,z) is not (s1)(s-1)-boostable: for some 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}} there exists a (w,z)(w,z)-learner that cannot be boosted to an ss^{\prime}-list PAC learner with s<ss^{\prime}<s.

In words, Theorem 33 implies that there is a partition of the loss values interval [0,1][0,1] into kk sub-intervals or “buckets”, based on vs¯v_{\underline{s}}. Then, any loss value zz in a certain bucket vs¯v_{\underline{s}} can be boosted to a list of size ss, but not below that. In Theorem 36 we show the converse: that any ss-list learner can be boosted to (w,z)(w,z)-learner where zz may be the lowest value within the ss-th interval vs¯v_{\overline{s}}, but not below it.

The first item of Theorem 33 holds trivially for zvk¯z\geq v_{\underline{k}}, since in that case s=ks=k and a kk-list learner always exists. The case z<vk¯z<v_{\underline{k}} is addressed by Lemma 34 below.

Lemma 34.

Let 𝒴=[k]{\mathcal{Y}}=[k], let w:𝒴2[0,1]w:{\mathcal{Y}}^{2}\rightarrow[0,1] be a cost function, and define v2¯vk¯v_{\underline{2}}\leq\ldots\leq v_{\underline{k}} as in Equation 65. Let 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}} and let 𝒜{\mathcal{A}} be a (w,z)(w,z)-learner for \mathcal{F} with sample complexity m0m_{0} such that z<vk¯z<v_{\underline{k}}. Let ss be the smallest integer in {1,,k1}\{1,\ldots,k-1\} such that z<vs+1¯z<v_{\underline{s+1}} and let γ=vs+1¯z>0\gamma=v_{\underline{s+1}}-z>0 be the margin. Fix any ff\in\mathcal{F}, let 𝒟\mathcal{D} be any distribution over 𝒳{\mathcal{X}}, and let S={(x1,y1),,(xm,ym)}S=\{(x_{1},y_{1}),\dots,(x_{m},y_{m})\} be a multiset of mm examples given by i.i.d. points x1,,xm𝒟x_{1},\dots,x_{m}\sim\mathcal{D} labeled by ff. Finally, fix any δ(0,1)\delta\in(0,1). If Algorithm 3 is given SS, oracle access to 𝒜{\mathcal{A}}, and parameters T=18ln(m)γ2T=\left\lceil\frac{18\ln(m)}{\gamma^{2}}\right\rceil, η=2ln(m)T\eta=\sqrt{\frac{2\ln(m)}{T}}, m^=m0(γ3,δT)\widehat{m}=m_{0}\bigl{(}\frac{\gamma}{3},\frac{\delta}{T}\bigr{)}, and σ=23γ\sigma=\frac{2}{3}\gamma, then Algorithm 3 makes TT calls to 𝒜{\mathcal{A}} and returns μS:𝒳𝒴s\mu_{S}:{\mathcal{X}}\to{\mathcal{Y}}^{s} such that with probability at least 1δ1-\delta:

yiμS(xi)i=1,,m.y_{i}\in\mu_{S}(x_{i})\qquad\forall i=1,\dots,m\;.
Proof.

Fix any ff\in\mathcal{F} and any ϵ,δ(0,1)\epsilon,\delta\in(0,1). Let 𝒟\mathcal{D} be any distribution over 𝒳{\mathcal{X}} and let S={(x1,y1),,(xm,ym)}S=\{(x_{1},y_{1}),\dots,(x_{m},y_{m})\} be a multiset of mm examples given by i.i.d. points x1,,xm𝒟x_{1},\dots,x_{m}\sim\mathcal{D} labeled by ff. The first part of this proof follows similar steps as the one of Lemma 12. In particular, fix any i[m]i\in[m] and observe that, again by the regret analysis of Hedge and by the definitions of η\eta and TT, Algorithm 3 guarantees

1Tt=1Tw(ht(xi),yi)γ3+1Tt=1TEj𝒟t[w(ht(xj),yj)].\frac{1}{T}\sum_{t=1}^{T}w(h_{t}(x_{i}),y_{i})\leq\frac{\gamma}{3}+\frac{1}{T}\sum_{t=1}^{T}{\mathbb E}_{j\sim\mathcal{D}_{t}}\Bigl{[}w(h_{t}(x_{j}),y_{j})\Bigr{]}\;. (66)

Furthermore, since 𝒜{\mathcal{A}} is a (w,z)(w,z)-learner for \mathcal{F}, and given the choices of m^\widehat{m} and ht=𝒜(St)h_{t}={\mathcal{A}}(S_{t}), by a union bound over t[T]t\in[T] we obtain that

1Tt=1TEj𝒟t[w(ht(xj),yj)]z+γ3\frac{1}{T}\sum_{t=1}^{T}{\mathbb E}_{j\sim\mathcal{D}_{t}}\Bigl{[}w(h_{t}(x_{j}),y_{j})\Bigr{]}\leq z+\frac{\gamma}{3} (67)

with probability at least 1δ1-\delta.

Now, we show that the list predictor μS\mu_{S} built by Algorithm 3 is consistent with SS with sufficiently large probability. Precisely, by conditioning on the event given by the inequality in Equation 67, we now demonstrate that yiμS(xi)y_{i}\in\mu_{S}(x_{i}) for every i[m]i\in[m]. Consider the function F:𝒳×𝒴[0,1]F:{\mathcal{X}}\times{\mathcal{Y}}\to[0,1] as defined by Algorithm 3. Then, by Equations 66 and 67 together with the definition of γ>0\gamma>0, we obtain for every i[m]i\in[m] that

𝒴F(xi,)w(,yi)=1Tt=1Tw(ht(xi),yi)z+23γ<z+γ=vs+1¯,\sum_{\ell\in{\mathcal{Y}}}F(x_{i},\ell)\cdot w(\ell,y_{i})=\frac{1}{T}\sum_{t=1}^{T}w(h_{t}(x_{i}),y_{i})\leq z+\frac{2}{3}\gamma<z+\gamma=v_{\underline{s+1}}\;, (68)

which in turn implies that yiμS(xi)y_{i}\in\mu_{S}(x_{i}) by construction of μS\mu_{S}.

We now proceed with a deterministic characterization of the lists returned by μS\mu_{S}, independently of any conditioning. Fix any x𝒳x\in{\mathcal{X}}. Let 𝐩xΔ𝒴\mathbf{p}^{x}\in\Delta_{\mathcal{Y}} be such that px=F(x,)p^{x}_{\ell}=F(x,\ell) for any 𝒴\ell\in{\mathcal{Y}}, and observe that the sum involved within the condition for y𝒴y\in{\mathcal{Y}} in the construction of the list μS(x)\mu_{S}(x) is equal to w(𝐩x,y)w(\mathbf{p}^{x},y). Furthermore, it must be true that s<ks<k since γ>0\gamma>0. We can then show that the list μS(x)\mu_{S}(x) satisfies

vs+1¯>maxyμS(x)w(𝐩x,y)=max𝐪ΔμS(x)w(𝐩x,𝐪)min𝐩Δ𝒴max𝐪ΔμS(x)w(𝐩,𝐪)=VμS(x)(w),v_{\underline{s+1}}>\max_{y\in\mu_{S}(x)}w(\mathbf{p}^{x},y)=\max_{\mathbf{q}\in\Delta_{\mu_{S}(x)}}w(\mathbf{p}^{x},\mathbf{q})\geq\min_{\mathbf{p}\in\Delta_{\mathcal{Y}}}\max_{\mathbf{q}\in\Delta_{\mu_{S}(x)}}w(\mathbf{p},\mathbf{q})=\operatorname{V}_{\mu_{S}(x)}(w)\;, (69)

where the first inequality holds by definition of μS(x)\mu_{S}(x). Consequently, after observing that VJ(w)VJ(w)\operatorname{V}_{J}(w)\leq\operatorname{V}_{J^{\prime}}(w) if JJJ\subseteq J^{\prime}, note that

VμS(x)(w)<vs+1¯=minJ(𝒴s+1)VJ(w)=minJ𝒴:|J|s+1VJ(w),\operatorname{V}_{\mu_{S}(x)}(w)<v_{\underline{s+1}}=\min_{J\in\binom{{\mathcal{Y}}}{s+1}}\operatorname{V}_{J}(w)=\min_{J\subseteq{\mathcal{Y}}:|J|\geq s+1}\operatorname{V}_{J}(w)\;, (70)

which in turn implies that |μS(x)|s|\mu_{S}(x)|\leq s. We can thus assume that μS\mu_{S} outputs elements of 𝒴s{\mathcal{Y}}^{s} without loss of generality. ∎

The following lemma proves that, in a similar way as Lemma 13 for the binary setting, the list predictor output by Algorithm 3 has a sufficiently small generalization error provided that the sample SS has size mm sufficiently large. The main idea to prove generalization is again via a sample compression scheme, but relying instead of novel results by Brukhim et al. (2023a) for multiclass classification. This generalization result, combined with Lemma 34, suffices to prove Theorem 33.

Lemma 35.

Assume the hypotheses of Lemma 34. For any ϵ,δ(0,1)\epsilon,\delta\in(0,1), if the size mm of the sample given to Algorithm 3 satisfies

mln(2/δ)ϵ+m0(γ3,δ2T)T(1+ln(m)ϵ),m\geq\frac{\ln(2/\delta)}{\epsilon}+m_{0}\left(\frac{\gamma}{3},\frac{\delta}{2T}\right)\cdot T\cdot\left(1+\frac{\ln(m)}{\epsilon}\right)\;,

then with probability at least 1δ1-\delta the output μS\mu_{S} of Algorithm 3 satisfies L𝒟(μS)ϵL_{\mathcal{D}}(\mu_{S})\leq\epsilon. Therefore, Algorithm 3 is an ss-list PAC learner for \mathcal{F}. Moreover, one can ensure that Algorithm 3 makes O(ln(m)γ2)O\bigl{(}\frac{\ln(m)}{\gamma^{2}}\bigr{)} calls to 𝒜{\mathcal{A}} and has sample complexity m(ϵ,δ)=O~(m0(γ3,δ2T)ln(1/δ)ϵγ2)m(\epsilon,\delta)=\widetilde{O}\left(m_{0}\left(\frac{\gamma}{3},\frac{\delta}{2T}\right)\cdot\frac{\ln(1/\delta)}{\epsilon\cdot\gamma^{2}}\right).

Proof.

By analogy with the proof of Lemma 13, we first apply Lemma 34 with δ/2\delta/2 in place of δ\delta in order to obtain an ss-list predictor μS\mu_{S} consistent with SS with probability at least 1δ/21-\delta/2. We then apply a compression-based generalization argument. To do so, we remark once again that one can construct a compression scheme for μS\mu_{S} of size equal to the total size of the samples on which 𝒜{\mathcal{A}} operates, which is equal to κ=m^T\kappa=\widehat{m}\cdot T. The main difference with the binary case is that we rely on the generalization bound for a sample compression scheme for lists as per Theorem 6 of Brukhim et al. (2023a), with δ/2\delta/2 in place of δ\delta; we can employ this generalization bound thanks to the consistency of μS\mu_{S} with SS (with sufficiently large probability). Then, this implies that

L𝒟(μS)=Px𝒟[f(x)μS(x)]κln(m)+ln(2/δ)mκL_{\mathcal{D}}(\mu_{S})=\operatorname{\mathbb{P}}_{x\sim\mathcal{D}}\bigl{[}f(x)\notin\mu_{S}(x)\bigr{]}\leq\frac{\kappa\ln(m)+\ln(2/\delta)}{m-\kappa} (71)

holds with probability at least 1δ/21-\delta/2. By similar calculations as in Equations 18 and 19, and replacing the values of κ\kappa and m^\widehat{m}, the right-hand side of the above inequality can be easily show to be at most ϵ\epsilon whenever:

mln(2/δ)ϵ+m0(γ3,δ2T)T(1+ln(m)ϵ).m\geq\frac{\ln(2/\delta)}{\epsilon}+m_{0}\left(\frac{\gamma}{3},\frac{\delta}{2T}\right)\cdot T\cdot\left(1+\frac{\ln(m)}{\epsilon}\right)\;. (72)

A union bound concludes the proof for the first part of the claim, since it shows that Algorithm 3 is an ss-list PAC learner for \mathcal{F}. The sample complexity of Algorithm 3 then immediately follows by definition of TT. ∎

Next, we prove a kind of converse of Theorem 33. That is, one can convert a list learner to a weak learner. This results is stated formally as follows.

Theorem 36 (ss-List \Rightarrow Weak Learning).

There exists an algorithm {\mathcal{B}} that for every 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}} satisfies what follows. Let w:𝒴2[0,1]w:{\mathcal{Y}}^{2}\rightarrow[0,1] be a cost function, let z0z\geq 0, and let v2¯vk¯v_{\overline{2}}\leq\ldots\leq v_{\overline{k}} as defined in Equation (65). Let s<ks<k be the smallest integer for which z<vs+1¯z<v_{\overline{s+1}}, or if none exists let s=ks=k. Given oracle access to a ss-list PAC learner \mathcal{L} for \mathcal{F} with sample complexity mm_{\mathcal{L}}, algorithm {\mathcal{B}} is a (w,z)(w,z)-learner for \mathcal{F} with sample complexity mm_{\mathcal{L}} that makes one oracle call to \mathcal{L}.

Proof.

Fix any ff\in\mathcal{F} and any distribution 𝒟\mathcal{D} over 𝒳{\mathcal{X}}, and suppose {\mathcal{B}} is given a sample SS of size mm(ϵ,δ)m\geq m_{\mathcal{L}}(\epsilon,\delta) consisting of examples drawn i.i.d. from 𝒟\mathcal{D} and labeled by ff. First, {\mathcal{B}} calls \mathcal{L} on SS. By hypothesis, then, \mathcal{L} returns μS:𝒳𝒴s\mu_{S}:{\mathcal{X}}\to{\mathcal{Y}}^{s} that with probability at least 1δ1-\delta satisfies:

Px𝒟[f(x)μS(x)]ϵ.\operatorname{\mathbb{P}}_{x\sim\mathcal{D}}\bigl{[}f(x)\notin\mu_{S}(x)\bigr{]}\leq\epsilon\;. (73)

Conditioning on the event above, we give a randomized predictor hSh_{S} such that L𝒟w(hS)z+ϵL_{\mathcal{D}}^{w}(h_{S})\leq z+\epsilon. First, for any nonempty J𝒴J\subseteq{\mathcal{Y}} let 𝐩JΔ𝒴\mathbf{p}_{J}\in\Delta_{\mathcal{Y}} be the minimax distribution achieving the value of the game restricted to JJ,

𝐩J=argmin𝐩Δ𝒴(max𝐪ΔJw(𝐩,𝐪)).\displaystyle\mathbf{p}_{J}=\arg\min_{\mathbf{p}\in\Delta_{\mathcal{Y}}}\left(\max_{\mathbf{q}\in\Delta_{J}}w(\mathbf{p},\mathbf{q})\right)\;. (74)

Note that 𝐩J\mathbf{p}_{J} can be efficiently computed via a linear program. Then, simply define hS(x)𝐩μS(x)h_{S}(x)\sim\mathbf{p}_{\mu_{S}(x)}.

Let us analyse the loss of hSh_{S} under 𝒟\mathcal{D}. First, by the law of total probability, and since w1\|w\|_{\infty}\leq 1,

L𝒟w(hS)\displaystyle L_{\mathcal{D}}^{w}(h_{S}) PxD[f(x)μS(x)]+JPx𝒟[μS(x)=Jf(x)J]L𝒟Jw(hS),\displaystyle\leq\operatorname{\mathbb{P}}_{x\sim D}[f(x)\notin\mu_{S}(x)]+\sum_{J}\operatorname{\mathbb{P}}_{x\sim\mathcal{D}}[\mu_{S}(x)=J\wedge f(x)\in J]\cdot L_{\mathcal{D}_{J}}^{w}(h_{S})\;, (75)

where the summation is over all J𝒴J\subseteq{\mathcal{Y}} with Px𝒟[μS(x)=J]>0\operatorname{\mathbb{P}}_{x\sim\mathcal{D}}[\mu_{S}(x)=J]>0, and 𝒟J\mathcal{D}_{J} is the distribution obtained from 𝒟\mathcal{D} by conditioning on the event μS(x)=Jf(x)J\mu_{S}(x)=J\wedge f(x)\in J. Consider the right-hand side of Equation 75. By Equation 73, the first term is at most ϵ\epsilon. For the second term, denote by 𝐪J\mathbf{q}_{J} the marginal of 𝒟J\mathcal{D}_{J} over 𝒴{\mathcal{Y}}; note that, crucially, 𝐪JΔJ\mathbf{q}_{J}\in\Delta_{J}. Therefore, by definition of hSh_{S} and of ss:

L𝒟Jw(hS)=w(𝐩J,𝐪J)max𝐪ΔJw(𝐩J,𝐪)=VJ(w)vs¯z.\displaystyle L_{\mathcal{D}_{J}}^{w}(h_{S})=w(\mathbf{p}_{J},\mathbf{q}_{J})\leq\max_{\mathbf{q}\in\Delta_{J}}w(\mathbf{p}_{J},\mathbf{q})=\operatorname{V}_{J}(w)\leq v_{\overline{s}}\leq z\;. (76)

We conclude that L𝒟w(hS)z+ϵL_{\mathcal{D}}^{w}(h_{S})\leq z+\epsilon. ∎

5.4 Lower bounds

We start proving the second item of C.

Lemma 37.

Let 𝒴=[k]{\mathcal{Y}}=[k] and let w:𝒴2[0,1]w:{\mathcal{Y}}^{2}\rightarrow[0,1] be any cost function. Let 0=v1(w)v2(w)vτ(w)0=v_{1}(w)\leq v_{2}(w)\leq\cdots\leq v_{\tau}(w) as defined in Equation 6, and let nn be the largest integer for which vn(w)zv_{n}(w)\leq z. Then there exists a class 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}} that has a (w,z)(w,z)-learner but no (w,z)(w,z^{\prime})-learner for any z<vn(w)z^{\prime}<v_{n}(w).

Proof.

If vn(w)=0v_{n}(w)=0 then the claim is trivial, hence we assume vn(w)>0v_{n}(w)>0. Observe that this implies that there exists some JJ with |J|2|J|\geq 2 such that vn(w)=VJ(w)v_{n}(w)=V_{J}(w). Let 𝐩Δ𝒴\mathbf{p}^{*}\in\Delta_{\mathcal{Y}} be the distribution achieving VJ(w)\operatorname{V}_{J}(w), that is:

𝐩=argmin𝐩Δ𝒴max𝐪ΔJw(𝐩,𝐪).\mathbf{p}^{*}=\mathop{\rm arg\,min}_{\mathbf{p}\in\Delta_{\mathcal{Y}}}\max_{\mathbf{q}\in\Delta_{J}}w(\mathbf{p},\mathbf{q})\;. (77)

Now let 𝒳{\mathcal{X}} be any infinite domain (e.g., 𝒳=N{\mathcal{X}}=\mathbb{N}) and let =J𝒳\mathcal{F}=J^{\mathcal{X}}. For every distribution 𝒟\mathcal{D} over 𝒳{\mathcal{X}} and every labeling f:𝒳Jf:{\mathcal{X}}\to J,

L𝒟w(𝐩)\displaystyle L_{\mathcal{D}}^{w}(\mathbf{p}^{*}) =w(𝐩,𝐪𝒟)max𝐪ΔJw(𝐩,𝐪)=VJ(w)=vn(w)z,\displaystyle=w(\mathbf{p}^{*},\mathbf{q}_{\mathcal{D}})\leq\max_{\mathbf{q}\in\Delta_{J}}w(\mathbf{p}^{*},\mathbf{q})=\operatorname{V}_{J}(w)=v_{n}(w)\leq z\;, (78)

where 𝐪𝒟ΔJ\mathbf{q}_{\mathcal{D}}\in\Delta_{J} is the marginal of 𝒟\mathcal{D} over 𝒴{\mathcal{Y}}. Thus the learner that returns the random guess hh based on 𝐩\mathbf{p}^{*} is a (w,z)(w,z)-learner for \mathcal{F}.

Next, we show that \mathcal{F} has no (w,z)(w,z^{\prime})-learner with z<vn(w)z^{\prime}<v_{n}(w). Suppose indeed towards a contradiction that there exists such a (w,z)(w,z^{\prime})-learner 𝒜{\mathcal{A}}. By Theorem 24, then, \mathcal{F} admits a (w,z)(w,z^{\prime})-list learner \mathcal{L}. Now, as z<vn=VJ(w)z^{\prime}<v_{n}=\operatorname{V}_{J}(w), the list function μ\mu returned by \mathcal{L} ensures Jμ(x)J\not\subseteq\mu(x) for all x𝒳x\in{\mathcal{X}}. Moreover, as F=J𝒳F=J^{\mathcal{X}}, we can assume without loss of generality that μ(x)J\mu(x)\subseteq J for all x𝒳x\in{\mathcal{X}}; otherwise we could remove the elements of μ(x)J\mu(x)\setminus J to obtain a list μ(x)\mu^{\prime}(x) such that Vμ(x)(w)Vμ(x)(w)\operatorname{V}_{\mu^{\prime}(x)}(w)\leq\operatorname{V}_{\mu(x)}(w) and that Px𝒟[f(x)μ(x)]Px𝒟[f(x)μ(x)]\operatorname{\mathbb{P}}_{x\sim\mathcal{D}}[f(x)\in\mu^{\prime}(x)]\geq\operatorname{\mathbb{P}}_{x\sim\mathcal{D}}[f(x)\in\mu(x)]. It follows that μ(x)J\mu(x)\subsetneq J for all x𝒳x\in{\mathcal{X}}. Therefore, \mathcal{L} is a (|J|1)(|J|-1)-list PAC learner. However, one can verify that the (|J|1)(|J|-1)-Daniely-Shalev-Shwartz dimension of \mathcal{F} is unbounded (see Definition 6 of Charikar and Pabbaraju (2023)). It follows by Charikar and Pabbaraju (2023, Theorem 1) that \mathcal{F} is not (|J|1)(|J|-1)-list PAC learnable, yielding a contradiction. ∎

Next, we prove the second item of Theorem 33.

Lemma 38.

Let 𝒴=[k]{\mathcal{Y}}=[k], let w:𝒴2[0,1]w:{\mathcal{Y}}^{2}\rightarrow[0,1] be any cost function, and let v2¯vk¯v_{\underline{2}}\leq\ldots\leq v_{\underline{k}} as defined in Equation (65). Let zv2¯z\geq v_{\underline{2}} and let sks\leq k be the largest integer for which zvs¯z\geq v_{\underline{s}}. Then, there exists a class 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}} that has a (w,z)(w,z)-learner and is not (s1)(s-1)-list PAC learnable.

Proof.

The proof follows closely the one of Lemma 37. Let J=argmin{VJ(w):J(𝒴s)}J=\arg\min\left\{\operatorname{V}_{J^{\prime}}(w):J^{\prime}\in{{\mathcal{Y}}\choose s}\right\}. Let 𝒳{\mathcal{X}} be any infinite domain (e.g., 𝒳=N{\mathcal{X}}=\mathbb{N}) and let =J𝒳\mathcal{F}=J^{\mathcal{X}}. As in the proof of Lemma 37 there is a distribution 𝐩\mathbf{p}^{*} yielding a (w,z)(w,z)-learner for \mathcal{F}. Simultaneously (again from the proof of Lemma 37), the class \mathcal{F} is not (s1)(s-1)-list PAC learnable. ∎

Finally, we prove the second item of D.

Lemma 39.

Let 𝒴=[k]{\mathcal{Y}}=[k], let 𝐰=(w1,,wr)\boldsymbol{w}=(w_{1},\ldots,w_{r}) where each wi:𝒴2[0,1]w_{i}:{\mathcal{Y}}^{2}\rightarrow[0,1] is a cost function, and let 𝐳,𝐳[0,1]r\boldsymbol{z},\boldsymbol{z}^{\prime}\in[0,1]^{r} such that 𝐳𝐰𝐳\boldsymbol{z}\not\preceq_{\boldsymbol{w}}\boldsymbol{z}^{\prime}. There exists a class 𝒴𝒳\mathcal{F}\subseteq{\mathcal{Y}}^{\mathcal{X}} that admits a (𝐰,𝐳)(\boldsymbol{w},\boldsymbol{z})-learner that is trivial, and therefore cannot be boosted to a (𝐰,𝐳)(\boldsymbol{w},\boldsymbol{z}^{\prime})-learner.

Proof.

By assumption that 𝒛𝒘𝒛\boldsymbol{z}\not\preceq_{\boldsymbol{w}}\boldsymbol{z}^{\prime}, there must be a set J𝒴J\subseteq{\mathcal{Y}} for which 𝒛DJ(𝒘)\boldsymbol{z}\in D_{J}(\boldsymbol{w}) and 𝒛DJ(𝒘)\boldsymbol{z}^{\prime}\notin D_{J}(\boldsymbol{w}). Let 𝒳{\mathcal{X}} be any infinite domain (e.g., 𝒳=N{\mathcal{X}}=\mathbb{N}) and let =J𝒳\mathcal{F}=J^{\mathcal{X}}. By definition of DJ(𝒘)D_{J}(\boldsymbol{w}) there exists a trivial learner that is a (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z})-learner for \mathcal{F}. Assume towards contradiction that the trivial learner can be used to construct a (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z}^{\prime})-learner for \mathcal{F}. Since 𝒛DJ(𝒘)\boldsymbol{z}^{\prime}\notin D_{J}(\boldsymbol{w}), then by Proposition 19 there exists 𝜶Δr\boldsymbol{\alpha}\in\Delta_{r} such that 𝜶,𝒛<VJ(w𝜶)\langle\boldsymbol{\alpha},\boldsymbol{z}^{\prime}\rangle<\operatorname{V}_{J}(w_{\boldsymbol{\alpha}}), where w𝜶=i=1rαiwiw_{\boldsymbol{\alpha}}=\sum_{i=1}^{r}\alpha_{i}w_{i}. Then, by Theorem 24, there is a (w𝜶,z𝜶)(w_{\boldsymbol{\alpha}},z_{\boldsymbol{\alpha}}^{\prime})-list learner \mathcal{L}^{\prime} that outputs list functions μ\mu such that, for every x𝒳x\in{\mathcal{X}} it holds that:

Vμ(x)(w𝜶)z𝜶<VJ(w𝜶).V_{\mu(x)}(w_{\boldsymbol{\alpha}})\leq z_{\boldsymbol{\alpha}}^{\prime}<\operatorname{V}_{J}(w_{\boldsymbol{\alpha}}).

Thus \mathcal{L}^{\prime} is a (|J|1)(|J|-1)-list PAC learner for \mathcal{F}. However, one can verify that the (|J|1)(|J|-1)-Daniely-Shalev-Shwartz dimension of \mathcal{F} is unbounded (see Definition 6 of Charikar and Pabbaraju (2023)). It follows by Charikar and Pabbaraju (2023, Theorem 1) that \mathcal{F} is not (|J|1)(|J|-1)-list PAC learnable, yielding a contradiction. ∎

References

  • Allwein et al. [2000] Erin L Allwein, Robert E Schapire, and Yoram Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of machine learning research, 1(Dec):113–141, 2000.
  • Ben-David et al. [2001] Shai Ben-David, Philip M Long, and Yishay Mansour. Agnostic boosting. In International Conference on Computational Learning Theory, pages 507–516. Springer, 2001.
  • Blackwell [1956] D Blackwell. An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6(1):1–8, 1956.
  • Brukhim et al. [2021] Nataly Brukhim, Elad Hazan, Shay Moran, and Robert E. Schapire. Multiclass boosting and the cost of weak learning. In Advances in Neural Information Processing Systems, NeurIPS, 2021.
  • Brukhim et al. [2022] Nataly Brukhim, Daniel Carmon, Irit Dinur, Shay Moran, and Amir Yehudayoff. A characterization of multiclass learnability. In Annual Symposium on Foundations of Computer Science, FOCS, 2022.
  • Brukhim et al. [2023a] Nataly Brukhim, Amit Daniely, Yishay Mansour, and Shay Moran. Multiclass boosting: simple and intuitive weak learning criteria. In Advances in Neural Information Processing Systems, NeurIPS, 2023a.
  • Brukhim et al. [2023b] Nataly Brukhim, Steve Hanneke, and Shay Moran. Improper multiclass boosting. In Proceedings of the Conference on Learning Theory, COLT, 2023b.
  • Canonne [2020] Clément L. Canonne. A short note on learning discrete distributions, 2020. URL https://arxiv.org/abs/2002.11457.
  • Charikar and Pabbaraju [2023] Moses Charikar and Chirag Pabbaraju. A characterization of list learnability. In Proceedings of the Annual Symposium on Theory of Computing, STOC, 2023.
  • Durgin and Juba [2019] Alexander Durgin and Brendan Juba. Hardness of improper one-sided learning of conjunctions for all uniformly falsifiable CSPs. In Proceedings of the International Conference on Algorithmic Learning Theory, ALT, 2019.
  • Ehrgott [2005] Matthias Ehrgott. Multicriteria optimization, volume 491. Springer Science & Business Media, 2005.
  • Elkan [2001] Charles Elkan. The foundations of cost-sensitive learning. In Bernhard Nebel, editor, Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI, pages 973–978, 2001.
  • Fan et al. [1999] Wei Fan, Salvatore J. Stolfo, Junxin Zhang, and Philip K. Chan. Adacost: Misclassification cost-sensitive boosting. In Proceedings of the International Conference on Machine Learning, ICML, 1999.
  • Feldman [2009] Vitaly Feldman. Distribution-specific agnostic boosting. arXiv preprint arXiv:0909.2927, 2009.
  • Freund [1990] Yoav Freund. Boosting a weak learning algorithm by majority. In Proceedings of the Workshop on Computational Learning Theory, COLT, 1990. URL http://dl.acm.org/citation.cfm?id=92640.
  • Freund and Schapire [1997] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci., 55(1):119–139, 1997. URL https://doi.org/10.1006/jcss.1997.1504.
  • Green Larsen and Ritzert [2022] Kasper Green Larsen and Martin Ritzert. Optimal weak to strong learning. Advances in Neural Information Processing Systems, 35:32830–32841, 2022.
  • Kalai and Servedio [2005] Adam Tauman Kalai and Rocco A. Servedio. Boosting in the presence of noise. J. Comput. Syst. Sci., 71(3):266–290, 2005. doi: 10.1016/j.jcss.2004.10.015. URL https://doi.org/10.1016/j.jcss.2004.10.015.
  • Kalai et al. [2008] Adam Tauman Kalai, Yishay Mansour, and Elad Verbin. On agnostic boosting and parity learning. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 629–638, 2008.
  • Kalai et al. [2012] Adam Tauman Kalai, Varun Kanade, and Yishay Mansour. Reliable agnostic learning. Journal of Computer and System Sciences, 78(5):1481–1495, 2012.
  • Kanade and Kalai [2009] Varun Kanade and Adam Kalai. Potential-based agnostic boosting. In Advances in neural information processing systems, NIPS, 2009.
  • Kanade and Thaler [2014] Varun Kanade and Justin Thaler. Distribution-independent reliable learning. In Proceedings of the Conference on Learning Theory, COLT, 2014.
  • Karakoulas and Shawe-Taylor [1998] Grigoris I. Karakoulas and John Shawe-Taylor. Optimizing classifers for imbalanced training sets. In Advances in Neural Information Processing Systems, NIPS, 1998.
  • Kearns [1988] M. Kearns. Thoughts on hypothesis boosting. Unpublished, December 1988.
  • Kivinen [1995] Jyrki Kivinen. Learning reliably and with one-sided error. Mathematical systems theory, 28(2):141–172, 1995.
  • Landesa-Vazquez and Alba-Castro [2012] Iago Landesa-Vazquez and José Luis Alba-Castro. Shedding light on the asymmetric learning capability of adaboost. Pattern Recognition Letters, 33(3):247–255, 2012. URL https://doi.org/10.1016/j.patrec.2011.10.022.
  • Landesa-Vazquez and Alba-Castro [2013] Iago Landesa-Vazquez and José Luis Alba-Castro. Double-base asymmetric adaboost. Neurocomputing, 118:101–114, 2013. URL https://doi.org/10.1016/j.neucom.2013.02.019.
  • Landesa-Vázquez and Alba-Castro [2015] Iago Landesa-Vázquez and José Luis Alba-Castro. Revisiting adaboost for cost-sensitive classification. part i: Theoretical perspective. arXiv preprint arXiv:1507.04125, 2015.
  • Ling and Sheng [2008] Charles X Ling and Victor S Sheng. Cost-sensitive learning and the class imbalance problem. Encyclopedia of Machine Learning, 2011:231–235, 2008.
  • Ling and Sheng [2017] Charles X. Ling and Victor S. Sheng. Cost-sensitive learning. In Encyclopedia of Machine Learning and Data Mining. Springer, 2017. URL https://doi.org/10.1007/978-1-4899-7687-1_181.
  • Long and Servedio [2008] Philip M. Long and Rocco A. Servedio. Adaptive martingale boosting. In Daphne Koller, Dale Schuurmans, Yoshua Bengio, and Léon Bottou, editors, Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, pages 977–984. Curran Associates, Inc., 2008. URL http://papers.nips.cc/paper/3623-adaptive-martingale-boosting.
  • Masnadi-Shirazi and Vasconcelos [2011] Hamed Masnadi-Shirazi and Nuno Vasconcelos. Cost-sensitive boosting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(2):294–309, 2011. URL https://doi.org/10.1109/TPAMI.2010.71.
  • Møller Høgsgaard et al. [2024] Mikael Møller Høgsgaard, Kasper Green Larsen, and Markus Engelund Mathiasen. The many faces of optimal weak-to-strong learning. arXiv e-prints, pages arXiv–2408, 2024.
  • Mukherjee and Schapire [2011] Indraneel Mukherjee and Robert E Schapire. A theory of multiclass boosting. Journal of Machine Learning Research, 14:437–497, 2011. URL http://rob.schapire.net/papers/multiboost-journal.pdfhttp://arxiv.org/abs/1108.2989.
  • Nikolaou et al. [2016] Nikolaos Nikolaou, Narayanan Unny Edakunni, Meelis Kull, Peter A. Flach, and Gavin Brown. Cost-sensitive boosting algorithms: Do we really need them? Machine Learning, 104(2-3):359–384, 2016. URL https://doi.org/10.1007/s10994-016-5572-x.
  • Raman and Tewari [2022] Vinod Raman and Ambuj Tewari. Online agnostic multiclass boosting. Advances in Neural Information Processing Systems, 35:25908–25920, 2022.
  • Raman et al. [2019] Vinod Raman, Daniel T Zhang, Young Hun Jung, and Ambuj Tewari. Online boosting for multilabel ranking with top-k feedback. arXiv preprint arXiv:1910.10937, 2019.
  • Sabato and Tishby [2012] Sivan Sabato and Naftali Tishby. Multi-instance learning with any hypothesis class. The Journal of Machine Learning Research, 13(1):2999–3039, 2012.
  • Schapire [1990] Robert E. Schapire. The strength of weak learnability. Machine Learning, 5(2):197–227, 1990. URL https://doi.org/10.1007/BF00116037.
  • Schapire and Freund [2012] Robert E Schapire and Yoav Freund. Boosting: Foundations and algorithms. Cambridge university press, 2012.
  • Schapire and Singer [1999] Robert E Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):297–336, 1999.
  • Shalev-Shwartz and Ben-David [2014] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.
  • Sun et al. [2007] Yanmin Sun, Mohamed S. Kamel, Andrew K. C. Wong, and Yang Wang. Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 40(12):3358–3378, 2007. URL https://doi.org/10.1016/j.patcog.2007.04.009.
  • Ting [2000] Kai Ming Ting. A comparative study of cost-sensitive boosting algorithms. In Proceedings of the International Conference on Machine Learning, ICML, 2000.
  • Valiant [1984] Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
  • Viola and Jones [2001] Paul A. Viola and Michael J. Jones. Fast and robust classification using asymmetric adaboost and a detector cascade. In Advances in Neural Information Processing Systems, NIPS, 2001.
  • von Neumann [1928] John von Neumann. Zur theorie der gesellschaftsspiele. Mathematische Annalen, 100(1):295–320, 1928.
  • Zadrozny et al. [2003] Bianca Zadrozny, John Langford, and Naoki Abe. Cost-sensitive learning by cost-proportionate example weighting. In International Conference on Data Mining, ICDM, 2003.

Appendix A Cost-Sensitive Learning

Cost sensitive learning has a long history in machine learning. The two main motivations are that, first, not all errors have the same impact, and second, that there might be a class-imbalance between the frequency of different classes. Over the years there have been many workshops on cost-sensitive learning in ICML (2000) NIPS (2008) SDM (2018), yet another indication of the impostance of cost-sensitive loss. See, e.g., Ling and Sheng [2017, 2008]. Recall the definition of the cost sensitive loss:

L𝒟w(h)=Ex𝒟[w(h(x),f(x))].L_{\mathcal{D}}^{w}(h)={\mathbb E}_{x\sim\mathcal{D}}\Bigl{[}w(h(x),f(x))\Bigr{]}\,. (79)

Given the the distribution 𝒟\mathcal{D} on 𝒳{\mathcal{X}}, the Bayes optimal prediction rule is given by:

argmini𝒴j𝒴w(i,j)P[f(x)=jx]\displaystyle\arg\min_{i\in{\mathcal{Y}}}\sum_{j\in{\mathcal{Y}}}w(i,j)\operatorname{\mathbb{P}}[f(x)=j\mid x] (80)

for every x𝒳x\in{\mathcal{X}}. The early works include Elkan [2001], which characterizes the Bayes optimal predictor as a threshold, and its implication for re-balancing and decision tree learning. In fact, this resembles our binary prediction rule.

Sample complexity for cost-sensitive leaning for large margin appears in Karakoulas and Shawe-Taylor [1998]. Additional sample complexity bounds, for cost-sensitive learning, based on transformation of the learning algorithms and using rejection sampling is found in Zadrozny et al. [2003]. The idea of cost sensitive learning has a long history in statistics. For binary classification, there are many different metrics used for evaluation. Let w=(w++w+w+w)w=\begin{pmatrix}w_{++}&w_{+-}\\ w_{-+}&w_{--}\end{pmatrix}. The false-positive (FP) is w+w_{+-}, the false-negative is w+w_{-+}, the precision is w++/(w+++w+)w_{++}/(w_{++}+w_{+-}), the recall is w++/(w+++w+)w_{++}/(w_{++}+w_{-+}), and more.

A.1 Boosting cost-sensitive loss

There has been significant amount of work with the motivation of adapting AdaBoost to a cost-sensitive loss. At a high level, the different proposals either modify the way the algorithm updates its weights, taking in to account the cost-sensitive loss, or changes the final prediction rule. Nikolaou et al. [2016] give an overview on various AdaBoost variants for the cost-sensitive case. See also Landesa-Vázquez and Alba-Castro [2015]. Our modified update rule of Hedge in Algorithm 1 corresponds to CGAda of Sun et al. [2007] and related AdaBoost variants.

The theoretically sound proposal focused on deriving similar guarantees to that of AdaBoost. Cost-sensitive boosting by Masnadi-Shirazi and Vasconcelos [2011] modified the exponential updates to include the cost-sensitive loss. Cost-Generalized AdaBoost by Landesa-Vazquez and Alba-Castro [2012] modifies the initial distribution over the examples. AdaboostDB by Landesa-Vazquez and Alba-Castro [2013] modifies the base of the exponents used in the updates. All the theoretically sound proposal are aimed to guarantee convergence under the standard weak-leaning assumption. Their main objective is to derive better empirical results when faced with a cost-sensitive loss. However, they do not address the essential question, when is boosting possible? In particular, they do not study cost-sensitive variants weak learners and do not characterize the boostability of such learners. In addition to the papers above, there have been many heuristic modifications of AdaBoost which try to address the cost-sensitive loss [Karakoulas and Shawe-Taylor, 1998, Fan et al., 1999, Ting, 2000, Viola and Jones, 2001, Sun et al., 2007].

A.2 Multi-objective learning and boosting

Learning with multiple objectives is also common in machine learning. A well studied special case are variants of learning with one-sided error [Kivinen, 1995, Sabato and Tishby, 2012]. A typical goal in one-sided learning or the related positive-reliable learning [Kalai et al., 2012, Kanade and Thaler, 2014, Durgin and Juba, 2019] is to guarantee an (almost) 0 false positive loss and simultaneously a low false negative loss. This corresponds to a (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z})-learner with 𝒘=(w+,w)\boldsymbol{w}=(w_{+},w_{-}) and 𝒛=(0,ϵ)\boldsymbol{z}=(0,\epsilon). More generally, Kalai et al. [2012] also considered tolerant reliable learning with an arbitrary 𝒛=(z+,z)\boldsymbol{z}=(z_{+},z_{-}). Our results apply in this context. For example in the binary case, our results show that a (𝒘,𝒛)(\boldsymbol{w},\boldsymbol{z})-learner (i.e., a 𝒛\boldsymbol{z}-tolerant-reliable one) is boostable if and only if z++z<1\sqrt{z_{+}}+\sqrt{z_{-}}<1. Moreover, our results also imply boostability and learnability results for reliable learning in the multi-class case; a learning setting mostly over-looked so far.

Appendix B Multiclass Boosting

Boosting is a fundamental methodology in machine learning that can boost the accuracy of weak learning rules into a strong one. Boosting theory was originally designed for binary classification. The study of boosting was initiating in a line of seminal works which include the celebrated Adaboost algorithm, as well an many other algorithms with various applications, (see e.g. Kearns [1988], Schapire [1990], Freund [1990], Freund and Schapire [1997]). It was later adapted to other settings and was extensively studied in broader contexts as well [Ben-David et al., 2001, Kalai and Servedio, 2005, Long and Servedio, 2008, Kalai et al., 2008, Kanade and Kalai, 2009, Feldman, 2009, Møller Høgsgaard et al., 2024, Green Larsen and Ritzert, 2022, Raman and Tewari, 2022, Raman et al., 2019].

There are also various extension of boosting to the multiclass setting. The early extensions include AdaBoost.MH, AdaBoost.MR, and approaches based on Error-Correcting Output Codes (ECOC) [Schapire and Singer, 1999, Allwein et al., 2000]. These works often reduce the kk-class task into a single binary task. The binary reduction can have various problems, including increased complexity, and lack of guarantees of an optimal joint predictor.

Notably, a work by Mukherjee and Schapire [2011] established a theoretical framework for multiclass boosting, which generalizes previous learning conditions. However, this requires the assumption that the weak learner minimizes a complicated loss function, that is significantly different from simple classification error.

More recently, there have been several works on multiclass boosting. First, Brukhim et al. [2021] followed a formulation for multiclass boosting similar to that of Mukherjee and Schapire [2011]. They proved a certain hardness result showing that a broad, yet restricted, class of boosting algorithms must incur a cost which scales polynomially with |𝒴||{\mathcal{Y}}|. Then, Brukhim et al. [2023b] and Brukhim et al. [2023a] demonstrated efficient methods for multiclass boosting. We note that our work generalizes the results given in Brukhim et al. [2023a] to the cost sensitive setting, as detailed in Section 5.