Of Dice and Games:
A Theory of Generalized Boosting
Abstract
Cost-sensitive loss functions are crucial in many real-world prediction problems, where different types of errors are penalized differently; for example, in medical diagnosis, a false negative prediction can lead to worse consequences than a false positive prediction. However, traditional PAC learning theory has mostly focused on the symmetric 0-1 loss, leaving cost-sensitive losses largely unaddressed. In this work, we extend the celebrated theory of boosting to incorporate both cost-sensitive and multi-objective losses. Cost-sensitive losses assign costs to the entries of a confusion matrix, and are used to control the sum of prediction errors accounting for the cost of each error type. Multi-objective losses, on the other hand, simultaneously track multiple cost-sensitive losses, and are useful when the goal is to satisfy several criteria at once (e.g., minimizing false positives while keeping false negatives below a critical threshold).
We develop a comprehensive theory of cost-sensitive and multi-objective boosting, providing a taxonomy of weak learning guarantees that distinguishes which guarantees are trivial (i.e., can always be achieved), which ones are boostable (i.e., imply strong learning), and which ones are intermediate, implying non-trivial yet not arbitrarily accurate learning. For binary classification, we establish a dichotomy: a weak learning guarantee is either trivial or boostable. In the multiclass setting, we describe a more intricate landscape of intermediate weak learning guarantees. Our characterization relies on a geometric interpretation of boosting, revealing a surprising equivalence between cost-sensitive and multi-objective losses.
Keywords: Boosting, Minimax theorem, cost-sensitive learning, multi-objective learning, Blackwell’s approachability.
1 Introduction
In many machine learning applications, different types of mistakes may have very different consequences, making it crucial to consider the costs associated with them. For example, in medical diagnostics, failing to detect a serious illness (a false negative) can have life-threatening implications, whereas incorrectly diagnosing a healthy person as ill (a false positive) mostly leads to unnecessary stress and medical expenses. This disparity in error costs is not limited to binary decisions. For example, when recommending movies to a viewer with preferences “romance over action over horror”, misclassifying a romance film as “horror” is probably worse than misclassifying it as “action”. Besides weighting different kinds of mispredictions, one may even want to treat different kinds of mispredictions separately. That is, instead of a cost-sensitive criterion, one may use a multi-objective criterion, specifying acceptable rates for different types of mispredictions. For example, one may find acceptable a false positive rate of (say) only if simultaneously the false negative rate is at most (say) .
Despite the importance of misclassification costs in applications, the theoretical understanding of this setting is lacking. A glaring example, which motivates this work, is boosting. Broadly speaking, a boosting algorithm is a procedure that aggregates several weak learners (whose accuracy is only marginally better than a random guess) into a single strong learner (whose accuracy can be made as high as desired). Although this is a fundamental and well-studied machine learning technique, a theory of boosting accounting for cost-sensitive or multi-objective losses is missing, even in the simplest setting of binary classification.111There are many works on adapting AdaBoost to cost-sensitive learning, but they do not address the fundamental question of identifying the minimal assumption on the cost-sensitive function which guarantees boosting. See more in Appendix A. In fact, if one can assign different costs to different kinds of mistakes, then even the meaning of “marginaly better than a random guess” is not immediately clear; let alone the question of whether one can boost a weak learner to a strong learner, or what precisely “boosting” means. The present work addresses those challenges, providing a generalized theory of boosting which unifies different types of weak learning guarantees, including cost-sensitive and multi-objective ones, and extends the standard algorithmic boosting framework beyond the current state of the art. The fundamental question that we pose is:
Which cost-sensitive and/or multi-objective learners can be boosted? And how?
In classical boosting theory for binary classification, a sharp transition occurs at a weak learner’s error threshold of : if the weak learner is guaranteed to output hypotheses with an error rate below , then it can be boosted to a strong learner with arbitrarily small error, for instance by AdaBoost (Freund and Schapire, 1997). Thus, a weak learning guarantee below implies strong learning. On the other hand, a guarantee of is trivial, as it can be achieved by tossing a fair coin—see Schapire and Freund (2012). Therefore, guarantees above are not boostable.
We investigate whether similar transitions exist for arbitrary cost-sensitive losses. A cost-sensitive loss specifies the penalty for predicting when the true label is , and can penalize prediction errors unequally (e.g., in binary classification, it may penalize false negatives more than false positives). Suppose we have access to a weak learner that outputs hypotheses with a cost-sensitive loss of at most under . For which values of does this imply strong learning, so that the weak learner can be boosted to achieve arbitrarily small cost-sensitive loss according to ? Which values of are trivial, meaning they can always be achieved? Are there intermediate values of that do not imply strong learning but are still non-trivial?
A similar question arises for multi-objective losses. A multi-objective loss is given by a vector where each is a cost-sensitive loss as described above. For instance, in binary classification, a natural choice would be , where measures false negatives and false positives. Suppose again we have access to a weak learner that outputs hypotheses with loss at most under simultaneously for every , forming a vector of guarantees . Which are trivial, in that they can always be achieved? Which are boostable, allowing for a strong learner that achieves arbitrarily small error simultaneously for all of the losses ? And are there intermediate vectors that fall between trivial and boostable?
We address these questions by introducing a new perspective on random guessing, framed as either a zero-sum game or a vector-payoff game (known as a Blackwell approachability game). This game-theoretic approach applies to both cost-sensitive and multi-objective learning, leading to a complete characterization of boostability in these cases. We then extend these techniques to the multiclass setting, where the boostability landscape becomes significantly more complex. While this perspective complements existing views of boosting as a zero-sum game, prior methods are not suited to the settings we examine here. The new tools introduced in this work effectively handle a broad learning framework, establishing a unified and comprehensive theory of generalized boosting.
In particular, we provide extensive answers to the above questions, as follows:
-
•
Cost-sensitive losses. For binary classification, we establish a crisp dichotomy: each guarantee is either trivial or boostable (A). We show that this transition occurs at a critical threshold given by the value of a zero-sum game defined by itself. In the multiclass setting, the dichotomy expands into a hierarchy of guarantees, ranging from fully boostable to trivial. Here, we show that there exist multiple thresholds , where depends on the cost function , and each guarantee can be boosted down to (C). This generalizes the binary case, in which .
-
•
Multi-objective losses. For binary classification, we again show a clean dichotomy; however, the threshold now takes a higher-dimensional form, becoming a surface that separates trivial guarantees from boostable ones (B). Figure 1 illustrates this surface for a loss vector representing false positive and false negative errors. In the multiclass setting, things become more complex: here, we show how to boost non-trivial guarantees to list-learners (D), but a complete taxonomy of boostable guarantees remains elusive and is left as an open question.
-
•
An equivalence between cost-sensitive and multi-objective losses. We establish and exploit an equivalence between multi-objective and cost-sensitive losses that may be of independent interest (Theorem 6). Given a loss vector , consider a weak learner that outputs hypotheses with loss at most for each . By linearity of expectation, it follows that for any convex combination , the weak learner’s loss with respect to does not exceed . We prove that the converse also holds: if for each such there exists a weak learner with loss at most , then we can efficiently aggregate these learners into one that achieves loss at most simultaneously for every . Interestingly, a geometric perspective of this result reveals a connection to Blackwell’s Approachability Theorem (Blackwell, 1956) and to scalarization methods in multi-objective optimization.
Organization of the manuscript.
Section 2 gives a detailed walkthrough of all our main results, their significance, and the underlying intuition. Section 3 addresses the binary classification case, in both the cost-sensitive and multi-objective flavors. Section 4 presents our equivalence connecting cost-sensitive learners with multi-objective learners. Finally, Section 5 considers the multiclass case.
2 Main Results
This section outlines the problem setting and the key questions we investigate, and provides an overview of our main results. We begin from the basic technical setup. We consider the standard PAC (Probably Approximately Correct) setting (Valiant, 1984), which models learning a concept class for a domain and a label space with . We define a cost function, or simply cost, to be any function that satisfies for all ;222Although the range of the cost is our results can be easily extended to arbitrary costs . the value should be thought of as the penalty incurred by predicting when the true label is . Note that can be seen as a matrix indexed by . For instance, for , a valid cost is ; this means that the cost of a false positive is , and that of a false negative is . For conciseness, in the binary setting we let and ; and by some overloading of notation we denote by both the matrix and the vector . For a generic cost , a target function , and a distribution over , the cost-sensitive loss of a predictor is:
(1) |
For example, when , then simply corresponds to , the standard 0-1 loss used in binary classification. For convenience we assume throughout the paper that , though all our results apply more broadly.
We begin by presenting our contributions for the binary setting (Section 2.1), followed by their extension to the general multiclass case (Section 2.2).
2.1 Binary setting
We begin from the binary setting, that is, the case . The starting point is the definition of a suitable notion of “weak learner” for an arbitrary cost function .
Definition 1 (-learner).
Let be a cost function and let . An algorithm is a -learner for if there is a function such that for every , every distribution over , and every the following claim holds. If is a sample of examples drawn i.i.d. from and labeled by , then returns a predictor such that with probability at least .
Binary | Multiclass | |
0-1 loss | ||
cost | ||
multi-objective | coin-attainable boundary of | -dice-attainable boundaries of |

We study the question of whether a -learner can be boosted. Broadly speaking, a learner is boostable if there exists a learning algorithm that can achieve arbitrarily small loss by aggregating a small set of predictors obtained from . More precisely, given black-box access to and a labeled sample , algorithm invokes on subsamples of to produce weak predictors . It then outputs a predictor using some aggregation function . For instance, in classic binary boosting the function is usually a weighted majority vote. In this work, we consider arbitrary such aggregations. The goal is to ensure that the loss of the aggregate predictor, , goes to zero with , resulting in a -learner. Our guiding question can then be stated as:
Can we boost a -learner to a -learner?
In order to develop some intuition, let us consider again the case when yields the standard 0-1 loss. In that case, Definition 1 boils down to the standard notion of a weak PAC learner.333Notice that this definition is slightly more general than the classic weak PAC learning definition, in which and , yet we consider learners which are allowed to be arbitrarily close to . Then, classic boosting theory states that every -learner with can be boosted to a -learner; and it is easy to see that a loss of is always achievable, simply by predicting according to a fair coin toss. Thus, the value yields a sharp dichotomy: every can be boosted and drawn to , while every is trivially achievable and cannot be brought below .
Can this classic dichotomy between “boostable” and “trivial” be extended to accommodate arbitrary cost functions? It turns out that this can be done if one uses as trivial predictors biased coins. Indeed, we show that, by taking such random guessing strategies into account, one can identify a general boostability threshold for all cost functions . However, in contrast to the 0-1 loss case, this critical threshold between the boostable and non-boostable guarantees is no longer ; instead, it is a function of the cost . More precisely, the threshold is determined by the outcome of a simple two-player zero-sum game, as we describe next.
The game involves a minimizing player (the predictor) and a maximizing player (the environment). The minimizing player selects a distribution over ; similarly, the maximizing player selects a distribution over . The payoff matrix of the game is . The overall cost paid by the predictor is then:
(2) |
Following standard game theory, the value of the game is the quantity
(3) |
where is the set of all distributions over . In words, is the smallest loss that the predictor can ensure by tossing a biased coin (i.e., ) without knowing anything about the true distribution of the labels (i.e., ). Now consider a value . It is easy to see that, if , then there exists a universal -learner—one that works for all and all distributions over : this is the learner that, ignoring the input sample, returns the randomized predictor whose outcome is distributed according to independently for every . Indeed, by Equation 3 the loss of is at most regardless of and of the distribution . Formally, we define:
Definition 2 (Random guess).
A randomized hypothesis is called a random guess if its prediction is independent of the input point . That is, there exists a probability distribution such that for every .
We prove that the non-boostability condition is tight. That is, we prove that every -learner with is boostable to a -learner. Thus, the value of the game given by Equation 3 is precisely the threshold for boostability. Formally:
Theorem A (Cost-sensitive boosting, binary case).
Let . Let be a cost. Then, for all , exactly one of the following holds.
-
•
is boostable: for every , every -learner is boostable to a -learner.
-
•
is trivial: there exists a random guess such that the learner that always outputs is a -learner for every .
Moreover, is boostable if and only if , where .
The proof of A is given in Section 3. In a nutshell, A says that anything that beats the weakest possible learner (a coin) is as powerful as a the strongest possible learner; between these two extremes there is no middle ground. Remarkably, the proof of A is simple and yet it relies on three distinct applications of von Neumann’s Minimax Theorem! (von Neumann, 1928). The first application, which also appears in classical binary boosting, is used to aggregate a distribution over the weak learners. The other two applications are unique to the cost-sensitive setting: one arises in the analysis of the boosting algorithm (first item of A), and the other one in defining the trivial learner (second item of A). These last two applications are both based on the zero-sum game defined above. We also note that the first item of A is obtained constructively by an efficient boosting algorithm, as we detail in Section 3.
2.1.1 Multi-objective losses



We turn our attention to multi-objective losses. A multi-objective loss is specified by a multi-objective cost, that is, a vector where is a cost for every . This allows one to express several criteria by which a learner is evaluated; for example, it allows us to measure separately the false positive rate and the false negative rate, giving a more fine-grained control over the error bounds. The guarantees of a learner can then be expressed by bounding by some simultaneously for each . This leads to the following generalization of Definition 1.
Definition 3 (-learner).
Let , let where each is a cost function, and let . An algorithm is a -learner for if there is a function such that for every , every distribution over , and every what follows holds. If is a sample of examples drawn i.i.d. from and labeled by , then returns a predictor such that with probability at least :
Consider for example with and . Then counts the false positives, the false negatives. Thus a -learner for, say, ensures simultaneously a false-positive rate arbitrarily close to and false-negative arbitrarily close to . See Figure 2 for illustrative geometric examples.
Like for the cost-sensitive case, we address the question of which -learners are boostable. In other words we ask: given , what is the right “threshold” for ? Equivalently, for which can we always boost a -learner to a -learner, and for which do there exist -learners that are not boostable? It turns out that the answer is more nuanced than in the cost-sensitive case of Section 2.1, and yet we can prove the same “all or nothing” phenomenon of A, as depicted in Figure 1.444Note that the light-red area (upper triangle) is attainable by coins (distributions) that are oblivious to the true distribution marginals. For example, it is trivial to attain the point by a fair coin, and the points or by a “degenerate coin” that is simply a constant function which always predicts the same label. In particular, every is either entirely boostable, in the sense that every -learner can be boosted to a -learner, or trivial, in the sense that there exists a -learner whose output is always a hypothesis that can be simulated by a (biased) random coin. To this end we introduce the notion of coin attainability. This is the multi-objective equivalent of the condition for being a scalar cost as in Section 2.1.
Definition 4 (Coin attainability).
Let where is a cost function for every , and let . We say that is coin-attainable if
(4) |
The coin-attainable region of is the set of all such that is coin-attainable.
It is not hard to see that, if is coin-attainable, there exists a universal trivial -learner for all . Fix indeed any distribution over and any . First, one can learn the marginal of over within, say, total variation distance ; then, by the coin-attainability of , one can return a hypothesis such that , where ensures for all . Formally, recalling the definition of random guess (Definition 2), we have:
Definition 5 (Trivial learner).
A learning algorithm is trivial if it only outputs random guesses.
Thus, a trivial learner can (at best) learn the best random guess and return it. Clearly, such a learner is in general not boostable to a -learner. The main result of this subsection is:
Theorem B (Multi-objective boosting, binary case).
Let , let where is a cost, and let . Then, exactly one of the following holds.
-
•
is boostable: for every , every -learner is boostable to a -learner.
-
•
is trivial: there exists a trivial learner that is a -learner for all .
Moreover, is boostable if and only if .
The proof of B is given in Section 3, and is based on a certain scalarization of the loss amenable to a reduction to A. Let us spend a few words on B. First, note how Definition 4 is reminiscent of the zero-sum game of Section 2.1. The crucial difference, however, is that now the maximizing player (the environment) plays first, and the minimizing player (the predictor) can then choose as a function of . As shown in more detail in Section 3.2, this is essential for proving the second item—that is, for coin-attainability to capture exactly the guarantees attainable by trivial learners. For the scalar-valued game of Section 2.1, instead, by von Neumann’s Minimax Theorem the order of the players is immaterial to the value of the game .
Second, note that the multi-objective setting is more general than that of a (scalar-valued) zero-sum game, as here the payoff is vector-valued. This can be thought of as a Blackwell approachability game (Blackwell, 1956), a generalization of zero-sum games to vectorial payoffs. Indeed, it turns out that there is a connection between these two notions, in the sense of Blackwell’s theorem: we prove that a certain vectorial bound can be attained (with respect to a vector-valued loss ) if an only if all the “scalarizations” of can be attained with respect to the corresponding scalarizations of . In fact, we can show that this is captured formally by an equivalence between cost-sensitive learners and multi-objective learners, as explained in Section 2.1.2.
Geometric interpretation of B.
A multi-objective guarantee with respect to a multi-objective cost can be viewed as a set of linear constraints, . This identifies a region that is convex and downward closed, see Figure 2. This holds true even in the multiclass setting, when there are more than labels. Similarly, the set of all coin-attainable points for the false-negative/false-positive objective can be shown to be convex and upward-closed555A set is upward-closed if for all and such that coordinate-wise, then . See Section 3 for further details.. This is illustrated in Figure 1, as well as the set of all boostable points, given in the red and blue areas, respectively.
2.1.2 Cost-sensitive vs. multi-objective losses: a general equivalence
We establish a general, formal connection between cost-sensitive and multi-objective learning. The starting point is the straightforward observation that, by definition, a -learner is a -learner for every . Does the converse hold, too? That is, if for each there exist a -learner , can we conclude that there exists a -learner ? Perhaps surprisingly, we can show that this holds if we consider all convex combinations of the scalar guarantees . Formally, given a distribution , let and . It is easy to see that the observation above continues to hold: a -learner is a -learner for every such . The next result shows that the converse holds, too: the existence of a -learner for every implies the existence of a -learner.
Theorem 6 (Equivalence of learners).
Let , let where each is a cost function, and let . Then, the following are equivalent.
-
1.
There exists a -learner for .
-
2.
For every there exists a -learner for .
Moreover, the equivalence is obtained constructively by an efficient algorithm.
We remark that Theorem 6 holds in the multiclass case, too (i.e., for arbitrary sets ). Thus, the interplay between multi-objective and cost-sensitive learning is a general phenomenon, not limited to binary classifiers. The reduction from multi-objective to cost-sensitive in Theorem 6 resembles the weighted sum scalarization method for multi-objective optimization (Ehrgott, 2005, Chapter 3). However, it is unclear whether this similarity can be used to provide further insights in our problem.
In the same vein as Theorem 6, we prove an equivalence between trivial guarantees of cost-sensitive learners (see A) and trivial guarantees of multi-objective learners (see B).
Theorem 7 (Equivalence of trivial guarantees).
Let where each is a cost function, and let . Then the following are equivalent.
-
1.
is coin-attainable.
-
2.
For every it holds that .
As Theorem 6, we note that Theorem 7 also holds in the general multiclass case. The proof of Theorem 7 again uses von Neumann’s Minimax Theorem, based on a two-player game that differs from those described earlier. In this game, the pure strategies of the maximizing player are the different elements of , representing the various objectives of the cost function. The proofs of all equivalence theorems are given in Section 4.
2.2 Multiclass setting
For binary classification tasks, we get an especially clean and refined mathematical picture, in which boostability is fully determined by a single threshold, i.e., . That is, any learner with loss value below it can be boosted, while any value above it is attainable by a trivial non-boostable learner. In the unweighted case, recent work by Brukhim et al. (2023a) demonstrated that a multichotomy emerges, governed by thresholds: , where is the number of labels. For each such threshold, any learner achieving a loss below it can be “partially” boosted to the next threshold below it, but not further than that. Here, we extend their results to arbitrary cost functions and prove that a similar phenomenon holds more generally.
In contrast to the unweighted case, the landscape of boostability in the general setting is more complex, consisting of many thresholds corresponding to distinct possible subsets of labels. Interestingly, the thresholds of boostability can be computed precisely, and are all determined by the outcome of zero-sum games, as defined next. We generalize the zero-sum game defined in Equation (2) to the case of labels. It turns out that in the multiclass setting, a more refined zero-sum game is needed, introducing a certain asymmetry between the sets of allowable strategies for each player. In particular, we restrict the set of distributions which the maximizing player is allowed to choose. For a subset of labels define the value of the game restricted to the subset as:
(5) |
Importantly, only the maximizing player is restricted to while the minimizing player is not. Thus, is the smallest loss one can ensure by predicting with a die given that the correct label is in . Clearly, for every it holds that . Consider the subsets in nondecreasing order of .666For notational easiness we define . We then denote each such cost by for some , so that overall we have,
(6) |
where depends on the cost function . The first equality holds since for . Moreover, by a straightforward calculation it can be shown that, for the unweighted case in which is simply the 0-1 loss (i.e., ), the thresholds specialize to those given by Brukhim et al. (2023a). Concretely, it is described in the following fact.
Fact 8.
Let be the 0-1 loss. Then, for every and , it holds that .
Thus, for the 0-1 loss we have . Note that although for the general case given here
there is no closed-form expression for the value of each threshold, it can each be determined by solving for the corresponding linear program representation of the zero-sum game described above.
We can now state our main results on multiclass cost-sensitive boosting, given next.
Theorem C (Cost-sensitive boosting, multiclass case).
Let , let be any cost function, and let . Let as defined in Equation 6,777When clear from context we omit and denote a threshold by . and let be the largest integer such that . Then, the following claims both hold.
-
•
is -boostable: for every , every -learner is boostable to -learner.
-
•
is not -boostable: there exists a class and a -learner for that cannot be boosted to -learner for , for any .
The proof of C is given in Section 5.
In words, C implies that there is a partition of the interval , on which the loss values lie, into at most sub-intervals or “buckets”, based on . Then, any loss value in a certain bucket can be boosted to the lowest value within the bucket, but not below that.
We remark that, in fact, the above boostability result is obtained constructively via an efficient algorithm, as described in Section 5.
2.2.1 Multiclass classification, multi-objective loss
We shift our focus to multi-objective losses in the multiclass setting. Recall the notion of -learner of Definition 3. Our goal is to prove a multi-objective counterpart of C; that is, to understand for which a -learner can be boosted to a -learner. Unlike the scalar cost setting of C, however, the set of those that are reachable by boosting a -learner is not a one-dimensional interval in , but a subset of .
As a first step, we generalize the notion of coin attainability given by Definition 4. The key ingredient that we shall add is to restrict the distribution of the labels over a given set . From the point of view of a random guesser, this amounts to knowing that the correct label must be in . We then say that a guarantee is -dice-attainable if a trivial learner can satisfy that guarantee over any distribution supported only over . Formally, it is defined as follows.
Definition 9 (-dice attainability).
Let , let where each is a cost function, let , and let . Then is -dice-attainable if:
(7) |
The -dice-attainable region of is the set .
Intuitively, if is -dice-attainable, then a -learner is not better than a random die over , and therefore should not be able to separate any label in . Using the notion of -dice-attainability, we define a partial ordering over . For every write if
(8) |
and otherwise. Intuitively, means that, whenever is not better than a die, then is not better than a die either. We prove that this partial ordering precisely characterizes the boostability of to . We can now state our main result for multi-objective multiclass boosting.
Theorem D (Multi-objective boosting, multiclass case).
Let , let where each is a cost function, and let . Then, the following claims both hold.
-
•
is boostable to for every : for every class , every -learner for is boostable to a -learner for .
-
•
is not boostable to for every : there exists a class and a -learner for that is a trivial learner and, therefore, cannot be boosted to a -learner for , for any .
Let us elaborate briefly on D. In the multiclass cost-sensitive setting, where the overall cost is a scalar, the landscape of boostability is determined by a sequence of (totally ordered) scalar thresholds ; and those thresholds determine whether a -learner can be boosted to a -learner, for , as detailed in C. In the multiclass multi-objective setting, those scalar thresholds are replaced by a set of surfaces in . Each such surface corresponds to the boundary of the dice-attainable sets . Those surface can be seen as thresholds of a higher-dimensional form, separating between different boostability guarantees.
3 Binary Classification
This section considers the classic binary classification setting, where .
Section 3.1 tackles the simplest case, where the cost is scalar, and explores the conditions that make a -learner boostable, proving A. Section 3.2 considers multi-objective costs, where the cost is vector-valued, for a specific but illustrative choice of , proving a special case of B and of Theorem 7. The reason for restricting to a special form is that this allows us to introduce the key ideas while avoiding heavy notation.
Finally, in Section 3.4 we give the full proof of B.
For convenience, as , we will sometimes encode a cost function as a 2-by-2 matrix , where is the cost of a false positive and the cost of a false negative. We will also use the fact, provable through easy calculations, that the value of the game equals if and otherwise. Before moving forward, we also need a measure of the advantage that a -learner has compared to a coin.
Definition 10 (Margin, binary case).
The margin of a -learner is .
When , and thus is the standard 0-1 cost, is the usual notion of margin that defines a weak learner in binary classification tasks. In that case, it is well known that to boost a weak learner with margin it is necessary and sufficient to invoke the learner a number of times that scales with . This remains true for general and , in that our boosting algorithm below invokes the -learner a number of times proportional to .
3.1 Boosting a -learner
The main contribution of this subsection is a boosting algorithm, shown in Algorithm 1, that turns any -learner with into a -learner . We prove:
Theorem 11.
Let and . If is a -learner for with margin , then Algorithm 1 with the choice of parameters of Lemma 12 is a -learner for . Under the same assumptions, Algorithm 1 makes oracle calls to , where is the size of the input sample, and has sample complexity , where is the sample complexity of .
It is immediate to see that Theorem 11 implies the first item of A; for the second item see Lemma 14 below. In the rest of this subsection we prove Theorem 11. We do so in two steps: first we prove that Algorithm 1 returns a hypothesis consistent with the input sample (Lemma 12), and then we show generalization by a compression argument (Lemma 13).
At a high level, Algorithm 1 follows standard boosting procedures. Given a -learner and a labeled sample , Algorithm 1 uses to obtain a sequence of hypotheses whose average has loss close to on each single example in . This can be achieved through any regret-minimizing algorithm, but for the sake of concreteness our algorithm uses a modified version of Hedge (Freund and Schapire, 1997) where the update step re-weights examples as a function of . Unlike standard approaches, the final prediction on a point is not just the majority vote of the ’s. Instead, the prediction is constructed by comparing the fraction of ’s returning , weighted by , and the fraction of ’s returning , weighted by . The smallest weighted fraction wins, and the corresponding label is returned. We show that this ensures correct labeling of the entire sample with the desired probability. Formally, we have:
Lemma 12.
Let and let be a cost function. Let , and let be a -learner for with margin and sample complexity . Fix any , let be any distribution over , and let be a multiset of examples given by i.i.d. points labeled by . Finally, fix any . If Algorithm 1 is given , oracle access to , and parameters , , and , then Algorithm 1 makes calls to and returns such that with probability at least :
Proof.
Fix any . Given that and that , the standard analysis of Hedge (Freund and Schapire, 1997) shows that:
(9) |
Dividing both sides by , and using the definitions of and at the right-hand side, yields:
(10) |
For every , since , since is a -learner for , and by the choice of :
(11) |
By averaging over and taking a union bound over , we have that, with probability at least ,
(12) |
Now suppose Equation 12 holds. We shall prove that for every . First, Equation 10 together with the definition of implies that for every :
(13) |
Now recall the definition of from Algorithm 1:
(14) |
Clearly, for any , we have that . Now consider any . If then by definition of and :
(15) |
By Equation 13, then, . By a symmetric argument if . It remains to show that for every exactly one among and holds. As we have just shown, at least one of the inequalities holds, depending on the value of . Now assume towards contradiction that both hold. Recall that . Note that implies and thus . Then we have:
(16) |
By definition of then for all , concluding the proof. ∎
The next lemma shows that, if the size of the sample in Lemma 12 is large enough, then the hypothesis returned by Algorithm 1 has small generalization error. It is immediate to see that, together with Lemma 12, this implies Theorem 11.
Lemma 13.
Assume the hypotheses of Lemma 12. For any , if the size of the sample given to Algorithm 1 satisfies
then with probability at least the output of Algorithm 1 satisfies . Therefore, Algorithm 1 is a -learner for with sample complexity .
Proof.
First, we apply Lemma 12 with in place of ; we obtain that, with probability at least , the hypothesis is consistent with the labeled sample . Next, we apply a standard compression-based generalization argument (see e.g., Theorem 2.8 from Schapire and Freund (2012)). To this end, note that one can construct a compression scheme for of size equal to the total size of the samples on which is invoked, that is, . By Theorem 2.8 from Schapire and Freund (2012) with in place of we get that, with probability at least ,
(17) |
Straightforward calculations show that the right-hand side is at most whenever:
(18) | ||||
(19) |
A union bound completes the first part of the claim, and shows that Algorithm 1 is a -learner for . The condition on the sample complexity of Algorithm 1 then follows immediately by definition of . ∎
Remark on adaptive boosting.
Traditional boosting algorithms, such as the well-known AdaBoost (Schapire and Freund, 2012), do not assume prior knowledge of the margin parameter and adapt to it dynamically during execution. In contrast, the boosting algorithm in Algorithm 1, along with our broader approach, requires an initial estimate of as input. If this estimate is too large, the algorithm may fail. However, this issue can be addressed through a straightforward binary search procedure: we iteratively adjust our guess, halving it when necessary based on observed outcomes. This adds only a logarithmic overhead of to the runtime, without affecting sample complexity bounds.
We conclude this subsection with a proof of the second part of A.
Lemma 14.
Let be any domain, let , and let be a cost over . If , then there is an algorithm that is a -learner for every . Moreover, for some domain and some the said learner cannot be boosted to a -learner for any .
Proof.
By definition of there exists a distribution such that for every over . Consider then the algorithm that ignores the input and returns a randomized predictor such that independently for every given . Clearly, for every distribution over ,
(20) |
where is the marginal of over . This proves that the algorithm is a -learner for every . Let us now show that for some cannot be boosted to a -learner for any . If then this is trivial, as ; thus we can assume . Suppose then that a -learner for exists, where . By item (1), it follows that a -learner for exists. Recall that . Since we assumed , it follows that . Then, a -learner is a strong learner in the PAC sense. Now choose to be any non-learnable class; for instance, let (see Shalev-Shwartz and Ben-David (2014, Theorem 5.1)). Then, no strong PAC learner and therefore no -learner exists for . Hence a -learner for with does not exist, too. ∎
3.2 Multi-objective losses: boosting a -learner
This section considers the boosting problem for learners that satisfy multi-objective costs, as captured by the notion of -learners (Definition 3). Our main question is then:
When can a -learner be boosted to a -learner?
For the scalar case, we showed in Section 3.1 that the threshold to boost -learners is the value of the game , which is the best bound on one can achieve by just tossing a coin—independently of the example at hand, and even of the specific distribution over . One may thus guess that the same characterization holds for boosting -learners. It turns out that this guess fails, as we show below. We thus adopt a different and stronger notion of “trivial learner”, obtained by exchanging the order of the players. It turns out that this is the right notion, in the sense that it is equivalent to being non-boostable.
For the sake of simplicity we assume a specific that we call population-driven cost. This will help conveying the key ideas, without having to deal with the technical details of the more general version of our results (Section 5). The population-driven cost is the one that counts, separately, false negatives and false positives. More precisely, where and are now two cost functions that for all satisfy:
(21) | ||||
(22) |
For , a -learner then ensures simultaneously a false-negative rate arbitrarily close to and a false-positive rate arbitrarily close to .
3.2.1 Coin attainability
Let as above, and let . As a first attempt at answering our question above, we guess that is non-boostable when it is attainable by a random coin toss in the same way as for scalar costs (Section 3.1). That is, we may say that is trivially attainable w.r.t. if there exists such that for every we have . Clearly, for any such there would exist a -learner that is not boostable for some class . If we could also prove the converse, that any not trivially attainable is boostable, then we would have determined that trivial attainability is equivalent non-boostability.
Unfortunately, this turns out to be false. To see why, fix any distribution . Define then as follows: if then , otherwise . It follows immediately that either or . That is, is at least in at least one coordinate. As a consequence, no is trivially attainable. Thus, if our guess is correct, it should be the case that any -learner with is boostable.
We can easily show that that is not the case. Let and consider the following -learner. Upon receiving a sufficiently large labeled sample from some distribution over , the learner computes an estimate of the marginal of over . The learner then computes that satisfies:
(23) |
It is not hard to show that such a always exists. The learner thus outputs such that for all independently. As the number of samples received by the learner increases, the estimate approaches , and thus the loss of the learner approaches . Clearly, however, such a learner cannot be boosted, as its predictions are independent of the example at hand. We conclude that there exist that are not trivially attainable and yet are not boostable.
The example above suggest a fix to our notion of “non-boostable learner”. The fix consists in considering again prediction by a coin , where, however, we allow to be a function of . This leads to the following definition:
Definition 15.
Let be the population-driven cost and let . Then, is coin-attainable w.r.t. if for every distribution there exists a distribution such that:
Note how Definition 15 mirrors the game of Section 3.1, but with a reverted order of the players: here, the predictor plays second. It should be clear that, if is coin-attainable, then there exists a -learner that is not boostable. That is the learner described above, which gets an estimate of from the sample, and then returns such that , where satisfies the condition in Definition 15 with respect to . What is not clear is whether the converse holds, too: is a -learner boostable whenever is not coin-attainable? It turns out that this is the case, as the next subsection shows.
3.2.2 Coin-attainability equals non-boostability
Let us try to boost a -learner . We can do so by reduction to the scalar case of Section 3.1. To this end choose any . Define by:
(24) |
For conciseness we shall write . In words, is the convex combination of according to , and therefore . Therefore and for each . Now let . Now we make the following crucial observation: since is a -learner, then it is also a -learner. This is immediate to see from the definition of and the definition of . It follows that, by the results of Section 3.1, can be boosted if . Summarizing, we can boost whenever:
(25) |
That is, if there is any satisfying Equation 25, then can be boosted to a strong learner. Moreover, if that is the case, then one can prove that one can ensure . Then, one can easily see that is a -learner. Our main question thus becomes:
Is it true that, if is not coin-attainable, then there is such that ?
We show that this is indeed the case, and therefore is coin-attainable if and only if for all . Equivalently, all -learners are boostable if and only if for some . As a consequence, every is either coin-attainable or boostable. Formally, we have:
Theorem 16.
Let , let be the population-driven cost, and let . Then is coin-attainable w.r.t. if and only if , where:
(26) |
Equivalently, is coin-attainable w.r.t. if and only if .
Proof.
First of all, by straightforward calculations we determine:
(27) | ||||
(28) |
This proves that if and only if .
Now, for the “only if” direction, by von Neumann’s minimax theorem and by definition of ,
(29) | ||||
(30) |
where the inequality holds as is coin-attainable.
For the “if” direction, suppose , hence . We claim that this implies is coin-attainable. Fix indeed any . We consider three cases: (1) , (2) , and (3) and . If then choose , otherwise if choose . It is immediate to see that this implies and . Suppose then and , which implies . We claim that:
(31) |
Indeed, letting , , , multiplying both sides by , and rearranging yields:
(32) |
which can be easily checked to hold for all whenever . Equation 31 then implies the existence of such that and that:
(33) | ||||
(34) |
We conclude that is coin-attainable. ∎
3.3 A duality, and a geometric interpretation

Theorem 16 has an interesting geometric interpretation. In fact, one can easily check that for every it holds that if and only if for every , and not just for the single specified in Theorem 16. That is, we have:
Theorem 17.
Let , let be the population-driven cost, and let . Then is coin-attainable w.r.t. if and only if for every .
Theorem 17 says that is coin-attainable if and only if there is no way to “scalarize” so as to obtain a boostable learner w.r.t. the corresponding scalarized cost . From a geometric perspective, the result can be read as follows. For any consider the following halfspace:
(35) |
Let moreover be the set of all that are coin-attainable w.r.t. . Then Theorem 17 says:
(36) |
In other words, the set of coin-attainable vectors is precisely the intersection of all halfspaces . It follows that is convex: and, indeed, if two vectors and are both coin-attainable then it is easy to see that is coin-attainable, too, for every . Figure 3 gives a pictorial description of this geometric interpretation. The proof of Theorem 17 is not given here, as it is a special case, for being the population-driven cost, of Theorem 7.
3.4 Proof of B
First item. The proof makes use of the duality results of Section 4. Assume ; that is, is not coin-attainable with respect to . By Theorem 17 there exists such that , where . Thus, is a -learner for . By Lemma 14, then, there exists a -learner for . It is immediate to see that, if (that is, for every ) then it is a -learner for . Thus, we need only to show that we can always assume . Let
(37) |
Note that both and are continuous functions of . Since , it follows that there exists such that , too.
Second item. We show that, if , then there is a trivial learner that is a -learner for every . By choosing to be any non-learnable class this implies that is not boostable to a -learner for for any , for otherwise the first item above would imply the existence of a -learner for . The learner works as follows. Given a sample of examples drawn i.i.d. from and labeled by any , it estimates the marginal of on . Denote that estimate by . It is well known that can ensure is within total variation distance of with probability at least by taking , see for instance (Canonne, 2020, Theorem 1). Assume then this is the case. Let be a distribution such that ; note that exists by definition of coin attainability. Then, returns a randomized predictor such that independently of the input . Since , then,
(38) |
This proves that is a -learner.
4 Equivalence: Cost-Sensitive and Multi-Objective
In this section we examine the connection between cost-sensitive and multi-objective learning. In particular, we prove the two equivalence theorems: Theorem 6 which demonstrates a certain equivalence between learning algorithms for a class , with respect to these different guarantees, and Theorem 7, which characterizes trivial guarantees in both settings.
The following lemma is the key component used to prove Theorem 7.
Lemma 18.
Let , let where each is a cost function, and let . Fix any , and . Let be some fixed hypotheses class. Assume that for every , there exists such that . Then, there exists such that for all , it holds that .
Proof.
Define the following zero-sum game. The pure strategies of the maximizing player are , and the pure strategies minimizing player are . The payoff matrix is given as follows, with rows corresponding to labels and columns to loss-entry ,
(39) |
Thus, we have,
(40) |
Denote,
(41) |
It remains to show that there exists such for all , we have . By assumption, we know that , and so by von Neumann’s minimax theorem (von Neumann, 1928) we have:
(42) |
In particular, there exists such that for all it holds that:
Observe that the above holds for all of the form for (i.e., the standard basis of ), and so we get the desired result. ∎
Proposition 19.
Let , and let where each is a cost function, , and . Then, the following are equivalent.
-
1.
is -dice-attainable.
-
2.
, it holds that , where .
Next, we prove Proposition 19, which also proves Theorem 7 as a special case.
Proof of Proposition 19.
It is easy to see that holds, by definitions. We prove . By assumption, for every it holds that,
(43) |
In particular, this implies that for any and any , there exists such that
Thus, by Lemma 18 we get that for any , there exists such that for every ,
Thus, we get that is -dice-attainable. ∎
We shift our focus to general learning algorithms for a class . First, recall that a multi-objective learner is, by definition, also a cost-sensitive learner. The existence of a -learner implies the existence of a -learner for each . Equivalently, the existence of a -learner implies the existence of a -learner for (the canonical basis of where and .
Theorem 6 shows the other direction holds, i.e., that the existence of a -learner for all possible implies the existence of a -learner.
Proof of Theorem 6.
It is easy to see that holds, by definitions. We prove . For any , and distribution over , let be a set of examples sampled i.i.d. from and labeled by , where to be determined later. Let . Then, applying Algorithm 2 over with access to learners, and with , , (where is the sample complexity of ) yields the following. Fix any . Given that and that , the standard analysis of Hedge (Freund and Schapire, 1997) shows that:
(44) |
By the guarantee of and the choice of , and by union bound over all we get that with probability at least ,
(45) |
which implies,
(46) |
Thus, we get that with probability ,
(47) |
By then also taking a union bound over , we have with probability at least for all it holds that,
(48) |
Recall that for each and we have that , and so for all but it holds that . We can now define the randomized hypothesis for any as follows. To compute the value of , first we sample a hypothesis by sampling uniformly at random, and then predict according to . We then have,
(49) |
Then, by the definition of , we have that with probability at least ,
(50) |
Finally, we apply standard compression-based generalization arguments (Shalev-Shwartz and Ben-David (2014), Theorem 30.2) where the sample size is and where the compression size is , we get that with probability at least over the sample of it holds that:
(51) |
Thus, the randomized hypothesis satisfies the desired guarantees with a constant probability.
Overall we get that the procedure described above is a
-learner for in expectation.
Lastly, we remark that instead of a randomized learner whose guarantees hold in expectation, one can de-randomize choosing a single uniformly at random as above. Moreover, the confidence parameter of can easily be made arbitrarily large using standard confidence-boosting techniques. That is, by first splitting the input sample into approximately non-intersecting equal parts to learn roughly independent classifiers, each satisfying Equation (51). By the choice of , with probability, at least one of these classifiers will the satisfy the -probability condition given in Equation (51). Then, by Chernoff’s inequality, for each of these classifiers the performance as tested over another independent test-set is close to that over the population distribution. Then, by choosing the one with best empirical errors over the test-set, we get the desired result. ∎
5 Multiclass Classification
In contrast to binary classification, boosting in the multiclass setting does not exhibit a clean dichotomy between “trivial” guarantees and “fully boostable” guarantees. A line of works (Mukherjee and Schapire, 2011; Brukhim et al., 2021, 2023b, 2023a) has studied multiclass boosting when is the standard 0-1 loss, and has shown a certain multichotomy determined by thresholds, , that replace the single boostability threshold of the binary case. For each such threshold, any learner achieving a loss below it can be “boosted” to a predictor that rules out some incorrect labels, albeit not all. Thus, for every example the predictor returns a subset of candidate labels, also called a list. A learner that beats a threshold above in fact yields a list predictor with lists of size , and it can be shown that the converse is also true. In this section we generalize these kind of results to the case of arbitrary costs functions, proving that these list boostability and list-versus-loss equivalences hold more broadly.
We start in Section 5.1 by describing an algorithm that turns a -learner into a list learner, proving the first part of C. Next, Section 5.3 proves some “list boostability” results that focus on the list length, based on the framework of List PAC learning (Brukhim et al., 2022; Charikar and Pabbaraju, 2023). We turn to multi-objective losses in Section 5.2, proving the first item of D. Finally, Section 5.4 concludes by giving the lower bounds of both C and D.
5.1 Cost-sensitive boosting
As said, in multiclass classification one cannot in general boost a learner into one that predicts the correct label. However, as prior work has shown, one can boost a learner into a list learner. This is a learner whose output hypothesis maps each to a set of labels, rather than to one single label. Formally, a list function is any map . A list learner is then a learner that outputs a list function . The performance of such a list function is typically measured in two ways. First, one wants to contain the true label with good probability. Second, one wants to somehow be “close” to ; for instance, in the case of standard 0-1 cost, this is measured by the length of . All these guarantees are meant, as usual, over the target distribution . One can then ask the following question: given a -learner, how good a list function can one construct? In order to answer this question in the cost-sensitive setting, we need to generalize list length to a more nuanced measure—one that is based once again on the values of games.
Definition 20 (-boundedness).
Let be a cost function and let . A list function is -bounded if, for every , the list satisfies .
Definition 21 (-list learner).
Let be a cost function and let . An algorithm is a -list learner for a class if there is a function such that for every , every distribution over , and every the following claim holds. If is a sample of examples drawn i.i.d. from and labeled by , then returns a -bounded list function such that with probability at least .
The definitions above mirror closely the standard ones used in list learning, where the quality of a list is measured by its length . The crucial difference is that, now, we are instead using the value of the restricted game . To get some intuition why this is the right choice, consider again the case where is the standard 0-1 cost. In that case it is well known that for every (nonempty) . Then, by Definition 20 and Definition 21 a -list learner is one that outputs lists of length at most . That is, for the 0-1 cost Definition 21 yields the standard notion of -list learner used in prior work on multiclass boosting. Now, for the special case of the 0-1 cost, Brukhim et al. (2023a) showed that any -learner is equivalent to an -list learner, with as defined above, in the sense that each one of them can be converted into the other. We show that this remains true for every cost function , if one replaces with , and therefore the notion of -list learner with the notion of -list learner. That is, we prove that -learners and -list learners are equivalent. Formally, we prove:
Theorem 22.
Let , let be a cost function, let , and let . Then:
-
•
Every -learner for can be converted into a -list learner for .
-
•
Every -list learner for can be converted into a -learner for .
Theorem 22 is the main technical result of the present sub-section, and implies easily the first item of C. And again, the above-mentioned result of Brukhim et al. (2023a) is a special case of Theorem 22, obtained for being the standard 0-1 cost. Moreover, one can easily show that all these results do not hold if one uses in place of . Our game-theoretical perspective therefore seems the right point of view for characterizing multiclass boostability. Let us see how the first item of C follows from Theorem 22. Recall the thresholds defined in Equation 6:
(52) |
where the first equalities hold because for any singleton and, by convention, for the empty set. Given , let be the largest integer such that . Now, suppose we have a -learner for some . By the first item of Theorem 22, there exists a -list learner for , too. However, by definition of every with satisfies . Therefore, actually returns list functions that are -bounded, for arbitrarily small. That is, is actually a -list learner for . By the second item of Theorem 22, then, there exists a -learner for .
The rest of the subsection is therefore devoted to prove Theorem 22: the first item in Section 5.1.1, and the second one in Section 5.1.2.
5.1.1 Turning a -learner into a -list learner
We prove the first item of Theorem 22 by giving an algorithm, Algorithm 3. The algorithm is very similar to the one for the binary case. And, like for that case, we define a certain margin that the learner has, in order to bound the number of “boosting” rounds of the algorithm. For the multiclass case, however, the definition is in general slightly different.
Definition 23 (Margin).
Let be a cost function and let . The margin of a -learner is , or whenever the set is empty.
Note that only when . In that case, the first item of Theorem 22 holds trivially by just considering the trivial list learner that outputs the constant list function . Hence, in what follows we consider always the case . We shall prove:
Theorem 24 (Weak List Learning).
Let and let be a cost function. If is a -learner for with margin , then Algorithm 3, with the choice of parameters of Lemma 25 and , is a -list learner for . Under the same assumptions, Algorithm 3 makes oracle calls to , where is the size of the input sample, and has sample complexity .
Theorem 24 follows immediately from two technical results stated below. The first result, Lemma 25, proves that Algorithm 3 can turn a -learner into a list function that is -bounded and that with probability is consistent with (i.e., for every ), for as small as desired. The second result, Lemma 26, shows that the list function has small generalization error: if the input sample of Algorithm 3 is sufficiently large, then , where can be chosen as small as desired.
Lemma 25.
Let , let , and let be a -learner for with sample complexity . Fix any , any distribution over , and any . Then, given a multiset of examples formed by i.i.d. points labeled by , oracle access to , and parameters , , and , Algorithm 3 returns a list function that is -bounded and that, with probability at least , satisfies:
Proof.
First, we prove that is -bounded. Fix any . Observe that is a distribution over ; denote it by , so that for every label . Thus, by the definition of , the sum involved in the condition for including a certain label in the list is
(53) |
Then, observe that
(54) |
where the last inequality holds by definition of .
We now turn to proving that with probability at least we have for all . The proof is similar to that of Lemma 12. For any given , the analysis of Hedge and the definitions of and yield:
(55) |
Furthermore, since is a -learner for , and given the choices of and , by a union bound over we have that with probability at least :
(56) |
Conditioning on Equation 56, we prove that for every . Consider the function defined in Algorithm 3. By Equations 55 and 56, we obtain for every that
(57) |
which in turn implies that by construction of . This concludes the proof. ∎
In a similar way as Lemma 13 does for the binary case, we show that the list function returned by Algorithm 3 has small generalization error provided that the sample is sufficiently large. The main idea to prove generalization is again via a sample compression scheme, but relying instead on a novel result by Brukhim et al. (2023a) for multiclass classification; note that, while their result from Brukhim et al. (2023a) is stated in terms of list functions whose outputs are -uples in for some , their same proof applies to the -bounded list functions output by our Algorithm 3.
Lemma 26.
Assume the setting of Lemma 25. For any , if the size of the sample given to Algorithm 3 satisfies
then the output of Algorithm 3 satisfies with probability at least . Therefore, Algorithm 3 is a -list learner for with sample complexity .
Proof.
By analogy with the proof of Lemma 13, we first apply Lemma 25 with in place of in order to obtain an -bounded list function consistent with with probability at least . We then apply a compression-based generalization argument. To do so, we remark once again that one can construct a compression scheme for of size equal to the total size of the samples on which operates, which is equal to . The main difference with the binary case is that we rely on the generalization bound for a sample compression scheme for lists as per Theorem 6 of Brukhim et al. (2023a), with in place of ; we can employ this generalization bound thanks to the consistency of with (with sufficiently large probability). Then, this implies that
(58) |
holds with probability at least . By similar calculations as in Equations 18 and 19, and replacing the values of and , the right-hand side of the above inequality can be easily show to be at most whenever
(59) |
A union bound concludes the proof for the first part of the claim, since it shows that Algorithm 3 is an -list learner for . The claim on the sample complexity of Algorithm 3 then follows by straightforward calculations and the definition of . ∎
At this point, we have all the ingredients and tools to prove that we can construct a -list learner from a -learner.
Proof of Theorem 24.
Consider Algorithm 3 under the same assumptions of Lemma 25, and set as per assumption. By Lemma 25 and Lemma 26, we know that Algorithm 3 is a -list learner with sample complexity that performs oracle calls to the -learner . Moreover, we can immediately conclude that Algorithm 3 is also a -list learner because and , by definitions of and . ∎
5.1.2 Turning a -list learner into a -learner
We prove that every -list learner can be converted into a -learner.
Lemma 27 (List Weak Learning).
There exists an algorithm that for every satisfies what follows. Let be a cost function and let . Given oracle access to a -list learner for with sample complexity , algorithm is a -learner for with sample complexity that makes one oracle call to .
Proof.
Fix any and any distribution over , and suppose is given a sample of size consisting of examples drawn i.i.d. from and labeled by . First, calls on . By hypothesis, then, returns a -bounded that with probability at least satisfies:
(60) |
Conditioning on the event above from this point onwards, we give a randomized predictor such that . First, for any nonempty , let be the minimax distribution achieving the value of the game restricted to , i.e.,
(61) |
Note that can be efficiently computed via a linear program. Then, simply define .
Let us analyse the loss of under . First, by the law of total expectation, and since ,
(62) |
where the summation is over all with , and is the distribution obtained from by conditioning on the event . Consider the right-hand side of Equation 62. By Equation 60, the first term is at most . For the second term, denote by the marginal of over ; note that, crucially, . Therefore, by definition of :
(63) |
where the last inequality holds as and is -bounded. Using Equations 60 and 63 in Equation 62 shows that . ∎
5.2 Multi-objective losses
In this sub-section we prove the first item of D. Coupled with Lemma 39 below, this proves D. Both results require the following definition. Let denote the family of all subsets of which are avoided by , i.e.,
The algorithm proposed below then scales with , which may be exponential in . We remark that for our purposes, it also suffices to consider only the minimally-sized avoided lists in , i.e., if and then . However, in the worst case remains of the same order and so we keep the definition as given above for simplicity. Lemma 39 will then give a lower bound showing that this condition is, in some sense, necessary.
Theorem 28.
Let , let where each is a cost function, and let . Let such that . Assume there exists a -learner for a class . Then, there exists a -learner for .
Proof.
Lemma 29.
Let where each is a cost function, and let such that . Assume there exists a -learner for a class . Then, there exists a -learner for , for every .
Proof.
First, by Lemma 31 we get that can be used to obtain an algorithm that is a -list learner for . Next, we show that can be boosted to a -learner. Fix any . First, we argue that is in fact a -list learner (see Definition 21). This holds since for every such that , it must be that , by definition and by Proposition 19. In particular, for any list function outputted by , and for every it holds that and so .
Thus, we have established that is in fact a -list learner. Lastly, by Lemma 27 we get that it can also be converted into a -learner, as needed. ∎
Definition 30 (-list learner).
An algorithm is a -list learner for a class if there is a function such that for every , every distribution over , and every the following claim holds. If is a sample of examples drawn i.i.d. from and labeled by , then returns a list function such that with probability , and such that for every and every it holds that .
Lemma 31.
Let , let where each is a cost function, and let such that . Assume there exists a -learner for a class . Then, there exists a -list learner for .
Proof.
First, fix any . By assumption that , we also have that . In particular, we have that is not -dice-attainable. Then, by Proposition 19 we have that there exist such that,
(64) |
where . Thus, we have that is also a -learner for , and that . We denote this learning algorithm by .
Notice that we can repeat the above process for any such . Thus, we obtain different learning algorithms for each .
Next, we will describe the construction of the -list learning algorithm. Fix any . For every learner , apply Algorithm 3 with parameters , where is the margin of , our -learner (see Definition 23). We also set the parameters of to be , and , for . The remaining parameters for Algorithm 3 are set as in Lemma 25.
For each such run of the algorithm, we obtain a list function. Let denote all list functions obtained by this procedure. Finally, return the list function defined by
for every . We will now show that this algorithm satisfies the desired guarantees, and is indeed a -list learner.
First, by Lemma 25 and union bound we get that with probability at least , for each it holds that:
Thus, in particular, with probability at least we also have that for all . Moreover, by Lemma 25 we have that all are -bounded. That is, for each and every it holds that . Now recall that by the definition of the margin, and by Equation 64 we in fact have that:
This then implies that for every it holds that . Thus, in particular, we also get that the final list function satisfies that for every it holds that . Lastly, by following the same compression-based generalization analysis as in Lemma 26, we obtain that the above procedure is in fact a -list learning algorithm, with sample complexity , where .
∎
5.3 Multiclass boosting via -list PAC learning
In this section we aim to convert weak learners to list learners, with a fixed list size. This is, in some sense, a coarsening of the previous subsection, which also subsumes previous work by Brukhim et al. (2023a), where the authors demonstrate thresholds which determine the achievable list sizes. To that end, for every define the following coarsening of -the critical thresholds of :
(65) |
Clearly, for all we have that . It is also easy to see that , and . When clear from context, we omit and denote the thresholds by and . We remark that the two types of thresholds and are useful for different ways of boosting as specified in the remainder of this section.
Before stating the theorem, we need to first describe what it means to “partially” boost. The results given in this section are based on the framework of List PAC learning (Brukhim et al., 2022; Charikar and Pabbaraju, 2023). The connection between list learning and multiclass boosting was demonstrated by prior works (Brukhim et al., 2023a, b). Here we strengthen this link and generalize previous results to hold for arbitrary costs. We start with introducing list learning in Definition 32, followed by the statement of Theorem 33.
Definition 32 (-List PAC Learning (Brukhim et al., 2022; Charikar and Pabbaraju, 2023)).
An algorithm is a -list PAC learner for a class if the following holds. For every distribution over and target function , and every , if is a set of examples drawn i.i.d. from and labeled by , then returns such that, with probability at least ,
Theorem 33.
Let , let be any cost function, and let . Let as defined in Equation 65, and let be the smallest integer for which or, if none exists, let . Then, the following claims both hold.
-
•
is -boostable: for every class , every -learner is boostable to an -list PAC learner.
-
•
is not -boostable: for some there exists a -learner that cannot be boosted to an -list PAC learner with .
In words, Theorem 33 implies that there is a partition of the loss values interval into sub-intervals or “buckets”, based on . Then, any loss value in a certain bucket can be boosted to a list of size , but not below that. In Theorem 36 we show the converse: that any -list learner can be boosted to -learner where may be the lowest value within the -th interval , but not below it.
The first item of Theorem 33 holds trivially for , since in that case and a -list learner always exists. The case is addressed by Lemma 34 below.
Lemma 34.
Let , let be a cost function, and define as in Equation 65. Let and let be a -learner for with sample complexity such that . Let be the smallest integer in such that and let be the margin. Fix any , let be any distribution over , and let be a multiset of examples given by i.i.d. points labeled by . Finally, fix any . If Algorithm 3 is given , oracle access to , and parameters , , , and , then Algorithm 3 makes calls to and returns such that with probability at least :
Proof.
Fix any and any . Let be any distribution over and let be a multiset of examples given by i.i.d. points labeled by . The first part of this proof follows similar steps as the one of Lemma 12. In particular, fix any and observe that, again by the regret analysis of Hedge and by the definitions of and , Algorithm 3 guarantees
(66) |
Furthermore, since is a -learner for , and given the choices of and , by a union bound over we obtain that
(67) |
with probability at least .
Now, we show that the list predictor built by Algorithm 3 is consistent with with sufficiently large probability. Precisely, by conditioning on the event given by the inequality in Equation 67, we now demonstrate that for every . Consider the function as defined by Algorithm 3. Then, by Equations 66 and 67 together with the definition of , we obtain for every that
(68) |
which in turn implies that by construction of .
We now proceed with a deterministic characterization of the lists returned by , independently of any conditioning. Fix any . Let be such that for any , and observe that the sum involved within the condition for in the construction of the list is equal to . Furthermore, it must be true that since . We can then show that the list satisfies
(69) |
where the first inequality holds by definition of . Consequently, after observing that if , note that
(70) |
which in turn implies that . We can thus assume that outputs elements of without loss of generality. ∎
The following lemma proves that, in a similar way as Lemma 13 for the binary setting, the list predictor output by Algorithm 3 has a sufficiently small generalization error provided that the sample has size sufficiently large. The main idea to prove generalization is again via a sample compression scheme, but relying instead of novel results by Brukhim et al. (2023a) for multiclass classification. This generalization result, combined with Lemma 34, suffices to prove Theorem 33.
Lemma 35.
Assume the hypotheses of Lemma 34. For any , if the size of the sample given to Algorithm 3 satisfies
then with probability at least the output of Algorithm 3 satisfies . Therefore, Algorithm 3 is an -list PAC learner for . Moreover, one can ensure that Algorithm 3 makes calls to and has sample complexity .
Proof.
By analogy with the proof of Lemma 13, we first apply Lemma 34 with in place of in order to obtain an -list predictor consistent with with probability at least . We then apply a compression-based generalization argument. To do so, we remark once again that one can construct a compression scheme for of size equal to the total size of the samples on which operates, which is equal to . The main difference with the binary case is that we rely on the generalization bound for a sample compression scheme for lists as per Theorem 6 of Brukhim et al. (2023a), with in place of ; we can employ this generalization bound thanks to the consistency of with (with sufficiently large probability). Then, this implies that
(71) |
holds with probability at least . By similar calculations as in Equations 18 and 19, and replacing the values of and , the right-hand side of the above inequality can be easily show to be at most whenever:
(72) |
A union bound concludes the proof for the first part of the claim, since it shows that Algorithm 3 is an -list PAC learner for . The sample complexity of Algorithm 3 then immediately follows by definition of . ∎
Next, we prove a kind of converse of Theorem 33. That is, one can convert a list learner to a weak learner. This results is stated formally as follows.
Theorem 36 (-List Weak Learning).
There exists an algorithm that for every satisfies what follows. Let be a cost function, let , and let as defined in Equation (65). Let be the smallest integer for which , or if none exists let . Given oracle access to a -list PAC learner for with sample complexity , algorithm is a -learner for with sample complexity that makes one oracle call to .
Proof.
Fix any and any distribution over , and suppose is given a sample of size consisting of examples drawn i.i.d. from and labeled by . First, calls on . By hypothesis, then, returns that with probability at least satisfies:
(73) |
Conditioning on the event above, we give a randomized predictor such that . First, for any nonempty let be the minimax distribution achieving the value of the game restricted to ,
(74) |
Note that can be efficiently computed via a linear program. Then, simply define .
Let us analyse the loss of under . First, by the law of total probability, and since ,
(75) |
where the summation is over all with , and is the distribution obtained from by conditioning on the event . Consider the right-hand side of Equation 75. By Equation 73, the first term is at most . For the second term, denote by the marginal of over ; note that, crucially, . Therefore, by definition of and of :
(76) |
We conclude that . ∎
5.4 Lower bounds
We start proving the second item of C.
Lemma 37.
Let and let be any cost function. Let as defined in Equation 6, and let be the largest integer for which . Then there exists a class that has a -learner but no -learner for any .
Proof.
If then the claim is trivial, hence we assume . Observe that this implies that there exists some with such that . Let be the distribution achieving , that is:
(77) |
Now let be any infinite domain (e.g., ) and let . For every distribution over and every labeling ,
(78) |
where is the marginal of over . Thus the learner that returns the random guess based on is a -learner for .
Next, we show that has no -learner with . Suppose indeed towards a contradiction that there exists such a -learner . By Theorem 24, then, admits a -list learner . Now, as , the list function returned by ensures for all . Moreover, as , we can assume without loss of generality that for all ; otherwise we could remove the elements of to obtain a list such that and that . It follows that for all . Therefore, is a -list PAC learner. However, one can verify that the -Daniely-Shalev-Shwartz dimension of is unbounded (see Definition 6 of Charikar and Pabbaraju (2023)). It follows by Charikar and Pabbaraju (2023, Theorem 1) that is not -list PAC learnable, yielding a contradiction. ∎
Next, we prove the second item of Theorem 33.
Lemma 38.
Let , let be any cost function, and let as defined in Equation (65). Let and let be the largest integer for which . Then, there exists a class that has a -learner and is not -list PAC learnable.
Proof.
Finally, we prove the second item of D.
Lemma 39.
Let , let where each is a cost function, and let such that . There exists a class that admits a -learner that is trivial, and therefore cannot be boosted to a -learner.
Proof.
By assumption that , there must be a set for which and . Let be any infinite domain (e.g., ) and let . By definition of there exists a trivial learner that is a -learner for . Assume towards contradiction that the trivial learner can be used to construct a -learner for . Since , then by Proposition 19 there exists such that , where . Then, by Theorem 24, there is a -list learner that outputs list functions such that, for every it holds that:
Thus is a -list PAC learner for . However, one can verify that the -Daniely-Shalev-Shwartz dimension of is unbounded (see Definition 6 of Charikar and Pabbaraju (2023)). It follows by Charikar and Pabbaraju (2023, Theorem 1) that is not -list PAC learnable, yielding a contradiction. ∎
References
- Allwein et al. [2000] Erin L Allwein, Robert E Schapire, and Yoram Singer. Reducing multiclass to binary: A unifying approach for margin classifiers. Journal of machine learning research, 1(Dec):113–141, 2000.
- Ben-David et al. [2001] Shai Ben-David, Philip M Long, and Yishay Mansour. Agnostic boosting. In International Conference on Computational Learning Theory, pages 507–516. Springer, 2001.
- Blackwell [1956] D Blackwell. An analog of the minimax theorem for vector payoffs. Pacific Journal of Mathematics, 6(1):1–8, 1956.
- Brukhim et al. [2021] Nataly Brukhim, Elad Hazan, Shay Moran, and Robert E. Schapire. Multiclass boosting and the cost of weak learning. In Advances in Neural Information Processing Systems, NeurIPS, 2021.
- Brukhim et al. [2022] Nataly Brukhim, Daniel Carmon, Irit Dinur, Shay Moran, and Amir Yehudayoff. A characterization of multiclass learnability. In Annual Symposium on Foundations of Computer Science, FOCS, 2022.
- Brukhim et al. [2023a] Nataly Brukhim, Amit Daniely, Yishay Mansour, and Shay Moran. Multiclass boosting: simple and intuitive weak learning criteria. In Advances in Neural Information Processing Systems, NeurIPS, 2023a.
- Brukhim et al. [2023b] Nataly Brukhim, Steve Hanneke, and Shay Moran. Improper multiclass boosting. In Proceedings of the Conference on Learning Theory, COLT, 2023b.
- Canonne [2020] Clément L. Canonne. A short note on learning discrete distributions, 2020. URL https://arxiv.org/abs/2002.11457.
- Charikar and Pabbaraju [2023] Moses Charikar and Chirag Pabbaraju. A characterization of list learnability. In Proceedings of the Annual Symposium on Theory of Computing, STOC, 2023.
- Durgin and Juba [2019] Alexander Durgin and Brendan Juba. Hardness of improper one-sided learning of conjunctions for all uniformly falsifiable CSPs. In Proceedings of the International Conference on Algorithmic Learning Theory, ALT, 2019.
- Ehrgott [2005] Matthias Ehrgott. Multicriteria optimization, volume 491. Springer Science & Business Media, 2005.
- Elkan [2001] Charles Elkan. The foundations of cost-sensitive learning. In Bernhard Nebel, editor, Proceedings of the International Joint Conference on Artificial Intelligence, IJCAI, pages 973–978, 2001.
- Fan et al. [1999] Wei Fan, Salvatore J. Stolfo, Junxin Zhang, and Philip K. Chan. Adacost: Misclassification cost-sensitive boosting. In Proceedings of the International Conference on Machine Learning, ICML, 1999.
- Feldman [2009] Vitaly Feldman. Distribution-specific agnostic boosting. arXiv preprint arXiv:0909.2927, 2009.
- Freund [1990] Yoav Freund. Boosting a weak learning algorithm by majority. In Proceedings of the Workshop on Computational Learning Theory, COLT, 1990. URL http://dl.acm.org/citation.cfm?id=92640.
- Freund and Schapire [1997] Yoav Freund and Robert E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci., 55(1):119–139, 1997. URL https://doi.org/10.1006/jcss.1997.1504.
- Green Larsen and Ritzert [2022] Kasper Green Larsen and Martin Ritzert. Optimal weak to strong learning. Advances in Neural Information Processing Systems, 35:32830–32841, 2022.
- Kalai and Servedio [2005] Adam Tauman Kalai and Rocco A. Servedio. Boosting in the presence of noise. J. Comput. Syst. Sci., 71(3):266–290, 2005. doi: 10.1016/j.jcss.2004.10.015. URL https://doi.org/10.1016/j.jcss.2004.10.015.
- Kalai et al. [2008] Adam Tauman Kalai, Yishay Mansour, and Elad Verbin. On agnostic boosting and parity learning. In Proceedings of the fortieth annual ACM symposium on Theory of computing, pages 629–638, 2008.
- Kalai et al. [2012] Adam Tauman Kalai, Varun Kanade, and Yishay Mansour. Reliable agnostic learning. Journal of Computer and System Sciences, 78(5):1481–1495, 2012.
- Kanade and Kalai [2009] Varun Kanade and Adam Kalai. Potential-based agnostic boosting. In Advances in neural information processing systems, NIPS, 2009.
- Kanade and Thaler [2014] Varun Kanade and Justin Thaler. Distribution-independent reliable learning. In Proceedings of the Conference on Learning Theory, COLT, 2014.
- Karakoulas and Shawe-Taylor [1998] Grigoris I. Karakoulas and John Shawe-Taylor. Optimizing classifers for imbalanced training sets. In Advances in Neural Information Processing Systems, NIPS, 1998.
- Kearns [1988] M. Kearns. Thoughts on hypothesis boosting. Unpublished, December 1988.
- Kivinen [1995] Jyrki Kivinen. Learning reliably and with one-sided error. Mathematical systems theory, 28(2):141–172, 1995.
- Landesa-Vazquez and Alba-Castro [2012] Iago Landesa-Vazquez and José Luis Alba-Castro. Shedding light on the asymmetric learning capability of adaboost. Pattern Recognition Letters, 33(3):247–255, 2012. URL https://doi.org/10.1016/j.patrec.2011.10.022.
- Landesa-Vazquez and Alba-Castro [2013] Iago Landesa-Vazquez and José Luis Alba-Castro. Double-base asymmetric adaboost. Neurocomputing, 118:101–114, 2013. URL https://doi.org/10.1016/j.neucom.2013.02.019.
- Landesa-Vázquez and Alba-Castro [2015] Iago Landesa-Vázquez and José Luis Alba-Castro. Revisiting adaboost for cost-sensitive classification. part i: Theoretical perspective. arXiv preprint arXiv:1507.04125, 2015.
- Ling and Sheng [2008] Charles X Ling and Victor S Sheng. Cost-sensitive learning and the class imbalance problem. Encyclopedia of Machine Learning, 2011:231–235, 2008.
- Ling and Sheng [2017] Charles X. Ling and Victor S. Sheng. Cost-sensitive learning. In Encyclopedia of Machine Learning and Data Mining. Springer, 2017. URL https://doi.org/10.1007/978-1-4899-7687-1_181.
- Long and Servedio [2008] Philip M. Long and Rocco A. Servedio. Adaptive martingale boosting. In Daphne Koller, Dale Schuurmans, Yoshua Bengio, and Léon Bottou, editors, Advances in Neural Information Processing Systems 21, Proceedings of the Twenty-Second Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 8-11, 2008, pages 977–984. Curran Associates, Inc., 2008. URL http://papers.nips.cc/paper/3623-adaptive-martingale-boosting.
- Masnadi-Shirazi and Vasconcelos [2011] Hamed Masnadi-Shirazi and Nuno Vasconcelos. Cost-sensitive boosting. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(2):294–309, 2011. URL https://doi.org/10.1109/TPAMI.2010.71.
- Møller Høgsgaard et al. [2024] Mikael Møller Høgsgaard, Kasper Green Larsen, and Markus Engelund Mathiasen. The many faces of optimal weak-to-strong learning. arXiv e-prints, pages arXiv–2408, 2024.
- Mukherjee and Schapire [2011] Indraneel Mukherjee and Robert E Schapire. A theory of multiclass boosting. Journal of Machine Learning Research, 14:437–497, 2011. URL http://rob.schapire.net/papers/multiboost-journal.pdfhttp://arxiv.org/abs/1108.2989.
- Nikolaou et al. [2016] Nikolaos Nikolaou, Narayanan Unny Edakunni, Meelis Kull, Peter A. Flach, and Gavin Brown. Cost-sensitive boosting algorithms: Do we really need them? Machine Learning, 104(2-3):359–384, 2016. URL https://doi.org/10.1007/s10994-016-5572-x.
- Raman and Tewari [2022] Vinod Raman and Ambuj Tewari. Online agnostic multiclass boosting. Advances in Neural Information Processing Systems, 35:25908–25920, 2022.
- Raman et al. [2019] Vinod Raman, Daniel T Zhang, Young Hun Jung, and Ambuj Tewari. Online boosting for multilabel ranking with top-k feedback. arXiv preprint arXiv:1910.10937, 2019.
- Sabato and Tishby [2012] Sivan Sabato and Naftali Tishby. Multi-instance learning with any hypothesis class. The Journal of Machine Learning Research, 13(1):2999–3039, 2012.
- Schapire [1990] Robert E. Schapire. The strength of weak learnability. Machine Learning, 5(2):197–227, 1990. URL https://doi.org/10.1007/BF00116037.
- Schapire and Freund [2012] Robert E Schapire and Yoav Freund. Boosting: Foundations and algorithms. Cambridge university press, 2012.
- Schapire and Singer [1999] Robert E Schapire and Yoram Singer. Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37(3):297–336, 1999.
- Shalev-Shwartz and Ben-David [2014] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge University Press, 2014.
- Sun et al. [2007] Yanmin Sun, Mohamed S. Kamel, Andrew K. C. Wong, and Yang Wang. Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition, 40(12):3358–3378, 2007. URL https://doi.org/10.1016/j.patcog.2007.04.009.
- Ting [2000] Kai Ming Ting. A comparative study of cost-sensitive boosting algorithms. In Proceedings of the International Conference on Machine Learning, ICML, 2000.
- Valiant [1984] Leslie G Valiant. A theory of the learnable. Communications of the ACM, 27(11):1134–1142, 1984.
- Viola and Jones [2001] Paul A. Viola and Michael J. Jones. Fast and robust classification using asymmetric adaboost and a detector cascade. In Advances in Neural Information Processing Systems, NIPS, 2001.
- von Neumann [1928] John von Neumann. Zur theorie der gesellschaftsspiele. Mathematische Annalen, 100(1):295–320, 1928.
- Zadrozny et al. [2003] Bianca Zadrozny, John Langford, and Naoki Abe. Cost-sensitive learning by cost-proportionate example weighting. In International Conference on Data Mining, ICDM, 2003.
Appendix A Cost-Sensitive Learning
Cost sensitive learning has a long history in machine learning. The two main motivations are that, first, not all errors have the same impact, and second, that there might be a class-imbalance between the frequency of different classes. Over the years there have been many workshops on cost-sensitive learning in ICML (2000) NIPS (2008) SDM (2018), yet another indication of the impostance of cost-sensitive loss. See, e.g., Ling and Sheng [2017, 2008]. Recall the definition of the cost sensitive loss:
(79) |
Given the the distribution on , the Bayes optimal prediction rule is given by:
(80) |
for every . The early works include Elkan [2001], which characterizes the Bayes optimal predictor as a threshold, and its implication for re-balancing and decision tree learning. In fact, this resembles our binary prediction rule.
Sample complexity for cost-sensitive leaning for large margin appears in Karakoulas and Shawe-Taylor [1998]. Additional sample complexity bounds, for cost-sensitive learning, based on transformation of the learning algorithms and using rejection sampling is found in Zadrozny et al. [2003]. The idea of cost sensitive learning has a long history in statistics. For binary classification, there are many different metrics used for evaluation. Let . The false-positive (FP) is , the false-negative is , the precision is , the recall is , and more.
A.1 Boosting cost-sensitive loss
There has been significant amount of work with the motivation of adapting AdaBoost to a cost-sensitive loss. At a high level, the different proposals either modify the way the algorithm updates its weights, taking in to account the cost-sensitive loss, or changes the final prediction rule. Nikolaou et al. [2016] give an overview on various AdaBoost variants for the cost-sensitive case. See also Landesa-Vázquez and Alba-Castro [2015]. Our modified update rule of Hedge in Algorithm 1 corresponds to CGAda of Sun et al. [2007] and related AdaBoost variants.
The theoretically sound proposal focused on deriving similar guarantees to that of AdaBoost. Cost-sensitive boosting by Masnadi-Shirazi and Vasconcelos [2011] modified the exponential updates to include the cost-sensitive loss. Cost-Generalized AdaBoost by Landesa-Vazquez and Alba-Castro [2012] modifies the initial distribution over the examples. AdaboostDB by Landesa-Vazquez and Alba-Castro [2013] modifies the base of the exponents used in the updates. All the theoretically sound proposal are aimed to guarantee convergence under the standard weak-leaning assumption. Their main objective is to derive better empirical results when faced with a cost-sensitive loss. However, they do not address the essential question, when is boosting possible? In particular, they do not study cost-sensitive variants weak learners and do not characterize the boostability of such learners. In addition to the papers above, there have been many heuristic modifications of AdaBoost which try to address the cost-sensitive loss [Karakoulas and Shawe-Taylor, 1998, Fan et al., 1999, Ting, 2000, Viola and Jones, 2001, Sun et al., 2007].
A.2 Multi-objective learning and boosting
Learning with multiple objectives is also common in machine learning. A well studied special case are variants of learning with one-sided error [Kivinen, 1995, Sabato and Tishby, 2012]. A typical goal in one-sided learning or the related positive-reliable learning [Kalai et al., 2012, Kanade and Thaler, 2014, Durgin and Juba, 2019] is to guarantee an (almost) false positive loss and simultaneously a low false negative loss. This corresponds to a -learner with and . More generally, Kalai et al. [2012] also considered tolerant reliable learning with an arbitrary . Our results apply in this context. For example in the binary case, our results show that a -learner (i.e., a -tolerant-reliable one) is boostable if and only if . Moreover, our results also imply boostability and learnability results for reliable learning in the multi-class case; a learning setting mostly over-looked so far.
Appendix B Multiclass Boosting
Boosting is a fundamental methodology in machine learning that can boost the accuracy of weak learning rules into a strong one. Boosting theory was originally designed for binary classification. The study of boosting was initiating in a line of seminal works which include the celebrated Adaboost algorithm, as well an many other algorithms with various applications, (see e.g. Kearns [1988], Schapire [1990], Freund [1990], Freund and Schapire [1997]). It was later adapted to other settings and was extensively studied in broader contexts as well [Ben-David et al., 2001, Kalai and Servedio, 2005, Long and Servedio, 2008, Kalai et al., 2008, Kanade and Kalai, 2009, Feldman, 2009, Møller Høgsgaard et al., 2024, Green Larsen and Ritzert, 2022, Raman and Tewari, 2022, Raman et al., 2019].
There are also various extension of boosting to the multiclass setting. The early extensions include AdaBoost.MH, AdaBoost.MR, and approaches based on Error-Correcting Output Codes (ECOC) [Schapire and Singer, 1999, Allwein et al., 2000]. These works often reduce the -class task into a single binary task. The binary reduction can have various problems, including increased complexity, and lack of guarantees of an optimal joint predictor.
Notably, a work by Mukherjee and Schapire [2011] established a theoretical framework for multiclass boosting, which generalizes previous learning conditions. However, this requires the assumption that the weak learner minimizes a complicated loss function, that is significantly different from simple classification error.
More recently, there have been several works on multiclass boosting. First, Brukhim et al. [2021] followed a formulation for multiclass boosting similar to that of Mukherjee and Schapire [2011]. They proved a certain hardness result showing that a broad, yet restricted, class of boosting algorithms must incur a cost which scales polynomially with . Then, Brukhim et al. [2023b] and Brukhim et al. [2023a] demonstrated efficient methods for multiclass boosting. We note that our work generalizes the results given in Brukhim et al. [2023a] to the cost sensitive setting, as detailed in Section 5.