Iterative Weak Learnability and Multi-Class AdaBoost
Abstract.
We construct an efficient recursive ensemble algorithm for the multi-class classification problem, inspired by SAMME (\citeasnounZhuZouRossetandHastie09). We strengthen the weak learnability condition in \citeasnounZhuZouRossetandHastie09 by requiring that the weak learnability condition holds for any subset of labels with at least two elements. This condition is simpler to check than many proposed alternatives
(e.g., \citeasnounMukherjeeSchapire2013). As SAMME, our algorithm is reduced to the Adaptive Boosting algorithm (\citeasnounSchapireandFreund12) if the number of labels is two, and can be motivated as a functional version of the steepest descending method to find an optimal solution. In contrast to SAMME, our algorithm’s final hypothesis converges to the correct label with probability 1. For any number of labels, the probability of misclassification vanishes exponentially as the training period increases.
The sum of the training error and an additional term, that depends only on the sample size, bounds the generalization error of our algorithm as the Adaptive Boosting algorithm.
Keywords. Adaptive Boosting Algorithm, SAMME, Iterative Weak Learnability, Efficient Ensemble Algorithm, Generalization Error
1. Introduction
While the Adaptive Boosting (AdaBoost) algorithm (\citeasnounSchapireandFreund12) and its variants have been remarkably successful for binary classification problems, difficulties with generalizing to the multi-label case are well-known.111For a comprehensive survey, see \citeasnounFriedmanHestieandTibshirani00. Early approaches to this problem proposed dividing the multi-class problem into multiple two-class problems (e.g., \citeasnounSchapireandSinger99). These approaches either do not inherit all of the desirable properties of AdaBoost or lack its intuitive descriptions. The extended algorithms are expected to be efficient: the algorithm must produce a forecast, which converges to the correct label in probability, and the probability of misclassification vanishes at an exponential rate. However, the extended algorithm is often substantially different from the original formula of \citeasnounSchapireandFreund12) and is significantly more elaborate than the original formula. The aesthetic differences with AdaBoost are substantive. First, it is less straightforward to analyze the properties of these algorithms when they are more elaborate. Second, the conditions required to ensure they work are often significantly more stringent.
ZhuZouRossetandHastie09 developed a multi-class generalization of AdaBoost that maintains the same structure as the binary label case, called Stagewise Additive Modeling using a Multi-class Exponential loss function (SAMME). The algorithm has a statistical foundation as a functional version of the recursive steepest descending method to calculate the minimum of an exponential loss function. SAMME is reduced to the original AdaBoost if the number of labels is two. \citeasnounZhuZouRossetandHastie09 demonstrated through numerical simulations that if the set of weak hypothesis satisfies a version of the weak learnability condition, which extends the condition of \citeasnounSchapireandFreund12, SAMME converges to the correct labels.
However, SAMME fails to generate a forecast, which converges to the correct label in probability. \citeasnounMukherjeeSchapire2013 showed that the weak learnability condition of \citeasnounZhuZouRossetandHastie09 is not adequate for constructing an efficient recursive ensemble algorithm. \citeasnounMukherjeeSchapire2013 instead imposed a condition called Edge-over-Random (EOR), which involves checking a richer set of conditions. Verifying EOR is not straightforward because we have to weigh the loss differently for every kind of misclassification.
This paper proposes an alternative approach. We show that a slight modification of SAMME, combined with a slightly stronger version of the weak learnability of \citeasnounZhuZouRossetandHastie09, allows us to construct an efficient recursive ensemble algorithm. We strengthen the weak learnability condition by requiring that the same weak learnability continue to hold for any non-empty subset of labels with at least two elements. We call the stronger version of the weak learnability iterative weak learnability. One might wonder whether the stronger version of weak learnability requires an exceedingly elaborate class of weak hypotheses. On the contrary, the set of single threshold classifiers satisfies iterative weak learnability, which is the set of the simplest non-trivial classifiers.
We depart from the conventional view of a boosting algorithm as a process to produce a final hypothesis, which is increasingly accurate as the training period increases. Instead, we consider the boosting algorithm as a process to eliminate inaccurate labels recursively, leaving the correct label in each round of elimination. If the set of labels has two elements, the two views are two sides of the same coin. As an example, let us consider the original AdaBoost with two labels. One can view the algorithm as a process to identify a correct label for each observation. Instead, we view the same algorithm as a process to eliminate an incorrect label. If the number of labels is larger than two, the second view is much easier to generalize than the conventional view on the boosting algorithm.
Our algorithm can be motivated as an implementation of Elimination by Aspects (EBA) (\citeasnounTversky72). If we regard a label as an attribute, our exercise can be viewed as a procedure to identify the best possible attribute for the observation. The decision maker evaluates an attribute according to a linear function of different attributes after a sequence of observations and eliminates an attribute with the negative value of the linear function. Our algorithm selects the attribute which survives the iterative elimination process.222We are grateful to Shaowei Ke for pointing out the possible link and the reference.
We modify the SAMME algorithm to construct an ensemble algorithm to eliminate incorrect labels for each observation. By eliminating incorrect labels iteratively, the number of remaining labels becomes two, and then, our algorithm becomes the original AdaBoost algorithm to identify the correct label. We show that the probability of eliminating the correct label vanishes at an exponential rate as the training period increases, which implies that our algorithm accurately eliminates incorrect labels efficiently, in contrast to the original proposal. One of the benefits of our approach is that we can analyze this algorithm following steps similar to the original binary label AdaBoost algorithm. Our algorithm produces a generalization error bound closely related to that of the original formulation.
Section 2 illustrates the problem. Section 3 constructs the algorithm and states the main result. Section 4 shows that the generalization error vanishes as the number of samples increases, proving that our algorithm is robust against the overfitting as the original AdaBoost is. Section 5 concludes the paper.
2. Preliminaries
2.1. Basics
Given set , let be the number of elements in . Let be the set of labels, and be the set of observations, which is endowed with a probability distribution. A classifier is
which labels each observation. We often call a weak hypothesis (or simply, a hypothesis) in place of a classifier. Let be the set of all feasible classifiers.
Let
be the correct classifier. We say that perfectly classifies if
We assume that a decision maker is endowed with , but is often significantly more complex than any element in . As a result, no can perfectly classify . An important question is whether a decision maker who is endowed with primitive can “learn” .
Definition 2.1.
Let be the set of all classifiers, and denote a subset of classifiers. A statistical procedure or algorithm is an onto function
where is a set of histories, is the set of feasible algorithms.
To incorporate learning, we need to allow a decision maker to observe data over time. We consider a dynamic interaction, where the algorithm can update its output in response to new data. Suppose that time is discrete: and is the data available at the beginning of time . is the classifier which algorithm generates from data . Let be the label assigned to observation by .
Let be the training period, when the decision maker updates in response to . The final hypothesis is the output at the end of the training period.
We focus on recursive ensemble algorithm for its simplicity.
Definition 2.2.
is an ensemble of if and such that
is a recursive algorithm, if is determined as a function of and events in time .
We require that an algorithm produce an accurate classifier as the final hypothesis and achieve the desired accuracy level with a reasonable amount of data.
Definition 2.3.
Algorithm is consistent if
Algorithm is efficient if such that
2.2. Adaptive Boosting Algorithm
Let us illustrate the Adaptive Boosting (AdaBoost) algorithm by \citeasnounSchapireandFreund12. Instead of the original formula, we follow the framework of SAMME (\citeasnounZhuZouRossetandHastie09). Let us denote the algorithm as .
Initialize as the uniform distribution over . For , is the history at the beginning of time . Iterate the following steps.
-
(1)
(calculation of forecasting error) Choose .
If , then stop the iteration and set . If not, continue.
-
(2)
(calculating the weight)333The following formula is from equation (11) on page 353 of \citeasnounZhuZouRossetandHastie09, which becomes the coefficient of AdaBoost if . This formula differs slightly from (c) on page 351.
(2.1) -
(3)
(preparing for the next round)
-
(4)
Reset as and start the first step of selecting an optimal weak hypothesis.
-
(5)
(final hypothesis)
If , is the Adaptive Boosting (AdaBoost) algorithm of \citeasnounSchapireandFreund12.
The description of the algorithm is incomplete because if , the algorithm is not well defined.
Definition 2.4.
Let be the set of labels. random guess, or simply a random guess, is to predict each label with probability .
is meaningful only if performs better than a random guess.
(2.2) |
which implies that . Note that is the score from the random guess.
The weak learnability implies the existence of such . We state the condition according to \citeasnounZhuZouRossetandHastie09.
Definition 2.5.
A hypothesis class is weakly learnable if, for every probability distribution over observations and labels , we have:
(2.3) |
for some .
The weak learnability is sufficient for the efficiency of if .
Theorem 2.6.
If is weakly learnable and , then is an efficient algorithm.
Proof.
See \citeasnounSchapireandFreund12. ∎
MukherjeeSchapire2013 pointed out that the notion of weak learnability of \citeasnounZhuZouRossetandHastie09 is too weak to guarantee the convergence of the algorithm to the correct label if .
Let us examine an example in \citeasnounMukherjeeSchapire2013.
Example 2.7.
, , . where and .
Since , the weak learnability is satisfied if the performance of a weak hypothesis is strictly better than
Let be a probability distribution over . The performance score of is
where the equality holds if . Thus, is weakly learnable. Any convex combination of and in assigns the same level to two observations. Since assigns a different label on a different observation, SAMME fails to produce a consistent forecast, although satisfies the weak learnability.
Example 2.7 reveals two issues we must address to extend AdaBoost from binary to multiple labels. First, to construct a boosting algorithm, it is necessary to strengthen the notion of weak learnability. Second, the set of weak hypotheses cannot be too restrictive. The issue identified in the previous example is that the requirement of outperforming a random guess by becomes less restrictive when the number of labels gets larger. Hence some distributions can satisfy 2.3 for many labels but can fail for a small number of labels. We propose to consider the minimal departure necessary to rule this out.
2.3. Iterative Weak Learnability
We strengthen the weak learnability by requiring that (2.3) holds for any subset of with more than a single element. Let be a non-empty subset of with . Define
as the collection of all weak hypotheses with labels from . Clearly, .
Definition 2.8.
is iteratively weakly learnable if with , such that for every distribution over observations and labels ,
(2.4) |
Note that is a decreasing function of . As decreases, the requirement for weak learnability of becomes more stringent, as the minimum performance criterion (the right hand side of (2.4) increases.
In Example 2.7, is weakly learnable, but not iteratively weakly learnable. To see this, note that if , then .
However, the weak learnability over requires that the minimum performance be strictly larger than 0.
2.4. Sufficiently Rich
A fundamental question is whether we can find a class of simple hypothesis which is weakly learnable. With the set of the simplest classifiers, we can do better than the random guess over any subset of with at least two elements.
Suppose that and are subsets of a vector space or a finite dimensional Euclidean space. Consider , which classifies according to the value of a linear function, thus called a linear classifier. Let be a hyperplane in : and such that
Define as the close half space above :
Definition 2.9.
A single threshold (linear) classifier is a mapping
where such that
If is a single threshold classifier, , which may be strictly smaller than . On the other hand, the set of single threshold classifiers is closed under permutations: if is a bijection, then for all . Being closed under permutations turns out to be the key property to obtain weak learnability.
Lemma 2.10.
Consider an environment where the image of has elements where . Any hypothesis class that is closed under permutations can do as well as random guessing.
Proof.
See Lemma A.1 in \citeasnounChoandLibgober2020. ∎
We use this result to show that the set of single-threshold classifiers is iteratively weakly learnable.
Corollary 2.11.
The set of single threshold classifiers is iteratively weakly learnable.
Proof.
, the set of single threshold classifiers
is closed under permutations, from which the conclusion follows. ∎
In Example 2.7, using single-threshold classifiers ensures that an extreme point of the convex hull of can always be labeled correctly, and the rest can at least do as well as a random guess. The analogous result is known to hold for the case of . Interestingly, the critical property in generalizing to the multi-class case is not changing the set of classifiers but instead allowing the labels to be sufficiently rich.
3. Construction of Algorithm
The algorithm, which we call , iterates over two loops: within an epoch (inner loop) and across epochs (outer loop). is the minimum length of an epoch, which we increase to improve the accuracy of the final hypothesis of the algorithm. The size of each epoch is a random variable . The algorithm has number of epochs, which is also a random variable. While the total number of periods to generate the final hypothesis is random variable , we show that as becomes large, the total number of periods is bounded by with a probability close to 1.
3.1. Preliminaries
The algorithm is run in discrete period within epoch . Each epoch lasts at least periods, but the terminal round of an epoch is random. At the start of an epoch, we set the period to and re-start the procedure.
It is convenient to enumerate the elements in as the first positive integers, highlighting the fact that the weak learnability has little to do with the name of the labels but relies solely on the ordinal properties of . Let be a permutation over :
Let be the collection of the first positive integers which is the set of feasible labels at the beginning of epoch with . Let be a permutation over .
3.2. Within Epoch
At the beginning of epoch , and are given. Set . as the uniform distribution over . The output of each stage consists three key components: an artificial probability distribution over , a threshold rule in step , and a positive weight .
Suppose that is defined . Choose by solving
(3.5) |
We call an optimal classifier. Define
(3.6) |
as the probability that the optimal classifier at round misclassifies under . If , then we stop the training and output as the final hypothesis, which perfectly forecasts .
Suppose that . Define
(3.7) |
Note that the only difference between (3.7) and (2.1) is the coefficient . If , our algorithm is reduced to AdaBoost.
Define for each , and each pair ,
where
Given , we can recursively define and .
Define ,
and
Note that
(3.8) | |||||
Therefore, ,
(3.9) |
if and only if
(3.10) |
We use to predict , instead of . Since is a linear combination of , our algorithm is an ensemble algorithm. Since is calculated recursively within an epoch, and the change between two epochs depends only on the final outcome from the previous epoch, we regard our algorithm a recursive algorithm.
Except for the case of , (3.9) does not guarantee that with probability close to 1, as . \citeasnounMukherjeeSchapire2013 argues that the failure is due to the fact that the weak learnability (2.3) is too weak, and proposed a stronger criterion of EOR (Edge over Random) criterion.
We take a different tack. We strengthen (2.3) to (2.4) by requiring the same condition for any subset of with at least two elements. Instead of solving (3.10), we eliminate which is not an element of . A critical step is to show that survives the elimination process with a probability close to 1. Since , we can “solve” (3.10) in a finite number of iterations. Roughly speaking, (2.3) may be too weak to solve (3.9) directly, but is sufficiently strong to solve the equivalent problem (3.10) indirectly, if (2.3) holds for any subset with at least two elements.
Recall that is the minimum number of periods in epoch , which is a parameter of the algorithm. Define
as the first period in epoch when each has a label where . Epoch terminates at period.
Before we start epoch , we need to update and . Roughly speaking, is obtained by eliminating with . Because the label with can vary across different , we need additional work. First, we need to eliminate the equal number of labels from for each , and then, permute the remaining elements so that we can “homogenize” the set of labels for each .
Define
as the set of labels in with a negative value of , which are the candidates of elimination. Because can vary across , we cannot eliminate all elements in from , if we want to keep the number of surviving elements in the same across . Define
which is at least 1, by the definition of . Recall that each element in is a positive integer. Instead of eliminating all elements in , we eliminate elements from . Recall that
We can write
For convenience, we select the last elements in to construct :
By the construction of ,
(3.11) |
The remaining step is to re-label each element in so that each has the same set of labels.
We arrange the elements in according to the index’s order. For each , define a permutation
where
and let . Given ,
Since is a bijection over , exists. The correct label in epoch is
3.3. Iteration over epoch
Replace and by and . Start the entire process described in Section 3.2. The iteration over epoch stops at the first where
(3.12) |
Define
as the number of epoch when the algorithm stops. Since epoch last periods, the total number of periods to train the algorithm is
which is a random variable.
Let us call the constructed algorithm because the algorithm is parameterized by the minimum length of each epoch . The element in is the source of the final hypothesis of . Since we know and , the final hypothesis of is
We are interested in .
Theorem 3.1.
Assume that and , and that is iteratively weakly learnable.
-
(1)
-
(2)
and such that , with probability at least ,
-
(a)
-
(b)
where .
-
(a)
Proof.
See Appendix A. ∎
4. Generalization Error
Our last result shows how our approach lends itself to a simple formulation of the generalization error. Let be the sample space generated by samples and be the probability measure over the sample space. Theorem 3.1 proved that the training error vanishes at an exponential rate:
where is the length of an epoch and is the number of epochs. We know that
We can show that the generalization error of vanishes as . Let be the population probability distribution.
Theorem 4.1.
,
Proof.
See Appendix B. ∎
5. Conclusion
As is built on the framework of SAMME, the algorithm inherits the same statistical foundation as SAMME and is easier to motivate than other extensions of AdaBoost. Some extensions of AdaBoost (e.g., \citeasnounMukherjeeSchapire2013) may have to “tune parameters” of the algorithm to achieve efficiency. The selection of a proper parameter of the algorithm requires information about . In contrast, the construction of requires no knowledge about , other than the fact that the set of weak hypothesis satisfies iterative weak learnability.
It is straightforward to check for hypothesis classes that are closed under permutations, making it easy to verify the iterative weak learnability. This allowed us to exhibit the algorithm straightforwardly, without needing to check that the algorithm can outperform random guessing for a rich set of possible cost functions.
Appendix A Proof of Theorem 3.1
A.1. Uniform Bound for
We show
Recall that , is maximizing
If such that , then cannot be an optimal weak hypothesis. Suppose that . The decision maker can increase the performance by choosing from . Thus, , .
We have , ,
since . Every is eliminated at the end of the first epoch. Since , the number of remaining elements after the first epoch is no more than .
A.2. Within Epoch
We show that if the minimum length of an epoch is sufficiently large, then with a probability close to 1. That is, by the time when the algorithm reaches the scheduled end of th epoch, such that . We calculate the lower bound of the probability. There exists such that the probability is bounded from below by .
Recall that ,
and
Thus,
Iterative weak learnability implies that , so that
for some . The weak learnability implies that .
Lemma A.1.
Proof.
Note that
(A.13) | |||||
From the updating formula of ,
(A.14) | |||||
(A.15) |
By summing over , we have
Note , ,
Lemma A.2.
is a strictly concave function of .
(A.16) |
and
(A.17) |
Proof.
and are strictly concave functions over . is the sum of two strictly concave functions. Thus, is strictly concave.
(A.16) follows from a simple calculation.
The calculation of is a little more involved.
∎
Since is weakly learnable, such that
Weak learnability implies that
Since is strictly concave, . Thus, such that ,
Thus,
for some . Thus,
where
(A.18) |
∎
By the definition, , ,
(A.19) |
By (A.19), , if , then such that
Lemma A.1 implies
with probably at least . Hence, with probability at least .
As Lemma A.1 indicates, if , then almost surely as . If , then (A.19) implies that exactly one label is associated with a positive value of . Thus, searching for with is the proper way to identify a correct label. However, if , then the positive sign of does not imply that . However, if , then Lemma A.1 implies that almost surely, as . The idea of is to eliminate with in each epoch, until we have a single element left.
A.3. Across Epochs
We construct by eliminating some elements from and permuting the remaining elements according to . With probability , the true label survives the elimination from , and is included in , permuted according to .
Since we eliminate at least one element from , the total number of epochs cannot be larger than . Let be the epoch when is a singleton. With probability
the remaining label is the real label which has been permuted by a series of . We can recover the real label
Define
Then, for a sufficiently large ,
which completes the proof.
Appendix B Proof of Theorem 4.1
The proof follows the same logic as the proof of Theorem 5.1 on page 98 of \citeasnounSchapireandFreund12. We need to change the notation to accommodate multiple classes . For the reference, we sketch the proof by following the original proof of Theorem 5.1 in \citeasnounSchapireandFreund12, while pointing out the place where we need a change and referring the rest to \citeasnounSchapireandFreund12.
B.1. Preliminaries
Recall that ,
Define
and
We drop subscript because we focus on and write instead
Recall that is the collection of feasible weak hypotheses, whose generic element is . Define
For a fixed , define
Let be the collection of randomly selected elements from and
where
If
then is eliminated from in -th epoch , which implies
Thus,
The last equality stems from the facts that the samples are independently drawn over periods, and that we re-initialize the algorithm at the beginning of each epoch.
We only show that in the first epoch
where is the collection of all samples. The proof for the remaining epochs follows the same logic. Since , we then have
B.2. Sketch of Proof
Lemma B.1.
(Re-statement of Lemma 5.2 in \citeasnounSchapireandFreund12) ,
(B.20) |
Proof.
Note that the only change from the original statement in \citeasnounSchapireandFreund12 is which needs to be scaled to accommodate the case . Note that . We need to scale into a random variable in and modify accordingly. The remainder of the proof is identical with the original proof of Lemma 5.2 in \citeasnounSchapireandFreund12. ∎
Lemma 5.3 of \citeasnounSchapireandFreund12 is not needed. The proof of Lemma 5.4 of \citeasnounSchapireandFreund12 applies to our case without modifying the parameter, from which the conclusion of Theorem 4.1 follows.
References
- [1] \harvarditem[Cho and Libgober]Cho and Libgober2020ChoandLibgober2020 Cho, I.-K., and J. Libgober (2020): “Machine Learning for Strategic Inference,” Emory University and University of Southern California.
- [2] \harvarditem[Friedman, Hastie, and Tibshirani]Friedman, Hastie, and Tibshirani2000FriedmanHestieandTibshirani00 Friedman, J., T. Hastie, and R. Tibshirani (2000): “Additive Logistic Regression: A Statistical View of Boosting,” The Annals of Statistics, 28(2), 337–407.
- [3] \harvarditem[Mukherjee and Schapire]Mukherjee and Schapire2013MukherjeeSchapire2013 Mukherjee, I., and R. E. Schapire (2013): “A Theory of Multiclass Boosting,” Journal of Machine Learning Research, 14, 437–497.
- [4] \harvarditem[Schapire and Freund]Schapire and Freund2012SchapireandFreund12 Schapire, R. E., and Y. Freund (2012): Boosting: Foundations and Algorithms. MIT Press.
- [5] \harvarditem[Schapire and Singer]Schapire and Singer1999SchapireandSinger99 Schapire, R. E., and Y. Singer (1999): “Improved Boosting Algorithms Using Confidence-Rate Predictions,” Machine Learning, 37, 297–336.
- [6] \harvarditem[Tversky]Tversky1972Tversky72 Tversky, A. (1972): “Elimination by aspects: A theory of choice,” Psychological Review, 79(4), 281–299.
- [7] \harvarditem[Zhu, Zou, Rosset, and Hastie]Zhu, Zou, Rosset, and Hastie2009ZhuZouRossetandHastie09 Zhu, J., H. Zou, S. Rosset, and T. Hastie (2009): “Multi-class AdaBoost,” Statistics and Its Interface, 2, 349–360.
- [8]