This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Iterative Weak Learnability and Multi-Class AdaBoost

In-Koo Cho Department of Economics, Emory University, Atlanta, GA 30322 USA [email protected] https://sites.google.com/site/inkoocho  and  Jonathan Libgober Department of Economics, University of Southern California, Los Angeles, CA 90089 USA [email protected] http://www.jonlib.com/
Abstract.

We construct an efficient recursive ensemble algorithm for the multi-class classification problem, inspired by SAMME (\citeasnounZhuZouRossetandHastie09). We strengthen the weak learnability condition in \citeasnounZhuZouRossetandHastie09 by requiring that the weak learnability condition holds for any subset of labels with at least two elements. This condition is simpler to check than many proposed alternatives (e.g., \citeasnounMukherjeeSchapire2013). As SAMME, our algorithm is reduced to the Adaptive Boosting algorithm (\citeasnounSchapireandFreund12) if the number of labels is two, and can be motivated as a functional version of the steepest descending method to find an optimal solution. In contrast to SAMME, our algorithm’s final hypothesis converges to the correct label with probability 1. For any number of labels, the probability of misclassification vanishes exponentially as the training period increases. The sum of the training error and an additional term, that depends only on the sample size, bounds the generalization error of our algorithm as the Adaptive Boosting algorithm.
Keywords. Adaptive Boosting Algorithm, SAMME, Iterative Weak Learnability, Efficient Ensemble Algorithm, Generalization Error

We are grateful for the hospitality and support from the University of Southern California. We thank Shaowei Ke, Tim Roughgarden, and Rob Schapire for their helpful comments. The first author gratefully acknowledges financial support from the National Science Foundation and the Korea Research Foundation.

1. Introduction

While the Adaptive Boosting (AdaBoost) algorithm (\citeasnounSchapireandFreund12) and its variants have been remarkably successful for binary classification problems, difficulties with generalizing to the multi-label case are well-known.111For a comprehensive survey, see \citeasnounFriedmanHestieandTibshirani00. Early approaches to this problem proposed dividing the multi-class problem into multiple two-class problems (e.g., \citeasnounSchapireandSinger99). These approaches either do not inherit all of the desirable properties of AdaBoost or lack its intuitive descriptions. The extended algorithms are expected to be efficient: the algorithm must produce a forecast, which converges to the correct label in probability, and the probability of misclassification vanishes at an exponential rate. However, the extended algorithm is often substantially different from the original formula of \citeasnounSchapireandFreund12) and is significantly more elaborate than the original formula. The aesthetic differences with AdaBoost are substantive. First, it is less straightforward to analyze the properties of these algorithms when they are more elaborate. Second, the conditions required to ensure they work are often significantly more stringent.

\citeasnoun

ZhuZouRossetandHastie09 developed a multi-class generalization of AdaBoost that maintains the same structure as the binary label case, called Stagewise Additive Modeling using a Multi-class Exponential loss function (SAMME). The algorithm has a statistical foundation as a functional version of the recursive steepest descending method to calculate the minimum of an exponential loss function. SAMME is reduced to the original AdaBoost if the number of labels is two. \citeasnounZhuZouRossetandHastie09 demonstrated through numerical simulations that if the set of weak hypothesis satisfies a version of the weak learnability condition, which extends the condition of \citeasnounSchapireandFreund12, SAMME converges to the correct labels.

However, SAMME fails to generate a forecast, which converges to the correct label in probability. \citeasnounMukherjeeSchapire2013 showed that the weak learnability condition of \citeasnounZhuZouRossetandHastie09 is not adequate for constructing an efficient recursive ensemble algorithm. \citeasnounMukherjeeSchapire2013 instead imposed a condition called Edge-over-Random (EOR), which involves checking a richer set of conditions. Verifying EOR is not straightforward because we have to weigh the loss differently for every kind of misclassification.

This paper proposes an alternative approach. We show that a slight modification of SAMME, combined with a slightly stronger version of the weak learnability of \citeasnounZhuZouRossetandHastie09, allows us to construct an efficient recursive ensemble algorithm. We strengthen the weak learnability condition by requiring that the same weak learnability continue to hold for any non-empty subset of labels with at least two elements. We call the stronger version of the weak learnability iterative weak learnability. One might wonder whether the stronger version of weak learnability requires an exceedingly elaborate class of weak hypotheses. On the contrary, the set of single threshold classifiers satisfies iterative weak learnability, which is the set of the simplest non-trivial classifiers.

We depart from the conventional view of a boosting algorithm as a process to produce a final hypothesis, which is increasingly accurate as the training period increases. Instead, we consider the boosting algorithm as a process to eliminate inaccurate labels recursively, leaving the correct label in each round of elimination. If the set of labels has two elements, the two views are two sides of the same coin. As an example, let us consider the original AdaBoost with two labels. One can view the algorithm as a process to identify a correct label for each observation. Instead, we view the same algorithm as a process to eliminate an incorrect label. If the number of labels is larger than two, the second view is much easier to generalize than the conventional view on the boosting algorithm.

Our algorithm can be motivated as an implementation of Elimination by Aspects (EBA) (\citeasnounTversky72). If we regard a label as an attribute, our exercise can be viewed as a procedure to identify the best possible attribute for the observation. The decision maker evaluates an attribute according to a linear function of different attributes after a sequence of observations and eliminates an attribute with the negative value of the linear function. Our algorithm selects the attribute which survives the iterative elimination process.222We are grateful to Shaowei Ke for pointing out the possible link and the reference.

We modify the SAMME algorithm to construct an ensemble algorithm to eliminate incorrect labels for each observation. By eliminating incorrect labels iteratively, the number of remaining labels becomes two, and then, our algorithm becomes the original AdaBoost algorithm to identify the correct label. We show that the probability of eliminating the correct label vanishes at an exponential rate as the training period increases, which implies that our algorithm accurately eliminates incorrect labels efficiently, in contrast to the original proposal. One of the benefits of our approach is that we can analyze this algorithm following steps similar to the original binary label AdaBoost algorithm. Our algorithm produces a generalization error bound closely related to that of the original formulation.

Section 2 illustrates the problem. Section 3 constructs the algorithm and states the main result. Section 4 shows that the generalization error vanishes as the number of samples increases, proving that our algorithm is robust against the overfitting as the original AdaBoost is. Section 5 concludes the paper.

2. Preliminaries

2.1. Basics

Given set XX, let |X|\mathinner{\!\left\lvert X\right\rvert} be the number of elements in XX. Let AA be the set of labels, and 𝒫{\mathcal{P}} be the set of observations, which is endowed with a probability distribution. A classifier is

h:𝒫Ah\mathrel{\mathop{\mathchar 58\relax}}{\mathcal{P}}\rightarrow A

which labels each observation. We often call hh a weak hypothesis (or simply, a hypothesis) in place of a classifier. Let {\mathcal{H}} be the set of all feasible classifiers.

Let

y:𝒫Ay\mathrel{\mathop{\mathchar 58\relax}}{\mathcal{P}}\rightarrow A

be the correct classifier. We say that hh perfectly classifies yy if

𝐏(h(p)=y(p))=1.\mathbf{P}\left(h(p)=y(p)\right)=1.

We assume that a decision maker is endowed with {\mathcal{H}}, but yy is often significantly more complex than any element in {\mathcal{H}}. As a result, no hh\in{\mathcal{H}} can perfectly classify yy. An important question is whether a decision maker who is endowed with primitive {\mathcal{H}} can “learn” yy.

Definition 2.1.

Let Γ\Gamma be the set of all classifiers, and Γ~Γ\tilde{\Gamma}\subset\Gamma denote a subset of classifiers. A statistical procedure or algorithm is an onto function

τ:𝒟Γ~,\tau\mathrel{\mathop{\mathchar 58\relax}}\mathcal{D}\rightarrow\tilde{\Gamma},

where 𝒟\mathcal{D} is a set of histories, 𝒯\mathcal{T} is the set of feasible algorithms.

To incorporate learning, we need to allow a decision maker to observe data over time. We consider a dynamic interaction, where the algorithm can update its output in response to new data. Suppose that time is discrete: t=1,2,t=1,2,\ldots and DtD_{t} is the data available at the beginning of time tt. τ(Dt)\tau(D_{t}) is the classifier which algorithm τ\tau generates from data DtD_{t}. Let τ(Dt)(p)\tau(D_{t})(p) be the label assigned to observation pp by τ(Dt)\tau(D_{t}).

Let T1T\geq 1 be the training period, when the decision maker updates τ(Dt)\tau(D_{t}) in response to DtD_{t}. The final hypothesis is the output τ(DT)\tau(D_{T}) at the end of the training period.

We focus on recursive ensemble algorithm for its simplicity.

Definition 2.2.

τ(DK)\tau(D_{K}) is an ensemble of {\mathcal{H}} if h1,,hK\exists h_{1},\ldots,h_{K}\in{\mathcal{H}} and α1,,αK0\alpha_{1},\ldots,\alpha_{K}\geq 0 such that p𝒫\forall p\in{\mathcal{P}}

τ(DK)(p)=argmaxak=1Kαk𝟏[a=hk(p)]\tau(D_{K})(p)=\arg\max_{a}\sum_{k=1}^{K}\alpha_{k}\mathbf{1}[a=h_{k}(p)]

τ\tau is a recursive algorithm, if τ(Dk)\tau(D_{k}) is determined as a function of τ(Dk1)\tau(D_{k-1}) and events in time kk.

We require that an algorithm produce an accurate classifier as the final hypothesis and achieve the desired accuracy level with a reasonable amount of data.

Definition 2.3.

Algorithm τ\tau is consistent if

limK𝐏(τ(DK)y(p))=0.\lim_{K\rightarrow\infty}\mathbf{P}\left(\tau(D_{K})\neq y(p)\right)=0.

Algorithm τ\tau is efficient if ρ>0,K>0\exists\rho>0,K>0 such that

𝐏(kK,τ(Dk)(p)y(p))eρK.\mathbf{P}\left(\exists k\geq K,\tau(D_{k})(p)\neq y(p)\right)\leq e^{-\rho K}.

2.2. Adaptive Boosting Algorithm

Let us illustrate the Adaptive Boosting (AdaBoost) algorithm by \citeasnounSchapireandFreund12. Instead of the original formula, we follow the framework of SAMME (\citeasnounZhuZouRossetandHastie09). Let us denote the algorithm as τA\tau_{A}.

Initialize d1(p)=1/|𝒫|d_{1}(p)=1/\mathinner{\!\left\lvert{\mathcal{P}}\right\rvert} as the uniform distribution over 𝒫{\mathcal{P}}. For k1k\geq 1, DkD_{k} is the history at the beginning of time kk. Iterate the following steps.

  1. (1)

    (calculation of forecasting error) Choose hkh_{k}\in{\mathcal{H}}.

    ϵk=pdk(p)𝕀(hk(p)y(p))\epsilon_{k}=\sum_{p}d_{k}(p){\mathbb{I}}\left(h_{k}(p)\neq y(p)\right)

    If ϵk=0\epsilon_{k}=0, then stop the iteration and set τ(Dk)=hk\tau(D_{k})=h_{k}. If not, continue.

  2. (2)

    (calculating the weight)333The following formula is from equation (11) on page 353 of \citeasnounZhuZouRossetandHastie09, which becomes the coefficient of AdaBoost if |A|=2\mathinner{\!\left\lvert A\right\rvert}=2. This formula differs slightly from (c) on page 351.

    αk=(|A|1)2|A|log((|A|1)(1ϵk)ϵk)\alpha_{k}=\frac{(\mathinner{\!\left\lvert A\right\rvert}-1)^{2}}{\mathinner{\!\left\lvert A\right\rvert}}\log\left(\frac{(\mathinner{\!\left\lvert A\right\rvert}-1)(1-\epsilon_{k})}{\epsilon_{k}}\right) (2.1)
  3. (3)

    (preparing for the next round)

    wk+1(p)={dk(p)exp((|A|1)αk)if hk(p)=y(p)dk(p)exp(αk)if hk(p)y(p).w_{k+1}(p)=\begin{cases}d_{k}(p)\exp\left(-(\mathinner{\!\left\lvert A\right\rvert}-1)\alpha_{k}\right)&\text{if }\ h_{k}(p)=y(p)\\ d_{k}(p)\exp\left(\alpha_{k}\right)&\text{if }\ h_{k}(p)\neq y(p).\end{cases}
    dk+1(p)=wk+1(p)pwk+1(p).d_{k+1}(p)=\frac{w_{k+1}(p)}{\sum_{p^{\prime}}w_{k+1}(p^{\prime})}.
  4. (4)

    Reset k+1k+1 as kk and start the first step of selecting an optimal weak hypothesis.

  5. (5)

    (final hypothesis)

    τA(Dt)(p)=argmaxaAs=1tαs𝕀(hs(p)=a)p𝒫\tau_{A}(D_{t})(p)=\arg\max_{a\in A}\sum_{s=1}^{t}\alpha_{s}{\mathbb{I}}(h_{s}(p)=a)\qquad\forall p\in{\mathcal{P}}

If |A|=2\mathinner{\!\left\lvert A\right\rvert}=2, τA\tau_{A} is the Adaptive Boosting (AdaBoost) algorithm of \citeasnounSchapireandFreund12.

The description of the algorithm is incomplete because if αk<0\alpha_{k}<0, the algorithm is not well defined.

Definition 2.4.

Let AA be the set of labels. 1|A|\frac{1}{\mathinner{\!\left\lvert A\right\rvert}} random guess, or simply a random guess, is to predict each label with probability 1|A|\frac{1}{\mathinner{\!\left\lvert A\right\rvert}}.

τA\tau_{A} is meaningful only if hkh_{k}\in{\mathcal{H}} performs better than a random guess.

p𝒫(𝕀[hk(p)=y(p)]𝕀[hk(p)y(p)])d(p)2|A||A|+ρ,\sum_{p\in{\mathcal{P}}}\left(\mathbb{I}[h_{k}(p)=y(p)]-\mathbb{I}[h_{k}(p)\neq y(p)]\right)d(p)\geq\frac{2-\mathinner{\!\left\lvert A\right\rvert}}{\mathinner{\!\left\lvert A\right\rvert}}+\rho, (2.2)

which implies that αk>0\alpha_{k}>0. Note that 2|A||A|\frac{2-\mathinner{\!\left\lvert A\right\rvert}}{\mathinner{\!\left\lvert A\right\rvert}} is the score from the random guess.

The weak learnability implies the existence of such hkh_{k}. We state the condition according to \citeasnounZhuZouRossetandHastie09.

Definition 2.5.

A hypothesis class \mathcal{H} is weakly learnable if, for every probability distribution dd over observations p𝒫p\in{\mathcal{P}} and labels y(p)y(p), we have:

mindΔ(𝒫)maxhp𝒫(𝕀[h(p)=y(p)]𝕀[h(p)y(p)])d(p)2|A||A|+ρ,\min_{d\in\Delta({\mathcal{P}})}\max_{h\in\mathcal{H}}\sum_{p\in{\mathcal{P}}}\left(\mathbb{I}[h(p)=y(p)]-\mathbb{I}[h(p)\neq y(p)]\right)d(p)\geq\frac{2-\mathinner{\!\left\lvert A\right\rvert}}{\mathinner{\!\left\lvert A\right\rvert}}+\rho, (2.3)

for some ρ>0\rho>0.

The weak learnability is sufficient for the efficiency of τA\tau_{A} if |A|=2\mathinner{\!\left\lvert A\right\rvert}=2.

Theorem 2.6.

If {\mathcal{H}} is weakly learnable and |A|=2\mathinner{\!\left\lvert A\right\rvert}=2, then τA\tau_{A} is an efficient algorithm.

Proof.

See \citeasnounSchapireandFreund12. ∎

\citeasnoun

MukherjeeSchapire2013 pointed out that the notion of weak learnability of \citeasnounZhuZouRossetandHastie09 is too weak to guarantee the convergence of the algorithm to the correct label if |A|>2\mathinner{\!\left\lvert A\right\rvert}>2.

Let us examine an example in \citeasnounMukherjeeSchapire2013.

Example 2.7.

𝒫={a,b}{\mathcal{P}}=\{a,b\}, A={1,2,3}A=\{1,2,3\}, y(a)=1,y(b)=2y(a)=1,y(b)=2. ={h1,h2}{\mathcal{H}}=\{h_{1},h_{2}\} where h1(a)=h1(b)=1h_{1}(a)=h_{1}(b)=1 and h2(a)=h2(b)=2h_{2}(a)=h_{2}(b)=2.

Since |A|=3\mathinner{\!\left\lvert A\right\rvert}=3, the weak learnability is satisfied if the performance of a weak hypothesis is strictly better than

233=13.\frac{2-3}{3}=-\frac{1}{3}.

Let d=(d(a),d(b))d=(d(a),d(b)) be a probability distribution over 𝒫{\mathcal{P}}. The performance score of {\mathcal{H}} is

max[12d(a),1+2d(a)]0\max\left[1-2d(a),-1+2d(a)\right]\geq 0

where the equality holds if d(a)=0.5d(a)=0.5. Thus, {\mathcal{H}} is weakly learnable. Any convex combination of h1h_{1} and h2h_{2} in {\mathcal{H}} assigns the same level to two observations. Since yy assigns a different label on a different observation, SAMME fails to produce a consistent forecast, although {\mathcal{H}} satisfies the weak learnability.

Example 2.7 reveals two issues we must address to extend AdaBoost from binary to multiple labels. First, to construct a boosting algorithm, it is necessary to strengthen the notion of weak learnability. Second, the set of weak hypotheses {\mathcal{H}} cannot be too restrictive. The issue identified in the previous example is that the requirement of outperforming a random guess by ρ\rho becomes less restrictive when the number of labels gets larger. Hence some distributions can satisfy 2.3 for many labels but can fail for a small number of labels. We propose to consider the minimal departure necessary to rule this out.

2.3. Iterative Weak Learnability

We strengthen the weak learnability by requiring that (2.3) holds for any subset of AA with more than a single element. Let AiAA_{i}\subset A be a non-empty subset of AA with |Ai|2\mathinner{\!\left\lvert A_{i}\right\rvert}\geq 2. Define

i={h:𝒫Ai}{\mathcal{H}}_{i}=\left\{h\mathrel{\mathop{\mathchar 58\relax}}{\mathcal{P}}\rightarrow A_{i}\right\}

as the collection of all weak hypotheses with labels from AiA_{i}. Clearly, i{\mathcal{H}}_{i}\subset{\mathcal{H}}.

Definition 2.8.

{\mathcal{H}} is iteratively weakly learnable if AiA\forall A_{i}\subset A with |Ai|2\mathinner{\!\left\lvert A_{i}\right\rvert}\geq 2, ρi>0\exists\rho_{i}>0 such that for every distribution dd over observations p𝒫p\in{\mathcal{P}} and labels y(p)y(p),

mindΔ(𝒫)maxhip𝒫(𝕀[h(p)=y(p)]𝕀[h(p)y(p)])d(p)2|Ai||Ai|+ρi.\min_{d\in\Delta({\mathcal{P}})}\max_{h\in\mathcal{H}_{i}}\sum_{p\in{\mathcal{P}}}\left(\mathbb{I}[h(p)=y(p)]-\mathbb{I}[h(p)\neq y(p)]\right)d(p)\geq\frac{2-\mathinner{\!\left\lvert A_{i}\right\rvert}}{\mathinner{\!\left\lvert A_{i}\right\rvert}}+\rho_{i}. (2.4)

Note that 2|Ai||Ai|\frac{2-\mathinner{\!\left\lvert A_{i}\right\rvert}}{\mathinner{\!\left\lvert A_{i}\right\rvert}} is a decreasing function of |Ai|\mathinner{\!\left\lvert A_{i}\right\rvert}. As |Ai|\mathinner{\!\left\lvert A_{i}\right\rvert} decreases, the requirement for weak learnability of i{\mathcal{H}}_{i} becomes more stringent, as the minimum performance criterion (the right hand side of (2.4) increases.

In Example 2.7, {\mathcal{H}} is weakly learnable, but not iteratively weakly learnable. To see this, note that if Ai={1,2}A={1,2,3}A_{i}=\{1,2\}\subset A=\{1,2,3\}, then =i{\mathcal{H}}={\mathcal{H}}_{i}.

min0d(a)1max[12d(a),1+2d(a)]=0.\min_{0\leq d(a)\leq 1}\max\left[1-2d(a),-1+2d(a)\right]=0.

However, the weak learnability over AiA_{i} requires that the minimum performance be strictly larger than 0.

2.4. Sufficiently Rich {\mathcal{H}}

A fundamental question is whether we can find a class of simple hypothesis which is weakly learnable. With the set of the simplest classifiers, we can do better than the random guess over any subset of AA with at least two elements.

Suppose that AA and 𝒫{\mathcal{P}} are subsets of a vector space or a finite dimensional Euclidean space. Consider hh, which classifies 𝒫{\mathcal{P}} according to the value of a linear function, thus called a linear classifier. Let 𝖧{\sf H} be a hyperplane in n{\mathbb{R}}^{n}: λn\exists\lambda\in{\mathbb{R}}^{n} and ω\omega\in{\mathbb{R}} such that

𝖧={pn|λp=ω}.{\sf H}=\left\{p\in{\mathbb{R}}^{n}\ |\ \lambda p=\omega\right\}.

Define 𝖧+{\sf H}_{+} as the close half space above 𝖧{\sf H}:

𝖧+={pn|λpω}.{\sf H}_{+}=\left\{p\in{\mathbb{R}}^{n}\ |\ \lambda p\geq\omega\right\}.
Definition 2.9.

A single threshold (linear) classifier is a mapping

h:𝒫Ah\mathrel{\mathop{\mathchar 58\relax}}{\mathcal{P}}\rightarrow A

where a+,aA\exists a_{+},a_{-}\in A such that

h(p)={a+if p𝖧+aif p𝖧+.h(p)=\begin{cases}a^{+}&\text{if }\ p\in{\sf H}_{+}\\ a^{-}&\text{if }\ p\not\in{\sf H}_{+}.\end{cases}

If hh is a single threshold classifier, |h(𝒫)|=2\mathinner{\!\left\lvert h(\mathcal{P})\right\rvert}=2, which may be strictly smaller than |A|\mathinner{\!\left\lvert A\right\rvert}. On the other hand, the set of single threshold classifiers is closed under permutations: if π:AA\pi\mathrel{\mathop{\mathchar 58\relax}}A\rightarrow A is a bijection, then π(h(p))\pi(h(p))\in\mathcal{H} for all π\pi. Being closed under permutations turns out to be the key property to obtain weak learnability.

Lemma 2.10.

Consider an environment where the image of y(p)y(p) has |A|\mathinner{\!\left\lvert A^{\prime}\right\rvert} elements where AAA^{\prime}\subset A. Any hypothesis class that is closed under permutations can do as well as 1/|A|1/\mathinner{\!\left\lvert A^{\prime}\right\rvert} random guessing.

Proof.

See Lemma A.1 in \citeasnounChoandLibgober2020. ∎

We use this result to show that the set of single-threshold classifiers is iteratively weakly learnable.

Corollary 2.11.

The set of single threshold classifiers is iteratively weakly learnable.

Proof.

AiA\forall A_{i}\subset A, the set of single threshold classifiers

i={h:𝒫Ai}{\mathcal{H}}_{i}=\left\{h\mathrel{\mathop{\mathchar 58\relax}}{\mathcal{P}}\rightarrow A_{i}\right\}

is closed under permutations, from which the conclusion follows. ∎

In Example 2.7, using single-threshold classifiers ensures that an extreme point of the convex hull of 𝒫\mathcal{P} can always be labeled correctly, and the rest can at least do as well as a random guess. The analogous result is known to hold for the case of |A|=2\mathinner{\!\left\lvert A\right\rvert}=2. Interestingly, the critical property in generalizing to the multi-class case is not changing the set of classifiers but instead allowing the labels to be sufficiently rich.

3. Construction of Algorithm

The algorithm, which we call τSK\tau^{K}_{S}, iterates over two loops: within an epoch (inner loop) and across epochs (outer loop). KK is the minimum length of an epoch, which we increase to improve the accuracy of the final hypothesis of the algorithm. The size of each epoch is a random variable KiKK_{i}\geq K. The algorithm has II number of epochs, which is also a random variable. While the total number of periods to generate the final hypothesis is random variable i=1IKi\sum_{i=1}^{I}K_{i}, we show that as KK becomes large, the total number of periods is bounded by min(|A|,|P|)K\min(\mathinner{\!\left\lvert A\right\rvert},\mathinner{\!\left\lvert P\right\rvert})K with a probability close to 1.

3.1. Preliminaries

The algorithm is run in discrete period k=1,2,k=1,2,\ldots within epoch i=1,2,i=1,2,\ldots. Each epoch lasts at least KK periods, but the terminal round of an epoch is random. At the start of an epoch, we set the period to k=1k=1 and re-start the procedure.

It is convenient to enumerate the elements in A={1,,|A|}A=\{1,\ldots,\mathinner{\!\left\lvert A\right\rvert}\} as the first |A|\mathinner{\!\left\lvert A\right\rvert} positive integers, highlighting the fact that the weak learnability has little to do with the name of the labels but relies solely on the ordinal properties of AA. Let π:AA\pi\mathrel{\mathop{\mathchar 58\relax}}A\rightarrow A be a permutation over AA:

π(j)=ij{1,,|A|}j.\pi(j)=i_{j}\in\{1,\ldots,\mathinner{\!\left\lvert A\right\rvert}\}\qquad\forall j.

Let Ai={1,,|Ai|}AA_{i}=\{1,\ldots,\mathinner{\!\left\lvert A_{i}\right\rvert}\}\subset A be the collection of the first |Ai|\mathinner{\!\left\lvert A_{i}\right\rvert} positive integers which is the set of feasible labels at the beginning of epoch ii with A1=AA_{1}=A. Let πi\pi_{i} be a permutation over AiA_{i}.

3.2. Within Epoch ii

At the beginning of epoch ii, AiAA_{i}\subset A and yi:𝒫Aiy_{i}\mathrel{\mathop{\mathchar 58\relax}}{\mathcal{P}}\rightarrow A_{i} are given. Set k=1k=1. d1(p)d_{1}(p) as the uniform distribution over 𝒫\mathcal{P}. The output of each stage consists three key components: an artificial probability distribution dk(p)d_{k}(p) over 𝒫{\mathcal{P}}, a threshold rule hkh_{k} in step kk, and a positive weight αk\alpha_{k}.

Suppose that dk(p)d_{k}(p) is defined p𝒫\forall p\in{\mathcal{P}}. Choose hkh_{k} by solving

maxhip𝒫(𝕀[h(p)=yi(p)]𝕀[h(p)yi(p)])dk(p).\max_{h\in{\mathcal{H}}_{i}}\sum_{p\in{\mathcal{P}}}\left(\mathbb{I}[h(p)=y_{i}(p)]-\mathbb{I}[h(p)\neq y_{i}(p)]\right)d_{k}(p). (3.5)

We call hkh_{k} an optimal classifier. Define

ϵk=𝐏dk(hk(p)yi(p))\epsilon_{k}=\mathbf{P}_{d_{k}}\left(h_{k}(p)\neq y_{i}(p)\right) (3.6)

as the probability that the optimal classifier hkh_{k} at kk round misclassifies pp under dkd_{k}. If ϵk=0\epsilon_{k}=0, then we stop the training and output hkh_{k} as the final hypothesis, which perfectly forecasts yi(p)y_{i}(p).

Suppose that ϵk>0\epsilon_{k}>0. Define

αk=12(|Ai|1)[log(|Ai|1)(1ϵk)ϵk].\alpha_{k}=\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}\left[\log\frac{(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)(1-\epsilon_{k})}{\epsilon_{k}}\right]. (3.7)

Note that the only difference between (3.7) and (2.1) is the coefficient 12(|Ai|1)\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}. If |Ai|=2\mathinner{\!\left\lvert A_{i}\right\rvert}=2, our algorithm is reduced to AdaBoost.

Define for each pp, and each pair (p,y(p))(p,y(p)),

dk+1(p)=dk(p)exp(αk[(|Ai|1)𝕀(hk(p)=yi(p))𝕀(hk(p)yi(p))])Zkd_{k+1}(p)=\frac{d_{k}(p)\exp\left(-\alpha_{k}\left[(\mathinner{\!\left\lvert A_{i}\right\rvert}-1){\mathbb{I}}(h_{k}(p)=y_{i}(p))-{\mathbb{I}}(h_{k}(p)\neq y_{i}(p))\right]\right)}{Z_{k}}

where

Zk=p𝒫dk(p)exp(αk[(|Ai|1)𝕀(hk(p)=yi(p))𝕀(hk(p)yi(p)))]).Z_{k}=\sum_{p\in{\mathcal{P}}}d_{k}(p)\exp\left(-\alpha_{k}\left[(\mathinner{\!\left\lvert A_{i}\right\rvert}-1){\mathbb{I}}(h_{k}(p)=y_{i}(p))-{\mathbb{I}}(h_{k}(p)\neq y_{i}(p)))\right]\right).

Given dk+1d_{k+1}, we can recursively define hk+1h_{k+1} and ϵk+1\epsilon_{k+1}.

Define aAi={1,,|Ai|}\forall a\in A_{i}=\{1,\ldots,\mathinner{\!\left\lvert A_{i}\right\rvert}\},

Fak(p)=s=1kαs𝕀(hs(p)=a)F^{k}_{a}(p)=\sum_{s=1}^{k}\alpha_{s}{\mathbb{I}}(h_{s}(p)=a)

and

Ψak(p)=(|Ai|1)Fak(p)aaFak(p).\Psi^{k}_{a}(p)=(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)F^{k}_{a}(p)-\sum_{a^{\prime}\neq a}F^{k}_{a^{\prime}}(p).

Note that

Ψak(p)\displaystyle\Psi^{k}_{a}(p) =\displaystyle= (|Ai|1)Fak(p)aaFak(p)\displaystyle(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)F^{k}_{a}(p)-\sum_{a^{\prime}\neq a}F^{k}_{a^{\prime}}(p) (3.8)
=\displaystyle= s=1kαs[(|Ai|1)𝕀(hs(p)=a)aa𝕀(hs(p)=a)]\displaystyle\sum_{s=1}^{k}\alpha_{s}\left[(\mathinner{\!\left\lvert A_{i}\right\rvert}-1){\mathbb{I}}(h_{s}(p)=a)-\sum_{a^{\prime}\neq a}{\mathbb{I}}(h_{s}(p)=a^{\prime})\right]
=\displaystyle= s=1kαs[|Ai|𝕀(hs(p)=a)(𝕀(hs(p)=a)+aa𝕀(hs(p)=a))]\displaystyle\sum_{s=1}^{k}\alpha_{s}\left[\mathinner{\!\left\lvert A_{i}\right\rvert}{\mathbb{I}}(h_{s}(p)=a)-\left({\mathbb{I}}(h_{s}(p)=a)+\sum_{a^{\prime}\neq a}{\mathbb{I}}(h_{s}(p)=a^{\prime})\right)\right]
=\displaystyle= s=1kαs(|Ai|𝕀(hs(p)=a)1).\displaystyle\sum_{s=1}^{k}\alpha_{s}\left(\mathinner{\!\left\lvert A_{i}\right\rvert}{\mathbb{I}}(h_{s}(p)=a)-1\right).

Therefore, p\forall p,

aargmaxAiFak(p)a\in\arg\max_{A_{i}}F^{k}_{a}(p) (3.9)

if and only if

aargmaxAiΨak(p).a\in\arg\max_{A_{i}}\Psi^{k}_{a}(p). (3.10)

We use Ψak(p)\Psi^{k}_{a}(p) to predict yi(p)y_{i}(p), instead of Fak(p)F^{k}_{a}(p). Since Ψak(p)\Psi^{k}_{a}(p) is a linear combination of 𝕀(hs(p)=a){\mathbb{I}}(h_{s}(p)=a), our algorithm is an ensemble algorithm. Since Ψak(p)\Psi^{k}_{a}(p) is calculated recursively within an epoch, and the change between two epochs depends only on the final outcome from the previous epoch, we regard our algorithm a recursive algorithm.

Except for the case of |Ai|=2\mathinner{\!\left\lvert A_{i}\right\rvert}=2, (3.9) does not guarantee that a=yi(p)a=y_{i}(p) with probability close to 1, as kk\rightarrow\infty. \citeasnounMukherjeeSchapire2013 argues that the failure is due to the fact that the weak learnability (2.3) is too weak, and proposed a stronger criterion of EOR (Edge over Random) criterion.

We take a different tack. We strengthen (2.3) to (2.4) by requiring the same condition for any subset of AA with at least two elements. Instead of solving (3.10), we eliminate aAia\in A_{i} which is not an element of argmaxAiΨak(p)\arg\max_{A_{i}}\Psi^{k}_{a}(p). A critical step is to show that a=yi(p)Aia=y_{i}(p)\in A_{i} survives the elimination process with a probability close to 1. Since |A|<\mathinner{\!\left\lvert A\right\rvert}<\infty, we can “solve” (3.10) in a finite number of iterations. Roughly speaking, (2.3) may be too weak to solve (3.9) directly, but is sufficiently strong to solve the equivalent problem (3.10) indirectly, if (2.3) holds for any subset with at least two elements.

Recall that KK is the minimum number of periods in epoch ii, which is a parameter of the algorithm. Define

Ki=min{kKp𝒫,aAi,Ψak(p)<0}K_{i}=\min\left\{k\geq K\mid\forall p\in{\mathcal{P}},\exists a\in A_{i},\Psi^{k}_{a}(p)<0\right\}

as the first period in epoch ii when each pp has a label aAia\in A_{i} where Ψak(p)<0\Psi^{k}_{a}(p)<0. Epoch ii terminates at KiK_{i} period.

Before we start epoch i+1i+1, we need to update AiA_{i} and yiy_{i}. Roughly speaking, Ai+1A_{i+1} is obtained by eliminating aAia\in A_{i} with ΨaKi(p)<0\Psi^{K_{i}}_{a}(p)<0. Because the label with ΨaKi(p)<0\Psi^{K_{i}}_{a}(p)<0 can vary across different pp, we need additional work. First, we need to eliminate the equal number of labels from AiA_{i} for each pp, and then, permute the remaining elements so that we can “homogenize” the set of labels for each pp.

Define

Bi(p)={aAiΨaKi(p)<0}B_{i}(p)=\{a\in A_{i}\mid\Psi^{K_{i}}_{a}(p)<0\}

as the set of labels in AiA_{i} with a negative value of ΨaKi(p)\Psi^{K_{i}}_{a}(p), which are the candidates of elimination. Because |Bi(p)|\mathinner{\!\left\lvert B_{i}(p)\right\rvert} can vary across pp, we cannot eliminate all elements in Bi(p)B_{i}(p) from AiA_{i}, if we want to keep the number of surviving elements in AiA_{i} the same across pp. Define

Ni=minp|Bi(p)|1N_{i}=\min_{p}\mathinner{\!\left\lvert B_{i}(p)\right\rvert}\geq 1

which is at least 1, by the definition of KiK_{i}. Recall that each element in Bi(p)B_{i}(p) is a positive integer. Instead of eliminating all elements in Bi(p)B_{i}(p), we eliminate NiN_{i} elements from Bi(p)B_{i}(p) p𝒫\forall p\in{\mathcal{P}}. Recall that

Ai={1,,|Ai|}.A_{i}=\{1,\ldots,\mathinner{\!\left\lvert A_{i}\right\rvert}\}.

We can write

Bi(p)={i(p)1,,i(p)|Bi(p)|}.B_{i}(p)=\{i(p)_{1},\ldots,i(p)_{\mathinner{\!\left\lvert B_{i}(p)\right\rvert}}\}.

For convenience, we select the last NiN_{i} elements in Bi(p)B_{i}(p) to construct B¯i(p){\underline{B}}_{i}(p):

B¯i(p)={i(p)|Bi(p)|Ni+1,,i(p)|Bi(p)|}.{\underline{B}}_{i}(p)=\left\{i(p)_{\mathinner{\!\left\lvert B_{i}(p)\right\rvert}-N_{i}+1},\ldots,i(p)_{\mathinner{\!\left\lvert B_{i}(p)\right\rvert}}\right\}.

By the construction of B¯i(p){\underline{B}}_{i}(p),

|AiB¯i(p)|=|AiB¯i(p)|pp𝒫.\mathinner{\!\left\lvert A_{i}\setminus{\underline{B}}_{i}(p)\right\rvert}=\mathinner{\!\left\lvert A_{i}\setminus{\underline{B}}_{i}(p^{\prime})\right\rvert}\qquad\forall p\neq p^{\prime}\in{\mathcal{P}}. (3.11)

The remaining step is to re-label each element in AiB¯i(p)A_{i}\setminus{\underline{B}}_{i}(p) so that each pp has the same set of labels.

Define

Ai+1={1,,|AiB¯i(p)|}A_{i+1}=\{1,\ldots,\mathinner{\!\left\lvert A_{i}\setminus{\underline{B}}_{i}(p)\right\rvert}\}

as the collection of the first |AiB¯i(p)|\mathinner{\!\left\lvert A_{i}\setminus{\underline{B}}_{i}(p)\right\rvert} positive integers. Note

|Ai+1|=|AiB¯i(p)|p𝒫\mathinner{\!\left\lvert A_{i+1}\right\rvert}=\mathinner{\!\left\lvert A_{i}\setminus{\underline{B}}_{i}(p)\right\rvert}\qquad\forall p\in{\mathcal{P}}

by (3.11).

We arrange the elements in AiB¯i(p)A_{i}\setminus{\underline{B}}_{i}(p) according to the index’s order. For each p𝒫p\in{\mathcal{P}}, define a permutation

πip:Ai+1AiB¯i(p)\pi^{p}_{i}\mathrel{\mathop{\mathchar 58\relax}}A_{i+1}\rightarrow A_{i}\setminus{\underline{B}}_{i}(p)

where

πip(1)=min{aAiB¯i(p)}\pi^{p}_{i}(1)=\min\{a\in A_{i}\setminus{\underline{B}}_{i}(p)\}

and let πip(1)=a1\pi^{p}_{i}(1)=a_{1}. Given a1,,aj1AiB¯i(p)a_{1},\ldots,a_{j-1}\in A_{i}\setminus{\underline{B}}_{i}(p),

πip(j)=min{aAi[B¯i(p){a1,,aj1}]}.\pi^{p}_{i}(j)=\min\left\{a\in A_{i}\setminus\left[{\underline{B}}_{i}(p)\cup\{a_{1},\ldots,a_{j-1}\}\right]\right\}.

Since πip\pi^{p}_{i} is a bijection over Ai+1A_{i+1}, (πip)1\left(\pi^{p}_{i}\right)^{-1} exists. The correct label in epoch ii is

yi+1(p)=(πip)1(yi(p))p.y_{i+1}(p)=\left(\pi^{p}_{i}\right)^{-1}(y_{i}(p))\qquad\forall p.

3.3. Iteration over epoch

Replace AiA_{i} and yiy_{i} by Ai+1A_{i+1} and yi+1y_{i+1}. Start the entire process described in Section 3.2. The iteration over epoch stops at the first ii where

|AiB¯i(p)|=1p\mathinner{\!\left\lvert A_{i}\setminus{\underline{B}}_{i}(p)\right\rvert}=1\qquad\forall p (3.12)

Define

I=min{i|AiB¯i(p)|=1p}.I=\min\left\{i\mid\mathinner{\!\left\lvert A_{i}\setminus{\underline{B}}_{i}(p)\right\rvert}=1\qquad\forall p\right\}.

as the number of epoch when the algorithm stops. Since epoch ii last KiK_{i} periods, the total number of periods to train the algorithm is

t=i=1IKit=\sum_{i=1}^{I}K_{i}

which is a random variable.

Let us call the constructed algorithm τSK\tau^{K}_{S} because the algorithm is parameterized by the minimum length of each epoch KK. The element in AIB¯I(p)A_{I}\setminus{\underline{B}}_{I}(p) is the source of the final hypothesis of τSK\tau^{K}_{S}. Since we know πip\pi^{p}_{i} p𝒫\forall p\in{\mathcal{P}} and i{1,,I}\forall i\in\{1,\ldots,I\}, the final hypothesis of τSK\tau^{K}_{S} is

τSK(Dt)(p)=π1p(πIp(AIB¯I(p)))p𝒫.\tau^{K}_{S}(D_{t})(p)=\pi_{1}^{p}\left(\cdots\pi_{I}^{p}\left(A_{I}\setminus{\underline{B}}_{I}(p)\right)\right)\qquad\forall p\in{\mathcal{P}}.

We are interested in 𝐏(τSK(Dt)(p)=y(p))\mathbf{P}\left(\tau^{K}_{S}(D_{t})(p)=y(p)\right).

Theorem 3.1.

Assume that |A|<\mathinner{\!\left\lvert A\right\rvert}<\infty and |𝒫|<\mathinner{\!\left\lvert{\mathcal{P}}\right\rvert}<\infty, and that {\mathcal{H}} is iteratively weakly learnable.

  1. (1)

    Imin(|A|,|P|)I\leq\min(\mathinner{\!\left\lvert A\right\rvert},\mathinner{\!\left\lvert P\right\rvert})

  2. (2)

    ρ¯>0\exists{\underline{\rho}}>0 and K¯{\overline{K}} such that KK¯\forall K\geq{\overline{K}}, with probability at least 1eρ¯K1-e^{-{\underline{\rho}}K},

    1. (a)

      Ki=KK_{i}=K i{1,,I}\forall i\in\{1,\ldots,I\}

    2. (b)

      𝐏(τSK(Dt)=y(p))1eρ¯K\mathbf{P}\left(\tau_{S}^{K}(D_{t})=y(p)\right)\geq 1-e^{-{\underline{\rho}}K} where t=i=1IKit=\sum_{i=1}^{I}K_{i}.

Proof.

See Appendix A. ∎

4. Generalization Error

Our last result shows how our approach lends itself to a simple formulation of the generalization error. Let 𝒮{\mathcal{S}} be the sample space generated by KK samples and 𝐏𝒮\mathbf{P}_{\mathcal{S}} be the probability measure over the sample space. Theorem 3.1 proved that the training error vanishes at an exponential rate:

𝐏𝒮(τSK(DKI)(p)y(p))eρ¯K\mathbf{P}_{\mathcal{S}}\left(\tau^{K}_{S}(D_{KI})(p)\neq y(p)\right)\leq e^{-{\underline{\rho}}K}

where KK is the length of an epoch and II is the number of epochs. We know that

Imin(|A|,|P|).I\leq\min(\mathinner{\!\left\lvert A\right\rvert},\mathinner{\!\left\lvert P\right\rvert}).

We can show that the generalization error of τSK\tau^{K}_{S} vanishes as KK\rightarrow\infty. Let 𝐏𝒟\mathbf{P}_{\mathcal{D}} be the population probability distribution.

Theorem 4.1.

mK\forall m\geq K,

𝐏𝒟(τSK(DKI)(p)y(p))𝐏S(τSK(DKI)(p)y(p))+𝖮(1m).\mathbf{P}_{\mathcal{D}}\left(\tau^{K}_{S}(D_{KI})(p)\neq y(p)\right)\leq\mathbf{P}_{S}\left(\tau^{K}_{S}(D_{KI})(p)\neq y(p)\right)+{\sf O}\left(\frac{1}{\sqrt{m}}\right).
Proof.

See Appendix B. ∎

5. Conclusion

As τSK\tau^{K}_{S} is built on the framework of SAMME, the algorithm inherits the same statistical foundation as SAMME and is easier to motivate than other extensions of AdaBoost. Some extensions of AdaBoost (e.g., \citeasnounMukherjeeSchapire2013) may have to “tune parameters” of the algorithm to achieve efficiency. The selection of a proper parameter of the algorithm requires information about {\mathcal{H}}. In contrast, the construction of τSK\tau^{K}_{S} requires no knowledge about {\mathcal{H}}, other than the fact that the set of weak hypothesis satisfies iterative weak learnability.

It is straightforward to check for hypothesis classes that are closed under permutations, making it easy to verify the iterative weak learnability. This allowed us to exhibit the algorithm straightforwardly, without needing to check that the algorithm can outperform random guessing for a rich set of possible cost functions.

Appendix A Proof of Theorem 3.1

A.1. Uniform Bound for II

We show

Imin(|A|,|𝒫|).I\leq\min(\mathinner{\!\left\lvert A\right\rvert},\mathinner{\!\left\lvert{\mathcal{P}}\right\rvert}).

Recall that dk\forall d_{k}, hkh_{k} is maximizing

p𝒫(𝕀[h(p)=yi(p)]𝕀[h(p)yi(p)])dk(p).\sum_{p\in{\mathcal{P}}}\left(\mathbb{I}[h(p)=y_{i}(p)]-\mathbb{I}[h(p)\neq y_{i}(p)]\right)d_{k}(p).

If p𝒫\exists p\in{\mathcal{P}} such that h(p)=aAy(𝒫)h(p)=a\in A\setminus y\left({\mathcal{P}}\right), then hh cannot be an optimal weak hypothesis. Suppose that a=a+Ay(𝒫)a=a^{+}\in A\setminus y\left({\mathcal{P}}\right). The decision maker can increase the performance by choosing a+a^{+} from yi(𝒫)y_{i}({\mathcal{P}}). Thus, k\forall k, hk(𝒫)yi(𝒫)h_{k}({\mathcal{P}})\subset y_{i}({\mathcal{P}}).

We have k\forall k, ayi(𝒫)A\forall a\not\in y_{i}({\mathcal{P}})\subset A,

Ψak(p)=s=1kαs(|A|𝕀(hs(p)=a)1)=s=1kαs<0,\Psi^{k}_{a}(p)=\sum_{s=1}^{k}\alpha_{s}\left(\mathinner{\!\left\lvert A\right\rvert}{\mathbb{I}}(h_{s}(p)=a)-1\right)=-\sum_{s=1}^{k}\alpha_{s}<0,

since hs(p)ah_{s}(p)\neq a aAy(𝒫)\forall a\in A\setminus y\left({\mathcal{P}}\right). Every aAy(𝒫)a\in A\setminus y\left({\mathcal{P}}\right) is eliminated at the end of the first epoch. Since |y(𝒫)||𝒫|\mathinner{\!\left\lvert y\left({\mathcal{P}}\right)\right\rvert}\leq\mathinner{\!\left\lvert{\mathcal{P}}\right\rvert}, the number of remaining elements after the first epoch is no more than min(|A|,|𝒫|)\min(\mathinner{\!\left\lvert A\right\rvert},\mathinner{\!\left\lvert{\mathcal{P}}\right\rvert}).

A.2. Within Epoch

We show that if the minimum length KK of an epoch is sufficiently large, then Ki=KK_{i}=K with a probability close to 1. That is, by the time when the algorithm reaches the scheduled end of iith epoch, ai(p)Ai\exists a_{i}(p)\in A_{i} such that Ψai(p)K(p)<0\Psi^{K}_{a_{i}(p)}(p)<0. We calculate the lower bound of the probability. There exists ρi>0\rho_{i}>0 such that the probability is bounded from below by 1eρiK1-e^{-\rho_{i}K}.

Recall that aAi\forall a\in A_{i},

Fak(p)=s=1kαs𝕀(hs(p)=a)F^{k}_{a}(p)=\sum_{s=1}^{k}\alpha_{s}{\mathbb{I}}(h_{s}(p)=a)

and

Ψak(p)\displaystyle\Psi^{k}_{a}(p) =\displaystyle= (|Ai|1)Fak(p)aaFak(p)\displaystyle(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)F^{k}_{a}(p)-\sum_{a^{\prime}\neq a}F^{k}_{a^{\prime}}(p)
=\displaystyle= s=1kαs(|Ai|𝕀(hs(p)=a)1).\displaystyle\sum_{s=1}^{k}\alpha_{s}\left(\mathinner{\!\left\lvert A_{i}\right\rvert}{\mathbb{I}}(h_{s}(p)=a)-1\right).

Thus,

aΨak(p)\displaystyle\sum_{a}\Psi^{k}_{a}(p) =\displaystyle= a(s=1kαs[|Ai|𝕀(hs(p)=a)1])\displaystyle\sum_{a}\left(\sum_{s=1}^{k}\alpha_{s}\left[\mathinner{\!\left\lvert A_{i}\right\rvert}{\mathbb{I}}(h_{s}(p)=a)-1\right]\right)
=\displaystyle= s=1kαsa[|Ai|𝕀(hs(p)=a)1]\displaystyle\sum_{s=1}^{k}\alpha_{s}\sum_{a}\left[\mathinner{\!\left\lvert A_{i}\right\rvert}{\mathbb{I}}(h_{s}(p)=a)-1\right]
=\displaystyle= s=1kαs[|Ai|a𝕀(hs(p)=a)a1]\displaystyle\sum_{s=1}^{k}\alpha_{s}\left[\mathinner{\!\left\lvert A_{i}\right\rvert}\sum_{a}{\mathbb{I}}(h_{s}(p)=a)-\sum_{a}1\right]
=\displaystyle= s=1kαs[|Ai||Ai|]=0.\displaystyle\sum_{s=1}^{k}\alpha_{s}\left[\mathinner{\!\left\lvert A_{i}\right\rvert}-\mathinner{\!\left\lvert A_{i}\right\rvert}\right]=0.

Iterative weak learnability implies that dk\forall d_{k}, hki\exists h_{k}\in{\mathcal{H}}_{i} so that

p𝒫(𝕀[hk(p)=yi(p)]𝕀[hk(p)yi(p)])d(p)2|Ai||Ai|+γk.\sum_{p\in{\mathcal{P}}}\left(\mathbb{I}[h_{k}(p)=y_{i}(p)]-\mathbb{I}[h_{k}(p)\neq y_{i}(p)]\right)d(p)\geq\frac{2-\mathinner{\!\left\lvert A_{i}\right\rvert}}{\mathinner{\!\left\lvert A_{i}\right\rvert}}+\gamma_{k}.

for some γk>0\gamma_{k}>0. The weak learnability implies that infkγk>0\inf_{k}\gamma_{k}>0.

Lemma A.1.
𝐏(Ψyi(p)k(p)0)eρik.\mathbf{P}\left(\Psi_{y_{i}(p)}^{k}(p)\leq 0\right)\leq e^{-\rho_{i}k}.
Proof.

Note that

𝐏(Ψyi(p)k(p)0)\displaystyle\mathbf{P}\left(\Psi_{y_{i}(p)}^{k}(p)\leq 0\right) =\displaystyle= 𝐄d1𝕀(Ψyi(p)k(p)0)\displaystyle\mathbf{E}_{d_{1}}{\mathbb{I}}\left(\Psi^{k}_{y_{i}(p)}(p)\leq 0\right) (A.13)
=\displaystyle= 𝐄d1𝕀(Ψyi(p)k(p)0)\displaystyle\mathbf{E}_{d_{1}}{\mathbb{I}}\left(-\Psi^{k}_{y_{i}(p)}(p)\geq 0\right)
\displaystyle\leq 𝐄d1exp(Ψyi(p)k(p)).\displaystyle\mathbf{E}_{d_{1}}\exp\left(-\Psi^{k}_{y_{i}(p)}(p)\right).

From the updating formula of dtd_{t},

dk+1(p)\displaystyle d_{k+1}(p) =\displaystyle= dk(p)exp[αk(|Ai|1)𝕀(hk(p)=yi(p))+αk𝕀(hk(p)yi(p))]Zk\displaystyle\frac{d_{k}(p)\exp\left[-\alpha_{k}(\mathinner{\!\left\lvert A_{i}\right\rvert}-1){\mathbb{I}}(h_{k}(p)=y_{i}(p))+\alpha_{k}{\mathbb{I}}(h_{k}(p)\neq y_{i}(p))\right]}{Z_{k}} (A.14)
=\displaystyle= d1(p)exp[s=1kαk(|Ai|1)𝕀(hs(p)=yi(p))+αk𝕀(hk(p)yi(p))]s=1kZs\displaystyle\frac{d_{1}(p)\exp\left[-\sum_{s=1}^{k}\alpha_{k}(\mathinner{\!\left\lvert A_{i}\right\rvert}-1){\mathbb{I}}(h_{s}(p)=y_{i}(p))+\alpha_{k}{\mathbb{I}}(h_{k}(p)\neq y_{i}(p))\right]}{\prod_{s=1}^{k}Z_{s}}
=\displaystyle= d1(p)exp[(|Ai|1)Fyi(p)k(p)+ayi(p)Fak(p)]s=1kZs\displaystyle\frac{d_{1}(p)\exp\left[-(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)F^{k}_{y_{i}(p)}(p)+\sum_{a\neq y_{i}(p)}F^{k}_{a}(p)\right]}{\prod_{s=1}^{k}Z_{s}}
=\displaystyle= d1(p)exp[Ψyi(p)k(p)]s=1kZs.\displaystyle\frac{d_{1}(p)\exp\left[-\Psi^{k}_{y_{i}(p)}(p)\right]}{\prod_{s=1}^{k}Z_{s}}. (A.15)

By summing over pp, we have

pd1(p)exp[Ψyi(p)k(p)]=s=1kZs.\sum_{p}d_{1}(p)\exp\left[-\Psi^{k}_{y_{i}(p)}(p)\right]=\prod_{s=1}^{k}Z_{s}.

Note k1\forall k\geq 1, s{1,,k}\forall s\{1,\ldots,k\},

Zs\displaystyle Z_{s} =\displaystyle= pds(p)exp[αs(|Ai|1)𝕀(hs(p)=yi(p))+αs𝕀(hs(p)yi(p))]\displaystyle\sum_{p}d_{s}(p)\exp\left[-\alpha_{s}(\mathinner{\!\left\lvert A_{i}\right\rvert}-1){\mathbb{I}}(h_{s}(p)=y_{i}(p))+\alpha_{s}{\mathbb{I}}(h_{s}(p)\neq y_{i}(p))\right]
=\displaystyle= hs(p)=yi(p)ds(p)e(|Ai|1)αs+hs(p)yi(p)ds(p)eαs\displaystyle\sum_{h_{s}(p)=y_{i}(p)}d_{s}(p)e^{-(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)\alpha_{s}}+\sum_{h_{s}(p)\neq y_{i}(p)}d_{s}(p)e^{\alpha_{s}}
=\displaystyle= eαs(|Ai|1)(1ϵs)+eαsϵs\displaystyle e^{-\alpha_{s}(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}(1-\epsilon_{s})+e^{\alpha_{s}}\epsilon_{s}
=\displaystyle= (|Ai|1)12ϵs12(1ϵs)12+(|Ai|1)12(|Ai|1)(1ϵs)12(|Ai|1)ϵs112(|Ai|1)\displaystyle(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)^{-\frac{1}{2}}\epsilon_{s}^{\frac{1}{2}}(1-\epsilon_{s})^{\frac{1}{2}}+(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)^{\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}}(1-\epsilon_{s})^{\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}}\epsilon_{s}^{1-\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}}
\displaystyle\equiv φ(ϵs).\displaystyle\varphi(\epsilon_{s}).
Lemma A.2.

φ(ϵ)\varphi(\epsilon) is a strictly concave function of ϵ[0,1]\epsilon\in[0,1].

φ(11|Ai|)=1\varphi\left(1-\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}\right)=1 (A.16)

and

φ(11|Ai|)=0.\varphi^{\prime}\left(1-\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}\right)=0. (A.17)
Proof.

ϵ12(1ϵ)12\epsilon^{\frac{1}{2}}(1-\epsilon)^{\frac{1}{2}} and (1ϵ)12(|Ai|1)ϵ112(|Ai|1)(1-\epsilon)^{\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}}\epsilon^{1-\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}} are strictly concave functions over ϵ\epsilon. φ(ϵ)\varphi(\epsilon) is the sum of two strictly concave functions. Thus, φ(ϵ)\varphi(\epsilon) is strictly concave.

(A.16) follows from a simple calculation.

φ(11|Ai|)\displaystyle\varphi\left(1-\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}\right)
=\displaystyle= (|Ai|1)12(|Ai|1|Ai|)12(1|Ai|)12+(|Ai|1)12(|Ai|1)(|Ai|1|Ai|)112(|Ai|1)(1|Ai|)12(|Ai|1)\displaystyle(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)^{-\frac{1}{2}}\left(\frac{\mathinner{\!\left\lvert A_{i}\right\rvert}-1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}\right)^{\frac{1}{2}}\left(\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}\right)^{\frac{1}{2}}+(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)^{\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}}\left(\frac{\mathinner{\!\left\lvert A_{i}\right\rvert}-1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}\right)^{1-\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}}\left(\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}\right)^{\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}}
=\displaystyle= 1|Ai|+(11|Ai|)=1.\displaystyle\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}+\left(1-\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}\right)=1.

The calculation of φ\varphi^{\prime} is a little more involved.

φ(11|Ai|)\displaystyle\varphi^{\prime}\left(1-\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}\right)
=\displaystyle= (|Ai|1)12[12(1|Ai|1)1212(|Ai|1)12]\displaystyle(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)^{-\frac{1}{2}}\left[\frac{1}{2}\left(\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}-1}\right)^{\frac{1}{2}}-\frac{1}{2}\left(\mathinner{\!\left\lvert A_{i}\right\rvert}-1\right)^{\frac{1}{2}}\right]
+(|Ai|1)12(|Ai|1)[(112(|Ai|1))(1|Ai|1)12(|Ai|1)12(|Ai|1)(|Ai|1)112(|Ai|1)]\displaystyle\ \ +(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)^{\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}}\left[\left(1-\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}\right)\left(\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}-1}\right)^{\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}}-\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}\left(\mathinner{\!\left\lvert A_{i}\right\rvert}-1\right)^{1-\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}}\right]
=\displaystyle= [12|Ai||Ai|11]+[112|Ai||Ai|1]\displaystyle\left[\frac{1}{2}\frac{\mathinner{\!\left\lvert A_{i}\right\rvert}}{\mathinner{\!\left\lvert A_{i}\right\rvert}-1}-1\right]+\left[1-\frac{1}{2}\frac{\mathinner{\!\left\lvert A_{i}\right\rvert}}{\mathinner{\!\left\lvert A_{i}\right\rvert}-1}\right]
=\displaystyle= 0.\displaystyle 0.

Since i{\mathcal{H}}_{i} is weakly learnable, γk>0\exists\gamma_{k}>0 such that

0ϵk11|Ai|γk<11|Ai|.0\leq\epsilon_{k}\leq 1-\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}-\gamma_{k}<1-\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}.

Weak learnability implies that

γk>0andγ=infkγk>0.\gamma_{k}>0\ \ \text{and}\ \ \gamma=\inf_{k}\gamma_{k}>0.

Since φ\varphi is strictly concave, φ(ϵ)>0\varphi^{\prime}(\epsilon)>0 ϵ<11|Ai|\forall\epsilon<1-\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}. Thus, γ^>0\exists{\hat{\gamma}}>0 such that ϵk11|Ai|γ\forall\epsilon_{k}\leq 1-\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}-\gamma,

φ(ϵk)φ(11|Ai|γ)=1γ^<1.\varphi(\epsilon_{k})\leq\varphi\left(1-\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}-\gamma\right)=1-{\hat{\gamma}}<1.

Thus,

Zk=φ(γk)φ(γ)1γ^Z_{k}=\varphi(\gamma_{k})\leq\varphi(\gamma)\leq 1-{\hat{\gamma}}

for some γ^>0{\hat{\gamma}}>0. Thus,

dk+1(p)=d1(p)s=1kZk1|𝒫|(1γ^)k=1|𝒫|eρitd_{k+1}(p)=d_{1}(p)\prod_{s=1}^{k}Z_{k}\leq\frac{1}{\mathinner{\!\left\lvert{\mathcal{P}}\right\rvert}}(1-{\hat{\gamma}})^{k}=\frac{1}{\mathinner{\!\left\lvert{\mathcal{P}}\right\rvert}}e^{-\rho_{i}t}

where

ρi=log(1γ^)>0.\rho_{i}=-\log(1-{\hat{\gamma}})>0. (A.18)

By the definition, p\forall p, k\forall k,

aAiΨak(p)=0.\sum_{a\in A_{i}}\Psi^{k}_{a}(p)=0. (A.19)

By (A.19), p\forall p, if Ψy(p)K(p)>0\Psi^{K}_{y(p)}(p)>0, then a¯i(p)Ai\exists{\underline{a}}_{i}(p)\in A_{i} such that

Ψa¯i(p)K(p)<0.\Psi^{K}_{{\underline{a}}_{i}(p)}(p)<0.

Lemma A.1 implies

a¯1(p)yi(p){\underline{a}}_{1}(p)\neq y_{i}(p)

with probably at least 1eρiK1-e^{-\rho_{i}K}. Hence, Ki=KK_{i}=K with probability at least 1eρiK1-e^{-\rho_{i}K}.

As Lemma A.1 indicates, if a=y(p)a=y(p), then Ψak(p)>0\Psi^{k}_{a}(p)>0 almost surely as KK\rightarrow\infty. If |A|=2\mathinner{\!\left\lvert A\right\rvert}=2, then (A.19) implies that exactly one label is associated with a positive value of Ψak(p)\Psi^{k}_{a}(p). Thus, searching for aa with Ψak(p)>0\Psi^{k}_{a}(p)>0 is the proper way to identify a correct label. However, if |A|>2\mathinner{\!\left\lvert A\right\rvert}>2, then the positive sign of Ψak(p)\Psi^{k}_{a}(p) does not imply that a=y(p)a=y(p). However, if Ψak(p)<0\Psi^{k}_{a}(p)<0, then Lemma A.1 implies that ay(p)a\neq y(p) almost surely, as KK\rightarrow\infty. The idea of τSK\tau^{K}_{S} is to eliminate aa with Ψak(p)<0\Psi^{k}_{a}(p)<0 in each epoch, until we have a single element left.

A.3. Across Epochs

We construct Ai+1A_{i+1} by eliminating some elements from AiA_{i} and permuting the remaining elements according to πi\pi_{i}. With probability 1eρiK1-e^{-\rho_{i}K}, the true label survives the elimination from AiA_{i}, and is included in Ai+1A_{i+1}, permuted according to πi\pi_{i}.

Since we eliminate at least one element from AiA_{i}, the total number of epochs cannot be larger than |A|\mathinner{\!\left\lvert A\right\rvert}. Let II be the epoch when AI+1A_{I+1} is a singleton. With probability

i=1I(1eρiK),\prod_{i=1}^{I}(1-e^{-\rho_{i}K}),

the remaining label yI(p)y_{I}(p) is the real label which has been permuted by a series of π1,,πI\pi_{1},\ldots,\pi_{I}. We can recover the real label

y(p)=π1(πI(yI(p))).y(p)=\pi_{1}\left(\cdots\pi_{I}(y_{I}(p))\right).

Define

ρ¯=12minρi>0.{\underline{\rho}}=\frac{1}{2}\min\rho_{i}>0.

Then, for a sufficiently large KK,

i=1I(1eρiK)1|A|eKminiρi=1eK(miniρi+1Klog|A|)1eρ¯K\prod_{i=1}^{I}(1-e^{-\rho_{i}K})\geq 1-\mathinner{\!\left\lvert A\right\rvert}e^{-K\min_{i}\rho_{i}}=1-e^{-K(\min_{i}\rho_{i}+\frac{1}{K}\log\mathinner{\!\left\lvert A\right\rvert})}\geq 1-e^{-{\underline{\rho}}K}

which completes the proof.

Appendix B Proof of Theorem 4.1

The proof follows the same logic as the proof of Theorem 5.1 on page 98 of \citeasnounSchapireandFreund12. We need to change the notation to accommodate multiple classes (|A|2)(\mathinner{\!\left\lvert A\right\rvert}\geq 2). For the reference, we sketch the proof by following the original proof of Theorem 5.1 in \citeasnounSchapireandFreund12, while pointing out the place where we need a change and referring the rest to \citeasnounSchapireandFreund12.

B.1. Preliminaries

Recall that iI\forall i\leq I, aAi\forall a\in A_{i}

Ψak(p)=s=1kαs[|Ai|𝕀(hs(p)=a)1].\Psi^{k}_{a}(p)=\sum_{s=1}^{k}\alpha_{s}\left[\mathinner{\!\left\lvert A_{i}\right\rvert}{\mathbb{I}}(h_{s}(p)=a)-1\right].

Define

gas(p)=|Ai|𝕀(hs(p)=a)1,g^{s}_{a}(p)=\mathinner{\!\left\lvert A_{i}\right\rvert}{\mathbb{I}}(h_{s}(p)=a)-1,
βs=αss=1kαs\beta_{s}=\frac{\alpha_{s}}{\sum_{s^{\prime}=1}^{k}\alpha_{s^{\prime}}}

and

ψak(p)=s=1kβsgas(p).\psi^{k}_{a}(p)=\sum_{s=1}^{k}\beta_{s}g_{a}^{s}(p).

We drop subscript aa because we focus on a=y(p)a=y(p) and write instead

g(p)=gy(p)(p)andψk(p)=ψy(p)k(p).g(p)=g_{y(p)}(p)\ \text{and}\ \psi^{k}(p)=\psi^{k}_{y(p)}(p).

Recall that {\mathcal{H}} is the collection of feasible weak hypotheses, whose generic element is hh. Define

𝒢={gh,g(p)=|Ai|𝕀(h(p)=y(p))1}{\mathcal{G}}=\left\{g\mid\exists h\in{\mathcal{H}},\ g(p)=\mathinner{\!\left\lvert A_{i}\right\rvert}{\mathbb{I}}(h(p)=y(p))-1\right\}

For a fixed nn, define

𝒜n={f:p1nj=1ngj(p)g1,,gn𝒢}.{\mathcal{A}}_{n}=\left\{f\mathrel{\mathop{\mathchar 58\relax}}p\mapsto\frac{1}{n}\sum_{j=1}^{n}g_{j}(p)\mid g_{1},\ldots,g_{n}\in{\mathcal{G}}\right\}.

Let g~1,,g~n{\tilde{g}}_{1},\ldots,{\tilde{g}}_{n} be the collection of nn randomly selected elements from GG and

f~=1nj=1ng~j(p){\tilde{f}}=\frac{1}{n}\sum_{j=1}^{n}{\tilde{g}}_{j}(p)

where

𝐄gj=s=1kβsgy(p)s(p)=ψy(p)k(p).\mathbf{E}g_{j}=\sum_{s=1}^{k}\beta_{s}g_{y(p)}^{s}(p)=\psi^{k}_{y(p)}(p).

If

τSK(DKI)(p)y(p),\tau^{K}_{S}(D_{KI})(p)\neq y(p),

then y(p)y(p) is eliminated from AiA_{i} in ii-th epoch i<Ii<I, which implies

ψy(p)K(p)<0.\psi^{K}_{y(p)}(p)<0.

Thus,

𝐏𝒟(τSK(DKI)y)\displaystyle\mathbf{P}_{\mathcal{D}}\left(\tau^{K}_{S}(D_{KI})\neq y\right)
\displaystyle\leq i=1I𝐏𝒟(i<I,ψy(p)K(p)<0ψy(p)K(p)0,ji)\displaystyle\prod_{i=1}^{I}\mathbf{P}_{\mathcal{D}}\left(\exists i<I,\ \psi^{K}_{y(p)}(p)<0\mid\psi^{K}_{y(p)}(p)\geq 0,j\leq i\right)
=\displaystyle= i=1I𝐏𝒟(ψy(p)K(p)<0at the end ofi-th epoch).\displaystyle\prod_{i=1}^{I}\mathbf{P}_{\mathcal{D}}\left(\psi^{K}_{y(p)}(p)<0\ \text{at the end of}\ i\text{-th epoch}\right).

The last equality stems from the facts that the samples are independently drawn over periods, and that we re-initialize the algorithm at the beginning of each epoch.

We only show that in the first epoch

𝐏𝒟(ψy(p)K(p)<0)𝐏𝒮(ψy(p)K(p)<0)+𝖮(1m)\mathbf{P}_{\mathcal{D}}\left(\psi^{K}_{y(p)}(p)<0\right)\leq\mathbf{P}_{\mathcal{S}}\left(\psi^{K}_{y(p)}(p)<0\right)+{\sf O}\left(\frac{1}{\sqrt{m}}\right)

where 𝒮{\mathcal{S}} is the collection of all mm samples. The proof for the remaining epochs follows the same logic. Since Imin(|A|,|𝒫|)|A|I\leq\min\left(\mathinner{\!\left\lvert A\right\rvert},\mathinner{\!\left\lvert{\mathcal{P}}\right\rvert}\right)\leq\mathinner{\!\left\lvert A\right\rvert}, we then have

𝐏𝒟(τSK(DKI)y)𝐏𝒮(τSK(DKI)y)+|A|𝖮(1m).\mathbf{P}_{\mathcal{D}}\left(\tau^{K}_{S}(D_{KI})\neq y\right)\leq\mathbf{P}_{\mathcal{S}}\left(\tau^{K}_{S}(D_{KI})\neq y\right)+\mathinner{\!\left\lvert A\right\rvert}{\sf O}\left(\frac{1}{\sqrt{m}}\right).

B.2. Sketch of Proof

Lemma B.1.

(Re-statement of Lemma 5.2 in \citeasnounSchapireandFreund12) θ\forall\theta,

𝐏f~(|f~(p)f(p)|θ2)2enθ22|A|2βn,θ.\mathbf{P}_{{\tilde{f}}}\left(|{\tilde{f}}(p)-f(p)|\geq\frac{\theta}{2}\right)\leq 2e^{-\frac{n\theta^{2}}{2\mathinner{\!\left\lvert A\right\rvert}^{2}}}\equiv\beta_{n,\theta}. (B.20)
Proof.

Note that the only change from the original statement in \citeasnounSchapireandFreund12 is βn,θ\beta_{n,\theta} which needs to be scaled to accommodate the case |A|>2\mathinner{\!\left\lvert A\right\rvert}>2. Note that g(p){|A|1,1}g(p)\in\{\mathinner{\!\left\lvert A\right\rvert}-1,-1\}. We need to scale gg into a random variable in [1,1][-1,1] and modify βn,θ\beta_{n,\theta} accordingly. The remainder of the proof is identical with the original proof of Lemma 5.2 in \citeasnounSchapireandFreund12. ∎

Lemma 5.3 of \citeasnounSchapireandFreund12 is not needed. The proof of Lemma 5.4 of \citeasnounSchapireandFreund12 applies to our case without modifying the parameter, from which the conclusion of Theorem 4.1 follows.

References

  • [1] \harvarditem[Cho and Libgober]Cho and Libgober2020ChoandLibgober2020 Cho, I.-K., and J. Libgober (2020): “Machine Learning for Strategic Inference,” Emory University and University of Southern California.
  • [2] \harvarditem[Friedman, Hastie, and Tibshirani]Friedman, Hastie, and Tibshirani2000FriedmanHestieandTibshirani00 Friedman, J., T. Hastie, and R. Tibshirani (2000): “Additive Logistic Regression: A Statistical View of Boosting,” The Annals of Statistics, 28(2), 337–407.
  • [3] \harvarditem[Mukherjee and Schapire]Mukherjee and Schapire2013MukherjeeSchapire2013 Mukherjee, I., and R. E. Schapire (2013): “A Theory of Multiclass Boosting,” Journal of Machine Learning Research, 14, 437–497.
  • [4] \harvarditem[Schapire and Freund]Schapire and Freund2012SchapireandFreund12 Schapire, R. E., and Y. Freund (2012): Boosting: Foundations and Algorithms. MIT Press.
  • [5] \harvarditem[Schapire and Singer]Schapire and Singer1999SchapireandSinger99 Schapire, R. E., and Y. Singer (1999): “Improved Boosting Algorithms Using Confidence-Rate Predictions,” Machine Learning, 37, 297–336.
  • [6] \harvarditem[Tversky]Tversky1972Tversky72 Tversky, A. (1972): “Elimination by aspects: A theory of choice,” Psychological Review, 79(4), 281–299.
  • [7] \harvarditem[Zhu, Zou, Rosset, and Hastie]Zhu, Zou, Rosset, and Hastie2009ZhuZouRossetandHastie09 Zhu, J., H. Zou, S. Rosset, and T. Hastie (2009): “Multi-class AdaBoost,” Statistics and Its Interface, 2, 349–360.
  • [8]