Iterative Weak Learnability and Multi-Class AdaBoost

In-Koo Cho Department of Economics, Emory University, Atlanta, GA 30322 USA [email protected] https://sites.google.com/site/inkoocho and Jonathan Libgober Department of Economics, University of Southern California, Los Angeles, CA 90089 USA [email protected] http://www.jonlib.com/

Abstract.

We construct an efficient recursive ensemble algorithm for the multi-class classification problem, inspired by SAMME (\citeasnounZhuZouRossetandHastie09). We strengthen the weak learnability condition in \citeasnounZhuZouRossetandHastie09 by requiring that the weak learnability condition holds for any subset of labels with at least two elements. This condition is simpler to check than many proposed alternatives (e.g., \citeasnounMukherjeeSchapire2013). As SAMME, our algorithm is reduced to the Adaptive Boosting algorithm (\citeasnounSchapireandFreund12) if the number of labels is two, and can be motivated as a functional version of the steepest descending method to find an optimal solution. In contrast to SAMME, our algorithm’s final hypothesis converges to the correct label with probability 1. For any number of labels, the probability of misclassification vanishes exponentially as the training period increases. The sum of the training error and an additional term, that depends only on the sample size, bounds the generalization error of our algorithm as the Adaptive Boosting algorithm.
Keywords. Adaptive Boosting Algorithm, SAMME, Iterative Weak Learnability, Efficient Ensemble Algorithm, Generalization Error

We are grateful for the hospitality and support from the University of Southern California. We thank Shaowei Ke, Tim Roughgarden, and Rob Schapire for their helpful comments. The first author gratefully acknowledges financial support from the National Science Foundation and the Korea Research Foundation.

1. Introduction

While the Adaptive Boosting (AdaBoost) algorithm (\citeasnounSchapireandFreund12) and its variants have been remarkably successful for binary classification problems, difficulties with generalizing to the multi-label case are well-known.¹¹1For a comprehensive survey, see \citeasnounFriedmanHestieandTibshirani00. Early approaches to this problem proposed dividing the multi-class problem into multiple two-class problems (e.g., \citeasnounSchapireandSinger99). These approaches either do not inherit all of the desirable properties of AdaBoost or lack its intuitive descriptions. The extended algorithms are expected to be efficient: the algorithm must produce a forecast, which converges to the correct label in probability, and the probability of misclassification vanishes at an exponential rate. However, the extended algorithm is often substantially different from the original formula of \citeasnounSchapireandFreund12) and is significantly more elaborate than the original formula. The aesthetic differences with AdaBoost are substantive. First, it is less straightforward to analyze the properties of these algorithms when they are more elaborate. Second, the conditions required to ensure they work are often significantly more stringent.

\citeasnoun

ZhuZouRossetandHastie09 developed a multi-class generalization of AdaBoost that maintains the same structure as the binary label case, called Stagewise Additive Modeling using a Multi-class Exponential loss function (SAMME). The algorithm has a statistical foundation as a functional version of the recursive steepest descending method to calculate the minimum of an exponential loss function. SAMME is reduced to the original AdaBoost if the number of labels is two. \citeasnounZhuZouRossetandHastie09 demonstrated through numerical simulations that if the set of weak hypothesis satisfies a version of the weak learnability condition, which extends the condition of \citeasnounSchapireandFreund12, SAMME converges to the correct labels.

However, SAMME fails to generate a forecast, which converges to the correct label in probability. \citeasnounMukherjeeSchapire2013 showed that the weak learnability condition of \citeasnounZhuZouRossetandHastie09 is not adequate for constructing an efficient recursive ensemble algorithm. \citeasnounMukherjeeSchapire2013 instead imposed a condition called Edge-over-Random (EOR), which involves checking a richer set of conditions. Verifying EOR is not straightforward because we have to weigh the loss differently for every kind of misclassification.

This paper proposes an alternative approach. We show that a slight modification of SAMME, combined with a slightly stronger version of the weak learnability of \citeasnounZhuZouRossetandHastie09, allows us to construct an efficient recursive ensemble algorithm. We strengthen the weak learnability condition by requiring that the same weak learnability continue to hold for any non-empty subset of labels with at least two elements. We call the stronger version of the weak learnability iterative weak learnability. One might wonder whether the stronger version of weak learnability requires an exceedingly elaborate class of weak hypotheses. On the contrary, the set of single threshold classifiers satisfies iterative weak learnability, which is the set of the simplest non-trivial classifiers.

We depart from the conventional view of a boosting algorithm as a process to produce a final hypothesis, which is increasingly accurate as the training period increases. Instead, we consider the boosting algorithm as a process to eliminate inaccurate labels recursively, leaving the correct label in each round of elimination. If the set of labels has two elements, the two views are two sides of the same coin. As an example, let us consider the original AdaBoost with two labels. One can view the algorithm as a process to identify a correct label for each observation. Instead, we view the same algorithm as a process to eliminate an incorrect label. If the number of labels is larger than two, the second view is much easier to generalize than the conventional view on the boosting algorithm.

Our algorithm can be motivated as an implementation of Elimination by Aspects (EBA) (\citeasnounTversky72). If we regard a label as an attribute, our exercise can be viewed as a procedure to identify the best possible attribute for the observation. The decision maker evaluates an attribute according to a linear function of different attributes after a sequence of observations and eliminates an attribute with the negative value of the linear function. Our algorithm selects the attribute which survives the iterative elimination process.²²2We are grateful to Shaowei Ke for pointing out the possible link and the reference.

We modify the SAMME algorithm to construct an ensemble algorithm to eliminate incorrect labels for each observation. By eliminating incorrect labels iteratively, the number of remaining labels becomes two, and then, our algorithm becomes the original AdaBoost algorithm to identify the correct label. We show that the probability of eliminating the correct label vanishes at an exponential rate as the training period increases, which implies that our algorithm accurately eliminates incorrect labels efficiently, in contrast to the original proposal. One of the benefits of our approach is that we can analyze this algorithm following steps similar to the original binary label AdaBoost algorithm. Our algorithm produces a generalization error bound closely related to that of the original formulation.

Section 2 illustrates the problem. Section 3 constructs the algorithm and states the main result. Section 4 shows that the generalization error vanishes as the number of samples increases, proving that our algorithm is robust against the overfitting as the original AdaBoost is. Section 5 concludes the paper.

2. Preliminaries

2.1. Basics

Given set $X$ , let $\mathinner{\!\left\lvert X\right\rvert}$ be the number of elements in $X$ . Let $A$ be the set of labels, and ${\mathcal{P}}$ be the set of observations, which is endowed with a probability distribution. A classifier is

h\mathrel{\mathop{\mathchar 58\relax}}{\mathcal{P}}\rightarrow A

which labels each observation. We often call $h$ a weak hypothesis (or simply, a hypothesis) in place of a classifier. Let ${\mathcal{H}}$ be the set of all feasible classifiers.

Let

y\mathrel{\mathop{\mathchar 58\relax}}{\mathcal{P}}\rightarrow A

be the correct classifier. We say that $h$ perfectly classifies $y$ if

\mathbf{P}\left(h(p)=y(p)\right)=1.

We assume that a decision maker is endowed with ${\mathcal{H}}$ , but $y$ is often significantly more complex than any element in ${\mathcal{H}}$ . As a result, no $h\in{\mathcal{H}}$ can perfectly classify $y$ . An important question is whether a decision maker who is endowed with primitive ${\mathcal{H}}$ can “learn” $y$ .

Definition 2.1.

Let $\Gamma$ be the set of all classifiers, and $\tilde{\Gamma}\subset\Gamma$ denote a subset of classifiers. A statistical procedure or algorithm is an onto function

\tau\mathrel{\mathop{\mathchar 58\relax}}\mathcal{D}\rightarrow\tilde{\Gamma},

where $\mathcal{D}$ is a set of histories, $\mathcal{T}$ is the set of feasible algorithms.

To incorporate learning, we need to allow a decision maker to observe data over time. We consider a dynamic interaction, where the algorithm can update its output in response to new data. Suppose that time is discrete: $t=1,2,\ldots$ and $D_{t}$ is the data available at the beginning of time $t$ . $\tau(D_{t})$ is the classifier which algorithm $\tau$ generates from data $D_{t}$ . Let $\tau(D_{t})(p)$ be the label assigned to observation $p$ by $\tau(D_{t})$ .

Let $T\geq 1$ be the training period, when the decision maker updates $\tau(D_{t})$ in response to $D_{t}$ . The final hypothesis is the output $\tau(D_{T})$ at the end of the training period.

We focus on recursive ensemble algorithm for its simplicity.

Definition 2.2.

$\tau(D_{K})$ is an ensemble of ${\mathcal{H}}$ if $\exists h_{1},\ldots,h_{K}\in{\mathcal{H}}$ and $\alpha_{1},\ldots,\alpha_{K}\geq 0$ such that $\forall p\in{\mathcal{P}}$

\tau(D_{K})(p)=\arg\max_{a}\sum_{k=1}^{K}\alpha_{k}\mathbf{1}[a=h_{k}(p)]

$\tau$ is a recursive algorithm, if $\tau(D_{k})$ is determined as a function of $\tau(D_{k-1})$ and events in time $k$ .

We require that an algorithm produce an accurate classifier as the final hypothesis and achieve the desired accuracy level with a reasonable amount of data.

Definition 2.3.

Algorithm $\tau$ is consistent if

\lim_{K\rightarrow\infty}\mathbf{P}\left(\tau(D_{K})\neq y(p)\right)=0.

Algorithm $\tau$ is efficient if $\exists\rho>0,K>0$ such that

\mathbf{P}\left(\exists k\geq K,\tau(D_{k})(p)\neq y(p)\right)\leq e^{-\rho K}.

2.2. Adaptive Boosting Algorithm

Let us illustrate the Adaptive Boosting (AdaBoost) algorithm by \citeasnounSchapireandFreund12. Instead of the original formula, we follow the framework of SAMME (\citeasnounZhuZouRossetandHastie09). Let us denote the algorithm as $\tau_{A}$ .

Initialize $d_{1}(p)=1/\mathinner{\!\left\lvert{\mathcal{P}}\right\rvert}$ as the uniform distribution over ${\mathcal{P}}$ . For $k\geq 1$ , $D_{k}$ is the history at the beginning of time $k$ . Iterate the following steps.

(1)

(calculation of forecasting error) Choose $h_{k}\in{\mathcal{H}}$ .

$\epsilon_{k}=\sum_{p}d_{k}(p){\mathbb{I}}\left(h_{k}(p)\neq y(p)\right)$

If $\epsilon_{k}=0$ , then stop the iteration and set $\tau(D_{k})=h_{k}$ . If not, continue.

(2)

(calculating the weight)³³3The following formula is from equation (11) on page 353 of \citeasnounZhuZouRossetandHastie09, which becomes the coefficient of AdaBoost if $\mathinner{\!\left\lvert A\right\rvert}=2$ . This formula differs slightly from (c) on page 351.

\alpha_{k}=\frac{(\mathinner{\!\left\lvert A\right\rvert}-1)^{2}}{\mathinner{\!\left\lvert A\right\rvert}}\log\left(\frac{(\mathinner{\!\left\lvert A\right\rvert}-1)(1-\epsilon_{k})}{\epsilon_{k}}\right)

(2.1)

(3)

(preparing for the next round)

w_{k+1}(p)=\begin{cases}d_{k}(p)\exp\left(-(\mathinner{\!\left\lvert A\right\rvert}-1)\alpha_{k}\right)&\text{if }\ h_{k}(p)=y(p)\\ d_{k}(p)\exp\left(\alpha_{k}\right)&\text{if }\ h_{k}(p)\neq y(p).\end{cases}

d_{k+1}(p)=\frac{w_{k+1}(p)}{\sum_{p^{\prime}}w_{k+1}(p^{\prime})}.

(4)

Reset $k+1$ as $k$ and start the first step of selecting an optimal weak hypothesis.
(5)

(final hypothesis)

$\tau_{A}(D_{t})(p)=\arg\max_{a\in A}\sum_{s=1}^{t}\alpha_{s}{\mathbb{I}}(h_{s}(p)=a)\qquad\forall p\in{\mathcal{P}}$

If $\mathinner{\!\left\lvert A\right\rvert}=2$ , $\tau_{A}$ is the Adaptive Boosting (AdaBoost) algorithm of \citeasnounSchapireandFreund12.

The description of the algorithm is incomplete because if $\alpha_{k}<0$ , the algorithm is not well defined.

Definition 2.4.

Let $A$ be the set of labels. $\frac{1}{\mathinner{\!\left\lvert A\right\rvert}}$ random guess, or simply a random guess, is to predict each label with probability $\frac{1}{\mathinner{\!\left\lvert A\right\rvert}}$ .

$\tau_{A}$ is meaningful only if $h_{k}\in{\mathcal{H}}$ performs better than a random guess.

\sum_{p\in{\mathcal{P}}}\left(\mathbb{I}[h_{k}(p)=y(p)]-\mathbb{I}[h_{k}(p)\neq y(p)]\right)d(p)\geq\frac{2-\mathinner{\!\left\lvert A\right\rvert}}{\mathinner{\!\left\lvert A\right\rvert}}+\rho,

(2.2)

which implies that $\alpha_{k}>0$ . Note that $\frac{2-\mathinner{\!\left\lvert A\right\rvert}}{\mathinner{\!\left\lvert A\right\rvert}}$ is the score from the random guess.

The weak learnability implies the existence of such $h_{k}$ . We state the condition according to \citeasnounZhuZouRossetandHastie09.

Definition 2.5.

A hypothesis class $\mathcal{H}$ is weakly learnable if, for every probability distribution $d$ over observations $p\in{\mathcal{P}}$ and labels $y(p)$ , we have:

\min_{d\in\Delta({\mathcal{P}})}\max_{h\in\mathcal{H}}\sum_{p\in{\mathcal{P}}}\left(\mathbb{I}[h(p)=y(p)]-\mathbb{I}[h(p)\neq y(p)]\right)d(p)\geq\frac{2-\mathinner{\!\left\lvert A\right\rvert}}{\mathinner{\!\left\lvert A\right\rvert}}+\rho,

(2.3)

for some $\rho>0$ .

The weak learnability is sufficient for the efficiency of $\tau_{A}$ if $\mathinner{\!\left\lvert A\right\rvert}=2$ .

Theorem 2.6.

If ${\mathcal{H}}$ is weakly learnable and $\mathinner{\!\left\lvert A\right\rvert}=2$ , then $\tau_{A}$ is an efficient algorithm.

Proof.

See \citeasnounSchapireandFreund12. ∎

\citeasnoun

MukherjeeSchapire2013 pointed out that the notion of weak learnability of \citeasnounZhuZouRossetandHastie09 is too weak to guarantee the convergence of the algorithm to the correct label if $\mathinner{\!\left\lvert A\right\rvert}>2$ .

Let us examine an example in \citeasnounMukherjeeSchapire2013.

Example 2.7.

${\mathcal{P}}=\{a,b\}$ , $A=\{1,2,3\}$ , $y(a)=1,y(b)=2$ . ${\mathcal{H}}=\{h_{1},h_{2}\}$ where $h_{1}(a)=h_{1}(b)=1$ and $h_{2}(a)=h_{2}(b)=2$ .

Since $\mathinner{\!\left\lvert A\right\rvert}=3$ , the weak learnability is satisfied if the performance of a weak hypothesis is strictly better than

\frac{2-3}{3}=-\frac{1}{3}.

Let $d=(d(a),d(b))$ be a probability distribution over ${\mathcal{P}}$ . The performance score of ${\mathcal{H}}$ is

\max\left[1-2d(a),-1+2d(a)\right]\geq 0

where the equality holds if $d(a)=0.5$ . Thus, ${\mathcal{H}}$ is weakly learnable. Any convex combination of $h_{1}$ and $h_{2}$ in ${\mathcal{H}}$ assigns the same level to two observations. Since $y$ assigns a different label on a different observation, SAMME fails to produce a consistent forecast, although ${\mathcal{H}}$ satisfies the weak learnability.

Example 2.7 reveals two issues we must address to extend AdaBoost from binary to multiple labels. First, to construct a boosting algorithm, it is necessary to strengthen the notion of weak learnability. Second, the set of weak hypotheses ${\mathcal{H}}$ cannot be too restrictive. The issue identified in the previous example is that the requirement of outperforming a random guess by $\rho$ becomes less restrictive when the number of labels gets larger. Hence some distributions can satisfy 2.3 for many labels but can fail for a small number of labels. We propose to consider the minimal departure necessary to rule this out.

2.3. Iterative Weak Learnability

We strengthen the weak learnability by requiring that (2.3) holds for any subset of $A$ with more than a single element. Let $A_{i}\subset A$ be a non-empty subset of $A$ with $\mathinner{\!\left\lvert A_{i}\right\rvert}\geq 2$ . Define

{\mathcal{H}}_{i}=\left\{h\mathrel{\mathop{\mathchar 58\relax}}{\mathcal{P}}\rightarrow A_{i}\right\}

as the collection of all weak hypotheses with labels from $A_{i}$ . Clearly, ${\mathcal{H}}_{i}\subset{\mathcal{H}}$ .

Definition 2.8.

${\mathcal{H}}$ is iteratively weakly learnable if $\forall A_{i}\subset A$ with $\mathinner{\!\left\lvert A_{i}\right\rvert}\geq 2$ , $\exists\rho_{i}>0$ such that for every distribution $d$ over observations $p\in{\mathcal{P}}$ and labels $y(p)$ ,

\min_{d\in\Delta({\mathcal{P}})}\max_{h\in\mathcal{H}_{i}}\sum_{p\in{\mathcal{P}}}\left(\mathbb{I}[h(p)=y(p)]-\mathbb{I}[h(p)\neq y(p)]\right)d(p)\geq\frac{2-\mathinner{\!\left\lvert A_{i}\right\rvert}}{\mathinner{\!\left\lvert A_{i}\right\rvert}}+\rho_{i}.

(2.4)

Note that $\frac{2-\mathinner{\!\left\lvert A_{i}\right\rvert}}{\mathinner{\!\left\lvert A_{i}\right\rvert}}$ is a decreasing function of $\mathinner{\!\left\lvert A_{i}\right\rvert}$ . As $\mathinner{\!\left\lvert A_{i}\right\rvert}$ decreases, the requirement for weak learnability of ${\mathcal{H}}_{i}$ becomes more stringent, as the minimum performance criterion (the right hand side of (2.4) increases.

In Example 2.7, ${\mathcal{H}}$ is weakly learnable, but not iteratively weakly learnable. To see this, note that if $A_{i}=\{1,2\}\subset A=\{1,2,3\}$ , then ${\mathcal{H}}={\mathcal{H}}_{i}$ .

\min_{0\leq d(a)\leq 1}\max\left[1-2d(a),-1+2d(a)\right]=0.

However, the weak learnability over $A_{i}$ requires that the minimum performance be strictly larger than 0.

2.4. Sufficiently Rich ${\mathcal{H}}$

A fundamental question is whether we can find a class of simple hypothesis which is weakly learnable. With the set of the simplest classifiers, we can do better than the random guess over any subset of $A$ with at least two elements.

Suppose that $A$ and ${\mathcal{P}}$ are subsets of a vector space or a finite dimensional Euclidean space. Consider $h$ , which classifies ${\mathcal{P}}$ according to the value of a linear function, thus called a linear classifier. Let ${\sf H}$ be a hyperplane in ${\mathbb{R}}^{n}$ : $\exists\lambda\in{\mathbb{R}}^{n}$ and $\omega\in{\mathbb{R}}$ such that

{\sf H}=\left\{p\in{\mathbb{R}}^{n}\ |\ \lambda p=\omega\right\}.

Define ${\sf H}_{+}$ as the close half space above ${\sf H}$ :

{\sf H}_{+}=\left\{p\in{\mathbb{R}}^{n}\ |\ \lambda p\geq\omega\right\}.

Definition 2.9.

A single threshold (linear) classifier is a mapping

h\mathrel{\mathop{\mathchar 58\relax}}{\mathcal{P}}\rightarrow A

where $\exists a_{+},a_{-}\in A$ such that

h(p)=\begin{cases}a^{+}&\text{if }\ p\in{\sf H}_{+}\\ a^{-}&\text{if }\ p\not\in{\sf H}_{+}.\end{cases}

If $h$ is a single threshold classifier, $\mathinner{\!\left\lvert h(\mathcal{P})\right\rvert}=2$ , which may be strictly smaller than $\mathinner{\!\left\lvert A\right\rvert}$ . On the other hand, the set of single threshold classifiers is closed under permutations: if $\pi\mathrel{\mathop{\mathchar 58\relax}}A\rightarrow A$ is a bijection, then $\pi(h(p))\in\mathcal{H}$ for all $\pi$ . Being closed under permutations turns out to be the key property to obtain weak learnability.

Lemma 2.10.

Consider an environment where the image of $y(p)$ has $\mathinner{\!\left\lvert A^{\prime}\right\rvert}$ elements where $A^{\prime}\subset A$ . Any hypothesis class that is closed under permutations can do as well as $1/\mathinner{\!\left\lvert A^{\prime}\right\rvert}$ random guessing.

Proof.

See Lemma A.1 in \citeasnounChoandLibgober2020. ∎

We use this result to show that the set of single-threshold classifiers is iteratively weakly learnable.

Corollary 2.11.

The set of single threshold classifiers is iteratively weakly learnable.

Proof.

$\forall A_{i}\subset A$ , the set of single threshold classifiers

{\mathcal{H}}_{i}=\left\{h\mathrel{\mathop{\mathchar 58\relax}}{\mathcal{P}}\rightarrow A_{i}\right\}

is closed under permutations, from which the conclusion follows. ∎

In Example 2.7, using single-threshold classifiers ensures that an extreme point of the convex hull of $\mathcal{P}$ can always be labeled correctly, and the rest can at least do as well as a random guess. The analogous result is known to hold for the case of $\mathinner{\!\left\lvert A\right\rvert}=2$ . Interestingly, the critical property in generalizing to the multi-class case is not changing the set of classifiers but instead allowing the labels to be sufficiently rich.

3. Construction of Algorithm

The algorithm, which we call $\tau^{K}_{S}$ , iterates over two loops: within an epoch (inner loop) and across epochs (outer loop). $K$ is the minimum length of an epoch, which we increase to improve the accuracy of the final hypothesis of the algorithm. The size of each epoch is a random variable $K_{i}\geq K$ . The algorithm has $I$ number of epochs, which is also a random variable. While the total number of periods to generate the final hypothesis is random variable $\sum_{i=1}^{I}K_{i}$ , we show that as $K$ becomes large, the total number of periods is bounded by $\min(\mathinner{\!\left\lvert A\right\rvert},\mathinner{\!\left\lvert P\right\rvert})K$ with a probability close to 1.

3.1. Preliminaries

The algorithm is run in discrete period $k=1,2,\ldots$ within epoch $i=1,2,\ldots$ . Each epoch lasts at least $K$ periods, but the terminal round of an epoch is random. At the start of an epoch, we set the period to $k=1$ and re-start the procedure.

It is convenient to enumerate the elements in $A=\{1,\ldots,\mathinner{\!\left\lvert A\right\rvert}\}$ as the first $\mathinner{\!\left\lvert A\right\rvert}$ positive integers, highlighting the fact that the weak learnability has little to do with the name of the labels but relies solely on the ordinal properties of $A$ . Let $\pi\mathrel{\mathop{\mathchar 58\relax}}A\rightarrow A$ be a permutation over $A$ :

\pi(j)=i_{j}\in\{1,\ldots,\mathinner{\!\left\lvert A\right\rvert}\}\qquad\forall j.

Let $A_{i}=\{1,\ldots,\mathinner{\!\left\lvert A_{i}\right\rvert}\}\subset A$ be the collection of the first $\mathinner{\!\left\lvert A_{i}\right\rvert}$ positive integers which is the set of feasible labels at the beginning of epoch $i$ with $A_{1}=A$ . Let $\pi_{i}$ be a permutation over $A_{i}$ .

3.2. Within Epoch $i$

At the beginning of epoch $i$ , $A_{i}\subset A$ and $y_{i}\mathrel{\mathop{\mathchar 58\relax}}{\mathcal{P}}\rightarrow A_{i}$ are given. Set $k=1$ . $d_{1}(p)$ as the uniform distribution over $\mathcal{P}$ . The output of each stage consists three key components: an artificial probability distribution $d_{k}(p)$ over ${\mathcal{P}}$ , a threshold rule $h_{k}$ in step $k$ , and a positive weight $\alpha_{k}$ .

Suppose that $d_{k}(p)$ is defined $\forall p\in{\mathcal{P}}$ . Choose $h_{k}$ by solving

\max_{h\in{\mathcal{H}}_{i}}\sum_{p\in{\mathcal{P}}}\left(\mathbb{I}[h(p)=y_{i}(p)]-\mathbb{I}[h(p)\neq y_{i}(p)]\right)d_{k}(p).

(3.5)

We call $h_{k}$ an optimal classifier. Define

\epsilon_{k}=\mathbf{P}_{d_{k}}\left(h_{k}(p)\neq y_{i}(p)\right)

(3.6)

as the probability that the optimal classifier $h_{k}$ at $k$ round misclassifies $p$ under $d_{k}$ . If $\epsilon_{k}=0$ , then we stop the training and output $h_{k}$ as the final hypothesis, which perfectly forecasts $y_{i}(p)$ .

Suppose that $\epsilon_{k}>0$ . Define

\alpha_{k}=\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}\left[\log\frac{(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)(1-\epsilon_{k})}{\epsilon_{k}}\right].

(3.7)

Note that the only difference between (3.7) and (2.1) is the coefficient $\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}$ . If $\mathinner{\!\left\lvert A_{i}\right\rvert}=2$ , our algorithm is reduced to AdaBoost.

Define for each $p$ , and each pair $(p,y(p))$ ,

d_{k+1}(p)=\frac{d_{k}(p)\exp\left(-\alpha_{k}\left[(\mathinner{\!\left\lvert A_{i}\right\rvert}-1){\mathbb{I}}(h_{k}(p)=y_{i}(p))-{\mathbb{I}}(h_{k}(p)\neq y_{i}(p))\right]\right)}{Z_{k}}

where

Z_{k}=\sum_{p\in{\mathcal{P}}}d_{k}(p)\exp\left(-\alpha_{k}\left[(\mathinner{\!\left\lvert A_{i}\right\rvert}-1){\mathbb{I}}(h_{k}(p)=y_{i}(p))-{\mathbb{I}}(h_{k}(p)\neq y_{i}(p)))\right]\right).

Given $d_{k+1}$ , we can recursively define $h_{k+1}$ and $\epsilon_{k+1}$ .

Define $\forall a\in A_{i}=\{1,\ldots,\mathinner{\!\left\lvert A_{i}\right\rvert}\}$ ,

F^{k}_{a}(p)=\sum_{s=1}^{k}\alpha_{s}{\mathbb{I}}(h_{s}(p)=a)

and

\Psi^{k}_{a}(p)=(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)F^{k}_{a}(p)-\sum_{a^{\prime}\neq a}F^{k}_{a^{\prime}}(p).

Note that

$\displaystyle\Psi^{k}_{a}(p)$	$\displaystyle=$	$\displaystyle(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)F^{k}_{a}(p)-\sum_{a^{\prime}\neq a}F^{k}_{a^{\prime}}(p)$	(3.8)
	$\displaystyle=$	$\displaystyle\sum_{s=1}^{k}\alpha_{s}\left[(\mathinner{\!\left\lvert A_{i}\right\rvert}-1){\mathbb{I}}(h_{s}(p)=a)-\sum_{a^{\prime}\neq a}{\mathbb{I}}(h_{s}(p)=a^{\prime})\right]$
	$\displaystyle=$	$\displaystyle\sum_{s=1}^{k}\alpha_{s}\left[\mathinner{\!\left\lvert A_{i}\right\rvert}{\mathbb{I}}(h_{s}(p)=a)-\left({\mathbb{I}}(h_{s}(p)=a)+\sum_{a^{\prime}\neq a}{\mathbb{I}}(h_{s}(p)=a^{\prime})\right)\right]$
	$\displaystyle=$	$\displaystyle\sum_{s=1}^{k}\alpha_{s}\left(\mathinner{\!\left\lvert A_{i}\right\rvert}{\mathbb{I}}(h_{s}(p)=a)-1\right).$

Therefore, $\forall p$ ,

a\in\arg\max_{A_{i}}F^{k}_{a}(p)

(3.9)

if and only if

a\in\arg\max_{A_{i}}\Psi^{k}_{a}(p).

(3.10)

We use $\Psi^{k}_{a}(p)$ to predict $y_{i}(p)$ , instead of $F^{k}_{a}(p)$ . Since $\Psi^{k}_{a}(p)$ is a linear combination of ${\mathbb{I}}(h_{s}(p)=a)$ , our algorithm is an ensemble algorithm. Since $\Psi^{k}_{a}(p)$ is calculated recursively within an epoch, and the change between two epochs depends only on the final outcome from the previous epoch, we regard our algorithm a recursive algorithm.

Except for the case of $\mathinner{\!\left\lvert A_{i}\right\rvert}=2$ , (3.9) does not guarantee that $a=y_{i}(p)$ with probability close to 1, as $k\rightarrow\infty$ . \citeasnounMukherjeeSchapire2013 argues that the failure is due to the fact that the weak learnability (2.3) is too weak, and proposed a stronger criterion of EOR (Edge over Random) criterion.

We take a different tack. We strengthen (2.3) to (2.4) by requiring the same condition for any subset of $A$ with at least two elements. Instead of solving (3.10), we eliminate $a\in A_{i}$ which is not an element of $\arg\max_{A_{i}}\Psi^{k}_{a}(p)$ . A critical step is to show that $a=y_{i}(p)\in A_{i}$ survives the elimination process with a probability close to 1. Since $\mathinner{\!\left\lvert A\right\rvert}<\infty$ , we can “solve” (3.10) in a finite number of iterations. Roughly speaking, (2.3) may be too weak to solve (3.9) directly, but is sufficiently strong to solve the equivalent problem (3.10) indirectly, if (2.3) holds for any subset with at least two elements.

Recall that $K$ is the minimum number of periods in epoch $i$ , which is a parameter of the algorithm. Define

K_{i}=\min\left\{k\geq K\mid\forall p\in{\mathcal{P}},\exists a\in A_{i},\Psi^{k}_{a}(p)<0\right\}

as the first period in epoch $i$ when each $p$ has a label $a\in A_{i}$ where $\Psi^{k}_{a}(p)<0$ . Epoch $i$ terminates at $K_{i}$ period.

Before we start epoch $i+1$ , we need to update $A_{i}$ and $y_{i}$ . Roughly speaking, $A_{i+1}$ is obtained by eliminating $a\in A_{i}$ with $\Psi^{K_{i}}_{a}(p)<0$ . Because the label with $\Psi^{K_{i}}_{a}(p)<0$ can vary across different $p$ , we need additional work. First, we need to eliminate the equal number of labels from $A_{i}$ for each $p$ , and then, permute the remaining elements so that we can “homogenize” the set of labels for each $p$ .

Define

B_{i}(p)=\{a\in A_{i}\mid\Psi^{K_{i}}_{a}(p)<0\}

as the set of labels in $A_{i}$ with a negative value of $\Psi^{K_{i}}_{a}(p)$ , which are the candidates of elimination. Because $\mathinner{\!\left\lvert B_{i}(p)\right\rvert}$ can vary across $p$ , we cannot eliminate all elements in $B_{i}(p)$ from $A_{i}$ , if we want to keep the number of surviving elements in $A_{i}$ the same across $p$ . Define

N_{i}=\min_{p}\mathinner{\!\left\lvert B_{i}(p)\right\rvert}\geq 1

which is at least 1, by the definition of $K_{i}$ . Recall that each element in $B_{i}(p)$ is a positive integer. Instead of eliminating all elements in $B_{i}(p)$ , we eliminate $N_{i}$ elements from $B_{i}(p)$ $\forall p\in{\mathcal{P}}$ . Recall that

A_{i}=\{1,\ldots,\mathinner{\!\left\lvert A_{i}\right\rvert}\}.

We can write

B_{i}(p)=\{i(p)_{1},\ldots,i(p)_{\mathinner{\!\left\lvert B_{i}(p)\right\rvert}}\}.

For convenience, we select the last $N_{i}$ elements in $B_{i}(p)$ to construct ${\underline{B}}_{i}(p)$ :

{\underline{B}}_{i}(p)=\left\{i(p)_{\mathinner{\!\left\lvert B_{i}(p)\right\rvert}-N_{i}+1},\ldots,i(p)_{\mathinner{\!\left\lvert B_{i}(p)\right\rvert}}\right\}.

By the construction of ${\underline{B}}_{i}(p)$ ,

\mathinner{\!\left\lvert A_{i}\setminus{\underline{B}}_{i}(p)\right\rvert}=\mathinner{\!\left\lvert A_{i}\setminus{\underline{B}}_{i}(p^{\prime})\right\rvert}\qquad\forall p\neq p^{\prime}\in{\mathcal{P}}.

(3.11)

The remaining step is to re-label each element in $A_{i}\setminus{\underline{B}}_{i}(p)$ so that each $p$ has the same set of labels.

Define

A_{i+1}=\{1,\ldots,\mathinner{\!\left\lvert A_{i}\setminus{\underline{B}}_{i}(p)\right\rvert}\}

as the collection of the first $\mathinner{\!\left\lvert A_{i}\setminus{\underline{B}}_{i}(p)\right\rvert}$ positive integers. Note

\mathinner{\!\left\lvert A_{i+1}\right\rvert}=\mathinner{\!\left\lvert A_{i}\setminus{\underline{B}}_{i}(p)\right\rvert}\qquad\forall p\in{\mathcal{P}}

by (3.11).

We arrange the elements in $A_{i}\setminus{\underline{B}}_{i}(p)$ according to the index’s order. For each $p\in{\mathcal{P}}$ , define a permutation

\pi^{p}_{i}\mathrel{\mathop{\mathchar 58\relax}}A_{i+1}\rightarrow A_{i}\setminus{\underline{B}}_{i}(p)

where

\pi^{p}_{i}(1)=\min\{a\in A_{i}\setminus{\underline{B}}_{i}(p)\}

and let $\pi^{p}_{i}(1)=a_{1}$ . Given $a_{1},\ldots,a_{j-1}\in A_{i}\setminus{\underline{B}}_{i}(p)$ ,

\pi^{p}_{i}(j)=\min\left\{a\in A_{i}\setminus\left[{\underline{B}}_{i}(p)\cup\{a_{1},\ldots,a_{j-1}\}\right]\right\}.

Since $\pi^{p}_{i}$ is a bijection over $A_{i+1}$ , $\left(\pi^{p}_{i}\right)^{-1}$ exists. The correct label in epoch $i$ is

y_{i+1}(p)=\left(\pi^{p}_{i}\right)^{-1}(y_{i}(p))\qquad\forall p.

3.3. Iteration over epoch

Replace $A_{i}$ and $y_{i}$ by $A_{i+1}$ and $y_{i+1}$ . Start the entire process described in Section 3.2. The iteration over epoch stops at the first $i$ where

\mathinner{\!\left\lvert A_{i}\setminus{\underline{B}}_{i}(p)\right\rvert}=1\qquad\forall p

(3.12)

Define

I=\min\left\{i\mid\mathinner{\!\left\lvert A_{i}\setminus{\underline{B}}_{i}(p)\right\rvert}=1\qquad\forall p\right\}.

as the number of epoch when the algorithm stops. Since epoch $i$ last $K_{i}$ periods, the total number of periods to train the algorithm is

t=\sum_{i=1}^{I}K_{i}

which is a random variable.

Let us call the constructed algorithm $\tau^{K}_{S}$ because the algorithm is parameterized by the minimum length of each epoch $K$ . The element in $A_{I}\setminus{\underline{B}}_{I}(p)$ is the source of the final hypothesis of $\tau^{K}_{S}$ . Since we know $\pi^{p}_{i}$ $\forall p\in{\mathcal{P}}$ and $\forall i\in\{1,\ldots,I\}$ , the final hypothesis of $\tau^{K}_{S}$ is

\tau^{K}_{S}(D_{t})(p)=\pi_{1}^{p}\left(\cdots\pi_{I}^{p}\left(A_{I}\setminus{\underline{B}}_{I}(p)\right)\right)\qquad\forall p\in{\mathcal{P}}.

We are interested in $\mathbf{P}\left(\tau^{K}_{S}(D_{t})(p)=y(p)\right)$ .

Theorem 3.1.

Assume that $\mathinner{\!\left\lvert A\right\rvert}<\infty$ and $\mathinner{\!\left\lvert{\mathcal{P}}\right\rvert}<\infty$ , and that ${\mathcal{H}}$ is iteratively weakly learnable.

(1)

$I\leq\min(\mathinner{\!\left\lvert A\right\rvert},\mathinner{\!\left\lvert P\right\rvert})$
(2)
$\exists{\underline{\rho}}>0$ and ${\overline{K}}$ such that $\forall K\geq{\overline{K}}$ , with probability at least $1-e^{-{\underline{\rho}}K}$ ,
1. (a)
  
  $K_{i}=K$ $\forall i\in\{1,\ldots,I\}$
2. (b)
  
  $\mathbf{P}\left(\tau_{S}^{K}(D_{t})=y(p)\right)\geq 1-e^{-{\underline{\rho}}K}$ where $t=\sum_{i=1}^{I}K_{i}$ .

Proof.

See Appendix A. ∎

4. Generalization Error

Our last result shows how our approach lends itself to a simple formulation of the generalization error. Let ${\mathcal{S}}$ be the sample space generated by $K$ samples and $\mathbf{P}_{\mathcal{S}}$ be the probability measure over the sample space. Theorem 3.1 proved that the training error vanishes at an exponential rate:

\mathbf{P}_{\mathcal{S}}\left(\tau^{K}_{S}(D_{KI})(p)\neq y(p)\right)\leq e^{-{\underline{\rho}}K}

where $K$ is the length of an epoch and $I$ is the number of epochs. We know that

I\leq\min(\mathinner{\!\left\lvert A\right\rvert},\mathinner{\!\left\lvert P\right\rvert}).

We can show that the generalization error of $\tau^{K}_{S}$ vanishes as $K\rightarrow\infty$ . Let $\mathbf{P}_{\mathcal{D}}$ be the population probability distribution.

Theorem 4.1.

$\forall m\geq K$ ,

\mathbf{P}_{\mathcal{D}}\left(\tau^{K}_{S}(D_{KI})(p)\neq y(p)\right)\leq\mathbf{P}_{S}\left(\tau^{K}_{S}(D_{KI})(p)\neq y(p)\right)+{\sf O}\left(\frac{1}{\sqrt{m}}\right).

Proof.

See Appendix B. ∎

5. Conclusion

As $\tau^{K}_{S}$ is built on the framework of SAMME, the algorithm inherits the same statistical foundation as SAMME and is easier to motivate than other extensions of AdaBoost. Some extensions of AdaBoost (e.g., \citeasnounMukherjeeSchapire2013) may have to “tune parameters” of the algorithm to achieve efficiency. The selection of a proper parameter of the algorithm requires information about ${\mathcal{H}}$ . In contrast, the construction of $\tau^{K}_{S}$ requires no knowledge about ${\mathcal{H}}$ , other than the fact that the set of weak hypothesis satisfies iterative weak learnability.

It is straightforward to check for hypothesis classes that are closed under permutations, making it easy to verify the iterative weak learnability. This allowed us to exhibit the algorithm straightforwardly, without needing to check that the algorithm can outperform random guessing for a rich set of possible cost functions.

Appendix A Proof of Theorem 3.1

A.1. Uniform Bound for $I$

We show

I\leq\min(\mathinner{\!\left\lvert A\right\rvert},\mathinner{\!\left\lvert{\mathcal{P}}\right\rvert}).

Recall that $\forall d_{k}$ , $h_{k}$ is maximizing

\sum_{p\in{\mathcal{P}}}\left(\mathbb{I}[h(p)=y_{i}(p)]-\mathbb{I}[h(p)\neq y_{i}(p)]\right)d_{k}(p).

If $\exists p\in{\mathcal{P}}$ such that $h(p)=a\in A\setminus y\left({\mathcal{P}}\right)$ , then $h$ cannot be an optimal weak hypothesis. Suppose that $a=a^{+}\in A\setminus y\left({\mathcal{P}}\right)$ . The decision maker can increase the performance by choosing $a^{+}$ from $y_{i}({\mathcal{P}})$ . Thus, $\forall k$ , $h_{k}({\mathcal{P}})\subset y_{i}({\mathcal{P}})$ .

We have $\forall k$ , $\forall a\not\in y_{i}({\mathcal{P}})\subset A$ ,

\Psi^{k}_{a}(p)=\sum_{s=1}^{k}\alpha_{s}\left(\mathinner{\!\left\lvert A\right\rvert}{\mathbb{I}}(h_{s}(p)=a)-1\right)=-\sum_{s=1}^{k}\alpha_{s}<0,

since $h_{s}(p)\neq a$ $\forall a\in A\setminus y\left({\mathcal{P}}\right)$ . Every $a\in A\setminus y\left({\mathcal{P}}\right)$ is eliminated at the end of the first epoch. Since $\mathinner{\!\left\lvert y\left({\mathcal{P}}\right)\right\rvert}\leq\mathinner{\!\left\lvert{\mathcal{P}}\right\rvert}$ , the number of remaining elements after the first epoch is no more than $\min(\mathinner{\!\left\lvert A\right\rvert},\mathinner{\!\left\lvert{\mathcal{P}}\right\rvert})$ .

A.2. Within Epoch

We show that if the minimum length $K$ of an epoch is sufficiently large, then $K_{i}=K$ with a probability close to 1. That is, by the time when the algorithm reaches the scheduled end of $i$ th epoch, $\exists a_{i}(p)\in A_{i}$ such that $\Psi^{K}_{a_{i}(p)}(p)<0$ . We calculate the lower bound of the probability. There exists $\rho_{i}>0$ such that the probability is bounded from below by $1-e^{-\rho_{i}K}$ .

Recall that $\forall a\in A_{i}$ ,

F^{k}_{a}(p)=\sum_{s=1}^{k}\alpha_{s}{\mathbb{I}}(h_{s}(p)=a)

and

	$\displaystyle\Psi^{k}_{a}(p)$	$\displaystyle=$	$\displaystyle(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)F^{k}_{a}(p)-\sum_{a^{\prime}\neq a}F^{k}_{a^{\prime}}(p)$
		$\displaystyle=$	$\displaystyle\sum_{s=1}^{k}\alpha_{s}\left(\mathinner{\!\left\lvert A_{i}\right\rvert}{\mathbb{I}}(h_{s}(p)=a)-1\right).$

Thus,

$\displaystyle\sum_{a}\Psi^{k}_{a}(p)$	$\displaystyle=$	$\displaystyle\sum_{a}\left(\sum_{s=1}^{k}\alpha_{s}\left[\mathinner{\!\left\lvert A_{i}\right\rvert}{\mathbb{I}}(h_{s}(p)=a)-1\right]\right)$
	$\displaystyle=$	$\displaystyle\sum_{s=1}^{k}\alpha_{s}\sum_{a}\left[\mathinner{\!\left\lvert A_{i}\right\rvert}{\mathbb{I}}(h_{s}(p)=a)-1\right]$
	$\displaystyle=$	$\displaystyle\sum_{s=1}^{k}\alpha_{s}\left[\mathinner{\!\left\lvert A_{i}\right\rvert}\sum_{a}{\mathbb{I}}(h_{s}(p)=a)-\sum_{a}1\right]$
	$\displaystyle=$	$\displaystyle\sum_{s=1}^{k}\alpha_{s}\left[\mathinner{\!\left\lvert A_{i}\right\rvert}-\mathinner{\!\left\lvert A_{i}\right\rvert}\right]=0.$

Iterative weak learnability implies that $\forall d_{k}$ , $\exists h_{k}\in{\mathcal{H}}_{i}$ so that

\sum_{p\in{\mathcal{P}}}\left(\mathbb{I}[h_{k}(p)=y_{i}(p)]-\mathbb{I}[h_{k}(p)\neq y_{i}(p)]\right)d(p)\geq\frac{2-\mathinner{\!\left\lvert A_{i}\right\rvert}}{\mathinner{\!\left\lvert A_{i}\right\rvert}}+\gamma_{k}.

for some $\gamma_{k}>0$ . The weak learnability implies that $\inf_{k}\gamma_{k}>0$ .

Lemma A.1.

\mathbf{P}\left(\Psi_{y_{i}(p)}^{k}(p)\leq 0\right)\leq e^{-\rho_{i}k}.

Proof.

Note that

$\displaystyle\mathbf{P}\left(\Psi_{y_{i}(p)}^{k}(p)\leq 0\right)$	$\displaystyle=$	$\displaystyle\mathbf{E}_{d_{1}}{\mathbb{I}}\left(\Psi^{k}_{y_{i}(p)}(p)\leq 0\right)$	(A.13)
	$\displaystyle=$	$\displaystyle\mathbf{E}_{d_{1}}{\mathbb{I}}\left(-\Psi^{k}_{y_{i}(p)}(p)\geq 0\right)$
	$\displaystyle\leq$	$\displaystyle\mathbf{E}_{d_{1}}\exp\left(-\Psi^{k}_{y_{i}(p)}(p)\right).$

From the updating formula of $d_{t}$ ,

$\displaystyle d_{k+1}(p)$	$\displaystyle=$	$\displaystyle\frac{d_{k}(p)\exp\left[-\alpha_{k}(\mathinner{\!\left\lvert A_{i}\right\rvert}-1){\mathbb{I}}(h_{k}(p)=y_{i}(p))+\alpha_{k}{\mathbb{I}}(h_{k}(p)\neq y_{i}(p))\right]}{Z_{k}}$	(A.14)
	$\displaystyle=$	$\displaystyle\frac{d_{1}(p)\exp\left[-\sum_{s=1}^{k}\alpha_{k}(\mathinner{\!\left\lvert A_{i}\right\rvert}-1){\mathbb{I}}(h_{s}(p)=y_{i}(p))+\alpha_{k}{\mathbb{I}}(h_{k}(p)\neq y_{i}(p))\right]}{\prod_{s=1}^{k}Z_{s}}$
	$\displaystyle=$	$\displaystyle\frac{d_{1}(p)\exp\left[-(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)F^{k}_{y_{i}(p)}(p)+\sum_{a\neq y_{i}(p)}F^{k}_{a}(p)\right]}{\prod_{s=1}^{k}Z_{s}}$
	$\displaystyle=$	$\displaystyle\frac{d_{1}(p)\exp\left[-\Psi^{k}_{y_{i}(p)}(p)\right]}{\prod_{s=1}^{k}Z_{s}}.$	(A.15)

By summing over $p$ , we have

\sum_{p}d_{1}(p)\exp\left[-\Psi^{k}_{y_{i}(p)}(p)\right]=\prod_{s=1}^{k}Z_{s}.

Note $\forall k\geq 1$ , $\forall s\{1,\ldots,k\}$ ,

$\displaystyle Z_{s}$	$\displaystyle=$	$\displaystyle\sum_{p}d_{s}(p)\exp\left[-\alpha_{s}(\mathinner{\!\left\lvert A_{i}\right\rvert}-1){\mathbb{I}}(h_{s}(p)=y_{i}(p))+\alpha_{s}{\mathbb{I}}(h_{s}(p)\neq y_{i}(p))\right]$
	$\displaystyle=$	$\displaystyle\sum_{h_{s}(p)=y_{i}(p)}d_{s}(p)e^{-(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)\alpha_{s}}+\sum_{h_{s}(p)\neq y_{i}(p)}d_{s}(p)e^{\alpha_{s}}$
	$\displaystyle=$	$\displaystyle e^{-\alpha_{s}(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}(1-\epsilon_{s})+e^{\alpha_{s}}\epsilon_{s}$
	$\displaystyle=$	$\displaystyle(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)^{-\frac{1}{2}}\epsilon_{s}^{\frac{1}{2}}(1-\epsilon_{s})^{\frac{1}{2}}+(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)^{\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}}(1-\epsilon_{s})^{\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}}\epsilon_{s}^{1-\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}}$
	$\displaystyle\equiv$	$\displaystyle\varphi(\epsilon_{s}).$

Lemma A.2.

$\varphi(\epsilon)$ is a strictly concave function of $\epsilon\in[0,1]$ .

\varphi\left(1-\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}\right)=1

(A.16)

and

\varphi^{\prime}\left(1-\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}\right)=0.

(A.17)

Proof.

$\epsilon^{\frac{1}{2}}(1-\epsilon)^{\frac{1}{2}}$ and $(1-\epsilon)^{\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}}\epsilon^{1-\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}}$ are strictly concave functions over $\epsilon$ . $\varphi(\epsilon)$ is the sum of two strictly concave functions. Thus, $\varphi(\epsilon)$ is strictly concave.

(A.16) follows from a simple calculation.

			$\displaystyle\varphi\left(1-\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}\right)$
		$\displaystyle=$	$\displaystyle(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)^{-\frac{1}{2}}\left(\frac{\mathinner{\!\left\lvert A_{i}\right\rvert}-1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}\right)^{\frac{1}{2}}\left(\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}\right)^{\frac{1}{2}}+(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)^{\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}}\left(\frac{\mathinner{\!\left\lvert A_{i}\right\rvert}-1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}\right)^{1-\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}}\left(\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}\right)^{\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}}$
		$\displaystyle=$	$\displaystyle\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}+\left(1-\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}\right)=1.$

The calculation of $\varphi^{\prime}$ is a little more involved.

			$\displaystyle\varphi^{\prime}\left(1-\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}\right)$
		$\displaystyle=$	$\displaystyle(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)^{-\frac{1}{2}}\left[\frac{1}{2}\left(\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}-1}\right)^{\frac{1}{2}}-\frac{1}{2}\left(\mathinner{\!\left\lvert A_{i}\right\rvert}-1\right)^{\frac{1}{2}}\right]$
			$\displaystyle\ \ +(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)^{\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}}\left[\left(1-\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}\right)\left(\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}-1}\right)^{\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}}-\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}\left(\mathinner{\!\left\lvert A_{i}\right\rvert}-1\right)^{1-\frac{1}{2(\mathinner{\!\left\lvert A_{i}\right\rvert}-1)}}\right]$
		$\displaystyle=$	$\displaystyle\left[\frac{1}{2}\frac{\mathinner{\!\left\lvert A_{i}\right\rvert}}{\mathinner{\!\left\lvert A_{i}\right\rvert}-1}-1\right]+\left[1-\frac{1}{2}\frac{\mathinner{\!\left\lvert A_{i}\right\rvert}}{\mathinner{\!\left\lvert A_{i}\right\rvert}-1}\right]$
		$\displaystyle=$	$\displaystyle 0.$

∎

Since ${\mathcal{H}}_{i}$ is weakly learnable, $\exists\gamma_{k}>0$ such that

0\leq\epsilon_{k}\leq 1-\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}-\gamma_{k}<1-\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}.

Weak learnability implies that

\gamma_{k}>0\ \ \text{and}\ \ \gamma=\inf_{k}\gamma_{k}>0.

Since $\varphi$ is strictly concave, $\varphi^{\prime}(\epsilon)>0$ $\forall\epsilon<1-\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}$ . Thus, $\exists{\hat{\gamma}}>0$ such that $\forall\epsilon_{k}\leq 1-\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}-\gamma$ ,

\varphi(\epsilon_{k})\leq\varphi\left(1-\frac{1}{\mathinner{\!\left\lvert A_{i}\right\rvert}}-\gamma\right)=1-{\hat{\gamma}}<1.

Thus,

Z_{k}=\varphi(\gamma_{k})\leq\varphi(\gamma)\leq 1-{\hat{\gamma}}

for some ${\hat{\gamma}}>0$ . Thus,

d_{k+1}(p)=d_{1}(p)\prod_{s=1}^{k}Z_{k}\leq\frac{1}{\mathinner{\!\left\lvert{\mathcal{P}}\right\rvert}}(1-{\hat{\gamma}})^{k}=\frac{1}{\mathinner{\!\left\lvert{\mathcal{P}}\right\rvert}}e^{-\rho_{i}t}

where

\rho_{i}=-\log(1-{\hat{\gamma}})>0.

(A.18)

∎

By the definition, $\forall p$ , $\forall k$ ,

\sum_{a\in A_{i}}\Psi^{k}_{a}(p)=0.

(A.19)

By (A.19), $\forall p$ , if $\Psi^{K}_{y(p)}(p)>0$ , then $\exists{\underline{a}}_{i}(p)\in A_{i}$ such that

\Psi^{K}_{{\underline{a}}_{i}(p)}(p)<0.

Lemma A.1 implies

{\underline{a}}_{1}(p)\neq y_{i}(p)

with probably at least $1-e^{-\rho_{i}K}$ . Hence, $K_{i}=K$ with probability at least $1-e^{-\rho_{i}K}$ .

As Lemma A.1 indicates, if $a=y(p)$ , then $\Psi^{k}_{a}(p)>0$ almost surely as $K\rightarrow\infty$ . If $\mathinner{\!\left\lvert A\right\rvert}=2$ , then (A.19) implies that exactly one label is associated with a positive value of $\Psi^{k}_{a}(p)$ . Thus, searching for $a$ with $\Psi^{k}_{a}(p)>0$ is the proper way to identify a correct label. However, if $\mathinner{\!\left\lvert A\right\rvert}>2$ , then the positive sign of $\Psi^{k}_{a}(p)$ does not imply that $a=y(p)$ . However, if $\Psi^{k}_{a}(p)<0$ , then Lemma A.1 implies that $a\neq y(p)$ almost surely, as $K\rightarrow\infty$ . The idea of $\tau^{K}_{S}$ is to eliminate $a$ with $\Psi^{k}_{a}(p)<0$ in each epoch, until we have a single element left.

A.3. Across Epochs

We construct $A_{i+1}$ by eliminating some elements from $A_{i}$ and permuting the remaining elements according to $\pi_{i}$ . With probability $1-e^{-\rho_{i}K}$ , the true label survives the elimination from $A_{i}$ , and is included in $A_{i+1}$ , permuted according to $\pi_{i}$ .

Since we eliminate at least one element from $A_{i}$ , the total number of epochs cannot be larger than $\mathinner{\!\left\lvert A\right\rvert}$ . Let $I$ be the epoch when $A_{I+1}$ is a singleton. With probability

\prod_{i=1}^{I}(1-e^{-\rho_{i}K}),

the remaining label $y_{I}(p)$ is the real label which has been permuted by a series of $\pi_{1},\ldots,\pi_{I}$ . We can recover the real label

y(p)=\pi_{1}\left(\cdots\pi_{I}(y_{I}(p))\right).

Define

{\underline{\rho}}=\frac{1}{2}\min\rho_{i}>0.

Then, for a sufficiently large $K$ ,

\prod_{i=1}^{I}(1-e^{-\rho_{i}K})\geq 1-\mathinner{\!\left\lvert A\right\rvert}e^{-K\min_{i}\rho_{i}}=1-e^{-K(\min_{i}\rho_{i}+\frac{1}{K}\log\mathinner{\!\left\lvert A\right\rvert})}\geq 1-e^{-{\underline{\rho}}K}

which completes the proof.

Appendix B Proof of Theorem 4.1

The proof follows the same logic as the proof of Theorem 5.1 on page 98 of \citeasnounSchapireandFreund12. We need to change the notation to accommodate multiple classes $(\mathinner{\!\left\lvert A\right\rvert}\geq 2)$ . For the reference, we sketch the proof by following the original proof of Theorem 5.1 in \citeasnounSchapireandFreund12, while pointing out the place where we need a change and referring the rest to \citeasnounSchapireandFreund12.

B.1. Preliminaries

Recall that $\forall i\leq I$ , $\forall a\in A_{i}$

\Psi^{k}_{a}(p)=\sum_{s=1}^{k}\alpha_{s}\left[\mathinner{\!\left\lvert A_{i}\right\rvert}{\mathbb{I}}(h_{s}(p)=a)-1\right].

Define

g^{s}_{a}(p)=\mathinner{\!\left\lvert A_{i}\right\rvert}{\mathbb{I}}(h_{s}(p)=a)-1,

\beta_{s}=\frac{\alpha_{s}}{\sum_{s^{\prime}=1}^{k}\alpha_{s^{\prime}}}

and

\psi^{k}_{a}(p)=\sum_{s=1}^{k}\beta_{s}g_{a}^{s}(p).

We drop subscript $a$ because we focus on $a=y(p)$ and write instead

g(p)=g_{y(p)}(p)\ \text{and}\ \psi^{k}(p)=\psi^{k}_{y(p)}(p).

Recall that ${\mathcal{H}}$ is the collection of feasible weak hypotheses, whose generic element is $h$ . Define

{\mathcal{G}}=\left\{g\mid\exists h\in{\mathcal{H}},\ g(p)=\mathinner{\!\left\lvert A_{i}\right\rvert}{\mathbb{I}}(h(p)=y(p))-1\right\}

For a fixed $n$ , define

{\mathcal{A}}_{n}=\left\{f\mathrel{\mathop{\mathchar 58\relax}}p\mapsto\frac{1}{n}\sum_{j=1}^{n}g_{j}(p)\mid g_{1},\ldots,g_{n}\in{\mathcal{G}}\right\}.

Let ${\tilde{g}}_{1},\ldots,{\tilde{g}}_{n}$ be the collection of $n$ randomly selected elements from $G$ and

{\tilde{f}}=\frac{1}{n}\sum_{j=1}^{n}{\tilde{g}}_{j}(p)

where

\mathbf{E}g_{j}=\sum_{s=1}^{k}\beta_{s}g_{y(p)}^{s}(p)=\psi^{k}_{y(p)}(p).

\tau^{K}_{S}(D_{KI})(p)\neq y(p),

then $y(p)$ is eliminated from $A_{i}$ in $i$ -th epoch $i<I$ , which implies

\psi^{K}_{y(p)}(p)<0.

Thus,

			$\displaystyle\mathbf{P}_{\mathcal{D}}\left(\tau^{K}_{S}(D_{KI})\neq y\right)$
		$\displaystyle\leq$	$\displaystyle\prod_{i=1}^{I}\mathbf{P}_{\mathcal{D}}\left(\exists i<I,\ \psi^{K}_{y(p)}(p)<0\mid\psi^{K}_{y(p)}(p)\geq 0,j\leq i\right)$
		$\displaystyle=$	$\displaystyle\prod_{i=1}^{I}\mathbf{P}_{\mathcal{D}}\left(\psi^{K}_{y(p)}(p)<0\ \text{at the end of}\ i\text{-th epoch}\right).$

The last equality stems from the facts that the samples are independently drawn over periods, and that we re-initialize the algorithm at the beginning of each epoch.

We only show that in the first epoch

\mathbf{P}_{\mathcal{D}}\left(\psi^{K}_{y(p)}(p)<0\right)\leq\mathbf{P}_{\mathcal{S}}\left(\psi^{K}_{y(p)}(p)<0\right)+{\sf O}\left(\frac{1}{\sqrt{m}}\right)

where ${\mathcal{S}}$ is the collection of all $m$ samples. The proof for the remaining epochs follows the same logic. Since $I\leq\min\left(\mathinner{\!\left\lvert A\right\rvert},\mathinner{\!\left\lvert{\mathcal{P}}\right\rvert}\right)\leq\mathinner{\!\left\lvert A\right\rvert}$ , we then have

\mathbf{P}_{\mathcal{D}}\left(\tau^{K}_{S}(D_{KI})\neq y\right)\leq\mathbf{P}_{\mathcal{S}}\left(\tau^{K}_{S}(D_{KI})\neq y\right)+\mathinner{\!\left\lvert A\right\rvert}{\sf O}\left(\frac{1}{\sqrt{m}}\right).

B.2. Sketch of Proof

Lemma B.1.

(Re-statement of Lemma 5.2 in \citeasnounSchapireandFreund12) $\forall\theta$ ,

\mathbf{P}_{{\tilde{f}}}\left(|{\tilde{f}}(p)-f(p)|\geq\frac{\theta}{2}\right)\leq 2e^{-\frac{n\theta^{2}}{2\mathinner{\!\left\lvert A\right\rvert}^{2}}}\equiv\beta_{n,\theta}.

(B.20)

Proof.

Note that the only change from the original statement in \citeasnounSchapireandFreund12 is $\beta_{n,\theta}$ which needs to be scaled to accommodate the case $\mathinner{\!\left\lvert A\right\rvert}>2$ . Note that $g(p)\in\{\mathinner{\!\left\lvert A\right\rvert}-1,-1\}$ . We need to scale $g$ into a random variable in $[-1,1]$ and modify $\beta_{n,\theta}$ accordingly. The remainder of the proof is identical with the original proof of Lemma 5.2 in \citeasnounSchapireandFreund12. ∎

Lemma 5.3 of \citeasnounSchapireandFreund12 is not needed. The proof of Lemma 5.4 of \citeasnounSchapireandFreund12 applies to our case without modifying the parameter, from which the conclusion of Theorem 4.1 follows.

References

[1] \harvarditem[Cho and Libgober]Cho and Libgober2020ChoandLibgober2020 Cho, I.-K., and J. Libgober (2020): “Machine Learning for Strategic Inference,” Emory University and University of Southern California.
[2] \harvarditem[Friedman, Hastie, and Tibshirani]Friedman, Hastie, and Tibshirani2000FriedmanHestieandTibshirani00 Friedman, J., T. Hastie, and R. Tibshirani (2000): “Additive Logistic Regression: A Statistical View of Boosting,” The Annals of Statistics, 28(2), 337–407.
[3] \harvarditem[Mukherjee and Schapire]Mukherjee and Schapire2013MukherjeeSchapire2013 Mukherjee, I., and R. E. Schapire (2013): “A Theory of Multiclass Boosting,” Journal of Machine Learning Research, 14, 437–497.
[4] \harvarditem[Schapire and Freund]Schapire and Freund2012SchapireandFreund12 Schapire, R. E., and Y. Freund (2012): Boosting: Foundations and Algorithms. MIT Press.
[5] \harvarditem[Schapire and Singer]Schapire and Singer1999SchapireandSinger99 Schapire, R. E., and Y. Singer (1999): “Improved Boosting Algorithms Using Confidence-Rate Predictions,” Machine Learning, 37, 297–336.
[6] \harvarditem[Tversky]Tversky1972Tversky72 Tversky, A. (1972): “Elimination by aspects: A theory of choice,” Psychological Review, 79(4), 281–299.
[7] \harvarditem[Zhu, Zou, Rosset, and Hastie]Zhu, Zou, Rosset, and Hastie2009ZhuZouRossetandHastie09 Zhu, J., H. Zou, S. Rosset, and T. Hastie (2009): “Multi-class AdaBoost,” Statistics and Its Interface, 2, 349–360.
[8]

Iterative Weak Learnability and Multi-Class AdaBoost

Abstract.

1. Introduction

2. Preliminaries

2.1. Basics

Definition 2.1.

Definition 2.2.

Definition 2.3.

2.2. Adaptive Boosting Algorithm

Definition 2.4.

Definition 2.5.

Theorem 2.6.

Proof.

Example 2.7.

2.3. Iterative Weak Learnability

Definition 2.8.

2.4. Sufficiently Rich ℋ{\mathcal{H}}

Definition 2.9.

Lemma 2.10.

Proof.

Corollary 2.11.

Proof.

3. Construction of Algorithm

3.1. Preliminaries

3.2. Within Epoch ii

3.3. Iteration over epoch

Theorem 3.1.

Proof.

4. Generalization Error

Theorem 4.1.

Proof.

5. Conclusion

Appendix A Proof of Theorem 3.1

A.1. Uniform Bound for II

A.2. Within Epoch

Lemma A.1.

Proof.

Lemma A.2.

Proof.

A.3. Across Epochs

Appendix B Proof of Theorem 4.1

B.1. Preliminaries

B.2. Sketch of Proof

Lemma B.1.

Proof.

References

2.4. Sufficiently Rich ${\mathcal{H}}$

3.2. Within Epoch $i$

A.1. Uniform Bound for $I$