Achieving More with Less: A Tensor-Optimization-Powered Ensemble Method

Jinghui Yuan^1,2, Weijin Jiang³, Zhe Cao¹, Fangyuan Xie^1,2
Rong Wang¹^*, Feiping Nie^1,2, Yuan Yuan¹ Corresponding author

Abstract

Ensemble learning is a method that leverages weak learners to produce a strong learner. However, obtaining a large number of base learners requires substantial time and computational resources. Therefore, it is meaningful to study how to achieve the performance typically obtained with many base learners using only a few. We argue that to achieve this, it is essential to enhance both classification performance and generalization ability during the ensemble process. To increase model accuracy, each weak base learner needs to be more efficiently integrated. It is observed that different base learners exhibit varying levels of accuracy in predicting different classes. To capitalize on this, we introduce confidence tensors $\tilde{\mathbf{\Theta}}$ and $\tilde{\mathbf{\Theta}}_{rst}$ signifies the degree of confidence that the $t$ -th base classifier assigns the sample to class $r$ while it actually belongs to class $s$ . To the best of our knowledge, this is the first time an evaluation of the performance of base classifiers across different classes has been proposed. The proposed confidence tensor compensates for the strengths and weaknesses of each base classifier in different classes, enabling the method to achieve superior results with a smaller number of base learners. To enhance generalization performance, we design a smooth and convex objective function that leverages the concept of margin, making the strong learner more discriminative. Furthermore, it is proved that in gradient matrix of the loss function, the sum of each column’s elements is zero, allowing us to solve a constrained optimization problem using gradient-based methods. We then compare our algorithm with random forests of ten times the size and other classical methods across numerous datasets, demonstrating the superiority of our approach. Finally, we discuss the reasons for the success of our algorithm and point out that it can be naturally extended from bagging to stacking.

Introduction

Ensemble learning is a method that constructs a strong learner from weak learners. To date, ensemble learning has achieved success in many fields, such as: (Zhou and Feng 2019) apply the concept of ensemble learning to deep ensemble methods, (Liu et al. 2015) combine deep learning with spectral clustering, (Wood et al. 2023) attempt to explain ensemble learning from the diversity perspective, (Shi et al. 2022) consider multi-view ensemble learning, and (Cao et al. 2020) introduce ensemble learning into bioinformatics for interdisciplinary research. In summary, ensemble learning is a very important concept in the field of artificial intelligence.

However, ensemble learning often requires integrating a large number of base learners. For example, (Chen and Guestrin 2016) design XGBoost using the boosting idea, and (Sun et al. 2024) consider reducing the correlation among random forests to improve their performance. Nevertheless, all these ensemble methods have relatively high time complexity. It is worthwhile to investigate how to achieve the effects of a large ensemble using only a few base learners.

We believe that to achieve this goal, it is necessary to adjust the confidences during the ensemble process. On one hand, the accuracy of the combined classifiers should be as high as possible, and on the other hand, generalization should be as good as possible.

To increase the accuracy of the ensemble model, we need each base learner to be integrated more efficiently. We observe that different base learners have varying accuracy for different classes. For example, one learner may be very accurate in classifying classes $1$ and $2$ but less accurate for class $3$ , while another learner may be accurate for classes $2$ and $3$ but less accurate for class $1$ . Based on this, we propose a learnable confidence tensor $\tilde{\mathbf{\Theta}}$ , where $\tilde{\mathbf{\Theta}}_{rst}$ represents the probability that the $t$ -th base classifier classifies an instance as class $r$ when it actually belongs to class $s$ . This concept is analogous to the confusion matrix(Townsend 1971). To better represent the tensor, we use a technique similar to that of (Ferrara, Grillo, and Gatto 1973) to unfold the tensor and prove that it has favorable properties, which can assist us in solving the constrained optimization problem (Boyd and Vandenberghe 2004).

Another important issue is how to improve the generalization of the ensemble. We introduce the concept of margin into multi-class optimization because (Zhou 2014) points out that margin is a key factor in suppressing overfitting in AdaBoost. Additionally, (Nie, Hao, and Wang 2024) achieve good results by considering margin in multi-class support vector machines. (Chen et al. 2023) propose the ODE theory using the margin concept, demonstrating excellent performance across multiple datasets.

There has been much discussion on how to introduce the margin. For instance, (Liu et al. 2024) defined the margin using the $l_{01}$ norm, and (Li et al. 2021) proposed the class margin. However, the margins obtained by these methods are often non-smooth or non-convex, which makes them relatively difficult to optimize (Khaled et al. 2023).

To solve the Problem, we use the logsumexp (Calafiore, Gaubert, and Possieri 2019) technique to derive a loss function with smoothness and certain convexity. Not only that, but we also prove that our loss function satisfies the property of having a gradient column sum of zero, which allows for adaptively adding confidences to each base learner given a sufficiently good initialization, which can be efficiently solved using gradient-based methods(Zhao et al. 2024).

We compare the proposed algorithm with a random forest that has ten times the number of parameters (Biau and Scornet 2016), as well as with classical boosting algorithms and support vector machine methods. Experiments demonstrate that we have achieved our goal, outperforming a large number of base learners using only a few.

Finally, we conduct an in-depth discussion, analyzing the relationship between our algorithm and dropout (Liu et al. 2023) and parameter freezing (Wimmer, Mehnert, and Condurache 2023). We discuss and analyze why our algorithm performs better. Additionally, we point out that this algorithm can not only be used as bagging (Ngo, Beard, and Chandra 2022) but can also be naturally extended to stacking (Zounemat-Kermani et al. 2021).

We summarize our main contributions below:

•

We propose a learnable tensor $\tilde{\mathbf{\Theta}}$ , which encapsulates the confidences of each base learner for different classes. We design a loss function based on the margin concept, which is smooth and partially convex.
•

We prove that the loss function has the desirable property of having a gradient column sum of zero. This allows us to solve the proposed optimization problem with linear constraints using gradient-based methods, and we design an algorithm for this purpose.
•

We conduct extensive comparative experiments. Under the same conditions, the ensemble of a small number of base learners outperforms a random forest with ten times the number of base learners, as well as other classical algorithms, validating our method’s superiority.
•

We further discuss the relationship between our algorithm and random forests, explaining that random forests can be seen as a result of our algorithm after applying dropout. We also elucidate why our algorithm performs well and how it can be naturally extended to stacking methods.

Methodology

In this section, we will introduce the symbols used in the paper, the construction of the loss function, and the specific optimization model employed.

Notations

Assuming we are addressing a $c$ -class classification problem, where $x_{i}$ represents the $i$ -th sample before ensemble training with $k$ classifiers $G_{1},...,G_{k}$ , each basic classifier inputs $x_{i}$ and outputs a c-dimensional indicator column vector $G_{j}(x_{i})=(0,...,1,...,0)^{T}\in\mathbb{R}^{c\times 1}$ . Suppose the true labels are represented by $y\in\mathbb{R}^{c\times 1}$ , and each element of $y$ is one-hot encoded, where $m_{i}$ is the class of the $i$ -th sample, forming the indicator matrix $Y\in\mathbb{R}^{c\times n}$ . If $g_{i}$ is a column vector satisfying $g_{i}=(G_{1}^{T}(x_{i}),...,G_{k}^{T}(x_{i}))^{T}$ , the $g_{i}\in\mathbb{R}^{kc\times 1}$ combining all of $g_{i}$ gives matrix $G=(g_{1}^{T},...,g_{n}^{T})^{T}$ , and matrix $G\in\mathbb{R}^{kc\times n}$

Introduction of Margin

Margin represents the distance from data points to the decision boundary. First, let’s introduce how predictions are made using the confidence tensor $\tilde{\mathbf{\Theta}}$ . The tensor $\tilde{\mathbf{\Theta}}$ has three dimensions: $c$ dimensions, $c$ dimensions, and $k$ dimensions. Here, $\tilde{\mathbf{\Theta}}_{rst}$ signifies that the $t$ -th base classifier assigns the sample to class $r$ while it actually belongs to class $s$ . The probabilities for all classes are combined into this confidence tensor $\tilde{\mathbf{\Theta}}$ . However, tensors are not straightforward to handle in matrix computations. Therefore, we split it into matrix $\Theta$ using forward slicing, as illustrated in the diagram below, so $\Theta\in\mathbb{R}^{c\times kc}$ .

Refer to caption — Figure 1: Expansion diagram of $\tilde{\mathbf{\Theta}}$

Define $\mathcal{S}$ as the softmax function. When $\mathcal{S}$ is applied to a column vector, it performs softmax normalization on the column vector. When $\mathcal{S}$ is applied to a matrix, it applies the softmax function to each column of the matrix. Based on this, the prediction for the $i$ -th sample can be expressed as

\hat{y}_{i}=\underset{j=1...c}{argmax}\ \left(\mathcal{S}\left(\Theta g_{i}\right)\right)_{j}

(1)

A suitable definition of the margin for multi-class classification is the difference between the predicted value at the true label position and the second largest predicted value. This can be expressed as formula

Y_{i}^{T}(Y_{i}\odot\mathcal{S}(\Theta g_{i}))-max_{2}(\mathcal{S}\left(\Theta g_{i}\right))

(2)

with the symbol $max_{2}(v)$ representing the second largest element in vector $v$ , $Y_{i}$ is the $i$ -th column of $Y$ , additionally, the symbol $\odot$ represents the Hadamard product of matrices. This definition is reasonable as it reflects the confidence in classifying into the true class.

However, both the maximum function and the $max_{2}$ function are non-smooth, and non-smooth functions can significantly affect the convergence and convergence speed of the algorithm. A reasonable solution is to approximate these functions using the log-sum-exp function. For a vector $v$ , by choosing a suitable $\alpha$ , the log-sum-exp function can effectively approximate the largest element in $v$ . The expression for the log-sum-exp function is given by Eq.(3).

f(v)=\frac{1}{\alpha}log\left(\sum_{j=1}^{c}e^{\alpha v_{j}}\right)

(3)

The choice of parameter $\alpha$ is straightforward and does not significantly affect the degree of approximation (Calafiore, Gaubert, and Possieri 2020). Through extensive experiments, we found that when $\alpha$ is set to 10, it can already extract the maximum value with considerable accuracy.

Using the log-sum-exp function, the $max_{2}$ function can be easily represented. To extract the second largest value, we only need to set the maximum position to 0 and then retrieve the maximum value again. We assume that the predicted value at the true class is the largest, which is reasonable, and we will explain this later. This can be expressed using the following formula:

max_{2}(\mathcal{S}\left(\Theta g_{i}\right))=\frac{1}{\alpha}log\left(\sum_{j=1}^{c}e^{\alpha(\mathcal{S}\left(\Theta g_{i}\right)-Y_{i}\odot\mathcal{S}\left(\Theta g_{i}\right))_{j}}\right)

(4)

Thus, the margin contributed by the $i$ -th item can be expressed as Eq.(5),

Y_{i}^{T}(Y_{i}\odot\mathcal{S}\left(\Theta g_{i}\right))-\frac{1}{\alpha}log\left(\sum_{j=1}^{c}e^{\alpha(\mathcal{S}\left(\Theta g_{i}\right)-Y_{i}\odot\mathcal{S}\left(\Theta g_{i}\right))_{j}}\right)

(5)

and the total margin can be expressed as:

		$\displaystyle\mathcal{M}=\sum_{i=1}^{n}Y_{i}^{T}(Y_{i}\odot\mathcal{S}\left(\Theta g_{i}\right))$		(6)
		$\displaystyle-\sum_{i=1}^{n}\frac{1}{\alpha}log\left(\sum_{j=1}^{c}e^{\alpha(\mathcal{S}\left(\Theta g_{i}\right)-Y_{i}\odot\mathcal{S}\left(\Theta g_{i}\right))_{j}}\right)$		(6)

Introduction of Loss Function

Now let’s return to consider the assumptions made at Eq.(4). In the previous margin calculation, the actual $max_{2}(\mathcal{S}\left(\Theta g_{i}\right))$ should be the maximum of

\mathcal{S}\left(\Theta g_{i}\right)-Y^{\mu}\odot max(\mathcal{S}\left(\Theta g_{i}\right))

(7)

where $\mu$ is $\underset{j=1...c}{argmax}(\mathcal{S}\left(\Theta g_{i}\right))_{j}$ . $Y^{\mu}$ represent the column vector where $\mu$ -th element is 1 and others are 0. but we calculate the maximum of

\mathcal{S}\left(\Theta g_{i}\right)-Y_{i}\odot\mathcal{S}\left(\Theta g_{i}\right)

(8)

this is because subtracting the linear term results in a margin function with better smoothness and convexity. In fact, we make the following assumption:

max(\mathcal{S}\left(\Theta g_{i}\right))\approx\|Y_{i}\odot\mathcal{S}(\Theta g_{i})\|

(9)

This essentially assumes correct classification. To ensure this, we need to incorporate a penalty for misclassification in the constraints. Cross-entropy can effectively measure accuracy in the classification process. Therefore, we add the accuracy function as follows:

\mathcal{C}=-\sum_{i=1}^{n}(Y_{i}^{T}(Y_{i}\odot log(\mathcal{S}\left(\Theta g_{i}\right))))

(10)

The final optimization objective is a negative confidenceed combination of both

\mathcal{L}=\mathcal{C}-\gamma\mathcal{M}

(11)

where the confidence $\gamma$ is a hyperparameter. A larger $\gamma$ indicates a greater preference for higher confidence in classification, thus larger margin. Specifically, when $\gamma$ equals 0, it reverts to the standard cross-entropy loss function used for classification. Therefore, the cross-entropy function is a special case of the loss function we are using.

Introduction of Optimization Problem

In this section, we will introduce the final optimization problem. Actually, the accuracy of each base classifier varies, so naturally, the confidences of each base classifier should also differ. Let’s denote the accuracy of the $i$ -th base classifier as $w_{i}$ , where $w=(w_{1},...,w_{k})^{T}$ , $\tilde{w}$ expands $w$ to $kc$ dimensions , expressed in formula as:

\tilde{w}=(w_{1},...w_{1},w_{2},...w_{2},...,w_{k},...,w_{k})^{T}

(12)

Assigning confidences to each classifier can be seen as imposing constraints on $\Theta$ ’s columns, with the constraint being $\Theta^{T}1=\tilde{w}$ .

In summary, the final optimization problem can be formulated as:

		$\displaystyle\underset{\Theta}{min}\ \mathcal{L}$		(13)
		$\displaystyle\Theta^{T}1=\tilde{w}$		(13)

and the loss function $\mathcal{L}$ is

		$\displaystyle\mathcal{L}=\mathcal{C}-\gamma\mathcal{M}$		(14)
		$\displaystyle=\sum_{i=1}^{n}-\left(Y_{i}^{T}log(\mathcal{S}\left(\Theta g_{i}\right))+\gamma Y_{i}^{T}\mathcal{S}\left(\Theta g_{i}\right)\right)$
		$\displaystyle+\gamma\sum_{i=1}^{n}\frac{1}{\alpha}log\left(\sum_{j=1}^{c}e^{\alpha(\mathcal{S}\left(\Theta g_{i}\right)-Y_{i}\odot\mathcal{S}\left(\Theta g_{i}\right))_{j}}\right)$

Optimization Algorithm

The convexity of the optimization problem is crucial for solving it. Performing the softmax operation inevitably transforms the convex function into a non-convex one. Fortunately, apart from the softmax operation, our problem remains convex. In other words, we have the following theorem.

Theorem 1.

The loss function $\mathcal{L}$ is a convex function with respect to $\mathcal{S}(\Theta g_{i})$ .

Proof

Assume that matrix $D$ has only one non-zero element $D_{m_{i}m_{i}}=1$ , then $Y_{i}\odot\mathcal{S}(\Theta g_{i})$ can be seen as a linear mapping of $\mathcal{S}(\Theta g_{i})$ , i.e. $D\mathcal{S}(\Theta g_{i})$ , and so that

\mathcal{S}(\Theta g_{i})-Y_{i}\odot\mathcal{S}(\Theta g_{i})=(I-D)\mathcal{S}(\Theta g_{i})

(15)

As is well-known, for a convex function $f(x)$ , $f(Ax+b)$ is also a convex function. The log-sum-exp function is a well-known convex function, and the other two terms are evidently convex as well. The non-negative sum of convex functions is convex. Thus, the proof is complete.

Excellent convexity allows us to use gradient methods to quickly reach a local or global minimum. The key to solving the above optimization problem is how to handle constraint $\Theta^{T}1=\tilde{w}$ . Through the following theorem, we can naturally satisfy the constraint with a simple initialization method.

Theorem 2.

Assume $\nabla_{kl}\mathcal{L}$ represents the kl-th element of the gradient of $\mathcal{L}$ with respect to matrix $\Theta$ i.e. $\frac{\partial\mathcal{L}}{\partial\Theta}_{kl}$ then we have the following formula $\sum_{k=1}^{c}\nabla_{kl}\mathcal{L}=0$ .

Proof

Without loss of generality, we assume there is only one data, which belongs to the $m$ -th class. That corresponds to the $m$ -th row of $Y$ being 1 and all other rows being 0. Therefore, there is no need for summation, and the loss function can be expressed as:

		$\displaystyle\mathcal{L}=\mathcal{L}_{1}+\mathcal{L}_{2}+\mathcal{L}_{3}$		(16)
		$\displaystyle=-Y^{T}log(\mathcal{S}\left(\Theta g\right))-\gamma Y^{T}\mathcal{S}\left(\Theta g\right)$
		$\displaystyle+\gamma\frac{1}{\alpha}log\left(\sum_{j=1}^{c}e^{\alpha(\mathcal{S}\left(\Theta g\right)-Y\odot\mathcal{S}\left(\Theta g\right))_{j}}\right)$

Suppose $\delta_{ij}$ represents the Kronecker delta function, which is 1 when $i=j$ and 0 otherwise.

According to the chain rule, it is not difficult to verify that the $j$ -th term of $\mathcal{S}(\Theta g)$ with respect to $\Theta_{kl}$ satisfies:

\displaystyle\frac{\partial\mathcal{S}(\Theta g)_{j}}{\partial\Theta_{kl}}=\mathcal{S}(\Theta g)_{j}(g_{l}\delta_{jk}-\mathcal{S}(\Theta g)_{k}g_{l})

(17)

Therefore, according to Eq.(17) the derivative of the first term $\mathcal{L}_{1}=-Y^{T}log(\mathcal{S}\left(\Theta g\right))$ with respect to $\Theta_{kl}$ is easily known to be

		$\displaystyle\frac{\partial\mathcal{L}_{1}}{\partial\Theta_{kl}}=-\frac{\partial log(\mathcal{S}(\Theta g))_{m}}{\partial\Theta_{kl}}$		(18)
		$\displaystyle=-\frac{1}{\mathcal{S}(\Theta g)_{m}}\mathcal{S}(\Theta g)_{m}(g_{l}\delta_{mk}-\mathcal{S}(\Theta g)_{k}g_{l})$
		$\displaystyle=-(g_{l}\delta_{mk}-\mathcal{S}(\Theta g)_{k}g_{l})$

Similarly, the partial derivative of the second term with respect to $\Theta_{kl}$ is

\displaystyle\frac{\partial\mathcal{L}_{2}}{\partial\Theta_{kl}}=-\gamma\mathcal{S}(\Theta g)_{m}(g_{l}\delta_{mk}-\mathcal{S}(\Theta g)_{k}g_{l})

(19)

Noting formula $\sum_{k=1}^{c}(g_{l}\delta_{mk}-\mathcal{S}(\Theta g)_{k}g_{l})=0$ , we have $\sum_{k=1}^{c}\nabla_{kl}(\mathcal{L}_{1}+\mathcal{L}_{2})=0$ . Next, we only need to prove that the derivative of the third term with respect to $k$ sums to zero. It is easy to verify that the derivative of the third term can be expressed in the following form:

	$\displaystyle\frac{\partial\mathcal{L}_{3}}{\partial\Theta_{kl}}$	$\displaystyle=\gamma\frac{\sum_{j=1}^{c}e^{\alpha\mathcal{S}(\Theta g)_{j}}\mathcal{S}^{2}(\Theta g)_{j}(g_{l}\delta_{jk}-\mathcal{S}(\Theta g)_{k}g_{l})}{\sum_{j=1}^{c}e^{\alpha\mathcal{S}(\Theta g)_{j}}-e^{\alpha\mathcal{S}(\Theta g)_{m}}+1}$		(20)
		$\displaystyle-\gamma\frac{e^{\alpha\mathcal{S}(\Theta g)_{m}}\mathcal{S}^{2}(\Theta g)_{m}(g_{l}\delta_{mk}-\mathcal{S}(\Theta g)_{k}g_{l})}{\sum_{j=1}^{c}e^{\alpha\mathcal{S}(\Theta g)_{j}}-e^{\alpha\mathcal{S}(\Theta g)_{m}}+1}$		(20)

Since Eq.(20) also includes formula $(g_{l}\delta_{mk}-\mathcal{S}(\Theta g)_{k}g_{l})$ and we know that $\sum_{k=1}^{c}(g_{l}\delta_{mk}-\mathcal{S}(\Theta g)_{k}g_{l})=0$ , then we have $\sum_{k=1}^{c}\nabla_{kl}\mathcal{L}_{3}=0$ .

Thus, we have

\displaystyle\sum_{k=1}^{c}\nabla_{kl}\mathcal{L}=\sum_{k=1}^{c}\nabla_{kl}(\mathcal{L}_{1}+\mathcal{L}_{2}+\mathcal{L}_{3})=0

(21)

it means that we have proven this theorem.

It is worth noting that Eq.(18)(19)(20) not only help us prove this theorem but also directly provide the formula for the gradient. This allows us to compute the gradient directly using these three formulas and apply gradient descent algorithms to optimize the objective function. The theorem proven above is a very useful property that aids us in quickly obtaining the optimal solution for the constrained loss function.

In fact, for non-special cases, i.e. with more than one sample, suppose set $\{\mathcal{M}\}$ represents formula $\{m_{1},...,m_{n}\}$ , where $m_{i}\in\{1,...,c\}$ . Then the final gradient is equal to

		$\displaystyle\frac{\partial\mathcal{L}}{\partial\Theta_{kl}}=-\sum_{i=1}^{n}(1+\gamma\mathcal{S}(\Theta g)_{m_{i}})(g_{l}\delta_{jk}-\mathcal{S}(\Theta g)_{k}g_{l})$		(22)
		$\displaystyle+\sum_{i=1}^{n}\left(\gamma\frac{\sum_{j=1}^{c}e^{\alpha\mathcal{S}(\Theta g)_{j}}\mathcal{S}^{2}(\Theta g)_{j}(g_{l}\delta_{jk}-\mathcal{S}(\Theta g)_{k}g_{l})}{\sum_{j=1}^{c}e^{\alpha\mathcal{S}(\Theta g)_{j}}-e^{\alpha\mathcal{S}(\Theta g)_{m_{i}}}+1}\right)$
		$\displaystyle-\sum_{i=1}^{n}\left(\gamma\frac{e^{\alpha\mathcal{S}(\Theta g)_{m_{i}}}\mathcal{S}^{2}(\Theta g)_{m_{i}}(g_{l}\delta_{m_{i}k}-\mathcal{S}(\Theta g)_{k}g_{l})}{\sum_{j=1}^{c}e^{\alpha\mathcal{S}(\Theta g)_{j}}-e^{\alpha\mathcal{S}(\Theta g)_{m_{i}}}+1}\right)$

where $n$ is the number of data points used to compute the gradient: if using full gradient descent, it includes all data points; if using batch gradient descent, it is the batch size; and if using stochastic gradient descent, it is 1.

Since $\sum_{k=1}^{c}\nabla_{kl}\mathcal{L}=0$ , by using appropriate initialization and gradient-based methods, the constraint can be naturally satisfied. Specifically, the update formula for gradient descent is as follows:

\Theta_{p+1}=\Theta_{p}-\beta\nabla_{\Theta_{p}}\mathcal{L}(\Theta_{p})

(23)

where $\beta$ is the learning rate. If we require that the initialization satisfies $\Theta_{0}1=\tilde{w}$ , then by mathematical induction, it is easy to prove that $\Theta^{*}1=\tilde{w}$ . Therefore, we can initialize each forward slice of the tensor and then expand it as shown below.

Due to the convexity of our loss function, we can optimize this problem using gradient descent. This involves selecting a batch of $g_{i}$ at a time to compute the loss function and its gradient. The specific algorithm is illustrated in the flowchart below.

Algorithm 1 Gradient Descent

0: Matrix

G

\Theta^{*}

1: Initialize

\Theta_{0}

according to Figure 2.

2: while not converged do

3: Randomly select a batch of columns

g_{i}

from

G

4: Compute

\nabla\mathcal{L}(\Theta_{p})

by Eq.(22).

5: Update

\Theta

\Theta_{p+1}=\Theta_{p}-\beta\nabla_{\Theta}\mathcal{L}(\Theta_{p})

6: end while

Time Complexity Analysis

In this section, we will analyze the time complexity of the algorithm. The algorithm computes gradients by selecting a batch of samples at a time. Note that while $g_{i}$ is a vector of dimension $kc$ , according to the definition, $g_{i}$ is a vector composed of predictions from each classifier, meaning $g_{i}$ is sparse. Computing the matrix multiplication $\Theta$ with $g_{i}$ involves adding corresponding positions of $\Theta$ together, without needing multiplication.

In other words, computing the gradient for each element according to Eq.(22) has a time complexity of $\mathcal{O}(kc)$ . During the computation, there are many common terms that can be retained to simplify calculations. With $\mathcal{O}(kc^{2})$ parameters, the time complexity for updating once will not exceed $\mathcal{O}(k^{2}c^{3})$ . Assuming convergence after $N$ iterations, the total time complexity will not exceed $\mathcal{O}(Nk^{2}c^{3})$ .

However, in practice, $k$ and $c$ are usually very small, and there are a lot of direct additions in calculations. Moreover, hardware acceleration with tools like PyTorch enables automatic differentiation, making the actual runtime very fast.

Experiments on toy datasets

We construct a toy dataset with a double-ring shape, which has the advantage of visualizing classification results. To demonstrate our superiority, we compare with the classical Support Vector Classification (SVC) algorithm and the XGBoost algorithm. To prove that confidence optimization can ”achieve more with less”, we compare Random Forests with 10 trees (RF10), 20 trees (RF20), 30 trees (RF30), and 100 trees (RF100), while our algorithm used 10 trees (OUR10). All tree models use the same maximum depth. The number of trees in the compared Random Forest models is 1 $\times$ , 2 $\times$ , 3 $\times$ , and 10 $\times$ that of our model, respectively.

The results and classification boundaries of the compared algorithms on the toy dataset are shown in the Figure.3. Our algorithm used a coefficient $\alpha$ of 10 and $\gamma$ of 5. From the classification boundaries, it can be observed that our algorithm effectively improves the learners’ prediction performance near the boundaries.

We divide the dataset into training and test sets with an 80% to 20% ratio. Below are the performances of the algorithms on the dataset:

Table 1: Description of the toy datasets

Method	OUR10	SVC	XGBoost	RF10	RF20	RF30	RF100
TrainData	90.2	83.7	90.4	88.3	89.0	89.5	89.9
TestData	88.3	84.0	86.5	86.3	88.0	87.8	88.0
AllData	89.8	83.8	89.7	87.9	88.8	89.2	89.5

From the experimental results on the toy dataset, we can see that although our algorithm does not have the highest accuracy on the training set, it achieves the highest accuracy on both the test set and the entire dataset. This demonstrates that our algorithm effectively suppresses overfitting on the training set and improves overall accuracy, achieving better results with 10 trees compared to the 100-tree Random Forest.

Experiments on real datasets

To further demonstrate the superiority of our algorithm, we conduct experiments on some real-world datasets of varying sizes.

Experimental Settings

We conducte extensive experiments on ten datasets, including TR41, warpPIE10P, Movement, Ecoil, Arcene, REUT, GLI85, Carcinom, Lung and Isolet. The Table 2 lists the number of samples and clusters for each dataset:

Table 2: Description of the benchmark datasets

Datasets	$\#$ Object	$\#$ Attribute	$\#$ Class
TR41	878	7454	10
warpPIE10P	210	2420	10
Movement	360	90	15
Ecoil	336	7	8
Arcene	200	10000	2
REUT	10000	2000	4
GLI85	85	22283	2
Carcinom	174	9782	11
Lung	203	3312	5
Isolet	1560	617	26

Parameters Selection.

Our algorithm has two hyperparameters, $\alpha$ and $\gamma$ . The hyperparameter $\alpha$ is fixed at 10 throughout, and as long as this parameter is not too small, it will not affect the algorithm’s performance. The other parameter, $\gamma$ , is chosen randomly from $\{5,10,15,20,25\}$ for each calculation to facilitate reproducibility. Our random seed is fixed at 0.

Evaluation Metrics.

In our experiments, we always select training set accuracy, test set accuracy, and overall accuracy, with the test set and training set ratio consistently at 80%:20%. Our comparison algorithms are the same as those used for the toy dataset, and we always ensure that the basic parameters of the base learners in our algorithm and the Random Forest algorithm are identical.

Experimental Result

Our experimental results are recorded in Tables 3, 4, and 5. Table 3 presents the accuracy of various algorithms on the training set, Table 4 records the accuracy of the algorithms on the test set, and Table 5 shows the accuracy of the algorithms on the entire dataset.

Table 3: Classification accuracy on the train Dataset

Dataset	OUR10	XGBoost	SVC	RF10	RF20	RF30	RF100
Movement	1.000	0.997	0.903	0.851	0.865	0.899	0.927
warpPIE10P	1.000	1.000	1.000	0.994	1.000	1.000	1.000
Lung	1.000	1.000	0.944	0.963	0.994	1.000	1.000
Ecoli	1.000	0.963	0.892	0.978	0.993	0.996	1.000
Arcene	1.000	1.000	0.744	0.994	0.994	1.000	1.000
Carcinom	1.000	1.000	0.928	0.993	1.000	1.000	1.000
GLI_85	1.000	1.000	0.706	1.000	1.000	1.000	1.000
Isolet	1.000	1.000	0.983	0.966	0.984	0.984	0.991
TR41	0.990	1.000	0.999	0.977	0.983	0.976	0.984
REUT	0.997	0.929	0.996	0.993	0.998	0.999	1.000

Table 4: Classification accuracy on the test Dataset

Dataset	OUR10	XGBoost	SVC	RF10	RF20	RF30	RF100
Movement	0.764	0.736	0.744	0.611	0.569	0.681	0.667
warpPIE10P	0.976	0.905	1.000	0.905	0.929	0.929	0.952
Lung	0.951	0.927	0.951	0.878	0.951	0.951	0.951
Ecoli	0.882	0.868	0.868	0.853	0.868	0.853	0.868
Arcene	0.850	0.650	0.700	0.750	0.775	0.750	0.775
Carcinom	0.886	0.743	0.686	0.771	0.800	0.857	0.885
GLI_85	0.941	0.765	0.647	0.824	0.882	0.882	0.882
Isolet	0.881	0.878	0.955	0.827	0.881	0.875	0.881
TR41	0.972	0.972	0.943	0.960	0.955	0.955	0.955
REUT	0.897	0.883	0.959	0.888	0.897	0.900	0.902

From the experiments, we can see that for the Random Forest algorithm, the larger the number of base classifiers, i.e., decision trees, the higher the accuracy. On the vast majority of datasets, our algorithm outperforms other algorithms in terms of accuracy on the training set, test set, and the entire dataset. It even surpasses the Random Forest algorithm with ten times the number of base classifiers as our algorithm, achieving the goal of outperforming large-scale ensemble classifiers with a smaller ensemble.

Convergence Analysis

To verify that our algorithm indeed has very good convergence properties, we select several datasets and plot the iteration curves of our algorithm on these datasets, as shown in Figure 4. Because the gradient descent method is very fast, we used full gradient descent in the experiments, i.e. computing the gradient using all samples.

Our objective function possesses partially convex properties, the algorithm typically converges quickly. As observed in the experiments, convergence is usually achieved within no more than 10 iterations.

Further Discussion

Relationship with Random Forest

In this section, we will discuss the relationship between our algorithm and the classic random forest, assuming that decision trees are used as base classifiers.

Drop out

Drop out typically refers to discarding some learned parameters, or setting some parameters to zero. For our learned confidence tensor $\tilde{\mathbf{\Theta}}$ , if we apply the dropout idea and drop out all the non-diagonal elements of $\tilde{\mathbf{\Theta}}$ , our algorithm would transform into the classic random forest confidenceed by accuracy.

Fix Parameters

A more specific case is that after dropping out the non-diagonal elements, we fix the diagonal elements to 1. At this point, each decision tree has the same confidence, and each decision is made by equal-confidence voting. This forms the most classic random forest algorithm.

Since, compared to the classic random forest algorithm, we have more learnable parameters, our algorithm represents an extension of it. This means that as long as our loss function is reasonably designed, our algorithm’s performance will definitely surpass that of the classic random forest algorithm.

Why Our Algorithm Performs Better

In this section, we will explain why our algorithm performs better using a simple example.

What Additional Information We Obtained

Assuming we have learned a confidence matrix $\Theta$ , the data it contains is shown in Figure 5. There are three classifiers, and their confidences are 3, 3.6, and 3, respectively. In fact, $\Theta$ is composed of three confusion matrices. We find that the diagonal elements of most columns are very large, indicating that the classifiers have learned the correct values in most cases, and we trust them more. However, it is worth noting that for the first classifier, the values in the third column are all the same, which means the first classifier has not learned how to classify the third category.

confidence Optimization

Note that although the first classifier has not learned to distinguish the third class, it is relatively confident about the other classes. Similarly, the second classifier has not learned to classify the first class but is also confident about the other classes. If using random forest confidenceed voting, the errors introduced by these classifiers cannot be ignored. According to our algorithm, the uncertain confidences are very small, which creates a complementary relationship with the other confident classifiers, thereby improving accuracy.

From Bagging to Stacking

Our algorithm seems to learn a set of confidences for bagging, but in fact, it can be easily extended to a stacking algorithm. We do not enforce that base classifiers must be of the same type; instead, we can combine and learn confidences for different base classifiers such as decision trees, support vector machines, and logistic regression. Furthermore, a potential extension is to analyze the performance of heterogeneous base classifiers on different types of datasets.

Conclusion

In this paper, we focus on how to achieve the performance of using a larger number of base learners through the integration of fewer base learners. To achieve this, we propose that during the integration process of the base learners, we should aim to simultaneously improve both accuracy and generalization. For this purpose, we introduce the concept of margin into the loss function and designed a smooth loss function with good convexity using the logsumexp function. We also prove that the column sum of the gradient satisfies good properties, allowing us to solve the constrained optimization problem through gradient descent to obtain the optimal solution. We compare our method with random forests having 10 times the number of base learners and other classical methods, demonstrating that our integration of fewer learners outperformed the majority. Finally, we discuss the effective reasons behind this algorithm and its further extensions.

References

Biau and Scornet (2016) Biau, G.; and Scornet, E. 2016. A random forest guided tour. Test, 25: 197–227.
Boyd and Vandenberghe (2004) Boyd, S.; and Vandenberghe, L. 2004. Convex optimization. Cambridge university press.
Calafiore, Gaubert, and Possieri (2019) Calafiore, G. C.; Gaubert, S.; and Possieri, C. 2019. Log-sum-exp neural networks and posynomial models for convex and log-log-convex data. IEEE transactions on neural networks and learning systems, 31(3): 827–838.
Calafiore, Gaubert, and Possieri (2020) Calafiore, G. C.; Gaubert, S.; and Possieri, C. 2020. A Universal Approximation Result for Difference of Log-Sum-Exp Neural Networks. IEEE Transactions on Neural Networks and Learning Systems, 31(12): 5603–5612.
Cao et al. (2020) Cao, Y.; Geddes, T. A.; Yang, J. Y. H.; and Yang, P. 2020. Ensemble deep learning in bioinformatics. Nature Machine Intelligence, 2(9): 500–508.
Chen et al. (2023) Chen, L.-J.; Zhang, T.; Shi, X.; and Jin, H. 2023. Incremental and Decremental Optimal Margin Distribution Learning. In IJCAI, 3523–3531.
Chen and Guestrin (2016) Chen, T.; and Guestrin, C. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 785–794.
Ferrara, Grillo, and Gatto (1973) Ferrara, S.; Grillo, A. F.; and Gatto, R. 1973. Tensor representations of conformal algebra and conformally covariant operator product expansion. Annals of Physics, 76(1): 161–188.
Khaled et al. (2023) Khaled, A.; Sebbouh, O.; Loizou, N.; Gower, R. M.; and Richtárik, P. 2023. Unified analysis of stochastic gradient methods for composite convex and smooth optimization. Journal of Optimization Theory and Applications, 199(2): 499–540.
Li et al. (2021) Li, B.; Yang, B.; Liu, C.; Liu, F.; Ji, R.; and Ye, Q. 2021. Beyond max-margin: Class margin equilibrium for few-shot object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 7363–7372.
Liu et al. (2015) Liu, H.; Liu, T.; Wu, J.; Tao, D.; and Fu, Y. 2015. Spectral ensemble clustering. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 715–724.
Liu et al. (2024) Liu, J.; Huang, L.-W.; Shao, Y.-H.; Chen, W.-J.; and Li, C.-N. 2024. A nonlinear kernel SVM classifier via L0/1 soft-margin loss with classification performance. Journal of Computational and Applied Mathematics, 437: 115471.
Liu et al. (2023) Liu, Z.; Xu, Z.; Jin, J.; Shen, Z.; and Darrell, T. 2023. Dropout reduces underfitting. In International Conference on Machine Learning, 22233–22248. PMLR.
Ngo, Beard, and Chandra (2022) Ngo, G.; Beard, R.; and Chandra, R. 2022. Evolutionary bagging for ensemble learning. Neurocomputing, 510: 1–14.
Nie, Hao, and Wang (2024) Nie, F.; Hao, Z.; and Wang, R. 2024. Multi-class support vector machine with maximizing minimum margin. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 14466–14473.
Shi et al. (2022) Shi, S.; Nie, F.; Wang, R.; and Li, X. 2022. When multi-view classification meets ensemble learning. Neurocomputing, 490: 17–29.
Sun et al. (2024) Sun, Z.; Wang, G.; Li, P.; Wang, H.; Zhang, M.; and Liang, X. 2024. An improved random forest based on the classification accuracy and correlation measurement of decision trees. Expert Systems with Applications, 237: 121549.
Townsend (1971) Townsend, J. T. 1971. Theoretical analysis of an alphabetic confusion matrix. Perception & Psychophysics, 9: 40–50.
Wimmer, Mehnert, and Condurache (2023) Wimmer, P.; Mehnert, J.; and Condurache, A. P. 2023. Dimensionality reduced training by pruning and freezing parts of a deep neural network: a survey. Artificial Intelligence Review, 56(12): 14257–14295.
Wood et al. (2023) Wood, D.; Mu, T.; Webb, A. M.; Reeve, H. W.; Lujan, M.; and Brown, G. 2023. A unified theory of diversity in ensemble learning. Journal of Machine Learning Research, 24(359): 1–49.
Zhao et al. (2024) Zhao, P.; Zhang, Y.-J.; Zhang, L.; and Zhou, Z.-H. 2024. Adaptivity and non-stationarity: Problem-dependent dynamic regret for online convex optimization. Journal of Machine Learning Research, 25(98): 1–52.
Zhou (2014) Zhou, Z.-H. 2014. Large margin distribution learning. In Artificial Neural Networks in Pattern Recognition: 6th IAPR TC 3 International Workshop, ANNPR 2014, Montreal, QC, Canada, October 6-8, 2014. Proceedings 6, 1–11. Springer.
Zhou and Feng (2019) Zhou, Z.-H.; and Feng, J. 2019. Deep forest. National science review, 6(1): 74–86.
Zounemat-Kermani et al. (2021) Zounemat-Kermani, M.; Batelaan, O.; Fadaee, M.; and Hinkelmann, R. 2021. Ensemble machine learning paradigms in hydrology: A review. Journal of Hydrology, 598: 126266.