This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Achieving More with Less: A Tensor-Optimization-Powered Ensemble Method

Jinghui Yuan1,2, Weijin Jiang3, Zhe Cao1, Fangyuan Xie1,2
Rong Wang1*, Feiping Nie1,2, Yuan Yuan1
Corresponding author
Abstract

Ensemble learning is a method that leverages weak learners to produce a strong learner. However, obtaining a large number of base learners requires substantial time and computational resources. Therefore, it is meaningful to study how to achieve the performance typically obtained with many base learners using only a few. We argue that to achieve this, it is essential to enhance both classification performance and generalization ability during the ensemble process. To increase model accuracy, each weak base learner needs to be more efficiently integrated. It is observed that different base learners exhibit varying levels of accuracy in predicting different classes. To capitalize on this, we introduce confidence tensors 𝚯~\tilde{\mathbf{\Theta}} and 𝚯~rst\tilde{\mathbf{\Theta}}_{rst} signifies the degree of confidence that the tt-th base classifier assigns the sample to class rr while it actually belongs to class ss. To the best of our knowledge, this is the first time an evaluation of the performance of base classifiers across different classes has been proposed. The proposed confidence tensor compensates for the strengths and weaknesses of each base classifier in different classes, enabling the method to achieve superior results with a smaller number of base learners. To enhance generalization performance, we design a smooth and convex objective function that leverages the concept of margin, making the strong learner more discriminative. Furthermore, it is proved that in gradient matrix of the loss function, the sum of each column’s elements is zero, allowing us to solve a constrained optimization problem using gradient-based methods. We then compare our algorithm with random forests of ten times the size and other classical methods across numerous datasets, demonstrating the superiority of our approach. Finally, we discuss the reasons for the success of our algorithm and point out that it can be naturally extended from bagging to stacking.

Introduction

Ensemble learning is a method that constructs a strong learner from weak learners. To date, ensemble learning has achieved success in many fields, such as: (Zhou and Feng 2019) apply the concept of ensemble learning to deep ensemble methods, (Liu et al. 2015) combine deep learning with spectral clustering, (Wood et al. 2023) attempt to explain ensemble learning from the diversity perspective, (Shi et al. 2022) consider multi-view ensemble learning, and (Cao et al. 2020) introduce ensemble learning into bioinformatics for interdisciplinary research. In summary, ensemble learning is a very important concept in the field of artificial intelligence.

However, ensemble learning often requires integrating a large number of base learners. For example, (Chen and Guestrin 2016) design XGBoost using the boosting idea, and (Sun et al. 2024) consider reducing the correlation among random forests to improve their performance. Nevertheless, all these ensemble methods have relatively high time complexity. It is worthwhile to investigate how to achieve the effects of a large ensemble using only a few base learners.

We believe that to achieve this goal, it is necessary to adjust the confidences during the ensemble process. On one hand, the accuracy of the combined classifiers should be as high as possible, and on the other hand, generalization should be as good as possible.

To increase the accuracy of the ensemble model, we need each base learner to be integrated more efficiently. We observe that different base learners have varying accuracy for different classes. For example, one learner may be very accurate in classifying classes 11 and 22 but less accurate for class 33, while another learner may be accurate for classes 22 and 33 but less accurate for class 11. Based on this, we propose a learnable confidence tensor 𝚯~\tilde{\mathbf{\Theta}}, where 𝚯~rst\tilde{\mathbf{\Theta}}_{rst} represents the probability that the tt-th base classifier classifies an instance as class rr when it actually belongs to class ss. This concept is analogous to the confusion matrix(Townsend 1971). To better represent the tensor, we use a technique similar to that of (Ferrara, Grillo, and Gatto 1973) to unfold the tensor and prove that it has favorable properties, which can assist us in solving the constrained optimization problem (Boyd and Vandenberghe 2004).

Another important issue is how to improve the generalization of the ensemble. We introduce the concept of margin into multi-class optimization because (Zhou 2014) points out that margin is a key factor in suppressing overfitting in AdaBoost. Additionally, (Nie, Hao, and Wang 2024) achieve good results by considering margin in multi-class support vector machines. (Chen et al. 2023) propose the ODE theory using the margin concept, demonstrating excellent performance across multiple datasets.

There has been much discussion on how to introduce the margin. For instance, (Liu et al. 2024) defined the margin using the l01l_{01} norm, and (Li et al. 2021) proposed the class margin. However, the margins obtained by these methods are often non-smooth or non-convex, which makes them relatively difficult to optimize (Khaled et al. 2023).

To solve the Problem, we use the logsumexp (Calafiore, Gaubert, and Possieri 2019) technique to derive a loss function with smoothness and certain convexity. Not only that, but we also prove that our loss function satisfies the property of having a gradient column sum of zero, which allows for adaptively adding confidences to each base learner given a sufficiently good initialization, which can be efficiently solved using gradient-based methods(Zhao et al. 2024).

We compare the proposed algorithm with a random forest that has ten times the number of parameters (Biau and Scornet 2016), as well as with classical boosting algorithms and support vector machine methods. Experiments demonstrate that we have achieved our goal, outperforming a large number of base learners using only a few.

Finally, we conduct an in-depth discussion, analyzing the relationship between our algorithm and dropout (Liu et al. 2023) and parameter freezing (Wimmer, Mehnert, and Condurache 2023). We discuss and analyze why our algorithm performs better. Additionally, we point out that this algorithm can not only be used as bagging (Ngo, Beard, and Chandra 2022) but can also be naturally extended to stacking (Zounemat-Kermani et al. 2021).

We summarize our main contributions below:

  • We propose a learnable tensor 𝚯~\tilde{\mathbf{\Theta}}, which encapsulates the confidences of each base learner for different classes. We design a loss function based on the margin concept, which is smooth and partially convex.

  • We prove that the loss function has the desirable property of having a gradient column sum of zero. This allows us to solve the proposed optimization problem with linear constraints using gradient-based methods, and we design an algorithm for this purpose.

  • We conduct extensive comparative experiments. Under the same conditions, the ensemble of a small number of base learners outperforms a random forest with ten times the number of base learners, as well as other classical algorithms, validating our method’s superiority.

  • We further discuss the relationship between our algorithm and random forests, explaining that random forests can be seen as a result of our algorithm after applying dropout. We also elucidate why our algorithm performs well and how it can be naturally extended to stacking methods.

Methodology

In this section, we will introduce the symbols used in the paper, the construction of the loss function, and the specific optimization model employed.

Notations

Assuming we are addressing a cc-class classification problem, where xix_{i} represents the ii-th sample before ensemble training with kk classifiers G1,,GkG_{1},...,G_{k}, each basic classifier inputs xix_{i} and outputs a c-dimensional indicator column vector Gj(xi)=(0,,1,,0)Tc×1G_{j}(x_{i})=(0,...,1,...,0)^{T}\in\mathbb{R}^{c\times 1}. Suppose the true labels are represented by yc×1y\in\mathbb{R}^{c\times 1}, and each element of yy is one-hot encoded, where mim_{i} is the class of the ii-th sample, forming the indicator matrix Yc×nY\in\mathbb{R}^{c\times n}. If gig_{i} is a column vector satisfying gi=(G1T(xi),,GkT(xi))Tg_{i}=(G_{1}^{T}(x_{i}),...,G_{k}^{T}(x_{i}))^{T}, the gikc×1g_{i}\in\mathbb{R}^{kc\times 1} combining all of gig_{i} gives matrix G=(g1T,,gnT)TG=(g_{1}^{T},...,g_{n}^{T})^{T} , and matrix Gkc×nG\in\mathbb{R}^{kc\times n}

Introduction of Margin

Margin represents the distance from data points to the decision boundary. First, let’s introduce how predictions are made using the confidence tensor 𝚯~\tilde{\mathbf{\Theta}}. The tensor 𝚯~\tilde{\mathbf{\Theta}} has three dimensions: cc dimensions, cc dimensions, and kk dimensions. Here, 𝚯~rst\tilde{\mathbf{\Theta}}_{rst} signifies that the tt-th base classifier assigns the sample to class rr while it actually belongs to class ss . The probabilities for all classes are combined into this confidence tensor 𝚯~\tilde{\mathbf{\Theta}}. However, tensors are not straightforward to handle in matrix computations. Therefore, we split it into matrix Θ\Theta using forward slicing, as illustrated in the diagram below, so Θc×kc\Theta\in\mathbb{R}^{c\times kc}.

Refer to caption
Figure 1: Expansion diagram of 𝚯~\tilde{\mathbf{\Theta}}

Define 𝒮\mathcal{S} as the softmax function. When 𝒮\mathcal{S} is applied to a column vector, it performs softmax normalization on the column vector. When 𝒮\mathcal{S} is applied to a matrix, it applies the softmax function to each column of the matrix. Based on this, the prediction for the ii-th sample can be expressed as

y^i=argmaxj=1c(𝒮(Θgi))j\hat{y}_{i}=\underset{j=1...c}{argmax}\ \left(\mathcal{S}\left(\Theta g_{i}\right)\right)_{j} (1)

A suitable definition of the margin for multi-class classification is the difference between the predicted value at the true label position and the second largest predicted value. This can be expressed as formula

YiT(Yi𝒮(Θgi))max2(𝒮(Θgi))Y_{i}^{T}(Y_{i}\odot\mathcal{S}(\Theta g_{i}))-max_{2}(\mathcal{S}\left(\Theta g_{i}\right)) (2)

with the symbol max2(v)max_{2}(v) representing the second largest element in vector vv, YiY_{i} is the ii-th column of YY, additionally, the symbol \odot represents the Hadamard product of matrices. This definition is reasonable as it reflects the confidence in classifying into the true class.

However, both the maximum function and the max2max_{2} function are non-smooth, and non-smooth functions can significantly affect the convergence and convergence speed of the algorithm. A reasonable solution is to approximate these functions using the log-sum-exp function. For a vector vv, by choosing a suitable α\alpha, the log-sum-exp function can effectively approximate the largest element in vv. The expression for the log-sum-exp function is given by Eq.(3).

f(v)=1αlog(j=1ceαvj)f(v)=\frac{1}{\alpha}log\left(\sum_{j=1}^{c}e^{\alpha v_{j}}\right) (3)

The choice of parameter α\alpha is straightforward and does not significantly affect the degree of approximation (Calafiore, Gaubert, and Possieri 2020). Through extensive experiments, we found that when α\alpha is set to 10, it can already extract the maximum value with considerable accuracy.

Using the log-sum-exp function, the max2max_{2} function can be easily represented. To extract the second largest value, we only need to set the maximum position to 0 and then retrieve the maximum value again. We assume that the predicted value at the true class is the largest, which is reasonable, and we will explain this later. This can be expressed using the following formula:

max2(𝒮(Θgi))=1αlog(j=1ceα(𝒮(Θgi)Yi𝒮(Θgi))j)max_{2}(\mathcal{S}\left(\Theta g_{i}\right))=\frac{1}{\alpha}log\left(\sum_{j=1}^{c}e^{\alpha(\mathcal{S}\left(\Theta g_{i}\right)-Y_{i}\odot\mathcal{S}\left(\Theta g_{i}\right))_{j}}\right) (4)

Thus, the margin contributed by the ii-th item can be expressed as Eq.(5),

YiT(Yi𝒮(Θgi))1αlog(j=1ceα(𝒮(Θgi)Yi𝒮(Θgi))j)Y_{i}^{T}(Y_{i}\odot\mathcal{S}\left(\Theta g_{i}\right))-\frac{1}{\alpha}log\left(\sum_{j=1}^{c}e^{\alpha(\mathcal{S}\left(\Theta g_{i}\right)-Y_{i}\odot\mathcal{S}\left(\Theta g_{i}\right))_{j}}\right) (5)

and the total margin can be expressed as:

=i=1nYiT(Yi𝒮(Θgi))\displaystyle\mathcal{M}=\sum_{i=1}^{n}Y_{i}^{T}(Y_{i}\odot\mathcal{S}\left(\Theta g_{i}\right)) (6)
i=1n1αlog(j=1ceα(𝒮(Θgi)Yi𝒮(Θgi))j)\displaystyle-\sum_{i=1}^{n}\frac{1}{\alpha}log\left(\sum_{j=1}^{c}e^{\alpha(\mathcal{S}\left(\Theta g_{i}\right)-Y_{i}\odot\mathcal{S}\left(\Theta g_{i}\right))_{j}}\right)

Introduction of Loss Function

Now let’s return to consider the assumptions made at Eq.(4). In the previous margin calculation, the actual max2(𝒮(Θgi))max_{2}(\mathcal{S}\left(\Theta g_{i}\right)) should be the maximum of

𝒮(Θgi)Yμmax(𝒮(Θgi))\mathcal{S}\left(\Theta g_{i}\right)-Y^{\mu}\odot max(\mathcal{S}\left(\Theta g_{i}\right)) (7)

where μ\mu is argmaxj=1c(𝒮(Θgi))j\underset{j=1...c}{argmax}(\mathcal{S}\left(\Theta g_{i}\right))_{j}. YμY^{\mu} represent the column vector where μ\mu-th element is 1 and others are 0. but we calculate the maximum of

𝒮(Θgi)Yi𝒮(Θgi)\mathcal{S}\left(\Theta g_{i}\right)-Y_{i}\odot\mathcal{S}\left(\Theta g_{i}\right) (8)

this is because subtracting the linear term results in a margin function with better smoothness and convexity. In fact, we make the following assumption:

max(𝒮(Θgi))Yi𝒮(Θgi)max(\mathcal{S}\left(\Theta g_{i}\right))\approx\|Y_{i}\odot\mathcal{S}(\Theta g_{i})\| (9)

This essentially assumes correct classification. To ensure this, we need to incorporate a penalty for misclassification in the constraints. Cross-entropy can effectively measure accuracy in the classification process. Therefore, we add the accuracy function as follows:

𝒞=i=1n(YiT(Yilog(𝒮(Θgi))))\mathcal{C}=-\sum_{i=1}^{n}(Y_{i}^{T}(Y_{i}\odot log(\mathcal{S}\left(\Theta g_{i}\right)))) (10)

The final optimization objective is a negative confidenceed combination of both

=𝒞γ\mathcal{L}=\mathcal{C}-\gamma\mathcal{M} (11)

where the confidence γ\gamma is a hyperparameter. A larger γ\gamma indicates a greater preference for higher confidence in classification, thus larger margin. Specifically, when γ\gamma equals 0, it reverts to the standard cross-entropy loss function used for classification. Therefore, the cross-entropy function is a special case of the loss function we are using.

Introduction of Optimization Problem

In this section, we will introduce the final optimization problem. Actually, the accuracy of each base classifier varies, so naturally, the confidences of each base classifier should also differ. Let’s denote the accuracy of the ii-th base classifier as wiw_{i}, where w=(w1,,wk)Tw=(w_{1},...,w_{k})^{T}, w~\tilde{w} expands ww to kckc dimensions , expressed in formula as:

w~=(w1,w1,w2,w2,,wk,,wk)T\tilde{w}=(w_{1},...w_{1},w_{2},...w_{2},...,w_{k},...,w_{k})^{T} (12)

Assigning confidences to each classifier can be seen as imposing constraints on Θ\Theta’s columns, with the constraint being ΘT1=w~\Theta^{T}1=\tilde{w}.

In summary, the final optimization problem can be formulated as:

minΘ\displaystyle\underset{\Theta}{min}\ \mathcal{L} (13)
ΘT1=w~\displaystyle\Theta^{T}1=\tilde{w}

and the loss function \mathcal{L} is

=𝒞γ\displaystyle\mathcal{L}=\mathcal{C}-\gamma\mathcal{M} (14)
=i=1n(YiTlog(𝒮(Θgi))+γYiT𝒮(Θgi))\displaystyle=\sum_{i=1}^{n}-\left(Y_{i}^{T}log(\mathcal{S}\left(\Theta g_{i}\right))+\gamma Y_{i}^{T}\mathcal{S}\left(\Theta g_{i}\right)\right)
+γi=1n1αlog(j=1ceα(𝒮(Θgi)Yi𝒮(Θgi))j)\displaystyle+\gamma\sum_{i=1}^{n}\frac{1}{\alpha}log\left(\sum_{j=1}^{c}e^{\alpha(\mathcal{S}\left(\Theta g_{i}\right)-Y_{i}\odot\mathcal{S}\left(\Theta g_{i}\right))_{j}}\right)

Optimization Algorithm

The convexity of the optimization problem is crucial for solving it. Performing the softmax operation inevitably transforms the convex function into a non-convex one. Fortunately, apart from the softmax operation, our problem remains convex. In other words, we have the following theorem.

Theorem 1.

The loss function \mathcal{L} is a convex function with respect to 𝒮(Θgi)\mathcal{S}(\Theta g_{i}).

Proof

Assume that matrix DD has only one non-zero element Dmimi=1D_{m_{i}m_{i}}=1, then Yi𝒮(Θgi)Y_{i}\odot\mathcal{S}(\Theta g_{i}) can be seen as a linear mapping of 𝒮(Θgi)\mathcal{S}(\Theta g_{i}), i.e. D𝒮(Θgi)D\mathcal{S}(\Theta g_{i}), and so that

𝒮(Θgi)Yi𝒮(Θgi)=(ID)𝒮(Θgi)\mathcal{S}(\Theta g_{i})-Y_{i}\odot\mathcal{S}(\Theta g_{i})=(I-D)\mathcal{S}(\Theta g_{i}) (15)

As is well-known, for a convex function f(x)f(x), f(Ax+b)f(Ax+b) is also a convex function. The log-sum-exp function is a well-known convex function, and the other two terms are evidently convex as well. The non-negative sum of convex functions is convex. Thus, the proof is complete.

Excellent convexity allows us to use gradient methods to quickly reach a local or global minimum. The key to solving the above optimization problem is how to handle constraint ΘT1=w~\Theta^{T}1=\tilde{w}. Through the following theorem, we can naturally satisfy the constraint with a simple initialization method.

Theorem 2.

Assume kl\nabla_{kl}\mathcal{L} represents the kl-th element of the gradient of \mathcal{L} with respect to matrix Θ\Theta i.e. Θkl\frac{\partial\mathcal{L}}{\partial\Theta}_{kl} then we have the following formula k=1ckl=0\sum_{k=1}^{c}\nabla_{kl}\mathcal{L}=0.

Proof

Without loss of generality, we assume there is only one data, which belongs to the mm-th class. That corresponds to the mm-th row of YY being 1 and all other rows being 0. Therefore, there is no need for summation, and the loss function can be expressed as:

=1+2+3\displaystyle\mathcal{L}=\mathcal{L}_{1}+\mathcal{L}_{2}+\mathcal{L}_{3} (16)
=YTlog(𝒮(Θg))γYT𝒮(Θg)\displaystyle=-Y^{T}log(\mathcal{S}\left(\Theta g\right))-\gamma Y^{T}\mathcal{S}\left(\Theta g\right)
+γ1αlog(j=1ceα(𝒮(Θg)Y𝒮(Θg))j)\displaystyle+\gamma\frac{1}{\alpha}log\left(\sum_{j=1}^{c}e^{\alpha(\mathcal{S}\left(\Theta g\right)-Y\odot\mathcal{S}\left(\Theta g\right))_{j}}\right)

Suppose δij\delta_{ij} represents the Kronecker delta function, which is 1 when i=ji=j and 0 otherwise.

According to the chain rule, it is not difficult to verify that the jj-th term of 𝒮(Θg)\mathcal{S}(\Theta g) with respect to Θkl\Theta_{kl} satisfies:

𝒮(Θg)jΘkl=𝒮(Θg)j(glδjk𝒮(Θg)kgl)\displaystyle\frac{\partial\mathcal{S}(\Theta g)_{j}}{\partial\Theta_{kl}}=\mathcal{S}(\Theta g)_{j}(g_{l}\delta_{jk}-\mathcal{S}(\Theta g)_{k}g_{l}) (17)

Therefore, according to Eq.(17) the derivative of the first term 1=YTlog(𝒮(Θg))\mathcal{L}_{1}=-Y^{T}log(\mathcal{S}\left(\Theta g\right)) with respect to Θkl\Theta_{kl} is easily known to be

1Θkl=log(𝒮(Θg))mΘkl\displaystyle\frac{\partial\mathcal{L}_{1}}{\partial\Theta_{kl}}=-\frac{\partial log(\mathcal{S}(\Theta g))_{m}}{\partial\Theta_{kl}} (18)
=1𝒮(Θg)m𝒮(Θg)m(glδmk𝒮(Θg)kgl)\displaystyle=-\frac{1}{\mathcal{S}(\Theta g)_{m}}\mathcal{S}(\Theta g)_{m}(g_{l}\delta_{mk}-\mathcal{S}(\Theta g)_{k}g_{l})
=(glδmk𝒮(Θg)kgl)\displaystyle=-(g_{l}\delta_{mk}-\mathcal{S}(\Theta g)_{k}g_{l})

Similarly, the partial derivative of the second term with respect to Θkl\Theta_{kl} is

2Θkl=γ𝒮(Θg)m(glδmk𝒮(Θg)kgl)\displaystyle\frac{\partial\mathcal{L}_{2}}{\partial\Theta_{kl}}=-\gamma\mathcal{S}(\Theta g)_{m}(g_{l}\delta_{mk}-\mathcal{S}(\Theta g)_{k}g_{l}) (19)

Noting formula k=1c(glδmk𝒮(Θg)kgl)=0\sum_{k=1}^{c}(g_{l}\delta_{mk}-\mathcal{S}(\Theta g)_{k}g_{l})=0, we have k=1ckl(1+2)=0\sum_{k=1}^{c}\nabla_{kl}(\mathcal{L}_{1}+\mathcal{L}_{2})=0. Next, we only need to prove that the derivative of the third term with respect to kk sums to zero. It is easy to verify that the derivative of the third term can be expressed in the following form:

3Θkl\displaystyle\frac{\partial\mathcal{L}_{3}}{\partial\Theta_{kl}} =γj=1ceα𝒮(Θg)j𝒮2(Θg)j(glδjk𝒮(Θg)kgl)j=1ceα𝒮(Θg)jeα𝒮(Θg)m+1\displaystyle=\gamma\frac{\sum_{j=1}^{c}e^{\alpha\mathcal{S}(\Theta g)_{j}}\mathcal{S}^{2}(\Theta g)_{j}(g_{l}\delta_{jk}-\mathcal{S}(\Theta g)_{k}g_{l})}{\sum_{j=1}^{c}e^{\alpha\mathcal{S}(\Theta g)_{j}}-e^{\alpha\mathcal{S}(\Theta g)_{m}}+1} (20)
γeα𝒮(Θg)m𝒮2(Θg)m(glδmk𝒮(Θg)kgl)j=1ceα𝒮(Θg)jeα𝒮(Θg)m+1\displaystyle-\gamma\frac{e^{\alpha\mathcal{S}(\Theta g)_{m}}\mathcal{S}^{2}(\Theta g)_{m}(g_{l}\delta_{mk}-\mathcal{S}(\Theta g)_{k}g_{l})}{\sum_{j=1}^{c}e^{\alpha\mathcal{S}(\Theta g)_{j}}-e^{\alpha\mathcal{S}(\Theta g)_{m}}+1}

Since Eq.(20) also includes formula (glδmk𝒮(Θg)kgl)(g_{l}\delta_{mk}-\mathcal{S}(\Theta g)_{k}g_{l}) and we know that k=1c(glδmk𝒮(Θg)kgl)=0\sum_{k=1}^{c}(g_{l}\delta_{mk}-\mathcal{S}(\Theta g)_{k}g_{l})=0, then we have k=1ckl3=0\sum_{k=1}^{c}\nabla_{kl}\mathcal{L}_{3}=0.

Thus, we have

k=1ckl=k=1ckl(1+2+3)=0\displaystyle\sum_{k=1}^{c}\nabla_{kl}\mathcal{L}=\sum_{k=1}^{c}\nabla_{kl}(\mathcal{L}_{1}+\mathcal{L}_{2}+\mathcal{L}_{3})=0 (21)

it means that we have proven this theorem.

It is worth noting that Eq.(18)(19)(20) not only help us prove this theorem but also directly provide the formula for the gradient. This allows us to compute the gradient directly using these three formulas and apply gradient descent algorithms to optimize the objective function. The theorem proven above is a very useful property that aids us in quickly obtaining the optimal solution for the constrained loss function.

In fact, for non-special cases, i.e. with more than one sample, suppose set {}\{\mathcal{M}\} represents formula {m1,,mn}\{m_{1},...,m_{n}\}, where mi{1,,c}m_{i}\in\{1,...,c\}. Then the final gradient is equal to

Θkl=i=1n(1+γ𝒮(Θg)mi)(glδjk𝒮(Θg)kgl)\displaystyle\frac{\partial\mathcal{L}}{\partial\Theta_{kl}}=-\sum_{i=1}^{n}(1+\gamma\mathcal{S}(\Theta g)_{m_{i}})(g_{l}\delta_{jk}-\mathcal{S}(\Theta g)_{k}g_{l}) (22)
+i=1n(γj=1ceα𝒮(Θg)j𝒮2(Θg)j(glδjk𝒮(Θg)kgl)j=1ceα𝒮(Θg)jeα𝒮(Θg)mi+1)\displaystyle+\sum_{i=1}^{n}\left(\gamma\frac{\sum_{j=1}^{c}e^{\alpha\mathcal{S}(\Theta g)_{j}}\mathcal{S}^{2}(\Theta g)_{j}(g_{l}\delta_{jk}-\mathcal{S}(\Theta g)_{k}g_{l})}{\sum_{j=1}^{c}e^{\alpha\mathcal{S}(\Theta g)_{j}}-e^{\alpha\mathcal{S}(\Theta g)_{m_{i}}}+1}\right)
i=1n(γeα𝒮(Θg)mi𝒮2(Θg)mi(glδmik𝒮(Θg)kgl)j=1ceα𝒮(Θg)jeα𝒮(Θg)mi+1)\displaystyle-\sum_{i=1}^{n}\left(\gamma\frac{e^{\alpha\mathcal{S}(\Theta g)_{m_{i}}}\mathcal{S}^{2}(\Theta g)_{m_{i}}(g_{l}\delta_{m_{i}k}-\mathcal{S}(\Theta g)_{k}g_{l})}{\sum_{j=1}^{c}e^{\alpha\mathcal{S}(\Theta g)_{j}}-e^{\alpha\mathcal{S}(\Theta g)_{m_{i}}}+1}\right)

where nn is the number of data points used to compute the gradient: if using full gradient descent, it includes all data points; if using batch gradient descent, it is the batch size; and if using stochastic gradient descent, it is 1.

Since k=1ckl=0\sum_{k=1}^{c}\nabla_{kl}\mathcal{L}=0, by using appropriate initialization and gradient-based methods, the constraint can be naturally satisfied. Specifically, the update formula for gradient descent is as follows:

Θp+1=ΘpβΘp(Θp)\Theta_{p+1}=\Theta_{p}-\beta\nabla_{\Theta_{p}}\mathcal{L}(\Theta_{p}) (23)

where β\beta is the learning rate. If we require that the initialization satisfies Θ01=w~\Theta_{0}1=\tilde{w} , then by mathematical induction, it is easy to prove that Θ1=w~\Theta^{*}1=\tilde{w}. Therefore, we can initialize each forward slice of the tensor and then expand it as shown below.

Refer to caption
Figure 2: Initialization Diagram

Due to the convexity of our loss function, we can optimize this problem using gradient descent. This involves selecting a batch of gig_{i} at a time to compute the loss function and its gradient. The specific algorithm is illustrated in the flowchart below.

Algorithm 1 Gradient Descent
0:  Matrix GG
0:  Θ\Theta^{*}
1:  Initialize Θ0\Theta_{0} according to Figure 2.
2:  while not converged do
3:     Randomly select a batch of columns gig_{i} from GG.
4:     Compute (Θp)\nabla\mathcal{L}(\Theta_{p}) by Eq.(22).
5:     Update Θ\Theta by Θp+1=ΘpβΘ(Θp)\Theta_{p+1}=\Theta_{p}-\beta\nabla_{\Theta}\mathcal{L}(\Theta_{p}).
6:  end while

Time Complexity Analysis

In this section, we will analyze the time complexity of the algorithm. The algorithm computes gradients by selecting a batch of samples at a time. Note that while gig_{i} is a vector of dimension kckc, according to the definition, gig_{i} is a vector composed of predictions from each classifier, meaning gig_{i} is sparse. Computing the matrix multiplication Θ\Theta with gig_{i} involves adding corresponding positions of Θ\Theta together, without needing multiplication.

In other words, computing the gradient for each element according to Eq.(22) has a time complexity of 𝒪(kc)\mathcal{O}(kc). During the computation, there are many common terms that can be retained to simplify calculations. With 𝒪(kc2)\mathcal{O}(kc^{2}) parameters, the time complexity for updating once will not exceed 𝒪(k2c3)\mathcal{O}(k^{2}c^{3}). Assuming convergence after NN iterations, the total time complexity will not exceed 𝒪(Nk2c3)\mathcal{O}(Nk^{2}c^{3}).

However, in practice, kk and cc are usually very small, and there are a lot of direct additions in calculations. Moreover, hardware acceleration with tools like PyTorch enables automatic differentiation, making the actual runtime very fast.

Experiments on toy datasets

We construct a toy dataset with a double-ring shape, which has the advantage of visualizing classification results. To demonstrate our superiority, we compare with the classical Support Vector Classification (SVC) algorithm and the XGBoost algorithm. To prove that confidence optimization can ”achieve more with less”, we compare Random Forests with 10 trees (RF10), 20 trees (RF20), 30 trees (RF30), and 100 trees (RF100), while our algorithm used 10 trees (OUR10). All tree models use the same maximum depth. The number of trees in the compared Random Forest models is 1×\times, 2×\times, 3×\times, and 10×\times that of our model, respectively.

The results and classification boundaries of the compared algorithms on the toy dataset are shown in the Figure.3. Our algorithm used a coefficient α\alpha of 10 and γ\gamma of 5. From the classification boundaries, it can be observed that our algorithm effectively improves the learners’ prediction performance near the boundaries.

Refer to caption
(a) Dataset
Refer to caption
(b) OUR10
Refer to caption
(c) SVC
Refer to caption
(d) XGBoost
Refer to caption
(e) RF10
Refer to caption
(f) RF20
Refer to caption
(g) RF30
Refer to caption
(h) RF100
Figure 3: Performance on toy Dataset. (a) Original dataset. (b) OUR10. (c) SVC. (d) XGBoost. (e) RF10. (f) RF20. (g) RF30. (h) RF100.

We divide the dataset into training and test sets with an 80% to 20% ratio. Below are the performances of the algorithms on the dataset:

Table 1: Description of the toy datasets
Method OUR10 SVC XGBoost RF10 RF20 RF30 RF100
TrainData 90.2 83.7 90.4 88.3 89.0 89.5 89.9
TestData 88.3 84.0 86.5 86.3 88.0 87.8 88.0
AllData 89.8 83.8 89.7 87.9 88.8 89.2 89.5

From the experimental results on the toy dataset, we can see that although our algorithm does not have the highest accuracy on the training set, it achieves the highest accuracy on both the test set and the entire dataset. This demonstrates that our algorithm effectively suppresses overfitting on the training set and improves overall accuracy, achieving better results with 10 trees compared to the 100-tree Random Forest.

Experiments on real datasets

To further demonstrate the superiority of our algorithm, we conduct experiments on some real-world datasets of varying sizes.

Experimental Settings

We conducte extensive experiments on ten datasets, including TR41, warpPIE10P, Movement, Ecoil, Arcene, REUT, GLI85, Carcinom, Lung and Isolet. The Table 2 lists the number of samples and clusters for each dataset:

Table 2: Description of the benchmark datasets
Datasets #\#Object #\#Attribute #\#Class
TR41 878 7454 10
warpPIE10P 210 2420 10
Movement 360 90 15
Ecoil 336 7 8
Arcene 200 10000 2
REUT 10000 2000 4
GLI85 85 22283 2
Carcinom 174 9782 11
Lung 203 3312 5
Isolet 1560 617 26

Parameters Selection.

Our algorithm has two hyperparameters, α\alpha and γ\gamma. The hyperparameter α\alpha is fixed at 10 throughout, and as long as this parameter is not too small, it will not affect the algorithm’s performance. The other parameter, γ\gamma, is chosen randomly from {5,10,15,20,25}\{5,10,15,20,25\} for each calculation to facilitate reproducibility. Our random seed is fixed at 0.

Evaluation Metrics.

In our experiments, we always select training set accuracy, test set accuracy, and overall accuracy, with the test set and training set ratio consistently at 80%:20%. Our comparison algorithms are the same as those used for the toy dataset, and we always ensure that the basic parameters of the base learners in our algorithm and the Random Forest algorithm are identical.

Experimental Result

Our experimental results are recorded in Tables 3, 4, and 5. Table 3 presents the accuracy of various algorithms on the training set, Table 4 records the accuracy of the algorithms on the test set, and Table 5 shows the accuracy of the algorithms on the entire dataset.

Table 3: Classification accuracy on the train Dataset
Dataset OUR10 XGBoost SVC RF10 RF20 RF30 RF100
Movement 1.000 0.997 0.903 0.851 0.865 0.899 0.927
warpPIE10P 1.000 1.000 1.000 0.994 1.000 1.000 1.000
Lung 1.000 1.000 0.944 0.963 0.994 1.000 1.000
Ecoli 1.000 0.963 0.892 0.978 0.993 0.996 1.000
Arcene 1.000 1.000 0.744 0.994 0.994 1.000 1.000
Carcinom 1.000 1.000 0.928 0.993 1.000 1.000 1.000
GLI_85 1.000 1.000 0.706 1.000 1.000 1.000 1.000
Isolet 1.000 1.000 0.983 0.966 0.984 0.984 0.991
TR41 0.990 1.000 0.999 0.977 0.983 0.976 0.984
REUT 0.997 0.929 0.996 0.993 0.998 0.999 1.000
Table 4: Classification accuracy on the test Dataset
Dataset OUR10 XGBoost SVC RF10 RF20 RF30 RF100
Movement 0.764 0.736 0.744 0.611 0.569 0.681 0.667
warpPIE10P 0.976 0.905 1.000 0.905 0.929 0.929 0.952
Lung 0.951 0.927 0.951 0.878 0.951 0.951 0.951
Ecoli 0.882 0.868 0.868 0.853 0.868 0.853 0.868
Arcene 0.850 0.650 0.700 0.750 0.775 0.750 0.775
Carcinom 0.886 0.743 0.686 0.771 0.800 0.857 0.885
GLI_85 0.941 0.765 0.647 0.824 0.882 0.882 0.882
Isolet 0.881 0.878 0.955 0.827 0.881 0.875 0.881
TR41 0.972 0.972 0.943 0.960 0.955 0.955 0.955
REUT 0.897 0.883 0.959 0.888 0.897 0.900 0.902

From the experiments, we can see that for the Random Forest algorithm, the larger the number of base classifiers, i.e., decision trees, the higher the accuracy. On the vast majority of datasets, our algorithm outperforms other algorithms in terms of accuracy on the training set, test set, and the entire dataset. It even surpasses the Random Forest algorithm with ten times the number of base classifiers as our algorithm, achieving the goal of outperforming large-scale ensemble classifiers with a smaller ensemble.

Convergence Analysis

To verify that our algorithm indeed has very good convergence properties, we select several datasets and plot the iteration curves of our algorithm on these datasets, as shown in Figure 4. Because the gradient descent method is very fast, we used full gradient descent in the experiments, i.e. computing the gradient using all samples.

Refer to caption
(a) Movement
Refer to caption
(b) TR41
Refer to caption
(c) warpPIE10P
Refer to caption
(d) Lung
Refer to caption
(e) Arcene
Refer to caption
(f) Isolet
Figure 4: Convergence curve of ANCMM. (a) Movement. (b) TR41. (c) warpPIE10P. (d) Lung. (f) Arcene. (g) Isolet.

Our objective function possesses partially convex properties, the algorithm typically converges quickly. As observed in the experiments, convergence is usually achieved within no more than 10 iterations.

Further Discussion

Relationship with Random Forest

In this section, we will discuss the relationship between our algorithm and the classic random forest, assuming that decision trees are used as base classifiers.

Drop out

Drop out typically refers to discarding some learned parameters, or setting some parameters to zero. For our learned confidence tensor 𝚯~\tilde{\mathbf{\Theta}}, if we apply the dropout idea and drop out all the non-diagonal elements of 𝚯~\tilde{\mathbf{\Theta}}, our algorithm would transform into the classic random forest confidenceed by accuracy.

Fix Parameters

A more specific case is that after dropping out the non-diagonal elements, we fix the diagonal elements to 1. At this point, each decision tree has the same confidence, and each decision is made by equal-confidence voting. This forms the most classic random forest algorithm.

Since, compared to the classic random forest algorithm, we have more learnable parameters, our algorithm represents an extension of it. This means that as long as our loss function is reasonably designed, our algorithm’s performance will definitely surpass that of the classic random forest algorithm.

Why Our Algorithm Performs Better

In this section, we will explain why our algorithm performs better using a simple example.

What Additional Information We Obtained

Assuming we have learned a confidence matrix Θ\Theta, the data it contains is shown in Figure 5. There are three classifiers, and their confidences are 3, 3.6, and 3, respectively. In fact, Θ\Theta is composed of three confusion matrices. We find that the diagonal elements of most columns are very large, indicating that the classifiers have learned the correct values in most cases, and we trust them more. However, it is worth noting that for the first classifier, the values in the third column are all the same, which means the first classifier has not learned how to classify the third category.

Refer to caption
Figure 5: Learned Θ\Theta

confidence Optimization

Note that although the first classifier has not learned to distinguish the third class, it is relatively confident about the other classes. Similarly, the second classifier has not learned to classify the first class but is also confident about the other classes. If using random forest confidenceed voting, the errors introduced by these classifiers cannot be ignored. According to our algorithm, the uncertain confidences are very small, which creates a complementary relationship with the other confident classifiers, thereby improving accuracy.

From Bagging to Stacking

Our algorithm seems to learn a set of confidences for bagging, but in fact, it can be easily extended to a stacking algorithm. We do not enforce that base classifiers must be of the same type; instead, we can combine and learn confidences for different base classifiers such as decision trees, support vector machines, and logistic regression. Furthermore, a potential extension is to analyze the performance of heterogeneous base classifiers on different types of datasets.

Conclusion

In this paper, we focus on how to achieve the performance of using a larger number of base learners through the integration of fewer base learners. To achieve this, we propose that during the integration process of the base learners, we should aim to simultaneously improve both accuracy and generalization. For this purpose, we introduce the concept of margin into the loss function and designed a smooth loss function with good convexity using the logsumexp function. We also prove that the column sum of the gradient satisfies good properties, allowing us to solve the constrained optimization problem through gradient descent to obtain the optimal solution. We compare our method with random forests having 10 times the number of base learners and other classical methods, demonstrating that our integration of fewer learners outperformed the majority. Finally, we discuss the effective reasons behind this algorithm and its further extensions.

References

  • Biau and Scornet (2016) Biau, G.; and Scornet, E. 2016. A random forest guided tour. Test, 25: 197–227.
  • Boyd and Vandenberghe (2004) Boyd, S.; and Vandenberghe, L. 2004. Convex optimization. Cambridge university press.
  • Calafiore, Gaubert, and Possieri (2019) Calafiore, G. C.; Gaubert, S.; and Possieri, C. 2019. Log-sum-exp neural networks and posynomial models for convex and log-log-convex data. IEEE transactions on neural networks and learning systems, 31(3): 827–838.
  • Calafiore, Gaubert, and Possieri (2020) Calafiore, G. C.; Gaubert, S.; and Possieri, C. 2020. A Universal Approximation Result for Difference of Log-Sum-Exp Neural Networks. IEEE Transactions on Neural Networks and Learning Systems, 31(12): 5603–5612.
  • Cao et al. (2020) Cao, Y.; Geddes, T. A.; Yang, J. Y. H.; and Yang, P. 2020. Ensemble deep learning in bioinformatics. Nature Machine Intelligence, 2(9): 500–508.
  • Chen et al. (2023) Chen, L.-J.; Zhang, T.; Shi, X.; and Jin, H. 2023. Incremental and Decremental Optimal Margin Distribution Learning. In IJCAI, 3523–3531.
  • Chen and Guestrin (2016) Chen, T.; and Guestrin, C. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, 785–794.
  • Ferrara, Grillo, and Gatto (1973) Ferrara, S.; Grillo, A. F.; and Gatto, R. 1973. Tensor representations of conformal algebra and conformally covariant operator product expansion. Annals of Physics, 76(1): 161–188.
  • Khaled et al. (2023) Khaled, A.; Sebbouh, O.; Loizou, N.; Gower, R. M.; and Richtárik, P. 2023. Unified analysis of stochastic gradient methods for composite convex and smooth optimization. Journal of Optimization Theory and Applications, 199(2): 499–540.
  • Li et al. (2021) Li, B.; Yang, B.; Liu, C.; Liu, F.; Ji, R.; and Ye, Q. 2021. Beyond max-margin: Class margin equilibrium for few-shot object detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 7363–7372.
  • Liu et al. (2015) Liu, H.; Liu, T.; Wu, J.; Tao, D.; and Fu, Y. 2015. Spectral ensemble clustering. In Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 715–724.
  • Liu et al. (2024) Liu, J.; Huang, L.-W.; Shao, Y.-H.; Chen, W.-J.; and Li, C.-N. 2024. A nonlinear kernel SVM classifier via L0/1 soft-margin loss with classification performance. Journal of Computational and Applied Mathematics, 437: 115471.
  • Liu et al. (2023) Liu, Z.; Xu, Z.; Jin, J.; Shen, Z.; and Darrell, T. 2023. Dropout reduces underfitting. In International Conference on Machine Learning, 22233–22248. PMLR.
  • Ngo, Beard, and Chandra (2022) Ngo, G.; Beard, R.; and Chandra, R. 2022. Evolutionary bagging for ensemble learning. Neurocomputing, 510: 1–14.
  • Nie, Hao, and Wang (2024) Nie, F.; Hao, Z.; and Wang, R. 2024. Multi-class support vector machine with maximizing minimum margin. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 14466–14473.
  • Shi et al. (2022) Shi, S.; Nie, F.; Wang, R.; and Li, X. 2022. When multi-view classification meets ensemble learning. Neurocomputing, 490: 17–29.
  • Sun et al. (2024) Sun, Z.; Wang, G.; Li, P.; Wang, H.; Zhang, M.; and Liang, X. 2024. An improved random forest based on the classification accuracy and correlation measurement of decision trees. Expert Systems with Applications, 237: 121549.
  • Townsend (1971) Townsend, J. T. 1971. Theoretical analysis of an alphabetic confusion matrix. Perception & Psychophysics, 9: 40–50.
  • Wimmer, Mehnert, and Condurache (2023) Wimmer, P.; Mehnert, J.; and Condurache, A. P. 2023. Dimensionality reduced training by pruning and freezing parts of a deep neural network: a survey. Artificial Intelligence Review, 56(12): 14257–14295.
  • Wood et al. (2023) Wood, D.; Mu, T.; Webb, A. M.; Reeve, H. W.; Lujan, M.; and Brown, G. 2023. A unified theory of diversity in ensemble learning. Journal of Machine Learning Research, 24(359): 1–49.
  • Zhao et al. (2024) Zhao, P.; Zhang, Y.-J.; Zhang, L.; and Zhou, Z.-H. 2024. Adaptivity and non-stationarity: Problem-dependent dynamic regret for online convex optimization. Journal of Machine Learning Research, 25(98): 1–52.
  • Zhou (2014) Zhou, Z.-H. 2014. Large margin distribution learning. In Artificial Neural Networks in Pattern Recognition: 6th IAPR TC 3 International Workshop, ANNPR 2014, Montreal, QC, Canada, October 6-8, 2014. Proceedings 6, 1–11. Springer.
  • Zhou and Feng (2019) Zhou, Z.-H.; and Feng, J. 2019. Deep forest. National science review, 6(1): 74–86.
  • Zounemat-Kermani et al. (2021) Zounemat-Kermani, M.; Batelaan, O.; Fadaee, M.; and Hinkelmann, R. 2021. Ensemble machine learning paradigms in hydrology: A review. Journal of Hydrology, 598: 126266.