Exhaustive Exploitation of Nature-inspired Computation for Cancer Screening in an Ensemble Manner
Abstract
Accurate screening of cancer types is crucial for effective cancer detection and precise treatment selection. However, the association between gene expression profiles and tumors is often limited to a small number of biomarker genes. While computational methods using nature-inspired algorithms have shown promise in selecting predictive genes, existing techniques are limited by inefficient search and poor generalization across diverse datasets. This study presents a framework termed Evolutionary Optimized Diverse Ensemble Learning (EODE) to improve ensemble learning for cancer classification from gene expression data. The EODE methodology combines an intelligent grey wolf optimization algorithm for selective feature space reduction, guided random injection modeling for ensemble diversity enhancement, and subset model optimization for synergistic classifier combinations. Extensive experiments were conducted across 35 gene expression benchmark datasets encompassing varied cancer types. Results demonstrated that EODE obtained significantly improved screening accuracy over individual and conventionally aggregated models. The integrated optimization of advanced feature selection, directed specialized modeling, and cooperative classifier ensembles helps address key challenges in current nature-inspired approaches. This provides an effective framework for robust and generalized ensemble learning with gene expression biomarkers. Specifically, we have opened EODE source code on Github at https://github.com/wangxb96/EODE.
Index Terms:
Feature selection, Clustering, Ensemble learning, Grey wolf optimizer, Classification1 Introduction
Cancer has become one of the leading causes of mortality worldwide, resulting in over 10 million deaths in 2020 alone [1]. The heterogeneity and complexity of various cancer types poses significant challenges for timely and accurate diagnosis, prognosis, and treatment planning [2, 3]. Precision oncology aims to overcome these difficulties by leveraging molecular biomarkers and omics data to guide personalized therapeutic decisions [4]. In particular, analysis of cancer gene expression data enables identification of discriminative genes and pathways involved in pathogenesis, which can inform diagnostic tests, prognostic indicators, and drug targets [5, 6].
However, several analytical difficulties impose barriers to identifying robust molecular biomarkers from gene expression data. Small sample sizes coupled with extremely high dimensionality and sparsity of the data make computational analysis statistically underpowered [7]. Technical noise, batch effects, tumor heterogeneity, and variability between patients also confound analyses [8, 9]. Effective and robust computational methods are therefore urgently needed to overcome these challenges and accurately detect differentially expressed genes from such complex high-dimensional datasets across diverse cancer types. This can support development of gene expression-based biomarkers for precision oncology applications.
A variety of computational approaches have been applied for cancer gene expression analysis and biomarker identification, including machine learning, deep learning, and nature-inspired optimization algorithms [10, 11, 12]. In particular, swarm intelligence and evolutionary algorithms like particle swarm optimization (PSO) [13], ant colony optimization (ACO) [14], genetic algorithms [15], and enhanced optimizer variants [16, 17, 18, 19] have shown promise. While achieving promising results, further improvements in accuracy, robustness, and generalization ability are still possible. A key limitation is that most methods rely on a single learner algorithm, which makes it difficult to determine the universally optimal learner across diverse cancer types and datasets. Different algorithms have distinct strengths and weaknesses, so their performance varies. Relying on just one also reduces robustness.
Ensemble learning methods which combine multiple diverse base learner models can help address these pitfalls [20]. Strategies like bagging [21] and boosting [22] train multiple base models on randomized or reweighted data versions, then aggregate predictions to reduce variance and bias. Such ensembles have proven effective for tasks ranging from cancer subtype classification [23, 24] to drug response modeling [25]. However, naively combining all base learner models can limit diversity, leading to redundant representations and suboptimal performance [26]. Recent studies have explored intelligent optimizer-guided selection of ensemble subsets to promote specialization and synergy among members [27, 28, 29, 30, 31]. For instance, genetic algorithms have been applied to search the space of model combinations, selecting only classifiers that maximize validation accuracy through cooperative interactions [32]. While showing promise, these approaches generally utilize the full, high-dimensional feature space, which can retain irrelevant variables that confuse models and constrain diversity. Advanced feature selection is needed to derive maximally informative biomarker subsets tailored for ensemble learning [33]. Furthermore, diversity enhancement techniques like bagging and boosting are insufficient to fully overcome representation redundancies during model training [34]. Novel forms of controlled randomness injection could better promote specialization by guiding different models to focus on distinct explanatory data facets [35, 27]. Overall there remains great opportunity to advance ensemble classifier performance by integrating intelligent feature selection, guided diversity induction, and metaheuristic optimization of cooperative model combinations [26, 36]. This can further evolve the state-of-the-art in ensemble methods for precision medicine applications.
In this work, we propose a novel nature-inspired feature selection algorithm, optimized ensemble classifier, and diversity-enhancing ensemble strategy by integrating the grey wolf optimizer (GWO). Our approach, called Evolutionary Optimized Diverse Ensemble learning (EODE), synergistically combines GWO-based wrapper feature selection, diversity injection via randomized model training, and evolutionary optimization for constructing optimal ensemble classifiers. Specifically, GWO efficiently searches the high-dimensional gene expression space to identify an informative subset of discriminative features for cancer diagnosis. Multiple diverse base classifiers (e.g. SVM, KNN) are trained on these selected features while introducing randomness to increase diversity. Finally, GWO optimizes selection and integration of ensemble members to maximize performance on validation data. EODE enhances generalization ability by leveraging GWO’s feature selection, controlled randomness injection, and metaheuristic ensemble optimization. We evaluate EODE on cancer gene expression datasets for tasks including subtype classification, outcome prediction, and the size of feature subset. Results demonstrate EODE significantly improves accuracy and robustness over 23 state-of-the-art methods on 35 cancer gene expression datasets. The integrated strategy advances biomarker discovery and precision oncology by evolving high-performance diverse ensemble classifiers. The main steps of the EODE approach are as follows:
-
1.
Base classifiers: The diversity among the base classifiers is crucial to the effectiveness of the ensemble. The base classifiers can be any suitable classification algorithms, such as decision trees, support vector machines, or neural networks. In this study, six base classifiers including Discriminant Analysis (DISCR), Decision Tree (DT), K-Nearest Neighbor (KNN), Artificial Neural Networks (ANN), Support Vector Machine (SVM), and Naive Bayes (NB) are used.
-
2.
Classifier selection: To mitigate the high computational cost associated with using ensemble methods in the feature selection training process, all base classifiers are initially trained with five-fold cross validation using the original training data. The best-performing base classifier is then selected to participate in the feature selection stage. This approach ensures that appropriate learners are involved in training for different datasets to a certain extent.
-
3.
Feature selection: GWO is employed to search for an optimal subset of genes that are most relevant to cancer diagnosis. The fitness function is designed to evaluate the quality of each feature subset based on classification performance and the size of feature subset. GWO optimizes the feature subset by iteratively updating the positions of grey wolves based on their fitness values.
-
4.
Ensemble diversity enhancement: To increase the diversity of ensemble, the techniques such as bagging, boosting, or random subspace method can be employed. Here, we generate multiple random subspaces through K-means clustering to increase the diversity of the ensemble. We use these data clusters to train base classifiers, resulting in a pool of models.
-
5.
Model pool optimization: In the model pool, directly fusing all models can lead to lower inference efficiency, and the presence of some low-quality models may degrade the overall performance. Therefore, before final model evaluation, we optimize the model pool. We first performed pre-optimization, discarding models that performed below average on the validation set. For the remaining models, we further optimized using the GWO algorithm to select the possible optimal combination of models.
-
6.
Evaluation and validation: The performance of the EODE model is evaluated using appropriate metrics such as accuracy, average performance, and the size of the feature subset. The predictions of the selected models are combined using plurality voting. The combined predictions provide the final classification result. Moreover, cross-validation and independent validation datasets are used to assess the generalization ability of the model.
2 Methods
2.1 Methodology Overview of EODE
In this study, we present a novel nature-inspired method called EODE for rapid identification of biomarker genes for multiple cancer types in multiple cancer gene expression datasets. A schematic overview of the algorithm is provided in Figure 1. The original input gene expression data is considered, where represents a sample with genes, belongs to the set indicating the consensus molecular subtypes, and is the total number of samples.

In the feature selection step, we employ the GWO to extract relevant biomarker genes after training our model on the training gene expression matrix . Each base classifier from the pool (including Discriminant Analysis (DISCR), Decision Tree (DT), K-Nearest Neighbor (KNN), Artificial Neural Networks (ANN), Support Vector Machine (SVM), and Naive Bayes (NB)) is initially trained using the input data. The best-performing classifier is then chosen as the evaluation classifier for feature selection.
The processed data is subsequently utilized to train and optimize a diverse ensemble model. Specifically, the data undergoes five-fold cross-validation to construct the final model . Initially, the data is partitioned into progressive subspaces using the K-means method to form clusters. These clusters are then utilized to train base classifiers, which are subsequently incorporated into the model pool. Models in the pool with below-average performance are filtered out. After that, the GWO approach is applied to optimize the model pool and identify the best possible combination. Finally, the model is evaluated on the test data using a plurality voting strategy. The overall framework of EODE is summarized in Algorithm 1.
2.2 Nature-inspired Feature Selection
Considering a training cancer gene expression data , where represents the feature vector and denotes the number of features, belongs to the set representing the class, and is the number of samples. It is important to note that the high-dimensional nature of the gene expression data may include many irrelevant genes, which can negatively impact identification accuracy while increasing computational time [7]. Therefore, performing feature selection is crucial to preprocess the data effectively.
The Grey Wolf Optimizer (GWO), initially proposed by Mirjalili [37], is a swarm intelligence algorithm inspired by the social hierarchy and hunting behavior of grey wolves in nature. GWO offers advantages such as good convergence, minimal parameter tuning, and ease of implementation [38]. The core concept of GWO revolves around three primary predation behaviors: encircling prey, hunting, and attacking prey, which are performed based on the social hierarchy among the wolves. The social hierarchy in GWO consists of four levels: , , , and , with being the dominant wolf, followed by and , while the remaining wolves are labeled as . Wolves at higher ranks exert dominance over those at lower ranks, and , , and play key roles in the algorithm, with being the wolf king and and serving as potential successors. The wolf represents the fittest solution and guides the pack towards promising search areas. The second and third best fit solutions are modeled as and wolves, respectively. The wolves represent the remaining weaker candidate solutions that follow the guidance of the , and wolves. During optimization, the candidate solutions iteratively update their positions towards the best three solutions until convergence upon the global optimal value. Specifically, a schematic representation of GWO is depicted in Fig. 2.
Building upon these foundations, we propose a nature-inspired feature selection method based on GWO, which comprises six essential components: classifier selection, population initialization, encircling prey phase, hunting phase, attacking phase, and feature selection objective function.
2.2.1 Classifier Selection
To evaluate the feature selection results, we consider six base classifiers in a classifier pool : Discriminant Analysis (DISCR), Decision Tree (DT), K-Nearest Neighbor (KNN), Artificial Neural Networks (ANNs), Support Vector Machine (SVM), and Naive Bayes (NB). However, incorporating all these classifiers into the ensemble method during the feature selection phase would be computationally expensive. Therefore, we adopt a pre-training approach to select the best-performing classifier from the pool . The cancer gene expression data is subjected to five-fold cross-validation on each base classifier, and the classifier with the highest performance is chosen as the evaluation classifier for the feature selection phase. This approach allows us to efficiently select the most suitable classifier for the subsequent feature selection process.
2.2.2 Population Initialization
In the beginning, the population is randomly created and represented as real numbers. Each individual, denoted as , is a set of genes: , where represents the th gene and is the total number of genes.
To convert these real numbers into a binary form, we use a threshold value . If a feature value () is greater than or equal to , it is set to 1, indicating that the corresponding feature is selected. On the other hand, if is less than , it is set to 0, indicating that the feature is not selected. The process can be as follows:
(1) |
After that, the position of each individual is represented by a binary (0/1) string.
2.2.3 Encircling Prey Phase
The ”encircling prey” behavior is a strategy employed by the grey wolf pack to search for feature subsets. This behavior is mathematically modeled to simulate how the grey wolf gradually approaches its prey and surrounds it. The distance () between the grey wolf and the prey is determined by the equation:
(2) |
where represents the distance between them. During the search process, the current iteration is denoted by , and and represent the position vectors of the prey and the grey wolf, respectively.
To update the position of the grey wolf, we utilize the formula . Here, is the convergence factor that decreases linearly from 2 to 0 as the iterations progress. The convergence factor is calculated as /, where represents the current iteration, and is the maximum number of iterations defined for the search process. Additionally, and are random numbers between 0 and 1.
By applying this position update formula, the grey wolf adjusts its position towards the prey. The term determines the magnitude and direction of the movement, while the distance guides the grey wolf’s movement in narrowing the gap with the prey. The process continues iteratively until the desired maximum number of iterations is reached (). Ultimately, the grey wolf is expected to encircle the prey, indicating the discovery of a promising feature subset.

2.2.4 Hunting Phase
Grey wolves possess the ability to identify the general location of their prey and work together to surround it. However, in many unknown situations, they may not have precise knowledge of the exact location of the target. In our study, we simulate the behavior of grey wolves by introducing three key individuals: , , and . These individuals help guide the entire wolf pack in surrounding the prey and searching for the optimal solution.
To track the position of the prey, each individual grey wolf calculates its distance to the prey using the following equations:
(3) |
(4) |
(5) |
Here, , , and represent the distances between the grey wolves , , and the prey, respectively. , , and denote the positions of , , and , while represents the current position of the grey wolf. Additionally, , , and are random vectors used to calculate these distances.
Each grey wolf updates its position based on these distance calculations:
(6) |
(7) |
(8) |
Here, , , and represent the new positions of the grey wolves moving towards , , and , respectively. The constants , , and control the magnitude of the movement towards the prey.
Finally, the position of the grey wolf at the next time step is determined as the average of the positions , , and :
(9) |
In this way, the entire wolf pack moves together towards the positions of , , and , and the new position of each individual is updated accordingly.
2.2.5 Attacking Phase
The final stage of the hunting process is the attack, during which the grey wolves aim to capture their prey and obtain the optimal solution. This phase involves adjusting certain parameters to strike a balance between global exploration and local exploitation.
To achieve this balance, two key parameters are considered: and . The value of is progressively decreased from 2 to 0 in a linear manner. Simultaneously, the range of fluctuations in is reduced. The parameter takes on values within the range . The behavior of the grey wolves is influenced by the magnitude of . When the absolute value of is greater than 1, the grey wolves tend to spread out across different areas, enabling a global search for prey. Conversely, when the absolute value of is less than 1, the grey wolves exhibit a more focused, local search.
In addition to these parameters, the influence of the grey wolves’ positions on the prey is governed by a random weight, denoted as . This weight, which ranges between 0 and 2, determines the random influence of the grey wolf’s location on the prey. A value of greater than 1 indicates a higher weight, emphasizing the significance of the grey wolf’s position in guiding the search. Conversely, a value of less than 1 assigns a lower weight, reducing the impact of the grey wolf’s location. This random weight, , helps prevent the algorithm from converging too early and becoming trapped in a local optimum.
By dynamically adjusting the values of , , and during the attacking phase, the grey wolves strike a balance between exploration and exploitation, allowing them to efficiently search for and capture the optimal solution while avoiding premature convergence and local optima.
2.2.6 Feature Selection Objective Function
During each iteration of the GWO algorithm, the classification label for each candidate solution is predicted using the evaluation classifier selected from the classifier selection phase. Specifically, the evaluation classifier is initially trained on the original training gene expression dataset with all features using five-fold cross-validation. For each containing a subset of selected features, the evaluation classifier generates predicted labels by classifying the corresponding data points from using only the selected features in . The performance of on determines the fitness value assigned to solution . This allows the GWO algorithm to determine the , , and solutions representing the current best feature subsets for classification.
In the feature selection stage, the primary objective is to identify and select relevant features while filtering out redundant ones for subsequent identification purposes in cancer gene expression data. Traditional studies often focus solely on classification accuracy, disregarding the resource costs associated with redundant features. In our study, we address this limitation by considering both classification accuracy and the size of the feature subsets as part of our feature selection objective function [39].
The objective function, denoted as , is defined as follows:
(10) |
Here, represents the number of selected features during the evolutionary process, and dim represents the total number of features in the dataset. To strike a balance between the two objectives, we introduce weight coefficients to control their relative importance. In our study, we assign a weight of 0.9 to to emphasize the significance of classification accuracy, while is set to 0.1 to underscore the importance of the feature subset size. These weight coefficients were determined based on the findings in the reference [40], where classification accuracy was identified as the primary objective.
The classification error (error) is a key component of the objective function. It is calculated as the difference between 1 and the accuracy (acc), which is defined as:
(11) |
(12) |
In the above equations, represents the total number of instances, represents the predicted class label for instance , and represents the true class label for instance . The function evaluates to 1 if the predicted and true class labels match, and 0 otherwise.
2.3 Nature-inspired Diverse Ensemble Learning
In this section, we propose a novel nature-inspired diverse ensemble learning method to improve the performance of cancer identification using selected features obtained through nature-inspired feature selection. Our method comprises diverse subspace generation, model pool generation, and model pool optimization.
2.3.1 Diverse Subspace Generation
Given the gene expression data after feature selection, denoted as , where represents the feature vector with denoting the number of features, represents the classification label, and represents the number of input samples, we employ the K-means method [41] to cluster the input cancer gene expression data into multiple clusters. The clustering process is performed iteratively from 1 to , generating clusters in each iteration. Here, denotes the total number of iterations. The clusters are obtained by minimizing the following function:
(13) |
where represents the feature vector and is the centroid of cluster . This clustering process generates a set of diverse subspaces composed of all the obtained clusters.
2.3.2 Model Pool Generation
Each cluster in the diverse subspace is used to train six classifiers (DISCR, DT, KNN, ANN, SVM, and NB) to create a model pool. The base classifiers used in this step are independent. The resulting models are then added to the base model pool . The base model pool consists of models, where represents the number of clusters and represents the number of base classifiers. Finally, we employ nature-inspired optimization techniques to refine the base models in the ensemble. Here, any combination of classifiers can be utilized.
2.3.3 Model Pool Optimization
After obtaining the diverse base model pool , we propose a pre-optimization step to refine by removing models with below-average performance. Subsequently, we incorporate a nature-inspired optimization method, namely GWO, to further optimize the pre-optimized base model pool .
Population Initialization: The population is randomly initialized, and each individual is represented as follows:
(14) |
Here, represents a classifier in the model pool , and is the total number of models in . Similar to nature-inspired feature selection, the selection or non-selection of models is indicated by binary values. ”1” indicates that a model is selected, while ”0” indicates that the model is not selected. To convert the continuous search space of GWO into a binary search space, we introduce a threshold . The conversion from a continuous position to discrete binary values is defined as follows:
(15) |
Nature-inspired Optimization Process: In this phase, our aim is to discover optimal model subsets by optimizing the base model pool . The population is used to explore optimal model subsets in the encircling phase, identify potential optimal solutions in the hunting phase, and ultimately obtain the optimal solution in the attacking phase.
Ensemble Optimizing Objective Function: Our objective is to achieve the highest identification performance with the smallest ensemble size. After clustering the data following feature selection and training the base classifiers to create a model pool, we aim to optimize the model pool to obtain the optimal ensemble model with the smallest size. The optimized model ensemble is then evaluated using the test data. The objective function in the model pool optimization stage, denoted as , is defined as follows:
(16) |
Here, error represents the identification error rate described in Equation (11), is the total number of selected models, and is the number of models in . The settings of and are identical to those in section 2.2.6, with accounting for 90% of the importance and for 10%.
However, unlike in section 2.2.6, where the predicted label is predicted by a single classifier, we consider the ensemble of multiple models. We employ a plurality voting method to combine the predictions of multiple models, which has been proven to be a simple and effective ensemble fusion technique in many studies [42] [43].
2.3.4 Ensemble Classifier Prediction
During the training process, we obtain multiple models to represent the model . The model is used to generate an ensemble, and all models in are utilized to predict the test set. The predicted class labels from all the models in the model are fused using the plurality voting method. The identification accuracy can be calculated using Equation (12).
2.4 Time Complexity Analysis
Here, we analyze the time complexity of our proposed EODE algorithm. The detailed analysis is outlined as follows:
-
•
Feature Selection: The time complexity of the feature selection process depends on the algorithm used. Since we used GWO for feature selection, the time complexity is typically , where is the number of generations, is the population size, is the number of features, and is the complexity of the fitness evaluation function. Generally, the feature selection process has a polynomial time complexity.
-
•
Diverse Subspace Generation: The time complexity of the diverse subspace generation mainly depends on the clustering algorithm used. Here, we applied the K-means algorithm, the time complexity is usually , where is the number of clusters, is the number of data points, is the number of iterations, and is the dimensionality of the data. The diverse subspace generation process has a polynomial time complexity.
-
•
Model Pool Generation: The model pool generation involves training multiple base classifiers on each cluster. The time complexity depends on the complexity of the base classifiers and the number of clusters. Assuming the time complexity of training a base classifier on a single cluster is , where is the number of data points, is the number of selected features, and is the complexity of the training algorithm, the overall time complexity of model pool generation is , where is the number of clusters. This process also has a polynomial time complexity.
-
•
Model Pool Optimization: The time complexity of the model pool optimization stage depends on the optimization algorithm used. We employed a nature-inspired optimization algorithm named GWO, the time complexity is typically , where is the number of generations, is the population size, and is the complexity of the fitness evaluation function. Similar to the feature selection process, the model pool optimization stage generally has a polynomial time complexity.
-
•
Ensemble Classifier Prediction: The time complexity of the ensemble classifier prediction is dependent on the number of models in the ensemble and the complexity of combining their predictions. Assuming we have models in the ensemble and the complexity of combining predictions is , the overall time complexity is . This process has a linear time complexity.
In summary, the overall time complexity is: Overall Time Complexity = Feature Selection + Diverse Subspace Generation + Model Pool Generation + Model Pool Optimization + Ensemble Classifier Prediction = O(T P F C) + O(K N I d) + O(L N F C) + O(T P C) + O(M)
Since all these time complexities are polynomial, we can express the overall time complexity as the highest-order term in the sum. Therefore, the overall time complexity of the EODE algorithm is:
Overall Time Complexity = {T P F C, K N I d, L N F C, T P C, M}
3 Implementation
3.1 Datasets
The cancer gene expression datasets were collected from [44], and can be downloaded from the website https://schlieplab.org/Static/Supplements/CompCancer/datasets.htm. The 35 datasets contain multiple types of cancers with high-dimensional features, exceeding 1000 dimensions, while having relatively small sample sizes (as shown in TABLE 1). This poses the “Curse of Dimensionality” challenge, necessitating the development of a computational model with high robustness and good generalization capabilities to address the different cancers.
To enable rigorous evaluation, the collected raw datasets have been randomly split into disjoint training and testing sets in a 80:20 ratio prior to conducting experiments. The training sets, comprising 80% of the data, have been used for model training and hyperparameter tuning. The testing sets, comprising the held-out 20% of the data, have only been used for final evaluation of the fully trained model’s performance. This ensures an unbiased estimate of generalization capability. The precise training/testing splits has been done randomly while preserving class balance in each set. Specifically, the training and testing datasets can be downloaded from the following links: https://github.com/wangxb96/EODE/tree/master/TrainData and https://github.com/wangxb96/EODE/tree/master/TestData. For model selection and hyperparameter tuning, k-fold cross-validation (k=5) was utilized during model selection and hyperparameter optimization on the training data only. By segregating the training and testing data, we prevent information leakage and overfitting to the test set. This rigorous methodology allows us to evaluate true generalization error and robustness across multiple cancer types.
Dataset | Tissue | Samples | Features | Classes | Dataset | Tissue | Samples | Features | Classes |
---|---|---|---|---|---|---|---|---|---|
Alizadeh-2000-v1 | Blood | 42 | 1095 | 2 | Alizadeh-2000-v2 | Blood | 62 | 2093 | 3 |
Alizadeh-2000-v3 | Blood | 62 | 2093 | 4 | Armstrong-2002-v1 | Blood | 72 | 1081 | 2 |
Armstrong-2002-v2 | Blood | 72 | 2194 | 3 | Bhattacharjee-2001 | Lung | 203 | 1543 | 5 |
Bittner-2000 | Skin | 38 | 2201 | 2 | Bredel-2005 | Brain | 50 | 1739 | 3 |
Chen-2002 | Liver | 179 | 85 | 2 | Chowdary-2006 | Breast, Colon | 104 | 182 | 2 |
Dyrskjot-2003 | Bladder | 40 | 1203 | 3 | Garber-2001 | Lung | 66 | 4553 | 4 |
Golub-1999-v1 | Bone Marrow | 72 | 1877 | 2 | Golub-1999-v2 | Bone Marrow | 72 | 1877 | 3 |
Gordon-2002 | Lung | 181 | 1626 | 2 | Khan-2001 | Multi-tissue | 83 | 1069 | 4 |
Laiho-2007 | Colon | 37 | 2202 | 2 | Lapointe-2004-v1 | Prostate | 69 | 1625 | 3 |
Lapointe-2004-v2 | Prostate | 110 | 2496 | 4 | Liang-2005 | Brain | 37 | 1411 | 3 |
Nutt-2003-v1 | Brain | 50 | 1377 | 4 | Nutt-2003-v2 | Brain | 28 | 1070 | 2 |
Nutt-2003-v3 | Brain | 22 | 1152 | 2 | Pomeroy-2002-v1 | Brain | 34 | 857 | 2 |
Pomeroy-2002-v2 | Brain | 42 | 1379 | 5 | Ramaswamy-2001 | Multi-tissue | 190 | 1363 | 14 |
Risinger-2003 | Endometrium | 42 | 1771 | 4 | Shipp-2002-v1 | Blood | 77 | 798 | 2 |
Singh-2002 | Prostate | 102 | 339 | 2 | Su-2001 | Multi-tissue | 174 | 1571 | 10 |
Tomlins-2006-v1 | Prostate | 104 | 2315 | 5 | Tomlins-2006-v2 | Prostate | 92 | 1288 | 4 |
West-2001 | Breast | 49 | 1198 | 2 | Yeoh-2002-v1 | Bone Marrow | 248 | 2526 | 2 |
Yeoh-2002-v2 | Bone Marrow | 248 | 2526 | 6 | - | - | - | - | - |
3.2 Baselines
To evaluate the effectiveness of our proposed method, we compared it against several existing classifiers and ensemble algorithms widely used in the literature. Firstly, we compared our model with six base classifiers: DISCR (Discriminant Analysis) [45], DT (Decision Tree) [46], KNN (K-Nearest Neighbor) [47], ANN (Artificial Neural Networks) [48], SVM (Support Vector Machine) [49], and NB (Naive Bayes) [50]. These classifiers serve as the baseline for performance comparison.
Next, we compared our approach with seven evolutionary algorithms: ACO [51], CS [52], DE [53], GA [54], GWO [37], PSO [55], and ABC [56]. These algorithms are widely used for optimization problems. Furthermore, we evaluated our approach against four novel ensemble methods: PSOEL [27], EAEL [57], FESM [58], and GA-Bagging-SVM [59]. These methods were selected to demonstrate the effectiveness of our proposed approach in comparison to recent advancements in ensemble learning.
In addition, we compared our ensemble algorithm with six state-of-the-art ensemble classifiers: Random Forests (RF) [60], ADABOOST [22], RUSBOOST [61], SUBSPACE [62], TOTALBOOST [63], and LPBOOST [64]. Random Forests is a well-known bagging method [60], while ADABOOST is a popular boosting method [22]. RUSBOOST is a random undersampling boosting method designed to address class imbalance [61]. SUBSPACE trains random feature subsets to reduce estimator correlation [62]. TOTALBOOST and LPBOOST aim to maximize the minimal margin of learned ensembles and have the ability to self-terminate [63] [64].
By comparing our method against these diverse algorithms, we aim to showcase its superiority and effectiveness in addressing the cancer gene expression data classification problem. Moreover, we have opened all computational model for public accessibility at “https://github.com/wangxb96/EODE/tree/master/ComparisonAlgorithms”.
3.3 Parameter Settings
Our experiments were conducted on a desktop computer with the following specifications: an Intel(R) Core(TM) i7-10700KF CPU @3.80GHz, 32GB of RAM, and a 64-bit Windows 10 operating system using Matlab 2021a. We utilized six base classifiers, namely DISCR, DT, KNN, ANN, SVM, and NB, to construct the ensemble. The parameters for DISCR, KNN, SVM, and NB are summarized in TABLE II, while the rest of the classifiers were used with their default settings. Additionally, Random Forest (RF) [60], ADABOOST [22], RUSBOOST [61], SUBSPACE [62], TOTALBOOST [63], and LPBOOST [64] were employed with their default parameter values. Furthermore, the parameters for four novel ensemble classifier methods, namely PSOEL [27], EAEL [57], FESM [58], and GA-Bagging-SVM [59], were set to be consistent with the original papers.
In our experiments, the original data was randomly divided into training and test datasets with an 8:2 ratio. The five-fold cross-validation method was used for training the data. For the GWO algorithm in feature selection and ensemble optimization, the population size () was set to 100, the number of iterations was set to 50, and the threshold () was set to 0.5. Specifically, the threshold value in our study is utilized as a criterion within the Grey Wolf Optimizer algorithm to determine feature selection, and is not directly related to actual gene expression values themselves. In the clustering phase, the parameter was set to . The detailed parameters of seven classical evolutionary algorithms, including ACO [51], CS [52], DE [53], GA [54], GWO [37], PSO [55], and ABC [56], are summarized in TABLE III, where the population size () and the maximum iteration () are set to the same values.
Methods | Parameters |
---|---|
DISCR | discrimtype = diaglinear |
KNN | K = 3 |
SVM | ’KernelFunction’ = ’rbf’, ’IterationLimit’ = 50000, ’Standardize’= true |
NB | distribution = kernel |
Methods | Parameters |
---|---|
ACO | tau = 1, eta = 1, alpha = 1, beta = 0.1, rho = 0.2, Pop = 100 , = 50. |
CS | lb = 0, ub = 1, , Pa = 0.25, alpha = 1, beta = 1.5, Pop = 100 , = 50. |
DE | lb = 0, ub = 1, , CR = 0.9, F = 0.5, Pop = 100 , = 50. |
GA | CR = 0.8, MR = 0.01, Pop = 100 , = 50. |
PSO | lb = 0, ub = 1, , c1 = 2, c2 = 2, w = 0.9, Vmax =(ub - lb)/2, Pop = 100 , = 50. |
ABC | lb = 0, ub = 1, , maxlimit = 5, Pop = 100 , = 50. |
GWO | lb = 0, ub = 1, , Pop = 100 , = 50. |
For ACO, ”tau” denotes the pheromone value, ”eta” denotes the heuristic desirability, ”alpha” denotes the control pheromone, ”beta” denotes the control heuristic, and ”rho” denotes the pheromone trail decay coefficient, which is set to 0.2. For CS, ”Pa” denotes the discovery rate, ”alpha” denotes the constant, and ”beta” denotes the Levy component. For DE, ”CR” denotes the crossover rate, and ”F” denotes the scale factor. For GA, ”CR” denotes the crossover rate, and ”MR” denotes the mutation rate. For PSO, ”c1” denotes the cognitive factor, ”c2” denotes the social factor, ”w” denotes the inertia weight, and ”Vmax” denotes the maximum velocity.
4 Results and Analysis
4.1 Performance Comparisons with Other Nature-inspired Ensemble Learning Algorithms
In our study, we conducted performance comparisons of EODE with several other nature-inspired ensemble learning algorithms, namely PSOEL, EAEL, FESM, and GA-Bagging-SVM. The experimental results are summarized in Figure 3, where Figure 3(A) presents detailed classification results, Figure 3(B) illustrates the performance comparisons of EODE against the other ensemble methods, and Figure 3(C) showcases the average performance values of these methods.
As shown in Figure 3(A), EODE achieved the best results among all methods on 26 out of the 35 datasets. Specifically, EODE attained 100% classification accuracy on 7 datasets and achieved over 90% accuracy on more than half of the datasets. These results highlight the robustness of EODE in handling various types of cancers and its ability to provide highly accurate classifications. From Figure 3(B), it is evident that EODE outperformed the other nature-inspired ensemble learning algorithms. The performance comparisons clearly demonstrate the superiority of EODE in terms of test accuracy. To provide a comprehensive performance overview, we present the average performance across all 35 cancer gene expression datasets in Figure 3(C). The results indicate that EODE outperformed PSOEL by 6% and exhibited more than a 10% improvement compared to the other methods. These findings strongly support the conclusion that EODE performs better than other nature-inspired ensemble methods in the context of cancer gene expression classification.

4.2 Performance Comparisons of Different Machine Learning Algorithms
In our study, we conducted a comprehensive analysis and comparison of the performance between our proposed ensemble approach, EODE, and single classifier approaches. The experimental results, as shown in Figure 4, clearly demonstrate the superiority of EODE in terms of classification accuracy for cancer gene expression datasets. EODE achieved the best classification accuracy for over 55% of the datasets, surpassing all single classifiers. This indicates the effectiveness and robustness of our ensemble approach in handling cancer gene expression classification tasks. Moreover, when considering the average performance across all 35 cancer gene expression datasets, EODE consistently outperformed all single classifiers. Specifically, our ensemble approach exhibited remarkable improvements compared to the worst classifier, with an increase in performance of nearly 33%. Furthermore, EODE consistently achieved performance improvements of more than 10% compared to the majority of the base classifiers.
These findings clearly highlight the advantages of our ensemble approach over traditional single classifier methods. By leveraging the collective wisdom of multiple classifiers, EODE effectively addresses the challenges posed by cancer gene expression classification, resulting in superior classification accuracy and overall performance. Figure 4 provides a visual representation of the experimental results, further supporting the conclusions drawn from our performance comparisons. The results validate the effectiveness of our proposed ensemble approach, highlighting its potential as a valuable tool in the field of cancer gene expression analysis.

4.3 Performance Comparisons of the Different Evolutionary Algorithms
To further evaluate the performance of the proposed EODE method, we compared it against other state-of-the-art evolutionary algorithms, including: Ant Colony Optimization (ACO), Cuckoo Search (CS), Differential Evolution (DE), Genetic Algorithm (GA), Grey Wolf Optimizer (GWO), Particle Swarm Optimization (PSO), and Artificial Bee Colony (ABC).
The experimental results are summarized in supplementary Figures 1 and 2. Supplementary Figure 1 shows the classification accuracy of different methods on each of the 35 cancer gene expression datasets, with the first 7 sub-figures presenting results on individual datasets and the last sub-figure reporting the average performance across all datasets. As seen in supplementary Figure 1, EODE obtains the best classification accuracy on over 60% of the datasets. Notably, there is an improvement of 5-8% in average classification accuracy achieved by EODE compared to other evolutionary algorithms. Supplementary Figure 2 depicts the number of features (i.e. biomarker genes) selected by each method on each dataset. We can observe that EODE selects the smallest feature subset in nearly 60% of datasets, indicating its ability to identify the most informative genes.
Overall, from both supplementary Figures 1 and 2, we can deduce that EODE consistently demonstrates the best average performance across all 35 cancer gene expression datasets, outperforming other state-of-the-art evolutionary methods. This validates the effectiveness and robustness of the proposed EODE approach in discovering critical biomarker genes for cancer classification.
Datasets | Training Accuracy | Test Accuracy | Datasets | Training Accuracy | Test Accuracy | ||||
---|---|---|---|---|---|---|---|---|---|
WEL | EODE | WEL | EODE | WEL | EODE | WEL | EODE | ||
Alizadeh-2000-v1 | 1.0000 | 1.0000 | 0.5114 | 1.0000 | Lapointe-2004-v2 | 0.8737 | 0.8634 | 0.4339 | 0.9091 |
Alizadeh-2000-v2 | 1.0000 | 1.0000 | 0.6591 | 1.0000 | Liang-2005 | 0.9515 | 0.9667 | 0.6494 | 0.8571 |
Alizadeh-2000-v3 | 0.9745 | 0.9600 | 0.4318 | 0.9167 | Nutt-2003-v1 | 0.8614 | 0.9250 | 0.4273 | 0.9000 |
Armstrong-2002-v1 | 1.0000 | 0.9818 | 0.6558 | 0.9286 | Nutt-2003-v2 | 1.0000 | 1.0000 | 0.5455 | 0.6000 |
Armstrong-2002-v2 | 1.0000 | 1.0000 | 0.4416 | 0.9286 | Nutt-2003-v3 | 1.0000 | 1.0000 | 0.7955 | 1.0000 |
Bhattacharjee-2001 | 0.9944 | 0.9938 | 0.6750 | 0.8750 | Pomeroy-2002-v1 | 0.9697 | 0.9667 | 0.8333 | 0.8333 |
Bittner-2000 | 0.9614 | 0.9667 | 0.5857 | 0.7143 | Pomeroy-2002-v2 | 0.9532 | 0.9381 | 0.2841 | 1.0000 |
Bredel-2005 | 0.9114 | 0.9250 | 0.5727 | 0.9000 | Ramaswamy-2001database | 0.7251 | 0.7966 | 0.1794 | 0.7632 |
Chen-2002 | 0.9918 | 0.9931 | 0.6649 | 0.9714 | Risinger-2003 | 0.8649 | 0.8238 | 0.4545 | 0.7500 |
Chowdary-2006 | 0.9904 | 0.9875 | 0.8682 | 0.9000 | Shipp-2002-v1 | 0.9800 | 0.9833 | 0.7212 | 0.8667 |
Dyrskjot-2003 | 0.9948 | 0.9381 | 0.5568 | 0.7500 | Singh-2002 | 0.9778 | 0.9750 | 0.5955 | 0.9000 |
Garber-2001 | 0.8883 | 0.8873 | 0.5455 | 0.6154 | Su-2001 | 0.9565 | 0.9643 | 0.1925 | 0.9118 |
Golub-1999-v1 | 0.9970 | 1.0000 | 0.6234 | 1.0000 | Tomlins-2006-v1 | 0.8787 | 0.8809 | 0.3364 | 0.8000 |
Golub-1999-v2 | 0.9939 | 1.0000 | 0.5065 | 0.9286 | Tomlins-2006-v2 | 0.8721 | 0.8648 | 0.3889 | 0.9444 |
Gordon-2002 | 1.0000 | 1.0000 | 0.8409 | 0.9444 | West-2001 | 1.0000 | 1.0000 | 0.4646 | 0.5556 |
Khan-2001database | 1.0000 | 1.0000 | 0.3466 | 1.0000 | Yeoh-2002-v1 | 0.9950 | 0.9950 | 0.8108 | 1.0000 |
Laiho-2007database | 1.0000 | 1.0000 | 0.7143 | 0.7143 | Yeoh-2002-v2 | 0.8448 | 0.8995 | 0.2430 | 0.8776 |
Lapointe-2004-v1 | 0.8423 | 0.8591 | 0.5385 | 0.6154 | Average | 0.9498 | 0.9524 | 0.5456 | 0.8620 |
4.4 Performance Comparisons of the Different Ensemble Learning Algorithms

To further validate the effectiveness of the proposed EODE method, we conducted experiments comparing its performance to other state-of-the-art ensemble learning classifiers on the 35 cancer gene expression datasets. The methods considered for comparison include: Random Forest (RF) [60], ADABOOST [22], RUSBOOST [61], SUBSPACE [62], TOTALBOOST [63] and LPBOOST [64].
The results are shown in Figure 5. Figure 5(A) depicts a heat map of the classification accuracy of different methods on each of the 35 datasets, where darker colors indicate better performance. This heat map visualization allows us to qualitatively compare the performance of models across various cancer types. Figure 5(B) summarizes the mean classification accuracy of each method averaged over the 35 datasets. The proposed EODE method achieves 6-32% better performance compared to other ensemble learning classifiers, demonstrating its superior predictive ability. Figure 5(C) presents box plots to compare the distribution of classification accuracies obtained by each method on different datasets. We can observe that the median accuracy of EODE is higher than all other methods, indicating its stable and robust performance. Moreover, the box plot of EODE is more narrow compared to others, showing the consistency of results obtained.
Overall, these quantitative and qualitative comparisons presented in Figure 5 validate that the proposed EODE method achieves the best classification accuracy on over 70% of the cancer gene expression datasets, outperforming other state-of-the-art ensemble classifiers. This clearly demonstrates the effectiveness and robustness of the EODE approach for cancer classification using gene expression data.
4.5 Ablation Study
4.5.1 Performance of EODE without Ensemble Learning
TABLE 4 presents a comprehensive evaluation of training and testing performance across 35 datasets using the proposed EODE approach against EODE without ensemble learning (WEL). Here, WEL means the nature-inspired diverse ensemble learning is not employed. On analysis, it is evident that the training accuracy of both EODE and WEL are comparable, with the EODE approach achieving a slightly higher average training accuracy of 0.9524 versus 0.9498 for WEL. This indicates that the model capacity for fitting the training data is similar between the two approaches. However, EODE demonstrates a significant test accuracy advantage over WEL, with average test accuracies of 0.8620 and 0.5456 respectively. This translates to an absolute improvement of over 30% in generalization performance by leveraging ensemble learning.
The key insight is that while ensemble learning does not markedly improve training fit, it provides superior generalization through effectively preventing overfitting. Single models are prone to overfitting the noise in small datasets. Ensemble learning creates multiple diverse models and aggregates their predictions, avoiding these spurious patterns. Across multiple datasets, EODE consistently exhibits stronger generalization, evidenced by the significantly higher test accuracies. This gap is particularly prominent in smaller datasets where individual models tend to overfit more. By reducing variance via ensembling, the proposed approach demonstrates more robust predictions on unseen test data. The results validate the effectiveness of ensemble learning in enhancing model generalization capability and tackling the overfitting challenge.
In conclusion, the ensemble framework shows considerable promise in boosting test performance over single model baseline across a wide range of conditions. This has important implications for real-world applications like speech emotion recognition where avoiding overfitting is critical. The analysis provides strong empirical evidence and rationale for adopting ensemble techniques.
4.5.2 Performance of EODE Ensemble versus Individual Classifiers
Unlike the analysis in Section 4.2, this study does not evaluate each base classifier model in isolation. Rather, this section investigates the impact of using a single base classifier within the nature-inspired diverse ensemble learning phase of the proposed approach, instead of aggregating multiple heterogeneous classifiers concurrently as intended in the ensemble methodology. By focusing on the ensemble learning stage, this analysis provides targeted insight into the benefits of leveraging diversity in the classifier combinations compared to relying on any individual modeling paradigm alone during this critical step.
Dataset | DISCR | DT | KNN | ANN | SVM | NB | EODE |
---|---|---|---|---|---|---|---|
Alizadeh-2000-v1 | 0.8333 | 0.6875 | 0.7750 | 0.6875 | 0.5000 | 0.7500 | 1.0000 |
Alizadeh-2000-v2 | 1.0000 | 0.8452 | 0.9833 | 1.0000 | 0.6667 | 0.7333 | 1.0000 |
Alizadeh-2000-v3 | 0.9306 | 0.7381 | 0.8667 | 0.9271 | 0.3333 | 0.8000 | 0.9167 |
Armstrong-2002-v1 | 0.8929 | 0.9490 | 0.9143 | 0.8929 | 0.6429 | 0.8429 | 0.9286 |
Armstrong-2002-v2 | 0.8690 | 0.7959 | 0.8000 | 0.8482 | 0.4286 | 0.6714 | 0.9286 |
Bhattacharjee-2001 | 0.9292 | 0.8321 | 0.8350 | 0.9063 | 0.6750 | 0.7750 | 0.8750 |
Bittner-2000 | 0.6905 | 0.6531 | 0.7143 | 0.6786 | 0.4286 | 0.5714 | 0.7143 |
Bredel-2005 | 0.8667 | 0.6857 | 0.7800 | 0.8500 | 0.7000 | 0.8400 | 0.9000 |
Chen-2002 | 0.8619 | 0.7837 | 0.8171 | 0.8786 | 0.5829 | 0.8229 | 0.9714 |
Chowdary-2006 | 0.9000 | 0.9214 | 0.9300 | 0.9438 | 0.8900 | 0.8700 | 0.9000 |
Dyrskjot-2003 | 0.7500 | 0.6429 | 0.6750 | 0.6719 | 0.6250 | 0.6500 | 0.7500 |
Garber-2001 | 0.7051 | 0.6923 | 0.6769 | 0.6538 | 0.6154 | 0.6769 | 0.6154 |
Golub-1999-v1 | 0.9286 | 0.9184 | 0.9143 | 0.8482 | 0.6429 | 0.8000 | 1.0000 |
Golub-1999-v2 | 0.8810 | 0.8367 | 0.8857 | 0.7500 | 0.5714 | 0.6714 | 0.9286 |
Gordon-2002 | 0.9861 | 0.9643 | 0.9722 | 0.9757 | 0.8333 | 0.9500 | 0.9444 |
Khan-2001database | 0.8646 | 0.8571 | 0.9125 | 0.9297 | 0.3125 | 0.7000 | 1.0000 |
Laiho-2007database | 0.7381 | 0.7551 | 0.7143 | 0.6786 | 0.7143 | 0.7429 | 0.7143 |
Lapointe-2004-v1 | 0.7538 | 0.6264 | 0.8154 | 0.6923 | 0.5385 | 0.6000 | 0.6154 |
Lapointe-2004-v2 | 0.7386 | 0.5195 | 0.5909 | 0.7330 | 0.3636 | 0.7091 | 0.9091 |
Liang-2005 | 0.6786 | 0.7347 | 0.8000 | 0.8750 | 0.7143 | 0.7143 | 0.8571 |
Nutt-2003-v1 | 0.7000 | 0.3857 | 0.5800 | 0.5500 | 0.2800 | 0.3800 | 0.9000 |
Nutt-2003-v2 | 0.8000 | 0.4286 | 0.6800 | 0.6500 | 0.4400 | 0.4000 | 0.6000 |
Nutt-2003-v3 | 1.0000 | 0.8214 | 0.8500 | 1.0000 | 0.7500 | 0.8500 | 1.0000 |
Pomeroy-2002-v1 | 0.6250 | 0.6429 | 0.6333 | 0.6250 | 0.1667 | 0.6000 | 0.8333 |
Pomeroy-2002-v2 | 0.8750 | 0.4107 | 0.5500 | 0.5156 | 0.2500 | 0.2500 | 1.0000 |
Ramaswamy-2001 | 0.6250 | 0.5752 | 0.6263 | 0.5329 | 0.1579 | 0.2526 | 0.7632 |
Risinger-2003 | 0.5938 | 0.5000 | 0.3750 | 0.5625 | 0.3750 | 0.5000 | 0.7500 |
Shipp-2002-v1 | 0.7667 | 0.7619 | 0.8000 | 0.8667 | 0.7333 | 0.7733 | 0.8667 |
Singh-2002 | 0.8125 | 0.7833 | 0.7600 | 0.8188 | 0.5000 | 0.8500 | 0.9000 |
Su-2001 | 0.8676 | 0.7549 | 0.7588 | 0.8750 | 0.1588 | 0.4471 | 0.9118 |
Tomlins-2006-v1 | 0.7875 | 0.5250 | 0.8200 | 0.8000 | 0.3000 | 0.6800 | 0.8000 |
Tomlins-2006-v2 | 0.7778 | 0.6000 | 0.7000 | 0.7986 | 0.3889 | 0.6556 | 0.9444 |
West-2001 | 0.5833 | 0.7778 | 0.6444 | 0.5972 | 0.4667 | 0.6000 | 0.5556 |
Yeoh-2002-v1 | 0.9796 | 0.9714 | 0.9878 | 0.9566 | 0.6694 | 0.8327 | 1.0000 |
Yeoh-2002-v2 | 0.6735 | 0.6408 | 0.7673 | 0.5357 | 0.3265 | 0.3347 | 0.8776 |
Average | 0.8076 | 0.7148 | 0.7687 | 0.7744 | 0.5069 | 0.6656 | 0.8620 |
Across the 35 gene expression datasets analyzed, the best single classifier achieved an average accuracy of 0.8076 using a DISCR model. In contrast, the proposed EODE ensemble approach attained a significantly higher accuracy of 0.8620 by leveraging an integrated combination of diverse classifiers including DISCR, DT, KNN, ANN, SVM and NB. The results highlight that relying on any individual base classifier is suboptimal compared to the ensemble approach. No single modeling paradigm consistently dominates the performance across all datasets, due to the complexity of the classification problem. Different datasets exhibit variability in terms of which individual classifier achieves the best performance when used alone. However, EODE provides equal or higher accuracy relative to the top stand-alone model on 23 out of 35 datasets. The results empirically demonstrate that integrating multiple complementary base classifiers simultaneously is essential to maximize the potential of the ensemble framework and attain optimal classification performance on gene expression data. Reliance on any single constituent classifier within the ensemble learning process fails to harness the full synergistic advantages of the diverse ensemble.
5 Conclusion
Cancer type identification is a critical aspect of cancer research, as it enables early diagnosis and tailored treatment for patients. One key challenge in this field is identifying the highly sensitive biomarker genes that are indicative of specific cancer types. In this study, we propose a novel approach called EODE to address the classification of cancer types, particularly in scenarios where the gene expression profile samples are high-dimensional and small in size. EODE leverages the grey wolf optimizer (GWO) to optimize feature subsets and collaboratively builds an optimized ensemble classifier. By combining nature-inspired feature selection and ensemble learning, EODE significantly improves the model’s identification capability.
We conducted experiments on 35 datasets encompassing various cancer types, and the results demonstrate the effectiveness of our algorithm compared to four nature-inspired ensemble methods (PSOEL, EAEL, FESM, and GA-Bagging-SVM), six benchmark machine learning algorithms (KNN, DT, ANN, SVM, DISCR, and NB), six state-of-the-art ensemble algorithms (RF, ADABOOST, RUSBOOST, SUBSPACE, TOTALBOOST, and LPBOOST), and seven nature-inspired methods (ACO, CS, DE, GA, GWO, PSO, and ABC). Our algorithm outperformed these methods in terms of classification accuracy.
In future work, we aim to enhance the efficiency of the algorithm by improving the screening of redundant and invalid features. Additionally, as biomedical data often exhibit class imbalance, we plan to ensure robust results on class-imbalanced data. Beyond computational refinements, we intend to evaluate the proposed methodology on expanded gene expression datasets from diverse clinical cohorts. As cancer subtyping using gene expression data holds great promise for guiding individualized treatment decisions, we hope to transition this computational pipeline into real-world clinical settings.
Acknowledgments
The work described in this paper was substantially supported by the National Natural Science Foundation of China under Grant No. 62076109, and funded by the Natural Science Foundation of Jilin Province under Grant No. 20190103006JH, the Natural Science Funds of Jilin Province under Grant No. 20200201158JC. The work described in this paper was supported by the grant from the Health and Medical Research Fund, the Food and Health Bureau, The Government of the Hong Kong Special Administrative Region [07181426], and the funding from Hong Kong Institute for Data Science (HKIDS) at City University of Hong Kong. The work described in this paper was partially supported by two grants from City University of Hong Kong (CityU 11202219, CityU 11203520). This research is also supported by the National Natural Science Foundation of China under Grant No. 32000464.
References
- [1] Wei Cao, Hong-Da Chen, Yi-Wen Yu, Ni Li, and Wan-Qing Chen. Changing profiles of cancer burden worldwide and in china: a secondary analysis of the global cancer statistics 2020. Chinese Medical Journal, 134(07):783–791, 2021.
- [2] Kyle Swanson, Eric Wu, Angela Zhang, Ash A Alizadeh, and James Zou. From patterns to patients: Advances in clinical machine learning for cancer diagnosis, prognosis, and treatment. Cell, 2023.
- [3] Wenya Linda Bi, Ahmed Hosny, Matthew B Schabath, Maryellen L Giger, Nicolai J Birkbak, Alireza Mehrtash, Tavis Allison, Omar Arnaout, Christopher Abbosh, Ian F Dunn, et al. Artificial intelligence in cancer imaging: clinical challenges and applications. CA: a cancer journal for clinicians, 69(2):127–157, 2019.
- [4] Joaquin Mateo, Lotte Steuten, Philippe Aftimos, Fabrice André, Mark Davies, Elena Garralda, Jan Geissler, Don Husereau, Iciar Martinez-Lopez, Nicola Normanno, et al. Delivering precision oncology to patients with cancer. Nature Medicine, 28(4):658–665, 2022.
- [5] De-Shuang Huang and Chun-Hou Zheng. Independent component analysis-based penalized discriminant method for tumor classification using gene expression data. Bioinformatics, 22(15):1855–1862, 2006.
- [6] Ran Su, Jiahang Zhang, Xiaofeng Liu, and Leyi Wei. Identification of expression signatures for non-small-cell lung carcinoma subtype classification. Bioinformatics, 36(2):339–346, 2020.
- [7] Chiwen Qu, Lupeng Zhang, Jinlong Li, Fang Deng, Yifan Tang, Xiaomin Zeng, and Xiaoning Peng. Improving feature selection performance for classification of gene expression data using harris hawks optimizer with variable neighborhood learning. Briefings in Bioinformatics, 2021.
- [8] Hilary S Parker, Jeffrey T Leek, Alexander V Favorov, Michael Considine, Xiaoxin Xia, Sameer Chavan, Christine H Chung, and Elana J Fertig. Preserving biological heterogeneity with a permuted surrogate variable analysis for genomics batch correction. Bioinformatics, 30(19):2757–2763, 2014.
- [9] Florian Schmidt, Markus List, Engin Cukuroglu, Sebastian Köhler, Jonathan Göke, and Marcel H Schulz. An ontology-based method for assessing batch effect adjustment approaches in heterogeneous datasets. Bioinformatics, 34(17):i908–i916, 2018.
- [10] Ting Jin, Nam D Nguyen, Flaminia Talos, and Daifeng Wang. Ecmarker: interpretable machine learning model identifies gene expression biomarkers predicting clinical outcomes and reveals molecular mechanisms of human disease in early stages. Bioinformatics, 37(8):1115–1124, 2021.
- [11] Bryan He, Ludvig Bergenstråhle, Linnea Stenbeck, Abubakar Abid, Alma Andersson, Åke Borg, Jonas Maaskola, Joakim Lundeberg, and James Zou. Integrating spatial gene expression and breast tumour morphology via deep learning. Nature Biomedical Engineering, 4(8):827–834, 2020.
- [12] Huimin Gao, Chuang Bian, Xvbin Wang, Xiangtao Li, and Yunhe Wang. Exploring cancer biomarker genes from gene expression data via natureinspired multiobjective optimization. In 2022 34th Chinese Control and Decision Conference (CCDC), pages 5000–5007. IEEE, 2022.
- [13] Xubin Wang and Weijia Jia. A feature weighting particle swarm optimization method to identify biomarker genes. In 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 830–834. IEEE, 2022.
- [14] Shahla Nemati, Mohammad Ehsan Basiri, Nasser Ghasem-Aghaee, and Mehdi Hosseinzadeh Aghdam. A novel aco–ga hybrid algorithm for feature selection in protein function prediction. Expert systems with applications, 36(10):12086–12094, 2009.
- [15] Negar Maleki, Yasser Zeinali, and Seyed Taghi Akhavan Niaki. A k-nn method for lung cancer prognosis with the use of a genetic algorithm for feature selection. Expert Systems with Applications, 164:113981, 2021.
- [16] Rodrigo Clemente Thom de Souza, Camila Andrade de Macedo, Leandro dos Santos Coelho, Juliano Pierezan, and Viviana Cocco Mariani. Binary coyote optimization algorithm for feature selection. Pattern Recognition, 107:107470, 2020.
- [17] Gaurav Dhiman, Diego Oliva, Amandeep Kaur, Krishna Kant Singh, S Vimal, Ashutosh Sharma, and Korhan Cengiz. Bepo: a novel binary emperor penguin optimizer for automatic feature selection. Knowledge-Based Systems, 211:106560, 2021.
- [18] Abdelaziz I Hammouri, Majdi Mafarja, Mohammed Azmi Al-Betar, Mohammed A Awadallah, and Iyad Abu-Doush. An improved dragonfly algorithm for feature selection. Knowledge-Based Systems, 203:106131, 2020.
- [19] Nabil Neggaz, Essam H Houssein, and Kashif Hussain. An efficient henry gas solubility optimization for feature selection. Expert Systems with Applications, 152:113364, 2020.
- [20] Mahardhika Pratama, Witold Pedrycz, and Edwin Lughofer. Evolving ensemble fuzzy classifier. IEEE Transactions on Fuzzy Systems, 26(5):2552–2567, 2018.
- [21] Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.
- [22] Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1):119–139, 1997.
- [23] Ronglai Shen, Adam B Olshen, and Marc Ladanyi. Integrative clustering of multiple genomic data types using a joint latent variable model with application to breast and lung cancer subtype analysis. Bioinformatics, 25(22):2906–2912, 2009.
- [24] Zhen Cao, Xiaoyong Pan, Yang Yang, Yan Huang, and Hong-Bin Shen. The lnclocator: a subcellular localization predictor for long non-coding rnas based on a stacked ensemble classifier. Bioinformatics, 34(13):2185–2194, 2018.
- [25] Ran Su, Xinyi Liu, Guobao Xiao, and Leyi Wei. Meta-gdbp: a high-level stacked regression model to improve anticancer drug response prediction. Briefings in Bioinformatics, 21(3):996–1005, 2020.
- [26] Gavin Brown, Jeremy Wyatt, Rachel Harris, and Xin Yao. Diversity creation methods: a survey and categorisation. Information Fusion, 6(1):5–20, 2005.
- [27] Muhammad Zohaib Jan, Juan Carloz Munoz, and Muhammad Asim Ali. A novel method for creating an optimized ensemble classifier by introducing cluster size reduction and diversity. IEEE Transactions on Knowledge and Data Engineering, 2020.
- [28] Tien Thanh Nguyen, Anh Vu Luong, Manh Truong Dang, Alan Wee-Chung Liew, and John McCall. Ensemble selection based on classifier prediction confidence. Pattern Recognition, 100:107104, 2020.
- [29] Yijun Chen, Man-Leung Wong, and Haibing Li. Applying ant colony optimization to configuring stacking ensembles for data mining. Expert Systems with Applications, 41(6):2688–2702, 2014.
- [30] Asit Kumar Das, Soumen Kumar Pati, and Arka Ghosh. Relevant feature selection and ensemble classifier design using bi-objective genetic algorithm. Knowledge and Information Systems, 62(2):423–455, 2020.
- [31] Xiangtao Li, Shixiong Zhang, and Ka-Chun Wong. Single-cell rna-seq interpretations using evolutionary multiobjective ensemble pruning. Bioinformatics, 35(16):2809–2817, 2019.
- [32] Sujie Zhu, Weikaixin Kong, Jie Zhu, Liting Huang, Shixin Wang, Suzhen Bi, and Zhengwei Xie. The genetic algorithm-aided three-stage ensemble learning method identified a robust survival risk score in patients with glioma. Briefings in Bioinformatics, 23(5):bbac344, 2022.
- [33] Girish Chandrashekar and Ferat Sahin. A survey on feature selection methods. Computers & Electrical Engineering, 40(1):16–28, 2014.
- [34] Ludmila I Kuncheva and Christopher J Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy. Machine Learning, 51:181–207, 2003.
- [35] Yi Zhang, Samuel Burer, W Nick Street, Kristin P Bennett, and Emilio Parrado-Hernández. Ensemble pruning via semi-definite programming. Journal of Machine Learning Research, 7(7), 2006.
- [36] Lior Rokach. Ensemble-based classifiers. Artificial Intelligence Review, 33:1–39, 2010.
- [37] Seyedali Mirjalili, Seyed Mohammad Mirjalili, and Andrew Lewis. Grey wolf optimizer. Advances in Engineering Software, 69:46–61, 2014.
- [38] Hossam Faris, Ibrahim Aljarah, Mohammed Azmi Al-Betar, and Seyedali Mirjalili. Grey wolf optimizer: a review of recent variants and applications. Neural Computing and Applications, 30(2):413–435, 2018.
- [39] Bing Xue, Mengjie Zhang, and Will N Browne. Particle swarm optimization for feature selection in classification: A multi-objective approach. IEEE Transactions on Cybernetics, 43(6):1656–1671, 2012.
- [40] Xiangyang Wang, Jie Yang, Xiaolong Teng, Weijun Xia, and Richard Jensen. Feature selection based on rough sets and particle swarm optimization. Pattern Recognition Letters, 28(4):459–471, 2007.
- [41] David JC MacKay and David JC Mac Kay. Information Theory, Inference and Learning Algorithms. Cambridge university press, 2003.
- [42] Reshef Meir, Maria Polukarov, Jeffrey Rosenschein, and Nicholas Jennings. Convergence to equilibria in plurality voting. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 24, 2010.
- [43] Reshef Meir. Plurality voting under uncertainty. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 29, 2015.
- [44] Marcilio Cp De Souto, Ivan G Costa, Daniel Sa De Araujo, Teresa B Ludermir, and Alexander Schliep. Clustering cancer gene expression data: a comparative study. BMC Bioinformatics, 9(1):1–14, 2008.
- [45] Peter A Lachenbruch and M Goldstein. Discriminant analysis. Biometrics, pages 69–85, 1979.
- [46] S Rasoul Safavian and David Landgrebe. A survey of decision tree classifier methodology. IEEE Transactions on Systems, Man, and Cybernetics, 21(3):660–674, 1991.
- [47] Naomi S Altman. An introduction to kernel and nearest-neighbor nonparametric regression. The American Statistician, 46(3):175–185, 1992.
- [48] Bayya Yegnanarayana. Artificial Neural Networks. PHI Learning Pvt. Ltd., 2009.
- [49] William S Noble. What is a support vector machine? Nature Biotechnology, 24(12):1565–1567, 2006.
- [50] Kevin P Murphy et al. Naive bayes classifiers. University of British Columbia, 18(60):1–8, 2006.
- [51] Marco Dorigo, Mauro Birattari, and Thomas Stutzle. Ant colony optimization. IEEE Computational Intelligence Magazine, 1(4):28–39, 2006.
- [52] Xin-She Yang and Suash Deb. Cuckoo search via lévy flights. In 2009 World Congress on Nature & Biologically Inspired Computing (NaBIC), pages 210–214. Ieee, 2009.
- [53] Swagatam Das and Ponnuthurai Nagaratnam Suganthan. Differential evolution: A survey of the state-of-the-art. IEEE Transactions on Evolutionary Computation, 15(1):4–31, 2010.
- [54] Darrell Whitley. A genetic algorithm tutorial. Statistics and computing, 4(2):65–85, 1994.
- [55] James Kennedy and Russell Eberhart. Particle swarm optimization. In Proceedings of ICNN’95-international Conference on Neural Networks, volume 4, pages 1942–1948. IEEE, 1995.
- [56] Dervis Karaboga and Bahriye Basturk. A powerful and efficient algorithm for numerical function optimization: artificial bee colony (abc) algorithm. Journal of Global Optimization, 39(3):459–471, 2007.
- [57] Zohaib Md. Jan and Brijesh Verma. Evolutionary classifier and cluster selection approach for ensemble classification. ACM Transactions on Knowledge Discovery from Data (TKDD), 14(1):1–18, 2019.
- [58] Muhammad Zohaib Jan. A Novel Framework for Optimised Ensemble Classifiers. PhD thesis, Central Queensland University, 2020.
- [59] Jianying Lin, Hui Chen, Shan Li, Yushuang Liu, Xuan Li, and Bin Yu. Accurate prediction of potential druggable proteins based on genetic algorithm and bagging-svm ensemble classifier. Artificial Intelligence in Medicine, 98:35–47, 2019.
- [60] Leo Breiman. Random forests. Machine Learning, 45(1):5–32, 2001.
- [61] Chris Seiffert, Taghi M Khoshgoftaar, Jason Van Hulse, and Amri Napolitano. Rusboost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 40(1):185–197, 2009.
- [62] Tin Kam Ho. The random subspace method for constructing decision forests. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(8):832–844, 1998.
- [63] Manfred K Warmuth, Jun Liao, and Gunnar Rätsch. Totally corrective boosting algorithms that maximize the margin. In Proceedings of the 23rd International Conference on Machine Learning, pages 1001–1008, 2006.
- [64] Adam J Grove and Dale Schuurmans. Boosting in the limit: Maximizing the margin of learned ensembles. In AAAI/IAAI, pages 692–699, 1998.