ASE: Anomaly Scoring Based Ensemble Learning for Highly Imbalanced Datasets

Xiayu Liang lxy˙[email protected] South China University of TechnologyGuangzhouGuangdongChina , Ying Gao [email protected] South China University of TechnologyGuangzhouGuangdongChina510006 and Shanrong Xu [email protected] South China University of TechnologyGuangzhouGuangdongChina510006

(xxxx)

Abstract.

Nowadays, many classification algorithms have been applied to various industries to help them work out their problems met in real-life scenarios. However, in many binary classification tasks, samples in the minority class only make up a small part of all instances, which leads to the datasets we get usually suffer from high imbalance ratio. Existing models sometimes treat minority classes as noise or ignore them as outliers encountering data skewing. In order to solve this problem, we propose a bagging ensemble learning framework $ASE$ (Anomaly Scoring Based Ensemble Learning). This framework has a scoring system based on anomaly detection algorithms which can guide the resampling strategy by divided samples in the majority class into subspaces. Then specific number of instances will be under-sampled from each subspace to construct subsets by combining with the minority class. And we calculate the weights of base classifiers trained by the subsets according to the classification result of the anomaly detection model and the statistics of the subspaces. Experiments have been conducted which show that our ensemble learning model can dramatically improve the performance of base classifiers and is more efficient than other existing methods under a wide range of imbalance ratio, data scale and data dimension. $ASE$ can be combined with various classifiers and every part of our framework has been proved to be reasonable and necessary.

ensemble learning, imbalanced datasets, resampling, anomaly detection, bagging

^†^†copyright: acmcopyright^†^†journalyear: xxxx^†^†doi: XXXXXXX.XXXXXXX^†^†conference: Make sure to enter the correct conference title from your rights confirmation emai; June 03–05, 2018; Woodstock, NY^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†ccs: Computing methodologies Bagging^†^†ccs: Computing methodologies Supervised learning by classification^†^†ccs: Computing methodologies Anomaly detection^†^†ccs: Computing methodologies Classification and regression trees

1. Introduction

Classification is a common research area in machine learning and data mining, and have a wide range of applications in many real-world scenarios(Haixiang et al., 2017). In these scenarios, our datasets often suffer from high imbalance ratio. Besides data imbalance, data overlapping also accounts for the poor performance of existing models dealing with imbalanced datasets (Del Río et al., 2014). Confronting with data skewing and data overlapping, classification models are inclined to the majority classes and ignore the minority classes as noise. Models can only get high accuracy on the majority classes but suffer from poor performance on the minority classes. And these kinds of model have no any practical value because people pay more attention to the classification result of the minority class. For example, in credit evaluation, only a few customers are untrustworthy users and the goal is to identify users who may default in the future from all the applicants. If the potential default users are not identified, bad debts will increase and companies will have a direct economic loss, so it is significant to classify the minority class correctly under data skewing situation.

In order to overcome classification problems in data skewing scenarios, resampling methods try to reduce the negative impact of imbalanced data by reducing the high imbalance ratio (Chawla et al., 2002; Tomek, 1976; Smith et al., 2014; Wang, 2008). But resampling methods have the tendency to ignore some informative instances and discard the original distribution of raw data. And only when the data is not overlapping and can be well clustered, it is feasible to discard or generate some samples. The main idea of cost-sensitive models is reweighting (Tang et al., 2008). Cost-sensitive models need experts in certain fields to provide prior knowledge like a cost matrix so that models can apply a higher penalty to the minority class to reduce the negative impact of high imbalance ratio, which is not feasible in most situations (Krawczyk and Schaefer, 2013). Some ensemble learning models are likely to be affected by noise and the majority of them only combine resampling methods with multiple base classifiers straightly (Chawla et al., 2003).

Inspired by the reasons why prevailing methods fail to work out classification tasks under high imbalance ratio and data overlapping, we try to design an innovative ensemble learning framework with a more complicated resample strategy. Instead of using classification error (Seiffert et al., 2009; Wang and Yao, 2012; Liu et al., 2020a) which may lead to error reinforcement, we use anomaly detection to guide the resampling process. Apart from that, we take advantage of the statistics of subsets to weight each base classifier, which makes the final ensemble model can pay more attention to the boundary between the majority class and the minority class.

In this paper, we propose an ensemble learning framework with a scoring system based on anomaly detection algorithms to quantify the anomaly degree of samples. Our framework is a bagging model so it can be trained efficiently with parallel computation. We introduce anomaly detection models to our ensemble Model $ASE$ to quantify the anomaly degree of samples. With higher scores, instances are more likely to be the minority class samples and it is possible for us to detect the overlapping area of the majority class and the minority class. The proposed $ASW$ (Anomaly Scoring Weight) affects the proportion of data splitting and we resample from different subspaces according to $ASW$ . $ASW$ depends on the contamination coefficient of anomaly detection models and we use the proposed $CEW$ (Contamination Entropy Weight) to estimate the generalization ability of base classifiers and integrate all weak classifiers into our $ASE$ . The pipeline of $ASE$ is shown in Figure 1.

In summary, this paper makes these contributions.

•

1) We introduce a scoring system based on anomaly detection to the resampling step of imbalance classification problems. We use this scoring system to make the proposed ensemble learning framework pay more attention to the minority class and the overlapping area of the two classes so that our framework can achieve higher performance while datasets suffer from severe data skewing.
•

2) We proposed an ensemble learning framework called $ASE$ , which is efficient enough to be applied in a great number of real-life situations to handle classification tasks under high imbalance ratio.
•

3) We test out our proposed method is much more efficient than other imbalanced classification algorithms on different real-world imbalanced tasks with various base classifiers. With our ensemble framework, we do not need any prior knowledge or pre-defined distance metrics which is inaccessible in most application scenarios.

Refer to caption — Figure 1. Anomaly Scoring Based Ensemble Learning Model

2. RELATED WORK

At present, there are three main techniques to deal with imbalanced data, resampling, cost-sensitive learning and ensemble learning (Haixiang et al., 2017). These methods are widely used in imbalanced classification and Table LABEL:Scenario shows some application scenarios.

Data resampling: Resampling try to solve the problem of data imbalance by reducing datasets’ imbalance ratio or even making datasets balanced in the data level. By generating some minority class data (over-sampling) or discarding some majority class data (under-sampling), resampling methods can be divided into three categories, under-sampling, over-sampling and hybrid-sampling. Common resampling methods include RUS (Seiffert et al., 2009), ROS(Menardi and Torelli, 2014), SMOTE(Chawla et al., 2002), ADASYN(He et al., 2008), TomekLink(Tomek, 1976), OSS(Kubat et al., 1997), etc.

Under-sampling tries to reduce imbalance ratio or construct balanced subsets by discarding samples from the majority class. RUS (Seiffert et al., 2009) is the oldest under-sampling method which discards samples randomly. Then a serious of under-sampling methods derivated from Nearest Neighbor Criterion including CNN (Hart, 1968), TomekLink (Tomek, 1976) and NCR (Laurikkala, 2001). IHT (Smith et al., 2014) removes the instances which have a high possibility to be misclassified by training extra classifiers and OSS (Kubat et al., 1997) discards samples from the majority class by TomekLink. The idea of evolution and genetic algorithms are also applied to under-sampling like GAUS (Ha and Lee, 2016). Since under-sampling loses a part of data, it may lead to information loss, which will degrade classifiers’ performance. Due to the limitation of the amount of instances, under-sampling is not a wise choice dealing with small scale datasets (Haixiang et al., 2017).

In order to decrease the imbalance ratio, over-sampling synthesizes new instances based on the minority class(Haixiang et al., 2017). ROS (Menardi and Torelli, 2014) first introduced the idea of over-sampling and SMOTE (Chawla et al., 2002)is the most representative method of over-sampling. Based on SMOTE, variants of SMOTE spring up like Borderline-SMOTE (Han et al., 2005), SVM-SMOTE (Wang, 2008). However, over-sampling usually breaks the original data distribution and when data is overlapping and highly imbalanced, this kind of method may not perform well because of overfitting.

Hybrid-sampling is a combination of under-sampling and over-sampling which tries to reduce the negative impact and take advantage of both methods. SMOTE-RSB (Ramentol et al., 2012)and SMOTE-IPF (Sáez et al., 2015) are some common hybrid-sampling methods.

Cost-sensitive learning: In cost-sensitive learning, a larger penalty coefficient is applied to the minority class so that the classifiers will pay more attention to the classification results of the minority class to reduce the impact of data imbalance. CSSVM (Tang et al., 2008) and CS-LDM (Cheng et al., 2016) are widely used cost-sensitive learning methods. However, cost-sensitive learning requires certain prior knowledge like a cost matrix provided by some experts in specific domains which is not often feasible in real-life situation and models have a certain probability of overfitting (Haixiang et al., 2017). As the result, compared with resampling and ensemble learning, cost-sensitive learning is a less popular research area and fewer people choose to use it to solve the imbalanced classification problems (Krawczyk and Schaefer, 2013).

Ensemble Learning: The main idea of ensemble learning is combining a number of weak classifiers to reconcile the impact of imbalance ratio and noise to a single classifier (Haixiang et al., 2017). The ensemble method can be divided into two categories, bagging-based ensemble learning models and boosting-based ensemble models. Common ensemble learning methods include UnderBagging (Wang and Yao, 2009), OverBagging (Wang and Yao, 2009), SMOTEBagging (Wang and Yao, 2009), Adaboost(Freund and Schapire, 1997), SMOTEBoost (Chawla et al., 2003), RUSBoost (Seiffert et al., 2009), EasyEnsemble (Liu et al., 2008b) and BalanceCascade (Liu et al., 2008b), etc. Ensemble learning improves the generalization ability of the classification models by integrating multiple weak classifiers trained by the data processed by resampling methods, but it is easier to be affected by noise.

Bagging models can be trained parallelly so they are more efficient than boosting models whose base classifiers need to be trained based on the information given by the last base classifier. Some experts (Fernández et al., 2013) have verified that bagging methods’ performance doesn’t fall behind boosting methods while costing less time in training so these methods are widely applied in practical applications. Bagging algorithms include OverBagging (Wang and Yao, 2009), UnderBagging (Wang and Yao, 2009)and SMOTEBagging (Wang and Yao, 2009), etc. BPSO (Li et al., 2022) introduces an under-sampling method named Binary PSO to select instances used to train base classifiers. HUE (Ng et al., 2020) uses a hashing method to divide samples in the majority class into subspaces. SubFeat (Haque et al., 2021) divides features into overlapping and non-overlapping spaces and uses subspaces to train individual classifier. EPX (Hsu et al., 2021) scores all rare class subjects by clustering feature space and exploiting the richness of features. DELAK (Yang et al., 2021) introduces a concept called the distance-based dynamic ensemble which can combine the output of base classifiers for new testing samples dynamically.

Boosting models take advantage of the heuristic information like the classification results during the iterative process and can be only trained in sequential process. Adaboost first took the idea of boosting into practice and more boosting algorithms are designed afterwards like SMOTEBoost (Chawla et al., 2003), RUSBoost (Seiffert et al., 2009), AdaboostNC (Wang and Yao, 2012), BSIA (Zi\keba and Tomczak, 2015), etc. Both EasyEnsemble and BalanceCascade (Liu et al., 2008b) use Adaboost as their base classifiers and BalanceCascade adjusts the threshold depending on the false positive rate. ERFADASYN (Balaram and Vasundra, 2022) proposes a new feature selection called Butterfly optimization to solve the class imbalance problem. In Enslia (Jing et al., 2022), the “excellent and diverse” principle helps to guide the training process of base classifiers. SPE (Liu et al., 2020a) and MESA (Liu et al., 2020b) are boosting models trained by subsets undersampled according to hardness distributions (Li et al., 2019).

3. Proposed Method

3.1. Symbol definition

In this paper, we define the majority class as the negative class and $N$ is the set of all negative samples, the minority class as the positive class and $P$ is the set of all positive samples. The definitions are shown in Equation 1 and Equation 2.

(1)

P=\{(x,y)|y=1\}

(2)

N=\{(x,y)|y=0\}

IR (Imbalance Ratio) is used to quantify the level of data skewing in a datase which is show in Equation 3.

(3)

IR=\frac{the\ number\ of\ majority\ class\ samples}{the\ number\ of\ minority\ class\ samples}=\frac{|N|}{|P|}

And we take Confusion Matrix which is shown in Table 1 to evaluate the anomaly detection models’ performance on the training set and the final performance of our ensemble framework.

Table 1. Confusion Matrix

	Predict Positive	Predict Negative
Label Positive	TP	FN
Label Negative	FP	TN

$x_{j}$ is any sample in the dataset and $A_{i}$ is the $i$ -th anomaly detection model. $AS_{i,j}=A_{i}(x_{j})$ denotes the anomaly score of $x_{j}$ and $B_{i,l}$ represents the $l$ -th bin in the $i$ -th base classifier. $ASW_{i,l}$ is the weight of $B_{i,l}$ and $CEW_{i}$ is the weight of the $i$ -th weak classifier. $C$ is our Contamination function which is relevant to the percentage of outliers of the anomaly detection model.

3.2. Anomaly Scoring System

Since under-sampling is easy to suffer from discarding some informative instances and disturb underlying distributions, some try to extract some hidden information from the raw data so that they applied Nearest Neighbor to guide resampling like CNN (Hart, 1968) and NCR (Laurikkala, 2001). Nearest Neighbor is often used in clustering and anomaly detection, so we try to apply other advanced algorithms in Anomaly Detection like Isolation Forest (Liu et al., 2008a), SVDD (Ruff et al., 2018) etc. Anomaly detection models will score the examples and examples whose scores surpass the threshold will be classified as the outliers. In $ASE$ , we used these scores to guide our resampling strategy. Some methods introduce a concept called Hardness Distribution (Li et al., 2019; Liu et al., 2020a, b) and they use some common functions like Absolute Error, Squared Error or Cross Entropy as their hardness function which try to represent the hardness to classify examples correctly. And then they split all positive samples into different bins and take the mean of all examples in a specified bin as the weight of this bin.

Inspired by the idea of instances’ weight and splitting data into bins, now we introduce our own concept called $ASW$ . Our $ASW$ needs one hyper-parameter $k$ which is the number of bins. We split all training samples into $k$ bins by the anomaly score $AS$ in Equation 4 given by the anomaly detection model $A$ and each bin indicates a certain anomaly level as shown in Equation 5.

(4)

AS_{i,j}=A_{i}(x_{j})

(5)

B_{i,l}=\{(x_{j},y_{j})|\frac{l-1}{k}\leq AS_{i,j}<\frac{l}{k}\}

Since our datasets are under high imbalance ratio, a part of the minority class will be regarded as the outliers, which have a greater probability to receive higher anomaly scores from the anomaly detection model than most samples in the majority class. However, what makes the imbalanced classification intractable is that the majority class usually overlaps with the minority class (Del Río et al., 2014) and there is no a specific boundary between the two classes. The overlapping data will seriously affect the performance of the base classifier so we need to pay more attention to these samples with a weighting function shown in Equation 6.

(6)

ASW_{i,l}=\frac{log\frac{1}{|B_{i,l}|}}{\sum_{l=1}^{k}log(\frac{1}{|B_{i,l}|})}

(7)

d_{i,l}=\frac{1}{|\frac{l}{k}-\frac{1}{2k}-(1-c_{i})|}

$ASW_{i,l}$ represents the weight of the $l$ -th bin according to the $i$ -th anomaly detection model and the number of instances needed to be resampled from the majority class of the $l$ -th bin is $n_{i,l}$ . $c_{i}$ is the contamination coefficient used to train the $i$ -th base classifier which is equal to the percentage of samples which the anomaly detection model classifies as the outliers and in Equation 7, $d_{i,l}$ implies the distance between the $l$ -th bin and the boundary of the outliers.

(8)

n_{i,l}=|N|\cdot ASW_{i,l}\cdot\frac{d_{i,l}}{\sum_{l=1}^{k}d_{i,l}}

Then we combine the resampled instances from different bins with all the samples from the minority class to construct a subset which has a lower imbalance ratio than the original dataset to train a base classifier.

Figure 2 shows how $ASW$ works to decide the resampling proportion in $ASE$ . We train 50 weak classifiers with the contamination coefficient increasing from 0.05 to 0.40 to construct our ensemble learning model on dataset $Wine$ which includes about 5,000 instances and the imbalance ratio is about 26:1. The number of samples in each bin is displayed and we choose the distributions of data in 5 rounds of training to be shown in Figure 2.

3.3. Contamination Entropy Weight Ensemble

When constructing weak classifiers into an ensemble model in many mainstream methods, they will not combine the weak classifiers directly but get the weighted average of all weak classifiers. The weight functions are usually relative to the classification results of the base classifiers generated in the previous training process (Zhang et al., 2022). In RUSBoost (Seiffert et al., 2009), the pseudo-loss will be calculated to update the weight parameters which will have a great impact on the resampling method in the next iteration and will directly affect the weight of the base classifier trained in this iteration. AdaboostNC (Wang and Yao, 2012) tries to access the disagreement degree of the classification with penalty strength and calculates the weight of base classifiers by error and penalty. Some factors derivated from the confusion matrix like the weight of FP and the quotient of TP and FP instances are applied to update the model in each iteration. In this paper, we propose our weight function CEW.

While training each base classifier, we change the contamination coefficient of the anomaly detection model, which is relative to the amount of contaminated instances of the dataset. Since the proportion of the outliers is changed, the Recall on the training set will be changed. With higher Recall, our anomaly detection model has a higher performance on the minority class and the minority instances are more possible to be splitted into the bin with higher $ASW$ . In this case, more instances in the minority class will be splitted into bins with higher $ASW$ so the number of instances resampled from these bins will increase. In result, the $n_{i,l}$ where $l$ is more closed to $k$ will increase, that is to say, the base classifiers will focus more on more informative instances in the majority class like the overlapping area between the majority class and the minority class.

The contamination percentage will directly affect the Recall value, but the main idea to attach an anomaly detection model to our framework is not only to detect all the outliers and classify the minority class correctly the first time, but also to balance the proportion of the number of the majority class, the minority class and the overlapping data in the constructed subsets. In order to handle the data overlapping problems, we need our model to pay more attention to the overlapping area in the original dataset. In $ASE$ , we combine the idea of information entropy and confusion matrix to weigh each base classifier and the proposed weight function $CEW$ is shown in Equation 9.

(9)

CEW_{i}=C_{i}\cdot E_{i}

$CEW$ can be divided into two parts. As shown in Equation 10, $C$ is a contamination function relative to Recall, which reveals the anomaly detection model’s ability to detect the minority class exactly. In Equation 11, $E$ is a function derivated from information entropy that can quantitatively assess the anomaly model generalization ability on the overlapping area between majority class and minority class.

(10)		$\displaystyle C$	$\displaystyle=log(\frac{\|P\|}{FN})$
(10)			$\displaystyle=log(\frac{1}{1-Recall})$

(11)	$\displaystyle E$	$\displaystyle=-(\sum_{i=1}^{k}p_{i}log(p_{i}))$
		$\displaystyle=-\frac{TP}{TP+FP}\cdot log(\frac{TP}{TP+FP})$
		$\displaystyle\ \ \ \ -\frac{FP}{TP+FP}\cdot log(\frac{FP}{TP+FP})$

The core idea of information entropy (Shannon, 1948) is to quantify the value of information. When there is not enough information concealed in data or something is highly likely to happen, we say that the information entropy in this case is low. When the majority class or the minority class consist of the most part of the outliers detected by the anomaly detection model, we can regard the result of the anomaly detection model can not give enough information for us to find the overlapping area between the majority and minority class and this situation has a low information entropy. On the contrary, when the number of the majority class in examples predicted as positive (FP) is closed to the number of the minority class in examples predicted as positive (TP), the examples in the overlapping area are nearly balanced and we will receive a higher information entropy. Since the overlapping area is more balanced, the weak classifier trained in this iteration will have a better generalization ability of the overlapping data so we should assign a greater weight to it.

3.4. ASE (Anomaly Scoring Ensemble Learning)

After discussing our model’s modules in detail, now we will give the general framework of $ASE$ . Firstly, we need an anomaly detection model like Isolation Forest (Liu et al., 2008a), SVDD (Ruff et al., 2018) or Auto Encoder (Zhou and Paffenroth, 2017) to assign anomaly scores $AS$ to the training data. Then we need to split the majority class into $k$ bins according to anomaly scores and we will calculate $ASW$ for each bin. With $ASW$ we can down-sample a specific number of instances from each bin and combine these instances with minority class to build a subset that has adjusted the proportion among the majority class, the minority class and the overlapping area. In this way, we make our classification model pay appropriate attention to different bins based on their contribution to datasets’ chaos. Base classifiers trained by these subsets can suffer less from high imbalance ratio and disruption of the raw data distributions by resampling and have a better generalization ability. In each training process of a new base classifier, the proportions of contaminated data are set differently which will affect the resampling strategy directly and we need to quantify each base classifier’s generalization ability so we introduce our weight function $CEW$ . $CEW$ consists of two parts which are relative to the ability of the anomaly detection model to find the minority class $C$ and the ability to discover the overlapping area between the two classes $E$ . At last, we combine all the base classifiers into a strong classifier which can adapt to various degrees of imbalance ratio, data scale and feature dimension with our $CEW$ function. The framework of $ASE$ is shown in Figure 1 and the details of $ASE$ is shown in Algorithm 1.

Algorithm 1 ASE (Anomaly Scoring Based Ensemble Learning)

1: Input: minority class

P

, majority class

N

, base classifier

M

, anomaly detection model

A

, number of bins

k

, number of base classifiers

b

2: for

i=1

b

3: Train anomaly detection model

A_{i}

with contamination coefficient

c_{i}

and normalize

AS_{i}

4: Split training set into

k

bins w.r.t.

AS_{i,}

5: Compute

ASW

for each bin and under-sample specific number of examples from the majority class in each bin

6: Construct subset

s_{i}

by combining the resampled data and all the data in the minority class

7: Train base classifier

M_{i}

with subset

s_{i}

8: Compute

CEW_{i}

for

M_{i}

9: end for

10: Ensemble base classifiers

ASE=\sum_{i=1}^{b}M_{i}\cdot CEW_{i}

4. Experiment

4.1. Parameter Setting and Evaluation Criterion

In all our experiments, we randomly choose 80% of samples from the majority class and the minority class respectively as the training set and the samples left are in the testing set.

If it is not specified, the training set is splitted into 5 bins and the depth of all tree models is set as 10, which are used in (Liu et al., 2020a, b) and can help us to compare our model with others. Apart from that, all ensemble learning models contain 50 base classifiers, which is the default value in Python package imbalanced-learn (Lemaître et al., 2017) and other parameters are the default values in imbalanced-learn (Lemaître et al., 2017) and scikit-learn (Pedregosa et al., 2011). In other words, our model’s two main hyper-parameters $k$ is set as 5 and $b$ is set as 50 by default.

We use some common criteria in imbalance learning area based on the confusion matrix. If the classification model considers all the testing samples as the majority class, the model will still get a high accuracy score because of the high imbalance ratio. Precision and recall are usually used to evaluate models’ performance on the minority class. So we take Accuracy, Precision, Recall, AUC and F1 to evaluate our model performance.

4.2. Datasets

There are many application scenarios for imbalanced classification and various algorithms are widely used in different industries, like finance, manufacturing, scientific research and medicine. A great number of areas suffer from data imbalance in real-life situations such as risk prediction, financial assessment, loan default, credit card fraud, network intrusion detection, fraud detection, product quality inspection and disease diagnosis. Therefore, we selected several datasets which are representative in each scenario to evaluate our proposed framework.

Table 2 contains the statistics of all datasets used in this paper.

Table 2. Datasets Detail

Dataset	Repository	Instance	Feature	IR
Credit Fault	Kaggle	284,807	31	579:1
Credit	Kaggle	150,000	11	14:1
KDD2004	KDD Cup	145,751	74	111:1
Letter	UCI	20,000	16	26:1
Mammography	UCI	11,183	6	42:1
ISOLET	UCI	7,797	617	12:1
Wine	UCI	4,898	11	26:1
Ozone Level	UCI	2,536	72	34:1

•

Credit Fault: The dataset (Credit Card Fraud Detection) (Dal Pozzolo et al., 2017) contains transactions made by credit cards in September 2013 by European cardholders. Each entity has 30 features, which are the result of a PCA transformation and this dataset contains 492 frauds out of 284,807 transactions. It is significant for credit card companies to recognize fraudulent credit card transactions under a high imbalance ratio about 579:1.
•

Credit: Give Me Some Credit is a dataset about credit scoring used in Kaggle featured prediction competition. We randomly select 150,000 instances with the imbalance ratio of 13.96:1 for our experiment. Each instance has 11 features about credit card users’ personal information.
•

KDD2004: This dataset is provided by KDD Cup 2004, focusing on predicting which proteins are homologous to a native sequence. This dataset has 145,751 instances with 74 features and the imbalance ratio 111:1.
•

Letter: There are a large number of black-and-white rectangular pixel displays as 26 capital letters in UCI dataset Letter Recognition. This dataset has 20,000 instances with 16 features and the imbalance ratio 26:1.
•

Mammography: Mammography is the most effective method for breast cancer screening available today. Each instance in Mammography (Mammographic Mass Data Set) only contains 6 features and it has only 260 malignant instances out of 11,183 records.
•

ISOLET: ISOLET (Isolated Letter Speech Recognition) is a real-world dataset generated by 150 subjects who spoke the name of each letter of the alphabet. ISOLET is a high-dimension dataset with 617 attributes, containing 7,797 instances and 600 instances among which are the minority class.
•

Wine: Wine Quality Data Set is a UCI (Asuncion and Newman, 2007) dataset, including two datasets related to red and white variants of the Portuguese ”Vinho Verde” wine. Since there are more normal wines than excellent or poor ones, this dataset has a imbalance ratio about 26:1 and 4,898 instances with 11 features.
•

Ozone Level: UCI dataset Ozone Level contains 2,536 instances with 72 dimensions, which were collected from 1998 to 2004 at the Houston, Galveston and Brazoria area. This imbalanced dataset with imbalance ratio 34:1 quantifies two ground ozone levels.

4.3. Experiment Design and Result

4.3.1. Experiments on Different Models

Table 3. Experiments on 6 datasets

Algorithm

Dataset

Metric

SMOTE

GBDT

Ada-

boost

SPE

Balance

Easy

RUS-

Boost

SMOTE-

Boost

Under

Over

ASE

Credit Fault

Acc

0.999

0.991

0.999

0.977

0.942

0.988

0.981

0.999

0.999

Precision

0.899

0.146

0.941

0.796

0.833

0.796

0.650

0.065

0.028

0.120

0.077

0.959

0.840

Recall

0.756

0.832

0.784

0.760

0.695

0.839

0.852

0.887

0.876

0.875

0.885

0.738

0.873

AUC

0.878

0.912

0.892

0.880

0.848

0.919

0.925

0.932

0.909

0.932

0.933

0.869

0.936

0.820

0.247

0.856

0.777

0.757

0.816

0.737

0.121

0.054

0.210

0.141

0.833

0.856

ISOLET

Acc

0.939

0.933

0.965

0.974

0.971

0.977

0.970

0.928

0.891

0.957

0.951

0.963

0.984

Precision

0.579

0.525

0.942

0.949

0.817

0.784

0.722

0.507

0.378

0.668

0.609

0.811

0.822

Recall

0.618

0.828

0.555

0.685

0.774

0.952

0.954

0.941

0.725

0.824

0.936

0.653

0.990

AUC

0.791

0.884

0.776

0.841

0.880

0.966

0.962

0.934

0.814

0.896

0.944

0.820

0.987

0.596

0.642

0.697

0.795

0.794

0.859

0.821

0.658

0.494

0.737

0.721

0.898

KDD 2004

Acc

0.997

0.973

0.997

0.996

0.997

0.995

0.965

0.948

0.974

0.979

0.997

0.997

Precision

0.906

0.226

0.981

0.799

0.859

0.851

0.670

0.186

0.133

0.233

0.275

0.959

0.810

Recall

0.689

0.878

0.659

0.746

0.726

0.846

0.876

0.927

0.879

0.904

0.922

0.708

0.914

AUC

0.844

0.926

0.830

0.872

0.863

0.923

0.936

0.946

0.914

0.939

0.950

0.854

0.956

0.783

0.360

0.788

0.771

0.786

0.848

0.758

0.310

0.230

0.370

0.423

0.814

0.859

Mam

Acc

0.983

0.964

0.987

0.986

0.985

0.974

0.677

0.903

0.875

0.932

0.946

0.986

0.990

Precision

0.666

0.363

0.894

0.785

0.754

0.473

0.059

0.175

0.139

0.234

0.283

0.841

0.744

Recall

0.551

0.728

0.506

0.548

0.498

0.810

0.863

0.848

0.764

0.831

0.852

0.483

0.853

AUC

0.772

0.848

0.752

0.772

0.747

0.894

0.768

0.876

0.821

0.883

0.900

0.740

0.923

0.597

0.483

0.645

0.642

0.599

0.595

0.111

0.289

0.233

0.364

0.425

0.613

0.793

Letter

Acc

0.994

0.991

0.996

0.997

0.993

0.997

0.996

0.944

0.962

0.986

0.984

0.997

0.997

Precision

0.928

0.840

0.997

0.981

0.919

0.968

0.912

0.388

0.504

0.733

0.698

0.980

0.957

Recall

0.917

0.935

0.887

0.938

0.887

0.955

0.970

0.975

0.926

0.954

0.977

0.929

0.962

AUC

0.957

0.964

0.943

0.969

0.942

0.977

0.980

0.959

0.945

0.971

0.980

0.964

0.980

0.922

0.884

0.938

0.959

0.903

0.960

0.940

0.555

0.643

0.829

0.814

0.953

0.960

Ozone

Acc

0.949

0.91

0.97

0.955

0.968

0.869

0.842

0.862

0.847

0.935

0.874

0.971

0.982

Precision

0.192

0.157

0.1

0.235

0.383

0.163

0.132

0.15

0.09

0.241

0.156

0.592

0.632

Recall

0.213

0.484

0.007

0.217

0.218

0.811

0.77

0.778

0.428

0.557

0.739

0.085

0.953

AUC

0.592

0.704

0.503

0.598

0.605

0.841

0.807

0.822

0.644

0.752

0.808

0.542

0.968

0.191

0.234

0.013

0.216

0.271

0.268

0.224

0.249

0.147

0.327

0.256

0.14

0.754

We first use Isolation Forest as the anomaly detection model and Decision Tree as the base classifier in our experiment. In order to test out the performance of $ASE$ , we use 6 datasets, Credit Fault, ISOLET, KDD2004, Mammography, Letter and Ozone Level to compare $ASE$ with 12 existing methods, including resampling methods and ensemble learning models which are widely used in imbalanced datasets. In detail, the models we select include Decision Tree, SMOTE, Random Forest, GBDT, Adaboost, UnderBagging, OverBagging, EasyEnsemble, BalanceCascade, SMOTEBoost, RUSBoost and SPE.

The results are shown in Table 3 and $ASE$ gets the highest scores of $Accuracy$ , $AUC$ and $F1$ among all 13 compared methods in 6 representative datasets in various scenarios, where data usually suffers from high imbalance ratio. Compared with the second-best F1 score in Ozone Level and Mammography, $ASE$ achieves 181% and 23% performance gain.

The experiment results show that $ASE$ is well-designed and has an exceeding ability to work out the data skewing problem than existing methods. The fluctuation of imbalance ratio, the scale of datasets and the changes in application scenarios demonstrate $ASE$ not only has high performance but also can be used in various situations.

4.3.2. Experiments on Different Base Classifiers and Anomaly Detection Algorithms

The above experiment uses Decision Tree as the base classifier and Isolation Forest as the anomaly detection model. In order to illustrate the excellent design philosophy of $ASE$ , we change the base classifiers and use various anomaly detection algorithms in $ASE$ .

To compare $ASE$ performance while using different base classifiers, We then use Decision Tree, Logistic Regression, SVM, KNN and MLP in datasets KDD2004, Credit and Wine to compare $ASE$ with SMOTE, SPE, EasyEnsemble and BalanceCascade and use $F1$ as evaluation criteria. As shown in Table 4, $ASE$ gets all the highest $F1$ which demonstrates that $ASE$ is not only limited to a specific classifier, this ensemble learning framework has a good generalization ability which can be applied to various classifiers.

Table 4. Performance of ASE with Different Base Classifiers

Dataset

Base Classifer

None

SMOTE

SPE

Balance-

Cascade

Easy-

Ensemble

ASE

KDD2004

0.798

0.345

0.841

0.743

0.304

0.866

0.778

0.227

0.763

0.660

0.213

0.778

SVM

0.643

0.255

0.740

0.615

0.121

0.747

KNN

0.560

0.239

0.148

0.178

0.098

0.599

MLP

0.828

0.764

0.811

0.334

0.705

0.828

Credit Fault

0.288

0.340

0.357

0.179

0.323

0.363

KNN

0.041

0.158

0.160

0.133

0.174

0.245

MLP

0.126

0.300

0.140

0.272

0.360

0.380

Wine

0.277

0.208

0.254

0.256

0.229

0.549

KNN

0.067

0.170

0.135

0.082

0.150

0.376

MLP

0.096

0.203

0.198

0.189

0.111

0.236

Table 5. Performance of ASE with Different Anomaly Detection Model on Dataset

Wine

	DT	UnderBagging	BalanceCascade	SPE	Iforest-ASE	OCSVM-ASE	KNN-ASE	AE-ASE	ROD-ASE
Accuracy	0.855	0.872	0.846	0.835	0.958	0.928	0.974	0.957	0.971
Precision	0.130	0.182	0.155	0.151	0.448	0.338	0.649	0.492	0.592
Recall	0.531	0.745	0.737	0.789	0.703	0.855	0.657	0.701	0.605
AUC	0.699	0.811	0.793	0.813	0.835	0.893	0.822	0.834	0.794
F1	0.208	0.292	0.256	0.254	0.545	0.472	0.651	0.576	0.595

Apart from changing the base classifiers, the experiment which combines different anomaly detection models with $ASE$ has been conducted. We use some representative and efficient anomaly detection models including Isolation Forest, KNN, OCSVM, Auto Encoder and ROD as the anomaly scoring part of $ASE$ .

•

Isolation Forest (Liu et al., 2008a) randomly selects a value between the minimum and maximum values to split the data into partitions since anomalous data are few and different.
•

OCSVM (Manevitz and Yousef, 2001) is an unsupervised outlier detection which can estimate the support of a high-dimensional distribution and is based on the premise that outliers will cluster in a dense region in the original dataset.
•

KNN (Angiulli and Pizzuti, 2002) is a popular algorithm widely used in classification, anomaly detection which assumes that outliers usually stay away from the cluster of similar instances.
•

Auto Encoder (Zhou and Paffenroth, 2017) can be used to detect anomaly by encoding and compressing the data into the lower dimensions and then decode to reconstruct the data.
•

ROD (Almardeny et al., 2020) is a parameter-free algorithm which uses 3D-vectors to represent the raw data and uses Rodrigues rotation formula for scoring the data to find the outliers.

Decision Tree is selected as the base classifier trained on dataset $Wine$ and the results are shown in Table 5. We compare $ASE$ with SMOTE, EasyEnsemble, BalanceCascade, UnderBagging, OverBagging and SPE but we only show the compared models with the highest performance including UnderBagging, OverBagging and SPE because of the layout.

4.3.3. Experiments the Sensitivity to Hyper-parameters

Since the number of base classifiers is the key determinant of the performance of ensemble learning models, we compare the training process of $ASE$ with other ensemble learning models, including EasyEnsemble, BalanceCascade and $SPE$ . Besides, we also change the hyper-parameter $k$ to split the training set into different number of bin, for example, $k$ is set as 3 in $ASE3$ . We choose dataset $Wine$ and $ISOLET$ and Decision Tree as the base classifier.

Table 6. Ablation Experiment

Dataset	ASE without CEW	ASE without ASW	ASE without both	ASE
Credit Fault	0.836	0.829	0.816	0.856
Mammography	0.788	0.668	0.590	0.791
Ozone	0.742	0.553	0.549	0.747
Wine	0.515	0.490	0.413	0.520

Figure 3 shows the changing process of the $ROC$ and $F1$ when training ensemble learning models, $ASE$ nearly performs best during all the training process. $ASE$ performs well with limited base classifiers and converges faster than mainstream ensemble learning models. At the same time, it is robust to different selection of $k$ .

4.3.4. Ablation Experiment

In order to verify that our $ASE$ is well designed and is not redundant, we conduct an ablation experiment to show that each module in $ASE$ like $CEW$ and $ASW$ is reasonable and necessary. We use Decision Tree as the base classifier and combine 20 base classifiers into the final ensemble model. In the ablation experiment, we assign equal weights to each base classifier to test out whether $CEW$ boosts the performance of $ASE$ or not. $ASW$ is discarded and we use a new strategy to resample while ensuring the $CEW$ is working at the same time. We don’t use the splitting strategy in Equation 5 and we split the training set into bins by the quantiles of the output anomaly scores. As a result, the number of examples in each bin is the same so we down-sample randomly in each bin to construct a new subset. Besides, we also set up a control group without both $ASW$ and $CEW$ . We use F1 as the criterion of the ablation experiment.

The result of the ablation experiment is shown in Table 6. The removal of either $ASW$ or $CEW$ from our framework has a negative impact on $ASE$ and it seems that $ASW$ has a greater impact on $ASE$ than $CEW$ since it directly changes the resampling strategy, which makes the proposed framework a success far beyond the existing models. Since $CEW$ is determined by the proportion of the samples that the anomaly model classifies as the positive class, which contains some informative information like the information entropy and the Recall value of the anomaly detection model, $CEW$ can help to improve the performance of the ensemble model.

5. Conclusions

High imbalance ratio, data overlapping, large scale and high feature dimension are common intractable problems in many real-life classification scenarios. Existing models like machine learning methods or ensemble models applied in these scenarios suffer from poor performance because of flaws embedded in their design. In this paper, we illustrate how data overlapping affects the performance of classification algorithms and we propose an innovative ensemble learning framework, Anomaly Scoring Based Ensemble Learning to work out the classification problems in data skewing real-life scenarios. Since our model can evaluate the overlapping area of the majority class and the minority class efficiently, the ensemble model has a better generalization ability on highly imbalanced datasets with high-dimension features. Experiments on several datasets including finance, medicine, manufacturing industry and meteorology have been conducted and the performance and applicability of our model have been tested out. Besides, we have designed an ablation experiment to approve each module in our framework can boost the ensemble model’s performance and our model is logical and not redundant. We believe that our ensemble learning model can be applied to various real-life scenarios.

References

(1)
Almardeny et al. (2020) Yahya Almardeny, Noureddine Boujnah, and Frances Cleary. 2020. A novel outlier detection method for multivariate data. IEEE Transactions on Knowledge and Data Engineering (2020).
Angiulli and Pizzuti (2002) Fabrizio Angiulli and Clara Pizzuti. 2002. Fast outlier detection in high dimensional spaces. In European conference on principles of data mining and knowledge discovery. Springer, 15–27.
Asuncion and Newman (2007) Arthur Asuncion and David Newman. 2007. UCI machine learning repository.
Balaram and Vasundra (2022) A Balaram and S Vasundra. 2022. Prediction of software fault-prone classes using ensemble random forest with adaptive synthetic sampling algorithm. Automated Software Engineering 29, 1 (2022), 1–21.
Chawla et al. (2002) Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research 16 (2002), 321–357.
Chawla et al. (2003) Nitesh V Chawla, Aleksandar Lazarevic, Lawrence O Hall, and Kevin W Bowyer. 2003. SMOTEBoost: Improving prediction of the minority class in boosting. In European conference on principles of data mining and knowledge discovery. Springer, 107–119.
Cheng et al. (2016) Fanyong Cheng, Jing Zhang, and Cuihong Wen. 2016. Cost-sensitive large margin distribution machine for classification of imbalanced data. Pattern Recognition Letters 80 (2016), 107–112.
Dal Pozzolo et al. (2017) Andrea Dal Pozzolo, Giacomo Boracchi, Olivier Caelen, Cesare Alippi, and Gianluca Bontempi. 2017. Credit card fraud detection: a realistic modeling and a novel learning strategy. IEEE transactions on neural networks and learning systems 29, 8 (2017), 3784–3797.
Del Río et al. (2014) Sara Del Río, Victoria López, José Manuel Benítez, and Francisco Herrera. 2014. On the use of mapreduce for imbalanced big data using random forest. Information Sciences 285 (2014), 112–137.
Fernández et al. (2013) Alberto Fernández, Victoria LóPez, Mikel Galar, MaríA José Del Jesus, and Francisco Herrera. 2013. Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches. Knowledge-based systems 42 (2013), 97–110.
Freund and Schapire (1997) Yoav Freund and Robert E Schapire. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences 55, 1 (1997), 119–139.
Ha and Lee (2016) Jihyun Ha and Jong-Seok Lee. 2016. A new under-sampling method using genetic algorithm for imbalanced data classification. In Proceedings of the 10th International Conference on Ubiquitous Information Management and Communication. 1–6.
Haixiang et al. (2017) Guo Haixiang, Li Yijing, Jennifer Shang, Gu Mingyun, Huang Yuanyue, and Gong Bing. 2017. Learning from class-imbalanced data: Review of methods and applications. Expert systems with applications 73 (2017), 220–239.
Han et al. (2005) Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. 2005. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing. Springer, 878–887.
Haque et al. (2021) HM Fazlul Haque, Muhammod Rafsanjani, Fariha Arifin, Sheikh Adilina, and Swakkhar Shatabda. 2021. Subfeat: Feature subspacing ensemble classifier for function prediction of dna, rna and protein sequences. Computational Biology and Chemistry 92 (2021), 107489.
Hart (1968) Peter Hart. 1968. The condensed nearest neighbor rule (corresp.). IEEE transactions on information theory 14, 3 (1968), 515–516.
He et al. (2008) Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. 2008. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). IEEE, 1322–1328.
Hsu et al. (2021) Grace G Hsu, Jabed H Tomal, and William J Welch. 2021. EPX: An R package for the ensemble of subsets of variables for highly unbalanced binary classification. Computers in Biology and Medicine 136 (2021), 104760.
Jing et al. (2022) Chao Jing, Yun Wu, and Chaoyuan Cui. 2022. Ensemble dynamic behavior detection method for adversarial malware. Future Generation Computer Systems 130 (2022), 193–206.
Krawczyk and Schaefer (2013) Bartosz Krawczyk and Gerald Schaefer. 2013. An improved ensemble approach for imbalanced classification problems. In 2013 IEEE 8th international symposium on applied computational intelligence and informatics (SACI). IEEE, 423–426.
Kubat et al. (1997) Miroslav Kubat, Stan Matwin, et al. 1997. Addressing the curse of imbalanced training sets: one-sided selection. In Icml, Vol. 97. Citeseer, 179.
Laurikkala (2001) Jorma Laurikkala. 2001. Improving identification of difficult small classes by balancing class distribution. In Conference on artificial intelligence in medicine in Europe. Springer, 63–66.
Lemaître et al. (2017) Guillaume Lemaître, Fernando Nogueira, and Christos K. Aridas. 2017. Imbalanced-learn: A Python Toolbox to Tackle the Curse of Imbalanced Datasets in Machine Learning. Journal of Machine Learning Research 18, 17 (2017), 1–5. http://jmlr.org/papers/v18/16-365.html
Li et al. (2019) Buyu Li, Yu Liu, and Xiaogang Wang. 2019. Gradient harmonized single-stage detector. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 8577–8584.
Li et al. (2022) Jinyan Li, Yaoyang Wu, Simon Fong, Antonio J Tallón-Ballesteros, Xin-she Yang, Sabah Mohammed, and Feng Wu. 2022. A binary PSO-based ensemble under-sampling model for rebalancing imbalanced training data. The Journal of Supercomputing 78, 5 (2022), 7428–7463.
Liu et al. (2008a) Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008a. Isolation forest. In 2008 eighth ieee international conference on data mining. IEEE, 413–422.
Liu et al. (2008b) Xu-Ying Liu, Jianxin Wu, and Zhi-Hua Zhou. 2008b. Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39, 2 (2008), 539–550.
Liu et al. (2020a) Zhining Liu, Wei Cao, Zhifeng Gao, Jiang Bian, Hechang Chen, Yi Chang, and Tie-Yan Liu. 2020a. Self-paced ensemble for highly imbalanced massive data classification. In 2020 IEEE 36th international conference on data engineering (ICDE). IEEE, 841–852.
Liu et al. (2020b) Zhining Liu, Pengfei Wei, Jing Jiang, Wei Cao, Jiang Bian, and Yi Chang. 2020b. MESA: boost ensemble imbalanced learning with meta-sampler. Advances in Neural Information Processing Systems 33 (2020), 14463–14474.
Manevitz and Yousef (2001) Larry M Manevitz and Malik Yousef. 2001. One-class SVMs for document classification. Journal of machine Learning research 2, Dec (2001), 139–154.
Menardi and Torelli (2014) Giovanna Menardi and Nicola Torelli. 2014. Training and assessing classification rules with imbalanced data. Data mining and knowledge discovery 28, 1 (2014), 92–122.
Ng et al. (2020) Wing WY Ng, Shichao Xu, Jianjun Zhang, Xing Tian, Tongwen Rong, and Sam Kwong. 2020. Hashing-based undersampling ensemble for imbalanced pattern classification problems. IEEE Transactions on Cybernetics (2020).
Pedregosa et al. (2011) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12 (2011), 2825–2830.
Ramentol et al. (2012) Enislay Ramentol, Yailé Caballero, Rafael Bello, and Francisco Herrera. 2012. SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory. Knowledge and information systems 33, 2 (2012), 245–265.
Ruff et al. (2018) Lukas Ruff, Robert Vandermeulen, Nico Goernitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft. 2018. Deep one-class classification. In International conference on machine learning. PMLR, 4393–4402.
Sáez et al. (2015) José A Sáez, Julián Luengo, Jerzy Stefanowski, and Francisco Herrera. 2015. SMOTE–IPF: Addressing the noisy and borderline examples problem in imbalanced classification by a re-sampling method with filtering. Information Sciences 291 (2015), 184–203.
Seiffert et al. (2009) Chris Seiffert, Taghi M Khoshgoftaar, Jason Van Hulse, and Amri Napolitano. 2009. RUSBoost: A hybrid approach to alleviating class imbalance. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 40, 1 (2009), 185–197.
Shannon (1948) Claude Elwood Shannon. 1948. A mathematical theory of communication. The Bell system technical journal 27, 3 (1948), 379–423.
Smith et al. (2014) Michael R Smith, Tony Martinez, and Christophe Giraud-Carrier. 2014. An instance level analysis of data complexity. Machine learning 95, 2 (2014), 225–256.
Tang et al. (2008) Yuchun Tang, Yan-Qing Zhang, Nitesh V Chawla, and Sven Krasser. 2008. SVMs modeling for highly imbalanced classification. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39, 1 (2008), 281–288.
Tomek (1976) Ivan Tomek. 1976. Two modifications of CNN. IEEE Trans. Systems, Man and Cybernetics 6 (1976), 769–772.
Wang (2008) He-Yong Wang. 2008. Combination approach of SMOTE and biased-SVM for imbalanced datasets. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence). IEEE, 228–231.
Wang and Yao (2009) Shuo Wang and Xin Yao. 2009. Diversity analysis on imbalanced data sets by using ensemble models. In 2009 IEEE symposium on computational intelligence and data mining. IEEE, 324–331.
Wang and Yao (2012) Shuo Wang and Xin Yao. 2012. Multiclass imbalance problems: Analysis and potential solutions. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 42, 4 (2012), 1119–1130.
Yang et al. (2021) Ping Yang, Dan Wang, Wen-Bing Zhao, Li-Hua Fu, Jin-Lian Du, and Hang Su. 2021. Ensemble of kernel extreme learning machine based random forest classifiers for automatic heartbeat classification. Biomedical Signal Processing and Control 63 (2021), 102138.
Zhang et al. (2022) Tianci Zhang, Jinglong Chen, Fudong Li, Kaiyu Zhang, Haixin Lv, Shuilong He, and Enyong Xu. 2022. Intelligent fault diagnosis of machines with small & imbalanced data: A state-of-the-art review and possible extensions. ISA transactions 119 (2022), 152–171.
Zhou and Paffenroth (2017) Chong Zhou and Randy C Paffenroth. 2017. Anomaly detection with robust deep autoencoders. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining. 665–674.
Zi\keba and Tomczak (2015) Maciej Zi\keba and Jakub M Tomczak. 2015. Boosted SVM with active learning strategy for imbalanced data. Soft Computing 19, 12 (2015), 3357–3368.