\history

Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000. 10.1109/ACCESS.2024.0429000

\corresp

Corresponding author: Shonal Chaudhry (Email: [email protected]).

Data Distribution-based Curriculum Learning

SHONAL CHAUDHRY1 ANURAGANAND SHARMA1 School of Information Technology, Engineering, Mathematics and Physics, The University of the South Pacific, Laucala Campus, Suva, Fiji

Abstract

The order of training samples can have a significant impact on a model’s performance. Curriculum learning is an approach for gradually training a model by ordering samples from ‘easy’ to ‘hard’. This paper proposes the novel idea of a curriculum learning strategy called Data Distribution-based Curriculum Learning (DDCL). DDCL uses the inherent data distribution of a dataset to build a curriculum based on the order of samples. Our proposed approach is innovative as it incorporates two distinct scoring methods known as DDCL-Density and DDCL-Point to determine the order of training samples. The DDCL-Density method assigns scores based on the density of samples favoring denser regions that can make initial learning easier. Conversely, DDCL-Point utilizes the Euclidean distance from the centroid of the dataset as a reference point to score samples providing an alternative perspective on sample difficulty. We evaluate the proposed DDCL approach by conducting experiments across various classifiers using a diverse set of small to medium-sized medical datasets. Results show that DDCL improves the classification accuracy, achieving increases ranging from 2% to 10% compared to baseline methods and other state-of-the-art techniques. Moreover, analysis of the error losses for a single training epoch reveals that DDCL not only improves accuracy but also increases the convergence rate, underlining its potential for more efficient training. The findings suggest that DDCL can specifically be of benefit to medical applications where data is often limited and indicate promising directions for future research in domains that involve limited datasets.

Index Terms:

classification, curriculum learning, data distribution, machine learning, neural network, random forest, support vector machine

\titlepgskip

=-21pt

I Introduction

Classification tasks in supervised machine learning use various techniques to create classifiers suited to the problem. Some of the widely used techniques are neural networks, support vector machines (SVM) and decision trees [1]. Neural networks are biologically inspired computing systems [2] that are widely used as classifiers due to their ability to learn and improve their performance through the use of data and experience. SVMs use a threshold to classify a data sample as belonging to one class or another class [3]. This is done by starting with the training data in a lower dimension and then moving to a higher dimension. The higher dimension data is then separated into groups that represent each class. Decision trees perform classification by using a tree-like structure made up of smaller decisions to make an overall decision for a given input. Multiple decision trees are often combined to create an ensemble classifier known as a Random Forest classifier [4].

The performance of these classifiers are dependent on the quality of data used and the robustness of training algorithms. Some of the factors that affect the quality of data are inaccurate, inconsistent, imbalanced, duplicate, missing and outlier samples in a dataset [5, 6]. Studies on the quality of data have shown that these properties may lead to a significant degradation in prediction performance and cause instability in learning due to high bias and/or high variance [5, 7].

To reduce the negative impact of these factors, the training algorithms applied to datasets are often designed to guide the learning model towards optimum performance [8, 9, 10]. Guidance for smaller datasets is particularly important since their limited size cannot provide the sample diversity present in larger datasets [11].

Gradient descent is the most common method of optimizing a neural network due to its fast convergence towards the minimum error [12]. Variants of gradient descent such as batch gradient descent, stochastic gradient descent (SGD) and mini-batch gradient descent exist for use with specific problem scenarios [13]. These problems may favor computation speed, data size or a balance of speed and size as frequently seen in practice [14]. Support vector machines are optimized by selecting a value for the C hyperparameter and the method of optimizing its kernel varies according to the kernel used. The Radial Basis Function (RBF) kernel is commonly used in a SVM and it is optimized by selecting the gamma ( $\gamma$ ) hyperparameter. Both of these hyperparameters can be found using several search algorithms including trees of Parzen estimators, particle swarm optimization and Bayesian optimization. Recent findings have concluded that the first two algorithms provide better hyperparameter values with a lower execution time compared to Bayesian optimization which has a very high computational cost [15]. In a random forest classifier, the number of decision trees (estimators) and the number of features required to split a tree are hyperparameters that are optimized [16]. The number of estimators for a specific problem is carefully selected since it corresponds with the computation time required.

Furthermore, the data distribution plays a major role in obtaining quality predictions from classifiers. When using the same data, a particular data distribution of training data can provide better results compared to another data distribution. Research has shown that the choice of distribution can cause bias in the training data due to under-representation of the minority class [17]. This bias can have a significant impact on a model’s accuracy and precision which can be critical for medical applications where training data can be limited [18, 19, 20]. Once the distribution of data is known, a decision can be made on how data samples are selected from the dataset for optimal results. This decision may include using a specific criteria for selecting samples in a particular order. The process of selectively choosing samples from a dataset as well as determining their order is known as creating a curriculum.

Using a curriculum for improving performance of a machine learning model is known as curriculum learning [21] and it is typically applied on the order of training data samples. Curriculum learning enhances training of a model by starting with simple concepts and then gradually introducing difficult concepts as training progresses [21, 22]. It is inspired by how humans are taught through curriculums in an education system by receiving basic education during childhood and then moving on to advanced education in adulthood [23, 24].

Curriculum learning has been applied to various problems achieving excellent results in the areas of image classification [25, 26], face recognition [27], visual attribute classification [28] and imbalanced data classification [29]. The effectiveness of the ‘easy’ to ‘hard’ strategy, where simpler examples are presented before more complex ones, has been well-documented in literature [30, 31].

However, rival approaches offer different strategies by focusing on sample difficulty and order during training. Self-paced learning (SPL) [32], a training strategy that also suggested presenting training samples ordered from simple to complex, built upon the concepts introduced in curriculum learning by altering the process of defining the difficulty of a sample. In SPL, the current learning progress of a model was considered for selecting the next sample whereas curriculum learning relied on a pre-determined fixed curriculum.

The concepts of SPL were used years later in a framework called self-paced curriculum learning (SPCL) [33]. SPCL sought to unify the ideas of curriculum learning and SPL. The authors of the study reasoned that both types of learning have drawbacks; The curriculum in curriculum learning is heavily reliant on the quality of fixed pre-determined prior knowledge and ignores feedback about the learner. In SPL, the curriculum is dynamically determined to adjust to the learning pace of the learner but is prone to over-fitting due to being unable to handle prior knowledge. According to the authors, SPCL addressed these drawbacks by introducing a flexible way of including prior knowledge while also dynamically adjusting the curriculum based on feedback from the learner.

Despite this, curriculum learning has been successfully used to improve performance of existing systems that use machine learning including unsupervised domain adaptation [34], transfer learning [35, 36] and reinforcement learning [30, 37].

In this paper, we propose a data distribution based curriculum learning approach for classification tasks on small to medium-sized medical datasets. The proposed approach first determines the distribution of data and then information from the distribution is used to build a curriculum based on the order of samples. The approach is then evaluated on multiple datasets with classifiers based on three different types of widely used learning methods: neural networks, SVM and random forest classifier. The remainder of the paper is organized as follows. Section II provides a background on curriculum learning as well as applications of curriculum learning to various problems. The proposed curriculum learning approach is discussed in Section III and Section IV outlines its experimental evaluation. Section V examines the experiment results whereas Section VI provides concluding remarks.

II Background

The concept of using a curriculum in machine learning was introduced by Bengio et al. in 2009 [21]. Their work defined curriculum learning as a training strategy where fewer simple concepts are presented in order at the beginning with greater complex concepts introduced at later stages. The authors conducted experiments on shape recognition and language modelling for scenarios with and without a curriculum. Experiment results showed that by using a curriculum, faster training and better convergence can be achieved.

Hacohen et al. [36] expanded the definition of curriculum learning by describing it as consisting of two tasks with specific functions, termed the scoring function and the pacing function. The scoring function ranks the difficulty of the samples in a curriculum whereas the pacing function determines how often newer samples are presented to a model during training.

Curriculum learning has been applied to a variety of problems achieving promising results. Some of these problems include image classification [25, 26], face recognition [27], visual attribute classification [28] and imbalanced data classification [29]. Moreover, curriculum learning has also been successfully used to improve performance of systems where a curriculum is typically not considered. These include unsupervised domain adaptation [34], transfer learning [35, 36], bipedal walking for robots [37] and medical report generation [19].

Overall, we categorize the reviewed curriculum learning literature into three different groups: the applications of standard curriculum learning [28, 37, 19], variations of learning styles [35, 36, 27, 26] and research that makes use of the data density [25, 34, 29].

II-A Standard curriculum learning

Sarafianos et al. [28] applied curriculum learning to visual attribute classification by introducing a method that combined curriculum learning with multi-task learning. Their method performed end-to-end learning by providing a convolutional neural network (CNN) with a complete image of a human without additional data to aid in classification. Then in multi-task learning, the tasks were split into groups using clustering. Curriculum learning was applied to the groups starting with the highest within-group cross-correlation and moving to the less correlated ones. Experiments performed on three datasets consisting of humans standing resulted in an increase in performance by up to 10%.

Tidd et al. [37] use curriculum learning to train deep reinforcement learning policies for bipedal walking of a robot over challenging terrain. The authors created an easy to hard curriculum using a three stage framework: In the first stage, guiding forces were applied to the joints and the base of the robot to start learning on easy terrain and gradually increased to difficult terrain. At the second stage when the terrain was most difficult, the guiding forces applied to the robot were slowly decreased. During the final stage, the magnitude of external random perturbations were increased to improve the robustness of the policy. Simulation experiments conducted by the authors demonstrated that a curriculum approach was effective in learning to walk for five types of terrain.

Refer to caption — Figure 1: Data Distribution-based Curriculum Learning

Ma et al. [19] applied curriculum learning to the medical report generation task. Their work introduced a framework capable of learning medical reports from limited medical data while reducing data bias. The framework learned in an easy to hard manner using a two step process where at first simple reports were used and then gradually complex reports consisting of rare and diverse medical abnormalities were attempted. This process effectively simulated the learning process of radiologists. The method was evaluated on two public datasets resulting in a boost in performance of the baselines.

II-B Learning style variations

In [35], the Weinshall et al. used transfer learning to implement curriculum learning. They presented an approach where the curriculum was inferred through transfer learning from another network that was pre-trained on another task. A CNN with two architectures trained using curriculum learning was used to evaluate the proposed method on the CIFAR-100 [38] and STL-10 datasets [39]. Experiment results from their work concluded that with curriculum learning the convergence is faster during the beginning of training and improved generalization is achieved when more difficult tasks are used.

Hacohen et al. applied curriculum learning to the training of deep networks [36]. They used transfer learning and bootstrapping as two different techniques for the easy to hard process which they termed the scoring function. With transfer learning, a ‘teacher’ network was trained on a large dataset and its prediction performance was used as the scoring function. In bootstrapping, a network was trained without a curriculum and its performance on the training data defined the scoring function. Then, the network was retrained from scratch using curriculum learning. For the pacing function, three approaches were used with all pacing functions having comparable performance based on experiments done by the authors. Experiments done on six test cases showed that using a curriculum provided high accuracy and better convergence with the amount of improvement ranging from small to large depending on the test case.

Huang et al. [27] propose a different approach to curriculum learning in the area of face recognition. Instead of using a traditional curriculum created by fixed ordering of samples with increasing difficulty, the authors introduce a loss function which incorporates adaptive curriculum learning. During the training process, samples are randomly selected for each mini-batch and the curriculum is created adaptively from the selected samples. Furthermore, the definition of hard samples is dynamic with a sample classified as hard during the start of training becoming easy towards the end. Overall, the loss function emphasizes easy samples at the start and hard samples later. The authors conducted experiments on benchmark data achieving improved performance over other methods.

Recent research has looked at improving the general curriculum learning process. Zhou et al. [26] presented a curriculum learning method known as dynamics-optimized curriculum learning (DoCL). DoCL selected training samples at each step using weighted sampling based on the scores. The authors conducted experiments on more than nine datasets achieving results that significantly improved the performance and efficiency compared to existing curriculum learning methods.

II-C Data density based curriculum learning

Our review of literature for curriculum learning has found few papers that utilize the density of the data. These works are presented in this section and our work on DDCL adds to this field of research. Specifically, DDCL addresses the key limitation of other works [25, 34] that require large datasets to utilize data density by effectively operating on small datasets. Moreover, clustering in DDCL is used to determine the class centroids rather than the density compared to other works.

Researchers in [25] presented a method for training deep neural networks on large-scale weakly-supervised web images. Their training strategy utilized curriculum learning to effectively handle the large amount of noisy labels and data imbalance during the training process. They proposed a curriculum created by measuring the complexity of data using the density of its data. Experiments using the proposed training strategy resulted in state-of-the-art performance on benchmark datasets.

Choi et al. [34] presented another method of using the density of data to generate the curriculum. Their method applied clustering on the data where higher density samples were considered simple and lower density samples were considered as complex. The proposed method was robust against false pseudo-labelled samples due to the use of a pseudo-labelling curriculum. They achieved state-of-the-art classification results on three benchmark datasets.

Curriculum learning has been applied to imbalanced data classification as well. Dynamic Curriculum Learning (DCL) proposed by Wang et.al [29] addressed the problem of requiring prior knowledge to train a system when using conventional techniques. DCL used a two-level curriculum scheduler made up of a sampling scheduler and a loss scheduler. The sampling scheduler found the best samples in a batch to train the model by dynamically managing the target data distributions from imbalanced to balanced and from easy to hard. The loss scheduler controlled the learning weights between classification loss and metric learning loss. Experiments achieved state-of-the-art performance on the CelebA face attribute dataset and the RAP pedestrian attribute dataset.

Algorithm 1 DDCL algorithm

1: Variables:

2: Classes:

S

3: Data size:

N

N_{s}

- dataset for class

s

|N_{s}|

- data size (cardinality)

h

- bandwidth

7: for all

s\in S

{O_{s}}\leftarrow\sum_{n=0}^{|N_{s}|}\min_{\mu_{s}\in S}(\parallel x_{n}-\mu_{s}\parallel^{2})

s\leftarrow s\cup O_{s}

10:

E_{s}\leftarrow\varnothing

11: for all

x\in s

12:

E_{s}\leftarrow E_{s}\cup\parallel{O_{s}-x}\parallel

13: end for

14:

D_{s}\leftarrow\frac{1}{|N_{s}|h}\sum_{i=1}^{|N_{s}|}K\left(\frac{y-\hat{E}_{i}}{h}\right)

{Kernel Density Estimation}

15: for

q\in Q^{s}

16: if

|q|<3

then

17:

q\leftarrow SMOTE(q)

18: end if

19: end for

20: if Scoring type

\leftarrow

“Density” then

21: go to Algorithm 2

22: else if Scoring type

\leftarrow

“Point” then

23: go to Algorithm 3

24: end if

25:

26: return

T\leftarrow R_{s}

27: end for

III Proposed Curriculum Learning Approach

The curriculum learning approach proposed in this paper is based on the data distribution of a dataset. This is referred to as Data Distribution-based Curriculum Learning (DDCL) and involves multiple steps as visualized through Figure 1. It begins with dividing the data into groups based on their target classes. Then, the centroids for each group are calculated (Centroid Determination) and used to compute the Euclidean distance between each sample and its centroid. The data distribution for each class is determined next and utilized to divide the data for individual classes into quantiles (Q1, Q2, Q3). After quantile division, oversampling is optionally performed on quantiles where the number of samples may be highly imbalanced compared to other quantiles. Lastly, the data is scored with either density or point scoring and rearranged for use in training.

The DDCL process is further detailed in Algorithm 1, Algorithm 2 and Algorithm 3. In the first step of DDCL, the training data samples are grouped according to their classes ( $S$ ) with the number of groups being equal to the number of unique classes in the data. Next, the grouped training data is used to calculate the centroids ( $O_{s}$ ) for each class by performing clustering on each data group. Then, the centroids are used to compute the Euclidean distance from the relevant centroid ( $E_{s}$ ) for each data sample ( $x$ ). The third step determines the data distribution for a class ( $D_{s}$ ) by plotting the normalized values of the Euclidean distances. Once the distribution of the training data is known, the data samples are divided into quantiles ( $Q^{s}$ ). Each quantile ( $q$ ) is then examined for the number of samples and the quantiles with the lowest number of samples are oversampled using SMOTE [40] if there are sufficient samples. SMOTE creates synthetic samples using the original data and different variants of it have been developed [41] since the initial method was proposed. The initial method of SMOTE is used in our proposed approach to address the potential lack of samples in a given quantile. Otherwise, no oversampling is applied due to insufficient samples. After the samples are divided into quantiles and optionally oversampled, each data sample is scored using either density or point scoring and rearranged accordingly ( $R_{s}$ ). Finally, the rearranged training data ( $T$ ) is ready to be passed to one of the learning methods for training and evaluation.

DDCL consists of two types of scoring methods: a sample density based method and a Euclidean distance based method. These are referred to as DDCL-Density and DDCL-Point respectively and are detailed in the following sections.

TABLE I: Datasets used for evaluating DDCL.

Dataset	Data Instances	Attributes	Class Type
Breast Cancer Wisconsin (Diagnostic)	569	31	Binary
Cancer	457	9	Binary
Haberman’s Survival	306	3	Binary
Liver Disorder	345	6	Binary
Pima Indians Diabetes	768	8	Binary
New-Thyroid	215	5	Multi-class
Diabetes 130 [42]	86556	46	Multi-class

III-A Density based DDCL

Figure 2(a) illustrates the DDCL-Density process where the scoring is applied to the quantiles based on the number of samples in the quantile. Quantiles with the greatest number of samples (Q6) are given a higher score whereas quantiles with the least number of samples (Q1) are given a lower score. Higher scores are considered simple and lower scores are considered difficult. This results in a final quantile order sorted from the highest to lowest density thus defining the curriculum to be used during training.

TABLE II: Neural Network results with and without DDCL.

Dataset	Test Scenario	Worst %	Best %	Average $\pm\sigma$
Breast Cancer (Diagnostic)	No Curriculum	91.304	100.000	96.232 $\pm$ 2.442
	DDCL-Density	89.130	100.000	96.594 $\pm$ 2.673
	DDCL-Point	86.957	100.000	97.319 $\pm$ 2.953
Cancer	No Curriculum	86.486	100.000	94.955 $\pm$ 3.993
	DDCL-Density	91.667	100.000	96.852 $\pm$ 2.658
	DDCL-Point	86.111	100.000	96.574 $\pm$ 3.100
Haberman’s Survival	No Curriculum	45.833	79.167	65.833 $\pm$ 8.898
	DDCL-Density	50.000	87.500	69.167 $\pm$ 10.353
	DDCL-Point	41.667	87.500	67.222 $\pm$ 9.424
Liver Disorder	No Curriculum	57.143	85.714	69.048 $\pm$ 6.477
	DDCL-Density	46.429	85.714	68.214 $\pm$ 9.052
	DDCL-Point	57.143	82.143	69.524 $\pm$ 6.098
Pima Indians Diabetes	No Curriculum	54.098	86.885	71.803 $\pm$ 6.658
	DDCL-Density	55.738	85.246	72.732 $\pm$ 5.933
	DDCL-Point	57.377	80.328	71.803 $\pm$ 5.460
New-Thyroid	No Curriculum	82.353	100.000	93.137 $\pm$ 5.280
	DDCL-Density	82.353	100.000	95.490 $\pm$ 5.409
	DDCL-Point	82.353	100.000	94.118 $\pm$ 5.683
Diabetes 130	No Curriculum	53.235	54.694	54.029 $\pm$ 0.518
	DDCL-Density	56.148	56.798	56.579 $\pm$ 0.246
	DDCL-Point	54.400	56.278	55.139 $\pm$ 0.696

TABLE III: SVM results with and without DDCL.

Dataset	Test Scenario	Worst %	Best %	Average $\pm\sigma$
Breast Cancer (Diagnostic)	No Curriculum	94.737	99.415	97.700 $\pm$ 1.188
	DDCL-Density	95.322	99.415	97.758 $\pm$ 1.025
	DDCL-Point	94.737	99.415	97.719 $\pm$ 1.239
Cancer	No Curriculum	94.928	99.275	96.715 $\pm$ 1.134
	DDCL-Density	94.118	99.265	96.985 $\pm$ 1.145
	DDCL-Point	92.647	99.265	96.225 $\pm$ 1.562
Haberman’s Survival	No Curriculum	65.217	78.261	72.609 $\pm$ 4.051
	DDCL-Density	65.217	83.696	73.841 $\pm$ 4.836
	DDCL-Point	67.391	79.348	73.080 $\pm$ 3.329
Liver Disorder	No Curriculum	44.231	76.923	63.750 $\pm$ 7.264
	DDCL-Density	49.038	75.000	66.699 $\pm$ 5.515
	DDCL-Point	55.769	75.962	68.558 $\pm$ 4.783
Pima Indians Diabetes	No Curriculum	71.429	80.519	76.046 $\pm$ 2.500
	DDCL-Density	70.996	82.251	77.128 $\pm$ 2.991
	DDCL-Point	72.294	82.684	76.508 $\pm$ 2.514
New-Thyroid	No Curriculum	92.308	100.000	97.128 $\pm$ 2.472
	DDCL-Density	92.308	100.000	97.282 $\pm$ 1.892
	DDCL-Point	92.308	100.000	96.872 $\pm$ 1.563
Diabetes 130	No Curriculum	64.285	65.487	64.948 $\pm$ 0.302
	DDCL-Density	64.220	65.457	64.977 $\pm$ 0.298
	DDCL-Point	64.301	65.469	65.052 $\pm$ 0.266

TABLE IV: Random Forest results with and without DDCL.

Dataset	Test Scenario	Worst %	Best %	Average $\pm\sigma$
Breast Cancer (Diagnostic)	No Curriculum	94.152	98.246	95.945 $\pm$ 1.034
	DDCL-Density	92.398	98.246	95.653 $\pm$ 1.581
	DDCL-Point	91.228	98.246	95.945 $\pm$ 1.517
Cancer	No Curriculum	92.754	100.000	97.150 $\pm$ 1.374
	DDCL-Density	94.853	100.000	96.691 $\pm$ 1.298
	DDCL-Point	94.118	100.000	97.206 $\pm$ 1.361
Haberman’s Survival	No Curriculum	58.696	76.087	68.261 $\pm$ 4.100
	DDCL-Density	65.217	79.348	70.145 $\pm$ 3.825
	DDCL-Point	57.609	77.174	69.746 $\pm$ 4.367
Liver Disorder	No Curriculum	59.615	78.846	70.417 $\pm$ 5.104
	DDCL-Density	67.308	77.885	72.917 $\pm$ 2.709
	DDCL-Point	60.577	78.846	70.064 $\pm$ 4.077
Pima Indians Diabetes	No Curriculum	71.861	79.654	75.483 $\pm$ 1.985
	DDCL-Density	70.996	80.087	75.339 $\pm$ 2.200
	DDCL-Point	65.368	81.818	76.003 $\pm$ 3.461
New-Thyroid	No Curriculum	95.385	100.000	97.949 $\pm$ 1.214
	DDCL-Density	92.308	100.000	98.000 $\pm$ 1.911
	DDCL-Point	95.385	100.000	97.897 $\pm$ 1.512
Diabetes 130	No Curriculum	65.830	66.658	66.218 $\pm$ 0.228
	DDCL-Density	65.626	66.948	66.242 $\pm$ 0.309
	DDCL-Point	65.542	66.805	66.199 $\pm$ 0.293

Algorithm 2 DDCL Density

C_{s}\leftarrow\varnothing

{Initialize}

2: for all

q\in Q^{s}

C_{s}\leftarrow C_{s}\cup|q|

4: end for

R_{s}\leftarrow sort(\forall\>C_{s}\mid s\in S)

7: return

R_{s}

Algorithm 2 provides details of the DDCL-Density scoring process. It shows that the process starts with determining the density of each quantile by taking their cardinality, $|q|$ . Once the densities for all quantiles ( $C_{s}$ ) are known, the training data is sorted according to their quantile’s density from highest to lowest. Finally, the rearranged data, $R_{s}$ , is ready for use in training.

III-B Point based DDCL

On the other hand, Figure 2(b) illustrates how DDCL-Point assigns a score to the individual data samples in the dataset by utilizing their normalized Euclidean distance from centroid rather than examining quantiles. In this scoring method, data samples with the shorter Euclidean distances (1 and 5) are given the highest scores whereas samples with the longer Euclidean distances (3 and $s$ ) are assigned the lowest scores. As with DDCL-Density, higher scores in this point scoring method are considered simple whereas lower scores are considered difficult. Using DDCL-Point results in a curriculum where the training samples are ordered based on their individual characteristic rather than as a group.

Algorithm 3 DDCL Point

R_{s}\leftarrow sort(\forall\>\hat{E_{s}}|s\in S)

3: return

R_{s}

Algorithm 3 explains the DDCL-Point scoring method and it shows that the data samples are sorted using the normalized Euclidean distance from centroid, $\hat{E_{s}}$ . The sorting is done from the shortest distance to the longest regardless of which quantile a sample is assigned to. For example, consider a dataset with four quantiles with each quantile having at least five data samples. The first sample with the lowest $\hat{E_{s}}$ value may belong to quantile 1. Then, the sample with the second lowest $\hat{E_{s}}$ could come from quantile 4. The third lowest $\hat{E_{s}}$ value can be a sample from quantile 2. Next, the fourth lowest $\hat{E_{s}}$ may come from a sample in quantile 3. This method of sorting then repeats until $\hat{E_{s}}$ values from all data samples are accounted for with the resulting order of quantiles and samples ( $R_{s}$ ) becoming randomized in contrast to DDCL-Density.

IV Experiments

DDCL is evaluated by performing classification tasks on seven datasets obtained from the UCI Machine Learning Repository [43]. As listed in Table I, the datasets used for the experiments focus on small to medium-sized medical data. Three types of classifiers are used for evaluating DDCL; neural network, SVM and random forest.

TABLE V: Comparison of DDCL results with other works.

Dataset	Method	Year	Accuracy %
Breast Cancer Wisconsin (Diagnostic)	SVM [44]	2021	97.20
Breast Cancer Wisconsin (Diagnostic)	SVM (DDCL)	2024	99.42
Haberman’s Survival	Random Forest [45]	2023	74.00
Haberman’s Survival	Random Forest (DDCL)	2024	79.35
Liver Disorder	Neural Network [46]	2023	76.67
Liver Disorder	Neural Network (DDCL)	2024	82.14
Pima Indians Diabetes	Random Forest [47]	2022	79.57
Pima Indians Diabetes	Random Forest [45]	2023	79.00
	Random Forest (DDCL)	2024	81.82
Diabetes 130	Random Forest [42]	2018	55.97
Diabetes 130	Random Forest (DDCL)	2024	66.95

The neural network architecture used for evaluating DDCL consists of one input layer, a variable number of hidden layers and one output layer. The size of the input layer is equal to the number of attributes in the dataset and varies depending on the selected dataset. Similarly, the number of hidden layers changes based on the dataset and is a hyperparameter that can be tuned. Bayesian optimization is used on the model to tune the hidden layer values. Finally, the size of the output layer can be configured for binary or multi-class classification. The SVM classifier for evaluating DDCL uses a C value of 1.0 with a RBF kernel. The gamma ( $\gamma$ ) value of the RBF kernel scales according to the dataset used. A RBF kernel is chosen due to its good generalization ability and robustness to input noise. The random forest classifier in our evaluation of DDCL uses 100 estimators and a value of 2 features for splitting a node.

During the neural network experiments, the selected dataset is split into training, validation and testing subsets. 20% of the data is used for validation, 10% for testing and the remainder for training. The neural network model is then trained on the training subset for 200 epochs and tested on the testing subset. Likewise in the SVM and random forest experiments, the selected dataset is split into training and testing subsets with 70% and 30% of the data being used respectively. The SVM and random forest classifiers are trained using the training subset of the data and tested on the testing subset.

For each dataset, 30 experiments runs are performed for each of the three test scenarios: No Curriculum, DDCL-Density and DDCL-Point. The same number of experiment runs is used for all three classifiers. The first No Curriculum test scenario takes the input data and uses it to train a neural network, SVM and random forest classifier without altering the order of the training data. After training, classification experiments are conducted on the testing subset of data. In the second DDCL-Density test scenario, the input data is processed using the steps outlined in the DDCL method and scored by density. Classification is then performed on the testing subset as stated earlier. The DDCL-Point test scenario processes the input data through DDCL with the point scoring approach being used for the scoring step. Experiments for classification are likewise performed on the testing subset.

The results for neural networks, SVM and random forest classifiers are given in Table II, Table III and Table IV respectively.

V Discussion

Observations on the learning methods show that the classification performance across the datasets is always increased whenever DDCL is applied. This is true regardless of the learning method used as shown in Table II, Table III and Table IV.

Using the neural network classifier provides the highest best results for the Haberman’s Survival, Liver Disorder and Pima Indians Diabetes datasets whereas the SVM method achieves the highest average results on the Breast Cancer (Diagnostic), Haberman’s Survival and Pima Indians Diabetes data. The random forest based classifier yields the highest average result for the Cancer, Liver Disorder, New-Thyroid and Diabetes 130 datasets. Notably, using DDCL with SVM and random forest approaches results in significant increases in average accuracy for Haberman’s Survival and Liver Disorder data respectively compared to other learning methods. Figure 3 shows the precision-recall curves for the best result of each binary classification dataset when using the neural network method. For multi-class data, the confusion matrix is provided in place of the precision-recall curves by Figure 4.

Table V shows a comparison of results obtained using DDCL compared to approaches used by other state-of-the-art works. It demonstrates DDCL’s ability to achieve superior results compared to the same type of classifier trained using a different algorithm. In particular it outperforms the balanced stratified method used in [45] on the challenging Haberman’s Survival and Pima Indians Diabetes datasets. Furthermore, DDCL provides greater generalization capability and performance with neural networks on the Liver Disorder data compared to the standard approach used in [46].

The highest average result for Breast Cancer (Diagnostic) was achieved using SVM with density-based DDCL while the largest improvement (1.087%) was using the neural network with point-based DDCL. For the Cancer dataset, using random forest with point-based DDCL provided the highest average accuracy and using the neural network with density-based DDCL gave the most improved accuracy (1.897%). Using SVM with density-based DDCL resulted in the highest average accuracy when evaluated on the Haberman’s Survival data. The largest improvement for this dataset (3.334%) was observed when using density-based DDCL with the neural network. The highest average result for Liver Disorder was achieved using random forest with density-based DDCL whereas the SVM gave the largest average accuracy improvement at 4.808%. For the Pima Indians Diabetes dataset, the highest average accuracy was obtained using SVM with density-based DDCL which was also the most improved accuracy (1.082%).

In multi-class classification, using random forest with density-based DDCL resulted in the highest average accuracy on the New-Thyroid data and the largest improvement for this dataset (2.353%) came from density-based DDCL applied to the neural network. For the Diabetes 130 data, density-based DDCL with random forest provided the highest average accuracy while the largest improvement in accuracy (2.550%) was obtained through the neural network classifier utilizing density-based DDCL.

The experiment results demonstrate that using DDCL to order the training data for a given learning method leads to improvements in classification accuracy. This can be seen in the performance of all datasets tested with either the density or point-based DDCL approach having increased performance compared to the no curriculum approach. Figure 5 shows the error loss per epoch plots when using the neural network classifier for each dataset except for Diabetes 130 which is excluded since its average accuracy is not significant [42]. The plot for Breast Cancer (Diagnostic) shows a minimal change in the DDCL error loss trends compared to the no curriculum method whereas the Cancer dataset plot presents a reduction in the error loss when using DDCL. For the remaining datasets, the use of DDCL shows a significant and quick reduction in the error loss.

In addition, using DDCL to order the training data results in faster convergence. We arrived at this conclusion by examining the error loss for the first five epochs of each dataset using batch gradient descent. Figure 6 shows the error loss plots for each dataset. It can be seen that the overall losses for DDCL-Density and DDCL-Point are reduced faster compared to the no curriculum approach thus leading to faster convergence towards the minimum. This is consistent with the findings of the error loss against epoch plots in Figure 5 where the error loss towards the end is lower for DDCL-Density and DDCL-Point.

VI Conclusion

This paper proposed a curriculum learning approach known as Data Distribution-based Curriculum Learning (DDCL). It used the data distribution of a dataset to build a curriculum based on the order of samples. Two types of scoring methods known as DDCL-Density and DDCL-Point were used to score samples thus determining their training order. DDCL-Density used the sample density to assign scores whereas DDCL-Point utilized the Euclidean distance for scoring.

Experiments were conducted on multiple medical datasets using a neural network, SVM and random forest classifier to evaluate DDCL. Results showed that even though performance varied across the datasets between the classifiers used, the application of DDCL resulted in accuracy increases ranging from 2% to 10% for all datasets compared to other state-of-the-art methods. Furthermore, analysis of the error losses for five training epochs using batch gradient descent reveals that convergence is faster when using DDCL with either of the scoring methods over the no curriculum method.

The current DDCL approach uses only two types of scoring methods and is limited to using a single scoring method at a time. Moreover, the proposed approach uses a fixed curriculum that is pre-determined before the start of training and does not take into account the current training progress.

Future work will explore the creation of additional scoring methods to determine their impact on the training performance and investigate the application of ensemble learning using each scoring method. By introducing these new scoring methods, we aim to assess their effectiveness in comparison to other curriculum approaches such as SPL and Reverse Curriculum Learning (RCL). In addition, self-paced learning concepts will be incorporated into DDCL in order to dynamically determine the curriculum based on feedback from the learner. This integration will allow the model to adjust its learning pace according to its current performance, potentially leading to a more robust and adaptive training process. Furthermore, the algorithms of the proposed DDCL’s scoring methods suggests versatility which can be adapted beyond its initial application to structured datasets. Specifically, the proposed approach can also be adapted to image and text processing applications where the inherent complexity and variability of data types present unique challenges.

Nomenclature

Symbol	Description
$S$	Grouped data samples for all classes
$s$	Data for a specific class
${O_{s}}$	Centroids for a specific class
$E_{s}$	Euclidean distances from centroid (dimensionless)
$D_{s}$	Data distribution for a class
$Q^{s}$	Quantiles for the whole dataset
$q$	A single quantile in a dataset
$R_{s}$	Scored training data
$T$	Curriculum prepared training data
$C_{s}$	Densities (count) for all quantiles
$\hat{E_{s}}$	Normalized Euclidean distances from centroid

References

[1] A. E. Maxwell, T. A. Warner, and F. Fang, “Implementation of machine-learning classification in remote sensing: an applied review,” International Journal of Remote Sensing, vol. 39, no. 9, pp. 2784–2817, May 2018.
[2] A. Dongare, R. Kharde, and A. D. Kachare, “Introduction to artificial neural network,” International Journal of Engineering and Innovative Technology (IJEIT), vol. 2, no. 1, pp. 189–194, 2012.
[3] O. Chapelle, P. Haffner, and V. Vapnik, “Support vector machines for histogram-based image classification,” IEEE Transactions on Neural Networks, vol. 10, no. 5, pp. 1055–1064, Sep. 1999, conference Name: IEEE Transactions on Neural Networks.
[4] L. Breiman, “Random Forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, Oct. 2001.
[5] V. Gudivada, A. Apon, and J. Ding, “Data Quality Considerations for Big Data and Machine Learning: Going Beyond Data Cleaning and Transformations,” International Journal on Advances in Software, vol. 10, pp. 1–20, 07 2017.
[6] H. Finch, “Distribution of Variables by Method of Outlier Detection,” Frontiers in Psychology, vol. 3, p. 211, 2012.
[7] I. Taleb, M. A. Serhani, and R. Dssouli, “Big Data Quality: A Survey,” in 2018 IEEE International Congress on Big Data (BigData Congress), Jul. 2018, pp. 166–173.
[8] D. Weiss, C. Alberti, M. Collins, and S. Petrov, “Structured training for neural network transition-based parsing,” CoRR, vol. abs/1506.06158, 2015.
[9] M. Andrychowicz, M. Denil, S. Gómez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. de Freitas, “Learning to learn by gradient descent by gradient descent,” in Advances in Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, and R. Garnett, Eds., vol. 29. Curran Associates, Inc., 2016.
[10] Y. Zhang, H. Tang, and K. Jia, “Fine-Grained Visual Categorization using Meta-Learning Optimization with Sample Selection of Auxiliary Data,” in Computer Vision – ECCV 2018. Cham: Springer International Publishing, 2018, pp. 241–256.
[11] A. Althnian, D. AlSaeed, H. Al-Baity, A. Samha, A. B. Dris, N. Alzakari, A. Abou Elwafa, and H. Kurdi, “Impact of dataset size on classification performance: An empirical evaluation in the medical domain,” Applied Sciences, vol. 11, no. 2, p. 796, 2021.
[12] A. Sharma, “Guided parallelized stochastic gradient descent for delay compensation,” Applied Soft Computing, vol. 102, p. 107084, 2021.
[13] ——, “Guided stochastic gradient descent algorithm for inconsistent datasets,” Applied Soft Computing, vol. 73, pp. 1068–1080, 2018.
[14] S. Ruder, “An overview of gradient descent optimization algorithms,” CoRR, vol. abs/1609.04747, 2016.
[15] J. Wainer and P. Fonseca, “How to tune the RBF SVM hyperparameters? An empirical evaluation of 18 search algorithms,” Artificial Intelligence Review, vol. 54, no. 6, pp. 4771–4797, Aug. 2021.
[16] P. Probst, M. N. Wright, and A.-L. Boulesteix, “Hyperparameters and tuning strategies for random forest,” WIREs Data Mining and Knowledge Discovery, vol. 9, no. 3, p. e1301, 2019.
[17] L. A. Jeni, J. F. Cohn, and F. De La Torre, “Facing Imbalanced Data–Recommendations for the Use of Performance Metrics,” in 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction, Sep. 2013, pp. 245–251, iSSN: 2156-8111.
[18] N. G. Gyori, M. Palombo, C. A. Clark, H. Zhang, and D. C. Alexander, “Training data distribution significantly impacts the estimation of tissue microstructure with machine learning,” Magnetic Resonance in Medicine, vol. 87, no. 2, pp. 932–947, 2022.
[19] X. Ma, F. Liu, S. Ge, and X. Wu, “Competence-based Multimodal Curriculum Learning for Medical Report Generation,” Oct. 2022, arXiv:2206.14579 [cs].
[20] X. Li, G. Luo, W. Wang, K. Wang, and S. Li, “Curriculum label distribution learning for imbalanced medical image segmentation,” Medical Image Analysis, vol. 89, p. 102911, 2023.
[21] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum learning,” in Proceedings of the 26th annual international conference on machine learning, 2009, pp. 41–48.
[22] K. Faber, D. Zurek, M. Pietron, N. Japkowicz, A. Vergari, and R. Corizzo, “From MNIST to ImageNet and back: benchmarking continual curriculum learning,” Machine Learning, 2024.
[23] O. Erstad and J. Voogt, “The Twenty-First Century Curriculum: Issues and Challenges,” 978-3-319-71053-2, pp. 19–36, 2018, accepted: 2020-01-03T19:44:53Z Publisher: Springer.
[24] K. A. Krueger and P. Dayan, “Flexible shaping: How learning in small steps helps,” Cognition, vol. 110, no. 3, pp. 380–394, 2009, number: 3.
[25] S. Guo, W. Huang, H. Zhang, C. Zhuang, D. Dong, M. R. Scott, and D. Huang, “Curriculumnet: Weakly supervised learning from large-scale web images,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 135–150.
[26] T. Zhou, S. Wang, and J. Bilmes, “Curriculum Learning by Optimizing Learning Dynamics,” in Proceedings of The 24th International Conference on Artificial Intelligence and Statistics. PMLR, Mar. 2021, pp. 433–441, iSSN: 2640-3498.
[27] Y. Huang, Y. Wang, Y. Tai, X. Liu, P. Shen, S. Li, J. Li, and F. Huang, “CurricularFace: Adaptive Curriculum Learning Loss for Deep Face Recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 5901–5910.
[28] N. Sarafianos, T. Giannakopoulos, C. Nikou, and I. A. Kakadiaris, “Curriculum learning of visual attribute clusters for multi-task classification,” Pattern Recognition, vol. 80, pp. 94–108, 2018.
[29] Y. Wang, W. Gan, J. Yang, W. Wu, and J. Yan, “Dynamic curriculum learning for imbalanced data classification,” in 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 5016–5025.
[30] S. Narvekar, B. Peng, M. Leonetti, J. Sinapov, M. E. Taylor, and P. Stone, “Curriculum learning for reinforcement learning domains: A framework and survey,” arXiv preprint arXiv:2003.04960, 2020.
[31] P. Soviany, R. T. Ionescu, P. Rota, and N. Sebe, “Curriculum learning: A survey,” International Journal of Computer Vision, vol. 130, no. 6, pp. 1526–1565, 2022.
[32] M. Kumar, B. Packer, and D. Koller, “Self-Paced Learning for Latent Variable Models,” in Advances in Neural Information Processing Systems, vol. 23. Curran Associates, Inc., 2010.
[33] L. Jiang, D. Meng, Q. Zhao, S. Shan, and A. G. Hauptmann, “Self-Paced Curriculum Learning,” in Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, ser. AAAI’15, 2015, pp. 2694–2700, event-place: Austin, Texas.
[34] J. Choi, M. Jeong, T. Kim, and C. Kim, “Pseudo-Labeling Curriculum for Unsupervised Domain Adaptation,” Aug. 2019, arXiv:1908.00262 [cs].
[35] D. Weinshall, G. Cohen, and D. Amir, “Curriculum learning by transfer learning: Theory and experiments with deep networks,” in International Conference on Machine Learning. PMLR, 2018, pp. 5238–5246.
[36] G. Hacohen and D. Weinshall, “On the power of curriculum learning in training deep networks,” in International Conference on Machine Learning. PMLR, 2019, pp. 2535–2544.
[37] B. Tidd, N. Hudson, and A. Cosgun, “Guided Curriculum Learning for Walking Over Complex Terrain,” Feb. 2021, arXiv:2010.03848 [cs].
[38] A. Krizhevsky, “Learning multiple layers of features from tiny images,” Canadian Institute for Advanced Research, Tech. Rep., 2009, publisher: Toronto, ON, Canada. [Online]. Available: https://www.cs.toronto.edu/~kriz/cifar.html
[39] A. Coates, A. Ng, and H. Lee, “An analysis of single-layer networks in unsupervised feature learning,” in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, 2011, pp. 215–223.
[40] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,” Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, Jun. 2002.
[41] A. Sharma, P. K. Singh, and R. Chandra, “SMOTified-GAN for Class Imbalanced Pattern Classification Problems,” IEEE Access, vol. 10, pp. 30 655–30 665, 2022, conference Name: IEEE Access.
[42] Shane, “Readmission Prediction,” Dec. 2018, original-date: 2018-12-24. [Online]. Available: https://github.com/freesinger/readmission_prediction
[43] D. Dua and C. Graff, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml
[44] M. A. Naji, S. E. Filali, K. Aarika, E. H. Benlahmar, R. A. Abdelouhahid, and O. Debauche, “Machine learning algorithms for breast cancer prediction and diagnosis,” Procedia Computer Science, vol. 191, pp. 487–492, 2021.
[45] M. H. Zakaria, J. Jaafar, and S. J. Abdulkadir, “Preliminary investigation of balanced stratified reduction (BSR) for imbalanced datasets,” in 2023 IEEE International Conference on Artificial Intelligence in Engineering and Technology (IICAIET), 2023, pp. 55–60.
[46] A. Kumar, K. Dev Mahato, C. Azad, and U. Kumar, “Liver disease prediction using different machine learning algorithms,” in 2023 International Conference on Advanced & Global Engineering Challenges (AGEC), 2023, pp. 1–6.
[47] V. Chang, J. Bailey, Q. A. Xu, and Z. Sun, “Pima indians diabetes mellitus classification based on machine learning (ML) algorithms,” Neural Computing and Applications, vol. 35, no. 22, pp. 16 157–16 173, 2022.

Shonal Chaudhry received the M.S degree in computing science from The University of the South Pacific, Laucala Campus, Fiji in 2016.

From 2015 to 2021, he was a full stack developer. He is currently a PhD Researcher at The University of the South Pacific. His research interests are in artificial intelligence and its applications with a focus on machine learning, computer vision and curriculum-learning.

Dr Anuraganand Sharma (Senior Member, IEEE) received B.S. and M.S degrees in computer science from the University of the South Pacific, Fiji, and the Ph.D. degree in artificial intelligence from the University of Canberra, Australia, in 2014. From 2003 to 2005, he was a software developer and since 2007, he joined the University of the South Pacific as an academic. His research interests are centered on deep learning with CNN and constraint optimization with meta-heuristic algorithms. His recent work includes SMOTified-GAN for class imbalanced problems and enhancement of SGD for gradient-based learning systems.

\EOD