Classification Trees for Imbalanced Data: Surface-to-Volume Regularization
Abstract
Classification algorithms face difficulties when one or more classes have limited training data. We are particularly interested in classification trees, due to their interpretability and flexibility. When data are limited in one or more of the classes, the estimated decision boundaries are often irregularly shaped due to the limited sample size, leading to poor generalization error. We propose a novel approach that penalizes the Surface-to-Volume Ratio (SVR) of the decision set, obtaining a new class of SVR-Tree algorithms. We develop a simple and computationally efficient implementation while proving estimation consistency for SVR-Tree and rate of convergence for an idealized empirical risk minimizer of SVR-Tree. SVR-Tree is compared with multiple algorithms that are designed to deal with imbalance through real data applications.
Keywords: CART, Categorical data, Decision boundary, Shape penalization
1 Introduction
We are interested in the common setting in which one has a set of training data , with a vector of features and a class label. The goal is to estimate a classifier , which outputs a class label given an input feature vector. This involves splitting the feature space into subsets having different class labels, as illustrated in Figure 1. Classification trees are particularly popular for their combined flexibility and interpretability (Hu et al., 2019; Lin et al., 2020).
Our focus is on the case in which is small, for one or more . For simplicity in exposition, we assume , so that there are only two classes. Without loss of generality, suppose that is the majority class, is the minority class, and call set the decision set (of the minority class). Hence, is relatively small compared to by this convention.
To illustrate the problems that can arise from this imbalance, we consider a toy example. Let the sample space of be . Let the conditional distribution of given be , . We generate a training data set with minority samples and majority samples. The estimated decision sets of unpruned CART and optimally pruned CART (Breiman et al., 1984) are shown in Figure 1 (a) and (b), respectively. Minority samples are given weight 32 while majority samples are given weight 1 to naively address the imbalance. Both decision sets are inaccurate, with unpruned CART overfitting and pruned CART also poor.
Imbalanced data problems have drawn substantial interest; see Haixiang et al. (2017), He and Garcia (2008), Krawczyk (2016) and Fernández et al. (2017) for reviews. Some early work relied on random under- or over-sampling, which is essentially equivalent to modifying the weights and cannot address the key problem. Chawla et al. (2002) proposed SMOTE, which instead creates synthetic samples. For each minority class sample, they create synthetic samples along the line segments that join each minority class sample with its k nearest neighbors in the minority class. Building on this idea, many other synthetic sampling methods have been proposed, including ADASYN (He et al., 2008), Borderline-SMOTE (Han et al., 2005), SPIDER (Stefanowski and Wilk, 2008), safe-level-SMOTE (Bunkhumpornpat et al., 2009) and WGAN-Based sampling (Wang et al., 2019). These synthetic sampling methods have been demonstrated to be relatively effective.
However, current understanding of synthetic sampling is inadequate. Chawla et al. (2002) motivates SMOTE as designed to “create large and less specific decision regions”, “rather than smaller and more specific regions”. Later papers fail to improve upon this heuristic justification. Practically, the advantage of synthetic sampling versus random over-sampling diminishes as the dimension of the feature space increases. In general, for each minority sample, we require at least synthetic samples to fully describe its neighborhood. This is often infeasible due to the sample size of the majority class and to computational complexity. Hence, it is typical to fix the number of synthetic samples regardless of the dimension of the feature space (Chawla et al., 2002), which may fail to “create large and less specific decision regions” when the dimension is high.
Motivated by these issues, we propose to directly penalize the Surface-to-Volume Ratio (SVR) of the decision set for the construction of decision tree classifiers, based on the following considerations. First, a primary issue with imbalanced data is that the decision set without any regularization typically consists of many small neighborhoods around each minority class sample. As shown in Section 2 and 3, by penalizing the SVR of the decision set, we favor regularly shaped decision sets much less subject to such over-fitting. Second, existing classification methods are usually justified with theoretical properties under strong assumptions on the local properties of the true decision sets, for example, requiring the true decision set to be convex or have sufficiently smooth boundaries. Such assumptions may be overly restrictive for many real datasets. In contrast, the proposed SVR regularized trees are only subject to a simple global constraint on surface-to-volume ratio, which effectively controls the regularity of decision sets leading to strong theoretical guarantees (see Section 3). Third, we illustrate that SVR regularization can be efficiently implemented for tree classifiers by proposing an algorithm with similar computational complexity to standard tree algorithms such as CART.
The rest of the paper is organized as follows. Section 2 describes our methodology and algorithmic implementation. Section 3 analyzes theoretical properties of SVR Trees, including consistency and rate of convergence for excess risk. Section 4 presents numerical studies for real datasets. Section 5 contains a discussion, and proofs are included in an Appendix.



2 Methodology
We first introduce the definition of surface-to-volume ratio (SVR) and tree impurity, and then define SVR-Tree as the minimizer of a weighted average of tree impurity and SVR. We then state the algorithm to estimate this tree from training data . We assume readers have basic knowledge of tree-based classifiers like CART (Breiman et al., 1984) and C4.5 (Quinlan, 2014). In the rest of the paper, the word “tree” refers specifically to classification trees that specify a class label associated with each leaf node.
2.1 Notation
Training data are denoted as , with a vector of features and a class label. Uppercase letters denote random variables, while lowercase denote specific values. Denote the th feature of as . Let denote the true distribution of i.i.d. samples and let denote the weighted distribution that up weights the minority class samples by a constant . That is, for any measurable subset ,
Up weighting minority class samples is a conventional technique to counter imbalance and we will focus on the weighted measure in the rest of the paper. Denote as the weighted empirical distribution which assigns mass to minority class training samples and to majority class training samples. The constant is set to ensure the measures of all training samples add up to .
We use with various subscripts and diacritics to denote a general classifier from to , while with various subscripts and diacritics denotes a tree classifier from to . A tree classifier is formed by finite splitting steps and its decision sets are a finite union of hyperrectangles. When discussing the probability of certain events that include random variables , we simply use to represent the probability measure in the -product space. For example, for any , denotes the probability of the event . We use for expectations over and for expectations over . We use and to denote the conditional expectations of given under the true and empirical measure, respectively. That is, and .
2.2 Surface-to-Volume Ratio
For all , consider the space equipped with Lebesgue measure . For any Lebesgue measurable closed set , we define its volume as the Lebesgue measure of set : . If can be represented as a finite union of hyperrectangles, we define the surface of as the dimensional Lebesgue measure of the boundary of : ; otherwise, if there exists such that each is a finite union of hyperrectangles and converges to in Hausdorff distance, we define the surface of as , provided the limit exists. In general, if all are regularly shaped, it will be proven in Proposition 2 that the limit exists. Our definition of surface is different from the convention of the word “surface”. For example, in our definition a dimensional ball with radius has the same surface as a dimensional cube with side length , as approximating such a ball with hyperrectangles results in the same surface area as the cube. We adopt this definition of surface since we focus on tree classifiers whose decision sets are always finite unions of hyperrectangles.
For any set with , the surface-to-volume ratio (SVR) can be obtained as For sets with the same volume, the -dimensional cube has the smallest SVR, while sets having multiple disconnected subsets and/or irregular boundaries have relatively high SVR. We visualize sets with different SVR in two-dimensional space in Figure 2, with the subfigures (a)-(c) corresponding to the decision sets in Figure 1. We can see that decision sets with larger SVR tend to have more branches, disconnected components, or irregularly shaped borders than those with smaller SVR.








Surface-to-Volume Ratio of a Classification Tree
For training data , we define a closed bounded sample space such that the support of is a subset of . A classification tree divides the sample space into two disjoint subsets and , where . The tree predicts a new sample belongs to class 0 if , and class 1 if . The surface-to-volume ratio of a classification tree is defined as the surface-to-volume-ratio of set :
2.3 Impurity Function and Tree Impurity
A classification tree partitions the sample space into multiple leaf nodes, assigning one class label to each leaf node. The tree should be built to maximize homogeneity of the training sample class labels within nodes. This motivates the following definition.
Definition 1 (Impurity, Definition 2.5 of Breiman et al. 1984).
An impurity function is defined on the set of pairs satisfying , with the properties (i) achieves its maximum only at ; (ii) achieves its minimum only at ; (iii) is symmetrical in , i.e., .
Let and represent the probabilities of belonging to the majority and minority class, respectively, within some branch of the tree. Ideally, splits of the tree are chosen so that, after splitting, and move closer to 0 or 1 and further from 1/2. Many different tree building algorithms use impurity to measure the quality of a split; for example, CART uses Gini impurity and C4.5 uses entropy, which is effectively a type of impurity measure. In the remainder of the paper, the term ‘impurity function’ refers to the Gini impurity. That is, .
Let be the leaf nodes of a classification tree and let be the predictive class label for node , for . Let be the weighted probability measure defined in section 2.1. Then the impurity of leaf node is . The impurity of node measures the class homogeneity in , but does not depend on the predictive class label . Let denote the dominant class label in under weighted measure . We define a signed impurity taking into account as
Signed impurity serves as a proxy to classification accuracy. If the predictive class label matches the dominant class label in node , the signed impurity of node is equal to the impurity of node and is no greater than . Otherwise, an extra penalty of is applied. It is easy to see . In other words, the signed impurity uses quadratic loss instead of the absolute value loss function of classification accuracy. Taking a weighted average of the signed impurities across the leaf nodes, one obtains the tree impurity and signed tree impurity.
Definition 2 (Tree Impurity).
Let be a tree and be all the leaf nodes of this tree. Denoting the weighted probability measure as , the tree impurity of is
Definition 3 (Signed Tree Impurity).
Under the notation of Definition 2, the signed tree impurity is
2.4 SVR-Tree Classifiers
The SVR-Tree classifier is the minimizer of the weighted average of signed tree impurity and surface-to-volume ratio. Letting be the collection of possible trees, the SVR-Tree classifier is defined as
(1) |
where is a penalty. The unknown probability measure is replaced with the empirical measure that assigns mass to each training sample . Unfortunately, without restricting the space of trees , optimization problem (1) is intractable. In the following subsection, we introduce an iterative greedy search algorithm that limits the size of in each step to efficiently obtain a tree having provably good performance.
2.5 The SVR-Tree Algorithm
The SVR-Tree Algorithm is designed to find a nearly optimal SVR-Tree classifier. SVR-Tree proceeds in a greedy manner. We begin with the root node. At each step, we operate on one leaf node of the current tree, splitting it into two new leaf nodes by finding the solution of (1). The node to split at each step is uniquely specified by a breadth-first searching order. After splitting, the tree is updated and the node to split in the next step will be specified. The process stops when further splitting of leaf nodes either does not improve the loss or a prespecified maximum number of leaf nodes is achieved.
We first describe how to split a current leaf node to improve the solution to (1). Suppose the current tree is and the node to split is , with training samples. For each feature , sort all samples in by increasing order of the th feature as . We only allow splits to occur at , the midpoint of two adjacent values of each feature. The total number of different splits is no more than . After each such split of , we keep all other leaf nodes unchanged while allowing all different class label assignments at the two new daughter nodes of . The current set of trees to choose from in optimizing (1) includes the initial and all the split trees described above. The cardinality of is no more than , a linear order of . We compute the risk for all to find the minimizer. If the initial is the risk minimizer, we do not make any split in this step and mark the node as “complete”. Any node marked as complete will no longer be split in subsequent steps.
It remains to specify the ‘breadth-first’ searching order determining which node to split in each step. Let depth of a node be the number of edges to the root node, which has depth 0. Breadth-first algorithms explore all nodes at the current depth before moving on (Cormen et al. 2009, chapter 22.2). To keep track of changes in the tree, we use a queue111A queue is a dynamic set in which the elements are kept in order and the principal operations on the set are the insertion of elements to the tail, known as enqueue, and deletion of elements from the head, known as dequeue. See chapter 10.1 of Cormen et al. (2009) for details.. We begin our algorithm with a queue where the only entity is the root node. At each step, we remove the node at the front terminal of the queue, and split at this node as described in the previous paragraph. If a split tree is accepted as the risk minimizer over the current set , we enqueue two new leaf nodes; otherwise, the unsplit tree is the risk minimizer over the current set , so we don’t enqueue any new node. The nodes in the front of the queue have the lowest depth. Therefore, our algorithm naturally follows a breadth-first searching order. We preset the maximal number of leaf nodes as . The process is stopped when either the queue is empty, in which case all the leaf nodes are marked as complete, or the number of leaf nodes is . On many datasets, Algorithm 1 will stop before reaching the maximal number of leaf nodes . The upper bound is necessary in certain circumstances, since our algorithm, like the majority of practical tree algorithms, is greedy and not guaranteed to find the global minimum. Provided we can solve the global minimum, the upperbound is no longer required, as shown in Theorem 2 and discussed thereafter.
Our SVR-Tree algorithm has a coarse to fine tree building style, tending to first split the sample space into larger pieces belonging to two different classes, followed by modifications to the surface of the decision set to decrease tree impurity and SVR. The steps are sketched in Algorithm 1, where feature selection steps are marked as optional. A more detailed and rigorous version is in Section 4.1 of the supplementary material.
The average time complexity of Algorithm 1 is , which is the same as many tree algorithms like CART and C4.5. A detailed analysis of computational complexity is available at Section 4.2 of the supplementary material.
Optional Step for Feature Selection
It is likely that some features will not help to predict class labels. These features should be excluded from our estimated tree. Under some mild conditions, a split on a redundant feature has minimal impact on the tree impurity compared to a split in a non-redundant feature. Thus feature selection can be achieved by thresholding. Suppose we are splitting node into two new leaf nodes . Then the (unsigned) tree impurity decrease after this split is defined as:
Let be the indices of features that have been split in previous tree building steps. Given that we are splitting on node , let be the maximal tree impurity decrease over all splits in feature . Then a split in a new feature , with tree impurity decrease , is accepted if
(2) |
where is a constant independent of the training data. By equation (2), a split on a new feature is accepted if its tree impurity decrease is greater than , the maximal tree impurity decrease over all previously split features, plus a threshold term . Theoretical support for this thresholding approach is provided in the supplementary material.
3 Theoretical Properties
In this section, we will discuss the theoretical properties of SVR-Tree from two aspects. First, we show the classifier obtained from Algorithm 1 is consistent in the sense that a generalized distance between SVR-Tree and the oracle classifier goes to zero as sample size goes to infinity. Second, we show for an idealized empirical risk minimizer over a family of SVR constrained classifiers, the excess risk converges to zero at the rate as sample size goes to infinity, where is defined in Condition 2. The consistency result provides some basic theoretical guarantees for SVR-Tree while the rate of convergence result provides some further insights on how the surface-to-volume ratio works in nonparametric classification. We also developed results regarding feature selection consistency of Algorithm 1 with optional steps enabled. Due to limited space, these results are provided in the supplementary material.
3.1 Estimation Consistency
Let the classification risk under measure be and its empirical version under measure be . Let be the collection of all measurable functions from to . Then the oracle classifier and oracle risk can be defined as and , respectively. We define the consistency of a sequence of classifiers based on the excess risk comparing to oracle risk.
Definition 4.
A sequence of classifiers is consistent if
In the case when , consistency of is equivalent to requiring almost surely. For any hyperrectangle , denote the component-wise conditional expectation of as . To ensure the consistency of Algorithm 1, we need the following identifiability condition.
Condition 1 (Identifiability).
For any hyperrectangle , if the component-wise conditional expectation is a constant on , , then the conditional expectation is a constant on .
Condition 1 is indispensable for proving the consistency of our SVR regularized tree classifiers without requiring the diameter of each leaf node to go to zero. This is because CART-type algorithms can only make splits based on the component-wise conditional expectation. If all the component-wise conditional expectations are constants but the overall conditional expectation varies in , then the problem is not identifiable to CART-type split rules. Condition 1 can also be viewed as a relaxed version of Condition (H1) in Scornet et al. (2015) who studied the consistency of random forest regression. Their Condition (H1) requires that the mean of given follows an additive model in the regression tree setup. An analogous condition to their (H1) for our classification tree setup would require to be an additive function, which would imply our Condition 1.
Denote the tree obtained from Algorithm 1 as . Theorem 1 shows is consistent under mild conditions.
Theorem 1 (Estimation consistency).
Let and the weighted measure have continuous density with lower bound for all . Assume that Condition 1 is satisfied, is continuous everywhere in , and , satisfy , , Then the classifier obtained from Algorithm 1 is consistent.
3.2 Rate of Convergence for Empirical Risk Minimizer
We now consider the empirical risk minimizer over a class of trees under the SVR constraint and derive the rate of convergence for this empirical risk minimizer. This empirical risk minimizer can be considered as an idealized classifier.
We first introduce the family of SVR constrained classifiers. For ease of notation, for any classifier , we denote its decision set by . Define as the family of all possible classifiers generated from trees with finite depth: For , the family of SVR constrained classifiers is defined as
(3) |
where denotes the set difference between two generic sets and . The elements of are constrained by SVR in the sense that the surface area of can decrease at most proportionally to the change in its volume. The parameter is set to be no smaller than to prevent from being empty. The definition of matches the intuition in Algorithm 1 where SVR in every leaf node is checked. Moreover, the following proposition reveals that the regularized risk minimizer of all trees over the true measure, provided it exists, always lies in the family . Therefore, in the subsequent analysis, we will focus on the effect of on the convergence of excess risk since it plays an equivalent role to in the minimization problem (1).
Proposition 1.
Suppose that , the density of the weighted marginal measure exists and satisfies for all . If , then
The family only contains tree classifiers and is not closed. It is desirable to consider the class of all classifiers that can be well approximated by tree classifiers. Let be the Hausdorff distance between two generic sets and . We define the closure of with respect to Hausdorff distance of decision sets as
The elements of have a well-defined surface area as shown in Proposition 2.
Proposition 2.
is well defined for all in the sense that: (i) For all sequences whose decision sets converge in Hausdorff distance, is a convergent sequence in ; (ii) for any two sequences and satisfying and , we have .
The proof of Proposition 1 and 2 are available in the supplementary material. To better control the rate of convergence of excess risk, we also need the following margin condition.
Condition 2.
There exist constants , , , such that for all classifiers satisfying , we have
Condition 2 is essentially Assumption (A1) in Tsybakov (2004), which is known as the Tsybakov’s margin condition and widely used in classification problems for obtaining fast convergence rate of excess risk (Tsybakov, 2004; Audibert and Tsybakov, 2007). Condition 2 holds as long as for all and some constants and ; see Proposition 1 in Tsybakov (2004). Under Condition 2, our main theorem states the rate of convergence for the empirical risk minimizer over .
Theorem 2 (Rate of Convergence).
Theorem 2 provides an upper bound in (4) for the rate of convergence of the excess risk from the empirical risk minimizer among all SVR constrained tree classifiers in to the theoretically best possible classifier in with a well-defined surface area. Because we have limited restrictions on the split locations of apart from belonging to , it is possible for a split to land directly on an observed sample under the empirical measure. In such situations, we allow the tree of to assign the sample to either leaf node. Hence, the leaf nodes of can be open, closed, or half open half closed. On the other hand, the theoretically best classifier can lie in the much larger family of . Although the classifiers from are tree classifiers, there is no limitation on how many leaf nodes they can have. Therefore, they can approximate decision sets with smooth boundaries, for example a -dimensional ball. In fact, by adjusting , will contain most classifiers with regularly shaped decision sets.
The rate of convergence for the excess risk in the upper bound (4) is , which is the same as the minimax rate of set estimation on the family of sets with first order differentiable boundaries as in Mammen and Tsybakov (1995). Although Mammen and Tsybakov (1995) do not use the margin condition and hence the term does not appear explicitly in their rates, their results will yield the same rate as ours up to logarithm factors if the margin condition is applied; see for example, Theorem 4 of Scott and Nowak (2006). Furthermore, in the upper bound (4) in Theorem 2, we have explicitly characterized the effect of the SVR regularization from the set . By definition, the SVR constraint in becomes weaker as increases. As expected, the upper bound in (4) increases with given the increasing complexity of the set . A direct implication of this upper bound is that if the true tree classifier belongs to a family with a smaller , i.e. a smaller SVR ratio, then the excess risk of the empirical risk minimizer also tends to be smaller.
Theorem 2 is related to the literature on classification risk of decision trees, but with important differences. In general, the rate of convergence for the excess risk depends on two assumptions: the margin condition as in our Condition 2 and the complexity, or the -entropy, of the family of decision sets (Scott and Nowak, 2006). Suppose the -entropy is for some . If , then together with a suitable margin condition like Condition 2, the rate of convergence can be faster than or even faster than (Tsybakov, 2004; Audibert and Tsybakov, 2007). If , then the rate of convergence is typically of order with being a constant independent of (Mammen and Tsybakov, 1995). Although it is possible to slightly improve the constant by refining the margin condition, the overall rate of convergence is still governed by the -entropy if is very large. Therefore the major objective of our research is to find a suitable family of sets with controllable -entropy as well as a practically feasible algorithm. This is not easy, because there are limited families of sets having known -entropy, including convex sets and sets with smooth boundaries. Both smoothness and convexity impose strong assumptions on the local properties of the boundary . In contrast, we propose to consider the family of SVR constrained sets as a new family of decision sets, which directly imposes a simple and intuitive assumption on the shape of decision sets and additionally has a tractable -entropy.
Scott and Nowak (2006) also noticed that the family of sets with smooth boundaries is not very realistic for practical problems. They proposed a family of decision sets named the “box-counting class” which satisfies the conclusion of our Lemma 5. However, they did not further characterize the behavior of sets in their class, and hence they must compute the tree for all smallest dyadic splits and then merge all these trees bottom-up, which can be regarded as an exhaustive search in the lowest split level. This leads to a computational complexity , with being the number of dyadic splits. The complexity will become intractable when is large, with the authors noting that will pose a problem. Scott and Nowak (2006) also proposed an alternative to the Tsybakov’s margin condition (Condition 2) and obtained a faster convergence rate in some specific situations, such as when lies in a low dimensional manifold. However, their alternative condition directly bounds the excess error between a possible dyadic tree and the oracle classifier, which is much stronger than the standard margin condition we have used.
The proof of Theorem 2 is in the Appendix. The key idea is to construct an -net on under a symmetric set difference from the true measure and to show that this -net is also a -net under a symmetric set difference from the empirical measure with high probability. This nontrivial property is the reason why we can derive the rate of convergence for the empirical risk minimizer over the whole family of rather than just for the minimizer over an -net of this family, as adopted by some previous classification literature (Mammen and Tsybakov, 1995; Tsybakov, 2004).
4 Numerical Studies
We compare SVR trees with popular imbalanced classification methods on real datasets, adding redundant features to these datasets in evaluating feature selection. A confusion matrix (Table 1) is often used to assess classification performance. A common criteria is accuracy When minority samples are relatively rare, it is often important to give higher priority to true positives, which is accomplished using the true positive rate (TPR, also known as recall) and precision To combine these, the F-measure is often used: . The Geometric mean of true positive rate and true negative rate () is also used to evaluate the overall performance on both classes.
True Label | |||
1 | 0 | ||
Predicted Label | 1 | True Positive (TP) | False Positive (FP) |
0 | False Negative (FN) | True Negative (TN) |
4.1 Datasets
We test our method on 12 datasets from the UCI Machine Learning Repository (Dua and Graff, 2017) and KEEL-dataset repository (Alcalá-Fdez et al., 2011), varying in size, number of features and level of imbalance. For datasets with three or more classes, classes are combined to form binary class datasets. Table 2 gives an overview of the datasets; detailed descriptions and preprocessing of these datasets are available in Section 5.1 of the supplementary material.
Data set | number of | number of | proportions of | number of |
name | total samples | minority samples | minority samples | features |
Pima | 768 | 268 | 34.9% | 8 |
Titanic | 2201 | 711 | 32.3% | 3 |
Phoneme | 5404 | 1586 | 29.3% | 5 |
Vehicle | 846 | 199 | 19.0% | 18 |
Ecoli | 336 | 52 | 15.5% | 6 |
Segment | 2308 | 329 | 14.3% | 18 |
Wine | 1599 | 217 | 11.9% | 11 |
Page | 5472 | 559 | 10.2% | 10 |
Satimage | 6435 | 626 | 8.9% | 36 |
Glass | 214 | 17 | 8.0% | 9 |
Abalone | 731 | 42 | 5.4% | 7 |
Yeast | 1484 | 51 | 3.4% | 8 |
4.2 Experimental Setup
We test the performance of SVR-Tree, SVR-Tree with feature selection, CART (Breiman et al., 1984) with duplicated oversampling, CART with SMOTE (Chawla et al., 2002), CART with Borderline-SMOTE (Han et al., 2005), CART with ADASYN (He et al., 2008) and Hellinger Distance Decision Tree (Cieslak et al., 2012) on all 12 datasets. Features are linearly transformed so that samples lie in .
We use nested stratified cross validation to evaluate the out-of-sample performance of all methods. The nested cross validation has two layers: the inner layer is used to choose tuning parameters and the outer layer is used to evaluate the out-of-sample performance. For each method and each dataset, we run the nested cross validation procedure 20 times. In each run, we randomly partition the whole dataset into three stratified folds222the proportions of samples of each class are roughly the same in all folds. and run three times. Each time one of the three folds is selected as the testing set and the other two folds are training sets. On the training sets, we use 5-fold stratified cross validation to choose the tuning parameter. That is, we further divide the training set into five folds and run for five times. Each time we use four folds as (inner) training sets and the other fold for validation. We define the cross-validation TPR, TNR in the same manner as the cross-validation accuracy and use the cross validation TPR, TNR to compute the cross-validation F-measure. The tuning parameter with the highest cross validation F-measure is selected. We then train the model with the selected tuning parameter on the whole training set and evaluate its performance on the test set. The cross validation accuracy, precision, TPR, F-measure and G-mean are recorded. The mean and standard error of these statistics from 20 nested cross validation runs are reported in Table 3.
SVR-Tree with and without feature selection are described in Algorithm 1. The weight for the minority class, , is set to be the largest integer that makes the total weight of the minority class no greater than the total weight of the majority class. The maximal number of leaf nodes is . The penalty parameter for SVR is chosen from a geometric sequence in and the value with the highest F-measure across 20 runs is selected. For SVR-Tree with feature selection, the constant in equation (2) is fixed to . In practice, the results are insensitive to .
For the other methods except Hellinger distance decision tree, we first over sample the minority class samples, such that the number of minority samples are multiplied by the same weight used for SVR tree. We than build the CART tree on the over sampled dataset and prune it. The pruning parameter of CART is selected to maximize the F-measure. By the algorithm proposed by Breiman et al. (1984), the range from which to choose the pruning parameter will be available after we build the tree and does not need to be specified ahead of time.
For duplicated oversampling, we sample each minority class sample times where is the weight used for SVR tree. For SMOTE, the number of nearest samples is set as . For Borderline-SMOTE, we adopt the Borderline-SMOTE1 of Han et al. (2005), with the number of nearest samples . For both SMOTE and Borderline-SMOTE, if , some nearest neighbors may be used multiple times to generate synthetic samples, such that the total weight of minority class samples of these oversampling methods are the same as that of SVR tree. For ADASYN, if we denote the number of majority class samples as and the number of minority class samples as , then the parameter is set to be .
For Hellinger distance decision tree, we directly call its java implementation available at Weka software (https://www.cs.waikato.ac.nz/ml/weka/). Although it is claimed that the Hellinger distance tree is insensitive to imbalance, their current implementation occasionally classifies all samples into a single class. We set the default leaf prediction strategy as “Naive Bayes” since it works well on most datasets. However, if the algorithm classifies all samples into a single class, we will set the leaf prediction strategy as “Majority Voting”. The third leaf prediction strategy “Naive Bayes Adapative” always produces the same results as one of the other two strategies in our experiments.
Data set | Method | Accuracy | Precision | TPR | F-measure | G-mean |
Phoneme | SVR | 0.8350(0.0060) | 0.6749(0.0113) | 0.8458(0.0154) | 0.7506(0.0076) | 0.8380(0.0062) |
SVR-Select | 0.8377(0.0040) | 0.6807(0.0064) | 0.8420(0.0098) | 0.7528(0.0060) | 0.8389(0.0049) | |
Duplicate | 0.8554(0.0038) | 0.7567(0.0110) | 0.7482(0.0154) | 0.7522(0.0068) | 0.8205(0.0065) | |
SMOTE | 0.8598(0.0054) | 0.7637(0.0119) | 0.7567(0.0121) | 0.7601(0.0089) | 0.8264(0.0068) | |
BSMOTE | 0.8568(0.0060) | 0.7580(0.0143) | 0.7527(0.0127) | 0.7552(0.0093) | 0.8230(0.0070) | |
ADASYN | 0.8607(0.0050) | 0.7661(0.0116) | 0.7565(0.0115) | 0.7612(0.0082) | 0.8269(0.0063) | |
HDDT | 0.7065(0.0000) | 0.0000(0.0000) | 0.000(0.0000) | 0.0000(0.0000) | 0.0000(0.0000) | |
Segment | SVR | 0.9922(0.0017) | 0.9696(0.0089) | 0.9763(0.0061) | 0.9729(0.0060) | 0.9855(0.0033) |
SVR-Select | 0.9932(0.0013) | 0.9753(0.0080) | 0.9769(0.0053) | 0.9761(0.0047) | 0.9863(0.0027) | |
Duplicate | 0.9913(0.0012) | 0.9707(0.0069) | 0.9685(0.0073) | 0.9696(0.0044) | 0.9817(0.0036) | |
SMOTE | 0.9916(0.0010) | 0.9722(0.0061) | 0.9688(0.0081) | 0.9705(0.0035) | 0.9820(0.0038) | |
BSMOTE | 0.9915(0.0011) | 0.9726(0.0082) | 0.9676(0.0059) | 0.9701(0.0037) | 0.9814(0.0027) | |
ADASYN | 0.9919(0.0011) | 0.9713(0.0063) | 0.9716(0.0064) | 0.9714(0.0040) | 0.9833(0.0032) | |
HDDT | 0.8330(0.0021) | 0.4599(0.0031) | 0.9825(0.0023) | 0.6266(0.0031) | 0.8910(0.0017) | |
Page | SVR | 0.9656(0.0015) | 0.8248(0.0098) | 0.8429(0.0117) | 0.8337(0.0076) | 0.9087(0.0062) |
SVR-Select | 0.9647(0.0016) | 0.8252(0.0124) | 0.8311(0.0140) | 0.8280(0.0077) | 0.9024(0.0073) | |
Duplicate | 0.9686(0.0013) | 0.8465(0.0081) | 0.8463(0.0135) | 0.8463(0.0072) | 0.9119(0.0071) | |
SMOTE | 0.9683(0.0021) | 0.8468(0.0133) | 0.8426(0.0144) | 0.8446(0.0101) | 0.9099(0.0078) | |
BSMOTE | 0.9682(0.0015) | 0.8444(0.0103) | 0.8448(0.0127) | 0.8445(0.0077) | 0.9109 (0.0067) | |
ADASYN | 0.9677(0.0021) | 0.8436(0.0131) | 0.8403(0.0148) | 0.8418(0.0103) | 0.9084(0.0080) | |
HDDT | 0.9024(0.0030) | 0.5479(0.0338) | 0.2720(0.0260) | 0.3622(0.0215) | 0.5141(0.0233) | |
Yeast | SVR | 0.9368(0.0100) | 0.2641(0.0410) | 0.4510(0.0888) | 0.3290(0.0416) | 0.6524(0.0619) |
SVR-Select | 0.9287(0.0122) | 0.2360(0.0397) | 0.4559(0.0715) | 0.3077(0.0391) | 0.6544(0.0496) | |
Duplicate | 0.9641(0.0037) | 0.3907(0.2084) | 0.1196(0.0784) | 0.1727(0.0981) | 0.3077(0.1544) | |
SMOTE | 0.9594(0.0045) | 0.3620(0.0859) | 0.2098(0.0830) | 0.2554(0.0752) | 0.4458(0.0887) | |
BSMOTE | 0.9602(0.0043) | 0.3664(0.1012) | 0.2020(0.0884) | 0.2499(0.0937) | 0.4333(0.1065) | |
ADASYN | 0.9605(0.0045) | 0.3709(0.1246) | 0.2078(0.1099) | 0.2491(0.0986) | 0.4288(0.1436) | |
HDDT | 0.7604(0.0200) | 0.1049(0.0101) | 0.7853(0.0287) | 0.1849(0.0162) | 0.7722(0.0211) | |
Average Ranking | SVR | 4.92 | 5.25 | 2.25 | 2.08 | 2.17 |
SVR-Select | 5.58 | 5.00 | 2.25 | 2.75 | 2.50 | |
Duplicate | 2.79 | 3.29 | 5.13 | 4.88 | 4.96 | |
SMOTE | 2.54 | 2.29 | 5.04 | 4.21 | 4.71 | |
BSMOTE | 3.33 | 2.92 | 5.58 | 5.00 | 5.42 | |
ADASYN | 2.75 | 2.92 | 4.58 | 4.00 | 4.42 | |
HDDT | 6.08 | 6.33 | 3.17 | 5.08 | 3.83 |
4.3 Results
We compute the mean and standard error of accuracy, precision, TPR, F-measure and G-mean across the 20 nested cross validations. Due to limited space, results of 4 representative datasets are shown in Table 3 with the remaining results in Section 5.2 of the supplementary material. If a method classifies all samples into the majority class, precision is not well-defined and we simply set it to be zero. In the column “Method”, SVR = SVR-Tree, SVR-Select = SVR-Tree with feature selection, Duplicate = CART with duplicated oversampling, SMOTE = CART with SMOTE, BSMOTE = CART with Borderline-SMOTE, ADASYN = CART with ADASYN and HDDT = Hellinger Distance Decision Tree. For each dataset and evaluation measure, the method with the highest mean value ranks first and is highlighted in bold. As suggested by Brazdil and Soares (2000), the average rankings of all 12 datasets are displayed in Table 3 to evaluate the overall performance of each method across all datasets. If two methods are tied, the ranking is set to the average.
The results in Table 3 show that oversampling based methods generally perform well in accuracy and precision, with SMOTE the best of these methods. However, SVR and SVR-select outperform the oversampling methods in the rankings of TPR, F-measure and G-mean on many datasets and on average. The behavior of HDDT is unstable across datasets, being competitive in TPR and G-mean for some datasets but sometimes assigning all samples to the majority class. Since all methods (except duplicated sampling) are built on ideal theoretical assumptions, their practical performance is typically data-dependent. For example, SMOTE is outperformed by SVR-select in all measures for the Segment dataset, while the opposite happens for the Page dataset.
5 Discussion
A major challenge in analyzing imbalanced data is small sample size in the minority class leading to overfitting. It is natural to consider using regularization to address this problem. Regularization of classification trees is an old idea; for example, Breiman et al. (1984) proposed to penalize the number of leaf notes in the tree. Other classification trees like C4.5 (Quinlan, 2014) and Ripper (Cohen, 1995) also prune overgrown trees. However, the number of leaf nodes may not be a good measure of complexity of a classification tree. Recently, Hahn et al. (2020) add a Bayesian prior to an ensemble of trees, which functions as indirect regularization. Following the idea of directly regularizing the shape of the decision set and complexity of the decision boundary, we instead penalize the surface-to-volume ratio. To our knowledge, this is a new idea in the statistical literature.
SVR-Tree can be trivially generalized to the multi-class case and balanced datasets. For multiple classes, we can apply SVR to one or more minority classes and take the sum of these ratios as regularization. For balanced datasets, we can either compute the SVR ratio of all classes, or we can simply regularize the surface of the decision boundary. The principle behind these generalizations is to regularize the complexity of the decision boundaries and shapes of the decision sets.
SUPPLEMENTARY MATERIAL
Proofs, a Detailed Algorithm and Additional Results: Supplementary Material for “Classification Trees for Imbalanced Data: Surface-to-Volume Regularization”.
Codes: https://github.com/YichenZhuDuke/Classification-Tree-with-Surface-to-
Volume-ratio-Regularization.git.
References
- Alcalá-Fdez et al. (2011) Alcalá-Fdez, J., A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, and F. Herrera (2011). Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic & Soft Computing 17, 255–287.
- Audibert and Tsybakov (2007) Audibert, J.-Y. and A. B. Tsybakov (2007). Fast learning rates for plug-in classifiers. The Annals of Statistics 35(2), 608–633.
- Brazdil and Soares (2000) Brazdil, P. B. and C. Soares (2000). A comparison of ranking methods for classification algorithm selection. In European Conference on Machine Learning, pp. 63–75. Springer.
- Breiman et al. (1984) Breiman, L., J. Friedman, R. Olshen, and C. Stone (1984). Classification and Regression Trees. Wadsworth.
- Bunkhumpornpat et al. (2009) Bunkhumpornpat, C., K. Sinapiromsaran, and C. Lursinsap (2009). Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 475–482. Springer.
- Chawla et al. (2002) Chawla, N. V., K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer (2002). Smote: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357.
- Cieslak et al. (2012) Cieslak, D. A., T. R. Hoens, N. V. Chawla, and W. P. Kegelmeyer (2012). Hellinger distance decision trees are robust and skew-insensitive. Data Mining and Knowledge Discovery 24(1), 136–158.
- Cohen (1995) Cohen, W. W. (1995). Fast effective rule induction. In Machine Learning Proceedings, pp. 115–123. Elsevier.
- Cormen et al. (2009) Cormen, T. H., C. E. Leiserson, R. L. Rivest, and C. Stein (2009). Introduction to Algorithms. MIT press.
- Dua and Graff (2017) Dua, D. and C. Graff (2017). UCI Machine Learning Repository.
- Fernández et al. (2017) Fernández, A., S. del Río, N. V. Chawla, and F. Herrera (2017). An insight into imbalanced big data classification: outcomes and challenges. Complex & Intelligent Systems 3(2), 105–120.
- Györfi et al. (2006) Györfi, L., M. Kohler, A. Krzyzak, and H. Walk (2006). A distribution-free theory of nonparametric regression. Springer Science & Business Media.
- Hahn et al. (2020) Hahn, P. R., J. S. Murray, and C. M. Carvalho (2020). Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects. Bayesian Analysis 15(3), 965–1056.
- Haixiang et al. (2017) Haixiang, G., L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing (2017). Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications 73, 220–239.
- Han et al. (2005) Han, H., W.-Y. Wang, and B.-H. Mao (2005). Borderline-smote: a new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing, pp. 878–887. Springer.
- He et al. (2008) He, H., Y. Bai, E. A. Garcia, and S. Li (2008). Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In IEEE International Joint Conference on Neural Networks, pp. 1322–1328. IEEE.
- He and Garcia (2008) He, H. and E. A. Garcia (2008). Learning from imbalanced data. IEEE Transactions on Knowledge & Data Engineering (9), 1263–1284.
- Hu et al. (2019) Hu, X., C. Rudin, and M. Seltzer (2019). Optimal sparse decision trees. In Advances in Neural Information Processing Systems (NeurIPS), Volume 32.
- Krawczyk (2016) Krawczyk, B. (2016). Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence 5(4), 221–232.
- Lin et al. (2020) Lin, J., C. Zhong, D. Hu, C. Rudin, and M. Seltzer (2020). Generalized and scalable optimal sparse decision trees. In International Conference on Machine Learning, pp. 6150–6160. PMLR.
- Mammen and Tsybakov (1995) Mammen, E. and A. B. Tsybakov (1995). Asymptotical minimax recovery of sets with smooth boundaries. The Annals of Statistics 23(2), 502–524.
- Nobel (1996) Nobel, A. (1996). Histogram regression estimation using data-dependent partitions. The Annals of Statistics 24(3), 1084–1105.
- Quinlan (2014) Quinlan, J. R. (2014). C4.5: Programs for Machine Learning. Elsevier.
- Scornet et al. (2015) Scornet, E., G. Biau, and J.-P. Vert (2015). Consistency of random forests. The Annals of Statistics 43(4), 1716–1741.
- Scott and Nowak (2006) Scott, C. and R. Nowak (2006). Minimax-optimal classification with dyadic decision trees. IEEE Transactions on Information Theory 52(4), 1135–1153.
- Stefanowski and Wilk (2008) Stefanowski, J. and S. Wilk (2008). Selective pre-processing of imbalanced data for improving classification performance. In International Conference on Data Warehousing and Knowledge Discovery, pp. 283–292. Springer.
- Tsybakov (2004) Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning. The Annals of Statistics 32(1), 135–166.
- Wang et al. (2019) Wang, Q., X. Zhou, C. Wang, Z. Liu, J. Huang, Y. Zhou, C. Li, H. Zhuang, and J.-Z. Cheng (2019). Wgan-based synthetic minority over-sampling technique: Improving semantic fine-grained classification for lung nodules in ct images. IEEE Access 7, 18450–18463.
Appendix A Proofs
We prove Theorem 1 and 2 here. Proofs for Corollary 1, bounds in examples 1-2 and lemmas introduced in the Appendix are in the supplementary material. Without loss of generality, we assume in this section.
A.1 Proof of Theorem 1
The proof builds on Nobel (1996), Györfi et al. (2006), Tsybakov (2004) and Scornet et al. (2015). We first establish a sufficient condition for consistency, showing a classification tree whose signed impurity converges to an oracle bound is consistent. We then break the difference between signed impurity of and the oracle bound into two parts. The first is estimation error, which goes to zero if the number of leaves increases slower than ; the second is approximation error, which goes to zero if goes to a constant within each leaf node and penalty goes to zero.
Recall the conditional expectation is denoted as , define the oracle lower bound for signed impurity as The following lemma shows if the signed impurity of a classification tree converges to as , the classifier associated with the tree is consistent.
Lemma 1.
Let be a sequence of classification trees, let be the classifier associated with . is consistent if in probability.
We then decompose the difference between signed impurity of and the oracle bound into estimation error and approximation error.
Lemma 2.
Let be a classification tree trained from data , be all the leaf nodes of . Define the set of classifiers as:
We have
(5) |
The first term on the right hand side of equation (5) is the “estimation error”, which measures the difference between functions evaluated under the empirical and true distributions. The second term is “approximation error”, which measures the ability of to approximate the oracle prediction function. The next two lemmas show both terms go to zero in probability.
Lemma 3.
If we have in probability.
Lemma 4.
As , if and , in probability.
A.2 Proof of Theorem 2
The proof consists of two parts. In the first part, we construct a subset on and prove it is both an -net under symmetric set difference of true measure and with high probability an -net under symmetric set differences of empirical measure. In the second part, we prove that with high probability, any element in the above constructed -net that is far away from the true classifier will incur a large excess error. Combining these two parts and selecting a proper value for will conclude the proof of Theorem 2.
We first construct the -net. We divide the space into hypercubes, each with volume as below:
The SVR constraint of family directly implies that the surface area of the decision set is no greater than . If the shape of the decision set is indeed regular, one may imagine that the border of the decision set will intersect only with a finite number of hypercubes in . The following Lemma formalizes this intuition.
Lemma 5.
There exists a constant only dependent on , such that for , for any ,
where denotes the cardinality.
The general proof idea is to examine the local region of each hypercube in . If intersects with a hypercube, then the SVR constraint and a dimensional isoperimetric inequality will imply that the surface area passing through the local region of that hypercube is lower bounded. Since the total surface area of is upper bounded given the maximum volume is 1 and the SVR constraint, the number of cubes intersecting with is also upper bounded.
Let be a classifier such that is a finite union of hypercubes of . Then for and , we say is a border hypercube of if . Let and define the set as
We have the following lemma regarding the -entropy of .
Lemma 6.
For , we have
The following lemma shows is not only an -net on under the true measure , but also simultaneously an -net under empirical measure with high probability.
Lemma 7.
We have the following two facts regarding :
We now move to the second part. By Lemma 7, there exists , such that . The next lemma shows with high probability, has a smaller empirical risk than all elements in that are sufficiently far away from .
Lemma 8.
Suppose Condition 2 holds. Then we have
The key proof idea is to write the empirical loss as a summation of i.i.d. random variables and bound the deviation of the summation to its expectation by Bernstein inequality.
We are now ready to prove Theorem 2.
Proof of Theorem 2.
Define the events as:
We claim on event , . We prove this by contradiction. Suppose that . Then by definition of event , there exists a classifier , such that
(6) |
(7) |
Combining equation (6) and the assumption that , we have . It then follows from the definition of event that
(8) |
By definition of event , , such that
(9) |
Combining the equations (7), (8) and (9), we have
which contradicts the definition that is the empirical risk minimizer. Therefore we have on the event . For all , the expected loss can be bounded by
where the second inequality uses Lemmas 5, 6, 7 and for ; the last inequality uses the definition that . We now defines a constant as
(10) | ||||
In the two inequalities of (10), the left-hand side obviously dominates the right-hand side, so is well defined. Let , where . Then for ,
where the third inequality utilizes the definition of and the last inequality utilizes the definition of . Therefore we have
By definition of , increases with in a polynomial order, so the second term in the display above dominates. Specifically, by definition of , when , we have , and hence
∎