Classification Trees for Imbalanced Data: Surface-to-Volume Regularization

Yichen Zhu , Cheng Li and David B. Dunson Yichen Zhu is Ph.D. student, Department of Statistical Science, Duke University, Durham, NC 27708.Cheng Li is Assistant Professor, Department of Statistics and Applied Probability, National University of Singapore, Singapore, 117546. The research was supported by the Singapore Ministry of Education Academic Research Funds Tier 1 Grant R-155-000-223-114.David B. Dunson is the Arts and Sciences Distinguished Professor, Departments of Statistical Science & Mathematics, Duke University, Durham, NC 27708. The research was partially supported by grant N000141712844 of the United States Office of Naval Research.

Abstract

Classification algorithms face difficulties when one or more classes have limited training data. We are particularly interested in classification trees, due to their interpretability and flexibility. When data are limited in one or more of the classes, the estimated decision boundaries are often irregularly shaped due to the limited sample size, leading to poor generalization error. We propose a novel approach that penalizes the Surface-to-Volume Ratio (SVR) of the decision set, obtaining a new class of SVR-Tree algorithms. We develop a simple and computationally efficient implementation while proving estimation consistency for SVR-Tree and rate of convergence for an idealized empirical risk minimizer of SVR-Tree. SVR-Tree is compared with multiple algorithms that are designed to deal with imbalance through real data applications.

Keywords: CART, Categorical data, Decision boundary, Shape penalization

1 Introduction

We are interested in the common setting in which one has a set of training data $\mathscr{D}_{n}=\{(X_{i},Y_{i})\}_{i=1}^{n}$ , with $X_{i}\in\Omega\subset\mathbb{R}^{d}$ a vector of features and $Y_{i}\in\{0,\ldots,K-1\}$ a class label. The goal is to estimate a classifier $f:\Omega\to\{0,\ldots,K-1\}$ , which outputs a class label given an input feature vector. This involves splitting the feature space $\Omega$ into subsets having different class labels, as illustrated in Figure 1. Classification trees are particularly popular for their combined flexibility and interpretability (Hu et al., 2019; Lin et al., 2020).

Our focus is on the case in which $n_{j}=\sum_{k=1}^{n}\bm{1}_{\{Y_{i}=j\}}$ is small, for one or more $j\in\{0,\ldots,J-1\}$ . For simplicity in exposition, we assume $J=2$ , so that there are only two classes. Without loss of generality, suppose that $j=0$ is the majority class, $j=1$ is the minority class, and call set $\{x\in\Omega:f(x)=1\}$ the decision set (of the minority class). Hence, $n_{1}$ is relatively small compared to $n_{0}$ by this convention.

To illustrate the problems that can arise from this imbalance, we consider a toy example. Let the sample space of $X$ be $\Omega=[0,1]\times[0,1]\subset\mathbb{R}^{2}$ . Let the conditional distribution of $X$ given $Y$ be $X|Y=1\sim U([0,0.75]\times[0.25,0.75])$ , $X|Y=0\sim U(\Omega)$ . We generate a training data set with $5$ minority samples and $200$ majority samples. The estimated decision sets of unpruned CART and optimally pruned CART (Breiman et al., 1984) are shown in Figure 1 (a) and (b), respectively. Minority samples are given weight 32 while majority samples are given weight 1 to naively address the imbalance. Both decision sets are inaccurate, with unpruned CART overfitting and pruned CART also poor.

Imbalanced data problems have drawn substantial interest; see Haixiang et al. (2017), He and Garcia (2008), Krawczyk (2016) and Fernández et al. (2017) for reviews. Some early work relied on random under- or over-sampling, which is essentially equivalent to modifying the weights and cannot address the key problem. Chawla et al. (2002) proposed SMOTE, which instead creates synthetic samples. For each minority class sample, they create synthetic samples along the line segments that join each minority class sample with its k nearest neighbors in the minority class. Building on this idea, many other synthetic sampling methods have been proposed, including ADASYN (He et al., 2008), Borderline-SMOTE (Han et al., 2005), SPIDER (Stefanowski and Wilk, 2008), safe-level-SMOTE (Bunkhumpornpat et al., 2009) and WGAN-Based sampling (Wang et al., 2019). These synthetic sampling methods have been demonstrated to be relatively effective.

However, current understanding of synthetic sampling is inadequate. Chawla et al. (2002) motivates SMOTE as designed to “create large and less specific decision regions”, “rather than smaller and more specific regions”. Later papers fail to improve upon this heuristic justification. Practically, the advantage of synthetic sampling versus random over-sampling diminishes as the dimension of the feature space increases. In general, for each minority sample, we require at least $d$ synthetic samples to fully describe its neighborhood. This is often infeasible due to the sample size of the majority class and to computational complexity. Hence, it is typical to fix the number of synthetic samples regardless of the dimension of the feature space (Chawla et al., 2002), which may fail to “create large and less specific decision regions” when the dimension is high.

Motivated by these issues, we propose to directly penalize the Surface-to-Volume Ratio (SVR) of the decision set for the construction of decision tree classifiers, based on the following considerations. First, a primary issue with imbalanced data is that the decision set without any regularization typically consists of many small neighborhoods around each minority class sample. As shown in Section 2 and 3, by penalizing the SVR of the decision set, we favor regularly shaped decision sets much less subject to such over-fitting. Second, existing classification methods are usually justified with theoretical properties under strong assumptions on the local properties of the true decision sets, for example, requiring the true decision set to be convex or have sufficiently smooth boundaries. Such assumptions may be overly restrictive for many real datasets. In contrast, the proposed SVR regularized trees are only subject to a simple global constraint on surface-to-volume ratio, which effectively controls the regularity of decision sets leading to strong theoretical guarantees (see Section 3). Third, we illustrate that SVR regularization can be efficiently implemented for tree classifiers by proposing an algorithm with similar computational complexity to standard tree algorithms such as CART.

The rest of the paper is organized as follows. Section 2 describes our methodology and algorithmic implementation. Section 3 analyzes theoretical properties of SVR Trees, including consistency and rate of convergence for excess risk. Section 4 presents numerical studies for real datasets. Section 5 contains a discussion, and proofs are included in an Appendix.

2 Methodology

We first introduce the definition of surface-to-volume ratio (SVR) and tree impurity, and then define SVR-Tree as the minimizer of a weighted average of tree impurity and SVR. We then state the algorithm to estimate this tree from training data $\mathscr{D}_{n}=\{(X_{i},Y_{i})\}_{i=1}^{n}$ . We assume readers have basic knowledge of tree-based classifiers like CART (Breiman et al., 1984) and C4.5 (Quinlan, 2014). In the rest of the paper, the word “tree” refers specifically to classification trees that specify a class label associated with each leaf node.

2.1 Notation

Training data are denoted as $\mathscr{D}_{n}=\{(X_{i},Y_{i})\}_{i=1}^{n}$ , with $X_{i}\in\Omega\subset\mathbb{R}^{d}$ a vector of features and $Y_{i}\in\{0,1\}$ a class label. Uppercase letters $(X,Y)$ denote random variables, while lowercase $x,y$ denote specific values. Denote the $j$ th feature of $X$ as $X[j]$ . Let $\mathbb{P}^{*}$ denote the true distribution of i.i.d. samples $(X_{i},Y_{i})$ and let $\mathbb{P}$ denote the weighted distribution that up weights the minority class samples by a constant $\alpha\geq 1$ . That is, for any measurable subset $A\subset\Omega$ ,

\mathbb{P}(A\times\{1\})=\frac{\alpha\mathbb{P}^{*}(A\times\{1\})}{\mathbb{P}^{*}(\Omega\times\{0\})+\alpha\mathbb{P}^{*}(\Omega\times\{1\})},\;\;\;\mathbb{P}(A\times\{0\})=\frac{\mathbb{P}^{*}(A\times\{0\})}{\mathbb{P}^{*}(\Omega\times\{0\})+\alpha\mathbb{P}^{*}(\Omega\times\{1\})}.

Up weighting minority class samples is a conventional technique to counter imbalance and we will focus on the weighted measure $\mathbb{P}$ in the rest of the paper. Denote $\mathbb{P}_{n}$ as the weighted empirical distribution which assigns mass $w_{0}\alpha/n$ to minority class training samples and $w_{0}/n$ to majority class training samples. The constant $w_{0}$ is set to ensure the measures of all training samples add up to $1$ .

We use $f$ with various subscripts and diacritics to denote a general classifier from $\Omega$ to $\{0,1\}$ , while $T$ with various subscripts and diacritics denotes a tree classifier from $\Omega$ to $\{0,1\}$ . A tree classifier is formed by finite splitting steps and its decision sets are a finite union of hyperrectangles. When discussing the probability of certain events that include $n$ random variables $\{(X_{i},Y_{i})\}_{i=1}^{n}$ , we simply use $\mathbb{P}$ to represent the probability measure in the $n$ -product space. For example, for any $a>0$ , $\mathbb{P}\left(n^{-1}\sum_{i=1}^{n}X_{i}>a\right)$ denotes the probability of the event $\{\{(X_{i},Y_{i})\}_{i=1}^{n}:n^{-1}\sum_{i=1}^{n}X_{i}>a\}$ . We use $\mathbb{E}$ for expectations over $\mathbb{P}$ and $\mathbb{E}_{n}$ for expectations over $\mathbb{P}_{n}$ . We use $\eta$ and $\eta_{n}$ to denote the conditional expectations of $Y$ given $X$ under the true and empirical measure, respectively. That is, $\eta(X)=\mathbb{E}(Y=1|X)$ and $\eta_{n}(X)=\mathbb{E}_{n}(Y=1|X)$ .

2.2 Surface-to-Volume Ratio

For all $d\in\mathbb{N}$ , consider the space $\mathbb{R}^{d}$ equipped with Lebesgue measure $\mu$ . For any Lebesgue measurable closed set $A\subset\Omega$ , we define its volume as the Lebesgue measure of set $A$ : $V(A)=\mu(A)$ . If $A$ can be represented as a finite union of hyperrectangles, we define the surface of $A$ as the $d-1$ dimensional Lebesgue measure of the boundary of $A$ : $S(A)=\mu_{d-1}(\partial A)$ ; otherwise, if there exists $A_{i},i\geq 1$ such that each $A_{i}$ is a finite union of hyperrectangles and $A_{i}$ converges to $A$ in Hausdorff distance, we define the surface of $A$ as $S(A)=\lim_{n\to\infty}S(A_{i})$ , provided the limit exists. In general, if all $A_{i},i\geq 1$ are regularly shaped, it will be proven in Proposition 2 that the limit exists. Our definition of surface is different from the convention of the word “surface”. For example, in our definition a $d$ dimensional ball with radius $1/2$ has the same surface as a $d$ dimensional cube with side length $1$ , as approximating such a ball with hyperrectangles results in the same surface area as the cube. We adopt this definition of surface since we focus on tree classifiers whose decision sets are always finite unions of hyperrectangles.

For any set $A$ with $0<\mu(A)<\infty$ , the surface-to-volume ratio (SVR) can be obtained as $r(A)=\frac{S(A)}{V(A)}.$ For sets with the same volume, the $d$ -dimensional cube has the smallest SVR, while sets having multiple disconnected subsets and/or irregular boundaries have relatively high SVR. We visualize sets with different SVR in two-dimensional space in Figure 2, with the subfigures (a)-(c) corresponding to the decision sets in Figure 1. We can see that decision sets with larger SVR tend to have more branches, disconnected components, or irregularly shaped borders than those with smaller SVR.

Surface-to-Volume Ratio of a Classification Tree

For training data $\mathscr{D}_{n}$ , we define a closed bounded sample space $\Omega\subset\mathbb{R}^{d}$ such that the support of $X$ is a subset of $\Omega$ . A classification tree $T$ divides the sample space $\Omega\subset\mathbb{R}^{d}$ into two disjoint subsets $\Omega_{1}=\{x\in\Omega:T(x)=1\}$ and $\Omega_{0}=\Omega\backslash\Omega_{1}$ , where $\Omega_{0}\cup\Omega_{1}=\Omega$ . The tree $T$ predicts a new sample $X^{*}$ belongs to class 0 if $X^{*}\in\Omega_{0}$ , and class 1 if $X^{*}\in\Omega_{1}$ . The surface-to-volume ratio of a classification tree is defined as the surface-to-volume-ratio of set $\Omega_{1}$ : $r(T)=r(\Omega_{1}).$

2.3 Impurity Function and Tree Impurity

A classification tree partitions the sample space into multiple leaf nodes, assigning one class label to each leaf node. The tree should be built to maximize homogeneity of the training sample class labels within nodes. This motivates the following definition.

Definition 1 (Impurity, Definition 2.5 of Breiman et al. 1984).

An impurity function $\phi(\cdot,\cdot)$ is defined on the set of pairs $(p_{0},p_{1})$ satisfying $p_{0}\geq 0,p_{1}\geq 0$ , $p_{0}+p_{1}=1$ with the properties (i) $\phi(\cdot,\cdot)$ achieves its maximum only at $(1/2,1/2)$ ; (ii) $\phi(\cdot,\cdot)$ achieves its minimum only at $(1,0),(0,1)$ ; (iii) $\phi(\cdot,\cdot)$ is symmetrical in $(p_{0},p_{1})$ , i.e., $\phi(p_{0},p_{1})=\phi(p_{1},p_{0})$ .

Let $p_{0}$ and $p_{1}$ represent the probabilities of belonging to the majority and minority class, respectively, within some branch of the tree. Ideally, splits of the tree are chosen so that, after splitting, $p_{0}$ and $p_{1}$ move closer to 0 or 1 and further from 1/2. Many different tree building algorithms use impurity to measure the quality of a split; for example, CART uses Gini impurity and C4.5 uses entropy, which is effectively a type of impurity measure. In the remainder of the paper, the term ‘impurity function’ refers to the Gini impurity. That is, $\phi(p_{0},p_{1})=1-p_{0}^{2}-p_{1}^{2}$ .

Let $A_{1},A_{2},\ldots,A_{m}$ be the leaf nodes of a classification tree $T$ and let $z_{j}\in\{0,1\}$ be the predictive class label for node $A_{j}$ , for $j=1,\ldots,m$ . Let $\mathbb{P}$ be the weighted probability measure defined in section 2.1. Then the impurity of leaf node $A_{j}$ is $I(A_{j},\mathbb{P})=\phi(\mathbb{P}(Y=0|X\in A_{j}),\mathbb{P}(Y=1|X\in A_{j}))$ . The impurity of node $A_{j}$ measures the class homogeneity in $A_{j}$ , but does not depend on the predictive class label $z_{j}$ . Let $\tilde{z}_{j}=\mathbbm{1}_{\{\mathbb{P}(Y=1|X\in A_{j})\geq 1/2\}}$ denote the dominant class label in $A_{j}$ under weighted measure $\mathbb{P}$ . We define a signed impurity taking into account $z_{j}$ as

\tilde{I}(A_{j},\mathbb{P})=\mathbbm{1}_{\{z_{j}=\tilde{z}_{j}\}}I(A_{j},\mathbb{P})+\mathbbm{1}_{\{z_{j}\neq\tilde{z}_{j}\}}(1-I(A_{j},\mathbb{P})).

Signed impurity serves as a proxy to classification accuracy. If the predictive class label $z_{j}$ matches the dominant class label $\tilde{z}_{j}$ in node $A_{j}$ , the signed impurity of node $A_{j}$ is equal to the impurity of node $A_{j}$ and is no greater than $1/2$ . Otherwise, an extra penalty of $1-2I(A_{j},\mathbb{P})$ is applied. It is easy to see $1-2I(A_{j},\mathbb{P})=[1-2\mathbb{P}(Y=0|X\in A_{j})]^{2}$ . In other words, the signed impurity uses quadratic loss instead of the absolute value loss function of classification accuracy. Taking a weighted average of the signed impurities across the leaf nodes, one obtains the tree impurity and signed tree impurity.

Definition 2 (Tree Impurity).

Let $T$ be a tree and $A_{1},A_{2},\ldots A_{m}$ be all the leaf nodes of this tree. Denoting the weighted probability measure as $\mathbb{P}$ , the tree impurity of $T$ is $I(T,\mathbb{P})=\sum_{j=1}^{m}\mathbb{P}(X\in A_{j})I(A_{j},\mathbb{P}).$

Definition 3 (Signed Tree Impurity).

Under the notation of Definition 2, the signed tree impurity is $\tilde{I}(T,\mathbb{P})=\sum_{j=1}^{m}\mathbb{P}(X\in A_{j})\tilde{I}(A_{j},\mathbb{P}).$

2.4 SVR-Tree Classifiers

The SVR-Tree classifier is the minimizer of the weighted average of signed tree impurity and surface-to-volume ratio. Letting $\mathscr{T}$ be the collection of possible trees, the SVR-Tree classifier is defined as

\hat{T}=\operatorname*{argmin}_{T\in\mathscr{T}}[\tilde{I}(T,\mathbb{P}_{n})+\lambda_{n}r(T)],

(1)

where $\lambda_{n}$ is a penalty. The unknown probability measure $\mathbb{P}$ is replaced with the empirical measure $\mathbb{P}_{n}$ that assigns mass $1/n$ to each training sample $(X_{i},Y_{i}),1\leq i\leq n$ . Unfortunately, without restricting the space of trees $\mathscr{T}$ , optimization problem (1) is intractable. In the following subsection, we introduce an iterative greedy search algorithm that limits the size of $\mathscr{T}$ in each step to efficiently obtain a tree having provably good performance.

2.5 The SVR-Tree Algorithm

The SVR-Tree Algorithm is designed to find a nearly optimal SVR-Tree classifier. SVR-Tree proceeds in a greedy manner. We begin with the root node. At each step, we operate on one leaf node of the current tree, splitting it into two new leaf nodes by finding the solution of (1). The node to split at each step is uniquely specified by a breadth-first searching order. After splitting, the tree is updated and the node to split in the next step will be specified. The process stops when further splitting of leaf nodes either does not improve the loss or a prespecified maximum number of leaf nodes is achieved.

We first describe how to split a current leaf node to improve the solution to (1). Suppose the current tree is $T$ and the node to split is $A$ , with $n^{\prime}$ training samples. For each feature $j$ , sort all samples in $A$ by increasing order of the $j$ th feature as $X[j]_{j_{1}},X[j]_{j_{2}},\ldots,X[j]_{j_{n^{\prime}}}$ . We only allow splits to occur at $(X[j]_{j_{i}}+X[j]_{j_{i+1}})/2,\;1\leq i\leq n^{\prime}-1,\;1\leq j\leq d$ , the midpoint of two adjacent values of each feature. The total number of different splits is no more than $(n^{\prime}-1)d$ . After each such split of $A$ , we keep all other leaf nodes unchanged while allowing all $4$ different class label assignments at the two new daughter nodes of $A$ . The current set of trees $\mathscr{T}$ to choose from in optimizing (1) includes the initial $T$ and all the split trees described above. The cardinality of $\mathscr{T}$ is no more than $1+4(n^{\prime}-1)d$ , a linear order of $n^{\prime}$ . We compute the risk for all $T\in\mathscr{T}$ to find the minimizer. If the initial $T$ is the risk minimizer, we do not make any split in this step and mark the node $A$ as “complete”. Any node marked as complete will no longer be split in subsequent steps.

It remains to specify the ‘breadth-first’ searching order determining which node to split in each step. Let depth of a node be the number of edges to the root node, which has depth 0. Breadth-first algorithms explore all nodes at the current depth before moving on (Cormen et al. 2009, chapter 22.2). To keep track of changes in the tree, we use a queue¹¹1A queue is a dynamic set in which the elements are kept in order and the principal operations on the set are the insertion of elements to the tail, known as enqueue, and deletion of elements from the head, known as dequeue. See chapter 10.1 of Cormen et al. (2009) for details.. We begin our algorithm with a queue where the only entity is the root node. At each step, we remove the node at the front terminal of the queue, and split at this node as described in the previous paragraph. If a split tree is accepted as the risk minimizer over the current set $\mathscr{F}$ , we enqueue two new leaf nodes; otherwise, the unsplit tree is the risk minimizer over the current set $\mathscr{F}$ , so we don’t enqueue any new node. The nodes in the front of the queue have the lowest depth. Therefore, our algorithm naturally follows a breadth-first searching order. We preset the maximal number of leaf nodes as $\bar{a}_{n}$ . The process is stopped when either the queue is empty, in which case all the leaf nodes are marked as complete, or the number of leaf nodes is $\bar{a}_{n}$ . On many datasets, Algorithm 1 will stop before reaching the maximal number of leaf nodes $\bar{a}_{n}$ . The upper bound is necessary in certain circumstances, since our algorithm, like the majority of practical tree algorithms, is greedy and not guaranteed to find the global minimum. Provided we can solve the global minimum, the upperbound $\bar{a}_{n}$ is no longer required, as shown in Theorem 2 and discussed thereafter.

Our SVR-Tree algorithm has a coarse to fine tree building style, tending to first split the sample space into larger pieces belonging to two different classes, followed by modifications to the surface of the decision set to decrease tree impurity and SVR. The steps are sketched in Algorithm 1, where feature selection steps are marked as optional. A more detailed and rigorous version is in Section 4.1 of the supplementary material.

The average time complexity of Algorithm 1 is $O(dn\log n)$ , which is the same as many tree algorithms like CART and C4.5. A detailed analysis of computational complexity is available at Section 4.2 of the supplementary material.

Optional Step for Feature Selection

It is likely that some features will not help to predict class labels. These features should be excluded from our estimated tree. Under some mild conditions, a split on a redundant feature has minimal impact on the tree impurity compared to a split in a non-redundant feature. Thus feature selection can be achieved by thresholding. Suppose we are splitting node $A$ into two new leaf nodes $A_{1},A_{2}$ . Then the (unsigned) tree impurity decrease after this split is defined as:

\Delta I(T,\mathbb{P}_{n})=\mathbb{P}_{n}(A)[I(A,\mathbb{P}_{n})-I(A_{1},\mathbb{P})\mathbb{P}_{n}(A_{1}|A)-I(A_{2},\mathbb{P})\mathbb{P}_{n}(A_{2}|A)].

Let $J_{0}\subset\{1,2,\ldots d\}$ be the indices of features that have been split in previous tree building steps. Given that we are splitting on node $A$ , let $\Delta I_{0}$ be the maximal tree impurity decrease over all splits in feature $X[j],j\in J_{0}$ . Then a split in a new feature $X[j^{\prime}],j^{\prime}\not\in J_{0}$ , with tree impurity decrease $\Delta I(T,\mathbb{P}_{n})$ , is accepted if

\Delta I(T,\mathbb{P}_{n})\geq\Delta I_{0}+c_{0}\lambda_{n},

(2)

where $c_{0}$ is a constant independent of the training data. By equation (2), a split on a new feature is accepted if its tree impurity decrease is greater than $\Delta I_{0}$ , the maximal tree impurity decrease over all previously split features, plus a threshold term $c_{0}\lambda_{n}$ . Theoretical support for this thresholding approach is provided in the supplementary material.

Result: Output the fitted tree

Input training data

\{(X_{i},Y_{i})\}_{i=1}^{n}

, impurity function

f(\cdot)

, weight for minority class

\alpha

, SVR penalty parameter

\lambda_{n}

, and maximal number of leaf nodes

\bar{a}_{n}\in\mathbb{N}

. Let

\mathrm{node\_queue}

be a queue of only root node

for $1\leq j\leq d$ do

Sort all the samples in values of

j

th feature;

end for

while $\mathrm{node\_queue}$ is not empty and number of leaf nodes $\leq\bar{a}_{n}$ do

Dequeue the first entity in

\mathrm{node\_queue}

, denoting it as

\mathrm{node}

;

for all possible splits in $\mathrm{node}$ do

(optional) Check if the current split satisfies feature selection conditions; if not satisfied, reject the current split and continue; Compute the signed tree impurity of the current split;

end for

Find the split that minimizes the weighted average of signed impurity and SVR as in equation (1);

if the weighted average of signed impurity and SVR is decreased then

Accept the current split. Enqueue two child nodes of

\mathrm{node}

into

\mathrm{node\_queue}

;

else

Reject the current split;

end if

end while

Algorithm 1 Outline of steps of SVR-Tree

3 Theoretical Properties

In this section, we will discuss the theoretical properties of SVR-Tree from two aspects. First, we show the classifier obtained from Algorithm 1 is consistent in the sense that a generalized distance between SVR-Tree and the oracle classifier goes to zero as sample size goes to infinity. Second, we show for an idealized empirical risk minimizer over a family of SVR constrained classifiers, the excess risk converges to zero at the rate $O((\log n/n)^{\frac{1}{d+1-1/\kappa}})$ as sample size goes to infinity, where $\kappa\geq 1$ is defined in Condition 2. The consistency result provides some basic theoretical guarantees for SVR-Tree while the rate of convergence result provides some further insights on how the surface-to-volume ratio works in nonparametric classification. We also developed results regarding feature selection consistency of Algorithm 1 with optional steps enabled. Due to limited space, these results are provided in the supplementary material.

3.1 Estimation Consistency

Let the classification risk under measure $\mathbb{P}$ be $R(f)=\mathbb{E}|Y-f(X)|$ and its empirical version under measure $\mathbb{P}_{n}$ be $R_{n}(f)=\mathbb{E}_{n}|Y-f(X)|$ . Let $\mathscr{F}$ be the collection of all measurable functions from $\Omega$ to $\{0,1\}$ . Then the oracle classifier and oracle risk can be defined as $f^{*}=\arg\min_{f\in\mathscr{F}}R(f)$ and $R^{*}=R(f^{*})$ , respectively. We define the consistency of a sequence of classifiers based on the excess risk comparing to oracle risk.

Definition 4.

A sequence of classifiers $f_{n}$ is consistent if $\lim_{n\to\infty}\mathbb{E}R(f_{n})=R^{*}.$

In the case when $\mathbb{P}(\{X\in\Omega:\eta(X)=1/2\})=0$ , consistency of $f_{n}$ is equivalent to requiring $\lim_{n\to\infty}\mathbb{E}|f_{n}-f^{*}|=0$ almost surely. For any hyperrectangle $A\subset\Omega$ , denote the component-wise conditional expectation of $A$ as $\eta_{A,j}(X[j])=\mathbb{E}(Y|X[j],X\in A)$ . To ensure the consistency of Algorithm 1, we need the following identifiability condition.

Condition 1 (Identifiability).

For any hyperrectangle $A\subset\Omega$ , if the component-wise conditional expectation $\eta_{A,j}(X)$ is a constant on $A$ , $\forall 1\leq j\leq d$ , then the conditional expectation $\eta(X)$ is a constant on $A$ .

Condition 1 is indispensable for proving the consistency of our SVR regularized tree classifiers without requiring the diameter of each leaf node to go to zero. This is because CART-type algorithms can only make splits based on the component-wise conditional expectation. If all the component-wise conditional expectations $\eta_{A,j}(X[j])$ are constants but the overall conditional expectation $\eta(x)$ varies in $x$ , then the problem is not identifiable to CART-type split rules. Condition 1 can also be viewed as a relaxed version of Condition (H1) in Scornet et al. (2015) who studied the consistency of random forest regression. Their Condition (H1) requires that the mean of $Y$ given $X$ follows an additive model in the regression tree setup. An analogous condition to their (H1) for our classification tree setup would require $\eta(x)$ to be an additive function, which would imply our Condition 1.

Denote the tree obtained from Algorithm 1 as $\hat{T}_{n}$ . Theorem 1 shows $\hat{T}_{n}$ is consistent under mild conditions.

Theorem 1 (Estimation consistency).

Let $\Omega=[0,1]^{d}$ and the weighted measure $\mathbb{P}$ have continuous density $\rho(x)$ with lower bound $\rho(x)\geq\rho_{\min}>0$ for all $x\in\Omega$ . Assume that Condition 1 is satisfied, $\eta(x)$ is continuous everywhere in $\Omega$ , and $\bar{a}_{n}$ , $\lambda_{n}$ satisfy $\lim_{n\to\infty}\bar{a}_{n}=\infty$ , $\lim_{n\to\infty}\lambda_{n}=0$ , $\lim_{n\to\infty}\frac{\bar{a}_{n}d\log n}{n}=0.$ Then the classifier $\hat{f}_{n}$ obtained from Algorithm 1 is consistent.

3.2 Rate of Convergence for Empirical Risk Minimizer

We now consider the empirical risk minimizer over a class of trees under the SVR constraint and derive the rate of convergence for this empirical risk minimizer. This empirical risk minimizer can be considered as an idealized classifier.

We first introduce the family of SVR constrained classifiers. For ease of notation, for any classifier $f$ , we denote its decision set by $\Omega_{1}(f)=\{x\in\Omega:f(x)=1\}$ . Define $\mathscr{F}_{0}$ as the family of all possible classifiers generated from trees with finite depth: $\mathscr{F}_{0}=\left\{f:\Omega\to[0,1]:\Omega_{1}(f)\;\;\text{is a finite union of hyperrectangles}\right\}.$ For $\gamma>0$ , the family of SVR constrained classifiers $\mathscr{F}_{\gamma}$ is defined as

\mathscr{F}_{\gamma}=\left\{f\in\mathscr{F}_{0}:\forall\;f^{\prime}\in\mathscr{F}_{0},\;S(f^{\prime})-S(f)\geq-\gamma V(\Omega_{1}(f)\operatorname*{\bigtriangleup}\Omega_{1}(f^{\prime}))\right\},

(3)

where $A\operatorname*{\bigtriangleup}B=(A\cap B^{c})\cup(A^{c}\cap B)$ denotes the set difference between two generic sets $A$ and $B$ . The elements of $\mathscr{F}_{\gamma}$ are constrained by SVR in the sense that the surface area of $\Omega_{1}(f)$ can decrease at most proportionally to the change in its volume. The parameter $\gamma$ is set to be no smaller than $2d$ to prevent $\mathscr{F}_{\gamma}$ from being empty. The definition of $\mathscr{F}_{\gamma}$ matches the intuition in Algorithm 1 where SVR in every leaf node is checked. Moreover, the following proposition reveals that the regularized risk minimizer of all trees over the true measure, provided it exists, always lies in the family $\mathscr{F}_{\gamma}$ . Therefore, in the subsequent analysis, we will focus on the effect of $\gamma$ on the convergence of excess risk since it plays an equivalent role to $\lambda$ in the minimization problem (1).

Proposition 1.

Suppose that $\Omega=[0,1]^{d}$ , the density $p$ of the weighted marginal measure $\mathbb{P}$ exists and satisfies $\rho(x)\leq\rho_{\max}<\infty$ for all $x\in\Omega$ . If $\gamma\geq 2d\lambda+1+\rho_{\max}/\lambda$ , then

\inf_{f\in\mathscr{F}_{\gamma}}[\tilde{I}(f,\mathbb{P})+\lambda r(f)]=\inf_{f\in\mathscr{F}_{0}}[\tilde{I}(f,\mathbb{P})+\lambda r(f)],

\inf_{f\in\mathscr{F}_{\gamma}}[R(f)+\lambda r(f)]=\inf_{f\in\mathscr{F}_{0}}[R(f)+\lambda r(f)].

The family $\mathscr{F}_{\gamma}$ only contains tree classifiers and is not closed. It is desirable to consider the class of all classifiers that can be well approximated by tree classifiers. Let $d_{H}(A,B)$ be the Hausdorff distance between two generic sets $A$ and $B$ . We define the closure of $\mathscr{F}_{\gamma}$ with respect to Hausdorff distance of decision sets as

\overline{\mathscr{F}_{\gamma}}\triangleq\left\{f:\Omega\to[0,1]:\exists\;\text{a sequence}\;f_{i}\in\mathscr{F}_{\gamma},\;\text{such that}\;\lim_{i\to\infty}d_{H}(\Omega_{1}(f_{i}),\Omega_{1}(f))=0\right\}.

The elements of $\overline{\mathscr{F}_{\gamma}}$ have a well-defined surface area as shown in Proposition 2.

Proposition 2.

$S(f)$ is well defined for all $f\in\overline{\mathscr{F}_{\gamma}}$ in the sense that: (i) For all sequences $f_{i}\in\mathscr{F}_{\gamma}$ whose decision sets $\Omega_{1}(f_{i})$ converge in Hausdorff distance, $S(f_{i})$ is a convergent sequence in $\mathbb{R}$ ; (ii) for any two sequences $f_{i1}\in\mathscr{F}_{\gamma}$ and $f_{i2}\in\mathscr{F}_{\gamma}$ satisfying $\lim_{i\to\infty}d_{H}(\Omega_{1}(f_{i1}),\Omega_{1}(f))=0$ and $\lim_{i\to\infty}d_{H}(\Omega_{1}(f_{i2}),\Omega_{1}(f))=0$ , we have $\lim_{i\to\infty}S(f_{i1})=\lim_{i\to\infty}S(f_{i2})$ .

The proof of Proposition 1 and 2 are available in the supplementary material. To better control the rate of convergence of excess risk, we also need the following margin condition.

Condition 2.

There exist constants $\epsilon_{0}\in(0,1)$ , $c_{1}>0$ , $\kappa\geq 1$ , such that for all classifiers $f$ satisfying $\mathbb{P}(\Omega_{1}(f)\operatorname*{\bigtriangleup}\Omega_{1}(f^{*}))<\epsilon_{0}$ , we have

R(f)-R^{*}\geq c_{1}\left[\mathbb{P}(\Omega_{1}(f)\operatorname*{\bigtriangleup}\Omega_{1}(f^{*}))\right]^{\kappa}.

Condition 2 is essentially Assumption (A1) in Tsybakov (2004), which is known as the Tsybakov’s margin condition and widely used in classification problems for obtaining fast convergence rate of excess risk (Tsybakov, 2004; Audibert and Tsybakov, 2007). Condition 2 holds as long as $\mathbb{P}(|\eta(X)-1/2|\leq t)\leq C_{\eta}t^{1/(\kappa-1)}$ for all $t\in(0,t^{*})$ and some constants $C_{\eta}>0$ and $t^{*}\in(0,1/2)$ ; see Proposition 1 in Tsybakov (2004). Under Condition 2, our main theorem states the rate of convergence for the empirical risk minimizer over $\mathscr{F}_{\gamma}$ .

Theorem 2 (Rate of Convergence).

Suppose that $\Omega=[0,1]^{d}$ , $d\geq 2$ and Condition 2 holds. Let the empirical risk minimizer be $\hat{T}_{n,\gamma}=\operatorname*{argmin}_{T\in\mathscr{F}_{\gamma}}R_{n}(T)$ and let $\tilde{c}=\min\big{(}2\log 2,\frac{21c_{1}^{1/\kappa}}{16}\big{)}$ , then there exist constants $N_{0}$ as in equation (10) and $c^{\prime}$ as in Lemma 5, such that for all $n\geq N_{0}$ ,

\sup_{f^{*}\in\overline{\mathscr{F}_{\gamma}}}\mathbb{E}\left[R(\hat{T}_{n,\gamma})-R(f^{*})\right]\leq 16(2/\tilde{c})^{\frac{1}{d+1-1/\kappa}}(c^{\prime}\gamma)^{\frac{d}{d+1-1/\kappa}}\left(\frac{\log n}{n}\right)^{\frac{1}{d+1-1/\kappa}}.

(4)

Theorem 2 provides an upper bound in (4) for the rate of convergence of the excess risk from the empirical risk minimizer $\hat{T}_{n,\gamma}$ among all SVR constrained tree classifiers in $\mathscr{F}_{\gamma}$ to the theoretically best possible classifier $f^{*}$ in $\overline{\mathscr{F}_{\gamma}}$ with a well-defined surface area. Because we have limited restrictions on the split locations of $\hat{T}_{n,\gamma}$ apart from belonging to $\mathscr{F}_{\gamma}$ , it is possible for a split to land directly on an observed sample under the empirical measure. In such situations, we allow the tree of $\hat{T}_{n,\gamma}$ to assign the sample to either leaf node. Hence, the leaf nodes of $\hat{T}_{n,\gamma}$ can be open, closed, or half open half closed. On the other hand, the theoretically best classifier $f^{*}$ can lie in the much larger family of $\overline{\mathscr{F}_{\gamma}}$ . Although the classifiers from $\mathscr{F}_{\gamma}$ are tree classifiers, there is no limitation on how many leaf nodes they can have. Therefore, they can approximate decision sets with smooth boundaries, for example a $d$ -dimensional ball. In fact, by adjusting $\gamma$ , $\overline{\mathscr{F}_{\gamma}}$ will contain most classifiers with regularly shaped decision sets.

The rate of convergence for the excess risk in the upper bound (4) is $O((\log n/n)^{\frac{1}{d+1-1/\kappa}})$ , which is the same as the minimax rate of set estimation on the family of sets with first order differentiable boundaries as in Mammen and Tsybakov (1995). Although Mammen and Tsybakov (1995) do not use the margin condition and hence the term $1/\kappa$ does not appear explicitly in their rates, their results will yield the same rate as ours up to logarithm factors if the margin condition is applied; see for example, Theorem 4 of Scott and Nowak (2006). Furthermore, in the upper bound (4) in Theorem 2, we have explicitly characterized the effect of the SVR regularization $\gamma$ from the set $\mathscr{F}_{\gamma}$ . By definition, the SVR constraint in $\mathscr{F}_{\gamma}$ becomes weaker as $\gamma$ increases. As expected, the upper bound in (4) increases with $\gamma$ given the increasing complexity of the set $\mathscr{F}_{\gamma}$ . A direct implication of this upper bound is that if the true tree classifier belongs to a family $\overline{\mathscr{F}_{\gamma}}$ with a smaller $\gamma$ , i.e. a smaller SVR ratio, then the excess risk of the empirical risk minimizer $\hat{T}_{n,\gamma}$ also tends to be smaller.

Theorem 2 is related to the literature on classification risk of decision trees, but with important differences. In general, the rate of convergence for the excess risk depends on two assumptions: the margin condition as in our Condition 2 and the complexity, or the $\epsilon$ -entropy, of the family of decision sets (Scott and Nowak, 2006). Suppose the $\epsilon$ -entropy is $(1/\epsilon)^{\rho}$ for some $\rho>0$ . If $\rho<1$ , then together with a suitable margin condition like Condition 2, the rate of convergence can be faster than $O(n^{-1/2})$ or even faster than $O(n^{-1})$ (Tsybakov, 2004; Audibert and Tsybakov, 2007). If $\rho\geq 1$ , then the rate of convergence is typically of order $n^{-\frac{1}{\rho+c}}$ with $c$ being a constant independent of $\rho$ (Mammen and Tsybakov, 1995). Although it is possible to slightly improve the constant $c$ by refining the margin condition, the overall rate of convergence is still governed by the $\epsilon$ -entropy if $\rho$ is very large. Therefore the major objective of our research is to find a suitable family of sets with controllable $\epsilon$ -entropy as well as a practically feasible algorithm. This is not easy, because there are limited families of sets having known $\epsilon$ -entropy, including convex sets and sets with smooth boundaries. Both smoothness and convexity impose strong assumptions on the local properties of the boundary $\partial\Omega_{1}(f)$ . In contrast, we propose to consider the family of SVR constrained sets $\mathscr{F}_{\gamma}$ as a new family of decision sets, which directly imposes a simple and intuitive assumption on the shape of decision sets and additionally has a tractable $\epsilon$ -entropy.

Scott and Nowak (2006) also noticed that the family of sets with smooth boundaries is not very realistic for practical problems. They proposed a family of decision sets named the “box-counting class” which satisfies the conclusion of our Lemma 5. However, they did not further characterize the behavior of sets in their class, and hence they must compute the tree for all smallest dyadic splits and then merge all these trees bottom-up, which can be regarded as an exhaustive search in the lowest split level. This leads to a computational complexity $O(dnL^{d}\log(nL^{d}))$ , with $L$ being the number of dyadic splits. The complexity will become intractable when $d$ is large, with the authors noting that $d\geq 10$ will pose a problem. Scott and Nowak (2006) also proposed an alternative to the Tsybakov’s margin condition (Condition 2) and obtained a faster convergence rate in some specific situations, such as when $\Omega$ lies in a low dimensional manifold. However, their alternative condition directly bounds the excess error between a possible dyadic tree and the oracle classifier, which is much stronger than the standard margin condition we have used.

The proof of Theorem 2 is in the Appendix. The key idea is to construct an $\epsilon$ -net on $\mathscr{F}_{\gamma}$ under a symmetric set difference from the true measure and to show that this $\epsilon$ -net is also a $2\epsilon$ -net under a symmetric set difference from the empirical measure with high probability. This nontrivial property is the reason why we can derive the rate of convergence for the empirical risk minimizer over the whole family of $\mathscr{F}_{\gamma}$ rather than just for the minimizer over an $\epsilon$ -net of this family, as adopted by some previous classification literature (Mammen and Tsybakov, 1995; Tsybakov, 2004).

4 Numerical Studies

We compare SVR trees with popular imbalanced classification methods on real datasets, adding redundant features to these datasets in evaluating feature selection. A confusion matrix (Table 1) is often used to assess classification performance. A common criteria is accuracy $\frac{\text{TP}+\text{TN}}{\text{TP}+\text{TN}+\text{FP}+\text{FN}}.$ When minority samples are relatively rare, it is often important to give higher priority to true positives, which is accomplished using the true positive rate (TPR, also known as recall) $\frac{\text{TP}}{\text{TP}+\text{FN}}$ and precision $\frac{\text{TP}}{\text{TP}+\text{FP}}.$ To combine these, the F-measure is often used: $\frac{2\cdot\text{TPR}\cdot\text{Precision}}{\text{TPR}+\text{Precision}}$ . The Geometric mean of true positive rate and true negative rate ( $G_{\text{mean}}=\frac{\text{TP}\cdot\text{TN}}{(\text{TP}+\text{FN})\cdot(\text{TN}+\text{FP})}$ ) is also used to evaluate the overall performance on both classes.

Table 1: Confusion matrix for two class classification.

		True Label
		1	0
Predicted Label	1	True Positive (TP)	False Positive (FP)
Predicted Label	0	False Negative (FN)	True Negative (TN)

4.1 Datasets

We test our method on 12 datasets from the UCI Machine Learning Repository (Dua and Graff, 2017) and KEEL-dataset repository (Alcalá-Fdez et al., 2011), varying in size, number of features and level of imbalance. For datasets with three or more classes, classes are combined to form binary class datasets. Table 2 gives an overview of the datasets; detailed descriptions and preprocessing of these datasets are available in Section 5.1 of the supplementary material.

Table 2: Overview of data sets

Data set	number of	number of	proportions of	number of
name	total samples	minority samples	minority samples	features
Pima	768	268	34.9%	8
Titanic	2201	711	32.3%	3
Phoneme	5404	1586	29.3%	5
Vehicle	846	199	19.0%	18
Ecoli	336	52	15.5%	6
Segment	2308	329	14.3%	18
Wine	1599	217	11.9%	11
Page	5472	559	10.2%	10
Satimage	6435	626	8.9%	36
Glass	214	17	8.0%	9
Abalone	731	42	5.4%	7
Yeast	1484	51	3.4%	8

4.2 Experimental Setup

We test the performance of SVR-Tree, SVR-Tree with feature selection, CART (Breiman et al., 1984) with duplicated oversampling, CART with SMOTE (Chawla et al., 2002), CART with Borderline-SMOTE (Han et al., 2005), CART with ADASYN (He et al., 2008) and Hellinger Distance Decision Tree (Cieslak et al., 2012) on all 12 datasets. Features are linearly transformed so that samples lie in $[0,1]^{d}$ .

We use $3\times 5$ nested stratified cross validation to evaluate the out-of-sample performance of all methods. The nested cross validation has two layers: the inner layer is used to choose tuning parameters and the outer layer is used to evaluate the out-of-sample performance. For each method and each dataset, we run the nested cross validation procedure 20 times. In each run, we randomly partition the whole dataset into three stratified folds²²2the proportions of samples of each class are roughly the same in all folds. and run three times. Each time one of the three folds is selected as the testing set and the other two folds are training sets. On the training sets, we use 5-fold stratified cross validation to choose the tuning parameter. That is, we further divide the training set into five folds and run for five times. Each time we use four folds as (inner) training sets and the other fold for validation. We define the cross-validation TPR, TNR in the same manner as the cross-validation accuracy and use the cross validation TPR, TNR to compute the cross-validation F-measure. The tuning parameter with the highest cross validation F-measure is selected. We then train the model with the selected tuning parameter on the whole training set and evaluate its performance on the test set. The cross validation accuracy, precision, TPR, F-measure and G-mean are recorded. The mean and standard error of these statistics from 20 nested cross validation runs are reported in Table 3.

SVR-Tree with and without feature selection are described in Algorithm 1. The weight for the minority class, $\alpha$ , is set to be the largest integer that makes the total weight of the minority class no greater than the total weight of the majority class. The maximal number of leaf nodes $\bar{a}_{n}$ is $2\sqrt{n}$ . The penalty parameter $\lambda_{n}$ for SVR is chosen from a geometric sequence in $[2^{0},2^{10}]\times 10^{-3}\times n^{-1/3}$ and the value with the highest F-measure across 20 runs is selected. For SVR-Tree with feature selection, the constant $c_{0}$ in equation (2) is fixed to $c_{0}=4$ . In practice, the results are insensitive to $c_{0}$ .

For the other methods except Hellinger distance decision tree, we first over sample the minority class samples, such that the number of minority samples are multiplied by the same weight $\alpha$ used for SVR tree. We than build the CART tree on the over sampled dataset and prune it. The pruning parameter of CART is selected to maximize the F-measure. By the algorithm proposed by Breiman et al. (1984), the range from which to choose the pruning parameter will be available after we build the tree and does not need to be specified ahead of time.

For duplicated oversampling, we sample each minority class sample $\alpha-1$ times where $\alpha$ is the weight used for SVR tree. For SMOTE, the number of nearest samples is set as $k=5$ . For Borderline-SMOTE, we adopt the Borderline-SMOTE1 of Han et al. (2005), with the number of nearest samples $k=5$ . For both SMOTE and Borderline-SMOTE, if $\alpha-1\geq k$ , some nearest neighbors may be used multiple times to generate synthetic samples, such that the total weight of minority class samples of these oversampling methods are the same as that of SVR tree. For ADASYN, if we denote the number of majority class samples as $n_{0}$ and the number of minority class samples as $n_{1}$ , then the parameter $\beta$ is set to be $\beta=\alpha n_{1}/n_{0}$ .

For Hellinger distance decision tree, we directly call its java implementation available at Weka software (https://www.cs.waikato.ac.nz/ml/weka/). Although it is claimed that the Hellinger distance tree is insensitive to imbalance, their current implementation occasionally classifies all samples into a single class. We set the default leaf prediction strategy as “Naive Bayes” since it works well on most datasets. However, if the algorithm classifies all samples into a single class, we will set the leaf prediction strategy as “Majority Voting”. The third leaf prediction strategy “Naive Bayes Adapative” always produces the same results as one of the other two strategies in our experiments.

Data set	Method	Accuracy	Precision	TPR	F-measure	G-mean
Phoneme	SVR	0.8350(0.0060)	0.6749(0.0113)	0.8458(0.0154)	0.7506(0.0076)	0.8380(0.0062)
	SVR-Select	0.8377(0.0040)	0.6807(0.0064)	0.8420(0.0098)	0.7528(0.0060)	0.8389(0.0049)
	Duplicate	0.8554(0.0038)	0.7567(0.0110)	0.7482(0.0154)	0.7522(0.0068)	0.8205(0.0065)
	SMOTE	0.8598(0.0054)	0.7637(0.0119)	0.7567(0.0121)	0.7601(0.0089)	0.8264(0.0068)
	BSMOTE	0.8568(0.0060)	0.7580(0.0143)	0.7527(0.0127)	0.7552(0.0093)	0.8230(0.0070)
	ADASYN	0.8607(0.0050)	0.7661(0.0116)	0.7565(0.0115)	0.7612(0.0082)	0.8269(0.0063)
	HDDT	0.7065(0.0000)	0.0000(0.0000)	0.000(0.0000)	0.0000(0.0000)	0.0000(0.0000)
Segment	SVR	0.9922(0.0017)	0.9696(0.0089)	0.9763(0.0061)	0.9729(0.0060)	0.9855(0.0033)
	SVR-Select	0.9932(0.0013)	0.9753(0.0080)	0.9769(0.0053)	0.9761(0.0047)	0.9863(0.0027)
	Duplicate	0.9913(0.0012)	0.9707(0.0069)	0.9685(0.0073)	0.9696(0.0044)	0.9817(0.0036)
	SMOTE	0.9916(0.0010)	0.9722(0.0061)	0.9688(0.0081)	0.9705(0.0035)	0.9820(0.0038)
	BSMOTE	0.9915(0.0011)	0.9726(0.0082)	0.9676(0.0059)	0.9701(0.0037)	0.9814(0.0027)
	ADASYN	0.9919(0.0011)	0.9713(0.0063)	0.9716(0.0064)	0.9714(0.0040)	0.9833(0.0032)
	HDDT	0.8330(0.0021)	0.4599(0.0031)	0.9825(0.0023)	0.6266(0.0031)	0.8910(0.0017)
Page	SVR	0.9656(0.0015)	0.8248(0.0098)	0.8429(0.0117)	0.8337(0.0076)	0.9087(0.0062)
	SVR-Select	0.9647(0.0016)	0.8252(0.0124)	0.8311(0.0140)	0.8280(0.0077)	0.9024(0.0073)
	Duplicate	0.9686(0.0013)	0.8465(0.0081)	0.8463(0.0135)	0.8463(0.0072)	0.9119(0.0071)
	SMOTE	0.9683(0.0021)	0.8468(0.0133)	0.8426(0.0144)	0.8446(0.0101)	0.9099(0.0078)
	BSMOTE	0.9682(0.0015)	0.8444(0.0103)	0.8448(0.0127)	0.8445(0.0077)	0.9109 (0.0067)
	ADASYN	0.9677(0.0021)	0.8436(0.0131)	0.8403(0.0148)	0.8418(0.0103)	0.9084(0.0080)
	HDDT	0.9024(0.0030)	0.5479(0.0338)	0.2720(0.0260)	0.3622(0.0215)	0.5141(0.0233)
Yeast	SVR	0.9368(0.0100)	0.2641(0.0410)	0.4510(0.0888)	0.3290(0.0416)	0.6524(0.0619)
	SVR-Select	0.9287(0.0122)	0.2360(0.0397)	0.4559(0.0715)	0.3077(0.0391)	0.6544(0.0496)
	Duplicate	0.9641(0.0037)	0.3907(0.2084)	0.1196(0.0784)	0.1727(0.0981)	0.3077(0.1544)
	SMOTE	0.9594(0.0045)	0.3620(0.0859)	0.2098(0.0830)	0.2554(0.0752)	0.4458(0.0887)
	BSMOTE	0.9602(0.0043)	0.3664(0.1012)	0.2020(0.0884)	0.2499(0.0937)	0.4333(0.1065)
	ADASYN	0.9605(0.0045)	0.3709(0.1246)	0.2078(0.1099)	0.2491(0.0986)	0.4288(0.1436)
	HDDT	0.7604(0.0200)	0.1049(0.0101)	0.7853(0.0287)	0.1849(0.0162)	0.7722(0.0211)
Average Ranking	SVR	4.92	5.25	2.25	2.08	2.17
	SVR-Select	5.58	5.00	2.25	2.75	2.50
	Duplicate	2.79	3.29	5.13	4.88	4.96
	SMOTE	2.54	2.29	5.04	4.21	4.71
	BSMOTE	3.33	2.92	5.58	5.00	5.42
	ADASYN	2.75	2.92	4.58	4.00	4.42
	HDDT	6.08	6.33	3.17	5.08	3.83

Table 3: Results of numerical studies, where the ranking is averaged over all 12 datasets.

4.3 Results

We compute the mean and standard error of accuracy, precision, TPR, F-measure and G-mean across the 20 nested cross validations. Due to limited space, results of 4 representative datasets are shown in Table 3 with the remaining results in Section 5.2 of the supplementary material. If a method classifies all samples into the majority class, precision is not well-defined and we simply set it to be zero. In the column “Method”, SVR = SVR-Tree, SVR-Select = SVR-Tree with feature selection, Duplicate = CART with duplicated oversampling, SMOTE = CART with SMOTE, BSMOTE = CART with Borderline-SMOTE, ADASYN = CART with ADASYN and HDDT = Hellinger Distance Decision Tree. For each dataset and evaluation measure, the method with the highest mean value ranks first and is highlighted in bold. As suggested by Brazdil and Soares (2000), the average rankings of all 12 datasets are displayed in Table 3 to evaluate the overall performance of each method across all datasets. If two methods are tied, the ranking is set to the average.

The results in Table 3 show that oversampling based methods generally perform well in accuracy and precision, with SMOTE the best of these methods. However, SVR and SVR-select outperform the oversampling methods in the rankings of TPR, F-measure and G-mean on many datasets and on average. The behavior of HDDT is unstable across datasets, being competitive in TPR and G-mean for some datasets but sometimes assigning all samples to the majority class. Since all methods (except duplicated sampling) are built on ideal theoretical assumptions, their practical performance is typically data-dependent. For example, SMOTE is outperformed by SVR-select in all measures for the Segment dataset, while the opposite happens for the Page dataset.

5 Discussion

A major challenge in analyzing imbalanced data is small sample size in the minority class leading to overfitting. It is natural to consider using regularization to address this problem. Regularization of classification trees is an old idea; for example, Breiman et al. (1984) proposed to penalize the number of leaf notes in the tree. Other classification trees like C4.5 (Quinlan, 2014) and Ripper (Cohen, 1995) also prune overgrown trees. However, the number of leaf nodes may not be a good measure of complexity of a classification tree. Recently, Hahn et al. (2020) add a Bayesian prior to an ensemble of trees, which functions as indirect regularization. Following the idea of directly regularizing the shape of the decision set and complexity of the decision boundary, we instead penalize the surface-to-volume ratio. To our knowledge, this is a new idea in the statistical literature.

SVR-Tree can be trivially generalized to the multi-class case and balanced datasets. For multiple classes, we can apply SVR to one or more minority classes and take the sum of these ratios as regularization. For balanced datasets, we can either compute the SVR ratio of all classes, or we can simply regularize the surface of the decision boundary. The principle behind these generalizations is to regularize the complexity of the decision boundaries and shapes of the decision sets.

SUPPLEMENTARY MATERIAL

Proofs, a Detailed Algorithm and Additional Results: Supplementary Material for “Classification Trees for Imbalanced Data: Surface-to-Volume Regularization”.
Codes: https://github.com/YichenZhuDuke/Classification-Tree-with-Surface-to-
Volume-ratio-Regularization.git.

References

Alcalá-Fdez et al. (2011) Alcalá-Fdez, J., A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, and F. Herrera (2011). Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic & Soft Computing 17, 255–287.
Audibert and Tsybakov (2007) Audibert, J.-Y. and A. B. Tsybakov (2007). Fast learning rates for plug-in classifiers. The Annals of Statistics 35(2), 608–633.
Brazdil and Soares (2000) Brazdil, P. B. and C. Soares (2000). A comparison of ranking methods for classification algorithm selection. In European Conference on Machine Learning, pp. 63–75. Springer.
Breiman et al. (1984) Breiman, L., J. Friedman, R. Olshen, and C. Stone (1984). Classification and Regression Trees. Wadsworth.
Bunkhumpornpat et al. (2009) Bunkhumpornpat, C., K. Sinapiromsaran, and C. Lursinsap (2009). Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 475–482. Springer.
Chawla et al. (2002) Chawla, N. V., K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer (2002). Smote: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357.
Cieslak et al. (2012) Cieslak, D. A., T. R. Hoens, N. V. Chawla, and W. P. Kegelmeyer (2012). Hellinger distance decision trees are robust and skew-insensitive. Data Mining and Knowledge Discovery 24(1), 136–158.
Cohen (1995) Cohen, W. W. (1995). Fast effective rule induction. In Machine Learning Proceedings, pp. 115–123. Elsevier.
Cormen et al. (2009) Cormen, T. H., C. E. Leiserson, R. L. Rivest, and C. Stein (2009). Introduction to Algorithms. MIT press.
Dua and Graff (2017) Dua, D. and C. Graff (2017). UCI Machine Learning Repository.
Fernández et al. (2017) Fernández, A., S. del Río, N. V. Chawla, and F. Herrera (2017). An insight into imbalanced big data classification: outcomes and challenges. Complex & Intelligent Systems 3(2), 105–120.
Györfi et al. (2006) Györfi, L., M. Kohler, A. Krzyzak, and H. Walk (2006). A distribution-free theory of nonparametric regression. Springer Science & Business Media.
Hahn et al. (2020) Hahn, P. R., J. S. Murray, and C. M. Carvalho (2020). Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects. Bayesian Analysis 15(3), 965–1056.
Haixiang et al. (2017) Haixiang, G., L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing (2017). Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications 73, 220–239.
Han et al. (2005) Han, H., W.-Y. Wang, and B.-H. Mao (2005). Borderline-smote: a new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing, pp. 878–887. Springer.
He et al. (2008) He, H., Y. Bai, E. A. Garcia, and S. Li (2008). Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In IEEE International Joint Conference on Neural Networks, pp. 1322–1328. IEEE.
He and Garcia (2008) He, H. and E. A. Garcia (2008). Learning from imbalanced data. IEEE Transactions on Knowledge & Data Engineering (9), 1263–1284.
Hu et al. (2019) Hu, X., C. Rudin, and M. Seltzer (2019). Optimal sparse decision trees. In Advances in Neural Information Processing Systems (NeurIPS), Volume 32.
Krawczyk (2016) Krawczyk, B. (2016). Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence 5(4), 221–232.
Lin et al. (2020) Lin, J., C. Zhong, D. Hu, C. Rudin, and M. Seltzer (2020). Generalized and scalable optimal sparse decision trees. In International Conference on Machine Learning, pp. 6150–6160. PMLR.
Mammen and Tsybakov (1995) Mammen, E. and A. B. Tsybakov (1995). Asymptotical minimax recovery of sets with smooth boundaries. The Annals of Statistics 23(2), 502–524.
Nobel (1996) Nobel, A. (1996). Histogram regression estimation using data-dependent partitions. The Annals of Statistics 24(3), 1084–1105.
Quinlan (2014) Quinlan, J. R. (2014). C4.5: Programs for Machine Learning. Elsevier.
Scornet et al. (2015) Scornet, E., G. Biau, and J.-P. Vert (2015). Consistency of random forests. The Annals of Statistics 43(4), 1716–1741.
Scott and Nowak (2006) Scott, C. and R. Nowak (2006). Minimax-optimal classification with dyadic decision trees. IEEE Transactions on Information Theory 52(4), 1135–1153.
Stefanowski and Wilk (2008) Stefanowski, J. and S. Wilk (2008). Selective pre-processing of imbalanced data for improving classification performance. In International Conference on Data Warehousing and Knowledge Discovery, pp. 283–292. Springer.
Tsybakov (2004) Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning. The Annals of Statistics 32(1), 135–166.
Wang et al. (2019) Wang, Q., X. Zhou, C. Wang, Z. Liu, J. Huang, Y. Zhou, C. Li, H. Zhuang, and J.-Z. Cheng (2019). Wgan-based synthetic minority over-sampling technique: Improving semantic fine-grained classification for lung nodules in ct images. IEEE Access 7, 18450–18463.

Appendix A Proofs

We prove Theorem 1 and 2 here. Proofs for Corollary 1, $c_{1}$ bounds in examples 1-2 and lemmas introduced in the Appendix are in the supplementary material. Without loss of generality, we assume $\Omega=[0,1]^{d}$ in this section.

A.1 Proof of Theorem 1

The proof builds on Nobel (1996), Györfi et al. (2006), Tsybakov (2004) and Scornet et al. (2015). We first establish a sufficient condition for consistency, showing a classification tree whose signed impurity converges to an oracle bound is consistent. We then break the difference between signed impurity of $\hat{T}_{n}$ and the oracle bound into two parts. The first is estimation error, which goes to zero if the number of leaves $a_{n}$ increases slower than $n$ ; the second is approximation error, which goes to zero if $\mathbb{E}(Y|X\in A)$ goes to a constant within each leaf node and penalty $\lambda_{n}$ goes to zero.

Recall the conditional expectation is denoted as $\mathbb{E}(Y|X)=\eta(X)$ , define the oracle lower bound for signed impurity as $I^{*}=\int_{\Omega}2\eta(x)(1-\eta(x))d\mathbb{P}(x).$ The following lemma shows if the signed impurity of a classification tree converges to $T^{*}$ as $n\to\infty$ , the classifier associated with the tree is consistent.

Lemma 1.

Let $T_{n}$ be a sequence of classification trees, let $f_{n}:\Omega\to\{0,1\}$ be the classifier associated with $T_{n}$ . $T_{n}$ is consistent if $\tilde{I}(T_{n},\mathbb{P})\to I^{*}$ in probability.

We then decompose the difference between signed impurity of $\hat{T}_{n}$ and the oracle bound into estimation error and approximation error.

Lemma 2.

Let $\hat{T}_{n}$ be a classification tree trained from data $\mathscr{D}_{n}$ , $A_{1},A_{2},\ldots A_{m}$ be all the leaf nodes of $\hat{T}_{n}$ . Define the set of classifiers $\mathscr{T}_{n}$ as:

\mathscr{T}_{n}=\{T:T\text{'s associated classifier }f:\Omega\to\{0,1\}\text{ is constant on all }A_{j},\;\;1\leq j\leq m\}.

We have

\tilde{I}(\hat{T}_{n},\mathbb{P})-I^{*}\leq 2\sup_{T\in\mathscr{T}_{n}}|\tilde{I}(T,\mathbb{P})-\tilde{I}(T,\mathbb{P}_{n})|+\inf_{T\in\mathscr{T}_{n}}|\tilde{I}(T,\mathbb{P})+\lambda_{n}r(T)-I^{*}|.

(5)

The first term on the right hand side of equation (5) is the “estimation error”, which measures the difference between functions evaluated under the empirical and true distributions. The second term is “approximation error”, which measures the ability of $\mathscr{T}_{n}$ to approximate the oracle prediction function. The next two lemmas show both terms go to zero in probability.

Lemma 3.

If $\frac{\bar{a}_{n}d\log n}{n}=o(1),$ we have $\sup_{T\in\mathscr{T}_{n}}|\tilde{I}(T,\mathbb{P})-\tilde{I}(T,\mathbb{P}_{n})|\to 0$ in probability.

Lemma 4.

As $n\to\infty$ , if $\lambda_{n}\to 0$ and $\bar{a}_{n}\to\infty$ , $\inf_{T\in\mathscr{T}_{n}}|\tilde{I}(T,\mathbb{P})+\lambda_{n}r(T)-I^{*}|\to 0$ in probability.

Proof of Theorem 1.

Combining Lemma 1, 2, 3, 4, we finish the proof.

∎

A.2 Proof of Theorem 2

The proof consists of two parts. In the first part, we construct a subset on $\mathscr{F}_{\gamma}$ and prove it is both an $\epsilon$ -net under symmetric set difference of true measure and with high probability an $2\epsilon$ -net under symmetric set differences of empirical measure. In the second part, we prove that with high probability, any element in the above constructed $\epsilon$ -net that is far away from the true classifier will incur a large excess error. Combining these two parts and selecting a proper value for $\epsilon$ will conclude the proof of Theorem 2.

We first construct the $\epsilon$ -net. We divide the space $\Omega=[0,1]^{d}$ into $M^{d}$ hypercubes, each with volume $1/M^{d}$ as below:

\mathscr{H}=\left\{\Big{[}\frac{i_{1}}{M},\frac{i_{1}+1}{M}\Big{]}\times\Big{[}\frac{i_{2}}{M},\frac{i_{2}+1}{M}\Big{]}\times\cdots\Big{[}\frac{i_{d}}{M},\frac{i_{d}+1}{M}\Big{]}:i_{1},i_{2},\cdots i_{d}\in\{0,1,\cdots M-1\}\right\}.

The SVR constraint of family $\mathscr{F}_{\gamma}$ directly implies that the surface area of the decision set is no greater than $\gamma$ . If the shape of the decision set is indeed regular, one may imagine that the border of the decision set will intersect only with a finite number of hypercubes in $\mathscr{H}$ . The following Lemma formalizes this intuition.

Lemma 5.

There exists a constant $c^{\prime}\geq 1$ only dependent on $d$ , such that for $1/M\geq 2/(3\gamma)$ , for any $f\in\mathscr{F}_{\gamma}$ ,

\big{|}\{A\in\mathscr{H}:A\cap\partial\Omega_{1}(f)\neq\emptyset\}\big{|}\leq c^{\prime}\gamma M^{d-1},

where $|\cdot|$ denotes the cardinality.

The general proof idea is to examine the local region of each hypercube in $\mathscr{H}$ . If $\partial\Omega_{1}(f)$ intersects with a hypercube, then the SVR constraint and a $d-1$ dimensional isoperimetric inequality will imply that the surface area passing through the local region of that hypercube is lower bounded. Since the total surface area of $\Omega_{1}(f)$ is upper bounded given the maximum volume is 1 and the SVR constraint, the number of cubes intersecting with $\partial\Omega_{1}(f)$ is also upper bounded.

Let $f$ be a classifier such that $\Omega_{1}(f)$ is a finite union of hypercubes of $\mathscr{H}$ . Then for $A\in\mathscr{H}$ and $A\subset\Omega_{1}(f)$ , we say $A$ is a border hypercube of $\Omega_{1}(f)$ if $\partial A\cap\partial\Omega_{1}(f)\neq\emptyset$ . Let $\epsilon=c^{\prime}\gamma/M$ and define the set $\mathscr{N}_{\epsilon}$ as

	$\displaystyle\mathscr{N}_{\epsilon}=\big{\{}f:$	$\displaystyle\;\Omega_{1}(f)\text{ is a finite union of elements in }\mathscr{H};$
		$\displaystyle\text{ and the number of border hypercubes of }\Omega_{1}(f)\text{ is no greater than }M^{d}\epsilon\big{\}}.$

We have the following lemma regarding the $\epsilon$ -entropy of $\mathscr{N}_{\epsilon}$ .

Lemma 6.

For $\epsilon<1/4$ , we have

\log|\mathscr{N}_{\epsilon}|\leq\sum_{k=1}^{c^{\prime}\gamma M^{d-1}}{M^{d}\choose k}\leq c^{\prime}\gamma\left(\frac{c^{\prime}\gamma}{\epsilon}\right)^{d-1}\left[\log\left(\frac{c^{\prime}\gamma}{\epsilon}\right)+1-\log(c^{\prime}\gamma)\right].

The following lemma shows $\mathscr{N}_{\epsilon}$ is not only an $\epsilon$ -net on $\mathscr{F}_{\gamma}$ under the true measure $\mathbb{P}$ , but also simultaneously an $2\epsilon$ -net under empirical measure $\mathbb{P}_{n}$ with high probability.

Lemma 7.

We have the following two facts regarding $\mathscr{N}_{\epsilon}$ :

	$\displaystyle~{}~{}\sup_{f\in\mathscr{F}_{\gamma}}\inf_{f_{0}\in\mathscr{N}_{\epsilon}}\mathbb{P}(\Omega_{1}(f)\operatorname*{\bigtriangleup}\Omega_{1}(f_{0}))\leq\epsilon,$
	$\displaystyle~{}~{}\mathbb{P}\left(\sup_{f\in\mathscr{F}_{\gamma}}\inf_{f_{0}\in\mathscr{N}_{\epsilon}}\max\big{\{}\mathbb{P}_{n}(\Omega_{1}(f)\operatorname{\bigtriangleup}\Omega_{1}(f_{0})),2\mathbb{P}(\Omega_{1}(f)\operatorname{\bigtriangleup}\Omega_{1}(f_{0}))\big{\}}>2\epsilon\right)$
	$\displaystyle\leq\sum_{k=1}^{c^{\prime}\gamma M^{d-1}}{M^{d}\choose k}\exp\{-2(\log 2)n\epsilon\}.$

We now move to the second part. By Lemma 7, there exists $f_{1}\in\mathscr{N}_{\epsilon}$ , such that $R(f_{1})-R^{*}\leq\mathbb{P}(\Omega_{1}(f_{1})\operatorname*{\bigtriangleup}\Omega_{1}(f^{*}))\leq\epsilon$ . The next lemma shows with high probability, $f_{1}$ has a smaller empirical risk than all elements in $\mathscr{N}_{\epsilon}$ that are sufficiently far away from $f^{*}$ .

Lemma 8.

Suppose Condition 2 holds. Then we have

\mathbb{P}\left(\sup_{T\in\mathscr{N}_{\epsilon},R(T)-R(f^{*})\geq 7\epsilon}R_{n}(T)-R_{n}(f_{1})>4\epsilon\right)\geq 1-|\mathscr{N}_{\epsilon}|\exp\left(-\frac{21}{16}nc_{1}^{1/\kappa}\epsilon^{2-1/\kappa}\right).

The key proof idea is to write the empirical loss $R_{n}(\cdot)$ as a summation of i.i.d. random variables and bound the deviation of the summation to its expectation by Bernstein inequality.

We are now ready to prove Theorem 2.

Proof of Theorem 2.

Define the events $E_{1},E_{2}$ as:

E_{1}=\left\{\sup_{f\in\mathscr{F}_{\gamma}}\inf_{f_{0}\in\mathscr{N}_{\epsilon}}\max\big{\{}\mathbb{P}_{n}(\Omega_{1}(f)\operatorname*{\bigtriangleup}\Omega_{1}(f_{0})),2\mathbb{P}(\Omega_{1}(f)\operatorname*{\bigtriangleup}\Omega_{1}(f_{0}))\big{\}}\leq 2\epsilon\right\},

E_{2}=\left\{\sup_{T\in|\mathscr{N}_{\epsilon}|,R(T)-R(f^{*})\geq 7\epsilon}R_{n}(T)-R_{n}(f_{1})>4\epsilon\right\}.

We claim on event $E_{1}\cap E_{2}$ , $R(\hat{T}_{n,\gamma})-R(f^{*})\leq 8\epsilon$ . We prove this by contradiction. Suppose that $R(\hat{T}_{n,\gamma})-R(f^{*})>8\epsilon$ . Then by definition of event $E_{1}$ , there exists a classifier $f_{2}\in\mathscr{N}_{\epsilon}$ , such that

|R(\hat{T}_{n,\gamma})-R(f_{2})|\leq\mathbb{P}(\Omega_{1}(\hat{T}_{n,\gamma})\operatorname*{\bigtriangleup}\Omega_{1}(f_{2}))\leq\epsilon,

(6)

|R_{n}(\hat{T}_{n,\gamma})-R_{n}(f_{2})|\leq\mathbb{P}_{n}(\Omega_{1}(\hat{T}_{n,\gamma})\operatorname*{\bigtriangleup}\Omega_{1}(f_{2}))\leq 2\epsilon.

(7)

Combining equation (6) and the assumption that $R(\hat{T}_{n,\gamma})-R(f^{*})>8\epsilon$ , we have $R(f_{2})-R(f^{*})>7\epsilon$ . It then follows from the definition of event $E_{2}$ that

R_{n}(f_{2})-R_{n}(f_{1})>4\epsilon.

(8)

By definition of event $E_{1}$ , $\exists T^{\prime}_{1}\in\mathscr{F}_{\gamma}$ , such that

|R_{n}(T^{\prime}_{1})-R_{n}(f_{1})|\leq\mathbb{P}_{n}(\Omega_{1}(T^{\prime}_{1})\operatorname*{\bigtriangleup}\Omega_{1}(f_{1}))\leq 2\epsilon.

(9)

Combining the equations (7), (8) and (9), we have

R_{n}(\hat{T}_{n,\gamma})-R_{n}(T_{1}^{\prime})>0,

which contradicts the definition that $\hat{T}_{n,\gamma}$ is the empirical risk minimizer. Therefore we have $R(\hat{T}_{n,\gamma})-R(f^{*})\leq 8\epsilon$ on the event $E_{1}\cap E_{2}$ . For all $f^{*}\in\mathscr{F}_{\gamma}$ , the expected loss can be bounded by

	$\displaystyle~{}\quad\mathbb{E}\left[R(\hat{T}_{n,\gamma})-R(f^{*})\right]$
	$\displaystyle\leq\mathbb{P}(E_{1}^{c}\cup E_{2}^{c})+8\epsilon$
	$\displaystyle\leq\exp\left[c^{\prime}\gamma\left(\frac{c^{\prime}\gamma}{\epsilon}\right)^{d-1}\log\left(\frac{c^{\prime}\gamma}{\epsilon}\right)\right]\left[\exp(-2(\log 2)n\epsilon)+\exp\left(-\frac{21}{16}nc_{1}^{1/\kappa}\epsilon^{2-1/\kappa}\right)\right]+8\epsilon$
	$\displaystyle\leq 2\exp\left[c^{\prime}\gamma\left(\frac{c^{\prime}\gamma}{\epsilon}\right)^{d-1}\log\left(\frac{c^{\prime}\gamma}{\epsilon}\right)\right]\exp\left(-\tilde{c}n\epsilon^{2-1/\kappa}\right)+8\epsilon,$

where the second inequality uses Lemmas 5, 6, 7 and $1-\log(c^{\prime}\gamma)\leq 0$ for $c^{\prime}\geq 1,\;\gamma\geq 2d\geq 4$ ; the last inequality uses the definition that $\tilde{c}=\min\left\{2\log 2,\frac{21c_{1}^{1/\kappa}}{16}\right\}$ . We now defines a constant $N_{0}$ as

	$\displaystyle N_{0}=\inf\Bigg{\{}n\in\mathbb{N}:{}$	$\displaystyle\frac{(d-1/\kappa)\log n+\log\log n}{d+1-1/\kappa}\geq\log(c^{\prime}\gamma)+\log c_{4},$		(10)
		$\displaystyle\frac{\tilde{c}}{2}n^{\frac{d-1}{d+1-1/\kappa}}(\log n)^{\frac{2-1/\kappa}{d+1-1/\kappa}}c_{5}^{2-1/\kappa}\geq-\log(4c_{4})+\frac{\log n-\log\log n}{d+1-1/\kappa}\Bigg{\}}.$		(10)

In the two inequalities of (10), the left-hand side obviously dominates the right-hand side, so $N_{0}$ is well defined. Let $\epsilon=c_{4}(\log n/n)^{\frac{1}{d+1-1/\kappa}}$ , where $c_{4}=[2(c^{\prime}\gamma)^{d}/\tilde{c}]^{\frac{1}{d+1-1/\kappa}}$ . Then for $n\geq N_{0}$ ,

	$\displaystyle\frac{c^{\prime}\gamma\big{(}\frac{c^{\prime}\gamma}{\epsilon}\big{)}^{d-1}\log\big{(}\frac{c^{\prime}\gamma}{\epsilon}\big{)}}{n\tilde{c}\epsilon^{2-1/\kappa}}$	$\displaystyle=\frac{(c^{\prime}\gamma)^{d}[\log(c^{\prime}\gamma)+\log(1/\epsilon)]}{\tilde{c}n\epsilon^{d+1-1/\kappa}}$
		$\displaystyle=\frac{(c^{\prime}\gamma)^{d}\big{[}\log(c^{\prime}\gamma)+\log(1/c_{4})+\frac{1}{d+1-1/\kappa}\log(\frac{n}{\log n})\big{]}}{\tilde{c}c_{5}^{d+1-1/\kappa}\log n}$
		$\displaystyle\leq\frac{\log(c^{\prime}\gamma)+\log(1/c_{4})+\frac{1}{d+1-1/\kappa}\log(\frac{n}{\log n})}{2\log n}\leq\frac{1}{2},$

where the third inequality utilizes the definition of $c_{4}$ and the last inequality utilizes the definition of $N_{0}$ . Therefore we have

\mathbb{E}\left[R(\hat{T}_{n,\gamma})-R(f^{*})\right]\leq 2\exp\left(-\frac{\tilde{c}}{2}n\epsilon^{2-1/\kappa}\right)+8c_{4}\left(\frac{\log n}{n}\right)^{\frac{1}{d+1-1/\kappa}}.

By definition of $\epsilon$ , $n\epsilon^{2-1/\kappa}$ increases with $n$ in a polynomial order, so the second term in the display above dominates. Specifically, by definition of $N_{0}$ , when $n\geq N_{0}$ , we have $2\exp\left(-\frac{\tilde{c}}{2}n\epsilon^{2-1/\kappa}\right)\leq 8c_{4}\left(\frac{\log n}{n}\right)^{\frac{1}{d+1-1/\kappa}}$ , and hence

\mathbb{E}\left[R(\hat{T}_{n,\gamma})-R(f^{*})\right]\leq 16(2/\tilde{c})^{\frac{1}{d+1-1/\kappa}}(c^{\prime}\gamma)^{\frac{d}{d+1-1/\kappa}}\left(\frac{\log n}{n}\right)^{\frac{1}{d+1-1/\kappa}}.

∎