This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Classification Trees for Imbalanced Data: Surface-to-Volume Regularization

Yichen Zhu  , Cheng Li   and David B. Dunson Yichen Zhu is Ph.D. student, Department of Statistical Science, Duke University, Durham, NC 27708.Cheng Li is Assistant Professor, Department of Statistics and Applied Probability, National University of Singapore, Singapore, 117546. The research was supported by the Singapore Ministry of Education Academic Research Funds Tier 1 Grant R-155-000-223-114.David B. Dunson is the Arts and Sciences Distinguished Professor, Departments of Statistical Science & Mathematics, Duke University, Durham, NC 27708. The research was partially supported by grant N000141712844 of the United States Office of Naval Research.
Abstract

Classification algorithms face difficulties when one or more classes have limited training data. We are particularly interested in classification trees, due to their interpretability and flexibility. When data are limited in one or more of the classes, the estimated decision boundaries are often irregularly shaped due to the limited sample size, leading to poor generalization error. We propose a novel approach that penalizes the Surface-to-Volume Ratio (SVR) of the decision set, obtaining a new class of SVR-Tree algorithms. We develop a simple and computationally efficient implementation while proving estimation consistency for SVR-Tree and rate of convergence for an idealized empirical risk minimizer of SVR-Tree. SVR-Tree is compared with multiple algorithms that are designed to deal with imbalance through real data applications.


Keywords: CART, Categorical data, Decision boundary, Shape penalization

1 Introduction

We are interested in the common setting in which one has a set of training data 𝒟n={(Xi,Yi)}i=1n\mathscr{D}_{n}=\{(X_{i},Y_{i})\}_{i=1}^{n}, with XiΩdX_{i}\in\Omega\subset\mathbb{R}^{d} a vector of features and Yi{0,,K1}Y_{i}\in\{0,\ldots,K-1\} a class label. The goal is to estimate a classifier f:Ω{0,,K1}f:\Omega\to\{0,\ldots,K-1\}, which outputs a class label given an input feature vector. This involves splitting the feature space Ω\Omega into subsets having different class labels, as illustrated in Figure 1. Classification trees are particularly popular for their combined flexibility and interpretability (Hu et al., 2019; Lin et al., 2020).

Our focus is on the case in which nj=k=1n𝟏{Yi=j}n_{j}=\sum_{k=1}^{n}\bm{1}_{\{Y_{i}=j\}} is small, for one or more j{0,,J1}j\in\{0,\ldots,J-1\}. For simplicity in exposition, we assume J=2J=2, so that there are only two classes. Without loss of generality, suppose that j=0j=0 is the majority class, j=1j=1 is the minority class, and call set {xΩ:f(x)=1}\{x\in\Omega:f(x)=1\} the decision set (of the minority class). Hence, n1n_{1} is relatively small compared to n0n_{0} by this convention.

To illustrate the problems that can arise from this imbalance, we consider a toy example. Let the sample space of XX be Ω=[0,1]×[0,1]2\Omega=[0,1]\times[0,1]\subset\mathbb{R}^{2}. Let the conditional distribution of XX given YY be X|Y=1U([0,0.75]×[0.25,0.75])X|Y=1\sim U([0,0.75]\times[0.25,0.75]), X|Y=0U(Ω)X|Y=0\sim U(\Omega). We generate a training data set with 55 minority samples and 200200 majority samples. The estimated decision sets of unpruned CART and optimally pruned CART (Breiman et al., 1984) are shown in Figure 1 (a) and (b), respectively. Minority samples are given weight 32 while majority samples are given weight 1 to naively address the imbalance. Both decision sets are inaccurate, with unpruned CART overfitting and pruned CART also poor.

Imbalanced data problems have drawn substantial interest; see Haixiang et al. (2017), He and Garcia (2008), Krawczyk (2016) and Fernández et al. (2017) for reviews. Some early work relied on random under- or over-sampling, which is essentially equivalent to modifying the weights and cannot address the key problem. Chawla et al. (2002) proposed SMOTE, which instead creates synthetic samples. For each minority class sample, they create synthetic samples along the line segments that join each minority class sample with its k nearest neighbors in the minority class. Building on this idea, many other synthetic sampling methods have been proposed, including ADASYN (He et al., 2008), Borderline-SMOTE (Han et al., 2005), SPIDER (Stefanowski and Wilk, 2008), safe-level-SMOTE (Bunkhumpornpat et al., 2009) and WGAN-Based sampling (Wang et al., 2019). These synthetic sampling methods have been demonstrated to be relatively effective.

However, current understanding of synthetic sampling is inadequate. Chawla et al. (2002) motivates SMOTE as designed to “create large and less specific decision regions”, “rather than smaller and more specific regions”. Later papers fail to improve upon this heuristic justification. Practically, the advantage of synthetic sampling versus random over-sampling diminishes as the dimension of the feature space increases. In general, for each minority sample, we require at least dd synthetic samples to fully describe its neighborhood. This is often infeasible due to the sample size of the majority class and to computational complexity. Hence, it is typical to fix the number of synthetic samples regardless of the dimension of the feature space (Chawla et al., 2002), which may fail to “create large and less specific decision regions” when the dimension is high.

Motivated by these issues, we propose to directly penalize the Surface-to-Volume Ratio (SVR) of the decision set for the construction of decision tree classifiers, based on the following considerations. First, a primary issue with imbalanced data is that the decision set without any regularization typically consists of many small neighborhoods around each minority class sample. As shown in Section 2 and 3, by penalizing the SVR of the decision set, we favor regularly shaped decision sets much less subject to such over-fitting. Second, existing classification methods are usually justified with theoretical properties under strong assumptions on the local properties of the true decision sets, for example, requiring the true decision set to be convex or have sufficiently smooth boundaries. Such assumptions may be overly restrictive for many real datasets. In contrast, the proposed SVR regularized trees are only subject to a simple global constraint on surface-to-volume ratio, which effectively controls the regularity of decision sets leading to strong theoretical guarantees (see Section 3). Third, we illustrate that SVR regularization can be efficiently implemented for tree classifiers by proposing an algorithm with similar computational complexity to standard tree algorithms such as CART.

The rest of the paper is organized as follows. Section 2 describes our methodology and algorithmic implementation. Section 3 analyzes theoretical properties of SVR Trees, including consistency and rate of convergence for excess risk. Section 4 presents numerical studies for real datasets. Section 5 contains a discussion, and proofs are included in an Appendix.

Refer to caption
(a) Unpruned CART.
Refer to caption
(b) Optimally pruned CART.
Refer to caption
(c) SVR-Tree.
Figure 1: Decision sets for different methods. Red crosses denote minority class samples, while blue points denote majority class samples. Rectangles with dashed frames denote the support of minority class samples, while rectangles filled with green color denote the minority class decision set.

2 Methodology

We first introduce the definition of surface-to-volume ratio (SVR) and tree impurity, and then define SVR-Tree as the minimizer of a weighted average of tree impurity and SVR. We then state the algorithm to estimate this tree from training data 𝒟n={(Xi,Yi)}i=1n\mathscr{D}_{n}=\{(X_{i},Y_{i})\}_{i=1}^{n}. We assume readers have basic knowledge of tree-based classifiers like CART (Breiman et al., 1984) and C4.5 (Quinlan, 2014). In the rest of the paper, the word “tree” refers specifically to classification trees that specify a class label associated with each leaf node.

2.1 Notation

Training data are denoted as 𝒟n={(Xi,Yi)}i=1n\mathscr{D}_{n}=\{(X_{i},Y_{i})\}_{i=1}^{n}, with XiΩdX_{i}\in\Omega\subset\mathbb{R}^{d} a vector of features and Yi{0,1}Y_{i}\in\{0,1\} a class label. Uppercase letters (X,Y)(X,Y) denote random variables, while lowercase x,yx,y denote specific values. Denote the jjth feature of XX as X[j]X[j]. Let \mathbb{P}^{*} denote the true distribution of i.i.d. samples (Xi,Yi)(X_{i},Y_{i}) and let \mathbb{P} denote the weighted distribution that up weights the minority class samples by a constant α1\alpha\geq 1. That is, for any measurable subset AΩA\subset\Omega,

(A×{1})=α(A×{1})(Ω×{0})+α(Ω×{1}),(A×{0})=(A×{0})(Ω×{0})+α(Ω×{1}).\mathbb{P}(A\times\{1\})=\frac{\alpha\mathbb{P}^{*}(A\times\{1\})}{\mathbb{P}^{*}(\Omega\times\{0\})+\alpha\mathbb{P}^{*}(\Omega\times\{1\})},\;\;\;\mathbb{P}(A\times\{0\})=\frac{\mathbb{P}^{*}(A\times\{0\})}{\mathbb{P}^{*}(\Omega\times\{0\})+\alpha\mathbb{P}^{*}(\Omega\times\{1\})}.

Up weighting minority class samples is a conventional technique to counter imbalance and we will focus on the weighted measure \mathbb{P} in the rest of the paper. Denote n\mathbb{P}_{n} as the weighted empirical distribution which assigns mass w0α/nw_{0}\alpha/n to minority class training samples and w0/nw_{0}/n to majority class training samples. The constant w0w_{0} is set to ensure the measures of all training samples add up to 11.

We use ff with various subscripts and diacritics to denote a general classifier from Ω\Omega to {0,1}\{0,1\}, while TT with various subscripts and diacritics denotes a tree classifier from Ω\Omega to {0,1}\{0,1\}. A tree classifier is formed by finite splitting steps and its decision sets are a finite union of hyperrectangles. When discussing the probability of certain events that include nn random variables {(Xi,Yi)}i=1n\{(X_{i},Y_{i})\}_{i=1}^{n}, we simply use \mathbb{P} to represent the probability measure in the nn-product space. For example, for any a>0a>0, (n1i=1nXi>a)\mathbb{P}\left(n^{-1}\sum_{i=1}^{n}X_{i}>a\right) denotes the probability of the event {{(Xi,Yi)}i=1n:n1i=1nXi>a}\{\{(X_{i},Y_{i})\}_{i=1}^{n}:n^{-1}\sum_{i=1}^{n}X_{i}>a\}. We use 𝔼\mathbb{E} for expectations over \mathbb{P} and 𝔼n\mathbb{E}_{n} for expectations over n\mathbb{P}_{n}. We use η\eta and ηn\eta_{n} to denote the conditional expectations of YY given XX under the true and empirical measure, respectively. That is, η(X)=𝔼(Y=1|X)\eta(X)=\mathbb{E}(Y=1|X) and ηn(X)=𝔼n(Y=1|X)\eta_{n}(X)=\mathbb{E}_{n}(Y=1|X).

2.2 Surface-to-Volume Ratio

For all dd\in\mathbb{N}, consider the space d\mathbb{R}^{d} equipped with Lebesgue measure μ\mu. For any Lebesgue measurable closed set AΩA\subset\Omega, we define its volume as the Lebesgue measure of set AA: V(A)=μ(A)V(A)=\mu(A). If AA can be represented as a finite union of hyperrectangles, we define the surface of AA as the d1d-1 dimensional Lebesgue measure of the boundary of AA: S(A)=μd1(A)S(A)=\mu_{d-1}(\partial A); otherwise, if there exists Ai,i1A_{i},i\geq 1 such that each AiA_{i} is a finite union of hyperrectangles and AiA_{i} converges to AA in Hausdorff distance, we define the surface of AA as S(A)=limnS(Ai)S(A)=\lim_{n\to\infty}S(A_{i}), provided the limit exists. In general, if all Ai,i1A_{i},i\geq 1 are regularly shaped, it will be proven in Proposition 2 that the limit exists. Our definition of surface is different from the convention of the word “surface”. For example, in our definition a dd dimensional ball with radius 1/21/2 has the same surface as a dd dimensional cube with side length 11, as approximating such a ball with hyperrectangles results in the same surface area as the cube. We adopt this definition of surface since we focus on tree classifiers whose decision sets are always finite unions of hyperrectangles.

For any set AA with 0<μ(A)<0<\mu(A)<\infty, the surface-to-volume ratio (SVR) can be obtained as r(A)=S(A)V(A).r(A)=\frac{S(A)}{V(A)}. For sets with the same volume, the dd-dimensional cube has the smallest SVR, while sets having multiple disconnected subsets and/or irregular boundaries have relatively high SVR. We visualize sets with different SVR in two-dimensional space in Figure 2, with the subfigures (a)-(c) corresponding to the decision sets in Figure 1. We can see that decision sets with larger SVR tend to have more branches, disconnected components, or irregularly shaped borders than those with smaller SVR.

Refer to caption
(a) SVR=70.03
Refer to caption
(b) SVR=13.59
Refer to caption
(c) SVR=7.3407
Refer to caption
(d) SVR=4
Refer to caption
(e) SVR=40
Refer to caption
(f) SVR=6.81
Refer to caption
(g) SVR=21.82
Refer to caption
(h) SVR=41.65
Figure 2: Examples of sets with different SVR values in [0,1]2[0,1]^{2}. Rectangles filled with green color denote the minority class decision set.

Surface-to-Volume Ratio of a Classification Tree

For training data 𝒟n\mathscr{D}_{n}, we define a closed bounded sample space Ωd\Omega\subset\mathbb{R}^{d} such that the support of XX is a subset of Ω\Omega. A classification tree TT divides the sample space Ωd\Omega\subset\mathbb{R}^{d} into two disjoint subsets Ω1={xΩ:T(x)=1}\Omega_{1}=\{x\in\Omega:T(x)=1\} and Ω0=Ω\Ω1\Omega_{0}=\Omega\backslash\Omega_{1}, where Ω0Ω1=Ω\Omega_{0}\cup\Omega_{1}=\Omega. The tree TT predicts a new sample XX^{*} belongs to class 0 if XΩ0X^{*}\in\Omega_{0}, and class 1 if XΩ1X^{*}\in\Omega_{1}. The surface-to-volume ratio of a classification tree is defined as the surface-to-volume-ratio of set Ω1\Omega_{1}: r(T)=r(Ω1).r(T)=r(\Omega_{1}).

2.3 Impurity Function and Tree Impurity

A classification tree partitions the sample space into multiple leaf nodes, assigning one class label to each leaf node. The tree should be built to maximize homogeneity of the training sample class labels within nodes. This motivates the following definition.

Definition 1 (Impurity, Definition 2.5 of Breiman et al. 1984).

An impurity function ϕ(,)\phi(\cdot,\cdot) is defined on the set of pairs (p0,p1)(p_{0},p_{1}) satisfying p00,p10p_{0}\geq 0,p_{1}\geq 0, p0+p1=1p_{0}+p_{1}=1 with the properties (i) ϕ(,)\phi(\cdot,\cdot) achieves its maximum only at (1/2,1/2)(1/2,1/2); (ii) ϕ(,)\phi(\cdot,\cdot) achieves its minimum only at (1,0),(0,1)(1,0),(0,1); (iii) ϕ(,)\phi(\cdot,\cdot) is symmetrical in (p0,p1)(p_{0},p_{1}), i.e., ϕ(p0,p1)=ϕ(p1,p0)\phi(p_{0},p_{1})=\phi(p_{1},p_{0}).

Let p0p_{0} and p1p_{1} represent the probabilities of belonging to the majority and minority class, respectively, within some branch of the tree. Ideally, splits of the tree are chosen so that, after splitting, p0p_{0} and p1p_{1} move closer to 0 or 1 and further from 1/2. Many different tree building algorithms use impurity to measure the quality of a split; for example, CART uses Gini impurity and C4.5 uses entropy, which is effectively a type of impurity measure. In the remainder of the paper, the term ‘impurity function’ refers to the Gini impurity. That is, ϕ(p0,p1)=1p02p12\phi(p_{0},p_{1})=1-p_{0}^{2}-p_{1}^{2}.

Let A1,A2,,AmA_{1},A_{2},\ldots,A_{m} be the leaf nodes of a classification tree TT and let zj{0,1}z_{j}\in\{0,1\} be the predictive class label for node AjA_{j}, for j=1,,mj=1,\ldots,m. Let \mathbb{P} be the weighted probability measure defined in section 2.1. Then the impurity of leaf node AjA_{j} is I(Aj,)=ϕ((Y=0|XAj),(Y=1|XAj))I(A_{j},\mathbb{P})=\phi(\mathbb{P}(Y=0|X\in A_{j}),\mathbb{P}(Y=1|X\in A_{j})). The impurity of node AjA_{j} measures the class homogeneity in AjA_{j}, but does not depend on the predictive class label zjz_{j}. Let z~j=𝟙{(Y=1|XAj)1/2}\tilde{z}_{j}=\mathbbm{1}_{\{\mathbb{P}(Y=1|X\in A_{j})\geq 1/2\}} denote the dominant class label in AjA_{j} under weighted measure \mathbb{P}. We define a signed impurity taking into account zjz_{j} as

I~(Aj,)=𝟙{zj=z~j}I(Aj,)+𝟙{zjz~j}(1I(Aj,)).\tilde{I}(A_{j},\mathbb{P})=\mathbbm{1}_{\{z_{j}=\tilde{z}_{j}\}}I(A_{j},\mathbb{P})+\mathbbm{1}_{\{z_{j}\neq\tilde{z}_{j}\}}(1-I(A_{j},\mathbb{P})).

Signed impurity serves as a proxy to classification accuracy. If the predictive class label zjz_{j} matches the dominant class label z~j\tilde{z}_{j} in node AjA_{j}, the signed impurity of node AjA_{j} is equal to the impurity of node AjA_{j} and is no greater than 1/21/2. Otherwise, an extra penalty of 12I(Aj,)1-2I(A_{j},\mathbb{P}) is applied. It is easy to see 12I(Aj,)=[12(Y=0|XAj)]21-2I(A_{j},\mathbb{P})=[1-2\mathbb{P}(Y=0|X\in A_{j})]^{2}. In other words, the signed impurity uses quadratic loss instead of the absolute value loss function of classification accuracy. Taking a weighted average of the signed impurities across the leaf nodes, one obtains the tree impurity and signed tree impurity.

Definition 2 (Tree Impurity).

Let TT be a tree and A1,A2,AmA_{1},A_{2},\ldots A_{m} be all the leaf nodes of this tree. Denoting the weighted probability measure as \mathbb{P}, the tree impurity of TT is I(T,)=j=1m(XAj)I(Aj,).I(T,\mathbb{P})=\sum_{j=1}^{m}\mathbb{P}(X\in A_{j})I(A_{j},\mathbb{P}).

Definition 3 (Signed Tree Impurity).

Under the notation of Definition 2, the signed tree impurity is I~(T,)=j=1m(XAj)I~(Aj,).\tilde{I}(T,\mathbb{P})=\sum_{j=1}^{m}\mathbb{P}(X\in A_{j})\tilde{I}(A_{j},\mathbb{P}).

2.4 SVR-Tree Classifiers

The SVR-Tree classifier is the minimizer of the weighted average of signed tree impurity and surface-to-volume ratio. Letting 𝒯\mathscr{T} be the collection of possible trees, the SVR-Tree classifier is defined as

T^=argminT𝒯[I~(T,n)+λnr(T)],\hat{T}=\operatorname*{argmin}_{T\in\mathscr{T}}[\tilde{I}(T,\mathbb{P}_{n})+\lambda_{n}r(T)], (1)

where λn\lambda_{n} is a penalty. The unknown probability measure \mathbb{P} is replaced with the empirical measure n\mathbb{P}_{n} that assigns mass 1/n1/n to each training sample (Xi,Yi),1in(X_{i},Y_{i}),1\leq i\leq n. Unfortunately, without restricting the space of trees 𝒯\mathscr{T}, optimization problem (1) is intractable. In the following subsection, we introduce an iterative greedy search algorithm that limits the size of 𝒯\mathscr{T} in each step to efficiently obtain a tree having provably good performance.

2.5 The SVR-Tree Algorithm

The SVR-Tree Algorithm is designed to find a nearly optimal SVR-Tree classifier. SVR-Tree proceeds in a greedy manner. We begin with the root node. At each step, we operate on one leaf node of the current tree, splitting it into two new leaf nodes by finding the solution of (1). The node to split at each step is uniquely specified by a breadth-first searching order. After splitting, the tree is updated and the node to split in the next step will be specified. The process stops when further splitting of leaf nodes either does not improve the loss or a prespecified maximum number of leaf nodes is achieved.

We first describe how to split a current leaf node to improve the solution to (1). Suppose the current tree is TT and the node to split is AA, with nn^{\prime} training samples. For each feature jj, sort all samples in AA by increasing order of the jjth feature as X[j]j1,X[j]j2,,X[j]jnX[j]_{j_{1}},X[j]_{j_{2}},\ldots,X[j]_{j_{n^{\prime}}}. We only allow splits to occur at (X[j]ji+X[j]ji+1)/2, 1in1, 1jd(X[j]_{j_{i}}+X[j]_{j_{i+1}})/2,\;1\leq i\leq n^{\prime}-1,\;1\leq j\leq d, the midpoint of two adjacent values of each feature. The total number of different splits is no more than (n1)d(n^{\prime}-1)d. After each such split of AA, we keep all other leaf nodes unchanged while allowing all 44 different class label assignments at the two new daughter nodes of AA. The current set of trees 𝒯\mathscr{T} to choose from in optimizing (1) includes the initial TT and all the split trees described above. The cardinality of 𝒯\mathscr{T} is no more than 1+4(n1)d1+4(n^{\prime}-1)d, a linear order of nn^{\prime}. We compute the risk for all T𝒯T\in\mathscr{T} to find the minimizer. If the initial TT is the risk minimizer, we do not make any split in this step and mark the node AA as “complete”. Any node marked as complete will no longer be split in subsequent steps.

It remains to specify the ‘breadth-first’ searching order determining which node to split in each step. Let depth of a node be the number of edges to the root node, which has depth 0. Breadth-first algorithms explore all nodes at the current depth before moving on (Cormen et al. 2009, chapter 22.2). To keep track of changes in the tree, we use a queue111A queue is a dynamic set in which the elements are kept in order and the principal operations on the set are the insertion of elements to the tail, known as enqueue, and deletion of elements from the head, known as dequeue. See chapter 10.1 of Cormen et al. (2009) for details.. We begin our algorithm with a queue where the only entity is the root node. At each step, we remove the node at the front terminal of the queue, and split at this node as described in the previous paragraph. If a split tree is accepted as the risk minimizer over the current set \mathscr{F}, we enqueue two new leaf nodes; otherwise, the unsplit tree is the risk minimizer over the current set \mathscr{F}, so we don’t enqueue any new node. The nodes in the front of the queue have the lowest depth. Therefore, our algorithm naturally follows a breadth-first searching order. We preset the maximal number of leaf nodes as a¯n\bar{a}_{n}. The process is stopped when either the queue is empty, in which case all the leaf nodes are marked as complete, or the number of leaf nodes is a¯n\bar{a}_{n}. On many datasets, Algorithm 1 will stop before reaching the maximal number of leaf nodes a¯n\bar{a}_{n}. The upper bound is necessary in certain circumstances, since our algorithm, like the majority of practical tree algorithms, is greedy and not guaranteed to find the global minimum. Provided we can solve the global minimum, the upperbound a¯n\bar{a}_{n} is no longer required, as shown in Theorem 2 and discussed thereafter.

Our SVR-Tree algorithm has a coarse to fine tree building style, tending to first split the sample space into larger pieces belonging to two different classes, followed by modifications to the surface of the decision set to decrease tree impurity and SVR. The steps are sketched in Algorithm 1, where feature selection steps are marked as optional. A more detailed and rigorous version is in Section 4.1 of the supplementary material.

The average time complexity of Algorithm 1 is O(dnlogn)O(dn\log n), which is the same as many tree algorithms like CART and C4.5. A detailed analysis of computational complexity is available at Section 4.2 of the supplementary material.

Optional Step for Feature Selection

It is likely that some features will not help to predict class labels. These features should be excluded from our estimated tree. Under some mild conditions, a split on a redundant feature has minimal impact on the tree impurity compared to a split in a non-redundant feature. Thus feature selection can be achieved by thresholding. Suppose we are splitting node AA into two new leaf nodes A1,A2A_{1},A_{2}. Then the (unsigned) tree impurity decrease after this split is defined as:

ΔI(T,n)=n(A)[I(A,n)I(A1,)n(A1|A)I(A2,)n(A2|A)].\Delta I(T,\mathbb{P}_{n})=\mathbb{P}_{n}(A)[I(A,\mathbb{P}_{n})-I(A_{1},\mathbb{P})\mathbb{P}_{n}(A_{1}|A)-I(A_{2},\mathbb{P})\mathbb{P}_{n}(A_{2}|A)].

Let J0{1,2,d}J_{0}\subset\{1,2,\ldots d\} be the indices of features that have been split in previous tree building steps. Given that we are splitting on node AA, let ΔI0\Delta I_{0} be the maximal tree impurity decrease over all splits in feature X[j],jJ0X[j],j\in J_{0}. Then a split in a new feature X[j],jJ0X[j^{\prime}],j^{\prime}\not\in J_{0}, with tree impurity decrease ΔI(T,n)\Delta I(T,\mathbb{P}_{n}), is accepted if

ΔI(T,n)ΔI0+c0λn,\Delta I(T,\mathbb{P}_{n})\geq\Delta I_{0}+c_{0}\lambda_{n}, (2)

where c0c_{0} is a constant independent of the training data. By equation (2), a split on a new feature is accepted if its tree impurity decrease is greater than ΔI0\Delta I_{0}, the maximal tree impurity decrease over all previously split features, plus a threshold term c0λnc_{0}\lambda_{n}. Theoretical support for this thresholding approach is provided in the supplementary material.

Result: Output the fitted tree
Input training data {(Xi,Yi)}i=1n\{(X_{i},Y_{i})\}_{i=1}^{n}, impurity function f()f(\cdot), weight for minority class α\alpha, SVR penalty parameter λn\lambda_{n}, and maximal number of leaf nodes a¯n\bar{a}_{n}\in\mathbb{N}. Let node_queue\mathrm{node\_queue} be a queue of only root node
for 1jd1\leq j\leq d do
      Sort all the samples in values of jjth feature;
end for
while node_queue\mathrm{node\_queue} is not empty and number of leaf nodes a¯n\leq\bar{a}_{n} do
       Dequeue the first entity in node_queue\mathrm{node\_queue}, denoting it as node\mathrm{node};
      for all possible splits in node\mathrm{node} do
             (optional) Check if the current split satisfies feature selection conditions; if not satisfied, reject the current split and continue; Compute the signed tree impurity of the current split;
       end for
      
      Find the split that minimizes the weighted average of signed impurity and SVR as in equation (1);
      if the weighted average of signed impurity and SVR is decreased then
             Accept the current split. Enqueue two child nodes of node\mathrm{node} into node_queue\mathrm{node\_queue};
      else
             Reject the current split;
       end if
      
end while
Algorithm 1 Outline of steps of SVR-Tree

3 Theoretical Properties

In this section, we will discuss the theoretical properties of SVR-Tree from two aspects. First, we show the classifier obtained from Algorithm 1 is consistent in the sense that a generalized distance between SVR-Tree and the oracle classifier goes to zero as sample size goes to infinity. Second, we show for an idealized empirical risk minimizer over a family of SVR constrained classifiers, the excess risk converges to zero at the rate O((logn/n)1d+11/κ)O((\log n/n)^{\frac{1}{d+1-1/\kappa}}) as sample size goes to infinity, where κ1\kappa\geq 1 is defined in Condition 2. The consistency result provides some basic theoretical guarantees for SVR-Tree while the rate of convergence result provides some further insights on how the surface-to-volume ratio works in nonparametric classification. We also developed results regarding feature selection consistency of Algorithm 1 with optional steps enabled. Due to limited space, these results are provided in the supplementary material.

3.1 Estimation Consistency

Let the classification risk under measure \mathbb{P} be R(f)=𝔼|Yf(X)|R(f)=\mathbb{E}|Y-f(X)| and its empirical version under measure n\mathbb{P}_{n} be Rn(f)=𝔼n|Yf(X)|R_{n}(f)=\mathbb{E}_{n}|Y-f(X)|. Let \mathscr{F} be the collection of all measurable functions from Ω\Omega to {0,1}\{0,1\}. Then the oracle classifier and oracle risk can be defined as f=argminfR(f)f^{*}=\arg\min_{f\in\mathscr{F}}R(f) and R=R(f)R^{*}=R(f^{*}), respectively. We define the consistency of a sequence of classifiers based on the excess risk comparing to oracle risk.

Definition 4.

A sequence of classifiers fnf_{n} is consistent if limn𝔼R(fn)=R.\lim_{n\to\infty}\mathbb{E}R(f_{n})=R^{*}.

In the case when ({XΩ:η(X)=1/2})=0\mathbb{P}(\{X\in\Omega:\eta(X)=1/2\})=0, consistency of fnf_{n} is equivalent to requiring limn𝔼|fnf|=0\lim_{n\to\infty}\mathbb{E}|f_{n}-f^{*}|=0 almost surely. For any hyperrectangle AΩA\subset\Omega, denote the component-wise conditional expectation of AA as ηA,j(X[j])=𝔼(Y|X[j],XA)\eta_{A,j}(X[j])=\mathbb{E}(Y|X[j],X\in A). To ensure the consistency of Algorithm 1, we need the following identifiability condition.

Condition 1 (Identifiability).

For any hyperrectangle AΩA\subset\Omega, if the component-wise conditional expectation ηA,j(X)\eta_{A,j}(X) is a constant on AA, 1jd\forall 1\leq j\leq d, then the conditional expectation η(X)\eta(X) is a constant on AA.

Condition 1 is indispensable for proving the consistency of our SVR regularized tree classifiers without requiring the diameter of each leaf node to go to zero. This is because CART-type algorithms can only make splits based on the component-wise conditional expectation. If all the component-wise conditional expectations ηA,j(X[j])\eta_{A,j}(X[j]) are constants but the overall conditional expectation η(x)\eta(x) varies in xx, then the problem is not identifiable to CART-type split rules. Condition 1 can also be viewed as a relaxed version of Condition (H1) in Scornet et al. (2015) who studied the consistency of random forest regression. Their Condition (H1) requires that the mean of YY given XX follows an additive model in the regression tree setup. An analogous condition to their (H1) for our classification tree setup would require η(x)\eta(x) to be an additive function, which would imply our Condition 1.

Denote the tree obtained from Algorithm 1 as T^n\hat{T}_{n}. Theorem 1 shows T^n\hat{T}_{n} is consistent under mild conditions.

Theorem 1 (Estimation consistency).

Let Ω=[0,1]d\Omega=[0,1]^{d} and the weighted measure \mathbb{P} have continuous density ρ(x)\rho(x) with lower bound ρ(x)ρmin>0\rho(x)\geq\rho_{\min}>0 for all xΩx\in\Omega. Assume that Condition 1 is satisfied, η(x)\eta(x) is continuous everywhere in Ω\Omega, and a¯n\bar{a}_{n}, λn\lambda_{n} satisfy limna¯n=\lim_{n\to\infty}\bar{a}_{n}=\infty, limnλn=0\lim_{n\to\infty}\lambda_{n}=0, limna¯ndlognn=0.\lim_{n\to\infty}\frac{\bar{a}_{n}d\log n}{n}=0. Then the classifier f^n\hat{f}_{n} obtained from Algorithm 1 is consistent.

3.2 Rate of Convergence for Empirical Risk Minimizer

We now consider the empirical risk minimizer over a class of trees under the SVR constraint and derive the rate of convergence for this empirical risk minimizer. This empirical risk minimizer can be considered as an idealized classifier.

We first introduce the family of SVR constrained classifiers. For ease of notation, for any classifier ff, we denote its decision set by Ω1(f)={xΩ:f(x)=1}\Omega_{1}(f)=\{x\in\Omega:f(x)=1\}. Define 0\mathscr{F}_{0} as the family of all possible classifiers generated from trees with finite depth: 0={f:Ω[0,1]:Ω1(f)is a finite union of hyperrectangles}.\mathscr{F}_{0}=\left\{f:\Omega\to[0,1]:\Omega_{1}(f)\;\;\text{is a finite union of hyperrectangles}\right\}. For γ>0\gamma>0, the family of SVR constrained classifiers γ\mathscr{F}_{\gamma} is defined as

γ={f0:f0,S(f)S(f)γV(Ω1(f)Ω1(f))},\mathscr{F}_{\gamma}=\left\{f\in\mathscr{F}_{0}:\forall\;f^{\prime}\in\mathscr{F}_{0},\;S(f^{\prime})-S(f)\geq-\gamma V(\Omega_{1}(f)\operatorname*{\bigtriangleup}\Omega_{1}(f^{\prime}))\right\}, (3)

where AB=(ABc)(AcB)A\operatorname*{\bigtriangleup}B=(A\cap B^{c})\cup(A^{c}\cap B) denotes the set difference between two generic sets AA and BB. The elements of γ\mathscr{F}_{\gamma} are constrained by SVR in the sense that the surface area of Ω1(f)\Omega_{1}(f) can decrease at most proportionally to the change in its volume. The parameter γ\gamma is set to be no smaller than 2d2d to prevent γ\mathscr{F}_{\gamma} from being empty. The definition of γ\mathscr{F}_{\gamma} matches the intuition in Algorithm 1 where SVR in every leaf node is checked. Moreover, the following proposition reveals that the regularized risk minimizer of all trees over the true measure, provided it exists, always lies in the family γ\mathscr{F}_{\gamma}. Therefore, in the subsequent analysis, we will focus on the effect of γ\gamma on the convergence of excess risk since it plays an equivalent role to λ\lambda in the minimization problem (1).

Proposition 1.

Suppose that Ω=[0,1]d\Omega=[0,1]^{d}, the density pp of the weighted marginal measure \mathbb{P} exists and satisfies ρ(x)ρmax<\rho(x)\leq\rho_{\max}<\infty for all xΩx\in\Omega. If γ2dλ+1+ρmax/λ\gamma\geq 2d\lambda+1+\rho_{\max}/\lambda, then

inffγ[I~(f,)+λr(f)]=inff0[I~(f,)+λr(f)],\inf_{f\in\mathscr{F}_{\gamma}}[\tilde{I}(f,\mathbb{P})+\lambda r(f)]=\inf_{f\in\mathscr{F}_{0}}[\tilde{I}(f,\mathbb{P})+\lambda r(f)],
inffγ[R(f)+λr(f)]=inff0[R(f)+λr(f)].\inf_{f\in\mathscr{F}_{\gamma}}[R(f)+\lambda r(f)]=\inf_{f\in\mathscr{F}_{0}}[R(f)+\lambda r(f)].

The family γ\mathscr{F}_{\gamma} only contains tree classifiers and is not closed. It is desirable to consider the class of all classifiers that can be well approximated by tree classifiers. Let dH(A,B)d_{H}(A,B) be the Hausdorff distance between two generic sets AA and BB. We define the closure of γ\mathscr{F}_{\gamma} with respect to Hausdorff distance of decision sets as

γ¯{f:Ω[0,1]:a sequencefiγ,such thatlimidH(Ω1(fi),Ω1(f))=0}.\overline{\mathscr{F}_{\gamma}}\triangleq\left\{f:\Omega\to[0,1]:\exists\;\text{a sequence}\;f_{i}\in\mathscr{F}_{\gamma},\;\text{such that}\;\lim_{i\to\infty}d_{H}(\Omega_{1}(f_{i}),\Omega_{1}(f))=0\right\}.

The elements of γ¯\overline{\mathscr{F}_{\gamma}} have a well-defined surface area as shown in Proposition 2.

Proposition 2.

S(f)S(f) is well defined for all fγ¯f\in\overline{\mathscr{F}_{\gamma}} in the sense that: (i) For all sequences fiγf_{i}\in\mathscr{F}_{\gamma} whose decision sets Ω1(fi)\Omega_{1}(f_{i}) converge in Hausdorff distance, S(fi)S(f_{i}) is a convergent sequence in \mathbb{R}; (ii) for any two sequences fi1γf_{i1}\in\mathscr{F}_{\gamma} and fi2γf_{i2}\in\mathscr{F}_{\gamma} satisfying limidH(Ω1(fi1),Ω1(f))=0\lim_{i\to\infty}d_{H}(\Omega_{1}(f_{i1}),\Omega_{1}(f))=0 and limidH(Ω1(fi2),Ω1(f))=0\lim_{i\to\infty}d_{H}(\Omega_{1}(f_{i2}),\Omega_{1}(f))=0, we have limiS(fi1)=limiS(fi2)\lim_{i\to\infty}S(f_{i1})=\lim_{i\to\infty}S(f_{i2}).

The proof of Proposition 1 and 2 are available in the supplementary material. To better control the rate of convergence of excess risk, we also need the following margin condition.

Condition 2.

There exist constants ϵ0(0,1)\epsilon_{0}\in(0,1), c1>0c_{1}>0, κ1\kappa\geq 1, such that for all classifiers ff satisfying (Ω1(f)Ω1(f))<ϵ0\mathbb{P}(\Omega_{1}(f)\operatorname*{\bigtriangleup}\Omega_{1}(f^{*}))<\epsilon_{0}, we have

R(f)Rc1[(Ω1(f)Ω1(f))]κ.R(f)-R^{*}\geq c_{1}\left[\mathbb{P}(\Omega_{1}(f)\operatorname*{\bigtriangleup}\Omega_{1}(f^{*}))\right]^{\kappa}.

Condition 2 is essentially Assumption (A1) in Tsybakov (2004), which is known as the Tsybakov’s margin condition and widely used in classification problems for obtaining fast convergence rate of excess risk (Tsybakov, 2004; Audibert and Tsybakov, 2007). Condition 2 holds as long as (|η(X)1/2|t)Cηt1/(κ1)\mathbb{P}(|\eta(X)-1/2|\leq t)\leq C_{\eta}t^{1/(\kappa-1)} for all t(0,t)t\in(0,t^{*}) and some constants Cη>0C_{\eta}>0 and t(0,1/2)t^{*}\in(0,1/2); see Proposition 1 in Tsybakov (2004). Under Condition 2, our main theorem states the rate of convergence for the empirical risk minimizer over γ\mathscr{F}_{\gamma}.

Theorem 2 (Rate of Convergence).

Suppose that Ω=[0,1]d\Omega=[0,1]^{d}, d2d\geq 2 and Condition 2 holds. Let the empirical risk minimizer be T^n,γ=argminTγRn(T)\hat{T}_{n,\gamma}=\operatorname*{argmin}_{T\in\mathscr{F}_{\gamma}}R_{n}(T) and let c~=min(2log2,21c11/κ16)\tilde{c}=\min\big{(}2\log 2,\frac{21c_{1}^{1/\kappa}}{16}\big{)}, then there exist constants N0N_{0} as in equation (10) and cc^{\prime} as in Lemma 5, such that for all nN0n\geq N_{0},

supfγ¯𝔼[R(T^n,γ)R(f)]16(2/c~)1d+11/κ(cγ)dd+11/κ(lognn)1d+11/κ.\sup_{f^{*}\in\overline{\mathscr{F}_{\gamma}}}\mathbb{E}\left[R(\hat{T}_{n,\gamma})-R(f^{*})\right]\leq 16(2/\tilde{c})^{\frac{1}{d+1-1/\kappa}}(c^{\prime}\gamma)^{\frac{d}{d+1-1/\kappa}}\left(\frac{\log n}{n}\right)^{\frac{1}{d+1-1/\kappa}}. (4)

Theorem 2 provides an upper bound in (4) for the rate of convergence of the excess risk from the empirical risk minimizer T^n,γ\hat{T}_{n,\gamma} among all SVR constrained tree classifiers in γ\mathscr{F}_{\gamma} to the theoretically best possible classifier ff^{*} in γ¯\overline{\mathscr{F}_{\gamma}} with a well-defined surface area. Because we have limited restrictions on the split locations of T^n,γ\hat{T}_{n,\gamma} apart from belonging to γ\mathscr{F}_{\gamma}, it is possible for a split to land directly on an observed sample under the empirical measure. In such situations, we allow the tree of T^n,γ\hat{T}_{n,\gamma} to assign the sample to either leaf node. Hence, the leaf nodes of T^n,γ\hat{T}_{n,\gamma} can be open, closed, or half open half closed. On the other hand, the theoretically best classifier ff^{*} can lie in the much larger family of γ¯\overline{\mathscr{F}_{\gamma}}. Although the classifiers from γ\mathscr{F}_{\gamma} are tree classifiers, there is no limitation on how many leaf nodes they can have. Therefore, they can approximate decision sets with smooth boundaries, for example a dd-dimensional ball. In fact, by adjusting γ\gamma, γ¯\overline{\mathscr{F}_{\gamma}} will contain most classifiers with regularly shaped decision sets.

The rate of convergence for the excess risk in the upper bound (4) is O((logn/n)1d+11/κ)O((\log n/n)^{\frac{1}{d+1-1/\kappa}}), which is the same as the minimax rate of set estimation on the family of sets with first order differentiable boundaries as in Mammen and Tsybakov (1995). Although Mammen and Tsybakov (1995) do not use the margin condition and hence the term 1/κ1/\kappa does not appear explicitly in their rates, their results will yield the same rate as ours up to logarithm factors if the margin condition is applied; see for example, Theorem 4 of Scott and Nowak (2006). Furthermore, in the upper bound (4) in Theorem 2, we have explicitly characterized the effect of the SVR regularization γ\gamma from the set γ\mathscr{F}_{\gamma}. By definition, the SVR constraint in γ\mathscr{F}_{\gamma} becomes weaker as γ\gamma increases. As expected, the upper bound in (4) increases with γ\gamma given the increasing complexity of the set γ\mathscr{F}_{\gamma}. A direct implication of this upper bound is that if the true tree classifier belongs to a family γ¯\overline{\mathscr{F}_{\gamma}} with a smaller γ\gamma, i.e. a smaller SVR ratio, then the excess risk of the empirical risk minimizer T^n,γ\hat{T}_{n,\gamma} also tends to be smaller.

Theorem 2 is related to the literature on classification risk of decision trees, but with important differences. In general, the rate of convergence for the excess risk depends on two assumptions: the margin condition as in our Condition 2 and the complexity, or the ϵ\epsilon-entropy, of the family of decision sets (Scott and Nowak, 2006). Suppose the ϵ\epsilon-entropy is (1/ϵ)ρ(1/\epsilon)^{\rho} for some ρ>0\rho>0. If ρ<1\rho<1, then together with a suitable margin condition like Condition 2, the rate of convergence can be faster than O(n1/2)O(n^{-1/2}) or even faster than O(n1)O(n^{-1}) (Tsybakov, 2004; Audibert and Tsybakov, 2007). If ρ1\rho\geq 1, then the rate of convergence is typically of order n1ρ+cn^{-\frac{1}{\rho+c}} with cc being a constant independent of ρ\rho (Mammen and Tsybakov, 1995). Although it is possible to slightly improve the constant cc by refining the margin condition, the overall rate of convergence is still governed by the ϵ\epsilon-entropy if ρ\rho is very large. Therefore the major objective of our research is to find a suitable family of sets with controllable ϵ\epsilon-entropy as well as a practically feasible algorithm. This is not easy, because there are limited families of sets having known ϵ\epsilon-entropy, including convex sets and sets with smooth boundaries. Both smoothness and convexity impose strong assumptions on the local properties of the boundary Ω1(f)\partial\Omega_{1}(f). In contrast, we propose to consider the family of SVR constrained sets γ\mathscr{F}_{\gamma} as a new family of decision sets, which directly imposes a simple and intuitive assumption on the shape of decision sets and additionally has a tractable ϵ\epsilon-entropy.

Scott and Nowak (2006) also noticed that the family of sets with smooth boundaries is not very realistic for practical problems. They proposed a family of decision sets named the “box-counting class” which satisfies the conclusion of our Lemma 5. However, they did not further characterize the behavior of sets in their class, and hence they must compute the tree for all smallest dyadic splits and then merge all these trees bottom-up, which can be regarded as an exhaustive search in the lowest split level. This leads to a computational complexity O(dnLdlog(nLd))O(dnL^{d}\log(nL^{d})), with LL being the number of dyadic splits. The complexity will become intractable when dd is large, with the authors noting that d10d\geq 10 will pose a problem. Scott and Nowak (2006) also proposed an alternative to the Tsybakov’s margin condition (Condition 2) and obtained a faster convergence rate in some specific situations, such as when Ω\Omega lies in a low dimensional manifold. However, their alternative condition directly bounds the excess error between a possible dyadic tree and the oracle classifier, which is much stronger than the standard margin condition we have used.

The proof of Theorem 2 is in the Appendix. The key idea is to construct an ϵ\epsilon-net on γ\mathscr{F}_{\gamma} under a symmetric set difference from the true measure and to show that this ϵ\epsilon-net is also a 2ϵ2\epsilon-net under a symmetric set difference from the empirical measure with high probability. This nontrivial property is the reason why we can derive the rate of convergence for the empirical risk minimizer over the whole family of γ\mathscr{F}_{\gamma} rather than just for the minimizer over an ϵ\epsilon-net of this family, as adopted by some previous classification literature (Mammen and Tsybakov, 1995; Tsybakov, 2004).

4 Numerical Studies

We compare SVR trees with popular imbalanced classification methods on real datasets, adding redundant features to these datasets in evaluating feature selection. A confusion matrix (Table 1) is often used to assess classification performance. A common criteria is accuracy TP+TNTP+TN+FP+FN.\frac{\text{TP}+\text{TN}}{\text{TP}+\text{TN}+\text{FP}+\text{FN}}. When minority samples are relatively rare, it is often important to give higher priority to true positives, which is accomplished using the true positive rate (TPR, also known as recall) TPTP+FN\frac{\text{TP}}{\text{TP}+\text{FN}} and precision TPTP+FP.\frac{\text{TP}}{\text{TP}+\text{FP}}. To combine these, the F-measure is often used: 2TPRPrecisionTPR+Precision\frac{2\cdot\text{TPR}\cdot\text{Precision}}{\text{TPR}+\text{Precision}}. The Geometric mean of true positive rate and true negative rate (Gmean=TPTN(TP+FN)(TN+FP)G_{\text{mean}}=\frac{\text{TP}\cdot\text{TN}}{(\text{TP}+\text{FN})\cdot(\text{TN}+\text{FP})}) is also used to evaluate the overall performance on both classes.

Table 1: Confusion matrix for two class classification.
True Label
1 0
Predicted Label 1 True Positive (TP) False Positive (FP)
0 False Negative (FN) True Negative (TN)

4.1 Datasets

We test our method on 12 datasets from the UCI Machine Learning Repository (Dua and Graff, 2017) and KEEL-dataset repository (Alcalá-Fdez et al., 2011), varying in size, number of features and level of imbalance. For datasets with three or more classes, classes are combined to form binary class datasets. Table 2 gives an overview of the datasets; detailed descriptions and preprocessing of these datasets are available in Section 5.1 of the supplementary material.

Table 2: Overview of data sets
Data set number of number of proportions of number of
name total samples minority samples minority samples features
Pima 768 268 34.9% 8
Titanic 2201 711 32.3% 3
Phoneme 5404 1586 29.3% 5
Vehicle 846 199 19.0% 18
Ecoli 336 52 15.5% 6
Segment 2308 329 14.3% 18
Wine 1599 217 11.9% 11
Page 5472 559 10.2% 10
Satimage 6435 626 8.9% 36
Glass 214 17 8.0% 9
Abalone 731 42 5.4% 7
Yeast 1484 51 3.4% 8

4.2 Experimental Setup

We test the performance of SVR-Tree, SVR-Tree with feature selection, CART (Breiman et al., 1984) with duplicated oversampling, CART with SMOTE (Chawla et al., 2002), CART with Borderline-SMOTE (Han et al., 2005), CART with ADASYN (He et al., 2008) and Hellinger Distance Decision Tree (Cieslak et al., 2012) on all 12 datasets. Features are linearly transformed so that samples lie in [0,1]d[0,1]^{d}.

We use 3×53\times 5 nested stratified cross validation to evaluate the out-of-sample performance of all methods. The nested cross validation has two layers: the inner layer is used to choose tuning parameters and the outer layer is used to evaluate the out-of-sample performance. For each method and each dataset, we run the nested cross validation procedure 20 times. In each run, we randomly partition the whole dataset into three stratified folds222the proportions of samples of each class are roughly the same in all folds. and run three times. Each time one of the three folds is selected as the testing set and the other two folds are training sets. On the training sets, we use 5-fold stratified cross validation to choose the tuning parameter. That is, we further divide the training set into five folds and run for five times. Each time we use four folds as (inner) training sets and the other fold for validation. We define the cross-validation TPR, TNR in the same manner as the cross-validation accuracy and use the cross validation TPR, TNR to compute the cross-validation F-measure. The tuning parameter with the highest cross validation F-measure is selected. We then train the model with the selected tuning parameter on the whole training set and evaluate its performance on the test set. The cross validation accuracy, precision, TPR, F-measure and G-mean are recorded. The mean and standard error of these statistics from 20 nested cross validation runs are reported in Table 3.

SVR-Tree with and without feature selection are described in Algorithm 1. The weight for the minority class, α\alpha, is set to be the largest integer that makes the total weight of the minority class no greater than the total weight of the majority class. The maximal number of leaf nodes a¯n\bar{a}_{n} is 2n2\sqrt{n}. The penalty parameter λn\lambda_{n} for SVR is chosen from a geometric sequence in [20,210]×103×n1/3[2^{0},2^{10}]\times 10^{-3}\times n^{-1/3} and the value with the highest F-measure across 20 runs is selected. For SVR-Tree with feature selection, the constant c0c_{0} in equation (2) is fixed to c0=4c_{0}=4. In practice, the results are insensitive to c0c_{0}.

For the other methods except Hellinger distance decision tree, we first over sample the minority class samples, such that the number of minority samples are multiplied by the same weight α\alpha used for SVR tree. We than build the CART tree on the over sampled dataset and prune it. The pruning parameter of CART is selected to maximize the F-measure. By the algorithm proposed by Breiman et al. (1984), the range from which to choose the pruning parameter will be available after we build the tree and does not need to be specified ahead of time.

For duplicated oversampling, we sample each minority class sample α1\alpha-1 times where α\alpha is the weight used for SVR tree. For SMOTE, the number of nearest samples is set as k=5k=5. For Borderline-SMOTE, we adopt the Borderline-SMOTE1 of Han et al. (2005), with the number of nearest samples k=5k=5. For both SMOTE and Borderline-SMOTE, if α1k\alpha-1\geq k, some nearest neighbors may be used multiple times to generate synthetic samples, such that the total weight of minority class samples of these oversampling methods are the same as that of SVR tree. For ADASYN, if we denote the number of majority class samples as n0n_{0} and the number of minority class samples as n1n_{1}, then the parameter β\beta is set to be β=αn1/n0\beta=\alpha n_{1}/n_{0}.

For Hellinger distance decision tree, we directly call its java implementation available at Weka software (https://www.cs.waikato.ac.nz/ml/weka/). Although it is claimed that the Hellinger distance tree is insensitive to imbalance, their current implementation occasionally classifies all samples into a single class. We set the default leaf prediction strategy as “Naive Bayes” since it works well on most datasets. However, if the algorithm classifies all samples into a single class, we will set the leaf prediction strategy as “Majority Voting”. The third leaf prediction strategy “Naive Bayes Adapative” always produces the same results as one of the other two strategies in our experiments.

Data set Method Accuracy Precision TPR F-measure G-mean
Phoneme SVR 0.8350(0.0060) 0.6749(0.0113) 0.8458(0.0154) 0.7506(0.0076) 0.8380(0.0062)
SVR-Select 0.8377(0.0040) 0.6807(0.0064) 0.8420(0.0098) 0.7528(0.0060) 0.8389(0.0049)
Duplicate 0.8554(0.0038) 0.7567(0.0110) 0.7482(0.0154) 0.7522(0.0068) 0.8205(0.0065)
SMOTE 0.8598(0.0054) 0.7637(0.0119) 0.7567(0.0121) 0.7601(0.0089) 0.8264(0.0068)
BSMOTE 0.8568(0.0060) 0.7580(0.0143) 0.7527(0.0127) 0.7552(0.0093) 0.8230(0.0070)
ADASYN 0.8607(0.0050) 0.7661(0.0116) 0.7565(0.0115) 0.7612(0.0082) 0.8269(0.0063)
HDDT 0.7065(0.0000) 0.0000(0.0000) 0.000(0.0000) 0.0000(0.0000) 0.0000(0.0000)
Segment SVR 0.9922(0.0017) 0.9696(0.0089) 0.9763(0.0061) 0.9729(0.0060) 0.9855(0.0033)
SVR-Select 0.9932(0.0013) 0.9753(0.0080) 0.9769(0.0053) 0.9761(0.0047) 0.9863(0.0027)
Duplicate 0.9913(0.0012) 0.9707(0.0069) 0.9685(0.0073) 0.9696(0.0044) 0.9817(0.0036)
SMOTE 0.9916(0.0010) 0.9722(0.0061) 0.9688(0.0081) 0.9705(0.0035) 0.9820(0.0038)
BSMOTE 0.9915(0.0011) 0.9726(0.0082) 0.9676(0.0059) 0.9701(0.0037) 0.9814(0.0027)
ADASYN 0.9919(0.0011) 0.9713(0.0063) 0.9716(0.0064) 0.9714(0.0040) 0.9833(0.0032)
HDDT 0.8330(0.0021) 0.4599(0.0031) 0.9825(0.0023) 0.6266(0.0031) 0.8910(0.0017)
Page SVR 0.9656(0.0015) 0.8248(0.0098) 0.8429(0.0117) 0.8337(0.0076) 0.9087(0.0062)
SVR-Select 0.9647(0.0016) 0.8252(0.0124) 0.8311(0.0140) 0.8280(0.0077) 0.9024(0.0073)
Duplicate 0.9686(0.0013) 0.8465(0.0081) 0.8463(0.0135) 0.8463(0.0072) 0.9119(0.0071)
SMOTE 0.9683(0.0021) 0.8468(0.0133) 0.8426(0.0144) 0.8446(0.0101) 0.9099(0.0078)
BSMOTE 0.9682(0.0015) 0.8444(0.0103) 0.8448(0.0127) 0.8445(0.0077) 0.9109 (0.0067)
ADASYN 0.9677(0.0021) 0.8436(0.0131) 0.8403(0.0148) 0.8418(0.0103) 0.9084(0.0080)
HDDT 0.9024(0.0030) 0.5479(0.0338) 0.2720(0.0260) 0.3622(0.0215) 0.5141(0.0233)
Yeast SVR 0.9368(0.0100) 0.2641(0.0410) 0.4510(0.0888) 0.3290(0.0416) 0.6524(0.0619)
SVR-Select 0.9287(0.0122) 0.2360(0.0397) 0.4559(0.0715) 0.3077(0.0391) 0.6544(0.0496)
Duplicate 0.9641(0.0037) 0.3907(0.2084) 0.1196(0.0784) 0.1727(0.0981) 0.3077(0.1544)
SMOTE 0.9594(0.0045) 0.3620(0.0859) 0.2098(0.0830) 0.2554(0.0752) 0.4458(0.0887)
BSMOTE 0.9602(0.0043) 0.3664(0.1012) 0.2020(0.0884) 0.2499(0.0937) 0.4333(0.1065)
ADASYN 0.9605(0.0045) 0.3709(0.1246) 0.2078(0.1099) 0.2491(0.0986) 0.4288(0.1436)
HDDT 0.7604(0.0200) 0.1049(0.0101) 0.7853(0.0287) 0.1849(0.0162) 0.7722(0.0211)
Average Ranking SVR 4.92 5.25 2.25 2.08 2.17
SVR-Select 5.58 5.00 2.25 2.75 2.50
Duplicate 2.79 3.29 5.13 4.88 4.96
SMOTE 2.54 2.29 5.04 4.21 4.71
BSMOTE 3.33 2.92 5.58 5.00 5.42
ADASYN 2.75 2.92 4.58 4.00 4.42
HDDT 6.08 6.33 3.17 5.08 3.83
Table 3: Results of numerical studies, where the ranking is averaged over all 12 datasets.

4.3 Results

We compute the mean and standard error of accuracy, precision, TPR, F-measure and G-mean across the 20 nested cross validations. Due to limited space, results of 4 representative datasets are shown in Table 3 with the remaining results in Section 5.2 of the supplementary material. If a method classifies all samples into the majority class, precision is not well-defined and we simply set it to be zero. In the column “Method”, SVR = SVR-Tree, SVR-Select = SVR-Tree with feature selection, Duplicate = CART with duplicated oversampling, SMOTE = CART with SMOTE, BSMOTE = CART with Borderline-SMOTE, ADASYN = CART with ADASYN and HDDT = Hellinger Distance Decision Tree. For each dataset and evaluation measure, the method with the highest mean value ranks first and is highlighted in bold. As suggested by Brazdil and Soares (2000), the average rankings of all 12 datasets are displayed in Table 3 to evaluate the overall performance of each method across all datasets. If two methods are tied, the ranking is set to the average.

The results in Table 3 show that oversampling based methods generally perform well in accuracy and precision, with SMOTE the best of these methods. However, SVR and SVR-select outperform the oversampling methods in the rankings of TPR, F-measure and G-mean on many datasets and on average. The behavior of HDDT is unstable across datasets, being competitive in TPR and G-mean for some datasets but sometimes assigning all samples to the majority class. Since all methods (except duplicated sampling) are built on ideal theoretical assumptions, their practical performance is typically data-dependent. For example, SMOTE is outperformed by SVR-select in all measures for the Segment dataset, while the opposite happens for the Page dataset.

5 Discussion

A major challenge in analyzing imbalanced data is small sample size in the minority class leading to overfitting. It is natural to consider using regularization to address this problem. Regularization of classification trees is an old idea; for example, Breiman et al. (1984) proposed to penalize the number of leaf notes in the tree. Other classification trees like C4.5 (Quinlan, 2014) and Ripper (Cohen, 1995) also prune overgrown trees. However, the number of leaf nodes may not be a good measure of complexity of a classification tree. Recently, Hahn et al. (2020) add a Bayesian prior to an ensemble of trees, which functions as indirect regularization. Following the idea of directly regularizing the shape of the decision set and complexity of the decision boundary, we instead penalize the surface-to-volume ratio. To our knowledge, this is a new idea in the statistical literature.

SVR-Tree can be trivially generalized to the multi-class case and balanced datasets. For multiple classes, we can apply SVR to one or more minority classes and take the sum of these ratios as regularization. For balanced datasets, we can either compute the SVR ratio of all classes, or we can simply regularize the surface of the decision boundary. The principle behind these generalizations is to regularize the complexity of the decision boundaries and shapes of the decision sets.


SUPPLEMENTARY MATERIAL

Proofs, a Detailed Algorithm and Additional Results: Supplementary Material for “Classification Trees for Imbalanced Data: Surface-to-Volume Regularization”.
Codes: https://github.com/YichenZhuDuke/Classification-Tree-with-Surface-to-
Volume-ratio-Regularization.git.

References

  • Alcalá-Fdez et al. (2011) Alcalá-Fdez, J., A. Fernández, J. Luengo, J. Derrac, S. García, L. Sánchez, and F. Herrera (2011). Keel data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. Journal of Multiple-Valued Logic & Soft Computing 17, 255–287.
  • Audibert and Tsybakov (2007) Audibert, J.-Y. and A. B. Tsybakov (2007). Fast learning rates for plug-in classifiers. The Annals of Statistics 35(2), 608–633.
  • Brazdil and Soares (2000) Brazdil, P. B. and C. Soares (2000). A comparison of ranking methods for classification algorithm selection. In European Conference on Machine Learning, pp.  63–75. Springer.
  • Breiman et al. (1984) Breiman, L., J. Friedman, R. Olshen, and C. Stone (1984). Classification and Regression Trees. Wadsworth.
  • Bunkhumpornpat et al. (2009) Bunkhumpornpat, C., K. Sinapiromsaran, and C. Lursinsap (2009). Safe-level-smote: Safe-level-synthetic minority over-sampling technique for handling the class imbalanced problem. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp.  475–482. Springer.
  • Chawla et al. (2002) Chawla, N. V., K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer (2002). Smote: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357.
  • Cieslak et al. (2012) Cieslak, D. A., T. R. Hoens, N. V. Chawla, and W. P. Kegelmeyer (2012). Hellinger distance decision trees are robust and skew-insensitive. Data Mining and Knowledge Discovery 24(1), 136–158.
  • Cohen (1995) Cohen, W. W. (1995). Fast effective rule induction. In Machine Learning Proceedings, pp.  115–123. Elsevier.
  • Cormen et al. (2009) Cormen, T. H., C. E. Leiserson, R. L. Rivest, and C. Stein (2009). Introduction to Algorithms. MIT press.
  • Dua and Graff (2017) Dua, D. and C. Graff (2017). UCI Machine Learning Repository.
  • Fernández et al. (2017) Fernández, A., S. del Río, N. V. Chawla, and F. Herrera (2017). An insight into imbalanced big data classification: outcomes and challenges. Complex & Intelligent Systems 3(2), 105–120.
  • Györfi et al. (2006) Györfi, L., M. Kohler, A. Krzyzak, and H. Walk (2006). A distribution-free theory of nonparametric regression. Springer Science & Business Media.
  • Hahn et al. (2020) Hahn, P. R., J. S. Murray, and C. M. Carvalho (2020). Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects. Bayesian Analysis 15(3), 965–1056.
  • Haixiang et al. (2017) Haixiang, G., L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing (2017). Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications 73, 220–239.
  • Han et al. (2005) Han, H., W.-Y. Wang, and B.-H. Mao (2005). Borderline-smote: a new over-sampling method in imbalanced data sets learning. In International Conference on Intelligent Computing, pp. 878–887. Springer.
  • He et al. (2008) He, H., Y. Bai, E. A. Garcia, and S. Li (2008). Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In IEEE International Joint Conference on Neural Networks, pp. 1322–1328. IEEE.
  • He and Garcia (2008) He, H. and E. A. Garcia (2008). Learning from imbalanced data. IEEE Transactions on Knowledge & Data Engineering (9), 1263–1284.
  • Hu et al. (2019) Hu, X., C. Rudin, and M. Seltzer (2019). Optimal sparse decision trees. In Advances in Neural Information Processing Systems (NeurIPS), Volume 32.
  • Krawczyk (2016) Krawczyk, B. (2016). Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence 5(4), 221–232.
  • Lin et al. (2020) Lin, J., C. Zhong, D. Hu, C. Rudin, and M. Seltzer (2020). Generalized and scalable optimal sparse decision trees. In International Conference on Machine Learning, pp. 6150–6160. PMLR.
  • Mammen and Tsybakov (1995) Mammen, E. and A. B. Tsybakov (1995). Asymptotical minimax recovery of sets with smooth boundaries. The Annals of Statistics 23(2), 502–524.
  • Nobel (1996) Nobel, A. (1996). Histogram regression estimation using data-dependent partitions. The Annals of Statistics 24(3), 1084–1105.
  • Quinlan (2014) Quinlan, J. R. (2014). C4.5: Programs for Machine Learning. Elsevier.
  • Scornet et al. (2015) Scornet, E., G. Biau, and J.-P. Vert (2015). Consistency of random forests. The Annals of Statistics 43(4), 1716–1741.
  • Scott and Nowak (2006) Scott, C. and R. Nowak (2006). Minimax-optimal classification with dyadic decision trees. IEEE Transactions on Information Theory 52(4), 1135–1153.
  • Stefanowski and Wilk (2008) Stefanowski, J. and S. Wilk (2008). Selective pre-processing of imbalanced data for improving classification performance. In International Conference on Data Warehousing and Knowledge Discovery, pp.  283–292. Springer.
  • Tsybakov (2004) Tsybakov, A. B. (2004). Optimal aggregation of classifiers in statistical learning. The Annals of Statistics 32(1), 135–166.
  • Wang et al. (2019) Wang, Q., X. Zhou, C. Wang, Z. Liu, J. Huang, Y. Zhou, C. Li, H. Zhuang, and J.-Z. Cheng (2019). Wgan-based synthetic minority over-sampling technique: Improving semantic fine-grained classification for lung nodules in ct images. IEEE Access 7, 18450–18463.

Appendix A Proofs

We prove Theorem 1 and 2 here. Proofs for Corollary 1, c1c_{1} bounds in examples 1-2 and lemmas introduced in the Appendix are in the supplementary material. Without loss of generality, we assume Ω=[0,1]d\Omega=[0,1]^{d} in this section.

A.1 Proof of Theorem 1

The proof builds on Nobel (1996), Györfi et al. (2006), Tsybakov (2004) and Scornet et al. (2015). We first establish a sufficient condition for consistency, showing a classification tree whose signed impurity converges to an oracle bound is consistent. We then break the difference between signed impurity of T^n\hat{T}_{n} and the oracle bound into two parts. The first is estimation error, which goes to zero if the number of leaves ana_{n} increases slower than nn; the second is approximation error, which goes to zero if 𝔼(Y|XA)\mathbb{E}(Y|X\in A) goes to a constant within each leaf node and penalty λn\lambda_{n} goes to zero.

Recall the conditional expectation is denoted as 𝔼(Y|X)=η(X)\mathbb{E}(Y|X)=\eta(X), define the oracle lower bound for signed impurity as I=Ω2η(x)(1η(x))𝑑(x).I^{*}=\int_{\Omega}2\eta(x)(1-\eta(x))d\mathbb{P}(x). The following lemma shows if the signed impurity of a classification tree converges to TT^{*} as nn\to\infty, the classifier associated with the tree is consistent.

Lemma 1.

Let TnT_{n} be a sequence of classification trees, let fn:Ω{0,1}f_{n}:\Omega\to\{0,1\} be the classifier associated with TnT_{n}. TnT_{n} is consistent if I~(Tn,)I\tilde{I}(T_{n},\mathbb{P})\to I^{*} in probability.

We then decompose the difference between signed impurity of T^n\hat{T}_{n} and the oracle bound into estimation error and approximation error.

Lemma 2.

Let T^n\hat{T}_{n} be a classification tree trained from data 𝒟n\mathscr{D}_{n}, A1,A2,AmA_{1},A_{2},\ldots A_{m} be all the leaf nodes of T^n\hat{T}_{n}. Define the set of classifiers 𝒯n\mathscr{T}_{n} as:

𝒯n={T:T’s associated classifier f:Ω{0,1} is constant on all Aj,  1jm}.\mathscr{T}_{n}=\{T:T\text{'s associated classifier }f:\Omega\to\{0,1\}\text{ is constant on all }A_{j},\;\;1\leq j\leq m\}.

We have

I~(T^n,)I2supT𝒯n|I~(T,)I~(T,n)|+infT𝒯n|I~(T,)+λnr(T)I|.\tilde{I}(\hat{T}_{n},\mathbb{P})-I^{*}\leq 2\sup_{T\in\mathscr{T}_{n}}|\tilde{I}(T,\mathbb{P})-\tilde{I}(T,\mathbb{P}_{n})|+\inf_{T\in\mathscr{T}_{n}}|\tilde{I}(T,\mathbb{P})+\lambda_{n}r(T)-I^{*}|. (5)

The first term on the right hand side of equation (5) is the “estimation error”, which measures the difference between functions evaluated under the empirical and true distributions. The second term is “approximation error”, which measures the ability of 𝒯n\mathscr{T}_{n} to approximate the oracle prediction function. The next two lemmas show both terms go to zero in probability.

Lemma 3.

If a¯ndlognn=o(1),\frac{\bar{a}_{n}d\log n}{n}=o(1), we have supT𝒯n|I~(T,)I~(T,n)|0\sup_{T\in\mathscr{T}_{n}}|\tilde{I}(T,\mathbb{P})-\tilde{I}(T,\mathbb{P}_{n})|\to 0 in probability.

Lemma 4.

As nn\to\infty, if λn0\lambda_{n}\to 0 and a¯n\bar{a}_{n}\to\infty, infT𝒯n|I~(T,)+λnr(T)I|0\inf_{T\in\mathscr{T}_{n}}|\tilde{I}(T,\mathbb{P})+\lambda_{n}r(T)-I^{*}|\to 0 in probability.

Proof of Theorem 1.

Combining Lemma 1, 2, 3, 4, we finish the proof.

A.2 Proof of Theorem 2

The proof consists of two parts. In the first part, we construct a subset on γ\mathscr{F}_{\gamma} and prove it is both an ϵ\epsilon-net under symmetric set difference of true measure and with high probability an 2ϵ2\epsilon-net under symmetric set differences of empirical measure. In the second part, we prove that with high probability, any element in the above constructed ϵ\epsilon-net that is far away from the true classifier will incur a large excess error. Combining these two parts and selecting a proper value for ϵ\epsilon will conclude the proof of Theorem 2.

We first construct the ϵ\epsilon-net. We divide the space Ω=[0,1]d\Omega=[0,1]^{d} into MdM^{d} hypercubes, each with volume 1/Md1/M^{d} as below:

={[i1M,i1+1M]×[i2M,i2+1M]×[idM,id+1M]:i1,i2,id{0,1,M1}}.\mathscr{H}=\left\{\Big{[}\frac{i_{1}}{M},\frac{i_{1}+1}{M}\Big{]}\times\Big{[}\frac{i_{2}}{M},\frac{i_{2}+1}{M}\Big{]}\times\cdots\Big{[}\frac{i_{d}}{M},\frac{i_{d}+1}{M}\Big{]}:i_{1},i_{2},\cdots i_{d}\in\{0,1,\cdots M-1\}\right\}.

The SVR constraint of family γ\mathscr{F}_{\gamma} directly implies that the surface area of the decision set is no greater than γ\gamma. If the shape of the decision set is indeed regular, one may imagine that the border of the decision set will intersect only with a finite number of hypercubes in \mathscr{H}. The following Lemma formalizes this intuition.

Lemma 5.

There exists a constant c1c^{\prime}\geq 1 only dependent on dd, such that for 1/M2/(3γ)1/M\geq 2/(3\gamma), for any fγf\in\mathscr{F}_{\gamma},

|{A:AΩ1(f)}|cγMd1,\big{|}\{A\in\mathscr{H}:A\cap\partial\Omega_{1}(f)\neq\emptyset\}\big{|}\leq c^{\prime}\gamma M^{d-1},

where |||\cdot| denotes the cardinality.

The general proof idea is to examine the local region of each hypercube in \mathscr{H}. If Ω1(f)\partial\Omega_{1}(f) intersects with a hypercube, then the SVR constraint and a d1d-1 dimensional isoperimetric inequality will imply that the surface area passing through the local region of that hypercube is lower bounded. Since the total surface area of Ω1(f)\Omega_{1}(f) is upper bounded given the maximum volume is 1 and the SVR constraint, the number of cubes intersecting with Ω1(f)\partial\Omega_{1}(f) is also upper bounded.

Let ff be a classifier such that Ω1(f)\Omega_{1}(f) is a finite union of hypercubes of \mathscr{H}. Then for AA\in\mathscr{H} and AΩ1(f)A\subset\Omega_{1}(f), we say AA is a border hypercube of Ω1(f)\Omega_{1}(f) if AΩ1(f)\partial A\cap\partial\Omega_{1}(f)\neq\emptyset. Let ϵ=cγ/M\epsilon=c^{\prime}\gamma/M and define the set 𝒩ϵ\mathscr{N}_{\epsilon} as

𝒩ϵ={f:\displaystyle\mathscr{N}_{\epsilon}=\big{\{}f: Ω1(f) is a finite union of elements in ;\displaystyle\;\Omega_{1}(f)\text{ is a finite union of elements in }\mathscr{H};
and the number of border hypercubes of Ω1(f) is no greater than Mdϵ}.\displaystyle\text{ and the number of border hypercubes of }\Omega_{1}(f)\text{ is no greater than }M^{d}\epsilon\big{\}}.

We have the following lemma regarding the ϵ\epsilon-entropy of 𝒩ϵ\mathscr{N}_{\epsilon}.

Lemma 6.

For ϵ<1/4\epsilon<1/4, we have

log|𝒩ϵ|k=1cγMd1(Mdk)cγ(cγϵ)d1[log(cγϵ)+1log(cγ)].\log|\mathscr{N}_{\epsilon}|\leq\sum_{k=1}^{c^{\prime}\gamma M^{d-1}}{M^{d}\choose k}\leq c^{\prime}\gamma\left(\frac{c^{\prime}\gamma}{\epsilon}\right)^{d-1}\left[\log\left(\frac{c^{\prime}\gamma}{\epsilon}\right)+1-\log(c^{\prime}\gamma)\right].

The following lemma shows 𝒩ϵ\mathscr{N}_{\epsilon} is not only an ϵ\epsilon-net on γ\mathscr{F}_{\gamma} under the true measure \mathbb{P}, but also simultaneously an 2ϵ2\epsilon-net under empirical measure n\mathbb{P}_{n} with high probability.

Lemma 7.

We have the following two facts regarding 𝒩ϵ\mathscr{N}_{\epsilon}:

supfγinff0𝒩ϵ(Ω1(f)Ω1(f0))ϵ,\displaystyle~{}~{}\sup_{f\in\mathscr{F}_{\gamma}}\inf_{f_{0}\in\mathscr{N}_{\epsilon}}\mathbb{P}(\Omega_{1}(f)\operatorname*{\bigtriangleup}\Omega_{1}(f_{0}))\leq\epsilon,
(supfγinff0𝒩ϵmax{n(Ω1(f)Ω1(f0)),2(Ω1(f)Ω1(f0))}>2ϵ)\displaystyle~{}~{}\mathbb{P}\left(\sup_{f\in\mathscr{F}_{\gamma}}\inf_{f_{0}\in\mathscr{N}_{\epsilon}}\max\big{\{}\mathbb{P}_{n}(\Omega_{1}(f)\operatorname*{\bigtriangleup}\Omega_{1}(f_{0})),2\mathbb{P}(\Omega_{1}(f)\operatorname*{\bigtriangleup}\Omega_{1}(f_{0}))\big{\}}>2\epsilon\right)
k=1cγMd1(Mdk)exp{2(log2)nϵ}.\displaystyle\leq\sum_{k=1}^{c^{\prime}\gamma M^{d-1}}{M^{d}\choose k}\exp\{-2(\log 2)n\epsilon\}.

We now move to the second part. By Lemma 7, there exists f1𝒩ϵf_{1}\in\mathscr{N}_{\epsilon}, such that R(f1)R(Ω1(f1)Ω1(f))ϵR(f_{1})-R^{*}\leq\mathbb{P}(\Omega_{1}(f_{1})\operatorname*{\bigtriangleup}\Omega_{1}(f^{*}))\leq\epsilon. The next lemma shows with high probability, f1f_{1} has a smaller empirical risk than all elements in 𝒩ϵ\mathscr{N}_{\epsilon} that are sufficiently far away from ff^{*}.

Lemma 8.

Suppose Condition 2 holds. Then we have

(supT𝒩ϵ,R(T)R(f)7ϵRn(T)Rn(f1)>4ϵ)1|𝒩ϵ|exp(2116nc11/κϵ21/κ).\mathbb{P}\left(\sup_{T\in\mathscr{N}_{\epsilon},R(T)-R(f^{*})\geq 7\epsilon}R_{n}(T)-R_{n}(f_{1})>4\epsilon\right)\geq 1-|\mathscr{N}_{\epsilon}|\exp\left(-\frac{21}{16}nc_{1}^{1/\kappa}\epsilon^{2-1/\kappa}\right).

The key proof idea is to write the empirical loss Rn()R_{n}(\cdot) as a summation of i.i.d. random variables and bound the deviation of the summation to its expectation by Bernstein inequality.

We are now ready to prove Theorem 2.

Proof of Theorem 2.

Define the events E1,E2E_{1},E_{2} as:

E1={supfγinff0𝒩ϵmax{n(Ω1(f)Ω1(f0)),2(Ω1(f)Ω1(f0))}2ϵ},E_{1}=\left\{\sup_{f\in\mathscr{F}_{\gamma}}\inf_{f_{0}\in\mathscr{N}_{\epsilon}}\max\big{\{}\mathbb{P}_{n}(\Omega_{1}(f)\operatorname*{\bigtriangleup}\Omega_{1}(f_{0})),2\mathbb{P}(\Omega_{1}(f)\operatorname*{\bigtriangleup}\Omega_{1}(f_{0}))\big{\}}\leq 2\epsilon\right\},
E2={supT|𝒩ϵ|,R(T)R(f)7ϵRn(T)Rn(f1)>4ϵ}.E_{2}=\left\{\sup_{T\in|\mathscr{N}_{\epsilon}|,R(T)-R(f^{*})\geq 7\epsilon}R_{n}(T)-R_{n}(f_{1})>4\epsilon\right\}.

We claim on event E1E2E_{1}\cap E_{2}, R(T^n,γ)R(f)8ϵR(\hat{T}_{n,\gamma})-R(f^{*})\leq 8\epsilon. We prove this by contradiction. Suppose that R(T^n,γ)R(f)>8ϵR(\hat{T}_{n,\gamma})-R(f^{*})>8\epsilon. Then by definition of event E1E_{1}, there exists a classifier f2𝒩ϵf_{2}\in\mathscr{N}_{\epsilon}, such that

|R(T^n,γ)R(f2)|(Ω1(T^n,γ)Ω1(f2))ϵ,|R(\hat{T}_{n,\gamma})-R(f_{2})|\leq\mathbb{P}(\Omega_{1}(\hat{T}_{n,\gamma})\operatorname*{\bigtriangleup}\Omega_{1}(f_{2}))\leq\epsilon, (6)
|Rn(T^n,γ)Rn(f2)|n(Ω1(T^n,γ)Ω1(f2))2ϵ.|R_{n}(\hat{T}_{n,\gamma})-R_{n}(f_{2})|\leq\mathbb{P}_{n}(\Omega_{1}(\hat{T}_{n,\gamma})\operatorname*{\bigtriangleup}\Omega_{1}(f_{2}))\leq 2\epsilon. (7)

Combining equation (6) and the assumption that R(T^n,γ)R(f)>8ϵR(\hat{T}_{n,\gamma})-R(f^{*})>8\epsilon, we have R(f2)R(f)>7ϵR(f_{2})-R(f^{*})>7\epsilon. It then follows from the definition of event E2E_{2} that

Rn(f2)Rn(f1)>4ϵ.R_{n}(f_{2})-R_{n}(f_{1})>4\epsilon. (8)

By definition of event E1E_{1}, T1γ\exists T^{\prime}_{1}\in\mathscr{F}_{\gamma}, such that

|Rn(T1)Rn(f1)|n(Ω1(T1)Ω1(f1))2ϵ.|R_{n}(T^{\prime}_{1})-R_{n}(f_{1})|\leq\mathbb{P}_{n}(\Omega_{1}(T^{\prime}_{1})\operatorname*{\bigtriangleup}\Omega_{1}(f_{1}))\leq 2\epsilon. (9)

Combining the equations (7), (8) and (9), we have

Rn(T^n,γ)Rn(T1)>0,R_{n}(\hat{T}_{n,\gamma})-R_{n}(T_{1}^{\prime})>0,

which contradicts the definition that T^n,γ\hat{T}_{n,\gamma} is the empirical risk minimizer. Therefore we have R(T^n,γ)R(f)8ϵR(\hat{T}_{n,\gamma})-R(f^{*})\leq 8\epsilon on the event E1E2E_{1}\cap E_{2}. For all fγf^{*}\in\mathscr{F}_{\gamma}, the expected loss can be bounded by

𝔼[R(T^n,γ)R(f)]\displaystyle~{}\quad\mathbb{E}\left[R(\hat{T}_{n,\gamma})-R(f^{*})\right]
(E1cE2c)+8ϵ\displaystyle\leq\mathbb{P}(E_{1}^{c}\cup E_{2}^{c})+8\epsilon
exp[cγ(cγϵ)d1log(cγϵ)][exp(2(log2)nϵ)+exp(2116nc11/κϵ21/κ)]+8ϵ\displaystyle\leq\exp\left[c^{\prime}\gamma\left(\frac{c^{\prime}\gamma}{\epsilon}\right)^{d-1}\log\left(\frac{c^{\prime}\gamma}{\epsilon}\right)\right]\left[\exp(-2(\log 2)n\epsilon)+\exp\left(-\frac{21}{16}nc_{1}^{1/\kappa}\epsilon^{2-1/\kappa}\right)\right]+8\epsilon
2exp[cγ(cγϵ)d1log(cγϵ)]exp(c~nϵ21/κ)+8ϵ,\displaystyle\leq 2\exp\left[c^{\prime}\gamma\left(\frac{c^{\prime}\gamma}{\epsilon}\right)^{d-1}\log\left(\frac{c^{\prime}\gamma}{\epsilon}\right)\right]\exp\left(-\tilde{c}n\epsilon^{2-1/\kappa}\right)+8\epsilon,

where the second inequality uses Lemmas 5, 6, 7 and 1log(cγ)01-\log(c^{\prime}\gamma)\leq 0 for c1,γ2d4c^{\prime}\geq 1,\;\gamma\geq 2d\geq 4; the last inequality uses the definition that c~=min{2log2,21c11/κ16}\tilde{c}=\min\left\{2\log 2,\frac{21c_{1}^{1/\kappa}}{16}\right\}. We now defines a constant N0N_{0} as

N0=inf{n:\displaystyle N_{0}=\inf\Bigg{\{}n\in\mathbb{N}:{} (d1/κ)logn+loglognd+11/κlog(cγ)+logc4,\displaystyle\frac{(d-1/\kappa)\log n+\log\log n}{d+1-1/\kappa}\geq\log(c^{\prime}\gamma)+\log c_{4}, (10)
c~2nd1d+11/κ(logn)21/κd+11/κc521/κlog(4c4)+lognloglognd+11/κ}.\displaystyle\frac{\tilde{c}}{2}n^{\frac{d-1}{d+1-1/\kappa}}(\log n)^{\frac{2-1/\kappa}{d+1-1/\kappa}}c_{5}^{2-1/\kappa}\geq-\log(4c_{4})+\frac{\log n-\log\log n}{d+1-1/\kappa}\Bigg{\}}.

In the two inequalities of (10), the left-hand side obviously dominates the right-hand side, so N0N_{0} is well defined. Let ϵ=c4(logn/n)1d+11/κ\epsilon=c_{4}(\log n/n)^{\frac{1}{d+1-1/\kappa}}, where c4=[2(cγ)d/c~]1d+11/κc_{4}=[2(c^{\prime}\gamma)^{d}/\tilde{c}]^{\frac{1}{d+1-1/\kappa}}. Then for nN0n\geq N_{0},

cγ(cγϵ)d1log(cγϵ)nc~ϵ21/κ\displaystyle\frac{c^{\prime}\gamma\big{(}\frac{c^{\prime}\gamma}{\epsilon}\big{)}^{d-1}\log\big{(}\frac{c^{\prime}\gamma}{\epsilon}\big{)}}{n\tilde{c}\epsilon^{2-1/\kappa}} =(cγ)d[log(cγ)+log(1/ϵ)]c~nϵd+11/κ\displaystyle=\frac{(c^{\prime}\gamma)^{d}[\log(c^{\prime}\gamma)+\log(1/\epsilon)]}{\tilde{c}n\epsilon^{d+1-1/\kappa}}
=(cγ)d[log(cγ)+log(1/c4)+1d+11/κlog(nlogn)]c~c5d+11/κlogn\displaystyle=\frac{(c^{\prime}\gamma)^{d}\big{[}\log(c^{\prime}\gamma)+\log(1/c_{4})+\frac{1}{d+1-1/\kappa}\log(\frac{n}{\log n})\big{]}}{\tilde{c}c_{5}^{d+1-1/\kappa}\log n}
log(cγ)+log(1/c4)+1d+11/κlog(nlogn)2logn12,\displaystyle\leq\frac{\log(c^{\prime}\gamma)+\log(1/c_{4})+\frac{1}{d+1-1/\kappa}\log(\frac{n}{\log n})}{2\log n}\leq\frac{1}{2},

where the third inequality utilizes the definition of c4c_{4} and the last inequality utilizes the definition of N0N_{0}. Therefore we have

𝔼[R(T^n,γ)R(f)]2exp(c~2nϵ21/κ)+8c4(lognn)1d+11/κ.\mathbb{E}\left[R(\hat{T}_{n,\gamma})-R(f^{*})\right]\leq 2\exp\left(-\frac{\tilde{c}}{2}n\epsilon^{2-1/\kappa}\right)+8c_{4}\left(\frac{\log n}{n}\right)^{\frac{1}{d+1-1/\kappa}}.

By definition of ϵ\epsilon, nϵ21/κn\epsilon^{2-1/\kappa} increases with nn in a polynomial order, so the second term in the display above dominates. Specifically, by definition of N0N_{0}, when nN0n\geq N_{0}, we have 2exp(c~2nϵ21/κ)8c4(lognn)1d+11/κ2\exp\left(-\frac{\tilde{c}}{2}n\epsilon^{2-1/\kappa}\right)\leq 8c_{4}\left(\frac{\log n}{n}\right)^{\frac{1}{d+1-1/\kappa}}, and hence

𝔼[R(T^n,γ)R(f)]16(2/c~)1d+11/κ(cγ)dd+11/κ(lognn)1d+11/κ.\mathbb{E}\left[R(\hat{T}_{n,\gamma})-R(f^{*})\right]\leq 16(2/\tilde{c})^{\frac{1}{d+1-1/\kappa}}(c^{\prime}\gamma)^{\frac{d}{d+1-1/\kappa}}\left(\frac{\log n}{n}\right)^{\frac{1}{d+1-1/\kappa}}.