This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Interpretable Rule Discovery Through Bilevel Optimization of Split-Rules of Nonlinear Decision Trees for Classification Problems

Yashesh Dhebar,   and Kalyanmoy Deb,  Yashesh Dhebar is with the Department of Mechanical Engineering at Michigan State University, East Lansing, MI 48824, USA, e-mail: [email protected] Deb is with the Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI, 48824, USA, e-mail: [email protected], COIN Lab website: https://coin-lab.org
Abstract

For supervised classification problems involving design, control, other practical purposes, users are not only interested in finding a highly accurate classifier, but they also demand that the obtained classifier be easily interpretable. While the definition of interpretability of a classifier can vary from case to case, here, by a humanly interpretable classifier we restrict it to be expressed in simplistic mathematical terms. As a novel approach, we represent a classifier as an assembly of simple mathematical rules using a non-linear decision tree (NLDT). Each conditional (non-terminal) node of the tree represents a non-linear mathematical rule (split-rule) involving features in order to partition the dataset in the given conditional node into two non-overlapping subsets. This partitioning is intended to minimize the impurity of the resulting child nodes. By restricting the structure of split-rule at each conditional node and depth of the decision tree, the interpretability of the classifier is assured. The non-linear split-rule at a given conditional node is obtained using an evolutionary bilevel optimization algorithm, in which while the upper-level focuses on arriving at an interpretable structure of the split-rule, the lower-level achieves the most appropriate weights (coefficients) of individual constituents of the rule to minimize the net impurity of two resulting child nodes. The performance of the proposed algorithm is demonstrated on a number of controlled test problems, existing benchmark problems, and industrial problems. Results on two to 500-feature problems are encouraging and open up further scopes of applying the proposed approach to more challenging and complex classification tasks.

Index Terms:
Classification, Decision trees, Machine learning, Bi-level optimization, Evolutionary algorithms.

I Introduction

In a classification task, usually a set of labelled data involving a set of features and its belonging to a specific class are provided. The task in a classification task is to design an algorithm which works on the given dataset and arrives at one or more classification rules (expressed as a function of problem features) which are able to make predictions with maximum classification accuracy. A classifier can be expressed in many different ways suitable for the purpose of the application. In a quest to make the classifier interpretable, classifier representation through a decision-list or a decision-tree (DT) had been a popular choice. In decision trees, the classification rule-set is represented in an inverted tree based structure where non-terminal conditional nodes comprise of conditional statements, mostly involving a single feature, and the terminal leaf-nodes have a class label associated with them. A datapoint traverses through the DT by following the rules at conditional nodes and lands at a particular leaf-node, marking its class. In the artificial neural network (ANN) approach, an implicit relationship among features is captured through a multi-layered network, while in the support-vector-machine (SVM) approach, a complex mathematical relationship of associated support vectors is derived. One common aspect of these methods is that they use an optimization method of arriving at a suitable DT, ANN or SVM based on maximizing the classification accuracy.

Due to their importance in many practical problems involving design, control, identification, and other machine learning related tasks, researchers have spent a lot of efforts to develop efficient optimization-based classification algorithms [1, 2, 3]. While most algorithms are developed for classifying two-class data sets, the developed methods can be extended to multi-class classification problems as hierarchical two-class classification problems or by extending them to constitute a simultaneous multi-class classification algorithm. In this paper, we do not extend our approach to multi-class classification problems, except showing a proof-of-principle study.

In most classification problem solving tasks, maximizing classification accuracy (or minimizing classification error) on the labelled dataset is usually used as the sole objective. However, besides the classification accuracy, in many applications, users are also interested in finding an easily interpretable classifier for various practical reasons: (i) it helps to identify the most important rules which are responsible for the classification task, (ii) it also helps to provide a more direct relationship of features to have a better insight for the underlying classification task for knowledge enhancement and future developmental purposes. The definition of an easily interpretable classifier largely depends on the context, but the existing literature represents them as a linear, polynomial, or posynomial function involving only a few features.

In this paper, we propose a number of novel ideas towards evolving a highly accurate and easily interpretable classifier. First, instead of a DT, we propose a nonlinear decision tree (NLDT) as a classifier, in which every non-terminal conditional node will represent a nonlinear function of features to express a split-rule. Each split-rule will split the data into two non-overlapping subsets. Successive hierarchical splits of these sub-sets are carried out and the tree is allowed to grow until one of the termination criteria is met. We argue that flexibility of allowing nonlinear split-rule at conditional nodes (instead of a single-variable based rule which is found in tradition ID3 based DTs [4]) will result in a more compact DT (having a fewer nodes). Second, to derive the split-rule at a given conditional-node, a dedicated bilevel-optimization algorithm is applied, so that learning of the structure of the rule and associated coefficients are treated as two hierarchical and inter-related optimization tasks. Third, our proposed methodology is customized for classification problem solving, so that our overall bilevel optimization algorithm is computationally efficient. Fourth, we emphasize evolving simplistic rule structures through our customized bilevel optimization method so that obtained rules are also interpretable.

In the remainder of the paper, we provide a brief survey of the existing approaches of inducing DTs in Section II. A detailed description about the problem of inducing interpretable DTs from optimization perspective and a high-level view on proposed approach is provided in Section III. Next, we provide an in-depth discussion of the proposed bilevel-optimization algorithm which is adopted to derive interpretable split-rules at each conditional node of NLDT in Section IV. Section V provides a brief overview on the post-processing method which is used to prune the tree to simplify its topology. Compilation of results on standard classification and engineering problems are given in section VI. The paper ends with concluding remarks and some highlights on future work in Section VII.

II Past Studies

There exist many studies involving machine learning and data analytics methods for discovering rules from data. Here, we provide a brief mention of some studies which are close to our proposed study.

Traditional induction of DTs is done in axis-parallel fashion [4], wherein each split rule is of type xiτix_{i}\leq\tau_{i} or xiτix_{i}\geq\tau_{i}. Algorithms such as ID3, CART and C4.5 are among the popular ones in this category. The work done in the past to generate non-axis-parallel trees can be found in [5, 6, 7], where the researchers rely on randomized algorithms to search for multi-variate hyperplanes. Work done in [8, 9] use evolutionary algorithms to induce oblique DTs. The idea of OCI [7] is extended in [10] to generate nonlinear quadratic split-rules in a DT. Bennett Et al. [11] uses SVM to generate linear/nonlinear split rules. However, these works do not address certain key practical considerations, such as the complexity of the combined split-rules and handling of biased data.

Next, we discuss some studies which take the aspect of complexity of rule into consideration. In [12], ellipsoidal and interval based rules are determined using the set of support-vectors and prototype points/vertex points. The authors there primarily focus at coming up with compact set of if-then-else rules which are comprehensible. Despite its intuitiveness, the approach proposed in [12] doesn’t result into comprehensible set of rules on high-dimensional datasets. Another approach suggested in [13] uses the decompositional based technique of deriving rules using the output of a linear-SVM classifier. The rules here are expressed as hypercubes. The method proposed is computationally fast, but it lacks in its scope to address nonlinear decision boundaries and its performance is limited by the performance of a regular linear-SVM. On the other hand, this approach has a tendency to generate more rules if the data is distributed parallel to the decision boundary. The study conducted in [14] uses a trained neural-network as an oracle to develop a DT of at least m-of-n type rule-sets (similar to the one described in ID2-of-3 [15]). The strength of this approach lies in its potential to scale up. However, its accuracy on unseen dataset usually falls by about 3%3\% from the corresponding oracle neural-network counterpart. Authors in [16] use a pedagogical technique to evolve comprehensible rules using genetic-programming. The proposed algorithm G-REX considers the fidelity (i.e. how closely can the evolved AI agent mimic the behaviour of the oracle NN) as the primary objective and the compactness of the rule expression is penalized using a parameter to evolve interpretable set of fuzzy rules for a classification problem. The approach is good enough to produce comprehensible rule-set, but it needs tweaking and fine tuning of the penalty parameter. A nice summary of the above-mentioned methods is provided in [17]. Ishibuchi et al. in [18] implemented a three-objective strategy to evolve fuzzy set of rules using a multi-objective genetic algorithm. The objectives considered in that research were the classification accuracy, number of rules in the evolved rule-set and the total number of antecedent conditions. In [19], an ANN is used as a final classifier and a multi-objective optimization approach is employed to find simple and accurate classifier. The simplicity is expressed by the number of nodes. Hand calculations are then used to analytically express the ANN with a mathematical function. This procedure however becomes intractable when the evolved ANN has a large number of nodes.

Genetic programming (GP) based methods have been found efficient in deriving relevant features from the set of original features which are then used to generate the classifier or interpretable rule sets [20, 21, 22, 23, 24, 25]. In some studies, the entire classifier is encoded with a genetic-representation and the genome is then evolved using GP. Some works conducted in this regard also simultaneously consider complexity of the classifier [26, 27, 28, 29], but those have been limited to axis-parallel or oblique splits. Application of GP to induce nonlinear multivariate decision tree can be found in [30, 31]. Our approach of inducing nonlinear decision tree is conceptually similar to the idea discussed in [30], where the DT is induced in the top-down way and at each internal conditional node, the nonlinear split-rule is derived using a GP. Here, the fitness of a GP solution is expressed as a weighted-sum of misclassification-rates and generalizability term. However, the interpretability aspect of the final split-rule does not get captured anywhere in the fitness assignment and authors do not report the complexity of the finally obtained classifiers. Authors in [32] use a three-phase evolutionary algorithm to create a composite rule set of axis parallel rules for each class. Their algorithm is able to generate interpretable rules but struggles to achieve respectable classification accuracy on datasets which are not separable by axis-parallel boundaries. In [33], GP is used to evolve DTs with linear split-rules, however the work does not focus on improving the interpretability of the evolved DT classifier. Eggermont et al. [34] induce the DT using GP by allowing multiple splits through their clustering and refined atomic (internal nodes) representations and conduct a global evolutionary search to estimate the values of attribute thresholds. A multi-layered fitness approach is adopted with first priority given to the overall classification accuracy followed by the complexity of DT. Here too, splits occur based on the box conditions and hence it becomes difficult to address classification datasets involving non-linear (or non axis-parallel) decision boundaries. In [35], a decision list is generated using a GP by incorporating concept-mapping and covering techniques. Similar to [32], a dedicated rule-set for each class is generated and this method too is limited to work well on datasets which can be partitioned through boxed boundaries. An interesting application of GP to derive interpretable classification rules is reported in [36]. Here, a dedicated rule of type IF-ANTECEDENT-THEN-DECISION is generated for each class in the dataset using a GP and a procedure is suggested to handle intermediate cases. The suggested algorithm shows promising results for classification accuracy but the study does not report statistics regarding the complexity of the evolved ruleset.

In our proposed approach, we attempt to evolve nonlinear DTs which are robust to different biases in the dataset and simultaneously target in evolving nonlinear yet simpler polynomial split-rules at each conditional node of NLDT with a dedicated bilevel genetic algorithm (shown pictorially in Figure 1). In oppose to the method discussed in [30], where the fitness of GP individual doesn’t capture the notion of complexity of rule expression, the bilevel-optimization proposed in our work deals with the aspects of interpretability and performance of split-rule in a logically hierarchical manner (conceptually illustrated in Figure 2). The results indicate that the proposed bilevel algorithm for evolving nonlinear split-rules eventually generates simple and comprehensible classifiers with high or comparable accuracy on all the problems used in this study.

III Proposed Approach

III-A Classifier Representation Using Nonlinear Decision Tree

An interpretable classifier for our proposed method is represented in the form of a nonlinear decision tree (NLDT), as depicted in Figure 1.

Refer to caption
Figure 1: An illustration of a Nonlinear Decision Tree (NLDT) classifier for a two class problem. For a given conditional node ii, the split-rule function fi(𝐱)f_{i}(\mathbf{x}) is evolved using a dedicated bilevel-optimization procedure.

The decision tree (DT) comprises of conditional (or non-terminal) nodes and terminal leaf-nodes. Each conditional-node of the DT has a rule (fi(𝐱)0f_{i}(\mathbf{x})\leq 0) associated to it, where 𝐱\mathbf{x} represents the feature vector. To make the DT more interpretable, two aspects are considered:

  1. 1.

    Simplicity of split-rule fi(𝐱)f_{i}(\mathbf{x}) at each conditional nodes (see Figure 2) and

  2. 2.

    Simplicity of the topology of overall DT, which is computed by counting total number of conditional-nodes in the DT.

Under an ideal scenario, the simplest split-rule will involve just one attribute (or feature) and can be expressed as fi(𝐱):xkτ0f_{i}(\mathbf{x}):x_{k}-\tau\leq 0, for ii-th rule. Here, the split occurs solely based on the kthk^{th} feature (xkx_{k}). In most complex problems, such a simple split rule involving a single feature may not lead to a small-depth DT, thereby making the overall classifier un-interpretable. DTs induced using algorithms such as ID.3 and C4.5 fall under this category. In the present work, we refer to these trees with classification and regression trees (CART) [4].

On the other extreme, a topologically-simplest tree will correspond to the DT involving only one conditional node, but the associated rule may be too complex to interpret. SVM based classifiers fall in this category, wherein decision-boundary is expressed in form of a complicated mathematical equation. Another way to represent a classifier is through an ANN, which when attempted to express analytically, will resort into a complicated nonlinear function f(𝐱)f(\mathbf{x}) without any easy interpretability.

In this work, we propose a novel and compromise approach to above two extreme cases so that the resulting DT is not too deep and associated split-rule function at each conditional node fi(𝐱)f_{i}(\mathbf{x}) assume a nonlinear form with controlled complexity and is easily interpretable. The development of a nonlinear DT is similar in principle to the search of a linear DT in CART, but this task requires the use of an efficient nonlinear optimization method for evolving nonlinear rules (fi(𝐱)f_{i}(\mathbf{x})), for which we exploit recent advances in bilevel optimization. The proposed bilevel approach is discussed in Section IV and a high-level perspective of the optimization process is provided in Figure 1. During the training phase, NLDT is induced using a recursive algorithm as shown in Algorithm 1. Brief description about subroutines used in Algorithm 1 is provided below and relevant pseudo codes for ObtainSplitRule subroutine and an evaluator for upper-level GA are provided in Algorithm 2 and 3 respectively. For brevity, details of the termination criteria is provided in the Supplementary Section.

  • ObtainSplitRule(Data)

    • Input: (N×(D+1))\left(N\times(D+1)\right)-matrix representing the dataset, where the last column indicates the class-label.

    • Output: Non linear-split rule f(𝐱)0f(\mathbf{x})\leq 0.

    • Method: Bilevel optimization is used to determine f(𝐱)f(\mathbf{x}). Details regarding this are discussed in Sec IV.

  • SplitNode(Data, SplitRule)

    • Input: Data, Split Rule f(𝐱)0f(\mathbf{x})\leq 0.

    • Output: LeftNode and RightNode which are node-data-structures, where LeftNode.Data are the datapoints in input Data satisfying f(𝐱)0f(\mathbf{x})\leq 0 while the RightNode.Data are the datapoints in the input Data violating the split rule.

Input: Dataset
Function UpdatedNode = InduceNLDT(Node, depth):
       Node.depth = depth;
       if TerminationSatisfied(Node) then
             Node.Type = ‘leaf’;
            
      else
             Node.Type = ‘conditional’;
             Node.SplitRule = ObtainSplitRule(Node.Data);
            
            [LeftNode, RightNode] = SplitNode(Node.Data, Node.SplitRule);
            
            Node.LeftNode = InduceNLDT(LeftNode, depth + 1);
             Node.RightNode = InduceNLDT(RightNode, depth + 1);
            
       end if
      UpdatedNode = Node;
      
// NLDT Induction Algorithm
RootNode.Data = Dataset;
Tree = InduceNLDT(RootNode, 0)
Algorithm 1 Pseudo Code to Recursively Induce NLDT.
Input: Data
Output : SplitRule // Split rule f(𝐱)f(\mathbf{x})
Function SplitRule = ObtainSplitRule(Data):
       Initialize : PUP_{U} // Upper Level population
      
       // Execute LLGA (Algorithm 3)
       PUP_{U} = EvaluateUpperLevelPop(PUP_{U});
      
      // Upper Level GA Loop
       for gen = 1:MaxGen do
             PUParentP_{U}^{Parent} = Selection(PUP_{U});
             PUChildP_{U}^{Child} = Crossover(PUParentP_{U}^{Parent});
             PUP_{U} = Mutation(PUChildP_{U}^{Child});
             // Execute LLGA (Algorithm 3)
             PUP_{U} = EvaluateUpperLevelPop(PUP_{U});
             // Elite preservation
             PUP_{U} = SelectElite(PUParentP_{U}^{Parent}, PUP_{U});
             if TerminationConditionSatisfied then
                  break;
             end if
            
       end for
      // Extract best solution in PUP_{U}
       SplitRule = ObtainBestInd(PUP_{U});
      
Algorithm 2 ObtainSplitRule subroutine. Implements a Bilevel optimization algorithm of determine a split-rule. The UpperLevel searches the space of Block Matrix 𝐁\mathbf{B} and modulus-flag mm. Constraint violation value for an upper level individuals comes by executing lower-level GA (LLGA) as shown in Algorithm 3.
Input: PUP_{U} // Upper Level population
Output : PUP_{U}^{\prime} // Evaluated Population
Function PUP_{U}^{\prime} = EvaluateUpperLevelPop(PUP_{U}):
       for i = 1:PopSize do
             // Execute LLGA (see Section IV-C)
             [FLF_{L}, 𝐰\mathbf{w}, 𝚯\mathbf{\Theta}] = LLGA(Data, 𝐁\mathbf{B}, mm);
             [PUP_{U}[i].𝐰\mathbf{w}, PUP_{U}[i].𝚯\mathbf{\Theta}] = [𝐰\mathbf{w}, 𝚯\mathbf{\Theta}];
             // Constraint and Fitness Value
             PUP_{U}[i].CV = FLτIF_{L}-\tau_{I};
             PUP_{U}[i].FUF_{U} = MaxActiveTerms(𝐁\mathbf{B});
            
       end for
      PU=PUP_{U}^{\prime}=P_{U}
Algorithm 3 EvaluateUpperLevelPop subroutine. A dedicated LowerLevel optimization is executed for each upper level population member.

III-B Split-Rule Representation

In this paper, we restrict the expression of split-rule at a conditional node of the decision tree operating on the feature vector 𝐱\mathbf{x} to assume the following structure:

Rule:\displaystyle\text{Rule}: f(𝐱,𝐰,𝚯,𝐁)0,\displaystyle\quad f(\mathbf{x},\mathbf{w},\mathbf{\Theta},\mathbf{B})\leq 0, (1)

where f(𝐱,𝐰,𝚯,𝐁)f(\mathbf{x},\mathbf{w},\mathbf{\Theta},\mathbf{B}) can be expressed in two different forms depending on whether a modulus operator mm is sought or not:

f(𝐱,𝐰,𝚯,𝐁)={θ1+w1B1++wpBp,if m=0,|θ1+w1B1++wpBp||θ2|,if m=1.\displaystyle\small f(\mathbf{x},\mathbf{w},\mathbf{\Theta},\mathbf{B})=\begin{cases}\theta_{1}+w_{1}B_{1}+\ldots+w_{p}B_{p},&\text{if $m=0$,}\\ \left|\theta_{1}+w_{1}B_{1}+\ldots+w_{p}B_{p}\right|-\left|\theta_{2}\right|,&\text{if $m=1$.}\end{cases} (2)

Here, wiw_{i}’s are coefficients or weights of several power-laws (BiB_{i}’s), θi\theta_{i}’s are biases, mm is the modulus-flag which indicates the presence or absence of the modulus operator, pp is a user-specified parameter to indicate the maximum number of power-laws (BiB_{i}) which can exist in the expression of f(𝐱)f(\mathbf{x}), and BiB_{i} represents a power-law rule of type:

Bi=x1bi1×x2bi2××xdbid,B_{i}=x_{1}^{b_{i1}}\times x_{2}^{b_{i2}}\times\ldots\times x_{d}^{b_{id}}, (3)

𝐁\mathbf{B} is a block-matrix of exponents bijb_{ij}, given as follows:

𝐁=[b11b12b13b1db21b22b23b2dbp1bp2bp3bpd].\mathbf{B}=\begin{bmatrix}b_{11}&b_{12}&b_{13}&\dots&b_{1d}\\ b_{21}&b_{22}&b_{23}&\dots&b_{2d}\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ b_{p1}&b_{p2}&b_{p3}&\dots&b_{pd}\end{bmatrix}. (4)

Exponents bijb_{ij}’s are allowed to assume values from a specified discrete set 𝐄\mathbf{E}. In this work, we set p=3p=3 and 𝐄={3,2,1,0,1,2,3}\mathbf{E}=\{-3,-2,-1,0,1,2,3\} to make the rules interpretable, however they can be changed by the user. The parameters wiw_{i} and θi\theta_{i} are real-valued variables in [1,1][-1,1]. The feature vector 𝐱\mathbf{x} is a data-point in a dd-dimensional space. Another user-defined parameter amaxa_{\max} controls the maximum number of variables that can be present in each power-law BiB_{i}. The default is amax=da_{\max}=d (i.e. dimension of the feature space).

IV Bilevel Approach for Split-Rule Identification

IV-A The Hierarchical Objectives to derive split-rule

Here, we illustrate the need of formulating the problem of split-rule identification as a bilevel-problem using Figure 2.

Refer to caption
Figure 2: Illustrative sketch representing different split-rules on a 2D data. While split-rule AA may be overly complex and un-interpretable, split-rules BB and CC have similar simple and interpretable structures, but CC has better coefficients to make it more accurate than BB.

The geometry and shape of split-rules is defined by exponent terms bijb_{ij} appearing in its expression (Eq. 2 and 3) while the orientation and location of the split-rule in the feature-space is dictated by the values of coefficients wiw_{i} and biases θi\theta_{i} (Eq. 2). Thus, the above optimization task of estimating split-rule f(𝐱)f(\mathbf{x}) involves two types of variables:

  1. 1.

    BB-matrix representing exponents of BB-terms (i.e. bijb_{ij} as shown in Eq. 4) and the modulus flag mm indicating the presence or absence of a modulus operator in the expression of f(𝐱)f(\mathbf{x}), and

  2. 2.

    weights 𝐰\mathbf{w} and biases 𝚯\mathbf{\Theta} in each rule function f(𝐱)f(\mathbf{x}).

Identification of a good structure for 𝐁\mathbf{B} terms and value of mm is a more difficult task, compared to the weight and bias identification. We argue that if both types of variables (𝐁\mathbf{B}, mm, Θ\Theta, 𝐰\mathbf{w}) are concatenated in a single-level solution, a good (𝐁,m)(\mathbf{B},m) combination may be associated with a not-so-good (𝐰(\mathbf{w}, 𝚯)\mathbf{\Theta}) combination (like split-rule BB in Fig. 2), thereby making the whole solution vulnerable to deletion during optimization. It may be better to separate the search of a good structure of (𝐁,m)(\mathbf{B},m) combination from the weight-bias search at the same level, and search for the best weight-bias combination for every (𝐁,m)(\mathbf{B},m) pair as a detailed task. This hierarchical structure of the variables motivates us to employ a bilevel optimization approach [37] to handle the two types of variables. The upper level optimization algorithm searches the space of 𝐁\mathbf{B} and mm. Then, for each (𝐁,m)(\mathbf{B},m) pair, the lower level optimization algorithm is invoked to determine the optimal values of 𝐰\mathbf{w} and 𝚯\mathbf{\Theta}. Referring to Fig. 2, the upper-level will search for the structure of f(𝐱)f(\mathbf{x}) which might have a nonlinearity (like in rule AA) or might be linear (like rule BB). Then, for each upper-level solution (for instance a solution corresponding to a linear rule structure), the lower-level will adjust its weights and biases to determine its optimal location CC.

The bi-level optimization problem can be then formulated as shown below:

Min.FU(𝐁,m,𝐰,𝚯),s.t.(𝐰,𝚯)argmin{FL(𝐰,𝚯)|(𝐁,m)|FL(𝐰,𝚯)|(𝐁,m)τI,1wi1,i,𝚯[1,1]m+1},m{0,1},bij{3,2,1,0,1,2,3},\begin{array}[]{rl}\text{Min.}&F_{U}(\mathbf{B},m,\mathbf{w}^{\ast},\mathbf{\Theta}^{\ast}),\\ \text{s.t.}&(\mathbf{w}^{\ast},\mathbf{\Theta}^{\ast})\in{\rm argmin}\left\{F_{L}(\mathbf{w},\mathbf{\Theta})|_{(\mathbf{B},m)}\big{|}\right.\\ &\quad F_{L}(\mathbf{w},\mathbf{\Theta})|_{(\mathbf{B},m)}\leq\tau_{I},\\ &\quad\left.-1\leq w_{i}\leq 1,\ \forall i,\ \mathbf{\Theta}\in[-1,1]^{m+1}\right\},\\ &m\in\{0,1\},\ b_{ij}\in\{-3,-2,-1,0,1,2,3\},\end{array} (5)

where the upper level objective FUF_{U} is the quantification of the simplicity of the split-rule, and, the lower-level objective FLF_{L} quantifies quality of split resulting due to split-rule f(𝐱)0f(\mathbf{x})\leq 0. An upper-level solution is considered feasible only if it is able to partition the data within some acceptable limit which is set by a parameter τI\tau_{I}. A more detailed explanation regarding the upper level objective FUF_{U} and the lower-level objective FLF_{L} is provided in next sections. A pseudo code of the bilevel algorithm to obtain split-rule is provided in Alogrithm 2 and the pseudo code provided in Algorithm 3 gives on overview on the evaluation of population-pool in upper level GA.

IV-A1 Computation of FLF_{L}

The impurity of a node in a decision tree is defined by the distribution of data present in the node. Entropy and Gini-score are popularly used to quantify the impurity of data distribution. In this work, we use gini-score to gauge the impurity of a node, however, entropy measure could also be used. For a binary classification task, gini-score ranges from 0-0.5, while entropy ranges from 0-1. Hence, the value of τI\tau_{I} (Eq. 5) needs to be changed if we use entropy instead of gini while everything else stays same. For a given parent node PP in a decision tree and two child nodes LL and RR resulting from it, the net-impurity of child nodes (FLF_{L}) can be computed as follows:

FL(𝐰,𝚯)|(𝐁,m)=(NLNPGini(L)+NRNPGini(R))(𝐰,𝚯,𝐁,m),\small F_{L}(\mathbf{w},\mathbf{\Theta})|_{(\mathbf{B},m)}=\left(\frac{N_{L}}{N_{P}}\text{Gini}(L)+\frac{N_{R}}{N_{P}}\text{Gini}(R)\right)_{(\mathbf{w},\mathbf{\Theta},\mathbf{B},m)}, (6)

where NPN_{P} is the total number of datapoints present in the parent node PP, and NLN_{L} and NRN_{R} indicate the total number of points present in left (LL) and right (RR) child nodes, respectively. Datapoints in PP which satisfy the split rule fP(𝐱)0f_{P}(\mathbf{x})\leq 0 (Eq 1) go to left child node, while the rest go to the right child node. The objective FLF_{L} of minimizing the net-impurity of child nodes favors the creation of purer child nodes (i.e. nodes with a low gini-score).

IV-A2 Computation of FUF_{U}

The objective FUF_{U} is subjective in its form since it targets at dealing with a subjective notion of generating visually simple equations of split rule. Usually, equations with more variables and terms seem visually complicated. Taking this aspect into consideration, FUF_{U} is set as the total number of non-zero exponent terms present in the overall equation of the split rule (Eq. 1). Mathematically, this can be represented with the following equation:

FU(𝐁,m,𝐰,𝚯)=i=1pj=1dg(bij),\displaystyle F_{U}(\mathbf{B},m,\mathbf{w}^{\ast},\mathbf{\Theta}^{\ast})=\sum_{i=1}^{p}\sum_{j=1}^{d}g(b_{ij}), (7)

where g(α)=1g(\alpha)=1, if α0\alpha\neq 0; zero, otherwise. Here, we use only 𝐁\mathbf{B} to define FUF_{U}, but another more comprehensive simplicity term involving presence or absence of modulus operators and relative optimal weight/bias values of the rule can also be used.

IV-B Upper Level Optimization (ULGA)

A genetic algorithm (GA) is implemented to explore the search space of power rules BiB_{i}’s and modulus flag mm in the upper level. The genome is represented as a tuple (𝐁,m)(\mathbf{B},m) wherein 𝐁\mathbf{B} is a matrix as shown in Equation 4 and mm assumes a Boolean value of 0 or 1.

The upper level GA focuses at estimating a simple equation of the split-rule within a desired value of net impurity (FLF_{L}) of child nodes. Thus, the optimization problem for upper level is formulated as a single objective constrained optimization problem as shown in Eq. 5. The constraint function FL(𝐰,𝚯)|(𝐁,m)F_{L}(\mathbf{w},\mathbf{\Theta})|_{(\mathbf{B},m)} is evaluated at the lower level of our bilevel optimization framework. The threshold value τI\tau_{I} indicates the desired value of net-impurity of the resultant child nodes (Eq. 6). In our experiments, we set τI\tau_{I} to be 0.05. As mentioned before, a solution satisfying this constraint implies creation of purer child nodes. Minimization of objective function FUF_{U} should result in a simplistic structure of the rule and the optimization will also reveal the key variables needed in the overall expression.

IV-B1 Custom Initialization for Upper Level GA

Minimization of objective FUF_{U} (given in Eq 7) requires to have less number of active (or non-zero) exponents in the expression of split-rule (Eq 1). To facilitate this, the population is initialized with a restriction of having only one active (or non-zero) exponent in the expression of split-rule, i.e., any-one of the bijb_{ij}’s in the block matrix 𝐁\mathbf{B} is set to a non-zero value from a user-specified set 𝐄\mathbf{E} and the rest of the elements of matrix 𝐁\mathbf{B} are set to zero. Note here that only 2d2d number of unique individuals (dd individuals with m=0m=0 and dd individuals with m=1m=1) can exist which satisfy the above mentioned restriction. If the population size for upper level GA exceeds 2d2d, then remaining individuals are initialized with two non-zero active-terms.As the upper level GA progresses, incremental enhancement in rule complexity is realized through crossover and mutation operations, which are described next.

IV-B2 Ranking of Upper Level Solutions

The binary-tournament selection operation and (μ+λ)(\mu+\lambda) survival selection strategies are implemented for the upper level GA. Selection operators use the following hierarchical ranking criteria to perform selection:

Definition IV.1.

For two individuals ii and jj in the upper level, rank(i)\textit{rank}(i) is better than rank(j)\textit{rank}(j), when any of the following is true:

  • ii and jj are both infeasible AND FL(i)<FL(j)F_{L}(i)<F_{L}(j),

  • ii is feasible (i.e. FL(i)τIF_{L}(i)\leq\tau_{I}) AND jj is infeasible (i.e. FL(j)>τIF_{L}(j)>\tau_{I}),

  • ii and jj both are feasible AND FU(i)<FU(j)F_{U}(i)<F_{U}(j),

  • ii and jj both are feasible AND FU(i)=FU(j)F_{U}(i)=F_{U}(j)
    AND FL(i)<FL(j)F_{L}(i)<F_{L}(j),

  • ii and jj both are feasible AND FU(i)=FU(j)F_{U}(i)=F_{U}(j)
    AND FL(i)=FL(j)F_{L}(i)=F_{L}(j) AND m(i)<m(j)m(i)<m(j).

IV-B3 Custom Crossover Operator for Upper Level GA

Before crossover operation, population members are clustered according to their mm-value – all individuals with m=0m=0 belong to one cluster and all individuals with m=1m=1 belong to another cluster. The crossover operation is then restricted on individuals belonging to the same cluster. The crossover operation in the upper level takes two block matrices from the parent population (𝐁𝐏𝟏\mathbf{B_{P_{1}}} and 𝐁𝐏𝟐\mathbf{B_{P_{2}}}) to create two child block matrices (𝐁𝐂𝟏\mathbf{B_{C_{1}}} and 𝐁𝐂𝟐\mathbf{B_{C_{2}}}). First, rows of block matrices of participating parents are rearranged in descending order of the magnitude of their corresponding coefficient values (i.e. weights wiw_{i} of Eq. 2). Let the parent block matrices be represented as 𝐁𝐏𝟏\mathbf{B_{P_{1}}^{\prime}} and 𝐁𝐏𝟐\mathbf{B_{P_{2}}^{\prime}} after this rearrangement. The crossover operation is then conducted element wise on each row of 𝐁𝐏𝟏\mathbf{B_{P_{1}}^{\prime}} and 𝐁𝐏𝟐\mathbf{B_{P_{2}}^{\prime}}. For better understanding of the cross-over operation, a psuedo code is provided in Algorithm 4.

input : Block matrices 𝐁𝐏𝟏\mathbf{B_{P_{1}}} and 𝐁𝐏𝟐\mathbf{B_{P_{2}}}, and weight vectors 𝐰𝐏𝟏\mathbf{w_{P_{1}}} and 𝐰𝐏𝟐\mathbf{w_{P_{2}}}.
output : Child block matrices 𝐁𝐂𝟏\mathbf{B_{C_{1}}} and 𝐁𝐂𝟐\mathbf{B_{C_{2}}}.
𝐁𝐏𝟏\mathbf{B_{P_{1}}^{\prime}} = SortRows(𝐁𝐏𝟏\mathbf{B_{P_{1}}},𝐰𝐏𝟏\mathbf{w_{P_{1}}});
𝐁𝐏𝟐\mathbf{B_{P_{2}}^{\prime}} = SortRows(𝐁𝐏𝟐\mathbf{B_{P_{2}}}, 𝐰𝐏𝟐\mathbf{w_{P_{2}}});
nrowsn_{rows} = Size(𝐁𝐏𝟏\mathbf{B_{P_{1}}},11);
ncolsn_{cols} = Size(𝐁𝐏𝟏\mathbf{B_{P_{1}}},22);
for i1i\leftarrow 1 to nrowsn_{rows} do
       for j1j\leftarrow 1 to ncolsn_{cols} do
             rr = rand();
            // random no. between 0 and 1.
             if r0.5r\leq 0.5 then
                   𝐁𝐂𝟏(i,j)=𝐁𝐏𝟏(i,j)\mathbf{B_{C_{1}}}(i,j)=\mathbf{B_{P_{1}}}(i,j);
                   𝐁𝐂𝟐(i,j)=𝐁𝐏𝟐(i,j)\mathbf{B_{C_{2}}}(i,j)=\mathbf{B_{P_{2}}}(i,j);
                  
             else
                   𝐁𝐂𝟏(i,j)=𝐁𝐏𝟐(i,j)\mathbf{B_{C_{1}}}(i,j)=\mathbf{B_{P_{2}}}(i,j);
                   𝐁𝐂𝟐(i,j)=𝐁𝐏𝟏(i,j)\mathbf{B_{C_{2}}}(i,j)=\mathbf{B_{P_{1}}}(i,j);
                  
             end if
            
       end for
      
end for
Algorithm 4 Crossover operation in upper level GA.

IV-B4 Custom Mutation Operator for Upper Level GA

For a given upper level solution with block matrix 𝐁\mathbf{B} and modulus flag mm, the probability with which bijb_{ij}’s and mm gets mutated is controlled by parameter pmutUp_{mut}^{U}. From the experiments, value of pmutU=1/dp_{mut}^{U}=1/d is found to work well. The mutation operation then changes the value of exponents bijb_{ij} and the modulus flag mm. Let the domain of bijb_{ij} be given by 𝐄\mathbf{E}, which is a sorted set of finite values. Since 𝐄\mathbf{E} is sorted, these ordinal values can be accessed using an integer-id kk, with id-value of k=1k=1 representing the smallest value and k=nek=n_{e} representing the largest value in set 𝐄\mathbf{E}. In our case, 𝐄=[3,2,1,0,1,2,3]\mathbf{E}=[-3,-2,-1,0,1,2,3] (making ne=7n_{e}=7). The mechanism with which the mutation operator mutates the value of bijb_{ij} for any arbitrary sorted array 𝐄\mathbf{E} is illustrated in Fig. 3.

Refer to caption
Figure 3: Mutation operation for upper level GA.

Here, the red tile indicates the id-value (kk) of bijb_{ij} in array 𝐄\mathbf{E}. The orange shaded vertical bars indicate the probability distribution for mutated-values. The bijb_{ij} can be mutated to either of k2,k1,k+1k-2,k-1,k+1, or k+2k+2 id-values with a probability of α\alpha, βα\beta\alpha, βα\beta\alpha, and α\alpha respectively. The parameter β\beta is preferred to be greater than one and is supplied by the user. The parameter α\alpha can then be computed as α=12(1+β)\alpha=\frac{1}{2(1+\beta)} by equating the sum of probabilities to 11. In our experiments, we have set β=3\beta=3. The value of the modulus flag mm is mutated randomly to either 0 or 1 with 50%50\% probability.

In order to bias the search to create simpler rules (i.e. split-rules with a small number of non-zero bijb_{ij}s), we introduce a parameter pzerop_{zero}. The value of parameter pzerop_{zero} indicates the probability with which a variable bijb_{ij} participating in mutation is set to zero. In our case, we use pzero=0.75p_{zero}=0.75, making bij0b_{ij}\rightarrow 0 with a net probability of pmutU×pzerop_{mut}^{U}\times p_{zero}.

Duplicates from the the child-population are then detected and changed so that the entire child-population comprises of unique individuals. This is done to ensure creation of novel rules and preservation of diversity. Details about this procedure can be found in the supplementary document.

(μ+λ\mu+\lambda) survival selection operation is then applied on the combined child and parent population. The selected elites then go to the next generation as parents and the process repeats. Parameter setting for upper-level-GA is provided in the supplementary document.

IV-C Lower Level Optimization (LLGA)

For a given population member of the upper level GA (with block matrix 𝐁\mathbf{B} and modulus flag mm as variables), the lower level optimization determines the coefficients wiw_{i} (Eq. 2) and biases θi\theta_{i} such that FL(𝐰,𝚯)|(𝐁,m)F_{L}(\mathbf{w},\mathbf{\Theta})|_{(\mathbf{B},m)} (Eq. 6) is minimized. The lower level optimization problem can be stated as below:

Minimize: FL(𝐰,𝚯)|(𝐁,m),\displaystyle\quad F_{L}(\mathbf{w},\mathbf{\Theta})|_{(\mathbf{B},m)},
𝐰[1,1]p,\displaystyle\quad\mathbf{w}\in[-1,1]^{p}, 𝚯[1,1]m+1.\displaystyle\mathbf{\Theta}\in[-1,1]^{m+1}.

IV-C1 Custom Initialization for Lower Level GA

In a bilevel optimization, the lower level problem must be solved for every upper level variable vector, thereby requiring a computationally fast algorithm for the lower level problem. Instead of creating every population member randomly, we use the mixed dipole concept [38, 9, 39] to facilitate faster convergence towards optimal (or near-optimal) values of 𝐰\mathbf{w} and 𝚯\mathbf{\Theta}. Let 𝐱𝐀\mathbf{x_{A}} and 𝐱𝐁\mathbf{x_{B}} be the data-points belonging to Class-1 and Class-2, respectively. Then, weights 𝐰𝐡\mathbf{w_{h}} and bias θh\theta_{h} corresponding to the hyperplane which separates 𝐱𝐀\mathbf{x_{A}} and 𝐱𝐁\mathbf{x_{B}}, and is orthogonal to vector 𝐱𝐀𝐱𝐁\mathbf{x_{A}-x_{B}}, can be given by the following equation:

𝐰𝐡=\displaystyle\mathbf{w_{h}}= 𝐱𝐀𝐱𝐁,\displaystyle\quad\mathbf{x_{A}}-\mathbf{x_{B}}, (9)
θh=\displaystyle\theta_{h}= Δ×(𝐰𝐡𝐱𝐁)+(1Δ)×(𝐰𝐡𝐱𝐀),\displaystyle\quad{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\Delta\times(\mathbf{w_{h}}\cdot\mathbf{x_{B}})+(1-\Delta)\times(\mathbf{w_{h}}\cdot\mathbf{x_{A}}),} (10)

where Δ\Delta is a random number between 0 and 1. This is pictorially represented in Fig. 4.

Refer to caption
Figure 4: Mixed dipole (𝐱𝐀\mathbf{x_{A}}, 𝐱𝐁\mathbf{x_{B}}) and a hyperplane H(𝐰𝐡,θh)H(\mathbf{w_{h}},\theta_{h}).

The dipole pairs (𝐱𝐀\mathbf{x_{A}}, 𝐱𝐁\mathbf{x_{B}}) for all individuals in the initial population pool are chosen randomly from the provided training dataset. The variables 𝐰\mathbf{w} and 𝚯\mathbf{\Theta} for an individual are then initialized as mentioned below:

𝐰=\displaystyle\mathbf{w}= 𝐰𝐡,\displaystyle\quad\mathbf{w_{h}}, (11)
θ1=\displaystyle\theta_{1}= θh,\displaystyle\quad\theta_{h}, (12)
θ2=\displaystyle\theta_{2}= min(Δ,1Δ)if m=1.\displaystyle\quad\text{min}(\Delta,1-\Delta)\quad\mbox{if $m=1$}. (13)

This method of initialization is found to be more efficient than randomly initializing 𝐰\mathbf{w} and Θ\Theta. This is because, for the regions beyond the convex-hull bounding the training dataset, the landscape of FLF_{L} is flat which makes any optimization algorithm to stagnate.

IV-C2 Selection, Crossover, and Mutation for Lower Level GA

The binary tournament selection, SBX crossover [40] and polynomial mutation [41] were used to create the offspring solutions. (μ+λ)(\mu+\lambda) survival selection strategy is then adopted to preserve elites.

IV-C3 Termination Criteria for Lower level GA

The lower level GA is terminated after a maximum of 50 generations were reached or when the change in the lower level objective value (FLF_{L}) is less than 0.01%0.01\% in the past 1010 consecutive generations.

Other parameter values for lower-level GA can be found in the supplementary document. Results from the ablation-studies on LLGA and the overall bilevel-GA are provided in supplementary document for conceptually understanding the efficacy of the proposed approach.

V Overall Tree Induction and Pruning

Once the split-rule for a conditional node is determined using the above bilevel optimization procedure, the resulting child nodes are checked with respect to the following termination criteria:

  • Depth of the Node >> Maximum allowable depth,

  • Number of datapoints in node <Nmin<N_{min}, and

  • Node Impurity τmin\leq\tau_{min}.

If any of the above criteria is satisfied, then that node is declared as a terminal leaf node. Otherwise, the node is flagged as an additional conditional node. It then undergoes another split (using a split-rule which is derived by running the proposed bilevel-GA on the data present in the node) and the process is repeated. This process is illustrated in Algorithm 1. This eventually produces the main nonlinear decision tree NLDTmainNLDT_{main}. The fully grown decision tree then undergoes pruning to produce the pruned decision tree NLDTprunedNLDT_{pruned}.

During the pruning phase, splits after the root-node are systematically removed until the training accuracy of the pruned tree does not fall below a pre-specified threshold value of τprune=3%\tau_{prune}=3\% (set here after trial-and-error runs). This makes the resultant tree topologically less complex and provides a better generalizability. In subsequent sections, we provide results on final NLDT obtained after pruning (i.e. NLDTprunedNLDT_{pruned}), unless otherwise specified.

VI Results

This section summarizes results obtained on four customized test problems, two real-world classification problems, three real-world optimization problems, eight multi-objective problems involving 30-500 features and one multi-class problem. For each experiment, 50 runs are performed with random training and testing sets. The dataset is split into training and testing sets with a ratio of 7:3, respectively. Mean and standard deviation of training and testing accuracy-scores across all 50 runs are evaluated to gauge the performance of the classifier. Statistics about the number of conditional nodes (given by #Rules) in NLDT, average number of active terms in the split-rules of decision tree (i.e. FUF_{U}/Rule) and rule length (which gives total number of active-terms appearing in the entire NLDT) is provided to quantify the simplicity and interpretability of classifier (lesser this number, simpler is the classifier). The comparison is made against classical CART tree solution [4] and SVM [42] results. For SVM, the Gaussian kernel is used and along with overall accuracy-scores; statistics about the number of support vectors (which is also equal to the rule-length) is also reported. It is to note that the decision boundary computed by SVM can be expressed with a single equation, however the length of the equation usually increases with the number of support vectors [43].

For each set of experiments, best scores are highlighted in bold and Wilcoxon signed-rank test is performed on overall testing accuracy scores to check the statistical similarity with other classifiers in terms of classification accuracy. Statistically similar classifiers (which will have their pp-value greater than 0.050.05) are italicized.

VI-A Customized Datasets: DS1 to DS4

The compilation of results of 50 runs on datasets DS1-DS4 (see supplementary for details regarding customized datasets) is presented in Table I. The results clearly indicate the superiority of the proposed nonlinear, bilevel based decision tree approach over classical CART and SVM based classification methods.

TABLE I: Results on DS1 to DS4 datasets. NLDT is not only most accurate, it also involves a few features.
Dataset Method Training Accuracy Testing Accuracy p-value #Rules 𝐅𝐔{\bf F_{U}}/Rule Rule Length
DS1 NLDT 99.78±0.5199.78\pm 0.51 99.55±1.08{\bf 99.55\pm 1.08} 1.0 ±\pm 0.0 2.3±0.62.3\pm 0.6 2.3±0.6{\bf 2.3\pm 0.6}
CART 97.99±0.9697.99\pm 0.96 90.32±4.0690.32\pm 4.06 7.34e-10 14.5±1.714.5\pm 1.7 1.0 ±\pm 0.0 14.5±1.714.5\pm 1.7
SVM 98.67±0.3198.67\pm 0.31 98.50±1.0998.50\pm 1.09 1.29e-05 1.0 ±\pm 0.0 71.5±3.371.5\pm 3.3 71.5±3.371.5\pm 3.3
DS2 NLDT 99.80±0.4099.80\pm 0.40 99.44±0.87{\bf 99.44\pm 0.87} 1.0 ±\pm 0.0 2.3±0.72.3\pm 0.7 2.3±0.7{\bf 2.3\pm 0.7}
CART 98.80±0.5098.80\pm 0.50 95.43±1.5095.43\pm 1.50 5.90e-10 11.0±1.411.0\pm 1.4 1.0 ±\pm 0.0 11.0±1.411.0\pm 1.4
SVM 96.95±0.7396.95\pm 0.73 95.24±0.1695.24\pm 0.16 2.43e-10 1.0 ±\pm 0.0 44.7±1.944.7\pm 1.9 44.7±1.944.7\pm 1.9
DS3 NLDT 99.91±0.3599.91\pm 0.35 99.77±0.67{\bf 99.77\pm 0.67} 1.0 ±\pm 0.0 2.2±0.52.2\pm 0.5 2.2±0.5{\bf 2.2\pm 0.5}
CART 99.42±0.5799.42\pm 0.57 95.00±2.3595.00\pm 2.35 7.00e-10 11.5±1.311.5\pm 1.3 1.0 ±\pm 0.0 11.5±1.311.5\pm 1.3
SVM 98.96±0.4298.96\pm 0.42 98.38±1.1598.38\pm 1.15 7.16e-09 1.0 ±\pm 0.0 62.1±3.462.1\pm 3.4 62.1±3.462.1\pm 3.4
DS4 NLDT 99.34±1.2199.34\pm 1.21 98.88±1.65{\bf 98.88\pm 1.65} 1.2±0.41.2\pm 0.4 2.6±0.62.6\pm 0.6 3.1±1.4{\bf 3.1\pm 1.4}
CART 96.98±1.1996.98\pm 1.19 88.68±3.6088.68\pm 3.60 7.31e-10 31.3±4.231.3\pm 4.2 1.0 ±\pm 0.0 31.3±4.231.3\pm 4.2
SVM 98.19±0.4498.19\pm 0.44 97.28±1.1997.28\pm 1.19 1.31e-05 1.0 ±\pm 0.0 89.8±3.389.8\pm 3.3 89.8±3.389.8\pm 3.3

Bilevel approach finds a single rule with two to three appearances of variables in the rule, whereas CART requires 11 to 31 rules involving a single variable per rule, and SVM requires only one rule but involving 44 to 90 appearances of variables in the rule. Moreover, the bilevel approach achieves this with the best accuracy. Thus, the classifiers obtained by bilevel approach are more simplistic and simultaneously more accurate.

VI-B Breast Cancer Wisconsin Dataset

This problem was originally proposed in 1991. It has two classes: benign and malignant, with 458 (or 65.5%\%) data-points belonging to benign class and 241 (or 34.5%\%) belonging to malignant class. Each data-point is represented with 10 attributes. Results are tabulated in Table II. The bilevel method and SVM had similar performance (p-value>0.05>0.05), but SVM requires about 90 variable appearances in its rule, compared to only about 6 variable appearances in the single rule obtained by the bilevel approach. The proposed approach outperformed the classifiers generated using techniques proposed in [12, 17, 16, 13, 14] in terms of both: accuracy and comprehensibility/compactness. The NLDT classifier obtained by a specific run of the bilevel approach has five variable appearances and is presented in Fig. 5.

TABLE II: Results on breast cancer Wisconsin dataset. Despite having a high accuracy, SVM’s rule involves about 90 repeated entries of features.
Method Training Accuracy Testing Accuracy p-value #Rules 𝐅𝐔{\bf F_{U}}/Rule Rule Length
NLDT 98.07±0.3998.07\pm 0.39 96.50±1.16\mathit{96.50\pm 1.16} 0.308\mathit{0.308} 1.0±0.0{\bf 1.0}\pm{\bf 0.0} 6.4±1.76.4\pm 1.7 6.4±1.7{\bf 6.4\pm 1.7}
CART 98.21±0.4998.21\pm 0.49 94.34±1.9294.34\pm 1.92 8.51e-09 11.6±2.411.6\pm 2.4 1.00 ±\pm 0.00 11.6±2.411.6\pm 2.4
SVM 97.65±0.3997.65\pm 0.39 96.64±1.16{\bf 96.64\pm 1.16} 1.00 ±\pm 0.00 89.4±14.889.4\pm 14.8 89.4±14.889.4\pm 14.8
Refer to caption
Figure 5: NLDT classifier for breast cancer Wisconsin dataset.

Figure 6 provides B-Space visualization of decision-boundary obtained by the bilevel approach, which is able to identify two nonlinear BB-terms involving variables to split the data linearly to obtain a high accuracy.

Refer to caption
Figure 6: B-space plot for Wisconsin breast cancer dataset.

VI-C Wisconsin Diagnostic Breast Cancer Dataset (WDBC)

This dataset is an extension to the dataset of the previous section. It has 30 features with total 356 datapoints belonging to benign class and 212 to malign class. Results shown in Table III indicate that the bilevel-based NLDT is able to outperform standard CART and SVM algorithms. The NLDT generated by a run of the bilevel approach requires seven out of 30 variables (or features) and is shown in Fig. 7. It is almost as accurate as that obtained by SVM and is more interpretable (seven versus about 107 variable appearances).

TABLE III: Results on WDBC dataset. Despite having a high accuracy, SVM’s rule involves about 107 repeated entries of features.
Method Training Accuracy Testing Accuracy p-value #Rules 𝐅𝐔{\bf F_{U}}/Rule Rule Length
NLDT 98.24±0.6498.24\pm 0.64 96.20±1.4996.20\pm 1.49 4.65e-05 1.0±0.0{\bf 1.0}\pm{\bf 0.0} 9.2±4.19.2\pm 4.1 9.2±4.1{\bf 9.2\pm 4.1}
CART 98.76±0.6098.76\pm 0.60 92.11±2.0792.11\pm 2.07 1.09e-09 10.8±2.110.8\pm 2.1 1.00 ±\pm 0.00 10.8±2.110.8\pm 2.1
SVM 98.65±0.3798.65\pm 0.37 97.39±1.37{\bf 97.39\pm 1.37} 1.00 ±\pm 0.00 106.7±6.6106.7\pm 6.6 106.7±6.6106.7\pm 6.6
Refer to caption
Figure 7: NLDT classifier for WDBC dataset.

The BB-space plot (Fig. 8) shows an efficient discovery of BB-functions to linearly classify the supplied data with a high accuracy.

Refer to caption
Figure 8: B-space plot for WDBC dataset.

VI-D Real-World Auto-Industry Problem (RW-problem)

This real-world engineering design optimization problem has 36 variables, eight constraints, and one objective function. The dataset created was highly biased, with 188 points belonging to the good-class and 996 belonging to the bad-class. Results obtained using the bilevel GA are shown in Table IV. The proposed algorithm is able to achieve near 90%90\% accuracy scores requiring only two split rules. The best performing NLDT has the testing-accuracy of 93.82%93.82\% and is shown in Fig. 9.

SVM performed the best in terms of accuracy, but the resulting classifier is complicated with about 241 variable appearances in the rule. However, bilevel GA requires only about two rules, each having about only 10 variable appearances per rule to achieve slightly less-accurate classification. CART requires about 30 rules with a deep DT, making the classifier difficult to interpret easily.

TABLE IV: Results on the real-world auto-industry problem.
Method Training Accuracy Testing Accuracy p-value #Rules 𝐅𝐔{\bf F_{U}}/Rule Rule Length
NLDT 94.36±1.4794.36\pm 1.47 89.93±2.0489.93\pm 2.04 4.35e-09 1.9±0.51.9\pm 0.5 10.0±2.910.0\pm 2.9 18.2±5.9{\bf 18.2\pm 5.9}
CART 98.00±0.5598.00\pm 0.55 91.13±1.3291.13\pm 1.32 9.80e-08 29.6±3.929.6\pm 3.9 1.00 ±\pm 0.00 29.6±3.929.6\pm 3.9
SVM 94.98±0.7394.98\pm 0.73 93.24±1.38{\bf 93.24\pm 1.38} 1.00 ±\pm 0.00 240.8±9.4240.8\pm 9.4 240.8±9.4240.8\pm 9.4
Refer to caption
Figure 9: NLDT classifier for the auto-industry problem. The first split-rule uses five variables and the second one uses 12.

VI-E Results on Multi-Objective Optimization Problems

Here we evaluate performance of NLDT on two real-world multi-objective problems: welded-beam design and 2D truss design, and modified versions of ZDT [44] and DTLZ [45] problems – m-ZDT and m-DTLZ. The data generation process of multi-objective datasets is provided in the supplementary document. Complexity of these datasets in controlled by two parameters τrank\tau_{rank} and grefg_{ref}, explanation to which is provided in supplementary document. Results for multi-objective problems are compiled in Table V. Due to space limitations, we report scores of only testing accuracy, total number of rules and FU/RuleF_{U}/\textit{Rule}.

VI-E1 Truss 2D

Truss 2D problem [46] is a three-variable problem. Clearly, the bilevel NLDT has the best accuracy and fewer variable appearances (meaning easier intrepretability) compared to CART and SVM generated classifiers as shown in Table V. Fig. 10(a) shows a 100% correct classifier with a single rule obtained by our bilevel NLDT, whereas Fig. 10(b) shows a typical CART classifier with 19 rules, making it difficult to interpret easily.

Refer to caption
(a) NLDT classifier requiring only one rule.
Refer to caption
(b) CART classifier requiring 11 rules.
Figure 10: Comparison of bilevel and CART methods on truss problem with τrank=6\tau_{rank}=6 dataset.

VI-E2 Welded Beam Design Problem

This bi-objective optimization problem has four variables and four constraints [46]. The NLDT (having a single rule) obtained with gref=10g_{ref}=10 is shown in Fig. 11. Interestingly, the bilevel NLDTs have similar accuracy to that of CART and SVM with a very low complexity as shown in Table V.

Refer to caption
Figure 11: NLDT classifier requiring only one rule for gref=10g_{ref}=10 for welded beam problem.

VI-E3 m-ZDT and m-DTLZ Problems

Experiments conducted on datasets involving 500 features (see Table V) and for two and three-objective optimization problems confirm the scalability aspect of the proposed approach. In all these problems, traditional methods like CART and SVM find themselves difficult to conduct a proper classification task. The provision of allowing controlled non-linearity at each conditional-node provides the proposed NLDT approach with necessary flexibility to make an appropriate overall classification. Additional results and visualization of some of the obtained NLDT classifiers are provided in the supplementary document.

TABLE V: Results on multi-objective problems for classifying dominated and non-dominated solutions.
Method Testing Acc. # Rules FUF_{U}/Rule
Welded Beam design with gref=10g_{ref}=10 and τrank=3\tau_{rank}=3.
NLDT 98.58±1.1398.58\pm 1.13 1.0±0.0{\bf 1.0}\pm{\bf 0.0} 3.9±1.03.9\pm 1.0
CART 97.72±1.0497.72\pm 1.04 8.42±1.428.42\pm 1.42 1.00 ±\pm 0.00
SVM 98.97±0.54{\bf 98.97\pm 0.54} 1.00 ±\pm 0.00 126.0±6.8126.0\pm 6.8
m-ZDT1-2-30
NLDT 98.96 ±\pm 0.59 1.80±0.601.80\pm 0.60 4.47±2.214.47\pm 2.21
CART 94.78±1.0294.78\pm 1.02 26.42±1.8926.42\pm 1.89 1.00 ±\pm 0.00
SVM 82.84±2.7782.84\pm 2.77 1.00 ±\pm 0.00 1192.38±11.341192.38\pm 11.34
m-ZDT2-2-30
NLDT 98.96 ±\pm 0.57 1.92±0.561.92\pm 0.56 4.37±1.824.37\pm 1.82
CART 94.72±0.8994.72\pm 0.89 27.80±2.1927.80\pm 2.19 1.00 ±\pm 0.00
SVM 98.44±0.5798.44\pm 0.57 1.00 ±\pm 0.00 315.54±6.12315.54\pm 6.12
m-DTLZ1-3-30
NLDT 96.65 ±\pm 2.86 3.12±0.593.12\pm 0.59 6.14±2.186.14\pm 2.18
CART 71.74±2.4971.74\pm 2.49 93.88±3.2393.88\pm 3.23 1.00 ±\pm 0.00
SVM 45.86±1.2845.86\pm 1.28 1.00 ±\pm 0.00 1381.56±8.251381.56\pm 8.25
m-DTLZ2-3-30
NLDT 97.22 ±\pm 2.25 3.02±0.623.02\pm 0.62 5.81±1.955.81\pm 1.95
CART 63.41±7.5963.41\pm 7.59 100.82±6.11100.82\pm 6.11 1.00 ±\pm 0.00
SVM 49.61±1.6249.61\pm 1.62 1.00 ±\pm 0.00 1367.42±8.441367.42\pm 8.44
Method Testing Acc. # Rules FUF_{U}/Rule
Truss 2D with τrank=6\tau_{rank}=6 and gref=1g_{ref}=1.
NLDT 99.54±0.75{\bf 99.54}\pm{\bf 0.75} 1.20±0.501.20\pm 0.50 2.90±0.502.90\pm 0.50
CART 98.33±1.1098.33\pm 1.10 11.06±3.1511.06\pm 3.15 1.00 ±\pm 0.00
SVM 99.46±0.50\mathit{99.46\pm 0.50} 1.00 ±\pm 0.00 62.5±2.9062.5\pm 2.90
m-ZDT1-2-500
NLDT 98.93±0.6098.93\pm 0.60 1.78±0.541.78\pm 0.54 5.66±3.235.66\pm 3.23
CART 93.48±0.9493.48\pm 0.94 20.58±1.3920.58\pm 1.39 1.00 ±\pm 0.00
SVM 100.00±0.00\mathbf{100.00\pm 0.00} 1.00±0.00\mathbf{\bf 1.00\pm 0.00} 240.88±4.62240.88\pm 4.62
m-ZDT2-2-500
NLDT 98.88±0.7198.88\pm 0.71 1.80±0.691.80\pm 0.69 5.30±2.455.30\pm 2.45
CART 93.95±1.3393.95\pm 1.33 20.98±1.9820.98\pm 1.98 1.00 ±\pm 0.00
SVM 100.00 ±\pm 0.00 1.00 ±\pm 0.00 248.06±4.37248.06\pm 4.37
m-DTLZ1-3-500
NLDT 93.77 ±\pm 4.24 3.00±0.453.00\pm 0.45 7.61±2.367.61\pm 2.36
CART 58.26±7.3758.26\pm 7.37 104.88±5.49104.88\pm 5.49 1.00 ±\pm 0.00
SVM 47.13±1.1947.13\pm 1.19 1.00 ±\pm 0.00 1382.38±8.841382.38\pm 8.84
m-DTLZ2-3-500
NLDT 95.33 ±\pm 4.46 3.04±0.563.04\pm 0.56 7.28±3.167.28\pm 3.16
CART 70.04±3.4670.04\pm 3.46 96.08±3.9396.08\pm 3.93 1.00 ±\pm 0.00
SVM 47.01±1.2047.01\pm 1.20 1.00 ±\pm 0.00 1382.62±8.461382.62\pm 8.46

VI-F Multi-class Iris Dataset

Though the purpose of the current study was to develop an efficient algorithm for two-class problems, its extension to multi-class problem is possible. The iris dataset is a popular multi-class dataset involving total three classes and four features, and is used here as a benchmark problem to check the feasibility of extending our method. The NLDT classifier, producing 94.8% classification accuracy on test data and 98.6% accuracy on training data, is presented in Figure 12 (see supplementary document for more details). Notice that of the four features, feature 1 (x1x_{1}) is not required for the classification task. Statistics on results obtained across 50 runs is provided in the supplementary document. We are currently investigating further on a more efficient extension of our approach to multi-class problems.

Refer to caption
Figure 12: NLDT classifier for the three-class iris dataset. The orange node is for datapoints belonging to iris-versicolor, yellow for iris-virginica and green node indicates datapoints present in iris-setosa class.

VII Conclusions and Future work

In this paper, we have addressed the problem of generating interpretable and accurate nonlinear decision trees for classification. Split-rules at the conditional nodes of a decision tree have been represented as a weighted sum of power-laws of feature variables. A provision of integrating modulus operation in the split-rule has also been formulated, particularly to handle sandwiched classes of datapoints. The proposed algorithm of deriving split-rule at a given conditional-node in a decision tree has two levels of hierarchy, wherein the upper-level is focused at evolving the power-law structures and operates on a discrete variable space, while the lower-level is focused at deriving values of optimal weights and biases by searching a continuous variable space to minimize the net impurity of child nodes after the split. Efficacy of upper-level and lower-level evolutionary algorithm have been tested on customized datasets. The lower-level GA is able to robustly and efficiently determine weights and biases to minimize the net-impurity. A test conducted on imbalanced dataset has also revealed the superiority of the proposed lower-level GA to achieve a high classification accuracy without relying on any synthetic data. Ablation studies conducted on the upper-level GA (see supplementary document) also demonstrate the efficacy of the algorithm to obtain simpler split-rules without using any prior knowledge. Results obtained on standard classification benchmark problems and a real-world single-objective problem has amply demonstrated the efficacy and usability of the proposed algorithm.

The classification problem has been extended to discriminate Pareto-data from non-Pareto data in a multi-objective problem. Experiments conducted on multi-objective engineering problems have resulted in determining simple polynomial rules which can assist in determining if the given solution is Pareto-optimal or not. These rules can then be used as design-principles for generating future optimal designs. As further future studies, we plan to integrate non-polynomial and generic terms in the expression of the split-rules. The proposed bilevel framework can also be tested for regression and reinforcement-learning related tasks. The upper-level problem can be converted to a bi-objective problem for optimizing both FUF_{U} and FLF_{L} simultaneously, so that a set of trade-off solutions can be obtained in a single run. Nevertheless, this proof-of-principle study on the use of a customized bilevel optimization method for classification tasks is encouraging for us to launch such future studies.

Acknowledgment

An initial study of this work was supported by General Motors Corporation, USA.

References

  • [1] S. B. Kotsiantis, I. Zaharakis, and P. Pintelas, “Supervised machine learning: A review of classification techniques,” Emerging artificial intelligence applications in computer engineering, vol. 160, pp. 3–24, 2007.
  • [2] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for hyper-parameter optimization,” in Advances in neural information processing systems, 2011, pp. 2546–2554.
  • [3] C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Auto-weka: Combined selection and hyperparameter optimization of classification algorithms,” in Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining.   ACM, 2013, pp. 847–855.
  • [4] L. Breiman, Classification and regression trees.   Routledge, 2017.
  • [5] D. Heath, S. Kasif, and S. Salzberg, “Induction of oblique decision trees,” in IJCAI, vol. 1993, 1993, pp. 1002–1007.
  • [6] S. K. Murthy, S. Kasif, and S. Salzberg, “A system for induction of oblique decision trees,” Journal of artificial intelligence research, vol. 2, pp. 1–32, 1994.
  • [7] S. K. Murthy, S. Kasif, S. Salzberg, and R. Beigel, “Oc1: A randomized algorithm for building oblique decision trees,” in Proceedings of AAAI, vol. 93.   Citeseer, 1993, pp. 322–327.
  • [8] E. Cantú-Paz and C. Kamath, “Using evolutionary algorithms to induce oblique decision trees,” in Proceedings of the 2nd Annual Conference on Genetic and Evolutionary Computation.   Morgan Kaufmann Publishers Inc., 2000, pp. 1053–1060.
  • [9] M. Kretowski, “An evolutionary algorithm for oblique decision tree induction,” in International Conference on Artificial Intelligence and Soft Computing.   Springer, 2004, pp. 432–437.
  • [10] A. Ittner and M. Schlosser, “Non-linear decision trees-ndt,” in ICML.   Citeseer, 1996, pp. 252–257.
  • [11] K. P. Bennett and J. A. Blue, “A support vector machine approach to decision trees,” in Neural Networks Proceedings, 1998. IEEE World Congress on Computational Intelligence. The 1998 IEEE International Joint Conference on, vol. 3.   IEEE, 1998, pp. 2396–2401.
  • [12] H. Núñez, C. Angulo, and A. Català, “Rule extraction from support vector machines.” in Esann, 2002, pp. 107–112.
  • [13] G. Fung, S. Sandilya, and R. B. Rao, “Rule extraction from linear support vector machines,” in Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining.   ACM, 2005, pp. 32–40.
  • [14] M. Craven and J. W. Shavlik, “Extracting tree-structured representations of trained networks,” in Advances in neural information processing systems, 1996, pp. 24–30.
  • [15] P. M. Murphy and M. J. Pazzani, “Id2-of-3: Constructive induction of m-of-n concepts for discriminators in decision trees,” in Machine Learning Proceedings 1991.   Elsevier, 1991, pp. 183–187.
  • [16] U. Johansson, R. König, and L. Niklasson, “The truth is in there-rule extraction from opaque models using genetic programming.” in FLAIRS Conference.   Miami Beach, FL, 2004, pp. 658–663.
  • [17] D. Martens, B. Baesens, T. Van Gestel, and J. Vanthienen, “Comprehensible credit scoring models using rule extraction from support vector machines,” European journal of operational research, vol. 183, no. 3, pp. 1466–1476, 2007.
  • [18] H. Ishibuchi, T. Nakashima, and T. Murata, “Three-objective genetics-based machine learning for linguistic rule extraction,” Information Sciences, vol. 136, no. 1-4, pp. 109–133, 2001.
  • [19] Y. Jin and B. Sendhoff, “Pareto-based multiobjective machine learning: An overview and case studies,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 38, no. 3, pp. 397–415, 2008.
  • [20] S. Nguyen, M. Zhang, and K. C. Tan, “Surrogate-assisted genetic programming with simplified models for automated design of dispatching rules,” IEEE transactions on cybernetics, vol. 47, no. 9, pp. 2951–2965, 2016.
  • [21] K. Nag and N. R. Pal, “A multiobjective genetic programming-based ensemble for simultaneous feature selection and classification,” IEEE transactions on cybernetics, vol. 46, no. 2, pp. 499–510, 2015.
  • [22] J. M. Luna, J. R. Romero, C. Romero, and S. Ventura, “On the use of genetic programming for mining comprehensible rules in subgroup discovery,” IEEE transactions on cybernetics, vol. 44, no. 12, pp. 2329–2341, 2014.
  • [23] A. Lensen, B. Xue, and M. Zhang, “Genetic programming for evolving a front of interpretable models for data visualization,” IEEE Transactions on Cybernetics, 2020.
  • [24] H. Guo and A. K. Nandi, “Breast cancer diagnosis using genetic programming generated feature,” Pattern Recognition, vol. 39, no. 5, pp. 980–987, 2006.
  • [25] M. Muharram and G. D. Smith, “Evolutionary constructive induction,” IEEE transactions on knowledge and data engineering, vol. 17, no. 11, pp. 1518–1528, 2005.
  • [26] M. Shirasaka, Q. Zhao, O. Hammami, K. Kuroda, and K. Saito, “Automatic design of binary decision trees based on genetic programming,” in Proc. The Second Asia-Pacific Conference on Simulated Evolution and Learning (SEAL’98.   Citeseer, 1998.
  • [27] S. Haruyama and Q. Zhao, “Designing smaller decision trees using multiple objective optimization based gps,” in IEEE International Conference on Systems, Man and Cybernetics, vol. 6.   IEEE, 2002, pp. 5–pp.
  • [28] C.-S. Kuo, T.-P. Hong, and C.-L. Chen, “Applying genetic programming technique in classification trees,” Soft Computing, vol. 11, no. 12, pp. 1165–1172, 2007.
  • [29] M. C. Bot, “Improving induction of linear classification trees with genetic programming,” in Proceedings of the 2nd Annual Conference on Genetic and Evolutionary Computation, 2000, pp. 403–410.
  • [30] R. E. Marmelstein and G. B. Lamont, “Pattern classification using a hybrid genetic program-decision tree approach,” Genetic Programming, pp. 223–231, 1998.
  • [31] E. M. Mugambi, A. Hunter, G. Oatley, and L. Kennedy, “Polynomial-fuzzy decision tree structures for classifying medical data,” in International Conference on Innovative Techniques and Applications of Artificial Intelligence.   Springer, 2003, pp. 155–167.
  • [32] A. Cano, A. Zafra, and S. Ventura, “An interpretable classification rule mining algorithm,” Information Sciences, vol. 240, pp. 1–20, 2013.
  • [33] M. C. Bot and W. B. Langdon, “Application of genetic programming to induction of linear classification trees,” in European Conference on Genetic Programming.   Springer, 2000, pp. 247–258.
  • [34] J. Eggermont, J. N. Kok, and W. A. Kosters, “Genetic programming for data classification: Partitioning the search space,” in Proceedings of the 2004 ACM symposium on Applied computing, 2004, pp. 1001–1005.
  • [35] K. C. Tan, A. Tay, T. H. Lee, and C. Heng, “Mining multiple comprehensible classification rules using genetic programming,” in Proceedings of the 2002 Congress on Evolutionary Computation. CEC’02 (Cat. No. 02TH8600), vol. 2.   IEEE, 2002, pp. 1302–1307.
  • [36] I. De Falco, A. Della Cioppa, and E. Tarantino, “Discovering interesting classification rules with genetic programming,” Applied Soft Computing, vol. 1, no. 4, pp. 257–269, 2002.
  • [37] A. Sinha, P. Malo, and K. Deb, “A review on bilevel optimization: From classical to evolutionary approaches and applications,” IEEE Transactions on Evolutionary Computation, vol. 22, no. 2, pp. 276–295, 2017.
  • [38] L. Bobrowski and M. Kretowski, “Induction of multivariate decision trees by using dipolar criteria,” in European Conference on Principles of Data Mining and Knowledge Discovery.   Springer, 2000, pp. 331–336.
  • [39] M. Kretowski and M. Grześ, “Global induction of oblique decision trees: an evolutionary approach,” in Intelligent Information Processing and Web Mining.   Springer, 2005, pp. 309–318.
  • [40] K. Deb and R. B. Agrawal, “Simulated binary crossover for continuous search space,” Complex systems, vol. 9, no. 2, pp. 115–148, 1995.
  • [41] K. Deb, Multi-objective optimization using evolutionary algorithms.   John Wiley & Sons, 2001, vol. 16.
  • [42] V. Vapnik, The nature of statistical learning theory.   Springer science & business media, 2013.
  • [43] J.-P. Vert, K. Tsuda, and B. Schölkopf, “A primer on kernel methods,” Kernel methods in computational biology, vol. 47, pp. 35–70, 2004.
  • [44] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: Nsga-ii,” IEEE transactions on evolutionary computation, vol. 6, no. 2, pp. 182–197, 2002.
  • [45] K. Deb, L. Thiele, M. Laumanns, and E. Zitzler, “Scalable multi-objective optimization test problems,” in Proceedings of the 2002 Congress on Evolutionary Computation. CEC’02 (Cat. No. 02TH8600), vol. 1.   IEEE, 2002, pp. 825–830.
  • [46] K. Deb and A. Srinivasan, “Innovization: Innovating design principles through optimization,” in Proceedings of the 8th annual conference on Genetic and evolutionary computation.   ACM, 2006, pp. 1629–1636.
  • [47] J. Blank and K. Deb, “pymoo - Multi-objective Optimization in Python,” https://pymoo.org.
  • [48] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002.
[Uncaptioned image] Yasheh Dhebar is a PhD student in the Department of Mechanical Engineering at Michigan State University. He did his bachelors and masters in Mechanical Engineering from the Indian Institute of Technology Kanpur and graduated in 2015 with his thesis on modular robotics and their operation in cluttered environments. Currently he is actively pursuing his research in the field of explainable and interpretable artificial intelligence, machine learning and knowledge discovery. He has worked on applied optimization and machine learning projects with industry collaborators.
[Uncaptioned image] Kalyanmoy Deb is Fellow, IEEE and the Koenig Endowed Chair Professor with the Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan, USA. He received his Bachelor’s degree in Mechanical Engineering from IIT Kharagpur in India, and his Master’s and Ph.D. degrees from the University of Alabama, Tuscaloosa, USA, in 1989 and 1991. He is largely known for his seminal research in evolutionary multi-criterion optimization. He has published over 535 international journal and conference research papers to date. His current research interests include evolutionary optimization and its application in design, modeling, AI, and machine learning. He is one of the top cited EC researchers with more than 144,000 Google Scholar citations.

Supplementary Document

Supplementary Document

In this supplementary document, we provide additional information related to our original paper.

S1 Summary of Supplementary Document

In this document, we first provide details of parameter settings used for all our proposed algorithms, followed by explanation on the procedure to create customized datasets for benchmarking lower-level and upper-level GAs. Details of the duplicate update operator used in the upper level GA of our proposed bilevel optimization approach is mentioned next. Thereafter, we present additional results to support the algorithms and discussions of the original paper.

S2 Parameter Setting

S2-A Parameter Setting for NSGA-II and NSGA-III

We used the implementation of NSGA-II and NSGA-III Algorithm as provided in pymoo [47]. pymoo is an official python package for multiobjective optimization algorithms and is developed under supervision of the original developer of NSGA-II and NSGA-III algorithm. Following parameter setting was adopted to create Pareto and non-Pareto datasets for two objective Truss and Welded beam problems:

  • Population Size = 500

  • Maximum Generations = 1000

  • Cross Over = SBX

  • SBX ηc\eta_{c} = 15

  • SBX probabililty = 0.9

  • Mutation type = Polynomial Mutation

  • Mutation ηm\eta_{m} = 20

  • Mutation probability pmp_{m} = 1/nvar1/n_{var} (where nvarn_{var} is total number of variables)

S2-B Parameter Setting for Upper Level GA

This section provides the general parameter setting used for the upper level of our bilevel GA.

  • Population size = 10×d10\times d, where dd is the dimension of the feature space of the original problem.

  • Selection = Binary tournament selection.

  • Crossover probability (pxoverUp_{xover}^{U}) = 0.9

  • Mutation parameters:

    • Mutation probability (pmutUp^{U}_{mut}) = min(0.33,1/d)\min(0.33,1/d).

    • Distribution parameter β\beta = 3 (any value >1>1 will be good).

  • pzerop_{zero} = 0.75

  • Maximum number of generations = 100

S2-C Parameter Setting for Lower Level GA

Following parameter setting is used for lower level GA:

  • Population size = 50,

  • Maximum generations = 50,

  • Variable range = [1,1][-1,1],

  • Crossover type = Simulated Binary Crossover (SBX)

  • Mutation type = Polynomial Mutation

  • Selection type = Binary Tournament Selection

  • Crossover probability = 0.9,

  • Mutation probability = 1/nvar1/n_{var},

  • Number of variables nvar=p+1n_{var}=p+1, if m=0m=0 or p+2p+2 if m=1m=1, where pp is the total number of power-laws allowed and mm is the modulus flag,

  • SBX and polynomial mutation parameters: (ηc,ηm)=(2,15)(\eta_{c},\eta_{m})=(2,15).

S2-D Parameter Setting for Inducing a Non-linear Decision Tree

  • Number of power-laws per split-rule: p=3p=3,

  • Allowable set of exponents: 𝐄={3,2,,3}\mathbf{E}=\{-3,-2,\dots,3\},

  • Impurity Metric: Gini Score.

  • Minimum Impurity to conduct a split: τmin=0.05\tau_{\min}=0.05

  • Minimum number of points required in a node conduct a split: Nmin=10N_{min}=10

  • Maximum allowable Depth = 5

  • Pruning threshold: τprune=3%\tau_{prune}=3\%

S3 Creation of Customized 2D Datasets

Datasets presented in this section were synthetically generated. Datapoints in 2D2D space were created using a reference curve ξ(𝐱)\xi(\mathbf{x}) and were assigned to either the Class-1 or Class-2. Two additional parameters δ\delta and σ\sigma controlled the location of a datapoint relative to the reference curve ξ(𝐱)\xi(\mathbf{x}). Datapoints (𝐱(𝐢)\mathbf{x^{(i)}}) for a given dataset were generated using following steps:

  • First nn points falling on curve ξ(𝐱)=0\xi(\mathbf{x})=0 were initialized for reference. Lets denote these reference points on the curve with 𝐱𝐫(𝐢)\mathbf{x_{r}^{(i)}} (where i=1,2,,ni=1,2,\dots,n). Thus,

    ξ(𝐱𝐫(𝐢))=0,i=1,2,,n.\xi(\mathbf{x_{r}^{(i)})}=0,\qquad i=1,2,\dots,n.
  • nn points (𝐱(𝐢)\mathbf{x^{(i)}}) for a dataset were then generated using the following equation

    𝐱(𝐢)=𝐱𝐫(𝐢)+δ+σr,i=1,2,,n,\mathbf{x^{(i)}}=\mathbf{x_{r}^{(i)}}+\delta+\sigma r,\qquad i=1,2,\dots,n, (14)

    where rr is a random number between 0 and 11, δ\delta represents the offset and σ\sigma indicates the amount of spread.

Four datasets generated for conducting ablation studies are graphically represented in Fig. S1.

Parameter setting which was used to generate these datasets is provided Table S1. NAN_{A} and NBN_{B} indicate the number of datapoints belonging to Class-1 and Class-2, respectively.

TABLE S1: Parameter setting to create customized datasets D1-D4. For Class-1 data-points (δ,σ)=(0,0.01)(\delta,\sigma)=(0,0.01).
Dataset ξ(𝐱)\xi(\mathbf{x}) Class-1 Class-2
NAN_{A} NBN_{B} (δ,σ)(\delta,\sigma)
DS1 2x1+x23=02x_{1}+x_{2}-3=0 100 100 (0.025,0.2)(0.025,0.2)
DS2 2x1+x23=02x_{1}+x_{2}-3=0 10 200 (0.025,0.2)(0.025,0.2)
DS3 x12+x21=0x_{1}^{2}+x_{2}-1=0 100 100 (0.025,0.2)(0.025,0.2)
DS4 2x1+x23=02x_{1}+x_{2}-3=0 100 50 (0.025,0.2)(0.025,0.2)
50 (0.015,0.2)(-0.015,-0.2)
Refer to caption
(a) DS1 Dataset.
Refer to caption
(b) DS2 Dataset.
Refer to caption
(c) DS3 Dataset.
Refer to caption
(d) DS4 Dataset.
Figure S1: Customized datasets.

S4 Multi-objective Datasets

S4-A ZDT, Truss and Welded Beam Datasets

For each problem, NSGA-II [44] algorithm is first applied to evolve the population of NN individuals for gmaxg_{max} generations. Population for each generation is stored. Naturally, population from later generations are closer to Pareto-front than the population of initial generations. We artificially separate the entire dataset into two classes – good and bad – using two parameters grefg_{ref}(indicating an intermediate generation for data collection) and τrank\tau_{rank}(indicating the minimum non-dominated rank for defining the bad cluster). First, population members from grefg_{ref} and gmaxg_{max} generations are merged. Next, non-dominated sorting is executed on the merged population to determine their non-domination rank. Points belonging to same non-domination rank are further sorted based on their crowding-distance (from highest to lowest) [44]. For the good-class, top NAN_{A} points from rank-1 front are chosen and for the bad-class, top NBN_{B} points from τrank\tau_{rank} front onward are chosen. Increasing the τrank\tau_{rank} increases the separation between good and bad classes, while the grefg_{ref} parameter has the inverse effect.

Visualization of this is provided in Fig. S2 and Fig. S3 for truss and welded-beam problems respectively.

Refer to caption
(a) F-space plot with τrank=6\tau_{rank}=6.
Refer to caption
(b) X-space plot with τrank=6\tau_{rank}=6.
Refer to caption
(c) F-space plot with τrank=9\tau_{rank}=9.
Refer to caption
(d) X-space plot with τrank=9\tau_{rank}=9.
Figure S2: Truss design problem data visualization. gref=1g_{ref}=1 is kept fixed.
Refer to caption
(a) F-space plot with gref=1g_{ref}=1.
Refer to caption
(b) F-space plot with gref=10g_{ref}=10.
Figure S3: Welded Beam Design Problem data visualization. τrank=3\tau_{rank}=3 is kept fixed. Problem at (b) is more difficult to solve than at (a).

The parameter setting used to create datasets for ZDT1, ZDT2, ZDT3, Truss and Welded-beam problems are shown in Table S2.

TABLE S2: Parameter Settings for generating datasets for multi-objective problems. For Truss and Welded Beam (WB), different experiments are performed with different values of grefg_{ref} and τrank\tau_{rank}. These values are provided in the caption of result tables.
Problem Pop-Size (NN) gmaxg_{max} grefg_{ref} τrank\tau_{rank} NAN_{A} NBN_{B}
ZDT-1 100 200 200 10 100 5000
ZDT-2 100 200 200 5 100 5000
ZDT-3 100 200 200 10 100 5000
Truss-2D 200 1000 1 200 200
WB 200 1000 3 200 200

S4-B m-ZDT, m-DTLZ Problems

These are the modified versions of ZDT and DLTZ problems [44, 45]. For m-ZDT problems, the g-function is modified to the following:

gzdt(x2,,xn)=\displaystyle g_{zdt}(x_{2},\ldots,x_{n})= 1+18n1i=2 and evenn(xi+xi+11)2.\displaystyle 1+\frac{18}{n-1}\sum_{\mbox{$i=2$ and even}}^{n}(x_{i}+x_{i+1}-1)^{2}. (15)

Many two-variable relationships (xi+xi+1=1x_{i}+x_{i+1}=1) must be set to be on the Pareto set.

For m-DTLZ problems, the g-function for 𝐱𝐦\mathbf{x_{m}} variables is

gdtlz(𝐱𝐦)=\displaystyle g_{dtlz}(\mathbf{x_{m}})= 100×xi𝐱𝐦 and i is evenn(xi+xi+11)2.\displaystyle 100\times\sum_{\mbox{$x_{i}\in\mathbf{x_{m}}$ and $i$ is even}}^{n}(x_{i}+x_{i+1}-1)^{2}. (16)

Pareto points for m-ZDT and m-DTLZ problems are generated by using the exact analytical expression of Pareto-set (locations where g=0g=0). The non-Pareto set is generated by using two parameters σspread\sigma_{spread} and σoffset\sigma_{offset}. To compute the location of a point 𝐱𝐧𝐩\mathbf{x_{np}} belonging to non-Pareto set from the location of the point 𝐱𝐩\mathbf{x_{p}} on the Pareto set, following equation is used:

xnp(i)=\displaystyle x_{np}^{(i)}= xp(i)+r1(σoffset+r2σspread)\displaystyle x_{p}^{(i)}+r_{1}(\sigma_{offset}+r_{2}\sigma_{spread}) (17)
where r11,1,r2[0,1].\displaystyle r_{1}\in{-1,1},\quad r_{2}\in[0,1]. (18)

Here, r1r_{1} and r2r_{2} are randomly generated for each xnp(i)x_{np}^{(i)}. For m-ZDT problems, i=1,2,ni=1,2,\dots n, while for m-DTLZ problems xnp(i),xp(i)𝐱𝐦x_{np}^{(i)},x_{p}^{(i)}\in\mathbf{x_{m}}. Parameter setting for generating m-ZDT and m-DTLZ datasets is provided in Table S3. For 30 variable problems, all xi𝐱𝐦x_{i}\in\mathbf{x_{m}} are changed according to Eq. 18 for generate non-pareto points. However, for 500 vars problems, only first 28 of variables in 𝐱𝐦\mathbf{x_{m}} are modified according to Eq. 18 to generate non-pareto data-points.

TABLE S3: Parameter setting to generate datasets for m-ZDT and m-DTLZ problems. We generate 1000 datapoints for each class.
Prob. 𝐧𝐯𝐚𝐫𝐬\mathbf{n_{vars}} σ𝐬𝐩𝐞𝐚𝐝\mathbf{\sigma_{spead}} σ𝐨𝐟𝐟𝐬𝐞𝐭\mathbf{\sigma_{offset}}
m-ZDT1 30 0.3 0.1
m-ZDT1 500 0.3 0.1
m-ZDT2 30 0.3 0.1
m-ZDT2 500 0.3 0.1
m-DTLZ1 30 0.05 0
m-DTLZ1 500 0.05 0
m-DTLZ2 30 0.05 0
m-DTLZ2 500 0.05 0

Visualization of Datasets for m-ZDT and m-DTLZ problems in provided in Figure S4 and Figure S5.

Refer to caption
(a) m-ZDT1, 30 vars
Refer to caption
(b) m-ZDT1, 500 vars
Refer to caption
(c) m-ZDT2, 30 vars
Refer to caption
(d) m-ZDT2, 500 vars
Figure S4: m-ZDT Datasets.
Refer to caption
(a) m-DTLZ1, 30 vars
Refer to caption
(b) m-DTLZ1, 500 vars
Refer to caption
(c) m-DTLZ2, 30 vars
Refer to caption
(d) m-DTLZ2, 500 vars
Figure S5: m-DTLZ Datasets.

S5 Duplicate Update Operator for Upper Level GA

After creating the offspring population, we attempt to eliminate duplicate population members. For each duplicate found in the child population, the block-matrix 𝐁\mathbf{B} of that individual is randomly mutated using the following equation

𝐁(ir,jr)=𝐄(kr),\mathbf{B}(i_{r},j_{r})=\mathbf{E}(k_{r}), (19)

where ir,jri_{r},j_{r} and krk_{r} are randomly chosen integers within specified limits. This process is repeated for all members of child population, until each member is unique within the child population. This operation allows to maintain diversity and encourages the search to generate multiple novel and optimized equations of the split-rule.

S6 Ablation Studies and Comparison

In this section, we test the efficacy of lower-level and upper-level optimization algorithm by applying them on four customized problems (DS1-DS4) which are shown pictorially in Fig. S1. Their performances are compared against two standard classifiers: CART and support-vector-machines (SVM)111Here, the support vector machine is applied without any kernal-trick..

S6-A Ablation Studies on Lower Level GA

Since the lower level GA (LLGA) focuses at determining coefficients wiw_{i} and bias θi\theta_{i} of linear split-rule in B-space, DS1 and DS2 data-sets are used to guage its efficacy. The block matrix 𝐁\mathbf{B} is fixed to 2×22\times 2 identity matrix and the modulus-flag mm is set to 0. Hence, the split-rule (classifier) to be evolved is

f(𝐱)=θ1+w1x1+w2x2,f(\mathbf{x})=\theta_{1}+w_{1}x_{1}+w_{2}x_{2},

where w1,w2w_{1},w_{2} and θ1\theta_{1} are to be determined by the LLGA.

The classifier generated by our LLGA is compared with that obtained using the standard SVM algorithm (without any kernel-trick) on DS1 data-set. Since DS2 dataset is unbalanced, SMOTE algorithm [48] is first applied to over-sample data-points of the minority class before generating the classifier using SVM. For the sake of completeness, classifier generated by SVM on DS2 dataset, without any oversampling, is also compared against the ones obtained using LLGA and SMOTE+SVM. The results are shown in Fig. S6 and Fig. S7, respectively, for DS1 and DS2 data-sets. Type-1 and Type-2 errors are reported for each experiment222Type-1 error indicates the percentage of data-points belonging to Class-1 getting classified as Class-2. Type-2 error indicates the percentage of points belonging to Class-2 getting classified as Class-1)..

Refer to caption
(a) LLGA: (0%,0%)(0\%,0\%); (0%,1%)(0\%,1\%)
Refer to caption
(b) SVM: (0%,17%)(0\%,17\%); (0%,10%)(0\%,10\%)
Figure S6: Results on DS1 dataset to benchmark LLGA. Numbers in first parenthesis indicate Type-1 and Type-2 error on training data and the numbers in second parenthesis indicate Type-1 and Type-2 error on testing data.

It is clearly evident from the results shown in Fig. S6 that the proposed customized LLGA is more efficient than SVM in arriving at the desired split-rule as a classifier. The LLGA is able to find the decision boundary with 100%100\% prediction accuracy on training and testing data-sets. However, the classifier generated by SVM has an overlap with the cloud of data-points belonging to the scattered class and there-by resulting into the accuracy of 89%89\% on the testing data-set.

Refer to caption
(a) LLGA : (0%,0%)(0\%,0\%); (0%,0%)(0\%,0\%)
Refer to caption
(b) SVM: (100%,0%)(100\%,0\%); (100%,0%)(100\%,0\%)
Refer to caption
(c) SVM++SMOTE: (0%,1.5%)(0\%,1.5\%); (0%,1.5%)(0\%,1.5\%)
Figure S7: Results on DS2 data-set to benchmark LLGA.

On the DS2 dataset, the SVM is needed to rely on SMOTE to synthetically generate data-points in order to achieve respectable performance as can be seen from Fig. 7(b) and 7(c). However, LLGA performs better without SMOTE, as shown in Fig. 7(a).

Ablation study conducted above on LLGA clearly indicates that the customized optimization algorithm developed for estimating weights 𝐰\mathbf{w} and bias 𝚯\mathbf{\Theta} in the expression of a split-rule is reliable and efficient and could easily outperform the classical SVM algorithm.

S6-B Ablation Studies on Proposed Bilevel GA

As mentioned before, the upper level of our bilevel algorithm aims at determining optimal power-law structures (i.e. BiB_{i}s) and the value of modulus flag mm. Experiments are performed on DS3 and DS4 datasets to guage the efficacy of upper level (and thus the overall bilevel) GA. No prior information about the optimal values of block matrix 𝐁\mathbf{B} and modulus flag mm is supplied. Thus, the structure of the equation of the split-rule is unknown to the algorithm. Results of these experiments are shown in Figs. 8(a) and 8(b).

Refer to caption
(a) DS3: (0%,0%)(0\%,0\%); (0%,0%)(0\%,0\%)
Refer to caption
(b) DS4: (0%,0%)(0\%,0\%); (0%,0%)(0\%,0\%)
Figure S8: Bilevel GA results on DS3 and DS4.

As can be seen from Fig. 8(a) that the proposed bilevel algorithm is able to find classifiers which are able to classify the data-sets with no Type-I or Type-II error. Results on DS4 data-set validated the importance of involving modulus-flag mm as a search variable for upper level GA in addition to the block-matrix 𝐁\mathbf{B}. The algorithm is able to converge at the optimal tree topology (Fig. 8(b)) with 100%100\% classification accuracy. It is to note here that the algorithm had no prior knowledge about optimal block matrix 𝐁\mathbf{B} of exponents bijb_{ij}. This observation suggests that the upper level GA is successfully able to navigate through the search space of block-matrices 𝐁\mathbf{B} and modulus-flag mm (which is a binary variable) to arrive at the optimal power-laws and split-rule.

S6-C Visualization of Split Rule: X-Space and B-Space

A comprehensible visualization of the decision boundary is possible if the dimension of the feature space is up to three. However, if the dimension dd of the original feature space (or, X-space) is larger than three, the decision-boundary can be visualized on the transformed-feature-space (or B-space). In our experiments, the maximum number of allowable power-law-rules (pp) is fixed to three (i.e. p=3p=3). Thus, any data-point 𝐱\mathbf{x} in a dd-dimensional X-space can be mapped to a three-dimensional B-space. A conceptual illustration of this mapping is provided in Fig. S9,

Refer to caption
Figure S9: Feature Transformation. A point in three-dimensional X-space is mapped to a point in a two-dimensional B-space for which Bi=x1bi1x2bi2x3bi3B_{i}=x_{1}^{b_{i1}}x_{2}^{b_{i2}}x_{3}^{b_{i3}}.

where, for the sake of simplicity, a three-dimensional X-space (x1x_{1} to x3x_{3}) is mapped to a two-dimensional B-space using two power-law-rules (B1B_{1} and B2B_{2}).

It is to note here that the power-law rules BiB_{i}’s are not known a priori. The proposed bilevel optimization method determines them so that the data becomes linearly separable in B-space.

S7 Additional Results

Here we provide some additional results. Fig. S10 and Fig. S11 shows plots of one of the decision boundary (classifier) obtained using our bilevel-algorithm and its comparison against decision-tree generated using traditional CART algorithm on DS3 and DS4 datasets.

Refer to caption
(a) Bilevel GA results: (0%,0%)(0\%,0\%) and (1%,0%)(1\%,0\%).
Refer to caption
(b) Obtained NLDT for DS3 problem – One rule.
Refer to caption
(c) CART results: (0%,4%)(0\%,4\%) and (4%,3%)(4\%,3\%) – 13 rules.
Figure S10: Results on DS3 dataset to benchmark the overall bilevel algorithm. CART is difficult to interpret, while NLDT classifier is easy to interpret.
Refer to caption
(a) Bilevel GA results: (0%,0%)(0\%,0\%) and (0%,0%)(0\%,0\%) – No error.
Refer to caption
(b) Obtained NLDT for DS4 problem – One rule.
Refer to caption
(c) CART results: (4%,3%)(4\%,3\%) and (5%,18%)(5\%,18\%), 27 rules.
Figure S11: Results on DS4 dataset to benchmark the overall bilevel algorithm. CART is difficult to interpret, while NLDT classifier is easy to interpret.

S7-A 2D Truss Problem

In table S4, comparison of accuracy-scores and complexity obtained for datasets generated using τrank=6\tau_{rank}=6 and τrank=9\tau_{rank}=9 for bi-objective truss design problem are provided.

TABLE S4: 2D Truss problem with τrank=6\tau_{rank}=6 and τrank=9\tau_{rank}=9.
τrank\tau_{rank} Training Accuracy Testing Accuracy #Rules n-vars 𝐅𝐔{\bf F_{U}}/Rule
6 99.86±0.3699.86\pm 0.36 99.6±0.5899.6\pm 0.58 1.20±0.531.20\pm 0.53 2.78±0.412.78\pm 0.41 2.92±0.522.92\pm 0.52
9 100±0100\pm 0 99.92±0.1999.92\pm 0.19 1.36±0.481.36\pm 0.48 2.14±0.352.14\pm 0.35 2.26±0.522.26\pm 0.52

Since the dataset generated with τrank=6\tau_{rank}=6 is relatively difficult, we conduct further analysis on it and compare the result obtained using our NLDT approach with that of CART and SVM. Compilation of these results is shown in Table S5.

TABLE S5: Results on Truss-2D with τrank=6\tau_{rank}=6 and gref=1g_{ref}=1.

¿ Method Training Accuracy Testing Accuracy p-value #Rules 𝐅𝐔{\bf F_{U}}/Rule Rule Length NLDT 99.77±0.7299.77\pm 0.72 99.54±0.75{\bf 99.54}\pm{\bf 0.75} 1.2±0.51.2\pm 0.5 2.9±0.52.9\pm 0.5 3.3±0.9{\bf 3.3}\pm{\bf 0.9} CART 99.34±0.3299.34\pm 0.32 98.33±1.1098.33\pm 1.10 6.04e-08 11.06±3.1511.06\pm 3.15 𝟏{\bf 1} 11.06±3.1511.06\pm 3.15 SVM 99.66±0.1599.66\pm 0.15 99.46±0.50\mathit{99.46\pm 0.50} 0.135 𝟏{\bf 1} 62.5±2.962.5\pm 2.9 62.5±2.962.5\pm 2.9

S7-B Welded Beam Design Resutls

Additional results for Welded beam design problem are shown in Table S6. Here, two sets of experiments are conducted for two different values of grefg_{ref}, keeping the value of τrank\tau_{rank} fixed to three. A higher value of grefg_{ref} results in good and bad data-points too close to each other, thereby making it difficult for the classification algorithm to determine a suitable classifier. This can be validated from the results presented in Table S6. However, bilevel NLDTs have produced one rule with about 2 to 4 variable appearances compared to 39.5 to 126 (on average) for SVM classifiers. CART does well in this problem with, on average, 2.94 to 8.42 rules.

TABLE S6: Results on welded beam design with gref=1g_{ref}=1 and gref=10g_{ref}=10. τrank=3\tau_{rank}=3 is kept fixed.

¿ grefg_{ref} Method Training Accuracy Testing Accuracy p-value #Rules 𝐅𝐔{\bf F_{U}}/Rule Rule Length 1 NLDT 99.98±0.0699.98\pm 0.06 99.38±0.49{\it 99.38\pm 0.49} 0.158 1.0±0.0{\bf 1.0}\pm{\bf 0.0} 2.6±0.72.6\pm 0.7 2.6±0.7{\bf 2.6\pm 0.7} CART 99.97±0.0799.97\pm 0.07 99.50±0.59\mathbf{99.50\pm 0.59} 2.94±0.712.94\pm 0.71 𝟏{\bf 1} 2.94±0.712.94\pm 0.71 SVM 99.37±0.1799.37\pm 0.17 99.40±0.40{\it 99.40\pm 0.40} 0.331 𝟏{\bf 1} 39.5±2.539.5\pm 2.5 39.5±2.539.5\pm 2.5 10 NLDT 99.39±0.3899.39\pm 0.38 98.58±1.1398.58\pm 1.13 3.42e-02 1.0±0.0{\bf 1.0}\pm{\bf 0.0} 3.9±1.03.9\pm 1.0 3.9±1.0{\bf 3.9\pm 1.0} CART 99.46±0.2799.46\pm 0.27 97.72±1.0497.72\pm 1.04 4.63e-08 8.42±1.428.42\pm 1.42 𝟏{\bf 1} 8.42±1.428.42\pm 1.42 SVM 99.46±0.1999.46\pm 0.19 98.97±0.54{\bf 98.97\pm 0.54} 𝟏{\bf 1} 126.0±6.8126.0\pm 6.8 126.0±6.8126.0\pm 6.8

S7-C ZDT Problems

ZDT problems are two-objective optimization problems and their formulations can be found in [44]. The dataset created was highly biased with 100100-datapoints in Pareto-class and 5,0005,000-datapoints in non-Pareto-class. Visualization of these datasets is provided in Figure S12. As can be seen from the figure S12, the classification task was deliberately made more difficult by tightening the gap between Pareto and non-Pareto class.

Refer to caption
(a) ZDT1.
Refer to caption
(b) ZDT2.
Refer to caption
(c) ZDT3.
Figure S12: There is no visible separation of the Pareto and non-Pareto datapoints, making the classification task difficult.

Due to highly biased distribution, a naive classifier predicting all points as non-Pareto would have the classification accuracy of 98.04%98.04\% (=5000/(5000+100)=5000/(5000+100)). Hence, the pruning implementation with τprune=3%\tau_{prune}=3\% will prune the entire tree, resulting into a tree with only one split-rule. Thus, we investigate the performance of the unpruned-NLDT and the prune-NLDT with τprune=0.5%\tau_{prune}=0.5\%. The compilation of results on ZDT problems is provided in Table S7.

TABLE S7: Results on multi-objective ZDT problems. Visualization of split-rules is provided in supplementary.
Problem Method Training Accuracy Testing Accuracy p-value #Rules 𝐅𝐔{\bf F_{U}}/Rule Rule Length
ZDT1 NLDT 99.97±0.0499.97\pm 0.04 99.74±0.16{\it 99.74\pm 0.16} 5.65e-01 3.3±1.03.3\pm 1.0 2.6±1.02.6\pm 1.0 7.9±1.97.9\pm 1.9
NLDT-pruned 99.80±0.0599.80\pm 0.05 99.66±0.18{\it 99.66\pm 0.18} 9.36e-02 1.0±0.0{\bf 1.0}\pm{\bf 0.0} 3.6±0.93.6\pm 0.9 3.6±0.9{\bf 3.6\pm 0.9}
CART 99.95±0.0499.95\pm 0.04 99.59±0.1599.59\pm 0.15 1.10e-05 9.0±2.19.0\pm 2.1 𝟏{\bf 1} 9.0±2.19.0\pm 2.1
SVM 99.76±0.0599.76\pm 0.05 99.72±0.11{\bf 99.72\pm 0.11} 𝟏{\bf 1} 178.3±13.8178.3\pm 13.8 178.3±13.8178.3\pm 13.8
ZDT2 NLDT 99.86±0.0899.86\pm 0.08 99.20±0.23{\it 99.20\pm 0.23} 9.41e-01 5.7±1.75.7\pm 1.7 6.2±1.56.2\pm 1.5 33.7±5.433.7\pm 5.4
NLDT-pruned 99.54±0.1399.54\pm 0.13 99.08±0.2799.08\pm 0.27 1.40e-02 1.6±1.01.6\pm 1.0 15.9±5.715.9\pm 5.7 21.6±4.321.6\pm 4.3
CART 99.85±0.0799.85\pm 0.07 99.04±0.2799.04\pm 0.27 2.58e-04 21.2±2.421.2\pm 2.4 𝟏{\bf 1} 21.2±2.4{\bf 21.2\pm 2.4}
SVM 99.29±0.0799.29\pm 0.07 99.21±0.16{\bf 99.21\pm 0.16} 𝟏{\bf 1} 141.9±32.7141.9\pm 32.7 141.9±32.7141.9\pm 32.7
ZDT3 NLDT 99.95±0.0699.95\pm 0.06 99.71±0.1799.71\pm 0.17 9.71e-06 4.4±1.24.4\pm 1.2 3.8±1.83.8\pm 1.8 15.5±4.615.5\pm 4.6
NLDT-pruned 99.75±0.0899.75\pm 0.08 99.60±0.19{\bf 99.60\pm 0.19} 1.1±0.31.1\pm 0.3 9.4±3.39.4\pm 3.3 9.7±3.5{\bf 9.7\pm 3.5}
CART 99.90±0.0699.90\pm 0.06 99.47±0.2099.47\pm 0.20 2.30e-03 13.5±2.213.5\pm 2.2 𝟏{\bf 1} 13.5±2.213.5\pm 2.2
SVM 98.18±0.1998.18\pm 0.19 98.02±0.1198.02\pm 0.11 7.28e-10 𝟏{\bf 1} 286.7±33.2286.7\pm 33.2 286.7±33.2286.7\pm 33.2

Clearly, the pruned NLDT requires one or two rules with fewer variables (or features) per rule than CART (on an average 3.6 versus 9.0 on ZDT1 and 9.7 versus 286.7 on ZDT3), but achieves a similar classification accuracy. The difficulty with CART solutions is that they are too deep to be easily interpretable. Although theoretical Pareto-optimal points for ZDT problems can be identified by checking xi=0x_{i}=0 for i2i\geq 2, NSGA-II, being a stochastic optimization algorithm, is not expected to find the exact Pareto-optimal solutions, thereby making classification methods to find a classifier with more than one participating variables.

Visualization of NLDT obtained using our bilevel algorithm on ZDT problems is provided in Figures S13, S14 and S15.

Refer to caption
Figure S13: NLDT for ZDT-1. It has the testing accuracy of 99.93%99.93\% and involves 44 variables in the split-rule.
Refer to caption
Figure S14: NLDT for ZDT-2. It has the testing accuracy of 99.54%99.54\% and involves 2222 variables in equation of the split-rule at the root node.
Refer to caption
Figure S15: NLDT for ZDT-3. It has the testing accuracy of 100%100\% and involves 99 variables in equation of the split-rule.

S7-D m-ZDT and m-DTLZ problems

Visualization of some NLDTs obtained for m-ZDT and m-DTLZ problems are shown in Figure S16 and S17, respectively. For m-ZDT problems, a relationship with two consecutive variables would constitute a relationship present in the Pareto set, but our approach finds a different two-variable interaction which has been found to be present in the dataset and was enough to classify Pareto versus non-Pareto data. For m-DTLZ problems, the relationships are a little more complex. Interestingly, for 500-var problem, only seven variables are needed in a nonlinear fashion to classify two classes of data.

Refer to caption
(a) m-ZDT1 30 vars
Refer to caption
(b) m-ZDT1 500 vars
Refer to caption
(c) m-ZDT2 30 vars
Refer to caption
(d) m-ZDT2 500 vars
Figure S16: NLDTs for m-ZDT problems. Class=0 indicates the Pareto dataset. Even for 500-var problems, only a two-variable interaction is enough to make a highly-accurate classification.
Refer to caption
(a) m-DTLZ1 30 vars
Refer to caption
(b) m-DTLZ1 500 vars
Refer to caption
(c) m-DTLZ2 30 vars
Refer to caption
(d) m-DTLZ2 500 vars
Figure S17: NLDTs for m-DTLZ problems. Class=0 indicates the Pareto dataset.

S8 Influence of Parameter τI\tau_{I}

In this section, we study the influence of parameter τI\tau_{I} on the complexity of the split-rule and the accuracy of overall decision tree. As can be seen from Figure S18, value of τI=0.05\tau_{I}=0.05 seemed to be reasonable enough for a two-class classification dataset.

Refer to caption
Figure S18: The effect of various values of τI\tau_{I} on the complexity (horizontal axis) and accurary of classifier (vertical axis) is shown here for 36 dimensional two-class real-world auto-industry problem. The value of τI\tau_{I} in the range from 0.03 to 0.05 seems to be reliable for this two-class classification problem.

S9 Multi-Class Iris Dataset Results

Results obtained on three class iris dataset using NLDT, CART and SVM are shown in Table S8. From the results it is clear that the proposed approach can be extended to multi-class problems as well. In our future study, we will extend our work to multi-class problems with more complicated decision boundaries than the standard iris dataset.

TABLE S8: Results on multi-class iris dataset. NLDT requires fewer rules and achieves more accuracy than CART and NLDT requires much smaller rule length (number of features in all rules) than SVM, making an excellent compromise of the two existing approaches.
Method Training Acc Testing Acc. # Rules FUF_{U}/Rule Rule Length
NLDT 98.59 ±\pm 0.69 94.80±4.1494.80\pm 4.14 2.00±0.002.00\pm 0.00 1.96±0.571.96\pm 0.57 3.92±1.153.92\pm 1.15
CART 96.57±1.0396.57\pm 1.03 93.69±2.7593.69\pm 2.75 3.80±0.493.80\pm 0.49 1.00 ±\pm 0.00 3.80 ±\pm 0.49
SVM 96.76±1.1796.76\pm 1.17 96.13 ±\pm 2.51 1.00 ±\pm 0.00 47.94±2.1747.94\pm 2.17 47.94±2.1747.94\pm 2.17

S10 Summary

Results of the main paper and additional results presented in this document amply indicate that the proposed bilevel-based NLDT approach is a viable strategy for classification tasks with fewer rules than a standard decision tree (CART, for example) produces and is more simplistic than a single complex rule that a SVM procedure produces. Hence, our nonlinear rules with restricted complexity are more humanly interpretable and useful not only for a mere classification task, but also for a better understanding of the classifier.