Interpretable Rule Discovery Through Bilevel Optimization of Split-Rules of Nonlinear Decision Trees for Classification Problems

Yashesh Dhebar, and Kalyanmoy Deb, Yashesh Dhebar is with the Department of Mechanical Engineering at Michigan State University, East Lansing, MI 48824, USA, e-mail: [email protected] Deb is with the Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI, 48824, USA, e-mail: [email protected], COIN Lab website: https://coin-lab.org

Abstract

For supervised classification problems involving design, control, other practical purposes, users are not only interested in finding a highly accurate classifier, but they also demand that the obtained classifier be easily interpretable. While the definition of interpretability of a classifier can vary from case to case, here, by a humanly interpretable classifier we restrict it to be expressed in simplistic mathematical terms. As a novel approach, we represent a classifier as an assembly of simple mathematical rules using a non-linear decision tree (NLDT). Each conditional (non-terminal) node of the tree represents a non-linear mathematical rule (split-rule) involving features in order to partition the dataset in the given conditional node into two non-overlapping subsets. This partitioning is intended to minimize the impurity of the resulting child nodes. By restricting the structure of split-rule at each conditional node and depth of the decision tree, the interpretability of the classifier is assured. The non-linear split-rule at a given conditional node is obtained using an evolutionary bilevel optimization algorithm, in which while the upper-level focuses on arriving at an interpretable structure of the split-rule, the lower-level achieves the most appropriate weights (coefficients) of individual constituents of the rule to minimize the net impurity of two resulting child nodes. The performance of the proposed algorithm is demonstrated on a number of controlled test problems, existing benchmark problems, and industrial problems. Results on two to 500-feature problems are encouraging and open up further scopes of applying the proposed approach to more challenging and complex classification tasks.

Index Terms:

Classification, Decision trees, Machine learning, Bi-level optimization, Evolutionary algorithms.

I Introduction

In a classification task, usually a set of labelled data involving a set of features and its belonging to a specific class are provided. The task in a classification task is to design an algorithm which works on the given dataset and arrives at one or more classification rules (expressed as a function of problem features) which are able to make predictions with maximum classification accuracy. A classifier can be expressed in many different ways suitable for the purpose of the application. In a quest to make the classifier interpretable, classifier representation through a decision-list or a decision-tree (DT) had been a popular choice. In decision trees, the classification rule-set is represented in an inverted tree based structure where non-terminal conditional nodes comprise of conditional statements, mostly involving a single feature, and the terminal leaf-nodes have a class label associated with them. A datapoint traverses through the DT by following the rules at conditional nodes and lands at a particular leaf-node, marking its class. In the artificial neural network (ANN) approach, an implicit relationship among features is captured through a multi-layered network, while in the support-vector-machine (SVM) approach, a complex mathematical relationship of associated support vectors is derived. One common aspect of these methods is that they use an optimization method of arriving at a suitable DT, ANN or SVM based on maximizing the classification accuracy.

Due to their importance in many practical problems involving design, control, identification, and other machine learning related tasks, researchers have spent a lot of efforts to develop efficient optimization-based classification algorithms [1, 2, 3]. While most algorithms are developed for classifying two-class data sets, the developed methods can be extended to multi-class classification problems as hierarchical two-class classification problems or by extending them to constitute a simultaneous multi-class classification algorithm. In this paper, we do not extend our approach to multi-class classification problems, except showing a proof-of-principle study.

In most classification problem solving tasks, maximizing classification accuracy (or minimizing classification error) on the labelled dataset is usually used as the sole objective. However, besides the classification accuracy, in many applications, users are also interested in finding an easily interpretable classifier for various practical reasons: (i) it helps to identify the most important rules which are responsible for the classification task, (ii) it also helps to provide a more direct relationship of features to have a better insight for the underlying classification task for knowledge enhancement and future developmental purposes. The definition of an easily interpretable classifier largely depends on the context, but the existing literature represents them as a linear, polynomial, or posynomial function involving only a few features.

In this paper, we propose a number of novel ideas towards evolving a highly accurate and easily interpretable classifier. First, instead of a DT, we propose a nonlinear decision tree (NLDT) as a classifier, in which every non-terminal conditional node will represent a nonlinear function of features to express a split-rule. Each split-rule will split the data into two non-overlapping subsets. Successive hierarchical splits of these sub-sets are carried out and the tree is allowed to grow until one of the termination criteria is met. We argue that flexibility of allowing nonlinear split-rule at conditional nodes (instead of a single-variable based rule which is found in tradition ID3 based DTs [4]) will result in a more compact DT (having a fewer nodes). Second, to derive the split-rule at a given conditional-node, a dedicated bilevel-optimization algorithm is applied, so that learning of the structure of the rule and associated coefficients are treated as two hierarchical and inter-related optimization tasks. Third, our proposed methodology is customized for classification problem solving, so that our overall bilevel optimization algorithm is computationally efficient. Fourth, we emphasize evolving simplistic rule structures through our customized bilevel optimization method so that obtained rules are also interpretable.

In the remainder of the paper, we provide a brief survey of the existing approaches of inducing DTs in Section II. A detailed description about the problem of inducing interpretable DTs from optimization perspective and a high-level view on proposed approach is provided in Section III. Next, we provide an in-depth discussion of the proposed bilevel-optimization algorithm which is adopted to derive interpretable split-rules at each conditional node of NLDT in Section IV. Section V provides a brief overview on the post-processing method which is used to prune the tree to simplify its topology. Compilation of results on standard classification and engineering problems are given in section VI. The paper ends with concluding remarks and some highlights on future work in Section VII.

II Past Studies

There exist many studies involving machine learning and data analytics methods for discovering rules from data. Here, we provide a brief mention of some studies which are close to our proposed study.

Traditional induction of DTs is done in axis-parallel fashion [4], wherein each split rule is of type $x_{i}\leq\tau_{i}$ or $x_{i}\geq\tau_{i}$ . Algorithms such as ID3, CART and C4.5 are among the popular ones in this category. The work done in the past to generate non-axis-parallel trees can be found in [5, 6, 7], where the researchers rely on randomized algorithms to search for multi-variate hyperplanes. Work done in [8, 9] use evolutionary algorithms to induce oblique DTs. The idea of OCI [7] is extended in [10] to generate nonlinear quadratic split-rules in a DT. Bennett Et al. [11] uses SVM to generate linear/nonlinear split rules. However, these works do not address certain key practical considerations, such as the complexity of the combined split-rules and handling of biased data.

Next, we discuss some studies which take the aspect of complexity of rule into consideration. In [12], ellipsoidal and interval based rules are determined using the set of support-vectors and prototype points/vertex points. The authors there primarily focus at coming up with compact set of if-then-else rules which are comprehensible. Despite its intuitiveness, the approach proposed in [12] doesn’t result into comprehensible set of rules on high-dimensional datasets. Another approach suggested in [13] uses the decompositional based technique of deriving rules using the output of a linear-SVM classifier. The rules here are expressed as hypercubes. The method proposed is computationally fast, but it lacks in its scope to address nonlinear decision boundaries and its performance is limited by the performance of a regular linear-SVM. On the other hand, this approach has a tendency to generate more rules if the data is distributed parallel to the decision boundary. The study conducted in [14] uses a trained neural-network as an oracle to develop a DT of at least m-of-n type rule-sets (similar to the one described in ID2-of-3 [15]). The strength of this approach lies in its potential to scale up. However, its accuracy on unseen dataset usually falls by about $3\%$ from the corresponding oracle neural-network counterpart. Authors in [16] use a pedagogical technique to evolve comprehensible rules using genetic-programming. The proposed algorithm G-REX considers the fidelity (i.e. how closely can the evolved AI agent mimic the behaviour of the oracle NN) as the primary objective and the compactness of the rule expression is penalized using a parameter to evolve interpretable set of fuzzy rules for a classification problem. The approach is good enough to produce comprehensible rule-set, but it needs tweaking and fine tuning of the penalty parameter. A nice summary of the above-mentioned methods is provided in [17]. Ishibuchi et al. in [18] implemented a three-objective strategy to evolve fuzzy set of rules using a multi-objective genetic algorithm. The objectives considered in that research were the classification accuracy, number of rules in the evolved rule-set and the total number of antecedent conditions. In [19], an ANN is used as a final classifier and a multi-objective optimization approach is employed to find simple and accurate classifier. The simplicity is expressed by the number of nodes. Hand calculations are then used to analytically express the ANN with a mathematical function. This procedure however becomes intractable when the evolved ANN has a large number of nodes.

Genetic programming (GP) based methods have been found efficient in deriving relevant features from the set of original features which are then used to generate the classifier or interpretable rule sets [20, 21, 22, 23, 24, 25]. In some studies, the entire classifier is encoded with a genetic-representation and the genome is then evolved using GP. Some works conducted in this regard also simultaneously consider complexity of the classifier [26, 27, 28, 29], but those have been limited to axis-parallel or oblique splits. Application of GP to induce nonlinear multivariate decision tree can be found in [30, 31]. Our approach of inducing nonlinear decision tree is conceptually similar to the idea discussed in [30], where the DT is induced in the top-down way and at each internal conditional node, the nonlinear split-rule is derived using a GP. Here, the fitness of a GP solution is expressed as a weighted-sum of misclassification-rates and generalizability term. However, the interpretability aspect of the final split-rule does not get captured anywhere in the fitness assignment and authors do not report the complexity of the finally obtained classifiers. Authors in [32] use a three-phase evolutionary algorithm to create a composite rule set of axis parallel rules for each class. Their algorithm is able to generate interpretable rules but struggles to achieve respectable classification accuracy on datasets which are not separable by axis-parallel boundaries. In [33], GP is used to evolve DTs with linear split-rules, however the work does not focus on improving the interpretability of the evolved DT classifier. Eggermont et al. [34] induce the DT using GP by allowing multiple splits through their clustering and refined atomic (internal nodes) representations and conduct a global evolutionary search to estimate the values of attribute thresholds. A multi-layered fitness approach is adopted with first priority given to the overall classification accuracy followed by the complexity of DT. Here too, splits occur based on the box conditions and hence it becomes difficult to address classification datasets involving non-linear (or non axis-parallel) decision boundaries. In [35], a decision list is generated using a GP by incorporating concept-mapping and covering techniques. Similar to [32], a dedicated rule-set for each class is generated and this method too is limited to work well on datasets which can be partitioned through boxed boundaries. An interesting application of GP to derive interpretable classification rules is reported in [36]. Here, a dedicated rule of type IF-ANTECEDENT-THEN-DECISION is generated for each class in the dataset using a GP and a procedure is suggested to handle intermediate cases. The suggested algorithm shows promising results for classification accuracy but the study does not report statistics regarding the complexity of the evolved ruleset.

In our proposed approach, we attempt to evolve nonlinear DTs which are robust to different biases in the dataset and simultaneously target in evolving nonlinear yet simpler polynomial split-rules at each conditional node of NLDT with a dedicated bilevel genetic algorithm (shown pictorially in Figure 1). In oppose to the method discussed in [30], where the fitness of GP individual doesn’t capture the notion of complexity of rule expression, the bilevel-optimization proposed in our work deals with the aspects of interpretability and performance of split-rule in a logically hierarchical manner (conceptually illustrated in Figure 2). The results indicate that the proposed bilevel algorithm for evolving nonlinear split-rules eventually generates simple and comprehensible classifiers with high or comparable accuracy on all the problems used in this study.

III Proposed Approach

III-A Classifier Representation Using Nonlinear Decision Tree

An interpretable classifier for our proposed method is represented in the form of a nonlinear decision tree (NLDT), as depicted in Figure 1.

Refer to caption — Figure 1: An illustration of a Nonlinear Decision Tree (NLDT) classifier for a two class problem. For a given conditional node $i$ , the split-rule function $f_{i}(\mathbf{x})$ is evolved using a dedicated bilevel-optimization procedure.

The decision tree (DT) comprises of conditional (or non-terminal) nodes and terminal leaf-nodes. Each conditional-node of the DT has a rule ( $f_{i}(\mathbf{x})\leq 0$ ) associated to it, where $\mathbf{x}$ represents the feature vector. To make the DT more interpretable, two aspects are considered:

1.

Simplicity of split-rule $f_{i}(\mathbf{x})$ at each conditional nodes (see Figure 2) and
2.

Simplicity of the topology of overall DT, which is computed by counting total number of conditional-nodes in the DT.

Under an ideal scenario, the simplest split-rule will involve just one attribute (or feature) and can be expressed as $f_{i}(\mathbf{x}):x_{k}-\tau\leq 0$ , for $i$ -th rule. Here, the split occurs solely based on the $k^{th}$ feature ( $x_{k}$ ). In most complex problems, such a simple split rule involving a single feature may not lead to a small-depth DT, thereby making the overall classifier un-interpretable. DTs induced using algorithms such as ID.3 and C4.5 fall under this category. In the present work, we refer to these trees with classification and regression trees (CART) [4].

On the other extreme, a topologically-simplest tree will correspond to the DT involving only one conditional node, but the associated rule may be too complex to interpret. SVM based classifiers fall in this category, wherein decision-boundary is expressed in form of a complicated mathematical equation. Another way to represent a classifier is through an ANN, which when attempted to express analytically, will resort into a complicated nonlinear function $f(\mathbf{x})$ without any easy interpretability.

In this work, we propose a novel and compromise approach to above two extreme cases so that the resulting DT is not too deep and associated split-rule function at each conditional node $f_{i}(\mathbf{x})$ assume a nonlinear form with controlled complexity and is easily interpretable. The development of a nonlinear DT is similar in principle to the search of a linear DT in CART, but this task requires the use of an efficient nonlinear optimization method for evolving nonlinear rules ( $f_{i}(\mathbf{x})$ ), for which we exploit recent advances in bilevel optimization. The proposed bilevel approach is discussed in Section IV and a high-level perspective of the optimization process is provided in Figure 1. During the training phase, NLDT is induced using a recursive algorithm as shown in Algorithm 1. Brief description about subroutines used in Algorithm 1 is provided below and relevant pseudo codes for ObtainSplitRule subroutine and an evaluator for upper-level GA are provided in Algorithm 2 and 3 respectively. For brevity, details of the termination criteria is provided in the Supplementary Section.

•
ObtainSplitRule(Data)
- –
  
  Input: $\left(N\times(D+1)\right)$ -matrix representing the dataset, where the last column indicates the class-label.
- –
  
  Output: Non linear-split rule $f(\mathbf{x})\leq 0$ .
- –
  
  Method: Bilevel optimization is used to determine $f(\mathbf{x})$ . Details regarding this are discussed in Sec IV.
•
SplitNode(Data, SplitRule)
- –
  
  Input: Data, Split Rule $f(\mathbf{x})\leq 0$ .
- –
  
  Output: LeftNode and RightNode which are node-data-structures, where LeftNode.Data are the datapoints in input Data satisfying $f(\mathbf{x})\leq 0$ while the RightNode.Data are the datapoints in the input Data violating the split rule.

Input: Dataset

Function UpdatedNode = InduceNLDT(Node, depth):

Node.depth = depth;

if TerminationSatisfied(Node) then

Node.Type = ‘leaf’;

else

Node.Type = ‘conditional’;

Node.SplitRule = ObtainSplitRule(Node.Data);

[LeftNode, RightNode] = SplitNode(Node.Data, Node.SplitRule);

Node.LeftNode = InduceNLDT(LeftNode, depth + 1);

Node.RightNode = InduceNLDT(RightNode, depth + 1);

end if

UpdatedNode = Node;

// NLDT Induction Algorithm

RootNode.Data = Dataset;

Tree = InduceNLDT(RootNode, 0)

Algorithm 1 Pseudo Code to Recursively Induce NLDT.

Input: Data

Output : SplitRule // Split rule

f(\mathbf{x})

Function SplitRule = ObtainSplitRule(Data):

Initialize :

P_{U}

// Upper Level population

// Execute LLGA (Algorithm 3)

P_{U}

= EvaluateUpperLevelPop(

P_{U}

);

// Upper Level GA Loop

for gen = 1:MaxGen do

P_{U}^{Parent}

= Selection(

P_{U}

);

P_{U}^{Child}

= Crossover(

P_{U}^{Parent}

);

P_{U}

= Mutation(

P_{U}^{Child}

);

// Execute LLGA (Algorithm 3)

P_{U}

= EvaluateUpperLevelPop(

P_{U}

);

// Elite preservation

P_{U}

= SelectElite(

P_{U}^{Parent}

P_{U}

);

if TerminationConditionSatisfied then

break;

end if

end for

// Extract best solution in

P_{U}

SplitRule = ObtainBestInd(

P_{U}

);

Algorithm 2 ObtainSplitRule subroutine. Implements a Bilevel optimization algorithm of determine a split-rule. The UpperLevel searches the space of Block Matrix

\mathbf{B}

and modulus-flag

m

. Constraint violation value for an upper level individuals comes by executing lower-level GA (LLGA) as shown in Algorithm 3.

Input:

P_{U}

// Upper Level population

Output :

P_{U}^{\prime}

// Evaluated Population

Function $P_{U}^{\prime}$ = EvaluateUpperLevelPop( $P_{U}$ ):

for i = 1:PopSize do

// Execute LLGA (see Section IV-C)

[

F_{L}

\mathbf{w}

\mathbf{\Theta}

] = LLGA(Data,

\mathbf{B}

m

);

[

P_{U}

[i].

\mathbf{w}

P_{U}

[i].

\mathbf{\Theta}

] = [

\mathbf{w}

\mathbf{\Theta}

];

// Constraint and Fitness Value

P_{U}

[i].CV =

F_{L}-\tau_{I}

;

P_{U}

[i].

F_{U}

= MaxActiveTerms(

\mathbf{B}

);

end for

P_{U}^{\prime}=P_{U}

Algorithm 3 EvaluateUpperLevelPop subroutine. A dedicated LowerLevel optimization is executed for each upper level population member.

III-B Split-Rule Representation

In this paper, we restrict the expression of split-rule at a conditional node of the decision tree operating on the feature vector $\mathbf{x}$ to assume the following structure:

\displaystyle\text{Rule}:

\displaystyle\quad f(\mathbf{x},\mathbf{w},\mathbf{\Theta},\mathbf{B})\leq 0,

(1)

where $f(\mathbf{x},\mathbf{w},\mathbf{\Theta},\mathbf{B})$ can be expressed in two different forms depending on whether a modulus operator $m$ is sought or not:

\displaystyle\small f(\mathbf{x},\mathbf{w},\mathbf{\Theta},\mathbf{B})=\begin{cases}\theta_{1}+w_{1}B_{1}+\ldots+w_{p}B_{p},&\text{if $m=0$,}\\ \left|\theta_{1}+w_{1}B_{1}+\ldots+w_{p}B_{p}\right|-\left|\theta_{2}\right|,&\text{if $m=1$.}\end{cases}

(2)

Here, $w_{i}$ ’s are coefficients or weights of several power-laws ( $B_{i}$ ’s), $\theta_{i}$ ’s are biases, $m$ is the modulus-flag which indicates the presence or absence of the modulus operator, $p$ is a user-specified parameter to indicate the maximum number of power-laws ( $B_{i}$ ) which can exist in the expression of $f(\mathbf{x})$ , and $B_{i}$ represents a power-law rule of type:

B_{i}=x_{1}^{b_{i1}}\times x_{2}^{b_{i2}}\times\ldots\times x_{d}^{b_{id}},

(3)

$\mathbf{B}$ is a block-matrix of exponents $b_{ij}$ , given as follows:

\mathbf{B}=\begin{bmatrix}b_{11}&b_{12}&b_{13}&\dots&b_{1d}\\ b_{21}&b_{22}&b_{23}&\dots&b_{2d}\\ \vdots&\vdots&\vdots&\ddots&\vdots\\ b_{p1}&b_{p2}&b_{p3}&\dots&b_{pd}\end{bmatrix}.

(4)

Exponents $b_{ij}$ ’s are allowed to assume values from a specified discrete set $\mathbf{E}$ . In this work, we set $p=3$ and $\mathbf{E}=\{-3,-2,-1,0,1,2,3\}$ to make the rules interpretable, however they can be changed by the user. The parameters $w_{i}$ and $\theta_{i}$ are real-valued variables in $[-1,1]$ . The feature vector $\mathbf{x}$ is a data-point in a $d$ -dimensional space. Another user-defined parameter $a_{\max}$ controls the maximum number of variables that can be present in each power-law $B_{i}$ . The default is $a_{\max}=d$ (i.e. dimension of the feature space).

IV Bilevel Approach for Split-Rule Identification

IV-A The Hierarchical Objectives to derive split-rule

Here, we illustrate the need of formulating the problem of split-rule identification as a bilevel-problem using Figure 2.

The geometry and shape of split-rules is defined by exponent terms $b_{ij}$ appearing in its expression (Eq. 2 and 3) while the orientation and location of the split-rule in the feature-space is dictated by the values of coefficients $w_{i}$ and biases $\theta_{i}$ (Eq. 2). Thus, the above optimization task of estimating split-rule $f(\mathbf{x})$ involves two types of variables:

1.

$B$ -matrix representing exponents of $B$ -terms (i.e. $b_{ij}$ as shown in Eq. 4) and the modulus flag $m$ indicating the presence or absence of a modulus operator in the expression of $f(\mathbf{x})$ , and
2.

weights $\mathbf{w}$ and biases $\mathbf{\Theta}$ in each rule function $f(\mathbf{x})$ .

Identification of a good structure for $\mathbf{B}$ terms and value of $m$ is a more difficult task, compared to the weight and bias identification. We argue that if both types of variables ( $\mathbf{B}$ , $m$ , $\Theta$ , $\mathbf{w}$ ) are concatenated in a single-level solution, a good $(\mathbf{B},m)$ combination may be associated with a not-so-good $(\mathbf{w}$ , $\mathbf{\Theta})$ combination (like split-rule $B$ in Fig. 2), thereby making the whole solution vulnerable to deletion during optimization. It may be better to separate the search of a good structure of $(\mathbf{B},m)$ combination from the weight-bias search at the same level, and search for the best weight-bias combination for every $(\mathbf{B},m)$ pair as a detailed task. This hierarchical structure of the variables motivates us to employ a bilevel optimization approach [37] to handle the two types of variables. The upper level optimization algorithm searches the space of $\mathbf{B}$ and $m$ . Then, for each $(\mathbf{B},m)$ pair, the lower level optimization algorithm is invoked to determine the optimal values of $\mathbf{w}$ and $\mathbf{\Theta}$ . Referring to Fig. 2, the upper-level will search for the structure of $f(\mathbf{x})$ which might have a nonlinearity (like in rule $A$ ) or might be linear (like rule $B$ ). Then, for each upper-level solution (for instance a solution corresponding to a linear rule structure), the lower-level will adjust its weights and biases to determine its optimal location $C$ .

The bi-level optimization problem can be then formulated as shown below:

\begin{array}[]{rl}\text{Min.}&F_{U}(\mathbf{B},m,\mathbf{w}^{\ast},\mathbf{\Theta}^{\ast}),\\ \text{s.t.}&(\mathbf{w}^{\ast},\mathbf{\Theta}^{\ast})\in{\rm argmin}\left\{F_{L}(\mathbf{w},\mathbf{\Theta})|_{(\mathbf{B},m)}\big{|}\right.\\ &\quad F_{L}(\mathbf{w},\mathbf{\Theta})|_{(\mathbf{B},m)}\leq\tau_{I},\\ &\quad\left.-1\leq w_{i}\leq 1,\ \forall i,\ \mathbf{\Theta}\in[-1,1]^{m+1}\right\},\\ &m\in\{0,1\},\ b_{ij}\in\{-3,-2,-1,0,1,2,3\},\end{array}

(5)

where the upper level objective $F_{U}$ is the quantification of the simplicity of the split-rule, and, the lower-level objective $F_{L}$ quantifies quality of split resulting due to split-rule $f(\mathbf{x})\leq 0$ . An upper-level solution is considered feasible only if it is able to partition the data within some acceptable limit which is set by a parameter $\tau_{I}$ . A more detailed explanation regarding the upper level objective $F_{U}$ and the lower-level objective $F_{L}$ is provided in next sections. A pseudo code of the bilevel algorithm to obtain split-rule is provided in Alogrithm 2 and the pseudo code provided in Algorithm 3 gives on overview on the evaluation of population-pool in upper level GA.

IV-A1 Computation of $F_{L}$

The impurity of a node in a decision tree is defined by the distribution of data present in the node. Entropy and Gini-score are popularly used to quantify the impurity of data distribution. In this work, we use gini-score to gauge the impurity of a node, however, entropy measure could also be used. For a binary classification task, gini-score ranges from 0-0.5, while entropy ranges from 0-1. Hence, the value of $\tau_{I}$ (Eq. 5) needs to be changed if we use entropy instead of gini while everything else stays same. For a given parent node $P$ in a decision tree and two child nodes $L$ and $R$ resulting from it, the net-impurity of child nodes ( $F_{L}$ ) can be computed as follows:

\small F_{L}(\mathbf{w},\mathbf{\Theta})|_{(\mathbf{B},m)}=\left(\frac{N_{L}}{N_{P}}\text{Gini}(L)+\frac{N_{R}}{N_{P}}\text{Gini}(R)\right)_{(\mathbf{w},\mathbf{\Theta},\mathbf{B},m)},

(6)

where $N_{P}$ is the total number of datapoints present in the parent node $P$ , and $N_{L}$ and $N_{R}$ indicate the total number of points present in left ( $L$ ) and right ( $R$ ) child nodes, respectively. Datapoints in $P$ which satisfy the split rule $f_{P}(\mathbf{x})\leq 0$ (Eq 1) go to left child node, while the rest go to the right child node. The objective $F_{L}$ of minimizing the net-impurity of child nodes favors the creation of purer child nodes (i.e. nodes with a low gini-score).

IV-A2 Computation of $F_{U}$

The objective $F_{U}$ is subjective in its form since it targets at dealing with a subjective notion of generating visually simple equations of split rule. Usually, equations with more variables and terms seem visually complicated. Taking this aspect into consideration, $F_{U}$ is set as the total number of non-zero exponent terms present in the overall equation of the split rule (Eq. 1). Mathematically, this can be represented with the following equation:

\displaystyle F_{U}(\mathbf{B},m,\mathbf{w}^{\ast},\mathbf{\Theta}^{\ast})=\sum_{i=1}^{p}\sum_{j=1}^{d}g(b_{ij}),

(7)

where $g(\alpha)=1$ , if $\alpha\neq 0$ ; zero, otherwise. Here, we use only $\mathbf{B}$ to define $F_{U}$ , but another more comprehensive simplicity term involving presence or absence of modulus operators and relative optimal weight/bias values of the rule can also be used.

IV-B Upper Level Optimization (ULGA)

A genetic algorithm (GA) is implemented to explore the search space of power rules $B_{i}$ ’s and modulus flag $m$ in the upper level. The genome is represented as a tuple $(\mathbf{B},m)$ wherein $\mathbf{B}$ is a matrix as shown in Equation 4 and $m$ assumes a Boolean value of 0 or 1.

The upper level GA focuses at estimating a simple equation of the split-rule within a desired value of net impurity ( $F_{L}$ ) of child nodes. Thus, the optimization problem for upper level is formulated as a single objective constrained optimization problem as shown in Eq. 5. The constraint function $F_{L}(\mathbf{w},\mathbf{\Theta})|_{(\mathbf{B},m)}$ is evaluated at the lower level of our bilevel optimization framework. The threshold value $\tau_{I}$ indicates the desired value of net-impurity of the resultant child nodes (Eq. 6). In our experiments, we set $\tau_{I}$ to be 0.05. As mentioned before, a solution satisfying this constraint implies creation of purer child nodes. Minimization of objective function $F_{U}$ should result in a simplistic structure of the rule and the optimization will also reveal the key variables needed in the overall expression.

IV-B1 Custom Initialization for Upper Level GA

Minimization of objective $F_{U}$ (given in Eq 7) requires to have less number of active (or non-zero) exponents in the expression of split-rule (Eq 1). To facilitate this, the population is initialized with a restriction of having only one active (or non-zero) exponent in the expression of split-rule, i.e., any-one of the $b_{ij}$ ’s in the block matrix $\mathbf{B}$ is set to a non-zero value from a user-specified set $\mathbf{E}$ and the rest of the elements of matrix $\mathbf{B}$ are set to zero. Note here that only $2d$ number of unique individuals ( $d$ individuals with $m=0$ and $d$ individuals with $m=1$ ) can exist which satisfy the above mentioned restriction. If the population size for upper level GA exceeds $2d$ , then remaining individuals are initialized with two non-zero active-terms.As the upper level GA progresses, incremental enhancement in rule complexity is realized through crossover and mutation operations, which are described next.

IV-B2 Ranking of Upper Level Solutions

The binary-tournament selection operation and $(\mu+\lambda)$ survival selection strategies are implemented for the upper level GA. Selection operators use the following hierarchical ranking criteria to perform selection:

Definition IV.1.

For two individuals $i$ and $j$ in the upper level, $\textit{rank}(i)$ is better than $\textit{rank}(j)$ , when any of the following is true:

•

$i$ and $j$ are both infeasible AND $F_{L}(i)<F_{L}(j)$ ,
•

$i$ is feasible (i.e. $F_{L}(i)\leq\tau_{I}$ ) AND $j$ is infeasible (i.e. $F_{L}(j)>\tau_{I}$ ),
•

$i$ and $j$ both are feasible AND $F_{U}(i)<F_{U}(j)$ ,
•

$i$ and $j$ both are feasible AND $F_{U}(i)=F_{U}(j)$
AND $F_{L}(i)<F_{L}(j)$ ,
•

$i$ and $j$ both are feasible AND $F_{U}(i)=F_{U}(j)$
AND $F_{L}(i)=F_{L}(j)$ AND $m(i)<m(j)$ .

IV-B3 Custom Crossover Operator for Upper Level GA

Before crossover operation, population members are clustered according to their $m$ -value – all individuals with $m=0$ belong to one cluster and all individuals with $m=1$ belong to another cluster. The crossover operation is then restricted on individuals belonging to the same cluster. The crossover operation in the upper level takes two block matrices from the parent population ( $\mathbf{B_{P_{1}}}$ and $\mathbf{B_{P_{2}}}$ ) to create two child block matrices ( $\mathbf{B_{C_{1}}}$ and $\mathbf{B_{C_{2}}}$ ). First, rows of block matrices of participating parents are rearranged in descending order of the magnitude of their corresponding coefficient values (i.e. weights $w_{i}$ of Eq. 2). Let the parent block matrices be represented as $\mathbf{B_{P_{1}}^{\prime}}$ and $\mathbf{B_{P_{2}}^{\prime}}$ after this rearrangement. The crossover operation is then conducted element wise on each row of $\mathbf{B_{P_{1}}^{\prime}}$ and $\mathbf{B_{P_{2}}^{\prime}}$ . For better understanding of the cross-over operation, a psuedo code is provided in Algorithm 4.

input : Block matrices

\mathbf{B_{P_{1}}}

and

\mathbf{B_{P_{2}}}

, and weight vectors

\mathbf{w_{P_{1}}}

and

\mathbf{w_{P_{2}}}

output : Child block matrices

\mathbf{B_{C_{1}}}

and

\mathbf{B_{C_{2}}}

\mathbf{B_{P_{1}}^{\prime}}

= SortRows( $\mathbf{B_{P_{1}}}$ , $\mathbf{w_{P_{1}}}$ );

\mathbf{B_{P_{2}}^{\prime}}

= SortRows( $\mathbf{B_{P_{2}}}$ , $\mathbf{w_{P_{2}}}$ );

n_{rows}

= Size( $\mathbf{B_{P_{1}}}$ , $1$ );

n_{cols}

= Size( $\mathbf{B_{P_{1}}}$ , $2$ );

for $i\leftarrow 1$ to $n_{rows}$ do

for $j\leftarrow 1$ to $n_{cols}$ do

r

= rand();

// random no. between 0 and 1.

if $r\leq 0.5$ then

\mathbf{B_{C_{1}}}(i,j)=\mathbf{B_{P_{1}}}(i,j)

;

\mathbf{B_{C_{2}}}(i,j)=\mathbf{B_{P_{2}}}(i,j)

;

else

\mathbf{B_{C_{1}}}(i,j)=\mathbf{B_{P_{2}}}(i,j)

;

\mathbf{B_{C_{2}}}(i,j)=\mathbf{B_{P_{1}}}(i,j)

;

end if

end for

Algorithm 4 Crossover operation in upper level GA.

IV-B4 Custom Mutation Operator for Upper Level GA

For a given upper level solution with block matrix $\mathbf{B}$ and modulus flag $m$ , the probability with which $b_{ij}$ ’s and $m$ gets mutated is controlled by parameter $p_{mut}^{U}$ . From the experiments, value of $p_{mut}^{U}=1/d$ is found to work well. The mutation operation then changes the value of exponents $b_{ij}$ and the modulus flag $m$ . Let the domain of $b_{ij}$ be given by $\mathbf{E}$ , which is a sorted set of finite values. Since $\mathbf{E}$ is sorted, these ordinal values can be accessed using an integer-id $k$ , with id-value of $k=1$ representing the smallest value and $k=n_{e}$ representing the largest value in set $\mathbf{E}$ . In our case, $\mathbf{E}=[-3,-2,-1,0,1,2,3]$ (making $n_{e}=7$ ). The mechanism with which the mutation operator mutates the value of $b_{ij}$ for any arbitrary sorted array $\mathbf{E}$ is illustrated in Fig. 3.

Here, the red tile indicates the id-value ( $k$ ) of $b_{ij}$ in array $\mathbf{E}$ . The orange shaded vertical bars indicate the probability distribution for mutated-values. The $b_{ij}$ can be mutated to either of $k-2,k-1,k+1$ , or $k+2$ id-values with a probability of $\alpha$ , $\beta\alpha$ , $\beta\alpha$ , and $\alpha$ respectively. The parameter $\beta$ is preferred to be greater than one and is supplied by the user. The parameter $\alpha$ can then be computed as $\alpha=\frac{1}{2(1+\beta)}$ by equating the sum of probabilities to $1$ . In our experiments, we have set $\beta=3$ . The value of the modulus flag $m$ is mutated randomly to either 0 or 1 with $50\%$ probability.

In order to bias the search to create simpler rules (i.e. split-rules with a small number of non-zero $b_{ij}$ s), we introduce a parameter $p_{zero}$ . The value of parameter $p_{zero}$ indicates the probability with which a variable $b_{ij}$ participating in mutation is set to zero. In our case, we use $p_{zero}=0.75$ , making $b_{ij}\rightarrow 0$ with a net probability of $p_{mut}^{U}\times p_{zero}$ .

Duplicates from the the child-population are then detected and changed so that the entire child-population comprises of unique individuals. This is done to ensure creation of novel rules and preservation of diversity. Details about this procedure can be found in the supplementary document.

( $\mu+\lambda$ ) survival selection operation is then applied on the combined child and parent population. The selected elites then go to the next generation as parents and the process repeats. Parameter setting for upper-level-GA is provided in the supplementary document.

IV-C Lower Level Optimization (LLGA)

For a given population member of the upper level GA (with block matrix $\mathbf{B}$ and modulus flag $m$ as variables), the lower level optimization determines the coefficients $w_{i}$ (Eq. 2) and biases $\theta_{i}$ such that $F_{L}(\mathbf{w},\mathbf{\Theta})|_{(\mathbf{B},m)}$ (Eq. 6) is minimized. The lower level optimization problem can be stated as below:

	Minimize:	$\displaystyle\quad F_{L}(\mathbf{w},\mathbf{\Theta})\|_{(\mathbf{B},m)},$
		$\displaystyle\quad\mathbf{w}\in[-1,1]^{p},$	$\displaystyle\mathbf{\Theta}\in[-1,1]^{m+1}.$

IV-C1 Custom Initialization for Lower Level GA

In a bilevel optimization, the lower level problem must be solved for every upper level variable vector, thereby requiring a computationally fast algorithm for the lower level problem. Instead of creating every population member randomly, we use the mixed dipole concept [38, 9, 39] to facilitate faster convergence towards optimal (or near-optimal) values of $\mathbf{w}$ and $\mathbf{\Theta}$ . Let $\mathbf{x_{A}}$ and $\mathbf{x_{B}}$ be the data-points belonging to Class-1 and Class-2, respectively. Then, weights $\mathbf{w_{h}}$ and bias $\theta_{h}$ corresponding to the hyperplane which separates $\mathbf{x_{A}}$ and $\mathbf{x_{B}}$ , and is orthogonal to vector $\mathbf{x_{A}-x_{B}}$ , can be given by the following equation:

	$\displaystyle\mathbf{w_{h}}=$	$\displaystyle\quad\mathbf{x_{A}}-\mathbf{x_{B}},$		(9)
	$\displaystyle\theta_{h}=$	$\displaystyle\quad{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\Delta\times(\mathbf{w_{h}}\cdot\mathbf{x_{B}})+(1-\Delta)\times(\mathbf{w_{h}}\cdot\mathbf{x_{A}}),}$		(10)

where $\Delta$ is a random number between 0 and 1. This is pictorially represented in Fig. 4.

The dipole pairs ( $\mathbf{x_{A}}$ , $\mathbf{x_{B}}$ ) for all individuals in the initial population pool are chosen randomly from the provided training dataset. The variables $\mathbf{w}$ and $\mathbf{\Theta}$ for an individual are then initialized as mentioned below:

$\displaystyle\mathbf{w}=$	$\displaystyle\quad\mathbf{w_{h}},$	(11)
$\displaystyle\theta_{1}=$	$\displaystyle\quad\theta_{h},$	(12)
$\displaystyle\theta_{2}=$	$\displaystyle\quad\text{min}(\Delta,1-\Delta)\quad\mbox{if $m=1$}.$	(13)

This method of initialization is found to be more efficient than randomly initializing $\mathbf{w}$ and $\Theta$ . This is because, for the regions beyond the convex-hull bounding the training dataset, the landscape of $F_{L}$ is flat which makes any optimization algorithm to stagnate.

IV-C2 Selection, Crossover, and Mutation for Lower Level GA

The binary tournament selection, SBX crossover [40] and polynomial mutation [41] were used to create the offspring solutions. $(\mu+\lambda)$ survival selection strategy is then adopted to preserve elites.

IV-C3 Termination Criteria for Lower level GA

The lower level GA is terminated after a maximum of 50 generations were reached or when the change in the lower level objective value ( $F_{L}$ ) is less than $0.01\%$ in the past $10$ consecutive generations.

Other parameter values for lower-level GA can be found in the supplementary document. Results from the ablation-studies on LLGA and the overall bilevel-GA are provided in supplementary document for conceptually understanding the efficacy of the proposed approach.

V Overall Tree Induction and Pruning

Once the split-rule for a conditional node is determined using the above bilevel optimization procedure, the resulting child nodes are checked with respect to the following termination criteria:

•

Depth of the Node $>$ Maximum allowable depth,
•

Number of datapoints in node $<N_{min}$ , and
•

Node Impurity $\leq\tau_{min}$ .

If any of the above criteria is satisfied, then that node is declared as a terminal leaf node. Otherwise, the node is flagged as an additional conditional node. It then undergoes another split (using a split-rule which is derived by running the proposed bilevel-GA on the data present in the node) and the process is repeated. This process is illustrated in Algorithm 1. This eventually produces the main nonlinear decision tree $NLDT_{main}$ . The fully grown decision tree then undergoes pruning to produce the pruned decision tree $NLDT_{pruned}$ .

During the pruning phase, splits after the root-node are systematically removed until the training accuracy of the pruned tree does not fall below a pre-specified threshold value of $\tau_{prune}=3\%$ (set here after trial-and-error runs). This makes the resultant tree topologically less complex and provides a better generalizability. In subsequent sections, we provide results on final NLDT obtained after pruning (i.e. $NLDT_{pruned}$ ), unless otherwise specified.

VI Results

This section summarizes results obtained on four customized test problems, two real-world classification problems, three real-world optimization problems, eight multi-objective problems involving 30-500 features and one multi-class problem. For each experiment, 50 runs are performed with random training and testing sets. The dataset is split into training and testing sets with a ratio of 7:3, respectively. Mean and standard deviation of training and testing accuracy-scores across all 50 runs are evaluated to gauge the performance of the classifier. Statistics about the number of conditional nodes (given by #Rules) in NLDT, average number of active terms in the split-rules of decision tree (i.e. $F_{U}$ /Rule) and rule length (which gives total number of active-terms appearing in the entire NLDT) is provided to quantify the simplicity and interpretability of classifier (lesser this number, simpler is the classifier). The comparison is made against classical CART tree solution [4] and SVM [42] results. For SVM, the Gaussian kernel is used and along with overall accuracy-scores; statistics about the number of support vectors (which is also equal to the rule-length) is also reported. It is to note that the decision boundary computed by SVM can be expressed with a single equation, however the length of the equation usually increases with the number of support vectors [43].

For each set of experiments, best scores are highlighted in bold and Wilcoxon signed-rank test is performed on overall testing accuracy scores to check the statistical similarity with other classifiers in terms of classification accuracy. Statistically similar classifiers (which will have their $p$ -value greater than $0.05$ ) are italicized.

VI-A Customized Datasets: DS1 to DS4

The compilation of results of 50 runs on datasets DS1-DS4 (see supplementary for details regarding customized datasets) is presented in Table I. The results clearly indicate the superiority of the proposed nonlinear, bilevel based decision tree approach over classical CART and SVM based classification methods.

TABLE I: Results on DS1 to DS4 datasets. NLDT is not only most accurate, it also involves a few features.

Dataset	Method	Training Accuracy	Testing Accuracy	p-value	#Rules	${\bf F_{U}}$ /Rule	Rule Length
DS1	NLDT	$99.78\pm 0.51$	${\bf 99.55\pm 1.08}$	–	1.0 $\pm$ 0.0	$2.3\pm 0.6$	${\bf 2.3\pm 0.6}$
	CART	$97.99\pm 0.96$	$90.32\pm 4.06$	7.34e-10	$14.5\pm 1.7$	1.0 $\pm$ 0.0	$14.5\pm 1.7$
	SVM	$98.67\pm 0.31$	$98.50\pm 1.09$	1.29e-05	1.0 $\pm$ 0.0	$71.5\pm 3.3$	$71.5\pm 3.3$
DS2	NLDT	$99.80\pm 0.40$	${\bf 99.44\pm 0.87}$	–	1.0 $\pm$ 0.0	$2.3\pm 0.7$	${\bf 2.3\pm 0.7}$
	CART	$98.80\pm 0.50$	$95.43\pm 1.50$	5.90e-10	$11.0\pm 1.4$	1.0 $\pm$ 0.0	$11.0\pm 1.4$
	SVM	$96.95\pm 0.73$	$95.24\pm 0.16$	2.43e-10	1.0 $\pm$ 0.0	$44.7\pm 1.9$	$44.7\pm 1.9$
DS3	NLDT	$99.91\pm 0.35$	${\bf 99.77\pm 0.67}$	–	1.0 $\pm$ 0.0	$2.2\pm 0.5$	${\bf 2.2\pm 0.5}$
	CART	$99.42\pm 0.57$	$95.00\pm 2.35$	7.00e-10	$11.5\pm 1.3$	1.0 $\pm$ 0.0	$11.5\pm 1.3$
	SVM	$98.96\pm 0.42$	$98.38\pm 1.15$	7.16e-09	1.0 $\pm$ 0.0	$62.1\pm 3.4$	$62.1\pm 3.4$
DS4	NLDT	$99.34\pm 1.21$	${\bf 98.88\pm 1.65}$	–	$1.2\pm 0.4$	$2.6\pm 0.6$	${\bf 3.1\pm 1.4}$
	CART	$96.98\pm 1.19$	$88.68\pm 3.60$	7.31e-10	$31.3\pm 4.2$	1.0 $\pm$ 0.0	$31.3\pm 4.2$
	SVM	$98.19\pm 0.44$	$97.28\pm 1.19$	1.31e-05	1.0 $\pm$ 0.0	$89.8\pm 3.3$	$89.8\pm 3.3$

Bilevel approach finds a single rule with two to three appearances of variables in the rule, whereas CART requires 11 to 31 rules involving a single variable per rule, and SVM requires only one rule but involving 44 to 90 appearances of variables in the rule. Moreover, the bilevel approach achieves this with the best accuracy. Thus, the classifiers obtained by bilevel approach are more simplistic and simultaneously more accurate.

VI-B Breast Cancer Wisconsin Dataset

This problem was originally proposed in 1991. It has two classes: benign and malignant, with 458 (or 65.5 $\%$ ) data-points belonging to benign class and 241 (or 34.5 $\%$ ) belonging to malignant class. Each data-point is represented with 10 attributes. Results are tabulated in Table II. The bilevel method and SVM had similar performance (p-value $>0.05$ ), but SVM requires about 90 variable appearances in its rule, compared to only about 6 variable appearances in the single rule obtained by the bilevel approach. The proposed approach outperformed the classifiers generated using techniques proposed in [12, 17, 16, 13, 14] in terms of both: accuracy and comprehensibility/compactness. The NLDT classifier obtained by a specific run of the bilevel approach has five variable appearances and is presented in Fig. 5.

TABLE II: Results on breast cancer Wisconsin dataset. Despite having a high accuracy, SVM’s rule involves about 90 repeated entries of features.

Method	Training Accuracy	Testing Accuracy	p-value	#Rules	${\bf F_{U}}$ /Rule	Rule Length
NLDT	$98.07\pm 0.39$	$\mathit{96.50\pm 1.16}$	$\mathit{0.308}$	${\bf 1.0}\pm{\bf 0.0}$	$6.4\pm 1.7$	${\bf 6.4\pm 1.7}$
CART	$98.21\pm 0.49$	$94.34\pm 1.92$	8.51e-09	$11.6\pm 2.4$	1.00 $\pm$ 0.00	$11.6\pm 2.4$
SVM	$97.65\pm 0.39$	${\bf 96.64\pm 1.16}$	–	1.00 $\pm$ 0.00	$89.4\pm 14.8$	$89.4\pm 14.8$

Figure 6 provides B-Space visualization of decision-boundary obtained by the bilevel approach, which is able to identify two nonlinear $B$ -terms involving variables to split the data linearly to obtain a high accuracy.

VI-C Wisconsin Diagnostic Breast Cancer Dataset (WDBC)

This dataset is an extension to the dataset of the previous section. It has 30 features with total 356 datapoints belonging to benign class and 212 to malign class. Results shown in Table III indicate that the bilevel-based NLDT is able to outperform standard CART and SVM algorithms. The NLDT generated by a run of the bilevel approach requires seven out of 30 variables (or features) and is shown in Fig. 7. It is almost as accurate as that obtained by SVM and is more interpretable (seven versus about 107 variable appearances).

TABLE III: Results on WDBC dataset. Despite having a high accuracy, SVM’s rule involves about 107 repeated entries of features.

Method	Training Accuracy	Testing Accuracy	p-value	#Rules	${\bf F_{U}}$ /Rule	Rule Length
NLDT	$98.24\pm 0.64$	$96.20\pm 1.49$	4.65e-05	${\bf 1.0}\pm{\bf 0.0}$	$9.2\pm 4.1$	${\bf 9.2\pm 4.1}$
CART	$98.76\pm 0.60$	$92.11\pm 2.07$	1.09e-09	$10.8\pm 2.1$	1.00 $\pm$ 0.00	$10.8\pm 2.1$
SVM	$98.65\pm 0.37$	${\bf 97.39\pm 1.37}$	–	1.00 $\pm$ 0.00	$106.7\pm 6.6$	$106.7\pm 6.6$

The $B$ -space plot (Fig. 8) shows an efficient discovery of $B$ -functions to linearly classify the supplied data with a high accuracy.

VI-D Real-World Auto-Industry Problem (RW-problem)

This real-world engineering design optimization problem has 36 variables, eight constraints, and one objective function. The dataset created was highly biased, with 188 points belonging to the good-class and 996 belonging to the bad-class. Results obtained using the bilevel GA are shown in Table IV. The proposed algorithm is able to achieve near $90\%$ accuracy scores requiring only two split rules. The best performing NLDT has the testing-accuracy of $93.82\%$ and is shown in Fig. 9.

SVM performed the best in terms of accuracy, but the resulting classifier is complicated with about 241 variable appearances in the rule. However, bilevel GA requires only about two rules, each having about only 10 variable appearances per rule to achieve slightly less-accurate classification. CART requires about 30 rules with a deep DT, making the classifier difficult to interpret easily.

TABLE IV: Results on the real-world auto-industry problem.

Method	Training Accuracy	Testing Accuracy	p-value	#Rules	${\bf F_{U}}$ /Rule	Rule Length
NLDT	$94.36\pm 1.47$	$89.93\pm 2.04$	4.35e-09	$1.9\pm 0.5$	$10.0\pm 2.9$	${\bf 18.2\pm 5.9}$
CART	$98.00\pm 0.55$	$91.13\pm 1.32$	9.80e-08	$29.6\pm 3.9$	1.00 $\pm$ 0.00	$29.6\pm 3.9$
SVM	$94.98\pm 0.73$	${\bf 93.24\pm 1.38}$	–	1.00 $\pm$ 0.00	$240.8\pm 9.4$	$240.8\pm 9.4$

VI-E Results on Multi-Objective Optimization Problems

Here we evaluate performance of NLDT on two real-world multi-objective problems: welded-beam design and 2D truss design, and modified versions of ZDT [44] and DTLZ [45] problems – m-ZDT and m-DTLZ. The data generation process of multi-objective datasets is provided in the supplementary document. Complexity of these datasets in controlled by two parameters $\tau_{rank}$ and $g_{ref}$ , explanation to which is provided in supplementary document. Results for multi-objective problems are compiled in Table V. Due to space limitations, we report scores of only testing accuracy, total number of rules and $F_{U}/\textit{Rule}$ .

VI-E1 Truss 2D

Truss 2D problem [46] is a three-variable problem. Clearly, the bilevel NLDT has the best accuracy and fewer variable appearances (meaning easier intrepretability) compared to CART and SVM generated classifiers as shown in Table V. Fig. 10(a) shows a 100% correct classifier with a single rule obtained by our bilevel NLDT, whereas Fig. 10(b) shows a typical CART classifier with 19 rules, making it difficult to interpret easily.

VI-E2 Welded Beam Design Problem

This bi-objective optimization problem has four variables and four constraints [46]. The NLDT (having a single rule) obtained with $g_{ref}=10$ is shown in Fig. 11. Interestingly, the bilevel NLDTs have similar accuracy to that of CART and SVM with a very low complexity as shown in Table V.

VI-E3 m-ZDT and m-DTLZ Problems

Experiments conducted on datasets involving 500 features (see Table V) and for two and three-objective optimization problems confirm the scalability aspect of the proposed approach. In all these problems, traditional methods like CART and SVM find themselves difficult to conduct a proper classification task. The provision of allowing controlled non-linearity at each conditional-node provides the proposed NLDT approach with necessary flexibility to make an appropriate overall classification. Additional results and visualization of some of the obtained NLDT classifiers are provided in the supplementary document.

TABLE V: Results on multi-objective problems for classifying dominated and non-dominated solutions.

Method	Testing Acc.	# Rules	$F_{U}$ /Rule
Welded Beam design with $g_{ref}=10$ and $\tau_{rank}=3$ .
NLDT	$98.58\pm 1.13$	${\bf 1.0}\pm{\bf 0.0}$	$3.9\pm 1.0$
CART	$97.72\pm 1.04$	$8.42\pm 1.42$	1.00 $\pm$ 0.00
SVM	${\bf 98.97\pm 0.54}$	1.00 $\pm$ 0.00	$126.0\pm 6.8$
m-ZDT1-2-30
NLDT	98.96 $\pm$ 0.59	$1.80\pm 0.60$	$4.47\pm 2.21$
CART	$94.78\pm 1.02$	$26.42\pm 1.89$	1.00 $\pm$ 0.00
SVM	$82.84\pm 2.77$	1.00 $\pm$ 0.00	$1192.38\pm 11.34$
m-ZDT2-2-30
NLDT	98.96 $\pm$ 0.57	$1.92\pm 0.56$	$4.37\pm 1.82$
CART	$94.72\pm 0.89$	$27.80\pm 2.19$	1.00 $\pm$ 0.00
SVM	$98.44\pm 0.57$	1.00 $\pm$ 0.00	$315.54\pm 6.12$
m-DTLZ1-3-30
NLDT	96.65 $\pm$ 2.86	$3.12\pm 0.59$	$6.14\pm 2.18$
CART	$71.74\pm 2.49$	$93.88\pm 3.23$	1.00 $\pm$ 0.00
SVM	$45.86\pm 1.28$	1.00 $\pm$ 0.00	$1381.56\pm 8.25$
m-DTLZ2-3-30
NLDT	97.22 $\pm$ 2.25	$3.02\pm 0.62$	$5.81\pm 1.95$
CART	$63.41\pm 7.59$	$100.82\pm 6.11$	1.00 $\pm$ 0.00
SVM	$49.61\pm 1.62$	1.00 $\pm$ 0.00	$1367.42\pm 8.44$

Method	Testing Acc.	# Rules	$F_{U}$ /Rule
Truss 2D with $\tau_{rank}=6$ and $g_{ref}=1$ .
NLDT	${\bf 99.54}\pm{\bf 0.75}$	$1.20\pm 0.50$	$2.90\pm 0.50$
CART	$98.33\pm 1.10$	$11.06\pm 3.15$	1.00 $\pm$ 0.00
SVM	$\mathit{99.46\pm 0.50}$	1.00 $\pm$ 0.00	$62.5\pm 2.90$
m-ZDT1-2-500
NLDT	$98.93\pm 0.60$	$1.78\pm 0.54$	$5.66\pm 3.23$
CART	$93.48\pm 0.94$	$20.58\pm 1.39$	1.00 $\pm$ 0.00
SVM	$\mathbf{100.00\pm 0.00}$	$\mathbf{\bf 1.00\pm 0.00}$	$240.88\pm 4.62$
m-ZDT2-2-500
NLDT	$98.88\pm 0.71$	$1.80\pm 0.69$	$5.30\pm 2.45$
CART	$93.95\pm 1.33$	$20.98\pm 1.98$	1.00 $\pm$ 0.00
SVM	100.00 $\pm$ 0.00	1.00 $\pm$ 0.00	$248.06\pm 4.37$
m-DTLZ1-3-500
NLDT	93.77 $\pm$ 4.24	$3.00\pm 0.45$	$7.61\pm 2.36$
CART	$58.26\pm 7.37$	$104.88\pm 5.49$	1.00 $\pm$ 0.00
SVM	$47.13\pm 1.19$	1.00 $\pm$ 0.00	$1382.38\pm 8.84$
m-DTLZ2-3-500
NLDT	95.33 $\pm$ 4.46	$3.04\pm 0.56$	$7.28\pm 3.16$
CART	$70.04\pm 3.46$	$96.08\pm 3.93$	1.00 $\pm$ 0.00
SVM	$47.01\pm 1.20$	1.00 $\pm$ 0.00	$1382.62\pm 8.46$

VI-F Multi-class Iris Dataset

Though the purpose of the current study was to develop an efficient algorithm for two-class problems, its extension to multi-class problem is possible. The iris dataset is a popular multi-class dataset involving total three classes and four features, and is used here as a benchmark problem to check the feasibility of extending our method. The NLDT classifier, producing 94.8% classification accuracy on test data and 98.6% accuracy on training data, is presented in Figure 12 (see supplementary document for more details). Notice that of the four features, feature 1 ( $x_{1}$ ) is not required for the classification task. Statistics on results obtained across 50 runs is provided in the supplementary document. We are currently investigating further on a more efficient extension of our approach to multi-class problems.

VII Conclusions and Future work

In this paper, we have addressed the problem of generating interpretable and accurate nonlinear decision trees for classification. Split-rules at the conditional nodes of a decision tree have been represented as a weighted sum of power-laws of feature variables. A provision of integrating modulus operation in the split-rule has also been formulated, particularly to handle sandwiched classes of datapoints. The proposed algorithm of deriving split-rule at a given conditional-node in a decision tree has two levels of hierarchy, wherein the upper-level is focused at evolving the power-law structures and operates on a discrete variable space, while the lower-level is focused at deriving values of optimal weights and biases by searching a continuous variable space to minimize the net impurity of child nodes after the split. Efficacy of upper-level and lower-level evolutionary algorithm have been tested on customized datasets. The lower-level GA is able to robustly and efficiently determine weights and biases to minimize the net-impurity. A test conducted on imbalanced dataset has also revealed the superiority of the proposed lower-level GA to achieve a high classification accuracy without relying on any synthetic data. Ablation studies conducted on the upper-level GA (see supplementary document) also demonstrate the efficacy of the algorithm to obtain simpler split-rules without using any prior knowledge. Results obtained on standard classification benchmark problems and a real-world single-objective problem has amply demonstrated the efficacy and usability of the proposed algorithm.

The classification problem has been extended to discriminate Pareto-data from non-Pareto data in a multi-objective problem. Experiments conducted on multi-objective engineering problems have resulted in determining simple polynomial rules which can assist in determining if the given solution is Pareto-optimal or not. These rules can then be used as design-principles for generating future optimal designs. As further future studies, we plan to integrate non-polynomial and generic terms in the expression of the split-rules. The proposed bilevel framework can also be tested for regression and reinforcement-learning related tasks. The upper-level problem can be converted to a bi-objective problem for optimizing both $F_{U}$ and $F_{L}$ simultaneously, so that a set of trade-off solutions can be obtained in a single run. Nevertheless, this proof-of-principle study on the use of a customized bilevel optimization method for classification tasks is encouraging for us to launch such future studies.

Acknowledgment

An initial study of this work was supported by General Motors Corporation, USA.

References

[1] S. B. Kotsiantis, I. Zaharakis, and P. Pintelas, “Supervised machine learning: A review of classification techniques,” Emerging artificial intelligence applications in computer engineering, vol. 160, pp. 3–24, 2007.
[2] J. S. Bergstra, R. Bardenet, Y. Bengio, and B. Kégl, “Algorithms for hyper-parameter optimization,” in Advances in neural information processing systems, 2011, pp. 2546–2554.
[3] C. Thornton, F. Hutter, H. H. Hoos, and K. Leyton-Brown, “Auto-weka: Combined selection and hyperparameter optimization of classification algorithms,” in Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2013, pp. 847–855.
[4] L. Breiman, Classification and regression trees. Routledge, 2017.
[5] D. Heath, S. Kasif, and S. Salzberg, “Induction of oblique decision trees,” in IJCAI, vol. 1993, 1993, pp. 1002–1007.
[6] S. K. Murthy, S. Kasif, and S. Salzberg, “A system for induction of oblique decision trees,” Journal of artificial intelligence research, vol. 2, pp. 1–32, 1994.
[7] S. K. Murthy, S. Kasif, S. Salzberg, and R. Beigel, “Oc1: A randomized algorithm for building oblique decision trees,” in Proceedings of AAAI, vol. 93. Citeseer, 1993, pp. 322–327.
[8] E. Cantú-Paz and C. Kamath, “Using evolutionary algorithms to induce oblique decision trees,” in Proceedings of the 2nd Annual Conference on Genetic and Evolutionary Computation. Morgan Kaufmann Publishers Inc., 2000, pp. 1053–1060.
[9] M. Kretowski, “An evolutionary algorithm for oblique decision tree induction,” in International Conference on Artificial Intelligence and Soft Computing. Springer, 2004, pp. 432–437.
[10] A. Ittner and M. Schlosser, “Non-linear decision trees-ndt,” in ICML. Citeseer, 1996, pp. 252–257.
[11] K. P. Bennett and J. A. Blue, “A support vector machine approach to decision trees,” in Neural Networks Proceedings, 1998. IEEE World Congress on Computational Intelligence. The 1998 IEEE International Joint Conference on, vol. 3. IEEE, 1998, pp. 2396–2401.
[12] H. Núñez, C. Angulo, and A. Català, “Rule extraction from support vector machines.” in Esann, 2002, pp. 107–112.
[13] G. Fung, S. Sandilya, and R. B. Rao, “Rule extraction from linear support vector machines,” in Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. ACM, 2005, pp. 32–40.
[14] M. Craven and J. W. Shavlik, “Extracting tree-structured representations of trained networks,” in Advances in neural information processing systems, 1996, pp. 24–30.
[15] P. M. Murphy and M. J. Pazzani, “Id2-of-3: Constructive induction of m-of-n concepts for discriminators in decision trees,” in Machine Learning Proceedings 1991. Elsevier, 1991, pp. 183–187.
[16] U. Johansson, R. König, and L. Niklasson, “The truth is in there-rule extraction from opaque models using genetic programming.” in FLAIRS Conference. Miami Beach, FL, 2004, pp. 658–663.
[17] D. Martens, B. Baesens, T. Van Gestel, and J. Vanthienen, “Comprehensible credit scoring models using rule extraction from support vector machines,” European journal of operational research, vol. 183, no. 3, pp. 1466–1476, 2007.
[18] H. Ishibuchi, T. Nakashima, and T. Murata, “Three-objective genetics-based machine learning for linguistic rule extraction,” Information Sciences, vol. 136, no. 1-4, pp. 109–133, 2001.
[19] Y. Jin and B. Sendhoff, “Pareto-based multiobjective machine learning: An overview and case studies,” IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), vol. 38, no. 3, pp. 397–415, 2008.
[20] S. Nguyen, M. Zhang, and K. C. Tan, “Surrogate-assisted genetic programming with simplified models for automated design of dispatching rules,” IEEE transactions on cybernetics, vol. 47, no. 9, pp. 2951–2965, 2016.
[21] K. Nag and N. R. Pal, “A multiobjective genetic programming-based ensemble for simultaneous feature selection and classification,” IEEE transactions on cybernetics, vol. 46, no. 2, pp. 499–510, 2015.
[22] J. M. Luna, J. R. Romero, C. Romero, and S. Ventura, “On the use of genetic programming for mining comprehensible rules in subgroup discovery,” IEEE transactions on cybernetics, vol. 44, no. 12, pp. 2329–2341, 2014.
[23] A. Lensen, B. Xue, and M. Zhang, “Genetic programming for evolving a front of interpretable models for data visualization,” IEEE Transactions on Cybernetics, 2020.
[24] H. Guo and A. K. Nandi, “Breast cancer diagnosis using genetic programming generated feature,” Pattern Recognition, vol. 39, no. 5, pp. 980–987, 2006.
[25] M. Muharram and G. D. Smith, “Evolutionary constructive induction,” IEEE transactions on knowledge and data engineering, vol. 17, no. 11, pp. 1518–1528, 2005.
[26] M. Shirasaka, Q. Zhao, O. Hammami, K. Kuroda, and K. Saito, “Automatic design of binary decision trees based on genetic programming,” in Proc. The Second Asia-Pacific Conference on Simulated Evolution and Learning (SEAL’98. Citeseer, 1998.
[27] S. Haruyama and Q. Zhao, “Designing smaller decision trees using multiple objective optimization based gps,” in IEEE International Conference on Systems, Man and Cybernetics, vol. 6. IEEE, 2002, pp. 5–pp.
[28] C.-S. Kuo, T.-P. Hong, and C.-L. Chen, “Applying genetic programming technique in classification trees,” Soft Computing, vol. 11, no. 12, pp. 1165–1172, 2007.
[29] M. C. Bot, “Improving induction of linear classification trees with genetic programming,” in Proceedings of the 2nd Annual Conference on Genetic and Evolutionary Computation, 2000, pp. 403–410.
[30] R. E. Marmelstein and G. B. Lamont, “Pattern classification using a hybrid genetic program-decision tree approach,” Genetic Programming, pp. 223–231, 1998.
[31] E. M. Mugambi, A. Hunter, G. Oatley, and L. Kennedy, “Polynomial-fuzzy decision tree structures for classifying medical data,” in International Conference on Innovative Techniques and Applications of Artificial Intelligence. Springer, 2003, pp. 155–167.
[32] A. Cano, A. Zafra, and S. Ventura, “An interpretable classification rule mining algorithm,” Information Sciences, vol. 240, pp. 1–20, 2013.
[33] M. C. Bot and W. B. Langdon, “Application of genetic programming to induction of linear classification trees,” in European Conference on Genetic Programming. Springer, 2000, pp. 247–258.
[34] J. Eggermont, J. N. Kok, and W. A. Kosters, “Genetic programming for data classification: Partitioning the search space,” in Proceedings of the 2004 ACM symposium on Applied computing, 2004, pp. 1001–1005.
[35] K. C. Tan, A. Tay, T. H. Lee, and C. Heng, “Mining multiple comprehensible classification rules using genetic programming,” in Proceedings of the 2002 Congress on Evolutionary Computation. CEC’02 (Cat. No. 02TH8600), vol. 2. IEEE, 2002, pp. 1302–1307.
[36] I. De Falco, A. Della Cioppa, and E. Tarantino, “Discovering interesting classification rules with genetic programming,” Applied Soft Computing, vol. 1, no. 4, pp. 257–269, 2002.
[37] A. Sinha, P. Malo, and K. Deb, “A review on bilevel optimization: From classical to evolutionary approaches and applications,” IEEE Transactions on Evolutionary Computation, vol. 22, no. 2, pp. 276–295, 2017.
[38] L. Bobrowski and M. Kretowski, “Induction of multivariate decision trees by using dipolar criteria,” in European Conference on Principles of Data Mining and Knowledge Discovery. Springer, 2000, pp. 331–336.
[39] M. Kretowski and M. Grześ, “Global induction of oblique decision trees: an evolutionary approach,” in Intelligent Information Processing and Web Mining. Springer, 2005, pp. 309–318.
[40] K. Deb and R. B. Agrawal, “Simulated binary crossover for continuous search space,” Complex systems, vol. 9, no. 2, pp. 115–148, 1995.
[41] K. Deb, Multi-objective optimization using evolutionary algorithms. John Wiley & Sons, 2001, vol. 16.
[42] V. Vapnik, The nature of statistical learning theory. Springer science & business media, 2013.
[43] J.-P. Vert, K. Tsuda, and B. Schölkopf, “A primer on kernel methods,” Kernel methods in computational biology, vol. 47, pp. 35–70, 2004.
[44] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan, “A fast and elitist multiobjective genetic algorithm: Nsga-ii,” IEEE transactions on evolutionary computation, vol. 6, no. 2, pp. 182–197, 2002.
[45] K. Deb, L. Thiele, M. Laumanns, and E. Zitzler, “Scalable multi-objective optimization test problems,” in Proceedings of the 2002 Congress on Evolutionary Computation. CEC’02 (Cat. No. 02TH8600), vol. 1. IEEE, 2002, pp. 825–830.
[46] K. Deb and A. Srinivasan, “Innovization: Innovating design principles through optimization,” in Proceedings of the 8th annual conference on Genetic and evolutionary computation. ACM, 2006, pp. 1629–1636.
[47] J. Blank and K. Deb, “pymoo - Multi-objective Optimization in Python,” https://pymoo.org.
[48] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “Smote: synthetic minority over-sampling technique,” Journal of artificial intelligence research, vol. 16, pp. 321–357, 2002.

Supplementary Document

In this supplementary document, we provide additional information related to our original paper.

S1 Summary of Supplementary Document

In this document, we first provide details of parameter settings used for all our proposed algorithms, followed by explanation on the procedure to create customized datasets for benchmarking lower-level and upper-level GAs. Details of the duplicate update operator used in the upper level GA of our proposed bilevel optimization approach is mentioned next. Thereafter, we present additional results to support the algorithms and discussions of the original paper.

S2 Parameter Setting

S2-A Parameter Setting for NSGA-II and NSGA-III

We used the implementation of NSGA-II and NSGA-III Algorithm as provided in pymoo [47]. pymoo is an official python package for multiobjective optimization algorithms and is developed under supervision of the original developer of NSGA-II and NSGA-III algorithm. Following parameter setting was adopted to create Pareto and non-Pareto datasets for two objective Truss and Welded beam problems:

•

Population Size = 500
•

Maximum Generations = 1000
•

Cross Over = SBX
•

SBX $\eta_{c}$ = 15
•

SBX probabililty = 0.9
•

Mutation type = Polynomial Mutation
•

Mutation $\eta_{m}$ = 20
•

Mutation probability $p_{m}$ = $1/n_{var}$ (where $n_{var}$ is total number of variables)

S2-B Parameter Setting for Upper Level GA

This section provides the general parameter setting used for the upper level of our bilevel GA.

•

Population size = $10\times d$ , where $d$ is the dimension of the feature space of the original problem.
•

Selection = Binary tournament selection.
•

Crossover probability ( $p_{xover}^{U}$ ) = 0.9
•
Mutation parameters:
- –
  
  Mutation probability ( $p^{U}_{mut}$ ) = $\min(0.33,1/d)$ .
- –
  
  Distribution parameter $\beta$ = 3 (any value $>1$ will be good).
•

$p_{zero}$ = 0.75
•

Maximum number of generations = 100

S2-C Parameter Setting for Lower Level GA

Following parameter setting is used for lower level GA:

•

Population size = 50,
•

Maximum generations = 50,
•

Variable range = $[-1,1]$ ,
•

Crossover type = Simulated Binary Crossover (SBX)
•

Mutation type = Polynomial Mutation
•

Selection type = Binary Tournament Selection
•

Crossover probability = 0.9,
•

Mutation probability = $1/n_{var}$ ,
•

Number of variables $n_{var}=p+1$ , if $m=0$ or $p+2$ if $m=1$ , where $p$ is the total number of power-laws allowed and $m$ is the modulus flag,
•

SBX and polynomial mutation parameters: $(\eta_{c},\eta_{m})=(2,15)$ .

S2-D Parameter Setting for Inducing a Non-linear Decision Tree

•

Number of power-laws per split-rule: $p=3$ ,
•

Allowable set of exponents: $\mathbf{E}=\{-3,-2,\dots,3\}$ ,
•

Impurity Metric: Gini Score.
•

Minimum Impurity to conduct a split: $\tau_{\min}=0.05$
•

Minimum number of points required in a node conduct a split: $N_{min}=10$
•

Maximum allowable Depth = 5
•

Pruning threshold: $\tau_{prune}=3\%$

S3 Creation of Customized 2D Datasets

Datasets presented in this section were synthetically generated. Datapoints in $2D$ space were created using a reference curve $\xi(\mathbf{x})$ and were assigned to either the Class-1 or Class-2. Two additional parameters $\delta$ and $\sigma$ controlled the location of a datapoint relative to the reference curve $\xi(\mathbf{x})$ . Datapoints ( $\mathbf{x^{(i)}}$ ) for a given dataset were generated using following steps:

•

First $n$ points falling on curve $\xi(\mathbf{x})=0$ were initialized for reference. Lets denote these reference points on the curve with $\mathbf{x_{r}^{(i)}}$ (where $i=1,2,\dots,n$ ). Thus,

$\xi(\mathbf{x_{r}^{(i)})}=0,\qquad i=1,2,\dots,n.$
•

$n$ points ( $\mathbf{x^{(i)}}$ ) for a dataset were then generated using the following equation

$\mathbf{x^{(i)}}=\mathbf{x_{r}^{(i)}}+\delta+\sigma r,\qquad i=1,2,\dots,n,$ (14)

where $r$ is a random number between $0$ and $1$ , $\delta$ represents the offset and $\sigma$ indicates the amount of spread.

Four datasets generated for conducting ablation studies are graphically represented in Fig. S1.

Parameter setting which was used to generate these datasets is provided Table S1. $N_{A}$ and $N_{B}$ indicate the number of datapoints belonging to Class-1 and Class-2, respectively.

TABLE S1: Parameter setting to create customized datasets D1-D4. For Class-1 data-points

(\delta,\sigma)=(0,0.01)

Dataset	$\xi(\mathbf{x})$	Class-1	Class-2
Dataset	$\xi(\mathbf{x})$	$N_{A}$	$N_{B}$	$(\delta,\sigma)$
DS1	$2x_{1}+x_{2}-3=0$	100	100	$(0.025,0.2)$
DS2	$2x_{1}+x_{2}-3=0$	10	200	$(0.025,0.2)$
DS3	$x_{1}^{2}+x_{2}-1=0$	100	100	$(0.025,0.2)$
DS4	$2x_{1}+x_{2}-3=0$	100	50	$(0.025,0.2)$
DS4	$2x_{1}+x_{2}-3=0$	100	50	$(-0.015,-0.2)$

S4 Multi-objective Datasets

S4-A ZDT, Truss and Welded Beam Datasets

For each problem, NSGA-II [44] algorithm is first applied to evolve the population of $N$ individuals for $g_{max}$ generations. Population for each generation is stored. Naturally, population from later generations are closer to Pareto-front than the population of initial generations. We artificially separate the entire dataset into two classes – good and bad – using two parameters $g_{ref}$ (indicating an intermediate generation for data collection) and $\tau_{rank}$ (indicating the minimum non-dominated rank for defining the bad cluster). First, population members from $g_{ref}$ and $g_{max}$ generations are merged. Next, non-dominated sorting is executed on the merged population to determine their non-domination rank. Points belonging to same non-domination rank are further sorted based on their crowding-distance (from highest to lowest) [44]. For the good-class, top $N_{A}$ points from rank-1 front are chosen and for the bad-class, top $N_{B}$ points from $\tau_{rank}$ front onward are chosen. Increasing the $\tau_{rank}$ increases the separation between good and bad classes, while the $g_{ref}$ parameter has the inverse effect.

Visualization of this is provided in Fig. S2 and Fig. S3 for truss and welded-beam problems respectively.

The parameter setting used to create datasets for ZDT1, ZDT2, ZDT3, Truss and Welded-beam problems are shown in Table S2.

TABLE S2: Parameter Settings for generating datasets for multi-objective problems. For Truss and Welded Beam (WB), different experiments are performed with different values of

g_{ref}

and

\tau_{rank}

. These values are provided in the caption of result tables.

Problem	Pop-Size ( $N$ )	$g_{max}$	$g_{ref}$	$\tau_{rank}$	$N_{A}$	$N_{B}$
ZDT-1	100	200	200	10	100	5000
ZDT-2	100	200	200	5	100	5000
ZDT-3	100	200	200	10	100	5000
Truss-2D	200	1000	1	–	200	200
WB	200	1000	–	3	200	200

S4-B m-ZDT, m-DTLZ Problems

These are the modified versions of ZDT and DLTZ problems [44, 45]. For m-ZDT problems, the g-function is modified to the following:

\displaystyle g_{zdt}(x_{2},\ldots,x_{n})=

\displaystyle 1+\frac{18}{n-1}\sum_{\mbox{$i=2$ and even}}^{n}(x_{i}+x_{i+1}-1)^{2}.

(15)

Many two-variable relationships ( $x_{i}+x_{i+1}=1$ ) must be set to be on the Pareto set.

For m-DTLZ problems, the g-function for $\mathbf{x_{m}}$ variables is

\displaystyle g_{dtlz}(\mathbf{x_{m}})=

\displaystyle 100\times\sum_{\mbox{$x_{i}\in\mathbf{x_{m}}$ and $i$ is even}}^{n}(x_{i}+x_{i+1}-1)^{2}.

(16)

Pareto points for m-ZDT and m-DTLZ problems are generated by using the exact analytical expression of Pareto-set (locations where $g=0$ ). The non-Pareto set is generated by using two parameters $\sigma_{spread}$ and $\sigma_{offset}$ . To compute the location of a point $\mathbf{x_{np}}$ belonging to non-Pareto set from the location of the point $\mathbf{x_{p}}$ on the Pareto set, following equation is used:

	$\displaystyle x_{np}^{(i)}=$	$\displaystyle x_{p}^{(i)}+r_{1}(\sigma_{offset}+r_{2}\sigma_{spread})$		(17)
	where	$\displaystyle r_{1}\in{-1,1},\quad r_{2}\in[0,1].$		(18)

Here, $r_{1}$ and $r_{2}$ are randomly generated for each $x_{np}^{(i)}$ . For m-ZDT problems, $i=1,2,\dots n$ , while for m-DTLZ problems $x_{np}^{(i)},x_{p}^{(i)}\in\mathbf{x_{m}}$ . Parameter setting for generating m-ZDT and m-DTLZ datasets is provided in Table S3. For 30 variable problems, all $x_{i}\in\mathbf{x_{m}}$ are changed according to Eq. 18 for generate non-pareto points. However, for 500 vars problems, only first 28 of variables in $\mathbf{x_{m}}$ are modified according to Eq. 18 to generate non-pareto data-points.

TABLE S3: Parameter setting to generate datasets for m-ZDT and m-DTLZ problems. We generate 1000 datapoints for each class.

Prob.	$\mathbf{n_{vars}}$	$\mathbf{\sigma_{spead}}$	$\mathbf{\sigma_{offset}}$
m-ZDT1	30	0.3	0.1
m-ZDT1	500	0.3	0.1
m-ZDT2	30	0.3	0.1
m-ZDT2	500	0.3	0.1
m-DTLZ1	30	0.05	0
m-DTLZ1	500	0.05	0
m-DTLZ2	30	0.05	0
m-DTLZ2	500	0.05	0

Visualization of Datasets for m-ZDT and m-DTLZ problems in provided in Figure S4 and Figure S5.

S5 Duplicate Update Operator for Upper Level GA

After creating the offspring population, we attempt to eliminate duplicate population members. For each duplicate found in the child population, the block-matrix $\mathbf{B}$ of that individual is randomly mutated using the following equation

\mathbf{B}(i_{r},j_{r})=\mathbf{E}(k_{r}),

(19)

where $i_{r},j_{r}$ and $k_{r}$ are randomly chosen integers within specified limits. This process is repeated for all members of child population, until each member is unique within the child population. This operation allows to maintain diversity and encourages the search to generate multiple novel and optimized equations of the split-rule.

S6 Ablation Studies and Comparison

In this section, we test the efficacy of lower-level and upper-level optimization algorithm by applying them on four customized problems (DS1-DS4) which are shown pictorially in Fig. S1. Their performances are compared against two standard classifiers: CART and support-vector-machines (SVM)¹¹1Here, the support vector machine is applied without any kernal-trick..

S6-A Ablation Studies on Lower Level GA

Since the lower level GA (LLGA) focuses at determining coefficients $w_{i}$ and bias $\theta_{i}$ of linear split-rule in B-space, DS1 and DS2 data-sets are used to guage its efficacy. The block matrix $\mathbf{B}$ is fixed to $2\times 2$ identity matrix and the modulus-flag $m$ is set to 0. Hence, the split-rule (classifier) to be evolved is

f(\mathbf{x})=\theta_{1}+w_{1}x_{1}+w_{2}x_{2},

where $w_{1},w_{2}$ and $\theta_{1}$ are to be determined by the LLGA.

The classifier generated by our LLGA is compared with that obtained using the standard SVM algorithm (without any kernel-trick) on DS1 data-set. Since DS2 dataset is unbalanced, SMOTE algorithm [48] is first applied to over-sample data-points of the minority class before generating the classifier using SVM. For the sake of completeness, classifier generated by SVM on DS2 dataset, without any oversampling, is also compared against the ones obtained using LLGA and SMOTE+SVM. The results are shown in Fig. S6 and Fig. S7, respectively, for DS1 and DS2 data-sets. Type-1 and Type-2 errors are reported for each experiment²²2Type-1 error indicates the percentage of data-points belonging to Class-1 getting classified as Class-2. Type-2 error indicates the percentage of points belonging to Class-2 getting classified as Class-1)..

It is clearly evident from the results shown in Fig. S6 that the proposed customized LLGA is more efficient than SVM in arriving at the desired split-rule as a classifier. The LLGA is able to find the decision boundary with $100\%$ prediction accuracy on training and testing data-sets. However, the classifier generated by SVM has an overlap with the cloud of data-points belonging to the scattered class and there-by resulting into the accuracy of $89\%$ on the testing data-set.

On the DS2 dataset, the SVM is needed to rely on SMOTE to synthetically generate data-points in order to achieve respectable performance as can be seen from Fig. 7(b) and 7(c). However, LLGA performs better without SMOTE, as shown in Fig. 7(a).

Ablation study conducted above on LLGA clearly indicates that the customized optimization algorithm developed for estimating weights $\mathbf{w}$ and bias $\mathbf{\Theta}$ in the expression of a split-rule is reliable and efficient and could easily outperform the classical SVM algorithm.

S6-B Ablation Studies on Proposed Bilevel GA

As mentioned before, the upper level of our bilevel algorithm aims at determining optimal power-law structures (i.e. $B_{i}$ s) and the value of modulus flag $m$ . Experiments are performed on DS3 and DS4 datasets to guage the efficacy of upper level (and thus the overall bilevel) GA. No prior information about the optimal values of block matrix $\mathbf{B}$ and modulus flag $m$ is supplied. Thus, the structure of the equation of the split-rule is unknown to the algorithm. Results of these experiments are shown in Figs. 8(a) and 8(b).

As can be seen from Fig. 8(a) that the proposed bilevel algorithm is able to find classifiers which are able to classify the data-sets with no Type-I or Type-II error. Results on DS4 data-set validated the importance of involving modulus-flag $m$ as a search variable for upper level GA in addition to the block-matrix $\mathbf{B}$ . The algorithm is able to converge at the optimal tree topology (Fig. 8(b)) with $100\%$ classification accuracy. It is to note here that the algorithm had no prior knowledge about optimal block matrix $\mathbf{B}$ of exponents $b_{ij}$ . This observation suggests that the upper level GA is successfully able to navigate through the search space of block-matrices $\mathbf{B}$ and modulus-flag $m$ (which is a binary variable) to arrive at the optimal power-laws and split-rule.

S6-C Visualization of Split Rule: X-Space and B-Space

A comprehensible visualization of the decision boundary is possible if the dimension of the feature space is up to three. However, if the dimension $d$ of the original feature space (or, X-space) is larger than three, the decision-boundary can be visualized on the transformed-feature-space (or B-space). In our experiments, the maximum number of allowable power-law-rules ( $p$ ) is fixed to three (i.e. $p=3$ ). Thus, any data-point $\mathbf{x}$ in a $d$ -dimensional X-space can be mapped to a three-dimensional B-space. A conceptual illustration of this mapping is provided in Fig. S9,

where, for the sake of simplicity, a three-dimensional X-space ( $x_{1}$ to $x_{3}$ ) is mapped to a two-dimensional B-space using two power-law-rules ( $B_{1}$ and $B_{2}$ ).

It is to note here that the power-law rules $B_{i}$ ’s are not known a priori. The proposed bilevel optimization method determines them so that the data becomes linearly separable in B-space.

S7 Additional Results

Here we provide some additional results. Fig. S10 and Fig. S11 shows plots of one of the decision boundary (classifier) obtained using our bilevel-algorithm and its comparison against decision-tree generated using traditional CART algorithm on DS3 and DS4 datasets.

S7-A 2D Truss Problem

In table S4, comparison of accuracy-scores and complexity obtained for datasets generated using $\tau_{rank}=6$ and $\tau_{rank}=9$ for bi-objective truss design problem are provided.

TABLE S4: 2D Truss problem with

\tau_{rank}=6

and

\tau_{rank}=9

$\tau_{rank}$	Training Accuracy	Testing Accuracy	#Rules	n-vars	${\bf F_{U}}$ /Rule
6	$99.86\pm 0.36$	$99.6\pm 0.58$	$1.20\pm 0.53$	$2.78\pm 0.41$	$2.92\pm 0.52$
9	$100\pm 0$	$99.92\pm 0.19$	$1.36\pm 0.48$	$2.14\pm 0.35$	$2.26\pm 0.52$

Since the dataset generated with $\tau_{rank}=6$ is relatively difficult, we conduct further analysis on it and compare the result obtained using our NLDT approach with that of CART and SVM. Compilation of these results is shown in Table S5.

TABLE S5: Results on Truss-2D with

\tau_{rank}=6

and

g_{ref}=1

¿ Method Training Accuracy Testing Accuracy p-value #Rules ${\bf F_{U}}$ /Rule Rule Length NLDT $99.77\pm 0.72$ ${\bf 99.54}\pm{\bf 0.75}$ – $1.2\pm 0.5$ $2.9\pm 0.5$ ${\bf 3.3}\pm{\bf 0.9}$ CART $99.34\pm 0.32$ $98.33\pm 1.10$ 6.04e-08 $11.06\pm 3.15$ ${\bf 1}$ $11.06\pm 3.15$ SVM $99.66\pm 0.15$ $\mathit{99.46\pm 0.50}$ 0.135 ${\bf 1}$ $62.5\pm 2.9$ $62.5\pm 2.9$

S7-B Welded Beam Design Resutls

Additional results for Welded beam design problem are shown in Table S6. Here, two sets of experiments are conducted for two different values of $g_{ref}$ , keeping the value of $\tau_{rank}$ fixed to three. A higher value of $g_{ref}$ results in good and bad data-points too close to each other, thereby making it difficult for the classification algorithm to determine a suitable classifier. This can be validated from the results presented in Table S6. However, bilevel NLDTs have produced one rule with about 2 to 4 variable appearances compared to 39.5 to 126 (on average) for SVM classifiers. CART does well in this problem with, on average, 2.94 to 8.42 rules.

TABLE S6: Results on welded beam design with

g_{ref}=1

and

g_{ref}=10

\tau_{rank}=3

is kept fixed.

¿ $g_{ref}$ Method Training Accuracy Testing Accuracy p-value #Rules ${\bf F_{U}}$ /Rule Rule Length 1 NLDT $99.98\pm 0.06$ ${\it 99.38\pm 0.49}$ 0.158 ${\bf 1.0}\pm{\bf 0.0}$ $2.6\pm 0.7$ ${\bf 2.6\pm 0.7}$ CART $99.97\pm 0.07$ $\mathbf{99.50\pm 0.59}$ – $2.94\pm 0.71$ ${\bf 1}$ $2.94\pm 0.71$ SVM $99.37\pm 0.17$ ${\it 99.40\pm 0.40}$ 0.331 ${\bf 1}$ $39.5\pm 2.5$ $39.5\pm 2.5$ 10 NLDT $99.39\pm 0.38$ $98.58\pm 1.13$ 3.42e-02 ${\bf 1.0}\pm{\bf 0.0}$ $3.9\pm 1.0$ ${\bf 3.9\pm 1.0}$ CART $99.46\pm 0.27$ $97.72\pm 1.04$ 4.63e-08 $8.42\pm 1.42$ ${\bf 1}$ $8.42\pm 1.42$ SVM $99.46\pm 0.19$ ${\bf 98.97\pm 0.54}$ – ${\bf 1}$ $126.0\pm 6.8$ $126.0\pm 6.8$

S7-C ZDT Problems

ZDT problems are two-objective optimization problems and their formulations can be found in [44]. The dataset created was highly biased with $100$ -datapoints in Pareto-class and $5,000$ -datapoints in non-Pareto-class. Visualization of these datasets is provided in Figure S12. As can be seen from the figure S12, the classification task was deliberately made more difficult by tightening the gap between Pareto and non-Pareto class.

Due to highly biased distribution, a naive classifier predicting all points as non-Pareto would have the classification accuracy of $98.04\%$ ( $=5000/(5000+100)$ ). Hence, the pruning implementation with $\tau_{prune}=3\%$ will prune the entire tree, resulting into a tree with only one split-rule. Thus, we investigate the performance of the unpruned-NLDT and the prune-NLDT with $\tau_{prune}=0.5\%$ . The compilation of results on ZDT problems is provided in Table S7.

TABLE S7: Results on multi-objective ZDT problems. Visualization of split-rules is provided in supplementary.

Problem	Method	Training Accuracy	Testing Accuracy	p-value	#Rules	${\bf F_{U}}$ /Rule	Rule Length
ZDT1	NLDT	$99.97\pm 0.04$	${\it 99.74\pm 0.16}$	5.65e-01	$3.3\pm 1.0$	$2.6\pm 1.0$	$7.9\pm 1.9$
	NLDT-pruned	$99.80\pm 0.05$	${\it 99.66\pm 0.18}$	9.36e-02	${\bf 1.0}\pm{\bf 0.0}$	$3.6\pm 0.9$	${\bf 3.6\pm 0.9}$
	CART	$99.95\pm 0.04$	$99.59\pm 0.15$	1.10e-05	$9.0\pm 2.1$	${\bf 1}$	$9.0\pm 2.1$
	SVM	$99.76\pm 0.05$	${\bf 99.72\pm 0.11}$	–	${\bf 1}$	$178.3\pm 13.8$	$178.3\pm 13.8$
ZDT2	NLDT	$99.86\pm 0.08$	${\it 99.20\pm 0.23}$	9.41e-01	$5.7\pm 1.7$	$6.2\pm 1.5$	$33.7\pm 5.4$
	NLDT-pruned	$99.54\pm 0.13$	$99.08\pm 0.27$	1.40e-02	$1.6\pm 1.0$	$15.9\pm 5.7$	$21.6\pm 4.3$
	CART	$99.85\pm 0.07$	$99.04\pm 0.27$	2.58e-04	$21.2\pm 2.4$	${\bf 1}$	${\bf 21.2\pm 2.4}$
	SVM	$99.29\pm 0.07$	${\bf 99.21\pm 0.16}$	–	${\bf 1}$	$141.9\pm 32.7$	$141.9\pm 32.7$
ZDT3	NLDT	$99.95\pm 0.06$	$99.71\pm 0.17$	9.71e-06	$4.4\pm 1.2$	$3.8\pm 1.8$	$15.5\pm 4.6$
	NLDT-pruned	$99.75\pm 0.08$	${\bf 99.60\pm 0.19}$	–	$1.1\pm 0.3$	$9.4\pm 3.3$	${\bf 9.7\pm 3.5}$
	CART	$99.90\pm 0.06$	$99.47\pm 0.20$	2.30e-03	$13.5\pm 2.2$	${\bf 1}$	$13.5\pm 2.2$
	SVM	$98.18\pm 0.19$	$98.02\pm 0.11$	7.28e-10	${\bf 1}$	$286.7\pm 33.2$	$286.7\pm 33.2$

Clearly, the pruned NLDT requires one or two rules with fewer variables (or features) per rule than CART (on an average 3.6 versus 9.0 on ZDT1 and 9.7 versus 286.7 on ZDT3), but achieves a similar classification accuracy. The difficulty with CART solutions is that they are too deep to be easily interpretable. Although theoretical Pareto-optimal points for ZDT problems can be identified by checking $x_{i}=0$ for $i\geq 2$ , NSGA-II, being a stochastic optimization algorithm, is not expected to find the exact Pareto-optimal solutions, thereby making classification methods to find a classifier with more than one participating variables.

Visualization of NLDT obtained using our bilevel algorithm on ZDT problems is provided in Figures S13, S14 and S15.

S7-D m-ZDT and m-DTLZ problems

Visualization of some NLDTs obtained for m-ZDT and m-DTLZ problems are shown in Figure S16 and S17, respectively. For m-ZDT problems, a relationship with two consecutive variables would constitute a relationship present in the Pareto set, but our approach finds a different two-variable interaction which has been found to be present in the dataset and was enough to classify Pareto versus non-Pareto data. For m-DTLZ problems, the relationships are a little more complex. Interestingly, for 500-var problem, only seven variables are needed in a nonlinear fashion to classify two classes of data.

S8 Influence of Parameter $\tau_{I}$

In this section, we study the influence of parameter $\tau_{I}$ on the complexity of the split-rule and the accuracy of overall decision tree. As can be seen from Figure S18, value of $\tau_{I}=0.05$ seemed to be reasonable enough for a two-class classification dataset.

S9 Multi-Class Iris Dataset Results

Results obtained on three class iris dataset using NLDT, CART and SVM are shown in Table S8. From the results it is clear that the proposed approach can be extended to multi-class problems as well. In our future study, we will extend our work to multi-class problems with more complicated decision boundaries than the standard iris dataset.

TABLE S8: Results on multi-class iris dataset. NLDT requires fewer rules and achieves more accuracy than CART and NLDT requires much smaller rule length (number of features in all rules) than SVM, making an excellent compromise of the two existing approaches.

Method	Training Acc	Testing Acc.	# Rules	$F_{U}$ /Rule	Rule Length
NLDT	98.59 $\pm$ 0.69	$94.80\pm 4.14$	$2.00\pm 0.00$	$1.96\pm 0.57$	$3.92\pm 1.15$
CART	$96.57\pm 1.03$	$93.69\pm 2.75$	$3.80\pm 0.49$	1.00 $\pm$ 0.00	3.80 $\pm$ 0.49
SVM	$96.76\pm 1.17$	96.13 $\pm$ 2.51	1.00 $\pm$ 0.00	$47.94\pm 2.17$	$47.94\pm 2.17$

S10 Summary

Results of the main paper and additional results presented in this document amply indicate that the proposed bilevel-based NLDT approach is a viable strategy for classification tasks with fewer rules than a standard decision tree (CART, for example) produces and is more simplistic than a single complex rule that a SVM procedure produces. Hence, our nonlinear rules with restricted complexity are more humanly interpretable and useful not only for a mere classification task, but also for a better understanding of the classifier.

Interpretable Rule Discovery Through Bilevel Optimization of Split-Rules of Nonlinear Decision Trees for Classification Problems

Abstract

Index Terms:

I Introduction

II Past Studies

III Proposed Approach

III-A Classifier Representation Using Nonlinear Decision Tree

III-B Split-Rule Representation

IV Bilevel Approach for Split-Rule Identification

IV-A The Hierarchical Objectives to derive split-rule

IV-A1 Computation of FLF_{L}

IV-A2 Computation of FUF_{U}

IV-B Upper Level Optimization (ULGA)

IV-B1 Custom Initialization for Upper Level GA

IV-B2 Ranking of Upper Level Solutions

Definition IV.1.

IV-B3 Custom Crossover Operator for Upper Level GA

IV-B4 Custom Mutation Operator for Upper Level GA

IV-C Lower Level Optimization (LLGA)

IV-C1 Custom Initialization for Lower Level GA

IV-C2 Selection, Crossover, and Mutation for Lower Level GA

IV-C3 Termination Criteria for Lower level GA

V Overall Tree Induction and Pruning

VI Results

VI-A Customized Datasets: DS1 to DS4

VI-B Breast Cancer Wisconsin Dataset

VI-C Wisconsin Diagnostic Breast Cancer Dataset (WDBC)

VI-D Real-World Auto-Industry Problem (RW-problem)

VI-E Results on Multi-Objective Optimization Problems

VI-E1 Truss 2D

VI-E2 Welded Beam Design Problem

VI-E3 m-ZDT and m-DTLZ Problems

VI-F Multi-class Iris Dataset

VII Conclusions and Future work

Acknowledgment

References

Supplementary Document

S1 Summary of Supplementary Document

S2 Parameter Setting

S2-A Parameter Setting for NSGA-II and NSGA-III

S2-B Parameter Setting for Upper Level GA

S2-C Parameter Setting for Lower Level GA

S2-D Parameter Setting for Inducing a Non-linear Decision Tree

S3 Creation of Customized 2D Datasets

S4 Multi-objective Datasets

S4-A ZDT, Truss and Welded Beam Datasets

S4-B m-ZDT, m-DTLZ Problems

S5 Duplicate Update Operator for Upper Level GA

S6 Ablation Studies and Comparison

S6-A Ablation Studies on Lower Level GA

S6-B Ablation Studies on Proposed Bilevel GA

S6-C Visualization of Split Rule: X-Space and B-Space

S7 Additional Results

S7-A 2D Truss Problem

S7-B Welded Beam Design Resutls

S7-C ZDT Problems

S7-D m-ZDT and m-DTLZ problems

S8 Influence of Parameter τI\tau_{I}

S9 Multi-Class Iris Dataset Results

S10 Summary

IV-A1 Computation of $F_{L}$

IV-A2 Computation of $F_{U}$

S8 Influence of Parameter $\tau_{I}$