This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\UseRawInputEncoding

Personalized Interpretable Classification

Zengyou He, Yifan Tang, Lianyu Hu, Mudi Jiang and Yan Liu Zengyou He, Yifan Tang, Lianyu Hu, Mudi Jiang and Yan Liu are with the School of Software, Dalian University of Technology, Dalian, 116024, China.
E-mail: zyhe@dlut.edu.cn Manuscript received XX XX, 2023; revised XX XX, 2023.
Abstract

How to interpret a data mining model has received much attention recently, because people may distrust a black-box predictive model if they do not understand how the model works. Hence, it will be trustworthy if a model can provide transparent illustrations on how to make the decision. Although many rule-based interpretable classification algorithms have been proposed, all these existing solutions cannot directly construct an interpretable model to provide personalized prediction for each individual test sample. In this paper, we make a first step towards formally introducing personalized interpretable classification as a new data mining problem to the literature. In addition to the problem formulation on this new issue, we present a greedy algorithm called PIC (Personalized Interpretable Classifier) to identify a personalized rule for each individual test sample. To demonstrate the necessity, feasibility and advantages of such a personalized interpretable classification method, we conduct a series of empirical studies on real data sets. The experimental results show that: (1) The new problem formulation enables us to find interesting rules for test samples that may be missed by existing non-personalized classifiers. (2) Our algorithm can achieve the same-level predictive accuracy as those state-of-the-art (SOTA) interpretable classifiers. (3) On a real data set for predicting breast cancer metastasis, such a personalized interpretable classifier can outperform SOTA methods in terms of both accuracy and interpretability.

Index Terms:
Interpretable classification, rule discovery, personalization, transductive learning.

1 Introduction

In many real applications such as medical diagnosis and biological data analysis, interpretable machine learning models are preferred since one also likes to understand why and how the decision is made [1, 2]. Hence, in addition to accuracy, interpretability has become one of the well-recognized goals to be achieved when developing new machine learning models and algorithms [3, 4]. For classification tasks, rule-based prediction approaches are arguably the most explainable models since they can provide illustrations on the decision process based on simple logical rules [5, 6].

To construct interpretable rule-based classification models, numerous research efforts have been conducted during the past decades. On one hand, existing solutions include classical sequential covering algorithms such as CN2 [7] and RIPPER [8] and classifiers based on association rules [9]. On the other hand, we also observe a recent upsurge of interest on developing new rule-based explainable classification models (e.g. [1, 2, 3, 4, 5, 6]).

Apart from learning an explainable classifier in which the model is assessed as a whole, there is also a need to provide illustrations for each individual prediction [10]. Towards this direction, two different strategies are typically employed in the literature. The post-processing method is to learn an interpretable model locally around an individual prediction of any classifier [10, 11, 12]. In contrast, the personalized transductive learning approach constructs a unique classifier for each test sample during the model creation phase [13, 14, 15]. Such a personalized classifier is desirable because real data distribution in biomedical and clinical applications could be too complicated to be represented by only one general model [15]. Therefore, a personalized model may provide a more accurate prediction and can present an implicit illustration on why the prediction is made.

Overall, a personalized and interpretable classification model should be developed to provide a unique explanation on the prediction result for each individual test sample. Unfortunately, existing solutions are insufficient to achieve this objective very well, as elaborated below. Firstly, current rule-based classification methods do provide a logical illustration on the classification process, however, those learned rules are not specially constructed for a given test sample. Secondly, the post-processing methods (e.g. [10, 11, 12]) approximate each individual prediction of a third-party model, which generally cannot provide explicit rules to illustrate the original classification decision. Finally, personalized transductive learning approaches (e.g. [13, 14, 15]) typically adopt the support vector machine as the base classifier in which training data is reweighted to fit the test sample. As a result, the interpretability of these classification methods is inherently lacking.

Motivated by the above observations, we intend to address the following new data mining issue: “Can we directly and efficiently construct a personalized yet interpretable classification model for each test sample?” To be interpretable, we try to find a best-matching rule as simple as possible for the test sample. To be personalized, such a rule is obtained by searching the training data from the scratch based on the feature-value combination of test samples. However, the number of possible rules is exponential to the number of features, making it infeasible to conduct an exhaustive search. To quickly find such a personalized rule, we present a greedy algorithm that works in a breadth-first search manner. More precisely, we first check candidate rules of length kk (=1=1) and then increase the rule length to k+1k+1. If the best rule of length k+1k+1 cannot beat the best one of length kk, we terminate the search and return the best rule of length kk to classify the test sample. The goodness of each candidate rule is evaluated based on a linear combination of precision and recall.

To empirically demonstrate the feasibility and advantages of such a personalized interpretable classification model, we conducted extensive experiments on real data sets. The experimental results show that: (1) Our formulation can find personalized rules for test samples that may be missed by existing rule-based classification models. These customized rules can provide some interesting explanations on the class assignment for test samples. (2) Our algorithm can achieve the same-level classification accuracy as those state-of-the-art (interpretable) classification methods. (3) Although our algorithm is more time-consuming than those non-personalized interpretable classifiers, it can finish the rule discovery and classification within a reasonable time slot.

In short, the main contributions of this paper can be summarized as follows:

  • We make a first step towards formally introducing the problem of personalized interpretable classification.

  • We present a greedy algorithm to identify a personalized rule for each individual test sample. To the best of our knowledge, this is the first piece of work that creates a personalized interpretable classification model in terms of logical rules.

  • We conduct a series of experiments to demonstrate the necessity, feasibility and advantages of such a personalized interpretable classifier.

The remaining parts of this paper are organized as follows: Section 2 discusses existing research efforts that are closely related to our formulation and algorithm. Section 3 presents the problem formulation on personalized interpretable classification and a greedy algorithm that can fulfill this task. Section 4 shows the experimental results and Section 5 concludes the paper.

2 related work

In Section 2.1, we provide a summarization on rule-based interpretable classification algorithms. In Section 2.2, we present a discussion on existing personalized classification algorithms. In Section 2.3, we give a brief review on learning algorithms that try to explain an individual prediction of a third-party classifier.

2.1 Rule-based interpretable classification

To date, there is still no consensus on a precise definition of interpretability [16]. In practice, simplicity, predictivity and stability are generally regarded as basic requirements for interpretable models [17]. In the context of rule-based classification models, the simplicity of model structure can be evaluated based on the number of rules, the length of each rule and the overlap among rules. The predictivity corresponds to the classification accuracy, which has been one of the long-term goals for any classifier. The stability refers to the model robustness with respect to small data perturbations [17].

To identify classification rules, heuristic sequential covering method and divide-and-conquer strategy are typically employed by early algorithms [7, 8]. However, simply being rule-based cannot fully guarantee the interpretability [6] and the final rule set is not explicitly optimized with respect to the interpretability. That is, these classical rule-based classifiers mainly focus on maximizing the classification accuracy, ignoring other interpretability measures such as simplicity and stability.

Recently, with the renewed interest on rule-based interpretable classification models, people begin to construct classifiers that explicitly optimize both accuracy and simplicity. These algorithms can be roughly divided into two categories. The classification algorithms in the first category [1, 2, 6, 18, 19] typically adopt a two-stage pipeline: rule generation and rule selection. In rule generation, an association rule mining algorithm such as FP-Growth is first employed to produce a set of candidate rules. In rule selection, a small and compact subset of rules are selected via either heuristic algorithms or the solution to a new optimization problem. Alternatively, some algorithms extract interpretable rules from random forests [17, 20]. The classification algorithms in the second category [3, 4, 21, 22, 23, 24, 25, 26] directly learn rules from the data by formulating the rule set discovery issue as different types of optimization problems.

Despite of their seeming difference with respect to problem formulation and rule discovery algorithms, all existing rule-based interpretable classification models follow an inductive learning paradigm. That is, these classifiers conduct rule learning only on the training data, which are unable to provide a potentially customized rule for each individual test sample.

2.2 Personalized transductive classification

In addition to the training data, the transductive learning approach employs information from the testing data as well [27]. In the context of classification, a personalized transductive classification model creates a unique model for each test sample adaptively [13]. Hence, such a learning paradigm naturally fits the applications where the focus is on the prediction and illustration for each individual sample.

The most representative personalized transductive classification algorithm is the k nearest neighbor (kNN) classifier [28], in which the prediction model is dynamically constructed based on samples within the kNN neighborhood. Recently, the support vector machine (SVM) is customized to generate various personalized models and these models are applied to tackle various biomedical data analysis tasks [13, 14, 15].

Either the lazy learning approach such as kNN or the personalized SVM cannot provide an explicit explanation on the classification decision. Hence, a personalized yet interpretable classification model is still lacking in the literature.

2.3 The interpretation of an individual prediction

How to interpret a black-box machine model has been widely investigated recently [29, 30]. There are two closely related problems towards this direction: the interpretation of a learned model and the interpretation of an individual prediction. For existing solutions on the former issue, please refer to a recent review in [29]. Here we only focus on the latter issue since our algorithm also seeks to provide an explanation on the prediction for each test sample.

To date, many effective algorithms have been proposed to learn an interpretation for each individual prediction of any classifier (e.g. [10, 11, 12]). Despite of their great successes, we have the following remarks.

First of all, these methods provide an explanation after the model selection procedure. That is, they did not construct an interpretable model for each test sample during the model selection phase in an adaptive manner. As a result, we may fail to find a more appropriate interpretation due to the separation of model construction and interpretation. Moreover, these methods generally cannot provide explanations in terms of rules. For example, the model to be explained is approximated with a linear model in a local region around the test sample in [10]. And the feature importance scores derived from the weights of the linear model are employed to explain the corresponding prediction.

3 method

3.1 Problem formulation

Let D={(xi,yi)|i=1,,N}D=\left\{(x_{i},y_{i})|i=1,...,N\right\} denote the training set of NN samples, where xix_{i} comprises MM categorical feature values and yiYy_{i}\in Y is a class label. We use fjf_{j} to denote the jjth feature and a predicate takes the form of fj=xijf_{j}=x_{ij}, where xijx_{ij} is one of the possible feature values for the jjth feature. An itemset ss is defined as a conjunction of kk (1kM)(1\leqslant k\leqslant M) predicates. For a given sample (xi,yi)(x_{i},y_{i}), it satisfies an itemset ss only if all predicates in ss are true in the sample. We use sxis\subseteq x_{i} to denote the fact that xix_{i} satisfies ss. A rule rr is a tuple (s,y)(s,y) where ss is an itemset and yy is a class label.

For each test sample (x,?)(x,?) in which the class label is unknown, we try to find a “best” rule r^=(s^,y^)\hat{r}=(\hat{s},\hat{y}) such that xx satisfies s^\hat{s}, and then y^\hat{y} is the label we predict. Hence, the personalized interpretable classification problem can be cast as the following algorithmic issue:

  • Input: A training dataset DD, a test sample xx.

  • Output: The “best” rule r^=(s^,y^)\hat{r}=(\hat{s},\hat{y}) that matches xx.

TABLE I: An example dataset.
  Class Features
f1f_{1} f2f_{2} f3f_{3} f4f_{4}
1 a1a_{1} b1b_{1} c1c_{1} d1d_{1}
1 a1a_{1} b2b_{2} c1c_{1} d2d_{2}
1 a2a_{2} b3b_{3} c2c_{2} d1d_{1}
2 a1a_{1} b2b_{2} c2c_{2} d1d_{1}
2 a2a_{2} b3b_{3} c1c_{1} d2d_{2}
2 a3a_{3} b1b_{1} c2c_{2} d1d_{1}
2 a1a_{1} b2b_{2} c2c_{2} d2d_{2}
? a1a_{1} b3b_{3} c2c_{2} d1d_{1}
 

To illustrate above concepts, the dataset in Table I is taken as an example. In Table I, rows stand for samples and columns represent their class label and features. There are 7 training samples in which 3 samples are drawn from class 1 and another 4 samples are obtained from class 2. We take the sample in the last line as a test sample. Assume that we find a “best” rule r^=(f1=a1f3=c2,2)\hat{r}=(f_{1}=a_{1}\land f_{3}=c_{2},2). Obviously, the test sample satisfies this rule and it will be classified to class 22.

Since our algorithm can only handle categorical features, we adopt a pre-processing procedure to discretize numeric features into categorical ones. More precisely, we employ the equal width discretization method to split the jjth numeric feature values into gjg_{j} groups, where gjg_{j} is a user-specified parameter.

3.2 Goodness of a rule

Obviously, the “best” rule we find should satisfy the test sample. Apart from this necessary condition, we evaluate each candidate rule by its accuracy and simplicity. For accuracy, we use a linear combination of precision and recall. For simplicity, since we only choose one rule so that the number of rules and the overlap among rules make no sense in our problem. Therefore, the length of a rule is the only indicator for simplicity and shorter rules are preferred than longer ones.

For a rule r=(s,y)r=(s,y), its length length(r)length(r) is defined as the number of predicates in ss. The precision is defined as the ratio between the number of samples satisfying ss in class yy and the number of all samples satisfying ss. The recall is the ratio between the number of samples satisfying ss in class yy and the number of all samples in class yy.

precision(r,D)=|{(xi,yi)|yi=y,sxi}||{(xi,yi)|sxi}|.precision(r,D)=\frac{|\left\{(x_{i},y_{i})|y_{i}=y,s\subseteq x_{i}\right\}|}{|\left\{(x_{i},y_{i})|s\subseteq x_{i}\right\}|}. (1)
recall(r,D)=|{(xi,yi)|yi=y,sxi}||{(xi,yi)|yi=y}|.recall(r,D)=\frac{|\left\{(x_{i},y_{i})|y_{i}=y,s\subseteq x_{i}\right\}|}{|\left\{(x_{i},y_{i})|y_{i}=y\right\}|}. (2)

To make a trade-off between precision and recall, we use the linear combination of precision and recall:

A(r,D)=αprecison(r,D)+(1α)recall(r,D),A(r,D)=\alpha*precison(r,D)+(1-\alpha)*recall(r,D), (3)

where α\alpha is a user-specified parameter. Then, our two objectives can be summarized as follows:

  • For simplicity: minimize length(r^)length(\hat{r}).

  • For accuracy: maximize A(r^,D)A(\hat{r},D).

For example, suppose that there are two rules r1=(f1=a1f3=c2,2)r_{1}=(f_{1}=a_{1}\land f_{3}=c_{2},2) and r2=(f2=b3f3=c2,1)r_{2}=(f_{2}=b_{3}\land f_{3}=c_{2},1). In the training samples in Table I, two samples from class 2 satisfy f1=a1f3=c2f_{1}=a_{1}\land f_{3}=c_{2} and one sample from class 1 satisfies f2=b3f3=c2f_{2}=b_{3}\land f_{3}=c_{2}. The precision of r1r_{1} is the ratio between the number of samples satisfying r1r_{1} in class 2 and the number of all samples satisfying r1r_{1}, which is 2÷2=12\div 2=1. The recall of r1r_{1} in class 2 is calculated as the ratio between the number of samples satisfying r1r_{1} in class 2 and the number of all samples in class 2, which is 2÷4=0.52\div 4=0.5. The precision and recall of r2r_{2} can be calculated in the same way, whose values are 11 and 0.3330.333. If we set α=0.5\alpha=0.5, then A(r1,D)A(r_{1},D) is 0.750.75 and A(r2,D)A(r_{2},D) is 0.6670.667. Hence, r1r_{1} is better than r2r_{2} because it has a higher accuracy score and both rules have the same length of 2.

3.3 A greedy algorithm

3.3.1 An overview

If we employ an exhaustive method for finding the “best” rule, the number of candidate rules is 2M2^{M}, where MM is the number of features. Apparently, such a naive algorithm is quite time-consuming in practice. Hence, we present a greedy algorithm to find a local optimal solution in a breadth-first manner.

Algorithm 1 The naive greedy algorithm.
1:A training dataset DD, a test sample xx, a maximal length parameter maxLmaxL and the parameter α\alpha.
2:The “best” rule r^=(s^,y^)\hat{r}=(\hat{s},\hat{y}) that satisfies xx.
3:for k=1k=1~{}tomaxL~{}maxL do
4:     evaluate all rules of length kk
5:     r^k\hat{r}_{k}\leftarrow the “best” rule of length kk
6:     if A(r^k,D)A(r^k1,D)A(\hat{r}_{k},D)\leqslant A(\hat{r}_{k-1},D) then
7:         return r^k1\hat{r}_{k-1}
8:     end if
9:end for
10:return r^k\hat{r}_{k}

As shown in Algorithm 1, we start from the rules of length k=1k=1 to obtain the “best” rule r^k\hat{r}_{k}. When A(r^k,D)A(r^k1,D)A(\hat{r}_{k},D)\leqslant A(\hat{r}_{k-1},D), the algorithm will be terminated and r^k1\hat{r}_{k-1} will be returned to classify the test sample. Otherwise, we will continue to examine rules of length k+1k+1 until the length exceeds the maximal length parameter maxLmaxL.

TABLE II: Rules of length 11.
       Rule A(r,D)A(r,D)
f1=a1f_{1}=a_{1} 1 0.583
f2=b3f_{2}=b_{3} 1 0.417
f3=c2f_{3}=c_{2} 2 0.750\boldsymbol{0.750}
f4=d1f_{4}=d_{1} 1 0.583
 
TABLE III: Rules of length 22.
       Rule A(r,D)A(r,D)
f1=a1f2=b3f_{1}=a_{1}\land f_{2}=b_{3} - -
f1=a1f3=c2f_{1}=a_{1}\land f_{3}=c_{2} 2 0.750\boldsymbol{0.750}
f1=a1f4=d1f_{1}=a_{1}\land f_{4}=d_{1} 1 0.417
f2=b3f3=c2f_{2}=b_{3}\land f_{3}=c_{2} 1 0.667
f2=b3f4=d1f_{2}=b_{3}\land f_{4}=d_{1} 1 0.667
f3=c2f4=d1f_{3}=c_{2}\land f_{4}=d_{1} 2 0.583
 

To show how the algorithm works in practice, let us still utilize the data in Table I as an example. Firstly, we construct four candidate rules of length 11 from the test sample. Then we determine their class label by calculating A(r,D)A(r,D) scores (α=0.5\alpha=0.5) in each class. For example, f1=a1f_{1}=a_{1} has an accuracy score of 0.5830.583 when it belongs to class 11. Similarly, this score will be 0.50.5 when it belongs to class 22. We choose the class associated with a larger score so that the rule is (f1=a1,1)(f_{1}=a_{1},1) with a score of 0.5830.583. All rules of length 11 are listed in Table II. In a similar manner, all rules of length 22 are listed in Table III. The “best” rule in Table II is r^1=(f3=c2,2)\hat{r}_{1}=(f_{3}=c_{2},2) and the “best” rule in Table III is r^2=(f1=a1f3=c2,2)\hat{r}_{2}=(f_{1}=a_{1}\land f_{3}=c_{2},2). Since A(r^2,D)A(r^1,D)A(\hat{r}_{2},D)\leqslant A(\hat{r}_{1},D), the algorithm is terminated and the test sample is classified to class 2. r^1\hat{r}_{1} is better than r^2\hat{r}_{2} because r^1\hat{r}_{1} is shorter when their scores are completely equal.

3.3.2 Pruning based on the upper bound of A(r,D)A(r,D)

To further prune the search space, we first provide a direct upper bound on A(r,D)A(r,D). For precision, it has a loose upper bound of 11. Hence, we have

A(r,D)α+(1α)recall(r,D).A\left(r,D\right)\leqslant\alpha+\left(1-\alpha\right)*recall(r,D). (4)

To deduce the upper bound of recall, we first define the support of the itemset ss of a rule rr with respect to a class cc as follows:

support(s,c)=|{(xi,yi)|yi=c,sxi}||{(xi,yi)|yi=c}|,support(s,c)=\frac{|\left\{(x_{i},y_{i})|y_{i}=c,s\subseteq x_{i}\right\}|}{|\left\{(x_{i},y_{i})|y_{i}=c\right\}|}, (5)

where cc is one class label in YY. According to definition of recall, we have

recall(r,D)\displaystyle recall(r,D) =support(s,y)\displaystyle=support(s,y) (6)
maxcY{support(s,c)},\displaystyle\leqslant\underset{c\in Y}{\max}\left\{support(s,c)\right\},

where yy is the corresponding class label in the rule rr. Moreover, the set of all sub-rules of r=(s,y)r=(s,y) of length length(r)1length(r)-1 is:

R={r=(s,y)|ss,length(r)=length(r)1}.R^{\prime}=\left\{r^{\prime}=\left(s^{\prime},y^{\prime}\right)|s^{\prime}\subseteq s,length\left(r^{\prime}\right)=length\left(r\right)-1\right\}. (7)

For a fixed class label cc, we have the following inequality according to the anti-monotonicity of support:

support(s,c)minrR{support(s,c)}.support\left(s,c\right)\leqslant\underset{r^{\prime}\in R^{\prime}}{\min}\left\{support\left(s^{\prime},c\right)\right\}. (8)

Hence we have an upper bound of recall(r,D)recall(r,D) and an upper bound ub(r,D)ub(r,D) of A(r,D)A(r,D):

recall(r,D)\displaystyle recall(r,D) =support(s,y)\displaystyle=support(s,y) (9)
maxcY{support(s,c}\displaystyle\leqslant\underset{c\in Y}{\max}\left\{support(s,c\right\}
maxcY{minrR{support(s,c)}},\displaystyle\leqslant\underset{c\in Y}{\max}\left\{\underset{r^{\prime}\in R^{\prime}}{\min}\left\{support\left(s^{\prime},c\right)\right\}\right\},

and

A(r,D)\displaystyle A\left(r,D\right) ub(r,D)\displaystyle\leqslant ub\left(r,D\right) (10)
=α+(1α)maxcY{minrR{support(s,c)}}.\displaystyle=\alpha+\left(1-\alpha\right)*\underset{c\in Y}{\max}\left\{\underset{r^{\prime}\in R^{\prime}}{\min}\left\{support\left(s^{\prime},c\right)\right\}\right\}.

In the rule search process, we have three different kinds of pruning strategies. First of all, we can evaluate ub(r,D)ub(r,D) for each rule rr before scanning the training set to calculate its A(r,D)A(r,D). If ub(r,D)ub(r,D) is no greater than A(r^,D)A(\hat{r},D), which is the score of the best rule r^\hat{r} found so far, we can delete rule rr from the candidate rule set. Secondly, when we generate candidate rules of length k+1k+1, we only consider those rules whose all kk sub-rules belong to the candidate rule set of length kk. Thirdly, for any rule r′′r^{\prime\prime} that contains rr as a sub-rule, it is easy to know that A(r′′,D)con(r,D)A(r^{\prime\prime},D)\leqslant con(r,D), where con(r,D)con(r,D) is defined as below:

con(r,D)=α+(1α)maxcY{support(s,c)}.con\left(r,D\right)=\alpha+\left(1-\alpha\right)*\underset{c\in Y}{\max}\left\{support\left(s,c\right)\right\}. (11)

By evaluating con(r,D)con(r,D), we can know the upper bound of all super-rules of rr. If con(r,D)A(r^,D)con(r,D)\leqslant A(\hat{r},D), we can remove rr from the candidate rule set.

Algorithm 2 The PIC algorithm.
1:A training dataset DD, a test sample xx, a maximal length parameter maxLmaxL and the parameter α\alpha.
2:The “best” rule r^=(s^,y^)\hat{r}=(\hat{s},\hat{y}) that satisfies xx.
3:for k=1k=1~{}tomaxL~{}maxL do
4:     RkcreateCandidateRules(Rk1,k)R_{k}\leftarrow createCandidateRules(R_{k-1},k)
5:     for rinRkr~{}in~{}R_{k} do
6:         if ub(r,D)A(r^,D)ub(r,D)\leqslant A(\hat{r},D) then
7:              remove rr from RkR_{k}
8:         end if
9:         evaluate A(r,D)A(r,D) and con(r,D)con(r,D)
10:         update r^\hat{r} and r^k\hat{r}_{k}
11:         if con(r,D)A(r^,D)con(r,D)\leqslant A(\hat{r},D) then
12:              remove rr from RkR_{k}
13:         end if
14:     end for
15:     if A(r^k,D)A(r^k1,D)A(\hat{r}_{k},D)\leqslant A(\hat{r}_{k-1},D) then
16:         return r^k1\hat{r}_{k-1}
17:     end if
18:end for
19:return r^k\hat{r}_{k}

The improved greedy algorithm that is equipped with the upper bound-based pruning is shown in Algorithm 2. The three pruning strategies are respectively employed in line 2 (strategy two), lines 4-6 (strategy one), lines 9-11 (strategy three). In line 7, we scan the training set to evaluate A(r,D)A(r,D) and con(r,D)con(r,D).

TABLE IV: Rules of length 11 (with pruning).
       Rule A(r,D)A(r,D) con(r,D)con(r,D)
f1=a1f_{1}=a_{1} 1 0.583 0.833
f2=b3f_{2}=b_{3} 1 0.417 0.666
f3=c2f_{3}=c_{2} 2 0.750\boldsymbol{0.750} 0.875
f4=d1f_{4}=d_{1} 1 0.583 0.833
 
TABLE V: Rules of length 22 (with pruning).
       Rule ub(r,D)ub(r,D) A(r,D)A(r,D)
f1=a1f3=c2f_{1}=a_{1}\land f_{3}=c_{2} 2 0.833 0.750\boldsymbol{0.750}
f1=a1f4=d1f_{1}=a_{1}\land f_{4}=d_{1} 1 0.833 0.417
f3=c2f4=d1f_{3}=c_{2}\land f_{4}=d_{1} 2 0.833 0.583
 

Then we show how these strategies are employed in the searching progress for finding rules from the data in Table I. For rules of length 11, they do not have nonempty sub-rules and it is unnecessary to evaluate their ub(r,D)ub(r,D). All their accuracy scores and their con(r,D)con(r,D) values are shown in Table IV. We find that con(r=(f2=b3,1),D)=0.666<A(r^1,D)=0.750con(r=(f_{2}=b_{3},1),D)=0.666<A(\hat{r}_{1},D)=0.750, which means all super-rules of r=(f2=b3,1)r=(f_{2}=b_{3},1) have no opportunity to achieve a higher score than 0.6660.666. So this rule is removed and it will not be used for generating R2R_{2}. There are still three candidate rules of length 22 in Table V. We evaluate their ub(r,D)ub(r,D) values and find that they are 0.8330.833, which are higher than the best score 0.7500.750 found so far. So we scan the training set to calculate the A(r,D)A(r,D) score for each rule. Then, the algorithm is terminated because A(r^2,D)A(r^1,D)A(\hat{r}_{2},D)\leqslant A(\hat{r}_{1},D) and r^1=(f3=c2,2)\hat{r}_{1}=(f_{3}=c_{2},2) will be returned as the identified “best” rule.

4 experiment

In order to assess the performance of our algorithm, a series of experiments are conducted. First, we compare our algorithm with existing interpretable classification methods in terms of the predictive performance. Second, to verify the personalization and interpretability of our method, we compare the set of all rules found by our method with the rule set reported by existing algorithms. Finally, we employ the PIC algorithm on a real dataset to demonstrate the effectiveness and rationale of our formulation. The PIC algorithm is implemented in C++ and the experiments are conducted on a workstation with an Intel(R) Core(TM) CPU(11400F @ 2.60GHz) and 16GB memory.

Baselines. To evaluate the performance of our algorithm, three state-of-the-art rule-based interpretable classifiers are included in the performance comparison: DR-Net [25], BRS [2] and DRS [6]. As the representatives of classic tree-based classification algorithms, classification and regression tree (CART) and random forest (RF) in the scikit-learn package [31] are included in the comparison as well.

Evaluation metrics. In order to test the performance of PIC comprehensively, we choose different evaluation measures for different purposes. For predictivity, we choose the classification accuracy as the performance indicator. For interpretability, we consider the length of rule, the number of total rules and the number of distinct “personalized” rules.

Parameter tuning. For DR-Net, we set λ1\lambda_{1} to be 10210^{-2} and λ2\lambda_{2} to be 10210^{-2}. For BRS, we set α+=α=500\alpha_{+}=\alpha_{-}=500 and β+=β=1\beta_{+}=\beta_{-}=1. For DRS, we fix the mode of the key hyper-parameter λ\lambda to be Max. Through some experiments, we find that when α[0.7,0.9]\alpha\in\left[0.7,0.9\right] our algorithm will achieve a better performance.

4.1 Performance on predictivity

4.1.1 The dataset

We conduct the experimental study on 16 public datasets. More precisely, 14 data sets are obtained from the UCI repository [32], COMPAS is a variant of the ProPublica recidivism dataset111https://www.propublica.org/datastore/dataset/compas-recidivism-risk-score-data-and-analysis and heloc is from Fair Isaac Corporation (FICO) dataset [33]. The detailed characteristics of these data sets are summarized in Table VI. A pre-processing procedure is employed to discretize numerical values into categorical ones. That is, the equal width method is used to split the jjth numeric feature values into gjg_{j} groups, where gjg_{j} is a user-specified parameter. This parameter is set to be 10 on german and adult, and it is fixed to be 5 on the other datasets.

TABLE VI: Some important characteristics of the data sets used in the experiment. NN represents the number of samples, MM is the number of features, CC is the number of classes.
  Dataset NN MM CC Type
adult 30162 13 2 mixed
banknote 1372 4 2 numeric
breastcancer 286 9 2 categorical
car 1728 6 4 categorical
COMPAS 7214 6 2 mixed
german 1000 20 2 mixed
heloc(FICO) 10459 23 2 numeric
ILPD 583 10 2 mixed
liver 345 6 2 numeric
magic 19020 10 2 numeric
monks 554 6 2 categorical
mushroom 8124 22 2 categorical
nursery 12959 8 5 categorical
tictactoe 958 9 2 categorical
transfusion 749 4 2 numeric
vote 435 15 2 categorical
 

4.1.2 Results

TABLE VII: The average accuracy values based on the repeated execution of the 5-fold cross-validation procedure 5 times.
  Dataset Ours DR-Net BRS DRS CART RF
α=0.7\alpha=0.7 α=0.9\alpha=0.9
adult 0.754 0.751 0.836\boldsymbol{0.836} 0.815 0.459 0.778 0.821
banknote 0.917 0.971 0.843 0.929 0.948 0.974\boldsymbol{0.974} 0.975
breastcancer 0.747 0.713 0.772\boldsymbol{0.772} 0.744 0.746 0.698 0.758
car 0.701 0.728 0.659 0.661 0.791 0.974\boldsymbol{0.974} 0.965
COMPAS 0.564 0.633\boldsymbol{0.633} 0.623 0.623 0.547 0.630 0.636
german 0.703 0.741\boldsymbol{0.741} 0.715 0.723 0.695 0.644 0.752
heloc(FICO) 0.692 0.690 0.693 0.694\boldsymbol{0.694} 0.645 0.623 0.679
ILPD 0.711\boldsymbol{0.711} 0.688 0.705 0.681 0.701 0.619 0.654
liver 0.584\boldsymbol{0.584} 0.558 0.579 0.512 0.525 0.554 0.545
magic 0.672 0.810 0.801 0.784 0.657 0.811\boldsymbol{0.811} 0.814
monks 0.964 0.972 0.747 0.982 0.991\boldsymbol{0.991} 0.971 0.982
mushroom 0.958 0.989 0.995 0.994 0.993 1.000\boldsymbol{1.000} 1.000
nursery 0.839 0.925 0.918 0.727 0.816 0.996\boldsymbol{0.996} 0.991
tictactoe 0.683 0.971 0.404 0.978 0.992\boldsymbol{0.992} 0.943 0.983
transfusion 0.759 0.762\boldsymbol{0.762} 0.760 0.761 0.757 0.752 0.767
vote 0.958\boldsymbol{0.958} 0.954 0.614 0.923 0.905 0.934 0.954
 

We repeat the 5-fold cross-validation procedure 5 times to compute the average accuracy values as the performance indicators for predictivity. In addition, in order to finish the experiments in an acceptable time slot, maxLmaxL parameter is set to be 2 on adult, 3 on heloc and mushroom, 4 on german, and 5 on nursery. For all remaining data sets, the maxLmaxL parameter is fixed to be 100. Since DR-Net and BRS can only handle binary classification problems, we employ the one-versus-one (OVO) strategy to accomplish the multi-class classification task on car and nursery. That is, we construct C(C1)/2C(C-1)/2 classifiers for each pair of classes and the final predicted class label will be determined by the voting result from the C(C1)/2C(C-1)/2 classifiers.

Table VII presents the comparison result between baseline methods and our PIC algorithm based on the 5-fold cross-validation. The best accuracy values among five interpretable methods (ours, DR-Net, BRS, DRS and CART) on each dataset are marked in bold. The DR-Net algorithm does not perform very good on tictactoe and vote, probably because the sample size of these two datasets is not big enough for training DR-Net. Table VII shows that PIC can achieve the same-level classification accuracy as those interpretable classifiers chosen in this experiment. On COMPAS, german, ILPD, liver, transfusion, and vote, our algorithm can achieve a higher accuracy than the other interpretable methods. On most of datasets, the accuracy values of our PIC algorithm are very close to those values of the tree-based interpretable method (CART). Furthermore, the performance gap between our PIC algorithm and RF is within 0.02 on all datasets except adult, car and nursery. The PIC algorithm does not perform very well on these datasets probably because of the class imbalance of these datasets.

Refer to caption
Figure 1: The average running time(s) based on the repeated execution of the 5-fold cross-validation procedure 5 times.

Fig 1 displays the average running time of different classification algorithms. We can find that our algorithm is very efficient on those small data sets such as banknote and liver. That is, our algorithm can achieve the same level efficiency as those tree-based classifiers and takes less running time than rule-based interpretable classifiers on these data sets. However, the running time of our algorithm on some large datasets, especially on adult, heloc and magic, is several orders of magnitude larger than the time consumed by the others. This happens because PIC constructs a personalized model for each test sample. In essence, PIC is a lazy learning method, so it will be more time-consuming than the other eager learning methods.

4.2 Performance on personalization and interpretability

To date, there is still no universally recognized precise definition of interpretability. In our case, we try to measure the performance of each algorithm in terms of interpretability via the length of rule, the number of total rules and the comparison between the distinct “personalized” rules and the common rules.

TABLE VIII: The average length of rules reported by each algorithm.
  Dataset Ours DR-Net BRS DRS
α=0.7\alpha=0.7 α=0.9\alpha=0.9
adult 1.91 2.00 10.50 2.95 9.55
banknote 1.44 2.11 9.67 2.97 2.59
breastcancer 1.62 2.86 38.00 2.98 4.03
car 1.08 1.38 16.15 1.45 4.82
COMPAS 2.96 3.73 53.80 3.00 3.21
german 1.39 3.77 24.18 3.00 11.59
heloc(FICO) 1.75 2.84 23.85 2.90 10.01
ILPD 1.00 1.89 43.62 2.82 5.70
liver 1.22 2.51 30.00 3.00 3.60
magic 1.53 3.83 8.99 3.00 3.62
monks 1.57 1.95 9.81 3.00 2.41
mushroom 2.07 2.91 11.82 2.84 6.52
nursery 2.07 2.91 8.61 1.41 2.72
tictactoe 1.05 2.85 13.86 3.00 3.92
transfusion 1.00 1.17 NaN 3.00 2.45
vote 1.13 2.04 45.00 2.97 6.33
 

Table VIII shows the average length of rules reported by four methods. For the PIC algorithm, we calculate the average length of all the rules found in a 5-fold cross-validation. In comparison, we use the average length of the 5 rule sets reported by the other methods. The rules reported by DR-Net are longer than the ones reported by other methods on most of the datasets except transfusion. DR-Net does not find any rules on the transfusion data set. In contrast, the rules found by our PIC algorithm are shorter and their lengths are all less than 44. The rules found when α=0.7\alpha=0.7 are shorter than those found when α=0.9\alpha=0.9.

TABLE IX: The average number of rules reported by each algorithm.
  Dataset Ours DR-Net BRS DRS
α=0.7\alpha=0.7 α=0.9\alpha=0.9
adult 5.0 381.4 10.0 3.8 6.6
banknote 7.0 32.8 48.4 7.6 19.8
breastcancer 20.2 30.4 50.0 10.0 63.6
car 13.4 41.6 283.0 66.6 70.4
COMPAS 34.8 85.8 21.6 3.2 4.6
german 4.8 111.6 22.2 5.8 5.6
heloc(FICO) 126.6 624.6 19.4 4.2 3.2
ILPD 2.2 40.8 11.0 8.0 16.4
liver 12.0 29.8 50.0 5.8 23.0
magic 22.2 371.6 21.0 3.2 17.2
monks 11.4 20.6 50.0 7.4 13.4
mushroom 10.6 21.6 49.2 8.2 19.8
nursery 147.2 328.8 258.0 75.6 16.0
tictactoe 10.0 21.6 48.8 8.0 28.4
transfusion 3.0 11.2 NaN 1.2 4.0
vote 4.4 10.8 50.0 7.2 21.2
 
Refer to caption
Figure 2: The distribution of the rules reported by PIC on mushroom in a 5-fold cross-validation when α=0.9\alpha=0.9.
TABLE X: The rules reported by PIC on mushroom in a 5-fold cross-validation when α=0.9\alpha=0.9.
                Common rules
Rules Method
{bruises=t} {habitat=d} \rightarrow 0 DRS
{stalk-surface-above-ring=s} {ring-number=o} {odor=n} \rightarrow 0 DRS
{stalk-surface-below-ring=s} {ring-number=o} {odor=n} \rightarrow 0 DRS
{ring-number=o} {gill-size=b} {odor=n} \rightarrow 0 DRS
{ring-number=o} {odor=n} \rightarrow 0 DRS
{gill-size=b} {odor=n} \rightarrow 0 DRS
{bruises=t} {habitat=u} \rightarrow 1 DR-Net
{odor=p} \rightarrow 1 BRS, DRS
{gill-attachment=f} {gill-spacing=c} {ring-number=o} {bruises=f} \rightarrow 1 DRS
{gill-attachment=f} {bruises=f} {population=v} \rightarrow 1 DRS
{gill-spacing=c} {gill-size=n} {population=v} \rightarrow 1 BRS, DRS
{gill-spacing=c} {stalk-surface-above-ring=k} \rightarrow 1 DR-Net, BRS, DRS
{gill-spacing=c} {stalk-surface-below-ring=k} \rightarrow 1 DRS
{stalk-shape=e} {ring-number=o} {stalk-root=b} {habitat=d} \rightarrow 1 DRS
{cap-color=g} {bruises=f} {stalk-root=b} \rightarrow 1 DRS
{gill-color=g} {stalk-root=b} \rightarrow 1 BRS
{odor=f} \rightarrow 1 DR-Net, BRS, DRS
{odor=c} \rightarrow 1 DRS
{gill-color=b} \rightarrow 1 DR-Net, BRS, DRS
“Personalized” rules
{spore-print-color=k} {gill-size=b} \rightarrow 0
{gill-size=b} {spore-print-color=n} \rightarrow 0
{cap-surface=s} {gill-spacing=c} {gill-size=n} \rightarrow 1
{cap-surface=s} {gill-size=n} {bruises=f} \rightarrow 1
 

Table IX shows the average number of rules reported by four methods. The PIC algorithm can find more rules than any other algorithms on COMPAS, heloc, magic and transfusion, which means that our method is more likely to find more “personalized” rules for distinct samples. Fig 2 shows the distribution of the rules reported by PIC on mushroom in a 5-fold cross-validation when α=0.9\alpha=0.9. PIC reports 23 rules, 8 of them are from class 0 and 15 of them are from class 1. We can find that there are 2 rules, ({ring-number=o} {gill-size=b} {odor=n} \rightarrow 0) and ({gill-spacing=c} {stalk-surface-above-ring=k} \rightarrow 1), which appear more times than others. And the other rules only appear less than 1000 times in the experiment.

We consider a rule r1=(s1,y1)r_{1}=(s_{1},y_{1}) reported by PIC is a common rule when there is a rule r2=(s2,y2)r_{2}=(s_{2},y_{2}) reported by other methods that satisfies s1s2y1=y2s_{1}\subseteq s_{2}\land y_{1}=y_{2}. Table X shows the common rules found on mushroom by PIC and other methods and the “personalized” rules only reported by PIC when α\alpha is set to 0.9. We find 4 “personalized” rules in this experiment, 2 for class 0 and 2 for class 1. Their frequency values are all in the range of 10 to 1000, so they do not appear by accident. This fact demonstrates that the PIC algorithm really can find some “personalized” rules which cannot be discovered by other methods.

4.3 A real application scenario

4.3.1 The dataset

To illustrate why personalized and interpretable classification methods should be developed, here we use the application of predicting breast cancer metastasis [14] as a real example. As a complex and heterogeneous disease, breast cancer typically has many molecular subtypes. Hence, classifiers constructed for one cohort often cannot achieve good performance on other cohorts. To tackle this issue, one feasible solution is to assume that each patient belongs to a distinct subtype and construct different classification models for different patients [14].

The breast cancer metastasis dataset is derived from [14], which is composed of 1522 features and 581 samples. In our experiments, we choose 30 samples as the test set and the remaining samples are used as the training set. That is, there are 551 samples in the training set and 30 samples in the test set. The class labels are binary so both DR-Net and BRS can handle this data set as well. All the 1522 features are numerical, so we employ the equal width method with gj=3g_{j}=3 to transform the jjth numeric feature values into categorical ones in our PIC method.

4.3.2 Results

We conduct the following experiments on a workstation with an AMD Ryzen 5 5600X 6-Core Processor(3.70 GHz) and 32GB memory, which runs approximately twice as fast as the workstation used in 4.1 and 4.2. When running DRS, we encounter an error when transforming the type of variables from type(‘O’) to type(‘float64’) in the digitize function from numpy package. We have written a function to replace it to continue the experiment. To finish the experiment within 24 hours, maxLmaxL is set to be 2 in the PIC algorithm. Besides, all other parameters are set the same as what they are in 4.1 and 4.2.

TABLE XI: Experimental results on the breast cancer dataset.
  Methods Accuracy Time (s) #Rule Length
Ours(α=0.7\alpha=0.7) 0.633 64188 2 2.00
Ours(α=0.9\alpha=0.9) 0.633 84999 26 2
DR-Net 0.366 43 NaN NaN
BRS 0.533 37749 38 2.76
DRS 0.366 192 NaN NaN
CART 0.500 1 - -
RF 0.566 1 - -
 

The experimental results are shown in Table XI. DR-Net and DRS accomplish the classification task very fast but they do not find any rules. BRS finds a rule set of 38 rules, 9 of them have a length of 2 and 29 of them have a length of 3. When α\alpha is set to be 0.7, PIC can find 2 different rules. And when α\alpha is set to be 0.9, PIC can find 26 distinct rules. All the rules reported by PIC have a length of 2, which means that better rules might be found if we further increase the maxLmaxL parameter. The results in Table 11 also show that PIC can find more “personalized” rules when α\alpha is set to be 0.9 at the cost of consuming more running time. More importaly, the PIC algorithm can achieve better predictive accuracy than all other competing classification methods in the performance comparison. It demonstrates the rationale of developing personalized interpretable classifiers in real applications such as cancer metastasis prediction.

5 conclusion

In this paper, we introduce the personalized interpretable classification issue and present a greedy algorithm called the PIC algorithm to identify a personalized rule for each individual test sample. To demonstrate the effectiveness of the PIC algorithm, we conduct a series of experiments on some real data sets. The experimental results show that such a personalized interpretable classifier can achieve good performance both on predictivity and interpretability.

Overall, we formally introduce a new data mining problem, namely personalized interpretable classification. By solving this new classification issue, we can identify some “personalized” rules that cannot be found by existing interpretable classification methods. In the future, we will further develop more effective algorithms for mining personalized rules from large data sets in different application domains.

Acknowledgments

This work has been supported by the Natural Science Foundation of China under Grant No. 61972066.

References

  • [1] H. Lakkaraju, S. H. Bach, and J. Leskovec, “Interpretable decision sets: A joint framework for description and prediction,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 1675–1684.
  • [2] T. Wang, C. Rudin, F. Doshi-Velez, Y. Liu, E. Klampfl, and P. MacNeille, “A Bayesian framework for learning rule sets for interpretable classification,” Journal of Machine Learning Research, vol. 18, no. 1, pp. 1–37, 2017.
  • [3] S. Dash, O. Gunluk, and D. Wei, “Boolean decision rules via column generation,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, vol. 31, 2018, pp. 4660–4670.
  • [4] F. Yang, K. He, L. Yang, H. Du, J. Yang, B. Yang, and L. Sun, “Learning interpretable decision rule sets: A submodular optimization approach,” in Proceedings of the 35th International Conference on Neural Information Processing Systems, vol. 34, 2021, pp. 27890–27902.
  • [5] J. Yu, A. Ignatiev, P. J. Stuckey, and P. Le Bodic, “Learning optimal decision sets and lists with sat,” Journal of Artificial Intelligence Research, vol. 72, pp. 1251–1279, 2021.
  • [6] G. Zhang and A. Gionis, “Diverse rule sets,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2020, pp. 1532–1541.
  • [7] P. Clark and T. Niblett, “The CN2 induction algorithm,” Machine learning, vol. 3, no. 4, pp. 261–283, 1989.
  • [8] W. W. Cohen, “Fast effective rule induction,” in Proceedings of the 12th International Conference on Machine Learning, 1995, pp. 115–123.
  • [9] B. Liu, W. Hsu, Y. Ma et al., “Integrating classification and association rule mining.” in Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, 1998, pp. 80–86.
  • [10] M. T. Ribeiro, S. Singh, and C. Guestrin, ““Why should i trust you?” explaining the predictions of any classifier,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 1135–1144.
  • [11] S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, vol. 30, 2017.
  • [12] J. Chen, L. Song, M. Wainwright, and M. Jordan, “Learning to explain: An information-theoretic perspective on model interpretation,” in Proceedings of the 35th International Conference on Machine Learning, 2018, pp. 883–892.
  • [13] S. Pang, T. Ban, Y. Kadobayashi, and N. Kasabov, “Personalized mode transductive spanning svm classification tree,” Information Sciences, vol. 181, no. 11, pp. 2071–2085, 2011.
  • [14] M. J. Jahid, T. H. Huang, and J. Ruan, “A personalized committee classification approach to improving prediction of breast cancer metastasis,” Bioinformatics, vol. 30, no. 13, pp. 1858–1866, 2014.
  • [15] Y. Zhu, M. Kim, X. Zhu, J. Yan, D. Kaufer, and G. Wu, “Personalized diagnosis for alzheimer’s disease,” in Proceedings of the 20th International Conference on Medical Image Computing and Computer-assisted Intervention, 2017, pp. 205–213.
  • [16] W. J. Murdoch, C. Singh, K. Kumbier, R. Abbasi-Asl, and B. Yu, “Definitions, methods, and applications in interpretable machine learning,” in Proceedings of the National Academy of Sciences, vol. 116, no. 44, pp. 22071–22080, 2019.
  • [17] C. Bénard, G. Biau, S. Da Veiga, and E. Scornet, “Sirus: Stable and interpretable rule set for classification,” Electronic Journal of Statistics, vol. 15, no. 1, pp. 427–505, 2021.
  • [18] G. Mita, P. Papotti, M. Filippone, and P. Michiardi, “Libre: Learning interpretable boolean rule ensembles,” in Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, 2020, pp. 245–255.
  • [19] J. H. Friedman and B. E. Popescu, “Predictive learning via rule ensembles,” The Annals of Applied Statistics, vol. 2, no. 3, pp. 916–954, 2008.
  • [20] I. Mollas, N. Bassiliades, and G. Tsoumakas, “Conclusive local interpretation rules for random forests,” Data Mining and Knowledge Discovery, vol. 36, no. 4, pp. 1521–1574, 2022.
  • [21] E. Angelino, N. Larus-Stone, D. Alabi, M. Seltzer, and C. Rudin, “Learning certifiably optimal rule lists for categorical data,” Journal of Machine Learning Research, vol. 18, no. 234, pp. 1–78, 2018.
  • [22] D. Malioutov and K. Varshney, “Exact rule learning via boolean compressed sensing,” in Proceedings of the 30th International Conference on Machine Learning, 2013, pp. 765–773.
  • [23] D. Wei, S. Dash, T. Gao, and O. Gunluk, “Generalized linear rule models,” in Proceedings of the 36th International Conference on Machine Learning, 2019, pp. 6687–6696.
  • [24] Z. Wang, W. Zhang, N. Liu, and J. Wang, “Scalable rule-based representation learning for interpretable classification,” in Proceedings of the 35th International Conference on Neural Information Processing Systems, vol. 34, pp. 30479–30491, 2021.
  • [25] L. Qiao, W. Wang, and B. Lin, “Learning accurate and interpretable decision rule sets from neural networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 5, 2021, pp. 4303–4311.
  • [26] A. Ignatiev, E. Lam, P. J. Stuckey, and J. Marques-Silva, “A scalable two stage approach to computing optimal decision sets,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 5, 2021, pp. 3806–3814.
  • [27] V. Vapnik, The nature of statistical learning theory.   Springer-Verlag, Berlin, Germany, 1999.
  • [28] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27, 1967.
  • [29] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi, “A survey of methods for explaining black box models,” ACM Computing Surveys, vol. 51, no. 5, pp. 1–42, 2018.
  • [30] C. Burns, J. Thomason, and W. Tansey, “Interpreting black box models via hypothesis testing,” in Proceedings of the 2020 ACM-IMS on Foundations of Data Science Conference, 2020, pp. 47–57.
  • [31] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and Édouard Duchesnay, “Scikit-learn: Machine learning in python,” Journal of Machine Learning Research, vol. 12, no. 85, pp. 2825–2830, 2011. [Online]. Available: http://jmlr.org/papers/v12/pedregosa11a.html
  • [32] D. Dua and C. Graff, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml
  • [33] FICO. Explainable machine learning challenge, 2018. [Online]. Available: https://community.fico.com/s/explainable-machine-learning-challenge