Personalized Interpretable Classification
Abstract
How to interpret a data mining model has received much attention recently, because people may distrust a black-box predictive model if they do not understand how the model works. Hence, it will be trustworthy if a model can provide transparent illustrations on how to make the decision. Although many rule-based interpretable classification algorithms have been proposed, all these existing solutions cannot directly construct an interpretable model to provide personalized prediction for each individual test sample. In this paper, we make a first step towards formally introducing personalized interpretable classification as a new data mining problem to the literature. In addition to the problem formulation on this new issue, we present a greedy algorithm called PIC (Personalized Interpretable Classifier) to identify a personalized rule for each individual test sample. To demonstrate the necessity, feasibility and advantages of such a personalized interpretable classification method, we conduct a series of empirical studies on real data sets. The experimental results show that: (1) The new problem formulation enables us to find interesting rules for test samples that may be missed by existing non-personalized classifiers. (2) Our algorithm can achieve the same-level predictive accuracy as those state-of-the-art (SOTA) interpretable classifiers. (3) On a real data set for predicting breast cancer metastasis, such a personalized interpretable classifier can outperform SOTA methods in terms of both accuracy and interpretability.
Index Terms:
Interpretable classification, rule discovery, personalization, transductive learning.1 Introduction
In many real applications such as medical diagnosis and biological data analysis, interpretable machine learning models are preferred since one also likes to understand why and how the decision is made [1, 2]. Hence, in addition to accuracy, interpretability has become one of the well-recognized goals to be achieved when developing new machine learning models and algorithms [3, 4]. For classification tasks, rule-based prediction approaches are arguably the most explainable models since they can provide illustrations on the decision process based on simple logical rules [5, 6].
To construct interpretable rule-based classification models, numerous research efforts have been conducted during the past decades. On one hand, existing solutions include classical sequential covering algorithms such as CN2 [7] and RIPPER [8] and classifiers based on association rules [9]. On the other hand, we also observe a recent upsurge of interest on developing new rule-based explainable classification models (e.g. [1, 2, 3, 4, 5, 6]).
Apart from learning an explainable classifier in which the model is assessed as a whole, there is also a need to provide illustrations for each individual prediction [10]. Towards this direction, two different strategies are typically employed in the literature. The post-processing method is to learn an interpretable model locally around an individual prediction of any classifier [10, 11, 12]. In contrast, the personalized transductive learning approach constructs a unique classifier for each test sample during the model creation phase [13, 14, 15]. Such a personalized classifier is desirable because real data distribution in biomedical and clinical applications could be too complicated to be represented by only one general model [15]. Therefore, a personalized model may provide a more accurate prediction and can present an implicit illustration on why the prediction is made.
Overall, a personalized and interpretable classification model should be developed to provide a unique explanation on the prediction result for each individual test sample. Unfortunately, existing solutions are insufficient to achieve this objective very well, as elaborated below. Firstly, current rule-based classification methods do provide a logical illustration on the classification process, however, those learned rules are not specially constructed for a given test sample. Secondly, the post-processing methods (e.g. [10, 11, 12]) approximate each individual prediction of a third-party model, which generally cannot provide explicit rules to illustrate the original classification decision. Finally, personalized transductive learning approaches (e.g. [13, 14, 15]) typically adopt the support vector machine as the base classifier in which training data is reweighted to fit the test sample. As a result, the interpretability of these classification methods is inherently lacking.
Motivated by the above observations, we intend to address the following new data mining issue: “Can we directly and efficiently construct a personalized yet interpretable classification model for each test sample?” To be interpretable, we try to find a best-matching rule as simple as possible for the test sample. To be personalized, such a rule is obtained by searching the training data from the scratch based on the feature-value combination of test samples. However, the number of possible rules is exponential to the number of features, making it infeasible to conduct an exhaustive search. To quickly find such a personalized rule, we present a greedy algorithm that works in a breadth-first search manner. More precisely, we first check candidate rules of length () and then increase the rule length to . If the best rule of length cannot beat the best one of length , we terminate the search and return the best rule of length to classify the test sample. The goodness of each candidate rule is evaluated based on a linear combination of precision and recall.
To empirically demonstrate the feasibility and advantages of such a personalized interpretable classification model, we conducted extensive experiments on real data sets. The experimental results show that: (1) Our formulation can find personalized rules for test samples that may be missed by existing rule-based classification models. These customized rules can provide some interesting explanations on the class assignment for test samples. (2) Our algorithm can achieve the same-level classification accuracy as those state-of-the-art (interpretable) classification methods. (3) Although our algorithm is more time-consuming than those non-personalized interpretable classifiers, it can finish the rule discovery and classification within a reasonable time slot.
In short, the main contributions of this paper can be summarized as follows:
-
•
We make a first step towards formally introducing the problem of personalized interpretable classification.
-
•
We present a greedy algorithm to identify a personalized rule for each individual test sample. To the best of our knowledge, this is the first piece of work that creates a personalized interpretable classification model in terms of logical rules.
-
•
We conduct a series of experiments to demonstrate the necessity, feasibility and advantages of such a personalized interpretable classifier.
The remaining parts of this paper are organized as follows: Section 2 discusses existing research efforts that are closely related to our formulation and algorithm. Section 3 presents the problem formulation on personalized interpretable classification and a greedy algorithm that can fulfill this task. Section 4 shows the experimental results and Section 5 concludes the paper.
2 related work
In Section 2.1, we provide a summarization on rule-based interpretable classification algorithms. In Section 2.2, we present a discussion on existing personalized classification algorithms. In Section 2.3, we give a brief review on learning algorithms that try to explain an individual prediction of a third-party classifier.
2.1 Rule-based interpretable classification
To date, there is still no consensus on a precise definition of interpretability [16]. In practice, simplicity, predictivity and stability are generally regarded as basic requirements for interpretable models [17]. In the context of rule-based classification models, the simplicity of model structure can be evaluated based on the number of rules, the length of each rule and the overlap among rules. The predictivity corresponds to the classification accuracy, which has been one of the long-term goals for any classifier. The stability refers to the model robustness with respect to small data perturbations [17].
To identify classification rules, heuristic sequential covering method and divide-and-conquer strategy are typically employed by early algorithms [7, 8]. However, simply being rule-based cannot fully guarantee the interpretability [6] and the final rule set is not explicitly optimized with respect to the interpretability. That is, these classical rule-based classifiers mainly focus on maximizing the classification accuracy, ignoring other interpretability measures such as simplicity and stability.
Recently, with the renewed interest on rule-based interpretable classification models, people begin to construct classifiers that explicitly optimize both accuracy and simplicity. These algorithms can be roughly divided into two categories. The classification algorithms in the first category [1, 2, 6, 18, 19] typically adopt a two-stage pipeline: rule generation and rule selection. In rule generation, an association rule mining algorithm such as FP-Growth is first employed to produce a set of candidate rules. In rule selection, a small and compact subset of rules are selected via either heuristic algorithms or the solution to a new optimization problem. Alternatively, some algorithms extract interpretable rules from random forests [17, 20]. The classification algorithms in the second category [3, 4, 21, 22, 23, 24, 25, 26] directly learn rules from the data by formulating the rule set discovery issue as different types of optimization problems.
Despite of their seeming difference with respect to problem formulation and rule discovery algorithms, all existing rule-based interpretable classification models follow an inductive learning paradigm. That is, these classifiers conduct rule learning only on the training data, which are unable to provide a potentially customized rule for each individual test sample.
2.2 Personalized transductive classification
In addition to the training data, the transductive learning approach employs information from the testing data as well [27]. In the context of classification, a personalized transductive classification model creates a unique model for each test sample adaptively [13]. Hence, such a learning paradigm naturally fits the applications where the focus is on the prediction and illustration for each individual sample.
The most representative personalized transductive classification algorithm is the k nearest neighbor (kNN) classifier [28], in which the prediction model is dynamically constructed based on samples within the kNN neighborhood. Recently, the support vector machine (SVM) is customized to generate various personalized models and these models are applied to tackle various biomedical data analysis tasks [13, 14, 15].
Either the lazy learning approach such as kNN or the personalized SVM cannot provide an explicit explanation on the classification decision. Hence, a personalized yet interpretable classification model is still lacking in the literature.
2.3 The interpretation of an individual prediction
How to interpret a black-box machine model has been widely investigated recently [29, 30]. There are two closely related problems towards this direction: the interpretation of a learned model and the interpretation of an individual prediction. For existing solutions on the former issue, please refer to a recent review in [29]. Here we only focus on the latter issue since our algorithm also seeks to provide an explanation on the prediction for each test sample.
To date, many effective algorithms have been proposed to learn an interpretation for each individual prediction of any classifier (e.g. [10, 11, 12]). Despite of their great successes, we have the following remarks.
First of all, these methods provide an explanation after the model selection procedure. That is, they did not construct an interpretable model for each test sample during the model selection phase in an adaptive manner. As a result, we may fail to find a more appropriate interpretation due to the separation of model construction and interpretation. Moreover, these methods generally cannot provide explanations in terms of rules. For example, the model to be explained is approximated with a linear model in a local region around the test sample in [10]. And the feature importance scores derived from the weights of the linear model are employed to explain the corresponding prediction.
3 method
3.1 Problem formulation
Let denote the training set of samples, where comprises categorical feature values and is a class label. We use to denote the th feature and a predicate takes the form of , where is one of the possible feature values for the th feature. An itemset is defined as a conjunction of predicates. For a given sample , it satisfies an itemset only if all predicates in are true in the sample. We use to denote the fact that satisfies . A rule is a tuple where is an itemset and is a class label.
For each test sample in which the class label is unknown, we try to find a “best” rule such that satisfies , and then is the label we predict. Hence, the personalized interpretable classification problem can be cast as the following algorithmic issue:
-
•
Input: A training dataset , a test sample .
-
•
Output: The “best” rule that matches .
Class | Features | |||
1 | ||||
1 | ||||
1 | ||||
2 | ||||
2 | ||||
2 | ||||
2 | ||||
? | ||||
To illustrate above concepts, the dataset in Table I is taken as an example. In Table I, rows stand for samples and columns represent their class label and features. There are 7 training samples in which 3 samples are drawn from class 1 and another 4 samples are obtained from class 2. We take the sample in the last line as a test sample. Assume that we find a “best” rule . Obviously, the test sample satisfies this rule and it will be classified to class .
Since our algorithm can only handle categorical features, we adopt a pre-processing procedure to discretize numeric features into categorical ones. More precisely, we employ the equal width discretization method to split the th numeric feature values into groups, where is a user-specified parameter.
3.2 Goodness of a rule
Obviously, the “best” rule we find should satisfy the test sample. Apart from this necessary condition, we evaluate each candidate rule by its accuracy and simplicity. For accuracy, we use a linear combination of precision and recall. For simplicity, since we only choose one rule so that the number of rules and the overlap among rules make no sense in our problem. Therefore, the length of a rule is the only indicator for simplicity and shorter rules are preferred than longer ones.
For a rule , its length is defined as the number of predicates in . The precision is defined as the ratio between the number of samples satisfying in class and the number of all samples satisfying . The recall is the ratio between the number of samples satisfying in class and the number of all samples in class .
(1) |
(2) |
To make a trade-off between precision and recall, we use the linear combination of precision and recall:
(3) |
where is a user-specified parameter. Then, our two objectives can be summarized as follows:
-
•
For simplicity: minimize .
-
•
For accuracy: maximize .
For example, suppose that there are two rules and . In the training samples in Table I, two samples from class 2 satisfy and one sample from class 1 satisfies . The precision of is the ratio between the number of samples satisfying in class 2 and the number of all samples satisfying , which is . The recall of in class 2 is calculated as the ratio between the number of samples satisfying in class 2 and the number of all samples in class 2, which is . The precision and recall of can be calculated in the same way, whose values are and . If we set , then is and is . Hence, is better than because it has a higher accuracy score and both rules have the same length of 2.
3.3 A greedy algorithm
3.3.1 An overview
If we employ an exhaustive method for finding the “best” rule, the number of candidate rules is , where is the number of features. Apparently, such a naive algorithm is quite time-consuming in practice. Hence, we present a greedy algorithm to find a local optimal solution in a breadth-first manner.
As shown in Algorithm 1, we start from the rules of length to obtain the “best” rule . When , the algorithm will be terminated and will be returned to classify the test sample. Otherwise, we will continue to examine rules of length until the length exceeds the maximal length parameter .
Rule | ||
1 | 0.583 | |
1 | 0.417 | |
2 | ||
1 | 0.583 | |
Rule | ||
- | - | |
2 | ||
1 | 0.417 | |
1 | 0.667 | |
1 | 0.667 | |
2 | 0.583 | |
To show how the algorithm works in practice, let us still utilize the data in Table I as an example. Firstly, we construct four candidate rules of length from the test sample. Then we determine their class label by calculating scores () in each class. For example, has an accuracy score of when it belongs to class . Similarly, this score will be when it belongs to class . We choose the class associated with a larger score so that the rule is with a score of . All rules of length are listed in Table II. In a similar manner, all rules of length are listed in Table III. The “best” rule in Table II is and the “best” rule in Table III is . Since , the algorithm is terminated and the test sample is classified to class 2. is better than because is shorter when their scores are completely equal.
3.3.2 Pruning based on the upper bound of
To further prune the search space, we first provide a direct upper bound on . For precision, it has a loose upper bound of . Hence, we have
(4) |
To deduce the upper bound of recall, we first define the support of the itemset of a rule with respect to a class as follows:
(5) |
where is one class label in . According to definition of recall, we have
(6) | ||||
where is the corresponding class label in the rule . Moreover, the set of all sub-rules of of length is:
(7) |
For a fixed class label , we have the following inequality according to the anti-monotonicity of support:
(8) |
Hence we have an upper bound of and an upper bound of :
(9) | ||||
and
(10) | ||||
In the rule search process, we have three different kinds of pruning strategies. First of all, we can evaluate for each rule before scanning the training set to calculate its . If is no greater than , which is the score of the best rule found so far, we can delete rule from the candidate rule set. Secondly, when we generate candidate rules of length , we only consider those rules whose all sub-rules belong to the candidate rule set of length . Thirdly, for any rule that contains as a sub-rule, it is easy to know that , where is defined as below:
(11) |
By evaluating , we can know the upper bound of all super-rules of . If , we can remove from the candidate rule set.
The improved greedy algorithm that is equipped with the upper bound-based pruning is shown in Algorithm 2. The three pruning strategies are respectively employed in line 2 (strategy two), lines 4-6 (strategy one), lines 9-11 (strategy three). In line 7, we scan the training set to evaluate and .
Rule | |||
1 | 0.583 | 0.833 | |
1 | 0.417 | 0.666 | |
2 | 0.875 | ||
1 | 0.583 | 0.833 | |
Rule | |||
2 | 0.833 | ||
1 | 0.833 | 0.417 | |
2 | 0.833 | 0.583 | |
Then we show how these strategies are employed in the searching progress for finding rules from the data in Table I. For rules of length , they do not have nonempty sub-rules and it is unnecessary to evaluate their . All their accuracy scores and their values are shown in Table IV. We find that , which means all super-rules of have no opportunity to achieve a higher score than . So this rule is removed and it will not be used for generating . There are still three candidate rules of length in Table V. We evaluate their values and find that they are , which are higher than the best score found so far. So we scan the training set to calculate the score for each rule. Then, the algorithm is terminated because and will be returned as the identified “best” rule.
4 experiment
In order to assess the performance of our algorithm, a series of experiments are conducted. First, we compare our algorithm with existing interpretable classification methods in terms of the predictive performance. Second, to verify the personalization and interpretability of our method, we compare the set of all rules found by our method with the rule set reported by existing algorithms. Finally, we employ the PIC algorithm on a real dataset to demonstrate the effectiveness and rationale of our formulation. The PIC algorithm is implemented in C++ and the experiments are conducted on a workstation with an Intel(R) Core(TM) CPU(11400F @ 2.60GHz) and 16GB memory.
Baselines. To evaluate the performance of our algorithm, three state-of-the-art rule-based interpretable classifiers are included in the performance comparison: DR-Net [25], BRS [2] and DRS [6]. As the representatives of classic tree-based classification algorithms, classification and regression tree (CART) and random forest (RF) in the scikit-learn package [31] are included in the comparison as well.
Evaluation metrics. In order to test the performance of PIC comprehensively, we choose different evaluation measures for different purposes. For predictivity, we choose the classification accuracy as the performance indicator. For interpretability, we consider the length of rule, the number of total rules and the number of distinct “personalized” rules.
Parameter tuning. For DR-Net, we set to be and to be . For BRS, we set and . For DRS, we fix the mode of the key hyper-parameter to be Max. Through some experiments, we find that when our algorithm will achieve a better performance.
4.1 Performance on predictivity
4.1.1 The dataset
We conduct the experimental study on 16 public datasets. More precisely, 14 data sets are obtained from the UCI repository [32], COMPAS is a variant of the ProPublica recidivism dataset111https://www.propublica.org/datastore/dataset/compas-recidivism-risk-score-data-and-analysis and heloc is from Fair Isaac Corporation (FICO) dataset [33]. The detailed characteristics of these data sets are summarized in Table VI. A pre-processing procedure is employed to discretize numerical values into categorical ones. That is, the equal width method is used to split the th numeric feature values into groups, where is a user-specified parameter. This parameter is set to be 10 on german and adult, and it is fixed to be 5 on the other datasets.
Dataset | Type | |||
adult | 30162 | 13 | 2 | mixed |
banknote | 1372 | 4 | 2 | numeric |
breastcancer | 286 | 9 | 2 | categorical |
car | 1728 | 6 | 4 | categorical |
COMPAS | 7214 | 6 | 2 | mixed |
german | 1000 | 20 | 2 | mixed |
heloc(FICO) | 10459 | 23 | 2 | numeric |
ILPD | 583 | 10 | 2 | mixed |
liver | 345 | 6 | 2 | numeric |
magic | 19020 | 10 | 2 | numeric |
monks | 554 | 6 | 2 | categorical |
mushroom | 8124 | 22 | 2 | categorical |
nursery | 12959 | 8 | 5 | categorical |
tictactoe | 958 | 9 | 2 | categorical |
transfusion | 749 | 4 | 2 | numeric |
vote | 435 | 15 | 2 | categorical |
4.1.2 Results
Dataset | Ours | DR-Net | BRS | DRS | CART | RF | |
adult | 0.754 | 0.751 | 0.815 | 0.459 | 0.778 | 0.821 | |
banknote | 0.917 | 0.971 | 0.843 | 0.929 | 0.948 | 0.975 | |
breastcancer | 0.747 | 0.713 | 0.744 | 0.746 | 0.698 | 0.758 | |
car | 0.701 | 0.728 | 0.659 | 0.661 | 0.791 | 0.965 | |
COMPAS | 0.564 | 0.623 | 0.623 | 0.547 | 0.630 | 0.636 | |
german | 0.703 | 0.715 | 0.723 | 0.695 | 0.644 | 0.752 | |
heloc(FICO) | 0.692 | 0.690 | 0.693 | 0.645 | 0.623 | 0.679 | |
ILPD | 0.688 | 0.705 | 0.681 | 0.701 | 0.619 | 0.654 | |
liver | 0.558 | 0.579 | 0.512 | 0.525 | 0.554 | 0.545 | |
magic | 0.672 | 0.810 | 0.801 | 0.784 | 0.657 | 0.814 | |
monks | 0.964 | 0.972 | 0.747 | 0.982 | 0.971 | 0.982 | |
mushroom | 0.958 | 0.989 | 0.995 | 0.994 | 0.993 | 1.000 | |
nursery | 0.839 | 0.925 | 0.918 | 0.727 | 0.816 | 0.991 | |
tictactoe | 0.683 | 0.971 | 0.404 | 0.978 | 0.943 | 0.983 | |
transfusion | 0.759 | 0.760 | 0.761 | 0.757 | 0.752 | 0.767 | |
vote | 0.954 | 0.614 | 0.923 | 0.905 | 0.934 | 0.954 | |
We repeat the 5-fold cross-validation procedure 5 times to compute the average accuracy values as the performance indicators for predictivity. In addition, in order to finish the experiments in an acceptable time slot, parameter is set to be 2 on adult, 3 on heloc and mushroom, 4 on german, and 5 on nursery. For all remaining data sets, the parameter is fixed to be 100. Since DR-Net and BRS can only handle binary classification problems, we employ the one-versus-one (OVO) strategy to accomplish the multi-class classification task on car and nursery. That is, we construct classifiers for each pair of classes and the final predicted class label will be determined by the voting result from the classifiers.
Table VII presents the comparison result between baseline methods and our PIC algorithm based on the 5-fold cross-validation. The best accuracy values among five interpretable methods (ours, DR-Net, BRS, DRS and CART) on each dataset are marked in bold. The DR-Net algorithm does not perform very good on tictactoe and vote, probably because the sample size of these two datasets is not big enough for training DR-Net. Table VII shows that PIC can achieve the same-level classification accuracy as those interpretable classifiers chosen in this experiment. On COMPAS, german, ILPD, liver, transfusion, and vote, our algorithm can achieve a higher accuracy than the other interpretable methods. On most of datasets, the accuracy values of our PIC algorithm are very close to those values of the tree-based interpretable method (CART). Furthermore, the performance gap between our PIC algorithm and RF is within 0.02 on all datasets except adult, car and nursery. The PIC algorithm does not perform very well on these datasets probably because of the class imbalance of these datasets.

Fig 1 displays the average running time of different classification algorithms. We can find that our algorithm is very efficient on those small data sets such as banknote and liver. That is, our algorithm can achieve the same level efficiency as those tree-based classifiers and takes less running time than rule-based interpretable classifiers on these data sets. However, the running time of our algorithm on some large datasets, especially on adult, heloc and magic, is several orders of magnitude larger than the time consumed by the others. This happens because PIC constructs a personalized model for each test sample. In essence, PIC is a lazy learning method, so it will be more time-consuming than the other eager learning methods.
4.2 Performance on personalization and interpretability
To date, there is still no universally recognized precise definition of interpretability. In our case, we try to measure the performance of each algorithm in terms of interpretability via the length of rule, the number of total rules and the comparison between the distinct “personalized” rules and the common rules.
Dataset | Ours | DR-Net | BRS | DRS | |
adult | 1.91 | 2.00 | 10.50 | 2.95 | 9.55 |
banknote | 1.44 | 2.11 | 9.67 | 2.97 | 2.59 |
breastcancer | 1.62 | 2.86 | 38.00 | 2.98 | 4.03 |
car | 1.08 | 1.38 | 16.15 | 1.45 | 4.82 |
COMPAS | 2.96 | 3.73 | 53.80 | 3.00 | 3.21 |
german | 1.39 | 3.77 | 24.18 | 3.00 | 11.59 |
heloc(FICO) | 1.75 | 2.84 | 23.85 | 2.90 | 10.01 |
ILPD | 1.00 | 1.89 | 43.62 | 2.82 | 5.70 |
liver | 1.22 | 2.51 | 30.00 | 3.00 | 3.60 |
magic | 1.53 | 3.83 | 8.99 | 3.00 | 3.62 |
monks | 1.57 | 1.95 | 9.81 | 3.00 | 2.41 |
mushroom | 2.07 | 2.91 | 11.82 | 2.84 | 6.52 |
nursery | 2.07 | 2.91 | 8.61 | 1.41 | 2.72 |
tictactoe | 1.05 | 2.85 | 13.86 | 3.00 | 3.92 |
transfusion | 1.00 | 1.17 | NaN | 3.00 | 2.45 |
vote | 1.13 | 2.04 | 45.00 | 2.97 | 6.33 |
Table VIII shows the average length of rules reported by four methods. For the PIC algorithm, we calculate the average length of all the rules found in a 5-fold cross-validation. In comparison, we use the average length of the 5 rule sets reported by the other methods. The rules reported by DR-Net are longer than the ones reported by other methods on most of the datasets except transfusion. DR-Net does not find any rules on the transfusion data set. In contrast, the rules found by our PIC algorithm are shorter and their lengths are all less than . The rules found when are shorter than those found when .
Dataset | Ours | DR-Net | BRS | DRS | |
adult | 5.0 | 381.4 | 10.0 | 3.8 | 6.6 |
banknote | 7.0 | 32.8 | 48.4 | 7.6 | 19.8 |
breastcancer | 20.2 | 30.4 | 50.0 | 10.0 | 63.6 |
car | 13.4 | 41.6 | 283.0 | 66.6 | 70.4 |
COMPAS | 34.8 | 85.8 | 21.6 | 3.2 | 4.6 |
german | 4.8 | 111.6 | 22.2 | 5.8 | 5.6 |
heloc(FICO) | 126.6 | 624.6 | 19.4 | 4.2 | 3.2 |
ILPD | 2.2 | 40.8 | 11.0 | 8.0 | 16.4 |
liver | 12.0 | 29.8 | 50.0 | 5.8 | 23.0 |
magic | 22.2 | 371.6 | 21.0 | 3.2 | 17.2 |
monks | 11.4 | 20.6 | 50.0 | 7.4 | 13.4 |
mushroom | 10.6 | 21.6 | 49.2 | 8.2 | 19.8 |
nursery | 147.2 | 328.8 | 258.0 | 75.6 | 16.0 |
tictactoe | 10.0 | 21.6 | 48.8 | 8.0 | 28.4 |
transfusion | 3.0 | 11.2 | NaN | 1.2 | 4.0 |
vote | 4.4 | 10.8 | 50.0 | 7.2 | 21.2 |

Common rules | |
Rules | Method |
{bruises=t} {habitat=d} 0 | DRS |
{stalk-surface-above-ring=s} {ring-number=o} {odor=n} 0 | DRS |
{stalk-surface-below-ring=s} {ring-number=o} {odor=n} 0 | DRS |
{ring-number=o} {gill-size=b} {odor=n} 0 | DRS |
{ring-number=o} {odor=n} 0 | DRS |
{gill-size=b} {odor=n} 0 | DRS |
{bruises=t} {habitat=u} 1 | DR-Net |
{odor=p} 1 | BRS, DRS |
{gill-attachment=f} {gill-spacing=c} {ring-number=o} {bruises=f} 1 | DRS |
{gill-attachment=f} {bruises=f} {population=v} 1 | DRS |
{gill-spacing=c} {gill-size=n} {population=v} 1 | BRS, DRS |
{gill-spacing=c} {stalk-surface-above-ring=k} 1 | DR-Net, BRS, DRS |
{gill-spacing=c} {stalk-surface-below-ring=k} 1 | DRS |
{stalk-shape=e} {ring-number=o} {stalk-root=b} {habitat=d} 1 | DRS |
{cap-color=g} {bruises=f} {stalk-root=b} 1 | DRS |
{gill-color=g} {stalk-root=b} 1 | BRS |
{odor=f} 1 | DR-Net, BRS, DRS |
{odor=c} 1 | DRS |
{gill-color=b} 1 | DR-Net, BRS, DRS |
“Personalized” rules | |
{spore-print-color=k} {gill-size=b} 0 | |
{gill-size=b} {spore-print-color=n} 0 | |
{cap-surface=s} {gill-spacing=c} {gill-size=n} 1 | |
{cap-surface=s} {gill-size=n} {bruises=f} 1 | |
Table IX shows the average number of rules reported by four methods. The PIC algorithm can find more rules than any other algorithms on COMPAS, heloc, magic and transfusion, which means that our method is more likely to find more “personalized” rules for distinct samples. Fig 2 shows the distribution of the rules reported by PIC on mushroom in a 5-fold cross-validation when . PIC reports 23 rules, 8 of them are from class 0 and 15 of them are from class 1. We can find that there are 2 rules, ({ring-number=o} {gill-size=b} {odor=n} 0) and ({gill-spacing=c} {stalk-surface-above-ring=k} 1), which appear more times than others. And the other rules only appear less than 1000 times in the experiment.
We consider a rule reported by PIC is a common rule when there is a rule reported by other methods that satisfies . Table X shows the common rules found on mushroom by PIC and other methods and the “personalized” rules only reported by PIC when is set to 0.9. We find 4 “personalized” rules in this experiment, 2 for class 0 and 2 for class 1. Their frequency values are all in the range of 10 to 1000, so they do not appear by accident. This fact demonstrates that the PIC algorithm really can find some “personalized” rules which cannot be discovered by other methods.
4.3 A real application scenario
4.3.1 The dataset
To illustrate why personalized and interpretable classification methods should be developed, here we use the application of predicting breast cancer metastasis [14] as a real example. As a complex and heterogeneous disease, breast cancer typically has many molecular subtypes. Hence, classifiers constructed for one cohort often cannot achieve good performance on other cohorts. To tackle this issue, one feasible solution is to assume that each patient belongs to a distinct subtype and construct different classification models for different patients [14].
The breast cancer metastasis dataset is derived from [14], which is composed of 1522 features and 581 samples. In our experiments, we choose 30 samples as the test set and the remaining samples are used as the training set. That is, there are 551 samples in the training set and 30 samples in the test set. The class labels are binary so both DR-Net and BRS can handle this data set as well. All the 1522 features are numerical, so we employ the equal width method with to transform the th numeric feature values into categorical ones in our PIC method.
4.3.2 Results
We conduct the following experiments on a workstation with an AMD Ryzen 5 5600X 6-Core Processor(3.70 GHz) and 32GB memory, which runs approximately twice as fast as the workstation used in 4.1 and 4.2. When running DRS, we encounter an error when transforming the type of variables from type(‘O’) to type(‘float64’) in the digitize function from numpy package. We have written a function to replace it to continue the experiment. To finish the experiment within 24 hours, is set to be 2 in the PIC algorithm. Besides, all other parameters are set the same as what they are in 4.1 and 4.2.
Methods | Accuracy | Time (s) | #Rule | Length |
Ours() | 0.633 | 64188 | 2 | 2.00 |
Ours() | 0.633 | 84999 | 26 | 2 |
DR-Net | 0.366 | 43 | NaN | NaN |
BRS | 0.533 | 37749 | 38 | 2.76 |
DRS | 0.366 | 192 | NaN | NaN |
CART | 0.500 | 1 | - | - |
RF | 0.566 | 1 | - | - |
The experimental results are shown in Table XI. DR-Net and DRS accomplish the classification task very fast but they do not find any rules. BRS finds a rule set of 38 rules, 9 of them have a length of 2 and 29 of them have a length of 3. When is set to be 0.7, PIC can find 2 different rules. And when is set to be 0.9, PIC can find 26 distinct rules. All the rules reported by PIC have a length of 2, which means that better rules might be found if we further increase the parameter. The results in Table 11 also show that PIC can find more “personalized” rules when is set to be 0.9 at the cost of consuming more running time. More importaly, the PIC algorithm can achieve better predictive accuracy than all other competing classification methods in the performance comparison. It demonstrates the rationale of developing personalized interpretable classifiers in real applications such as cancer metastasis prediction.
5 conclusion
In this paper, we introduce the personalized interpretable classification issue and present a greedy algorithm called the PIC algorithm to identify a personalized rule for each individual test sample. To demonstrate the effectiveness of the PIC algorithm, we conduct a series of experiments on some real data sets. The experimental results show that such a personalized interpretable classifier can achieve good performance both on predictivity and interpretability.
Overall, we formally introduce a new data mining problem, namely personalized interpretable classification. By solving this new classification issue, we can identify some “personalized” rules that cannot be found by existing interpretable classification methods. In the future, we will further develop more effective algorithms for mining personalized rules from large data sets in different application domains.
Acknowledgments
This work has been supported by the Natural Science Foundation of China under Grant No. 61972066.
References
- [1] H. Lakkaraju, S. H. Bach, and J. Leskovec, “Interpretable decision sets: A joint framework for description and prediction,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 1675–1684.
- [2] T. Wang, C. Rudin, F. Doshi-Velez, Y. Liu, E. Klampfl, and P. MacNeille, “A Bayesian framework for learning rule sets for interpretable classification,” Journal of Machine Learning Research, vol. 18, no. 1, pp. 1–37, 2017.
- [3] S. Dash, O. Gunluk, and D. Wei, “Boolean decision rules via column generation,” in Proceedings of the 32nd International Conference on Neural Information Processing Systems, vol. 31, 2018, pp. 4660–4670.
- [4] F. Yang, K. He, L. Yang, H. Du, J. Yang, B. Yang, and L. Sun, “Learning interpretable decision rule sets: A submodular optimization approach,” in Proceedings of the 35th International Conference on Neural Information Processing Systems, vol. 34, 2021, pp. 27890–27902.
- [5] J. Yu, A. Ignatiev, P. J. Stuckey, and P. Le Bodic, “Learning optimal decision sets and lists with sat,” Journal of Artificial Intelligence Research, vol. 72, pp. 1251–1279, 2021.
- [6] G. Zhang and A. Gionis, “Diverse rule sets,” in Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2020, pp. 1532–1541.
- [7] P. Clark and T. Niblett, “The CN2 induction algorithm,” Machine learning, vol. 3, no. 4, pp. 261–283, 1989.
- [8] W. W. Cohen, “Fast effective rule induction,” in Proceedings of the 12th International Conference on Machine Learning, 1995, pp. 115–123.
- [9] B. Liu, W. Hsu, Y. Ma et al., “Integrating classification and association rule mining.” in Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining, 1998, pp. 80–86.
- [10] M. T. Ribeiro, S. Singh, and C. Guestrin, ““Why should i trust you?” explaining the predictions of any classifier,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 1135–1144.
- [11] S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, vol. 30, 2017.
- [12] J. Chen, L. Song, M. Wainwright, and M. Jordan, “Learning to explain: An information-theoretic perspective on model interpretation,” in Proceedings of the 35th International Conference on Machine Learning, 2018, pp. 883–892.
- [13] S. Pang, T. Ban, Y. Kadobayashi, and N. Kasabov, “Personalized mode transductive spanning svm classification tree,” Information Sciences, vol. 181, no. 11, pp. 2071–2085, 2011.
- [14] M. J. Jahid, T. H. Huang, and J. Ruan, “A personalized committee classification approach to improving prediction of breast cancer metastasis,” Bioinformatics, vol. 30, no. 13, pp. 1858–1866, 2014.
- [15] Y. Zhu, M. Kim, X. Zhu, J. Yan, D. Kaufer, and G. Wu, “Personalized diagnosis for alzheimer’s disease,” in Proceedings of the 20th International Conference on Medical Image Computing and Computer-assisted Intervention, 2017, pp. 205–213.
- [16] W. J. Murdoch, C. Singh, K. Kumbier, R. Abbasi-Asl, and B. Yu, “Definitions, methods, and applications in interpretable machine learning,” in Proceedings of the National Academy of Sciences, vol. 116, no. 44, pp. 22071–22080, 2019.
- [17] C. Bénard, G. Biau, S. Da Veiga, and E. Scornet, “Sirus: Stable and interpretable rule set for classification,” Electronic Journal of Statistics, vol. 15, no. 1, pp. 427–505, 2021.
- [18] G. Mita, P. Papotti, M. Filippone, and P. Michiardi, “Libre: Learning interpretable boolean rule ensembles,” in Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics, 2020, pp. 245–255.
- [19] J. H. Friedman and B. E. Popescu, “Predictive learning via rule ensembles,” The Annals of Applied Statistics, vol. 2, no. 3, pp. 916–954, 2008.
- [20] I. Mollas, N. Bassiliades, and G. Tsoumakas, “Conclusive local interpretation rules for random forests,” Data Mining and Knowledge Discovery, vol. 36, no. 4, pp. 1521–1574, 2022.
- [21] E. Angelino, N. Larus-Stone, D. Alabi, M. Seltzer, and C. Rudin, “Learning certifiably optimal rule lists for categorical data,” Journal of Machine Learning Research, vol. 18, no. 234, pp. 1–78, 2018.
- [22] D. Malioutov and K. Varshney, “Exact rule learning via boolean compressed sensing,” in Proceedings of the 30th International Conference on Machine Learning, 2013, pp. 765–773.
- [23] D. Wei, S. Dash, T. Gao, and O. Gunluk, “Generalized linear rule models,” in Proceedings of the 36th International Conference on Machine Learning, 2019, pp. 6687–6696.
- [24] Z. Wang, W. Zhang, N. Liu, and J. Wang, “Scalable rule-based representation learning for interpretable classification,” in Proceedings of the 35th International Conference on Neural Information Processing Systems, vol. 34, pp. 30479–30491, 2021.
- [25] L. Qiao, W. Wang, and B. Lin, “Learning accurate and interpretable decision rule sets from neural networks,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 5, 2021, pp. 4303–4311.
- [26] A. Ignatiev, E. Lam, P. J. Stuckey, and J. Marques-Silva, “A scalable two stage approach to computing optimal decision sets,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 5, 2021, pp. 3806–3814.
- [27] V. Vapnik, The nature of statistical learning theory. Springer-Verlag, Berlin, Germany, 1999.
- [28] T. Cover and P. Hart, “Nearest neighbor pattern classification,” IEEE Transactions on Information Theory, vol. 13, no. 1, pp. 21–27, 1967.
- [29] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi, “A survey of methods for explaining black box models,” ACM Computing Surveys, vol. 51, no. 5, pp. 1–42, 2018.
- [30] C. Burns, J. Thomason, and W. Tansey, “Interpreting black box models via hypothesis testing,” in Proceedings of the 2020 ACM-IMS on Foundations of Data Science Conference, 2020, pp. 47–57.
- [31] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and Édouard Duchesnay, “Scikit-learn: Machine learning in python,” Journal of Machine Learning Research, vol. 12, no. 85, pp. 2825–2830, 2011. [Online]. Available: http://jmlr.org/papers/v12/pedregosa11a.html
- [32] D. Dua and C. Graff, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml
- [33] FICO. Explainable machine learning challenge, 2018. [Online]. Available: https://community.fico.com/s/explainable-machine-learning-challenge