¹¹institutetext: Instituto Federal de Educação, Ciência e Tecnologia do Ceará (IFCE), Brazil
¹¹email: [email protected]
¹¹email: [email protected]

An incremental MaxSAT-based model to learn interpretable and balanced classification rules

Antônio Carlos Souza Ferreira Júnior Thiago Alves Rocha

Abstract

The increasing advancements in the field of machine learning have led to the development of numerous applications that effectively address a wide range of problems with accurate predictions. However, in certain cases, accuracy alone may not be sufficient. Many real-world problems also demand explanations and interpretability behind the predictions. One of the most popular interpretable models that are classification rules. This work aims to propose an incremental model for learning interpretable and balanced rules based on MaxSAT, called IMLIB. This new model was based on two other approaches, one based on SAT and the other on MaxSAT. The one based on SAT limits the size of each generated rule, making it possible to balance them. We suggest that such a set of rules seem more natural to be understood compared to a mixture of large and small rules. The approach based on MaxSAT, called IMLI, presents a technique to increase performance that involves learning a set of rules by incrementally applying the model in a dataset. Finally, IMLIB and IMLI are compared using diverse databases. IMLIB obtained results comparable to IMLI in terms of accuracy, generating more balanced rules with smaller sizes.

Keywords:

Interpretable Artificial Intelligence Explainable Artificial Intelligence Rule Learning Maximum Satisfiability.

^†^†footnotetext: This preprint has not undergone peer review or any post-submission improvements or corrections. The Version of Record of this contribution is published in Intelligent Systems, LNCS, vol 14195 and is available online at https://doi.org/10.1007/978-3-031-45368-7_15.

1 Introduction

The success of Machine Learning (ML) in recent years has led to a growing advancement in studies in this area [2, 8, 12]. Several applications have emerged with the aim of circumventing various problems and situations [4, 14, 20]. One such problem is the lack of explainability of prediction models. This directly affects the reliability of using these applications in critical situations involving, for example, finance, autonomous systems, damage to equipment, the environment, and even lives [1, 7, 23]. That said, some works seek to develop approaches that bring explainability to their predictions [13, 21, 22].

Precise predictions with high levels of interpretability are often not a simple task. There are some works that try to solve this problem by balancing the accuracy of the prediction with the interpretability [5, 9, 17, 24, 6, 16, 15]. It can be seen that some of these works use approaches based on the Boolean Satisfiability Problem (SAT) and the Maximum Boolean Satisfiability Problem (MaxSAT). The choice of these approaches to solve this problem has been increasingly recurrent in recent years. The reasons can be seen in the results obtained by these models.

SAT-based approaches have been proposed recently [18, 19] to learn quantifier-free first-order sentences from a set of classified strings. More specifically, given a set of classified strings, the goal is to find a first-order sentence over strings of minimum size that correctly classifies all the strings. One of the approaches demonstrated is SQFSAT (Synthesis of quantifier-free first-order sentences over strings with SAT). Upon receiving a set of classified strings, this approach generates a quantifier-free first-order sentence over strings in disjunctive normal form (DNF) with a given number of terms. What makes this method stand out is the fact that we can limit both the number of terms and the number of formulas per term in the generated formula. In addition, as the approach generates formulas in DNF, each term of the formula can be seen as a rule. Then, for each rule, its explanation is the conjunction of formulas in the rule, which can be interesting for their interpretability [11, 18]. On the other hand, as the model is based on the SAT problem, in certain situations it may bring results that are not so interesting in terms of interpretability and efficiency, such as in cases where the set of strings is large.

Ghosh, B. et al. created a classification model based on MaxSAT called IMLI [6]. The approach takes a set of classified samples, represented by vectors of numerical and categorical data, and generates a set of rules expressed in DNF or in conjunctive normal form (CNF) that correctly classifies as many samples as possible. In this work, we focus on using IMLI for learning rules in DNF. The number of rules in the set of rules can be defined similarly to SQFSAT, but IMLI does not consider the number of elements per rule. Although IMLI focuses on learning a sparse set of rules, it may obtain a combination of both large and small rules. IMLI also takes into account the option of defining a weighting for correct classifications. As the weighting increases, the accuracy of the model improves, but at the cost of an increase in the size of the generated set of rules. The smaller the weighting, the lower the accuracy of the model, but correspondingly, the generated set of rules tends to be smaller. Furthermore, IMLI uses an incremental approach to achieve better runtime performance. The incremental form consists of dividing the set of samples into partitions in order to generate a set of rules for each partition from the set of rules obtained in the previous partitions.

In this work, we aim to create a new approach for learning interpretable rules based on MaxSAT that unites SQFSAT with the incrementability of IMLI. The motivation for choosing SQFSAT is the possibility of defining the number of literals per clause, allowing us to generate smaller and more balanced rules. The choice of IMLI is motivated by its incrementability technique, which allows the method to train on large sets of samples efficiently. In addition, we propose a technique that reduces the size of the generated rules, removing possible redundancies.

This work is divided into 6 sections. In Section 2, we define the general notions and notations. Since all methods presented in this paper use Boolean logic, we also define in Section 2 how these methods binarize datasets with numerical and categorical data. In Section 3, SQFSAT and IMLI are presented, respectively. We present SQFSAT in the context of our work where samples consist of binary vectors instead of strings, and elements of rules are not first-order sentences over strings. In Section 4, our contribution is presented: IMLIB. In Section 5, we describe the experiments conducted and the results for the comparison of our approach against IMLI. Finally, in the last section, we present the conclusions and indicate future work.

2 Preliminaries

We consider the binary classification problem where we are given a set of samples and their classifications. The set of samples is represented by a binary matrix of size $n\times m$ and their classifications by a vector of size $n$ . We call the matrix X and the vector y. Each row of X is a sample of the set and we will call it $\textbf{X}_{i}$ with $i\in\{1,...,n\}$ . To represent a specific value of $\textbf{X}_{i}$ , we will use $x_{i,j}$ with $j\in\{1,...,m\}$ . Each column of X has a label representing a feature and the label is symbolized by $x^{j}$ . To represent a specific value of y, we will use $y_{i}$ .

To represent the opposite value of $y_{i}$ , that is, if it is $1$ the opposite value is $0$ and vice versa, we use $\lnot y_{i}$ . Therefore, we will use the symbol $\lnot\textbf{y}$ to represent y with all opposite values. To represent the opposite value of $x_{i,j}$ , we use $\lnot x_{i,j}$ . Therefore, we will use the symbol $\lnot\textbf{X}_{i}$ to represent $\textbf{X}_{i}$ with all opposite values. Each label also has its opposite label which is symbolized by $\lnot x^{j}$ .

A partition of X is represented by $\textbf{X}^{t}$ with $t\in\{1,...,p\}$ , where $p$ is the number of partitions. Therefore, the partitions of vector y are represented by $\textbf{y}^{t}$ . Each element of y is symbolized by $y_{i}$ and represents the class value of sample $\textbf{X}_{i}$ . We use $\mathcal{E}^{-}=\{X_{i}\mid y_{i}=0,1\leq i\leq n\}$ and $\mathcal{E}^{+}=\{X_{i}\mid y_{i}=1,1\leq i\leq n\}$ . To represent the size of these sets, that is, the number of samples contained in them, we use the notations: $|\mathcal{E}^{-}|$ and $|\mathcal{E}^{+}|$ .

Example 1

Let X be the set of samples

\textbf{X}=\left[\begin{array}[]{ccc}0&0&1\\ 0&1&0\\ 0&1&1\\ 1&0&0\end{array}\right]

and their classifications $\textbf{y}=[1,0,0,1]$ . The samples $\textbf{X}_{i}$ are: $\textbf{X}_{1}=[0,0,1],...,\textbf{X}_{4}=[1,0,0]$ . The values of each sample $x_{i,j}$ are: $x_{1,1}=0,x_{1,2}=0,x_{1,3}=1,x_{2,1}=0,...,x_{4,3}=0$ . The class values $y_{i}$ of each sample are: $y_{1}=1,...,y_{4}=1$ . We can divide X into two partitions in several different ways, one of which is: $\textbf{X}^{1}=\left[\begin{array}[]{ccc}0&1&1\\ 0&1&0\end{array}\right]$ , $\textbf{y}^{1}=[0,0]$ , $\textbf{X}^{2}=\left[\begin{array}[]{ccc}1&0&0\\ 0&0&1\end{array}\right]$ e $\textbf{y}^{2}=[1,1]$ .

Example 2

Let X be the set of samples from Example 1, then

\lnot\textbf{X}=\left[\begin{array}[]{ccc}1&1&0\\ 1&0&1\\ 1&0&0\\ 0&1&1\end{array}\right]

and $\lnot\textbf{y}=[0,1,1,0]$ . The samples $\lnot\textbf{X}_{i}$ are: $\lnot\textbf{X}_{1}=[1,1,0],...,\lnot\textbf{X}_{4}=[0,1,1]$ . The values of each sample $\lnot x_{i,j}$ are: $\lnot x_{1,1}=1,\lnot x_{1,2}=1,\lnot x_{1,3}=0,\lnot x_{2,1}=1,...,\lnot x_{4,3}=1$ . The class values of each sample $\lnot y_{i}$ are: $\lnot y_{1}=0,...,\lnot y_{4}=0$ . We can divide $\lnot\textbf{X}$ in partitions as in Example 1: $\lnot\textbf{X}^{1}=\left[\begin{array}[]{ccc}1&0&0\\ 1&0&1\end{array}\right]$ , $\lnot\textbf{y}^{1}=[1,1]$ , $\lnot\textbf{X}^{2}=\left[\begin{array}[]{ccc}0&1&1\\ 1&1&0\end{array}\right]$ e $\lnot\textbf{y}^{2}=[0,0]$ .

We define a set of rules being the disjunction of rules and is represented by R. A rule is a conjunction of one or more features. Each rule in R is represented by $R_{o}$ with $o\in\{1,...,k\}$ , where $k$ is the number of rules. Moreover, $\textbf{R}(\textbf{X}_{i})$ represents the application of R to $\textbf{X}_{i}$ . The notations $|\textbf{R}|$ and $|R_{o}|$ are used to represent the number of features in R and $R_{o}$ , respectively.

Example 3

Let $x^{1}=$ Man, $x^{2}=$ Smoke, $x^{3}=$ Hike be labels of features. Let R be the set of rules $\textbf{R}=$ (Man) $\lor$ (Smoke $\land$ $\lnot$ Hike). The rules $R_{o}$ are: $R_{1}=$ (Man) and $R_{2}=$ (Smoke $\land$ $\lnot$ Hike). The application of R to $\textbf{X}_{i}$ is represented as follows: $\textbf{R}(\textbf{X}_{i})=x_{i,1}\lor(x_{i,2}\land\lnot x_{i,3})$ . For example, Let X be the set of samples from Example 1, then: $\textbf{R}(\textbf{X}_{1})=x_{1,1}\lor(x_{1,2}\land\lnot x_{1,3})=0\lor(0\land 0)=0$ . Moreover, we have that $|\textbf{R}|=3$ , $|R_{1}|=1$ and $|R_{2}|=2$ .

As we assume a set of binary samples, we need to perform some preprocessing. Preprocessing consists of binarizing a set of samples with numerical or categorical values. The algorithm divides the features into four types: constant, where all samples have the same value; binary, where there are only two distinct variations among all the samples for the same value; categorical, when the feature does not fit in constant and binary and its values are three or more categories; ordinal, when the feature does not fit into constant and binary and has numerical values.

When the feature type is constant, the algorithm discards that feature. This happens due to the fact that a feature common to all samples makes no difference in the generated rules. When the type is binary, one of the feature variations will receive $0$ and the other $1$ as new values. If the type is categorical, we employ the widely recognized technique of one-hot encoding. Finally, for the ordinal type feature, a quantization is performed, that is, the variations of this feature are divided into quantiles. With this, Boolean values are assigned to each quantile according to the original value.

We use SAT and MaxSAT solvers to implement the methods presented in this work. A solver receives a formula in CNF, for example: $(p\lor q)\land(q\lor\lnot p)$ . Furthermore, a MaxSAT solver receives weights that will be assigned to each clause in the formula. A clause is the disjunction of one or more literals. The weights are represented by $W(Cl)=w$ where $Cl$ is one or more clauses and $w$ represents the weight assigned to each one of them. A SAT solver tries to assign values to the literals in such a way that all clauses are satisfied. A MaxSAT solver tries to assign values to the literals in a way that the sum of the weights of satisfied clauses is maximum. Clauses with numerical weights are considered soft. The greater the weight, the greater the priority of the clause to be satisfied. Clauses assigned a weight of $\infty$ are considered hard and must be satisfied.

3 Rule learning with SAT and MaxSAT

3.1 SQFSAT

SQFSAT is a SAT-based approach that, given X, y, $k$ and the number of features per rule $l$ , tries to find a set of rules R with $k$ rules and at most $l$ features per rule that correctly classify all samples $\textbf{X}_{i}$ , that is, $\textbf{R}(\textbf{X}_{i})=y_{i}$ for all $i$ . In general, the approach takes its parameters X, y, $k$ and $l$ and constructs a CNF formula to apply it to a SAT solver, which returns an answer that is used to get R.

The construction of the SAT clauses is defined by propositional variables: $u_{o,d}^{j}$ , $p_{o,d}$ , $u_{o,d}^{*}$ , $e_{o,d,i}$ and $z_{o,i}$ , for $d\in\{1,...,l\}$ . If the valuation of $u_{o,d}^{j}$ is true, it means that $j$ th feature label will be the $d$ th feature of the rule $R_{o}$ . Furthermore, if $p_{o,d}$ is true, it means that the $d$ th feature of the rule $R_{o}$ will be $x^{j}$ , in other words, will be positive. Otherwise, it will be negative: $\lnot x^{j}$ . If $u_{o,d}^{*}$ is true, it means that the $d$ th feature is skipped in the rule $R_{o}$ . In this case, we ignore $p_{o,d}$ . If $e_{o,d,i}$ is true, then the $d$ th feature of rule $R_{o}$ contributes to the correct classification of the $i$ th sample. If $z_{o,i}$ is true, then the rule $R_{o}$ contributes to the correct classification of the $i$ th sample. That said, below, we will show the constraints formulated in the model for constructing the SAT clauses.

Conjunction of clauses that guarantees that exactly one $u_{o,d}^{j}$ is true for the $d$ th feature of the rule $R_{o}$ :

A=\bigwedge_{o\in\{1,...,k\}\atop d\in\{1,...,l\}}\bigvee_{j\in\{1,...,m,*\}}u_{o,d}^{j}

(1)

B=\bigwedge_{o\in\{1,...,k\}\atop{d\in\{1,...,l\}\atop j,j^{\prime}\in\{1,...,m,*\},j\neq j^{\prime}}}(\lnot u_{o,d}^{j}\lor\lnot u_{o,d}^{j^{\prime}})

(2)

Conjunction of clauses that ensures that each rule has at least one feature:

C=\bigwedge_{o\in\{1,...,k\}}\bigvee_{d\in\{1,...,l\}}\lnot u_{o,d}^{*}

(3)

We will use the symbol $s_{o,d,i}^{j}$ to represent the value of the $i$ th sample in the $j$ th feature label of X. If this value is $1$ , it means that if the $j$ th feature label is in the $d$ th position of the rule $R_{o}$ , then it contributes to the correct classification of the $i$ th sample. Therefore, $s_{o,d,i}^{j}=e_{o,d,i}$ . Otherwise, $s_{o,d,i}^{j}=\lnot e_{o,d,i}$ . That said, the following conjunction of formulas guarantees that $e_{o,d,i}$ is true if the $j$ th feature in the $o$ th rule contributes to the correct classification of the sample $\textbf{X}_{i}$ :

D=\bigwedge_{o\in\{1,...,k\}\atop{d\in\{1,...,l\}\atop{j\in\{1,...,m\}\atop i\in\{1,...,n\}}}}u_{o,d}^{j}\rightarrow(p_{o,d}\leftrightarrow s_{o,d,i}^{j})

(4)

Conjunction of formulas guaranteeing that if the $d$ th feature of a rule is skipped, then the classification of this rule is not interfered by this feature:

E=\bigwedge_{o\in\{1,...,k\}\atop{d\in\{1,...,l\}\atop i\in\{1,...,n\}}}u_{o,d}^{*}\rightarrow e_{o,d,i}

(5)

Conjunction of formulas indicating that $z_{o,i}$ will be set to true if all the features of rule $R_{o}$ contribute to the correct classification of sample $\textbf{X}_{i}$ :

F=\bigwedge_{o\in\{1,...,k\}}\bigwedge_{i\in\{1,...,n\}}z_{o,i}\leftrightarrow\bigwedge_{d\in\{1,...,l\}}e_{o,d,i}

(6)

Conjunction of clauses that guarantees that R will correctly classify all samples:

G=\bigwedge_{i\in\mathcal{E}^{+}}\bigvee_{o\in\{1,...,k\}}z_{o,i}

(7)

H=\bigwedge_{i\in\mathcal{E}^{-}}\bigwedge_{o\in\{1,...,k\}}\lnot z_{o,i}

(8)

Next, the formula $Q$ below is converted to CNF. Then, finally, we have the SAT query that is sent to the solver.

Q=A\land B\land C\land D\land E\land F\land G\land H

(9)

3.2 IMLI

IMLI is an incremental approach based on MaxSAT for learning interpretable rules. Given X, y, $k$ , and a weight $\lambda$ , the model aims to obtain the smallest set of rules M in CNF that correctly classifies as many samples as possible, penalizing classification errors with $\lambda$ . In general, the method solves the optimization problem $\min_{\textbf{M}}\{|\textbf{M}|+\lambda|\mathcal{E}_{M}|\mid\mathcal{E}_{M}=\{\textbf{X}_{i}\mid\textbf{M}(\textbf{X}_{i})\neq y_{i}\}\}$ , where $|\textbf{M}|$ represents the number of features in M and $\textbf{M}(\textbf{X}_{i})$ denotes the application of the set of rules M to $\textbf{X}_{i}$ . Therefore, the approach takes its parameters X, y, $k$ and $\lambda$ and constructs a MaxSAT query to apply it to a MaxSAT solver, which returns an answer that is used to generate M. Note that IMLI generates set of rules in CNF, whereas our objective is to obtain sets of rules in DNF. For that, we will have to use as parameter $\lnot\textbf{y}$ instead of y and negate the set of rules M to obtain a set of rules R in DNF.

The construction of the MaxSAT clauses is defined by propositional variables: $b_{o}^{v}$ and $\eta_{i}$ , for $v\in\{1,...,2m\}$ . The $v$ ranges from $1$ to $2m$ , as it also considers opposite features. If the valuation of $b_{o}^{v}$ is true and $v\leq m$ , it means that feature $x^{v}$ will be in the rule $M_{o}$ , where $M_{o}$ is the $o$ th rule of M. If the valuation of the $b_{o}^{v}$ is true and $v>m$ , it means that feature $\lnot x^{v-m}$ will be in the rule $M_{o}$ . If the valuation of $\eta_{i}$ is true, it means that sample $\textbf{X}_{i}$ is not classified correctly, that is, $\textbf{M}(\textbf{X}_{i})\neq y_{i}$ . That said, below, we will show the constraints for constructing MaxSAT clauses.

Constraints that represent that the cost of a misclassification is $\lambda$ :

A=\bigwedge_{i\in\{1,...,n\}}\lnot\eta_{i},W(A)=\lambda

(10)

Constraints that represent that the model tries to insert as few features as possible in M, taking into account the weights of all clauses:

B=\bigwedge_{v\in\{1,...,2m\}\atop o\in\{1,...,k\}}\lnot b_{o}^{v},W(B)=1

(11)

Even though the constraints in 11 prioritize learning sparse rules, they do so by directing attention to the overall set of rules, i.e. in the total number of features in M. Then, IMLI may generate a set of rules that comprises a combination of both large and small rules. In our approach presented in Section 4, we address this drawback by limiting the number of features in each rule.

We will use $\textbf{L}_{o}$ to represent the set of variables $b_{o}^{v}$ of a rule $M_{o}$ , that is, $\textbf{L}_{o}=\{b_{o}^{v}\,|v\in\{1,...,2m\}\}$ , for $o\in\{1,...,k\}$ . To represent the concatenation of two samples, we will use the symbol $\cup$ . We also use the symbol $@$ to represent an operation between two vectors of the same size. The operation consists of applying a conjunction between the corresponding elements of the vectors. Subsequently, a disjunction between the elements of the result is applied. The following example illustrates how these definitions will be used:

Example 4

Let be $\textbf{X}_{4}$ as in Example 1, $\textbf{X}_{4}\cup\lnot\textbf{X}_{4}=[1,0,0,0,1,1]$ and $\textbf{L}_{o}=[b_{o}^{1},b_{o}^{2},b_{o}^{3},b_{o}^{4},b_{o}^{5},b_{o}^{6}]$ . Therefore, $(\textbf{X}_{4}\cup\lnot\textbf{X}_{4})@\textbf{L}_{o}=(x_{4,1}\land b_{o}^{1})\lor(x_{4,2}\land b_{o}^{2})\lor...\lor(\lnot x_{4,6}\land b_{o}^{6})=(1\land b_{o}^{1})\lor(0\land b_{o}^{2})\lor...\lor(1\land b_{o}^{6})=b_{o}^{1}\lor b_{o}^{5}\lor b_{o}^{6}$ .

The objective of this operation is to generate a disjunction of variables that indicates if any of the features associated with these variables are present in $M_{o}$ , then sample $\textbf{X}_{i}$ will be correctly classified by $M_{o}$ . Now, we can show the formula that guarantees that if $\eta_{i}$ is false, then $\textbf{M}(\textbf{X}_{i})=y_{i}$ :

C=\bigwedge_{\ i\ \in\ \{1,...,n\}}\lnot\eta_{i}\rightarrow(y_{i}\leftrightarrow\bigwedge_{o\in\{1,...,k\}}((\textbf{X}_{i}\cup\lnot\textbf{X}_{i})@\textbf{L}_{o})),W(C)=\infty

(12)

We can see that $C$ is not in CNF. Therefore, formula $Q$ below must be converted to CNF. With that, finally, we have the MaxSAT query that is sent to the solver.

Q=A\land B\land C

(13)

The set of samples X, in IMLI, can be divided into $p$ partitions: $\textbf{X}^{1}$ , $\textbf{X}^{2}$ , …, $\textbf{X}^{p}$ . Each partition, but the last one, contains the same values of $|\mathcal{E}^{-}|$ and $|\mathcal{E}^{+}|$ . Also, the samples are randomly distributed across the partitions. Partitioning aims to make the model perform better in generating the set of rules M. Thus, the conjunction of clauses will be created from each partition $\textbf{X}^{t}$ in an incremental way, that is, the set of rules M obtained by the current partition will be reused for the next partition. In the first partition, constraints (10), (11), (12) are created in exactly the same way as described. From the second onwards, (11) is replaced by the following constraints:

B^{\prime}=\bigwedge_{v\in\{1,...,2m\}\atop o\in\{1,...,k\}}\left\{\begin{array}[]{ll}b_{o}^{v}\textrm{, if $b_{o}^{v}$ is true in the previous partition};\\ \lnot b_{o}^{v}\textrm{, otherwise}.\end{array}\right.,W(B^{\prime})=1

(14)

The IMLI also has a technique for reducing the size of the generated set of rules. The technique removes possible redundancies in ordinal features as the one in Example 5. In the original implementation of the model, the technique is applied at the end of each partition. In our implementation for the experiments in Section 5, this technique is applied only at the end of the last partition. The reason for this is training performance.

Example 5

Let R be the following set of rules with redundancy in the same rule:

(\textit{Age}>18\land\textit{Age}>20)\lor(\textit{Height}\leq 2).

Then, the technique removes the redundancy and the following set of rules is obtained:

(\textit{Age}>20)\lor(\textit{Height}\leq 2).

4 IMLIB

In this section, we will present our method IMLIB which is an incremental version of SQFSAT based on MaxSAT. IMLIB also has a technique for reducing the size of the generated set of rules. Therefore, our approach partitions the set of samples X. Moreover, our method has one more constraint and weight on all clauses. With that, our approach receives five input parameters X, y, $k$ , $l$ , $\lambda$ and tries to obtain the smallest R that correctly classifies as many samples as possible, penalizing classification errors with $\lambda$ , that is, $\min_{\textbf{R}}\{|\textbf{R}|+\lambda|\mathcal{E}_{R}|\mid\mathcal{E}_{R}=\{\textbf{X}_{i}\mid\textbf{R}(\textbf{X}_{i})\neq y_{i}\}\}$ . That said, below, we will show the constraints of our approach for constructing MaxSAT clauses.

Constraints that guarantee that exactly only one $u_{o,d}^{j}$ is true for the $d$ th feature of the rule $R_{o}$ :

A=\bigwedge_{o\in\{1,...,k\}\atop d\in\{1,...,l\}}\bigvee_{j\in\{1,...,m,*\}}u_{o,d}^{j},W(A)=\infty

(15)

B=\bigwedge_{o\in\{1,...,k\}\atop{d\in\{1,...,l\}\atop j,j^{\prime}\in\{1,...,m,*\},j\neq j^{\prime}}}\lnot u_{o,d}^{j}\lor\lnot u_{o,d}^{j^{\prime}},W(B)=\infty

(16)

Constraints representing that the model will try to insert as few features as possible in R:

C=\bigwedge_{o\in\{1,...,k\}\atop{d\in\{1,...,l\}\atop j\in\{1,...,m\}}}\lnot u_{o,d}^{j}\land\bigwedge_{o\in\{1,...,k\}\atop{d\in\{1,...,l\}}}u_{o,d}^{*},W(C)=1

(17)

Conjunction of clauses that guarantees that each rule has at least one feature:

D=\bigwedge_{o\in\{1,...,k\}}\bigvee_{d\in\{1,...,l\}}\lnot u_{o,d}^{*},W(D)=\infty

(18)

The following conjunction of formulas ensures that $e_{o,d,i}$ is true if the $j$ th feature label in the $o$ th rule contributes to correctly classify sample $\textbf{X}_{i}$ :

E=\bigwedge_{o\in\{1,...,k\}\atop{d\in\{1,...,l\}\atop{j\in\{1,...,m\}\atop i\in\{1,...,n\}}}}u_{o,d}^{j}\rightarrow(p_{o,d}\leftrightarrow s_{o,d,i}^{j}),W(E)=\infty

(19)

Conjunction of formulas guaranteeing that the classification of a specific rule will not be interfered by skipped features in the rule:

F=\bigwedge_{o\in\{1,...,k\}\atop{d\in\{1,...,l\}\atop i\in\{1,...,n\}}}u_{o,d}^{*}\rightarrow e_{o,d,i},W(F)=\infty

(20)

Conjunction of formulas indicating that the model assigns true to $z_{o,i}$ if all the features of rule $R_{o}$ support the correct classification of sample $\textbf{X}_{i}$ :

G=\bigwedge_{o\in\{1,...,k\}}\bigwedge_{i\in\{1,...,n\}}z_{o,i}\leftrightarrow\bigwedge_{d\in\{1,...,l\}}e_{o,d,i},W(G)=\infty

(21)

Conjunction of clauses designed to generate a set of rules R that correctly classify as many samples as possible:

H=\bigwedge_{i\in\mathcal{E}^{+}}\bigvee_{o\in\{1,...,k\}}z_{o,i},W(H)=\lambda

(22)

I=\bigwedge_{i\in\mathcal{E}^{-}}\bigwedge_{o\in\{1,...,k\}}\lnot z_{o,i},W(I)=\lambda

(23)

Finally, after converting formula $Q$ below to CNF, we have the MaxSAT query that is sent to the solver.

Q=A\land B\land C\land D\land E\land F\land G\land H\land I

(24)

IMLIB can also partition the set of samples X in the same way IMLI. Therefore, all constraints described above are applied in the first partition. Starting from the second partition, the constraints in (17) are replaced by the following constraints:

C^{\prime}=\bigwedge_{o\in\{1,...,k\}\atop{d\in\{1,...,l\}\atop j\in\{1,...,m,*\}}}\left\{\begin{array}[]{ll}u_{o,d}^{j}\textrm{, if $u_{o,d}^{j}$ is true in the previous}\\ {partition};\\ \lnot u_{o,d}^{j}\textrm{, otherwise}.\end{array}\right.,W(C^{\prime})=1

(25)

IMLIB also has a technique for reducing the size of the generated set of rules demonstrated in Example 5. Moreover, we added two more cases which are described in Example 6 and Example 7.

Example 6

Let R be the following set of rules with opposite features in the same rule:

(\textit{Age}>20)\lor(\textit{Height}\leq 2\land\textit{Height}>2)\lor(\textit{Hike}\land\textit{Not Hike}).

Therefore, the technique removes rules with opposite features in the same rule obtaining the following set of rules:

(\textit{Age}>20).

Example 7

Let R be the following set of rules with the same feature occurring twice in a rule:

(\textit{Hike}\land\textit{Hike})\lor(\textit{Age}>20).

Accordingly, our technique for removing redundancies eliminates repetitive features, resulting in the following set of rules:

(\textit{Hike})\lor(\textit{Age}>20).

5 Experiments

Table 1: Databases information.

Databases	Samples	$\|\mathcal{E}^{-}\|$	$\|\mathcal{E}^{+}\|$	Features
lung cancer	59	31	28	6
iris	150	100	50	4
parkinsons	195	48	147	22
ionosphere	351	126	225	33
wdbc	569	357	212	30
transfusion	748	570	178	4
pima	768	500	268	8
titanic	1309	809	500	6
depressed	1429	1191	238	22
mushroom	8112	3916	4208	22

In this section, we present the experiments we conducted to compare our method IMLIB against IMLI. The two models were implemented¹¹1Source code of IMLIB and the implementation of the tests performed can be found at the link: https://github.com/cacajr/decision_set_models with Python and MaxSAT solver RC2 [10]. The experiments were carried out on a machine with the following configurations: Intel(R) Core(TM) i5-4460 3.20GHz processor, and 12GB of RAM memory. Ten databases from the UCI repository [3] were used to compare IMLI with IMLIB. Information on the datasets can be seen in Table 1. Databases that have more than two classes were adapted, considering that both models are designed for binary classification. For purposes of comparison, we measure the following metrics: number of rules, size of the set of rules, size of the largest rule, accuracy on test data and training time. The number of rules, size of the set of rules, and size of the largest rule can be used as interpretability metrics. For example, a set of rules with few rules and small rules is more interpretable than one with many large rules.

Each dataset was split into $80\%$ for training and $20\%$ for testing. Both models were trained and evaluated using the same training and test sets, as well as the same random distribution. Then, the way the experiments were conducted ensured that both models had exactly the same set of samples to learn the set of rules.

For both IMLI and IMLIB, we consider parameter configurations obtained by combining values of: $k\in\{1,2,3\}$ , $\lambda\in\{5,10\}$ and $lp\in\{8,16\}$ , where $lp$ is the number of samples per partition. Since IMLIB has the maximum number of features per rule $l$ as an extra parameter, for each parameter configuration of IMLI and its corresponding $\mathbf{R}$ , we considered $l$ ranging from $1$ to one less than the size of the largest rule in $\mathbf{R}$ . Thus, the value of $l$ that resulted in the best test accuracy was chosen to be compared with IMLI. Our objective is to evaluate whether IMLIB can achieve higher test accuracy compared to IMLI by employing smaller and more balanced rules. Furthermore, it should be noted that this does not exclude the possibility of our method generating sets of rules with larger sizes than IMLI.

For each dataset and each parameter configuration of $k$ , $\lambda$ and $lp$ , we conducted ten independent realizations of this experiment. For each dataset, the parameter configuration with the best average of test accuracy for IMLI was chosen to be inserted in Table 2. For each dataset, the parameter configuration with the best average of test accuracy for IMLIB was chosen to be inserted in Table 3. The results presented in both tables are the average over the ten realizations.

Table 2: Comparison between IMLI and IMLIB in different databases with the IMLI configuration that obtained the best result in terms of accuracy. The column Training time represents the training time in seconds.

Databases

Models

Number of

rules

|\textbf{R}|

Largest rule

size

Accuracy

Training time

lung cancer

IMLI

2.00

\pm

0.00

3.60

\pm

0.84

2.20

\pm

0.63

0.93

\pm

0.07

0.0062

\pm

0.0016

IMLIB

2.00

\pm

0.00

2.20

\pm

0.63

1.10

\pm

0.32

0.93

\pm

0.07

0.0146

\pm

0.0091

iris

IMLI

2.00

\pm

0.00

7.60

\pm

1.35

4.50

\pm

1.08

0.90

\pm

0.08

0.0051

\pm

0.0010

IMLIB

2.00

\pm

0.00

4.90

\pm

1.20

2.50

\pm

0.71

0.84

\pm

0.12

0.0523

\pm

0.0378

parkinsons

IMLI

2.00

\pm

0.00

5.00

\pm

2.05

2.90

\pm

1.37

0.80

\pm

0.07

0.0223

\pm

0.0033

IMLIB

2.00

\pm

0.00

3.00

\pm

1.41

1.60

\pm

0.84

0.79

\pm

0.06

0.0631

\pm

0.0263

ionosphere

IMLI

2.90

\pm

0.32

12.00

\pm

1.63

5.20

\pm

0.63

0.81

\pm

0.05

0.0781

\pm

0.0096

IMLIB

3.00

\pm

0.00

7.70

\pm

3.02

2.70

\pm

1.16

0.79

\pm

0.04

0.2797

\pm

0.1087

wdbc

IMLI

2.90

\pm

0.32

8.70

\pm

2.50

3.70

\pm

1.34

0.89

\pm

0.03

0.0894

\pm

0.0083

IMLIB

3.00

\pm

0.00

5.30

\pm

2.36

1.80

\pm

0.79

0.86

\pm

0.06

0.2172

\pm

0.0800

transfusion

IMLI

1.00

\pm

0.00

3.10

\pm

0.88

3.10

\pm

0.88

0.72

\pm

0.08

0.0291

\pm

0.0026

IMLIB

1.00

\pm

0.00

2.00

\pm

0.82

2.00

\pm

0.82

0.68

\pm

0.08

0.5287

\pm

0.3849

pima

IMLI

1.00

\pm

0.00

5.10

\pm

0.74

5.10

\pm

0.74

0.68

\pm

0.09

0.0412

\pm

0.0032

IMLIB

1.00

\pm

0.00

1.90

\pm

1.10

1.90

\pm

1.10

0.74

\pm

0.04

0.6130

\pm

0.5093

titanic

IMLI

1.00

\pm

0.00

6.90

\pm

1.91

6.90

\pm

1.91

0.71

\pm

0.07

0.0684

\pm

0.0040

IMLIB

1.00

\pm

0.00

1.70

\pm

0.67

1.70

\pm

0.67

0.75

\pm

0.06

1.9630

\pm

3.2705

depressed

IMLI

1.80

\pm

0.42

7.50

\pm

2.64

5.30

\pm

1.89

0.74

\pm

0.08

0.2041

\pm

0.0059

IMLIB

2.00

\pm

0.00

6.20

\pm

3.36

3.30

\pm

1.95

0.79

\pm

0.04

0.5175

\pm

0.2113

mushroom

IMLI

2.90

\pm

0.32

16.30

\pm

2.91

8.20

\pm

2.20

0.99

\pm

0.01

0.3600

\pm

0.0340

IMLIB

3.00

\pm

0.00

12.30

\pm

7.24

4.30

\pm

2.54

0.97

\pm

0.03

2.3136

\pm

0.6294

Table 3: Comparison between IMLI and IMLIB in different databases with the IMLIB configuration that obtained the best result in terms of accuracy. The column Training time represents the training time in seconds.

Databases

Models

Number of

rules

|\textbf{R}|

Largest rule

size

Accuracy

Training time

lung cancer

IMLIB

2.00

\pm

0.00

2.20

\pm

0.63

1.10

\pm

0.32

0.93

\pm

0.07

0.0146

\pm

0.0091

IMLI

2.00

\pm

0.00

3.60

\pm

0.84

2.20

\pm

0.63

0.93

\pm

0.07

0.0062

\pm

0.0016

iris

IMLIB

2.90

\pm

0.32

6.80

\pm

1.48

2.50

\pm

0.53

0.90

\pm

0.07

0.0373

\pm

0.0095

IMLI

2.50

\pm

0.53

9.10

\pm

1.91

4.80

\pm

0.92

0.86

\pm

0.09

0.0062

\pm

0.0011

parkinsons

IMLIB

3.00

\pm

0.00

4.90

\pm

1.66

1.70

\pm

0.67

0.82

\pm

0.07

0.0868

\pm

0.0510

IMLI

3.00

\pm

0.00

8.40

\pm

1.90

3.70

\pm

1.06

0.79

\pm

0.07

0.0295

\pm

0.0064

ionosphere

IMLIB

2.00

\pm

0.00

5.00

\pm

1.70

2.50

\pm

0.85

0.82

\pm

0.06

0.2002

\pm

0.0725

IMLI

2.00

\pm

0.00

7.90

\pm

1.79

4.90

\pm

1.45

0.80

\pm

0.07

0.0531

\pm

0.0106

wdbc

IMLIB

1.00

\pm

0.00

1.20

\pm

0.42

1.20

\pm

0.42

0.89

\pm

0.04

0.0532

\pm

0.0159

IMLI

1.00

\pm

0.00

2.50

\pm

0.71

2.50

\pm

0.71

0.86

\pm

0.09

0.0357

\pm

0.0048

transfusion

IMLIB

1.00

\pm

0.00

1.70

\pm

0.67

1.70

\pm

0.67

0.72

\pm

0.03

0.2843

\pm

0.1742

IMLI

1.00

\pm

0.00

3.10

\pm

0.74

3.10

\pm

0.74

0.71

\pm

0.06

0.0273

\pm

0.0032

pima

IMLIB

1.00

\pm

0.00

1.90

\pm

1.10

1.90

\pm

1.10

0.74

\pm

0.04

0.6130

\pm

0.5093

IMLI

1.00

\pm

0.00

5.10

\pm

0.74

5.10

\pm

0.74

0.68

\pm

0.09

0.0412

\pm

0.0032

titanic

IMLIB

1.00

\pm

0.00

1.40

\pm

0.97

1.40

\pm

0.97

0.76

\pm

0.08

0.8523

\pm

1.7754

IMLI

1.00

\pm

0.00

6.80

\pm

1.87

6.80

\pm

1.87

0.68

\pm

0.12

0.0649

\pm

0.0047

depressed

IMLIB

3.00

\pm

0.00

13.30

\pm

5.12

4.70

\pm

1.89

0.80

\pm

0.04

0.7263

\pm

0.1692

IMLI

2.90

\pm

0.32

14.80

\pm

2.25

6.70

\pm

1.70

0.69

\pm

0.08

0.2520

\pm

0.0140

mushroom

IMLIB

1.00

\pm

0.00

6.70

\pm

0.95

6.70

\pm

0.95

0.99

\pm

0.00

1.2472

\pm

0.2250

IMLI

1.00

\pm

0.00

8.90

\pm

1.10

8.90

\pm

1.10

0.99

\pm

0.01

0.1214

\pm

0.0218

In Table 2, when considering parameter configurations that favor IMLI, we can see that IMLIB stands out in the size of the generated set of rules and in the size of the largest rule in datasets. Furthermore, our method achieved equal or higher accuracy compared to IMLI in four out of ten datasets. In datasets where IMLI outperformed IMLIB in terms of accuracy, our method exhibited a modest average performance gap of only three percentage points. Besides, IMLI outperformed our method in terms of training time in all datasets.

In Table 3, when we consider parameter configurations that favor our method, we can see that IMLIB continues to stand out in terms of the size of the generated set of rules and the size of the largest rule in all datasets. Moreover, our method achieved equal or higher accuracy than IMLI in all datasets. Again, IMLI consistently demonstrated better training time performance compared to IMLIB across all datasets.

As an illustrative example of interpretability, we present a comparison of the sizes of rules learned by both methods in the Mushroom dataset. Table 4 shows the sizes of rules obtained in all ten realizations of the experiment. We can observe that IMLIB consistently maintains a smaller and more balanced set of rules across the different realizations. This is interesting because unbalanced rules can affect interpretability. See realization $6$ , for instance. The largest rule learned by IMLI has a size of $10$ , nearly double the size of the remaining rules. In contrast, IMLIB learned a set of rules where the size of the largest rule is $6$ and the others have similar sizes. Thus, interpreting three rules of size at most $6$ is easier than interpreting a rule of size $10$ . Also as illustrative examples of interpretability, we can see some sets of rules learned by IMLIB in Table 5.

Table 4: Comparison of the size of the rules generated in the ten realizations of the Mushroom base from Table 2. The configuration used was

lp=16

k=3

and

\lambda=10

. As the value of

l

used in IMLIB varies across the realizations, column

l

will indicate which was the value used in each realization. In the column Rules sizes, we show the size of each rule in the following format: (

|R_{1}|

|R_{2}|

|R_{3}|

). We have highlighted in bold the cases where the size of

|R_{o}|

is the same or smaller in our model compared to IMLI.

Realizations	$l$	Models	Rules sizes
1	–	IMLI	(4, 6, 5)
	1	IMLIB	(1, 1, 1)
2	–	IMLI	(6, 11, 0)
	8	IMLIB	(3, 6, 6)
3	–	IMLI	(3, 8, 5)
	5	IMLIB	(5, 5, 5)
4	–	IMLI	(6, 4, 4)
	2	IMLIB	(2, 2, 2)
5	–	IMLI	(3, 3, 4)
	2	IMLIB	(2, 2, 2)
6	–	IMLI	(5, 10, 6)
	6	IMLIB	(6, 5, 6)
7	–	IMLI	(5, 9, 3)
	8	IMLIB	(8, 8, 8)
8	–	IMLI	(4, 10, 3)
	6	IMLIB	(5, 5, 6)
9	–	IMLI	(9, 5, 4)
	6	IMLIB	(6, 6, 6)
10	–	IMLI	(4, 9, 5)
	1	IMLIB	(1, 1, 1)

Table 5: Examples of set of rules generated by IMLIB in some tested databases.

Databases

Sets of rules

lung cancer

(AreaQ

\leq

5.0 and Alkhol

>

3.0)

iris

(petal length

\leq

1.6 and petal width

>

1.3) or

(sepal width

\leq

3.0 and petal length

\leq

5.1)

parkinsons

(Spread2

>

0.18 and PPE

>

0.19) or

(Shimmer:APQ3

>

0.008 and Spread2

\leq

0.28)

wdbc

(Largest area

>

1039.5) or

(Area

\leq

546.3 and Largest concave points

\leq

0.07)

depressed

(Age

>

41.0 and Living expenses

\leq

26692283.0) or

(Education level

>

8.0 and Other expenses

\leq

20083274.0)

6 Conclusion

In this work, we present a new incremental model for learning interpretable and balanced rules: IMLIB. Our method leverages the strengths of SQFSAT, which effectively constrains the size of rules, while incorporating techniques from IMLI, such as incrementability, cost for classification errors, and minimization of the set of rules. Our experiments demonstrate that the proposed approach generates smaller and more balanced rules than IMLI, while maintaining comparable or even superior accuracy in many cases. We argue that sets of small rules with approximately the same size seem more interpretable when compared to sets with a few large rules. As future work, we plan to develop a version of IMLIB that can classify sets of samples with more than two classes, enabling us to compare this approach with multiclass interpretable rules from the literature [11, 24].

References

[1] Biran, O., Cotton, C.: Explanation and justification in machine learning: A survey. In: IJCAI-17 workshop on explainable AI (XAI). vol. 8, pp. 8–13 (2017)
[2] Carleo, G., Cirac, I., Cranmer, K., Daudet, L., Schuld, M., Tishby, N., Vogt-Maranto, L., Zdeborová, L.: Machine learning and the physical sciences. Reviews of Modern Physics 91(4), 045002 (2019)
[3] Dua, D., Graff, C.: UCI machine learning repository (2017), http://archive.ics.uci.edu/ml
[4] Ghassemi, M., Oakden-Rayner, L., Beam, A.L.: The false hope of current approaches to explainable artificial intelligence in health care. The Lancet Digital Health 3(11), e745–e750 (2021)
[5] Ghosh, B., Malioutov, D., Meel, K.S.: Efficient learning of interpretable classification rules. Journal of Artificial Intelligence Research 74, 1823–1863 (2022)
[6] Ghosh, B., Meel, K.S.: IMLI: An incremental framework for MaxSAT-based learning of interpretable classification rules. In: Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society. pp. 203–210 (2019)
[7] Gunning, D., Stefik, M., Choi, J., Miller, T., Stumpf, S., Yang, G.Z.: XAI—Explainable artificial intelligence. Science robotics 4(37), eaay7120 (2019)
[8] Huang, H.Y., Broughton, M., Mohseni, M., Babbush, R., Boixo, S., Neven, H., McClean, J.R.: Power of data in quantum machine learning. Nature communications 12(1), 2631 (2021)
[9] Ignatiev, A., Marques-Silva, J., Narodytska, N., Stuckey, P.J.: Reasoning-based learning of interpretable ML models. In: IJCAI. pp. 4458–4465 (2021)
[10] Ignatiev, A., Morgado, A., Marques-Silva, J.: RC2: an efficient MaxSAT solver. Journal on Satisfiability, Boolean Modeling and Computation 11(1), 53–64 (2019)
[11] Ignatiev, A., Pereira, F., Narodytska, N., Marques-Silva, J.: A SAT-based approach to learn explainable decision sets. In: Automated Reasoning: 9th International Joint Conference, IJCAR 2018, Held as Part of the Federated Logic Conference, FloC 2018, Oxford, UK, July 14-17, 2018, Proceedings 9. pp. 627–645. Springer (2018)
[12] Janiesch, C., Zschech, P., Heinrich, K.: Machine learning and deep learning. Electronic Markets 31(3), 685–695 (2021)
[13] Jiménez-Luna, J., Grisoni, F., Schneider, G.: Drug discovery with explainable artificial intelligence. Nature Machine Intelligence 2(10), 573–584 (2020)
[14] Kwekha-Rashid, A.S., Abduljabbar, H.N., Alhayani, B.: Coronavirus disease (covid-19) cases analysis using machine-learning applications. Applied Nanoscience pp. 1–13 (2021)
[15] Lakkaraju, H., Bach, S.H., Leskovec, J.: Interpretable decision sets: A joint framework for description and prediction. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. pp. 1675–1684 (2016)
[16] Malioutov, D., Meel, K.S.: MLIC: A MaxSAT-based framework for learning interpretable classification rules. In: Principles and Practice of Constraint Programming: 24th International Conference, CP 2018, Lille, France, August 27-31, 2018, Proceedings. pp. 312–327. Springer (2018)
[17] Mita, G., Papotti, P., Filippone, M., Michiardi, P.: LIBRE: Learning interpretable boolean rule ensembles. In: AISTATS. pp. 245–255. PMLR (2020)
[18] Rocha, T.A., Martins, A.T.: Synthesis of quantifier-free first-order sentences from noisy samples of strings. In: 2019 8th Brazilian Conference on Intelligent Systems (BRACIS). pp. 12–17. IEEE (2019)
[19] Rocha, T.A., Martins, A.T., Ferreira, F.M.: Synthesis of a DNF formula from a sample of strings using Ehrenfeucht–Fraïssé games. Theoretical Computer Science 805, 109–126 (2020)
[20] Sharma, A., Jain, A., Gupta, P., Chowdary, V.: Machine learning applications for precision agriculture: A comprehensive review. IEEE Access 9, 4843–4873 (2020)
[21] Tjoa, E., Guan, C.: A survey on explainable artificial intelligence (XAI): Toward medical XAI. IEEE transactions on neural networks and learning systems 32(11), 4793–4813 (2020)
[22] Vilone, G., Longo, L.: Notions of explainability and evaluation approaches for explainable artificial intelligence. Information Fusion 76, 89–106 (2021)
[23] Yan, L., Zhang, H.T., Goncalves, J., Xiao, Y., Wang, M., Guo, Y., Sun, C., Tang, X., Jing, L., Zhang, M., et al.: An interpretable mortality prediction model for covid-19 patients. Nature machine intelligence 2(5), 283–288 (2020)
[24] Yu, J., Ignatiev, A., Stuckey, P.J., Le Bodic, P.: Computing optimal decision sets with SAT. In: Principles and Practice of Constraint Programming: 26th International Conference, CP 2020, Louvain-la-Neuve, Belgium, September 7–11, 2020, Proceedings 26. pp. 952–970. Springer (2020)