This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Connecting Interpretability and Robustness in Decision Trees through Separation

Michal Moshkovitz    Yao-Yuan Yang    Kamalika Chaudhuri
Abstract

Recent research has recognized interpretability and robustness as essential properties of trustworthy classification. Curiously, a connection between robustness and interpretability was empirically observed, but the theoretical reasoning behind it remained elusive. In this paper, we rigorously investigate this connection. Specifically, we focus on interpretation using decision trees and robustness to ll_{\infty}-perturbation. Previous works defined the notion of rr-separation as a sufficient condition for robustness. We prove upper and lower bounds on the tree size in case the data is rr-separated. We then show that a tighter bound on the size is possible when the data is linearly separated. We provide the first algorithm with provable guarantees both on robustness, interpretability, and accuracy in the context of decision trees. Experiments confirm that our algorithm yields classifiers that are both interpretable and robust and have high accuracy. The code for the experiments is available at https://github.com/yangarbiter/interpretable-robust-trees.

machine learning, adversarial robustness, interpretable

1 Introduction

Deploying machine learning (ML) models in high-stakes fields like healthcare, transportation, and law, requires the ML models to be trustworthy. Essential ingredients of trustworthy models are explainability and robustness: if we do not understand the reasons for the model’s prediction, we cannot trust the model; if small changes in the input modifies the model’s prediction, we cannot trust the model. Previous works hypothesized that there is a strong connection between robustness and explainability. They empirically observed that robust models lead to better explanations (Chen et al., 2019; Ross & Doshi-Velez, 2017). In this work, we take a rigorous approach towards understanding the connection between robustness and interpretability.

We focus on binary predictions, where each example has dd features and the label of each example is in {1,+1}\{-1,+1\}, so an ML model is a hypothesis f:d{1,1}.f:\mathbb{R}^{d}\rightarrow\{-1,1\}. We want our model to be (i) robust to adversarial \ell_{\infty} perturbations, i.e., for a small distortion, δ\|\delta\|_{\infty}, the model’s response is similar, f(x)=f(x+δ)f(x)=f(x+\delta), for most examples xx, (ii) interpretable, i.e., the model itself is simple and so self-explanatory, and (iii) have high-accuracy. A common type of interpretable models are decision trees (Molnar, 2019), which we call tree-based explanation and focus on in this paper.

Prior literature (Yang et al., 2020b) showed that data separation is a sufficient condition for a robust and accurate classifier. A data is rr-separated if the distance between the two closest examples with different labels is at least 2r2r. Intuitively, if rr is large, then the data is well-separated. A separated data guarantees that points with opposite labels are far from each other, which is essential to construct a robust model.

In this paper, we examine whether separation implies tree-based explanation. We first show that for a decision tree to have accuracy strictly above 1/21/2 (i.e., better than random), the data must be bounded. Henceforth we assume that the data is in [1,1]d.[-1,1]^{d}. We start with a trivial algorithm that constructs a tree-based explanation with complexity (i.e., tree size) 2O(d/r)2^{O(d/r)}. For constant rr, we show a matching lower bound of 2Ω(d).2^{\Omega(d)}. Thus, we have a matching lower and upper bound on the explanation size of 2Θ(d)2^{\Theta(d)}. Thus, separation implies robustness and interpretability. Unfortunately, for a large number of features, dd, the explanation size is too high to be useful in practice.

In this paper, we show that designing a simpler explanation is possible with a stronger separability assumption — linear separability with a γ\gamma-margin. This assumption was recently used to gain a better understanding of neural networks (Soudry et al., 2018; Nacson et al., 2019; Shamir, 2020). More formally, this assumption means that there is a vector ww with w=1\|w\|=1 such that ywxγyw\cdot x\geq\gamma for each labeled example (x,y)d×{1,1}(x,y)\in\mathbb{R}^{d}\times\{-1,1\} in the data (Shalev-Shwartz & Ben-David, 2014). Our main building block is a proof that, under the linearity assumption, there is always at least one feature that provides non-trivial information for the prediction. To formalize this, we use the known notion of weak learners (Kearns, 1988), which guarantees the existence of hypothesis with accuracy bounded below by more than 1/2\nicefrac{{1}}{{2}}.

Our weak-learnability theorem, together with Kearns & Mansour (1999), implies that a popular CART-type algorithm (Breiman et al., 1984) is guaranteed to give a decision tree with size 1/ϵO(1/γ2)1/\epsilon^{O(1/\gamma^{2})} and accuracy 1ϵ1-\epsilon. Therefore, under the linearity assumption, we can design a tree with complexity independent of the number of features. Thus, even if the number of features, dd, is large, the interpretation complexity is not affected. This achieves our first goal of constructing an interpretable model with provable guarantees.

Recently, several research papers gave a theoretical justification for CART’s empirical success (Brutzkus et al., 2020, 2019; Blanc et al., 2019, 2020; Fiat & Pechyony, 2004). Those papers assume that the underlying distribution is uniform or features chosen independently. For many cases, this assumption does not hold. For example, in medical data, there is a strong correlation between age and different diseases. On the other hand, we give a theoretical justification for CART without resorting to the feature-independence assumption. We use, instead, the linear separability assumption. We believe that our method will allow, in the future, proofs with less restrictive assumptions.

So far, we showed how to construct an interpretable model, but we want a model that will not be just interpretable but also robust. Decision trees are not robust by-default (Chen et al., 2019). For example, a small change at the feature at the root of the decision tree leads to an entirely different model (and thus to entirely different predictions): the model defined by the left subtree and the model defined by the right subtree. We are left with the questions: can robustness and interpretability co-exist? How to construct a model that is guaranteed to be both interpretable and robust? To design such a model, we focus on a specific kind of decision tree — risk score (Ustun & Rudin, 2017). A risk score is composed of several conditions (e.g., age75age\geq 75) and each matched with a weight, i.e., a small integer. A score s(x)s(x) of an example xx is the weighted sum of all the satisfied conditions. The label is then a function of the score s(x)s(x). A risk score is a specific case of decision trees, wherein at each level in the tree, the same feature is queried. The number of parameters required to represent a risk score is much smaller than their corresponding decision trees, hence they might be considered more interpretable than decision trees (Ustun & Rudin, 2017).

We design a new learning algorithm, BBM-RS, for learning risk scores that rely on the Boost-by-Majority (BBM) algorithm (Freund, 1995) and our weak learner theorem. It yields a risk score of size O(γ2log(1/ϵ))O(\gamma^{-2}\log(1/\epsilon)) and accuracy 1ϵ1-\epsilon. Thus, we found an algorithm that creates a risk score with provable guarantees on size and accuracy. As a side effect, note that BBM allows to control the interpretation complexity easily. Importantly, we show that risk scores are also guaranteed to be robust to \ell_{\infty} perturbations, by deliberately adding a small noise to dataset (but not too much noise to make sure that the noisy dataset is still linearly separable). Therefore, we design a model that is guaranteed to have high accuracy and be both interpretable and robust, achieving our final goal.

Finally, in Section 6, we test the validity of the separability assumption and the quality of the new algorithm on real-world datasets that were used previously in tree-based explanation research. On most of the datasets, less than 12%12\% points were removed to achieve an rr-separation with r0.05r\geq 0.05. For comparison, for binary feature-values {1,1}\{-1,1\}, and \ell_{\infty} distance, the best value for rr is r=1r=1. The added percentage of points required to be removed for the dataset to be linearly separable is less than 7%7\% on average. Thus, we observe that real datasets are close to being separable and even linearly separable. In the next step, we explored the quality of our new algorithm, BBM-RS. Even though it has provable guarantees only if the data is linearly separable, we run it on real datasets that do not satisfy this property. We compare BBM-RS to different algorithms for learning: decision trees (Quinlan, 1986), small risk scores (Ustun & Rudin, 2017), and robust decision trees (Chen et al., 2019). All algorithms try to maximize accuracy, but different algorithms try to, additionally, minimize interpretation complexity or maximize robustness. None of the algorithms aimed to optimize both interpretability and robustness. We compared the (i) interpretation complexity, (ii) robustness, and (iii) accuracy of all four algorithms. We find that our algorithm provides a model with better interpretation complexity and robustness while having comparable accuracy.

To summarize, our main contributions are:

Interpretability under separability: optimal bounds. We show lower and upper bounds on decision tree size for rr-separable data with r=Θ(1)r=\Theta(1), of 2Θ(d)2^{\Theta(d)}. Namely, our upper bound proves that for any separable data, there is a tree of size 2O(d)2^{O(d)}, and the lower bound proves that separability cannot guarantee an explanation smaller than 2Ω(d)2^{\Omega(d)}.

Algorithm with provable guarantees on interpretability and robustness. Designing algorithms that have provable guarantees both on interpretability, robustness, and accuracy in the context of decision trees is highly sought-after, yet there was no such algorithm before our work. We design the first learning algorithm that has provable guarantees both on interpretability, robustness, and accuracy of the returned model, under the assumption that the data is linearly separable with a margin.

While the CART algorithm is empirically highly effective, its theoretical analysis has been elusive for a long time. As a side effect, we provide an analysis of CART under the assumption of linear separability. To the best of our knowledge, this is the first proof with a distributional assumption that does not include feature independence.

Experiments. We verify the validity of our assumptions empirically and show that for real datasets, if a small percentage of points is removed then we get a linear separable dataset. We also compare our new algorithm to other algorithms that return interpretable models (Quinlan, 1986; Ustun & Rudin, 2017; Chen et al., 2019) and show that if the goal is to design a model that is both interpretable and robust, then our method is preferable.

2 Related Work

Post-hoc explanations. There are two main types of explanations: post hoc explanations (Ribeiro et al., 2016a) and intrinsic explanations (Rudin, 2019b). Algorithms for post hoc explanation take as an input a black-box model and return some form of explanation. Intrinsic explanations are simple models, so the models are self-explanatory. The main advantage of algorithms for post hoc explanations (Lundberg & Lee, 2017; Lundberg et al., 2018; Ribeiro et al., 2016b; Koh & Liang, 2017; Ribeiro et al., 2018; Deutch & Frost, 2019; Li et al., 2020; Boer et al., 2020) is that they can be used on any model. However, they host a variety of problems: they introduce a new source of error stemming from the explanation method (Rudin, 2019b); they can be fooled (Slack et al., 2020); some explanations methods are not robust to common pre-processing steps (Kindermans et al., 2019), and can be independent both of the model and the data generating process (Adebayo et al., 2018). Because of the critics against post hoc explanations, in this paper, we focus on intrinsic explanations.

Explainability and robustness. Different works research the intersection of explanation and robustness of black-box models (Lakkaraju et al., 2020), decision trees (Chen et al., 2019; Andriushchenko & Hein, 2019), and deep neural networks (Szegedy et al., 2013; Goodfellow et al., 2014; Madry et al., 2017; Ross & Doshi-Velez, 2017). Unfortunately, the quality of these methods are only verified empirically. On the theoretical side, most works analyzed explainability and robustness separately. Explainability was researched for supervised learning (Garreau & von Luxburg, 2020b, a; Mardaoui & Garreau, 2020; Hu et al., 2019) and unsupervised learning (Moshkovitz et al., 2020; Frost et al., 2020; Laber & Murtinho, 2021). For robustness, Cohen et al. (2019) showed that the technique of randomized smoothing has robustness guarantees. Ignatiev et al. (2019) connected adversarial examples and a different type of explainability from the point of view of formal logic.

Risk scores. Ustun and Rudin (Ustun & Rudin, 2017) designed a new algorithm for learning risk scores by solving an appropriate optimization problem. They focused on constructing an interpretable model with high accuracy and did not consider robustness, as we do in this work.

3 Preliminaries

We investigate models that are (i) with high-accuracy, (ii) robust, and (iii) interpretable, as formalized next.

High accuracy. We consider the task of binary classification over a domain 𝒳d{\mathcal{X}}\subseteq{\mathbb{R}}^{d}. Let μ\mu be an underlying probability distribution111In the paper, we will assume that μ\mu has additional properties, like separation or linear separation. over labeled examples 𝒳×{1,+1}.{\mathcal{X}}\times\{-1,+1\}. The input to a learning algorithm 𝒜\mathcal{A} consists of a labeled sample SμmS\sim\mu^{m}, and its output is a hypothesis h:𝒳{1,+1}.h:{\mathcal{X}}\rightarrow\{-1,+1\}. The accuracy of hh is equal to Pr(x,y)μ(h(x)=y)\Pr_{(x,y)\sim\mu}(h(x)=y). The sample complexity of 𝒜\mathcal{A} under the distribution μ\mu, denoted m(ϵ,δ):(0,1)2m(\epsilon,\delta):(0,1)^{2}\rightarrow\mathbb{N}, is a function mapping desired accuracy ϵ\epsilon and confidence δ\delta to the minimal positive integer m(ϵ,δ)m(\epsilon,\delta) such that for any mm(ϵ,δ)m\geq m(\epsilon,\delta), with probability at least 1δ1-\delta over the drawn of an i.i.d. sample SμmS\sim\mu^{m}, the output A(S)A(S) has accuracy of at least 1ϵ1-\epsilon.

Robustness. We focus on the \ell_{\infty} ball, 𝔹{\mathbb{B}}, and denote the rr-radius ball around a point x𝒳x\in{\mathcal{X}} as 𝔹(x,r){\mathbb{B}}(x,r). A hypothesis h:𝒳{1,+1}h:{\mathcal{X}}\rightarrow\{-1,+1\} is robust at xx with radius rr if for all x𝔹(x,r)x^{\prime}\in{\mathbb{B}}(x,r) we have that h(x)=h(x).h(x)=h(x^{\prime}). In (Wang et al., 2018), the notion of astuteness was introduced to measure the robustness of a hypothesis hh. The astuteness of hh at radius r>0r>0 under a distribution μ\mu is

Pr(x,y)μ[x𝔹(x,r).h(x)=y].\Pr_{(x,y)\sim\mu}[\forall x^{\prime}\in{\mathbb{B}}(x,r).\;h(x^{\prime})=y].

For a hypothesis to have high astuteness the positive and negative examples need to be separated. A binary labeled data is rr-separated if for every two labeled examples (x1,+1)(x^{1},+1),(x2,1)(x^{2},-1), it holds that x1x22r.\|x^{1}-x^{2}\|_{\infty}\geq 2r.

Interpretability. We focus on intrinsic explanations, also called interpretable models (Rudin, 2019a), where the model itself is the explanation. There are several types of interpretable models, e.g., logistic regression, decision rules, and anchors (Molnar, 2019). One of the most fundamental interpretable models, which we focus on in this paper, is decision trees (Quinlan, 1986). In a decision tree, each leaf corresponds to a label, and each inner node corresponds to a threshold and a feature. The label of an example is the leaf’s label of the corresponding path.

In the paper we also focus on a specific type of decision trees, risk scores (Ustun & Rudin, 2019), see Table 1. It is defined by a series of mm conditions and a weight for each condition. Each condition compares one feature to a threshold, and the weights should be small integers. A score, s(x)s(x), of an example xx is the number of satisfied conditions out of the mm conditions, each multiplied by the corresponding weight. The prediction of the risk model ff is the sign of the score, f(x)=sign(s(x))f(x)=sign(s(x)). A risk score can be viewed as a decision tree where at each level there is the same feature-threshold pair. Since the risk-score model has fewer parameters than the corresponding decision tree, it may be considered more interpretable.

feature weights
LCPA BBM-RS
Bias term -6 -7 + …
Age \geq 75 - 2 + …
Called in Q1 1 2 + …
Called in Q2 -1 - + …
Called before 1 4 + …
Previous call was Successful 1 2 + …
Employment variation rate <1<-1 5 4 + …
Consumer price index 93.5\geq 93.5 1 - + …
3 month euribor rate 200\geq 200 -2 - + …
3 month euribor rate 400\geq 400 5 - + …
3 month euribor rate 500\geq 500 2 - + …
total score =
Table 1: Two risk score models: LCPA (Ustun & Rudin, 2019) and our new BBM-RS algorithm on the bank dataset (Moro et al., 2014). Each satisfied condition is multiplied by its weight and summed. Bias term is always satisfied. If the total score >0>0, the risk model predicts “1” (i.e., the client will open a bank account after a marketing call). All features are binary (either 0 or 11).

4 Separation and Interpretability

We want to understand whether separation implies the existence of a small tree-based explanation. Our first observation is that the data has to be bounded for a tree-based explanation to exist. If the data is unbounded, then to achieve a training error slightly better than random, the tree size must depend on the size of the training data, see Section 4.1, Theorem 1.

In Section 4.2 we investigate lower and upper bounds for decision tree’s size, assuming separation. Specifically, in Theorem 2, we show that if the data is bounded, in [1,1]d[-1,1]^{d}, then rr-separability implies a tree based-explanation with tree depth O(d/r)O(\nicefrac{{d}}{{r}}). Importantly, the depth of the tree is independent of the training size, so a tree-based explanation exists. Nevertheless, even for a constant rr, the size of the tree is exponential in dd. In Theorem 3, we show that this bound is tight as there is a 11-separable dataset that requires an exponential size to achieve accuracy even negligibly better than random. The conclusion from this section is that if all we know is that the data is rr-separability for constant rr, we understand the interpretation complexity: it is exactly 2Θ(d)2^{\Theta(d)}. However, the explanation is of exponential size in dd. In Section 5, we improve the interpretation complexity by assuming a stronger separability assumption. We will assume linear separability with a margin. All proofs are in Section A.1.

4.1 Bounded

In Theorem 1, we show that the data has to be bounded for a small decision tree to exist. In fact, boundedness is necessarily, even if the data is constrained to be linearly separable. For any tree size ss and a given accuracy, we can construct a linearly-separable dataset such that any tree of size ss cannot have the desired accuracy.

Theorem 1.

For any tree size ss and γ>0\gamma>0, there is a dataset in 2{\mathbb{R}}^{2} that is linearly separable, and any decision tree with size ss has accuracy less than 12+γ.\frac{1}{2}+\gamma.

4.2 Upper and lower bounds

Assuming the data in [1,1]d[-1,1]^{d} is rr-separated, Theorem 2 tells us that one can construct a decision tree with depth 6d/r6d/r and training error 0 (and from standard VC-arguments also accuracy 1ϵ1-\epsilon, with enough examples) . Importantly, the depth of the tree is independent of the training size nn. Nevertheless, it means the size of the tress is exponential in dd. The idea of the proof is to fine-grain the data to bins of size about rr, in each coordinate. From this construction, it is clear that the returned model is robust at any training data.

Theorem 2.

For any labeled data in [1,1]d×{1,1}[-1,1]^{d}\times\{-1,1\} that is rr-separated, there is a decision tree of depth at most 6dr\frac{6d}{r} which has a training error 0.

Theorem 3 proves a matching lower bound by constructing a dataset such that any tree that achieves error better than random, the tree size must be exponential in dd. The dataset proving this lower bound is parity. More specifically, it contains the points {1,+1}d\{-1,+1\}^{d} and the label of each point xx is the xor of all of its coordinates.

Theorem 3.

There is a labeled dataset in [1,1]d[-1,1]^{d} which is 11-separated and has the following property. For any γ>0\gamma>0 and any decision tree TT that achieves accuracy 0.5+γ0.5+\gamma, the size of TT is at least γ2d.\gamma 2^{d}.

5 Linear Separability

In the previous section, we showed that Θ(1)\Theta(1)-separability implies a decision tree with size exponential in dd, and we showed a matching lower bound. This section explores a stronger assumption than separability that will guarantee a smaller tree, i.e., a simpler explanation. This assumption is that the data is linearly separable with a margin. More formally, data is γ\gamma-linearly separable if there is wdw\in{\mathbb{R}}^{d}, w1=1\|w\|_{1}=1, such that for each positive example xx it holds that wxγw\cdot x\geq\gamma and for each negative example xx it holds that wxγ.w\cdot x\leq-\gamma. Note that without loss of generality wi0w_{i}\geq 0 (if the inequality does not hold, multiply the ii-th feature in each example by 1-1). Thus, we can interpret ww as a distribution over all the features. One might interpret the highest wiw_{i} as the most important feature. However, this will be misleading. For example, if all that data has the same value at the ii-th feature, this feature is meaningless. In the experimental section, Section 6, we show that real datasets are close to being linearly separable.

The key step in designing a model which is both interpretable and robust is to show that one feature can provide a nontrivial prediction. In particular, in Section 5.1 we show that the hypotheses class, t={hi,θ}{\mathcal{H}}_{t}=\{h_{i,\theta}\}, is a weak learner, where

hi,θ(x)={+1if xiθ1o.w. h_{i,\theta}(x)=\begin{cases}\begin{array}[]{lr}+1&\text{if }x_{i}\geq\theta\\ -1&\text{o.w. }\end{array}\end{cases}

This class is similar to the known decision stumps class, but it does not contain hypotheses of the form “if xiθx_{i}\leq\theta then +1+1 else 1-1”. The reason will become apparent in Section 5.3, but for now, we will hint that it helps achieve robustness.

In Section 5.2, we observe that weak learnability immediately implies that the known CART algorithm constructs a tree of size independent of dd (Kearns & Mansour, 1999). Unfortunately, decision trees are not necessarily robust. To overcome this difficulty, we focus on one type of decision trees, risk scores, which are interpretable models on their own. In Section 5.3 we show how to use (Freund, 1995) together with our weak learnability theorem to construct a risk score model. We also show that this model is robust. This concludes our quest of finding a model that is guaranteed to be robust, interpretable, and have high-accuracy under the linearity separable assumption. In Section 6 we will evaluate the model on several real datasets.

5.1 Weak learner

This section shows that under the linearity assumption, we can always find a feature that gives nontrivial information, which is formally defined using the concept of a weak learner class. We say that a class {\mathcal{H}} is a weak learner if for every function ff that is γ\gamma-linearly separable and for every distribution μ\mu over the examples, there is hypothesis hh\in{\mathcal{H}} such that Prxμ(h(x)=f(x))\Pr_{x\sim\mu}(h(x)=f(x)) is strictly bigger than 1/21/2, preferably at least 1/2+Ω(γ)1/2+\Omega(\gamma). Finding the best hypothesis in t{\mathcal{H}}_{t} can be done efficiently using dynamic programming (Shalev-Shwartz & Ben-David, 2014). The interesting question is how to prove that there must be a weak learner in t{\mathcal{H}}_{t}. This will be the focus of the current section.

One might suspect that if the data is linearly separable by the vector ww (i.e., for each labeled example (x,y)(x,y) it holds that ywxγywx\geq\gamma), then hih_{i} which corresponds to the highest wiw_{i} is a weak learner. Conversely, if wiw_{i} is small, then the corresponding hypotheses hih_{i} will have a low accuracy. These claims are not true. To illustrate this, think about the extreme example where w1=0w_{1}=0 but x1x_{1} completely predicts the output of any example xx. From the viewpoint of ww, the first feature is irrelevant, as it does not contribute to the term wxw\cdot x, but the first feature is a perfect predictor.

To prove that there is always a hypothesis in t{\mathcal{H}}_{t} with accuracy 0.5+Ω(γ)0.5+\Omega(\gamma), we view t{\mathcal{H}}_{t} as a graph. Namely, define a bipartite graph where the vertices are the examples and the hypotheses and there is an edge between a hypothesis hh and example xx if hh correctly predicts xx. The edges of the graph are defined so that (i) the degree of the hypotheses vertices corresponds to its accuracy and (ii) the linearity assumption ensures that the degree of the example vertices is high. These two properties of the graph proves the theorem. All proofs are in Section A.2.

Theorem 4.

Fix α>0\alpha>0. For any data in [1,1]d×{1,1}[-1,1]^{d}\times\{-1,1\} that is labeled by a γ\gamma-linearly separable hypothesis ff and for any distribution μ\mu on the examples, there is a hypothesis hth\in{\mathcal{H}}_{t} such that

Prxμ(h(x)=f(x))12+γ2α.\Pr_{x\sim\mu}(h(x)=f(x))\geq\frac{1}{2}+\frac{\gamma}{2}-\alpha.

As a side note, in (Shalev-Shwartz & Singer, 2008), a different connection between linear separability and weak learning was formed. They view each example in the hypotheses basis, and on this basis, the famous minimax theorem implies that linearity is equivalent to weak learnability. In this paper, we focus on the case that the data, in its original form, is linearly separable. Nonetheless, in the case where the features are binary, the two views, the original and hypotheses basis, coincide. However, our method also holds for the nonbinary case.

So far, we showed the existence of hypothesis in t{\mathcal{H}}_{t} with accuracy 0.5+Ω(γ)0.5+\Omega(\gamma). Standard arguments in learning theory imply that the hypothesis that maximizes the accuracy on a sample also has accuracy 0.5+Ω(γ)0.5+\Omega(\gamma). Specifically, for any sample SS, denote by hSh_{S} the best hypothesis in t{\mathcal{H}}_{t} on the sample SS. Basic arguments in learning theory shows that for a sample of size m=O(d+log1δ/γ2)m=O\left(\nicefrac{{d+\log\frac{1}{\delta}}}{{\gamma^{2}}}\right), the hypothesis hSh_{S} has a good accuracy, as the following theorem proves.

Theorem 5 (weak-learner).

Fix α>0\alpha>0. For any distribution μ\mu over [1,+1]d×{1,+1}[-1,+1]^{d}\times\{-1,+1\} that satisfies linear separability with a γ\gamma-margin, and for any δ(0,1)\delta\in(0,1) there is m=O(d+log1δγ2)m=O\left(\frac{d+\log\frac{1}{\delta}}{\gamma^{2}}\right), such that with probability at least 1δ1-\delta over the sample SS of size mm, it holds that

Pr(x,y)μ(hS(x)=y)12+γ4α.\Pr_{(x,y)\sim\mu}(h_{S}(x)=y)\geq\frac{1}{2}+\frac{\gamma}{4}-\alpha.

5.2 Decision tree using CART

CART is a popular algorithm for learning decision trees. In (Kearns & Mansour, 1999) it was shown that if the internal nodes define a γ\gamma-weak learner and number of samples is some polynomial of tlog(1/δ)dt\log(1/\delta)d, then a CART-type algorithm returns a tree with size t=1/ϵO(1/γ2)t=1/\epsilon^{O(1/\gamma^{2})} and accuracy at least 1ϵ1-\epsilon, with probability at least 1δ.1-\delta. Under the linearity assumption, we know that the internal nodes indeed define a γ\gamma-weak learner by Theorem 5. Thus, we get a model with a tree size independent of the training size and the dimension. But the model is not necessarily robust.

The above results can be interrupted as a proof for the CART’s algorithm success. This proof does not use the strong assumption of feature independence, which is assumed in recent works (Brutzkus et al., 2020, 2019; Blanc et al., 2019, 2020; Fiat & Pechyony, 2004).

Designing robust decision trees is inherently a difficult task. The reason is that, generally, the model defined by the right and left subtrees can be completely different. The feature ii in the root determines if the model uses the right or left subtrees. Thus, a small change in the ii-th feature completely changes the model. To overcome this difficulty we focus on a specific type of decision tree, risk scores (Ustun & Rudin, 2019), see Table 1 for an example. In the decision tree that corresponds to the risk score, the right and left subtrees are the same. In the next section, we design risk scores that have guarantees on the robustness and the accuracy.

5.3 Risk score

This section designs an algorithm that returns a risk score model with provable guarantees on its accuracy and robustness, assuming that the data is linearly separable. In the previous section, we used (Kearns & Mansour, 1999) that viewed CART as a boosting method. This section uses a more traditional boosting method — the Boost-by-Majority algorithm (BBM) (Freund, 1995). This boosting algorithm gets as an input training data and an integer TT, and at each step tTt\leq T it reweigh the examples and apply a γ\gamma-weak learner that returns a hypothesis ht:d{1,+1}h_{t}:{\mathbb{R}}^{d}\rightarrow\{-1,+1\}. At the end, after TT steps, BBM returns sign(t=1Tht).sign\left(\sum_{t=1}^{T}{h_{t}}\right). In (Freund, 1995; Schapire & Freund, 2013) it was shown that BBM returns hypothesis with accuracy at least 1ϵ1-\epsilon after at most T=O(γ2log(1/ϵ))T=O(\gamma^{-2}\log(1/\epsilon)) rounds.

The translation from BBM, which uses t{\mathcal{H}}_{t} as a weak learner, to a risk score model, is straightforward. The hypotheses in t{\mathcal{H}}_{t} exactly correspond to the conditions in the risk score. Each condition has weight of 11. If the number of conditions that hold is at least T/2T/2 then our risk model returns +1+1, else it returns 1.-1. Together with Theorem 4 and (Freund, 1995) we get that BBM returns a risk score with accuracy at least 1ϵ1-\epsilon and with T=O(γ2log(1/ϵ))T=O(\gamma^{-2}\log(1/\epsilon)) conditions.

We remark that other boosting methods, e.g., (Freund & Schapire, 1997; Kanade & Kalai, 2009), cannot replace BBM in the suggested scheme, since the final combination has to be a simple sum of the weak learners and not arbitrary linear combination. The letter corresponds to a risk score where the weights are in {\mathbb{R}} and not a small integer, as desired.

Our next and final goal is to prove that our risk score model is also robust. For that, we use the concept of monotonicity. For x,ydx,y\in{\mathbb{R}}^{d}, we say that xyx\leq y if and only if for all i[d]i\in[d] it holds that xiyix_{i}\leq y_{i}. A model f:d{0,1}f:{\mathbb{R}}^{d}\rightarrow\{0,1\} is monotone if for all xyx\leq y it holds that f(x)f(y).f(x)\leq f(y). We will show that BBM with weak learners from t{\mathcal{H}}_{t} yields a monotone model. The reasons being (i) all conditions are of the form “xiθx_{i}\geq\theta”, (ii) all weights are non-negative, except the bias term, and (iii) classification of a risk score is detriment by the score’s sign. All proofs appear in Section A.3.

Claim 6.

If every condition in a risk-score model RR is of the form “xiθx_{i}\geq\theta” and all weights are positive, except the bias term, then RR is a monotone model.

In Claim 7 we show that, by carefully adding a small noise to each feature, we can transform any algorithm that returns a monotone model to one that returns a robust model.

Claim 7.

Assume a learning algorithm AA gets as an input γ\gamma-linearly separable and returns a monotone model with accuracy 1ϵ(γ)1-\epsilon(\gamma). Then, there is an algorithm that returns a model with astuteness at least 1ϵ(γ2)1-\epsilon\left(\frac{\gamma}{2}\right) at radius γ/2.\gamma/2.

To summarize, in Algorithm 1 we show the pseudocode of our new algorithm, BBM-RS. In the first step we add noise to each example by replacing each example (x,y)(x,y) by (xτy𝟏,y)(x-\tau y\mathbf{1},y), where τ(0,1)\tau\in(0,1) is a parameter that defines the noise level and 𝟏\mathbf{1} is the all-one vector. In other words, we add noise yτy\tau to each feature. In the second step, the algorithm iteratively adds conditions to the risk score. At each iteration, we first find the distribution μ\mu defined by BBM (Freund, 1995). Then, we find the best hypothesis hi,θh_{i,\theta} in t{\mathcal{H}}_{t}, according to μ.\mu. We add to the risk score a condition “xiθx_{i}\geq\theta”. Finally, we add a bias term of T/2-T/2, to check if at least half of the conditions are satisfied.

Algorithm 1 BBM-RS (BBM-Risk Score)
  input: DD: linearly separable training data by ww; WLOG i.wi0\forall i.w_{i}\geq 0
     TT: bound on interpretation complexity
     τ\tau: noise level
  output: risk score
  # Add noise:
  for (x,y)D(x,y)\in D do
     replace (x,y)(x,y) with (xτy𝟏,y)(x-\tau y\mathbf{1},y)
  end for
  for i=1Ti=1\ldots T do
     μ\mu\leftarrow BBM distrbution on DD
     i,θargmaxi,θ(x,y)Dμ(x)I(xiθ)y>0i,\theta\leftarrow\operatorname*{arg\,max}_{i,\theta}\sum_{(x,y)\in D}\mu(x)I_{(x_{i}-\theta)y>0}
     Add condition “xiθx_{i}\geq\theta” to RSRS
  end for
  Add a bias term of T/2-T/2 to RSRS
  return RSRS

6 Experiments

In previous sections, we designed new algorithms and gave provable guarantees for separated data. We next investigate these results on real datasets. Concretely, we ask the following questions:

  • How separated are real datasets?

  • How well does BBM-RS perform compared with other interpretable methods?

  • How do interpretability, robustness, and accuracy trade-off with one another in BBM-RS?

Datasets. To maintain compatibility with prior work on interpretable and robust decision trees (Ustun & Rudin, 2019; Lin et al., 2020), we use the following pre-processed datasets from their repositories – adult, bank, breastcancer, mammo, mushroom, spambase, careval, ficobin, and campasbin. We also use some datasets from other sources such as LIBSVM (Chang & Lin, 2011) datasets and Moro et al. (2014). These include diabetes, heart, ionosphere, and bank2. All features are normalized to [0,1][0,1]. More details can be found in Appendix B. The dataset statistics are shown in Table 2.

dataset statistics rr-separation γ\gamma-linear separation
# samples # features # binary features portion of positive label sep. 2r2r sep. γ\gamma
adult 32561 36 36 0.24 0.88 1.00 0.84 0.001
bank 41188 57 57 0.11 0.97 1.00 0.90 0.33
bank2 41188 63 53 0.11 1.00 0.0004 0.91 0.00002
breastcancer 683 9 0 0.35 1.00 0.11 0.97 0.0003
careval 1728 15 15 0.30 1.00 1.00 0.96 0.003
compasbin 6907 12 12 0.46 0.68 1.00 0.65 0.20
diabetes 768 8 0 0.65 1.00 0.11 0.77 0.0008
ficobin 10459 17 17 0.48 0.79 1.00 0.70 0.33
heart 270 20 13 0.44 1.00 0.13 0.89 0.0003
ionosphere 351 34 1 0.64 1.00 0.80 0.95 0.0007
mammo 961 14 13 0.46 0.83 0.33 0.79 0.14
mushroom 8124 113 113 0.48 1.00 1.00 1.00 0.02
spambase 4601 57 0 0.39 1.00 0.000063 0.94 0.000002
Table 2: Dataset statistics. Columns “sep.” records the separateness of each dataset. Columns “2r2r” and “γ\gamma” are calculated after dataset is separated by removing 1sep1-\text{sep} points.

6.1 Separation of real datasets

To understand how separated they are, we measure the closeness of each dataset to being rr- or linearly separated. The separateness of a dataset is one minus the fraction of examples needed to be removed for it to be rr- or linearly separated.

For rr-separation, we use the algorithm designed by Yang et al. (2020a) that calculates the minimum number of examples needed to be removed for a dataset to be rr-separated with r105r\geq 10^{-5}. This ensures that after removal, there will be no pair of examples that are very similar but with different labels. Finding the optimal separateness for linear separation is NP-hard (Ben-David et al., 2003), thus we run a 1\ell_{1} regularized linear SVM with regularization terms C={1010,108,,1010}C=\{10^{-10},10^{-8},\ldots,10^{10}\} and record the lowest training error as an approximation to one minus the optimal separateness.

The separation results are shown in Table 2. Eight datasets are already rr-separated (separateness =100%=100\%). In the five datasets with separateness << 100%, there are examples with very similar features but different labels. This occurs mostly in binarized datasets; see Appendix C for an example. Three datasets are almost separated with separateness equal to 97%97\%, 88%88\%, and 83%83\%, and two have separateness 68%68\% and 79%79\%. To summarize, 84%84\% of the datasets are rr-separated with r105r\geq 10^{-5}, after removing at most 17%17\% of the points.

Linear separation is a stricter property than rr-separation, so the separateness for linear separation is smaller or equal to the separateness for rr-separation. Seven datasets have separateness 90%\geq 90\%, three separateness between 79%79\% and 89%89\%, and the remaining three have separateness <79%<79\%. After removing the points, all datasets are γ\gamma-linearly separable and nine datasets have γ0.001.\gamma\geq 0.001. To summarize, (i) 77%77\% of the datasets are close to being linearly separated (ii) requiring linear-separability reduces the separateness of the rr-separated dataset by only an average of 6.77%6.77\%. From this we conclude that for these datasets at least, the assumption of rr- or linear-separability is approximately correct.

IC (lower=better) test accuracy (higher=better) ER (higher=better)
DT RobDT LCPA BBM-RS DT RobDT LCPA BBM-RS DT RobDT LCPA BBM-RS
adult 414.20414.20 287.90287.90 14.9014.90 6.00\mathbf{6.00} 0.83\mathbf{0.83} 0.83\mathbf{0.83} 0.820.82 0.810.81 0.50\mathbf{0.50} 0.50\mathbf{0.50} 0.120.12 0.50\mathbf{0.50}
bank 30.7030.70 26.8026.80 8.908.90 8.00\mathbf{8.00} 0.90\mathbf{0.90} 0.90\mathbf{0.90} 0.90\mathbf{0.90} 0.90\mathbf{0.90} 0.50\mathbf{0.50} 0.50\mathbf{0.50} 0.200.20 0.50\mathbf{0.50}
bank2 30.0030.00 30.7030.70 13.8013.80 4.50\mathbf{4.50} 0.91\mathbf{0.91} 0.900.90 0.900.90 0.900.90 0.120.12 0.180.18 0.100.10 0.50\mathbf{0.50}
breastcancer 15.2015.20 7.407.40 6.00\mathbf{6.00} 11.0011.00 0.940.94 0.940.94 0.96\mathbf{0.96} 0.96\mathbf{0.96} 0.230.23 0.29\mathbf{0.29} 0.280.28 0.270.27
careval 59.3059.30 28.2028.20 10.1010.10 8.70\mathbf{8.70} 0.97\mathbf{0.97} 0.960.96 0.910.91 0.770.77 0.50\mathbf{0.50} 0.50\mathbf{0.50} 0.190.19 0.50\mathbf{0.50}
compasbin 67.8067.80 33.7033.70 5.40\mathbf{5.40} 7.607.60 0.67\mathbf{0.67} 0.67\mathbf{0.67} 0.650.65 0.660.66 0.50\mathbf{0.50} 0.50\mathbf{0.50} 0.150.15 0.330.33
diabetes 31.2031.20 27.9027.90 6.006.00 2.10\mathbf{2.10} 0.740.74 0.730.73 0.76\mathbf{0.76} 0.650.65 0.080.08 0.080.08 0.090.09 0.15\mathbf{0.15}
ficobin 30.6030.60 59.6059.60 6.40\mathbf{6.40} 11.8011.80 0.710.71 0.710.71 0.710.71 0.72\mathbf{0.72} 0.50\mathbf{0.50} 0.50\mathbf{0.50} 0.220.22 0.50\mathbf{0.50}
heart 20.3020.30 13.6013.60 11.9011.90 9.50\mathbf{9.50} 0.760.76 0.790.79 0.82\mathbf{0.82} 0.82\mathbf{0.82} 0.230.23 0.310.31 0.140.14 0.32\mathbf{0.32}
ionosphere 11.3011.30 8.608.60 17.9017.90 6.80\mathbf{6.80} 0.890.89 0.92\mathbf{0.92} 0.880.88 0.860.86 0.150.15 0.250.25 0.070.07 0.28\mathbf{0.28}
mammo 27.4027.40 12.4012.40 7.207.20 1.90\mathbf{1.90} 0.79\mathbf{0.79} 0.79\mathbf{0.79} 0.79\mathbf{0.79} 0.770.77 0.470.47 0.50\mathbf{0.50} 0.210.21 0.50\mathbf{0.50}
mushroom 10.8010.80 9.10\mathbf{9.10} 23.8023.80 9.909.90 1.00\mathbf{1.00} 1.00\mathbf{1.00} 1.00\mathbf{1.00} 0.970.97 0.50\mathbf{0.50} 0.50\mathbf{0.50} 0.100.10 0.50\mathbf{0.50}
spambase 153.90153.90 72.3072.30 29.5029.50 5.60\mathbf{5.60} 0.92\mathbf{0.92} 0.870.87 0.880.88 0.790.79 0.000.00 0.040.04 0.020.02 0.05\mathbf{0.05}
Table 3: Comparison of BBM-RS with other interpretable models. In bold: the best algorithm for each dataset and criterion. Note that several datasets (adult, bank, careval, compasbin, ficobin, and mushroom) have ER = 0.50.5 for tree-based models (DT, RobDT, and BBM-RS), because these datasets have all binary features and tree-based models set the threshold in the middle of 0 and 11.

6.2 Performance of BBM-RS

Next, we want to understand how our proposed BBM-RS performs on real datasets. We compare the performance of BBM-RS with three different baselines on three evaluation criteria: interpretability, accuracy, and robustness.

Baselines. We compare BBM-RS with three baselines: (i) LCPA (Ustun & Rudin, 2019), an algorithm for learning risk scores, (ii) DT (Breiman et al., 1984), standard algorithm for learning decision trees, and (iii) Robust decision tree (RobDT) (Chen et al., 2019), an algorithm for learning robust decision trees.

We use a 55-fold cross-validation based on accuracy for hyperparameters selection. For DT and RobDT, we search through 5,10,305,10,\ldots 30 for the maximum depth of the tree. For BBM-RS, we search through 5,10,305,10,\ldots 30 for the maximum number of weak learners (TT). The algorithm stops when it reaches TT iterations or if no weak learner can produce a weighted accuracy >0.51>0.51. For LCPA, we search through 5,10,305,10,\ldots 30 for the maximum 0\ell_{0} norm of the weight vector. We set the robust radius for RobDT and the noise level τ\tau for BBM-RS to 0.050.05. More details about the setup of the algorithms can be found in Appendix B.

6.2.1 Evaluation

We evaluate interpretability, accuracy, and robustness of each baseline. The data is randomly split into training and testing sets by 2:1. The experiment is repeated 1010 times with different training and testing splits. The mean and standard error of the evaluation criteria are recorded.

Interpretability. We measure a model’s interpretability by evaluating its Interpretation Complexity (IC), which is the number of parameters in the model. For decision trees (DT and RobDT) the IC is the number of internal nodes in the tree, and for risk scores (LCPA and BBM-RS) it is the number non-zero terms in the weight vector. The lower the IC is, the more interpretable the model is.

Robustness. We measure model’s robustness by evaluating its Empirical robustness (ER) (Yang et al., 2020a). ER on a classifier ff at an input xx is ER(f,x):=minf(x)f(x)xxER(f,x):=\min_{f(x^{\prime})\neq f(x)}\|x^{\prime}-x\|_{\infty}. We evaluate ER on 100100 randomly chosen correctly predicted examples in the test set. The larger ER is, the more robust the classifier is.

6.2.2 Results

The results are shown in Table 3 (only the means are shown, the standard errors can be found in Appendix C). We see that BBM-RS performs well in terms of interpretability and robustness. BBM-RS performs the best on nine and eleven out of thirteen datasets in terms of interpretation complexity and robustness, respectively. In terms of accuracy, in nine out of the thirteen datasets, BBM-RS is the best or within 3%3\% to the best. These results show that on most datasets, BBM-RS is better than other algorithms in IC and ER while being comparable in accuracy.

6.3 Tradeoffs in BBM-RS

The parameter τ\tau gives us the opportunity to explore the tradeoff between interpretability, robustness, and accuracy within BBM-RS. Figure 1 shows that for small τ\tau, BBM-RS’s IC is high, and its ER is low, and when τ\tau is high, IC is low, and ER is high. This empirical observation strengthens the claim that interpretability and robustness are correlated. See Appendix C for experiments on other datasets and experiments on the tradeoffs between IC and accuracy.

Refer to caption
Figure 1: Interaction of interpretability, accuracy, and robustness with different noise level τ\tau on the spambase dataset. The size of each ball represents the accuracy. For τ=0\tau=0: IC=22.5,ER=0.006IC=22.5,ER=0.006 and for higher noise τ=0.25\tau=0.25: IC=2.3,ER=0.33IC=2.3,ER=0.33

7 Conclusion

We found that linear separability is a hidden property of the data that guarantees both interpretability and robustness. We designed an efficient algorithm, BBM-RS, that returns a model, risk-score, which we prove is interpretable, robust, and have high-accuracy. An interesting open question is whether a weaker notion than linear separability can give similar guarantees.

Acknowledgements

Kamalika Chaudhuri and Yao-Yuan Yang thank NSF under CIF 1719133 and CNS 1804829 for research support.

References

  • Adebayo et al. (2018) Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., and Kim, B. Sanity checks for saliency maps. Advances in neural information processing systems, 31:9505–9515, 2018.
  • Alon & Spencer (2004) Alon, N. and Spencer, J. H. The probabilistic method. John Wiley & Sons, 2004.
  • Andriushchenko & Hein (2019) Andriushchenko, M. and Hein, M. Provably robust boosted decision stumps and trees against adversarial attacks. In Advances in Neural Information Processing Systems, pp. 13017–13028, 2019.
  • Ben-David et al. (2003) Ben-David, S., Eiron, N., and Long, P. M. On the difficulty of approximately maximizing agreements. Journal of Computer and System Sciences, 66(3):496–514, 2003.
  • Blanc et al. (2019) Blanc, G., Lange, J., and Tan, L.-Y. Top-down induction of decision trees: rigorous guarantees and inherent limitations. arXiv preprint arXiv:1911.07375, 2019.
  • Blanc et al. (2020) Blanc, G., Lange, J., and Tan, L.-Y. Provable guarantees for decision tree induction: the agnostic setting. arXiv preprint arXiv:2006.00743, 2020.
  • Boer et al. (2020) Boer, N., Deutch, D., Frost, N., and Milo, T. Personal insights for altering decisions of tree-based ensembles over time. Proceedings of the VLDB Endowment, 13(6):798–811, 2020.
  • Breiman et al. (1984) Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A. Classification and regression trees. CRC press, 1984.
  • Brutzkus et al. (2019) Brutzkus, A., Daniely, A., and Malach, E. On the optimality of trees generated by id3. arXiv preprint arXiv:1907.05444, 2019.
  • Brutzkus et al. (2020) Brutzkus, A., Daniely, A., and Malach, E. Id3 learns juntas for smoothed product distributions. In Conference on Learning Theory, pp.  902–915. PMLR, 2020.
  • Chang & Lin (2011) Chang, C.-C. and Lin, C.-J. Libsvm: a library for support vector machines. ACM transactions on intelligent systems and technology (TIST), 2(3):1–27, 2011.
  • Chen et al. (2019) Chen, H., Zhang, H., Boning, D., and Hsieh, C.-J. Robust decision trees against adversarial examples. arXiv preprint arXiv:1902.10660, 2019.
  • Cohen et al. (2019) Cohen, J. M., Rosenfeld, E., and Kolter, J. Z. Certified adversarial robustness via randomized smoothing. arXiv preprint arXiv:1902.02918, 2019.
  • Deutch & Frost (2019) Deutch, D. and Frost, N. Constraints-based explanations of classifications. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp.  530–541. IEEE, 2019.
  • Fiat & Pechyony (2004) Fiat, A. and Pechyony, D. Decision trees: More theoretical justification for practical algorithms. In International Conference on Algorithmic Learning Theory, pp.  156–170. Springer, 2004.
  • Freund (1995) Freund, Y. Boosting a weak learning algorithm by majority. Information and computation, 121(2):256–285, 1995.
  • Freund & Schapire (1997) Freund, Y. and Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
  • Frost et al. (2020) Frost, N., Moshkovitz, M., and Rashtchian, C. Exkmc: Expanding explainable kk-means clustering. arXiv preprint arXiv:2006.02399, 2020.
  • Garreau & von Luxburg (2020a) Garreau, D. and von Luxburg, U. Explaining the explainer: A first theoretical analysis of lime. arXiv preprint arXiv:2001.03447, 2020a.
  • Garreau & von Luxburg (2020b) Garreau, D. and von Luxburg, U. Looking deeper into lime. arXiv preprint arXiv:2008.11092, 2020b.
  • Goodfellow et al. (2014) Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
  • Hu et al. (2019) Hu, X., Rudin, C., and Seltzer, M. Optimal sparse decision trees. In Advances in Neural Information Processing Systems, pp. 7267–7275, 2019.
  • Ignatiev et al. (2019) Ignatiev, A., Narodytska, N., and Marques-Silva, J. On relating explanations and adversarial examples. In Advances in Neural Information Processing Systems, pp. 15883–15893, 2019.
  • Kanade & Kalai (2009) Kanade, V. and Kalai, A. Potential-based agnostic boosting. Advances in neural information processing systems, 22:880–888, 2009.
  • Kantchelian et al. (2016) Kantchelian, A., Tygar, J. D., and Joseph, A. Evasion and hardening of tree ensemble classifiers. In International Conference on Machine Learning, pp. 2387–2396, 2016.
  • Kearns (1988) Kearns, M. Learning boolean formulae or finite automata is as hard as factoring. Technical Report TR-14-88 Harvard University Aikem Computation Laboratory, 1988.
  • Kearns & Mansour (1999) Kearns, M. and Mansour, Y. On the boosting ability of top–down decision tree learning algorithms. Journal of Computer and System Sciences, 58(1):109–128, 1999.
  • Kindermans et al. (2019) Kindermans, P.-J., Hooker, S., Adebayo, J., Alber, M., Schütt, K. T., Dähne, S., Erhan, D., and Kim, B. The (un) reliability of saliency methods. In Explainable AI: Interpreting, Explaining and Visualizing Deep Learning, pp.  267–280. Springer, 2019.
  • Koh & Liang (2017) Koh, P. W. and Liang, P. Understanding black-box predictions via influence functions. arXiv preprint arXiv:1703.04730, 2017.
  • Laber & Murtinho (2021) Laber, E. and Murtinho, L. On the price of explainability for some clustering problems. arXiv preprint arXiv:2101.01576, 2021.
  • Lakkaraju et al. (2020) Lakkaraju, H., Arsov, N., and Bastani, O. Robust and stable black box explanations. In International Conference on Machine Learning, pp. 5628–5638. PMLR, 2020.
  • Li et al. (2020) Li, X.-H., Shi, Y., Li, H., Bai, W., Song, Y., Cao, C. C., and Chen, L. Quantitative evaluations on saliency methods: An experimental study. arXiv preprint arXiv:2012.15616, 2020.
  • Lin et al. (2020) Lin, J., Zhong, C., Hu, D., Rudin, C., and Seltzer, M. Generalized and scalable optimal sparse decision trees. In International Conference on Machine Learning, pp. 6150–6160. PMLR, 2020.
  • Lundberg & Lee (2017) Lundberg, S. M. and Lee, S.-I. A unified approach to interpreting model predictions. In Advances in neural information processing systems, pp. 4765–4774, 2017.
  • Lundberg et al. (2018) Lundberg, S. M., Erion, G. G., and Lee, S.-I. Consistent individualized feature attribution for tree ensembles. arXiv preprint arXiv:1802.03888, 2018.
  • Madry et al. (2017) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  • Mardaoui & Garreau (2020) Mardaoui, D. and Garreau, D. An analysis of lime for text data. arXiv preprint arXiv:2010.12487, 2020.
  • Molnar (2019) Molnar, C. Interpretable Machine Learning. 2019. https://christophm.github.io/interpretable-ml-book/.
  • Moro et al. (2014) Moro, S., Cortez, P., and Rita, P. A data-driven approach to predict the success of bank telemarketing. Decision Support Systems, 62:22–31, 2014.
  • Moshkovitz et al. (2020) Moshkovitz, M., Dasgupta, S., Rashtchian, C., and Frost, N. Explainable k-means and k-medians clustering. In International Conference on Machine Learning, pp. 7055–7065. PMLR, 2020.
  • Nacson et al. (2019) Nacson, M. S., Lee, J., Gunasekar, S., Savarese, P. H. P., Srebro, N., and Soudry, D. Convergence of gradient descent on separable data. In The 22nd International Conference on Artificial Intelligence and Statistics, pp.  3420–3428. PMLR, 2019.
  • Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.
  • Quinlan (1986) Quinlan, J. R. Induction of decision trees. Machine learning, 1(1):81–106, 1986.
  • Ribeiro et al. (2016a) Ribeiro, M. T., Singh, S., and Guestrin, C. Model-agnostic interpretability of machine learning. arXiv preprint arXiv:1606.05386, 2016a.
  • Ribeiro et al. (2016b) Ribeiro, M. T., Singh, S., and Guestrin, C. ” why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp.  1135–1144, 2016b.
  • Ribeiro et al. (2018) Ribeiro, M. T., Singh, S., and Guestrin, C. Anchors: High-precision model-agnostic explanations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  • Ross & Doshi-Velez (2017) Ross, A. S. and Doshi-Velez, F. Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients. arXiv preprint arXiv:1711.09404, 2017.
  • Rudin (2019a) Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1:206–215, May 2019a.
  • Rudin (2019b) Rudin, C. Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead. Nature Machine Intelligence, 1(5):206–215, 2019b.
  • Schapire & Freund (2013) Schapire, R. E. and Freund, Y. Boosting: Foundations and algorithms. Kybernetes, 2013.
  • Shalev-Shwartz & Ben-David (2014) Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
  • Shalev-Shwartz & Singer (2008) Shalev-Shwartz, S. and Singer, Y. On the equivalence of weak learnability and linear separability: New relaxations and efficient boosting algorithms. In 21st Annual Conference on Learning Theory, COLT 2008, 2008.
  • Shamir (2020) Shamir, O. Gradient methods never overfit on separable data. arXiv preprint arXiv:2007.00028, 2020.
  • Slack et al. (2020) Slack, D., Hilgard, S., Jia, E., Singh, S., and Lakkaraju, H. Fooling lime and shap: Adversarial attacks on post hoc explanation methods. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, pp.  180–186, 2020.
  • Soudry et al. (2018) Soudry, D., Hoffer, E., Nacson, M. S., Gunasekar, S., and Srebro, N. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.
  • Szegedy et al. (2013) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  • Ustun & Rudin (2017) Ustun, B. and Rudin, C. Optimized risk scores. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp.  1125–1134, 2017.
  • Ustun & Rudin (2019) Ustun, B. and Rudin, C. Learning optimized risk scores. Journal of Machine Learning Research, 20(150):1–75, 2019.
  • Wang et al. (2018) Wang, Y., Jha, S., and Chaudhuri, K. Analyzing the robustness of nearest neighbors to adversarial examples. In International Conference on Machine Learning, pp. 5133–5142. PMLR, 2018.
  • Yang et al. (2020a) Yang, Y.-Y., Rashtchian, C., Wang, Y., and Chaudhuri, K. Robustness for non-parametric classification: A generic attack and defense. In International Conference on Artificial Intelligence and Statistics, pp.  941–951. PMLR, 2020a.
  • Yang et al. (2020b) Yang, Y.-Y., Rashtchian, C., Zhang, H., Salakhutdinov, R., and Chaudhuri, K. Adversarial robustness through local lipschitzness. arXiv preprint arXiv:2003.02460, 2020b.

Appendix A Proofs

A.1 Separation and interpretability

See 1

Proof.

Fix an integer ss and γ>0.\gamma>0. We will start by describing the dataset and prove it is linearly separable. We will then show that any decision tree of size ss must have accuracy smaller than 1/2+γ.1/2+\gamma.

Dataset.

The dataset size is nn, to be fixed later. The dataset includes, for any i=1n/4i=1\dots n/4, a group, GiG_{i}, of four points: two points (i,i+ϵ),(i+ϵ,i)(i,-i+\epsilon),(i+\epsilon,-i) that are labeled positive and two points (i,iϵ),(iϵ,i)(i,-i-\epsilon),(i-\epsilon,-i) that are labeled negative for some small ϵ>0\epsilon>0, see Figure 2.

Refer to caption
Figure 2: Proof of Theorem 1. Linearly separable dataset that is not interpretable by a small decision tree. Around point (i,i)(i,-i) there are 44 close points: two points (i,i+ϵ),(i+ϵ,i)(i,-i+\epsilon),(i+\epsilon,-i) that are labeled positive and two points (i,iϵ),(iϵ,i)(i,-i-\epsilon),(i-\epsilon,-i) that are labeled negative for some small ϵ>0\epsilon>0.

To prove that the dataset is linearly separable focus on the vector w=(0.5,0.5)w=(0.5,0.5) with |w|1=1|w|_{1}=1. For each labeled examples (x,y)(x,y) in the dataset, the inner product is equal to ywx=ϵ/2>0yw\cdot x=\epsilon/2>0.

Accuracy.

We will prove by induction that the points arriving in each node are a series of consecutive groups, perhaps except a few points that are in the “tails”. In each node, the number of positive and negative examples is about the same, so one cannot predict well. See intuition in Figure 3.

Refer to caption
Figure 3: Projection of the points in the dataset to the first feature. We see that the groups appear one after another, each cut defined by an inner node in the tree, will leave the order between the groups as is. In a series of consecutive groups, the number of positive and negative examples is equal. Thus, the accuracy is close to 1/21/2. Similar figure when projecting to the second feature.

More formally, a tail GiG^{\prime}_{i} is a subset of GiG_{i}, i.e., Gi(Gi)G^{\prime}_{i}\in\raisebox{1.79993pt}{\Large$\wp$}(G_{i}), where \wp is the power set. We prove the following claim.

Claim. For each inner node vv, there are integers j0j1j_{0}\leq j_{1} such that the points arriving to this node, PvP_{v}, satisfy

Pv=i=j0j1GiGj01Gj1+1,P_{v}=\cup_{i=j_{0}}^{j_{1}}G_{i}\cup G^{\prime}_{j_{0}-1}\cup G^{\prime}_{j_{1}+1},

where Gj01(Gj01)G^{\prime}_{j_{0}-1}\in\raisebox{1.79993pt}{\Large$\wp$}(G_{j_{0}-1}) and Gj1+1(Gj1+1).G^{\prime}_{j_{1}+1}\in\raisebox{1.79993pt}{\Large$\wp$}(G_{j_{1}+1}). In other words, PvP_{v} contains a series of consecutive groups and a few points from the group after and before the series.

Proof.

We will prove the claim by induction on the level of the tree. The claim is correct for the root with j0=1j_{0}=1 and j1=n/4j_{1}=n/4. To prove the claim by induction, suppose an inner node uses a threshold θ\theta and the first feature. Denote the closest integer to θ\theta by jθ<j+1j^{*}\leq\theta<j^{*}+1. The points reaching the left son are

i=j0jGiGj01Gj+1,\cup_{i=j_{0}}^{j^{*}}G_{i}\cup G_{j_{0}-1}^{\prime}\cup G_{j^{*}+1}^{\prime},

and the nodes reaching the right son are

i=jj1GiGj1Gj1+1.\cup_{i=j^{*}}^{j_{1}}G_{i}\cup G_{j^{*}-1}^{\prime}\cup G_{j_{1}+1}^{\prime}.

A similar argument also hold if vv uses the second feature, which proves our claim. ∎

To finish the proof of Theorem 1, first note that the number of positive and negative examples in each GiG_{i} is exactly equal. Together with the claim that we just proved, for each leaf vv, the number of points that are correctly classified out of the points, PvP_{v}, reaching vv, is at most |Pv|/2+4.|P_{v}|/2+4. Thus the total number of points correctly classified by the entire tree is at most n/2+4s.n/2+4s. Thus, the accuracy of the tree is at most 1/2+4s/n.1/2+4s/n. We take n>4s/γn>4s/\gamma and get that the accuracy is smaller than 1/2+γ,1/2+\gamma, which is what we wanted to prove.

See 2

Proof.

Each feature is in [1,1][-1,1], so r1.r\leq 1. Fix rΔ<2rr\leq\Delta<2r and L=1Δ.L=\lceil\frac{1}{\Delta}\rceil. We can bound LL by

L1Δ+11r+12r.L\leq\frac{1}{\Delta}+1\leq\frac{1}{r}+1\leq\frac{2}{r}.

Take two data points, one labeled positive, x+x^{+}, and one negative xx^{-}. By the rr-separation, we know that there is a feature ii^{\prime} such that

|xi+xi|2r>Δ.|x^{+}_{i^{\prime}}-x^{-}_{i^{\prime}}|\geq 2r>\Delta.

This means that we can find a threshold θ\theta among the 2L+12L+1 thresholds LΔ,,0Δ,1Δ,,LΔ-L\cdot\Delta,\ldots,0\cdot\Delta,1\cdot\Delta,\ldots,L\cdot\Delta that distinguishes the examples x+x^{+} and xx^{-}, i.e., there is j{L,,L}j^{\prime}\in\{-L,\ldots,L\}, such that

sign(xi+Δj)sign(xiΔj).sign(x^{+}_{i^{\prime}}-\Delta\cdot{j^{\prime}})\neq sign(x^{-}_{i^{\prime}}-\Delta\cdot{j^{\prime}}).

We focus on the decision tree with all possible features and the 2L+12L+1 thresholds. All examples reaching the same leaf, has the same label. In other words, the training error is 0.0. Since there are d(2L+1)d\cdot(2L+1) pairs of feature and thresholds, the depth of the tree is at most d(2L+1)3dL6drd(2L+1)\leq 3dL\leq\frac{6d}{r}.

See 3

Proof.

We start with describing the dataset, then we will show that any decision tree must be large if we want to achieve 0.5+γ0.5+\gamma accuracy.

The dataset.

The inputs are all strings in {1,1}d.\{-1,1\}^{d}. The hypothesis is the parity function, i.e., and

f(x1,,xd)={1if i=1dxi+121(mod2)1o.w. f(x_{1},\ldots,x_{d})=\begin{cases}1&\mbox{if }\sum_{i=1}^{d}\frac{x_{i}+1}{2}\equiv 1\pmod{2}\\ -1&\mbox{o.w. }\end{cases}
Each node has equal number of positive and negative examples.

The main idea is that as long as the depth of a node is not dd, it has exactly the same number of positive and negative examples reaching it. This is true since as long as at most d1d-1 features are fixed, there exactly the same number of positive and negative examples that agree on these features.

Large size.

Denote by N1N_{1} the set of all nodes that exactly one example reach them. Denote this set size by |N1|=n1.|N_{1}|=n_{1}. We want to prove that n1n_{1} is large. So far, we proved that for each node that contains more than one example, exactly half of the examples are labeled positive and half negative. Number of examples correctly classified is half of all the examples not in N1N_{1} plus n1n_{1}. There are 2d2^{d} examples in total. So the accuracy is equal to (2dn1)/2+n12d=12+n12d+1\frac{(2^{d}-n_{1})/2+n_{1}}{2^{d}}=\frac{1}{2}+\frac{n_{1}}{2^{d+1}}. The latter should be at least 1/2+γ\nicefrac{{1}}{{2}}+\gamma. Therefore, the size of the tree is at least γ2d.\gamma 2^{d}.

A.2 Linear separability: weak learner

We start with the more restricted case where the features are binary. This will give the necessary foundations for the general case, where features are in [1,1][-1,1]. In this case t{\mathcal{H}}_{t} is simplified to the set {xxi:i[d]}.\{x\mapsto x_{i}:i\in[d]\}.

Theorem 8.

For any data in {1,1}d×{1,1}\{-1,1\}^{d}\times\{-1,1\} that is labeled by a γ\gamma-linearly separable a hypothesis ff and for any distribution μ\mu on the examples, there is hypothesis hth\in{\mathcal{H}}_{t} such that

Prxμ(h(x)=f(x))12+γ2.\Pr_{x\sim\mu}(h(x)=f(x))\geq\frac{1}{2}+\frac{\gamma}{2}.
Proof.

The proof’s high-level idea is to represent the class as a bipartite graph and lower bound the weighted density of this graph. The high lower-bound leads to a high degree vertex, which will correspond to our desired weak learner.

Recall that by the fact that the data is γ\gamma-linearly separable, we know that there is a vector ww with |w|1=1|w|_{1}=1 such that for eac labeled example (x,y)(x,y) it holds that

ywxγ.yw\cdot x\geq\gamma. (1)
Bipartite graph description.

Consider the following bipartite graph. The vertices are the mm examples in the training data and the dd hypotheses in the binary version of t{\mathcal{H}}_{t}. There is an edge (x,hi)E(x,h_{i})\in E between an example xx and hypothesis hih_{i}, if it correctly classify xx, i.e., if f(x)=hi(x)f(x)=h_{i}(x). Each vertex is given a weight: example xjx^{j} gets weight μj\mu_{j} and a hypothesis hih_{i} gets weight wiw_{i}, where ww is in Equation (1).

Assumption: i=1dwi=1\sum_{i=1}^{d}w_{i}=1.

We assume without loss of generality that wi0w_{i}\geq 0, otherwise if there is wi<0w_{i}<0, we can multiply the ii-th feature in all the examples by 1-1. After this multiplication, the linearity assumption still holds. We know that i=1d|wi|=1\sum_{i=1}^{d}|w_{i}|=1, and since wi0w_{i}\geq 0, we get that i=1dwi=1\sum_{i=1}^{d}w_{i}=1.

Lower bound weighted-density.

The main idea is to lower bound the weighted density ρ\rho of the bipartite graph, which is the sum over all edges (hi,xj)(h_{i},x^{j}) in the graph, each with weight wi,μjw_{i},\mu_{j}:

ρ=(xj,hi)Ewiμj.\rho=\sum_{(x^{j},h_{i})\in E}w_{i}\mu_{j}.

To prove a lower bound on ρ\rho, we focus on one labeled example (xj,yj).(x^{j},y^{j}). From the linearity assumption, Equation (1), we know that i=1dyjwixijγ.\sum_{i=1}^{d}y^{j}w_{i}x^{j}_{i}\geq\gamma. Recall that yjxijy^{j}x^{j}_{i} is equal to +1+1 or 1-1, since we are in the binary case. We can separate the sum depending on whether yjxijy^{j}x^{j}_{i} is equal to +1+1 or 1-1 and get that

i:yjxij=1wii:yjxij=1wiγ.\sum_{i:y^{j}x^{j}_{i}=1}w_{i}-\sum_{i:y^{j}x^{j}_{i}=-1}w_{i}\geq\gamma.

We know that i=1dwi=1\sum_{i=1}^{d}w_{i}=1, thus i:yjxij=1wi=1i:yjxij=1wi\sum_{i:y^{j}x^{j}_{i}=-1}w_{i}=1-\sum_{i:y^{j}x^{j}_{i}=1}w_{i}, and the inequality can be rewritten as

2i:yjxij=1wi1γ.2\sum_{i:y^{j}x^{j}_{i}=1}w_{i}-1\geq\gamma.

Notice that (xj,hi)Eyjxij=1(x^{j},h_{i})\in E\Leftrightarrow y^{j}x^{j}_{i}=1. Thus, the inequality can be further rewritten as

i:(xj,hi)Ewi1+γ2.\sum_{i:(x^{j},h_{i})\in E}w_{i}\geq\frac{1+\gamma}{2}.

This inequality holds for any labeled example, so we can sum all these inequalities, each with weight μj\mu_{j} and get that

ρ(1+γ)/2.\rho\geq(1+\gamma)/2.
Finding a weak learner.

Since ww is a probability distribution, we can rewrite ρ\rho and get

𝔼iw[j:(xj,hi)Eμj](1+γ)/2.{\mathbb{E}}_{i\sim w}\left[\sum_{j:(x^{j},h_{i})\in E}\mu_{j}\right]\geq(1+\gamma)/2.

From the probabilistic method (Alon & Spencer, 2004), there is a hypothesis hih_{i} such that

j:(xj,hi)Eμj(1+γ)/2.\sum_{j:(x^{j},h_{i})\in E}\mu_{j}\geq(1+\gamma)/2.

By the definition of the graph, (xj,hi)Ehi(xj)=f(xj)(x^{j},h_{i})\in E\Leftrightarrow h_{i}(x^{j})=f(x^{j}). Thus, we get that

Prxμ(hi(x)=f(x))12+γ2,\Pr_{x\sim\mu}(h_{i}(x)=f(x))\geq\frac{1}{2}+\frac{\gamma}{2},

which is exactly what we wanted to prove.

Theorem 9 (binary weak-learner).

For any distribution μ\mu over labeled examples {1,+1}d×{1,+1}\{-1,+1\}^{d}\times\{-1,+1\} that satisfies linear separability with a γ\gamma-margin, and for any δ(0,1)\delta\in(0,1) there is m=O(d+log1δγ2)m=O\left(\frac{d+\log\frac{1}{\delta}}{\gamma^{2}}\right), such that with probability at least 1δ1-\delta over the sample SS of size mm, it holds that

Pr(x,y)μ(hS(x)=y)12+γ4.\Pr_{(x,y)\sim\mu}(h_{S}(x)=y)\geq\frac{1}{2}+\frac{\gamma}{4}.
Proof.

We start with a known fact222To bound the VC(t)VC({\mathcal{H}}_{t}) note that for any d+1d+1 points in d{\mathbb{R}}^{d} there is at least one point vv, where in each coordinate it is not the largest one among the d+1d+1 points. Thus, it’s impossible that vv is labeled +1+1 while all the rest of the points are labeled 1-1. that

VC(t)d.VC({\mathcal{H}}_{t})\leq d.

Denote the best hypothesis in the binary version in t{\mathcal{H}}_{t} as hh^{*}. From Theorem 8, we know that Prx(h(x)=f(x))12+γ2.\Pr_{x}(h^{*}(x)=f(x))\geq\frac{1}{2}+\frac{\gamma}{2}. For every sample SS, denote by hSh_{S} the hypothesis in the binary version of t{\mathcal{H}}_{t} that optimizes the accuracy on the sample S.S. From the fundamental theorem of statistical learning (Shalev-Shwartz & Ben-David, 2014), we know that for m=O(d+log1δ(γ/4)2)m=O\left(\frac{d+\log\frac{1}{\delta}}{(\gamma/4)^{2}}\right), with probability at least 1δ1-\delta over the sample SS of size mm, it holds that

Pr(x,y)μ(hS(x)=y)Pr(x,y)μ(h(x)=y)γ412+γ4.\Pr_{(x,y)\sim\mu}(h_{S}(x)=y)\geq\Pr_{(x,y)\sim\mu}(h^{*}(x)=y)-\frac{\gamma}{4}\geq\frac{1}{2}+\frac{\gamma}{4}.

See 4

Proof.

The proof is similar in spirit to the proof of Theorem 8. The main difference is the technique to lower bound ρ\rho, the weighted density. In the current proof we use the fact that if for some positive example xx, its ii-th feature, xix_{i}, is high, then it contains many edges in the graph. For a negative example xx, if its ii-th feature, xix_{i}, is low, then it contains many edges in the graph.

Bipartite graph description.

We first discretize the segment [1,1][-1,1] to 2/α+1\ell\geq 2/\alpha+1 values with Δ=2/(1)\Delta=2/(\ell-1):

Z={1,1+Δ,,1Δ,1}.Z=\{-1,-1+\Delta,\ldots,1-\Delta,1\}.

The value of Δ\Delta was chosen such that |Z|=.|Z|=\ell. We focus on the following subclass of t{\mathcal{H}}^{\prime}_{t} which is a discretization of t{\mathcal{H}}_{t}

t={hi,θ} for i[d],θZ.{\mathcal{H}}^{\prime}_{t}=\{h_{i,\theta}\}\quad\text{ for }i\in[d],\theta\in Z.

We use a similar bipartite graph as in the proof of Theorem 8 for the subclass t{\mathcal{H}}^{\prime}_{t}: the vertices are the mm examples and the hypotheses in t{\mathcal{H}}^{\prime}_{t}; and there is an edge (x,hi,θ)E(x,h_{i,\theta})\in E between an example xx with label yy and hypothesis hi,θh_{i,\theta} whenever yxiθyx_{i}\geq\theta, i.e., when hi,θh_{i,\theta} correctly classify xx.

Recall that by the fact that the data is γ\gamma-linearly separable, we know that there is a vector ww with |w|1=1|w|_{1}=1 and for any labeled example (x,y)(x,y) it holds that

ywxγ.yw\cdot x\geq\gamma. (2)

From the same argument as in Theorem 8, we assume that i=1dwi=1.\sum_{i=1}^{d}w_{i}=1.

We are now ready to give each vertex in the bipartite graph a weight: example xjx^{j} gets weight μj\mu_{j} and a hypothesis hi,θh_{i,\theta} gets weight wi,θ=wi/w_{i,\theta}=w_{i}/\ell, where ww is as in Equation (2). The weights on the hypotheses were chosen such that they will sum up to 11:

i,θwi,θ=i,θwi=iwi=1.\sum_{i,\theta}w_{i,\theta}=\sum_{i,\theta}\frac{w_{i}}{\ell}=\sum_{i}w_{i}=1. (3)
Lower bound weighted-density.

We will lower bound the weighted density ρ\rho of the bipartite graph, which is the sum over all edges (xj,hi,θ)(x^{j},h_{i,\theta}) in the graph, each with weight wi,θμjw_{i,\theta}\mu_{j}:

ρ=(xj,hi,θ)Ewi,θμj.\rho=\sum_{(x^{j},h_{i,\theta})\in E}w_{i,\theta}\mu_{j}.

To show a lower bound on ρ\rho, we focus on one labeled example (x,y)(x,y) and one feature ii and we will lower bound the following sum

θ:(x,hi,θ)Ewi,θ=wiθ:(x,hi,θ)E1.\sum_{\theta:(x,h_{i,\theta})\in E}w_{i,\theta}=\frac{w_{i}}{\ell}\sum_{\theta:(x,h_{i,\theta})\in E}1. (4)

The sum in the RHS is equal to the number of θ\theta’s such that yxijθyx^{j}_{i}\geq\theta:

dx,i=|{θZ:yxiθ}|.d_{x,i}=|\{\theta\in Z:yx_{i}\geq\theta\}|. (5)

To lower bound dx,id_{x,i} we separate the analysis depending on the label yy. If xx is a positive example, i.e., y=+1y=+1, we have that

dx,i(1+xiΔ)/Δd_{x,i}\geq(1+x_{i}-\Delta)/\Delta

For a negative example xx, we can lower bound dx,id_{x,i} by

dx,i(1xiΔ)/Δd_{x,i}\geq(1-x_{i}-\Delta)/\Delta

To summarize these two equations we get that

dx,i(1+yxiΔ)/Δd_{x,i}\geq(1+yx_{i}-\Delta)/\Delta

Going back to Equation (4),

θ:(x,hi,θ)Ewi,θwi1+yxiΔΔ=wi(1+yxiΔ)12.\sum_{\theta:(x,h_{i,\theta})\in E}w_{i,\theta}\geq\frac{w_{i}}{\ell}\cdot\frac{1+yx_{i}-\Delta}{\Delta}=w_{i}(1+yx_{i}-\Delta)\cdot\frac{\ell-1}{2\ell}.

Summing over all features

i,θ:(x,hi,θ)Ewi,θ\displaystyle\sum_{i,\theta:(x,h_{i,\theta})\in E}w_{i,\theta} =\displaystyle= 12(iwi(1Δ)+iyxiwi)\displaystyle\frac{\ell-1}{2\ell}\left(\sum_{i}w_{i}(1-\Delta)+\sum_{i}yx_{i}w_{i}\right) (6)
\displaystyle\geq 12(1+γΔ)=(12+γ2Δ2)(11)\displaystyle\frac{\ell-1}{2\ell}(1+\gamma-\Delta)=\left(\frac{1}{2}+\frac{\gamma}{2}-\frac{\Delta}{2}\right)\left(1-\frac{1}{\ell}\right) (7)

Weighted sum over all examples yields the desired lower bound on ρ\rho

ρ(12+γ2Δ2)(11)12+γ2α.\rho\geq\left(\frac{1}{2}+\frac{\gamma}{2}-\frac{\Delta}{2}\right)\left(1-\frac{1}{\ell}\right)\geq\frac{1}{2}+\frac{\gamma}{2}-\alpha.
Finding a weak learner.

The proof now continues similarly to the proof of Theorem 8. Since wi,θw_{i,\theta} is a probability distribution over all hypotheses in t{\mathcal{H}}^{\prime}_{t} (see Equation 3), we can rewrite ρ\rho and get

𝔼hi,θw[j:(xj,hi,θ)Eμj]12+γ2α.{\mathbb{E}}_{h_{i,\theta}\sim w}\left[\sum_{j:(x^{j},h_{i,\theta})\in E}\mu_{j}\right]\geq\frac{1}{2}+\frac{\gamma}{2}-\alpha.

From the probabilistic method (Alon & Spencer, 2004), there is hypothesis hih_{i} such that

j:(xj,hi)Eμj12+γ2α.\sum_{j:(x^{j},h_{i})\in E}\mu_{j}\geq\frac{1}{2}+\frac{\gamma}{2}-\alpha.

By the definition of the graph, (xj,hi)Ehi(xj)=f(xj)(x^{j},h_{i})\in E\Leftrightarrow h_{i}(x^{j})=f(x^{j}). Thus, we get that

Prxμ(hi(x)=f(x))12+γ2α,\Pr_{x\sim\mu}(h_{i}(x)=f(x))\geq\frac{1}{2}+\frac{\gamma}{2}-\alpha,

which is exactly what we wanted to prove.

See 5

Proof.

Proof is similar to Theorem 9. ∎

A.3 Linear separability: risk scores

See 6

Proof.

Fix a risk score model ff which is defined by a series of mm conditions ``xi1θ1",,``ximθm"``x_{i_{1}}\geq\theta_{1}",\ldots,``x_{i_{m}}\geq\theta_{m}" and weights w0,w1,,wmw_{0},w_{1},\ldots,w_{m}. The score, s(z)s(z), of an example zz is the weighted-sum of all satisfied conditions,

s(z)=w0+j=1mwjIzjθj,s(z)=w_{0}+\sum_{j=1}^{m}w_{j}I_{z_{j}\geq\theta_{j}},

where IA=1I_{A}=1 if AA is true, and IA=0I_{A}=0 otherwise. The prediction of the model ff is equal to

f(z)=sign(s(z)).f(z)=sign(s(z)).

Fix examples x,ydx,y\in\mathbb{R}^{d} with xy.x\leq y. Our goal is to show that f(x)f(y).f(x)\leq f(y). The key observation is that any condition xijθjx_{i_{j}}\geq\theta_{j} satisfied by xx is also satisfied by yy because yijxijθj,y_{i_{j}}\geq x_{i_{j}}\geq\theta_{j}, by our assumption that xy.x\leq y. In different words we have that

IyjθjIxjθj.I_{y_{j}\geq\theta_{j}}\geq I_{x_{j}\geq\theta_{j}}. (8)

This implies that the score of yy is at least the score of xx, since it holds that

s(y)=w0j=1mwjIyjθjw0+j=1mwjIxjθj=s(x).s(y)=w_{0}\sum_{j=1}^{m}w_{j}I_{y_{j}\geq\theta_{j}}\geq w_{0}+\sum_{j=1}^{m}w_{j}I_{x_{j}\geq\theta_{j}}=s(x).

Thus, we get exactly what we wanted to prove f(y)=sign(s(y))sign(s(x))=f(x).f(y)=sign(s(y))\geq sign(s(x))=f(x).

As an aside, at this point, it should become apparent why we restricted our conditions to be of the form ``xijθj"``x_{i_{j}}\geq\theta_{j}" and did not allow natural conditions of the form ``xijθj"``x_{i_{j}}\leq\theta_{j}", such conditions will not be monotone and Inequality (8) will not hold. ∎

See 7

Proof.

The high-level idea is to focus on a noisier distribution than the original one. The noise is small enough so that the new data will remain linearly separable with a (slightly worse) margin. Therefore, the learning algorithm can be applied. We call the learning algorithm with a sample from the noisy distribution and return its result. We will prove that a point correctly classified in the noisy dataset implies that the noiseless point is robust to adversarial examples.

Noisy data.

Fix a data D𝒳×{1,+1}D\subseteq{\mathcal{X}}\times\{-1,+1\} and a distribution μ\mu on DD. We will create a noisy distribution μ\mu^{\prime} on the labeled examples by mapping each example (x,y)D(x,y)\in D to a new labeled example (x,y)(x^{\prime},y) where

x=xyγ2𝟏,x^{\prime}=x-y\frac{\gamma}{2}\mathbf{1},

where 𝟏\mathbf{1} is the vector of all 1.1. Thus, if xx is a positive labeled example (y=+1y=+1) then x=xγ2𝟏x^{\prime}=x-\frac{\gamma}{2}\mathbf{1}. This means that from each coordinate we decrease γ/2\gamma/2. Intuitively, we make xx looks more negative. Similarly, if xx is a negative labeled example (y=1y=-1) we create a new example x=x+γ2𝟏x^{\prime}=x+\frac{\gamma}{2}\mathbf{1}, intuitively, making xx looks more positive by adding γ/2\gamma/2 to each coordinate.

If we have a samples from μ\mu, then we can easily sample from μ\mu^{\prime} by subtracting yγ2𝟏y\frac{\gamma}{2}\mathbf{1} from each labeled example (x,y)(x,y) in the sample.

Noisy data is γ/2\gamma/2-linearly separable.

Suppose that the original data DD is γ\gamma-separable. This means that there is a vector ww with |w|1=1|w|_{1}=1 such that for every labeled example (x,y)D(x,y)\in D in it holds that ywxγ.yw\cdot x\geq\gamma. We know that the corresponding example in the noisy data is equal to x=xyγ2𝟏x^{\prime}=x-y\frac{\gamma}{2}\mathbf{1}. We will prove a lower bound on ywxyw\cdot x^{\prime} and this will prove that the noisy input is also linearly separable with a margin. We will use the fact that y2=1y^{2}=1 for any label yy and get that

ywx=yw(xyγ2𝟏)=ywxy2wγ2𝟏=ywxwγ2𝟏yw\cdot x^{\prime}=yw\left(x-y\frac{\gamma}{2}\mathbf{1}\right)=ywx-y^{2}w\cdot\frac{\gamma}{2}\mathbf{1}=ywx-w\cdot\frac{\gamma}{2}\mathbf{1}

Recall that |w|1=1|w|_{1}=1, which means that w𝟏=iwii|wi|=|w|1=1.w\cdot\mathbf{1}=\sum_{i}w_{i}\leq\sum_{i}|w_{i}|=|w|_{1}=1. This means that wγ2𝟏γ/2.-w\cdot\frac{\gamma}{2}\mathbf{1}\geq-\gamma/2. Together with the fact that ywxγyw\cdot x\geq\gamma, we can now give the desired lower bounds on ywxyw\cdot x^{\prime}

ywxywxγ/2γγ/2=γ/2.yw\cdot x^{\prime}\geq ywx-\gamma/2\geq\gamma-\gamma/2=\gamma/2.

In different words, the new data is γ/2\gamma/2-linearly separable.

Model is robust.

In order to construct a robust model, we take our sample SS from μ\mu. Then we transform it to a sample from μ\mu^{\prime} by subtracting noise of yγ2𝟏y\frac{\gamma}{2}\mathbf{1} to each labeled example (x,y)(x,y), and then we call algorithm AA with the noisy training data. From our assumption, the resulting model has accuracy 1ϵ(γ2)1-\epsilon(\frac{\gamma}{2}). We will show that the returned model has astuteness of 1ϵ(γ2)1-\epsilon(\frac{\gamma}{2}) at radius γ/2\gamma/2 with respect to μ.\mu. We do this by proving that if (x,y)(x^{\prime},y) is correctly classified than the model is robust at xx with radius γ/2.\gamma/2.

Fix a noisy labeled example (x,y)(x^{\prime},y) and an example zz that is γ/2\gamma/2 close to xx, in \ell_{\infty}, i.e., xzγ/2\|x-z\|_{\infty}\leq\gamma/2. If xx is positively labeled then xx^{\prime} is also positively labeled, and from the construction of the noisy dataset it holds that

xz.x^{\prime}\leq z.

Hence, from the monotone property of the model, if xx^{\prime} is labeled correctly then so is zz. A similar argument holds if xx is negatively labeled.

Appendix B Additional Experiment Details

B.1 Setups

The experiments are performed on a Intel Core i9 9940X machine with 128GB of RAM. The code for the experiments is available at https://github.com/yangarbiter/interpretable-robust-trees.

Additional dataset details.

The adult, bank, breastcancer, mammo, mushroom, and spambase datasets are retrieved from a publicly available repository 333https://github.com/ustunb/risk-slim/tree/master/examples/data, and these datasets are used by Ustun & Rudin (2019). The careval, ficobin, and campasbin datasets are also retrieved from a publicly available source444https://github.com/Jimmy-Lin/GeneralizedOptimalSparseDecisionTrees/tree/master/experiments/datasets and used by Lin et al. (2020). We also added the diabetes, heart, and ionosphere dataset from 555https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/ and bank2 dataset from Moro et al. (2014). All features are scaled to [0,1][0,1] by the following formula (xmin)/(maxmin)(x–min)/(max–min), where xx represents the feature value, and minmin and maxmax represents the minimum and maximum value of the given feature across the entire data.

In the adult dataset, the target is to predict whether the person’s income is greater then 50,00050,000. For bank and bank2 datasets, we want to predict whether the client opens a bank account after a marketing call. In the breastcancer dataset, we want to predict whether the given sample is benign. In the heart dataset, we want to detect the presence of heart disease in a patient. In the mammo dataset, we want to predict whether the sample from the mammography is malignant. In the mushroom dataset, we want to predict whether the mushroom is poisonous. In the spambase dataset, we want to predict whether an email is a spam. In the careval dataset, the goal is to evaluate cars. In the ficobin dataset, we want to predict a person credit risk. In the campasbin dataset, we want to predict whether a convicted criminal will re-offend again. In the diabetes dataset, we want to predict whether or not the patients in the dataset have diabetes. In the ionosphere dataset, we want to predict whether the radar data are showing some evidence of some type of structure in the ionosphere.

Baseline implementations

For DT, we use the implementation from scikit-learn (Pedregosa et al., 2011) and set the splitting criteria to entropy. For LCPA and RobDT, we use the implementation from their original authors666https://github.com/ustunb/risk-slim and https://github.com/chenhongge/RobustTrees.

Measure empirical robustness.

The IR for DT and RobDT can be measured using the method in (Kantchelian et al., 2016; Yang et al., 2020a). The IR for BBM-RS is measured using the method in (Andriushchenko & Hein, 2019). The IR for LCPA can be measured by solving a linear program.

Appendix C Additional Results

C.1 Examples that are similar but labeled differently

The compasbin dataset has the lowest rr-separateness. Its binary features are:

  • sex:Female

  • age:<21<21

  • age:<23<23

  • age:<26<26

  • age:<46<46

  • juvenile-felonies:=0

  • juvenile-misdemeanors:=0

  • juvenile-crimes:=0

  • priors:=0

  • priors:=1

  • priors:2-3

  • priors:>>3

There are 852852 people who are: male, age between 2626 to 4646, did not commit any juvenile felonies, misdemeanors, and crimes, and have more than 33 previous criminal conviction. These people will have the same feature vector while for their labels, 542542 recidivate within two years while 310310 people did not.

C.2 Relationship between explainability, accuracy, and robustness in BBM-RS

To understand the interaction between explainability, accuracy, and robustness, we measure these criteria of BBM-RS with different noise levels τ\tau. The results are shown in Figure 4. We observed that, by changing the noise level, that robustness and the explanation complexity go hand in hand. For higher noise levels, we have a higher robustness and lower explanation complexity and accuracy. The shows that by making the model simpler, we can have better robustness and explainability while loosing some accuracy.

Refer to caption
(a) adult
Refer to caption
(b) bank
Refer to caption
(c) bank2
Refer to caption
(d) breastcancer
Refer to caption
(e) careval
Refer to caption
(f) compasbin
Refer to caption
(g) diabetes
Refer to caption
(h) ficobin
Refer to caption
(i) heart
Refer to caption
(j) ionosphere
Refer to caption
(k) mammo
Refer to caption
(l) mushroom
Refer to caption
(m) spambase
Figure 4: Trade-off between explainability and accuracy for BBM-RS. The size of the ball represents the accuracy.

C.3 Trade-off between explanation complexity and accuracy for BBM-RS

To understand how explanation complexity effects accuracy, we first train a BBM-RS classifier. The learned BBM-RS classifier consists of TT weak learners. We then measure the test accuracy of using only ii weak learners for prediction, where i=1Ti=1\ldots T. Finally, we plot out the figure of accuracy versus explanation complexity (number of unique weak learners) in Table 5. Note that for the same explanation complexity, there may be more then one test accuracy. In this case we show the highest test accuracy. In Table 5, we see that with the increase of explanation complexity, generally the test accuracy also increases.

Refer to caption
(a) adult
Refer to caption
(b) bank
Refer to caption
(c) bank2
Refer to caption
(d) breastcancer
Refer to caption
(e) careval
Refer to caption
(f) compasbin
Refer to caption
(g) diabetes
Refer to caption
(h) ficobin
Refer to caption
(i) heart
Refer to caption
(j) ionosphere
Refer to caption
(k) mammo
Refer to caption
(l) mushroom
Refer to caption
(m) spambase
Figure 5: Trade-off between explanation complexity and test accuracy for BBM-RS.
EC test accuracy ER
DT RobDT LCPA BBM-RS DT RobDT LCPA BBM-RS DT RobDT LCPA BBM-RS
adult 414.20±5.66414.20\pm 5.66 287.90±35.66287.90\pm 35.66 14.90±1.4614.90\pm 1.46 6.00±.60\mathbf{6.00\pm.60} 0.83±.00\mathbf{0.83\pm.00} 0.83±.00\mathbf{0.83\pm.00} 0.82±.000.82\pm.00 0.81±.000.81\pm.00 0.50±.00\mathbf{0.50\pm.00} 0.50±.00\mathbf{0.50\pm.00} 0.12±.020.12\pm.02 0.50±.00\mathbf{0.50\pm.00}
bank 30.70±.1530.70\pm.15 26.80±.2026.80\pm.20 8.90±.668.90\pm.66 8.00±1.41\mathbf{8.00\pm 1.41} 0.90±.00\mathbf{0.90\pm.00} 0.90±.00\mathbf{0.90\pm.00} 0.90±.00\mathbf{0.90\pm.00} 0.90±.00\mathbf{0.90\pm.00} 0.50±.00\mathbf{0.50\pm.00} 0.50±.00\mathbf{0.50\pm.00} 0.20±.030.20\pm.03 0.50±.00\mathbf{0.50\pm.00}
bank2 30.00±.3030.00\pm.30 30.70±.1530.70\pm.15 13.80±1.5413.80\pm 1.54 4.50±1.34\mathbf{4.50\pm 1.34} 0.91±.00\mathbf{0.91\pm.00} 0.90±.000.90\pm.00 0.90±.000.90\pm.00 0.90±.000.90\pm.00 0.12±.010.12\pm.01 0.18±.020.18\pm.02 0.10±.010.10\pm.01 0.50±.00\mathbf{0.50\pm.00}
breastcancer 15.20±1.2515.20\pm 1.25 7.40±.607.40\pm.60 6.00±.00\mathbf{6.00\pm.00} 11.00±.8911.00\pm.89 0.94±.000.94\pm.00 0.94±.010.94\pm.01 0.96±.00\mathbf{0.96\pm.00} 0.96±.01\mathbf{0.96\pm.01} 0.23±.010.23\pm.01 0.29±.01\mathbf{0.29\pm.01} 0.28±.000.28\pm.00 0.27±.010.27\pm.01
careval 59.30±2.2259.30\pm 2.22 28.20±.6528.20\pm.65 10.10±.9710.10\pm.97 8.70±.47\mathbf{8.70\pm.47} 0.97±.00\mathbf{0.97\pm.00} 0.96±.000.96\pm.00 0.91±.010.91\pm.01 0.77±.000.77\pm.00 0.50±.00\mathbf{0.50\pm.00} 0.50±.00\mathbf{0.50\pm.00} 0.19±.020.19\pm.02 0.50±.00\mathbf{0.50\pm.00}
compasbin 67.80±13.0167.80\pm 13.01 33.70±3.0533.70\pm 3.05 5.40±.22\mathbf{5.40\pm.22} 7.60±.167.60\pm.16 0.67±.00\mathbf{0.67\pm.00} 0.67±.00\mathbf{0.67\pm.00} 0.65±.000.65\pm.00 0.66±.000.66\pm.00 0.50±.00\mathbf{0.50\pm.00} 0.50±.00\mathbf{0.50\pm.00} 0.15±.010.15\pm.01 0.33±.010.33\pm.01
diabetes 31.20±6.9631.20\pm 6.96 27.90±2.9527.90\pm 2.95 6.00±.006.00\pm.00 2.10±.53\mathbf{2.10\pm.53} 0.74±.010.74\pm.01 0.73±.010.73\pm.01 0.76±.01\mathbf{0.76\pm.01} 0.65±.010.65\pm.01 0.08±.010.08\pm.01 0.08±.000.08\pm.00 0.09±.000.09\pm.00 0.15±.05\mathbf{0.15\pm.05}
ficobin 30.60±.2230.60\pm.22 59.60±29.8259.60\pm 29.82 6.40±.16\mathbf{6.40\pm.16} 11.80±.6511.80\pm.65 0.71±.000.71\pm.00 0.71±.000.71\pm.00 0.71±.000.71\pm.00 0.72±.00\mathbf{0.72\pm.00} 0.50±.00\mathbf{0.50\pm.00} 0.50±.00\mathbf{0.50\pm.00} 0.22±.010.22\pm.01 0.50±.00\mathbf{0.50\pm.00}
heart 20.30±1.6020.30\pm 1.60 13.60±.8813.60\pm.88 11.90±1.4611.90\pm 1.46 9.50±.82\mathbf{9.50\pm.82} 0.76±.010.76\pm.01 0.79±.010.79\pm.01 0.82±.01\mathbf{0.82\pm.01} 0.82±.01\mathbf{0.82\pm.01} 0.23±.020.23\pm.02 0.31±.020.31\pm.02 0.14±.010.14\pm.01 0.32±.02\mathbf{0.32\pm.02}
ionosphere 11.30±.9811.30\pm.98 8.60±.768.60\pm.76 17.90±3.1417.90\pm 3.14 6.80±1.96\mathbf{6.80\pm 1.96} 0.89±.010.89\pm.01 0.92±.01\mathbf{0.92\pm.01} 0.88±.010.88\pm.01 0.86±.010.86\pm.01 0.15±.010.15\pm.01 0.25±.010.25\pm.01 0.07±.010.07\pm.01 0.28±.01\mathbf{0.28\pm.01}
mammo 27.40±5.0927.40\pm 5.09 12.40±.6512.40\pm.65 7.20±.657.20\pm.65 1.90±.60\mathbf{1.90\pm.60} 0.79±.00\mathbf{0.79\pm.00} 0.79±.00\mathbf{0.79\pm.00} 0.79±.00\mathbf{0.79\pm.00} 0.77±.000.77\pm.00 0.47±.010.47\pm.01 0.50±.00\mathbf{0.50\pm.00} 0.21±.020.21\pm.02 0.50±.00\mathbf{0.50\pm.00}
mushroom 10.80±.2510.80\pm.25 9.10±.10\mathbf{9.10\pm.10} 23.80±1.5023.80\pm 1.50 9.90±.899.90\pm.89 1.00±.00\mathbf{1.00\pm.00} 1.00±.00\mathbf{1.00\pm.00} 1.00±.00\mathbf{1.00\pm.00} 0.97±.000.97\pm.00 0.50±.00\mathbf{0.50\pm.00} 0.50±.00\mathbf{0.50\pm.00} 0.10±.010.10\pm.01 0.50±.00\mathbf{0.50\pm.00}
spambase 153.90±8.51153.90\pm 8.51 72.30±2.8972.30\pm 2.89 29.50±.7629.50\pm.76 5.60±.48\mathbf{5.60\pm.48} 0.92±.00\mathbf{0.92\pm.00} 0.87±.000.87\pm.00 0.88±.000.88\pm.00 0.79±.010.79\pm.01 0.00±.000.00\pm.00 0.04±.000.04\pm.00 0.02±.000.02\pm.00 0.05±.00\mathbf{0.05\pm.00}
Table 4: The comparison of BBM-RS with other interpretable models (with standard error).