This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Decision tree heuristics can fail, even in the smoothed setting

Guy Blanc
Stanford
       Jane Lange
   MIT
   Mingda Qiao
Stanford
   Li-Yang Tan
Stanford
Abstract

Greedy decision tree learning heuristics are mainstays of machine learning practice, but theoretical justification for their empirical success remains elusive. In fact, it has long been known that there are simple target functions for which they fail badly (Kearns and Mansour, STOC 1996).

Recent work of Brutzkus, Daniely, and Malach (COLT 2020) considered the smoothed analysis model as a possible avenue towards resolving this disconnect. Within the smoothed setting and for targets ff that are kk-juntas, they showed that these heuristics successfully learn ff with depth-kk decision tree hypotheses. They conjectured that the same guarantee holds more generally for targets that are depth-kk decision trees.

We provide a counterexample to this conjecture: we construct targets that are depth-kk decision trees and show that even in the smoothed setting, these heuristics build trees of depth 2Ω(k)2^{\Omega(k)} before achieving high accuracy. We also show that the guarantees of Brutzkus et al. cannot extend to the agnostic setting: there are targets that are very close to kk-juntas, for which these heuristics build trees of depth 2Ω(k)2^{\Omega(k)} before achieving high accuracy.

1 Introduction

Greedy decision tree learning heuristics are among the earliest and most basic algorithms in machine learning. Well-known examples include ID3 [Qui86], its successor C4.5 [Qui93], and CART [BFSO84], all of which continue to be widely employed in everyday ML applications. These simple heuristics build a decision tree for labeled dataset SS in a greedy, top-down fashion. They first identify a “good” attribute to query as the root of the tree. This induces a partition of SS into S0S_{0} and S1S_{1}, and the left and right subtrees are built recursively using S0S_{0} and S1S_{1} respectively.

In more detail, each heuristic is associated with an impurity function 𝒢:[0,1][0,1]\mathscr{G}:[0,1]\to[0,1] that is concave, symmetric around 12\frac{1}{2}, and satisfies 𝒢(0)=𝒢(1)=0\mathscr{G}(0)=\mathscr{G}(1)=0 and 𝒢(12)=1\mathscr{G}(\frac{1}{2})=1. Examples include the binary entropy function 𝒢(p)=H(p)\mathscr{G}(p)=\textnormal{H}(p) that is used by ID3 and C4.5, and the Gini impurity function 𝒢(p)=4p(1p)\mathscr{G}(p)=4p(1-p) that is used by CART; Kearns and Mansour [KM96] proposed and analyzed the function 𝒢(p)=2p(1p)\mathscr{G}(p)=2\sqrt{p(1-p)}. For a target function f:n{0,1}f:\mathbb{R}^{n}\to\{0,1\} and a distribution 𝒟\mathcal{D} over n\mathbb{R}^{n}, these heuristics build a decision tree hypothesis for ff as follows:

  1. 1.

    Split: Query 𝟙[xiθ]\mathds{1}[x_{i}\geq\theta] as the root of the tree, where xix_{i} and θ\theta are chosen to (approximately) maximize the purity gain with respect to 𝒢\mathscr{G}:

    𝒢-purity-gain𝒟(f,xi)𝒢(𝔼[f])(Pr[xiθ]𝒢(𝔼[fxiθ])+Pr[xi<θ]𝒢(𝔼[fxi<θ])),\mathscr{G}\text{-}\mathrm{purity}\text{-}\mathrm{gain}_{\mathcal{D}}(f,x_{i})\coloneqq\mathscr{G}(\operatorname*{\mathbb{E}}\left[f\right])-\big{(}\Pr\left[x_{i}\geq\theta\right]\cdot\mathscr{G}(\operatorname*{\mathbb{E}}\left[f_{x_{i}\geq\theta}\right])+\Pr\left[x_{i}<\theta\right]\cdot\mathscr{G}(\operatorname*{\mathbb{E}}\left[f_{x_{i}<\theta}\right])\big{)},

    where the expectations and probabilities above are with respect to randomly drawn labeled examples (x,f(x))(x,f(x)) where x𝒟x\sim\mathcal{D}, and fxiθf_{x_{i}\geq\theta} denotes the restriction of ff to inputs satisfying xiθx_{i}\geq\theta (and similarly for fxi<θf_{x_{i}<\theta}).

  2. 2.

    Recurse: Build the left and right subtrees by recursing on fxiθf_{x_{i}\geq\theta} and fxi<θf_{x_{i}<\theta} respectively.

  3. 3.

    Terminate: The recursion terminates when the depth of the tree reaches a user-specified depth parameter. Each leaf \ell of the tree is labeled by round(𝔼[f])\mathrm{round}(\operatorname*{\mathbb{E}}\left[f_{\ell}\right]), where we associate \ell with the restriction corresponding to the root-to-\ell path within the tree and round(p)𝟙[p12]\mathrm{round}(p)\coloneqq\mathds{1}[p\geq\frac{1}{2}].

Given the popularity and empirical success of these heuristics111CART and C4.5 were named as two of the “Top 10 algorithms in data mining” by the International Conference on Data Mining (ICDM) community [WKQ+08]; other algorithms on this list include kk-means, kk-nearest neighbors, Adaboost, and PageRank, all of whose theoretical properties are the subjects of intensive study. C4.5 has also been described as “probably the machine learning workhorse most widely used in practice to date” [WFHP16]., it is natural to seek theoretical guarantees on their performance:

Let f:n{0,1}f:\mathbb{R}^{n}\to\{0,1\} be a target function and 𝒟\mathcal{D} be a distribution over n\mathbb{R}^{n}. Can we obtain a high-accuracy hypothesis for ff by growing a depth-kk^{\prime} tree using these heuristics, where kk^{\prime} is not too much larger than kk, the optimal decision tree depth for ff? (\diamondsuit)

1.1 Background and prior work

A simple and well-known impossibility result.

Unfortunately, it has long been known [KM96, Kea96] that no such guarantee is possible even under favorable feature and distributional assumptions. Consider the setting of binary features (i.e. f:{0,1}n{0,1}f:\{0,1\}^{n}\to\{0,1\}) and the uniform distribution 𝒰\mathcal{U} over {0,1}n\{0,1\}^{n}, and suppose ff is the parity of two unknown features xixjx_{i}\oplus x_{j} for i,j[n]i,j\in[n]. It can be easily verified that for all impurity functions 𝒢\mathscr{G}, all features have the same purity gain: 𝒢-purity-gain𝒰(f,x)=0\mathscr{G}\text{-}\mathrm{purity}\text{-}\mathrm{gain}_{\mathcal{U}}(f,x_{\ell})=0 for all [n]\ell\in[n], regardless of whether {i,j}\ell\in\{i,j\}. Therefore, these heuristics may build a tree of depth Ω(n)\Omega(n), querying irrelevant variables xx_{\ell} where {i,j}\ell\notin\{i,j\}, before achieving any nontrivial accuracy. This is therefore an example where the target ff is computable by a decision tree of depth k=2k=2, and yet these heuristics may build a tree of depth k=Ω(n)k^{\prime}=\Omega(n) before achieving any nontrivial accuracy.

Smoothed analysis.

In light of such impossibility results, a line of work has focused on establishing provable guarantees for restricted classes of target functions [FP04, Lee09, BDM19, BLT20b, BLT20a]; we give an overview of these results in Section 1.3.

The focus of our work is instead on smoothed analysis as an alternative route towards evading these impossibility results, an approach that was recently considered by Brutzkus, Daniely, and Malach [BDM20]. Smoothed analysis is by now a standard paradigm for going beyond worst-case analysis. Roughly speaking, positive results in this model show that “hard instances are pathological.” Smoothed analysis has been especially influential in accounting for the empirical effectiveness of algorithms widely used in practice, a notable example being the simplex algorithm for linear programming [ST04]. The idea of analyzing greedy decision tree learning heuristics through the lens of smoothed analysis is therefore very natural.

A smoothed product distribution over {0,1}n\{0,1\}^{n}, a notion introduced by Kalai, Samrodnitsky, and Teng [KST09], is obtained by randomly and independently perturbing the bias of each marginal of a product distribution. For smoothed product distributions, Brutzkus et al. proved strong guarantees on the performance of greedy decision tree heuristics when run on targets that are juntas, functions that depend only on a small number of its features. For a given impurity function 𝒢\mathscr{G}, let us write 𝒜𝒢\mathcal{A}_{\mathscr{G}} to denote the corresponding decision tree learning heuristic.

Theorem 1 (Performance guarantee for targets that are kk-juntas [BDM20]).

For all impurity functions 𝒢\mathscr{G} and for all target functions f:{0,1}n{0,1}f:\{0,1\}^{n}\to\{0,1\} that are kk-juntas, if 𝒜𝒢\mathcal{A}_{\mathscr{G}} is trained on examples drawn from a smoothed product distribution, it learns a decision tree hypothesis of depth kk that achieves perfect accuracy.

(Therefore Theorem 1 shows that the smoothed setting enables one to circumvent the impossibility result discussed above, which was based on targets that are 22-juntas.)

Every kk-junta is computable by a depth-kk decision tree, but a depth-kk decision tree can depend on as many as 2k2^{k} variables. Brutzkus et al. left as an open problem of their paper a conjecture that the guarantees of Theorem 1 hold more generally for targets that are depth-kk decision trees:

Conjecture 1 (Performance guarantee for targets that are depth-kk decision trees).

For all impurity functions 𝒢\mathscr{G} and for all target functions f:{0,1}n{0,1}f:\{0,1\}^{n}\to\{0,1\} that are depth-kk decision trees, if 𝒜𝒢\mathcal{A}_{\mathscr{G}} is trained on examples drawn from a smoothed product distribution, it learns a decision tree hypothesis of depth O(k)O(k) that achieves high accuracy.

In other words, Conjecture 1 states that for all targets f:{0,1}n{0,1}f:\{0,1\}^{n}\to\{0,1\}, the sought-for guarantee (\diamondsuit) holds if the heuristics are trained on examples drawn from a smoothed product distribution.

1.2 This work: Lower bounds in the smoothed setting

Our main result is a counterexample to Conjecture 1. We construct targets that are depth-kk decision trees for which all greedy impurity-based heuristics, even in the smoothed setting, may grow a tree of depth 2Ω(k)2^{\Omega(k)} before achieving high accuracy. This lower bound is close to being maximally large since Theorem 1 implies an upper bound of O(2k)O(2^{k}). Our result is actually stronger than just a lower bound in the smoothed setting: our lower bound holds with respect to any product distribution that is balanced in the sense that its marginals are not too skewed.

Theorem 2 (Our main result: a counterexample to Conjecture 1; informal).

Conjecture 1 is false: For all k=k(n)k=k(n), there are target functions f:{0,1}n{0,1}f:\{0,1\}^{n}\to\{0,1\} that are depth-kk decision trees such that for all impurity functions 𝒢\mathscr{G}, if 𝒜𝒢\mathcal{A}_{\mathscr{G}} is trained on examples drawn from any balanced product distribution, its decision tree hypothesis does not achieve high accuracy unless it has depth 2Ω(k)2^{\Omega(k)}.

By building on our proof of Theorem 2, we also show that the guarantees of Brutzkus et al. for kk-juntas cannot extend to the agnostic setting:

Theorem 3 (Theorem 1 does not extend to the agnostic setting; informal).

For all ε\varepsilon and k=k(n)k=k(n), there are target functions f:{0,1}n{0,1}f:\{0,1\}^{n}\to\{0,1\} that are ε\varepsilon-close to a kk-junta such that for all impurity functions 𝒢\mathscr{G}, if 𝒜𝒢\mathcal{A}_{\mathscr{G}} is trained on examples drawn from any balanced product distribution, its decision tree hypothesis does not achieve high accuracy unless it has depth ε2Ω(k)\varepsilon\cdot 2^{\Omega(k)}.

In particular, there are targets that are 2Ω(k)2^{-\Omega(k)}-close to kk-juntas, for which these heuristics have to construct a decision tree hypothesis of depth 2Ω(k)2^{\Omega(k)} before achieving high accuracy. Taken together with the positive result of Brutzkus et al., Theorems 2 and 3 add to our understanding of the strength and limitations of greedy decision tree learning heuristics.

Our lower bounds are based on new generalizations of the addressing function. Since the addressing function is often a useful extremal example in a variety of settings, we are hopeful that these generalizations and our analysis of them will see further utility beyond the applications of this paper.

1.3 Related Work

As mentioned above, there has been a substantial line of work on establishing provable guarantees for greedy decision tree heuristics when run in restricted classes of target functions. Fiat and Pechyony [FP04] considered the class of read-once DNF formulas and halfspaces; the Ph.D. thesis of Lee [Lee09] considered the class of monotone functions; Brutzkus, Daniely, and Malach [BDM19] considered conjunctions and read-once DNF formulas; recent works of [BLT20b, BLT20a] build on the work of Lee and further studied monotone target functions. (All these works focus on the case of binary features and product distributions over examples.)

Kearns and Mansour [KM96], in one of the first papers to study these heuristics from a theoretical perspective, showed that they can be viewed as boosting algorithms, with internal nodes of the decision tree hypothesis playing the role of weak learners. Their subsequent work with Dietterich [DKM96] provide experimental results that complement the theoretical results of [KM96]; see also the survey of Kearns [Kea96].

Finally, we mention that decision trees are one of the most intensively studied concept classes in learning theory. The literature on this problem is rich and vast (see e.g. [EH89, Riv87, Blu92, Han93, Bsh93, KM93, BFJ+94, HJLT96, KM99, MR02, JS06, OS07, GKK08, KS06, KST09, KST09, HKY18, CM19, BGLT20]), studying it from a variety of perspectives and providing both positive and negative results. However, the algorithms developed in these works do not resemble the greedy heuristics used in practice, and indeed, most of them are not proper (in the sense of returning a hypothesis that is itself a decision tree).222Quoting [KM96], “it seems fair to say that despite their other successes, the models of computational learning theory have not yet provided significant insight into the apparent empirical success of programs like C4.5 and CART.”

2 Preliminaries

Recall that an impurity function 𝒢:[0,1][0,1]\mathscr{G}:[0,1]\to[0,1] is concave, symmetric with respect to 12\frac{1}{2}, and satisfies 𝒢(0)=𝒢(1)=0\mathscr{G}(0)=\mathscr{G}(1)=0 and 𝒢(12)=1\mathscr{G}(\frac{1}{2})=1. We further quantify the concavity and smoothness of 𝒢\mathscr{G} as follows:

Definition 1 (Impurity functions).

𝒢\mathscr{G} is an (α,L)(\alpha,L)-impurity function if 𝒢\mathscr{G} is α\alpha-strongly concave and LL-smooth, i.e., 𝒢\mathscr{G} is twice-differentiable and 𝒢′′(x)[L,α]\mathscr{G}^{\prime\prime}(x)\in[-L,-\alpha] for every x[0,1]x\in[0,1].

For a boolean function f:{0,1}n{0,1}f:\{0,1\}^{n}\to\{0,1\} and index i[n]i\in[n], we write fxi=0f_{x_{i}=0} and fxi=1f_{x_{i}=1} to denote the restricted functions obtained by fixing the ii-th input bit of ff to either 0 or 11. Formally, each fxi=bf_{x_{i}=b} is a function over {0,1}n\{0,1\}^{n} defined as fxi=b(x)=f(xib)f_{x_{i}=b}(x)=f(x^{i\to b}), where xibx^{i\to b} denotes the string obtained by setting the ii-th bit of xx to bb. More generally, a restriction π\pi is a list of constraints of form “xi=bx_{i}=b” in which every index ii appears at most once. For restriction π=(xi1=b1,xi2=b2,)\pi=(x_{i_{1}}=b_{1},x_{i_{2}}=b_{2},\ldots), the restricted function fπ:{0,1}n{0,1}f_{\pi}:\{0,1\}^{n}\to\{0,1\} is similarly defined as fπ(x)=f(xi1b1,i2b2,)f_{\pi}(x)=f(x^{i_{1}\to b_{1},i_{2}\to b_{2},\ldots}).

Definition 2 (Purity gain).

Let 𝒟\mathcal{D} be a distribution over {0,1}n\{0,1\}^{n} and pi=Prx𝒟[xi=1]p_{i}=\Pr_{x\sim\mathcal{D}}\left[x_{i}=1\right]. The 𝒢\mathscr{G}-purity gain of querying variable xix_{i} on boolean function ff is defined as

𝒢-purity-gain𝒟(f,xi)𝒢(𝔼x𝒟[f(x)])pi𝒢(𝔼x𝒟[fxi=1(x)])(1pi)𝒢(𝔼x𝒟[fxi=0(x)]).\mathscr{G}\text{-}\mathrm{purity}\text{-}\mathrm{gain}_{\mathcal{D}}(f,x_{i})\coloneqq\mathscr{G}\left(\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[f(x)\right]\right)-p_{i}\mathscr{G}\left(\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[f_{x_{i}=1}(x)\right]\right)-(1-p_{i})\mathscr{G}\left(\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[f_{x_{i}=0}(x)\right]\right).

In a decision tree, each node vv naturally corresponds to a restriction πv\pi_{v} formed by the variables queried by the ancestors of vv (excluding vv itself). We use fvf_{v} as a shorthand for fπvf_{\pi_{v}}. We say that a decision tree learning algorithm is impurity-based if, in the tree returned by the algorithm, every internal node vv queries a variable that maximizes the purity gain with respect to fvf_{v}.

Definition 3 (Impurity-based algorithms).

A decision tree learning algorithm is 𝒢\mathscr{G}-impurity-based if the following holds for every f:{0,1}n{0,1}f:\{0,1\}^{n}\to\{0,1\} and distribution 𝒟\mathcal{D} over {0,1}n\{0,1\}^{n}: When learning ff on 𝒟\mathcal{D}, the algorithm outputs a decision tree such that for every internal node vv, the variable xix_{i} that is queried at vv satisfies 𝒢-purity-gain𝒟(fv,xi)𝒢-purity-gain𝒟(fv,xj)\mathscr{G}\text{-}\mathrm{purity}\text{-}\mathrm{gain}_{\mathcal{D}}(f_{v},x_{i})\geq\mathscr{G}\text{-}\mathrm{purity}\text{-}\mathrm{gain}_{\mathcal{D}}(f_{v},x_{j}) for every j[n]j\in[n].

The above definition assumes that the algorithm exactly maximizes the 𝒢\mathscr{G}-purity gain at every split, while in reality, the purity gains can only be estimated from a finite dataset. We therefore consider an idealized setting that grants the learning algorithm with infinitely many training examples, which, intuitively, strengthens our lower bounds. (Our lower bounds show that in order for an algorithm to recover a good tree—a high-accuracy hypothesis whose depth is close to that of the target—it would need to query a variable that has exponentially smaller purity gain than that of the variable with the largest purity gain. Hence, if purity gains are estimated using finitely many random samples as is done in reality, the strength of our lower bounds imply that with extremely high probability, impurity-based heuristics will fail to build a good tree; see Remark 5 for a detailed discussion.)

When a decision tree queries variable xix_{i} on function ff, it naturally induces two restricted functions fxi=0f_{x_{i}=0} and fxi=1f_{x_{i}=1}. The following lemma states that the purity gain of querying xix_{i} is roughly the squared difference between the averages of the two functions, up to a factor that depends on the impurity function 𝒢\mathscr{G} and the data distribution 𝒟\mathcal{D}. We say that a product distribution over {0,1}n\{0,1\}^{n} is δ\delta-balanced if the expectation of each of the nn coordinates is in [δ,1δ][\delta,1-\delta].

Lemma 1.

For any f:{0,1}n{0,1}f:\{0,1\}^{n}\to\{0,1\}, δ\delta-balanced product distribution 𝒟\mathcal{D} over {0,1}n\{0,1\}^{n} and (α,L)(\alpha,L)-impurity function 𝒢\mathscr{G}, it holds for κ=max(2αδ(1δ),L8)\kappa=\max\left(\frac{2}{\alpha\delta(1-\delta)},\frac{L}{8}\right) and every i[n]i\in[n] that

1κ𝒢-purity-gain𝒟(f,xi)[𝔼xD[fxi=0(x)]𝔼xD[fxi=1(x)]]2κ.\frac{1}{\kappa}\leq\frac{\mathscr{G}\text{-}\mathrm{purity}\text{-}\mathrm{gain}_{\mathcal{D}}(f,x_{i})}{\left[\operatorname*{\mathbb{E}}_{x\sim D}\left[f_{x_{i}=0}(x)\right]-\operatorname*{\mathbb{E}}_{x\sim D}\left[f_{x_{i}=1}(x)\right]\right]^{2}}\leq\kappa.
Proof of Lemma 1.

Let pi=Prx𝒟[xi=1]p_{i}=\Pr_{x\sim\mathcal{D}}\left[x_{i}=1\right] and μb=𝔼xD[fxi=b(x)]\mu_{b}=\operatorname*{\mathbb{E}}_{x\sim D}\left[f_{x_{i}=b}(x)\right] respectively. Then, we have 𝔼x𝒟[f(x)]=piμ1+(1pi)μ0\operatorname*{\mathbb{E}}_{x\sim\mathcal{D}}\left[f(x)\right]=p_{i}\mu_{1}+(1-p_{i})\mu_{0}, and the purity gain can be written as

𝒢-purity-gain𝒟(f,xi)=𝒢(piμ1+(1pi)μ0)pi𝒢(μ1)(1pi)𝒢(μ0).\mathscr{G}\text{-}\mathrm{purity}\text{-}\mathrm{gain}_{\mathcal{D}}(f,x_{i})=\mathscr{G}(p_{i}\mu_{1}+(1-p_{i})\mu_{0})-p_{i}\mathscr{G}(\mu_{1})-(1-p_{i})\mathscr{G}(\mu_{0}).

Since 𝒢\mathscr{G} is α\alpha-strongly concave and LL-smooth, the above is bounded between α2pi(1pi)(μ0μ1)2\frac{\alpha}{2}\cdot p_{i}(1-p_{i})\cdot(\mu_{0}-\mu_{1})^{2} and L2pi(1pi)(μ0μ1)2\frac{L}{2}\cdot p_{i}(1-p_{i})\cdot(\mu_{0}-\mu_{1})^{2}. Since 𝒟\mathcal{D} is δ\delta-balanced, we have δ(1δ)pi(1pi)14\delta(1-\delta)\leq p_{i}(1-p_{i})\leq\frac{1}{4}. It follows that

α2δ(1δ)α2pi(1pi)𝒢-purity-gain𝒟(f,xi)(μ0μ1)2L2pi(1pi)L8.\frac{\alpha}{2}\cdot\delta(1-\delta)\leq\frac{\alpha}{2}\cdot p_{i}(1-p_{i})\leq\frac{\mathscr{G}\text{-}\mathrm{purity}\text{-}\mathrm{gain}_{\mathcal{D}}(f,x_{i})}{(\mu_{0}-\mu_{1})^{2}}\leq\frac{L}{2}\cdot p_{i}(1-p_{i})\leq\frac{L}{8}.

Thus, the ratio is bounded between 1/κ1/\kappa and κ\kappa. ∎

Our lower bounds hold with respect to all δ\delta-balanced product distributions. We compare this to the definition of a cc-smoothened δ\delta-balanced product distribution from [BDM20].

Definition 4 (Smooth distributions).

A cc-smoothened δ\delta-balanced product distribution is a random product distribution over {0,1}n\{0,1\}^{n} where the marginal for the ithi^{\text{th}} bit is 11 with probability pi^+Δi\widehat{p_{i}}+\Delta_{i} for fixed pi^(δ+c,1δc)\widehat{p_{i}}\in(\delta+c,1-\delta-c) and Δi\Delta_{i} drawn i.i.d. from Uniform([c,c])\mathrm{Uniform}([-c,c]).

Since our lower bounds hold against all δ\delta-balanced product distributions, it also holds against all cc-smoothened δ\delta-balanced product distributions.

3 Proof overview and formal statements of our results

Our goal is to construct a target function that can be computed by a depth-kk decision tree, but on which impurity-based algorithms must build to depth 2Ω(k)2^{\Omega(k)} or have large error. To do so, we construct a decision tree target TT where the variables with largest purity gain are at the bottom layer of TT (adjacent to its leaves). Intuitively, impurity-based algorithms will build their decision tree hypothesis for TT by querying all the variables in the bottom layer of TT before querying any of the variables higher up in TT. Our construction will be such that until the higher up variables are queried, it is impossible to approximate the target with any nontrivial error. Summarizing informally, we show that impurity-based algorithms build its decision tree hypothesis for our target by querying variables in exactly the “wrong order”.

The starting point of our construction is the well known addressing function. For kk\in\mathbb{N}, the addressing function f:{0,1}k+2k{0,1}f:\{0,1\}^{k+2^{k}}\to\{0,1\} is defined as follows: Given “addressing bits” z{0,1}kz\in\{0,1\}^{k} and “memory bits” y{0,1}2ky\in\{0,1\}^{2^{k}}, the output f(y,z)f(y,z) is the zthz^{\text{th}} bit of yy, where “zthz^{\text{th}} bit” is computed by interpreting zz as a base-2 integer. Note that the addressing function is computable by a decision tree of depth k+1k+1 that first queries the kk addressing bits, followed by the appropriate memory bit.

For our lower bound, we would like the variables with the highest purity gain to be the memory bits. However, for smoothed product distributions, the addressing bits might have higher purity gain than the memory bits, and impurity-based algorithms might succeed in learning the addressing function. We therefore modify the addressing function by making each addressing bit the parity of multiple new bits. We show that by making each addressing bit the parity of sufficiently many new bits, we can drive the purity gain of these new bits down to the point where the memory bits have the highest purity gain as desired—in fact, larger than the addressing bits by a multiplicative factor of eΩ(k)e^{\Omega(k)}. (Making each addressing bit the parity of multiple new bits increases the depth of the target, so this introduces technical challenges we have to overcome in order to achieve the strongest parameters.)

Our main theorem is formally restated as follows.

Theorem 4 (Formal version of Theorem 2).

Fix Lα>0L\geq\alpha>0 and δ(0,12]\delta\in(0,\frac{1}{2}]. There are boolean functions f1,f2,f_{1},f_{2},\ldots such that: (1) fkf_{k} is computable by a decision tree of depth O(k/δ)O(k/\delta); (2) For every δ\delta-balanced product distribution 𝒟\mathcal{D} over the domain of fkf_{k} and every (α,L)(\alpha,L)-impurity function 𝒢\mathscr{G}, any 𝒢\mathscr{G}-impurity based decision tree heuristic, when learning fkf_{k} on 𝒟\mathcal{D}, returns a tree that has either depth 2k\geq 2^{k} or an Ω(δ)\Omega(\delta) error.

An extension of our construction and its analysis shows that the guarantees of Brutzkus et al. for targets that are kk-juntas cannot extend to the agnostic setting. Roughly speaking, while our variant of the addressing function from Theorem 4 is far from all kk-juntas, it can be made close to one by fixing most of the memory bits. We obtain our result by showing that our analysis continues to hold under such a restriction.

Theorem 5 (Formal version of Theorem 3).

Fix Lα>0L\geq\alpha>0, δ(0,12]\delta\in(0,\frac{1}{2}] and ε(0,1]\varepsilon\in(0,1]. There are boolean functions f1,f2,f_{1},f_{2},\ldots such that for every δ\delta-balanced product distribution 𝒟\mathcal{D} over the domain of fkf_{k}: (1) fkf_{k} is ε\varepsilon-close to an O(k/δ)O(k/\delta)-junta with respect to 𝒟\mathcal{D}; (2) For every (α,L)(\alpha,L)-impurity function 𝒢\mathscr{G}, any 𝒢\mathscr{G}-impurity based decision tree heuristic, when learning fkf_{k} on 𝒟\mathcal{D}, returns a tree that has either a depth of Ω(ε2k)\Omega(\varepsilon\cdot 2^{k}) or an Ω(1)\Omega(1) error.

4 Warm-Up: A Weaker Lower Bound

We start by giving a simplified construction that proves a weaker version of Theorem 4, in which the O(k/δ)O(k/\delta) depth in condition (1) is relaxed to O(k2/δ)O(k^{2}/\delta). For integers c,k1c,k\geq 1, we define a boolean function fc,k:{0,1}ck2+2k{0,1}f_{c,k}:\{0,1\}^{ck^{2}+2^{k}}\to\{0,1\} as follows. The input of fc,kf_{c,k} is viewed as two parts: ck2ck^{2} addressing bits xi,jx_{i,j} indexed by i[k]i\in[k] and j[ck]j\in[ck], and 2k2^{k} memory bits yay_{a} indexed by a{0,1}ka\in\{0,1\}^{k}. The function value fc,k(x,y)f_{c,k}(x,y) is defined by first computing zi(x)=j=1ckxi,jz_{i}(x)=\bigoplus_{j=1}^{ck}x_{i,j} for every i[k]i\in[k], and then assigning fc,k(x,y)=yz(x)f_{c,k}(x,y)=y_{z(x)}.

In other words, fc,kf_{c,k} is a disjoint composition of the kk-bit addressing function and the parity function over ckck bits. Given addressing bits xx and memory bits yy, the function first computes a kk-bit address by taking the XOR of the addressing bits in each group of size ckck, and then retrieves the memory bit with the corresponding address. Clearly, fc,kf_{c,k} can be computed by a decision tree of depth ck2+1ck^{2}+1 that first queries all the ck2ck^{2} addressing bits and then queries the relevant memory bit in the last layer.

4.1 Address is Almost Uniform

Drawing input (x,y)(x,y) from a distribution 𝒟\mathcal{D} naturally defines a distribution over {0,1}k\{0,1\}^{k} of the kk-bit address z(x)=(z1(x),z2(x),,zk(x))z(x)=(z_{1}(x),z_{2}(x),\ldots,z_{k}(x)). The following lemma states that when 𝒟\mathcal{D} is a δ\delta-balanced product distribution, the distribution of z(x)z(x) is almost uniform in the \ell_{\infty} sense. Furthermore, this almost uniformity holds even if one of the addressing bits xi,jx_{i,j} is fixed.

Lemma 2.

Suppose that cln5δc\geq\frac{\ln 5}{\delta} and 𝒟\mathcal{D} is a δ\delta-balanced product distribution over the domain of fc,kf_{c,k}. Then,

|Pr(x,y)𝒟[z(x)=a]2k|5k,a{0,1}k.\left|\Pr_{(x,y)\sim\mathcal{D}}\left[z(x)=a\right]-2^{-k}\right|\leq 5^{-k},\forall a\in\{0,1\}^{k}.

Furthermore, for every i[k]i\in[k], j[ck]j\in[ck] and b{0,1}b\in\{0,1\},

|Pr(x,y)𝒟[z(x)=axi,j=b]2k|5k,a{0,1}k.\left|\Pr_{(x,y)\sim\mathcal{D}}\left[z(x)=a|x_{i,j}=b\right]-2^{-k}\right|\leq 5^{-k},\forall a\in\{0,1\}^{k}.

The proof of Lemma 2 uses the following simple fact, which states that the XOR of independent biased random bits is exponentially close to an unbiased coin flip.

Fact 3.

Suppose that x1,x2,,xnx_{1},x_{2},\ldots,x_{n} are independent Bernoulli random variables, each with an expectation between δ\delta and 1δ1-\delta. Then, |Pr[x1x2xn=1]12|12(12δ)n12exp(2δn)\left|\Pr\left[x_{1}\oplus x_{2}\oplus\cdots\oplus x_{n}=1\right]-\frac{1}{2}\right|\leq\frac{1}{2}(1-2\delta)^{n}\leq\frac{1}{2}\exp(-2\delta n).

Proof of Lemma 2.

Since zi(x)=j=1ckxi,jz_{i}(x)=\bigoplus_{j=1}^{ck}x_{i,j} and 𝒟\mathcal{D} is δ\delta-balanced, Fact 3 gives

|Pr(x,y)𝒟[zi(x)=1]12|12exp(2δck)125k.\left|\Pr_{(x,y)\sim\mathcal{D}}\left[z_{i}(x)=1\right]-\frac{1}{2}\right|\leq\frac{1}{2}\exp(-2\delta ck)\leq\frac{1}{2}\cdot 5^{-k}.

Note that the bits of z(x)z(x) are independent, so Pr(x,y)𝒟[z(x)=a]\Pr_{(x,y)\sim\mathcal{D}}\left[z(x)=a\right] is given by

i=1kPr(x,y)𝒟[zi(x)=ai](12+125k)k=2k(1+5k)k2k(1+(2/5)k)=2k+5k,\prod_{i=1}^{k}\Pr_{(x,y)\sim\mathcal{D}}\left[z_{i}(x)=a_{i}\right]\leq\left(\frac{1}{2}+\frac{1}{2}\cdot 5^{-k}\right)^{k}=2^{-k}\cdot(1+5^{-k})^{k}\leq 2^{-k}\cdot(1+(2/5)^{k})=2^{-k}+5^{-k},

where the third step applies (1+x)k1+2kx(1+x)^{k}\leq 1+2^{k}x for x[0,1]x\in[0,1] and integers k1k\geq 1. Similarly,

Pr(x,y)𝒟[z(x)=a](12125k)k2k(1k5k)2k5k,\Pr_{(x,y)\sim\mathcal{D}}\left[z(x)=a\right]\geq\left(\frac{1}{2}-\frac{1}{2}\cdot 5^{-k}\right)^{k}\geq 2^{-k}\cdot(1-k\cdot 5^{-k})\geq 2^{-k}-5^{-k},

where the last two steps apply (1x)k1kx(1-x)^{k}\geq 1-kx and k2k1k\cdot 2^{-k}\leq 1. This proves the first part.

The proof of the “furthermore” part is essentially the same, except that conditioning on xi,j=bx_{i,j}=b, zi(x)z_{i}(x) becomes the XOR of ck1ck-1 independent bits and bb. By Fact 3, we have

|Pr(x,y)𝒟[zi(x)=1xi,j=b]12|12exp(2δ(ck1))12exp(δck)125k,\left|\Pr_{(x,y)\sim\mathcal{D}}\left[z_{i}(x)=1|x_{i,j}=b\right]-\frac{1}{2}\right|\leq\frac{1}{2}\exp(-2\delta(ck-1))\leq\frac{1}{2}\exp(-\delta ck)\leq\frac{1}{2}\cdot 5^{-k},

and the rest of the proof is the same. ∎

4.2 Memory Bits are Queried First

The following technical lemma states that the purity gain of fc,kf_{c,k} is maximized by a memory bit, regardless of the impurity function and the data distribution. Therefore, when an impurity-based algorithm (in the sense of Definition 3) learns fc,kf_{c,k}, the root of the decision tree will always query a memory bit. Furthermore, this property also holds for restrictions of fc,kf_{c,k} as long as the restriction only involves the memory bits.

Lemma 4.

Fix Lα>0L\geq\alpha>0 and δ(0,12]\delta\in(0,\frac{1}{2}]. Let c0=ln5δc_{0}=\frac{\ln 5}{\delta} and k0=ln(2κ)ln(5/4)+1k_{0}=\frac{\ln(2\kappa)}{\ln(5/4)}+1, where κ\kappa is chosen as in Lemma 1. The following holds for every function fc,kf_{c,k} with cc0c\geq c_{0} and kk0k\geq k_{0}: For any (α,L)(\alpha,L)-impurity function 𝒢\mathscr{G}, δ\delta-balanced product distribution 𝒟\mathcal{D} and restriction π\pi of size <2k<2^{k} that only contains the memory bits of fc,kf_{c,k}, the purity gain 𝒢-purity-gain𝒟((fc,k)π,)\mathscr{G}\text{-}\mathrm{purity}\text{-}\mathrm{gain}_{\mathcal{D}}((f_{c,k})_{\pi},\cdot) is maximized by a memory bit.

Proof of Lemma 4.

Fix cc0c\geq c_{0} and kk0k\geq k_{0} and shorthand ff for fc,kf_{c,k}. We will prove a stronger claim: with respect to fπf_{\pi}, every memory bit (that is not in π\pi) gives a much higher purity gain than every addressing bit does.

Purity gain of the memory bits.

Fix a memory bit yay_{a} (a{0,1}ka\in\{0,1\}^{k}) that does not appear in restriction π\pi. Let μb=𝔼(x,y)𝒟[fπ,ya=b(x,y)]\mu_{b}=\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}}\left[f_{\pi,y_{a}=b}(x,y)\right] for b{0,1}b\in\{0,1\}. By the law of total expectation,

μb\displaystyle\mu_{b} =Pr(x,y)𝒟[z(x)=a]𝔼(x,y)𝒟[fπ,ya=b(x,y)|z(x)=a]\displaystyle=\Pr_{(x,y)\sim\mathcal{D}}\left[z(x)=a\right]\cdot\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}}\left[f_{\pi,y_{a}=b}(x,y)|z(x)=a\right]
+Pr(x,y)𝒟[z(x)a]𝔼(x,y)𝒟[fπ,ya=b(x,y)|z(x)a]\displaystyle\quad\quad+\Pr_{(x,y)\sim\mathcal{D}}\left[z(x)\neq a\right]\cdot\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}}\left[f_{\pi,y_{a}=b}(x,y)|z(x)\neq a\right]
=Pr(x,y)𝒟[z(x)=a]b+Pr(x,y)𝒟[z(x)a]𝔼(x,y)𝒟[fπ(x,y)|z(x)a].\displaystyle=\Pr_{(x,y)\sim\mathcal{D}}\left[z(x)=a\right]\cdot b+\Pr_{(x,y)\sim\mathcal{D}}\left[z(x)\neq a\right]\cdot\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}}\left[f_{\pi}(x,y)|z(x)\neq a\right].

Here the second step holds since fπ,ya=b(x,y)f_{\pi,y_{a}=b}(x,y) evaluates to bb when the address z(x)z(x) equals aa, and fπ,ya=bf_{\pi,y_{a}=b} agrees with fπf_{\pi} when z(x)az(x)\neq a. Since only the first term above depends on bb, we have

|μ0μ1|=Pr(x,y)𝒟[z(x)=a]2k5k122k,|\mu_{0}-\mu_{1}|=\Pr_{(x,y)\sim\mathcal{D}}\left[z(x)=a\right]\geq 2^{-k}-5^{-k}\geq\frac{1}{2}\cdot 2^{-k},

where the second step follows from cc0c\geq c_{0} and Lemma 2. Finally, by Lemma 1, 𝒢-purity-gain𝒟(fπ,ya)1κ(μ0μ1)214κ22k\mathscr{G}\text{-}\mathrm{purity}\text{-}\mathrm{gain}_{\mathcal{D}}(f_{\pi},y_{a})\geq\frac{1}{\kappa}(\mu_{0}-\mu_{1})^{2}\geq\frac{1}{4\kappa}\cdot 2^{-2k}.

Purity gain of the addressing bits.

Similarly, we fix an addressing bit xi,jx_{i,j} and define the average μb=𝔼(x,y)𝒟[fπ,xi,j=b(x,y)]\mu_{b}=\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}}\left[f_{\pi,x_{i,j}=b}(x,y)\right]. Since 𝒟\mathcal{D} is a product distribution, μb\mu_{b} is equal to the conditional expectation 𝔼(x,y)𝒟[fπ(x,y)|xi,j=b]\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}}\left[f_{\pi}(x,y)|x_{i,j}=b\right]. Then, by the law of total expectation, we can write μb\mu_{b} as

μb\displaystyle\mu_{b} =a{0,1}kPr(x,y)𝒟[z(x)=a|xi,j=b]𝔼(x,y)𝒟[fπ(x,y)|z(x)=a,xi,j=b]\displaystyle=\sum_{a\in\{0,1\}^{k}}\Pr_{(x,y)\sim\mathcal{D}}\left[z(x)=a|x_{i,j}=b\right]\cdot\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}}\left[f_{\pi}(x,y)|z(x)=a,x_{i,j}=b\right]
=a{0,1}kPr(x,y)𝒟[z(x)=a|xi,j=b]𝔼(x,y)𝒟[fπ(x,y)|z(x)=a].\displaystyle=\sum_{a\in\{0,1\}^{k}}\Pr_{(x,y)\sim\mathcal{D}}\left[z(x)=a|x_{i,j}=b\right]\cdot\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}}\left[f_{\pi}(x,y)|z(x)=a\right].

Here the second step holds since fπ(x,y)f_{\pi}(x,y) and xi,jx_{i,j} are independent conditioning on the address z(x)z(x); in other words, once we know the value of z(x)z(x), it doesn’t matter how xx is set in determining the output of ff.

Let cac_{a} denote 𝔼(x,y)𝒟[fπ(x,y)|z(x)=a]\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}}\left[f_{\pi}(x,y)|z(x)=a\right], and let 𝒫b\mathcal{P}_{b} be the distribution of z(x)z(x) conditioning on xi,j=bx_{i,j}=b. Then, μb\mu_{b} is exactly given by 𝔼a𝒫b[ca]\operatorname*{\mathbb{E}}_{a\sim\mathcal{P}_{b}}\left[c_{a}\right]. Since each cac_{a} is in [0,1][0,1], |μ0μ1||\mu_{0}-\mu_{1}| is upper bounded by the total variation distance between 𝒫0\mathcal{P}_{0} and 𝒫1\mathcal{P}_{1}:

|μ0μ1|\displaystyle|\mu_{0}-\mu_{1}| 12a{0,1}k|𝒫0(a)𝒫1(a)|\displaystyle\leq\frac{1}{2}\sum_{a\in\{0,1\}^{k}}|\mathcal{P}_{0}(a)-\mathcal{P}_{1}(a)|
12a{0,1}k(|𝒫0(a)2k|+|𝒫1(a)2k|)\displaystyle\leq\frac{1}{2}\sum_{a\in\{0,1\}^{k}}\left(|\mathcal{P}_{0}(a)-2^{-k}|+|\mathcal{P}_{1}(a)-2^{-k}|\right)
122k25k=(2/5)k.\displaystyle\leq\frac{1}{2}\cdot 2^{k}\cdot 2\cdot 5^{-k}=(2/5)^{k}. (Lemma 2)

Finally, applying Lemma 1 shows that 𝒢-purity-gain𝒟(fπ,xi,j)κ(μ0μ1)2κ(2/5)2k\mathscr{G}\text{-}\mathrm{purity}\text{-}\mathrm{gain}_{\mathcal{D}}(f_{\pi},x_{i,j})\leq\kappa(\mu_{0}-\mu_{1})^{2}\leq\kappa\cdot(2/5)^{2k}.

Recall that kk0>ln(2κ)ln(5/4)k\geq k_{0}>\frac{\ln(2\kappa)}{\ln(5/4)}, so we have κ(2/5)2k<14κ22k\kappa\cdot(2/5)^{2k}<\frac{1}{4\kappa}\cdot 2^{-2k}. Therefore, the purity gain of every memory bit outside the restriction is strictly larger than that of any addressing bit, and the lemma follows immediately. ∎

Remark 5.

The proof above bounds the purity gain of each memory bit and each addressing bit by Ω((1/2)2k)\Omega((1/2)^{2k}) and O((2/5)2k)O((2/5)^{2k}) respectively. For Lemma 4 to hold when the purity gains are estimated from a finite dataset, it suffices to argue that each estimate is accurate up to an O((2/5)2k)O((2/5)^{2k}) additive error. By a standard concentration argument, to estimate the purity gains for all restriction π\pi of size h\leq h, 2O(h+k)2^{O(h+k)} training examples are sufficient. When applied later in the proof of Theorem 4, this finite-sample version of Lemma 4 would imply that impurity-based algorithms need to build a tree of depth hh as soon as the sample size reaches 2Ω(h+k)2^{\Omega(h+k)}.

4.3 Proof of the Weaker Version

Now we are ready to prove the weaker version of Theorem 4. We will apply Lemma 4 to argue that the tree returned by an impurity-based algorithm never queries an addressing bit (unless all the 2k2^{k} memory bits have been queried), and then show that every such decision tree must have an error of Ω(δ)\Omega(\delta).

Proof of Theorem 4 (weaker version).

Fix integer cln5δc\geq\frac{\ln 5}{\delta} and consider the functions fc,1,fc,2,f_{c,1},f_{c,2},\ldots. Since each fc,kf_{c,k} is represented by a decision tree of depth ck2+1=O(k2/δ)ck^{2}+1=O(k^{2}/\delta), it remains to show that impurity-based algorithms fail to learn fc,kf_{c,k}. Fix integer kk0k\geq k_{0} (where k0k_{0} is chosen as in Lemma 4) and δ\delta-balanced product distribution 𝒟\mathcal{D} over the domain of fc,kf_{c,k}. In the following, we use shorthand ff for fc,kf_{c,k}.

Small trees never query addressing bits.

Let TT be the decision tree returned by a 𝒢\mathscr{G}-impurity-based algorithm when learning ff on 𝒟\mathcal{D}. If TT has depth >2k>2^{k}, we are done, so we assume that TT has depth at most 2k2^{k}. We claim that TT never queries the addressing bits of ff. Suppose otherwise, that an addressing bit is queried at node vv in TT, and no addressing bits are queried by the ancestors of vv. Then, the restriction πv\pi_{v} associated with node vv only contains the memory bits of ff. Since TT has depth 2k\leq 2^{k}, the size of πv\pi_{v} is strictly less than 2k2^{k}. Then, by Lemma 4, the 𝒢\mathscr{G}-purity gain with respect to fvf_{v} is maximized by a memory bit. This contradicts the assumption that the algorithm is 𝒢\mathscr{G}-impurity-based.

Trivial accuracy if no addressing bits are queried.

We have shown that TT only queries the memory bits of ff. We may further assume that TT queries all the 2k2^{k} memory bits before reaching any of its leaves, i.e., TT is a full binary tree of depth 2k2^{k}. This assumption is without loss of generality because we can add dummy queries on the memory bits to the leaves of depth <2k<2^{k}, and label all the resulting leaves with the same bit. This change does not modify the function represented by TT.

Assuming that TT is full, every leaf \ell of TT is labeled by 2k2^{k} bits (ca)a{0,1}k(c_{a})_{a\in\{0,1\}^{k}}, meaning that each memory bit yay_{a} is fixed to cac_{a} on the root-to-\ell path. The expectation of the restricted function ff_{\ell} is then given by μ𝔼(x,y)𝒟[cz(x)]\mu_{\ell}\coloneqq\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}}\left[c_{z(x)}\right]. Clearly, the error of TT is minimized when each leaf \ell is labeled with 𝟙[μ12]\mathds{1}\left[\mu_{\ell}\geq\frac{1}{2}\right], and the conditional error when reaching leaf \ell is min(μ,1μ)\min(\mu_{\ell},1-\mu_{\ell}).

It remains to show that for a large fraction of leaves \ell, μ\mu_{\ell} is bounded away from 0 and 11, so that min(μ,1μ)\min(\mu_{\ell},1-\mu_{\ell}) is large. When leaf \ell is randomly chosen according to distribution 𝒟\mathcal{D}, the corresponding μ\mu_{\ell} is given by

μ=a{0,1}kPr(x,y)𝒟[z(x)=a]ca,\mu_{\ell}=\sum_{a\in\{0,1\}^{k}}\Pr_{(x,y)\sim\mathcal{D}}\left[z(x)=a\right]\cdot c_{a}, (1)

where (ca)a{0,1}k(c_{a})_{a\in\{0,1\}^{k}} are 2k2^{k} independent Bernoulli random variables with means in [δ,1δ][\delta,1-\delta].

By Lemma 2 and our choice of cc0c\geq c_{0}, Pr(x,y)𝒟[z(x)=a]22k\Pr_{(x,y)\sim\mathcal{D}}\left[z(x)=a\right]\leq 2\cdot 2^{-k} holds for every a{0,1}ka\in\{0,1\}^{k}. Thus, each term in (1) is bounded between 0 and 22k2\cdot 2^{-k}. Furthermore, since each cac_{a} has expectation at least δ\delta, 𝔼[μ]δ\operatorname*{\mathbb{E}}\left[\mu_{\ell}\right]\geq\delta. Then, Hoeffding’s inequality guarantees that over the random choice of (ca)a{0,1}k(c_{a})_{a\in\{0,1\}^{k}}, μδ/2\mu_{\ell}\geq\delta/2 holds with probability at least 1exp(2(δ/2)22k(22k)2)=1exp(2kδ2/8)1-\exp\left(-\frac{2\cdot(\delta/2)^{2}}{2^{k}\cdot(2\cdot 2^{-k})^{2}}\right)=1-\exp(-2^{k}\delta^{2}/8), which is lower bounded by 2/32/3 for all sufficiently large kk. By a symmetric argument, μ1δ/2\mu_{\ell}\leq 1-\delta/2 also holds with probability 2/3\geq 2/3. Therefore, with probability 1/3\geq 1/3 over the choice of leaf \ell, μ[δ/2,1δ/2]\mu_{\ell}\in[\delta/2,1-\delta/2] holds and thus the conditional error on leaf \ell is at least δ/2\delta/2. This shows that the error of TT over distribution 𝒟\mathcal{D} is lower bounded by δ/6\delta/6, which completes the proof. ∎

5 Proof of Theorem 4

When proving the weaker version of Theorem 4, each hard instance fc,kf_{c,k} has Θ(k2)\Theta(k^{2}) addressing bits grouped into kk disjoint subsets, and the kk-bit address is defined by the XOR of bits in each subset. We will prove Theorem 4 using a slightly different construction that computes address from kk overlapping subsets of only O(k)O(k) addressing bits.

For integers c,k1c,k\geq 1 and a list of kk sets S=(S1,S2,,Sk)S=(S_{1},S_{2},\ldots,S_{k}) where each Si[ck]S_{i}\subseteq[ck], we define a boolean function fc,k,S:{0,1}ck+2k{0,1}f_{c,k,S}:\{0,1\}^{ck+2^{k}}\to\{0,1\} as follows. The input of fc,k,Sf_{c,k,S} is again divided into two parts: ckck addressing bits x1,x2,,xckx_{1},x_{2},\ldots,x_{ck} and 2k2^{k} memory bits yay_{a} indexed by a kk-bit address aa. The function value f(x,y)f(x,y) is computed by taking zi(x)=jSixjz_{i}(x)=\bigoplus_{j\in S_{i}}x_{j} and then f(x,y)=yz(x)f(x,y)=y_{z(x)}. Clearly, fc,k,Sf_{c,k,S} can be computed by a decision tree of depth ck+1ck+1 that first queries all the ckck addressing bits x1,x2,,xckx_{1},x_{2},\ldots,x_{ck}, and then queries the relevant memory bit yz(x)y_{z(x)}.

Let i=1kSi\operatorname*{\triangle}_{i=1}^{k}S_{i} denote the kk-ary symmetric difference of sets S1S_{1} through SkS_{k}, i.e., the set of elements that appear in an odd number of sets. We say that a list of sets S=(S1,S2,,Sk)S=(S_{1},S_{2},\ldots,S_{k}) has distance dd, if any non-empty collection of sets has a symmetric difference of size at least dd, i.e., |iISi|d\left|\operatorname*{\triangle}_{i\in I}S_{i}\right|\geq d for every non-empty I[k]I\subseteq[k]. In the following, we prove analogs of Lemmas 2 and 4 for function fc,k,Sf_{c,k,S} assuming that SS has a large distance; Theorem 4 would then follow immediately.

Lemma 6.

Suppose that 𝒟\mathcal{D} is a δ\delta-balanced product distribution over the domain of fc,k,Sf_{c,k,S} and SS has distance dln5δkd\geq\frac{\ln 5}{\delta}\cdot k. Then,

|Pr(x,y)𝒟[z(x)=a]2k|5k,a{0,1}k.\left|\Pr_{(x,y)\sim\mathcal{D}}\left[z(x)=a\right]-2^{-k}\right|\leq 5^{-k},\forall a\in\{0,1\}^{k}.

Furthermore, for every i[ck]i\in[ck] and b{0,1}b\in\{0,1\},

|Pr(x,y)𝒟[z(x)=axi=b]2k|5k,a{0,1}k.\left|\Pr_{(x,y)\sim\mathcal{D}}\left[z(x)=a|x_{i}=b\right]-2^{-k}\right|\leq 5^{-k},\forall a\in\{0,1\}^{k}.

We prove Lemma 6 by noting that the distribution of z(x)z(x) has exponentially small Fourier coefficients (except the degree-0 one) under the assumptions, and is thus close to the uniform distribution over {0,1}k\{0,1\}^{k}. More concretely, our goal is to show that, for every I[k]I\subseteq[k] the quantity iIzi(x)\bigoplus_{i\in I}z_{i}(x) is 11 with probability nearly exactly 12\frac{1}{2}. Afterwards, we will show this is sufficient to guarantee that the distribution of z(x)z(x) is close to the uniform distribution.

Proof of Lemma 6.

Since zi(x)=jSixjz_{i}(x)=\bigoplus_{j\in S_{i}}x_{j}, we have iIzi(x)=jSIxj\bigoplus_{i\in I}z_{i}(x)=\bigoplus_{j\in S_{I}}x_{j} for every I[k]I\subseteq[k], where SI=iISiS_{I}=\operatorname*{\triangle}_{i\in I}S_{i} is the symmetric difference of the corresponding sets. Since SS has distance dd, |SI|d|S_{I}|\geq d for every non-empty I[k]I\subseteq[k] and thus iIzi(x)\bigoplus_{i\in I}z_{i}(x) is the XOR of at least dd independent bits. Note that 12iIzi(x)=iI(12zi(x))1-2\bigoplus_{i\in I}z_{i}(x)=\prod_{i\in I}(1-2z_{i}(x)). By Fact 3 and dln5δkd\geq\frac{\ln 5}{\delta}\cdot k,

|𝔼(x,y)𝒟[iI(12zi(x))]|=2|Pr(x,y)𝒟[iIzi(x)=1]12|exp(2δd)5k.\left|\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}}\left[\prod_{i\in I}(1-2z_{i}(x))\right]\right|=2\cdot\left|\Pr_{(x,y)\sim\mathcal{D}}\left[\bigoplus_{i\in I}z_{i}(x)=1\right]-\frac{1}{2}\right|\leq\exp(-2\delta d)\leq 5^{-k}. (2)

Note that for b1,b2{0,1}b_{1},b_{2}\in\{0,1\}, we have 𝟙[b1=b2]=(12b1)(12b2)+12\mathds{1}\left[b_{1}=b_{2}\right]=\frac{(1-2b_{1})(1-2b_{2})+1}{2}. Therefore, for every a{0,1}ka\in\{0,1\}^{k},

|Pr(x,y)𝒟[z(x)=a]2k|\displaystyle\left|\Pr_{(x,y)\sim\mathcal{D}}\left[z(x)=a\right]-2^{-k}\right| =|𝔼(x,y)𝒟[i=1k(12ai)(12zi(x))+12]2k|\displaystyle=\left|\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}}\left[\prod_{i=1}^{k}\frac{(1-2a_{i})(1-2z_{i}(x))+1}{2}\right]-2^{-k}\right|
=2k|I[k]𝔼(x,y)𝒟[iI(12ai)(12zi(x))]1|\displaystyle=2^{-k}\left|\sum_{I\subseteq[k]}\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}}\left[\prod_{i\in I}(1-2a_{i})(1-2z_{i}(x))\right]-1\right| (expansion of product and linearity)
=2k|I[k]:I𝔼(x,y)𝒟[iI(12ai)(12zi(x))]|\displaystyle=2^{-k}\left|\sum_{I\subseteq[k]:I\neq\emptyset}\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}}\left[\prod_{i\in I}(1-2a_{i})(1-2z_{i}(x))\right]\right| (empty product equals 11)
2kI[k]:I|iI(12ai)||𝔼(x,y)𝒟[iI(12zi(x))]|\displaystyle\leq 2^{-k}\sum_{I\subseteq[k]:I\neq\emptyset}\left|\prod_{i\in I}(1-2a_{i})\right|\cdot\left|\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}}\left[\prod_{i\in I}(1-2z_{i}(x))\right]\right| (triangle inequality and linearity)
=2kI[k]:I|𝔼(x,y)𝒟[iI(12zi(x))]|\displaystyle=2^{-k}\sum_{I\subseteq[k]:I\neq\emptyset}\left|\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}}\left[\prod_{i\in I}(1-2z_{i}(x))\right]\right| (|12ai|=1|1-2a_{i}|=1)
2k(2k1)5k<5k.\displaystyle\leq 2^{-k}\cdot(2^{k}-1)\cdot 5^{-k}<5^{-k}. (Inequality (2))

The proof of the “furthermore” part is the same, except that after conditioning on xi=bx_{i}=b, each jIzj(x)\bigoplus_{j\in I}z_{j}(x) is now the XOR of at least d1d-1 independent bits, and the remaining proof goes through. ∎

We note that the proof of Lemma 4 depends on the definition of z(x)z(x) only through the application of Lemma 2. Thus, Lemma 6 directly implies the following analog of Lemma 4:

Lemma 7.

Fix Lα>0L\geq\alpha>0 and δ(0,12]\delta\in(0,\frac{1}{2}]. Let c0=ln5δc_{0}=\frac{\ln 5}{\delta} and k0=ln(2κ)ln(5/4)+1k_{0}=\frac{\ln(2\kappa)}{\ln(5/4)}+1, where κ\kappa is chosen as in Lemma 1. The following holds for every function fc,k,Sf_{c,k,S} such that kk0k\geq k_{0} and SS has distance c0kc_{0}k: For any (α,L)(\alpha,L)-impurity function 𝒢\mathscr{G}, δ\delta-balanced product distribution 𝒟\mathcal{D} and restriction π\pi of size <2k<2^{k} that only contains the memory bits of fc,k,Sf_{c,k,S}, the purity gain 𝒢-purity-gain𝒟((fc,k,S)π,)\mathscr{G}\text{-}\mathrm{purity}\text{-}\mathrm{gain}_{\mathcal{D}}((f_{c,k,S})_{\pi},\cdot) is maximized by a memory bit.

Finally, we prove Theorem 4 by showing the existence of a set family SS with a good distance.

Proof of Theorem 4.

Fix δ(0,12]\delta\in(0,\frac{1}{2}]. The Gilbert–Varshamov bound for binary linear codes implies that for some c=Θ(1/δ)c=\Theta(1/\delta), there exists a binary linear code with rate 1c\frac{1}{c} and relative distance ln5δc\frac{\ln 5}{\delta c}. It follows that for every sufficiently large kk, there exists S(k)=(S1(k),S2(k),,Sk(k))S^{(k)}=(S^{(k)}_{1},S^{(k)}_{2},\ldots,S^{(k)}_{k}) such that each Si(k)[ck]S^{(k)}_{i}\subseteq[ck] and S(k)S^{(k)} has distance ln5δk\frac{\ln 5}{\delta}\cdot k. This can be done by using the ii-th basis of the linear code as the indicator vector of subset Si(k)S^{(k)}_{i} for each i[k]i\in[k].

We prove Theorem 4 using functions fc,1,S(1),fc,2,S(2),f_{c,1,S^{(1)}},f_{c,2,S^{(2)}},\ldots. Since each fc,k,S(k)f_{c,k,S^{(k)}} can be represented by a decision tree of depth ck+1=O(k/δ)ck+1=O(k/\delta), it remains to prove that impurity-based algorithms fail to learn fc,k,S(k)f_{c,k,S^{(k)}}. Lemma 7 guarantees that the tree returned by such algorithms either has depth >2k>2^{k}, or never queries any addressing bits. In the latter case, by the same calculation as in the proof of the weaker version, the decision tree must have an Ω(δ)\Omega(\delta) error on distribution 𝒟\mathcal{D}. ∎

6 Proof of Theorem 5

We prove Theorem 5 using the construction of fc,k,Sf_{c,k,S} in Section 5, where S=(S1,S2,,Sk)S=(S_{1},S_{2},\ldots,S_{k}) is a list of kk subsets of [ck][ck] and each SiS_{i} specifies how the i-th bit of the address, zi(x)z_{i}(x), is computed from the addressing bits x1x_{1} through xckx_{ck}. Note that fc,k,Sf_{c,k,S} itself depends on Ω(2k)\Omega(2^{k}) input bits and is thus not an O(k)O(k)-junta. Nevertheless, we will show that, after we fix most of the memory bits of fc,k,Sf_{c,k,S}, the function is indeed close to a (ck)(ck)-junta with relevant inputs being the ckck addressing bits. Then, as in the proof of Theorem 4, we will argue that impurity-based heuristics still query the (unfixed) memory bits before querying any of the addressing bits, resulting in a tree that is either exponentially deep or far from the target function.

Proof of Theorem 5.

As in the proof of Theorem 4, we can find functions fc,1,S(1),fc,2,S(2),f_{c,1,S^{(1)}},f_{c,2,S^{(2)}},\ldots for some c=Θ(1/δ)c=\Theta(1/\delta) such that each S(k)S^{(k)} has distance ln5δk\geq\frac{\ln 5}{\delta}\cdot k. We fix a sufficiently large integer kk and shorthand ff for fc,k,S(k)f_{c,k,S^{(k)}} in the following.

Partition {0,1}k\{0,1\}^{k} into three sets A0A^{0}, A1A^{1} and AfreeA^{\textrm{free}} such that |A0|=|A1||A^{0}|=|A^{1}| and ε2k2|Afree|ε2k1\varepsilon\cdot 2^{k-2}\leq|A^{\textrm{free}}|\leq\varepsilon\cdot 2^{k-1}. Consider the restriction π\pi of function ff such that the memory bit yay_{a} is fixed to be 0 for every aA0a\in A^{0} and fixed to be 11 for every aA1a\in A^{1}; the memory bits with addresses in AfreeA^{\textrm{free}} are left as “free” variables. We will prove the theorem using fπf_{\pi} as the kk-th function in the family.

fπf_{\pi} is close to a junta.

Consider the function g:{0,1}ck+2k{0,1}g:\{0,1\}^{ck+2^{k}}\to\{0,1\} defined as g(x,y)=𝟙[z(x)A1]g(x,y)=\mathds{1}\left[z(x)\in A^{1}\right], where z(x)z(x) denotes (z1(x),z2(x),,zk(x))(z_{1}(x),z_{2}(x),\ldots,z_{k}(x)) and each zi(x)=jSi(k)xjz_{i}(x)=\bigoplus_{j\in S^{(k)}_{i}}x_{j}. Clearly, g(x,y)g(x,y) only depends on x{0,1}ckx\in\{0,1\}^{ck} and is thus a (ck)(ck)-junta. Furthermore, for every input (x,y)(x,y) such that z(x)A0z(x)\in A^{0} (resp. z(x)A1z(x)\in A^{1}), both fπf_{\pi} and gg evaluate to 0 (resp. 11). Thus, fπf_{\pi} and gg may disagree only if z(x)Afreez(x)\in A^{\textrm{free}}. It follows that for every δ\delta-balanced product distribution 𝒟\mathcal{D},

Pr(x,y)𝒟[fπ(x,y)g(x,y)]\displaystyle\Pr_{(x,y)\sim\mathcal{D}}\left[f_{\pi}(x,y)\neq g(x,y)\right] Pr(x,y)𝒟[z(x)Afree]\displaystyle\leq\Pr_{(x,y)\sim\mathcal{D}}\left[z(x)\in A^{\textrm{free}}\right]
|Afree|(2k+5k)\displaystyle\leq|A^{\textrm{free}}|\cdot(2^{-k}+5^{-k}) (Lemma 6)
ε2k1(2k+5k)<ε.\displaystyle\leq\varepsilon\cdot 2^{k-1}\cdot(2^{-k}+5^{-k})<\varepsilon. (|Afree|ε2k1|A^{\textrm{free}}|\leq\varepsilon\cdot 2^{k-1})

Therefore, fπf_{\pi} is ε\varepsilon-close to an O(k/δ)O(k/\delta)-junta (namely, gg) with respect to distribution 𝒟\mathcal{D}.

Impurity-based algorithms fail to learn fπf_{\pi}.

Let TT be the decision tree returned by an 𝒢\mathscr{G}-impurity based algorithm when learning fπf_{\pi} on distribution 𝒟\mathcal{D}. By Lemma 7, TT must query all the free memory bits with addresses in AfreeA^{\textrm{free}} before querying any of the addressing bits. Thus, either TT has depth >|Afree|=Ω(ε2k)>|A^{\textrm{free}}|=\Omega(\varepsilon\cdot 2^{k}), or TT only queries the free memory bits of fπf_{\pi}.

In the latter case, we may again assume without loss of generality that TT queries all the free memory bits (ya)aAfree(y_{a})_{a\in A^{\textrm{free}}} before reaching any of its leaves, i.e., TT is a full binary tree of depth |Afree||A^{\textrm{free}}|. Then, every leaf \ell naturally specifies 2k2^{k} bits (ca)a{0,1}k(c_{a})_{a\in\{0,1\}^{k}} defined as

ca={0,aA0,1,aA1,b,aAfree,ya is fixed to b on the root-to- path.c_{a}=\begin{cases}0,&a\in A^{0},\\ 1,&a\in A^{1},\\ b,&a\in A^{\textrm{free}},y_{a}\text{ is fixed to }b\text{ on the root-to-}\ell\text{ path.}\end{cases}

Let μ𝔼(x,y)𝒟[cz(x)]\mu_{\ell}\coloneqq\operatorname*{\mathbb{E}}_{(x,y)\sim\mathcal{D}}\left[c_{z(x)}\right]. Again, the minimum possible error conditioning on reaching leaf \ell is min(μ,1μ)\min(\mu_{\ell},1-\mu_{\ell}), achieved by labeling \ell with 𝟙[μ12]\mathds{1}\left[\mu_{\ell}\geq\frac{1}{2}\right]. On the other hand, we have

μ\displaystyle\mu_{\ell} Pr(x,y)𝒟[z(x)A1]\displaystyle\geq\Pr_{(x,y)\sim\mathcal{D}}\left[z(x)\in A^{1}\right]
|A1|(2k5k)\displaystyle\geq|A^{1}|\cdot(2^{-k}-5^{-k}) (Lemma 6)
2k|Afree|22(k+1)\displaystyle\geq\frac{2^{k}-|A^{\textrm{free}}|}{2}\cdot 2^{-(k+1)} (2|A1|+|Afree|=2k2|A^{1}|+|A^{\textrm{free}}|=2^{k})
2k2k122(k+1)=18,\displaystyle\geq\frac{2^{k}-2^{k-1}}{2}\cdot 2^{-(k+1)}=\frac{1}{8}, (|Afree|ε2k12k1|A^{\textrm{free}}|\leq\varepsilon\cdot 2^{k-1}\leq 2^{k-1})

and a similar calculation shows μ78\mu_{\ell}\leq\frac{7}{8}. We conclude that the error of the decision tree TT over distribution 𝒟\mathcal{D} is at least 18=Ω(1)\frac{1}{8}=\Omega(1). ∎

7 Conclusion

We have constructed target functions for which greedy decision tree learning heuristics fail badly, even in the smoothed setting. Our lower bounds complement and strengthen the parity-of-two-features example discussed in the introduction, which showed that these heuristics fail badly in the non-smoothed setting.

It can be reasonably argued that real-world data sets do not resemble the target functions considered in this paper or the parity-of-two-features example. Perhaps the sought-for guarantee ()(\diamondsuit), while false for certain target functions even in the smoothed setting, is nonetheless true for broad and natural classes of targets? It would be interesting to reexamine, through the lens of smoothed analysis, provable guarantees for restricted classes of functions that have been established. For example, can the guarantees of [BLT20b, BLT20a] for monotone target functions and product distributions be further strengthened in the smoothed setting? The target functions considered in this paper, as well as the parity-of-two-features example, are non-monotone.

Acknowledgements

We thank the anonymous reviewers, whose suggestions have helped improved this paper.

References

  • [BDM19] Alon Brutzkus, Amit Daniely, and Eran Malach. On the Optimality of Trees Generated by ID3. ArXiv, abs/1907.05444, 2019.
  • [BDM20] Alon Brutzkus, Amit Daniely, and Eran Malach. ID3 learns juntas for smoothed product distributions. In Proceedings of the 33rd Annual Conference on Learning Theory (COLT), pages 902–915, 2020.
  • [BFJ+94] Avirm Blum, Merrick Furst, Jeffrey Jackson, Michael Kearns, Yishay Mansour, and Steven Rudich. Weakly learning DNF and characterizing statistical query learning using Fourier analysis. In Proceedings of the 26th Annual ACM Symposium on Theory of Computing (STOC), pages 253–262, 1994.
  • [BFSO84] Leo Breiman, Jerome Friedman, Charles Stone, and Richard Olshen. Classification and regression trees. Wadsworth International Group, 1984.
  • [BGLT20] Guy Blanc, Neha Gupta, Jane Lange, and Li-Yang Tan. Universal guarantees for decision tree induction via a higher-order splitting criterion. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS), 2020.
  • [BLT20a] Guy Blanc, Jane Lange, and Li-Yang Tan. Provable guarantees for decision tree induction: the agnostic setting. In Proceedings of the 37th International Conference on Machine Learning (ICML), 2020. Available at https://arxiv.org/abs/2006.00743.
  • [BLT20b] Guy Blanc, Jane Lange, and Li-Yang Tan. Top-down induction of decision trees: rigorous guarantees and inherent limitations. In Proceedings of the 11th Innovations in Theoretical Computer Science Conference (ITCS), volume 151, pages 1–44, 2020.
  • [Blu92] Avrim Blum. Rank-rr decision trees are a subclass of rr-decision lists. Inform. Process. Lett., 42(4):183–185, 1992.
  • [Bsh93] Nader Bshouty. Exact learning via the monotone theory. In Proceedings of 34th Annual Symposium on Foundations of Computer Science (FOCS), pages 302–311, 1993.
  • [CM19] Sitan Chen and Ankur Moitra. Beyond the low-degree algorithm: mixtures of subcubes and their applications. In Proceedings of the 51st Annual ACM Symposium on Theory of Computing (STOC), pages 869–880, 2019.
  • [DKM96] Tom Dietterich, Michael Kearns, and Yishay Mansour. Applying the weak learning framework to understand and improve C4.5. In Proceedings of the 13th International Conference on Machine Learning (ICML), pages 96–104, 1996.
  • [EH89] Andrzej Ehrenfeucht and David Haussler. Learning decision trees from random examples. Information and Computation, 82(3):231–246, 1989.
  • [FP04] Amos Fiat and Dmitry Pechyony. Decision trees: More theoretical justification for practical algorithms. In Proceedings of the 15th International Conference on Algorithmic Learning Theory (ALT), pages 156–170, 2004.
  • [GKK08] Parikshit Gopalan, Adam Kalai, and Adam Klivans. Agnostically learning decision trees. In Proceedings of the 40th ACM Symposium on Theory of Computing (STOC), pages 527–536, 2008.
  • [Han93] Thomas Hancock. Learning kkμ\mu decision trees on the uniform distribution. In Proceedings of the 6th Annual Conference on Computational Learning Theory (COT), pages 352–360, 1993.
  • [HJLT96] Thomas Hancock, Tao Jiang, Ming Li, and John Tromp. Lower bounds on learning decision lists and trees. Information and Computation, 126(2):114–122, 1996.
  • [HKY18] Elad Hazan, Adam Klivans, and Yang Yuan. Hyperparameter optimization: A spectral approach. Proceedings of the 6th International Conference on Learning Representations (ICLR), 2018.
  • [JS06] Jeffrey C. Jackson and Rocco A. Servedio. On learning random dnf formulas under the uniform distribution. Theory of Computing, 2(8):147–172, 2006.
  • [Kea96] Michael Kearns. Boosting theory towards practice: recent developments in decision tree induction and the weak learning framework (invited talk). In Proceedings of the 13th National Conference on Artificial intelligence (AAAI), pages 1337–1339, 1996.
  • [KM93] Eyal Kushilevitz and Yishay Mansour. Learning decision trees using the fourier spectrum. SIAM Journal on Computing, 22(6):1331–1348, December 1993.
  • [KM96] Michael Kearns and Yishay Mansour. On the boosting ability of top-down decision tree learning algorithms. In Proceedings of the 28th Annual Symposium on the Theory of Computing (STOC), pages 459–468, 1996.
  • [KM99] Michael Kearns and Yishay Mansour. On the boosting ability of top-down decision tree learning algorithms. Journal of Computer and System Sciences, 58(1):109–128, 1999.
  • [KS06] Adam Klivans and Rocco Servedio. Toward attribute efficient learning of decision lists and parities. Journal of Machine Learning Research, 7(Apr):587–602, 2006.
  • [KST09] Adam Kalai, Alex Samorodnitsky, and Shang-Hua Teng. Learning and smoothed analysis. In Proceedings of the 50th Annual IEEE Symposium on Foundations of Computer Science (FOCS), pages 395–404, 2009.
  • [Lee09] Homin Lee. On the learnability of monotone functions. PhD thesis, Columbia University, 2009.
  • [MR02] Dinesh Mehta and Vijay Raghavan. Decision tree approximations of boolean functions. Theoretical Computer Science, 270(1-2):609–623, 2002.
  • [OS07] Ryan O’Donnell and Rocco Servedio. Learning monotone decision trees in polynomial time. SIAM Journal on Computing, 37(3):827–844, 2007.
  • [Qui86] Ross Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986.
  • [Qui93] Ross Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1993.
  • [Riv87] Ronald Rivest. Learning decision lists. Machine learning, 2(3):229–246, 1987.
  • [ST04] Daniel A. Spielman and Shang-Hua Teng. Smoothed analysis of algorithms: Why the simplex algorithm usually takes polynomial time. Journal of the ACM, 51(3):385–463, 2004.
  • [WFHP16] Ian Witten, Eibe Frank, Mark Hall, and Christopher Pal. Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, 2016.
  • [WKQ+08] Xindong Wu, Vipin Kumar, J Ross Quinlan, Joydeep Ghosh, Qiang Yang, Hiroshi Motoda, Geoffrey J McLachlan, Angus Ng, Bing Liu, S Yu Philip, et al. Top 10 algorithms in data mining. Knowledge and information systems, 14(1):1–37, 2008.