This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the equivalence of Occam algorithms

Zaman Keinath-Esmail
Abstract

Blumer et al. (1987, 1989) showed that any concept class that is learnable by Occam algorithms is PAC learnable. Board and Pitt (1990) showed a partial converse of this theorem: for concept classes that are closed under exception lists, any class that is PAC learnable is learnable by an Occam algorithm. However, their Occam algorithm outputs a hypothesis whose complexity is δ\delta-dependent, which is an important limitation. In this paper, we show that their partial converse applies to Occam algorithms with δ\delta-independent complexities as well. Thus, we provide a posteriori justification of various theoretical results and algorithm design methods which use the partial converse as a basis for their work.

keywords:
Occam algorithms , polynomial learnability , PAC learning , statistical learning theory
MSC:
[2010] 68Q32 , 68W01  Funding: This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
journal: Information and Computation
\affiliation

[inst1]organization=Rudolf Peierls Centre for Theoretical Physics, University of Oxford,addressline=Beecroft Building Parks Road, city=Oxford, postcode=OX1 3PU, country=UK

1 Introduction

For decades, there has been debate regarding the necessity of simplicity for machine learning [1, 2, 3]. Many of these analyses have focused on the implications and uses of complexity-based algorithms defined by Blumer et al. in two seminal papers [4, 5]. Their algorithms were defined such that they achieved zero training error on a sample, and outputted a hypothesis whose complexity (VC dimension for continuous alphabets; description length for discrete ones) was at most a polynomial in the target concept complexity, multiplied by a sublinear factor in the sam. These “Occam algorithms” are weak approximations of the minimum-consistent-hypothesis problem [6]. In this paper, we focus on the continuous-alphabet Occam algorithms.

In 1989, Blumer et al. [5] showed that if a concept was learnable by their Occam algorithm, then it was polynomially learnable; they left open the question of whether the converse of this theorem was true. Board and Pitt [6] proved a partial converse, namely that, for concept classes closed under exception lists, if a class was polynomially learnable in general, then it was learnable by an Occam algorithm. This proof suggested that for concept classes which are closed under exception lists, learnability is equivalent to weak approximability of the minimum-consistent hypothesis problem [6]. This equivalence formed the basis for subsequent theoretical work analysing the learnability of DFAs [7] and decision trees and lists [8], as well as motivating the design of practical algorithms such as prediction algorithms for optimal data prefetching [9].

However, the Occam algorithm model that was used by Board and Pitt [6] was different to the one defined by Blumer et al. [5]: the former is a construction based on a functional algorithm requiring ϵ\epsilon, δ\delta, ss, and the training sample as explicit inputs, while Blumer et al.’s model requires only the training sample [5]. While it has been shown that many frequently studied PAC algorithms form an equivalence class [10], it is not clear whether Board and Pitt’s [6] version is a member of this class. Further, Board an Pitt later remove a degree of freedom by defining ϵ\epsilon in terms of mm. In some cases, changing degrees of freedom can alter whether an algorithm always halts or just usually halts: even if Board and Pitt’s model of learnability falls under the aforementioned equivalence class, it is unclear whether the equivalent algorithm in Blumer et al.’s definition would always halt [10, 6, 5].

In addition, the complexity of the output of Board and Pitt’s Occam algorithm is δ\delta-dependent [6], while in Blumer et al.’s definition it is not [5]. Thus, it is not immediately clear whether Board and Pitt’s [6] conclusions hold for more commonly used polynomial PAC learning models and δ\delta-independent output complexities (and are hence a general property of polynomial learnability and Occam algorithms), or whether their theorem results purely from the specific choice of algorithmic model. Therefore, it may not be justified to use their results as the basis for further research on general learnability and algorithm construction, as is currently the case [9, 8, 7].

In this paper, we show that the polynomial PAC model used by Board and Pitt [6] is equivalent to the models used by Blumer et al. and Haussler et al. [5, 10], and that we can construct an Occam algorithm from such a functional algorithm so that its output complexity is δ\delta-independent, thereby explicitly linking the work of Board and Pitt [6] to that of Blumer et al. [5]. Our theorem provides the link needed to justify the use of Board and Pitt’s [6] work as a theoretical basis from which to further analyse questions of learnability and algorithm design by showing that their result holds irrespective of the specific algorithmic model they chose.

We will begin by defining concept classes, parameters such as ss and δ\delta, and our choice of complexity measure, the Vapnik-Chervonenkis (VC) dimension, along with relevant lemmas. Next, we define functional and oracle algorithmic models and provide an equivalence theorem due to Haussler et al. [10] linking the two. We proceed to provide the definitions of Occam algorithms used by Blumer et al. [5] and Board and Pitt [6], pointing out the differences and showing that the latter is not obviously equivalent to the former. Finally, we prove that the two Occam algorithm definitions are, in fact, equivalent, thus justifying general analyses of learnability based on Board and Pitt’s work [6].

1.1 A note on algorithm names

It is worth pausing before the next section to explicitly list the algorithmic models we will be discussing in the remainder of this paper. We will distinguish between four models: δ,s\delta,s-known functional algorithms, δ,s\delta,s-unknown functional algorithms, δ\delta-dependent Occam algorithms, and δ\delta-independent Occam algorithms.

The δ,s\delta,s-unknown functional algorithm is the standard definition of a functional algorithm used by Blumer et al. and Haussler et al. [5, 10], given in Definition2. The δ,s\delta,s-known functional algorithm is a version used only by Board and Pitt [6]. We will need to show an equivalence between these two models, as the δ,s\delta,s-known functional algorithm is not one of the models contained in the equivalence class provided by Haussler et al. in Theorem 5. This result will be Lemma 6.

The δ\delta-dependent Occam algorithms and δ\delta-independent Occam algorithms, on the other hand, refer not to whether δ\delta is an explicit input, but whether the VC dimension of the hypothesis space is polynomially dependent on 1δ\frac{1}{\delta} or not. The δ\delta-independent version is used by Blumer et al. [5] while Board and Pitt [6] use a δ\delta-dependent version. The main result of our paper will be extending the analysis conducted by Board and Pitt to apply to δ\delta-independent Occam algorithms as well, resulting in Theorem 7.

2 Notation and definitions

2.1 Terminology

Let XX be a stratified learning domain (i.e. we can write it as X=n1XnX=\cup_{n\geq 1}X_{n} where each XnX_{n} has dimension nn). A concept is a subset of XX; we can consider it as labelling all elements of XX with either a 1 or a 0, depending on whether the element is in the concept or not.

Then let C2XC\subseteq 2^{X} be a well-behaved111in a measure-theoretic sense made explicit in Blumer et al’s work [5]. class of concepts on XX and let the concept complexity measure size:C+\textbf{size}:C\rightarrow\mathbb{Z^{+}}. Let there also be associated with CC a set of representations for concepts in CC given in some representation language Γ\Gamma. We assume there is a function σ\sigma from the set of representations in Γ\Gamma onto CC, that σ\sigma is in 𝐏\mathbf{P} (or in 𝐑𝐏\mathbf{RP}), and that given xXx\in X and a string in Γ\Gamma, we can decide in polynomial time if xx is in the concept represented by the string (i.e. CC is polynomially evaluable). Finally, we say that CC is stratified, i.e. C=n1CnC=\cup_{n\geq 1}C_{n}.

By 𝒞\mathcal{C} we denote (X,C)\left(X,C\right) along with their function size, σ\sigma and Γ\Gamma, so 𝒞=(X,C,Γ,σ,size)\mathcal{C}=\left(X,C,\Gamma,\sigma,\textbf{size}\right). We include Γ\Gamma because learnability may in fact be a property of the representation of a set of concepts, not just the concepts themselves [6]. We use the term concept class to refer to 𝒞\mathcal{C} as well as to CC. We say a concept cc is in 𝒞\mathcal{C} if cCc\in C.

We define a hypothesis class \mathcal{H} where, as with 𝒞\mathcal{C}, =(X,H,Γ,σ,size)\mathcal{H}=\left(X,H,\Gamma,\sigma,\textbf{size}\right), where H2XH\subseteq 2^{X} is also a class of concepts on XX but need not be related to CC.

We use 𝒫\mathcal{P} to represent probability distributions on the domain XX. A sample set taken according to 𝒫\mathcal{P} on XX is denoted MM and contains mm tuples (or points) (x,c(x))(x,c(x)); we say MM has cardinality mm. Denote the set of xx-values in MM by the set ZZ. The error, or accuracy parameter, ϵ\epsilon of a hypothesis hh with respect to a concept cc is ϵ=𝒫(hc)\epsilon=\mathcal{P}(h\triangle c) where \triangle denotes symmetric difference. In words, ϵ\epsilon is the probability due to 𝒫\mathcal{P} of the symmetric difference of the hypothesis and the concept.

We cannot directly measure ϵ\epsilon as it is a property over the entire domain, and the concept cc is unknown. Therefore, it is useful to define a training error ϵ=(h(Z)c(Z))\epsilon^{\prime}=\ell(h(Z)\triangle c(Z)) for some loss function \ell. We say a hypothesis is consistent if ϵ\epsilon^{\prime} is zero on a sample. Then δ\delta is the probability that the algorithm fails to return a consistent hypothesis.

Finally, we assume that 𝐑𝐏𝐍𝐏\mathbf{RP}\neq\mathbf{NP}; otherwise the learnability of any concept class would be trivial.

2.2 Vapnik-Chervonenkis dimension

We have in Section 2.1 roughly stated the existence of a concept complexity measure, size, but we have not described any of its properties other than its domain and range. For the arguments presented in this paper we do not require any specific concept complexity measure function for size; it can be semantic (measuring breadth of explanatory power) or syntactic (e.g. the length of the description of a hypothesis in Γ\Gamma), or something else entirely. We do make use of the (assumed) property that an algorithm cannot output a hypothesis with size greater than its runtime [6]. What will primarily matter, though, is how we can map from size to a combinatorial semantic complexity measure called the Vapnik-Chervonenkis (VC) Dimension [6, 5].

Definition 1.

Given a nonempty concept class 𝐂=(C,X)\mathbf{C}=(C,X) with C2XC\subseteq 2^{X} and a set of points SXS\subseteq X, ΠC(S)\Pi_{C}(S) denotes the set of all subsets of SS that can be obtained by intersecting SS with a concept in CC. That is, ΠC(S)={Sc:cC}\Pi_{C}(S)=\{S\cap c:c\in C\}. If ΠC(S)=2S\Pi_{C}(S)=2^{S} we say that SS is shattered by CC. The VC Dimension of CC is the cardinality of the largest set of points SXS\subseteq X that is shattered by CC. If the set of shattered points is arbitrarily large, then the VC Dimension of CC is infinite. CC is trivial if it consists of only one concept, or two disjoint concepts c1c_{1} and c2c_{2} such that c1c2=Xc_{1}\cup c_{2}=X (such concept classes require only one example to learn) [5].

Lemma 1.

For a finite hypothesis class \mathcal{H}, the VC dimension of \mathcal{H} is bounded by VCdim()log(|H|)\text{VCdim}(\mathcal{H})\leq\log(|H|) [11].

Lemma 2.

Let \mathcal{H} be a hypothesis class with VCdim()d)\text{VCdim}(\mathcal{H})\leq d\leq\infty) and τ(m)=max|M|=|ΠH(M)|\tau_{\mathcal{H}}(m)=\text{max}|\mathcal{H}_{M}|=|\Pi_{H}(M)| be the size of the effective hypothesis space; namely, the number of functions from a training set of size MM to {0,1}\{0,1\} that can be obtained by restricting \mathcal{H} to MM. Then m\forall\ m, τ(m)Σi=0d(mi)\tau_{\mathcal{H}}(m)\leq\Sigma_{i=0}^{d}\binom{m}{i}, and for md+1m\geq d+1 we have τ(m)(em/d)d\tau_{\mathcal{H}}(m)\leq(em/d)^{d} [11]. Partially restated, we can say that τ(m)|S|d+1\tau_{\mathcal{H}}(m)\leq|S|^{d}+1 [6].

Lemma 3.

Let H2XH\subseteq 2^{X} be a nonempty hypothesis class, 𝒫\mathcal{P} be a probability distribution on XX, and target concept cXc\subseteq X. Then for any ϵ>1\epsilon>1 and m1m\geq 1, given mm independent examples drawn according to 𝒫\mathcal{P}, the probability that \exists a hypothesis in HH that is consistent with all of these examples and has error greater than ϵ\epsilon is at most 2τ(2m)2ϵm/22\tau_{\mathcal{H}}(2m)2^{-\epsilon m/2} [5].

Lemma 4.

Let 𝐇=(H,X)\mathbf{H}=(H,X) have VC dimension dd. Let H,l={hE:hH,EX,|E|l}H^{\triangle,l}=\{h\triangle E:h\in H,E\subseteq X,|E|\leq l\}. If dl2d_{l}\geq 2 is the VC dimension of H,lH^{\triangle,l}, then dllog(dl)d+l+2\frac{d_{l}}{\log(d_{l})}\leq d+l+2.

2.3 Functional and oracle algorithms

Definitions similar to the following can be found in papers by Blumer et al., Valiant, and others [5, 10, 12].

Definition 2 (Functional polynomial learnability).

For a concept class 𝒞\mathcal{C} defined as in 2.1, we say that 𝒞\mathcal{C} is properly polynomially learnable (poly-learnable) in the functional model if \exists polynomial-time learning algorithm AA that takes as input a sample of a concept in 𝒞\mathcal{C}, outputs a hypothesis in 𝒞\mathcal{C}, and has the property that, 0<ϵ,δ<1\forall\ 0<\epsilon,\delta<1 and n,s1,n,s\geq 1,\exists sample size m(ϵ,δ,n,s)m\left(\epsilon,\delta,n,s\right), polynomial in 1/ϵ,1/δ,n,ands1/\epsilon,1/\delta,n,\text{and}\ s, such that, \forall target concepts cCnc\in C_{n} with size(c)s\textbf{size}(c)\leq s and \forall probability distributions PP on XX, given a random sample of cc of size m(ϵ,δ,n,s)m\left(\epsilon,\delta,n,s\right) drawn independently according to PP, AA produces, with probability at least 1δ1-\delta, a hypothesis hCnh\in C_{n} that has error at most ϵ\epsilon [5].

Functional polynomial learning algorithms receive only the sample of size mm; it has no access to ϵ,δ,\epsilon,\delta, or ss [10]. It can always determine nn from the dimension of the data points it is given [5]. However, in the oracle model, the algorithm receives at least ϵ\epsilon and δ\delta as input and has access to an oracle EX()EX() that with each call returns a point in XX along with a label 0 or 11 indicating whether or not the point is in a fixed target concept cCnc\in C_{n} with size(c)s\textbf{size}\left(c\right)\leq s. Each oracle call takes unit time. If AA is probabilistic, each call to a fair coin flip also takes unit time. After some time, AA halts and outputs a hypothesis in CC [5, 10].

Definition 3 (Oracle polynomial learnability).

For a concept class 𝒞\mathcal{C} defined as in 2.1, we say that 𝒞\mathcal{C} is properly polynomially learnable (poly-learnable) in the oracle model if \exists algorithm AA that has the property that, 0<ϵ,δ<1\forall\ 0<\epsilon,\delta<1 and n,s1,n,s\geq 1,\exists time bound TA(ϵ,δ,n,s)T_{A}\left(\epsilon,\delta,n,s\right), polynomial in 1/ϵ,1/δ,n,ands1/\epsilon,1/\delta,n,\text{and}\ s, such that, \forall target concepts cCnc\in C_{n} with size(c)s\textbf{size}(c)\leq s and \forall probability distributions PP on XX, AA runs in time TA(ϵ,δ,n,s)T_{A}\left(\epsilon,\delta,n,s\right) and produces, with probability at least 1δ1-\delta, a hypothesis hCnh\in C_{n} that has error at most ϵ\epsilon [5].

These models of learnability can easily be extended to situations in which we are learning 𝒞\mathcal{C} by \mathcal{H} as opposed to just 𝒞\mathcal{C} by itself [5, 10]. The following theorem, due to Haussler et al. [10], shows the equivalence between the different models of polynomial learnability. The property “usually halts” refers to probabilistic oracle algorithms that use fair coin tosses to decide whether to call the oracle again or halt.

Theorem 5.

If 𝒞\mathcal{C} is learnable by hypothesis class \mathcal{H} in any of the following algorithmic models, it is learnable by \mathcal{H} in all of them:

  1. 1.

    functional(p1p_{1})

  2. 2.

    oracle(p1,p2,p3p_{1},p_{2},p_{3}),

where we have properties:

  • 1.

    p1{deterministic,randomised}p_{1}\in\{\textit{deterministic},\textit{randomised}\}

  • 2.

    p2{s-known,s-unknown}p_{2}\in\{\textit{s-known},\textit{s-unknown}\}

  • 3.

    p3{always halts,usually halts}p_{3}\in\{\textit{always halts},\textit{usually halts}\}

and the stipulation that p2=p_{2}= “s-unknown” only when p3=p_{3}= “usually halts” [10].

Definition 4 (Polynomial PAC learnability).

If a class is learnable by one (and hence all) of the algorithmic models in Theorem 5, then we simply call it “polynomially PAC learnable” without specifying the specific model in which it is learnable.

2.4 Occam algorithms

The following definition is used by Blumer et al. in their paper introducing VC-dimension-based Occam algorithms [5]. Note that this is clearly a functional algorithm, matching definition 2.

Definition 5 (δ\delta-independent Occam algorithm).

Let AA be a polynomial learning algorithm that, given a sample of a concept c𝒞c\in\mathcal{C}, produces a hypothesis consistent with the sample. For concept class 𝒞\mathcal{C} and all s,n,m1s,n,m\geq 1, let S𝒞,s,n,mS_{\mathcal{C},s,n,m} denote the set of all mm-samples of concepts cCnc\in C_{n} such that size(c)s\textbf{size}(c)\leq s. For polynomial-time learning algorithm AA, let 𝒞s,n,mA𝒞\mathcal{C}^{A}_{s,n,m}\subseteq\mathcal{C} denote the AA-image of S𝒞,s,n,mS_{\mathcal{C},s,n,m}, i.e., the set of all hypotheses produced by AA when AA is given an m-sample of a concept cCnc\in C_{n} with size(c)s\textbf{size}(c)\leq s. We also call 𝒞s,n,mA\mathcal{C}^{A}_{s,n,m} the effective hypothesis space of AA. Then AA is an Occam algorithm for 𝒞\mathcal{C} if \exists polynomial p(s,n)p(s,n) and constant α\alpha with 0α<10\leq\alpha<1 such that s,n,m1\forall\ s,n,m\geq 1, the VC dimension of 𝒞s,n,mA\mathcal{C}^{A}_{s,n,m} is bounded by p(s,n)mαp(s,n)m^{\alpha} [5].

Board and Pitt [6], on the other hand, define their algorithm as follows (some notation has been altered to match the rest of this paper, but the definition itself is equivalent):

Definition 6 (δ\delta-dependent Occam algorithm).

Let AA be a polynomial learning algorithm that, given a sample of a concept c𝒞c\in\mathcal{C}, produces a hypothesis consistent with the sample. For concept class 𝒞\mathcal{C} and all s,n,m1s,n,m\geq 1 and 0<δ<10<\delta<1, let S𝒞,s,n,mS_{\mathcal{C},s,n,m} denote the set of all mm-samples of concepts cCnc\in C_{n} such that size(c)s\textbf{size}(c)\leq s. For polynomial-time learning algorithm AA, let 𝒞s,n,m,δA𝒞\mathcal{C}^{A}_{s,n,m,\delta}\subseteq\mathcal{C} denote the AA-image of S𝒞,s,n,mS_{\mathcal{C},s,n,m}, i.e., the set of all hypotheses produced by AA when AA is given as input ss, δ\delta, and an m-sample of a concept cCnc\in C_{n} with size(c)s\textbf{size}(c)\leq s. Then AA is an Occam algorithm for 𝒞\mathcal{C} if \exists polynomial p(s,n,1δ)p(s,n,\frac{1}{\delta}) and constant α\alpha with 0α<10\leq\alpha<1 such that s,n,m1\forall\ s,n,m\geq 1 and 0<δ<10<\delta<1, the VC dimension of 𝒞s,n,m,δA\mathcal{C}^{A}_{s,n,m,\delta} is bounded by p(s,n,1δ)mαp(s,n,\frac{1}{\delta})m^{\alpha} [6].

This is a functional Occam algorithm which takes ss and δ\delta as explicit inputs and has a VC dimension that depends polynomially on 1δ\frac{1}{\delta}. Each of these factors makes it distinct from the algorithms analysed by Blumer et al. [5]. We will first show that if we have a functional algorithm dependent on ss and δ\delta as explicit inputs, we can construct a functional algorithm that does not require these inputs explicitly. We will then show that we can construct a δ\delta-independent Occam algorithm from the resulting functional algorithm.

2.5 Exception lists

Our analysis will apply only to concept classes which are closed under exception lists.

Definition 7.

A class 𝒞\mathcal{C} is closed under exception lists if \exists an algorithm ExList and polynomial pexp_{\text{ex}} such that n1\forall\ n\geq 1, on input of any cCnc\in C_{n} and any finite set EXnE\subseteq X_{n} of cardinality ee, ExList halts in time bounded by pex(n,s,e)p_{\text{ex}}(n,s,e) and outputs a concept ExList(c,E)=cECn\text{ExList}(c,E)=c_{E}\in C_{n} such that cE=cEc_{E}=c\triangle E. Note that polynomial running time means that size(cE)pex(n,s,e)\textbf{size}(c_{E})\leq p_{\text{ex}}(n,s,e) (a program cannot output a concept with size larger than its runtime) [6].

In words, the ExList algorithm will output a concept cEc_{E} that agrees with cc on all but a finite set of points in XX, with that finite set being given by EE. Classes that are closed under exception lists include decision lists, decision trees, arbitrary programs, Boolean circuits, Boolean formulas, and Deterministic Finite Automata (DFAs) over any fixed alphabet, among others [6]. Many of the above classes happen to be ones where finding a minimal consistent hypothesis is NP-hard, which was the original motivation for developing Occam algorithms [4].

3 Equivalence of Occam algorithms

Board and Pitt’s [6] construction of an Occam algorithm from a general learning algorithm LL begins as follows: “Run the algorithm L[dependent]L[_{\text{dependent}}], giving it input parameters ss, ϵ=m1k+1\epsilon=m^{-\frac{1}{k+1}}, and δ\delta. Whenever L[dependent]L[_{\text{dependent}}] calls the oracle for a data point, choose a random point in the sample set MM with uniform probability 1m\frac{1}{m} and supply it to LL as the example. Let the output of LL be denoted cc^{\prime} ” where we have added the labels in square brackets.

We know that ϵ\epsilon will be given a value that does not require explicit input, as it depends only on mm. Thus, in practice, the learning algorithm LdependentL_{\text{dependent}} in fact only requires ss, δ\delta, and MM as inputs: it is a functional algorithm with δ,s\delta,s known. Clearly, this is not one of the learning models considered by [10] in Theorem 5. In contrast, the standard definition of a functional polynomial learning algorithm 2 can be termed a δ,s\delta,s-unknown functional algorithm. We would like to show an equivalence between the two definitions:

Lemma 6.

If \mathcal{H} is polynomially learnable by a δ,s\delta,s-known algorithm, it is polynomially learnable by a δ,s\delta,s-unknown algorithm.

Proof 1.

We will construct an δ,s\delta,s-unknown functional algorithm from a δ,s\delta,s-known functional algorithm following a method used by Haussler et al. [10] to show the equivalence of functional and oracle models of learning.

The δ,s\delta,s-known functional algorithm defined by Board and Pitt [6] is an algorithm that takes mm inputs and runs in a polynomially bounded time; thus we need the same for our δ,s\delta,s-independent functional algorithm. The first requirement is explicit in functional models of learnability, while the latter is explicit in oracle models. Thus, in order to be able to make explicit use of both properties, we will make our algorithm, LindependentL_{\text{independent}}, a functional algorithm taking input MM of length mm, that has been constructed from an oracle algorithm. We will show that this algorithm then leads to the same result as the one considered by Board and Pitt.

Given that a functional algorithm cannot take δ\delta or ss as input, we need to find a way to bound these values. To do so, we use a construction similar to Haussler et al.’s construction of a functional algorithm from an oracle [10].

If LdependentL_{\text{dependent}} has a runtime bounded by polynomial TL(ϵ,δ,n,s)T_{L}(\epsilon,\delta,n,s), we find a polynomial pp monotonically increasing in xx such that p(ϵ,n,x)TL(ϵ,1x,n,x)p(\epsilon,n,x)\geq T_{L}(\epsilon,\frac{1}{x},n,x).

We choose an xx such that p(ϵ,n,x)mp(\epsilon,n,x)\leq m and p(ϵ,n,2x)>mp(\epsilon,n,2x)>m. We know by definition that mm is polynomial in 1δ\frac{1}{\delta} and ss, so this construction means our xx is polynomially related to those values. If we cannot do this because p(ϵ,n,1)>mp(\epsilon,n,1)>m then we halt with the default hypothesis. Our constructed algorithm LindependentL_{\text{independent}} works as follows: we run the algorithm LdependentL_{\text{dependent}}, giving it input parameters sin=xs_{\text{in}}=x, nin=nn_{\text{in}}=n (which LL can determine directly from the form of each example in MM), ϵin=ϵ\epsilon_{\text{in}}=\epsilon, and δin=1x\delta_{\text{in}}=\frac{1}{x}. Whenever LdependentL_{\text{dependent}} calls the oracle for a data point, choose a random point in the sample set MM with uniform probability 1m\frac{1}{m} and supply it to LdependentL_{\text{dependent}} as the example, just as before.

We now claim that mm is bounded from below by polynomial q(ϵ,δ,n,s)q(\epsilon,\delta,n,s). To see this, choose qq such that q(ϵ,δ,n,s)=p(ϵ,n,2(1δ+s))q(\epsilon,\delta,n,s)=p(\epsilon,n,2(\frac{1}{\delta}+s)). Then the xx we choose will be bounded from below by 1δ+s\frac{1}{\delta}+s, so we are running LdependentL_{\text{dependent}} with δin=1xδ\delta_{\text{in}}=\frac{1}{x}\leq\delta and sin=xss_{\text{in}}=x\geq s. Thus we have a functional algorithm LindependentL_{\text{independent}} whose input and runtime are both bounded polynomially.

This algorithm is equivalent to the one defined by Board and Pitt [6], as it operates in polynomial bounds with δin=1xδ\delta_{\text{in}}=\frac{1}{x}\leq\delta and sin=xss_{\text{in}}=x\geq s. Clearly, if a concept class is learnable by LdependentL_{\text{dependent}}, then it is learnable by LindependentL_{\text{independent}}; thus we complete our proof of Lemma 6.

Lemma 6 shows that the model of polynomial PAC learnability considered by Board and Pitt is in fact equivalent to the models commonly used in the literature, and is a member of the equivalence class in Theorem 5. Based on this result, we would now like to construct an Occam algorithm whose output complexity is independent of δ\delta.

Theorem 7.

If 𝒞\mathcal{C} is polynomially learnable then it is learnable by δ\delta-independent Occam algorithm.

Proof 2.

We now repeat Board and Pitt’s [6] proof of equivalence between general learning and Occam learnability using our newly-constructed ss- and δ\delta-independent algorithm, LindependentL_{\text{independent}}, and showing that we still result in a polynomial Occam algorithm, which is now δ\delta-independent.

Define xx as in the previous proof. Choose constant kk such that we can bound the VC dimension of the hypothesis space, d(n,T)d(n,T) for runtime TT, with

d(n,TL(ϵ,1x,n,x))+2d(n,p(ϵ,n,x))+2k2(nx2ϵ)k,n,x1 and 0<ϵ<1,d(n,T_{L}(\epsilon,\frac{1}{x},n,x))+2\leq d(n,p(\epsilon,n,x))+2\leq\frac{k}{2}\left(\frac{nx^{2}}{\epsilon}\right)^{k},\\ \forall\ n,x\geq 1\text{ and }0<\epsilon<1, (1)

where TLT_{L} and pp are defined as in the previous proof. Let ϵ=m1k+1\epsilon=m^{-\frac{1}{k+1}}. Thus, LindependentL_{\text{independent}} does not require ϵ\epsilon, δ\delta, or ss as explicit inputs, so it is consistent with the models in Theorem 5, and if it learns 𝒞\mathcal{C}, we can say that 𝒞\mathcal{C} is polynomially learnable (as defined in Definition 4).

Now, we construct an Occam algorithm OO from LindependentL_{\text{independent}} as follows:

  1. 1.

    Run LindependentL_{\text{independent}} as defined in the previous proof, and let its output be denoted cc^{\prime}.

  2. 2.

    Compute the exception list E={xM:c(x)c(x)}E=\{x\in M:c(x)\neq c^{\prime}(x)\}.

  3. 3.

    Output cE=ExList(c,E)c^{\prime}_{E}=\text{ExList}(c^{\prime},E).

We define a constant aka_{k} such that for any yaky\geq a_{k}, log(y)<y1k+2\log(y)<y^{\frac{1}{k+2}}. To prove the theorem, we simply need to show that OO is an Occam algorithm for 𝒞\mathcal{C} with corresponding polynomial

pO(n,x)=akkk+2k+1(nx2)k2+2kk+1p_{O}(n,x)=a_{k}k^{\frac{k+2}{k+1}}(nx^{2})^{\frac{k^{2}+2k}{k+1}} (2)

and constant exponent

α=k2+2kk2+2k+1.\alpha=\frac{k^{2}+2k}{k^{2}+2k+1}. (3)

LL runs in polynomial time, as does ExList; thus, OO runs in polynomial time. Clearly, any output cEc^{\prime}_{E} produced by the algorithm is consistent with MM. Also, by the definition of ϵ\epsilon in Section 2.1, it must be true that |E|ϵ|M||E|\leq\epsilon|M|.

As in Definition 5, let 𝒞x,n,mO𝒞\mathcal{C}^{O}_{x,n,m}\subseteq\mathcal{C} denote the OO-image of S𝒞,x,n,mS_{\mathcal{C},x,n,m}, i.e., the set of all hypotheses produced by OO when OO is given an m-sample of a concept cCnc\in C_{n} with size(c)s\textbf{size}(c)\leq s. We would like to find the VC dimension of 𝒞x,n,mO\mathcal{C}^{O}_{x,n,m}.

Let the VC dimension of 𝒞x,n,mO\mathcal{C}^{O}_{x,n,m} be called dOd_{O}. If dO1d_{O}\leq 1, it is clearly bounded by pO(n,x)mαp_{O}(n,x)m^{\alpha} and the theorem is proved. Now let us assume dO2d_{O}\geq 2.

Let 𝒞x,n,mL\mathcal{C}^{L}_{x,n,m} be the effective hypothesis space of LindependentL_{\text{independent}}. An algorithm cannot output a hypothesis with size greater than its runtime TT, so the size of each element of 𝒞x,n,mL\mathcal{C}^{L}_{x,n,m} is bounded by TL(ϵ,1x,n,x)T_{L}(\epsilon,\frac{1}{x},n,x). Then VC-Dim(𝒞x,n,mL)d(n,TL(ϵ,1x,n,x))(\mathcal{C}^{L}_{x,n,m})\leq d(n,T_{L}(\epsilon,\frac{1}{x},n,x)).

Making use of Lemma 4, we can write

dOlog(dO)d(n,TL(ϵ,1x,n,x))+ϵm+2.\frac{d_{O}}{\log(d_{O})}\leq d(n,T_{L}(\epsilon,\frac{1}{x},n,x))+\epsilon m+2.

By our choice of kk in (1), this means that

dOlog(dO)k2(nx2ϵ)k+ϵm.\frac{d_{O}}{\log(d_{O})}\leq\frac{k}{2}\left(\frac{nx^{2}}{\epsilon}\right)^{k}+\epsilon m. (4)

If dO<akd_{O}<a_{k} then clearly dOd_{O} is bounded by pO(n,x)mαp_{O}(n,x)m^{\alpha} and the theorem is proved.

If dOakd_{O}\geq a_{k} then by choice of aka_{k}, log(dO)<(dO)1k+2\log(d_{O})<(d_{O})^{\frac{1}{k+2}}. Thus,

dOlog(dO)dO(dO)1k+2=(dO)k+1k+2.\frac{d_{O}}{\log(d_{O})}\leq\frac{d_{O}}{(d_{O})^{\frac{1}{k+2}}}=(d_{O})^{\frac{k+1}{k+2}}.

Combining with (4) we have

(dO)k+1k+2\displaystyle(d_{O})^{\frac{k+1}{k+2}} <k2(nx2ϵ)k+ϵm\displaystyle<\frac{k}{2}\left(\frac{nx^{2}}{\epsilon}\right)^{k}+\epsilon m
=k2(nx2ϵ)kmkk+1+mkk+1\displaystyle=\frac{k}{2}\left(\frac{nx^{2}}{\epsilon}\right)^{k}m^{\frac{k}{k+1}}+m^{\frac{k}{k+1}}
k(nx2ϵ)kmkk+1.\displaystyle\leq k\left(\frac{nx^{2}}{\epsilon}\right)^{k}m^{\frac{k}{k+1}}.

Raising each side to k+2k+1\frac{k+2}{k+1}, we get

dO\displaystyle d_{O} kk+1k+2(nx2)k2+2kk+1mk2+2kk2+2k+1\displaystyle\leq k^{\frac{k+1}{k+2}}(nx^{2})^{\frac{k^{2}+2k}{k+1}}m^{\frac{k^{2}+2k}{k^{2}+2k+1}}
pO(n,x)mα.\displaystyle\leq p_{O}(n,x)m^{\alpha}.

It should be noted that xx and ss are polynomially related, so pO(n,x)=pR(n,s)p_{O}(n,x)=p_{R}(n,s) for some polynomial pRp_{R}. Thus, we explicitly satisfy the conditions Definition 5; thus, we have constructed a δ\delta-independent Occam algorithm. This concludes our proof of Theorem 7.

4 Conclusion

We have now shown that the algorithms used in Board and Pitt’s analysis [6] are, in fact, equivalent to the algorithmic models used by Blumer et al. [5]. Thus, the equivalence of learnability and the weak approximability of the minimum-consistent-hypothesis problem is not dependent solely on the choice of algorithmic model, but in fact extends to all algorithmic models considered by Haussler et al. [10].

This analysis focused on strong learning conditions, but based on the equivalence of certain strong and weak learnability criteria, it should naturally extend to those conditions as well. In addition, the same method of proof shows that Board and Pitt’s theorem for finite-alphabet length-based δ\delta-dependent Occam algorithms also applies to similarly defined but δ\delta-independent Occam algorithms as used by [4, 13]. Essentially, we have proven an equivalence which had been widely assumed to be true due to the work of Board and Pitt [6] but had not, in fact, been shown explicitly to be correct. Thus, we provide a posteriori justification of analyses and algorithmic design techniques which made use of this equivalence theorem.

Appendix A Approximate Occam algorithms

If we relax the condition that Occam algorithms be consistent with the training sample, we can define an “approximate Occam algorithm” and show that any polynomially PAC learnable concept class is learnable by such an algorithm [6]. In fact, it is easy to show that any polynomial PAC learning algorithm can be considered an “approximate Occam algorithm”.

Specifically, we can define approximate Occam algorithms as follows:

Definition 8 (δ\delta-independent approximate Occam algorithm).

Let AA be a polynomial learning algorithm that, given a sample of a concept c𝒞c\in\mathcal{C}, produces a hypothesis consistent with at least (1ϵ)m(1-\epsilon)m points in the sample. For concept class 𝒞\mathcal{C} and all s,n,m1s,n,m\geq 1, let S𝒞,s,n,mS_{\mathcal{C},s,n,m} denote the set of all mm-samples of concepts cCnc\in C_{n} such that size(c)s\textbf{size}(c)\leq s. For polynomial-time learning algorithm AA, let 𝒞s,n,mA𝒞\mathcal{C}^{A}_{s,n,m}\subseteq\mathcal{C} denote the AA-image of S𝒞,s,n,mS_{\mathcal{C},s,n,m}, i.e., the set of all hypotheses produced by AA when AA is given an m-sample of a concept cCnc\in C_{n} with size(c)s\textbf{size}(c)\leq s. We also call 𝒞s,n,mA\mathcal{C}^{A}_{s,n,m} the effective hypothesis space of AA. Then AA is an Occam algorithm for 𝒞\mathcal{C} if \exists polynomial p(s,n)p(s,n) and constant α\alpha with 0α<10\leq\alpha<1 such that s,n,m1\forall\ s,n,m\geq 1, the VC dimension of 𝒞s,n,mA\mathcal{C}^{A}_{s,n,m} is bounded by p(s,n)mαp(s,n)m^{\alpha} [6].

Then we can show the following lemma:

Lemma 8.

If LL is a polynomial PAC learning algorithm for 𝒞\mathcal{C}, it is an approximate Occam algorithm for 𝒞\mathcal{C}.

Proof 3.

Define xx as in the proof of Lemma 6, and consider a learning algorithm of the type LindependentL_{\text{independent}}. Then, following the proof of Theorem 7, we can choose constant kk such that we can bound the VC dimension of the hypothesis space, d(n,T)d(n,T) for runtime TT, with

d(n,TL(ϵ,1x,n,x))+2d(n,p(ϵ,n,x))+2k2(nx2ϵ)k,n,x1 and 0<ϵ<1,d(n,T_{L}(\epsilon,\frac{1}{x},n,x))+2\leq d(n,p(\epsilon,n,x))+2\leq\frac{k}{2}\left(\frac{nx^{2}}{\epsilon}\right)^{k},\\ \forall\ n,x\geq 1\text{ and }0<\epsilon<1, (5)

where TLT_{L} and pp are defined as in the proof of Lemma 6. Let ϵ=m1k+1\epsilon=m^{-\frac{1}{k+1}}. Thus, LindependentL_{\text{independent}} does not require ϵ\epsilon, δ\delta, or ss as explicit inputs, so it is consistent with the models in Theorem 5, and if it learns 𝒞\mathcal{C}, we can say that 𝒞\mathcal{C} is polynomially learnable (as defined in Definition 4).

Let 𝒞x,n,mL\mathcal{C}^{L}_{x,n,m} be the effective hypothesis space of LindependentL_{\text{independent}}. An algorithm cannot output a hypothesis with size greater than its runtime TT, so the size of each element of 𝒞x,n,mL\mathcal{C}^{L}_{x,n,m} is bounded by TL(ϵ,1x,n,x)T_{L}(\epsilon,\frac{1}{x},n,x). Then

VC-Dim(𝒞x,n,mL)\displaystyle\text{VC-Dim}(\mathcal{C}^{L}_{x,n,m}) d(n,TL(ϵ,1x,n,x))\displaystyle\leq d(n,T_{L}(\epsilon,\frac{1}{x},n,x))
k2(nx2ϵ)k\displaystyle\leq\frac{k}{2}\left(\frac{nx^{2}}{\epsilon}\right)^{k}
=k2(nx2)kmkk+1.\displaystyle=\frac{k}{2}(nx^{2})^{k}m^{\frac{k}{k+1}}.

which is polynomial in nn and xx and sublinear in mm. As discussed previously, xx is polynomially bounded by ss, so we have a polynomial in nn and ss which is multiplied by a sublinear factor in mm. This completes the proof of Lemma 8.

Acknowledgements

I would like to thank Ard Louis for his mentorship and guidance over the course of this work, and for providing feedback on earlier drafts.

References

  • [1] J. Pearl, On the connection between the complexity and credibility of inferred models, International Journal of General System 4 (4) (1978) 255–264.
  • [2] P. Domingos, The role of Occam’s razor in knowledge discovery, Data mining and knowledge discovery 3 (1999) 409–425.
  • [3] G. I. Webb, Further experimental evidence against the utility of Occam’s razor, Journal of Artificial Intelligence Research 4 (1996) 397–417.
  • [4] A. Blumer, A. Ehrenfeucht, D. Haussler, M. K. Warmuth, Occam’s razor, Information processing letters 24 (6) (1987) 377–380.
  • [5] A. Blumer, A. Ehrenfeucht, D. Haussler, M. K. Warmuth, Learnability and the Vapnik-Chervonenkis dimension, Journal of the ACM (JACM) 36 (4) (1989) 929–965.
  • [6] R. Board, L. Pitt, On the necessity of Occam algorithms, in: Proceedings of the Twenty-second Annual ACM Symposium on Theory of Computing, 1990, pp. 54–63.
  • [7] L. Pitt, M. K. Warmuth, The minimum consistent DFA problem cannot be approximated within any polynomial, Journal of the ACM (JACM) 40 (1) (1993) 95–142.
  • [8] T. Hancock, T. Jiang, M. Li, J. Tromp, Lower bounds on learning decision lists and trees, in: STACS 95: 12th Annual Symposium on Theoretical Aspects of Computer Science Munich, Germany, March 2–4, 1995 Proceedings, Springer, 2005, pp. 527–538.
  • [9] J. S. Vitter, P. Krishnan, Optimal prefetching via data compression, Journal of the ACM (JACM) 43 (5) (1996) 771–793.
  • [10] D. Haussler, M. Kearns, N. Littlestone, M. K. Warmuth, Equivalence of models for polynomial learnability, Information and Computation 95 (2) (1991) 129–161.
  • [11] S. Shalev-Schwartz, S. Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, 2014.
  • [12] L. Pitt, L. G. Valiant, Computational limitations on learning from examples, Journal of the ACM (JACM) 35 (4) (1988) 965–984.
  • [13] M. J. Kearns, U. Vazirani, An Introduction to Computational Learning Theory, The MIT Press, 1994. doi:10.7551/mitpress/3897.001.0001.
    URL https://doi.org/10.7551/mitpress/3897.001.0001