On the equivalence of Occam algorithms
Abstract
Blumer et al. (1987, 1989) showed that any concept class that is learnable by Occam algorithms is PAC learnable. Board and Pitt (1990) showed a partial converse of this theorem: for concept classes that are closed under exception lists, any class that is PAC learnable is learnable by an Occam algorithm. However, their Occam algorithm outputs a hypothesis whose complexity is -dependent, which is an important limitation. In this paper, we show that their partial converse applies to Occam algorithms with -independent complexities as well. Thus, we provide a posteriori justification of various theoretical results and algorithm design methods which use the partial converse as a basis for their work.
keywords:
Occam algorithms , polynomial learnability , PAC learning , statistical learning theoryMSC:
[2010] 68Q32 , 68W01 Funding: This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.[inst1]organization=Rudolf Peierls Centre for Theoretical Physics, University of Oxford,addressline=Beecroft Building Parks Road, city=Oxford, postcode=OX1 3PU, country=UK
1 Introduction
For decades, there has been debate regarding the necessity of simplicity for machine learning [1, 2, 3]. Many of these analyses have focused on the implications and uses of complexity-based algorithms defined by Blumer et al. in two seminal papers [4, 5]. Their algorithms were defined such that they achieved zero training error on a sample, and outputted a hypothesis whose complexity (VC dimension for continuous alphabets; description length for discrete ones) was at most a polynomial in the target concept complexity, multiplied by a sublinear factor in the sam. These “Occam algorithms” are weak approximations of the minimum-consistent-hypothesis problem [6]. In this paper, we focus on the continuous-alphabet Occam algorithms.
In 1989, Blumer et al. [5] showed that if a concept was learnable by their Occam algorithm, then it was polynomially learnable; they left open the question of whether the converse of this theorem was true. Board and Pitt [6] proved a partial converse, namely that, for concept classes closed under exception lists, if a class was polynomially learnable in general, then it was learnable by an Occam algorithm. This proof suggested that for concept classes which are closed under exception lists, learnability is equivalent to weak approximability of the minimum-consistent hypothesis problem [6]. This equivalence formed the basis for subsequent theoretical work analysing the learnability of DFAs [7] and decision trees and lists [8], as well as motivating the design of practical algorithms such as prediction algorithms for optimal data prefetching [9].
However, the Occam algorithm model that was used by Board and Pitt [6] was different to the one defined by Blumer et al. [5]: the former is a construction based on a functional algorithm requiring , , , and the training sample as explicit inputs, while Blumer et al.’s model requires only the training sample [5]. While it has been shown that many frequently studied PAC algorithms form an equivalence class [10], it is not clear whether Board and Pitt’s [6] version is a member of this class. Further, Board an Pitt later remove a degree of freedom by defining in terms of . In some cases, changing degrees of freedom can alter whether an algorithm always halts or just usually halts: even if Board and Pitt’s model of learnability falls under the aforementioned equivalence class, it is unclear whether the equivalent algorithm in Blumer et al.’s definition would always halt [10, 6, 5].
In addition, the complexity of the output of Board and Pitt’s Occam algorithm is -dependent [6], while in Blumer et al.’s definition it is not [5]. Thus, it is not immediately clear whether Board and Pitt’s [6] conclusions hold for more commonly used polynomial PAC learning models and -independent output complexities (and are hence a general property of polynomial learnability and Occam algorithms), or whether their theorem results purely from the specific choice of algorithmic model. Therefore, it may not be justified to use their results as the basis for further research on general learnability and algorithm construction, as is currently the case [9, 8, 7].
In this paper, we show that the polynomial PAC model used by Board and Pitt [6] is equivalent to the models used by Blumer et al. and Haussler et al. [5, 10], and that we can construct an Occam algorithm from such a functional algorithm so that its output complexity is -independent, thereby explicitly linking the work of Board and Pitt [6] to that of Blumer et al. [5]. Our theorem provides the link needed to justify the use of Board and Pitt’s [6] work as a theoretical basis from which to further analyse questions of learnability and algorithm design by showing that their result holds irrespective of the specific algorithmic model they chose.
We will begin by defining concept classes, parameters such as and , and our choice of complexity measure, the Vapnik-Chervonenkis (VC) dimension, along with relevant lemmas. Next, we define functional and oracle algorithmic models and provide an equivalence theorem due to Haussler et al. [10] linking the two. We proceed to provide the definitions of Occam algorithms used by Blumer et al. [5] and Board and Pitt [6], pointing out the differences and showing that the latter is not obviously equivalent to the former. Finally, we prove that the two Occam algorithm definitions are, in fact, equivalent, thus justifying general analyses of learnability based on Board and Pitt’s work [6].
1.1 A note on algorithm names
It is worth pausing before the next section to explicitly list the algorithmic models we will be discussing in the remainder of this paper. We will distinguish between four models: -known functional algorithms, -unknown functional algorithms, -dependent Occam algorithms, and -independent Occam algorithms.
The -unknown functional algorithm is the standard definition of a functional algorithm used by Blumer et al. and Haussler et al. [5, 10], given in Definition2. The -known functional algorithm is a version used only by Board and Pitt [6]. We will need to show an equivalence between these two models, as the -known functional algorithm is not one of the models contained in the equivalence class provided by Haussler et al. in Theorem 5. This result will be Lemma 6.
The -dependent Occam algorithms and -independent Occam algorithms, on the other hand, refer not to whether is an explicit input, but whether the VC dimension of the hypothesis space is polynomially dependent on or not. The -independent version is used by Blumer et al. [5] while Board and Pitt [6] use a -dependent version. The main result of our paper will be extending the analysis conducted by Board and Pitt to apply to -independent Occam algorithms as well, resulting in Theorem 7.
2 Notation and definitions
2.1 Terminology
Let be a stratified learning domain (i.e. we can write it as where each has dimension ). A concept is a subset of ; we can consider it as labelling all elements of with either a 1 or a 0, depending on whether the element is in the concept or not.
Then let be a well-behaved111in a measure-theoretic sense made explicit in Blumer et al’s work [5]. class of concepts on and let the concept complexity measure . Let there also be associated with a set of representations for concepts in given in some representation language . We assume there is a function from the set of representations in onto , that is in (or in ), and that given and a string in , we can decide in polynomial time if is in the concept represented by the string (i.e. is polynomially evaluable). Finally, we say that is stratified, i.e. .
By we denote along with their function size, and , so . We include because learnability may in fact be a property of the representation of a set of concepts, not just the concepts themselves [6]. We use the term concept class to refer to as well as to . We say a concept is in if .
We define a hypothesis class where, as with , , where is also a class of concepts on but need not be related to .
We use to represent probability distributions on the domain . A sample set taken according to on is denoted and contains tuples (or points) ; we say has cardinality . Denote the set of -values in by the set . The error, or accuracy parameter, of a hypothesis with respect to a concept is where denotes symmetric difference. In words, is the probability due to of the symmetric difference of the hypothesis and the concept.
We cannot directly measure as it is a property over the entire domain, and the concept is unknown. Therefore, it is useful to define a training error for some loss function . We say a hypothesis is consistent if is zero on a sample. Then is the probability that the algorithm fails to return a consistent hypothesis.
Finally, we assume that ; otherwise the learnability of any concept class would be trivial.
2.2 Vapnik-Chervonenkis dimension
We have in Section 2.1 roughly stated the existence of a concept complexity measure, size, but we have not described any of its properties other than its domain and range. For the arguments presented in this paper we do not require any specific concept complexity measure function for size; it can be semantic (measuring breadth of explanatory power) or syntactic (e.g. the length of the description of a hypothesis in ), or something else entirely. We do make use of the (assumed) property that an algorithm cannot output a hypothesis with size greater than its runtime [6]. What will primarily matter, though, is how we can map from size to a combinatorial semantic complexity measure called the Vapnik-Chervonenkis (VC) Dimension [6, 5].
Definition 1.
Given a nonempty concept class with and a set of points , denotes the set of all subsets of that can be obtained by intersecting with a concept in . That is, . If we say that is shattered by . The VC Dimension of is the cardinality of the largest set of points that is shattered by . If the set of shattered points is arbitrarily large, then the VC Dimension of is infinite. is trivial if it consists of only one concept, or two disjoint concepts and such that (such concept classes require only one example to learn) [5].
Lemma 1.
For a finite hypothesis class , the VC dimension of is bounded by [11].
Lemma 2.
Lemma 3.
Let be a nonempty hypothesis class, be a probability distribution on , and target concept . Then for any and , given independent examples drawn according to , the probability that a hypothesis in that is consistent with all of these examples and has error greater than is at most [5].
Lemma 4.
Let have VC dimension . Let . If is the VC dimension of , then .
2.3 Functional and oracle algorithms
Definitions similar to the following can be found in papers by Blumer et al., Valiant, and others [5, 10, 12].
Definition 2 (Functional polynomial learnability).
For a concept class defined as in 2.1, we say that is properly polynomially learnable (poly-learnable) in the functional model if polynomial-time learning algorithm that takes as input a sample of a concept in , outputs a hypothesis in , and has the property that, and sample size , polynomial in , such that, target concepts with and probability distributions on , given a random sample of of size drawn independently according to , produces, with probability at least , a hypothesis that has error at most [5].
Functional polynomial learning algorithms receive only the sample of size ; it has no access to or [10]. It can always determine from the dimension of the data points it is given [5]. However, in the oracle model, the algorithm receives at least and as input and has access to an oracle that with each call returns a point in along with a label or indicating whether or not the point is in a fixed target concept with . Each oracle call takes unit time. If is probabilistic, each call to a fair coin flip also takes unit time. After some time, halts and outputs a hypothesis in [5, 10].
Definition 3 (Oracle polynomial learnability).
For a concept class defined as in 2.1, we say that is properly polynomially learnable (poly-learnable) in the oracle model if algorithm that has the property that, and time bound , polynomial in , such that, target concepts with and probability distributions on , runs in time and produces, with probability at least , a hypothesis that has error at most [5].
These models of learnability can easily be extended to situations in which we are learning by as opposed to just by itself [5, 10]. The following theorem, due to Haussler et al. [10], shows the equivalence between the different models of polynomial learnability. The property “usually halts” refers to probabilistic oracle algorithms that use fair coin tosses to decide whether to call the oracle again or halt.
Theorem 5.
If is learnable by hypothesis class in any of the following algorithmic models, it is learnable by in all of them:
-
1.
functional()
-
2.
oracle(),
where we have properties:
-
1.
-
2.
-
3.
and the stipulation that “s-unknown” only when “usually halts” [10].
Definition 4 (Polynomial PAC learnability).
If a class is learnable by one (and hence all) of the algorithmic models in Theorem 5, then we simply call it “polynomially PAC learnable” without specifying the specific model in which it is learnable.
2.4 Occam algorithms
The following definition is used by Blumer et al. in their paper introducing VC-dimension-based Occam algorithms [5]. Note that this is clearly a functional algorithm, matching definition 2.
Definition 5 (-independent Occam algorithm).
Let be a polynomial learning algorithm that, given a sample of a concept , produces a hypothesis consistent with the sample. For concept class and all , let denote the set of all -samples of concepts such that . For polynomial-time learning algorithm , let denote the -image of , i.e., the set of all hypotheses produced by when is given an m-sample of a concept with . We also call the effective hypothesis space of . Then is an Occam algorithm for if polynomial and constant with such that , the VC dimension of is bounded by [5].
Board and Pitt [6], on the other hand, define their algorithm as follows (some notation has been altered to match the rest of this paper, but the definition itself is equivalent):
Definition 6 (-dependent Occam algorithm).
Let be a polynomial learning algorithm that, given a sample of a concept , produces a hypothesis consistent with the sample. For concept class and all and , let denote the set of all -samples of concepts such that . For polynomial-time learning algorithm , let denote the -image of , i.e., the set of all hypotheses produced by when is given as input , , and an m-sample of a concept with . Then is an Occam algorithm for if polynomial and constant with such that and , the VC dimension of is bounded by [6].
This is a functional Occam algorithm which takes and as explicit inputs and has a VC dimension that depends polynomially on . Each of these factors makes it distinct from the algorithms analysed by Blumer et al. [5]. We will first show that if we have a functional algorithm dependent on and as explicit inputs, we can construct a functional algorithm that does not require these inputs explicitly. We will then show that we can construct a -independent Occam algorithm from the resulting functional algorithm.
2.5 Exception lists
Our analysis will apply only to concept classes which are closed under exception lists.
Definition 7.
A class is closed under exception lists if an algorithm ExList and polynomial such that , on input of any and any finite set of cardinality , ExList halts in time bounded by and outputs a concept such that . Note that polynomial running time means that (a program cannot output a concept with size larger than its runtime) [6].
In words, the ExList algorithm will output a concept that agrees with on all but a finite set of points in , with that finite set being given by . Classes that are closed under exception lists include decision lists, decision trees, arbitrary programs, Boolean circuits, Boolean formulas, and Deterministic Finite Automata (DFAs) over any fixed alphabet, among others [6]. Many of the above classes happen to be ones where finding a minimal consistent hypothesis is NP-hard, which was the original motivation for developing Occam algorithms [4].
3 Equivalence of Occam algorithms
Board and Pitt’s [6] construction of an Occam algorithm from a general learning algorithm begins as follows: “Run the algorithm , giving it input parameters , , and . Whenever calls the oracle for a data point, choose a random point in the sample set with uniform probability and supply it to as the example. Let the output of be denoted ” where we have added the labels in square brackets.
We know that will be given a value that does not require explicit input, as it depends only on . Thus, in practice, the learning algorithm in fact only requires , , and as inputs: it is a functional algorithm with known. Clearly, this is not one of the learning models considered by [10] in Theorem 5. In contrast, the standard definition of a functional polynomial learning algorithm 2 can be termed a -unknown functional algorithm. We would like to show an equivalence between the two definitions:
Lemma 6.
If is polynomially learnable by a -known algorithm, it is polynomially learnable by a -unknown algorithm.
Proof 1.
We will construct an -unknown functional algorithm from a -known functional algorithm following a method used by Haussler et al. [10] to show the equivalence of functional and oracle models of learning.
The -known functional algorithm defined by Board and Pitt [6] is an algorithm that takes inputs and runs in a polynomially bounded time; thus we need the same for our -independent functional algorithm. The first requirement is explicit in functional models of learnability, while the latter is explicit in oracle models. Thus, in order to be able to make explicit use of both properties, we will make our algorithm, , a functional algorithm taking input of length , that has been constructed from an oracle algorithm. We will show that this algorithm then leads to the same result as the one considered by Board and Pitt.
Given that a functional algorithm cannot take or as input, we need to find a way to bound these values. To do so, we use a construction similar to Haussler et al.’s construction of a functional algorithm from an oracle [10].
If has a runtime bounded by polynomial , we find a polynomial monotonically increasing in such that .
We choose an such that and . We know by definition that is polynomial in and , so this construction means our is polynomially related to those values. If we cannot do this because then we halt with the default hypothesis. Our constructed algorithm works as follows: we run the algorithm , giving it input parameters , (which can determine directly from the form of each example in ), , and . Whenever calls the oracle for a data point, choose a random point in the sample set with uniform probability and supply it to as the example, just as before.
We now claim that is bounded from below by polynomial . To see this, choose such that . Then the we choose will be bounded from below by , so we are running with and . Thus we have a functional algorithm whose input and runtime are both bounded polynomially.
Lemma 6 shows that the model of polynomial PAC learnability considered by Board and Pitt is in fact equivalent to the models commonly used in the literature, and is a member of the equivalence class in Theorem 5. Based on this result, we would now like to construct an Occam algorithm whose output complexity is independent of .
Theorem 7.
If is polynomially learnable then it is learnable by -independent Occam algorithm.
Proof 2.
We now repeat Board and Pitt’s [6] proof of equivalence between general learning and Occam learnability using our newly-constructed - and -independent algorithm, , and showing that we still result in a polynomial Occam algorithm, which is now -independent.
Define as in the previous proof. Choose constant such that we can bound the VC dimension of the hypothesis space, for runtime , with
(1) |
where and are defined as in the previous proof. Let . Thus, does not require , , or as explicit inputs, so it is consistent with the models in Theorem 5, and if it learns , we can say that is polynomially learnable (as defined in Definition 4).
Now, we construct an Occam algorithm from as follows:
-
1.
Run as defined in the previous proof, and let its output be denoted .
-
2.
Compute the exception list .
-
3.
Output .
We define a constant such that for any , . To prove the theorem, we simply need to show that is an Occam algorithm for with corresponding polynomial
(2) |
and constant exponent
(3) |
runs in polynomial time, as does ExList; thus, runs in polynomial time. Clearly, any output produced by the algorithm is consistent with . Also, by the definition of in Section 2.1, it must be true that .
As in Definition 5, let denote the -image of , i.e., the set of all hypotheses produced by when is given an m-sample of a concept with . We would like to find the VC dimension of .
Let the VC dimension of be called . If , it is clearly bounded by and the theorem is proved. Now let us assume .
Let be the effective hypothesis space of . An algorithm cannot output a hypothesis with size greater than its runtime , so the size of each element of is bounded by . Then VC-Dim.
If then clearly is bounded by and the theorem is proved.
If then by choice of , . Thus,
Combining with (4) we have
Raising each side to , we get
4 Conclusion
We have now shown that the algorithms used in Board and Pitt’s analysis [6] are, in fact, equivalent to the algorithmic models used by Blumer et al. [5]. Thus, the equivalence of learnability and the weak approximability of the minimum-consistent-hypothesis problem is not dependent solely on the choice of algorithmic model, but in fact extends to all algorithmic models considered by Haussler et al. [10].
This analysis focused on strong learning conditions, but based on the equivalence of certain strong and weak learnability criteria, it should naturally extend to those conditions as well. In addition, the same method of proof shows that Board and Pitt’s theorem for finite-alphabet length-based -dependent Occam algorithms also applies to similarly defined but -independent Occam algorithms as used by [4, 13]. Essentially, we have proven an equivalence which had been widely assumed to be true due to the work of Board and Pitt [6] but had not, in fact, been shown explicitly to be correct. Thus, we provide a posteriori justification of analyses and algorithmic design techniques which made use of this equivalence theorem.
Appendix A Approximate Occam algorithms
If we relax the condition that Occam algorithms be consistent with the training sample, we can define an “approximate Occam algorithm” and show that any polynomially PAC learnable concept class is learnable by such an algorithm [6]. In fact, it is easy to show that any polynomial PAC learning algorithm can be considered an “approximate Occam algorithm”.
Specifically, we can define approximate Occam algorithms as follows:
Definition 8 (-independent approximate Occam algorithm).
Let be a polynomial learning algorithm that, given a sample of a concept , produces a hypothesis consistent with at least points in the sample. For concept class and all , let denote the set of all -samples of concepts such that . For polynomial-time learning algorithm , let denote the -image of , i.e., the set of all hypotheses produced by when is given an m-sample of a concept with . We also call the effective hypothesis space of . Then is an Occam algorithm for if polynomial and constant with such that , the VC dimension of is bounded by [6].
Then we can show the following lemma:
Lemma 8.
If is a polynomial PAC learning algorithm for , it is an approximate Occam algorithm for .
Proof 3.
Define as in the proof of Lemma 6, and consider a learning algorithm of the type . Then, following the proof of Theorem 7, we can choose constant such that we can bound the VC dimension of the hypothesis space, for runtime , with
(5) |
where and are defined as in the proof of Lemma 6. Let . Thus, does not require , , or as explicit inputs, so it is consistent with the models in Theorem 5, and if it learns , we can say that is polynomially learnable (as defined in Definition 4).
Let be the effective hypothesis space of . An algorithm cannot output a hypothesis with size greater than its runtime , so the size of each element of is bounded by . Then
which is polynomial in and and sublinear in . As discussed previously, is polynomially bounded by , so we have a polynomial in and which is multiplied by a sublinear factor in . This completes the proof of Lemma 8.
Acknowledgements
I would like to thank Ard Louis for his mentorship and guidance over the course of this work, and for providing feedback on earlier drafts.
References
- [1] J. Pearl, On the connection between the complexity and credibility of inferred models, International Journal of General System 4 (4) (1978) 255–264.
- [2] P. Domingos, The role of Occam’s razor in knowledge discovery, Data mining and knowledge discovery 3 (1999) 409–425.
- [3] G. I. Webb, Further experimental evidence against the utility of Occam’s razor, Journal of Artificial Intelligence Research 4 (1996) 397–417.
- [4] A. Blumer, A. Ehrenfeucht, D. Haussler, M. K. Warmuth, Occam’s razor, Information processing letters 24 (6) (1987) 377–380.
- [5] A. Blumer, A. Ehrenfeucht, D. Haussler, M. K. Warmuth, Learnability and the Vapnik-Chervonenkis dimension, Journal of the ACM (JACM) 36 (4) (1989) 929–965.
- [6] R. Board, L. Pitt, On the necessity of Occam algorithms, in: Proceedings of the Twenty-second Annual ACM Symposium on Theory of Computing, 1990, pp. 54–63.
- [7] L. Pitt, M. K. Warmuth, The minimum consistent DFA problem cannot be approximated within any polynomial, Journal of the ACM (JACM) 40 (1) (1993) 95–142.
- [8] T. Hancock, T. Jiang, M. Li, J. Tromp, Lower bounds on learning decision lists and trees, in: STACS 95: 12th Annual Symposium on Theoretical Aspects of Computer Science Munich, Germany, March 2–4, 1995 Proceedings, Springer, 2005, pp. 527–538.
- [9] J. S. Vitter, P. Krishnan, Optimal prefetching via data compression, Journal of the ACM (JACM) 43 (5) (1996) 771–793.
- [10] D. Haussler, M. Kearns, N. Littlestone, M. K. Warmuth, Equivalence of models for polynomial learnability, Information and Computation 95 (2) (1991) 129–161.
- [11] S. Shalev-Schwartz, S. Ben-David, Understanding Machine Learning: From Theory to Algorithms, Cambridge University Press, 2014.
- [12] L. Pitt, L. G. Valiant, Computational limitations on learning from examples, Journal of the ACM (JACM) 35 (4) (1988) 965–984.
-
[13]
M. J. Kearns, U. Vazirani,
An Introduction to
Computational Learning Theory, The MIT Press, 1994.
doi:10.7551/mitpress/3897.001.0001.
URL https://doi.org/10.7551/mitpress/3897.001.0001