Fat-Shattering Dimension of -fold Aggregations††thanks: A previous version of this paper was titled “Fat-shattering dimension of -fold maxima”.
Abstract
We provide estimates on the fat-shattering dimension of aggregation rules of real-valued function classes. The latter consists of all ways of choosing functions, one from each of the classes, and computing a pointwise function of them, such as the median, mean, and maximum. The bound is stated in terms of the fat-shattering dimensions of the component classes. For linear and affine function classes, we provide a considerably sharper upper bound and a matching lower bound, achieving, in particular, an optimal dependence on . Along the way, we improve several known results in addition to pointing out and correcting a number of erroneous claims in the literature.
Keywords: combinatorial dimension, scale-sensitive dimension, fat-shattering dimension, aggregation rules, -fold maximum, ensemble methods
1 Introduction
The fat-shattering dimension, also known as “scale-sensitive” and the “parametrized variant of the -dimension”, was first proposed by Kearns and Schapire (1994); its key role in learning theory lies in characterizing the PAC learnability of real-valued function classes (Alon et al., 1997; Bartlett and Long, 1998).
In this paper, we study the behavior of the fat-shattering dimension under various -fold aggregations. Let be real-valued function classes, and be an aggregation rule. We consider the aggregate function class , which consists of all mappings , for any . Some natural aggreagation rules include the pointwise -fold maximum, median, mean, and max-min. We seek to bound the shattering complexity of in terms of the constituent fat-shattering dimensions of the . This question naturally arises in the context of ensemble methods, such as boosting and bagging, where the learner’s prediction consists of an aggregation of base learners.
The analogous question regarding aggregations of VC classes (VC dimension being the combinatorial complexity controlling the learnability of Boolean function classes) have been studied in detail and largely resolved (Baum and Haussler, 1989; Blumer et al., 1989; Eisenstat and Angluin, 2007; Eisenstat, 2009; Csikós et al., 2019). Furthermore, closure properties were also studied in the context of online classification and private PAC learning (Alon et al., 2020; Ghazi et al., 2021) for the Littlestone and Threshold dimensions. However, for real-valued functions, this question remained largely uninvestigated.
Our contributions are as follows:
-
•
For a natural class of aggregation rules that commute with shifts (see Definition (4)), assuming , for , we show that
The formal statement is given in Theorem 1. By using an entirely different approach, for aggregations that are -Lipschitz () in supremum norm (see Definition (5)) and for bounded function classes with , for , we show that
where . The formal statement is given in Theorem 2.
In particular, our proofs hold for the maximum, minimum, median, mean, and max-min aggregations.
-
•
For -bounded affine functions and for aggregations that are -Lipschitz in supremum norm, we show the following dimension-free bound,
This result also extends to the hinge-loss class of affine functions. The formal statement is given in Theorem 3. In particular, we improve by a log factor the estimate of Fefferman et al. (2016, Lemma 6) on the fat-shattering dimension of max-min aggregation of linear functions.
-
•
For affine functions and the -fold max aggregation, we show tight dimension-dependent bounds (up to constants),
where is the Euclidean dimension. For the formal statements, see Theorems 7 and 8.
Applications.
The need to analyze the combinatorial complexity of a -fold maximum of function classes (see (3) for the formal definition) arises in a number of diverse settings. One natural example is adversarially robust PAC learning to test time attacks for real-valued functions (Attias et al., 2022; Attias and Hanneke, 2023). In this setting, the learner observes an i.i.d. labeled sample from an unknown distribution, and the goal is to output a hypothesis with a small error on unseen examples from the same distribution, with high probability. The difference from the standard PAC learning model is that at test time, the learner only observes a corrupted example, while the prediction is tested on the original label. Formally, is drawn from the unknown distribution, there is an adversary that can map to possible corruptions that are known to the learner. The learner observes only while its loss is with respect to the original label . This scenario is naturally captured by the -fold max: the learner aims to learn the maximum aggregation of the loss classes. Attias et al. (2022) showed that uniform convergence holds in this case, and so the sample complexity of an empirical risk minimization algorithm is determined by the complexity measure of the -fold maximum aggregation.
Analyzing the -fold maximum arises also in a setting of learning polyhedra with a margin. Gottlieb et al. (2018) provided a learning algorithm that represents polyhedra as intersections of bounded affine functions. The sample complexity of the algorithm is determined by the complexity measure of the maximum aggregation of affine function classes.
Another natural example of where the -fold maximum and -fold max-min play a role is in analyzing the convergence of -means clustering. Fefferman et al. (2016) bounded the max-min aggregation and Klochkov et al. (2021); Biau et al. (2008); Appert and Catoni (2021); Zhivotovskiy (2022) bounded the max aggregation. The main challenge in this setting is bounding the covering numbers of the aggregation over function classes which can be obtained by bounding the Rademacher complexity or the fat-shattering dimension.
Finally, there are numerous ensemble methods for regression that output some aggregation of base learners, such as the median or mean. Examples of these methods include boosting (e.g., Freund and Schapire (1997); Kégl (2003)), bagging (bootstrap aggregation) by Breiman (1996), and its extension to the random forest algorithm (Breiman, 2001).
Related work.
It was claimed in Attias et al. (2019, Theorem 12) that but the proof had a mistake (see Section 5); our Open Problem asks if the general form of the bound does holds (we believe it does at least for the max aggregation). Using the recent disambiguation result of Alon et al. (2022) presented in Lemma 9 here, Attias et al. (2022, Lemma 15) obtained the bound , where is the domain of the function classes . The latter is, in general, incomparable to Theorem 1 — but is clearly a considerable improvement for large or infinite .
Using the covering number results of Mendelson and Vershynin (2003); Talagrand (2003) (see Section A.1), Duan (2012, Theorem 6.2) obtained a general result, which, when specialized to -fold maxima, yields
(1) |
for a universal constant ; (1) is an immediate consequence of Theorem 10 (with ), Lemma 15, and Lemma 16 in this paper. Our results improve over (1) by removing the dependence on in the scale of the fat-shattering dimensions; however, Duan’s general method is applicable to a wider class of uniformly continuous -fold aggregations.
Srebro et al. (2010, Lemma A.2) bounded the fat-shattering dimension in terms of the Rademacher complexity. Foster and Rakhlin (2019) bounded the Rademacher complexity of a smooth -fold aggregate, see also references therein. Inspired by Appert and Catoni (2021), Zhivotovskiy (2022) has obtained the best known upper bound on the Rademacher complexity of -fold maximum over linear function classes. Raviv et al. (2018) upper bounded the Rademacher complexity of the -fold maximum aggregation of affine functions and hinge-loss affine functions.
2 Preliminaries
Aggregation rules.
A -fold aggregation rule is any mapping . Just as maps -tuples of reals into reals, it naturally aggregates -tuples of functions into a single one: for , we define as the mapping . Finally, the aggregation extends to -tuples of function classes: for , we define
(2) |
A canonical example of an aggregation rule is the -fold max, induced by the mapping
(3) |
Next, we consider some properties that an aggregation rule might possess.
Commuting with shifts. We say that an aggregation rule commutes with shifts if
(4) |
Lipschitz continuity. The mapping is -Lipschitz with respect to if
(5) |
Many natural aggregation rules possess these properties, such as maximum, minimum, median, mean, and max-min. The maximum is defined in (3), the minimum is defined analogously as
the median is defined as
and the mean is defined as
We also define as
(6) |
it is readily verified to commute with shifts and Lemma 14 shows that it is -Lipschitz with respect to .
Fat-shattering dimension.
Let be a set and . For , a set is said to be -shattered by if
(7) |
The -fat-shattering dimension, denoted by , is the size of the largest -shattered set (possibly ).
Fat-shattering dimension at zero.
Rademacher complexity.
Let be of real-valued function class on the domain space . Define the empirical Rademacher complexity of on a given sequence is defined as
where are independent random variables uniformly chosen from . The Rademacher complexity of with respect to a distribution is defined as
It is a classic fact (Mohri et al., 2012, Theorem 3.1) that the Rademacher complexity controls generalization bounds in a wide range of supervised learning settings.
Covering numbers.
We start with some background on covering numbers. Whenever is endowed with a probability measure , this induces, for and , the norm
on . When , we write . For , is the essential supremum of with respect to . For and , we say that is a -cover of under if The -covering number of , denoted by , is the cardinality of the smallest -cover of (possibly, ). We note the obvious relation
(9) |
which holds for all probability measures and all .
We sometimes overload the notation about aggregations by defining on -tuples of functions (instead on -tuples of reals), . We say that is -Lipschitz with respect to , if
Notation.
We write to denote the natural numbers. For , we write . All of our logarithms are base , unless explicitly denoted otherwise. We use and interchangeably, and write . For any function class over a set and , denotes the projection on (restriction to) . In line with the common convention in functional analysis, absolute numerical constants will be denoted by letters such as , whose value may change from line to line. Any transformation may be applied to a function via , as well as to via . The sign function thresholds at : .
3 Main Results
Our main results involve upper-bounding the fat-shattering dimension of aggregation rules in terms of the dimensions of the component classes. We begin with the simplest (to present):
Theorem 1 (General function classes and aggregations that commute with shifts)
For , and aggregation rule that commutes with shifts, (see definition (4)), we have
where . In the degenerate case where , .
In particular, this result holds for natural aggregation rules, such as maximum, minimum, median, and mean.
Remark.
We made no attempt to optimize the constants; these are only provided to give a rough order-of-magnitude sense. In the sequel, we forgo numerical estimates and state the results in terms of unspecified universal constants.
The next result provides an alternative bound based on an entirely different technique:
Theorem 2 (Bounded function classes and Lipschitz aggregations)
For , , and an aggregation rule that is -Lipschitz () in supremum norm (see definition (5)), we have
where
and are universal constants.
In Section A.1, we show that maximum, median, mean, and max-min aggregations are -Lipschitz.
Remark.
The bounds in Theorems 1 and 2 are, in general, incomparable—and not just because of the unspecified constants in the latter. One notable difference is that Theorem 1 only depends on the shattering scale , while Theorem 2 additionally features a (weak) explicit dependence on the aspect ratio . In particular, Theorem 1 is applicable to semi-bounded affine classes (see Section A.4), while Theorem 2 is not. Still, for fixed and large , the latter presents a significant asymptotic improvement over the former.
For the special case of affine functions and hinge-loss affine functions, the technique of Theorem 2 yields a considerably sharper estimate:
Theorem 3 (Dimension-free bound for Lipschitz aggregations of affine functions)
Let be the -dimensional Euclidean unit ball and
(10) |
be collections of -bounded affine functions on and be an aggregation rule that is -Lipschitz in supremum norm (see definition (5)). Then
(11) |
where is a universal constant. Further, if
(12) |
is a family of -bounded hinge-loss affine functions for and is an aggregation rule that is -Lipschitz in supremum norm, then the same bound as in (11) hold for .
In particular, Theorem 3 improves by a log factor the estimate of Fefferman et al. (2016), on the fat-shattering dimension of max-min aggregation (defined in Section 2) of linear functions:111 The max-min aggregation is shown to be -Lipschitz in supremum norm in Lemma 14 of Section A.1.
Lemma 4 (Fefferman et al. (2016), Lemma 6)
Let be the -dimensional Euclidean unit ball and
be (identical) linear function classes defined on . If is the max-min aggregation rule (6), then
where is a universal constant.
Our Theorem 3 improves the latter by a log factor:
Corollary 5 (Rademacher complexity for -Fold Maximum of Affine Functions)
Corollary 5 improves upon Raviv et al. (2018, Theorem 7). Their upper bound scales linearly with , whereas ours as .
Note, however, that for linear classes a better bound is known:
Theorem 6 (Zhivotovskiy (2022))
Let be the -dimensional Euclidean unit ball and
be (identical) linear function classes defined on . If is the maximum aggregation rule, then
where is the Rademacher complexity and is a universal constant.
The estimate in Theorem 3 is dimension-free in the sense of being independent of . In applications where a dependence on is admissible, an optimal bound can be obtained:
Theorem 7 (Dimension-dependent bound for -fold maximum of affine functions)
Let and be (identical) function classes consisting of all real-valued affine functions:
and let be their -fold maximum (see definition (3)). Then
where is a universal constant.
The optimality of the upper bound in Theorem 7 is witnessed by the matching lower bound:
Theorem 8 (Dimension-dependent lower bound for -fold maximum of affine functions)
For and , let be the collection of all affine functions over and let be their -fold maximum (see definition (3)). Then
where is a universal constant.
The scaling argument employed in the proof of Theorem 8 can be invoked to show that the claim continues to hold for .
4 Proofs
We start with upper-bounding the fat-shattering dimension of aggregation rule that commute with shifts, in terms of the dimensions of the component classes.
4.1 Proof of Theorem 1
Partial concept classes and disambiguation.
We say that is a partial concept class over ; this usage is consistent with Alon et al. (2022), while Attias et al. (2019, 2022) used the descriptor ambiguous. For , define its disambiguation set as
in words, consists of the total concepts that agree pointwise with , whenever the latter takes a value in . We say that disambiguates if for all , we have ; in words, every must have a disambiguated representative in .222Attias et al. (2022) additionally required that , but this is an unnecessary restriction, and does not affect any of the results.
As in Alon et al. (2022); Attias et al. (2022), we say333 Attias et al. (2019) had incorrectly given as the shattering condition. that is VC-shattered by if . We write to denote the size of the largest VC-shattered set (possibly, ). The obvious relation always holds between a partial concept class and any of its disambiguations. Alon et al. (2022, Theorem 13) proved the following variant of the Sauer-Shelah-Perles Lemma for partial concept classes:
Lemma 9 (Alon et al. (2022))
We will make use of the elementary fact
and its corollary
(15) |
Proof [of Theorem 1] We follow the basic techniques of discretization and -shifting, employed in Attias et al. (2019, 2022). Fix and define the operator as
Observe that for all and , we have Let be a -fold aggregation rule and be real-valued function classes. Suppose that some is -shattered by . Proving the claim amounts to upper-bounding appropriately. By (8), there is an such that . Put and since commutes with -shift, as defined in (4), we have
(16) |
Hence, is VC-shattered by and
(17) |
Let us assume for now that each ; in this case, there is no loss of generality in assuming . Let be a “good” disambiguation of on , as furnished by Lemma 9:
Observe that is a valid disambiguation of . It follows that
(18) |
Thus, (15) implies that , and the latter is an upper bound on — and hence, also on . The claim now follows from (17).
If any one given , we claim that (18) is unaffected. This is because any with has a singleton disambiguation . Indeed, any given can receive at most one of as a label from the members of (otherwise, it would be shattered, forcing ). If any labels with , then all members of are disambiguated to label with (and, mutatis mutandis, ). Any labeled with by every can be disambiguated arbitrarily (say, to ). Disambiguating the degenerate to the singleton has no effect on the product in (18).
The foregoing argument continues to hold
if more than one . In particular,
in the degenerate case where
,
we have ,
which forces .
4.2 Proof of Theorem 2
First, we upper bound the covering numbers of Lipschitz aggregations as a function of the covering of the component classes.
Theorem 10 (Covering number of -Lipschitz aggregations)
Let , , and . Let be an aggregation rule that is -Lipschitz. Then, for all probability measures on ,
We proceed to the main proof.
Proof [of Theorem 2]. Let be an aggregation rule that is -Lipschitz () in supremum norm, as defined in (5), and let be real-valued function classes. Suppose that some is -shattered by , and let . We upper bound the covering number with the fat-shattering dimension as in Lemma 18 (see Section A.2), with and ,
where . Then Theorem 10 implies that
where , (a) follows since and assuming , and (b) follows by the concavity of (see Lemma 25 in Section A.5). We can assume without loss of generality. Combining the monotonicity of the covering number (see (9)) and a lower bound on the covering number in terms of the fat-shattering dimension (see Lemma 15 in Section A.2) yields
whence
Using the elementary fact
(with and ), we get
which implies the claim.
4.3 Proof of Theorem 3
We use the notation and results from the Appendix, and in particular, from Section A.3.
Proof [of Theorem 3] A bound of this form for the -fold maximum aggregation was claimed in Kontorovich (2018), however the argument there was flawed, see Section 5.
Let be an aggregation rule that is -Lipschitz in supremum norm, as defined in (5), and let be bounded affine function classes, as defined in (10). Suppose that some is -shattered by , let , and be the uniform distribution on . We upper bound the covering number as in Lemma 20 (with ),
Denote , and consider the covering number of at scale :
Then Theorem 10 implies that
where and (a) follows by the concavity of (see Corollary 24 in Section A.5). Combining the monotonicity of the covering number (see (9)) and a lower bound on the covering number in terms of the fat-shattering dimension (see Lemma 15 in Section A.2) yields
whence
Using the elementary fact
(with and ) we get , which implies the claim.
The result
can easily be generalized to hinge-loss
affine classes. Let be an affine function class as in (10),
define
as the function class
on
given by ,
and
the hinge-loss affine class
as the function class
on
given by .
One first observes that
the restriction of
to any ,
as a body in ,
is identical to the
the restriction of
to
.
Interpreting as a -fold maximum
over the singleton class
and the bounded affine class lets us invoke
Theorem 10
to argue that
and have the same covering numbers.
Hence, the argument
we deployed here to establish
(11)
for affine classes
also applies
to
-fold
-Lipschitz aggregations
hinge-loss classes.
4.4 Proof of Corollary 5
Proof [of Corollary 5] Raviv et al. (2018, Theorem 7) upper-bounded the Rademacher complexity of the maximum aggregation of hinge loss affine functions by .
For -bounded affine functions or hinge loss affine functions, the analysis above, combined with the calculation in Kontorovich (2018) yields a bound of . For completeness, we provide the full proof.
Let be the -fold maximum aggregation rule, as defined in (3), and let be an -bounded affine functions as in (10) or a hinge loss affine functions as in (12). Since this aggregation is -Lipschitz in the supremum norm, Theorem 3 implies that
where is a universal constant.
From fat-shattering to Rademacher.
The fat-shattering estimate above can be used to upper-bound the Rademacher complexity by converting the former to a covering number bound and plugging it into Dudley’s chaining integral (Dudley, 1967):
(19) |
where are the covering numbers (see Section A.1).
It remains to bound the covering numbers. A simple way of doing so is to invoke Lemmas 2.6, 3.2, and 3.3 in Alon et al. (1997) — but this incurs superfluous logarithmic factors in . Instead, we use the sharper estimate of Mendelson and Vershynin (2003), stated here in Lemma 16. Putting , the latter yields
Now
and choosing yields
4.5 Proof of Theorem 7
Proof [of Theorem 7] Let be the -fold maximum aggregation rule, as defined in (3), and let be real-valued function classes. Note that is an aggregation that commutes with shift, as defined in (4).
Since and commute, we have . We claim that
(20) |
Indeed, any that is -shattered with shift by any is also VC-shattered by . (See Section 4.1, and notice that the converse implication—and the reverse inequality—do not hold.) It holds that
where (a) follows from a standard argument (e.g., Mohri et al. (2012, Example 3.2)), (b) holds because any that is VC-shattered by is also -shattered by with shift , (c) follows from Lemma 21, since the class is closed under scalar multiplication, and (d) holds since the shattering remains the same for the shifted class.
4.6 Proof of Theorem 8
Proof [of Theorem 8]
It follows from
Mohri et al. (2012, Example 3.2)
that .
Since is closed under
scalar multiplication,
a scaling argument
shows that
any
that is VC-shattered by
is also
-shattered
by
with shift ,
whence
for all ;
invoking
Lemma 21
extends this to
as well.
Now
Csikós et al. (2019, Theorem 1)
shows that the -fold unions
of
half-spaces
necessarily shatter some set
of size at least
.
Since union is a special case of
the max operator,
and the latter commutes with ,
the scaling argument shows that
this
is -shattered
by
with shift .
Hence,
,
which proves the claim.
5 Discussion
In this paper, we proved upper and lower bounds on the fat-shattering dimension of aggregation rules as a function of the fat-shattering dimension of the component classes. We leave some remaining gaps for future work. First, for aggregation rules that commute with shifts, assuming , for , we show in Theorem 1 that
is a universal constant. We pose the following
Open problem.
Let be an aggregation rule that commutes with shifts. Is it the case that for all with , , we have
for some universal ?
In light of Theorem 8, this is the best one could hope for in general. We pose also the following conjecture about bounded affine functions.
Conjecture 11
Theorem 3 is tight up to constants. For -bounded affine functions and aggregation rule that is 1-Lipschitz in supremum norm,
(22) |
where is a universal constant.
Throughout the paper, we mentioned several mistaken claims in the literature. In this section, we briefly discuss the nature of these mistakes—which are, in a sense, variations on the same kind of error. We begin with Attias et al. (2019, Lemma 14), which incorrectly claimed that any partial function class has a disambiguation such that (see Section 4.1 for the definitions). The mistake was pointed out to us by Yann Guermeur, and later, Alon et al. (2022, Theorem 11) showed that there exist partial classes with for which every disambiguation has .
Kontorovich (2018) attempted to prove the bound stated in our Theorem 3 (up to constants, and only for linear classes). The argument proceeded via a reduction to the Boolean case, as in our proof of Theorem 7. It was correctly observed that if, say, some finite is -shattered by with shift , then it is also VC-shattered by . Neglected was the fact that might shatter additional points in —and, in sufficiently high dimension, it necessarily will. The crux of the matter is that (20) holds in the dimension-dependent but not the dimension-free setting; again, this may be seen as a variant of the disambiguation mistake.
Finally, the proof of Hanneke and Kontorovich (2019, Lemma 6) claims, in the first display, that the shattered set can be classified with large margin, which is incorrect — yet another variant of mistaken disambiguation.
Acknowledgments
We thank Steve Hanneke and Ramon van Handel for very helpful discussions; the latter, in particular, patiently explained to us how to prove Lemma 20. Roman Vershynin kindly gave us permission to share his example in Remark 17. This research was partially supported by the Israel Science Foundation (grant No. 1602/19), an Amazon Research Award, the Ben-Gurion University Data Science Research Center, and Cyber Security Research Center, Prime Minister’s Office.
References
- Alon et al. (1997) Noga Alon, Shai Ben-David, Nicolò Cesa-Bianchi, and David Haussler. Scale-sensitive dimensions, uniform convergence, and learnability. Journal of the ACM, 44(4):615–631, 1997.
- Alon et al. (2020) Noga Alon, Amos Beimel, Shay Moran, and Uri Stemmer. Closure properties for private classification and online prediction. In Conference on Learning Theory, pages 119–152. PMLR, 2020.
- Alon et al. (2022) Noga Alon, Steve Hanneke, Ron Holzman, and Shay Moran. A theory of PAC learnability of partial concept classes. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 658–671, 2022.
- Appert and Catoni (2021) Gautier Appert and Olivier Catoni. New bounds for -means and information -means. arXiv preprint arXiv:2101.05728, 2021.
- Artstein et al. (2004) S. Artstein, V. Milman, and S. J. Szarek. Duality of metric entropy. Annals of Mathematics, 159(3):1313–1328, 2004.
- Attias and Hanneke (2023) Idan Attias and Steve Hanneke. Adversarially robust PAC learnability of real-valued functions. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 1172–1199, 2023.
- Attias et al. (2019) Idan Attias, Aryeh Kontorovich, and Yishay Mansour. Improved generalization bounds for robust learning. In Algorithmic Learning Theory, ALT 2019, volume 98 of Proceedings of Machine Learning Research, pages 162–183. PMLR, 2019.
- Attias et al. (2022) Idan Attias, Aryeh Kontorovich, and Yishay Mansour. Improved generalization bounds for adversarially robust learning. The Journal of Machine Learning Research, 23(1):7897–7927, 2022.
- Bartlett and Shawe-Taylor (1999) Peter Bartlett and John Shawe-Taylor. Generalization performance of support vector machines and other pattern classifiers, pages 43–54. MIT Press, Cambridge, MA, USA, 1999. ISBN 0-262-19416-3.
- Bartlett and Long (1998) Peter L. Bartlett and Philip M. Long. Prediction, learning, uniform convergence, and scale-sensitive dimensions. J. Comput. Syst. Sci., 56(2):174–190, 1998.
- Baum and Haussler (1989) Eric B. Baum and David Haussler. What size net gives valid generalization? Neural Comput., 1(1):151–160, 1989.
- Biau et al. (2008) Gérard Biau, Luc Devroye, and Gábor Lugosi. On the performance of clustering in hilbert spaces. IEEE Transactions on Information Theory, 54(2):781–790, 2008.
- Blumer et al. (1989) Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. J. Assoc. Comput. Mach., 36(4):929–965, 1989.
- Breiman (1996) Leo Breiman. Bagging predictors. Machine learning, 24:123–140, 1996.
- Breiman (2001) Leo Breiman. Random forests. Machine learning, 45:5–32, 2001.
- Csikós et al. (2019) Mónika Csikós, Nabil H. Mustafa, and Andrey Kupavskii. Tight lower bounds on the vc-dimension of geometric set systems. J. Mach. Learn. Res., 20:81:1–81:8, 2019.
- Duan (2012) Hubert Haoyang Duan. Bounding the Fat Shattering Dimension of a Composition Function Class Built Using a Continuous Logic Connective. PhD thesis, University of Waterloo, 2012.
- Dudley (1967) Richard M Dudley. The sizes of compact subsets of hilbert space and continuity of gaussian processes. Journal of Functional Analysis, 1(3):290–330, 1967.
- Eisenstat (2009) David Eisenstat. k-fold unions of low-dimensional concept classes. Inf. Process. Lett., 109(23-24):1232–1234, 2009.
- Eisenstat and Angluin (2007) David Eisenstat and Dana Angluin. The VC dimension of k-fold union. Inf. Process. Lett., 101(5):181–184, 2007.
- Fefferman et al. (2016) Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983–1049, 2016.
- Foster and Rakhlin (2019) Dylan J. Foster and Alexander Rakhlin. vector contraction for rademacher complexity. CoRR, abs/1911.06468, 2019.
- Freund and Schapire (1997) Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
- Ghazi et al. (2021) Badih Ghazi, Noah Golowich, Ravi Kumar, and Pasin Manurangsi. Near-tight closure bounds for the littlestone and threshold dimensions. In Algorithmic Learning Theory, pages 686–696. PMLR, 2021.
- Gottlieb et al. (2014) Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. Efficient classification for metric data (extended abstract: COLT 2010). IEEE Transactions on Information Theory, 60(9):5750–5759, 2014.
- Gottlieb et al. (2018) Lee-Ad Gottlieb, Eran Kaufman, Aryeh Kontorovich, and Gabriel Nivasch. Learning convex polytopes with margin. In Neural Information Processing Systems (NIPS), 2018.
- Hanneke and Kontorovich (2019) Steve Hanneke and Aryeh Kontorovich. Optimality of SVM: novel proofs and tighter bounds. Theor. Comput. Sci., 796:99–113, 2019.
- Kearns and Schapire (1994) Michael J. Kearns and Robert E. Schapire. Efficient distribution-free learning of probabilistic concepts. J. Comput. Syst. Sci., 48(3):464–497, 1994.
- Kégl (2003) Balázs Kégl. Robust regression by boosting the median. In Learning Theory and Kernel Machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003. Proceedings, pages 258–272. Springer, 2003.
- Klochkov et al. (2021) Yegor Klochkov, Alexey Kroshnin, and Nikita Zhivotovskiy. Robust k-means clustering for distributions with two moments. The Annals of Statistics, 49(4):2206–2230, 2021.
- Kontorovich (2018) Aryeh Kontorovich. Rademacher complexity of -fold maxima of hyperplanes. 2018.
- Mendelson and Vershynin (2003) S. Mendelson and R. Vershynin. Entropy and the combinatorial dimension. Invent. Math., 152(1):37–55, 2003.
- Mohri et al. (2012) Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations Of Machine Learning. The MIT Press, 2012.
- Raviv et al. (2018) Dolev Raviv, Tamir Hazan, and Margarita Osadchy. Hinge-minimax learner for the ensemble of hyperplanes. J. Mach. Learn. Res., 19:62:1–62:30, 2018.
- Rudelson and Vershynin (2006) M. Rudelson and R. Vershynin. Combinatorics of random processes and sections of convex bodies. Annals of Mathematics, 164(2):603–648, 2006.
- Srebro et al. (2010) Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. Advances in neural information processing systems, 23, 2010.
- Talagrand (2003) Michel Talagrand. Vapnik–chervonenkis type conditions and uniform donsker classes of functions. The Annals of Probability, 31(3):1565–1582, 2003.
- Vershynin (2018) Roman Vershynin. High-dimensional probability, volume 47 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 2018.
- Vershynin (2021) Roman Vershynin, 2021. Private communication.
- Zhang (2002) Tong Zhang. Covering number bounds of certain regularized linear function classes. The Journal of Machine Learning Research, 2:527–550, 2002.
- Zhivotovskiy (2022) Nikita Zhivotovskiy. A bound for -fold maximum. 2022.
A Auxiliary results
A.1 Covering numbers and Lipschitz Aggregations
Lemma 12
If is -Lipschitz under , then is -Lipschitz in .
Proof
where the inequality follows from the assumption that is -Lipschitz in . This proves
and hence the claim.
Proof [of Theorem 10] Suppose , and let . For each , let be a -cover of . Let each be “-covered” by some , in the sense that . Assuming that is -Lipschitz in , Lemma 12 implies that is -Lipschitz in . Then it follows that is -covered by , since
and so .
We conclude
that
has a -cover of size
,
which proves the claim.
The case is proved analogously (or, alternatively, as a limiting case of ).
We show that natural aggregations are Lipschitz in norms, , and in supremum norm.
The following facts are elementary:
(23) | |||||
(24) |
where and .
Lemma 13 (Maximum aggregation is 1-Lipschitz)
Let be the maximum aggregation, then for any and ,
Proof
For and , the claim follows from
the stronger,
pointwise inequality
(23).
The proof follows by simple induction on . Since , we conclude the proof for .
Lemma 14 (Max-Min aggregation is 1-Lipschitz)
If is the max-min aggregation, then for any and ,
A.2 Covering numbers and the fat-shattering dimension
In this section, we summarize some known results connecting the covering numbers of a bounded function class to its fat-shattering dimension.
Lemma 15 (Talagrand (2003), Proposition 1.4)
For any , there exists a probability measure on such that
(25) |
where is a universal constants. Moreover, may be taken to be the uniform distribution on any -shattered subset of .
Remark. The tightness of (25) is trivially demonstrated by the example .
Lemma 16 (Mendelson and Vershynin (2003), Theorem 1)
For all and all probability measures ,
(26) |
where are universal constants.
Remark 17
Lemma 18 (Rudelson and Vershynin (2006))
Suppose that , is a probability measure on , and . If satisfies , then
furthermore, for all , if , then
where , , and are universal constants.
A.3 Covering numbers of linear and affine classes
Let be the -dimensional Euclidean unit ball and
be the collection of -bounded affine functions on .
Remark 19
There is a trivial reduction from an -bounded affine class in dimensions to a -bounded linear class in dimensions, via the standard trick of adding an extra dummy dimension. This only affects the covering number bounds up to constants.
For , , define , and endow with the uniform measure . Zhang (2002, Theorem 4) implies the covering number estimate
where is a universal constant (Zhang’s result is more general and allows to compute explicit constants). We will use the following sharper bound:
Lemma 20
where and is a universal constant.
Proof The result is folklore knowledge, but we provide a proof for completeness.
We argue that there is no loss of generality in assuming . Indeed, if , then is spanned by some and is also a -dimensional set. Thus, we assume henceforth. Via a standard infinitesimal perturbation, we can assume that is a linearly independent set (i.e., spans ). If we treat as an matrix, then , which means that is an ellipsoid. We are interested in covering numbers of .
Let be such that , where is the unit cube. (The existence of a such that is obvious, but because we assumed that spans , every point in has a pre-image under .) Let us compute the polar body , defined as
We claim that
Indeed, consider a . Then, for any , we have
where we have used (since ) and Hölder’s inequality. This shows that . On the other hand, consider any . There is no loss of generality in assuming that is spanned by , that is, , for . By definition of , we have
Now because , for each choice of , there is a such that for all . This shows that we must have , and proves .
It is well-known (and easy to verify) that covering numbers enjoy an affine invariance:
where , for two sets , is the smallest number of copies of necessary to cover . Now the seminal result of Artstein et al. (2004) applies: for all ,
where are universal constants.
This reduces the problem to estimating the -covering numbers of . The latter may be achieved via Maurey’s method (Vershynin, 2018, Corollary 0.0.4 and Exercise 0.0.6): the -covering number of under is at most
where is a universal constant.
A.4 Fat-shattering dimension of linear and affine classes
In this section, and denotes the Euclidean unit ball. A function is said to be affine if it is of the form , for some and , where denotes the Euclidean inner product.
Throughout the paper, we have have referred to -bounded affine function classes as those for which . In this section, we define the larger class of -semi-bounded affine functions, as those for which , but may be unbounded. In particular, the covering-number results (and the reduction to linear classes spelled out in Remark 19) do not apply to semi-bounded affine classes.
The following simple result may be of independent interest.
Lemma 21
Let be some collection of functions with the closure property
(27) |
Then, for all , we have .
Proof
Suppose that some set is -shattered by . That means that there is an such that for all , there is a for which
(28) |
Now for any , let and . Then, for each , we have
It follows that
achieves (28), for the given ,
with .
Now (27) implies that
the function defined by
belongs to , which completes the proof.
Now is well-known (Bartlett and Shawe-Taylor, 1999, Theorem 4.6) that bounded linear functions — i.e., function classes on of the form , also known as homogeneous hyperplanes — satisfy . The discussion in Hanneke and Kontorovich (2019, p. 102) shows that the common approach of reducing of the general (affine) case to the linear (homogeneous, ) case, via the addition of a “dummy” coordinate, incurs a large suboptimal factor in the bound. Hanneke and Kontorovich (2019, Lemma 6) is essentially an analysis of the fat-shattering dimension of bounded affine functions. Although this result contains a mistake (see Section 5), much of the proof technique can be salvaged:
Lemma 22
The semi-bounded affine function class on defined by in dimensions satisfies
Proof Since satisfies (27), it suffices to consider , and so the shattering condition simplifies to
(29) |
Now is always upper-bounded by the VC-dimension of the corresponding class thresholded at zero, i.e., . For -dimensional inhomogeneous hyperplanes, the latter is exactly (Mohri et al., 2012, Example 3.2). Having dispensed with the dimension-dependent part in the bound, we how focus on the -dependent one.
Let us observe, as in Hanneke and Kontorovich (2019, Lemma 6), that for and , , one can always realize (29) with ; which is what we shall assume, without generality, henceforth. Summing up the inequalities in (29) yields
Letting be drawn uniformly from and taking expectations, we have
Isolating on the left-hand side of the inequality proves the claim .
Following a referee’s suggestion, we improve the constant as follows. Note that
where the inequality follows from binomial coefficient estimate via Stirling’s approximation. Thus,
which proves that .
A.5 Concavity miscellanea
The results below are routine exercises in differentiation and Jensen’s inequality.
Lemma 23
For , the function is concave on .
Corollary 24
For all and , ,
Lemma 25
For and , the function is concave on . It follows that for as above and , ,