This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Fat-Shattering Dimension of kk-fold Aggregationsthanks: A previous version of this paper was titled “Fat-shattering dimension of kk-fold maxima”.

\nameIdan Attias \email[email protected]
\nameAryeh Kontorovich \email[email protected]
\addrDepartment of Computer Science
Ben-Gurion University of the Negev, Beer Sheva, Israel
Abstract

We provide estimates on the fat-shattering dimension of aggregation rules of real-valued function classes. The latter consists of all ways of choosing kk functions, one from each of the kk classes, and computing a pointwise function of them, such as the median, mean, and maximum. The bound is stated in terms of the fat-shattering dimensions of the component classes. For linear and affine function classes, we provide a considerably sharper upper bound and a matching lower bound, achieving, in particular, an optimal dependence on kk. Along the way, we improve several known results in addition to pointing out and correcting a number of erroneous claims in the literature.

Keywords: combinatorial dimension, scale-sensitive dimension, fat-shattering dimension, aggregation rules, kk-fold maximum, ensemble methods

1 Introduction

The fat-shattering dimension, also known as “scale-sensitive” and the “parametrized variant of the PP-dimension”, was first proposed by Kearns and Schapire (1994); its key role in learning theory lies in characterizing the PAC learnability of real-valued function classes (Alon et al., 1997; Bartlett and Long, 1998).

In this paper, we study the behavior of the fat-shattering dimension under various kk-fold aggregations. Let F1,,FkΩF_{1},\ldots,F_{k}\subseteq\mathbb{R}^{\Omega} be real-valued function classes, and G:kG:\mathbb{R}^{k}\rightarrow\mathbb{R} be an aggregation rule. We consider the aggregate function class G(F1,,Fk)G(F_{1},\ldots,F_{k}), which consists of all mappings xG(f1(x),,fk(x))x\mapsto G(f_{1}(x),\ldots,f_{k}(x)), for any f1F1,,fkFkf_{1}\in F_{1},\ldots,f_{k}\in F_{k}. Some natural aggreagation rules include the pointwise kk-fold maximum, median, mean, and max-min. We seek to bound the shattering complexity of G(F1,,Fk)G(F_{1},\ldots,F_{k}) in terms of the constituent fat-shattering dimensions of the FiF_{i}. This question naturally arises in the context of ensemble methods, such as boosting and bagging, where the learner’s prediction consists of an aggregation of base learners.

The analogous question regarding aggregations of VC classes (VC dimension being the combinatorial complexity controlling the learnability of Boolean function classes) have been studied in detail and largely resolved (Baum and Haussler, 1989; Blumer et al., 1989; Eisenstat and Angluin, 2007; Eisenstat, 2009; Csikós et al., 2019). Furthermore, closure properties were also studied in the context of online classification and private PAC learning (Alon et al., 2020; Ghazi et al., 2021) for the Littlestone and Threshold dimensions. However, for real-valued functions, this question remained largely uninvestigated.

Our contributions are as follows:

  • For a natural class of aggregation rules that commute with shifts (see Definition (4)), assuming fatγ(Fi)d\operatorname{fat}_{\gamma}(F_{i})\leq d, for 1ik1\leq i\leq k, we show that

    fatγ(G(F1,,Fk))O(dklog2(dk)),γ>0.\displaystyle\operatorname{fat}_{\gamma}(G(F_{1},\ldots,F_{k}))\leq O\left(dk\log^{2}\left(dk\right)\right),\qquad\gamma>0.

    The formal statement is given in Theorem 1. By using an entirely different approach, for aggregations that are LL-Lipschitz (L1L\geq 1) in supremum norm (see Definition (5)) and for bounded function classes F1,,Fk[R,R]ΩF_{1},\ldots,F_{k}\subset[-R,R]^{\Omega} with fatεγ(Fi)d\operatorname{fat}_{\varepsilon\gamma}(F_{i})\leq d, for 1ik1\leq i\leq k, we show that

    fatγ(G(F1,,Fk))O(dkLog1+ϵLRkγ),0<γ/L<R and 0<ε<log2,\displaystyle\operatorname{fat}_{\gamma}(G(F_{1},\ldots,F_{k}))\leq O\left(dk\operatorname{Log}^{1+\epsilon}\frac{LRk}{\gamma}\right),\qquad 0<\gamma/L<R\text{ and }0<\varepsilon<\log 2,

    where Log(x):=log(ex)\operatorname{Log}(x):=\log(e\vee x). The formal statement is given in Theorem 2.

    In particular, our proofs hold for the maximum, minimum, median, mean, and max-min aggregations.

  • For RR-bounded affine functions and for aggregations that are LL-Lipschitz in supremum norm, we show the following dimension-free bound,

    fatγ(G(F1,,Fk))\displaystyle\operatorname{fat}_{\gamma}(G(F_{1},\ldots,F_{k})) \displaystyle\leq O(L2R2klog(k)γ2),0<γ/L<R.\displaystyle O\left(\frac{L^{2}R^{2}k\log(k)}{\gamma^{2}}\right),\qquad 0<\gamma/L<R.

    This result also extends to the hinge-loss class of affine functions. The formal statement is given in Theorem 3. In particular, we improve by a log factor the estimate of Fefferman et al. (2016, Lemma 6) on the fat-shattering dimension of max-min aggregation of linear functions.

    Furthermore, in Corollary 5 we show an upper bound on the Rademacher complexity of the kk-fold maximum aggregation of affine functions and hinge-loss affine functions. Our bound scales with k\sqrt{k}, improving upon Raviv et al. (2018) where the dependence on kk is linear.

  • For affine functions and the kk-fold max aggregation, we show tight dimension-dependent bounds (up to constants),

    fatγ(Gmax(F1,,Fk))\displaystyle\operatorname{fat}_{\gamma}(G_{\max}(F_{1},\ldots,F_{k})) =\displaystyle= Θ(dklogk),γ>0,\displaystyle\Theta\left(dk\log k\right),\qquad\gamma>0,

    where dd is the Euclidean dimension. For the formal statements, see Theorems 7 and 8.

Applications.

The need to analyze the combinatorial complexity of a kk-fold maximum of function classes (see (3) for the formal definition) arises in a number of diverse settings. One natural example is adversarially robust PAC learning to test time attacks for real-valued functions (Attias et al., 2022; Attias and Hanneke, 2023). In this setting, the learner observes an i.i.d. labeled sample from an unknown distribution, and the goal is to output a hypothesis with a small error on unseen examples from the same distribution, with high probability. The difference from the standard PAC learning model is that at test time, the learner only observes a corrupted example, while the prediction is tested on the original label. Formally, (x,y)(x,y) is drawn from the unknown distribution, there is an adversary that can map xx to kk possible corruptions zz that are known to the learner. The learner observes only zz while its loss is with respect to the original label yy. This scenario is naturally captured by the kk-fold max: the learner aims to learn the maximum aggregation of the loss classes. Attias et al. (2022) showed that uniform convergence holds in this case, and so the sample complexity of an empirical risk minimization algorithm is determined by the complexity measure of the kk-fold maximum aggregation.

Analyzing the kk-fold maximum arises also in a setting of learning polyhedra with a margin. Gottlieb et al. (2018) provided a learning algorithm that represents polyhedra as intersections of bounded affine functions. The sample complexity of the algorithm is determined by the complexity measure of the maximum aggregation of affine function classes.

Another natural example of where the kk-fold maximum and kk-fold max-min play a role is in analyzing the convergence of kk-means clustering. Fefferman et al. (2016) bounded the max-min aggregation and Klochkov et al. (2021); Biau et al. (2008); Appert and Catoni (2021); Zhivotovskiy (2022) bounded the max aggregation. The main challenge in this setting is bounding the covering numbers of the aggregation over kk function classes which can be obtained by bounding the Rademacher complexity or the fat-shattering dimension.

Finally, there are numerous ensemble methods for regression that output some aggregation of base learners, such as the median or mean. Examples of these methods include boosting (e.g., Freund and Schapire (1997); Kégl (2003)), bagging (bootstrap aggregation) by Breiman (1996), and its extension to the random forest algorithm (Breiman, 2001).

Related work.

It was claimed in Attias et al. (2019, Theorem 12) that fatγ(Gmax)2log(3k)j=1kfatγ(Fi),\operatorname{fat}_{\gamma}(G_{\max})\leq 2\log(3k)\sum_{j=1}^{k}\operatorname{fat}_{\gamma}(F_{i}), but the proof had a mistake (see Section 5); our Open Problem asks if the general form of the bound does holds (we believe it does at least for the max aggregation). Using the recent disambiguation result of Alon et al. (2022) presented in Lemma 9 here, Attias et al. (2022, Lemma 15) obtained the bound fatγ(Gmax)O(Log(k)Log2(|Ω|)j=1kfatγ(Fi))\operatorname{fat}_{\gamma}(G_{\max})\leq O\left(\operatorname{Log}(k)\operatorname{Log}^{2}(|\Omega|)\sum_{j=1}^{k}\operatorname{fat}_{\gamma}(F_{i})\right), where Ω\Omega is the domain of the function classes F1,,FkF_{1},\ldots,F_{k}. The latter is, in general, incomparable to Theorem 1 — but is clearly a considerable improvement for large or infinite Ω\Omega.

Using the covering number results of Mendelson and Vershynin (2003); Talagrand (2003) (see Section A.1), Duan (2012, Theorem 6.2) obtained a general result, which, when specialized to kk-fold maxima, yields

fatγ(Gmax)\displaystyle\operatorname{fat}_{\gamma}(G_{\max}) \displaystyle\leq O(Logkγi=1kfatcγ/k(Fi))\displaystyle O\left({\operatorname{Log}\frac{k}{\gamma}}\cdot\sum_{i=1}^{k}\operatorname{fat}_{c\gamma/\sqrt{k}}(F_{i})\right) (1)

for a universal constant c>0c>0; (1) is an immediate consequence of Theorem 10 (with p=2p=2), Lemma 15, and Lemma 16 in this paper. Our results improve over (1) by removing the dependence on kk in the scale of the fat-shattering dimensions; however, Duan’s general method is applicable to a wider class of uniformly continuous kk-fold aggregations.

Srebro et al. (2010, Lemma A.2) bounded the fat-shattering dimension in terms of the Rademacher complexity. Foster and Rakhlin (2019) bounded the Rademacher complexity of a smooth kk-fold aggregate, see also references therein. Inspired by Appert and Catoni (2021), Zhivotovskiy (2022) has obtained the best known upper bound on the Rademacher complexity of kk-fold maximum over linear function classes. Raviv et al. (2018) upper bounded the Rademacher complexity of the kk-fold maximum aggregation of affine functions and hinge-loss affine functions.

2 Preliminaries

Aggregation rules.

A kk-fold aggregation rule is any mapping G:kG:\mathbb{R}^{k}\rightarrow\mathbb{R}. Just as GG maps kk-tuples of reals into reals, it naturally aggregates kk-tuples of functions into a single one: for f1,,fk:Ωf_{1},\ldots,f_{k}:\Omega\rightarrow\mathbb{R}, we define G(f1,,fk):ΩG(f_{1},\ldots,f_{k}):\Omega\rightarrow\mathbb{R} as the mapping xG(f1(x),,fk(x))x\mapsto G(f_{1}(x),\ldots,f_{k}(x)). Finally, the aggregation extends to kk-tuples of function classes: for F1,,FkΩF_{1},\ldots,F_{k}\subseteq\mathbb{R}^{\Omega}, we define

G(F1,,Fk):={xG(f1(x),,fk(x)):fiFi,i[k]}.G(F_{1},\ldots,F_{k}):=\left\{x\mapsto G(f_{1}(x),\ldots,f_{k}(x)):f_{i}\in F_{i},i\in[k]\right\}. (2)

A canonical example of an aggregation rule is the kk-fold max, induced by the mapping

Gmax(x1,,xk):=maxi[k]xi.\displaystyle G_{\max}(x_{1},\ldots,x_{k}):=\max_{i\in[k]}x_{i}. (3)

Next, we consider some properties that an aggregation rule might possess.

Commuting with shifts. We say that an aggregation rule GG commutes with shifts if

G(x)r=G(xr),xk,r.G(x)-r=G(x-r),\qquad x\in\mathbb{R}^{k},r\in\mathbb{R}. (4)

Lipschitz continuity. The mapping G:kG:\mathbb{R}^{k}\to\mathbb{R} is LL-Lipschitz with respect to \left\|\cdot\right\|_{\infty} if

|G(x)G(x)|Lxx=Lmaxi[k]|xixi|,x,xk.\displaystyle\left|G(x)-G(x^{\prime})\right|\leq L\left\|x-x^{\prime}\right\|_{\infty}=L\max_{i\in[k]}|x_{i}-x_{i}^{\prime}|,\qquad x,x^{\prime}\in\mathbb{R}^{k}. (5)

Many natural aggregation rules possess these properties, such as maximum, minimum, median, mean, and max-min. The maximum is defined in (3), the minimum is defined analogously as

Gmin(x1,,xk):=mini[k]xi,\displaystyle G_{\min}(x_{1},\ldots,x_{k}):=\min_{i\in[k]}x_{i},

the median is defined as

Gmed(x1,,xk):={xk+12,k is odd,12(xk2+xk+12),k is even,\displaystyle G_{\operatorname{med}}(x_{1},\ldots,x_{k}):=\begin{cases}x_{\frac{k+1}{2}},&k\text{ is odd},\\ \frac{1}{2}(x_{\frac{k}{2}}+x_{\frac{k+1}{2}}),&k\text{ is even},\end{cases}

and the mean is defined as

Gmean(x1,,xk):=1ki=0kxi.\displaystyle G_{\operatorname{mean}}(x_{1},\ldots,x_{k}):=\frac{1}{k}\sum^{k}_{i=0}x_{i}.

We also define Gmaxmin:×kG_{\operatorname{max-min}}:\mathbb{R}^{\ell\times k}\to\mathbb{R} as

Gmaxmin(x11,,xk):=maxj[]mini[k]xij;\displaystyle G_{\operatorname{max-min}}(x_{11},\ldots,x_{k\ell}):=\max_{j\in[\ell]}\min_{i\in[k]}x_{ij}; (6)

it is readily verified to commute with shifts and Lemma 14 shows that it is 11-Lipschitz with respect to \left\|\cdot\right\|_{\infty}.

Fat-shattering dimension.

Let Ω\Omega be a set and FΩF\subset\mathbb{R}^{\Omega}. For γ>0\gamma>0, a set S={x1,,xm}ΩS=\left\{x_{1},\ldots,x_{m}\right\}\subset\Omega is said to be γ\gamma-shattered by FF if

suprmminy{1,1}msupfFmini[m]yi(f(xi)ri)γ.\displaystyle\sup_{r\in\mathbb{R}^{m}}\;\min_{y\in\left\{-1,1\right\}^{m}}\;\sup_{f\in F}\;\min_{i\in[m]}\;y_{i}(f(x_{i})-r_{i})\geq\gamma. (7)

The γ\gamma-fat-shattering dimension, denoted by fatγ(F)\operatorname{fat}_{\gamma}(F), is the size of the largest γ\gamma-shattered set (possibly \infty).

Fat-shattering dimension at zero.

As in Gottlieb et al. (2014), we also define the notion of γ\gamma-shattering at 0, where the “shift” rr in (7) is constrained to be 0. Formally, the shattering condition is miny{1,1}msupfFmini[m]yif(xi)γ,\min_{y\in\left\{-1,1\right\}^{m}}\sup_{f\in F}\min_{i\in[m]}y_{i}f(x_{i})\geq\gamma, and the we denote corresponding dimension by fåtγ(F)\operatorname{\textup{f\aa t}}_{\gamma}(F).

Attias et al. (2019, Lemma 13) showed that for all FΩF\subset\mathbb{R}^{\Omega},

fatγ(F)=maxrΩfåtγ(Fr),γ>0,\displaystyle\operatorname{fat}_{\gamma}(F)=\max_{r\in\mathbb{R}^{\Omega}}\operatorname{\textup{f\aa t}}_{\gamma}(F-r),\qquad\gamma>0, (8)

where Fr={fr;fF}F-r=\left\{f-r;f\in F\right\} is the rr-shifted class (the maximum is always achieved). Lemma 21 presents another, apparently novel, connection between fat\operatorname{fat} and fåt\operatorname{\textup{f\aa t}}.

Rademacher complexity.

Let \mathcal{F} be of real-valued function class on the domain space 𝒲\mathcal{W}. Define the empirical Rademacher complexity of \mathcal{F} on a given sequence (w1,,wn)𝒲n(w_{1},\ldots,w_{n})\in\mathcal{W}^{n} is defined as

n(|w1,,wn)=𝔼σsupf1ni=1nσif(wi),\displaystyle\mathcal{R}_{n}(\mathcal{F}|w_{1},\ldots,w_{n})=\mathbb{E}_{\mathbf{\sigma}}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}f(w_{i}),

where σ=(σ1,σn)\sigma=(\sigma_{1},\ldots\sigma_{n}) are independent random variables uniformly chosen from {1,1}\left\{-1,1\right\}. The Rademacher complexity of \mathcal{F} with respect to a distribution 𝒟\mathcal{D} is defined as

n()=𝔼w1,,wn𝒟Rn(|w1,,wn).\displaystyle\mathcal{R}_{n}(\mathcal{F})=\mathbb{E}_{w_{1},\ldots,w_{n}\sim\mathcal{D}}R_{n}(\mathcal{F}|w_{1},\ldots,w_{n}).

It is a classic fact (Mohri et al., 2012, Theorem 3.1) that the Rademacher complexity controls generalization bounds in a wide range of supervised learning settings.

Covering numbers.

We start with some background on covering numbers. Whenever Ω\Omega is endowed with a probability measure μ\mu, this induces, for p[1,)p\in[1,\infty) and f:Ωkf:\Omega\rightarrow\mathbb{R}^{k}, the norm

fLp(k)(μ)p=𝔼Xμf(X)pp=Ωf(x)ppdμ(x)\displaystyle\left\|f\right\|_{L_{p}^{(k)}(\mu)}^{p}=\mathop{\mathbb{E}}_{X\sim\mu}\left\|f(X)\right\|_{p}^{p}=\int_{\Omega}\left\|f(x)\right\|_{p}^{p}\mathrm{d}\mu(x)

on Lp(k)(μ):={f(k)Ω:fLp(k)(μ)<}L^{(k)}_{p}(\mu):=\left\{f\in(\mathbb{R}^{k})^{\Omega}:\left\|f\right\|_{L_{p}^{(k)}(\mu)}<\infty\right\}. When k=1k=1, we write Lp(μ):=Lp(1)(μ)L_{p}(\mu):=L^{(1)}_{p}(\mu). For p=p=\infty, fL(k)(μ)\left\|f\right\|_{L_{\infty}^{(k)}(\mu)} is the essential supremum of ff with respect to μ\mu. For t>0t>0 and HFLp(μ)H\subset F\subset L_{p}(\mu), we say that HH is a tt-cover of FF under Lp(μ)\left\|\cdot\right\|_{L_{p}(\mu)} if supfFinfhHfhLp(μ)t.\sup_{f\in F}\inf_{h\in H}\left\|f-h\right\|_{L_{p}(\mu)}\leq t. The tt-covering number of FF, denoted by 𝒩(F,Lp(μ),t)\mathcal{N}(F,L_{p}(\mu),t), is the cardinality of the smallest tt-cover of FF (possibly, \infty). We note the obvious relation

p>q\displaystyle p>q \displaystyle\implies 𝒩(F,Lp(μ),t)𝒩(F,Lq(μ),t),\displaystyle\mathcal{N}(F,L_{p}(\mu),t)\geq\mathcal{N}(F,L_{q}(\mu),t), (9)

which holds for all probability measures μ\mu and all t>0t>0.

We sometimes overload the notation about aggregations by defining GG on kk-tuples of functions (instead on kk-tuples of reals), G:(Ω)kΩG:(\mathbb{R}^{\Omega})^{k}\rightarrow\mathbb{R}^{\Omega}. We say that GG is LL-Lipschitz with respect to Lp(k)(μ)\left\|\cdot\right\|_{L_{p}^{(k)}(\mu)}, if

G(f1:k)G(f1:k)Lp(μ)L(f1:k)(f1:k)Lp(k)(μ),f1:k,f1:k(k)Ω.\displaystyle\left\|G(f_{1:k})-G(f^{\prime}_{1:k})\right\|_{L_{p}(\mu)}\leq L\left\|(f_{1:k})-(f^{\prime}_{1:k})\right\|_{L_{p}^{(k)}(\mu)},\qquad f_{1:k},f^{\prime}_{1:k}\in(\mathbb{R}^{k})^{\Omega}.

Notation.

We write ={0,1,}\mathbb{N}=\left\{0,1,\ldots\right\} to denote the natural numbers. For nn\in\mathbb{N}, we write [n]:={1,2,,n}[n]:=\left\{1,2,\ldots,n\right\}. All of our logarithms are base ee, unless explicitly denoted otherwise. We use max{u,v}\max\left\{u,v\right\} and uvu\vee v interchangeably, and write Log(x):=log(ex)\operatorname{Log}(x):=\log(e\vee x). For any function class FF over a set Ω\Omega and EΩE\subset\Omega, F(E)=F|EF(E)=\left.F\right|_{E} denotes the projection on (restriction to) EE. In line with the common convention in functional analysis, absolute numerical constants will be denoted by letters such as C,cC,c, whose value may change from line to line. Any transformation φ:\varphi:\mathbb{R}\to\mathbb{R} may be applied to a function fΩf\in\mathbb{R}^{\Omega} via φ(f):=φf\varphi(f):=\varphi\circ f, as well as to FΩF\subset\mathbb{R}^{\Omega} via φ(F):={φ(f);fF}\varphi(F):=\left\{\varphi(f);f\in F\right\}. The sign function thresholds at 0: sign(t)=𝟏[t0]\operatorname{sign}(t)=\boldsymbol{{1}}[t\geq 0].

3 Main Results

Our main results involve upper-bounding the fat-shattering dimension of aggregation rules in terms of the dimensions of the component classes. We begin with the simplest (to present):

Theorem 1 (General function classes and aggregations that commute with shifts)

For F1,,FkΩF_{1},\ldots,F_{k}\subseteq\mathbb{R}^{\Omega}, and aggregation rule GG that commutes with shifts, (see definition (4)), we have

fatγ(G(F1,,Fk))\displaystyle\operatorname{fat}_{\gamma}(G(F_{1},\ldots,F_{k})) \displaystyle\leq 25Dγlog2(90Dγ),γ>0,\displaystyle 25D_{\gamma}\log^{2}(90D_{\gamma}),\qquad\gamma>0,

where Dγ:=i=1kfatγ(Fi)>0D_{\gamma}:=\sum_{i=1}^{k}\operatorname{fat}_{\gamma}(F_{i})>0. In the degenerate case where Dγ=0D_{\gamma}=0, fatγ(G)=0\operatorname{fat}_{\gamma}(G)=0.

In particular, this result holds for natural aggregation rules, such as maximum, minimum, median, and mean.

Remark.

We made no attempt to optimize the constants; these are only provided to give a rough order-of-magnitude sense. In the sequel, we forgo numerical estimates and state the results in terms of unspecified universal constants.

The next result provides an alternative bound based on an entirely different technique:

Theorem 2 (Bounded function classes and Lipschitz aggregations)

For 0<ε<log20<\varepsilon<\log 2, F1,,Fk[R,R]ΩF_{1},\ldots,F_{k}\subseteq[-R,R]^{\Omega}, and an aggregation rule GG that is LL-Lipschitz (L1L\geq 1) in supremum norm (see definition (5)), we have

fatγ(G(F1,,Fk))\displaystyle\operatorname{fat}_{\gamma}(G(F_{1},\ldots,F_{k})) \displaystyle\leq CDLog1+εLRkγ,0<γ/L<R,\displaystyle CD\operatorname{Log}^{1+\varepsilon}\frac{LRk}{\gamma},\qquad 0<\gamma/L<R,

where

D\displaystyle D =\displaystyle= i=1kfatcεγ(Fi)\displaystyle\sum_{i=1}^{k}\operatorname{fat}_{c\varepsilon\gamma}(F_{i})

and C,c>0C,c>0 are universal constants.

In Section A.1, we show that maximum, median, mean, and max-min aggregations are 11-Lipschitz.

Remark.

The bounds in Theorems 1 and 2 are, in general, incomparable—and not just because of the unspecified constants in the latter. One notable difference is that Theorem 1 only depends on the shattering scale γ\gamma, while Theorem 2 additionally features a (weak) explicit dependence on the aspect ratio R/γR/\gamma. In particular, Theorem 1 is applicable to semi-bounded affine classes (see Section A.4), while Theorem 2 is not. Still, for fixed R,γR,\gamma and large kk, the latter presents a significant asymptotic improvement over the former.

For the special case of affine functions and hinge-loss affine functions, the technique of Theorem 2 yields a considerably sharper estimate:

Theorem 3 (Dimension-free bound for Lipschitz aggregations of affine functions)

Let BdB\subset\mathbb{R}^{d} be the dd-dimensional Euclidean unit ball and

Fi={xwx+b;w|b|Ri,wd,b,Ri},i[k],\displaystyle F_{i}=\left\{x\mapsto w\cdot x+b;\left\|w\right\|\vee|b|\leq R_{i},w\in\mathbb{R}^{d},b,R_{i}\in\mathbb{R}\right\},\qquad i\in[k], (10)

be kk collections of RiR_{i}-bounded affine functions on Ω=B\Omega=B and GG be an aggregation rule that is LL-Lipschitz in supremum norm (see definition (5)). Then

fatγ(G(F1,,Fk))\displaystyle\operatorname{fat}_{\gamma}(G(F_{1},\ldots,F_{k})) \displaystyle\leq CL2Log(k)γ2i=1kRi2,0<γ/L<mini[k]Ri,\displaystyle\frac{CL^{2}\operatorname{Log}(k)}{\gamma^{2}}\sum_{i=1}^{k}{R_{i}^{2}},\qquad 0<\gamma/L<\min_{i\in[k]}R_{i}, (11)

where C>0C>0 is a universal constant. Further, if

FiHinge={(x,y)max{0,1yf(x,y)};fFi}\displaystyle F_{i}^{\operatorname{Hinge}}=\left\{(x,y)\mapsto\max\left\{0,1-yf(x,y)\right\};f\in F_{i}\right\} (12)

is a family of RiR_{i}-bounded hinge-loss affine functions for i[k]i\in[k] and GHingeG(F1Hinge,,FkHinge)G_{\operatorname{Hinge}}\equiv G(F_{1}^{\operatorname{Hinge}},\ldots,F_{k}^{\operatorname{Hinge}}) is an aggregation rule that is LL-Lipschitz in supremum norm, then the same bound as in (11) hold for fatγ(GHinge)\operatorname{fat}_{\gamma}(G_{\operatorname{Hinge}}).

In particular, Theorem 3 improves by a log factor the estimate of Fefferman et al. (2016), on the fat-shattering dimension of max-min aggregation (defined in Section 2) of linear functions:111 The max-min aggregation is shown to be 11-Lipschitz in supremum norm in Lemma 14 of Section A.1.

Lemma 4 (Fefferman et al. (2016), Lemma 6)

Let BdB\subset\mathbb{R}^{d} be the dd-dimensional Euclidean unit ball and

Fij={xwx;w1,wd},i[k],j[],\displaystyle F_{ij}=\left\{x\mapsto w\cdot x;\left\|w\right\|\leq\left\|1\right\|,w\in\mathbb{R}^{d}\right\},\qquad i\in[k],j\in[\ell],

be kk\ell (identical) linear function classes defined on Ω=B\Omega=B. If GmaxminG_{\operatorname{max-min}} is the max-min aggregation rule (6), then

fatγ(Gmaxmin(F11,,Fk))\displaystyle\operatorname{fat}_{\gamma}(G_{\operatorname{max-min}}(F_{11},\ldots,F_{k\ell})) \displaystyle\leq Ckγ2log2(Ckγ2),\displaystyle C\frac{k\ell}{\gamma^{2}}\log^{2}\left(C\frac{k\ell}{\gamma^{2}}\right),

where C>0C>0 is a universal constant.

Our Theorem 3 improves the latter by a log factor:

fatγ(Gmaxmin(F11,,Fk))\displaystyle\operatorname{fat}_{\gamma}(G_{\operatorname{max-min}}(F_{11},\ldots,F_{k\ell})) \displaystyle\leq Cklog(k)γ2.\displaystyle C\frac{k\ell\log\left(k\ell\right)}{\gamma^{2}}.
Corollary 5 (Rademacher complexity for kk-Fold Maximum of Affine Functions)

Let FiF_{i} be an RiR_{i}-bounded affine function class as in (10) or a hinge loss affine function as in (12), let GmaxG_{\max} be the maximum aggregation rule, and let R~=maxiRi\tilde{R}=\max_{i}R_{i}, then

n(Gmax(F1,,Fk))CLog(k)log3(R~n)R~2i=1kRi2n.\displaystyle\mathcal{R}_{n}(G_{\max}(F_{1},\ldots,F_{k}))\leq C\sqrt{\frac{\operatorname{Log}(k)\log^{3}(\tilde{R}n)\tilde{R}^{2}\sum_{i=1}^{k}{R_{i}^{2}}}{n}}.

where n\mathcal{R}_{n} is the Rademacher complexity and C>0C>0 is a universal constant.

Corollary 5 improves upon Raviv et al. (2018, Theorem 7). Their upper bound scales linearly with kk, whereas ours as klogk\sqrt{k\log k}.

Note, however, that for linear classes a better bound is known:

Theorem 6 (Zhivotovskiy (2022))

Let BdB\subset\mathbb{R}^{d} be the dd-dimensional Euclidean unit ball and

Fi={xwx;w1,wd},i[k]\displaystyle F_{i}=\left\{x\mapsto w\cdot x;\left\|w\right\|\leq{1},w\in\mathbb{R}^{d}\right\},\qquad i\in[k]

be kk (identical) linear function classes defined on Ω=B\Omega=B. If GmaxG_{\max} is the maximum aggregation rule, then

n(Gmax(F1,,Fk))Clog(nk)klogkn,\displaystyle\mathcal{R}_{n}(G_{\max}(F_{1},\ldots,F_{k}))\leq C\log\left(\frac{n}{k}\right)\sqrt{\frac{k\log k}{n}},

where n\mathcal{R}_{n} is the Rademacher complexity and C>0C>0 is a universal constant.

The estimate in Theorem 3 is dimension-free in the sense of being independent of dd. In applications where a dependence on dd is admissible, an optimal bound can be obtained:

Theorem 7 (Dimension-dependent bound for kk-fold maximum of affine functions)

Let Ω=d\Omega=\mathbb{R}^{d} and FiΩF_{i}\subset\mathbb{R}^{\Omega} be kk (identical) function classes consisting of all real-valued affine functions:

Fi={xwx+b;wd,b},i[k]\displaystyle F_{i}=\left\{x\mapsto w\cdot x+b;w\in\mathbb{R}^{d},b\in\mathbb{R}\right\},\qquad i\in[k]

and let GmaxG_{\max} be their kk-fold maximum (see definition (3)). Then

fatγ(Gmax(F1,,Fk))\displaystyle\operatorname{fat}_{\gamma}(G_{\max}(F_{1},\ldots,F_{k})) \displaystyle\leq CdkLogk,γ>0,\displaystyle Cdk\operatorname{Log}k,\qquad\gamma>0,

where C>0C>0 is a universal constant.

The optimality of the upper bound in Theorem 7 is witnessed by the matching lower bound:

Theorem 8 (Dimension-dependent lower bound for kk-fold maximum of affine functions)

For k1k\geq 1 and d4d\geq 4, let F1=F2==FkF_{1}=F_{2}=\ldots=F_{k} be the collection of all affine functions over Ω=d\Omega=\mathbb{R}^{d} and let GmaxG_{\max} be their kk-fold maximum (see definition (3)). Then

fatγ(Gmax(F1,,Fk))\displaystyle\operatorname{fat}_{\gamma}(G_{\max}(F_{1},\ldots,F_{k})) \displaystyle\geq Clog(k)i=1kfatγ(Fi)=Cdklogk,γ>0,\displaystyle C\log(k)\sum_{i=1}^{k}\operatorname{fat}_{\gamma}(F_{i})=Cdk\log k,\qquad\gamma>0,

where C>0C>0 is a universal constant.

The scaling argument employed in the proof of Theorem 8 can be invoked to show that the claim continues to hold for Ω=B\Omega=B.

Together, Theorems 7 and 8 show that the logarithmic dependence on kk is optimal.

4 Proofs

We start with upper-bounding the fat-shattering dimension of aggregation rule that commute with shifts, in terms of the dimensions of the component classes.

4.1 Proof of Theorem 1

Partial concept classes and disambiguation.

We say that F{0,1,}ΩF^{\star}\subseteq\left\{0,1,\star\right\}^{\Omega} is a partial concept class over Ω\Omega; this usage is consistent with Alon et al. (2022), while Attias et al. (2019, 2022) used the descriptor ambiguous. For fFf^{\star}\in F^{\star}, define its disambiguation set 𝒟(f){0,1}Ω\mathscr{D}(f^{\star})\subseteq\left\{0,1\right\}^{\Omega} as

𝒟(f)={g{0,1}Ω:xΩ,f(x)f(x)=g(x)};\displaystyle\mathscr{D}(f^{\star})=\left\{g\in\left\{0,1\right\}^{\Omega}:\forall x\in\Omega,~{}f^{\star}(x)\neq\star\implies f^{\star}(x)=g(x)\right\};

in words, 𝒟(f)\mathscr{D}(f^{\star}) consists of the total concepts g:Ω{0,1}g:\Omega\to\left\{0,1\right\} that agree pointwise with ff^{\star}, whenever the latter takes a value in {0,1}\left\{0,1\right\}. We say that F¯{0,1}Ω\bar{F}\subseteq\left\{0,1\right\}^{\Omega} disambiguates FF^{\star} if for all fFf^{\star}\in F^{\star}, we have F¯𝒟(f)\bar{F}\cap\mathscr{D}(f^{\star})\neq\emptyset; in words, every fFf^{\star}\in F^{\star} must have a disambiguated representative in F¯\bar{F}.222Attias et al. (2022) additionally required that F¯fF𝒟(f)\bar{F}\subseteq\bigcup_{f^{\star}\in F^{\star}}\mathscr{D}(f^{\star}), but this is an unnecessary restriction, and does not affect any of the results.

As in Alon et al. (2022); Attias et al. (2022), we say333 Attias et al. (2019) had incorrectly given F(S)={0,1}SF^{\star}(S)=\left\{0,1\right\}^{S} as the shattering condition. that SΩS\subset\Omega is VC-shattered by FF^{\star} if F(S){0,1}SF^{\star}(S)\supseteq\left\{0,1\right\}^{S}. We write vc(F)\operatorname{vc}(F^{\star}) to denote the size of the largest VC-shattered set (possibly, \infty). The obvious relation vc(F)vc(F¯)\operatorname{vc}(F^{\star})\leq\operatorname{vc}(\bar{F}) always holds between a partial concept class and any of its disambiguations. Alon et al. (2022, Theorem 13) proved the following variant of the Sauer-Shelah-Perles Lemma for partial concept classes:

Lemma 9 (Alon et al. (2022))

For every F{0,1,}ΩF^{\star}\subseteq\left\{0,1,\star\right\}^{\Omega} with d=vc(F)<d=\operatorname{vc}(F^{\star})<\infty and |Ω|<|\Omega|<\infty, there is an F¯\bar{F} disambiguating FF^{\star} such that

|F¯(Ω)|\displaystyle|\bar{F}(\Omega)| \displaystyle\leq (|Ω|+1)(d+1)log2|Ω|+2.\displaystyle(|\Omega|+1)^{(d+1)\log_{2}|\Omega|+2}. (13)

For d>0d>0 and |Ω|>1|\Omega|>1, this implies the somewhat more wieldy estimate444 The estimate (14) does not appear in Alon et al. (2022), but is an elementary consequence of (13). to

|F¯(Ω)|\displaystyle|\bar{F}(\Omega)| \displaystyle\leq |Ω|5dlog2|Ω|.\displaystyle|\Omega|^{5d\log_{2}|\Omega|}. (14)

We will make use of the elementary fact

xAlog2x\displaystyle x\leq A\log_{2}x \displaystyle\implies x3Alog(3A),x,A1\displaystyle x\leq 3A\log(3A),\qquad x,A\geq 1

and its corollary

yA(log2y)2\displaystyle y\leq A(\log_{2}y)^{2} \displaystyle\implies y5Alog2(18A),y,A1.\displaystyle y\leq 5A\log^{2}(18A),\qquad y,A\geq 1. (15)

Proof [of Theorem 1] We follow the basic techniques of discretization and rr-shifting, employed in Attias et al. (2019, 2022). Fix γ>0\gamma>0 and define the operator []γ:{0,1,}[\cdot]_{\gamma}^{\star}:\mathbb{R}\to\left\{0,1,\star\right\} as

[t]γ\displaystyle[t]_{\gamma}^{\star} =\displaystyle= {0,tγ1,tγ,else.\displaystyle\begin{cases}0,&t\leq-\gamma\\ 1,&t\geq\gamma\\ \star,&\text{else}.\end{cases}

Observe that for all FΩF\subseteq\mathbb{R}^{\Omega} and [F]γ:={[f]γ;fF}[F]_{\gamma}^{\star}:=\left\{[f]_{\gamma}^{\star};f\in F\right\}, we have fåtγ(F)=vc([F]γ).\operatorname{\textup{f\aa t}}_{\gamma}(F)=\operatorname{vc}([F]_{\gamma}^{\star}). Let G:kG:\mathbb{R}^{k}\rightarrow\mathbb{R} be a kk-fold aggregation rule and F1,,FkΩF_{1},\ldots,F_{k}\subseteq\mathbb{R}^{\Omega} be real-valued function classes. Suppose that some S={x1,,x}ΩS=\left\{x_{1},\ldots,x_{\ell}\right\}\subset\Omega is γ\gamma-shattered by GG(F1,,Fk)G\equiv G(F_{1},\ldots,F_{k}). Proving the claim amounts to upper-bounding \ell appropriately. By (8), there is an rΩr\in\mathbb{R}^{\Omega} such that fatγ(G)=fåtγ(Gr)=vc([Gr]γ)\operatorname{fat}_{\gamma}(G)=\operatorname{\textup{f\aa t}}_{\gamma}(G-r)=\operatorname{vc}([G-r]_{\gamma}^{\star}). Put Fi:=FirF^{\prime}_{i}:=F_{i}-r and since GG commutes with rr-shift, as defined in (4), we have

G:=G(F1,,Fk)=G(F1r,Fkr)=G(F1,Fk)r.\displaystyle G^{\prime}:=G(F^{\prime}_{1},\ldots,F^{\prime}_{k})=G(F_{1}-r\ldots,F_{k}-r)=G(F_{1}\ldots,F_{k})-r. (16)

Hence, SS is VC-shattered by [G]γ[G^{\prime}]_{\gamma}^{\star} and

vi:=vc([Fi]γ)=fåtγ(Fi)fatγ(Fi)=fatγ(Fi),i[k].\displaystyle v_{i}:=\operatorname{vc}([F_{i}^{\prime}]_{\gamma}^{\star})=\operatorname{\textup{f\aa t}}_{\gamma}(F_{i}^{\prime})\leq\operatorname{fat}_{\gamma}(F_{i}^{\prime})=\operatorname{fat}_{\gamma}(F_{i}),\qquad i\in[k]. (17)

Let us assume for now that each vi>0v_{i}>0; in this case, there is no loss of generality in assuming >1\ell>1. Let F¯i\bar{F}_{i} be a “good” disambiguation of [Fi]γ[F_{i}^{\prime}]_{\gamma}^{\star} on SS, as furnished by Lemma 9:

|F¯i(S)|\displaystyle|\bar{F}_{i}(S)| \displaystyle\leq 5vilog2.\displaystyle\ell^{5v_{i}\log_{2}\ell}.

Observe that G¯:=G(F¯1,,F¯k){\bar{G}}:=G(\bar{F}_{1},\ldots,\bar{F}_{k}) is a valid disambiguation of [G]γ[G^{\prime}]_{\gamma}^{\star}. It follows that

2=|G¯(S)|i=1k|F¯i(S)|5log2i=1kvi.\displaystyle 2^{\ell}\;=\;|{\bar{G}}(S)|\;\leq\;\prod_{i=1}^{k}|\bar{F}_{i}(S)|\;\leq\;\ell^{5\log_{2}\ell\sum_{i=1}^{k}v_{i}}. (18)

Thus, (15) implies that 25(vi)log2(90vi)\ell\leq 25(\sum v_{i})\log^{2}(90\sum v_{i}), and the latter is an upper bound on vc(G¯)\operatorname{vc}({\bar{G}}) — and hence, also on vc([G]γ)=fatγ(G)\operatorname{vc}([G^{\prime}]_{\gamma}^{\star})=\operatorname{fat}_{\gamma}(G). The claim now follows from (17).

If any one given vi=0v_{i}=0, we claim that (18) is unaffected. This is because any C{0,1,}ΩC^{\star}\subset\left\{0,1,\star\right\}^{\Omega} with vc(C)=0\operatorname{vc}(C^{\star})=0 has a singleton disambiguation C¯={c}\bar{C}=\left\{c\right\}. Indeed, any given xΩx\in\Omega can receive at most one of {0,1}\left\{0,1\right\} as a label from the members of CC (otherwise, it would be shattered, forcing vc(C)1\operatorname{vc}(C^{\star})\geq 1). If any cCc^{\star}\in C^{\star} labels xx with 0, then all members of CC^{\star} are disambiguated to label xx with 0 (and, mutatis mutandis, 11). Any xx labeled with \star by every cCic^{\star}\in C^{\star}_{i} can be disambiguated arbitrarily (say, to 0). Disambiguating the degenerate [Fi]γ[F_{i}^{\prime}]_{\gamma}^{\star} to the singleton F¯i(S)\bar{F}_{i}(S) has no effect on the product in (18).

The foregoing argument continues to hold if more than one vi=0v_{i}=0. In particular, in the degenerate case where fatγ(F1)=fatγ(F2)==fatγ(Fk)=0\operatorname{fat}_{\gamma}(F_{1})=\operatorname{fat}_{\gamma}(F_{2})=\ldots=\operatorname{fat}_{\gamma}(F_{k})=0, we have |F¯i(S)|=1\prod|\bar{F}_{i}(S)|=1, which forces =0\ell=0.  

4.2 Proof of Theorem 2

First, we upper bound the covering numbers of Lipschitz aggregations as a function of the covering of the component classes.

Theorem 10 (Covering number of LL-Lipschitz aggregations)

Let t>0t>0, p[1,]p\in[1,\infty], and F1,,FkLp(μ)F_{1},\ldots,F_{k}\subset L_{p}(\mu). Let GG be an aggregation rule that is LL-Lipschitz. Then, for all probability measures μ\mu on Ω\Omega,

𝒩(G(F1,,Fk),Lp(μ),t)\displaystyle\mathcal{N}(G(F_{1},\ldots,F_{k}),L_{p}(\mu),t) \displaystyle\leq {i=1k𝒩(Fi,t/Lk1/p),p<i=1k𝒩(Fi,t/L),p=.\displaystyle\begin{cases}\prod_{i=1}^{k}\mathcal{N}(F_{i},t/Lk^{1/p}),&p<\infty\\ \prod_{i=1}^{k}\mathcal{N}(F_{i},t/L),&p=\infty.\end{cases}

We proceed to the main proof.

Proof  [of Theorem 2]. Let G:kG:\mathbb{R}^{k}\rightarrow\mathbb{R} be an aggregation rule that is LL-Lipschitz (L1L\geq 1) in supremum norm, as defined in (5), and let F1,,Fk[R,R]ΩF_{1},\ldots,F_{k}\subseteq[-R,R]^{\Omega} be real-valued function classes. Suppose that some Ω={x1,,x}Ω=B\Omega_{\ell}=\left\{x_{1},\ldots,x_{\ell}\right\}\subset\Omega=B is γ\gamma-shattered by GG, and let Fi(Ω)=Fi|ΩF_{i}(\Omega_{\ell})=\left.F_{i}\right|_{\Omega_{\ell}}. We upper bound the covering number with the fat-shattering dimension as in Lemma 18 (see Section A.2), with n=n=\ell and p=p=\infty,

log𝒩(Fi(Ω),L(μ),γ)\displaystyle\log\mathcal{N}(F_{i}(\Omega_{\ell}),L_{\infty}(\mu_{\ell}),\gamma) \displaystyle\leq Cvilog(R/viγ)logε(/vi),0<γ<R,\displaystyle Cv_{i}\log(R\ell/v_{i}\gamma)\log^{\varepsilon}(\ell/v_{i}),\qquad 0<\gamma<R,

where vi=fatcεγ(Fi)v_{i}=\operatorname{fat}_{c\varepsilon\gamma}(F_{i}). Then Theorem 10 implies that

log𝒩(G(Ω),L(μ),γ/2)\displaystyle\log\mathcal{N}(G(\Omega_{\ell}),L_{\infty}(\mu_{\ell}),\gamma/2) \displaystyle\leq i=1klog𝒩(Fi(Ω),L(μ),γ/2L)\displaystyle\sum_{i=1}^{k}\log\mathcal{N}(F_{i}(\Omega_{\ell}),L_{\infty}(\mu_{\ell}),\gamma/2L)
\displaystyle\leq Ci=1kvilog(LR/viγ)logε(/vi)\displaystyle C\sum_{i=1}^{k}v_{i}\log(LR\ell/v_{i}\gamma)\log^{\varepsilon}(\ell/v_{i})
(a)\displaystyle\overset{\textup{(a)}}{\leq} Ci=1kvilog1+ε(LR/viγ)\displaystyle C\sum_{i=1}^{k}v_{i}\log^{1+\varepsilon}(LR\ell/v_{i}\gamma)
(b)\displaystyle\overset{\textup{(b)}}{\leq} CDlog1+εLRkDγ,\displaystyle CD\log^{1+\varepsilon}\frac{LR\ell k}{D\gamma},

where D:=i=1kviD:=\sum_{i=1}^{k}v_{i}, (a) follows since R/γ>1R/\gamma>1 and assuming L1L\geq 1, and (b) follows by the concavity of xlog1+ε(u/x)x\log^{1+\varepsilon}(u/x) (see Lemma 25 in Section A.5). We can assume 2\ell\geq 2 without loss of generality. Combining the monotonicity of the covering number (see (9)) and a lower bound on the covering number in terms of the fat-shattering dimension (see Lemma 15 in Section A.2) yields

log𝒩(G(Ω),L(μ),γ/2)\displaystyle\log\mathcal{N}(G(\Omega_{\ell}),L_{\infty}(\mu_{\ell}),\gamma/2) \displaystyle\geq Cfatγ(G)=C,\displaystyle C\operatorname{fat}_{\gamma}(G)=C\ell,

whence

\displaystyle\ell \displaystyle\leq CDlog1+εLRkDγ.\displaystyle CD\log^{1+\varepsilon}\frac{LR\ell k}{D\gamma}.

Using the elementary fact

xALog1+εx\displaystyle x\leq A\operatorname{Log}^{1+\varepsilon}x \displaystyle\implies xcALog1+εAx,A1\displaystyle x\leq cA\operatorname{Log}^{1+\varepsilon}A\qquad x,A\geq 1

(with x=LRk/Dγx=LR\ell k/D\gamma and A=cLRk/γA=cLRk/\gamma), we get

\displaystyle\ell \displaystyle\leq CDLog1+εLRkγ,\displaystyle CD\operatorname{Log}^{1+\varepsilon}\frac{LRk}{\gamma},

which implies the claim.  

4.3 Proof of Theorem 3

We use the notation and results from the Appendix, and in particular, from Section A.3.

Proof [of Theorem 3] A bound of this form for the kk-fold maximum aggregation was claimed in Kontorovich (2018), however the argument there was flawed, see Section 5.

Let G:kG:\mathbb{R}^{k}\rightarrow\mathbb{R} be an aggregation rule that is LL-Lipschitz in supremum norm, as defined in (5), and let F1,,FkF_{1},\ldots,F_{k} be bounded affine function classes, as defined in (10). Suppose that some Ω={x1,,x}Ω=B\Omega_{\ell}=\left\{x_{1},\ldots,x_{\ell}\right\}\subset\Omega=B is γ\gamma-shattered by GG, let Fi(Ω)=Fi|ΩF_{i}(\Omega_{\ell})=\left.F_{i}\right|_{\Omega_{\ell}}, and μ\mu_{\ell} be the uniform distribution on Ω\Omega_{\ell}. We upper bound the covering number as in Lemma 20 (with m=m=\ell),

log𝒩(Fi(Ω),L(μ),γ)\displaystyle\log\mathcal{N}(F_{i}(\Omega_{\ell}),L_{\infty}(\mu_{\ell}),\gamma) \displaystyle\leq CRi2γ2Logγ2Ri2,0<γ<Ri.\displaystyle C\frac{R_{i}^{2}}{\gamma^{2}}\operatorname{Log}\frac{\ell\gamma^{2}}{R_{i}^{2}},\qquad 0<\gamma<R_{i}.

Denote vi:=L2Ri2/γ2v_{i}:=L^{2}R_{i}^{2}/\gamma^{2}, and consider the LL_{\infty} covering number of Fi(Ω)F_{i}(\Omega_{\ell}) at scale γ/L\gamma/L:

log𝒩(Fi(Ω),L(μ),γ/L)\displaystyle\log\mathcal{N}(F_{i}(\Omega_{\ell}),L_{\infty}(\mu_{\ell}),\gamma/L) \displaystyle\leq CviLogvi.\displaystyle Cv_{i}\operatorname{Log}\frac{\ell}{v_{i}}.

Then Theorem 10 implies that

log𝒩(G(Ω),L(μ),γ/2)\displaystyle\log\mathcal{N}(G(\Omega_{\ell}),L_{\infty}(\mu_{\ell}),\gamma/2) \displaystyle\leq i=1klog𝒩(Fi(Ω),L(μ),γ/2L)\displaystyle\sum_{i=1}^{k}\log\mathcal{N}(F_{i}(\Omega_{\ell}),L_{\infty}(\mu_{\ell}),\gamma/2L)
\displaystyle\leq Ci=1kviLogvi\displaystyle C\sum_{i=1}^{k}v_{i}\operatorname{Log}\frac{\ell}{v_{i}}
(a)\displaystyle\overset{\textup{(a)}}{\leq} CDLogkD,\displaystyle CD\operatorname{Log}\frac{k\ell}{D},

where D:=i=1kviD:=\sum_{i=1}^{k}v_{i} and (a) follows by the concavity of xlog(u/x)x\log(u/x) (see Corollary 24 in Section A.5). Combining the monotonicity of the covering number (see (9)) and a lower bound on the covering number in terms of the fat-shattering dimension (see Lemma 15 in Section A.2) yields

log𝒩(G(Ω),L(μ),γ/2)\displaystyle\log\mathcal{N}(G(\Omega_{\ell}),L_{\infty}(\mu_{\ell}),\gamma/2) \displaystyle\geq Cfatγ(G)=C,\displaystyle C\operatorname{fat}_{\gamma}(G)=C\ell,

whence

\displaystyle\ell \displaystyle\leq CDLogkD.\displaystyle CD\operatorname{Log}\frac{k\ell}{D}.

Using the elementary fact

xALogx\displaystyle x\leq A\operatorname{Log}x \displaystyle\implies xcALogA,x,A1\displaystyle x\leq cA\operatorname{Log}A,\qquad x,A\geq 1

(with x=k/Dx=k\ell/D and A=ckA=ck) we get cDLogk\ell\leq cD\operatorname{Log}k, which implies the claim.

The result can easily be generalized to hinge-loss affine classes. Let FiF_{i} be an affine function class as in (10), define FiF^{\prime}_{i} as the function class on B×{1,1}B\times\left\{-1,1\right\} given by Fi={(x,y)yf(x);fFi}F^{\prime}_{i}=\left\{(x,y)\mapsto yf(x);f\in F_{i}\right\}, and the hinge-loss affine class FiHingeF_{i}^{\operatorname{Hinge}} as the function class on B×{1,1}B\times\left\{-1,1\right\} given by FiHinge={(x,y)max{0,1f(x,y)};fFi}F_{i}^{\operatorname{Hinge}}=\left\{(x,y)\mapsto\max\left\{0,1-f(x,y)\right\};f\in F^{\prime}_{i}\right\}. One first observes that the restriction of FiF^{\prime}_{i} to any {(x1,y1),,(xn,yn)}\left\{(x_{1},y_{1}),\ldots,(x_{n},y_{n})\right\}, as a body in n\mathbb{R}^{n}, is identical to the the restriction of FiF_{i} to {x1,,xn}\left\{x_{1},\ldots,x_{n}\right\}. Interpreting FiHingeF_{i}^{\operatorname{Hinge}} as a 22-fold maximum over the singleton class H={h0}H=\left\{h\equiv 0\right\} and the bounded affine class FiF^{\prime}_{i} lets us invoke Theorem 10 to argue that FiF_{i} and FiHingeF_{i}^{\operatorname{Hinge}} have the same LL_{\infty} covering numbers. Hence, the argument we deployed here to establish (11) for affine classes also applies to kk-fold LL-Lipschitz aggregations hinge-loss classes.  

4.4 Proof of Corollary 5

Proof [of Corollary 5] Raviv et al. (2018, Theorem 7) upper-bounded the Rademacher complexity of the maximum aggregation of kk hinge loss affine functions by k/nk/\sqrt{n}.

For RiR_{i}-bounded affine functions or hinge loss affine functions, the analysis above, combined with the calculation in Kontorovich (2018) yields a bound of O(Log(k)log3(n)i=1kRi2n)O\left(\sqrt{\frac{\operatorname{Log}(k)\log^{3}(n)\sum_{i=1}^{k}{R_{i}^{2}}}{n}}\right). For completeness, we provide the full proof.

Let Gmax:kG_{\max}:\mathbb{R}^{k}\rightarrow\mathbb{R} be the kk-fold maximum aggregation rule, as defined in (3), and let F1,,FkΩF_{1},\ldots,F_{k}\subseteq\mathbb{R}^{\Omega} be an RiR_{i}-bounded affine functions as in (10) or a hinge loss affine functions as in (12). Since this aggregation is 11-Lipschitz in the supremum norm, Theorem 3 implies that

fatγ(Gmax)\displaystyle\operatorname{fat}_{\gamma}(G_{\max}) \displaystyle\leq CLog(k)γ2i=1kRi2,0<γ<mini[k]Ri,\displaystyle\frac{C\operatorname{Log}(k)}{\gamma^{2}}\sum_{i=1}^{k}{R_{i}^{2}},\qquad 0<\gamma<\min_{i\in[k]}R_{i},

where C>0C>0 is a universal constant.

From fat-shattering to Rademacher.

The fat-shattering estimate above can be used to upper-bound the Rademacher complexity by converting the former to a covering number bound and plugging it into Dudley’s chaining integral (Dudley, 1967):

n(F)infα0(4α+12αlog𝒩(t,F,2)n𝑑t),\displaystyle\mathcal{R}_{n}(F)\leq\inf_{\alpha\geq 0}\left(4\alpha+12\int_{\alpha}^{\infty}\sqrt{\frac{\log\mathcal{N}(t,F,\left\|\cdot\right\|_{2})}{n}}dt\right), (19)

where 𝒩()\mathcal{N}(\cdot) are the 2\ell_{2} covering numbers (see Section A.1).

It remains to bound the covering numbers. A simple way of doing so is to invoke Lemmas 2.6, 3.2, and 3.3 in Alon et al. (1997) — but this incurs superfluous logarithmic factors in nn. Instead, we use the sharper estimate of Mendelson and Vershynin (2003), stated here in Lemma 16. Putting R~=maxiRi\tilde{R}=\max_{i}R_{i}, the latter yields

n(Gmax)\displaystyle\mathcal{R}_{n}(G_{\max}) \displaystyle\leq infα0(4α+12α1log𝒩(t,Gmax,2)n𝑑t)\displaystyle\inf_{\alpha\geq 0}\left(4\alpha+12\int_{\alpha}^{1}\sqrt{\frac{\log\mathcal{N}(t,G_{\max},\left\|\cdot\right\|_{2})}{n}}dt\right)
\displaystyle\leq infα0(4α+12cα1fatct/R~(Gmax)log2R~tn𝑑t)\displaystyle\inf_{\alpha\geq 0}\left(4\alpha+12c^{\prime}\int_{\alpha}^{1}\sqrt{\frac{\operatorname{fat}_{ct/\tilde{R}}(G_{\max})\log\frac{2\tilde{R}}{t}}{n}}dt\right)
\displaystyle\leq infα0(4α+12c′′Log(k)i=1kRi2nα1R~tlog2R~t𝑑t).\displaystyle\inf_{\alpha\geq 0}\left(4\alpha+{12c^{\prime\prime}}\sqrt{\frac{\operatorname{Log}(k)\sum_{i=1}^{k}{R_{i}^{2}}}{n}}\int_{\alpha}^{1}\frac{\tilde{R}}{t}\sqrt{\log\frac{2\tilde{R}}{t}}dt\right).

Now

α1R~tlog2R~tdt=2R~3(log(2R~/α)3/2(log2R~)3/2)\displaystyle\int_{\alpha}^{1}\frac{\tilde{R}}{t}\sqrt{\log\frac{2\tilde{R}}{t}}dt=\frac{2\tilde{R}}{3}\left(\log(2\tilde{R}/\alpha)^{3/2}-(\log 2\tilde{R})^{3/2}\right)

and choosing α=1/n\alpha=1/\sqrt{n} yields

n(Gmax)\displaystyle\mathcal{R}_{n}(G_{\max}) \displaystyle\leq 4n+12c′′Log(k)i=1kRi2n2R~3(log(2R~n)3/2(log2R~)3/2)\displaystyle\frac{4}{\sqrt{n}}+12c^{\prime\prime}\sqrt{\frac{\operatorname{Log}(k)\sum_{i=1}^{k}{R_{i}^{2}}}{n}}\frac{2\tilde{R}}{3}\left(\log(2\tilde{R}\sqrt{n})^{3/2}-(\log 2\tilde{R})^{3/2}\right)
=\displaystyle= O(Log(k)log3(R~n)R~2i=1kRi2n).\displaystyle O\left(\sqrt{\frac{\operatorname{Log}(k)\log^{3}(\tilde{R}n)\tilde{R}^{2}\sum_{i=1}^{k}{R_{i}^{2}}}{n}}\right).

 

4.5 Proof of Theorem 7

Proof [of Theorem 7] Let Gmax:kG_{\max}:\mathbb{R}^{k}\rightarrow\mathbb{R} be the kk-fold maximum aggregation rule, as defined in (3), and let F1,,FkΩF_{1},\ldots,F_{k}\subseteq\mathbb{R}^{\Omega} be real-valued function classes. Note that GmaxG_{\max} is an aggregation that commutes with shift, as defined in (4).

By (8), there is an rΩr\in\mathbb{R}^{\Omega} such that fatγ(Gmax)=fåtγ(Gmaxr)\operatorname{fat}_{\gamma}(G_{\max})=\operatorname{\textup{f\aa t}}_{\gamma}(G_{\max}-r). As in (16), put Fi:=FirF^{\prime}_{i}:=F_{i}-r and Gmax:=Gmaxr=Gmax(F1,,Fk)G^{\prime}_{\max}:=G_{\max}-r=G_{\max}(F^{\prime}_{1},\ldots,F^{\prime}_{k}). Define G¯max=sign(G)\bar{G}_{\max}=\operatorname{sign}(G^{\prime}) and F¯i=sign(Fi)\bar{F}_{i}=\operatorname{sign}(F_{i}^{\prime}).

Since sign\operatorname{sign} and max\max commute, we have G¯max=max(F¯i[k])\bar{G}_{\max}=\max(\bar{F}_{i\in[k]}). We claim that

fåtγ(Gmax)\displaystyle\operatorname{\textup{f\aa t}}_{\gamma}(G_{\max}^{\prime}) \displaystyle\leq vc(G¯max).\displaystyle\operatorname{vc}(\bar{G}_{\max}). (20)

Indeed, any SΩS\subset\Omega that is γ\gamma-shattered with shift r=0r=0 by any GΩG\subset\mathbb{R}^{\Omega} is also VC-shattered by sign(G)\operatorname{sign}(G). (See Section 4.1, and notice that the converse implication—and the reverse inequality—do not hold.) It holds that

d+1=(a)vc(F¯i)=(b)fåtγ(Fi)=(c)fatγ(Fi)=(d)fatγ(Fi),\displaystyle d+1\overset{(a)}{=}\operatorname{vc}(\bar{F}_{i})\overset{(b)}{=}\operatorname{\textup{f\aa t}}_{\gamma}(F_{i})\overset{(c)}{=}\operatorname{fat}_{\gamma}(F_{i})\overset{(d)}{=}\operatorname{fat}_{\gamma}(F_{i}^{\prime}),

where (a) follows from a standard argument (e.g., Mohri et al. (2012, Example 3.2)), (b) holds because any SdS\subset\mathbb{R}^{d} that is VC-shattered by sign(Fi)\operatorname{sign}(F^{\prime}_{i}) is also γ\gamma-shattered by FiF^{\prime}_{i} with shift r=0r=0, (c) follows from Lemma 21, since the class is closed under scalar multiplication, and (d) holds since the shattering remains the same for the shifted class.

Now the argument of Blumer et al. (1989, Lemma 3.2.3) applies:

vc(G¯max)\displaystyle\operatorname{vc}(\bar{G}_{\max}) \displaystyle\leq 2(d+1)klog(3k)\displaystyle 2(d+1)k\log(3k) (21)

(this holds for any kk-fold aggregation function, not just the maximum). Combining (20) with (21) proves the claim.  

4.6 Proof of Theorem 8

Proof [of Theorem 8] It follows from Mohri et al. (2012, Example 3.2) that vc(sign(Fi))=d+1\operatorname{vc}(\operatorname{sign}(F_{i}))=d+1. Since FiF_{i} is closed under scalar multiplication, a scaling argument shows that any SdS\subset\mathbb{R}^{d} that is VC-shattered by sign(Fi)\operatorname{sign}(F_{i}) is also γ\gamma-shattered by FiF_{i} with shift r=0r=0, whence fåtγ(Fi)=d+1\operatorname{\textup{f\aa t}}_{\gamma}(F_{i})=d+1 for all γ>0\gamma>0; invoking Lemma 21 extends this to fatγ(Fi)\operatorname{fat}_{\gamma}(F_{i}) as well. Now Csikós et al. (2019, Theorem 1) shows that the kk-fold unions of half-spaces necessarily shatter some set SdS\subset\mathbb{R}^{d} of size at least cdklogkcdk\log k. Since union is a special case of the max operator, and the latter commutes with sign\operatorname{sign}, the scaling argument shows that this SS is γ\gamma-shattered by GmaxG_{\max} with shift r=0r=0. Hence, fatγ(Gmax)fåtγ(Gmax)|S|\operatorname{fat}_{\gamma}(G_{\max})\geq\operatorname{\textup{f\aa t}}_{\gamma}(G_{\max})\geq|S|, which proves the claim.  

5 Discussion

In this paper, we proved upper and lower bounds on the fat-shattering dimension of aggregation rules as a function of the fat-shattering dimension of the component classes. We leave some remaining gaps for future work. First, for aggregation rules that commute with shifts, assuming fatγ(Fi)d\operatorname{fat}_{\gamma}(F_{i})\leq d, for 1ik1\leq i\leq k, we show in Theorem 1 that

fatγ(G(F1,,Fk))Cdklog2(dk),γ>0,\displaystyle\operatorname{fat}_{\gamma}(G(F_{1},\ldots,F_{k}))\leq Cdk\log^{2}\left(dk\right),\qquad\gamma>0,

C>0C>0 is a universal constant. We pose the following

Open problem.

Let GG be an aggregation rule that commutes with shifts. Is it the case that for all FiΩF_{i}\subseteq\mathbb{R}^{\Omega} with fatγ(Fi)d\operatorname{fat}_{\gamma}(F_{i})\leq d, i[k]i\in[k], we have

fatγ(G(F1,,Fk))\displaystyle\operatorname{fat}_{\gamma}(G(F_{1},\ldots,F_{k})) \displaystyle\leq Cdklog(k),γ>0,\displaystyle Cdk\log\left(k\right),\qquad\gamma>0,

for some universal C>0C>0?

In light of Theorem 8, this is the best one could hope for in general. We pose also the following conjecture about bounded affine functions.

Conjecture 11

Theorem 3 is tight up to constants. For RiR_{i}-bounded affine functions and aggregation rule GG that is 1-Lipschitz in supremum norm,

fatγ(G(F1,,Fk))\displaystyle\operatorname{fat}_{\gamma}(G(F_{1},\ldots,F_{k})) \displaystyle\geq CLog(k)γ2i=1kRi2,0<γ<mini[k]Ri,\displaystyle\frac{C\operatorname{Log}(k)}{\gamma^{2}}\sum_{i=1}^{k}{R_{i}^{2}},\qquad 0<\gamma<\min_{i\in[k]}R_{i}, (22)

where C>0C>0 is a universal constant.

Throughout the paper, we mentioned several mistaken claims in the literature. In this section, we briefly discuss the nature of these mistakes—which are, in a sense, variations on the same kind of error. We begin with Attias et al. (2019, Lemma 14), which incorrectly claimed that any partial function class FF^{\star} has a disambiguation F¯\bar{F} such that vc(F¯)vc(F)\operatorname{vc}(\bar{F})\leq\operatorname{vc}(F^{\star}) (see Section 4.1 for the definitions). The mistake was pointed out to us by Yann Guermeur, and later, Alon et al. (2022, Theorem 11) showed that there exist partial classes FF^{\star} with vc(F)=1\operatorname{vc}(F^{\star})=1 for which every disambiguation F¯\bar{F} has vc(F¯)=\operatorname{vc}(\bar{F})=\infty.

Kontorovich (2018) attempted to prove the bound stated in our Theorem 3 (up to constants, and only for linear classes). The argument proceeded via a reduction to the Boolean case, as in our proof of Theorem 7. It was correctly observed that if, say, some finite SΩS\subset\Omega is 11-shattered by FiF_{i} with shift r=0r=0, then it is also VC-shattered by sign(Fi)\operatorname{sign}(F_{i}). Neglected was the fact that sign(Fi)\operatorname{sign}(F_{i}) might shatter additional points in ΩS\Omega\setminus S—and, in sufficiently high dimension, it necessarily will. The crux of the matter is that (20) holds in the dimension-dependent but not the dimension-free setting; again, this may be seen as a variant of the disambiguation mistake.

Finally, the proof of Hanneke and Kontorovich (2019, Lemma 6) claims, in the first display, that the shattered set can be classified with large margin, which is incorrect — yet another variant of mistaken disambiguation.


Acknowledgments

We thank Steve Hanneke and Ramon van Handel for very helpful discussions; the latter, in particular, patiently explained to us how to prove Lemma 20. Roman Vershynin kindly gave us permission to share his example in Remark 17. This research was partially supported by the Israel Science Foundation (grant No. 1602/19), an Amazon Research Award, the Ben-Gurion University Data Science Research Center, and Cyber Security Research Center, Prime Minister’s Office.

References

  • Alon et al. (1997) Noga Alon, Shai Ben-David, Nicolò Cesa-Bianchi, and David Haussler. Scale-sensitive dimensions, uniform convergence, and learnability. Journal of the ACM, 44(4):615–631, 1997.
  • Alon et al. (2020) Noga Alon, Amos Beimel, Shay Moran, and Uri Stemmer. Closure properties for private classification and online prediction. In Conference on Learning Theory, pages 119–152. PMLR, 2020.
  • Alon et al. (2022) Noga Alon, Steve Hanneke, Ron Holzman, and Shay Moran. A theory of PAC learnability of partial concept classes. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 658–671, 2022.
  • Appert and Catoni (2021) Gautier Appert and Olivier Catoni. New bounds for kk-means and information kk-means. arXiv preprint arXiv:2101.05728, 2021.
  • Artstein et al. (2004) S. Artstein, V. Milman, and S. J. Szarek. Duality of metric entropy. Annals of Mathematics, 159(3):1313–1328, 2004.
  • Attias and Hanneke (2023) Idan Attias and Steve Hanneke. Adversarially robust PAC learnability of real-valued functions. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 1172–1199, 2023.
  • Attias et al. (2019) Idan Attias, Aryeh Kontorovich, and Yishay Mansour. Improved generalization bounds for robust learning. In Algorithmic Learning Theory, ALT 2019, volume 98 of Proceedings of Machine Learning Research, pages 162–183. PMLR, 2019.
  • Attias et al. (2022) Idan Attias, Aryeh Kontorovich, and Yishay Mansour. Improved generalization bounds for adversarially robust learning. The Journal of Machine Learning Research, 23(1):7897–7927, 2022.
  • Bartlett and Shawe-Taylor (1999) Peter Bartlett and John Shawe-Taylor. Generalization performance of support vector machines and other pattern classifiers, pages 43–54. MIT Press, Cambridge, MA, USA, 1999. ISBN 0-262-19416-3.
  • Bartlett and Long (1998) Peter L. Bartlett and Philip M. Long. Prediction, learning, uniform convergence, and scale-sensitive dimensions. J. Comput. Syst. Sci., 56(2):174–190, 1998.
  • Baum and Haussler (1989) Eric B. Baum and David Haussler. What size net gives valid generalization? Neural Comput., 1(1):151–160, 1989.
  • Biau et al. (2008) Gérard Biau, Luc Devroye, and Gábor Lugosi. On the performance of clustering in hilbert spaces. IEEE Transactions on Information Theory, 54(2):781–790, 2008.
  • Blumer et al. (1989) Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. J. Assoc. Comput. Mach., 36(4):929–965, 1989.
  • Breiman (1996) Leo Breiman. Bagging predictors. Machine learning, 24:123–140, 1996.
  • Breiman (2001) Leo Breiman. Random forests. Machine learning, 45:5–32, 2001.
  • Csikós et al. (2019) Mónika Csikós, Nabil H. Mustafa, and Andrey Kupavskii. Tight lower bounds on the vc-dimension of geometric set systems. J. Mach. Learn. Res., 20:81:1–81:8, 2019.
  • Duan (2012) Hubert Haoyang Duan. Bounding the Fat Shattering Dimension of a Composition Function Class Built Using a Continuous Logic Connective. PhD thesis, University of Waterloo, 2012.
  • Dudley (1967) Richard M Dudley. The sizes of compact subsets of hilbert space and continuity of gaussian processes. Journal of Functional Analysis, 1(3):290–330, 1967.
  • Eisenstat (2009) David Eisenstat. k-fold unions of low-dimensional concept classes. Inf. Process. Lett., 109(23-24):1232–1234, 2009.
  • Eisenstat and Angluin (2007) David Eisenstat and Dana Angluin. The VC dimension of k-fold union. Inf. Process. Lett., 101(5):181–184, 2007.
  • Fefferman et al. (2016) Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983–1049, 2016.
  • Foster and Rakhlin (2019) Dylan J. Foster and Alexander Rakhlin. \ell_{\infty} vector contraction for rademacher complexity. CoRR, abs/1911.06468, 2019.
  • Freund and Schapire (1997) Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
  • Ghazi et al. (2021) Badih Ghazi, Noah Golowich, Ravi Kumar, and Pasin Manurangsi. Near-tight closure bounds for the littlestone and threshold dimensions. In Algorithmic Learning Theory, pages 686–696. PMLR, 2021.
  • Gottlieb et al. (2014) Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. Efficient classification for metric data (extended abstract: COLT 2010). IEEE Transactions on Information Theory, 60(9):5750–5759, 2014.
  • Gottlieb et al. (2018) Lee-Ad Gottlieb, Eran Kaufman, Aryeh Kontorovich, and Gabriel Nivasch. Learning convex polytopes with margin. In Neural Information Processing Systems (NIPS), 2018.
  • Hanneke and Kontorovich (2019) Steve Hanneke and Aryeh Kontorovich. Optimality of SVM: novel proofs and tighter bounds. Theor. Comput. Sci., 796:99–113, 2019.
  • Kearns and Schapire (1994) Michael J. Kearns and Robert E. Schapire. Efficient distribution-free learning of probabilistic concepts. J. Comput. Syst. Sci., 48(3):464–497, 1994.
  • Kégl (2003) Balázs Kégl. Robust regression by boosting the median. In Learning Theory and Kernel Machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003. Proceedings, pages 258–272. Springer, 2003.
  • Klochkov et al. (2021) Yegor Klochkov, Alexey Kroshnin, and Nikita Zhivotovskiy. Robust k-means clustering for distributions with two moments. The Annals of Statistics, 49(4):2206–2230, 2021.
  • Kontorovich (2018) Aryeh Kontorovich. Rademacher complexity of kk-fold maxima of hyperplanes. 2018.
  • Mendelson and Vershynin (2003) S. Mendelson and R. Vershynin. Entropy and the combinatorial dimension. Invent. Math., 152(1):37–55, 2003.
  • Mohri et al. (2012) Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations Of Machine Learning. The MIT Press, 2012.
  • Raviv et al. (2018) Dolev Raviv, Tamir Hazan, and Margarita Osadchy. Hinge-minimax learner for the ensemble of hyperplanes. J. Mach. Learn. Res., 19:62:1–62:30, 2018.
  • Rudelson and Vershynin (2006) M. Rudelson and R. Vershynin. Combinatorics of random processes and sections of convex bodies. Annals of Mathematics, 164(2):603–648, 2006.
  • Srebro et al. (2010) Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. Advances in neural information processing systems, 23, 2010.
  • Talagrand (2003) Michel Talagrand. Vapnik–chervonenkis type conditions and uniform donsker classes of functions. The Annals of Probability, 31(3):1565–1582, 2003.
  • Vershynin (2018) Roman Vershynin. High-dimensional probability, volume 47 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 2018.
  • Vershynin (2021) Roman Vershynin, 2021. Private communication.
  • Zhang (2002) Tong Zhang. Covering number bounds of certain regularized linear function classes. The Journal of Machine Learning Research, 2:527–550, 2002.
  • Zhivotovskiy (2022) Nikita Zhivotovskiy. A bound for kk-fold maximum. 2022.

A Auxiliary results

A.1 Covering numbers and Lipschitz Aggregations

Lemma 12

If G:kG:\mathbb{R}^{k}\rightarrow\mathbb{R} is LL-Lipschitz under p\left\|\cdot\right\|_{p}, then G:(Ω)kΩG:(\mathbb{R}^{\Omega})^{k}\rightarrow\mathbb{R}^{\Omega} is LL-Lipschitz in Lp(k)(μ)\left\|\cdot\right\|_{L_{p}^{(k)}(\mu)}.

Proof 

G(f1,,fk)G(f1,,fk)Lp(μ)p\displaystyle\left\|G(f_{1},\ldots,f_{k})-G(f^{\prime}_{1},\ldots,f^{\prime}_{k})\right\|_{L_{p}(\mu)}^{p} =Ω|G(f1,,fk)(x)G(f1,,fk)(x)|pdμ(x)\displaystyle=\int_{\Omega}|G(f_{1},\ldots,f_{k})(x)-G(f^{\prime}_{1},\ldots,f^{\prime}_{k})(x)|^{p}\mathrm{d}\mu(x)
=Ω|G(f1(x),,fk(x))G(f1(x),,fk(x))|pdμ(x)\displaystyle=\int_{\Omega}|G(f_{1}(x),\ldots,f_{k}(x))-G(f^{\prime}_{1}(x),\ldots,f^{\prime}_{k}(x))|^{p}\mathrm{d}\mu(x)
ΩLp(f1(x),,fk(x))(f1(x),,fk(x))ppdμ(x),\displaystyle\leq\int_{\Omega}L^{p}\left\|(f_{1}(x),\ldots,f_{k}(x))-(f^{\prime}_{1}(x),\ldots,f^{\prime}_{k}(x))\right\|_{p}^{p}\mathrm{d}\mu(x),

where the inequality follows from the assumption that G:kG:\mathbb{R}^{k}\rightarrow\mathbb{R} is LL-Lipschitz in p\left\|\cdot\right\|_{p}. This proves

G(f1,,fk)G(f1,,fk)Lp(μ)L(f1,,fk)(f1,,fk)Lp(k)(μ),\displaystyle\left\|G(f_{1},\ldots,f_{k})-G(f^{\prime}_{1},\ldots,f^{\prime}_{k})\right\|_{L_{p}(\mu)}\leq L\left\|(f_{1},\ldots,f_{k})-(f^{\prime}_{1},\ldots,f^{\prime}_{k})\right\|_{L_{p}^{(k)}(\mu)},

and hence the claim.  

Proof [of Theorem 10] Suppose p<p<\infty, and let g=G(f1,,fk)G(F1,,Fk)g=G(f_{1},\ldots,f_{k})\in G(F_{1},\ldots,F_{k}). For each i[k]i\in[k], let F^iFi\hat{F}_{i}\subset F_{i} be a t/Lk1/pt/Lk^{1/p}-cover of FiF_{i}. Let each fif_{i} be “t/Lk1/pt/Lk^{1/p}-covered” by some f^iF^i\hat{f}_{i}\in\hat{F}_{i}, in the sense that fif^iLp(μ)t/Lk1/p\left\|f_{i}-\hat{f}_{i}\right\|_{L_{p}(\mu)}\leq t/Lk^{1/p}. Assuming that G:kG:\mathbb{R}^{k}\rightarrow\mathbb{R} is LL-Lipschitz in p\left\|\cdot\right\|_{p}, Lemma 12 implies that G:(Ω)kΩG:(\mathbb{R}^{\Omega})^{k}\rightarrow\mathbb{R}^{\Omega} is LL-Lipschitz in Lp(k)(μ)\left\|\cdot\right\|_{L_{p}^{(k)}(\mu)}. Then it follows that gg is tt-covered by G(f^1,,f^k)G(\hat{f}_{1},\ldots,\hat{f}_{k}), since

G(f1,,fk)G(f1,,fk)Lp(μ)p\displaystyle\left\|G(f_{1},\ldots,f_{k})-G(f^{\prime}_{1},\ldots,f^{\prime}_{k})\right\|_{L_{p}(\mu)}^{p} Lp(f1,,fk)(f1,,fk)Lp(k)(μ)p\displaystyle\leq L^{p}\left\|(f_{1},\ldots,f_{k})-(f^{\prime}_{1},\ldots,f^{\prime}_{k})\right\|_{L_{p}^{(k)}(\mu)}^{p}
=LpΩ(f1(x),,fk(x))(f1(x),,fk(x))ppdμ(x)\displaystyle=L^{p}\int_{\Omega}\left\|(f_{1}(x),\ldots,f_{k}(x))-(f^{\prime}_{1}(x),\ldots,f^{\prime}_{k}(x))\right\|_{p}^{p}\mathrm{d}\mu(x)
=LpΩi=1k|fi(x)fi(x)|pdμ(x)\displaystyle=L^{p}\int_{\Omega}\sum^{k}_{i=1}\left|f_{i}(x)-f^{\prime}_{i}(x)\right|^{p}\mathrm{d}\mu(x)
=Lpi=1kΩ|fi(x)fi(x)|pdμ(x)\displaystyle=L^{p}\sum^{k}_{i=1}\int_{\Omega}\left|f_{i}(x)-f^{\prime}_{i}(x)\right|^{p}\mathrm{d}\mu(x)
=Lpi=1kfifiLp(μ)p\displaystyle=L^{p}\sum^{k}_{i=1}\left\|f_{i}-f^{\prime}_{i}\right\|_{L_{p}(\mu)}^{p}
Lpk(tLk1/p)p\displaystyle\leq L^{p}k\left(\frac{t}{Lk^{1/p}}\right)^{p}
=tp,\displaystyle=t^{p},

and so G(f1,,fk)G(f1,,fk)Lp(μ)t\left\|G(f_{1},\ldots,f_{k})-G(f^{\prime}_{1},\ldots,f^{\prime}_{k})\right\|_{L_{p}(\mu)}\leq t.

We conclude that G(F1,,Fk)G(F_{1},\ldots,F_{k}) has a tt-cover of size |F^1×F^2××F^k||\hat{F}_{1}\times\hat{F}_{2}\times\ldots\times\hat{F}_{k}|, which proves the claim. The case p=p=\infty is proved analogously (or, alternatively, as a limiting case of p<p<\infty).  
We show that natural aggregations are Lipschitz in p\left\|\cdot\right\|_{p} norms, p[1,)p\in[1,\infty), and in supremum norm. The following facts are elementary:

|abcd|\displaystyle|a\vee b-c\vee d| \displaystyle\leq |ac||bd|,a,b,c,d;\displaystyle|a-c|\vee|b-d|,\qquad a,b,c,d\in\mathbb{R}; (23)
|abcd|\displaystyle|a\wedge b-c\wedge d| \displaystyle\leq |ac||bd|,a,b,c,d,\displaystyle|a-c|\vee|b-d|,\qquad a,b,c,d\in\mathbb{R}, (24)

where st:=max{s,t}s\vee t:=\max\left\{s,t\right\} and st:=min{s,t}s\wedge t:=\min\left\{s,t\right\}.

Lemma 13 (Maximum aggregation is 1-Lipschitz)

Let Gmax:kG_{\max}:\mathbb{R}^{k}\rightarrow\mathbb{R} be the maximum aggregation, then for any x,xkx,x^{\prime}\in\mathbb{R}^{k} and p[1,]p\in[1,\infty],

|G(x)G(x)|xxp.\displaystyle\left|G(x)-G(x^{\prime})\right|\leq\left\|x-x^{\prime}\right\|_{p}.

Proof  For k=2k=2 and p=p=\infty, the claim follows from the stronger, pointwise inequality (23). The proof follows by simple induction on kk. Since p\left\|\cdot\right\|_{\infty}\leq\left\|\cdot\right\|_{p}, we conclude the proof for p[1,]p\in[1,\infty].  

Lemma 14 (Max-Min aggregation is 1-Lipschitz)

If Gmaxmin:kG_{\operatorname{max-min}}:\mathbb{R}^{k}\rightarrow\mathbb{R} is the max-min aggregation, then for any x,xkx,x^{\prime}\in\mathbb{R}^{k} and p[1,]p\in[1,\infty],

|G(x)G(x)|xxp.\displaystyle\left|G(x)-G(x^{\prime})\right|\leq\left\|x-x^{\prime}\right\|_{p}.

Proof  The inequalities (23), (24) imply that the kk-fold max and min aggregations are both 11-Lipschitz with respect to \left\|\cdot\right\|_{\infty}. Hence, for all x,yk×x,y\in\mathbb{R}^{k\times\ell}, we have

|minj[]xijminj[]yij|maxj[]|xijxij|,i[k]\displaystyle\left|\min_{j\in[\ell]}x_{ij}-\min_{j\in[\ell]}y_{ij}\right|\leq\max_{j\in[\ell]}\left|x_{ij}-x_{ij}\right|,\qquad i\in[k]

and further,

|maxi[k]minj[]xijmaxi[k]minj[]yij|maxi[k]maxj[]|xijxij|.\displaystyle\left|\max_{i\in[k]}\min_{j\in[\ell]}x_{ij}-\max_{i\in[k]}\min_{j\in[\ell]}y_{ij}\right|\leq\max_{i\in[k]}\max_{j\in[\ell]}\left|x_{ij}-x_{ij}\right|.

This proves that |G(x)G(x)|xx\left|G(x)-G(x^{\prime})\right|\leq\left\|x-x^{\prime}\right\|_{\infty}. Since p\left\|\cdot\right\|_{\infty}\leq\left\|\cdot\right\|_{p}, the claim holds for all p[1,]p\in[1,\infty].  

A.2 Covering numbers and the fat-shattering dimension

In this section, we summarize some known results connecting the covering numbers of a bounded function class to its fat-shattering dimension.

Lemma 15 (Talagrand (2003), Proposition 1.4)

For any F[R,R]ΩF\subseteq[-R,R]^{\Omega}, there exists a probability measure μ\mu on Ω\Omega such that

𝒩(F,L2(μ),t)\displaystyle\mathcal{N}(F,L_{2}(\mu),t) \displaystyle\geq 2Cfat2t(F),0<t<R,\displaystyle 2^{C\operatorname{fat}_{2t}(F)},\qquad 0<t<R, (25)

where C>0C>0 is a universal constants. Moreover, μ\mu may be taken to be the uniform distribution on any 2t2t-shattered subset of Ω\Omega.

Remark. The tightness of (25) is trivially demonstrated by the example F={γ,γ}nF=\left\{-\gamma,\gamma\right\}^{n}.

Lemma 16 (Mendelson and Vershynin (2003), Theorem 1)

For all F[1,1]ΩF\subseteq[-1,1]^{\Omega} and all probability measures μ\mu,

𝒩(F,L2(μ),t)\displaystyle\mathcal{N}(F,L_{2}(\mu),t) \displaystyle\leq (2t)Cfatct(F),0<t<1,\displaystyle\left(\frac{2}{t}\right)^{C\operatorname{fat}_{ct}(F)},\qquad 0<t<1, (26)

where C,c>0C,c>0 are universal constants.

Remark 17

The following example due to Vershynin (2021) shows that (26) is tight. Take Ω=[n]\Omega=[n] and F=[1,1]ΩF=[-1,1]^{\Omega}. Then, for all sufficiently small t>0t>0, we have fatt(F)=n\operatorname{fat}_{t}(F)=n. However, a simple volumetric calculation shows that 𝒩(F,t)\mathcal{N}(F,t) behaves as (C/t)n(C/t)^{n} for small tt, where C>0C>0 is a constant.

Lemma 18 (Rudelson and Vershynin (2006))

Suppose that p[2,)p\in[2,\infty), μ\mu is a probability measure on Ω\Omega, and R>0R>0. If FLp(Ω,μ)F\subset L_{p}(\Omega,\mu) satisfies supfFfL2p(μ)R\sup_{f\in F}\left\|f\right\|_{L_{2p}(\mu)}\leq R, then

log𝒩(F,Lp(μ),t)\displaystyle\log\mathcal{N}(F,L_{p}(\mu),t) \displaystyle\leq Cp2fatct(F)logRct,0<t<R;\displaystyle Cp^{2}\operatorname{fat}_{ct}(F)\log\frac{R}{ct},\qquad 0<t<R;

furthermore, for all ε>0\varepsilon>0, if supfFfL(μ)R\sup_{f\in F}\left\|f\right\|_{L_{\infty}(\mu)}\leq R, then

log𝒩(F,L(μ),t)\displaystyle\log\mathcal{N}(F,L_{\infty}(\mu),t) \displaystyle\leq Cvlog(Rn/vt)logε(n/v),0<t<R,\displaystyle Cv\log(Rn/vt)\log^{\varepsilon}(n/v),\qquad 0<t<R,

where n=|Ω|n=|\Omega|, v=fatcεt(F)v=\operatorname{fat}_{c\varepsilon t}(F), and C,c>0C,c>0 are universal constants.

A.3 Covering numbers of linear and affine classes

Let BdB\subset\mathbb{R}^{d} be the dd-dimensional Euclidean unit ball and

F={xwx+b;w|b|R}\displaystyle F=\left\{x\mapsto w\cdot x+b;\left\|w\right\|\vee|b|\leq R\right\}

be the collection of RR-bounded affine functions on Ω=B\Omega=B.

Remark 19

There is a trivial reduction from an RR-bounded affine class in dd dimensions to a 2R2R-bounded linear class in d+1d+1 dimensions, via the standard trick of adding an extra dummy dimension. This only affects the covering number bounds up to constants.

For ΩnB\Omega_{n}\subset B, |Ωn|=n|\Omega_{n}|=n, define F(Ωn)=F|ΩnF(\Omega_{n})=\left.F\right|_{\Omega_{n}}, and endow Ωn\Omega_{n} with the uniform measure μn\mu_{n}. Zhang (2002, Theorem 4) implies the covering number estimate

log𝒩(F(Ωn),L(μn),t)\displaystyle\log\mathcal{N}(F(\Omega_{n}),L_{\infty}(\mu_{n}),t) \displaystyle\leq CR2t2LognRt,t>0,\displaystyle C\frac{R^{2}}{t^{2}}\operatorname{Log}\frac{nR}{t},\qquad t>0,

where C>0C>0 is a universal constant (Zhang’s result is more general and allows to compute explicit constants). We will use the following sharper bound:

Lemma 20
log𝒩(F(Ωn),L(μn),t)\displaystyle\log\mathcal{N}(F(\Omega_{n}),L_{\infty}(\mu_{n}),t) \displaystyle\leq CR2t2Logmt2R2,0<t<R,\displaystyle C\frac{R^{2}}{t^{2}}\operatorname{Log}\frac{mt^{2}}{R^{2}},\qquad 0<t<R,

where m=min{n,d}m=\min\left\{n,d\right\} and C>0C>0 is a universal constant.

Proof  The result is folklore knowledge, but we provide a proof for completeness.

We argue that there is no loss of generality in assuming dnd\geq n. Indeed, if n>dn>d, then XX is spanned by some X={x1,,xd}BX^{\prime}=\left\{x_{1}^{\prime},\ldots,x_{d}^{\prime}\right\}\subset B and Fspan(X)F\subset\operatorname{span}(X^{\prime}) is also a dd-dimensional set. Thus, we assume dnd\geq n henceforth. Via a standard infinitesimal perturbation, we can assume that XX is a linearly independent set (i.e., spans m\mathbb{R}^{m}). If we treat XX as an m×dm\times d matrix, then F=XBF=XB, which means that FF is an ellipsoid. We are interested in \ell_{\infty} covering numbers of FF.

Let KdK\subset\mathbb{R}^{d} be such that XK=LXK=L, where L=BmL=B_{\infty}^{m} is the unit cube. (The existence of a KK such that XKLXK\subset L is obvious, but because we assumed that XX spans m\mathbb{R}^{m}, every point in [1,1]m[-1,1]^{m} has a pre-image under XX.) Let us compute the polar body KK^{\circ}, defined as

K={ud:supvKvu1}.\displaystyle K^{\circ}=\left\{u\in\mathbb{R}^{d}:\sup_{v\in K}v\cdot u\leq 1\right\}.

We claim that

K=absconv(X)=:{i=1mαixi;|αi|1}.\displaystyle K^{\circ}=\operatorname{absconv}(X)=:\left\{\sum_{i=1}^{m}\alpha_{i}x_{i};\sum\left|\alpha_{i}\right|\leq 1\right\}.

Indeed, consider a z=i=1mαixiabsconv(X)z=\sum_{i=1}^{m}\alpha_{i}x_{i}\in\operatorname{absconv}(X). Then, for any vKv\in K, we have

vz\displaystyle v\cdot z =\displaystyle= vi=1mαixi\displaystyle v\cdot\sum_{i=1}^{m}\alpha_{i}x_{i}
=\displaystyle= i=1mαi(vxi)=i=1m|αi|1zK,\displaystyle\sum_{i=1}^{m}\alpha_{i}(v\cdot x_{i})=\sum_{i=1}^{m}|\alpha_{i}|\leq 1\qquad\implies z\in K^{\circ},

where we have used |vxi|1|v\cdot x_{i}|\leq 1 (since XK=L=Bm=[1,1]mXK=L=B_{\infty}^{m}=[-1,1]^{m}) and Hölder’s inequality. This shows that absconv(X)K\operatorname{absconv}(X)\subseteq K^{\circ}. On the other hand, consider any uKu\in K^{\circ}. There is no loss of generality in assuming that uu is spanned by XX, that is, u=i=1mαixiu=\sum_{i=1}^{m}\alpha_{i}x_{i}, for αi\alpha_{i}\in\mathbb{R}. By definition of uKu\in K^{\circ}, we have

supvKvu\displaystyle\sup_{v\in K}v\cdot u =\displaystyle= supvKvi=1mαixi=supvKi=1mαi(vxi)1.\displaystyle\sup_{v\in K}v\cdot\sum_{i=1}^{m}\alpha_{i}x_{i}=\sup_{v\in K}\sum_{i=1}^{m}\alpha_{i}(v\cdot x_{i})\leq 1.

Now because XK=[1,1]mXK=[-1,1]^{m}, for each choice of αm\alpha\in\mathbb{R}^{m}, there is a vKv\in K such that |vxi|=sign(αi)|v\cdot x_{i}|=\operatorname{sign}(\alpha_{i}) for all i[m]i\in[m]. This shows that we must have i=1m|αi|1\sum_{i=1}^{m}|\alpha_{i}|\leq 1, and proves Kabsconv(X)K^{\circ}\subseteq\operatorname{absconv}(X).

It is well-known (and easy to verify) that covering numbers enjoy an affine invariance:

N(F,L):=N(XB,XK)=N(B,K),\displaystyle N(F,L):=N(XB,XK)=N(B,K),

where N(A,B)N(A,B), for two sets A,BA,B, is the smallest number of copies of BB necessary to cover AA. Now the seminal result of Artstein et al. (2004) applies: for all t>0t>0,

logN(B,tK)alogN(K,btB),\displaystyle\log N(B,tK)\leq a\log N(K^{\circ},btB),

where a,b>0a,b>0 are universal constants.

This reduces the problem to estimating the 2\ell_{2}-covering numbers of absconv(X)\operatorname{absconv}(X). The latter may be achieved via Maurey’s method (Vershynin, 2018, Corollary 0.0.4 and Exercise 0.0.6): the tt-covering number of absconv(rX)\operatorname{absconv}(rX) under 2\ell_{2} is at most

(c+cmt2/r2)r2/t2,\displaystyle(c+cmt^{2}/r^{2})^{\left\lceil r^{2}/t^{2}\right\rceil},

where c>0c>0 is a universal constant.

 

A.4 Fat-shattering dimension of linear and affine classes

In this section, Ω=d\Omega=\mathbb{R}^{d} and BdB\subset\mathbb{R}^{d} denotes the Euclidean unit ball. A function f:Ωf:\Omega\to\mathbb{R} is said to be affine if it is of the form f(x)=wx+bf(x)=w\cdot x+b, for some wdw\in\mathbb{R}^{d} and bb\in\mathbb{R}, where \cdot denotes the Euclidean inner product.

Throughout the paper, we have have referred to RR-bounded affine function classes as those for which w|b|R\left\|w\right\|\vee|b|\leq R. In this section, we define the larger class of RR-semi-bounded affine functions, as those for which wR\left\|w\right\|\leq R, but bb may be unbounded. In particular, the covering-number results (and the reduction to linear classes spelled out in Remark 19) do not apply to semi-bounded affine classes.

The following simple result may be of independent interest.

Lemma 21

Let FΩF\subset\mathbb{R}^{\Omega} be some collection of functions with the closure property

f,gF\displaystyle f,g\in F (fg)/2F.\displaystyle\implies(f-g)/2\in F. (27)

Then, for all γ>0\gamma>0, we have fatγ(F)=fåtγ(F)\operatorname{fat}_{\gamma}(F)=\operatorname{\textup{f\aa t}}_{\gamma}(F) .

Proof 

Suppose that some set {x1,,xk}\left\{x_{1},\ldots,x_{k}\right\} is γ\gamma-shattered by FF. That means that there is an rkr\in\mathbb{R}^{k} such that for all y{1,1}ky\in\left\{-1,1\right\}^{k}, there is a f=fyFf=f_{y}\in F for which

γ\displaystyle\gamma \displaystyle\leq yi(f(xi)ri),i[k].\displaystyle y_{i}(f(x_{i})-r_{i}),\qquad i\in[k]. (28)

Now for any y{1,1}ky\in\left\{-1,1\right\}^{k}, let f^=fy{\hat{f}}=f_{y} and fˇ=fy{\check{f}}=f_{-y}. Then, for each i[k]i\in[k], we have

γ\displaystyle\gamma \displaystyle\leq yi(f^(xi)ri),\displaystyle\phantom{-}y_{i}({\hat{f}}(x_{i})-r_{i}),
γ\displaystyle\gamma \displaystyle\leq yi(fˇ(xi)ri).\displaystyle-y_{i}({\check{f}}(x_{i})-r_{i}).

It follows that f=(f^fˇ)/2f=({\hat{f}}-{\check{f}})/2 achieves (28), for the given yy, with r0r\equiv 0. Now (27) implies that the function defined by ff belongs to FF, which completes the proof.  

Now is well-known (Bartlett and Shawe-Taylor, 1999, Theorem 4.6) that bounded linear functions — i.e., function classes on BB of the form F={xwx;wR}F=\left\{x\mapsto w\cdot x;\left\|w\right\|\leq R\right\}, also known as homogeneous hyperplanes — satisfy fatγ(F)(R/γ)2\operatorname{fat}_{\gamma}(F)\leq(R/\gamma)^{2}. The discussion in Hanneke and Kontorovich (2019, p. 102) shows that the common approach of reducing of the general (affine) case to the linear (homogeneous, b=0b=0) case, via the addition of a “dummy” coordinate, incurs a large suboptimal factor in the bound. Hanneke and Kontorovich (2019, Lemma 6) is essentially an analysis of the fat-shattering dimension of bounded affine functions. Although this result contains a mistake (see Section 5), much of the proof technique can be salvaged:

Lemma 22

The semi-bounded affine function class on BB defined by F={xwx+b;wR}F=\left\{x\mapsto w\cdot x+b;\left\|w\right\|\leq R\right\} in dd dimensions satisfies

fatγ(F)\displaystyle\operatorname{fat}_{\gamma}(F) \displaystyle\leq min{d+1,((1+8π)Rγ)2},0<γR.\displaystyle\min\left\{d+1,\left(\frac{\left(1+\sqrt{\frac{8}{\pi}}\right)R}{\gamma}\right)^{2}\right\},\qquad 0<\gamma\leq R.

Proof  Since FF satisfies (27), it suffices to consider fåtγ(F)\operatorname{\textup{f\aa t}}_{\gamma}(F), and so the shattering condition simplifies to

γ\displaystyle\gamma \displaystyle\leq yi(wxi+b),i[k].\displaystyle y_{i}(w\cdot x_{i}+b),\qquad i\in[k]. (29)

Now fåtγ(F)\operatorname{\textup{f\aa t}}_{\gamma}(F) is always upper-bounded by the VC-dimension of the corresponding class thresholded at zero, i.e., sign(F)\operatorname{sign}(F). For dd-dimensional inhomogeneous hyperplanes, the latter is exactly d+1d+1 (Mohri et al., 2012, Example 3.2). Having dispensed with the dimension-dependent part in the bound, we how focus on the RR-dependent one.

Let us observe, as in Hanneke and Kontorovich (2019, Lemma 6), that for xi1\left\|x_{i}\right\|\leq 1 and w\left\|w\right\|, γR\gamma\leq R, one can always realize (29) with |b|2R|b|\leq 2R; which is what we shall assume, without generality, henceforth. Summing up the kk inequalities in (29) yields

kγ\displaystyle k\gamma \displaystyle\leq wi=1kyixi+bi=1kyiRi=1kyixi+2R|i=1kyi|.\displaystyle w\cdot\sum_{i=1}^{k}y_{i}x_{i}+b\sum_{i=1}^{k}y_{i}\leq R\left\|\sum_{i=1}^{k}y_{i}x_{i}\right\|+2R\left|\sum_{i=1}^{k}y_{i}\right|.

Letting yy be drawn uniformly from {1,1}k\left\{-1,1\right\}^{k} and taking expectations, we have

kγ\displaystyle k\gamma \displaystyle\leq R𝔼i=1kyixi+2R𝔼|i=1kyi|R𝔼i=1kyixi2+2R𝔼(i=1kyi)2\displaystyle R\mathop{\mathbb{E}}\left\|\sum_{i=1}^{k}y_{i}x_{i}\right\|+2R\mathop{\mathbb{E}}\left|\sum_{i=1}^{k}y_{i}\right|\leq R\sqrt{\mathop{\mathbb{E}}\left\|\sum_{i=1}^{k}y_{i}x_{i}\right\|^{2}}+2R\sqrt{\mathop{\mathbb{E}}\left(\sum_{i=1}^{k}y_{i}\right)^{2}}
=\displaystyle= Ri=1kxi2+2Ri=1kyi23Rk.\displaystyle R\sqrt{\sum_{i=1}^{k}\left\|x_{i}\right\|^{2}}+2R\sqrt{\sum_{i=1}^{k}y_{i}^{2}}\leq 3R\sqrt{k}.

Isolating kk on the left-hand side of the inequality proves the claim k(3Rγ)2k\leq\left(\frac{3R}{\gamma}\right)^{2}.

Following a referee’s suggestion, we improve the constant as follows. Note that

𝔼|i=1kyi|=12ki=0k(ki)|k2i|=k2k1(k1k2)2πkk+12,\displaystyle\mathop{\mathbb{E}}\left|\sum_{i=1}^{k}y_{i}\right|=\frac{1}{2^{k}}\sum^{k}_{i=0}\binom{k}{i}|k-2i|=\frac{k}{2^{k-1}}\binom{k-1}{\left\lfloor\frac{k}{2}\right\rfloor}\leq\sqrt{\frac{2}{\pi}}\frac{k}{\sqrt{k+\frac{1}{2}}},

where the inequality follows from binomial coefficient estimate via Stirling’s approximation. Thus,

kγ\displaystyle k\gamma \displaystyle\leq Rk+2R2πkk+12Rk+2R2πk,\displaystyle R\sqrt{k}+2R\sqrt{\frac{2}{\pi}}\frac{k}{\sqrt{k+\frac{1}{2}}}\leq R\sqrt{k}+2R\sqrt{\frac{2}{\pi}}\sqrt{k},

which proves that k((1+8π)Rγ)2k\leq\left(\frac{\left(1+\sqrt{\frac{8}{\pi}}\right)R}{\gamma}\right)^{2}.  

A.5 Concavity miscellanea

The results below are routine exercises in differentiation and Jensen’s inequality.

Lemma 23

For u>0u>0, the function xxlog(u/x)x\mapsto x\log(u/x) is concave on (0,)(0,\infty).

Corollary 24

For all u>0u>0 and vi>0v_{i}>0, i[k]i\in[k],

i=1kvilog(u/vi)\displaystyle\sum_{i=1}^{k}v_{i}\log(u/v_{i}) \displaystyle\leq (vi)logukvi.\displaystyle\left(\sum v_{i}\right)\log\frac{uk}{\sum v_{i}}.
Lemma 25

For 0εlog20\leq\varepsilon\leq\log 2 and u2u\geq 2, the function xxlog1+ε(u/x)x\mapsto x\log^{1+\varepsilon}(u/x) is concave on [1,)[1,\infty). It follows that for ε,u\varepsilon,u as above and vi1v_{i}\geq 1, i[k]i\in[k],

i=1kvilog1+ε(u/vi)\displaystyle\sum_{i=1}^{k}v_{i}\log^{1+\varepsilon}(u/v_{i}) \displaystyle\leq (vi)log1+εukvi.\displaystyle\left(\sum v_{i}\right)\log^{1+\varepsilon}\frac{uk}{\sum v_{i}}.