Fat-Shattering Dimension of $k$ -fold Aggregations^†^†thanks: A previous version of this paper was titled “Fat-shattering dimension of $k$ -fold maxima”.

\nameIdan Attias \email[email protected]
\nameAryeh Kontorovich \email[email protected]
\addrDepartment of Computer Science
Ben-Gurion University of the Negev, Beer Sheva, Israel

Abstract

We provide estimates on the fat-shattering dimension of aggregation rules of real-valued function classes. The latter consists of all ways of choosing $k$ functions, one from each of the $k$ classes, and computing a pointwise function of them, such as the median, mean, and maximum. The bound is stated in terms of the fat-shattering dimensions of the component classes. For linear and affine function classes, we provide a considerably sharper upper bound and a matching lower bound, achieving, in particular, an optimal dependence on $k$ . Along the way, we improve several known results in addition to pointing out and correcting a number of erroneous claims in the literature.

Keywords: combinatorial dimension, scale-sensitive dimension, fat-shattering dimension, aggregation rules, $k$ -fold maximum, ensemble methods

1 Introduction

The fat-shattering dimension, also known as “scale-sensitive” and the “parametrized variant of the $P$ -dimension”, was first proposed by Kearns and Schapire (1994); its key role in learning theory lies in characterizing the PAC learnability of real-valued function classes (Alon et al., 1997; Bartlett and Long, 1998).

In this paper, we study the behavior of the fat-shattering dimension under various $k$ -fold aggregations. Let $F_{1},\ldots,F_{k}\subseteq\mathbb{R}^{\Omega}$ be real-valued function classes, and $G:\mathbb{R}^{k}\rightarrow\mathbb{R}$ be an aggregation rule. We consider the aggregate function class $G(F_{1},\ldots,F_{k})$ , which consists of all mappings $x\mapsto G(f_{1}(x),\ldots,f_{k}(x))$ , for any $f_{1}\in F_{1},\ldots,f_{k}\in F_{k}$ . Some natural aggreagation rules include the pointwise $k$ -fold maximum, median, mean, and max-min. We seek to bound the shattering complexity of $G(F_{1},\ldots,F_{k})$ in terms of the constituent fat-shattering dimensions of the $F_{i}$ . This question naturally arises in the context of ensemble methods, such as boosting and bagging, where the learner’s prediction consists of an aggregation of base learners.

The analogous question regarding aggregations of VC classes (VC dimension being the combinatorial complexity controlling the learnability of Boolean function classes) have been studied in detail and largely resolved (Baum and Haussler, 1989; Blumer et al., 1989; Eisenstat and Angluin, 2007; Eisenstat, 2009; Csikós et al., 2019). Furthermore, closure properties were also studied in the context of online classification and private PAC learning (Alon et al., 2020; Ghazi et al., 2021) for the Littlestone and Threshold dimensions. However, for real-valued functions, this question remained largely uninvestigated.

Our contributions are as follows:

•

For a natural class of aggregation rules that commute with shifts (see Definition (4)), assuming $\operatorname{fat}_{\gamma}(F_{i})\leq d$ , for $1\leq i\leq k$ , we show that

\displaystyle\operatorname{fat}_{\gamma}(G(F_{1},\ldots,F_{k}))\leq O\left(dk\log^{2}\left(dk\right)\right),\qquad\gamma>0.

The formal statement is given in Theorem 1. By using an entirely different approach, for aggregations that are $L$ -Lipschitz ( $L\geq 1$ ) in supremum norm (see Definition (5)) and for bounded function classes $F_{1},\ldots,F_{k}\subset[-R,R]^{\Omega}$ with $\operatorname{fat}_{\varepsilon\gamma}(F_{i})\leq d$ , for $1\leq i\leq k$ , we show that

\displaystyle\operatorname{fat}_{\gamma}(G(F_{1},\ldots,F_{k}))\leq O\left(dk\operatorname{Log}^{1+\epsilon}\frac{LRk}{\gamma}\right),\qquad 0<\gamma/L<R\text{ and }0<\varepsilon<\log 2,

where $\operatorname{Log}(x):=\log(e\vee x)$ . The formal statement is given in Theorem 2.

In particular, our proofs hold for the maximum, minimum, median, mean, and max-min aggregations.

•

For $R$ -bounded affine functions and for aggregations that are $L$ -Lipschitz in supremum norm, we show the following dimension-free bound,

\displaystyle\operatorname{fat}_{\gamma}(G(F_{1},\ldots,F_{k}))

\displaystyle\leq

\displaystyle O\left(\frac{L^{2}R^{2}k\log(k)}{\gamma^{2}}\right),\qquad 0<\gamma/L<R.

This result also extends to the hinge-loss class of affine functions. The formal statement is given in Theorem 3. In particular, we improve by a log factor the estimate of Fefferman et al. (2016, Lemma 6) on the fat-shattering dimension of max-min aggregation of linear functions.

Furthermore, in Corollary 5 we show an upper bound on the Rademacher complexity of the $k$ -fold maximum aggregation of affine functions and hinge-loss affine functions. Our bound scales with $\sqrt{k}$ , improving upon Raviv et al. (2018) where the dependence on $k$ is linear.

•

For affine functions and the $k$ -fold max aggregation, we show tight dimension-dependent bounds (up to constants),

\displaystyle\operatorname{fat}_{\gamma}(G_{\max}(F_{1},\ldots,F_{k}))

\displaystyle=

\displaystyle\Theta\left(dk\log k\right),\qquad\gamma>0,

where $d$ is the Euclidean dimension. For the formal statements, see Theorems 7 and 8.

Applications.

The need to analyze the combinatorial complexity of a $k$ -fold maximum of function classes (see (3) for the formal definition) arises in a number of diverse settings. One natural example is adversarially robust PAC learning to test time attacks for real-valued functions (Attias et al., 2022; Attias and Hanneke, 2023). In this setting, the learner observes an i.i.d. labeled sample from an unknown distribution, and the goal is to output a hypothesis with a small error on unseen examples from the same distribution, with high probability. The difference from the standard PAC learning model is that at test time, the learner only observes a corrupted example, while the prediction is tested on the original label. Formally, $(x,y)$ is drawn from the unknown distribution, there is an adversary that can map $x$ to $k$ possible corruptions $z$ that are known to the learner. The learner observes only $z$ while its loss is with respect to the original label $y$ . This scenario is naturally captured by the $k$ -fold max: the learner aims to learn the maximum aggregation of the loss classes. Attias et al. (2022) showed that uniform convergence holds in this case, and so the sample complexity of an empirical risk minimization algorithm is determined by the complexity measure of the $k$ -fold maximum aggregation.

Analyzing the $k$ -fold maximum arises also in a setting of learning polyhedra with a margin. Gottlieb et al. (2018) provided a learning algorithm that represents polyhedra as intersections of bounded affine functions. The sample complexity of the algorithm is determined by the complexity measure of the maximum aggregation of affine function classes.

Another natural example of where the $k$ -fold maximum and $k$ -fold max-min play a role is in analyzing the convergence of $k$ -means clustering. Fefferman et al. (2016) bounded the max-min aggregation and Klochkov et al. (2021); Biau et al. (2008); Appert and Catoni (2021); Zhivotovskiy (2022) bounded the max aggregation. The main challenge in this setting is bounding the covering numbers of the aggregation over $k$ function classes which can be obtained by bounding the Rademacher complexity or the fat-shattering dimension.

Finally, there are numerous ensemble methods for regression that output some aggregation of base learners, such as the median or mean. Examples of these methods include boosting (e.g., Freund and Schapire (1997); Kégl (2003)), bagging (bootstrap aggregation) by Breiman (1996), and its extension to the random forest algorithm (Breiman, 2001).

Related work.

It was claimed in Attias et al. (2019, Theorem 12) that $\operatorname{fat}_{\gamma}(G_{\max})\leq 2\log(3k)\sum_{j=1}^{k}\operatorname{fat}_{\gamma}(F_{i}),$ but the proof had a mistake (see Section 5); our Open Problem asks if the general form of the bound does holds (we believe it does at least for the max aggregation). Using the recent disambiguation result of Alon et al. (2022) presented in Lemma 9 here, Attias et al. (2022, Lemma 15) obtained the bound $\operatorname{fat}_{\gamma}(G_{\max})\leq O\left(\operatorname{Log}(k)\operatorname{Log}^{2}(|\Omega|)\sum_{j=1}^{k}\operatorname{fat}_{\gamma}(F_{i})\right)$ , where $\Omega$ is the domain of the function classes $F_{1},\ldots,F_{k}$ . The latter is, in general, incomparable to Theorem 1 — but is clearly a considerable improvement for large or infinite $\Omega$ .

Using the covering number results of Mendelson and Vershynin (2003); Talagrand (2003) (see Section A.1), Duan (2012, Theorem 6.2) obtained a general result, which, when specialized to $k$ -fold maxima, yields

\displaystyle\operatorname{fat}_{\gamma}(G_{\max})

\displaystyle\leq

\displaystyle O\left({\operatorname{Log}\frac{k}{\gamma}}\cdot\sum_{i=1}^{k}\operatorname{fat}_{c\gamma/\sqrt{k}}(F_{i})\right)

(1)

for a universal constant $c>0$ ; (1) is an immediate consequence of Theorem 10 (with $p=2$ ), Lemma 15, and Lemma 16 in this paper. Our results improve over (1) by removing the dependence on $k$ in the scale of the fat-shattering dimensions; however, Duan’s general method is applicable to a wider class of uniformly continuous $k$ -fold aggregations.

Srebro et al. (2010, Lemma A.2) bounded the fat-shattering dimension in terms of the Rademacher complexity. Foster and Rakhlin (2019) bounded the Rademacher complexity of a smooth $k$ -fold aggregate, see also references therein. Inspired by Appert and Catoni (2021), Zhivotovskiy (2022) has obtained the best known upper bound on the Rademacher complexity of $k$ -fold maximum over linear function classes. Raviv et al. (2018) upper bounded the Rademacher complexity of the $k$ -fold maximum aggregation of affine functions and hinge-loss affine functions.

2 Preliminaries

Aggregation rules.

A $k$ -fold aggregation rule is any mapping $G:\mathbb{R}^{k}\rightarrow\mathbb{R}$ . Just as $G$ maps $k$ -tuples of reals into reals, it naturally aggregates $k$ -tuples of functions into a single one: for $f_{1},\ldots,f_{k}:\Omega\rightarrow\mathbb{R}$ , we define $G(f_{1},\ldots,f_{k}):\Omega\rightarrow\mathbb{R}$ as the mapping $x\mapsto G(f_{1}(x),\ldots,f_{k}(x))$ . Finally, the aggregation extends to $k$ -tuples of function classes: for $F_{1},\ldots,F_{k}\subseteq\mathbb{R}^{\Omega}$ , we define

G(F_{1},\ldots,F_{k}):=\left\{x\mapsto G(f_{1}(x),\ldots,f_{k}(x)):f_{i}\in F_{i},i\in[k]\right\}.

(2)

A canonical example of an aggregation rule is the $k$ -fold max, induced by the mapping

\displaystyle G_{\max}(x_{1},\ldots,x_{k}):=\max_{i\in[k]}x_{i}.

(3)

Next, we consider some properties that an aggregation rule might possess.

Commuting with shifts. We say that an aggregation rule $G$ commutes with shifts if

G(x)-r=G(x-r),\qquad x\in\mathbb{R}^{k},r\in\mathbb{R}.

(4)

Lipschitz continuity. The mapping $G:\mathbb{R}^{k}\to\mathbb{R}$ is $L$ -Lipschitz with respect to $\left\|\cdot\right\|_{\infty}$ if

\displaystyle\left|G(x)-G(x^{\prime})\right|\leq L\left\|x-x^{\prime}\right\|_{\infty}=L\max_{i\in[k]}|x_{i}-x_{i}^{\prime}|,\qquad x,x^{\prime}\in\mathbb{R}^{k}.

(5)

Many natural aggregation rules possess these properties, such as maximum, minimum, median, mean, and max-min. The maximum is defined in (3), the minimum is defined analogously as

\displaystyle G_{\min}(x_{1},\ldots,x_{k}):=\min_{i\in[k]}x_{i},

the median is defined as

\displaystyle G_{\operatorname{med}}(x_{1},\ldots,x_{k}):=\begin{cases}x_{\frac{k+1}{2}},&k\text{ is odd},\\ \frac{1}{2}(x_{\frac{k}{2}}+x_{\frac{k+1}{2}}),&k\text{ is even},\end{cases}

and the mean is defined as

\displaystyle G_{\operatorname{mean}}(x_{1},\ldots,x_{k}):=\frac{1}{k}\sum^{k}_{i=0}x_{i}.

We also define $G_{\operatorname{max-min}}:\mathbb{R}^{\ell\times k}\to\mathbb{R}$ as

\displaystyle G_{\operatorname{max-min}}(x_{11},\ldots,x_{k\ell}):=\max_{j\in[\ell]}\min_{i\in[k]}x_{ij};

(6)

it is readily verified to commute with shifts and Lemma 14 shows that it is $1$ -Lipschitz with respect to $\left\|\cdot\right\|_{\infty}$ .

Fat-shattering dimension.

Let $\Omega$ be a set and $F\subset\mathbb{R}^{\Omega}$ . For $\gamma>0$ , a set $S=\left\{x_{1},\ldots,x_{m}\right\}\subset\Omega$ is said to be $\gamma$ -shattered by $F$ if

\displaystyle\sup_{r\in\mathbb{R}^{m}}\;\min_{y\in\left\{-1,1\right\}^{m}}\;\sup_{f\in F}\;\min_{i\in[m]}\;y_{i}(f(x_{i})-r_{i})\geq\gamma.

(7)

The $\gamma$ -fat-shattering dimension, denoted by $\operatorname{fat}_{\gamma}(F)$ , is the size of the largest $\gamma$ -shattered set (possibly $\infty$ ).

Fat-shattering dimension at zero.

As in Gottlieb et al. (2014), we also define the notion of $\gamma$ -shattering at $0$ , where the “shift” $r$ in (7) is constrained to be $0$ . Formally, the shattering condition is $\min_{y\in\left\{-1,1\right\}^{m}}\sup_{f\in F}\min_{i\in[m]}y_{i}f(x_{i})\geq\gamma,$ and the we denote corresponding dimension by $\operatorname{\textup{f\aa t}}_{\gamma}(F)$ .

Attias et al. (2019, Lemma 13) showed that for all $F\subset\mathbb{R}^{\Omega}$ ,

\displaystyle\operatorname{fat}_{\gamma}(F)=\max_{r\in\mathbb{R}^{\Omega}}\operatorname{\textup{f\aa t}}_{\gamma}(F-r),\qquad\gamma>0,

(8)

where $F-r=\left\{f-r;f\in F\right\}$ is the $r$ -shifted class (the maximum is always achieved). Lemma 21 presents another, apparently novel, connection between $\operatorname{fat}$ and $\operatorname{\textup{f\aa t}}$ .

Rademacher complexity.

Let $\mathcal{F}$ be of real-valued function class on the domain space $\mathcal{W}$ . Define the empirical Rademacher complexity of $\mathcal{F}$ on a given sequence $(w_{1},\ldots,w_{n})\in\mathcal{W}^{n}$ is defined as

\displaystyle\mathcal{R}_{n}(\mathcal{F}|w_{1},\ldots,w_{n})=\mathbb{E}_{\mathbf{\sigma}}\sup_{f\in\mathcal{F}}\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}f(w_{i}),

where $\sigma=(\sigma_{1},\ldots\sigma_{n})$ are independent random variables uniformly chosen from $\left\{-1,1\right\}$ . The Rademacher complexity of $\mathcal{F}$ with respect to a distribution $\mathcal{D}$ is defined as

\displaystyle\mathcal{R}_{n}(\mathcal{F})=\mathbb{E}_{w_{1},\ldots,w_{n}\sim\mathcal{D}}R_{n}(\mathcal{F}|w_{1},\ldots,w_{n}).

It is a classic fact (Mohri et al., 2012, Theorem 3.1) that the Rademacher complexity controls generalization bounds in a wide range of supervised learning settings.

Covering numbers.

We start with some background on covering numbers. Whenever $\Omega$ is endowed with a probability measure $\mu$ , this induces, for $p\in[1,\infty)$ and $f:\Omega\rightarrow\mathbb{R}^{k}$ , the norm

\displaystyle\left\|f\right\|_{L_{p}^{(k)}(\mu)}^{p}=\mathop{\mathbb{E}}_{X\sim\mu}\left\|f(X)\right\|_{p}^{p}=\int_{\Omega}\left\|f(x)\right\|_{p}^{p}\mathrm{d}\mu(x)

on $L^{(k)}_{p}(\mu):=\left\{f\in(\mathbb{R}^{k})^{\Omega}:\left\|f\right\|_{L_{p}^{(k)}(\mu)}<\infty\right\}$ . When $k=1$ , we write $L_{p}(\mu):=L^{(1)}_{p}(\mu)$ . For $p=\infty$ , $\left\|f\right\|_{L_{\infty}^{(k)}(\mu)}$ is the essential supremum of $f$ with respect to $\mu$ . For $t>0$ and $H\subset F\subset L_{p}(\mu)$ , we say that $H$ is a $t$ -cover of $F$ under $\left\|\cdot\right\|_{L_{p}(\mu)}$ if $\sup_{f\in F}\inf_{h\in H}\left\|f-h\right\|_{L_{p}(\mu)}\leq t.$ The $t$ -covering number of $F$ , denoted by $\mathcal{N}(F,L_{p}(\mu),t)$ , is the cardinality of the smallest $t$ -cover of $F$ (possibly, $\infty$ ). We note the obvious relation

\displaystyle p>q

\displaystyle\implies

\displaystyle\mathcal{N}(F,L_{p}(\mu),t)\geq\mathcal{N}(F,L_{q}(\mu),t),

(9)

which holds for all probability measures $\mu$ and all $t>0$ .

We sometimes overload the notation about aggregations by defining $G$ on $k$ -tuples of functions (instead on $k$ -tuples of reals), $G:(\mathbb{R}^{\Omega})^{k}\rightarrow\mathbb{R}^{\Omega}$ . We say that $G$ is $L$ -Lipschitz with respect to $\left\|\cdot\right\|_{L_{p}^{(k)}(\mu)}$ , if

\displaystyle\left\|G(f_{1:k})-G(f^{\prime}_{1:k})\right\|_{L_{p}(\mu)}\leq L\left\|(f_{1:k})-(f^{\prime}_{1:k})\right\|_{L_{p}^{(k)}(\mu)},\qquad f_{1:k},f^{\prime}_{1:k}\in(\mathbb{R}^{k})^{\Omega}.

Notation.

We write $\mathbb{N}=\left\{0,1,\ldots\right\}$ to denote the natural numbers. For $n\in\mathbb{N}$ , we write $[n]:=\left\{1,2,\ldots,n\right\}$ . All of our logarithms are base $e$ , unless explicitly denoted otherwise. We use $\max\left\{u,v\right\}$ and $u\vee v$ interchangeably, and write $\operatorname{Log}(x):=\log(e\vee x)$ . For any function class $F$ over a set $\Omega$ and $E\subset\Omega$ , $F(E)=\left.F\right|_{E}$ denotes the projection on (restriction to) $E$ . In line with the common convention in functional analysis, absolute numerical constants will be denoted by letters such as $C,c$ , whose value may change from line to line. Any transformation $\varphi:\mathbb{R}\to\mathbb{R}$ may be applied to a function $f\in\mathbb{R}^{\Omega}$ via $\varphi(f):=\varphi\circ f$ , as well as to $F\subset\mathbb{R}^{\Omega}$ via $\varphi(F):=\left\{\varphi(f);f\in F\right\}$ . The sign function thresholds at $0$ : $\operatorname{sign}(t)=\boldsymbol{{1}}[t\geq 0]$ .

3 Main Results

Our main results involve upper-bounding the fat-shattering dimension of aggregation rules in terms of the dimensions of the component classes. We begin with the simplest (to present):

Theorem 1 (General function classes and aggregations that commute with shifts)

For $F_{1},\ldots,F_{k}\subseteq\mathbb{R}^{\Omega}$ , and aggregation rule $G$ that commutes with shifts, (see definition (4)), we have

\displaystyle\operatorname{fat}_{\gamma}(G(F_{1},\ldots,F_{k}))

\displaystyle\leq

\displaystyle 25D_{\gamma}\log^{2}(90D_{\gamma}),\qquad\gamma>0,

where $D_{\gamma}:=\sum_{i=1}^{k}\operatorname{fat}_{\gamma}(F_{i})>0$ . In the degenerate case where $D_{\gamma}=0$ , $\operatorname{fat}_{\gamma}(G)=0$ .

In particular, this result holds for natural aggregation rules, such as maximum, minimum, median, and mean.

Remark.

We made no attempt to optimize the constants; these are only provided to give a rough order-of-magnitude sense. In the sequel, we forgo numerical estimates and state the results in terms of unspecified universal constants.

The next result provides an alternative bound based on an entirely different technique:

Theorem 2 (Bounded function classes and Lipschitz aggregations)

For $0<\varepsilon<\log 2$ , $F_{1},\ldots,F_{k}\subseteq[-R,R]^{\Omega}$ , and an aggregation rule $G$ that is $L$ -Lipschitz ( $L\geq 1$ ) in supremum norm (see definition (5)), we have

\displaystyle\operatorname{fat}_{\gamma}(G(F_{1},\ldots,F_{k}))

\displaystyle\leq

\displaystyle CD\operatorname{Log}^{1+\varepsilon}\frac{LRk}{\gamma},\qquad 0<\gamma/L<R,

where

\displaystyle D

\displaystyle=

\displaystyle\sum_{i=1}^{k}\operatorname{fat}_{c\varepsilon\gamma}(F_{i})

and $C,c>0$ are universal constants.

In Section A.1, we show that maximum, median, mean, and max-min aggregations are $1$ -Lipschitz.

Remark.

The bounds in Theorems 1 and 2 are, in general, incomparable—and not just because of the unspecified constants in the latter. One notable difference is that Theorem 1 only depends on the shattering scale $\gamma$ , while Theorem 2 additionally features a (weak) explicit dependence on the aspect ratio $R/\gamma$ . In particular, Theorem 1 is applicable to semi-bounded affine classes (see Section A.4), while Theorem 2 is not. Still, for fixed $R,\gamma$ and large $k$ , the latter presents a significant asymptotic improvement over the former.

For the special case of affine functions and hinge-loss affine functions, the technique of Theorem 2 yields a considerably sharper estimate:

Theorem 3 (Dimension-free bound for Lipschitz aggregations of affine functions)

Let $B\subset\mathbb{R}^{d}$ be the $d$ -dimensional Euclidean unit ball and

\displaystyle F_{i}=\left\{x\mapsto w\cdot x+b;\left\|w\right\|\vee|b|\leq R_{i},w\in\mathbb{R}^{d},b,R_{i}\in\mathbb{R}\right\},\qquad i\in[k],

(10)

be $k$ collections of $R_{i}$ -bounded affine functions on $\Omega=B$ and $G$ be an aggregation rule that is $L$ -Lipschitz in supremum norm (see definition (5)). Then

\displaystyle\operatorname{fat}_{\gamma}(G(F_{1},\ldots,F_{k}))

\displaystyle\leq

\displaystyle\frac{CL^{2}\operatorname{Log}(k)}{\gamma^{2}}\sum_{i=1}^{k}{R_{i}^{2}},\qquad 0<\gamma/L<\min_{i\in[k]}R_{i},

(11)

where $C>0$ is a universal constant. Further, if

\displaystyle F_{i}^{\operatorname{Hinge}}=\left\{(x,y)\mapsto\max\left\{0,1-yf(x,y)\right\};f\in F_{i}\right\}

(12)

is a family of $R_{i}$ -bounded hinge-loss affine functions for $i\in[k]$ and $G_{\operatorname{Hinge}}\equiv G(F_{1}^{\operatorname{Hinge}},\ldots,F_{k}^{\operatorname{Hinge}})$ is an aggregation rule that is $L$ -Lipschitz in supremum norm, then the same bound as in (11) hold for $\operatorname{fat}_{\gamma}(G_{\operatorname{Hinge}})$ .

In particular, Theorem 3 improves by a log factor the estimate of Fefferman et al. (2016), on the fat-shattering dimension of max-min aggregation (defined in Section 2) of linear functions:¹¹1 The max-min aggregation is shown to be $1$ -Lipschitz in supremum norm in Lemma 14 of Section A.1.

Lemma 4 (Fefferman et al. (2016), Lemma 6)

Let $B\subset\mathbb{R}^{d}$ be the $d$ -dimensional Euclidean unit ball and

\displaystyle F_{ij}=\left\{x\mapsto w\cdot x;\left\|w\right\|\leq\left\|1\right\|,w\in\mathbb{R}^{d}\right\},\qquad i\in[k],j\in[\ell],

be $k\ell$ (identical) linear function classes defined on $\Omega=B$ . If $G_{\operatorname{max-min}}$ is the max-min aggregation rule (6), then

\displaystyle\operatorname{fat}_{\gamma}(G_{\operatorname{max-min}}(F_{11},\ldots,F_{k\ell}))

\displaystyle\leq

\displaystyle C\frac{k\ell}{\gamma^{2}}\log^{2}\left(C\frac{k\ell}{\gamma^{2}}\right),

where $C>0$ is a universal constant.

Our Theorem 3 improves the latter by a log factor:

\displaystyle\operatorname{fat}_{\gamma}(G_{\operatorname{max-min}}(F_{11},\ldots,F_{k\ell}))

\displaystyle\leq

\displaystyle C\frac{k\ell\log\left(k\ell\right)}{\gamma^{2}}.

Corollary 5 (Rademacher complexity for $k$ -Fold Maximum of Affine Functions)

Let $F_{i}$ be an $R_{i}$ -bounded affine function class as in (10) or a hinge loss affine function as in (12), let $G_{\max}$ be the maximum aggregation rule, and let $\tilde{R}=\max_{i}R_{i}$ , then

\displaystyle\mathcal{R}_{n}(G_{\max}(F_{1},\ldots,F_{k}))\leq C\sqrt{\frac{\operatorname{Log}(k)\log^{3}(\tilde{R}n)\tilde{R}^{2}\sum_{i=1}^{k}{R_{i}^{2}}}{n}}.

where $\mathcal{R}_{n}$ is the Rademacher complexity and $C>0$ is a universal constant.

Corollary 5 improves upon Raviv et al. (2018, Theorem 7). Their upper bound scales linearly with $k$ , whereas ours as $\sqrt{k\log k}$ .

Note, however, that for linear classes a better bound is known:

Theorem 6 (Zhivotovskiy (2022))

Let $B\subset\mathbb{R}^{d}$ be the $d$ -dimensional Euclidean unit ball and

\displaystyle F_{i}=\left\{x\mapsto w\cdot x;\left\|w\right\|\leq{1},w\in\mathbb{R}^{d}\right\},\qquad i\in[k]

be $k$ (identical) linear function classes defined on $\Omega=B$ . If $G_{\max}$ is the maximum aggregation rule, then

\displaystyle\mathcal{R}_{n}(G_{\max}(F_{1},\ldots,F_{k}))\leq C\log\left(\frac{n}{k}\right)\sqrt{\frac{k\log k}{n}},

where $\mathcal{R}_{n}$ is the Rademacher complexity and $C>0$ is a universal constant.

The estimate in Theorem 3 is dimension-free in the sense of being independent of $d$ . In applications where a dependence on $d$ is admissible, an optimal bound can be obtained:

Theorem 7 (Dimension-dependent bound for $k$ -fold maximum of affine functions)

Let $\Omega=\mathbb{R}^{d}$ and $F_{i}\subset\mathbb{R}^{\Omega}$ be $k$ (identical) function classes consisting of all real-valued affine functions:

\displaystyle F_{i}=\left\{x\mapsto w\cdot x+b;w\in\mathbb{R}^{d},b\in\mathbb{R}\right\},\qquad i\in[k]

and let $G_{\max}$ be their $k$ -fold maximum (see definition (3)). Then

\displaystyle\operatorname{fat}_{\gamma}(G_{\max}(F_{1},\ldots,F_{k}))

\displaystyle\leq

\displaystyle Cdk\operatorname{Log}k,\qquad\gamma>0,

where $C>0$ is a universal constant.

The optimality of the upper bound in Theorem 7 is witnessed by the matching lower bound:

Theorem 8 (Dimension-dependent lower bound for $k$ -fold maximum of affine functions)

For $k\geq 1$ and $d\geq 4$ , let $F_{1}=F_{2}=\ldots=F_{k}$ be the collection of all affine functions over $\Omega=\mathbb{R}^{d}$ and let $G_{\max}$ be their $k$ -fold maximum (see definition (3)). Then

\displaystyle\operatorname{fat}_{\gamma}(G_{\max}(F_{1},\ldots,F_{k}))

\displaystyle\geq

\displaystyle C\log(k)\sum_{i=1}^{k}\operatorname{fat}_{\gamma}(F_{i})=Cdk\log k,\qquad\gamma>0,

where $C>0$ is a universal constant.

The scaling argument employed in the proof of Theorem 8 can be invoked to show that the claim continues to hold for $\Omega=B$ .

Together, Theorems 7 and 8 show that the logarithmic dependence on $k$ is optimal.

4 Proofs

We start with upper-bounding the fat-shattering dimension of aggregation rule that commute with shifts, in terms of the dimensions of the component classes.

4.1 Proof of Theorem 1

Partial concept classes and disambiguation.

We say that $F^{\star}\subseteq\left\{0,1,\star\right\}^{\Omega}$ is a partial concept class over $\Omega$ ; this usage is consistent with Alon et al. (2022), while Attias et al. (2019, 2022) used the descriptor ambiguous. For $f^{\star}\in F^{\star}$ , define its disambiguation set $\mathscr{D}(f^{\star})\subseteq\left\{0,1\right\}^{\Omega}$ as

\displaystyle\mathscr{D}(f^{\star})=\left\{g\in\left\{0,1\right\}^{\Omega}:\forall x\in\Omega,~{}f^{\star}(x)\neq\star\implies f^{\star}(x)=g(x)\right\};

in words, $\mathscr{D}(f^{\star})$ consists of the total concepts $g:\Omega\to\left\{0,1\right\}$ that agree pointwise with $f^{\star}$ , whenever the latter takes a value in $\left\{0,1\right\}$ . We say that $\bar{F}\subseteq\left\{0,1\right\}^{\Omega}$ disambiguates $F^{\star}$ if for all $f^{\star}\in F^{\star}$ , we have $\bar{F}\cap\mathscr{D}(f^{\star})\neq\emptyset$ ; in words, every $f^{\star}\in F^{\star}$ must have a disambiguated representative in $\bar{F}$ .²²2Attias et al. (2022) additionally required that $\bar{F}\subseteq\bigcup_{f^{\star}\in F^{\star}}\mathscr{D}(f^{\star})$ , but this is an unnecessary restriction, and does not affect any of the results.

As in Alon et al. (2022); Attias et al. (2022), we say³³3 Attias et al. (2019) had incorrectly given $F^{\star}(S)=\left\{0,1\right\}^{S}$ as the shattering condition. that $S\subset\Omega$ is VC-shattered by $F^{\star}$ if $F^{\star}(S)\supseteq\left\{0,1\right\}^{S}$ . We write $\operatorname{vc}(F^{\star})$ to denote the size of the largest VC-shattered set (possibly, $\infty$ ). The obvious relation $\operatorname{vc}(F^{\star})\leq\operatorname{vc}(\bar{F})$ always holds between a partial concept class and any of its disambiguations. Alon et al. (2022, Theorem 13) proved the following variant of the Sauer-Shelah-Perles Lemma for partial concept classes:

Lemma 9 (Alon et al. (2022))

For every $F^{\star}\subseteq\left\{0,1,\star\right\}^{\Omega}$ with $d=\operatorname{vc}(F^{\star})<\infty$ and $|\Omega|<\infty$ , there is an $\bar{F}$ disambiguating $F^{\star}$ such that

\displaystyle|\bar{F}(\Omega)|

\displaystyle\leq

\displaystyle(|\Omega|+1)^{(d+1)\log_{2}|\Omega|+2}.

(13)

For $d>0$ and $|\Omega|>1$ , this implies the somewhat more wieldy estimate⁴⁴4 The estimate (14) does not appear in Alon et al. (2022), but is an elementary consequence of (13). to

\displaystyle|\bar{F}(\Omega)|

\displaystyle\leq

\displaystyle|\Omega|^{5d\log_{2}|\Omega|}.

(14)

We will make use of the elementary fact

\displaystyle x\leq A\log_{2}x

\displaystyle\implies

\displaystyle x\leq 3A\log(3A),\qquad x,A\geq 1

and its corollary

\displaystyle y\leq A(\log_{2}y)^{2}

\displaystyle\implies

\displaystyle y\leq 5A\log^{2}(18A),\qquad y,A\geq 1.

(15)

Proof [of Theorem 1] We follow the basic techniques of discretization and $r$ -shifting, employed in Attias et al. (2019, 2022). Fix $\gamma>0$ and define the operator $[\cdot]_{\gamma}^{\star}:\mathbb{R}\to\left\{0,1,\star\right\}$ as

\displaystyle[t]_{\gamma}^{\star}

\displaystyle=

\displaystyle\begin{cases}0,&t\leq-\gamma\\ 1,&t\geq\gamma\\ \star,&\text{else}.\end{cases}

Observe that for all $F\subseteq\mathbb{R}^{\Omega}$ and $[F]_{\gamma}^{\star}:=\left\{[f]_{\gamma}^{\star};f\in F\right\}$ , we have $\operatorname{\textup{f\aa t}}_{\gamma}(F)=\operatorname{vc}([F]_{\gamma}^{\star}).$ Let $G:\mathbb{R}^{k}\rightarrow\mathbb{R}$ be a $k$ -fold aggregation rule and $F_{1},\ldots,F_{k}\subseteq\mathbb{R}^{\Omega}$ be real-valued function classes. Suppose that some $S=\left\{x_{1},\ldots,x_{\ell}\right\}\subset\Omega$ is $\gamma$ -shattered by $G\equiv G(F_{1},\ldots,F_{k})$ . Proving the claim amounts to upper-bounding $\ell$ appropriately. By (8), there is an $r\in\mathbb{R}^{\Omega}$ such that $\operatorname{fat}_{\gamma}(G)=\operatorname{\textup{f\aa t}}_{\gamma}(G-r)=\operatorname{vc}([G-r]_{\gamma}^{\star})$ . Put $F^{\prime}_{i}:=F_{i}-r$ and since $G$ commutes with $r$ -shift, as defined in (4), we have

\displaystyle G^{\prime}:=G(F^{\prime}_{1},\ldots,F^{\prime}_{k})=G(F_{1}-r\ldots,F_{k}-r)=G(F_{1}\ldots,F_{k})-r.

(16)

Hence, $S$ is VC-shattered by $[G^{\prime}]_{\gamma}^{\star}$ and

\displaystyle v_{i}:=\operatorname{vc}([F_{i}^{\prime}]_{\gamma}^{\star})=\operatorname{\textup{f\aa t}}_{\gamma}(F_{i}^{\prime})\leq\operatorname{fat}_{\gamma}(F_{i}^{\prime})=\operatorname{fat}_{\gamma}(F_{i}),\qquad i\in[k].

(17)

Let us assume for now that each $v_{i}>0$ ; in this case, there is no loss of generality in assuming $\ell>1$ . Let $\bar{F}_{i}$ be a “good” disambiguation of $[F_{i}^{\prime}]_{\gamma}^{\star}$ on $S$ , as furnished by Lemma 9:

\displaystyle|\bar{F}_{i}(S)|

\displaystyle\leq

\displaystyle\ell^{5v_{i}\log_{2}\ell}.

Observe that ${\bar{G}}:=G(\bar{F}_{1},\ldots,\bar{F}_{k})$ is a valid disambiguation of $[G^{\prime}]_{\gamma}^{\star}$ . It follows that

\displaystyle 2^{\ell}\;=\;|{\bar{G}}(S)|\;\leq\;\prod_{i=1}^{k}|\bar{F}_{i}(S)|\;\leq\;\ell^{5\log_{2}\ell\sum_{i=1}^{k}v_{i}}.

(18)

Thus, (15) implies that $\ell\leq 25(\sum v_{i})\log^{2}(90\sum v_{i})$ , and the latter is an upper bound on $\operatorname{vc}({\bar{G}})$ — and hence, also on $\operatorname{vc}([G^{\prime}]_{\gamma}^{\star})=\operatorname{fat}_{\gamma}(G)$ . The claim now follows from (17).

If any one given $v_{i}=0$ , we claim that (18) is unaffected. This is because any $C^{\star}\subset\left\{0,1,\star\right\}^{\Omega}$ with $\operatorname{vc}(C^{\star})=0$ has a singleton disambiguation $\bar{C}=\left\{c\right\}$ . Indeed, any given $x\in\Omega$ can receive at most one of $\left\{0,1\right\}$ as a label from the members of $C$ (otherwise, it would be shattered, forcing $\operatorname{vc}(C^{\star})\geq 1$ ). If any $c^{\star}\in C^{\star}$ labels $x$ with $0$ , then all members of $C^{\star}$ are disambiguated to label $x$ with $0$ (and, mutatis mutandis, $1$ ). Any $x$ labeled with $\star$ by every $c^{\star}\in C^{\star}_{i}$ can be disambiguated arbitrarily (say, to $0$ ). Disambiguating the degenerate $[F_{i}^{\prime}]_{\gamma}^{\star}$ to the singleton $\bar{F}_{i}(S)$ has no effect on the product in (18).

The foregoing argument continues to hold if more than one $v_{i}=0$ . In particular, in the degenerate case where $\operatorname{fat}_{\gamma}(F_{1})=\operatorname{fat}_{\gamma}(F_{2})=\ldots=\operatorname{fat}_{\gamma}(F_{k})=0$ , we have $\prod|\bar{F}_{i}(S)|=1$ , which forces $\ell=0$ .

4.2 Proof of Theorem 2

First, we upper bound the covering numbers of Lipschitz aggregations as a function of the covering of the component classes.

Theorem 10 (Covering number of $L$ -Lipschitz aggregations)

Let $t>0$ , $p\in[1,\infty]$ , and $F_{1},\ldots,F_{k}\subset L_{p}(\mu)$ . Let $G$ be an aggregation rule that is $L$ -Lipschitz. Then, for all probability measures $\mu$ on $\Omega$ ,

\displaystyle\mathcal{N}(G(F_{1},\ldots,F_{k}),L_{p}(\mu),t)

\displaystyle\leq

\displaystyle\begin{cases}\prod_{i=1}^{k}\mathcal{N}(F_{i},t/Lk^{1/p}),&p<\infty\\ \prod_{i=1}^{k}\mathcal{N}(F_{i},t/L),&p=\infty.\end{cases}

We proceed to the main proof.

Proof [of Theorem 2]. Let $G:\mathbb{R}^{k}\rightarrow\mathbb{R}$ be an aggregation rule that is $L$ -Lipschitz ( $L\geq 1$ ) in supremum norm, as defined in (5), and let $F_{1},\ldots,F_{k}\subseteq[-R,R]^{\Omega}$ be real-valued function classes. Suppose that some $\Omega_{\ell}=\left\{x_{1},\ldots,x_{\ell}\right\}\subset\Omega=B$ is $\gamma$ -shattered by $G$ , and let $F_{i}(\Omega_{\ell})=\left.F_{i}\right|_{\Omega_{\ell}}$ . We upper bound the covering number with the fat-shattering dimension as in Lemma 18 (see Section A.2), with $n=\ell$ and $p=\infty$ ,

\displaystyle\log\mathcal{N}(F_{i}(\Omega_{\ell}),L_{\infty}(\mu_{\ell}),\gamma)

\displaystyle\leq

\displaystyle Cv_{i}\log(R\ell/v_{i}\gamma)\log^{\varepsilon}(\ell/v_{i}),\qquad 0<\gamma<R,

where $v_{i}=\operatorname{fat}_{c\varepsilon\gamma}(F_{i})$ . Then Theorem 10 implies that

$\displaystyle\log\mathcal{N}(G(\Omega_{\ell}),L_{\infty}(\mu_{\ell}),\gamma/2)$	$\displaystyle\leq$	$\displaystyle\sum_{i=1}^{k}\log\mathcal{N}(F_{i}(\Omega_{\ell}),L_{\infty}(\mu_{\ell}),\gamma/2L)$
	$\displaystyle\leq$	$\displaystyle C\sum_{i=1}^{k}v_{i}\log(LR\ell/v_{i}\gamma)\log^{\varepsilon}(\ell/v_{i})$
	$\displaystyle\overset{\textup{(a)}}{\leq}$	$\displaystyle C\sum_{i=1}^{k}v_{i}\log^{1+\varepsilon}(LR\ell/v_{i}\gamma)$
	$\displaystyle\overset{\textup{(b)}}{\leq}$	$\displaystyle CD\log^{1+\varepsilon}\frac{LR\ell k}{D\gamma},$

where $D:=\sum_{i=1}^{k}v_{i}$ , (a) follows since $R/\gamma>1$ and assuming $L\geq 1$ , and (b) follows by the concavity of $x\log^{1+\varepsilon}(u/x)$ (see Lemma 25 in Section A.5). We can assume $\ell\geq 2$ without loss of generality. Combining the monotonicity of the covering number (see (9)) and a lower bound on the covering number in terms of the fat-shattering dimension (see Lemma 15 in Section A.2) yields

\displaystyle\log\mathcal{N}(G(\Omega_{\ell}),L_{\infty}(\mu_{\ell}),\gamma/2)

\displaystyle\geq

\displaystyle C\operatorname{fat}_{\gamma}(G)=C\ell,

whence

\displaystyle\ell

\displaystyle\leq

\displaystyle CD\log^{1+\varepsilon}\frac{LR\ell k}{D\gamma}.

Using the elementary fact

\displaystyle x\leq A\operatorname{Log}^{1+\varepsilon}x

\displaystyle\implies

\displaystyle x\leq cA\operatorname{Log}^{1+\varepsilon}A\qquad x,A\geq 1

(with $x=LR\ell k/D\gamma$ and $A=cLRk/\gamma$ ), we get

\displaystyle\ell

\displaystyle\leq

\displaystyle CD\operatorname{Log}^{1+\varepsilon}\frac{LRk}{\gamma},

which implies the claim.

4.3 Proof of Theorem 3

We use the notation and results from the Appendix, and in particular, from Section A.3.

Proof [of Theorem 3] A bound of this form for the $k$ -fold maximum aggregation was claimed in Kontorovich (2018), however the argument there was flawed, see Section 5.

Let $G:\mathbb{R}^{k}\rightarrow\mathbb{R}$ be an aggregation rule that is $L$ -Lipschitz in supremum norm, as defined in (5), and let $F_{1},\ldots,F_{k}$ be bounded affine function classes, as defined in (10). Suppose that some $\Omega_{\ell}=\left\{x_{1},\ldots,x_{\ell}\right\}\subset\Omega=B$ is $\gamma$ -shattered by $G$ , let $F_{i}(\Omega_{\ell})=\left.F_{i}\right|_{\Omega_{\ell}}$ , and $\mu_{\ell}$ be the uniform distribution on $\Omega_{\ell}$ . We upper bound the covering number as in Lemma 20 (with $m=\ell$ ),

\displaystyle\log\mathcal{N}(F_{i}(\Omega_{\ell}),L_{\infty}(\mu_{\ell}),\gamma)

\displaystyle\leq

\displaystyle C\frac{R_{i}^{2}}{\gamma^{2}}\operatorname{Log}\frac{\ell\gamma^{2}}{R_{i}^{2}},\qquad 0<\gamma<R_{i}.

Denote $v_{i}:=L^{2}R_{i}^{2}/\gamma^{2}$ , and consider the $L_{\infty}$ covering number of $F_{i}(\Omega_{\ell})$ at scale $\gamma/L$ :

\displaystyle\log\mathcal{N}(F_{i}(\Omega_{\ell}),L_{\infty}(\mu_{\ell}),\gamma/L)

\displaystyle\leq

\displaystyle Cv_{i}\operatorname{Log}\frac{\ell}{v_{i}}.

Then Theorem 10 implies that

$\displaystyle\log\mathcal{N}(G(\Omega_{\ell}),L_{\infty}(\mu_{\ell}),\gamma/2)$	$\displaystyle\leq$	$\displaystyle\sum_{i=1}^{k}\log\mathcal{N}(F_{i}(\Omega_{\ell}),L_{\infty}(\mu_{\ell}),\gamma/2L)$
	$\displaystyle\leq$	$\displaystyle C\sum_{i=1}^{k}v_{i}\operatorname{Log}\frac{\ell}{v_{i}}$
	$\displaystyle\overset{\textup{(a)}}{\leq}$	$\displaystyle CD\operatorname{Log}\frac{k\ell}{D},$

where $D:=\sum_{i=1}^{k}v_{i}$ and (a) follows by the concavity of $x\log(u/x)$ (see Corollary 24 in Section A.5). Combining the monotonicity of the covering number (see (9)) and a lower bound on the covering number in terms of the fat-shattering dimension (see Lemma 15 in Section A.2) yields

\displaystyle\log\mathcal{N}(G(\Omega_{\ell}),L_{\infty}(\mu_{\ell}),\gamma/2)

\displaystyle\geq

\displaystyle C\operatorname{fat}_{\gamma}(G)=C\ell,

whence

\displaystyle\ell

\displaystyle\leq

\displaystyle CD\operatorname{Log}\frac{k\ell}{D}.

Using the elementary fact

\displaystyle x\leq A\operatorname{Log}x

\displaystyle\implies

\displaystyle x\leq cA\operatorname{Log}A,\qquad x,A\geq 1

(with $x=k\ell/D$ and $A=ck$ ) we get $\ell\leq cD\operatorname{Log}k$ , which implies the claim.

The result can easily be generalized to hinge-loss affine classes. Let $F_{i}$ be an affine function class as in (10), define $F^{\prime}_{i}$ as the function class on $B\times\left\{-1,1\right\}$ given by $F^{\prime}_{i}=\left\{(x,y)\mapsto yf(x);f\in F_{i}\right\}$ , and the hinge-loss affine class $F_{i}^{\operatorname{Hinge}}$ as the function class on $B\times\left\{-1,1\right\}$ given by $F_{i}^{\operatorname{Hinge}}=\left\{(x,y)\mapsto\max\left\{0,1-f(x,y)\right\};f\in F^{\prime}_{i}\right\}$ . One first observes that the restriction of $F^{\prime}_{i}$ to any $\left\{(x_{1},y_{1}),\ldots,(x_{n},y_{n})\right\}$ , as a body in $\mathbb{R}^{n}$ , is identical to the the restriction of $F_{i}$ to $\left\{x_{1},\ldots,x_{n}\right\}$ . Interpreting $F_{i}^{\operatorname{Hinge}}$ as a $2$ -fold maximum over the singleton class $H=\left\{h\equiv 0\right\}$ and the bounded affine class $F^{\prime}_{i}$ lets us invoke Theorem 10 to argue that $F_{i}$ and $F_{i}^{\operatorname{Hinge}}$ have the same $L_{\infty}$ covering numbers. Hence, the argument we deployed here to establish (11) for affine classes also applies to $k$ -fold $L$ -Lipschitz aggregations hinge-loss classes.

4.4 Proof of Corollary 5

Proof [of Corollary 5] Raviv et al. (2018, Theorem 7) upper-bounded the Rademacher complexity of the maximum aggregation of $k$ hinge loss affine functions by $k/\sqrt{n}$ .

For $R_{i}$ -bounded affine functions or hinge loss affine functions, the analysis above, combined with the calculation in Kontorovich (2018) yields a bound of $O\left(\sqrt{\frac{\operatorname{Log}(k)\log^{3}(n)\sum_{i=1}^{k}{R_{i}^{2}}}{n}}\right)$ . For completeness, we provide the full proof.

Let $G_{\max}:\mathbb{R}^{k}\rightarrow\mathbb{R}$ be the $k$ -fold maximum aggregation rule, as defined in (3), and let $F_{1},\ldots,F_{k}\subseteq\mathbb{R}^{\Omega}$ be an $R_{i}$ -bounded affine functions as in (10) or a hinge loss affine functions as in (12). Since this aggregation is $1$ -Lipschitz in the supremum norm, Theorem 3 implies that

\displaystyle\operatorname{fat}_{\gamma}(G_{\max})

\displaystyle\leq

\displaystyle\frac{C\operatorname{Log}(k)}{\gamma^{2}}\sum_{i=1}^{k}{R_{i}^{2}},\qquad 0<\gamma<\min_{i\in[k]}R_{i},

where $C>0$ is a universal constant.

From fat-shattering to Rademacher.

The fat-shattering estimate above can be used to upper-bound the Rademacher complexity by converting the former to a covering number bound and plugging it into Dudley’s chaining integral (Dudley, 1967):

\displaystyle\mathcal{R}_{n}(F)\leq\inf_{\alpha\geq 0}\left(4\alpha+12\int_{\alpha}^{\infty}\sqrt{\frac{\log\mathcal{N}(t,F,\left\|\cdot\right\|_{2})}{n}}dt\right),

(19)

where $\mathcal{N}(\cdot)$ are the $\ell_{2}$ covering numbers (see Section A.1).

It remains to bound the covering numbers. A simple way of doing so is to invoke Lemmas 2.6, 3.2, and 3.3 in Alon et al. (1997) — but this incurs superfluous logarithmic factors in $n$ . Instead, we use the sharper estimate of Mendelson and Vershynin (2003), stated here in Lemma 16. Putting $\tilde{R}=\max_{i}R_{i}$ , the latter yields

$\displaystyle\mathcal{R}_{n}(G_{\max})$	$\displaystyle\leq$	$\displaystyle\inf_{\alpha\geq 0}\left(4\alpha+12\int_{\alpha}^{1}\sqrt{\frac{\log\mathcal{N}(t,G_{\max},\left\\|\cdot\right\\|_{2})}{n}}dt\right)$
	$\displaystyle\leq$	$\displaystyle\inf_{\alpha\geq 0}\left(4\alpha+12c^{\prime}\int_{\alpha}^{1}\sqrt{\frac{\operatorname{fat}_{ct/\tilde{R}}(G_{\max})\log\frac{2\tilde{R}}{t}}{n}}dt\right)$
	$\displaystyle\leq$	$\displaystyle\inf_{\alpha\geq 0}\left(4\alpha+{12c^{\prime\prime}}\sqrt{\frac{\operatorname{Log}(k)\sum_{i=1}^{k}{R_{i}^{2}}}{n}}\int_{\alpha}^{1}\frac{\tilde{R}}{t}\sqrt{\log\frac{2\tilde{R}}{t}}dt\right).$

Now

\displaystyle\int_{\alpha}^{1}\frac{\tilde{R}}{t}\sqrt{\log\frac{2\tilde{R}}{t}}dt=\frac{2\tilde{R}}{3}\left(\log(2\tilde{R}/\alpha)^{3/2}-(\log 2\tilde{R})^{3/2}\right)

and choosing $\alpha=1/\sqrt{n}$ yields

	$\displaystyle\mathcal{R}_{n}(G_{\max})$	$\displaystyle\leq$	$\displaystyle\frac{4}{\sqrt{n}}+12c^{\prime\prime}\sqrt{\frac{\operatorname{Log}(k)\sum_{i=1}^{k}{R_{i}^{2}}}{n}}\frac{2\tilde{R}}{3}\left(\log(2\tilde{R}\sqrt{n})^{3/2}-(\log 2\tilde{R})^{3/2}\right)$
		$\displaystyle=$	$\displaystyle O\left(\sqrt{\frac{\operatorname{Log}(k)\log^{3}(\tilde{R}n)\tilde{R}^{2}\sum_{i=1}^{k}{R_{i}^{2}}}{n}}\right).$

4.5 Proof of Theorem 7

Proof [of Theorem 7] Let $G_{\max}:\mathbb{R}^{k}\rightarrow\mathbb{R}$ be the $k$ -fold maximum aggregation rule, as defined in (3), and let $F_{1},\ldots,F_{k}\subseteq\mathbb{R}^{\Omega}$ be real-valued function classes. Note that $G_{\max}$ is an aggregation that commutes with shift, as defined in (4).

By (8), there is an $r\in\mathbb{R}^{\Omega}$ such that $\operatorname{fat}_{\gamma}(G_{\max})=\operatorname{\textup{f\aa t}}_{\gamma}(G_{\max}-r)$ . As in (16), put $F^{\prime}_{i}:=F_{i}-r$ and $G^{\prime}_{\max}:=G_{\max}-r=G_{\max}(F^{\prime}_{1},\ldots,F^{\prime}_{k})$ . Define $\bar{G}_{\max}=\operatorname{sign}(G^{\prime})$ and $\bar{F}_{i}=\operatorname{sign}(F_{i}^{\prime})$ .

Since $\operatorname{sign}$ and $\max$ commute, we have $\bar{G}_{\max}=\max(\bar{F}_{i\in[k]})$ . We claim that

\displaystyle\operatorname{\textup{f\aa t}}_{\gamma}(G_{\max}^{\prime})

\displaystyle\leq

\displaystyle\operatorname{vc}(\bar{G}_{\max}).

(20)

Indeed, any $S\subset\Omega$ that is $\gamma$ -shattered with shift $r=0$ by any $G\subset\mathbb{R}^{\Omega}$ is also VC-shattered by $\operatorname{sign}(G)$ . (See Section 4.1, and notice that the converse implication—and the reverse inequality—do not hold.) It holds that

\displaystyle d+1\overset{(a)}{=}\operatorname{vc}(\bar{F}_{i})\overset{(b)}{=}\operatorname{\textup{f\aa t}}_{\gamma}(F_{i})\overset{(c)}{=}\operatorname{fat}_{\gamma}(F_{i})\overset{(d)}{=}\operatorname{fat}_{\gamma}(F_{i}^{\prime}),

where (a) follows from a standard argument (e.g., Mohri et al. (2012, Example 3.2)), (b) holds because any $S\subset\mathbb{R}^{d}$ that is VC-shattered by $\operatorname{sign}(F^{\prime}_{i})$ is also $\gamma$ -shattered by $F^{\prime}_{i}$ with shift $r=0$ , (c) follows from Lemma 21, since the class is closed under scalar multiplication, and (d) holds since the shattering remains the same for the shifted class.

Now the argument of Blumer et al. (1989, Lemma 3.2.3) applies:

\displaystyle\operatorname{vc}(\bar{G}_{\max})

\displaystyle\leq

\displaystyle 2(d+1)k\log(3k)

(21)

(this holds for any $k$ -fold aggregation function, not just the maximum). Combining (20) with (21) proves the claim.

4.6 Proof of Theorem 8

Proof [of Theorem 8] It follows from Mohri et al. (2012, Example 3.2) that $\operatorname{vc}(\operatorname{sign}(F_{i}))=d+1$ . Since $F_{i}$ is closed under scalar multiplication, a scaling argument shows that any $S\subset\mathbb{R}^{d}$ that is VC-shattered by $\operatorname{sign}(F_{i})$ is also $\gamma$ -shattered by $F_{i}$ with shift $r=0$ , whence $\operatorname{\textup{f\aa t}}_{\gamma}(F_{i})=d+1$ for all $\gamma>0$ ; invoking Lemma 21 extends this to $\operatorname{fat}_{\gamma}(F_{i})$ as well. Now Csikós et al. (2019, Theorem 1) shows that the $k$ -fold unions of half-spaces necessarily shatter some set $S\subset\mathbb{R}^{d}$ of size at least $cdk\log k$ . Since union is a special case of the max operator, and the latter commutes with $\operatorname{sign}$ , the scaling argument shows that this $S$ is $\gamma$ -shattered by $G_{\max}$ with shift $r=0$ . Hence, $\operatorname{fat}_{\gamma}(G_{\max})\geq\operatorname{\textup{f\aa t}}_{\gamma}(G_{\max})\geq|S|$ , which proves the claim.

5 Discussion

In this paper, we proved upper and lower bounds on the fat-shattering dimension of aggregation rules as a function of the fat-shattering dimension of the component classes. We leave some remaining gaps for future work. First, for aggregation rules that commute with shifts, assuming $\operatorname{fat}_{\gamma}(F_{i})\leq d$ , for $1\leq i\leq k$ , we show in Theorem 1 that

\displaystyle\operatorname{fat}_{\gamma}(G(F_{1},\ldots,F_{k}))\leq Cdk\log^{2}\left(dk\right),\qquad\gamma>0,

$C>0$ is a universal constant. We pose the following

Open problem.

Let $G$ be an aggregation rule that commutes with shifts. Is it the case that for all $F_{i}\subseteq\mathbb{R}^{\Omega}$ with $\operatorname{fat}_{\gamma}(F_{i})\leq d$ , $i\in[k]$ , we have

\displaystyle\operatorname{fat}_{\gamma}(G(F_{1},\ldots,F_{k}))

\displaystyle\leq

\displaystyle Cdk\log\left(k\right),\qquad\gamma>0,

for some universal $C>0$ ?

In light of Theorem 8, this is the best one could hope for in general. We pose also the following conjecture about bounded affine functions.

Conjecture 11

Theorem 3 is tight up to constants. For $R_{i}$ -bounded affine functions and aggregation rule $G$ that is 1-Lipschitz in supremum norm,

\displaystyle\operatorname{fat}_{\gamma}(G(F_{1},\ldots,F_{k}))

\displaystyle\geq

\displaystyle\frac{C\operatorname{Log}(k)}{\gamma^{2}}\sum_{i=1}^{k}{R_{i}^{2}},\qquad 0<\gamma<\min_{i\in[k]}R_{i},

(22)

where $C>0$ is a universal constant.

Throughout the paper, we mentioned several mistaken claims in the literature. In this section, we briefly discuss the nature of these mistakes—which are, in a sense, variations on the same kind of error. We begin with Attias et al. (2019, Lemma 14), which incorrectly claimed that any partial function class $F^{\star}$ has a disambiguation $\bar{F}$ such that $\operatorname{vc}(\bar{F})\leq\operatorname{vc}(F^{\star})$ (see Section 4.1 for the definitions). The mistake was pointed out to us by Yann Guermeur, and later, Alon et al. (2022, Theorem 11) showed that there exist partial classes $F^{\star}$ with $\operatorname{vc}(F^{\star})=1$ for which every disambiguation $\bar{F}$ has $\operatorname{vc}(\bar{F})=\infty$ .

Kontorovich (2018) attempted to prove the bound stated in our Theorem 3 (up to constants, and only for linear classes). The argument proceeded via a reduction to the Boolean case, as in our proof of Theorem 7. It was correctly observed that if, say, some finite $S\subset\Omega$ is $1$ -shattered by $F_{i}$ with shift $r=0$ , then it is also VC-shattered by $\operatorname{sign}(F_{i})$ . Neglected was the fact that $\operatorname{sign}(F_{i})$ might shatter additional points in $\Omega\setminus S$ —and, in sufficiently high dimension, it necessarily will. The crux of the matter is that (20) holds in the dimension-dependent but not the dimension-free setting; again, this may be seen as a variant of the disambiguation mistake.

Finally, the proof of Hanneke and Kontorovich (2019, Lemma 6) claims, in the first display, that the shattered set can be classified with large margin, which is incorrect — yet another variant of mistaken disambiguation.

Acknowledgments

We thank Steve Hanneke and Ramon van Handel for very helpful discussions; the latter, in particular, patiently explained to us how to prove Lemma 20. Roman Vershynin kindly gave us permission to share his example in Remark 17. This research was partially supported by the Israel Science Foundation (grant No. 1602/19), an Amazon Research Award, the Ben-Gurion University Data Science Research Center, and Cyber Security Research Center, Prime Minister’s Office.

References

Alon et al. (1997) Noga Alon, Shai Ben-David, Nicolò Cesa-Bianchi, and David Haussler. Scale-sensitive dimensions, uniform convergence, and learnability. Journal of the ACM, 44(4):615–631, 1997.
Alon et al. (2020) Noga Alon, Amos Beimel, Shay Moran, and Uri Stemmer. Closure properties for private classification and online prediction. In Conference on Learning Theory, pages 119–152. PMLR, 2020.
Alon et al. (2022) Noga Alon, Steve Hanneke, Ron Holzman, and Shay Moran. A theory of PAC learnability of partial concept classes. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 658–671, 2022.
Appert and Catoni (2021) Gautier Appert and Olivier Catoni. New bounds for $k$ -means and information $k$ -means. arXiv preprint arXiv:2101.05728, 2021.
Artstein et al. (2004) S. Artstein, V. Milman, and S. J. Szarek. Duality of metric entropy. Annals of Mathematics, 159(3):1313–1328, 2004.
Attias and Hanneke (2023) Idan Attias and Steve Hanneke. Adversarially robust PAC learnability of real-valued functions. In Proceedings of the 40th International Conference on Machine Learning, volume 202, pages 1172–1199, 2023.
Attias et al. (2019) Idan Attias, Aryeh Kontorovich, and Yishay Mansour. Improved generalization bounds for robust learning. In Algorithmic Learning Theory, ALT 2019, volume 98 of Proceedings of Machine Learning Research, pages 162–183. PMLR, 2019.
Attias et al. (2022) Idan Attias, Aryeh Kontorovich, and Yishay Mansour. Improved generalization bounds for adversarially robust learning. The Journal of Machine Learning Research, 23(1):7897–7927, 2022.
Bartlett and Shawe-Taylor (1999) Peter Bartlett and John Shawe-Taylor. Generalization performance of support vector machines and other pattern classifiers, pages 43–54. MIT Press, Cambridge, MA, USA, 1999. ISBN 0-262-19416-3.
Bartlett and Long (1998) Peter L. Bartlett and Philip M. Long. Prediction, learning, uniform convergence, and scale-sensitive dimensions. J. Comput. Syst. Sci., 56(2):174–190, 1998.
Baum and Haussler (1989) Eric B. Baum and David Haussler. What size net gives valid generalization? Neural Comput., 1(1):151–160, 1989.
Biau et al. (2008) Gérard Biau, Luc Devroye, and Gábor Lugosi. On the performance of clustering in hilbert spaces. IEEE Transactions on Information Theory, 54(2):781–790, 2008.
Blumer et al. (1989) Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K. Warmuth. Learnability and the Vapnik-Chervonenkis dimension. J. Assoc. Comput. Mach., 36(4):929–965, 1989.
Breiman (1996) Leo Breiman. Bagging predictors. Machine learning, 24:123–140, 1996.
Breiman (2001) Leo Breiman. Random forests. Machine learning, 45:5–32, 2001.
Csikós et al. (2019) Mónika Csikós, Nabil H. Mustafa, and Andrey Kupavskii. Tight lower bounds on the vc-dimension of geometric set systems. J. Mach. Learn. Res., 20:81:1–81:8, 2019.
Duan (2012) Hubert Haoyang Duan. Bounding the Fat Shattering Dimension of a Composition Function Class Built Using a Continuous Logic Connective. PhD thesis, University of Waterloo, 2012.
Dudley (1967) Richard M Dudley. The sizes of compact subsets of hilbert space and continuity of gaussian processes. Journal of Functional Analysis, 1(3):290–330, 1967.
Eisenstat (2009) David Eisenstat. k-fold unions of low-dimensional concept classes. Inf. Process. Lett., 109(23-24):1232–1234, 2009.
Eisenstat and Angluin (2007) David Eisenstat and Dana Angluin. The VC dimension of k-fold union. Inf. Process. Lett., 101(5):181–184, 2007.
Fefferman et al. (2016) Charles Fefferman, Sanjoy Mitter, and Hariharan Narayanan. Testing the manifold hypothesis. Journal of the American Mathematical Society, 29(4):983–1049, 2016.
Foster and Rakhlin (2019) Dylan J. Foster and Alexander Rakhlin. $\ell_{\infty}$ vector contraction for rademacher complexity. CoRR, abs/1911.06468, 2019.
Freund and Schapire (1997) Yoav Freund and Robert E Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.
Ghazi et al. (2021) Badih Ghazi, Noah Golowich, Ravi Kumar, and Pasin Manurangsi. Near-tight closure bounds for the littlestone and threshold dimensions. In Algorithmic Learning Theory, pages 686–696. PMLR, 2021.
Gottlieb et al. (2014) Lee-Ad Gottlieb, Aryeh Kontorovich, and Robert Krauthgamer. Efficient classification for metric data (extended abstract: COLT 2010). IEEE Transactions on Information Theory, 60(9):5750–5759, 2014.
Gottlieb et al. (2018) Lee-Ad Gottlieb, Eran Kaufman, Aryeh Kontorovich, and Gabriel Nivasch. Learning convex polytopes with margin. In Neural Information Processing Systems (NIPS), 2018.
Hanneke and Kontorovich (2019) Steve Hanneke and Aryeh Kontorovich. Optimality of SVM: novel proofs and tighter bounds. Theor. Comput. Sci., 796:99–113, 2019.
Kearns and Schapire (1994) Michael J. Kearns and Robert E. Schapire. Efficient distribution-free learning of probabilistic concepts. J. Comput. Syst. Sci., 48(3):464–497, 1994.
Kégl (2003) Balázs Kégl. Robust regression by boosting the median. In Learning Theory and Kernel Machines: 16th Annual Conference on Learning Theory and 7th Kernel Workshop, COLT/Kernel 2003, Washington, DC, USA, August 24-27, 2003. Proceedings, pages 258–272. Springer, 2003.
Klochkov et al. (2021) Yegor Klochkov, Alexey Kroshnin, and Nikita Zhivotovskiy. Robust k-means clustering for distributions with two moments. The Annals of Statistics, 49(4):2206–2230, 2021.
Kontorovich (2018) Aryeh Kontorovich. Rademacher complexity of $k$ -fold maxima of hyperplanes. 2018.
Mendelson and Vershynin (2003) S. Mendelson and R. Vershynin. Entropy and the combinatorial dimension. Invent. Math., 152(1):37–55, 2003.
Mohri et al. (2012) Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations Of Machine Learning. The MIT Press, 2012.
Raviv et al. (2018) Dolev Raviv, Tamir Hazan, and Margarita Osadchy. Hinge-minimax learner for the ensemble of hyperplanes. J. Mach. Learn. Res., 19:62:1–62:30, 2018.
Rudelson and Vershynin (2006) M. Rudelson and R. Vershynin. Combinatorics of random processes and sections of convex bodies. Annals of Mathematics, 164(2):603–648, 2006.
Srebro et al. (2010) Nathan Srebro, Karthik Sridharan, and Ambuj Tewari. Smoothness, low noise and fast rates. Advances in neural information processing systems, 23, 2010.
Talagrand (2003) Michel Talagrand. Vapnik–chervonenkis type conditions and uniform donsker classes of functions. The Annals of Probability, 31(3):1565–1582, 2003.
Vershynin (2018) Roman Vershynin. High-dimensional probability, volume 47 of Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge, 2018.
Vershynin (2021) Roman Vershynin, 2021. Private communication.
Zhang (2002) Tong Zhang. Covering number bounds of certain regularized linear function classes. The Journal of Machine Learning Research, 2:527–550, 2002.
Zhivotovskiy (2022) Nikita Zhivotovskiy. A bound for $k$ -fold maximum. 2022.

A Auxiliary results

A.1 Covering numbers and Lipschitz Aggregations

Lemma 12

If $G:\mathbb{R}^{k}\rightarrow\mathbb{R}$ is $L$ -Lipschitz under $\left\|\cdot\right\|_{p}$ , then $G:(\mathbb{R}^{\Omega})^{k}\rightarrow\mathbb{R}^{\Omega}$ is $L$ -Lipschitz in $\left\|\cdot\right\|_{L_{p}^{(k)}(\mu)}$ .

Proof

	$\displaystyle\left\\|G(f_{1},\ldots,f_{k})-G(f^{\prime}_{1},\ldots,f^{\prime}_{k})\right\\|_{L_{p}(\mu)}^{p}$	$\displaystyle=\int_{\Omega}\|G(f_{1},\ldots,f_{k})(x)-G(f^{\prime}_{1},\ldots,f^{\prime}_{k})(x)\|^{p}\mathrm{d}\mu(x)$
		$\displaystyle=\int_{\Omega}\|G(f_{1}(x),\ldots,f_{k}(x))-G(f^{\prime}_{1}(x),\ldots,f^{\prime}_{k}(x))\|^{p}\mathrm{d}\mu(x)$
		$\displaystyle\leq\int_{\Omega}L^{p}\left\\|(f_{1}(x),\ldots,f_{k}(x))-(f^{\prime}_{1}(x),\ldots,f^{\prime}_{k}(x))\right\\|_{p}^{p}\mathrm{d}\mu(x),$

where the inequality follows from the assumption that $G:\mathbb{R}^{k}\rightarrow\mathbb{R}$ is $L$ -Lipschitz in $\left\|\cdot\right\|_{p}$ . This proves

\displaystyle\left\|G(f_{1},\ldots,f_{k})-G(f^{\prime}_{1},\ldots,f^{\prime}_{k})\right\|_{L_{p}(\mu)}\leq L\left\|(f_{1},\ldots,f_{k})-(f^{\prime}_{1},\ldots,f^{\prime}_{k})\right\|_{L_{p}^{(k)}(\mu)},

and hence the claim.

Proof [of Theorem 10] Suppose $p<\infty$ , and let $g=G(f_{1},\ldots,f_{k})\in G(F_{1},\ldots,F_{k})$ . For each $i\in[k]$ , let $\hat{F}_{i}\subset F_{i}$ be a $t/Lk^{1/p}$ -cover of $F_{i}$ . Let each $f_{i}$ be “ $t/Lk^{1/p}$ -covered” by some $\hat{f}_{i}\in\hat{F}_{i}$ , in the sense that $\left\|f_{i}-\hat{f}_{i}\right\|_{L_{p}(\mu)}\leq t/Lk^{1/p}$ . Assuming that $G:\mathbb{R}^{k}\rightarrow\mathbb{R}$ is $L$ -Lipschitz in $\left\|\cdot\right\|_{p}$ , Lemma 12 implies that $G:(\mathbb{R}^{\Omega})^{k}\rightarrow\mathbb{R}^{\Omega}$ is $L$ -Lipschitz in $\left\|\cdot\right\|_{L_{p}^{(k)}(\mu)}$ . Then it follows that $g$ is $t$ -covered by $G(\hat{f}_{1},\ldots,\hat{f}_{k})$ , since

	$\displaystyle\left\\|G(f_{1},\ldots,f_{k})-G(f^{\prime}_{1},\ldots,f^{\prime}_{k})\right\\|_{L_{p}(\mu)}^{p}$	$\displaystyle\leq L^{p}\left\\|(f_{1},\ldots,f_{k})-(f^{\prime}_{1},\ldots,f^{\prime}_{k})\right\\|_{L_{p}^{(k)}(\mu)}^{p}$
		$\displaystyle=L^{p}\int_{\Omega}\left\\|(f_{1}(x),\ldots,f_{k}(x))-(f^{\prime}_{1}(x),\ldots,f^{\prime}_{k}(x))\right\\|_{p}^{p}\mathrm{d}\mu(x)$
		$\displaystyle=L^{p}\int_{\Omega}\sum^{k}_{i=1}\left\|f_{i}(x)-f^{\prime}_{i}(x)\right\|^{p}\mathrm{d}\mu(x)$
		$\displaystyle=L^{p}\sum^{k}_{i=1}\int_{\Omega}\left\|f_{i}(x)-f^{\prime}_{i}(x)\right\|^{p}\mathrm{d}\mu(x)$
		$\displaystyle=L^{p}\sum^{k}_{i=1}\left\\|f_{i}-f^{\prime}_{i}\right\\|_{L_{p}(\mu)}^{p}$
		$\displaystyle\leq L^{p}k\left(\frac{t}{Lk^{1/p}}\right)^{p}$
		$\displaystyle=t^{p},$

and so $\left\|G(f_{1},\ldots,f_{k})-G(f^{\prime}_{1},\ldots,f^{\prime}_{k})\right\|_{L_{p}(\mu)}\leq t$ .

We conclude that $G(F_{1},\ldots,F_{k})$ has a $t$ -cover of size $|\hat{F}_{1}\times\hat{F}_{2}\times\ldots\times\hat{F}_{k}|$ , which proves the claim. The case $p=\infty$ is proved analogously (or, alternatively, as a limiting case of $p<\infty$ ).
We show that natural aggregations are Lipschitz in $\left\|\cdot\right\|_{p}$ norms, $p\in[1,\infty)$ , and in supremum norm. The following facts are elementary:

	$\displaystyle\|a\vee b-c\vee d\|$	$\displaystyle\leq$	$\displaystyle\|a-c\|\vee\|b-d\|,\qquad a,b,c,d\in\mathbb{R};$		(23)
	$\displaystyle\|a\wedge b-c\wedge d\|$	$\displaystyle\leq$	$\displaystyle\|a-c\|\vee\|b-d\|,\qquad a,b,c,d\in\mathbb{R},$		(24)

where $s\vee t:=\max\left\{s,t\right\}$ and $s\wedge t:=\min\left\{s,t\right\}$ .

Lemma 13 (Maximum aggregation is 1-Lipschitz)

Let $G_{\max}:\mathbb{R}^{k}\rightarrow\mathbb{R}$ be the maximum aggregation, then for any $x,x^{\prime}\in\mathbb{R}^{k}$ and $p\in[1,\infty]$ ,

\displaystyle\left|G(x)-G(x^{\prime})\right|\leq\left\|x-x^{\prime}\right\|_{p}.

Proof For $k=2$ and $p=\infty$ , the claim follows from the stronger, pointwise inequality (23). The proof follows by simple induction on $k$ . Since $\left\|\cdot\right\|_{\infty}\leq\left\|\cdot\right\|_{p}$ , we conclude the proof for $p\in[1,\infty]$ .

Lemma 14 (Max-Min aggregation is 1-Lipschitz)

If $G_{\operatorname{max-min}}:\mathbb{R}^{k}\rightarrow\mathbb{R}$ is the max-min aggregation, then for any $x,x^{\prime}\in\mathbb{R}^{k}$ and $p\in[1,\infty]$ ,

\displaystyle\left|G(x)-G(x^{\prime})\right|\leq\left\|x-x^{\prime}\right\|_{p}.

Proof The inequalities (23), (24) imply that the $k$ -fold max and min aggregations are both $1$ -Lipschitz with respect to $\left\|\cdot\right\|_{\infty}$ . Hence, for all $x,y\in\mathbb{R}^{k\times\ell}$ , we have

\displaystyle\left|\min_{j\in[\ell]}x_{ij}-\min_{j\in[\ell]}y_{ij}\right|\leq\max_{j\in[\ell]}\left|x_{ij}-x_{ij}\right|,\qquad i\in[k]

and further,

\displaystyle\left|\max_{i\in[k]}\min_{j\in[\ell]}x_{ij}-\max_{i\in[k]}\min_{j\in[\ell]}y_{ij}\right|\leq\max_{i\in[k]}\max_{j\in[\ell]}\left|x_{ij}-x_{ij}\right|.

This proves that $\left|G(x)-G(x^{\prime})\right|\leq\left\|x-x^{\prime}\right\|_{\infty}$ . Since $\left\|\cdot\right\|_{\infty}\leq\left\|\cdot\right\|_{p}$ , the claim holds for all $p\in[1,\infty]$ .

A.2 Covering numbers and the fat-shattering dimension

In this section, we summarize some known results connecting the covering numbers of a bounded function class to its fat-shattering dimension.

Lemma 15 (Talagrand (2003), Proposition 1.4)

For any $F\subseteq[-R,R]^{\Omega}$ , there exists a probability measure $\mu$ on $\Omega$ such that

\displaystyle\mathcal{N}(F,L_{2}(\mu),t)

\displaystyle\geq

\displaystyle 2^{C\operatorname{fat}_{2t}(F)},\qquad 0<t<R,

(25)

where $C>0$ is a universal constants. Moreover, $\mu$ may be taken to be the uniform distribution on any $2t$ -shattered subset of $\Omega$ .

Remark. The tightness of (25) is trivially demonstrated by the example $F=\left\{-\gamma,\gamma\right\}^{n}$ .

Lemma 16 (Mendelson and Vershynin (2003), Theorem 1)

For all $F\subseteq[-1,1]^{\Omega}$ and all probability measures $\mu$ ,

\displaystyle\mathcal{N}(F,L_{2}(\mu),t)

\displaystyle\leq

\displaystyle\left(\frac{2}{t}\right)^{C\operatorname{fat}_{ct}(F)},\qquad 0<t<1,

(26)

where $C,c>0$ are universal constants.

Remark 17

The following example due to Vershynin (2021) shows that (26) is tight. Take $\Omega=[n]$ and $F=[-1,1]^{\Omega}$ . Then, for all sufficiently small $t>0$ , we have $\operatorname{fat}_{t}(F)=n$ . However, a simple volumetric calculation shows that $\mathcal{N}(F,t)$ behaves as $(C/t)^{n}$ for small $t$ , where $C>0$ is a constant.

Lemma 18 (Rudelson and Vershynin (2006))

Suppose that $p\in[2,\infty)$ , $\mu$ is a probability measure on $\Omega$ , and $R>0$ . If $F\subset L_{p}(\Omega,\mu)$ satisfies $\sup_{f\in F}\left\|f\right\|_{L_{2p}(\mu)}\leq R$ , then

\displaystyle\log\mathcal{N}(F,L_{p}(\mu),t)

\displaystyle\leq

\displaystyle Cp^{2}\operatorname{fat}_{ct}(F)\log\frac{R}{ct},\qquad 0<t<R;

furthermore, for all $\varepsilon>0$ , if $\sup_{f\in F}\left\|f\right\|_{L_{\infty}(\mu)}\leq R$ , then

\displaystyle\log\mathcal{N}(F,L_{\infty}(\mu),t)

\displaystyle\leq

\displaystyle Cv\log(Rn/vt)\log^{\varepsilon}(n/v),\qquad 0<t<R,

where $n=|\Omega|$ , $v=\operatorname{fat}_{c\varepsilon t}(F)$ , and $C,c>0$ are universal constants.

A.3 Covering numbers of linear and affine classes

Let $B\subset\mathbb{R}^{d}$ be the $d$ -dimensional Euclidean unit ball and

\displaystyle F=\left\{x\mapsto w\cdot x+b;\left\|w\right\|\vee|b|\leq R\right\}

be the collection of $R$ -bounded affine functions on $\Omega=B$ .

Remark 19

There is a trivial reduction from an $R$ -bounded affine class in $d$ dimensions to a $2R$ -bounded linear class in $d+1$ dimensions, via the standard trick of adding an extra dummy dimension. This only affects the covering number bounds up to constants.

For $\Omega_{n}\subset B$ , $|\Omega_{n}|=n$ , define $F(\Omega_{n})=\left.F\right|_{\Omega_{n}}$ , and endow $\Omega_{n}$ with the uniform measure $\mu_{n}$ . Zhang (2002, Theorem 4) implies the covering number estimate

\displaystyle\log\mathcal{N}(F(\Omega_{n}),L_{\infty}(\mu_{n}),t)

\displaystyle\leq

\displaystyle C\frac{R^{2}}{t^{2}}\operatorname{Log}\frac{nR}{t},\qquad t>0,

where $C>0$ is a universal constant (Zhang’s result is more general and allows to compute explicit constants). We will use the following sharper bound:

Lemma 20

\displaystyle\log\mathcal{N}(F(\Omega_{n}),L_{\infty}(\mu_{n}),t)

\displaystyle\leq

\displaystyle C\frac{R^{2}}{t^{2}}\operatorname{Log}\frac{mt^{2}}{R^{2}},\qquad 0<t<R,

where $m=\min\left\{n,d\right\}$ and $C>0$ is a universal constant.

Proof The result is folklore knowledge, but we provide a proof for completeness.

We argue that there is no loss of generality in assuming $d\geq n$ . Indeed, if $n>d$ , then $X$ is spanned by some $X^{\prime}=\left\{x_{1}^{\prime},\ldots,x_{d}^{\prime}\right\}\subset B$ and $F\subset\operatorname{span}(X^{\prime})$ is also a $d$ -dimensional set. Thus, we assume $d\geq n$ henceforth. Via a standard infinitesimal perturbation, we can assume that $X$ is a linearly independent set (i.e., spans $\mathbb{R}^{m}$ ). If we treat $X$ as an $m\times d$ matrix, then $F=XB$ , which means that $F$ is an ellipsoid. We are interested in $\ell_{\infty}$ covering numbers of $F$ .

Let $K\subset\mathbb{R}^{d}$ be such that $XK=L$ , where $L=B_{\infty}^{m}$ is the unit cube. (The existence of a $K$ such that $XK\subset L$ is obvious, but because we assumed that $X$ spans $\mathbb{R}^{m}$ , every point in $[-1,1]^{m}$ has a pre-image under $X$ .) Let us compute the polar body $K^{\circ}$ , defined as

\displaystyle K^{\circ}=\left\{u\in\mathbb{R}^{d}:\sup_{v\in K}v\cdot u\leq 1\right\}.

We claim that

\displaystyle K^{\circ}=\operatorname{absconv}(X)=:\left\{\sum_{i=1}^{m}\alpha_{i}x_{i};\sum\left|\alpha_{i}\right|\leq 1\right\}.

Indeed, consider a $z=\sum_{i=1}^{m}\alpha_{i}x_{i}\in\operatorname{absconv}(X)$ . Then, for any $v\in K$ , we have

	$\displaystyle v\cdot z$	$\displaystyle=$	$\displaystyle v\cdot\sum_{i=1}^{m}\alpha_{i}x_{i}$
		$\displaystyle=$	$\displaystyle\sum_{i=1}^{m}\alpha_{i}(v\cdot x_{i})=\sum_{i=1}^{m}\|\alpha_{i}\|\leq 1\qquad\implies z\in K^{\circ},$

where we have used $|v\cdot x_{i}|\leq 1$ (since $XK=L=B_{\infty}^{m}=[-1,1]^{m}$ ) and Hölder’s inequality. This shows that $\operatorname{absconv}(X)\subseteq K^{\circ}$ . On the other hand, consider any $u\in K^{\circ}$ . There is no loss of generality in assuming that $u$ is spanned by $X$ , that is, $u=\sum_{i=1}^{m}\alpha_{i}x_{i}$ , for $\alpha_{i}\in\mathbb{R}$ . By definition of $u\in K^{\circ}$ , we have

\displaystyle\sup_{v\in K}v\cdot u

\displaystyle=

\displaystyle\sup_{v\in K}v\cdot\sum_{i=1}^{m}\alpha_{i}x_{i}=\sup_{v\in K}\sum_{i=1}^{m}\alpha_{i}(v\cdot x_{i})\leq 1.

Now because $XK=[-1,1]^{m}$ , for each choice of $\alpha\in\mathbb{R}^{m}$ , there is a $v\in K$ such that $|v\cdot x_{i}|=\operatorname{sign}(\alpha_{i})$ for all $i\in[m]$ . This shows that we must have $\sum_{i=1}^{m}|\alpha_{i}|\leq 1$ , and proves $K^{\circ}\subseteq\operatorname{absconv}(X)$ .

It is well-known (and easy to verify) that covering numbers enjoy an affine invariance:

\displaystyle N(F,L):=N(XB,XK)=N(B,K),

where $N(A,B)$ , for two sets $A,B$ , is the smallest number of copies of $B$ necessary to cover $A$ . Now the seminal result of Artstein et al. (2004) applies: for all $t>0$ ,

\displaystyle\log N(B,tK)\leq a\log N(K^{\circ},btB),

where $a,b>0$ are universal constants.

This reduces the problem to estimating the $\ell_{2}$ -covering numbers of $\operatorname{absconv}(X)$ . The latter may be achieved via Maurey’s method (Vershynin, 2018, Corollary 0.0.4 and Exercise 0.0.6): the $t$ -covering number of $\operatorname{absconv}(rX)$ under $\ell_{2}$ is at most

\displaystyle(c+cmt^{2}/r^{2})^{\left\lceil r^{2}/t^{2}\right\rceil},

where $c>0$ is a universal constant.

A.4 Fat-shattering dimension of linear and affine classes

In this section, $\Omega=\mathbb{R}^{d}$ and $B\subset\mathbb{R}^{d}$ denotes the Euclidean unit ball. A function $f:\Omega\to\mathbb{R}$ is said to be affine if it is of the form $f(x)=w\cdot x+b$ , for some $w\in\mathbb{R}^{d}$ and $b\in\mathbb{R}$ , where $\cdot$ denotes the Euclidean inner product.

Throughout the paper, we have have referred to $R$ -bounded affine function classes as those for which $\left\|w\right\|\vee|b|\leq R$ . In this section, we define the larger class of $R$ -semi-bounded affine functions, as those for which $\left\|w\right\|\leq R$ , but $b$ may be unbounded. In particular, the covering-number results (and the reduction to linear classes spelled out in Remark 19) do not apply to semi-bounded affine classes.

The following simple result may be of independent interest.

Lemma 21

Let $F\subset\mathbb{R}^{\Omega}$ be some collection of functions with the closure property

\displaystyle f,g\in F

\displaystyle\implies(f-g)/2\in F.

(27)

Then, for all $\gamma>0$ , we have $\operatorname{fat}_{\gamma}(F)=\operatorname{\textup{f\aa t}}_{\gamma}(F)$ .

Proof

Suppose that some set $\left\{x_{1},\ldots,x_{k}\right\}$ is $\gamma$ -shattered by $F$ . That means that there is an $r\in\mathbb{R}^{k}$ such that for all $y\in\left\{-1,1\right\}^{k}$ , there is a $f=f_{y}\in F$ for which

\displaystyle\gamma

\displaystyle\leq

\displaystyle y_{i}(f(x_{i})-r_{i}),\qquad i\in[k].

(28)

Now for any $y\in\left\{-1,1\right\}^{k}$ , let ${\hat{f}}=f_{y}$ and ${\check{f}}=f_{-y}$ . Then, for each $i\in[k]$ , we have

	$\displaystyle\gamma$	$\displaystyle\leq$	$\displaystyle\phantom{-}y_{i}({\hat{f}}(x_{i})-r_{i}),$
	$\displaystyle\gamma$	$\displaystyle\leq$	$\displaystyle-y_{i}({\check{f}}(x_{i})-r_{i}).$

It follows that $f=({\hat{f}}-{\check{f}})/2$ achieves (28), for the given $y$ , with $r\equiv 0$ . Now (27) implies that the function defined by $f$ belongs to $F$ , which completes the proof.

Now is well-known (Bartlett and Shawe-Taylor, 1999, Theorem 4.6) that bounded linear functions — i.e., function classes on $B$ of the form $F=\left\{x\mapsto w\cdot x;\left\|w\right\|\leq R\right\}$ , also known as homogeneous hyperplanes — satisfy $\operatorname{fat}_{\gamma}(F)\leq(R/\gamma)^{2}$ . The discussion in Hanneke and Kontorovich (2019, p. 102) shows that the common approach of reducing of the general (affine) case to the linear (homogeneous, $b=0$ ) case, via the addition of a “dummy” coordinate, incurs a large suboptimal factor in the bound. Hanneke and Kontorovich (2019, Lemma 6) is essentially an analysis of the fat-shattering dimension of bounded affine functions. Although this result contains a mistake (see Section 5), much of the proof technique can be salvaged:

Lemma 22

The semi-bounded affine function class on $B$ defined by $F=\left\{x\mapsto w\cdot x+b;\left\|w\right\|\leq R\right\}$ in $d$ dimensions satisfies

\displaystyle\operatorname{fat}_{\gamma}(F)

\displaystyle\leq

\displaystyle\min\left\{d+1,\left(\frac{\left(1+\sqrt{\frac{8}{\pi}}\right)R}{\gamma}\right)^{2}\right\},\qquad 0<\gamma\leq R.

Proof Since $F$ satisfies (27), it suffices to consider $\operatorname{\textup{f\aa t}}_{\gamma}(F)$ , and so the shattering condition simplifies to

\displaystyle\gamma

\displaystyle\leq

\displaystyle y_{i}(w\cdot x_{i}+b),\qquad i\in[k].

(29)

Now $\operatorname{\textup{f\aa t}}_{\gamma}(F)$ is always upper-bounded by the VC-dimension of the corresponding class thresholded at zero, i.e., $\operatorname{sign}(F)$ . For $d$ -dimensional inhomogeneous hyperplanes, the latter is exactly $d+1$ (Mohri et al., 2012, Example 3.2). Having dispensed with the dimension-dependent part in the bound, we how focus on the $R$ -dependent one.

Let us observe, as in Hanneke and Kontorovich (2019, Lemma 6), that for $\left\|x_{i}\right\|\leq 1$ and $\left\|w\right\|$ , $\gamma\leq R$ , one can always realize (29) with $|b|\leq 2R$ ; which is what we shall assume, without generality, henceforth. Summing up the $k$ inequalities in (29) yields

\displaystyle k\gamma

\displaystyle\leq

\displaystyle w\cdot\sum_{i=1}^{k}y_{i}x_{i}+b\sum_{i=1}^{k}y_{i}\leq R\left\|\sum_{i=1}^{k}y_{i}x_{i}\right\|+2R\left|\sum_{i=1}^{k}y_{i}\right|.

Letting $y$ be drawn uniformly from $\left\{-1,1\right\}^{k}$ and taking expectations, we have

	$\displaystyle k\gamma$	$\displaystyle\leq$	$\displaystyle R\mathop{\mathbb{E}}\left\\|\sum_{i=1}^{k}y_{i}x_{i}\right\\|+2R\mathop{\mathbb{E}}\left\|\sum_{i=1}^{k}y_{i}\right\|\leq R\sqrt{\mathop{\mathbb{E}}\left\\|\sum_{i=1}^{k}y_{i}x_{i}\right\\|^{2}}+2R\sqrt{\mathop{\mathbb{E}}\left(\sum_{i=1}^{k}y_{i}\right)^{2}}$
		$\displaystyle=$	$\displaystyle R\sqrt{\sum_{i=1}^{k}\left\\|x_{i}\right\\|^{2}}+2R\sqrt{\sum_{i=1}^{k}y_{i}^{2}}\leq 3R\sqrt{k}.$

Isolating $k$ on the left-hand side of the inequality proves the claim $k\leq\left(\frac{3R}{\gamma}\right)^{2}$ .

Following a referee’s suggestion, we improve the constant as follows. Note that

\displaystyle\mathop{\mathbb{E}}\left|\sum_{i=1}^{k}y_{i}\right|=\frac{1}{2^{k}}\sum^{k}_{i=0}\binom{k}{i}|k-2i|=\frac{k}{2^{k-1}}\binom{k-1}{\left\lfloor\frac{k}{2}\right\rfloor}\leq\sqrt{\frac{2}{\pi}}\frac{k}{\sqrt{k+\frac{1}{2}}},

where the inequality follows from binomial coefficient estimate via Stirling’s approximation. Thus,

\displaystyle k\gamma

\displaystyle\leq

\displaystyle R\sqrt{k}+2R\sqrt{\frac{2}{\pi}}\frac{k}{\sqrt{k+\frac{1}{2}}}\leq R\sqrt{k}+2R\sqrt{\frac{2}{\pi}}\sqrt{k},

which proves that $k\leq\left(\frac{\left(1+\sqrt{\frac{8}{\pi}}\right)R}{\gamma}\right)^{2}$ .

A.5 Concavity miscellanea

The results below are routine exercises in differentiation and Jensen’s inequality.

Lemma 23

For $u>0$ , the function $x\mapsto x\log(u/x)$ is concave on $(0,\infty)$ .

Corollary 24

For all $u>0$ and $v_{i}>0$ , $i\in[k]$ ,

\displaystyle\sum_{i=1}^{k}v_{i}\log(u/v_{i})

\displaystyle\leq

\displaystyle\left(\sum v_{i}\right)\log\frac{uk}{\sum v_{i}}.

Lemma 25

For $0\leq\varepsilon\leq\log 2$ and $u\geq 2$ , the function $x\mapsto x\log^{1+\varepsilon}(u/x)$ is concave on $[1,\infty)$ . It follows that for $\varepsilon,u$ as above and $v_{i}\geq 1$ , $i\in[k]$ ,

\displaystyle\sum_{i=1}^{k}v_{i}\log^{1+\varepsilon}(u/v_{i})

\displaystyle\leq

\displaystyle\left(\sum v_{i}\right)\log^{1+\varepsilon}\frac{uk}{\sum v_{i}}.

	$\displaystyle\left\\|G(f_{1},\ldots,f_{k})-G(f^{\prime}_{1},\ldots,f^{\prime}_{k})\right\\|_{L_{p}(\mu)}^{p}$	$\displaystyle=\int_{\Omega}\|G(f_{1},\ldots,f_{k})(x)-G(f^{\prime}_{1},\ldots,f^{\prime}_{k})(x)\|^{p}\mathrm{d}\mu(x)$
		$\displaystyle=\int_{\Omega}\|G(f_{1}(x),\ldots,f_{k}(x))-G(f^{\prime}_{1}(x),\ldots,f^{\prime}_{k}(x))\|^{p}\mathrm{d}\mu(x)$
		$\displaystyle\leq\int_{\Omega}L^{p}\left\\|(f_{1}(x),\ldots,f_{k}(x))-(f^{\prime}_{1}(x),\ldots,f^{\prime}_{k}(x))\right\\|_{p}^{p}\mathrm{d}\mu(x),$

	$\displaystyle\left\\|G(f_{1},\ldots,f_{k})-G(f^{\prime}_{1},\ldots,f^{\prime}_{k})\right\\|_{L_{p}(\mu)}^{p}$	$\displaystyle\leq L^{p}\left\\|(f_{1},\ldots,f_{k})-(f^{\prime}_{1},\ldots,f^{\prime}_{k})\right\\|_{L_{p}^{(k)}(\mu)}^{p}$
		$\displaystyle=L^{p}\int_{\Omega}\left\\|(f_{1}(x),\ldots,f_{k}(x))-(f^{\prime}_{1}(x),\ldots,f^{\prime}_{k}(x))\right\\|_{p}^{p}\mathrm{d}\mu(x)$
		$\displaystyle=L^{p}\int_{\Omega}\sum^{k}_{i=1}\left\|f_{i}(x)-f^{\prime}_{i}(x)\right\|^{p}\mathrm{d}\mu(x)$
		$\displaystyle=L^{p}\sum^{k}_{i=1}\int_{\Omega}\left\|f_{i}(x)-f^{\prime}_{i}(x)\right\|^{p}\mathrm{d}\mu(x)$
		$\displaystyle=L^{p}\sum^{k}_{i=1}\left\\|f_{i}-f^{\prime}_{i}\right\\|_{L_{p}(\mu)}^{p}$
		$\displaystyle\leq L^{p}k\left(\frac{t}{Lk^{1/p}}\right)^{p}$
		$\displaystyle=t^{p},$

Fat-Shattering Dimension of kk-fold Aggregations††thanks: A previous version of this paper was titled “Fat-shattering dimension of kk-fold maxima”.

Abstract

1 Introduction

Applications.

Related work.

2 Preliminaries

Aggregation rules.

Fat-shattering dimension.

Fat-shattering dimension at zero.

Rademacher complexity.

Covering numbers.

Notation.

3 Main Results

Theorem 1 (General function classes and aggregations that commute with shifts)

Remark.

Theorem 2 (Bounded function classes and Lipschitz aggregations)

Remark.

Theorem 3 (Dimension-free bound for Lipschitz aggregations of affine functions)

Lemma 4 (Fefferman et al. (2016), Lemma 6)

Corollary 5 (Rademacher complexity for kk-Fold Maximum of Affine Functions)

Theorem 6 (Zhivotovskiy (2022))

Theorem 7 (Dimension-dependent bound for kk-fold maximum of affine functions)

Theorem 8 (Dimension-dependent lower bound for kk-fold maximum of affine functions)

4 Proofs

4.1 Proof of Theorem 1

Partial concept classes and disambiguation.

Lemma 9 (Alon et al. (2022))

4.2 Proof of Theorem 2

Theorem 10 (Covering number of LL-Lipschitz aggregations)

4.3 Proof of Theorem 3

4.4 Proof of Corollary 5

From fat-shattering to Rademacher.

4.5 Proof of Theorem 7

4.6 Proof of Theorem 8

5 Discussion

Open problem.

Conjecture 11

References

A Auxiliary results

A.1 Covering numbers and Lipschitz Aggregations

Lemma 12

Lemma 13 (Maximum aggregation is 1-Lipschitz)

Lemma 14 (Max-Min aggregation is 1-Lipschitz)

A.2 Covering numbers and the fat-shattering dimension

Lemma 15 (Talagrand (2003), Proposition 1.4)

Lemma 16 (Mendelson and Vershynin (2003), Theorem 1)

Remark 17

Lemma 18 (Rudelson and Vershynin (2006))

A.3 Covering numbers of linear and affine classes

Remark 19

Lemma 20

A.4 Fat-shattering dimension of linear and affine classes

Lemma 21

Lemma 22

A.5 Concavity miscellanea

Lemma 23

Corollary 24

Lemma 25

Fat-Shattering Dimension of $k$ -fold Aggregations^†^†thanks: A previous version of this paper was titled “Fat-shattering dimension of $k$ -fold maxima”.

Corollary 5 (Rademacher complexity for $k$ -Fold Maximum of Affine Functions)

Theorem 7 (Dimension-dependent bound for $k$ -fold maximum of affine functions)

Theorem 8 (Dimension-dependent lower bound for $k$ -fold maximum of affine functions)

Theorem 10 (Covering number of $L$ -Lipschitz aggregations)