This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The Sample Complexity of One-Hidden-Layer Neural Networks

Gal Vardi Toyota Technological Institute at Chicago and the Hebrew University of Jerusalem, [email protected]. Work done while the author was at the Weizmann Institute of Science    Ohad Shamir Weizmann Institute of Science, Israel, [email protected]    Nathan Srebro Toyota Technological Institute at Chicago, [email protected]
Abstract

We study norm-based uniform convergence bounds for neural networks, aiming at a tight understanding of how these are affected by the architecture and type of norm constraint, for the simple class of scalar-valued one-hidden-layer networks, and inputs bounded in Euclidean norm. We begin by proving that in general, controlling the spectral norm of the hidden layer weight matrix is insufficient to get uniform convergence guarantees (independent of the network width), while a stronger Frobenius norm control is sufficient, extending and improving on previous work. Motivated by the proof constructions, we identify and analyze two important settings where (perhaps surprisingly) a mere spectral norm control turns out to be sufficient: First, when the network’s activation functions are sufficiently smooth (with the result extending to deeper networks); and second, for certain types of convolutional networks. In the latter setting, we study how the sample complexity is additionally affected by parameters such as the amount of overlap between patches and the overall number of patches.

1 Introduction

Understanding why large neural networks are able to generalize is one of the most important puzzles in the theory of deep learning. Since sufficiently large neural networks can approximate any function, their success must be due to a strong inductive bias in the learned network weights, which is still not fully understood.

A useful approach to understand such biases is studying what types of constraints on the network weights can lead to uniform convergence bounds, which ensure that empirical risk minimization will not lead to overfitting. Notwithstanding the ongoing debate on whether uniform convergence can fully explain the learning performance of neural networks (Nagarajan and Kolter, 2019; Negrea et al., 2020; Koehler et al., 2021), these bounds provide us with important insights on what norm-based biases can potentially aid in generalization. For example, for linear predictors, it is well-understood that constraints on the Euclidean norm of the weights imply uniform convergence guarantees independent of the number of parameters. This indicates that minimizing the Euclidean norm (without worrying about the number of parameters) is often a useful inductive bias, whether used explicitly or implicitly, or whether uniform convergence formally holds or not for some specific setup. However, neural networks have a more complicated structure than linear predictors, and we still lack a good understanding of what norm-based constraints imply a good inductive bias.

In this paper, we study this question in the simple case of scalar-valued one-hidden-layer neural networks, which generally compute functions from d\mathbb{R}^{d} to \mathbb{R} of the form

𝐱𝐮σ(W𝐱),\mathbf{x}~{}\mapsto~{}\mathbf{u}^{\top}\sigma(W\mathbf{x})~{}, (1)

with weight matrix Wn×dW\in\mathbb{R}^{n\times d}, weight vector 𝐮\mathbf{u}, and a fixed (generally non-linear) activation function σ\sigma. We focus on an Euclidean setting, where the inputs 𝐱\mathbf{x} and output weight vector 𝐯\mathbf{v} are assumed to have bounded Euclidean norm. Our goal is to understand what kind of norm control on the matrix WW is required to achieve uniform convergence guarantees, independent of the underlying distribution and the network width nn (i.e., the number of neurons). Previous work clearly indicates that a bound on the spectral norm is generally necessary, but (as we discuss below) does not conclusively imply whether it is also sufficient.

Our first contribution (in Subsection 3.1) is formally establishing that spectral norm control is generally insufficient to get width-independent sample complexity bounds in high dimensions, by directly lower bounding the fat-shattering number of the predictor class. On the flip side, if we assume that the Frobenius norm of WW is bounded, then we can prove uniform convergence guarantees, independent of the network width or input dimension. The latter result is based on Rademacher complexity, and extends previous results (e.g., (Neyshabur et al., 2015; Golowich et al., 2018), which crucially required homogeneous activations) to general Lipschitz activations. In Subsection 3.2, we also prove a variant of our lower bound in the case where the input dimension is fixed, pointing at a possibly interesting regime for which good upper bounds are currently lacking.

The constructions used in our lower bounds crucially require activation functions which are non-smooth around 0. Moreover, they employ networks where the matrix WW can be arbitrary (as long as its norm is bounded). Motivated by this, we identify and analyze two important settings where these lower bounds can be circumvented, and where a mere spectral norm control is sufficient to obtain width-independent guarantees:

  • The first case (studied in Sec. 4) is for networks where the activation function σ\sigma is sufficiently smooth: Specifically, when it is analytic and the coefficients of its Taylor expansion decay sufficiently rapidly. Some examples include polynomial activations, sigmoidal functions such as the error function, and appropriate smoothings of the ReLU function. Perhaps surprisingly, the smoothness of the activation allows us to prove uniform convergence guarantees that depend only on the spectral norm of WW and the structure of the activation function, independent of the network width. Moreover, we can extend our results for deeper networks when the activations is a power function (e.g., quadratic activations).

  • A second important case (studied in Sec. 5) is when the network employs weight-sharing on WW, as in convolutional networks. Specifically, we consider two variants of one-hidden-layer convolutional networks, one with a linear output layer, and another employing max-pooling. In both cases, we present bounds on the sample complexity that depend only on the spectral norm, and study how they depend on the convolutional architecture of the network (such as the number of patches or their amount of overlap).

Our work leaves open quite a few questions and possible avenues for future research, which we discuss in Sec. 6. All proofs of our results appear in Appendix A.

Related Work

The literature on the sample complexity of neural networks has rapidly expanded in recent years, and cannot be reasonably surveyed here. In what follows, we discuss only works which deal specifically with the issues we focus on in this paper.

Frobenius vs. spectral norm Control, lower bounds. Fat-shattering lower bounds for neural networks were developed in Anthony and Bartlett (1999), but involve size or dimension dependencies rather than norm control. Bartlett et al. (2017) proved a lower bound on the Rademacher complexity of neural networks, implying that a dependence on the spectral norm is generally necessary. Golowich et al. (2018) extended this to show that a dependence on the network width is also necessary, if only the spectral norm is controlled. However, their construction requires a vector-valued (rather than scalar-valued) output. More importantly, the lower bound is on the Rademacher complexity of the predictor class rather than the fat-shattering dimension, and thus (as we further discuss below) does not necessarily imply that the actual sample complexity with some bounded loss function indeed suffers from such a width dependence. Daniely and Granot (2019) do provide a fat-shattering lower bound, which implies that neural networks on d\mathbb{R}^{d} with bounded spectral norm and width at most dd can shatter Ω~(d2)\tilde{\Omega}(d^{2}) points with constant margin, assuming that the inputs have norm at most d\sqrt{d}. However, this lower bound does not separate between the input norm bound and the width of the hidden layer (which both scale with dd), and thus does not clarify the contribution of the network width to the bound. Moreover, their proof technique appears to crucially rely on the input’s norm scaling with the dimension, rather than being an independent parameter.

Frobenius vs. spectral norm control, upper bounds. A width-independent uniform convergence guarantee, depending on the Frobenius norm, has been established in Neyshabur et al. (2015) for constant-depth networks, and in Golowich et al. (2018) for arbitrary-depth networks. However, these bounds are specific to homogeneous activation functions, whereas we tackle general Lipschitz activations (at least for one-hidden layer networks). Bounds based on other norms include Anthony and Bartlett (1999); Bartlett et al. (2017); Liang (2016), but are potentially more restrictive than the Frobenius norm, or do not lead to width-independence. Also, we note that the bound of Bartlett et al. (2017) has the nice property of depending on the distance to some fixed reference matrix, rather than the norm itself. However, we do not pursue this generalization here as it is not the focus of our work.

Sample complexity with smooth activations. The Rademacher complexity for networks with quadratic activations has been studied in Du and Lee (2018), but assuming Frobenius norm constraints, whereas we show that mere spectral norm constraint is sufficient to bound the Rademacher complexity independent of the network width. The strong influence of the activation function on the sample complexity has been observed in the context of VC-dimension bounds for binary classification (see Anthony and Bartlett (1999, Section 7.2)). However, we are not aware of previous results showing how the smoothness of the activation functions provably affects scale-sensitive bounds such as the Rademacher complexity in our setting.

Sample complexity of convolutional networks. Norm-based uniform convergence bounds for convolutional networks (including more general ones than the one we study) have been provided in Du et al. (2018); Long and Sedghi (2019). However, these bounds either depend on the overall number of parameters, or apply only to average-pooling. For convolutional networks with max-pooling, Ledent et al. (2021) provide a norm-based analysis which we build on (see Sec. 5 for details). Cao and Gu (2019) showed an algorithm-dependent sample complexity of learning one-hidden-layer convolutional networks with non-overlapping filters and general activation functions. Additional works studying the generalization performance of convolutional networks in settings different than ours include Li et al. (2018); Arora et al. (2018); Wei and Ma (2019); Hsu et al. (2020); Brutzkus and Globerson (2021).

2 Preliminaries

Notation. We use bold-face letters to denote vectors, and let [m][m] be shorthand for {1,,m}\{1,\ldots,m\}. Given a matrix MM, Mi,jM_{i,j} is the entry in row ii and column jj. Given a function σ()\sigma(\cdot) on \mathbb{R}, we somewhat abuse notation and let σ(𝐱)\sigma(\mathbf{x}) (for a vector 𝐱\mathbf{x}) or σ(M)\sigma(M) (for a matrix MM) denote applying σ\sigma element-wise. A special case is when σ()=[]+=max{,0}\sigma(\cdot)=[\cdot]_{+}=\max\{\cdot,0\} is the ReLU function. We use standard big-Oh notation, with Ω(),Θ(),𝒪()\Omega(\cdot),\Theta(\cdot),\mathcal{O}(\cdot) hiding constants and Ω~(),Θ~(),𝒪~()\tilde{\Omega}(\cdot),\tilde{\Theta}(\cdot),\tilde{\mathcal{O}}(\cdot) hiding constants and factors polylogarithmic in the problem parameters.

Norms. \|\cdot\| denotes the operator norm: For vectors, it is the Euclidean norm, and for matrices, the spectral norm (i.e., M=sup𝐱:𝐱=1M𝐱\|M\|=\sup_{\mathbf{x}:\|\mathbf{x}\|=1}\|M\mathbf{x}\|). F\|\cdot\|_{F} denotes the Frobenius norm (i.e., MF=i,jMi,j2\|M\|_{F}=\sqrt{\sum_{i,j}M_{i,j}^{2}} ). It is well-known that for any matrix MM, MMF\|M\|\leq\|M\|_{F}, so the class of matrices whose Frobenius norm is bounded by some BB is a subset of the class of matrices whose spectral norm is bounded by the same BB. Moreover, if MM is an n×dn\times d matrix, then MFMmin{n,d}\|M\|_{F}\leq\|M\|\cdot\sqrt{\min\{n,d\}}.

Network Architecture. Most of our results pertain to scalar-valued one-hidden-layer networks, of the form 𝐱𝐮σ(W𝐱)\mathbf{x}\mapsto\mathbf{u}^{\top}\sigma(W\mathbf{x}), where 𝐱d\mathbf{x}\in\mathbb{R}^{d}, Wn×dW\in\mathbb{R}^{n\times d}, 𝐮\mathbf{u} is a vector and σ()\sigma(\cdot) is some fixed non-linear function. The width of the network is nn, the number of rows of WW (or equivalently, the number of neurons in the hidden layer of the network).

Fat-Shattering and Rademacher Complexity. When studying lower bounds on the sample complexity of a given function class, we use the following version of its fat-shattering dimension:

Definition 1.

A class of functions \mathcal{F} on an input domain 𝒳\mathcal{X} shatters mm points {𝐱i}i=1m𝒳\{\mathbf{x}_{i}\}_{i=1}^{m}\subseteq\mathcal{X} with margin ϵ\epsilon, if there exist a number ss, such that for all 𝐲{0,1}m\mathbf{y}\in\{0,1\}^{m} we can find some ff\in\mathcal{F} such that

i[m],f(𝐱i)sϵifyi=0andf(𝐱i)s+ϵifyi=1.\forall i\in[m],~{}~{}f(\mathbf{x}_{i})\leq s-\epsilon~{}~{}~{}\text{if}~{}~{}~{}y_{i}=0~{}~{}~{}\text{and}~{}~{}~{}f(\mathbf{x}_{i})\geq s+\epsilon~{}~{}~{}\text{if}~{}~{}~{}y_{i}=1~{}.

The fat-shattering dimension of \mathcal{F} (at scale ϵ\epsilon) is the cardinality mm of the largest set of points in 𝒳\mathcal{X} for which the above holds.

It is well-known that the fat-shattering dimension lower bounds the number of samples needed to learn in a distribution-free learning setting, up to accuracy ϵ\epsilon (see for example Anthony and Bartlett (1999, Part III)). Thus, by proving the existence of a large set of points shattered by the function class, we get lower bounds on the fat-shattering dimension, which translate to lower bounds on the sample complexity.

As to upper bounds on the sample complexity, our results utilize the Rademacher complexity of a function class \mathcal{F}, which for our purposes can be defined as

m()=sup{𝐱i}i=1m𝒳𝔼ϵ[supf1mi=1mϵifi(𝐱i)],\mathcal{R}_{m}(\mathcal{F})~{}=~{}\sup_{\{\mathbf{x}_{i}\}_{i=1}^{m}\subseteq\mathcal{X}}\mathbb{E}_{\boldsymbol{\epsilon}}\left[\sup_{f\in\mathcal{F}}~{}\frac{1}{m}\sum_{i=1}^{m}\epsilon_{i}f_{i}(\mathbf{x}_{i})\right]~{},

where ϵ=(ϵ1,,ϵm)\boldsymbol{\epsilon}=(\epsilon_{1},\ldots,\epsilon_{m}) is a vector of mm independent random variables ϵi\epsilon_{i} uniformly distributed on {1,+1}\{-1,+1\}. Upper bounds on the Rademacher complexity directly translate to upper bounds on the sample complexity required for learning \mathcal{F}: Specifically, the number of inputs mm required to make m()\mathcal{R}_{m}(\mathcal{F}) smaller than some ϵ\epsilon is generally an upper bound on the number of samples required to learn \mathcal{F} up to accuracy ϵ\epsilon, using any Lipschitz loss (see Bartlett and Mendelson (2002); Shalev-Shwartz and Ben-David (2014); Mohri et al. (2018)). We note that Rademacher complexity bounds can also be easily converted to margin-based bounds (where the 010-1 classification risk is upper-bounded by the proportion of margin violations on the training data) by considering a composition of the hypothesis class with an appropriate ramp loss (which upper bounds the 0-1 loss and lower bounds the margin loss, as was done for example in Bartlett and Mendelson (2002); Bartlett et al. (2017)).

We note that although the fat-shattering dimension and Rademacher complexity of the predictor class are closely related, they do no always behave the same: For example, the class of norm-bounded linear predictors {𝐱𝐰,𝐱:𝐰d,𝐰B}\{\mathbf{x}\mapsto\langle\mathbf{w},\mathbf{x}\rangle:\mathbf{w}\in\mathbb{R}^{d},\|\mathbf{w}\|\leq B\} has Rademacher complexity Θ(B/m)\Theta(B/\sqrt{m}), implying Θ((B/ϵ)2)\Theta((B/\epsilon)^{2}) samples to make it less than ϵ\epsilon. In contrast, the fat-shattering dimension of the class is smaller, Θ(min{d,(B/ϵ)2})\Theta(\min\{d,(B/\epsilon)^{2}\}) (Anthony and Bartlett, 1999; Bartlett and Mendelson, 2002). The reason for this discrepancy is that the Rademacher complexity of the predictor class necessarily scales with the range of the function outputs, which is not necessarily relevant if we use bounded losses (that is, if we are actually interested in the function class of linear predictors composed with a bounded loss). Such bounded losses are common, for example, when we are interested in bounding the expected misclassification error (see for example Bartlett and Mendelson (2002); Bartlett et al. (2017)). For this reason, when considering the predictor class itself, we focus on fat-shattering dimension in our lower bounds, and Rademacher complexity in our upper bounds.

3 Frobenius Norm Control is Necessary for General Networks

We begin by considering one-hidden-layer networks 𝐱𝐮σ(W𝐱)\mathbf{x}\mapsto\mathbf{u}^{\top}\sigma(W\mathbf{x}), where σ\sigma is a function on \mathbb{R} applied element-wise (such as the ReLU activation function). In Subsection 3.1, we consider the dimension-free case (where we are interested in upper and lower bounds that do not depend on the input dimension dd). In Subsection 3.2, we consider the case where the dimension dd is a fixed parameter.

3.1 Dimension-Free Bounds

We focus on the following hypothesis class of scalar-valued, one-hidden-layer neural networks of width nn on inputs in d\mathbb{R}^{d}, where σ\sigma is a function on \mathbb{R} applied element-wise, and where we only bound the operator norms:

b,B,n,dσ:={𝐱𝐮σ(W𝐱):𝐮n,Wn×d,𝐮b,WB}.\mathcal{H}_{b,B,n,d}^{\sigma}~{}:=~{}\left\{\mathbf{x}\mapsto\mathbf{u}^{\top}\sigma(W\mathbf{x})~{}:~{}\mathbf{u}\in\mathbb{R}^{n}~{},~{}W\in\mathbb{R}^{n\times d}~{},~{}\|\mathbf{u}\|\leq b~{},~{}\|W\|\leq B\right\}~{}.

The following theorem shows that if the input dimension is large enough, then under a mild condition on the non-smoothness of σ\sigma around 0, the fat-shattering dimension of this class necessarily scales with the network width nn:

Theorem 1.

Suppose that the activation function σ\sigma (as a function on \mathbb{R}) is 11-Lipschitz on [1,+1][-1,+1], and satisfies σ(0)=0\sigma(0)=0 as well as

infδ(0,1)|σ(δ)+σ(δ)δ|α\inf_{\delta\in(0,1)}\left|\frac{\sigma(\delta)+\sigma(-\delta)}{\delta}\right|\geq\alpha (2)

for some α>0\alpha>0.

Then there exist universal constants c,c>0c,c^{\prime}>0 such that the following hold: For any b,B,bx,n,ϵ>0b,B,b_{x},n,\epsilon>0, there is some d0=poly(b,B,bx,n,1/ϵ)d_{0}=\text{poly}(b,B,b_{x},n,1/\epsilon) such that for any input dimension dd0d\geq d_{0}, b,B,n,dσ\mathcal{H}^{\sigma}_{b,B,n,d} can shatter

cα2(bBbx)2nϵ2c\alpha^{2}\cdot\frac{(bBb_{x})^{2}n}{\epsilon^{2}}

points from {𝐱d:𝐱bx}\{\mathbf{x}\in\mathbb{R}^{d}:\|\mathbf{x}\|\leq b_{x}\} with margin ϵ\epsilon, provided the expression above is larger than c(1α2+B2+n)c^{\prime}(\frac{1}{\alpha^{2}}+B^{2}+n).

To understand the condition in Eq. (2), suppose that σ\sigma has a left-hand derivative σ(0)\sigma^{\prime}_{-}(0) and right-hand derivative σ+(0)\sigma^{\prime}_{+}(0) at 0. Recalling that σ(0)=0\sigma(0)=0, the condition stated in the theorem implies that

|σ(δ)σ(0)δσ(0)σ(δ)δ|α\left|\frac{\sigma(\delta)-\sigma(0)}{\delta}-\frac{\sigma(0)-\sigma(-\delta)}{\delta}\right|~{}\geq~{}\alpha

for all δ>0\delta>0. In particular, as δ0\delta\rightarrow 0, we get |σ+(0)σ(0)|>α|\sigma^{\prime}_{+}(0)-\sigma^{\prime}_{-}(0)|>\alpha. Thus, σ\sigma is necessarily non-differentiable at 0. For example, the ReLU activation function satisfies the assumption in the theorem with α=1\alpha=1, and the leaky ReLU function σ(z)=βz+(1β)[z]+\sigma(z)=\beta z+(1-\beta)[z]_{+} (with parameter β\beta) satisfies the assumption with α=1β\alpha=1-\beta.

Remark 1.

The assumption σ(0)=0\sigma(0)=0 is without much loss of generality: If σ(0)0\sigma(0)\neq 0, then let σ^(z):=σ(z)σ(0)\hat{\sigma}(z):=\sigma(z)-\sigma(0) be a centering of σ\sigma, and note that our predictors can be rewritten in the form 𝐱𝐮σ^(W𝐱)+σ(0)𝐮𝟏\mathbf{x}\mapsto\mathbf{u}^{\top}\hat{\sigma}(W\mathbf{x})+\sigma(0)\cdot\mathbf{u}^{\top}\mathbf{1}. Thus, our hypothesis class is contained in the hypothesis class of predictors of the form 𝐱𝐮σ^(W𝐱)+r\mathbf{x}\mapsto\mathbf{u}^{\top}\hat{\sigma}(W\mathbf{x})+r for some bounded bias parameter rr\in\mathbb{R}. This bias term does not change the fat-shattering dimension, and thus is not of much interest.

The theorem implies that with only spectral norm control (i.e. where 𝐮,W\|\mathbf{u}\|,\|W\| is bounded), it is impossible to get bounds independent of the width of the network nn. Initially, the lower bound might appear surprising, since if the activation function σ\sigma is the identity, b,B,n,dσ\mathcal{H}^{\sigma}_{b,B,n,d} simply contains linear predictors of norm bB\leq bB, for which the sample complexity / fat-shattering dimension is well known to be 𝒪(bB/ϵ2)\mathcal{O}(bB/\epsilon^{2}) in high input dimensions, completely independent of nn (see discussion in the previous section). Intuitively, the extra nn term in the lower bound comes from the fact that for random matrices MM, σ(M)\|\sigma(M)\| can typically be much larger than M\|M\|, even when σ\sigma is a Lipschitz function satisfying σ(0)=0\sigma(0)=0. To give a concrete example, if MM is an n×nn\times n matrix with i.i.d. entries uniform on {±1n}\{\pm\frac{1}{\sqrt{n}}\}, then standard concentration results imply that 𝔼[M]\mathbb{E}[\|M\|] is upper-bounded by a universal constant independent of nn, yet the matrix σ(M)\sigma(M) (where σ\sigma is entry-wise absolute value) satisfies σ(M)=n\|\sigma(M)\|=\sqrt{n} (since σ(M)\sigma(M) is just the constant matrix with value 1n\frac{1}{\sqrt{n}} at every entry). The formal proof (in the appendix) relies on constructing a network involving random weights, so that the spectral norm is small yet the network can return sufficiently large values due to the non-linearity.

Remark 2.

Thm. 1 has an interesting connection to the recent work of Bubeck et al. (2021), which implies that in order to fit mm points with bounded norm using a width-nn one-hidden-layer neural network 𝐱𝐯σ(W𝐱)\mathbf{x}\mapsto\mathbf{v}^{\top}\sigma(W\mathbf{x}), the Lipschitz constant of the network (and hence 𝐯W\|\mathbf{v}\|\cdot\|W\|) must be generally at least Ω(m/n)\Omega(\sqrt{m/n}). The lower bound in Thm. 1 implies a related statement in the opposite direction: If we allow 𝐯W\|\mathbf{v}\|\cdot\|W\| to be sufficiently larger than m/n\sqrt{m/n}, then there exist mm points that can be shattered with constant margin. Thus, we seem to get a good characterization of the expressiveness of one-hidden layer neural networks on finite datasets, as a function of their width and the magnitude of the weights.

Considering the lower bound, and noting that B2nB^{2}n is an upper bound on WF2\|W\|_{F}^{2} which is tight in the worst-case, the bound suggests that a control over the Frobenius norm WF\|W\|_{F} would be sufficient to get width-independent bounds. Indeed, such results were previously known when σ\sigma is the ReLU function, or more generally, a positive-homogeneous function of degree 11 (Neyshabur et al., 2015; Golowich et al., 2018), with the proofs crucially relying on that property. In what follows, we will prove such a result for general Lipschitz functions (at least for one-hidden layer networks).

Specifically, consider the following hypothesis class, where the previous spectral norm constraint on WW is replaced by a Frobenius norm constraint:

b,B,n,dσ:={𝐱𝐮σ(W𝐱):𝐮n,Wn×d,𝐮b,WFB}.\mathcal{F}_{b,B,n,d}^{\sigma}~{}:=~{}\left\{\mathbf{x}\mapsto\mathbf{u}^{\top}\sigma(W\mathbf{x})~{}:~{}\mathbf{u}\in\mathbb{R}^{n}~{},~{}W\in\mathbb{R}^{n\times d}~{},~{}\|\mathbf{u}\|\leq b~{},~{}\|W\|_{F}\leq B\right\}~{}.
Theorem 2.

Suppose σ()\sigma(\cdot) (as a function on \mathbb{R}) is LL-Lipschitz and σ(0)=0\sigma(0)=0. Then for any b,B,bx,n,d,ϵ>0b,B,b_{x},n,d,\epsilon>0, the Rademacher complexity of b,B,n,dσ\mathcal{F}_{b,B,n,d}^{\sigma} on mm inputs from {𝐱d:𝐱bx}\{\mathbf{x}\in\mathbb{R}^{d}:\|\mathbf{x}\|\leq b_{x}\} is at most ϵ\epsilon, if

mc(bBbxL)2(1+log3(m))ϵ2m~{}\geq~{}c\cdot\frac{(bBb_{x}L)^{2}(1+\log^{3}(m))}{\epsilon^{2}}

for some universal constant c>0c>0. Thus, it suffices to have m=𝒪~((bBbxLϵ)2)m=\tilde{\mathcal{O}}\left(\left(\frac{bBb_{x}L}{\epsilon}\right)^{2}\right).

The bound is indeed independent of the network width nn. Also, the result (as an upper bound on the Rademacher complexity) is clearly tight up to log-factors, since in the special case where σ(z)=Lz\sigma(z)=L\cdot z and we fix 𝐮=b𝐞1\mathbf{u}=b\cdot\mathbf{e}_{1}, then b,B,n,dσ\mathcal{F}_{b,B,n,d}^{\sigma} reduces to the class of linear predictors with Euclidean norm at most bBLbBL (on data of norm at most bxb_{x}), whose Rademacher complexity matches the bound above up to log-factors.

Remark 3 (Connection to Implicit Regularization).

It was recently proved that training neural networks employing homogeneous activations on losses such as the logistic loss, without any explicit regularization, gradient methods are implicitly biased towards models which minimize the squared Euclidean norm of their parameters (Lyu and Li, 2019; Ji and Telgarsky, 2020). In our setting of one-hidden-layer networks 𝐱𝐮σ(W𝐱)\mathbf{x}\mapsto\mathbf{u}^{\top}\sigma(W\mathbf{x}), this reduces to 𝐮2+WF2\|\mathbf{u}\|^{2}+\|W\|_{F}^{2}. For homogeneous activations, multiplying 𝐮\mathbf{u} by some scalar α\alpha and dividing WW by the same scalar leaves the network unchanged. Based on this observation, and the fact that minαα𝐮2+1αWF2=2𝐮WF\min_{\alpha\in\mathbb{R}}\|\alpha\mathbf{u}\|^{2}+\|\frac{1}{\alpha}W\|_{F}^{2}=2\|\mathbf{u}\|\cdot\|W\|_{F}, it follows that minimizing 𝐮2+WF2\|\mathbf{u}\|^{2}+\|W\|_{F}^{2} (under any constraints on the network’s outputs) is equivalent to minimizing 𝐮WF\|\mathbf{u}\|\cdot\|W\|_{F} (under the same constraints). Thus, gradient methods are biased towards models which minimize our bound from Thm. 2 in terms of the norms of 𝐮,W\mathbf{u},W.

3.2 Dimension-Dependent Lower Bound

The bounds presented above are dimension-free, in the sense that the upper bound holds for any input dimension dd, and the lower bound applies once dd is sufficiently large. However, for neural networks the case of dd being a fixed parameter is also of interest, since we often wish to apply large neural networks on inputs whose dimensionality is reasonably bounded (e.g., the number of pixels in an image).

For fixed dd, and for the predictor class itself (without an additional loss composed), it is well-known that there can be a discrepancy between the fat-shattering dimension and the Rademacher complexity, even for linear predictors (see discussion in Sec. 2). Thus, although Thm. 2 is tight as a bound on the Rademacher complexity, one may conjecture that the fat-shattering dimension (and true sample complexity for bounded losses) is actually smaller for fixed dd.

In what follows, we focus on the case of the Frobenius norm, and provide a dimension-dependent lower bound on the fat-shattering dimension. We first state the result for a ReLU activation with a bias term (Thm. 3), and then extend it to the standard ReLU activation under a slightly more stringent condition (Corollary 1).

Theorem 3.

For any b,B,bx,n,ϵb,B,b_{x},n,\epsilon, and any dd larger than some universal constant, there exists a choice of β[0,𝒪~(Bbxdn)]\beta\in[0,\tilde{\mathcal{O}}(\frac{Bb_{x}}{\sqrt{dn}})] such that the following hold: If σ(z)=[zβ]+\sigma(z)=[z-\beta]_{+}, then b,B,n,dσ\mathcal{F}^{\sigma}_{b,B,n,d} can shatter

Ω~(min{nd,bBbxϵd})\tilde{\Omega}\left(\min\left\{nd,\frac{bBb_{x}}{\epsilon}\sqrt{d}\right\}\right) (3)

points from {𝐱d:𝐱bx}\{\mathbf{x}\in\mathbb{R}^{d}:\|\mathbf{x}\|\leq b_{x}\} with margin ϵ\epsilon, assuming the expression above is larger than cdcd for some universal constant c>0c>0, and where Ω~\tilde{\Omega} hides factors polylogarithmic in d,n,b,B,bx,1ϵd,n,b,B,b_{x},\frac{1}{\epsilon}.

Corollary 1.

The lower bound of Thm. 3 also holds for the standard ReLU activation σ(z)=[z]+\sigma(z)=[z]_{+}, if βBbxn\beta\leq\frac{Bb_{x}}{\sqrt{n}} (which happens if the input dimension dd is larger than a factor polylogarithmic in the problem parameters).

The lower bound is the minimum of two terms: The first is ndnd, which is the order of the number of parameters in the network. This term is to be expected, since the fat-shattering dimension of \mathcal{F} is at most the pseudodimension of \mathcal{F}, which indeed scales with the number of parameters ndnd (see Anthony and Bartlett (1999); Bartlett et al. (2019)). Hence, we cannot expect to be able to shatter many more than ndnd points. The second term is norm- and dimension-dependent, and dominates the overall lower bound if the network width nn is large enough. Comparing the theorem with the 𝒪~((bBbx/ϵ)2)\tilde{\mathcal{O}}((bBb_{x}/\epsilon)^{2}) upper bound from Thm. 2, it seems to suggest that having a bounded dimension dd may improve the sample complexity compared to the dimension-free case, with a smaller dependence on the norm bounds. However, at the moment we do not have upper bounds which match this lower bound, or even establish that bounds better than Thm. 2 are possible when the dimension dd is small. We leave the question of understanding the sample complexity in the fixed-dimension regime as an interesting problem for future research.

Remark 4 (No contradiction to upper bound in Thm. 2, due to implicit bound on dd).

Thm. 3 requires that Eq. (3) is at least order of dd for the lower bound to be valid. This in turn requires that bBbxϵdd\frac{bBb_{x}}{\epsilon}\sqrt{d}\gg d, or equivalently d(bBbxϵ)2d\ll\left(\frac{bBb_{x}}{\epsilon}\right)^{2}. Thus, the theorem only applies when the dimension dd is not too large with respect to the other parameters. We note that this is to be expected: If we allow d(bBbxϵ)2d\gg\left(\frac{bBb_{x}}{\epsilon}\right)^{2} (and nn sufficiently large), then the lower bound in Eq. (3) will be larger than (bBbxϵ)2\left(\frac{bBb_{x}}{\epsilon}\right)^{2}, and this would violate the 𝒪~((bBbx/ϵ)2)\tilde{\mathcal{O}}\left((bBb_{x}/\epsilon)^{2}\right) upper bound implied by Thm. 2.

4 Spectral Norm Control Suffices for Sufficiently Smooth Activations

The lower bounds in the previous section crucially rely on the non-smoothness of the activation functions. Thus, one may wonder whether smoothness can lead to better upper bounds. In this section, we show that perhaps surprisingly, this is indeed the case: For sufficiently smooth activations (e.g., polynomials), one can provide width-independent Rademacher complexity bounds, using only the spectral norm. Formally, we return to the class of one-hidden-layer neural networks with spectral norm constraints,

b,B,n,dσ={𝐱𝐮σ(W𝐱):𝐮n,Wn×d,𝐮b,WB},\mathcal{H}^{\sigma}_{b,B,n,d}~{}=~{}\left\{\mathbf{x}\mapsto\mathbf{u}^{\top}\sigma(W\mathbf{x})~{}:~{}\mathbf{u}\in\mathbb{R}^{n}~{},~{}W\in\mathbb{R}^{n\times d}~{},~{}\|\mathbf{u}\|\leq b~{},~{}\|W\|\leq B\right\}~{},

and state the following theorem:

Theorem 4.

Fix some b,B,bx,n,d,ϵ>0b,B,b_{x},n,d,\epsilon>0. Suppose σ(z)=j=1ajzj\sigma(z)=\sum_{j=1}^{\infty}a_{j}z^{j} for some a1,a2,a_{1},a_{2},\ldots\in\mathbb{R}, simultaneously for all z:|z|Bbxz:|z|\leq Bb_{x}. Then the Rademacher complexity of b,B,n,dσ\mathcal{H}^{\sigma}_{b,B,n,d} on mm inputs from {𝐱d:𝐱bx}\{\mathbf{x}\in\mathbb{R}^{d}:\|\mathbf{x}\|\leq b_{x}\} is at most ϵ\epsilon, if

m(bσ~(Bbx)ϵ)2,whereσ~(z):=j=1k|aj|zjm~{}\geq~{}\left(\frac{b\cdot\tilde{\sigma}(Bb_{x})}{\epsilon}\right)^{2}~{},~{}~{}\text{where}~{}~{}\tilde{\sigma}(z):=\sum_{j=1}^{k}|a_{j}|z^{j}

(assuming the sum converges).

We note that the conditions imply σ(0)=0\sigma(0)=0, which is assumed for simplicity (see Remark 1). We emphasize that this bound depends only on spectral norms of the network and properties of the activation σ\sigma. In particular, it is independent of the network width nn as well as the Frobenius norm of WW. We also note that the bound is clearly tight in some cases: For example, if σ()\sigma(\cdot) is just the identity function, then b,B,n,dσ\mathcal{H}^{\sigma}_{b,B,n,d} reduces to the class of linear predictors of Euclidean norm at most bBbB, whose Rademacher complexity on inputs of norm at most bxb_{x} is well-known to equal Θ((bBbx/ϵ)2)\Theta((bBb_{x}/\epsilon)^{2}). This also demonstrates that the dependence on the spectral norm BB is necessary, even with smooth activations.

The proof of the theorem (in the appendix) depends on algebraic manipulations, which involve ‘unrolling’ the Rademacher complexity as a polynomial function of the network inputs, and employing a certain technical trick to simplify the resulting expression, given a bound on the spectral norm of the weight matrix.

We now turn to provide some specific examples of σ()\sigma(\cdot) and the resulting expression σ~(Bbx)\tilde{\sigma}(Bb_{x}) in the theorem (see also Figure 1):

Example 1.

If σ(z)\sigma(z) is a polynomial of degree kk, then σ~(Bbx)=𝒪((Bbx)k)\tilde{\sigma}(Bb_{x})=\mathcal{O}((Bb_{x})^{k}) for large enough BbxBb_{x}.

In the example above, the output values of predictors in the class are at most 𝒪((Bbx)k)\mathcal{O}((Bb_{x})^{k}), so it is not surprising that the resulting Rademacher complexity scales in the same manner.

The theorem also extends to non-polynomial activations, as long as they are sufficiently smooth (although the dependence on BbxBb_{x} in σ~(Bbx)\tilde{\sigma}(Bb_{x}) generally becomes exponential). The following is an example for a sigmoidal activation based on the error function:

Example 2.

If σ(z)=erf(rz)\sigma(z)=\text{erf}(rz) (where erf is the error function, and r>0r>0 is a scaling parameter), then σ~(Bbx)2rBbxπexp((rBbx)2)\tilde{\sigma}(Bb_{x})\leq\frac{2rBb_{x}}{\sqrt{\pi}}\exp((rBb_{x})^{2}).

Proof.

We have that σ(z)\sigma(z) equals

erf(rz)=2πt=0rzexp(t2)𝑑t=2πt=0rzj=0(t2)jj!dt=2πj=0(1)j(rz)2j+1j!(2j+1)\text{erf}(rz)~{}=~{}\frac{2}{\sqrt{\pi}}\int_{t=0}^{rz}\exp(-t^{2})dt~{}=~{}\frac{2}{\sqrt{\pi}}\int_{t=0}^{rz}\sum_{j=0}^{\infty}\frac{(-t^{2})^{j}}{j!}dt~{}=~{}\frac{2}{\sqrt{\pi}}\sum_{j=0}^{\infty}\frac{(-1)^{j}(rz)^{2j+1}}{j!(2j+1)}

and therefore

σ~(z)=2πj=0(rz)2j+1j!(2j+1)2rzπj=0((rz)2)jj!=2rzπexp((rz)2).\tilde{\sigma}(z)~{}=~{}\frac{2}{\sqrt{\pi}}\sum_{j=0}^{\infty}\frac{(rz)^{2j+1}}{j!(2j+1)}~{}\leq~{}\frac{2rz}{\sqrt{\pi}}\sum_{j=0}^{\infty}\frac{\left((rz)^{2}\right)^{j}}{j!}~{}=~{}\frac{2rz}{\sqrt{\pi}}\exp\left((rz)^{2}\right)~{}.

Refer to caption
Figure 1: Plots of error function activation from Example 2 (blue) and smoothed ReLU activation from Example 3 (red). Best viewed in color.

A sigmoidal activation also allows us to define a smooth approximation of the ReLU function, to which the theorem can be applied:

Example 3.

If σ(y)=12(y+z=0yerf(rz)𝑑z)\sigma(y)=\frac{1}{2}\left(y+\int_{z=0}^{y}\text{erf}(rz)dz\right), then σ~(Bbx)Bbx2+r(Bbx)2πexp((rBbx)2)\tilde{\sigma}(Bb_{x})\leq\frac{Bb_{x}}{2}+\frac{r(Bb_{x})^{2}}{\sqrt{\pi}}\exp((rBb_{x})^{2}).

We note that as rr\rightarrow\infty, σ(y)\sigma(y) converges uniformly to the ReLU function.

Proof.

Using a computation similar to the previous example, σ(y)\sigma(y) equals

12y+1πj=0(1)j(r2j+1y2j+2)j!(2j+1)(2j+2),\frac{1}{2}y+\frac{1}{\sqrt{\pi}}\sum_{j=0}^{\infty}\frac{(-1)^{j}(r^{2j+1}y^{2j+2})}{j!(2j+1)(2j+2)},

and therefore

σ~(z)=z2+1πj=0r2j+1z2j+2j!(2j+1)(2j+2)z2+rz2πj=0((rz)2)jj!=z2+rz2πexp((rz)2).\tilde{\sigma}(z)~{}=~{}\frac{z}{2}+\frac{1}{\sqrt{\pi}}\sum_{j=0}^{\infty}\frac{r^{2j+1}z^{2j+2}}{j!(2j+1)(2j+2)}~{}\leq~{}\frac{z}{2}+\frac{rz^{2}}{\sqrt{\pi}}\sum_{j=0}^{\infty}\frac{\left((rz)^{2}\right)^{j}}{j!}~{}=~{}\frac{z}{2}+\frac{rz^{2}}{\sqrt{\pi}}\exp((rz)^{2})~{}.

Although the last two examples imply an exponential dependence on the spectral norm bound BB in the theorem, they still imply that for any fixed BB, we can get a finite size-independent sample complexity (regardless of the network’s width or input dimension) while controlling only the spectral norm of the weight matrices.

4.1 Extension to Higher Depths for Power Activations

When the activation functions are powers of the form σ(z)=zk\sigma(z)=z^{k} for some kk, then the previous theorem can be extended to deeper networks. To formalize this, fix integers k1k\geq 1 and L1L\geq 1, and consider a depth-(L+1)(L+1) network fL+1(𝐱)f_{L+1}(\mathbf{x}) (parameterized by weight matrices W1,W2,,WLW^{1},W^{2},\ldots,W^{L} of some arbitrary fixed dimensions, and a weight vector 𝐮\mathbf{u}) defined recursively as follows:

f0(𝐱)=𝐱,j{0,,L1},fj+1(𝐱)=(Wj+1fj(𝐱))k,fL+1(𝐱)=𝐮fL(𝐱).f_{0}(\mathbf{x})=\mathbf{x}~{}~{},~{}~{}\forall j\in\{0,\ldots,L-1\},~{}f_{j+1}(\mathbf{x})~{}=~{}(W^{j+1}f_{j}(\mathbf{x}))^{\circ k}~{}~{},~{}~{}f_{L+1}(\mathbf{x})=\mathbf{u}^{\top}f_{L}(\mathbf{x})~{}.

where (𝐯)k(\mathbf{v})^{\circ k} denotes applying the kk-th power element-wise on a vector 𝐯\mathbf{v}.

Theorem 5.

For any integers k,L1k,L\geq 1 and choice of matrix dimensions at each layer, consider the class of neural networks fL+1f_{L+1} as above, over all weight matrices W1WLW^{1}\ldots W^{L} such that WjB\|W^{j}\|\leq B for all jj, and all 𝐮\mathbf{u} such that 𝐮b\|\mathbf{u}\|\leq b. Then the Rademacher complexity of this class on mm inputs from {𝐱:𝐱bx}\{\mathbf{x}:\|\mathbf{x}\|\leq b_{x}\} is at most ϵ\epsilon, if

m(bBk+k2+kLbxkLϵ)2.m~{}\geq~{}\left(\frac{b\cdot B^{k+k^{2}+\ldots k^{L}}\cdot b_{x}^{k^{L}}}{\epsilon}\right)^{2}~{}.

For constant kk and constant-depth networks, the sample complexity bound in the theorem is of the form bpoly(Bbx)/mb\cdot\text{poly}(Bb_{x})/\sqrt{m}, where BB bounds merely the (relatively weak) spectral norm. We also note that the exponential/doubly-exponential dependence on k,Lk,L is to be expected: Even if we consider networks where each matrix is a scalar BB, and the input is exactly bkb_{k}, then multiplying by BB and taking the kk-th power L1L-1 times over leads to the exact same Bk+k2+kLbxkLB^{k+k^{2}+\ldots k^{L}}\cdot b_{x}^{k^{L}} factor. Since the Rademacher complexity depends on the scale of the outputs, such a factor is generally unavoidable. The proof of the theorem (in the appendix) builds on the proof ideas of Thm. 4, which can be extended to deeper networks at least when the activations are power functions.

5 Convolutional Networks

In this section, we study another important example of neural networks which circumvent our lower bounds from Sec. 3, this time by adding additional constraints on the weight matrix. Specifically, we consider one-hidden-layer convolutional neural networks. These networks are defined via a set of patches Φ={ϕj}j=1n\Phi=\{\phi_{j}\}_{j=1}^{n}, where for each jj, the patch ϕj:dn\phi_{j}:\mathbb{R}^{d}\mapsto\mathbb{R}^{n^{\prime}} projects the input vector 𝐱d\mathbf{x}\in\mathbb{R}^{d} into some subset of its coordinates, namely ϕj(𝐱)=(xi1j,,xinj)\phi_{j}(\mathbf{x})=(x_{i^{j}_{1}},\ldots,x_{i^{j}_{n^{\prime}}}) for some {i1j,,inj}{1,,d}\{i^{j}_{1},\ldots,i^{j}_{n^{\prime}}\}\subseteq\{1,\ldots,d\}. The hidden layer is parameterized by a convolutional filter vector 𝐰n\mathbf{w}\in\mathbb{R}^{n^{\prime}}, and given an input 𝐱\mathbf{x}, outputs the vector (σ(𝐰ϕ1(𝐱)),,σ(𝐰ϕn(𝐱)))n(\sigma(\mathbf{w}^{\top}\phi_{1}(\mathbf{x})),\ldots,\sigma(\mathbf{w}^{\top}\phi_{n}(\mathbf{x})))\in\mathbb{R}^{n}, where σ\sigma is some activation function (e.g., ReLU). Note that this can be equivalently written as σ(W𝐱)\sigma(W\mathbf{x}), where each row jj of WW embeds the 𝐰\mathbf{w} vector in the coordinates corresponding to ϕj()\phi_{j}(\cdot). In what follows, we say that a matrix WW conforms to a set of patches Φ={ϕj}j=1n\Phi=\{\phi_{j}\}_{j=1}^{n}, if there exists a vector 𝐰\mathbf{w} such that (W𝐱)j=𝐰ϕj(𝐱)(W\mathbf{x})_{j}=\mathbf{w}^{\top}\phi_{j}(\mathbf{x}) for all 𝐱\mathbf{x}. Thus, our convolutional hidden layer corresponds to a standard hidden layer (same as in previous sections), but with the additional constraint on WW that it must conform to a certain set of patches.

In the first subsection below, we study networks where the convolutional hidden layer is combined with a linear output layer. In the following section, we study the case where the hidden layer is combined with a fixed pooling operation. In both cases, we will get bounds that depend on the spectral norm of WW and the architecture of the patches.

5.1 Convolutional Hidden Layer + Linear Output Layer

We begin by considering convolutional networks consisting of a convolutional hidden layer (with spectral norm control and with respect to some set of patches), followed by a linear output layer:

b,B,n,dσ,Φ={𝐱𝐮σ(W𝐱):𝐮n,Wn×d,𝐮b,WB,Wconforms toΦ}\mathcal{H}^{\sigma,\Phi}_{b,B,n,d}=\{\mathbf{x}\mapsto\mathbf{u}^{\top}\sigma(W\mathbf{x})~{}:~{}\mathbf{u}\in\mathbb{R}^{n},W\in\mathbb{R}^{n\times d},\|\mathbf{u}\|\leq b~{},~{}\|W\|\leq B~{},~{}W~{}\text{conforms to}~{}\Phi\}

The following theorem shows that we can indeed obtain a Rademacher complexity bound depending only on the spectral norm of WW, and independent of the network width nn, under a mild assumption about the architecture of the patches:

Theorem 6.

Suppose σ()\sigma(\cdot) is LL-Lipschitz and σ(0)=0\sigma(0)=0. Fix some set of patches Φ\Phi, and let OΦO_{\Phi} be the maximal number of patches that any single input coordinate (in {1,,d}\{1,\ldots,d\}) appears in. Then for any b,B,bx,n,d,ϵ>0b,B,b_{x},n,d,\epsilon>0, the Rademacher complexity of b,B,n,dσ,Φ\mathcal{H}^{\sigma,\Phi}_{b,B,n,d} on mm inputs from {𝐱d:𝐱bx}\{\mathbf{x}\in\mathbb{R}^{d}:\|\mathbf{x}\|\leq b_{x}\} is at most ϵ\epsilon, if

m2OΦ(bBbxLϵ)2.m~{}\geq~{}2\cdot O_{\Phi}\cdot\left(\frac{bBb_{x}L}{\epsilon}\right)^{2}~{}.

The proof of the theorem (in the appendix) is based on an algebraic analysis of the Rademacher complexity, and the observation that the spectral norm of WW necessarily upper bounds the Euclidean norm of the convolutional filter vector 𝐰\mathbf{w}.

Other than the usual parameters, the bound in the theorem also depends on the architectural parameter OΦ{1,,n}O_{\Phi}\in\{1,\ldots,n\}, which quantifies the amount of “overlap” between the patches. Although it can be as large as nn in the worst case (when some single coordinate appears in all patches), for standard convolutional architectures it is usually quite small, and does not scale with the input dimension or the total number of patches. For example, it equals 11 if the patches are disjoint, and more generally it equals the patch size divided by the stride. Nevertheless, an interesting open question is whether the OΦO_{\Phi} factor in the bound can be reduced or avoided altogether.

5.2 Convolutional Hidden Layer + Pooling Layer

We now turn to consider a slightly different one-hidden-layer convolutional networks, where the linear output layer is replaced by a fixed pooling layer. Specifically, we consider networks of the form

𝐱ρσ(W𝐱)=ρ(σ(𝐰ϕ1(𝐱)),,σ(𝐰ϕn(𝐱))),\mathbf{x}~{}\mapsto~{}\rho\circ\sigma(W\mathbf{x})~{}=~{}\rho\left(\sigma(\mathbf{w}^{\top}\phi_{1}(\mathbf{x})),\ldots,\sigma(\mathbf{w}^{\top}\phi_{n}(\mathbf{x}))\right),

where σ:\sigma:\mathbb{R}\to\mathbb{R} is an activation function as before, and ρ:n\rho:\mathbb{R}^{n}\to\mathbb{R} is 11-Lipschitz with respect to the \ell_{\infty} norm. For example, ρ()\rho(\cdot) may correspond to a max-pooling layer 𝐳maxj[n]zj\mathbf{z}\mapsto\max_{j\in[n]}z_{j}, or to an average-pooling layer 𝐳1nj[n]zj\mathbf{z}\mapsto\frac{1}{n}\sum_{j\in[n]}z_{j}. We define the following class of networks:

B,n,dσ,ρ,Φ:={𝐱ρσ(W𝐱):Wn×d,WB,W conforms to Φ}.\displaystyle\mathcal{H}_{B,n,d}^{\sigma,\rho,\Phi}~{}:=~{}\left\{\mathbf{x}\mapsto\rho\circ\sigma(W\mathbf{x})~{}:~{}W\in\mathbb{R}^{n\times d}~{},~{}\|W\|\leq B~{},~{}W\text{ conforms to }\Phi\right\}~{}.

This class is very closely related to a class of convolutional networks recently studied in Ledent et al. (2021) using an elegant covering number argument. Using their proof technique, we first provide a Rademacher complexity upper bound (Thm. 7 below), which depends merely on the spectral norm of WW, as well as a logarithmic dependence on the network width nn. Although a logarithmic dependence is relatively mild, one may wonder if we can remove it and get a fully width-independent bound, same as our previous results. Our main novel contribution in this section (Thm. 8) is to show that this is not the case: The fat-shattering dimension of the class necessarily has a log(n)\log(n) factor, so the upper bound is tight up to factors polylogarithmic in the sample size mm.

Theorem 7.

Suppose that σ:\sigma:\mathbb{R}\to\mathbb{R} is LL-Lipschitz and σ(0)=0\sigma(0)=0, and that ρ:n\rho:\mathbb{R}^{n}\to\mathbb{R} is 11-Lipschitz w.r.t. \ell_{\infty} and satisfies ρ(𝟎)=0\rho(\mathbf{0})=0. Fix some set of patches Φ={ϕj}j=1n\Phi=\{\phi_{j}\}_{j=1}^{n}. Then, for any B,n,d,bx,ϵ>0B,n,d,b_{x},\epsilon>0, the Rademacher complexity of B,n,dσ,ρ,Φ\mathcal{H}_{B,n,d}^{\sigma,\rho,\Phi} on mm inputs from {𝐱d:ϕj(𝐱)bx for all j[n]}\left\{\mathbf{x}\in\mathbb{R}^{d}~{}:~{}\|\phi_{j}(\mathbf{x})\|\leq b_{x}\text{ for all }j\in[n]\right\} is at most ϵ\epsilon if

mc(LBbxϵ)2log2(m)log(mn)m\geq c\cdot\left(\frac{LBb_{x}}{\epsilon}\right)^{2}\cdot\log^{2}(m)\log(mn)

for some universal constant c>0c>0. Thus, it suffices to have m=𝒪~((LBbxϵ)2)m=\tilde{\mathcal{O}}\left(\left(\frac{LBb_{x}}{\epsilon}\right)^{2}\right).

For the lower bound, we focus for simplicity on the case where ρ(𝐳)=maxjzj\rho(\mathbf{z})=\max_{j}z_{j} is a max-pooling layer, and where σ\sigma is the ReLU function (which satisfies the conditions of Thm. 7 with L=1L=1). However, we emphasize that unlike the lower bound we proved in Sec. 3, the construction does not rely on the non-smoothness of σ\sigma, and in fact can easily be verified to apply (up to constants) for any σ\sigma satisfying σ(0)=0\sigma(0)=0 and σ(ϵ)cϵ\sigma(\epsilon)\geq c\cdot\epsilon (where c>0c>0 is a constant).

Theorem 8.

For any B,n,bx,ϵ>0B,n,b_{x},\epsilon>0, there is d,Φd,\Phi such that the following hold: The class B,n,dσ,ρ,Φ\mathcal{H}_{B,n,d}^{\sigma,\rho,\Phi}, with σ\sigma being the ReLU function and ρ\rho being the max function, can shatter

14(Bbxϵ)2log(n)\frac{1}{4}\cdot\left(\frac{Bb_{x}}{\epsilon}\right)^{2}\cdot\log(n)

points from {𝐱d:𝐱bx}\{\mathbf{x}\in\mathbb{R}^{d}:\|\mathbf{x}\|\leq b_{x}\} with margin ϵ\epsilon.

Moreover, this claim holds already where Φ\Phi corresponds to a convolutional layer with a constant stride 11, in the following sense: If we view the input 𝐱d\mathbf{x}\in\mathbb{R}^{d} as a vectorization of a tensor of order p=𝒪(log(n))p=\mathcal{O}(\log(n)), then Φ\Phi corresponds to all patches of certain fixed dimensions s1××sps_{1}\times\ldots\times s_{p} in the tensor.

The proof of the theorem is rather technical, but can be informally described as follows (where we focus just on where the dependence on p=𝒪(log(n))p=\mathcal{O}(\log(n)) comes from): We consider each input 𝐱\mathbf{x} as a tensor of size 3×3××33\times 3\times\ldots\times 3 (with entries indexed by a vector in {1,2,3}p)\{1,2,3\}^{p}), and the patches are all sub-tensors of dimensions 2×2××22\times 2\times\ldots\times 2. We construct inputs 𝐱1,,𝐱p\mathbf{x}_{1},\ldots,\mathbf{x}_{p}, where each 𝐱i\mathbf{x}_{i} contains zeros, and a single 11 value at coordinate (2,2,,2,3,2,,2)(2,2,\ldots,2,3,2,\ldots,2) (with a 33 at position ii, and 22 elsewhere). Given a vector (y1,,yp){0,1}p(y_{1},\ldots,y_{p})\in\{0,1\}^{p} of target values, we construct the convolutional filter 𝐰\mathbf{w} (a pp-th order tensor of dimensions 2×2××22\times 2\times\ldots\times 2) to have zeros, and a single 11 value at coordinate (y1+1,,yp+1)(y_{1}+1,\ldots,y_{p}+1). Thus, we “encode” the full set of target values in 𝐰\mathbf{w}, and a simple calculation reveals that given 𝐱i\mathbf{x}_{i}, the network will output 11 if yi=1y_{i}=1, and 0 otherwise. Thus, we can shatter p=𝒪(log(n))p=\mathcal{O}(\log(n)) points. An extension of this idea allows us to also incorporate the right dependence on the other problem parameters.

6 Conclusions and Open Questions

In this paper, we studied sample complexity upper and lower bounds for one-hidden layer neural networks, based on bounding the norms of the weight matrices. We showed that in general, bounding the spectral norm cannot lead to size-independent guarantees, whereas bounding the Frobenius norm does. However, the constructions also pointed out where the lower bounds can be circumvented, and where a spectral norm control suffices for width-independent guarantees: First, when the activations are sufficiently smooth, and second, for certain types of convolutional networks.

Our work raises many open questions for future research. For example, how does having a fixed input dimension dd affect the sample complexity of neural networks? Our lower bound in Thm. 3 indicates small dd might reduce the sample complexity, but currently we do not have good upper bounds that actually establish that (at least without depending on the network width as well). Alternatively, it could also be that Thm. 3 can be strengthened. In a related vein, our lower bound for convolutional networks (Thm. 8) requires a relatively high dimension, at least on the order of the network width. Can we get smaller bounds if the dimension is constrained?

In a different direction, we showed that spectral norm control does not lead to width-free guarantees with non-smooth activations, whereas such guarantees are possible with very smooth activations. What about other activations? Can we characterize when can we get such guarantees for a given activation function? Or at least, can we improve the dependence on the norm bound for sufficiently smooth non-polynomial activations?

As to convolutional networks, we studied two particular architectures employing weight-sharing: One with a linear output layer, and one with a fixed Lipschitz pooling layer mapping to \mathbb{R}. Even for one-hidden-layer networks, this leaves open the question of characterizing the width-independent sample complexity of networks 𝐱𝐮ρσ(W𝐱)\mathbf{x}\mapsto\mathbf{u}^{\top}\rho\circ\sigma(W\mathbf{x}), where WW implements weight-sharing and ρ\rho is a pooling operator mapping to p\mathbb{R}^{p} with p>1p>1 (Ledent et al. (2021) provide upper bounds in this setting, but they are not size-independent and we conjecture that they can be improved). Moreover, we still do not know whether parameters such as the amount of overlap in the patches (see Thm. 6) are indeed necessary.

All our bounds are in terms of the parameter matrix norm, W\|W\| or WF\|W\|_{F}. Some existing bounds, such as in Bartlett et al. (2017), depend instead on the distance from some fixed data-independent matrix W0W_{0} (e.g., the initialization point), a quantity which can be much smaller. We chose to ignore this issue in our paper for simplicity, but it would be useful to generalize our bounds to incorporate this.

Beyond these, perhaps the most tantalizing open question is whether our results can be extended to deeper networks, and what types of bounds we might expect. Even if we treat the depth as a constant, existing bounds for deeper networks are either not width-independent (e.g., Neyshabur et al. (2018); Daniely and Granot (2019)), utilize norms much stronger than even the Frobenius norm (e.g., Anthony and Bartlett (1999); Bartlett et al. (2017)), or involve products of Frobenius norms, which is quite restrictive (Neyshabur et al., 2015; Golowich et al., 2018). Based on our results, we know that a bound depending purely on the spectral norms is impossible in general, but conjecture that the existing upper bounds are all relatively loose. A similar question can be asked for more specific architectures such as convolutional networks.

Acknowledgements

This research is supported in part by European Research Council (ERC) grant 754705, and NSF-BSF award 1718970. We thank Roey Magen for spotting some bugs in the proof of Thm. 2 in a previous version of this paper.

References

  • Anthony and Bartlett [1999] Martin Anthony and Peter L Bartlett. Neural network learning: Theoretical foundations. Cambridge University Press, 1999.
  • Arora et al. [2018] Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. In International Conference on Machine Learning, pages 254–263. PMLR, 2018.
  • Bartlett and Mendelson [2002] Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
  • Bartlett et al. [2017] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. Advances in Neural Information Processing Systems, 30:6240–6249, 2017.
  • Bartlett et al. [2019] Peter L Bartlett, Nick Harvey, Christopher Liaw, and Abbas Mehrabian. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. The Journal of Machine Learning Research, 20(1):2285–2301, 2019.
  • Brutzkus and Globerson [2021] Alon Brutzkus and Amir Globerson. An optimization and generalization analysis for max-pooling networks. In Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, 2021.
  • Bubeck et al. [2021] Sébastien Bubeck, Yuanzhi Li, and Dheeraj M Nagaraj. A law of robustness for two-layers neural networks. In Conference on Learning Theory, pages 804–820. PMLR, 2021.
  • Cao and Gu [2019] Yuan Cao and Quanquan Gu. Tight sample complexity of learning one-hidden-layer convolutional neural networks. Advances in Neural Information Processing Systems, 32, 2019.
  • Daniely and Granot [2019] Amit Daniely and Elad Granot. Generalization bounds for neural networks via approximate description length. Advances in Neural Information Processing Systems, 32:13008–13016, 2019.
  • Du and Lee [2018] Simon Du and Jason Lee. On the power of over-parametrization in neural networks with quadratic activation. In International Conference on Machine Learning, pages 1329–1338. PMLR, 2018.
  • Du et al. [2018] Simon S Du, Yining Wang, Xiyu Zhai, Sivaraman Balakrishnan, Russ R Salakhutdinov, and Aarti Singh. How many samples are needed to estimate a convolutional neural network? Advances in Neural Information Processing Systems, 31, 2018.
  • Golowich et al. [2017] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. arXiv preprint arXiv:1712.06541, 2017.
  • Golowich et al. [2018] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. In Conference On Learning Theory, pages 297–299. PMLR, 2018.
  • Hsu et al. [2020] Daniel Hsu, Ziwei Ji, Matus Telgarsky, and Lan Wang. Generalization bounds via distillation. In International Conference on Learning Representations, 2020.
  • Ji and Telgarsky [2020] Ziwei Ji and Matus Telgarsky. Directional convergence and alignment in deep learning. Advances in Neural Information Processing Systems, 33, 2020.
  • Koehler et al. [2021] Frederic Koehler, Lijia Zhou, Danica J Sutherland, and Nathan Srebro. Uniform convergence of interpolators: Gaussian width, norm bounds, and benign overfitting. arXiv preprint arXiv:2106.09276, 2021.
  • Ledent et al. [2021] Antoine Ledent, Waleed Mustafa, Yunwen Lei, and Marius Kloft. Norm-based generalisation bounds for deep multi-class convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 8279–8287, 2021.
  • Ledoux and Talagrand [1991] Michel Ledoux and Michel Talagrand. Probability in Banach Spaces: isoperimetry and processes, volume 23. Springer Science & Business Media, 1991.
  • Li et al. [2018] Xingguo Li, Junwei Lu, Zhaoran Wang, Jarvis Haupt, and Tuo Zhao. On tighter generalization bound for deep neural networks: Cnns, resnets, and beyond. arXiv preprint arXiv:1806.05159, 2018.
  • Liang [2016] Percy Liang. Cs229t/stat231: Statistical learning theory (winter 2016) lecture notes. https://web.stanford.edu/class/cs229t/notes.pdf, 2016.
  • Long and Sedghi [2019] Philip M Long and Hanie Sedghi. Generalization bounds for deep convolutional neural networks. In International Conference on Learning Representations, 2019.
  • Lyu and Li [2019] Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. In International Conference on Learning Representations, 2019.
  • Mohri et al. [2018] Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of machine learning. MIT press, 2018.
  • Nagarajan and Kolter [2019] Vaishnavh Nagarajan and J Zico Kolter. Uniform convergence may be unable to explain generalization in deep learning. In Advances in Neural Information Processing Systems, volume 32, 2019.
  • Negrea et al. [2020] Jeffrey Negrea, Gintare Karolina Dziugaite, and Daniel Roy. In defense of uniform convergence: Generalization via derandomization with an application to interpolating predictors. In International Conference on Machine Learning, pages 7263–7272. PMLR, 2020.
  • Neyshabur et al. [2015] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Conference on Learning Theory, pages 1376–1401. PMLR, 2015.
  • Neyshabur et al. [2018] Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representations, 2018.
  • Seginer [2000] Yoav Seginer. The expected norm of random matrices. Combinatorics, Probability and Computing, 9(2):149–166, 2000.
  • Shalev-Shwartz and Ben-David [2014] Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.
  • Wei and Ma [2019] Colin Wei and Tengyu Ma. Improved sample complexities for deep networks and robust classification via an all-layer margin. arXiv preprint arXiv:1910.04284, 2019.
  • Zhang [2002] Tong Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2(Mar):527–550, 2002.

Appendix A Proofs

A.1 Proof of Thm. 1

We will assume without loss of generality that the condition infδ(0,1)|σ(δ)+σ(δ)δ|α\inf_{\delta\in(0,1)}\left|\frac{\sigma(\delta)+\sigma(-\delta)}{\delta}\right|\geq\alpha stated in the theorem holds without an absolute value, namely

infδ(0,1)σ(δ)+σ(δ)δα.\inf_{\delta\in(0,1)}\frac{\sigma(\delta)+\sigma(-\delta)}{\delta}~{}\geq~{}\alpha~{}. (4)

To see why, note that if infδ(0,1)|σ(δ)+σ(δ)δ|α0\inf_{\delta\in(0,1)}\left|\frac{\sigma(\delta)+\sigma(-\delta)}{\delta}\right|\geq\alpha\geq 0, then σ(δ)+σ(δ)δ\frac{\sigma(\delta)+\sigma(-\delta)}{\delta} can never change sign as a function of δ\delta (otherwise it will be 0 for some δ\delta). Hence, the condition implies that either σ(δ)+σ(δ)δα\frac{\sigma(\delta)+\sigma(-\delta)}{\delta}~{}\geq~{}\alpha for all δ(0,1)\delta\in(0,1), or that σ(δ)+σ(δ)δα-\frac{\sigma(\delta)+\sigma(-\delta)}{\delta}~{}\geq~{}\alpha for all δ(0,1)\delta\in(0,1). We simply choose to treat the first case, as the second case can be treated with a completely identical analysis, only flipping some of the signs.

Fix some sufficiently large dimension dd and integer mdm\leq d to be chosen later. Choose 𝐱1,,𝐱m\mathbf{x}_{1},\ldots,\mathbf{x}_{m} to be some mm orthogonal vectors of norm bxb_{x} in d\mathbb{R}^{d}. Let XX be the d×md\times m matrix whose ii-th column is 𝐱i\mathbf{x}_{i}. Given this input set, it is enough to show that there is some number ss, such that for any 𝐲{0,1}m\mathbf{y}\in\{0,1\}^{m}, we can find a predictor (namely, 𝐮,W\mathbf{u},W depending on 𝐲\mathbf{y}) in our class, such that 𝐮b\|\mathbf{u}\|\leq b, WB\|W\|\leq B, and

i,𝐮σ(W𝐱i)is{sϵyi=0s+ϵyi=1.\forall i~{},\mathbf{u}^{\top}\sigma(W\mathbf{x}_{i})~{}\text{is}~{}\begin{cases}\leq s-\epsilon&y_{i}=0\\ \geq s+\epsilon&y_{i}=1\end{cases}~{}. (5)

We will do so as follows: We let

𝐮=bn𝟏andW=δbx2Vdiag(𝐲)X,\mathbf{u}=\frac{b}{\sqrt{n}}\mathbf{1}~{}~{}~{}\text{and}~{}~{}~{}W=\frac{\delta}{b_{x}^{2}}V\text{diag}(\mathbf{y})X^{\top}~{},

Where δ(0,1)\delta\in(0,1) is a certain scaling factor and VV is a ±1\pm 1-valued matrix of size n×mn\times m, both to be chosen later. In particular, we will assume that VV is approximately balanced, in the sense that for any column i[n]i\in[n] of VV, if pip_{i} is the portion of +1+1 entries in the column, then

maxi|12pi|α8.\max_{i}\left|\frac{1}{2}-p_{i}\right|~{}\leq~{}\frac{\alpha}{8}~{}. (6)

For any i[m]i\in[m], since 𝐱1,,𝐱m\mathbf{x}_{1},\ldots,\mathbf{x}_{m} are orthogonal and of norm bxb_{x}, we have

𝐮σ(W𝐱i)\displaystyle\mathbf{u}^{\top}\sigma(W\mathbf{x}_{i})~{} =𝐮σ(δbx2Vdiag(𝐲)X𝐱i)=𝐮σ(δyi𝐯i)=bnj=1nσ(δyiVj,i)\displaystyle=~{}\mathbf{u}^{\top}\sigma\left(\frac{\delta}{b_{x}^{2}}V\text{diag}(\mathbf{y})X^{\top}\mathbf{x}_{i}\right)~{}=~{}\mathbf{u}^{\top}\sigma(\delta y_{i}\mathbf{v}_{i})~{}=~{}\frac{b}{\sqrt{n}}\sum_{j=1}^{n}\sigma(\delta y_{i}V_{j,i})

where 𝐯i\mathbf{v}_{i} is the ii-th column of VV, and Vj,iV_{j,i} is the entry of VV in the jj-th row and ii-th column. Then we have the following:

  • If yi=0y_{i}=0, this equals bnσ(0)=0b\sqrt{n}\sigma(0)=0.

  • If yi=1y_{i}=1, this equals bn(piσ(δ)+(1pi)σ(δ))b\sqrt{n}\left(p_{i}\sigma(\delta)+(1-p_{i})\sigma(-\delta)\right), where pi[12α8,12+α8]p_{i}\in[\frac{1}{2}-\frac{\alpha}{8},\frac{1}{2}+\frac{\alpha}{8}] is the portion of entries in the ii-th column of VV with value +1+1. Rewriting it and using Eq. (4), Eq. (6) and the fact that σ()\sigma(\cdot) is 11-Lipschitz on [1,+1][-1,+1], we get the expression

    bn(σ(δ)+σ(δ)2(12pi)(σ(δ)σ(δ)))bn(δα2α82δ)=bnδα4.b\sqrt{n}\left(\frac{\sigma(\delta)+\sigma(-\delta)}{2}-\left(\frac{1}{2}-p_{i}\right)\left(\sigma(\delta)-\sigma(-\delta)\right)\right)~{}\geq~{}b\sqrt{n}\left(\frac{\delta\alpha}{2}-\frac{\alpha}{8}\cdot 2\delta\right)~{}=~{}\frac{b\sqrt{n}\delta\alpha}{4}~{}.

Recalling Eq. (5), we get that by fixing s=nδα8s=\frac{\sqrt{n}\delta\alpha}{8}, we can shatter the dataset as long as

bnδα8ϵδ8ϵαbn.\frac{b\sqrt{n}\delta\alpha}{8}\geq\epsilon~{}~{}~{}~{}\Rightarrow~{}~{}~{}\delta\geq\frac{8\epsilon}{\alpha b\sqrt{n}}~{}. (7)

Leaving this condition for a moment, we now turn to specify how δ,V\delta,V is chosen, so as to satisfy the condition W=δbx2Vdiag(𝐲)XB\|W\|=\|\frac{\delta}{b_{x}^{2}}V\text{diag}(\mathbf{y})X^{\top}\|\leq B. To that end, we let VV be any ±1\pm 1-valued n×mn\times m matrix which satisfies Eq. (6) as well as Vc(n+m)\|V\|\leq c(\sqrt{n}+\sqrt{m}), where c1c\geq 1 is some universal constant. Such a matrix necessarily exists assuming mm is sufficiently larger than 1α2\frac{1}{\alpha^{2}}111This follows from the probabilistic method: If we pick the entries of VV uniformly at random, then both conditions will hold with some arbitrarily large probability (assuming mm is sufficiently larger than 1/α21/\alpha^{2}, see for example Seginer [2000]), hence the required matrix will result with some positive probability.. It then follows that Wδbx2Vdiag(𝐲)Xδbx2c(n+m)bx=δbxc(n+m)\|W\|\leq\frac{\delta}{b_{x}^{2}}\|V\|\cdot\|\text{diag}(\mathbf{y})\|\cdot\|X\|\leq\frac{\delta}{b_{x}^{2}}\cdot c(\sqrt{n}+\sqrt{m})\cdot b_{x}=\frac{\delta}{b_{x}}\cdot c(\sqrt{n}+\sqrt{m}). Therefore, by assuming

δBbxc(n+m),\delta\leq\frac{Bb_{x}}{c(\sqrt{n}+\sqrt{m})},

we ensure that WB\|W\|\leq B.

Collecting the conditions on δ\delta (namely, that it is in (0,1)(0,1), satisfies Eq. (7), as well as the displayed equation above), we get that there is an appropriate choice of δ\delta and we can shatter our mm points, as long as mm is sufficiently larger than 1/α21/\alpha^{2} and that

1>Bbxc(n+m)8ϵαbn.1~{}>~{}\frac{Bb_{x}}{c(\sqrt{n}+\sqrt{m})}~{}\geq~{}\frac{8\epsilon}{\alpha b\sqrt{n}}~{}.

The first inequality is satisfied if (say) we can choose m(Bbx/c)2m\geq(Bb_{x}/c)^{2} (which we will indeed do in the sequel). As to the second inequality, it is certainly satisfied if mnm\geq n, as well as

Bbx2cm8ϵαbnm(α16c)2(bBbx)2nϵ2.\frac{Bb_{x}}{2c\sqrt{m}}~{}\geq~{}\frac{8\epsilon}{\alpha b\sqrt{n}}~{}~{}~{}\Longrightarrow~{}~{}~{}m\leq\left(\frac{\alpha}{16c}\right)^{2}\cdot\frac{(bBb_{x})^{2}n}{\epsilon^{2}}~{}.

Thus, we can shatter any number mm of points up to this upper bound. Picking mm on this order (assuming it is sufficiently larger than 1/α21/\alpha^{2}, B2B^{2} or nn), assuming that the dimension dd is larger than mm, and renaming the universal constants, the result follows.

A.2 Proof of Thm. 2

To simplify notation, we rewrite sup𝐮,W:𝐮b,WFB\sup_{\mathbf{u},W:\|\mathbf{u}\|\leq b,\|W\|_{F}\leq B} as simply sup𝐮,W\sup_{\mathbf{u},W}. Also, we let 𝐰j\mathbf{w}_{j} denote the jj-th row of the matrix WW.

Fix some set of inputs 𝐱1,,𝐱m\mathbf{x}_{1},\ldots,\mathbf{x}_{m} with norm at most bxb_{x}. The Rademacher complexity equals

𝔼ϵsup𝐮,W\displaystyle\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{\mathbf{u},W} 1mi=1mϵi𝐮σ(W𝐱i)=𝔼ϵsup𝐮,W1m𝐮(i=1mϵiσ(W𝐱i))\displaystyle\frac{1}{m}\sum_{i=1}^{m}\epsilon_{i}\mathbf{u}^{\top}\sigma(W\mathbf{x}_{i})~{}=~{}\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{\mathbf{u},W}\frac{1}{m}\mathbf{u}^{\top}\left(\sum_{i=1}^{m}\epsilon_{i}\sigma(W\mathbf{x}_{i})\right)
=bm𝔼ϵsupWi=1mϵiσ(W𝐱i)=bm𝔼ϵsupWj=1n(i=1mϵiσ(𝐰j𝐱i))2.\displaystyle=~{}\frac{b}{m}\cdot\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{W}\left\|\sum_{i=1}^{m}\epsilon_{i}\sigma(W\mathbf{x}_{i})\right\|~{}=~{}\frac{b}{m}\cdot\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{W}\sqrt{\sum_{j=1}^{n}\left(\sum_{i=1}^{m}\epsilon_{i}\sigma(\mathbf{w}_{j}^{\top}\mathbf{x}_{i})\right)^{2}}~{}.

Each matrix in the set {Wd×n:WFB}\{W\in\mathbb{R}^{d\times n}:\|W\|_{F}\leq B\} is composed of rows, whose sum of squared norms is at most B2B^{2}. Thus, the set can be equivalently defined as the set of d×nd\times n matrices, where each row jj equals vj𝐰jv_{j}\mathbf{w}_{j} for some vj>0v_{j}>0, 𝐰j1\|\mathbf{w}\|_{j}\leq 1, and (v1,,vn)2=𝐯2B2\|(v_{1},\ldots,v_{n})\|^{2}=\|\mathbf{v}\|^{2}\leq B^{2}. Noting that each vjv_{j} is positive, we can upper bound the expression in the displayed equation above as follows:

bm𝔼ϵsup𝐯,{𝐰j}j=1n(i=1mϵiσ(vj𝐰j𝐱i))2\displaystyle\frac{b}{m}\cdot\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{\mathbf{v},\{\mathbf{w}_{j}\}}\sqrt{\sum_{j=1}^{n}\left(\sum_{i=1}^{m}\epsilon_{i}\sigma(v_{j}\mathbf{w}_{j}^{\top}\mathbf{x}_{i})\right)^{2}}
=bm𝔼ϵsup𝐯,{𝐰j}j=1nvj2(i=1mϵivjσ(vj𝐰j𝐱i))2\displaystyle=~{}\frac{b}{m}\cdot\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{\mathbf{v},\{\mathbf{w}_{j}\}}\sqrt{\sum_{j=1}^{n}v_{j}^{2}\left(\sum_{i=1}^{m}\frac{\epsilon_{i}}{v_{j}}\sigma(v_{j}\mathbf{w}_{j}^{\top}\mathbf{x}_{i})\right)^{2}}
bm𝔼ϵsup𝐯,𝐯,{𝐰j}j=1nvj2(i=1mϵivjσ(vj𝐰j𝐱i))2,\displaystyle\leq~{}\frac{b}{m}\cdot\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{\mathbf{v},\mathbf{v}^{\prime},\{\mathbf{w}_{j}\}}\sqrt{\sum_{j=1}^{n}{v^{\prime}}_{j}^{2}\left(\sum_{i=1}^{m}\frac{\epsilon_{i}}{v_{j}}\sigma(v_{j}\mathbf{w}_{j}^{\top}\mathbf{x}_{i})\right)^{2}}~{}, (8)

where 𝐯=(v1,,vn)\mathbf{v}^{\prime}=(v^{\prime}_{1},\ldots,v^{\prime}_{n}) satisfies 𝐯2=j=1nvj2B2\|\mathbf{v}^{\prime}\|^{2}=\sum_{j=1}^{n}{v^{\prime}}_{j}^{2}\leq B^{2} (note that 𝐯\mathbf{v} must also satisfy this constraint). Considering this constraint in Eq. (8), we see that for any choice of ϵ,𝐯\boldsymbol{\epsilon},\mathbf{v} and 𝐰1,,𝐰n\mathbf{w}_{1},\ldots,\mathbf{w}_{n}, the supremum over 𝐯\mathbf{v}^{\prime} is clearly attained by letting vj=B{v^{\prime}}_{j^{*}}=B for some jj^{*} for which (i=1mϵivjσ(vj𝐰j𝐱i))2\left(\sum_{i=1}^{m}\frac{\epsilon_{i}}{v_{j}}\sigma(v_{j}\mathbf{w}_{j}^{\top}\mathbf{x}_{i})\right)^{2} is maximized, and vj=0{v^{\prime}}_{j}=0 for all jjj\neq j*. Plugging this observation back into Eq. (8) and writing the supremum constraints explicitly, we can upper bound the displayed equation by

bBm𝔼ϵsup𝐯:minjvj>0,𝐯Bsup𝐰1,𝐰n:maxj𝐰j1maxj|i=1mϵivjσ(vj𝐰j𝐱i)|\displaystyle\frac{bB}{m}\cdot\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{\mathbf{v}:\min_{j}v_{j}>0,\|\mathbf{v}\|\leq B}~{}~{}\sup_{\mathbf{w}_{1},\ldots\mathbf{w}_{n}:\max_{j}\|\mathbf{w}_{j}\|\leq 1}\max_{j}\left|\sum_{i=1}^{m}\frac{\epsilon_{i}}{v_{j}}\sigma(v_{j}\mathbf{w}_{j}^{\top}\mathbf{x}_{i})\right|
=bBm𝔼ϵsupv(0,B],𝐰:𝐰1|i=1mϵivσ(v𝐰𝐱i)|\displaystyle~{}~{}~{}~{}=~{}\frac{bB}{m}\cdot\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{v\in(0,B],\mathbf{w}:\|\mathbf{w}\|\leq 1}\left|\sum_{i=1}^{m}\frac{\epsilon_{i}}{v}\sigma(v\mathbf{w}^{\top}\mathbf{x}_{i})\right|
=bBm𝔼ϵsupv(0,B],𝐰:𝐰1|i=1mϵiψv(𝐰𝐱i)|,\displaystyle~{}~{}~{}=~{}\frac{bB}{m}\cdot\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{v\in(0,B],\mathbf{w}:\|\mathbf{w}\|\leq 1}\left|\sum_{i=1}^{m}\epsilon_{i}\psi_{v}\left(\mathbf{w}^{\top}\mathbf{x}_{i}\right)\right|~{}, (9)

where ψv(z):=σ(vz)v\psi_{v}(z):=\frac{\sigma(vz)}{v} for any zz\in\mathbb{R}. Since σ()\sigma(\cdot) is LL-Lipschitz, it follows that ψ𝐯()\psi_{\mathbf{v}}(\cdot) is also LL-Lipschitz regardless of vv, since for any z,zz,z^{\prime}\in\mathbb{R},

|ψv(z)ψv(z)|=|σ(vz)σ(vz)|vL|vzvz|v=L|zz|.|\psi_{v}(z)-\psi_{v}(z^{\prime})|~{}=~{}\frac{|\sigma(vz)-\sigma(vz^{\prime})|}{v}~{}\leq~{}\frac{L|vz-vz^{\prime}|}{v}~{}=~{}L|z-z^{\prime}|~{}.

Thus, the supremum over vv in Eq. (9) corresponds to a supremum over a class of LL-Lipschitz functions which all equal 0 at the origin (since ψv(0)=σ(0)v=0\psi_{v}(0)=\frac{\sigma(0)}{v}=0 by assumption). As a result, we can upper bound Eq. (9) by

bBm𝔼ϵsupψΨL,𝐰:𝐰1|i=1mϵiψ(𝐰𝐱i)|,\frac{bB}{m}\cdot\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{\psi\in\Psi_{L},\mathbf{w}:\|\mathbf{w}\|\leq 1}\left|\sum_{i=1}^{m}\epsilon_{i}\psi\left(\mathbf{w}^{\top}\mathbf{x}_{i}\right)\right|~{},

where ΨL\Psi_{L} is the class of all LL-Lipschitz functions which equal 0 at the origin.

To continue, it will be convenient to get rid of the absolute value in the displayed expression above. This can be done by noting that the expression equals

bBm𝔼ϵsupψΨL,𝐰:𝐰1max{i=1mϵiψ(𝐰𝐱i),i=1mϵiψ(𝐰𝐱i)}\displaystyle\frac{bB}{m}\cdot\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{\psi\in\Psi_{L},\mathbf{w}:\|\mathbf{w}\|\leq 1}\max\left\{\sum_{i=1}^{m}\epsilon_{i}\psi\left(\mathbf{w}^{\top}\mathbf{x}_{i}\right)~{},~{}-\sum_{i=1}^{m}\epsilon_{i}\psi\left(\mathbf{w}^{\top}\mathbf{x}_{i}\right)\right\}
()bBm𝔼ϵ[supψΨL,𝐰:𝐰1i=1mϵiψ(𝐰𝐱i)+supψΨL,𝐰:𝐰1i=1mϵiψ(𝐰𝐱i)]\displaystyle\stackrel{{\scriptstyle(*)}}{{\leq}}~{}\frac{bB}{m}\cdot\mathbb{E}_{\boldsymbol{\epsilon}}\left[\sup_{\psi\in\Psi_{L},\mathbf{w}:\|\mathbf{w}\|\leq 1}\sum_{i=1}^{m}\epsilon_{i}\psi\left(\mathbf{w}^{\top}\mathbf{x}_{i}\right)+\sup_{\psi\in\Psi_{L},\mathbf{w}:\|\mathbf{w}\|\leq 1}-\sum_{i=1}^{m}\epsilon_{i}\psi\left(\mathbf{w}^{\top}\mathbf{x}_{i}\right)\right]
=()2bBm𝔼ϵsupψΨL,𝐰:𝐰1i=1mϵiψ(𝐰𝐱i),\displaystyle\stackrel{{\scriptstyle(**)}}{{=}}~{}\frac{2bB}{m}\cdot\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{\psi\in\Psi_{L},\mathbf{w}:\|\mathbf{w}\|\leq 1}\sum_{i=1}^{m}\epsilon_{i}\psi\left(\mathbf{w}^{\top}\mathbf{x}_{i}\right)~{}, (10)

where ()(*) follows from the fact that max{a,b}a+b\max\{a,b\}\leq a+b for non-negative a,ba,b and the observation that the supremum is always non-negative (it is only larger, say, than the specific choice of ψ\psi being the zero function), and ()(**) is by symmetry of the function class ΨL\Psi_{L} (if ψΨL\psi\in\Psi_{L}, then ψΨL-\psi\in\Psi_{L} as well).

Considering Eq. (10), this is 2bB2bB times the Rademacher complexity of the function class {𝐱ψ(𝐰𝐱):ψΨL,𝐰1}\{\mathbf{x}\mapsto\psi(\mathbf{w}^{\top}\mathbf{x}):\psi\in\Psi_{L},\|\mathbf{w}\|\leq 1\}. In other words, this class is a composition of all linear functions of norm at most 11, and all univariate LL-Lipschitz functions crossing the origin. Fortunately, the Rademacher complexity of such composed classes was analyzed in Golowich et al. [2017] for a different purpose. In particular, noting that 𝐰𝐱i\mathbf{w}^{\top}\mathbf{x}_{i} is bounded in [bx,bx][-b_{x},b_{x}], and applying Theorem 4 from that paper, we get that Eq. (10) is upper bounded by

2bBcL(bxm+log3/2(m)m())2bB\cdot cL\left(\frac{b_{x}}{\sqrt{m}}+\log^{3/2}(m)\cdot\mathcal{R}_{m}(\mathcal{H})\right) (11)

for some universal constant c>0c>0, where ={𝐱𝐰𝐱:𝐰1}\mathcal{H}=\{\mathbf{x}\mapsto\mathbf{w}^{\top}\mathbf{x}:\|\mathbf{w}\|\leq 1\}, and m()\mathcal{R}_{m}(\mathcal{H}) is the Rademacher complexity of \mathcal{H}.

To complete the proof, we need to employ a standard upper bound on ^m()\hat{\mathcal{R}}_{m}(\mathcal{H}) (see Bartlett and Mendelson [2002], Shalev-Shwartz and Ben-David [2014]), which we derive below for completeness:

^m()\displaystyle\hat{\mathcal{R}}_{m}(\mathcal{H})~{} =𝔼ϵsuph1mi=1mϵih(𝐱i)=1m𝔼ϵsup𝐰:𝐰1i=1mϵi𝐰𝐱i\displaystyle=~{}\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{h\in\mathcal{H}}\frac{1}{m}\sum_{i=1}^{m}\epsilon_{i}h(\mathbf{x}_{i})~{}=~{}\frac{1}{m}\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{\mathbf{w}:\|\mathbf{w}\|\leq 1}\sum_{i=1}^{m}\epsilon_{i}\mathbf{w}^{\top}\mathbf{x}_{i}
=1m𝔼ϵsup𝐰:𝐰1𝐰(i=1mϵi𝐱i)=()1m𝔼ϵi=1mϵi𝐱i\displaystyle=~{}\frac{1}{m}\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{\mathbf{w}:\|\mathbf{w}\|\leq 1}\mathbf{w}^{\top}\left(\sum_{i=1}^{m}\epsilon_{i}\mathbf{x}_{i}\right)~{}\stackrel{{\scriptstyle(*)}}{{=}}~{}\frac{1}{m}\mathbb{E}_{\boldsymbol{\epsilon}}\left\|\sum_{i=1}^{m}\epsilon_{i}\mathbf{x}_{i}\right\|
()1m𝔼ϵi=1mϵi𝐱i2=1m𝔼ϵi,i=1mϵiϵi𝐱i𝐱i\displaystyle\stackrel{{\scriptstyle(**)}}{{\leq}}~{}\frac{1}{m}\sqrt{\mathbb{E}_{\boldsymbol{\epsilon}}\left\|\sum_{i=1}^{m}\epsilon_{i}\mathbf{x}_{i}\right\|^{2}}~{}=~{}\frac{1}{m}\sqrt{\mathbb{E}_{\boldsymbol{\epsilon}}\sum_{i,i^{\prime}=1}^{m}\epsilon_{i}\epsilon_{i^{\prime}}\mathbf{x}_{i}^{\top}\mathbf{x}_{i^{\prime}}}
=1mi=1m𝐱i21mmbx2=bxm,\displaystyle=~{}\frac{1}{m}\sqrt{\sum_{i=1}^{m}\|\mathbf{x}_{i}\|^{2}}~{}\leq~{}\frac{1}{m}\sqrt{mb_{x}^{2}}~{}=~{}\frac{b_{x}}{\sqrt{m}}~{},

where ()(*) is by the Cauchy-Schwarz inequality, and ()(**) is by Jensen’s inequality. Plugging this back into Eq. (11), we get the following upper bound:

2bBcL(bxm+log3/2(m)bxm)=2cbBbxL1+log3/2(m)m.2bB\cdot cL\left(\frac{b_{x}}{\sqrt{m}}+\log^{3/2}(m)\cdot\frac{b_{x}}{\sqrt{m}}\right)~{}=~{}2cbBb_{x}L\cdot\frac{1+\log^{3/2}(m)}{\sqrt{m}}~{}.

Upper bounding this by ϵ\epsilon, solving for mm and simplifying a bit, the result follows.

A.3 Proof of Thm. 3

We fix a number of inputs mm to be chosen later. We let XX be the d×md\times m matrix whose ii-th column is 𝐱i\mathbf{x}_{i}. We choose X to be any matrix such that the following conditions hold for some universal constant c>0c>0:

  • Every entry of XX is in {±bxd}\{\pm\frac{b_{x}}{\sqrt{d}}\} (hence i,𝐱i=1\forall i,~{}\|\mathbf{x}_{i}\|=1)

  • maxii|𝐱i𝐱i|cbx2log(d)d\max_{i^{\prime}\neq i}|\mathbf{x}_{i}^{\top}\mathbf{x}_{i^{\prime}}|\leq cb_{x}^{2}\sqrt{\frac{\log(d)}{d}}

  • Xcbx(1+md)\|X\|\leq cb_{x}\left(1+\sqrt{\frac{m}{d}}\right).

The existence of such a matrix follows from the probabilistic method: If we simply choose each entry of XX independently and uniformly from {±1d}\{\pm\frac{1}{\sqrt{d}}\}, then the first condition automatically holds; The second condition holds with high probability by a standard concentration of measure argument and a union bound; And the third condition holds with arbitrarily high constant probability (by Markov’s inequality and the fact that 𝔼[dbxX]c(d+m)\mathbb{E}[\|\frac{\sqrt{d}}{b_{x}}\cdot X\|]\leq c(\sqrt{d}+\sqrt{m}), see for example Seginer [2000]). Thus, by a union bound, a random matrix satisfies all of the above with some positive probability, hence such a matrix XX exists.

Given this input set, it is enough to show that for any 𝐲{0,1}m\mathbf{y}\in\{0,1\}^{m}, we can find a predictor (namely, 𝐮,W\mathbf{u},W depending on 𝐲\mathbf{y}) in our class, such that 𝐮b\|\mathbf{u}\|\leq b, WB\|W\|\leq B, and

i,𝐮σ(W𝐱i)is{0yi=02ϵyi=1.\forall i~{},\mathbf{u}^{\top}\sigma(W\mathbf{x}_{i})~{}\text{is}~{}\begin{cases}\leq 0&y_{i}=0\\ \geq 2\epsilon&y_{i}=1\end{cases}~{}. (12)

We will do so as follows: Letting a0,p[0,1]a\geq 0,p\in[0,1] be some parameters to be chosen later, we let

𝐮=bn𝟏andW=1bx2Vdiag(𝐲)X,\mathbf{u}=\frac{b}{\sqrt{n}}\mathbf{1}~{}~{}~{}\text{and}~{}~{}~{}W=\frac{1}{b_{x}^{2}}\cdot V\text{diag}(\mathbf{y})X^{\top}~{},

Where Vn×mV\in\mathbb{R}^{n\times m} is a random matrix with i.i.d. entries chosen as follows:

Vk,i={0w.p.1paw.p.p2aw.p.p2.V_{k,i}~{}=~{}\begin{cases}0&\text{w.p.}~{}1-p\\ a&\text{w.p.}~{}\frac{p}{2}\\ -a&\text{w.p.}~{}\frac{p}{2}~{}.\end{cases}

Note that the condition 𝐮b\|\mathbf{u}\|\leq b follows directly from the definition of 𝐮\mathbf{u}. We will show that there is a way to choose the parameters a,pa,p such that the following holds: For any 𝐲{0,1}m\mathbf{y}\in\{0,1\}^{m}, with high probability over the choice of VV, Eq. (12) holds as well as WB\|W\|\leq B. This implies that for any 𝐲\mathbf{y}, there exists some fixed choice of VV (and hence WW) such that WB\|W\|\leq B as well as Eq. (12) holds, implying the theorem statement.

We break this argument into two lemmas:

Lemma 1.

There exists a universal constant c>0c^{\prime}>0 such that the following holds: For any ϵ0\epsilon\geq 0, δ(0,exp(1))\delta\in(0,\exp(-1)) and 𝐲{0,1}m\mathbf{y}\in\{0,1\}^{m}, if we assume

β=calog(d)dlog(mδ)(pm+1)\beta=c^{\prime}a\sqrt{\frac{\log(d)}{d}}\log\left(\frac{m}{\delta}\right)\left(\sqrt{pm}+1\right)

as well as a4βa\geq 4\beta and bapn8ϵbap\sqrt{n}\geq 8\epsilon, then Eq. (12) holds with probability at least 1δmexp(pn/16)1-\delta-m\exp(-pn/16) over the choice of VV.

Proof.

Let 𝐰k\mathbf{w}_{k} be the kk-th row of WW. Fixing some i[m]i\in[m], we have

𝐮σ(W𝐱i)\displaystyle\mathbf{u}^{\top}\sigma(W\mathbf{x}_{i})~{} =𝐮[W𝐱iβ]+=bnk=1n[𝐰k𝐱iβ]+=bnk=1n[i=1m1bx2Vk,iyi𝐱i𝐱iβ]+\displaystyle=~{}\mathbf{u}^{\top}[W\mathbf{x}_{i}-\beta]_{+}~{}=~{}\frac{b}{\sqrt{n}}\sum_{k=1}^{n}[\mathbf{w}_{k}^{\top}\mathbf{x}_{i}-\beta]_{+}~{}=~{}\frac{b}{\sqrt{n}}\sum_{k=1}^{n}\left[\sum_{i^{\prime}=1}^{m}\frac{1}{b_{x}^{2}}V_{k,i^{\prime}}y_{i^{\prime}}\mathbf{x}_{i^{\prime}}^{\top}\mathbf{x}_{i}-\beta\right]_{+}
=bnk=1n[Vk,iyi+ii1bx2Vk,iyi𝐱i𝐱iβ]+.\displaystyle=~{}\frac{b}{\sqrt{n}}\sum_{k=1}^{n}\left[V_{k,i}y_{i}+\sum_{i^{\prime}\neq i}\frac{1}{b_{x}^{2}}V_{k,i^{\prime}}y_{i^{\prime}}\mathbf{x}_{i^{\prime}}^{\top}\mathbf{x}_{i}-\beta\right]_{+}~{}. (13)

Recalling the assumptions on XX and the random choice of VV, note that ii1bx2Vk,iyi𝐱i𝐱i\sum_{i^{\prime}\neq i}\frac{1}{b_{x}^{2}}V_{k,i^{\prime}}y_{i^{\prime}}\mathbf{x}_{i^{\prime}}^{\top}\mathbf{x}_{i} is the sum of m1m-1 independent random variables, each with mean 0, absolute value at most |abx2yi𝐱i𝐱i|aclog(d)d|\frac{a}{b_{x}^{2}}y_{i^{\prime}}\mathbf{x}_{i^{\prime\top}}\mathbf{x}_{i}|\leq ac\sqrt{\frac{\log(d)}{d}}, and standard deviation at most paclog(d)d\sqrt{p}\cdot ac\sqrt{\frac{\log(d)}{d}}. Thus, by Bernstein’s inequality, for any δ(0,exp(1))\delta\in(0,\exp(-1)), it holds with probability at least 1δ1-\delta that

|ii1bx2Vk,iyi𝐱i𝐱i|\displaystyle\left|\sum_{i^{\prime}\neq i}\frac{1}{b_{x}^{2}}V_{k,i^{\prime}}y_{i^{\prime}}\mathbf{x}_{i^{\prime}}^{\top}\mathbf{x}_{i}\right|~{} c(palog(d)d(m1)log(1δ)+alog(d)dlog(1δ))\displaystyle\leq~{}c^{\prime}\left(\sqrt{p}\cdot a\sqrt{\frac{\log(d)}{d}}\cdot\sqrt{(m-1)\log\left(\frac{1}{\delta}\right)}+a\sqrt{\frac{\log(d)}{d}}\cdot\log\left(\frac{1}{\delta}\right)\right)
calog(d)dlog(1δ)(pm+1),\displaystyle\leq~{}c^{\prime}a\sqrt{\frac{\log(d)}{d}}\log\left(\frac{1}{\delta}\right)\left(\sqrt{pm}+1\right)~{},

where c>0c^{\prime}>0 is some universal constant. Applying a union bound over all i[m]i\in[m], we get that with probability at least 1δ1-\delta,

maxi[m]|ii1bx2Vk,iyi𝐱i𝐱i|calog(d)dlog(mδ)(pm+1).\max_{i\in[m]}\left|\sum_{i^{\prime}\neq i}\frac{1}{b_{x}^{2}}V_{k,i^{\prime}}y_{i^{\prime}}\mathbf{x}_{i^{\prime}}^{\top}\mathbf{x}_{i}\right|~{}\leq~{}c^{\prime}a\sqrt{\frac{\log(d)}{d}}\log\left(\frac{m}{\delta}\right)\left(\sqrt{pm}+1\right)~{}.

Recalling that we choose β\beta to equal this upper bound, and plugging back into Eq. (13), we get that with probability at least 1δ1-\delta,

i[m],𝐮σ(W𝐱i)is{bnk=1n[Vk,iyi]+=0ifyi=0bnk=1n[Vk,iyi2β]+=bnk=1n[Vk,i2β]+ifyi=1.\forall i\in[m],~{}~{}\mathbf{u}^{\top}\sigma(W\mathbf{x}_{i})~{}~{}\text{is}~{}~{}\begin{cases}\leq\frac{b}{\sqrt{n}}\sum_{k=1}^{n}[V_{k,i}y_{i}]_{+}=0&\text{if}~{}y_{i}=0\\ \geq\frac{b}{\sqrt{n}}\sum_{k=1}^{n}[V_{k,i}y_{i}-2\beta]_{+}~{}=~{}\frac{b}{\sqrt{n}}\sum_{k=1}^{n}[V_{k,i}-2\beta]_{+}&\text{if}~{}y_{i}=1\end{cases}~{}.

Moreover, by the assumption a4βa\geq 4\beta, we have

bnk=1n[Vk,i2β]+bnk:Vk,i=a[aa2]+ba2nk:Vk,i=a1.\frac{b}{\sqrt{n}}\sum_{k=1}^{n}[V_{k,i}-2\beta]_{+}~{}\geq~{}\frac{b}{\sqrt{n}}\sum_{k:V_{k,i}=a}\left[a-\frac{a}{2}\right]_{+}~{}\geq~{}\frac{ba}{2\sqrt{n}}\sum_{k:V_{k,i}=a}1~{}.

Note that 𝔼V[k:Vk,i=a1]=pn2\mathbb{E}_{V}[\sum_{k:V_{k,i}=a}1]=\frac{pn}{2}. Thus, by a standard multiplicative Chernoff bound and a union bound, k:Vk,i=a1pn4\sum_{k:V_{k,i}=a}1\geq\frac{pn}{4} simultaneously for all i[m]i\in[m], with probability at least 1mexp(pn/16)1-m\exp(-pn/16). Combining with the above using a union bound, we get that with probability at least 1δmexp(pn/16)1-\delta-m\exp(-pn/16) over the choice of VV,

i[m],𝐮σ(W𝐱i)is{0ifyi=0bapn4ifyi=1.\forall i\in[m],~{}~{}\mathbf{u}^{\top}\sigma(W\mathbf{x}_{i})~{}~{}\text{is}~{}~{}\begin{cases}\leq 0&\text{if}~{}y_{i}=0\\ \geq\frac{bap\sqrt{n}}{4}&\text{if}~{}y_{i}=1\end{cases}~{}.

Since we assume bapn42ϵ\frac{bap\sqrt{n}}{4}\geq 2\epsilon, the result follows. ∎

Lemma 2.

For any 𝐲{0,1}m\mathbf{y}\in\{0,1\}^{m}, with probability at least 12\frac{1}{2} over the random choice of VV, the matrix WW satisfies

WFa2nmpbx.\|W\|_{F}~{}\leq~{}\frac{a\sqrt{2nmp}}{b_{x}}~{}.
Proof.

By definition of W,VW,V and XX, we have

𝔼[WF2]\displaystyle\mathbb{E}[\|W\|_{F}^{2}]~{} =k=1ni=1d𝔼[Wk,i2]=k=1ni=1d𝔼[(j=1m1bx2Vk,jyjXi,j)2]\displaystyle=~{}\sum_{k=1}^{n}\sum_{i=1}^{d}\mathbb{E}[W_{k,i}^{2}]~{}=~{}\sum_{k=1}^{n}\sum_{i=1}^{d}\mathbb{E}\left[\left(\sum_{j=1}^{m}\frac{1}{b_{x}^{2}}V_{k,j}y_{j}X_{i,j}\right)^{2}\right]
=1bx4k=1ni=1d𝔼[j,j=1mVk,jVk,jyjyjXi,jXi,j]\displaystyle=~{}\frac{1}{b_{x}^{4}}\cdot\sum_{k=1}^{n}\sum_{i=1}^{d}\mathbb{E}\left[\sum_{j,j^{\prime}=1}^{m}V_{k,j}V_{k,j^{\prime}}y_{j}y_{j^{\prime}}X_{i,j}X_{i,j^{\prime}}\right]
=1bx4k=1ni=1dj=1m𝔼[Vk,j2yj2Xi,j2]1bx4bx2dk=1ni=1dj=1m𝔼[Vk,j2]\displaystyle=~{}\frac{1}{b_{x}^{4}}\cdot\sum_{k=1}^{n}\sum_{i=1}^{d}\sum_{j=1}^{m}\mathbb{E}\left[V_{k,j}^{2}y_{j}^{2}X_{i,j}^{2}\right]~{}\leq~{}\frac{1}{b_{x}^{4}}\cdot\frac{b_{x}^{2}}{d}\cdot\sum_{k=1}^{n}\sum_{i=1}^{d}\sum_{j=1}^{m}\mathbb{E}[V_{k,j}^{2}]
=1bx2dndmpa2=nmpa2bx2.\displaystyle=~{}\frac{1}{b_{x}^{2}d}\cdot ndm\cdot pa^{2}~{}=~{}\frac{nmpa^{2}}{b_{x}^{2}}~{}.

By Markov’s inequality, it follows that with probability at least 12\frac{1}{2}, WF22nmpa2bx2\|W\|_{F}^{2}\leq 2\cdot\frac{nmpa^{2}}{b_{x}^{2}}, from which the result follows. ∎

Combining Lemma 1 and Lemma 2, and choosing δ=1/4\delta=1/4, we get that with some positive probability over the choice of VV, both the shattering condition in Eq. (12) holds, as well as WFB\|W\|_{F}\leq B, if the following combination of conditions are met (for suitable universal constant c1>0c_{1}>0):

mexp(pn16)<14,ac1alog(d)dlog(4m)(pm+1),bapn8ϵ,a2nmpBbx.m\exp\left(-\frac{pn}{16}\right)<\frac{1}{4}~{}~{},~{}~{}a\geq c_{1}a\sqrt{\frac{\log(d)}{d}}\log(4m)(\sqrt{pm}+1)~{}~{},~{}~{}bap\sqrt{n}\geq 8\epsilon~{}~{},~{}~{}a\sqrt{2nmp}~{}\leq~{}Bb_{x}~{}.

We now wish to choose the free parameters p,ap,a, to ensure that all these conditions are met (hence we indeed manage to shatter the dataset), while allowing the size mm of the shattered set to be as large as possible. We begin by noting that the first condition is satisfied if p>c2log(m)np>c_{2}\frac{\log(m)}{n}, and the second condition is satisfied if dc3d\geq c_{3} and pc4dlog(d)log2(4m)mp\leq c_{4}\frac{d}{\log(d)\log^{2}(4m)m} (for suitable universal constants c2,c3,c4>0c_{2},c_{3},c_{4}>0). Thus, it is enough to require

dc3,c2log(m)n<pc4dlog(d)log2(4m)m,bapn8ϵ,a2nmpBbx.d\geq c_{3}~{}~{},~{}~{}c_{2}\frac{\log(m)}{n}~{}<~{}p~{}\leq~{}c_{4}\frac{d}{\log(d)\log^{2}(4m)m}~{}~{},~{}~{}bap\sqrt{n}\geq 8\epsilon~{}~{},~{}~{}a\sqrt{2nmp}\leq Bb_{x}~{}. (14)

Let us pick in particular

p=c4dlog(d)log2(4m)mp~{}=~{}c_{4}\frac{d}{\log(d)\log^{2}(4m)m}

(which is valid if it is in [0,1][0,1] and if c2log(m)nc4dlog(d)log2(4m)mc_{2}\frac{\log(m)}{n}\leq c_{4}\frac{d}{\log(d)\log^{2}(4m)m}, or equivalently mlog(m)log2(4m)c4ndc2log(d))m\log(m)\log^{2}(4m)\leq\frac{c_{4}nd}{c_{2}\log(d)}) and

a=8ϵbpn=8ϵlog(d)log2(4m)mc4bdna~{}=~{}\frac{8\epsilon}{bp\sqrt{n}}~{}=~{}\frac{8\epsilon\log(d)\log^{2}(4m)m}{c_{4}bd\sqrt{n}}

(which automatically satisfied the third condition in Eq. (14)). Plugging into Eq. (14), the required conditions hold if we assume

dc3,c4dlog(d)log2(4m)m1,mlog3(4m)c5ndlog(d),c6ϵlog(d)log(4m)mbdBbx\displaystyle d\geq c_{3}~{}~{},~{}~{}\frac{c_{4}d}{\log(d)\log^{2}(4m)m}\leq 1~{}~{},~{}~{}m\log^{3}(4m)\leq\frac{c_{5}nd}{\log(d)}~{}~{},~{}~{}c_{6}\frac{\epsilon\sqrt{\log(d)}\log(4m)m}{b\sqrt{d}}\leq Bb_{x}

for appropriate universal constants c5,c6>0c_{5},c_{6}>0. The first two conditions are satisfied if we require mc7dc8m\geq c_{7}d\geq c_{8} for suitable universal constants c7,c8>0c_{7},c_{8}>0. Thus, it is enough to require the set of conditions

mc6dc7,mlog3(4m)c5ndlog(d),mlog(4m)bBbxdc6ϵlog(d).m\geq c_{6}d\geq c_{7}~{}~{},~{}~{}m\log^{3}(4m)\leq\frac{c_{5}nd}{\log(d)}~{}~{},~{}~{}m\log(4m)\leq\frac{bBb_{x}\sqrt{d}}{c_{6}\epsilon\sqrt{\log(d)}}~{}.

All these conditions are satisfied if we assume dc7/c6d\geq c_{7}/c_{6}, pick

m=Θ~(min{nd,bBbxϵd})m=\tilde{\Theta}\left(\min\left\{nd,\frac{bBb_{x}}{\epsilon}\sqrt{d}\right\}\right) (15)

(with the Θ~\tilde{\Theta} hiding constants and factors polylogarithmic in d,n,b,B,bx,1ϵ)d,n,b,B,b_{x},\frac{1}{\epsilon})), and assume that the parameters are such that this expression is sufficiently larger than dd, and that dd is larger than some universal constant.

It only remains to track what value of β\beta we have chosen (as a function of the problem parameters). Combining Lemma 1, the choice of a,pa,p from earlier, as well as Eq. (15), it follows that

β=Θ~(ad(1+pm))=Θ~(ϵmbddn(1+d))=Θ~(ϵmbdn)=Θ~(min{ϵnb,Bbxdn}),\beta~{}=~{}\tilde{\Theta}\left(\frac{a}{\sqrt{d}}(1+\sqrt{pm})\right)~{}=~{}\tilde{\Theta}\left(\frac{\epsilon m}{bd\sqrt{dn}}(1+\sqrt{d})\right)~{}=~{}\tilde{\Theta}\left(\frac{\epsilon m}{bd\sqrt{n}}\right)~{}=~{}\tilde{\Theta}\left(\min\left\{\frac{\epsilon\sqrt{n}}{b}~{},~{}\frac{Bb_{x}}{\sqrt{dn}}\right\}\right)~{},

which is at most 𝒪~(Bbx/dn)\tilde{\mathcal{O}}(Bb_{x}/\sqrt{dn}).

A.4 Proof of Corollary 1

Thm. 3 implies that a certain dataset {𝐱i}i=1m\{\mathbf{x}_{i}\}_{i=1}^{m} of points in d\mathbb{R}^{d} of norm at most bxb_{x} (where mm is the lower bound stated in the theorem) can be shattered with margin ϵ\epsilon, using networks in b,B,n,dσ\mathcal{F}^{\sigma}_{b,B,n,d} of the form 𝐱𝐮σ(W𝐱)\mathbf{x}\mapsto\mathbf{u}^{\top}\sigma(W\mathbf{x}), where σ=[zβ]+\sigma=[z-\beta]_{+} for some β[0,𝒪~(Bbxdn)]\beta\in\left[0,\tilde{\mathcal{O}}(\frac{Bb_{x}}{\sqrt{dn}})\right]. Our key observation is the following: Any network 𝐱𝐮σ(W𝐱)\mathbf{x}\mapsto\mathbf{u}^{\top}\sigma(W\mathbf{x}) can be equivalently written as 𝐱~𝐮[W~𝐱~]+\tilde{\mathbf{x}}\mapsto\mathbf{u}^{\top}[\tilde{W}\tilde{\mathbf{x}}]_{+}, where 𝐱~=(𝐱,bx)\tilde{\mathbf{x}}=(\mathbf{x},b_{x}), and W~=[W,βbx𝟏]\tilde{W}=[W~{},~{}-\frac{\beta}{b_{x}}\cdot\mathbf{1}] (namely, we add to WW another column with every entry being equal to βbx-\frac{\beta}{b_{x}}. Note that if 𝐱bx\|\mathbf{x}\|\leq b_{x}, then 𝐱~2bx\|\tilde{\mathbf{x}}\|\leq\sqrt{2}b_{x}, and W~W+βbx𝟏B+βbxn2B\|\tilde{W}\|\leq\|W\|+\|-\frac{\beta}{b_{x}}\cdot\mathbf{1}\|\leq B+\frac{\beta}{b_{x}}\sqrt{n}\leq 2B under the corollary’s conditions. Thus, if we can shatter a set of points {𝐱i}i=1m\{\mathbf{x}_{i}\}_{i=1}^{m} in the unit ball in d\mathbb{R}^{d} using networks from b,B,n,dσ\mathcal{F}^{\sigma}_{b,B,n,d}, we can also shatter {𝐱~i}i=1m\{\tilde{\mathbf{x}}_{i}\}_{i=1}^{m} in d+1\mathbb{R}^{d+1} (with norm 2bx\leq\sqrt{2}b_{x}) using networks from b,2B,n,d+1[]+\mathcal{F}^{[\cdot]_{+}}_{b,2B,n,d+1}. Rescaling bx,B,db_{x},B,d appropriately, we get the same shattering number lower bound for b,B,n,d[]+\mathcal{F}^{[\cdot]_{+}}_{b,B,n,d} and inputs with norm bx\leq b_{x} up to small numerical constants which get absorbed into the Ω~()\tilde{\Omega}(\cdot) notation.

A.5 Proofs of Thm. 4 and Thm. 5

In what follows, given a vector 𝐮i\mathbf{u}_{i}, we let ui,ju_{i,j} denote its jj-th entry.

The proofs rely on the following two key technical lemmas:

Lemma 3.

Let WW be a matrix such that W1\|W\|\leq 1, with row vectors 𝐰1,𝐰2,\mathbf{w}_{1},\mathbf{w}_{2},\ldots Then the following holds for any set of vectors {𝐮i}\{\mathbf{u}_{i}\} with the same dimensionality as 𝐰1\mathbf{w}_{1}, and any scalars {zi,},{zi}\{z_{i,\ell}\},\{z_{i}\}indexed by i,i,\ell:

(i(𝐰𝐮i)zi,)2,r(iui,rzi,)2\sum_{\ell}\left(\sum_{i}(\mathbf{w}_{\ell}^{\top}\mathbf{u}_{i})z_{i,\ell}\right)^{2}~{}\leq~{}\sum_{\ell,r}\left(\sum_{i}u_{i,r}z_{i,\ell}\right)^{2}

and

(i(𝐰𝐮i)zi)2r(iui,rzi)2,\sum_{\ell}\left(\sum_{i}(\mathbf{w}_{\ell}^{\top}\mathbf{u}_{i})z_{i}\right)^{2}~{}\leq~{}\sum_{r}\left(\sum_{i}u_{i,r}z_{i}\right)^{2}~{},

where the sum rr is over all all coordinates of each 𝐮i\mathbf{u}_{i}.

Proof.

Starting with the first inequality, the left hand side equals

(𝐰(i𝐮izi,))2,(𝐰(i𝐮izi,))2=W(i𝐮izi,)2.\sum_{\ell}\left(\mathbf{w}_{\ell}^{\top}\left(\sum_{i}\mathbf{u}_{i}z_{i,\ell}\right)\right)^{2}~{}\leq~{}\sum_{\ell,\ell^{\prime}}\left(\mathbf{w}_{\ell^{\prime}}^{\top}\left(\sum_{i}\mathbf{u}_{i}z_{i,\ell}\right)\right)^{2}~{}=~{}\sum_{\ell}\left\|W^{\top}\left(\sum_{i}\mathbf{u}_{i}z_{i,\ell}\right)\right\|^{2}~{}.

By Cauchy-Schwartz and the assumption W1\|W\|\leq 1, this is at most i𝐮izi,2=,r(iui,rzi,)2\sum_{\ell}\left\|\sum_{i}\mathbf{u}_{i}z_{i,\ell}\right\|^{2}~{}=~{}\sum_{\ell,r}\left(\sum_{i}u_{i,r}z_{i,\ell}\right)^{2}. As to the second inequality, the left hand side equals

(𝐰(i𝐮izi))2=W(i𝐮izi)2i𝐮izi2=r(iui,rzi)2\sum_{\ell}\left(\mathbf{w}_{\ell}^{\top}\left(\sum_{i}\mathbf{u}_{i}z_{i}\right)\right)^{2}~{}=~{}\left\|W^{\top}\left(\sum_{i}\mathbf{u}_{i}z_{i}\right)\right\|^{2}~{}\leq~{}\left\|\sum_{i}\mathbf{u}_{i}z_{i}\right\|^{2}=\sum_{r}\left(\sum_{i}u_{i,r}z_{i}\right)^{2}

where we again used Cauchy Schwartz and the assumption W1\|W\|\leq 1. ∎

Lemma 4.

Given a vector 𝐮din\mathbf{u}\in\mathbb{R}^{d_{in}}, a matrix Wdout×dinW\in\mathbb{R}^{d_{out}\times d_{in}} with row vectors 𝐰1,𝐰2,\mathbf{w}_{1},\mathbf{w}_{2},\ldots such that WB\|W\|\leq B, and a positive integer kk, define

f(𝐮)=(W𝐮)k,f(\mathbf{u})=(W\mathbf{u})^{\circ k}~{},

where ∘k denotes taking the kk-th power element-wise. Then for any positive integer rr, any vectors 𝐮1,𝐮2,\mathbf{u}_{1},\mathbf{u}_{2},\ldots in din\mathbb{R}^{d_{in}} and any scalars ϵ1,ϵ2,\epsilon_{1},\epsilon_{2},\ldots, it holds that

1,,r=1dout(iϵif(𝐮i)1f(𝐮i)r)2B2rk1,,rk=1din(iϵiui,1ui,rk)2.\sum_{\ell_{1},\ldots,\ell_{r}=1}^{d_{out}}\left(\sum_{i}\epsilon_{i}f(\mathbf{u}_{i})_{\ell_{1}}\cdots f(\mathbf{u}_{i})_{\ell_{r}}\right)^{2}~{}\leq~{}B^{2rk}\cdot\sum_{\ell_{1},\ldots,\ell_{rk}=1}^{d_{in}}\left(\sum_{i}\epsilon_{i}u_{i,\ell_{1}}\cdots u_{i,\ell_{rk}}\right)^{2}~{}.
Proof.

It is enough to prove the result for WW such that W=1\|W\|=1 (and therefore B=1B=1): For any other WW, apply the result on f~(𝐮):=(WW𝐮)k=1Wkf(𝐮)\tilde{f}(\mathbf{u}):=(\frac{W}{\|W\|}\mathbf{u})^{\circ k}=\frac{1}{\|W\|^{k}}f(\mathbf{u}), and rescale accordingly.

The left hand side equals

1r=1dout(iϵi(𝐰1𝐮i)k(𝐰r𝐮i)k)2\sum_{\ell_{1}\ldots\ell_{r}=1}^{d_{out}}\left(\sum_{i}\epsilon_{i}(\mathbf{w}_{\ell_{1}}^{\top}\mathbf{u}_{i})^{\circ k}\cdots(\mathbf{w}_{\ell_{r}}^{\top}\mathbf{u}_{i})^{\circ k}\right)^{2} (16)

Note that the term inside the square involves the product of rkrk terms. We now simplify them one-by-one using Lemma 3: To start, we note that the above can be written as

2r=1dout1=1dout(i(𝐰1𝐮i)ϵi(𝐰1𝐮i)k1(𝐰2𝐮i)k(𝐰r𝐮i)k)2\sum_{\ell_{2}\ldots\ell_{r}=1}^{d_{out}}\sum_{\ell_{1}=1}^{d_{out}}\left(\sum_{i}(\mathbf{w}_{\ell_{1}}^{\top}\mathbf{u}_{i})\cdot\epsilon_{i}(\mathbf{w}_{\ell_{1}}^{\top}\mathbf{u}_{i})^{\circ k-1}(\mathbf{w}_{\ell_{2}}^{\top}\mathbf{u}_{i})^{\circ k}\cdots(\mathbf{w}_{\ell_{r}}^{\top}\mathbf{u}_{i})^{\circ k}\right)^{2}

Denoting ϵi(𝐰1𝐮i)k1(𝐰2𝐮i)k(𝐰r𝐮i)k\epsilon_{i}(\mathbf{w}_{\ell_{1}}^{\top}\mathbf{u}_{i})^{\circ k-1}(\mathbf{w}_{\ell_{2}}^{\top}\mathbf{u}_{i})^{\circ k}\cdots(\mathbf{w}_{\ell_{r}}^{\top}\mathbf{u}_{i})^{\circ k} as zi,1z_{i,\ell_{1}} and plugging the first inequality in Lemma 3, the above is at most

2r=1dout1=1dout1=1din(iui,1ϵi(𝐰1𝐮i)k1(𝐰2𝐮i)k(𝐰r𝐮i)k)2\sum_{\ell_{2}\ldots\ell_{r}=1}^{d_{out}}\sum_{\ell_{1}=1}^{d_{out}}\sum_{\ell^{\prime}_{1}=1}^{d_{in}}\left(\sum_{i}u_{i,\ell^{\prime}_{1}}\epsilon_{i}(\mathbf{w}_{\ell_{1}}^{\top}\mathbf{u}_{i})^{\circ k-1}(\mathbf{w}_{\ell_{2}}^{\top}\mathbf{u}_{i})^{\circ k}\cdots(\mathbf{w}_{\ell_{r}}^{\top}\mathbf{u}_{i})^{\circ k}\right)^{2}

Again pulling out one of the product terms in front, we can rewrite this as

2r=1dout1=1din1=1dout(i(𝐰1𝐮i)ui,1ϵi(𝐰1𝐮i)k2(𝐰2𝐮i)k(𝐰r𝐮i)k)2.\sum_{\ell_{2}\ldots\ell_{r}=1}^{d_{out}}\sum_{\ell^{\prime}_{1}=1}^{d_{in}}\sum_{\ell_{1}=1}^{d_{out}}\left(\sum_{i}(\mathbf{w}_{\ell_{1}}^{\top}\mathbf{u}_{i})\cdot u_{i,\ell^{\prime}_{1}}\epsilon_{i}(\mathbf{w}_{\ell_{1}}^{\top}\mathbf{u}_{i})^{\circ k-2}(\mathbf{w}_{\ell_{2}}^{\top}\mathbf{u}_{i})^{\circ k}\cdots(\mathbf{w}_{\ell_{r}}^{\top}\mathbf{u}_{i})^{\circ k}\right)^{2}~{}.

Again using the first inequality in Lemma 3, this is at most

2r=1dout1,1′′=1din1=1dout(iui,1′′ui,1ϵi(𝐰1𝐮i)k2(𝐰2𝐮i)k(𝐰r𝐮i)k)2.\sum_{\ell_{2}\ldots\ell_{r}=1}^{d_{out}}\sum_{\ell^{\prime}_{1},\ell^{\prime\prime}_{1}=1}^{d_{in}}\sum_{\ell_{1}=1}^{d_{out}}\left(\sum_{i}u_{i,\ell^{\prime\prime}_{1}}u_{i,\ell^{\prime}_{1}}\epsilon_{i}(\mathbf{w}_{\ell_{1}}^{\top}\mathbf{u}_{i})^{\circ k-2}(\mathbf{w}_{\ell_{2}}^{\top}\mathbf{u}_{i})^{\circ k}\cdots(\mathbf{w}_{\ell_{r}}^{\top}\mathbf{u}_{i})^{\circ k}\right)^{2}~{}.

Repeating this to get rid of all but the last (𝐰1𝐮i)(\mathbf{w}_{\ell_{1}}^{\top}\mathbf{u}_{i}) term, we get the upper bound

2r=1dout111k1=1din1=1dout(iui,11ui,1k1ϵi(𝐰1𝐮i)(𝐰2𝐮i)k(𝐰r𝐮i)k)2.\sum_{\ell_{2}\ldots\ell_{r}=1}^{d_{out}}\sum_{\ell_{1}^{1}\ldots\ell_{1}^{k-1}=1}^{d_{in}}\sum_{\ell_{1}=1}^{d_{out}}\left(\sum_{i}u_{i,\ell_{1}^{1}}\cdots u_{i,\ell_{1}^{k-1}}\epsilon_{i}(\mathbf{w}_{\ell_{1}}^{\top}\mathbf{u}_{i})(\mathbf{w}_{\ell_{2}}^{\top}\mathbf{u}_{i})^{\circ k}\cdots(\mathbf{w}_{\ell_{r}}^{\top}\mathbf{u}_{i})^{\circ k}\right)^{2}~{}.

Again pulling the last (𝐰1𝐮i)(\mathbf{w}_{\ell_{1}}^{\top}\mathbf{u}_{i}) term in front, and applying now the second inequality in Lemma 3 (as the remaining terms in the product no longer depend on 1\ell_{1}), we get the upper bound

2r=1dout111k=1din(iui,11ui,1kϵi(𝐰2𝐮i)k(𝐰r𝐮i)k)2.\sum_{\ell_{2}\ldots\ell_{r}=1}^{d_{out}}\sum_{\ell_{1}^{1}\ldots\ell_{1}^{k}=1}^{d_{in}}\left(\sum_{i}u_{i,\ell_{1}^{1}}\cdots u_{i,\ell_{1}^{k}}\epsilon_{i}(\mathbf{w}_{\ell_{2}}^{\top}\mathbf{u}_{i})^{\circ k}\cdots(\mathbf{w}_{\ell_{r}}^{\top}\mathbf{u}_{i})^{\circ k}\right)^{2}~{}.

Recalling that this is an upper bound on Eq. (16), and applying the same procedure now on the (𝐰2𝐮i),(𝐰3𝐮i)(\mathbf{w}_{\ell_{2}}^{\top}\mathbf{u}_{i}),(\mathbf{w}_{\ell_{3}}^{\top}\mathbf{u}_{i})\ldots terms, we get overall an upper bound of the form

111k=1dinr1rk=1din(iui,11ui,rkϵi)2.\sum_{\ell_{1}^{1}\ldots\ell_{1}^{k}=1}^{d_{in}}\cdots\sum_{\ell_{r}^{1}\cdots\ell_{r}^{k}=1}^{d_{in}}\left(\sum_{i}u_{i,\ell_{1}^{1}}\cdots u_{i,\ell_{r}^{k}}\epsilon_{i}\right)^{2}~{}.

Re-labeling the rkrk indices as 1,,rk\ell_{1},\ldots,\ell_{rk}, the result follows. ∎

A.5.1 Proof of Thm. 4

Fixing a dataset 𝐱1,,𝐱m\mathbf{x}_{1},\ldots,\mathbf{x}_{m} and applying Cauchy-Schwartz, the Rademacher complexity is

𝔼ϵsup𝐮,W1mi=1mϵi𝐮σ(W𝐱i)𝔼ϵsupWbmi=1mϵiσ(W𝐱i).\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{\mathbf{u},W}\frac{1}{m}\sum_{i=1}^{m}\epsilon_{i}\mathbf{u}^{\top}\sigma(W\mathbf{x}_{i})~{}\leq~{}\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{W}\frac{b}{m}\left\|\sum_{i=1}^{m}\epsilon_{i}\sigma(W\mathbf{x}_{i})\right\|~{}.

Recalling that σ(z)=j=1ajzj\sigma(z)=\sum_{j=1}^{\infty}a_{j}z^{j}, by the triangle inequality, we have that the above is at most

𝔼ϵsupWbmj=1|aj|i=1mϵi(W𝐱i)jbmj=1|aj|𝔼ϵsupWi=1mϵi(W𝐱i)j\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{W}\frac{b}{m}\sum_{j=1}^{\infty}|a_{j}|\left\|\sum_{i=1}^{m}\epsilon_{i}(W\mathbf{x}_{i})^{j}\right\|~{}\leq~{}\frac{b}{m}\sum_{j=1}^{\infty}|a_{j}|\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{W}\left\|\sum_{i=1}^{m}\epsilon_{i}(W\mathbf{x}_{i})^{j}\right\|

where ()j(\cdot)^{j} is applied element-wise. Recalling that the supremum is over matrices of spectral norm at most BB, and using Jensen’s inequality, the above can be equivalently written as

bmj=1|aj|Bj𝔼ϵsupW:W1i=1mϵi(W𝐱i)jbmj=1|aj|Bj𝔼ϵsupW:W1i=1mϵi(W𝐱i)j2.\frac{b}{m}\sum_{j=1}^{\infty}|a_{j}|B^{j}\cdot\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{W:\|W\|\leq 1}\left\|\sum_{i=1}^{m}\epsilon_{i}(W\mathbf{x}_{i})^{j}\right\|~{}\leq~{}\frac{b}{m}\sum_{j=1}^{\infty}|a_{j}|B^{j}\sqrt{\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{W:\|W\|\leq 1}\left\|\sum_{i=1}^{m}\epsilon_{i}(W\mathbf{x}_{i})^{j}\right\|^{2}}~{}. (17)

Using Lemma 4, we have that for any W:W1W:\|W\|\leq 1,

i=1mϵi(W𝐱i)j2=(iϵi(W𝐱i)j)21,,j=1d(i=1mϵixi,1xi,j)2.\left\|\sum_{i=1}^{m}\epsilon_{i}(W\mathbf{x}_{i})^{j}\right\|^{2}~{}=~{}\sum_{\ell}\left(\sum_{i}\epsilon_{i}(W\mathbf{x}_{i})_{\ell}^{j}\right)^{2}~{}\leq~{}\sum_{\ell_{1},\ldots,\ell_{j}=1}^{d}\left(\sum_{i=1}^{m}\epsilon_{i}x_{i,\ell_{1}}\cdots x_{i,\ell_{j}}\right)^{2}~{}.

Thus,

𝔼ϵsupW:W1i=1mϵi(W𝐱i)j2𝔼ϵ1,,j=1d(i=1mϵixi,1xi,j)2\displaystyle\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{W:\|W\|\leq 1}\left\|\sum_{i=1}^{m}\epsilon_{i}(W\mathbf{x}_{i})^{j}\right\|^{2}~{}\leq~{}\mathbb{E}_{\boldsymbol{\epsilon}}\sum_{\ell_{1},\ldots,\ell_{j}=1}^{d}\left(\sum_{i=1}^{m}\epsilon_{i}x_{i,\ell_{1}}\cdots x_{i,\ell_{j}}\right)^{2}
=𝔼ϵi,i=1m1,,j=1dϵiϵixi,1xi,1xi,jxi,j\displaystyle=~{}\mathbb{E}_{\boldsymbol{\epsilon}}\sum_{i,i^{\prime}=1}^{m}\sum_{\ell_{1},\ldots,\ell_{j}=1}^{d}\epsilon_{i}\epsilon_{i^{\prime}}x_{i,\ell_{1}}x_{i^{\prime},\ell_{1}}\cdots x_{i,\ell_{j}}x_{i^{\prime},\ell_{j}}
=()i=1m1,,j=1dxi,12xi,j2\displaystyle\stackrel{{\scriptstyle(*)}}{{=}}~{}\sum_{i=1}^{m}\sum_{\ell_{1},\ldots,\ell_{j}=1}^{d}x_{i,\ell_{1}}^{2}\cdots x_{i,\ell_{j}}^{2}
=i=1m(1=1dxi,12)(j=1dxi,j2)\displaystyle=~{}\sum_{i=1}^{m}\left(\sum_{\ell_{1}=1}^{d}x_{i,\ell_{1}}^{2}\right)\cdots\left(\sum_{\ell_{j}=1}^{d}x_{i,\ell_{j}}^{2}\right)
=i=1m𝐱i2ji=1mbx2j=mbx2j,\displaystyle=~{}\sum_{i=1}^{m}\|\mathbf{x}_{i}\|^{2j}~{}\leq~{}\sum_{i=1}^{m}b_{x}^{2j}~{}=~{}m\cdot b_{x}^{2j}~{},

where in ()(*) we used the fact that each ϵi\epsilon_{i} is independent and uniformly distributed on ±1\pm 1. Plugging this bound back into Eq. (17), we get that the Rademacher complexity is at most

bmj=1|aj|(Bbx)jm=bσ~(Bbx)m.\frac{b}{m}\sum_{j=1}^{\infty}|a_{j}|(Bb_{x})^{j}\sqrt{m}~{}=~{}\frac{b\cdot\tilde{\sigma}(Bb_{x})}{\sqrt{m}}~{}.

Upper bounding this by ϵ\epsilon and solving for mm, the result follows.

A.6 Proof of Thm. 5

For simplicity, we use sup𝐮,W1,,WL\sup_{\mathbf{u},W^{1},\ldots,W^{L}} as short for sup𝐮:𝐮b,W1,,WL:maxjWjB\sup_{\mathbf{u}:\|\mathbf{u}\|\leq b,W^{1},\ldots,W^{L}:\max_{j}\|W^{j}\|\leq B}. The Rademacher complexity equals

𝔼ϵsup𝐮,W1,,WL1mi=1mϵifL+1(𝐱i)=𝔼ϵsup𝐮,W1,,WL1mi=1mϵi𝐮fL(𝐱i)\displaystyle\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{\mathbf{u},W^{1},\ldots,W^{L}}\frac{1}{m}\sum_{i=1}^{m}\epsilon_{i}f_{L+1}(\mathbf{x}_{i})~{}=~{}\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{\mathbf{u},W^{1},\ldots,W^{L}}\frac{1}{m}\sum_{i=1}^{m}\epsilon_{i}\mathbf{u}^{\top}f_{L}(\mathbf{x}_{i})
𝔼ϵsup𝐮,W1,,WL𝐮(1mi=1mϵifL(𝐱i))bm𝔼ϵsup𝐮,W1,,WLi=1mϵifL(𝐱i)\displaystyle\leq~{}\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{\mathbf{u},W^{1},\ldots,W^{L}}\mathbf{u}^{\top}\left(\frac{1}{m}\sum_{i=1}^{m}\epsilon_{i}f_{L}(\mathbf{x}_{i})\right)~{}\leq~{}\frac{b}{m}\cdot\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{\mathbf{u},W^{1},\ldots,W^{L}}\left\|\sum_{i=1}^{m}\epsilon_{i}f_{L}(\mathbf{x}_{i})\right\|
bm𝔼ϵsup𝐮,W1,,WLi=1mϵifL(𝐱i)2=bm𝔼ϵsup𝐮,W1,,WL(i=1mϵi(fL(𝐱i)))2,\displaystyle\leq~{}\frac{b}{m}\sqrt{\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{\mathbf{u},W^{1},\ldots,W^{L}}\left\|\sum_{i=1}^{m}\epsilon_{i}f_{L}(\mathbf{x}_{i})\right\|^{2}}~{}=~{}\frac{b}{m}\sqrt{\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{\mathbf{u},W^{1},\ldots,W^{L}}\sum_{\ell}\left(\sum_{i=1}^{m}\epsilon_{i}(f_{L}(\mathbf{x}_{i}))_{\ell}\right)^{2}}~{}, (18)

where we used Cauchy-Schwartz and the assumption 𝐮b\|\mathbf{u}\|\leq b, and \ell ranges over the indices of fL(𝐱i)f_{L}(\mathbf{x}_{i}). Recalling that fj+1(𝐱)=(Wj+1fj(𝐱))kf_{j+1}(\mathbf{x})=(W^{j+1}f_{j}(\mathbf{x}))^{\circ k} and repeatedly applying Lemma 4, we have

(i=1mϵi(fL(𝐱i)))2B2k1k(i=1mϵi(fL1(𝐱i))1(fL1(𝐱i))k)2\displaystyle\sum_{\ell}\left(\sum_{i=1}^{m}\epsilon_{i}(f_{L}(\mathbf{x}_{i}))_{\ell}\right)^{2}~{}\leq~{}\sum_{\ell}B^{2k}\sum_{\ell_{1}\ldots\ell_{k}}\left(\sum_{i=1}^{m}\epsilon_{i}(f_{L-1}(\mathbf{x}_{i}))_{\ell_{1}}\cdots(f_{L-1}(\mathbf{x}_{i}))_{\ell_{k}}\right)^{2}
B2k+2k21k2(i=1mϵi(fL2(𝐱i))1(fL2(𝐱i))k)2\displaystyle\leq~{}B^{2k+2k^{2}}\sum_{\ell_{1}\ldots\ell_{k^{2}}}\left(\sum_{i=1}^{m}\epsilon_{i}(f_{L-2}(\mathbf{x}_{i}))_{\ell_{1}}\cdots(f_{L-2}(\mathbf{x}_{i}))_{\ell_{k}}\right)^{2}
B2k+2k2+2kL1kL(i=1mϵi(f0(𝐱i))1(f0(𝐱i))kL)2\displaystyle\leq~{}\cdots~{}\leq~{}B^{2k+2k^{2}+\ldots 2k^{L}}\sum_{\ell_{1}\ldots\ell_{k^{L}}}\left(\sum_{i=1}^{m}\epsilon_{i}(f_{0}(\mathbf{x}_{i}))_{\ell_{1}}\cdots(f_{0}(\mathbf{x}_{i}))_{\ell_{k^{L}}}\right)^{2}
=B2k+2k2+2kL1kL(i=1mϵi(f0(𝐱i))1(f0(𝐱i))kL)2\displaystyle=~{}B^{2k+2k^{2}+\ldots 2k^{L}}\sum_{\ell_{1}\ldots\ell_{k^{L}}}\left(\sum_{i=1}^{m}\epsilon_{i}(f_{0}(\mathbf{x}_{i}))_{\ell_{1}}\cdots(f_{0}(\mathbf{x}_{i}))_{\ell_{k^{L}}}\right)^{2}
=B2k+2k2+2kL1kL(i=1mϵixi,1xi,kL)2\displaystyle=~{}B^{2k+2k^{2}+\ldots 2k^{L}}\sum_{\ell_{1}\ldots\ell_{k^{L}}}\left(\sum_{i=1}^{m}\epsilon_{i}x_{i,\ell_{1}}\cdots x_{i,\ell_{k^{L}}}\right)^{2}

Therefore, recalling that ϵ1ϵm\epsilon_{1}\ldots\epsilon_{m} are i.i.d. and uniform on {1,+1}\{-1,+1\}, we have

𝔼ϵsup𝐮,W0,,WL1(i=1mϵi(fL(𝐱i)))2B2k+2k2+2kL𝔼ϵ1kL(i=1mϵixi,1xi,kL)2\displaystyle\mathbb{E}_{\boldsymbol{\epsilon}}\sup_{\mathbf{u},W^{0},\ldots,W^{L-1}}\sum_{\ell}\left(\sum_{i=1}^{m}\epsilon_{i}(f_{L}(\mathbf{x}_{i}))_{\ell}\right)^{2}~{}\leq~{}B^{2k+2k^{2}+\ldots 2k^{L}}\mathbb{E}_{\boldsymbol{\epsilon}}\sum_{\ell_{1}\ldots\ell_{k^{L}}}\left(\sum_{i=1}^{m}\epsilon_{i}x_{i,\ell_{1}}\cdots x_{i,\ell_{k^{L}}}\right)^{2}
=B2k+2k2+2kL𝔼ϵ1kLi,i=1mϵiϵixi,1xi,1xi,kLxi,kL\displaystyle=~{}B^{2k+2k^{2}+\ldots 2k^{L}}\mathbb{E}_{\boldsymbol{\epsilon}}\sum_{\ell_{1}\ldots\ell_{k^{L}}}\sum_{i,i^{\prime}=1}^{m}\epsilon_{i}\epsilon_{i^{\prime}}x_{i,\ell_{1}}x_{i^{\prime},\ell_{1}}\cdots x_{i,\ell_{k^{L}}}x_{i^{\prime},\ell_{k^{L}}}
=B2k+2k2+2kL1kLi=1mxi,12xi,kL2\displaystyle=~{}B^{2k+2k^{2}+\ldots 2k^{L}}\sum_{\ell_{1}\ldots\ell_{k^{L}}}\sum_{i=1}^{m}x_{i,\ell_{1}}^{2}\cdots x_{i,\ell_{k^{L}}}^{2}
=B2k+2k2+2kLi=1m(1xi,12)(kLxi,kL2)B2k+2k2+2kLmbx2kL,\displaystyle=~{}B^{2k+2k^{2}+\ldots 2k^{L}}\sum_{i=1}^{m}\left(\sum_{\ell_{1}}x_{i,\ell_{1}}^{2}\right)\cdots\left(\sum_{\ell_{k^{L}}}x_{i,\ell_{k^{L}}}^{2}\right)~{}\leq~{}B^{2k+2k^{2}+\ldots 2k^{L}}\cdot m\cdot b_{x}^{2k^{L}}~{},

where in the last step we used the assumption that 𝐱i2bx2\|\mathbf{x}_{i}\|^{2}\leq b_{x}^{2} for all ii. Plugging this back into Eq. (18), and solving for the number of inputs mm required to make the expression less than ϵ\epsilon, the result follows.

A.7 Proof of Thm. 6

We will need the following lemma, based on a contraction result from Ledoux and Talagrand [1991]:

Lemma 5.

Let 𝒯\mathcal{T} be a set of vectors in m\mathbb{R}^{m} which contains the origin. If ϵ1,,ϵm\epsilon_{1},\ldots,\epsilon_{m} are i.i.d. Rademacher random variables, and σ\sigma is an LL-Lipschitz function on \mathbb{R} with σ(0)=0\sigma(0)=0, then

𝔼ϵ[supt𝒯(i=1mϵiσ(ti))2]2L2𝔼ϵ[(supt𝒯i=1mϵiti)2].\mathbb{E}_{\boldsymbol{\epsilon}}\left[\sup_{t\in\mathcal{T}}\left(\sum_{i=1}^{m}\epsilon_{i}\sigma(t_{i})\right)^{2}\right]~{}\leq~{}2L^{2}\cdot\mathbb{E}_{\boldsymbol{\epsilon}}\left[\left(\sup_{t\in\mathcal{T}}\sum_{i=1}^{m}\epsilon_{i}t_{i}\right)^{2}\right]~{}.
Proof.

For any realization of ϵ\boldsymbol{\epsilon}, supt𝒯|i=1mϵiσ(ti)|\sup_{t\in\mathcal{T}}|\sum_{i=1}^{m}\epsilon_{i}\sigma(t_{i})| equals either supt𝒯i=1mϵiσ(ti)\sup_{t\in\mathcal{T}}\sum_{i=1}^{m}\epsilon_{i}\sigma(t_{i}) or supt𝒯i=1mϵiσ(ti)\sup_{t\in\mathcal{T}}-\sum_{i=1}^{m}\epsilon_{i}\sigma(t_{i}). Thus, the left hand side in the lemma can be upper bounded as follows:

𝔼[(supt𝒯|i=1mϵiσ(ti)|)2]𝔼[(supt𝒯i=1mϵiσ(ti))2+(supt𝒯i=1mϵiσ(ti))2].\mathbb{E}\left[\left(\sup_{t\in\mathcal{T}}\left|\sum_{i=1}^{m}\epsilon_{i}\sigma(t_{i})\right|\right)^{2}\right]~{}\leq~{}\mathbb{E}\left[\left(\sup_{t\in\mathcal{T}}\sum_{i=1}^{m}\epsilon_{i}\sigma(t_{i})\right)^{2}+\left(\sup_{t\in\mathcal{T}}-\sum_{i=1}^{m}\epsilon_{i}\sigma(t_{i})\right)^{2}\right]~{}.

Noting that 𝔼ϵ[(supt𝒯iϵiσ(ti))2]\mathbb{E}_{\boldsymbol{\epsilon}}[(\sup_{t\in\mathcal{T}}\sum_{i}\epsilon_{i}\sigma(t_{i}))^{2}] equals 𝔼ϵ[(supt𝒯iϵiσ(ti))2]\mathbb{E}_{\boldsymbol{\epsilon}}[(\sup_{t\in\mathcal{T}}-\sum_{i}\epsilon_{i}\sigma(t_{i}))^{2}] by symmetry of the ϵi\epsilon_{i} random variables, the expression above equals

2𝔼[(supt𝒯i=1mϵiσ(ti))2]=()2𝔼[[supt𝒯i=1mϵiσ(ti)]+2]=2L2𝔼[[supt𝒯i=1mϵi1Lσ(ti)]+2],2\cdot\mathbb{E}\left[\left(\sup_{t\in\mathcal{T}}\sum_{i=1}^{m}\epsilon_{i}\sigma(t_{i})\right)^{2}\right]~{}\stackrel{{\scriptstyle(*)}}{{=}}~{}2\cdot\mathbb{E}\left[\left[\sup_{t\in\mathcal{T}}\sum_{i=1}^{m}\epsilon_{i}\sigma(t_{i})\right]_{+}^{2}\right]~{}=~{}2L^{2}\cdot\mathbb{E}\left[\left[\sup_{t\in\mathcal{T}}\sum_{i=1}^{m}\epsilon_{i}\frac{1}{L}\sigma(t_{i})\right]_{+}^{2}\right]~{},

where ()(*) follows from the fact that the supremum is always non-negative, since σ(0)=0\sigma(0)=0 and 𝒯\mathcal{T} contains the origin. We now utilize equation (4.20) in Ledoux and Talagrand [1991], which implies that 𝔼ϵg(supt𝒯iϵiϕ(ti))𝔼ϵg(supt𝒯iϵiti)\mathbb{E}_{\boldsymbol{\epsilon}}g(\sup_{t\in\mathcal{T}}\sum_{i}\epsilon_{i}\phi(t_{i}))\leq\mathbb{E}_{\boldsymbol{\epsilon}}g(\sup_{t\in\mathcal{T}}\sum_{i}\epsilon_{i}t_{i}) for any 11-Lipschitz ϕ\phi satisfying ϕ(0)=0\phi(0)=0, and any convex increasing function gg. Plugging into the above, and using the fact that [z]+2z2[z]_{+}^{2}\leq z^{2} for all zz, the lemma follows. ∎

We now turn to prove the theorem. The Rademacher complexity times mm equals

𝔼ϵ[supW,𝐮i=1mϵi𝐮σ(W𝐱i)],\mathbb{E}_{\boldsymbol{\epsilon}}\left[\sup_{W,\mathbf{u}}\sum_{i=1}^{m}\epsilon_{i}\mathbf{u}^{\top}\sigma(W\mathbf{x}_{i})\right]~{},

where for notational convenience we drop the conditions on W,𝐮,𝐰W,\mathbf{u},\mathbf{w} in the supremum. Using the Cauchy-Schwartz and Jensen’s inequalities, this in turn can be upper bounded as follows:

𝔼ϵ\displaystyle\mathbb{E}_{\boldsymbol{\epsilon}} [supW,𝐮𝐮(i=1mϵiσ(W𝐱i))]b𝔼ϵ[supWi=1mϵiσ(W𝐱i)]\displaystyle\left[\sup_{W,\mathbf{u}}\mathbf{u}^{\top}\left(\sum_{i=1}^{m}\epsilon_{i}\sigma(W\mathbf{x}_{i})\right)\right]~{}\leq~{}b\cdot\mathbb{E}_{\boldsymbol{\epsilon}}\left[\sup_{W}\left\|\sum_{i=1}^{m}\epsilon_{i}\sigma(W\mathbf{x}_{i})\right\|\right]
b𝔼ϵ[supWi=1mϵiσ(W𝐱i)2]=b𝔼ϵ[supWj=1n(i=1mϵiσ(𝐰ϕj(𝐱i)))2]\displaystyle\leq~{}b\sqrt{\mathbb{E}_{\boldsymbol{\epsilon}}\left[\sup_{W}\left\|\sum_{i=1}^{m}\epsilon_{i}\sigma(W\mathbf{x}_{i})\right\|^{2}\right]}~{}=~{}b\sqrt{\mathbb{E}_{\boldsymbol{\epsilon}}\left[\sup_{W}\sum_{j=1}^{n}\left(\sum_{i=1}^{m}\epsilon_{i}\sigma(\mathbf{w}^{\top}\phi_{j}(\mathbf{x}_{i}))\right)^{2}\right]}
bj=1n𝔼ϵ[supW(i=1mϵiσ(𝐰ϕj(𝐱i)))2].\displaystyle\leq~{}b\sqrt{\sum_{j=1}^{n}\mathbb{E}_{\boldsymbol{\epsilon}}\left[\sup_{W}\left(\sum_{i=1}^{m}\epsilon_{i}\sigma(\mathbf{w}^{\top}\phi_{j}(\mathbf{x}_{i}))\right)^{2}\right]}~{}.

Recall that the supremum is over all matrices WW which conform to the patches, and has spectral norm at most BB. By definition, every row of this matrix has a subset of entries, which correspond to the convolutional filter vector 𝐰\mathbf{w}. Thus, we must have 𝐰B\|\mathbf{w}\|\leq B, since the norm 𝐰\mathbf{w} equals the norm of any row of WW, and the norm of a row of WW is a lower bound on the spectral norm. Thus, we can upper bound the expression above by taking the supremum over all vectors 𝐰\mathbf{w} such that 𝐰B\|\mathbf{w}\|\leq B (and not just those that the corresponding matrix has spectral norm B\leq B). Thus, we get the upper bound

bj=1n𝔼ϵ[sup𝐰:𝐰B(i=1mϵiσ(𝐰ϕj(𝐱i)))2],b\sqrt{\sum_{j=1}^{n}\mathbb{E}_{\boldsymbol{\epsilon}}\left[\sup_{\mathbf{w}:\|\mathbf{w}\|\leq B}\left(\sum_{i=1}^{m}\epsilon_{i}\sigma(\mathbf{w}^{\top}\phi_{j}(\mathbf{x}_{i}))\right)^{2}\right]},

which by Lemma 5 and Cauchy-Shwartz, is at most

bL\displaystyle bL 2j=1n𝔼ϵ[sup𝐰:𝐰B(i=1mϵi𝐰ϕj(𝐱i)))2]bBL2j=1n𝔼ϵ[i=1mϵiϕj(𝐱i))2]\displaystyle\sqrt{2\sum_{j=1}^{n}\mathbb{E}_{\boldsymbol{\epsilon}}\left[\sup_{\mathbf{w}:\|\mathbf{w}\|\leq B}\left(\sum_{i=1}^{m}\epsilon_{i}\mathbf{w}^{\top}\phi_{j}(\mathbf{x}_{i}))\right)^{2}\right]}~{}\leq~{}bBL\sqrt{2\sum_{j=1}^{n}\mathbb{E}_{\boldsymbol{\epsilon}}\left[\left\|\sum_{i=1}^{m}\epsilon_{i}\phi_{j}(\mathbf{x}_{i}))\right\|^{2}\right]}
=bBL2j=1n𝔼ϵ[i,i=1mϵiϵiϕj(𝐱i)ϕj(𝐱i)]=bBL2j=1ni=1mϕj(𝐱i)2.\displaystyle=~{}bBL\sqrt{2\sum_{j=1}^{n}\mathbb{E}_{\boldsymbol{\epsilon}}\left[\sum_{i,i^{\prime}=1}^{m}\epsilon_{i}\epsilon_{i}^{\prime}\phi_{j}(\mathbf{x}_{i})^{\top}\phi_{j}(\mathbf{x}_{i^{\prime}})\right]}~{}=~{}bBL\sqrt{2\sum_{j=1}^{n}\sum_{i=1}^{m}\|\phi_{j}(\mathbf{x}_{i})\|^{2}}.

Recalling that OΦO_{\Phi} is the maximal number of times any single input coordinate appears across the patches, and letting xi,lx_{i,l} be the ll-th coordinate of 𝐱i\mathbf{x}_{i}, we can upper bound the above by

bBL2i=1ml=1dxi,l2OΦ=bBL2i=1m𝐱i2OΦbBbxL2mOΦ.bBL\sqrt{2\sum_{i=1}^{m}\sum_{l=1}^{d}x_{i,l}^{2}O_{\Phi}}~{}=~{}bBL\sqrt{2\sum_{i=1}^{m}\|\mathbf{x}_{i}\|^{2}\cdot O_{\Phi}}~{}\leq~{}bBb_{x}L\sqrt{2mO_{\Phi}}.

Dividing by mm, and solving for the number mm required to make the resulting expression less than ϵ\epsilon, the result follows.

A.8 Proof of Thm. 7

The proof follows from a covering number argument. We start with some required definitions and lemmas.

Definition 2.

Let \mathcal{F} be a class of functions from 𝒳\mathcal{X} to \mathbb{R}. For 1p1\leq p\leq\infty, ϵ>0\epsilon>0, and {𝐱1,,𝐱m}𝒳\{\mathbf{x}_{1},\ldots,\mathbf{x}_{m}\}\subseteq\mathcal{X}, the empirical covering number 𝒩p(,ϵ;𝐱1,,𝐱m)\mathcal{N}_{p}(\mathcal{F},\epsilon;\mathbf{x}_{1},\ldots,\mathbf{x}_{m}) is the minimal cardinality of a set VmV\subseteq\mathbb{R}^{m}, such that for all ff\in\mathcal{F} there is 𝐯V\mathbf{v}\in V such that

(1mi=1m|f(𝐱i)vi|p)1/pϵ.\left(\frac{1}{m}\sum_{i=1}^{m}|f(\mathbf{x}_{i})-v_{i}|^{p}\right)^{1/p}\leq\epsilon~{}.

We define the covering number 𝒩p(,ϵ,m)=sup𝐱1,,𝐱m𝒩p(,ϵ;𝐱1,,𝐱m)\mathcal{N}_{p}(\mathcal{F},\epsilon,m)=\sup_{\mathbf{x}_{1},\ldots,\mathbf{x}_{m}}\mathcal{N}_{p}(\mathcal{F},\epsilon;\mathbf{x}_{1},\ldots,\mathbf{x}_{m}).

Lemma 6 (Zhang [2002]).

Let a,b>0a,b>0, let 𝒳={𝐱d:𝐱b}\mathcal{X}=\{\mathbf{x}\in\mathbb{R}^{d}~{}:~{}\|\mathbf{x}\|\leq b\}, and consider the class of linear predictors ={f𝒳:f(𝐱)=𝐰𝐱,𝐰a}\mathcal{F}=\{f\in\mathbb{R}^{\mathcal{X}}:~{}f(\mathbf{x})=\mathbf{w}^{\top}\mathbf{x},~{}\|\mathbf{w}\|\leq a\}. Then,

log𝒩(,ϵ,m)36a2b2ϵ2log(2m4ab/ϵ+2+1).\log\mathcal{N}_{\infty}(\mathcal{F},\epsilon,m)\leq\frac{36a^{2}b^{2}}{\epsilon^{2}}\log\left(2m\lceil 4ab/\epsilon+2\rceil+1\right)~{}.
Lemma 7 (E.g., Daniely and Granot [2019]).

Let C>0C>0 and let \mathcal{F} be a class of CC-bounded functions from 𝒳\mathcal{X} to \mathbb{R}, i.e., |f(𝐱)|C|f(\mathbf{x})|\leq C for all ff\in\mathcal{F} and 𝐱𝒳\mathbf{x}\in\mathcal{X}. Then, for every integer M1M\geq 1 we have

m()C2M+6Cmk=1M2klog𝒩2(,C2k,m).\mathcal{R}_{m}(\mathcal{F})\leq C2^{-M}+\frac{6C}{\sqrt{m}}\sum_{k=1}^{M}2^{-k}\sqrt{\log\mathcal{N}_{2}(\mathcal{F},C2^{-k},m)}~{}.

We are now ready to prove the theorem. For i[m],j[n]i\in[m],~{}j\in[n] we denote 𝐱i,j=ϕj(𝐱i)n\mathbf{x}^{\prime}_{i,j}=\phi_{j}(\mathbf{x}_{i})\in\mathbb{R}^{n^{\prime}}. Let 𝒳n={𝐱n:𝐱bx}\mathcal{X}_{n^{\prime}}=\{\mathbf{x}^{\prime}\in\mathbb{R}^{n^{\prime}}:\|\mathbf{x}^{\prime}\|\leq b_{x}\}, and let

:={f𝒳n:f(𝐱)=𝐰𝐱,𝐰n,𝐰B}.\mathcal{F}:=\{f\in\mathbb{R}^{\mathcal{X}_{n^{\prime}}}~{}:~{}f(\mathbf{x}^{\prime})=\mathbf{w}^{\top}\mathbf{x}^{\prime},~{}\mathbf{w}\in\mathbb{R}^{n^{\prime}},~{}\|\mathbf{w}\|\leq B\}~{}.

Let VmnV\subseteq\mathbb{R}^{mn} be a set of size at most 𝒩(,ϵ/L,mn)\mathcal{N}_{\infty}(\mathcal{F},\epsilon/L,mn), such that for all ff\in\mathcal{F} there is 𝐯V\mathbf{v}\in V that satisfies the following: Letting vi,j:=v(i1)n+jv_{i,j}:=v_{(i-1)n+j}, we have |f(𝐱i,j)vi,j|ϵ/L|f(\mathbf{x}^{\prime}_{i,j})-v_{i,j}|\leq\epsilon/L for all i[m],j[n]i\in[m],~{}j\in[n].

We define

U:={𝐮m:there is 𝐯V s.t. ui=ρσ(vi,1,,vi,n)=ρ(σ(vi,1),,σ(vi,n)) for all i[m]}.U:=\{\mathbf{u}\in\mathbb{R}^{m}~{}:~{}\text{there is }\mathbf{v}\in V\text{ s.t. }u_{i}=\rho\circ\sigma(v_{i,1},\ldots,v_{i,n})=\rho\left(\sigma(v_{i,1}),\ldots,\sigma(v_{i,n})\right)\text{ for all }i\in[m]\}~{}.

Note that |U||V||U|\leq|V|. Let hB,n,dσ,ρ,Φh\in\mathcal{H}_{B,n,d}^{\sigma,\rho,\Phi} and suppose that the network hh has a filter 𝐰n\mathbf{w}\in\mathbb{R}^{n^{\prime}}. Let WW be the weight matrix that corresponds to Φ\Phi and 𝐰\mathbf{w}. Thus, we have WB\|W\|\leq B. Let 𝐱d\mathbf{x}\in\mathbb{R}^{d} such that ϕ1(𝐱)=𝐰𝐰\phi_{1}(\mathbf{x})=\frac{\mathbf{w}}{\|\mathbf{w}\|} and xk=0x_{k}=0 for every coordinate kk that does not appear in ϕ1\phi_{1}. That is, 𝐱\mathbf{x} is a vector of norm 11 such that (W𝐱)1=𝐰ϕ1(𝐱)=𝐰(W\mathbf{x})_{1}=\mathbf{w}^{\top}\phi_{1}(\mathbf{x})=\|\mathbf{w}\|. Therefore, W𝐱(W𝐱)1=𝐰\|W\mathbf{x}\|\geq(W\mathbf{x})_{1}=\|\mathbf{w}\|, and thus BW𝐰B\geq\|W\|\geq\|\mathbf{w}\|. Let ff be the function in \mathcal{F} that corresponds to 𝐰\mathbf{w}, and let 𝐯V\mathbf{v}\in V such that |f(𝐱i,j)vi,j|ϵ/L|f(\mathbf{x}^{\prime}_{i,j})-v_{i,j}|\leq\epsilon/L for all i[m],j[n]i\in[m],~{}j\in[n]. Let 𝐮U\mathbf{u}\in U that corresponds to 𝐯\mathbf{v}, namely, ui=ρσ(vi,1,,vi,n)u_{i}=\rho\circ\sigma(v_{i,1},\ldots,v_{i,n}) for all i[m]i\in[m]. Note that |h(𝐱i)ui|ϵ|h(\mathbf{x}_{i})-u_{i}|\leq\epsilon for all i[m]i\in[m]. Indeed, we have

|h(𝐱i)ui|=|ρσ(f(𝐱i,1),,f(𝐱i,n))ρσ(vi,1,,vi,n)|Lmaxj[n]|f(𝐱i,j)vi,j|LϵL=ϵ,|h(\mathbf{x}_{i})-u_{i}|=\left|\rho\circ\sigma\left(f(\mathbf{x}^{\prime}_{i,1}),\ldots,f(\mathbf{x}^{\prime}_{i,n})\right)-\rho\circ\sigma\left(v_{i,1},\ldots,v_{i,n}\right)\right|\leq L\cdot\max_{j\in[n]}\left|f(\mathbf{x}^{\prime}_{i,j})-v_{i,j}\right|\leq L\cdot\frac{\epsilon}{L}=\epsilon~{},

where the first inequality follows from the LL-Lipschitzness of ρσ\rho\circ\sigma w.r.t. \ell_{\infty}. Hence,

𝒩(B,n,dσ,ρ,Φ,ϵ,m)|U||V|𝒩(,ϵ/L,mn).\mathcal{N}_{\infty}\left(\mathcal{H}_{B,n,d}^{\sigma,\rho,\Phi},\epsilon,m\right)\leq|U|\leq|V|\leq\mathcal{N}_{\infty}(\mathcal{F},\epsilon/L,mn)~{}.

Combining the above with Lemma 6, and using the fact that the 𝒩2\mathcal{N}_{2} covering number is at most the 𝒩\mathcal{N}_{\infty} covering number (cf. Anthony and Bartlett [1999]), we get

log𝒩2(B,n,dσ,ρ,Φ,ϵ,m)\displaystyle\log\mathcal{N}_{2}\left(\mathcal{H}_{B,n,d}^{\sigma,\rho,\Phi},\epsilon,m\right) log𝒩(B,n,dσ,ρ,Φ,ϵ,m)\displaystyle\leq\log\mathcal{N}_{\infty}\left(\mathcal{H}_{B,n,d}^{\sigma,\rho,\Phi},\epsilon,m\right)
log𝒩(,ϵ/L,mn)\displaystyle\leq\log\mathcal{N}_{\infty}(\mathcal{F},\epsilon/L,mn)
36bx2B2(ϵ/L)2log(2mn4bxB/(ϵ/L)+2+1).\displaystyle\leq\frac{36b_{x}^{2}B^{2}}{(\epsilon/L)^{2}}\log\left(2mn\lceil 4b_{x}B/(\epsilon/L)+2\rceil+1\right)~{}. (19)

Note that for every 𝐱𝒳:={𝐱d:ϕj(𝐱)bx for all j[n]}\mathbf{x}\in\mathcal{X}:=\left\{\mathbf{x}\in\mathbb{R}^{d}~{}:~{}\|\phi_{j}(\mathbf{x})\|\leq b_{x}\text{ for all }j\in[n]\right\} and hB,n,dσ,ρ,Φh\in\mathcal{H}_{B,n,d}^{\sigma,\rho,\Phi} we have |h(𝐱)|=|ρ(σ(𝐰ϕ1(𝐱)),,σ(𝐰ϕn(𝐱)))|LbxB|h(\mathbf{x})|=|\rho(\sigma(\mathbf{w}^{\top}\phi_{1}(\mathbf{x})),\ldots,\sigma(\mathbf{w}^{\top}\phi_{n}(\mathbf{x})))|\leq Lb_{x}B, since |𝐰ϕj(𝐱)|Bbx|\mathbf{w}^{\top}\phi_{j}(\mathbf{x})|\leq Bb_{x}, the activation σ\sigma is LL-Lipschitz and satisfies σ(0)=0\sigma(0)=0, and ρ\rho is 11-Lipschitz w.r.t. \ell_{\infty} and satisfies ρ(𝟎)=0\rho(\mathbf{0})=0. By Lemma 7, we conclude that

m(B,n,dσ,ρ,Φ)LbxB2M+6LbxBm=1M2log𝒩2(B,n,dσ,ρ,Φ,LbxB2,m),\mathcal{R}_{m}\left(\mathcal{H}_{B,n,d}^{\sigma,\rho,\Phi}\right)\leq Lb_{x}B2^{-M}+\frac{6Lb_{x}B}{\sqrt{m}}\sum_{\ell=1}^{M}2^{-\ell}\sqrt{\log\mathcal{N}_{2}\left(\mathcal{H}_{B,n,d}^{\sigma,\rho,\Phi},Lb_{x}B2^{-\ell},m\right)}~{},

for every integer M1M\geq 1. By plugging-in M=log(m)M=\lceil\log(\sqrt{m})\rceil and the expression from Eq. (A.8), we get

m(B,n,dσ,ρ,Φ)\displaystyle\mathcal{R}_{m}\left(\mathcal{H}_{B,n,d}^{\sigma,\rho,\Phi}\right) LbxBm+6LbxBm=1log(m)236bx2B2(bxB2)2log(2mn4bxB/(bxB2)+2+1)\displaystyle\leq\frac{Lb_{x}B}{\sqrt{m}}+\frac{6Lb_{x}B}{\sqrt{m}}\sum_{\ell=1}^{\lceil\log(\sqrt{m})\rceil}2^{-\ell}\sqrt{\frac{36b_{x}^{2}B^{2}}{(b_{x}B2^{-\ell})^{2}}\log\left(2mn\lceil 4b_{x}B/(b_{x}B2^{-\ell})+2\rceil+1\right)}
=LbxBm+36LbxBm=1log(m)log(2mn42+2+1)\displaystyle=\frac{Lb_{x}B}{\sqrt{m}}+\frac{36Lb_{x}B}{\sqrt{m}}\sum_{\ell=1}^{\lceil\log(\sqrt{m})\rceil}\sqrt{\log\left(2mn\lceil 4\cdot 2^{\ell}+2\rceil+1\right)}
LbxBm+36LbxBmlog(m)log(23mnm).\displaystyle\leq\frac{Lb_{x}B}{\sqrt{m}}+\frac{36Lb_{x}B}{\sqrt{m}}\lceil\log(\sqrt{m})\rceil\cdot\sqrt{\log\left(23mn\sqrt{m}\right)}~{}.

Hence, for some universal constant c>0c^{\prime}>0 the above is at most

cLbxBlog(m)log(mn)m.c^{\prime}\cdot\frac{Lb_{x}B\log(m)\sqrt{\log\left(mn\right)}}{\sqrt{m}}~{}.

Requiring this to be at most ϵ\epsilon and rearranging, the result follows.

A.9 Proof of Thm. 8

To help the reader track the main proof ideas, we first prove the claim for the case where B=bx=1B=b_{x}=1 and ϵ=1/2\epsilon=1/2 (in Subsection A.9.1), and then extend the proof for arbitrary B,bx,ϵ>0B,b_{x},\epsilon>0 in Subsection A.9.2.

A.9.1 Proof for B=bx=1B=b_{x}=1 and ϵ=1/2\epsilon=1/2

Let m=log(n)m=\log(n) and let d=3md=3^{m}. Consider mm points 𝐱1,,𝐱m\mathbf{x}^{1},\ldots,\mathbf{x}^{m}, where for every i[m]i\in[m] the point 𝐱id\mathbf{x}^{i}\in\mathbb{R}^{d} is a vectorization of an order-mm tensor 𝐱^i\hat{\mathbf{x}}^{i} such that each component is indexed by (j1,,jm)[3]m(j_{1},\ldots,j_{m})\in[3]^{m}. We define the components xj1,,jmix^{i}_{j_{1},\ldots,j_{m}} of 𝐱^i\hat{\mathbf{x}}^{i} such that xj1,,jmi=1x^{i}_{j_{1},\ldots,j_{m}}=1 if ji=3j_{i}=3, and jr=2j_{r}=2 for all rir\neq i, and xj1,,jmi=0x^{i}_{j_{1},\ldots,j_{m}}=0 otherwise. Note that 𝐱i=1\|\mathbf{x}^{i}\|=1 for all i[m]i\in[m]. Consider patches of dimensions 2××22\times\ldots\times 2 and stride 11. Thus, the set Φ\Phi corresponds to all the patches of dimensions 2××22\times\ldots\times 2 in the tensor. Note that there are 2m=n2^{m}=n such patches. Indeed, given an index (j1,,jm)[2]m(j_{1},\ldots,j_{m})\in[2]^{m}, we can define a patch which contains the indices {(j1,,jm)+(Δ1,,Δm):(Δ1,,Δm){0,1}m}\left\{(j_{1},\ldots,j_{m})+(\Delta_{1},\ldots,\Delta_{m}):~{}(\Delta_{1},\ldots,\Delta_{m})\in\{0,1\}^{m}\right\}. We say that (j1,,jm)(j_{1},\ldots,j_{m}) is the base index of this patch. Note that each (j1,,jm)[2]m(j_{1},\ldots,j_{m})\in[2]^{m} is a base index of exactly one patch. Also, an index (j1,,jm)(j_{1},\ldots,j_{m}) which includes some r[m]r\in[m] with jr=3j_{r}=3 does not induce a patch of the form {(j1,,jm)+(Δ1,,Δm):(Δ1,,Δm){0,1}m}\left\{(j_{1},\ldots,j_{m})+(\Delta_{1},\ldots,\Delta_{m}):~{}(\Delta_{1},\ldots,\Delta_{m})\in\{0,1\}^{m}\right\}, since for Δr=1\Delta_{r}=1 we get an invalid index.

We show that for any 𝐲{0,1}m\mathbf{y}\in\{0,1\}^{m} we can find a filter 𝐰\mathbf{w}, such that 𝐰\mathbf{w} is an order-mm tensor of dimensions 2××22\times\ldots\times 2 and satisfies the following. Let N𝐰N_{\mathbf{w}} be the neural network that consists of a convolutional layer with the patches Φ\Phi and the filter 𝐰\mathbf{w}, followed by a max-pooling layer. Then, N𝐰(𝐱i)=yiN_{\mathbf{w}}(\mathbf{x}^{i})=y_{i} for all i[m]i\in[m]. Thus, we can shatter 𝐱1,,𝐱m\mathbf{x}^{1},\ldots,\mathbf{x}^{m} with margin ϵ=1/2\epsilon=1/2. Moreover, the spectral norm of the matrix WW that corresponds to the convolutional layer is at most 11.

Consider the filter 𝐰\mathbf{w} of dimensions 2××22\times\ldots\times 2 such that wj1,,jm=1w_{j_{1},\ldots,j_{m}}=1 if (j1,,jm)=𝟏+𝐲(j_{1},\ldots,j_{m})=\mathbf{1}+\mathbf{y}, and wj1,,jm=0w_{j_{1},\ldots,j_{m}}=0 otherwise. We now show that N𝐰(𝐱i)=yiN_{\mathbf{w}}(\mathbf{x}^{i})=y_{i} for all i[m]i\in[m]. Since the filter 𝐰\mathbf{w} has a single non-zero component, then the inner product between 𝐰\mathbf{w} and a patch of 𝐱i\mathbf{x}^{i} is non-zero iff the patch of 𝐱i\mathbf{x}^{i} has a non-zero component in the appropriate position. More precisely, for a patch with base index (j1,,jm)(j_{1},\ldots,j_{m}), the inner product between the components of 𝐱i\mathbf{x}^{i} in the indices of the patch and the filter 𝐰\mathbf{w} is 11 iff x(j1,,jm)+𝐲i=1x^{i}_{(j_{1},\ldots,j_{m})+\mathbf{y}}=1, and otherwise the inner product is 0. Since xq1,,qmi=1x^{i}_{q_{1},\ldots,q_{m}}=1 iff qi=3q_{i}=3 and qr=2q_{r}=2 for rir\neq i, then x(j1,,jm)+𝐲i=1x^{i}_{(j_{1},\ldots,j_{m})+\mathbf{y}}=1 iff ji=3yij_{i}=3-y_{i} and jr=2yrj_{r}=2-y_{r} for rir\neq i. Now, if yi=0y_{i}=0 then there is no patch such that the base index satisfies ji=3yi=3j_{i}=3-y_{i}=3, since all base indices are in [2]m[2]^{m}, and therefore N𝐰(𝐱i)=0N_{\mathbf{w}}(\mathbf{x}^{i})=0. If yi=1y_{i}=1 then the patch whose base index satisfies ji=3yij_{i}=3-y_{i} and jr=2yrj_{r}=2-y_{r} for rir\neq i gives output 11 (and all other patches give output 0) and hence N𝐰(𝐱i)=1N_{\mathbf{w}}(\mathbf{x}^{i})=1. Thus, we have N𝐰(𝐱i)=yiN_{\mathbf{w}}(\mathbf{x}^{i})=y_{i} as required.

For example, consider the case where m=2m=2. Then, the tensor 𝐱^1\hat{\mathbf{x}}^{1} is the matrix

𝐱^1=[000000010].\hat{\mathbf{x}}^{1}=\begin{bmatrix}0&0&0\\ 0&0&0\\ 0&1&0\end{bmatrix}.

For 𝐲=(1,1)\mathbf{y}=(1,1)^{\top} we have 𝐰=[0001]\mathbf{w}=\begin{bmatrix}0&0\\ 0&1\end{bmatrix} and hence the patch with base index (2,1)(2,1) gives output 11. For 𝐲=(1,0)\mathbf{y}=(1,0)^{\top} we have 𝐰=[0010]\mathbf{w}=\begin{bmatrix}0&0\\ 1&0\end{bmatrix} and hence the patch with base index (2,2)(2,2) gives output 11. However, for 𝐲=(0,1)\mathbf{y}=(0,1)^{\top} we have 𝐰=[0100]\mathbf{w}=\begin{bmatrix}0&1\\ 0&0\end{bmatrix} and hence there is no patch that gives output 11. Thus, in all the above cases we have N𝐰(𝐱1)=y1N_{\mathbf{w}}(\mathbf{x}^{1})=y_{1}.

It remains to show that the spectral norm of the matrix WW that corresponds to the convolutional layer with the filter 𝐰\mathbf{w} is at most 11. Thus, we show that for every input 𝐱d\mathbf{x}\in\mathbb{R}^{d} with 𝐱=1\|\mathbf{x}\|=1 the inputs to the hidden layer is a vector with norm at most 11. We view 𝐱\mathbf{x} as the vectorization of a tensor 𝐱^\hat{\mathbf{x}} with components xj1,,jmx_{j_{1},\ldots,j_{m}} for (j1,,jm)[3]m(j_{1},\ldots,j_{m})\in[3]^{m}. Since the filter 𝐰\mathbf{w} contains a single 11-component and all other components are 0, then the input to each hidden neuron is a different component of 𝐱^\hat{\mathbf{x}}. More precisely, since the filter 𝐰\mathbf{w} contains 11 at index 𝟏+𝐲\mathbf{1}+\mathbf{y} then for the patch with base index (j1,,jm)(j_{1},\ldots,j_{m}) the corresponding hidden neuron has input x(j1,,jm)+𝐲x_{(j_{1},\ldots,j_{m})+\mathbf{y}}. Note that each hidden neuron corresponds to a different base index and hence the input to each hidden neuron is a different component of 𝐱^\hat{\mathbf{x}}. Therefore, the norm of the vector whose components are the inputs to the hidden neurons is at most the norm of the input 𝐱\mathbf{x}, and hence it is at most 11.

A.9.2 Proof for arbitrary B,bx,ϵ>0B,b_{x},\epsilon>0

Let m=(bxB2ϵ)2log(n)m=\left(\frac{b_{x}B}{2\epsilon}\right)^{2}\cdot\log(n) and let d=(bxB2ϵ)23log(n)d=\left(\frac{b_{x}B}{2\epsilon}\right)^{2}\cdot 3^{\log(n)}. Let m=log(n)m^{\prime}=\log(n) and let L=(bxB2ϵ)2L=\left(\frac{b_{x}B}{2\epsilon}\right)^{2}. Consider mm points 𝐱1,,𝐱m\mathbf{x}^{1},\ldots,\mathbf{x}^{m}, where for every i[m]i\in[m] the point 𝐱id\mathbf{x}^{i}\in\mathbb{R}^{d} is a vectorization of a tensor 𝐱^i\hat{\mathbf{x}}^{i} of order m+1m^{\prime}+1, such that each component is indexed by (j1,,jm,)[3]m×[L](j_{1},\ldots,j_{m^{\prime}},\ell)\in[3]^{m^{\prime}}\times[L]. Consider a partition of [m][m] into LL disjoint susets S1,,SLS_{1},\ldots,S_{L}, each of size m/L=mm/L=m^{\prime}.

We define the components xj1,,jm,ix^{i}_{j_{1},\ldots,j_{m^{\prime}},\ell} of 𝐱^i\hat{\mathbf{x}}^{i} as follows: Suppose that iSr:={k1,,km}i\in S_{r}:=\{k_{1},\ldots,k_{m^{\prime}}\} for some rLr\in L, and that i=kti=k_{t}, i.e., ii is the tt-th element in the subset SrS_{r}. For every r\ell\neq r we define xj1,,jm,i=0x^{i}_{j_{1},\ldots,j_{m^{\prime}},\ell}=0 for every j1,,jm[3]mj_{1},\ldots,j_{m^{\prime}}\in[3]^{m^{\prime}}, namely, if \ell does not correspond to the subset of ii then the component is 0. For =r\ell=r the component xj1,,jm,ix^{i}_{j_{1},\ldots,j_{m^{\prime}},\ell} is defined in a similar way to the tensor 𝐱^i\hat{\mathbf{x}}^{i} from Subsection A.9.1, but with respect to the subset SrS_{r} and at scale bxb_{x}. Formally, for =r\ell=r we have xj1,,jm,i=bxx^{i}_{j_{1},\ldots,j_{m^{\prime}},\ell}=b_{x} if jt=3j_{t}=3, and jk=2j_{k}=2 for all ktk\neq t, and xj1,,jm,i=0x^{i}_{j_{1},\ldots,j_{m^{\prime}},\ell}=0 otherwise. Note that 𝐱i=bx\|\mathbf{x}^{i}\|=b_{x} for all i[m]i\in[m].

Consider patches of dimensions 2××2×L2\times\ldots\times 2\times L and stride 11. Thus, the set Φ\Phi corresponds to all the patches of dimensions 2××2×L2\times\ldots\times 2\times L in the tensor. Note that since the last dimension is LL, then the filter can “move” only in the first mm^{\prime} dimensions. Also, note that there are 2m=n2^{m^{\prime}}=n such patches. Indeed, given (j1,,jm)[2]m(j_{1},\ldots,j_{m^{\prime}})\in[2]^{m^{\prime}}, we can define a patch which contains the indices {(j1,,jm,0)+(Δ1,,Δm,Δm+1):(Δ1,,Δm){0,1}m,Δm+1[L]}\left\{(j_{1},\ldots,j_{m^{\prime}},0)+(\Delta_{1},\ldots,\Delta_{m^{\prime}},\Delta_{m^{\prime}+1}):~{}(\Delta_{1},\ldots,\Delta_{m^{\prime}})\in\{0,1\}^{m^{\prime}},~{}\Delta_{m^{\prime}+1}\in[L]\right\}. We say that (j1,,jm)(j_{1},\ldots,j_{m^{\prime}}) is the base index of this patch. Note that each (j1,,jm)[2]m(j_{1},\ldots,j_{m^{\prime}})\in[2]^{m^{\prime}} is a base index of exactly one patch. Also, if (j1,,jm)(j_{1},\ldots,j_{m^{\prime}}) includes some r[m]r\in[m^{\prime}] with jr=3j_{r}=3 then it does not induce a patch of the form {(j1,,jm,0)+(Δ1,,Δm,Δm+1):(Δ1,,Δm){0,1}m,Δm+1[L]}\left\{(j_{1},\ldots,j_{m^{\prime}},0)+(\Delta_{1},\ldots,\Delta_{m^{\prime}},\Delta_{m^{\prime}+1}):~{}(\Delta_{1},\ldots,\Delta_{m^{\prime}})\in\{0,1\}^{m^{\prime}},~{}\Delta_{m^{\prime}+1}\in[L]\right\}, since for Δr=1\Delta_{r}=1 we get an invalid index.

We show that for any 𝐲{0,1}m\mathbf{y}\in\{0,1\}^{m} we can find a filter 𝐰\mathbf{w}, such that 𝐰\mathbf{w} is an order-(m+1)(m^{\prime}+1) tensor of dimensions 2××2×L2\times\ldots\times 2\times L and satisfies the following. Let N𝐰N_{\mathbf{w}} be the neural network that consists of a convolutional layer with the patches Φ\Phi and the filter 𝐰\mathbf{w}, followed by a max-pooling layer. Then, for all i[m]i\in[m] we have: if yi=0y_{i}=0 then N𝐰(𝐱i)=0N_{\mathbf{w}}(\mathbf{x}^{i})=0, and if yi=1y_{i}=1 then N𝐰(𝐱i)=2ϵN_{\mathbf{w}}(\mathbf{x}^{i})=2\epsilon. Thus, we can shatter 𝐱1,,𝐱m\mathbf{x}^{1},\ldots,\mathbf{x}^{m} with margin ϵ\epsilon. Moreover, the spectral norm of the matrix WW that corresponds to the convolutional layer is at most BB.

We now define the filter 𝐰\mathbf{w} of dimensions 2××2×L2\times\ldots\times 2\times L. For every [L]\ell\in[L] we define the components wj1,,jm,w_{j_{1},\ldots,j_{m^{\prime}},\ell} as follows. Let 𝐲S{0,1}m\mathbf{y}_{S_{\ell}}\in\{0,1\}^{m^{\prime}} be the restriction of 𝐲\mathbf{y} to the indices in SS_{\ell}. Then, wj1,,jm,=2ϵbxw_{j_{1},\ldots,j_{m^{\prime}},\ell}=\frac{2\epsilon}{b_{x}} if (j1,,jm)=𝟏+𝐲S(j_{1},\ldots,j_{m^{\prime}})=\mathbf{1}+\mathbf{y}_{S_{\ell}}, and wj1,,jm,=0w_{j_{1},\ldots,j_{m^{\prime}},\ell}=0 otherwise. We show that for all i[m]i\in[m], if yi=0y_{i}=0 then N𝐰(𝐱i)=0N_{\mathbf{w}}(\mathbf{x}^{i})=0, and if yi=1y_{i}=1 then N𝐰(𝐱i)=2ϵN_{\mathbf{w}}(\mathbf{x}^{i})=2\epsilon. Suppose that iSr:={k1,,km}i\in S_{r}:=\{k_{1},\ldots,k_{m^{\prime}}\} for some rLr\in L, and that i=kti=k_{t}, i.e., ii is the tt-th element in the subset SrS_{r}. Then, the tensor 𝐱^i\hat{\mathbf{x}}^{i} has a non-zero component only at xj1,,jm,rix^{i}_{j_{1},\ldots,j_{m^{\prime}},r} with jt=3j_{t}=3, and js=2j_{s}=2 for all sts\neq t. Moreover, the filter 𝐰\mathbf{w} has a non-zero component at index (q1,,qm,r)(q_{1},\ldots,q_{m^{\prime}},r) iff (q1,,qm)=𝟏+𝐲Sr(q_{1},\ldots,q_{m^{\prime}})=\mathbf{1}+\mathbf{y}_{S_{r}}. Hence, the inner product between 𝐰\mathbf{w} and a patch of 𝐱i\mathbf{x}^{i} is non-zero iff the patch has a base index (j1,,jm)(j_{1},\ldots,j_{m^{\prime}}) such that (j1,,jm)+𝐲Sr=(p1,,pm)(j_{1},\ldots,j_{m^{\prime}})+\mathbf{y}_{S_{r}}=(p_{1},\ldots,p_{m^{\prime}}) where pt=3p_{t}=3, and ps=2p_{s}=2 for all sts\neq t. If yi=0y_{i}=0 then the tt-th component of 𝐲Sr\mathbf{y}_{S_{r}} is 0, and there is no patch such that the base index satisfies jt+(𝐲Sr)t=jt+0=pt=3j_{t}+(\mathbf{y}_{S_{r}})_{t}=j_{t}+0=p_{t}=3. Therefore, N𝐰(𝐱i)=0N_{\mathbf{w}}(\mathbf{x}^{i})=0. If yi=1y_{i}=1 then the patch whose base index satisfies jt=3(𝐲Sr)t=31=2j_{t}=3-(\mathbf{y}_{S_{r}})_{t}=3-1=2, and js=2(𝐲Sr)sj_{s}=2-(\mathbf{y}_{S_{r}})_{s} for sts\neq t, gives output 2ϵbxbx=2ϵ\frac{2\epsilon}{b_{x}}\cdot b_{x}=2\epsilon (and all other patches give output 0).

It remains to show that the spectral norm of the matrix WW that corresponds to the convolutional layer with the filter 𝐰\mathbf{w} is at most BB. Thus, we show that for every input 𝐱d\mathbf{x}\in\mathbb{R}^{d} with 𝐱=1\|\mathbf{x}\|=1 the inputs to the hidden layer are a vector with norm at most BB. We view 𝐱\mathbf{x} as the vectorization of a tensor 𝐱^\hat{\mathbf{x}} with components xj1,,jm,x_{j_{1},\ldots,j_{m^{\prime}},\ell} for (j1,,jm,)[3]m×[L](j_{1},\ldots,j_{m^{\prime}},\ell)\in[3]^{m^{\prime}}\times[L]. The inner product between a patch of 𝐱\mathbf{x} and the filter 𝐰\mathbf{w} can be written as

[L]2ϵbxxq1(),,qm(),.\sum_{\ell\in[L]}\frac{2\epsilon}{b_{x}}\cdot x_{q^{(\ell)}_{1},\ldots,q^{(\ell)}_{m^{\prime}},\ell}~{}.

Thus, for each \ell there is a single index of 𝐱^\hat{\mathbf{x}} that contributes to the inner product, since for every \ell the filter 𝐰\mathbf{w} has a single non-zero component, which equals 2ϵbx\frac{2\epsilon}{b_{x}}. By Cauchy–Schwarz, the above sum is at most

2ϵbxL[L]xq1(),,qm(),2=2ϵbxbxB2ϵ[L]xq1(),,qm(),2=B[L]xq1(),,qm(),2.\frac{2\epsilon}{b_{x}}\cdot\sqrt{L}\cdot\sqrt{\sum_{\ell\in[L]}x^{2}_{q^{(\ell)}_{1},\ldots,q^{(\ell)}_{m^{\prime}},\ell}}=\frac{2\epsilon}{b_{x}}\cdot\frac{b_{x}B}{2\epsilon}\cdot\sqrt{\sum_{\ell\in[L]}x^{2}_{q^{(\ell)}_{1},\ldots,q^{(\ell)}_{m^{\prime}},\ell}}=B\cdot\sqrt{\sum_{\ell\in[L]}x^{2}_{q^{(\ell)}_{1},\ldots,q^{(\ell)}_{m^{\prime}},\ell}}~{}. (20)

Hence, the input to the hidden neuron that corresponds to the patch is bounded by the above expression. Moreover, since for every [L]\ell\in[L] the filter 𝐰\mathbf{w} has a single non-zero component such that the last dimension of its index is \ell, then for every two patches with different base indices, the bound in the above expression includes different indices of 𝐱^\hat{\mathbf{x}}. Namely, if the inner product between one patch of 𝐱\mathbf{x} and the filter 𝐰\mathbf{w} is [L]2ϵbxxq1(),,qm(),\sum_{\ell\in[L]}\frac{2\epsilon}{b_{x}}\cdot x_{q^{(\ell)}_{1},\ldots,q^{(\ell)}_{m^{\prime}},\ell} and the inner product between another patch of 𝐱\mathbf{x} and the filter 𝐰\mathbf{w} is [L]2ϵbxxp1(),,pm(),\sum_{\ell\in[L]}\frac{2\epsilon}{b_{x}}\cdot x_{p^{(\ell)}_{1},\ldots,p^{(\ell)}_{m^{\prime}},\ell}, then for every \ell we have (q1(),,qm())(p1(),,pm())(q^{(\ell)}_{1},\ldots,q^{(\ell)}_{m^{\prime}})\neq(p^{(\ell)}_{1},\ldots,p^{(\ell)}_{m^{\prime}}). Since by Eq. (20) the square of the input to each hidden neuron can be bounded by B2[L]xq1(),,qm(),2B^{2}\cdot\sum_{\ell\in[L]}x^{2}_{q^{(\ell)}_{1},\ldots,q^{(\ell)}_{m^{\prime}},\ell} for some subset {xq1(),,qm(),}[L]\left\{x_{q^{(\ell)}_{1},\ldots,q^{(\ell)}_{m^{\prime}},\ell}\right\}_{\ell\in[L]} of components, and since for each two hidden neurons these subsets are disjoint, then the norm of the vector of inputs to the hidden neurons can be bounded by

B2k[d]xk2B21=B.\sqrt{B^{2}\cdot\sum_{k\in[d]}x_{k}^{2}}\leq\sqrt{B^{2}\cdot 1}=B~{}.