This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Random Neural Networks in the Infinite Width Limit as Gaussian Processes

Boris Hanin
Department of Operations Research and Financial Engineering
Princeton University
Abstract

This article gives a new proof that fully connected neural networks with random weights and biases converge to Gaussian processes in the regime where the input dimension, output dimension, and depth are kept fixed, while the hidden layer widths tend to infinity. Unlike prior work, convergence is shown assuming only moment conditions for the distribution of weights and for quite general non-linearities.

1 Introduction

In the last decade or so neural networks, originally introduced in the 1940’s and 50’s [29, 45], have become indispensable tools for machine learning tasks ranging from computer vision [32] to natural language processing [9] and reinforcement learning [47]. Their empirical success has raised many new mathematical questions in approximation theory [13, 55, 56], probability (see §1.2.2 for some references), optimization/learning theory [7, 8, 31, 58] and so on. The present article concerns a fundamental probabilistic question about arguably the simplest networks, the so-called fully connected neural networks, defined as follows:

Definition 1.1 (Fully Connected Network).

Fix a positive integer LL as well as L+2L+2 positive integers n0,,nL+1n_{0},\ldots,n_{L+1} and a function σ:\sigma:\mathbb{R}\rightarrow\mathbb{R}. A fully connected depth LL neural network with input dimension n0n_{0}, output dimension nL+1n_{L+1}, hidden layer widths n1,,nLn_{1},\ldots,n_{L}, and non-linearity σ\sigma is any function xαn0zα(L+1)nL+1x_{\alpha}\in\mathbb{R}^{n_{0}}\mapsto z_{\alpha}^{(L+1)}\in\mathbb{R}^{n_{L+1}} of the following form

zα()={W(1)xα+b(1),=1W()σ(zα(1))+b(),=2,,L+1,z_{\alpha}^{(\ell)}=\begin{cases}W^{(1)}x_{\alpha}+b^{(1)},&\quad\ell=1\\ W^{(\ell)}\sigma(z_{\alpha}^{(\ell-1)})+b^{(\ell)},&\quad\ell=2,\ldots,L+1\end{cases},

where W()n×n1W^{(\ell)}\in\mathbb{R}^{n_{\ell}\times n_{\ell-1}} are matrices, b()nb^{(\ell)}\in\mathbb{R}^{n_{\ell}} are vectors, and σ\sigma applied to a vector is shorthand for σ\sigma applied to each component.

The parameters L,n0,,nL+1L,n_{0},\ldots,n_{L+1} are called the network architecture, and zα()nz_{\alpha}^{(\ell)}\in\mathbb{R}^{n_{\ell}} is called the vector of pre-activations at layer \ell corresponding to input xα.x_{\alpha}. A fully connected network with a fixed architecture and given non-linearity σ\sigma is therefore a finite (but typically high) dimensional family of functions, parameterized by the network weights (entries of the weight matrices W()W^{(\ell)}) and biases (components of bias vectors b()b^{(\ell)}).

This article considers the mapping xαzα(L+1)x_{\alpha}\mapsto z_{\alpha}^{(L+1)} when the network’s weights and biases are chosen independently at random and the hidden layer widths n1,,nLn_{1},\ldots,n_{L} are sent to infinity while the input dimension n0,n_{0}, output dimension nL+1n_{L+1}, and network depth LL are fixed. In this infinite width limit, akin to the large matrix limit in random matrix theory (see §1.2), neural networks with random weights and biases converge to Gaussian processes (see §1.4 for a review of prior work). Unlike prior work Theorem 1.2, our main result, is that this holds for general non-linearities σ\sigma and distributions of network weights (cf §1.3).

Moreover, in addition to establishing convergence of wide neural networks to a Gaussian process under weak hypotheses, the present article gives a mathematical take aimed at probabilists of some of the ideas developed in the recent monograph [44]. This book, written in the language and style of theoretical physics by Roberts and Yaida, is based on research done jointly with the author. It represents a far-reaching development of the breakthrough work of Yaida [50], which was the first to systematically explain how to compute finite width corrections to infinite width Gaussian process limit of random neural networks for arbitrary depth, width, and non-linearity. Previously, such finite width (and large depth) corrections were only possible for some special observables in linear and ReLU networks [21, 23, 24, 25, 39, 57]. The present article deals only with the asymptotic analysis of random neural networks as the width tends to infinity, leaving to future work a probabilistic elaboration of the some aspects of the approach to finite width corrections from [44].

1.1 Roadmap

The rest of this article is organized as follows. First, in §1.2 we briefly motivate the study of neural networks with random weights. Then, in §1.3 we formulate our main result, Theorem 1.2. Before giving its proof in §2, we first indicate in §1.4 the general idea of the proof and its relation to prior work.

1.2 Why Random Neural Networks?

1.2.1 Practical Motivations

It may seem at first glance that studying neural networks with random weights and biases is of no practical interest. After all, a neural network is only useful after it has been “trained,” i.e. one has found a setting of its parameters so that the resulting network function (at least approximately) interpolates a given training dataset of input-output pairs (x,f(x))(x,f(x)) for an otherwise unknown function f:n0nL+1f:\mathbb{R}^{n_{0}}\rightarrow\mathbb{R}^{n_{L+1}}.

However, the vast majority of neural network training algorithms used in practice are variants of gradient descent starting from a random initialization of the weight matrices W()W^{(\ell)} and bias vectors b()b^{(\ell)}. Studying networks with random weights and biases therefore provides an understanding of the initial conditions for neural network optimization.

Beyond illuminating the properties of networks at the start of training, the analysis of random neural networks can reveal a great deal about networks after training as well. Indeed, on a heuristic level, just as the behavior of the level spacings of the eigenvalues of large random matrices is a surprisingly good match for emission spectra of heavy atoms [49], it is not unreasonable to believe that certain coarse properties of the incredibly complex networks used in practice will be similar to those of networks with random weights and biases. More rigorously, neural networks used in practice often have many more tunable parameters (weights and biases) than the number of datapoints from the training dataset. Thus, at least in certain regimes, neural network training provably proceeds by an approximate linearization around initialization, since no one parameter needs to move much to fit the data. This so-called NTK analysis [14, 16, 30, 31, 35] shows, with several important caveats related to network size and initialization scheme, that in some cases the statistical properties of neural networks at the start of training are the key determinants of their behavior throughout training.

1.2.2 Motivation from Random Matrix Theory

In addition to being of practical importance, random neural networks are also fascinating mathematical objects, giving rise to new problems in approximation theory [12, 13, 22, 55, 56], random geometry [26, 27], and random matrix theory (RMT). Perhaps the most direct, though by no means only, connection to RMT questions is to set the network biases b()b^{(\ell)} to zero and consider the very special case when σ(t)=t\sigma(t)=t is the identity (in the machine learning literature these are called deep linear networks). The network function

zα(L+1)=W(L+1)W(1)xαz_{\alpha}^{(L+1)}=W^{(L+1)}\cdots W^{(1)}x_{\alpha} (1.1)

is then a linear statistic for a product of L+1L+1 independent random matrices. Such matrix models have been extensively studied, primarily in two regimes. The first is the multiplicative ergodic theorem regime [10, 17, 18, 46], in which all the layer widths n0,,nL+1n_{0},\ldots,n_{L+1} are typically set to a fixed value nn and the network depth LL tends to infinity. The second regime, where LL is fixed and the layer widths nn_{\ell} (i.e. matrix dimensions) tend to infinity, is the purview of free-probability [38, 48].

In the presence of a non-linearity σ\sigma, random neural network provide non-linear generalizations of the usual RMT questions. For instance, the questions taken up in this article are analogs of the joint normality of linear statistics of random matrix products in the free probability regime. Further, random neural networks give additional motivation for studying matrix products appearing in (1.1) when the matrix dimensions nn_{\ell} and the number of terms LL are simultaneously large. This double scaling limit reveals new phenomena [3, 4, 5, 6, 20, 24, 25] but is so far poorly understood relative to the ergodic or free regimes.

Finally, beyond studying linear networks, random matrix theory questions naturally appear in neural network theory via non-linear analogs of the Marchenko-Pastur distribution for empirical covariance matrices of zα(L+1)z_{\alpha}^{(L+1)} when αA\alpha\in A ranges over a random dataset of inputs [1, 28, 41, 43] as well as through the spectrum of the input-output Jacobian [24, 42] and the NTK [2, 16].

1.3 Main Result

Our main result shows that under rather general conditions, when the weights W()W^{(\ell)} and biases b()b^{(\ell)} of a fully connected network are chosen at random, the resulting field xαzα(L+1)x_{\alpha}\mapsto z_{\alpha}^{(L+1)} converges to a centered Gaussian field with iid components when the input dimension n0n_{0} and output dimension nL+1n_{L+1} are held fixed but the hidden layer widths n1,,nLn_{1},\ldots,n_{L} tend to infinity. To give the precise statement in Theorem 1.2 below, fix a fully connected neural network with depth L1L\geq 1, input dimension n0n_{0}, output dimension nL+1n_{L+1}, hidden layer widths n1,,nL1n_{1},\ldots,n_{L}\geq 1, and non-linearity σ:\sigma:\mathbb{R}\rightarrow\mathbb{R}. We assume that σ\sigma is absolutely continuous and that its almost-everywhere defined derivative (and hence σ\sigma itself) is polynomially bounded:

k>0 s.t. xσ(x)1+|x|kL()<.\exists k>0\text{ s.t. }\forall x\in\mathbb{R}\quad\left|\left|\frac{\sigma^{\prime}(x)}{1+\left|x\right|^{k}}\right|\right|_{L^{\infty}(\mathbb{R})}<\infty. (1.2)

All non-linearities used in practice satisfy these rather mild criteria. Further, let us write Wij()W_{ij}^{(\ell)} for the entries of the weight matrices W()W^{(\ell)} and bi()b_{i}^{(\ell)} for the components of the bias vectors b()b^{(\ell)}. For 2\ell\geq 2 the Definition 1.1 of fully connected networks means that the formula for the components of the pre-activations zα()z_{\alpha}^{(\ell)} at layer \ell in terms of those for zα(1)z_{\alpha}^{(\ell-1)} reads

zi;α():=bi()+j=1n1Wij()σ(zj;α(1)),i=1,,n,z_{i;\alpha}^{(\ell)}:=b_{i}^{(\ell)}+\sum_{j=1}^{n_{\ell-1}}W_{ij}^{(\ell)}\sigma(z_{j;\alpha}^{(\ell-1)}),\qquad i=1,\ldots,n_{\ell}, (1.3)

where we’ve denoted by zi;α()z_{i;\alpha}^{(\ell)} the ithi^{th} component of the nn_{\ell}-dimensional vector of pre-activations zα()z_{\alpha}^{(\ell)} in layer \ell corresponding to a network input xαnx_{\alpha}\in\mathbb{R}^{n_{\ell}}. We make the following assumption on the network weights:

Wij():=(CWn1)1/2W^ij(),W^ij()μiid,W_{ij}^{(\ell)}:=\left(\frac{C_{W}}{n_{\ell-1}}\right)^{1/2}\widehat{W}_{ij}^{(\ell)},\qquad\widehat{W}_{ij}^{(\ell)}\sim\mu\quad\text{iid}, (1.4)

where μ\mu is a fixed probability distribution on \mathbb{R} such that

μ has mean 0, variance 1, and finite higher moments.\mu\text{ has mean }0,\text{ variance }1\text{, and finite higher moments.} (1.5)

We further assume the network biases are iid Gaussian111As explained in §1.4 the universality results in this article are simply not true if the biases are drawn iid from a fixed non-Gaussian distribution. and independent of the weights:

bi()𝒩(0,Cb)iid.b_{i}^{(\ell)}\sim\mathcal{N}(0,C_{b})\quad\text{iid}. (1.6)

In (1.4) and (1.6), CW>0C_{W}>0 and Cb0C_{b}\geq 0 are fixed constants. These constants do not play an important role for the analysis in this article but will be crucial for followup work. With the network weights and biases chosen at random the vectors zα()z_{\alpha}^{(\ell)} are also random. Our main result is that, in the infinite width limit, they have independent Gaussian components.

Theorem 1.2.

Fix n0,nL+1n_{0},n_{L+1} and a compact set Tn0T\subseteq\mathbb{R}^{n_{0}}. As the hidden layer widths n1,,nLn_{1},\ldots,n_{L} tend to infinity, the sequence of stochastic processes

xαn0zα(L+1)nL+1x_{\alpha}\in\mathbb{R}^{n_{0}}\quad\mapsto\quad z_{\alpha}^{(L+1)}\in\mathbb{R}^{n_{L+1}}

converges weakly in C0(T,nL+1)C^{0}(T,\mathbb{R}^{n_{L+1}}) to a centered Gaussian process taking values in nL+1\mathbb{R}^{n_{L+1}} with iid coordinates. The coordinate-wise covariance function

Kαβ(L+1):=limn1,,nLCov(zi;α(L+1),zi;β(L+1))K_{\alpha\beta}^{(L+1)}:=\lim_{n_{1},\ldots,n_{L}\rightarrow\infty}\mathrm{Cov}\left(z_{i;\alpha}^{(L+1)},\,z_{i;\beta}^{(L+1)}\right)

for this limiting process satisfies the layerwise recursion

Kαβ(+1)=Cb+CW𝔼[σ(zα)σ(zβ)],(zαzβ)𝒩(0,(Kαα()Kαβ()Kαβ()Kββ()))K_{\alpha\beta}^{(\ell+1)}=C_{b}+C_{W}\mathbb{E}\left[\sigma(z_{\alpha})\sigma(z_{\beta})\right],\qquad\left(\begin{array}[]{c}z_{\alpha}\\ z_{\beta}\end{array}\right)\sim\mathcal{N}\left(0,\left(\begin{array}[]{cc}K_{\alpha\alpha}^{(\ell)}&K_{\alpha\beta}^{(\ell)}\\ K_{\alpha\beta}^{(\ell)}&K_{\beta\beta}^{(\ell)}\end{array}\right)\right) (1.7)

for 2\ell\geq 2, with initial condition

Kαβ(2)=Cb+CW𝔼[σ(z1;α(1))σ(z1;β(1))],K_{\alpha\beta}^{(2)}=C_{b}+C_{W}\mathbb{E}\left[\sigma\left(z_{1;\alpha}^{(1)}\right)\sigma\left(z_{1;\beta}^{(1)}\right)\right], (1.8)

where the distribution of (z1;α(1),z1;β(1))(z_{1;\alpha}^{(1)},z_{1;\beta}^{(1)}) is determined via (1.3) by the distribution of weights and biases in the first layer and hence is not universal.

We prove Theorem 1.2 in §2. First, we explain the main idea and review prior work.

1.4 Theorem 1.2: Discussion, Main Idea, and Relation to Prior Work

At a high level, the proof of Theorem 1.2 (specifically the convergence of finite-dimensional distributions) proceeds as follows:

  1. 1.

    Conditional on the mapping xαzα(L)x_{\alpha}\mapsto z_{\alpha}^{(L)}, the components of the neural network output xαzα(L+1)x_{\alpha}\mapsto z_{\alpha}^{(L+1)} are independent sums of nLn_{L} independent random fields (see (1.3)), and hence, when nLn_{L} is large, are each approximately Gaussian by the CLT.

  2. 2.

    The conditional covariance in the CLT from step 11 is random at finite widths (it depends on zα(L)z_{\alpha}^{(L)}). However, it has the special form of an average over j=1,,nLj=1,\ldots,n_{L} of the same function applied to each component zj;α(L)z_{j;\alpha}^{(L)} of the vector zα(L)z_{\alpha}^{(L)} of pre-activations at the last hidden layer. We call such objects collective observables (see §2.1.2 and (1.10)).

  3. 3.

    While zj;α()z_{j;\alpha}^{(\ell)} are not independent at finite width when 2\ell\geq 2, they are weakly sufficiently correlated that a LLN still applies to any collective observable in the infinite width limit (see §2.1.2).

  4. 4.

    The LLN from step 3 allows us to replace the random conditional covariance matrix from steps 1 and 2 by its expectation, asymptotically as n1,,nLn_{1},\ldots,n_{L} tend to infinity.

We turn to giving a few more details on steps 1-4 and reviewing along the way the relation of the present article to prior work. The study of the infinite width limit for random neural networks dates back at least to Neal [37], who considered networks with one hidden layer:

zi;α(2)=bi(2)+j=1n1Wij(2)σ(zj;α(1)),zj;α(1)=bj(1)+k=1n0Wjk(1)xk;α,z_{i;\alpha}^{(2)}=b_{i}^{(2)}+\sum_{j=1}^{n_{1}}W_{ij}^{(2)}\sigma\left(z_{j;\alpha}^{(1)}\right),\qquad z_{j;\alpha}^{(1)}=b_{j}^{(1)}+\sum_{k=1}^{n_{0}}W_{jk}^{(1)}x_{k;\alpha},

where i=1,,n2.i=1,\ldots,n_{2}. In the shallow L=1L=1 setting of Neal if in addition n2=1n_{2}=1, then neglecting the bias b1(2)b_{1}^{(2)} for the moment, the scalar field z1;α(2)z_{1;\alpha}^{(2)} is a sum of iid random fields with finite moments, and hence the asymptotic normality of its finite-dimensional distributions follows immediately from the multidimensional CLT. Modulo tightness, this explains why z1;α(2)z_{1;\alpha}^{(2)} ought to converge to a Gaussian field. Even this simple case, however, holds several useful lessons:

  • If the distribution of the bias b1(2)b_{1}^{(2)} is fixed independent of n1n_{1} is and non-Gaussian, then the distribution of z1;α(2)z_{1;\alpha}^{(2)} will not be Gaussian, even in the limit when n1n_{1}\rightarrow\infty.

  • If the first layer biases bj(1)b_{j}^{(1)} are drawn iid from a fixed distribution μb\mu_{b} and σ\sigma is non-linear, then higher moments of μb\mu_{b} will contribute to the variance of each neuron post-activation σ(zj;α(1))\sigma(z_{j;\alpha}^{(1)}), causing the covariance of the Gaussian field at infinite width to be non-universal.

  • Unlike in deeper layers, as long as n0n_{0} is fixed, the distribution of each neuron pre-activation zj;α(1)z_{j;\alpha}^{(1)} in the first layer will not be Gaussian, unless the weights and biases in layer 11 are themselves Gaussian. This explains why, in the initial condition (1.8) the distribution is non-Gaussian in the first layer.

In light of the first two points, what should one assume about the bias distribution? There are, it seems, two options. The first is to assume that the variance of the biases tends to zero as n1n_{1}\rightarrow\infty, putting them on par with the weights. The second, which we adopt in this article, is to declare all biases to be Gaussian.

The first trick in proving Theorem 1.2 for general depth and width appears already when L=1L=1 but the output dimension n2n_{2} is at least two.222Neal [37] states erroneously on page 38 of his thesis that zi;α(2)z_{i;\alpha}^{(2)} and zj;α(2)z_{j;\alpha}^{(2)} will be independent because the weights going into them are independent. This is not true at finite width but becomes true in the infinite width limit. In this case, even for a single network input xαx_{\alpha}, at finite values of the network width n1n_{1} different components of the random n2n_{2}-dimensional vector zα(2)z_{\alpha}^{(2)} are not independent, due to their shared dependence on the vector zα(1)z_{\alpha}^{(1)}. The key observation, which to the author’s knowledge was first presented in [34], is to note that the components of zα(2)z_{\alpha}^{(2)} are independent conditional on the first layer (i.e. on zα(1)z_{\alpha}^{(1)}) and are approximately Gaussian when n1n_{1} is large by the CLT. The conditional variance, which captures the main dependence on zα(1)z_{\alpha}^{(1)}, has the following form:

Σαα(2):=Var[zi;α(2)|zα(1)]=Cb+CWn1j=1n1σ(zj;α(1))2.\Sigma_{\alpha\alpha}^{(2)}:=\mathrm{Var}\left[z_{i;\alpha}^{(2)}~{}\big{|}~{}z_{\alpha}^{(1)}\right]=C_{b}+\frac{C_{W}}{n_{1}}\sum_{j=1}^{n_{1}}\sigma\left(z_{j;\alpha}^{(1)}\right)^{2}. (1.9)

This is an example of what we’ll call a collective observable, an average over all neurons in a layer of the same function applied to the pre-activations at each neuron (see §2.1.2 for the precise definition). In the shallow L=1L=1 setting, Σαα(2)\Sigma_{\alpha\alpha}^{(2)} is a sum of n1n_{1} iid random variables with finite moments. Hence, by the LLN, it converges almost surely to its mean as n1n_{1}\rightarrow\infty. This causes the components of zα(2)z_{\alpha}^{(2)} to become independent in the infinite width limit, since the source of their shared randomness, Σαα(L+1)\Sigma_{\alpha\alpha}^{(L+1)}, can be replaced asymptotically by its expectation.

The proof for general LL follows a similar pattern. Exactly as before, for any 0L0\leq\ell\leq L, the components of the pre-activations at layer +1\ell+1 are still conditionally independent, given the pre-activations at layer \ell. When the width nn_{\ell} is large the conditional distribution of each component over any finite collection of inputs is therefore approximately Gaussian by the CLT. Moreover, the conditional covariance across network inputs has the form:

Σαβ(+1):=Cov(zi;α(+1),zi;β(+1)|zα(),zβ())=Cb+CWnj=1nσ(zj;α())σ(zj;β()).\Sigma_{\alpha\beta}^{(\ell+1)}:=\mathrm{Cov}\left(z_{i;\alpha}^{(\ell+1)},\,z_{i;\beta}^{(\ell+1)}~{}\big{|}~{}z_{\alpha}^{(\ell)},\,z_{\beta}^{(\ell)}\right)=C_{b}+\frac{C_{W}}{n_{\ell}}\sum_{j=1}^{n_{\ell}}\sigma\left(z_{j;\alpha}^{(\ell)}\right)\sigma\left(z_{j;\beta}^{(\ell)}\right). (1.10)

The summands on the right hand side are no longer independent at finite width if 2\ell\geq 2. However, Σαβ(+1)\Sigma_{\alpha\beta}^{(\ell+1)} are still collective observables, and the crucial point is to check that their dependence is sufficiently weak that we may still apply the LLN. Verifying this is the heart of the proof of Theorem 1.2 and is carried out in §2.1.2.

Let us mention that, in addition to the approach outlined above, other methods for showing that wide neural networks are asymptotically Gaussian processes are possible. In the prior article [36], for instance, the idea is to use that the entries of zα()z_{\alpha}^{(\ell)} are exchangeable and argue using an exchangeable CLT. This leads to some technical complications which, at least in the way the argument is carried out in [36], result in unnatural restrictions on the class of non-linearities and weight distributions considered there. Let us also mention that in the article [34], the non-independence at finite width of the components of zα()z_{\alpha}^{(\ell)} for large \ell was circumvented by considering only the sequential limit in which nn_{\ell}\rightarrow\infty in order of increasing \ell. The effect is that for every \ell the conditional covariance Σαβ()\Sigma_{\alpha\beta}^{(\ell)} has already converged to its mean before n+1n_{\ell+1} is taken large. However, this way of taking the infinite width limit seems to the author somewhat unnatural and is any case not conducive to studying finite width corrections as in [44, 50], which we plan to take up in future work.

We conclude this section by pointing the reader to several other related strands of work. The first are articles such as [11], which quantify the magnitude of the difference

1ni=1nzi;α()zi;β()limn1,,n𝔼[1ni=1nzi;α()zi;β()]\frac{1}{n_{\ell}}\sum_{i=1}^{n_{\ell}}z_{i;\alpha}^{(\ell)}z_{i;\beta}^{(\ell)}-\lim_{n_{1},\ldots,n_{\ell}\rightarrow\infty}\mathbb{E}\left[\frac{1}{n_{\ell}}\sum_{i=1}^{n_{\ell}}z_{i;\alpha}^{(\ell)}z_{i;\beta}^{(\ell)}\right]

between the empirical overlaps n1zα(),zβ()n_{\ell}^{-1}\left\langle z_{\alpha}^{(\ell)},z_{\beta}^{(\ell)}\right\rangle of pre-activations and the corresponding infinite width limit uniformly over network inputs xα,xβx_{\alpha},x_{\beta} in a compact subset of n0\mathbb{R}^{n_{0}}. In a similar vein are articles such as [15], which give quantitative estimates at finite width for the distance from xαzα()x_{\alpha}\mapsto z_{\alpha}^{(\ell)} to a nearby Gaussian process.

The second is the series of articles starting with the work of Yang [51, 52, 53, 54], which develops the study not only of initialization but also certain aspects of inference with infinitely wide networks using what Yang terms tensor programs. As part of that series, the article [52] establishes that in the infinite width limit many different architectures become Gaussian processes. However, the arguments in those articles are significantly more technical than the ones presented here since they are focused on building the foundation for the tensor program framework. At any rate, to the best of the author’s knowledge, no prior article addresses universality of the Gaussian process limit with respect to the weight distribution in deep networks (for shallow networks with L=1L=1 this was considered by Neal in [37]). Finally, that random neural networks converge to Gaussian processes in the infinite width limit under various restrictions but for architectures other than fully connected is taken up in [19, 40, 52].

2 Proof of Theorem 1.2

Let us recall the notation. Namely, we fix a network depth L1L\geq 1, an input dimension n01,n_{0}\geq 1, an output dimension nL+11n_{L+1}\geq 1, hidden layer widths n1,,nL1n_{1},\ldots,n_{L}\geq 1 and a non-linearity σ\sigma satisfying (2.5). We further assume that the networks weights and biases are independent and random as in (1.4) and (1.6). To prove Theorem 1.2 we must show that the random fields xαzα(L+1)x_{\alpha}\mapsto z_{\alpha}^{(L+1)} converge weakly in distribution to a Gaussian process in the limit where n1,,nLn_{1},\dots,n_{L} tend to infinity. We start with the convergence of finite-dimensional distributions. Let us therefore fix a collection

xA={xα,αA}x_{A}=\left\{x_{\alpha},\quad\alpha\in A\right\}

of |A|\left|A\right| distinct network inputs in n0\mathbb{R}^{n_{0}} and introduce for each =0,,L+1\ell=0,\ldots,L+1, every i=1,,ni=1,\ldots,n_{\ell}, and all αA\alpha\in A the vectorized notation

zi;A():=(zi;α(),αA)|A|,zA():=(zi;A(),i=1,,n)n×|A|.z_{i;A}^{(\ell)}:=\left(z_{i;\alpha}^{(\ell)},\,\alpha\in A\right)\in\mathbb{R}^{\left|A\right|},\qquad z_{A}^{(\ell)}:=\left(z_{i;A}^{(\ell)},\,i=1,\ldots,n_{\ell}\right)\in\mathbb{R}^{n_{\ell}\times\left|A\right|}.

The following result states that the distribution of the random variable zA(L+1)z_{A}^{(L+1)} with values in nL+1×|A|\mathbb{R}^{n_{L+1}\times\left|A\right|} converges to that of the claimed Gaussian field.

Proposition 2.1 (Convergence of Finite-Dimensional Distributions).

Fix L1L\geq 1 and n0,nL+1.n_{0},n_{L+1}. The distribution of zA(L+1)z_{A}^{(L+1)} converges weakly as n1,,nLn_{1},\ldots,n_{L}\rightarrow\infty to that of a centered Gaussian in nL+1×|A|\mathbb{R}^{n_{L+1}\times\left|A\right|} with iid rows for which the covariance

Kαβ(L+1)=limn1,,nLCov(zi;α(L+1),zi;β(L+1)),α,βAK_{\alpha\beta}^{(L+1)}=\lim_{n_{1},\ldots,n_{L}\rightarrow\infty}\mathrm{Cov}\left(z_{i;\alpha}^{(L+1)},z_{i;\beta}^{(L+1)}\right),\qquad\alpha,\beta\in A

between the entries in each row satisfies the recursion (1.7) with initial condition (1.8).

Once we have proved Proposition 2.1 in §2.1, it remains to show tightness. For this, we fix a compact subset Tn0T\subseteq\mathbb{R}^{n_{0}}. The tightness of xαzα(L+1)x_{\alpha}\mapsto z_{\alpha}^{(L+1)} in C0(T,nL+1)C^{0}(T,\mathbb{R}^{n_{L+1}}) follows immediately from the Arzelà-Ascoli Theorem and the following result, which we prove in §2.2.

Proposition 2.2 (High Probability Equicontinuity and Equiboundedness of zα(L+1)z_{\alpha}^{(L+1)}).

For every L1,ϵ>0L\geq 1,\,\epsilon>0 there exists C=C(ϵ,σ,T,L,Cb,CW)>0C=C(\epsilon,\sigma,T,L,C_{b},C_{W})>0 so that

supxα,xβTzα(L+1)zβ(L+1)2xαxβ2CandsupxαTzα(L+1)C\sup_{x_{\alpha},x_{\beta}\in T}\frac{\left|\left|z_{\alpha}^{(L+1)}-z_{\beta}^{(L+1)}\right|\right|_{2}}{\left|\left|x_{\alpha}-x_{\beta}\right|\right|_{2}}\leq C\qquad\text{and}\qquad\sup_{x_{\alpha}\in T}\left|\left|z_{\alpha}^{(L+1)}\right|\right|\leq C (2.1)

with probability at least 1ϵ1-\epsilon.

2.1 Finite-Dimensional Distributions: Proof of Proposition 2.1

We will prove Proposition 2.1 in two steps. First, we prove a special case in which we keep the weights in layer 11 completely general as in (1.4) but take weights in layers 2\ell\geq 2 to be independent Gaussians:

Wij()𝒩(0,CWn11),iid.W_{ij}^{(\ell)}\sim\mathcal{N}\left(0,C_{W}n_{\ell-1}^{-1}\right),\qquad\text{iid}.

We continue to assume (as in the statement of Theorem 1.2) that all biases are Gaussian:

bi()𝒩(0,Cb),iid.b_{i}^{(\ell)}\sim\mathcal{N}(0,C_{b}),\qquad\text{iid}.

The argument in this case is the technical heart of this paper and is presented in §2.1.1 - §2.1.2. Ultimately, it relies on the analysis of collective observables, which we isolate in §2.1.2. A simple Lindeberg swapping argument and induction on layer detailed in §2.1.3 allows us to extend Proposition 2.1 to general weights in layers 2\ell\geq 2 from the Gaussian case.

2.1.1 Proof of Proposition 2.1 with Gaussian Weights in Layers 2\ell\geq 2

Fix

Ξ=(ξi,i=1,,nL+1)nL+1×|A|,ξi=(ξi;α,i=1,,nL+1,αA)|A|\Xi=\left(\xi_{i},\,i=1,\ldots,n_{L+1}\right)\in\mathbb{R}^{n_{L+1}\times\left|A\right|},\qquad\xi_{i}=\left(\xi_{i;\alpha},\,i=1,\ldots,n_{L+1},\,\alpha\in A\right)\in\mathbb{R}^{\left|A\right|}

and consider the characteristic function

χA(Ξ)=𝔼[exp[izA(L+1),Ξ]]=𝔼[exp[iαAi=1nL+1zi;α(L+1)ξi;α]]\chi_{A}(\Xi)=\mathbb{E}\left[\exp\left[-i\left\langle z_{A}^{(L+1)},\Xi\right\rangle\right]\right]=\mathbb{E}\left[\exp\left[-i\sum_{\alpha\in A}\sum_{i=1}^{n_{L+1}}z_{i;\alpha}^{(L+1)}\xi_{i;\alpha}\right]\right]

of the random variable zA(L+1)nL+1×|A|z_{A}^{(L+1)}\in\mathbb{R}^{n_{L+1}\times\left|A\right|}. By Levy’s continuity theorem, it is sufficient to show that

limn1,,nLχA(Ξ)=exp[12i=1nL+1KA(L+1)ξi,ξi],\lim_{n_{1},\ldots,n_{L}\rightarrow\infty}\chi_{A}(\Xi)=\exp\left[-\frac{1}{2}\sum_{i=1}^{n_{L+1}}\left\langle K_{A}^{(L+1)}\xi_{i},\xi_{i}\right\rangle\right], (2.2)

where

KA(L+1)=(Kαβ(L+1))α,βAK_{A}^{(L+1)}=\left(K_{\alpha\beta}^{(L+1)}\right)_{\alpha,\beta\in A}

is the matrix defined by the recursion (1.7) with initial condition (1.8). Writing

:=filtration defined by {W(),b(),=1,,},\mathcal{F}_{\ell}:=\text{filtration defined by }\left\{W^{(\ell^{\prime})},b^{(\ell^{\prime})},\,\,\ell^{\prime}=1,\ldots,\ell\right\}, (2.3)

we may use the tower property to write

χA(Ξ)\displaystyle\chi_{A}(\Xi) =𝔼[𝔼[exp[izA(L+1),Ξ]|L]].\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\exp\left[-i\left\langle z_{A}^{(L+1)},\Xi\right\rangle\right]~{}\big{|}~{}\mathcal{F}_{L}\right]\right]. (2.4)

Note that conditional on L\mathcal{F}_{L}, the random vectors zi;A(L+1)|A|z_{i;A}^{(L+1)}\in\mathbb{R}^{\left|A\right|} in layer L+1L+1 for each i=1,,nL+1i=1,\ldots,n_{L+1} are iid Gaussians, since we’ve assumed for now that weights in layers 2\ell\geq 2 are Gaussian. Specifically,

zi;A(L+1)=d(ΣA(L+1))1/2Gi,Gi𝒩(0,I|A|) iid1inL+1,z_{i;A}^{(L+1)}\stackrel{{\scriptstyle d}}{{=}}\left(\Sigma_{A}^{(L+1)}\right)^{1/2}G_{i},\qquad G_{i}\sim\mathcal{N}\left(0,\mathrm{I}_{\left|A\right|}\right)\text{ iid}\qquad 1\leq i\leq n_{L+1},

where for any α,βA\alpha,\beta\in A the conditional covariance is

(ΣA(L+1))αβ=Cov(zi;α(L+1),zi;β(L+1)|L)=Cb+CWnLj=1nLσ(zj;α(L))σ(zj;β(L)).\left(\Sigma_{A}^{(L+1)}\right)_{\alpha\beta}=\mathrm{Cov}\left(z_{i;\alpha}^{(L+1)},\,z_{i;\beta}^{(L+1)}~{}\big{|}~{}\mathcal{F}_{L}\right)=C_{b}+\frac{C_{W}}{n_{L}}\sum_{j=1}^{n_{L}}\sigma\left(z_{j;\alpha}^{(L)}\right)\sigma\left(z_{j;\beta}^{(L)}\right). (2.5)

Using (2.4) and the explicit form of the characteristic function of a Gaussian reveals

χA(Ξ)\displaystyle\chi_{A}(\Xi) =𝔼[exp[12i=1nL+1ΣA(L+1)ξi,ξi]].\displaystyle=\mathbb{E}\left[\exp\left[-\frac{1}{2}\sum_{i=1}^{n_{L+1}}\left\langle\Sigma_{A}^{(L+1)}\xi_{i},\xi_{i}\right\rangle\right]\right]. (2.6)

The crucial observation is that each entry of the conditional covariance matrix ΣA(L+1)\Sigma_{A}^{(L+1)} is an average over j=1,,nLj=1,\ldots,n_{L} of the same fixed function applied to the vector zj;A(L)z_{j;A}^{(L)}. While zj;A(L)z_{j;A}^{(L)} are not independent at finite values of n1,,nL1n_{1},\ldots,n_{L-1} for L>1L>1, they are sufficiently weakly correlated that a weak law of large numbers still holds:

Lemma 2.3.

Fix n0,nL+1n_{0},n_{L+1}. There exists a |A|×|A|\left|A\right|\times\left|A\right| PSD matrix

KA(L+1)=(Kαβ(L+1))α,βAK_{A}^{(L+1)}=\left(K_{\alpha\beta}^{(L+1)}\right)_{\alpha,\beta\in A}

such that for all α,βA\alpha,\beta\in A

limn1,,nL𝔼[(ΣA(L+1))αβ]=Kαβ(L+1)andlimn1,,nLVar[(ΣA(L+1))αβ]=0.\lim_{n_{1},\ldots,n_{L}\rightarrow\infty}\mathbb{E}\left[\left(\Sigma_{A}^{(L+1)}\right)_{\alpha\beta}\right]=K_{\alpha\beta}^{(L+1)}\qquad\text{and}\qquad\lim_{n_{1},\ldots,n_{L}\rightarrow\infty}\mathrm{Var}\left[\left(\Sigma_{A}^{(L+1)}\right)_{\alpha\beta}\right]=0.
Proof.

Lemma 2.3 is a special case of Lemma 2.4 (see §2.1.2). ∎

Lemma 2.3 implies that ΣA(L+1)\Sigma_{A}^{(L+1)} converges in distribution to KA(L+1)K_{A}^{(L+1)}. In view of (2.6) and the definition of weak convergence this immediately implies (2.2). It therefore remains to check that KA(L+1)K_{A}^{(L+1)} satisfies the desired recursion. For this, note that at any values of n1,,nLn_{1},\ldots,n_{L} we find

Cov(zi;α(L+1),zi;β(L+1))\displaystyle\mathrm{Cov}\left(z_{i;\alpha}^{(L+1)},\,z_{i;\beta}^{(L+1)}\right) =𝔼[(ΣA(L+1))αβ]=Cb+CW𝔼[σ(z1;α(L))σ(z1;β(L))].\displaystyle=\mathbb{E}\left[\left(\Sigma_{A}^{(L+1)}\right)_{\alpha\beta}\right]=C_{b}+C_{W}\mathbb{E}\left[\sigma\left(z_{1;\alpha}^{(L)}\right)\sigma\left(z_{1;\beta}^{(L)}\right)\right].

When L=1L=1, we therefore see that

Cov(zi;α(2),zi;β(2))=Cb+CW𝔼[σ(zi;α(1))σ(zi;β(1))],\mathrm{Cov}\left(z_{i;\alpha}^{(2)},z_{i;\beta}^{(2)}\right)=C_{b}+C_{W}\mathbb{E}\left[\sigma(z_{i;\alpha}^{(1)})\sigma(z_{i;\beta}^{(1)})\right],

where the law of (zi;α(1),zi;β(1))(z_{i;\alpha}^{(1)},z_{i;\beta}^{(1)}) is determined by the distribution μW\mu_{W} of weights in layer 11 and does not depend on n1n_{1}. This confirms the initial condition (1.8). Otherwise, if L>1L>1, the convergence of finite-dimensional distributions that we’ve already established yields

Kαβ(L+1)=\displaystyle K_{\alpha\beta}^{(L+1)}= limn1,,nLCov(zi;α(L+1),zi;β(L+1))=limn1,,nL1(Cb+CW𝔼[σ(z1;α())σ(z1;β())]).\displaystyle\lim_{n_{1},\ldots,n_{L}\rightarrow\infty}\mathrm{Cov}\left(z_{i;\alpha}^{(L+1)},\,z_{i;\beta}^{(L+1)}\right)=\lim_{n_{1},\ldots,n_{L-1}\rightarrow\infty}\left(C_{b}+C_{W}\mathbb{E}\left[\sigma\left(z_{1;\alpha}^{(\ell)}\right)\sigma\left(z_{1;\beta}^{(\ell)}\right)\right]\right).

Since σ\sigma is continuous we may invoke the continuous mapping theorem to conclude that

Kαβ(L+1)\displaystyle K_{\alpha\beta}^{(L+1)} =Cb+CW𝔼(zα,zβ)G(0,K(L))[σ(zα)σ(zβ)],\displaystyle=C_{b}+C_{W}\mathbb{E}_{(z_{\alpha},z_{\beta})\sim G(0,K^{(L)})}\left[\sigma(z_{\alpha})\sigma(z_{\beta})\right],

which confirms the recursion (1.7). This completes the proof that the finite-dimensional distributions of zα(L+1)z_{\alpha}^{(L+1)} converge to those of the desired Gaussian process, modulo two issues. First, we must prove Lemma 2.3. This is done in §2.1.2 by proving a more general result, Lemma 2.4. Second, we must remove the assumption that the weights in layers 2\ell\geq 2 are Gaussian. This is done in §2.1.3. \square

2.1.2 Collective Observables with Gaussian Weights: Generalizing Lemma 2.3

This section contains the key technical argument in our proof of Proposition 2.1. To state the main result, define a collective observable at layer \ell to be any random variable of the form

𝒪n,f;A():=1ni=1nf(zi;A()),\mathcal{O}_{n_{\ell},f;A}^{(\ell)}:=\frac{1}{n_{\ell}}\sum_{i=1}^{n_{\ell}}f(z_{i;A}^{(\ell)}),

where f:|A|f:\mathbb{R}^{\left|A\right|}\rightarrow\mathbb{R} is measurable and polynomially bounded:

C>0,k1 s.t. z|A||f(z)|C(1+z2k).\exists C>0,\,k\geq 1\text{ s.t. }\forall z\in\mathbb{R}^{\left|A\right|}\qquad\left|f(z)\right|\leq C\left(1+\left|\left|z\right|\right|_{2}^{k}\right).

We continue to assume, as §2.1.1, that the weights (and biases) in layers 2\ell\geq 2 are Gaussian and will remove this assumption in §2.1.3. Several key properties of collective observables are summarized in the following Lemma:

Lemma 2.4 (Key Properties of Collective Observables).

Let 𝒪n,f;A()\mathcal{O}_{n_{\ell},f;A}^{(\ell)} be a collective observable at layer \ell. There exists a deterministic scalar O¯f;A()\overline{O}_{f;A}^{(\ell)} such that

limn1,,n1𝔼[𝒪n,f;A()]=O¯f;A().\lim_{n_{1},\ldots,n_{\ell-1}\rightarrow\infty}\mathbb{E}\left[\mathcal{O}_{n_{\ell},f;A}^{(\ell)}\right]=\overline{O}_{f;A}^{(\ell)}. (2.7)

Moreover,

limn1,,nVar[𝒪n,f;A()]=0.\lim_{n_{1},\ldots,n_{\ell}\rightarrow\infty}\mathrm{Var}\left[\mathcal{O}_{n_{\ell},f;A}^{(\ell)}\right]=0. (2.8)

Hence, we have the following convergence in probability

limn1,,n𝒪n,f;A()pO¯f;A().\lim_{n_{1},\ldots,n_{\ell}\rightarrow\infty}\mathcal{O}_{n_{\ell},f;A}^{(\ell)}\quad\stackrel{{\scriptstyle p}}{{\longrightarrow}}\quad\overline{O}_{f;A}^{(\ell)}.
Proof.

This proof is by induction on \ell, starting with =1\ell=1. In this case, zi;A(1)z_{i;A}^{(1)} are independent for different ii. Moreover, for each i,αi,\alpha we have

zi;α(1)=bi(1)+j=1n0Wij(1)xj;α=bi(1)+(CWn0)1/2j=1n0W^ij(1)xj;α.z_{i;\alpha}^{(1)}=b_{i}^{(1)}+\sum_{j=1}^{n_{0}}W_{ij}^{(1)}x_{j;\alpha}=b_{i}^{(1)}+\left(\frac{C_{W}}{n_{0}}\right)^{1/2}\sum_{j=1}^{n_{0}}\widehat{W}_{ij}^{(1)}x_{j;\alpha}.

Hence, zi;α(1)z_{i;\alpha}^{(1)} have finite moments since bi(1)b_{i}^{(1)} are iid Gaussian and W^ij(1)\widehat{W}_{ij}^{(1)} are mean 0 with finite higher moments. In particular, since ff is polynomially bounded, we find for every n1n_{1} that

𝔼[𝒪n1,f;A(1)]=𝔼[f(z1;A(1))],\mathbb{E}\left[\mathcal{O}_{n_{1},f;A}^{(1)}\right]=\mathbb{E}\left[f(z_{1;A}^{(1)})\right],

which is finite and independent of n1n_{1}, confirming (2.7). Further, 𝒪n1,f;A(1)\mathcal{O}_{n_{1},f;A}^{(1)} is the average of n1n_{1} iid random variables with all moments finite. Hence, (2.8) follows by the weak law of large numbers, completing the proof of the base case.

Let us now assume that we have shown (2.7) and (2.8) for all =1,,L\ell=1,\ldots,L. We begin by checking that (2.7) holds at layer L+1L+1. We have

𝔼[𝒪nL+1,f;A(L+1)]=𝔼[f(z1;A(L+1))].\mathbb{E}\left[\mathcal{O}_{n_{L+1},f;A}^{(L+1)}\right]=\mathbb{E}\left[f(z_{1;A}^{(L+1)})\right]. (2.9)

Since the weights and biases in layer L+1L+1 are Gaussian and independent of L\mathcal{F}_{L}, we find

z1;A(L+1)=d(ΣA(L+1))1/2G,z_{1;A}^{(L+1)}\stackrel{{\scriptstyle d}}{{=}}\left(\Sigma_{A}^{(L+1)}\right)^{1/2}G, (2.10)

where ΣA(L+1)\Sigma_{A}^{(L+1)} is the conditional covariance defined in (2.5) and GG is an |A|\left|A\right|-dimensional standard Gaussian. The key point is that ΣA(L+1)\Sigma_{A}^{(L+1)} is a collective observable at layer LL. Hence, by the inductive hypothesis, there exists a PSD matrix Σ¯A(L+1)\overline{\Sigma}_{A}^{(L+1)} such that ΣA(L+1)\Sigma_{A}^{(L+1)} converges in probability to Σ¯A(L+1)\overline{\Sigma}_{A}^{(L+1)} as n1,,nLn_{1},\ldots,n_{L}\rightarrow\infty. To establish (2.7) it therefore suffices in view of (2.9) to check that

limn1,,nL𝔼[f((ΣA(L+1))1/2G)]=𝔼[f((Σ¯A(L+1))1/2G)],\lim_{n_{1},\ldots,n_{L}\rightarrow\infty}\mathbb{E}\left[f\left(\left(\Sigma_{A}^{(L+1)}\right)^{1/2}G\right)\right]=\mathbb{E}\left[f\left(\left(\overline{\Sigma}_{A}^{(L+1)}\right)^{1/2}G\right)\right], (2.11)

where the right hand side is finite since ff is polynomially bounded and all polynomial moments of GG are finite. To establish (2.11), let us invoke the Skorohod representation theorem to find a common probability space on which there are versions of ΣA(L+1)\Sigma_{A}^{(L+1)} – which by an abuse of notation we will still denote by ΣA(L+1)\Sigma_{A}^{(L+1)} – that converge to Σ¯A(L+1)\overline{\Sigma}_{A}^{(L+1)} almost surely. Next, note that since ff is polynomially bounded we may repeatedly apply ab12(a2+b2)ab\leq\frac{1}{2}(a^{2}+b^{2}) to find

|f((ΣA(L+1))1/2G)|p((ΣA(L+1))1/2)+q(G),\left|f((\Sigma_{A}^{(L+1)})^{1/2}G)\right|\leq p\left((\Sigma_{A}^{(L+1)})^{1/2}\right)+q(G), (2.12)

where pp is a polynomial in the entries of (ΣA(L+1))1/2(\Sigma_{A}^{(L+1)})^{1/2}, qq a polynomial in the entries of GG, and the polynomials p,qp,q don’t depend on n1,,nLn_{1},\ldots,n_{L}. The continuous mapping theorem shows that

limn1,,nL𝔼[p((ΣA(L+1))1/2)]=𝔼[p((Σ¯A(L+1))1/2)].\lim_{n_{1},\ldots,n_{L}\rightarrow\infty}\mathbb{E}\left[p\left((\Sigma_{A}^{(L+1)})^{1/2}\right)\right]=\mathbb{E}\left[p\left((\overline{\Sigma}_{A}^{(L+1)})^{1/2}\right)\right].

Thus, since all moments of Gaussian are finite, (2.11) follows from the generalized dominated convergence theorem. It remains to argue that (2.8) holds at layer L+1L+1. To do this, we write

Var[𝒪nL+1,f;A(L+1)]=1nL+1Var[f(z1;A(L+1))]+(11nL+1)Cov(f(z1;A(L+1)),f(z2;A(L+1))).\displaystyle\mathrm{Var}\left[\mathcal{O}_{n_{L+1},f;A}^{(L+1)}\right]=\frac{1}{n_{L+1}}\mathrm{Var}\left[f(z_{1;A}^{(L+1)})\right]+\left(1-\frac{1}{n_{L+1}}\right)\mathrm{Cov}\left(f(z_{1;A}^{(L+1)}),\,f(z_{2;A}^{(L+1)})\right). (2.13)

Observe that

Var[f(z1;A(L+1))]𝔼[f(z1;A(L+1))2]=𝔼[1nL+1i=1nL+1(f(zi;A(L+1)))2]<,\mathrm{Var}\left[f(z_{1;A}^{(L+1)})\right]\leq\mathbb{E}\left[f(z_{1;A}^{(L+1)})^{2}\right]=\mathbb{E}\left[\frac{1}{n_{L+1}}\sum_{i=1}^{n_{L+1}}\left(f(z_{i;A}^{(L+1)})\right)^{2}\right]<\infty,

since we already showed that (2.7) holds at layer L+1L+1. Next, recall that, conditional on L\mathcal{F}_{L}, neurons in layer L+1L+1 are independent. The law of total variance and Cauchy-Schwartz yield

|Cov(f(z1;A(L+1)),f(z2;A(L+1)))|\displaystyle\left|\mathrm{Cov}\left(f(z_{1;A}^{(L+1)}),\,f(z_{2;A}^{(L+1)})\right)\right| =|Cov(𝔼[f(z1;A(L+1))|L],𝔼[f(z2;A(L+1))|L])|\displaystyle=\left|\mathrm{Cov}\left(\mathbb{E}\left[f(z_{1;A}^{(L+1)})~{}|~{}\mathcal{F}_{L}\right],\,\mathbb{E}\left[f(z_{2;A}^{(L+1)})~{}|~{}\mathcal{F}_{L}\right]\right)\right|
Var[𝔼[f(z1;A(L+1))|L]].\displaystyle\leq\mathrm{Var}\left[\mathbb{E}\left[f(z_{1;A}^{(L+1)})~{}|~{}\mathcal{F}_{L}\right]\right]. (2.14)

Using (2.10) and the polynomial estimates (2.12) on ff, we conclude that the conditional expectation on the previous line is some polynomially bounded function of the components of (ΣA(L+1))1/2(\Sigma_{A}^{(L+1)})^{1/2}. Hence, we may apply dominated convergence as above to find

limn1,,nLVar[𝔼[f(z1;A(L+1))|L]]=Var[𝔼[f(Σ¯A1/2)G]]=0,G𝒩(0,I|A|)\lim_{n_{1},\ldots,n_{L}\rightarrow\infty}\mathrm{Var}\left[\mathbb{E}\left[f(z_{1;A}^{(L+1)})~{}|~{}\mathcal{F}_{L}\right]\right]=\mathrm{Var}\left[\mathbb{E}\left[f(\overline{\Sigma}_{A}^{1/2})G\right]\right]=0,\qquad G\sim\mathcal{N}(0,\mathrm{I}_{\left|A\right|})

since 𝔼[f(Σ¯A1/2)G]\mathbb{E}\left[f(\overline{\Sigma}_{A}^{1/2})G\right] is a constant. This proves (2.8) for observables at layer L+1L+1 and completes the proof of Lemma 2.4. ∎

2.1.3 Proof of Proposition 2.1 for General Weights

In this section, we complete the proof of Proposition 2.1 by removing the assumption from §2.1.1 that weights in layers 2\ell\geq 2 are Gaussian. To do this, let us introduce some notation. Let us write

xαzα()x_{\alpha}\mapsto z_{\alpha}^{(\ell)}

for the pre-activations at layers =0,,L+1\ell=0,\ldots,L+1 of a random fully connected network in which, as in the general case of Theorem 1.2, all weights and biases are independent, biases are Gaussian as in (1.6) and weights in all layers are drawn from a general centered distribution with the correct variance and finite higher moments as in (1.4) and (1.5). Next, let us write

xαz~α(),x_{\alpha}\mapsto\widetilde{z}_{\alpha}^{(\ell)},

for the vector of pre-activations obtained by replacing, in each layer =2,,L+1\ell=2,\ldots,L+1, all weights Wij()W_{ij}^{(\ell)} by iid centered Gaussians with variance CW/n1C_{W}/n_{\ell-1}. We already saw that the distribution of z~A(L+1)\widetilde{z}_{A}^{{(L+1)}} converges weakly to that of the desired Gaussian in the infinite width limit. Our goal is thus to show that, as n1,,nLn_{1},\ldots,n_{L} tend to infinity, zA(L+1)z_{A}^{{(L+1)}} and z~A(L+1)\widetilde{z}_{A}^{{(L+1)}} converge weakly to the same distribution. We will proceed by induction on LL. When L=0L=0 the claim is trivial since, by construction, the weight and bias distributions in layer 11 are identical in both zα(1)z_{\alpha}^{(1)} and z~α(1)\widetilde{z}_{\alpha}^{(1)} (recall that when we proved Proposition 2.1 in §2.1.1 we had Gaussian weights only starting from layer 22 and general weights in layer 11.)

Suppose therefore that we have shown the claim for =0,,L\ell=0,\ldots,L. By the Portmanteau theorem and the density of smooth compactly supported functions in the space of continuous compactly supported functions equipped with the supremum norm, it suffices to show that for any smooth function g:nL+1×|A|g:\mathbb{R}^{n_{L+1}\times\left|A\right|}\rightarrow\mathbb{R} with compact support we have

limn1,,nL(𝔼[g(zA(L+1))]𝔼[g(z~A(L+1))])=0.\lim_{n_{1},\ldots,n_{L}\rightarrow\infty}\left(\mathbb{E}\left[g(z_{A}^{(L+1)})\right]-\mathbb{E}\left[g(\widetilde{z}_{A}^{(L+1)})\right]\right)=0. (2.15)

To check (2.15), let us define an intermediate object:

zα(L+1),=b(L+1)+W(L+1),σ(zα(L)),z_{\alpha}^{(L+1),{\tiny{\bullet}}}=b^{(L+1)}+W^{(L+1),{\tiny{\bullet}}}\sigma\left(z_{\alpha}^{(L)}\right),

where the entries Wij(L+1),W_{ij}^{(L+1),{\tiny{\bullet}}} of W(L+1),W^{(L+1),{\tiny{\bullet}}} are iid Gaussian with mean 0 and variance CW/nLC_{W}/n_{L}. That is, we take the vector σ(zα(L))\sigma(z_{\alpha}^{(L)}) of post-activations from layer LL obtained by using general weights in layers 1,,L1,\ldots,L and use Gaussian weights only in layer L+1L+1. Our first step in checking (2.15) is to show that it this relation holds when zA(L+1)z_{A}^{(L+1)} is replaced by zA(L+1),z_{A}^{(L+1),{\tiny{\bullet}}}.

Lemma 2.5.

Let xA={xα,A}x_{A}=\left\{x_{\alpha},\,\in A\right\} be a finite subset of n0\mathbb{R}^{n_{0}} consists of |A|\left|A\right| distinct elements. Fix in addition nL+11n_{L+1}\geq 1 and a smooth compactly supported function g:nL+1×|A|g:\mathbb{R}^{n_{L+1}\times\left|A\right|}\rightarrow\mathbb{R}. There exists C>0C>0 and a collective observable 𝒪nL,f;A(L)\mathcal{O}_{n_{L},f;A}^{(L)} at layer LL so that

|𝔼[g(zA(L+1))]𝔼[g(zA(L+1),)]|CnL+13nL1/2𝔼[𝒪nL,f;A(L)].\left|\mathbb{E}\left[g\left(z_{A}^{(L+1)}\right)\right]-\mathbb{E}\left[g\left(z_{A}^{(L+1),{\tiny{\bullet}}}\right)\right]\right|\leq Cn_{L+1}^{3}n_{L}^{-1/2}\mathbb{E}\left[\mathcal{O}_{n_{L},f;A}^{(L)}\right].
Proof.

This is a standard Lindeberg swapping argument. Namely, for each αA\alpha\in A and k=0,,nLk=0,\ldots,n_{L} define

zα(L+1),k:=b(L+1)+W(L+1),kσ(zα(L)),z_{\alpha}^{(L+1),k}:=b^{(L+1)}+W^{(L+1),k}\sigma\left(z_{\alpha}^{(L)}\right),

where the first kk entries of each row of W(L+1),kW^{(L+1),k} are iid Gaussian with mean 0 and variance CW/nLC_{W}/n_{L}, while the remaining entries are (CW/nL)1/2(C_{W}/n_{L})^{-1/2} times iid draws W^ij(L+1)\widehat{W}_{ij}^{(L+1)} from the general distribution μ\mu of network weights, as in (1.4) and (1.5). With this notation, we have

zα(L+1)=zα(L),0,z~α(L+1),=zα(L),nL.z_{\alpha}^{(L+1)}=z_{\alpha}^{(L),0},\qquad\widetilde{z}_{\alpha}^{(L+1),{\tiny{\bullet}}}=z_{\alpha}^{(L),n_{L}}.

Thus,

𝔼[g(zA(L+1))]𝔼[g(z~A(L+1))]=k=1nL𝔼[g(zA(L+1),k1)]𝔼[g(zA(L+1),k)].\mathbb{E}\left[g\left(z_{A}^{(L+1)}\right)\right]-\mathbb{E}\left[g\left(\widetilde{z}_{A}^{(L+1)}\right)\right]=\sum_{k=1}^{n_{L}}\mathbb{E}\left[g\left(z_{A}^{(L+1),k-1}\right)\right]-\mathbb{E}\left[g\left(z_{A}^{(L+1),k}\right)\right].

For any Zn+1×|A|Z\in\mathbb{R}^{n_{\ell+1}\times\left|A\right|} and

δZ=(δzi;αi=1,,n+1,αA)n+1×|A|\delta Z=\left(\delta z_{i;\alpha}\,i=1,\ldots,n_{\ell+1},\,\alpha\in A\right)\in\mathbb{R}^{n_{\ell+1}\times\left|A\right|}

consider the third order Taylor expansion of gg around ZZ.

g(Z+δZ)\displaystyle g(Z+\delta Z) =g(Z)+αAi=1,,nL+1gi;αδzi;α+α1,α2Ai1,i2=1,,nL+1gi1,i2;α1,α2δzi1;α1δzi2;α2\displaystyle=g(Z)+\sum_{\begin{subarray}{c}\alpha\in A\\ i=1,\ldots,n_{L+1}\end{subarray}}g_{i;\alpha}\delta z_{i;\alpha}+\sum_{\begin{subarray}{c}\alpha_{1},\alpha_{2}\in A\\ i_{1},i_{2}=1,\ldots,n_{L+1}\end{subarray}}g_{i_{1},i_{2};\alpha_{1},\alpha_{2}}\delta z_{i_{1};\alpha_{1}}\delta z_{i_{2};\alpha_{2}}
+O(α1,α2,α3Ai1,i2,i3=1,,nL+1|δzi1;α1δzi2;α2δzi3;α3|).\displaystyle+O\left(\sum_{\begin{subarray}{c}\alpha_{1},\alpha_{2},\alpha_{3}\in A\\ i_{1},i_{2},i_{3}=1,\ldots,n_{L+1}\end{subarray}}\left|\delta z_{i_{1};\alpha_{1}}\delta z_{i_{2};\alpha_{2}}\delta z_{i_{3};\alpha_{3}}\right|\right).

Let us write

zi;α(L+1),k1=zi;α(L+1),k+nL1/2Yi,k;α,δZi,k;α=CW1/2(W~ik(L+1)W^ik(L+1))σ(zk;α(L)),z_{i;\alpha}^{(L+1),k-1}=z_{i;\alpha}^{(L+1),k}+n_{L}^{-1/2}Y_{i,k;\alpha},\qquad\delta Z_{i,k;\alpha}=C_{W}^{1/2}\left(\widetilde{W}_{ik}^{(L+1)}-\widehat{W}_{ik}^{(L+1)}\right)\sigma(z_{k;\alpha}^{(L)}),

where W~ik(L+1)𝒩(0,1)\widetilde{W}_{ik}^{(L+1)}\sim\mathcal{N}(0,1). Then, Taylor expanding gg to third order around Zk=zi;α(L+1),kZ_{k}=z_{i;\alpha}^{(L+1),k} and, using that the first two moments of (CWnL1)1/2W^ij(L)(C_{W}n_{L}^{-1})^{1/2}\widehat{W}_{ij}^{(L)} match those of 𝒩(0,CWnL1)\mathcal{N}(0,C_{W}n_{L}^{-1}), we find that

𝔼[g(zA(L+1),k1)]𝔼[g(z~A(L+1),k)]=O(nL3/2nL+13𝔼[p(|σ(zk;α(L))|,αA))]),\mathbb{E}\left[g\left(z_{A}^{(L+1),k-1}\right)\right]-\mathbb{E}\left[g\left(\widetilde{z}_{A}^{(L+1),k}\right)\right]=O\left(n_{L}^{-3/2}n_{L+1}^{3}\mathbb{E}\left[p\left(\left|\sigma(z_{k;\alpha}^{(L)})\right|,\,\alpha\in A)\right)\right]\right),

where pp is a degree 33 polynomial in the |A|\left|A\right|-dimensional vector of absolute values |σ(zk;α())|,αA.|\sigma(z_{k;\alpha}^{(\ell)})|,\,\alpha\in A. Summing this over kk completes the proof. ∎

To make use of Lemma 2.5 let us consider any collective observable 𝒪nL,f;A(L)\mathcal{O}_{n_{L},f;A}^{{}_{(L)}} at layer LL. Recall that by (2.9) and (2.13) both the mean and variance of 𝒪nL,f;A(L)\mathcal{O}_{n_{L},f;A}^{{}_{(L)}} depend only on the distributions of finitely many components of the vector zA(L)z_{A}^{(L)}. By the inductive hypothesis we therefore find

limn1,,nL𝔼[𝒪nL,f;A(L)]=limn1,,nL𝔼[𝒪~nL,f;A(L)],\lim_{n_{1},\ldots,n_{L}\rightarrow\infty}\mathbb{E}\left[\mathcal{O}_{n_{L},f;A}^{(L)}\right]=\lim_{n_{1},\ldots,n_{L}\rightarrow\infty}\mathbb{E}\left[\widetilde{\mathcal{O}}_{n_{L},f;A}^{(L)}\right], (2.16)

where the right hand side means that we consider the same collective observable but for z~A(L)\widetilde{z}_{A}^{(L)} instead of zA(L)z_{A}^{(L)}, which exists by Lemma 2.4. Similarly, again using Lemma 2.4, we have

limn1,,nLVar[𝒪nL,f;A(L)]=0.\lim_{n_{1},\ldots,n_{L}\rightarrow\infty}\mathrm{Var}\left[\mathcal{O}_{n_{L},f;A}^{(L)}\right]=0. (2.17)

Therefore, we conclude that

𝒪nL,f;A(L)𝒪~nL,f;A(L)d0,as n1,,nL.\mathcal{O}_{n_{L},f;A}^{(L)}-\widetilde{\mathcal{O}}_{n_{L},f;A}^{(L)}\quad\stackrel{{\scriptstyle d}}{{\longrightarrow}}\quad 0,\qquad\text{as }n_{1},\ldots,n_{L}\rightarrow\infty. (2.18)

Note that by (2.16) the mean of any collective observable 𝔼[𝒪nL,f;A(L)]\mathbb{E}\left[\mathcal{O}_{n_{L},f;A}^{(L)}\right] is bounded independent of n1,,nLn_{1},\ldots,n_{L} since we saw in Lemma 2.4 that the limit exists and is bounded when using Gaussian weights. Since nL+1n_{L+1} is fixed and finite, the error term nL1/2nL+13𝔼[𝒪nL,f;A(L)]n_{L}^{-1/2}n_{L+1}^{3}\mathbb{E}\left[\mathcal{O}_{n_{L},f;A}^{{}_{(L)}}\right] in Lemma 2.5 is therefore tends to zero as n1,,nLn_{1},\ldots,n_{L}\rightarrow\infty, and (2.15) is reduced to showing that

limn1,,nL(𝔼[g(zA(L+1),)]𝔼[g(z~A(L+1))])=0.\lim_{n_{1},\ldots,n_{L}\rightarrow\infty}\left(\mathbb{E}\left[g(z_{A}^{(L+1),{\tiny{\bullet}}})\right]-\mathbb{E}\left[g(\widetilde{z}_{A}^{(L+1)})\right]\right)=0. (2.19)

This follows from (2.17) and the inductive hypothesis. Indeed, by construction, conditional on the filtration L\mathcal{F}_{L} defined by weights and biases in layers up to LL (see (2.3)), the |A|\left|A\right|-dimensional vectors zi;A(L+1),z_{i;A}^{(L+1),{\tiny{\bullet}}} are iid Gaussians:

zi;A(L+1),=d(ΣA(L+1))1/2Gi,Gi𝒩(0,I|A|)iid,z_{i;A}^{(L+1),{\tiny{\bullet}}}\stackrel{{\scriptstyle d}}{{=}}\left(\Sigma_{A}^{(L+1)}\right)^{1/2}G_{i},\qquad G_{i}\sim\mathcal{N}(0,\mathrm{I}_{\left|A\right|)}\quad iid,

where ΣA(L+1)\Sigma_{A}^{(L+1)} is the conditional covariance matrix from (2.5). The key point, as in the proof with all Gaussian weights, is that each entry of the matrix ΣA(L+1)\Sigma_{A}^{(L+1)} is a collective observable at layer LL. Moreover, since the weights and biases in the final layer are Gaussian for zA(L+1),z_{A}^{(L+1),{\tiny{\bullet}}} the conditional distribution of g(zA(L+1),)g(z_{A}^{(L+1),{\tiny{\bullet}}}) given L\mathcal{F}_{L} is completely determined by ΣA(L+1)\Sigma_{A}^{(L+1)}. In particular, since gg is bounded and continuous, we find that

𝔼[g(zA(L+1),)]𝔼[g(z~A(L+1))]=𝔼[h(ΣA(L+1))]𝔼[h(Σ~A(L+1))],\mathbb{E}\left[g(z_{A}^{(L+1),{\tiny{\bullet}}})\right]-\mathbb{E}\left[g(\widetilde{z}_{A}^{(L+1)})\right]=\mathbb{E}\left[h\left(\Sigma_{A}^{(L+1)}\right)\right]-\mathbb{E}\left[h\left(\widetilde{\Sigma}_{A}^{(L+1)}\right)\right],

where h:nL+1×|A|h:\mathbb{R}^{n_{L+1}\times\left|A\right|}\rightarrow\mathbb{R} is a bounded continuous function and Σ~A(L+1)\widetilde{\Sigma}_{A}^{(L+1)} is the conditional covariance matrix at layer L+1L+1 for z~A(L+1)\widetilde{z}_{A}^{(L+1)}. Combining this with the convergence in distribution from (2.18) shows that (2.19) holds and completes the proof of Proposition 2.1 for general weight distributions. \square

2.2 Tightness: Proof of Proposition 2.2

In this section, we provide a proof of Proposition 2.2. In the course of showing tightness, we will need several elementary Lemmas, which we record in the §2.2.1. We then use them in §2.2.2 to complete the proof of Proposition 2.2.

2.2.1 Preparatory Lemmas

For the first Lemma, let us agree to write 𝒞(A)\mathcal{C}(A) for the cone over a subset AA in a euclidean space and B1(n)B_{1}(\mathbb{R}^{n}) for the unit ball in n\mathbb{R}^{n}.

Lemma 2.6.

Fix integers n0,n11n_{0},n_{1}\geq 1 and a real number λ1\lambda\geq 1. Suppose that TT is a compact of n0\mathbb{R}^{n_{0}} and f:n0n1f:\mathbb{R}^{n_{0}}\rightarrow\mathbb{R}^{n_{1}} is λ\lambda-Lipschitz with respect to the 2\ell_{2}-norm on both n0\mathbb{R}^{n_{0}} and n1\mathbb{R}^{n_{1}}. Define the Minkowski sums

T^=f(T)+𝒞(f(T)f(T))B1(n1).\widehat{T}=f(T)+\mathcal{C}\left(f(T)-f(T)\right)\cap B_{1}(\mathbb{R}^{n_{1}}).

There exists a constant C>0C>0 a compact subset TT^{\prime} of 3n0+1\mathbb{R}^{3n_{0}+1}, and a CλC\lambda-Lipschitz map g:3n0+1n1g:\mathbb{R}^{3n_{0}+1}\rightarrow\mathbb{R}^{n_{1}} (all depending only T,λT,\lambda,), so that T^=g(T).\widehat{T}=g(T^{\prime}).

Proof.

By definition,

T^=g(T×T×T×[0,1]),g(x,y,z,r)=f(x)+r(f(y)f(z)).\widehat{T}=g(T\times T\times T\times[0,1]),\qquad g(x,y,z,r)=f(x)+r(f(y)-f(z)). (2.20)

In particular, for some constant CC depending on T0,λT_{0},\lambda, we have

g(x,y,z,r)g(x,y,z,r)2\displaystyle\left|\left|g(x,y,z,r)-g(x^{\prime},y^{\prime},z^{\prime},r^{\prime})\right|\right|_{2} f(x)f(x)2+f(y)f(y)2+f(z)f(z)2\displaystyle\leq\left|\left|f(x)-f(x^{\prime})\right|\right|_{2}+\left|\left|f(y)-f(y^{\prime})\right|\right|_{2}+\left|\left|f(z)-f(z^{\prime})\right|\right|_{2}
+|rr|f(y)f(z)2\displaystyle+\left|r-r^{\prime}\right|\left|\left|f(y)-f(z)\right|\right|_{2}
λ(xx2+yy2+zz2+Diam(T0)|rr0|)\displaystyle\leq\lambda\left(\left|\left|x-x^{\prime}\right|\right|_{2}+\left|\left|y-y^{\prime}\right|\right|_{2}+\left|\left|z-z^{\prime}\right|\right|_{2}+\mathrm{Diam}(T_{0})\left|r-r_{0}\right|\right)
Cλ(xx,yy,zz,rr)2.\displaystyle\leq C\lambda\left|\left|(x-x^{\prime},y-y^{\prime},z-z^{\prime},r-r^{\prime})\right|\right|_{2}.

Hence, T^\widehat{T} is the image under a Lipschitz map with a Lipschitz constant depending only on T0,λT_{0},\lambda of a compact set in 3n0+1\mathbb{R}^{3n_{0}+1}. ∎

The second Lemma we need is an elementary inequality.

Lemma 2.7.

Let a,b,c0a,b,c\geq 0 be real numbers and k1k\geq 1 be an integer. We have

(a+b+c)k22k1(1+a2k)+14[(2+b)4k+(1+c)4k].(a+b+c)^{k}\leq 2^{2k-1}\left(1+a^{2k}\right)+\frac{1}{4}\left[(2+b)^{4k}+\left(1+c\right)^{4k}\right].
Proof.

For any a,b0a,b\geq 0 we have

(a+b)k\displaystyle(a+b)^{k} =j=0k(kj)ajbkjj=0k(kj)(1+a)kbkj=(1+a)k(1+b)k\displaystyle=\sum_{j=0}^{k}\binom{k}{j}a^{j}b^{k-j}\leq\sum_{j=0}^{k}\binom{k}{j}\left(1+a\right)^{k}b^{k-j}=(1+a)^{k}(1+b)^{k}
12((1+a)2k+(1+b)2k)\displaystyle\leq\frac{1}{2}\left((1+a)^{2k}+(1+b)^{2k}\right) (2.21)

Further, breaking into cases depending on whether 0a10\leq a\leq 1 or 1a1\leq a we find that

(a+b)k22k1(1+a2k)+12(1+b)2k.\displaystyle(a+b)^{k}\leq 2^{2k-1}\left(1+a^{2k}\right)+\frac{1}{2}(1+b)^{2k}. (2.22)

Combining (2.21) with (2.22) we see as desired that any a,b,c0a,b,c\geq 0

(a+b+c)k22k1(1+a2k)+14[(2+b)4k+(1+c)4k].(a+b+c)^{k}\leq 2^{2k-1}\left(1+a^{2k}\right)+\frac{1}{4}\left[(2+b)^{4k}+\left(1+c\right)^{4k}\right].

The next Lemma is also an elementary estimate.

Lemma 2.8.

Fix an integer k1k\geq 1, and suppose X1,,XkX_{1},\ldots,X_{k} are non-negative random variables. There exists a positive integer q=q(k)q=q(k) such that

𝔼[i=1kXi](i=1k𝔼[Xiq])1/q.\mathbb{E}\left[\prod_{i=1}^{k}X_{i}\right]\leq\left(\prod_{i=1}^{k}\mathbb{E}\left[X_{i}^{q}\right]\right)^{1/q}.
Proof.

The proof is by induction on kk. For the base cases when k=1k=1, we may take q=1q=1 and when k=2k=2 we may take q=2q=2 by Cauchy-Schwartz. Now suppose we have proved the claim for all k=1,2,,Kk=1,2,\ldots,K for some K3K\geq 3. Note that 1v(K+1)/2K1\leq v\lceil{(K+1)/2}\rceil\leq K. So we may use Cauchy-Schwartz and the inductive hypothesis to obtain

𝔼[i=1K+1Xi]\displaystyle\mathbb{E}\left[\prod_{i=1}^{K+1}X_{i}\right] =𝔼[i=1K+12Xii=K+12+1KXi]\displaystyle=\mathbb{E}\left[\prod_{i=1}^{\lceil\frac{K+1}{2}\rceil}X_{i}\prod_{i=\lceil\frac{K+1}{2}\rceil+1}^{K}X_{i}\right]
𝔼[i=1K+12Xi2]1/2𝔼[i=K+12+1KXi2]1/2\displaystyle\leq\mathbb{E}\left[\prod_{i=1}^{\lceil\frac{K+1}{2}\rceil}X_{i}^{2}\right]^{1/2}\mathbb{E}\left[\prod_{i=\lceil\frac{K+1}{2}\rceil+1}^{K}X_{i}^{2}\right]^{1/2}
(i=1K+1𝔼[Xi2q])1/2q,\displaystyle\leq\left(\prod_{i=1}^{K+1}\mathbb{E}\left[X_{i}^{2q}\right]\right)^{1/2q},

where q=max{q(12(K+1)),q(K12(K+1)1)}q=\max\left\{q\left(\lceil\frac{1}{2}(K+1)\rceil\right),q\left(K-\lceil\frac{1}{2}(K+1)\rceil-1\right)\right\}. ∎

The next Lemma is an elementary result about the moments of marginals of iid random vectors.

Lemma 2.9.

Fix an even integer p2p\geq 2 a positive integer, and suppose μ\mu is a probability measure on \mathbb{R} with mean 0 and finite higher moments. Assume also that w=(w1,,wn)w=\left(w_{1},\ldots,w_{n}\right) is a vector with iid components, each with distribution μ\mu. Then, there exists a constant CC depending only on pp and first pp moments of μ\mu such that for all n1n\geq 1

supu=1𝔼[|wu|p]C.\sup_{\left|\left|u\right|\right|=1}\mathbb{E}\left[\left|w\cdot u\right|^{p}\right]\leq C.
Proof.

We will use the following result of Łatała [33, Thm. 2, Cor. 2, Rmk. 2]. Suppose XiX_{i} are independent random variables and pp is a positive even integer. Then

𝔼[|i=1nXi|p]inf{t>0|i=1nlog𝔼[|1+Xit|p]p},\mathbb{E}\left[\left|\sum_{i=1}^{n}X_{i}\right|^{p}\right]\simeq\inf\left\{t>0~{}\bigg{|}~{}\sum_{i=1}^{n}\log\mathbb{E}\left[\left|1+\frac{X_{i}}{t}\right|^{p}\right]\leq p\right\}, (2.23)

where \simeq means bounded above and below up to universal multiplicative constants. Let us fix a unit vector u=(u1,,un)Sn1u=\left(u_{1},\ldots,u_{n}\right)\in S^{n-1} and apply this to Xi=uiwiX_{i}=u_{i}w_{i}. Since wiw_{i} have mean 0 and pp is even we find

i=1nlog𝔼[|1+Xit|p]\displaystyle\sum_{i=1}^{n}\log\mathbb{E}\left[\left|1+\frac{X_{i}}{t}\right|^{p}\right] i=1nlog(1+k=2p(pk)|ui|k𝔼[|w1|k]tk).\displaystyle\leq\sum_{i=1}^{n}\log\left(1+\sum_{k=2}^{p}\binom{p}{k}\frac{\left|u_{i}\right|^{k}\mathbb{E}\left[\left|w_{1}\right|^{k}\right]}{t^{k}}\right).

Note that for each k=2,,pk=2,\ldots,p we have

𝔼[|w1|k]𝔼[(1+|w1|)p]and|ui|kui2.\mathbb{E}\left[\left|w_{1}\right|^{k}\right]\leq\mathbb{E}\left[(1+\left|w_{1}\right|)^{p}\right]\qquad\text{and}\qquad\left|u_{i}\right|^{k}\leq u_{i}^{2}.

Hence, using that log(1+x)x\log(1+x)\leq x we find

i=1nlog𝔼[|1+Xit|p]\displaystyle\sum_{i=1}^{n}\log\mathbb{E}\left[\left|1+\frac{X_{i}}{t}\right|^{p}\right] i=1nlog(1+ui2𝔼[(1+|w1|)p]k=2p(pk)1tk)\displaystyle\leq\sum_{i=1}^{n}\log\left(1+u_{i}^{2}\mathbb{E}\left[(1+\left|w_{1}\right|)^{p}\right]\sum_{k=2}^{p}\binom{p}{k}\frac{1}{t^{k}}\right)
=i=1nlog(1+ui2𝔼[(1+|w1|)p][(1+1t)p1pt])\displaystyle=\sum_{i=1}^{n}\log\left(1+u_{i}^{2}\mathbb{E}\left[(1+\left|w_{1}\right|)^{p}\right]\left[\left(1+\frac{1}{t}\right)^{p}-1-\frac{p}{t}\right]\right)
𝔼[(1+|w1|)p][(1+1t)p1pt].\displaystyle\leq\mathbb{E}\left[(1+\left|w_{1}\right|)^{p}\right]\left[\left(1+\frac{1}{t}\right)^{p}-1-\frac{p}{t}\right].

Note that for 2<t2<t, there is a universal constant C>0C>0 so that

|(1+1t)p1pt|Cp2t2.\left|\left(1+\frac{1}{t}\right)^{p}-1-\frac{p}{t}\right|\leq\frac{Cp^{2}}{t^{2}}.

Thus, there exists a constant C>0C^{\prime}>0 so that

t>Cp𝔼[(1+|w1|)p])i=1nlog𝔼[|1+Xit|p]p.t>C^{\prime}\sqrt{p\mathbb{E}\left[(1+\left|w_{1}\right|)^{p}\right])}\quad\Rightarrow\quad\sum_{i=1}^{n}\log\mathbb{E}\left[\left|1+\frac{X_{i}}{t}\right|^{p}\right]\leq p.

Combining this with (2.23) completes the proof. ∎

The final Lemma we need is an integrability statement for the supremum of certain non-Gaussian fields over low-dimensional sets.

Lemma 2.10.

Fix a positive integer n0n_{0}, an even integer k1k\geq 1, a compact set T0n0T_{0}\subseteq\mathbb{R}^{n_{0}}, a constant λ>0\lambda>0, and a probability measure μ\mu on \mathbb{R} with mean 0, variance 1,1, and finite higher moments. For every ϵ(0,1)\epsilon\in(0,1) there exists a constant C=C(T0,ϵ,λ,n0,k,μ)C=C(T_{0},\epsilon,\lambda,n_{0},k,\mu) with the following property. Fix any integer n11n_{1}\geq 1 and a λ\lambda-Lipschitz map f:n0n1f:\mathbb{R}^{n_{0}}\rightarrow\mathbb{R}^{n_{1}}. Define T1:=f(T0)T_{1}:=f(T_{0}), and let w=(w1,,wn1)w=\left(w_{1},\ldots,w_{n_{1}}\right) be a vector with iid components wiw_{i}, each drawn from μ\mu. Then, for any fixed y0T1y_{0}\in T_{1}

𝔼[supyT1(w(yy0))k]C.\mathbb{E}\left[\sup_{y\in T_{1}}(w\cdot(y-y_{0}))^{k}\right]\leq C. (2.24)
Proof.

The proof is a standard chaining argument. For each yT1y\in T_{1} write Πk(y)\Pi_{k}(y) for the closest point to yy in a 2k2^{-k} net in T1T_{1} and assume without loss of generality that the diameter of T1T_{1} is bounded above by 11 and that Π0(y)=y0\Pi_{0}(y)=y_{0} for all yT1y\in T_{1}. We have using the usual chaining trick that

𝔼[(supyT1w(yy0))k]\displaystyle\mathbb{E}\left[\left(\sup_{y\in T_{1}}w\cdot(y-y_{0})\right)^{k}\right] q1,,qk=0𝔼[i=1ksupyiT1|w(Πqi(yi)Πqi1(yi))|].\displaystyle\leq\sum_{q_{1},\ldots,q_{k}=0}^{\infty}\mathbb{E}\left[\prod_{i=1}^{k}\sup_{y_{i}\in T_{1}}\left|w\cdot(\Pi_{q_{i}}(y_{i})-\Pi_{q_{i}-1}(y_{i}))\right|\right]. (2.25)

By Lemma 2.8, there exists qq depending only on kk so that for any q1,,qkq_{1},\ldots,q_{k} we have

𝔼[i=1ksupyiT1|w(Πqi(yi)Πqi1(yi))|]i=1k(𝔼[supyT1|w(Πqi(y)Πqi1(y))|q])1/q.\mathbb{E}\left[\prod_{i=1}^{k}\sup_{y_{i}\in T_{1}}\left|w\cdot(\Pi_{q_{i}}(y_{i})-\Pi_{q_{i}-1}(y_{i}))\right|\right]\leq\prod_{i=1}^{k}\left(\mathbb{E}\left[\sup_{y\in T_{1}}\left|w\cdot(\Pi_{q_{i}}(y)-\Pi_{q_{i}-1}(y))\right|^{q}\right]\right)^{1/q}. (2.26)

We seek to bound each expectation on the right hand side in (2.26). To do this, write

𝔼[supyT1|w(Πqi(y)Πqi1(y))|q]=0(supyT1|w(Πqi(y)Πqi1(y))|q>t)𝑑t.\mathbb{E}\left[\sup_{y\in T_{1}}\left|w\cdot(\Pi_{q_{i}}(y)-\Pi_{q_{i}-1}(y))\right|^{q}\right]=\int_{0}^{\infty}\mathbb{P}\left(\sup_{y\in T_{1}}\left|w\cdot(\Pi_{q_{i}}(y)-\Pi_{q_{i}-1}(y))\right|^{q}>t\right)dt.

Note that the supremum is only over a finite set of cardinality at most

|Πqi||Πqi1|2cn0qi\left|\Pi_{q_{i}}\right|\left|\Pi_{q_{i-1}}\right|\leq 2^{cn_{0}q_{i}}

for some c>0c>0 depending only T0,λT_{0},\lambda. This is because, by assumption T1T_{1} is the image of T0T_{0} under a λ\lambda-Lipschtiz map and Lipschitz maps preserve covering numbers. Thus, by a union bound,

(supyT1|w(Πqi(y)Πqi1(y))|q>t)2cn0qisupyT1(|w(Πqi(y)Πqi1(y))|q>t)\mathbb{P}\left(\sup_{y\in T_{1}}\left|w\cdot(\Pi_{q_{i}}(y)-\Pi_{q_{i}-1}(y))\right|^{q}>t\right)\leq 2^{cn_{0}q_{i}}\sup_{y\in T_{1}}\mathbb{P}\left(\left|w\cdot(\Pi_{q_{i}}(y)-\Pi_{q_{i}-1}(y))\right|^{q}>t\right)

But for any yT1y\in T_{1} and any s>0,p1s>0,p\geq 1 we have

(|w(Πq(y)Πq1(y))|q>s)\displaystyle\mathbb{P}\left(\left|w\cdot(\Pi_{q}(y)-\Pi_{q-1}(y))\right|^{q}>s\right) supu=1𝔼[|wu|p](Πk(y)Πk1(y)psp/q)\displaystyle\leq\sup_{\left|\left|u\right|\right|=1}\mathbb{E}\left[\left|w\cdot u\right|^{p}\right]\left(\frac{\left|\left|\Pi_{k}(y)-\Pi_{k-1}(y)\right|\right|^{p}}{s^{p/q}}\right)
=2pqisp/qsupu=1𝔼[|wu|p].\displaystyle=2^{-pq_{i}}s^{-p/q}\sup_{\left|\left|u\right|\right|=1}\mathbb{E}\left[\left|w\cdot u\right|^{p}\right].

Putting this all together we find for any p2max{q+2,cn0}p\geq 2\max\left\{q+2,cn_{0}\right\} that

𝔼[supyT1|w(Πqi(y)Πqi1(y))|q]2cn0qisupu=1𝔼[|wu|p].\mathbb{E}\left[\sup_{y\in T_{1}}\left|w\cdot(\Pi_{q_{i}}(y)-\Pi_{q_{i}-1}(y))\right|^{q}\right]\leq 2^{-cn_{0}q_{i}}\sup_{\left|\left|u\right|\right|=1}\mathbb{E}\left[\left|w\cdot u\right|^{p}\right].

Thus, substituting this into (2.25) yields

𝔼[(supyT1w(yy0))k]\displaystyle\mathbb{E}\left[\left(\sup_{y\in T_{1}}w\cdot(y-y_{0})\right)^{k}\right] supu=1𝔼[|wu|p]q1,,qk=02cn0i=1kqi2ksupu=1𝔼[|wu|p].\displaystyle\leq\sup_{\left|\left|u\right|\right|=1}\mathbb{E}\left[\left|w\cdot u\right|^{p}\right]\sum_{q_{1},\ldots,q_{k}=0}^{\infty}2^{-cn_{0}\sum_{i=1}^{k}q_{i}}\leq 2^{k}\sup_{\left|\left|u\right|\right|=1}\mathbb{E}\left[\left|w\cdot u\right|^{p}\right].

Appealing to Lemma 2.9 completes the proof of Lemma 2.10. ∎

2.2.2 Proof of Proposition 2.2 Using Lemmas from §2.2.1

Let us first establish the equi-Lipschitz condition, which we recall states that for each ϵ(0,1)\epsilon\in(0,1) and each compact set Tn0T\subseteq\mathbb{R}^{n_{0}} there exist C>0C>0 so that with probability at least 1ϵ1-\epsilon we have

supxα,xβTzα(L+1)zβ(L+1)2xαxβ2C.\sup_{x_{\alpha},x_{\beta}\in T}\frac{\left|\left|z_{\alpha}^{(L+1)}-z_{\beta}^{(L+1)}\right|\right|_{2}}{\left|\left|x_{\alpha}-x_{\beta}\right|\right|_{2}}\leq C. (2.27)

For (2.27) to hold, we need a result about the Lipschitz constant of each layer. To ease the notation define a normalized single layer with random weights WW and random biases bb via the map ψ:n1n2\psi:\mathbb{R}^{n_{1}}\rightarrow\mathbb{R}^{n_{2}}:

ψ(x;W,b)=1n2σ(Wx+b),\psi(x;W,b)=\frac{1}{\sqrt{n_{2}}}\sigma\left(Wx+b\right), (2.28)

where b𝒩(0,CbIn2)b\sim\mathcal{N}(0,C_{b}\mathrm{I}_{n_{2}}) and W=(wij)n2×n1W=\left(w_{ij}\right)\in\mathbb{R}^{n_{2}\times n_{1}} with wijw_{ij} drawn iid from a distribution with mean 0, variance 11, and finite higher moments. We choose the variance of wijw_{ij} to be 11 instead of CW/n1C_{W}/n_{1} since we will later think of xx as the normalized vector (CW/n)1/2σ(zα())(C_{W}/n_{\ell})^{1/2}\sigma(z_{\alpha}^{(\ell)}) of post-activations in a given layer.

Lemma 2.11.

Fix an integer n01n_{0}\geq 1, a compact set T0n0T_{0}\subseteq\mathbb{R}^{n_{0}}, and a constant λ>0\lambda>0. For every ϵ(0,1)\epsilon\in(0,1) there exists a constant C=C(ϵ,n0,T0,σ,λ)C=C(\epsilon,n_{0},T_{0},\sigma,\lambda) with the following property. Fix any integers n1,n21n_{1},n_{2}\geq 1, and define ψ:n1n2\psi:\mathbb{R}^{n_{1}}\rightarrow\mathbb{R}^{n_{2}} as in (2.28). Suppose that T1n1T_{1}\subseteq\mathbb{R}^{n_{1}} is the image of T0T_{0} under a λ\lambda-Lipschitz map from n0\mathbb{R}^{n_{0}} to n1\mathbb{R}^{n_{1}}. Then,

(supxα,xβT1ψ(xα)ψ(xβ)2xαxβ2C)1ϵ.\mathbb{P}\left(\sup_{x_{\alpha},x_{\beta}\in T_{1}}\frac{\left|\left|\psi(x_{\alpha})-\psi(x_{\beta})\right|\right|_{2}}{\left|\left|x_{\alpha}-x_{\beta}\right|\right|_{2}}\leq C\right)\geq 1-\epsilon.
Proof.

Fix xαxβT1x_{\alpha}\neq x_{\beta}\in T_{1} and define

ξαβ=xαxβxαxβ2.\xi_{\alpha\beta}=\frac{x_{\alpha}-x_{\beta}}{\left|\left|x_{\alpha}-x_{\beta}\right|\right|_{2}}.

Write WiW_{i} for the ii-th row of WW and bib_{i} for the ii-th component of bb. Since ab12(a2+b2)ab\leq\frac{1}{2}(a^{2}+b^{2}) and σ\sigma is absolutely continuous, we have

ψ(xα)ψ(xβ)22\displaystyle\left|\left|\psi(x_{\alpha})-\psi(x_{\beta})\right|\right|_{2}^{2} =1n2i=1n2(σ(Wixα+bi)σ(Wixα+bi))2\displaystyle\quad=\frac{1}{n_{2}}\sum_{i=1}^{n_{2}}\left(\sigma\left(W_{i}\cdot x_{\alpha}+b_{i}\right)-\sigma\left(W_{i}\cdot x_{\alpha}+b_{i}\right)\right)^{2}
=1n2i=1n2(Wiξαβ0xαxβ2σ(Wi(xβ+tξαβ)+bi)𝑑t)2\displaystyle\quad=\frac{1}{n_{2}}\sum_{i=1}^{n_{2}}\left(W_{i}\cdot\xi_{\alpha\beta}\int_{0}^{\left|\left|x_{\alpha}-x_{\beta}\right|\right|_{2}}\sigma^{\prime}\left(W_{i}\cdot\left(x_{\beta}+t\xi_{\alpha\beta}\right)+b_{i}\right)dt\right)^{2}
xαxβ221n2i=1n2supyT^(σ(Wiy+bi))2supξT~(Wiξ)2\displaystyle\quad\leq\left|\left|x_{\alpha}-x_{\beta}\right|\right|_{2}^{2}\frac{1}{n_{2}}\sum_{i=1}^{n_{2}}\sup_{y\in\widehat{T}}\left(\sigma^{\prime}\left(W_{i}\cdot y+b_{i}\right)\right)^{2}\sup_{\xi\in\widetilde{T}}\left(W_{i}\cdot\xi\right)^{2}
xαxβ221n2i=1n2[supyT^(σ(Wiy+bi))4+supξT~(Wiξ)4],\displaystyle\quad\leq\left|\left|x_{\alpha}-x_{\beta}\right|\right|_{2}^{2}\frac{1}{n_{2}}\sum_{i=1}^{n_{2}}\left[\sup_{y\in\widehat{T}}\left(\sigma^{\prime}\left(W_{i}\cdot y+b_{i}\right)\right)^{4}+\sup_{\xi\in\widetilde{T}}\left(W_{i}\cdot\xi\right)^{4}\right], (2.29)

where we’ve set

T~=𝒞(T1T1)Sn11andT^:=T1+𝒞(T1T1)B1(n1),\widetilde{T}=\mathcal{C}(T_{1}-T_{1})\cap S^{n_{1}-1}\qquad\text{and}\qquad\widehat{T}:=T_{1}+\mathcal{C}(T_{1}-T_{1})\cap B_{1}(\mathbb{R}^{n_{1}}),

and have denoted by 𝒞(A)\mathcal{C}(A) the cone over a set AA and by B1(n1)B_{1}(\mathbb{R}^{n_{1}}) the unit ball in n1\mathbb{R}^{n_{1}}. The estimate (2.29) yields

(supxα,xβTψ(xα)ψ(xβ)22xαxβ22>C)\displaystyle\mathbb{P}\left(\sup_{x_{\alpha},x_{\beta}\in T}\frac{\left|\left|\psi(x_{\alpha})-\psi(x_{\beta})\right|\right|_{2}^{2}}{\left|\left|x_{\alpha}-x_{\beta}\right|\right|_{2}^{2}}>C\right)
(1n2i=1n2[supyT^(σ(Wiy+bi))4+supξT~(Wiξ)4]>C)\displaystyle\qquad\leq\mathbb{P}\left(\frac{1}{n_{2}}\sum_{i=1}^{n_{2}}\left[\sup_{y\in\widehat{T}}\left(\sigma^{\prime}\left(W_{i}\cdot y+b_{i}\right)\right)^{4}+\sup_{\xi\in\widetilde{T}}\left(W_{i}\cdot\xi\right)^{4}\right]>C\right)

Since σ\sigma^{\prime} is polynomially bounded by assumption (1.2), we find by Markov’s inequality that there exists an even integer k2k\geq 2 so that for any C>1C>1

(supxα,xβTψ(xα)ψ(xβ)2xαxβ2>C)\displaystyle\mathbb{P}\left(\sup_{x_{\alpha},x_{\beta}\in T}\frac{\left|\left|\psi(x_{\alpha})-\psi(x_{\beta})\right|\right|^{2}}{\left|\left|x_{\alpha}-x_{\beta}\right|\right|^{2}}>C\right) 1+𝔼[supyT^|W1y+b1|k+supξT~|W1ξ|4]C1.\displaystyle\leq\frac{1+\mathbb{E}\left[\sup_{y\in\widehat{T}}\left|W_{1}\cdot y+b_{1}\right|^{k}+\sup_{\xi\in\widetilde{T}}\left|W_{1}\cdot\xi\right|^{4}\right]}{C-1}. (2.30)

Our goal is now to show that the numerator in (2.30) is bounded above by a constant that depends only on T0,n0,λT_{0},n_{0},\lambda. For this, let us fix any y0T^y_{0}\in\widehat{T} and apply Lemma 2.7 as follows:

|W1y+b1|k\displaystyle\left|W_{1}\cdot y+b_{1}\right|^{k} =|W1(yy0)+W1y0+b1|k\displaystyle=\left|W_{1}\cdot(y-y_{0})+W_{1}\cdot y_{0}+b_{1}\right|^{k}
22k1(1+|W1(yy0)|2k)+14[(2+|W1y0|4k)+(1+|b1|)4k].\displaystyle\leq 2^{2k-1}\left(1+\left|W_{1}\cdot(y-y_{0})\right|^{2k}\right)+\frac{1}{4}\left[(2+\left|W_{1}\cdot y_{0}\right|^{4k})+\left(1+\left|b_{1}\right|\right)^{4k}\right].

Substituting this and the analogous estimate for |Wξ|4\left|W\cdot\xi\right|^{4} into (2.30), we see that since all moments of the entries of the weights and biases exist, there exists a constant C>0C^{\prime}>0 depending on λ,T0,k\lambda,T_{0},k so that

(supxα,xβTψ(xα)ψ(xβ)2xαxβ2>C)\displaystyle\mathbb{P}\left(\sup_{x_{\alpha},x_{\beta}\in T}\frac{\left|\left|\psi(x_{\alpha})-\psi(x_{\beta})\right|\right|^{2}}{\left|\left|x_{\alpha}-x_{\beta}\right|\right|^{2}}>C\right)
C+𝔼[supyT^|W1(yy0)|2k+supξT~|W1(ξξ0)|4]C1,\displaystyle\qquad\leq\frac{C^{\prime}+\mathbb{E}\left[\sup_{y\in\widehat{T}}\left|W_{1}\cdot(y-y_{0})\right|^{2k}+\sup_{\xi\in\widetilde{T}}\left|W_{1}\cdot(\xi-\xi_{0})\right|^{4}\right]}{C-1}, (2.31)

where ξ0T~\xi_{0}\in\widetilde{T} is any fixed point. Note that by Lemma 2.6, the sets T^,T~\widehat{T},\widetilde{T} are both contained in the image of a compact subset T3n0+1T^{\prime}\subseteq\mathbb{R}^{3n_{0}+1} under a Lipschitz map, with Lipschitz constant depending only on λ,T\lambda,T. Thus, an application of Lemma 2.6 shows that there exists a constant C′′C^{\prime\prime} depending only λ,T,k\lambda,T,k so that

𝔼[supyT^|W1(yy0)|2k]+𝔼[supξT~|W1(ξξ0)|4]C′′.\mathbb{E}\left[\sup_{y\in\widehat{T}}\left|W_{1}\cdot(y-y_{0})\right|^{2k}\right]+\mathbb{E}\left[\sup_{\xi\in\widetilde{T}}\left|W_{1}\cdot(\xi-\xi_{0})\right|^{4}\right]\leq C^{\prime\prime}.

Substituting this into (2.31) and taking CC sufficiently large completes the proof of Lemma 2.6. ∎

Lemma 2.11 directly yields the equi-Lipschitz estimate (2.27). Indeed, let us fix ϵ(0,1)\epsilon\in(0,1) and a compact set Tn0T\subseteq\mathbb{R}^{n_{0}}. Let us define

hα():={CWn0xα,=0CWnσ(zα()),=1,,L1nL+1zα(L+1),=L+1.h_{\alpha}^{(\ell)}:=\begin{cases}\sqrt{\frac{C_{W}}{n_{0}}}x_{\alpha},&\qquad\ell=0\\ \sqrt{\frac{C_{W}}{n_{\ell}}}\sigma(z_{\alpha}^{(\ell)}),&\quad\ell=1,\ldots,L\\ \frac{1}{\sqrt{n_{L+1}}}z_{\alpha}^{(L+1)},&\qquad\ell=L+1\end{cases}.

For =1,,L+1\ell=1,\ldots,L+1 we have

hα()={CWnσ(W^()hα(1)+b()),=1,,L1nL+1(W^(L+1)hα(L)+b(L+1)),=L+1h_{\alpha}^{(\ell)}=\begin{cases}\sqrt{\frac{C_{W}}{n_{\ell}}}\sigma\left(\widehat{W}^{(\ell)}h_{\alpha}^{(\ell-1)}+b^{(\ell)}\right),&\qquad\ell=1,\ldots,L\\ \frac{1}{\sqrt{n_{L+1}}}\left(\widehat{W}^{(L+1)}h_{\alpha}^{(L)}+b^{(L+1)}\right),&\qquad\ell=L+1\end{cases}

where the rescaled weight matrices W^()\widehat{W}^{(\ell)}, defined in (1.4), has iid mean 0 variance 11 entries with finite higher moments. To each of the transformations hα()hα(+1)h_{\alpha}^{(\ell)}\mapsto h_{\alpha}^{(\ell+1)} we may now apply Lemma 2.11. Specifically, applying Lemma 2.11 in the first layer shows that there exists C(1)>0C^{(1)}>0 so that the rescaled first layer map

supxα,xβThα(1)hβ(1)xαxβC(1)\sup_{x_{\alpha},x_{\beta}\in T}\frac{\left|\left|h_{\alpha}^{(1)}-h_{\beta}^{(1)}\right|\right|}{\left|\left|x_{\alpha}-x_{\beta}\right|\right|}\leq C^{(1)}

with probability at least 1ϵ/(L+1)1-\epsilon/(L+1). Thus, the image

T(1):=h(1)(T)T^{(1)}:=h^{(1)}(T)

of TT under the normalized first layer map is the image under a C(1)C^{(1)}-Lipschitz map of the compact set Tn0T\subseteq\mathbb{R}^{n_{0}}. This allows us to apply Lemma 2.11 again, but this time to the second layer, to conclude that, again, there exist C(2)>0C^{(2)}>0 so that with probability at least 12ϵ/(L+1)1-2\epsilon/(L+1) the normalized second layer map satisfies

supxα,xβThα(2)hβ(2)xαxβsupxα,xβThα(2)hβ(2)hα(1)hβ(1)supxα,xβThα(1)hβ(1)xαxβC(1)C(2).\sup_{x_{\alpha},x_{\beta}\in T}\frac{\left|\left|h_{\alpha}^{(2)}-h_{\beta}^{(2)}\right|\right|}{\left|\left|x_{\alpha}-x_{\beta}\right|\right|}\leq\sup_{x_{\alpha},x_{\beta}\in T}\frac{\left|\left|h_{\alpha}^{(2)}-h_{\beta}^{(2)}\right|\right|}{\left|\left|h_{\alpha}^{(1)}-h_{\beta}^{(1)}\right|\right|}\sup_{x_{\alpha},x_{\beta}\in T}\frac{\left|\left|h_{\alpha}^{(1)}-h_{\beta}^{(1)}\right|\right|}{\left|\left|x_{\alpha}-x_{\beta}\right|\right|}\leq C^{(1)}C^{(2)}.

Proceeding in this way, with probability at least 1ϵ1-\epsilon we that

supxα,xβTzα(L+1)zβ(L+1)xαxβ=nL+11/2supxα,xβKhα(L+1)hβ(L+1)xαxβnL+11/2C(1)C(L+1).\sup_{x_{\alpha},x_{\beta}\in T}\frac{\left|\left|z_{\alpha}^{(L+1)}-z_{\beta}^{(L+1)}\right|\right|}{\left|\left|x_{\alpha}-x_{\beta}\right|\right|}=n_{L+1}^{1/2}\sup_{x_{\alpha},x_{\beta}\in K}\frac{\left|\left|h_{\alpha}^{(L+1)}-h_{\beta}^{(L+1)}\right|\right|}{\left|\left|x_{\alpha}-x_{\beta}\right|\right|}\leq n_{L+1}^{1/2}C^{(1)}\cdots C^{(L+1)}.

Since nL+1n_{L+1} is fixed and finite, this confirms (2.27). It remains to check the uniform boundedness condition in (2.1). For this note that for any fixed xβKx_{\beta}\in K by Lemma 2.4, we have

supn1,,nL1𝔼[1nL+1zβ(L+1)2]<.\displaystyle\sup_{n_{1},\ldots,n_{L}\geq 1}\mathbb{E}\left[\frac{1}{n_{L+1}}\left|\left|z_{\beta}^{(L+1)}\right|\right|^{2}\right]<\infty.

Thus, by Markov’s inequality, zβ(L+1)\left|\left|z_{\beta}^{(L+1)}\right|\right| is bounded above with high probability. Combined with the equi-Lipschitz condition (2.27)\eqref{E:equi-lip}, which we just saw holds with high probability on KK, we conclude that for each ϵ>0\epsilon>0 there exists C>0C>0 so that

(supxαKzα(L+1)C)1ϵ,\mathbb{P}\left(\sup_{x_{\alpha}\in K}\left|\left|z_{\alpha}^{(L+1)}\right|\right|\leq C\right)\geq 1-\epsilon,

as desired. \square

References

  • [1] Ben Adlam, Jake Levinson, and Jeffrey Pennington. A random matrix perspective on mixtures of nonlinearities for deep learning. arXiv preprint arXiv:1912.00827, 2019.
  • [2] Ben Adlam and Jeffrey Pennington. The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning, pages 74–84. PMLR, 2020.
  • [3] Andrew Ahn. Fluctuations of beta-jacobi product processes. arXiv preprint arXiv:1910.00743, 2019.
  • [4] Gernot Akemann and Zdzislaw Burda. Universal microscopic correlation functions for products of independent ginibre matrices. Journal of Physics A: Mathematical and Theoretical, 45(46):465201, 2012.
  • [5] Gernot Akemann, Zdzislaw Burda, and Mario Kieburg. Universal distribution of lyapunov exponents for products of ginibre matrices. Journal of Physics A: Mathematical and Theoretical, 47(39):395202, 2014.
  • [6] Gernot Akemann, Zdzislaw Burda, and Mario Kieburg. From integrable to chaotic systems: Universal local statistics of lyapunov exponents. EPL (Europhysics Letters), 126(4):40001, 2019.
  • [7] Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.
  • [8] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
  • [9] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. 2020.
  • [10] Andrea Crisanti, Giovanni Paladin, and Angelo Vulpiani. Products of random matrices: in Statistical Physics, volume 104. Springer Science & Business Media, 2012.
  • [11] Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. Proceedings of Neural Information Processing Systems, 2016.
  • [12] Ingrid Daubechies, Ronald DeVore, Simon Foucart, Boris Hanin, and Guergana Petrova. Nonlinear approximation and (deep) relu networks. Constructive Approximation, pages 1–46, 2021.
  • [13] Ronald DeVore, Boris Hanin, and Guergana Petrova. Neural network approximation. Acta Numerica 2021 (to appear). arXiv:2012.14501, 2020.
  • [14] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. Proceedings of International Conference on Representation Learning, 2019.
  • [15] Ronen Eldan, Dan Mikulincer, and Tselil Schramm. Non-asymptotic approximations of neural networks by gaussian processes. arXiv preprint arXiv:2102.08668, 2021.
  • [16] Zhou Fan and Zhichao Wang. Spectra of the conjugate kernel and neural tangent kernel for linear-width neural networks. Proceedings of Neural Information Processing Systems, 2020.
  • [17] Harry Furstenberg. Noncommuting random products. Transactions of the American Mathematical Society, 108(3):377–428, 1963.
  • [18] Harry Furstenberg and Harry Kesten. Products of random matrices. The Annals of Mathematical Statistics, 31(2):457–469, 1960.
  • [19] Adrià Garriga-Alonso, Carl Edward Rasmussen, and Laurence Aitchison. Deep convolutional networks as shallow gaussian processes. arXiv preprint arXiv:1808.05587, 2018.
  • [20] Vadim Gorin and Yi Sun. Gaussian fluctuations for products of random matrices. To appear in American Journal of Mathematics. arXiv:1812.06532, 2018.
  • [21] Boris Hanin. Which neural net architectures give rise to exploding and vanishing gradients? Proceedings of Neural Information Processing Systems 2018, 2018.
  • [22] Boris Hanin. Universal function approximation by deep neural nets with bounded width and relu activations. Mathematics, 7(10):992, 2019.
  • [23] Boris Hanin and Mihai Nica. Finite depth and width corrections to the neural tangent kernel. Proceedings of Interntional Conference on Representation Learning, 2019.
  • [24] Boris Hanin and Mihai Nica. Products of many large random matrices and gradients in deep neural networks. Communications in Mathematical Physics, pages 1–36, 2019.
  • [25] Boris Hanin and Grigoris Paouris. Non-asymptotic results for singular values of gaussian matrix products. Geometric and Functional Analysis, 31(2):268–324, 2021.
  • [26] Boris Hanin and David Rolnick. Complexity of linear regions in deep networks. In International Conference on Machine Learning, pages 2596–2604. PMLR, 2019.
  • [27] Boris Hanin and David Rolnick. Deep relu networks have surprisingly few activation patterns. Proceedings Neural Information Processing Systems, 2019.
  • [28] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, 2019.
  • [29] Donald Olding Hebb. The organization of behavior; a neuropsycholocigal theory. A Wiley Book in Clinical Psychology, 62:78, 1949.
  • [30] Jiaoyang Huang and Horng-Tzer Yau. Dynamics of deep neural networks and neural tangent hierarchy. In International Conference on Machine Learning, pages 4542–4551. PMLR, 2020.
  • [31] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Proceedings of Neural Information Processing Systems, 2018.
  • [32] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012.
  • [33] Rafał Latała. Estimation of moments of sums of independent real random variables. The Annals of Probability, 25(3):1502–1513, 1997.
  • [34] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. Proceedings of Interntional Conference on Representation Learning, 2019.
  • [35] Chaoyue Liu, Libin Zhu, and Mikhail Belkin. On the linearity of large non-linear models: when and why the tangent kernel is constant. Proceedings of Neural Information Processing Systems, 2020.
  • [36] Alexander G de G Matthews, Mark Rowland, Jiri Hron, Richard E Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. Proceedings of International Conference on Representation Learning, 2018.
  • [37] Radford M Neal. Priors for infinite networks. In Bayesian Learning for Neural Networks, pages 29–53. Springer, 1996.
  • [38] Alexandru Nica and Roland Speicher. Lectures on the combinatorics of free probability, volume 13. Cambridge University Press, 2006.
  • [39] Lorenzo Noci, Gregor Bachmann, Kevin Roth, Sebastian Nowozin, and Thomas Hofmann. Precise characterization of the prior predictive distribution of deep relu networks. arXiv preprint arXiv:2106.06615, 2021.
  • [40] Roman Novak, Lechao Xiao, Jaehoon Lee, Yasaman Bahri, Greg Yang, Jiri Hron, Daniel A Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. Bayesian deep convolutional networks with many channels are gaussian processes. arXiv preprint arXiv:1810.05148, 2018.
  • [41] S Péché. A note on the pennington-worah distribution. Electronic Communications in Probability, 24:1–7, 2019.
  • [42] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. The emergence of spectral universality in deep networks. In International Conference on Artificial Intelligence and Statistics, pages 1924–1932. PMLR, 2018.
  • [43] Jeffrey Pennington and Pratik Worah. Nonlinear random matrix theory for deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124005, 2019.
  • [44] Daniel Roberts, Sho Yaida, and Boris Hanin. The principles of deep learning theory. arXiv:2106.10165 (under contract at Cambridge University Press), 2021.
  • [45] Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958.
  • [46] David Ruelle. Ergodic theory of differentiable dynamical systems. Publications Mathématiques de l’Institut des Hautes Études Scientifiques, 50(1):27–58, 1979.
  • [47] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
  • [48] Dan Voiculescu. Addition of certain non-commuting random variables. Journal of functional analysis, 66(3):323–346, 1986.
  • [49] Eugene P Wigner. On the distribution of the roots of certain symmetric matrices. Annals of Mathematics, pages 325–327, 1958.
  • [50] Sho Yaida. Non-gaussian processes and neural networks at finite widths. In Mathematical and Scientific Machine Learning, pages 165–192. PMLR, 2020.
  • [51] Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760, 2019.
  • [52] Greg Yang. Tensor programs i: Wide feedforward or recurrent neural networks of any architecture are gaussian processes. Proceedings of Neural Information Processing Systems, 2019.
  • [53] Greg Yang. Tensor programs ii: Neural tangent kernel for any architecture. arXiv preprint arXiv:2006.14548, 2020.
  • [54] Greg Yang. Tensor programs iii: Neural matrix laws. arXiv preprint arXiv:2009.10685, 2020.
  • [55] Dmitry Yarotsky. Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114, 2017.
  • [56] Dmitry Yarotsky. Optimal approximation of continuous functions by very deep relu networks. In Conference on Learning Theory, pages 639–649. PMLR, 2018.
  • [57] Jacob A Zavatone-Veth and Cengiz Pehlevan. Exact priors of finite neural networks. arXiv preprint arXiv:2104.11734, 2021.
  • [58] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.