Random Neural Networks in the Infinite Width Limit as Gaussian Processes
Abstract
This article gives a new proof that fully connected neural networks with random weights and biases converge to Gaussian processes in the regime where the input dimension, output dimension, and depth are kept fixed, while the hidden layer widths tend to infinity. Unlike prior work, convergence is shown assuming only moment conditions for the distribution of weights and for quite general non-linearities.
1 Introduction
In the last decade or so neural networks, originally introduced in the 1940’s and 50’s [29, 45], have become indispensable tools for machine learning tasks ranging from computer vision [32] to natural language processing [9] and reinforcement learning [47]. Their empirical success has raised many new mathematical questions in approximation theory [13, 55, 56], probability (see §1.2.2 for some references), optimization/learning theory [7, 8, 31, 58] and so on. The present article concerns a fundamental probabilistic question about arguably the simplest networks, the so-called fully connected neural networks, defined as follows:
Definition 1.1 (Fully Connected Network).
Fix a positive integer as well as positive integers and a function . A fully connected depth neural network with input dimension , output dimension , hidden layer widths , and non-linearity is any function of the following form
where are matrices, are vectors, and applied to a vector is shorthand for applied to each component.
The parameters are called the network architecture, and is called the vector of pre-activations at layer corresponding to input A fully connected network with a fixed architecture and given non-linearity is therefore a finite (but typically high) dimensional family of functions, parameterized by the network weights (entries of the weight matrices ) and biases (components of bias vectors ).
This article considers the mapping when the network’s weights and biases are chosen independently at random and the hidden layer widths are sent to infinity while the input dimension output dimension , and network depth are fixed. In this infinite width limit, akin to the large matrix limit in random matrix theory (see §1.2), neural networks with random weights and biases converge to Gaussian processes (see §1.4 for a review of prior work). Unlike prior work Theorem 1.2, our main result, is that this holds for general non-linearities and distributions of network weights (cf §1.3).
Moreover, in addition to establishing convergence of wide neural networks to a Gaussian process under weak hypotheses, the present article gives a mathematical take aimed at probabilists of some of the ideas developed in the recent monograph [44]. This book, written in the language and style of theoretical physics by Roberts and Yaida, is based on research done jointly with the author. It represents a far-reaching development of the breakthrough work of Yaida [50], which was the first to systematically explain how to compute finite width corrections to infinite width Gaussian process limit of random neural networks for arbitrary depth, width, and non-linearity. Previously, such finite width (and large depth) corrections were only possible for some special observables in linear and ReLU networks [21, 23, 24, 25, 39, 57]. The present article deals only with the asymptotic analysis of random neural networks as the width tends to infinity, leaving to future work a probabilistic elaboration of the some aspects of the approach to finite width corrections from [44].
1.1 Roadmap
The rest of this article is organized as follows. First, in §1.2 we briefly motivate the study of neural networks with random weights. Then, in §1.3 we formulate our main result, Theorem 1.2. Before giving its proof in §2, we first indicate in §1.4 the general idea of the proof and its relation to prior work.
1.2 Why Random Neural Networks?
1.2.1 Practical Motivations
It may seem at first glance that studying neural networks with random weights and biases is of no practical interest. After all, a neural network is only useful after it has been “trained,” i.e. one has found a setting of its parameters so that the resulting network function (at least approximately) interpolates a given training dataset of input-output pairs for an otherwise unknown function .
However, the vast majority of neural network training algorithms used in practice are variants of gradient descent starting from a random initialization of the weight matrices and bias vectors . Studying networks with random weights and biases therefore provides an understanding of the initial conditions for neural network optimization.
Beyond illuminating the properties of networks at the start of training, the analysis of random neural networks can reveal a great deal about networks after training as well. Indeed, on a heuristic level, just as the behavior of the level spacings of the eigenvalues of large random matrices is a surprisingly good match for emission spectra of heavy atoms [49], it is not unreasonable to believe that certain coarse properties of the incredibly complex networks used in practice will be similar to those of networks with random weights and biases. More rigorously, neural networks used in practice often have many more tunable parameters (weights and biases) than the number of datapoints from the training dataset. Thus, at least in certain regimes, neural network training provably proceeds by an approximate linearization around initialization, since no one parameter needs to move much to fit the data. This so-called NTK analysis [14, 16, 30, 31, 35] shows, with several important caveats related to network size and initialization scheme, that in some cases the statistical properties of neural networks at the start of training are the key determinants of their behavior throughout training.
1.2.2 Motivation from Random Matrix Theory
In addition to being of practical importance, random neural networks are also fascinating mathematical objects, giving rise to new problems in approximation theory [12, 13, 22, 55, 56], random geometry [26, 27], and random matrix theory (RMT). Perhaps the most direct, though by no means only, connection to RMT questions is to set the network biases to zero and consider the very special case when is the identity (in the machine learning literature these are called deep linear networks). The network function
(1.1) |
is then a linear statistic for a product of independent random matrices. Such matrix models have been extensively studied, primarily in two regimes. The first is the multiplicative ergodic theorem regime [10, 17, 18, 46], in which all the layer widths are typically set to a fixed value and the network depth tends to infinity. The second regime, where is fixed and the layer widths (i.e. matrix dimensions) tend to infinity, is the purview of free-probability [38, 48].
In the presence of a non-linearity , random neural network provide non-linear generalizations of the usual RMT questions. For instance, the questions taken up in this article are analogs of the joint normality of linear statistics of random matrix products in the free probability regime. Further, random neural networks give additional motivation for studying matrix products appearing in (1.1) when the matrix dimensions and the number of terms are simultaneously large. This double scaling limit reveals new phenomena [3, 4, 5, 6, 20, 24, 25] but is so far poorly understood relative to the ergodic or free regimes.
Finally, beyond studying linear networks, random matrix theory questions naturally appear in neural network theory via non-linear analogs of the Marchenko-Pastur distribution for empirical covariance matrices of when ranges over a random dataset of inputs [1, 28, 41, 43] as well as through the spectrum of the input-output Jacobian [24, 42] and the NTK [2, 16].
1.3 Main Result
Our main result shows that under rather general conditions, when the weights and biases of a fully connected network are chosen at random, the resulting field converges to a centered Gaussian field with iid components when the input dimension and output dimension are held fixed but the hidden layer widths tend to infinity. To give the precise statement in Theorem 1.2 below, fix a fully connected neural network with depth , input dimension , output dimension , hidden layer widths , and non-linearity . We assume that is absolutely continuous and that its almost-everywhere defined derivative (and hence itself) is polynomially bounded:
(1.2) |
All non-linearities used in practice satisfy these rather mild criteria. Further, let us write for the entries of the weight matrices and for the components of the bias vectors . For the Definition 1.1 of fully connected networks means that the formula for the components of the pre-activations at layer in terms of those for reads
(1.3) |
where we’ve denoted by the component of the -dimensional vector of pre-activations in layer corresponding to a network input . We make the following assumption on the network weights:
(1.4) |
where is a fixed probability distribution on such that
(1.5) |
We further assume the network biases are iid Gaussian111As explained in §1.4 the universality results in this article are simply not true if the biases are drawn iid from a fixed non-Gaussian distribution. and independent of the weights:
(1.6) |
In (1.4) and (1.6), and are fixed constants. These constants do not play an important role for the analysis in this article but will be crucial for followup work. With the network weights and biases chosen at random the vectors are also random. Our main result is that, in the infinite width limit, they have independent Gaussian components.
Theorem 1.2.
Fix and a compact set . As the hidden layer widths tend to infinity, the sequence of stochastic processes
converges weakly in to a centered Gaussian process taking values in with iid coordinates. The coordinate-wise covariance function
for this limiting process satisfies the layerwise recursion
(1.7) |
for , with initial condition
(1.8) |
where the distribution of is determined via (1.3) by the distribution of weights and biases in the first layer and hence is not universal.
1.4 Theorem 1.2: Discussion, Main Idea, and Relation to Prior Work
At a high level, the proof of Theorem 1.2 (specifically the convergence of finite-dimensional distributions) proceeds as follows:
-
1.
Conditional on the mapping , the components of the neural network output are independent sums of independent random fields (see (1.3)), and hence, when is large, are each approximately Gaussian by the CLT.
-
2.
The conditional covariance in the CLT from step is random at finite widths (it depends on ). However, it has the special form of an average over of the same function applied to each component of the vector of pre-activations at the last hidden layer. We call such objects collective observables (see §2.1.2 and (1.10)).
-
3.
While are not independent at finite width when , they are weakly sufficiently correlated that a LLN still applies to any collective observable in the infinite width limit (see §2.1.2).
-
4.
The LLN from step 3 allows us to replace the random conditional covariance matrix from steps 1 and 2 by its expectation, asymptotically as tend to infinity.
We turn to giving a few more details on steps 1-4 and reviewing along the way the relation of the present article to prior work. The study of the infinite width limit for random neural networks dates back at least to Neal [37], who considered networks with one hidden layer:
where In the shallow setting of Neal if in addition , then neglecting the bias for the moment, the scalar field is a sum of iid random fields with finite moments, and hence the asymptotic normality of its finite-dimensional distributions follows immediately from the multidimensional CLT. Modulo tightness, this explains why ought to converge to a Gaussian field. Even this simple case, however, holds several useful lessons:
-
•
If the distribution of the bias is fixed independent of is and non-Gaussian, then the distribution of will not be Gaussian, even in the limit when .
-
•
If the first layer biases are drawn iid from a fixed distribution and is non-linear, then higher moments of will contribute to the variance of each neuron post-activation , causing the covariance of the Gaussian field at infinite width to be non-universal.
-
•
Unlike in deeper layers, as long as is fixed, the distribution of each neuron pre-activation in the first layer will not be Gaussian, unless the weights and biases in layer are themselves Gaussian. This explains why, in the initial condition (1.8) the distribution is non-Gaussian in the first layer.
In light of the first two points, what should one assume about the bias distribution? There are, it seems, two options. The first is to assume that the variance of the biases tends to zero as , putting them on par with the weights. The second, which we adopt in this article, is to declare all biases to be Gaussian.
The first trick in proving Theorem 1.2 for general depth and width appears already when but the output dimension is at least two.222Neal [37] states erroneously on page 38 of his thesis that and will be independent because the weights going into them are independent. This is not true at finite width but becomes true in the infinite width limit. In this case, even for a single network input , at finite values of the network width different components of the random -dimensional vector are not independent, due to their shared dependence on the vector . The key observation, which to the author’s knowledge was first presented in [34], is to note that the components of are independent conditional on the first layer (i.e. on ) and are approximately Gaussian when is large by the CLT. The conditional variance, which captures the main dependence on , has the following form:
(1.9) |
This is an example of what we’ll call a collective observable, an average over all neurons in a layer of the same function applied to the pre-activations at each neuron (see §2.1.2 for the precise definition). In the shallow setting, is a sum of iid random variables with finite moments. Hence, by the LLN, it converges almost surely to its mean as . This causes the components of to become independent in the infinite width limit, since the source of their shared randomness, , can be replaced asymptotically by its expectation.
The proof for general follows a similar pattern. Exactly as before, for any , the components of the pre-activations at layer are still conditionally independent, given the pre-activations at layer . When the width is large the conditional distribution of each component over any finite collection of inputs is therefore approximately Gaussian by the CLT. Moreover, the conditional covariance across network inputs has the form:
(1.10) |
The summands on the right hand side are no longer independent at finite width if . However, are still collective observables, and the crucial point is to check that their dependence is sufficiently weak that we may still apply the LLN. Verifying this is the heart of the proof of Theorem 1.2 and is carried out in §2.1.2.
Let us mention that, in addition to the approach outlined above, other methods for showing that wide neural networks are asymptotically Gaussian processes are possible. In the prior article [36], for instance, the idea is to use that the entries of are exchangeable and argue using an exchangeable CLT. This leads to some technical complications which, at least in the way the argument is carried out in [36], result in unnatural restrictions on the class of non-linearities and weight distributions considered there. Let us also mention that in the article [34], the non-independence at finite width of the components of for large was circumvented by considering only the sequential limit in which in order of increasing . The effect is that for every the conditional covariance has already converged to its mean before is taken large. However, this way of taking the infinite width limit seems to the author somewhat unnatural and is any case not conducive to studying finite width corrections as in [44, 50], which we plan to take up in future work.
We conclude this section by pointing the reader to several other related strands of work. The first are articles such as [11], which quantify the magnitude of the difference
between the empirical overlaps of pre-activations and the corresponding infinite width limit uniformly over network inputs in a compact subset of . In a similar vein are articles such as [15], which give quantitative estimates at finite width for the distance from to a nearby Gaussian process.
The second is the series of articles starting with the work of Yang [51, 52, 53, 54], which develops the study not only of initialization but also certain aspects of inference with infinitely wide networks using what Yang terms tensor programs. As part of that series, the article [52] establishes that in the infinite width limit many different architectures become Gaussian processes. However, the arguments in those articles are significantly more technical than the ones presented here since they are focused on building the foundation for the tensor program framework. At any rate, to the best of the author’s knowledge, no prior article addresses universality of the Gaussian process limit with respect to the weight distribution in deep networks (for shallow networks with this was considered by Neal in [37]). Finally, that random neural networks converge to Gaussian processes in the infinite width limit under various restrictions but for architectures other than fully connected is taken up in [19, 40, 52].
2 Proof of Theorem 1.2
Let us recall the notation. Namely, we fix a network depth , an input dimension an output dimension , hidden layer widths and a non-linearity satisfying (2.5). We further assume that the networks weights and biases are independent and random as in (1.4) and (1.6). To prove Theorem 1.2 we must show that the random fields converge weakly in distribution to a Gaussian process in the limit where tend to infinity. We start with the convergence of finite-dimensional distributions. Let us therefore fix a collection
of distinct network inputs in and introduce for each , every , and all the vectorized notation
The following result states that the distribution of the random variable with values in converges to that of the claimed Gaussian field.
Proposition 2.1 (Convergence of Finite-Dimensional Distributions).
Once we have proved Proposition 2.1 in §2.1, it remains to show tightness. For this, we fix a compact subset . The tightness of in follows immediately from the Arzelà-Ascoli Theorem and the following result, which we prove in §2.2.
Proposition 2.2 (High Probability Equicontinuity and Equiboundedness of ).
For every there exists so that
(2.1) |
with probability at least .
2.1 Finite-Dimensional Distributions: Proof of Proposition 2.1
We will prove Proposition 2.1 in two steps. First, we prove a special case in which we keep the weights in layer completely general as in (1.4) but take weights in layers to be independent Gaussians:
We continue to assume (as in the statement of Theorem 1.2) that all biases are Gaussian:
The argument in this case is the technical heart of this paper and is presented in §2.1.1 - §2.1.2. Ultimately, it relies on the analysis of collective observables, which we isolate in §2.1.2. A simple Lindeberg swapping argument and induction on layer detailed in §2.1.3 allows us to extend Proposition 2.1 to general weights in layers from the Gaussian case.
2.1.1 Proof of Proposition 2.1 with Gaussian Weights in Layers
Fix
and consider the characteristic function
of the random variable . By Levy’s continuity theorem, it is sufficient to show that
(2.2) |
where
is the matrix defined by the recursion (1.7) with initial condition (1.8). Writing
(2.3) |
we may use the tower property to write
(2.4) |
Note that conditional on , the random vectors in layer for each are iid Gaussians, since we’ve assumed for now that weights in layers are Gaussian. Specifically,
where for any the conditional covariance is
(2.5) |
Using (2.4) and the explicit form of the characteristic function of a Gaussian reveals
(2.6) |
The crucial observation is that each entry of the conditional covariance matrix is an average over of the same fixed function applied to the vector . While are not independent at finite values of for , they are sufficiently weakly correlated that a weak law of large numbers still holds:
Lemma 2.3.
Fix . There exists a PSD matrix
such that for all
Lemma 2.3 implies that converges in distribution to . In view of (2.6) and the definition of weak convergence this immediately implies (2.2). It therefore remains to check that satisfies the desired recursion. For this, note that at any values of we find
When , we therefore see that
where the law of is determined by the distribution of weights in layer and does not depend on . This confirms the initial condition (1.8). Otherwise, if , the convergence of finite-dimensional distributions that we’ve already established yields
Since is continuous we may invoke the continuous mapping theorem to conclude that
which confirms the recursion (1.7). This completes the proof that the finite-dimensional distributions of converge to those of the desired Gaussian process, modulo two issues. First, we must prove Lemma 2.3. This is done in §2.1.2 by proving a more general result, Lemma 2.4. Second, we must remove the assumption that the weights in layers are Gaussian. This is done in §2.1.3.
2.1.2 Collective Observables with Gaussian Weights: Generalizing Lemma 2.3
This section contains the key technical argument in our proof of Proposition 2.1. To state the main result, define a collective observable at layer to be any random variable of the form
where is measurable and polynomially bounded:
We continue to assume, as §2.1.1, that the weights (and biases) in layers are Gaussian and will remove this assumption in §2.1.3. Several key properties of collective observables are summarized in the following Lemma:
Lemma 2.4 (Key Properties of Collective Observables).
Let be a collective observable at layer . There exists a deterministic scalar such that
(2.7) |
Moreover,
(2.8) |
Hence, we have the following convergence in probability
Proof.
This proof is by induction on , starting with . In this case, are independent for different . Moreover, for each we have
Hence, have finite moments since are iid Gaussian and are mean with finite higher moments. In particular, since is polynomially bounded, we find for every that
which is finite and independent of , confirming (2.7). Further, is the average of iid random variables with all moments finite. Hence, (2.8) follows by the weak law of large numbers, completing the proof of the base case.
Let us now assume that we have shown (2.7) and (2.8) for all . We begin by checking that (2.7) holds at layer . We have
(2.9) |
Since the weights and biases in layer are Gaussian and independent of , we find
(2.10) |
where is the conditional covariance defined in (2.5) and is an -dimensional standard Gaussian. The key point is that is a collective observable at layer . Hence, by the inductive hypothesis, there exists a PSD matrix such that converges in probability to as . To establish (2.7) it therefore suffices in view of (2.9) to check that
(2.11) |
where the right hand side is finite since is polynomially bounded and all polynomial moments of are finite. To establish (2.11), let us invoke the Skorohod representation theorem to find a common probability space on which there are versions of – which by an abuse of notation we will still denote by – that converge to almost surely. Next, note that since is polynomially bounded we may repeatedly apply to find
(2.12) |
where is a polynomial in the entries of , a polynomial in the entries of , and the polynomials don’t depend on . The continuous mapping theorem shows that
Thus, since all moments of Gaussian are finite, (2.11) follows from the generalized dominated convergence theorem. It remains to argue that (2.8) holds at layer . To do this, we write
(2.13) |
Observe that
since we already showed that (2.7) holds at layer . Next, recall that, conditional on , neurons in layer are independent. The law of total variance and Cauchy-Schwartz yield
(2.14) |
Using (2.10) and the polynomial estimates (2.12) on , we conclude that the conditional expectation on the previous line is some polynomially bounded function of the components of . Hence, we may apply dominated convergence as above to find
since is a constant. This proves (2.8) for observables at layer and completes the proof of Lemma 2.4. ∎
2.1.3 Proof of Proposition 2.1 for General Weights
In this section, we complete the proof of Proposition 2.1 by removing the assumption from §2.1.1 that weights in layers are Gaussian. To do this, let us introduce some notation. Let us write
for the pre-activations at layers of a random fully connected network in which, as in the general case of Theorem 1.2, all weights and biases are independent, biases are Gaussian as in (1.6) and weights in all layers are drawn from a general centered distribution with the correct variance and finite higher moments as in (1.4) and (1.5). Next, let us write
for the vector of pre-activations obtained by replacing, in each layer , all weights by iid centered Gaussians with variance . We already saw that the distribution of converges weakly to that of the desired Gaussian in the infinite width limit. Our goal is thus to show that, as tend to infinity, and converge weakly to the same distribution. We will proceed by induction on . When the claim is trivial since, by construction, the weight and bias distributions in layer are identical in both and (recall that when we proved Proposition 2.1 in §2.1.1 we had Gaussian weights only starting from layer and general weights in layer .)
Suppose therefore that we have shown the claim for . By the Portmanteau theorem and the density of smooth compactly supported functions in the space of continuous compactly supported functions equipped with the supremum norm, it suffices to show that for any smooth function with compact support we have
(2.15) |
To check (2.15), let us define an intermediate object:
where the entries of are iid Gaussian with mean and variance . That is, we take the vector of post-activations from layer obtained by using general weights in layers and use Gaussian weights only in layer . Our first step in checking (2.15) is to show that it this relation holds when is replaced by .
Lemma 2.5.
Let be a finite subset of consists of distinct elements. Fix in addition and a smooth compactly supported function . There exists and a collective observable at layer so that
Proof.
This is a standard Lindeberg swapping argument. Namely, for each and define
where the first entries of each row of are iid Gaussian with mean and variance , while the remaining entries are times iid draws from the general distribution of network weights, as in (1.4) and (1.5). With this notation, we have
Thus,
For any and
consider the third order Taylor expansion of around .
Let us write
where . Then, Taylor expanding to third order around and, using that the first two moments of match those of , we find that
where is a degree polynomial in the -dimensional vector of absolute values Summing this over completes the proof. ∎
To make use of Lemma 2.5 let us consider any collective observable at layer . Recall that by (2.9) and (2.13) both the mean and variance of depend only on the distributions of finitely many components of the vector . By the inductive hypothesis we therefore find
(2.16) |
where the right hand side means that we consider the same collective observable but for instead of , which exists by Lemma 2.4. Similarly, again using Lemma 2.4, we have
(2.17) |
Therefore, we conclude that
(2.18) |
Note that by (2.16) the mean of any collective observable is bounded independent of since we saw in Lemma 2.4 that the limit exists and is bounded when using Gaussian weights. Since is fixed and finite, the error term in Lemma 2.5 is therefore tends to zero as , and (2.15) is reduced to showing that
(2.19) |
This follows from (2.17) and the inductive hypothesis. Indeed, by construction, conditional on the filtration defined by weights and biases in layers up to (see (2.3)), the -dimensional vectors are iid Gaussians:
where is the conditional covariance matrix from (2.5). The key point, as in the proof with all Gaussian weights, is that each entry of the matrix is a collective observable at layer . Moreover, since the weights and biases in the final layer are Gaussian for the conditional distribution of given is completely determined by . In particular, since is bounded and continuous, we find that
where is a bounded continuous function and is the conditional covariance matrix at layer for . Combining this with the convergence in distribution from (2.18) shows that (2.19) holds and completes the proof of Proposition 2.1 for general weight distributions.
2.2 Tightness: Proof of Proposition 2.2
In this section, we provide a proof of Proposition 2.2. In the course of showing tightness, we will need several elementary Lemmas, which we record in the §2.2.1. We then use them in §2.2.2 to complete the proof of Proposition 2.2.
2.2.1 Preparatory Lemmas
For the first Lemma, let us agree to write for the cone over a subset in a euclidean space and for the unit ball in .
Lemma 2.6.
Fix integers and a real number . Suppose that is a compact of and is -Lipschitz with respect to the -norm on both and . Define the Minkowski sums
There exists a constant a compact subset of , and a -Lipschitz map (all depending only ,), so that
Proof.
By definition,
(2.20) |
In particular, for some constant depending on , we have
Hence, is the image under a Lipschitz map with a Lipschitz constant depending only on of a compact set in . ∎
The second Lemma we need is an elementary inequality.
Lemma 2.7.
Let be real numbers and be an integer. We have
Proof.
The next Lemma is also an elementary estimate.
Lemma 2.8.
Fix an integer , and suppose are non-negative random variables. There exists a positive integer such that
Proof.
The proof is by induction on . For the base cases when , we may take and when we may take by Cauchy-Schwartz. Now suppose we have proved the claim for all for some . Note that . So we may use Cauchy-Schwartz and the inductive hypothesis to obtain
where . ∎
The next Lemma is an elementary result about the moments of marginals of iid random vectors.
Lemma 2.9.
Fix an even integer a positive integer, and suppose is a probability measure on with mean and finite higher moments. Assume also that is a vector with iid components, each with distribution . Then, there exists a constant depending only on and first moments of such that for all
Proof.
We will use the following result of Łatała [33, Thm. 2, Cor. 2, Rmk. 2]. Suppose are independent random variables and is a positive even integer. Then
(2.23) |
where means bounded above and below up to universal multiplicative constants. Let us fix a unit vector and apply this to . Since have mean and is even we find
Note that for each we have
Hence, using that we find
Note that for , there is a universal constant so that
Thus, there exists a constant so that
Combining this with (2.23) completes the proof. ∎
The final Lemma we need is an integrability statement for the supremum of certain non-Gaussian fields over low-dimensional sets.
Lemma 2.10.
Fix a positive integer , an even integer , a compact set , a constant , and a probability measure on with mean , variance and finite higher moments. For every there exists a constant with the following property. Fix any integer and a -Lipschitz map . Define , and let be a vector with iid components , each drawn from . Then, for any fixed
(2.24) |
Proof.
The proof is a standard chaining argument. For each write for the closest point to in a net in and assume without loss of generality that the diameter of is bounded above by and that for all . We have using the usual chaining trick that
(2.25) |
By Lemma 2.8, there exists depending only on so that for any we have
(2.26) |
We seek to bound each expectation on the right hand side in (2.26). To do this, write
Note that the supremum is only over a finite set of cardinality at most
for some depending only . This is because, by assumption is the image of under a -Lipschtiz map and Lipschitz maps preserve covering numbers. Thus, by a union bound,
But for any and any we have
Putting this all together we find for any that
Thus, substituting this into (2.25) yields
2.2.2 Proof of Proposition 2.2 Using Lemmas from §2.2.1
Let us first establish the equi-Lipschitz condition, which we recall states that for each and each compact set there exist so that with probability at least we have
(2.27) |
For (2.27) to hold, we need a result about the Lipschitz constant of each layer. To ease the notation define a normalized single layer with random weights and random biases via the map :
(2.28) |
where and with drawn iid from a distribution with mean , variance , and finite higher moments. We choose the variance of to be instead of since we will later think of as the normalized vector of post-activations in a given layer.
Lemma 2.11.
Fix an integer , a compact set , and a constant . For every there exists a constant with the following property. Fix any integers , and define as in (2.28). Suppose that is the image of under a -Lipschitz map from to . Then,
Proof.
Fix and define
Write for the -th row of and for the -th component of . Since and is absolutely continuous, we have
(2.29) |
where we’ve set
and have denoted by the cone over a set and by the unit ball in . The estimate (2.29) yields
Since is polynomially bounded by assumption (1.2), we find by Markov’s inequality that there exists an even integer so that for any
(2.30) |
Our goal is now to show that the numerator in (2.30) is bounded above by a constant that depends only on . For this, let us fix any and apply Lemma 2.7 as follows:
Substituting this and the analogous estimate for into (2.30), we see that since all moments of the entries of the weights and biases exist, there exists a constant depending on so that
(2.31) |
where is any fixed point. Note that by Lemma 2.6, the sets are both contained in the image of a compact subset under a Lipschitz map, with Lipschitz constant depending only on . Thus, an application of Lemma 2.6 shows that there exists a constant depending only so that
Substituting this into (2.31) and taking sufficiently large completes the proof of Lemma 2.6. ∎
Lemma 2.11 directly yields the equi-Lipschitz estimate (2.27). Indeed, let us fix and a compact set . Let us define
For we have
where the rescaled weight matrices , defined in (1.4), has iid mean variance entries with finite higher moments. To each of the transformations we may now apply Lemma 2.11. Specifically, applying Lemma 2.11 in the first layer shows that there exists so that the rescaled first layer map
with probability at least . Thus, the image
of under the normalized first layer map is the image under a Lipschitz map of the compact set . This allows us to apply Lemma 2.11 again, but this time to the second layer, to conclude that, again, there exist so that with probability at least the normalized second layer map satisfies
Proceeding in this way, with probability at least we that
Since is fixed and finite, this confirms (2.27). It remains to check the uniform boundedness condition in (2.1). For this note that for any fixed by Lemma 2.4, we have
Thus, by Markov’s inequality, is bounded above with high probability. Combined with the equi-Lipschitz condition , which we just saw holds with high probability on , we conclude that for each there exists so that
as desired.
References
- [1] Ben Adlam, Jake Levinson, and Jeffrey Pennington. A random matrix perspective on mixtures of nonlinearities for deep learning. arXiv preprint arXiv:1912.00827, 2019.
- [2] Ben Adlam and Jeffrey Pennington. The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization. In International Conference on Machine Learning, pages 74–84. PMLR, 2020.
- [3] Andrew Ahn. Fluctuations of beta-jacobi product processes. arXiv preprint arXiv:1910.00743, 2019.
- [4] Gernot Akemann and Zdzislaw Burda. Universal microscopic correlation functions for products of independent ginibre matrices. Journal of Physics A: Mathematical and Theoretical, 45(46):465201, 2012.
- [5] Gernot Akemann, Zdzislaw Burda, and Mario Kieburg. Universal distribution of lyapunov exponents for products of ginibre matrices. Journal of Physics A: Mathematical and Theoretical, 47(39):395202, 2014.
- [6] Gernot Akemann, Zdzislaw Burda, and Mario Kieburg. From integrable to chaotic systems: Universal local statistics of lyapunov exponents. EPL (Europhysics Letters), 126(4):40001, 2019.
- [7] Peter L Bartlett, Philip M Long, Gábor Lugosi, and Alexander Tsigler. Benign overfitting in linear regression. Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020.
- [8] Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32):15849–15854, 2019.
- [9] Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. 2020.
- [10] Andrea Crisanti, Giovanni Paladin, and Angelo Vulpiani. Products of random matrices: in Statistical Physics, volume 104. Springer Science & Business Media, 2012.
- [11] Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. Proceedings of Neural Information Processing Systems, 2016.
- [12] Ingrid Daubechies, Ronald DeVore, Simon Foucart, Boris Hanin, and Guergana Petrova. Nonlinear approximation and (deep) relu networks. Constructive Approximation, pages 1–46, 2021.
- [13] Ronald DeVore, Boris Hanin, and Guergana Petrova. Neural network approximation. Acta Numerica 2021 (to appear). arXiv:2012.14501, 2020.
- [14] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. Proceedings of International Conference on Representation Learning, 2019.
- [15] Ronen Eldan, Dan Mikulincer, and Tselil Schramm. Non-asymptotic approximations of neural networks by gaussian processes. arXiv preprint arXiv:2102.08668, 2021.
- [16] Zhou Fan and Zhichao Wang. Spectra of the conjugate kernel and neural tangent kernel for linear-width neural networks. Proceedings of Neural Information Processing Systems, 2020.
- [17] Harry Furstenberg. Noncommuting random products. Transactions of the American Mathematical Society, 108(3):377–428, 1963.
- [18] Harry Furstenberg and Harry Kesten. Products of random matrices. The Annals of Mathematical Statistics, 31(2):457–469, 1960.
- [19] Adrià Garriga-Alonso, Carl Edward Rasmussen, and Laurence Aitchison. Deep convolutional networks as shallow gaussian processes. arXiv preprint arXiv:1808.05587, 2018.
- [20] Vadim Gorin and Yi Sun. Gaussian fluctuations for products of random matrices. To appear in American Journal of Mathematics. arXiv:1812.06532, 2018.
- [21] Boris Hanin. Which neural net architectures give rise to exploding and vanishing gradients? Proceedings of Neural Information Processing Systems 2018, 2018.
- [22] Boris Hanin. Universal function approximation by deep neural nets with bounded width and relu activations. Mathematics, 7(10):992, 2019.
- [23] Boris Hanin and Mihai Nica. Finite depth and width corrections to the neural tangent kernel. Proceedings of Interntional Conference on Representation Learning, 2019.
- [24] Boris Hanin and Mihai Nica. Products of many large random matrices and gradients in deep neural networks. Communications in Mathematical Physics, pages 1–36, 2019.
- [25] Boris Hanin and Grigoris Paouris. Non-asymptotic results for singular values of gaussian matrix products. Geometric and Functional Analysis, 31(2):268–324, 2021.
- [26] Boris Hanin and David Rolnick. Complexity of linear regions in deep networks. In International Conference on Machine Learning, pages 2596–2604. PMLR, 2019.
- [27] Boris Hanin and David Rolnick. Deep relu networks have surprisingly few activation patterns. Proceedings Neural Information Processing Systems, 2019.
- [28] Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani. Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560, 2019.
- [29] Donald Olding Hebb. The organization of behavior; a neuropsycholocigal theory. A Wiley Book in Clinical Psychology, 62:78, 1949.
- [30] Jiaoyang Huang and Horng-Tzer Yau. Dynamics of deep neural networks and neural tangent hierarchy. In International Conference on Machine Learning, pages 4542–4551. PMLR, 2020.
- [31] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Proceedings of Neural Information Processing Systems, 2018.
- [32] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012.
- [33] Rafał Latała. Estimation of moments of sums of independent real random variables. The Annals of Probability, 25(3):1502–1513, 1997.
- [34] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S Schoenholz, Jeffrey Pennington, and Jascha Sohl-Dickstein. Deep neural networks as gaussian processes. Proceedings of Interntional Conference on Representation Learning, 2019.
- [35] Chaoyue Liu, Libin Zhu, and Mikhail Belkin. On the linearity of large non-linear models: when and why the tangent kernel is constant. Proceedings of Neural Information Processing Systems, 2020.
- [36] Alexander G de G Matthews, Mark Rowland, Jiri Hron, Richard E Turner, and Zoubin Ghahramani. Gaussian process behaviour in wide deep neural networks. Proceedings of International Conference on Representation Learning, 2018.
- [37] Radford M Neal. Priors for infinite networks. In Bayesian Learning for Neural Networks, pages 29–53. Springer, 1996.
- [38] Alexandru Nica and Roland Speicher. Lectures on the combinatorics of free probability, volume 13. Cambridge University Press, 2006.
- [39] Lorenzo Noci, Gregor Bachmann, Kevin Roth, Sebastian Nowozin, and Thomas Hofmann. Precise characterization of the prior predictive distribution of deep relu networks. arXiv preprint arXiv:2106.06615, 2021.
- [40] Roman Novak, Lechao Xiao, Jaehoon Lee, Yasaman Bahri, Greg Yang, Jiri Hron, Daniel A Abolafia, Jeffrey Pennington, and Jascha Sohl-Dickstein. Bayesian deep convolutional networks with many channels are gaussian processes. arXiv preprint arXiv:1810.05148, 2018.
- [41] S Péché. A note on the pennington-worah distribution. Electronic Communications in Probability, 24:1–7, 2019.
- [42] Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. The emergence of spectral universality in deep networks. In International Conference on Artificial Intelligence and Statistics, pages 1924–1932. PMLR, 2018.
- [43] Jeffrey Pennington and Pratik Worah. Nonlinear random matrix theory for deep learning. Journal of Statistical Mechanics: Theory and Experiment, 2019(12):124005, 2019.
- [44] Daniel Roberts, Sho Yaida, and Boris Hanin. The principles of deep learning theory. arXiv:2106.10165 (under contract at Cambridge University Press), 2021.
- [45] Frank Rosenblatt. The perceptron: a probabilistic model for information storage and organization in the brain. Psychological review, 65(6):386, 1958.
- [46] David Ruelle. Ergodic theory of differentiable dynamical systems. Publications Mathématiques de l’Institut des Hautes Études Scientifiques, 50(1):27–58, 1979.
- [47] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. nature, 550(7676):354–359, 2017.
- [48] Dan Voiculescu. Addition of certain non-commuting random variables. Journal of functional analysis, 66(3):323–346, 1986.
- [49] Eugene P Wigner. On the distribution of the roots of certain symmetric matrices. Annals of Mathematics, pages 325–327, 1958.
- [50] Sho Yaida. Non-gaussian processes and neural networks at finite widths. In Mathematical and Scientific Machine Learning, pages 165–192. PMLR, 2020.
- [51] Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760, 2019.
- [52] Greg Yang. Tensor programs i: Wide feedforward or recurrent neural networks of any architecture are gaussian processes. Proceedings of Neural Information Processing Systems, 2019.
- [53] Greg Yang. Tensor programs ii: Neural tangent kernel for any architecture. arXiv preprint arXiv:2006.14548, 2020.
- [54] Greg Yang. Tensor programs iii: Neural matrix laws. arXiv preprint arXiv:2009.10685, 2020.
- [55] Dmitry Yarotsky. Error bounds for approximations with deep relu networks. Neural Networks, 94:103–114, 2017.
- [56] Dmitry Yarotsky. Optimal approximation of continuous functions by very deep relu networks. In Conference on Learning Theory, pages 639–649. PMLR, 2018.
- [57] Jacob A Zavatone-Veth and Cengiz Pehlevan. Exact priors of finite neural networks. arXiv preprint arXiv:2104.11734, 2021.
- [58] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3):107–115, 2021.