Deep ReLU Networks Preserve Expected Length
Abstract
Assessing the complexity of functions computed by a neural network helps us understand how the network will learn and generalize. One natural measure of complexity is how the network distorts length – if the network takes a unit-length curve as input, what is the length of the resulting curve of outputs? It has been widely believed that this length grows exponentially in network depth. We prove that in fact this is not the case: the expected length distortion does not grow with depth, and indeed shrinks slightly, for ReLU networks with standard random initialization. We also generalize this result by proving upper bounds both for higher moments of the length distortion and for the distortion of higher-dimensional volumes. These theoretical results are corroborated by our experiments.
1 Introduction
The utility of deep neural networks ultimately arises from their ability to learn functions that are sufficiently complex to fit training data and yet simple enough to generalize well. Understanding the precise sense in which functions computed by a given network have low or high complexity is therefore important for studying when the network will perform well. Despite the fundamental importance of this question, our mathematical understanding of the functions expressed and learned by different neural network architectures remains limited.
A popular way to measure the complexity of a neural network function is to compute how it distorts lengths. This may be done by considering a set of inputs lying along a curve and measuring the length of the corresponding curve of outputs. It has been claimed in prior literature that in a ReLU network this length distortion grows exponentially with the network’s depth [22, 23], and this has been used as a justification of the power of deeper networks. We prove that, in fact, for networks with the typical initialization used in practice, expected length distortion does not grow at all with depth.
Our main contributions are:
-
1.
We prove that for ReLU networks initialized with the usual weight variance, the expected length distortion does not grow with depth at initialization, actually decreasing slightly with depth (Thm. 3.1) and exhibiting an interesting width dependency.
- 2.
-
3.
We empirically verify that our theoretical results accurately predict observed behavior for networks at initialization, while previous bounds are loose and fail to capture subtle architecture dependencies.
It is worth explaining why our conclusions differ from those of [22, 23]. First, prior authors prove only lower bounds on the expected length distortion, while we use different methodology to calculate tight upper bounds, allowing us to say that the expected length distortion does not grow with depth. As we show in Thm. 5.1 and Fig. 1, the prior bounds are in fact quite loose.
Second, Thm. 3(a) in [23] has a critical typo111We have confirmed this in personal correspondence with the authors, and it is simple to verify – the typo arises in the last step of the proof given in the Supplementary Material of [23]., which has unfortunately been perpetuated in Thm. 1 of [22]. Namely, the “2” was omitted in the following statement (paraphrased from the original):
where is the variance of the weight distribution. As we prove in Thm. 3.1, leaving out the 2 makes the statement false, incorrectly suggesting that length distortion explodes with depth for the standard He initialization .
Finally, prior authors drew the conclusion that length distortion grows exponentially with depth by considering the behavior of ReLU networks with unrealistically large weights. If one multiplies by the weights and biases of a ReLU network, one multiplies the length distortion by , so it should come as no surprise that there exist settings of the weights for which the distortion grows exponentially with depth (or where it decays exponentially). The value of our results comes in analyzing the behavior specifically at He initialization () [17]. This initialization is the one used in practice for ReLU networks, since this is the weight variance that must be used if the outputs [14] and gradients [11] are to remain well-controlled. In the present work, we show that this is also the correct initialization for the expected length distortion to remain well-behaved.
2 Related Work
A range of complexity measures for functions computed by deep neural networks have been considered in the literature, dividing the prior work into at least three categories. In the first, the emphasis is on worst-case (or best-case) scenarios – what is the maximal possible complexity of functions computed by a given network architecture. These works essentially study the expressive power of neural networks and often focus on showing that deep networks are able to express functions that cannot be expressed by shallower networks. For example, it has been shown that it is possible to set the weights of a deep ReLU network such that the number of linear regions computed by the network grows exponentially in the depth [7, 9, 20, 26, 27]. Other works consider the degree of polynomials approximable by networks of different depths [19, 24] and the topological invariants of networks [5].
While such work has sometimes been used to explain the utility of different neural network architectures (especially deeper ones), a second strand of prior work has shown that a significant gap can exist between the functions expressible by a given architecture and those which may be learned in practice. Such average-case analyses have indicated that some readily expressible functions are provably difficult to learn [25] or vanishingly unlikely for random networks [13, 15, 16], and that some functions learned more easily by deep architectures are nonetheless expressible by shallow ones [3]. (While here we consider neural nets with ReLU activation, it is worth noting that for arithmetic circuits, worst-case and average-case scenarios may be more similar – in both scenarios, a significant gap exists between the matricization rank of the functions computed by deep and shallow architectures [6].) As noted earlier, average-case analyses in [22, 23] provided lower bounds on expected length distortion, while [21] presented a similar analysis for the curvature of output trajectories.
3 Expected Length Distortion
3.1 Motivation
While neural networks are typically overparameterized and trained with little or no explicit regularization, the functions they learn in practice are often able to generalize to unseen data. This phenomenon indicates an implicit regularization that causes these learned functions to be surprisingly simple.

Why this occurs is not well-understood theoretically, but there is a high level intuition. In a nutshell, randomly initialized neural networks will compute functions of low complexity. Moreover, as optimization by first order methods is a kind of greedy local search, network training is attracted to minima of the loss that are not too far from the initialization and hence will still be well-behaved.
While this intuition is compelling, a key challenge in making it rigorous is to devise appropriate notions of complexity that are small throughout training. The present work is intended to provide a piece of the puzzle, making precise the idea that, at the start of training, neural networks compute tame functions. Specifically, we demonstrate that neural networks at initialization have low distortion of length and volume, as defined in §3.2.
An important aspect of our analysis is that we study networks in typical, average-case scenarios. A randomly initialized neural network could, in principle, compute any function expressible by a network of that architecture, since the weights might with low probability take on any set of values. Some settings of weights will lead to functions of high complexity, but these settings may be unlikely to occur in practice, depending on the distribution over weights that is under consideration. As prior work has emphasized [22, 23], atypical distributions of weights can lead to exponentially high length distortion. We show that, in contrast, for deep ReLU networks with the standard initialization, the functions computed have low distortion in expectation and with high probability.
3.2 Definitions
Let be a positive integer and fix positive integers . We consider a fully connected feed-forward ReLU network with input dimension , output dimension , and hidden layer widths . Suppose that the weights and biases of are independent and Gaussian with the weights between layers and and biases in layer satisfying:
(3.1) |
Here, is any fixed constant and denotes a Gaussian with mean and variance . For any , we denote by the (random) image of under the map . Our primary object of the study will be the size of the output relative to that of . Specifically, when is a -dimensional curve, we define
Note that while a priori this random variable depends on , we will find that its statistics depend only on the architecture of (see Thms. 3.1 and 5.1).
3.3 Results
We prove that the expected length distortion does not grow with depth – in fact, it slightly decreases. Our result may be informally stated thus:
Theorem 3.1 (Length distortion: Mean).
Consider a ReLU network of depth , input dimension , output dimension , and hidden layer widths , with weights given by standard He normal initialization [17]. The expected length distortion is upper bounded by . More precisely:
where is a constant close to depending only on the output dimension .
For a formal statement (including the exact values of constants) and proof, see Thm. 5.1 and App. C.

Our experiments align with the theorem. In Figure 1, we compute the empirical mean of the length distortion for randomly initialized deep ReLU networks of various architectures with fixed input and output dimension. The figure confirms three of our key predictions: (1) the expected length distortion does not grow with depth, and actually decreases slightly; (2) the decrease happens faster for narrower networks (since is larger for smaller ); and (3) for equal input and output dimension , there is an upper bound of 1 (in fact, the tighter upper bound of is approximately 0.9515 for , as shown in the figure). The figure also shows the prior bounds in [23] to be quite loose. For further experimental details, see Appendix B.
In Fig. 2, we instead fix the set of hidden layer widths, but vary input and output dimensions. The results confirm that indeed the expected length distortion grows as the square root of the ratio of output to input dimension.
Note that our theorem also applies for initializations other than He normal; the result in such cases is simply less interesting. Suppose we initialize the weights in each layer as i.i.d. normal with variance instead of . This is equivalent to taking a He normal initialization and multiplying the weights by . Multiplying the weights and biases in a depth- ReLU network by simply multiplies the output (and hence the length distortion) by , so the expected distortion is times the value given in Thm. 3.1. This emphasizes why an exponentially large distortion necessarily occurs if one sets the weights too large.
3.4 Intuitive explanation
The purpose of this section is to give an intuitive but technical explanation for why, in Theorem 3.1, we find that the distortion of the length of a curve under the neural network is typically not much larger than , even for deep networks.
Our starting point is that, near a given point , the length of the image under of a small portion of with length is approximately given by , where is the input-output Jacobian of at the input and is the unit tangent vector to at . In fact, in Lemma C.1 of the Supp. Material, we use a simple argument to give upper bounds on the moments of in terms of the moments of the norm of the Jacobian-vector product .

Thus, upper bounds on the length distortion reduce to studying the Jacobian , which in a network with hidden layers can be written as a product of the layer to layer Jacobians . In a worst-case analysis, the left singular vectors of would align with the right singular vectors of , causing the resulting product to have a largest singular value that grows exponentially with . However, at initialization, this is provably not the case with high probability. Indeed, the Jacobians are independent (see Lemma C.3) and their singular vectors are therefore incoherent. We find in particular (Lemma C.2) the following equality in distribution:
where is the first standard unit vector and the terms in the product are independent. On average each term in this product is close to :
This is a consequence of initializing weights to have variance . Put together, the two preceding displayed equations suggest writing
The terms being summed in the exponent are independent and the argument of each logarithm scales like . With probability we have for all that , where is a constant depending only on . Thus, all together,
with high probability. In particular, in the simple case where are proportional to a single large constant , we find the typical size of is exponential in rather than exponential in , as in the worst case. If is wider than it is deep, so that is bounded above, the typical size of and hence of remains bounded. Theorem 5.1 makes this argument precise. Moreover, let us note that articles like [7, 10, 21] show low distortion of angles and volumes in very wide networks. Our results, however, apply directly to more realistic network widths.
4 Further Results
4.1 Higher moments
We have seen that the mean length distortion is upper-bounded by 1, and indeed by a function of the architecture that decreases with the depth of the network. However, a small mean is insufficient by itself to ensure that typical networks have low length distortion. For this, we now show that in fact all moments of the length distortion are well-controlled. Specifically, the variance is bounded above by the ratio of output to input dimension, and higher moments are upper-bounded by a function that grows very slowly with depth. Our results may be informally stated thus:
Theorem 4.1 (Length distortion: Higher moments).
Consider, as before, a ReLU network of depth , input dimension , output dimension , and hidden layer widths , with weights given by He normal initialization. We have the following bounds on higher moments of the length distortion:
for some universal constant
We consider the mean and variance of length distortion for a wide range of network architectures in Figure 3. The figure confirms that variance is modest for all architectures and does not increase with depth, and that mean length distortion is consistently decreasing with the sum of layer width reciprocals, as predicted by Theorem 4.1.
4.2 Higher-dimensional volumes
Another natural generalization to consider is how ReLU networks distort higher-dimensional volumes: Given a -dimensional input , the output will in general also be -dimensional, and we can therefore consider the volume distortion . The theorems presented in §3.3 when can be extended to all , and the results may be informally stated thus:
Theorem 4.2 (Volume distortion).
Consider a ReLU network of depth , input dimension , output dimension , and hidden layer widths , with weights given by He normal initialization. Both the squared mean and the variance of volume distortion are upper-bounded by:
5 Formal Statements
In this section, we provide formal statements of our results. Full proofs are given in the Supp. Material.
5.1 One-dimensional manifolds
Fix a smooth one-dimensional manifold (i.e. a curve) with unit length. Each point on represents a possible input to our ReLU network , which we recall has input dimension , output dimension , and hidden layer widths . The function computed by is continuous and piecewise linear. Thus, the image of under this map is a piecewise smooth curve in with a finite number of pieces, and its length is a random variable. Our first result states that, provided is wider than it is deep, this distortion is small with high probability at initialization.
Theorem 5.1.
Let be a fully connected ReLU network with input dimension output dimension , hidden layer widths , and independent centered Gaussian weights/biases with the weights having variance (as in (3.1)). Let be a -dimensional curve of unit length in . Then, the mean length equals
(5.1) |
where implied constants in are universal and denotes the Gamma function. Moreover:
(5.2) |
Finally, there exist universal constants such that if , then
(5.3) |
Theorem 5.1 is proved in §C. Several comments are in order. First, we have assumed for simplicity that has unit length. For curves of arbitrary length, both the equality (5.1) and the bounds (5.2) and (5.3) all hold and are derived in the same way provided is replaced by the distortion per unit length.
Second, in the expression (5.1), note that the exponent tends to as for any fixed . More precisely, when is large and constant, it scales as . This shows that, for fixed , the mean is actually decreasing with . This somewhat surprising phenomenon is borne out in Fig. 1 and is a consequence of the fact that wide fully connected ReLU nets are strongly contracting (see e.g. Thms. 3 and 4 in [10]).
Third, the pre-factor encodes the fact that, on average over initialization, for any vector of inputs , we have (see e.g. Cor. 1 in [14])
(5.4) |
where the approximate equality becomes exact if the bias variance is set to (see (3.1)). In other words, at the start of training, the network rescales inputs by an average factor . This overall scaling factor also dilates relative to .
5.2 Higher-dimensional manifolds
The results in the previous section generalize to the case when has higher dimension. To consider this case, fix a positive integer and a smooth dimensional manifold of unit volume . Note that if , then is at most -dimensional and its -dimensional volume vanishes. Its image is a piecewise smooth manifold of dimension at most . The following result, proved in §D, gives an upper bound on the typical value of the dimensional volume of
Theorem 5.2.
Let be a fully connected ReLU network with input dimension output dimension , hidden layer widths and independent centered Gaussian weights/biases with the variance of the weights given by (see (3.1)). Let be a -dimensional smooth submanifold of with unit volume and . Then, both the squared mean and the variance of the dimensional volume of is bounded above by
(5.5) |
6 Proof Sketch
The purpose of this section is to explain the main steps to obtaining the mean and variance estimates (5.1) and (5.2) from Theorem 5.1. In several places, we will gloss over some mathematical subtleties that we deal with in the detailed proof of Theorem 5.1 given in Appendix C. We will also content ourselves here with proving a slightly weaker estimate than (5.1) and show instead simply that
(6.1) |
We refer the reader to Appendix C for a derivation of the more refined result stated in Theorem 5.1. Since we have and , both estimates in (6.1) follow from
(6.2) |
To obtain this bound, we begin by relating moments of to those of the input-output Jacobian of at a single point for which we need some notation. Namely, let us choose a unit speed parameterization of ; that is, fix a smooth mapping
for which is the image under of the interval . That has unit speed means that for every we have , where Then, the mapping (for ) gives a parameterization of the curve . Note that this parameterization is not unit speed. Rather the length of is given by
(6.3) |
Intuitively, the integrand computes, at a given , the length of as The following Lemma uses (6.3) to bound the moments of in terms of the moments of the norm of the input-output Jacobian , defined for any by
(6.4) |
where is the -th component of the network output.
Lemma 6.1.
We have
(6.5) |
Sketch of Proof.
Lemma 6.1, while elementary, appears to be new, allowing us to bound global quantities such as length in terms of local quantities such as the moments of . In particular, having obtained the expression (6.5), we have reduced bounding the second moment of to bounding the second moment of the random variable for a fixed and unit vector . This is a simple matter since in our setting the distribution of weights and biases is Gaussian and hence, in distribution, is equal to a product of independent random variables:
Proposition 6.2.
For any and any unit vector , the random variable is equal in distribution to a product of independent scaled chi-squared random variables
where the number of degrees of freedom in the th term of the product (for ) is given by an independent binomial with trials and success probability . The number of degrees of freedom in the final term is deterministic and given by .
This Proposition has been derived a number of times in the literature (see Theorem 3 in [11] and Fact 7.2 in [1]). We provide an alternative proof based on Proposition 2 in [12] in the Supplementary Material. Note that the distribution of is the same at every and . Thus, fixing and a unit vector , we find
To prove (6.2), we note that and apply Lemma 6.2 to find, as desired,
7 Conclusion
We show that deep ReLU networks with appropriate initialization do not appreciably distort lengths and volumes, contrary to prior assumptions of exponential growth. Specifically, we provide an exact expression for the mean length distortion, which is bounded above by 1 and decreases slightly with increasing depth. We also prove that higher moments of the length distortion admit well-controlled upper bounds, and generalize this to distortion of higher dimensional volumes. We show empirically that our theoretical results closely match observations, unlike previous loose lower bounds.
There remain several promising directions for future work. First, we have proved statements for networks at initialization, and our results do not necessarily hold after training; analyzing such behavior formally would require consideration of the loss function, optimizer, and data distribution. Second, our results, as stated, do not apply to convolutional, residual, or recurrent networks. We envision generalizations for networks with skip connections being straightforward, while formulations for convolutional and recurrent networks will likely be more complicated. In Appendix A, we provide preliminary results suggesting that expected length distortion decreases modestly with depth in both convolutional networks and those with skip connections. Third, we believe that various other measures of complexity for neural networks, such as curvature, likely demonstrate similar behavior to length and volume distortion in average-case scenarios with appropriate initialization. While somewhat parallel results for counting linear regions were shown in [15, 16], there remains much more to be understood for other notions of complexity. Finally, we look forward to work that explicitly ties properties such as low length distortion to improved generalization, as well as new learning algorithms that leverage this line of theory in order to better control inductive biases to fit real-world tasks.
References
- [1] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning (ICML), 2019.
- [2] Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. In International Conference on Machine Learning (ICML), 2018.
- [3] Lei Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Conference on Neural Information Processing Systems (NeurIPS), 2014.
- [4] Peter Bartlett, Dylan J Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. Preprint arXiv:1706.08498, 2017.
- [5] Monica Bianchini and Franco Scarselli. On the complexity of neural network classifiers: A comparison between shallow and deep architectures. IEEE Transactions on Neural Networks and Learning Systems, 25(8):1553–1565, 2014.
- [6] Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A tensor analysis. In Conference on Learning Theory (COLT), 2016.
- [7] Amit Daniely. Depth separation for neural networks. In Conference on Learning Theory (COLT), 2017.
- [8] Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. Preprint arXiv:1703.11008, 2017.
- [9] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Conference on Learning Theory (COLT), 2016.
- [10] Raja Giryes, Guillermo Sapiro, and Alex M Bronstein. Deep neural networks with random Gaussian weights: A universal classification strategy? IEEE Transactions on Signal Processing, 64(13):3444–3457, 2016.
- [11] Boris Hanin. Which neural net architectures give rise to exploding and vanishing gradients? In Conference on Neural Information Processing Systems (NeurIPS), 2018.
- [12] Boris Hanin and Mihai Nica. Products of many large random matrices and gradients in deep neural networks. Communications in Mathematical Physics, pages 1–36, 2019.
- [13] Boris Hanin and Mihai Nica. Finite depth and width corrections to the neural tangent kernel. In International Conference on Learning Representations (ICLR), 2020.
- [14] Boris Hanin and David Rolnick. How to start training: The effect of initialization and architecture. In Conference on Neural Information Processing Systems (NeurIPS), 2018.
- [15] Boris Hanin and David Rolnick. Complexity of linear regions in deep networks. In International Conference on Machine Learning (ICML), 2019.
- [16] Boris Hanin and David Rolnick. Deep ReLU networks have surprisingly few activation patterns. In Conference on Neural Information Processing Systems (NeurIPS), 2020.
- [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In IEEE International Conference on Computer Vision (ICCV), 2015.
- [18] Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them. Preprint arXiv:1912.02178, 2019.
- [19] Henry W Lin, Max Tegmark, and David Rolnick. Why does deep and cheap learning work so well? Journal of Statistical Physics, 168(6):1223–1247, 2017.
- [20] Guido Montúfar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Conference on Neural Information Processing Systems (NeurIPS), 2014.
- [21] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Conference on Neural Information Processing Systems (NeurIPS), 2016.
- [22] Ilan Price and Jared Tanner. Trajectory growth lower bounds for random sparse deep ReLU networks. Preprint arXiv:1911.10651, 2019.
- [23] Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the expressive power of deep neural networks. In International Conference on Machine Learning (ICML), 2017.
- [24] David Rolnick and Max Tegmark. The power of deeper networks for expressing natural functions. In International Conference on Learning Representations, 2018.
- [25] Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Failures of gradient-based deep learning. In International Conference on Machine Learning (ICML), 2017.
- [26] Matus Telgarsky. Representation benefits of deep feedforward networks. Preprint arXiv:1509.08101, 2015.
- [27] Matus Telgarsky. Benefits of depth in neural networks. In Conference on Learning Theory (COLT), 2016.
Appendix A Experiments for Convolutional and Residual Networks
We here provide preliminary experimental results for the mean length distortion in networks with convolutional layers and skip connections.
In Figure 4(a), we show results for networks with sequential convolutional layers (without pooling layers), initialized as before with weights i.i.d. normal with variance , and a final fully connected layer. We use input dimension and output dimension 5. Our results indicate that the mean length distortion is, as expected, approximately equal to , and that it decays slightly with depth.
In Figure 4(b), we consider residual networks. Here, the overall network is defined in terms of residual modules and scales according to:
We set all residual modules to be two-layer, fully connected ReLU networks. In keeping with [14], we initialize all weights i.i.d. normal with variance , while setting . Our results suggest that mean length distortion again decays modestly overall, with a somewhat sharper decrease for small depths.

Appendix B Experimental Details
For all experiments, weights were initialized from i.i.d. normal distributions with variance and bias variance . We run several experiments that involve computing the length distortion of a given line segment in . We remark on how this was done, for which we rely on the notion of linear regions and bent hyperplanes associated with ReLU networks, explored in [16, 26]. Specifically, ReLU networks partition input space into a collection of convex polytopes, which we call linear regions, as (generically) distinct linear functions are defined on each region corresponding to a different subset of the hidden neurons being activated (where a neuron is activated if its pre-activation is nonnegative). The boundaries of these linear regions are given by sets of points for which a particular neuron has preactivation equal to .
Say we are given a line segment with endpoints parametrized by unit speed on , given by the set , for which we aim to compute the length distortion. Let .
We first approximate the intersections of the line segment with linear region boundaries using a binary search subroutine. Specifically, initialize the set , which will contain the parameter values for these approximations. For each iteration of the subroutine, we run the following procedure: for any three consecutive points , let , and consider the values and . These values are equal if the three points are in the same linear region, in which case we eliminate and from . If not equal, then there exists a linear region boundary in the segment from to ; we add the points and to . We iterate this procedure (15 times in our implementation); for the final iteration, we ensure that only the last point associated with a given linear region is in .
Now take the set returned by the binary search procedure; we proceed to compute the parameter values denoting the exact intersections of the line segment with region boundaries, which we shall store in . For consecutive points , determine the set of activated neurons for both points at , and find the neuron at which these sets differ; we solve a linear equation to determine the value for which this neuron is , and replace with the exact value in . Finally, we append and to the respective ends of .
Computing the length distortion is reduced to summing the lengths of the output segments corresponding to consecutive pairs . Namely, for each such pair, the network is given by a single linear function on the segment of inputs between and . We calculate the weight matrix of this linear function – the length of the corresponding output segment is the product of with the vector . The total length distortion is then given by the sum
Appendix C Proof of Theorem 5.1
Our first step in proving Theorem 5.1 is to obtain bounds on the moments of in terms of the input-output Jacobian of at a single point, which we recall was defined in (6.4). To accomplish them, we recall the notation from §6. Namely, fix a smooth unit speed parameterization of with
The mapping
gives a parameterization of the curve , and we have
(C.1) |
Let us note an important but ultimately benign technicality. The Jacobian of the map is not defined at every (namely those where some neuron turns from on to off). Thus, a priori, exists only at those for which exists. However, the formula (C.1) is still valid. Indeed, for any setting of weights and biases of the map is Lipschitz. Thus, by Rademacher’s theorem, exists for almost every and the length of the curve is given by integrating the norm of this almost-everywhere defined derivative. The following simple Lemma is a generalization of Lemma 6.1 and allows us to bound all moments of in terms of the moments of the norm of the Jacobian vector product .
Lemma C.1.
For any integer , we have
(C.2) |
Proof.
Taking powers in (C.1) and using Tonelli’s theorem to interchange the expectation and integrals, we obtain
(C.3) |
To proceed, let us recall a special case of the power mean inequality which says that for any we have
Applying this to the integrand in (C.3), we conclude
(C.4) |
Next, fix . For any neuron in , denote by its pre-activation. Note that since our bias variance is set to some fixed positive constant and hence the bias of each neuron has a continuous density. Therefore, with probability over the randomness in the weights and biases of , there exists a neighborhood of on which has constant sign for all . The Jacobian is therefore well-defined and the chain rule yields
(C.5) |
Substituting this into (C.4) completes the proof. ∎
Having obtained the expression (C.2), we have reduced bounding the moments of to bounding the moments of the random variable for a fixed and unit vector . Prior work (e.g. Thm. 1 in [12]) shows that these moments satisfy
provided
(C.6) |
Substituting these moment estimates in (C.2) completes the derivation of (5.3). However, the results in [12] are subtle because they apply to any distribution of weights and biases. They also give the slightly sub-optimal restriction (C.6) that must be smaller than a constant times the minimum of the ’s. In the special case where the distribution of weights and biases is Gaussian, we can do slightly better by computing the moments of more directly (note that in the statement of Theorem 5.1, we required only that is smaller than a constant times the minimum of the ’s). We will need this anyway to derive the slightly more refined estimates in (5.1) and (5.2). Let us therefore check that, in distribution, is equal to a product of independent random variables:
Proposition C.2.
For any and any unit vector , the random variable is equal in distribution to a product of independent scaled chi-squared random variables
where the number of degrees of freedom in the th term of the product (for ) is given by an independent binomial
with trials and success probability . The number of degrees of freedom in the final term is deterministic and given by .
Proof.
Consider a ReLU network with input dimension , output dimension , and hidden layer widths . Suppose the weight matrices and bias vectors are independent with i.i.d. Gaussian components:
where is any fixed constant. For a fixed network input , with probability , the input-output Jacobian is well-defined. Moreover, it can be written as
where is the matrix of weights from layer to layer and is an diagonal matrix:
whose diagonal entries are or depending on whether the pre-activation of neuron in layer is positive at our fixed input Next, according to Proposition in [12], the marginal distribution of each is that its diagonal entries are independent random variables. Moreover, we have the following equality in distribution:
where are independent of each other (and of ) resampled from the marginal distribution of and is an independent diagonal matrix with independent diagonal entries that are with equal probability. In particular, for a fixed unit vector , we have
We may rewrite this as
(C.7) |
for , where we interpret the expression (C.7) as equal to if is the zero matrix. To complete the proof, we need the following standard observation.
Lemma C.3.
Suppose is an matrix with i.i.d. Gaussian entries and is a random unit vector in that is independent of but otherwise has any distribution. Then is independent of and is equal in distribution to where is any fixed unit vector in .
Proof.
For any fixed orthogonal matrix , it is a standard fact that is equal in distribution to . Thus, for any measurable sets and , since are independent we have
where is the distribution of Fix any and let be any orthogonal matrix so that with is the first standard unit vector. Then, since is equal in distribution to , we have
which is independent of . We therefore find
as desired. ∎
We are now in a position to complete the proof of Proposition C.2. Combining Lemma C.3 with (C.7), we find that, in distribution equals
(C.8) |
Note that these two terms are independent. Repeating this argument, we obtain that is equal in distribution to the product
(C.9) |
The terms in this product are independent. To complete the proof note that for the number of non-zero entries in is a binomial random variable with trials, each with probability of success and that
is precisely times a sum of squares of independent standard Gaussians. Thus, for ,
(C.10) |
Similarly
(C.11) |
Substituting (C.10) and (C.11) into (C.9) completes the proof. ∎
Evaluating the moments of is now a matter of computing the moments of some scaled chi-squared random variables with a random number of degrees of freedom. For instance, recalling that
and applying Lemma C.2 as well as the tower property of the expectation we find
Substituting this into (C.2) with yields
yielding (5.2). To prove (5.1), we need to estimate . By taking expectations in (C.1) and using (C.5), we find
where we’ve used that by Lemma C.2, the distribution of is the same for every and every unit vector . Moreover, again using Lemma C.2, we see that equals
To simplify this, a direct computation shows that
where is the Gamma function. Next, for any positive random variable with we have
Recall that
Using this and the law of total variance yields
Thus, we find that equals
as claimed. Finally, let us check the higher moment estimates (5.3). By Lemma C.2 we have the following estimate
where we used that For any , we have
Therefore, is bounded above by
(C.12) |
Finally, for any fixed positive integer we may write
where are independent random variables and are independent standard Gaussians. A direct computation shows that the centered random variables are sub-exponential with some universal parameters . Thus, by the stability of sub-exponential random variables under summation we have
Again using this property we conclude
where
Therefore, by (C.12), we find that
where . By definition of sub-exponential random variables, we have
provided for some universal constant . This completes the derivation of (5.3) and completes the proof of Theorem 5.1.
Appendix D Proof of Theorem 5.2
The proof of Theorem 5.2 follows closely the argument for Theorem 5.1. For a given input , we will continue to write
for the input-output Jacobian of the network function, which exists for Lebesgue almost every . We will write for the orthogonal projection onto this tangent space. We have
where is the dimensional Hausdorff measure. Indeed, the integrand measures, at each , the volume of the image of an infinitesimal cube on of volume centered at under the map . Arguing precisely as in the proof of Lemma C.1, we find that for any integer
(D.1) |
Fix and denote by
an orthonormal basis of the tangent space of . Then, by the Gram identity, we may write
(D.2) |
where we recall that the wedge product is the anti-symmetrization of the tensor product. Just as in the proof of Theorem 5.1, the key observation is that for Gaussian weights we have that is a product of i.i.d. random variables. The formal statement is the following:
Proposition D.1.
For any and collection of orthonormal unit vectors , the random variable
is equal in distribution to a product of independent scaled chi-squared random variables:
where all the terms in the product are independent and for we’ve written for independent Binomial random variables:
with trials and success probability .
Proof.
The proof is identical to that of Lemma C.2. The only difference is that we must invoke the following amplification of Lemma C.3:
Lemma D.2.
Let be a collection of orthonormal vectors (i.e. a collection) with any distribution. Let be an independent matrix with i.i.d. Gaussian entries. Then
is independent of and is equal in distribution to where is any fixed collection of orthonormal vectors.
Proof.
The proof is identical to that of Lemma C.3, except that we note that for that, given , there exists an orthogonal matrix so that
where are the standard basis vectors. ∎
∎
With Lemma D.1 in hand, we complete the proof of Theorem 5.2 as follows. First, note that for any random variable we have
Hence, Theorem 5.2 will follow once we show that
Combining Proposition D.1 with (D.1) and (D.2), this estimate follows by showing that
(D.3) |
To check this, recall that
Hence,
where in the last line we used that . This completes the proof.