This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Deep ReLU Networks Preserve Expected Length

Boris Hanin
Princeton University
[email protected]
&Ryan Jeong11footnotemark: 1
University of Pennsylvania
[email protected]
&David Rolnick11footnotemark: 1
McGill University & Mila
[email protected]
Equal contribution
Abstract

Assessing the complexity of functions computed by a neural network helps us understand how the network will learn and generalize. One natural measure of complexity is how the network distorts length – if the network takes a unit-length curve as input, what is the length of the resulting curve of outputs? It has been widely believed that this length grows exponentially in network depth. We prove that in fact this is not the case: the expected length distortion does not grow with depth, and indeed shrinks slightly, for ReLU networks with standard random initialization. We also generalize this result by proving upper bounds both for higher moments of the length distortion and for the distortion of higher-dimensional volumes. These theoretical results are corroborated by our experiments.

1 Introduction

The utility of deep neural networks ultimately arises from their ability to learn functions that are sufficiently complex to fit training data and yet simple enough to generalize well. Understanding the precise sense in which functions computed by a given network have low or high complexity is therefore important for studying when the network will perform well. Despite the fundamental importance of this question, our mathematical understanding of the functions expressed and learned by different neural network architectures remains limited.

A popular way to measure the complexity of a neural network function is to compute how it distorts lengths. This may be done by considering a set of inputs lying along a curve and measuring the length of the corresponding curve of outputs. It has been claimed in prior literature that in a ReLU network this length distortion grows exponentially with the network’s depth [22, 23], and this has been used as a justification of the power of deeper networks. We prove that, in fact, for networks with the typical initialization used in practice, expected length distortion does not grow at all with depth.

Our main contributions are:

  1. 1.

    We prove that for ReLU networks initialized with the usual 2/fan-in2/\text{fan-in} weight variance, the expected length distortion does not grow with depth at initialization, actually decreasing slightly with depth (Thm. 3.1) and exhibiting an interesting width dependency.

  2. 2.

    We prove bounds on higher moments of the length distortion, giving upper bounds that hold with high probability (Thm. 4.1). We also obtain similar results for the distortion in the volume of higher-dimensional manifolds of inputs (Thm. 4.2).

  3. 3.

    We empirically verify that our theoretical results accurately predict observed behavior for networks at initialization, while previous bounds are loose and fail to capture subtle architecture dependencies.

It is worth explaining why our conclusions differ from those of [22, 23]. First, prior authors prove only lower bounds on the expected length distortion, while we use different methodology to calculate tight upper bounds, allowing us to say that the expected length distortion does not grow with depth. As we show in Thm. 5.1 and Fig. 1, the prior bounds are in fact quite loose.

Second, Thm. 3(a) in [23] has a critical typo111We have confirmed this in personal correspondence with the authors, and it is simple to verify – the typo arises in the last step of the proof given in the Supplementary Material of [23]., which has unfortunately been perpetuated in Thm. 1 of [22]. Namely, the “2” was omitted in the following statement (paraphrased from the original):

𝔼[length distortion]=Ω[(σwwidth2width+1)depth],\mathbb{E}\left[\text{length distortion}\right]=\Omega\bigg{[}\left(\frac{\sigma_{w}\sqrt{\text{width}}}{2\sqrt{\text{width}+1}}\right)^{\text{depth}}\bigg{]},

where σw2/width\sigma_{w}^{2}/\text{width} is the variance of the weight distribution. As we prove in Thm. 3.1, leaving out the 2 makes the statement false, incorrectly suggesting that length distortion explodes with depth for the standard He initialization σw=2\sigma_{w}=\sqrt{2}.

Finally, prior authors drew the conclusion that length distortion grows exponentially with depth by considering the behavior of ReLU networks with unrealistically large weights. If one multiplies by CC the weights and biases of a ReLU network, one multiplies the length distortion by CdepthC^{\text{depth}}, so it should come as no surprise that there exist settings of the weights for which the distortion grows exponentially with depth (or where it decays exponentially). The value of our results comes in analyzing the behavior specifically at He initialization (σw=2\sigma_{w}=\sqrt{2}) [17]. This initialization is the one used in practice for ReLU networks, since this is the weight variance that must be used if the outputs [14] and gradients [11] are to remain well-controlled. In the present work, we show that this is also the correct initialization for the expected length distortion to remain well-behaved.

2 Related Work

A range of complexity measures for functions computed by deep neural networks have been considered in the literature, dividing the prior work into at least three categories. In the first, the emphasis is on worst-case (or best-case) scenarios – what is the maximal possible complexity of functions computed by a given network architecture. These works essentially study the expressive power of neural networks and often focus on showing that deep networks are able to express functions that cannot be expressed by shallower networks. For example, it has been shown that it is possible to set the weights of a deep ReLU network such that the number of linear regions computed by the network grows exponentially in the depth [7, 9, 20, 26, 27]. Other works consider the degree of polynomials approximable by networks of different depths [19, 24] and the topological invariants of networks [5].

While such work has sometimes been used to explain the utility of different neural network architectures (especially deeper ones), a second strand of prior work has shown that a significant gap can exist between the functions expressible by a given architecture and those which may be learned in practice. Such average-case analyses have indicated that some readily expressible functions are provably difficult to learn [25] or vanishingly unlikely for random networks [13, 15, 16], and that some functions learned more easily by deep architectures are nonetheless expressible by shallow ones [3]. (While here we consider neural nets with ReLU activation, it is worth noting that for arithmetic circuits, worst-case and average-case scenarios may be more similar – in both scenarios, a significant gap exists between the matricization rank of the functions computed by deep and shallow architectures [6].) As noted earlier, average-case analyses in [22, 23] provided lower bounds on expected length distortion, while [21] presented a similar analysis for the curvature of output trajectories.

Finally, a number of authors have sought complexity measures that either empirically or theoretically correlate with generalization [18]. Such measures have been based on classification margins [4], network compressibility [2], and PAC-Bayes considerations [8].

3 Expected Length Distortion

3.1 Motivation

While neural networks are typically overparameterized and trained with little or no explicit regularization, the functions they learn in practice are often able to generalize to unseen data. This phenomenon indicates an implicit regularization that causes these learned functions to be surprisingly simple.

Refer to caption
Figure 1: Mean length distortion as a function of depth, for randomly initialized ReLU networks of varying architecture. As described in Thm. 3.1, distortion not only fails to grow, but shrinks with depth, especially for networks of small width. (a) compares the predictions of our Thm. 5.1 (solid lines) to the lower bounds proven in prior work [23] (dashed lines) and the true empirical means (colored dots), which closely track our predictions. (b) zooms in on the upper part of (a), showing the empirical mean length distortion as a function of depth for different widths. The horizontal black line is y=Γ(3)Γ(2.5)2.5y=\frac{\Gamma(3)}{\Gamma(2.5)\sqrt{2.5}}, the mean length distortion we predict when n0=nLn_{0}=n_{L} in the limit of infinite width for any fixed depth. In each network, width is constant in all hidden layers, while input and output dimension are both 55. The input curve is a fixed line segment of unit length. For each architecture, length distortion is calculated for 500500 different initializations of the weights and biases of the network (with variance of the weights given by 2/fan-in2/\text{fan-in}). For further experimental details, see Appendix B.

Why this occurs is not well-understood theoretically, but there is a high level intuition. In a nutshell, randomly initialized neural networks will compute functions of low complexity. Moreover, as optimization by first order methods is a kind of greedy local search, network training is attracted to minima of the loss that are not too far from the initialization and hence will still be well-behaved.

While this intuition is compelling, a key challenge in making it rigorous is to devise appropriate notions of complexity that are small throughout training. The present work is intended to provide a piece of the puzzle, making precise the idea that, at the start of training, neural networks compute tame functions. Specifically, we demonstrate that neural networks at initialization have low distortion of length and volume, as defined in §3.2.

An important aspect of our analysis is that we study networks in typical, average-case scenarios. A randomly initialized neural network could, in principle, compute any function expressible by a network of that architecture, since the weights might with low probability take on any set of values. Some settings of weights will lead to functions of high complexity, but these settings may be unlikely to occur in practice, depending on the distribution over weights that is under consideration. As prior work has emphasized [22, 23], atypical distributions of weights can lead to exponentially high length distortion. We show that, in contrast, for deep ReLU networks with the standard initialization, the functions computed have low distortion in expectation and with high probability.

3.2 Definitions

Let L1L\geq 1 be a positive integer and fix positive integers n0,,nLn_{0},\ldots,n_{L}. We consider a fully connected feed-forward ReLU network 𝒩\mathcal{N} with input dimension n0n_{0}, output dimension nLn_{L}, and hidden layer widths n1,,nL1n_{1},\ldots,n_{L-1}. Suppose that the weights and biases of 𝒩\mathcal{N} are independent and Gaussian with the weights Wij()W_{ij}^{{}_{(\ell)}} between layers 1\ell-1 and \ell and biases bj()b_{j}^{{}_{(\ell)}} in layer \ell satisfying:

Wij()G(0,2/n1),bj()G(0,Cb).W_{ij}^{(\ell)}\sim G\left(0,2/n_{\ell-1}\right),\qquad b_{j}^{(\ell)}\sim G(0,C_{b}). (3.1)

Here, Cb>0C_{b}>0 is any fixed constant and G(μ,σ2)G(\mu,\sigma^{2}) denotes a Gaussian with mean 0 and variance σ2\sigma^{2}. For any Mn0M\subseteq\mathbb{R}^{n_{0}}, we denote by 𝒩(M)nL\mathcal{N}(M)\subseteq\mathbb{R}^{n_{L}} the (random) image of MM under the map x𝒩(x)x\mapsto\mathcal{N}(x). Our primary object of the study will be the size of the output 𝒩(M)\mathcal{N}(M) relative to that of MM. Specifically, when MM is a 11-dimensional curve, we define

length distortion=len(𝒩(M))len(M).\textit{length distortion}=\frac{\operatorname{len}(\mathcal{N}(M))}{\operatorname{len}(M)}.

Note that while a priori this random variable depends on MM, we will find that its statistics depend only on the architecture of 𝒩\mathcal{N} (see Thms. 3.1 and 5.1).

3.3 Results

We prove that the expected length distortion does not grow with depth – in fact, it slightly decreases. Our result may be informally stated thus:

Theorem 3.1 (Length distortion: Mean).

Consider a ReLU network of depth LL, input dimension n0n_{0}, output dimension nLn_{L}, and hidden layer widths nn_{\ell}, with weights given by standard He normal initialization [17]. The expected length distortion is upper bounded by nL/n0\sqrt{n_{L}/n_{0}}. More precisely:

𝔼[length distortion]C(nLn0)1/2exp[58=1L11n],\mathbb{E}\left[\text{length distortion}\right]\approx C\left(\frac{n_{L}}{n_{0}}\right)^{1/2}\exp\left[-\frac{5}{8}\sum_{\ell=1}^{L-1}\frac{1}{n_{\ell}}\right],

where CC is a constant close to 11 depending only on the output dimension nLn_{L}.

For a formal statement (including the exact values of constants) and proof, see Thm. 5.1 and App. C.

Refer to caption
Figure 2: Mean length distortion as a function of the ratio of output to input dimension, for ReLU networks with various architectures (e.g. [10,10,10][10,10,10] denotes three hidden layers, each of width 10). All networks are randomly initialized as in Figure 1. We sample 100100 pairs of input dimension n0n_{0} and output dimension nLn_{L}, each at most 5050, such that the ratio of output to input dimension is distinct for each such pair. For each pair, 200200 different network initializations are tested and the resulting length distortion is calculated; the log of the empirical mean is plotted. The dashed black line plots log(y)=12log(x)\log(y)=\frac{1}{2}\log(x), the approximate prediction by Theorem 3.1.

Our experiments align with the theorem. In Figure 1, we compute the empirical mean of the length distortion for randomly initialized deep ReLU networks of various architectures with fixed input and output dimension. The figure confirms three of our key predictions: (1) the expected length distortion does not grow with depth, and actually decreases slightly; (2) the decrease happens faster for narrower networks (since 1/n1/n_{\ell} is larger for smaller nn_{\ell}); and (3) for equal input and output dimension n0=nLn_{0}=n_{L}, there is an upper bound of 1 (in fact, the tighter upper bound of CC is approximately 0.9515 for nL=5n_{L}=5, as shown in the figure). The figure also shows the prior bounds in [23] to be quite loose. For further experimental details, see Appendix B.

In Fig. 2, we instead fix the set of hidden layer widths, but vary input and output dimensions. The results confirm that indeed the expected length distortion grows as the square root of the ratio of output to input dimension.

Note that our theorem also applies for initializations other than He normal; the result in such cases is simply less interesting. Suppose we initialize the weights in each layer \ell as i.i.d. normal with variance 2c2/n12c^{2}/n_{\ell-1} instead of 2/n12/n_{\ell-1}. This is equivalent to taking a He normal initialization and multiplying the weights by cc. Multiplying the weights and biases in a depth-LL ReLU network by cc simply multiplies the output (and hence the length distortion) by cLc^{L}, so the expected distortion is cLc^{L} times the value given in Thm. 3.1. This emphasizes why an exponentially large distortion necessarily occurs if one sets the weights too large.

3.4 Intuitive explanation

The purpose of this section is to give an intuitive but technical explanation for why, in Theorem 3.1, we find that the distortion len(𝒩(M))/len(M)\operatorname{len}(\mathcal{N}(M))/\operatorname{len}(M) of the length of a 1D1D curve MM under the neural network 𝒩\mathcal{N} is typically not much larger than 11, even for deep networks.

Our starting point is that, near a given point xMx\in M, the length of the image under 𝒩\mathcal{N} of a small portion of MM with length dxdx is approximately given by Jxudx\left|\left|J_{x}u\right|\right|dx, where JxJ_{x} is the input-output Jacobian of 𝒩\mathcal{N} at the input xx and uu is the unit tangent vector to MM at xx. In fact, in Lemma C.1 of the Supp. Material, we use a simple argument to give upper bounds on the moments of len(𝒩(M))/len(M)\operatorname{len}(\mathcal{N}(M))/\operatorname{len}(M) in terms of the moments of the norm Jxu\left|\left|J_{x}u\right|\right| of the Jacobian-vector product JxuJ_{x}u.

Refer to caption
Figure 3: Length distortion as a function of =1Ln1\sum_{\ell=1}^{L}n_{\ell}^{-1}, showing both mean and standard deviation across initializations. We test several types of network architecture – with constant width 20, alternating between widths 30 and 10, and with the first (respectively, second) half of the layers of width 30 and the rest width 10. Each architecture type is tested for several depths. For each such network, we use n0=nL=5n_{0}=n_{L}=5 and compute length distortion for 100100 initializations on a fixed line segment. As predicted in Thm. 4.1, the mean length distortion decreases with the sum of width reciprocals. Empirical standard deviation does not, in general, increase with greater depth, remaining modest throughout, as is consistent with the upper bound on variance in Thm. 4.1.

Thus, upper bounds on the length distortion reduce to studying the Jacobian JxJ_{x}, which in a network with LL hidden layers can be written as a product Jx=JL,xJ1,xJ_{x}=J_{L,x}\cdots J_{1,x} of the layer 1\ell-1 to layer \ell Jacobians J,xJ_{\ell,x}. In a worst-case analysis, the left singular vectors of J,xJ_{\ell,x} would align with the right singular vectors of J+1,xJ_{\ell+1,x}, causing the resulting product JxJ_{x} to have a largest singular value that grows exponentially with LL. However, at initialization, this is provably not the case with high probability. Indeed, the Jacobians J,xJ_{\ell,x} are independent (see Lemma C.3) and their singular vectors are therefore incoherent. We find in particular (Lemma C.2) the following equality in distribution:

Jxu=d=1LJ,xe1,\left|\left|J_{x}u\right|\right|\stackrel{{\scriptstyle d}}{{=}}\prod_{\ell=1}^{L}\left|\left|J_{\ell,x}e_{1}\right|\right|,

where e1e_{1} is the first standard unit vector and the terms in the product are independent. On average each term in this product is close to 11:

𝔼[J,xe1]=1+O(n1).\mathbb{E}\left[\left|\left|J_{\ell,x}e_{1}\right|\right|\right]=1+O(n_{\ell}^{-1}).

This is a consequence of initializing weights to have variance 2/fan-in2/\text{fan-in}. Put together, the two preceding displayed equations suggest writing

Jxu\displaystyle\left|\left|J_{x}u\right|\right| =exp[=1Llog(J,xe1)]\displaystyle=\exp\left[\sum_{\ell=1}^{L}\log(\left|\left|J_{\ell,x}e_{1}\right|\right|)\right]

The terms being summed in the exponent are independent and the argument of each logarithm scales like 1+O(n1)1+O(n_{\ell}^{-1}). With probability 1δ1-\delta we have for all \ell that log(J,xe1)cn1\log(\left|\left|J_{\ell,x}e_{1}\right|\right|)\leq cn_{\ell}^{-1}, where cc is a constant depending only on δ\delta. Thus, all together,

Jxu\displaystyle\left|\left|J_{x}u\right|\right| exp[c=1Ln1]\displaystyle\leq\exp\left[c\sum_{\ell=1}^{L}n_{\ell}^{-1}\right]

with high probability. In particular, in the simple case where nn_{\ell} are proportional to a single large constant nn, we find the typical size of Jxu\left|\left|J_{x}u\right|\right| is exponential in L/nL/n rather than exponential in LL, as in the worst case. If 𝒩\mathcal{N} is wider than it is deep, so that L/nL/n is bounded above, the typical size of Jxu\left|\left|J_{x}u\right|\right| and hence of len(𝒩(M))/len(M)\operatorname{len}(\mathcal{N}(M))/\operatorname{len}(M) remains bounded. Theorem 5.1 makes this argument precise. Moreover, let us note that articles like [7, 10, 21] show low distortion of angles and volumes in very wide networks. Our results, however, apply directly to more realistic network widths.

4 Further Results

4.1 Higher moments

We have seen that the mean length distortion is upper-bounded by 1, and indeed by a function of the architecture that decreases with the depth of the network. However, a small mean is insufficient by itself to ensure that typical networks have low length distortion. For this, we now show that in fact all moments of the length distortion are well-controlled. Specifically, the variance is bounded above by the ratio of output to input dimension, and higher moments are upper-bounded by a function that grows very slowly with depth. Our results may be informally stated thus:

Theorem 4.1 (Length distortion: Higher moments).

Consider, as before, a ReLU network of depth LL, input dimension n0n_{0}, output dimension nLn_{L}, and hidden layer widths nn_{\ell}, with weights given by He normal initialization. We have the following bounds on higher moments of the length distortion:

Var[length distortion]nLn0and𝔼[(length distortion)m](nLn0)m2exp[cm2=1Ln1]\mathrm{Var}[\text{length distortion}]\leq\frac{n_{L}}{n_{0}}\quad\text{and}\quad\mathbb{E}\left[(\text{length distortion})^{m}\right]\leq\left(\frac{n_{L}}{n_{0}}\right)^{\frac{m}{2}}\exp\left[cm^{2}\sum_{\ell=1}^{L}n_{\ell}^{-1}\right]

for some universal constant c>0.c>0.

A formal statement and proof of this result are given in Theorem 5.1 and Appendix C.

We consider the mean and variance of length distortion for a wide range of network architectures in Figure 3. The figure confirms that variance is modest for all architectures and does not increase with depth, and that mean length distortion is consistently decreasing with the sum of layer width reciprocals, as predicted by Theorem 4.1.

4.2 Higher-dimensional volumes

Another natural generalization to consider is how ReLU networks distort higher-dimensional volumes: Given a dd-dimensional input MM, the output 𝒩(M)\mathcal{N}(M) will in general also be dd-dimensional, and we can therefore consider the volume distortion vold(𝒩(M))/vold(M)\operatorname{vol}_{d}(\mathcal{N}(M))/\operatorname{vol}_{d}(M). The theorems presented in §3.3 when d=1d=1 can be extended to all dd, and the results may be informally stated thus:

Theorem 4.2 (Volume distortion).

Consider a ReLU network of depth LL, input dimension n0n_{0}, output dimension nLn_{L}, and hidden layer widths nn_{\ell}, with weights given by He normal initialization. Both the squared mean and the variance of volume distortion are upper-bounded by:

(nLn0)dexp[(d2)=1Ln1].\left(\frac{n_{L}}{n_{0}}\right)^{d}\exp\left[-\binom{d}{2}\sum_{\ell=1}^{L}n_{\ell}^{-1}\right].

A formal statement and proof of this result are given in Theorem 5.2 and Appendix D.

5 Formal Statements

In this section, we provide formal statements of our results. Full proofs are given in the Supp. Material.

5.1 One-dimensional manifolds

Fix a smooth one-dimensional manifold Mn0M\subseteq\mathbb{R}^{n_{0}} (i.e. a curve) with unit length. Each point on MM represents a possible input to our ReLU network 𝒩\mathcal{N}, which we recall has input dimension n0n_{0}, output dimension nLn_{L}, and hidden layer widths n1,,nL1n_{1},\ldots,n_{L-1}. The function x𝒩(x)x\mapsto\mathcal{N}(x) computed by 𝒩\mathcal{N} is continuous and piecewise linear. Thus, the image 𝒩(M)nL\mathcal{N}(M)\subseteq\mathbb{R}^{n_{L}} of MM under this map is a piecewise smooth curve in nL\mathbb{R}^{n_{L}} with a finite number of pieces, and its length len(𝒩(M))\operatorname{len}(\mathcal{N}(M)) is a random variable. Our first result states that, provided 𝒩\mathcal{N} is wider than it is deep, this distortion is small with high probability at initialization.

Theorem 5.1.

Let 𝒩\mathcal{N} be a fully connected ReLU network with input dimension n0,n_{0}, output dimension nLn_{L}, hidden layer widths n1,,nL1n_{1},\ldots,n_{L-1}, and independent centered Gaussian weights/biases with the weights having variance 2/fan-in2/\text{fan-in} (as in (3.1)). Let MM be a 11-dimensional curve of unit length in n0\mathbb{R}^{n_{0}}. Then, the mean length 𝔼[len(𝒩(M))]\mathbb{E}\left[\operatorname{len}(\mathcal{N}(M))\right] equals

(nLn0)1/2Γ(nL+12)Γ(nL2)(nL2)1/2×exp[58=1L1n1+O(=1L1n2)],\left(\frac{n_{L}}{n_{0}}\right)^{1/2}\frac{\Gamma\left(\frac{n_{L}+1}{2}\right)}{\Gamma\left(\frac{n_{L}}{2}\right)\left(\frac{n_{L}}{2}\right)^{1/2}}\times\exp\bigg{[}-\frac{5}{8}\sum_{\ell=1}^{L-1}n_{\ell}^{-1}+O\left(\sum_{\ell=1}^{L-1}n_{\ell}^{-2}\right)\bigg{]}, (5.1)

where implied constants in O()O(\cdot) are universal and Γ()\Gamma(\cdot) denotes the Gamma function. Moreover:

Var[len(𝒩(M))]nLn0.\mathrm{Var}[\operatorname{len}(\mathcal{N}(M))]\leq\frac{n_{L}}{n_{0}}. (5.2)

Finally, there exist universal constants c1,c2>0c_{1},c_{2}>0 such that if m<c1min{n1,,nL1}m<c_{1}\min\left\{n_{1},\ldots,n_{L-1}\right\}, then

𝔼[(len(𝒩(M)))m](nLn0)m/2exp[c2m2=1Ln1].\mathbb{E}\left[\left(\operatorname{len}(\mathcal{N}(M))\right)^{m}\right]\leq\left(\frac{n_{L}}{n_{0}}\right)^{m/2}\exp\left[c_{2}m^{2}\sum_{\ell=1}^{L}n_{\ell}^{-1}\right]. (5.3)

Theorem 5.1 is proved in §C. Several comments are in order. First, we have assumed for simplicity that MM has unit length. For curves of arbitrary length, both the equality (5.1) and the bounds (5.2) and (5.3) all hold and are derived in the same way provided len(𝒩(M))\operatorname{len}(\mathcal{N}(M)) is replaced by the distortion len(𝒩(M))/len(M)\operatorname{len}(\mathcal{N}(M))/\operatorname{len}(M) per unit length.

Second, in the expression (5.1), note that the exponent tends to 0 as nn_{\ell}\rightarrow\infty for any fixed LL. More precisely, when n=nn_{\ell}=n is large and constant, it scales as 5L/8n-5L/8n. This shows that, for fixed n0,nL,nn_{0},n_{L},n, the mean 𝔼[len(𝒩(M))]\mathbb{E}\left[\operatorname{len}(\mathcal{N}(M))\right] is actually decreasing with LL. This somewhat surprising phenomenon is borne out in Fig. 1 and is a consequence of the fact that wide fully connected ReLU nets are strongly contracting (see e.g. Thms. 3 and 4 in [10]).

Third, the pre-factor (nL/n0)1/2(n_{L}/n_{0})^{1/2} encodes the fact that, on average over initialization, for any vector of inputs xn0x\in\mathbb{R}^{n_{0}}, we have (see e.g. Cor. 1 in [14])

𝔼[𝒩(x)2]nLn0x2,\mathbb{E}\left[\left|\left|\mathcal{N}(x)\right|\right|^{2}\right]\approx\frac{n_{L}}{n_{0}}\left|\left|x\right|\right|^{2}, (5.4)

where the approximate equality becomes exact if the bias variance CbC_{b} is set to 0 (see (3.1)). In other words, at the start of training, the network 𝒩\mathcal{N} rescales inputs by an average factor (nL/n0)1/2(n_{L}/n_{0})^{1/2}. This overall scaling factor also dilates len(𝒩(M))\operatorname{len}(\mathcal{N}(M)) relative to len(M)\operatorname{len}(M).

Fourth, we stated Theorem 5.1 for Gaussian weights and biases (see (3.1)), but the result holds for any weight distribution that is symmetric around 0, has variance 2/fan-in,2/\text{fan-in}, and has finite higher moments. The only difference is that the constant 5/85/8 may need to be slightly enlarged (e.g. around (C.6)).

Finally, all the estimates (5.1), (5.2), and (5.3) are consistent with the statement that, up to the scaling factor (nL/n0)1/2(n_{L}/n_{0})^{1/2}, the random variable len(𝒩(M))\operatorname{len}(\mathcal{N}(M)) is bounded above by a log-normal distribution:

len(𝒩(M))(nLn0)1/2exp[12G(β2,β)],\operatorname{len}(\mathcal{N}(M))\leq\left(\frac{n_{L}}{n_{0}}\right)^{1/2}\exp\left[\frac{1}{2}G\left(\frac{\beta}{2},\beta\right)\right],

where β=c=1Ln1\beta=c\sum_{\ell=1}^{L}n_{\ell}^{-1} and cc is a fixed constant.

5.2 Higher-dimensional manifolds

The results in the previous section generalize to the case when MM has higher dimension. To consider this case, fix a positive integer dmin{n0,,nL}d\leq\min\left\{n_{0},\ldots,n_{L}\right\} and a smooth dd-dimensional manifold Mn0M\subseteq\mathbb{R}^{n_{0}} of unit volume vold(M)=1\operatorname{vol}_{d}(M)=1. Note that if d>min{n0,,nL}d>\min\left\{n_{0},\ldots,n_{L}\right\}, then 𝒩(M)\mathcal{N}(M) is at most (d1)(d-1)-dimensional and its dd-dimensional volume vanishes. Its image 𝒩(M)nL\mathcal{N}(M)\subseteq\mathbb{R}^{n_{L}} is a piecewise smooth manifold of dimension at most dd. The following result, proved in §D, gives an upper bound on the typical value of the dd-dimensional volume of 𝒩(M).\mathcal{N}(M).

Theorem 5.2.

Let 𝒩\mathcal{N} be a fully connected ReLU network with input dimension n0,n_{0}, output dimension nLn_{L}, hidden layer widths n1,,nL1n_{1},\ldots,n_{L-1} and independent centered Gaussian weights/biases with the variance of the weights given by 2/fan-in2/\text{fan-in} (see (3.1)). Let MM be a dd-dimensional smooth submanifold of n0\mathbb{R}^{n_{0}} with unit volume and dmin{n0,,nL}d\leq\min\left\{n_{0},\ldots,n_{L}\right\}. Then, both the squared mean and the variance of the dd-dimensional volume vold(𝒩(M))\operatorname{vol}_{d}(\mathcal{N}(M)) of 𝒩(M)\mathcal{N}(M) is bounded above by

(nLn0)dexp[(d2)=1Ln1]\displaystyle\left(\frac{n_{L}}{n_{0}}\right)^{d}\exp\left[-\binom{d}{2}\sum_{\ell=1}^{L}n_{\ell}^{-1}\right] (5.5)

6 Proof Sketch

The purpose of this section is to explain the main steps to obtaining the mean and variance estimates (5.1) and (5.2) from Theorem 5.1. In several places, we will gloss over some mathematical subtleties that we deal with in the detailed proof of Theorem 5.1 given in Appendix C. We will also content ourselves here with proving a slightly weaker estimate than (5.1) and show instead simply that

𝔼[len(𝒩(M))](nLn0)1/2andVar[len(𝒩(M))]nLn0.\mathbb{E}\left[\operatorname{len}(\mathcal{N}(M))\right]\leq\left(\frac{n_{L}}{n_{0}}\right)^{1/2}\qquad\text{and}\qquad\mathrm{Var}\left[\operatorname{len}(\mathcal{N}(M))\right]\leq\frac{n_{L}}{n_{0}}. (6.1)

We refer the reader to Appendix C for a derivation of the more refined result stated in Theorem 5.1. Since we have 𝔼[X]2𝔼[X2]\mathbb{E}\left[X\right]^{2}\leq\mathbb{E}\left[X^{2}\right] and Var[X]𝔼[X2]\mathrm{Var}[X]\leq\mathbb{E}\left[X^{2}\right], both estimates in (6.1) follow from

𝔼[len(𝒩(M))2]nLn0.\mathbb{E}\left[\operatorname{len}(\mathcal{N}(M))^{2}\right]\leq\frac{n_{L}}{n_{0}}. (6.2)

To obtain this bound, we begin by relating moments of len(𝒩(M))\operatorname{len}(\mathcal{N}(M)) to those of the input-output Jacobian of 𝒩\mathcal{N} at a single point for which we need some notation. Namely, let us choose a unit speed parameterization of MM; that is, fix a smooth mapping

γ:n0,γ(t)=(γ1(t),,γn0(t))\gamma:\mathbb{R}\rightarrow\mathbb{R}^{n_{0}},\qquad\gamma(t)=\left(\gamma_{1}(t),\ldots,\gamma_{n_{0}}(t)\right)

for which MM is the image under γ\gamma of the interval [0,1][0,1]. That γ\gamma has unit speed means that for every t[0,1]t\in[0,1] we have γ(t)=1\left|\left|\gamma^{\prime}(t)\right|\right|=1, where γ(t):=(γ1(t),,γn0(t)).\gamma^{\prime}(t):=\left(\gamma_{1}^{\prime}(t),\ldots,\gamma_{n_{0}}^{\prime}(t)\right). Then, the mapping Γ:=𝒩γ\Gamma:=\mathcal{N}\circ\gamma (for Γ:nL\Gamma:\mathbb{R}\rightarrow\mathbb{R}^{n_{L}}) gives a parameterization of the curve 𝒩(M)\mathcal{N}(M). Note that this parameterization is not unit speed. Rather the length of 𝒩(M)\mathcal{N}(M) is given by

len(𝒩(M))=01Γ(t)𝑑t.\operatorname{len}(\mathcal{N}(M))=\int_{0}^{1}\left|\left|\Gamma^{\prime}(t)\right|\right|dt. (6.3)

Intuitively, the integrand Γ(t)dt\left|\left|\Gamma^{\prime}(t)\right|\right|dt computes, at a given tt, the length of Γ([t,t+dt])\Gamma([t,t+dt]) as dt0.dt\rightarrow 0. The following Lemma uses (6.3) to bound the moments of len(𝒩(M))\operatorname{len}(\mathcal{N}(M)) in terms of the moments of the norm of the input-output Jacobian JxJ_{x}, defined for any xn0x\in\mathbb{R}^{n_{0}} by

Jx:=(𝒩ixj(x))1inL1jn0,J_{x}:=\left(\frac{\partial\mathcal{N}_{i}}{\partial x_{j}}(x)\right)_{\begin{subarray}{c}1\leq i\leq n_{L}\\ 1\leq j\leq n_{0}\end{subarray}}, (6.4)

where 𝒩i\mathcal{N}_{i} is the ii-th component of the network output.

Lemma 6.1.

We have

𝔼[len(𝒩(M))2]01𝔼[||Jγ(t)γ(t)||2]dt.\mathbb{E}\left[\operatorname{len}(\mathcal{N}(M))^{2}\right]\leq\int_{0}^{1}\mathbb{E}\left[\left|\left|J_{\gamma(t)}\gamma^{\prime}(t)\right|\right|^{2}\right]dt. (6.5)
Sketch of Proof.

Taking powers in (6.3) and interchanging the expectation and integrals, we obtain

𝔼[len(𝒩(M))2]=0101𝔼[||Γ(t1)||||Γ(t2)||]dt1dt2.\mathbb{E}\left[\operatorname{len}(\mathcal{N}(M))^{2}\right]=\int_{0}^{1}\int_{0}^{1}\mathbb{E}\left[\left|\left|\Gamma^{\prime}(t_{1})\right|\right|\left|\left|\Gamma^{\prime}(t_{2})\right|\right|\right]dt_{1}dt_{2}. (6.6)

Applying the inequality ab12(a2+b2)ab\leq\frac{1}{2}(a^{2}+b^{2}), for a,ba,b\in\mathbb{R}, to the integrand in (6.6), we conclude

𝔼[len(𝒩(M))2]01𝔼[||Γ(t)||2]dt.\mathbb{E}\left[\operatorname{len}(\mathcal{N}(M))^{2}\right]\leq\int_{0}^{1}\mathbb{E}\left[\left|\left|\Gamma^{\prime}(t)\right|\right|^{2}\right]dt. (6.7)

The chain rule yields Γ(t)=Jγ(t)γ(t)\Gamma^{\prime}(t)=J_{\gamma(t)}\gamma^{\prime}(t). Substituting this into (6.7) completes the proof. ∎

Lemma 6.1, while elementary, appears to be new, allowing us to bound global quantities such as length in terms of local quantities such as the moments of Jxu\left|\left|J_{x}u\right|\right|. In particular, having obtained the expression (6.5), we have reduced bounding the second moment of len(𝒩(M))\operatorname{len}(\mathcal{N}(M)) to bounding the second moment of the random variable Jxu\left|\left|J_{x}u\right|\right| for a fixed xn0x\in\mathbb{R}^{n_{0}} and unit vector un0u\in\mathbb{R}^{n_{0}}. This is a simple matter since in our setting the distribution of weights and biases is Gaussian and hence, in distribution, Jxu\left|\left|J_{x}u\right|\right| is equal to a product of independent random variables:

Proposition 6.2.

For any xn0x\in\mathbb{R}^{n_{0}} and any unit vector un0u\in\mathbb{R}^{n_{0}}, the random variable Jxu2\left|\left|J_{x}u\right|\right|^{2} is equal in distribution to a product of independent scaled chi-squared random variables

Jxu2=dnLn0(=1L12nχ𝐧2)1nLχnL2,\left|\left|J_{x}u\right|\right|^{2}\stackrel{{\scriptstyle d}}{{=}}\frac{n_{L}}{n_{0}}\left(\prod_{\ell=1}^{L-1}\frac{2}{n_{\ell}}\chi_{{\bf n_{\ell}}}^{2}\right)\cdot\frac{1}{n_{L}}\chi_{n_{L}}^{2},

where the number of degrees of freedom in the \ellth term of the product (for =1,,L1\ell=1,\ldots,L-1) is given by an independent binomial 𝐧=dBin(n,1/2){\bf n_{\ell}}\stackrel{{\scriptstyle d}}{{=}}\mathrm{Bin}\left(n_{\ell},1/2\right) with nn_{\ell} trials and success probability 1/21/2. The number of degrees of freedom in the final term is deterministic and given by nLn_{L}.

This Proposition has been derived a number of times in the literature (see Theorem 3 in [11] and Fact 7.2 in [1]). We provide an alternative proof based on Proposition 2 in [12] in the Supplementary Material. Note that the distribution of Jxu\left|\left|J_{x}u\right|\right| is the same at every xx and uu. Thus, fixing xn0x\in\mathbb{R}^{n_{0}} and a unit vector un0u\in\mathbb{R}^{n_{0}}, we find

𝔼[len(𝒩(M))2]𝔼[||Jxu||2].\mathbb{E}\left[\operatorname{len}(\mathcal{N}(M))^{2}\right]\leq\mathbb{E}\left[\left|\left|J_{x}u\right|\right|^{2}\right].

To prove (6.2), we note that 𝔼[χ𝐧2]=n/2\mathbb{E}\left[\chi_{\bf n}^{2}\right]=n/2 and apply Lemma 6.2 to find, as desired,

𝔼[Jxu2]\displaystyle\mathbb{E}\left[\left|\left|J_{x}u\right|\right|^{2}\right] =nLn0(=1L1𝔼[2nχ𝐧2])𝔼[nL1χnL2]=nLn0=1L1{2n𝔼[𝐧]}=nLn0.\displaystyle=\frac{n_{L}}{n_{0}}\left(\prod_{\ell=1}^{L-1}\mathbb{E}\left[\frac{2}{n_{\ell}}\chi_{{\bf n}_{\ell}}^{2}\right]\right)\mathbb{E}\left[n_{L}^{-1}\chi_{n_{L}}^{2}\right]=\frac{n_{L}}{n_{0}}\prod_{\ell=1}^{L-1}\left\{\frac{2}{n_{\ell}}\mathbb{E}\left[{\bf n}_{\ell}\right]\right\}=\frac{n_{L}}{n_{0}}.

\square

7 Conclusion

We show that deep ReLU networks with appropriate initialization do not appreciably distort lengths and volumes, contrary to prior assumptions of exponential growth. Specifically, we provide an exact expression for the mean length distortion, which is bounded above by 1 and decreases slightly with increasing depth. We also prove that higher moments of the length distortion admit well-controlled upper bounds, and generalize this to distortion of higher dimensional volumes. We show empirically that our theoretical results closely match observations, unlike previous loose lower bounds.

There remain several promising directions for future work. First, we have proved statements for networks at initialization, and our results do not necessarily hold after training; analyzing such behavior formally would require consideration of the loss function, optimizer, and data distribution. Second, our results, as stated, do not apply to convolutional, residual, or recurrent networks. We envision generalizations for networks with skip connections being straightforward, while formulations for convolutional and recurrent networks will likely be more complicated. In Appendix A, we provide preliminary results suggesting that expected length distortion decreases modestly with depth in both convolutional networks and those with skip connections. Third, we believe that various other measures of complexity for neural networks, such as curvature, likely demonstrate similar behavior to length and volume distortion in average-case scenarios with appropriate initialization. While somewhat parallel results for counting linear regions were shown in [15, 16], there remains much more to be understood for other notions of complexity. Finally, we look forward to work that explicitly ties properties such as low length distortion to improved generalization, as well as new learning algorithms that leverage this line of theory in order to better control inductive biases to fit real-world tasks.

References

  • [1] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning (ICML), 2019.
  • [2] Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. In International Conference on Machine Learning (ICML), 2018.
  • [3] Lei Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? In Conference on Neural Information Processing Systems (NeurIPS), 2014.
  • [4] Peter Bartlett, Dylan J Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. Preprint arXiv:1706.08498, 2017.
  • [5] Monica Bianchini and Franco Scarselli. On the complexity of neural network classifiers: A comparison between shallow and deep architectures. IEEE Transactions on Neural Networks and Learning Systems, 25(8):1553–1565, 2014.
  • [6] Nadav Cohen, Or Sharir, and Amnon Shashua. On the expressive power of deep learning: A tensor analysis. In Conference on Learning Theory (COLT), 2016.
  • [7] Amit Daniely. Depth separation for neural networks. In Conference on Learning Theory (COLT), 2017.
  • [8] Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. Preprint arXiv:1703.11008, 2017.
  • [9] Ronen Eldan and Ohad Shamir. The power of depth for feedforward neural networks. In Conference on Learning Theory (COLT), 2016.
  • [10] Raja Giryes, Guillermo Sapiro, and Alex M Bronstein. Deep neural networks with random Gaussian weights: A universal classification strategy? IEEE Transactions on Signal Processing, 64(13):3444–3457, 2016.
  • [11] Boris Hanin. Which neural net architectures give rise to exploding and vanishing gradients? In Conference on Neural Information Processing Systems (NeurIPS), 2018.
  • [12] Boris Hanin and Mihai Nica. Products of many large random matrices and gradients in deep neural networks. Communications in Mathematical Physics, pages 1–36, 2019.
  • [13] Boris Hanin and Mihai Nica. Finite depth and width corrections to the neural tangent kernel. In International Conference on Learning Representations (ICLR), 2020.
  • [14] Boris Hanin and David Rolnick. How to start training: The effect of initialization and architecture. In Conference on Neural Information Processing Systems (NeurIPS), 2018.
  • [15] Boris Hanin and David Rolnick. Complexity of linear regions in deep networks. In International Conference on Machine Learning (ICML), 2019.
  • [16] Boris Hanin and David Rolnick. Deep ReLU networks have surprisingly few activation patterns. In Conference on Neural Information Processing Systems (NeurIPS), 2020.
  • [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. In IEEE International Conference on Computer Vision (ICCV), 2015.
  • [18] Yiding Jiang, Behnam Neyshabur, Hossein Mobahi, Dilip Krishnan, and Samy Bengio. Fantastic generalization measures and where to find them. Preprint arXiv:1912.02178, 2019.
  • [19] Henry W Lin, Max Tegmark, and David Rolnick. Why does deep and cheap learning work so well? Journal of Statistical Physics, 168(6):1223–1247, 2017.
  • [20] Guido Montúfar, Razvan Pascanu, Kyunghyun Cho, and Yoshua Bengio. On the number of linear regions of deep neural networks. In Conference on Neural Information Processing Systems (NeurIPS), 2014.
  • [21] Ben Poole, Subhaneil Lahiri, Maithra Raghu, Jascha Sohl-Dickstein, and Surya Ganguli. Exponential expressivity in deep neural networks through transient chaos. In Conference on Neural Information Processing Systems (NeurIPS), 2016.
  • [22] Ilan Price and Jared Tanner. Trajectory growth lower bounds for random sparse deep ReLU networks. Preprint arXiv:1911.10651, 2019.
  • [23] Maithra Raghu, Ben Poole, Jon Kleinberg, Surya Ganguli, and Jascha Sohl-Dickstein. On the expressive power of deep neural networks. In International Conference on Machine Learning (ICML), 2017.
  • [24] David Rolnick and Max Tegmark. The power of deeper networks for expressing natural functions. In International Conference on Learning Representations, 2018.
  • [25] Shai Shalev-Shwartz, Ohad Shamir, and Shaked Shammah. Failures of gradient-based deep learning. In International Conference on Machine Learning (ICML), 2017.
  • [26] Matus Telgarsky. Representation benefits of deep feedforward networks. Preprint arXiv:1509.08101, 2015.
  • [27] Matus Telgarsky. Benefits of depth in neural networks. In Conference on Learning Theory (COLT), 2016.

Appendix A Experiments for Convolutional and Residual Networks

We here provide preliminary experimental results for the mean length distortion in networks with convolutional layers and skip connections.

In Figure 4(a), we show results for networks with sequential convolutional layers (without pooling layers), initialized as before with weights i.i.d. normal with variance 2/fan-in2/\text{fan-in}, and a final fully connected layer. We use input dimension n0=16×16×3=768n_{0}=16\times 16\times 3=768 and output dimension 5. Our results indicate that the mean length distortion is, as expected, approximately equal to nL/n00.08\sqrt{n_{L}/n_{0}}\approx 0.08, and that it decays slightly with depth.

In Figure 4(b), we consider residual networks. Here, the overall network 𝒩\mathcal{N} is defined in terms of residual modules 𝒩1,𝒩2,,𝒩L\mathcal{N}_{1},\mathcal{N}_{2},\ldots,\mathcal{N}_{L} and scales η1,η2,,ηL\eta_{1},\eta_{2},\ldots,\eta_{L} according to:

𝒩(x)\displaystyle\mathcal{N}(x) =x+η1𝒩1(x)+η2𝒩2(x+η1𝒩1(x))+\displaystyle=x+\eta_{1}\mathcal{N}_{1}(x)+\eta_{2}\mathcal{N}_{2}(x+\eta_{1}\mathcal{N}_{1}(x))+\cdots
+ηL𝒩L(x+η1𝒩1(x)+η2𝒩2(x+η1𝒩1(x))+)).\displaystyle\;+\eta_{L}\mathcal{N}_{L}\left(x+\eta_{1}\mathcal{N}_{1}(x)+\eta_{2}\mathcal{N}_{2}(x+\eta_{1}\mathcal{N}_{1}(x))+\cdots)\right).

We set all residual modules 𝒩\mathcal{N}_{\ell} to be two-layer, fully connected ReLU networks. In keeping with [14], we initialize all weights i.i.d. normal with variance 2/fan-in2/\text{fan-in}, while setting η1=η2==ηL=1/L\eta_{1}=\eta_{2}=\cdots=\eta_{L}=1/L. Our results suggest that mean length distortion again decays modestly overall, with a somewhat sharper decrease for small depths.

Refer to caption
Figure 4: Mean length distortion as a function of depth, for networks with convolutional layers and skip connections, initialized using He normal initialization [17]. (a) shows results for networks having convolutional layers, where the input dimension n0n_{0} equals 16×16×3=76816\times 16\times 3=768 and output dimension nLn_{L} equals 55. Here width corresponds to the number of 3×33\times 3 kernels in each layer. As expected, the mean length distortion is nL/n00.08\sqrt{n_{L}/n_{0}}\approx 0.08 and decays modestly with depth. (b) shows results for networks with skip connections occurring between even layers (i.e. each residual block is a fully-connected ReLU network of depth 22), again taking a line segment of unit length as the input curve. Here, depth corresponds to the total number of layers in the network, so that a depth-20 network includes 1010 residual blocks. Both plots (a) and (b) show the empirical mean over 100100 initializations of the weights and biases.

Appendix B Experimental Details

For all experiments, weights were initialized from i.i.d. normal distributions with variance 2/fan-in2/\text{fan-in} and bias variance 0.10.1. We run several experiments that involve computing the length distortion of a given line segment in n0\mathbb{R}^{n_{0}}. We remark on how this was done, for which we rely on the notion of linear regions and bent hyperplanes associated with ReLU networks, explored in [16, 26]. Specifically, ReLU networks partition input space into a collection of convex polytopes, which we call linear regions, as (generically) distinct linear functions are defined on each region corresponding to a different subset of the hidden neurons being activated (where a neuron is activated if its pre-activation is nonnegative). The boundaries of these linear regions are given by sets of points for which a particular neuron has preactivation equal to 0.

Say we are given a line segment \ell with endpoints 𝐩1,𝐩2n0\mathbf{p}_{1},\ \mathbf{p}_{2}\in\mathbb{R}^{n_{0}} parametrized by unit speed on [0,1][0,1], given by the set {𝐩1+t(𝐩2𝐩1):t[0,1]}\{\mathbf{p}_{1}+t(\mathbf{p}_{2}-\mathbf{p}_{1}):t\in[0,1]\}, for which we aim to compute the length distortion. Let 𝐰=𝐩2𝐩1\mathbf{w}^{\ell}=\mathbf{p}_{2}-\mathbf{p}_{1}.

We first approximate the intersections of the line segment with linear region boundaries using a binary search subroutine. Specifically, initialize the set 𝒮=[0,0.5,1]\mathcal{S}=[0,0.5,1], which will contain the parameter values for these approximations. For each iteration of the subroutine, we run the following procedure: for any three consecutive points t1,t2,t3𝒮t_{1},t_{2},t_{3}\in\mathcal{S}, let 𝐱1=𝐩1+t1(𝐩2𝐩1),𝐱2=𝐩1+t2(𝐩2𝐩1),𝐱3=𝐩1+t3(𝐩2𝐩1)\mathbf{x}_{1}=\mathbf{p}_{1}+t_{1}(\mathbf{p}_{2}-\mathbf{p}_{1}),\ \mathbf{x}_{2}=\mathbf{p}_{1}+t_{2}(\mathbf{p}_{2}-\mathbf{p}_{1}),\ \mathbf{x}_{3}=\mathbf{p}_{1}+t_{3}(\mathbf{p}_{2}-\mathbf{p}_{1}), and consider the values 𝒩(𝐱2)𝒩(𝐱1)𝐱2𝐱1\frac{||\mathcal{N}(\mathbf{x}_{2})-\mathcal{N}(\mathbf{x}_{1})||}{||\mathbf{x}_{2}-\mathbf{x}_{1}||} and 𝒩(𝐱3)𝒩(𝐱2)𝐱3𝐱2\frac{||\mathcal{N}(\mathbf{x}_{3})-\mathcal{N}(\mathbf{x}_{2})||}{||\mathbf{x}_{3}-\mathbf{x}_{2}||}. These values are equal if the three points are in the same linear region, in which case we eliminate t1t_{1} and t3t_{3} from 𝒮\mathcal{S}. If not equal, then there exists a linear region boundary in the segment from x1\textbf{x}_{1} to x3\textbf{x}_{3}; we add the points t1+t22\frac{t_{1}+t_{2}}{2} and t2+t32\frac{t_{2}+t_{3}}{2} to 𝒮\mathcal{S}. We iterate this procedure (15 times in our implementation); for the final iteration, we ensure that only the last point associated with a given linear region is in 𝒮\mathcal{S}.

Now take the set 𝒮=[t1,t2,,tn]\mathcal{S}=[t_{1},t_{2},\dots,t_{n}] returned by the binary search procedure; we proceed to compute the parameter values denoting the exact intersections of the line segment with region boundaries, which we shall store in 𝒮\mathcal{S}^{*}. For consecutive points ti,ti+1𝒮,i=1,,n1t_{i},\ t_{i+1}\in\mathcal{S},\ i=1,\dots,n-1, determine the set of activated neurons for both points at 𝐱i=𝐩1+ti(𝐩2𝐩1),𝐱i+1=𝐩1+ti+1(𝐩2𝐩1)\mathbf{x}_{i}=\mathbf{p}_{1}+t_{i}(\mathbf{p}_{2}-\mathbf{p}_{1}),\ \mathbf{x}_{i+1}=\mathbf{p}_{1}+t_{i+1}(\mathbf{p}_{2}-\mathbf{p}_{1}), and find the neuron at which these sets differ; we solve a linear equation to determine the value t[0,1]t^{*}\in[0,1] for which this neuron is 0, and replace tit_{i} with the exact value tit_{i}^{*} in 𝒮\mathcal{S}^{*}. Finally, we append 0 and 11 to the respective ends of 𝒮=[t0=0,t1,,tn=1]\mathcal{S}^{*}=[t^{*}_{0}=0,t^{*}_{1},\dots,t^{*}_{n}=1].

Computing the length distortion is reduced to summing the lengths of the output segments corresponding to consecutive pairs ti,ti+1𝒮t_{i},\ t_{i+1}\in\mathcal{S}^{*}. Namely, for each such pair, the network is given by a single linear function on the segment of inputs between 𝐱i=𝐩1+ti(𝐩2𝐩1)\mathbf{x}_{i}=\mathbf{p}_{1}+t_{i}(\mathbf{p}_{2}-\mathbf{p}_{1}) and 𝐱i+1=𝐩1+ti+1(𝐩2𝐩1)\mathbf{x}_{i+1}=\mathbf{p}_{1}+t_{i+1}(\mathbf{p}_{2}-\mathbf{p}_{1}). We calculate the weight matrix 𝐖i\mathbf{W}_{i} of this linear function – the length of the corresponding output segment is the product of 𝐖i\mathbf{W}_{i} with the vector 𝐰\mathbf{w}^{\ell}. The total length distortion is then given by the sum

i=0n1𝐖i𝐰(ti+1ti)\displaystyle\sum_{i=0}^{n-1}||\mathbf{W}_{i}\mathbf{w}^{\ell}||(t^{*}_{i+1}-t^{*}_{i})

Appendix C Proof of Theorem 5.1

Our first step in proving Theorem 5.1 is to obtain bounds on the moments of len(𝒩(M))\operatorname{len}(\mathcal{N}(M)) in terms of the input-output Jacobian of 𝒩\mathcal{N} at a single point, which we recall was defined in (6.4). To accomplish them, we recall the notation from §6. Namely, fix a smooth unit speed parameterization of M=γ([0,1])M=\gamma([0,1]) with

γ:n0,γ(t)=(γ1(t),,γn0(t)).\gamma:\mathbb{R}\rightarrow\mathbb{R}^{n_{0}},\qquad\gamma(t)=\left(\gamma_{1}(t),\ldots,\gamma_{n_{0}}(t)\right).

The mapping

Γ:=𝒩γ,Γ:nL\Gamma:=\mathcal{N}\circ\gamma,\qquad\Gamma:\mathbb{R}\rightarrow\mathbb{R}^{n_{L}}

gives a parameterization of the curve 𝒩(M)\mathcal{N}(M), and we have

len(𝒩(M))=01Γ(t)𝑑t.\operatorname{len}(\mathcal{N}(M))=\int_{0}^{1}\left|\left|\Gamma^{\prime}(t)\right|\right|dt. (C.1)

Let us note an important but ultimately benign technicality. The Jacobian JxJ_{x} of the map x𝒩(x)x\mapsto\mathcal{N}(x) is not defined at every xx (namely those xx where some neuron turns from on to off). Thus, a priori, Γ(t)\Gamma^{\prime}(t) exists only at those tt for which Jγ(t)J_{\gamma(t)} exists. However, the formula (C.1) is still valid. Indeed, for any setting of weights and biases of 𝒩\mathcal{N} the map Γ\Gamma is Lipschitz. Thus, by Rademacher’s theorem, Γ(t)\Gamma^{\prime}(t) exists for almost every t[0,1]t\in[0,1] and the length of the curve is given by integrating the norm of this almost-everywhere defined derivative. The following simple Lemma is a generalization of Lemma 6.1 and allows us to bound all moments of len(𝒩(M))\operatorname{len}(\mathcal{N}(M)) in terms of the moments of the norm of the Jacobian vector product Jxu\left|\left|J_{x}u\right|\right|.

Lemma C.1.

For any integer m1m\geq 1, we have

𝔼[len(𝒩(M))m]01𝔼[||Jγ(t)γ(t)||m]dt.\mathbb{E}\left[\operatorname{len}(\mathcal{N}(M))^{m}\right]\leq\int_{0}^{1}\mathbb{E}\left[\left|\left|J_{\gamma(t)}\gamma^{\prime}(t)\right|\right|^{m}\right]dt. (C.2)
Proof.

Taking powers in (C.1) and using Tonelli’s theorem to interchange the expectation and integrals, we obtain

𝔼[len(𝒩(M))m]=0101𝔼[j=1m||Γ(tj)||]dt1dtm.\mathbb{E}\left[\operatorname{len}(\mathcal{N}(M))^{m}\right]=\int_{0}^{1}\cdots\int_{0}^{1}\mathbb{E}\left[\prod_{j=1}^{m}\left|\left|\Gamma^{\prime}(t_{j})\right|\right|\right]dt_{1}\cdots dt_{m}. (C.3)

To proceed, let us recall a special case of the power mean inequality which says that for any ai0a_{i}\geq 0 we have

i=1mai1mi=1maim\prod_{i=1}^{m}a_{i}\leq\frac{1}{m}\sum_{i=1}^{m}a_{i}^{m}

Applying this to the integrand in (C.3), we conclude

𝔼[len(𝒩(M))m]01𝔼[||Γ(t)||m]dt.\mathbb{E}\left[\operatorname{len}(\mathcal{N}(M))^{m}\right]\leq\int_{0}^{1}\mathbb{E}\left[\left|\left|\Gamma^{\prime}(t)\right|\right|^{m}\right]dt. (C.4)

Next, fix t[0,1]t\in[0,1]. For any neuron zz in 𝒩\mathcal{N}, denote by xz(x)x\mapsto z(x) its pre-activation. Note that (z(γ(t))=0)=0\mathbb{P}(z(\gamma(t))=0)=0 since our bias variance CbC_{b} is set to some fixed positive constant and hence the bias of each neuron has a continuous density. Therefore, with probability 11 over the randomness in the weights and biases of 𝒩\mathcal{N}, there exists a neighborhood 𝒰\mathcal{U} of γ(t)\gamma(t) on which z(x)z(x) has constant sign for all x𝒰x\in\mathcal{U}. The Jacobian Jγ(t)J_{\gamma(t)} is therefore well-defined and the chain rule yields

Γ(t)=Jγ(t)γ(t).\Gamma^{\prime}(t)=J_{\gamma(t)}\gamma^{\prime}(t). (C.5)

Substituting this into (C.4) completes the proof. ∎

Having obtained the expression (C.2), we have reduced bounding the moments of len(𝒩(M))\operatorname{len}(\mathcal{N}(M)) to bounding the moments of the random variable Jxu\left|\left|J_{x}u\right|\right| for a fixed xn0x\in\mathbb{R}^{n_{0}} and unit vector un0u\in\mathbb{R}^{n_{0}}. Prior work (e.g. Thm. 1 in [12]) shows that these moments satisfy

𝔼[Jxu2m]=(nLn0)mexp(5(m2)=1L1n1+O(=1L1n2)),\mathbb{E}\left[\left|\left|J_{x}u\right|\right|^{2m}\right]=\left(\frac{n_{L}}{n_{0}}\right)^{m}\exp\left(5\binom{m}{2}\sum_{\ell=1}^{L-1}n_{\ell}^{-1}+O\left(\sum_{\ell=1}^{L-1}n_{\ell}^{-2}\right)\right),

provided

(m2)<min=1,,L1n.\binom{m}{2}<\min_{\ell=1,\ldots,L-1}n_{\ell}. (C.6)

Substituting these moment estimates in (C.2) completes the derivation of (5.3). However, the results in [12] are subtle because they apply to any distribution of weights and biases. They also give the slightly sub-optimal restriction (C.6) that m2m^{2} must be smaller than a constant times the minimum of the nn_{\ell}’s. In the special case where the distribution of weights and biases is Gaussian, we can do slightly better by computing the moments of Jxu\left|\left|J_{x}u\right|\right| more directly (note that in the statement of Theorem 5.1, we required only that mm is smaller than a constant times the minimum of the nn_{\ell}’s). We will need this anyway to derive the slightly more refined estimates in (5.1) and (5.2). Let us therefore check that, in distribution, Jxu\left|\left|J_{x}u\right|\right| is equal to a product of independent random variables:

Proposition C.2.

For any xn0x\in\mathbb{R}^{n_{0}} and any unit vector un0u\in\mathbb{R}^{n_{0}}, the random variable Jxu2\left|\left|J_{x}u\right|\right|^{2} is equal in distribution to a product of independent scaled chi-squared random variables

Jxu2=dnLn0(=1L12nχ𝐧2)1nLχnL2,\left|\left|J_{x}u\right|\right|^{2}\stackrel{{\scriptstyle d}}{{=}}\frac{n_{L}}{n_{0}}\left(\prod_{\ell=1}^{L-1}\frac{2}{n_{\ell}}\chi_{{\bf n_{\ell}}}^{2}\right)\cdot\frac{1}{n_{L}}\chi_{n_{L}}^{2},

where the number of degrees of freedom in the \ellth term of the product (for =1,,L1\ell=1,\ldots,L-1) is given by an independent binomial

𝐧=dBin(n,1/2){\bf n_{\ell}}\stackrel{{\scriptstyle d}}{{=}}\mathrm{Bin}\left(n_{\ell},1/2\right)

with nn_{\ell} trials and success probability 1/21/2. The number of degrees of freedom in the final term is deterministic and given by nLn_{L}.

Proof.

Consider a ReLU network 𝒩\mathcal{N} with input dimension 11, output dimension nLn_{L}, and hidden layer widths n1,,nL1n_{1},\ldots,n_{L-1}. Suppose the weight matrices W()W^{(\ell)} and bias vectors b()b^{(\ell)} are independent with i.i.d. Gaussian components:

Wij()G(0,2/n1),bj()G(0,Cb),W_{ij}^{(\ell)}\sim G(0,2/n_{\ell-1}),\qquad b_{j}^{(\ell)}\sim G(0,C_{b}),

where Cb>0C_{b}>0 is any fixed constant. For a fixed network input xn0x\in\mathbb{R}^{n_{0}}, with probability 11, the input-output Jacobian JxJ_{x} is well-defined. Moreover, it can be written as

Jx=W(L)D(L1)W(L1)D(1)W(1),J_{x}=W^{(L)}D^{(L-1)}W^{(L-1)}\cdots D^{(1)}W^{(1)},

where W()W^{(\ell)} is the matrix of weights from layer 1\ell-1 to layer \ell and D()D^{(\ell)} is an n×nn_{\ell}\times n_{\ell} diagonal matrix:

D()=Diag(𝟏{zi()0},i=1,,n)D^{(\ell)}=\mathrm{Diag}\left({\bf 1}_{\left\{z_{i}^{(\ell)}\geq 0\right\}},\,i=1,\ldots,n_{\ell}\right)

whose diagonal entries are 0 or 11 depending on whether the pre-activation zi()z_{i}^{(\ell)} of neuron ii in layer \ell is positive at our fixed input x.x. Next, according to Proposition 22 in [12], the marginal distribution of each D()D^{(\ell)} is that its diagonal entries are independent Bernoulli(1/2)\text{Bernoulli}(1/2) random variables. Moreover, we have the following equality in distribution:

Jx=dηW(L)𝒟(L1)W(L1)𝒟(1)W(1)J_{x}\stackrel{{\scriptstyle d}}{{=}}\eta W^{(L)}\mathcal{D}^{(L-1)}W^{(L-1)}\cdots\mathcal{D}^{(1)}W^{(1)}

where 𝒟()\mathcal{D}^{(\ell)} are independent of each other (and of W()W^{(\ell)}) resampled from the marginal distribution of D()D^{(\ell)} and η\eta is an independent diagonal matrix with independent diagonal entries that are ±1\pm 1 with equal probability. In particular, for a fixed unit vector un0u\in\mathbb{R}^{n_{0}}, we have

Jxu=dW(L)𝒟(L1)W(L1)𝒟(1)W(1)u.\left|\left|J_{x}u\right|\right|\stackrel{{\scriptstyle d}}{{=}}||W^{(L)}\mathcal{D}^{(L-1)}W^{(L-1)}\cdots\mathcal{D}^{(1)}W^{(1)}u||.

We may rewrite this as

W(L)𝒟(L1)W(L1)𝒟(2)W(2)u(1)𝒟(1)W(1)u,||W^{(L)}\mathcal{D}^{(L-1)}W^{(L-1)}\cdots\mathcal{D}^{(2)}W^{(2)}u^{(1)}||~{}||\mathcal{D}^{(1)}W^{(1)}u||, (C.7)

for u(1):=𝒟(1)W(1)u𝒟(1)W(1)uu^{(1)}:=\frac{\mathcal{D}^{(1)}W^{(1)}u}{||\mathcal{D}^{(1)}W^{(1)}u||}, where we interpret the expression (C.7) as equal to 0 if 𝒟(1)\mathcal{D}^{(1)} is the zero matrix. To complete the proof, we need the following standard observation.

Lemma C.3.

Suppose WW is an n×nn\times n^{\prime} matrix with i.i.d. Gaussian entries and uu is a random unit vector in n\mathbb{R}^{n^{\prime}} that is independent of WW but otherwise has any distribution. Then WuWu is independent of uu and is equal in distribution to WvWv where vv is any fixed unit vector in n\mathbb{R}^{n^{\prime}}.

Proof.

For any fixed orthogonal matrix 𝒪\mathcal{O}, it is a standard fact that W𝒪W\mathcal{O} is equal in distribution to WW. Thus, for any measurable sets AnA\subseteq\mathbb{R}^{n} and BnB\subseteq\mathbb{R}^{n^{\prime}}, since u,Wu,W are independent we have

(WuA,uB)\displaystyle\mathbb{P}\left(Wu\in A,u\in B\right) =Sn1(Wu0A)𝑑u(u0),\displaystyle=\int_{S^{n^{\prime}-1}}\mathbb{P}\left(Wu_{0}\in A\right)d\mathbb{P}_{u}(u_{0}),

where u(u0)\mathbb{P}_{u}(u_{0}) is the distribution of u.u. Fix any u0Sn1u_{0}\in S^{n^{\prime}-1} and let 𝒪=𝒪(u0)\mathcal{O}=\mathcal{O}(u_{0}) be any orthogonal matrix so that u0=𝒪e1u_{0}=\mathcal{O}e_{1} with e1=(1,0,,0)e_{1}=(1,0,\ldots,0) is the first standard unit vector. Then, since W𝒪W\mathcal{O} is equal in distribution to WW, we have

(Wu0A)=(We1A),\mathbb{P}\left(Wu_{0}\in A\right)=\mathbb{P}\left(We_{1}\in A\right),

which is independent of u0u_{0}. We therefore find

(WuA,uB)\displaystyle\mathbb{P}\left(Wu\in A,u\in B\right) =(We1A)(uB),\displaystyle=\mathbb{P}\left(We_{1}\in A\right)\mathbb{P}(u\in B),

as desired. ∎

We are now in a position to complete the proof of Proposition C.2. Combining Lemma C.3 with (C.7), we find that, in distribution Jxu\left|\left|J_{x}u\right|\right| equals

W(L)𝒟(L1)W(L1)𝒟(2)W(2)u𝒟(1)W(1)u.||W^{(L)}\mathcal{D}^{(L-1)}W^{(L-1)}\cdots\mathcal{D}^{(2)}W^{(2)}u||\,||\mathcal{D}^{(1)}W^{(1)}u||. (C.8)

Note that these two terms are independent. Repeating this argument, we obtain that Jxu\left|\left|J_{x}u\right|\right| is equal in distribution to the product

W(L)u𝒟(L1)W(L1)u𝒟(1)W(1)u.||W^{(L)}u||\,||\mathcal{D}^{(L-1)}W^{(L-1)}u||\cdots||\mathcal{D}^{(1)}W^{(1)}u||. (C.9)

The terms in this product are independent. To complete the proof note that for =1,,L1\ell=1,\ldots,L-1 the number of non-zero entries in 𝒟()\mathcal{D}^{(\ell)} is a binomial random variable 𝐧{\bf n}_{\ell} with nn_{\ell} trials, each with probability of success 1/21/2 and that

𝒟()W()u2||\mathcal{D}^{(\ell)}W^{(\ell)}u||^{2}

is precisely 2/n2/n_{\ell} times a sum of 𝐧{\bf n}_{\ell} squares of independent standard Gaussians. Thus, for =1,,L1\ell=1,\ldots,L-1,

𝒟()W()u2=d2n1χ𝐧2.||\mathcal{D}^{(\ell)}W^{(\ell)}u||^{2}\stackrel{{\scriptstyle d}}{{=}}\frac{2}{n_{\ell-1}}\chi_{{\bf n}_{\ell}}^{2}. (C.10)

Similarly

W(L)u2=d2nL1χnL2.||W^{(L)}u||^{2}\stackrel{{\scriptstyle d}}{{=}}\frac{2}{n_{L-1}}\chi_{n_{L}}^{2}. (C.11)

Substituting (C.10) and (C.11) into (C.9) completes the proof. ∎

Evaluating the moments of len(𝒩(M))\operatorname{len}(\mathcal{N}(M)) is now a matter of computing the moments of some scaled chi-squared random variables with a random number of degrees of freedom. For instance, recalling that

𝔼[χk2]=k\mathbb{E}\left[\chi_{k}^{2}\right]=k

and applying Lemma C.2 as well as the tower property of the expectation we find

𝔼[Jxu2]\displaystyle\mathbb{E}\left[\left|\left|J_{x}u\right|\right|^{2}\right] =nLn0(=1L1𝔼[2nχ𝐧2])𝔼[nL1χnL2]\displaystyle=\frac{n_{L}}{n_{0}}\left(\prod_{\ell=1}^{L-1}\mathbb{E}\left[\frac{2}{n_{\ell}}\chi_{{\bf n}_{\ell}}^{2}\right]\right)\mathbb{E}\left[n_{L}^{-1}\chi_{n_{L}}^{2}\right]
=nLn0=1L1{2n𝔼[𝔼[χ𝐧2|𝐧]]}\displaystyle=\frac{n_{L}}{n_{0}}\prod_{\ell=1}^{L-1}\left\{\frac{2}{n_{\ell}}\mathbb{E}\left[\mathbb{E}\left[\chi_{{\bf n}_{\ell}}^{2}~{}|~{}{\bf n_{\ell}}\right]\right]\right\}
=nLn0=1L1{2n𝔼[𝐧]}\displaystyle=\frac{n_{L}}{n_{0}}\prod_{\ell=1}^{L-1}\left\{\frac{2}{n_{\ell}}\mathbb{E}\left[{\bf n}_{\ell}\right]\right\}
=nLn0.\displaystyle=\frac{n_{L}}{n_{0}}.

Substituting this into (C.2) with m=2m=2 yields

Var[len(𝒩(M))]nLn0,\mathrm{Var}\left[\operatorname{len}(\mathcal{N}(M))\right]\leq\frac{n_{L}}{n_{0}},

yielding (5.2). To prove (5.1), we need to estimate 𝔼[len𝒩(M)]\mathbb{E}\left[\operatorname{len}{\mathcal{N}(M)}\right]. By taking expectations in (C.1) and using (C.5), we find

𝔼[len𝒩(M)]=01𝔼[Jγ(t)γ(t)]𝑑t=𝔼[Jxu],\mathbb{E}\left[\operatorname{len}{\mathcal{N}(M)}\right]=\int_{0}^{1}\mathbb{E}\left[\left|\left|J_{\gamma(t)}\gamma^{\prime}(t)\right|\right|\right]dt=\mathbb{E}\left[\left|\left|J_{x}u\right|\right|\right],

where we’ve used that by Lemma C.2, the distribution of Jxu\left|\left|J_{x}u\right|\right| is the same for every xn0x\in\mathbb{R}^{n_{0}} and every unit vector uu. Moreover, again using Lemma C.2, we see that 𝔼[Jxu]\mathbb{E}\left[\left|\left|J_{x}u\right|\right|\right] equals

(n0nL)1/2=1L1𝔼[(2nχ𝐧2)1/2]𝔼[(1nLχnL2)1/2].\left(\frac{n_{0}}{n_{L}}\right)^{1/2}\prod_{\ell=1}^{L-1}\mathbb{E}\left[\left(\frac{2}{n_{\ell}}\chi_{{\bf n}_{\ell}}^{2}\right)^{1/2}\right]\mathbb{E}\left[\left(\frac{1}{n_{L}}\chi_{n_{L}}^{2}\right)^{1/2}\right].

To simplify this, a direct computation shows that

𝔼[(k1χk2)1/2]=Γ(k+12)Γ(k2)(k2)1/2,\mathbb{E}\left[\left(k^{-1}\chi_{k}^{2}\right)^{1/2}\right]=\frac{\Gamma\left(\frac{k+1}{2}\right)}{\Gamma\left(\frac{k}{2}\right)\left(\frac{k}{2}\right)^{1/2}},

where Γ()\Gamma(\cdot) is the Gamma function. Next, for any positive random variable XX with 𝔼[X]=1\mathbb{E}\left[X\right]=1 we have

𝔼[X1/2]\displaystyle\mathbb{E}\left[X^{1/2}\right] =𝔼[(1+(X1))1/2]\displaystyle=\mathbb{E}\left[(1+(X-1))^{1/2}\right]
=118Var[X]+O(|X1|3).\displaystyle=1-\frac{1}{8}\mathrm{Var}[X]+O(|X-1|^{3}).

Recall that

Var[χk2]=2k.\mathrm{Var}[\chi_{k}^{2}]=2k.

Using this and the law of total variance yields

𝔼[(2nχ𝐧2)1/2]\displaystyle\mathbb{E}\left[\left(\frac{2}{n_{\ell}}\chi_{{\bf n}_{\ell}}^{2}\right)^{1/2}\right] =112n2Var[χ𝐧2]+O(n2)\displaystyle=1-\frac{1}{2n_{\ell}^{2}}\mathrm{Var}[\chi_{{\bf n}_{\ell}}^{2}]+O(n_{\ell}^{-2})
=112n2[𝔼[Var[χ𝐧2|𝐧]]\displaystyle=1-\frac{1}{2n_{\ell}^{2}}\bigg{[}\mathbb{E}\left[\mathrm{Var}[\chi_{{\bf n}_{\ell}}^{2}~{}|~{}{\bf n}_{\ell}]\right]
+Var[𝔼[χ𝐧2|𝐧]]]+O(n2)\displaystyle\quad+\mathrm{Var}[\mathbb{E}\left[\chi_{{\bf n}_{\ell}}^{2}~{}|~{}{\bf n}_{\ell}\right]]\bigg{]}+O(n_{\ell}^{-2})
=112n2[𝔼[2𝐧]+Var[𝐧]]+O(n2)\displaystyle=1-\frac{1}{2n_{\ell}^{2}}\left[\mathbb{E}\left[2{\bf n}_{\ell}\right]+\mathrm{Var}[{\bf n}_{\ell}]\right]+O(n_{\ell}^{-2})
=112n2[n+n4]+O(n2)\displaystyle=1-\frac{1}{2n_{\ell}^{2}}\left[n_{\ell}+\frac{n_{\ell}}{4}\right]+O(n_{\ell}^{-2})
=158n+O(n2).\displaystyle=1-\frac{5}{8n_{\ell}}+O(n_{\ell}^{-2}).

Thus, we find that 𝔼[len(𝒩(M))]\mathbb{E}\left[\operatorname{len}(\mathcal{N}(M))\right] equals

(nLn0)1/2Γ(nL+12)Γ(nL2)(nL2)1/2×exp[58=1L1n1+O(=1L1n2)],\left(\frac{n_{L}}{n_{0}}\right)^{1/2}\frac{\Gamma\left(\frac{n_{L}+1}{2}\right)}{\Gamma\left(\frac{n_{L}}{2}\right)\left(\frac{n_{L}}{2}\right)^{1/2}}\times\exp\left[-\frac{5}{8}\sum_{\ell=1}^{L-1}n_{\ell}^{-1}+O\left(\sum_{\ell=1}^{L-1}n_{\ell}^{-2}\right)\right],

as claimed. Finally, let us check the higher moment estimates (5.3). By Lemma C.2 we have the following estimate

Jxu2\displaystyle\left|\left|J_{x}u\right|\right|^{2} =nLn0(=1L12nχ𝐧2)nL1χnL2\displaystyle=\frac{n_{L}}{n_{0}}\left(\prod_{\ell=1}^{L-1}\frac{2}{n_{\ell}}\chi_{{\bf n}_{\ell}}^{2}\right)n_{L}^{-1}\chi_{n_{L}}^{2}
nLn0exp[=1L1{2nχ𝐧21}]nL1χnL2,\displaystyle\leq\frac{n_{L}}{n_{0}}\exp\left[\sum_{\ell=1}^{L-1}\left\{\frac{2}{n_{\ell}}\chi_{{\bf n}_{\ell}}^{2}-1\right\}\right]n_{L}^{-1}\chi_{n_{L}}^{2},

where we used that x=x1+1ex1.x=x-1+1\leq e^{x-1}. For any mm, we have

𝔼[(1nLχnL2)m]\displaystyle\mathbb{E}\left[\left(\frac{1}{n_{L}}\chi_{n_{L}}^{2}\right)^{m}\right] =(1+2nL)(1+2m2nL)\displaystyle=\left(1+\frac{2}{n_{L}}\right)\cdots\left(1+\frac{2m-2}{n_{L}}\right)
exp[j=0m12jnL]exp[m2nL].\displaystyle\leq\exp\left[\sum_{j=0}^{m-1}\frac{2j}{n_{L}}\right]\leq\exp\left[\frac{m^{2}}{n_{L}}\right].

Therefore, Jxu2m\left|\left|J_{x}u\right|\right|^{2m} is bounded above by

(nLn0)mexp[m=1L1{2nχ𝐧21}+m2nL].\displaystyle\left(\frac{n_{L}}{n_{0}}\right)^{m}\exp\left[m\sum_{\ell=1}^{L-1}\left\{\frac{2}{n_{\ell}}\chi_{{\bf n}_{\ell}}^{2}-1\right\}+\frac{m^{2}}{n_{L}}\right]. (C.12)

Finally, for any fixed positive integer nn we may write

2nχ𝐧21=2nk=1n{ξkZk212},\frac{2}{n}\chi_{{\bf n}}^{2}-1=\frac{2}{n}\sum_{k=1}^{n}\left\{\xi_{k}Z_{k}^{2}-\frac{1}{2}\right\},

where ξk\xi_{k} are independent Bernoulli(1/2)\text{Bernoulli}(1/2) random variables and ZkZ_{k} are independent standard Gaussians. A direct computation shows that the centered random variables ξkZk212\xi_{k}Z_{k}^{2}-\frac{1}{2} are sub-exponential SE(ν2,α)\text{SE}(\nu^{2},\alpha) with some universal parameters ν2,α\nu^{2},\alpha. Thus, by the stability of sub-exponential random variables under summation we have

2nχ𝐧21SE(4ν2n,2αn).\frac{2}{n_{\ell}}\chi_{{\bf n}_{\ell}}^{2}-1\in\text{SE}\left(\frac{4\nu^{2}}{n_{\ell}},\frac{2\alpha}{n_{\ell}}\right).

Again using this property we conclude

=1L12nχ𝐧21SE(4ν2=1L11n,2αn),\sum_{\ell=1}^{L-1}\frac{2}{n_{\ell}}\chi_{{\bf n}_{\ell}}^{2}-1\in\text{SE}\left(4\nu^{2}\sum_{\ell=1}^{L-1}\frac{1}{n_{\ell}},\frac{2\alpha}{n_{*}}\right),

where

n=min{n1,,nL1}.n_{*}=\min\left\{n_{1},\ldots,n_{L-1}\right\}.

Therefore, by (C.12), we find that

𝔼[Jxu2m](nLn0)m𝔼[emY]exp[m2nL],\mathbb{E}\left[\left|\left|J_{x}u\right|\right|^{2m}\right]\leq\left(\frac{n_{L}}{n_{0}}\right)^{m}\mathbb{E}\left[e^{mY}\right]\exp\left[\frac{m^{2}}{n_{L}}\right],

where YSE(4ν2=1L11n,2αn)Y\in\text{SE}\left(4\nu^{2}\sum_{\ell=1}^{L-1}\frac{1}{n_{\ell}},\frac{2\alpha}{n_{*}}\right). By definition of sub-exponential random variables, we have

𝔼[emY]exp[4m2ν2=1L11n],\mathbb{E}\left[e^{mY}\right]\leq\exp\left[4m^{2}\nu^{2}\sum_{\ell=1}^{L-1}\frac{1}{n_{\ell}}\right],

provided m<n/2αm<n_{*}/2\alpha for some universal constant c>0c>0. This completes the derivation of (5.3) and completes the proof of Theorem 5.1. \square

Appendix D Proof of Theorem 5.2

The proof of Theorem 5.2 follows closely the argument for Theorem 5.1. For a given input xn0x\in\mathbb{R}^{n_{0}}, we will continue to write

J𝒩(x):=(xj𝒩i(x)){1jn01inL}J_{\mathcal{N}}(x):=\left(\partial_{x_{j}}\mathcal{N}_{i}(x)\right)_{\left\{\begin{subarray}{c}1\leq j\leq n_{0}\\ 1\leq i\leq n_{L}\end{subarray}\right\}}

for the input-output Jacobian of the network function, which exists for Lebesgue almost every xn0x\in\mathbb{R}^{n_{0}}. We will write ΠTxM:n0TxM\Pi_{T_{x}M}:\mathbb{R}^{n_{0}}\rightarrow T_{x}M for the orthogonal projection onto this tangent space. We have

vold(𝒩(M))=M(det(ΠTxMJ𝒩(x)TJ𝒩(x)ΠTxM))1/2vold(dx),\operatorname{vol}_{d}(\mathcal{N}(M))=\int_{M}\left(\det\left(\Pi_{T_{x}M}J_{\mathcal{N}}(x)^{T}J_{\mathcal{N}}(x)\Pi_{T_{x}M}\right)\right)^{1/2}\operatorname{vol}_{d}(dx),

where vold\operatorname{vol}_{d} is the dd-dimensional Hausdorff measure. Indeed, the integrand measures, at each xMx\in M, the volume of the image of an infinitesimal cube on MM of volume vold(dx)\operatorname{vol}_{d}(dx) centered at xx under the map x𝒩(x)x\mapsto\mathcal{N}(x). Arguing precisely as in the proof of Lemma C.1, we find that for any integer m1m\geq 1

𝔼[vold(𝒩(M))m]M𝔼[det(ΠTxMJ𝒩(x)TJ𝒩(x)ΠTxM)]m/2vold(dx)\mathbb{E}\left[\operatorname{vol}_{d}(\mathcal{N}(M))^{m}\right]\leq\int_{M}\mathbb{E}\left[\det\left(\Pi_{T_{x}M}J_{\mathcal{N}}(x)^{T}J_{\mathcal{N}}(x)\Pi_{T_{x}M}\right)\right]^{m/2}\operatorname{vol}_{d}(dx) (D.1)

Fix xMx\in M and denote by

e1(x),,ed(x)e_{1}(x),\ldots,e_{d}(x)

an orthonormal basis of the tangent space of MM. Then, by the Gram identity, we may write

det(ΠTxMJ𝒩(x)TJ𝒩(x)ΠTxM)=J𝒩(x)e1(x)J𝒩(x)ed(x)2,\det\left(\Pi_{T_{x}M}J_{\mathcal{N}}(x)^{T}J_{\mathcal{N}}(x)\Pi_{T_{x}M}\right)=\left|\left|J_{\mathcal{N}}(x)e_{1}(x)\wedge\cdots\wedge J_{\mathcal{N}}(x)e_{d}(x)\right|\right|^{2}, (D.2)

where we recall that the wedge product is the anti-symmetrization of the tensor product. Just as in the proof of Theorem 5.1, the key observation is that for Gaussian weights we have that J𝒩(x)e1(x)J𝒩(x)ed(x)\left|\left|J_{\mathcal{N}}(x)e_{1}(x)\wedge\cdots\wedge J_{\mathcal{N}}(x)e_{d}(x)\right|\right| is a product of i.i.d. random variables. The formal statement is the following:

Proposition D.1.

For any xn0x\in\mathbb{R}^{n_{0}} and collection of orthonormal unit vectors u1,,udn0u_{1},\ldots,u_{d}\in\mathbb{R}^{n_{0}}, the random variable

Jxu1Jxud2\left|\left|J_{x}u_{1}\wedge\cdots\wedge J_{x}u_{d}\right|\right|^{2}

is equal in distribution to a product of independent scaled chi-squared random variables:

(nLn0)d(=1L1j=1d2nχ𝐧j+12)j=1d1nLχ𝐧𝐋j+12,\left(\frac{n_{L}}{n_{0}}\right)^{d}\left(\prod_{\ell=1}^{L-1}\prod_{j=1}^{d}\frac{2}{n_{\ell}}\chi_{{\bf n_{\ell}}-j+1}^{2}\right)\cdot\prod_{j=1}^{d}\frac{1}{n_{L}}\chi_{{\bf n_{L}}-j+1}^{2},

where all the terms in the product are independent and for =1,,L1\ell=1,\ldots,L-1 we’ve written 𝐧{\bf n_{\ell}} for independent Binomial random variables:

𝐧=dBin(n,1/2){\bf n_{\ell}}\stackrel{{\scriptstyle d}}{{=}}\mathrm{Bin}\left(n_{\ell},1/2\right)

with nn_{\ell} trials and success probability 1/21/2.

Proof.

The proof is identical to that of Lemma C.2. The only difference is that we must invoke the following amplification of Lemma C.3:

Lemma D.2.

Let u1,,uknu_{1},\ldots,u_{k}\in\mathbb{R}^{n^{\prime}} be a collection of orthonormal vectors (i.e. a collection) with any distribution. Let WW be an independent n×nn\times n^{\prime} matrix with i.i.d. Gaussian entries. Then

W(u1uk)=Wu1WukW(u_{1}\wedge\cdots\wedge u_{k})=Wu_{1}\wedge\cdots\wedge Wu_{k}

is independent of u1uku_{1}\wedge\cdots\wedge u_{k} and is equal in distribution to Wv1WvkWv_{1}\wedge\cdots Wv_{k} where v1,,vkv_{1},\ldots,v_{k} is any fixed collection of orthonormal vectors.

Proof.

The proof is identical to that of Lemma C.3, except that we note that for that, given u1,,uku_{1},\ldots,u_{k}, there exists an orthogonal matrix 𝒪=𝒪(u1,,uk)\mathcal{O}=\mathcal{O}(u_{1},\ldots,u_{k}) so that

uj=𝒪ej,j=1,,ku_{j}=\mathcal{O}e_{j},\qquad j=1,\ldots,k

where eje_{j} are the standard basis vectors. ∎

With Lemma D.1 in hand, we complete the proof of Theorem 5.2 as follows. First, note that for any random variable XX we have

𝔼[X]𝔼[X2]1/2,Var[X]𝔼[X2].\mathbb{E}\left[X\right]\leq\mathbb{E}\left[X^{2}\right]^{1/2},\qquad\mathrm{Var}[X]\leq\mathbb{E}\left[X^{2}\right].

Hence, Theorem 5.2 will follow once we show that

𝔼[vold(𝒩(M))2](nLn0)dexp[(d2)=1Ln1].\mathbb{E}\left[\operatorname{vol}_{d}(\mathcal{N}(M))^{2}\right]\leq\left(\frac{n_{L}}{n_{0}}\right)^{d}\exp\left[-\binom{d}{2}\sum_{\ell=1}^{L}n_{\ell}^{-1}\right].

Combining Proposition D.1 with (D.1) and (D.2), this estimate follows by showing that

𝔼[Jxu1Jxud2](nLn0)dexp[(d2)=1Ln1].\mathbb{E}\left[\left|\left|J_{x}u_{1}\wedge\cdots\wedge J_{x}u_{d}\right|\right|^{2}\right]\leq\left(\frac{n_{L}}{n_{0}}\right)^{d}\exp\left[-\binom{d}{2}\sum_{\ell=1}^{L}n_{\ell}^{-1}\right]. (D.3)

To check this, recall that

𝔼[χk2]=k.\mathbb{E}\left[\chi_{k}^{2}\right]=k.

Hence,

𝔼[J𝒩(x)e1(x)J𝒩(x)ed(x)2]\displaystyle\mathbb{E}\left[\left|\left|J_{\mathcal{N}}(x)e_{1}(x)\wedge\cdots\wedge J_{\mathcal{N}}(x)e_{d}(x)\right|\right|^{2}\right] =(nLn0)d(=1L1j=1d2n𝔼[χ𝐧j+12])j=1d1nL𝔼[χ𝐧𝐋j+12]\displaystyle=\left(\frac{n_{L}}{n_{0}}\right)^{d}\left(\prod_{\ell=1}^{L-1}\prod_{j=1}^{d}\frac{2}{n_{\ell}}\mathbb{E}\left[\chi_{{\bf n_{\ell}}-j+1}^{2}\right]\right)\cdot\prod_{j=1}^{d}\frac{1}{n_{L}}\mathbb{E}\left[\chi_{{\bf n_{L}}-j+1}^{2}\right]
=(nLn0)d=1Lj=1d(1j1n)\displaystyle=\left(\frac{n_{L}}{n_{0}}\right)^{d}\prod_{\ell=1}^{L}\prod_{j=1}^{d}\left(1-\frac{j-1}{n_{\ell}}\right)
(nLn0)dexp[(d2)=1Ln1],\displaystyle\leq\left(\frac{n_{L}}{n_{0}}\right)^{d}\exp\left[-\binom{d}{2}\sum_{\ell=1}^{L}n_{\ell}^{-1}\right],

where in the last line we used that 1+xex1+x\leq e^{x}. This completes the proof. \square