Better Random Initialization in Deep Learning

Mert Gurbuzbalaban
Rutgers Business School

(March 2020)

1 Priors and Initialization in Deep Learning

We consider a deep linear network where $x^{(k)}$ denotes the output of the $k$ -th layer with initialization $x^{(0)}$ , a similar setup to [vladimirova2018understanding] except that we assume the network is linear for now. We will have the recursion

x^{(k+1)}=W^{(k+1)}x^{(k)}+q^{k+1}

(1)

here should be $b^{k+1}$ rather than $q^{k+1}$ where $W^{(k+1)}$ is a weight matrix of the $k+1$ -st layer and $b^{(k+1)}$ is the bias vector. we assume that each layer has the same number of neurons so that the matrix $W^{(k+1)}$ is a square matrix. The paper [vladimirova2018understanding] consider Gaussian initializations of $W^{(k+1)}$ matrices and show that the output of the $k$ -th layer (i.e. $x^{(k)}$ ) will have tails that are distributed with a sub-Weibull distribution of parameter $\theta_{k}=k/2$ with tails decaying like $\mathbb{P}(|x^{(k)}|>t)\sim e^{-t^{1/\theta_{k}}}$ . For large $k$ , the tail is heavier than that of the exponential distribution; however this result still shows superpolynomial decay of the tails; in particular it does not show that the tails can have polynomial decay such as $\mathbb{P}(|x^{(k)}|>t)\sim\frac{1}{t^{\alpha}}$ for some $\alpha>0$ . On the other hand, recent literature found that the weights and the gradients in neural network training can often have distributions with a polynomial decay.

Here, the purpose is to show that polynomial decay is possible when the number of layers $k$ is large enough, improving the results of [vladimirova2018understanding] to show that even heavier tails can arise. Similar to [vladimirova2018understanding], we assume that the network is initialized with a Gaussian distribution, i.e.

W^{(k)}_{i,j}\sim\mathcal{N}(0,\sigma^{2}I_{d})

where $\sigma$ is fixed where $d$ is the dimension (number of neurons at each layer). In particular, for the stochastic recursion (8), top Lyapunov exponent is explicitly known

\displaystyle\rho=\rho(d)=\log(\sigma)+\frac{1}{2}\left[\log(2)+\Psi\left(\frac{d}{2}\right)\right]

(2)

(see [newman1986, eqn. (7)]) where $\Psi$ is the digamma function that admits the asymptotic expansion

\Psi\left(\frac{d}{2}\right)=\log(d/2)-1/d+O(1/d^{2}).

(3)

Note that $\rho$ is deterministic (see [newman1986, eqn. (7)])). Another issue we will focus on is the initialization. It has been noticed by researchers that if $\sigma$ is too large; then the variance of $x^{(k)}$ can exponentially grow over the iterates. In particular, the authors of [he2015delving] argued that for a ReLU network, a sufficient condition to avoid exponential blow up would be choosing a layer dependent $\sigma(k)$ of the form $\sigma(k)\leq\frac{\sqrt{2}}{\sqrt{d_{k}}}$ where $d_{k}$ is the number of neurons at layer $k$ . In our setup, we take $d_{k}=d$ for every layer $k$ . A proper initialization method should avoid growing the input exponentially as the number of layers $k$ grows. A natural question would be the following:

•

If we take $\sigma=c/\sqrt{d}$ for some constant $c>\sqrt{2}$ , would the network be stable? In other words, would the moments of $x^{(k)}$ be finite? If so, which moments will be finite, can we have for instance finite variance? How does the tail depend on the parameters? In particular, if the limit exists and is heavy-tailed it will allow us to explore the space more efficiently in the beginning of the SGD iterations.

The initialization proposed by [he2015delving] corresponds to $\sigma^{2}=2/d$ in our setting with $b^{(k+1)}=0$ and the factor $2$ arises due to the fact that $ReLu$ activation is used in their setting. If $x$ is a random variable that is symmetric around the origin, the variance of $\max(0,x)$ is half of the variance of $x$ due to symmetry. In other words, application of ReLU to a random variable decreases the variance by a factor of 2; that is why they can allow $\sigma^{2}=2/d$ . However, for a linear network; if we follow the arguments of [he2015delving]; then we would get that for linear networks; a sufficient condition to avoid blow up initialization over layers is to choose $\sigma^{2}=1/d$ . In fact, they argue that that the choice of $\sigma^{2}<1/d$ for a linear network; would make the input exponentially small as a function of $k$ . This is true if we choose the bias $b^{(k+1)}=0$ along an argument by [kesten1973random]; however if we choose $b^{(k+1)}$ randomly this does not have to be the case; as we will explain further. Also, we will prove that even if $b^{(k)}=0$ we can choose $\sigma^{2}$ larger larger than $1/d$ but yet avoid blow up of the iterations.

Lemma 1.

Consider a linear network with infinitely many layers where each layer has width $d$ . If we initialize the network with weights

W^{(k)}_{ij}\sim\mathcal{N}(0,\sigma^{2}_{d})

and the biases $q^{(k)}$ are i.i.d. sampled from a distribution $q$ that takes values in $\mathbb{R}^{d}$ . If $\rho<0$ and $\mathbb{E}(\log^{+}\|q\|)<\infty$ , then the Markov chain determinining the law of iterates $x^{(k)}$ admits a unique stationary distribution $X$ which is given by the law of the almost surely convergent series

X:=\sum_{j\geq 0}\Pi^{(j)}b^{(k+1-j)}

(4)

where $\Pi^{(k+1)}=W^{(k)}W^{(k-1)}\dots W^{(1)}$ with the convention that $\Pi^{(0)}=I_{d}$ .

Proof.

This is a well known result, see e.g. the introduction of [alsmeyer2012tail]. ∎

Remark 2.

An analogue of this lemma is also available for Lipschitz maps, see [ELTON199039]. This can also handle the ReLU case.

Remark 3.

One can also compute all the moments of $x^{(k)}$ explicitly and determine for which choice of parameters which moment blows up.

Theorem 4.

In the setting of Lemma 1, assume that we choose $\sigma_{d}$ such that it satisfies

\log(\sigma_{d})+\frac{1}{2}\left[\log(2)+\Psi(\frac{d}{2})\right]<0

(5)

Then the Markov chain determinining the law of iterates $x^{(k)}$ admits a unique stationary distribution $X$ that can be expressed by the almost surely convergent series sum (4). Furthermore, if we choose $\sigma_{d}$ such that

\sigma_{d}=\frac{1}{d}+\frac{c}{d^{2}}+o(\frac{1}{d^{3}})

(6)

Here should be $\sigma_{d}^{2}$ for some positive constant $c<1$ , then the condition (5) will be satisfied for every $d$ large enough. Furthermore, for $\sigma_{d}>1/d$ ; X is heavy tailed in the sense that its second moment is infinte. However, if $c>1$ , then $x^{(k)}$ goes to infinity with probability one.

Remark 5.

If we take $b^{(k)}=0$ for every $k$ , we see from (4) that the limit $X$ has to be necessarily zero almost surely. However, the limit can be non-trivial if we choose $b^{(k)}\neq 0$ , nevertheless the limit will be independent of initialization $x^{(0)}$ [kesten1973random, Thm 6].

Proof.

Proof follows from the formula (5) and Lemma 1. The fact that the choice (6) satisfies (5) is due to the formula (3). If $c>1$ , then we have $\rho>0$ for $d$ large enough and the results follow from [cohen1984stability, Prop 2.1]. From a computation similar to that one used to derive (14) we see that if $\sigma_{d}=1/d$ $\sigma_{d}^{2}$ ; then $\mathbb{E}(\|x^{(k)}\|^{2})$ is constant and in particular $\sigma_{d}>1/d$ $\sigma_{d}^{2}$ means that the variance of $\mathbb{E}(\|x^{(k)}\|^{2})$ will blow up. This implies that the stationary distribution $X$ also has infinite variance. ∎

1.1 Numerical Experiments for Gaussian Initialization

The website Example Python Code has some example code for testing the stability of the initializations. Let’s try the initialization $\sigma_{d}=\frac{1}{d}+\frac{1}{d^{2}}$ suggested by (6) and compare it with the Kaiming initialization after 100 layers. Also, compare the performance on CNN tasks reported in this web page.

1.2 Network initialization with symmetric alpha stable distributions

What happens if we initialize the network with a different distribution that may have heavier tails than the Gaussian? For example, recent research shows that the weights of the network may demonstrate a heavy-tailed behavior that can be modeled with a symmetric alpha-stable distribution ( $\mathcal{S}\alpha\mathcal{S}$ ) which have characteristic functions that scales like $\exp(-\sigma_{\alpha}|t|^{\alpha})$ where $\sigma_{\alpha}$ is a scale parameter recovering the Gaussian case when $\sigma=2$ . If we initialize the network weights as $W_{ij}^{k}\sim\mathcal{S}\alpha\mathcal{S}$ with a fixed choice of $\alpha\in(0,2)$ and $\sigma_{\alpha}$ ; how can we choose the scale parameter $\sigma_{\alpha}$ as a function of $d$ ? Would $\sigma_{\alpha}=\mathcal{O}(\sqrt{\frac{1}{d}})$ work or would such initialization lead to exponential blow up?

A random variable $Z$ is called a standard symmetric stable random variable with tail index $\alpha$ , if

\mathbb{E}(e^{itZ})=\mbox{exp}(-|t|^{\alpha}/\alpha).

For a vector $x\in\mathbb{R}^{d}$ , we also introduce the notation

\|x\|_{\alpha}:=(\sum_{j=1}^{d}|x_{j}|^{\alpha})^{1/\alpha}.

We call that a random variable $X\sim\mathcal{S}\alpha\mathcal{S}(\sigma)$ if $X/\sigma$ is a standard symmetric stable random variable.

Theorem 6.

For a given $\alpha\in(0,2)$ , assume that we initialize the network weights for a deep linear network as

W^{(k)}_{i,j}\sim\mathcal{S}\alpha\mathcal{S}(\sigma_{d}),

for every $i,j$ and $k$ while setting $q^{(k)}=0$ for every $k$ . If

\limsup_{d\to\infty}(d\log(d))^{1/\alpha}\sigma_{d}<J(\alpha),

then $x^{(k)}$ converges to $0$ with probability one for large enough $d$ whereas if

\liminf_{d\to\infty}(d\log(d))^{1/\alpha}\sigma_{d}>J(\alpha),

then for large enough $d$ , $x^{(k)}$ goes to infinity with probability one where

\displaystyle J(\alpha)=\left[\frac{\alpha\pi}{2\Gamma(\alpha)\sin(\alpha\pi/2)}\right]^{1/\alpha}.

(7)

Proof.

This is basically a restatement of [cohen1984stability, Theorem 2.7]. ∎

A consequence of this theorem is that it is appropriate to set

\sigma_{d}=\frac{J(\alpha)}{(n\log(n))^{1/\alpha}}

to avoid the exponential blow up or exponential decay of the dependency to the input if the network weights are initialized with the $\mathcal{S}\alpha\mathcal{S}$ distribution. Given that the network weights behave often in a heavy tailed manner; it would make sense to initialize them in a heavy tailed manner as well.

2 ReLu Activation

Here we use the ReLU Activation:

x^{(k+1)}=\phi(x^{k+1/2}),\quad\quad x^{k+1/2}=W^{(k+1)}x^{(k)}+q^{k+1}

(8)

where

\phi(x)=\max(x,0)

is the ReLU activation and we set the biases $q^{(k)}=0$ for every $k$ . The analysis of [cohen1984stability] also shows that in fact we can compute for $x^{(0)}\neq 0$

\mathbb{E}\left[\left(\frac{\|x^{(k)}\|}{\|x^{(0)}\|}\right)^{s}\right]=\mathbb{E}\left[\Pi_{j=0}^{k-1}\frac{\|x^{(j+1)}\|^{s}}{\|x^{(j)}\|^{s}}\right]

However, the random variables $y_{j}=\frac{\|x^{(j+1)}\|^{s}}{\|x^{(j)}\|^{s}}$ are i.i.d. because this ratio does not depend on the choice of $x^{(j)}$ (this stems from the fact that Gaussian distribution is rotationally symmetric). Therefore, we get

\mathbb{E}\left[\left(\frac{\|x^{(k)}\|}{\|x^{(0)}\|}\right)^{s}\right]=\Pi_{j=0}^{k-1}\mathbb{E}\left[\frac{\|x^{(j+1)}\|^{s}}{\|x^{(j)}\|^{s}}\right]=\left(\mathbb{E}\left[\frac{\|x^{(1)}\|^{s}}{\|x^{(0)}\|^{s}}\right]\right)^{k}

This quantity does not depend on the choice of $x^{(0)}$ so without loss of generality we can choose $x^{(0)}=e_{1}$ in our computations, which yields

\displaystyle\mathbb{E}\left[\frac{\|x^{(1)}\|^{s}}{\|x^{(0)}\|^{s}}\right]=\mathbb{E}\|\phi(W^{(1)}e_{1})\|^{s}=\mathbb{E}\|\phi(w)\|^{s}

(9)

. where $w$ is the first column of $W^{(1)}$ . We have $w_{j}\sim\mathcal{N}(0,\sigma_{d}^{2})$ . $w$ is symmetric with respect to the origin, therefore its j-th component $\phi(w)_{j}$ satisfies

[\phi(w)]_{j}=\begin{cases}0&\mbox{with probability}\quad 1/2,\\ w_{j}&\mbox{with probability}\quad 1/2.\end{cases}

Hence, for $s=2$ , we obtain

\displaystyle\mathbb{E}\|\phi(w)\|^{2}=\mathbb{E}(\sum_{j=1}^{d}\phi(w_{j})^{2})=\frac{d}{2}\sigma_{d}^{2}

(10)

because

$\displaystyle\mathbb{E}(\phi(w_{j}))^{2}$	$\displaystyle=$	$\displaystyle\mathbb{E}(\phi(w_{j})^{2}\|w_{j}\leq 0)+\mathbb{E}(\phi(w_{j})^{2}\|w_{j}\geq 0)$	(11)
	$\displaystyle=$	$\displaystyle 0+\sigma_{d}^{2}\mathbb{E}(\phi(w_{j}/\sigma_{d})^{2}\|w_{j}\geq 0)$	(12)
	$\displaystyle=$	$\displaystyle\sigma_{d}^{2}/2$	(13)

Combining everything, we obtain

\mathbb{E}\left[\left(\frac{\|x^{(k)}\|}{\|x^{(0)}\|}\right)^{2}\right]=d\sigma_{d}^{2}/2

(14)

From here, we see that if we choose $\sigma_{d}^{2}=2/d$ ; then this quantity will behave in a stable fashion.

It follows through a similar argument that for the ReLU case, the top Lyapunov exponent admits the formula

	$\displaystyle\rho_{ReLu}$	$\displaystyle=$	$\displaystyle\mathbb{E}\log\\|\phi(W^{(1)}e_{1})\\|=\frac{1}{2}\mathbb{E}\log\\|\phi(W^{(1)}e_{1})\\|^{2}$		(15)
		$\displaystyle=$	$\displaystyle\frac{1}{2}\mathbb{E}\log(\\|\phi(w)\\|^{2})$		(16)

We can compute this expectation by exploiting the symmetries of the integration. In $\mathbb{R^{d}}$ there are $2^{d}$ quadrants. Each quadrant can be identified as an element of the set $\{+,-\}^{d}$ . On every quadrant that corresponds to $n$ plus signs and $d-n$ minus signs, the distribution of $\phi(w)$ is given by a chi-square distribution with $n$ degrees of freedom. If we choose a random quadrant; with probability $p(n)={n\choose d}\frac{1}{2^{d}}$ we will be in such a quadrant. Building on the formula [newman1986, eqn. (7)], we can show that

	$\displaystyle\rho_{ReLU}$	$\displaystyle=$	$\displaystyle\sum_{n=0}^{d}p(n)\rho(n)$		(17)
		$\displaystyle=$	$\displaystyle\sum_{n=1}^{d}{d\choose n}\frac{1}{2^{d}}\left[\log(\sigma)+\frac{1}{2}\left[\log(2)+\Psi\left(\frac{n}{2}\right)\right]\right]$		(18)

where $\rho(d)$ is given by (2) (the function $\rho(w$ is zero in the negative orthant (set of vectors whose components are all negative), therefore the sum above does not include $n=0$ )).

Estimating $\rho_{ReLU}$ :

To bound $\rho_{ReLU}$ we can use Jensen’s formula. The digamma function $\psi(z)$ is the derivative of the $\log\Gamma(z)$ function, i.e.

\displaystyle\psi(z)=\frac{d}{dz}\log\Gamma(z)=\frac{\Gamma^{\prime}(z)}{\Gamma(z)}

where $\Gamma(z)$ is the Gamma function. We define

Differentiating under the integral, we obtain that

\displaystyle\psi^{(m)}\phi:=\displaystyle\frac{d^{m}}{dz^{m}}\psi(z)

\displaystyle=

\displaystyle\frac{d^{m+1}}{dz^{m+1}}\log(\Gamma(z)).

(19)

This function is called the polygamma function of order $m$ and it admits the representation

\displaystyle\psi^{(m)}(z)

\displaystyle=(-1)^{m+1}\int_{0}^{\infty}{\frac{t^{m}e^{-zt}}{1-e^{-t}}}\,dt

for $z>0$ . In particular for $m=2$ , we see that the second derivative $\psi^{(2)}(z)\leq 0$ so that the function $\psi(z)$ is concave for $z>0$ . Therefore, the function $\rho(d)$ is a concave function of $d$ for $d\geq 1$ whereas it is not defined for $d=0$ . If we define $\rho(0)=0$ , we can write

\rho_{ReLU}=\sum_{n=1}^{d}p(n)\rho(n)=(1-\frac{1}{2^{d}})\sum_{n=1}^{d}q(n)\rho(n)

where $p(n)={n\choose d}1/2^{d}$ with $q_{n}:=p(n)/\sum_{d=1}^{n}p(n)$ using the fact that $\sum_{n=1}^{d}p_{n}=1-\frac{1}{2^{d}}$ . Given a binomal distribution $Y\sim\mbox{Bi}(d,1/2)$ with support over $[0,d]$ , we have $q_{n}=\mathbb{P}(Y=n|Y\neq 0)$ and $q_{n}$ defines a probability distribution over $\{1,2,\dots,n\}$ as $\sum_{n=1}^{d}q_{n}=1$ . Let $Q$ be the distribution such that $\mathbb{P}(Q=n)=q_{n}$ .

By Jensen’s inequality;

$\displaystyle\rho_{ReLU}=(1-\frac{1}{2^{d}})\mathbb{E}_{Q}\rho(Q)$	$\displaystyle\leq$	$\displaystyle(1-\frac{1}{2^{d}})\rho(\mathbb{E}_{Q}(Q))=$	(20)
	$\displaystyle=$	$\displaystyle(1-\frac{1}{2^{d}})\rho(\sum_{n=1}^{d}q_{n}n)=(1-\frac{1}{2^{d}})\rho\left(\frac{1}{1-\frac{1}{2^{d}}}\sum_{n=1}^{d}np_{n}\right)$	(21)
	$\displaystyle=$	$\displaystyle(1-\frac{1}{2^{d}})\rho\left(\frac{1}{1-\frac{1}{2^{d}}}\frac{d}{2}\right)$	(22)

Theorem 7.

Consider a ReLu network with infinitely many layers where each layer has width $d$ . We initialize the network with weights

W^{(k)}_{ij}\sim\mathcal{N}(0,\sigma^{2}_{d})

and the biases $q^{(k)}=0$ . Then, we can choose $\sigma_{d}$ such that $\rho_{ReLu}<0$ and $\sigma_{d}^{2}>2/d$ ; and for this choice of $\sigma_{d}$ the stationary distribution exists and is heavy tailed.

By a similar approach, we can also compute the moments (23) explicitly

\displaystyle m(n,s):=\mathbb{E}\left[\frac{\|x^{(1)}\|^{s}}{\|x^{(0)}\|^{s}}\right]=\mathbb{E}\|\phi(W^{(1)}e_{1})\|^{s}=\mathbb{E}\left[(\sum_{j=1}^{d}\phi(w_{j})^{2})^{s/2}\right]

(23)

If $Y_{n}$ is a chi-square distribution with $n$ degrees of freedom, its moments are explicitly available and are given by

m(n,s):=\mathbb{E}(Y_{n}^{s})=2^{s}\frac{\Gamma(n/2+s)}{\Gamma(n/2)}.

(see [walck1996hand, Sec. 8]). We compute

\displaystyle r(s):=\mathbb{E}\left[\frac{\|x^{(1)}\|^{s}}{\|x^{(0)}\|^{s}}\right]=\sigma^{s}\sum_{n=1}^{d}p(n)m(n,s/2)

(24)

. In particular, for $s=2$ , this implies

\displaystyle\mathbb{E}\left[\frac{\|x^{(1)}\|^{2}}{\|x^{(0)}\|^{2}}\right]=\sigma_{d}^{2}\sum_{n=1}^{d}p(n)m(n,1)

(25)

. Since $\Gamma(n/2+1)=n/2\Gamma(n/2)$ , we have $m(n,1)=2\Gamma(n/2+1)/\Gamma(n/2)=2n/2=n$ and therefore,

\displaystyle\mathbb{E}\left[\frac{\|x^{(1)}\|^{s}}{\|x^{(0)}\|^{s}}\right]=\sigma^{2}\sum_{n=1}^{d}p(n)m(n,1)=\sigma^{2}\sum_{n=1}^{d}p(n)n=\sigma_{d}^{2}d/2.

(26)

which recovers the identity (14), we recovered earlier. For $s=1$ , we get

\displaystyle\mathbb{E}\left[\frac{\|x^{(1)}\|^{1}}{\|x^{(0)}\|^{1}}\right]=\sigma_{d}^{2}\sum_{n=1}^{d}p(n)m(n,1/2)=\sigma_{d}\sum_{n=1}^{d}p(n)\sqrt{2}\frac{\Gamma(n/2+1/2)}{\Gamma(n/2)}

(27)

. It is also known that

\Gamma(n/2)=\frac{(n-2)!!\sqrt{\pi}}{2^{(n-1)/2}}

where $n!!$ is the double factorial. Therefore, we get

	$\displaystyle\mathbb{E}\left[\frac{\\|x^{(1)}\\|^{1}}{\\|x^{(0)}\\|^{1}}\right]=\sigma_{d}^{2}\sum_{n=1}^{d}p(n)m(n,1/2)=\sigma_{d}^{2}\sum_{n=1}^{d}p(n)\sqrt{2}\frac{\frac{(n-1)!!\sqrt{\pi}}{2^{(n)/2}}}{\frac{(n-2)!!\sqrt{\pi}}{2^{(n-1)/2}}}$		(28)
	$\displaystyle=\sigma_{d}^{2}\sum_{n=1}^{d}p(n)2\frac{(n-1)!!}{(n-2)!!}$		(29)

Remember that the function $r(s)$ does not depend on the choice of $x^{(0)}$ as long as it is not zero. So, without loss of generality we can set $x^{(0)}=e_{1}$ . Note that we have

m(n,s)=\mathbb{E}Y_{n}^{s}=\mathbb{E}[e^{s\log(Y_{n})}]

Therefore, m(n,s) is the moment generating function of the random variable $\log(Y_{n})$ ; therefore it is a convex function of $s$ by the properties of moment generating functions. Then, it follows that the function $r(s)$ defined by (24) is a convex function of $s$ . Clearly, we have $m(n,0)=1$ . Furthermore,

	$\displaystyle\frac{d}{ds}m(n,s)\|_{s=0}$	$\displaystyle=$	$\displaystyle\frac{d}{ds}(\mathbb{E}(e^{s\log(Y_{n})}))_{s=0}=\mathbb{E}\left[\frac{d}{ds}e^{s\log(Y_{n})}\right]_{s=0}$		(30)
		$\displaystyle=$	$\displaystyle\mathbb{E}\left[\frac{d}{ds}\log(Y_{n})\right]=\rho$		(31)

In other words, $\rho$ is the derivative at zero of the function $m(n,s)$ which is strictly convex (because the second derivative of the gamma function is called the trigamma function which is strictly positive on the positive real axis) , infinitely differentiable satisfying $m(n,0)=1$ and $m(n,s)\to\infty$ as $s\to\infty$ . Therefore, there exists $\alpha>0$ such that $r(\alpha)=1$ if $\rho<0$ . The behavior of $m(n,s)$ will essentially depend on the sign of $\rho$ . If $\rho=0$ , then the higher order derivatives $\frac{d^{m}}{ds^{m}}m(n,s)\big{|}_{s=0}$ will determine the behavior. For instance; if $\rho=0$ but $\frac{d^{m}}{ds^{m}}m(n,s)\big{|}_{s=0}<0$ for some finite $m$ ; then we can still conclude that there exists $\alpha>0$ such that $h(\alpha)=1$ . We have the following cases:

•

Case I: $\rho_{ReLU}<0$ : There exists $\alpha>0$ such that $r(\alpha)=1$ . The function $r(s)$ is strictly convex and we have $r(0)=r(\alpha)=1$ and $r(s)<1$ for $s\in(0,\alpha)$ . In this case $\mathbb{E}\|x^{(k)}\|^{s}\to 0$ if $s\in(0,\alpha)$ .
•

Case II: $\rho_{ReLU}=0$ : In this case, $r(s)>1$ for any $s>0$ by the strict convexity of $r(s)$ . All the moments of order $s>0$ of $x^{(k)}$ blows up.
•

Case III: $\rho_{ReLU}>0$ : In this case, we have $r(s)>1$ for every $s>0$ . $x^{(k)}$ goes to infinity with probability one.

In particular if $\sigma_{d}=2/d$ , then we are in case I above with $\alpha=2$ . If we make $\sigma_{d}$ larger while keeping $\rho_{ReLU}<0$ ; then $\alpha$ will get smaller. In particular, if we choose $\sigma_{d}$ to satisfy

\displaystyle r(\alpha):=\sigma_{d}^{\alpha}\sum_{n=1}^{d}p(n)m(n,\alpha/2)=1

(32)

i.e if we choose

\sigma_{d}=\frac{1}{(\sum_{n=1}^{d}p(n)m(n,\alpha/2))^{\alpha}}

then we will have $\mathbb{E}(\frac{\|x^{k}\|^{\alpha}}{\|x^{0}\|^{\alpha}})=1$ for every $k$ and the limiting distribution $X$ will have $\alpha$ -th moment to be finite, whereas any moment $p>\alpha$ will blow up. This is clearly heavier tail than the exponential distribution.

In the next section, we look at initialization with alpha stable distributions.

Remark 8.

It seems that the law of $x^{(k)}$ converges to stationary distribution almost surely; but this does not imply the convergence of the moments directly; one would need to have additional uniform integrability conditions typically. Furthermore, by the remark at the top of page 292; we can choose initialization to recover an alpha-stable distribution in the limit.

2.1 Alpha-Stable Initialization

Theorem 9.

For a given $\alpha\in(0,2)$ , assume that we initialize the network weights for a deep linear network as

W^{(k)}_{i,j}\sim\mathcal{S}\alpha\mathcal{S}(\sigma_{d}),

for every $i,j$ and $k$ while setting $q^{(k)}=0$ for every $k$ . Then the Lyapunov exponent satisfies

\rho_{d}=\frac{1}{\alpha}\left[\log(\sigma_{d}^{\alpha})+\mathbb{E}\log(\sum_{i}^{N_{d}}|W_{i}|^{\alpha})\right]

where $N_{d}\sim Bi(d,1/2)$ is a binomial random variable and $W_{i}$ are standard symmetric $\alpha$ -stable. Therefore, $\rho<0$ (resp. $\rho>0$ ) if

\sigma_{d}<(\mbox{resp.}>)e^{-Q_{d}(\alpha)}

where

Q_{d}(\alpha)=\frac{1}{\alpha}\left[\log(\sigma_{d}^{\alpha})+\mathbb{E}\log(\sum_{i}^{N_{d}}|W_{i}|^{\alpha})\right]

A sequence of $d\times d$ networks are asymptotically stable (i.e. input decays exponentially) if

\limsup_{d\to\infty}(d\log(d))^{1/\alpha}\sigma_{d}<2^{1/\alpha}J(\alpha),

and asymptotically unstable (i.e. input grows exponentially) if

\liminf_{d\to\infty}(d\log(d))^{1/\alpha}\sigma_{d}>2^{1/\alpha}J(\alpha),

where $J(\alpha)$ is defined by (7).

Proof.

The Lyapunov exponent will admit the formula

\rho_{ReLu}=\mathbb{E}\log\|\phi(W^{(1)}e_{1})\|

But this quantity is invariant, if we change the 2-norm above with an $L_{p}$ norm or even

\|x\|_{\alpha}:=(\sum_{i=1}^{n}\|x_{i}\|^{\alpha})^{1/\alpha}

for any $\alpha>0$ see [cohen1984stability, Introduction]. Then, using the properties of alpha-stable distributions we can show by a similar argument to [cohen1984stability] that

\|W^{(1)}e_{1}\|_{\alpha}\sim\sigma_{d}^{\alpha}\sum_{i}^{d}|W_{i}|^{\alpha}

Then, due to the ReLU term, we get

\rho=\mathbb{E}\log\|\phi(W^{(1)}e_{1})\|_{\alpha}=\mathbb{E}\log(\sum_{i}^{N_{d}}|W_{i}|^{\alpha})^{1/\alpha}=\frac{1}{\alpha}\mathbb{E}\log(\sum_{i}^{N_{d}}|W_{i}|^{\alpha}).

It is known that

Using the properties of $\alpha$ -stable distributions Proof sketch: Approximate binomial distribution $Bi(d,1/2)$ with a Gaussian with mean $d/2$ and approximate everything as a Gaussian integral. Use the limit arguman in Cohen-Newman paper that computes the distribution of the integrand in the Lyapunov exponent.

∎

According to Theorem 9, in order to avoid exponential decay/growth of the input we can choose $\sigma_{d}$ to satisfy

(d\log(d))^{1/\alpha}\sigma_{d}=2^{1/\alpha}J(\alpha),

i.e.

\sigma_{d}=\frac{2^{1/\alpha}J(\alpha)}{(d\log(d))^{1/\alpha}}.

Analogously, we can look at the moments

$\displaystyle m_{\alpha}(n,s):=\mathbb{E}\left[\frac{\\|x^{(1)}\\|_{\alpha}^{s}}{\\|x^{(0)}\\|_{\alpha}^{s}}\right]$	$\displaystyle=$	$\displaystyle\mathbb{E}\\|\phi(W^{(1)}e_{1})\\|_{\alpha}^{s}=\mathbb{E}\left[(\sum_{j=1}^{d}\phi(\|w_{j}\|^{\alpha})^{s/\alpha}\right]$	(33)
	$\displaystyle=$	$\displaystyle\sigma^{s}\mathbb{E}\left[\left(\sum_{i}^{N_{d}}\|W_{i}\|^{\alpha}\right)^{s/\alpha}\right]$	(34)
	$\displaystyle=$	$\displaystyle\sigma^{s}\sum_{n=1}^{d}p(n)\mathbb{E}\left[\left(\sum_{i=1}^{n}\|W_{i}\|^{\alpha}\right)^{s/\alpha}\right]$	(35)

3 Initialization with Exponential Tails

We consider the Laplace distribution with mean zero. Its density can be expressed as

f(x)=\frac{1}{2c}e^{-|x|/c}

and will be denoted by $\mbox{Laplace}(c)$ . This distribution coincides with the exponential distribution density up to a constant factor on the positive real axis. Variance of Laplace distribution is given by $2c^{2}$ . So if we set $c=\sigma_{d}/\sqrt{2}$ ; the distribution will have variance equal to $\sigma_{d}^{2}$ .

Lets assume that we set

W_{ij}^{(k)}\sim\mbox{Laplace}(\frac{\sigma_{d}}{\sqrt{2}})

Recall that we have the formula

\displaystyle\rho_{ReLu}=\mathbb{E}\log\|\phi(W^{(1)}e_{1})\|_{1}=\mathbb{E}\log(\|W_{11}^{1}\|+\|W_{21}^{1}\|+\dots+\|W_{d1}^{1}\|).

(36)

In particular, the choice of the norm above, does not matter where we could use $\|\cdot\|_{\alpha}$ for any $\alpha>0$ . For computational convenience, above we chose the $L_{1}$ norm.

It is easy to check from the symmetry with respect to origin of the Laplace distribution that if

{\displaystyle X\sim{\textrm{Laplace}}(c)}\implies{\displaystyle\left|X\right|\sim{\textrm{Exponential}}\left(c^{-1}\right)}.

where $\textrm{Exponential}(\lambda)$ is the exponential distribution with parameter $\lambda$ with density

f(x;\lambda)=\begin{cases}\lambda e^{-\lambda x}&x\geq 0,\\ 0&x<0.\end{cases}.

Therefore, (36) becomes

\displaystyle\rho_{ReLu}=\mathbb{E}\log\|\phi(W^{(1)}e_{1})\|_{1}=\mathbb{E}\log(\phi(X_{1}+X_{2}+\dots+X_{d})).

(37)

where $X_{i}$ are i.i.d. with $X_{i}\sim{\textrm{Exponential}}\left(c^{-1}\right)={\textrm{Exponential}}\left(\frac{\sqrt{2}}{\sigma_{d}}\right)$ . By similar ideas to before, we obtain

\displaystyle\rho_{ReLu}=\mathbb{E}\log\|\phi(W^{(1)}e_{1})\|_{1}=\mathbb{E}\log(\sum_{i=1}^{N_{d}}X_{i}).

(38)

where $N_{d}\sim\mbox{Binomial}(d,1/2)$ (with the convention that the random sum is zero when $N_{d}=0$ ). We recognize that the sum of $m$ -many i.i.d. exponentially distributed random variables with parameter $\lambda$ is a Gamma distribution $\Gamma(m,\lambda)$ with density

f(x)=\frac{\lambda^{m}}{\Gamma(m)}x^{m-1}e^{-x\lambda}

The logarithm of a Gamma distribution $\Gamma(m,\lambda)$ follows a log-Gamma distribution $Z_{m,\lambda}$ (sometimes confused with the exponential gamma distribution) with expectation

\mathbb{E}(Z_{m,\lambda})=\psi(m)+\log(\frac{1}{\lambda})

and variance

\mbox{var}(Z_{m,\lambda})=\psi^{\prime}(m)>0

(see paper by Halliwell). Plugging in $\lambda=\frac{\sqrt{2}}{\sigma_{d}}$ , we obtain

\mathbb{E}(Z_{m,\lambda})=\psi(m)+\log(\frac{\sigma_{d}}{\sqrt{2}})

For the ReLu computation,

$\displaystyle\rho_{ReLu}$	$\displaystyle=$	$\displaystyle\mathbb{E}\log\\|\phi(W^{(1)}e_{1})\\|_{1}=\mathbb{E}\log(\sum_{i=1}^{N_{d}}X_{i})$	(39)
	$\displaystyle=$	$\displaystyle\sum_{m=1}^{d}{d\choose m}(\frac{1}{2})^{d}\mathbb{E}Z_{m,\lambda}$	(40)
	$\displaystyle=$	$\displaystyle\sum_{m=1}^{d}{d\choose m}(\frac{1}{2})^{d}\left[\psi(m)+\log(\frac{1}{\lambda})\right]$	(41)
	$\displaystyle=$	$\displaystyle\sum_{m=1}^{d}{d\choose m}(\frac{1}{2})^{d}\left[\psi(m)+\log(\frac{\sigma_{d}}{\sqrt{2}})\right]$	(42)

where $\psi$ is the digamma function.

This suggests that when the set the entries of networks as an exponential distribution, we should choose $\rho_{ReLu}=0$ in the above formula. Note that this formula is different than the Gaussian case.

4 Initialization with Weibull tails

The idea is to use the fact that

X\sim\mathrm{Weibull}(\lambda,{\frac{1}{2}})}\quad\mbox{then}\quad{\displaystyle{\sqrt{X}}\sim\mathrm{Exponential}({\frac{1}{\sqrt{\lambda}}})

and recycle the analysis for the exponential distribution.

5 Other initializations

If entries are sampled from the Laplace distribution $X$ (also called the double exponential distribution), then $|X|$ is exponentially distributed, and sum of $d$ exponentials $Y$ is an Erlang distribution (which is a special case of the gamma distribution) where $\mathbb{E}\log(Y)$ is explicitly known. In other words, we can compute the optimal $\sigma$ for the double exponential distribution as well (if tails are heavier than exponential, we call them heavy tailed). In other words, if we look at the Laplace distribution we can compute $\mathbb{E}(\|x\|^{(k)})$ explicitly.

Another possibility is to consider initialization with a t-distribution. The square of t-distribution is distributed with an F-distribution, see e.g.
https://stats.stackexchange.com/questions/78940/distribution-of-sum-of-squares-of-t-distributed-random-variables Although sum of F distributions do not have closed density, we can characterize the characteristic function of such distributions to compute all the moments from the characteristic function.

If initialization is two-sided Weibull , then its characteristic function is known, see e.g. the paper titled ”Characteristic and Moment Generating Functions of Three Parameter Weibull Distribution-An Independent Approach” by G. Muraleedharan. Then, we can compute the $\mathbb{E}\|x^{(k)}\|_{1}$ based on characteristic functions of random sum of Weibull distributions.

Another option is to look at a two-side Pareto distribution and look at the Lyapunov exponent with respect to the L1 norm using characteristic function techniques.

6 Correlated Random Initialization

Here, we make use of explicit computations for the top Lyapunov exponent of correlated Gaussian random matrices. The simplest setting is when each column is sampled from a Gaussian vector. The idea is to use the result of this paper

https://arxiv.org/pdf/1306.6576.pdf

see Theorem 1.1

$X=W^{T}W$ has a Wishart distribution. Log-expectation of $|X|$ can be computed as follows: The following formula plays a role in variational Bayes derivations for Bayes networks involving the Wishart distribution: [10]:693

{\displaystyle\operatorname{E}[\,\ln\left|\mathbf{X}\right|\,]=\psi_{p}\left({\frac{n}{2}}\right)+p\,\ln(2)+\ln|\mathbf{V}|}

where $\psi_{p}$ is the multivariate digamma function (the derivative of the log of the multivariate gamma function).

Log-variance: The following variance computation could be of help in Bayesian statistics:

{\displaystyle\operatorname{Var}\left[\,\ln\left|\mathbf{X}\right|\,\right]=\sum_{i=1}^{p}\psi_{1}\left({\frac{n+1-i}{2}}\right)}

where ${\displaystyle\psi_{1}}$ is the trigamma function. This comes up when computing the Fisher information of the Wishart random variable, see the Wikipedia page link.

Better Random Initialization in Deep Learning

1 Priors and Initialization in Deep Learning

Lemma 1.

Proof.

Remark 2.

Remark 3.

Theorem 4.

Remark 5.

Proof.

1.1 Numerical Experiments for Gaussian Initialization

1.2 Network initialization with symmetric alpha stable distributions

Theorem 6.

Proof.

2 ReLu Activation

Estimating ρR​e​L​U\rho_{ReLU}:

Theorem 7.

Remark 8.

2.1 Alpha-Stable Initialization

Theorem 9.

Proof.

3 Initialization with Exponential Tails

4 Initialization with Weibull tails

5 Other initializations

6 Correlated Random Initialization

Estimating $\rho_{ReLU}$ :