This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Better Random Initialization in Deep Learning

Mert Gurbuzbalaban
Rutgers Business School
(March 2020)

1 Priors and Initialization in Deep Learning

We consider a deep linear network where x(k)x^{(k)} denotes the output of the kk-th layer with initialization x(0)x^{(0)}, a similar setup to [vladimirova2018understanding] except that we assume the network is linear for now. We will have the recursion

x(k+1)=W(k+1)x(k)+qk+1x^{(k+1)}=W^{(k+1)}x^{(k)}+q^{k+1} (1)

here should be bk+1b^{k+1} rather than qk+1q^{k+1} where W(k+1)W^{(k+1)} is a weight matrix of the k+1k+1-st layer and b(k+1)b^{(k+1)} is the bias vector. we assume that each layer has the same number of neurons so that the matrix W(k+1)W^{(k+1)} is a square matrix. The paper [vladimirova2018understanding] consider Gaussian initializations of W(k+1)W^{(k+1)} matrices and show that the output of the kk-th layer (i.e. x(k)x^{(k)}) will have tails that are distributed with a sub-Weibull distribution of parameter θk=k/2\theta_{k}=k/2 with tails decaying like (|x(k)|>t)et1/θk\mathbb{P}(|x^{(k)}|>t)\sim e^{-t^{1/\theta_{k}}}. For large kk, the tail is heavier than that of the exponential distribution; however this result still shows superpolynomial decay of the tails; in particular it does not show that the tails can have polynomial decay such as (|x(k)|>t)1tα\mathbb{P}(|x^{(k)}|>t)\sim\frac{1}{t^{\alpha}} for some α>0\alpha>0. On the other hand, recent literature found that the weights and the gradients in neural network training can often have distributions with a polynomial decay.

Here, the purpose is to show that polynomial decay is possible when the number of layers kk is large enough, improving the results of [vladimirova2018understanding] to show that even heavier tails can arise. Similar to [vladimirova2018understanding], we assume that the network is initialized with a Gaussian distribution, i.e.

Wi,j(k)𝒩(0,σ2Id)W^{(k)}_{i,j}\sim\mathcal{N}(0,\sigma^{2}I_{d})

where σ\sigma is fixed where dd is the dimension (number of neurons at each layer). In particular, for the stochastic recursion (8), top Lyapunov exponent is explicitly known

ρ=ρ(d)=log(σ)+12[log(2)+Ψ(d2)]\displaystyle\rho=\rho(d)=\log(\sigma)+\frac{1}{2}\left[\log(2)+\Psi\left(\frac{d}{2}\right)\right] (2)

(see [newman1986, eqn. (7)]) where Ψ\Psi is the digamma function that admits the asymptotic expansion

Ψ(d2)=log(d/2)1/d+O(1/d2).\Psi\left(\frac{d}{2}\right)=\log(d/2)-1/d+O(1/d^{2}). (3)

Note that ρ\rho is deterministic (see [newman1986, eqn. (7)])). Another issue we will focus on is the initialization. It has been noticed by researchers that if σ\sigma is too large; then the variance of x(k)x^{(k)} can exponentially grow over the iterates. In particular, the authors of [he2015delving] argued that for a ReLU network, a sufficient condition to avoid exponential blow up would be choosing a layer dependent σ(k)\sigma(k) of the form σ(k)2dk\sigma(k)\leq\frac{\sqrt{2}}{\sqrt{d_{k}}} where dkd_{k} is the number of neurons at layer kk. In our setup, we take dk=dd_{k}=d for every layer kk. A proper initialization method should avoid growing the input exponentially as the number of layers kk grows. A natural question would be the following:

  • If we take σ=c/d\sigma=c/\sqrt{d} for some constant c>2c>\sqrt{2}, would the network be stable? In other words, would the moments of x(k)x^{(k)} be finite? If so, which moments will be finite, can we have for instance finite variance? How does the tail depend on the parameters? In particular, if the limit exists and is heavy-tailed it will allow us to explore the space more efficiently in the beginning of the SGD iterations.

The initialization proposed by [he2015delving] corresponds to σ2=2/d\sigma^{2}=2/d in our setting with b(k+1)=0b^{(k+1)}=0 and the factor 22 arises due to the fact that ReLuReLu activation is used in their setting. If xx is a random variable that is symmetric around the origin, the variance of max(0,x)\max(0,x) is half of the variance of xx due to symmetry. In other words, application of ReLU to a random variable decreases the variance by a factor of 2; that is why they can allow σ2=2/d\sigma^{2}=2/d. However, for a linear network; if we follow the arguments of [he2015delving]; then we would get that for linear networks; a sufficient condition to avoid blow up initialization over layers is to choose σ2=1/d\sigma^{2}=1/d. In fact, they argue that that the choice of σ2<1/d\sigma^{2}<1/d for a linear network; would make the input exponentially small as a function of kk. This is true if we choose the bias b(k+1)=0b^{(k+1)}=0 along an argument by [kesten1973random]; however if we choose b(k+1)b^{(k+1)} randomly this does not have to be the case; as we will explain further. Also, we will prove that even if b(k)=0b^{(k)}=0 we can choose σ2\sigma^{2} larger larger than 1/d1/d but yet avoid blow up of the iterations.

Lemma 1.

Consider a linear network with infinitely many layers where each layer has width dd. If we initialize the network with weights

Wij(k)𝒩(0,σd2)W^{(k)}_{ij}\sim\mathcal{N}(0,\sigma^{2}_{d})

and the biases q(k)q^{(k)} are i.i.d. sampled from a distribution qq that takes values in d\mathbb{R}^{d}. If ρ<0\rho<0 and 𝔼(log+q)<\mathbb{E}(\log^{+}\|q\|)<\infty, then the Markov chain determinining the law of iterates x(k)x^{(k)} admits a unique stationary distribution XX which is given by the law of the almost surely convergent series

X:=j0Π(j)b(k+1j)X:=\sum_{j\geq 0}\Pi^{(j)}b^{(k+1-j)} (4)

where Π(k+1)=W(k)W(k1)W(1)\Pi^{(k+1)}=W^{(k)}W^{(k-1)}\dots W^{(1)} with the convention that Π(0)=Id\Pi^{(0)}=I_{d}.

Proof.

This is a well known result, see e.g. the introduction of [alsmeyer2012tail]. ∎

Remark 2.

An analogue of this lemma is also available for Lipschitz maps, see [ELTON199039]. This can also handle the ReLU case.

Remark 3.

One can also compute all the moments of x(k)x^{(k)} explicitly and determine for which choice of parameters which moment blows up.

Theorem 4.

In the setting of Lemma 1, assume that we choose σd\sigma_{d} such that it satisfies

log(σd)+12[log(2)+Ψ(d2)]<0\log(\sigma_{d})+\frac{1}{2}\left[\log(2)+\Psi(\frac{d}{2})\right]<0 (5)

Then the Markov chain determinining the law of iterates x(k)x^{(k)} admits a unique stationary distribution XX that can be expressed by the almost surely convergent series sum (4). Furthermore, if we choose σd\sigma_{d} such that

σd=1d+cd2+o(1d3)\sigma_{d}=\frac{1}{d}+\frac{c}{d^{2}}+o(\frac{1}{d^{3}}) (6)

Here should be σd2\sigma_{d}^{2} for some positive constant c<1c<1, then the condition (5) will be satisfied for every dd large enough. Furthermore, for σd>1/d\sigma_{d}>1/d; X is heavy tailed in the sense that its second moment is infinte. However, if c>1c>1, then x(k)x^{(k)} goes to infinity with probability one.

Remark 5.

If we take b(k)=0b^{(k)}=0 for every kk, we see from (4) that the limit XX has to be necessarily zero almost surely. However, the limit can be non-trivial if we choose b(k)0b^{(k)}\neq 0, nevertheless the limit will be independent of initialization x(0)x^{(0)} [kesten1973random, Thm 6].

Proof.

Proof follows from the formula (5) and Lemma 1. The fact that the choice (6) satisfies (5) is due to the formula (3). If c>1c>1, then we have ρ>0\rho>0 for dd large enough and the results follow from [cohen1984stability, Prop 2.1]. From a computation similar to that one used to derive (14) we see that if σd=1/d\sigma_{d}=1/dσd2\sigma_{d}^{2}; then 𝔼(x(k)2)\mathbb{E}(\|x^{(k)}\|^{2}) is constant and in particularσd>1/d\sigma_{d}>1/dσd2\sigma_{d}^{2} means that the variance of 𝔼(x(k)2)\mathbb{E}(\|x^{(k)}\|^{2}) will blow up. This implies that the stationary distribution XX also has infinite variance. ∎

1.1 Numerical Experiments for Gaussian Initialization

The website Example Python Code has some example code for testing the stability of the initializations. Let’s try the initialization σd=1d+1d2\sigma_{d}=\frac{1}{d}+\frac{1}{d^{2}} suggested by (6) and compare it with the Kaiming initialization after 100 layers. Also, compare the performance on CNN tasks reported in this web page.

1.2 Network initialization with symmetric alpha stable distributions

What happens if we initialize the network with a different distribution that may have heavier tails than the Gaussian? For example, recent research shows that the weights of the network may demonstrate a heavy-tailed behavior that can be modeled with a symmetric alpha-stable distribution (𝒮α𝒮\mathcal{S}\alpha\mathcal{S}) which have characteristic functions that scales like exp(σα|t|α)\exp(-\sigma_{\alpha}|t|^{\alpha}) where σα\sigma_{\alpha} is a scale parameter recovering the Gaussian case when σ=2\sigma=2. If we initialize the network weights as Wijk𝒮α𝒮W_{ij}^{k}\sim\mathcal{S}\alpha\mathcal{S} with a fixed choice of α(0,2)\alpha\in(0,2) and σα\sigma_{\alpha}; how can we choose the scale parameter σα\sigma_{\alpha} as a function of dd? Would σα=𝒪(1d)\sigma_{\alpha}=\mathcal{O}(\sqrt{\frac{1}{d}}) work or would such initialization lead to exponential blow up?

A random variable ZZ is called a standard symmetric stable random variable with tail index α\alpha, if

𝔼(eitZ)=exp(|t|α/α).\mathbb{E}(e^{itZ})=\mbox{exp}(-|t|^{\alpha}/\alpha).

For a vector xdx\in\mathbb{R}^{d}, we also introduce the notation

xα:=(j=1d|xj|α)1/α.\|x\|_{\alpha}:=(\sum_{j=1}^{d}|x_{j}|^{\alpha})^{1/\alpha}.

We call that a random variable X𝒮α𝒮(σ)X\sim\mathcal{S}\alpha\mathcal{S}(\sigma) if X/σX/\sigma is a standard symmetric stable random variable.

Theorem 6.

For a given α(0,2)\alpha\in(0,2), assume that we initialize the network weights for a deep linear network as

Wi,j(k)𝒮α𝒮(σd),W^{(k)}_{i,j}\sim\mathcal{S}\alpha\mathcal{S}(\sigma_{d}),

for every i,ji,j and kk while setting q(k)=0q^{(k)}=0 for every kk. If

lim supd(dlog(d))1/ασd<J(α),\limsup_{d\to\infty}(d\log(d))^{1/\alpha}\sigma_{d}<J(\alpha),

then x(k)x^{(k)} converges to 0 with probability one for large enough dd whereas if

lim infd(dlog(d))1/ασd>J(α),\liminf_{d\to\infty}(d\log(d))^{1/\alpha}\sigma_{d}>J(\alpha),

then for large enough dd, x(k)x^{(k)} goes to infinity with probability one where

J(α)=[απ2Γ(α)sin(απ/2)]1/α.\displaystyle J(\alpha)=\left[\frac{\alpha\pi}{2\Gamma(\alpha)\sin(\alpha\pi/2)}\right]^{1/\alpha}. (7)
Proof.

This is basically a restatement of [cohen1984stability, Theorem 2.7]. ∎

A consequence of this theorem is that it is appropriate to set

σd=J(α)(nlog(n))1/α\sigma_{d}=\frac{J(\alpha)}{(n\log(n))^{1/\alpha}}

to avoid the exponential blow up or exponential decay of the dependency to the input if the network weights are initialized with the 𝒮α𝒮\mathcal{S}\alpha\mathcal{S} distribution. Given that the network weights behave often in a heavy tailed manner; it would make sense to initialize them in a heavy tailed manner as well.

2 ReLu Activation

Here we use the ReLU Activation:

x(k+1)=ϕ(xk+1/2),xk+1/2=W(k+1)x(k)+qk+1x^{(k+1)}=\phi(x^{k+1/2}),\quad\quad x^{k+1/2}=W^{(k+1)}x^{(k)}+q^{k+1} (8)

where

ϕ(x)=max(x,0)\phi(x)=\max(x,0)

is the ReLU activation and we set the biases q(k)=0q^{(k)}=0 for every kk. The analysis of [cohen1984stability] also shows that in fact we can compute for x(0)0x^{(0)}\neq 0

𝔼[(x(k)x(0))s]=𝔼[Πj=0k1x(j+1)sx(j)s]\mathbb{E}\left[\left(\frac{\|x^{(k)}\|}{\|x^{(0)}\|}\right)^{s}\right]=\mathbb{E}\left[\Pi_{j=0}^{k-1}\frac{\|x^{(j+1)}\|^{s}}{\|x^{(j)}\|^{s}}\right]

However, the random variables yj=x(j+1)sx(j)sy_{j}=\frac{\|x^{(j+1)}\|^{s}}{\|x^{(j)}\|^{s}} are i.i.d. because this ratio does not depend on the choice of x(j)x^{(j)} (this stems from the fact that Gaussian distribution is rotationally symmetric). Therefore, we get

𝔼[(x(k)x(0))s]=Πj=0k1𝔼[x(j+1)sx(j)s]=(𝔼[x(1)sx(0)s])k\mathbb{E}\left[\left(\frac{\|x^{(k)}\|}{\|x^{(0)}\|}\right)^{s}\right]=\Pi_{j=0}^{k-1}\mathbb{E}\left[\frac{\|x^{(j+1)}\|^{s}}{\|x^{(j)}\|^{s}}\right]=\left(\mathbb{E}\left[\frac{\|x^{(1)}\|^{s}}{\|x^{(0)}\|^{s}}\right]\right)^{k}

This quantity does not depend on the choice of x(0)x^{(0)} so without loss of generality we can choose x(0)=e1x^{(0)}=e_{1} in our computations, which yields

𝔼[x(1)sx(0)s]=𝔼ϕ(W(1)e1)s=𝔼ϕ(w)s\displaystyle\mathbb{E}\left[\frac{\|x^{(1)}\|^{s}}{\|x^{(0)}\|^{s}}\right]=\mathbb{E}\|\phi(W^{(1)}e_{1})\|^{s}=\mathbb{E}\|\phi(w)\|^{s} (9)

. where ww is the first column of W(1)W^{(1)}. We have wj𝒩(0,σd2)w_{j}\sim\mathcal{N}(0,\sigma_{d}^{2}). ww is symmetric with respect to the origin, therefore its j-th component ϕ(w)j\phi(w)_{j} satisfies

[ϕ(w)]j={0with probability1/2,wjwith probability1/2.[\phi(w)]_{j}=\begin{cases}0&\mbox{with probability}\quad 1/2,\\ w_{j}&\mbox{with probability}\quad 1/2.\end{cases}

Hence, for s=2s=2, we obtain

𝔼ϕ(w)2=𝔼(j=1dϕ(wj)2)=d2σd2\displaystyle\mathbb{E}\|\phi(w)\|^{2}=\mathbb{E}(\sum_{j=1}^{d}\phi(w_{j})^{2})=\frac{d}{2}\sigma_{d}^{2} (10)

because

𝔼(ϕ(wj))2\displaystyle\mathbb{E}(\phi(w_{j}))^{2} =\displaystyle= 𝔼(ϕ(wj)2|wj0)+𝔼(ϕ(wj)2|wj0)\displaystyle\mathbb{E}(\phi(w_{j})^{2}|w_{j}\leq 0)+\mathbb{E}(\phi(w_{j})^{2}|w_{j}\geq 0) (11)
=\displaystyle= 0+σd2𝔼(ϕ(wj/σd)2|wj0)\displaystyle 0+\sigma_{d}^{2}\mathbb{E}(\phi(w_{j}/\sigma_{d})^{2}|w_{j}\geq 0) (12)
=\displaystyle= σd2/2\displaystyle\sigma_{d}^{2}/2 (13)

Combining everything, we obtain

𝔼[(x(k)x(0))2]=dσd2/2\mathbb{E}\left[\left(\frac{\|x^{(k)}\|}{\|x^{(0)}\|}\right)^{2}\right]=d\sigma_{d}^{2}/2 (14)

From here, we see that if we choose σd2=2/d\sigma_{d}^{2}=2/d; then this quantity will behave in a stable fashion.

It follows through a similar argument that for the ReLU case, the top Lyapunov exponent admits the formula

ρReLu\displaystyle\rho_{ReLu} =\displaystyle= 𝔼logϕ(W(1)e1)=12𝔼logϕ(W(1)e1)2\displaystyle\mathbb{E}\log\|\phi(W^{(1)}e_{1})\|=\frac{1}{2}\mathbb{E}\log\|\phi(W^{(1)}e_{1})\|^{2} (15)
=\displaystyle= 12𝔼log(ϕ(w)2)\displaystyle\frac{1}{2}\mathbb{E}\log(\|\phi(w)\|^{2}) (16)

We can compute this expectation by exploiting the symmetries of the integration. In 𝕕\mathbb{R^{d}} there are 2d2^{d} quadrants. Each quadrant can be identified as an element of the set {+,}d\{+,-\}^{d}. On every quadrant that corresponds to nn plus signs and dnd-n minus signs, the distribution of ϕ(w)\phi(w) is given by a chi-square distribution with nn degrees of freedom. If we choose a random quadrant; with probability p(n)=(nd)12dp(n)={n\choose d}\frac{1}{2^{d}} we will be in such a quadrant. Building on the formula [newman1986, eqn. (7)], we can show that

ρReLU\displaystyle\rho_{ReLU} =\displaystyle= n=0dp(n)ρ(n)\displaystyle\sum_{n=0}^{d}p(n)\rho(n) (17)
=\displaystyle= n=1d(dn)12d[log(σ)+12[log(2)+Ψ(n2)]]\displaystyle\sum_{n=1}^{d}{d\choose n}\frac{1}{2^{d}}\left[\log(\sigma)+\frac{1}{2}\left[\log(2)+\Psi\left(\frac{n}{2}\right)\right]\right] (18)

where ρ(d)\rho(d) is given by (2) (the function ρ(w\rho(w is zero in the negative orthant (set of vectors whose components are all negative), therefore the sum above does not include n=0n=0)).

Estimating ρReLU\rho_{ReLU}:

To bound ρReLU\rho_{ReLU} we can use Jensen’s formula. The digamma function ψ(z)\psi(z) is the derivative of the logΓ(z)\log\Gamma(z) function, i.e.

ψ(z)=ddzlogΓ(z)=Γ(z)Γ(z)\displaystyle\psi(z)=\frac{d}{dz}\log\Gamma(z)=\frac{\Gamma^{\prime}(z)}{\Gamma(z)}

where Γ(z)\Gamma(z) is the Gamma function. We define

Differentiating under the integral, we obtain that

ψ(m)ϕ:=dmdzmψ(z)\displaystyle\psi^{(m)}\phi:=\displaystyle\frac{d^{m}}{dz^{m}}\psi(z) =\displaystyle= dm+1dzm+1log(Γ(z)).\displaystyle\frac{d^{m+1}}{dz^{m+1}}\log(\Gamma(z)). (19)

This function is called the polygamma function of order mm and it admits the representation

ψ(m)(z)\displaystyle\psi^{(m)}(z) =(1)m+10tmezt1et𝑑t\displaystyle=(-1)^{m+1}\int_{0}^{\infty}{\frac{t^{m}e^{-zt}}{1-e^{-t}}}\,dt

for z>0z>0. In particular for m=2m=2, we see that the second derivative ψ(2)(z)0\psi^{(2)}(z)\leq 0 so that the function ψ(z)\psi(z) is concave for z>0z>0. Therefore, the function ρ(d)\rho(d) is a concave function of dd for d1d\geq 1 whereas it is not defined for d=0d=0. If we define ρ(0)=0\rho(0)=0, we can write

ρReLU=n=1dp(n)ρ(n)=(112d)n=1dq(n)ρ(n)\rho_{ReLU}=\sum_{n=1}^{d}p(n)\rho(n)=(1-\frac{1}{2^{d}})\sum_{n=1}^{d}q(n)\rho(n)

where p(n)=(nd)1/2dp(n)={n\choose d}1/2^{d} with qn:=p(n)/d=1np(n)q_{n}:=p(n)/\sum_{d=1}^{n}p(n) using the fact that n=1dpn=112d\sum_{n=1}^{d}p_{n}=1-\frac{1}{2^{d}}. Given a binomal distribution YBi(d,1/2)Y\sim\mbox{Bi}(d,1/2) with support over [0,d][0,d], we have qn=(Y=n|Y0)q_{n}=\mathbb{P}(Y=n|Y\neq 0) and qnq_{n} defines a probability distribution over {1,2,,n}\{1,2,\dots,n\} as n=1dqn=1\sum_{n=1}^{d}q_{n}=1. Let QQ be the distribution such that (Q=n)=qn\mathbb{P}(Q=n)=q_{n}.

By Jensen’s inequality;

ρReLU=(112d)𝔼Qρ(Q)\displaystyle\rho_{ReLU}=(1-\frac{1}{2^{d}})\mathbb{E}_{Q}\rho(Q) \displaystyle\leq (112d)ρ(𝔼Q(Q))=\displaystyle(1-\frac{1}{2^{d}})\rho(\mathbb{E}_{Q}(Q))= (20)
=\displaystyle= (112d)ρ(n=1dqnn)=(112d)ρ(1112dn=1dnpn)\displaystyle(1-\frac{1}{2^{d}})\rho(\sum_{n=1}^{d}q_{n}n)=(1-\frac{1}{2^{d}})\rho\left(\frac{1}{1-\frac{1}{2^{d}}}\sum_{n=1}^{d}np_{n}\right) (21)
=\displaystyle= (112d)ρ(1112dd2)\displaystyle(1-\frac{1}{2^{d}})\rho\left(\frac{1}{1-\frac{1}{2^{d}}}\frac{d}{2}\right) (22)
Theorem 7.

Consider a ReLu network with infinitely many layers where each layer has width dd. We initialize the network with weights

Wij(k)𝒩(0,σd2)W^{(k)}_{ij}\sim\mathcal{N}(0,\sigma^{2}_{d})

and the biases q(k)=0q^{(k)}=0. Then, we can choose σd\sigma_{d} such that ρReLu<0\rho_{ReLu}<0 and σd2>2/d\sigma_{d}^{2}>2/d; and for this choice of σd\sigma_{d} the stationary distribution exists and is heavy tailed.

By a similar approach, we can also compute the moments (23) explicitly

m(n,s):=𝔼[x(1)sx(0)s]=𝔼ϕ(W(1)e1)s=𝔼[(j=1dϕ(wj)2)s/2]\displaystyle m(n,s):=\mathbb{E}\left[\frac{\|x^{(1)}\|^{s}}{\|x^{(0)}\|^{s}}\right]=\mathbb{E}\|\phi(W^{(1)}e_{1})\|^{s}=\mathbb{E}\left[(\sum_{j=1}^{d}\phi(w_{j})^{2})^{s/2}\right] (23)

.

If YnY_{n} is a chi-square distribution with nn degrees of freedom, its moments are explicitly available and are given by

m(n,s):=𝔼(Yns)=2sΓ(n/2+s)Γ(n/2).m(n,s):=\mathbb{E}(Y_{n}^{s})=2^{s}\frac{\Gamma(n/2+s)}{\Gamma(n/2)}.

(see [walck1996hand, Sec. 8]). We compute

r(s):=𝔼[x(1)sx(0)s]=σsn=1dp(n)m(n,s/2)\displaystyle r(s):=\mathbb{E}\left[\frac{\|x^{(1)}\|^{s}}{\|x^{(0)}\|^{s}}\right]=\sigma^{s}\sum_{n=1}^{d}p(n)m(n,s/2) (24)

. In particular, for s=2s=2, this implies

𝔼[x(1)2x(0)2]=σd2n=1dp(n)m(n,1)\displaystyle\mathbb{E}\left[\frac{\|x^{(1)}\|^{2}}{\|x^{(0)}\|^{2}}\right]=\sigma_{d}^{2}\sum_{n=1}^{d}p(n)m(n,1) (25)

. Since Γ(n/2+1)=n/2Γ(n/2)\Gamma(n/2+1)=n/2\Gamma(n/2), we have m(n,1)=2Γ(n/2+1)/Γ(n/2)=2n/2=nm(n,1)=2\Gamma(n/2+1)/\Gamma(n/2)=2n/2=n and therefore,

𝔼[x(1)sx(0)s]=σ2n=1dp(n)m(n,1)=σ2n=1dp(n)n=σd2d/2.\displaystyle\mathbb{E}\left[\frac{\|x^{(1)}\|^{s}}{\|x^{(0)}\|^{s}}\right]=\sigma^{2}\sum_{n=1}^{d}p(n)m(n,1)=\sigma^{2}\sum_{n=1}^{d}p(n)n=\sigma_{d}^{2}d/2. (26)

which recovers the identity (14), we recovered earlier. For s=1s=1, we get

𝔼[x(1)1x(0)1]=σd2n=1dp(n)m(n,1/2)=σdn=1dp(n)2Γ(n/2+1/2)Γ(n/2)\displaystyle\mathbb{E}\left[\frac{\|x^{(1)}\|^{1}}{\|x^{(0)}\|^{1}}\right]=\sigma_{d}^{2}\sum_{n=1}^{d}p(n)m(n,1/2)=\sigma_{d}\sum_{n=1}^{d}p(n)\sqrt{2}\frac{\Gamma(n/2+1/2)}{\Gamma(n/2)} (27)

. It is also known that

Γ(n/2)=(n2)!!π2(n1)/2\Gamma(n/2)=\frac{(n-2)!!\sqrt{\pi}}{2^{(n-1)/2}}

where n!!n!! is the double factorial. Therefore, we get

𝔼[x(1)1x(0)1]=σd2n=1dp(n)m(n,1/2)=σd2n=1dp(n)2(n1)!!π2(n)/2(n2)!!π2(n1)/2\displaystyle\mathbb{E}\left[\frac{\|x^{(1)}\|^{1}}{\|x^{(0)}\|^{1}}\right]=\sigma_{d}^{2}\sum_{n=1}^{d}p(n)m(n,1/2)=\sigma_{d}^{2}\sum_{n=1}^{d}p(n)\sqrt{2}\frac{\frac{(n-1)!!\sqrt{\pi}}{2^{(n)/2}}}{\frac{(n-2)!!\sqrt{\pi}}{2^{(n-1)/2}}} (28)
=σd2n=1dp(n)2(n1)!!(n2)!!\displaystyle=\sigma_{d}^{2}\sum_{n=1}^{d}p(n)2\frac{(n-1)!!}{(n-2)!!} (29)

.

Remember that the function r(s)r(s) does not depend on the choice of x(0)x^{(0)} as long as it is not zero. So, without loss of generality we can set x(0)=e1x^{(0)}=e_{1}. Note that we have

m(n,s)=𝔼Yns=𝔼[eslog(Yn)]m(n,s)=\mathbb{E}Y_{n}^{s}=\mathbb{E}[e^{s\log(Y_{n})}]

Therefore, m(n,s) is the moment generating function of the random variable log(Yn)\log(Y_{n}); therefore it is a convex function of ss by the properties of moment generating functions. Then, it follows that the function r(s)r(s) defined by (24) is a convex function of ss. Clearly, we have m(n,0)=1m(n,0)=1. Furthermore,

ddsm(n,s)|s=0\displaystyle\frac{d}{ds}m(n,s)|_{s=0} =\displaystyle= dds(𝔼(eslog(Yn)))s=0=𝔼[ddseslog(Yn)]s=0\displaystyle\frac{d}{ds}(\mathbb{E}(e^{s\log(Y_{n})}))_{s=0}=\mathbb{E}\left[\frac{d}{ds}e^{s\log(Y_{n})}\right]_{s=0} (30)
=\displaystyle= 𝔼[ddslog(Yn)]=ρ\displaystyle\mathbb{E}\left[\frac{d}{ds}\log(Y_{n})\right]=\rho (31)

In other words, ρ\rho is the derivative at zero of the function m(n,s)m(n,s) which is strictly convex (because the second derivative of the gamma function is called the trigamma function which is strictly positive on the positive real axis) , infinitely differentiable satisfying m(n,0)=1m(n,0)=1 and m(n,s)m(n,s)\to\infty as ss\to\infty. Therefore, there exists α>0\alpha>0 such that r(α)=1r(\alpha)=1 if ρ<0\rho<0. The behavior of m(n,s)m(n,s) will essentially depend on the sign of ρ\rho. If ρ=0\rho=0, then the higher order derivatives dmdsmm(n,s)|s=0\frac{d^{m}}{ds^{m}}m(n,s)\big{|}_{s=0} will determine the behavior. For instance; if ρ=0\rho=0 but dmdsmm(n,s)|s=0<0\frac{d^{m}}{ds^{m}}m(n,s)\big{|}_{s=0}<0 for some finite mm; then we can still conclude that there exists α>0\alpha>0 such that h(α)=1h(\alpha)=1. We have the following cases:

  • Case I: ρReLU<0\rho_{ReLU}<0: There exists α>0\alpha>0 such that r(α)=1r(\alpha)=1. The function r(s)r(s) is strictly convex and we have r(0)=r(α)=1r(0)=r(\alpha)=1 and r(s)<1r(s)<1 for s(0,α)s\in(0,\alpha). In this case 𝔼x(k)s0\mathbb{E}\|x^{(k)}\|^{s}\to 0 if s(0,α)s\in(0,\alpha).

  • Case II: ρReLU=0\rho_{ReLU}=0: In this case, r(s)>1r(s)>1 for any s>0s>0 by the strict convexity of r(s)r(s). All the moments of order s>0s>0 of x(k)x^{(k)} blows up.

  • Case III: ρReLU>0\rho_{ReLU}>0: In this case, we have r(s)>1r(s)>1 for every s>0s>0. x(k)x^{(k)} goes to infinity with probability one.

In particular if σd=2/d\sigma_{d}=2/d, then we are in case I above with α=2\alpha=2. If we make σd\sigma_{d} larger while keeping ρReLU<0\rho_{ReLU}<0; then α\alpha will get smaller. In particular, if we choose σd\sigma_{d} to satisfy

r(α):=σdαn=1dp(n)m(n,α/2)=1\displaystyle r(\alpha):=\sigma_{d}^{\alpha}\sum_{n=1}^{d}p(n)m(n,\alpha/2)=1 (32)

i.e if we choose

σd=1(n=1dp(n)m(n,α/2))α\sigma_{d}=\frac{1}{(\sum_{n=1}^{d}p(n)m(n,\alpha/2))^{\alpha}}

then we will have 𝔼(xkαx0α)=1\mathbb{E}(\frac{\|x^{k}\|^{\alpha}}{\|x^{0}\|^{\alpha}})=1 for every kk and the limiting distribution XX will have α\alpha-th moment to be finite, whereas any moment p>αp>\alpha will blow up. This is clearly heavier tail than the exponential distribution.

In the next section, we look at initialization with alpha stable distributions.

Remark 8.

It seems that the law of x(k)x^{(k)} converges to stationary distribution almost surely; but this does not imply the convergence of the moments directly; one would need to have additional uniform integrability conditions typically. Furthermore, by the remark at the top of page 292; we can choose initialization to recover an alpha-stable distribution in the limit.

2.1 Alpha-Stable Initialization

Theorem 9.

For a given α(0,2)\alpha\in(0,2), assume that we initialize the network weights for a deep linear network as

Wi,j(k)𝒮α𝒮(σd),W^{(k)}_{i,j}\sim\mathcal{S}\alpha\mathcal{S}(\sigma_{d}),

for every i,ji,j and kk while setting q(k)=0q^{(k)}=0 for every kk. Then the Lyapunov exponent satisfies

ρd=1α[log(σdα)+𝔼log(iNd|Wi|α)]\rho_{d}=\frac{1}{\alpha}\left[\log(\sigma_{d}^{\alpha})+\mathbb{E}\log(\sum_{i}^{N_{d}}|W_{i}|^{\alpha})\right]

where NdBi(d,1/2)N_{d}\sim Bi(d,1/2) is a binomial random variable and WiW_{i} are standard symmetric α\alpha-stable. Therefore, ρ<0\rho<0 (resp. ρ>0\rho>0) if

σd<(resp.>)eQd(α)\sigma_{d}<(\mbox{resp.}>)e^{-Q_{d}(\alpha)}

where

Qd(α)=1α[log(σdα)+𝔼log(iNd|Wi|α)]Q_{d}(\alpha)=\frac{1}{\alpha}\left[\log(\sigma_{d}^{\alpha})+\mathbb{E}\log(\sum_{i}^{N_{d}}|W_{i}|^{\alpha})\right]

A sequence of d×dd\times d networks are asymptotically stable (i.e. input decays exponentially) if

lim supd(dlog(d))1/ασd<21/αJ(α),\limsup_{d\to\infty}(d\log(d))^{1/\alpha}\sigma_{d}<2^{1/\alpha}J(\alpha),

and asymptotically unstable (i.e. input grows exponentially) if

lim infd(dlog(d))1/ασd>21/αJ(α),\liminf_{d\to\infty}(d\log(d))^{1/\alpha}\sigma_{d}>2^{1/\alpha}J(\alpha),

where J(α)J(\alpha) is defined by (7).

Proof.

The Lyapunov exponent will admit the formula

ρReLu=𝔼logϕ(W(1)e1)\rho_{ReLu}=\mathbb{E}\log\|\phi(W^{(1)}e_{1})\|

But this quantity is invariant, if we change the 2-norm above with an LpL_{p} norm or even

xα:=(i=1nxiα)1/α\|x\|_{\alpha}:=(\sum_{i=1}^{n}\|x_{i}\|^{\alpha})^{1/\alpha}

for any α>0\alpha>0 see [cohen1984stability, Introduction]. Then, using the properties of alpha-stable distributions we can show by a similar argument to [cohen1984stability] that

W(1)e1ασdαid|Wi|α\|W^{(1)}e_{1}\|_{\alpha}\sim\sigma_{d}^{\alpha}\sum_{i}^{d}|W_{i}|^{\alpha}

Then, due to the ReLU term, we get

ρ=𝔼logϕ(W(1)e1)α=𝔼log(iNd|Wi|α)1/α=1α𝔼log(iNd|Wi|α).\rho=\mathbb{E}\log\|\phi(W^{(1)}e_{1})\|_{\alpha}=\mathbb{E}\log(\sum_{i}^{N_{d}}|W_{i}|^{\alpha})^{1/\alpha}=\frac{1}{\alpha}\mathbb{E}\log(\sum_{i}^{N_{d}}|W_{i}|^{\alpha}).

It is known that

Using the properties of α\alpha-stable distributions Proof sketch: Approximate binomial distribution Bi(d,1/2)Bi(d,1/2) with a Gaussian with mean d/2d/2 and approximate everything as a Gaussian integral. Use the limit arguman in Cohen-Newman paper that computes the distribution of the integrand in the Lyapunov exponent.

According to Theorem 9, in order to avoid exponential decay/growth of the input we can choose σd\sigma_{d} to satisfy

(dlog(d))1/ασd=21/αJ(α),(d\log(d))^{1/\alpha}\sigma_{d}=2^{1/\alpha}J(\alpha),

i.e.

σd=21/αJ(α)(dlog(d))1/α.\sigma_{d}=\frac{2^{1/\alpha}J(\alpha)}{(d\log(d))^{1/\alpha}}.

Analogously, we can look at the moments

mα(n,s):=𝔼[x(1)αsx(0)αs]\displaystyle m_{\alpha}(n,s):=\mathbb{E}\left[\frac{\|x^{(1)}\|_{\alpha}^{s}}{\|x^{(0)}\|_{\alpha}^{s}}\right] =\displaystyle= 𝔼ϕ(W(1)e1)αs=𝔼[(j=1dϕ(|wj|α)s/α]\displaystyle\mathbb{E}\|\phi(W^{(1)}e_{1})\|_{\alpha}^{s}=\mathbb{E}\left[(\sum_{j=1}^{d}\phi(|w_{j}|^{\alpha})^{s/\alpha}\right] (33)
=\displaystyle= σs𝔼[(iNd|Wi|α)s/α]\displaystyle\sigma^{s}\mathbb{E}\left[\left(\sum_{i}^{N_{d}}|W_{i}|^{\alpha}\right)^{s/\alpha}\right] (34)
=\displaystyle= σsn=1dp(n)𝔼[(i=1n|Wi|α)s/α]\displaystyle\sigma^{s}\sum_{n=1}^{d}p(n)\mathbb{E}\left[\left(\sum_{i=1}^{n}|W_{i}|^{\alpha}\right)^{s/\alpha}\right] (35)

.

3 Initialization with Exponential Tails

We consider the Laplace distribution with mean zero. Its density can be expressed as

f(x)=12ce|x|/cf(x)=\frac{1}{2c}e^{-|x|/c}

and will be denoted by Laplace(c)\mbox{Laplace}(c). This distribution coincides with the exponential distribution density up to a constant factor on the positive real axis. Variance of Laplace distribution is given by 2c22c^{2}. So if we set c=σd/2c=\sigma_{d}/\sqrt{2}; the distribution will have variance equal to σd2\sigma_{d}^{2}.

Lets assume that we set

Wij(k)Laplace(σd2)W_{ij}^{(k)}\sim\mbox{Laplace}(\frac{\sigma_{d}}{\sqrt{2}})

Recall that we have the formula

ρReLu=𝔼logϕ(W(1)e1)1=𝔼log(W111+W211++Wd11).\displaystyle\rho_{ReLu}=\mathbb{E}\log\|\phi(W^{(1)}e_{1})\|_{1}=\mathbb{E}\log(\|W_{11}^{1}\|+\|W_{21}^{1}\|+\dots+\|W_{d1}^{1}\|). (36)

In particular, the choice of the norm above, does not matter where we could use α\|\cdot\|_{\alpha} for any α>0\alpha>0. For computational convenience, above we chose the L1L_{1} norm.

It is easy to check from the symmetry with respect to origin of the Laplace distribution that if

XLaplace(c)|X|Exponential(c1).{\displaystyle X\sim{\textrm{Laplace}}(c)}\implies{\displaystyle\left|X\right|\sim{\textrm{Exponential}}\left(c^{-1}\right)}.

where Exponential(λ)\textrm{Exponential}(\lambda) is the exponential distribution with parameter λ\lambda with density

f(x;λ)={λeλxx0,0x<0..f(x;\lambda)=\begin{cases}\lambda e^{-\lambda x}&x\geq 0,\\ 0&x<0.\end{cases}.

Therefore, (36) becomes

ρReLu=𝔼logϕ(W(1)e1)1=𝔼log(ϕ(X1+X2++Xd)).\displaystyle\rho_{ReLu}=\mathbb{E}\log\|\phi(W^{(1)}e_{1})\|_{1}=\mathbb{E}\log(\phi(X_{1}+X_{2}+\dots+X_{d})). (37)

where XiX_{i} are i.i.d. with XiExponential(c1)=Exponential(2σd)X_{i}\sim{\textrm{Exponential}}\left(c^{-1}\right)={\textrm{Exponential}}\left(\frac{\sqrt{2}}{\sigma_{d}}\right). By similar ideas to before, we obtain

ρReLu=𝔼logϕ(W(1)e1)1=𝔼log(i=1NdXi).\displaystyle\rho_{ReLu}=\mathbb{E}\log\|\phi(W^{(1)}e_{1})\|_{1}=\mathbb{E}\log(\sum_{i=1}^{N_{d}}X_{i}). (38)

where NdBinomial(d,1/2)N_{d}\sim\mbox{Binomial}(d,1/2) (with the convention that the random sum is zero when Nd=0N_{d}=0). We recognize that the sum of mm-many i.i.d. exponentially distributed random variables with parameter λ\lambda is a Gamma distribution Γ(m,λ)\Gamma(m,\lambda) with density

f(x)=λmΓ(m)xm1exλf(x)=\frac{\lambda^{m}}{\Gamma(m)}x^{m-1}e^{-x\lambda}

The logarithm of a Gamma distribution Γ(m,λ)\Gamma(m,\lambda) follows a log-Gamma distribution Zm,λZ_{m,\lambda} (sometimes confused with the exponential gamma distribution) with expectation

𝔼(Zm,λ)=ψ(m)+log(1λ)\mathbb{E}(Z_{m,\lambda})=\psi(m)+\log(\frac{1}{\lambda})

and variance

var(Zm,λ)=ψ(m)>0\mbox{var}(Z_{m,\lambda})=\psi^{\prime}(m)>0

(see paper by Halliwell). Plugging in λ=2σd\lambda=\frac{\sqrt{2}}{\sigma_{d}}, we obtain

𝔼(Zm,λ)=ψ(m)+log(σd2)\mathbb{E}(Z_{m,\lambda})=\psi(m)+\log(\frac{\sigma_{d}}{\sqrt{2}})

For the ReLu computation,

ρReLu\displaystyle\rho_{ReLu} =\displaystyle= 𝔼logϕ(W(1)e1)1=𝔼log(i=1NdXi)\displaystyle\mathbb{E}\log\|\phi(W^{(1)}e_{1})\|_{1}=\mathbb{E}\log(\sum_{i=1}^{N_{d}}X_{i}) (39)
=\displaystyle= m=1d(dm)(12)d𝔼Zm,λ\displaystyle\sum_{m=1}^{d}{d\choose m}(\frac{1}{2})^{d}\mathbb{E}Z_{m,\lambda} (40)
=\displaystyle= m=1d(dm)(12)d[ψ(m)+log(1λ)]\displaystyle\sum_{m=1}^{d}{d\choose m}(\frac{1}{2})^{d}\left[\psi(m)+\log(\frac{1}{\lambda})\right] (41)
=\displaystyle= m=1d(dm)(12)d[ψ(m)+log(σd2)]\displaystyle\sum_{m=1}^{d}{d\choose m}(\frac{1}{2})^{d}\left[\psi(m)+\log(\frac{\sigma_{d}}{\sqrt{2}})\right] (42)

where ψ\psi is the digamma function.

This suggests that when the set the entries of networks as an exponential distribution, we should choose ρReLu=0\rho_{ReLu}=0 in the above formula. Note that this formula is different than the Gaussian case.

4 Initialization with Weibull tails

The idea is to use the fact that

XWeibull(λ,12)thenXExponential(1λ){\displaystyle X\sim\mathrm{Weibull}(\lambda,{\frac{1}{2}})}\quad\mbox{then}\quad{\displaystyle{\sqrt{X}}\sim\mathrm{Exponential}({\frac{1}{\sqrt{\lambda}}})}

and recycle the analysis for the exponential distribution.

5 Other initializations

If entries are sampled from the Laplace distribution XX (also called the double exponential distribution), then |X||X| is exponentially distributed, and sum of dd exponentials YY is an Erlang distribution (which is a special case of the gamma distribution) where 𝔼log(Y)\mathbb{E}\log(Y) is explicitly known. In other words, we can compute the optimal σ\sigma for the double exponential distribution as well (if tails are heavier than exponential, we call them heavy tailed). In other words, if we look at the Laplace distribution we can compute 𝔼(x(k))\mathbb{E}(\|x\|^{(k)}) explicitly.

Another possibility is to consider initialization with a t-distribution. The square of t-distribution is distributed with an F-distribution, see e.g.
https://stats.stackexchange.com/questions/78940/distribution-of-sum-of-squares-of-t-distributed-random-variables Although sum of F distributions do not have closed density, we can characterize the characteristic function of such distributions to compute all the moments from the characteristic function.

If initialization is two-sided Weibull , then its characteristic function is known, see e.g. the paper titled ”Characteristic and Moment Generating Functions of Three Parameter Weibull Distribution-An Independent Approach” by G. Muraleedharan. Then, we can compute the 𝔼x(k)1\mathbb{E}\|x^{(k)}\|_{1} based on characteristic functions of random sum of Weibull distributions.

Another option is to look at a two-side Pareto distribution and look at the Lyapunov exponent with respect to the L1 norm using characteristic function techniques.

6 Correlated Random Initialization

Here, we make use of explicit computations for the top Lyapunov exponent of correlated Gaussian random matrices. The simplest setting is when each column is sampled from a Gaussian vector. The idea is to use the result of this paper

https://arxiv.org/pdf/1306.6576.pdf

see Theorem 1.1

X=WTWX=W^{T}W has a Wishart distribution. Log-expectation of |X||X| can be computed as follows: The following formula plays a role in variational Bayes derivations for Bayes networks involving the Wishart distribution: [10]:693

E[ln|𝐗|]=ψp(n2)+pln(2)+ln|𝐕|{\displaystyle\operatorname{E}[\,\ln\left|\mathbf{X}\right|\,]=\psi_{p}\left({\frac{n}{2}}\right)+p\,\ln(2)+\ln|\mathbf{V}|}

whereψp\psi_{p} is the multivariate digamma function (the derivative of the log of the multivariate gamma function).

Log-variance: The following variance computation could be of help in Bayesian statistics:

Var[ln|𝐗|]=i=1pψ1(n+1i2){\displaystyle\operatorname{Var}\left[\,\ln\left|\mathbf{X}\right|\,\right]=\sum_{i=1}^{p}\psi_{1}\left({\frac{n+1-i}{2}}\right)}

where ψ1{\displaystyle\psi_{1}} is the trigamma function. This comes up when computing the Fisher information of the Wishart random variable, see the Wikipedia page link.