Better Random Initialization in Deep Learning
1 Priors and Initialization in Deep Learning
We consider a deep linear network where denotes the output of the -th layer with initialization , a similar setup to [vladimirova2018understanding] except that we assume the network is linear for now. We will have the recursion
(1) |
here should be rather than where is a weight matrix of the -st layer and is the bias vector. we assume that each layer has the same number of neurons so that the matrix is a square matrix. The paper [vladimirova2018understanding] consider Gaussian initializations of matrices and show that the output of the -th layer (i.e. ) will have tails that are distributed with a sub-Weibull distribution of parameter with tails decaying like . For large , the tail is heavier than that of the exponential distribution; however this result still shows superpolynomial decay of the tails; in particular it does not show that the tails can have polynomial decay such as for some . On the other hand, recent literature found that the weights and the gradients in neural network training can often have distributions with a polynomial decay.
Here, the purpose is to show that polynomial decay is possible when the number of layers is large enough, improving the results of [vladimirova2018understanding] to show that even heavier tails can arise. Similar to [vladimirova2018understanding], we assume that the network is initialized with a Gaussian distribution, i.e.
where is fixed where is the dimension (number of neurons at each layer). In particular, for the stochastic recursion (8), top Lyapunov exponent is explicitly known
(2) |
(see [newman1986, eqn. (7)]) where is the digamma function that admits the asymptotic expansion
(3) |
Note that is deterministic (see [newman1986, eqn. (7)])). Another issue we will focus on is the initialization. It has been noticed by researchers that if is too large; then the variance of can exponentially grow over the iterates. In particular, the authors of [he2015delving] argued that for a ReLU network, a sufficient condition to avoid exponential blow up would be choosing a layer dependent of the form where is the number of neurons at layer . In our setup, we take for every layer . A proper initialization method should avoid growing the input exponentially as the number of layers grows. A natural question would be the following:
-
•
If we take for some constant , would the network be stable? In other words, would the moments of be finite? If so, which moments will be finite, can we have for instance finite variance? How does the tail depend on the parameters? In particular, if the limit exists and is heavy-tailed it will allow us to explore the space more efficiently in the beginning of the SGD iterations.
The initialization proposed by [he2015delving] corresponds to in our setting with and the factor arises due to the fact that activation is used in their setting. If is a random variable that is symmetric around the origin, the variance of is half of the variance of due to symmetry. In other words, application of ReLU to a random variable decreases the variance by a factor of 2; that is why they can allow . However, for a linear network; if we follow the arguments of [he2015delving]; then we would get that for linear networks; a sufficient condition to avoid blow up initialization over layers is to choose . In fact, they argue that that the choice of for a linear network; would make the input exponentially small as a function of . This is true if we choose the bias along an argument by [kesten1973random]; however if we choose randomly this does not have to be the case; as we will explain further. Also, we will prove that even if we can choose larger larger than but yet avoid blow up of the iterations.
Lemma 1.
Consider a linear network with infinitely many layers where each layer has width . If we initialize the network with weights
and the biases are i.i.d. sampled from a distribution that takes values in . If and , then the Markov chain determinining the law of iterates admits a unique stationary distribution which is given by the law of the almost surely convergent series
(4) |
where with the convention that .
Proof.
This is a well known result, see e.g. the introduction of [alsmeyer2012tail]. ∎
Remark 2.
An analogue of this lemma is also available for Lipschitz maps, see [ELTON199039]. This can also handle the ReLU case.
Remark 3.
One can also compute all the moments of explicitly and determine for which choice of parameters which moment blows up.
Theorem 4.
In the setting of Lemma 1, assume that we choose such that it satisfies
(5) |
Then the Markov chain determinining the law of iterates admits a unique stationary distribution that can be expressed by the almost surely convergent series sum (4). Furthermore, if we choose such that
(6) |
Here should be for some positive constant , then the condition (5) will be satisfied for every large enough. Furthermore, for ; X is heavy tailed in the sense that its second moment is infinte. However, if , then goes to infinity with probability one.
Remark 5.
If we take for every , we see from (4) that the limit has to be necessarily zero almost surely. However, the limit can be non-trivial if we choose , nevertheless the limit will be independent of initialization [kesten1973random, Thm 6].
Proof.
Proof follows from the formula (5) and Lemma 1. The fact that the choice (6) satisfies (5) is due to the formula (3). If , then we have for large enough and the results follow from [cohen1984stability, Prop 2.1]. From a computation similar to that one used to derive (14) we see that if ; then is constant and in particular means that the variance of will blow up. This implies that the stationary distribution also has infinite variance. ∎
1.1 Numerical Experiments for Gaussian Initialization
The website Example Python Code has some example code for testing the stability of the initializations. Let’s try the initialization suggested by (6) and compare it with the Kaiming initialization after 100 layers. Also, compare the performance on CNN tasks reported in this web page.
1.2 Network initialization with symmetric alpha stable distributions
What happens if we initialize the network with a different distribution that may have heavier tails than the Gaussian? For example, recent research shows that the weights of the network may demonstrate a heavy-tailed behavior that can be modeled with a symmetric alpha-stable distribution () which have characteristic functions that scales like where is a scale parameter recovering the Gaussian case when . If we initialize the network weights as with a fixed choice of and ; how can we choose the scale parameter as a function of ? Would work or would such initialization lead to exponential blow up?
A random variable is called a standard symmetric stable random variable with tail index , if
For a vector , we also introduce the notation
We call that a random variable if is a standard symmetric stable random variable.
Theorem 6.
For a given , assume that we initialize the network weights for a deep linear network as
for every and while setting for every . If
then converges to with probability one for large enough whereas if
then for large enough , goes to infinity with probability one where
(7) |
Proof.
This is basically a restatement of [cohen1984stability, Theorem 2.7]. ∎
A consequence of this theorem is that it is appropriate to set
to avoid the exponential blow up or exponential decay of the dependency to the input if the network weights are initialized with the distribution. Given that the network weights behave often in a heavy tailed manner; it would make sense to initialize them in a heavy tailed manner as well.
2 ReLu Activation
Here we use the ReLU Activation:
(8) |
where
is the ReLU activation and we set the biases for every . The analysis of [cohen1984stability] also shows that in fact we can compute for
However, the random variables are i.i.d. because this ratio does not depend on the choice of (this stems from the fact that Gaussian distribution is rotationally symmetric). Therefore, we get
This quantity does not depend on the choice of so without loss of generality we can choose in our computations, which yields
(9) |
. where is the first column of . We have . is symmetric with respect to the origin, therefore its j-th component satisfies
Hence, for , we obtain
(10) |
because
(11) | |||||
(12) | |||||
(13) |
Combining everything, we obtain
(14) |
From here, we see that if we choose ; then this quantity will behave in a stable fashion.
It follows through a similar argument that for the ReLU case, the top Lyapunov exponent admits the formula
(15) | |||||
(16) |
We can compute this expectation by exploiting the symmetries of the integration. In there are quadrants. Each quadrant can be identified as an element of the set . On every quadrant that corresponds to plus signs and minus signs, the distribution of is given by a chi-square distribution with degrees of freedom. If we choose a random quadrant; with probability we will be in such a quadrant. Building on the formula [newman1986, eqn. (7)], we can show that
(17) | |||||
(18) |
where is given by (2) (the function is zero in the negative orthant (set of vectors whose components are all negative), therefore the sum above does not include )).
Estimating :
To bound we can use Jensen’s formula. The digamma function is the derivative of the function, i.e.
where is the Gamma function. We define
Differentiating under the integral, we obtain that
(19) |
This function is called the polygamma function of order and it admits the representation
for . In particular for , we see that the second derivative so that the function is concave for . Therefore, the function is a concave function of for whereas it is not defined for . If we define , we can write
where with using the fact that . Given a binomal distribution with support over , we have and defines a probability distribution over as . Let be the distribution such that .
By Jensen’s inequality;
(20) | |||||
(21) | |||||
(22) |
Theorem 7.
Consider a ReLu network with infinitely many layers where each layer has width . We initialize the network with weights
and the biases . Then, we can choose such that and ; and for this choice of the stationary distribution exists and is heavy tailed.
By a similar approach, we can also compute the moments (23) explicitly
(23) |
.
If is a chi-square distribution with degrees of freedom, its moments are explicitly available and are given by
(see [walck1996hand, Sec. 8]). We compute
(24) |
. In particular, for , this implies
(25) |
. Since , we have and therefore,
(26) |
which recovers the identity (14), we recovered earlier. For , we get
(27) |
. It is also known that
where is the double factorial. Therefore, we get
(28) | |||
(29) |
.
Remember that the function does not depend on the choice of as long as it is not zero. So, without loss of generality we can set . Note that we have
Therefore, m(n,s) is the moment generating function of the random variable ; therefore it is a convex function of by the properties of moment generating functions. Then, it follows that the function defined by (24) is a convex function of . Clearly, we have . Furthermore,
(30) | |||||
(31) |
In other words, is the derivative at zero of the function which is strictly convex (because the second derivative of the gamma function is called the trigamma function which is strictly positive on the positive real axis) , infinitely differentiable satisfying and as . Therefore, there exists such that if . The behavior of will essentially depend on the sign of . If , then the higher order derivatives will determine the behavior. For instance; if but for some finite ; then we can still conclude that there exists such that . We have the following cases:
-
•
Case I: : There exists such that . The function is strictly convex and we have and for . In this case if .
-
•
Case II: : In this case, for any by the strict convexity of . All the moments of order of blows up.
-
•
Case III: : In this case, we have for every . goes to infinity with probability one.
In particular if , then we are in case I above with . If we make larger while keeping ; then will get smaller. In particular, if we choose to satisfy
(32) |
i.e if we choose
then we will have for every and the limiting distribution will have -th moment to be finite, whereas any moment will blow up. This is clearly heavier tail than the exponential distribution.
In the next section, we look at initialization with alpha stable distributions.
Remark 8.
It seems that the law of converges to stationary distribution almost surely; but this does not imply the convergence of the moments directly; one would need to have additional uniform integrability conditions typically. Furthermore, by the remark at the top of page 292; we can choose initialization to recover an alpha-stable distribution in the limit.
2.1 Alpha-Stable Initialization
Theorem 9.
For a given , assume that we initialize the network weights for a deep linear network as
for every and while setting for every . Then the Lyapunov exponent satisfies
where is a binomial random variable and are standard symmetric -stable. Therefore, (resp. ) if
where
A sequence of networks are asymptotically stable (i.e. input decays exponentially) if
and asymptotically unstable (i.e. input grows exponentially) if
where is defined by (7).
Proof.
The Lyapunov exponent will admit the formula
But this quantity is invariant, if we change the 2-norm above with an norm or even
for any see [cohen1984stability, Introduction]. Then, using the properties of alpha-stable distributions we can show by a similar argument to [cohen1984stability] that
Then, due to the ReLU term, we get
It is known that
Using the properties of -stable distributions Proof sketch: Approximate binomial distribution with a Gaussian with mean and approximate everything as a Gaussian integral. Use the limit arguman in Cohen-Newman paper that computes the distribution of the integrand in the Lyapunov exponent.
∎
According to Theorem 9, in order to avoid exponential decay/growth of the input we can choose to satisfy
i.e.
Analogously, we can look at the moments
(33) | |||||
(34) | |||||
(35) |
.
3 Initialization with Exponential Tails
We consider the Laplace distribution with mean zero. Its density can be expressed as
and will be denoted by . This distribution coincides with the exponential distribution density up to a constant factor on the positive real axis. Variance of Laplace distribution is given by . So if we set ; the distribution will have variance equal to .
Lets assume that we set
Recall that we have the formula
(36) |
In particular, the choice of the norm above, does not matter where we could use for any . For computational convenience, above we chose the norm.
It is easy to check from the symmetry with respect to origin of the Laplace distribution that if
where is the exponential distribution with parameter with density
Therefore, (36) becomes
(37) |
where are i.i.d. with . By similar ideas to before, we obtain
(38) |
where (with the convention that the random sum is zero when ). We recognize that the sum of -many i.i.d. exponentially distributed random variables with parameter is a Gamma distribution with density
The logarithm of a Gamma distribution follows a log-Gamma distribution (sometimes confused with the exponential gamma distribution) with expectation
For the ReLu computation,
(39) | |||||
(40) | |||||
(41) | |||||
(42) |
where is the digamma function.
This suggests that when the set the entries of networks as an exponential distribution, we should choose in the above formula. Note that this formula is different than the Gaussian case.
4 Initialization with Weibull tails
The idea is to use the fact that
and recycle the analysis for the exponential distribution.
5 Other initializations
If entries are sampled from the Laplace distribution (also called the double exponential distribution), then is exponentially distributed, and sum of exponentials is an Erlang distribution (which is a special case of the gamma distribution) where is explicitly known. In other words, we can compute the optimal for the double exponential distribution as well (if tails are heavier than exponential, we call them heavy tailed). In other words, if we look at the Laplace distribution we can compute explicitly.
Another possibility is to consider initialization with a t-distribution. The square of t-distribution is distributed with an F-distribution, see e.g.
https://stats.stackexchange.com/questions/78940/distribution-of-sum-of-squares-of-t-distributed-random-variables
Although sum of F distributions do not have closed density, we can characterize the characteristic function of such distributions to compute all the moments from the characteristic function.
If initialization is two-sided Weibull , then its characteristic function is known, see e.g. the paper titled ”Characteristic and Moment Generating Functions of Three Parameter Weibull Distribution-An Independent Approach” by G. Muraleedharan. Then, we can compute the based on characteristic functions of random sum of Weibull distributions.
Another option is to look at a two-side Pareto distribution and look at the Lyapunov exponent with respect to the L1 norm using characteristic function techniques.
6 Correlated Random Initialization
Here, we make use of explicit computations for the top Lyapunov exponent of correlated Gaussian random matrices. The simplest setting is when each column is sampled from a Gaussian vector. The idea is to use the result of this paper
https://arxiv.org/pdf/1306.6576.pdf
see Theorem 1.1
has a Wishart distribution. Log-expectation of can be computed as follows: The following formula plays a role in variational Bayes derivations for Bayes networks involving the Wishart distribution: [10]:693
where is the multivariate digamma function (the derivative of the log of the multivariate gamma function).
Log-variance: The following variance computation could be of help in Bayesian statistics:
where is the trigamma function. This comes up when computing the Fisher information of the Wishart random variable, see the Wikipedia page link.