This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Exact Solutions of a Deep Linear Network

Liu Ziyin1, Botao Li2, Xiangming Meng3
1Department of Physics, The University of Tokyo
2Laboratoire de Physique de l’Ecole normale supérieure, ENS, Université PSL,
CNRS, Sorbonne Université, Université de Paris Cité, Paris, France
3Institute for Physics of Intelligence, Graduate School of Science, The University of Tokyo
Abstract

This work finds the analytical expression of the global minima of a deep linear network with weight decay and stochastic neurons, a fundamental model for understanding the landscape of neural networks. Our result implies that the origin is a special point in deep neural network loss landscape where highly nonlinear phenomenon emerges. We show that weight decay strongly interacts with the model architecture and can create bad minima at zero in a network with more than 11 hidden layer, qualitatively different from a network with only 11 hidden layer. Practically, our result implies that common deep learning initialization methods are insufficient to ease the optimization of neural networks in general.

1 Introduction

Applications of neural networks have achieved great success in various fields. One central open question is why neural networks, being nonlinear and containing many saddle points and local minima, can sometimes be optimized easily (Choromanska et al., 2015a, ) while becoming difficult and requiring many tricks to train in some other scenarios (Glorot and Bengio,, 2010; Gotmare et al.,, 2018). One established approach is to study the landscape of deep linear nets (Choromanska et al., 2015b, ), which are believed to approximate the landscape of a nonlinear net well. A series of works proved the famous results that for a deep linear net, all local minima are global (Kawaguchi,, 2016; Lu and Kawaguchi,, 2017; Laurent and Brecht,, 2018), which is regarded to have successfully explained why deep neural networks are so easy to train because it implies that initialization in any attractive basin can reach the global minimum without much effort (Kawaguchi,, 2016). However, the theoretical problem of when and why neural networks can be hard to train is understudied.

In this work, we theoretically study a deep linear net with weight decay and stochastic neurons, whose loss function takes the following form in general:

𝔼x𝔼ϵ(1),ϵ(2),,ϵ(D)(i,i1,i2,,iDd,d1,d2,dDUiDϵiD(D)ϵi2(2)Wi2i1(2)ϵi1(1)Wi1i(1)xiy)2L0+γuU22+i=1DγiW(i)F2L2reg.,\underbrace{\mathbb{E}_{x}\mathbb{E}_{\epsilon^{(1)},\epsilon^{(2)},...,\epsilon^{(D)}}\left(\sum_{i,i_{1},i_{2},...,i_{D}}^{d,d_{1},d_{2},...d_{D}}U_{i_{D}}\epsilon^{(D)}_{i_{D}}...\epsilon_{i_{2}}^{(2)}W_{i_{2}i_{1}}^{(2)}\epsilon_{i_{1}}^{(1)}W_{i_{1}i}^{(1)}x_{i}-y\right)^{2}}_{L_{0}}+\underbrace{\gamma_{u}||U||_{2}^{2}+\sum_{i=1}^{D}\gamma_{i}||W^{(i)}||_{F}^{2}}_{L_{2}\ reg.}, (1)

where 𝔼x\mathbb{E}_{x} denotes the expectation over the training set, UU and W(i)W^{(i)} are the model parameters, DD is the depth of the network,111In this work, we use “depth” to refer to the number of hidden layers. For example, a linear regressor has depth 0. ϵ\epsilon is the noise in the hidden layer (e.g., due to dropout), did_{i} is the width of the ii-th layer, and γ\gamma is the strength of the weight decay. Previous works have studied special cases of this loss function. For example, Kawaguchi, (2016) and Lu and Kawaguchi, (2017) study the landscape of L0L_{0} when ϵ\epsilon is a constant (namely, when there is no noise). Mehta et al., (2021) studies L0L_{0} with (a more complicated type of) weight decay but without stochasticity and proved that all the stationary points are isolated. Another line of works studies L0L_{0} when the noise is caused by dropout (Mianjy and Arora,, 2019; Cavazza et al.,, 2018). Our setting is more general than the previous works in two respects. First, apart from the mean square error (MSE) loss L0L_{0}, an L2L_{2} regularization term (weight decay) with arbitrary strength is included; second, the noise ϵ\epsilon is arbitrary. Thus, our setting is arguably closer to the actual deep learning practice, where the injection of noises to latent layers is common, and the use of weight decay is virtually ubiquitous (Krogh and Hertz,, 1992; Loshchilov and Hutter,, 2017). One major limitation of our work is that we assume the label yy to be 11-dimensional, and it can be an important future problem to prove whether an exact solution exists or not when yy is high-dimensional.

Refer to caption
Refer to caption
Figure 1: Left: A summary of the network landscape that is implied by the main results of this work when one increases the weight decay strength γ\gamma while fixing other terms. We show that the landscape of a depth-11 net can be precisely divided into two regimes, while, for D2D\geq 2, there exists at least three regimes. The solid blue line indicates that the division of the regimes is precisely understood. The dashed lines indicate that the conditions we found are not tight and may be improved in the future. Right: ResNet18 on CIFAR10. The performance of a linear regressor never drops to that of a trivial model, whereas the performance of ResNet18 drops to the level of a trivial model, like a deep linear net with similar depth.

Our foremost contribution is to prove that all the global minimum of an arbitrarily deep and wide linear net takes a simple analytical form. In other words, we identify in closed form the global minima of Eq. (1) up to a single scalar, whose analytical expression does not exist in general. We then show that it has nontrivial properties that can explain many phenomena in deep learning. In particular, the implications of our result include (but are not limited to):

  1. 1.

    Weight decay makes the landscape of neural nets more complicated;

    • we show that bad minima222Unless otherwise specified, we use the word “bad minimum” to mean a local minimum that is not a global minimum. emerge as weight decay is applied, whereas there is no bad minimum when there is no weight decay. This highlights the need to escape bad local minima in deep learning with weight decay.

  2. 2.

    Deeper nets are harder to optimize than shallower ones;

    • we show that a D2D\geq 2 linear net contains a bad minimum at zero, whereas a D=1D=1 net does not. This partially explains why deep networks are much harder to optimize than shallower ones in deep learning practice.

  3. 3.

    Depending on the task, the common initialization methods (such as the Kaiming init.) can initialize a deep model in the basin of attraction of the bad minimum at zero;

    • common initialization methods initialize the models at a radius of roughly 1/width1/\sqrt{width} around the origin; however, we show that the width of the bad minimum is task-dependent and can be larger than the initialization radius for tasks with a small margin (𝔼[xy]||\mathbb{E}[xy]||);

  4. 4.

    Thus, the use of (effective) weight decay is a major cause of various types of collapses in deep learning (for example, see Figure 1).

Organization: In the next section, we discuss the related works. In Section 3, we derive the exact solution for a two-layer net. Section 4 extends the result to an arbitrary depth. In Section 5, we study and discuss the relevance of our results to many commonly encountered problems in deep learning. The last section concludes the work and discusses unresolved open problems. All proofs are delayed to Section B. Moreover, additional theoretical results on the effect of including a bias term is considered in Section D.

Notation. For a matrix WW, we use Wi:W_{i:} to denote the ii-th row vector of WW. Z||Z|| denotes the L2L_{2} norm if ZZ is a vector and the Frobenius norm if ZZ is a matrix. The notation * signals an optimized quantity. Additionally, we use the superscript and subscript interchangeably, whichever leads to a simpler expression. For example, b2b_{*}^{2} and (b)2(b^{*})^{2} denote the same quantity, while the former is “simpler."

2 Related Works

In many ways, linear networks have been used to help understand nonlinear networks. For example, even at depth 0, where the linear net is just a linear regressor, linear nets are shown to be relevant for understanding the generalization behavior of modern overparametrized networks (Hastie et al.,, 2019). Saxe et al., (2013) studies the training dynamics of a depth-11 network and uses it to understand the dynamics of learning of nonlinear networks. These networks are the same as a linear regression model in terms of expressivity. However, the loss landscape is highly complicated due to the existence of more than one layer, and linear nets are widely believed to approximate the loss landscape of a nonlinear net (Kawaguchi,, 2016; Hardt and Ma,, 2016; Laurent and Brecht,, 2018). In particular, the landscape of linear nets has been studied as early as 1989 in Baldi and Hornik, (1989), which proposed the well-known conjecture that all local minima of a deep linear net are global. This conjecture is first proved in Kawaguchi, (2016), and extended to other loss functions and deeper depths in Lu and Kawaguchi, (2017) and Laurent and Brecht, (2018). Many relevant contemporary deep learning problems can be understood with deep linear models. For example, two-layer linear VAE models are used to understand the cause of the posterior collapse problem (Lucas et al.,, 2019; Wang and Ziyin,, 2022). Deep linear nets are also used to understand the neural collapse problem in contrastive learning (Tian,, 2022). We also provide more empirical evidence in Section 5.

3 Two-layer Linear Net

This section finds the global minima of a two-layer linear net. The data point is a dd-dimensional vector xdx\in\mathbb{R}^{d} drawn from an arbitrary distribution, and the labels are generated through an arbitrary function y=y(x)y=y(x)\in\mathbb{R}. For generality, we let different layers have different strengths of weight decay even though they often take the same value in practice. We want to minimize the following objective:

Ld,d1(U,W)=𝔼x𝔼ϵ(jd1UjϵjidWjixiy)2+γwW2+γuU2,L_{d,d_{1}}(U,W)=\mathbb{E}_{x}\mathbb{E}_{\epsilon}\left(\sum_{j}^{d_{1}}U_{j}\epsilon_{j}\sum_{i}^{d}W_{ji}x_{i}-y\right)^{2}+\gamma_{w}||W||^{2}+\gamma_{u}||U||^{2}, (2)

where d1d_{1} is the width of the hidden layer and ϵi\epsilon_{i} are independent random variables. γw>0\gamma_{w}>0 and γu>0\gamma_{u}>0 are the weight decay parameters. Here, we consider a general type of noise with 𝔼[ϵi]=1\mathbb{E}[\epsilon_{i}]=1 and 𝔼[ϵiϵj]=δijσ2+1\mathbb{E}[\epsilon_{i}\epsilon_{j}]=\delta_{ij}\sigma^{2}+1 where δij\delta_{ij} is the Kronecker’s delta, and σ2>0\sigma^{2}>0.333While we formally require γ\gamma and σ\sigma to nonzero, one can show that the solutions we provided remain global minimizers in the zero limit by applying Theorem 2 from Ziyin and Ueda, (2022). For shorthand, we use the notation A0:=𝔼[xxT]A_{0}:=\mathbb{E}[xx^{T}], and the largest and the smallest eigenvalues of A0A_{0} are denoted as amaxa_{\rm max} and amina_{\rm min} respectively. aia_{i} denotes the ii-th eigenvalue of A0A_{0} viewed in any order. For now, it is sufficient for us to assume that the global minimum of Eq. (2) always exists. We will prove a more general result in Proposition 1, when we deal with multilayer nets.

3.1 Main Result

We first present two lemmas showing that the global minimum can only lie on a rather restrictive subspace of all possible parameter settings due to invariances in the objective.

Lemma 1.

At the global minimum of Eq. (2), Uj2=γwγuiWji2U_{j}^{2}=\frac{\gamma_{w}}{\gamma_{u}}\sum_{i}W_{ji}^{2} for all jj.

Proof Sketch. We use the fact that the first term of Eq. (2) is invariant to a simultaneous rescaling of rows of the weight matrix to find the optimal rescaling, which implies the lemma statement. \square

This lemma implies that for all jj, |Uj||U_{j}| must be proportional to the norm of its corresponding row vector in WW. This lemma means that using weight decay makes all layers of a deep neural network balanced. This lemma has been referred to as the “weight balancing" condition in recent works (Tanaka et al.,, 2020), and, in some sense, is a unique and potentially essential feature of neural networks that encourages a sparse solution (Ziyin and Wang,, 2023). The following lemma further shows that, at the global minimum, all elements of UU must be equal.

Lemma 2.

At the global minimum, for all ii and jj, we have

{Ui2=Uj2;UiWi:=UjWj:.\begin{cases}U_{i}^{2}=U_{j}^{2};\\ U_{i}W_{i:}=U_{j}W_{j:}.\end{cases} (3)

Proof Sketch. We show that if the condition is not satisfied, then an “averaging" transformation will strictly decrease the objective. \square

This lemma can be seen as a formalization of the intuition suggested in the original dropout paper (Srivastava et al.,, 2014). Namely, using dropout encourages the neurons to be independent of one another and results in an averaging effect. The second lemma imposes strong conditions on the solution of the problem, and the essence of this lemma is the reduction of the original problem to a lower dimension. We are now ready to prove our first main result.

Theorem 1.

The global minimum UU_{*} and WW_{*} of Eq. (2) is U=0U_{*}=0 and W=0W_{*}=0 if and only if

𝔼[xy]2γuγw.||\mathbb{E}[xy]||^{2}\leq\gamma_{u}\gamma_{w}. (4)

When 𝔼[xy]2>γuγw||\mathbb{E}[xy]||^{2}>\gamma_{u}\gamma_{w}, the global minima are

{U=b𝐫;W=𝐫𝔼[xy]Tb[b2(σ2+d1)A0+γwI]1,\begin{cases}U_{*}=b\mathbf{r};\\ W_{*}=\mathbf{r}\mathbb{E}[xy]^{T}b\left[b^{2}\left(\sigma^{2}+d_{1}\right)A_{0}+\gamma_{w}I\right]^{-1},\end{cases} (5)

where 𝐫=(±1,,±1)\mathbf{r}=(\pm 1,...,\pm 1) is an arbitrary vertex of a d1d_{1}-dimensional hypercube, and bb satisfies:

[b2(σ2+d1)A0+γwI]1𝔼[xy]2=γuγw.\bigg{|}\bigg{|}\left[b^{2}\left(\sigma^{2}+d_{1}\right)A_{0}+\gamma_{w}I\right]^{-1}\mathbb{E}[xy]\bigg{|}\bigg{|}^{2}=\frac{\gamma_{u}}{\gamma_{w}}. (6)

Apparently, b=0b=0 is the trivial solution that has not learned any feature due to overregularization. Henceforth, we refer to this solution (and similar solutions for deeper nets) as the “trivial" solution. We now analyze the properties of the nontrivial solution bb^{*} when it exists.

The condition for the solution to become nontrivial is interesting: 𝔼[xy]2γuγw||\mathbb{E}[xy]||^{2}\geq\gamma_{u}\gamma_{w}. The term 𝔼[xy]||\mathbb{E}[xy]|| can be seen as the effective strength of the signal, and γuγw\gamma_{u}\gamma_{w} is the strength of regularization. This precise condition means that the learning of a two-layer can be divided into two qualitatively different regimes: an “overregularized regime" where the global minimum is trivial, and a “feature learning regime" where the global minimum involves actual learning. Lastly, note that our main result does not specify the exact value of bb^{*}. This is because bb^{*} must satisfy the condition in Eq. (6), which is equivalent to a high-order polynomial in bb with coefficients being general functions of the eigenvalues of A0A_{0}, whose solutions are generally not analytical by Galois theory. One special case where an analytical formula exists for bb is when A0=σx2IA_{0}=\sigma_{x}^{2}I. See Section C for more discussion.

3.2 Bounding the General Solution

While the solution to bb^{*} does not admit an analytical form for a general A0A_{0}, one can find meaningful lower and upper bounds to bb^{*} such that we can perform an asymptotic analysis of bb^{*}. At the global minimum, the following inequality holds:

||[b2(σ2+d1)amaxI+γwI]1𝔼[xy]||2||[\displaystyle||\left[b^{2}\left(\sigma^{2}+d_{1}\right)a_{\rm max}I+\gamma_{w}I\right]^{-1}\mathbb{E}[xy]||^{2}\leq||[ b2(σ2+d1)A0+γwI]1𝔼[xy]||2\displaystyle b^{2}\left(\sigma^{2}+d_{1}\right)A_{0}+\gamma_{w}I]^{-1}\mathbb{E}[xy]||^{2}
[b2(σ2+d1)aminI+γwI]1𝔼[xy]2,\displaystyle\leq||\left[b^{2}\left(\sigma^{2}+d_{1}\right)a_{\rm min}I+\gamma_{w}I\right]^{-1}\mathbb{E}[xy]||^{2}, (7)

where amina_{\rm min} and amaxa_{\rm max} are the smallest and largest eigenvalue of A0A_{0}, respectively. The middle term is equal to γu/γw\gamma_{u}/\gamma_{w} by the global minimum condition in (33), and so, assuming amin>0a_{\rm min}>0, this inequality is equivalent to the following inequality of bb^{*}:

γwγu𝔼[xy]γw(σ2+d1)amaxb2γwγu𝔼[xy]γw(σ2+d1)amin.\frac{\sqrt{\frac{\gamma_{w}}{\gamma_{u}}}||\mathbb{E}[xy]||-\gamma_{w}}{(\sigma^{2}+d_{1})a_{\rm max}}\leq b_{*}^{2}\leq\frac{\sqrt{\frac{\gamma_{w}}{\gamma_{u}}}||\mathbb{E}[xy]||-\gamma_{w}}{(\sigma^{2}+d_{1})a_{\rm min}}. (8)

Namely, the general solution bb^{*} should scale similarly to the homogeneous solution in Eq. (104) if we treat the eigenvalues of A0A_{0} as constants.

4 Exact Solution for An Arbitrary-Depth Linear Net

This section extends our result to multiple layers. We first derive the analytical formula for the global minimum of a general arbitrary-depth model. We then show that the landscape for a deeper network is highly nontrivial.

4.1 General Solution

The loss function is

𝔼x𝔼ϵ(1),ϵ(2),,ϵ(D)(i,i1,i2,,iDd,d1,d2,dDUiDϵiD(D)ϵi2(2)Wi2i1(2)ϵi1(1)Wi1i(1)xiy)2+γuU2+i=1DγiW(i)2,\mathbb{E}_{x}\mathbb{E}_{\epsilon^{(1)},\epsilon^{(2)},...,\epsilon^{(D)}}\left(\sum_{i,i_{1},i_{2},...,i_{D}}^{d,d_{1},d_{2},...d_{D}}U_{i_{D}}\epsilon^{(D)}_{i_{D}}...\epsilon_{i_{2}}^{(2)}W_{i_{2}i_{1}}^{(2)}\epsilon_{i_{1}}^{(1)}W_{i_{1}i}^{(1)}x_{i}-y\right)^{2}+\gamma_{u}||U||^{2}+\sum_{i=1}^{D}\gamma_{i}||W^{(i)}||^{2}, (9)

where all the noises ϵ\epsilon are independent, and for all ii and jj, 𝔼[ϵj(i)]=1\mathbb{E}[\epsilon^{(i)}_{j}]=1 and 𝔼[(ϵj(i))2]=σi2+1>1\mathbb{E}[(\epsilon^{(i)}_{j})^{2}]=\sigma_{i}^{2}+1>1. We first show that for general DD, the global minimum exists for this objective.

Proposition 1.

For D1D\geq 1 and strictly positive γu,γ1,,γD\gamma_{u},\ \gamma_{1},...,\gamma_{D}, the global minimum for Eq.(9) exists.

Note that the positivity of the regularization strength is crucial. If one of the γi\gamma_{i} is zero, the global minimum may not exist. The following theorem is our second main result.

Theorem 2.

Any global minimum of Eq. (9) is of the form

{U=bu𝐫D;W(i)=bi𝐫i𝐫i1T;W(1)=𝐫1𝔼[xy]T(bui=2Dbi)μ[(bui=2Dbi)2s2(σ2+d1)A0+γwI]1,\begin{cases}U=b_{u}\mathbf{r}_{D};\\ W^{(i)}=b_{i}\mathbf{r}_{i}\mathbf{r}^{T}_{i-1};\\ W^{(1)}=\mathbf{r}_{1}\mathbb{E}[xy]^{T}(b_{u}\prod_{i=2}^{D}b_{i})\mu\left[(b_{u}\prod_{i=2}^{D}b_{i})^{2}s^{2}\left(\sigma^{2}+d_{1}\right)A_{0}+\gamma_{w}I\right]^{-1},\end{cases} (10)

where μ=i=2Ddi\mu=\prod_{i=2}^{D}d_{i}, s2=i=2Ddi(σ2+di)s^{2}=\prod_{i=2}^{D}d_{i}(\sigma^{2}+d_{i}), bu0b_{u}\geq 0 and bi0b_{i}\geq 0, and 𝐫i=(±1,,±1)\mathbf{r}_{i}=(\pm 1,...,\pm 1) is an arbitrary vertex of a did_{i}-dimensional hypercube for all ii. Furthermore, let b1:=Wi:2/db_{1}:=\sqrt{||W_{i:}||^{2}/d} and bD+1:=bub_{D+1}:=b_{u}, bib_{i} satisfies

γk+1dk+1bk+12=γkdk1bk2.\gamma_{k+1}d_{k+1}b_{k+1}^{2}=\gamma_{k}d_{k-1}b_{k}^{2}. (11)

Proof Sketch. We prove by induction on the depth DD. The base case is proved in Theorem 1. We then show that for a general depth, the objective involves optimizing subproblems, one of which is a D1D-1 layer problem that follows by the induction assumption, and the other is a two-layer problem that has been solved in Theorem 1. Putting these two subproblems together, one obtains Eq. (10). \square

Remark.

We deal with the technical case of having a bias term for each layer in Appendix D. For example, we will show that if one has preprocessed the data such that 𝔼[x]=0\mathbb{E}[x]=0 and 𝔼[y]=0\mathbb{E}[y]=0, our main results remain precisely unchanged.

The condition in Eq. (11) shows that the scaling factor bib_{i} for all ii is not independent of one another. This automatic balancing of the norm of all layers is a consequence of the rescaling invariance of the multilayer architecture and the use of weight decay. It is well-known that this rescaling invariance also exists in a neural network with the ReLU activation, and so this balancing condition is also directly relevant for ReLU networks.

Condition (11) implies that all the bib_{i} can be written in terms of one of the bib_{i}:

bui=2Dbi=c0sgn(bui=2Dbi)|b2D|:=c0sgn(bui=2Dbi)bDb_{u}\prod_{i=2}^{D}b_{i}=c_{0}{\rm sgn}\left(b_{u}\prod_{i=2}^{D}b_{i}\right)|b_{2}^{D}|:=c_{0}{\rm sgn}\left(b_{u}\prod_{i=2}^{D}b_{i}\right)b^{D} (12)

where c0=(γ2d2d1)D/2γui=2Dγii=2Ddid1c_{0}=\frac{(\gamma_{2}d_{2}d_{1})^{D/2}}{\sqrt{\gamma_{u}\prod_{i=2}^{D}\gamma_{i}}\prod_{i=2}^{D}d_{i}\sqrt{d_{1}}} and b0b\geq 0. Consider the first layer (i=1i=1), Eq (11) shows that the global minimum must satisfy the following equation, which is equivalent to a high-order polynomial in bb that does not have an analytical solution in general:

𝔼[xy]Tc0bDμ[c02b2Ds2(σ2+d1)A0+γwI]12=d2b2.||\mathbb{E}[xy]^{T}c_{0}b^{D}\mu\left[c_{0}^{2}b^{2D}s^{2}\left(\sigma^{2}+d_{1}\right)A_{0}+\gamma_{w}I\right]^{-1}||^{2}=d_{2}b^{2}. (13)

Thus, this condition is an extension of the condition (6) for two-layer networks.

At this point, it pays to clearly define the word “solution," especially given that it has a special meaning in this work because it now becomes highly nontrivial to differentiate between the two types of solutions.

Definition 1.

We say that a non-negative real bb is a solution if it satisfies Eq. (13). A solution is trivial if b=0b=0 and nontrivial otherwise.

Namely, a global minimum must be a solution, but a solution is not necessarily a global minimum. We have seen that even in the two-layer case, the global minimum can be the trivial one when the strength of the signal is too weak or when the strength of regularization is too strong. It is thus natural to expect 0 to be the global minimum under a similar condition, and one is interested in whether the condition becomes stronger or weaker as the depth of the model is increased. However, it turns out this naive expectation is not true. In fact, when the depth of the model is larger than 22, the condition for the trivial global minimum becomes highly nontrivial.

The following proposition shows why the problem becomes more complicated. In particular, we have seen that in the case of a two-layer net, some elementary argument has helped us show that the trivial solution b=0b=0 is either a saddle or the global minimum. However, the proposition below shows that with D2D\geq 2, the landscape becomes more complicated in the sense that the trivial solution is always a local minimum, and it becomes difficult to compare the loss value of the trivial solution with the nontrivial solution because the value of bb^{*} is unknown in general.

Proposition 2.

Let D2D\geq 2 in Eq. (9). Then, the solution U=0U=0, W(D)=0W^{(D)}=0, …, W(1)=0W^{(1)}=0 is a local minimum with a diagonal positive-definite Hessian γI\gamma I.

Comparing the Hessian of D2D\geq 2 and D=1D=1, one notices a qualitative difference: for D2D\geq 2, the Hessian is always diagonal (at 0); for D=1D=1, in sharp contrast, the off-diagonal terms are nonzero in general, and it is these off-diagonal terms that can break the positive-definiteness of the Hessian. This offers a different perspective on why there is a qualitative difference between D=1D=1 and D=2D=2.

Lastly, note that, unlike the depth-11 case, one can no longer find a precise condition such that a b0b\neq 0 solution exists for a general A0A_{0}. The reason is that the condition for the existence of the solution is now a high-order polynomial with quite arbitrary intermediate terms. The following proposition gives a sufficient but stronger-than-necessary condition for the existence of a nontrivial solution, when all the σi\sigma_{i}, intermediate width did_{i} and regularization strength γi\gamma_{i} are the same.444This is equivalent to setting c0=d0c_{0}=\sqrt{d_{0}}. The result is qualitatively similar but involves additional factors of c0c_{0} if σi\sigma_{i}, did_{i}, and γi\gamma_{i} all take different values. We thus only present the case when σi\sigma_{i}, did_{i}, and γi\gamma_{i} are the same for notational concision and for emphasizing the most relevant terms. Also, note that this proposition gives a sufficient and necessary condition if A0=σx2IA_{0}=\sigma_{x}^{2}I is proportional to the identity.

Proposition 3.

Let σi2=σ2>0\sigma^{2}_{i}=\sigma^{2}>0, di=d0d_{i}=d_{0} and γi=γ>0\gamma_{i}=\gamma>0 for all ii. Assuming amin>0a_{\rm min}>0, the only solution is trivial if

D+12D𝔼[xy]d0D1((D1)𝔼[xy]2Dd0(σ2+d0)Damin)D1D+1<γ.\frac{D+1}{2D}||\mathbb{E}[xy]||d_{0}^{D-1}\left(\frac{(D-1)||\mathbb{E}[xy]||}{2Dd_{0}(\sigma^{2}+d_{0})^{D}a_{\rm min}}\right)^{\frac{D-1}{D+1}}<\gamma. (14)

Nontrivial solutions exist if

D+12D𝔼[xy]d0D1((D1)𝔼[xy]2Dd0(σ2+d0)Damax)D1D+1γ.\frac{D+1}{2D}||\mathbb{E}[xy]||d_{0}^{D-1}\left(\frac{(D-1)||\mathbb{E}[xy]||}{2Dd_{0}(\sigma^{2}+d_{0})^{D}a_{\rm max}}\right)^{\frac{D-1}{D+1}}\geq\gamma. (15)

Moreover, the nontrivial solutions are both lower and upper-bounded:555For D=1D=1, we define the lower-bound as limη0+limD1+1d0[γ+η𝔼[xy]]1D1\lim_{\eta\to 0^{+}}\lim_{D\to 1^{+}}\frac{1}{d_{0}}\left[\frac{\gamma+\eta}{||\mathbb{E}[xy]||}\right]^{\frac{1}{D-1}}, which equal to zero if 𝔼[xy]γ\mathbb{E}[xy]\geq\gamma, and \infty if 𝔼[xy]<γ\mathbb{E}[xy]<\gamma. With this definition, this proposition applies to a two-layer net as well.

1d0[γ𝔼[xy]]1D1b[𝔼[xy]d0(σ2+d0)Damax]1D+1.\frac{1}{d_{0}}\left[\frac{\gamma}{||\mathbb{E}[xy]||}\right]^{\frac{1}{D-1}}\leq b^{*}\leq\left[\frac{||\mathbb{E}[xy]||}{d_{0}(\sigma^{2}+d_{0})^{D}a_{\rm max}}\right]^{\frac{1}{D+1}}. (16)

Proof Sketch. The proof follows from the observation that the l.h.s. of Eq. (13) is a continuous function and must cross the r.h.s. under certain sufficient conditions. \square

One should compare the general condition here with the special condition for D=1D=1. One sees that for D2D\geq 2, many other factors (such as the width, the depth, and the spectrum of the data covariance A0A_{0}) come into play to determine the existence of a solution apart from the signal strength 𝔼[xy]\mathbb{E}[xy] and the regularization strength γ\gamma.

4.2 Which Solution is the Global Minimum?

Again, we set γi=γ>0\gamma_{i}=\gamma>0, σi2=σ2>0\sigma^{2}_{i}=\sigma^{2}>0 and di=d0>0d_{i}=d_{0}>0 for all ii for notational concision. Using this condition and applying Lemma 3 to Theorem 2, the solution now takes the following form, where b0b\geq 0,

{U=d0b𝐫D;W(i)=b𝐫i𝐫i1T;W(1)=𝐫1𝔼[xy]Td0D12bD[d0D(σ2+d0)Db2DA0+γ]1.\begin{cases}U=\sqrt{d_{0}}b\mathbf{r}_{D};\\ W^{(i)}=b\mathbf{r}_{i}\mathbf{r}_{i-1}^{T};\\ W^{(1)}=\mathbf{r}_{1}\mathbb{E}[xy]^{T}d_{0}^{D-\frac{1}{2}}b^{D}\left[d_{0}^{D}(\sigma^{2}+d_{0})^{D}b^{2D}A_{0}+\gamma\right]^{-1}.\end{cases} (17)

The following theorem gives a sufficient condition for the global minimum to be nontrivial. It also shows that the landscape of the linear net becomes complicated and can contain more than 11 local minimum.

Theorem 3.

Let σi2=σ2>0\sigma^{2}_{i}=\sigma^{2}>0, di=d0d_{i}=d_{0} and γi=γ>0\gamma_{i}=\gamma>0 for all ii and assuming amin>0a_{\rm min}>0. Then, if

𝔼[xy]2γD+1DD2(σ2+d0)D1amaxD1Dd0D1(D1)D1D||\mathbb{E}[xy]||^{2}\geq\frac{\gamma^{\frac{D+1}{D}}D^{2}(\sigma^{2}+d_{0})^{D-1}a_{max}^{\frac{D-1}{D}}}{d_{0}^{D-1}(D-1)^{\frac{D-1}{D}}} (18)

the global minimum of Eq. (9) is one of the nontrivial solutions.

While there are various ways this bound can be improved, it is general enough for our purpose. In particular, one sees that, for a general depth, the condition for having a nontrivial global minimum depends not only on the 𝔼[xy]\mathbb{E}[xy] and γ\gamma but also on the model architecture in general. For a more general architecture with different widths etc., the architectural constant c0c_{0} from Eq. (13) will also enter the equation. In the limit of D1+D\to 1^{+}, relation (18) reduces to

𝔼[xy]2γ2,||\mathbb{E}[xy]||^{2}\geq\gamma^{2}, (19)

which is the condition derived for the 2-layer case.

Refer to caption
Refer to caption
Figure 2: The training loss as a function of bb for a D=1D=1 network with different activation functions in the hidden layer. For simplicity, dropout is not implemented. The non-linear activation functions we considered are ReLU, Tanh, and Swish. The left and right panels use different data. Left: XX are Gaussian random vectors, and y=vxy=v\cdot x is a linear function of xx. Right: xx are Gaussian random vectors, and y=vtanh(x)y=v\cdot tanh(x) are nonlinear functions of data; the weight vv is obtained as a Gaussian random vector.

5 Implications

Relevance to nonlinear models. We first caution the readers that the following discussion should be taken with a caveat and is based on the philosophy that deep linear nets can approximate the nonlinear ones. This approximation certainly holds for fully connected models with differentiable activation functions such as tanh or Swish because they are, up to first-order Taylor expansion, a deep linear net around zero, which is the region for which our theory is the most relevant. We empirically demonstrate that close to the origin, the landscape of linear nets can indeed approximate that of nonlinear nets quite well. To compare, we plug in the solution in Theorem 4 to both linear and nonlinear models of the same architecture and compare the loss values at different values of bb around b=0b=0. For simplicity, we only consider the case D=1D=1. The activation functions we consider are ReLU, Tanh, and Swish (Ramachandran et al.,, 2017), a modern and differentiable variant of ReLU. See Fig. 2.

The regressor xdx\in\mathbb{R}^{d} is sampled as Gaussian random vectors. We consider two methods of generating yy; the first one (left) is y=vxy=v\cdot x. The second one (right) is y=vtanh(x)y=v\cdot\tanh(x), where the weight vdv\in\mathbb{R}^{d} is obtained as a Gaussian random vector. Fig. 2 shows that the landscape consisting of Tanh is always close to the linear landscape. Swish is not as good as Tanh, but the Swish landscape shows a similar tendency to the linear landscape. The ReLU landscape is not so close to the linear landscape either for b>0b>0 or b<0b<0, but it agrees completely with the linear landscape on the other side, as expected. Besides the quantitative closeness, it is also important to note that all the landscapes agree qualitatively, containing the same number of local minima at similar values of bb.

Landscape of multi-layer neural networks. The combination of Theorem 3 and Proposition 2 shows that the landscape of a deep neural network can become highly nontrivial when there is a weight decay and when the depth of the model is larger than 22. This gives an incomplete but meaningful picture of a network’s complicated but interesting landscape beyond two layers (see Figure 1 for an incomplete summary of our results). In particular, even when the nontrivial solution is the global minimum, the trivial solution is still a local minimum that needs to be escaped. Our result suggests the previous understanding that all local minima of a deep linear net are global cannot generalize to many practical settings where deep learning is found to work well. For example, a series of works attribute the existence of bad (non-global) minima to the use of nonlinearities (Kawaguchi,, 2016) or the use of a non-regular (non-differentiable) loss function (Laurent and Brecht,, 2018). Our result, in contrast, shows that the use of a simple weight decay is sufficient to create a bad minimum.666Some previous works do suggest the existence of bad minima when weight decay is present, but no direct proof exists yet. For example, Taghvaei et al., (2017) shows that when the model is approximated by a linear dynamical system, regularization can cause bad local minima. Mehta et al., (2021) shows the existence of bad local minima in deep linear networks with weight decay through numerical simulations. Moreover, the problem with such a minimum is two-fold: (1) (optimization) it is not global and so needs to be “overcome" and (2) (generalization) it is a minimum that has not learned any feature at all because the model constantly outputs zero. To the best of our knowledge, previous to our work, there has not been any proof that a bad minimum can generically exist in a rather arbitrary network without any restriction on the data.777In the case of nonlinear networks without regularization, a few works proved the existence of bad minima. However, the previous results strongly depend on the data and are rather independent of architecture. For example, one major assumption is that the data cannot be perfected and fitted by a linear model (Yun et al.,, 2018; Liu,, 2021; He et al.,, 2020). Some other works explicitly construct data distribution (Safran and Shamir,, 2018; Venturi et al.,, 2019). Our result, in contrast, is independent of the data. Thus, our result offers direct theoretical justification for the widely believed importance of escaping local minima in the field of deep learning (Kleinberg et al.,, 2018; Liu et al.,, 2021; Mori et al.,, 2022). In particular, previous works on escaping local minima often hypothesize landscapes that are of unknown relevance to an actual neural network. With our result, this line of research can now be established with respect to landscapes that are actually deep-learning-relevant.

Previous works also argue that having a deeper depth does not create a bad minimum (Lu and Kawaguchi,, 2017). While this remains true, its generality and applicability to practical settings now also seem low. Our result shows that as long as weight decay is used, and as long as D2D\geq 2, there is indeed a bad local minimum at 0. In contrast, there is no bad minimum at 0 for a depth-22 network: the point b=0b=0 is either a saddle or the global minimum.888Of course, in practice, the model trained with SGD can still converge to the trivial solution even if it is a saddle point (Ziyin et al.,, 2021) because minibatch SGD is, in general, not a good estimator of the local minima. Having a deeper depth thus alters the qualitative nature of the landscape, and our results agree better with the common observation that a deeper network is harder, if not impossible, to optimize.

We note that our result can also be relevant for more modern architectures such as the ResNet. Using ResNet, one needs to change the dimension of the hidden layer after every bottleneck, and a learnable linear transformation is applied here. Thus, the “effective depth” of a ResNet would be roughly between the number of its bottlenecks and its total number of blocks. For example, a ResNet18 applied to CIFAR10 often has five bottlenecks and 18 layers in total. We thus expect it to have qualitatively similar behavior to a deep linear net with a depth in between. See Figure 1. The experimental details are given in Section A.

Learnability of a neural network. Now we analyze the solution when DD tends to infinity. We first note that the existence condition bound in (15) becomes exponentially harder to satisfy as DD becomes large:

𝔼[xy]24d02amaxγeDlog[(σ2+d0)/d0]+O(1).||\mathbb{E}[xy]||^{2}\geq 4d_{0}^{2}a_{\rm max}\gamma e^{D\log[(\sigma^{2}+d_{0})/d_{0}]}+O(1). (20)

When this bound is not satisfied, the given neural network cannot learn the data. Recall that for a two-layer net, the existence condition is nothing but 𝔼[xy]2>γ2||\mathbb{E}[xy]||^{2}>\gamma^{2}, independent of the depth, width, or stochasticity in the model. For a deeper network, however, every factor comes into play, and the architecture of the model has a strong (and dominant) influence on the condition. In particular, a factor that increases polynomially in the model width and exponentially in the model depth appears.

A practical implication is that the use of weight decay may be too strong for deep networks. If one increases the depth or width of the model, one should also roughly decrease γ\gamma according to Eq. (20).

Refer to caption
Figure 3: Training loss of D=2D=2 neural networks with ReLU and Tanh activations across synthetic tasks with different 𝔼[xy]||\mathbb{E}[xy]||. We see that with the Kaiming initialization, both the Tanh net and the ReLU net are stuck at the trivial solution in expectation of our theory. In contrast, an optimized linear regressor (D=0D=0) is better than the trivial solution when 𝔼[xy]>0||\mathbb{E}[xy]||>0. See Section A for experimental details.

Insufficiency of the existing initialization schemes. We have shown that 0 is often a bad local minimum for deep learning. Our result further implies that escaping this local minimum can be highly practically relevant because standard initialization schemes are trapped in this local minimum for tasks where the signal 𝔼[xy]\mathbb{E}[xy] is weak. See Inequality (16): any nontrivial global minimum is lower-bounded by a factor proportional to (γ/𝔼[xy]1/(D1))/d0(\gamma/||\mathbb{E}[xy]||^{1/(D-1)})/d_{0}, which can be seen as an approximation of the radius of the local minimum at the origin. In comparison, standard deep learning initialization schemes such as Kaiming init. initialize at a radius roughly 1/d01/\sqrt{d_{0}}. Thus, for tasks 𝔼[xy]γ/d0\mathbb{E}[xy]\ll\gamma/\sqrt{d_{0}}, these initialization methods are likely to initialize the model in the basin of attraction of the trivial regime, which can cause a serious failure in learning. To demonstrate, we perform a numerical simulation shown in the right panel of Figure 3, where we train D=2D=2 nonlinear networks with width 3232 with SGD on tasks with varying 𝔼[xy]||\mathbb{E}[xy]||. For sufficiently small 𝔼[xy]||\mathbb{E}[xy]||, the model clearly is stuck at the origin.999There are many natural problems where the signal is extremely weak. One well-known example is the problem of future price prediction in finance, where the fundamental theorem of finance forbids a large 𝔼[xy]||\mathbb{E}[xy]|| (Fama,, 1970). In contrast, linear regression is never stuck at the origin. Our result thus suggests that it may be desirable to devise initialization methods that are functions of the data distribution.

Prediction variance of stochastic nets. A major extension of the standard neural networks is to make them stochastic, namely, to make the output a random function of the input. In a broad sense, stochastic neural networks include neural networks trained with dropout (Srivastava et al.,, 2014; Gal and Ghahramani,, 2016), Bayesian networks (Mackay,, 1992), variational autoencoders (VAE) (Kingma and Welling,, 2013), and generative adversarial networks (Goodfellow et al.,, 2014). Stochastic networks are thus of both practical and theoretical importance to study. Our result can also be used for studying the theoretical properties of stochastic neural networks. Here, we present a simple application of our general solution to analyze the properties of a stochastic net. The following theorem summarizes our technical results.

Theorem 4.

Let σi2=σ2>0\sigma^{2}_{i}=\sigma^{2}>0, di=d0d_{i}=d_{0} and γi=γ>0\gamma_{i}=\gamma>0 for all ii. Let A0=σx2IA_{0}=\sigma_{x}^{2}I. Then, at any global minimum of Eq. (9), holding other parameters fixed,

  1. 1.

    in the limit of large d0d_{0}, Var[f(x)]=O(d01);{\rm Var}[f(x)]=O\left(d_{0}^{-1}\right);

  2. 2.

    in the limit of large σ2\sigma^{2}, Var[f(x)]=O(1(σ2)D){\rm Var}[f(x)]=O\left(\frac{1}{(\sigma^{2})^{D}}\right);

  3. 3.

    In the limit of large DD, Var[f(x)]=O(e2Dlog[(σ2+d0)/d0]){\rm Var}[f(x)]=O\left(e^{-2D\log[(\sigma^{2}+d_{0})/d_{0}]}\right).

Interestingly, the scaling of prediction variance in asymptotic σ2\sigma^{2} is different for different widths. The third result shows that the prediction variance decreases exponentially fast in DD. In particular, this result answers a question recently proposed in Ziyin et al., (2022): does a stochastic net trained on MSE have a prediction variance that scales towards 0? We improve on their result in the case of a deep linear net by (a) showing that the d01d_{0}^{-1} is tight in general, independent of the depth or other factors of the model, and (b) proving a bound showing that the variance also scales towards zero as depth increases, which is a novel result of our work. Our result also offers an important insight into the cause of the vanishing prediction variance. Previous works (Alemi et al.,, 2018) often attribute the cause to the fact that a wide neural network is too expressive. However, our result implies that this is not always the case because a linear network with limited expressivity can also have a vanishing variance as the model tends to an infinite width.

Collapses in deep learning. Lastly, we comment briefly on the apparent similarity between different types of collapses that occur in deep learning. For neural collapse, our result agrees with the recent works that identify weight decay as a main cause (Rangamani and Banburski-Fahey,, 2022). For Bayesian deep learning, Wang and Ziyin, (2022) identified the cause of the posterior collapse in a two-layer VAE structure to be that the regularization of the mean of the latent variable zz is too strong. More recently, the origin and its stability have also been discussed as the dimensional collapse in self-supervised learning (Ziyin et al.,, 2023). Although appearing in different contexts of deep learning, the three types of collapses share the same phenomenology that the model converges to a “collapsed" regime where the learned representation becomes low-rank or constant, which agrees with the behavior of the trivial regime we identified. We refer the readers to Ziyin and Ueda, (2022) for a study of how the second-order phase transition framework of statistical physics can offer a possible unified explanation of these phenomena.

6 Conclusion

In this work, we derived the exact solution of a deep linear net with arbitrary depth and width and with stochasticity. Our work sheds light on the highly complicated landscape of a deep neural network. Compared to the previous works that mostly focus on the qualitative understanding of the linear net, our result offers a more precise quantitative understanding of deep linear nets. Quantitative understanding is one major benefit of knowing the exact solution, whose usefulness we have also demonstrated with the various implications. The results, although derived for linear models, are also empirically shown to be relevant for networks with nonlinear activations. Lastly, our results strengthen the line of thought that analytical approaches to deep linear models can be used to understand deep neural networks, and it is the sincere hope of the authors to attract more attention to this promising field.

Acknowledgement

Ziyin is financially supported by the GSS scholarship of the University of Tokyo and the JSPS fellowship. Li is financially supported by CNRS. X. Meng is supported by JST CREST Grant Number JPMJCR1912, Japan.

References

  • Alemi et al., (2018) Alemi, A., Poole, B., Fischer, I., Dillon, J., Saurous, R. A., and Murphy, K. (2018). Fixing a broken ELBO. In International Conference on Machine Learning, pages 159–168. PMLR.
  • Baldi and Hornik, (1989) Baldi, P. and Hornik, K. (1989). Neural networks and principal component analysis: Learning from examples without local minima. Neural networks, 2(1):53–58.
  • Cavazza et al., (2018) Cavazza, J., Morerio, P., Haeffele, B., Lane, C., Murino, V., and Vidal, R. (2018). Dropout as a low-rank regularizer for matrix factorization. In International Conference on Artificial Intelligence and Statistics, pages 435–444. PMLR.
  • (4) Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y. (2015a). The loss surfaces of multilayer networks. In Artificial Intelligence and Statistics, pages 192–204.
  • (5) Choromanska, A., LeCun, Y., and Arous, G. B. (2015b). Open problem: The landscape of the loss surfaces of multilayer networks. In Conference on Learning Theory, pages 1756–1760. PMLR.
  • Fama, (1970) Fama, E. F. (1970). Efficient capital markets: A review of theory and empirical work. The journal of Finance, 25(2):383–417.
  • Gal and Ghahramani, (2016) Gal, Y. and Ghahramani, Z. (2016). Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059. PMLR.
  • Glorot and Bengio, (2010) Glorot, X. and Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249–256. JMLR Workshop and Conference Proceedings.
  • Goodfellow et al., (2014) Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems, 27.
  • Gotmare et al., (2018) Gotmare, A., Keskar, N. S., Xiong, C., and Socher, R. (2018). A closer look at deep learning heuristics: Learning rate restarts, warmup and distillation. arXiv preprint arXiv:1810.13243.
  • Hardt and Ma, (2016) Hardt, M. and Ma, T. (2016). Identity matters in deep learning. arXiv preprint arXiv:1611.04231.
  • Hastie et al., (2019) Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. (2019). Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560.
  • He et al., (2020) He, F., Wang, B., and Tao, D. (2020). Piecewise linear activations substantially shape the loss surfaces of neural networks. arXiv preprint arXiv:2003.12236.
  • Kawaguchi, (2016) Kawaguchi, K. (2016). Deep learning without poor local minima. Advances in Neural Information Processing Systems, 29:586–594.
  • Kingma and Welling, (2013) Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  • Kleinberg et al., (2018) Kleinberg, B., Li, Y., and Yuan, Y. (2018). An alternative view: When does sgd escape local minima? In International Conference on Machine Learning, pages 2698–2707. PMLR.
  • Krogh and Hertz, (1992) Krogh, A. and Hertz, J. A. (1992). A simple weight decay can improve generalization. In Advances in neural information processing systems, pages 950–957.
  • Laurent and Brecht, (2018) Laurent, T. and Brecht, J. (2018). Deep linear networks with arbitrary loss: All local minima are global. In International conference on machine learning, pages 2902–2907. PMLR.
  • Liu, (2021) Liu, B. (2021). Spurious local minima are common for deep neural networks with piecewise linear activations. arXiv preprint arXiv:2102.13233.
  • Liu et al., (2021) Liu, K., Ziyin, L., and Ueda, M. (2021). Noise and fluctuation of finite learning rate stochastic gradient descent. In International Conference on Machine Learning, pages 7045–7056. PMLR.
  • Loshchilov and Hutter, (2017) Loshchilov, I. and Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  • Lu and Kawaguchi, (2017) Lu, H. and Kawaguchi, K. (2017). Depth creates no bad local minima. arXiv preprint arXiv:1702.08580.
  • Lucas et al., (2019) Lucas, J., Tucker, G., Grosse, R., and Norouzi, M. (2019). Don’t Blame the ELBO! A Linear VAE Perspective on Posterior Collapse.
  • Mackay, (1992) Mackay, D. J. C. (1992). Bayesian methods for adaptive models. PhD thesis, California Institute of Technology.
  • Mehta et al., (2021) Mehta, D., Chen, T., Tang, T., and Hauenstein, J. (2021). The loss surface of deep linear networks viewed through the algebraic geometry lens. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  • Mianjy and Arora, (2019) Mianjy, P. and Arora, R. (2019). On dropout and nuclear norm regularization. In International Conference on Machine Learning, pages 4575–4584. PMLR.
  • Mori et al., (2022) Mori, T., Ziyin, L., Liu, K., and Ueda, M. (2022). Power-law escape rate of sgd. In International Conference on Machine Learning, pages 15959–15975. PMLR.
  • Ramachandran et al., (2017) Ramachandran, P., Zoph, B., and Le, Q. V. (2017). Searching for activation functions.
  • Rangamani and Banburski-Fahey, (2022) Rangamani, A. and Banburski-Fahey, A. (2022). Neural collapse in deep homogeneous classifiers and the role of weight decay. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4243–4247. IEEE.
  • Safran and Shamir, (2018) Safran, I. and Shamir, O. (2018). Spurious local minima are common in two-layer relu neural networks. In International conference on machine learning, pages 4433–4441. PMLR.
  • Saxe et al., (2013) Saxe, A. M., McClelland, J. L., and Ganguli, S. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv preprint arXiv:1312.6120.
  • Srivastava et al., (2014) Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958.
  • Taghvaei et al., (2017) Taghvaei, A., Kim, J. W., and Mehta, P. (2017). How regularization affects the critical points in linear networks. Advances in neural information processing systems, 30.
  • Tanaka et al., (2020) Tanaka, H., Kunin, D., Yamins, D. L., and Ganguli, S. (2020). Pruning neural networks without any data by iteratively conserving synaptic flow. Advances in Neural Information Processing Systems, 33:6377–6389.
  • Tian, (2022) Tian, Y. (2022). Deep contrastive learning is provably (almost) principal component analysis. arXiv preprint arXiv:2201.12680.
  • Venturi et al., (2019) Venturi, L., Bandeira, A. S., and Bruna, J. (2019). Spurious valleys in one-hidden-layer neural network optimization landscapes. Journal of Machine Learning Research, 20:133.
  • Wang and Ziyin, (2022) Wang, Z. and Ziyin, L. (2022). Posterior collapse of a linear latent variable model. In Advances in Neural Information Processing Systems.
  • Yun et al., (2018) Yun, C., Sra, S., and Jadbabaie, A. (2018). Small nonlinearities in activation functions create bad local minima in neural networks. arXiv preprint arXiv:1802.03487.
  • Ziyin et al., (2021) Ziyin, L., Li, B., Simon, J. B., and Ueda, M. (2021). Sgd can converge to local maxima. In International Conference on Learning Representations.
  • Ziyin et al., (2023) Ziyin, L., Lubana, E. S., Ueda, M., and Tanaka, H. (2023). What shapes the loss landscape of self supervised learning? In The Eleventh International Conference on Learning Representations.
  • Ziyin and Ueda, (2022) Ziyin, L. and Ueda, M. (2022). Exact phase transitions in deep learning. arXiv preprint arXiv:2205.12510.
  • Ziyin and Wang, (2023) Ziyin, L. and Wang, Z. (2023). spred: Solving L1 Penalty with SGD. In International Conference on Machine Learning.
  • Ziyin et al., (2022) Ziyin, L., Zhang, H., Meng, X., Lu, Y., Xing, E., and Ueda, M. (2022). Stochastic neural networks with infinite width are deterministic.

Checklist

  1. 1.

    For all authors…

    1. (a)

      Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes]

    2. (b)

      Did you describe the limitations of your work? [Yes]

    3. (c)

      Did you discuss any potential negative societal impacts of your work? [N/A]

    4. (d)

      Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

  2. 2.

    If you are including theoretical results…

    1. (a)

      Did you state the full set of assumptions of all theoretical results? [Yes]

    2. (b)

      Did you include complete proofs of all theoretical results? [Yes] See Appendix.

  3. 3.

    If you ran experiments…

    1. (a)

      Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [No] The experiments are only for demonstration and are straightforward to reproduce following the theory.

    2. (b)

      Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes]

    3. (c)

      Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [No] The fluctuations are visually negligible.

    4. (d)

      Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [Yes] They are done on a single 3080Ti GPU.

  4. 4.

    If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. (a)

      If your work uses existing assets, did you cite the creators? [N/A]

    2. (b)

      Did you mention the license of the assets? [N/A]

    3. (c)

      Did you include any new assets either in the supplemental material or as a URL? [N/A]

    4. (d)

      Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [N/A]

    5. (e)

      Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A]

  5. 5.

    If you used crowdsourcing or conducted research with human subjects…

    1. (a)

      Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A]

    2. (b)

      Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]

    3. (c)

      Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]

Appendix A Experimental Details

For the experiment in Figure 3, the input data consists of 10001000 data points sampled from a multivariate Gaussian distribution: x𝒩(0,I5)x\sim\mathcal{N}(0,I_{5}). The target is generated by a linear transformation y=vxy=v\cdot x, where the norm of vv is rescaled to obtain different values of 𝔼[xy]||\mathbb{E}[xy]|| as the control parameter of the simulation. The models are with D=2D=2 neural networks with bias terms and with hidden width 3232 for both hidden layers. The training proceeds with gradient descent with a learning rate of 0.10.1 for 10410^{4} iterations when the training loss has stopped decreasing for all the experiments.

For the CIFAR10 experiments, we train a standard ResNet18 with roughly 10710^{7} parameters under the standard procedure, with a batch size of 256256 for 100100 epochs.101010Specifically, we use the implementation and training procedure of https://github.com/kuangliu/pytorch-cifar, with standard augmentations such as random crop, etc. For the linear models, we use a hidden width of 3232 without any bias term. The training proceeds with SGD with batch size 256256 for 100100 epochs with a momentum of 0.90.9. The learning rate is 0.0020.002, chosen as the best learning rate from a grid search over [0.001,0.002,,0.01][0.001,0.002,...,0.01].

Appendix B Proofs

B.1 Proof of Lemma 1

Proof. Note that the first term in the loss function is invariant to the following rescaling for any a>0a>0:

{UiaUi;WijWij/a;\begin{cases}U_{i}\to aU_{i};\\ W_{ij}\to W_{ij}/a;\end{cases} (21)

meanwhile, the L2L_{2} regularization term changes as aa changes. Therefore, the global minimum must have a minimized aa with respect to any UU and WW.

One can easily find the solution:

a=argmina(γua2Ui2+γwjWij2a2)=(γwjWij2γuUi2)1/4.a^{*}=\arg\min_{a}\left(\gamma_{u}a^{2}U_{i}^{2}+\gamma_{w}\sum_{j}\frac{W_{ij}^{2}}{a^{2}}\right)=\left(\frac{\gamma_{w}\sum_{j}W_{ij}^{2}}{\gamma_{u}U_{i}^{2}}\right)^{1/4}. (22)

Therefore, at the global minimum, we must have γua2Ui2=γwjWij2a2\gamma_{u}a^{2}U_{i}^{2}=\gamma_{w}\sum_{j}\frac{W_{ij}^{2}}{a^{2}}, so that

(Ui)2=(aUi)2=γwγuj(Wij)2,(U_{i}^{*})^{2}=(a^{*}U_{i})^{2}=\frac{\gamma_{w}}{\gamma_{u}}\sum_{j}(W_{ij}^{*})^{2}, (23)

which completes the proof. \square

B.2 Proof of Lemma 2

Proof. By Lemma 1, we can write UiU_{i} as bib_{i} and Wi:W_{i:} as biwib_{i}w_{i} where wiw_{i} is a unit vector, and finding the global minimizer of Eq. (2) is equivalent to finding the minimizer of the following objective,

𝔼x,ε[(i,jbi2ϵiwijxjy)2]+(γu+γw)b22,\displaystyle\mathbb{E}_{x,\varepsilon}\left[\left(\sum_{i,j}b^{2}_{i}\epsilon_{i}w_{ij}x_{j}-y\right)^{2}\right]+(\gamma_{u}+\gamma_{w})||b||_{2}^{2}, (24)
=𝔼x[(i,jbi2wijxjy)2]+σ2ijbi4(kwikxk)2+(γu+γw)b22,\displaystyle=\mathbb{E}_{x}\left[\left(\sum_{i,j}b^{2}_{i}w_{ij}x_{j}-y\right)^{2}\right]+\sigma^{2}\sum_{ij}b_{i}^{4}\left(\sum_{k}w_{ik}x_{k}\right)^{2}+(\gamma_{u}+\gamma_{w})||b||_{2}^{2}, (25)

The lemma statement is equivalent to bi=bjb_{i}=b_{j} for all ii and jj.

We prove this by contradiction. Suppose there exist ii and jj such that bibjb_{i}\neq b_{j}, we can choose ii to be the index of bib_{i} with maximum bi2b_{i}^{2}, and let jj be the index of bjb_{j} with minimum bj2b_{j}^{2}. Now, we can construct a different solution by the following replacement of biwi:b_{i}w_{i:} and bjwj:b_{j}w_{j:}:

{bi2wi:c2v;bj2wj:c2v,\begin{cases}b_{i}^{2}w_{i:}\to c^{2}v;\\ b_{j}^{2}w_{j:}\to c^{2}v,\end{cases} (26)

where cc is a positive scalar and vv is a unit vector such that 2c2v=bi2wi:+bj2wj:2c^{2}v=b_{i}^{2}w_{i:}+b_{j}^{2}w_{j:}. Note that, by the triangular inequality, 2c2bi2+bj22c^{2}\leq b_{i}^{2}+b_{j}^{2}. Meanwhile, all the other terms, bkb_{k} for kik\neq i and kjk\neq j, are left unchanged. This transformation leaves the first term in the loss function (25) unchanged, and we now show that it decreases the other terms.

The change in the second term is

(bi2kwikxk)2+(bj2kwjkxk)22(c2kvkxk)2=12(bi2kwikxk+bj2kwjkxk)2.\left(b_{i}^{2}\sum_{k}w_{ik}x_{k}\right)^{2}+\left(b_{j}^{2}\sum_{k}w_{jk}x_{k}\right)^{2}\to 2\left(c^{2}\sum_{k}v_{k}x_{k}\right)^{2}=\frac{1}{2}\left(b_{i}^{2}\sum_{k}w_{ik}x_{k}+b_{j}^{2}\sum_{k}w_{jk}x_{k}\right)^{2}. (27)

By the inequality a2+b2(a+b)2/2a^{2}+b^{2}\geq(a+b)^{2}/2, we see that the left-hand side is larger than the right-hand side.

We now consider the L2L_{2} regularization term. The change is

(γu+γw)(bi2+bj2)2(γu+γw)c2,(\gamma_{u}+\gamma_{w})(b_{i}^{2}+b_{j}^{2})\to 2(\gamma_{u}+\gamma_{w})c^{2}, (28)

and the left-hand side is again larger than the right-hand side by the inequality mentioned above: 2c2bi2+bj22c^{2}\leq b_{i}^{2}+b_{j}^{2}. Therefore, we have constructed a solution whose loss is strictly smaller than that of the global minimum: a contradiction. Thus, the global minimum must satisfy

Ui2=Uj2U_{i}^{2}=U_{j}^{2} (29)

for all ii and jj.

Likewise, we can show that UiWi:=UjWj:U_{i}W_{i:}=U_{j}W_{j:} for all ii and jj. This is because the triangular inequality 2c2bi2+bj22c^{2}\leq b_{i}^{2}+b_{j}^{2} is only an equality if UiWi:=UjWj:U_{i}W_{i:}=U_{j}W_{j:}. If UiWi:UjWj:U_{i}W_{i:}\neq U_{j}W_{j:}, following the same argument above, we arrive at another contradiction. \square

B.3 Proof of Theorem 1

Proof. By Lemma 2, at any global minimum, we can write U=b𝐫U_{*}=b\mathbf{r} for some bb\in\mathbb{R}. We can also write W=𝐫vTW_{*}=\mathbf{r}v^{T} for a general vector vdv\in\mathbb{R}^{d}. Without loss of generality, we assume that b>0b>0 (because the sign of bb can be absorbed into 𝐫\mathbf{r}).

The original problem in Eq. (2) is now equivalently reduced following problem because 𝐫T𝐫=d1\mathbf{r}^{T}\mathbf{r}=d_{1}:

minb,v𝔼x[(bd1jvjxjy)2+b2d1σ2(kvkxk)2]+γud1b2+γwd1v22.\displaystyle\min_{b,v}\mathbb{E}_{x}\left[\left(bd_{1}\sum_{j}v_{j}x_{j}-y\right)^{2}+b^{2}d_{1}\sigma^{2}\left(\sum_{k}v_{k}x_{k}\right)^{2}\right]+\gamma_{u}d_{1}b^{2}+\gamma_{w}d_{1}||v||_{2}^{2}. (30)

For any fixed bb, the global minimum of vv is well known:111111Namely, it is the solution of a ridge linear regression problem.

v=b𝔼[xy]T[b2(σ2+d1)A0+γwI]1.v=b\mathbb{E}[xy]^{T}\left[b^{2}\left(\sigma^{2}+d_{1}\right)A_{0}+\gamma_{w}I\right]^{-1}. (31)

By Lemma 1, at a global minimum, bb also satisfies the following condition:

b2=γwγuv2,b^{2}=\frac{\gamma_{w}}{\gamma_{u}}||v||^{2}, (32)

One solution to this equation is b=0b=0, and we are interested in whether solutions with b0b\neq 0 exist. If there is no other solution, then b=0b=0 must be the unique global minimum; otherwise, we need to identify which of the solutions are actual global minima. When b0b\neq 0,

[b2(σ2+d1)A0+γwI]1𝔼[xy]2=γuγw.\bigg{|}\bigg{|}\left[b^{2}\left(\sigma^{2}+d_{1}\right)A_{0}+\gamma_{w}I\right]^{-1}\mathbb{E}[xy]\bigg{|}\bigg{|}^{2}=\frac{\gamma_{u}}{\gamma_{w}}. (33)

Note that the left-hand side is monotonically decreasing in b2b^{2}, and is equal to γw2𝔼[xy]2\gamma^{-2}_{w}||\mathbb{E}[xy]||^{2} when b=0b=0. When bb\to\infty, the left-hand side tends to 0. Because the left-hand side is a continuous and monotonic function of bb, a unique solution b>0b_{*}>0 that satisfies Eq. (33) exists if and only if γw2𝔼[xy]2>γu/γw\gamma^{-2}_{w}||\mathbb{E}[xy]||^{2}>\gamma_{u}/\gamma_{w}, or,

𝔼[xy]2>γuγw.||\mathbb{E}[xy]||^{2}>\gamma_{u}\gamma_{w}. (34)

Therefore, at most, three candidates for global minima of the loss function exist:

{b=0,v=0if 𝔼[xy]2γuγw;b=±b,v=b[b2(σ2+d1)A0+γwI]1𝔼[xy],if 𝔼[xy]2>γuγw,\begin{cases}b=0,\ v=0&\text{if }||\mathbb{E}[xy]||^{2}\leq\gamma_{u}\gamma_{w};\\ b=\pm b_{*},\ v=b\left[b^{2}\left(\sigma^{2}+d_{1}\right)A_{0}+\gamma_{w}I\right]^{-1}\mathbb{E}[xy],&\text{if }||\mathbb{E}[xy]||^{2}>\gamma_{u}\gamma_{w},\end{cases} (35)

where b>0b^{*}>0.

In the second case, one needs to discern the saddle points from the global minima. Using the expression of vv, one finds the expression of the loss function as a function of bb

d1(d1+σ2)b4i𝔼[xy]i2ai[b2(σ2+d1)ai+γw]2\displaystyle d_{1}(d_{1}+\sigma^{2})b^{4}\sum_{i}\frac{\mathbb{E}[x^{\prime}y]_{i}^{2}a_{i}}{[b^{2}(\sigma^{2}+d_{1})a_{i}+\gamma_{w}]^{2}} 2b2d1i𝔼[xy]i2b2(σ2+d1)ai+γw+𝔼[y2]\displaystyle-2b^{2}d_{1}\sum_{i}\frac{\mathbb{E}[x^{\prime}y]_{i}^{2}}{b^{2}(\sigma^{2}+d_{1})a_{i}+\gamma_{w}}+\mathbb{E}[y^{2}]
+γud1b2+γwd1i𝔼[xy]i2b2[b2(σ2+d1)ai+γw]2,\displaystyle+\gamma_{u}d_{1}b^{2}+\gamma_{w}d_{1}\sum_{i}\frac{\mathbb{E}[x^{\prime}y]_{i}^{2}b^{2}}{[b^{2}(\sigma^{2}+d_{1})a_{i}+\gamma_{w}]^{2}}, (36)

where x=Rxx^{\prime}=Rx such that RA0R1RA_{0}R^{-1} is a diagonal matrix. We now show that condition (34) is sufficient to guarantee that 0 is not the global minimum.

At b=0b=0, the first nonvanishing derivative of bb is the second-order derivative. The second order derivative at b=0b=0 is

2d1𝔼[xy]2/γw+2γud1,-2d_{1}||\mathbb{E}[xy]||^{2}/\gamma_{w}+2\gamma_{u}d_{1}, (37)

which is negative if and only if 𝔼[xy]2>γuγw||\mathbb{E}[xy]||^{2}>\gamma_{u}\gamma_{w}. If the second derivative at b=0b=0 is negative, b=0b=0 cannot be a minimum. It then follows that for 𝔼[xy]2>γuγw||\mathbb{E}[xy]||^{2}>\gamma_{u}\gamma_{w}, b=±bb=\pm b^{*}, v=b[b2(σ2+d1)A0+γwI]1𝔼[xy]v=b\left[b^{2}\left(\sigma^{2}+d_{1}\right)A_{0}+\gamma_{w}I\right]^{-1}\mathbb{E}[xy] are the two global minimum (because the loss is invariant to the sign flip of bb). For the same reason, when 𝔼[xy]2<γuγw||\mathbb{E}[xy]||^{2}<\gamma_{u}\gamma_{w}, b=0b=0 gives the unique global minimum. This finishes the proof. \square

B.4 Proof of Proposition 1

Proof. We first show that there exists a constant rr such that the global minimum must be confined within a (closed) rr-Ball around the origin. The objective (9) can be upper-bounded by

Eq.(9)γuU2+i=1DγiW(i)2γmin(U2+iW(i)2),Eq.~{}\eqref{eq: arbitrary depth main loss}\geq\gamma_{u}||U||^{2}+\sum_{i=1}^{D}\gamma_{i}||W^{(i)}||^{2}\geq\gamma_{\rm min}\left(||U||^{2}+\sum_{i}||W^{(i)}||^{2}\right), (38)

where γmin:=mini{u,1,2,,D}>0\gamma_{\rm min}:=\min_{i\in\{u,1,2,...,D\}}>0. Now, let ww denote be the union of all the parameters (U,W(i)U,W^{(i)}) and viewed as a vector. We see that the above inequality is equivalent to

Eq.(9)γminw2.Eq.~{}\eqref{eq: arbitrary depth main loss}\geq\gamma_{\rm min}||w||^{2}. (39)

Now, note that the loss value at the origin is 𝔼[y2]\mathbb{E}[y^{2}], which means that for any ww, whose norm w2𝔼[y2]/γmin||w||^{2}\geq\mathbb{E}[y^{2}]/\gamma_{\rm min}, the loss value must be larger than the loss value of the origin. Therefore, let r=𝔼[y2]/γminr=\mathbb{E}[y^{2}]/\gamma_{\rm min}, we have proved that the global minimum must lie in a closed rr-Ball around the origin.

As the last step, because the objective is a continuous function of ww and the rr-Ball is a compact set, the minimum of the objective in this rr-Ball is achievable. This completes the proof. \square

B.5 Proof of Theorem 2

We divide the proof into the proof of a proposition and a lemma, and combining the following proposition and lemma obtains the theorem statement.

B.5.1 Proposition 4

Proposition 4.

Any global minimum of Eq. (9) is of the form

{U=bu𝐫D;W(i)=bi𝐫i𝐫i1T;W(1)=𝐫1𝔼[xy]T(bui=2Dbi)μ[(bui=2Dbi)2s2(σ2+d1)A0+γwI]1,\begin{cases}U=b_{u}\mathbf{r}_{D};\\ W^{(i)}=b_{i}\mathbf{r}_{i}\mathbf{r}^{T}_{i-1};\\ W^{(1)}=\mathbf{r}_{1}\mathbb{E}[xy]^{T}(b_{u}\prod_{i=2}^{D}b_{i})\mu\left[(b_{u}\prod_{i=2}^{D}b_{i})^{2}s^{2}\left(\sigma^{2}+d_{1}\right)A_{0}+\gamma_{w}I\right]^{-1},\end{cases} (40)

where μ=i=2Ddi\mu=\prod_{i=2}^{D}d_{i}, s2=i=2Ddi(σ2+di)s^{2}=\prod_{i=2}^{D}d_{i}(\sigma^{2}+d_{i}), bu0b_{u}\geq 0 and bi0b_{i}\geq 0, and 𝐫i=(±1,,±1)\mathbf{r}_{i}=(\pm 1,...,\pm 1) is an arbitrary vertex of a did_{i}-dimensional hypercube for all ii.

Proof. Note that the trivial solution is also a special case of this solution with b=0b=0. We thus focus on deriving the form of the nontrivial solution.

We prove by induction on DD. The base case with depth 11 is proved in Theorem 1. We now assume that the same holds for depth D1D-1 and prove that it also holds for depth DD.

For any fixed W(1)W^{(1)}, the loss function can be equivalently written as

𝔼x~𝔼ϵ(2),,ϵ(D)(i1,i2,,iDd1,d2,dDUiDϵiD(D)ϵi2(2)Wi2i1(2)x~i1y)2+γuU2+i=2DγiW(i)2+const.,\displaystyle\mathbb{E}_{\tilde{x}}\mathbb{E}_{\epsilon^{(2)},...,\epsilon^{(D)}}\left(\sum_{i_{1},i_{2},...,i_{D}}^{d_{1},d_{2},...d_{D}}U_{i_{D}}\epsilon^{(D)}_{i_{D}}...\epsilon_{i_{2}}^{(2)}W_{i_{2}i_{1}}^{(2)}\tilde{x}_{i_{1}}-y\right)^{2}+\gamma_{u}||U||^{2}+\sum_{i=2}^{D}\gamma_{i}||W^{(i)}||^{2}+const., (41)

where x~=ϵi1(1)iWi1i(1)xi\tilde{x}=\epsilon_{i_{1}}^{(1)}\sum_{i}W^{(1)}_{i_{1}i}x_{i}. Namely, we have reduced the problem to a problem involving only a depth D1D-1 linear net with a transformed input x~\tilde{x}.

By the induction assumption, the global minimum of this problem takes the form of Eq. (10), which means that the loss function can be written in the following form:

𝔼x~𝔼ϵ(2),,ϵ(D)(bubDb3i1,i2,,iDd1,d2,dDϵiD(D)ϵi2(2)vi1x~i1y)2+L2reg.,\mathbb{E}_{\tilde{x}}\mathbb{E}_{\epsilon^{(2)},...,\epsilon^{(D)}}\left(b_{u}b_{D}...b_{3}\sum_{i_{1},i_{2},...,i_{D}}^{d_{1},d_{2},...d_{D}}\epsilon^{(D)}_{i_{D}}...\epsilon_{i_{2}}^{(2)}v_{i_{1}}\tilde{x}_{i_{1}}-y\right)^{2}+L_{2}\ reg., (42)

for an arbitrary optimizable vector vi1v_{i_{1}}. The term i2,,iDd2,dDϵiD(D)ϵi2(2):=η\sum_{i_{2},...,i_{D}}^{d_{2},...d_{D}}\epsilon^{(D)}_{i_{D}}...\epsilon_{i_{2}}^{(2)}:=\eta can now be regarded as a single random variable such that 𝔼[η]=i=2Ddi:=μ\mathbb{E}[\eta]=\prod_{i=2}^{D}d_{i}:=\mu and 𝔼[η2]=i=2Ddi(σi2+di):=s2\mathbb{E}[\eta^{2}]=\prod_{i=2}^{D}d_{i}(\sigma_{i}^{2}+d_{i}):=s^{2}. Computing the expectation over all the noises except for ϵ(1)\epsilon^{(1)}, one finds

𝔼x~(bubDb3si1vi1x~i1μys)2+L2reg.+const.\displaystyle\mathbb{E}_{\tilde{x}}\left(b_{u}b_{D}...b_{3}s\sum_{i_{1}}v_{i_{1}}\tilde{x}_{i_{1}}-\frac{\mu y}{s}\right)^{2}+L_{2}\ reg.+const. (43)
=𝔼x,ϵ(1)(bubDb3si,i1vi1ϵi1(1)Wi1i(1)xiμys)2+L2reg.+const.,\displaystyle=\mathbb{E}_{x,\epsilon^{(1)}}\left(b_{u}b_{D}...b_{3}s\sum_{i,i_{1}}v_{i_{1}}\epsilon_{i_{1}}^{(1)}W^{(1)}_{i_{1}i}{x}_{i}-\frac{\mu y}{s}\right)^{2}+L_{2}\ reg.+const., (44)

where we have ignored the constant term because it does not affect the minimizer of the loss. Namely, we have reduced the original problem to a two-layer linear net problem where the label becomes effectively rescaled for a deep network.

For any fixed bu,,b3b_{u},...,b_{3}, we can define x¯:=bubDb3sx\bar{x}:=b_{u}b_{D}...b_{3}sx, and obtain the following problem, whose global minimum we have already derived:

𝔼x¯𝔼ϵ2,,ϵD(i,i1vi1Wi1i(1)x¯iμys)2.\mathbb{E}_{\bar{x}}\mathbb{E}_{\epsilon^{2},...,\epsilon_{D}}\left(\sum_{i,i_{1}}v_{i_{1}}W^{(1)}_{i_{1}i}\bar{x}_{i}-\frac{\mu y}{s}\right)^{2}. (45)

By Theorem 1, the global minimum is identically 0 if 𝔼[μx¯y/s]2<d2γ2γ1||\mathbb{E}[\mu\bar{x}y/s]||^{2}<d_{2}\gamma_{2}\gamma_{1}, or, 𝔼[xy]γ2γ1b32bu2(i=3Ddi)\mathbb{E}[xy]\leq\frac{\gamma_{2}\gamma_{1}}{b_{3}^{2}...b_{u}^{2}\left(\prod_{i=3}^{D}d_{i}\right)}. When 𝔼[xy]>γ2γ1b32bu2(i=3Ddi)\mathbb{E}[xy]>\frac{\gamma_{2}\gamma_{1}}{b_{3}^{2}...b_{u}^{2}\left(\prod_{i=3}^{D}d_{i}\right)}, the solution can be non-trivial:

{v=b2𝐫1;W=𝐫1𝔼[xy]Tμb2b3bu[(b2)2d32dD2bu2s2(σ2+d1)A0+γ1I]1,\begin{cases}v_{*}=b_{2}^{*}\mathbf{r}_{1};\\ W_{*}=\mathbf{r}_{1}\mathbb{E}[xy]^{T}\mu b_{2}^{*}b_{3}...b_{u}\left[(b_{2}^{*})^{2}d_{3}^{2}...d_{D}^{2}b_{u}^{2}s^{2}\left(\sigma^{2}+d_{1}\right)A_{0}+\gamma_{1}I\right]^{-1},\end{cases} (46)

for some b2b^{*}_{2}. This proves the theorem. \square

B.6 Lemma 3

Lemma 3.

At any global minimum of Eq. (9), let b1:=Wi:2/db_{1}:=\sqrt{||W_{i:}||^{2}/d} and bD+1:=bub_{D+1}:=b_{u},

γk+1dk+1bk+12=γkdk1bk2.\gamma_{k+1}d_{k+1}b_{k+1}^{2}=\gamma_{k}d_{k-1}b_{k}^{2}. (47)

Proof. It is sufficient to show that for all kk and ii,

γk+1ij(Wjik+1)2=γkij(Wijk)2.\gamma_{k+1}\sum_{ij}(W_{ji}^{k+1})^{2}=\gamma_{k}\sum_{ij}(W_{ij}^{k})^{2}. (48)

We prove by contradiction. Let U,WU^{*},W^{*} be the global minimum of the loss function. Assuming that for an arbitrary kk,

γk+1ij(Wji,k+1)2γkij(Wij,k)2.\gamma_{k+1}\sum_{ij}(W_{ji}^{*,k+1})^{2}\neq\gamma_{k}\sum_{ij}(W_{ij}^{*,k})^{2}. (49)

Now, we introduce WaW^{a} such that Wjia,k+1=aWji,k+1W_{ji}^{a,k+1}=aW_{ji}^{*,k+1} and Wjia,k=Wji,k/aW_{ji}^{a,k}=W_{ji}^{*,k}/a. The loss without regularization is invariant under the transformation of WWaW^{*}\to W^{a}, namely

L0(W)=L0(Wa).L_{0}(W^{*})=L_{0}(W^{a}). (50)

In the regularization, all the terms remain invariant except two terms:

{γk+1ij(Wji,k+1)2γk+1ij(Wjia,k+1)2=a2γk+1ij(Wji,k+1)2γkij(Wij,k)2γkij(Wjia,k)2=a2γkij(Wji,k)2\begin{cases}\gamma_{k+1}\sum_{ij}(W_{ji}^{*,k+1})^{2}\to\gamma_{k+1}\sum_{ij}(W_{ji}^{a,k+1})^{2}=a^{2}\gamma_{k+1}\sum_{ij}(W_{ji}^{*,k+1})^{2}\\ \gamma_{k}\sum_{ij}(W_{ij}^{*,k})^{2}\to\gamma_{k}\sum_{ij}(W_{ji}^{a,k})^{2}=a^{-2}\gamma_{k}\sum_{ij}(W_{ji}^{*,k})^{2}\end{cases} (51)

It could be shown that, the sum of a2γk+1ij(Wji,k+1)2a^{2}\gamma_{k+1}\sum_{ij}(W_{ji}^{*,k+1})^{2} and a2γkij(Wji,k)2a^{-2}\gamma_{k}\sum_{ij}(W_{ji}^{*,k})^{2} reaches its minimum when a2=γkij(Wji,k)2γk+1ij(Wji,k+1)2a^{2}=\sqrt{\frac{\gamma_{k}\sum_{ij}(W_{ji}^{*,k})^{2}}{\gamma_{k+1}\sum_{ij}(W_{ji}^{*,k+1})^{2}}}. If γk+1ij(Wji,k+1)2γkij(Wij,k)2\gamma_{k+1}\sum_{ij}(W_{ji}^{*,k+1})^{2}\neq\gamma_{k}\sum_{ij}(W_{ij}^{*,k})^{2}, one can choose aa to minimize the regularization terms in the loss function such that L(Wa)<L(W)L(W^{a})<L(W^{*}), indicating WW^{*} is not the global minimum. Thus, γk+1ij(Wji,k+1)2γkij(Wij,k)2\gamma_{k+1}\sum_{ij}(W_{ji}^{*,k+1})^{2}\neq\gamma_{k}\sum_{ij}(W_{ij}^{*,k})^{2} cannot be true. \square

B.7 Proof of Proposition 2

Proof. Let

L0=𝔼x~𝔼ϵ2,,ϵD(i1,i2,,iDd1,d2,dDUiDϵiD(D)ϵi1(1)Wi1i(1)xiy)2.L_{0}=\mathbb{E}_{\tilde{x}}\mathbb{E}_{\epsilon^{2},...,\epsilon_{D}}\left(\sum_{i_{1},i_{2},...,i_{D}}^{d_{1},d_{2},...d_{D}}U_{i_{D}}\epsilon^{(D)}_{i_{D}}...\epsilon_{i_{1}}^{(1)}W_{i_{1}i}^{(1)}x_{i}-y\right)^{2}. (52)

L0L_{0} is a polynomial containing 2D+22D+2th order, D+1D+1th order, and 0th order terms in terms of parameters UU and WW. The second order derivative of LL is thus a polynomial containing 2D2D-th order and (D1)(D-1)-th order terms; however, other orders are not possible. For D2D\geq 2, there are no constant terms in the Hessian of LL, and there is at least a parameter in each of the terms.

The Hessian of the full loss function with regularization is

2L2UiUj\displaystyle\frac{\partial^{2}L}{\partial^{2}U_{i}U_{j}} =2L02UiUj+(1δij)2γu(Ui+Uj)+δij2γu;\displaystyle=\frac{\partial^{2}L_{0}}{\partial^{2}U_{i}U_{j}}+(1-\delta_{ij})2\gamma_{u}(U_{i}+U_{j})+\delta_{ij}2\gamma_{u}; (53)
2L2Wjk(i)Ul\displaystyle\frac{\partial^{2}L}{\partial^{2}W^{(i)}_{jk}U_{l}} =2L02Wjk(i)Ul+2(γwWjk(i)+γuUl);\displaystyle=\frac{\partial^{2}L_{0}}{\partial^{2}W^{(i)}_{jk}U_{l}}+2(\gamma_{w}W^{(i)}_{jk}+\gamma_{u}U_{l}); (54)
2L2Wjk(i)Wmn(l)\displaystyle\frac{\partial^{2}L}{\partial^{2}W^{(i)}_{jk}W^{(l)}_{mn}} =2L02Wjk(i)Wmn(l)+(1δilδjmδkn)2γw(Wjk(i)+Wmn(l))+δilδjmδkn2γw.\displaystyle=\frac{\partial^{2}L_{0}}{\partial^{2}W^{(i)}_{jk}W^{(l)}_{mn}}+(1-\delta_{il}\delta_{jm}\delta_{kn})2\gamma_{w}(W^{(i)}_{jk}+W^{(l)}_{mn})+\delta_{il}\delta_{jm}\delta_{kn}2\gamma_{w}. (55)

For U=0U=0, W=0W=0, the Hessian of L0L_{0} is 0, since each term in L0L_{0} contains at least a UU or a WW. The Hessian of LL becomes

2L2UiUj|U,W=0\displaystyle\left.\frac{\partial^{2}L}{\partial^{2}U_{i}U_{j}}\right|_{U,W=0} =δij2γu;\displaystyle=\delta_{ij}2\gamma_{u}; (56)
2L2Wjk(i)Ul|U,W=0\displaystyle\left.\frac{\partial^{2}L}{\partial^{2}W^{(i)}_{jk}U_{l}}\right|_{U,W=0} =0;\displaystyle=0; (57)
2L2Wjk(i)Wmn(l)|U,W=0\displaystyle\left.\frac{\partial^{2}L}{\partial^{2}W^{(i)}_{jk}W^{(l)}_{mn}}\right|_{U,W=0} =δilδjmδkn2γw.\displaystyle=\delta_{il}\delta_{jm}\delta_{kn}2\gamma_{w}. (58)

The Hessian of LL is a positive-definite matrix. Thus, U=0U=0, W=0W=0 is always a local minimum of the loss function LL. \square

B.8 Proof of Proposition 3

We first apply Lemma 3 to determine the condition for the nontrivial solution to exist. In particular, the Lemma must hold for W(2)W^{(2)} and W(1)W^{(1)}, which leads to the following condition:

bD1d0D1[b2Dd0D(σ2+d0)DA0+γ]1𝔼[xy]2=1.||b^{D-1}d_{0}^{D-1}[b^{2D}d_{0}^{D}(\sigma^{2}+d_{0})^{D}A_{0}+\gamma]^{-1}\mathbb{E}[xy]||^{2}=1. (59)

Note that the left-hand side is a continuous function that tends to 0 as bb\to\infty. Therefore, it is sufficient to find the condition that guarantees that there exists bb such that the l.h.s. is larger than 11. For any bb, the l.h.s. is a monotonically decreasing function of any eigenvalue of A0A_{0}, and so the following two inequalities hold:

{bD1d0D1(b2Dd0D(σ2+d0)Dσx2+γ)1𝔼[xy]bD1d0D1(b2Dd0D(σ2+d0)Damin+γ)1𝔼[xy]bD1d0D1(b2Dd0D(σ2+d0)Dσx2+γ)1𝔼[xy]bD1d0D1(b2Dd0D(σ2+d0)Damax+γ)1𝔼[xy].\begin{cases}||b^{D-1}d_{0}^{D-1}(b^{2D}d_{0}^{D}(\sigma^{2}+d_{0})^{D}\sigma_{x}^{2}+\gamma)^{-1}\mathbb{E}[xy]||\leq||b^{D-1}d_{0}^{D-1}(b^{2D}d_{0}^{D}(\sigma^{2}+d_{0})^{D}a_{\rm min}+\gamma)^{-1}\mathbb{E}[xy]||\\ ||b^{D-1}d_{0}^{D-1}(b^{2D}d_{0}^{D}(\sigma^{2}+d_{0})^{D}\sigma_{x}^{2}+\gamma)^{-1}\mathbb{E}[xy]||\geq||b^{D-1}d_{0}^{D-1}(b^{2D}d_{0}^{D}(\sigma^{2}+d_{0})^{D}a_{\rm max}+\gamma)^{-1}\mathbb{E}[xy]||.\end{cases} (60)

The second inequality implies that if

bD1d0D1[b2Dd0D(σ2+d0)Damax+γ]1𝔼[xy]>1,||b^{D-1}d_{0}^{D-1}[b^{2D}d_{0}^{D}(\sigma^{2}+d_{0})^{D}a_{\rm max}+\gamma]^{-1}\mathbb{E}[xy]||>1, (61)

a nontrivial solution must exist. This condition is equivalent to the existence of a bb such that

d0D(σ2+d0)Damaxb2D𝔼[xy]bD1d0D1<γ,d_{0}^{D}(\sigma^{2}+d_{0})^{D}a_{\rm max}b^{2D}-||\mathbb{E}[xy]||b^{D-1}d_{0}^{D-1}<-\gamma, (62)

which is a polynomial inequality that does not admit an explicit condition for bb for a general DD. Since the l.h.s is a continuous function that increases to infinity as bb\to\infty, one sufficient condition for (62) to hold is that the minimizer of the l.h.s. is smaller than γ\gamma.

Note that the left-hand side of Eq. (62) diverges to \infty as b±b\to\pm\infty and tends to zero as b0b\to 0. Moreover, Eq. (62) is lower-bounded and must have a nontrivial minimizer for some b>0b>0 because the coefficient of the bD1b^{D-1} term is strictly negative. One can thus find its minimizer by taking derivative. In particular, the left-hand side is minimized when

bD+1=(D1)𝔼[xy]2Dd0(σ2+d0)Damax,b^{D+1}=\frac{(D-1)||\mathbb{E}[xy]||}{2Dd_{0}(\sigma^{2}+d_{0})^{D}a_{\rm max}}, (63)

and we can obtain the following sufficient condition for (62) to be satisfiable, which, in turn, implies that (59) is satisfiable:

D+12D𝔼[xy]d0D1((D1)𝔼[xy]2Dd0(σ2+d0)Damax)D1D+1>γ,\frac{D+1}{2D}||\mathbb{E}[xy]||d_{0}^{D-1}\left(\frac{(D-1)||\mathbb{E}[xy]||}{2Dd_{0}(\sigma^{2}+d_{0})^{D}a_{\rm max}}\right)^{\frac{D-1}{D+1}}>\gamma, (64)

which is identical to the proposition statement in (15).

Now, we come back to condition (60) to derive a sufficient condition for the trivial solution to be the only solution. The first inequality in Condition (60) implies that if

bD1d0D1[b2Dd0D(σ2+d0)Damin+γ]1𝔼[xy]1,||b^{D-1}d_{0}^{D-1}[b^{2D}d_{0}^{D}(\sigma^{2}+d_{0})^{D}a_{\rm min}+\gamma]^{-1}\mathbb{E}[xy]||\leq 1, (65)

the only possible solution is the trivial one, and the condition for this to hold can be found using the same procedure as above to be

D+12D𝔼[xy]d0D1((D1)𝔼[xy]2Dd0(σ2+d0)Damin)D1D+1γ,\frac{D+1}{2D}||\mathbb{E}[xy]||d_{0}^{D-1}\left(\frac{(D-1)||\mathbb{E}[xy]||}{2Dd_{0}(\sigma^{2}+d_{0})^{D}a_{\rm min}}\right)^{\frac{D-1}{D+1}}\leq\gamma, (66)

which is identical to (14).

We now prove the upper bound in (16). Because for any bb, the first condition in (60) gives an upper bound, and so any bb that makes the upper bound less than 11 cannot be a solution. This means that any bb for which

bD1d0D1[b2Dd0D(σ2+d0)Damin+γ]1𝔼[xy]1||b^{D-1}d_{0}^{D-1}[b^{2D}d_{0}^{D}(\sigma^{2}+d_{0})^{D}a_{\rm min}+\gamma]^{-1}\mathbb{E}[xy]||\leq 1 (67)

cannot be a solution. This condition holds if and only if

d0D(σ2+d0)Daminb2D𝔼[xy]bD1d0D1>γ.d_{0}^{D}(\sigma^{2}+d_{0})^{D}a_{\rm min}b^{2D}-||\mathbb{E}[xy]||b^{D-1}d_{0}^{D-1}>-\gamma. (68)

Because γ>0\gamma>0, one sufficient condition to ensure this is that there exists bb such that

d0(σ2+d0)Daminb2D𝔼[xy]bD1>0,d_{0}(\sigma^{2}+d_{0})^{D}a_{\rm min}b^{2D}-||\mathbb{E}[xy]||b^{D-1}>0, (69)

which is equivalent to

b>[𝔼[xy]d0(σ2+d0)Damin]1D+1.b>\left[\frac{||\mathbb{E}[xy]||}{d_{0}(\sigma^{2}+d_{0})^{D}a_{\rm min}}\right]^{\frac{1}{D+1}}. (70)

Namely, any solution bb^{*} satisfies

b[𝔼[xy]d0(σ2+d0)Damin]1D+1.b^{*}\leq\left[\frac{||\mathbb{E}[xy]||}{d_{0}(\sigma^{2}+d_{0})^{D}a_{\rm min}}\right]^{\frac{1}{D+1}}. (71)

We can also find a lower bound for all possible solutions. When D>1D>1, another sufficient condition for Eq. (68) to hold is that there exists bb such that

𝔼[xy]d0D1bD1<γ.||\mathbb{E}[xy]||d_{0}^{D-1}b^{D-1}<\gamma. (72)

because the b2Db^{2D} term is always positive. This condition then implies that any solution must satisfy:

b1d0[γ𝔼[xy]]1D1.b^{*}\geq\frac{1}{d_{0}}\left[\frac{\gamma}{||\mathbb{E}[xy]||}\right]^{\frac{1}{D-1}}. (73)

For D=1D=1, we have by Theorem 1 that

b>0b^{*}>0 (74)

if and only if 𝔼[xy]>γ\mathbb{E}[xy]>\gamma. This means that

blimη0+limD1+1d0[γ+η𝔼[xy]]1D1={if 𝔼[xy]γ;0if 𝔼[xy]<γ..b^{*}\geq\lim_{\eta\to 0^{+}}\lim_{D\to 1^{+}}\frac{1}{d_{0}}\left[\frac{\gamma+\eta}{||\mathbb{E}[xy]||}\right]^{\frac{1}{D-1}}=\begin{cases}\infty&\text{if $\mathbb{E}[xy]\geq\gamma$};\\ 0&\text{if $\mathbb{E}[xy]<\gamma$}.\end{cases}. (75)

This finishes the proof. \square

B.9 Proof of Theorem 3

Proof. When nontrivial solutions exist, we are interested in identifying when b=0b=0 is not the global minimum. To achieve this, we compare the loss of b=0b=0 with the other solutions. Plug the trivial solution into the loss function in Eq. (9), the loss is easily identified to be Ltrivial=E[y2]L_{\rm trivial}=E[y^{2}].

For the nontrivial minimum, defining ff to be the model,

f(x)\displaystyle f(x) :=i,i1,i2,,iDd,d1,d2,dDUiDϵiD(D)ϵi2(2)Wi2i1(2)ϵi1(1)Wi1i(1)x\displaystyle:=\sum_{i,i_{1},i_{2},...,i_{D}}^{d,d_{1},d_{2},...d_{D}}U_{i_{D}}\epsilon^{(D)}_{i_{D}}...\epsilon_{i_{2}}^{(2)}W_{i_{2}i_{1}}^{(2)}\epsilon_{i_{1}}^{(1)}W_{i_{1}i}^{(1)}x (76)
=ηd0Db2D𝔼[xy]T[b2Dd0D(σ2+d0)DA0+γI]1x,\displaystyle=\eta d_{0}^{D}b^{2D}\mathbb{E}[xy]^{T}[b^{2D}d_{0}^{D}(\sigma^{2}+d_{0})^{D}A_{0}+\gamma I]^{-1}x, (77)

where, similar to the previous proof, we have defined i1,,iDd1,dDϵiD(D)ϵi1(1):=η\sum_{i_{1},...,i_{D}}^{d_{1},...d_{D}}\epsilon^{(D)}_{i_{D}}...\epsilon_{i_{1}}^{(1)}:=\eta such that 𝔼[η]=iDdi=d0D\mathbb{E}[\eta]=\prod_{i}^{D}d_{i}=d_{0}^{D} and 𝔼[η2]=iDdi(σi2+di):=d0D(σ2+d0)D\mathbb{E}[\eta^{2}]=\prod_{i}^{D}d_{i}(\sigma_{i}^{2}+d_{i}):=d_{0}^{D}(\sigma^{2}+d_{0})^{D}. With this notation, The loss function becomes

𝔼x𝔼η(f(x)y)2+L2reg.\displaystyle\mathbb{E}_{x}\mathbb{E}_{\eta}(f(x)-y)^{2}+L_{2}\ reg. (78)
=𝔼x,η[f(x)2]2𝔼x,η[yf(x)]+𝔼x[y2]+L2reg.\displaystyle=\mathbb{E}_{x,\eta}[f(x)^{2}]-2\mathbb{E}_{x,\eta}[yf(x)]+\mathbb{E}_{x}[y^{2}]+L_{2}\ reg. (79)
=id03D(σ2+d0)Db4Dai𝔼[xy]i2[d0D(σ2+d0)Daib2D+γ]22id02Db2D𝔼[xy]i2d0D(σ2+d0)Daib2D+γ+𝔼x[y2]+L2reg.\displaystyle=\sum_{i}\frac{d_{0}^{3D}(\sigma^{2}+d_{0})^{D}b^{4D}a_{i}\mathbb{E}[x^{\prime}y]_{i}^{2}}{[d_{0}^{D}(\sigma^{2}+d_{0})^{D}a_{i}b^{2D}+\gamma]^{2}}-2\sum_{i}\frac{d_{0}^{2D}b^{2D}\mathbb{E}[x^{\prime}y]^{2}_{i}}{d_{0}^{D}(\sigma^{2}+d_{0})^{D}a_{i}b^{2D}+\gamma}+\mathbb{E}_{x}[y^{2}]+L_{2}\ reg. (80)

The last equation is obtained by rotating xx using a orthogonal matrix such that R1A0R=diag(ai)R^{-1}A_{0}R={\rm diag}(a_{i}) and denoting the rotated xx as x=Rxx^{\prime}=Rx. With xx^{\prime}, The L2regL_{2}\ reg term takes the form of

L2reg.\displaystyle L_{2}\ reg. =γDd02b2+γid02Db2D𝔼[xy]i2(d0D(σ2+d0)Db2Dai+γ)2.\displaystyle=\gamma Dd_{0}^{2}b^{2}+\gamma\sum_{i}\frac{d_{0}^{2D}b^{2D}||\mathbb{E}[x^{\prime}y]_{i}||^{2}}{(d_{0}^{D}(\sigma^{2}+d_{0})^{D}b^{2D}a_{i}+\gamma)^{2}}. (81)

Combining the expressions of (81) and (80), we obtain that the difference between the loss at the non-trivial solution and the loss at 0 is

id02Db2D𝔼[xy]i2[d0D(σ2+d0)Daib2D+γ]+γDd02b2.-\sum_{i}\frac{d_{0}^{2D}b^{2D}\mathbb{E}[x^{\prime}y]_{i}^{2}}{[d_{0}^{D}(\sigma^{2}+d_{0})^{D}a_{i}b^{2D}+\gamma]}+\gamma Dd_{0}^{2}b^{2}. (82)

Satisfaction of the following relation thus guarantees that the global minimum is nontrivial:

id02Db2D𝔼[xy]i2[d0D(σ2+d0)Daib2D+γ]γDd02b2.\sum_{i}\frac{d_{0}^{2D}b^{2D}\mathbb{E}[x^{\prime}y]_{i}^{2}}{[d_{0}^{D}(\sigma^{2}+d_{0})^{D}a_{i}b^{2D}+\gamma]}\geq\gamma Dd_{0}^{2}b^{2}. (83)

This relation is satisfied if

d02Db2D𝔼[xy]2[d0D(σ2+d0)Damaxb2D+γ]\displaystyle\frac{d_{0}^{2D}b^{2D}||\mathbb{E}[xy]||^{2}}{[d_{0}^{D}(\sigma^{2}+d_{0})^{D}a_{max}b^{2D}+\gamma]} γDd02b2\displaystyle\geq\gamma Dd_{0}^{2}b^{2} (84)
b2D2[d0D(σ2+d0)Damaxb2D+γ]\displaystyle\frac{b^{2D-2}}{[d_{0}^{D}(\sigma^{2}+d_{0})^{D}a_{max}b^{2D}+\gamma]} γDd02D2𝔼[xy]2.\displaystyle\geq\frac{\gamma D}{d_{0}^{2D-2}||\mathbb{E}[xy]||^{2}}. (85)

The derivative or l.h.s. with respect to bb is

b2D3[(2D2)γ2d0D(σ2+d0)Damaxd2D][d0D(σ2+d0)Damaxb2D+γ]2.\frac{b^{2D-3}[(2D-2)\gamma-2d_{0}^{D}(\sigma^{2}+d_{0})^{D}a_{max}d^{2D}]}{[d_{0}^{D}(\sigma^{2}+d_{0})^{D}a_{max}b^{2D}+\gamma]^{2}}. (86)

For b,γ(0,)b,\gamma\in(0,\infty), the derivative dives below 0, indicating the l.h.s. of (85) has a global maximum at a strictly positive bb. The value of bb is found when setting the derivative to 0, namely

b2D3[(2D2)γ2d0D(σ2+d0)Damaxd2D][d0D(σ2+d0)Damaxb2D+γ]2\displaystyle\frac{b^{2D-3}[(2D-2)\gamma-2d_{0}^{D}(\sigma^{2}+d_{0})^{D}a_{max}d^{2D}]}{[d_{0}^{D}(\sigma^{2}+d_{0})^{D}a_{max}b^{2D}+\gamma]^{2}} =0\displaystyle=0 (87)
(2D2)γ2d0D(σ2+d0)Damaxd2D\displaystyle(2D-2)\gamma-2d_{0}^{D}(\sigma^{2}+d_{0})^{D}a_{max}d^{2D} =0\displaystyle=0 (88)
b2D\displaystyle b^{2D} =(D1)γd0D(σ2+d0)Damax.\displaystyle=\frac{(D-1)\gamma}{d_{0}^{D}(\sigma^{2}+d_{0})^{D}a_{max}}. (89)

The maximum value then takes the form

(D1)D1DDγ1Dd0D1(σ2+d0)D1amaxD1D.\frac{(D-1)^{\frac{D-1}{D}}}{D\gamma^{\frac{1}{D}}d_{0}^{D-1}(\sigma^{2}+d_{0})^{D-1}a_{max}^{\frac{D-1}{D}}}. (90)

The following condition thus guarantees that the global minimum is non-trivial

(D1)D1DDγ1Dd0D1(σ2+d0)D1amaxD1D\displaystyle\frac{(D-1)^{\frac{D-1}{D}}}{D\gamma^{\frac{1}{D}}d_{0}^{D-1}(\sigma^{2}+d_{0})^{D-1}a_{max}^{\frac{D-1}{D}}} γDd02D2𝔼[xy]2\displaystyle\geq\frac{\gamma D}{d_{0}^{2D-2}||\mathbb{E}[xy]||^{2}} (91)
𝔼[xy]2\displaystyle||\mathbb{E}[xy]||^{2} γD+1DD2(σ2+d0)D1amaxD1Dd0D1(D1)D1D.\displaystyle\geq\frac{\gamma^{\frac{D+1}{D}}D^{2}(\sigma^{2}+d_{0})^{D-1}a_{max}^{\frac{D-1}{D}}}{d_{0}^{D-1}(D-1)^{\frac{D-1}{D}}}. (92)

This finishes the proof. \square

B.10 Proof of Theorem 4

Proof. The model prediction is:

f(x)\displaystyle f(x) :=i,i1,i2,,iDd,d1,d2,dDUiDϵiD(D)ϵi2(2)Wi2i1(2)ϵi1(1)Wi1i(1)x\displaystyle:=\sum_{i,i_{1},i_{2},...,i_{D}}^{d,d_{1},d_{2},...d_{D}}U_{i_{D}}\epsilon^{(D)}_{i_{D}}...\epsilon_{i_{2}}^{(2)}W_{i_{2}i_{1}}^{(2)}\epsilon_{i_{1}}^{(1)}W_{i_{1}i}^{(1)}x (93)
=ηd0Db2D𝔼[xy]T[b2Dd0D(σ2+d0)Dσx2I+γI]1x.\displaystyle=\eta d_{0}^{D}b^{2D}\mathbb{E}[xy]^{T}[b^{2D}d_{0}^{D}(\sigma^{2}+d_{0})^{D}\sigma_{x}^{2}I+\gamma I]^{-1}x. (94)

One can find the expectation value and variance of a model prediction:

𝔼η[f(x)]=d02Db2D𝔼[xy]Txb2Dd0D(σ2+d0)Dσx2+γ\mathbb{E}_{\eta}[f(x)]=\frac{d_{0}^{2D}b^{2D}\mathbb{E}[xy]^{T}x}{b^{2D}d_{0}^{D}(\sigma^{2}+d_{0})^{D}\sigma_{x}^{2}+\gamma} (95)

For the trivial solution, the theorem is trivially true. We thus focus on the case when the global minimum is nontrivial.

The variance of the model is

Var[f(x)]\displaystyle{\rm Var}[f(x)] =𝔼[f(x)2]𝔼[f(x)]2\displaystyle=\mathbb{E}[f(x)^{2}]-\mathbb{E}[f(x)]^{2} (96)
=(σ2+d0)Dd03Db4D(𝔼[xy]Tx)2[b2Dd0D(σ2+d0)Dσx2+γ]2d04Db4D(𝔼[xy]Tx)2[b2Dd0D(σ2+d0)D]2σx2+γ]2\displaystyle=\frac{(\sigma^{2}+d_{0})^{D}d_{0}^{3D}b^{4D}(\mathbb{E}[xy]^{T}x)^{2}}{[b^{2D}d_{0}^{D}(\sigma^{2}+d_{0})^{D}\sigma_{x}^{2}+\gamma]^{2}}-\frac{d_{0}^{4D}b^{4D}(\mathbb{E}[xy]^{T}x)^{2}}{[b^{2D}d_{0}^{D}(\sigma^{2}+d_{0})^{D}]^{2}\sigma_{x}^{2}+\gamma]^{2}} (97)
=d03D[(σ2+d0)Dd0D]b4D(𝔼[xy]Tx)2[b2Dd0D(σ2+d0)Dσx2+γ]2\displaystyle=\frac{d_{0}^{3D}[(\sigma^{2}+d_{0})^{D}-d_{0}^{D}]b^{4D}(\mathbb{E}[xy]^{T}x)^{2}}{[b^{2D}d_{0}^{D}(\sigma^{2}+d_{0})^{D}\sigma_{x}^{2}+\gamma]^{2}} (98)
=d03D[(σ2+d0)Dd0D]b2D+2(𝔼[xy]Tx)2𝔼[xy]2,\displaystyle=\frac{d_{0}^{3D}[(\sigma^{2}+d_{0})^{D}-d_{0}^{D}]b^{2D+2}(\mathbb{E}[xy]^{T}x)^{2}}{||\mathbb{E}[xy]||^{2}}, (99)

where the last equation follows from Eq. (13). The variance can be upper-bounded by applying (16),

Var[f(x)]d0D[(σ2+d0)Dd0D](𝔼[xy]Tx)2(σ2+d0)2Dσx2d0D[(σ2+d0)Dd0D](σ2+d0)2D.{\rm Var}[f(x)]\leq\frac{d_{0}^{D}[(\sigma^{2}+d_{0})^{D}-d_{0}^{D}](\mathbb{E}[xy]^{T}x)^{2}}{(\sigma^{2}+d_{0})^{2D}\sigma_{x}^{2}}\propto\frac{d_{0}^{D}[(\sigma^{2}+d_{0})^{D}-d_{0}^{D}]}{(\sigma^{2}+d_{0})^{2D}}. (100)

We first consider the limit d0d_{0}\to\infty with fixed σ2\sigma^{2}:

Var[f(x)]Dd02D1σ2(d0+σ2)2D=O(1d0).{\rm Var}[f(x)]\propto\frac{Dd_{0}^{2D-1}\sigma^{2}}{(d_{0}+\sigma^{2})^{2D}}=O\left(\frac{1}{d_{0}}\right). (101)

For the limit σ2\sigma^{2}\to\infty with d0d_{0} fixed, we have

Var[f(x)]=O(1(σ2)D).{\rm Var}[f(x)]=O\left(\frac{1}{(\sigma^{2})^{D}}\right). (102)

Additionally, we can consider the limit when DD\to\infty as we fix both σ2\sigma^{2} and d0d_{0}:

Var[f(x)]=O(eD2log[(σ2+d0)/d0]),{\rm Var}[f(x)]=O\left(e^{-D2\log[(\sigma^{2}+d_{0})/d_{0}]}\right), (103)

which is an exponential decay. \square

Appendix C Exact Form of bb^{*} for D=1D=1

Note that our main result does not specify the exact value of bb^{*}. This is because bb^{*} must satisfy the condition in Eq. (6), which is equivalent to a high-order polynomial in bb with coefficients being general functions of the eigenvalues of A0A_{0}, whose solutions are generally not analytical by Galois theory. One special case where an analytical formula exists for bb is when A0=σx2IA_{0}=\sigma_{x}^{2}I. Practically, this can be achieved for any (full-rank) dataset if we disentangle and rescale the data by the whitening transformation: xσxA01xx\to\sigma_{x}\sqrt{A_{0}^{-1}}x. In this case, we have

b2=γwγu𝔼[xy]γw(σ2+d1)σx2,b_{*}^{2}=\frac{\sqrt{\frac{\gamma_{w}}{\gamma_{u}}}||\mathbb{E}[xy]||-\gamma_{w}}{(\sigma^{2}+d_{1})\sigma_{x}^{2}}, (104)

and

v=±γuγw𝔼[xy]γuσx2(σ2+d1)𝔼[xy]𝔼[xy],v=\pm\sqrt{\frac{\sqrt{\frac{\gamma_{u}}{\gamma_{w}}}||\mathbb{E}[xy]||-\gamma_{u}}{\sigma_{x}^{2}(\sigma^{2}+d_{1})}}\frac{\mathbb{E}[xy]}{||\mathbb{E}[xy]||}, (105)

where v=Wi:v=W_{i:}.

Appendix D Effect of Bias

This section studies a deep linear network with biases for every layer and compares it with the no-bias networks. We first study a general case when the data does not receive any preprocessing. We then show that the problem reduces to the setting we considered in the main text under the common data preprocessing schemes that centers the input and output data: 𝔼[x]=0\mathbb{E}[x]=0, and 𝔼[y]=0\mathbb{E}[y]=0.

D.1 Two-layer network

The two-layer linear network with bias is defined as

fb(x;U,W,βU,βW)=iϵiUi(Wi:x+βiW)+βU,f_{b}(x;U,W,\beta^{U},\beta^{W})=\sum_{i}\epsilon_{i}U_{i}(W_{i:}\cdot x+\beta^{W}_{i})+\beta^{U}, (106)

where βWd1\beta^{W}\in\mathbb{R}^{d_{1}} is the bias in the hidden layer, and βU\beta^{U}\in\mathbb{R} is the bias at the output layer. The loss function is

Lb(U,W,βU,βW)=\displaystyle L_{b}(U,W,\beta^{U},\beta^{W})= 𝔼ϵ,x,y[(iϵiUi(Wi:x+βiW)+βUy)2]+L2\displaystyle\mathbb{E}_{\epsilon,x,y}\left[(\sum_{i}\epsilon_{i}U_{i}(W_{i:}\cdot x+\beta^{W}_{i})+\beta^{U}-y)^{2}\right]+L_{2} (107)
=\displaystyle= 𝔼x,y[(UWx+UβW+βUy)2+σ2iUi2(Wi:x+βiW)2]\displaystyle\mathbb{E}_{x,y}\left[(UWx+U\beta^{W}+\beta^{U}-y)^{2}+\sigma^{2}\sum_{i}U_{i}^{2}(W_{i:}\cdot x+\beta^{W}_{i})^{2}\right]
+γu(U2+(βU)2)+γw(W2+βW2).\displaystyle+\gamma_{u}(||U||^{2}+(\beta^{U})^{2})+\gamma_{w}(||W||^{2}+||\beta^{W}||^{2}). (108)

It is helpful to concatenate xx and 11 into a single vector x:=(x,1)Tx^{\prime}:=(x,1)^{\rm T} and concatenate WW and βW\beta^{W} into a single matrix WW^{\prime} such that WW, βW\beta^{W}, xx, and WW^{\prime}, xx^{\prime} are related via the following equation

Wx+βW=Wx.Wx+\beta^{W}=W^{\prime}x^{\prime}. (109)

Using WW^{\prime} and xx^{\prime}, the model can be written as

fb(x,U,W,βU)=iϵiUiWi:x+βU.f_{b}(x^{\prime},U,W^{\prime},\beta^{U})=\sum_{i}\epsilon_{i}U_{i}W^{\prime}_{i:}\cdot x^{\prime}+\beta^{U}. (110)

The loss function simplifies to

Lb(U,W,β)=𝔼ϵ,x,y[(iϵiUiWi:x+βUy)2]+γu(U2+(βU)2)+γwW2.L_{b}(U,W^{\prime},\beta)=\mathbb{E}_{\epsilon,x,y}[(\sum_{i}\epsilon_{i}U_{i}W^{\prime}_{i:}\cdot x^{\prime}+\beta^{U}-y)^{2}]+\gamma_{u}(||U||^{2}+(\beta^{U})^{2})+\gamma_{w}||W^{\prime}||^{2}. (111)

Note that (111) contains similar rescaling invariance between UU and WW^{\prime} and the invariance of aligning Wi:W^{\prime}_{i:} and Wj:W^{\prime}_{j:}. One can thus obtain the following two propositions that mirror Lemma 1 and 2.

Proposition 5.

At the global minimum of (107), Uj2=γwγu(iWji2+(βjW)2)U_{j}^{2}=\frac{\gamma_{w}}{\gamma_{u}}\left(\sum_{i}W_{ji}^{2}+(\beta^{W}_{j})^{2}\right).

Proposition 6.

At the global minimum, for all ii and jj, we have

{Ui2=Uj2;UiWi:=UjWj:;UiβiW=UjβjW.\begin{cases}U_{i}^{2}=U_{j}^{2};\\ U_{i}W_{i:}=U_{j}W_{j:};\\ U_{i}\beta^{W}_{i}=U_{j}\beta^{W}_{j}.\\ \end{cases} (112)

The proofs are omitted because they are the same as those of Lemma 1 and 2, substituting WW by WW^{\prime}. Following a procedure similar to finding the solution for a no-bias network, one can prove the following theorem.

Theorem 5.

The global minimum of Eq. (107) is of the form

{U=b𝐫;βU=d1b[(d1+σ2)γuγwb1]v𝔼[x](d1γuγwb21)𝔼[y](d1+σ2)d1γu2γw2b4+(γu2)d1γuγwb2+γu+1;W=𝐫b{𝔼[x][bγuγw(d1+bσ2)1]βU+𝔼[xy]}T[b2(d1+σ2)A0+γwI]1;βW=𝐫γuγwbβU,\begin{cases}U=b\mathbf{r};\\ \beta^{U}=\frac{d_{1}b\left[(d_{1}+\sigma^{2})\frac{\gamma_{u}}{\gamma_{w}}b-1\right]v\mathbb{E}[x]-\left(d_{1}\frac{\gamma_{u}}{\gamma_{w}}b^{2}-1\right)\mathbb{E}[y]}{(d_{1}+\sigma^{2})d_{1}\frac{\gamma_{u}^{2}}{\gamma_{w}^{2}}b^{4}+(\gamma_{u}-2)d_{1}\frac{\gamma_{u}}{\gamma_{w}}b^{2}+\gamma_{u}+1};\\ W=\mathbf{r}b\left\{\mathbb{E}[x]\left[b\frac{\gamma_{u}}{\gamma_{w}}(d_{1}+b\sigma^{2})-1\right]\beta^{U}+\mathbb{E}[xy]\right\}^{T}[b^{2}(d_{1}+\sigma^{2})A_{0}+\gamma_{w}I]^{-1};\\ \beta^{W}=-\mathbf{r}\frac{\gamma_{u}}{\gamma_{w}}b\beta^{U},\end{cases} (113)

where bb satisfies

γub2=b2γw(𝔼[y]S1S3S4𝔼[x]𝔼[xy])(M1)2(𝔼[y]S1S3S4𝔼[x]𝔼[xy])T+γu2γw(S3S4𝔼[y]bS2S4𝔼[x]M1𝔼[xy])2(bS2S1S4𝔼[x]M1𝔼[x]T1)2,\gamma_{u}b^{2}=b^{2}\frac{\gamma_{w}(\frac{\mathbb{E}[y]S_{1}S_{3}}{S_{4}}\mathbb{E}[x]-\mathbb{E}[xy])(M^{-1})^{2}(\frac{\mathbb{E}[y]S_{1}S_{3}}{S_{4}}\mathbb{E}[x]-\mathbb{E}[xy])^{T}+\frac{\gamma_{u}^{2}}{\gamma_{w}}\left(\frac{S_{3}}{S_{4}}\mathbb{E}[y]-b\frac{S_{2}}{S_{4}}\mathbb{E}[x]M^{-1}\mathbb{E}[xy]\right)^{2}}{\left(b\frac{S_{2}S_{1}}{S_{4}}\mathbb{E}[x]M^{-1}\mathbb{E}[x]^{T}-1\right)^{2}}, (114)

where M,S1,S2,S3,S4M,S_{1},S_{2},S_{3},S_{4} are functions of the model parameters and bb, defined in Eq. (120).

Proof. First of all, we derive a handy relation satisfied by βU\beta^{U} and βW\beta^{W} at all the stationary points. The zero-gradient condition of the stationary points gives

{𝔼x,y[2(UWx+UβW+βUy)]U+2γwβW=0;𝔼x,y[2(UWx+UβW+βUy)]+2γuβU=0,\begin{cases}\mathbb{E}_{x,y}[2(UWx+U\beta^{W}+\beta^{U}-y)]U+2\gamma_{w}\beta^{W}=0;\\ \mathbb{E}_{x,y}[2(UWx+U\beta^{W}+\beta^{U}-y)]+2\gamma_{u}\beta^{U}=0,\\ \end{cases} (115)

leading to

UγuβU+γwβW=0\displaystyle U\gamma_{u}\beta^{U}+\gamma_{w}\beta^{W}=0 (116)
βiW=γuγwUiβU.\displaystyle\beta^{W}_{i}=-\frac{\gamma_{u}}{\gamma_{w}}U_{i}\beta^{U}. (117)

Proposition 5 and proposition 6 implies that we can define b:=|Ui|b:=|U_{i}| and bv:=UiWi:bv:=U_{i}W_{i:}. Consequently, UiβiW=γuγwb2βUU_{i}\beta^{W}_{i}=-\frac{\gamma_{u}}{\gamma_{w}}b^{2}\beta^{U}, and the loss function can be written as

𝔼x[(bd1jvjxj(d1γuγwb21)βUy)2+b2d1σ2(kvkxkγuγwbβU)2]\displaystyle\mathbb{E}_{x}\left[\left(bd_{1}\sum_{j}v_{j}x_{j}-(d_{1}\frac{\gamma_{u}}{\gamma_{w}}b^{2}-1)\beta^{U}-y\right)^{2}+b^{2}d_{1}\sigma^{2}\left(\sum_{k}v_{k}x_{k}-\frac{\gamma_{u}}{\gamma_{w}}b\beta^{U}\right)^{2}\right] +γud1b2\displaystyle+\gamma_{u}d_{1}b^{2}
+γwd1v22+γu(b2d1γuγw+1)(βU)2.\displaystyle+\gamma_{w}d_{1}||v||_{2}^{2}+\gamma_{u}\left(\frac{b^{2}d_{1}\gamma_{u}}{\gamma_{w}}+1\right)(\beta^{U})^{2}. (118)

The respective zero-gradient condition for vv and βU\beta^{U} implies that for all stationary points,

{v=[b2(d1+σ2)A0+γwI]1b{𝔼[x][bγuγw(d1+bσ2)1]βU+𝔼[xy]};βU=d1b[(d1+σ2)γuγwb1]v𝔼[x](d1γuγwb21)𝔼[y](d1+σ2)d1γu2γw2b4+(γu2)d1γuγwb2+γu+1.\begin{cases}v=[b^{2}(d_{1}+\sigma^{2})A_{0}+\gamma_{w}I]^{-1}b\left\{\mathbb{E}[x]\left[b\frac{\gamma_{u}}{\gamma_{w}}(d_{1}+b\sigma^{2})-1\right]\beta^{U}+\mathbb{E}[xy]\right\};\\ \beta^{U}=\frac{d_{1}b\left[(d_{1}+\sigma^{2})\frac{\gamma_{u}}{\gamma_{w}}b-1\right]v\mathbb{E}[x]-\left(d_{1}\frac{\gamma_{u}}{\gamma_{w}}b^{2}-1\right)\mathbb{E}[y]}{(d_{1}+\sigma^{2})d_{1}\frac{\gamma_{u}^{2}}{\gamma_{w}^{2}}b^{4}+(\gamma_{u}-2)d_{1}\frac{\gamma_{u}}{\gamma_{w}}b^{2}+\gamma_{u}+1}.\end{cases} (119)

To shorten the expressions, we introduce

{M=b2(d1+σ2)A0+γwI;S1=bγuγw(d1+bσ2)1;S2=d1b[(d1+σ2)γuγwb1];S3=d1γuγwb21;S4=(d1+σ2)d1γu2γw2b4+(γu2)d1γuγwb2+γu+1.\begin{cases}M=b^{2}(d_{1}+\sigma^{2})A_{0}+\gamma_{w}I;\\ S_{1}=b\frac{\gamma_{u}}{\gamma_{w}}(d_{1}+b\sigma^{2})-1;\\ S_{2}=d_{1}b\left[(d_{1}+\sigma^{2})\frac{\gamma_{u}}{\gamma_{w}}b-1\right];\\ S_{3}=d_{1}\frac{\gamma_{u}}{\gamma_{w}}b^{2}-1;\\ S_{4}=(d_{1}+\sigma^{2})d_{1}\frac{\gamma_{u}^{2}}{\gamma_{w}^{2}}b^{4}+(\gamma_{u}-2)d_{1}\frac{\gamma_{u}}{\gamma_{w}}b^{2}+\gamma_{u}+1.\end{cases} (120)

With M,S1,S2,S3,S4M,S_{1},S_{2},S_{3},S_{4}, we have

{v=M1b(𝔼[x]S1βU+𝔼[xy]);βU=S2v𝔼[x]S3𝔼[y]S4.\begin{cases}v=M^{-1}b(\mathbb{E}[x]S_{1}\beta^{U}+\mathbb{E}[xy]);\\ \beta^{U}=\frac{S_{2}v\mathbb{E}[x]-S_{3}\mathbb{E}[y]}{S_{4}}.\end{cases} (121)

The inner product of vv and 𝔼[x]\mathbb{E}[x] can be solved as

v𝔼[x]=bS3S4𝔼[x]M1𝔼[x]S1𝔼[y]𝔼[x]M1𝔼[xy]bS2S4𝔼[x]M1𝔼[x]S11.v\mathbb{E}[x]=b\frac{\frac{S_{3}}{S_{4}}\mathbb{E}[x]M^{-1}\mathbb{E}[x]S_{1}\mathbb{E}[y]-\mathbb{E}[x]M^{-1}\mathbb{E}[xy]}{b\frac{S_{2}}{S_{4}}\mathbb{E}[x]M^{-1}\mathbb{E}[x]S_{1}-1}. (122)

Inserting the expression of v𝔼[x]v\mathbb{E}[x] into the expression of βU\beta^{U} one obtains

βU=S3𝔼[y]bS2𝔼[x]M1𝔼[xy]b𝔼[x]M1𝔼[x]S1S2S4\beta^{U}=\frac{S_{3}\mathbb{E}[y]-bS_{2}\mathbb{E}[x]M^{-1}\mathbb{E}[xy]}{b\mathbb{E}[x]M^{-1}\mathbb{E}[x]S_{1}S_{2}-S_{4}} (123)

The global minimum must thus satisfy

γub2\displaystyle\gamma_{u}b^{2} =γwv2+γub2d1γuγw(βU)2\displaystyle=\gamma_{w}||v||^{2}+\gamma_{u}\frac{b^{2}d_{1}\gamma_{u}}{\gamma_{w}}(\beta^{U})^{2} (124)
=b2γw(𝔼[y]S1S3S4𝔼[x]𝔼[xy])(M1)2(𝔼[y]S1S3S4𝔼[x]𝔼[xy])T+γu2γw(S3S4𝔼[y]bS2S4𝔼[x]M1𝔼[xy])2(bS2S1S4𝔼[x]M1𝔼[x]T1)2.\displaystyle=b^{2}\frac{\gamma_{w}(\frac{\mathbb{E}[y]S_{1}S_{3}}{S_{4}}\mathbb{E}[x]-\mathbb{E}[xy])(M^{-1})^{2}(\frac{\mathbb{E}[y]S_{1}S_{3}}{S_{4}}\mathbb{E}[x]-\mathbb{E}[xy])^{T}+\frac{\gamma_{u}^{2}}{\gamma_{w}}\left(\frac{S_{3}}{S_{4}}\mathbb{E}[y]-b\frac{S_{2}}{S_{4}}\mathbb{E}[x]M^{-1}\mathbb{E}[xy]\right)^{2}}{\left(b\frac{S_{2}S_{1}}{S_{4}}\mathbb{E}[x]M^{-1}\mathbb{E}[x]^{T}-1\right)^{2}}. (125)

This completes the proof. \square

Remark.

As in the no-bias case, we have reduced the original problem to a one-dimensional problem. However, the condition for bb becomes so complicated that it is almost impossible to understand. That being said, the numerical simulations we have done all carry the bias terms, suggesting that even with the bias term, the mechanisms are qualitatively similar, and so the approach in the main text is justified.

When 𝔼[x]=0\mathbb{E}[x]=0, the solution can be simplified a little:

{U=rb;βU=d1γuγwb21(d1+σ2)d1γu2γw2b4+(γu2)d1γuγwb2+γu+1𝔼[y];W=rb𝔼[xy]T[b2(d1+σ2)A0+γwI]1;βW=rγuγwbd1γuγwb21(d1+σ2)d1γu2γw2b4+(γu2)d1γuγwb2+γu+1𝔼[y],\begin{cases}U=\textbf{r}b;\\ \beta^{U}=-\frac{d_{1}\frac{\gamma_{u}}{\gamma_{w}}b^{2}-1}{(d_{1}+\sigma^{2})d_{1}\frac{\gamma_{u}^{2}}{\gamma_{w}^{2}}b^{4}+(\gamma_{u}-2)d_{1}\frac{\gamma_{u}}{\gamma_{w}}b^{2}+\gamma_{u}+1}\mathbb{E}[y];\\ W=\textbf{r}b\mathbb{E}[xy]^{T}[b^{2}(d_{1}+\sigma^{2})A_{0}+\gamma_{w}I]^{-1};\\ \beta^{W}=\textbf{r}\frac{\gamma_{u}}{\gamma_{w}}b\frac{d_{1}\frac{\gamma_{u}}{\gamma_{w}}b^{2}-1}{(d_{1}+\sigma^{2})d_{1}\frac{\gamma_{u}^{2}}{\gamma_{w}^{2}}b^{4}+(\gamma_{u}-2)d_{1}\frac{\gamma_{u}}{\gamma_{w}}b^{2}+\gamma_{u}+1}\mathbb{E}[y],\end{cases} (126)

where the value of bb is either 0 or determined by

γu=γw|𝔼[xy]T[b2(d1+σ2)A0+γwI]1|2+γu2γw𝔼[y]2(d1γuγwb21(d1+σ2)d1γu2γw2b4+(γu2)d1γuγwb2+γu+1)2.\gamma_{u}=\gamma_{w}|\mathbb{E}[xy]^{T}[b^{2}(d_{1}+\sigma^{2})A_{0}+\gamma_{w}I]^{-1}|^{2}+\frac{\gamma_{u}^{2}}{\gamma_{w}}\mathbb{E}[y]^{2}\left(\frac{d_{1}\frac{\gamma_{u}}{\gamma_{w}}b^{2}-1}{(d_{1}+\sigma^{2})d_{1}\frac{\gamma_{u}^{2}}{\gamma_{w}^{2}}b^{4}+(\gamma_{u}-2)d_{1}\frac{\gamma_{u}}{\gamma_{w}}b^{2}+\gamma_{u}+1}\right)^{2}. (127)

In this case, the expression of WW is identical to the no-bias model. The bias of both layers is proportional to 𝔼[y]\mathbb{E}[y]. The equation determining the value of bb is also similar to the no-bias case. The only difference is the term proportional to 𝔼[y]2\mathbb{E}[y]^{2}.

Lastly, the solution becomes significantly simplified when both 𝔼[x]=0\mathbb{E}[x]=0 and 𝔼[y]=0\mathbb{E}[y]=0. When this is the case, the solution reverts to the case when there is no bias. In practice, it is a common and usually recommended practice to subtract the average of xx and yy from the data and achieve precisely 𝔼[x]=0\mathbb{E}[x]=0 and 𝔼[y]=0\mathbb{E}[y]=0. We generalize this result to deeper networks in the next section.

D.2 Deep linear network

Let β\beta be a (iDdi+1)\left(\sum_{i}^{D}d_{i}+1\right)-dimensional vector concatenating all β(1),β(2),,β(D),βU\beta^{(1)},\beta^{(2)},...,\beta^{(D)},\beta^{U}, and denoting the collection of all the weights UU, W(D)W^{(D)}, …, W(1)W^{(1)} by ww, the model of a deep linear network with bias is defined as

fb(x,W(D),,W(1),U,β(D),,β(1),βU)\displaystyle f_{b}(x,W^{(D)},...,W^{(1)},U,\beta^{(D)},...,\beta^{(1)},\beta^{U}) (128)
=\displaystyle= U(ϵ(D)(W(D)((ϵ(2)(W(2)(ϵ(1)(W(1)x+β(1)))+β(2))))+βD))+βU\displaystyle U(\epsilon^{(D)}\circ(W^{(D)}(...(\epsilon^{(2)}\circ(W^{(2)}(\epsilon^{(1)}\circ(W^{(1)}x+\beta^{(1)}))+\beta^{(2)}))...)+\beta^{D}))+\beta^{U} (129)
=\displaystyle= U(ϵ(D)(W(D)((ϵ(2)(W(2)(ϵ(1)(W(1)x)))))))\displaystyle U(\epsilon^{(D)}\circ(W^{(D)}(...(\epsilon^{(2)}\circ(W^{(2)}(\epsilon^{(1)}\circ(W^{(1)}x)))))))
+U(ϵ(D)(W(D)((ϵ(2)(W(2)(ϵ(1)β(1)))))))\displaystyle+U(\epsilon^{(D)}\circ(W^{(D)}(...(\epsilon^{(2)}\circ(W^{(2)}(\epsilon^{(1)}\circ\beta^{(1)}))))))
+U(ϵ(D)(W(D)((ϵ(2)β(2)))))++U(ϵ(D)β(D))+βU\displaystyle+U(\epsilon^{(D)}\circ(W^{(D)}(...(\epsilon^{(2)}\circ\beta^{(2)}))))+...+U(\epsilon^{(D)}\circ\beta^{(D)})+\beta^{U} (130)
=\displaystyle= U(ϵ(D)(W(D)((ϵ(2)(W(2)(ϵ(1)(W(1)x)))))))+bias(w,β),\displaystyle U(\epsilon^{(D)}\circ(W^{(D)}(...(\epsilon^{(2)}\circ(W^{(2)}(\epsilon^{(1)}\circ(W^{(1)}x)))))))+\rm{bias}(w,\beta), (131)

where

bias(w,β)=U(ϵ(D)(W(D)((ϵ(2)(W(2)(ϵ(1)β(1)))))))+U(ϵ(D)(W(D)((ϵ(2)β(2)))))++U(ϵ(D)β(D))+βU,\displaystyle\begin{split}{\rm bias}(w,\beta)=&U(\epsilon^{(D)}\circ(W^{(D)}(...(\epsilon^{(2)}\circ(W^{(2)}(\epsilon^{(1)}\circ\beta^{(1)}))))))\\ &+U(\epsilon^{(D)}\circ(W^{(D)}(...(\epsilon^{(2)}\circ\beta^{(2)}))))+...+U(\epsilon^{(D)}\circ\beta^{(D)})+\beta^{U},\end{split} (132)

and \circ denotes Hadamard product. The loss function is

Lb(x,y,w,β)=𝔼ϵ,x,y[(fb(x,w,β)y)2]+L2(w,β).L_{b}(x,y,w,\beta)=\mathbb{E}_{\epsilon,x,y}[(f_{b}(x,w,\beta)-y)^{2}]+L_{2}(w,\beta). (133)

Proposition 5 and Proposition 6 can be generated to deep linear network. Similar to the no-bias case, we can reduce the landscape to a 11-dimensional problem by performing induction on DD and using the 22-dimensional case as the base step. However, we do not solve this case explicitly here because the involved expressions now become too long and complicated even to write down, nor can they directly offer too much insight. We thus only focus on the case when the data has been properly preprocessed. Namely, 𝔼[x]=0\mathbb{E}[x]=0 and 𝔼[y]=0\mathbb{E}[y]=0.

For simplicity, we assume that the regularization strength for all the layers are identically γ\gamma. The following theorem shows that When 𝔼[x]=0\mathbb{E}[x]=0 and 𝔼[y]=0\mathbb{E}[y]=0, the biases vanish for an arbitrarily deep linear network:

Theorem 6.

Let 𝔼[x]=0\mathbb{E}[x]=0 and 𝔼[y]=0\mathbb{E}[y]=0. The global minima of Eq. (133) have β(1)=0,β(2)=0,,β(D)=0,βU=0\beta^{(1)}=0,\beta^{(2)}=0,...,\beta^{(D)}=0,\beta^{U}=0.

Proof. At the global minimum, the gradient of the loss function vanishes. In particular, the derivatives with respect to β\beta vanish:

Lb(x,y,w,β)βi\displaystyle\frac{\partial L_{b}(x,y,w,\beta)}{\partial\beta_{i}} =0;\displaystyle=0; (134)
𝔼ϵ,x,y[fb(x,w,β)βi(fb(x,w,β)y)]+γβi\displaystyle\mathbb{E}_{\epsilon,x,y}\left[\frac{\partial f_{b}(x,w,\beta)}{\partial\beta_{i}}(f_{b}(x,w,\beta)-y)\right]+\gamma\beta_{i} =0;\displaystyle=0; (135)
𝔼ϵ,x,y[bias(w,β)βi(fb(x,w,β)y)]+γβi\displaystyle\mathbb{E}_{\epsilon,x,y}\left[\frac{\partial\rm{bias}(w,\beta)}{\partial\beta_{i}}(f_{b}(x,w,\beta)-y)\right]+\gamma\beta_{i} =0;\displaystyle=0; (136)
𝔼ϵ[bias(w,β)βi(fb(𝔼[x],w,β)𝔼[y])]+γβi\displaystyle\mathbb{E}_{\epsilon}\left[\frac{\partial\rm{bias}(w,\beta)}{\partial\beta_{i}}(f_{b}(\mathbb{E}[x],w,\beta)-\mathbb{E}[y])\right]+\gamma\beta_{i} =0,\displaystyle=0, (137)

where βi\beta_{i} is the iith element of β\beta. The last equation is obtained since fb(x,w,β)f_{b}(x,w,\beta) is a linear function of xx. Using the condition 𝔼[x]=0\mathbb{E}[x]=0 and 𝔼[y]=0\mathbb{E}[y]=0, Equation (137) becomes

𝔼ϵ[bias(w,β)βibias(w,β)]+γβi=0.\mathbb{E}_{\epsilon}\left[\frac{\partial\rm{bias}(w,\beta)}{\partial\beta_{i}}\rm{bias}(w,\beta)\right]+\gamma\beta_{i}=0. (138)

bias(w,β)\rm{bias}(w,\beta) is a linear combination of βi\beta_{i}. Consequently, bias(w,β)/βi\partial\rm{bias}(w,\beta)/\partial\beta_{i} does not depend on β\beta, and bias(w,β)bias(w,β)/βi\rm{bias}(w,\beta)\partial\rm{bias}(w,\beta)/\partial\beta_{i} is a linear combination of βi\beta_{i}. Furthermore, the linearity of bias(w,β)\rm{bias}(w,\beta) indicates that βbias(w,β)/β=bias(w,β)\beta\cdot\partial\rm{bias}(w,\beta)/\partial\beta=\rm{bias}(w,\beta). The solution to the equation (138) also takes the form of

β=argminβ(bias(w,β)2+γβ2).\beta^{*}=\arg\min_{\beta}(||\rm{bias}(w,\beta)||^{2}+\gamma||\beta||^{2}). (139)

The only choice of β\beta that minimizes both the first term and the second term in (139) is, regardless of the value of ww,

β=0.\beta=0. (140)

This finishes the proof. \square

Thus, for a deep linear network, a model without bias is good enough to describe data satisfying 𝔼[x]=0\mathbb{E}[x]=0 and 𝔼[y]=0\mathbb{E}[y]=0, which could be achieved by subtracting the mean of the data.