This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Exact Phase Transitions in Deep Learning

Liu Ziyin1 Masahito Ueda1,2,3
1Department of Physics, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033
2Institute for Physics of Intelligence, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033
3RIKEN Center for Emergent Matter Science (CEMS), Wako, Saitama 351-0198, Japan
Abstract

This work reports deep-learning-unique first-order and second-order phase transitions, whose phenomenology closely follows that in statistical physics. In particular, we prove that the competition between prediction error and model complexity in the training loss leads to the second-order phase transition for nets with one hidden layer and the first-order phase transition for nets with more than one hidden layer. The proposed theory is directly relevant to the optimization of neural networks and points to an origin of the posterior collapse problem in Bayesian deep learning.

Understanding neural networks is a fundamental problem in both theoretical deep learning and neuroscience. In deep learning, learning proceeds as the parameters of different layers become correlated so that the model responds to an input in a meaningful way. This is reminiscent of an ordered phase in physics, where the microscopic degrees of freedom behave collectively and coherently. Meanwhile, regularization effectively prevents the overfitting of the model by reducing the correlation between model parameters in a manner similar to the effect of an entropic force in physics. One thus expects a phase transition in the model behavior from the regime where the regularization is negligible to the regime where it is dominant. In the long history of statistical physics of learning Hopfield, (1982); Watkin et al., (1993); Martin and Mahoney, (2017); Bahri et al., (2020), a series of works studied the under-to-overparametrization (UO) phase transition in the context of linear regression Krogh and Hertz, 1992a ; Krogh and Hertz, 1992b ; Watkin et al., (1993); Haussler et al., (1996). Recently, this type of phase transition has seen a resurgence of interest Hastie et al., (2019); Liao et al., (2020). One recent work by Li and Sompolinsky, (2021) deals with the UO transition in a deep linear model. However, the UO phase transition is not unique to deep learning because it appears in both shallow and deep models and also in non-neural-network models Belkin et al., (2020). To understand deep learning, we need to identify what is unique about deep neural networks.

In this work, we address the fundamental problem of the loss landscape of a deep neural network and prove that there exist phase transitions in deep learning that can be described precisely as the first- and second-order phase transitions with a striking similarity to physics. We argue that these phase transitions can have profound implications for deep learning, such as the importance of symmetry breaking for learning and the qualitative difference between shallow and deep architectures. We also show that these phase transitions are unique to machine learning and deep learning. They are unique to machine learning because they are caused by the competition between the need to make predictions more accurate and the need to make the model simpler. These phase transitions are also deep-learning unique because they only appear in “deeper” models. For a multilayer linear net with stochastic neurons and trained with L2L_{2} regularization,

  1. 1.

    we identify an order parameter and effective landscape that describe the phase transition between a trivial phase and a feature learning phase as the L2L_{2} regularization hyperparameter is changed (Theorem 3);

  2. 2.

    we show that finite-depth networks cannot have the zeroth-order phase transition (Theorem 2);

  3. 3.

    we prove that:

    1. (a)

      depth-0 nets (linear regression) do not have a phase transition (Theorem 1);

    2. (b)

      depth-11 nets have the second-order phase transitions (Theorem 4);

    3. (c)

      depth-DD nets have the first-order phase transition (Theorem 5) for D>1D>1;

    4. (d)

      infinite-depth nets have the zeroth-order phase transition (Theorem 6).

The theorem statements and proofs are presented in the Supplementary Section B. To the best of our knowledge, we are the first to identify second-order and first-order phase transitions in the context of deep learning. Our result implies that one can precisely classify the landscape of deep neural models according to the Ehrenfest classification of phase transitions.

Results

Formal framework. Let (w,a)\ell(w,a) be a differentiable loss function that is dependent on the model parameter ww and a hyperparameters aa. The loss function \ell can be decomposed into a data-dependent feature learning term 0\ell_{0} and a data-independent term aR(w)aR(w) that regularizes the model at strength aa:

(w,a)=𝔼x[0(w,x)]+aR(w).\ell(w,a)=\mathbb{E}_{x}[\ell_{0}(w,x)]+aR(w). (1)

Learning amounts to finding the global minimizer of the loss:

{L(a):=minw(w,a);w:=argminw(w,a).\begin{cases}L(a):=\min_{w}\ell(w,a);\\ w_{*}:=\arg\min_{w}\ell(w,a).\end{cases} (2)

Naively, one expects L(a)L(a) to change smoothly as we change aa. If LL changes drastically or even discontinuously when one perturb aa, it becomes hard to tune aa to optimize the model performance. Thus, that L(a)L(a) is well-behaved is equivalent to that aa is an easy-to-tune hyperparameter. We are thus interested in the case where the tuning of aa is difficult, which occurs when a phase transition comes into play.

It is standard to treat the first term in Eq. (1) as an energy. To formally identify the regularization term as an entropy, its coefficient must be proportional to the temperature:

aR(w)=T2σ2R(w),aR(w)=\frac{T}{2\sigma^{2}}R(w), (3)

where σ2\sigma^{2} controls the fluctuation of ww at zero temperature. We note that this identification is consistent with many previous works, where the term that encourages a lower model complexity is identified as an “entropy” Haussler et al., (1996); Vapnik, (2006); Benedek and Itai, (1991); Friston, (2009); Li and Sompolinsky, (2021). In this view, learning is a balancing process between the learning error and the model complexity. Intuitively, one expects phase transitions to happen when one term starts to dominate the other, just like thermodynamic phase transitions that take place when the entropy term starts to dominate the energy.

In this setting, the partition function is Z(a)=𝑑wexp[(w,a)/T]Z(a)=\int dw\exp[-\ell(w,a)/T]. We consider a special limit of the partition function, where both TT and 2σ22\sigma^{2} are made to vanish with their ratio held fixed at T/2σ2=γT/2\sigma^{2}=\gamma. In this limit, one can find the free energy with the saddle point approximation, which is exact in the zero-temperature limit:

F(a)=limT0,σ20,T/2σ2=γTlog𝑑wexp[(w,a)/T]=minw(w,a).\displaystyle F(a)=\lim_{T\to 0,\ \sigma^{2}\to 0,\ T/2\sigma^{2}=\gamma}-T\log\int dw\exp[-\ell(w,a)/T]=\min_{w}\ell(w,a). (4)

We thus treat LL as the free energy.

Definition 1.

L(a)L(a) is said to have the nnth-order phase transition in aa at a=aa=a^{*} if nn is the smallest integer such that dndanL(a)|a=a\frac{d^{n}}{da^{n}}L(a)|_{a=a^{*}} is discontinuous.

We formally define the order parameter and effective loss as follows.

Definition 2.

b=b(w)b=b(w)\in\mathbb{R} is said to be an order parameter of (w,a)\ell(w,a) if there exists a function ¯\bar{\ell} such that for all aa, minw¯(b(w),a)=L(a)\min_{w}\bar{\ell}(b(w),a)=L(a), where ¯\bar{\ell} is said to be an effective loss function of \ell.

In other words, an order parameter is a one-dimensional quantity whose minimization on ¯\bar{\ell} gives L(a)L(a). The existence of an order parameter suggests that the original problem (w,a)\ell(w,a) can effectively be reduced to a low-dimensional problem that is much easier to understand. Physical examples are the average magnetization in the Ising model and the average density of molecules in a water-to-vapor phase transition. A dictionary of the corresponding concepts between physics and deep learning is given in Table 1.

Our theory deals with deep linear nets, the primary minimal model for deep learning. It is well-established that the landscape of a deep linear net can be used to understand that of nonlinear networks Kawaguchi, (2016); Hardt and Ma, (2016); Laurent and Brecht, (2018). The most general type of deep linear nets, with L2L_{2} regularization and stochastic neurons, has the following loss:

𝔼x𝔼ϵ(1),ϵ(2),,ϵ(D)(i,i1,i2,,iDd0,d0,d0,d0UiDϵiD(D)ϵi2(2)Wi2i1(2)ϵi1(1)Wi1i(1)xiy)2L0+γU22+i=1DγW(i)F2L2reg.,\underbrace{\mathbb{E}_{x}\mathbb{E}_{\epsilon^{(1)},\epsilon^{(2)},...,\epsilon^{(D)}}\left(\sum_{i,i_{1},i_{2},...,i_{D}}^{d_{0},d_{0},d_{0},...d_{0}}U_{i_{D}}\epsilon^{(D)}_{i_{D}}...\epsilon_{i_{2}}^{(2)}W_{i_{2}i_{1}}^{(2)}\epsilon_{i_{1}}^{(1)}W_{i_{1}i}^{(1)}x_{i}-y\right)^{2}}_{L_{0}}+\underbrace{\gamma||U||_{2}^{2}+\sum_{i=1}^{D}\gamma||W^{(i)}||_{F}^{2}}_{L_{2}\ {\rm reg.}}, (5)

where xx is the input data, yy the label, UU and W(i)W^{(i)} the model parameters, DD the network depth, ϵ\epsilon the noise in the hidden layer (e.g., dropout), d0d_{0} the width of the model, and γ\gamma the weight decay strength. We build on the recent results established in Ziyin et al., (2022). Let b:=U/d0b:=||U||/d_{0}. Ref. Ziyin et al., (2022) shows that all the global minima of Eq. (5) must take the form U=f(b)U=f(b) and Wi=fi(b)W_{i}=f_{i}(b), where ff and fif_{i} are explicit functions of the hyperparameters. Ref. Ziyin et al., (2022) further shows that there are two regimes of learning, where, for some range of γ\gamma, the global minimum is uniquely given by b=0b=0, and for some other range of γ\gamma, some b>0b>0 gives the global minimum. When b=0b=0, the model outputs a constant 0, and so this regime is called the “trivial regime,” and the regime where b=0b=0 is not the global minimum is called the “feature learning regime.” In this work, we prove that the transition between these two regimes corresponds to a phase transition in the Ehrenfest sense (Definition 1), and therefore one can indeed refer to these two regimes as two different phases.

machine learning statistical physics
training loss free energy
prediction error internal energy
regularization negative entropy
learning process symmetry breaking
norm of model (bb) order parameter
feature learning regime ordered phase
trivial regime disordered phase
noise required for learning latent heat
Table 1: Left table: the correspondence between machine learning and statistical physics. Right table: the correspondence between a learning process and symmetry breaking.

No-phase-transition theorems. The first result we prove is that there is no phase transition in any hyperparameter (γ,E[xxT],E[xy],E[y2]\gamma,\ E[xx^{T}],\ E[xy],\ E[y^{2}]) for a simple linear regression problem. In our terminology, this corresponds to the case of D=0D=0. The fact that there is no phase transition in any of these hyperparameters means that the model’s behavior is predictable as one tunes the hyperparameters. In the parlance of physics, a linear regressor operates within the linear-response regime.

Theorem 2 shows that a finite-depth net cannot have zeroth-order phase transitions. This theorem can be seen as a worst-case guarantee: the training loss needs to change continuously as one changes the hyperparameter. We also stress that this general theorem applies to standard nonlinear networks as well. Indeed, if we only consider the global minimum of the training loss, the training loss cannot jump. However, in practice, one can often observe jumps because the gradient-based algorithms can be trapped in local minima. The following theory offers a direct explanation for this phenomenon.

Refer to caption
Refer to caption
Figure 1: Effective landscape given in Eq. (6) for D=1D=1 (left) and D=2D=2 (right). For D=1D=1, zero is either the global minimum or a local maximum. Note that the shape of the loss resembles that of the Landau free energy for the second-order phase transition. For D=2D=2, the landscape becomes more complicated, featuring the emergence of local minima. In particular, zero is always a local minimum.

Phase Transitions in Deeper Networks. Theorem 4 shows that the quantity bb is an order parameter describing any phase transition induced by the weight decay parameter in Eq. (5). Let b=U/dub=||U||/d_{u}, A0:=𝔼[xxT]A_{0}:=\mathbb{E}[xx^{T}], and aia_{i} be the ii-th eigenvalue of A0A_{0}. The effective loss landscape is

¯(b,γ):=id02Db2D𝔼[xy]i2d0D(σ2+d0)Daib2D+γ+𝔼x[y2]+γDd02b2,\bar{\ell}(b,\gamma):=-\sum_{i}\frac{d_{0}^{2D}b^{2D}\mathbb{E}[x^{\prime}y]_{i}^{2}}{d_{0}^{D}(\sigma^{2}+d_{0})^{D}a_{i}b^{2D}+\gamma}+\mathbb{E}_{x}[y^{2}]+\gamma Dd_{0}^{2}b^{2}, (6)

where xx^{\prime} is a rotation of xx. See Figure 1 for an illustration. The complicated landscape for D>1D>1 implies that neural networks are susceptible to initialization schemes and entrapment in meta-stable states is common (see Supplementary Section A.1).

Theorem 5 shows that when D=1D=1 in Eq. (5), there is a second-order phase transition precisely at

γ=𝔼[xy].\gamma=||\mathbb{E}[xy]||. (7)

In machine learning language, γ\gamma is the regularization strength and E[xy]||E[xy]|| is the signal. The phase transition occurs precisely when the regularization dominates the signal. In physics, γ\gamma and 𝔼[xy]||\mathbb{E}[xy]|| are proportional to the temperature and energy, respectively. The phase transition occurs exactly when the entropy dominates the energy. Also, the phase transition for a depth-11 linear net is independent of the number of parameters of the model. For D>1D>1, the size of the model does play a role in influencing the phase transition. However, γ\gamma remains the dominant variable controlling this phase transition. This independence of the model size is an advantage of the proposed theory because our result becomes directly relevant for all model sizes, not just the infinitely large ones that the previous works often adopt.

For D2D\geq 2, we show that there is a first-order phase transition between the two phases at some γ>0\gamma>0. However, an analytical expression for the critical point is not known. In physics, first-order phase transitions are accompanied by latent heat. Our theory implies that this heat is equivalent to the amount of random noise we have to inject into the model parameters to escape from a local to the global minimum for a deep model. We illustrate the phase transitions studied in Figure 2. We also experimentally demonstrate that the same phase transitions take place in deep nonlinear networks with the corresponding depths (Supplementary Section A.3). While infinite-depth networks are not used in practice, they are important from a theoretical point of view Sonoda and Murata, (2019) because they can be used for understanding a (very) deep network that often appears in the deep learning practice. Our result shows that the limiting landscape has a zeroth-order phase transition at γ=0\gamma=0. In fact, zeroth-order phase transitions do not occur in physics, and it is a unique feature of deep learning.

Relevance of symmetry breaking. The phase transitions we studied also involve symmetry breaking. This can be seen directly from the effective landscape in Eq. (6). The loss is unaltered as one flip the sign of bb, and therefore the loss is symmetric in bb. Figure 3 illustrates the effect and importance of symmetry breaking on the gradient descent dynamics. Additionally, this observation may also provide an alternative venue for studying general symmetry-breaking dynamics because the computation with neural networks is both accurate and efficient.

Mean-Field Analysis. The crux of our theory can be understood by applying a simplified “mean-field” analysis of the loss function in Eq. (5). Let each weight matrix be approximated by a scalar U=bD+1U=b_{D+1}, Wi=biW_{i}=b_{i}, ignore the stochasticity due to ϵi\epsilon_{i}, and let xx be one-dimensional. One then obtains a simplified mean-field loss:

𝔼x[(c0xi=1D+1biy)2]+γi=1Dcibi2,\mathbb{E}_{x}\left[\left(c_{0}x\prod_{i=1}^{D+1}b_{i}-y\right)^{2}\right]+\gamma\sum_{i=1}^{D}c_{i}b_{i}^{2}, (8)

where cic_{i}’s are constants. The first term can be interpreted as a special type of (D+1)(D+1)-body interaction. We now perform a second mean-field approximation, where all the bib_{i} take the same value bb:

c0𝔼[x2]b2D+2c1𝔼[xy]bD+1+γc2b2+const.\ell\propto c_{0}^{\prime}\mathbb{E}[x^{2}]b^{2D+2}-c_{1}^{\prime}\mathbb{E}[xy]b^{D+1}+\gamma c_{2}^{\prime}b^{2}+const. (9)

Here, c0c_{0}^{\prime}, c1c_{1}^{\prime} and c2c_{2}^{\prime} are structural constants, only depending on the model (depth, width, etc). The first and the third terms monotonically increase in bb, encouraging a smaller bb. The second term monotonically decreases in bD+1𝔼[xy]b^{D+1}\mathbb{E}[xy], encouraging a positive correlation between bb and the feature 𝔼[xy]\mathbb{E}[xy]. The leading and lowest-order terms regularize the model, while the intermediate term characterizes learning. For D=0D=0, the loss is quadratic and has no transition. For D=1D=1, the loss is identical to the Landau free energy, and a phase transition occurs when the second-order term flips sign: c2γ=c1𝔼[xy]c_{2}^{\prime}\gamma=c_{1}^{\prime}\mathbb{E}[xy]. For D>1D>1, the origin is always a local minimum, dominated by the quadratic term. This leads to a first-order phase transition. When DD\to\infty, the leading terms become discontinuous in bb, and one obtains a zeroth-order phase transition. This simple analysis highlights one important distinction between physics and machine learning: in physics, the most common type of interaction is a two-body interaction, whereas, for machine learning, the common interaction is many-body and tends to infinite-body as DD increases.

One implication is that L2L_{2} regularization may be too strong for deep learning because it creates a trivial phase. Our result also suggests a way to avoid the trivial phase. Instead of regularizing by γw22\gamma||w||_{2}^{2}, one might consider γw2d+2\gamma||w||_{2}^{d+2}, which is the lowest-order regularization that does not lead to a trivial phase. The effectiveness of this suggested method is confirmed in Supplementary Section A.2.

Refer to caption
Refer to caption
Refer to caption
Figure 2: Phase transitions in a linear net. In agreement with the theory, a depth-0 net has no phase transition. A depth-11 net has a second-order phase transition at approximately γ=0.45\gamma=0.45, close to the theoretical value of 𝔼[xy]||\mathbb{E}[xy]||, and a depth-22 net has a first-order phase transition at roughly γ=0.15\gamma=0.15. The qualitative differences between networks of different depths are clearly observed in the data. Left: Training loss of a network with 0 (linear regression), 11, and 22 hidden layers. Clearly, a depth-0 net shows no phase transition. A depth-11 net has a second-order phase transition at approximately γ=0.45\gamma=0.45, and a depth-22 net has a first-order phase transition at roughly γ=0.15\gamma=0.15. Middle: Magnitude of the regularization term at convergence. As discussed in the main text, this term corresponds to the entropy term TSTS. We see that for D>1D>1, there is a jump (discontinuity) in TSTS from a finite value to 0. This jump corresponds to the latent heat of the first-order phase transition process. Right: Order parameter as a function of γ\gamma. The inset shows that bb precisely scales as t0.5t^{0.5} with t:=(γγ)t:=-(\gamma-\gamma^{*}) in the vicinity of the phase transition, in agreement with the standard Landau theory.
Refer to caption
Refer to caption
Refer to caption
Figure 3: Dynamics of training, where the model is initialized at the origin and the learning proceeds under gradient descent with injected Gaussian noise. Before training, the models lie roughly in the trivial phase because the model is initialized close to the origin and has not learned anything yet. However, for the feature learning phase, any global minimum must choose a specific b0b\neq 0, and so the actual solution does not feature the symmetry in bb: a symmetry breaking in bb must take place for the learning to happen. The recent work of Ref. Tanaka and Kunin, (2021) showed that the symmetries in the loss could become difficult obstacles in the training of a neural network, and our result complements this view by identifying a precise deep-learning-relevant symmetry to be broken. Left: the training loss LL; except for D=0D=0, where no symmetry breaking occurs, the dynamics exhibits a wide plateau that hinders learning emerging at initialization. Middle: a zoom-in of the left panel when LL is close to the initialized value (0.2\approx 0.2). For D=1D=1, the loss decreases monotonically. For D>1D>1, in sharp contrast, the loss first increases slowly and then decreases precipitously, a signature of escaping from a local minimum: the height of the peak may be interpreted as the latent heat of the phase transition since this is the ”energy barrier” for the system to overcome in order to undergo the first-order phase transition. Right: time evolution of the order parameter bb: one sees that D=0D=0 shows a fast increase of bb from the beginning. For D1D\geq 1, the initial stage is dominated by slow diffusion, where bb increases as the square root of time. The diffusion phase only ends after a long period, before a fast learning period begins. One also notices that in the fast learning period, the slope of bb versus time is different for different depths, with deeper models considerably faster than shallower ones. The inset shows the corrected order parameter b~:=bDτ\tilde{b}:=b-D\sqrt{\tau}, where τ\tau is the training step, and DD is the diffusion constant of the noisy gradient descent. One sees that b~\tilde{b} stays zero over an extended period of time for D>0D>0.

Posterior Collapse in Bayesian Deep Learning. Our results also identify an origin of the well-known problem posterior collapse problem in Bayesian deep learning. Posterior collapse refers to the learning failure where the learned posterior distribution coincides with the prior, and so no learning has happened even after training Dai and Wipf, (2019); Alemi et al., (2018); Lucas et al., (2019). Our results offer a direct explanation for this posterior collapse problem. In the Bayesian interpretation, the training loss in Eq. (5) is the exact negative log posterior, and the trivial phase exactly corresponds to the posterior collapse: the global minimum of the loss is identical to the global maximum of the prior term. Our results thus imply that (1) posterior collapse is a unique problem of deep learning because it does not occur in shallow models, and (2) posterior collapse happens as a direct consequence of the competition between the prior and the likelihood. This means that it is not a good idea to assume a Gaussian prior for the deep neural network models. The suggested fix also leads to a clean and Bayesian-principled solution to the posterior collapse problem by using a prior logp(w)w2D+2\log p(w)\propto-||w||_{2}^{D+2}.

Discussion

The striking similarity between phase transitions in neural networks and statistical-physics phase transitions lends a great impetus to a more thorough investigation of deep learning through the lens of thermodynamics and statistical physics. We now outline a few major future steps:

  1. 1.

    Instead of classification by analyticity, can we classify neural networks by symmetry and topological invariants?

  2. 2.

    What are other possible phases for a nonlinear network? Does a new phase emerge?

  3. 3.

    Can we find any analogy of other thermodynamic quantities such as volume and pressure? More broadly, can we establish thermodynamics for deep learning?

  4. 4.

    Can we utilize the latent heat picture to devise better algorithms for escaping local minima in deep learning?

This work shows that the Ehrenfest classification of phase transitions aligns precisely with the number of layers in deep neural networks. We believe that the statistical-physics approach to deep learning will bring about fruitful developments in both fields of statistical physics and deep learning.

References

  • Alemi et al., [2018] Alemi, A., Poole, B., Fischer, I., Dillon, J., Saurous, R. A., and Murphy, K. (2018). Fixing a broken elbo. In International Conference on Machine Learning, pages 159–168. PMLR.
  • Bahri et al., [2020] Bahri, Y., Kadmon, J., Pennington, J., Schoenholz, S. S., Sohl-Dickstein, J., and Ganguli, S. (2020). Statistical mechanics of deep learning. Annual Review of Condensed Matter Physics, 11:501–528.
  • Belkin et al., [2020] Belkin, M., Hsu, D., and Xu, J. (2020). Two models of double descent for weak features. SIAM Journal on Mathematics of Data Science, 2(4):1167–1180.
  • Benedek and Itai, [1991] Benedek, G. M. and Itai, A. (1991). Learnability with respect to fixed distributions. Theoretical Computer Science, 86(2):377–389.
  • Dai and Wipf, [2019] Dai, B. and Wipf, D. (2019). Diagnosing and enhancing vae models. arXiv preprint arXiv:1903.05789.
  • Friston, [2009] Friston, K. (2009). The free-energy principle: a rough guide to the brain? Trends in cognitive sciences, 13(7):293–301.
  • Hardt and Ma, [2016] Hardt, M. and Ma, T. (2016). Identity matters in deep learning. arXiv preprint arXiv:1611.04231.
  • Hastie et al., [2019] Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. (2019). Surprises in high-dimensional ridgeless least squares interpolation. arXiv preprint arXiv:1903.08560.
  • Haussler et al., [1996] Haussler, D., Kearns, M., Seung, H. S., and Tishby, N. (1996). Rigorous learning curve bounds from statistical mechanics. Machine Learning, 25(2):195–236.
  • Hopfield, [1982] Hopfield, J. J. (1982). Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558.
  • Kawaguchi, [2016] Kawaguchi, K. (2016). Deep learning without poor local minima. Advances in Neural Information Processing Systems, 29:586–594.
  • [12] Krogh, A. and Hertz, J. A. (1992a). Generalization in a linear perceptron in the presence of noise. Journal of Physics A: Mathematical and General, 25(5):1135.
  • [13] Krogh, A. and Hertz, J. A. (1992b). A simple weight decay can improve generalization. In Advances in neural information processing systems, pages 950–957.
  • Laurent and Brecht, [2018] Laurent, T. and Brecht, J. (2018). Deep linear networks with arbitrary loss: All local minima are global. In International conference on machine learning, pages 2902–2907. PMLR.
  • Li and Sompolinsky, [2021] Li, Q. and Sompolinsky, H. (2021). Statistical mechanics of deep linear neural networks: The backpropagating kernel renormalization. Physical Review X, 11(3):031059.
  • Liao et al., [2020] Liao, Z., Couillet, R., and Mahoney, M. W. (2020). A random matrix analysis of random fourier features: beyond the gaussian kernel, a precise phase transition, and the corresponding double descent. Advances in Neural Information Processing Systems, 33:13939–13950.
  • Lucas et al., [2019] Lucas, J., Tucker, G., Grosse, R., and Norouzi, M. (2019). Don’t blame the elbo! a linear vae perspective on posterior collapse.
  • Martin and Mahoney, [2017] Martin, C. H. and Mahoney, M. W. (2017). Rethinking generalization requires revisiting old ideas: statistical mechanics approaches and complex learning behavior. arXiv preprint arXiv:1710.09553.
  • Mianjy and Arora, [2019] Mianjy, P. and Arora, R. (2019). On dropout and nuclear norm regularization. In International Conference on Machine Learning, pages 4575–4584. PMLR.
  • Sonoda and Murata, [2019] Sonoda, S. and Murata, N. (2019). Transport analysis of infinitely deep neural network. The Journal of Machine Learning Research, 20(1):31–82.
  • Tanaka and Kunin, [2021] Tanaka, H. and Kunin, D. (2021). Noether’s learning dynamics: Role of symmetry breaking in neural networks. Advances in Neural Information Processing Systems, 34.
  • Vapnik, [2006] Vapnik, V. (2006). Estimation of dependences based on empirical data. Springer Science & Business Media.
  • Watkin et al., [1993] Watkin, T. L., Rau, A., and Biehl, M. (1993). The statistical mechanics of learning a rule. Reviews of Modern Physics, 65(2):499.
  • Ziyin et al., [2022] Ziyin, L., Li, B., and Meng, X. (2022). Exact solutions of a deep linear network. arXiv preprint arXiv:2202.04777.

Appendix A Additional Experiments

A.1 Sensitivity to the Initial Condition

Our result suggests that the learning of a deeper network is quite sensitive to the initialization schemes we use. In particular, for D>1D>1, some initialization schemes converge to the trivial solutions more easily, while others converge to the nontrivial solution more easily. Figure 4 plots the converged loss of a D=2D=2 model for two types of initialization: (a) larger initialization, where the parameters are initialized around zero with the standard deviation s=0.3s=0.3 and (b) small initialization with s=0.01s=0.01. The value of ss is thus equal to the expected norm of the model at initialization, and a small ss means that it is initialized closer to the trivial phase and a larger ss means that it is initialized closer to the learning phase. We see that across a wide range of γ\gamma, one of the initialization schemes gets stuck in a local minimum and does not converge to the global minimum. In light of the latent heat picture, the reason for the sensitivity to initial states is clear: one needs to inject additional energy to the system to leave the meta-stable state; otherwise, the system may become stuck for a very long time. The existing initialization methods are predominantly data-dependent. However, our result (also see [24]) suggests that the size of the trivial minimum is data-dependent, and our result thus highlights the importance of designing data-dependent initialization methods in deep learning.

A.2 Removing the Trivial Phase

We also explore our suggested fix to the trivial learning problem. Here, instead of regularization the model by γw22\gamma||w||_{2}^{2}, we regularize the model by γw2D+2\gamma||w||_{2}^{D+2}. The training loss and the model norm bb are plotted in Figure 5. We find that the trivial phase now completely disappears even if we go to very high γ\gamma. However, we note that this fix only removes the local maximum at zero, but zero remains a saddle point from which it takes the system a long time to escape.

Refer to caption
Figure 4: Sensitivity of the obtained solution to the initialization of the model. We initialize the model around zero with standard deviation ss. The experiment shows that a larger initialization variance (s=0.3s=0.3) affords a preference of the nontrivial solution over the trivial one, while a smaller initialization leads to the opposite preference.
Refer to caption
Refer to caption
Figure 5: The training loss L(γ)L(\gamma) (left) and the model norm bb (right) when we train with a regularization term of the form γwD+2\gamma||w||^{D+2}, which is a theoretically justified fix to the trivial learning problem. We see that the trivial phase disappears under this regularization.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Phase transition of a fully connected tanh network. Top row shows the case of D=1D=1, exhibiting a second-order phase transition: the training loss L(γ)L(\gamma) (left), first derivative (middle), and the second derivative (right). Bottom row shows the case of D=2D=2, exhibiting a first-order phase transition: the training loss L(γ)L(\gamma) (left) and first derivative L(γ)L^{\prime}(\gamma) (middle). For D=2D=2, we initialize the model with three initialization at different scales and use the minimum of the respective loss values as an empirical estimate of the actual global minimum.

A.3 Nonlinear Networks

We expect our theory to also apply to deep nonlinear networks that can be locally approximated by linear net at the origin, e.g., a network with tanh activations. As shown in Figure 6, the data shows that a tanh net also features a second-order phase transition for D=1D=1 and a first-order phase transition for D=2D=2.

One notable exception that our theory may not apply is the networks with the ReLU activation because these networks are not differentiable at zero (i.e., in the trivial phase). However, there are smoother (and empirically better) alternatives to ReLU, such as the swish activation function, to which the present theory should also be relevant.

Appendix B Main Results

B.1 Theorem Statements

For a simple ridge linear regression, the minimization objective is

(W)=𝔼x(iWixiy)2+γW2.\ell(W)=\mathbb{E}_{x}\left(\sum_{i}W_{i}x_{i}-y\right)^{2}+\gamma||W||^{2}. (10)
Theorem 1.

There is no phase transition in any hyperparameter (γ,A0,E[xy],E[y2]\gamma,\ A_{0},\ E[xy],\ E[y^{2}]) in a simple ridge linear regression for any γ(0,)\gamma\in(0,\infty).

The following result shows that for a finite depth, L(γ)L(\gamma) must be continuous in γ\gamma.

Theorem 2.

For any finite D>0D>0 and γ[0,)\gamma\in[0,\infty), L(γ)L(\gamma) has no zeroth-order phase transition with respect to γ\gamma.

Note that this theorem allows the weight decay parameter to be 0, and so our results also extend to the case when there is no weight decay.

The following theorem shows that there exists order parameters describing any phase transition induced by the weight decay parameter in Eq. (5).

Theorem 3.

Let b=U/dub=||U||/d_{u}, and let

¯(b,γ):=id02Db2D𝔼[xy]i2d0D(σ2+d0)Daib2D+γ+𝔼x[y2]+γDd02b2.\bar{\ell}(b,\gamma):=-\sum_{i}\frac{d_{0}^{2D}b^{2D}\mathbb{E}[x^{\prime}y]_{i}^{2}}{d_{0}^{D}(\sigma^{2}+d_{0})^{D}a_{i}b^{2D}+\gamma}+\mathbb{E}_{x}[y^{2}]+\gamma Dd_{0}^{2}b^{2}. (11)

Then, bb is an order parameter of Eq. (5) for the effective loss ¯\bar{\ell}.

Here, the norm of the last layer is referred to as the order parameter. The meaning of this choice should be clear. The norm of the last layer is zero if and only if all weights of the last layer is zero, and the model is a trivial model. The model can only learn something when the order parameter is nonzero. Additionally, we note that the choice of the order parameter is not unique and there are other choices for the order parameter (e.g., the norm of any other layer, or the sum of the norms of all layers).

The following theorem shows that when D=1D=1 in Eq. (5), there is a second-order phase transition with respect to γ\gamma.

Theorem 4.

Equation (5) has the second-order phase transition between the trivial and feature learning phases at111When the two layers have different regularization strengths γu\gamma_{u} and γw\gamma_{w}, one can show that the phase transition occurs precisely at γuγw=𝔼[xy]\sqrt{\gamma_{u}\gamma_{w}}=||\mathbb{E}[xy]||.

γ=𝔼[xy].\gamma=||\mathbb{E}[xy]||. (12)

Now, we show that for D2D\geq 2, there is a first-order phase transition.

Theorem 5.

Let D2D\geq 2. There exists a γ>0\gamma^{*}>0 such that the loss function Eq. (5) has the first-order phase transition between the trivial and feature learning phases at γ=γ\gamma=\gamma^{*}.

Theorem 6.

Let L(D)(γ)L^{(D)}(\gamma) denote the loss function for a fixed depth DD as a function of γ\gamma. Then, for γ[0,)\gamma\in[0,\infty) and some constant rr,

L(D)(γ){rif γ=0;𝔼[y2]otherwise.L^{(D)}(\gamma)\to\begin{cases}r&\text{if $\gamma=0$};\\ \mathbb{E}[y^{2}]&\text{otherwise}.\end{cases} (13)

The constant rr is, in general, not equal to 𝔼[y2]\mathbb{E}[y^{2}]. For example, in the limit σ0\sigma\to 0, rr converges to the loss value of a simple linear regression, which is not equal to 𝔼[y2]\mathbb{E}[y^{2}] as long as 𝔼[xy]0\mathbb{E}[xy]\neq 0.

B.2 Proof of Theorem 1

Proof. The global minimum of Eq. (10) is

W=(A0+γI)1E[xy].W_{*}=(A_{0}+\gamma I)^{-1}E[xy]. (14)

The loss of the global minimum is thus

L\displaystyle L =𝔼x(iWixiy)2+γW2\displaystyle=\mathbb{E}_{x}\left(\sum_{i}W_{i}x_{i}-y\right)^{2}+\gamma||W||^{2} (15)
=WTA0W2WT𝔼[xy]+𝔼[y2]+γW2\displaystyle=W^{T}A_{0}W-2W^{T}\mathbb{E}[xy]+\mathbb{E}[y^{2}]+\gamma||W||^{2} (16)
=𝔼[xy]TA0(A0+γI)2𝔼[xy]2𝔼[xy]T1A0+γI𝔼[xy]+𝔼[y2]+γ𝔼[xy]T1(A0+γI)2𝔼[xy]\displaystyle=\mathbb{E}[xy]^{T}\frac{A_{0}}{(A_{0}+\gamma I)^{2}}\mathbb{E}[xy]-2\mathbb{E}[xy]^{T}\frac{1}{A_{0}+\gamma I}\mathbb{E}[xy]+\mathbb{E}[y^{2}]+\gamma\mathbb{E}[xy]^{T}\frac{1}{(A_{0}+\gamma I)^{2}}\mathbb{E}[xy] (17)
=𝔼[xy]T(A0+γI)1𝔼[xy]+𝔼[y2],\displaystyle=-\mathbb{E}[xy]^{T}(A_{0}+\gamma I)^{-1}\mathbb{E}[xy]+\mathbb{E}[y^{2}], (18)

which is infinitely differentiable for any γ(0,)\gamma\in(0,\infty) (note that A0A_{0} is always positive semi-definite by definition). \square

B.3 Proof of Theorem 2

Proof. For any fixed and bounded ww, (w,γ)\ell(w,\gamma) is continuous in γ\gamma. Moreover, (w,γ)\ell(w,\gamma) is a monotonically increasing function of γ\gamma. This implies that L(γ)L(\gamma) is also an increasing function of γ\gamma (but may not be strictly increasing).

We now prove by contradiction. We first show that L(γ)L(\gamma) is left-continuous. Suppose that for some DD, L(γ)L(\gamma) is not left-continuous in γ\gamma at some γ\gamma^{*}. By definition, we have

L(γϵ)=minw(w,γϵ):=(w,γϵ),L(\gamma^{*}-\epsilon)=\min_{w}\ell(w,\gamma^{*}-\epsilon):=\ell(w^{\prime},\gamma^{*}-\epsilon), (19)

where ww^{\prime} is one of the (potentially many) global minima of L(γϵ)L(\gamma^{*}-\epsilon). Since L(γ)L(\gamma) is not left-continuous by assumption, there exists δ>0\delta>0 such that for any ϵ>0\epsilon>0,

L(γϵ)<L(γ)δ,L(\gamma^{*}-\epsilon)<L(\gamma^{*})-\delta, (20)

which implies that

(w,γϵ)=L(γϵ)<L(γ)δ(w,γ)δ.\ell(w^{\prime},\gamma^{*}-\epsilon)=L(\gamma^{*}-\epsilon)<L(\gamma^{*})-\delta\leq\ell(w^{\prime},\gamma^{*})-\delta. (21)

Namely, the left-discontinuity implies that for all ϵ>0\epsilon>0,

(w,γϵ)(w,γ)δ.\ell(w^{\prime},\gamma^{*}-\epsilon)\leq\ell(w^{\prime},\gamma^{*})-\delta. (22)

However, by definition of (w,γ)\ell(w,\gamma), we have

(w,γ)(w,γϵ)=ϵw2.\ell(w,\gamma)-\ell(w,\gamma-\epsilon)=\epsilon||w||^{2}. (23)

Thus, by choosing ϵ<δ/w2\epsilon<\delta/||w||^{2}, the relation in (21) is violated. Thus, L(γ)L(\gamma) must be left-continuous.

In a similar manner, we can prove that LL is right-continuous. Suppose that for some DD, L(γ)L(\gamma) is not right-continuous in γ\gamma at some γ\gamma^{*}. Let γ>0\gamma>0. By definition, we have

L(γ+ϵ)=minw(w,γ+ϵ):=(w,γ+ϵ),L(\gamma^{*}+\epsilon)=\min_{w}\ell(w,\gamma^{*}+\epsilon):=\ell(w^{\prime},\gamma^{*}+\epsilon), (24)

where ww^{\prime} is one of the (potentially many) global minima of L(γ+ϵ)L(\gamma^{*}+\epsilon). Since L(γ)L(\gamma) is not right-continuous by assumption, there exists δ>0\delta>0 such that for any ϵ>0\epsilon>0,

L(γ+ϵ)>L(γ)+δ,L(\gamma^{*}+\epsilon)>L(\gamma^{*})+\delta, (25)

which implies that

(w,γ+ϵ)=L(γ+ϵ)>L(γ)+δ(w,γ)+δ.\ell(w^{\prime},\gamma^{*}+\epsilon)=L(\gamma^{*}+\epsilon)>L(\gamma^{*})+\delta\geq\ell(w^{\prime},\gamma^{*})+\delta. (26)

Namely, the right-discontinuity implies that for all ϵ>0\epsilon>0,

(w,γ+ϵ)(w,γ)+δ.\ell(w^{\prime},\gamma^{*}+\epsilon)\geq\ell(w^{\prime},\gamma^{*})+\delta. (27)

However, by definition of (w,γ)\ell(w,\gamma), we have

(w,γ+ϵ)(w,γ)=ϵw2.\ell(w,\gamma+\epsilon)-\ell(w,\gamma)=\epsilon||w||^{2}. (28)

Thus, by choosing ϵ<δ/w2\epsilon<\delta/||w||^{2}, the relation in (26) is violated. Thus, L(γ)L(\gamma) must be right-continuous.

Therefore, L(γ)L(\gamma) is continuous for all γ>0\gamma>0. By definition, this means that there is no zeroth-order phase transition in γ\gamma for LL. Additionally, note that the above proof does not require γ0\gamma\neq 0, and so we have also shown that L(γ)L(\gamma) is right-continuous at γ=0\gamma=0. \square

B.4 Proof of Theorem 3

Proof. By Theorem 3 of Ref. [24], any global minimum of Eq. (5) is given by the following set of equations for some b0b\geq 0:

{U=d0b𝐫D;W(i)=b𝐫i𝐫i1T;W(1)=𝐫1𝔼[xy]Td0D12bD[d0D(σ2+d0)Db2DA0+γ]1,\begin{cases}U=\sqrt{d_{0}}b\mathbf{r}_{D};\\ W^{(i)}=b\mathbf{r}_{i}\mathbf{r}_{i-1}^{T};\\ W^{(1)}=\mathbf{r}_{1}\mathbb{E}[xy]^{T}d_{0}^{D-\frac{1}{2}}b^{D}\left[d_{0}^{D}(\sigma^{2}+d_{0})^{D}b^{2D}A_{0}+\gamma\right]^{-1},\end{cases} (29)

where 𝐫i=(±1,,±1)\mathbf{r}_{i}=(\pm 1,...,\pm 1) is an arbitrary vertex of a did_{i}-dimensional hypercube for all ii. Therefore, the global minimum must lie on a one-dimensional space indexed by b[0,)b\in[0,\infty). Let f(x)f(x) specify the model as

f(x):=i,i1,i2,,iDd,d1,d2,dDUiDϵiD(D)ϵi2(2)Wi2i1(2)ϵi1(1)Wi1i(1)x,f(x):=\sum_{i,i_{1},i_{2},...,i_{D}}^{d,d_{1},d_{2},...d_{D}}U_{i_{D}}\epsilon^{(D)}_{i_{D}}...\epsilon_{i_{2}}^{(2)}W_{i_{2}i_{1}}^{(2)}\epsilon_{i_{1}}^{(1)}W_{i_{1}i}^{(1)}x, (30)

and let η\eta denote the set of all random noises ϵi\epsilon_{i}.

Substituting Eq. (29) in Eq. (5), one finds that within this subspace, the loss function can be written as

(w,γ)\displaystyle\ell(w,\gamma) =𝔼x𝔼η(f(x)y)2+L2reg.\displaystyle=\mathbb{E}_{x}\mathbb{E}_{\eta}(f(x)-y)^{2}+L_{2}\ {\rm reg.} (31)
=𝔼x,η[f(x)2]2𝔼x,η[yf(x)]+𝔼x[y2]+L2reg.\displaystyle=\mathbb{E}_{x,\eta}[f(x)^{2}]-2\mathbb{E}_{x,\eta}[yf(x)]+\mathbb{E}_{x}[y^{2}]+L_{2}\ {\rm reg.} (32)
=id03D(σ2+d0)Db4Dai𝔼[xy]i2[d0D(σ2+d0)Daib2D+γ]22id02Db2D𝔼[xy]i2d0D(σ2+d0)Daib2D+γ+𝔼x[y2]+L2reg.,\displaystyle=\sum_{i}\frac{d_{0}^{3D}(\sigma^{2}+d_{0})^{D}b^{4D}a_{i}\mathbb{E}[x^{\prime}y]_{i}^{2}}{[d_{0}^{D}(\sigma^{2}+d_{0})^{D}a_{i}b^{2D}+\gamma]^{2}}-2\sum_{i}\frac{d_{0}^{2D}b^{2D}\mathbb{E}[x^{\prime}y]^{2}_{i}}{d_{0}^{D}(\sigma^{2}+d_{0})^{D}a_{i}b^{2D}+\gamma}+\mathbb{E}_{x}[y^{2}]+L_{2}\ {\rm reg.}, (33)

where the L2reg.L_{2}\ {\rm reg.} term is

L2reg.=γDd02b2+γid02Db2D𝔼[xy]i2[d0D(σ2+d0)Db2Dai+γ]2.L_{2}\ {\rm reg.}=\gamma Dd_{0}^{2}b^{2}+\gamma\sum_{i}\frac{d_{0}^{2D}b^{2D}||\mathbb{E}[x^{\prime}y]_{i}||^{2}}{[d_{0}^{D}(\sigma^{2}+d_{0})^{D}b^{2D}a_{i}+\gamma]^{2}}. (34)

Combining terms, we can simplify the expression for the loss function to be

id02Db2D𝔼[xy]i2[d0D(σ2+d0)Daib2D+γ]+𝔼x[y2]+γDd02b2.-\sum_{i}\frac{d_{0}^{2D}b^{2D}\mathbb{E}[x^{\prime}y]_{i}^{2}}{[d_{0}^{D}(\sigma^{2}+d_{0})^{D}a_{i}b^{2D}+\gamma]}+\mathbb{E}_{x}[y^{2}]+\gamma Dd_{0}^{2}b^{2}. (35)

We can now define the effective loss by

¯(b,γ):=id02Db2D𝔼[xy]i2[d0D(σ2+d0)Daib2D+γ]+𝔼x[y2]+γDd02b2.\bar{\ell}(b,\gamma):=-\sum_{i}\frac{d_{0}^{2D}b^{2D}\mathbb{E}[x^{\prime}y]_{i}^{2}}{[d_{0}^{D}(\sigma^{2}+d_{0})^{D}a_{i}b^{2D}+\gamma]}+\mathbb{E}_{x}[y^{2}]+\gamma Dd_{0}^{2}b^{2}. (36)

Then, the above argument shows that, for all γ\gamma,

minb¯(b,γ)=minw(w,γ).\min_{b}\bar{\ell}(b,\gamma)=\min_{w}\ell(w,\gamma). (37)

By definition 2, bb is an order parameter of \ell with respect to the effective loss ¯(b,γ)\bar{\ell}(b,\gamma). This completes the proof. \square

B.5 Two Useful Lemmas

Before continuing the proofs, we first prove two lemmas that will simplify the following proofs significantly.

Lemma 1.

If L(γ)L(\gamma) is differentiable, then for at least one of the global minima bb_{*},

ddγL(γ)=id02Db2D𝔼[xy]i2[d0D(σ2+d0)Daib2D+γ]2+Dd02b20.\frac{d}{d\gamma}L(\gamma)=\sum_{i}\frac{d_{0}^{2D}b_{*}^{2D}\mathbb{E}[x^{\prime}y]_{i}^{2}}{[d_{0}^{D}(\sigma^{2}+d_{0})^{D}a_{i}b_{*}^{2D}+\gamma]^{2}}+Dd_{0}^{2}b_{*}^{2}\geq 0. (38)

Proof. Because LL is differentiable in γ\gamma, one can find the derivative for at least one of the global minima bb^{*}

ddγL(γ)\displaystyle\frac{d}{d\gamma}L(\gamma) =ddγ¯(b(γ),γ)\displaystyle=\frac{d}{d\gamma}\bar{\ell}(b^{*}(\gamma),\gamma) (39)
=b¯(b,γ)bγ+γ¯(b,γ)\displaystyle=\frac{\partial}{\partial b^{*}}\bar{\ell}(b^{*},\gamma)\frac{\partial b^{*}}{\partial\gamma}+\frac{\partial}{\partial\gamma}\bar{\ell}(b^{*},\gamma) (40)
=γ¯(b,γ)\displaystyle=\frac{\partial}{\partial\gamma}\bar{\ell}(b^{*},\gamma) (41)
=id02Db2D𝔼[xy]i2[d0D(σ2+d0)Daib2D+γ]2+Dd02b20,\displaystyle=\sum_{i}\frac{d_{0}^{2D}b_{*}^{2D}\mathbb{E}[x^{\prime}y]_{i}^{2}}{[d_{0}^{D}(\sigma^{2}+d_{0})^{D}a_{i}b_{*}^{2D}+\gamma]^{2}}+Dd_{0}^{2}b_{*}^{2}\geq 0, (42)

where we have used the optimality condition b¯(b(γ),γ)=0\frac{\partial}{\partial b^{*}}\bar{\ell}(b^{*}(\gamma),\gamma)=0 in the second equality. \square

B.6 Proof of Theorem 4

Proof. By definition 1, it suffices to only prove the existence of phase transitions on the effective loss. For D=1D=1, the effective loss is

¯(b,γ)=d1b2E[xy]T[b2(σ2+d1)A+γI]1E[xy]+E[y2]+γd1b2.\bar{\ell}(b,\gamma)=-d_{1}b^{2}E[xy]^{T}[b^{2}(\sigma^{2}+d_{1})A+\gamma I]^{-1}E[xy]+E[y^{2}]+\gamma d_{1}b^{2}. (43)

By Theorem 1 of Ref. [24], the phase transition, if exists, must occur precisely at γ=𝔼[xy]\gamma=||\mathbb{E}[xy]||. To prove that γ=𝔼[xy]\gamma=||\mathbb{E}[xy]|| has a second-order phase transition, we must check both its first derivative and second derivative.

When γE[xy]\gamma\to||E[xy]|| from the right, we find that the all derivatives of L(γ)L(\gamma) are zero because the loss is identically equal to 𝔼[y2]\mathbb{E}[y^{2}]. We now consider the derivative of LL when γE[xy]\gamma\to||E[xy]|| from the left.

We first need to find the minimizer of Eq. (43). Because Eq.(43) is differentiable, its derivative in bb must be equal to 0 at the global minimum

2γd1b𝔼[xy]T[b2(σ2+d1)2A+γI]2𝔼[xy]+2γd1b=0.-2\gamma d_{1}b\mathbb{E}[xy]^{T}[b^{2}(\sigma^{2}+d_{1})^{2}A+\gamma I]^{-2}\mathbb{E}[xy]+2\gamma d_{1}b=0. (44)

Finding the minimizer bb is thus equivalent to finding the real roots of a high-order polynomial in bb. When γ𝔼[xy]\gamma\geq||\mathbb{E}[xy]||, the solution is unique [24]:

b02=0,b^{2}_{0}=0, (45)

where we labeled the solution with the subscript 0 to emphasize that this solution is also the zeroth-order term of the solution in a perturbatively small neighborhood of γ=E[xy]\gamma=||E[xy]||. From this point, we define a shifted regularization strength: Δ:=γ𝔼[xy]\Delta:=\gamma-||\mathbb{E}[xy]||. When Δ<0\Delta<0, the condition (44) simplifies to

𝔼[xy]T[b2(σ2+d1)A+γI]2𝔼[xy]=1.\mathbb{E}[xy]^{T}[b^{2}(\sigma^{2}+d_{1})A+\gamma I]^{-2}\mathbb{E}[xy]=1. (46)

Because the polynomial is not singular in Δ\Delta, one can Taylor expand the (squared) solution b2b^{2} in Δ\Delta:

b(γ)2=β0+β1Δ+O(Δ2).b(\gamma)^{2}=\beta_{0}+\beta_{1}\Delta+O(\Delta^{2}). (47)

We first Substitute (47) in (44) to find222Note that alternatively, β0=0\beta_{0}=0 is implied by the no-zeroth-order transition theorem.

β0=0.\beta_{0}=0. (48)

One can then again substitute Eq. (47) in Eq. (44) to find β1\beta_{1}. To the first order in b2b^{2}, Eq. (44) reads

1γ2𝔼[xy]22b2(σ2+d1)γ3𝔼[xy]A02=1\displaystyle\frac{1}{\gamma^{2}}||\mathbb{E}[xy]||^{2}-2b^{2}\frac{(\sigma^{2}+d_{1})}{\gamma^{3}}||\mathbb{E}[xy]||_{A_{0}}^{2}=1 (49)
2β1Δ(σ2+d1)γ3𝔼[xy]A02=2Δ𝔼[xy]\displaystyle\Longleftrightarrow-2\beta_{1}\Delta\frac{(\sigma^{2}+d_{1})}{\gamma^{3}}||\mathbb{E}[xy]||_{A_{0}}^{2}=2\frac{\Delta}{||\mathbb{E}[xy]||} (50)
β1=1(σ2+d1)𝔼[xy]2𝔼[xy]A02\displaystyle\Longleftrightarrow\beta_{1}=-\frac{1}{(\sigma^{2}+d_{1})}\frac{||\mathbb{E}[xy]||^{2}}{||\mathbb{E}[xy]||^{2}_{A_{0}}} (51)

Substituting this first-order solution to Lemma 1, we obtain that

ddγL(γ)|γ=E[xy]b2=0=ddγL(γ)|γ=E[xy]+.\frac{d}{d\gamma}L(\gamma)|_{\gamma=||E[xy]||_{-}}\sim b_{*}^{2}=0=\frac{d}{d\gamma}L(\gamma)|_{\gamma=||E[xy]||_{+}}. (52)

Thus, the first-order derivative of L(γ)L(\gamma) is continuous at the phase transition point.

We now find the second-order derivative of L(γ)L(\gamma). To achieve this, we also need to find the second-order term of b2b^{2} in γ\gamma. We expand b2b^{2} as

b(γ)2=0+β1Δ+β2Δ2+O(Δ3).b(\gamma)^{2}=0+\beta_{1}\Delta+\beta_{2}\Delta^{2}+O(\Delta^{3}). (53)

To the second order in b2b^{2}, (44) reads

1γ2𝔼[xy]22b2(σ2+d1)γ3𝔼[xy]A02+3b4(σ2+d1)2γ4𝔼[xy]A022=1\displaystyle\frac{1}{\gamma^{2}}||\mathbb{E}[xy]||^{2}-2b^{2}\frac{(\sigma^{2}+d_{1})}{\gamma^{3}}||\mathbb{E}[xy]||_{A_{0}}^{2}+3b^{4}\frac{(\sigma^{2}+d_{1})^{2}}{\gamma^{4}}||\mathbb{E}[xy]||_{A_{0}^{2}}^{2}=1 (54)
γ2𝔼[xy]22b2(σ2+d1)γ𝔼[xy]A02+3b4(σ2+d1)2𝔼[xy]A022=γ4\displaystyle\Longleftrightarrow{\gamma^{2}}||\mathbb{E}[xy]||^{2}-2b^{2}{(\sigma^{2}+d_{1})}{\gamma}||\mathbb{E}[xy]||_{A_{0}}^{2}+3b^{4}{(\sigma^{2}+d_{1})^{2}}||\mathbb{E}[xy]||_{A_{0}^{2}}^{2}=\gamma^{4} (55)
Δ2E022β2Δ2(σ2+d1)E0E122β1Δ2(σ2+d1)E12+3β12Δ2(σ2+d1)2E22=6E02Δ2\displaystyle\Longleftrightarrow{\Delta^{2}}E_{0}^{2}-2\beta_{2}\Delta^{2}{(\sigma^{2}+d_{1})}{E_{0}}E_{1}^{2}-2\beta_{1}\Delta^{2}{(\sigma^{2}+d_{1})}E_{1}^{2}+3\beta_{1}^{2}\Delta^{2}{(\sigma^{2}+d_{1})^{2}}E_{2}^{2}=6E_{0}^{2}\Delta^{2} (56)
β2=3β12(σ2+d1)2E225E022β1(σ2+d1)E122(σ2+d1)E0E12,\displaystyle\Longleftrightarrow\beta_{2}=\frac{3\beta_{1}^{2}(\sigma^{2}+d_{1})^{2}E_{2}^{2}-5E_{0}^{2}-2\beta_{1}(\sigma^{2}+d_{1})E_{1}^{2}}{2(\sigma^{2}+d_{1})E_{0}E_{1}^{2}}, (57)

where, from the third line, we have used the shorthand E0:=𝔼[xy]E_{0}:=||\mathbb{E}[xy]||, E1:=𝔼[xy]A0E_{1}:=||\mathbb{E}[xy]||_{A_{0}}, and E2:=𝔼[xy]A02E_{2}:=||\mathbb{E}[xy]||_{A_{0}^{2}}. Substitute in β1\beta_{1}, one finds that

β2=3E0(E22E12)2(σ2+d1)E14.\beta_{2}=\frac{3E_{0}(E_{2}^{2}-E_{1}^{2})}{2(\sigma^{2}+d_{1})E_{1}^{4}}. (58)

This allows us to find the second derivative of L(γ)L(\gamma). Substituting β1\beta_{1} and β2\beta_{2} into Eq. (43) and expanding to the second order in Δ\Delta, we obtain that

L(γ)\displaystyle L(\gamma) =d1b2E[xy]T[b2(σ2+d1)A+γI]1E[xy]+E[y2]+γd1b2\displaystyle=-d_{1}b^{2}E[xy]^{T}[b^{2}(\sigma^{2}+d_{1})A+\gamma I]^{-1}E[xy]+E[y^{2}]+\gamma d_{1}b^{2} (59)
=d1(β1Δ+β2Δ2)𝔼[xy]T[(β1Δ+β2Δ2)(σ2+d1)A0+γI]1𝔼[xy]+γd1(β1Δ+β2Δ).\displaystyle=-d_{1}(\beta_{1}\Delta+\beta_{2}\Delta^{2})\mathbb{E}[xy]^{T}[(\beta_{1}\Delta+\beta_{2}\Delta^{2})(\sigma^{2}+d_{1})A_{0}+\gamma I]^{-1}\mathbb{E}[xy]+\gamma d_{1}(\beta_{1}\Delta+\beta_{2}\Delta). (60)

At the critical point,

d2dγ2L(γ)|γ=𝔼[xy]\displaystyle\frac{d^{2}}{d\gamma^{2}}L(\gamma)|_{\gamma=||\mathbb{E}[xy]||_{-}} =d1β2E0+d1β12(σ2+d1)E12E02+d1β1+d1β1+d1β2E0\displaystyle=-d_{1}\beta_{2}E_{0}+d_{1}\beta_{1}^{2}(\sigma^{2}+d_{1})\frac{E_{1}^{2}}{E_{0}^{2}}+d_{1}\beta_{1}+d_{1}\beta_{1}+d_{1}\beta_{2}E_{0} (61)
=2d1β1+d1β12(σ2+d1)E12E02\displaystyle=2d_{1}\beta_{1}+d_{1}\beta_{1}^{2}(\sigma^{2}+d_{1})\frac{E_{1}^{2}}{E_{0}^{2}} (62)
=d1β1\displaystyle=d_{1}\beta_{1} (63)
=d1σ2+d1𝔼[xy]2𝔼[xy]A02.\displaystyle=-\frac{d_{1}}{\sigma^{2}+d_{1}}\frac{||\mathbb{E}[xy]||^{2}}{||\mathbb{E}[xy]||^{2}_{A_{0}}}. (64)

Notably, the second derivative of LL from the left is only dependent on β1\beta_{1} and not on β2\beta_{2}.

d2dγ2L(γ)|γ=𝔼[xy]=d1σ2+d1𝔼[xy]2𝔼[xy]A02<0.\frac{d^{2}}{d\gamma^{2}}L(\gamma)|_{\gamma=||\mathbb{E}[xy]||_{-}}=-\frac{d_{1}}{\sigma^{2}+d_{1}}\frac{||\mathbb{E}[xy]||^{2}}{||\mathbb{E}[xy]||^{2}_{A_{0}}}<0. (65)

Thus, the second derivative of L(γ)L(\gamma) is discontinuous at γ=𝔼[xy]\gamma=||\mathbb{E}[xy]||. This completes the proof. \square

Remark.

Note that the proof suggests that close to the critical point, bΔb\sim\sqrt{\Delta}, in agreement with the Landau theory.

B.7 Proof of Theorem 5

Proof. By definition, it suffices to show that ddγL(γ)\frac{d}{d\gamma}L(\gamma) is not continuous. We prove by contradiction. Suppose that ddγL(γ)\frac{d}{d\gamma}L(\gamma) is everywhere continuous on γ(0,)\gamma\in(0,\infty). Then, by Lemma 1, one can find the derivative for at least one of the global minima bb^{*}

ddγL(γ)=id02Db2D𝔼[xy]i2[d0D(σ2+d0)Daib2D+γ]2+γDd02b20.\displaystyle\frac{d}{d\gamma}L(\gamma)=\sum_{i}\frac{d_{0}^{2D}b_{*}^{2D}\mathbb{E}[x^{\prime}y]_{i}^{2}}{[d_{0}^{D}(\sigma^{2}+d_{0})^{D}a_{i}b_{*}^{2D}+\gamma]^{2}}+\gamma Dd_{0}^{2}b_{*}^{2}\geq 0. (66)

Both terms in the last line are nonnegative, and so one necessary condition for ddγL(γ)\frac{d}{d\gamma}L(\gamma) to be continuous is that both of these two terms are continuous in γ\gamma.

In particular, one necessary condition is that γDd02b2\gamma Dd_{0}^{2}b_{*}^{2} is continuous in γ\gamma. By Proposition 3 of Ref. [24], there exist constants c0c_{0}, c1c_{1} such that 0<c0c10<c_{0}\leq c_{1}, and

{b=0if γ<c0;b>0,if γ>c1.\begin{cases}b_{*}=0&\text{if }\gamma<c_{0};\\ b_{*}>0,&\text{if }\gamma>c_{1}.\end{cases} (67)

Additionally, if b>0b_{*}>0, bb_{*} must be lower-bounded by some nonzero value [24]:

b1d0(γ𝔼[xy])1D1>1d0(c1𝔼[xy])1D1>0.b_{*}\geq\frac{1}{d_{0}}\left(\frac{\gamma}{||\mathbb{E}[xy]||}\right)^{\frac{1}{D-1}}>\frac{1}{d_{0}}\left(\frac{c_{1}}{||\mathbb{E}[xy]||}\right)^{\frac{1}{D-1}}>0. (68)

Therefore, for any D>1D>1, b(γ)b_{*}(\gamma) must have a discontinuous jump from 0 to a value larger than 1d0(c0𝔼[xy])1D1\frac{1}{d_{0}}\left(\frac{c_{0}}{||\mathbb{E}[xy]||}\right)^{\frac{1}{D-1}}, and cannot be continuous. This, in turn, implies that ddγL(γ)\frac{d}{d\gamma}L(\gamma) jumps from zero to a nonzero value and cannot be continuous. This completes the proof. \square

B.8 Proof of Theorem 6

Proof. It suffices to show that a nonzero global minimum cannot exist at a sufficiently large DD, when one fixes γ\gamma. By Proposition 3 of Ref. [24], when γ>0\gamma>0, any nonzero global minimum must obey the following two inequalities:

1d0[γ𝔼[xy]]1D1b[𝔼[xy]d0(σ2+d0)Damax]1D+1,\frac{1}{d_{0}}\left[\frac{\gamma}{||\mathbb{E}[xy]||}\right]^{\frac{1}{D-1}}\leq b^{*}\leq\left[\frac{||\mathbb{E}[xy]||}{d_{0}(\sigma^{2}+d_{0})^{D}a_{\rm max}}\right]^{\frac{1}{D+1}}, (69)

where amaxa_{\rm max} is the largest eigenvalue of A0A_{0}. In the limit DD\to\infty, the lower bound becomes

1d0[γ𝔼[xy]]1D11d0.\frac{1}{d_{0}}\left[\frac{\gamma}{||\mathbb{E}[xy]||}\right]^{\frac{1}{D-1}}\to\frac{1}{d_{0}}. (70)

The upper bound becomes

[𝔼[xy]d0(σ2+d0)Damax]1D+11σ2+d0.\left[\frac{||\mathbb{E}[xy]||}{d_{0}(\sigma^{2}+d_{0})^{D}a_{\rm max}}\right]^{\frac{1}{D+1}}\to\frac{1}{\sigma^{2}+d_{0}}. (71)

But for any σ2>0\sigma^{2}>0, 1d0<1σ2+d0\frac{1}{d_{0}}<\frac{1}{\sigma^{2}+d_{0}}. Thus, the set of such bb^{*} is empty.

On the other hand, when γ=0\gamma=0, the global minimizer has been found in Ref. [19] and is nonzero, which implies that L(0)<𝔼[y2]L(0)<\mathbb{E}[y^{2}]. This means that L(γ)L(\gamma) is not continuous at 0. This completes the proof. \square