Law of Balance and Stationary Distribution of Stochastic Gradient Descent

Liu Ziyin^∗ Hongchao Li^∗ Masahito Ueda

Abstract

The stochastic gradient descent (SGD) algorithm is the algorithm we use to train neural networks. However, it remains poorly understood how the SGD navigates the highly nonlinear and degenerate loss landscape of a neural network. In this work, we prove that the minibatch noise of SGD regularizes the solution towards a balanced solution whenever the loss function contains a rescaling symmetry. Because the difference between a simple diffusion process and SGD dynamics is the most significant when symmetries are present, our theory implies that the loss function symmetries constitute an essential probe of how SGD works. We then apply this result to derive the stationary distribution of stochastic gradient flow for a diagonal linear network with arbitrary depth and width. The stationary distribution exhibits complicated nonlinear phenomena such as phase transitions, broken ergodicity, and fluctuation inversion. These phenomena are shown to exist uniquely in deep networks, implying a fundamental difference between deep and shallow models.

1 Introduction

The stochastic gradient descent (SGD) algorithm is defined as

\Delta\theta_{t}=-\frac{\eta}{S}\sum_{x\in B}\nabla_{\theta}\ell(\theta,x),

(1)

where $\theta$ is the model parameter, $\ell(\theta,x)$ is a per-sample loss whose expectation over $x$ gives the training loss. $B$ is a randomly sampled minibatch of data points, each independently sampled from the training set, and $S$ is the minibatch size. The training-set average of $\ell$ is the training objective $L(\theta)=\mathbb{E}_{x}\ell(\theta,x)$ , where $\mathbb{E}_{x}$ denotes averaging over the training set. Two aspects of the algorithm make it difficult to understand this algorithm: (1) its dynamics is discrete in time, and (2) the randomness is highly nonlinear and parameter-dependent. This work relies on the continuous-time approximation and deals with the second aspect.

In natural and social sciences, the most important object of study of a stochastic system is its stationary distribution, which is often found to offer fundamental insights into understanding a given stochastic process [37, 31]. Arguably, a great deal of insights into SGD can be obtained if we have an analytical understanding of the stationary distribution, which remains unknown until today. Predominantly many works study the dynamics and stationary properties of SGD in the case of a strongly convex loss function [43, 44, 20, 46, 27, 52, 21, 42]. The works that touch on the nonlinear aspects of the loss function rely heavily on the local approximations of the stationary distribution of SGD close to a local minimum, often with additional unrealistic assumptions about the noise. For example, using a saddle point expansion and assuming that the noise is parameter-independent, Refs. [23, 44, 20] showed that the stationary distribution of SGD is exponential. Taking partial parameter-dependence into account and near an interpolation minimum, Ref. [27] showed that the stationary distribution is power-law like and proportional to $L(\theta)^{-c_{0}}$ for some constant $c_{0}$ . However, the stationary distribution of SGD is unknown when the loss function is beyond quadratic and high-dimensional.

Since the stationary distribution of SGD is unknown, we will compare our results with the most naive theory one can construct for SGD, a continuous-time Langevin equation with a constant noise level:

\dot{\theta}(t)=-\eta\nabla_{\theta}L(\theta)+\sqrt{2T_{0}}\epsilon(t),

(2)

where $\epsilon$ is a random time-dependent noise with zero mean and $\mathbb{E}[\epsilon(t)\epsilon(t^{\prime})^{T}]=\eta\delta(t-t^{\prime})I$ with $I$ being the identity operator. Here, the naive theory relies on the assumption that one can find a constant scalar $T_{0}$ such that Eq. (2) closely models (1), at least after some level of coarse-graining. Let us examine some of the predictions of this model to understand when and why it goes wrong.

There are two important predictions of this model. The first is that the stationary distribution of SGD is a Gibbs distribution with temperature $T_{0}$ : $p(\theta)\propto\exp[-L(\theta)/T]$ . This implies that the maximum likelihood estimator of $\theta$ under SGD is the same as the global minimizer of the $L(\theta)$ : $\arg\max p(\theta)=\arg\min L(\theta)$ . This relation holds for the local minima as well: every local minimum of $L$ corresponds to a local maximum of $p$ . These properties are often required in the popular argument that SGD approximates Bayesian inference [23, 25]. Another implication is ergodicity [39]: any state with the same energy will have an equal probability of being accessed. The second is the dynamical implication: SGD will diffuse. If there is a degenerate direction in the loss function, SGD will diffuse along that direction.¹¹1Note that this can also be seen as a dynamical interpretation of the ergodicity.

Refer to caption — Figure 1: SGD converges to a balanced solution. Left: the quantity $u^{2}-w^{2}$ is conserved for GD without noise, is divergent for GD with an isotropic Gaussian noise, which simulates the simple Langevin model, and decays to zero for SGD, making a sharp and dramatic contrast. Right: illustration of the three types of dynamics. Gradient descent (GD) moves along the conservation line due to the conservation law: $u^{2}(t)-w^{2}(t)=u^{2}(0)-w^{2}(0)$ . GD with an isotropic Gaussian noise expands and diverges along the flat direction of the minimum valley. The actual SGD oscillates along a balanced solution.

However, these predictions of the Langevin model are not difficult to reject. Let us consider a simple two-layer network with the loss function: $\ell(u,w,x)=(uwx-y(x))^{2}$ . Because of the rescaling symmetry, a valley of degenerate solution exists at $uw=c_{0}$ . Under the simple Langevin model, SGD diverges to infinity due to diffusion. One can also see this from a static perspective. All points on the line $uw=c_{0}$ must have the same probability at stationarity, but such a distribution does not exist because it is not normalizable. This means that the Langevin model of SGD diverges for this loss function.

Does this agree with the empirical observation? Certainly not.²²2In fact, had it been the case, no linear network or ReLU network can be trained with SGD. See Fig. 1. We see that contrary to the prediction of the Langevin model, $|u^{2}-w^{2}|$ converges to zero under SGD. Under GD, this quantity is conserved during training [7]. Only the Gaussian GD obeys the prediction of the Langevin model, which is expected. This sharp contrast shows that the SGD dynamics is quite special, and a naive theoretical model can be very far from the truth in understanding its behavior. There is one more lesson to be learned. The fact that the Langevin model disagrees the most with the experiments when symmetry conditions are present suggests that the symmetry conditions are crucial tools to probe and understand the nature of the SGD noise, which is the main topic of our theory.

2 Law of Balance

Now, we consider the actual continuous-time limit of SGD [16, 17, 19, 33, 9, 13]:

d{\theta}=-\nabla_{\theta}Ldt+\sqrt{TC(\theta)}dW_{t},

(3)

where $dW_{t}$ is a stochastic process satisfying $dW_{t}\sim N(0,Idt)$ and $\mathbb{E}[dW_{t}dW_{t^{\prime}}^{T}]=\delta(t-t^{\prime})I$ , and $T=\eta/T$ . Apparently, $T$ gives the average noise level in the dynamics. Previous works have suggested that the ratio $\eta/S:=T$ is the main factor determining the behavior of SGD, and a higher $T$ often leads to better generalization performance [32, 20, 50]. The crucial difference between Eq. (3) and (2) is that in (3), the noise covariance $C(\theta)$ is parameter-dependent and, in general, low-rank when symmetries exist.

Due to standard architecture designs, a type of invariance – the rescaling symmetry – often appears in the loss function and exists for all sampling of minibatches. The per-sample loss $\ell$ is said to have the rescaling symmetry for all $x$ if $\ell(u,w,x)=\ell\left(\lambda u,w/\lambda,x\right)$ for an arbitrary scalar $\lambda$ Note that this implies that the expected loss $L$ also has the same symmetry. This type of symmetry appears in many scenarios in deep learning. For example, it appears in any neural network with the ReLU activation. It also appears in the self-attention of transformers, often in the form of key and query matrices [38]. When this symmetry exists between $u$ and $w$ , one can prove the following result, which we refer to as the law of balance.

Theorem 1.

(Law of balance.) Let $u$ and $w$ be vectors of arbitrary dimensions. Let $\ell(u,w,x)$ satisfy $\ell(u,w,x)=\ell(\lambda u,w/\lambda,x)$ for arbitrary $x$ and $\lambda\neq 0$ . Then,

\frac{d}{dt}(||u||^{2}-||w||^{2})=-T(u^{T}C_{1}u-w^{T}C_{2}w),

(4)

where $C_{1}=\mathbb{E}[A^{T}A]-\mathbb{E}[A^{T}]\mathbb{E}[A]$ , $C_{2}=\mathbb{E}[AA^{T}]-\mathbb{E}[A]\mathbb{E}[A^{T}]$ and $A_{ki}=\frac{\partial\tilde{\ell}}{\partial(u_{i}w_{k})}$ with $\tilde{\ell}(u_{i}w_{k},x)\equiv\ell(u_{i},w_{k},x)$ .

Our result holds in a stronger version if we consider the effect of a finite step-size by using the modified loss function (See Appendix A.7) [2, 34]. For common problems, $C_{1}$ and $C_{2}$ are positive definite, and this theorem implies that the norms of $u$ and $w$ will be approximately balanced. To see this, we can simplify the expression to

-T(\lambda_{1M}||u||^{2}-\lambda_{2m}||w||^{2})\leq\frac{d}{dt}(||u||^{2}-||w||^{2})\leq-T(\lambda_{1m}||u||^{2}-\lambda_{2M}||w||^{2}),

(5)

where $\lambda_{1m(2m)},\lambda_{1M(2M)}$ represent the minimal and maximal eigenvalue of the matrix $C_{1(2)}$ , respectively. In the long-time limit, the value of $||u||^{2}/||w||^{2}$ is restricted by

\frac{\lambda_{2m}}{\lambda_{1M}}\leq\frac{||u||^{2}}{||w||^{2}}\leq\frac{\lambda_{2M}}{\lambda_{1m}},

(6)

which implies that the stationary dynamics of the parameters $u,w$ is constrained in a bounded subspace of the unbounded degenerate local minimum valley. Conventional analysis shows that the difference between SGD and GD is of order $T^{2}$ per unit time step, and it is thus often believed that SGD can be understood perturbatively through GD [13]. However, the law of balance implies that the difference between GD and SGD is not perturbative. As long as there is any level of noise, the difference between GD and SGD at stationarity is $O(1)$ . This theorem has an important implication: the noise in SGD creates a qualitative difference between SGD and GD, and we must study SGD noise in its own right. This theorem also implies the loss of ergodicity, an important phenomenon in nonequilibrium physics [29, 35, 24, 36], because not all solutions with the same training loss will be accessed by SGD with equal probability.

The theorem greatly simplifies when both $u$ and $w$ are one-dimensional.

Corollary 1.

If $u,w\in\mathbb{R}$ , then, $\frac{d}{dt}|u^{2}-w^{2}|=-TC_{0}|u^{2}-w^{2}|$ , where $C_{0}={\rm Var}[\frac{\partial\ell}{\partial(uw)}]$ .

Before we apply the theorem to study the stationary distributions, we stress the importance of this balance condition. This relation is closely related to Noether’s theorem [26, 1, 22]. If there is no weight decay or stochasticity in training, the quantity $||u||^{2}-||w||^{2}$ will be a conserved quantity under gradient flow [7, 15], as is evident by taking the infinite $S$ limit. The fact that it monotonically decays to zero at a finite $T$ may be a manifestation of some underlying fundamental mechanism. A more recent result in Ref. [40] showed that for a two-layer linear network, the norms of two layers are within a distance of order $O(\eta^{-1})$ , suggesting that the norm of the two layers are balanced. Our result agrees with Ref. [40] in this case, but our result is far stronger because our result is nonperturbative and only relies on the rescaling symmetry, and is independent of the loss function or architecture of the model.

Example: two-layer linear network.

It is instructive to illustrate the application of the law to a two-layer linear network, the simplest model that obeys the law. Let $\theta=(w,u)$ denote the set of trainable parameters; the per-sample loss is $\ell(\theta,x)=\left(\sum_{i}^{d}u_{i}w_{i}x-y\right)^{2}+\gamma||\theta||^{2}$ . Here, $d$ is the width of the model, $||\theta||^{2}$ is the common $L_{2}$ regularization term that encourages the learned model to have a small norm, $\gamma\geq 0$ is the strength of regularization, and $\mathbb{E}_{x}$ denotes the averaging over the training set, which could be a continuous distribution or a discrete sum of delta distributions. It will be convenient for us also to define the shorthand: $v:=\sum_{i}^{d}u_{i}w_{i}$ . The distribution of $v$ is said to be the distribution of the “model.”

Applying the law of balance, we obtain that

\frac{d}{dt}(u_{i}^{2}-w_{i}^{2})=-4[T(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3})+\gamma](u_{i}^{2}-w_{i}^{2}),

(7)

where we have introduced the parameters

\begin{cases}\alpha_{1}:={\rm Var}[x^{2}],\\ \alpha_{2}:=\mathbb{E}[x^{3}y]-\mathbb{E}[x^{2}]\mathbb{E}[xy],\\ \alpha_{3}:={\rm Var}[xy].\end{cases}

(8)

When $\alpha_{2}^{2}-\alpha_{1}\alpha_{3}$ or $\gamma>0$ , the time evolution of $|u^{2}-w^{2}|$ can be upper-bounded by an exponentially decreasing function in time: $|u_{i}^{2}-w_{i}^{2}|(t)<|u_{i}^{2}-w_{i}^{2}|(0)\exp\left(-4{T}(\alpha_{2}^{2}-\alpha_{1}\alpha_{3})t/\alpha_{1}-4\gamma t\right)\to 0$ . Namely, the quantity $(u_{i}^{2}-w_{i}^{2})$ decays to $0$ with probability $1$ . We thus have $u_{i}^{2}=w_{i}^{2}$ for all $i\in\{1,\cdots,d\}$ at stationarity, in agreement with what we see in Figure 1.

3 Stationary Distribution of SGD

As an important application of the law of balance, we solve the stationary distribution of SGD for a deep diagonal linear network. While linear networks are limited in expressivity, their loss landscape and dynamics are highly nonlinear and is regarded as a minimal model of nonlinear neural networks [14, 18, 48, 41].

3.1 Depth- $0$ Case

Let us first derive the stationary distribution of a one-dimensional linear regressor, which will be a basis for comparison to help us understand what is unique about having a “depth” in deep learning. The per-sample loss is $\ell(x,v)=(vx-y)^{2}+\gamma v^{2}$ , for which the SGD dynamics is ${dv}=-2(\beta_{1}v-\beta_{2}+\gamma v)dt+\sqrt{TC(v)}{dW(t)}$ , where we have defined

\begin{cases}\beta_{1}:=\mathbb{E}[x^{2}],\\ \beta_{2}:=\mathbb{E}[xy].\end{cases}

(9)

Note that the closed-form solution of linear regression gives the global minimizer of the loss function: $v^{*}=\beta_{2}/\beta_{1}$ . The gradient variance is also not trivial: $C(v):={\rm Var}[\ell(v,x)]=4(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3})$ . Note that the loss landscape $L$ only depends on $\beta_{1}$ and $\beta_{2}$ , and the gradient noise only depends on $\alpha_{1}$ , $\alpha_{2}$ and, $\alpha_{3}$ . These relations imply that $C$ can be quite independent of $L$ , contrary to popular beliefs in the literature [27, 23]. It is also reasonable to call $\beta$ the landscape parameters and $\alpha$ the noise parameters. We will see that both $\beta$ and $\alpha$ are important parameters appearing in all stationary distributions we derive, implying that the stationary distributions of SGD are strongly dependent on the data.

Another important quantity is $\Delta:=\min_{v}C(v)\geq 0$ , which is the minimal level of noise on the landscape. For all the examples in this work,

\Delta={\rm Var}[x^{2}]{\rm Var}[xy]-{\rm cov}(x^{2},xy)=\alpha_{1}\alpha_{3}-\alpha_{2}^{2}.

(10)

When is $\Delta$ zero? It happens when, for all samples of $(x,y)$ , $xy+c=kx^{2}$ for some constant $k$ and $c$ . We focus on the case $\Delta>0$ in the main text, which is most likely the case for practical situations. The other cases are dealt with in Section A.

For $\Delta>0$ , the stationary distribution for linear regression is found to be

\displaystyle p(v)

\displaystyle\propto{(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3})^{-1-\frac{\beta_{1}^{\prime}}{2T\alpha_{1}}}}\exp\left[-\frac{1}{T}\frac{\alpha_{2}\beta_{1}^{\prime}-\alpha_{1}\beta_{2}}{\alpha_{1}\sqrt{\Delta}}\arctan\left(\frac{\alpha_{1}v-\alpha_{2}}{\sqrt{\Delta}}\right)\right],

(11)

roughly in agreement with the result in Ref. [27]. Two notable features exist for this distribution: (1) the power exponent for the tail of the distribution depends on the learning rate and batch size, and (2) the integral of $p(v)$ converges for an arbitrary learning rate. On the one hand, this implies that increasing the learning rate alone cannot introduce new phases of learning to a linear regression; on the other hand, it implies that the expected error is divergent as one increases the learning rate (or the feature variation), which happens at $T=\beta_{1}^{\prime}/\alpha_{1}$ . We will see that deeper models differ from the single-layer model in these two crucial aspects.

3.2 Deep Diagonal Networks

Now, we consider a diagonal deep linear network, whose loss function can be written as

\displaystyle\ell=\left[\sum_{i}^{d_{0}}\left(\prod_{k=0}^{D}u_{i}^{(k)}\right)x-y\right]^{2},

(12)

where $D$ is the depth and $d_{0}$ is the width. When the width $d_{0}=1$ , the law of balance is sufficient to solve the model. When $d_{0}>1$ , we need to eliminate additional degrees of freedom. A lot of recent works study the properties of a diagonal linear network, which has been found to well approximate the dynamics of real networks [30, 28, 3, 8].

We introduce $v_{i}:=\prod_{k=0}^{D}u_{i}^{(k)}$ , and so $v=\sum_{i}v_{i}$ and call $v_{i}$ a “subnetwork” and $v$ the “model.” The following proposition shows that the dynamics of this model can be reduced to a one-dimensional form.

Theorem 2.

For all $i\neq j$ , one (or more) of the following conditions holds for all trajectories at stationarity:

1.

$v_{i}=0$ , or $v_{j}=0$ , or $L(\theta)=0$ ;
2.
${\rm sgn}(v_{i})={\rm sgn}(v_{j})$ . In addition,
1. (a)
  
  if $D=1$ , for a constant $c_{0}$ , $\log|v_{i}|-\log|v_{j}|=c_{0}$ ;
2. (b)
  
  if $D>1$ , $|v_{i}|^{2}-|v_{j}|^{2}=0$ .

This theorem contains many interesting aspects. First of all, the three situations in item 1 directly tell us the distribution of $v$ , which is the quantity we ultimately care about.³³3 $L\to 0$ is only possible when $\Delta=0$ and $v=\beta_{2}/\beta_{1}$ . This result implies that if we want to understand the stationary distribution of SGD, we only need to solve the case of item 2. Once the parameters enter the condition of item 2, item 2 will continue to hold with probability $1$ for the rest of the trajectory. The second aspect is that item 2 of the theorem implies that all the $v_{i}$ of the model must be of the same sign for any network with $D\geq 1$ . Namely, no subnetwork of the original network can learn an incorrect sign. This is dramatically different from the case of $D=0$ . We will discuss this point in more detail below. The third interesting aspect of the theorem is that it implies that the dynamics of SGD is qualitatively different for different depths of the model. In particular, $D=1$ and $D>1$ have entirely different dynamics. For $D=1$ , the ratio between every pair of $v_{i}$ and $v_{j}$ is a conserved quantity. In sharp contrast, for $D>1$ , the distance between different $v_{i}$ is no longer conserved but decays to zero. Therefore, a new balancing condition emerges as we increase the depth. This qualitative distinction also corroborates the discovery in Refs. [48] and [51], where $D=1$ models are found to be qualitatively different from models with $D>1$ .

With this theorem, we are now ready to solve for the stationary distribution. It suffices to condition on the event that $v_{i}$ does not converge to zero. Let us suppose that there are $d$ nonzero $v_{i}$ that obey item 2 of Theorem 2 and $d$ can be seen as an effective width of the model. We stress that the effective width $d\leq d_{0}$ depends on the initialization and can be arbitrary.⁴⁴4One can systematically initialize the parameters in a way that $d$ takes any desired value between $1$ and $d_{0}$ ; for example, one way to achieve this is to initialize on the stationary conditions at the desired value of $d$ . Therefore, we condition on a fixed value of $d$ to solve for the stationary distribution of $v$ (Appendix A):

p_{\pm}(|v|)\propto\frac{1}{|v|^{3(1-1/(D+1))}(\alpha_{1}|v|^{2}\mp 2\alpha_{2}|v|+\alpha_{3})}\exp\left(-\frac{1}{T}\int_{0}^{|v|}d|v|\frac{d^{1-2/(D+1)}(\beta_{1}|v|\mp\beta_{2})}{(D+1)|v|^{2D/(D+1)}(\alpha_{1}|v|^{2}\mp 2\alpha_{2}|v|+\alpha_{3})}\right),

(13)

where $p_{-}$ is the distribution of $v$ on $(-\infty,0)$ and $p_{+}$ is that on $(0,\infty)$ . Next, we analyze this distribution in detail. Since the result is symmetric in the sign of $\beta_{2}=\mathbb{E}[xy]$ , we assume that $\mathbb{E}[xy]>0$ from now on.

3.2.1 Depth- $1$ Nets

We focus on the case $\gamma=0$ .⁵⁵5When weight decay is present, the stationary distribution is the same, except that one needs to replace $\beta_{2}$ with $\beta_{2}-\gamma$ . Other cases are also studied in detail in Appendix A and listed in Table. 1. The distribution of $v$ is

\displaystyle p_{\pm}(|v|)

\displaystyle\propto\frac{|v|^{\pm\beta_{2}/2\alpha_{3}T-3/2}}{(\alpha_{1}|v|^{2}\mp 2\alpha_{2}|v|+\alpha_{3})^{1\pm\beta_{2}/4T\alpha_{3}}}\exp\left(-\frac{1}{2T}\frac{\alpha_{3}\beta_{1}-\alpha_{2}\beta_{2}}{\alpha_{3}\sqrt{\Delta}}\arctan\frac{\alpha_{1}|v|\mp\alpha_{2}}{\sqrt{\Delta}}\right),

(14)

This measure is worth a close examination. First, the exponential term is upper and lower bounded and well-behaved in all situations. In contrast, the polynomial term becomes dominant both at infinity and close to zero. When $v<0$ , the distribution is a delta function at zero: $p(v)=\delta(v)$ . To see this, note that the term $v^{-\beta_{2}/2\alpha_{3}T-3/2}$ integrates to give $v^{-\beta_{2}/2\alpha_{3}T-1/2}$ close to the origin, which is infinite. Outside the origin, the integral is finite. This signals that the only possible stationary distribution has a zero measure for $v\neq 0$ . The stationary distribution is thus a delta distribution, meaning that if $x$ and $y$ are positively correlated, the learned subnets $v_{i}$ can never be negative, no matter the initial configuration.

For $v>0$ , the distribution is nontrivial. Close to $v=0$ , the distribution is dominated by $v^{\beta_{2}/2\alpha_{3}T-3/2}$ , which integrates to $v^{\beta_{2}/2\alpha_{3}T-1/2}$ . It is only finite below a critical $T_{c}=\beta_{2}/\alpha_{3}$ . This is a phase-transition-like behavior. As $T\to(\beta_{2}/\alpha_{3})_{-}$ , the integral diverges and tends to a delta distribution. Namely, if $T>T_{c}$ , we have $u_{i}=w_{i}=0$ for all $i$ with probability $1$ , and no learning can happen. If $T<T_{c}$ , the stationary distribution has a finite variance, and learning may happen. In the more general setting, where weight decay is present, this critical $T$ shifts to

T_{c}=\frac{\beta_{2}-\gamma}{\alpha_{3}}.

(15)

When $T=0$ , the phase transition occurs at $\beta_{2}=\gamma$ , in agreement with the critical point identified in [51]. This critical learning rate also agrees with the discrete-time analysis performed in Refs. [49, 47] and the approximate continuous-time analysis in Ref.[4]. See Figure 2 for illustrations of the distribution across different values of $T$ . We also compare with the stationary distribution of a depth- $0$ model. Two characteristics of the two-layer model appear rather striking: (1) the solution becomes a delta distribution at the sparse solution $u=w=0$ at a large learning rate; (2) the two-layer model never learns the incorrect sign ( $v$ is always non-negative). See Figure 2.

Therefore, training with SGD on deeper models simultaneously have two advantages: (1) a generalization advantage such that a sparse solution is favored when the underlying data correlation is weak; (2) an optimization advantage such that the training loss interpolates between that of the global minimizer and the sparse saddle and is well-bounded (whereas a depth- $0$ model can have arbitrarily bad objective value at a large learning rate).

Another exotic phenomenon implied by the result is what we call the “fluctuation inversion.” Naively, the variance of model parameters should increase as we increase $T$ , the noise level in SGD. However, for the distribution we derived, the variance of $v$ and $u$ both decrease to zero as we increase $T$ : injecting noise makes the model fluctuation vanish. We discuss more about this “fluctuation inversion” in the next section.

Also, while there is no other phase-transition behavior below $T_{c}$ , there is still an interesting and practically relevant crossover behavior in the distribution of the parameters as we change the learning rate. When we train a model, we often run SGD only once or a few times. When we do this, the most likely parameter we obtain is given by the maximum likelihood estimator of the distribution, $\hat{v}:=\arg\max p(v)$ . Understanding how $\hat{v}(T)$ changes as a function of $T$ is crucial. This quantity also exhibits nontrivial crossover behaviors at critical values of $T$ .

When $T<T_{c}$ , a nonzero maximizer for $p(v)$ must satisfy

v^{*}=-\frac{\beta_{1}-10\alpha_{2}T-\sqrt{(\beta_{1}-10\alpha_{2}T)^{2}+28\alpha_{1}T(\beta_{2}-3\alpha_{3}T)}}{14\alpha_{1}T}.

(16)

The existence of this solution is nontrivial, which we analyze in Appendix A.5. When $T\to 0$ , a solution always exists and is given by $v=\beta_{2}/\beta_{1}$ , which does not depend on the learning rate or noise $C$ . Note that $\beta_{2}/\beta_{1}$ is also the minimum point of $L(u_{i},w_{i})$ . This means that SGD is only a consistent estimator of the local minima in deep learning in the vanishing learning rate limit. How biased is SGD at a finite learning rate? Two limits can be computed. For a small learning rate, the leading order correction to the solution is $v=\frac{\beta_{2}}{\beta_{1}}+\left(\frac{10\alpha_{2}\beta_{2}}{\beta_{1}^{2}}-\frac{7\alpha_{1}\beta_{2}^{2}}{\beta_{1}^{3}}-\frac{3\alpha_{3}}{\beta_{1}}\right)T$ . This implies that the common Bayesian analysis that relies on a Laplace expansion of the loss fluctuation around a local minimum is improper. The fact that the stationary distribution of SGD is very far away from the Bayesian posterior also implies that SGD is only a good Bayesian sampler at a small learning rate.

It is instructive to consider an example of a structured dataset: $y=kx+\epsilon$ , where $x\sim\mathcal{N}(0,1)$ and the noise $\epsilon$ obeys $\epsilon\sim\mathcal{N}(0,\sigma^{2})$ . We let $\gamma=0$ for simplicity. If $\sigma^{2}>\frac{8}{21}k^{2}$ , there always exists a transitional learning rate: $T^{*}=\frac{4k+\sqrt{42}\sigma}{4(21\sigma^{2}-8k^{2})}$ .⁶⁶6We say“transitional” to indicate that it is different from the critical learning rate. Obviously, $T_{c}/3<T^{*}$ . One can characterize the learning of SGD by comparing $T$ with $T_{c}$ and $T^{*}$ . For this simple example, SGD can be classified into roughly $5$ different regimes. See Figure 3.

3.3 Power-Law Tail of Deeper Models

An interesting aspect of the depth- $1$ model is that its distribution is independent of the width $d$ of the model. This is not true for a deep model, as seen from Eq. (13). The $d$ -dependent term vanishes only if $D=1$ . Another intriguing aspect of the depth- $1$ distribution is that its tail is independent of any hyperparameter of the problem, dramatically different from the linear regression case. This is true for deeper models as well.

Since $d$ only affects the non-polynomial part of the distribution, the stationary distribution scales as $p(v)\propto\frac{1}{v^{3(1-1/(D+1))}(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3})}$ . Hence, when $v\to\infty$ , the scaling behaviour is $v^{-5+3/(D+1)}$ . The tail gets monotonically thinner as one increases the depth. For $D=1$ , the exponent is $7/2$ ; an infinite-depth network has an exponent of $5$ . Therefore, the tail of the model distribution only depends on the depth and is independent of the data or details of training, unlike the depth- $0$ model. In addition, due to the scaling $v^{5-3/(D+1)}$ for $v\to\infty$ , we can see that $\mathbb{E}[v^{2}]$ will never diverge no matter how large the $T$ is. See Figure 4–mid.

One implication is that neural networks with at least one hidden layer will never have a divergent training loss. This directly explains the puzzling observation of the edge-of-stability phenomenon in deep learning: SGD training often gives a neural network a solution where a slight increment of the learning rate will cause discrete-time instability and divergence [43, 5]. These solutions, quite surprisingly, exhibit low training and testing loss values even when the learning rate is right at the critical learning rate of instability. This observation contradicts naive theoretical expectations. Let $\eta_{\rm sta}$ denote the largest stable learning rate. Close to a local minimum, one can expand the loss function up to the second order to show that the value of the loss function $L$ is proportional to ${\rm Tr}[\Sigma]$ . However, $\Sigma\propto 1/(\eta_{\rm sta}-\eta)$ should be a very large value [45, 50, 20], and therefore $L$ should diverge. Thus, the edge of stability phenomenon is incompatible with the naive expectation up to the second order as pointed out in Ref. [6]. Our theory offers a direct explanation of why the divergence of loss does not happen: for deeper models, SGD always has a finite loss because of the power-law tail and fluctuation inversion. See Figure 4-right.

3.4 Role of Width

As discussed, for $D>1$ , the model width $d$ directly affects the stationary distribution of SGD. However, the integral in the exponent of Eq. (13) cannot be analytically calculated for a generic $D$ . Two cases exist where an analytical solution exists: $D=1$ and $D\to\infty$ . We thus consider the case $D\to\infty$ to study the effect of $d$ .

As $D$ tends to infinity, the distribution becomes

p(v)\propto\frac{1}{v^{3+k_{1}}(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3})^{1-k_{1}/2}}\exp\left(-\frac{d}{DT}\left(\frac{\beta_{2}}{\alpha_{3}v}+\frac{\alpha_{2}\alpha_{3}\beta_{1}-2\alpha_{2}^{2}\beta_{2}+\alpha_{1}\alpha_{3}\beta_{2}}{\alpha_{3}^{2}\sqrt{\Delta}}\arctan(\frac{\alpha_{1}v-\alpha_{2}}{\sqrt{\Delta}})\right)\right),

(17)

where $k_{1}=d(\alpha_{3}\beta_{1}-2\alpha_{2}\beta_{2})/(TD\alpha_{3}^{2})$ . The first striking feature is that the architecture ratio $d/D$ always appears simultaneously with $1/T$ . This implies that for a sufficiently deep neural network, the ratio $D/d$ also becomes proportional to the strength of the noise. Since we know that $T=\eta/S$ determines the performance of SGD,⁷⁷7Therefore, scaling $\eta$ with $1/S$ is known as the learning-rate-batch-size scaling law [12]. our result thus shows an extended scaling law of training: $\frac{d}{D}\frac{S}{\eta}=const$ . For example, if we want to scale up the depth without changing the width, we can increase the learning rate proportionally or decrease the batch size. This scaling law thus links all the learning rates, the batch size, and the model width and depth. The architecture aspect of the scaling law also agrees with the suggestion in Refs. [10, 11], where the optimal architecture is found to have a constant ratio of $d/D$ . See Figure 4.

Now, let us fix $T$ and understand the different limits of the stationary distribution, which is decided by the scaling of $d$ as we scale up $D$ . There are three situations: (1) $d=o(D)$ , (2) $d=c_{0}D$ for a constant $c_{0}$ , (3) $d=\Omega(D)$ . If $d=o(D)$ , $k_{1}\to 0$ and the distribution converges to $p(v)\propto{v^{-3}(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3})^{-1}}$ , which is a delta distribution at $0$ . Namely, if the width is far smaller than the depth, the model will collapse, and no learning will happen under SGD. Therefore, we should increase the model width as we increase the depth. In the second case, $d/D$ is a constant and can thus be absorbed into the definition of $T$ and is the only limit where we obtain a nontrivial distribution with a finite spread. If $d=\Omega(D)$ , one can perform a saddle point approximation to see that the distribution becomes a delta distribution at the global minimum of the loss landscape, $p(v)=\delta(v-\beta_{2}/\beta_{1})$ . Therefore, the learned model locates deterministically at the global minimum.

4 Discussion

The most important implication of our theory is that the behavior of SGD cannot be understood through gradient flow or a simple Langevin approximation. Having even a perturbative amount of noise in SGD leads to an order- $1$ change in the stationary distribution of the solution. Our result suggests that one promising way to understand SGD is to study its behavior on a landscape from the viewpoint of symmetries. We showed that SGD systematically moves towards a balanced solution when rescaling symmetry exists. Likewise, it is not difficult to imagine that for other types of symmetries, SGD will also have interesting systematic tendencies to deviate from gradient flow and Brownian motion. An important future direction is thus to understand and characterize the dynamics of SGD on a loss function with different types of symmetries.

By utilizing the symmetry conditions in the loss landscape, we are able to characterize the stationary distribution of SGD analytically. To the best of our knowledge, this is the first analytical expression for SGD obtained for a globally nonconvex and highly nonlinear loss function without the need for any approximation. With this solution, we have demonstrated many phenomena of deep learning that were previously unknown. For example, we showed the qualitative difference between networks with different depths, the finiteness of the training loss, the fluctuation inversion effect, the loss of ergodicity, and the incapability of learning a wrong sign for a deep model.

Lastly, let us return to the original question we raised in Introduction. Why is the Gibbs measure a bad model of SGD? When the number of data points $N\gg S$ , a standard computation shows that the noise covariance of SGD takes the following form: $C(\theta)=T(\mathbb{E}_{x}[(\nabla_{\theta}\ell)(\nabla_{\theta}\ell)^{T}]-(\nabla_{\theta}L)(\nabla_{\theta}L)^{T})$ , which is nothing but the covariance of the gradients of $\theta$ . A key feature of the noise is that it depends on the dynamical variable $\theta$ in a highly nontrivial manner. See Figure 5 for an illustration of the landscape against $C$ . We see that the shape of $C(\theta)$ generally changes faster than the loss landscape. For the Gibbs distribution to hold (at least locally), we need $C(\theta)$ to change much slower than $L(\theta)$ . A good criterion is thus to compare to the relative magnitude of $||\nabla L||$ and $||\nabla{\rm Tr}[C]||$ , which tells us which term changes faster. When $||\nabla{\rm Tr}[C]||$ is larger, unexpected phenomena will happen and one must consider the parameter dependence of $C$ to understand SGD.

Acknowledgement

We thank Prof. Tsunetsugu for the discussion on ergodicity. We also thank Shi Chen for valuable discussions about symmetry. This work is financially supported by a research grant from JSPS (Grant No. JP22H01152).

References

[1] John C Baez and Brendan Fong. A noether theorem for markov processes. Journal of Mathematical Physics, 54(1):013301, 2013.
[2] David GT Barrett and Benoit Dherin. Implicit gradient regularization. arXiv preprint arXiv:2009.11162, 2020.
[3] Raphaël Berthier. Incremental learning in diagonal linear networks. Journal of Machine Learning Research, 24(171):1–26, 2023.
[4] Feng Chen, Daniel Kunin, Atsushi Yamamura, and Surya Ganguli. Stochastic collapse: How gradient noise attracts sgd dynamics towards simpler subnetworks. arXiv preprint arXiv:2306.04251, 2023.
[5] Jeremy M Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. arXiv preprint arXiv:2103.00065, 2021.
[6] Alex Damian, Eshaan Nichani, and Jason D Lee. Self-stabilization: The implicit bias of gradient descent at the edge of stability. arXiv preprint arXiv:2209.15594, 2022.
[7] Simon S Du, Wei Hu, and Jason D Lee. Algorithmic regularization in learning deep homogeneous models: Layers are automatically balanced. Advances in neural information processing systems, 31, 2018.
[8] Mathieu Even, Scott Pesme, Suriya Gunasekar, and Nicolas Flammarion. (s) gd over diagonal linear networks: Implicit regularisation, large stepsizes and edge of stability. arXiv preprint arXiv:2302.08982, 2023.
[9] Xavier Fontaine, Valentin De Bortoli, and Alain Durmus. Convergence rates and approximation results for sgd and its continuous-time counterpart. In Mikhail Belkin and Samory Kpotufe, editors, Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 1965–2058. PMLR, 15–19 Aug 2021.
[10] Boris Hanin. Which neural net architectures give rise to exploding and vanishing gradients? Advances in neural information processing systems, 31, 2018.
[11] Boris Hanin and David Rolnick. How to start training: The effect of initialization and architecture. Advances in Neural Information Processing Systems, 31, 2018.
[12] Elad Hoffer, Itay Hubara, and Daniel Soudry. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems, pages 1731–1741, 2017.
[13] Wenqing Hu, Chris Junchi Li, Lei Li, and Jian-Guo Liu. On the diffusion approximation of nonconvex stochastic gradient descent. arXiv preprint arXiv:1705.07562, 2017.
[14] Kenji Kawaguchi. Deep learning without poor local minima. Advances in Neural Information Processing Systems, 29:586–594, 2016.
[15] Daniel Kunin, Javier Sagastuy-Brena, Surya Ganguli, Daniel LK Yamins, and Hidenori Tanaka. Neural mechanics: Symmetry and broken conservation laws in deep learning dynamics. arXiv preprint arXiv:2012.04728, 2020.
[16] Jonas Latz. Analysis of stochastic gradient descent in continuous time. Statistics and Computing, 31(4):39, 2021.
[17] Qianxiao Li, Cheng Tai, and Weinan E. Stochastic modified equations and dynamics of stochastic gradient algorithms i: Mathematical foundations. Journal of Machine Learning Research, 20(40):1–47, 2019.
[18] Qianyi Li and Haim Sompolinsky. Statistical mechanics of deep linear neural networks: The backpropagating kernel renormalization. Physical Review X, 11(3):031059, 2021.
[19] Zhiyuan Li, Sadhika Malladi, and Sanjeev Arora. On the validity of modeling sgd with stochastic differential equations (sdes), 2021.
[20] Kangqiao Liu, Liu Ziyin, and Masahito Ueda. Noise and fluctuation of finite learning rate stochastic gradient descent, 2021.
[21] Siyuan Ma, Raef Bassily, and Mikhail Belkin. The power of interpolation: Understanding the effectiveness of sgd in modern over-parametrized learning. In International Conference on Machine Learning, pages 3325–3334. PMLR, 2018.
[22] Agnieszka B Malinowska and Moulay Rchid Sidi Ammi. Noether’s theorem for control problems on time scales. arXiv preprint arXiv:1406.0705, 2014.
[23] Stephan Mandt, Matthew D Hoffman, and David M Blei. Stochastic gradient descent as approximate bayesian inference. Journal of Machine Learning Research, 18:1–35, 2017.
[24] John C Mauro, Prabhat K Gupta, and Roger J Loucks. Continuously broken ergodicity. The Journal of chemical physics, 126(18), 2007.
[25] Chris Mingard, Guillermo Valle-Pérez, Joar Skalse, and Ard A Louis. Is sgd a bayesian sampler? well, almost. The Journal of Machine Learning Research, 22(1):3579–3642, 2021.
[26] Tetsuya Misawa. Noether’s theorem in symmetric stochastic calculus of variations. Journal of mathematical physics, 29(10):2178–2180, 1988.
[27] Takashi Mori, Liu Ziyin, Kangqiao Liu, and Masahito Ueda. Power-law escape rate of sgd. In International Conference on Machine Learning, pages 15959–15975. PMLR, 2022.
[28] Mor Shpigel Nacson, Kavya Ravichandran, Nathan Srebro, and Daniel Soudry. Implicit bias of the step size in linear diagonal neural networks. In International Conference on Machine Learning, pages 16270–16295. PMLR, 2022.
[29] Richard G Palmer. Broken ergodicity. Advances in Physics, 31(6):669–735, 1982.
[30] Scott Pesme, Loucas Pillaud-Vivien, and Nicolas Flammarion. Implicit bias of sgd for diagonal linear networks: a provable benefit of stochasticity. Advances in Neural Information Processing Systems, 34:29218–29230, 2021.
[31] Tomasz Rolski, Hanspeter Schmidli, Volker Schmidt, and Jozef L Teugels. Stochastic processes for insurance and finance. John Wiley & Sons, 2009.
[32] N. Shirish Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. ArXiv e-prints, September 2016.
[33] Justin Sirignano and Konstantinos Spiliopoulos. Stochastic gradient descent in continuous time: A central limit theorem. Stochastic Systems, 10(2):124–151, 2020.
[34] Samuel L Smith, Benoit Dherin, David GT Barrett, and Soham De. On the origin of implicit regularization in stochastic gradient descent. arXiv preprint arXiv:2101.12176, 2021.
[35] D Thirumalai and Raymond D Mountain. Activated dynamics, loss of ergodicity, and transport in supercooled liquids. Physical Review E, 47(1):479, 1993.
[36] Christopher J Turner, Alexios A Michailidis, Dmitry A Abanin, Maksym Serbyn, and Zlatko Papić. Weak ergodicity breaking from quantum many-body scars. Nature Physics, 14(7):745–749, 2018.
[37] Nicolaas Godfried Van Kampen. Stochastic processes in physics and chemistry, volume 1. Elsevier, 1992.
[38] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[39] Peter Walters. An introduction to ergodic theory, volume 79. Springer Science & Business Media, 2000.
[40] Yuqing Wang, Minshuo Chen, Tuo Zhao, and Molei Tao. Large learning rate tames homogeneity: Convergence and balancing effect, 2022.
[41] Zihao Wang and Liu Ziyin. Posterior collapse of a linear latent variable model. arXiv preprint arXiv:2205.04009, 2022.
[42] Blake Woodworth, Kumar Kshitij Patel, Sebastian Stich, Zhen Dai, Brian Bullins, Brendan Mcmahan, Ohad Shamir, and Nathan Srebro. Is local sgd better than minibatch sgd? In International Conference on Machine Learning, pages 10334–10343. PMLR, 2020.
[43] Lei Wu, Chao Ma, et al. How sgd selects the global minima in over-parameterized learning: A dynamical stability perspective. Advances in Neural Information Processing Systems, 31, 2018.
[44] Zeke Xie, Issei Sato, and Masashi Sugiyama. A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. arXiv preprint arXiv:2002.03495, 2020.
[45] Sho Yaida. Fluctuation-dissipation relations for stochastic gradient descent. arXiv preprint arXiv:1810.00004, 2018.
[46] Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects. arXiv preprint arXiv:1803.00195, 2018.
[47] Liu Ziyin, Botao Li, Tomer Galanti, and Masahito Ueda. The probabilistic stability of stochastic gradient descent. arXiv preprint arXiv:2303.13093, 2023.
[48] Liu Ziyin, Botao Li, and Xiangming Meng. Exact solutions of a deep linear network. In Advances in Neural Information Processing Systems, 2022.
[49] Liu Ziyin, Botao Li, James B. Simon, and Masahito Ueda. Sgd may never escape saddle points, 2021.
[50] Liu Ziyin, Kangqiao Liu, Takashi Mori, and Masahito Ueda. Strength of minibatch noise in SGD. In International Conference on Learning Representations, 2022.
[51] Liu Ziyin and Masahito Ueda. Exact phase transitions in deep learning. arXiv preprint arXiv:2205.12510, 2022.
[52] Difan Zou, Jingfeng Wu, Vladimir Braverman, Quanquan Gu, and Sham Kakade. Benign overfitting of constant-stepsize sgd for linear regression. In Conference on Learning Theory, pages 4633–4635. PMLR, 2021.

Appendix A Theoretical Considerations

A.1 Background

A.1.1 Ito’s Lemma

Let us consider the following stochastic differential equation (SDE) for a Wiener process $W(t)$ :

dX_{t}=\mu_{t}dt+\sigma_{t}dW(t).

(18)

We are interested in the dynamics of a generic function of $X_{t}$ . Let $Y_{t}=f(t,X_{t})$ ; Ito’s lemma states that the SDE for the new variable is

df(t,X_{t})=\left(\frac{\partial f}{\partial t}+\mu_{t}\frac{\partial f}{\partial X_{t}}+\frac{\sigma_{t}^{2}}{2}\frac{\partial^{2}f}{\partial X_{t}^{2}}\right)dt+\sigma_{t}\frac{\partial f}{\partial x}dW(t).

(19)

Let us take the variable $Y_{t}=X_{t}^{2}$ as an example. Then the SDE is

dY_{t}=\left(2\mu_{t}X_{t}+\sigma_{t}^{2}\right)dt+2\sigma_{t}X_{t}dW(t).

(20)

Let us consider another example. Let two variables $X_{t}$ and $Y_{t}$ follow

	$\displaystyle dX_{t}$	$\displaystyle=\mu_{t}dt+\sigma_{t}dW(t),$
	$\displaystyle dY_{t}$	$\displaystyle=\lambda_{t}dt+\phi_{t}dW(t).$		(21)

The SDE of $X_{t}Y_{t}$ is given by

d(X_{t}Y_{t})=(\mu_{t}Y_{t}+\lambda_{t}X_{t}+\sigma_{t}\phi_{t})dt+(\sigma_{t}Y_{t}+\phi_{t}X_{t})dW(t).

(22)

A.1.2 Fokker Planck Equation

The general SDE of a 1d variable $X$ is given by:

{dX}=-\mu(X)dt+B(X){dW(t)}.

(23)

The time evolution of the probability density $P(x,t)$ is given by the Fokker-Planck equation:

\frac{\partial P(X,t)}{\partial t}=-\frac{\partial}{\partial X}J(X,t),

(24)

where $J(X,t)=\mu(X)P(X,t)+\frac{1}{2}\frac{\partial}{\partial X}[B^{2}(X)P(X,t)]$ . The stationary distribution satisfying ${\partial P(X,t)}/{\partial t}=0$ is

P(X)\propto\frac{1}{B^{2}(X)}\exp\left[-\int dX\frac{2\mu(X)}{B^{2}(X)}\right]:=\tilde{P}(X),

(25)

which gives a solution as a Boltzmann-type distribution if $B$ is a constant. We will apply Eq. (25) to determine the stationary distributions in the following sections.

A.2 Proof of Theorem 1

Proof.

By definition of the symmetry $\ell(\mathbf{u},\mathbf{w},x)=\ell(\lambda\mathbf{u},\mathbf{w}/\lambda,x)$ , we obtain its infinitesimal transformation $\ell(\mathbf{u},\mathbf{w},x)=\ell((1+\epsilon)\mathbf{u},(1-\epsilon)\mathbf{w}/\lambda,x)$ . Expanding this to first order in $\epsilon$ , we obtain

\sum_{i}u_{i}\frac{\partial\ell}{\partial u_{i}}=\sum_{j}w_{j}\frac{\partial\ell}{\partial w_{j}}.

(26)

The equations of motion are

	$\displaystyle\frac{du_{i}}{dt}$	$\displaystyle=-\frac{\partial\ell}{\partial u_{i}},$		(27)
	$\displaystyle\frac{dw_{j}}{dt}$	$\displaystyle=-\frac{\partial\ell}{\partial w_{j}}.$		(28)

Using Ito’s lemma, we can find the equations governing the evolutions of $u_{i}^{2}$ and $w_{j}^{2}$ :

	$\displaystyle\frac{du_{i}^{2}}{dt}$	$\displaystyle=2u_{i}\frac{du_{i}}{dt}+\frac{(du_{i})^{2}}{dt}=-2u_{i}\frac{\partial\ell}{\partial u_{i}}+TC_{i}^{u},$
	$\displaystyle\frac{dw_{j}^{2}}{dt}$	$\displaystyle=2w_{j}\frac{dw_{j}}{dt}+\frac{(dw_{j})^{2}}{dt}=-2w_{j}\frac{\partial\ell}{\partial w_{j}}+TC_{j}^{w},$		(29)

where $C_{i}^{u}={\rm Var}[\frac{\partial\ell}{\partial u_{i}}]$ and $C_{j}^{w}={\rm Var}[\frac{\partial\ell}{\partial w_{j}}]$ . With Eq. (26), we obtain

\frac{d}{dt}(||u||^{2}-||w||^{2})=-T(\sum_{j}C_{j}^{w}-\sum_{i}C_{i}^{u})=-T\left(\sum_{j}{\rm Var}\left[\frac{\partial\ell}{\partial w_{j}}\right]-\sum_{i}{\rm Var}\left[\frac{\partial\ell}{\partial u_{i}}\right]\right).

(30)

Due to the rescaling symmetry, the loss function can be considered as a function of the matrix $uw^{T}$ . Here we define a new loss function as $\tilde{\ell}(u_{i}w_{j})=\ell(u_{i},w_{j})$ . Hence, we have

\frac{\partial\ell}{\partial w_{j}}=\sum_{i}u_{i}\frac{\partial\tilde{\ell}}{\partial(u_{i}w_{j})},\frac{\partial\ell}{\partial u_{i}}=\sum_{j}w_{j}\frac{\partial\tilde{\ell}}{\partial(u_{i}w_{j})}.

(31)

We can rewrite Eq. (30) into

\frac{d}{dt}(||u||^{2}-||w||^{2})=-T(u^{T}C_{1}u-w^{T}C_{2}w),,

(32)

where

$\displaystyle(C_{1})_{ij}$	$\displaystyle=\mathbb{E}\left[\sum_{k}\frac{\partial\tilde{\ell}}{\partial(u_{i}w_{k})}\frac{\partial\tilde{\ell}}{\partial(u_{j}w_{k})}\right]-\sum_{k}\mathbb{E}\left[\frac{\partial\tilde{\ell}}{\partial(u_{i}w_{k})}\right]\mathbb{E}\left[\frac{\partial\tilde{\ell}}{\partial(u_{j}w_{k})}\right],$
	$\displaystyle\equiv\mathbb{E}[A^{T}A]-\mathbb{E}[A^{T}]\mathbb{E}[A]$	(33)
$\displaystyle(C_{2})_{kl}$	$\displaystyle=\mathbb{E}\left[\sum_{i}\frac{\partial\tilde{\ell}}{\partial(u_{i}w_{k})}\frac{\partial\tilde{\ell}}{\partial(u_{i}w_{l})}\right]-\sum_{i}\mathbb{E}\left[\frac{\partial\tilde{\ell}}{\partial(u_{i}w_{k})}\right]\mathbb{E}\left[\frac{\partial\tilde{\ell}}{\partial(u_{i}w_{l})}\right]$
	$\displaystyle\equiv\mathbb{E}[AA^{T}]-\mathbb{E}[A]\mathbb{E}[A^{T}],$	(34)

where

(A)_{ik}\equiv\frac{\partial\tilde{\ell}}{\partial(u_{i}w_{k})}.

(35)

The proof is thus complete. ∎

A.3 Proof of Theorem 2

Proof.

This proof is based on the fact that if a certain condition is satisfied for all trajectories with probability 1, this condition is satisfied by the stationary distribution of the dynamics with probability 1.

Let us first consider the case of $D>1$ . We first show that any trajectory satisfies at least one of the following five conditions: for any $i$ , (i) $v_{i}\to 0$ , (ii) $L(\theta)\to 0$ , or (iii) for any $k\neq l$ , $(u_{i}^{(k)})^{2}-(u_{i}^{(l)})^{2}\to 0$ .

The SDE for $u_{i}^{(k)}$ is

\frac{du_{i}^{(k)}}{dt}=-2\frac{v_{i}}{u_{i}^{(k)}}(\beta_{1}v-\beta_{2})+2\frac{v_{i}}{u_{i}^{(k)}}\sqrt{\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3})}\frac{dW}{dt},

(36)

where $v_{i}:=\prod_{k=1}^{D}u_{i}^{(k)}$ , and so $v=\sum_{i}v_{i}$ . There exists rescaling symmetry between $u_{i}^{(k)}$ and $u_{i}^{(l)}$ for $k\neq l$ . By the law of balance, we have

\frac{d}{dt}[(u_{i}^{(k)})^{2}-(u_{i}^{(l)})^{2}]=-T[(u_{i}^{(k)})^{2}-(u_{i}^{(l)})^{2}]{\rm Var}\left[\frac{\partial\ell}{\partial(u_{i}^{(k)}u_{i}^{(l)})}\right],

(37)

where

{\rm Var}\left[\frac{\partial\ell}{\partial(u_{i}^{(k)}u_{i}^{(l)})}\right]=(\frac{v_{i}}{u_{i}^{(k)}u_{i}^{(l)}})^{2}(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3})

(38)

with $v_{i}/(u_{i}^{(k)}u_{i}^{(l)})=\prod_{s\neq k,l}u_{i}^{(s)}$ . In the long-time limit, $(u_{i}^{(k)})^{2}$ converges to $(u_{i}^{(l)})^{2}$ unless ${\rm Var}\left[\frac{\partial\ell}{\partial(u_{i}^{(k)}u_{i}^{(l)})}\right]=0$ , which is equivalent to $v_{i}/(u_{i}^{(k)}u_{i}^{(l)})=0$ or $\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3}=0$ . These two conditions correspond to conditions (i) and (ii). The latter is because $\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3}=0$ takes place if and only if $v=\alpha_{2}/\alpha_{1}$ and $\alpha_{2}^{2}-\alpha_{1}\alpha_{3}=0$ together with $L(\theta)=0$ . Therefore, at stationarity, we must have conditions (i), (ii), or (iii).

Now, we prove that when (iii) holds, the condition 2-(b) in the theorem statement must hold: for $D=1$ , $(\log|v_{i}|-\log|v_{j}|)=c_{0}$ with ${\rm sgn}(v_{i})={\rm sgn}(v_{j})$ . When (iii) holds, there are two situations. First, if $v_{i}=0$ , we have $u_{i}^{(}k)=0$ for all $k$ , and $v_{i}$ will stay $0$ for the rest of the trajectory, which corresponds to condition (i).

If $v_{i}\neq 0$ , we have $u_{i}^{(k)}\neq 0$ for all $k$ . Therefore, the dynamics of $v_{i}$ is

\frac{dv_{i}}{dt}=-2\sum_{k}\left(\frac{v_{i}}{u_{i}^{(k)}}\right)^{2}(\beta_{1}v-\beta_{2})+2\sum_{k}\left(\frac{v_{i}}{u_{i}^{(k)}}\right)^{2}\sqrt{\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3})}\frac{dW}{dt}+4\sum_{k,l}\left(\frac{v_{i}^{3}}{(u_{i}^{(k)}u_{i}^{(l)})^{2}}\right)\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3}).

(39)

Comparing the dynamics of $v_{i}$ and $v_{j}$ for $i\neq j$ , we obtain

	$\displaystyle\frac{dv_{i}/dt}{\sum_{k}(v_{i}/u_{i}^{(k)})^{2}}-\frac{dv_{j}/dt}{\sum_{k}(v_{j}/u_{j}^{(k)})^{2}}$	$\displaystyle=4\left(\frac{\sum_{m,l}v_{i}^{3}/(u_{i}^{(m)}u_{i}^{(l)})^{2}}{\sum_{k}(v_{i}/u_{i}^{(k)})^{2}}-\frac{\sum_{m,l}v_{j}^{3}/(u_{j}^{(m)}u_{j}^{(l)})^{2}}{\sum_{k}(v_{j}/u_{j}^{(k)})^{2}}\right)\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3})$
		$\displaystyle=4\left(v_{i}\frac{\sum_{m,l}v_{i}^{2}/(u_{i}^{(m)}u_{i}^{(l)})^{2}}{\sum_{k}(v_{i}/u_{i}^{(k)})^{2}}-v_{j}\frac{\sum_{m,l}v_{j}^{2}/(u_{j}^{(m)}u_{j}^{(l)})^{2}}{\sum_{k}(v_{j}/u_{j}^{(k)})^{2}}\right)\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3}).$		(40)

By condition (iii), we have $|u_{i}^{(0)}|=\cdots=|u_{i}^{(D)}|$ , i.e., $(v_{i}/u_{i}^{(k)})^{2}=(v_{i}^{2})^{D/(D+1)}$ and $(v_{i}/u_{i}^{(m)}u_{i}^{(l)})^{2}=(v_{i}^{2})^{(D-1)/(D+1)}$ .⁸⁸8Here, we only consider the root on the positive real axis. Therefore, we obtain

\frac{dv_{i}/dt}{(D+1)(v_{i}^{2})^{D/(D+1)}}-\frac{dv_{j}/dt}{(D+1)(v_{j}^{2})^{D/(D+1)}}=\left(v_{i}\frac{D(v_{i}^{2})^{(D-1)/(D+1)}}{2(v_{i}^{2})^{D/(D+1)}}-v_{j}\frac{D(v_{j}^{2})^{(D-1)/(D+1)}}{2(v_{j}^{2})^{D/(D+1)}}\right)\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3}).

(41)

We first consider the case where $v_{i}$ and $v_{j}$ initially share the same sign (both positive or both negative). When $D>1$ , the left-hand side of Eq. (41) can be written as

\frac{1}{1-D}\frac{dv_{i}^{2/(D+1)-1}}{dt}+4Dv_{i}^{1-2/(D+1)}\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3})-\frac{1}{1-D}\frac{dv_{j}^{2/(D+1)-1}}{dt}-4Dv_{j}^{1-2/(D+1)}\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3}),

(42)

which follows from Ito’s lemma:

	$\displaystyle\frac{dv_{i}^{2/(D+1)-1}}{dt}$	$\displaystyle=\left(\frac{2}{D+1}-1\right)v_{i}^{2/(D+1)-2}\frac{dv_{i}}{dt}+2(\frac{2}{D+1}-1)(\frac{2}{D+1}-2)v_{i}^{2/(D+1)-3}\left(\sum_{k}(\frac{v_{i}}{u_{i}^{(k)}})^{2}\sqrt{\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3})}\right)^{2}$
		$\displaystyle=(\frac{2}{D+1}-1)v_{i}^{2/(D+1)-2}\frac{dv_{i}}{dt}+4D(D-1)v_{i}^{1-2/(D+1)}\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3}).$		(43)

Substitute in Eq. (41), we obtain Eq. (42).

Now, we consider the right-hand side of Eq. (41), which is given by

2Dv_{i}^{1-2/(D+1)}\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3})-2Dv_{j}^{1-2/(D+1)}\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3}).

(44)

Combining Eq. (42) and Eq. (44), we obtain

\frac{1}{1-D}\frac{dv_{i}^{2/(D+1)-1}}{dt}-\frac{1}{1-D}\frac{dv_{j}^{2/(D+1)-1}}{dt}=-2D(v_{i}^{1-2/(D+1)}-v_{j}^{1-2/(D+1)})\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3}).

(45)

By defining $z_{i}=v_{i}^{2/(D+1)-1}$ , we can further simplify the dynamics:

	$\displaystyle\frac{d(z_{i}-z_{j})}{dt}$	$\displaystyle=2D(D-1)\left(\frac{1}{z_{i}}-\frac{1}{z_{j}}\right)\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3})$
		$\displaystyle=-2D(D-1)\frac{z_{i}-z_{j}}{z_{i}z_{j}}\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3}).$		(46)

Hence,

z_{i}(t)-z_{j}(t)=\exp\left[-\int dt\frac{2D(D-1)}{z_{i}z_{j}}\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3})\right].

(47)

Therefore, if $v_{i}$ and $v_{j}$ initially have the same sign, they will decay to the same value in the long-time limit $t\to\infty$ , which gives condition 2-(b). When $v_{i}$ and $v_{j}$ initially have different signs, we can write Eq. (41) as

	$\displaystyle\frac{d\|v_{i}\|/dt}{(D+1)(\|v_{i}\|^{2})^{D/(D+1)}}+\frac{d\|v_{j}\|/dt}{(D+1)(\|v_{j}\|^{2})^{D/(D+1)}}=$	$\displaystyle\left(\|v_{i}\|\frac{D(\|v_{i}\|^{2})^{(D-1)/(D+1)}}{2(\|v_{i}\|^{2})^{D/(D+1)}}+\|v_{j}\|\frac{D(\|v_{j}\|^{2})^{(D-1)/(D+1)}}{2(\|v_{j}\|^{2})^{D/(D+1)}}\right)$
		$\displaystyle\times\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3}).$		(48)

Hence, when $D>1$ , we simplify the equation with a similar procedure as

\frac{1}{1-D}\frac{d|v_{i}|^{2/(D+1)-1}}{dt}+\frac{1}{1-D}\frac{d|v_{j}|^{2/(D+1)-1}}{dt}=-2D(|v_{i}|^{1-2/(D+1)}+|v_{j}|^{1-2/(D+1)})\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3}).

(49)

Defining $z_{i}=|v_{i}|^{2/(D+1)-1}$ , we obtain

	$\displaystyle\frac{d(z_{i}+z_{j})}{dt}$	$\displaystyle=2D(D-1)\left(\frac{1}{z_{i}}+\frac{1}{z_{j}}\right)\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3})$
		$\displaystyle=2D(D-1)\frac{z_{i}+z_{j}}{z_{i}z_{j}}\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3}),$		(50)

which implies

z_{i}(t)+z_{j}(t)=\exp\left[\int dt\frac{2D(D-1)}{z_{i}z_{j}}\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3})\right].

(51)

From this equation, we reach the conclusion that if $v_{i}$ and $v_{j}$ have different signs initially, one of them converges to 0 in the long-time limit $t\to\infty$ , corresponding to condition 1 in the theorem statement. Hence, for $D>1$ , at least one of the conditions is always satisfied at $t\to\infty$ .

Now, we prove the theorem for $D=1$ , which is similar to the proof above. The law of balance gives

\frac{d}{dt}[(u_{i}^{(1)})^{2}-(u_{i}^{(2)})^{2}]=-T[(u_{i}^{(1)})^{2}-(u_{i}^{(2)})^{2}]{\rm Var}\left[\frac{\partial\ell}{\partial(u_{i}^{(1)}u_{i}^{(2)})}\right].

(52)

We can see that $|u_{i}^{(1)}|\to|u_{i}^{(2)}|$ takes place unless ${\rm Var}\left[\frac{\partial\ell}{\partial(u_{i}^{(1)}u_{i}^{(2)})}\right]=0$ , which is equivalent to $L(\theta)=0$ . This corresponds to condition (ii). Hence, if condition (ii) is violated, we need to prove condition (iii). In this sense, $|u_{i}^{(1)}|\to|u_{i}^{(2)}|$ occurs and Eq. (41) can be rewritten as

\frac{dv_{i}/dt}{|v_{i}|}-\frac{dv_{j}/dt}{|v_{j}|}=(\text{sign}(v_{i})-\text{sign}(v_{j}))\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3}).

(53)

When $v_{i}$ and $v_{j}$ are both positive, we have

\frac{dv_{i}/dt}{v_{i}}-\frac{dv_{j}/dt}{v_{j}}=0.

(54)

With Ito’s lemma, we have

\frac{d\log(v_{i})}{dt}=\frac{dv_{i}}{v_{i}dt}-2\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3}).

(55)

Therefore, Eq. (54) can be simplified to

\frac{d(\log(v_{i})-\log(v_{j}))}{dt}=0,

(56)

which indicates that all $v_{i}$ with the same sign will decay at the same rate. This differs from the case of $D>2$ where all $v_{i}$ decay to the same value. Similarly, we can prove the case where $v_{i}$ and $v_{j}$ are both negative.

Now, we consider the case where $v_{i}$ is positive while $v_{j}$ is negative and rewrite Eq. (53) as

\frac{dv_{i}/dt}{v_{i}}+\frac{d(|v_{j}|)/dt}{|v_{j}|}=2\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3}).

(57)

Furthermore, we can derive the dynamics of $v_{j}$ with Ito’s lemma:

\frac{d\log(|v_{j}|)}{dt}=\frac{dv_{i}}{v_{i}dt}-2\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3}).

(58)

Therefore, Eq. (57) takes the form of

\frac{d(\log(v_{i})+\log(|v_{j}|))}{dt}=-2\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3}).

(59)

In the long-time limit, we can see $\log(v_{i}|v_{j}|)$ decays to $-\infty$ , indicating that either $v_{i}$ or $v_{j}$ will decay to $0$ . This corresponds to condition 1 in the theorem statement. Combining Eq. (56) and Eq. (59), we conclude that all $v_{i}$ have the same sign as $t\to\infty$ , which indicates condition 2-(a) if conditions in item 1 are all violated. The proof is thus complete. ∎

A.4 Stationary distribution in Eq. (13)

Following Eq. (39), we substitute $u_{i}^{(k)}$ with $v_{i}^{1/D}$ for arbitrary $k$ and obtain

	$\displaystyle\frac{dv_{i}}{dt}=$	$\displaystyle-2(D+1)\|v_{i}\|^{2D/(D+1)}(\beta_{1}v-\beta_{2})+2(D+1)\|v_{i}\|^{2D/(D+1)}\sqrt{\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3})}\frac{dW}{dt}$
		$\displaystyle+2(D+1)Dv_{i}^{3}\|v_{i}\|^{-4/(D+1)}\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3}).$		(60)

With Eq. (47), we can see that for arbitrary $i$ and $j$ , $v_{i}$ will converge to $v_{j}$ in the long-time limit. In this case, we have $v=dv_{i}$ for each $i$ . Then, the SDE for $v$ can be written as

	$\displaystyle\frac{dv}{dt}=$	$\displaystyle-2(D+1)d^{2/(D+1)-1}\|v\|^{2D/(D+1)}(\beta_{1}v-\beta_{2})+2(D+1)d^{2/(D+1)-1}\|v\|^{2D/(D+1)}\sqrt{\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3})}\frac{dW}{dt}$
		$\displaystyle+2(D+1)Dd^{4/(D+1)-2}v^{3}\|v\|^{-4/(D+1)}\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3}).$		(61)

If $v>0$ , Eq. (61) becomes

	$\displaystyle\frac{dv}{dt}=$	$\displaystyle-2(D+1)d^{2/(D+1)-1}v^{2D/(D+1)}(\beta_{1}v-\beta_{2})+2(D+1)d^{2/(D+1)-1}v^{2D/(D+1)}\sqrt{\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3})}\frac{dW}{dt}$
		$\displaystyle+2(D+1)Dd^{4/(D+1)-2}v^{3-4/(D+1)}\eta(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3}).$		(62)

Therefore, the stationary distribution of a general deep diagonal network is given by

\displaystyle p(v)

\displaystyle\propto\frac{1}{v^{3(1-1/(D+1))}(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3})}\exp\left(-\frac{1}{T}\int dv\frac{d^{1-2/(D+1)}(\beta_{1}v-\beta_{2})}{(D+1)v^{2D/(D+1)}(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3})}\right).

(63)

If $v<0$ , Eq. (61) becomes

	$\displaystyle\frac{d\|v\|}{dt}=$	$\displaystyle-2(D+1)d^{2/(D+1)-1}\|v\|^{2D/(D+1)}(\beta_{1}\|v\|+\beta_{2})-2(D+1)d^{2/(D+1)-1}\|v\|^{2D/(D+1)}\sqrt{\eta(\alpha_{1}\|v\|^{2}+2\alpha_{2}\|v\|+\alpha_{3})}\frac{dW}{dt}$
		$\displaystyle+2(D+1)Dd^{4/(D+1)-2}\|v\|^{3-4/(D+1)}\eta(\alpha_{1}\|v\|^{2}+2\alpha_{2}\|v\|+\alpha_{3}).$		(64)

The stationary distribution of $|v|$ is given by

p(|v|)\propto\frac{1}{|v|^{3(1-1/(D+1))}(\alpha_{1}|v|^{2}+2\alpha_{2}|v|+\alpha_{3})}\exp\left(-\frac{1}{T}\int d|v|\frac{d^{1-2/(D+1)}(\beta_{1}|v|+\beta_{2})}{(D+1)|v|^{2D/(D+1)}(\alpha_{1}|v|^{2}+2\alpha_{2}|v|+\alpha_{3})}\right).

(65)

Thus, we have obatined

p_{\pm}(|v|)\propto\frac{1}{|v|^{3(1-1/(D+1))}(\alpha_{1}|v|^{2}\mp 2\alpha_{2}|v|+\alpha_{3})}\exp\left(-\frac{1}{T}\int d|v|\frac{d^{1-2/(D+1)}(\beta_{1}|v|\mp\beta_{2})}{(D+1)|v|^{2D/(D+1)}(\alpha_{1}|v|^{2}\mp 2\alpha_{2}|v|+\alpha_{3})}\right).

(66)

Especially, when $D=1$ , the distribution function can be simplified as

\displaystyle p_{\pm}(|v|)

\displaystyle\propto\frac{|v|^{\pm\beta_{2}/2\alpha_{3}T-3/2}}{(\alpha_{1}|v|^{2}\mp 2\alpha_{2}|v|+\alpha_{3})^{1\pm\beta_{2}/4T\alpha_{3}}}\exp\left(-\frac{1}{2T}\frac{\alpha_{3}\beta_{1}-\alpha_{2}\beta_{2}}{\alpha_{3}\sqrt{\Delta}}\arctan\frac{\alpha_{1}|v|\mp\alpha_{2}}{\sqrt{\Delta}}\right),

(67)

where we have used the integral

\int dv\frac{\beta_{1}v\mp\beta_{2}}{\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3}}=\frac{\alpha_{3}\beta_{1}-\alpha_{2}\beta_{2}}{\alpha_{3}\sqrt{\Delta}}\arctan\frac{\alpha_{1}|v|\mp\alpha_{2}}{\sqrt{\Delta}}\pm\frac{\beta_{2}}{\alpha_{3}}\log(v)\pm\frac{\beta_{2}}{2\alpha_{3}}\log(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3}).

(68)

A.5 Analysis of the maximum probability point

To investigate the existence of the maximum point given in Eq. (16), we treat $T$ as a variable and study whether $(\beta_{1}-10\alpha_{2}T)^{2}+28\alpha_{1}T(\beta_{2}-3\alpha_{3}T):=A$ in the square root is always positive or not. When $T<\frac{\beta_{2}}{3\alpha_{3}}=T_{c}/3$ , $A$ is positive for arbitrary data. When $T>\frac{\beta_{2}}{3\alpha_{3}}$ , we divide the discussion into several cases. First, when $\alpha_{1}\alpha_{3}>\frac{25}{21}\alpha_{2}^{2}$ , there always exists a root for the expression $A$ . Hence, we find that

T=\frac{-5\alpha_{2}\beta_{1}+7\alpha_{1}\beta_{2}+\sqrt{7}\sqrt{3\alpha_{1}\alpha_{3}\beta_{1}^{2}-10\alpha_{1}\alpha_{2}\beta_{1}\beta_{2}+7\alpha_{1}^{2}\beta_{2}^{2}}}{2(21\alpha_{1}\alpha_{3}-25\alpha_{2}^{2})}:=T^{*}

(69)

is a critical point. When $T_{c}/3<T<T^{*}$ , there exists a solution to the maximum condition. When $T>T^{*}$ , there is no solution to the maximum condition.

The second case is $\alpha_{2}^{2}<\alpha_{1}\alpha_{3}<\frac{25}{21}\alpha_{2}^{2}$ . In this case, we need to further compare the value between $5\alpha_{2}\beta_{1}$ and $7\alpha_{1}\beta_{2}$ . If $5\alpha_{2}\beta_{1}<7\alpha_{1}\beta_{2}$ , we have $A>0$ , which indicates that the maximum point exists. If $5\alpha_{2}\beta_{1}>7\alpha_{1}\beta_{2}$ , we need to further check the value of minimum of $A$ , which takes the form of

{\rm min}_{T}A(T)=\frac{(25\alpha_{2}^{2}-21\alpha_{1}\alpha_{3})\beta_{1}^{2}-(7\alpha_{1}\beta_{2}-5\alpha_{2}\beta_{1})^{2}}{25\alpha_{2}^{2}-21\alpha_{1}\alpha_{3}}.

(70)

If $\frac{7\alpha_{1}}{5\alpha_{2}}<\frac{\beta_{1}}{\beta_{2}}<\frac{5\alpha_{2}+\sqrt{25\alpha_{2}^{2}-21\alpha_{1}\alpha_{3}}}{3\alpha_{3}}$ , the minimum of $A$ is always positive and the maximum exists. However, if $\frac{\beta_{1}}{\beta_{2}}\geq\frac{5\alpha_{2}+\sqrt{25\alpha_{2}^{2}-21\alpha_{1}\alpha_{3}}}{3\alpha_{3}}$ , there is always a critical learning rate $T^{*}$ . If $\frac{\beta_{1}}{\beta_{2}}=\frac{5\alpha_{2}+\sqrt{25\alpha_{2}^{2}-21\alpha_{1}\alpha_{3}}}{3\alpha_{3}}$ , there is only one critical learning rate as $T_{c}=\frac{5\alpha_{2}\beta_{1}-7\alpha_{1}\beta_{2}}{2(25\alpha_{2}^{2}-21\alpha_{1}\alpha_{3})}$ . When $T_{c}/3<T<T^{*}$ , there is a solution to the maximum condition while there is no solution when $T>T^{*}$ . If $\frac{\beta_{1}}{\beta_{2}}>\frac{5\alpha_{2}+\sqrt{25\alpha_{2}^{2}-21\alpha_{1}\alpha_{3}}}{3\alpha_{3}}$ , there are two critical points:

T_{1,2}=\frac{-5\alpha_{2}\beta_{1}+7\alpha_{1}\beta_{2}\mp\sqrt{7}\sqrt{3\alpha_{1}\alpha_{3}\beta_{1}^{2}-10\alpha_{1}\alpha_{2}\beta_{1}\beta_{2}+7\alpha_{1}^{2}\beta_{2}^{2}}}{2(21\alpha_{1}\alpha_{3}-25\alpha_{2}^{2})}.

(71)

For $T<T_{1}$ and $T>T_{2}$ , there exists a solution to the maximum condition. For $T_{1}<T<T_{2}$ , there is no solution to the maximum condition. The last case is $\alpha_{2}^{2}=\alpha_{1}\alpha_{3}<\frac{25}{21}\alpha_{2}^{2}$ . In this sense, the expression of $A$ is simplified as $\beta_{1}^{2}+28\alpha_{1}\beta_{2}T-20\alpha_{2}\beta_{1}T$ . Hence, when $\frac{\beta_{1}}{\beta_{2}}<\frac{7\alpha_{1}}{5\alpha_{2}}$ , there is no critical learning rate and the maximum always exists. Nevertheless, when $\frac{\beta_{1}}{\beta_{2}}>\frac{7\alpha_{1}}{5\alpha_{2}}$ , there is always a critical learning rate as $T^{*}=\frac{\beta_{1}^{2}}{20\alpha_{2}\beta_{1}-28\alpha_{1}\beta_{2}}$ . When $T<T^{*}$ , there is a solution to the maximum condition while there is no solution when $T>T^{*}$ .

	without weight decay	with weight decay
single layer	${(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3})^{-1-\frac{\beta_{1}}{2T\alpha_{1}}}}$	$\alpha_{1}(v-k)^{-2-\frac{(\beta_{1}+\gamma)}{T\alpha_{1}}}$
non-interpolation	$\frac{v^{\beta_{2}/2\alpha_{3}T-3/2}}{(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3})^{1+\beta_{2}/4T\alpha_{3}}}$	$\frac{v^{S(\beta_{2}-\gamma)/2\alpha_{3}\lambda-3/2}}{(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3})^{1+(\beta_{2}-\gamma)/4T\alpha_{3}}}$
interpolation $y=kx$	$\frac{v^{-3/2+\beta_{1}/2T\alpha_{1}k}}{(v-k)^{2+\beta_{1}/2T\alpha_{1}k}}$	$\frac{v^{-3/2+\frac{1}{2T\alpha_{1}k}(\beta_{1}-\frac{\gamma}{k})}}{(v-k)^{2+\frac{1}{2T\alpha_{1}k}(\beta_{1}-\frac{\gamma}{k})}}\exp(-\frac{\beta\gamma}{2T\alpha_{1}}\frac{1}{k(k-v)})$

Table 1: Summary of distributions

p(v)

in a depth-

1

neural network. Here, we show the distribution in the nontrivial subspace when the data

x

and

y

are positively correlated. The

\Theta(1)

factors are neglected for concision.

A.6 Other Cases for $D=1$

The other cases are worth studying. For the interpolation case where the data is linear ( $y=kx$ for some $k$ ), the stationary distribution is different and simpler. There exists a nontrivial fixed point for $\sum_{i}(u_{i}^{2}-w_{i}^{2})$ : $\sum_{j}u_{j}w_{j}=\frac{\alpha_{2}}{\alpha_{1}}$ , which is the global minimizer of $L$ and also has a vanishing noise. It is helpful to note the following relationships for the data distribution when it is linear:

\begin{cases}\alpha_{1}={\rm Var}[x^{2}],\\ \alpha_{2}=k{\rm Var}[x^{2}]=k\alpha_{1},\\ \alpha_{3}=k^{2}\alpha_{1},\\ \beta_{1}=\mathbb{E}[x^{2}],\\ \beta_{2}=k\mathbb{E}[x^{2}]=k\beta_{1}.\end{cases}

(72)

Since the analysis of Fokker-Planck equation is the same, we directly begin with the distribution function in Eq. (14) for $u_{i}=-w_{i}$ which is given by $P(|v|)\propto\delta(|v|)$ . Namely, the only possible weights are $u_{i}=w_{i}=0$ , the same as the non-interpolation case. This is because the corresponding stationary distribution is

	$\displaystyle P(\|v\|)$	$\displaystyle\propto\frac{1}{\|v\|^{2}(\|v\|+k)^{2}}\exp\left(-\frac{1}{2T}\int d\|v\|\frac{\beta_{1}(\|v\|+k)+\alpha_{1}\frac{1}{T}(\|v\|+k)^{2}}{\alpha_{1}\|v\|(\|v\|+k)^{2}}\right)$
		$\displaystyle\propto{\|v\|^{-\frac{3}{2}-\frac{\beta_{1}}{2T\alpha_{1}k}}(\|v\|+k)^{-2+\frac{\beta_{1}}{2T\alpha_{1}k}}}.$		(73)

The integral of Eq. (73) with respect to $|v|$ diverges at the origin due to the factor $|v|^{\frac{3}{2}+\frac{\beta_{1}}{2T\alpha_{1}k}}$ .

For the case $u_{i}=w_{i}$ , the stationary distribution is given from Eq. (14) as

	$\displaystyle P(v)$	$\displaystyle\propto\frac{1}{v^{2}(v-k)^{2}}\exp\left(-\frac{1}{2T}\int dv\frac{\beta_{1}(v-k)+\alpha_{1}T(v-k)^{2}}{\alpha_{1}v(v-k)^{2}}\right)$
		$\displaystyle\propto{v^{-\frac{3}{2}+\frac{\beta_{1}}{2T\alpha_{1}k}}}{(v-k)^{-2-\frac{\beta_{1}}{2T\alpha_{1}k}}}.$		(74)

Now, we consider the case of $\gamma\neq 0$ . In the non-interpolation regime, when $u_{i}=-w_{i}$ , the stationary distribution is still $p(v)=\delta(v)$ . For the case of $u_{i}=w_{i}$ , the stationary distribution is the same as in Eq. (14) after replacing $\beta$ with $\beta_{2}^{\prime}=\beta_{2}-\gamma$ . It still has a phase transition. The weight decay has the effect of shifting $\beta_{2}$ by $-\gamma$ . In the interpolation regime, the stationary distribution is still $p(v)=\delta(v)$ when $u_{i}=-w_{i}$ . However, when $u_{i}=w_{i}$ , the phase transition still exists since the stationary distribution is

\displaystyle p(v)

\displaystyle\propto\frac{v^{-\frac{3}{2}+\theta_{2}}}{(v-k)^{2+\theta_{2}}}\exp\left(-\frac{\beta_{1}\gamma}{2T\alpha_{1}}\frac{1}{k(k-v)}\right),

(75)

where $\theta_{2}=\frac{1}{2T\alpha_{1}k}(\beta_{1}-\frac{\gamma}{k})$ . The phase transition point is $\theta_{2}=1/2$ , which is the same as the non-interpolation one.

The last situation is rather special, which happens when $\Delta=0$ but $y\neq kx$ : $y=kx-c/x$ for some $c\neq 0$ . In this case, the parameters $\alpha$ and $\beta$ are the same as those given in Eq. (72) except for $\beta_{2}$ :

\beta_{2}=k\mathbb{E}[x^{2}]-kc=k\beta_{1}-kc.

(76)

The corresponding stationary distribution is

\displaystyle P(|v|)

\displaystyle\propto\frac{|v|^{-\frac{3}{2}-\phi_{2}}}{(|v|+k)^{2-\phi_{2}}}\exp\left(\frac{c}{2T\alpha_{1}}\frac{1}{k(k+|v|)}\right),

(77)

where $\phi_{2}=\frac{1}{2T\alpha_{1}k}(\beta_{1}-c)$ . Here, we see that the behavior of stationary distribution $P(|v|)$ is influenced by the sign of $c$ . When $c<0$ , the integral of $P(|v|)$ diverges due to the factor $|v|^{-\frac{3}{2}-\phi_{2}}<|v|^{-3/2}$ and Eq. (77) becomes $\delta(|v|)$ again. However, when $c>0$ , the integral of $|v|$ may not diverge. The critical point is $\frac{3}{2}+\phi_{2}=1$ or equivalently: $c=\beta_{1}+T\alpha_{1}k$ . This is because when $c<0$ , the data points are all distributed above the line $y=kx$ . Hence, $u_{i}=-w_{i}$ can only give a trivial solution. However, if $c>0$ , there is the possibility to learn the negative slope $k$ . When $0<c<\beta_{1}+T\alpha_{1}k$ , the integral of $P(|v|)$ still diverges and the distribution is equivalent to $\delta(|v|)$ . Now, we consider the case of $u_{i}=w_{i}$ . The stationary distribution is

\displaystyle P(|v|)

\displaystyle\propto\frac{|v|^{-\frac{3}{2}+\phi_{2}}}{(|v|-k)^{2+\phi_{2}}}\exp\left(-\frac{c}{2T\alpha_{1}}\frac{1}{k-|v|}\right).

(78)

It also contains a critical point: $-\frac{3}{2}+\phi_{2}=-1$ , or equivalently, $c=\beta_{1}-\alpha_{1}kT$ . There are two cases. When $c<0$ , the probability density only has support for $|v|>k$ since the gradient always pulls the parameter $|v|$ to the region $|v|>k$ . Hence, the divergence at $|v|=0$ is of no effect. When $c>0$ , the probability density has support on $0<|v|<k$ for the same reason. Therefore, if $\beta_{1}>\alpha_{1}kT$ , there exists a critical point $c=\beta_{1}-\alpha_{1}kT$ . When $c>\beta_{1}-\alpha_{1}kT$ , the distribution function $P(|v|)$ becomes $\delta(|v|)$ . When $c<\beta_{1}-\alpha_{1}kT$ , the integral of the distribution function is finite for $0<|v|<k$ , indicating the learning of the neural network. If $\beta_{1}\leq\alpha_{1}kT$ , there will be no criticality and $P(|v|)$ is always equivalent to $\delta(|v|)$ . The effect of having weight decay can be similarly analyzed, and the result can be systematically obtained if we replace $\beta_{1}$ with $\beta_{1}+{\gamma}/{k}$ for the case $u_{i}=-w_{i}$ or replacing $\beta_{1}$ with $\beta_{1}-{\gamma}/{k}$ for the case $u_{i}=w_{i}$ .

A.7 Second-order Law of Balance

Considering the modified loss function:

\ell_{\text{tot}}=\ell+\frac{1}{4}T||\nabla L||^{2}.

(79)

In this case, the Langevin equations become

	$\displaystyle{dw_{j}}$	$\displaystyle=-\frac{\partial\ell}{\partial w_{j}}dt-\frac{1}{4}T\frac{\partial\|\|\nabla L\|\|^{2}}{\partial w_{j}},$		(80)
	$\displaystyle{du_{i}}$	$\displaystyle=--\frac{\partial\ell}{\partial u_{i}}dt-\frac{1}{4}T\frac{\partial\|\|\nabla L\|\|^{2}}{\partial u_{i}}.$		(81)

Hence, the modified SDEs of $u_{i}^{2}$ and $w_{j}^{2}$ can be rewritten as

	$\displaystyle\frac{du_{i}^{2}}{dt}$	$\displaystyle=2u_{i}\frac{du_{i}}{dt}+\frac{(du_{i})^{2}}{dt}=-2u_{i}\frac{\partial\ell}{\partial u_{i}}++TC_{i}^{u}-\frac{1}{2}Tu_{i}\nabla_{u_{i}}\|\nabla L\|^{2},$		(82)
	$\displaystyle\frac{dw_{j}^{2}}{dt}$	$\displaystyle=2w_{j}\frac{dw_{j}}{dt}+\frac{(dw_{j})^{2}}{dt}=-2w_{j}\frac{\partial\ell}{\partial w_{j}}+TC_{j}^{w}-\frac{1}{2}Tw_{j}\nabla_{w_{j}}\|\nabla L\|^{2}.$		(83)

In this section, we consider the effects brought by the last term in Eqs. (82) and (83). From the infinitesimal transformation of the rescaling symmetry:

\sum_{j}w_{j}\frac{\partial\ell}{\partial w_{j}}=\sum_{i}u_{i}\frac{\partial\ell}{\partial u_{i}},

(84)

we take derivative to both sides of the equation and obtain

	$\displaystyle\frac{\partial L}{\partial u_{i}}+\sum_{j}u_{j}\frac{\partial^{2}L}{\partial u_{i}\partial u_{j}}=\sum_{j}w_{j}\frac{\partial^{2}L}{\partial u_{i}\partial w_{j}},$		(85)
	$\displaystyle\sum_{j}u_{j}\frac{\partial^{2}L}{\partial w_{i}\partial u_{j}}=\frac{\partial L}{\partial w_{i}}+\sum_{j}w_{j}\frac{\partial^{2}L}{\partial w_{i}\partial w_{j}},$		(86)

where we take the expectation to $\ell$ at the same time. By substituting these equations into Eqs. (82) and (83), we obtain

\frac{d||u||^{2}}{dt}-\frac{d||w|||^{2}}{dt}=T\sum_{i}(C_{i}^{u}+(\nabla_{u_{i}}L)^{2})-T\sum_{j}(C_{j}^{w}+(\nabla_{w_{j}}L)^{2}).

(87)

Then following the procedure in Appendix. A.2, we can rewrite Eq. (87) as

	$\displaystyle\frac{d\|\|u\|\|^{2}}{dt}-\frac{d\|\|w\|\|^{2}}{dt}$	$\displaystyle=-T(u^{T}C_{1}u+u^{T}D_{1}u-w^{T}C_{2}w-w^{T}D_{2}w)$
		$\displaystyle=-T(u^{T}E_{1}u-w^{T}E_{2}w),$		(88)

where

$\displaystyle(D_{1})_{ij}$	$\displaystyle=\sum_{k}\mathbb{E}\left[\frac{\partial\ell}{\partial(u_{i}w_{k})}\right]\mathbb{E}\left[\frac{\partial\ell}{\partial(u_{j}w_{k})}\right],$	(89)
$\displaystyle(D_{2})_{kl}$	$\displaystyle=\sum_{i}\mathbb{E}\left[\frac{\partial\ell}{\partial(u_{i}w_{k})}\right]\mathbb{E}\left[\frac{\partial\ell}{\partial(u_{i}w_{l})}\right],$	(90)
$\displaystyle(E_{1})_{ij}$	$\displaystyle=\mathbb{E}\left[\sum_{k}\frac{\partial\ell}{\partial(u_{i}w_{k})}\frac{\partial\ell}{\partial(u_{j}w_{k})}\right],$	(91)
$\displaystyle(E_{2})_{kl}$	$\displaystyle=\mathbb{E}\left[\sum_{i}\frac{\partial\ell}{\partial(u_{i}w_{k})}\frac{\partial\ell}{\partial(u_{i}w_{l})}\right].$	(92)

For one-dimensional parameters $u,w$ , Eq. (A.7) is reduced to

\frac{d}{dt}(u^{2}-w^{2})=-\mathbb{E}\left[\left(\frac{\partial\ell}{\partial(uw)}\right)^{2}\right](u^{2}-w^{2}).

(93)

Therefore, we can see this loss modification increases the speed of convergence. Now, we move to the stationary distribution of the parameter $v$ . At the stationarity, if $u_{i}=-w_{i}$ , we also have the distribution $P(v)=\delta(v)$ like before. However, when $u_{i}=w_{i}$ , we have

\frac{dv}{dt}=-4v(\beta_{1}v-\beta_{2})+4Tv(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3})-4\beta_{1}^{2}Tv(\beta_{1}v-\beta_{2})(3\beta_{1}v-\beta_{2})+4v\sqrt{T(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3})}\frac{dW}{dt}.

(94)

Hence, the stationary distribution becomes

P(v)\propto\frac{v^{\beta_{2}/2\alpha_{3}T-3/2-\beta_{2}^{2}/2\alpha_{3}}}{(\alpha_{1}v^{2}-2\alpha_{2}v+\alpha_{3})^{1+\beta_{2}/4T\alpha_{3}+K_{1}}}\exp\left(-\left(\frac{1}{2T}\frac{\alpha_{3}\beta_{1}-\alpha_{2}\beta_{2}}{\alpha_{3}\sqrt{\Delta}}+K_{2}\right)\arctan\frac{\alpha_{1}v-\alpha_{2}}{\sqrt{\Delta}}\right),

(95)

where

	$\displaystyle K_{1}$	$\displaystyle=\frac{3\alpha_{3}\beta_{1}^{2}-\alpha_{1}\beta_{2}^{2}}{4\alpha_{1}\alpha_{3}},$
	$\displaystyle K_{2}$	$\displaystyle=\frac{3\alpha_{2}\alpha_{3}\beta_{1}^{2}-4\alpha_{1}\alpha_{3}\beta_{1}\beta_{2}+\alpha_{1}\alpha_{2}\beta_{2}^{2}}{2\alpha_{1}\alpha_{3}\sqrt{\Delta}}.$		(96)

From the expression above we can see $K_{1}\ll 1+\beta_{2}/4T\alpha_{3}$ and $K_{2}\ll(\alpha_{3}\beta_{1}-\alpha_{2}\beta_{2})/2T\alpha_{3}\sqrt{\Delta}$ . Hence, the effect of modification can only be seen in the term proportional to $v$ . The phase transition point is modified as

T_{c}=\frac{\beta_{2}}{\alpha_{3}+\beta_{2}^{2}}.

(97)

Compared with the previous result $T_{c}=\frac{\beta_{2}}{\alpha_{3}}$ , we can see the effect of the loss modification is $\alpha_{3}\to\alpha_{3}+\beta_{2}^{2}$ , or equivalently, ${\rm Var}[xy]\to\mathbb{E}[x^{2}y^{2}]$ . This effect can be seen from $E_{1}$ and $E_{2}$ .

	$\displaystyle\frac{d\|v\|}{dt}=$	$\displaystyle-2(D+1)d^{2/(D+1)-1}\|v\|^{2D/(D+1)}(\beta_{1}\|v\|+\beta_{2})-2(D+1)d^{2/(D+1)-1}\|v\|^{2D/(D+1)}\sqrt{\eta(\alpha_{1}\|v\|^{2}+2\alpha_{2}\|v\|+\alpha_{3})}\frac{dW}{dt}$
		$\displaystyle+2(D+1)Dd^{4/(D+1)-2}\|v\|^{3-4/(D+1)}\eta(\alpha_{1}\|v\|^{2}+2\alpha_{2}\|v\|+\alpha_{3}).$		(64)

Law of Balance and Stationary Distribution of Stochastic Gradient Descent

Abstract

1 Introduction

2 Law of Balance

Theorem 1.

Corollary 1.

Example: two-layer linear network.

3 Stationary Distribution of SGD

3.1 Depth-0 Case

3.2 Deep Diagonal Networks

Theorem 2.

3.2.1 Depth-11 Nets

3.3 Power-Law Tail of Deeper Models

3.4 Role of Width

4 Discussion

Acknowledgement

References

Appendix A Theoretical Considerations

A.1 Background

A.1.1 Ito’s Lemma

A.1.2 Fokker Planck Equation

A.2 Proof of Theorem 1

Proof.

A.3 Proof of Theorem 2

Proof.

A.4 Stationary distribution in Eq. (13)

A.5 Analysis of the maximum probability point

A.6 Other Cases for D=1D=1

A.7 Second-order Law of Balance

3.1 Depth- $0$ Case

3.2.1 Depth- $1$ Nets

A.6 Other Cases for $D=1$