∎
: Alphabetical order
: Corresponding authors
Power-law Dynamic arising from machine learning
Abstract
We study a kind of new SDE that was arisen from the research on optimization in machine learning, we call it power-law dynamic because its stationary distribution cannot have sub-Gaussian tail and obeys power-law. We prove that the power-law dynamic is ergodic with unique stationary distribution, provided the learning rate is small enough. We investigate its first exist time. In particular, we compare the exit times of the (continuous) power-law dynamic and its discretization. The comparison can help guide machine learning algorithm.
Keywords:
machine learningstochastic gradient descent stochastic differential equationpower-law dynamic1 Introduction
In the past ten years, we have witnessed the rapid development of deep machine learning technology. We successfully train deep neural networks (DNN) and achieve big breakthroughs in AI tasks, such as computer vision he2015delving ; he2016deep ; krizhevsky2012imagenet , speech recognition oord2016wavenet ; ren2019fastspeech ; shen2018natural and natural language processing he2016dual ; sundermeyer2012lstm ; vaswani2017attention , etc.
Stochastic gradient descent (SGD) is a mainstream optimization algorithm in deep machine learning. Specifically, in each iteration, SGD randomly sample a mini batch of data and update the model by the stochastic gradient. For large DNN models, the gradient computation over each instance is costly. Thus, compared to gradient descent which updates the model by the gradient over the full batch data, SGD can train DNN much more efficiently. In addition, the gradient noise may help SGD to escape from local minima of the non-convex optimization landscape.
Researchers are investigating how the noise in SGD influences the optimization and generalization of deep learning. Recently, more and more work take SGD as the numerical discretization of the stochastic differential equations (SDE) and investigate the dynamic behaviors of SGD by analyzing the SDE, including the convergence rate he2018differential ; li2017stochastic ; rakhlin2012making , the first exit time gurbuzbalaban2020heavy ; meng22020dynamic ; wu2018sgd ; xie2020diffusion , the PAC-Bayes generalization bound he2019control ; mou2017generalization ; smith2017bayesian and the optimal hyper-parameters he2019control ; li2017stochastic . Most of the results in this research line are derived from the dynamic with state-independent noise, assuming that the diffusion coefficient of SDE is a constant matrix independent with the state (i.e., model parameters in DNN). However, the covariance of the gradient noise in SGD does depend on the model parameters.
In our recent work meng22020dynamic ; meng2020dynamic , we studied the dynamic behavior of SGD with state-dependent noise. We found that the covariance of the gradient noise of SGD in the local region of local minima can be well approximated by a quadratic function of the state. Then, we proposed to investigate the dynamic behavior of SGD by a stochastic differential equation (SDE) with a quadratic state-dependent diffusion coefficient. As shown in meng22020dynamic ; meng2020dynamic , the new SDE with quadratic diffusion coefficient can better matches the behavior of SGD compared with the SDE with constant diffusion coefficient.
In this paper, we study some mathematical properties of the new SDE with quadratic diffusion coefficient. After briefly introducing its machine learning background and investigating its preliminary properties (Section 2), we show in Section 3 that the stationary distribution of this new SDE is a power-law distribution ( hence we call the corresponding dynamic a power-law dynamic ), and the distribution possesses heavy-tailed property, which means that it cannot have sub-Gaussian tail. Employing coupling method, in Section 4 we prove that the power-law dynamic is ergodic with unique stationary distribution, provided the learning rate is small enough. In the last two sections we analyze the first exit time of the power-law dynamic. We obtain an asymptotic order of the first exit time in Section 5, we then in Section 6 compare the exit times of the (continuous) power-law dynamic and its discretization. The comparison can help guide machine learning algorithm.
2 Background and preliminaries on power-law dynamic
2.1 Background in Machine Learning
Suppose that we have training data with inputs and outputs . For a model with parameter (vector) , its loss over the training instance is , where is the loss function. In machine learning, we are minimizing the empirical loss over the training data, i.e.,
(2.1) |
Stochastic gradient descent (SGD) and its variants are the mainstream approaches to minimize . In SGD, the update rule at the -th iteration is
(2.2) |
where denotes the learning rate,
(2.3) |
is the stochastic gradient, with being a random sampled subset of with size . In the literature, is called mini-batch.
We know that is an unbiased estimator of the full gradient . The gap between the full gradient and the stochastic gradient, i.e.,
(2.4) |
is called the gradient noise in SGD. In the literature, e.g. li2017stochastic ; meng22020dynamic ; xie2020diffusion , the gradient noise is assumed to be drawn from Gaussian distribution 111Under mild conditions the assumption is approximately satisfied by the Central Limit Theorem.,that is, , where is the covariance matrix of Denote by the update rule of SGD in Eq.(2.2) is then approximated by:
(2.5) |
Further, for small enough learning rate , Eq.(2.5) can be viewed as the numerical discretization of the following stochastic differential equation (SDE) he2018differential ; li2017stochastic ; meng22020dynamic ,
(2.6) |
where is the standard Brownian Motion in . This viewpoint enable the researchers to investigate the dynamic properties of SGD by means of stochastic analysis. In this line, recent work studied the dynamic of SGD with the help of SDE. However, most of the quantitative results in this line of work were obtained for the dynamics with state-independent noise. More precisely, the authors assumed that the covariance in Eq.(2.6) is a constant matrix independent with the state . This assumption of constant diffusion coefficient simplifies the calculation and the corresponding analysis. But it is over simplified because the noise covariance in SGD does depend on the model parameters.
In our recent work meng22020dynamic ; meng2020dynamic , we studied the dynamic behavior of SGD with state-dependent noise. The theoritical conduction and empirical observations of our research show that the covariance of the gradient noise of SGD in the local region of local minima can be well approximated by a quadratic function of the state as briefly reviewed below.
Let be a local minimum of the (empirical/training) loss function defined in (2.1). We assume that the loss function in the local region of can be approximated by the second-order Taylor expansion as
(2.7) |
where H is the Hessian matrix of loss at Since at the local minimum (2.7) is reduced to
(2.8) |
Under the above setting, the full gradient of training loss is
(2.9) |
and the stochastic gradient (2.3) is
(2.10) |
where and are the gradient and Hessian calculated by the minibatch. More explicitly, the -th component of is
(2.11) |
Assuming that for , 222This assumption holds for additive noise case and the squared loss wei2020implicit . Specifically, for , the gradient and Hessian are and . With additive noise, we have where is white noise independent with the input . Then, , , and we have . we have
(2.12) |
where is a matrix with elements
Thus, we can convert into an analytic tractable form as follows.
(2.13) |
where and are positive definite matrix. The empirical observations in meng22020dynamic ; meng2020dynamic is consistent with the covariance structure (2.13). Thus the SDE (2.6) takes the form
(2.14) |
where is given by (2.13). We call the dynamic driven by (2.14) a power-law dynamic because its stationary distribution obeys power-law (see Theorem 3.1 below). As shown in meng22020dynamic ; meng2020dynamic , power-law dynamic can better match the behavior of SGD compared to the SDE with constant diffusion coefficient.
2.2 Preliminaries on power-law dynamic
For the power-law dynamic (2.14), the infinitesimal generator exists and has the following form:
We will use the infinitesimal generator to specify a coupling in the subsequent sections. Write , then
(2.15) |
where (Comparing with (2.14), here we slightly abused the notation ).
In machine learning, we often assume that the dynamic in (2.15) can be decoupled meng2020dynamic ; xie2020diffusion ; zhang2019algorithmic . More explicitly, we assume that and are codiagonalizable by an orthogonal matrix , then under the affine transformation , (2.15) is decoupled and can be written as
(2.16) |
where , are positive constants.333This decoupling property is empirically observed in machine learning, i.e., the directions of eigenvectors of the hessian matrix and the gradient covariance matrix are often nearly coincide at the minimum xie2020diffusion . An explanation of this phenomenon is that under expectations the Hessian equals to Fisher information jastrzkebski2017three ; xie2020diffusion ; zhu2019anisotropic .
Following the convention of probabilistic literature, in what follows we shall write
(2.17) |
(2.18) |
Suppose , by the mean value theorem, we have the following inequality,
(2.19) |
Then, it is easy to check that that both and are local Lipschitz and have linear growth. Therefore, by standard theory of stochastic differential equations, the SDE (2.15) has a unique strong solution which has continuous paths and possesses strong Markov property.
Consider the decoupled dynamic in (2.16), we use the fact that as ,
Then, for any fixed we have
and
which implies that each component of will not blow up in finite time.
To conclude, the stochastic differential equation (2.16) admits a unique strong solution which has continuous paths and will not blow up in finite time. In subsequent sections we shall study more properties of the dynamic
3 Property of the stationary distribution
In this section, we show that the stationary distribution of the SDE (2.15) possesses heavy-tailed property, and its decoupled form is a product of power-law distributions. The existence and uniqueness of the stationary distribution will be given in the next section.
Let be an orthogonal matrix such that is a diagonal matrix. Then
(3.1) |
where , Note that is still a Brownian motion. (3.1) is just the power-law dynamic (2.15) under a new orthogonal coordinate system, so we will abuse the notation and denote the transformed dynamic by as well. Since we care about the tail behavior of the power-law dynamic, we show first that does not have finite higher moments as . This implies that cannot have exponential decay on the tail.
Theorem 3.1
(i) We can find such that the moments of the power-law dynamic (3.1) of order greater than m will explode as the time .
(ii) For the decoupled case in (2.16), the probabilistic density of the stationary distribution is a product of power-law distributions (the terminology follows from zhou ) as below:
(3.2) |
where and is the normalization constant.
Proof
(i) Denote the 2k-th moment of as . Then . By Ito’s formula, we have
Let be the maximal diagonal element of and be the minimal diagonal element of , then we get the recursion inequality (note that it may not hold for the odd degree moments):
(3.3) |
where is the minimal eigenvalue of the positive definite matrix . Let and , then
the reminder term is defined by . From the above relation, we can prove the following inequality by induction:
(3.4) |
By tracking the related coefficients carefully, it is not difficult to find the recurrence relations for . For example,
Since is positive, becomes positive when i is large. From this fact, we can always find a such that
which means the moment generating function of the stationary distribution blows up. Therefore, the stationary distribution of cannot have sub-Gaussian tail.
(ii) Now we turn to the decoupled case of equation (2.16), since each coordinate is self-dependent, we know that the probabilistic density is of the product form. To investigate the probabilistic density of one fixed coordinate , we need to study the backward Kolmogolov equation satisfied by . Since and have linear growth, we have
Here we adopt an idea from the statistical physics literature zhou as in the machine learning area (see meng22020dynamic ), we first transform the Kolmogolov equation into the Smoluchowski form:
(3.5) | ||||
Let , then the fluctuation-dissipation relation of and in zhou is satisfied with
Let , then the stationary distribution satisfies the power law:
(3.6) |
The proof is completed.
Remark 1
(i) In zhou , the tail index ( depending on the hyper-parameter ) plays an important role in locating the large learning rate region. Observe that when , the variance of is infinite.
(ii) Another way to view the power-law dynamic is to apply the results in the groundbreaking article ma . Roughly speaking, the authors of ma gave a complete classification in the Fourier space with a determined stationary distribution. Following the notation in ma , suppose we write the SDE in the following form:
where D(z) is a positive semi-definite diffusion matrix (a Riemannian metric). Suppose the stationary distribution , then the drift term must satisfy:
where is an arbitrary skew-symmetric matrix (a symplectic form) and is defined by
When , due to the skew-symmetry, . If the stationary distribution is given by (3.6), . Thus,
We get that
In this way, we automatically obtain the fluctuation-dissipation relation.
4 Existence and uniqueness of the stationary distribution
In this section, we shall prove that the power-law dynamic is ergodic with unique stationary distribution, provided the learning rate is small enough (see theorem 4.1 (ii) below). Note that unlike Langevin dynamics, we have a state-dependent diffusion term in the power-law dynamic and its stationary distribution does not have a sub-Gaussian tail, which makes the diffusion process break the log-sobolev inequality condition. Instead of treating as a gradient flow, we shall use coupling method to bound the convergence of to its stationary distribution.
Let the drift vector and the diffusion matrix , where be defined as in (2.17) and (2.18) respectively. We set
(4.1) | |||
(4.2) |
Theorem 4.1
(i) Let be the transition probability of the power-law dynamic driven by (2.15), we have
(4.3) |
where is the Wasserstein distance between two probability distributions.
(ii) Employing the notations used in the previous section, we write for the minimal diagonal element of the matrix , for the the maximal element of , and for the sum of the eigenvalues of . Suppose that
(4.4) |
then the power-law dynamic in (2.15) is ergodic and its stationary distribution is unique.
Proof
(i) We shall employ the coupling method of Markov processes in this proof and in the rest of this paper. The reader may refer to Chapter 2 of chen2006eigenvalues , especially page 24 and Example 2.16, for the relevant contents. Recall that every infinitesimal generator of an -valued diffusion process has the form . To specify a coupling between two power-law dynamics starting from different points of we define a coupling infinitesimal generator , as follows:
where and are specified by (2.17) and (2.18) respectively, corresponds to the second order differentiation and corresponds to the first order differentiation.
Let and let act on , we get
where . Denote by the dynamic starting at and the dynamic starting at , by Ito’s formula, we have
Applying Gronwall¡¯s inequality, we get
which implies that
verifying (4.3).
(ii) In view of (4.3), we need only to check that if (4.4) holds, then
We have
therefore
(4.5) |
On the other hand, let be the maximal element of , then for all
Since is preserved under orthogonal transformation, then by the mean value theorem and Cauchy inequality, we can find , such that
where denote the eigenvalues of and denotes the sum of the eigenvalues. Thus,
Consequently,
(4.6) |
Combining (4.5) and (4.6), we see that (4.4) implies Therefore Assertion (ii) holds by the virtue of (4.3). The proof is completed.
Remark 2
If we restrict ourselves in the decoupled case (2.16), we can get the exponential convergence to stationary distribution under much weaker condition of . Notice that now is a diagonal matrix. Using short hand writing for we have
Then, we have
where , which does not involve the dimension .
5 First exit time: asymptotic order
From now on, we investigate the first exit time from a ball of the power-law dynamic, which is an important issue in machine learning. By leveraging the transition rate results from the large deviation theory (see e.g. kraaij2019classical ), in this section we obtain an asymptotic order of the first exit time for the decoupled power-law dynamic.
Theorem 5.1
Suppose is the only local minimum of the loss function inside . Let be the first exit time from of the decoupled power-law dynamic in (2.16), with learning rate starting at , then
(5.1) |
where is a prefactor to be determined.
When , we have an explicit expression of the first exit time from an interval starting at :
(5.2) |
where and .
Proof
Let be a stopping time with finite expectation and let be the infinitesimal generator of , then recall that Dynkin’s formula tells us:
where . Suppose solves the following boundary problem:
(5.3) |
then , where denote the first exit time of starting at from the ball .
We consider first the situation of , let be the first exit time of from an interval starting at . Note that the diffusion coefficient function , then by Dynkin’s formula, , where solves the following second order ODE:
(5.4) |
where is the infinitesimal generator of the one dimensional power-law diffusion. Now we introduce the integration factor , following (3.5),
(5.5) |
Then,
where we denote to get the second line. Therefore (5.4) is equivalent to
Integrating the above equation, we recover (13) of du2012power :
(5.6) |
The reader can check section 12.3 of weinan2019applied for the asymptotic analysis of (5.6) as .
We now investigate the general situation of . We have only asymptotic estimates on the exit time as the learning rate . For this purpose, it is convenient to introduce the geometric reformulation of (2.16). Suppose is the standard brownian motion, recall that in local coordinates , the Riemannian Brownian motion with metric has the following form (cf. e.g. hsu2002stochastic ):
(5.7) |
where and . Comparing (5.7) with the martingale part of (2.16), we define the inverse metric as . Then the metric is also a diagonal matrix. The christoffel symbols can be calculated under the new metric:
and , otherwise. Denote the gradient vector filed of a smooth function by , then
where denotes the i-th coordinate tangent vector at . To emphasis the parameter that appears in the diffusion term in the power-law dynamic, we denote the dynamic and its corresponding exit time by and . Let , then in (2.16) can be seen as a diffusion process under the new metric:
(5.8) |
where is the Riemannian Brownian motion by (5.7) and is a local minima of the limit function: . Note that both the drift term and the diffusion term are intrinsically defined with respect to the metric . By large deviation theory, the rate function of a path is:
where the norm is with respect to the Riemannian metric . It follows that the quasi-potential of the ball is given by
By theorem 2.2 and corollary 2.4 of Nils , if 0 is the only local minima of in , then there exists a constant such that
The proof is completed.
Remark 3
When the dimension , we can get similar results with a precise prefactor by applying semiclassical approximation to the integral (5.6), see kolokoltsov2007semiclassical . Taking exponential of (5.1), it’s obvious that the leading order of the average exit time is of the power law form with respect to the radius .
6 First exit time: from continuous to discrete
In this section, we compare the exit times of the continuous power-law dynamic (2.15) and its discretization:
(6.1) |
Note that the first exit time of the discretized dynamic (6.1) is an integer that measures how many steps it takes to escape from the ball, thus the time steps correspond to amount of time. In this view point, the comparison can help guide machine learning algorithm provided the time interval coincides with the learning rate . However, since in power law dynamic the covariance matrix contains , for the convenience of theoretic discussion, we should temporarily distinguish with before we arriving at the conclusion.
To shorten the length of the article, we shall confine ourselves in the situation of Assume the local minima is . Let be the first exit time from the ball of the one dimensional continuous power-law dynamic in (2.15) and let be the corresponding first exit time of the discretized dynamic (6.1), both starting from . Then we have the following comparison of with the corresponding quantities related to the first exit times of the continuous dynamic.
Theorem 6.1
Suppose and satisfy , given a large integer we have
where and as
Proof
For our purpose we introduce an interpolation process as follows:
(6.2) |
where is the discretization step size. More precisely, the drift coefficient and the diffusion coefficient of (6.2) will remain unchanged when for each Note that if we rewrite as (6.2) is expressed as
(6.3) |
which reduced to (6.1) when for each
We shall adopt a similar strategy as in nguyen2019first to transfer from the average exit time of the power-law dynamic to its discretization . Below in the discussion we follow also the notations in the previous sections. Roughly speaking, the proof can be divided into two steps:
(i) Fix the number of iteration steps as K, prove that
where is a small positive constant to be determined, and
is the hyper-cube of radius . This can be down by bounding the -distance of and for . Let
where . For simplicity, denote the exit time of the power-law dynamic from by . For the interpolation process , we denote the corresponding exit time (an integer) with a bar above it: . Then,
(6.4) | ||||
Note that the event indicates that the interpolation process remains in when .
(ii) Step 1 guarantees that if is trapped in a ball with a different size when , then the interpolation process is also trapped in a ball. However,
since may drift outside the ball when for . We define this ’anomalous’ random event by
Then obviously,
(6.5) | ||||
where denotes the complement of the event . We would expect that the probability of to be small if the diffusion coefficient of the dynamic is bounded, in which case we can apply Gaussian concentration results. However, the diffusion part of the power-law dynamic is not bounded, so additional technical issue should be taken care of.
Now we introduce the same form of coupling as in the previous sections between for . Following the notations in the proof of theorem 4.1, we set the and of the infinitesimal generator of the coupling as:
(6.6) |
The remainder proof of the theorem will be accomplished by three lemmas. We prove first the following lemma for the one dimensional decoupled dynamic (2.16):
Lemma 1
Suppose that the coefficients of (2.16) satisfy:
(6.7) |
(which is fulfilled in SGD algorithm for large batch size and small learning rate). Let be the coupling process defined by (6.6), and . When , the Warsserstein distance between the marginal distribution of and the marginal distribution of the power-law dynamic is bounded by
(6.8) |
where . Moreover, is independent of the number of steps K and is of order when the time interval .
Proof (Proof of Lemma 1)
Denote the transition probability of and from time to by and respectively. Let act on the function and use the Gronwall’s inequality, we can deduce the following recursion inequality for :
where and is the optimal coupling between and . ( Note that in our context the optimal coupling always exists, see e.g. Proposition 1.3.2 and Theorem 2.3,3 in wang .)
For the interpolation process starting at , by Ito’s formula, we have the following estimate:
Similarly, the second moment of the continuous dynamic starting at can be bounded by:
(6.9) |
Notice that the distance between and is zero at initialization, then by applying the recursion relation from to , we conclude that there exits , such that
where is independent of K, which completes the proof of Lemma 1.
By the definition of the -distance,
(6.10) |
Below we denote the above right hand side by . For the second step, from (6.4) and (6.5), it follows that
(6.11) |
Therefore, we are left to estimate . Under the condition , we have the following lemma:
Lemma 2
Let be fixed. Conditioning on the event that is inside when for all , we have
where
Remark 4
The above lemma tells us that the fourth moment won’t change too much if the time interval is small. Intuitively, Since the martingale part of is , if is bounded, then by the time change theorem, we know that the marginal distribution of the martingale part behaves like a scaled Gaussian.
Proof (Proof of Lemma 2)
By Ito’s formula,
Define a local martingale , then by Gronwall’s inequality,
Let
Then, for the fixed ,
Applying Doob’s inequality, we get
Now, Ito’s isometry implies that
where we used Gronwall’s inequality to derive the last line. For all , by taking in (6.9), we get
Then, conditioning on the event that , we have
which completes the proof of Lemma 2.
Lemma 3
Let be a positive constant, then for every ,
where when .
Proof (Proof of Lemma 3)
Remark 5
Theorem 6.1 discussed only the 1-dimensional case. For high dimensional case, there have been some intuitive discussion in machine learning literature xie2020diffusion . Roughly speaking, the escaping path will concentrate on the critical paths, i.e., the paths on the direction of the eigenvector of the Hessian, when the noise is much smaller than the barrier height with high probability. If there are multiple parallel exit paths, the total exiting rate, i.e., the inverse of the expected exit time, equals to the sum of the first exiting rate for every path (cf. Rule 1 in xie2020diffusion ).
Acknowledgements.
We had pleasant collaboration with Shiqi Gong, Huishuai Zhang, and Tie-Yan Liu on the research of power-law dynamics in machine learningmeng22020dynamic ; meng2020dynamic . We thank them for their contributions in the previous work and their comments on this work, especially based on the empirical observations during our previous collaboration.References
- (1) Nils Berglund. Kramers’ law: Validity, derivations and generalisations. arXiv preprint arXiv:1106.5799, 2013.
- (2) Mu-Fa Chen. Eigenvalues, inequalities, and ergodic theory. Springer Science & Business Media, 2006.
- (3) Jiu-Lin Du. Power-law distributions and fluctuation-dissipation relation in the stochastic dynamics of two-variable langevin equations. Journal of Statistical Mechanics: Theory and Experiment, 2012(02):P02006, 2012.
- (4) Mert Gurbuzbalaban, Umut Simsekli, and Lingjiong Zhu. The heavy-tail phenomenon in sgd. arXiv preprint arXiv:2006.04740, 2020.
- (5) Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, and Wei-Ying Ma. Dual learning for machine translation. In Advances in neural information processing systems, pages 820–828, 2016.
- (6) Fengxiang He, Tongliang Liu, and Dacheng Tao. Control batch size and learning rate to generalize well: Theoretical and empirical evidence. In Advances in Neural Information Processing Systems, pages 1141–1150, 2019.
- (7) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
- (8) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
- (9) Li He, Qi Meng, Wei Chen, Zhi-Ming Ma, and Tie-Yan Liu. Differential equations for modeling asynchronous algorithms. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 2220–2226, 2018.
- (10) Elton P Hsu. Stochastic analysis on manifolds. Number 38. American Mathematical Soc., 2002.
- (11) Stanisław Jastrzkebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623, 2017.
- (12) Vassili N Kolokoltsov. Semiclassical analysis for diffusions and stochastic processes. Springer, 2007.
- (13) Richard C Kraaij, Frank Redig, and Rik Versendaal. Classical large deviation theorems on complete riemannian manifolds. Stochastic Processes and their Applications, 129(11):4294–4334, 2019.
- (14) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012.
- (15) Qianxiao Li, Cheng Tai, et al. Stochastic modified equations and adaptive stochastic gradient algorithms. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2101–2110. JMLR. org, 2017.
- (16) Yi-An Ma, Tianqi Chen, and Emily B Fox. A complete recipe for stochastic gradient mcmc. arXiv preprint arXiv:1506.04696, 2015.
- (17) Qi Meng, Shiqi Gong, Wei Chen, Zhi-Ming Ma, and Tie-Yan Liu. Dynamic of stochastic gradient descent with state-dependent noise. arXiv preprint arXiv 2006.13719v3, 2020.
- (18) Qi Meng, Shiqi Gong, Weitao Du, Wei Chen, Zhi-Ming Ma, and Tie-Yan Liu. A fine-grained study on the escaping behavior of stochastic gradient descent. Under review, 2021.
- (19) Wenlong Mou, Liwei Wang, Xiyu Zhai, and Kai Zheng. Generalization bounds of sgld for non-convex learning: Two theoretical viewpoints. arXiv preprint arXiv:1707.05947, 2017.
- (20) Thanh Huy Nguyen, Umut Simsekli, Mert Gurbuzbalaban, and Gaël Richard. First exit time analysis of stochastic gradient descent under heavy-tailed gradient noise. In Advances in Neural Information Processing Systems, pages 273–283, 2019.
- (21) Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
- (22) Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1571–1578, 2012.
- (23) Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech: Fast, robust and controllable text to speech. arXiv preprint arXiv:1905.09263, 2019.
- (24) Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783. IEEE, 2018.
- (25) Le Quoc V Smith Samuel L. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.
- (26) Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. Lstm neural networks for language modeling. In Thirteenth annual conference of the international speech communication association, 2012.
- (27) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
- (28) Feng-Yu Wang. Analysis for diffusion processes on Riemannian manifolds, volume 18. World Scientific, 2014.
- (29) Colin Wei, Sham Kakade, and Tengyu Ma. The implicit and explicit regularization effects of dropout. In International Conference on Machine Learning, pages 10181–10192. PMLR, 2020.
- (30) E Weinan, Tiejun Li, and Eric Vanden-Eijnden. Applied stochastic analysis, volume 199. American Mathematical Soc., 2019.
- (31) Lei Wu and Weinan E Ma, Chao. How sgd selects the global minima in over-parameterized learning: A dynamical stability perspective. Advances in Neural Information Processing Systems, 31:8279–8288, 2018.
- (32) Sugiyama Masashi Xie Zeke, Sato Issei. A diffusion theory for deep learning dynamics: Stochastic gradient descent escapes from sharp minima exponentially fast. arXiv preprint arXiv:2002.03495, 2020.
- (33) Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George Dahl, Chris Shallue, and Roger B Grosse. Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. In Advances in Neural Information Processing Systems, pages 8196–8207, 2019.
- (34) Yanjun Zhou and Jiulin Du. Kramers escape rate in overdamped systems with the power-law distribution. Physica A: Statistical Mechanics and its Applications, 402:299–305, 2014.
- (35) Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects. In Proceedings of International Conference on Machine Learning, pages 7654–7663, 2019.