This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: 1 Microsoft Research Asia, [email protected] and [email protected]; 2 University of Science and Technology of China, [email protected]; 3 University of Chinese Academy of Sciences, [email protected]
\textdagger: Alphabetical order
: Corresponding authors

Power-law Dynamic arising from machine learning

Wei Chen1,†    Weitao Du2,†,∗    Zhi-Ming Ma3,†,∗    Qi Meng1,†
Abstract

We study a kind of new SDE that was arisen from the research on optimization in machine learning, we call it power-law dynamic because its stationary distribution cannot have sub-Gaussian tail and obeys power-law. We prove that the power-law dynamic is ergodic with unique stationary distribution, provided the learning rate is small enough. We investigate its first exist time. In particular, we compare the exit times of the (continuous) power-law dynamic and its discretization. The comparison can help guide machine learning algorithm.

Keywords:
machine learningstochastic gradient descent stochastic differential equationpower-law dynamic

1 Introduction

In the past ten years, we have witnessed the rapid development of deep machine learning technology. We successfully train deep neural networks (DNN) and achieve big breakthroughs in AI tasks, such as computer vision he2015delving ; he2016deep ; krizhevsky2012imagenet , speech recognition oord2016wavenet ; ren2019fastspeech ; shen2018natural and natural language processing he2016dual ; sundermeyer2012lstm ; vaswani2017attention , etc.

Stochastic gradient descent (SGD) is a mainstream optimization algorithm in deep machine learning. Specifically, in each iteration, SGD randomly sample a mini batch of data and update the model by the stochastic gradient. For large DNN models, the gradient computation over each instance is costly. Thus, compared to gradient descent which updates the model by the gradient over the full batch data, SGD can train DNN much more efficiently. In addition, the gradient noise may help SGD to escape from local minima of the non-convex optimization landscape.

Researchers are investigating how the noise in SGD influences the optimization and generalization of deep learning. Recently, more and more work take SGD as the numerical discretization of the stochastic differential equations (SDE) and investigate the dynamic behaviors of SGD by analyzing the SDE, including the convergence rate he2018differential ; li2017stochastic ; rakhlin2012making , the first exit time gurbuzbalaban2020heavy ; meng22020dynamic ; wu2018sgd ; xie2020diffusion , the PAC-Bayes generalization bound he2019control ; mou2017generalization ; smith2017bayesian and the optimal hyper-parameters he2019control ; li2017stochastic . Most of the results in this research line are derived from the dynamic with state-independent noise, assuming that the diffusion coefficient of SDE is a constant matrix independent with the state (i.e., model parameters in DNN). However, the covariance of the gradient noise in SGD does depend on the model parameters.

In our recent work meng22020dynamic ; meng2020dynamic , we studied the dynamic behavior of SGD with state-dependent noise. We found that the covariance of the gradient noise of SGD in the local region of local minima can be well approximated by a quadratic function of the state. Then, we proposed to investigate the dynamic behavior of SGD by a stochastic differential equation (SDE) with a quadratic state-dependent diffusion coefficient. As shown in meng22020dynamic ; meng2020dynamic , the new SDE with quadratic diffusion coefficient can better matches the behavior of SGD compared with the SDE with constant diffusion coefficient.

In this paper, we study some mathematical properties of the new SDE with quadratic diffusion coefficient. After briefly introducing its machine learning background and investigating its preliminary properties (Section 2), we show in Section 3 that the stationary distribution of this new SDE is a power-law distribution ( hence we call the corresponding dynamic a power-law dynamic ), and the distribution possesses heavy-tailed property, which means that it cannot have sub-Gaussian tail. Employing coupling method, in Section 4 we prove that the power-law dynamic is ergodic with unique stationary distribution, provided the learning rate is small enough. In the last two sections we analyze the first exit time of the power-law dynamic. We obtain an asymptotic order of the first exit time in Section 5, we then in Section 6 compare the exit times of the (continuous) power-law dynamic and its discretization. The comparison can help guide machine learning algorithm.

2 Background and preliminaries on power-law dynamic

2.1 Background in Machine Learning

Suppose that we have training data Sn={(x1,y1),,(xn,yn)}S_{n}=\{(x_{1},y_{1}),\cdots,(x_{n},y_{n})\} with inputs {xj}j=1nd1×n\{x_{j}\}_{j=1}^{n}\in\mathbb{R}^{d_{1}\times n} and outputs {yj}j=1nd2×n\{y_{j}\}_{j=1}^{n}\in\mathbb{R}^{d_{2}\times n}. For a model fw(x):d1d2f_{w}(x):\mathbb{R}^{d_{1}}\rightarrow\mathbb{R}^{d_{2}} with parameter (vector) wdw\in\mathbb{R}^{d}, its loss over the training instance (xj,yj)(x_{j},y_{j}) is l(fw(xj),yj)l(f_{w}(x_{j}),y_{j}), where (,)\ell(\cdot,\cdot) is the loss function. In machine learning, we are minimizing the empirical loss over the training data, i.e.,

minwL(w):=1nj=1n(fw(xj),yj).\displaystyle\min_{w}L(w):=\frac{1}{n}\sum_{j=1}^{n}\ell(f_{w}(x_{j}),y_{j}). (2.1)

Stochastic gradient descent (SGD) and its variants are the mainstream approaches to minimize L(w)L(w). In SGD, the update rule at the kk-th iteration is

wk+1=wkηg~(wk),\displaystyle w_{k+1}=w_{k}-\eta\cdot\tilde{g}(w_{k}), (2.2)

where η\eta denotes the learning rate,

g~(w):=1bjSbw(fw(xj,yj))\displaystyle\tilde{g}(w):=\frac{1}{b}\sum_{j\in S_{b}}\nabla_{w}\ell(f_{w}(x_{j},y_{j})) (2.3)

is the stochastic gradient, with SbS_{b} being a random sampled subset of SnS_{n} with size b:=|Sb|b:=|S_{b}|. In the literature, SbS_{b} is called mini-batch.

We know that g~(w)\tilde{g}(w) is an unbiased estimator of the full gradient L(w)\nabla L(w). The gap between the full gradient and the stochastic gradient, i.e.,

R(w):=L(w)g~(w),\displaystyle R(w):=\nabla L(w)-\tilde{g}(w), (2.4)

is called the gradient noise in SGD. In the literature, e.g. li2017stochastic ; meng22020dynamic ; xie2020diffusion , the gradient noise R(w)R(w) is assumed to be drawn from Gaussian distribution 111Under mild conditions the assumption is approximately satisfied by the Central Limit Theorem.,that is, R(w)𝒩(0,Σ(R(w)))R(w)\sim\mathcal{N}(0,\Sigma(R(w))), where Σ(R(w))\Sigma(R(w)) is the covariance matrix of R(w).R(w). Denote Σ(R(w))\Sigma(R(w)) by C(w),C(w), the update rule of SGD in Eq.(2.2) is then approximated by:

wk+1=wkηL(wk)+ηξk,ξk𝒩(0,C(wk)).\displaystyle w_{k+1}=w_{k}-\eta\nabla L(w_{k})+\eta\xi_{k},~{}~{}\xi_{k}\sim\mathcal{N}(0,C(w_{k})). (2.5)

Further, for small enough learning rate η\eta, Eq.(2.5) can be viewed as the numerical discretization of the following stochastic differential equation (SDE) he2018differential ; li2017stochastic ; meng22020dynamic ,

dwt=L(wt)dt+ηC(wt)dBt,\displaystyle dw_{t}=-\nabla L(w_{t})dt+\sqrt{\eta C(w_{t})}dB_{t}, (2.6)

where BtB_{t} is the standard Brownian Motion in d\mathbb{R}^{d}. This viewpoint enable the researchers to investigate the dynamic properties of SGD by means of stochastic analysis. In this line, recent work studied the dynamic of SGD with the help of SDE. However, most of the quantitative results in this line of work were obtained for the dynamics with state-independent noise. More precisely, the authors assumed that the covariance C(wt)C(w_{t}) in Eq.(2.6) is a constant matrix independent with the state wtw_{t}. This assumption of constant diffusion coefficient simplifies the calculation and the corresponding analysis. But it is over simplified because the noise covariance in SGD does depend on the model parameters.

In our recent work meng22020dynamic ; meng2020dynamic , we studied the dynamic behavior of SGD with state-dependent noise. The theoritical conduction and empirical observations of our research show that the covariance of the gradient noise of SGD in the local region of local minima can be well approximated by a quadratic function of the state wtw_{t} as briefly reviewed below.

Let ww^{*} be a local minimum of the (empirical/training) loss function defined in (2.1). We assume that the loss function in the local region of ww^{*} can be approximated by the second-order Taylor expansion as

L(w)=L(w)+wL(w)(ww)+12(ww)TH(ww),\displaystyle L(w)=L(w^{*})+\nabla_{w}L(w^{*})(w-w^{*})+\frac{1}{2}(w-w^{*})^{T}H(w-w^{*}), (2.7)

where H is the Hessian matrix of loss at w.w^{*}. Since wL(w)=0\nabla_{w}L(w^{*})=0 at the local minimum w,w^{*}, (2.7) is reduced to

L(w)=L(w)+12(ww)TH(ww),\displaystyle L(w)=L(w^{*})+\frac{1}{2}(w-w^{*})^{T}H(w-w^{*}), (2.8)

Under the above setting, the full gradient of training loss is

L(w)=H(ww),\displaystyle\nabla L(w)=H(w-w^{*}), (2.9)

and the stochastic gradient (2.3) is

g~(w):=g~(w)+H~(ww)\displaystyle\tilde{g}(w):=\tilde{g}(w^{*})+\tilde{H}(w-w^{*}) (2.10)

where g~()\tilde{g}(\cdot) and H~()\tilde{H}(\cdot) are the gradient and Hessian calculated by the minibatch. More explicitly, the ii-th component of g~(w)\tilde{g}(w) is

g~i(w)=g~i(w)+a=1dH~ia(wawa).\displaystyle\tilde{g}_{i}(w)=\tilde{g}_{i}(w^{*})+\sum_{a=1}^{d}\tilde{H}_{ia}(w_{a}-w^{*}_{a}). (2.11)

Assuming that Cov(g~i(w),H~jk)=0Cov(\tilde{g}_{i}(w^{*}),\tilde{H}_{jk})=0 for  i,j,k{1,,d}i,j,k\in\{1,...,d\}, 222This assumption holds for additive noise case and the squared loss wei2020implicit . Specifically, for (w)=(yfw(x))2\ell(w)=(y-f_{w}(x))^{2}, the gradient and Hessian are g(w)=2fw(x)(yfw(x))g(w)=2f^{\prime}_{w}(x)(y-f_{w}(x)) and H(w)=2(fw(x))2+2fw′′(x)(yfw(w))2(fw(x))2H(w)=2(f^{\prime}_{w}(x))^{2}+2f^{\prime\prime}_{w}(x)(y-f_{w}(w))\approx 2(f^{\prime}_{w}(x))^{2} . With additive noise, we have y=fw(x)+ϵy=f_{w^{*}}(x)+\epsilon where ϵ\epsilon is white noise independent with the input xx. Then, g(w)=2fw(x)ϵg(w^{*})=2f^{\prime}_{w}(x)\epsilon, H(w)2(fw(x))2H(w^{*})\approx 2(f^{\prime}_{w^{*}}(x))^{2}, and we have Cov(g~i(w),H~jk)0Cov(\tilde{g}_{i}(w^{*}),\tilde{H}_{jk})\approx 0. we have

C(w)ij=Cov(g~i(w),g~j(w))=Σij+(ww)TAij(ww),\displaystyle C(w)_{ij}=Cov(\tilde{g}_{i}(w),\tilde{g}_{j}(w))=\Sigma_{ij}+(w-w^{*})^{T}A^{ij}(w-w^{*}), (2.12)

where Σij=Cov(g~i(w),g~j(w)),\Sigma_{ij}=Cov(\tilde{g}_{i}(w^{*}),\tilde{g}_{j}(w^{*})),    AijA^{ij} is a d×dd\times d matrix with elements Aabij=Cov(H~ia,H~jb).A^{ij}_{ab}=Cov(\tilde{H}_{ia},\tilde{H}_{jb}).

Thus, we can convert C(w)C(w) into an analytic tractable form as follows.

C(w)=Σg(I+(ww)TΣH(ww))\displaystyle C(w)=\Sigma_{g}(I+(w-w^{*})^{T}\Sigma_{H}(w-w^{*})) (2.13)

where Σg\Sigma_{g} and ΣH\Sigma_{H} are positive definite matrix. The empirical observations in meng22020dynamic ; meng2020dynamic is consistent with the covariance structure (2.13). Thus the SDE (2.6) takes the form

dwt=H(wtw)dt+ηC(wt)dBt,dw_{t}=-H(w_{t}-w^{*})dt+\sqrt{\eta C(w_{t})}dB_{t}, (2.14)

where C(w)C(w) is given by (2.13). We call the dynamic driven by (2.14) a power-law dynamic because its stationary distribution obeys power-law (see Theorem 3.1 below). As shown in meng22020dynamic ; meng2020dynamic , power-law dynamic can better match the behavior of SGD compared to the SDE with constant diffusion coefficient.

2.2 Preliminaries on power-law dynamic

For the power-law dynamic (2.14), the infinitesimal generator exists and has the following form:

𝒜=ijη2(Σg)ij(1+wTΣHw)wiwjiHijwjwi.\mathcal{A}=\sum_{i}\sum_{j}\frac{\eta}{2}(\Sigma_{g})_{ij}(1+w^{T}\Sigma_{H}w)\frac{\partial}{\partial w^{i}}\frac{\partial}{\partial w^{j}}-\sum_{i}H_{ij}w^{j}\frac{\partial}{\partial w^{i}}.

We will use the infinitesimal generator to specify a coupling in the subsequent sections. Write vt=wtwv_{t}=w_{t}-w^{*}, then

dvt=Hvtdt+ηC(vt)dBt,dv_{t}=-Hv_{t}dt+\sqrt{\eta C(v_{t})}dB_{t}, (2.15)

where C(v)=Σg(1+vTΣHv)C(v)=\Sigma_{g}(1+v^{T}\Sigma_{H}v) (Comparing with (2.14), here we slightly abused the notation CC).

In machine learning, we often assume that the dynamic in (2.15) can be decoupled meng2020dynamic ; xie2020diffusion ; zhang2019algorithmic . More explicitly, we assume that Σg,\Sigma_{g}, ΣH\Sigma_{H} and HH are codiagonalizable by an orthogonal matrix QQ, then under the affine transformation vt=Q(wtwt)v_{t}=Q(w_{t}-w^{*}_{t}), (2.15) is decoupled and can be written as

dvt=hivtidt+ησi+ηρi(vti)2dBti,i{1,,d},dv_{t}=-h_{i}v_{t}^{i}dt+\sqrt{\eta\sigma_{i}+\eta\rho_{i}(v_{t}^{i})^{2}}dB^{i}_{t},\ \ i\in\{1,\dots,d\}, (2.16)

where σi\sigma_{i}, ρi\rho_{i} are positive constants.333This decoupling property is empirically observed in machine learning, i.e., the directions of eigenvectors of the hessian matrix and the gradient covariance matrix are often nearly coincide at the minimum xie2020diffusion . An explanation of this phenomenon is that under expectations the Hessian equals to Fisher information jastrzkebski2017three ; xie2020diffusion ; zhu2019anisotropic .

Following the convention of probabilistic literature, in what follows we shall write

μ(vt)=Hvt\mu(v_{t})=-Hv_{t} (2.17)
σ2(vt)=ηC(vt).\sigma^{2}(v_{t})=\eta C(v_{t}). (2.18)

Suppose x,ydx,y\in\mathbb{R}^{d} , by the mean value theorem, we have the following inequality,

|a+bx2a+by2|b|xy|.|\sqrt{a+bx^{2}}-\sqrt{a+by^{2}}|\leq\sqrt{b}|x-y|. (2.19)

Then, it is easy to check that that both μ()\mu(\cdot) and σ()\sigma(\cdot) are local Lipschitz and have linear growth. Therefore, by standard theory of stochastic differential equations, the SDE (2.15) has a unique strong solution v(t),v(t), which has continuous paths and possesses strong Markov property.

Consider the decoupled dynamic in (2.16), we use the fact that as |xi||x_{i}|\rightarrow\infty,

|μi(xi)σi2(xi)|=|hixiησi+ηρixi2|O(1|xi|).|\frac{\mu_{i}(x_{i})}{\sigma_{i}^{2}(x_{i})}|=|\frac{h_{i}x_{i}}{\eta\sigma_{i}+\eta\rho_{i}x_{i}^{2}}|\sim O(\frac{1}{|x_{i}|}).

Then, for any fixed x0,x_{0}, we have

x0exp(x0x2μi(s)σi2(s)𝑑s)(xx0exp(x0y2μi(s)σi2(s)𝑑s)σi2(y)𝑑y)𝑑x=,\int_{-\infty}^{x_{0}}\exp(-\int_{x_{0}}^{x}\frac{2\mu_{i}(s)}{\sigma_{i}^{2}(s)}ds)(\int_{x}^{x_{0}}\frac{\exp(\int_{x_{0}}^{y}\frac{2\mu_{i}(s)}{\sigma_{i}^{2}(s)}ds)}{\sigma_{i}^{2}(y)}dy)dx=\infty,

and

x0exp(x0x2μ(s)σ2(s)𝑑s)(x0xexp(x0y2μ(s)σ2(s)𝑑s)σ2(y)𝑑y)𝑑x=,\int_{x_{0}}^{\infty}\exp(-\int_{x_{0}}^{x}\frac{2\mu(s)}{\sigma^{2}(s)}ds)(\int_{x_{0}}^{x}\frac{\exp(\int_{x_{0}}^{y}\frac{2\mu(s)}{\sigma^{2}(s)}ds)}{\sigma^{2}(y)}dy)dx=\infty,

which implies that each component of vtv_{t} will not blow up in finite time.

To conclude, the stochastic differential equation (2.16) admits a unique strong solution v(t),v(t), which has continuous paths and will not blow up in finite time. In subsequent sections we shall study more properties of the dynamic v(t).v(t).

3 Property of the stationary distribution

In this section, we show that the stationary distribution of the SDE (2.15) possesses heavy-tailed property, and its decoupled form is a product of power-law distributions. The existence and uniqueness of the stationary distribution will be given in the next section.

Let QQ be an orthogonal matrix such that H=QHQTH^{\prime}=QHQ^{T} is a diagonal matrix. Then

d(Qvt)=HQvtdt+1+(Qvt)TΣ~HQvtηΣ~gdB~t,\displaystyle d(Qv_{t})=-H^{\prime}Qv_{t}dt+\sqrt{1+(Qv_{t})^{T}\tilde{\Sigma}_{H}Qv_{t}}\cdot\sqrt{\eta\tilde{\Sigma}_{g}}d\tilde{B}_{t}, (3.1)

where Σ~H=QΣHQT\tilde{\Sigma}_{H}=Q\Sigma_{H}Q^{T}, Σ~g=QΣgQT.\tilde{\Sigma}_{g}=Q\Sigma_{g}Q^{T}. Note that B~t=QBt\tilde{B}_{t}=QB_{t} is still a Brownian motion. (3.1) is just the power-law dynamic (2.15) under a new orthogonal coordinate system, so we will abuse the notation and denote the transformed dynamic by vtv_{t} as well. Since we care about the tail behavior of the power-law dynamic, we show first that vtv_{t} does not have finite higher moments as tt\rightarrow\infty. This implies that vtv_{t} cannot have exponential decay on the tail.

Theorem 3.1

(i) We can find m2m\geq 2 such that the moments of the power-law dynamic (3.1) of order greater than m will explode as the time tt\rightarrow\infty.
(ii) For the decoupled case in (2.16), the probabilistic density of the stationary distribution is a product of power-law distributions (the terminology follows from zhou ) as below:

p(x)=1Zi=1d(1+ρiσix2)κi,\displaystyle p(x)=\frac{1}{Z}\prod_{i=1}^{d}(1+\frac{\rho_{i}}{\sigma_{i}}x^{2})^{\kappa_{i}}, (3.2)

where κi=ηρi+hiηρi\kappa_{i}=-\frac{\eta\rho_{i}+h_{i}}{\eta\rho_{i}} and ZZ is the normalization constant.

Proof

(i)  Denote the 2k-th moment of vtv_{t} as m2k(t):=i𝔼(vti)2km_{2k}(t):=\sum_{i}\mathbb{E}(v_{t}^{i})^{2k}. Then m0(t)=𝔼[vt0]=1m_{0}(t)=\mathbb{E}[v_{t}^{0}]=1. By Ito’s formula, we have

di𝔼(vti)2k=i{Hii(vti)2k+k(2k1)(1+vTΣH~v)Σg~ii𝔼(vti)2k2}.d\sum_{i}\mathbb{E}(v_{t}^{i})^{2k}=\sum_{i}\{-H^{\prime}_{ii}(v_{t}^{i})^{2k}+k(2k-1)(1+v^{T}\tilde{\Sigma_{H}}v)\tilde{\Sigma_{g}}_{ii}\mathbb{E}(v_{t}^{i})^{2k-2}\}.

Let hmaxh_{max} be the maximal diagonal element of HH^{\prime} and gming_{min} be the minimal diagonal element of Σ~g\tilde{\Sigma}_{g}, then we get the recursion inequality (note that it may not hold for the odd degree moments):

dm2k(t)(khmax+k(2k1)gminHmin)m2k(t)+k(2k1)gminm2k2(t),dm_{2k}(t)\geq(-kh_{max}+k(2k-1)g_{min}H_{min})m_{2k}(t)+k(2k-1)g_{min}m_{2k-2}(t), (3.3)

where HminH_{min} is the minimal eigenvalue of the positive definite matrix Σ~H\tilde{\Sigma}_{H}. Let ak=khmax+k(2k1)gminHmina_{k}=-kh_{max}+k(2k-1)g_{min}H_{min} and bk=k(2k1)gminb_{k}=k(2k-1)g_{min}, then

m2k(t)eakt(x02k+0tRk(s)exp(aks)𝑑s),m_{2k}(t)\geq e^{a_{k}t}(x_{0}^{2k}+\int_{0}^{t}R_{k}(s)\exp(-a_{k}s)ds),

the reminder term is defined by Rk(s):=k(2k1)gminm2k2(s)R_{k}(s):=k(2k-1)g_{min}m_{2k-2}(s). From the above relation, we can prove the following inequality by induction:

dm2k(t)i=0kckiexp(akit).dm_{2k}(t)\geq\sum_{i=0}^{k}c_{k}^{i}\exp(a_{k}^{i}t). (3.4)

By tracking the related coefficients carefully, it is not difficult to find the recurrence relations for ckic_{k}^{i}. For example,

ckk=m2k(0)i=0k1cki.c_{k}^{k}=m_{2k}(0)-\sum_{i=0}^{k-1}c_{k}^{i}.

Since gminHming_{min}H_{min} is positive, aia_{i} becomes positive when i is large. From this fact, we can always find a kk such that

limtm2k(t)=,\lim_{t\rightarrow\infty}m_{2k}(t)=\infty,

which means the moment generating function of the stationary distribution blows up. Therefore, the stationary distribution of vtv_{t} cannot have sub-Gaussian tail.

(ii) Now we turn to the decoupled case of equation (2.16), since each coordinate is self-dependent, we know that the probabilistic density is of the product form. To investigate the probabilistic density p(t,x)p(t,x) of one fixed coordinate vtiv_{t}^{i}, we need to study the backward Kolmogolov equation satisfied by p(t,x)p(t,x). Since μ(x)\mu(x) and σ(x)\sigma(x) have linear growth, we have

pt=12Δ(σ2p)(μp)=12Δ[(ησi+ηρix2)p]+[Hxp].\frac{\partial p}{\partial t}=\frac{1}{2}\Delta(\sigma^{2}p)-\nabla(\mu p)=\frac{1}{2}\Delta[(\eta\sigma_{i}+\eta\rho_{i}x^{2})p]+\nabla[Hxp].

Here we adopt an idea from the statistical physics literature zhou as in the machine learning area (see meng22020dynamic ), we first transform the Kolmogolov equation into the Smoluchowski form:

12Δ[(ησi+\displaystyle\frac{1}{2}\Delta[(\eta\sigma_{i}+ ηρix2)p]+[hixp]=[ηρixp+σ22p+hixp]\displaystyle\eta\rho_{i}x^{2})p]+\nabla[h_{i}xp]=\nabla\cdot[\eta\rho_{i}xp+\frac{\sigma^{2}}{2}\nabla p+h_{i}xp] (3.5)
=[(hix+ηρix)p]+[σ22p].\displaystyle=\nabla[(h_{i}x+\eta\rho_{i}x)p]+\nabla[\frac{\sigma^{2}}{2}\nabla p].

Let Ux:=Hx+ηρix\frac{\partial U}{\partial x}:=Hx+\eta\rho_{i}x, then the fluctuation-dissipation relation of σ2\sigma^{2} and UU in zhou is satisfied with

κi=ηρi+hiηρi.\kappa_{i}=-\frac{\eta\rho_{i}+h_{i}}{\eta\rho_{i}}.

Let ps:=limtp(t,x)p_{s}:=\lim_{t\rightarrow\infty}p(t,x), then the stationary distribution satisfies the power law:

ps(x)=1Z(1+ρiσix2)κi.p_{s}(x)=\frac{1}{Z}(1+\frac{\rho_{i}}{\sigma_{i}}x^{2})^{\kappa_{i}}. (3.6)

The proof is completed.

Remark 1

(i) In zhou , the tail index κi\kappa_{i} ( depending on the hyper-parameter η\eta ) plays an important role in locating the large learning rate region. Observe that when κi12\kappa_{i}\geq-\frac{1}{2}, the variance of psp_{s} is infinite.

(ii) Another way to view the power-law dynamic is to apply the results in the groundbreaking article ma . Roughly speaking, the authors of ma gave a complete classification in the Fourier space with a determined stationary distribution. Following the notation in ma , suppose we write the SDE in the following form:

dz=f(z)dt+2D(z)dB(t),dz=f(z)dt+\sqrt{2D(z)}dB(t),

where D(z) is a positive semi-definite diffusion matrix (a Riemannian metric). Suppose the stationary distribution ps(z)exp(H(z))p_{s}(z)\propto\exp(-H(z)), then the drift term f(z)f(z) must satisfy:

f(z)=[D(z)+Q(z)]H(z)+Γ(z),f(z)=-[D(z)+Q(z)]\nabla H(z)+\Gamma(z),

where Q(z)Q(z) is an arbitrary skew-symmetric matrix (a symplectic form) and Γ(z)\Gamma(z) is defined by

Γi(z)=j=1dzj(Dij(z)+Qij(z)).\Gamma_{i}(z)=\sum_{j=1}^{d}\frac{\partial}{\partial z_{j}}(D_{ij}(z)+Q_{ij}(z)).

When d=1d=1, due to the skew-symmetry, Q(z)0Q(z)\equiv 0. If the stationary distribution is given by (3.6), H(z)=κln(1+ρσz2)H(z)=\kappa\ln(1+\frac{\rho}{\sigma}z^{2}). Thus,

H(z)=κρzσ+ρz2.\nabla H(z)=\frac{-\kappa\rho z}{\sigma+\rho z^{2}}.

We get that

f(z)=2(1+κ)ηρz.f(z)=2(1+\kappa)\eta\rho z.

In this way, we automatically obtain the fluctuation-dissipation relation.

4 Existence and uniqueness of the stationary distribution

In this section, we shall prove that the power-law dynamic is ergodic with unique stationary distribution, provided the learning rate η\eta is small enough (see theorem 4.1 (ii) below). Note that unlike Langevin dynamics, we have a state-dependent diffusion term in the power-law dynamic and its stationary distribution does not have a sub-Gaussian tail, which makes the diffusion process break the log-sobolev inequality condition. Instead of treating vtv_{t} as a gradient flow, we shall use coupling method to bound the convergence of vtv_{t} to its stationary distribution.

Let the drift vector μ(x)=Hx\mu(x)=-Hx and the diffusion matrix σ2(x)=ηC(x)\sigma^{2}(x)=\eta C(x), where xd,x\in\mathbb{R}^{d}, be defined as in (2.17) and (2.18) respectively. We set

θ:=infx,y{<μ(x)μ(y),xy>/xy2},\displaystyle\theta:=\inf_{x,y}\{-<\mu(x)-\mu(y),x-y>/\left\lVert x-y\right\rVert^{2}\}, (4.1)
λ:=supx,y{maxi1jd(σij(x)σij(y))2/xy2}.\displaystyle\lambda:=\sup_{x,y}\{\max_{i}\sum_{1\leq j\leq d}(\sigma_{ij}(x)-\sigma_{ij}(y))^{2}/\left\lVert x-y\right\rVert^{2}\}. (4.2)
Theorem 4.1

(i) Let p(t,x,)p(t,x,\cdot) be the transition probability of the power-law dynamic driven by (2.15), we have

𝕎2(p(t,x,),p(t,y,))xye(dλθ)t,\mathbb{W}_{2}(p(t,x,\cdot),p(t,y,\cdot))\leq\parallel x-y\parallel e^{(d\cdot\lambda-\theta)t}, (4.3)

where 𝕎2(,)\mathbb{W}_{2}(\cdot,\cdot) is the Wasserstein distance between two probability distributions.

(ii) Employing the notations used in the previous section, we write hminh_{min} for the minimal diagonal element of the matrix HH^{\prime}, gmaxg_{max} for the the maximal element of Σg~\sqrt{\tilde{\Sigma_{g}}}, and HsumH_{sum}  for the sum of the eigenvalues of  ΣH~\tilde{\Sigma_{H}}. Suppose that

η<hmind2gmax2Hsum,\eta<\frac{h_{min}}{d^{2}\cdot g_{max}^{2}H_{sum}}, (4.4)

then the power-law dynamic in (2.15) is ergodic and its stationary distribution is unique.

Proof

(i) We shall employ the coupling method of Markov processes in this proof and in the rest of this paper. The reader may refer to Chapter 2 of chen2006eigenvalues , especially page 24 and Example 2.16, for the relevant contents. Recall that every infinitesimal generator of an d\mathbb{R}^{d}-valued diffusion process has the form 𝒜s=xα(x)2x2+xβ(x)x\mathcal{A}_{s}=\sum_{x}\alpha(x)\frac{\partial^{2}}{\partial x^{2}}+\sum_{x}\beta(x)\frac{\partial}{\partial x}. To specify a coupling between two power-law dynamics starting from different points of d,\mathbb{R}^{d}, we define a coupling infinitesimal generator 𝒜s(x,y)\mathcal{A}_{s}(x,y), (x,y)2d,(x,y)\in\mathbb{R}^{2d}, as follows:

αs(x,y)=(σ(x)σ(x)T,σ(x)σ(y)Tσ(y)σ(x)T,σ(y)σ(y)T),βs(x,y)=(μ(x)μ(y)),\alpha_{s}(x,y)=\begin{pmatrix}\sigma(x)\sigma(x)^{T},&\sigma(x)\sigma(y)^{T}\\ \sigma(y)\sigma(x)^{T},&\sigma(y)\sigma(y)^{T}\end{pmatrix},\ \ \beta_{s}(x,y)=\begin{pmatrix}\mu(x)\\ \mu(y)\end{pmatrix},

where σ()\sigma(\cdot) and μ()\mu(\cdot) are specified by (2.17) and (2.18) respectively, αs(x,y)\alpha_{s}(x,y) corresponds to the second order differentiation and βs(x,y)\beta_{s}(x,y) corresponds to the first order differentiation.

Let r(x,y)=xy2r(x,y)=\left\lVert x-y\right\rVert^{2} and let 𝒜s\mathcal{A}_{s} act on r(x,y)r(x,y), we get

𝒜sr(x,y)\displaystyle\mathcal{A}_{s}r(x,y) =2<μ(x)μ(y),xy>+ij(σij(x)σij(y))2\displaystyle=2<\mu(x)-\mu(y),x-y>+\sum_{i}\sum_{j}(\sigma_{ij}(x)-\sigma_{ij}(y))^{2}
2θxy2+2dλxy2\displaystyle\leq-2\theta\left\lVert x-y\right\rVert^{2}+2d\lambda\left\lVert x-y\right\rVert^{2}
cr(x,y),\displaystyle\leq cr(x,y),

where c:=2dλ2θc:=2d\lambda-2\theta. Denote by XtX_{t} the dynamic starting at xx and YtY_{t} the dynamic starting at yy, by Ito’s formula, we have

d𝔼r(Xt,Yt)dtc𝔼r(Xt,Yt).\frac{d\mathbb{E}r(X_{t},Y_{t})}{dt}\leq c\mathbb{E}r(X_{t},Y_{t}).

Applying Gronwall¡¯s inequality, we get

𝔼r(Xt,Yt)r(x,y)ect,\mathbb{E}r(X_{t},Y_{t})\leq r(x,y)e^{ct},

which implies that

𝕎2(p(t,x,),p(t,y,))r(x,y)ect/2=xye(dλθ)t,\mathbb{W}_{2}(p(t,x,\cdot),p(t,y,\cdot))\leq\sqrt{r(x,y)}e^{ct/2}=\left\lVert x-y\right\rVert e^{(d\cdot\lambda-\theta)t},

verifying (4.3).

(ii) In view of (4.3), we need only to check that if (4.4) holds, then
(dλθ)<0.(d\cdot\lambda-\theta)<0. We have

<μ(x)μ(y),xy>=iHi(xiyi)2hminxy2,<\mu(x)-\mu(y),x-y>=-\sum_{i}H^{\prime}_{i}(x_{i}-y_{i})^{2}\leq-h_{min}\left\lVert x-y\right\rVert^{2},

therefore

θhmin.\theta\geq h_{min}. (4.5)

On the other hand, let gmaxg_{max} be the maximal element of Σg~\sqrt{\tilde{\Sigma_{g}}}, then for all i,i,

1jd(σij(x)σij(y))2\displaystyle\sum_{1\leq j\leq d}(\sigma_{ij}(x)-\sigma_{ij}(y))^{2} ηgmax2(1+xTΣH~x1+yTΣH~y)2.\displaystyle\leq\eta\cdot g_{max}^{2}(\sqrt{1+x^{T}\tilde{\Sigma_{H}}x}-\sqrt{1+y^{T}\tilde{\Sigma_{H}}y})^{2}.

Since xy\left\lVert x-y\right\rVert is preserved under orthogonal transformation, then by the mean value theorem and Cauchy inequality, we can find (θ1,,θd)(\theta_{1},\dots,\theta_{d}), such that

(σij(x)σij(y))2\displaystyle(\sigma_{ij}(x)-\sigma_{ij}(y))^{2} ηgmax2|(h1θ11+θTΣH~θ,,hdθd1+θTΣH~θ)\displaystyle\leq\eta\cdot g_{max}^{2}|(\frac{h_{1}\theta_{1}}{\sqrt{1+\theta^{T}\tilde{\Sigma_{H}}\theta}},\dots,\frac{h_{d}\theta_{d}}{\sqrt{1+\theta^{T}\tilde{\Sigma_{H}}\theta}})
(x1y1,,xdyd)|2ηgmax2Hsumxy2,\displaystyle\cdot(x_{1}-y_{1},\dots,x_{d}-y_{d})|^{2}\leq\eta\cdot g_{max}^{2}H_{sum}\left\lVert x-y\right\rVert^{2},

where {hi}\{h_{i}\} denote the eigenvalues of ΣH~,\tilde{\Sigma_{H}}, and HsumH_{sum} denotes the sum of the eigenvalues. Thus,

maxi1jd(σij(x)σij(y))2dηgmax2Hsumxy2.\max_{i}\sum_{1\leq j\leq d}(\sigma_{ij}(x)-\sigma_{ij}(y))^{2}\leq d\cdot\eta\cdot g_{max}^{2}H_{sum}\left\lVert x-y\right\rVert^{2}.

Consequently,

λdηgmax2Hsum.\lambda\leq d\cdot\eta\cdot g_{max}^{2}H_{sum}. (4.6)

Combining (4.5) and (4.6), we see that (4.4) implies (dλθ)<0.(d\cdot\lambda-\theta)<0. Therefore Assertion (ii) holds by the virtue of (4.3). The proof is completed.

Remark 2

If we restrict ourselves in the decoupled case (2.16), we can get the exponential convergence to stationary distribution under much weaker condition of η\eta. Notice that now σ(x)\sigma(x) is a diagonal matrix. Using short hand writing (σii(x)σii(y))2(\sigma_{ii}(x)-\sigma_{ii}(y))^{2} for 1jd(σij(x)σij(y))2,\sum_{1\leq j\leq d}(\sigma_{ij}(x)-\sigma_{ij}(y))^{2}, we have

(σii(x)σii(y))2\displaystyle(\sigma_{ii}(x)-\sigma_{ii}(y))^{2} (η(σi+ρi(xi)2)η(σi+ρi(yi)2))2\displaystyle\leq(\sqrt{\eta(\sigma_{i}+\rho_{i}(x^{i})^{2})}-\sqrt{\eta(\sigma_{i}+\rho_{i}(y^{i})^{2})})^{2}
ηρi(xiyi)2.\displaystyle\leq\eta\rho_{i}(x_{i}-y_{i})^{2}.

Then, we have

Lsr(x,y)\displaystyle L_{s}r(x,y) =i(σii(x)σii(y))2η(σ1+ρi(yi)2))2hi(xiyi)2]\displaystyle=\sum_{i}(\sigma_{ii}(x)-\sigma_{ii}(y))^{2}-\sqrt{\eta(\sigma_{1}+\rho_{i}(y^{i})^{2})})^{2}-h_{i}(x^{i}-y^{i})^{2}]
i(ηρihi)(xiyi)2\displaystyle\leq\sum_{i}(\eta\rho_{i}-h_{i})(x^{i}-y^{i})^{2}
csr(x,y),\displaystyle\leq c_{s}r(x,y),

where cs:=maxi[ηρihi]c_{s}:=\max_{i}[\eta\rho_{i}-h_{i}], which does not involve the dimension dd.

5 First exit time: asymptotic order

From now on, we investigate the first exit time from a ball of the power-law dynamic, which is an important issue in machine learning. By leveraging the transition rate results from the large deviation theory (see e.g. kraaij2019classical ), in this section we obtain an asymptotic order of the first exit time for the decoupled power-law dynamic.

Theorem 5.1

Suppose 0 is the only local minimum of the loss function inside B(0,r)B(0,r). Let τrx(η)\tau_{r}^{x}(\eta) be the first exit time from B(0,r)B(0,r) of the decoupled power-law dynamic in (2.16), with learning rate η,\eta, starting at xB(0,r)x\in B(0,r), then

limη0ηlog𝔼τrx(η)=Cinfζ=(ζ1,,ζd)B(0,r)ihiρilog[σi+ρi(ζi)2],\lim_{\eta\rightarrow 0}\eta\log\mathbb{E}\tau^{x}_{r}(\eta)=C\cdot\inf_{\zeta=(\zeta^{1},\dots,\zeta^{d})\in\partial B(0,r)}\sum_{i}-\frac{h_{i}}{\rho_{i}}\log[\sigma_{i}+\rho_{i}(\zeta^{i})^{2}], (5.1)

where CC is a prefactor to be determined.
When d=1d=1, we have an explicit expression of the first exit time from an interval [a,b][a,b] starting at x(a,b)x\in(a,b):

𝔼τx=g(x):=2xbeϕ(y)σ2(y)𝑑yayeϕ(z)𝑑z,\mathbb{E}\tau_{x}=g(x):=2\int_{x}^{b}\frac{e^{\phi(y)}}{\sigma^{2}(y)}dy\int_{a}^{y}e^{-\phi(z)}dz, (5.2)

where ϕ=2κln(1+ρσx2)\phi=2-\kappa\ln(1+\frac{\rho}{\sigma}x^{2}) and κ=ηρ+hηρ\kappa=-\frac{\eta\rho+h}{\eta\rho}.

Proof

Let τ\tau be a stopping time with finite expectation and let 𝒜\mathcal{A} be the infinitesimal generator of vtv_{t}, then recall that Dynkin’s formula tells us:

𝔼[f(vτ)]=f(x)+𝔼[0τ𝒜f(vs)𝑑s],f𝒞02(d),\mathbb{E}[f(v_{\tau})]=f(x)+\mathbb{E}[\int_{0}^{\tau}\mathcal{A}f(v_{s})ds],\ \ \ f\in\mathcal{C}_{0}^{2}(\mathbb{R}^{d}),

where v0=xv_{0}=x. Suppose ff solves the following boundary problem:

{𝒜f(x)=1,xB(0,r),f(x)=0,xB(0,r),\left\{\begin{array}[]{l}\mathcal{A}f(x)=-1,\ \ x\in B(0,r),\\ f(x)=0,\ \ x\in\partial B(0,r),\end{array}\right. (5.3)

then 𝔼τrx=f(x)\mathbb{E}\tau^{x}_{r}=f(x), where τrx\tau^{x}_{r} denote the first exit time of vtv_{t} starting at xx from the ball B(0,r)B(0,r).

We consider first the situation of d=1d=1, let τ(a,b)x(η)\tau^{x}_{(a,b)}(\eta) be the first exit time of v(t)v(t) from an interval [a,b][a,b] starting at x(a,b)x\in(a,b). Note that the diffusion coefficient function σ(x)=ησ+ηρx2>0\sigma(x)=\sqrt{\eta\sigma+\eta\rho x^{2}}>0, then by Dynkin’s formula, 𝔼τ(a,b)x(η)=g(x)\mathbb{E}\tau^{x}_{(a,b)}(\eta)=g(x), where g(x)g(x) solves the following second order ODE:

{𝒜1g(x)=1,x(a,b),g(x)=0,x{a,b},\left\{\begin{array}[]{l}\mathcal{A}_{1}g(x)=-1,\ \ x\in(a,b),\\ g(x)=0,\ \ x\in\{a,b\},\end{array}\right. (5.4)

where 𝒜1=hxx+(ησ+ηρx2)2x2\mathcal{A}_{1}=-hx\frac{\partial}{\partial x}+(\eta\sigma+\eta\rho x^{2})\frac{\partial^{2}}{\partial x^{2}} is the infinitesimal generator of the one dimensional power-law diffusion. Now we introduce the integration factor ϕ(x):=κln(1+ρσx2)\phi(x):=-\kappa\ln(1+\frac{\rho}{\sigma}x^{2}), following (3.5),

[(hx+ηρx)f(x)]+[σ22f(x)]=12[σ2eϕ(eϕf(x))]=0.\nabla[(hx+\eta\rho x)f(x)]+\nabla[\frac{\sigma^{2}}{2}\nabla f(x)]=\frac{1}{2}\nabla[\sigma^{2}e^{-\phi}\nabla(e^{\phi}f(x))]=0. (5.5)

Then,

𝒜1f(x)\displaystyle\mathcal{A}_{1}f(x) =12[σ2eϕ(eϕf(x))][ηρxf(x)]+hf(x)ηρxf(x)2[hxf(x)]\displaystyle=\frac{1}{2}\nabla[\sigma^{2}e^{-\phi}\nabla(e^{\phi}f(x))]-\nabla[\eta\rho xf(x)]+hf(x)-\eta\rho x\nabla f(x)-2\nabla[hxf(x)]
=12[σ2eϕ(eϕf(x))]12[ϕ˙σ2(x)f(x)]12ϕ˙σ2f(x)\displaystyle=\frac{1}{2}\nabla[\sigma^{2}e^{-\phi}\nabla(e^{\phi}f(x))]-\frac{1}{2}\nabla[\dot{\phi}\sigma^{2}(x)f(x)]-\frac{1}{2}\dot{\phi}\sigma^{2}\nabla f(x)
=12eϕ[eϕσ2f(x)],\displaystyle=\frac{1}{2}e^{\phi}\nabla[e^{-\phi}\sigma^{2}\nabla f(x)],

where we denote ϕ˙=2(ηρ+h)x/σ2\dot{\phi}=2(\eta\rho+h)x/\sigma^{2} to get the second line. Therefore (5.4) is equivalent to

𝒜1g(x)=12eϕ[eϕσ2g(x)]=1.\mathcal{A}_{1}g(x)=\frac{1}{2}e^{\phi}\nabla[e^{-\phi}\sigma^{2}\nabla g(x)]=-1.

Integrating the above equation, we recover (13) of du2012power :

𝔼τ(a,b)x(η)=g(x)=2xb𝑑yeϕ(y)σ2(y)ayeϕ(z)𝑑z.\mathbb{E}\tau^{x}_{(a,b)}(\eta)=g(x)=2\int_{x}^{b}dy\frac{e^{\phi(y)}}{\sigma^{2}(y)}\int_{a}^{y}e^{-\phi(z)}dz. (5.6)

The reader can check section 12.3 of weinan2019applied for the asymptotic analysis of (5.6) as η0\eta\rightarrow 0.
We now investigate the general situation of d>1d>1. We have only asymptotic estimates on the exit time as the learning rate η0\eta\rightarrow 0. For this purpose, it is convenient to introduce the geometric reformulation of (2.16). Suppose BtB_{t} is the standard brownian motion, recall that in local coordinates , the Riemannian Brownian motion WtW_{t} with metric {gij}\{g_{ij}\} has the following form (cf. e.g. hsu2002stochastic ):

dWti=σji(Wt)dBtj+12bi(Wt)dt,dW_{t}^{i}=\sigma_{j}^{i}(W_{t})dB^{j}_{t}+\frac{1}{2}b^{i}(W_{t})dt, (5.7)

where bi(x)=gjk(x)Γjki(x)b^{i}(x)=g^{jk}(x)\Gamma_{jk}^{i}(x) and σij(x)=gij(x)\sigma_{ij}(x)=\sqrt{g^{ij}(x)}. Comparing (5.7) with the martingale part of (2.16), we define the inverse metric as gij=(σi+ρi(xi)2)δijg^{ij}=(\sigma_{i}+\rho_{i}(x^{i})^{2})\delta_{ij}. Then the metric gijg_{ij} is also a diagonal matrix. The christoffel symbols can be calculated under the new metric:

Γjki(x)=ρixi(ρi+ρi(xi)2), when i=j=k,\Gamma_{jk}^{i}(x)=\rho_{i}x^{i}\cdot(\rho_{i}+\rho_{i}(x^{i})^{2}),\text{\ when\ }i=j=k,

and Γjki(x)=0\Gamma_{jk}^{i}(x)=0, otherwise. Denote the gradient vector filed of a smooth function f(x)f(x) by f(x)\nabla f(x), then

f(x)=(if)gijj(x),\nabla f(x)=(\partial_{i}f)g^{ij}\partial_{j}(x),

where i(x):=xi(x), 1id,\partial_{i}(x):=\frac{\partial}{\partial x_{i}}(x),\ 1\leq i\leq d, denotes the i-th coordinate tangent vector at xdx\in\mathbb{R}^{d}. To emphasis the parameter η\eta that appears in the diffusion term in the power-law dynamic, we denote the dynamic and its corresponding exit time by vt(η)v_{t}(\eta) and τrx(η)\tau^{x}_{r}(\eta). Let fη(x):=ihi2ρilog(σi+ρi(xi)2)η2ρi(σi2(xi)2+ρi3(xi)3)f_{\eta}(x):=-\sum_{i}\frac{h_{i}}{2\rho_{i}}\log(\sigma_{i}+\rho_{i}(x^{i})^{2})-\frac{\sqrt{\eta}}{2}\rho_{i}(\frac{\sigma_{i}}{2}(x^{i})^{2}+\frac{\rho_{i}}{3}(x^{i})^{3}), then vt(η)v_{t}(\eta) in (2.16) can be seen as a diffusion process under the new metric:

dvt(η)=fη(vt)dt+ηdWt,dv_{t}(\eta)=\nabla f_{\eta}(v_{t})dt+\sqrt{\eta}dW_{t}, (5.8)

where WtW_{t} is the Riemannian Brownian motion by (5.7) and 0 is a local minima of the limit function: flim(x):=limη0fη(x)=ihi2ρilog(σi+ρi(xi)2)f_{lim}(x):=\lim_{\eta\rightarrow 0}f_{\eta}(x)=-\sum_{i}\frac{h_{i}}{2\rho_{i}}\log(\sigma_{i}+\rho_{i}(x^{i})^{2}). Note that both the drift term and the diffusion term are intrinsically defined with respect to the metric {gij}1i,jd\{g_{ij}\}_{1\leq i,j\leq d}. By large deviation theory, the rate function Iη(ϕ)I_{\eta}(\phi) of a path ϕ:[0,T]d\phi:[0,T]\rightarrow\mathbb{R}^{d} is:

Ilim(ϕ)=120Tϕ˙(t)flim(ϕ(t))2𝑑t+2[flim(ϕ(T))flim(ϕ(0))],I_{lim}(\phi)=\frac{1}{2}\int_{0}^{T}\left\lVert\dot{\phi}(t)-\nabla f_{lim}(\phi(t))\right\rVert^{2}dt+2[f_{lim}(\phi(T))-f_{lim}(\phi(0))],

where the norm is with respect to the Riemannian metric gijg_{ij}. It follows that the quasi-potential of the ball B(0,r)B(0,r) is given by

f¯lim=2[infζ=(ζ1,,ζd)B(0,r)flim(ζ)flim(0)].\bar{f}_{lim}=2[\inf_{\zeta=(\zeta^{1},\dots,\zeta^{d})\in\partial B(0,r)}f_{lim}(\zeta)-f_{lim}(0)].

By theorem 2.2 and corollary 2.4 of Nils , if 0 is the only local minima of flimf_{lim} in B(0,r)B(0,r), then there exists a constant C>0C>0 such that

limη0ηlog𝔼τrx(η)\displaystyle\lim_{\eta\rightarrow 0}\eta\log\mathbb{E}\tau^{x}_{r}(\eta) =C2[infζ=(ζ1,,ζd)B(0,r)flim(ζ)flim(0)]\displaystyle=C\cdot 2[\inf_{\zeta=(\zeta^{1},\dots,\zeta^{d})\in\partial B(0,r)}f_{lim}(\zeta)-f_{lim}(0)]
=Cinfζ=(ζ1,,ζd)B(0,r)ihiρilog[1+ρiσi(ζi)2].\displaystyle=C\cdot\inf_{\zeta=(\zeta^{1},\dots,\zeta^{d})\in\partial B(0,r)}\sum_{i}-\frac{h_{i}}{\rho_{i}}\log[1+\frac{\rho_{i}}{\sigma_{i}}(\zeta^{i})^{2}].

The proof is completed.

Remark 3

When the dimension d=1d=1, we can get similar results with a precise prefactor by applying semiclassical approximation to the integral (5.6), see kolokoltsov2007semiclassical . Taking exponential of (5.1), it’s obvious that the leading order of the average exit time is of the power law form with respect to the radius rr.

6 First exit time: from continuous to discrete

In this section, we compare the exit times of the continuous power-law dynamic (2.15) and its discretization:

zk+1=zk+ϵμ(zk)+ϵϵk,ϵk(0,σ2(zk)).z_{k+1}=z_{k}+\epsilon\mu(z_{k})+\epsilon\epsilon_{k},\ \ \epsilon_{k}\sim\mathbb{N}(0,\sigma^{2}(z_{k})). (6.1)

Note that the first exit time of the discretized dynamic (6.1) is an integer that measures how many steps it takes to escape from the ball, thus the KK time steps correspond to KϵK\epsilon amount of time. In this view point, the comparison can help guide machine learning algorithm provided the time interval ϵ\epsilon coincides with the learning rate η\eta. However, since in power law dynamic the covariance matrix σ2(wk)\sigma^{2}(w_{k}) contains η\eta, for the convenience of theoretic discussion, we should temporarily distinguish ϵ\epsilon with η\eta before we arriving at the conclusion.

To shorten the length of the article, we shall confine ourselves in the situation of d=1.d=1. Assume the local minima is 0. Let τr0\tau_{r}^{0} be the first exit time from the ball B(0,r)B(0,r) of the one dimensional continuous power-law dynamic in (2.15) and let τ¯r0\bar{\tau}_{r}^{0} be the corresponding first exit time of the discretized dynamic (6.1), both starting from 0. Then we have the following comparison of [τ¯r0>K]\mathbb{P}[\bar{\tau}_{r}^{0}>K] with the corresponding quantities related to the first exit times of the continuous dynamic.

Theorem 6.1

Suppose δ,δ¯>0\delta,\bar{\delta}>0 and satisfy δ+δ¯<r\delta+\bar{\delta}<r, given a large integer K,K, we have

\displaystyle\mathbb{P} [τrδ0>Kϵ]43δ2E(ϵ)Kcϵ[τ¯r0>K]\displaystyle[\tau_{r-\delta}^{0}>K\epsilon]-\frac{4}{3\delta^{2}}\frac{E(\epsilon)K}{c\epsilon}\leq\mathbb{P}[\bar{\tau}_{r}^{0}>K]
[τr+δ+δ¯0>Kϵ]+43δ2E(ϵ)Kcϵ+1(1C(ϵ,η)δ¯4)K,\displaystyle\leq\mathbb{P}[\tau^{0}_{r+\delta+\bar{\delta}}>K\epsilon]+\frac{4}{3\delta^{2}}\frac{E(\epsilon)K}{c\epsilon}+1-(1-\frac{C(\epsilon,\eta)}{\bar{\delta}^{4}})^{K},

where E(ϵ)O(ϵ2)E(\epsilon)\sim O(\epsilon^{2}) and C(ϵ,η)O(ϵ)C(\epsilon,\eta)\sim O(\epsilon) as ϵ0.\epsilon\rightarrow 0.

Proof

For our purpose we introduce an interpolation process ztz_{t} as follows:

dzt=hzkdt+ησ+ηρ(zk)2dBt,t[(k1)ϵ,kϵ),\displaystyle dz_{t}=-hz_{k}dt+\sqrt{\eta\sigma+\eta\rho(z_{k})^{2}}dB_{t},\ \ t\in[(k-1)\epsilon,k\epsilon), (6.2)

where ϵ\epsilon is the discretization step size. More precisely, the drift coefficient and the diffusion coefficient of (6.2) will remain unchanged when t[(k1)ϵ,kϵ)t\in[(k-1)\epsilon,k\epsilon) for each 1kK.1\leq k\leq K. Note that if we rewrite ησ+ηρ(zk)2\sqrt{\eta\sigma+\eta\rho(z_{k})^{2}} as σ(zk),\sigma(z_{k}), (6.2) is expressed as

dzt=hzkdt+σ(zk)dBt,t[(k1)ϵ,kϵ),\displaystyle dz_{t}=-hz_{k}dt+\sigma(z_{k})dB_{t},\ \ t\in[(k-1)\epsilon,k\epsilon), (6.3)

which reduced to (6.1) when t=kϵt=k\epsilon for each k.k.

We shall adopt a similar strategy as in nguyen2019first to transfer from the average exit time of the power-law dynamic vtv_{t} to its discretization ztz_{t}. Below in the discussion we follow also the notations in the previous sections. Roughly speaking, the proof can be divided into two steps:

(i)    Fix the number of iteration steps as K, prove that

((zϵ,,zKϵ)(vϵ,,vKϵ)Bδ)ϵ¯,\mathbb{P}((z_{\epsilon},\dots,z_{K\epsilon})-(v_{\epsilon},\dots,v_{K\epsilon})\notin B_{\delta})\leq\bar{\epsilon},

where ϵ¯\bar{\epsilon} is a small positive constant to be determined, and
Bδ=B(0,δ)××B(0,δ)KB_{\delta}=\underbrace{B(0,\delta)\times\cdots\times B(0,\delta)}_{K} is the hyper-cube of radius δ>0\delta>0. This can be down by bounding the W2W_{2}-distance of vkϵv_{k\epsilon} and zkϵz_{k\epsilon} for 1kK1\leq k\leq K. Let

Aδ=B(0,r+δ)××B(0,r+δ)K,A_{\delta}=\underbrace{B(0,r+\delta)\times\cdots\times B(0,r+\delta)}_{K},

where r>|δ|>0r>|\delta|>0. For simplicity, denote the exit time of the power-law dynamic from AδA_{\delta} by τδ\tau_{\delta}. For the interpolation process ztz_{t}, we denote the corresponding exit time (an integer) with a bar above it: τδτδ¯\tau_{\delta}\rightarrow\bar{\tau_{\delta}}. Then,

[(vϵ,,vKϵ)\displaystyle\mathbb{P}[(v_{\epsilon},\dots,v_{K\epsilon}) Aδ]δ¯[τ0¯>K]\displaystyle\in A_{-\delta}]-\bar{\delta}\leq\mathbb{P}[\bar{\tau_{0}}>K] (6.4)
[(vϵ,,vKϵ)Aδ]+ϵ¯.\displaystyle\leq\mathbb{P}[(v_{\epsilon},\dots,v_{K\epsilon})\in A_{\delta}]+\bar{\epsilon}.

Note that the event {ττ¯>K}\{\bar{\tau_{\tau}}>K\} indicates that the interpolation process ztz_{t} remains in AδA_{\delta} when tKϵt\leq K\epsilon.

(ii)  Step 1 guarantees that if v(t)v(t) is trapped in a ball with a different size when t=ϵ,2ϵ,,Kϵt=\epsilon,2\epsilon,\dots,K\epsilon, then the interpolation process z(t)z(t) is also trapped in a ball. However,

[(vϵ,,vkϵ)Aδ]>[τδ>Kϵ],\mathbb{P}[(v_{\epsilon},\dots,v_{k\epsilon})\in A_{\delta}]>\mathbb{P}[\tau_{\delta}>K\epsilon],

since v(t)v(t) may drift outside the ball when t[(k1)ϵ,kϵ)t\in[(k-1)\epsilon,k\epsilon) for 1kK1\leq k\leq K. We define this  ’anomalous’  random event by

R:={max0kK1supt(kϵ,(k+1)ϵ)vtvkϵ>δ¯}.R:=\{\max_{0\leq k\leq K-1}\sup_{t\in(k\epsilon,(k+1)\epsilon)}\left\lVert v_{t}-v_{k\epsilon}\right\rVert>\bar{\delta}\}.

Then obviously,

[(vϵ,,vkϵ)\displaystyle\mathbb{P}[(v_{\epsilon},\dots,v_{k\epsilon}) Aδ]\displaystyle\in A_{\delta}]\leq (6.5)
[τδ+δ¯>kϵ]\displaystyle\mathbb{P}[\tau_{\delta+\bar{\delta}}>k\epsilon] +[(vϵ,,vkϵ)Rc],\displaystyle+\mathbb{P}[(v_{\epsilon},\dots,v_{k\epsilon})\in R^{c}],

where RcR^{c} denotes the complement of the event RR. We would expect that the probability of {(v(ϵ),,v(kϵ))Rc}\{(v(\epsilon),\dots,v(k\epsilon))\in R^{c}\} to be small if the diffusion coefficient of the dynamic is bounded, in which case we can apply Gaussian concentration results. However, the diffusion part of the power-law dynamic is not bounded, so additional technical issue should be taken care of.

Now we introduce the same form of coupling as in the previous sections between (vt,zt)(v_{t},z_{t}) for t[(k1)ϵ,kϵ),1kKt\in[(k-1)\epsilon,k\epsilon),\ \forall 1\leq k\leq K . Following the notations in the proof of theorem 4.1, we set the αs(v,z)\alpha_{s}(v,z) and βs(v,z)\beta_{s}(v,z) of the infinitesimal generator 𝒜s\mathcal{A}_{s} of the coupling as:

αs(v,z)=(σ(v)σT(v),σ(v)σ(zkϵ)σ(zkϵ)σ(z),σ(zkϵ)σT(zkϵ)),βs(v,z)=(hvhzkϵ).\alpha_{s}(v,z)=\begin{pmatrix}\sigma(v)\sigma^{T}(v),&\sigma(v)\sigma(z_{k\epsilon})\\ \sigma(z_{k\epsilon})\sigma(z),&\sigma(z_{k\epsilon})\sigma^{T}(z_{k\epsilon})\end{pmatrix},\ \ \beta_{s}(v,z)=\begin{pmatrix}-hv\\ -hz_{k\epsilon}\end{pmatrix}. (6.6)

The remainder proof of the theorem will be accomplished by three lemmas. We prove first the following lemma for the one dimensional decoupled dynamic (2.16):

Lemma 1

Suppose that the coefficients of (2.16) satisfy:

{ηρ,ρ}h1ϵ,\{\eta\rho,\rho\}\leq h\leq\frac{1}{\epsilon}, (6.7)

(which is fulfilled in SGD algorithm for large batch size and small learning rate). Let (v(t),z(t))(v(t),z(t)) be the coupling process defined by (6.6), and v(0)=z(0)=xv(0)=z(0)=x. When t=Kϵt=K\epsilon, the Warsserstein distance between the marginal distribution pKzp_{K}^{z} of z(t)z(t) and the marginal distribution pKvp_{K}^{v} of the power-law dynamic v(t)v(t) is bounded by

𝒲22(pKv,pKz)43E(ϵ)cϵ,\mathcal{W}_{2}^{2}(p_{K}^{v},p_{K}^{z})\leq\frac{4}{3}\frac{E(\epsilon)}{c\epsilon}, (6.8)

where c=2hηρ>0c=2h-\eta\rho>0. Moreover, E(ϵ)>0E(\epsilon)>0 is independent of the number of steps K and is of order O(ϵ2)O(\epsilon^{2}) when the time interval ϵ0\epsilon\rightarrow 0.

Proof (Proof of Lemma 1)

Denote the transition probability of vtv_{t} and ztz_{t} from time (k1)ϵ(k-1)\epsilon to (k1)ϵ+t(k-1)\epsilon+t by p(k1)ϵ+tv(v(k1)ϵ,)p^{v}_{(k-1)\epsilon+t}(v_{(k-1)\epsilon},\cdot) and p(k1)ϵ+tz(z(k1)ϵ,)p^{z}_{(k-1)\epsilon+t}(z_{(k-1)\epsilon},\cdot) respectively. Let 𝒜s\mathcal{A}_{s} act on the function r(v,z):=vz2r(v,z):=\left\lVert v-z\right\rVert^{2} and use the Gronwall’s inequality, we can deduce the following recursion inequality for 1kK1\leq k\leq K:

𝒲22(pkv(),pkz())\displaystyle\mathcal{W}_{2}^{2}(p_{k}^{v}(\cdot),p_{k}^{z}(\cdot)) d×d𝒲22(p(k1)ϵv+t(v(k1)ϵ,),p(k1)ϵz(z(k1)ϵ,))\displaystyle\leq\int_{\mathbb{R}^{d}\times\mathbb{R}^{d}}\mathcal{W}_{2}^{2}(p^{v}_{(k-1)\epsilon}+t(v_{(k-1)\epsilon},\cdot),p^{z}_{(k-1)\epsilon}(z_{(k-1)\epsilon},\cdot))
dπ(pk1v(v(k1)ϵ),pk1z(z(k1)ϵ))\displaystyle d\pi(p_{k-1}^{v}(v_{(k-1)\epsilon}),p_{k-1}^{z}(z_{(k-1)\epsilon}))
ecϵ𝒲22(pk1v(),pk1z())+\displaystyle\leq e^{-c\epsilon}\mathcal{W}_{2}^{2}(p_{k-1}^{v}(\cdot),p_{k-1}^{z}(\cdot))+
12ϵ3ρ(8ηρ+4h)x2𝑑pk1v(x)+\displaystyle\ \int_{\mathbb{R}}\frac{1}{2}\epsilon^{3}\rho(8\eta\rho+4h)\left\lVert x\right\rVert^{2}dp_{k-1}^{v}(x)+
[13ϵ3h(4ηρ+2h)ρ+12ϵ2σ(8ηρ+3h)]y2𝑑pk1z(y)+\displaystyle\ \int_{\mathbb{R}}[\frac{1}{3}\epsilon^{3}h(4\eta\rho+2h)\rho+\frac{1}{2}\epsilon^{2}\sigma(8\eta\rho+3h)]\left\lVert y\right\rVert^{2}dp_{k-1}^{z}(y)+
ϵ2ηρ(4ηρ+2h)+12ϵ2σ(8ηρ+3h),\displaystyle\ \epsilon^{2}\eta\rho(4\eta\rho+2h)+\frac{1}{2}\epsilon^{2}\sigma(8\eta\rho+3h),

where c=2hηρ>0c=2h-\eta\rho>0 and π(pk1v(),pk1z())\pi(p_{k-1}^{v}(\cdot),p_{k-1}^{z}(\cdot)) is the optimal coupling between pk1v()p_{k-1}^{v}(\cdot) and pk1z()p_{k-1}^{z}(\cdot). ( Note that in our context the optimal coupling always exists, see e.g. Proposition 1.3.2 and Theorem 2.3,3 in wang .)

For the interpolation process ztz_{t} starting at y0y_{0}, by Ito’s formula, we have the following estimate:

dy2𝑑pkz(y)e(2hηρ)kϵ(y02ηρ2hηρ)+ηρ2hηρ.\int_{\mathbb{R}^{d}}\left\lVert y\right\rVert^{2}dp_{k}^{z}(y)\leq e^{-(2h-\eta\rho)k\epsilon}(y_{0}^{2}-\frac{\eta\rho}{2h-\eta\rho})+\frac{\eta\rho}{2h-\eta\rho}.

Similarly, the second moment of the continuous dynamic vtv_{t} starting at x0x_{0} can be bounded by:

𝔼vt2e(2hηρ)t(x02ηρ2hηρ)+ηρ2hηρ.\mathbb{E}\left\lVert v_{t}\right\rVert^{2}\leq e^{-(2h-\eta\rho)t}(x_{0}^{2}-\frac{\eta\rho}{2h-\eta\rho})+\frac{\eta\rho}{2h-\eta\rho}. (6.9)

Notice that the distance between vtv_{t} and ztz_{t} is zero at initialization, then by applying the recursion relation from k=1k=1 to k=Kk=K, we conclude that there exits E(ϵ)>0E(\epsilon)>0, such that

𝒲22(pKv,pKx)43E(ϵ)cϵ,\mathcal{W}_{2}^{2}(p_{K}^{v},p_{K}^{x})\leq\frac{4}{3}\frac{E(\epsilon)}{c\epsilon},

where E(ϵ)O(ϵ2)E(\epsilon)\sim O(\epsilon^{2}) is independent of K, which completes the proof of Lemma 1.

By the definition of the W2W_{2}-distance,

((zϵ,,zKϵ)(vϵ,,vKϵ)Bδ)\displaystyle\mathbb{P}((z_{\epsilon},\dots,z_{K\epsilon})-(v_{\epsilon},\dots,v_{K\epsilon})\notin B_{\delta}) k=1KW22(pkv,pkz)δ2\displaystyle\leq\sum_{k=1}^{K}\frac{W_{2}^{2}(p_{k}^{v},p_{k}^{z})}{\delta^{2}}
43δ2E(ϵ)Kcϵ.\displaystyle\leq\frac{4}{3\delta^{2}}\cdot\frac{E(\epsilon)K}{c\epsilon}. (6.10)

Below we denote the above right hand side 43δ2E(ϵ)Kcϵ\frac{4}{3\delta^{2}}\frac{E(\epsilon)K}{c\epsilon} by ϵ¯\bar{\epsilon}. For the second step, from (6.4) and (6.5), it follows that

[τδ>Kϵ]ϵ¯[τ¯0>K][τδ+δ¯>Kϵ]+[(vϵ,,vKϵ)Rc]+ϵ¯.\mathbb{P}[\tau_{-\delta}>K\epsilon]-\bar{\epsilon}\leq\mathbb{P}[\bar{\tau}_{0}>K]\leq\mathbb{P}[\tau_{\delta+\bar{\delta}}>K\epsilon]+\mathbb{P}[(v_{\epsilon},\dots,v_{K\epsilon})\in R^{c}]+\bar{\epsilon}. (6.11)

Therefore, we are left to estimate [(vϵ,,vKϵ)Rc]\mathbb{P}[(v_{\epsilon},\dots,v_{K\epsilon})\in R^{c}]. Under the condition h>12ηρh>\frac{1}{2}\eta\rho, we have the following lemma:

Lemma 2

Let δ>0\delta>0 be fixed. Conditioning on the event that vtv_{t} is inside B(0,b+δ)B(0,b+\delta) when t=kϵt=k\epsilon for all 1kK1\leq k\leq K, we have

𝔼sups[kϵ,(k+1)ϵ)(vs)4D(η,ρ,ϵ),\mathbb{E}\sup_{s\in[k\epsilon,(k+1)\epsilon)}(v_{s})^{4}\leq D(\eta,\rho,\epsilon),

where D(η,ρ,ϵ):=[(2+2/δ)(b+δ)2ηρ2hηρ+(5+1δ)(ησ)2ϵ]exp{12(1+δ)ηρϵ}.D(\eta,\rho,\epsilon):=[(2+2/\delta)\frac{(b+\delta)^{2}\eta\rho}{2h-\eta\rho}+(5+\frac{1}{\delta})(\eta\sigma)^{2}\epsilon]\exp\{12(1+\delta)\eta\rho\epsilon\}.

Remark 4

The above lemma tells us that the fourth moment won’t change too much if the time interval ϵ\epsilon is small. Intuitively, Since the martingale part of vtv_{t} is ησ+ηρvt2dBt\sqrt{\eta\sigma+\eta\rho v_{t}^{2}}dB_{t}, if |vt||v_{t}| is bounded, then by the time change theorem, we know that the marginal distribution of the martingale part behaves like a scaled Gaussian.

Proof (Proof of Lemma 2)

By Ito’s formula,

d(vt)2=ησdt(2hηρ)(vt)2dt+2vtησ+ηρi(vt)2dBt.d(v_{t})^{2}=\eta\sigma dt-(2h-\eta\rho)(v_{t})^{2}dt+2v_{t}\sqrt{\eta\sigma+\eta\rho_{i}(v_{t})^{2}}dB_{t}.

Define a local martingale Mt:=20te(2hηρ)sησ+ηρ(vs)2vs𝑑BsM_{t}:=2\sum\int_{0}^{t}e^{(2h-\eta\rho)s}\sqrt{\eta\sigma+\eta\rho(v_{s})^{2}}v_{s}dB_{s}, then by Gronwall’s inequality,

(vt)2\displaystyle(v_{t})^{2} e(2hηρ)t(v0)2+e(2hηρ)tMt+\displaystyle\leq e^{-(2h-\eta\rho)t}(v_{0})^{2}+e^{-(2h-\eta\rho)t}M_{t}+
e(2hηρ)t0te(2hηρ)sησ𝑑s.\displaystyle\ \ e^{-(2h-\eta\rho)t}\int_{0}^{t}e^{(2h-\eta\rho)s}\eta\sigma ds.

Let

St:=𝔼sups[kϵ,kϵ+t](vs)4,t[0,ϵ).S_{t}:=\mathbb{E}\sup_{s\in[k\epsilon,k\epsilon+t]}(v_{s})^{4},\ \ t\in[0,\epsilon).

Then, for the fixed δ>0\delta>0,

St\displaystyle S_{t} (1+δ)e2(2hηρ)t𝔼sups[0,t](Ms)2+(2+2/δ)e2(2hηρ)t𝔼(vkϵ)4\displaystyle\leq(1+\delta)e^{-2(2h-\eta\rho)t}\mathbb{E}\sup_{s\in[0,t]}(M_{s})^{2}+(2+2/\delta)e^{-2(2h-\eta\rho)t}\mathbb{E}(v_{k\epsilon})^{4}
+(2+2/δ)e2(2hηρ)t𝔼(0te(2hηρ)sησ𝑑s)2.\displaystyle\ \ \ \ +(2+2/\delta)e^{-2(2h-\eta\rho)t}\mathbb{E}(\int_{0}^{t}e^{(2h-\eta\rho)s}\eta\sigma ds)^{2}.

Applying Doob’s inequality, we get

St2(1+δ)e2(2hηρ)t\displaystyle S_{t}\leq 2(1+\delta)e^{-2(2h-\eta\rho)t} 𝔼Mt2+(1+1δ)e2(2hηρ)t1(ηρ2h)t\displaystyle\mathbb{E}M_{t}^{2}+(1+\frac{1}{\delta})\frac{e^{-2(2h-\eta\rho)t}-1}{(\eta\rho-2h)t}
𝔼0t(ησ)2𝑑s+(2+2/δ)e2(2hηρ)t𝔼(vkϵ)4.\displaystyle\cdot\mathbb{E}\int_{0}^{t}(\eta\sigma)^{2}ds+(2+2/\delta)e^{-2(2h-\eta\rho)t}\mathbb{E}(v_{k\epsilon})^{4}.

Now, Ito’s isometry implies that

St\displaystyle S_{t} 12(1+δ)ηρ0tSs𝑑s+(2+2/δ)e2(2hηρ)t𝔼(vkϵ)4\displaystyle\leq 12(1+\delta)\eta\rho\int_{0}^{t}S_{s}ds+(2+2/\delta)e^{-2(2h-\eta\rho)t}\mathbb{E}(v_{k\epsilon})^{4}
+(5+1δ)(ησ)2t\displaystyle\ \ +(5+\frac{1}{\delta})(\eta\sigma)^{2}t
[(2+2/δ)𝔼(vkϵ)4+(5+1δ)(ησ)2t]exp{12(1+δ)ηρt},\displaystyle\leq[(2+2/\delta)\mathbb{E}(v_{k\epsilon})^{4}+(5+\frac{1}{\delta})(\eta\sigma)^{2}t]\exp\{12(1+\delta)\eta\rho t\},

where we used Gronwall’s inequality to derive the last line. For all 1kK1\leq k\leq K, by taking x0=0x_{0}=0 in (6.9), we get

𝔼{(vkη)4𝕀(vϵ,,vKϵ)Aδ}(b+δ)2ηρ2hηρ.\mathbb{E}\{(v_{k\eta})^{4}\mathbb{I}_{(v_{\epsilon},\dots,v_{K\epsilon})\in A_{\delta}}\}\leq\frac{(b+\delta)^{2}\eta\rho}{2h-\eta\rho}.

Then, conditioning on the event that {vkϵB(0,b+δ)}\{v_{k\epsilon}\in B(0,b+\delta)\}, we have

𝔼sups[kϵ,(k+1)ϵ)(vs)4\displaystyle\mathbb{E}\sup_{s\in[k\epsilon,(k+1)\epsilon)}(v_{s})^{4} [(2+2/δ)(b+δ)2ηρ2hηρ+(5+1δ)(ησ)2ϵ]\displaystyle\leq[(2+2/\delta)\frac{(b+\delta)^{2}\eta\rho}{2h-\eta\rho}+(5+\frac{1}{\delta})(\eta\sigma)^{2}\epsilon]
exp{12(1+δ)ηρϵ}:=D(η,ρ,ϵ),\displaystyle\cdot\exp\{12(1+\delta)\eta\rho\epsilon\}:=D(\eta,\rho,\epsilon),

which completes the proof of Lemma 2.

Lemma 3

Let δ¯\bar{\delta} be a positive constant, then for every k[0,,K1]k\in[0,\dots,K-1],

(supt[kϵ,(k+1)ϵ)vtvkϵ>δ¯)C(ϵ,η)δ¯4,\mathbb{P}(\sup_{t\in[k\epsilon,(k+1)\epsilon)}\left\lVert v_{t}-v_{k\epsilon}\right\rVert>\bar{\delta})\leq\frac{C(\epsilon,\eta)}{\bar{\delta}^{4}},

where C(ϵ,η)O(ϵ)C(\epsilon,\eta)\sim O(\epsilon) when ϵ0\epsilon\rightarrow 0.

Proof (Proof of Lemma 3)

Let r(x)=xvkϵ2r(x)=\left\lVert x-v_{k\epsilon}\right\rVert^{2}, then by Ito’s lemma,

dr(vt)\displaystyle dr(v_{t}) 2hvt(vtvkϵ)dt+2ησ+ηρ(vt)2(vtvkϵ)dBt+(ησ+ηρ(vt)2)dt\displaystyle\leq-2hv_{t}(v_{t}-v_{k\epsilon})dt+2\sqrt{\eta\sigma+\eta\rho(v_{t})^{2}}(v_{t}-v_{k\epsilon})dB_{t}+(\eta\sigma+\eta\rho(v_{t})^{2})dt
2h(vtvkϵ)2dt+3h(vkϵ)2dt+ησdt\displaystyle\leq-2h(v_{t}-v_{k\epsilon})^{2}dt+3h(v_{k\epsilon})^{2}dt+\eta\sigma dt
+(ηρ+h)(vt)2dt+2ησ+ηρ(vt)2(vtvkϵ)dBt,\displaystyle\ \ \ +(\eta\rho+h)(v_{t})^{2}dt+2\sqrt{\eta\sigma+\eta\rho(v_{t})^{2}}(v_{t}-v_{k\epsilon})dB_{t},

and obviously we know that r(vkϵ)=0r(v_{k\epsilon})=0. Let

Mt:=2kϵte2hsησ+ηρ(vs)2(vsvkϵ)𝑑Bs,M_{t}:=2\int_{k\epsilon}^{t}e^{2hs}\sqrt{\eta\sigma+\eta\rho(v_{s})^{2}}(v_{s}-v_{k\epsilon})dB_{s},

we have

r(vt)e2htMt+e2htkϵte2hs[3h(vkϵ)2+ησ+(ηρ+h)(vs)2]𝑑s.r(v_{t})\leq e^{-2ht}M_{t}+e^{-2ht}\int_{k\epsilon}^{t}e^{2hs}[3h(v_{k\epsilon})^{2}+\eta\sigma+(\eta\rho+h)(v_{s})^{2}]ds.

Let St:=𝔼sups[kϵ,t]r2(vs),S_{t}:=\mathbb{E}\sup_{s\in[k\epsilon,t]}r^{2}(v_{s}), then

St2e4h(tkϵ)\displaystyle S_{t}\leq 2e^{-4h(t-k\epsilon)} 𝔼sups[kϵ,t](Ms)2+2e4h(tkϵ)\displaystyle\mathbb{E}\sup_{s\in[k\epsilon,t]}(M_{s})^{2}+2e^{-4h(t-k\epsilon)}
𝔼(kϵte2hs[3h(vkϵ)2+ησ+(ηρ+h)(vs)2]𝑑s)2\displaystyle\cdot\mathbb{E}(\int_{k\epsilon}^{t}e^{2hs}[3h(v_{k\epsilon})^{2}+\eta\sigma+(\eta\rho+h)(v_{s})^{2}]ds)^{2}
4e4h(tkϵ)\displaystyle\leq 4e^{4h(t-k\epsilon)} 𝔼(Mt)2+1e4h(tkϵ)h𝔼kϵt[3h(vkϵ)2+ησ+(ηρ+h)(vs)2]2𝑑s\displaystyle\mathbb{E}(M_{t})^{2}+\frac{1-e^{-4h(t-k\epsilon)}}{h}\mathbb{E}\int_{k\epsilon}^{t}[3h(v_{k\epsilon})^{2}+\eta\sigma+(\eta\rho+h)(v_{s})^{2}]^{2}ds
16𝔼kϵt(ησ+ηρ(vs)2)(vsvkϵ)2𝑑s+\displaystyle\leq 16\mathbb{E}\int_{k\epsilon}^{t}(\eta\sigma+\eta\rho(v_{s})^{2})(v_{s}-v_{k\epsilon})^{2}ds+
1e4h(tkϵ)h𝔼kϵt[3h(vkϵ)2+ησ+(ηρ+h)(vs)2]2𝑑s.\displaystyle\ \ \ \ \ \frac{1-e^{-4h(t-k\epsilon)}}{h}\mathbb{E}\int_{k\epsilon}^{t}[3h(v_{k\epsilon})^{2}+\eta\sigma+(\eta\rho+h)(v_{s})^{2}]^{2}ds.

Let t=(k+1)ϵt=(k+1)\epsilon, and denote the above right hand side as C(ϵ,η)C(\epsilon,\eta). Since 𝔼|vs|2𝔼|vs|4,\mathbb{E}|v_{s}|^{2}\leq\sqrt{\mathbb{E}|v_{s}|^{4}}, by Lemma 2,

C(ϵ,η)O(exp(ηϵ)1η)=O(ϵ).C(\epsilon,\eta)\sim O\left(\frac{\exp(\eta\epsilon)-1}{\eta}\right)=O(\epsilon).

Therefore,

(supt[kϵ,(k+1)ϵ)v(t)v(kϵ)>δ¯)𝔼S(k+1)ϵδ¯4C(ϵ,η)δ¯4.\mathbb{P}(\sup_{t\in[k\epsilon,(k+1)\epsilon)}\left\lVert v(t)-v(k\epsilon)\right\rVert>\bar{\delta})\leq\frac{\mathbb{E}S_{(k+1)\epsilon}}{\bar{\delta}^{4}}\leq\frac{C(\epsilon,\eta)}{\bar{\delta}^{4}}.

The proof of Lemma 3 is completed

From the previous lemma, we can easily deduce the following bound on [(vϵ,,vKϵ)Rc]\mathbb{P}[(v_{\epsilon},\dots,v_{K\epsilon})\in R^{c}]:

[(vϵ,,vKϵ)Rc]1(1C(ϵ,η)δ¯4)K.\mathbb{P}[(v_{\epsilon},\dots,v_{K\epsilon})\in R^{c}]\leq 1-(1-\frac{C(\epsilon,\eta)}{\bar{\delta}^{4}})^{K}. (6.12)

Combing (6.12) and (6.11) , we have finished the proof of the theorem.

Remark 5

Theorem 6.1 discussed only the 1-dimensional case. For high dimensional case, there have been some intuitive discussion in machine learning literature xie2020diffusion . Roughly speaking, the escaping path will concentrate on the critical paths, i.e., the paths on the direction of the eigenvector of the Hessian, when the noise is much smaller than the barrier height with high probability. If there are multiple parallel exit paths, the total exiting rate, i.e., the inverse of the expected exit time, equals to the sum of the first exiting rate for every path (cf. Rule 1 in xie2020diffusion ).

Acknowledgements.
We had pleasant collaboration with Shiqi Gong, Huishuai Zhang, and Tie-Yan Liu on the research of power-law dynamics in machine learningmeng22020dynamic ; meng2020dynamic . We thank them for their contributions in the previous work and their comments on this work, especially based on the empirical observations during our previous collaboration.

References

  • (1) Nils Berglund. Kramers’ law: Validity, derivations and generalisations. arXiv preprint arXiv:1106.5799, 2013.
  • (2) Mu-Fa Chen. Eigenvalues, inequalities, and ergodic theory. Springer Science & Business Media, 2006.
  • (3) Jiu-Lin Du. Power-law distributions and fluctuation-dissipation relation in the stochastic dynamics of two-variable langevin equations. Journal of Statistical Mechanics: Theory and Experiment, 2012(02):P02006, 2012.
  • (4) Mert Gurbuzbalaban, Umut Simsekli, and Lingjiong Zhu. The heavy-tail phenomenon in sgd. arXiv preprint arXiv:2006.04740, 2020.
  • (5) Di He, Yingce Xia, Tao Qin, Liwei Wang, Nenghai Yu, Tie-Yan Liu, and Wei-Ying Ma. Dual learning for machine translation. In Advances in neural information processing systems, pages 820–828, 2016.
  • (6) Fengxiang He, Tongliang Liu, and Dacheng Tao. Control batch size and learning rate to generalize well: Theoretical and empirical evidence. In Advances in Neural Information Processing Systems, pages 1141–1150, 2019.
  • (7) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
  • (8) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • (9) Li He, Qi Meng, Wei Chen, Zhi-Ming Ma, and Tie-Yan Liu. Differential equations for modeling asynchronous algorithms. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 2220–2226, 2018.
  • (10) Elton P Hsu. Stochastic analysis on manifolds. Number 38. American Mathematical Soc., 2002.
  • (11) Stanisław Jastrzkebski, Zachary Kenton, Devansh Arpit, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. Three factors influencing minima in sgd. arXiv preprint arXiv:1711.04623, 2017.
  • (12) Vassili N Kolokoltsov. Semiclassical analysis for diffusions and stochastic processes. Springer, 2007.
  • (13) Richard C Kraaij, Frank Redig, and Rik Versendaal. Classical large deviation theorems on complete riemannian manifolds. Stochastic Processes and their Applications, 129(11):4294–4334, 2019.
  • (14) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25:1097–1105, 2012.
  • (15) Qianxiao Li, Cheng Tai, et al. Stochastic modified equations and adaptive stochastic gradient algorithms. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2101–2110. JMLR. org, 2017.
  • (16) Yi-An Ma, Tianqi Chen, and Emily B Fox. A complete recipe for stochastic gradient mcmc. arXiv preprint arXiv:1506.04696, 2015.
  • (17) Qi Meng, Shiqi Gong, Wei Chen, Zhi-Ming Ma, and Tie-Yan Liu. Dynamic of stochastic gradient descent with state-dependent noise. arXiv preprint arXiv 2006.13719v3, 2020.
  • (18) Qi Meng, Shiqi Gong, Weitao Du, Wei Chen, Zhi-Ming Ma, and Tie-Yan Liu. A fine-grained study on the escaping behavior of stochastic gradient descent. Under review, 2021.
  • (19) Wenlong Mou, Liwei Wang, Xiyu Zhai, and Kai Zheng. Generalization bounds of sgld for non-convex learning: Two theoretical viewpoints. arXiv preprint arXiv:1707.05947, 2017.
  • (20) Thanh Huy Nguyen, Umut Simsekli, Mert Gurbuzbalaban, and Gaël Richard. First exit time analysis of stochastic gradient descent under heavy-tailed gradient noise. In Advances in Neural Information Processing Systems, pages 273–283, 2019.
  • (21) Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
  • (22) Alexander Rakhlin, Ohad Shamir, and Karthik Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Coference on International Conference on Machine Learning, pages 1571–1578, 2012.
  • (23) Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech: Fast, robust and controllable text to speech. arXiv preprint arXiv:1905.09263, 2019.
  • (24) Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783. IEEE, 2018.
  • (25) Le Quoc V Smith Samuel L. A bayesian perspective on generalization and stochastic gradient descent. arXiv preprint arXiv:1710.06451, 2017.
  • (26) Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. Lstm neural networks for language modeling. In Thirteenth annual conference of the international speech communication association, 2012.
  • (27) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  • (28) Feng-Yu Wang. Analysis for diffusion processes on Riemannian manifolds, volume 18. World Scientific, 2014.
  • (29) Colin Wei, Sham Kakade, and Tengyu Ma. The implicit and explicit regularization effects of dropout. In International Conference on Machine Learning, pages 10181–10192. PMLR, 2020.
  • (30) E Weinan, Tiejun Li, and Eric Vanden-Eijnden. Applied stochastic analysis, volume 199. American Mathematical Soc., 2019.
  • (31) Lei Wu and Weinan E Ma, Chao. How sgd selects the global minima in over-parameterized learning: A dynamical stability perspective. Advances in Neural Information Processing Systems, 31:8279–8288, 2018.
  • (32) Sugiyama Masashi Xie Zeke, Sato Issei. A diffusion theory for deep learning dynamics: Stochastic gradient descent escapes from sharp minima exponentially fast. arXiv preprint arXiv:2002.03495, 2020.
  • (33) Guodong Zhang, Lala Li, Zachary Nado, James Martens, Sushant Sachdeva, George Dahl, Chris Shallue, and Roger B Grosse. Which algorithmic choices matter at which batch sizes? insights from a noisy quadratic model. In Advances in Neural Information Processing Systems, pages 8196–8207, 2019.
  • (34) Yanjun Zhou and Jiulin Du. Kramers escape rate in overdamped systems with the power-law distribution. Physica A: Statistical Mechanics and its Applications, 402:299–305, 2014.
  • (35) Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects. In Proceedings of International Conference on Machine Learning, pages 7654–7663, 2019.