This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the convergence of an improved and adaptive kinetic simulated annealing

Michael C.H. Choi School of Data Science, The Chinese University of Hong Kong, Shenzhen and Shenzhen Institute of Artificial Intelligence and Robotics for Society, Guangdong, 518172, P.R. China [email protected]
Abstract.

Inspired by the work of Fang et al. (1997), who propose an improved simulated annealing algorithm based on a variant of overdamped Langevin diffusion with state-dependent diffusion coefficient, we cast this idea in the kinetic setting and develop an improved kinetic simulated annealing (IKSA) method for minimizing a target function UU. To analyze its convergence, we utilize the framework recently introduced by Monmarché (2018) for the case of kinetic simulated annealing (KSA). The core idea of IKSA rests on introducing a parameter c>infUc>\inf U, which de facto modifies the optimization landscape and clips the critical height in IKSA at a maximum of cinfUc-\inf U. Consequently IKSA enjoys improved convergence with faster logarithmic cooling than KSA. To tune the parameter cc, we propose an adaptive method that we call IAKSA which utilizes the running minimum generated by the algorithm on the fly, thus avoiding the need to manually adjust cc for better performance. We present positive numerical results on some standard global optimization benchmark functions that verify the improved convergence of IAKSA over other Langevin-based annealing methods.

AMS 2010 subject classifications: 60J25, 60J60, 46N30

Keywords: simulated annealing; Langevin diffusion; relative entropy; log-Sobolev constant; self-interacting diffusion; landscape modification

1. Introduction

Given a target function U:dU:\mathbb{R}^{d}\to\mathbb{R} to minimize, we are interested in simulated annealing algorithms based on Langevin diffusion and its various variants. Let us begin by briefly recalling the dynamics of the classical overdamped Langevin diffusion (𝒵t)t0(\mathcal{Z}_{t})_{t\geqslant 0} for simulated annealing (SA):

(1.1) d𝒵t=U(𝒵t)dt+2ϵtdBt,\displaystyle d\mathcal{Z}_{t}=-\nabla U(\mathcal{Z}_{t})\,dt+\sqrt{2\epsilon_{t}}dB_{t},

where (Bt)t0(B_{t})_{t\geqslant 0} is the standard dd-dimensional Brownian motion and (ϵt)t0(\epsilon_{t})_{t\geqslant 0} is the temperature or cooling schedule. The instantaneous stationary distribution of (1.1) at time tt is the Gibbs distribution that we denote by

μϵt0(x)e1ϵtU(x).\mu_{\epsilon_{t}}^{0}(x)\propto e^{-\frac{1}{\epsilon_{t}}U(x)}.

We shall explain the seemingly strange upper script that appears in μϵt0\mu_{\epsilon_{t}}^{0} later in (1.6). This overdamped Langevin dynamics and its convergence have been analyzed in Chiang et al. (1987); Holley et al. (1989); Miclo (1992); Jacquot (1992). It can be shown that under the logarithmic cooling schedule of the form

(1.2) ϵt=Elnt,large enought,\displaystyle\epsilon_{t}=\dfrac{E}{\ln t},\quad\text{large enough}\,t,

where E>EE>E_{*}, (𝒵t)t0(\mathcal{Z}_{t})_{t\geqslant 0} gradually concentrates around the set of global minima of UU in the sense that for any δ>0\delta>0,

limt(U(𝒵t)>Umin+δ)=0.\lim_{t\to\infty}\mathbb{P}\left(U(\mathcal{Z}_{t})>U_{min}+\delta\right)=0.

Here we write Umin:=infUU_{min}:=\inf U. We call EE the energy level and EE_{*} the hill-climbing constant or the critical height associated with UU. Intuitively speaking, EE_{*} is the largest hill one need to climb starting from a local minimum to a fixed global minimum. For a precise definition of EE_{*}, we refer readers to (2.2) below.

While the overdamped Langevin diffusion (1.1) can be seen as the continuous counterpart of gradient descent perturbed by Gaussian noise, the analogue of momentum method in this context is the kinetic or underdamped Langevin diffusion. The kinetic Langevin dynamics (𝒳t,𝒴t)t0(\mathcal{X}_{t},\mathcal{Y}_{t})_{t\geqslant 0} (KSA) is described by

(1.3) d𝒳t\displaystyle d\mathcal{X}_{t} =𝒴tdt,\displaystyle=\mathcal{Y}_{t}\,dt,
(1.4) d𝒴t\displaystyle d\mathcal{Y}_{t} =1ϵt𝒴tdtU(𝒳t)dt+2dBt,\displaystyle=-\dfrac{1}{\epsilon_{t}}\mathcal{Y}_{t}\,dt-\nabla U(\mathcal{X}_{t})\,dt+\sqrt{2}\,dB_{t},

where (𝒳t)t0(\mathcal{X}_{t})_{t\geqslant 0} stands for the position and (𝒴t)t0(\mathcal{Y}_{t})_{t\geqslant 0} is the velocity or momentum variable. The instantaneous stationary distribution of (𝒳t,𝒴t)t0(\mathcal{X}_{t},\mathcal{Y}_{t})_{t\geqslant 0} at time tt is the product distribution of μϵt0\mu_{\epsilon_{t}}^{0} and the Gaussian distribution with variance ϵt\epsilon_{t} that we denote by

πϵt0(x,y)e1ϵtU(x)ey22ϵt.\pi_{\epsilon_{t}}^{0}(x,y)\propto e^{-\frac{1}{\epsilon_{t}}U(x)}e^{-\frac{\left\lVert y\right\rVert^{2}}{2\epsilon_{t}}}.

We will explain the notation πϵt0\pi_{\epsilon_{t}}^{0} in (1.10) below. Unlike the overdamped Langevin dynamics (1.1) which is reversible, the kinetic counterpart (1.3) is in general non-reversible, which imposes technical difficulties in establishing its long-time convergence. On the other hand, it is known in the literature that using non-reversible dynamics may accelerate convergence in the context of sampling or optimization, see for example Diaconis et al. (2000); Bierkens (2016); Chen and Hwang (2013); Löwe (1997); Duncan et al. (2016, 2017); Hwang et al. (2005); Gao et al. (2018); Hu et al. (2020). Using a distorted entropy approach, in Monmarché (2018) the author proves for the first time convergence result of kinetic simulated annealing: under the same logarithmic cooling schedule as in (1.2), for any δ>0\delta>0 we have

limt(U(𝒳t)>Umin+δ)=0.\lim_{t\to\infty}\mathbb{P}\left(U(\mathcal{X}_{t})>U_{min}+\delta\right)=0.

More recently, Chak et al. (2020) analyze the generalized Langevin dynamics for simulated annealing based on the framework introduced in Monmarché (2018).

Many techniques have been developed in the literature to improve or to accelerate the convergence of Langevin dynamics. In this paper, we are particularly interested in an improved variant of Langevin dynamics (Zt)t0(Z_{t})_{t\geqslant 0} (ISA) with state-dependent diffusion coefficient, introduced by Fang et al. (1997), and its dynamics is described by the following:

(1.5) dZt=U(Zt)dt+2(f((U(Zt)c)+)+ϵt)dBt,\displaystyle dZ_{t}=-\nabla U(Z_{t})\,dt+\sqrt{2\left(f((U(Z_{t})-c)_{+})+\epsilon_{t}\right)}\,dB_{t},

where we write a+=max{a,0}a_{+}=\max\{a,0\} for aa\in\mathbb{R}. Comparing the improved dynamics (1.5) with the classical one (1.1), we see that the function f:+f:\mathbb{R}\to\mathbb{R}^{+} and the parameter cc are introduced. We formally state the assumptions needed on both ff and cc in Assumption 1.1 below. To briefly summarize, we need to choose c>Uminc>U_{min} and ff to be twice-differentiable, non-negative, bounded and non-decreasing with f(0)=f(0)=f′′(0)=0f(0)=f^{\prime}(0)=f^{\prime\prime}(0)=0. It is shown in Fang et al. (1997) that the instantaneous stationary distribution at time tt of (1.5) is given by

(1.6) μϵtf(x)\displaystyle\mu_{\epsilon_{t}}^{f}(x) =μϵt,cf(x)eHϵt(x)=1f((U(x)c)+)+ϵtexp(UminU(x)1f((uc)+)+ϵt𝑑u),\displaystyle=\mu_{\epsilon_{t},c}^{f}(x)\propto e^{-H_{\epsilon_{t}}(x)}=\dfrac{1}{f((U(x)-c)_{+})+\epsilon_{t}}\exp\left(-\int_{U_{min}}^{U(x)}\dfrac{1}{f((u-c)_{+})+\epsilon_{t}}\,du\right),

where

(1.7) Hϵ(x)=Hϵ,c(x):=UminU(x)1f((uc)+)+ϵ𝑑u+ln(f((U(x)c)+)+ϵ).\displaystyle H_{\epsilon}(x)=H_{\epsilon,c}(x):=\int_{U_{min}}^{U(x)}\dfrac{1}{f((u-c)_{+})+\epsilon}\,du+\ln\left(f((U(x)-c)_{+})+\epsilon\right).

Observe that if f=0f=0, (1.5) reduces to the classical overdamped dynamics (1.1). As such μϵf\mu_{\epsilon}^{f} can be considered as a generalization of the Gibbs distribution μϵ0\mu_{\epsilon}^{0}. This also explains the notation μϵt0\mu_{\epsilon_{t}}^{0} earlier.

One important difference between (1.5) and (1.1) is the introduction of state-dependent diffusion coefficient: the greater the difference between U(Zt)U(Z_{t}) and cc, the greater (in absolute terms) the Gaussian noise is to be injected, and this extra noise may improve the convergence by helping the dynamics to escape a local minimum or saddle point. On the other hand, in the region where U(Zt)<cU(Z_{t})<c the dynamics evolves in the same manner as the classical dynamics. As for the theoretical benefits, in Fang et al. (1997) the authors demonstrate that under the logarithmic cooling schedule of the form (1.2) with E>cE>c_{*} and for any δ>0\delta>0,

limt(U(Zt)>Umin+δ)=0,\lim_{t\to\infty}\mathbb{P}\left(U(Z_{t})>U_{min}+\delta\right)=0,

where we call cc_{*} the clipped critical height, to be defined formally in (2.3) below. It can be shown that EcE_{*}\geqslant c_{*} and cUmincc-U_{min}\geqslant c_{*}, and hence one can understand as if the critical height is capped at a maximum level cUminc-U_{min}. The key technical insight in Fang et al. (1997) relies on both the spectral gap and the log-Sobolev constant are of the order ec/ϵte^{c_{*}/\epsilon_{t}}. As a result, we can operate a faster cooling schedule for the improved dynamics (1.5) that still enjoys convergence guarantee.

The crux of this paper is to cast the idea of Fang et al. (1997) into the kinetic Langevin setting for simulated annealing. One way to do so is to think of altering the target function: in SA the exponent in μϵt0\mu_{\epsilon_{t}}^{0} is (1/ϵt)U(x)-(1/\epsilon_{t})U(x), while in ISA the exponent in μϵtf\mu_{\epsilon_{t}}^{f} (1.6) takes on the generalized form as Hϵt-H_{\epsilon_{t}}. In this way the optimization landscape is de facto modified from UU to HϵH_{\epsilon} and hopefully improved. We apply this idea and simply substitute (1/ϵt)U(x)(1/\epsilon_{t})U(x) by HϵtH_{\epsilon_{t}} in its dynamics. More precisely, we are interested in the following dynamics (Xt,Yt)t0(X_{t},Y_{t})_{t\geqslant 0} that we call IKSA:

(1.8) dXt\displaystyle dX_{t} =Ytdt,\displaystyle=Y_{t}\,dt,
(1.9) dYt\displaystyle dY_{t} =1ϵtYtdtϵtxHϵt,c(Xt)dt+2dBt.\displaystyle=-\dfrac{1}{\epsilon_{t}}Y_{t}\,dt-\epsilon_{t}\nabla_{x}H_{\epsilon_{t},c}(X_{t})\,dt+\sqrt{2}\,dB_{t}.

Its instantaneous stationary distribution at time tt is the product distribution of μϵtf\mu_{\epsilon_{t}}^{f} and the Gaussian distribution with mean 0 and variance ϵt\epsilon_{t}:

(1.10) πϵtf(x,y)=πϵt,cf(x,y)μϵtf(x)ey22ϵteHϵt(x)ey22ϵt.\displaystyle\pi_{\epsilon_{t}}^{f}(x,y)=\pi_{\epsilon_{t},c}^{f}(x,y)\propto\mu_{\epsilon_{t}}^{f}(x)e^{-\frac{\left\lVert y\right\rVert^{2}}{2\epsilon_{t}}}\propto e^{-H_{\epsilon_{t}}(x)}e^{-\frac{\left\lVert y\right\rVert^{2}}{2\epsilon_{t}}}.

Note that when f=0f=0, xHϵt(x)=1ϵtxU(x)\nabla_{x}H_{\epsilon_{t}}(x)=\frac{1}{\epsilon_{t}}\nabla_{x}U(x), and we retrieve exactly the classical kinetic Langevin diffusion (1.3). As such we can think of IKSA as a generalization of KSA. This also explains the notation πϵt0\pi_{\epsilon_{t}}^{0} that appears earlier.

For a general ff (that satisfies Assumption 1.1 below), in the case of U(x)cU(x)\leqslant c we have xHϵt(x)=1ϵtxU(x)\nabla_{x}H_{\epsilon_{t}}(x)=\frac{1}{\epsilon_{t}}\nabla_{x}U(x), and hence the improved kinetic dynamics (1.8) evolves in the same way as the classical one (1.3) in this region. In the other case when U(x)>cU(x)>c, we see that

xHϵ=1+f((U(x)c)+)f((U(x)c)+)+ϵxU.\nabla_{x}H_{\epsilon}=\dfrac{1+f^{\prime}((U(x)-c)_{+})}{f((U(x)-c)_{+})+\epsilon}\nabla_{x}U.

Thus, the greater U(x)U(x) is relative to cc, the greater the denominator in xHϵt\nabla_{x}H_{\epsilon_{t}} in the above equation, and the more dominant is the Brownian noise in the velocity update of YtY_{t}. This effect can hopefully improve the convergence of U(Xt)U(X_{t}) when its value is greater than cc. To illustrate, we plot the following function

U0(x)=cos(2x)+12sin(x)+13sin(10x)U_{0}(x)=\cos(2x)+\frac{1}{2}\sin(x)+\frac{1}{3}\sin(10x)

as in Monmarché (2018) and compare this landscape with that of Hϵ,cH_{\epsilon,c} in Figure 1, where we take ϵ=0.5\epsilon=0.5, c=1.5c=-1.5 and f=arctanf=\arctan. For example, when x(4,2)x\in(-4,-2) in Figure 1, we can see that the gradient of Hϵ,cH_{\epsilon,c} is much smaller than that of UU, thus the Brownian noise plays a relatively more dominant role in this region in the velocity update. While it may not be immediately apparent in Figure 1, we note that both U0U_{0} (the black and solid curve) and Hϵ,cH_{\epsilon,c} (the orange and dashed curve) share exactly the same set of stationary points. As for the theoretical advantage of using IKSA, we shall prove that we can operate a faster logarithmic cooling schedule than KSA, relying on the key technical insight that the instantaneous spectral gap and the log-Sobolev constant are of the order ec/ϵte^{c_{*}/\epsilon_{t}}.

While there are practical benefits in using IKSA over KSA or ISA over SA, on the other hand they come along with extra computational costs: in the velocity update of IKSA, in addition to evaluating the gradient of UU, we would need to evaluate both ff and its derivative ff^{\prime} at (U(x)c)+(U(x)-c)_{+}, as the core idea of IKSA or ISA rests on comparing U(Xt)U(X_{t}) and cc for all tt. As such if we implement the Euler-Maruyama discretization of IKSA, extra function evaluations are required at each iteration.

The objective of this paper is to promote the idea of landscape modification and state-dependent noise in stochastic optimization. It has been brought to us by Miclo (2020) that Olivier Catoni has also considered the operation from UU to g(U)g(U) in a discrete-time and finite state space setting with concave gg. Recently in Guo et al. (2020) the authors study perturbed gradient descent with state-dependent noise, using the notion of occupation time.

Refer to caption
Figure 1. Landscape of U0U_{0} and Hϵ,cH_{\epsilon,c}, where ϵ=0.5\epsilon=0.5, c=1.5c=-1.5 and f=arctanf=\arctan. We evaluate the integral in Hϵ,cH_{\epsilon,c} using the numerical integration package in MATLAB.

We summarize the main contributions of this paper below:

  1. (1)

    Propose an improved kinetic simulated annealing (IKSA) method and analyze its convergence

    In our main result Theorem 1.1 below, we will prove that under the logarithmic cooling of the form (1.2) with energy level E>cE>c_{*}, where both cc and EE are fixed, the improved kinetic annealing (Xt,Yt)t0(X_{t},Y_{t})_{t\geqslant 0} converges: for any δ>0\delta>0,

    limt(U(Xt)>Umin+δ)=0.\lim_{t\to\infty}\mathbb{P}\left(U(X_{t})>U_{min}+\delta\right)=0.

    This will be proved using the framework proposed by Monmarché (2018), along with the key technical insight that the log-Sobolev constant is of the order ec/ϵte^{c_{*}/\epsilon_{t}} at time tt.

  2. (2)

    Propose an adaptive (IAKSA) method to tune the parameter cc and the energy level EE, and analyze its convergence

    The convergence behaviour of (Xt,Yt)t0(X_{t},Y_{t})_{t\geqslant 0} in IKSA highly depends on the value of the parameter c>Uminc>U_{min}. Ideally we would like to choose cc to be close to UminU_{min} (so that the clipped critical height cc_{*} is as small as possible), but it can be hard to achieve in practice without a priori information on UU.

    In our second main result Theorem 1.2 below, we tune both cc and EE adaptively by incorporating the information of the running minimum minvtU(Xv)\min_{v\leqslant t}U(X_{v}), where both c=(ct)t0c=(c_{t})_{t\geqslant 0} and E=(Et)t0E=(E_{t})_{t\geqslant 0} depend on the running minimum generated by the algorithm on the fly. We call the resulting non-Markovian diffusion IAKSA. Although the setting is slightly different, this idea is in reminiscence of the adaptive biasing method Benaïm and Bréhier (2019); Benaïm et al. (2020); Lelièvre and Minoukadeh (2011); Lelièvre et al. (2008) or the self-interacting annealing method Raimond (2009), thus avoiding the need to manually tune the parameter cc for better performance. We also mention the related work of memory gradient diffusions Gadat et al. (2013); Gadat and Panloup (2014). Adaptive algorithms are popular in the Markov chain Monte Carlo (MCMC) literature as well Roberts and Rosenthal (2009); Andrieu and Thoms (2008); Fort et al. (2011). Note that in our context, the idea of tuning cc adaptively on the fly dates back to the work Fang et al. (1997) for ISA.

  3. (3)

    Present numerical experiments to illustrate the performance of IAKSA

    We compare the performance of four simulated annealing methods, namely IAKSA, IASA (1.5) (i.e. ISA with the parameter cc tuned adaptively in the same way as IAKSA), KSA (1.3) and SA (1.1), on minimizing three standard global optimization benchmark functions Jamil and Yang (2013). Empirical results demonstrate the improved convergence performance of IAKSA over other annealing methods that are based on Langevin diffusions.

1.1. Notations

Before we discuss our main results, we fix a few notations that are frequently used throughout the paper. For a,ba,b\in\mathbb{R}, we write ab=min{a,b}a\wedge b=\min\{a,b\} and a+=max{a,0}a_{+}=\max\{a,0\}. For 𝐯d\mathbf{v}\in\mathbb{R}^{d}, we write 𝐯\left\lVert\mathbf{v}\right\rVert to be its Euclidean norm. We also denote by x\partial_{x} to be the partial derivative with respect to xx. For two functions g1,g2g_{1},g_{2} on \mathbb{R}, we write g1=𝒪(g2)g_{1}=\mathcal{O}(g_{2}) if there exists constant C>0C>0 such that g1(x)Cg2(x)g_{1}(x)\leqslant Cg_{2}(x) for large enough xx. We write g1=Ω(g2)g_{1}=\Omega(g_{2}) if g2=𝒪(g1)g_{2}=\mathcal{O}(g_{1}). We also use the little-o notation: g1=o(g2)g_{1}=o(g_{2}) if limxg1(x)/g2(x)=0\lim_{x\to\infty}g_{1}(x)/g_{2}(x)=0. We say that a function ξ(ϵ)\xi(\epsilon) is a subexponential function if limϵ0ϵlnξ(ϵ)=0\lim_{\epsilon\to 0}\epsilon\ln\xi(\epsilon)=0.

In the rest of the paper, as ff is fixed, we shall hide its dependence on various quantities. We will write πϵt=πϵtf\pi_{\epsilon_{t}}=\pi_{\epsilon_{t}}^{f} and μϵt=μϵtf\mu_{\epsilon_{t}}=\mu_{\epsilon_{t}}^{f}.

1.2. Overview of the main results

In this subsection, we state our main results. First, let us clearly state the assumptions on the target function UU, the function ff and the parameter cc. Note that the critical height EE_{*} and clipped critical height cc_{*} will be introduced in Section 2.3. These assumptions are standard in the simulated annealing literature.

Assumption 1.1.
  1. (1)

    The potential function UU is smooth with bounded second derivatives, that is,

    x2U=supxdi,j=1d(xixjU(x))2<.\left\lVert\nabla_{x}^{2}U\right\rVert_{\infty}=\sup_{x\in\mathbb{R}^{d}}\sum_{i,j=1}^{d}\left(\partial_{x_{i}}\partial_{x_{j}}U(x)\right)^{2}<\infty.

    Also, there exist constants a1,a2,r,M>0a_{1},a_{2},r,M>0 such that UU satisfies

    a1x2M\displaystyle a_{1}\left\lVert x\right\rVert^{2}-M U(x)a2x2+M,\displaystyle\leqslant U(x)\leqslant a_{2}\left\lVert x\right\rVert^{2}+M,
    xU(x)x\displaystyle-\nabla_{x}U(x)\cdot x rx2+M,\displaystyle\leqslant-r\left\lVert x\right\rVert^{2}+M,

    where xdx\in\mathbb{R}^{d}.

  2. (2)

    The function f:+f:\mathbb{R}\to\mathbb{R}^{+} is twice-differentiable, bounded, non-negative and non-decreasing. Furthermore, ff satisfies

    f(0)=f(0)=f′′(0)=0,f(0)=f^{\prime}(0)=f^{\prime\prime}(0)=0,

    and there exist constant M3,M4>0M_{3},M_{4}>0 such that

    f(x)=M4,if xM3.f(x)=M_{4},\quad\text{if }x\geqslant M_{3}.

    We also denote M5:=sup0xM3f(x)M_{5}:=\sup_{0\leqslant x\leqslant M_{3}}f^{\prime}(x).

  3. (3)

    The cooling schedule satisfies, for large enough tt,

    |tϵt|=𝒪(1t).|\partial_{t}\epsilon_{t}|=\mathcal{O}\left(\dfrac{1}{t}\right).

    This is for instance satisfied by the logarithmic cooling schedule (1.2).

  4. (4)

    The parameter cc is picked so that c>Uminc>U_{min}. In the adaptive case, ct>Uminc_{t}>U_{min} for all t0t\geqslant 0.

  5. (5)

    The initial law of (X0,Y0)(X_{0},Y_{0}) admits a smooth density with respect to the Lebesgue measure that we denote by m0m_{0}. Its Fisher information m02/m0𝑑x𝑑y\int\left\lVert\nabla m_{0}\right\rVert^{2}/m_{0}\,dxdy and moments 𝔼(X0p+Y0p)\mathbb{E}\left(\left\lVert X_{0}\right\rVert^{p}+\left\lVert Y_{0}\right\rVert^{p}\right) are all finite, where p0p\geqslant 0.

Our first main result gives large-time convergence guarantee for IKSA (Xt,Yt)t0(X_{t},Y_{t})_{t\geqslant 0}, introduced earlier in (1.8):

Theorem 1.1.

[Convergence of IKSA] Under Assumption 1.1, for any δ>0\delta>0, as tt\to\infty we have

(U(Xt)>Umin+δ)0.\displaystyle\mathbb{P}\left(U(X_{t})>U_{min}+\delta\right)\to 0.

If we employ the logarithmic cooling schedule of the form ϵt=Elog(t)\epsilon_{t}=\frac{E}{\log(t)} for large enough tt, where E>cE>c_{*}, and both cc and EE are fixed, then for any α,δ>0\alpha,\delta>0, there exists constant A>0A>0 such that

(1.11) (U(Xt)>Umin+δ)A(1t)min{1cEα2,δ2E}.\displaystyle\mathbb{P}\left(U(X_{t})>U_{min}+\delta\right)\leqslant A\left(\dfrac{1}{t}\right)^{\min\bigg{\{}\frac{1-\frac{c_{*}}{E}-\alpha}{2},\frac{\delta}{2E}\bigg{\}}}.

If we compare the result of Monmarché (2018) for KSA (1.3) against the result we obtain in Theorem 1.1 for IKSA (1.8), in essence we replace the critical height EE_{*} by the clipped critical height cc_{*}. In retrospect this is perhaps unsurprising, as one can understand the modification in IKSA as clipping the target function from UU to UcU\wedge c. While IKSA enjoys improved logarithmic cooling when compared with KSA, logarithmic cooling is however known to be inefficient, see Catoni (1992) and the Remark in (Monmarché, 2018, Section 1.21.2).

In our second main result, we propose an adaptive method, that we call IAKSA, to tune the parameter c=(ct)t0c=(c_{t})_{t\geqslant 0} and the energy level E=(Et)t0E=(E_{t})_{t\geqslant 0} using the running minimum minvtU(Xv)\min_{v\leqslant t}U(X_{v}) on the fly. We shall discuss in more technical details in Section 3. Note that the idea of tuning cc adaptively on the fly dates back to the work Fang et al. (1997) for ISA.

Theorem 1.2.

[Convergence of IAKSA] Under Assumption 1.1, consider the kinetic dynamics (Xt,Yt)t0(X_{t},Y_{t})_{t\geqslant 0} described by

dXt\displaystyle dX_{t} =Ytdt,\displaystyle=Y_{t}\,dt,
dYt\displaystyle dY_{t} =1ϵtYtdtϵtxHϵt,ct(Xt)dt+2dBt,\displaystyle=-\dfrac{1}{\epsilon_{t}}Y_{t}\,dt-\epsilon_{t}\nabla_{x}H_{\epsilon_{t},c_{t}}(X_{t})\,dt+\sqrt{2}\,dB_{t},

where Hϵt,ctH_{\epsilon_{t},c_{t}} is introduced in (1.7), ctc_{t} is tuned adaptively according to (3.1) and the cooling schedule is

ϵt=Etlnt\epsilon_{t}=\dfrac{E_{t}}{\ln t}

with EtE_{t} satisfying (3.2). Given δ>0\delta>0, for large enough tt and a constant A>0A>0, we consider sufficiently small α\alpha such that α(0,δ2δ1U(X0)Umin+δ2)\alpha\in(0,\frac{\delta_{2}-\delta_{1}}{U(X_{0})-U_{min}+\delta_{2}}), and select δ1,δ2>0\delta_{1},\delta_{2}>0 such that 0<δ2δ1<δ0<\delta_{2}-\delta_{1}<\delta, to yield

(U(Xt)>Umin+δ)A(1t)a,\displaystyle\mathbb{P}\left(U(X_{t})>U_{min}+\delta\right)\leqslant A\left(\dfrac{1}{t}\right)^{a},

where

a:=min{δ2δ1δ+δ2α2,δ2δ1U(X0)Umin+δ2α2}.a:=\min\bigg{\{}\frac{\frac{\delta_{2}-\delta_{1}}{\delta+\delta_{2}}-\alpha}{2},\frac{\frac{\delta_{2}-\delta_{1}}{U(X_{0})-U_{min}+\delta_{2}}-\alpha}{2}\bigg{\}}.

1.3. Numerical results

In this subsection, we present our numerical illustrations. The benchmark functions are the Rastrigin function U3U_{3}, Ackley3 function U2U_{2} and Ackley function U1U_{1}. For further details on the experimental setup and the parameters used (such as initialization, stepsize or the cooling schedule), these are described in the Appendix.

We mimic Figure 33 in Monmarché (2018), and we plot the corresponding results in Figure 2 for the three benchmark functions. On the vertical axis, we plot log10(minvtU(Xv)>Umin+δ)\log_{10}\mathbb{P}\left(\min_{v\leqslant t}U(X_{v})>U_{min}+\delta\right) or log10(minvtU(Zv)>Umin+δ)\log_{10}\mathbb{P}\left(\min_{v\leqslant t}U(Z_{v})>U_{min}+\delta\right) against log10t\log_{10}t in Figure (2(a)), (2(c)) and (2(e)), and similarly we plot log10(U(Xt)>Umin+δ)\log_{10}\mathbb{P}\left(U(X_{t})>U_{min}+\delta\right) or log10(U(Zt)>Umin+δ)\log_{10}\mathbb{P}\left(U(Z_{t})>U_{min}+\delta\right) against log10t\log_{10}t in Figure (2(b)), (2(d)) and (2(f)). To compute these probabilities, we run 100 independent replicas and count the proportion of replicas for which U(Xt)>Umin+δU(X_{t})>U_{min}+\delta or minvtU(Xv)>Umin+δ\min_{v\leqslant t}U(X_{v})>U_{min}+\delta. We inject the same sequence of Gaussian noise in each of the 100 replicas across all four annealing methods for fair comparison.

KSA and SA can be considered as the baseline algorithms for IAKSA and IASA respectively. In all of the plots in Figure (2) IAKSA outperforms KSA, revealing that there is perhaps empirical advantage in using IAKSA over classical KSA in some instances.

For the Rastrigin function U3U_{3} in Figure (2(a)) and (2(b)), we note that IAKSA enjoys improved convergence over other methods, while SA does not seem to converge near UminU_{min} at all. For IASA and KSA, although their running minimum reach the neighbourhood of UminU_{min} when log10t\log_{10}t is approximately 3 to 5, as evident from Figure (2(b)) they however do not get stuck at the neighbourhood.

For the Ackley3 function U2U_{2} in Figure (2(c)) and (2(d)), we first observe that the curve corresponding to the running minimum of IASA drops fast but stays flat when log10t\log_{10}t is bigger than 5. In Figure (2(d)), the only method that lingers around the neighbourhood of UminU_{min} is IAKSA.

For the Ackley function U1U_{1} in Figure (2(e)) and (2(f)), the two overdamped methods IASA and SA seem to outperform the two kinetic methods IAKSA and KSA.

Refer to caption
(a) Rastrigin function U3U_{3}: running minimum
Refer to caption
(b) Rastrigin function U3U_{3}: U3(Xt)U_{3}(X_{t}) or U3(Zt)U_{3}(Z_{t})
Refer to caption
(c) Ackley3 function U2U_{2}: running minimum
Refer to caption
(d) Ackley3 function U2U_{2}: U2(Xt)U_{2}(X_{t}) or U2(Zt)U_{2}(Z_{t})
Refer to caption
(e) Ackley function U1U_{1}: running minimum
Refer to caption
(f) Ackley function U1U_{1}: U1(Xt)U_{1}(X_{t}) or U1(Zt)U_{1}(Z_{t})
Figure 2. Probability plots of the four annealing methods on three benchmark functions

1.4. Organization of the paper

The rest of this paper is organized as follow. In Section 2, we present the proof of Theorem 1.1. In Section 3, we state the adaptive method IAKSA along with its proof.

2. Proof of Theorem 1.1

In this section, we prove the convergence of the improved kinetic dynamics with fixed cc and energy level EE. We employ the same proof strategy as in Monmarché (2018) and break down the proof into smaller parts. Finally, in Section 2.8 we connect the auxiliary results and finish off the proof of Theorem 1.1.

2.1. Weak convergence of μϵ\mu_{\epsilon}

In this subsection, we prove that μϵ\mu_{\epsilon} converges weakly to the set of global minima of UU as ϵ0\epsilon\to 0. Recall that μϵ\mu_{\epsilon} is first introduced in (1.6) as the stationary distribution of the improved annealing ISA method. Note that similar result have been obtained in Fang et al. (1997).

Proposition 2.1.

Under Assumption 1.1, for a fixed ϵ>0\epsilon>0, suppose the law of XX is μϵ\mu_{\epsilon}. Then for sufficiently small δ>0\delta>0 with δ(0,cUmin)\delta\in(0,c-U_{min}), there exists a constant D2=D2(δ)>0D_{2}=D_{2}(\delta)>0, independent of ϵ\epsilon, such that

(U(X)>Umin+δ)D2eδ2ϵ.\mathbb{P}\left(U(X)>U_{min}+\delta\right)\leqslant D_{2}e^{-\frac{\delta}{2\epsilon}}.

[Proof. ]First, denote a:=Umin+δ<ca:=U_{min}+\delta<c. For any Borel set SdS\subset\mathbb{R}^{d}, we denote Vol(S)\mathrm{Vol}(S) to be its Lebesgue volume. Note that an equivalent way of writing down μϵ\mu_{\epsilon} is

μϵ(x)\displaystyle\mu_{\epsilon}(x) =1Λ¯μϵ1f((U(x)c)+)+ϵexp{aU(x)1f((uc)+)+ϵ𝑑u},\displaystyle=\dfrac{1}{\overline{\Lambda}_{\mu_{\epsilon}}}\dfrac{1}{f((U(x)-c)_{+})+\epsilon}\exp\bigg{\{}-\int_{a}^{U(x)}\dfrac{1}{f((u-c)_{+})+\epsilon}\,du\bigg{\}},
Λ¯μϵ\displaystyle\overline{\Lambda}_{\mu_{\epsilon}} =d1f((U(x)c)+)+ϵexp{aU(x)1f((uc)+)+ϵ𝑑u}𝑑x\displaystyle=\int_{\mathbb{R}^{d}}\dfrac{1}{f((U(x)-c)_{+})+\epsilon}\exp\bigg{\{}-\int_{a}^{U(x)}\dfrac{1}{f((u-c)_{+})+\epsilon}\,du\bigg{\}}\,dx
1ϵ{U(x)aδ/2}exp{U(x)a1f((uc)+)+ϵ𝑑u}𝑑x\displaystyle\geqslant\dfrac{1}{\epsilon}\int_{\{U(x)\leqslant a-\delta/2\}}\exp\bigg{\{}\int_{U(x)}^{a}\dfrac{1}{f((u-c)_{+})+\epsilon}\,du\bigg{\}}\,dx
1ϵ{U(x)aδ/2}exp{1ϵ(aU(x))}𝑑x\displaystyle\geqslant\dfrac{1}{\epsilon}\int_{\{U(x)\leqslant a-\delta/2\}}\exp\bigg{\{}\dfrac{1}{\epsilon}(a-U(x))\bigg{\}}\,dx
1ϵeδ2ϵVol({U(x)aδ/2})=:1ϵeδ2ϵD1,\displaystyle\geqslant\dfrac{1}{\epsilon}e^{\frac{\delta}{2\epsilon}}\mathrm{Vol}(\{U(x)\leqslant a-\delta/2\})=:\dfrac{1}{\epsilon}e^{\frac{\delta}{2\epsilon}}D_{1},

where the set {U(x)aδ/2}\{U(x)\leqslant a-\delta/2\} is compact as UU is quadratic at infinity. Define

Aδ(x):=Umin+δ+x.\displaystyle A_{\delta}(x):=U_{min}+\delta+\left\lVert x\right\rVert.

Note that under Assumption 1.1, since UU is quadratic at infinity, the set Sδ:={U(x)<Aδ(x)}S_{\delta}:=\{U(x)<A_{\delta}(x)\} is compact. Therefore, we have

(U(X)>Umin+δ)\displaystyle\mathbb{P}\left(U(X)>U_{min}+\delta\right) =1Λ¯μϵ{U(x)>Umin+δ}1f((U(x)c)+)+ϵexp{aU(x)1f((uc)+)+ϵ𝑑u}𝑑x\displaystyle=\dfrac{1}{\overline{\Lambda}_{\mu_{\epsilon}}}\int_{\{U(x)>U_{min}+\delta\}}\dfrac{1}{f((U(x)-c)_{+})+\epsilon}\exp\bigg{\{}-\int_{a}^{U(x)}\dfrac{1}{f((u-c)_{+})+\epsilon}\,du\bigg{\}}\,dx
eδ2ϵD1{Aδ(x)>U(x)>Umin+δ}exp{U(x)aM4+ϵ}𝑑x\displaystyle\leqslant\dfrac{e^{-\frac{\delta}{2\epsilon}}}{D_{1}}\int_{\{A_{\delta}(x)>U(x)>U_{min}+\delta\}}\exp\bigg{\{}-\dfrac{U(x)-a}{M_{4}+\epsilon}\bigg{\}}\,dx
+eδ2ϵD1{U(x)>Aδ(x)}exp{U(x)aM4+ϵ}𝑑x\displaystyle\quad+\dfrac{e^{-\frac{\delta}{2\epsilon}}}{D_{1}}\int_{\{U(x)>A_{\delta}(x)\}}\exp\bigg{\{}-\dfrac{U(x)-a}{M_{4}+\epsilon}\bigg{\}}\,dx
1D1Vol(Sδ)eδ2ϵ+1D1eδ2ϵde1M4+ϵx𝑑x.\displaystyle\leqslant\dfrac{1}{D_{1}}\mathrm{Vol}(S_{\delta})e^{-\frac{\delta}{2\epsilon}}+\dfrac{1}{D_{1}}e^{-\frac{\delta}{2\epsilon}}\int_{\mathbb{R}^{d}}e^{-\frac{1}{M_{4}+\epsilon}\left\lVert x\right\rVert}\,dx.

2.2. Existence and regularity for the density of IKSA

First, we note that the infinitesimal generator of the improved kinetic dynamics (Xt,Yt)t0(X_{t},Y_{t})_{t\geqslant 0} (1.8) at a fixed temperature ϵ>0\epsilon>0 is

(2.1) Lϵ=Lϵ,c:=yx(yϵ+ϵxHϵ)y+Δy,\displaystyle L_{\epsilon}=L_{\epsilon,c}:=y\cdot\nabla_{x}-\left(\dfrac{y}{\epsilon}+\epsilon\nabla_{x}H_{\epsilon}\right)\cdot\nabla_{y}+\Delta_{y},

where we recall Hϵ=Hϵ,cH_{\epsilon}=H_{\epsilon,c} is first introduced in (1.7), and x\nabla_{x}\cdot (resp.  y\nabla_{y}\cdot) is the divergence operator with respect to the variable xx (resp.  yy). When there is no ambiguity on the parameter cc, we simply hide the dependency on cc and write Hϵ=Hϵ,cH_{\epsilon}=H_{\epsilon,c} and Lϵ=Lϵ,cL_{\epsilon}=L_{\epsilon,c}.

We show that the density of (Xt,Yt)t0(X_{t},Y_{t})_{t\geqslant 0} is nice:

Proposition 2.2.

Under Assumption 1.1, the process (Xt,Yt)t0(X_{t},Y_{t})_{t\geqslant 0} is well-defined and the second moment 𝔼(Xt2+Yt2)\mathbb{E}(\left\lVert X_{t}\right\rVert^{2}+\left\lVert Y_{t}\right\rVert^{2}) is finite for all tt. The Lebesgue density of (Xt,Yt)(X_{t},Y_{t}), denoted by mtm_{t}, is smooth and positive.

ht:=dmtdπϵth_{t}:=\dfrac{dm_{t}}{d\pi_{\epsilon_{t}}}

is well-defined and smooth.

[Proof. ]We follow the same proof as in (Monmarché, 2018, Proposition 4 and 5).

First, we show the non-explosiveness and the finite second moment result. Consider the homogeneous Markov process (Xt,Yt,t)t0(X_{t},Y_{t},t)_{t\geqslant 0}, its generator =Lϵt+t\mathcal{L}=L_{\epsilon_{t}}+\partial_{t} and Hamiltonian function

(x,y,t)=Hϵt(x)lnϵt+y22ϵt+1.\mathcal{H}(x,y,t)=H_{\epsilon_{t}}(x)-\ln\epsilon_{t}+\dfrac{\left\lVert y\right\rVert^{2}}{2\epsilon_{t}}+1.

Note that for sts\leqslant t:

(x,y,s)\displaystyle\mathcal{L}\mathcal{H}(x,y,s) =1ϵsy2ϵs2ϵs(UminU(x)1(f((uc)+)+ϵs)2𝑑u+y22ϵs2)+ϵsf((U(x)c)+)+ϵsϵsϵs\displaystyle=\dfrac{1}{\epsilon_{s}}-\dfrac{\left\lVert y\right\rVert^{2}}{\epsilon_{s}^{2}}-\epsilon_{s}^{\prime}\left(\int_{U_{min}}^{U(x)}\dfrac{1}{\left(f((u-c)_{+})+\epsilon_{s}\right)^{2}}\,du+\frac{\left\lVert y\right\rVert^{2}}{2\epsilon_{s}^{2}}\right)+\dfrac{\epsilon_{s}^{\prime}}{f((U(x)-c)_{+})+\epsilon_{s}}-\dfrac{\epsilon_{s}^{\prime}}{\epsilon_{s}}
C(x,y,s),\displaystyle\leqslant C\mathcal{H}(x,y,s),

where the constant CC depends the bounds of ϵs\epsilon_{s} and its derivative on [0,t][0,t]. Using the Markov inequality and the Ito’s formula, we have 𝔼((Xt,Yt))<\mathbb{E}(\mathcal{H}(X_{t},Y_{t}))<\infty which implies that the second moment is finite.

For the smoothness of mtm_{t}, one can readily check the Hormander’s bracket condition is satisified.

For the positiveness of mtm_{t}, we apply exactly the same argument as in Proposition 55 of Monmarché (2018), with Ft(x,y)F_{t}(x,y) therein replaced by Ft(x,y)=xHϵt(x)+yϵtF_{t}(x,y)=\nabla_{x}H_{\epsilon_{t}}(x)+\frac{y}{\epsilon_{t}}.

2.3. The log-Sobolev inequality

In this subsection, we prove the log-Sobolev inequality for the improved kinetic Langevin process (Xt,Yt)t0(X_{t},Y_{t})_{t\geqslant 0}. First, let us briefly recall the concept of critical height EE_{*} that frequently appears in various classical and modern work of the annealing literature Jacquot (1992); Holley and Stroock (1988); Miclo (1992). For two points x,yd,x,y\in\mathbb{R}^{d}, we write Γx,y\Gamma_{x,y} to be the set of C1C^{1} parametric curves that start at xx and end at yy. Given a target function UU, the classical critical height EE_{*} of UU is then defined to be

(2.2) E=E(U):=supx,ydinfγΓx,y{supt{U(γ(t))}U(x)U(y)+infU}.\displaystyle E_{*}=E_{*}(U):=\sup_{x,y\in\mathbb{R}^{d}}\inf_{\gamma\in\Gamma_{x,y}}\left\{\sup_{t}\{U(\gamma(t))\}-U(x)-U(y)+\inf U\right\}.

One major motivation of the current work is the introduction of the function ff and the parameter cc that control the injection of the Gaussian noise into the system, which allows one to clip the critical height EE_{*} at an arbitrary level cc, thus effectively reducing EE_{*}. Precisely, for an arbitrary δ1>0\delta_{1}>0 and c>Uminc>U_{min}, we define cc_{*} to be

(2.3) c=c(U,c,δ1):=supx,ydinfγΓx,y{supt{U(γ(t))(c+δ1)}U(x)cU(y)c+infU}.\displaystyle c_{*}=c_{*}(U,c,\delta_{1}):=\sup_{x,y\in\mathbb{R}^{d}}\inf_{\gamma\in\Gamma_{x,y}}\left\{\sup_{t}\{U(\gamma(t))\wedge(c+\delta_{1})\}-U(x)\wedge c-U(y)\wedge c+\inf U\right\}.

For probability measure ν\nu on 2d\mathbb{R}^{2d} and smooth positive function hh on 2d\mathbb{R}^{2d} such that ν(h)=1\nu(h)=1, we define the relative entropy and Fisher information to be respectively

Entν(h)\displaystyle\mathrm{Ent}_{\nu}(h) =hlnhdν,\displaystyle=\int h\ln h\,d\nu,
Iν(h)\displaystyle I_{\nu}(h) =|h|2h𝑑ν.\displaystyle=\int\dfrac{|\nabla h|^{2}}{h}\,d\nu.
Proposition 2.3.

Under Assumption 1.1 and suppose the temperature ϵ\epsilon is fixed. For any arbitrary δ1>0\delta_{1}>0 and positive smooth function hh with πϵ(h)=1\pi_{\epsilon}(h)=1, there exists c=c(U,c,δ1)c_{*}=c_{*}(U,c,\delta_{1}) and a polynomial function pδ1(1/ϵ)p_{\delta_{1}}(1/\epsilon) (which may depend on δ1\delta_{1}) such that

Entπϵ(h)\displaystyle\mathrm{Ent}_{\pi_{\epsilon}}(h) max(ϵ2,pδ1(1/ϵ)ecϵ)Iπϵ(h).\displaystyle\leqslant\max\left(\dfrac{\epsilon}{2},p_{\delta_{1}}(1/\epsilon)e^{\frac{c_{*}}{\epsilon}}\right)I_{\pi_{\epsilon}}(h).
Remark 2.1.

In Fang et al. (1997), by introducing assumption on the behaviour of ff near 0 (Assumption (H4)(H4) therein), they demonstrate the log-Sobolev constant is of the order 𝒪(ec(U,c,0)ϵ)\mathcal{O}(e^{\frac{c_{*}(U,c,0)}{\epsilon}}), while in our Assumption 1.1, we do not place such assumption on ff, at the trade-off of introducing an error δ1>0\delta_{1}>0 that appears in pδ1(1/ϵ)p_{\delta_{1}}(1/\epsilon). Alternatively, we can insert extra assumptions and simply use the result obtained in Fang et al. (1997) for the log-Sobolev inequality.

[Proof. ]First, note that πϵ\pi_{\epsilon} can be written as a tensor product of μϵ\mu_{\epsilon} (for the xx coordinates) and a Gaussian distribution of mean 0 with variance ϵ\epsilon (for the yy coordinates). Since the log-Sobolev inequality tensorizes and is stable under perturbation (see e.g. the references as in Monmarché (2018); Chak et al. (2020)), it suffices for us to determine the log-Sobolev constant of μϵ\mu_{\epsilon}.

Let us recall the improved overdamped Langevin dynamics at temperature ϵ\epsilon is described by the dynamics

dZt=U(Zt)dt+2(f(U(Zt)c)++ϵ)dBt,dZ_{t}=-\nabla U(Z_{t})\,dt+\sqrt{2\left(f(U(Z_{t})-c)_{+}+\epsilon\right)}\,dB_{t},

with generator L¯ϵ\overline{L}_{\epsilon} and L2(μϵ)L^{2}(\mu_{\epsilon})-spectral gap

λ¯2=infhL2(μϵ);μϵ(h)=0L¯ϵh,hμϵh,hμϵ,\overline{\lambda}_{2}=\inf_{h\in L^{2}(\mu_{\epsilon});~{}\mu_{\epsilon}(h)=0}\dfrac{\langle-\overline{L}_{\epsilon}h,h\rangle_{\mu_{\epsilon}}}{\langle h,h\rangle_{\mu_{\epsilon}}},

where we write g,hμϵ=gh𝑑μϵ\langle g,h\rangle_{\mu_{\epsilon}}=\int gh\,d\mu_{\epsilon} to be the inner product in L2(μϵ)L^{2}(\mu_{\epsilon}) for g,hL2(μϵ)g,h\in L^{2}(\mu_{\epsilon}). Define

Vϵ(x):=UminU(x)1f((uc)+)+ϵ𝑑u,νϵ(x)eVϵ(x),V_{\epsilon}(x):=\int_{U_{min}}^{U(x)}\dfrac{1}{f((u-c)_{+})+\epsilon}\,du,\quad\nu_{\epsilon}(x)\propto e^{-V_{\epsilon}(x)},

Since {hL2(νϵ);νϵ(h)=0}{hL2(μϵ);μϵ(h)=0}\{h\in L^{2}(\nu_{\epsilon});~{}\nu_{\epsilon}(h)=0\}\subset\{h\in L^{2}(\mu_{\epsilon});~{}\mu_{\epsilon}(h)=0\}, we thus have

λ¯2(M4+ϵ)infhL2(νϵ);νϵ(h)=0h2𝑑νϵh,hνϵ.\overline{\lambda}_{2}\leqslant(M_{4}+\epsilon)\inf_{h\in L^{2}(\nu_{\epsilon});~{}\nu_{\epsilon}(h)=0}\dfrac{\int\left\lVert\nabla h\right\rVert^{2}\,d\nu_{\epsilon}}{\langle h,h\rangle_{\nu_{\epsilon}}}.

Therefore, we can pretend our target function is VϵV_{\epsilon} at temperature 11 in the classical Langevin dynamics, and utilize the result of Jacquot (1992) to conclude that, for a constant A>0A>0 independent of ϵ\epsilon,

(2.4) λ¯2AeE(Vϵ).\displaystyle\overline{\lambda}_{2}\leqslant Ae^{E_{*}(V_{\epsilon})}.

If we show that

(2.5) E(Vϵ)cϵ+Cδ1,U,\displaystyle E_{*}(V_{\epsilon})\leqslant\frac{c_{*}}{\epsilon}+C_{\delta_{1},U},

where Cδ1,UC_{\delta_{1},U} is a constant that depends on δ1\delta_{1} and UU but not on ϵ\epsilon, the desired result follows from the above spectral gap estimate and (Fang et al., 1997, Theorem 4.44.4).

In order to apply the result of Jacquot (1992), we check Condition (A)(A) and (B)(B) therein:

  • Condition (A)(A): xVϵ\left\lVert\nabla_{x}V_{\epsilon}\right\rVert\to\infty and VϵV_{\epsilon}\to\infty as x\left\lVert x\right\rVert\to\infty since UU is quadratic at infinity.

  • Condition (B)(B): Outside the ball of {U(x)c>M3}\{U(x)-c>M_{3}\},

    xVϵ(x)\displaystyle\nabla_{x}V_{\epsilon}(x) =1f((U(x)c)+)+ϵxU(x)=1M4+ϵxU(x),\displaystyle=\dfrac{1}{f((U(x)-c)_{+})+\epsilon}\nabla_{x}U(x)=\dfrac{1}{M_{4}+\epsilon}\nabla_{x}U(x),
    ΔVϵ(x)\displaystyle\Delta V_{\epsilon}(x) =1M4+ϵΔU(x),\displaystyle=\dfrac{1}{M_{4}+\epsilon}\Delta U(x),

    and so xVϵ2ΔVϵ\left\lVert\nabla_{x}V_{\epsilon}\right\rVert^{2}-\Delta V_{\epsilon} is bounded below outside the ball {U(x)c>M3}\{U(x)-c>M_{3}\}, and hence it is bounded below for all xdx\in\mathbb{R}^{d}.

Now, we observe that, for any xdx\in\mathbb{R}^{d},

infVϵ\displaystyle\inf V_{\epsilon} =0,\displaystyle=0,
Vϵ(x)\displaystyle V_{\epsilon}(x) =1ϵ(U(x)cUmin)+U(x)cU(x)1f((uc)+)+ϵ𝑑u\displaystyle=\dfrac{1}{\epsilon}\left(U(x)\wedge c-U_{min}\right)+\int_{U(x)\wedge c}^{U(x)}\dfrac{1}{f((u-c)_{+})+\epsilon}\,du
1ϵ(U(x)cUmin),\displaystyle\geqslant\dfrac{1}{\epsilon}\left(U(x)\wedge c-U_{min}\right),
Vϵ(x)\displaystyle V_{\epsilon}(x) 1ϵ(U(x)(c+δ1)Umin)+1f(δ1)(U(x)U(x)(c+δ1)).\displaystyle\leqslant\dfrac{1}{\epsilon}\left(U(x)\wedge(c+\delta_{1})-U_{min}\right)+\dfrac{1}{f(\delta_{1})}(U(x)-U(x)\wedge(c+\delta_{1})).

Writing B(0,R)B(0,R) to be the ball center at 0 with sufficiently large radius RR, we finally show (2.5):

E(Vϵ)\displaystyle E_{*}(V_{\epsilon}) =supx,ydinfγΓx,y{supt{Vϵ(γ(t))}Vϵ(x)Vϵ(y)}\displaystyle=\sup_{x,y\in\mathbb{R}^{d}}\inf_{\gamma\in\Gamma_{x,y}}\left\{\sup_{t}\{V_{\epsilon}(\gamma(t))\}-V_{\epsilon}(x)-V_{\epsilon}(y)\right\}
=supx,yB(0,R)infγΓx,y{supt{Vϵ(γ(t))}Vϵ(x)Vϵ(y)}\displaystyle=\sup_{x,y\in B(0,R)}\inf_{\gamma\in\Gamma_{x,y}}\left\{\sup_{t}\{V_{\epsilon}(\gamma(t))\}-V_{\epsilon}(x)-V_{\epsilon}(y)\right\}
cϵ+supxB(0,R)1f(δ1)(U(x)U(x)(c+δ1))=:cϵ+Cδ1,U.\displaystyle\leqslant\dfrac{c_{*}}{\epsilon}+\sup_{x\in B(0,R)}\dfrac{1}{f(\delta_{1})}(U(x)-U(x)\wedge(c+\delta_{1}))=:\dfrac{c_{*}}{\epsilon}+C_{\delta_{1},U}.

2.4. Lyapunov function and moment estimates

The aim of this section is to prove the following moment estimate of (Xt,Yt)t0(X_{t},Y_{t})_{t\geqslant 0}:

Proposition 2.4.

For any pp\in\mathbb{N}, α>0\alpha>0 and large enough tt (which depends on p,U,fp,U,f and the temperature schedule ϵt\epsilon_{t}), there exist a constant kk, independent of tt, such that

𝔼(Xt2+Yt2)pk(1+t)α.\displaystyle\mathbb{E}\left(\left\lVert X_{t}\right\rVert^{2}+\left\lVert Y_{t}\right\rVert^{2}\right)^{p}\leqslant k(1+t)^{\alpha}.

We first prove the following Lyapunov property in Section 2.4.1, followed by proving Proposition 2.4 in Section 2.4.2.

Proposition 2.5 (Lyapunov property of RϵR_{\epsilon}).

Let

(2.6) Rϵ(x,y)=Rϵ,c(x,y):=ϵHϵ(x)+y22+ϵ3xy.\displaystyle R_{\epsilon}(x,y)=R_{\epsilon,c}(x,y):=\epsilon H_{\epsilon}(x)+\dfrac{\left\lVert y\right\rVert^{2}}{2}+\epsilon^{3}x\cdot y.

For ϵ>1\epsilon>1, there exist constants c1,c4c_{1},c_{4} and c5(ϵ)=𝒪(ϵ3ln(1/ϵ))c_{5}(\epsilon)=\mathcal{O}\left(\epsilon^{3}\ln\left(1/\epsilon\right)\right) a subexponential function of ϵ\epsilon such that

Lϵ(Rϵ)c4ϵ5Rϵ+c5(ϵ)+c1.L_{\epsilon}(R_{\epsilon})\leqslant-c_{4}\epsilon^{5}R_{\epsilon}+c_{5}(\epsilon)+c_{1}.

2.4.1. Proof of Proposition 2.5

As we have

Lϵ(y22)\displaystyle L_{\epsilon}\left(\dfrac{\left\lVert y\right\rVert^{2}}{2}\right) =y2ϵϵxHϵy+d,\displaystyle=-\dfrac{\left\lVert y\right\rVert^{2}}{\epsilon}-\epsilon\nabla_{x}H_{\epsilon}\cdot y+d,
Lϵ(ϵHϵ)\displaystyle L_{\epsilon}(\epsilon H_{\epsilon}) =yϵxHϵ,\displaystyle=y\cdot\epsilon\nabla_{x}H_{\epsilon},

summing up these two equations leads to

(2.7) Lϵ(ϵHϵ+y22)=y2ϵ+d.\displaystyle L_{\epsilon}\left(\epsilon H_{\epsilon}+\dfrac{\left\lVert y\right\rVert^{2}}{2}\right)=-\dfrac{\left\lVert y\right\rVert^{2}}{\epsilon}+d.

Note that we also have

(2.8) Lϵ(xy)=y2xyϵϵxHϵx.\displaystyle L_{\epsilon}(x\cdot y)=\left\lVert y\right\rVert^{2}-\dfrac{x\cdot y}{\epsilon}-\epsilon\nabla_{x}H_{\epsilon}\cdot x.

We proceed to give an upper bound on the second and the third term in the above equation. We first consider lower bounding xHϵx\nabla_{x}H_{\epsilon}\cdot x:

xHϵx\displaystyle\nabla_{x}H_{\epsilon}\cdot x =1+f(U(x)c)+f((U(x)c)+)+ϵxU(x)x\displaystyle=\dfrac{1+f^{\prime}(U(x)-c)_{+}}{f((U(x)-c)_{+})+\epsilon}\nabla_{x}U(x)\cdot x
1+f(U(x)c)+f((U(x)c)+)+ϵ(rx2M)\displaystyle\geqslant\dfrac{1+f^{\prime}(U(x)-c)_{+}}{f((U(x)-c)_{+})+\epsilon}\left(r\left\lVert x\right\rVert^{2}-M\right)
(2.9) rM4+ϵ0x2(M5+1)Mϵ:=r¯2x2M¯ϵ,\displaystyle\geqslant\dfrac{r}{M_{4}+\epsilon_{0}}\left\lVert x\right\rVert^{2}-\dfrac{(M_{5}+1)M}{\epsilon}:=\overline{r}_{2}\left\lVert x\right\rVert^{2}-\overline{M}_{\epsilon},

where the first inequality follows from Assumption 1.1 item 1, and we recall both M4M_{4} and M5M_{5} are introduced in Assumption 1.1 item 2. Next, as we have

|xy|r¯2ϵ22x2+12r¯2ϵ2y2,\displaystyle|x\cdot y|\leqslant\dfrac{\overline{r}_{2}\epsilon^{2}}{2}\left\lVert x\right\rVert^{2}+\dfrac{1}{2\overline{r}_{2}\epsilon^{2}}\left\lVert y\right\rVert^{2},

dividing by ϵ\epsilon leads to

(2.10) xyϵ|xy|ϵr¯2ϵ2x2+12r¯2ϵ3y2.\displaystyle-\dfrac{x\cdot y}{\epsilon}\leqslant\dfrac{|x\cdot y|}{\epsilon}\leqslant\dfrac{\overline{r}_{2}\epsilon}{2}\left\lVert x\right\rVert^{2}+\dfrac{1}{2\overline{r}_{2}\epsilon^{3}}\left\lVert y\right\rVert^{2}.

We substitute (2.4.1) and (2.10) into (2.8) to yield

Lϵ(xy)\displaystyle L_{\epsilon}(x\cdot y) (1+12r¯2ϵ3)y2ϵ2(2xHϵxr¯2x2)\displaystyle\leqslant\left(1+\dfrac{1}{2\overline{r}_{2}\epsilon^{3}}\right)\left\lVert y\right\rVert^{2}-\dfrac{\epsilon}{2}\left(2\nabla_{x}H_{\epsilon}\cdot x-\overline{r}_{2}\left\lVert x\right\rVert^{2}\right)
(2.11) (1+12r¯2ϵ3)y2ϵ2(r¯2x22M¯ϵ).\displaystyle\leqslant\left(1+\dfrac{1}{2\overline{r}_{2}\epsilon^{3}}\right)\left\lVert y\right\rVert^{2}-\dfrac{\epsilon}{2}\left(\overline{r}_{2}\left\lVert x\right\rVert^{2}-2\overline{M}_{\epsilon}\right).

The next two results give upper and lower bounds for HϵH_{\epsilon} and RϵR_{\epsilon}.

Lemma 2.1.

For any δ>0\delta>0, the upper bound of HϵH_{\epsilon} is

Hϵ(x){1ϵ(a2x2+MUmin)+ln(M4+ϵ0),if U(x)c+δ,1ϵ(c+δUmin)+1f(δ)(a2x2+M(c+δ))+ln(M4+ϵ0),if U(x)>c+δ.H_{\epsilon}(x)\leqslant\begin{cases}\dfrac{1}{\epsilon}\left(a_{2}\left\lVert x\right\rVert^{2}+M-U_{min}\right)+\ln(M_{4}+\epsilon_{0}),&\mbox{if }U(x)\leqslant c+\delta,\\ \dfrac{1}{\epsilon}\left(c+\delta-U_{min}\right)+\dfrac{1}{f(\delta)}\left(a_{2}\left\lVert x\right\rVert^{2}+M-(c+\delta)\right)+\ln(M_{4}+\epsilon_{0}),&\mbox{if }U(x)>c+\delta.\end{cases}

Consequently, we have

x2{1a2(ϵ(Hϵ(x)ln(M4+ϵ0))M+Umin),if U(x)c+δ,1a2(f(δ)(Hϵ(x)1ϵ(c+δUmin)ln(M4+ϵ0))+(c+δ)M),if U(x)>c+δ.\left\lVert x\right\rVert^{2}\geqslant\begin{cases}\dfrac{1}{a_{2}}\left(\epsilon\left(H_{\epsilon}(x)-\ln(M_{4}+\epsilon_{0})\right)-M+U_{min}\right),&\mbox{if }U(x)\leqslant c+\delta,\\ \dfrac{1}{a_{2}}\left(f(\delta)\left(H_{\epsilon}(x)-\dfrac{1}{\epsilon}\left(c+\delta-U_{min}\right)-\ln(M_{4}+\epsilon_{0})\right)+(c+\delta)-M\right),&\mbox{if }U(x)>c+\delta.\end{cases}

On the other hand, the lower bound of HϵH_{\epsilon} is

Hϵ(x)1M4+ϵ(a1x2MUmin)+lnϵ.H_{\epsilon}(x)\geqslant\dfrac{1}{M_{4}+\epsilon}\left(a_{1}\left\lVert x\right\rVert^{2}-M-U_{min}\right)+\ln\epsilon.

Consequently, we have

x21a1((M4+ϵ)Hϵ(x)(M4+ϵ)lnϵ+M+Umin).\left\lVert x\right\rVert^{2}\leqslant\dfrac{1}{a_{1}}\left((M_{4}+\epsilon)H_{\epsilon}(x)-(M_{4}+\epsilon)\ln\epsilon+M+U_{min}\right).

[Proof. ]We shall only prove the bounds on HϵH_{\epsilon}, and from these bounds it is straightforward to deduce the bounds on x2\left\lVert x\right\rVert^{2}. We first prove the upper bound of HϵH_{\epsilon}. If U(x)c+δU(x)\leqslant c+\delta, we have

Hϵ(x)1ϵ(U(x)Umin)+ln(M4+ϵ0)1ϵ(a2x2+MUmin)+ln(M4+ϵ0),H_{\epsilon}(x)\leqslant\dfrac{1}{\epsilon}(U(x)-U_{min})+\ln(M_{4}+\epsilon_{0})\leqslant\dfrac{1}{\epsilon}\left(a_{2}\left\lVert x\right\rVert^{2}+M-U_{min}\right)+\ln(M_{4}+\epsilon_{0}),

where the second inequality follows from the assumption on UU in Assumption 1.1. If U(x)>c+δU(x)>c+\delta,

Hϵ(x)\displaystyle H_{\epsilon}(x) =Uminc+δ1f((uc)+)+ϵ𝑑u+c+δU(x)1f((uc)+)+ϵ𝑑u+ln(f((U(x)c)+)+ϵ)\displaystyle=\int_{U_{min}}^{c+\delta}\dfrac{1}{f((u-c)_{+})+\epsilon}\,du+\int_{c+\delta}^{U(x)}\dfrac{1}{f((u-c)_{+})+\epsilon}\,du+\ln\left(f((U(x)-c)_{+})+\epsilon\right)
1ϵ(c+δUmin)+1f(δ)(a2x2+M(c+δ))+ln(M4+ϵ0),\displaystyle\leqslant\dfrac{1}{\epsilon}\left(c+\delta-U_{min}\right)+\dfrac{1}{f(\delta)}\left(a_{2}\left\lVert x\right\rVert^{2}+M-(c+\delta)\right)+\ln(M_{4}+\epsilon_{0}),

where the inequality follows from Assumption 1.1 again. For the lower bound of HϵH_{\epsilon}, we consider

Hϵ(x)1M4+ϵ(U(x)Umin)+lnϵ1M4+ϵ(a1x2MUmin)+lnϵ,H_{\epsilon}(x)\geqslant\dfrac{1}{M_{4}+\epsilon}(U(x)-U_{min})+\ln\epsilon\geqslant\dfrac{1}{M_{4}+\epsilon}\left(a_{1}\left\lVert x\right\rVert^{2}-M-U_{min}\right)+\ln\epsilon,

where the second inequality follows from Assumption 1.1.

Lemma 2.2.

For the upper bound of RϵR_{\epsilon}, we have

Rϵ(x,y)(ϵ+ϵ3M4+ϵa1)Hϵ(x)+(12+ϵ32)y2+ϵ3a1((M4+ϵ)lnϵ+M+Umin).R_{\epsilon}(x,y)\leqslant\left(\epsilon+\epsilon^{3}\dfrac{M_{4}+\epsilon}{a_{1}}\right)H_{\epsilon}(x)+\left(\dfrac{1}{2}+\dfrac{\epsilon^{3}}{2}\right)\left\lVert y\right\rVert^{2}+\dfrac{\epsilon^{3}}{a_{1}}\left(-(M_{4}+\epsilon)\ln\epsilon+M+U_{min}\right).

On the other hand, the lower bound of RϵR_{\epsilon} is

Rϵ(x,y)ϵM4+ϵ(a1x2MUmin)+ϵlnϵ+y22ϵ3(x2/2+y2/2).R_{\epsilon}(x,y)\geqslant\dfrac{\epsilon}{M_{4}+\epsilon}\left(a_{1}\left\lVert x\right\rVert^{2}-M-U_{min}\right)+\epsilon\ln\epsilon+\dfrac{\left\lVert y\right\rVert^{2}}{2}-\epsilon^{3}\left(\left\lVert x\right\rVert^{2}/2+\left\lVert y\right\rVert^{2}/2\right).

[Proof. ]Using xyx2/2+y2/2x\cdot y\leqslant\left\lVert x\right\rVert^{2}/2+\left\lVert y\right\rVert^{2}/2, the desired results follow from (2.6) and Lemma 2.1.

Using Lemma 2.1 together with (2.4.1) leads to, for some constants c1,c2,c3c_{1},c_{2},c_{3},

ϵ3Lϵ(xy)\displaystyle\epsilon^{3}L_{\epsilon}(x\cdot y) (ϵ3+12r¯2)y2c3ϵ5Hϵ(x)+c1,\displaystyle\leqslant\left(\epsilon^{3}+\dfrac{1}{2\overline{r}_{2}}\right)\left\lVert y\right\rVert^{2}-c_{3}\epsilon^{5}H_{\epsilon}(x)+c_{1},

and hence, when combined with (2.7) and (2.6),

Lϵ(Rϵ)c2ϵ5y2c3ϵ5Hϵ(x)+c1.L_{\epsilon}(R_{\epsilon})\leqslant-c_{2}\epsilon^{5}\left\lVert y\right\rVert^{2}-c_{3}\epsilon^{5}H_{\epsilon}(x)+c_{1}.

Finally, the above equation together with Lemma 2.2 gives, for some constants c4c_{4} and c5(ϵ)c_{5}(\epsilon) a subexponential function of ϵ\epsilon,

Lϵ(Rϵ)c4ϵ5Rϵ(x,y)+c5(ϵ)+c1.L_{\epsilon}(R_{\epsilon})\leqslant-c_{4}\epsilon^{5}R_{\epsilon}(x,y)+c_{5}(\epsilon)+c_{1}.

2.4.2. Proof of Proposition 2.4

Proposition 2.6.

For any pp\in\mathbb{N} and ϵ>1\epsilon>1, there exist subexponential functions Cp,1(ϵ),Cp,2(ϵ)C_{p,1}(\epsilon),C_{p,2}(\epsilon) such that

ϵRϵpCp,1(ϵ)Rϵp+Cp,2(ϵ)Rϵp1.\dfrac{\partial}{\partial\epsilon}R_{\epsilon}^{p}\leqslant C_{p,1}(\epsilon)R_{\epsilon}^{p}+C_{p,2}(\epsilon)R_{\epsilon}^{p-1}.

[Proof. ]For some constants c6,c7>0c_{6},c_{7}>0, we compute

ϵRϵp\displaystyle\dfrac{\partial}{\partial\epsilon}R_{\epsilon}^{p} =pRϵp1(ϵϵHϵ+Hϵ+3ϵ2xy)\displaystyle=pR_{\epsilon}^{p-1}\left(\epsilon\dfrac{\partial}{\partial\epsilon}H_{\epsilon}+H_{\epsilon}+3\epsilon^{2}x\cdot y\right)
pRϵp1(1+1ϵ(a2x2+MUmin)+ln(M4+ϵ0)+3ϵ2(x2/2+y2/2))\displaystyle\leqslant pR_{\epsilon}^{p-1}\left(1+\dfrac{1}{\epsilon}\left(a_{2}\left\lVert x\right\rVert^{2}+M-U_{min}\right)+\ln(M_{4}+\epsilon_{0})+3\epsilon^{2}(\left\lVert x\right\rVert^{2}/2+\left\lVert y\right\rVert^{2}/2)\right)
pRϵp1(c6ϵ(x2+y2)+1+1ϵ(MUmin)+ln(M4+ϵ0)).\displaystyle\leqslant pR_{\epsilon}^{p-1}\left(\dfrac{c_{6}}{\epsilon}(\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2})+1+\dfrac{1}{\epsilon}\left(M-U_{min}\right)+\ln(M_{4}+\epsilon_{0})\right).

Now, we use the lower bound of RϵR_{\epsilon} in Lemma 2.2 to see that

Rϵ(x,y)c7ϵ(x2+y2)ϵM+UminM4+ϵ+ϵlnϵ,R_{\epsilon}(x,y)\geqslant c_{7}\epsilon(\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2})-\epsilon\dfrac{M+U_{min}}{M_{4}+\epsilon}+\epsilon\ln\epsilon,

and so

ϵRϵp\displaystyle\dfrac{\partial}{\partial\epsilon}R_{\epsilon}^{p} pRϵp1(c6c7ϵ2(Rϵ(x,y)+ϵM+UminM4+ϵϵlnϵ)+1+1ϵ(MUmin)+ln(M4+ϵ0))\displaystyle\leqslant pR_{\epsilon}^{p-1}\left(\dfrac{c_{6}}{c_{7}\epsilon^{2}}\left(R_{\epsilon}(x,y)+\epsilon\dfrac{M+U_{min}}{M_{4}+\epsilon}-\epsilon\ln\epsilon\right)+1+\dfrac{1}{\epsilon}\left(M-U_{min}\right)+\ln(M_{4}+\epsilon_{0})\right)
=:Cp,1(ϵ)Rϵp+Cp,2(ϵ)Rϵp1.\displaystyle=:C_{p,1}(\epsilon)R_{\epsilon}^{p}+C_{p,2}(\epsilon)R_{\epsilon}^{p-1}.

The carré du champ operator Γϵ\Gamma_{\epsilon} of LϵL_{\epsilon} is Γϵf=12Lϵf2fLϵf=yf2\Gamma_{\epsilon}f=\dfrac{1}{2}L_{\epsilon}f^{2}-fL_{\epsilon}f=\left\lVert\nabla_{y}f\right\rVert^{2}. Using the lower bound of RϵR_{\epsilon} in Lemma 2.2, we compute, for constant c8c_{8} and subexponential functions C3(ϵ),C4(ϵ)C_{3}(\epsilon),C_{4}(\epsilon),

ΓϵRϵ=yRϵ2\displaystyle\Gamma_{\epsilon}R_{\epsilon}=\left\lVert\nabla_{y}R_{\epsilon}\right\rVert^{2} =y+ϵ3x2\displaystyle=\left\lVert y+\epsilon^{3}x\right\rVert^{2}
c8(x2+y2)\displaystyle\leqslant c_{8}\left(\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2}\right)
(2.12) c8c7ϵ(Rϵ(x,y)+ϵM+UminM4+ϵϵlnϵ)=:C3(ϵ)Rϵ(x,y)+C4(ϵ).\displaystyle\leqslant\dfrac{c_{8}}{c_{7}\epsilon}\left(R_{\epsilon}(x,y)+\epsilon\dfrac{M+U_{min}}{M_{4}+\epsilon}-\epsilon\ln\epsilon\right)=:C_{3}(\epsilon)R_{\epsilon}(x,y)+C_{4}(\epsilon).
Proposition 2.7.

For any pp\in\mathbb{N}, α>0\alpha>0 and large enough tt, there exist a constant C~p,α\widetilde{C}_{p,\alpha} such that

𝔼(Rϵtp(Xt,Yt))C~p,α(1+t)1+α.\mathbb{E}\left(R_{\epsilon_{t}}^{p}(X_{t},Y_{t})\right)\leqslant\widetilde{C}_{p,\alpha}(1+t)^{1+\alpha}.

[Proof. ]We shall prove the result by induction on pp. We denote by nt,p:=𝔼(Rϵtp(Xt,Yt))n_{t,p}:=\mathbb{E}\left(R_{\epsilon_{t}}^{p}(X_{t},Y_{t})\right). When p=0p=0, the result clearly holds. When p=1p=1, using Proposition 2.5 and 2.6 we have

tnt,1\displaystyle\dfrac{\partial}{\partial t}n_{t,1} =(tϵt)ϵtnt,1+s𝔼(Rϵt(Xt+s,Yt+s))|s=0\displaystyle=\left(\dfrac{\partial}{\partial t}\epsilon_{t}\right)\dfrac{\partial}{\partial\epsilon_{t}}n_{t,1}+\dfrac{\partial}{\partial s}\mathbb{E}\left(R_{\epsilon_{t}}(X_{t+s},Y_{t+s})\right)\bigg{|}_{s=0}
|tϵt|(C1,1(ϵt)nt,1+C1,2(ϵt))+𝔼(Lϵt(Rϵt)(Xt,Yt))\displaystyle\leqslant\bigg{|}\dfrac{\partial}{\partial t}\epsilon_{t}\bigg{|}\left(C_{1,1}(\epsilon_{t})n_{t,1}+C_{1,2}(\epsilon_{t})\right)+\mathbb{E}\left(L_{\epsilon_{t}}(R_{\epsilon_{t}})(X_{t},Y_{t})\right)
|tϵt|(C1,1(ϵt)nt,1+C1,2(ϵt))c4ϵt5nt,1+c5(ϵt)+c1.\displaystyle\leqslant\bigg{|}\dfrac{\partial}{\partial t}\epsilon_{t}\bigg{|}\left(C_{1,1}(\epsilon_{t})n_{t,1}+C_{1,2}(\epsilon_{t})\right)-c_{4}\epsilon_{t}^{5}n_{t,1}+c_{5}(\epsilon_{t})+c_{1}.

As ϵt=Ω(1ln(1+t))\epsilon_{t}=\Omega(\frac{1}{\ln(1+t)}), |tϵt|=𝒪(1t)\bigg{|}\dfrac{\partial}{\partial t}\epsilon_{t}\bigg{|}=\mathcal{O}(\frac{1}{t}) and C1,1(ϵt),C1,2(ϵt),c5(ϵt)=o(tβ)C_{1,1}(\epsilon_{t}),C_{1,2}(\epsilon_{t}),c_{5}(\epsilon_{t})=o(t^{\beta}) for any β>0\beta>0 as tt\to\infty, we deduce that, for constants c6>0,c7c_{6}>0,c_{7},

(2.13) tnt,1\displaystyle\dfrac{\partial}{\partial t}n_{t,1} c6ϵt5nt,1+c7(1+t)1+α/2.\displaystyle\leqslant-c_{6}\epsilon_{t}^{5}n_{t,1}+c_{7}(1+t)^{1+\alpha/2}.

As a result, for constant c8c_{8},

t(nt,1ec60tϵs5𝑑s)\displaystyle\dfrac{\partial}{\partial t}\left(n_{t,1}e^{c_{6}\int_{0}^{t}\epsilon_{s}^{5}\,ds}\right) c7(1+t)1+α/2ec60tϵs5𝑑s\displaystyle\leqslant c_{7}(1+t)^{1+\alpha/2}e^{c_{6}\int_{0}^{t}\epsilon_{s}^{5}\,ds}
nt,1\displaystyle n_{t,1} n0,1+c7(1+t)1+α/20tec6stϵu5𝑑u𝑑s\displaystyle\leqslant n_{0,1}+c_{7}(1+t)^{1+\alpha/2}\int_{0}^{t}e^{-c_{6}\int_{s}^{t}\epsilon_{u}^{5}\,du}\,ds
n0,1+c7(1+t)1+α/20tec6ϵt5(ts)𝑑s\displaystyle\leqslant n_{0,1}+c_{7}(1+t)^{1+\alpha/2}\int_{0}^{t}e^{-c_{6}\epsilon_{t}^{5}(t-s)}\,ds
n0,1+c7(1+t)1+α/2c6ϵt5c8(1+t)1+α.\displaystyle\leqslant n_{0,1}+\dfrac{c_{7}(1+t)^{1+\alpha/2}}{c_{6}\epsilon_{t}^{5}}\leqslant c_{8}(1+t)^{1+\alpha}.

This proves the result when p=1p=1. Assume that the result holds for all q<pq<p, where p2p\geqslant 2. First, using Proposition 2.5 and equation (2.4.2) we compute

Lϵt(Rϵtp)\displaystyle L_{\epsilon_{t}}(R_{\epsilon_{t}}^{p}) =pRϵtp1Lϵt(Rϵt)+p(p1)Rϵtp2ΓϵtRϵt\displaystyle=pR_{\epsilon_{t}}^{p-1}L_{\epsilon_{t}}(R_{\epsilon_{t}})+p(p-1)R^{p-2}_{\epsilon_{t}}\Gamma_{\epsilon_{t}}R_{\epsilon_{t}}
(2.14) pRϵtp1(c4ϵt5Rϵt+c5(ϵt)+c1)+p(p1)C3(ϵt)Rϵtp1+p(p1)C4(ϵt)Rϵtp2.\displaystyle\leqslant pR_{\epsilon_{t}}^{p-1}\left(-c_{4}\epsilon_{t}^{5}R_{\epsilon_{t}}+c_{5}(\epsilon_{t})+c_{1}\right)+p(p-1)C_{3}(\epsilon_{t})R^{p-1}_{\epsilon_{t}}+p(p-1)C_{4}(\epsilon_{t})R^{p-2}_{\epsilon_{t}}.

Differentiating with respect to tt, followed by using Proposition 2.6 and equation (2.4.2) give

tnt,p\displaystyle\dfrac{\partial}{\partial t}n_{t,p} =(tϵt)ϵtnt,p+s𝔼(Rϵtp(Xt+s,Yt+s))|s=0\displaystyle=\left(\dfrac{\partial}{\partial t}\epsilon_{t}\right)\dfrac{\partial}{\partial\epsilon_{t}}n_{t,p}+\dfrac{\partial}{\partial s}\mathbb{E}\left(R_{\epsilon_{t}}^{p}(X_{t+s},Y_{t+s})\right)\bigg{|}_{s=0}
|tϵt|(Cp,1(ϵt)nt,p+Cp,2(ϵt)nt,p1)+𝔼(Lϵt(Rϵtp)(Xt,Yt))\displaystyle\leqslant\bigg{|}\dfrac{\partial}{\partial t}\epsilon_{t}\bigg{|}\left(C_{p,1}(\epsilon_{t})n_{t,p}+C_{p,2}(\epsilon_{t})n_{t,p-1}\right)+\mathbb{E}\left(L_{\epsilon_{t}}(R_{\epsilon_{t}}^{p})(X_{t},Y_{t})\right)
c9ϵt5nt,p+c10(p,α)(1+t)1+α/2,\displaystyle\leqslant-c_{9}\epsilon_{t}^{5}n_{t,p}+c_{10}(p,\alpha)(1+t)^{1+\alpha/2},

where we use the same asymptotic estimates that lead us to (2.13) and the induction assumption on nt,p1n_{t,p-1} and nt,p2n_{t,p-2}.

Now, we wrap up the proof of Proposition 2.4.

Using the lower bound of RϵR_{\epsilon} in Lemma 2.2, we note that, for some constants k1,k2,k3,k4>0k_{1},k_{2},k_{3},k_{4}>0 and arbitrary r>0r>0,

𝔼(Xt2+Yt2)p\displaystyle\mathbb{E}\left(\left\lVert X_{t}\right\rVert^{2}+\left\lVert Y_{t}\right\rVert^{2}\right)^{p} k1𝔼(1ϵtp(Rϵt(Xt,Yt)+ϵtM+UminM4+ϵt+ϵtln1ϵt)p)\displaystyle\leqslant k_{1}\mathbb{E}\left(\dfrac{1}{\epsilon_{t}^{p}}\left(R_{\epsilon_{t}}(X_{t},Y_{t})+\epsilon_{t}\dfrac{M+U_{min}}{M_{4}+\epsilon_{t}}+\epsilon_{t}\ln\dfrac{1}{\epsilon_{t}}\right)^{p}\right)
=k1𝔼(1ϵtp(j=0p(pj)Rϵt(Xt,Yt)j(ϵtM+UminM4+ϵt+ϵtln1ϵt)pj)p)\displaystyle=k_{1}\mathbb{E}\left(\dfrac{1}{\epsilon_{t}^{p}}\left(\sum_{j=0}^{p}{p\choose j}R_{\epsilon_{t}}(X_{t},Y_{t})^{j}\left(\epsilon_{t}\dfrac{M+U_{min}}{M_{4}+\epsilon_{t}}+\epsilon_{t}\ln\dfrac{1}{\epsilon_{t}}\right)^{p-j}\right)^{p}\right)
k21ϵtp(𝔼Rϵtjr(Xt,Yt))1/rk31ϵtp(1+t)2/rk4(1+t)3/r,\displaystyle\leqslant k_{2}\dfrac{1}{\epsilon_{t}^{p}}\left(\mathbb{E}R_{\epsilon_{t}}^{jr}(X_{t},Y_{t})\right)^{1/r}\leqslant k_{3}\dfrac{1}{\epsilon_{t}^{p}}(1+t)^{2/r}\leqslant k_{4}(1+t)^{3/r},

where the equality follows from the binomial theorem, the second inequality follows from the Jensen’s inequality, the third inequality comes from Proposition 2.7, and we use the estimates ϵt=Ω(1ln(1+t))\epsilon_{t}=\Omega(\frac{1}{\ln(1+t)}), ln(1+t)p=𝒪((1+t)1/r)\ln(1+t)^{p}=\mathcal{O}((1+t)^{1/r}) for large enough tt in the last inequality.

2.5. Gamma calculus

For smooth enough ϕ\phi and gg, as LϵtL_{\epsilon_{t}} is a diffusion, recall that we have

Lϵt(ϕ(g))=ϕ(g)Lϵt(g)+ϕ′′(g)Γϵt(g),L_{\epsilon_{t}}(\phi(g))=\phi^{\prime}(g)L_{\epsilon_{t}}(g)+\phi^{\prime\prime}(g)\Gamma_{\epsilon_{t}}(g),

where Γϵt\Gamma_{\epsilon_{t}} is the carré du champ operator with Γϵtg=12Lϵtg2gLϵtg=yg2.\Gamma_{\epsilon_{t}}g=\dfrac{1}{2}L_{\epsilon_{t}}g^{2}-gL_{\epsilon_{t}}g=\left\lVert\nabla_{y}g\right\rVert^{2}. Let LϵtL_{\epsilon_{t}}^{*} be the L2(πϵt)L^{2}(\pi_{\epsilon_{t}}) adjoint of LϵtL_{\epsilon_{t}}, that is,

Lϵt=yx(yϵtϵtxHϵt)y+Δy.L_{\epsilon_{t}}^{*}=-y\cdot\nabla_{x}-\left(\dfrac{y}{\epsilon_{t}}-\epsilon_{t}\nabla_{x}H_{\epsilon_{t}}\right)\cdot\nabla_{y}+\Delta_{y}.

For hh a non-negative function from d\mathbb{R}^{d} to \mathbb{R} which belongs to some functional space 𝒟\mathcal{D}, and a differentiable Φ:𝒟\Phi:\mathcal{D}\to\mathbb{R} with differential operator DhΦD_{h}\Phi, the directional derivative in hh, we will be interested in quantities of the form

(2.15) ΓLϵt,Φ(h):=12(LϵtΦ(h)DhΦ(h)(Lϵth)).\displaystyle\Gamma_{L_{\epsilon_{t}}^{*},\Phi}(h):=\dfrac{1}{2}\left(L_{\epsilon_{t}}^{*}\Phi(h)-D_{h}\Phi(h)\cdot(L_{\epsilon_{t}}^{*}h)\right).

We now present three auxiliary lemmas, which are essential in proving the dissipation of the distorted entropy later in Section 2.7. The proofs are essentially the same as corresponding proofs in Monmarché (2018), by replacing the target function UU therein with ϵtHϵt\epsilon_{t}H_{\epsilon_{t}}.

Lemma 2.3.

For non-negative hh and Φ(h)=hlnh\Phi(h)=h\ln h, then

ΓLϵt,Φ(h)=Γϵt(h)2h.\Gamma_{L_{\epsilon_{t}}^{*},\Phi}(h)=\dfrac{\Gamma_{\epsilon_{t}}(h)}{2h}.

[Proof. ]The proof is the same as (Monmarché, 2018, Lemma 1111), by replacing the target function UU therein with ϵtHϵt\epsilon_{t}H_{\epsilon_{t}}.

Lemma 2.4.

For non-negative hh and Φ(h)=Ah2\Phi(h)=\left\lVert Ah\right\rVert^{2}, where A=(A1,A2,,Ak)A=(A_{1},A_{2},\ldots,A_{k}) is a linear operator from 𝒟\mathcal{D} to 𝒟k\mathcal{D}^{k}, then

ΓLϵt,Φ(h)=Γϵt(Ah)+(Ah)[Lϵt,A]h(Ah)[Lϵt,A]h,\Gamma_{L_{\epsilon_{t}}^{*},\Phi}(h)=\Gamma_{\epsilon_{t}}(Ah)+(Ah)[L_{\epsilon_{t}}^{*},A]h\geqslant(Ah)[L_{\epsilon_{t}}^{*},A]h,

where for two operators C,DC,D, the commutator bracket is [C,D]=CDDC[C,D]=CD-DC, Γϵt(Ah)=i=1kΓϵt(Aih)\Gamma_{\epsilon_{t}}(Ah)=\sum_{i=1}^{k}\Gamma_{\epsilon_{t}}(A_{i}h) and [Lϵt,A]=([Lϵt,A1],[Lϵt,A2],,[Lϵt,Ak])[L_{\epsilon_{t}}^{*},A]=([L_{\epsilon_{t}}^{*},A_{1}],[L_{\epsilon_{t}}^{*},A_{2}],\ldots,[L_{\epsilon_{t}}^{*},A_{k}]).

[Proof. ]The proof is the same as (Monmarché, 2018, Lemma 1010), by replacing the target function UU therein with ϵtHϵt\epsilon_{t}H_{\epsilon_{t}}.

Lemma 2.5.

Let

γ(ϵt)\displaystyle\gamma(\epsilon_{t}) =12+2(ϵtxHϵt+1+1ϵt)2,\displaystyle=\dfrac{1}{2}+2\left(\epsilon_{t}\left\lVert\nabla_{x}H_{\epsilon_{t}}\right\rVert_{\infty}+1+\dfrac{1}{\epsilon_{t}}\right)^{2},
Ψϵt(h)\displaystyle\Psi_{\epsilon_{t}}(h) =(x+y)h2h+γ(ϵt)hlnh.\displaystyle=\dfrac{\left\lVert(\nabla_{x}+\nabla_{y})h\right\rVert^{2}}{h}+\gamma(\epsilon_{t})h\ln h.

We have

ΓLϵt,Ψϵt(h)12h2h.\Gamma_{L_{\epsilon_{t}}^{*},\Psi_{\epsilon_{t}}}(h)\geqslant\dfrac{1}{2}\dfrac{\left\lVert\nabla h\right\rVert^{2}}{h}.

[Proof. ]The proof is the same as (Monmarché, 2018, Corollary 1313), by replacing the target function UU therein with ϵtHϵt\epsilon_{t}H_{\epsilon_{t}}.

2.6. Truncated differentiation

In this subsection, we present auxiliary results related to mollifier and truncated differentiation. These results are needed when we discuss the dissipation of distorted entropy in Section 2.7 below.

Let φ:\varphi:\mathbb{R}\to\mathbb{R} be a mollifier defined to be

φ(x):={e1x21(11e1y21𝑑y)1,if 1<x<1,0,otherwise.\varphi(x):=\begin{cases}e^{\frac{1}{x^{2}-1}}\left(\int_{-1}^{1}e^{\frac{1}{y^{2}-1}}\,dy\right)^{-1},&\mbox{if }-1<x<1,\\ 0,&\mbox{otherwise.}\end{cases}

For mm\in\mathbb{N}, we define

(2.16) φm(x)\displaystyle\varphi_{m}(x) :=1mφ(xm),\displaystyle:=\dfrac{1}{m}\varphi\left(\dfrac{x}{m}\right),
(2.17) vm\displaystyle v_{m} :=φm𝟏(,m2]1,\displaystyle:=\varphi_{m}\star\mathbf{1}_{(-\infty,m^{2}]}\leqslant 1,

where fgf\star g is the convolution of two functions f,gf,g and 𝟏A\mathbf{1}_{A} is the indicator function of the set AA. Recall that in Lemma 2.2 the lower bound of RϵR_{\epsilon} is given by

Rϵ(x,y)(ϵa1M4+ϵϵ32)x2+(1ϵ32)y2(ϵM+UminM4+ϵ+ϵln(1ϵ)).R_{\epsilon}(x,y)\geqslant\left(\dfrac{\epsilon a_{1}}{M_{4}+\epsilon}-\dfrac{\epsilon^{3}}{2}\right)\left\lVert x\right\rVert^{2}+\left(\dfrac{1-\epsilon^{3}}{2}\right)\left\lVert y\right\rVert^{2}-\left(\epsilon\dfrac{M+U_{min}}{M_{4}+\epsilon}+\epsilon\ln\left(\dfrac{1}{\epsilon}\right)\right).

Denoting the third term on the right hand side by

d2,ϵ:=ϵM+UminM4+ϵln(1ϵ),d_{2,\epsilon}:=\epsilon\dfrac{M+U_{min}}{M_{4}}+\epsilon\ln\left(\dfrac{1}{\epsilon}\right),

we define, for any δ>max{ϵ0UminM4,0}\delta>\max\bigg{\{}-\epsilon_{0}\dfrac{U_{min}}{M_{4}},0\bigg{\}} and δ2>0\delta_{2}>0,

(2.18) ηm,ϵ\displaystyle\eta_{m,\epsilon} :=vm(ln(Rϵ+2d2,ϵ+δ+δ2)).\displaystyle:=v_{m}\left(\ln\left(R_{\epsilon}+2d_{2,\epsilon}+\delta+\delta_{2}\right)\right).

Note that for large enough tt (which depends on a1,M4a_{1},M_{4} and the cooling schedule (ϵt)t0(\epsilon_{t})_{t\geqslant 0}), Rϵt+d2,ϵt0R_{\epsilon_{t}}+d_{2,\epsilon_{t}}\geqslant 0 for all x,ydx,y\in\mathbb{R}^{d}. Also, note that d2,ϵd_{2,\epsilon} can be negative as M+UminM+U_{min} may possibly be negative. With our choice of δ\delta it ensures that d2,ϵ+δ0d_{2,\epsilon}+\delta\geqslant 0.

Now, we present results that summarize some basic properties of ηm,ϵ\eta_{m,\epsilon}.

Lemma 2.6.

For mm\in\mathbb{N}, ηm,ϵt\eta_{m,\epsilon_{t}}

  1. (1)

    is compactly supported;

  2. (2)

    converges to 11 pointwise;

  3. (3)

    For some constant C>0C>0, independent of mm and time tt, we have, for large enough tt,

    Lϵtηm,ϵtCm.L_{\epsilon_{t}}\eta_{m,\epsilon_{t}}\leqslant\dfrac{C}{m}.

[Proof. ]Similar to Chak et al. (2020), we have the following estimates:

vm(x)\displaystyle v_{m}(x) =m2φm(xy)𝑑y=x+m2φm(z)𝑑z,\displaystyle=\int_{-\infty}^{m^{2}}\varphi_{m}(x-y)\,dy=\int_{-\infty}^{x+m^{2}}\varphi_{m}(z)\,dz,
vm(x)\displaystyle v_{m}^{\prime}(x) =φm(x+m2)1m(maxφ),\displaystyle=\varphi_{m}(x+m^{2})\leqslant\dfrac{1}{m}\left(\max\varphi\right),
vm′′(x)\displaystyle v_{m}^{\prime\prime}(x) =φm(x+m2)1m2(maxφ).\displaystyle=\varphi_{m}^{\prime}(x+m^{2})\leqslant\dfrac{1}{m^{2}}\left(\max\varphi\right).

From these it is straightforward to show item (1) and (2). We proceed to prove item (3). First, using Gamma calculus and the above upper bounds on vm,vm′′v_{m}^{\prime},v_{m}^{\prime\prime}, we have, for some constant C¯>0\overline{C}>0,

Lϵtηm,ϵt\displaystyle L_{\epsilon_{t}}\eta_{m,\epsilon_{t}} =vm(ln(Rϵt+2d2,ϵt+δ+δ2))Lϵt(ln(Rϵt+2d2,ϵt+δ+δ2))\displaystyle=v_{m}^{\prime}\left(\ln\left(R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}\right)\right)L_{\epsilon_{t}}\left(\ln\left(R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}\right)\right)
+vm′′(ln(Rϵt+2d2,ϵt+δ+δ2))y(ln(Rϵt+2d2,ϵt+δ+δ2))2\displaystyle\quad+v_{m}^{\prime\prime}\left(\ln\left(R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}\right)\right)\left\lVert\nabla_{y}\left(\ln\left(R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}\right)\right)\right\rVert^{2}
C¯(1m(LϵtRϵtRϵt+2d2,ϵt+δ+δ2yRϵt2(Rϵt+2d2,ϵt+δ+δ2)2)+1m2yRϵt2(Rϵt+2d2,ϵt+δ+δ2)2)\displaystyle\leqslant\overline{C}\left(\dfrac{1}{m}\left(\dfrac{L_{\epsilon_{t}}R_{\epsilon_{t}}}{R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}}-\dfrac{\left\lVert\nabla_{y}R_{\epsilon_{t}}\right\rVert^{2}}{\left(R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}\right)^{2}}\right)+\dfrac{1}{m^{2}}\dfrac{\left\lVert\nabla_{y}R_{\epsilon_{t}}\right\rVert^{2}}{\left(R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}\right)^{2}}\right)
C¯(1m(LϵtRϵtRϵt+2d2,ϵt+δ+δ2)+1m2yRϵt2(Rϵt+2d2,ϵt+δ+δ2)2)\displaystyle\leqslant\overline{C}\left(\dfrac{1}{m}\left(\dfrac{L_{\epsilon_{t}}R_{\epsilon_{t}}}{R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}}\right)+\dfrac{1}{m^{2}}\dfrac{\left\lVert\nabla_{y}R_{\epsilon_{t}}\right\rVert^{2}}{\left(R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}\right)^{2}}\right)
C¯(1m(c4ϵt5Rϵt+c5(ϵt)+c1Rϵt+2d2,ϵt+δ+δ2)+1m2yRϵt2(Rϵt+2d2,ϵt+δ+δ2)2)\displaystyle\leqslant\overline{C}\left(\dfrac{1}{m}\left(\dfrac{-c_{4}\epsilon_{t}^{5}R_{\epsilon_{t}}+c_{5}(\epsilon_{t})+c_{1}}{R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}}\right)+\dfrac{1}{m^{2}}\dfrac{\left\lVert\nabla_{y}R_{\epsilon_{t}}\right\rVert^{2}}{\left(R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}\right)^{2}}\right)
=C¯(1m(c4ϵt5(Rϵt+d2,ϵt+δ)+c4ϵt5(d2,ϵt+δ)+c5(ϵt)+c1Rϵt+2d2,ϵt+δ+δ2)+1m2yRϵt2(Rϵt+2d2,ϵt+δ+δ2)2),\displaystyle=\overline{C}\left(\dfrac{1}{m}\left(\dfrac{-c_{4}\epsilon_{t}^{5}\left(R_{\epsilon_{t}}+d_{2,\epsilon_{t}}+\delta\right)+c_{4}\epsilon_{t}^{5}(d_{2,\epsilon_{t}}+\delta)+c_{5}(\epsilon_{t})+c_{1}}{R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}}\right)+\dfrac{1}{m^{2}}\dfrac{\left\lVert\nabla_{y}R_{\epsilon_{t}}\right\rVert^{2}}{\left(R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}\right)^{2}}\right),

where the last inequality follows from the Lyapunov property of RϵtR_{\epsilon_{t}} in Proposition 2.5. Now, we bound each of the two terms on the right hand side above. Using Rϵt+d2,ϵt0R_{\epsilon_{t}}+d_{2,\epsilon_{t}}\geqslant 0 and d2,ϵt+δ0d_{2,\epsilon_{t}}+\delta\geqslant 0, we observe that

c4ϵt5(Rϵt+d2,ϵt+δ)+c4ϵt5(d2,ϵt+δ)+c5(ϵt)+c1Rϵt+2d2,ϵt+δ+δ2\displaystyle\dfrac{-c_{4}\epsilon_{t}^{5}\left(R_{\epsilon_{t}}+d_{2,\epsilon_{t}}+\delta\right)+c_{4}\epsilon_{t}^{5}(d_{2,\epsilon_{t}}+\delta)+c_{5}(\epsilon_{t})+c_{1}}{R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}} c4ϵt5+c5(ϵt)d2,ϵt+δ+c1δ2,\displaystyle\leqslant c_{4}\epsilon_{t}^{5}+\dfrac{c_{5}(\epsilon_{t})}{d_{2,\epsilon_{t}}+\delta}+\dfrac{c_{1}}{\delta_{2}},
c4ϵ02+𝒪(ϵ02)+c1δ2,\displaystyle\leqslant c_{4}\epsilon_{0}^{2}+\mathcal{O}(\epsilon_{0}^{2})+\dfrac{c_{1}}{\delta_{2}},

where we use c5(ϵt)=𝒪(ϵt3ln(1/ϵt))c_{5}(\epsilon_{t})=\mathcal{O}(\epsilon_{t}^{3}\ln(1/\epsilon_{t})) as in Proposition 2.5 and d2,ϵt+δ=Ω(ϵtln(1/ϵt))d_{2,\epsilon_{t}}+\delta=\Omega(\epsilon_{t}\ln(1/\epsilon_{t})).

For the second term, using

yRϵt2=y+ϵt3x2=𝒪(ϵt3x2+y2),\left\lVert\nabla_{y}R_{\epsilon_{t}}\right\rVert^{2}=\left\lVert y+\epsilon_{t}^{3}x\right\rVert^{2}=\mathcal{O}(\epsilon_{t}^{3}\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2}),

and the lower bound of RϵtR_{\epsilon_{t}} in Lemma 2.2 again,

Rϵt+d2,ϵt=Ω(ϵtx2+y2),R_{\epsilon_{t}}+d_{2,\epsilon_{t}}=\Omega(\epsilon_{t}\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2}),

we see that, for some constant C¯2>0\overline{C}_{2}>0,

yRϵt2(Rϵt+2d2,ϵt+δ+δ2)2C¯2ϵt3x2+y2(ϵtx2+y2)2+δ22,\dfrac{\left\lVert\nabla_{y}R_{\epsilon_{t}}\right\rVert^{2}}{\left(R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}\right)^{2}}\leqslant\overline{C}_{2}\dfrac{\epsilon_{t}^{3}\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2}}{\left(\epsilon_{t}\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2}\right)^{2}+\delta_{2}^{2}},

which can clearly be bounded above independent of tt.

2.7. Dissipation of the distorted entropy

Recall the notation in Section 2.2 that ht=dmtdπϵth_{t}=\frac{dm_{t}}{d\pi_{\epsilon_{t}}} is the density of the process at time tt with respect to the distribution πϵt\pi_{\epsilon_{t}}. For non-negative h:2d+h:\mathbb{R}^{2d}\to\mathbb{R}^{+}, we define the functionals

Φ0(h)\displaystyle\Phi_{0}(h) :=hlnh,\displaystyle:=h\ln h,
Φ1(h)\displaystyle\Phi_{1}(h) :=xh+yh2h,\displaystyle:=\dfrac{\left\lVert\nabla_{x}h+\nabla_{y}h\right\rVert^{2}}{h},

and we recall that in Lemma 2.5 we introduce

γ(ϵt)\displaystyle\gamma(\epsilon_{t}) =12+2(ϵtxHϵt+1+1ϵt)2,\displaystyle=\dfrac{1}{2}+2\left(\epsilon_{t}\left\lVert\nabla_{x}H_{\epsilon_{t}}\right\rVert_{\infty}+1+\dfrac{1}{\epsilon_{t}}\right)^{2},
Ψϵt(h)\displaystyle\Psi_{\epsilon_{t}}(h) =Φ1(h)+γ(ϵt)Φ0(h).\displaystyle=\Phi_{1}(h)+\gamma(\epsilon_{t})\Phi_{0}(h).

With these notations in mind we define the distorted entropy as, for t0t\geqslant 0,

(2.19) 𝐇(t):=Ψϵt(ht)𝑑πϵt=Φ1(ht)+γ(ϵt)Φ0(ht)dπϵt=xht+yht2ht𝑑πϵt+γ(ϵt)Entπϵt(ht).\displaystyle\mathbf{H}(t):=\int\Psi_{\epsilon_{t}}(h_{t})\,d\pi_{\epsilon_{t}}=\int\Phi_{1}(h_{t})+\gamma(\epsilon_{t})\Phi_{0}(h_{t})\,d\pi_{\epsilon_{t}}=\int\dfrac{\left\lVert\nabla_{x}h_{t}+\nabla_{y}h_{t}\right\rVert^{2}}{h_{t}}\,d\pi_{\epsilon_{t}}+\gamma(\epsilon_{t})\mathrm{Ent}_{\pi_{\epsilon_{t}}}(h_{t}).

The distorted entropy 𝐇\mathbf{H} is not to be confused with HϵH_{\epsilon} introduced in (1.7). In order to compute the time derivative of the distorted entropy 𝐇\mathbf{H} and pass the derivative into the integral, we will first be working with its truncated version, which is defined to be, for mm\in\mathbb{N},

(2.20) 𝐇ηm,ϵt(t):=ηm,ϵt(Φ1(ht)+γ(ϵt)Φ0(ht))𝑑πϵt,\displaystyle\mathbf{H}_{\eta_{m},\epsilon_{t}}(t):=\int\eta_{m,\epsilon_{t}}\left(\Phi_{1}(h_{t})+\gamma(\epsilon_{t})\Phi_{0}(h_{t})\right)\,d\pi_{\epsilon_{t}},

where we recall ηm,ϵt\eta_{m,\epsilon_{t}} is the mollifier introduced in (2.18). Taking the time derivative of (2.20) and thanks to Lemma 2.6, we have

(2.21) ddt𝐇ηm,ϵt(t)=ηm,ϵtddt(Ψϵt(ht))𝑑πϵt+ϵtddϵtηm,ϵtΨϵt(ht)πϵt𝑑x𝑑y.\displaystyle\dfrac{d}{dt}\mathbf{H}_{\eta_{m},\epsilon_{t}}(t)=\int\eta_{m,\epsilon_{t}}\dfrac{d}{dt}\left(\Psi_{\epsilon_{t}}(h_{t})\right)\,d\pi_{\epsilon_{t}}+\epsilon^{\prime}_{t}\int\dfrac{d}{d\epsilon_{t}}\eta_{m,\epsilon_{t}}\Psi_{\epsilon_{t}}(h_{t})\pi_{\epsilon_{t}}\,dxdy.

We will handle and give upper bound on each of the two terms on the right hand side above separately.

For the first term on the right hand side of (2.21), we have the following upper bound:

Lemma 2.7.

For the same constant C>0C>0 that appears in Lemma 2.6,

ηm,ϵtddt(Ψϵt(ht))𝑑πϵtηm,ϵtht2ht𝑑πϵt+CmΨϵt(ht)+γ(ϵt)e1dπϵt.\int\eta_{m,\epsilon_{t}}\dfrac{d}{dt}\left(\Psi_{\epsilon_{t}}(h_{t})\right)\,d\pi_{\epsilon_{t}}\leqslant-\int\eta_{m,\epsilon_{t}}\dfrac{\left\lVert\nabla h_{t}\right\rVert^{2}}{h_{t}}\,d\pi_{\epsilon_{t}}+\dfrac{C}{m}\int\Psi_{\epsilon_{t}}(h_{t})+\gamma(\epsilon_{t})e^{-1}\,d\pi_{\epsilon_{t}}.

[Proof. ]

ηm,ϵtddt(Ψϵt(ht))𝑑πϵt\displaystyle\int\eta_{m,\epsilon_{t}}\dfrac{d}{dt}\left(\Psi_{\epsilon_{t}}(h_{t})\right)\,d\pi_{\epsilon_{t}} =ηm,ϵtDhtΨϵt(ht)thtdπϵt\displaystyle=\int\eta_{m,\epsilon_{t}}D_{h_{t}}\Psi_{\epsilon_{t}}(h_{t})\cdot\partial_{t}h_{t}\,d\pi_{\epsilon_{t}}
=ηm,ϵtDhtΨϵt(ht)tmtdxdy\displaystyle=\int\eta_{m,\epsilon_{t}}D_{h_{t}}\Psi_{\epsilon_{t}}(h_{t})\cdot\partial_{t}m_{t}\,dxdy
=ηm,ϵtDhtΨϵt(ht)Lϵtht𝑑πϵt\displaystyle=\int\eta_{m,\epsilon_{t}}D_{h_{t}}\Psi_{\epsilon_{t}}(h_{t})\cdot L_{\epsilon_{t}}^{*}h_{t}\,d\pi_{\epsilon_{t}}
=ηm,ϵt2ΓLϵt,Ψϵt(ht)𝑑πϵt+ηm,ϵtLϵt(Ψϵt(ht))𝑑πϵt\displaystyle=-\int\eta_{m,\epsilon_{t}}2\Gamma_{L_{\epsilon_{t}}^{*},\Psi_{\epsilon_{t}}}(h_{t})\,d\pi_{\epsilon_{t}}+\int\eta_{m,\epsilon_{t}}L_{\epsilon_{t}}^{*}\left(\Psi_{\epsilon_{t}}(h_{t})\right)\,d\pi_{\epsilon_{t}}
=ηm,ϵt2ΓLϵt,Ψϵt(ht)𝑑πϵt+Lϵtηm,ϵt(Ψϵt(ht)+γ(ϵt)e1)𝑑πϵt\displaystyle=-\int\eta_{m,\epsilon_{t}}2\Gamma_{L_{\epsilon_{t}}^{*},\Psi_{\epsilon_{t}}}(h_{t})\,d\pi_{\epsilon_{t}}+\int L_{\epsilon_{t}}\eta_{m,\epsilon_{t}}\left(\Psi_{\epsilon_{t}}(h_{t})+\gamma(\epsilon_{t})e^{-1}\right)\,d\pi_{\epsilon_{t}}
ηm,ϵtht2ht𝑑πϵt+CmΨϵt(ht)+γ(ϵt)e1dπϵt,\displaystyle\leqslant-\int\eta_{m,\epsilon_{t}}\dfrac{\left\lVert\nabla h_{t}\right\rVert^{2}}{h_{t}}\,d\pi_{\epsilon_{t}}+\dfrac{C}{m}\int\Psi_{\epsilon_{t}}(h_{t})+\gamma(\epsilon_{t})e^{-1}\,d\pi_{\epsilon_{t}},

where the third equality follows from (Chak et al., 2020, discussion below (C.20)(C.20)) and the fourth equality comes from classical Gamma calculus as in (2.15). For the fifth equality, we add γ(ϵt)e1\gamma(\epsilon_{t})e^{-1} to ensure that Ψϵt(ht)+γ(ϵt)e10\Psi_{\epsilon_{t}}(h_{t})+\gamma(\epsilon_{t})e^{-1}\geqslant 0. For the last inequality, we make use of Lemma 2.5 for the first term and Lemma 2.6 for the second term.

We proceed to consider the second term on the right hand side of (2.21). Using the chain rule, we have

ϵtddϵt(ηm,ϵtΨϵt(ht)πϵt)𝑑x𝑑y\displaystyle\epsilon^{\prime}_{t}\int\dfrac{d}{d\epsilon_{t}}\left(\eta_{m,\epsilon_{t}}\Psi_{\epsilon_{t}}(h_{t})\pi_{\epsilon_{t}}\right)\,dxdy =ϵt(ddϵtηm,ϵt)Ψϵt(ht)πϵt𝑑x𝑑y+ϵtηm,ϵt(ddϵtΨϵt(ht)πϵt)𝑑x𝑑y\displaystyle=\epsilon^{\prime}_{t}\int\left(\dfrac{d}{d\epsilon_{t}}\eta_{m,\epsilon_{t}}\right)\Psi_{\epsilon_{t}}(h_{t})\pi_{\epsilon_{t}}\,dxdy+\epsilon^{\prime}_{t}\int\eta_{m,\epsilon_{t}}\left(\dfrac{d}{d\epsilon_{t}}\Psi_{\epsilon_{t}}(h_{t})\pi_{\epsilon_{t}}\right)\,dxdy
(2.22) =𝒪(1m)+ϵtηm,ϵt(ddϵtΨϵt(ht)πϵt)𝑑x𝑑y,\displaystyle=\mathcal{O}\left(\dfrac{1}{m}\right)+\epsilon^{\prime}_{t}\int\eta_{m,\epsilon_{t}}\left(\dfrac{d}{d\epsilon_{t}}\Psi_{\epsilon_{t}}(h_{t})\pi_{\epsilon_{t}}\right)\,dxdy,

where the second equality follows from the proof in Lemma 2.6. Now, as in the proof of (Monmarché, 2018, Lemma 1515), we compute that

ϵtlnπϵt(x,y)=ϵtHϵt+1ϵt2y2(ϵtHϵt+y2ϵt2dπϵt).\dfrac{\partial}{\partial\epsilon_{t}}\ln\pi_{\epsilon_{t}}(x,y)=-\dfrac{\partial}{\partial\epsilon_{t}}H_{\epsilon_{t}}+\dfrac{1}{\epsilon_{t}^{2}}\left\lVert y\right\rVert^{2}-\int\left(-\dfrac{\partial}{\partial_{\epsilon_{t}}}H_{\epsilon_{t}}+\dfrac{\left\lVert y\right\rVert^{2}}{\epsilon_{t}^{2}}\,d\pi_{\epsilon_{t}}\right).

Using the quadratic growth assumption on UU in Assumption 1.1, there exist a subexponential function ξ1\xi_{1} such that

(2.23) |ϵtlnπϵt(x,y)|+ϵtlnπϵt(x,y)2ξ1(ϵt)(1+x2+y2).\displaystyle|\partial_{\epsilon_{t}}\ln\pi_{\epsilon_{t}}(x,y)|+\left\lVert\nabla\partial_{\epsilon_{t}}\ln\pi_{\epsilon_{t}}(x,y)\right\rVert^{2}\leqslant\xi_{1}(\epsilon_{t})(1+\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2}).
Lemma 2.8.

There exist subexponential function ξ(ϵt)\xi(\epsilon_{t}) such that

ϵtηm,ϵt(ddϵtΨϵt(ht)πϵt)𝑑x𝑑y|ϵt|ξ(ϵt)(ηm,ϵtΨϵt(ht)𝑑πϵt+(1+x2+y2)mt(x,y)𝑑x𝑑y).\epsilon^{\prime}_{t}\int\eta_{m,\epsilon_{t}}\left(\dfrac{d}{d\epsilon_{t}}\Psi_{\epsilon_{t}}(h_{t})\pi_{\epsilon_{t}}\right)\,dxdy\leqslant|\epsilon^{\prime}_{t}|\xi(\epsilon_{t})\left(\int\eta_{m,\epsilon_{t}}\Psi_{\epsilon_{t}}(h_{t})\,d\pi_{\epsilon_{t}}+\int(1+\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2})m_{t}(x,y)\,dxdy\right).

[Proof. ]

ηm,ϵtϵt(γ(ϵt)Φ0(ht)πϵt)dxdy\displaystyle\int\eta_{m,\epsilon_{t}}\partial_{\epsilon_{t}}\left(\gamma(\epsilon_{t})\Phi_{0}(h_{t})\pi_{\epsilon_{t}}\right)\,dxdy =ηm,ϵtγ(ϵt)(ϵtlnπϵt)mt(x,y)𝑑x𝑑y+ηm,ϵtγ(ϵt)Φ0(ht)𝑑πϵt\displaystyle=\int\eta_{m,\epsilon_{t}}\gamma(\epsilon_{t})\left(-\partial_{\epsilon_{t}}\ln\pi_{\epsilon_{t}}\right)m_{t}(x,y)\,dxdy+\int\eta_{m,\epsilon_{t}}\gamma^{\prime}(\epsilon_{t})\Phi_{0}(h_{t})d\pi_{\epsilon_{t}}
ηm,ϵtγ(ϵt)ξ1(ϵt)(1+x2+y2)mt(x,y)𝑑x𝑑y\displaystyle\leqslant\int\eta_{m,\epsilon_{t}}\gamma(\epsilon_{t})\xi_{1}(\epsilon_{t})(1+\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2})m_{t}(x,y)\,dxdy
+|γ(ϵt)|(ηm,ϵtΦ0(ht)𝑑πϵt+2e),\displaystyle\quad+|\gamma^{\prime}(\epsilon_{t})|\left(\int\eta_{m,\epsilon_{t}}\Phi_{0}(h_{t})d\pi_{\epsilon_{t}}+\dfrac{2}{e}\right),

where we use Φ0(ht)πϵt=ln(mtπϵt)mt\Phi_{0}(h_{t})\pi_{\epsilon_{t}}=\ln\left(\frac{m_{t}}{\pi_{\epsilon_{t}}}\right)m_{t} in the equality, and in the inequality we utilize |Φ0(h)|Φ0(h)+2e|\Phi_{0}(h)|\leqslant\Phi_{0}(h)+\frac{2}{e} as well as (2.23). Next, for matrix M1M_{1} since Φ1(ht)πϵt=M1lnmtπϵt2mt\Phi_{1}(h_{t})\pi_{\epsilon_{t}}=\left\lVert M_{1}\nabla\ln\frac{m_{t}}{\pi_{\epsilon_{t}}}\right\rVert^{2}m_{t}, we consider

ηm,ϵtϵt(Φ1(ht)πϵt)dxdy\displaystyle\int\eta_{m,\epsilon_{t}}\partial_{\epsilon_{t}}\left(\Phi_{1}(h_{t})\pi_{\epsilon_{t}}\right)\,dxdy =2ηm,ϵtM1ln(mtπϵt)M1ϵtlnπϵtmtdxdy\displaystyle=-2\int\eta_{m,\epsilon_{t}}M_{1}\nabla\ln\left(\dfrac{m_{t}}{\pi_{\epsilon_{t}}}\right)\cdot M_{1}\nabla\partial_{\epsilon_{t}}\ln\pi_{\epsilon_{t}}m_{t}\,dxdy
ηm,ϵtΦ1(ht)𝑑πϵt+2ξ1(ϵt)(1+x2+y2)mt(x,y)𝑑x𝑑y,\displaystyle\leqslant\int\eta_{m,\epsilon_{t}}\Phi_{1}(h_{t})\,d\pi_{\epsilon_{t}}+2\xi_{1}(\epsilon_{t})\int(1+\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2})m_{t}(x,y)\,dxdy,

where the inequality follows again from (2.23). The desired result follows from taking

ξ(ϵt)=1+2γ(ϵt)ξ1(ϵt)+|γ(ϵt)|γ(ϵt).\xi(\epsilon_{t})=1+2\gamma(\epsilon_{t})\xi_{1}(\epsilon_{t})+\dfrac{|\gamma^{\prime}(\epsilon_{t})|}{\gamma(\epsilon_{t})}.

We write the Fisher information at time tt to be

I(t)=ht2ht𝑑πϵt.I(t)=\int\dfrac{\left\lVert\nabla h_{t}\right\rVert^{2}}{h_{t}}\,d\pi_{\epsilon_{t}}.

Collecting the results of Lemma 2.7, (2.7) and Lemma 2.8 and substitute these back into (2.21), we obtain the following:

Proposition 2.8.
𝐇(t)I(t)+|ϵt|ξ(ϵt)(𝐇(t)+𝔼(1+Xt2+Yt2)).\mathbf{H}^{\prime}(t)\leqslant-I(t)+|\epsilon_{t}^{\prime}|\xi(\epsilon_{t})(\mathbf{H}(t)+\mathbb{E}\left(1+\left\lVert X_{t}\right\rVert^{2}+\left\lVert Y_{t}\right\rVert^{2}\right)).

[Proof. ]First, according to the results of Lemma 2.7, (2.7) and Lemma 2.8, we put these back into (2.21) to obtain

ddt𝐇ηm,ϵt(t)\displaystyle\dfrac{d}{dt}\mathbf{H}_{\eta_{m},\epsilon_{t}}(t) ηm,ϵtht2ht𝑑πϵt+𝒪(1m)\displaystyle\leqslant-\int\eta_{m,\epsilon_{t}}\dfrac{\left\lVert\nabla h_{t}\right\rVert^{2}}{h_{t}}\,d\pi_{\epsilon_{t}}+\mathcal{O}\left(\dfrac{1}{m}\right)
+|ϵt|ξ(ϵt)(ηm,ϵtΨϵt(ht)𝑑πϵt+(1+x2+y2)mt(x,y)𝑑x𝑑y).\displaystyle\quad+|\epsilon^{\prime}_{t}|\xi(\epsilon_{t})\left(\int\eta_{m,\epsilon_{t}}\Psi_{\epsilon_{t}}(h_{t})\,d\pi_{\epsilon_{t}}+\int(1+\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2})m_{t}(x,y)\,dxdy\right).

Integrating from ss to tt leads to

𝐇ηm,ϵt(t)𝐇ηm,ϵs(s)\displaystyle\mathbf{H}_{\eta_{m},\epsilon_{t}}(t)-\mathbf{H}_{\eta_{m},\epsilon_{s}}(s) stηm,ϵuhu2hu𝑑πϵu𝑑u+st𝒪(1m)𝑑u\displaystyle\leqslant-\int_{s}^{t}\int\eta_{m,\epsilon_{u}}\dfrac{\left\lVert\nabla h_{u}\right\rVert^{2}}{h_{u}}\,d\pi_{\epsilon_{u}}du+\int_{s}^{t}\mathcal{O}\left(\dfrac{1}{m}\right)\,du
+st|ϵu|ξ(ϵu)(ηm,ϵuΨϵu(hu)𝑑πϵu+(1+x2+y2)mu(x,y)𝑑x𝑑y)𝑑u.\displaystyle\quad+\int_{s}^{t}|\epsilon^{\prime}_{u}|\xi(\epsilon_{u})\left(\int\eta_{m,\epsilon_{u}}\Psi_{\epsilon_{u}}(h_{u})\,d\pi_{\epsilon_{u}}+\int(1+\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2})m_{u}(x,y)\,dxdy\right)\,du.

The desired result follows by taking mm\to\infty, Fatou’s lemma and Lemma 2.6.

Finally, the objective of this section is to prove the following bound on the distorted entropy:

Proposition 2.9.

Under Assumption 1.1, we have, for any α>0\alpha>0, there exists constant B>0B>0 such that for large enough tt,

𝐇(t)B(1t)1cEα.\mathbf{H}(t)\leqslant B\left(\dfrac{1}{t}\right)^{1-\frac{c_{*}}{E}-\alpha}.

[Proof. ]The proof mimics that of (Monmarché, 2018, Lemma 1919), except that we have an improved estimate of the log-Sobolev constant in our setting. More precisely, using the log-Sobolev inequality Proposition 2.3 and Proposition 2.8 we have

𝐇(t)\displaystyle\mathbf{H}^{\prime}(t) 1pδ1(1/ϵt)ecϵt𝐇(t)+|ϵt|ξ(ϵt)(𝐇(t)+𝔼(1+Xt2+Yt2))\displaystyle\leqslant-\dfrac{1}{p_{\delta_{1}}(1/\epsilon_{t})}e^{-\frac{c_{*}}{\epsilon_{t}}}\mathbf{H}(t)+|\epsilon_{t}^{\prime}|\xi(\epsilon_{t})(\mathbf{H}(t)+\mathbb{E}\left(1+\left\lVert X_{t}\right\rVert^{2}+\left\lVert Y_{t}\right\rVert^{2}\right))
(2.24) =(1pδ1(1/ϵt)ecϵt+|ϵt|ξ(ϵt))𝐇(t)+|ϵt|ξ(ϵt)𝔼(1+Xt2+Yt2)\displaystyle=\left(-\dfrac{1}{p_{\delta_{1}}(1/\epsilon_{t})}e^{-\frac{c_{*}}{\epsilon_{t}}}+|\epsilon_{t}^{\prime}|\xi(\epsilon_{t})\right)\mathbf{H}(t)+|\epsilon_{t}^{\prime}|\xi(\epsilon_{t})\mathbb{E}\left(1+\left\lVert X_{t}\right\rVert^{2}+\left\lVert Y_{t}\right\rVert^{2}\right)

where we recall ξ(ϵt)\xi(\epsilon_{t}) is a subexponential function, and pδ1(1/ϵt)p_{\delta_{1}}(1/\epsilon_{t}) is first introduced in Proposition 2.3.

First, we handle the second term in (2.7). Using |ϵt|=𝒪(1/t)|\epsilon_{t}^{\prime}|=\mathcal{O}(1/t), Proposition 2.4 and (lnt)p=𝒪(tα)(\ln t)^{p}=\mathcal{O}(t^{\alpha}) for large enough tt and any α,p>0\alpha,p>0 leads to

(2.25) |ϵt|ξ(ϵt)𝔼(1+Xt2+Yt2)=𝒪(1t1α).\displaystyle|\epsilon_{t}^{\prime}|\xi(\epsilon_{t})\mathbb{E}\left(1+\left\lVert X_{t}\right\rVert^{2}+\left\lVert Y_{t}\right\rVert^{2}\right)=\mathcal{O}\left(\dfrac{1}{t^{1-\alpha}}\right).

Next, we consider the first term in (2.7). Using the fact that pδ1(1/ϵt)p_{\delta_{1}}(1/\epsilon_{t}) is polynomial in 1/ϵt1/\epsilon_{t} leads to

(2.26) |ϵt|ξ(ϵt)\displaystyle|\epsilon_{t}^{\prime}|\xi(\epsilon_{t}) =𝒪(1t1α),\displaystyle=\mathcal{O}\left(\dfrac{1}{t^{1-\alpha}}\right),
(2.27) 1pδ1(1/ϵt)ecϵt\displaystyle\dfrac{1}{p_{\delta_{1}}(1/\epsilon_{t})}e^{-\frac{c_{*}}{\epsilon_{t}}} =Ω(1t)cE+α.\displaystyle=\Omega\left(\dfrac{1}{t}\right)^{\frac{c_{*}}{E}+\alpha}.

Collecting the results of (2.25), (2.27) and (2.26) and put these back into (2.7), if we choose α\alpha small enough, there exists constants c1,c2>0c_{1},c_{2}>0 such that

𝐇(t)c1(1t)cE+α𝐇(t)+c2(1t)1α.\mathbf{H}^{\prime}(t)\leqslant-c_{1}\left(\dfrac{1}{t}\right)^{\frac{c_{*}}{E}+\alpha}\mathbf{H}(t)+c_{2}\left(\dfrac{1}{t}\right)^{1-\alpha}.

The rest of the proof follows exactly that of (Monmarché, 2018, Lemma 1919) (with EE_{*} therein replaced by our cc_{*}), which further relies on the estimate obtained in (Miclo, 1992, Lemma 66).

2.8. Wrapping up the proof

In this subsection, we finish off the proof of Theorem 1.1 by using the auxiliary results obtained in previous sections.

[Proof of Theorem 1.1. ] We follow the same proof as Monmarché (2018). Let (Xtπ,Ytπ)(X_{t}^{\pi},Y_{t}^{\pi}) with law πϵt\pi_{\epsilon_{t}}. For any δ>0\delta>0, we have

(U(Xt)>Umin+δ)(U(Xtπ)>Umin+δ)+ht1L1(πϵt).\mathbb{P}\left(U\left(X_{t}\right)>U_{min}+\delta\right)\leqslant\mathbb{P}\left(U\left(X_{t}^{\pi}\right)>U_{min}+\delta\right)+\left\lVert h_{t}-1\right\rVert_{L^{1}(\pi_{\epsilon_{t}})}.

Using the Pinsker’s inequality,

ht1L1(πϵt)2Entπϵt(ht)2𝐇(t).\left\|h_{t}-1\right\|_{L^{1}(\pi_{\epsilon_{t}})}\leqslant\sqrt{2\mathrm{Ent}_{\pi_{\epsilon_{t}}}\left(h_{t}\right)}\leqslant\sqrt{2\mathbf{H}(t)}.

Together with Proposition 2.9 and Proposition 2.1 yields, for constants A2,D2>0A_{2},D_{2}>0,

(U(Xt)>Umin+δ)D2eδ2ϵt+A2(1t)1cEα2A(1t)min{1cEα2,δ2E}.\mathbb{P}\left(U\left(X_{t}\right)>U_{min}+\delta\right)\leqslant D_{2}e^{-\frac{\delta}{2\epsilon_{t}}}+A_{2}\left(\dfrac{1}{t}\right)^{\frac{1-\frac{c_{*}}{E}-\alpha}{2}}\leqslant A\left(\dfrac{1}{t}\right)^{\min\bigg{\{}\frac{1-\frac{c_{*}}{E}-\alpha}{2},\frac{\delta}{2E}\bigg{\}}}.
Remark 2.2.

In the adaptive setting as described in Section 3.1 and Section 3, thanks to Proposition 3.1 we have exactly the same result with the same proof (with EE replaced by EtE_{t} and c=c(U,c,δ1)c_{*}=c_{*}(U,c,\delta_{1}) here replaced by c(U,ct,δ1)c_{*}(U,c_{t},\delta_{1})).

3. An adaptive algorithm (IAKSA) and its convergence

Throughout this section, we assume that U(x)0=UminU(x)\geqslant 0=U_{min} for the ease of calculation.

Recall that in IKSA, introduced in Theorem 1.1, the performance of the diffusion depends on the parameter cc. In this section, we propose an adaptive method to tune both the parameter c=(ct)t0c=(c_{t})_{t\geqslant 0} and the energy level E=(Et)t0E=(E_{t})_{t\geqslant 0}. First, we rewrite the dynamics of (Xt,Yt)t0(X_{t},Y_{t})_{t\geqslant 0} to emphasize the dependence on cc and EE:

dXt\displaystyle dX_{t} =Ytdt,\displaystyle=Y_{t}\,dt,
dYt\displaystyle dY_{t} =1ϵtYtdtϵtxHϵt,ct(Xt)dt+2dBt,\displaystyle=-\dfrac{1}{\epsilon_{t}}Y_{t}\,dt-\epsilon_{t}\nabla_{x}H_{\epsilon_{t},c_{t}}(X_{t})\,dt+\sqrt{2}\,dB_{t},

where Hϵt,ctH_{\epsilon_{t},c_{t}} is introduced in (1.7). The parameter cc and the cooling schedule are tuned adaptively using the running minimum (t:=minvtU(Xv))t0(\mathcal{M}_{t}:=\min_{v\leqslant t}U(X_{v}))_{t\geqslant 0}. Recall that for m,nm,n\in\mathbb{N}, φm\varphi_{m} is a mollifier introduced in Section 2.6. In the following we shall use the mollifier φ1n\varphi_{\frac{1}{n}} with support on (1n,1n)(-\frac{1}{n},\frac{1}{n}). We tune the parameter (ct)t0(c_{t})_{t\geqslant 0} adaptively by setting

(3.1) ct:=φ1n(1n)+(t),\displaystyle c_{t}:=\varphi_{\frac{1}{n}}\star\mathcal{M}_{\left(\cdot-\frac{1}{n}\right)_{+}}(t),

and hence, according to the definition of cc_{*} in (2.3), for arbitrary δ1>0\delta_{1}>0 an upper bound of cc_{*} is given by

c=c(U,ct,δ1)(t2n)+Umin+δ1=(t2n)++δ1.c_{*}=c_{*}(U,c_{t},\delta_{1})\leqslant\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}-U_{min}+\delta_{1}=\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}+\delta_{1}.

Let δ2>δ1\delta_{2}>\delta_{1}. Now, the energy level (Et)t0(E_{t})_{t\geqslant 0} that appears on the numerator of the cooling schedule is also tuned adaptively:

(3.2) Et:=φ1n(3n)+(t)Umin+δ2(t2n)++δ2>(t2n)++δ1.\displaystyle E_{t}:=\varphi_{\frac{1}{n}}\star\mathcal{M}_{\left(\cdot-\frac{3}{n}\right)_{+}}(t)-U_{min}+\delta_{2}\geqslant\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}+\delta_{2}>\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}+\delta_{1}.

As we shall see in the proof, the primary reason of mollifying t\mathcal{M}_{t} allows us to consider the derivative of ctc_{t} or EtE_{t} with respect to time tt, and the choice of using φ1n\varphi_{\frac{1}{n}} is arbitrary. This is essential in analyzing the dissipation of the distorted entropy. For instance, we need to consider the time derivative of the cooling schedule, and as such we have to ensure that EtE_{t} is differentiable.

For the convenience of readers, we restate Theorem 1.2 below:

Theorem 3.1.

Under Assumption 1.1, consider the kinetic dynamics (Xt,Yt)t0(X_{t},Y_{t})_{t\geqslant 0} described by

dXt\displaystyle dX_{t} =Ytdt,\displaystyle=Y_{t}\,dt,
dYt\displaystyle dY_{t} =1ϵtYtdtϵtxHϵt,ct(Xt)dt+2dBt,\displaystyle=-\dfrac{1}{\epsilon_{t}}Y_{t}\,dt-\epsilon_{t}\nabla_{x}H_{\epsilon_{t},c_{t}}(X_{t})\,dt+\sqrt{2}\,dB_{t},

where Hϵt,ctH_{\epsilon_{t},c_{t}} is introduced in (1.7), ctc_{t} is tuned adaptively according to (3.1) and the cooling schedule is

ϵt=Etlnt\epsilon_{t}=\dfrac{E_{t}}{\ln t}

with EtE_{t} satisfying (3.2). Given δ>0\delta>0, for large enough tt and a constant A>0A>0, we consider sufficiently small α\alpha such that α(0,δ2δ1U(X0)Umin+δ2)\alpha\in(0,\frac{\delta_{2}-\delta_{1}}{U(X_{0})-U_{min}+\delta_{2}}), and select δ1,δ2>0\delta_{1},\delta_{2}>0 such that 0<δ2δ1<δ0<\delta_{2}-\delta_{1}<\delta, to yield

(U(Xt)>Umin+δ)A(1t)a,\displaystyle\mathbb{P}\left(U(X_{t})>U_{min}+\delta\right)\leqslant A\left(\dfrac{1}{t}\right)^{a},

where

a:=min{δ2δ1δ+δ2α2,δ2δ1U(X0)Umin+δ2α2}.a:=\min\bigg{\{}\frac{\frac{\delta_{2}-\delta_{1}}{\delta+\delta_{2}}-\alpha}{2},\frac{\frac{\delta_{2}-\delta_{1}}{U(X_{0})-U_{min}+\delta_{2}}-\alpha}{2}\bigg{\}}.

The rest of this section is devoted to the proof of Theorem 1.2.

3.1. Dissipation of the distorted entropy in the adaptive setting

In this subsection, we present some auxiliary results which will be used in Section 3.2, the proof of Theorem 1.2. In a nutshell, we show that under appropriate assumptions on ctc_{t}^{\prime} and EtE_{t}^{\prime}, key results such as Proposition 2.9 also hold in this adaptive setting:

Proposition 3.1.

Under Assumption 1.1, suppose further that the parameter cc is tuned adaptively by (ct)t0(c_{t})_{t\geqslant 0} and the energy level is (Et)t0(E_{t})_{t\geqslant 0}, which are non-increasing with respect to time. We write c=c(U,ct,δ1)c_{*}=c_{*}(U,c_{t},\delta_{1}) as in (2.3), and we assume that

|ct|=𝒪(1t),|Et|=𝒪(1t),Et=Ω(1).\displaystyle|c_{t}^{\prime}|=\mathcal{O}\left(\dfrac{1}{t}\right),\quad|E_{t}^{\prime}|=\mathcal{O}\left(\dfrac{1}{t}\right),\quad E_{t}=\Omega(1).

For sufficiently small α\alpha with α(0,1cEt)\alpha\in(0,1-\frac{c_{*}}{E_{t}}) for all t0t\geqslant 0, there exists constant B>0B>0 such that for large enough tt,

𝐇(t)B(1t)1cEtα.\mathbf{H}(t)\leqslant B\left(\dfrac{1}{t}\right)^{1-\frac{c_{*}}{E_{t}}-\alpha}.

Note that both cc_{*} and EtE_{t} are possibly time-dependent.

Remark 3.1.

In Section 3.2, we show that such a choice of α(0,1cEt)\alpha\in(0,1-\frac{c_{*}}{E_{t}}) is possible.

The rest of this subsection is devoted to the proof of Proposition 3.1.

3.1.1. Auxiliary results

In this subsection, we present five auxiliary results which will be used in the proof of Proposition 3.1, namely Lemma 3.1, Proposition 3.2, Proposition 3.3 and Proposition 3.4.

First, to emphasize the dependence of ctc_{t} on various quantities, compared with (2.20) the truncated distorted entropy 𝐇\mathbf{H} is now written as

𝐇ηm,ϵt,ct(t):=ηm,ϵt,ct(Φ1(ht)+γ(ϵt)Φ0(ht))𝑑πϵt,ct,\mathbf{H}_{\eta_{m},\epsilon_{t},c_{t}}(t):=\int\eta_{m,\epsilon_{t},c_{t}}\left(\Phi_{1}(h_{t})+\gamma(\epsilon_{t})\Phi_{0}(h_{t})\right)\,d\pi_{\epsilon_{t},c_{t}},

where we note that the stationary distribution πϵt,ct\pi_{\epsilon_{t},c_{t}} depends on ctc_{t} and ηm,ϵt,ct\eta_{m,\epsilon_{t},c_{t}} depends on ctc_{t} through Rϵt,ctR_{\epsilon_{t},c_{t}}. Taking the derivative with respect to tt gives

(3.3) ddt𝐇ηm,ϵt,ct(t)\displaystyle\dfrac{d}{dt}\mathbf{H}_{\eta_{m},\epsilon_{t},c_{t}}(t) =ηm,ϵt,ctddt(Ψϵt(ht))𝑑πϵt,ct+ϵtddϵtηm,ϵt,ctΨϵt(ht)πϵt,ct𝑑x𝑑y\displaystyle=\int\eta_{m,\epsilon_{t},c_{t}}\dfrac{d}{dt}\left(\Psi_{\epsilon_{t}}(h_{t})\right)\,d\pi_{\epsilon_{t},c_{t}}+\epsilon^{\prime}_{t}\int\dfrac{d}{d\epsilon_{t}}\eta_{m,\epsilon_{t},c_{t}}\Psi_{\epsilon_{t}}(h_{t})\pi_{\epsilon_{t},c_{t}}\,dxdy
+ctddctηm,ϵt,ctΨϵt(ht)πϵt,ct𝑑x𝑑y.\displaystyle\quad+c^{\prime}_{t}\int\dfrac{d}{dc_{t}}\eta_{m,\epsilon_{t},c_{t}}\Psi_{\epsilon_{t}}(h_{t})\pi_{\epsilon_{t},c_{t}}\,dxdy.

Comparing with (2.21), the extra term in (3.3) vanishes when ct=cc_{t}=c for all tt. The first two terms in (3.3) can be handled in exactly the same way as in Lemma 2.7, equation (2.7) and Lemma 2.8. We proceed to simplify the third term in (3.3). We note that

ctRϵt,ct=ϵtctHϵt,ct(x)\displaystyle\partial_{c_{t}}R_{\epsilon_{t},c_{t}}=\epsilon_{t}\partial_{c_{t}}H_{\epsilon_{t},c_{t}}(x) =ϵt(UminU(x)f(uct)+(f(uct)++ϵt)2𝑑uf(U(x)ct)+f(U(x)ct)++ϵt)\displaystyle=\epsilon_{t}\left(\int_{U_{min}}^{U(x)}\dfrac{f^{\prime}(u-c_{t})_{+}}{\left(f(u-c_{t})_{+}+\epsilon_{t}\right)^{2}}\,du-\dfrac{f^{\prime}(U(x)-c_{t})_{+}}{f(U(x)-c_{t})_{+}+\epsilon_{t}}\right)
(3.4) M5ϵt(M3+ctUmin)=:ξ5(ϵt),\displaystyle\leqslant\dfrac{M_{5}}{\epsilon_{t}}\left(M_{3}+c_{t}-U_{min}\right)=:\xi_{5}(\epsilon_{t}),
ctηm,ϵt,ct\displaystyle\partial_{c_{t}}\eta_{m,\epsilon_{t},c_{t}} =vm(ln(Rϵt,ct+2d2,ϵt+δ+δ2))1Rϵt,ct+2d2,ϵt+δ+δ2ctRϵt,ct\displaystyle=v_{m}^{\prime}\left(\ln\left(R_{\epsilon_{t},c_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}\right)\right)\dfrac{1}{R_{\epsilon_{t},c_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}}\partial_{c_{t}}R_{\epsilon_{t},c_{t}}
=𝒪(1m),\displaystyle=\mathcal{O}\left(\dfrac{1}{m}\right),
|ctlnπϵt,ct(x,y)|\displaystyle|\partial_{c_{t}}\ln\pi_{\epsilon_{t},c_{t}}(x,y)| |ctHϵt,ct|+|ctHϵt,ct|𝑑πϵt,ct\displaystyle\leqslant|\partial_{c_{t}}H_{\epsilon_{t},c_{t}}|+\int|\partial_{c_{t}}H_{\epsilon_{t},c_{t}}|\,d\pi_{\epsilon_{t},c_{t}}
2M5ϵt2(M3+ctUmin)+2M5ϵt,\displaystyle\leqslant\dfrac{2M_{5}}{\epsilon_{t}^{2}}\left(M_{3}+c_{t}-U_{min}\right)+\dfrac{2M_{5}}{\epsilon_{t}},

where we use Assumption 1.1 in the first inequality. The above computation leads to

(3.5) |ctlnπϵt,ct(x,y)|+ctlnπϵt,ct(x,y)2\displaystyle|\partial_{c_{t}}\ln\pi_{\epsilon_{t},c_{t}}(x,y)|+\left\lVert\nabla\partial_{c_{t}}\ln\pi_{\epsilon_{t},c_{t}}(x,y)\right\rVert^{2} ξ2(ϵt)(1+x2),\displaystyle\leqslant\xi_{2}(\epsilon_{t})(1+\left\lVert x\right\rVert^{2}),

where ξ2(ϵt)\xi_{2}(\epsilon_{t}) is a subexponential function of ϵt\epsilon_{t}. Our first auxiliary result handles the third term in (3.3).

Lemma 3.1.

There exist subexponential function ξ3(ϵt)\xi_{3}(\epsilon_{t}) such that

ctηm,ϵt,ct(ddctΨϵt(ht)πϵt,ct)𝑑x𝑑y|ct|ξ3(ϵt)(ηm,ϵt,ctΨϵt(ht)𝑑πϵt,ct+(1+x2)mt(x,y)𝑑x𝑑y).c^{\prime}_{t}\int\eta_{m,\epsilon_{t},c_{t}}\left(\dfrac{d}{dc_{t}}\Psi_{\epsilon_{t}}(h_{t})\pi_{\epsilon_{t},c_{t}}\right)\,dxdy\leqslant|c^{\prime}_{t}|\xi_{3}(\epsilon_{t})\left(\int\eta_{m,\epsilon_{t},c_{t}}\Psi_{\epsilon_{t}}(h_{t})\,d\pi_{\epsilon_{t},c_{t}}+\int(1+\left\lVert x\right\rVert^{2})m_{t}(x,y)\,dxdy\right).

[Proof. ]

ηm,ϵt,ctct(γ(ϵt)Φ0(ht)πϵt,ct)dxdy\displaystyle\int\eta_{m,\epsilon_{t},c_{t}}\partial_{c_{t}}\left(\gamma(\epsilon_{t})\Phi_{0}(h_{t})\pi_{\epsilon_{t},c_{t}}\right)\,dxdy =ηm,ϵt,ctγ(ϵt)(ϵtlnπϵt,ct)mt(x,y)𝑑x𝑑y\displaystyle=\int\eta_{m,\epsilon_{t},c_{t}}\gamma(\epsilon_{t})\left(-\partial_{\epsilon_{t}}\ln\pi_{\epsilon_{t},c_{t}}\right)m_{t}(x,y)\,dxdy
ηm,ϵt,ctγ(ϵt)ξ2(ϵt)(1+x2)mt(x,y)𝑑x𝑑y,\displaystyle\leqslant\int\eta_{m,\epsilon_{t},c_{t}}\gamma(\epsilon_{t})\xi_{2}(\epsilon_{t})(1+\left\lVert x\right\rVert^{2})m_{t}(x,y)\,dxdy,
+(ηm,ϵt,ctΦ0(ht)𝑑πϵt,ct+2e),\displaystyle\quad+\left(\int\eta_{m,\epsilon_{t},c_{t}}\Phi_{0}(h_{t})d\pi_{\epsilon_{t},c_{t}}+\dfrac{2}{e}\right),

where we use Φ0(ht)πϵt,ct=ln(mtπϵt,ct)mt\Phi_{0}(h_{t})\pi_{\epsilon_{t},c_{t}}=\ln\left(\frac{m_{t}}{\pi_{\epsilon_{t},c_{t}}}\right)m_{t} in the equality, and in the inequality we utilize |Φ0(h)|Φ0(h)+2e|\Phi_{0}(h)|\leqslant\Phi_{0}(h)+\frac{2}{e} as well as (3.5). Next, for matrix M1M_{1} since Φ1(ht)πϵt,ct=M1lnmtπϵt,ct2mt\Phi_{1}(h_{t})\pi_{\epsilon_{t},c_{t}}=\left\lVert M_{1}\nabla\ln\frac{m_{t}}{\pi_{\epsilon_{t},c_{t}}}\right\rVert^{2}m_{t}, we consider

ηm,ϵt,ctct(Φ1(ht)πϵt,ct)dxdy\displaystyle\int\eta_{m,\epsilon_{t},c_{t}}\partial_{c_{t}}\left(\Phi_{1}(h_{t})\pi_{\epsilon_{t},c_{t}}\right)\,dxdy =2ηm,ϵt,ctM1ln(mtπϵt,ct)M1ctlnπϵt,ctmtdxdy\displaystyle=-2\int\eta_{m,\epsilon_{t},c_{t}}M_{1}\nabla\ln\left(\dfrac{m_{t}}{\pi_{\epsilon_{t},c_{t}}}\right)\cdot M_{1}\nabla\partial_{c_{t}}\ln\pi_{\epsilon_{t},c_{t}}m_{t}\,dxdy
ηm,ϵt,ctΦ1(ht)𝑑πϵt,ct+2ξ2(ϵt)(1+x2)mt(x,y)𝑑x𝑑y,\displaystyle\leqslant\int\eta_{m,\epsilon_{t},c_{t}}\Phi_{1}(h_{t})\,d\pi_{\epsilon_{t},c_{t}}+2\xi_{2}(\epsilon_{t})\int(1+\left\lVert x\right\rVert^{2})m_{t}(x,y)\,dxdy,

where the inequality follows again from (3.5).

Next, we consider the time derivative of the distorted entropy. The result is essentially the same as Proposition 2.8 for the case of fixed cc, except that in the adaptive setting we have to introduce ctc_{t} and its time derivative:

Proposition 3.2.

There exist subexponential function ξ4(ϵt)\xi_{4}(\epsilon_{t}) such that

𝐇(t)I(t)+(|ϵt|+|ct|)ξ4(ϵt)(𝐇(t)+𝔼(1+Xt2+Yt2)).\mathbf{H}^{\prime}(t)\leqslant-I(t)+\left(|\epsilon_{t}^{\prime}|+|c_{t}^{\prime}|\right)\xi_{4}(\epsilon_{t})(\mathbf{H}(t)+\mathbb{E}\left(1+\left\lVert X_{t}\right\rVert^{2}+\left\lVert Y_{t}\right\rVert^{2}\right)).

[Proof. ]First, according to the results of Lemma 2.7, (2.7), Lemma 2.8 and Lemma 3.1, we put these back into (3.3) to obtain

ddt𝐇ηm,ϵt,ct(t)\displaystyle\dfrac{d}{dt}\mathbf{H}_{\eta_{m},\epsilon_{t},c_{t}}(t) ηm,ϵt,ctht2ht𝑑πϵt,ct+𝒪(1m)\displaystyle\leqslant-\int\eta_{m,\epsilon_{t},c_{t}}\dfrac{\left\lVert\nabla h_{t}\right\rVert^{2}}{h_{t}}\,d\pi_{\epsilon_{t},c_{t}}+\mathcal{O}\left(\dfrac{1}{m}\right)
+(|ϵt|+|ct|)ξ4(ϵt)(ηm,ϵt,ctΨϵt(ht)𝑑πϵt,ct+(1+x2+y2)mt(x,y)𝑑x𝑑y).\displaystyle\quad+\left(|\epsilon^{\prime}_{t}|+|c^{\prime}_{t}|\right)\xi_{4}(\epsilon_{t})\left(\int\eta_{m,\epsilon_{t},c_{t}}\Psi_{\epsilon_{t}}(h_{t})\,d\pi_{\epsilon_{t},c_{t}}+\int(1+\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2})m_{t}(x,y)\,dxdy\right).

Integrating from ss to tt leads to

𝐇ηm,ϵt,ct(t)\displaystyle\mathbf{H}_{\eta_{m},\epsilon_{t},c_{t}}(t) 𝐇ηm,ϵs,cs(s)stηm,ϵu,cuhu2hu𝑑πϵu,cu𝑑u+st𝒪(1m)𝑑u\displaystyle-\mathbf{H}_{\eta_{m},\epsilon_{s},c_{s}}(s)\leqslant-\int_{s}^{t}\int\eta_{m,\epsilon_{u},c_{u}}\dfrac{\left\lVert\nabla h_{u}\right\rVert^{2}}{h_{u}}\,d\pi_{\epsilon_{u},c_{u}}du+\int_{s}^{t}\mathcal{O}\left(\dfrac{1}{m}\right)\,du
+st|ϵu|ξ4(ϵu)(ηm,ϵu,cuΨϵu(hu)𝑑πϵu,cu+(1+x2+y2)mu(x,y)𝑑x𝑑y)𝑑u.\displaystyle\quad+\int_{s}^{t}|\epsilon^{\prime}_{u}|\xi_{4}(\epsilon_{u})\left(\int\eta_{m,\epsilon_{u},c_{u}}\Psi_{\epsilon_{u}}(h_{u})\,d\pi_{\epsilon_{u},c_{u}}+\int(1+\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2})m_{u}(x,y)\,dxdy\right)\,du.

The desired result follows by taking mm\to\infty, Fatou’s lemma and Lemma 2.6.

Our next two results generalize Proposition 2.7 and Proposition 2.4 respectively to the adaptive setting.

Proposition 3.3.

Under the same assumptions as in Proposition 3.1, for any pp\in\mathbb{N}, α>0\alpha>0 and large enough tt, there exist a constant C~p,α\widetilde{C}_{p,\alpha} such that

𝔼(Rϵt,ctp(Xt,Yt))C~p,α(1+t)1+α.\mathbb{E}\left(R_{\epsilon_{t},c_{t}}^{p}(X_{t},Y_{t})\right)\leqslant\widetilde{C}_{p,\alpha}(1+t)^{1+\alpha}.

[Proof. ]We shall prove the result by induction on pp. We denote by nt,p:=𝔼(Rϵt,ctp(Xt,Yt))n_{t,p}:=\mathbb{E}\left(R_{\epsilon_{t},c_{t}}^{p}(X_{t},Y_{t})\right). When p=0p=0, the result clearly holds. When p=1p=1, using Proposition 2.5 and 2.6 we have

tnt,1\displaystyle\dfrac{\partial}{\partial t}n_{t,1} =(tct)ctnt,1+(tϵt)ϵtnt,1+s𝔼(Rϵt,ct(Xt+s,Yt+s))|s=0\displaystyle=\left(\partial_{t}c_{t}\right)\partial_{c_{t}}n_{t,1}+\left(\partial_{t}\epsilon_{t}\right)\partial_{\epsilon_{t}}n_{t,1}+\dfrac{\partial}{\partial s}\mathbb{E}\left(R_{\epsilon_{t},c_{t}}(X_{t+s},Y_{t+s})\right)\bigg{|}_{s=0}
|tct|ξ5(ϵt)+|tϵt|(C1,1(ϵt)nt,1+C1,2(ϵt))+𝔼(Lϵt(Rϵt)(Xt,Yt))\displaystyle\leqslant|\partial_{t}c_{t}|\xi_{5}(\epsilon_{t})+|\partial_{t}\epsilon_{t}|\left(C_{1,1}(\epsilon_{t})n_{t,1}+C_{1,2}(\epsilon_{t})\right)+\mathbb{E}\left(L_{\epsilon_{t}}(R_{\epsilon_{t}})(X_{t},Y_{t})\right)
|tct|ξ5(ϵt)+|tϵt|(C1,1(ϵt)nt,1+C1,2(ϵt))c4ϵt5nt,1+c5(ϵt)+c1,\displaystyle\leqslant|\partial_{t}c_{t}|\xi_{5}(\epsilon_{t})+|\partial_{t}\epsilon_{t}|\left(C_{1,1}(\epsilon_{t})n_{t,1}+C_{1,2}(\epsilon_{t})\right)-c_{4}\epsilon_{t}^{5}n_{t,1}+c_{5}(\epsilon_{t})+c_{1},

where we use (3.4) in the first inequality. As ϵt=Ω(1ln(1+t))\epsilon_{t}=\Omega(\frac{1}{\ln(1+t)}), |tϵt|=𝒪(1t)=|tct||\partial_{t}\epsilon_{t}|=\mathcal{O}(\frac{1}{t})=|\partial_{t}c_{t}| and

C1,1(ϵt),C1,2(ϵt),c5(ϵt),ξ5(ϵt)=o(tβ)C_{1,1}(\epsilon_{t}),C_{1,2}(\epsilon_{t}),c_{5}(\epsilon_{t}),\xi_{5}(\epsilon_{t})=o(t^{\beta})

for any β>0\beta>0 as tt\to\infty, we deduce that, for constants c6>0,c7c_{6}>0,c_{7},

(3.6) tnt,1\displaystyle\dfrac{\partial}{\partial t}n_{t,1} c6ϵt5nt,1+c7(1+t)1+α/2.\displaystyle\leqslant-c_{6}\epsilon_{t}^{5}n_{t,1}+c_{7}(1+t)^{1+\alpha/2}.

The rest of the argument is the same as Proposition 3.3. This proves the result when p=1p=1. Assume that the result holds for all q<pq<p, where p2p\geqslant 2. First, using Proposition 2.5 and equation (2.4.2) we compute

Lϵt(Rϵt,ctp)\displaystyle L_{\epsilon_{t}}(R_{\epsilon_{t},c_{t}}^{p}) =pRϵt,ctp1Lϵt(Rϵt,ct)+p(p1)Rϵt,ctp2ΓϵtRϵt,ct\displaystyle=pR_{\epsilon_{t},c_{t}}^{p-1}L_{\epsilon_{t}}(R_{\epsilon_{t},c_{t}})+p(p-1)R^{p-2}_{\epsilon_{t},c_{t}}\Gamma_{\epsilon_{t}}R_{\epsilon_{t},c_{t}}
pRϵt,ctp1(c4ϵt5Rϵt,ct+c5(ϵt)+c1)+p(p1)C3(ϵt)Rϵt,ctp1+p(p1)C4(ϵt)Rϵt,ctp2.\displaystyle\leqslant pR_{\epsilon_{t},c_{t}}^{p-1}\left(-c_{4}\epsilon_{t}^{5}R_{\epsilon_{t},c_{t}}+c_{5}(\epsilon_{t})+c_{1}\right)+p(p-1)C_{3}(\epsilon_{t})R^{p-1}_{\epsilon_{t},c_{t}}+p(p-1)C_{4}(\epsilon_{t})R^{p-2}_{\epsilon_{t},c_{t}}.

Differentiating with respect to tt, followed by using Proposition 2.6 and equation (2.4.2) give

tnt,p\displaystyle\dfrac{\partial}{\partial t}n_{t,p} =(tct)ctnt,p+(tϵt)ϵtnt,p+s𝔼(Rϵtp(Xt+s,Yt+s))|s=0\displaystyle=\left(\partial_{t}c_{t}\right)\partial_{c_{t}}n_{t,p}+\left(\partial_{t}\epsilon_{t}\right)\partial_{\epsilon_{t}}n_{t,p}+\dfrac{\partial}{\partial s}\mathbb{E}\left(R_{\epsilon_{t}}^{p}(X_{t+s},Y_{t+s})\right)\bigg{|}_{s=0}
|tct|ξ5(ϵt)+|tϵt|(Cp,1(ϵt)nt,p+Cp,2(ϵ)nt,p1)+𝔼(Lϵt(Rϵtp)(Xt,Yt))\displaystyle\leqslant\left|\partial_{t}c_{t}\right|\xi_{5}(\epsilon_{t})+|\partial_{t}\epsilon_{t}|\left(C_{p,1}(\epsilon_{t})n_{t,p}+C_{p,2}(\epsilon)n_{t,p-1}\right)+\mathbb{E}\left(L_{\epsilon_{t}}(R_{\epsilon_{t}}^{p})(X_{t},Y_{t})\right)
c9ϵt5nt,p+c10(p,α)(1+t)1+α/2,\displaystyle\leqslant-c_{9}\epsilon_{t}^{5}n_{t,p}+c_{10}(p,\alpha)(1+t)^{1+\alpha/2},

where we use the same asymptotic estimates that lead us to (3.6) and the induction assumption on nt,p1n_{t,p-1} and nt,p2n_{t,p-2}.

Proposition 3.4.

Under the same assumptions as in Proposition 3.1, for any pp\in\mathbb{N}, α>0\alpha>0 and large enough tt (which depends on p,U,fp,U,f and the temperature schedule ϵt\epsilon_{t}), there exist a constant kk such that

𝔼(Xt2+Yt2)pk(1+t)α.\displaystyle\mathbb{E}\left(\left\lVert X_{t}\right\rVert^{2}+\left\lVert Y_{t}\right\rVert^{2}\right)^{p}\leqslant k(1+t)^{\alpha}.

[Proof. ]The proof is exactly the same as Proposition 2.4, except that we apply Proposition 3.3.

3.1.2. Proof of Proposition 3.1

Using the log-Sobolev inequality Proposition 2.3 and Proposition 3.2 we have

𝐇(t)\displaystyle\mathbf{H}^{\prime}(t) 1pδ1(1/ϵt)ecϵt𝐇(t)+(|ϵt|+|ct|)ξ4(ϵt)(𝐇(t)+𝔼(1+Xt2+Yt2))\displaystyle\leqslant-\dfrac{1}{p_{\delta_{1}}(1/\epsilon_{t})}e^{-\frac{c_{*}}{\epsilon_{t}}}\mathbf{H}(t)+\left(|\epsilon_{t}^{\prime}|+|c_{t}^{\prime}|\right)\xi_{4}(\epsilon_{t})(\mathbf{H}(t)+\mathbb{E}\left(1+\left\lVert X_{t}\right\rVert^{2}+\left\lVert Y_{t}\right\rVert^{2}\right))
(3.7) =(1pδ1(1/ϵt)ecϵt+(|ϵt|+|ct|)ξ4(ϵt))𝐇(t)+(|ϵt|+|ct|)ξ4(ϵt)𝔼(1+Xt2+Yt2)\displaystyle=\left(-\dfrac{1}{p_{\delta_{1}}(1/\epsilon_{t})}e^{-\frac{c_{*}}{\epsilon_{t}}}+\left(|\epsilon_{t}^{\prime}|+|c_{t}^{\prime}|\right)\xi_{4}(\epsilon_{t})\right)\mathbf{H}(t)+\left(|\epsilon_{t}^{\prime}|+|c_{t}^{\prime}|\right)\xi_{4}(\epsilon_{t})\mathbb{E}\left(1+\left\lVert X_{t}\right\rVert^{2}+\left\lVert Y_{t}\right\rVert^{2}\right)

where we recall ξ4(ϵt)\xi_{4}(\epsilon_{t}) is a subexponential function, and pδ1(1/ϵt)p_{\delta_{1}}(1/\epsilon_{t}) is first introduced in Proposition 2.3.

First, we handle the second term in (3.1.2). Note that

|ϵt|=𝒪(1t+|Et|)=𝒪(1t).\displaystyle|\epsilon_{t}^{\prime}|=\mathcal{O}\left(\dfrac{1}{t}+|E_{t}^{\prime}|\right)=\mathcal{O}\left(\dfrac{1}{t}\right).

Using |ct|=𝒪(1/t)|c_{t}^{\prime}|=\mathcal{O}(1/t), Proposition 2.4 and (lnt)p=𝒪(tα)(\ln t)^{p}=\mathcal{O}(t^{\alpha}) for large enough tt and any α,p>0\alpha,p>0 leads to

(3.8) (|ϵt|+|ct|)ξ4(ϵt)𝔼(1+Xt2+Yt2)=𝒪(1t1α).\displaystyle\left(|\epsilon_{t}^{\prime}|+|c_{t}^{\prime}|\right)\xi_{4}(\epsilon_{t})\mathbb{E}\left(1+\left\lVert X_{t}\right\rVert^{2}+\left\lVert Y_{t}\right\rVert^{2}\right)=\mathcal{O}\left(\dfrac{1}{t^{1-\alpha}}\right).

Next, we consider the first term in (3.1.2). Using the fact that pδ1(1/ϵt)p_{\delta_{1}}(1/\epsilon_{t}) is polynomial in 1/ϵt1/\epsilon_{t} and Et=Ω(1)E_{t}=\Omega(1) lead to

(3.9) (|ϵt|+|ct|)ξ4(ϵt)\displaystyle\left(|\epsilon_{t}^{\prime}|+|c_{t}^{\prime}|\right)\xi_{4}(\epsilon_{t}) =𝒪(1t1α),\displaystyle=\mathcal{O}\left(\dfrac{1}{t^{1-\alpha}}\right),
(3.10) 1pδ1(1/ϵt)ecϵt\displaystyle\dfrac{1}{p_{\delta_{1}}(1/\epsilon_{t})}e^{-\frac{c_{*}}{\epsilon_{t}}} =Ω(1t)cEt+α.\displaystyle=\Omega\left(\dfrac{1}{t}\right)^{\frac{c_{*}}{E_{t}}+\alpha}.

Collecting the results of (3.8), (3.10) and (3.9) and put these back into (3.1.2), if we choose α\alpha small enough, there exists constants c1,c2>0c_{1},c_{2}>0 such that

𝐇(t)c1(1t)cEt+α𝐇(t)+c2(1t)1α.\mathbf{H}^{\prime}(t)\leqslant-c_{1}\left(\dfrac{1}{t}\right)^{\frac{c_{*}}{E_{t}}+\alpha}\mathbf{H}(t)+c_{2}\left(\dfrac{1}{t}\right)^{1-\alpha}.

The rest of the proof follows exactly that of (Chak et al., 2020, Equation (C.32)(C.32) to (C.34)(C.34)), and we need to check that, as tt to \infty,

t(1s)cEs+α𝑑st(1s)𝑑s.\int_{\cdot}^{t}\left(\dfrac{1}{s}\right)^{\frac{c_{*}}{E_{s}}+\alpha}\,ds\geqslant\int_{\cdot}^{t}\left(\dfrac{1}{s}\right)\,ds\to\infty.

3.2. Proof of Theorem 1.2

First, we check that with our choice of (ct)t0(c_{t})_{t\geqslant 0} in (3.1) and (Et)t0(E_{t})_{t\geqslant 0} in (3.2), the assumptions in Proposition 3.1 are satisfied:

Lemma 3.2.

With our choice of (ct)t0(c_{t})_{t\geqslant 0} in (3.1) and (Et)t0(E_{t})_{t\geqslant 0} in (3.2), we have

|ct|=𝒪(1t),|Et|=𝒪(1t),Et=Ω(1).\displaystyle|c_{t}^{\prime}|=\mathcal{O}\left(\dfrac{1}{t}\right),\quad|E_{t}^{\prime}|=\mathcal{O}\left(\dfrac{1}{t}\right),\quad E_{t}=\Omega(1).

[Proof. ]Clearly, Etδ2E_{t}\geqslant\delta_{2} and so Et=Ω(1)E_{t}=\Omega(1). Next, we consider ctc_{t}:

ct\displaystyle c_{t} =(u1n)+φ1n(tu)𝑑u,\displaystyle=\int\mathcal{M}_{\left(u-\frac{1}{n}\right)_{+}}\varphi_{\frac{1}{n}}(t-u)\,du,
ct\displaystyle c_{t}^{\prime} =(u1n)+nφ((tu)n)n22(tu)(((tu)n)21)2𝑑u,\displaystyle=\int\mathcal{M}_{\left(u-\frac{1}{n}\right)_{+}}n\varphi((t-u)n)\dfrac{-n^{2}2(t-u)}{(((t-u)n)^{2}-1)^{2}}\,du,
|ct|\displaystyle|c_{t}^{\prime}| (u1n)+nφ((tu)n)n((tu)n1)2𝑑u.\displaystyle\leqslant\int\mathcal{M}_{\left(u-\frac{1}{n}\right)_{+}}n\varphi((t-u)n)\dfrac{n}{((t-u)n-1)^{2}}\,du.

Using the monotone convergence theorem, as φ(t)1(t1)2\varphi(t)\frac{1}{(t-1)^{2}} is non-increasing in tt, we conclude that t|ct|0t|c_{t}^{\prime}|\to 0 as tt\to\infty. The proof of |Et|=𝒪(1t)|E_{t}^{\prime}|=\mathcal{O}\left(\frac{1}{t}\right) is very similar and is omitted.

We write t\mathcal{F}_{t} to be the canonical filtration generated by t\mathcal{M}_{t} up to time tt. Thanks to Lemma 3.2, Proposition 3.1, Theorem 1.1 and Remark 2.2 we have the following estimate:

(U(Xt)>Umin+δ|(t1n)+)A(1t)min{1cEtα2,δ2Et}.\displaystyle\mathbb{P}\left(U(X_{t})>U_{min}+\delta|\mathcal{F}_{(t-\frac{1}{n})_{+}}\right)\leqslant A\left(\dfrac{1}{t}\right)^{\min\bigg{\{}\frac{1-\frac{c_{*}}{E_{t}}-\alpha}{2},\frac{\delta}{2E_{t}}\bigg{\}}}.

For the exponent of (1/t)(1/t), we select δ1\delta_{1} and δ2\delta_{2} such that 0<δ2δ1<δ0<\delta_{2}-\delta_{1}<\delta which gives

min{1cEtα2,δ2Et}min{1(t2n)++δ1Etα2,δ2Et}\displaystyle\min\bigg{\{}\frac{1-\frac{c_{*}}{E_{t}}-\alpha}{2},\frac{\delta}{2E_{t}}\bigg{\}}\geqslant\min\bigg{\{}\frac{1-\frac{\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}+\delta_{1}}{E_{t}}-\alpha}{2},\frac{\delta}{2E_{t}}\bigg{\}} =1(t2n)++δ1Etα2\displaystyle=\frac{1-\frac{\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}+\delta_{1}}{E_{t}}-\alpha}{2}
1(t2n)++δ1(t2n)++δ2α2.\displaystyle\geqslant\frac{1-\frac{\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}+\delta_{1}}{\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}+\delta_{2}}-\alpha}{2}.

Note that the choice of α\alpha is arbitrary, and we consider sufficiently small α\alpha such that α(0,δ2δ1U(X0)+δ2)\alpha\in(0,\frac{\delta_{2}-\delta_{1}}{U(X_{0})+\delta_{2}}) to ensure the exponent of (1/t)(1/t) is positive, i.e. for all t0t\geqslant 0

1(t2n)++δ1(t2n)++δ2α2>0.\frac{1-\frac{\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}+\delta_{1}}{\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}+\delta_{2}}-\alpha}{2}>0.

This choice of α\alpha also satisfies the requirement in Proposition 3.1. Using the law of iterated expectation yields

(U(Xt)>Umin+δ)\displaystyle\mathbb{P}\left(U(X_{t})>U_{min}+\delta\right) =𝔼((U(Xt)>Umin+δ|(t1n)+))\displaystyle=\mathbb{E}\left(\mathbb{P}\left(U(X_{t})>U_{min}+\delta|\mathcal{F}_{(t-\frac{1}{n})_{+}}\right)\right)
AUminU(X0)(1t)1y+δ1y+δ2α2𝑑((t2n)+y)\displaystyle\leqslant A\int_{U_{min}}^{U(X_{0})}\left(\dfrac{1}{t}\right)^{\frac{1-\frac{y+\delta_{1}}{y+\delta_{2}}-\alpha}{2}}\,d\,\mathbb{P}\left(\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}\leqslant y\right)
=AUminUmin+δ(1t)1y+δ1y+δ2α2𝑑((t2n)+y)\displaystyle=A\int_{U_{min}}^{U_{min}+\delta}\left(\dfrac{1}{t}\right)^{\frac{1-\frac{y+\delta_{1}}{y+\delta_{2}}-\alpha}{2}}\,d\,\mathbb{P}\left(\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}\leqslant y\right)
+AUmin+δU(X0)(1t)1y+δ1y+δ2α2𝑑((t2n)+y)\displaystyle\quad+A\int_{U_{min}+\delta}^{U(X_{0})}\left(\dfrac{1}{t}\right)^{\frac{1-\frac{y+\delta_{1}}{y+\delta_{2}}-\alpha}{2}}\,d\,\mathbb{P}\left(\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}\leqslant y\right)
A(1t)δ2δ1δ+δ2α2+A(1t)1U(X0)+δ1U(X0)+δ2α2\displaystyle\leqslant A\left(\dfrac{1}{t}\right)^{\frac{\frac{\delta_{2}-\delta_{1}}{\delta+\delta_{2}}-\alpha}{2}}+A\left(\dfrac{1}{t}\right)^{\frac{1-\frac{U(X_{0})+\delta_{1}}{U(X_{0})+\delta_{2}}-\alpha}{2}}
AUmin+δU(X0)((t2n)+y)d(1t)1y+δ1y+δ2α2\displaystyle\quad-A\int_{U_{min}+\delta}^{U(X_{0})}\mathbb{P}\left(\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}\leqslant y\right)\,d\left(\dfrac{1}{t}\right)^{\frac{1-\frac{y+\delta_{1}}{y+\delta_{2}}-\alpha}{2}}
2A(1t)δ2δ1δ+δ2α2+AUmin+δU(X0)((t2n)+>y)d(1t)1y+δ1y+δ2α2\displaystyle\leqslant 2A\left(\dfrac{1}{t}\right)^{\frac{\frac{\delta_{2}-\delta_{1}}{\delta+\delta_{2}}-\alpha}{2}}+A\int_{U_{min}+\delta}^{U(X_{0})}\mathbb{P}\left(\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}>y\right)\,d\left(\dfrac{1}{t}\right)^{\frac{1-\frac{y+\delta_{1}}{y+\delta_{2}}-\alpha}{2}}
A(1t)δ2δ1δ+δ2α2+A(1t)δ2δ1U(X0)+δ2α2,\displaystyle\leqslant A\left(\dfrac{1}{t}\right)^{\frac{\frac{\delta_{2}-\delta_{1}}{\delta+\delta_{2}}-\alpha}{2}}+A\left(\dfrac{1}{t}\right)^{\frac{\frac{\delta_{2}-\delta_{1}}{U(X_{0})+\delta_{2}}-\alpha}{2}},

where the second inequality follows from integration by part.

Appendix: setup of the numerical results

In this section, we describe the experimental setup for the numerical results presented in Section 1.3.

3.2.1. Description of the four annealing methods

We describe the four annealing methods that we test on:

  • IAKSA and KSA: KSA is a special case of IAKSA with f=0f=0. Instead of running (1.8), we consider

    dXt\displaystyle dX_{t} =Ytdt,\displaystyle=Y_{t}\,dt,
    dYt\displaystyle dY_{t} =YtdtϵtxHϵt,ct(Xt)dt+2ϵtdBt,\displaystyle=-Y_{t}\,dt-\epsilon_{t}\nabla_{x}H_{\epsilon_{t},c_{t}}(X_{t})\,dt+\sqrt{2\epsilon_{t}}\,dB_{t},

    and apply the Euler-Maruyama discretization with stepsize (η(k))k0(\eta(k))_{k\in\mathbb{N}_{0}}, cooling schedule (ϵ(k))k0(\epsilon(k))_{k\in\mathbb{N}_{0}} and adaptive (c(k))k0(c(k))_{k\in\mathbb{N}_{0}} to obtain (X(k),Y(k))k0(X(k),Y(k))_{k\in\mathbb{N}_{0}}:

    X(k+1)\displaystyle X(k+1) =X(k)+Y(k)η(k),\displaystyle=X(k)+Y(k)\eta(k),
    Y(k+1)\displaystyle Y(k+1) =Y(k)Y(k)η(k)ϵ(k)xHϵ(k),c(k)(X(k))η(k)+2ϵ(k)η(k)N(k),\displaystyle=Y(k)-Y(k)\eta(k)-\epsilon(k)\nabla_{x}H_{\epsilon(k),c(k)}(X(k))\eta(k)+\sqrt{2\epsilon(k)}\sqrt{\eta(k)}N(k),

    where (N(k))(N(k)) is a sequence of i.i.d. standard normal random variables.

  • IASA and SA: SA is a special case of IASA with f=0f=0. We simulate an Euler–Maruyama discretization of (1.5) with stepsize (η(k))k0(\eta(k))_{k\in\mathbb{N}_{0}}, cooling schedule (ϵ(k))k0(\epsilon(k))_{k\in\mathbb{N}_{0}} and adaptive (c(k))k0(c(k))_{k\in\mathbb{N}_{0}} to obtain (Z(k))k0(Z(k))_{k\in\mathbb{N}_{0}}:

    Z(k+1)=Z(k)U(Z(k))η(k)+2(f((U(Z(k))c(k))+)+ϵ(k))η(k)N(k).\displaystyle Z(k+1)=Z(k)-\nabla U(Z(k))\,\eta(k)+\sqrt{2\left(f((U(Z(k))-c(k))_{+})+\epsilon(k)\right)}\sqrt{\eta(k)}N(k).

3.2.2. Description of the test functions and the parameters

For both IAKSA and IASA, we use f(u)=0.5arctan(u)f(u)=0.5\arctan(u). Note that although this choice of ff does not satisfy Assumption 1.1, this is used in the numerical experiments in Fang et al. (1997). As for the benchmark functions, we use the following:

  • Ackley function U1U_{1}: We consider the 22-dimensional Ackley function

    U1(x1,x2)=20exp(0.212i=12xi2)exp(12i=12cos(2πxi))+20+eU_{1}(x_{1},x_{2})=-20\exp\left(-0.2\sqrt{\frac{1}{2}\sum_{i=1}^{2}x_{i}^{2}}\right)-\exp\left(\frac{1}{2}\sum_{i=1}^{2}\cos\left(2\pi x_{i}\right)\right)+20+e

    with initial stepsize η(0)=0.05\eta(0)=0.05. We use a multiplicative stepsize decay strategy: on every 10001000 iterations, the stepsize decreases by a factor of 0.9990.999. Denote Θ(k)=skη(s)\Theta(k)=\sum_{s\leqslant k}\eta(s). We also use

    c(k)\displaystyle c(k) =minvkU1(X(v))+1Θ(k)+1,\displaystyle=\min_{v\leqslant k}U_{1}(X(v))+\dfrac{1}{\Theta(k)+1},
    ϵ(k)\displaystyle\epsilon(k) =2ln(Θ(k)+2).\displaystyle=\dfrac{2}{\ln(\Theta(k)+2)}.

    The initialization is X(0)=(18.5,17.4)X(0)=(18.5,17.4) and for kinetic diffusions Y(0)=0Y(0)=0.

  • Ackley3 function U2U_{2}: We consider the 22-dimensional Ackley3 function

    U2(x1,x2)=200exp(0.2i=12xi2)+5exp(cos(3x1)+sin(3x2))U_{2}(x_{1},x_{2})=-200\exp\left(-0.2\sqrt{\sum_{i=1}^{2}x_{i}^{2}}\right)+5\exp\left(\cos(3x_{1})+\sin(3x_{2})\right)

    with initial stepsize η(0)=0.05\eta(0)=0.05. We use a multiplicative stepsize decay strategy: on every 10001000 iterations, the stepsize decreases by a factor of 0.9990.999. We use

    c(k)\displaystyle c(k) =minvkU2(X(v))+1Θ(k)+1,\displaystyle=\min_{v\leqslant k}U_{2}(X(v))+\dfrac{1}{\Theta(k)+1},
    ϵ(k)\displaystyle\epsilon(k) =2ln(Θ(k)+2).\displaystyle=\dfrac{2}{\ln(\Theta(k)+2)}.

    The initialization is X(0)=(18.4,12.8)X(0)=(18.4,12.8) and for kinetic diffusions Y(0)=0Y(0)=0.

  • Rastrigin function U3U_{3}: We consider the 22-dimensional Rastrigin function

    U3(x1,x2)=20+i=12[xi210cos(2πxi)]U_{3}(x_{1},x_{2})=20+\sum_{i=1}^{2}\left[x_{i}^{2}-10\cos\left(2\pi x_{i}\right)\right]

    with initial stepsize η(0)=0.5\eta(0)=0.5. We use a multiplicative stepsize decay strategy: on every 10001000 iterations, the stepsize decreases by a factor of 0.9990.999. We use

    c(k)\displaystyle c(k) =minvkU3(X(v))+1Θ(k)+1,\displaystyle=\min_{v\leqslant k}U_{3}(X(v))+\dfrac{1}{\Theta(k)+1},
    ϵ(k)\displaystyle\epsilon(k) =0.5ln(Θ(k)+2).\displaystyle=\dfrac{0.5}{\ln(\Theta(k)+2)}.

    The initialization is X(0)=(9.84,3.33)X(0)=(9.84,3.33) and for kinetic diffusions Y(0)=0Y(0)=0.

Acknowledgements

We thank Jing Zhang for a careful proofreading, and appreciate the constructive remarks from Xuefeng Gao, Laurent Miclo, Andre Milzarek, Pierre Monmarché and Wenpin Tang on various aspects of an earlier version of this work and stochastic optimization in general. In particular, we thank Andre Milzarek for suggesting the plot in Figure 1. The author acknowledges the support from The Chinese University of Hong Kong, Shenzhen grant PF01001143 and the financial support from AIRS - Shenzhen Institute of Artificial Intelligence and Robotics for Society Project 2019-INT002.

References

  • Andrieu and Thoms (2008) C. Andrieu and J. Thoms. A tutorial on adaptive MCMC. Stat. Comput., 18(4):343–373, 2008.
  • Benaïm and Bréhier (2019) M. Benaïm and C.-E. Bréhier. Convergence analysis of adaptive biasing potential methods for diffusion processes. Commun. Math. Sci., 17(1):81–130, 2019.
  • Benaïm et al. (2020) M. Benaïm, C.-E. Bréhier, and P. Monmarché. Analysis of an adaptive biasing force method based on self-interacting dynamics. Electron. J. Probab., 25:28 pp., 2020.
  • Bierkens (2016) J. Bierkens. Non-reversible Metropolis-Hastings. Stat. Comput., 26(6):1213–1228, 2016.
  • Catoni (1992) O. Catoni. Rough large deviation estimates for simulated annealing: application to exponential schedules. Ann. Probab., 20(3):1109–1146, 1992.
  • Chak et al. (2020) M. Chak, N. Kantas, and G. A. Pavliotis. On the generalised langevin equation for simulated annealing. arXiv preprint arXiv:2003.06448, 2020.
  • Chen and Hwang (2013) T.-L. Chen and C.-R. Hwang. Accelerating reversible Markov chains. Statist. Probab. Lett., 83(9):1956–1962, 2013.
  • Chiang et al. (1987) T.-S. Chiang, C.-R. Hwang, and S. J. Sheu. Diffusion for global optimization in 𝐑n{\bf R}^{n}. SIAM J. Control Optim., 25(3):737–753, 1987.
  • Diaconis et al. (2000) P. Diaconis, S. Holmes, and R. M. Neal. Analysis of a nonreversible Markov chain sampler. Ann. Appl. Probab., 10(3):726–752, 2000.
  • Duncan et al. (2016) A. B. Duncan, T. Lelièvre, and G. A. Pavliotis. Variance reduction using nonreversible Langevin samplers. J. Stat. Phys., 163(3):457–491, 2016.
  • Duncan et al. (2017) A. B. Duncan, N. Nüsken, and G. A. Pavliotis. Using perturbed underdamped Langevin dynamics to efficiently sample from probability distributions. J. Stat. Phys., 169(6):1098–1131, 2017.
  • Fang et al. (1997) H. Fang, M. Qian, and G. Gong. An improved annealing method and its large-time behavior. Stochastic Process. Appl., 71(1):55–74, 1997.
  • Fort et al. (2011) G. Fort, E. Moulines, and P. Priouret. Convergence of adaptive and interacting Markov chain Monte Carlo algorithms. Ann. Statist., 39(6):3262–3289, 2011.
  • Gadat and Panloup (2014) S. Gadat and F. Panloup. Long time behaviour and stationary regime of memory gradient diffusions. Ann. Inst. Henri Poincaré Probab. Stat., 50(2):564–601, 2014.
  • Gadat et al. (2013) S. Gadat, F. Panloup, and C. Pellegrini. Large deviation principle for invariant distributions of memory gradient diffusions. Electron. J. Probab., 18:no. 81, 34, 2013.
  • Gao et al. (2018) X. Gao, M. Gurbuzbalaban, and L. Zhu. Breaking reversibility accelerates langevin dynamics for global non-convex optimization. arXiv preprint arXiv:1812.07725, 2018.
  • Guo et al. (2020) X. Guo, J. Han, and W. Tang. Perturbed gradient descent with occupation time. arXiv preprint arXiv:2005.04507, 2020.
  • Holley and Stroock (1988) R. Holley and D. Stroock. Simulated annealing via Sobolev inequalities. Comm. Math. Phys., 115(4):553–569, 1988.
  • Holley et al. (1989) R. A. Holley, S. Kusuoka, and D. W. Stroock. Asymptotics of the spectral gap with applications to the theory of simulated annealing. J. Funct. Anal., 83(2):333–347, 1989.
  • Hu et al. (2020) Y. Hu, X. Wang, X. Gao, M. Gurbuzbalaban, and L. Zhu. Non-convex stochastic optimization via non-reversible stochastic gradient langevin dynamics. arXiv preprint arXiv:2004.02823, 2020.
  • Hwang et al. (2005) C.-R. Hwang, S.-Y. Hwang-Ma, and S.-J. Sheu. Accelerating diffusions. Ann. Appl. Probab., 15(2):1433–1444, 2005.
  • Jacquot (1992) S. Jacquot. Comportement asymptotique de la seconde valeur propre des processus de Kolmogorov. J. Multivariate Anal., 40(2):335–347, 1992.
  • Jamil and Yang (2013) M. Jamil and X.-S. Yang. A literature survey of benchmark functions for global optimisation problems. International Journal of Mathematical Modelling and Numerical Optimisation, 4(2):150–194, 2013.
  • Lelièvre and Minoukadeh (2011) T. Lelièvre and K. Minoukadeh. Long-time convergence of an adaptive biasing force method: the bi-channel case. Arch. Ration. Mech. Anal., 202(1):1–34, 2011.
  • Lelièvre et al. (2008) T. Lelièvre, M. Rousset, and G. Stoltz. Long-time convergence of an adaptive biasing force method. Nonlinearity, 21(6):1155–1181, 2008.
  • Löwe (1997) M. Löwe. On the invariant measure of non-reversible simulated annealing. Statist. Probab. Lett., 36(2):189–193, 1997.
  • Miclo (1992) L. Miclo. Recuit simulé sur 𝐑n{\bf R}^{n}. Étude de l’évolution de l’énergie libre. Ann. Inst. H. Poincaré Probab. Statist., 28(2):235–266, 1992.
  • Miclo (2020) L. Miclo. Private communcation. 2020.
  • Monmarché (2018) P. Monmarché. Hypocoercivity in metastable settings and kinetic simulated annealing. Probab. Theory Related Fields, 172(3-4):1215–1248, 2018.
  • Raimond (2009) O. Raimond. Self-interacting diffusions: a simulated annealing version. Probab. Theory Related Fields, 144(1-2):247–279, 2009.
  • Roberts and Rosenthal (2009) G. O. Roberts and J. S. Rosenthal. Examples of adaptive MCMC. J. Comput. Graph. Statist., 18(2):349–367, 2009.