On the convergence of an improved and adaptive kinetic simulated annealing

Michael C.H. Choi School of Data Science, The Chinese University of Hong Kong, Shenzhen and Shenzhen Institute of Artificial Intelligence and Robotics for Society, Guangdong, 518172, P.R. China [email protected]

Abstract.

Inspired by the work of Fang et al. (1997), who propose an improved simulated annealing algorithm based on a variant of overdamped Langevin diffusion with state-dependent diffusion coefficient, we cast this idea in the kinetic setting and develop an improved kinetic simulated annealing (IKSA) method for minimizing a target function $U$ . To analyze its convergence, we utilize the framework recently introduced by Monmarché (2018) for the case of kinetic simulated annealing (KSA). The core idea of IKSA rests on introducing a parameter $c>\inf U$ , which de facto modifies the optimization landscape and clips the critical height in IKSA at a maximum of $c-\inf U$ . Consequently IKSA enjoys improved convergence with faster logarithmic cooling than KSA. To tune the parameter $c$ , we propose an adaptive method that we call IAKSA which utilizes the running minimum generated by the algorithm on the fly, thus avoiding the need to manually adjust $c$ for better performance. We present positive numerical results on some standard global optimization benchmark functions that verify the improved convergence of IAKSA over other Langevin-based annealing methods.

AMS 2010 subject classifications: 60J25, 60J60, 46N30

Keywords: simulated annealing; Langevin diffusion; relative entropy; log-Sobolev constant; self-interacting diffusion; landscape modification

1. Introduction

Given a target function $U:\mathbb{R}^{d}\to\mathbb{R}$ to minimize, we are interested in simulated annealing algorithms based on Langevin diffusion and its various variants. Let us begin by briefly recalling the dynamics of the classical overdamped Langevin diffusion $(\mathcal{Z}_{t})_{t\geqslant 0}$ for simulated annealing (SA):

(1.1)

\displaystyle d\mathcal{Z}_{t}=-\nabla U(\mathcal{Z}_{t})\,dt+\sqrt{2\epsilon_{t}}dB_{t},

where $(B_{t})_{t\geqslant 0}$ is the standard $d$ -dimensional Brownian motion and $(\epsilon_{t})_{t\geqslant 0}$ is the temperature or cooling schedule. The instantaneous stationary distribution of (1.1) at time $t$ is the Gibbs distribution that we denote by

\mu_{\epsilon_{t}}^{0}(x)\propto e^{-\frac{1}{\epsilon_{t}}U(x)}.

We shall explain the seemingly strange upper script that appears in $\mu_{\epsilon_{t}}^{0}$ later in (1.6). This overdamped Langevin dynamics and its convergence have been analyzed in Chiang et al. (1987); Holley et al. (1989); Miclo (1992); Jacquot (1992). It can be shown that under the logarithmic cooling schedule of the form

(1.2)

\displaystyle\epsilon_{t}=\dfrac{E}{\ln t},\quad\text{large enough}\,t,

where $E>E_{*}$ , $(\mathcal{Z}_{t})_{t\geqslant 0}$ gradually concentrates around the set of global minima of $U$ in the sense that for any $\delta>0$ ,

\lim_{t\to\infty}\mathbb{P}\left(U(\mathcal{Z}_{t})>U_{min}+\delta\right)=0.

Here we write $U_{min}:=\inf U$ . We call $E$ the energy level and $E_{*}$ the hill-climbing constant or the critical height associated with $U$ . Intuitively speaking, $E_{*}$ is the largest hill one need to climb starting from a local minimum to a fixed global minimum. For a precise definition of $E_{*}$ , we refer readers to (2.2) below.

While the overdamped Langevin diffusion (1.1) can be seen as the continuous counterpart of gradient descent perturbed by Gaussian noise, the analogue of momentum method in this context is the kinetic or underdamped Langevin diffusion. The kinetic Langevin dynamics $(\mathcal{X}_{t},\mathcal{Y}_{t})_{t\geqslant 0}$ (KSA) is described by

(1.3)		$\displaystyle d\mathcal{X}_{t}$	$\displaystyle=\mathcal{Y}_{t}\,dt,$
(1.4)		$\displaystyle d\mathcal{Y}_{t}$	$\displaystyle=-\dfrac{1}{\epsilon_{t}}\mathcal{Y}_{t}\,dt-\nabla U(\mathcal{X}_{t})\,dt+\sqrt{2}\,dB_{t},$

where $(\mathcal{X}_{t})_{t\geqslant 0}$ stands for the position and $(\mathcal{Y}_{t})_{t\geqslant 0}$ is the velocity or momentum variable. The instantaneous stationary distribution of $(\mathcal{X}_{t},\mathcal{Y}_{t})_{t\geqslant 0}$ at time $t$ is the product distribution of $\mu_{\epsilon_{t}}^{0}$ and the Gaussian distribution with variance $\epsilon_{t}$ that we denote by

\pi_{\epsilon_{t}}^{0}(x,y)\propto e^{-\frac{1}{\epsilon_{t}}U(x)}e^{-\frac{\left\lVert y\right\rVert^{2}}{2\epsilon_{t}}}.

We will explain the notation $\pi_{\epsilon_{t}}^{0}$ in (1.10) below. Unlike the overdamped Langevin dynamics (1.1) which is reversible, the kinetic counterpart (1.3) is in general non-reversible, which imposes technical difficulties in establishing its long-time convergence. On the other hand, it is known in the literature that using non-reversible dynamics may accelerate convergence in the context of sampling or optimization, see for example Diaconis et al. (2000); Bierkens (2016); Chen and Hwang (2013); Löwe (1997); Duncan et al. (2016, 2017); Hwang et al. (2005); Gao et al. (2018); Hu et al. (2020). Using a distorted entropy approach, in Monmarché (2018) the author proves for the first time convergence result of kinetic simulated annealing: under the same logarithmic cooling schedule as in (1.2), for any $\delta>0$ we have

\lim_{t\to\infty}\mathbb{P}\left(U(\mathcal{X}_{t})>U_{min}+\delta\right)=0.

More recently, Chak et al. (2020) analyze the generalized Langevin dynamics for simulated annealing based on the framework introduced in Monmarché (2018).

Many techniques have been developed in the literature to improve or to accelerate the convergence of Langevin dynamics. In this paper, we are particularly interested in an improved variant of Langevin dynamics $(Z_{t})_{t\geqslant 0}$ (ISA) with state-dependent diffusion coefficient, introduced by Fang et al. (1997), and its dynamics is described by the following:

(1.5)

\displaystyle dZ_{t}=-\nabla U(Z_{t})\,dt+\sqrt{2\left(f((U(Z_{t})-c)_{+})+\epsilon_{t}\right)}\,dB_{t},

where we write $a_{+}=\max\{a,0\}$ for $a\in\mathbb{R}$ . Comparing the improved dynamics (1.5) with the classical one (1.1), we see that the function $f:\mathbb{R}\to\mathbb{R}^{+}$ and the parameter $c$ are introduced. We formally state the assumptions needed on both $f$ and $c$ in Assumption 1.1 below. To briefly summarize, we need to choose $c>U_{min}$ and $f$ to be twice-differentiable, non-negative, bounded and non-decreasing with $f(0)=f^{\prime}(0)=f^{\prime\prime}(0)=0$ . It is shown in Fang et al. (1997) that the instantaneous stationary distribution at time $t$ of (1.5) is given by

(1.6)

\displaystyle\mu_{\epsilon_{t}}^{f}(x)

\displaystyle=\mu_{\epsilon_{t},c}^{f}(x)\propto e^{-H_{\epsilon_{t}}(x)}=\dfrac{1}{f((U(x)-c)_{+})+\epsilon_{t}}\exp\left(-\int_{U_{min}}^{U(x)}\dfrac{1}{f((u-c)_{+})+\epsilon_{t}}\,du\right),

where

(1.7)

\displaystyle H_{\epsilon}(x)=H_{\epsilon,c}(x):=\int_{U_{min}}^{U(x)}\dfrac{1}{f((u-c)_{+})+\epsilon}\,du+\ln\left(f((U(x)-c)_{+})+\epsilon\right).

Observe that if $f=0$ , (1.5) reduces to the classical overdamped dynamics (1.1). As such $\mu_{\epsilon}^{f}$ can be considered as a generalization of the Gibbs distribution $\mu_{\epsilon}^{0}$ . This also explains the notation $\mu_{\epsilon_{t}}^{0}$ earlier.

One important difference between (1.5) and (1.1) is the introduction of state-dependent diffusion coefficient: the greater the difference between $U(Z_{t})$ and $c$ , the greater (in absolute terms) the Gaussian noise is to be injected, and this extra noise may improve the convergence by helping the dynamics to escape a local minimum or saddle point. On the other hand, in the region where $U(Z_{t})<c$ the dynamics evolves in the same manner as the classical dynamics. As for the theoretical benefits, in Fang et al. (1997) the authors demonstrate that under the logarithmic cooling schedule of the form (1.2) with $E>c_{*}$ and for any $\delta>0$ ,

\lim_{t\to\infty}\mathbb{P}\left(U(Z_{t})>U_{min}+\delta\right)=0,

where we call $c_{*}$ the clipped critical height, to be defined formally in (2.3) below. It can be shown that $E_{*}\geqslant c_{*}$ and $c-U_{min}\geqslant c_{*}$ , and hence one can understand as if the critical height is capped at a maximum level $c-U_{min}$ . The key technical insight in Fang et al. (1997) relies on both the spectral gap and the log-Sobolev constant are of the order $e^{c_{*}/\epsilon_{t}}$ . As a result, we can operate a faster cooling schedule for the improved dynamics (1.5) that still enjoys convergence guarantee.

The crux of this paper is to cast the idea of Fang et al. (1997) into the kinetic Langevin setting for simulated annealing. One way to do so is to think of altering the target function: in SA the exponent in $\mu_{\epsilon_{t}}^{0}$ is $-(1/\epsilon_{t})U(x)$ , while in ISA the exponent in $\mu_{\epsilon_{t}}^{f}$ (1.6) takes on the generalized form as $-H_{\epsilon_{t}}$ . In this way the optimization landscape is de facto modified from $U$ to $H_{\epsilon}$ and hopefully improved. We apply this idea and simply substitute $(1/\epsilon_{t})U(x)$ by $H_{\epsilon_{t}}$ in its dynamics. More precisely, we are interested in the following dynamics $(X_{t},Y_{t})_{t\geqslant 0}$ that we call IKSA:

(1.8)		$\displaystyle dX_{t}$	$\displaystyle=Y_{t}\,dt,$
(1.9)		$\displaystyle dY_{t}$	$\displaystyle=-\dfrac{1}{\epsilon_{t}}Y_{t}\,dt-\epsilon_{t}\nabla_{x}H_{\epsilon_{t},c}(X_{t})\,dt+\sqrt{2}\,dB_{t}.$

Its instantaneous stationary distribution at time $t$ is the product distribution of $\mu_{\epsilon_{t}}^{f}$ and the Gaussian distribution with mean $0$ and variance $\epsilon_{t}$ :

(1.10)

\displaystyle\pi_{\epsilon_{t}}^{f}(x,y)=\pi_{\epsilon_{t},c}^{f}(x,y)\propto\mu_{\epsilon_{t}}^{f}(x)e^{-\frac{\left\lVert y\right\rVert^{2}}{2\epsilon_{t}}}\propto e^{-H_{\epsilon_{t}}(x)}e^{-\frac{\left\lVert y\right\rVert^{2}}{2\epsilon_{t}}}.

Note that when $f=0$ , $\nabla_{x}H_{\epsilon_{t}}(x)=\frac{1}{\epsilon_{t}}\nabla_{x}U(x)$ , and we retrieve exactly the classical kinetic Langevin diffusion (1.3). As such we can think of IKSA as a generalization of KSA. This also explains the notation $\pi_{\epsilon_{t}}^{0}$ that appears earlier.

For a general $f$ (that satisfies Assumption 1.1 below), in the case of $U(x)\leqslant c$ we have $\nabla_{x}H_{\epsilon_{t}}(x)=\frac{1}{\epsilon_{t}}\nabla_{x}U(x)$ , and hence the improved kinetic dynamics (1.8) evolves in the same way as the classical one (1.3) in this region. In the other case when $U(x)>c$ , we see that

\nabla_{x}H_{\epsilon}=\dfrac{1+f^{\prime}((U(x)-c)_{+})}{f((U(x)-c)_{+})+\epsilon}\nabla_{x}U.

Thus, the greater $U(x)$ is relative to $c$ , the greater the denominator in $\nabla_{x}H_{\epsilon_{t}}$ in the above equation, and the more dominant is the Brownian noise in the velocity update of $Y_{t}$ . This effect can hopefully improve the convergence of $U(X_{t})$ when its value is greater than $c$ . To illustrate, we plot the following function

U_{0}(x)=\cos(2x)+\frac{1}{2}\sin(x)+\frac{1}{3}\sin(10x)

as in Monmarché (2018) and compare this landscape with that of $H_{\epsilon,c}$ in Figure 1, where we take $\epsilon=0.5$ , $c=-1.5$ and $f=\arctan$ . For example, when $x\in(-4,-2)$ in Figure 1, we can see that the gradient of $H_{\epsilon,c}$ is much smaller than that of $U$ , thus the Brownian noise plays a relatively more dominant role in this region in the velocity update. While it may not be immediately apparent in Figure 1, we note that both $U_{0}$ (the black and solid curve) and $H_{\epsilon,c}$ (the orange and dashed curve) share exactly the same set of stationary points. As for the theoretical advantage of using IKSA, we shall prove that we can operate a faster logarithmic cooling schedule than KSA, relying on the key technical insight that the instantaneous spectral gap and the log-Sobolev constant are of the order $e^{c_{*}/\epsilon_{t}}$ .

While there are practical benefits in using IKSA over KSA or ISA over SA, on the other hand they come along with extra computational costs: in the velocity update of IKSA, in addition to evaluating the gradient of $U$ , we would need to evaluate both $f$ and its derivative $f^{\prime}$ at $(U(x)-c)_{+}$ , as the core idea of IKSA or ISA rests on comparing $U(X_{t})$ and $c$ for all $t$ . As such if we implement the Euler-Maruyama discretization of IKSA, extra function evaluations are required at each iteration.

The objective of this paper is to promote the idea of landscape modification and state-dependent noise in stochastic optimization. It has been brought to us by Miclo (2020) that Olivier Catoni has also considered the operation from $U$ to $g(U)$ in a discrete-time and finite state space setting with concave $g$ . Recently in Guo et al. (2020) the authors study perturbed gradient descent with state-dependent noise, using the notion of occupation time.

Refer to caption — Figure 1. Landscape of $U_{0}$ and $H_{\epsilon,c}$ , where $\epsilon=0.5$ , $c=-1.5$ and $f=\arctan$ . We evaluate the integral in $H_{\epsilon,c}$ using the numerical integration package in MATLAB.

We summarize the main contributions of this paper below:

(1)

Propose an improved kinetic simulated annealing (IKSA) method and analyze its convergence

In our main result Theorem 1.1 below, we will prove that under the logarithmic cooling of the form (1.2) with energy level $E>c_{*}$ , where both $c$ and $E$ are fixed, the improved kinetic annealing $(X_{t},Y_{t})_{t\geqslant 0}$ converges: for any $\delta>0$ ,

$\lim_{t\to\infty}\mathbb{P}\left(U(X_{t})>U_{min}+\delta\right)=0.$

This will be proved using the framework proposed by Monmarché (2018), along with the key technical insight that the log-Sobolev constant is of the order $e^{c_{*}/\epsilon_{t}}$ at time $t$ .
(2)

Propose an adaptive (IAKSA) method to tune the parameter $c$ and the energy level $E$ , and analyze its convergence

The convergence behaviour of $(X_{t},Y_{t})_{t\geqslant 0}$ in IKSA highly depends on the value of the parameter $c>U_{min}$ . Ideally we would like to choose $c$ to be close to $U_{min}$ (so that the clipped critical height $c_{*}$ is as small as possible), but it can be hard to achieve in practice without a priori information on $U$ .

In our second main result Theorem 1.2 below, we tune both $c$ and $E$ adaptively by incorporating the information of the running minimum $\min_{v\leqslant t}U(X_{v})$ , where both $c=(c_{t})_{t\geqslant 0}$ and $E=(E_{t})_{t\geqslant 0}$ depend on the running minimum generated by the algorithm on the fly. We call the resulting non-Markovian diffusion IAKSA. Although the setting is slightly different, this idea is in reminiscence of the adaptive biasing method Benaïm and Bréhier (2019); Benaïm et al. (2020); Lelièvre and Minoukadeh (2011); Lelièvre et al. (2008) or the self-interacting annealing method Raimond (2009), thus avoiding the need to manually tune the parameter $c$ for better performance. We also mention the related work of memory gradient diffusions Gadat et al. (2013); Gadat and Panloup (2014). Adaptive algorithms are popular in the Markov chain Monte Carlo (MCMC) literature as well Roberts and Rosenthal (2009); Andrieu and Thoms (2008); Fort et al. (2011). Note that in our context, the idea of tuning $c$ adaptively on the fly dates back to the work Fang et al. (1997) for ISA.
(3)

Present numerical experiments to illustrate the performance of IAKSA

We compare the performance of four simulated annealing methods, namely IAKSA, IASA (1.5) (i.e. ISA with the parameter $c$ tuned adaptively in the same way as IAKSA), KSA (1.3) and SA (1.1), on minimizing three standard global optimization benchmark functions Jamil and Yang (2013). Empirical results demonstrate the improved convergence performance of IAKSA over other annealing methods that are based on Langevin diffusions.

1.1. Notations

Before we discuss our main results, we fix a few notations that are frequently used throughout the paper. For $a,b\in\mathbb{R}$ , we write $a\wedge b=\min\{a,b\}$ and $a_{+}=\max\{a,0\}$ . For $\mathbf{v}\in\mathbb{R}^{d}$ , we write $\left\lVert\mathbf{v}\right\rVert$ to be its Euclidean norm. We also denote by $\partial_{x}$ to be the partial derivative with respect to $x$ . For two functions $g_{1},g_{2}$ on $\mathbb{R}$ , we write $g_{1}=\mathcal{O}(g_{2})$ if there exists constant $C>0$ such that $g_{1}(x)\leqslant Cg_{2}(x)$ for large enough $x$ . We write $g_{1}=\Omega(g_{2})$ if $g_{2}=\mathcal{O}(g_{1})$ . We also use the little-o notation: $g_{1}=o(g_{2})$ if $\lim_{x\to\infty}g_{1}(x)/g_{2}(x)=0$ . We say that a function $\xi(\epsilon)$ is a subexponential function if $\lim_{\epsilon\to 0}\epsilon\ln\xi(\epsilon)=0$ .

In the rest of the paper, as $f$ is fixed, we shall hide its dependence on various quantities. We will write $\pi_{\epsilon_{t}}=\pi_{\epsilon_{t}}^{f}$ and $\mu_{\epsilon_{t}}=\mu_{\epsilon_{t}}^{f}$ .

1.2. Overview of the main results

In this subsection, we state our main results. First, let us clearly state the assumptions on the target function $U$ , the function $f$ and the parameter $c$ . Note that the critical height $E_{*}$ and clipped critical height $c_{*}$ will be introduced in Section 2.3. These assumptions are standard in the simulated annealing literature.

Assumption 1.1.

(1)

The potential function $U$ is smooth with bounded second derivatives, that is,

\left\lVert\nabla_{x}^{2}U\right\rVert_{\infty}=\sup_{x\in\mathbb{R}^{d}}\sum_{i,j=1}^{d}\left(\partial_{x_{i}}\partial_{x_{j}}U(x)\right)^{2}<\infty.

Also, there exist constants $a_{1},a_{2},r,M>0$ such that $U$ satisfies

	$\displaystyle a_{1}\left\lVert x\right\rVert^{2}-M$	$\displaystyle\leqslant U(x)\leqslant a_{2}\left\lVert x\right\rVert^{2}+M,$
	$\displaystyle-\nabla_{x}U(x)\cdot x$	$\displaystyle\leqslant-r\left\lVert x\right\rVert^{2}+M,$

where $x\in\mathbb{R}^{d}$ .

(2)

The function $f:\mathbb{R}\to\mathbb{R}^{+}$ is twice-differentiable, bounded, non-negative and non-decreasing. Furthermore, $f$ satisfies

$f(0)=f^{\prime}(0)=f^{\prime\prime}(0)=0,$

and there exist constant $M_{3},M_{4}>0$ such that

$f(x)=M_{4},\quad\text{if }x\geqslant M_{3}.$

We also denote $M_{5}:=\sup_{0\leqslant x\leqslant M_{3}}f^{\prime}(x)$ .
(3)

The cooling schedule satisfies, for large enough $t$ ,

$|\partial_{t}\epsilon_{t}|=\mathcal{O}\left(\dfrac{1}{t}\right).$

This is for instance satisfied by the logarithmic cooling schedule (1.2).
(4)

The parameter $c$ is picked so that $c>U_{min}$ . In the adaptive case, $c_{t}>U_{min}$ for all $t\geqslant 0$ .
(5)

The initial law of $(X_{0},Y_{0})$ admits a smooth density with respect to the Lebesgue measure that we denote by $m_{0}$ . Its Fisher information $\int\left\lVert\nabla m_{0}\right\rVert^{2}/m_{0}\,dxdy$ and moments $\mathbb{E}\left(\left\lVert X_{0}\right\rVert^{p}+\left\lVert Y_{0}\right\rVert^{p}\right)$ are all finite, where $p\geqslant 0$ .

Our first main result gives large-time convergence guarantee for IKSA $(X_{t},Y_{t})_{t\geqslant 0}$ , introduced earlier in (1.8):

Theorem 1.1.

[Convergence of IKSA] Under Assumption 1.1, for any $\delta>0$ , as $t\to\infty$ we have

\displaystyle\mathbb{P}\left(U(X_{t})>U_{min}+\delta\right)\to 0.

If we employ the logarithmic cooling schedule of the form $\epsilon_{t}=\frac{E}{\log(t)}$ for large enough $t$ , where $E>c_{*}$ , and both $c$ and $E$ are fixed, then for any $\alpha,\delta>0$ , there exists constant $A>0$ such that

(1.11)

\displaystyle\mathbb{P}\left(U(X_{t})>U_{min}+\delta\right)\leqslant A\left(\dfrac{1}{t}\right)^{\min\bigg{\{}\frac{1-\frac{c_{*}}{E}-\alpha}{2},\frac{\delta}{2E}\bigg{\}}}.

If we compare the result of Monmarché (2018) for KSA (1.3) against the result we obtain in Theorem 1.1 for IKSA (1.8), in essence we replace the critical height $E_{*}$ by the clipped critical height $c_{*}$ . In retrospect this is perhaps unsurprising, as one can understand the modification in IKSA as clipping the target function from $U$ to $U\wedge c$ . While IKSA enjoys improved logarithmic cooling when compared with KSA, logarithmic cooling is however known to be inefficient, see Catoni (1992) and the Remark in (Monmarché, 2018, Section $1.2$ ).

In our second main result, we propose an adaptive method, that we call IAKSA, to tune the parameter $c=(c_{t})_{t\geqslant 0}$ and the energy level $E=(E_{t})_{t\geqslant 0}$ using the running minimum $\min_{v\leqslant t}U(X_{v})$ on the fly. We shall discuss in more technical details in Section 3. Note that the idea of tuning $c$ adaptively on the fly dates back to the work Fang et al. (1997) for ISA.

Theorem 1.2.

[Convergence of IAKSA] Under Assumption 1.1, consider the kinetic dynamics $(X_{t},Y_{t})_{t\geqslant 0}$ described by

	$\displaystyle dX_{t}$	$\displaystyle=Y_{t}\,dt,$
	$\displaystyle dY_{t}$	$\displaystyle=-\dfrac{1}{\epsilon_{t}}Y_{t}\,dt-\epsilon_{t}\nabla_{x}H_{\epsilon_{t},c_{t}}(X_{t})\,dt+\sqrt{2}\,dB_{t},$

where $H_{\epsilon_{t},c_{t}}$ is introduced in (1.7), $c_{t}$ is tuned adaptively according to (3.1) and the cooling schedule is

\epsilon_{t}=\dfrac{E_{t}}{\ln t}

with $E_{t}$ satisfying (3.2). Given $\delta>0$ , for large enough $t$ and a constant $A>0$ , we consider sufficiently small $\alpha$ such that $\alpha\in(0,\frac{\delta_{2}-\delta_{1}}{U(X_{0})-U_{min}+\delta_{2}})$ , and select $\delta_{1},\delta_{2}>0$ such that $0<\delta_{2}-\delta_{1}<\delta$ , to yield

\displaystyle\mathbb{P}\left(U(X_{t})>U_{min}+\delta\right)\leqslant A\left(\dfrac{1}{t}\right)^{a},

where

a:=\min\bigg{\{}\frac{\frac{\delta_{2}-\delta_{1}}{\delta+\delta_{2}}-\alpha}{2},\frac{\frac{\delta_{2}-\delta_{1}}{U(X_{0})-U_{min}+\delta_{2}}-\alpha}{2}\bigg{\}}.

1.3. Numerical results

In this subsection, we present our numerical illustrations. The benchmark functions are the Rastrigin function $U_{3}$ , Ackley3 function $U_{2}$ and Ackley function $U_{1}$ . For further details on the experimental setup and the parameters used (such as initialization, stepsize or the cooling schedule), these are described in the Appendix.

We mimic Figure $3$ in Monmarché (2018), and we plot the corresponding results in Figure 2 for the three benchmark functions. On the vertical axis, we plot $\log_{10}\mathbb{P}\left(\min_{v\leqslant t}U(X_{v})>U_{min}+\delta\right)$ or $\log_{10}\mathbb{P}\left(\min_{v\leqslant t}U(Z_{v})>U_{min}+\delta\right)$ against $\log_{10}t$ in Figure (2(a)), (2(c)) and (2(e)), and similarly we plot $\log_{10}\mathbb{P}\left(U(X_{t})>U_{min}+\delta\right)$ or $\log_{10}\mathbb{P}\left(U(Z_{t})>U_{min}+\delta\right)$ against $\log_{10}t$ in Figure (2(b)), (2(d)) and (2(f)). To compute these probabilities, we run 100 independent replicas and count the proportion of replicas for which $U(X_{t})>U_{min}+\delta$ or $\min_{v\leqslant t}U(X_{v})>U_{min}+\delta$ . We inject the same sequence of Gaussian noise in each of the 100 replicas across all four annealing methods for fair comparison.

KSA and SA can be considered as the baseline algorithms for IAKSA and IASA respectively. In all of the plots in Figure (2) IAKSA outperforms KSA, revealing that there is perhaps empirical advantage in using IAKSA over classical KSA in some instances.

For the Rastrigin function $U_{3}$ in Figure (2(a)) and (2(b)), we note that IAKSA enjoys improved convergence over other methods, while SA does not seem to converge near $U_{min}$ at all. For IASA and KSA, although their running minimum reach the neighbourhood of $U_{min}$ when $\log_{10}t$ is approximately 3 to 5, as evident from Figure (2(b)) they however do not get stuck at the neighbourhood.

For the Ackley3 function $U_{2}$ in Figure (2(c)) and (2(d)), we first observe that the curve corresponding to the running minimum of IASA drops fast but stays flat when $\log_{10}t$ is bigger than 5. In Figure (2(d)), the only method that lingers around the neighbourhood of $U_{min}$ is IAKSA.

For the Ackley function $U_{1}$ in Figure (2(e)) and (2(f)), the two overdamped methods IASA and SA seem to outperform the two kinetic methods IAKSA and KSA.

1.4. Organization of the paper

The rest of this paper is organized as follow. In Section 2, we present the proof of Theorem 1.1. In Section 3, we state the adaptive method IAKSA along with its proof.

2. Proof of Theorem 1.1

In this section, we prove the convergence of the improved kinetic dynamics with fixed $c$ and energy level $E$ . We employ the same proof strategy as in Monmarché (2018) and break down the proof into smaller parts. Finally, in Section 2.8 we connect the auxiliary results and finish off the proof of Theorem 1.1.

2.1. Weak convergence of $\mu_{\epsilon}$

In this subsection, we prove that $\mu_{\epsilon}$ converges weakly to the set of global minima of $U$ as $\epsilon\to 0$ . Recall that $\mu_{\epsilon}$ is first introduced in (1.6) as the stationary distribution of the improved annealing ISA method. Note that similar result have been obtained in Fang et al. (1997).

Proposition 2.1.

Under Assumption 1.1, for a fixed $\epsilon>0$ , suppose the law of $X$ is $\mu_{\epsilon}$ . Then for sufficiently small $\delta>0$ with $\delta\in(0,c-U_{min})$ , there exists a constant $D_{2}=D_{2}(\delta)>0$ , independent of $\epsilon$ , such that

\mathbb{P}\left(U(X)>U_{min}+\delta\right)\leqslant D_{2}e^{-\frac{\delta}{2\epsilon}}.

[Proof. ]First, denote $a:=U_{min}+\delta<c$ . For any Borel set $S\subset\mathbb{R}^{d}$ , we denote $\mathrm{Vol}(S)$ to be its Lebesgue volume. Note that an equivalent way of writing down $\mu_{\epsilon}$ is

	$\displaystyle\mu_{\epsilon}(x)$	$\displaystyle=\dfrac{1}{\overline{\Lambda}_{\mu_{\epsilon}}}\dfrac{1}{f((U(x)-c)_{+})+\epsilon}\exp\bigg{\{}-\int_{a}^{U(x)}\dfrac{1}{f((u-c)_{+})+\epsilon}\,du\bigg{\}},$
	$\displaystyle\overline{\Lambda}_{\mu_{\epsilon}}$	$\displaystyle=\int_{\mathbb{R}^{d}}\dfrac{1}{f((U(x)-c)_{+})+\epsilon}\exp\bigg{\{}-\int_{a}^{U(x)}\dfrac{1}{f((u-c)_{+})+\epsilon}\,du\bigg{\}}\,dx$
		$\displaystyle\geqslant\dfrac{1}{\epsilon}\int_{\{U(x)\leqslant a-\delta/2\}}\exp\bigg{\{}\int_{U(x)}^{a}\dfrac{1}{f((u-c)_{+})+\epsilon}\,du\bigg{\}}\,dx$
		$\displaystyle\geqslant\dfrac{1}{\epsilon}\int_{\{U(x)\leqslant a-\delta/2\}}\exp\bigg{\{}\dfrac{1}{\epsilon}(a-U(x))\bigg{\}}\,dx$
		$\displaystyle\geqslant\dfrac{1}{\epsilon}e^{\frac{\delta}{2\epsilon}}\mathrm{Vol}(\{U(x)\leqslant a-\delta/2\})=:\dfrac{1}{\epsilon}e^{\frac{\delta}{2\epsilon}}D_{1},$

where the set $\{U(x)\leqslant a-\delta/2\}$ is compact as $U$ is quadratic at infinity. Define

\displaystyle A_{\delta}(x):=U_{min}+\delta+\left\lVert x\right\rVert.

Note that under Assumption 1.1, since $U$ is quadratic at infinity, the set $S_{\delta}:=\{U(x)<A_{\delta}(x)\}$ is compact. Therefore, we have

	$\displaystyle\mathbb{P}\left(U(X)>U_{min}+\delta\right)$	$\displaystyle=\dfrac{1}{\overline{\Lambda}_{\mu_{\epsilon}}}\int_{\{U(x)>U_{min}+\delta\}}\dfrac{1}{f((U(x)-c)_{+})+\epsilon}\exp\bigg{\{}-\int_{a}^{U(x)}\dfrac{1}{f((u-c)_{+})+\epsilon}\,du\bigg{\}}\,dx$
		$\displaystyle\leqslant\dfrac{e^{-\frac{\delta}{2\epsilon}}}{D_{1}}\int_{\{A_{\delta}(x)>U(x)>U_{min}+\delta\}}\exp\bigg{\{}-\dfrac{U(x)-a}{M_{4}+\epsilon}\bigg{\}}\,dx$
		$\displaystyle\quad+\dfrac{e^{-\frac{\delta}{2\epsilon}}}{D_{1}}\int_{\{U(x)>A_{\delta}(x)\}}\exp\bigg{\{}-\dfrac{U(x)-a}{M_{4}+\epsilon}\bigg{\}}\,dx$
		$\displaystyle\leqslant\dfrac{1}{D_{1}}\mathrm{Vol}(S_{\delta})e^{-\frac{\delta}{2\epsilon}}+\dfrac{1}{D_{1}}e^{-\frac{\delta}{2\epsilon}}\int_{\mathbb{R}^{d}}e^{-\frac{1}{M_{4}+\epsilon}\left\lVert x\right\rVert}\,dx.$

2.2. Existence and regularity for the density of IKSA

First, we note that the infinitesimal generator of the improved kinetic dynamics $(X_{t},Y_{t})_{t\geqslant 0}$ (1.8) at a fixed temperature $\epsilon>0$ is

(2.1)

\displaystyle L_{\epsilon}=L_{\epsilon,c}:=y\cdot\nabla_{x}-\left(\dfrac{y}{\epsilon}+\epsilon\nabla_{x}H_{\epsilon}\right)\cdot\nabla_{y}+\Delta_{y},

where we recall $H_{\epsilon}=H_{\epsilon,c}$ is first introduced in (1.7), and $\nabla_{x}\cdot$ (resp. $\nabla_{y}\cdot$ ) is the divergence operator with respect to the variable $x$ (resp. $y$ ). When there is no ambiguity on the parameter $c$ , we simply hide the dependency on $c$ and write $H_{\epsilon}=H_{\epsilon,c}$ and $L_{\epsilon}=L_{\epsilon,c}$ .

We show that the density of $(X_{t},Y_{t})_{t\geqslant 0}$ is nice:

Proposition 2.2.

Under Assumption 1.1, the process $(X_{t},Y_{t})_{t\geqslant 0}$ is well-defined and the second moment $\mathbb{E}(\left\lVert X_{t}\right\rVert^{2}+\left\lVert Y_{t}\right\rVert^{2})$ is finite for all $t$ . The Lebesgue density of $(X_{t},Y_{t})$ , denoted by $m_{t}$ , is smooth and positive.

h_{t}:=\dfrac{dm_{t}}{d\pi_{\epsilon_{t}}}

is well-defined and smooth.

[Proof. ]We follow the same proof as in (Monmarché, 2018, Proposition 4 and 5).

First, we show the non-explosiveness and the finite second moment result. Consider the homogeneous Markov process $(X_{t},Y_{t},t)_{t\geqslant 0}$ , its generator $\mathcal{L}=L_{\epsilon_{t}}+\partial_{t}$ and Hamiltonian function

\mathcal{H}(x,y,t)=H_{\epsilon_{t}}(x)-\ln\epsilon_{t}+\dfrac{\left\lVert y\right\rVert^{2}}{2\epsilon_{t}}+1.

Note that for $s\leqslant t$ :

	$\displaystyle\mathcal{L}\mathcal{H}(x,y,s)$	$\displaystyle=\dfrac{1}{\epsilon_{s}}-\dfrac{\left\lVert y\right\rVert^{2}}{\epsilon_{s}^{2}}-\epsilon_{s}^{\prime}\left(\int_{U_{min}}^{U(x)}\dfrac{1}{\left(f((u-c)_{+})+\epsilon_{s}\right)^{2}}\,du+\frac{\left\lVert y\right\rVert^{2}}{2\epsilon_{s}^{2}}\right)+\dfrac{\epsilon_{s}^{\prime}}{f((U(x)-c)_{+})+\epsilon_{s}}-\dfrac{\epsilon_{s}^{\prime}}{\epsilon_{s}}$
		$\displaystyle\leqslant C\mathcal{H}(x,y,s),$

where the constant $C$ depends the bounds of $\epsilon_{s}$ and its derivative on $[0,t]$ . Using the Markov inequality and the Ito’s formula, we have $\mathbb{E}(\mathcal{H}(X_{t},Y_{t}))<\infty$ which implies that the second moment is finite.

For the smoothness of $m_{t}$ , one can readily check the Hormander’s bracket condition is satisified.

For the positiveness of $m_{t}$ , we apply exactly the same argument as in Proposition $5$ of Monmarché (2018), with $F_{t}(x,y)$ therein replaced by $F_{t}(x,y)=\nabla_{x}H_{\epsilon_{t}}(x)+\frac{y}{\epsilon_{t}}$ .

2.3. The log-Sobolev inequality

In this subsection, we prove the log-Sobolev inequality for the improved kinetic Langevin process $(X_{t},Y_{t})_{t\geqslant 0}$ . First, let us briefly recall the concept of critical height $E_{*}$ that frequently appears in various classical and modern work of the annealing literature Jacquot (1992); Holley and Stroock (1988); Miclo (1992). For two points $x,y\in\mathbb{R}^{d},$ we write $\Gamma_{x,y}$ to be the set of $C^{1}$ parametric curves that start at $x$ and end at $y$ . Given a target function $U$ , the classical critical height $E_{*}$ of $U$ is then defined to be

(2.2)

\displaystyle E_{*}=E_{*}(U):=\sup_{x,y\in\mathbb{R}^{d}}\inf_{\gamma\in\Gamma_{x,y}}\left\{\sup_{t}\{U(\gamma(t))\}-U(x)-U(y)+\inf U\right\}.

One major motivation of the current work is the introduction of the function $f$ and the parameter $c$ that control the injection of the Gaussian noise into the system, which allows one to clip the critical height $E_{*}$ at an arbitrary level $c$ , thus effectively reducing $E_{*}$ . Precisely, for an arbitrary $\delta_{1}>0$ and $c>U_{min}$ , we define $c_{*}$ to be

(2.3)

\displaystyle c_{*}=c_{*}(U,c,\delta_{1}):=\sup_{x,y\in\mathbb{R}^{d}}\inf_{\gamma\in\Gamma_{x,y}}\left\{\sup_{t}\{U(\gamma(t))\wedge(c+\delta_{1})\}-U(x)\wedge c-U(y)\wedge c+\inf U\right\}.

For probability measure $\nu$ on $\mathbb{R}^{2d}$ and smooth positive function $h$ on $\mathbb{R}^{2d}$ such that $\nu(h)=1$ , we define the relative entropy and Fisher information to be respectively

	$\displaystyle\mathrm{Ent}_{\nu}(h)$	$\displaystyle=\int h\ln h\,d\nu,$
	$\displaystyle I_{\nu}(h)$	$\displaystyle=\int\dfrac{\|\nabla h\|^{2}}{h}\,d\nu.$

Proposition 2.3.

Under Assumption 1.1 and suppose the temperature $\epsilon$ is fixed. For any arbitrary $\delta_{1}>0$ and positive smooth function $h$ with $\pi_{\epsilon}(h)=1$ , there exists $c_{*}=c_{*}(U,c,\delta_{1})$ and a polynomial function $p_{\delta_{1}}(1/\epsilon)$ (which may depend on $\delta_{1}$ ) such that

\displaystyle\mathrm{Ent}_{\pi_{\epsilon}}(h)

\displaystyle\leqslant\max\left(\dfrac{\epsilon}{2},p_{\delta_{1}}(1/\epsilon)e^{\frac{c_{*}}{\epsilon}}\right)I_{\pi_{\epsilon}}(h).

Remark 2.1.

In Fang et al. (1997), by introducing assumption on the behaviour of $f$ near $0$ (Assumption $(H4)$ therein), they demonstrate the log-Sobolev constant is of the order $\mathcal{O}(e^{\frac{c_{*}(U,c,0)}{\epsilon}})$ , while in our Assumption 1.1, we do not place such assumption on $f$ , at the trade-off of introducing an error $\delta_{1}>0$ that appears in $p_{\delta_{1}}(1/\epsilon)$ . Alternatively, we can insert extra assumptions and simply use the result obtained in Fang et al. (1997) for the log-Sobolev inequality.

[Proof. ]First, note that $\pi_{\epsilon}$ can be written as a tensor product of $\mu_{\epsilon}$ (for the $x$ coordinates) and a Gaussian distribution of mean $0$ with variance $\epsilon$ (for the $y$ coordinates). Since the log-Sobolev inequality tensorizes and is stable under perturbation (see e.g. the references as in Monmarché (2018); Chak et al. (2020)), it suffices for us to determine the log-Sobolev constant of $\mu_{\epsilon}$ .

Let us recall the improved overdamped Langevin dynamics at temperature $\epsilon$ is described by the dynamics

dZ_{t}=-\nabla U(Z_{t})\,dt+\sqrt{2\left(f(U(Z_{t})-c)_{+}+\epsilon\right)}\,dB_{t},

with generator $\overline{L}_{\epsilon}$ and $L^{2}(\mu_{\epsilon})$ -spectral gap

\overline{\lambda}_{2}=\inf_{h\in L^{2}(\mu_{\epsilon});~{}\mu_{\epsilon}(h)=0}\dfrac{\langle-\overline{L}_{\epsilon}h,h\rangle_{\mu_{\epsilon}}}{\langle h,h\rangle_{\mu_{\epsilon}}},

where we write $\langle g,h\rangle_{\mu_{\epsilon}}=\int gh\,d\mu_{\epsilon}$ to be the inner product in $L^{2}(\mu_{\epsilon})$ for $g,h\in L^{2}(\mu_{\epsilon})$ . Define

V_{\epsilon}(x):=\int_{U_{min}}^{U(x)}\dfrac{1}{f((u-c)_{+})+\epsilon}\,du,\quad\nu_{\epsilon}(x)\propto e^{-V_{\epsilon}(x)},

Since $\{h\in L^{2}(\nu_{\epsilon});~{}\nu_{\epsilon}(h)=0\}\subset\{h\in L^{2}(\mu_{\epsilon});~{}\mu_{\epsilon}(h)=0\}$ , we thus have

\overline{\lambda}_{2}\leqslant(M_{4}+\epsilon)\inf_{h\in L^{2}(\nu_{\epsilon});~{}\nu_{\epsilon}(h)=0}\dfrac{\int\left\lVert\nabla h\right\rVert^{2}\,d\nu_{\epsilon}}{\langle h,h\rangle_{\nu_{\epsilon}}}.

Therefore, we can pretend our target function is $V_{\epsilon}$ at temperature $1$ in the classical Langevin dynamics, and utilize the result of Jacquot (1992) to conclude that, for a constant $A>0$ independent of $\epsilon$ ,

(2.4)

\displaystyle\overline{\lambda}_{2}\leqslant Ae^{E_{*}(V_{\epsilon})}.

If we show that

(2.5)

\displaystyle E_{*}(V_{\epsilon})\leqslant\frac{c_{*}}{\epsilon}+C_{\delta_{1},U},

where $C_{\delta_{1},U}$ is a constant that depends on $\delta_{1}$ and $U$ but not on $\epsilon$ , the desired result follows from the above spectral gap estimate and (Fang et al., 1997, Theorem $4.4$ ).

In order to apply the result of Jacquot (1992), we check Condition $(A)$ and $(B)$ therein:

•

Condition $(A)$ : $\left\lVert\nabla_{x}V_{\epsilon}\right\rVert\to\infty$ and $V_{\epsilon}\to\infty$ as $\left\lVert x\right\rVert\to\infty$ since $U$ is quadratic at infinity.

•

Condition $(B)$ : Outside the ball of $\{U(x)-c>M_{3}\}$ ,

	$\displaystyle\nabla_{x}V_{\epsilon}(x)$	$\displaystyle=\dfrac{1}{f((U(x)-c)_{+})+\epsilon}\nabla_{x}U(x)=\dfrac{1}{M_{4}+\epsilon}\nabla_{x}U(x),$
	$\displaystyle\Delta V_{\epsilon}(x)$	$\displaystyle=\dfrac{1}{M_{4}+\epsilon}\Delta U(x),$

and so $\left\lVert\nabla_{x}V_{\epsilon}\right\rVert^{2}-\Delta V_{\epsilon}$ is bounded below outside the ball $\{U(x)-c>M_{3}\}$ , and hence it is bounded below for all $x\in\mathbb{R}^{d}$ .

Now, we observe that, for any $x\in\mathbb{R}^{d}$ ,

	$\displaystyle\inf V_{\epsilon}$	$\displaystyle=0,$
	$\displaystyle V_{\epsilon}(x)$	$\displaystyle=\dfrac{1}{\epsilon}\left(U(x)\wedge c-U_{min}\right)+\int_{U(x)\wedge c}^{U(x)}\dfrac{1}{f((u-c)_{+})+\epsilon}\,du$
		$\displaystyle\geqslant\dfrac{1}{\epsilon}\left(U(x)\wedge c-U_{min}\right),$
	$\displaystyle V_{\epsilon}(x)$	$\displaystyle\leqslant\dfrac{1}{\epsilon}\left(U(x)\wedge(c+\delta_{1})-U_{min}\right)+\dfrac{1}{f(\delta_{1})}(U(x)-U(x)\wedge(c+\delta_{1})).$

Writing $B(0,R)$ to be the ball center at $0$ with sufficiently large radius $R$ , we finally show (2.5):

	$\displaystyle E_{*}(V_{\epsilon})$	$\displaystyle=\sup_{x,y\in\mathbb{R}^{d}}\inf_{\gamma\in\Gamma_{x,y}}\left\{\sup_{t}\{V_{\epsilon}(\gamma(t))\}-V_{\epsilon}(x)-V_{\epsilon}(y)\right\}$
		$\displaystyle=\sup_{x,y\in B(0,R)}\inf_{\gamma\in\Gamma_{x,y}}\left\{\sup_{t}\{V_{\epsilon}(\gamma(t))\}-V_{\epsilon}(x)-V_{\epsilon}(y)\right\}$
		$\displaystyle\leqslant\dfrac{c_{}}{\epsilon}+\sup_{x\in B(0,R)}\dfrac{1}{f(\delta_{1})}(U(x)-U(x)\wedge(c+\delta_{1}))=:\dfrac{c_{}}{\epsilon}+C_{\delta_{1},U}.$

2.4. Lyapunov function and moment estimates

The aim of this section is to prove the following moment estimate of $(X_{t},Y_{t})_{t\geqslant 0}$ :

Proposition 2.4.

For any $p\in\mathbb{N}$ , $\alpha>0$ and large enough $t$ (which depends on $p,U,f$ and the temperature schedule $\epsilon_{t}$ ), there exist a constant $k$ , independent of $t$ , such that

\displaystyle\mathbb{E}\left(\left\lVert X_{t}\right\rVert^{2}+\left\lVert Y_{t}\right\rVert^{2}\right)^{p}\leqslant k(1+t)^{\alpha}.

We first prove the following Lyapunov property in Section 2.4.1, followed by proving Proposition 2.4 in Section 2.4.2.

Proposition 2.5 (Lyapunov property of $R_{\epsilon}$ ).

Let

(2.6)

\displaystyle R_{\epsilon}(x,y)=R_{\epsilon,c}(x,y):=\epsilon H_{\epsilon}(x)+\dfrac{\left\lVert y\right\rVert^{2}}{2}+\epsilon^{3}x\cdot y.

For $\epsilon>1$ , there exist constants $c_{1},c_{4}$ and $c_{5}(\epsilon)=\mathcal{O}\left(\epsilon^{3}\ln\left(1/\epsilon\right)\right)$ a subexponential function of $\epsilon$ such that

L_{\epsilon}(R_{\epsilon})\leqslant-c_{4}\epsilon^{5}R_{\epsilon}+c_{5}(\epsilon)+c_{1}.

2.4.1. Proof of Proposition 2.5

As we have

	$\displaystyle L_{\epsilon}\left(\dfrac{\left\lVert y\right\rVert^{2}}{2}\right)$	$\displaystyle=-\dfrac{\left\lVert y\right\rVert^{2}}{\epsilon}-\epsilon\nabla_{x}H_{\epsilon}\cdot y+d,$
	$\displaystyle L_{\epsilon}(\epsilon H_{\epsilon})$	$\displaystyle=y\cdot\epsilon\nabla_{x}H_{\epsilon},$

summing up these two equations leads to

(2.7)

\displaystyle L_{\epsilon}\left(\epsilon H_{\epsilon}+\dfrac{\left\lVert y\right\rVert^{2}}{2}\right)=-\dfrac{\left\lVert y\right\rVert^{2}}{\epsilon}+d.

Note that we also have

(2.8)

\displaystyle L_{\epsilon}(x\cdot y)=\left\lVert y\right\rVert^{2}-\dfrac{x\cdot y}{\epsilon}-\epsilon\nabla_{x}H_{\epsilon}\cdot x.

We proceed to give an upper bound on the second and the third term in the above equation. We first consider lower bounding $\nabla_{x}H_{\epsilon}\cdot x$ :

	$\displaystyle\nabla_{x}H_{\epsilon}\cdot x$	$\displaystyle=\dfrac{1+f^{\prime}(U(x)-c)_{+}}{f((U(x)-c)_{+})+\epsilon}\nabla_{x}U(x)\cdot x$
		$\displaystyle\geqslant\dfrac{1+f^{\prime}(U(x)-c)_{+}}{f((U(x)-c)_{+})+\epsilon}\left(r\left\lVert x\right\rVert^{2}-M\right)$
(2.9)			$\displaystyle\geqslant\dfrac{r}{M_{4}+\epsilon_{0}}\left\lVert x\right\rVert^{2}-\dfrac{(M_{5}+1)M}{\epsilon}:=\overline{r}_{2}\left\lVert x\right\rVert^{2}-\overline{M}_{\epsilon},$

where the first inequality follows from Assumption 1.1 item 1, and we recall both $M_{4}$ and $M_{5}$ are introduced in Assumption 1.1 item 2. Next, as we have

\displaystyle|x\cdot y|\leqslant\dfrac{\overline{r}_{2}\epsilon^{2}}{2}\left\lVert x\right\rVert^{2}+\dfrac{1}{2\overline{r}_{2}\epsilon^{2}}\left\lVert y\right\rVert^{2},

dividing by $\epsilon$ leads to

(2.10)

\displaystyle-\dfrac{x\cdot y}{\epsilon}\leqslant\dfrac{|x\cdot y|}{\epsilon}\leqslant\dfrac{\overline{r}_{2}\epsilon}{2}\left\lVert x\right\rVert^{2}+\dfrac{1}{2\overline{r}_{2}\epsilon^{3}}\left\lVert y\right\rVert^{2}.

We substitute (2.4.1) and (2.10) into (2.8) to yield

	$\displaystyle L_{\epsilon}(x\cdot y)$	$\displaystyle\leqslant\left(1+\dfrac{1}{2\overline{r}_{2}\epsilon^{3}}\right)\left\lVert y\right\rVert^{2}-\dfrac{\epsilon}{2}\left(2\nabla_{x}H_{\epsilon}\cdot x-\overline{r}_{2}\left\lVert x\right\rVert^{2}\right)$
(2.11)			$\displaystyle\leqslant\left(1+\dfrac{1}{2\overline{r}_{2}\epsilon^{3}}\right)\left\lVert y\right\rVert^{2}-\dfrac{\epsilon}{2}\left(\overline{r}_{2}\left\lVert x\right\rVert^{2}-2\overline{M}_{\epsilon}\right).$

The next two results give upper and lower bounds for $H_{\epsilon}$ and $R_{\epsilon}$ .

Lemma 2.1.

For any $\delta>0$ , the upper bound of $H_{\epsilon}$ is

H_{\epsilon}(x)\leqslant\begin{cases}\dfrac{1}{\epsilon}\left(a_{2}\left\lVert x\right\rVert^{2}+M-U_{min}\right)+\ln(M_{4}+\epsilon_{0}),&\mbox{if }U(x)\leqslant c+\delta,\\ \dfrac{1}{\epsilon}\left(c+\delta-U_{min}\right)+\dfrac{1}{f(\delta)}\left(a_{2}\left\lVert x\right\rVert^{2}+M-(c+\delta)\right)+\ln(M_{4}+\epsilon_{0}),&\mbox{if }U(x)>c+\delta.\end{cases}

Consequently, we have

\left\lVert x\right\rVert^{2}\geqslant\begin{cases}\dfrac{1}{a_{2}}\left(\epsilon\left(H_{\epsilon}(x)-\ln(M_{4}+\epsilon_{0})\right)-M+U_{min}\right),&\mbox{if }U(x)\leqslant c+\delta,\\ \dfrac{1}{a_{2}}\left(f(\delta)\left(H_{\epsilon}(x)-\dfrac{1}{\epsilon}\left(c+\delta-U_{min}\right)-\ln(M_{4}+\epsilon_{0})\right)+(c+\delta)-M\right),&\mbox{if }U(x)>c+\delta.\end{cases}

On the other hand, the lower bound of $H_{\epsilon}$ is

H_{\epsilon}(x)\geqslant\dfrac{1}{M_{4}+\epsilon}\left(a_{1}\left\lVert x\right\rVert^{2}-M-U_{min}\right)+\ln\epsilon.

Consequently, we have

\left\lVert x\right\rVert^{2}\leqslant\dfrac{1}{a_{1}}\left((M_{4}+\epsilon)H_{\epsilon}(x)-(M_{4}+\epsilon)\ln\epsilon+M+U_{min}\right).

[Proof. ]We shall only prove the bounds on $H_{\epsilon}$ , and from these bounds it is straightforward to deduce the bounds on $\left\lVert x\right\rVert^{2}$ . We first prove the upper bound of $H_{\epsilon}$ . If $U(x)\leqslant c+\delta$ , we have

H_{\epsilon}(x)\leqslant\dfrac{1}{\epsilon}(U(x)-U_{min})+\ln(M_{4}+\epsilon_{0})\leqslant\dfrac{1}{\epsilon}\left(a_{2}\left\lVert x\right\rVert^{2}+M-U_{min}\right)+\ln(M_{4}+\epsilon_{0}),

where the second inequality follows from the assumption on $U$ in Assumption 1.1. If $U(x)>c+\delta$ ,

	$\displaystyle H_{\epsilon}(x)$	$\displaystyle=\int_{U_{min}}^{c+\delta}\dfrac{1}{f((u-c)_{+})+\epsilon}\,du+\int_{c+\delta}^{U(x)}\dfrac{1}{f((u-c)_{+})+\epsilon}\,du+\ln\left(f((U(x)-c)_{+})+\epsilon\right)$
		$\displaystyle\leqslant\dfrac{1}{\epsilon}\left(c+\delta-U_{min}\right)+\dfrac{1}{f(\delta)}\left(a_{2}\left\lVert x\right\rVert^{2}+M-(c+\delta)\right)+\ln(M_{4}+\epsilon_{0}),$

where the inequality follows from Assumption 1.1 again. For the lower bound of $H_{\epsilon}$ , we consider

H_{\epsilon}(x)\geqslant\dfrac{1}{M_{4}+\epsilon}(U(x)-U_{min})+\ln\epsilon\geqslant\dfrac{1}{M_{4}+\epsilon}\left(a_{1}\left\lVert x\right\rVert^{2}-M-U_{min}\right)+\ln\epsilon,

where the second inequality follows from Assumption 1.1.

Lemma 2.2.

For the upper bound of $R_{\epsilon}$ , we have

R_{\epsilon}(x,y)\leqslant\left(\epsilon+\epsilon^{3}\dfrac{M_{4}+\epsilon}{a_{1}}\right)H_{\epsilon}(x)+\left(\dfrac{1}{2}+\dfrac{\epsilon^{3}}{2}\right)\left\lVert y\right\rVert^{2}+\dfrac{\epsilon^{3}}{a_{1}}\left(-(M_{4}+\epsilon)\ln\epsilon+M+U_{min}\right).

On the other hand, the lower bound of $R_{\epsilon}$ is

R_{\epsilon}(x,y)\geqslant\dfrac{\epsilon}{M_{4}+\epsilon}\left(a_{1}\left\lVert x\right\rVert^{2}-M-U_{min}\right)+\epsilon\ln\epsilon+\dfrac{\left\lVert y\right\rVert^{2}}{2}-\epsilon^{3}\left(\left\lVert x\right\rVert^{2}/2+\left\lVert y\right\rVert^{2}/2\right).

[Proof. ]Using $x\cdot y\leqslant\left\lVert x\right\rVert^{2}/2+\left\lVert y\right\rVert^{2}/2$ , the desired results follow from (2.6) and Lemma 2.1.

Using Lemma 2.1 together with (2.4.1) leads to, for some constants $c_{1},c_{2},c_{3}$ ,

\displaystyle\epsilon^{3}L_{\epsilon}(x\cdot y)

\displaystyle\leqslant\left(\epsilon^{3}+\dfrac{1}{2\overline{r}_{2}}\right)\left\lVert y\right\rVert^{2}-c_{3}\epsilon^{5}H_{\epsilon}(x)+c_{1},

and hence, when combined with (2.7) and (2.6),

L_{\epsilon}(R_{\epsilon})\leqslant-c_{2}\epsilon^{5}\left\lVert y\right\rVert^{2}-c_{3}\epsilon^{5}H_{\epsilon}(x)+c_{1}.

Finally, the above equation together with Lemma 2.2 gives, for some constants $c_{4}$ and $c_{5}(\epsilon)$ a subexponential function of $\epsilon$ ,

L_{\epsilon}(R_{\epsilon})\leqslant-c_{4}\epsilon^{5}R_{\epsilon}(x,y)+c_{5}(\epsilon)+c_{1}.

2.4.2. Proof of Proposition 2.4

Proposition 2.6.

For any $p\in\mathbb{N}$ and $\epsilon>1$ , there exist subexponential functions $C_{p,1}(\epsilon),C_{p,2}(\epsilon)$ such that

\dfrac{\partial}{\partial\epsilon}R_{\epsilon}^{p}\leqslant C_{p,1}(\epsilon)R_{\epsilon}^{p}+C_{p,2}(\epsilon)R_{\epsilon}^{p-1}.

[Proof. ]For some constants $c_{6},c_{7}>0$ , we compute

	$\displaystyle\dfrac{\partial}{\partial\epsilon}R_{\epsilon}^{p}$	$\displaystyle=pR_{\epsilon}^{p-1}\left(\epsilon\dfrac{\partial}{\partial\epsilon}H_{\epsilon}+H_{\epsilon}+3\epsilon^{2}x\cdot y\right)$
		$\displaystyle\leqslant pR_{\epsilon}^{p-1}\left(1+\dfrac{1}{\epsilon}\left(a_{2}\left\lVert x\right\rVert^{2}+M-U_{min}\right)+\ln(M_{4}+\epsilon_{0})+3\epsilon^{2}(\left\lVert x\right\rVert^{2}/2+\left\lVert y\right\rVert^{2}/2)\right)$
		$\displaystyle\leqslant pR_{\epsilon}^{p-1}\left(\dfrac{c_{6}}{\epsilon}(\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2})+1+\dfrac{1}{\epsilon}\left(M-U_{min}\right)+\ln(M_{4}+\epsilon_{0})\right).$

Now, we use the lower bound of $R_{\epsilon}$ in Lemma 2.2 to see that

R_{\epsilon}(x,y)\geqslant c_{7}\epsilon(\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2})-\epsilon\dfrac{M+U_{min}}{M_{4}+\epsilon}+\epsilon\ln\epsilon,

and so

	$\displaystyle\dfrac{\partial}{\partial\epsilon}R_{\epsilon}^{p}$	$\displaystyle\leqslant pR_{\epsilon}^{p-1}\left(\dfrac{c_{6}}{c_{7}\epsilon^{2}}\left(R_{\epsilon}(x,y)+\epsilon\dfrac{M+U_{min}}{M_{4}+\epsilon}-\epsilon\ln\epsilon\right)+1+\dfrac{1}{\epsilon}\left(M-U_{min}\right)+\ln(M_{4}+\epsilon_{0})\right)$
		$\displaystyle=:C_{p,1}(\epsilon)R_{\epsilon}^{p}+C_{p,2}(\epsilon)R_{\epsilon}^{p-1}.$

The carré du champ operator $\Gamma_{\epsilon}$ of $L_{\epsilon}$ is $\Gamma_{\epsilon}f=\dfrac{1}{2}L_{\epsilon}f^{2}-fL_{\epsilon}f=\left\lVert\nabla_{y}f\right\rVert^{2}$ . Using the lower bound of $R_{\epsilon}$ in Lemma 2.2, we compute, for constant $c_{8}$ and subexponential functions $C_{3}(\epsilon),C_{4}(\epsilon)$ ,

	$\displaystyle\Gamma_{\epsilon}R_{\epsilon}=\left\lVert\nabla_{y}R_{\epsilon}\right\rVert^{2}$	$\displaystyle=\left\lVert y+\epsilon^{3}x\right\rVert^{2}$
		$\displaystyle\leqslant c_{8}\left(\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2}\right)$
(2.12)			$\displaystyle\leqslant\dfrac{c_{8}}{c_{7}\epsilon}\left(R_{\epsilon}(x,y)+\epsilon\dfrac{M+U_{min}}{M_{4}+\epsilon}-\epsilon\ln\epsilon\right)=:C_{3}(\epsilon)R_{\epsilon}(x,y)+C_{4}(\epsilon).$

Proposition 2.7.

For any $p\in\mathbb{N}$ , $\alpha>0$ and large enough $t$ , there exist a constant $\widetilde{C}_{p,\alpha}$ such that

\mathbb{E}\left(R_{\epsilon_{t}}^{p}(X_{t},Y_{t})\right)\leqslant\widetilde{C}_{p,\alpha}(1+t)^{1+\alpha}.

[Proof. ]We shall prove the result by induction on $p$ . We denote by $n_{t,p}:=\mathbb{E}\left(R_{\epsilon_{t}}^{p}(X_{t},Y_{t})\right)$ . When $p=0$ , the result clearly holds. When $p=1$ , using Proposition 2.5 and 2.6 we have

	$\displaystyle\dfrac{\partial}{\partial t}n_{t,1}$	$\displaystyle=\left(\dfrac{\partial}{\partial t}\epsilon_{t}\right)\dfrac{\partial}{\partial\epsilon_{t}}n_{t,1}+\dfrac{\partial}{\partial s}\mathbb{E}\left(R_{\epsilon_{t}}(X_{t+s},Y_{t+s})\right)\bigg{\|}_{s=0}$
		$\displaystyle\leqslant\bigg{\|}\dfrac{\partial}{\partial t}\epsilon_{t}\bigg{\|}\left(C_{1,1}(\epsilon_{t})n_{t,1}+C_{1,2}(\epsilon_{t})\right)+\mathbb{E}\left(L_{\epsilon_{t}}(R_{\epsilon_{t}})(X_{t},Y_{t})\right)$
		$\displaystyle\leqslant\bigg{\|}\dfrac{\partial}{\partial t}\epsilon_{t}\bigg{\|}\left(C_{1,1}(\epsilon_{t})n_{t,1}+C_{1,2}(\epsilon_{t})\right)-c_{4}\epsilon_{t}^{5}n_{t,1}+c_{5}(\epsilon_{t})+c_{1}.$

As $\epsilon_{t}=\Omega(\frac{1}{\ln(1+t)})$ , $\bigg{|}\dfrac{\partial}{\partial t}\epsilon_{t}\bigg{|}=\mathcal{O}(\frac{1}{t})$ and $C_{1,1}(\epsilon_{t}),C_{1,2}(\epsilon_{t}),c_{5}(\epsilon_{t})=o(t^{\beta})$ for any $\beta>0$ as $t\to\infty$ , we deduce that, for constants $c_{6}>0,c_{7}$ ,

(2.13)

\displaystyle\dfrac{\partial}{\partial t}n_{t,1}

\displaystyle\leqslant-c_{6}\epsilon_{t}^{5}n_{t,1}+c_{7}(1+t)^{1+\alpha/2}.

As a result, for constant $c_{8}$ ,

	$\displaystyle\dfrac{\partial}{\partial t}\left(n_{t,1}e^{c_{6}\int_{0}^{t}\epsilon_{s}^{5}\,ds}\right)$	$\displaystyle\leqslant c_{7}(1+t)^{1+\alpha/2}e^{c_{6}\int_{0}^{t}\epsilon_{s}^{5}\,ds}$
	$\displaystyle n_{t,1}$	$\displaystyle\leqslant n_{0,1}+c_{7}(1+t)^{1+\alpha/2}\int_{0}^{t}e^{-c_{6}\int_{s}^{t}\epsilon_{u}^{5}\,du}\,ds$
		$\displaystyle\leqslant n_{0,1}+c_{7}(1+t)^{1+\alpha/2}\int_{0}^{t}e^{-c_{6}\epsilon_{t}^{5}(t-s)}\,ds$
		$\displaystyle\leqslant n_{0,1}+\dfrac{c_{7}(1+t)^{1+\alpha/2}}{c_{6}\epsilon_{t}^{5}}\leqslant c_{8}(1+t)^{1+\alpha}.$

This proves the result when $p=1$ . Assume that the result holds for all $q<p$ , where $p\geqslant 2$ . First, using Proposition 2.5 and equation (2.4.2) we compute

	$\displaystyle L_{\epsilon_{t}}(R_{\epsilon_{t}}^{p})$	$\displaystyle=pR_{\epsilon_{t}}^{p-1}L_{\epsilon_{t}}(R_{\epsilon_{t}})+p(p-1)R^{p-2}_{\epsilon_{t}}\Gamma_{\epsilon_{t}}R_{\epsilon_{t}}$
(2.14)			$\displaystyle\leqslant pR_{\epsilon_{t}}^{p-1}\left(-c_{4}\epsilon_{t}^{5}R_{\epsilon_{t}}+c_{5}(\epsilon_{t})+c_{1}\right)+p(p-1)C_{3}(\epsilon_{t})R^{p-1}_{\epsilon_{t}}+p(p-1)C_{4}(\epsilon_{t})R^{p-2}_{\epsilon_{t}}.$

Differentiating with respect to $t$ , followed by using Proposition 2.6 and equation (2.4.2) give

	$\displaystyle\dfrac{\partial}{\partial t}n_{t,p}$	$\displaystyle=\left(\dfrac{\partial}{\partial t}\epsilon_{t}\right)\dfrac{\partial}{\partial\epsilon_{t}}n_{t,p}+\dfrac{\partial}{\partial s}\mathbb{E}\left(R_{\epsilon_{t}}^{p}(X_{t+s},Y_{t+s})\right)\bigg{\|}_{s=0}$
		$\displaystyle\leqslant\bigg{\|}\dfrac{\partial}{\partial t}\epsilon_{t}\bigg{\|}\left(C_{p,1}(\epsilon_{t})n_{t,p}+C_{p,2}(\epsilon_{t})n_{t,p-1}\right)+\mathbb{E}\left(L_{\epsilon_{t}}(R_{\epsilon_{t}}^{p})(X_{t},Y_{t})\right)$
		$\displaystyle\leqslant-c_{9}\epsilon_{t}^{5}n_{t,p}+c_{10}(p,\alpha)(1+t)^{1+\alpha/2},$

where we use the same asymptotic estimates that lead us to (2.13) and the induction assumption on $n_{t,p-1}$ and $n_{t,p-2}$ .

Now, we wrap up the proof of Proposition 2.4.

Using the lower bound of $R_{\epsilon}$ in Lemma 2.2, we note that, for some constants $k_{1},k_{2},k_{3},k_{4}>0$ and arbitrary $r>0$ ,

	$\displaystyle\mathbb{E}\left(\left\lVert X_{t}\right\rVert^{2}+\left\lVert Y_{t}\right\rVert^{2}\right)^{p}$	$\displaystyle\leqslant k_{1}\mathbb{E}\left(\dfrac{1}{\epsilon_{t}^{p}}\left(R_{\epsilon_{t}}(X_{t},Y_{t})+\epsilon_{t}\dfrac{M+U_{min}}{M_{4}+\epsilon_{t}}+\epsilon_{t}\ln\dfrac{1}{\epsilon_{t}}\right)^{p}\right)$
		$\displaystyle=k_{1}\mathbb{E}\left(\dfrac{1}{\epsilon_{t}^{p}}\left(\sum_{j=0}^{p}{p\choose j}R_{\epsilon_{t}}(X_{t},Y_{t})^{j}\left(\epsilon_{t}\dfrac{M+U_{min}}{M_{4}+\epsilon_{t}}+\epsilon_{t}\ln\dfrac{1}{\epsilon_{t}}\right)^{p-j}\right)^{p}\right)$
		$\displaystyle\leqslant k_{2}\dfrac{1}{\epsilon_{t}^{p}}\left(\mathbb{E}R_{\epsilon_{t}}^{jr}(X_{t},Y_{t})\right)^{1/r}\leqslant k_{3}\dfrac{1}{\epsilon_{t}^{p}}(1+t)^{2/r}\leqslant k_{4}(1+t)^{3/r},$

where the equality follows from the binomial theorem, the second inequality follows from the Jensen’s inequality, the third inequality comes from Proposition 2.7, and we use the estimates $\epsilon_{t}=\Omega(\frac{1}{\ln(1+t)})$ , $\ln(1+t)^{p}=\mathcal{O}((1+t)^{1/r})$ for large enough $t$ in the last inequality.

2.5. Gamma calculus

For smooth enough $\phi$ and $g$ , as $L_{\epsilon_{t}}$ is a diffusion, recall that we have

L_{\epsilon_{t}}(\phi(g))=\phi^{\prime}(g)L_{\epsilon_{t}}(g)+\phi^{\prime\prime}(g)\Gamma_{\epsilon_{t}}(g),

where $\Gamma_{\epsilon_{t}}$ is the carré du champ operator with $\Gamma_{\epsilon_{t}}g=\dfrac{1}{2}L_{\epsilon_{t}}g^{2}-gL_{\epsilon_{t}}g=\left\lVert\nabla_{y}g\right\rVert^{2}.$ Let $L_{\epsilon_{t}}^{*}$ be the $L^{2}(\pi_{\epsilon_{t}})$ adjoint of $L_{\epsilon_{t}}$ , that is,

L_{\epsilon_{t}}^{*}=-y\cdot\nabla_{x}-\left(\dfrac{y}{\epsilon_{t}}-\epsilon_{t}\nabla_{x}H_{\epsilon_{t}}\right)\cdot\nabla_{y}+\Delta_{y}.

For $h$ a non-negative function from $\mathbb{R}^{d}$ to $\mathbb{R}$ which belongs to some functional space $\mathcal{D}$ , and a differentiable $\Phi:\mathcal{D}\to\mathbb{R}$ with differential operator $D_{h}\Phi$ , the directional derivative in $h$ , we will be interested in quantities of the form

(2.15)

\displaystyle\Gamma_{L_{\epsilon_{t}}^{*},\Phi}(h):=\dfrac{1}{2}\left(L_{\epsilon_{t}}^{*}\Phi(h)-D_{h}\Phi(h)\cdot(L_{\epsilon_{t}}^{*}h)\right).

We now present three auxiliary lemmas, which are essential in proving the dissipation of the distorted entropy later in Section 2.7. The proofs are essentially the same as corresponding proofs in Monmarché (2018), by replacing the target function $U$ therein with $\epsilon_{t}H_{\epsilon_{t}}$ .

Lemma 2.3.

For non-negative $h$ and $\Phi(h)=h\ln h$ , then

\Gamma_{L_{\epsilon_{t}}^{*},\Phi}(h)=\dfrac{\Gamma_{\epsilon_{t}}(h)}{2h}.

[Proof. ]The proof is the same as (Monmarché, 2018, Lemma $11$ ), by replacing the target function $U$ therein with $\epsilon_{t}H_{\epsilon_{t}}$ .

Lemma 2.4.

For non-negative $h$ and $\Phi(h)=\left\lVert Ah\right\rVert^{2}$ , where $A=(A_{1},A_{2},\ldots,A_{k})$ is a linear operator from $\mathcal{D}$ to $\mathcal{D}^{k}$ , then

\Gamma_{L_{\epsilon_{t}}^{*},\Phi}(h)=\Gamma_{\epsilon_{t}}(Ah)+(Ah)[L_{\epsilon_{t}}^{*},A]h\geqslant(Ah)[L_{\epsilon_{t}}^{*},A]h,

where for two operators $C,D$ , the commutator bracket is $[C,D]=CD-DC$ , $\Gamma_{\epsilon_{t}}(Ah)=\sum_{i=1}^{k}\Gamma_{\epsilon_{t}}(A_{i}h)$ and $[L_{\epsilon_{t}}^{*},A]=([L_{\epsilon_{t}}^{*},A_{1}],[L_{\epsilon_{t}}^{*},A_{2}],\ldots,[L_{\epsilon_{t}}^{*},A_{k}])$ .

[Proof. ]The proof is the same as (Monmarché, 2018, Lemma $10$ ), by replacing the target function $U$ therein with $\epsilon_{t}H_{\epsilon_{t}}$ .

Lemma 2.5.

Let

	$\displaystyle\gamma(\epsilon_{t})$	$\displaystyle=\dfrac{1}{2}+2\left(\epsilon_{t}\left\lVert\nabla_{x}H_{\epsilon_{t}}\right\rVert_{\infty}+1+\dfrac{1}{\epsilon_{t}}\right)^{2},$
	$\displaystyle\Psi_{\epsilon_{t}}(h)$	$\displaystyle=\dfrac{\left\lVert(\nabla_{x}+\nabla_{y})h\right\rVert^{2}}{h}+\gamma(\epsilon_{t})h\ln h.$

We have

\Gamma_{L_{\epsilon_{t}}^{*},\Psi_{\epsilon_{t}}}(h)\geqslant\dfrac{1}{2}\dfrac{\left\lVert\nabla h\right\rVert^{2}}{h}.

[Proof. ]The proof is the same as (Monmarché, 2018, Corollary $13$ ), by replacing the target function $U$ therein with $\epsilon_{t}H_{\epsilon_{t}}$ .

2.6. Truncated differentiation

In this subsection, we present auxiliary results related to mollifier and truncated differentiation. These results are needed when we discuss the dissipation of distorted entropy in Section 2.7 below.

Let $\varphi:\mathbb{R}\to\mathbb{R}$ be a mollifier defined to be

\varphi(x):=\begin{cases}e^{\frac{1}{x^{2}-1}}\left(\int_{-1}^{1}e^{\frac{1}{y^{2}-1}}\,dy\right)^{-1},&\mbox{if }-1<x<1,\\ 0,&\mbox{otherwise.}\end{cases}

For $m\in\mathbb{N}$ , we define

(2.16)		$\displaystyle\varphi_{m}(x)$	$\displaystyle:=\dfrac{1}{m}\varphi\left(\dfrac{x}{m}\right),$
(2.17)		$\displaystyle v_{m}$	$\displaystyle:=\varphi_{m}\star\mathbf{1}_{(-\infty,m^{2}]}\leqslant 1,$

where $f\star g$ is the convolution of two functions $f,g$ and $\mathbf{1}_{A}$ is the indicator function of the set $A$ . Recall that in Lemma 2.2 the lower bound of $R_{\epsilon}$ is given by

R_{\epsilon}(x,y)\geqslant\left(\dfrac{\epsilon a_{1}}{M_{4}+\epsilon}-\dfrac{\epsilon^{3}}{2}\right)\left\lVert x\right\rVert^{2}+\left(\dfrac{1-\epsilon^{3}}{2}\right)\left\lVert y\right\rVert^{2}-\left(\epsilon\dfrac{M+U_{min}}{M_{4}+\epsilon}+\epsilon\ln\left(\dfrac{1}{\epsilon}\right)\right).

Denoting the third term on the right hand side by

d_{2,\epsilon}:=\epsilon\dfrac{M+U_{min}}{M_{4}}+\epsilon\ln\left(\dfrac{1}{\epsilon}\right),

we define, for any $\delta>\max\bigg{\{}-\epsilon_{0}\dfrac{U_{min}}{M_{4}},0\bigg{\}}$ and $\delta_{2}>0$ ,

(2.18)

\displaystyle\eta_{m,\epsilon}

\displaystyle:=v_{m}\left(\ln\left(R_{\epsilon}+2d_{2,\epsilon}+\delta+\delta_{2}\right)\right).

Note that for large enough $t$ (which depends on $a_{1},M_{4}$ and the cooling schedule $(\epsilon_{t})_{t\geqslant 0}$ ), $R_{\epsilon_{t}}+d_{2,\epsilon_{t}}\geqslant 0$ for all $x,y\in\mathbb{R}^{d}$ . Also, note that $d_{2,\epsilon}$ can be negative as $M+U_{min}$ may possibly be negative. With our choice of $\delta$ it ensures that $d_{2,\epsilon}+\delta\geqslant 0$ .

Now, we present results that summarize some basic properties of $\eta_{m,\epsilon}$ .

Lemma 2.6.

For $m\in\mathbb{N}$ , $\eta_{m,\epsilon_{t}}$

(1)

is compactly supported;
(2)

converges to $1$ pointwise;
(3)

For some constant $C>0$ , independent of $m$ and time $t$ , we have, for large enough $t$ ,

$L_{\epsilon_{t}}\eta_{m,\epsilon_{t}}\leqslant\dfrac{C}{m}.$

[Proof. ]Similar to Chak et al. (2020), we have the following estimates:

	$\displaystyle v_{m}(x)$	$\displaystyle=\int_{-\infty}^{m^{2}}\varphi_{m}(x-y)\,dy=\int_{-\infty}^{x+m^{2}}\varphi_{m}(z)\,dz,$
	$\displaystyle v_{m}^{\prime}(x)$	$\displaystyle=\varphi_{m}(x+m^{2})\leqslant\dfrac{1}{m}\left(\max\varphi\right),$
	$\displaystyle v_{m}^{\prime\prime}(x)$	$\displaystyle=\varphi_{m}^{\prime}(x+m^{2})\leqslant\dfrac{1}{m^{2}}\left(\max\varphi\right).$

From these it is straightforward to show item (1) and (2). We proceed to prove item (3). First, using Gamma calculus and the above upper bounds on $v_{m}^{\prime},v_{m}^{\prime\prime}$ , we have, for some constant $\overline{C}>0$ ,

	$\displaystyle L_{\epsilon_{t}}\eta_{m,\epsilon_{t}}$	$\displaystyle=v_{m}^{\prime}\left(\ln\left(R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}\right)\right)L_{\epsilon_{t}}\left(\ln\left(R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}\right)\right)$
		$\displaystyle\quad+v_{m}^{\prime\prime}\left(\ln\left(R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}\right)\right)\left\lVert\nabla_{y}\left(\ln\left(R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}\right)\right)\right\rVert^{2}$
		$\displaystyle\leqslant\overline{C}\left(\dfrac{1}{m}\left(\dfrac{L_{\epsilon_{t}}R_{\epsilon_{t}}}{R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}}-\dfrac{\left\lVert\nabla_{y}R_{\epsilon_{t}}\right\rVert^{2}}{\left(R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}\right)^{2}}\right)+\dfrac{1}{m^{2}}\dfrac{\left\lVert\nabla_{y}R_{\epsilon_{t}}\right\rVert^{2}}{\left(R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}\right)^{2}}\right)$
		$\displaystyle\leqslant\overline{C}\left(\dfrac{1}{m}\left(\dfrac{L_{\epsilon_{t}}R_{\epsilon_{t}}}{R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}}\right)+\dfrac{1}{m^{2}}\dfrac{\left\lVert\nabla_{y}R_{\epsilon_{t}}\right\rVert^{2}}{\left(R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}\right)^{2}}\right)$
		$\displaystyle\leqslant\overline{C}\left(\dfrac{1}{m}\left(\dfrac{-c_{4}\epsilon_{t}^{5}R_{\epsilon_{t}}+c_{5}(\epsilon_{t})+c_{1}}{R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}}\right)+\dfrac{1}{m^{2}}\dfrac{\left\lVert\nabla_{y}R_{\epsilon_{t}}\right\rVert^{2}}{\left(R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}\right)^{2}}\right)$
		$\displaystyle=\overline{C}\left(\dfrac{1}{m}\left(\dfrac{-c_{4}\epsilon_{t}^{5}\left(R_{\epsilon_{t}}+d_{2,\epsilon_{t}}+\delta\right)+c_{4}\epsilon_{t}^{5}(d_{2,\epsilon_{t}}+\delta)+c_{5}(\epsilon_{t})+c_{1}}{R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}}\right)+\dfrac{1}{m^{2}}\dfrac{\left\lVert\nabla_{y}R_{\epsilon_{t}}\right\rVert^{2}}{\left(R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}\right)^{2}}\right),$

where the last inequality follows from the Lyapunov property of $R_{\epsilon_{t}}$ in Proposition 2.5. Now, we bound each of the two terms on the right hand side above. Using $R_{\epsilon_{t}}+d_{2,\epsilon_{t}}\geqslant 0$ and $d_{2,\epsilon_{t}}+\delta\geqslant 0$ , we observe that

	$\displaystyle\dfrac{-c_{4}\epsilon_{t}^{5}\left(R_{\epsilon_{t}}+d_{2,\epsilon_{t}}+\delta\right)+c_{4}\epsilon_{t}^{5}(d_{2,\epsilon_{t}}+\delta)+c_{5}(\epsilon_{t})+c_{1}}{R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}}$	$\displaystyle\leqslant c_{4}\epsilon_{t}^{5}+\dfrac{c_{5}(\epsilon_{t})}{d_{2,\epsilon_{t}}+\delta}+\dfrac{c_{1}}{\delta_{2}},$
		$\displaystyle\leqslant c_{4}\epsilon_{0}^{2}+\mathcal{O}(\epsilon_{0}^{2})+\dfrac{c_{1}}{\delta_{2}},$

where we use $c_{5}(\epsilon_{t})=\mathcal{O}(\epsilon_{t}^{3}\ln(1/\epsilon_{t}))$ as in Proposition 2.5 and $d_{2,\epsilon_{t}}+\delta=\Omega(\epsilon_{t}\ln(1/\epsilon_{t}))$ .

For the second term, using

\left\lVert\nabla_{y}R_{\epsilon_{t}}\right\rVert^{2}=\left\lVert y+\epsilon_{t}^{3}x\right\rVert^{2}=\mathcal{O}(\epsilon_{t}^{3}\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2}),

and the lower bound of $R_{\epsilon_{t}}$ in Lemma 2.2 again,

R_{\epsilon_{t}}+d_{2,\epsilon_{t}}=\Omega(\epsilon_{t}\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2}),

we see that, for some constant $\overline{C}_{2}>0$ ,

\dfrac{\left\lVert\nabla_{y}R_{\epsilon_{t}}\right\rVert^{2}}{\left(R_{\epsilon_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}\right)^{2}}\leqslant\overline{C}_{2}\dfrac{\epsilon_{t}^{3}\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2}}{\left(\epsilon_{t}\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2}\right)^{2}+\delta_{2}^{2}},

which can clearly be bounded above independent of $t$ .

2.7. Dissipation of the distorted entropy

Recall the notation in Section 2.2 that $h_{t}=\frac{dm_{t}}{d\pi_{\epsilon_{t}}}$ is the density of the process at time $t$ with respect to the distribution $\pi_{\epsilon_{t}}$ . For non-negative $h:\mathbb{R}^{2d}\to\mathbb{R}^{+}$ , we define the functionals

	$\displaystyle\Phi_{0}(h)$	$\displaystyle:=h\ln h,$
	$\displaystyle\Phi_{1}(h)$	$\displaystyle:=\dfrac{\left\lVert\nabla_{x}h+\nabla_{y}h\right\rVert^{2}}{h},$

and we recall that in Lemma 2.5 we introduce

	$\displaystyle\gamma(\epsilon_{t})$	$\displaystyle=\dfrac{1}{2}+2\left(\epsilon_{t}\left\lVert\nabla_{x}H_{\epsilon_{t}}\right\rVert_{\infty}+1+\dfrac{1}{\epsilon_{t}}\right)^{2},$
	$\displaystyle\Psi_{\epsilon_{t}}(h)$	$\displaystyle=\Phi_{1}(h)+\gamma(\epsilon_{t})\Phi_{0}(h).$

With these notations in mind we define the distorted entropy as, for $t\geqslant 0$ ,

(2.19)

\displaystyle\mathbf{H}(t):=\int\Psi_{\epsilon_{t}}(h_{t})\,d\pi_{\epsilon_{t}}=\int\Phi_{1}(h_{t})+\gamma(\epsilon_{t})\Phi_{0}(h_{t})\,d\pi_{\epsilon_{t}}=\int\dfrac{\left\lVert\nabla_{x}h_{t}+\nabla_{y}h_{t}\right\rVert^{2}}{h_{t}}\,d\pi_{\epsilon_{t}}+\gamma(\epsilon_{t})\mathrm{Ent}_{\pi_{\epsilon_{t}}}(h_{t}).

The distorted entropy $\mathbf{H}$ is not to be confused with $H_{\epsilon}$ introduced in (1.7). In order to compute the time derivative of the distorted entropy $\mathbf{H}$ and pass the derivative into the integral, we will first be working with its truncated version, which is defined to be, for $m\in\mathbb{N}$ ,

(2.20)

\displaystyle\mathbf{H}_{\eta_{m},\epsilon_{t}}(t):=\int\eta_{m,\epsilon_{t}}\left(\Phi_{1}(h_{t})+\gamma(\epsilon_{t})\Phi_{0}(h_{t})\right)\,d\pi_{\epsilon_{t}},

where we recall $\eta_{m,\epsilon_{t}}$ is the mollifier introduced in (2.18). Taking the time derivative of (2.20) and thanks to Lemma 2.6, we have

(2.21)

\displaystyle\dfrac{d}{dt}\mathbf{H}_{\eta_{m},\epsilon_{t}}(t)=\int\eta_{m,\epsilon_{t}}\dfrac{d}{dt}\left(\Psi_{\epsilon_{t}}(h_{t})\right)\,d\pi_{\epsilon_{t}}+\epsilon^{\prime}_{t}\int\dfrac{d}{d\epsilon_{t}}\eta_{m,\epsilon_{t}}\Psi_{\epsilon_{t}}(h_{t})\pi_{\epsilon_{t}}\,dxdy.

We will handle and give upper bound on each of the two terms on the right hand side above separately.

For the first term on the right hand side of (2.21), we have the following upper bound:

Lemma 2.7.

For the same constant $C>0$ that appears in Lemma 2.6,

\int\eta_{m,\epsilon_{t}}\dfrac{d}{dt}\left(\Psi_{\epsilon_{t}}(h_{t})\right)\,d\pi_{\epsilon_{t}}\leqslant-\int\eta_{m,\epsilon_{t}}\dfrac{\left\lVert\nabla h_{t}\right\rVert^{2}}{h_{t}}\,d\pi_{\epsilon_{t}}+\dfrac{C}{m}\int\Psi_{\epsilon_{t}}(h_{t})+\gamma(\epsilon_{t})e^{-1}\,d\pi_{\epsilon_{t}}.

[Proof. ]

	$\displaystyle\int\eta_{m,\epsilon_{t}}\dfrac{d}{dt}\left(\Psi_{\epsilon_{t}}(h_{t})\right)\,d\pi_{\epsilon_{t}}$	$\displaystyle=\int\eta_{m,\epsilon_{t}}D_{h_{t}}\Psi_{\epsilon_{t}}(h_{t})\cdot\partial_{t}h_{t}\,d\pi_{\epsilon_{t}}$
		$\displaystyle=\int\eta_{m,\epsilon_{t}}D_{h_{t}}\Psi_{\epsilon_{t}}(h_{t})\cdot\partial_{t}m_{t}\,dxdy$
		$\displaystyle=\int\eta_{m,\epsilon_{t}}D_{h_{t}}\Psi_{\epsilon_{t}}(h_{t})\cdot L_{\epsilon_{t}}^{*}h_{t}\,d\pi_{\epsilon_{t}}$
		$\displaystyle=-\int\eta_{m,\epsilon_{t}}2\Gamma_{L_{\epsilon_{t}}^{},\Psi_{\epsilon_{t}}}(h_{t})\,d\pi_{\epsilon_{t}}+\int\eta_{m,\epsilon_{t}}L_{\epsilon_{t}}^{}\left(\Psi_{\epsilon_{t}}(h_{t})\right)\,d\pi_{\epsilon_{t}}$
		$\displaystyle=-\int\eta_{m,\epsilon_{t}}2\Gamma_{L_{\epsilon_{t}}^{*},\Psi_{\epsilon_{t}}}(h_{t})\,d\pi_{\epsilon_{t}}+\int L_{\epsilon_{t}}\eta_{m,\epsilon_{t}}\left(\Psi_{\epsilon_{t}}(h_{t})+\gamma(\epsilon_{t})e^{-1}\right)\,d\pi_{\epsilon_{t}}$
		$\displaystyle\leqslant-\int\eta_{m,\epsilon_{t}}\dfrac{\left\lVert\nabla h_{t}\right\rVert^{2}}{h_{t}}\,d\pi_{\epsilon_{t}}+\dfrac{C}{m}\int\Psi_{\epsilon_{t}}(h_{t})+\gamma(\epsilon_{t})e^{-1}\,d\pi_{\epsilon_{t}},$

where the third equality follows from (Chak et al., 2020, discussion below $(C.20)$ ) and the fourth equality comes from classical Gamma calculus as in (2.15). For the fifth equality, we add $\gamma(\epsilon_{t})e^{-1}$ to ensure that $\Psi_{\epsilon_{t}}(h_{t})+\gamma(\epsilon_{t})e^{-1}\geqslant 0$ . For the last inequality, we make use of Lemma 2.5 for the first term and Lemma 2.6 for the second term.

We proceed to consider the second term on the right hand side of (2.21). Using the chain rule, we have

	$\displaystyle\epsilon^{\prime}_{t}\int\dfrac{d}{d\epsilon_{t}}\left(\eta_{m,\epsilon_{t}}\Psi_{\epsilon_{t}}(h_{t})\pi_{\epsilon_{t}}\right)\,dxdy$	$\displaystyle=\epsilon^{\prime}_{t}\int\left(\dfrac{d}{d\epsilon_{t}}\eta_{m,\epsilon_{t}}\right)\Psi_{\epsilon_{t}}(h_{t})\pi_{\epsilon_{t}}\,dxdy+\epsilon^{\prime}_{t}\int\eta_{m,\epsilon_{t}}\left(\dfrac{d}{d\epsilon_{t}}\Psi_{\epsilon_{t}}(h_{t})\pi_{\epsilon_{t}}\right)\,dxdy$
(2.22)			$\displaystyle=\mathcal{O}\left(\dfrac{1}{m}\right)+\epsilon^{\prime}_{t}\int\eta_{m,\epsilon_{t}}\left(\dfrac{d}{d\epsilon_{t}}\Psi_{\epsilon_{t}}(h_{t})\pi_{\epsilon_{t}}\right)\,dxdy,$

where the second equality follows from the proof in Lemma 2.6. Now, as in the proof of (Monmarché, 2018, Lemma $15$ ), we compute that

\dfrac{\partial}{\partial\epsilon_{t}}\ln\pi_{\epsilon_{t}}(x,y)=-\dfrac{\partial}{\partial\epsilon_{t}}H_{\epsilon_{t}}+\dfrac{1}{\epsilon_{t}^{2}}\left\lVert y\right\rVert^{2}-\int\left(-\dfrac{\partial}{\partial_{\epsilon_{t}}}H_{\epsilon_{t}}+\dfrac{\left\lVert y\right\rVert^{2}}{\epsilon_{t}^{2}}\,d\pi_{\epsilon_{t}}\right).

Using the quadratic growth assumption on $U$ in Assumption 1.1, there exist a subexponential function $\xi_{1}$ such that

(2.23)

\displaystyle|\partial_{\epsilon_{t}}\ln\pi_{\epsilon_{t}}(x,y)|+\left\lVert\nabla\partial_{\epsilon_{t}}\ln\pi_{\epsilon_{t}}(x,y)\right\rVert^{2}\leqslant\xi_{1}(\epsilon_{t})(1+\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2}).

Lemma 2.8.

There exist subexponential function $\xi(\epsilon_{t})$ such that

\epsilon^{\prime}_{t}\int\eta_{m,\epsilon_{t}}\left(\dfrac{d}{d\epsilon_{t}}\Psi_{\epsilon_{t}}(h_{t})\pi_{\epsilon_{t}}\right)\,dxdy\leqslant|\epsilon^{\prime}_{t}|\xi(\epsilon_{t})\left(\int\eta_{m,\epsilon_{t}}\Psi_{\epsilon_{t}}(h_{t})\,d\pi_{\epsilon_{t}}+\int(1+\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2})m_{t}(x,y)\,dxdy\right).

[Proof. ]

	$\displaystyle\int\eta_{m,\epsilon_{t}}\partial_{\epsilon_{t}}\left(\gamma(\epsilon_{t})\Phi_{0}(h_{t})\pi_{\epsilon_{t}}\right)\,dxdy$	$\displaystyle=\int\eta_{m,\epsilon_{t}}\gamma(\epsilon_{t})\left(-\partial_{\epsilon_{t}}\ln\pi_{\epsilon_{t}}\right)m_{t}(x,y)\,dxdy+\int\eta_{m,\epsilon_{t}}\gamma^{\prime}(\epsilon_{t})\Phi_{0}(h_{t})d\pi_{\epsilon_{t}}$
		$\displaystyle\leqslant\int\eta_{m,\epsilon_{t}}\gamma(\epsilon_{t})\xi_{1}(\epsilon_{t})(1+\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2})m_{t}(x,y)\,dxdy$
		$\displaystyle\quad+\|\gamma^{\prime}(\epsilon_{t})\|\left(\int\eta_{m,\epsilon_{t}}\Phi_{0}(h_{t})d\pi_{\epsilon_{t}}+\dfrac{2}{e}\right),$

where we use $\Phi_{0}(h_{t})\pi_{\epsilon_{t}}=\ln\left(\frac{m_{t}}{\pi_{\epsilon_{t}}}\right)m_{t}$ in the equality, and in the inequality we utilize $|\Phi_{0}(h)|\leqslant\Phi_{0}(h)+\frac{2}{e}$ as well as (2.23). Next, for matrix $M_{1}$ since $\Phi_{1}(h_{t})\pi_{\epsilon_{t}}=\left\lVert M_{1}\nabla\ln\frac{m_{t}}{\pi_{\epsilon_{t}}}\right\rVert^{2}m_{t}$ , we consider

	$\displaystyle\int\eta_{m,\epsilon_{t}}\partial_{\epsilon_{t}}\left(\Phi_{1}(h_{t})\pi_{\epsilon_{t}}\right)\,dxdy$	$\displaystyle=-2\int\eta_{m,\epsilon_{t}}M_{1}\nabla\ln\left(\dfrac{m_{t}}{\pi_{\epsilon_{t}}}\right)\cdot M_{1}\nabla\partial_{\epsilon_{t}}\ln\pi_{\epsilon_{t}}m_{t}\,dxdy$
		$\displaystyle\leqslant\int\eta_{m,\epsilon_{t}}\Phi_{1}(h_{t})\,d\pi_{\epsilon_{t}}+2\xi_{1}(\epsilon_{t})\int(1+\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2})m_{t}(x,y)\,dxdy,$

where the inequality follows again from (2.23). The desired result follows from taking

\xi(\epsilon_{t})=1+2\gamma(\epsilon_{t})\xi_{1}(\epsilon_{t})+\dfrac{|\gamma^{\prime}(\epsilon_{t})|}{\gamma(\epsilon_{t})}.

We write the Fisher information at time $t$ to be

I(t)=\int\dfrac{\left\lVert\nabla h_{t}\right\rVert^{2}}{h_{t}}\,d\pi_{\epsilon_{t}}.

Collecting the results of Lemma 2.7, (2.7) and Lemma 2.8 and substitute these back into (2.21), we obtain the following:

Proposition 2.8.

\mathbf{H}^{\prime}(t)\leqslant-I(t)+|\epsilon_{t}^{\prime}|\xi(\epsilon_{t})(\mathbf{H}(t)+\mathbb{E}\left(1+\left\lVert X_{t}\right\rVert^{2}+\left\lVert Y_{t}\right\rVert^{2}\right)).

[Proof. ]First, according to the results of Lemma 2.7, (2.7) and Lemma 2.8, we put these back into (2.21) to obtain

	$\displaystyle\dfrac{d}{dt}\mathbf{H}_{\eta_{m},\epsilon_{t}}(t)$	$\displaystyle\leqslant-\int\eta_{m,\epsilon_{t}}\dfrac{\left\lVert\nabla h_{t}\right\rVert^{2}}{h_{t}}\,d\pi_{\epsilon_{t}}+\mathcal{O}\left(\dfrac{1}{m}\right)$
		$\displaystyle\quad+\|\epsilon^{\prime}_{t}\|\xi(\epsilon_{t})\left(\int\eta_{m,\epsilon_{t}}\Psi_{\epsilon_{t}}(h_{t})\,d\pi_{\epsilon_{t}}+\int(1+\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2})m_{t}(x,y)\,dxdy\right).$

Integrating from $s$ to $t$ leads to

	$\displaystyle\mathbf{H}_{\eta_{m},\epsilon_{t}}(t)-\mathbf{H}_{\eta_{m},\epsilon_{s}}(s)$	$\displaystyle\leqslant-\int_{s}^{t}\int\eta_{m,\epsilon_{u}}\dfrac{\left\lVert\nabla h_{u}\right\rVert^{2}}{h_{u}}\,d\pi_{\epsilon_{u}}du+\int_{s}^{t}\mathcal{O}\left(\dfrac{1}{m}\right)\,du$
		$\displaystyle\quad+\int_{s}^{t}\|\epsilon^{\prime}_{u}\|\xi(\epsilon_{u})\left(\int\eta_{m,\epsilon_{u}}\Psi_{\epsilon_{u}}(h_{u})\,d\pi_{\epsilon_{u}}+\int(1+\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2})m_{u}(x,y)\,dxdy\right)\,du.$

The desired result follows by taking $m\to\infty$ , Fatou’s lemma and Lemma 2.6.

Finally, the objective of this section is to prove the following bound on the distorted entropy:

Proposition 2.9.

Under Assumption 1.1, we have, for any $\alpha>0$ , there exists constant $B>0$ such that for large enough $t$ ,

\mathbf{H}(t)\leqslant B\left(\dfrac{1}{t}\right)^{1-\frac{c_{*}}{E}-\alpha}.

[Proof. ]The proof mimics that of (Monmarché, 2018, Lemma $19$ ), except that we have an improved estimate of the log-Sobolev constant in our setting. More precisely, using the log-Sobolev inequality Proposition 2.3 and Proposition 2.8 we have

	$\displaystyle\mathbf{H}^{\prime}(t)$	$\displaystyle\leqslant-\dfrac{1}{p_{\delta_{1}}(1/\epsilon_{t})}e^{-\frac{c_{*}}{\epsilon_{t}}}\mathbf{H}(t)+\|\epsilon_{t}^{\prime}\|\xi(\epsilon_{t})(\mathbf{H}(t)+\mathbb{E}\left(1+\left\lVert X_{t}\right\rVert^{2}+\left\lVert Y_{t}\right\rVert^{2}\right))$
(2.24)			$\displaystyle=\left(-\dfrac{1}{p_{\delta_{1}}(1/\epsilon_{t})}e^{-\frac{c_{*}}{\epsilon_{t}}}+\|\epsilon_{t}^{\prime}\|\xi(\epsilon_{t})\right)\mathbf{H}(t)+\|\epsilon_{t}^{\prime}\|\xi(\epsilon_{t})\mathbb{E}\left(1+\left\lVert X_{t}\right\rVert^{2}+\left\lVert Y_{t}\right\rVert^{2}\right)$

where we recall $\xi(\epsilon_{t})$ is a subexponential function, and $p_{\delta_{1}}(1/\epsilon_{t})$ is first introduced in Proposition 2.3.

First, we handle the second term in (2.7). Using $|\epsilon_{t}^{\prime}|=\mathcal{O}(1/t)$ , Proposition 2.4 and $(\ln t)^{p}=\mathcal{O}(t^{\alpha})$ for large enough $t$ and any $\alpha,p>0$ leads to

(2.25)

\displaystyle|\epsilon_{t}^{\prime}|\xi(\epsilon_{t})\mathbb{E}\left(1+\left\lVert X_{t}\right\rVert^{2}+\left\lVert Y_{t}\right\rVert^{2}\right)=\mathcal{O}\left(\dfrac{1}{t^{1-\alpha}}\right).

Next, we consider the first term in (2.7). Using the fact that $p_{\delta_{1}}(1/\epsilon_{t})$ is polynomial in $1/\epsilon_{t}$ leads to

(2.26)		$\displaystyle\|\epsilon_{t}^{\prime}\|\xi(\epsilon_{t})$	$\displaystyle=\mathcal{O}\left(\dfrac{1}{t^{1-\alpha}}\right),$
(2.27)		$\displaystyle\dfrac{1}{p_{\delta_{1}}(1/\epsilon_{t})}e^{-\frac{c_{*}}{\epsilon_{t}}}$	$\displaystyle=\Omega\left(\dfrac{1}{t}\right)^{\frac{c_{*}}{E}+\alpha}.$

Collecting the results of (2.25), (2.27) and (2.26) and put these back into (2.7), if we choose $\alpha$ small enough, there exists constants $c_{1},c_{2}>0$ such that

\mathbf{H}^{\prime}(t)\leqslant-c_{1}\left(\dfrac{1}{t}\right)^{\frac{c_{*}}{E}+\alpha}\mathbf{H}(t)+c_{2}\left(\dfrac{1}{t}\right)^{1-\alpha}.

The rest of the proof follows exactly that of (Monmarché, 2018, Lemma $19$ ) (with $E_{*}$ therein replaced by our $c_{*}$ ), which further relies on the estimate obtained in (Miclo, 1992, Lemma $6$ ).

2.8. Wrapping up the proof

In this subsection, we finish off the proof of Theorem 1.1 by using the auxiliary results obtained in previous sections.

[Proof of Theorem 1.1. ] We follow the same proof as Monmarché (2018). Let $(X_{t}^{\pi},Y_{t}^{\pi})$ with law $\pi_{\epsilon_{t}}$ . For any $\delta>0$ , we have

\mathbb{P}\left(U\left(X_{t}\right)>U_{min}+\delta\right)\leqslant\mathbb{P}\left(U\left(X_{t}^{\pi}\right)>U_{min}+\delta\right)+\left\lVert h_{t}-1\right\rVert_{L^{1}(\pi_{\epsilon_{t}})}.

Using the Pinsker’s inequality,

\left\|h_{t}-1\right\|_{L^{1}(\pi_{\epsilon_{t}})}\leqslant\sqrt{2\mathrm{Ent}_{\pi_{\epsilon_{t}}}\left(h_{t}\right)}\leqslant\sqrt{2\mathbf{H}(t)}.

Together with Proposition 2.9 and Proposition 2.1 yields, for constants $A_{2},D_{2}>0$ ,

\mathbb{P}\left(U\left(X_{t}\right)>U_{min}+\delta\right)\leqslant D_{2}e^{-\frac{\delta}{2\epsilon_{t}}}+A_{2}\left(\dfrac{1}{t}\right)^{\frac{1-\frac{c_{*}}{E}-\alpha}{2}}\leqslant A\left(\dfrac{1}{t}\right)^{\min\bigg{\{}\frac{1-\frac{c_{*}}{E}-\alpha}{2},\frac{\delta}{2E}\bigg{\}}}.

Remark 2.2.

In the adaptive setting as described in Section 3.1 and Section 3, thanks to Proposition 3.1 we have exactly the same result with the same proof (with $E$ replaced by $E_{t}$ and $c_{*}=c_{*}(U,c,\delta_{1})$ here replaced by $c_{*}(U,c_{t},\delta_{1})$ ).

3. An adaptive algorithm (IAKSA) and its convergence

Throughout this section, we assume that $U(x)\geqslant 0=U_{min}$ for the ease of calculation.

Recall that in IKSA, introduced in Theorem 1.1, the performance of the diffusion depends on the parameter $c$ . In this section, we propose an adaptive method to tune both the parameter $c=(c_{t})_{t\geqslant 0}$ and the energy level $E=(E_{t})_{t\geqslant 0}$ . First, we rewrite the dynamics of $(X_{t},Y_{t})_{t\geqslant 0}$ to emphasize the dependence on $c$ and $E$ :

	$\displaystyle dX_{t}$	$\displaystyle=Y_{t}\,dt,$
	$\displaystyle dY_{t}$	$\displaystyle=-\dfrac{1}{\epsilon_{t}}Y_{t}\,dt-\epsilon_{t}\nabla_{x}H_{\epsilon_{t},c_{t}}(X_{t})\,dt+\sqrt{2}\,dB_{t},$

where $H_{\epsilon_{t},c_{t}}$ is introduced in (1.7). The parameter $c$ and the cooling schedule are tuned adaptively using the running minimum $(\mathcal{M}_{t}:=\min_{v\leqslant t}U(X_{v}))_{t\geqslant 0}$ . Recall that for $m,n\in\mathbb{N}$ , $\varphi_{m}$ is a mollifier introduced in Section 2.6. In the following we shall use the mollifier $\varphi_{\frac{1}{n}}$ with support on $(-\frac{1}{n},\frac{1}{n})$ . We tune the parameter $(c_{t})_{t\geqslant 0}$ adaptively by setting

(3.1)

\displaystyle c_{t}:=\varphi_{\frac{1}{n}}\star\mathcal{M}_{\left(\cdot-\frac{1}{n}\right)_{+}}(t),

and hence, according to the definition of $c_{*}$ in (2.3), for arbitrary $\delta_{1}>0$ an upper bound of $c_{*}$ is given by

c_{*}=c_{*}(U,c_{t},\delta_{1})\leqslant\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}-U_{min}+\delta_{1}=\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}+\delta_{1}.

Let $\delta_{2}>\delta_{1}$ . Now, the energy level $(E_{t})_{t\geqslant 0}$ that appears on the numerator of the cooling schedule is also tuned adaptively:

(3.2)

\displaystyle E_{t}:=\varphi_{\frac{1}{n}}\star\mathcal{M}_{\left(\cdot-\frac{3}{n}\right)_{+}}(t)-U_{min}+\delta_{2}\geqslant\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}+\delta_{2}>\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}+\delta_{1}.

As we shall see in the proof, the primary reason of mollifying $\mathcal{M}_{t}$ allows us to consider the derivative of $c_{t}$ or $E_{t}$ with respect to time $t$ , and the choice of using $\varphi_{\frac{1}{n}}$ is arbitrary. This is essential in analyzing the dissipation of the distorted entropy. For instance, we need to consider the time derivative of the cooling schedule, and as such we have to ensure that $E_{t}$ is differentiable.

For the convenience of readers, we restate Theorem 1.2 below:

Theorem 3.1.

Under Assumption 1.1, consider the kinetic dynamics $(X_{t},Y_{t})_{t\geqslant 0}$ described by

	$\displaystyle dX_{t}$	$\displaystyle=Y_{t}\,dt,$
	$\displaystyle dY_{t}$	$\displaystyle=-\dfrac{1}{\epsilon_{t}}Y_{t}\,dt-\epsilon_{t}\nabla_{x}H_{\epsilon_{t},c_{t}}(X_{t})\,dt+\sqrt{2}\,dB_{t},$

where $H_{\epsilon_{t},c_{t}}$ is introduced in (1.7), $c_{t}$ is tuned adaptively according to (3.1) and the cooling schedule is

\epsilon_{t}=\dfrac{E_{t}}{\ln t}

\displaystyle\mathbb{P}\left(U(X_{t})>U_{min}+\delta\right)\leqslant A\left(\dfrac{1}{t}\right)^{a},

where

a:=\min\bigg{\{}\frac{\frac{\delta_{2}-\delta_{1}}{\delta+\delta_{2}}-\alpha}{2},\frac{\frac{\delta_{2}-\delta_{1}}{U(X_{0})-U_{min}+\delta_{2}}-\alpha}{2}\bigg{\}}.

The rest of this section is devoted to the proof of Theorem 1.2.

3.1. Dissipation of the distorted entropy in the adaptive setting

In this subsection, we present some auxiliary results which will be used in Section 3.2, the proof of Theorem 1.2. In a nutshell, we show that under appropriate assumptions on $c_{t}^{\prime}$ and $E_{t}^{\prime}$ , key results such as Proposition 2.9 also hold in this adaptive setting:

Proposition 3.1.

Under Assumption 1.1, suppose further that the parameter $c$ is tuned adaptively by $(c_{t})_{t\geqslant 0}$ and the energy level is $(E_{t})_{t\geqslant 0}$ , which are non-increasing with respect to time. We write $c_{*}=c_{*}(U,c_{t},\delta_{1})$ as in (2.3), and we assume that

\displaystyle|c_{t}^{\prime}|=\mathcal{O}\left(\dfrac{1}{t}\right),\quad|E_{t}^{\prime}|=\mathcal{O}\left(\dfrac{1}{t}\right),\quad E_{t}=\Omega(1).

For sufficiently small $\alpha$ with $\alpha\in(0,1-\frac{c_{*}}{E_{t}})$ for all $t\geqslant 0$ , there exists constant $B>0$ such that for large enough $t$ ,

\mathbf{H}(t)\leqslant B\left(\dfrac{1}{t}\right)^{1-\frac{c_{*}}{E_{t}}-\alpha}.

Note that both $c_{*}$ and $E_{t}$ are possibly time-dependent.

Remark 3.1.

In Section 3.2, we show that such a choice of $\alpha\in(0,1-\frac{c_{*}}{E_{t}})$ is possible.

The rest of this subsection is devoted to the proof of Proposition 3.1.

3.1.1. Auxiliary results

In this subsection, we present five auxiliary results which will be used in the proof of Proposition 3.1, namely Lemma 3.1, Proposition 3.2, Proposition 3.3 and Proposition 3.4.

First, to emphasize the dependence of $c_{t}$ on various quantities, compared with (2.20) the truncated distorted entropy $\mathbf{H}$ is now written as

\mathbf{H}_{\eta_{m},\epsilon_{t},c_{t}}(t):=\int\eta_{m,\epsilon_{t},c_{t}}\left(\Phi_{1}(h_{t})+\gamma(\epsilon_{t})\Phi_{0}(h_{t})\right)\,d\pi_{\epsilon_{t},c_{t}},

where we note that the stationary distribution $\pi_{\epsilon_{t},c_{t}}$ depends on $c_{t}$ and $\eta_{m,\epsilon_{t},c_{t}}$ depends on $c_{t}$ through $R_{\epsilon_{t},c_{t}}$ . Taking the derivative with respect to $t$ gives

(3.3)		$\displaystyle\dfrac{d}{dt}\mathbf{H}_{\eta_{m},\epsilon_{t},c_{t}}(t)$	$\displaystyle=\int\eta_{m,\epsilon_{t},c_{t}}\dfrac{d}{dt}\left(\Psi_{\epsilon_{t}}(h_{t})\right)\,d\pi_{\epsilon_{t},c_{t}}+\epsilon^{\prime}_{t}\int\dfrac{d}{d\epsilon_{t}}\eta_{m,\epsilon_{t},c_{t}}\Psi_{\epsilon_{t}}(h_{t})\pi_{\epsilon_{t},c_{t}}\,dxdy$
		$\displaystyle\quad+c^{\prime}_{t}\int\dfrac{d}{dc_{t}}\eta_{m,\epsilon_{t},c_{t}}\Psi_{\epsilon_{t}}(h_{t})\pi_{\epsilon_{t},c_{t}}\,dxdy.$

Comparing with (2.21), the extra term in (3.3) vanishes when $c_{t}=c$ for all $t$ . The first two terms in (3.3) can be handled in exactly the same way as in Lemma 2.7, equation (2.7) and Lemma 2.8. We proceed to simplify the third term in (3.3). We note that

	$\displaystyle\partial_{c_{t}}R_{\epsilon_{t},c_{t}}=\epsilon_{t}\partial_{c_{t}}H_{\epsilon_{t},c_{t}}(x)$	$\displaystyle=\epsilon_{t}\left(\int_{U_{min}}^{U(x)}\dfrac{f^{\prime}(u-c_{t})_{+}}{\left(f(u-c_{t})_{+}+\epsilon_{t}\right)^{2}}\,du-\dfrac{f^{\prime}(U(x)-c_{t})_{+}}{f(U(x)-c_{t})_{+}+\epsilon_{t}}\right)$
(3.4)			$\displaystyle\leqslant\dfrac{M_{5}}{\epsilon_{t}}\left(M_{3}+c_{t}-U_{min}\right)=:\xi_{5}(\epsilon_{t}),$
	$\displaystyle\partial_{c_{t}}\eta_{m,\epsilon_{t},c_{t}}$	$\displaystyle=v_{m}^{\prime}\left(\ln\left(R_{\epsilon_{t},c_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}\right)\right)\dfrac{1}{R_{\epsilon_{t},c_{t}}+2d_{2,\epsilon_{t}}+\delta+\delta_{2}}\partial_{c_{t}}R_{\epsilon_{t},c_{t}}$
		$\displaystyle=\mathcal{O}\left(\dfrac{1}{m}\right),$
	$\displaystyle\|\partial_{c_{t}}\ln\pi_{\epsilon_{t},c_{t}}(x,y)\|$	$\displaystyle\leqslant\|\partial_{c_{t}}H_{\epsilon_{t},c_{t}}\|+\int\|\partial_{c_{t}}H_{\epsilon_{t},c_{t}}\|\,d\pi_{\epsilon_{t},c_{t}}$
		$\displaystyle\leqslant\dfrac{2M_{5}}{\epsilon_{t}^{2}}\left(M_{3}+c_{t}-U_{min}\right)+\dfrac{2M_{5}}{\epsilon_{t}},$

where we use Assumption 1.1 in the first inequality. The above computation leads to

(3.5)

\displaystyle|\partial_{c_{t}}\ln\pi_{\epsilon_{t},c_{t}}(x,y)|+\left\lVert\nabla\partial_{c_{t}}\ln\pi_{\epsilon_{t},c_{t}}(x,y)\right\rVert^{2}

\displaystyle\leqslant\xi_{2}(\epsilon_{t})(1+\left\lVert x\right\rVert^{2}),

where $\xi_{2}(\epsilon_{t})$ is a subexponential function of $\epsilon_{t}$ . Our first auxiliary result handles the third term in (3.3).

Lemma 3.1.

There exist subexponential function $\xi_{3}(\epsilon_{t})$ such that

c^{\prime}_{t}\int\eta_{m,\epsilon_{t},c_{t}}\left(\dfrac{d}{dc_{t}}\Psi_{\epsilon_{t}}(h_{t})\pi_{\epsilon_{t},c_{t}}\right)\,dxdy\leqslant|c^{\prime}_{t}|\xi_{3}(\epsilon_{t})\left(\int\eta_{m,\epsilon_{t},c_{t}}\Psi_{\epsilon_{t}}(h_{t})\,d\pi_{\epsilon_{t},c_{t}}+\int(1+\left\lVert x\right\rVert^{2})m_{t}(x,y)\,dxdy\right).

[Proof. ]

	$\displaystyle\int\eta_{m,\epsilon_{t},c_{t}}\partial_{c_{t}}\left(\gamma(\epsilon_{t})\Phi_{0}(h_{t})\pi_{\epsilon_{t},c_{t}}\right)\,dxdy$	$\displaystyle=\int\eta_{m,\epsilon_{t},c_{t}}\gamma(\epsilon_{t})\left(-\partial_{\epsilon_{t}}\ln\pi_{\epsilon_{t},c_{t}}\right)m_{t}(x,y)\,dxdy$
		$\displaystyle\leqslant\int\eta_{m,\epsilon_{t},c_{t}}\gamma(\epsilon_{t})\xi_{2}(\epsilon_{t})(1+\left\lVert x\right\rVert^{2})m_{t}(x,y)\,dxdy,$
		$\displaystyle\quad+\left(\int\eta_{m,\epsilon_{t},c_{t}}\Phi_{0}(h_{t})d\pi_{\epsilon_{t},c_{t}}+\dfrac{2}{e}\right),$

where we use $\Phi_{0}(h_{t})\pi_{\epsilon_{t},c_{t}}=\ln\left(\frac{m_{t}}{\pi_{\epsilon_{t},c_{t}}}\right)m_{t}$ in the equality, and in the inequality we utilize $|\Phi_{0}(h)|\leqslant\Phi_{0}(h)+\frac{2}{e}$ as well as (3.5). Next, for matrix $M_{1}$ since $\Phi_{1}(h_{t})\pi_{\epsilon_{t},c_{t}}=\left\lVert M_{1}\nabla\ln\frac{m_{t}}{\pi_{\epsilon_{t},c_{t}}}\right\rVert^{2}m_{t}$ , we consider

	$\displaystyle\int\eta_{m,\epsilon_{t},c_{t}}\partial_{c_{t}}\left(\Phi_{1}(h_{t})\pi_{\epsilon_{t},c_{t}}\right)\,dxdy$	$\displaystyle=-2\int\eta_{m,\epsilon_{t},c_{t}}M_{1}\nabla\ln\left(\dfrac{m_{t}}{\pi_{\epsilon_{t},c_{t}}}\right)\cdot M_{1}\nabla\partial_{c_{t}}\ln\pi_{\epsilon_{t},c_{t}}m_{t}\,dxdy$
		$\displaystyle\leqslant\int\eta_{m,\epsilon_{t},c_{t}}\Phi_{1}(h_{t})\,d\pi_{\epsilon_{t},c_{t}}+2\xi_{2}(\epsilon_{t})\int(1+\left\lVert x\right\rVert^{2})m_{t}(x,y)\,dxdy,$

where the inequality follows again from (3.5).

Next, we consider the time derivative of the distorted entropy. The result is essentially the same as Proposition 2.8 for the case of fixed $c$ , except that in the adaptive setting we have to introduce $c_{t}$ and its time derivative:

Proposition 3.2.

There exist subexponential function $\xi_{4}(\epsilon_{t})$ such that

\mathbf{H}^{\prime}(t)\leqslant-I(t)+\left(|\epsilon_{t}^{\prime}|+|c_{t}^{\prime}|\right)\xi_{4}(\epsilon_{t})(\mathbf{H}(t)+\mathbb{E}\left(1+\left\lVert X_{t}\right\rVert^{2}+\left\lVert Y_{t}\right\rVert^{2}\right)).

[Proof. ]First, according to the results of Lemma 2.7, (2.7), Lemma 2.8 and Lemma 3.1, we put these back into (3.3) to obtain

	$\displaystyle\dfrac{d}{dt}\mathbf{H}_{\eta_{m},\epsilon_{t},c_{t}}(t)$	$\displaystyle\leqslant-\int\eta_{m,\epsilon_{t},c_{t}}\dfrac{\left\lVert\nabla h_{t}\right\rVert^{2}}{h_{t}}\,d\pi_{\epsilon_{t},c_{t}}+\mathcal{O}\left(\dfrac{1}{m}\right)$
		$\displaystyle\quad+\left(\|\epsilon^{\prime}_{t}\|+\|c^{\prime}_{t}\|\right)\xi_{4}(\epsilon_{t})\left(\int\eta_{m,\epsilon_{t},c_{t}}\Psi_{\epsilon_{t}}(h_{t})\,d\pi_{\epsilon_{t},c_{t}}+\int(1+\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2})m_{t}(x,y)\,dxdy\right).$

Integrating from $s$ to $t$ leads to

	$\displaystyle\mathbf{H}_{\eta_{m},\epsilon_{t},c_{t}}(t)$	$\displaystyle-\mathbf{H}_{\eta_{m},\epsilon_{s},c_{s}}(s)\leqslant-\int_{s}^{t}\int\eta_{m,\epsilon_{u},c_{u}}\dfrac{\left\lVert\nabla h_{u}\right\rVert^{2}}{h_{u}}\,d\pi_{\epsilon_{u},c_{u}}du+\int_{s}^{t}\mathcal{O}\left(\dfrac{1}{m}\right)\,du$
		$\displaystyle\quad+\int_{s}^{t}\|\epsilon^{\prime}_{u}\|\xi_{4}(\epsilon_{u})\left(\int\eta_{m,\epsilon_{u},c_{u}}\Psi_{\epsilon_{u}}(h_{u})\,d\pi_{\epsilon_{u},c_{u}}+\int(1+\left\lVert x\right\rVert^{2}+\left\lVert y\right\rVert^{2})m_{u}(x,y)\,dxdy\right)\,du.$

The desired result follows by taking $m\to\infty$ , Fatou’s lemma and Lemma 2.6.

Our next two results generalize Proposition 2.7 and Proposition 2.4 respectively to the adaptive setting.

Proposition 3.3.

Under the same assumptions as in Proposition 3.1, for any $p\in\mathbb{N}$ , $\alpha>0$ and large enough $t$ , there exist a constant $\widetilde{C}_{p,\alpha}$ such that

\mathbb{E}\left(R_{\epsilon_{t},c_{t}}^{p}(X_{t},Y_{t})\right)\leqslant\widetilde{C}_{p,\alpha}(1+t)^{1+\alpha}.

[Proof. ]We shall prove the result by induction on $p$ . We denote by $n_{t,p}:=\mathbb{E}\left(R_{\epsilon_{t},c_{t}}^{p}(X_{t},Y_{t})\right)$ . When $p=0$ , the result clearly holds. When $p=1$ , using Proposition 2.5 and 2.6 we have

	$\displaystyle\dfrac{\partial}{\partial t}n_{t,1}$	$\displaystyle=\left(\partial_{t}c_{t}\right)\partial_{c_{t}}n_{t,1}+\left(\partial_{t}\epsilon_{t}\right)\partial_{\epsilon_{t}}n_{t,1}+\dfrac{\partial}{\partial s}\mathbb{E}\left(R_{\epsilon_{t},c_{t}}(X_{t+s},Y_{t+s})\right)\bigg{\|}_{s=0}$
		$\displaystyle\leqslant\|\partial_{t}c_{t}\|\xi_{5}(\epsilon_{t})+\|\partial_{t}\epsilon_{t}\|\left(C_{1,1}(\epsilon_{t})n_{t,1}+C_{1,2}(\epsilon_{t})\right)+\mathbb{E}\left(L_{\epsilon_{t}}(R_{\epsilon_{t}})(X_{t},Y_{t})\right)$
		$\displaystyle\leqslant\|\partial_{t}c_{t}\|\xi_{5}(\epsilon_{t})+\|\partial_{t}\epsilon_{t}\|\left(C_{1,1}(\epsilon_{t})n_{t,1}+C_{1,2}(\epsilon_{t})\right)-c_{4}\epsilon_{t}^{5}n_{t,1}+c_{5}(\epsilon_{t})+c_{1},$

where we use (3.4) in the first inequality. As $\epsilon_{t}=\Omega(\frac{1}{\ln(1+t)})$ , $|\partial_{t}\epsilon_{t}|=\mathcal{O}(\frac{1}{t})=|\partial_{t}c_{t}|$ and

C_{1,1}(\epsilon_{t}),C_{1,2}(\epsilon_{t}),c_{5}(\epsilon_{t}),\xi_{5}(\epsilon_{t})=o(t^{\beta})

for any $\beta>0$ as $t\to\infty$ , we deduce that, for constants $c_{6}>0,c_{7}$ ,

(3.6)

\displaystyle\dfrac{\partial}{\partial t}n_{t,1}

\displaystyle\leqslant-c_{6}\epsilon_{t}^{5}n_{t,1}+c_{7}(1+t)^{1+\alpha/2}.

The rest of the argument is the same as Proposition 3.3. This proves the result when $p=1$ . Assume that the result holds for all $q<p$ , where $p\geqslant 2$ . First, using Proposition 2.5 and equation (2.4.2) we compute

	$\displaystyle L_{\epsilon_{t}}(R_{\epsilon_{t},c_{t}}^{p})$	$\displaystyle=pR_{\epsilon_{t},c_{t}}^{p-1}L_{\epsilon_{t}}(R_{\epsilon_{t},c_{t}})+p(p-1)R^{p-2}_{\epsilon_{t},c_{t}}\Gamma_{\epsilon_{t}}R_{\epsilon_{t},c_{t}}$
		$\displaystyle\leqslant pR_{\epsilon_{t},c_{t}}^{p-1}\left(-c_{4}\epsilon_{t}^{5}R_{\epsilon_{t},c_{t}}+c_{5}(\epsilon_{t})+c_{1}\right)+p(p-1)C_{3}(\epsilon_{t})R^{p-1}_{\epsilon_{t},c_{t}}+p(p-1)C_{4}(\epsilon_{t})R^{p-2}_{\epsilon_{t},c_{t}}.$

Differentiating with respect to $t$ , followed by using Proposition 2.6 and equation (2.4.2) give

	$\displaystyle\dfrac{\partial}{\partial t}n_{t,p}$	$\displaystyle=\left(\partial_{t}c_{t}\right)\partial_{c_{t}}n_{t,p}+\left(\partial_{t}\epsilon_{t}\right)\partial_{\epsilon_{t}}n_{t,p}+\dfrac{\partial}{\partial s}\mathbb{E}\left(R_{\epsilon_{t}}^{p}(X_{t+s},Y_{t+s})\right)\bigg{\|}_{s=0}$
		$\displaystyle\leqslant\left\|\partial_{t}c_{t}\right\|\xi_{5}(\epsilon_{t})+\|\partial_{t}\epsilon_{t}\|\left(C_{p,1}(\epsilon_{t})n_{t,p}+C_{p,2}(\epsilon)n_{t,p-1}\right)+\mathbb{E}\left(L_{\epsilon_{t}}(R_{\epsilon_{t}}^{p})(X_{t},Y_{t})\right)$
		$\displaystyle\leqslant-c_{9}\epsilon_{t}^{5}n_{t,p}+c_{10}(p,\alpha)(1+t)^{1+\alpha/2},$

where we use the same asymptotic estimates that lead us to (3.6) and the induction assumption on $n_{t,p-1}$ and $n_{t,p-2}$ .

Proposition 3.4.

Under the same assumptions as in Proposition 3.1, for any $p\in\mathbb{N}$ , $\alpha>0$ and large enough $t$ (which depends on $p,U,f$ and the temperature schedule $\epsilon_{t}$ ), there exist a constant $k$ such that

\displaystyle\mathbb{E}\left(\left\lVert X_{t}\right\rVert^{2}+\left\lVert Y_{t}\right\rVert^{2}\right)^{p}\leqslant k(1+t)^{\alpha}.

[Proof. ]The proof is exactly the same as Proposition 2.4, except that we apply Proposition 3.3.

3.1.2. Proof of Proposition 3.1

Using the log-Sobolev inequality Proposition 2.3 and Proposition 3.2 we have

	$\displaystyle\mathbf{H}^{\prime}(t)$	$\displaystyle\leqslant-\dfrac{1}{p_{\delta_{1}}(1/\epsilon_{t})}e^{-\frac{c_{*}}{\epsilon_{t}}}\mathbf{H}(t)+\left(\|\epsilon_{t}^{\prime}\|+\|c_{t}^{\prime}\|\right)\xi_{4}(\epsilon_{t})(\mathbf{H}(t)+\mathbb{E}\left(1+\left\lVert X_{t}\right\rVert^{2}+\left\lVert Y_{t}\right\rVert^{2}\right))$
(3.7)			$\displaystyle=\left(-\dfrac{1}{p_{\delta_{1}}(1/\epsilon_{t})}e^{-\frac{c_{*}}{\epsilon_{t}}}+\left(\|\epsilon_{t}^{\prime}\|+\|c_{t}^{\prime}\|\right)\xi_{4}(\epsilon_{t})\right)\mathbf{H}(t)+\left(\|\epsilon_{t}^{\prime}\|+\|c_{t}^{\prime}\|\right)\xi_{4}(\epsilon_{t})\mathbb{E}\left(1+\left\lVert X_{t}\right\rVert^{2}+\left\lVert Y_{t}\right\rVert^{2}\right)$

where we recall $\xi_{4}(\epsilon_{t})$ is a subexponential function, and $p_{\delta_{1}}(1/\epsilon_{t})$ is first introduced in Proposition 2.3.

First, we handle the second term in (3.1.2). Note that

\displaystyle|\epsilon_{t}^{\prime}|=\mathcal{O}\left(\dfrac{1}{t}+|E_{t}^{\prime}|\right)=\mathcal{O}\left(\dfrac{1}{t}\right).

Using $|c_{t}^{\prime}|=\mathcal{O}(1/t)$ , Proposition 2.4 and $(\ln t)^{p}=\mathcal{O}(t^{\alpha})$ for large enough $t$ and any $\alpha,p>0$ leads to

(3.8)

\displaystyle\left(|\epsilon_{t}^{\prime}|+|c_{t}^{\prime}|\right)\xi_{4}(\epsilon_{t})\mathbb{E}\left(1+\left\lVert X_{t}\right\rVert^{2}+\left\lVert Y_{t}\right\rVert^{2}\right)=\mathcal{O}\left(\dfrac{1}{t^{1-\alpha}}\right).

Next, we consider the first term in (3.1.2). Using the fact that $p_{\delta_{1}}(1/\epsilon_{t})$ is polynomial in $1/\epsilon_{t}$ and $E_{t}=\Omega(1)$ lead to

(3.9)		$\displaystyle\left(\|\epsilon_{t}^{\prime}\|+\|c_{t}^{\prime}\|\right)\xi_{4}(\epsilon_{t})$	$\displaystyle=\mathcal{O}\left(\dfrac{1}{t^{1-\alpha}}\right),$
(3.10)		$\displaystyle\dfrac{1}{p_{\delta_{1}}(1/\epsilon_{t})}e^{-\frac{c_{*}}{\epsilon_{t}}}$	$\displaystyle=\Omega\left(\dfrac{1}{t}\right)^{\frac{c_{*}}{E_{t}}+\alpha}.$

Collecting the results of (3.8), (3.10) and (3.9) and put these back into (3.1.2), if we choose $\alpha$ small enough, there exists constants $c_{1},c_{2}>0$ such that

\mathbf{H}^{\prime}(t)\leqslant-c_{1}\left(\dfrac{1}{t}\right)^{\frac{c_{*}}{E_{t}}+\alpha}\mathbf{H}(t)+c_{2}\left(\dfrac{1}{t}\right)^{1-\alpha}.

The rest of the proof follows exactly that of (Chak et al., 2020, Equation $(C.32)$ to $(C.34)$ ), and we need to check that, as $t$ to $\infty$ ,

\int_{\cdot}^{t}\left(\dfrac{1}{s}\right)^{\frac{c_{*}}{E_{s}}+\alpha}\,ds\geqslant\int_{\cdot}^{t}\left(\dfrac{1}{s}\right)\,ds\to\infty.

3.2. Proof of Theorem 1.2

First, we check that with our choice of $(c_{t})_{t\geqslant 0}$ in (3.1) and $(E_{t})_{t\geqslant 0}$ in (3.2), the assumptions in Proposition 3.1 are satisfied:

Lemma 3.2.

With our choice of $(c_{t})_{t\geqslant 0}$ in (3.1) and $(E_{t})_{t\geqslant 0}$ in (3.2), we have

\displaystyle|c_{t}^{\prime}|=\mathcal{O}\left(\dfrac{1}{t}\right),\quad|E_{t}^{\prime}|=\mathcal{O}\left(\dfrac{1}{t}\right),\quad E_{t}=\Omega(1).

[Proof. ]Clearly, $E_{t}\geqslant\delta_{2}$ and so $E_{t}=\Omega(1)$ . Next, we consider $c_{t}$ :

	$\displaystyle c_{t}$	$\displaystyle=\int\mathcal{M}_{\left(u-\frac{1}{n}\right)_{+}}\varphi_{\frac{1}{n}}(t-u)\,du,$
	$\displaystyle c_{t}^{\prime}$	$\displaystyle=\int\mathcal{M}_{\left(u-\frac{1}{n}\right)_{+}}n\varphi((t-u)n)\dfrac{-n^{2}2(t-u)}{(((t-u)n)^{2}-1)^{2}}\,du,$
	$\displaystyle\|c_{t}^{\prime}\|$	$\displaystyle\leqslant\int\mathcal{M}_{\left(u-\frac{1}{n}\right)_{+}}n\varphi((t-u)n)\dfrac{n}{((t-u)n-1)^{2}}\,du.$

Using the monotone convergence theorem, as $\varphi(t)\frac{1}{(t-1)^{2}}$ is non-increasing in $t$ , we conclude that $t|c_{t}^{\prime}|\to 0$ as $t\to\infty$ . The proof of $|E_{t}^{\prime}|=\mathcal{O}\left(\frac{1}{t}\right)$ is very similar and is omitted.

We write $\mathcal{F}_{t}$ to be the canonical filtration generated by $\mathcal{M}_{t}$ up to time $t$ . Thanks to Lemma 3.2, Proposition 3.1, Theorem 1.1 and Remark 2.2 we have the following estimate:

\displaystyle\mathbb{P}\left(U(X_{t})>U_{min}+\delta|\mathcal{F}_{(t-\frac{1}{n})_{+}}\right)\leqslant A\left(\dfrac{1}{t}\right)^{\min\bigg{\{}\frac{1-\frac{c_{*}}{E_{t}}-\alpha}{2},\frac{\delta}{2E_{t}}\bigg{\}}}.

For the exponent of $(1/t)$ , we select $\delta_{1}$ and $\delta_{2}$ such that $0<\delta_{2}-\delta_{1}<\delta$ which gives

	$\displaystyle\min\bigg{\{}\frac{1-\frac{c_{*}}{E_{t}}-\alpha}{2},\frac{\delta}{2E_{t}}\bigg{\}}\geqslant\min\bigg{\{}\frac{1-\frac{\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}+\delta_{1}}{E_{t}}-\alpha}{2},\frac{\delta}{2E_{t}}\bigg{\}}$	$\displaystyle=\frac{1-\frac{\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}+\delta_{1}}{E_{t}}-\alpha}{2}$
		$\displaystyle\geqslant\frac{1-\frac{\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}+\delta_{1}}{\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}+\delta_{2}}-\alpha}{2}.$

Note that the choice of $\alpha$ is arbitrary, and we consider sufficiently small $\alpha$ such that $\alpha\in(0,\frac{\delta_{2}-\delta_{1}}{U(X_{0})+\delta_{2}})$ to ensure the exponent of $(1/t)$ is positive, i.e. for all $t\geqslant 0$

\frac{1-\frac{\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}+\delta_{1}}{\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}+\delta_{2}}-\alpha}{2}>0.

This choice of $\alpha$ also satisfies the requirement in Proposition 3.1. Using the law of iterated expectation yields

	$\displaystyle\mathbb{P}\left(U(X_{t})>U_{min}+\delta\right)$	$\displaystyle=\mathbb{E}\left(\mathbb{P}\left(U(X_{t})>U_{min}+\delta\|\mathcal{F}_{(t-\frac{1}{n})_{+}}\right)\right)$
		$\displaystyle\leqslant A\int_{U_{min}}^{U(X_{0})}\left(\dfrac{1}{t}\right)^{\frac{1-\frac{y+\delta_{1}}{y+\delta_{2}}-\alpha}{2}}\,d\,\mathbb{P}\left(\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}\leqslant y\right)$
		$\displaystyle=A\int_{U_{min}}^{U_{min}+\delta}\left(\dfrac{1}{t}\right)^{\frac{1-\frac{y+\delta_{1}}{y+\delta_{2}}-\alpha}{2}}\,d\,\mathbb{P}\left(\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}\leqslant y\right)$
		$\displaystyle\quad+A\int_{U_{min}+\delta}^{U(X_{0})}\left(\dfrac{1}{t}\right)^{\frac{1-\frac{y+\delta_{1}}{y+\delta_{2}}-\alpha}{2}}\,d\,\mathbb{P}\left(\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}\leqslant y\right)$
		$\displaystyle\leqslant A\left(\dfrac{1}{t}\right)^{\frac{\frac{\delta_{2}-\delta_{1}}{\delta+\delta_{2}}-\alpha}{2}}+A\left(\dfrac{1}{t}\right)^{\frac{1-\frac{U(X_{0})+\delta_{1}}{U(X_{0})+\delta_{2}}-\alpha}{2}}$
		$\displaystyle\quad-A\int_{U_{min}+\delta}^{U(X_{0})}\mathbb{P}\left(\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}\leqslant y\right)\,d\left(\dfrac{1}{t}\right)^{\frac{1-\frac{y+\delta_{1}}{y+\delta_{2}}-\alpha}{2}}$
		$\displaystyle\leqslant 2A\left(\dfrac{1}{t}\right)^{\frac{\frac{\delta_{2}-\delta_{1}}{\delta+\delta_{2}}-\alpha}{2}}+A\int_{U_{min}+\delta}^{U(X_{0})}\mathbb{P}\left(\mathcal{M}_{\left(t-\frac{2}{n}\right)_{+}}>y\right)\,d\left(\dfrac{1}{t}\right)^{\frac{1-\frac{y+\delta_{1}}{y+\delta_{2}}-\alpha}{2}}$
		$\displaystyle\leqslant A\left(\dfrac{1}{t}\right)^{\frac{\frac{\delta_{2}-\delta_{1}}{\delta+\delta_{2}}-\alpha}{2}}+A\left(\dfrac{1}{t}\right)^{\frac{\frac{\delta_{2}-\delta_{1}}{U(X_{0})+\delta_{2}}-\alpha}{2}},$

where the second inequality follows from integration by part.

Appendix: setup of the numerical results

In this section, we describe the experimental setup for the numerical results presented in Section 1.3.

3.2.1. Description of the four annealing methods

We describe the four annealing methods that we test on:

•

IAKSA and KSA: KSA is a special case of IAKSA with $f=0$ . Instead of running (1.8), we consider

	$\displaystyle dX_{t}$	$\displaystyle=Y_{t}\,dt,$
	$\displaystyle dY_{t}$	$\displaystyle=-Y_{t}\,dt-\epsilon_{t}\nabla_{x}H_{\epsilon_{t},c_{t}}(X_{t})\,dt+\sqrt{2\epsilon_{t}}\,dB_{t},$

and apply the Euler-Maruyama discretization with stepsize $(\eta(k))_{k\in\mathbb{N}_{0}}$ , cooling schedule $(\epsilon(k))_{k\in\mathbb{N}_{0}}$ and adaptive $(c(k))_{k\in\mathbb{N}_{0}}$ to obtain $(X(k),Y(k))_{k\in\mathbb{N}_{0}}$ :

	$\displaystyle X(k+1)$	$\displaystyle=X(k)+Y(k)\eta(k),$
	$\displaystyle Y(k+1)$	$\displaystyle=Y(k)-Y(k)\eta(k)-\epsilon(k)\nabla_{x}H_{\epsilon(k),c(k)}(X(k))\eta(k)+\sqrt{2\epsilon(k)}\sqrt{\eta(k)}N(k),$

where $(N(k))$ is a sequence of i.i.d. standard normal random variables.

•

IASA and SA: SA is a special case of IASA with $f=0$ . We simulate an Euler–Maruyama discretization of (1.5) with stepsize $(\eta(k))_{k\in\mathbb{N}_{0}}$ , cooling schedule $(\epsilon(k))_{k\in\mathbb{N}_{0}}$ and adaptive $(c(k))_{k\in\mathbb{N}_{0}}$ to obtain $(Z(k))_{k\in\mathbb{N}_{0}}$ :

\displaystyle Z(k+1)=Z(k)-\nabla U(Z(k))\,\eta(k)+\sqrt{2\left(f((U(Z(k))-c(k))_{+})+\epsilon(k)\right)}\sqrt{\eta(k)}N(k).

3.2.2. Description of the test functions and the parameters

For both IAKSA and IASA, we use $f(u)=0.5\arctan(u)$ . Note that although this choice of $f$ does not satisfy Assumption 1.1, this is used in the numerical experiments in Fang et al. (1997). As for the benchmark functions, we use the following:

•

Ackley function $U_{1}$ : We consider the $2$ -dimensional Ackley function

U_{1}(x_{1},x_{2})=-20\exp\left(-0.2\sqrt{\frac{1}{2}\sum_{i=1}^{2}x_{i}^{2}}\right)-\exp\left(\frac{1}{2}\sum_{i=1}^{2}\cos\left(2\pi x_{i}\right)\right)+20+e

with initial stepsize $\eta(0)=0.05$ . We use a multiplicative stepsize decay strategy: on every $1000$ iterations, the stepsize decreases by a factor of $0.999$ . Denote $\Theta(k)=\sum_{s\leqslant k}\eta(s)$ . We also use

	$\displaystyle c(k)$	$\displaystyle=\min_{v\leqslant k}U_{1}(X(v))+\dfrac{1}{\Theta(k)+1},$
	$\displaystyle\epsilon(k)$	$\displaystyle=\dfrac{2}{\ln(\Theta(k)+2)}.$

The initialization is $X(0)=(18.5,17.4)$ and for kinetic diffusions $Y(0)=0$ .

•

Ackley3 function $U_{2}$ : We consider the $2$ -dimensional Ackley3 function

U_{2}(x_{1},x_{2})=-200\exp\left(-0.2\sqrt{\sum_{i=1}^{2}x_{i}^{2}}\right)+5\exp\left(\cos(3x_{1})+\sin(3x_{2})\right)

with initial stepsize $\eta(0)=0.05$ . We use a multiplicative stepsize decay strategy: on every $1000$ iterations, the stepsize decreases by a factor of $0.999$ . We use

	$\displaystyle c(k)$	$\displaystyle=\min_{v\leqslant k}U_{2}(X(v))+\dfrac{1}{\Theta(k)+1},$
	$\displaystyle\epsilon(k)$	$\displaystyle=\dfrac{2}{\ln(\Theta(k)+2)}.$

The initialization is $X(0)=(18.4,12.8)$ and for kinetic diffusions $Y(0)=0$ .

•

Rastrigin function $U_{3}$ : We consider the $2$ -dimensional Rastrigin function

U_{3}(x_{1},x_{2})=20+\sum_{i=1}^{2}\left[x_{i}^{2}-10\cos\left(2\pi x_{i}\right)\right]

with initial stepsize $\eta(0)=0.5$ . We use a multiplicative stepsize decay strategy: on every $1000$ iterations, the stepsize decreases by a factor of $0.999$ . We use

	$\displaystyle c(k)$	$\displaystyle=\min_{v\leqslant k}U_{3}(X(v))+\dfrac{1}{\Theta(k)+1},$
	$\displaystyle\epsilon(k)$	$\displaystyle=\dfrac{0.5}{\ln(\Theta(k)+2)}.$

The initialization is $X(0)=(9.84,3.33)$ and for kinetic diffusions $Y(0)=0$ .

Acknowledgements

We thank Jing Zhang for a careful proofreading, and appreciate the constructive remarks from Xuefeng Gao, Laurent Miclo, Andre Milzarek, Pierre Monmarché and Wenpin Tang on various aspects of an earlier version of this work and stochastic optimization in general. In particular, we thank Andre Milzarek for suggesting the plot in Figure 1. The author acknowledges the support from The Chinese University of Hong Kong, Shenzhen grant PF01001143 and the financial support from AIRS - Shenzhen Institute of Artificial Intelligence and Robotics for Society Project 2019-INT002.

References

Andrieu and Thoms (2008) C. Andrieu and J. Thoms. A tutorial on adaptive MCMC. Stat. Comput., 18(4):343–373, 2008.
Benaïm and Bréhier (2019) M. Benaïm and C.-E. Bréhier. Convergence analysis of adaptive biasing potential methods for diffusion processes. Commun. Math. Sci., 17(1):81–130, 2019.
Benaïm et al. (2020) M. Benaïm, C.-E. Bréhier, and P. Monmarché. Analysis of an adaptive biasing force method based on self-interacting dynamics. Electron. J. Probab., 25:28 pp., 2020.
Bierkens (2016) J. Bierkens. Non-reversible Metropolis-Hastings. Stat. Comput., 26(6):1213–1228, 2016.
Catoni (1992) O. Catoni. Rough large deviation estimates for simulated annealing: application to exponential schedules. Ann. Probab., 20(3):1109–1146, 1992.
Chak et al. (2020) M. Chak, N. Kantas, and G. A. Pavliotis. On the generalised langevin equation for simulated annealing. arXiv preprint arXiv:2003.06448, 2020.
Chen and Hwang (2013) T.-L. Chen and C.-R. Hwang. Accelerating reversible Markov chains. Statist. Probab. Lett., 83(9):1956–1962, 2013.
Chiang et al. (1987) T.-S. Chiang, C.-R. Hwang, and S. J. Sheu. Diffusion for global optimization in ${\bf R}^{n}$ . SIAM J. Control Optim., 25(3):737–753, 1987.
Diaconis et al. (2000) P. Diaconis, S. Holmes, and R. M. Neal. Analysis of a nonreversible Markov chain sampler. Ann. Appl. Probab., 10(3):726–752, 2000.
Duncan et al. (2016) A. B. Duncan, T. Lelièvre, and G. A. Pavliotis. Variance reduction using nonreversible Langevin samplers. J. Stat. Phys., 163(3):457–491, 2016.
Duncan et al. (2017) A. B. Duncan, N. Nüsken, and G. A. Pavliotis. Using perturbed underdamped Langevin dynamics to efficiently sample from probability distributions. J. Stat. Phys., 169(6):1098–1131, 2017.
Fang et al. (1997) H. Fang, M. Qian, and G. Gong. An improved annealing method and its large-time behavior. Stochastic Process. Appl., 71(1):55–74, 1997.
Fort et al. (2011) G. Fort, E. Moulines, and P. Priouret. Convergence of adaptive and interacting Markov chain Monte Carlo algorithms. Ann. Statist., 39(6):3262–3289, 2011.
Gadat and Panloup (2014) S. Gadat and F. Panloup. Long time behaviour and stationary regime of memory gradient diffusions. Ann. Inst. Henri Poincaré Probab. Stat., 50(2):564–601, 2014.
Gadat et al. (2013) S. Gadat, F. Panloup, and C. Pellegrini. Large deviation principle for invariant distributions of memory gradient diffusions. Electron. J. Probab., 18:no. 81, 34, 2013.
Gao et al. (2018) X. Gao, M. Gurbuzbalaban, and L. Zhu. Breaking reversibility accelerates langevin dynamics for global non-convex optimization. arXiv preprint arXiv:1812.07725, 2018.
Guo et al. (2020) X. Guo, J. Han, and W. Tang. Perturbed gradient descent with occupation time. arXiv preprint arXiv:2005.04507, 2020.
Holley and Stroock (1988) R. Holley and D. Stroock. Simulated annealing via Sobolev inequalities. Comm. Math. Phys., 115(4):553–569, 1988.
Holley et al. (1989) R. A. Holley, S. Kusuoka, and D. W. Stroock. Asymptotics of the spectral gap with applications to the theory of simulated annealing. J. Funct. Anal., 83(2):333–347, 1989.
Hu et al. (2020) Y. Hu, X. Wang, X. Gao, M. Gurbuzbalaban, and L. Zhu. Non-convex stochastic optimization via non-reversible stochastic gradient langevin dynamics. arXiv preprint arXiv:2004.02823, 2020.
Hwang et al. (2005) C.-R. Hwang, S.-Y. Hwang-Ma, and S.-J. Sheu. Accelerating diffusions. Ann. Appl. Probab., 15(2):1433–1444, 2005.
Jacquot (1992) S. Jacquot. Comportement asymptotique de la seconde valeur propre des processus de Kolmogorov. J. Multivariate Anal., 40(2):335–347, 1992.
Jamil and Yang (2013) M. Jamil and X.-S. Yang. A literature survey of benchmark functions for global optimisation problems. International Journal of Mathematical Modelling and Numerical Optimisation, 4(2):150–194, 2013.
Lelièvre and Minoukadeh (2011) T. Lelièvre and K. Minoukadeh. Long-time convergence of an adaptive biasing force method: the bi-channel case. Arch. Ration. Mech. Anal., 202(1):1–34, 2011.
Lelièvre et al. (2008) T. Lelièvre, M. Rousset, and G. Stoltz. Long-time convergence of an adaptive biasing force method. Nonlinearity, 21(6):1155–1181, 2008.
Löwe (1997) M. Löwe. On the invariant measure of non-reversible simulated annealing. Statist. Probab. Lett., 36(2):189–193, 1997.
Miclo (1992) L. Miclo. Recuit simulé sur ${\bf R}^{n}$ . Étude de l’évolution de l’énergie libre. Ann. Inst. H. Poincaré Probab. Statist., 28(2):235–266, 1992.
Miclo (2020) L. Miclo. Private communcation. 2020.
Monmarché (2018) P. Monmarché. Hypocoercivity in metastable settings and kinetic simulated annealing. Probab. Theory Related Fields, 172(3-4):1215–1248, 2018.
Raimond (2009) O. Raimond. Self-interacting diffusions: a simulated annealing version. Probab. Theory Related Fields, 144(1-2):247–279, 2009.
Roberts and Rosenthal (2009) G. O. Roberts and J. S. Rosenthal. Examples of adaptive MCMC. J. Comput. Graph. Statist., 18(2):349–367, 2009.

	$\displaystyle\dfrac{\partial}{\partial t}n_{t,1}$	$\displaystyle=\left(\dfrac{\partial}{\partial t}\epsilon_{t}\right)\dfrac{\partial}{\partial\epsilon_{t}}n_{t,1}+\dfrac{\partial}{\partial s}\mathbb{E}\left(R_{\epsilon_{t}}(X_{t+s},Y_{t+s})\right)\bigg{\|}_{s=0}$
		$\displaystyle\leqslant\bigg{\|}\dfrac{\partial}{\partial t}\epsilon_{t}\bigg{\|}\left(C_{1,1}(\epsilon_{t})n_{t,1}+C_{1,2}(\epsilon_{t})\right)+\mathbb{E}\left(L_{\epsilon_{t}}(R_{\epsilon_{t}})(X_{t},Y_{t})\right)$
		$\displaystyle\leqslant\bigg{\|}\dfrac{\partial}{\partial t}\epsilon_{t}\bigg{\|}\left(C_{1,1}(\epsilon_{t})n_{t,1}+C_{1,2}(\epsilon_{t})\right)-c_{4}\epsilon_{t}^{5}n_{t,1}+c_{5}(\epsilon_{t})+c_{1}.$

	$\displaystyle\dfrac{\partial}{\partial t}n_{t,1}$	$\displaystyle=\left(\partial_{t}c_{t}\right)\partial_{c_{t}}n_{t,1}+\left(\partial_{t}\epsilon_{t}\right)\partial_{\epsilon_{t}}n_{t,1}+\dfrac{\partial}{\partial s}\mathbb{E}\left(R_{\epsilon_{t},c_{t}}(X_{t+s},Y_{t+s})\right)\bigg{\|}_{s=0}$
		$\displaystyle\leqslant\|\partial_{t}c_{t}\|\xi_{5}(\epsilon_{t})+\|\partial_{t}\epsilon_{t}\|\left(C_{1,1}(\epsilon_{t})n_{t,1}+C_{1,2}(\epsilon_{t})\right)+\mathbb{E}\left(L_{\epsilon_{t}}(R_{\epsilon_{t}})(X_{t},Y_{t})\right)$
		$\displaystyle\leqslant\|\partial_{t}c_{t}\|\xi_{5}(\epsilon_{t})+\|\partial_{t}\epsilon_{t}\|\left(C_{1,1}(\epsilon_{t})n_{t,1}+C_{1,2}(\epsilon_{t})\right)-c_{4}\epsilon_{t}^{5}n_{t,1}+c_{5}(\epsilon_{t})+c_{1},$

On the convergence of an improved and adaptive kinetic simulated annealing

Abstract.

1. Introduction

1.1. Notations

1.2. Overview of the main results

Assumption 1.1.

Theorem 1.1.

Theorem 1.2.

1.3. Numerical results

1.4. Organization of the paper

2. Proof of Theorem 1.1

2.1. Weak convergence of μϵ\mu_{\epsilon}

Proposition 2.1.

2.2. Existence and regularity for the density of IKSA

Proposition 2.2.

2.3. The log-Sobolev inequality

Proposition 2.3.

Remark 2.1.

2.4. Lyapunov function and moment estimates

Proposition 2.4.

Proposition 2.5 (Lyapunov property of RϵR_{\epsilon}).

2.4.1. Proof of Proposition 2.5

Lemma 2.1.

Lemma 2.2.

2.4.2. Proof of Proposition 2.4

Proposition 2.6.

Proposition 2.7.

2.5. Gamma calculus

Lemma 2.3.

Lemma 2.4.

Lemma 2.5.

2.6. Truncated differentiation

Lemma 2.6.

2.7. Dissipation of the distorted entropy

Lemma 2.7.

Lemma 2.8.

Proposition 2.8.

Proposition 2.9.

2.8. Wrapping up the proof

Remark 2.2.

3. An adaptive algorithm (IAKSA) and its convergence

Theorem 3.1.

3.1. Dissipation of the distorted entropy in the adaptive setting

Proposition 3.1.

Remark 3.1.

3.1.1. Auxiliary results

Lemma 3.1.

Proposition 3.2.

Proposition 3.3.

Proposition 3.4.

3.1.2. Proof of Proposition 3.1

3.2. Proof of Theorem 1.2

Lemma 3.2.

Appendix: setup of the numerical results

3.2.1. Description of the four annealing methods

3.2.2. Description of the test functions and the parameters

Acknowledgements

References

2.1. Weak convergence of $\mu_{\epsilon}$

Proposition 2.5 (Lyapunov property of $R_{\epsilon}$ ).