This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Global optimization via Schrödinger-Föllmer diffusion

Yin Dai Yuling Jiao Lican Kang Xiliang Lu  and  Jerry Zhijian Yang School of Mathematics and Statistics, Wuhan University, Wuhan 430072, P.R. China [email protected] School of Mathematics and Statistics, and Hubei Key Laboratory of Computational Science, Wuhan University, Wuhan 430072, P.R. China [email protected] Center for Quantitative Medicine Duke-NUS Medical School, Singapore [email protected] School of Mathematics and Statistics, and Hubei Key Laboratory of Computational Science, Wuhan University, Wuhan 430072, P.R. China [email protected] School of Mathematics and Statistics, and Hubei Key Laboratory of Computational Science, Wuhan University, Wuhan 430072, P.R. China [email protected]
Abstract.

We study the problem of finding global minimizers of V(x):dV(x):\mathbb{R}^{d}\rightarrow\mathbb{R} approximately via sampling from a probability distribution μσ\mu_{\sigma} with density pσ(x)=exp(V(x)/σ)dexp(V(y)/σ)𝑑yp_{\sigma}(x)=\dfrac{\exp(-V(x)/\sigma)}{\int_{\mathbb{R}^{d}}\exp(-V(y)/\sigma)dy} with respect to the Lebesgue measure for σ(0,1]\sigma\in(0,1] small enough. We analyze a sampler based on the Euler-Maruyama discretization of the Schrödinger-Föllmer diffusion processes with stochastic approximation under appropriate assumptions on the step size ss and the potential VV. We prove that the output of the proposed sampler is an approximate global minimizer of V(x)V(x) with high probability at cost of sampling 𝒪(d3)\mathcal{O}(d^{3}) standard normal random variables. Numerical studies illustrate the effectiveness of the proposed method and its superiority to the Langevin method.

Key words and phrases:
Global optimization, Schrödinger-Föllmer diffusion, Stochastic approximation

1. Introduction

In this paper we study a challenging problem of finding the global minimizers of a non-convex smooth function V:dV:\mathbb{R}^{d}\rightarrow\mathbb{R}. Suppose N:={x1,,xκ}BRN:=\{x_{1}^{*},\ldots,x_{\kappa}^{*}\}\subset B_{R} is the set of the global minima of VV with finite cardinality, i.e.,

xiargmin𝑥V(x),for any i=1,,κ,\displaystyle x_{i}^{*}\in\underset{x}{\mbox{argmin}}\leavevmode\nobreak\ V(x),\quad\text{for any $i=1,\ldots,\kappa$,} (1)

where BRB_{R} denotes the ball centered at origin with radius R>0R>0. Precisely speaking, we have (V(x)<V(xi)ε)=0\mathscr{L}(V(x)<V(x_{i}^{*})-\varepsilon)=0 and (V(x)<V(xi)+ε)>0\mathscr{L}(V(x)<V(x_{i}^{*})+\varepsilon)>0 for any ε>0\varepsilon>0, where \mathscr{L} denotes the Lebesgue measure on d\mathbb{R}^{d}. Without loss of generality, we can assume V(x)=x22/2V(x)=\|x\|_{2}^{2}/2 outside the ball BRB_{R} without changing the global minimizers of VV. For any given σ(0,1]\sigma\in(0,1], we define a constant CσC_{\sigma} and a probability density function pσ(x)p_{\sigma}(x) on d\mathbb{R}^{d} as

pσ(x):=exp(V(x)/σ)Cσ,forCσ:=dexp(V(x)/σ)𝑑x<.p_{\sigma}(x):=\frac{\exp(-V(x)/\sigma)}{C_{\sigma}},\quad\mathrm{for}\;C_{\sigma}:=\int_{\mathbb{R}^{d}}\exp(-V(x)/\sigma)dx<\infty.

Let μσ\mu_{\sigma} be the probability distribution corresponding to the density function pσp_{\sigma}. If the function VV is twice continuously differentiable, we have that the measure μσ\mu_{\sigma} converges weakly to a probability measure with weights proportional to (det2V(xi))12j=1κ(det2V(xj))12\dfrac{\left(\det\nabla^{2}V(x_{i}^{*})\right)^{-\frac{1}{2}}}{\sum_{j=1}^{\kappa}\left(\det\nabla^{2}V(x_{j}^{*})\right)^{-\frac{1}{2}}} at xix_{i}^{*} as σ\sigma goes to 0, i.e.,

limσ0μσ=i=1κ(det2V(xi))12δxij=1κ(det2V(xj))12.\underset{\sigma\rightarrow 0}{\lim}\leavevmode\nobreak\ \mu_{\sigma}=\dfrac{\sum_{i=1}^{\kappa}\left(\det\nabla^{2}V(x_{i}^{*})\right)^{-\frac{1}{2}}\delta_{x_{i}^{*}}}{\sum_{j=1}^{\kappa}\left(\det\nabla^{2}V(x_{j}^{*})\right)^{-\frac{1}{2}}}.

Therefore, solving the optimization problem (1) can be converted into sampling from the probability distribution measure μσ\mu_{\sigma} for sufficiently small σ\sigma.

An efficient method sampling from μσ\mu_{\sigma} is based on the overdamped Langevin stochastic differential equation (SDE) which is given by

dZt=V(Zt)dt+2σdBt,\mathrm{d}Z_{t}=-\nabla V(Z_{t})\mathrm{d}t+\sqrt{2\sigma}\mathrm{d}B_{t}, (2)

where (Bt)t0(B_{t})_{t\geq 0} is a dd-dimensional Brownian motion. Under some certain conditions, Langevin SDE (2) admits the unique invariant measure μσ\mu_{\sigma} (Bakry et al., 2008). Hence the Langevin sampler is generated by applying Euler-Maruyama discretization on this process to achieve the purpose of sampling from μσ\mu_{\sigma}. Convergence properties of the Langevin sampler under the strongly convex potential assumption have been established by Durmus and Moulines (2016); Dalalyan (2017a, b); Cheng and Bartlett (2018); Durmus and Moulines (2019); Dalalyan and Karagulyan (2019). Moreover, the strongly convex potential assumption can be replaced by different conditions to guarantee the log-Sobolev inequality for the target distribution, including the dissipativity condition for the drift term (Raginsky et al., 2017; Zhang et al., 2019; Mou et al., 2022) and the local convexity condition for the potential function outside a ball (Durmus and Moulines, 2017; Ma et al., 2019; Bou-Rabee et al., 2020).

An alternative to Langevin sampler is the class of algorithms based on the Schrödinger-Föllmer diffusion process (3). This process has been proposed for sampling and generative modelling (Tzen and Raginsky, 2019; Huang et al., 2021; Jiao et al., 2021; Wang et al., 2021). Subsequently, Ruzayqat et al. (2022) studied the problem of unbiased estimation of expectations based on the Schrödinger-Föllmer diffusion process. Vargas et al. (2021) applied this process to Bayesian inference in large datasets, and the related posterior is reached in finite time. Zhang and Chen (2021) proposed a new Path Integral Sampler (PIS), a generic sampler that generates samples through simulating a target-dependent SDE which can be trained with free-form architecture network design. The PIS is built on Schödinger-Föllmer diffusion process to reach the terminal distribution. Different from these existing works, the purpose of this paper is to solve the non-convex smooth optimization problems. To that end, we need to rescale the Schrödinger-Föllmer diffusion process to sample from the target distribution μσ\mu_{\sigma}.

To be precise, the Schrödinger-Föllmer diffusion process associated to μσ\mu_{\sigma} is defined as

dXt=σb(Xt,t)dt+σdBt,t[0,1],X0=0,\displaystyle\mathrm{d}X_{t}=\sigma b\left(X_{t},t\right)\mathrm{d}t+\sqrt{\sigma}\mathrm{d}B_{t},\quad t\in[0,1],\quad X_{0}=0, (3)

where the drift function is

b(x,t)=𝔼ZN(0,𝐈𝐝)[fσ(x+(1t)σZ)]𝔼ZN(0,𝐈𝐝)[fσ(x+(1t)σZ)]:d×[0,1]db(x,t)=\frac{\mathbb{E}_{Z\sim N\left(0,\bf{I}_{d}\right)}[\nabla f_{\sigma}(x+\sqrt{(1-t)\sigma}Z)]}{\mathbb{E}_{Z\sim N\left(0,\bf{I}_{d}\right)}[f_{\sigma}(x+\sqrt{(1-t)\sigma}Z)]}:\mathbb{R}^{d}\times[0,1]\rightarrow\mathbb{R}^{d}

with the density ratio fσ()=pσ()ϕσ()f_{\sigma}(\cdot)=\frac{p_{\sigma}(\cdot)}{\phi_{\sigma}(\cdot)} and ϕσ()\phi_{\sigma}(\cdot) being the density function of a normal distribution N(0,σ𝐈d)N(0,\sigma\mathbf{I}_{d}). According to Léonard (2014) and Eldan et al. (2020), the process {Xt}t[0,1]\{X_{t}\}_{t\in[0,1]} in (3) was first formulated by Föllmer (Föllmer, 1985, 1986, 1988) when studying the Schrödinger bridge problem (Schrödinger, 1932). The main feature of the above Schrödinger-Föllmer diffusion process is that it interpolates δ0\delta_{0} and μσ\mu_{\sigma} in time [0,1][0,1], i.e., X1μσX_{1}\sim\mu_{\sigma}, see Proposition 2.3. Then we can solve the optimization problem (1) by sampling from μσ\mu_{\sigma} via the following Euler-Maruyama discretization of (3),

Ytk+1=Ytk+σsb(Ytk,tk)+σsϵk+1,Yt0=0,k=0,1,,K1,Y_{t_{k+1}}=Y_{t_{k}}+\sigma sb\left(Y_{t_{k}},t_{k}\right)+\sqrt{\sigma s}\epsilon_{k+1},\leavevmode\nobreak\ Y_{t_{0}}=0,\leavevmode\nobreak\ k=0,1,\ldots,K-1,

where s=1/Ks=1/K is the step size, tk=kst_{k}=ks, and {ϵk}k=1K\{\epsilon_{k}\}_{k=1}^{K} are independent and identically distributed from N(0,Id)N(0,\mbox{\bf I}_{d}). If the expectations in the drift term b(x,t)b(x,t) do not have analytical forms, one can use Monte Carlo method to evaluate b(Ytk,tk)b\left(Y_{t_{k}},t_{k}\right) approximately, i.e., one can sample from μσ\mu_{\sigma} according to

Y~tk+1=Y~tk+σsb~m(Y~tk,tk)+σsϵk+1,Y~t0=0,k=0,1,,K1,\widetilde{Y}_{t_{k+1}}=\widetilde{Y}_{t_{k}}+\sigma s\tilde{b}_{m}\left(\widetilde{Y}_{t_{k}},t_{k}\right)+\sqrt{\sigma s}\epsilon_{k+1},\leavevmode\nobreak\ \widetilde{Y}_{t_{0}}=0,\leavevmode\nobreak\ k=0,1,\ldots,K-1,

where b~m(Y~tk,tk):=1mj=1m[fσ(Y~tk+(1tk)σZj)]1mj=1m[fσ(Y~tk+(1tk)σZj)]\tilde{b}_{m}(\widetilde{Y}_{t_{k}},t_{k}):=\dfrac{\frac{1}{m}\sum_{j=1}^{m}[\nabla f_{\sigma}(\widetilde{Y}_{t_{k}}+\sqrt{(1-t_{k})\sigma}Z_{j})]}{\frac{1}{m}\sum_{j=1}^{m}[f_{\sigma}(\widetilde{Y}_{t_{k}}+\sqrt{(1-t_{k})\sigma}Z_{j})]} with Z1,,ZmZ_{1},...,Z_{m} i.i.d. N(0,Id)N(0,\mbox{\bf I}_{d}).

The main result of this paper is summarized in the following.

Theorem 1.1.

(Informal) Under condition (A)(\textbf{A}), 0<δ1\forall\ \ 0<\delta\ll 1, with probability at least 1δ1-\sqrt{\delta}, Y~tK\widetilde{Y}_{t_{K}} is a τ\tau-global minimizer of VV, i.e., V(Y~tK)τ+infV(x)V(\widetilde{Y}_{t_{K}})\leq\tau+\inf V(x), if number of iterations K𝒪(d2δ)K\geq\mathcal{O}\left(\frac{d^{2}}{\delta}\right), number of Gaussian samples per iteration m𝒪(dδ)m\geq\mathcal{O}\left(\frac{d}{\delta}\right) and σ𝒪(τlog(1/δ))\sigma\leq\mathcal{O}\left(\frac{\tau}{\log(1/\delta)}\right).

The rest of this paper is organized as follows. In Section 2, we give the motivation and details of approximation method, i.e., Algorithm 1. In Section 3, we present the theoretical analysis for the proposed method. In Section 4, a numerical example is given to validate the efficiency of the method. We conclude the manuscript in Section 5. Proofs for all the propositions and theorems are provided in Appendix 7.

2. Methodology Description

In this section we first provide some background on the Schrödinger-Föllmer diffusion. Then we propose Algorithm 1 to solve the minimization problem (1) based on the Euler-Maruyama discretization scheme of the Schrödinger-Föllmer diffusion.

2.1. Background on Schrödinger-Föllmer diffusion

We first recall the Schrödinger bridge problem, then introduce the Schrödinger-Föllmer diffusion.

2.1.1. Schrödinger bridge problem

Let Ω=C([0,1],d)\Omega=C([0,1],\mathbb{R}^{d}) be the space of d\mathbb{R}^{d}-valued continuous functions on the time interval [0,1][0,1]. Denote Z=(Zt)t[0,1]Z=(Z_{t})_{t\in[0,1]} as the canonical process on Ω\Omega, where Zt(ω)=ωtZ_{t}(\omega)=\omega_{t}, ω=(ωs)s[0,1]Ω\omega=(\omega_{s})_{s\in[0,1]}\in\Omega. The canonical σ\sigma-field on Ω\Omega is then generated as =σ(Zt,t[0,1])={{ω:(Zt(ω))t[0,1]H}:H(d)}\mathscr{F}=\sigma(Z_{t},t\in[0,1])=\left\{\{\omega:(Z_{t}(\omega))_{t\in[0,1]}\in H\}:H\in\mathcal{B}(\mathbb{R}^{d})\right\}, where (d)\mathcal{B}(\mathbb{R}^{d}) denotes the Borel σ\sigma-algebra of d\mathbb{R}^{d}. Denote 𝒫(Ω)\mathcal{P}(\Omega) as the space of probability measures on the path space Ω\Omega, and 𝐖σ𝐱𝒫(Ω)\mathbf{W}^{\mathbf{x}}_{\sigma}\in\mathcal{P}(\Omega) as the Wiener measure with variance σ\sigma whose initial marginal is δ𝐱\delta_{\mathbf{x}}. The law of the reversible Brownian motion is then defined as 𝐏σ=𝐖σ𝐱d𝐱\mathbf{P}_{\sigma}=\int\mathbf{W}_{\sigma}^{\mathbf{x}}\mathrm{d}\mathbf{x}, which is an unbounded measure on Ω\Omega. One can observe that 𝐏σ\mathbf{P}_{\sigma} has a marginal distribution coinciding with the Lebesgue measure \mathscr{L} at each tt. Schrödinger (1932) studied the problem of finding the most likely random evolution between two probability distributions ν~,μ~𝒫(d)\widetilde{\nu},\widetilde{\mu}\in\mathcal{P}(\mathbb{R}^{d}). This problem is referred to as the Schrödinger bridge problem (SBP). SBP can be further formulated as seeking a probability law on the path space that interpolates between ν~\widetilde{\nu} and μ~\widetilde{\mu}, such that the probability law is close to the prior law of the Brownian diffusion with respect to the relative entropy (Jamison, 1975; Léonard, 2014), i.e., finding a path measure 𝐐𝒫(Ω)\mathbf{Q}^{*}\in\mathcal{P}(\Omega) with marginal 𝐐t=(Zt)#𝐐=𝐐Zt1,t[0,1]\mathbf{Q}^{*}_{t}=(Z_{t})_{\#}\mathbf{Q}^{*}=\mathbf{Q}^{*}\circ Z_{t}^{-1},t\in[0,1] such that

𝐐argmin𝔻KL(𝐐||𝐏σ),with𝐐0=ν~,𝐐1=μ~,\mathbf{Q}^{*}\in\arg\min\mathbb{D}_{\mathrm{KL}}(\mathbf{Q}||\mathbf{P}_{\sigma}),\;\textrm{with}\;\mathbf{Q}_{0}=\widetilde{\nu},\qquad\mathbf{Q}_{1}=\widetilde{\mu},

where the relative entropy is defined by 𝔻KL(𝐐||𝐏σ)=log(d𝐐d𝐏σ)d𝐐\mathbb{D}_{\mathrm{KL}}(\mathbf{Q}||\mathbf{P}_{\sigma})=\int\log\left(\dfrac{d\mathbf{Q}}{d\mathbf{P}_{\sigma}}\right)d\mathbf{Q} if 𝐐𝐏σ\mathbf{Q}\ll\mathbf{P}_{\sigma} (i.e., 𝐐\mathbf{Q} is absolutely continuous w.r.t. 𝐏σ\mathbf{P}_{\sigma}), and 𝔻KL(𝐐||𝐏σ)=+\mathbb{D}_{\mathrm{KL}}(\mathbf{Q}||\mathbf{P}_{\sigma})=+\infty otherwise. The following theorem characterizes the solution of SBP.

Theorem 2.1 (Léonard (2014)).

If measures ν~,μ~\widetilde{\nu},\widetilde{\mu}\ll\mathscr{L}, then SBP admits a unique solution d𝐐=f(Z0)g(Z1)d𝐏σd\mathbf{Q}^{*}=f^{*}(Z_{0})g^{*}(Z_{1})d\mathbf{P}_{\sigma}, where ff^{*} and gg^{*} are \mathscr{L}-measurable non-negative functions satisfying the Schrödinger system

{f(𝐱)𝔼𝐏σ[g(Z1)Z0=𝐱]=dν~d(𝐱),-a.e.,g(𝐲)𝔼𝐏σ[f(Z0)Z1=𝐲]=dμ~d(𝐲),-a.e..\left\{\begin{array}[]{l}f^{*}(\mathbf{x})\mathbb{E}_{\mathbf{P}_{\sigma}}\left[g^{*}\left(Z_{1}\right)\mid Z_{0}=\mathbf{x}\right]=\dfrac{d\widetilde{\nu}}{d\mathscr{L}}(\mathbf{x}),\quad\mathscr{L}\text{-a.e.},\\ g^{*}(\mathbf{y})\mathbb{E}_{\mathbf{P}_{\sigma}}\left[f^{*}\left(Z_{0}\right)\mid Z_{1}=\mathbf{y}\right]=\dfrac{d\widetilde{\mu}}{d\mathscr{L}}(\mathbf{y}),\quad\mathscr{L}\text{-a.e.}.\end{array}\right.

Furthermore, the pair (𝐐t,𝐯t)(\mathbf{Q}^{*}_{t},\mathbf{v}^{*}_{t}) with

𝐯t(𝐱)=𝐱log𝔼𝐏σ[g(Z1)Zt=𝐱]\mathbf{v}^{*}_{t}(\mathbf{x})=\nabla_{\mathbf{x}}\log\mathbb{E}_{\mathbf{P}_{\sigma}}\left[g^{*}\left(Z_{1}\right)\mid Z_{t}=\mathbf{x}\right]

solves the minimum action problem

minμt,𝐯t01𝔼𝐳μt[𝐯t(𝐳)2]dt\min_{\mu_{t},\mathbf{v}_{t}}\int_{0}^{1}\mathbb{E}_{\mathbf{z}\sim\mu_{t}}[\|\mathbf{v}_{t}(\mathbf{z})\|^{2}]\mathrm{d}t

such that

{tμt=(μt𝐯t)+σΔμt2,on (0,1)×d,μ0=ν~,μ1=μ~.\left\{\begin{array}[]{l}\partial_{t}\mu_{t}=-\nabla\cdot(\mu_{t}\mathbf{v}_{t})+\frac{\sigma\Delta\mu_{t}}{2},\quad\text{on }(0,1)\times\mathbb{R}^{d},\\ \mu_{0}=\widetilde{\nu},\mu_{1}=\widetilde{\mu}.\end{array}\right.

Let Kσ(s,𝐱,t,𝐲)=[2πσ(ts)]d/2exp(𝐱𝐲22σ(ts))K_{\sigma}(s,\mathbf{x},t,\mathbf{y})=[2\pi\sigma(t-s)]^{-d/2}\exp\left(-\dfrac{\|\mathbf{x}-\mathbf{y}\|^{2}}{2\sigma(t-s)}\right) be the transition density of the Wiener process with variance σ\sigma, q~(𝐱)\widetilde{q}(\mathbf{x}) and p~(𝐲)\widetilde{p}(\mathbf{y}) be the density of ν~\widetilde{\nu} and μ~\widetilde{\mu}, respectively. Denote by

f0(𝐱)=f(𝐱),g1(𝐲)=g(𝐲),f_{0}(\mathbf{x})=f^{*}(\mathbf{x}),\ \ g_{1}(\mathbf{y})=g^{*}(\mathbf{y}),
f1(𝐲)=𝔼𝐏σ[f(Z0)Z1=𝐲]=Kσ(0,𝐱,1,𝐲)f0(𝐱)𝑑𝐱,{f_{1}}(\mathbf{y})=\mathbb{E}_{\mathbf{P}_{\sigma}}\left[f^{*}\left(Z_{0}\right)\mid Z_{1}=\mathbf{y}\right]=\int K_{\sigma}(0,\mathbf{x},1,\mathbf{y})f_{0}(\mathbf{x})d\mathbf{x},
g0(𝐱)=𝔼𝐏σ[g(Z1)Z0=𝐱]=Kσ(0,𝐱,1,𝐲)g1(𝐲)𝑑𝐲.{g_{0}}(\mathbf{x})=\mathbb{E}_{\mathbf{P}_{\sigma}}\left[g^{*}\left(Z_{1}\right)\mid Z_{0}=\mathbf{x}\right]=\int K_{\sigma}(0,\mathbf{x},1,\mathbf{y})g_{1}(\mathbf{y})d\mathbf{y}.

Then the Schrödinger system in Theorem 2.1 can also be characterized by

q~(𝐱)=f0(𝐱)g0(𝐱),p~(𝐲)=f1(𝐲)g1(𝐲),\widetilde{q}(\mathbf{x})=f_{0}(\mathbf{x}){g_{0}}(\mathbf{x}),\ \ \widetilde{p}(\mathbf{y})={f_{1}}(\mathbf{y})g_{1}(\mathbf{y}),

with the following forward and backward time harmonic equations (Chen et al., 2021),

{tft(𝐱)=σΔ2ft(𝐱),tgt(𝐱)=σΔ2gt(𝐱), on (0,1)×d.\left\{\begin{array}[]{l}\partial_{t}f_{t}(\mathbf{x})=\frac{\sigma\Delta}{2}f_{t}(\mathbf{x}),\\ \partial_{t}g_{t}(\mathbf{x})=-\frac{\sigma\Delta}{2}g_{t}(\mathbf{x}),\end{array}\right.\quad\text{ on }(0,1)\times\mathbb{R}^{d}.

Let qtq_{t} denote marginal density of 𝐐t\mathbf{Q}_{t}^{*}, i.e., qt(𝐱)=d𝐐td(𝐱)q_{t}(\mathbf{x})=\frac{d\mathbf{Q}_{t}^{*}}{d\mathscr{L}}(\mathbf{x}), then it can be represented by the product of gtg_{t} and ftf_{t} (Chen et al., 2021). Let 𝒱\mathcal{V} consist of admissible Markov controls with finite energy. Then, the vector field

𝐯t=σ𝐱loggt(𝐱)=σ𝐱logKσ(t,𝐱,1,𝐲)g1(𝐲)d𝐲\displaystyle\mathbf{v}^{*}_{t}=\sigma\nabla_{\mathbf{x}}\log g_{t}(\mathbf{x})=\sigma\nabla_{\mathbf{x}}\log\int K_{\sigma}(t,\mathbf{x},1,\mathbf{y})g_{1}(\mathbf{y})\mathrm{d}\mathbf{y} (4)

solves the following stochastic control problem.

Theorem 2.2 (Dai Pra (1991)).
𝐯t(𝐱)argmin𝐯𝒱𝔼[0112σ𝐯t2dt]\mathbf{v}^{*}_{t}(\mathbf{x})\in\arg\min_{\mathbf{v}\in\mathcal{V}}\mathbb{E}\left[\int_{0}^{1}\frac{1}{2\sigma}\|\mathbf{v}_{t}\|^{2}\mathrm{d}t\right]

such that

{d𝐱t=𝐯tdt+σdBt,𝐱0q~(𝐱),𝐱1p~(𝐱).\left\{\begin{array}[]{l}\mathrm{d}\mathbf{x}_{t}=\mathbf{v}_{t}\mathrm{d}t+\sqrt{\sigma}\mathrm{d}B_{t},\\ \mathbf{x}_{0}\sim\widetilde{q}(\mathbf{x}),\quad\mathbf{x}_{1}\sim\widetilde{p}(\mathbf{x}).\end{array}\right. (5)

According to Theorem 2.2, the dynamics determined by the SDE in (5) with a time-varying drift term 𝐯t\mathbf{v}^{*}_{t} in (4) will drive the particles sampled from the initial distribution ν~\widetilde{\nu} to evolve to the particles drawn from the target distribution μ~\widetilde{\mu} on the unit time interval. This nice property is what we need in designing samplers: we can sample from the underlying target distribution μ~\widetilde{\mu} via pushing forward a simple reference distribution ν~\widetilde{\nu}. In particular, if we take the initial distribution ν~\widetilde{\nu} to be δ0\delta_{0}, the degenerate distribution at 0, then the Schrödinger-Föllmer diffusion process (7) defined below is a solution to (5), i.e., it will transport δ0\delta_{0} to the target distribution.

2.1.2. Schrödinger-Föllmer diffusion process

From now on, without loss of generality, we can assume that the minimum value of VV is 0, i.e., V(xi)=0,i=1,,κV(x_{i}^{*})=0,i=1,\ldots,\kappa, otherwise, we consider VV replaced by VminxV(x)V-\min_{x}V(x). Since μσ\mu_{\sigma} is absolutely continuous with respect to the dd-dimensional Gaussian distribution N(0,σId)N(0,\sigma\mbox{\bf I}_{d}), then we denote the Radon-Nikodym derivative of μσ\mu_{\sigma} with respect to N(0,σId)N(0,\sigma\mbox{\bf I}_{d}) as follows:

fσ(x):=dμσdN(0,σId)(x)=pσ(x)ϕσ(x),xd.f_{\sigma}(x):=\frac{d\mu_{\sigma}}{dN(0,\sigma\mbox{\bf I}_{d})}(x)=\frac{p_{\sigma}(x)}{\phi_{\sigma}(x)},\ x\in\mathbb{R}^{d}. (6)

Let QtQ_{t} be the heat semigroup defined by

Qtfσ(x):=𝔼ZN(0,Id)[fσ(x+tσZ)],t[0,1].Q_{t}f_{\sigma}(x):=\mathbb{E}_{Z\sim N(0,\mbox{\bf I}_{d})}[f_{\sigma}(x+\sqrt{t\sigma}Z)],\quad t\in[0,1].

The Schrödinger-Föllmer diffusion process {Xt}t[0,1]\{X_{t}\}_{t\in[0,1]} (Föllmer, 1985, 1986, 1988) is defined as

dXt=σb(Xt,t)dt+σdBt,X0=0,t[0,1],\displaystyle\mathrm{d}X_{t}=\sigma b(X_{t},t)\mathrm{d}t+\sqrt{\sigma}\mathrm{d}B_{t},\ X_{0}=0,\ t\in[0,1], (7)

where b(x,t):d×[0,1]db(x,t):\mathbb{R}^{d}\times[0,1]\rightarrow\mathbb{R}^{d} is the drift term given by

b(x,t)=logQ1tfσ(x).b(x,t)=\nabla\log Q_{1-t}f_{\sigma}(x).

This process {Xt}t[0,1]\{X_{t}\}_{t\in[0,1]} defined in (7) is a solution to (5) with ν~=δ0\tilde{\nu}=\delta_{0}, μ~=μσ\tilde{\mu}=\mu_{\sigma}, and 𝐯t(x)=b(x,t)\mathbf{v}_{t}(x)=b(x,t) (Dai Pra, 1991; Lehec, 2013; Eldan et al., 2020). Let

f^σ(x):=Cσ(2πσ)d2fσ(x).\hat{f}_{\sigma}(x):=C_{\sigma}(2\pi\sigma)^{-\frac{d}{2}}f_{\sigma}(x).

Since the drift term b(x,t)b(x,t) is scale-invariant with respect to fσf_{\sigma} in the sense that b(x,t)=logQ1tCfσ(x)b(x,t)=\nabla\log Q_{1-t}Cf_{\sigma}(x) for any C>0C>0. Therefore, the Schrödinger-Föllmer diffusion can be used for sampling from an unnormalized distribution μσ\mu_{\sigma}, that is, the normalizing constant CσC_{\sigma} of μσ\mu_{\sigma} does not need to be known. We have μσ(dx)=exp(V(x)/σ)dx/Cσ\mu_{\sigma}(dx)=\exp(-V(x)/\sigma)dx/C_{\sigma} with the normalized constant CσC_{\sigma}, then fσ(x)=(2πσ)dCσexp(V(x)/σ+x222σ)f_{\sigma}(x)=\frac{\left(\sqrt{2\pi\sigma}\right)^{d}}{C_{\sigma}}\exp(-V(x)/\sigma+\frac{\|x\|_{2}^{2}}{2\sigma}) and f^σ(x)=exp(V(x)σ+x222σ)\hat{f}_{\sigma}(x)=\exp\left(-\frac{V(x)}{\sigma}+\frac{\|x\|^{2}_{2}}{2\sigma}\right).

To ensure that the SDE (7) admits a unique strong solution, we assume that

  • (A)(\textbf{A})

    V(x)V(x) is twice continuous differentiable on d\mathbb{R}^{d} and V(x)=x22/2V(x)=\|x\|_{2}^{2}/2 outside a ball BRB_{R}.

As we mentioned earlier, we can make Assumption (𝐀)\mathbf{(A)} by smoothing and it does not change the the global minimizers of VV. Under Assumption (𝐀)\mathbf{(A)}, we have the following two properties which further imply the SDE (7) admits a unique strong solution.

  • (P1)(\textbf{P1})

    For each σ(0,1]\sigma\in(0,1], f^σ,f^σ\hat{f}_{\sigma},\nabla\hat{f}_{\sigma} are Lipschitz continuous with a constant γσ>0\gamma_{\sigma}>0, where

    γσ:\displaystyle\gamma_{\sigma}: =maxx2Rexp(V(x)σ+x222σ){xσV(x)σ22+Idσ2V(x)σ2}\displaystyle=\max_{\|x\|_{2}\leq R}\exp\left(-\frac{V(x)}{\sigma}+\frac{\|x\|_{2}^{2}}{2\sigma}\right)\left\{\left\|\frac{x}{\sigma}-\frac{\nabla V(x)}{\sigma}\right\|_{2}^{2}+\left\|\frac{\mbox{\bf I}_{d}}{\sigma}-\frac{\nabla^{2}V(x)}{\sigma}\right\|_{2}\right\}
    ={(M2,R/σ)2+M3,R/σ}exp(M1,R/σ),\displaystyle=\left\{\left(M_{2,R}/\sigma\right)^{2}+M_{3,R}/\sigma\right\}\exp\left(M_{1,R}/\sigma\right),

    and

    M1,R:=maxx2R{V(x)+x222},M2,R:=maxx2RxV(x)2,M3,R:=maxx2R𝐈d2V(x)2.M_{1,R}:=\max_{\|x\|_{2}\leq R}\left\{-V(x)+\frac{\|x\|^{2}_{2}}{2}\right\},\;M_{2,R}:=\max_{\|x\|_{2}\leq R}\left\|x-\nabla V(x)\right\|_{2},\;M_{3,R}:=\max_{\|x\|_{2}\leq R}\left\|\mathbf{I}_{d}-\nabla^{2}V(x)\right\|_{2}. (8)
  • (P2)(\textbf{P2})

    For each σ(0,1]\sigma\in(0,1], there exists a constant ξσ>0\xi_{\sigma}>0 such that f^σξσ\hat{f}_{\sigma}\geq\xi_{\sigma}, where

    ξσ:=exp(m1,R/σ)withm1,R:=minx2R{V(x)+x222}.\xi_{\sigma}:=\exp\left(m_{1,R}/\sigma\right)\;\mbox{with}\;m_{1,R}:=\min_{\|x\|_{2}\leq R}\left\{-V(x)+\frac{\|x\|_{2}^{2}}{2}\right\}. (9)

Properties (P1)-(P2) are shown in Section 7.1 in Appendix. We should mention here (P1)-(P2) are used as assumptions in Lehec (2013); Tzen and Raginsky (2019).

Thanks to (P1) and (P2), some calculations show that

f^σ2γσ,2f^σ2γσ,\|\nabla\hat{f}_{\sigma}\|_{2}\leq\gamma_{\sigma},\quad\|\nabla^{2}\hat{f}_{\sigma}\|_{2}\leq\gamma_{\sigma},
supxd,t[0,1]Q1tf^σ(x)2γσ,supxd,t[0,1]2(Q1tf^σ(x))2γσ,\sup_{x\in\mathbb{R}^{d},t\in[0,1]}\|\nabla Q_{1-t}\hat{f}_{\sigma}(x)\|_{2}\leq\gamma_{\sigma},\quad\sup_{x\in\mathbb{R}^{d},t\in[0,1]}\|\nabla^{2}(Q_{1-t}\hat{f}_{\sigma}(x))\|_{2}\leq\gamma_{\sigma},

and

b(x,t)=Q1tf^σ(x)Q1tf^σ(x),b(x,t)=2(Q1tf^σ)(x)Q1tf^σ(x)b(x,t)b(x,t).b(x,t)=\frac{\nabla Q_{1-t}\hat{f}_{\sigma}(x)}{Q_{1-t}\hat{f}_{\sigma}(x)},\leavevmode\nobreak\ \nabla b(x,t)=\frac{\nabla^{2}(Q_{1-t}\hat{f}_{\sigma})(x)}{Q_{1-t}\hat{f}_{\sigma}(x)}-b(x,t)b(x,t)^{\top}.

We conclude that

supxd,t[0,1]b(x,t)2γσ/ξσ,supxd,t[0,1]b(x,t)2γσ/ξσ+γσ2/ξσ2.\sup_{x\in\mathbb{R}^{d},t\in[0,1]}\|b(x,t)\|_{2}\leq\gamma_{\sigma}/\xi_{\sigma},\sup_{x\in\mathbb{R}^{d},t\in[0,1]}\|\nabla b(x,t)\|_{2}\leq\gamma_{\sigma}/\xi_{\sigma}+\gamma^{2}_{\sigma}/\xi_{\sigma}^{2}.

Furthermore, we can also easily deduce that the drift term bb satisfies a linear growth condition and a Lipschitz continuity condition (Revuz and Yor, 2013; Pavliotis, 2014), that is,

b(x,t)22C0,σ(1+x22),xd,t[0,1]\displaystyle\|b(x,t)\|_{2}^{2}\leq C_{0,\sigma}(1+\|x\|_{2}^{2}),\ x\in\mathbb{R}^{d},t\in[0,1] (C1)

and

b(x,t)b(y,t)2C1,σxy2,x,yd,t[0,1],\displaystyle\|b(x,t)-b(y,t)\|_{2}\leq C_{1,\sigma}\|x-y\|_{2},\ x,y\in\mathbb{R}^{d},t\in[0,1], (C2)

where C0,σC_{0,\sigma} and C1,σC_{1,\sigma} are two finite positive constants that only depend on σ\sigma. The linear growth condition (C1) and Lipschitz continuity condition (C2) ensure the existence of the unique strong solution of Schrödinger-Föllmer SDE (7). We summarize the above discussion in the following Proposition, whose proof are shown in Appendix 7.

Proposition 2.3.

Under assumption (A)(\textbf{A}) the Schrödinger-Föllmer SDE (7) has a unique strong solution {Xt}t[0,1]\{X_{t}\}_{t\in[0,1]} with X0δ0X_{0}\sim\delta_{0} and X1μσX_{1}\sim\mu_{\sigma}.

2.2. Euler-Maruyama discretization for Schrödinger-Föllmer Diffusion

Proposition 2.3 shows that the Schrödinger-Föllmer diffusion will transport δ0\delta_{0} to the probability distribution measure μσ\mu_{\sigma} on the unite time interval. Because the drift term b(x,t)b(x,t) is scale-invariant with respect to fσf_{\sigma} in the sense that b(x,t)=logQ1tCfσ(x),C>0b(x,t)=\nabla\log Q_{1-t}Cf_{\sigma}(x),\forall C>0, the Schrödinger-Föllmer diffusion can be used for sampling from μσ(dx)=exp(V(x)/σ)dx/Cσ\mu_{\sigma}(dx)=\exp(-V(x)/\sigma)dx/C_{\sigma}, where the normalizing constant CσC_{\sigma} may not be known. To this end, we use the Euler-Maruyama method to discretize the Schrödinger-Föllmer diffusion (7). Let

tk=ks,k=0,1,,K,withs=1/K,Yt0=0,t_{k}=k\cdot s,\ \ \leavevmode\nobreak\ k=0,1,\ldots,K,\ \ \leavevmode\nobreak\ \mbox{with}\leavevmode\nobreak\ \ \ s=1/K,\ \ Y_{t_{0}}=0,

the Euler-Maruyama discretization scheme reads

Ytk+1=Ytk+σsb(Ytk,tk)+σsϵk+1,k=0,1,,K1,\displaystyle Y_{t_{k+1}}=Y_{t_{k}}+\sigma sb\left(Y_{t_{k}},t_{k}\right)+\sqrt{\sigma s}\epsilon_{k+1},\leavevmode\nobreak\ k=0,1,\ldots,K-1, (10)

where {ϵk}k=1K\{\epsilon_{k}\}_{k=1}^{K} are i.i.d. N(0,Id)N(0,\mbox{\bf I}_{d}) and

b(Ytk,tk)=𝔼Z[f^σ(Ytk+(1tk)σZ)]𝔼Z[f^σ(Ytk+(1tk)σZ)]=𝔼Z[Zf^σ(Ytk+(1tk)σZ)]𝔼Z[f^σ(Ytk+(1tk)σZ)](1tk)σ,\displaystyle b(Y_{t_{k}},t_{k})=\frac{\mathbb{E}_{Z}[\nabla\hat{f}_{\sigma}(Y_{t_{k}}+\sqrt{(1-t_{k})\sigma}Z)]}{\mathbb{E}_{Z}[\hat{f}_{\sigma}(Y_{t_{k}}+\sqrt{(1-t_{k})\sigma}Z)]}=\frac{\mathbb{E}_{Z}[Z\hat{f}_{\sigma}(Y_{t_{k}}+\sqrt{(1-t_{k})\sigma}Z)]}{\mathbb{E}_{Z}[\hat{f}_{\sigma}(Y_{t_{k}}+\sqrt{(1-t_{k})\sigma}Z)]\sqrt{(1-t_{k})\sigma}}, (11)

where the second equality follows from Stein’s lemma (Stein, 1972, 1986; Landsman and Nešlehová, 2008). From the definition of b(Ytk,tk)b(Y_{t_{k}},t_{k}) in (11) we may not get its explicit expression. We then consider a estimator b~m\tilde{b}_{m} of bb by replacing 𝔼Z\mathbb{E}_{Z} in the drift term bb with mm-samples mean, i.e.,

b~m(Ytk,tk)=1mj=1m[f^σ(Ytk+(1tk)σZj)]1mj=1m[f^σ(Ytk+(1tk)σZj)],k=0,,K1,\displaystyle\tilde{b}_{m}(Y_{t_{k}},t_{k})=\frac{\frac{1}{m}\sum_{j=1}^{m}[\nabla\hat{f}_{\sigma}(Y_{t_{k}}+\sqrt{(1-t_{k})\sigma}Z_{j})]}{\frac{1}{m}\sum_{j=1}^{m}[\hat{f}_{\sigma}(Y_{t_{k}}+\sqrt{(1-t_{k})\sigma}Z_{j})]},\ k=0,\ldots,K-1, (12)

or

b~m(Ytk,tk)=1mj=1m[Zjf^σ(Ytk+(1tk)σZj)]1mj=1m[f^σ(Ytk+(1tk)σZj)](1tk)σ,k=0,,K1,\displaystyle\tilde{b}_{m}(Y_{t_{k}},t_{k})=\frac{\frac{1}{m}\sum_{j=1}^{m}[Z_{j}\hat{f}_{\sigma}(Y_{t_{k}}+\sqrt{(1-t_{k})\sigma}Z_{j})]}{\frac{1}{m}\sum_{j=1}^{m}[\hat{f}_{\sigma}(Y_{t_{k}}+\sqrt{(1-t_{k})\sigma}Z_{j})]\cdot\sqrt{(1-t_{k})\sigma}},\ k=0,\ldots,K-1, (13)

where Z1,,ZmZ_{1},\ldots,Z_{m} are i.i.d. N(0,Id)N(0,\mbox{\bf I}_{d}). The detailed description of the proposed method is summarized in the following Algorithm 1 below.

Algorithm 1 Solving (1) via Euler-Maruyama discretization of Schrödinger-Föllmer Diffusion
1:  Input: σ\sigma, mm, KK. Initialize s=1/Ks=1/K, Y~t0=0\widetilde{Y}_{t_{0}}=0.
2:  for k=0,1,,K1k=0,1,\ldots,K-1  do
3:     Sample ϵk+1N(0,Id)\epsilon_{k+1}\sim N(0,\mbox{\bf I}_{d}).
4:     Sample Zi,i=1,,mZ_{i},i=1,\ldots,m, from N(0,Id)N(0,\mbox{\bf I}_{d}).
5:     Compute b~m\tilde{b}_{m} according to (12) or (13),
6:     Y~tk+1=Y~tk+σsb~m(Y~tk,tk)+σsϵk+1\widetilde{Y}_{t_{k+1}}=\widetilde{Y}_{t_{k}}+\sigma s\tilde{b}_{m}\left(\widetilde{Y}_{t_{k}},t_{k}\right)+\sqrt{\sigma s}\epsilon_{k+1}.
7:  end for
8:  Output: {Y~tk}k=1K\{\widetilde{Y}_{t_{k}}\}_{k=1}^{K}.

In the next section, we establish the probability bound of Y~tK\widetilde{Y}_{t_{K}} being a τ\tau-global minimizer (Theorem 3.5), and derive the bound in the Wasserstein-2 distance between the law of Y~tK\widetilde{Y}_{t_{K}} generated from Algorithm 1 and the probability distribution measure μσ\mu_{\sigma} under some certain conditions (Theorem 3.7).

3. Theoretical Property

In this section, we show that the Gibbs measure μσ\mu_{\sigma} weakly converges to a multidimensional distribution concentrating on the optimal points {x1,,xκ}\{x_{1}^{*},\ldots,x_{\kappa}^{*}\}. Since the minimum value of VV is 0, then we estimate the probabilities of V(X1)>τV(X_{1})>\tau and V(Y~tk)>τV(\widetilde{Y}_{t_{k}})>\tau for any τ>0\tau>0, and establish the non-asymptotic bounds between the law of the samples generated from Algorithm 1 and the target distribution μσ\mu_{\sigma} in the Wasserstein-2 distance. Recall that the linear growth condition (C1) and Lipschitz continuity (C2) hold under conditions (P1) and (P2), which make the Schrödinger-Föllmer SDE (7) have the unique strong solution. Besides, we assume that the drift term b(x,t)b(x,t) is Lipschitz continuous in xx and 12\frac{1}{2}-Hölder continuous in tt,

b(x,t)b(y,s)2C2,σ(xy2+|ts|12),\displaystyle\|b(x,t)-b(y,s)\|_{2}\leq C_{2,\sigma}\left(\|x-y\|_{2}+|t-s|^{\frac{1}{2}}\right), (C3)

where C2,σC_{2,\sigma} is a positive constant depending on σ\sigma.

Remark 3.1.

(C1) and (C2) are the essentially sufficient conditions such that the Schrödinger-Föllmer SDE (7) admits the unique strong solution. (C3) has been introduced in Theorem 4.1 of Tzen and Raginsky (2019) and it is also similar to the condition (H2) of Chau et al. (2021) and Assumption 3.2 of Barkhagen et al. (2021). Obviously, (C3) implies (C2), and (C1) holds if the drift term b(x,t)b(x,t) is bounded over d×[0,1]\mathbb{R}^{d}\times[0,1].

Firstly, we show that the Gibbs measure μσ\mu_{\sigma} weakly converges to a multidimensional distribution. This result can be traced back to the 1980s. For the overall continuity of the article, we combine the Laplace’s method in Hwang (1980, 1981) to give a detailed proof of the result. The key idea is to prove that for each δ>0\delta^{\prime}>0 small enough, μσ({x;xxi2<δ})\mu_{\sigma}(\{x;\|x-x_{i}^{*}\|_{2}<\delta^{\prime}\}) converges to (det2V(xi))12j=1κ(det2V(xj))12\dfrac{\left(\det\nabla^{2}V(x_{i}^{*})\right)^{-\frac{1}{2}}}{\sum_{j=1}^{\kappa}\left(\det\nabla^{2}V(x_{j}^{*})\right)^{-\frac{1}{2}}} as σ0\sigma\downarrow 0.

Next, we want to estimate the probabilities of V(X1)>τV(X_{1})>\tau and V(Y~tK)>τV(\widetilde{Y}_{t_{K}})>\tau for any τ>0\tau>0. However, the second analysis is more complicated due to the discretization, and the main idea comes from Dalalyan (2017b); Cheng et al. (2018), which constructs a continuous-time interpolation SDE for the Euler-Maruyama discretization. In their works, the relative entropy is controlled via using the Girsanov’s theorem to estimate Radon-Nikodym derivatives.

Another method of controlling the relative entropy is proposed by Mou et al. (2022). By direct calculations, the time derivative of the relative entropy between the interpolated and the original SDE (7) is controlled by the mean squared difference between the drift terms of the Fokker-Planck equations for the original and the interpolated processes. Compared to the bound obtained from Lemma 7.2, the bound in Mou et al. (2022) has an additional backward conditional expectation inside the norm. It becomes a key reason for obtaining higher precision orders. But it must satisfy the dissipative condition to the drift term of the SDE and initial distribution smoothness.

The concrete results are showed in the following Theorems 3.2, 3.5, 3.7. See Appendix 7 for detailed proofs.

Theorem 3.2.

Let V:dV:\mathbb{R}^{d}\rightarrow\mathbb{R} be a twice continuously differentiable function. Suppose there exists a finite set N:={xd;V(x)=infxV(x)}={x1,,xκ},κ1N:=\{x\in\mathbb{R}^{d};V(x)=\inf_{x}V(x)\}=\{x_{1}^{*},\ldots,x_{\kappa}^{*}\},\kappa\geq 1 and dexp(V(x))𝑑x<\int_{\mathbb{R}^{d}}\exp(-V(x))dx<\infty, then

μσ𝑤i=1κ(det2V(xi))12δxij=1κ(det2V(xj))12,as σ0.\mu_{\sigma}\xrightarrow{w}\dfrac{\sum_{i=1}^{\kappa}\left(\det\nabla^{2}V(x_{i}^{*})\right)^{-\frac{1}{2}}\delta_{x_{i}^{*}}}{\sum_{j=1}^{\kappa}\left(\det\nabla^{2}V(x_{j}^{*})\right)^{-\frac{1}{2}}},\quad\text{as $\sigma\downarrow 0.$}

Under Theorem 3.2, a natural question is to consider the convergence rate about the measure μσ\mu_{\sigma} converging to the multidimensional distribution. To that end, we apply the tools from the large deviation of the Gibbs measure which is absolutely continuous with respect to Lebesgue measure, see Chiang et al. (1987); Holley et al. (1989); Márquez (1997). Therefore we can obtain the following property.

Proposition 3.3.

Assume that the conditions of Theorem 3.2 hold, then for all τ>0\tau>0,

limσ0σlogμσ(V(x)minxV(x)τ)=τ.\lim_{\sigma\to 0}\sigma\log\mu_{\sigma}\left(V(x)-\min_{x}V(x)\geq\tau\right)=-\tau.
Remark 3.4.

Although we can directly use the large deviation principle to obtain Proposition 3.3, further, we can conclude that the Gibbs measure μσ\mu_{\sigma} weakly converges to the global minimum points of the potential function VV and obtain the corresponding convergence rate. However, we can not directly obtain the specific limit distribution form Proposition 3.3.

Theorem 3.5.

Suppose (A)(\textbf{A}) holds. Then, for each ε(0,τ),σ(0,1)\varepsilon\in(0,\tau),\sigma\in(0,1), there exists a constant Cτ,ε,dC_{\tau,\varepsilon,d} (depending on τ,ε,d\tau,\varepsilon,d) given in (A.15) such that

(V(X1)>τ)Cτ,ε,dexp(τεσ),\displaystyle\mathbb{P}(V(X_{1})>\tau)\leq C_{\tau,\varepsilon,d}\exp\left(-\frac{\tau-\varepsilon}{\sigma}\right),
(V(Y~tK)>τ)Cτ,ε,dexp(τεσ)+C1,σd(2d+3)s+C2,σ4dm,\displaystyle\mathbb{P}(V(\widetilde{Y}_{t_{K}})>\tau)\leq C_{\tau,\varepsilon,d}\exp\left(-\frac{\tau-\varepsilon}{\sigma}\right)+C^{\sharp}_{1,\sigma}\sqrt{d(2d+3)s}+C^{\sharp}_{2,\sigma}\sqrt{\frac{4d}{m}}, (14)

where

C1,σ:=γσξσ+γσ3ξσ3,γσξσ={(M2,Rσ)2+M3,Rσ}exp(M1,Rm1,Rσ),\displaystyle C^{\sharp}_{1,\sigma}:=\frac{\gamma_{\sigma}}{\xi_{\sigma}}+\frac{\gamma^{3}_{\sigma}}{\xi^{3}_{\sigma}},\quad\frac{\gamma_{\sigma}}{\xi_{\sigma}}=\left\{\left(\frac{M_{2,R}}{\sigma}\right)^{2}+\frac{M_{3,R}}{\sigma}\right\}\exp\left(\frac{M_{1,R}-m_{1,R}}{\sigma}\right),
C2,σ:=γσ2ξσ2+γσζσξσ2,γσξσ={(M2,Rσ)2+M3,Rσ}exp(M1,Rm1,Rσ),\displaystyle C^{\sharp}_{2,\sigma}:=\frac{\gamma^{2}_{\sigma}}{\xi^{2}_{\sigma}}+\frac{\gamma_{\sigma}\zeta_{\sigma}}{\xi^{2}_{\sigma}},\quad\frac{\gamma_{\sigma}}{\xi_{\sigma}}=\left\{\left(\frac{M_{2,R}}{\sigma}\right)^{2}+\frac{M_{3,R}}{\sigma}\right\}\exp\left(\frac{M_{1,R}-m_{1,R}}{\sigma}\right),

M1,RM_{1,R}, M2,RM_{2,R}, M3,RM_{3,R} and m1,Rm_{1,R} are given in (8)-(9), and ζσ=exp(M1,R/σ)\zeta_{\sigma}=\exp(M_{1,R}/\sigma).

Remark 3.6.

The first term in the right hand side of (14) originates from the difference between Gibbs sampling and optimization, which is determined by the simulated annealing method itself. The latter two terms are caused by Euler discretization of SDE (7) and Monte Carlo estimation of drift coefficient b(x,t)b(x,t) in (11). The latter two terms of (14) depend on dd polynomially and σ\sigma exponentially. This result is contrary to the some Langevin methods, that is, the related convergence error bounds depend on dd exponentially (Wang, 2009; Hale, 2010; Menz and Schlichting, 2014; Raginsky et al., 2017) implying that the efficiency of Langevin samplers may suffer from the curse of dimensionality.

By Theorem 3.5, 0<δ1\forall\ \ 0<\delta\ll 1, with probability at least 1δ1-\sqrt{\delta}, Y~tK\widetilde{Y}_{t_{K}} is a τ\tau-global minimizer of VV, i.e., V(Y~tK)τ+infV(x)V(\widetilde{Y}_{t_{K}})\leq\tau+\inf V(x), if the number of iterations K𝒪(d2δ)K\geq\mathcal{O}\left(\frac{d^{2}}{\delta}\right), the number of Gaussian samples per iteration m𝒪(dδ)m\geq\mathcal{O}\left(\frac{d}{\delta}\right) and σ𝒪(τlog(1/δ))\sigma\leq\mathcal{O}\left(\frac{\tau}{\log(1/\delta)}\right).

At last, we establish the non-asymptotic bounds between the law of the samples generated from Algorithm 1 and the distribution μσ\mu_{\sigma} in the Wasserstein-2 distance. We introduce the definition of Wasserstein distance. Let 𝒟(ν1,ν2)\mathcal{D}(\nu_{1},\nu_{2}) be the collection of coupling probability measures on (2d,(2d))\left(\mathbb{R}^{2d},\mathcal{B}(\mathbb{R}^{2d})\right) such that its respective marginal distributions are ν1\nu_{1} and ν2\nu_{2}. The Wasserstein distance of order p1p\geq 1 measuring the discrepancy between ν1\nu_{1} and ν2\nu_{2} is defined as

Wp(ν1,ν2)=infν𝒟(ν1,ν2)(ddθ1θ22pdν(θ1,θ2))1/p.\displaystyle W_{p}(\nu_{1},\nu_{2})=\inf_{\nu\in\mathcal{D}(\nu_{1},\nu_{2})}\left(\int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}}\left\|\theta_{1}-\theta_{2}\right\|_{2}^{p}\mathrm{d}\nu\left(\theta_{1},\theta_{2}\right)\right)^{1/p}.
Theorem 3.7.

Assume (A) and (C3) hold, then

W2(Law(Y~tK),μσ)Cσ(C3,σs+C2,σ16dm),\displaystyle W_{2}(\text{Law}(\widetilde{Y}_{t_{K}}),\mu_{\sigma})\leq C^{\sharp}_{\sigma}\left(C^{\sharp}_{3,\sigma}\sqrt{s}+C^{\sharp}_{2,\sigma}\sqrt{\frac{16d}{m}}\right),

where

Cσ:=exp(1/2+8C2,σ2),C3,σ:=2C1,σ+4C1,σ2{1+C0,σexp(2C0,σ+1)}12,\displaystyle C^{\sharp}_{\sigma}:=\exp(1/2+8C_{2,\sigma}^{2}),\quad C^{\sharp}_{3,\sigma}:=2C_{1,\sigma}+4C^{2}_{1,\sigma}\left\{1+C_{0,\sigma}\exp\left(2\sqrt{C_{0,\sigma}}+1\right)\right\}^{\frac{1}{2}},
C2,σ:=(γσξσ)2+γσζσξσ2,γσξσ={(M2,Rσ)2+M3,Rσ}exp(M1,Rm1,Rσ),\displaystyle C^{\sharp}_{2,\sigma}:=\left(\frac{\gamma_{\sigma}}{\xi_{\sigma}}\right)^{2}+\frac{\gamma_{\sigma}\zeta_{\sigma}}{\xi^{2}_{\sigma}},\quad\frac{\gamma_{\sigma}}{\xi_{\sigma}}=\left\{\left(\frac{M_{2,R}}{\sigma}\right)^{2}+\frac{M_{3,R}}{\sigma}\right\}\exp\left(\frac{M_{1,R}-m_{1,R}}{\sigma}\right),

with C0,σ=C1,σ=γσ/ξσ+γσ2/ξσ2C_{0,\sigma}=C_{1,\sigma}=\gamma_{\sigma}/\xi_{\sigma}+\gamma^{2}_{\sigma}/\xi^{2}_{\sigma}.

Remark 3.8.

Langevin sampling method has been studied under the (strongly) convex potential assumption (Durmus and Moulines, 2016, 2017, 2019; Dalalyan, 2017a, b; Cheng and Bartlett, 2018; Dalalyan and Karagulyan, 2019). Also, there are some mean-field Langevin methods, see Garbuno-Inigo et al. (2020); Wang and Li (2022) for more details. However, the Langevin diffusion process tends to the target distribution as time tt goes to infinity while the Schrödinger bridge achieve this in unit time. The Schrödinger bridge has been shown to have close connections with a number of problems in statistical physics, optimal transport and optimal control (Léonard, 2014). However, only a few recent works use this tool for statistical sampling. Bernton et al. (2019) proposed the Schrödinger bridge sampler by applying the iterative proportional fitting method (Sinkhorn algorithm (Sinkhorn, 1964; Peyré and Cuturi, 2019)). For the Gibbs measure μσ\mu_{\sigma}, Schrödinger bridge samplers iteratively modifies the transition kernels of the reference Markov chain to obtain a process whose marginal distribution at terminal time is approximate to μσ\mu_{\sigma}, via regression-based approximations of the corresponding iterative proportional fitting recursion.

4. Simulation studies

In this section, we conduct numerical simulations to evaluate the performance of the proposed method (Algorithm 1), and compare it with the Langevin method. We consider the following non-convex smooth function (Carrillo et al., 2021), which maps from d\mathbb{R}^{d} to \mathbb{R},

V(x)=1di=1d[(xiB)210cos(2π(xiB))+10]+C,V(x)=\frac{1}{d}\sum_{i=1}^{d}\left[\left(x_{i}-B\right)^{2}-10\cos\left(2\pi\left(x_{i}-B\right)\right)+10\right]+C,

with B=argminxdV(x)B=\operatorname{argmin}_{x\in\mathbb{R}^{d}}V(x) and C=minxdV(x)C=\min_{x\in\mathbb{R}^{d}}V(x). Figure 1 depicts this target function VV with setting B=0,C=0,d=1B=0,C=0,d=1, and we can see that the number of local minimizers in \mathbb{R} is infinite and only one global minimizer exists, i.e, x=0x=0. In the numerical experiment, we consider the case d=2,B=0,C=0d=2,B=0,C=0 in VV. We set K=200,m=1000,σ=0.01K=200,m=1000,\sigma=0.01 in Algorithm 1, and Langevin method is implemented by R package yuima (Iacus and Yoshida, 2018). As shown in Proposition 2.3, the target distribution can be exactly obtained at time one. Thus, we only need to keep the last data in the Euler-Maruyama discretization of Schrödinger-Föllmer diffusion in each iteration and repeat this scheme such that the desired sample size is derived, i.e., we get one sample when running Algorithm 1 one time (it costs 200×1000200\times 1000 Gaussian samples at each run). In comparison, the Langevin diffusion (3) tends to the target distribution as time tt goes to infinity, then it should burn out sufficient data points empirically in each Euler-Maruyama discretization. To make the comparison fair, we run the Langevin method 50 times. At each run we generate 200×1000200\times 1000 samples and keep the one with the best function value to show. Figure 2 shows the simulation results of 50 independent runs, where the red line yields the global minimizer (GM) (x1=0,x2=0)(x_{1}=0,x_{2}=0). From Figure 2, we can conclude that our proposed method obtains the global minimizer approximately and performs better than the Langevin method.

Refer to caption
Figure 1. B=0B=0, C=0C=0, d=1d=1.
Refer to caption
Refer to caption
Figure 2. Proposed method VS Langevin method.

5. Conclusion

We study the problem of finding global minimizers of V(x):dV(x):\mathbb{R}^{d}\rightarrow\mathbb{R} approximately via sampling from a probability distribution μσ\mu_{\sigma} with density pσ(x)=exp(V(x)/σ)dexp(V(y)/σ)𝑑yp_{\sigma}(x)=\dfrac{\exp(-V(x)/\sigma)}{\int_{\mathbb{R}^{d}}\exp(-V(y)/\sigma)dy} with respect to the Lebesgue measure for σ(0,1]\sigma\in(0,1] small enough. We analyze a sampler based on the Euler discretization scheme of the Schrödinger-Föllmer diffusion processes with stochastic approximation under appropriate assumptions on the step size ss and the potential VV. We prove that the output of the proposed sampler is an approximate global minimizer of V(x)V(x) with high probability. Moreover, simulation studies verify the effectiveness of the proposed method on solving non-convex smooth optimization problems and it performs better than the Langevin method.

6. Acknowledgments

We would like the thank the anonymous referees for their useful comments and suggestions, which have led to considerable improvements in the paper. This work is supported by the National Key Research and Development Program of China (No. 2020YFA0714200), by the National Science Foundation of China (No. 12125103, No. 12071362, No. 11871474, No.11871385). The numerical calculations have been done at the Supercomputing Center of Wuhan University.

7. appendix

In the appendix, we will show (P1)-(P2) and prove the all Propositions and Theorems, i.e., Propositions 2.3, 3.3, Theorems 3.2, 3.5, 3.7.

7.1. Verify (P1)-(P2)

Proof.

Recall the definition of fσf_{\sigma} in (6) and the assumption V(x)=x22/2V(x)=\|x\|_{2}^{2}/2 when x2>R\|x\|_{2}>R. Through some simple calculations, we have

f^σ(x)\displaystyle\nabla\hat{f}_{\sigma}(x) =exp(x222σV(x)σ)(xσV(x)σ),\displaystyle=\exp\left(\frac{\|x\|_{2}^{2}}{2\sigma}-\frac{V(x)}{\sigma}\right)\cdot\left(\frac{x}{\sigma}-\frac{\nabla V(x)}{\sigma}\right),
2f^σ(x)\displaystyle\nabla^{2}\hat{f}_{\sigma}(x) =exp(x222σV(x)σ)(xσV(x)σ)(xσV(x)σ)\displaystyle=\exp\left(\frac{\|x\|_{2}^{2}}{2\sigma}-\frac{V(x)}{\sigma}\right)\cdot\left(\frac{x}{\sigma}-\frac{\nabla V(x)}{\sigma}\right)\cdot\left(\frac{x}{\sigma}-\frac{\nabla V(x)}{\sigma}\right)^{\top}
+exp(x222σV(x)σ)(Idσ2V(x)σ).\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ +\exp\left(\frac{\|x\|_{2}^{2}}{2\sigma}-\frac{V(x)}{\sigma}\right)\cdot\left(\frac{\mbox{\bf I}_{d}}{\sigma}-\frac{\nabla^{2}V(x)}{\sigma}\right).

Then, the properties (P1)-(P2) hold if for each σ(0,1]\sigma\in(0,1],

limR~supx2R~exp(V(x)σ+x222σ)𝐈d2V(x)σ2\displaystyle\lim_{\widetilde{R}\rightarrow\infty}\sup_{\|x\|_{2}\geq\widetilde{R}}\exp\left(-\frac{V(x)}{\sigma}+\frac{\|x\|_{2}^{2}}{2\sigma}\right)\left\|\frac{\mathbf{I}_{d}-\operatorname{\nabla^{2}}V(x)}{\sigma}\right\|_{2}
\displaystyle\leq limR~supx2R~exp(V(x)+x222)𝐈d2V(x)2<,\displaystyle\lim_{\widetilde{R}\rightarrow\infty}\sup_{\|x\|_{2}\geq\widetilde{R}}\exp\left(-V(x)+\frac{\|x\|_{2}^{2}}{2}\right)\left\|\mathbf{I}_{d}-\operatorname{\nabla^{2}}V(x)\right\|_{2}<\infty,

and

limR~supx2R~exp(V(x)σ+x222σ)xV(x)σ2\displaystyle\lim_{\widetilde{R}\rightarrow\infty}\sup_{\|x\|_{2}\geq\widetilde{R}}\exp\left(-\frac{V(x)}{\sigma}+\frac{\|x\|_{2}^{2}}{2\sigma}\right)\cdot\left\|\frac{x-\nabla V(x)}{\sigma}\right\|_{2}
\displaystyle\leq limR~supx2R~exp(V(x)+x222)xV(x)2<,\displaystyle\lim_{\widetilde{R}\rightarrow\infty}\sup_{\|x\|_{2}\geq\widetilde{R}}\exp\left(-V(x)+\frac{\|x\|_{2}^{2}}{2}\right)\cdot\left\|x-\nabla V(x)\right\|_{2}<\infty,

where we use the fact that aaexp(ax)a\mapsto a\exp(-ax) is non-increasing for a1,x+a\geq 1,x\in\mathbb{R}_{+}. ∎

7.2. Proof of Proposition 2.3

Proof.

By (P1) and (P2), it yields that for all xdx\in\mathbb{R}^{d} and t[0,1]t\in[0,1],

b(x,t)2=Q1tf^σ(x)2Q1tf^σ(x)γσξσ.\displaystyle\|b(x,t)\|_{2}=\frac{\left\|\nabla Q_{1-t}\hat{f}_{\sigma}(x)\right\|_{2}}{Q_{1-t}\hat{f}_{\sigma}(x)}\leq\frac{\gamma_{\sigma}}{\xi_{\sigma}}. (A.1)

Then, by (P1)-(P2) and (A.1), for all x,ydx,y\in\mathbb{R}^{d} and t[0,1]t\in[0,1],

b(x,t)b(y,t)2\displaystyle\left\|b(x,t)-b\left(y,t\right)\right\|_{2} =Q1tf^σ(x)Q1tf^σ(x)Q1tf^σ(y)Q1tf^σ(y)2\displaystyle=\left\|\frac{\nabla Q_{1-t}\hat{f}_{\sigma}(x)}{Q_{1-t}\hat{f}_{\sigma}(x)}-\frac{\nabla Q_{1-t}\hat{f}_{\sigma}\left(y\right)}{Q_{1-t}\hat{f}_{\sigma}\left(y\right)}\right\|_{2}
=Q1tf^σ(x)Q1tf^σ(y)Q1tf^σ(y)+Q1tf^σ(x)(Q1tf^σ(y)Q1tf^σ(x))Q1tf^σ(x)Q1tf^σ(y)2\displaystyle=\left\|\frac{\nabla Q_{1-t}\hat{f}_{\sigma}(x)-\nabla Q_{1-t}\hat{f}_{\sigma}(y)}{Q_{1-t}\hat{f}_{\sigma}(y)}+\frac{\nabla Q_{1-t}\hat{f}_{\sigma}(x)\left(Q_{1-t}\hat{f}_{\sigma}(y)-Q_{1-t}\hat{f}_{\sigma}(x)\right)}{Q_{1-t}\hat{f}_{\sigma}(x)Q_{1-t}\hat{f}_{\sigma}(y)}\right\|_{2}
Q1tf^σ(x)Q1tf^σ(y)2Q1tf^σ(y)+b(x,t)2|Q1tf^σ(x)Q1tf^σ(y)|Q1tf^σ(y)\displaystyle\leq\frac{\left\|\nabla Q_{1-t}\hat{f}_{\sigma}(x)-\nabla Q_{1-t}\hat{f}_{\sigma}\left(y\right)\right\|_{2}}{Q_{1-t}\hat{f}_{\sigma}\left(y\right)}+\|b(x,t)\|_{2}\cdot\frac{\left|Q_{1-t}\hat{f}_{\sigma}(x)-Q_{1-t}\hat{f}_{\sigma}\left(y\right)\right|}{Q_{1-t}\hat{f}_{\sigma}\left(y\right)}
(γσξσ+γσ2ξσ2)xy2.\displaystyle\leq\left(\frac{\gamma_{\sigma}}{\xi_{\sigma}}+\frac{\gamma_{\sigma}^{2}}{\xi_{\sigma}^{2}}\right)\left\|x-y\right\|_{2}.

Setting C1,σ:=γσξσ+γσ2ξσ2C_{1,\sigma}:=\frac{\gamma_{\sigma}}{\xi_{\sigma}}+\frac{\gamma_{\sigma}^{2}}{\xi_{\sigma}^{2}} yields the Lipschitiz continuous condition (C2). Combining (A.1) and (C2) with the triangle inequality, we have

b(x,t)2b(0,t)2+C1x2γσξσ+C1,σx2.\|b(x,t)\|_{2}\leq\|b(0,t)\|_{2}+C_{1}\|x\|_{2}\leq\frac{\gamma_{\sigma}}{\xi_{\sigma}}+C_{1,\sigma}\|x\|_{2}.

Let C0,σ:=max{γσξσ,C1,σ}C_{0,\sigma}:=\max\left\{\frac{\gamma_{\sigma}}{\xi_{\sigma}},C_{1,\sigma}\right\}, then (C1) holds. Therefore, the drift term b(x,t)b(x,t) satisfies the linear grow condition (C1) and Lipschitz condition (C2), then the Schrödinger-Föllmer diffusion SDE (7) has the unique strong solution (Revuz and Yor, 2013; Pavliotis, 2014).

Moreover, Schrödinger-Föllmer diffusion process {Xt}t[0,1]\{X_{t}\}_{t\in[0,1]} defined in (7) admits the transition probability density

ps,t(x,y):=p~s,t(x,y)Q1tfσ(y)Q1sfσ(x),\displaystyle p_{s,t}(x,y):=\widetilde{p}_{s,t}(x,y)\frac{Q_{1-t}f_{\sigma}(y)}{Q_{1-s}f_{\sigma}(x)},

where

p~s,t(x,y)=1(2πσ(ts))d/2exp(12σ(ts)xy22)\displaystyle\widetilde{p}_{s,t}(x,y)=\frac{1}{(2\pi\sigma(t-s))^{d/2}}\exp\left(-\frac{1}{2\sigma(t-s)}\|x-y\|^{2}_{2}\right)

is the transition probability density of a dd-dimensional Brownian motion σBt\sqrt{\sigma}B_{t}. See Dai Pra (1991); Lehec (2013) for details. It follows that for any Borel measurable set A(d)A\in\mathcal{B}(\mathbb{R}^{d}),

(X1A)\displaystyle\mathbb{P}(X_{1}\in A) =Ap0,1(0,y)dy\displaystyle=\int_{A}p_{0,1}(0,y)\mathrm{d}y
=Ap~0,1(0,y)Q0fσ(y)Q1fσ(0)dy\displaystyle=\int_{A}\widetilde{p}_{0,1}(0,y)\frac{Q_{0}f_{\sigma}(y)}{Q_{1}f_{\sigma}(0)}\mathrm{d}y
=μσ(A),\displaystyle=\mu_{\sigma}(A),

where the last equality follows from Q1fσ(0)=μσ(d)=1Q_{1}f_{\sigma}(0)=\mu_{\sigma}(\mathbb{R}^{d})=1 and Q0fσ(y)=fσ(y)Q_{0}f_{\sigma}(y)=f_{\sigma}(y). Therefore, X1X_{1} is distributed as the probability distribution μσ\mu_{\sigma}. This completes the proof. ∎

7.3. Proof of Proposition 3.3

Proof.

Under the Theorem 3.2, then Cσ=dexp(V(x)/σ)𝑑xdexp(V(x))𝑑x<C_{\sigma}=\int_{\mathbb{R}^{d}}\exp\left(-V(x)/\sigma\right)dx\leq\int_{\mathbb{R}^{d}}\exp\left(-V(x)\right)dx<\infty for all σ(0,1]\sigma\in(0,1]. According to the Varadhan’s theorem 1.2.3 in Dupuis and Ellis (2011), it follows that the family {μσ}σ[0,1]\{\mu_{\sigma}\}_{\sigma\in[0,1]} on (d)\mathcal{B}(\mathbb{R}^{d}) satisfies large deviation principle with rate function V(x)minxV(x)V(x)-\min_{x}V(x), that is, for every function FCb(d)F\in C_{b}(\mathbb{R}^{d}), the bounded continuous function space on d\mathbb{R}^{d},

limσ0σlogdCσ1exp(F(x)V(x)σ)𝑑x=supx{F(x)I(x)},\lim_{\sigma\to 0}\sigma\log\int_{\mathbb{R}^{d}}C_{\sigma}^{-1}\exp\left(\frac{F(x)-V(x)}{\sigma}\right)dx=\sup_{x}\{F(x)-I(x)\}, (A.2)

where the rate function I(x)I(x) is defined by

I(x):=V(x)minxV(x).I(x):=V(x)-\min_{x}V(x).

Next, we only need to prove (A.2). On the one hand,

logdCσ1exp(F(x)V(x)σ)𝑑x=logdexp(V(x)σ)𝑑x+logdexp(F(x)V(x)σ)𝑑x.\log\int_{\mathbb{R}^{d}}C_{\sigma}^{-1}\exp\left(\frac{F(x)-V(x)}{\sigma}\right)dx=-\log\int_{\mathbb{R}^{d}}\exp\left(-\frac{V(x)}{\sigma}\right)dx+\log\int_{\mathbb{R}^{d}}\exp\left(\frac{F(x)-V(x)}{\sigma}\right)dx.

By Lemma 7.1, we have

σlogdexp(V(x)σ)𝑑x\displaystyle-\sigma\log\int_{\mathbb{R}^{d}}\exp\left(-\frac{V(x)}{\sigma}\right)dx =σlogdexp(V(x)minxV(x)σ)𝑑x+minxV(x)\displaystyle=-\sigma\log\int_{\mathbb{R}^{d}}\exp\left(-\frac{V(x)-\min_{x}V(x)}{\sigma}\right)dx+\min_{x}V(x)
minxV(x)+d2limσ0σlogσ=minxV(x),as σ0.\displaystyle\rightarrow\min_{x}V(x)+\frac{d}{2}\lim_{\sigma\to 0}\sigma\log\sigma=\min_{x}V(x),\quad\text{as $\sigma\downarrow 0$.} (A.3)

On the other hand, for any positive K>0K>0, combining Lemma 2.2 in Varadhan (1966) and the dominated convergence theorem yields that

limσ0σlogdexp(F(x)V(x)σ)𝑑x\displaystyle\lim_{\sigma\to 0}\sigma\log\int_{\mathbb{R}^{d}}\exp\left(\frac{F(x)-V(x)}{\sigma}\right)dx =limKlimσ0σlogdexp(F(x)V(x)Kσ)𝑑x\displaystyle=\lim_{K\to\infty}\lim_{\sigma\to 0}\sigma\log\int_{\mathbb{R}^{d}}\exp\left(\frac{F(x)-V(x)\land K}{\sigma}\right)dx
=limKsupx{F(x)V(x)K}=supx{F(x)V(x)}.\displaystyle=\lim_{K\to\infty}\sup_{x}\{F(x)-V(x)\land K\}=\sup_{x}\{F(x)-V(x)\}. (A.4)

Hence, by (7.3) and (A.4), we get

limσ0σlogdCσ1exp(F(x)V(x)σ)𝑑x=supx{F(x)V(x)+minxV(x)}.\lim_{\sigma\to 0}\sigma\log\int_{\mathbb{R}^{d}}C_{\sigma}^{-1}\exp\left(\frac{F(x)-V(x)}{\sigma}\right)dx=\sup_{x}\{F(x)-V(x)+\min_{x}V(x)\}.

Since the measure μσ\mu_{\sigma} satisfies the large deviation principle with rate function II, if we take the closed set F:={xd;V(x)minxV(x)τ}F:=\{x\in\mathbb{R}^{d};V(x)-\min_{x}V(x)\geq\tau\}, then

limσ0σlogμσ(F)=limσ0σlogμσ(V(x)minxV(x)τ)=infxFI(x)=τ.\lim_{\sigma\to 0}\sigma\log\mu_{\sigma}(F)=\lim_{\sigma\to 0}\sigma\log\mu_{\sigma}\left(V(x)-\min_{x}V(x)\geq\tau\right)=-\inf_{x\in F}I(x)=-\tau.

7.4. Preliminary lemmas for Theorem 3.2 and Theorem 3.5

In order to prove that the Gibbs measure μσ\mu_{\sigma} weakly converges to a multidimensional distribution and estimate the probabilities of V(X1)>τV(X_{1})>\tau and V(Y~tk)>τV(\widetilde{Y}_{t_{k}})>\tau for any τ>0\tau>0, we first need to prove the following Lemmas 7.1-7.2.

Lemma 7.1.

Assume VV is twice continuously differentiable and satisfies V(x0)=0,V(x0)=0V(x_{0})=0,\nabla V(x_{0})=0, and the Hessian matrix 2V(x0)\nabla^{2}V(x_{0}) is positive definite. When δ>0\delta>0 is sufficiently small, then

limt+td2Uδ(x0)etV(x1,,xd)𝑑x1𝑑xd=(2π)d2(det2V(x0))12\lim_{t\to+\infty}t^{\frac{d}{2}}\int_{U_{\delta}(x_{0})}e^{-tV(x_{1},\ldots,x_{d})}dx_{1}\cdots dx_{d}=\frac{(2\pi)^{\frac{d}{2}}}{\left(\det\nabla^{2}V(x_{0})\right)^{\frac{1}{2}}}

for any xUδ(x0):={xd;xx02<δ}x\in U_{\delta}(x_{0}):=\{x\in\mathbb{R}^{d};\|x-x_{0}\|_{2}<\delta\}.

Proof.

For each ε(0,1)\varepsilon\in(0,1), there exists δ>0\delta>0 such that

(1ε)(xx0)2V(x0)(xx0)2V(x)(1+ε)(xx0)2V(x0)(xx0)2\frac{(1-\varepsilon)(x-x_{0})^{\top}\nabla^{2}V(x_{0})(x-x_{0})}{2}\leq V(x)\leq\frac{(1+\varepsilon)(x-x_{0})^{\top}\nabla^{2}V(x_{0})(x-x_{0})}{2}

holds for any xUδx\in U_{\delta}. Moreover, we have

Uδ(x0)exp(tV(x))𝑑xUδ(x0)exp(t2(1ε)(xx0)2V(x0)(xx0))𝑑x.\int_{U_{\delta}(x_{0})}\exp\left(-tV(x)\right)dx\leq\int_{U_{\delta}(x_{0})}\exp\left(-\frac{t}{2}(1-\varepsilon)(x-x_{0})^{\top}\nabla^{2}V(x_{0})(x-x_{0})\right)dx. (A.5)

There is an orthogonal matrix PP such taht P2V(x0)P=diag(λ1,,λd)P^{\top}\nabla^{2}V(x_{0})P=\mathrm{diag}(\lambda_{1},\ldots,\lambda_{d}), where λ1,,λd\lambda_{1},\ldots,\lambda_{d} are the eigenvalues of Hessian matrix 2V(x0)\nabla^{2}V(x_{0}). And we denote (y1,,yd):=P(xx0)(y_{1},\ldots,y_{d}):=P(x-x_{0}), then

(xx0)2V(x0)(xx0)=i=1dλi(yi)2.(x-x_{0})^{\top}\nabla^{2}V(x_{0})(x-x_{0})=\sum_{i=1}^{d}\lambda_{i}(y_{i})^{2}.

Hence

Uδ(x0)exp(t2(1ε)(xx0)2V(x0)(xx0))𝑑x=Uδ(0)exp(t(1ε)2i=1dλi(yi)2)𝑑y.\displaystyle\int_{U_{\delta}(x_{0})}\exp\left(-\frac{t}{2}(1-\varepsilon)(x-x_{0})^{\top}\nabla^{2}V(x_{0})(x-x_{0})\right)dx=\int_{U_{\delta}(0)}\exp\left(-\frac{t(1-\varepsilon)}{2}\sum_{i=1}^{d}\lambda_{i}(y_{i})^{2}\right)dy. (A.6)

Further, we can get

Uδ(0)exp(t(1ε)2i=1dλi(yi)2)𝑑y=(2t(1ε))d2z2t(1ε)2δexp(i=1dλi(zi)2)𝑑z,\int_{U_{\delta}(0)}\exp\left(-\frac{t(1-\varepsilon)}{2}\sum_{i=1}^{d}\lambda_{i}(y_{i})^{2}\right)dy=\left(\frac{2}{t(1-\varepsilon)}\right)^{\frac{d}{2}}\int_{\|z\|_{2}\leq\sqrt{\frac{t(1-\varepsilon)}{2}}\delta}\exp\left(-\sum_{i=1}^{d}\lambda_{i}(z_{i})^{2}\right)dz, (A.7)

where the equality holds by setting z=t(1ε)2yz=\sqrt{\dfrac{t(1-\varepsilon)}{2}}y. By (A.5), (A.6) and (A.7), it follows that

lim supt+td2Uδ(x0)etV(x)𝑑x\displaystyle\limsup_{t\to+\infty}t^{\frac{d}{2}}\int_{U_{\delta}(x_{0})}e^{-tV(x)}dx (21ε)d2lim supt+z2<t(1ε)2δexp(i=1dλi(zi)2)𝑑z\displaystyle\leq\left(\frac{2}{1-\varepsilon}\right)^{\frac{d}{2}}\limsup_{t\to+\infty}\int_{\|z\|_{2}<\sqrt{\frac{t(1-\varepsilon)}{2}}\delta}\exp\left(-\sum_{i=1}^{d}\lambda_{i}(z_{i})^{2}\right)dz
(21ε)d2dexp(i=1dλi(zi)2)𝑑z1𝑑zd\displaystyle\leq\left(\frac{2}{1-\varepsilon}\right)^{\frac{d}{2}}\int_{\mathbb{R}^{d}}\exp\left(-\sum_{i=1}^{d}\lambda_{i}(z_{i})^{2}\right)dz_{1}\cdots dz_{d}
=(2π1ε)d2(i=1d1λi)=(2π1ε)d21(det2V(x0))12.\displaystyle=\left(\frac{2\pi}{1-\varepsilon}\right)^{\frac{d}{2}}\left(\prod_{i=1}^{d}\frac{1}{\sqrt{\lambda_{i}}}\right)=\left(\frac{2\pi}{1-\varepsilon}\right)^{\frac{d}{2}}\frac{1}{\left(\det\nabla^{2}V(x_{0})\right)^{\frac{1}{2}}}.

Similarly, we have

lim inft+td2Uδ(x0)etV(x)𝑑x\displaystyle\liminf_{t\to+\infty}t^{\frac{d}{2}}\int_{U_{\delta}(x_{0})}e^{-tV(x)}dx (21+ε)d2lim inft+z2<t(1+ε)2δexp(i=1dλi(zi)2)𝑑z\displaystyle\geq\left(\frac{2}{1+\varepsilon}\right)^{\frac{d}{2}}\liminf_{t\to+\infty}\int_{\|z\|_{2}<\sqrt{\frac{t(1+\varepsilon)}{2}}\delta}\exp\left(-\sum_{i=1}^{d}\lambda_{i}(z_{i})^{2}\right)dz
(21+ε)d2dexp(i=1dλi(zi)2)𝑑z1𝑑zd\displaystyle\geq\left(\frac{2}{1+\varepsilon}\right)^{\frac{d}{2}}\int_{\mathbb{R}^{d}}\exp\left(-\sum_{i=1}^{d}\lambda_{i}(z_{i})^{2}\right)dz_{1}\cdots dz_{d}
=(2π1+ε)d2(i=1d1λi)=(2π1+ε)d21(det2V(x0))12.\displaystyle=\left(\frac{2\pi}{1+\varepsilon}\right)^{\frac{d}{2}}\left(\prod_{i=1}^{d}\frac{1}{\sqrt{\lambda_{i}}}\right)=\left(\frac{2\pi}{1+\varepsilon}\right)^{\frac{d}{2}}\frac{1}{\left(\det\nabla^{2}V(x_{0})\right)^{\frac{1}{2}}}.

Therefore, letting ε0+\varepsilon\rightarrow 0^{+}, we get

limt+td2Uδ(x0)etV(x)𝑑x=(2π)d2(i=1d1λi)=(2π)d2(det2V(x0))12.\lim_{t\to+\infty}t^{\frac{d}{2}}\int_{U_{\delta}(x_{0})}e^{-tV(x)}dx=(2\pi)^{\frac{d}{2}}\left(\prod_{i=1}^{d}\frac{1}{\sqrt{\lambda_{i}}}\right)=\frac{(2\pi)^{\frac{d}{2}}}{\left(\det\nabla^{2}V(x_{0})\right)^{\frac{1}{2}}}.

Lemma 7.2.

Let X=(Xt,t),Y=(Yt,t)X=(X_{t},\mathcal{F}_{t}),Y=(Y_{t},\mathcal{F}_{t}) be strong solutions of the following two stochastic differential equations

dXt=a(Xt,t)dt+σdBt,t[0,1]\displaystyle dX_{t}=a(X_{t},t)dt+\sigma dB_{t},\quad t\in[0,1]
dYt=b(Yt,t)dt+σdBt,Y0=X0,t[0,1],\displaystyle dY_{t}=b(Y_{t},t)dt+\sigma dB_{t},\quad Y_{0}=X_{0},\leavevmode\nobreak\ t\in[0,1],

and X0X_{0} is a 0\mathcal{F}_{0}-measureable random variable. In addition, if drift terms a(Xt,t)a(X_{t},t) and b(Xt,t)b(X_{t},t) satisfy 𝔼[exp(01a(Xt,t)22+b(Xt,t)22dt)]<\mathbb{E}\left[\exp\left(\int_{0}^{1}\|a(X_{t},t)\|_{2}^{2}+\|b(X_{t},t)\|_{2}^{2}dt\right)\right]<\infty, then we have

dYdX(X)=exp(σ101b(Xt,t)a(Xt,t),dBt12σ201b(Xt,t)a(Xt,t)22𝑑t),\frac{d\mathbb{P}_{Y}}{d\mathbb{P}_{X}}(X)=\exp\left(\sigma^{-1}\int_{0}^{1}\left\langle b(X_{t},t)-a(X_{t},t),dB_{t}\right\rangle-\frac{1}{2\sigma^{2}}\int_{0}^{1}\|b(X_{t},t)-a(X_{t},t)\|_{2}^{2}dt\right), (A.8)

and the relative entropy of X\mathbb{P}_{X} with respect to Y\mathbb{P}_{Y} satisfies

𝔻KL(X||Y)=12σ201𝔼[b(Xt,t)a(Xt,t)22]dt,\mathbb{D}_{\mathrm{KL}}(\mathbb{P}_{X}||\mathbb{P}_{Y})=\frac{1}{2\sigma^{2}}\int_{0}^{1}\mathbb{E}\left[\|b(X_{t},t)-a(X_{t},t)\|_{2}^{2}\right]dt,

where probability distributions X,Y\mathbb{P}_{X},\mathbb{P}_{Y} are induced by process (Xt,0t1)(X_{t},0\leq t\leq 1) and (Yt,0t1)(Y_{t},0\leq t\leq 1), respectively.

Proof.

By the Novikov condition (Revuz and Yor, 2013), we know that

Mt:=exp(σ10tb(Xu,u)a(Xu,u),dBu12σ20tb(Xu,u)a(Xu,u)22𝑑u)M_{t}:=\exp\left(\sigma^{-1}\int_{0}^{t}\left\langle b(X_{u},u)-a(X_{u},u),dB_{u}\right\rangle-\frac{1}{2\sigma^{2}}\int_{0}^{t}\|b(X_{u},u)-a(X_{u},u)\|_{2}^{2}du\right)

is exponential martingale and 𝔼Mt=1\mathbb{E}M_{t}=1 for all t[0,1]t\in[0,1]. We can denote a new probability measure \mathbb{Q} such that d=M1dd\mathbb{Q}=M_{1}d\mathbb{P}. By Girsanov’s theorem (Revuz and Yor, 2013), under the new probability measure \mathbb{Q}, we can conclude that

B~t:=Btσ10t(b(Xu,u)a(Xu,u))𝑑u\widetilde{B}_{t}:=B_{t}-\sigma^{-1}\int_{0}^{t}(b(X_{u},u)-a(X_{u},u))du

is a \mathbb{Q}-Brownian motion. Hence, under the new probability measure \mathbb{Q},

b(Xt,t)dt+σdB~t\displaystyle b(X_{t},t)dt+\sigma d\widetilde{B}_{t} =b(Xt,t)dt+σdBt(b(Xt,t)a(Xt,t))dt\displaystyle=b(X_{t},t)dt+\sigma dB_{t}-(b(X_{t},t)-a(X_{t},t))dt
=a(Xt,t)dt+σdBt=dXt.\displaystyle=a(X_{t},t)dt+\sigma dB_{t}=dX_{t}.

Thus, we have X=Y\mathbb{Q}_{X}=\mathbb{P}_{Y}, where X\mathbb{Q}_{X} is the distribution of XX under the measure \mathbb{Q}. Furthermore, we can obtain (A.8). On the other hand, by the definition of realtive entropy of X\mathbb{P}_{X} with respect to Y\mathbb{P}_{Y}, we have

𝔻KL(X||Y)=𝔼[log(dYdX(X))]=12σ201𝔼[b(Xt,t)a(Xt,t)22]dt.\mathbb{D}_{\mathrm{KL}}(\mathbb{P}_{X}||\mathbb{P}_{Y})=\mathbb{E}\left[-\log\left(\frac{d\mathbb{P}_{Y}}{d\mathbb{P}_{X}}(X)\right)\right]=\frac{1}{2\sigma^{2}}\int_{0}^{1}\mathbb{E}\left[\|b(X_{t},t)-a(X_{t},t)\|_{2}^{2}\right]dt.

Therefore, the proof of Lemma 7.2 is completed. ∎

7.5. Proof of Theorem 3.2

Proof.

The result can be traced back to the 1980s. For the overall continuity of the article, we use the Laplace’s method in Hwang (1980, 1981) to give a detailed proof of the result. The key idea is to prove that for each δ>0\delta^{\prime}>0 small enough, μσ({x;xxi2<δ})\mu_{\sigma}(\{x;\|x-x_{i}^{*}\|_{2}<\delta^{\prime}\}) converges to (det2V(xi))12j=1κ(det2V(xj))12\dfrac{\left(\det\nabla^{2}V(x_{i}^{*})\right)^{-\frac{1}{2}}}{\sum_{j=1}^{\kappa}\left(\det\nabla^{2}V(x_{j}^{*})\right)^{-\frac{1}{2}}} as σ0\sigma\downarrow 0. We firstly introduce the following notations

a(δ):=inf{V(x);xxi2δ,1iκ},\displaystyle a(\delta^{\prime}):=\inf\{V(x);\|x-x_{i}^{*}\|_{2}\geq\delta^{\prime},1\leq i\leq\kappa\},
m~i(σ,δ):=xxi2<δexp(V(x)σ)𝑑x,1iκ,\displaystyle\widetilde{m}_{i}(\sigma,\delta^{\prime}):=\int_{\|x-x_{i}^{*}\|_{2}<\delta^{\prime}}\exp\left(-\frac{V(x)}{\sigma}\right)dx,\quad 1\leq i\leq\kappa,
m~(σ,δ):=i=1κxxiδexp(V(x)σ)𝑑x.\displaystyle\widetilde{m}(\sigma,\delta^{\prime}):=\int_{\bigcup_{i=1}^{\kappa}\|x-x_{i}^{*}\|\geq\delta^{\prime}}\exp\left(-\frac{V(x)}{\sigma}\right)dx.

Hence, we have

μσ({x;xxi2<δ})=xxi2<δexp(V(x)σ)𝑑xdexp(V(x)σ)𝑑x=m~i(σ,δ)j=1κm~j(σ,δ)+m~(σ,δ).\mu_{\sigma}(\{x;\|x-x_{i}^{*}\|_{2}<\delta^{\prime}\})=\dfrac{\int_{\|x-x_{i}^{*}\|_{2}<\delta^{\prime}}\exp\left(-\frac{V(x)}{\sigma}\right)dx}{\int_{\mathbb{R}^{d}}\exp\left(-\frac{V(x)}{\sigma}\right)dx}=\frac{\widetilde{m}_{i}(\sigma,\delta^{\prime})}{\sum_{j=1}^{\kappa}\widetilde{m}_{j}(\sigma,\delta^{\prime})+\widetilde{m}(\sigma,\delta^{\prime})}. (A.9)

On the one hand, 2V(xi)\nabla^{2}V(x_{i}^{*}) is a positive definite symmetric matrix. For any ε(0,1)\varepsilon\in(0,1), we can choose 0<δ<ε0<\delta^{\prime}<\varepsilon such that

(xxi)(2V(xi)εId)(xxi)2V(x)V(xi)(xxi)(2V(xi)+εId)(xxi)2\frac{(x-x_{i}^{*})^{\top}(\nabla^{2}V(x_{i}^{*})-\varepsilon\mbox{\bf I}_{d})(x-x_{i}^{*})}{2}\leq V(x)-V(x_{i}^{*})\leq\frac{(x-x_{i}^{*})^{\top}(\nabla^{2}V(x_{i}^{*})+\varepsilon\mbox{\bf I}_{d})(x-x_{i}^{*})}{2}

holds for any xxi2<δ\|x-x_{i}^{*}\|_{2}<\delta^{\prime}. Thus, for any i=1,,κi=1,\ldots,\kappa, we obtain

(2πσ)d2eV(xi)σm~i(σ,δ)(2πσ)d2xxi2<δexp((xxi)(2V(xi)εId)(xxi)2σ)𝑑x,\displaystyle(2\pi\sigma)^{-\frac{d}{2}}e^{\frac{V(x_{i}^{*})}{\sigma}}\widetilde{m}_{i}(\sigma,\delta^{\prime})\leq(2\pi\sigma)^{-\frac{d}{2}}\int_{\|x-x_{i}^{*}\|_{2}<\delta^{\prime}}\exp\left(-\frac{(x-x_{i}^{*})^{\top}(\nabla^{2}V(x_{i}^{*})-\varepsilon\mbox{\bf I}_{d})(x-x_{i}^{*})}{2\sigma}\right)dx,
(2πσ)d2eV(xi)σm~i(σ,δ)(2πσ)d2xxi2<δexp((xxi)(2V(xi)+εId)(xxi)2σ)𝑑x.\displaystyle(2\pi\sigma)^{-\frac{d}{2}}e^{\frac{V(x_{i}^{*})}{\sigma}}\widetilde{m}_{i}(\sigma,\delta^{\prime})\geq(2\pi\sigma)^{-\frac{d}{2}}\int_{\|x-x_{i}^{*}\|_{2}<\delta^{\prime}}\exp\left(-\frac{(x-x_{i}^{*})^{\top}(\nabla^{2}V(x_{i})+\varepsilon\mbox{\bf I}_{d})(x-x_{i}^{*})}{2\sigma}\right)dx.

By Lemma 7.1 and letting σ0\sigma\rightarrow 0, we have

(det(2V(xi)+εId))12\displaystyle\left(\det(\nabla^{2}V(x_{i}^{*})+\varepsilon\mbox{\bf I}_{d})\right)^{-\frac{1}{2}} lim infσ0(2πσ)d2eV(xi)σm~i(σ,δ)\displaystyle\leq\liminf_{\sigma\to 0}(2\pi\sigma)^{-\frac{d}{2}}e^{\frac{V(x_{i}^{*})}{\sigma}}\widetilde{m}_{i}(\sigma,\delta^{\prime})
lim supσ0(2πσ)d2eV(xi)σm~i(σ,δ)\displaystyle\leq\limsup_{\sigma\to 0}(2\pi\sigma)^{-\frac{d}{2}}e^{\frac{V(x_{i}^{*})}{\sigma}}\widetilde{m}_{i}(\sigma,\delta^{\prime})
(det(2V(xi)εId))12.\displaystyle\leq\left(\det(\nabla^{2}V(x_{i}^{*})-\varepsilon\mbox{\bf I}_{d})\right)^{-\frac{1}{2}}.

As ε0\varepsilon\downarrow 0, we get

limσ0(2πσ)d2exp(V(xi)σ)m~i(σ,δ)=(det2V(xi))12.\lim_{\sigma\to 0}(2\pi\sigma)^{-\frac{d}{2}}\exp\left(\frac{V(x_{i}^{*})}{\sigma}\right)\widetilde{m}_{i}(\sigma,\delta^{\prime})=\left(\det\nabla^{2}V(x_{i}^{*})\right)^{-\frac{1}{2}}. (A.10)

On the other hand, we have

(2πσ)d2exp(V(xi)σ)m~(σ,δ)\displaystyle(2\pi\sigma)^{-\frac{d}{2}}\exp\left(\frac{V(x_{i}^{*})}{\sigma}\right)\widetilde{m}(\sigma,\delta^{\prime}) =(2πσ)d2exp(a(δ)V(xi)σ)\displaystyle=(2\pi\sigma)^{-\frac{d}{2}}\exp\left(-\frac{a(\delta^{\prime})-V(x_{i}^{*})}{\sigma}\right)
×i=1κxxiδexp(V(x)a(δ)σ)dx.\displaystyle\qquad\quad\times\int_{\bigcup_{i=1}^{\kappa}\|x-x_{i}^{*}\|\geq\delta^{\prime}}\exp\left(-\frac{V(x)-a(\delta^{\prime})}{\sigma}\right)dx.

Since a(δ):=inf{V(x);xxi2δ,1iκ}>V(xi)a(\delta^{\prime}):=\inf\{V(x);\|x-x_{i}^{*}\|_{2}\geq\delta^{\prime},1\leq i\leq\kappa\}>V(x_{i}^{*}) and V(x)a(δ)V(x)\geq a(\delta^{\prime}) holds in {xd:xxi2δ}\{x\in\mathbb{R}^{d}:\|x-x_{i}^{*}\|_{2}\geq\delta^{\prime}\}, then for any σ(0,1]\sigma\in(0,1], we have

i=1κxxiδexp(V(x)a(δ)σ)𝑑x\displaystyle\int_{\bigcup_{i=1}^{\kappa}\|x-x_{i}^{*}\|\geq\delta^{\prime}}\exp\left(-\frac{V(x)-a(\delta^{\prime})}{\sigma}\right)dx i=1κxxiδexp((V(x)a(δ)))𝑑x\displaystyle\leq\int_{\bigcup_{i=1}^{\kappa}\|x-x_{i}^{*}\|\geq\delta^{\prime}}\exp\left(-(V(x)-a(\delta^{\prime}))\right)dx
exp(a(δ))dexp(V(x))𝑑x<.\displaystyle\leq\exp(a(\delta^{\prime}))\int_{\mathbb{R}^{d}}\exp(-V(x))dx<\infty.

Also, it follows that

(2πσ)d2exp(a(δ)V(xi)σ)0as σ0.(2\pi\sigma)^{-\frac{d}{2}}\exp\left(-\frac{a(\delta^{\prime})-V(x_{i}^{*})}{\sigma}\right)\rightarrow 0\quad\text{as $\sigma\downarrow 0.$}

Thus we get

limσ0(2πσ)d2exp(V(xi)σ)m~(σ,δ)=0.\lim_{\sigma\to 0}(2\pi\sigma)^{-\frac{d}{2}}\exp\left(\frac{V(x_{i}^{*})}{\sigma}\right)\widetilde{m}(\sigma,\delta^{\prime})=0. (A.11)

Taking limit σ0\sigma\downarrow 0 in (A.9) and applying (A.10), (A.11), we get

μσ({x;xxi2<δ})(det2V(xi))12j=1κ(det2V(xj))12as σ0.\mu_{\sigma}(\{x;\|x-x_{i}^{*}\|_{2}<\delta^{\prime}\})\rightarrow\frac{\left(\det\nabla^{2}V(x_{i}^{*})\right)^{-\frac{1}{2}}}{\sum_{j=1}^{\kappa}\left(\det\nabla^{2}V(x_{j}^{*})\right)^{-\frac{1}{2}}}\quad\text{as $\sigma\downarrow 0.$}

Therefore, the proof Theorem 3.2 is completed. ∎

7.6. Proof of Theorem 3.5

Proof.

Note that

(V(X1)>τ)=V(x)>τexp(V(x)/σ)𝑑xdexp(V(x)/σ)𝑑x.\mathbb{P}(V(X_{1})>\tau)=\frac{\int_{V(x)>\tau}\exp(-V(x)/\sigma)dx}{\int_{\mathbb{R}^{d}}\exp(-V(x)/\sigma)dx}. (A.12)

According to V(x)=x22/2V(x)=\|x\|^{2}_{2}/2 for any x2R\|x\|_{2}\geq R, then VV has at least linear growth at infinity, that is, there exists a constant C>0C>0 such that for RR^{\star} large enough

V(x)miny2=RV(y)+C(x2R) for x2>R.V(x)\geq\min_{\|y\|_{2}=R^{\star}}V(y)+C(\|x\|_{2}-R^{\star})\quad\text{ for $\|x\|_{2}>R^{\star}$.}

We can choose sufficiently large RR^{\star} such that miny2=RV(y)>τ\min_{\|y\|_{2}=R^{\star}}V(y)>\tau. Hence,

V(x)τexp(V(x)/σ)𝑑x\displaystyle\int_{V(x)\geq\tau}\exp(-V(x)/\sigma)dx =V(x)τ,x2Rexp(V(x)/σ)𝑑x+V(x)τ,x2>Rexp(V(x)/σ)𝑑x\displaystyle=\int_{V(x)\geq\tau,\|x\|_{2}\leq R^{\star}}\exp(-V(x)/\sigma)dx+\int_{V(x)\geq\tau,\|x\|_{2}>R^{\star}}\exp(-V(x)/\sigma)dx
exp(τσ)Vol(BR)+V(x)τ,x2>Rexp(τ+C(x2R)σ)𝑑x\displaystyle\leq\exp\left(-\frac{\tau}{\sigma}\right)\text{Vol}(B_{R^{\star}})+\int_{V(x)\geq\tau,\|x\|_{2}>R^{\star}}\exp\left(-\frac{\tau+C(\|x\|_{2}-R^{\star})}{\sigma}\right)dx
exp(τσ)(Vol(BR)+dVol(B1)Rrd1exp{C(rR)}𝑑r)\displaystyle\leq\exp\left(-\frac{\tau}{\sigma}\right)\left(\text{Vol}(B_{R^{\star}})+d\text{Vol}(B_{1})\int_{R^{\star}}^{\infty}r^{d-1}\exp\left\{-C(r-R^{\star})\right\}dr\right)
2Vol(BR)exp(τσ),\displaystyle\leq 2\text{Vol}(B_{R^{\star}})\exp\left(-\frac{\tau}{\sigma}\right), (A.13)

where Vol(BR)(B_{R^{\star}}) is the volume of a ball with radius RR^{\star}. On the other hand, since minxV(x)=0\min_{x}V(x)=0, then there exists r>0r>0 such that V(x)<εV(x)<\varepsilon when x2<r\|x\|_{2}<r, we have

dexp(V(x)/σ)𝑑xx2<rexp(V(x)/σ)𝑑x>exp(εσ)Vol(Br).\int_{\mathbb{R}^{d}}\exp(-V(x)/\sigma)dx\geq\int_{\|x\|_{2}<r}\exp(-V(x)/\sigma)dx>\exp\left(-\frac{\varepsilon}{\sigma}\right)\text{Vol}(B_{r}). (A.14)

By injection (7.6), (A.14) into (A.12), we get

(V(X1)>τ)Cτ,ε,dexp(τεσ),\mathbb{P}(V(X_{1})>\tau)\leq C_{\tau,\varepsilon,d}\exp\left(-\frac{\tau-\varepsilon}{\sigma}\right),

where

Cτ,ε,d:=2Vol(BR)Vol(Br).C_{\tau,\varepsilon,d}:=\frac{2\text{Vol}(B_{R^{\star}})}{\text{Vol}(B_{r})}. (A.15)

Next, we will prove that the second conclusion holds in the discrete case. Recall that s=1/Ks=1/K is the step size, tk:=kst_{k}:=ks is the cumulative step size up to iteration kk, and {Xt}t[0,1]\{X_{t}\}_{t\in[0,1]} is the Schrödinger-Föllmer diffusion process (7). Let μtk\mu_{t_{k}} be the probability measure of YtkY_{t_{k}} defined by (10). Then for fixed τ>0\tau>0, we have

(V(Y~tK)>τ)\displaystyle\mathbb{P}(V(\widetilde{Y}_{t_{K}})>\tau) (V(Y~tK)>τ)+|(V(Y~tK)>τ)(V(YtK)>τ)|\displaystyle\leq\mathbb{P}(V(\widetilde{Y}_{t_{K}})>\tau)+|\mathbb{P}(V(\widetilde{Y}_{t_{K}})>\tau)-\mathbb{P}(V(Y_{t_{K}})>\tau)|
(V(YtK)>τ)+Y~tKYtKTV\displaystyle\leq\mathbb{P}(V(Y_{t_{K}})>\tau)+\|\widetilde{Y}_{t_{K}}-Y_{t_{K}}\|_{TV}
(V(XtK)>τ)+XtKYtKTV+Y~tKYtKTV\displaystyle\leq\mathbb{P}(V(X_{t_{K}})>\tau)+\|X_{t_{K}}-Y_{t_{K}}\|_{TV}+\|\widetilde{Y}_{t_{K}}-Y_{t_{K}}\|_{TV}
(V(XtK)>τ)+2𝔻KL(μtK||μσ)+2𝔻KL(Law(Y~tK)||μtK),\displaystyle\leq\mathbb{P}(V(X_{t_{K}})>\tau)+\sqrt{2\mathbb{D}_{\mathrm{KL}}(\mu_{t_{K}}||\mu_{\sigma})}+\sqrt{2\mathbb{D}_{\mathrm{KL}}(\text{Law}(\widetilde{Y}_{t_{K}})||\mu_{t_{K}})}, (A.16)

where we use Pinsker’s inequality (Bakry et al., 2014) in the last inequality and the second inequality holds due to the fact that letting g(x):=𝟙V(x)>τg(x):=\mathbbm{1}_{V(x)>\tau}, then

|(V(Y~tK)>τ)(V(YtK)>τ)|\displaystyle|\mathbb{P}(V(\widetilde{Y}_{t_{K}})>\tau)-\mathbb{P}(V(Y_{t_{K}})>\tau)| =|𝔼(g(Y~tK)g(YtK))|\displaystyle=\left|\mathbb{E}\left(g(\widetilde{Y}_{t_{K}})-g(Y_{t_{K}})\right)\right|
=|dg(x)d(Y~tk(x)Ytk(x))|\displaystyle=\left|\int_{\mathbb{R}^{d}}g(x)d\left(\mathbb{P}_{\widetilde{Y}_{t_{k}}}(x)-\mathbb{P}_{Y_{t_{k}}}(x)\right)\right|
|Y~tkYtk|(d)=Y~tKYtKTV,\displaystyle\leq|\mathbb{P}_{\widetilde{Y}_{t_{k}}}-\mathbb{P}_{Y_{t_{k}}}|(\mathbb{R}^{d})=\|\widetilde{Y}_{t_{K}}-Y_{t_{K}}\|_{TV},

where the total variation metric between two probability measures μ,ν\mu,\nu on d\mathbb{R}^{d} is defined by μνTV:=|μν|(d)=2supAd|μ(A)ν(A)|\|\mu-\nu\|_{TV}:=\left|\mu-\nu\right|(\mathbb{R}^{d})=2\sup_{A\subseteq\mathbb{R}^{d}}|\mu(A)-\nu(A)|.

Firstly, from the first part of proof, we can get a bound for the first term on the right hand side of (7.6). That is, for each ε(0,τ)\varepsilon\in(0,\tau), there exists a constant Cτ,ε,dC_{\tau,\varepsilon,d} defined in (A.15) such that

(V(XtK)>τ)Cτ,ε,dexp(τεσ).\mathbb{P}(V(X_{t_{K}})>\tau)\leq C_{\tau,\varepsilon,d}\exp\left(-\frac{\tau-\varepsilon}{\sigma}\right). (A.17)

Secondly, we estimate the boundness of 𝔻KL(μtK||μσ)\mathbb{D}_{\mathrm{KL}}(\mu_{t_{K}}||\mu_{\sigma}). To make use of continuous-time tools, we construct a continuous-time interpolation for the discrete-time algorithm (10). In particular, we define a stochastic process {Yt}t[0,1]\{Y_{t}\}_{t\in[0,1]} via SDE

dYt=σb^(Yt,t)dt+σdBt,t[0,1],Y0=0,dY_{t}=\sigma\hat{b}(Y_{t},t)dt+\sqrt{\sigma}dB_{t},\quad t\in[0,1],Y_{0}=0, (A.18)

with the non-anticipative drift b^(Yt,t):=k=0K1b(Ytk,tk)𝟙[tk,tk+1)(t)\hat{b}(Y_{t},t):=\sum\limits_{k=0}^{K-1}b(Y_{t_{k}},t_{k})\mathbbm{1}_{[t_{k},t_{k+1})}(t). Meanwhile, because of Proposition 2.3, the process {Xt}t[0,1]\{X_{t}\}_{t\in[0,1]} (7) satisfies X1μσX_{1}\sim\mu_{\sigma}. We also denote by μ1x\mu_{1}^{x} and ν1y\nu_{1}^{y} the marginal distributions on C([0,1],d)C([0,1],\mathbb{R}^{d}) of (Xt,Yt)t[0,1](X_{t},Y_{t})_{t\in[0,1]}. Thus, combining (10), (A.18) and Lemma 7.2, we obtain

𝔻KL(μtK||μσ)\displaystyle\mathbb{D}_{\mathrm{KL}}(\mu_{t_{K}}||\mu_{\sigma}) 𝔻KL(ν1y||μ1x)=σ201𝔼(b(Yt,t)b^(Yt,t)22)dt\displaystyle\leq\mathbb{D}_{\mathrm{KL}}(\nu^{y}_{1}||\mu^{x}_{1})=\frac{\sigma}{2}\int_{0}^{1}\mathbb{E}\left(\|b(Y_{t},t)-\hat{b}(Y_{t},t)\|_{2}^{2}\right)dt
=σ2k=0K1tktk+1𝔼(b(Yt,t)b(Ytk,tk)22)𝑑t\displaystyle=\frac{\sigma}{2}\sum_{k=0}^{K-1}\int_{t_{k}}^{t_{k+1}}\mathbb{E}\left(\|b(Y_{t},t)-b(Y_{t_{k}},t_{k})\|_{2}^{2}\right)dt
σdC1,σ2k=0K1tktk+1𝔼(YtYtk22+(ttk))𝑑t\displaystyle\leq\sigma dC_{1,\sigma}^{2}\sum_{k=0}^{K-1}\int_{t_{k}}^{t_{k+1}}\mathbb{E}\left(\|Y_{t}-Y_{t_{k}}\|_{2}^{2}+(t-t_{k})\right)dt
=σdC1,σ2[k=0K1tktk+1𝔼(b(Ytk)(ttk)σ+σ(BtBtk)22)𝑑t+12K]\displaystyle=\sigma dC_{1,\sigma}^{2}\left[\sum_{k=0}^{K-1}\int_{t_{k}}^{t_{k+1}}\mathbb{E}\left(\|b(Y_{t_{k}})(t-t_{k})\sigma+\sqrt{\sigma}(B_{t}-B_{t_{k}})\|_{2}^{2}\right)dt+\frac{1}{2K}\right]
2σdC1,σ2k=0K1tktk+1𝔼(b(Ytk,tk)22σ2(ttk)2+σd(ttk))𝑑t+σd2KC1,σ2\displaystyle\leq 2\sigma dC_{1,\sigma}^{2}\sum_{k=0}^{K-1}\int_{t_{k}}^{t_{k+1}}\mathbb{E}\left(\|b(Y_{t_{k}},t_{k})\|_{2}^{2}\sigma^{2}(t-t_{k})^{2}+\sigma d(t-t_{k})\right)dt+\frac{\sigma d}{2K}C_{1,\sigma}^{2}
2dC1,σ2k=0K1tktk+1[(γσξσ)2σ3(ttk)2+d(ttk)σ2]𝑑t+σd2KC1,σ2\displaystyle\leq 2dC_{1,\sigma}^{2}\sum_{k=0}^{K-1}\int_{t_{k}}^{t_{k+1}}\left[\left(\frac{\gamma_{\sigma}}{\xi_{\sigma}}\right)^{2}\sigma^{3}(t-t_{k})^{2}+d(t-t_{k})\sigma^{2}\right]dt+\frac{\sigma d}{2K}C_{1,\sigma}^{2}
4dσ33K2(γσ4ξσ4+γσ6ξσ6)+σd(2dσ+1)K(γσ2ξσ2+γσ4ξσ4)d(2d+3)KC1,σ,\displaystyle\leq\frac{4d\sigma^{3}}{3K^{2}}\left(\frac{\gamma^{4}_{\sigma}}{\xi^{4}_{\sigma}}+\frac{\gamma^{6}_{\sigma}}{\xi^{6}_{\sigma}}\right)+\frac{\sigma d(2d\sigma+1)}{K}\left(\frac{\gamma^{2}_{\sigma}}{\xi^{2}_{\sigma}}+\frac{\gamma^{4}_{\sigma}}{\xi^{4}_{\sigma}}\right)\leq\frac{d(2d+3)}{K}C^{\star}_{1,\sigma}, (A.19)

where

C1,σ:=γσ2ξσ2+γσ4ξσ4+γσ6ξσ6,γσξσ={(M2,Rσ)2+M3,Rσ}exp(M1,Rm1,Rσ),C^{\star}_{1,\sigma}:=\frac{\gamma^{2}_{\sigma}}{\xi^{2}_{\sigma}}+\frac{\gamma^{4}_{\sigma}}{\xi^{4}_{\sigma}}+\frac{\gamma^{6}_{\sigma}}{\xi^{6}_{\sigma}},\quad\frac{\gamma_{\sigma}}{\xi_{\sigma}}=\left\{\left(\frac{M_{2,R}}{\sigma}\right)^{2}+\frac{M_{3,R}}{\sigma}\right\}\exp\left(\frac{M_{1,R}-m_{1,R}}{\sigma}\right),

the second inequality holds due to Remark 4.1 in Huang et al. (2021) and the fact that (a+b)22(a2+b2)(a+b)^{2}\leq 2(a^{2}+b^{2}), the second equality holds by the continuous-time interpolation (A.18), and the fourth inequality follows from b(x,t)22γσ2/ξσ2\|b(x,t)\|_{2}^{2}\leq\gamma_{\sigma}^{2}/\xi_{\sigma}^{2}.

So it remains to estimate the relative entropy 𝔻KL(Law(Y~tK)||μtK)\mathbb{D}_{\mathrm{KL}}(\text{Law}(\widetilde{Y}_{t_{K}})||\mu_{t_{K}}). Similar to the proof of the relative entropy 𝔻KL(μtK||μσ)\mathbb{D}_{\mathrm{KL}}(\mu_{t_{K}}||\mu_{\sigma}), we need to construct a continuous-time interpolation process {Y~t}t[0,1]\{\widetilde{Y}_{t}\}_{t\in[0,1]} defined by

dY~t=σb^m(Y~t,t)dt+σdBt,t[0,1],Y~0=0,d\widetilde{Y}_{t}=\sigma\hat{b}_{m}(\widetilde{Y}_{t},t)dt+\sqrt{\sigma}dB_{t},\quad t\in[0,1],\quad\widetilde{Y}_{0}=0, (A.20)

with the non-anticipative drift b^m(Y~t,t):=k=0K1b~m(Y~tk,tk)𝟙[tk,tk+1)(t)\hat{b}_{m}(\widetilde{Y}_{t},t):=\sum\limits_{k=0}^{K-1}\widetilde{b}_{m}(\widetilde{Y}_{t_{k}},t_{k})\mathbbm{1}_{[t_{k},t_{k+1})}(t), where b~m(Y~k,tk)\widetilde{b}_{m}(\widetilde{Y}_{k},t_{k}) is defined by (12) or (13). Denote by ν1y\nu_{1}^{y} and ν~1y\widetilde{\nu}_{1}^{y} the marginal distributions on C([0,1],d)C([0,1],\mathbb{R}^{d}) of (Yt,Y~t)t[0,1](Y_{t},\widetilde{Y}_{t})_{t\in[0,1]}. Therefore, combining (A.18), (A.20), Lemma 7.2 and Lemma 7.5, we get

𝔻KL(Law(Y~tK)||μtK)𝔻KL(ν~1y||ν1y)\displaystyle\mathbb{D}_{\mathrm{KL}}(\text{Law}(\widetilde{Y}_{t_{K}})||\mu_{t_{K}})\leq\mathbb{D}_{\mathrm{KL}}(\widetilde{\nu}_{1}^{y}||\nu^{y}_{1}) =σ201𝔼(b^(Y~t,t)b^m(Y~t,t)22)𝑑t\displaystyle=\frac{\sigma}{2}\int_{0}^{1}\mathbb{E}\left(\|\hat{b}(\widetilde{Y}_{t},t)-\hat{b}_{m}(\widetilde{Y}_{t},t)\|_{2}^{2}\right)dt
=σ2k=0K1tktk+1𝔼(b(Y~tk,tk)b~m(Y~tk,tk)22)𝑑t\displaystyle=\frac{\sigma}{2}\sum_{k=0}^{K-1}\int_{t_{k}}^{t_{k+1}}\mathbb{E}\left(\|b(\widetilde{Y}_{t_{k}},t_{k})-\widetilde{b}_{m}(\widetilde{Y}_{t_{k}},t_{k})\|_{2}^{2}\right)dt
4dσmC2,σ,\displaystyle\leq\frac{4d\sigma}{m}C^{\star}_{2,\sigma}, (A.21)

where

C2,σ:=γσ4ξσ4+γσ2ζσ2ξσ4,γσξσ={(M2,Rσ)2+M3,Rσ}exp(M1,Rm1,Rσ),C^{\star}_{2,\sigma}:=\frac{\gamma^{4}_{\sigma}}{\xi^{4}_{\sigma}}+\frac{\gamma_{\sigma}^{2}\zeta^{2}_{\sigma}}{\xi^{4}_{\sigma}},\quad\frac{\gamma_{\sigma}}{\xi_{\sigma}}=\left\{\left(\frac{M_{2,R}}{\sigma}\right)^{2}+\frac{M_{3,R}}{\sigma}\right\}\exp\left(\frac{M_{1,R}-m_{1,R}}{\sigma}\right),

with ζσ=exp(M1,R/σ)\zeta_{\sigma}=\exp\left(M_{1,R}/\sigma\right). By injecting (A.17), (7.6), and (7.6) into (7.6), we can get the desired results. ∎

7.7. Preliminary lemmas for Theorem 3.7

First, we introduce Lemmas 7.3-7.5 preparing for the proofs of Theorem 3.7.

Lemma 7.3.

Assume (P1) and (P2) hold, then

𝔼[Xt22]2(C0,σ+d)exp(2C0,σt).\displaystyle\mathbb{E}\left[\|X_{t}\|_{2}^{2}\right]\leq 2(C_{0,\sigma}+d)\exp(2C_{0,\sigma}t).
Proof.

From the definition of XtX_{t} in (7), we have Xt2σ0tb(Xu,u)2du+σBt2.\|X_{t}\|_{2}\leq\sigma\int_{0}^{t}\|b(X_{u},u)\|_{2}\mathrm{d}u+\sqrt{\sigma}\|B_{t}\|_{2}. Then, we can get

Xt22\displaystyle\|X_{t}\|_{2}^{2} 2σ2(0tb(Xu,u)2du)2+2σBt22\displaystyle\leq 2\sigma^{2}\left(\int_{0}^{t}\|b(X_{u},u)\|_{2}\mathrm{d}u\right)^{2}+2\sigma\|B_{t}\|_{2}^{2}
2t0tb(Xu,u)22du+2Bt22\displaystyle\leq 2t\int_{0}^{t}\|b(X_{u},u)\|_{2}^{2}\mathrm{d}u+2\|B_{t}\|_{2}^{2}
2t0tC0,σ(Xu22+1)du+2Bt22,\displaystyle\leq 2t\int_{0}^{t}C_{0,\sigma}\left(\|X_{u}\|_{2}^{2}+1\right)\mathrm{d}u+2\|B_{t}\|_{2}^{2},

where the first inequality holds by the inequality (a+c)22a2+2c2(a+c)^{2}\leq 2a^{2}+2c^{2}, the last inequality holds by (C1). Thus,

𝔼[Xt22]\displaystyle\mathbb{E}\left[\|X_{t}\|_{2}^{2}\right] 2t0tC0,σ(𝔼[Xu22]+1)du+2𝔼[Bt22]\displaystyle\leq 2t\int_{0}^{t}C_{0,\sigma}\left(\mathbb{E}\left[\|X_{u}\|_{2}^{2}\right]+1\right)\mathrm{d}u+2\mathbb{E}\left[\|B_{t}\|_{2}^{2}\right]
2C0,σ0t𝔼[Xu22]du+2(C0,σ+d).\displaystyle\leq 2C_{0,\sigma}\int_{0}^{t}\mathbb{E}\left[\|X_{u}\|_{2}^{2}\right]\mathrm{d}u+2(C_{0,\sigma}+d).

By the Grönwall inequality, we have

𝔼[Xt22]2(C0,σ+d)exp(2C0,σt).\displaystyle\mathbb{E}\left[\|X_{t}\|_{2}^{2}\right]\leq 2(C_{0,\sigma}+d)\exp(2C_{0,\sigma}t).

Lemma 7.4.

Assume (P1) and (P2) hold, then for any 0t1t210\leq t_{1}\leq t_{2}\leq 1,

𝔼[Xt2Xt122]2(t2t1){1+C0,σexp(2C0,σ+1)}.\displaystyle\mathbb{E}\left[\|X_{t_{2}}-X_{t_{1}}\|_{2}^{2}\right]\leq 2(t_{2}-t_{1})\left\{1+C_{0,\sigma}\exp(2\sqrt{C_{0,\sigma}}+1)\right\}.
Proof.

Using (C1) and the elementary inequality 2aba2+b22ab\leq a^{2}+b^{2}, one can derive that for any ε>0\varepsilon>0,

2x,b(x,t)εx22+b(x,t)22εεx22+C0,σε(1+x22).2\left\langle x,b(x,t)\right\rangle\leq\varepsilon\|x\|^{2}_{2}+\frac{\|b(x,t)\|^{2}_{2}}{\varepsilon}\leq\varepsilon\|x\|^{2}_{2}+\frac{C_{0,\sigma}}{\varepsilon}(1+\|x\|^{2}_{2}).

Letting ε=C0,σ\varepsilon=\sqrt{C_{0,\sigma}} yields

x,b(x,t)C0,σ(1+x22).\left\langle x,b(x,t)\right\rangle\leq\sqrt{C_{0,\sigma}}(1+\|x\|^{2}_{2}). (A.22)

From the definition of XtX_{t} in (7), we have

Xt=σ0tb(Xs,s)𝑑s+σ0t𝑑Bs,t[0,1].X_{t}=\sigma\int_{0}^{t}b(X_{s},s)ds+\sqrt{\sigma}\int_{0}^{t}dB_{s},\quad\forall t\in[0,1].

On the one hand, by the Itô formula and (A.22), for any t[0,1]t\in[0,1], we have

1+Xt22\displaystyle 1+\|X_{t}\|^{2}_{2} =1+2σ0tXs,b(Xs,s)𝑑s+0tσ𝑑s+20tXs,σdBs\displaystyle=1+2\sigma\int_{0}^{t}\left\langle X_{s},b(X_{s},s)\right\rangle ds+\int_{0}^{t}\sigma ds+2\int_{0}^{t}\left\langle X_{s},\sqrt{\sigma}dB_{s}\right\rangle
1+20t(Xsb(Xs,s)+12)𝑑s+20tXs,dBs\displaystyle\leq 1+2\int_{0}^{t}\left(X^{\top}_{s}b(X_{s},s)+\frac{1}{2}\right)ds+2\int_{0}^{t}\left\langle X_{s},dB_{s}\right\rangle
1+2α0t(1+Xs22)𝑑s+20tXs,dBs,\displaystyle\leq 1+2\alpha\int_{0}^{t}\left(1+\|X_{s}\|^{2}_{2}\right)ds+2\int_{0}^{t}\left\langle X_{s},dB_{s}\right\rangle,

where α:=C0,σ+1/2\alpha:=\sqrt{C_{0,\sigma}}+1/2. Furthermore, we have

𝔼[1+Xt22]1+2α0t𝔼[1+Xs22]𝑑s.\mathbb{E}\left[1+\|X_{t}\|^{2}_{2}\right]\leq 1+2\alpha\int_{0}^{t}\mathbb{E}\left[1+\|X_{s}\|^{2}_{2}\right]ds.

The Grönwall inequality yields

𝔼[1+Xt22]exp(2αt)=exp(2C0,σ+1),t[0,1].\mathbb{E}\left[\|1+X_{t}\|_{2}^{2}\right]\leq\exp\left(2\alpha t\right)=\exp\left(2\sqrt{C_{0,\sigma}}+1\right),\quad\forall t\in[0,1]. (A.23)

On the other hand, by the elementary inequality (a+b)22(a2+b2)(a+b)^{2}\leq 2(a^{2}+b^{2}) then we get

𝔼[Xt2Xt122]2𝔼[(t1t2σb(Xs,s)𝑑s)2]+2𝔼[(t1t2σ𝑑Bs)2].\mathbb{E}\left[\|X_{t_{2}}-X_{t_{1}}\|^{2}_{2}\right]\leq 2\mathbb{E}\left[\left(\int_{t_{1}}^{t_{2}}\sigma b(X_{s},s)ds\right)^{2}\right]+2\mathbb{E}\left[\left(\int_{t_{1}}^{t_{2}}\sqrt{\sigma}dB_{s}\right)^{2}\right].

Thus, using the Cauchy-Schwarz inequality, Burkholder-Davis-Gundy inequality, (C1) and (A.23), we deduce that

𝔼[Xt2Xt122]\displaystyle\mathbb{E}\left[\|X_{t_{2}}-X_{t_{1}}\|^{2}_{2}\right] 2σ2(t2t1)t1t2𝔼[b(Xs,s)22]𝑑s+2σ(t2t1)\displaystyle\leq 2\sigma^{2}(t_{2}-t_{1})\int_{t_{1}}^{t_{2}}\mathbb{E}\left[\|b(X_{s},s)\|_{2}^{2}\right]ds+2\sigma(t_{2}-t_{1})
2C0,σt1t2𝔼[1+Xs22]𝑑s+2(t2t1)\displaystyle\leq 2C_{0,\sigma}\int_{t_{1}}^{t_{2}}\mathbb{E}\left[1+\|X_{s}\|^{2}_{2}\right]ds+2(t_{2}-t_{1})
{2+2C0,σexp(2C0,σ+1)}(t2t1).\displaystyle\leq\left\{2+2C_{0,\sigma}\exp(2\sqrt{C_{0,\sigma}}+1)\right\}(t_{2}-t_{1}).

Lemma 7.5.

Assume (P1) and (P2) hold, then

supxd,t[0,1]𝔼[b(x,t)b~m(t,x)22]4dmC2,σ,\displaystyle\underset{x\in\mathbb{R}^{d},t\in[0,1]}{\sup}\mathbb{E}\left[\|b(x,t)-\tilde{b}_{m}(t,x)\|_{2}^{2}\right]\leq\frac{4d}{m}C^{\star}_{2,\sigma},

where

C2,σ:=γσ4ξσ4+γσ2ζσ2ξσ4,γσξσ={(M2,Rσ)2+M3,Rσ}exp(M1,Rm1,Rσ),C^{\star}_{2,\sigma}:=\frac{\gamma^{4}_{\sigma}}{\xi^{4}_{\sigma}}+\frac{\gamma_{\sigma}^{2}\zeta^{2}_{\sigma}}{\xi^{4}_{\sigma}},\quad\frac{\gamma_{\sigma}}{\xi_{\sigma}}=\left\{\left(\frac{M_{2,R}}{\sigma}\right)^{2}+\frac{M_{3,R}}{\sigma}\right\}\exp\left(\frac{M_{1,R}-m_{1,R}}{\sigma}\right),

with ζσ=exp(M1,R/σ)\zeta_{\sigma}=\exp\left(M_{1,R}/\sigma\right).

Proof.

Denote two independent sequences of independent copies of ZN(0,Id)Z\sim N(0,\mbox{\bf I}_{d}), that is, 𝐙={Z1,,Zm}{\bf Z}=\{Z_{1},\ldots,Z_{m}\} and 𝐙={Z1,,Zm}{\bf Z}^{\prime}=\{Z_{1}^{\prime},\ldots,Z_{m}^{\prime}\}. For notation convenience, we also denote

h:=𝔼Z[f^σ(x+(1t)σZ)],hm:=i=1mf^σ(x+(1t)σZi)m,\displaystyle h:=\mathbb{E}_{Z}\left[\nabla\hat{f}_{\sigma}(x+\sqrt{(1-t)\sigma}Z)\right],\leavevmode\nobreak\ h_{m}:=\frac{\sum_{i=1}^{m}\nabla\hat{f}_{\sigma}(x+\sqrt{(1-t)\sigma}Z_{i})}{m},
e:=𝔼Z[f^σ(x+(1t)σZ)],em:=i=1mf^σ(x+(1t)σZi)m,\displaystyle e:=\mathbb{E}_{Z}\left[\hat{f}_{\sigma}(x+\sqrt{(1-t)\sigma}Z)\right],\leavevmode\nobreak\ e_{m}:=\frac{\sum_{i=1}^{m}\hat{f}_{\sigma}(x+\sqrt{(1-t)\sigma}Z_{i})}{m},
hm:=i=1mf^σ(x+(1t)σZi)m,em:=i=1mf^σ(x+(1t)σZi)m.\displaystyle h_{m}^{\prime}:=\frac{\sum_{i=1}^{m}\nabla\hat{f}_{\sigma}(x+\sqrt{(1-t)\sigma}Z_{i}^{\prime})}{m},\leavevmode\nobreak\ e_{m}^{\prime}:=\frac{\sum_{i=1}^{m}\hat{f}_{\sigma}(x+\sqrt{(1-t)\sigma}Z_{i}^{\prime})}{m}.

Since hhm=𝔼[hmhm|𝐙]h-h_{m}=\mathbb{E}\left[h_{m}^{\prime}-h_{m}|{\bf Z}\right], then hhm22𝔼[hmhm22|𝐙]\|h-h_{m}\|^{2}_{2}\leq\mathbb{E}\left[\|h_{m}^{\prime}-h_{m}\|^{2}_{2}|{\bf Z}\right]. Moreover, we have

𝔼[hhm2]\displaystyle\mathbb{E}\left[\|h-h_{m}\|^{2}\right] 𝔼{𝔼[hmhm22|𝐙]}=𝔼[hmhm22]\displaystyle\leq\mathbb{E}\left\{\mathbb{E}\left[\|h_{m}^{\prime}-h_{m}\|^{2}_{2}|{\bf Z}\right]\right\}=\mathbb{E}\left[\|h_{m}^{\prime}-h_{m}\|^{2}_{2}\right]
=𝔼Z1,Z1[f^σ(x+(1t)σZ1)f^σ(x+(1t)σZ1)22]m\displaystyle=\frac{\mathbb{E}_{Z_{1},Z_{1}^{\prime}}\left[\left\|\nabla\hat{f}_{\sigma}(x+\sqrt{(1-t)\sigma}Z_{1})-\nabla\hat{f}_{\sigma}(x+\sqrt{(1-t)\sigma}Z_{1}^{\prime})\right\|^{2}_{2}\right]}{m}
σ(1t)γσ2m𝔼Z1,Z1[Z1Z122]\displaystyle\leq\frac{\sigma(1-t)\gamma_{\sigma}^{2}}{m}\mathbb{E}_{Z_{1},Z_{1}^{\prime}}\left[\left\|Z_{1}-Z_{1}^{\prime}\right\|^{2}_{2}\right]
2dγσ2m,\displaystyle\leq\frac{2d\gamma_{\sigma}^{2}}{m}, (A.24)

where the second inequality holds by (P1) and the last inequality follows from the fact that Z1Z_{1} and Z1Z^{\prime}_{1} are independent standard normal distribution. Similarly, we also have

𝔼[|eem|2]\displaystyle\mathbb{E}\left[|e-e_{m}|^{2}\right] 𝔼[|emem|2]\displaystyle\leq\mathbb{E}\left[|e_{m}^{\prime}-e_{m}|^{2}\right]
=𝔼Z1,Z1[|f^σ(x+(1t)σZ1)f^σ(x+(1t)σZ1)|2]m\displaystyle=\frac{\mathbb{E}_{Z_{1},Z_{1}^{\prime}}\left[\left|\hat{f}_{\sigma}(x+\sqrt{(1-t)\sigma}Z_{1})-\hat{f}_{\sigma}(x+\sqrt{(1-t)\sigma}Z_{1}^{\prime})\right|^{2}\right]}{m}\
σ(1t)γσ2m𝔼Z1,Z1[Z1Z122]\displaystyle\leq\frac{\sigma(1-t)\gamma_{\sigma}^{2}}{m}\mathbb{E}_{Z_{1},Z_{1}^{\prime}}\left[\left\|Z_{1}-Z_{1}^{\prime}\right\|^{2}_{2}\right]
2dγσ2m,\displaystyle\leq\frac{2d\gamma_{\sigma}^{2}}{m}, (A.25)

where the second inequality holds due to (P1). Thus, by (7.7) and (7.7), it follows that

supxd,t[0,1]𝔼[hhm22]2dγσ2m,\displaystyle\underset{x\in\mathbb{R}^{d},t\in[0,1]}{\sup}\mathbb{E}\left[\left\|h-h_{m}\right\|_{2}^{2}\right]\leq\frac{2d\gamma_{\sigma}^{2}}{m}, (A.26)
supxd,t[0,1]𝔼[|eem|2]2dγσ2m.\displaystyle\underset{x\in\mathbb{R}^{d},t\in[0,1]}{\sup}\mathbb{E}\left[|e-e_{m}|^{2}\right]\leq\frac{2d\gamma_{\sigma}^{2}}{m}. (A.27)

Then, by (P1) and (P2), some simple calculations yield that

b(x,t)b~m(x,t)2\displaystyle\|b(x,t)-\tilde{b}_{m}(x,t)\|_{2} =hehmem2\displaystyle=\left\|\frac{h}{e}-\frac{h_{m}}{e_{m}}\right\|_{2}
h2|eme|+hhm2|e||eem|\displaystyle\leq\frac{\|h\|_{2}|e_{m}-e|+\|h-h_{m}\|_{2}|e|}{|ee_{m}|}
γσ|eme|+hhm2|e|ξσ2.\displaystyle\leq\frac{\gamma_{\sigma}|e_{m}-e|+\|h-h_{m}\|_{2}|e|}{\xi_{\sigma}^{2}}. (A.28)

Recall that ζσ=exp(M1,R/σ)\zeta_{\sigma}=\exp(M_{1,R}/\sigma) with

M1,R:=maxx2R{V(x)+x222},\quad M_{1,R}:=\max_{\|x\|_{2}\leq R}\left\{-V(x)+\frac{\|x\|^{2}_{2}}{2}\right\},

then f^σ(x)ζσ\hat{f}_{\sigma}(x)\leq\zeta_{\sigma} for any xBRx\in B_{R}. Further, by (7.7), it follows that for all xdx\in\mathbb{R}^{d} and t[0,1]t\in[0,1],

b(x,t)b~m(x,t)22\displaystyle\|b(x,t)-\tilde{b}_{m}(x,t)\|_{2}^{2} 2γσ2|eme|2+ζσ2hhm22ξσ4.\displaystyle\leq 2\frac{\gamma_{\sigma}^{2}|e_{m}-e|^{2}+\zeta_{\sigma}^{2}\|h-h_{m}\|_{2}^{2}}{\xi_{\sigma}^{4}}. (A.29)

Then, combining (A.26)-(A.27) and (A.29), it follows that

supxd,t[0,1]𝔼[b(x,t)b~m(t,x)22]4dmC2,σ,\displaystyle\underset{x\in\mathbb{R}^{d},t\in[0,1]}{\sup}\mathbb{E}\left[\|b(x,t)-\tilde{b}_{m}(t,x)\|_{2}^{2}\right]\leq\frac{4d}{m}C^{\star}_{2,\sigma},

where

C2,σ:=γσ4ξσ4+γσ2ζσ2ξσ4,γσξσ={(M2,Rσ)2+M3,Rσ}exp(M1,Rm1,Rσ),ζσ=exp(M1,Rσ).C^{\star}_{2,\sigma}:=\frac{\gamma^{4}_{\sigma}}{\xi^{4}_{\sigma}}+\frac{\gamma_{\sigma}^{2}\zeta^{2}_{\sigma}}{\xi^{4}_{\sigma}},\quad\frac{\gamma_{\sigma}}{\xi_{\sigma}}=\left\{\left(\frac{M_{2,R}}{\sigma}\right)^{2}+\frac{M_{3,R}}{\sigma}\right\}\exp\left(\frac{M_{1,R}-m_{1,R}}{\sigma}\right),\quad\zeta_{\sigma}=\exp\left(\frac{M_{1,R}}{\sigma}\right).

Lemma 7.6.

Assume (P1) and (P2) hold, then for any k=0,1,,Kk=0,1,\ldots,K,

𝔼[Y~tk22]6γσ2ξσ2+3d.\displaystyle\mathbb{E}\left[\|\widetilde{Y}_{t_{k}}\|^{2}_{2}\right]\leq\frac{6\gamma_{\sigma}^{2}}{\xi_{\sigma}^{2}}+3d.
Proof.

Define Θk,t:=Y~tk+σ(ttk)b~m(Y~tk,tk)\Theta_{k,t}:=\widetilde{Y}_{t_{k}}+\sigma(t-t_{k})\tilde{b}_{m}(\widetilde{Y}_{t_{k}},t_{k}), hence, we get Y~t=Θk,t+σ(BtBtk)\widetilde{Y}_{t}=\Theta_{k,t}+\sqrt{\sigma}(B_{t}-B_{t_{k}}), where tkttk+1t_{k}\leq t\leq t_{k+1} with k=0,1,,K1k=0,1,\ldots,K-1. By (P1) and (P2), it follows that for all xdx\in\mathbb{R}^{d} and t[0,1]t\in[0,1],

b(x,t)22γσ2ξσ2,b~m(x,t)22γσ2ξσ2.\displaystyle\|b(x,t)\|_{2}^{2}\leq\frac{\gamma_{\sigma}^{2}}{\xi_{\sigma}^{2}},\leavevmode\nobreak\ \leavevmode\nobreak\ \|\tilde{b}_{m}(x,t)\|_{2}^{2}\leq\frac{\gamma_{\sigma}^{2}}{\xi^{2}_{\sigma}}. (A.30)

Then, by (A.30) and s=1/Ks=1/K, we have

Θk,t22\displaystyle\|\Theta_{k,t}\|^{2}_{2} =Y~tk22+σ2(ttk)2b~m(Y~tk,tk)22+2σ(ttk)Y~tk,b~m(Y~tk,tk)\displaystyle=\|\widetilde{Y}_{t_{k}}\|^{2}_{2}+\sigma^{2}(t-t_{k})^{2}\|\tilde{b}_{m}(\widetilde{Y}_{t_{k}},t_{k})\|^{2}_{2}+2\sigma(t-t_{k})\left<\widetilde{Y}_{t_{k}},\tilde{b}_{m}(\widetilde{Y}_{t_{k}},t_{k})\right>
(1+sσ)Y~tk22+γσ2(sσ+s2σ2)ξσ2.\displaystyle\leq(1+s\sigma)\|\widetilde{Y}_{t_{k}}\|^{2}_{2}+\frac{\gamma_{\sigma}^{2}(s\sigma+s^{2}\sigma^{2})}{\xi^{2}_{\sigma}}.

Further, we can get

𝔼[Y~t22|Y~tk]\displaystyle\mathbb{E}\left[\|\left.\widetilde{Y}_{t}\|^{2}_{2}\right|\widetilde{Y}_{t_{k}}\right] =𝔼[Θk,t22|Y~tk]+σ(ttk)d\displaystyle=\mathbb{E}\left[\|\left.\Theta_{k,t}\|^{2}_{2}\right|\widetilde{Y}_{t_{k}}\right]+\sigma(t-t_{k})d
(1+σs)Y~tk22+(sσ+s2σ2)γσ2ξσ2+σsd.\displaystyle\leq(1+\sigma s)\|\widetilde{Y}_{t_{k}}\|^{2}_{2}+\frac{(s\sigma+s^{2}\sigma^{2})\gamma_{\sigma}^{2}}{\xi^{2}_{\sigma}}+\sigma sd.

Therefore,

𝔼[Y~tk+122](1+sσ)𝔼[Y~tk22]+(sσ+s2σ2)γσ2ξσ2+σsd.\displaystyle\mathbb{E}\left[\|\widetilde{Y}_{t_{k+1}}\|^{2}_{2}\right]\leq(1+s\sigma)\mathbb{E}\left[\|\widetilde{Y}_{t_{k}}\|^{2}_{2}\right]+\frac{(s\sigma+s^{2}\sigma^{2})\gamma_{\sigma}^{2}}{\xi^{2}_{\sigma}}+\sigma sd.

Since Y~t0=0\widetilde{Y}_{t_{0}}=0, then by induction, we have

𝔼[Y~tk+122]e(k+1)sσ(d+(1+sσ)γσ2ξσ2)e(d+2γσ2ξσ2)6γσ2ξσ2+3d.\displaystyle\mathbb{E}\left[\|\widetilde{Y}_{t_{k+1}}\|^{2}_{2}\right]\leq e^{(k+1)s\sigma}\left(d+\frac{(1+s\sigma)\gamma_{\sigma}^{2}}{\xi^{2}_{\sigma}}\right)\leq e\left(d+\frac{2\gamma_{\sigma}^{2}}{\xi^{2}_{\sigma}}\right)\leq\frac{6\gamma_{\sigma}^{2}}{\xi^{2}_{\sigma}}+3d.

7.8. Proof of Theorem 3.7

Proof.

From the definitions of Y~tk\widetilde{Y}_{t_{k}} and XtkX_{t_{k}}, we have

Y~tkXtk22\displaystyle\|\widetilde{Y}_{t_{k}}-X_{t_{k}}\|_{2}^{2} Y~tk1Xtk122+(tk1tkσb(Xu,u)b~m(Y~tk1,tk1)2du)2\displaystyle\leq\|\widetilde{Y}_{t_{k-1}}-X_{t_{k-1}}\|_{2}^{2}+\left(\int_{t_{k-1}}^{t_{k}}\sigma\|b(X_{u},u)-\tilde{b}_{m}(\widetilde{Y}_{t_{k-1}},t_{k-1})\|_{2}\mathrm{d}u\right)^{2}
+2σY~tk1Xtk12(tk1tkb(Xu,u)b~m(Y~tk1,tk1)2du)\displaystyle\qquad+2\sigma\|\widetilde{Y}_{t_{k-1}}-X_{t_{k-1}}\|_{2}\left(\int_{t_{k-1}}^{t_{k}}\|b(X_{u},u)-\tilde{b}_{m}(\widetilde{Y}_{t_{k-1}},t_{k-1})\|_{2}\mathrm{d}u\right)
(1+s)Y~tk1Xtk122+(1+s)tk1tkb(Xu,u)b~m(Y~tk1,tk1)22du\displaystyle\leq(1+s)\|\widetilde{Y}_{t_{k-1}}-X_{t_{k-1}}\|_{2}^{2}+(1+s)\int_{t_{k-1}}^{t_{k}}\|b(X_{u},u)-\tilde{b}_{m}(\widetilde{Y}_{t_{k-1}},t_{k-1})\|_{2}^{2}\mathrm{d}u
(1+s)Y~tk1Xtk122+2(1+s)tk1tkb(Xu,u)b(Y~tk1,tk1)22du\displaystyle\leq(1+s)\|\widetilde{Y}_{t_{k-1}}-X_{t_{k-1}}\|_{2}^{2}+2(1+s)\int_{t_{k-1}}^{t_{k}}\|b(X_{u},u)-b(\widetilde{Y}_{t_{k-1}},t_{k-1})\|_{2}^{2}\mathrm{d}u
+2s(1+s)b(Y~tk1,tk1)b~m(Y~tk1,tk1)22\displaystyle\qquad+2s(1+s)\|b(\widetilde{Y}_{t_{k-1}},t_{k-1})-\tilde{b}_{m}(\widetilde{Y}_{t_{k-1}},t_{k-1})\|_{2}^{2}
(1+s)Y~tk1Xtk122+4C2,σ2(1+s)tk1tk[XuY~tk122+|utk1|]du\displaystyle\leq(1+s)\|\widetilde{Y}_{t_{k-1}}-X_{t_{k-1}}\|_{2}^{2}+4C_{2,\sigma}^{2}(1+s)\int_{t_{k-1}}^{t_{k}}\left[\|X_{u}-\widetilde{Y}_{t_{k-1}}\|_{2}^{2}+|u-t_{k-1}|\right]\mathrm{d}u
+2s(1+s)b(Y~tk1,tk1)b~m(Y~tk1,tk1)22\displaystyle\qquad+2s(1+s)\|b(\widetilde{Y}_{t_{k-1}},t_{k-1})-\tilde{b}_{m}(\widetilde{Y}_{t_{k-1}},t_{k-1})\|_{2}^{2}
(1+s)Y~tk1Xtk122+8C2,σ2(1+s)tk1tkXuXtk122du\displaystyle\leq(1+s)\|\widetilde{Y}_{t_{k-1}}-X_{t_{k-1}}\|_{2}^{2}+8C_{2,\sigma}^{2}(1+s)\int_{t_{k-1}}^{t_{k}}\|X_{u}-X_{t_{k-1}}\|_{2}^{2}\mathrm{d}u
+8C2,σ2s(1+s)Xtk1Y~tk122+4C2,σ2(1+s)s2\displaystyle\qquad+8C_{2,\sigma}^{2}s(1+s)\|X_{t_{k-1}}-\widetilde{Y}_{t_{k-1}}\|_{2}^{2}+4C_{2,\sigma}^{2}(1+s)s^{2}
+2s(1+s)b(Y~tk1,tk1)b~m(Y~tk1,tk1)22\displaystyle\qquad+2s(1+s)\|b(\widetilde{Y}_{t_{k-1}},t_{k-1})-\tilde{b}_{m}(\widetilde{Y}_{t_{k-1}},t_{k-1})\|_{2}^{2}
(1+s+8C2,σ2(s+s2))Y~tk1Xtk122+8C2,σ2(1+s)tk1tkXuXtk122du\displaystyle\leq(1+s+8C_{2,\sigma}^{2}(s+s^{2}))\|\widetilde{Y}_{t_{k-1}}-X_{t_{k-1}}\|_{2}^{2}+8C_{2,\sigma}^{2}(1+s)\int_{t_{k-1}}^{t_{k}}\|X_{u}-X_{t_{k-1}}\|_{2}^{2}\mathrm{d}u
+4C2,σ2(1+s)s2+2s(1+s)b(Y~tk1,tk1)b~m(Y~tk1,tk1)22,\displaystyle\qquad+4C_{2,\sigma}^{2}(1+s)s^{2}+2s(1+s)\|b(\widetilde{Y}_{t_{k-1}},t_{k-1})-\tilde{b}_{m}(\widetilde{Y}_{t_{k-1}},t_{k-1})\|_{2}^{2},

where the second inequality holds due to 2acsa2+c2/s2ac\leq sa^{2}+c^{2}/s for s=1/Ks=1/K and the fourth inequality holds by condition (C3). Then, we obtain

𝔼[Y~tkXtk22]\displaystyle\mathbb{E}\left[\|\widetilde{Y}_{t_{k}}-X_{t_{k}}\|_{2}^{2}\right]
(1+s+8C2,σ2(s+s2))𝔼[Y~tk1Xtk122]+8C2,σ2(1+s)tk1tk𝔼[XuXtk122]du\displaystyle\leq(1+s+8C_{2,\sigma}^{2}(s+s^{2}))\mathbb{E}\left[\|\widetilde{Y}_{t_{k-1}}-X_{t_{k-1}}\|_{2}^{2}\right]+8C_{2,\sigma}^{2}(1+s)\int_{t_{k-1}}^{t_{k}}\mathbb{E}\left[\|X_{u}-X_{t_{k-1}}\|_{2}^{2}\right]\mathrm{d}u
+4C2,σ2(s2+s3)+2s(1+s)𝔼[b(Y~tk1,tk1)b~m(Y~tk1,tk1)22]\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ +4C_{2,\sigma}^{2}(s^{2}+s^{3})+2s(1+s)\mathbb{E}\left[\|b(\widetilde{Y}_{t_{k-1}},t_{k-1})-\tilde{b}_{m}(\widetilde{Y}_{t_{k-1}},t_{k-1})\|_{2}^{2}\right]
(1+s+8C2,σ2(s+s2))𝔼[Y~tk1Xtk122]+H(s,σ)+4C2,σ2(s2+s3)\displaystyle\leq(1+s+8C_{2,\sigma}^{2}(s+s^{2}))\mathbb{E}\left[\|\widetilde{Y}_{t_{k-1}}-X_{t_{k-1}}\|_{2}^{2}\right]+H(s,\sigma)+4C_{2,\sigma}^{2}(s^{2}+s^{3})
+2s(1+s)𝔼[b(Y~tk1,tk1)b~m(Y~tk1,tk1)22]\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ +2s(1+s)\mathbb{E}\left[\|b(\widetilde{Y}_{t_{k-1}},t_{k-1})-\tilde{b}_{m}(\widetilde{Y}_{t_{k-1}},t_{k-1})\|_{2}^{2}\right]
(1+s+8C2,σ2(s+s2))𝔼[Y~tk1Xtk122]+H(s,σ)+8s(1+s)dmC2,σ,\displaystyle\leq(1+s+8C_{2,\sigma}^{2}(s+s^{2}))\mathbb{E}\left[\|\widetilde{Y}_{t_{k-1}}-X_{t_{k-1}}\|_{2}^{2}\right]+H(s,\sigma)+\frac{8s(1+s)d}{m}C^{\star}_{2,\sigma},

where H(s,σ):=16C2,σ2(s2+s3){1+C0,σexp(2C0,σ+1)}+4C2,σ2(s2+s3)H(s,\sigma):=16C_{2,\sigma}^{2}(s^{2}+s^{3})\left\{1+C_{0,\sigma}\exp(2\sqrt{C_{0,\sigma}}+1)\right\}+4C^{2}_{2,\sigma}(s^{2}+s^{3}) follows from Lemma 7.4, and the last inequality holds by Lemma 7.5. Owing to Y~t0=Xt0=0\widetilde{Y}_{t_{0}}=X_{t_{0}}=0, we can conclude that there exists a constant Cσ>0C_{\sigma}>0 such that

𝔼[Y~tKXtK22]\displaystyle\mathbb{E}\left[\|\widetilde{Y}_{t_{K}}-X_{t_{K}}\|_{2}^{2}\right] (1+s+8C2,σ2(s+s2))K1s+8C2,σ2(s+s2)[H(s,σ)+8s(1+s)dmC2,σ]\displaystyle\leq\frac{(1+s+8C_{2,\sigma}^{2}(s+s^{2}))^{K}-1}{s+8C_{2,\sigma}^{2}(s+s^{2})}\left[H(s,\sigma)+\frac{8s(1+s)d}{m}C^{\star}_{2,\sigma}\right]
Cσ(sC3,σ+16dmC2,σ),\displaystyle\leq C_{\sigma}\left(sC^{\star}_{3,\sigma}+\frac{16d}{m}C^{\star}_{2,\sigma}\right),

where

Cσ:=exp(1+16C2,σ2),C3,σ:=40C2,σ2+32C2,σ2C0,σexp(2C0,σ+1).\displaystyle C_{\sigma}:=\exp(1+16C_{2,\sigma}^{2}),\quad C^{\star}_{3,\sigma}:=40C^{2}_{2,\sigma}+32C^{2}_{2,\sigma}C_{0,\sigma}\exp(2\sqrt{C_{0,\sigma}}+1).

Therefore,

W22(Law(Y~tK),μσ)Cσ(sC3,σ+16dmC2,σ).\displaystyle W^{2}_{2}(\text{Law}(\widetilde{Y}_{t_{K}}),\mu_{\sigma})\leq C_{\sigma}\left(sC^{\star}_{3,\sigma}+\frac{16d}{m}C^{\star}_{2,\sigma}\right).

It completes the proof. ∎

References

  • Bakry et al. (2008) Bakry D, Cattiaux P, Guillin A (2008) Rate of convergence for ergodic continuous markov processes: Lyapunov versus poincaré. Journal of Functional Analysis 254(3):727–759
  • Bakry et al. (2014) Bakry D, Gentil I, Ledoux M (2014) Analysis and geometry of Markov diffusion operators, vol 103. Springer
  • Barkhagen et al. (2021) Barkhagen M, Chau NH, Moulines É, Rásonyi M, Sabanis S, Zhang Y (2021) On stochastic gradient langevin dynamics with dependent data streams in the logconcave case. Bernoulli 27(1):1–33
  • Bernton et al. (2019) Bernton E, Heng J, Doucet A, Jacob PE (2019) Schrödinger bridge samplers. arXiv preprint arXiv:191213170
  • Bou-Rabee et al. (2020) Bou-Rabee N, Eberle A, Zimmer R (2020) Coupling and convergence for hamiltonian monte carlo. The Annals of applied probability 30(3):1209–1250
  • Carrillo et al. (2021) Carrillo JA, Jin S, Li L, Zhu Y (2021) A consensus-based global optimization method for high dimensional machine learning problems. ESAIM: Control, Optimisation and Calculus of Variations 27:S5
  • Chau et al. (2021) Chau NH, Moulines É, Rásonyi M, Sabanis S, Zhang Y (2021) On stochastic gradient langevin dynamics with dependent data streams: The fully non-convex case. SIAM Journal on Mathematics of Data Science 3(3):959–986
  • Chen et al. (2021) Chen Y, Georgiou TT, Pavon M (2021) Stochastic control liaisons: Richard sinkhorn meets gaspard monge on a schrödinger bridge. SIAM Review 63(2):249–313
  • Cheng and Bartlett (2018) Cheng X, Bartlett P (2018) Convergence of langevin mcmc in kl-divergence. Proceedings of Machine Learning Research, Volume 83: Algorithmic Learning Theory pp 186–211
  • Cheng et al. (2018) Cheng X, Chatterji NS, Bartlett PL, Jordan MI (2018) Underdamped langevin mcmc: A non-asymptotic analysis. In: Conference on learning theory, PMLR, pp 300–323
  • Chiang et al. (1987) Chiang TS, Hwang CR, Sheu SJ (1987) Diffusion for global optimization in r^n. SIAM Journal on Control and Optimization 25(3):737–753
  • Dai Pra (1991) Dai Pra P (1991) A stochastic control approach to reciprocal diffusion processes. Applied mathematics and Optimization 23(1):313–329
  • Dalalyan (2017a) Dalalyan AS (2017a) Further and stronger analogy between sampling and optimization: Langevin monte carlo and gradient descent. pp 678–689
  • Dalalyan (2017b) Dalalyan AS (2017b) Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79(3):651–676
  • Dalalyan and Karagulyan (2019) Dalalyan AS, Karagulyan AG (2019) User-friendly guarantees for the langevin monte carlo with inaccurate gradient. Stochastic Processes and their Applications 129(12):5278–5311
  • Dupuis and Ellis (2011) Dupuis P, Ellis RS (2011) A weak convergence approach to the theory of large deviations. John Wiley & Sons
  • Durmus and Moulines (2016) Durmus A, Moulines E (2016) Sampling from a strongly log-concave distribution with the unadjusted langevin algorithm. arXiv preprint arXiv:160501559
  • Durmus and Moulines (2017) Durmus A, Moulines E (2017) Nonasymptotic convergence analysis for the unadjusted langevin algorithm. The Annals of Applied Probability 27(3):1551–1587
  • Durmus and Moulines (2019) Durmus A, Moulines E (2019) High-dimensional bayesian inference via the unadjusted langevin algorithm. Bernoulli 25(4A):2854–2882
  • Eldan et al. (2020) Eldan R, Lehec J, Shenfeld Y (2020) Stability of the logarithmic sobolev inequality via the föllmer process. In: Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, Institut Henri Poincaré, vol 56, pp 2253–2269
  • Föllmer (1985) Föllmer H (1985) An entropy approach to the time reversal of diffusion processes. In: Stochastic Differential Systems Filtering and Control, Springer, pp 156–163
  • Föllmer (1986) Föllmer H (1986) Time reversal on wiener space. In: Stochastic processes–mathematics and physics, Springer, pp 119–129
  • Föllmer (1988) Föllmer H (1988) Random fields and diffusion processes. In: École d’Été de Probabilités de Saint-Flour XV–XVII, 1985–87, Springer, pp 101–203
  • Garbuno-Inigo et al. (2020) Garbuno-Inigo A, Hoffmann F, Li W, Stuart AM (2020) Interacting langevin diffusions: Gradient structure and ensemble kalman sampler. SIAM Journal on Applied Dynamical Systems 19(1):412–441
  • Hale (2010) Hale JK (2010) Asymptotic behavior of dissipative systems. 25, American Mathematical Soc.
  • Holley et al. (1989) Holley RA, Kusuoka S, Stroock DW (1989) Asymptotics of the spectral gap with applications to the theory of simulated annealing. Journal of functional analysis 83(2):333–347
  • Huang et al. (2021) Huang J, Jiao Y, Kang L, Liao X, Liu J, Liu Y (2021) Schrödinger-föllmer sampler: Sampling without ergodicity. arXiv preprint arXiv: 210610880
  • Hwang (1980) Hwang CR (1980) Laplace’s method revisited: weak convergence of probability measures. The Annals of Probability pp 1177–1182
  • Hwang (1981) Hwang CR (1981) A generalization of laplace’s method. Proceedings of the American Mathematical Society 82(3):446–451
  • Iacus and Yoshida (2018) Iacus SM, Yoshida N (2018) Simulation and inference for stochastic processes with YUIMA. Springer
  • Jamison (1975) Jamison B (1975) The markov processes of schrödinger. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete 32(4):323–331
  • Jiao et al. (2021) Jiao Y, Kang L, Liu Y, Zhou Y (2021) Convergence analysis of schrödinger-föllmer sampler without convexity. arXiv preprint arXiv:210704766
  • Landsman and Nešlehová (2008) Landsman Z, Nešlehová J (2008) Stein’s lemma for elliptical random vectors. Journal of Multivariate Analysis 99(5):912–927
  • Lehec (2013) Lehec J (2013) Representation formula for the entropy and functional inequalities. In: Annales de l’IHP Probabilités et statistiques, vol 49, pp 885–899
  • Léonard (2014) Léonard C (2014) A survey of the schrödinger problem and some of its connections with optimal transport. Discrete & Continuous Dynamical Systems-A 34(4):1533
  • Ma et al. (2019) Ma YA, Chen Y, Jin C, Flammarion N, Jordan MI (2019) Sampling can be faster than optimization. Proceedings of the National Academy of Sciences 116(42):20881–20885
  • Márquez (1997) Márquez D (1997) Convergence rates for annealing diffusion processes. The Annals of Applied Probability pp 1118–1139
  • Menz and Schlichting (2014) Menz G, Schlichting A (2014) Poincaré and logarithmic sobolev inequalities by decomposition of the energy landscape. Annals of Probability 42(5):1809–1884
  • Mou et al. (2022) Mou W, Flammarion N, Wainwright MJ, Bartlett PL (2022) Improved bounds for discretization of langevin diffusions: Near-optimal rates without convexity. Bernoulli 28(3):1577–1601
  • Pavliotis (2014) Pavliotis GA (2014) Stochastic processes and applications: diffusion processes, the Fokker-Planck and Langevin equations, vol 60. Springer
  • Peyré and Cuturi (2019) Peyré G, Cuturi M (2019) Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning 11(5-6):355–607
  • Raginsky et al. (2017) Raginsky M, Rakhlin A, Telgarsky M (2017) Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In: Conference on Learning Theory, PMLR, pp 1674–1703
  • Revuz and Yor (2013) Revuz D, Yor M (2013) Continuous martingales and Brownian motion, vol 293. Springer Science & Business Media
  • Ruzayqat et al. (2022) Ruzayqat H, Beskos A, Crisan D, Jasra A, Kantas N (2022) Unbiased estimation using a class of diffusion processes. arXiv preprint arXiv:220303013
  • Schrödinger (1932) Schrödinger E (1932) Sur la théorie relativiste de l’électron et l’interprétation de la mécanique quantique. In: Annales de l’institut Henri Poincaré, vol 2, pp 269–310
  • Sinkhorn (1964) Sinkhorn R (1964) A relationship between arbitrary positive matrices and doubly stochastic matrices. The annals of mathematical statistics 35(2):876–879
  • Stein (1972) Stein C (1972) A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In: Proceedings of the sixth Berkeley symposium on mathematical statistics and probability, volume 2: Probability theory, University of California Press, pp 583–602
  • Stein (1986) Stein C (1986) Approximate computation of expectations. IMS
  • Tzen and Raginsky (2019) Tzen B, Raginsky M (2019) Theoretical guarantees for sampling and inference in generative models with latent diffusions. In: Conference on Learning Theory, PMLR, pp 3084–3114
  • Varadhan (1966) Varadhan SS (1966) Asymptotic probabilities and differential equations. Communications on Pure and Applied Mathematics 19(3):261–286
  • Vargas et al. (2021) Vargas F, Ovsianas A, Fernandes D, Girolami M, Lawrence N, Nüsken N (2021) Bayesian learning via neural schrödinger-föllmer flows. arXiv preprint arXiv:211110510
  • Wang (2009) Wang FY (2009) Log-sobolev inequalities: different roles of ric and hess. The Annals of Probability 37(4):1587–1604
  • Wang et al. (2021) Wang G, Jiao Y, Xu Q, Wang Y, Yang C (2021) Deep generative learning via schrödinger bridge. In: International Conference on Machine Learning, PMLR, pp 10794–10804
  • Wang and Li (2022) Wang Y, Li W (2022) Accelerated information gradient flow. Journal of Scientific Computing 90(1):1–47
  • Zhang and Chen (2021) Zhang Q, Chen Y (2021) Path integral sampler: a stochastic control approach for sampling. arXiv preprint arXiv:211115141
  • Zhang et al. (2019) Zhang Y, Akyildiz ÖD, Damoulas T, Sabanis S (2019) Nonasymptotic estimates for stochastic gradient langevin dynamics under local conditions in nonconvex optimization. arXiv preprint arXiv:191002008