Global optimization via Schrödinger-Föllmer diffusion

Yin Dai , Yuling Jiao , Lican Kang , Xiliang Lu and Jerry Zhijian Yang School of Mathematics and Statistics, Wuhan University, Wuhan 430072, P.R. China [email protected] School of Mathematics and Statistics, and Hubei Key Laboratory of Computational Science, Wuhan University, Wuhan 430072, P.R. China [email protected] Center for Quantitative Medicine Duke-NUS Medical School, Singapore [email protected] School of Mathematics and Statistics, and Hubei Key Laboratory of Computational Science, Wuhan University, Wuhan 430072, P.R. China [email protected] School of Mathematics and Statistics, and Hubei Key Laboratory of Computational Science, Wuhan University, Wuhan 430072, P.R. China [email protected]

Abstract.

We study the problem of finding global minimizers of $V(x):\mathbb{R}^{d}\rightarrow\mathbb{R}$ approximately via sampling from a probability distribution $\mu_{\sigma}$ with density $p_{\sigma}(x)=\dfrac{\exp(-V(x)/\sigma)}{\int_{\mathbb{R}^{d}}\exp(-V(y)/\sigma)dy}$ with respect to the Lebesgue measure for $\sigma\in(0,1]$ small enough. We analyze a sampler based on the Euler-Maruyama discretization of the Schrödinger-Föllmer diffusion processes with stochastic approximation under appropriate assumptions on the step size $s$ and the potential $V$ . We prove that the output of the proposed sampler is an approximate global minimizer of $V(x)$ with high probability at cost of sampling $\mathcal{O}(d^{3})$ standard normal random variables. Numerical studies illustrate the effectiveness of the proposed method and its superiority to the Langevin method.

Key words and phrases:

Global optimization, Schrödinger-Föllmer diffusion, Stochastic approximation

1. Introduction

In this paper we study a challenging problem of finding the global minimizers of a non-convex smooth function $V:\mathbb{R}^{d}\rightarrow\mathbb{R}$ . Suppose $N:=\{x_{1}^{*},\ldots,x_{\kappa}^{*}\}\subset B_{R}$ is the set of the global minima of $V$ with finite cardinality, i.e.,

\displaystyle x_{i}^{*}\in\underset{x}{\mbox{argmin}}\leavevmode\nobreak\ V(x),\quad\text{for any $i=1,\ldots,\kappa$,}

(1)

where $B_{R}$ denotes the ball centered at origin with radius $R>0$ . Precisely speaking, we have $\mathscr{L}(V(x)<V(x_{i}^{*})-\varepsilon)=0$ and $\mathscr{L}(V(x)<V(x_{i}^{*})+\varepsilon)>0$ for any $\varepsilon>0$ , where $\mathscr{L}$ denotes the Lebesgue measure on $\mathbb{R}^{d}$ . Without loss of generality, we can assume $V(x)=\|x\|_{2}^{2}/2$ outside the ball $B_{R}$ without changing the global minimizers of $V$ . For any given $\sigma\in(0,1]$ , we define a constant $C_{\sigma}$ and a probability density function $p_{\sigma}(x)$ on $\mathbb{R}^{d}$ as

p_{\sigma}(x):=\frac{\exp(-V(x)/\sigma)}{C_{\sigma}},\quad\mathrm{for}\;C_{\sigma}:=\int_{\mathbb{R}^{d}}\exp(-V(x)/\sigma)dx<\infty.

Let $\mu_{\sigma}$ be the probability distribution corresponding to the density function $p_{\sigma}$ . If the function $V$ is twice continuously differentiable, we have that the measure $\mu_{\sigma}$ converges weakly to a probability measure with weights proportional to $\dfrac{\left(\det\nabla^{2}V(x_{i}^{*})\right)^{-\frac{1}{2}}}{\sum_{j=1}^{\kappa}\left(\det\nabla^{2}V(x_{j}^{*})\right)^{-\frac{1}{2}}}$ at $x_{i}^{*}$ as $\sigma$ goes to 0, i.e.,

\underset{\sigma\rightarrow 0}{\lim}\leavevmode\nobreak\ \mu_{\sigma}=\dfrac{\sum_{i=1}^{\kappa}\left(\det\nabla^{2}V(x_{i}^{*})\right)^{-\frac{1}{2}}\delta_{x_{i}^{*}}}{\sum_{j=1}^{\kappa}\left(\det\nabla^{2}V(x_{j}^{*})\right)^{-\frac{1}{2}}}.

Therefore, solving the optimization problem (1) can be converted into sampling from the probability distribution measure $\mu_{\sigma}$ for sufficiently small $\sigma$ .

An efficient method sampling from $\mu_{\sigma}$ is based on the overdamped Langevin stochastic differential equation (SDE) which is given by

\mathrm{d}Z_{t}=-\nabla V(Z_{t})\mathrm{d}t+\sqrt{2\sigma}\mathrm{d}B_{t},

(2)

where $(B_{t})_{t\geq 0}$ is a $d$ -dimensional Brownian motion. Under some certain conditions, Langevin SDE (2) admits the unique invariant measure $\mu_{\sigma}$ (Bakry et al., 2008). Hence the Langevin sampler is generated by applying Euler-Maruyama discretization on this process to achieve the purpose of sampling from $\mu_{\sigma}$ . Convergence properties of the Langevin sampler under the strongly convex potential assumption have been established by Durmus and Moulines (2016); Dalalyan (2017a, b); Cheng and Bartlett (2018); Durmus and Moulines (2019); Dalalyan and Karagulyan (2019). Moreover, the strongly convex potential assumption can be replaced by different conditions to guarantee the log-Sobolev inequality for the target distribution, including the dissipativity condition for the drift term (Raginsky et al., 2017; Zhang et al., 2019; Mou et al., 2022) and the local convexity condition for the potential function outside a ball (Durmus and Moulines, 2017; Ma et al., 2019; Bou-Rabee et al., 2020).

An alternative to Langevin sampler is the class of algorithms based on the Schrödinger-Föllmer diffusion process (3). This process has been proposed for sampling and generative modelling (Tzen and Raginsky, 2019; Huang et al., 2021; Jiao et al., 2021; Wang et al., 2021). Subsequently, Ruzayqat et al. (2022) studied the problem of unbiased estimation of expectations based on the Schrödinger-Föllmer diffusion process. Vargas et al. (2021) applied this process to Bayesian inference in large datasets, and the related posterior is reached in finite time. Zhang and Chen (2021) proposed a new Path Integral Sampler (PIS), a generic sampler that generates samples through simulating a target-dependent SDE which can be trained with free-form architecture network design. The PIS is built on Schödinger-Föllmer diffusion process to reach the terminal distribution. Different from these existing works, the purpose of this paper is to solve the non-convex smooth optimization problems. To that end, we need to rescale the Schrödinger-Föllmer diffusion process to sample from the target distribution $\mu_{\sigma}$ .

To be precise, the Schrödinger-Föllmer diffusion process associated to $\mu_{\sigma}$ is defined as

\displaystyle\mathrm{d}X_{t}=\sigma b\left(X_{t},t\right)\mathrm{d}t+\sqrt{\sigma}\mathrm{d}B_{t},\quad t\in[0,1],\quad X_{0}=0,

(3)

where the drift function is

b(x,t)=\frac{\mathbb{E}_{Z\sim N\left(0,\bf{I}_{d}\right)}[\nabla f_{\sigma}(x+\sqrt{(1-t)\sigma}Z)]}{\mathbb{E}_{Z\sim N\left(0,\bf{I}_{d}\right)}[f_{\sigma}(x+\sqrt{(1-t)\sigma}Z)]}:\mathbb{R}^{d}\times[0,1]\rightarrow\mathbb{R}^{d}

with the density ratio $f_{\sigma}(\cdot)=\frac{p_{\sigma}(\cdot)}{\phi_{\sigma}(\cdot)}$ and $\phi_{\sigma}(\cdot)$ being the density function of a normal distribution $N(0,\sigma\mathbf{I}_{d})$ . According to Léonard (2014) and Eldan et al. (2020), the process $\{X_{t}\}_{t\in[0,1]}$ in (3) was first formulated by Föllmer (Föllmer, 1985, 1986, 1988) when studying the Schrödinger bridge problem (Schrödinger, 1932). The main feature of the above Schrödinger-Föllmer diffusion process is that it interpolates $\delta_{0}$ and $\mu_{\sigma}$ in time $[0,1]$ , i.e., $X_{1}\sim\mu_{\sigma}$ , see Proposition 2.3. Then we can solve the optimization problem (1) by sampling from $\mu_{\sigma}$ via the following Euler-Maruyama discretization of (3),

Y_{t_{k+1}}=Y_{t_{k}}+\sigma sb\left(Y_{t_{k}},t_{k}\right)+\sqrt{\sigma s}\epsilon_{k+1},\leavevmode\nobreak\ Y_{t_{0}}=0,\leavevmode\nobreak\ k=0,1,\ldots,K-1,

where $s=1/K$ is the step size, $t_{k}=ks$ , and $\{\epsilon_{k}\}_{k=1}^{K}$ are independent and identically distributed from $N(0,\mbox{\bf I}_{d})$ . If the expectations in the drift term $b(x,t)$ do not have analytical forms, one can use Monte Carlo method to evaluate $b\left(Y_{t_{k}},t_{k}\right)$ approximately, i.e., one can sample from $\mu_{\sigma}$ according to

\widetilde{Y}_{t_{k+1}}=\widetilde{Y}_{t_{k}}+\sigma s\tilde{b}_{m}\left(\widetilde{Y}_{t_{k}},t_{k}\right)+\sqrt{\sigma s}\epsilon_{k+1},\leavevmode\nobreak\ \widetilde{Y}_{t_{0}}=0,\leavevmode\nobreak\ k=0,1,\ldots,K-1,

where $\tilde{b}_{m}(\widetilde{Y}_{t_{k}},t_{k}):=\dfrac{\frac{1}{m}\sum_{j=1}^{m}[\nabla f_{\sigma}(\widetilde{Y}_{t_{k}}+\sqrt{(1-t_{k})\sigma}Z_{j})]}{\frac{1}{m}\sum_{j=1}^{m}[f_{\sigma}(\widetilde{Y}_{t_{k}}+\sqrt{(1-t_{k})\sigma}Z_{j})]}$ with $Z_{1},...,Z_{m}$ i.i.d. $N(0,\mbox{\bf I}_{d})$ .

The main result of this paper is summarized in the following.

Theorem 1.1.

(Informal) Under condition $(\textbf{A})$ , $\forall\ \ 0<\delta\ll 1$ , with probability at least $1-\sqrt{\delta}$ , $\widetilde{Y}_{t_{K}}$ is a $\tau$ -global minimizer of $V$ , i.e., $V(\widetilde{Y}_{t_{K}})\leq\tau+\inf V(x)$ , if number of iterations $K\geq\mathcal{O}\left(\frac{d^{2}}{\delta}\right)$ , number of Gaussian samples per iteration $m\geq\mathcal{O}\left(\frac{d}{\delta}\right)$ and $\sigma\leq\mathcal{O}\left(\frac{\tau}{\log(1/\delta)}\right)$ .

The rest of this paper is organized as follows. In Section 2, we give the motivation and details of approximation method, i.e., Algorithm 1. In Section 3, we present the theoretical analysis for the proposed method. In Section 4, a numerical example is given to validate the efficiency of the method. We conclude the manuscript in Section 5. Proofs for all the propositions and theorems are provided in Appendix 7.

2. Methodology Description

In this section we first provide some background on the Schrödinger-Föllmer diffusion. Then we propose Algorithm 1 to solve the minimization problem (1) based on the Euler-Maruyama discretization scheme of the Schrödinger-Föllmer diffusion.

2.1. Background on Schrödinger-Föllmer diffusion

We first recall the Schrödinger bridge problem, then introduce the Schrödinger-Föllmer diffusion.

2.1.1. Schrödinger bridge problem

Let $\Omega=C([0,1],\mathbb{R}^{d})$ be the space of $\mathbb{R}^{d}$ -valued continuous functions on the time interval $[0,1]$ . Denote $Z=(Z_{t})_{t\in[0,1]}$ as the canonical process on $\Omega$ , where $Z_{t}(\omega)=\omega_{t}$ , $\omega=(\omega_{s})_{s\in[0,1]}\in\Omega$ . The canonical $\sigma$ -field on $\Omega$ is then generated as $\mathscr{F}=\sigma(Z_{t},t\in[0,1])=\left\{\{\omega:(Z_{t}(\omega))_{t\in[0,1]}\in H\}:H\in\mathcal{B}(\mathbb{R}^{d})\right\}$ , where $\mathcal{B}(\mathbb{R}^{d})$ denotes the Borel $\sigma$ -algebra of $\mathbb{R}^{d}$ . Denote $\mathcal{P}(\Omega)$ as the space of probability measures on the path space $\Omega$ , and $\mathbf{W}^{\mathbf{x}}_{\sigma}\in\mathcal{P}(\Omega)$ as the Wiener measure with variance $\sigma$ whose initial marginal is $\delta_{\mathbf{x}}$ . The law of the reversible Brownian motion is then defined as $\mathbf{P}_{\sigma}=\int\mathbf{W}_{\sigma}^{\mathbf{x}}\mathrm{d}\mathbf{x}$ , which is an unbounded measure on $\Omega$ . One can observe that $\mathbf{P}_{\sigma}$ has a marginal distribution coinciding with the Lebesgue measure $\mathscr{L}$ at each $t$ . Schrödinger (1932) studied the problem of finding the most likely random evolution between two probability distributions $\widetilde{\nu},\widetilde{\mu}\in\mathcal{P}(\mathbb{R}^{d})$ . This problem is referred to as the Schrödinger bridge problem (SBP). SBP can be further formulated as seeking a probability law on the path space that interpolates between $\widetilde{\nu}$ and $\widetilde{\mu}$ , such that the probability law is close to the prior law of the Brownian diffusion with respect to the relative entropy (Jamison, 1975; Léonard, 2014), i.e., finding a path measure $\mathbf{Q}^{*}\in\mathcal{P}(\Omega)$ with marginal $\mathbf{Q}^{*}_{t}=(Z_{t})_{\#}\mathbf{Q}^{*}=\mathbf{Q}^{*}\circ Z_{t}^{-1},t\in[0,1]$ such that

\mathbf{Q}^{*}\in\arg\min\mathbb{D}_{\mathrm{KL}}(\mathbf{Q}||\mathbf{P}_{\sigma}),\;\textrm{with}\;\mathbf{Q}_{0}=\widetilde{\nu},\qquad\mathbf{Q}_{1}=\widetilde{\mu},

where the relative entropy is defined by $\mathbb{D}_{\mathrm{KL}}(\mathbf{Q}||\mathbf{P}_{\sigma})=\int\log\left(\dfrac{d\mathbf{Q}}{d\mathbf{P}_{\sigma}}\right)d\mathbf{Q}$ if $\mathbf{Q}\ll\mathbf{P}_{\sigma}$ (i.e., $\mathbf{Q}$ is absolutely continuous w.r.t. $\mathbf{P}_{\sigma}$ ), and $\mathbb{D}_{\mathrm{KL}}(\mathbf{Q}||\mathbf{P}_{\sigma})=+\infty$ otherwise. The following theorem characterizes the solution of SBP.

Theorem 2.1 (Léonard (2014)).

If measures $\widetilde{\nu},\widetilde{\mu}\ll\mathscr{L}$ , then SBP admits a unique solution $d\mathbf{Q}^{*}=f^{*}(Z_{0})g^{*}(Z_{1})d\mathbf{P}_{\sigma}$ , where $f^{*}$ and $g^{*}$ are $\mathscr{L}$ -measurable non-negative functions satisfying the Schrödinger system

\left\{\begin{array}[]{l}f^{*}(\mathbf{x})\mathbb{E}_{\mathbf{P}_{\sigma}}\left[g^{*}\left(Z_{1}\right)\mid Z_{0}=\mathbf{x}\right]=\dfrac{d\widetilde{\nu}}{d\mathscr{L}}(\mathbf{x}),\quad\mathscr{L}\text{-a.e.},\\ g^{*}(\mathbf{y})\mathbb{E}_{\mathbf{P}_{\sigma}}\left[f^{*}\left(Z_{0}\right)\mid Z_{1}=\mathbf{y}\right]=\dfrac{d\widetilde{\mu}}{d\mathscr{L}}(\mathbf{y}),\quad\mathscr{L}\text{-a.e.}.\end{array}\right.

Furthermore, the pair $(\mathbf{Q}^{*}_{t},\mathbf{v}^{*}_{t})$ with

\mathbf{v}^{*}_{t}(\mathbf{x})=\nabla_{\mathbf{x}}\log\mathbb{E}_{\mathbf{P}_{\sigma}}\left[g^{*}\left(Z_{1}\right)\mid Z_{t}=\mathbf{x}\right]

solves the minimum action problem

\min_{\mu_{t},\mathbf{v}_{t}}\int_{0}^{1}\mathbb{E}_{\mathbf{z}\sim\mu_{t}}[\|\mathbf{v}_{t}(\mathbf{z})\|^{2}]\mathrm{d}t

such that

\left\{\begin{array}[]{l}\partial_{t}\mu_{t}=-\nabla\cdot(\mu_{t}\mathbf{v}_{t})+\frac{\sigma\Delta\mu_{t}}{2},\quad\text{on }(0,1)\times\mathbb{R}^{d},\\ \mu_{0}=\widetilde{\nu},\mu_{1}=\widetilde{\mu}.\end{array}\right.

Let $K_{\sigma}(s,\mathbf{x},t,\mathbf{y})=[2\pi\sigma(t-s)]^{-d/2}\exp\left(-\dfrac{\|\mathbf{x}-\mathbf{y}\|^{2}}{2\sigma(t-s)}\right)$ be the transition density of the Wiener process with variance $\sigma$ , $\widetilde{q}(\mathbf{x})$ and $\widetilde{p}(\mathbf{y})$ be the density of $\widetilde{\nu}$ and $\widetilde{\mu}$ , respectively. Denote by

f_{0}(\mathbf{x})=f^{*}(\mathbf{x}),\ \ g_{1}(\mathbf{y})=g^{*}(\mathbf{y}),

{f_{1}}(\mathbf{y})=\mathbb{E}_{\mathbf{P}_{\sigma}}\left[f^{*}\left(Z_{0}\right)\mid Z_{1}=\mathbf{y}\right]=\int K_{\sigma}(0,\mathbf{x},1,\mathbf{y})f_{0}(\mathbf{x})d\mathbf{x},

{g_{0}}(\mathbf{x})=\mathbb{E}_{\mathbf{P}_{\sigma}}\left[g^{*}\left(Z_{1}\right)\mid Z_{0}=\mathbf{x}\right]=\int K_{\sigma}(0,\mathbf{x},1,\mathbf{y})g_{1}(\mathbf{y})d\mathbf{y}.

Then the Schrödinger system in Theorem 2.1 can also be characterized by

\widetilde{q}(\mathbf{x})=f_{0}(\mathbf{x}){g_{0}}(\mathbf{x}),\ \ \widetilde{p}(\mathbf{y})={f_{1}}(\mathbf{y})g_{1}(\mathbf{y}),

with the following forward and backward time harmonic equations (Chen et al., 2021),

\left\{\begin{array}[]{l}\partial_{t}f_{t}(\mathbf{x})=\frac{\sigma\Delta}{2}f_{t}(\mathbf{x}),\\ \partial_{t}g_{t}(\mathbf{x})=-\frac{\sigma\Delta}{2}g_{t}(\mathbf{x}),\end{array}\right.\quad\text{ on }(0,1)\times\mathbb{R}^{d}.

Let $q_{t}$ denote marginal density of $\mathbf{Q}_{t}^{*}$ , i.e., $q_{t}(\mathbf{x})=\frac{d\mathbf{Q}_{t}^{*}}{d\mathscr{L}}(\mathbf{x})$ , then it can be represented by the product of $g_{t}$ and $f_{t}$ (Chen et al., 2021). Let $\mathcal{V}$ consist of admissible Markov controls with finite energy. Then, the vector field

\displaystyle\mathbf{v}^{*}_{t}=\sigma\nabla_{\mathbf{x}}\log g_{t}(\mathbf{x})=\sigma\nabla_{\mathbf{x}}\log\int K_{\sigma}(t,\mathbf{x},1,\mathbf{y})g_{1}(\mathbf{y})\mathrm{d}\mathbf{y}

(4)

solves the following stochastic control problem.

Theorem 2.2 (Dai Pra (1991)).

\mathbf{v}^{*}_{t}(\mathbf{x})\in\arg\min_{\mathbf{v}\in\mathcal{V}}\mathbb{E}\left[\int_{0}^{1}\frac{1}{2\sigma}\|\mathbf{v}_{t}\|^{2}\mathrm{d}t\right]

such that

\left\{\begin{array}[]{l}\mathrm{d}\mathbf{x}_{t}=\mathbf{v}_{t}\mathrm{d}t+\sqrt{\sigma}\mathrm{d}B_{t},\\ \mathbf{x}_{0}\sim\widetilde{q}(\mathbf{x}),\quad\mathbf{x}_{1}\sim\widetilde{p}(\mathbf{x}).\end{array}\right.

(5)

According to Theorem 2.2, the dynamics determined by the SDE in (5) with a time-varying drift term $\mathbf{v}^{*}_{t}$ in (4) will drive the particles sampled from the initial distribution $\widetilde{\nu}$ to evolve to the particles drawn from the target distribution $\widetilde{\mu}$ on the unit time interval. This nice property is what we need in designing samplers: we can sample from the underlying target distribution $\widetilde{\mu}$ via pushing forward a simple reference distribution $\widetilde{\nu}$ . In particular, if we take the initial distribution $\widetilde{\nu}$ to be $\delta_{0}$ , the degenerate distribution at $0$ , then the Schrödinger-Föllmer diffusion process (7) defined below is a solution to (5), i.e., it will transport $\delta_{0}$ to the target distribution.

2.1.2. Schrödinger-Föllmer diffusion process

From now on, without loss of generality, we can assume that the minimum value of $V$ is 0, i.e., $V(x_{i}^{*})=0,i=1,\ldots,\kappa$ , otherwise, we consider $V$ replaced by $V-\min_{x}V(x)$ . Since $\mu_{\sigma}$ is absolutely continuous with respect to the $d$ -dimensional Gaussian distribution $N(0,\sigma\mbox{\bf I}_{d})$ , then we denote the Radon-Nikodym derivative of $\mu_{\sigma}$ with respect to $N(0,\sigma\mbox{\bf I}_{d})$ as follows:

f_{\sigma}(x):=\frac{d\mu_{\sigma}}{dN(0,\sigma\mbox{\bf I}_{d})}(x)=\frac{p_{\sigma}(x)}{\phi_{\sigma}(x)},\ x\in\mathbb{R}^{d}.

(6)

Let $Q_{t}$ be the heat semigroup defined by

Q_{t}f_{\sigma}(x):=\mathbb{E}_{Z\sim N(0,\mbox{\bf I}_{d})}[f_{\sigma}(x+\sqrt{t\sigma}Z)],\quad t\in[0,1].

The Schrödinger-Föllmer diffusion process $\{X_{t}\}_{t\in[0,1]}$ (Föllmer, 1985, 1986, 1988) is defined as

\displaystyle\mathrm{d}X_{t}=\sigma b(X_{t},t)\mathrm{d}t+\sqrt{\sigma}\mathrm{d}B_{t},\ X_{0}=0,\ t\in[0,1],

(7)

where $b(x,t):\mathbb{R}^{d}\times[0,1]\rightarrow\mathbb{R}^{d}$ is the drift term given by

b(x,t)=\nabla\log Q_{1-t}f_{\sigma}(x).

This process $\{X_{t}\}_{t\in[0,1]}$ defined in (7) is a solution to (5) with $\tilde{\nu}=\delta_{0}$ , $\tilde{\mu}=\mu_{\sigma}$ , and $\mathbf{v}_{t}(x)=b(x,t)$ (Dai Pra, 1991; Lehec, 2013; Eldan et al., 2020). Let

\hat{f}_{\sigma}(x):=C_{\sigma}(2\pi\sigma)^{-\frac{d}{2}}f_{\sigma}(x).

Since the drift term $b(x,t)$ is scale-invariant with respect to $f_{\sigma}$ in the sense that $b(x,t)=\nabla\log Q_{1-t}Cf_{\sigma}(x)$ for any $C>0$ . Therefore, the Schrödinger-Föllmer diffusion can be used for sampling from an unnormalized distribution $\mu_{\sigma}$ , that is, the normalizing constant $C_{\sigma}$ of $\mu_{\sigma}$ does not need to be known. We have $\mu_{\sigma}(dx)=\exp(-V(x)/\sigma)dx/C_{\sigma}$ with the normalized constant $C_{\sigma}$ , then $f_{\sigma}(x)=\frac{\left(\sqrt{2\pi\sigma}\right)^{d}}{C_{\sigma}}\exp(-V(x)/\sigma+\frac{\|x\|_{2}^{2}}{2\sigma})$ and $\hat{f}_{\sigma}(x)=\exp\left(-\frac{V(x)}{\sigma}+\frac{\|x\|^{2}_{2}}{2\sigma}\right)$ .

To ensure that the SDE (7) admits a unique strong solution, we assume that

$(\textbf{A})$

$V(x)$ is twice continuous differentiable on $\mathbb{R}^{d}$ and $V(x)=\|x\|_{2}^{2}/2$ outside a ball $B_{R}$ .

As we mentioned earlier, we can make Assumption $\mathbf{(A)}$ by smoothing and it does not change the the global minimizers of $V$ . Under Assumption $\mathbf{(A)}$ , we have the following two properties which further imply the SDE (7) admits a unique strong solution.

(\textbf{P1})

For each $\sigma\in(0,1]$ , $\hat{f}_{\sigma},\nabla\hat{f}_{\sigma}$ are Lipschitz continuous with a constant $\gamma_{\sigma}>0$ , where

	$\displaystyle\gamma_{\sigma}:$	$\displaystyle=\max_{\\|x\\|_{2}\leq R}\exp\left(-\frac{V(x)}{\sigma}+\frac{\\|x\\|_{2}^{2}}{2\sigma}\right)\left\{\left\\|\frac{x}{\sigma}-\frac{\nabla V(x)}{\sigma}\right\\|_{2}^{2}+\left\\|\frac{\mbox{\bf I}_{d}}{\sigma}-\frac{\nabla^{2}V(x)}{\sigma}\right\\|_{2}\right\}$
		$\displaystyle=\left\{\left(M_{2,R}/\sigma\right)^{2}+M_{3,R}/\sigma\right\}\exp\left(M_{1,R}/\sigma\right),$

and

M_{1,R}:=\max_{\|x\|_{2}\leq R}\left\{-V(x)+\frac{\|x\|^{2}_{2}}{2}\right\},\;M_{2,R}:=\max_{\|x\|_{2}\leq R}\left\|x-\nabla V(x)\right\|_{2},\;M_{3,R}:=\max_{\|x\|_{2}\leq R}\left\|\mathbf{I}_{d}-\nabla^{2}V(x)\right\|_{2}.

(8)

(\textbf{P2})

For each $\sigma\in(0,1]$ , there exists a constant $\xi_{\sigma}>0$ such that $\hat{f}_{\sigma}\geq\xi_{\sigma}$ , where

\xi_{\sigma}:=\exp\left(m_{1,R}/\sigma\right)\;\mbox{with}\;m_{1,R}:=\min_{\|x\|_{2}\leq R}\left\{-V(x)+\frac{\|x\|_{2}^{2}}{2}\right\}.

(9)

Properties (P1)-(P2) are shown in Section 7.1 in Appendix. We should mention here (P1)-(P2) are used as assumptions in Lehec (2013); Tzen and Raginsky (2019).

Thanks to (P1) and (P2), some calculations show that

\|\nabla\hat{f}_{\sigma}\|_{2}\leq\gamma_{\sigma},\quad\|\nabla^{2}\hat{f}_{\sigma}\|_{2}\leq\gamma_{\sigma},

\sup_{x\in\mathbb{R}^{d},t\in[0,1]}\|\nabla Q_{1-t}\hat{f}_{\sigma}(x)\|_{2}\leq\gamma_{\sigma},\quad\sup_{x\in\mathbb{R}^{d},t\in[0,1]}\|\nabla^{2}(Q_{1-t}\hat{f}_{\sigma}(x))\|_{2}\leq\gamma_{\sigma},

and

b(x,t)=\frac{\nabla Q_{1-t}\hat{f}_{\sigma}(x)}{Q_{1-t}\hat{f}_{\sigma}(x)},\leavevmode\nobreak\ \nabla b(x,t)=\frac{\nabla^{2}(Q_{1-t}\hat{f}_{\sigma})(x)}{Q_{1-t}\hat{f}_{\sigma}(x)}-b(x,t)b(x,t)^{\top}.

We conclude that

\sup_{x\in\mathbb{R}^{d},t\in[0,1]}\|b(x,t)\|_{2}\leq\gamma_{\sigma}/\xi_{\sigma},\sup_{x\in\mathbb{R}^{d},t\in[0,1]}\|\nabla b(x,t)\|_{2}\leq\gamma_{\sigma}/\xi_{\sigma}+\gamma^{2}_{\sigma}/\xi_{\sigma}^{2}.

Furthermore, we can also easily deduce that the drift term $b$ satisfies a linear growth condition and a Lipschitz continuity condition (Revuz and Yor, 2013; Pavliotis, 2014), that is,

\displaystyle\|b(x,t)\|_{2}^{2}\leq C_{0,\sigma}(1+\|x\|_{2}^{2}),\ x\in\mathbb{R}^{d},t\in[0,1]

(C1)

and

\displaystyle\|b(x,t)-b(y,t)\|_{2}\leq C_{1,\sigma}\|x-y\|_{2},\ x,y\in\mathbb{R}^{d},t\in[0,1],

(C2)

where $C_{0,\sigma}$ and $C_{1,\sigma}$ are two finite positive constants that only depend on $\sigma$ . The linear growth condition (C1) and Lipschitz continuity condition (C2) ensure the existence of the unique strong solution of Schrödinger-Föllmer SDE (7). We summarize the above discussion in the following Proposition, whose proof are shown in Appendix 7.

Proposition 2.3.

Under assumption $(\textbf{A})$ the Schrödinger-Föllmer SDE (7) has a unique strong solution $\{X_{t}\}_{t\in[0,1]}$ with $X_{0}\sim\delta_{0}$ and $X_{1}\sim\mu_{\sigma}$ .

2.2. Euler-Maruyama discretization for Schrödinger-Föllmer Diffusion

Proposition 2.3 shows that the Schrödinger-Föllmer diffusion will transport $\delta_{0}$ to the probability distribution measure $\mu_{\sigma}$ on the unite time interval. Because the drift term $b(x,t)$ is scale-invariant with respect to $f_{\sigma}$ in the sense that $b(x,t)=\nabla\log Q_{1-t}Cf_{\sigma}(x),\forall C>0$ , the Schrödinger-Föllmer diffusion can be used for sampling from $\mu_{\sigma}(dx)=\exp(-V(x)/\sigma)dx/C_{\sigma}$ , where the normalizing constant $C_{\sigma}$ may not be known. To this end, we use the Euler-Maruyama method to discretize the Schrödinger-Föllmer diffusion (7). Let

t_{k}=k\cdot s,\ \ \leavevmode\nobreak\ k=0,1,\ldots,K,\ \ \leavevmode\nobreak\ \mbox{with}\leavevmode\nobreak\ \ \ s=1/K,\ \ Y_{t_{0}}=0,

the Euler-Maruyama discretization scheme reads

\displaystyle Y_{t_{k+1}}=Y_{t_{k}}+\sigma sb\left(Y_{t_{k}},t_{k}\right)+\sqrt{\sigma s}\epsilon_{k+1},\leavevmode\nobreak\ k=0,1,\ldots,K-1,

(10)

where $\{\epsilon_{k}\}_{k=1}^{K}$ are i.i.d. $N(0,\mbox{\bf I}_{d})$ and

\displaystyle b(Y_{t_{k}},t_{k})=\frac{\mathbb{E}_{Z}[\nabla\hat{f}_{\sigma}(Y_{t_{k}}+\sqrt{(1-t_{k})\sigma}Z)]}{\mathbb{E}_{Z}[\hat{f}_{\sigma}(Y_{t_{k}}+\sqrt{(1-t_{k})\sigma}Z)]}=\frac{\mathbb{E}_{Z}[Z\hat{f}_{\sigma}(Y_{t_{k}}+\sqrt{(1-t_{k})\sigma}Z)]}{\mathbb{E}_{Z}[\hat{f}_{\sigma}(Y_{t_{k}}+\sqrt{(1-t_{k})\sigma}Z)]\sqrt{(1-t_{k})\sigma}},

(11)

where the second equality follows from Stein’s lemma (Stein, 1972, 1986; Landsman and Nešlehová, 2008). From the definition of $b(Y_{t_{k}},t_{k})$ in (11) we may not get its explicit expression. We then consider a estimator $\tilde{b}_{m}$ of $b$ by replacing $\mathbb{E}_{Z}$ in the drift term $b$ with $m$ -samples mean, i.e.,

\displaystyle\tilde{b}_{m}(Y_{t_{k}},t_{k})=\frac{\frac{1}{m}\sum_{j=1}^{m}[\nabla\hat{f}_{\sigma}(Y_{t_{k}}+\sqrt{(1-t_{k})\sigma}Z_{j})]}{\frac{1}{m}\sum_{j=1}^{m}[\hat{f}_{\sigma}(Y_{t_{k}}+\sqrt{(1-t_{k})\sigma}Z_{j})]},\ k=0,\ldots,K-1,

(12)

\displaystyle\tilde{b}_{m}(Y_{t_{k}},t_{k})=\frac{\frac{1}{m}\sum_{j=1}^{m}[Z_{j}\hat{f}_{\sigma}(Y_{t_{k}}+\sqrt{(1-t_{k})\sigma}Z_{j})]}{\frac{1}{m}\sum_{j=1}^{m}[\hat{f}_{\sigma}(Y_{t_{k}}+\sqrt{(1-t_{k})\sigma}Z_{j})]\cdot\sqrt{(1-t_{k})\sigma}},\ k=0,\ldots,K-1,

(13)

where $Z_{1},\ldots,Z_{m}$ are i.i.d. $N(0,\mbox{\bf I}_{d})$ . The detailed description of the proposed method is summarized in the following Algorithm 1 below.

Algorithm 1 Solving (1) via Euler-Maruyama discretization of Schrödinger-Föllmer Diffusion

1: Input:

\sigma

m

K

. Initialize

s=1/K

\widetilde{Y}_{t_{0}}=0

2: for

k=0,1,\ldots,K-1

3: Sample

\epsilon_{k+1}\sim N(0,\mbox{\bf I}_{d})

4: Sample

Z_{i},i=1,\ldots,m

, from

N(0,\mbox{\bf I}_{d})

5: Compute

\tilde{b}_{m}

according to (12) or (13),

\widetilde{Y}_{t_{k+1}}=\widetilde{Y}_{t_{k}}+\sigma s\tilde{b}_{m}\left(\widetilde{Y}_{t_{k}},t_{k}\right)+\sqrt{\sigma s}\epsilon_{k+1}

7: end for

8: Output:

\{\widetilde{Y}_{t_{k}}\}_{k=1}^{K}

In the next section, we establish the probability bound of $\widetilde{Y}_{t_{K}}$ being a $\tau$ -global minimizer (Theorem 3.5), and derive the bound in the Wasserstein-2 distance between the law of $\widetilde{Y}_{t_{K}}$ generated from Algorithm 1 and the probability distribution measure $\mu_{\sigma}$ under some certain conditions (Theorem 3.7).

3. Theoretical Property

In this section, we show that the Gibbs measure $\mu_{\sigma}$ weakly converges to a multidimensional distribution concentrating on the optimal points $\{x_{1}^{*},\ldots,x_{\kappa}^{*}\}$ . Since the minimum value of $V$ is 0, then we estimate the probabilities of $V(X_{1})>\tau$ and $V(\widetilde{Y}_{t_{k}})>\tau$ for any $\tau>0$ , and establish the non-asymptotic bounds between the law of the samples generated from Algorithm 1 and the target distribution $\mu_{\sigma}$ in the Wasserstein-2 distance. Recall that the linear growth condition (C1) and Lipschitz continuity (C2) hold under conditions (P1) and (P2), which make the Schrödinger-Föllmer SDE (7) have the unique strong solution. Besides, we assume that the drift term $b(x,t)$ is Lipschitz continuous in $x$ and $\frac{1}{2}$ -Hölder continuous in $t$ ,

\displaystyle\|b(x,t)-b(y,s)\|_{2}\leq C_{2,\sigma}\left(\|x-y\|_{2}+|t-s|^{\frac{1}{2}}\right),

(C3)

where $C_{2,\sigma}$ is a positive constant depending on $\sigma$ .

Remark 3.1.

(C1) and (C2) are the essentially sufficient conditions such that the Schrödinger-Föllmer SDE (7) admits the unique strong solution. (C3) has been introduced in Theorem 4.1 of Tzen and Raginsky (2019) and it is also similar to the condition (H2) of Chau et al. (2021) and Assumption 3.2 of Barkhagen et al. (2021). Obviously, (C3) implies (C2), and (C1) holds if the drift term $b(x,t)$ is bounded over $\mathbb{R}^{d}\times[0,1]$ .

Firstly, we show that the Gibbs measure $\mu_{\sigma}$ weakly converges to a multidimensional distribution. This result can be traced back to the 1980s. For the overall continuity of the article, we combine the Laplace’s method in Hwang (1980, 1981) to give a detailed proof of the result. The key idea is to prove that for each $\delta^{\prime}>0$ small enough, $\mu_{\sigma}(\{x;\|x-x_{i}^{*}\|_{2}<\delta^{\prime}\})$ converges to $\dfrac{\left(\det\nabla^{2}V(x_{i}^{*})\right)^{-\frac{1}{2}}}{\sum_{j=1}^{\kappa}\left(\det\nabla^{2}V(x_{j}^{*})\right)^{-\frac{1}{2}}}$ as $\sigma\downarrow 0$ .

Next, we want to estimate the probabilities of $V(X_{1})>\tau$ and $V(\widetilde{Y}_{t_{K}})>\tau$ for any $\tau>0$ . However, the second analysis is more complicated due to the discretization, and the main idea comes from Dalalyan (2017b); Cheng et al. (2018), which constructs a continuous-time interpolation SDE for the Euler-Maruyama discretization. In their works, the relative entropy is controlled via using the Girsanov’s theorem to estimate Radon-Nikodym derivatives.

Another method of controlling the relative entropy is proposed by Mou et al. (2022). By direct calculations, the time derivative of the relative entropy between the interpolated and the original SDE (7) is controlled by the mean squared difference between the drift terms of the Fokker-Planck equations for the original and the interpolated processes. Compared to the bound obtained from Lemma 7.2, the bound in Mou et al. (2022) has an additional backward conditional expectation inside the norm. It becomes a key reason for obtaining higher precision orders. But it must satisfy the dissipative condition to the drift term of the SDE and initial distribution smoothness.

The concrete results are showed in the following Theorems 3.2, 3.5, 3.7. See Appendix 7 for detailed proofs.

Theorem 3.2.

Let $V:\mathbb{R}^{d}\rightarrow\mathbb{R}$ be a twice continuously differentiable function. Suppose there exists a finite set $N:=\{x\in\mathbb{R}^{d};V(x)=\inf_{x}V(x)\}=\{x_{1}^{*},\ldots,x_{\kappa}^{*}\},\kappa\geq 1$ and $\int_{\mathbb{R}^{d}}\exp(-V(x))dx<\infty$ , then

\mu_{\sigma}\xrightarrow{w}\dfrac{\sum_{i=1}^{\kappa}\left(\det\nabla^{2}V(x_{i}^{*})\right)^{-\frac{1}{2}}\delta_{x_{i}^{*}}}{\sum_{j=1}^{\kappa}\left(\det\nabla^{2}V(x_{j}^{*})\right)^{-\frac{1}{2}}},\quad\text{as $\sigma\downarrow 0.$}

Under Theorem 3.2, a natural question is to consider the convergence rate about the measure $\mu_{\sigma}$ converging to the multidimensional distribution. To that end, we apply the tools from the large deviation of the Gibbs measure which is absolutely continuous with respect to Lebesgue measure, see Chiang et al. (1987); Holley et al. (1989); Márquez (1997). Therefore we can obtain the following property.

Proposition 3.3.

Assume that the conditions of Theorem 3.2 hold, then for all $\tau>0$ ,

\lim_{\sigma\to 0}\sigma\log\mu_{\sigma}\left(V(x)-\min_{x}V(x)\geq\tau\right)=-\tau.

Remark 3.4.

Although we can directly use the large deviation principle to obtain Proposition 3.3, further, we can conclude that the Gibbs measure $\mu_{\sigma}$ weakly converges to the global minimum points of the potential function $V$ and obtain the corresponding convergence rate. However, we can not directly obtain the specific limit distribution form Proposition 3.3.

Theorem 3.5.

Suppose $(\textbf{A})$ holds. Then, for each $\varepsilon\in(0,\tau),\sigma\in(0,1)$ , there exists a constant $C_{\tau,\varepsilon,d}$ (depending on $\tau,\varepsilon,d$ ) given in (A.15) such that

	$\displaystyle\mathbb{P}(V(X_{1})>\tau)\leq C_{\tau,\varepsilon,d}\exp\left(-\frac{\tau-\varepsilon}{\sigma}\right),$
	$\displaystyle\mathbb{P}(V(\widetilde{Y}_{t_{K}})>\tau)\leq C_{\tau,\varepsilon,d}\exp\left(-\frac{\tau-\varepsilon}{\sigma}\right)+C^{\sharp}_{1,\sigma}\sqrt{d(2d+3)s}+C^{\sharp}_{2,\sigma}\sqrt{\frac{4d}{m}},$		(14)

where

	$\displaystyle C^{\sharp}_{1,\sigma}:=\frac{\gamma_{\sigma}}{\xi_{\sigma}}+\frac{\gamma^{3}_{\sigma}}{\xi^{3}_{\sigma}},\quad\frac{\gamma_{\sigma}}{\xi_{\sigma}}=\left\{\left(\frac{M_{2,R}}{\sigma}\right)^{2}+\frac{M_{3,R}}{\sigma}\right\}\exp\left(\frac{M_{1,R}-m_{1,R}}{\sigma}\right),$
	$\displaystyle C^{\sharp}_{2,\sigma}:=\frac{\gamma^{2}_{\sigma}}{\xi^{2}_{\sigma}}+\frac{\gamma_{\sigma}\zeta_{\sigma}}{\xi^{2}_{\sigma}},\quad\frac{\gamma_{\sigma}}{\xi_{\sigma}}=\left\{\left(\frac{M_{2,R}}{\sigma}\right)^{2}+\frac{M_{3,R}}{\sigma}\right\}\exp\left(\frac{M_{1,R}-m_{1,R}}{\sigma}\right),$

$M_{1,R}$ , $M_{2,R}$ , $M_{3,R}$ and $m_{1,R}$ are given in (8)-(9), and $\zeta_{\sigma}=\exp(M_{1,R}/\sigma)$ .

Remark 3.6.

The first term in the right hand side of (14) originates from the difference between Gibbs sampling and optimization, which is determined by the simulated annealing method itself. The latter two terms are caused by Euler discretization of SDE (7) and Monte Carlo estimation of drift coefficient $b(x,t)$ in (11). The latter two terms of (14) depend on $d$ polynomially and $\sigma$ exponentially. This result is contrary to the some Langevin methods, that is, the related convergence error bounds depend on $d$ exponentially (Wang, 2009; Hale, 2010; Menz and Schlichting, 2014; Raginsky et al., 2017) implying that the efficiency of Langevin samplers may suffer from the curse of dimensionality.

By Theorem 3.5, $\forall\ \ 0<\delta\ll 1$ , with probability at least $1-\sqrt{\delta}$ , $\widetilde{Y}_{t_{K}}$ is a $\tau$ -global minimizer of $V$ , i.e., $V(\widetilde{Y}_{t_{K}})\leq\tau+\inf V(x)$ , if the number of iterations $K\geq\mathcal{O}\left(\frac{d^{2}}{\delta}\right)$ , the number of Gaussian samples per iteration $m\geq\mathcal{O}\left(\frac{d}{\delta}\right)$ and $\sigma\leq\mathcal{O}\left(\frac{\tau}{\log(1/\delta)}\right)$ .

At last, we establish the non-asymptotic bounds between the law of the samples generated from Algorithm 1 and the distribution $\mu_{\sigma}$ in the Wasserstein-2 distance. We introduce the definition of Wasserstein distance. Let $\mathcal{D}(\nu_{1},\nu_{2})$ be the collection of coupling probability measures on $\left(\mathbb{R}^{2d},\mathcal{B}(\mathbb{R}^{2d})\right)$ such that its respective marginal distributions are $\nu_{1}$ and $\nu_{2}$ . The Wasserstein distance of order $p\geq 1$ measuring the discrepancy between $\nu_{1}$ and $\nu_{2}$ is defined as

\displaystyle W_{p}(\nu_{1},\nu_{2})=\inf_{\nu\in\mathcal{D}(\nu_{1},\nu_{2})}\left(\int_{\mathbb{R}^{d}}\int_{\mathbb{R}^{d}}\left\|\theta_{1}-\theta_{2}\right\|_{2}^{p}\mathrm{d}\nu\left(\theta_{1},\theta_{2}\right)\right)^{1/p}.

Theorem 3.7.

Assume (A) and (C3) hold, then

\displaystyle W_{2}(\text{Law}(\widetilde{Y}_{t_{K}}),\mu_{\sigma})\leq C^{\sharp}_{\sigma}\left(C^{\sharp}_{3,\sigma}\sqrt{s}+C^{\sharp}_{2,\sigma}\sqrt{\frac{16d}{m}}\right),

where

	$\displaystyle C^{\sharp}_{\sigma}:=\exp(1/2+8C_{2,\sigma}^{2}),\quad C^{\sharp}_{3,\sigma}:=2C_{1,\sigma}+4C^{2}_{1,\sigma}\left\{1+C_{0,\sigma}\exp\left(2\sqrt{C_{0,\sigma}}+1\right)\right\}^{\frac{1}{2}},$
	$\displaystyle C^{\sharp}_{2,\sigma}:=\left(\frac{\gamma_{\sigma}}{\xi_{\sigma}}\right)^{2}+\frac{\gamma_{\sigma}\zeta_{\sigma}}{\xi^{2}_{\sigma}},\quad\frac{\gamma_{\sigma}}{\xi_{\sigma}}=\left\{\left(\frac{M_{2,R}}{\sigma}\right)^{2}+\frac{M_{3,R}}{\sigma}\right\}\exp\left(\frac{M_{1,R}-m_{1,R}}{\sigma}\right),$

with $C_{0,\sigma}=C_{1,\sigma}=\gamma_{\sigma}/\xi_{\sigma}+\gamma^{2}_{\sigma}/\xi^{2}_{\sigma}$ .

Remark 3.8.

Langevin sampling method has been studied under the (strongly) convex potential assumption (Durmus and Moulines, 2016, 2017, 2019; Dalalyan, 2017a, b; Cheng and Bartlett, 2018; Dalalyan and Karagulyan, 2019). Also, there are some mean-field Langevin methods, see Garbuno-Inigo et al. (2020); Wang and Li (2022) for more details. However, the Langevin diffusion process tends to the target distribution as time $t$ goes to infinity while the Schrödinger bridge achieve this in unit time. The Schrödinger bridge has been shown to have close connections with a number of problems in statistical physics, optimal transport and optimal control (Léonard, 2014). However, only a few recent works use this tool for statistical sampling. Bernton et al. (2019) proposed the Schrödinger bridge sampler by applying the iterative proportional fitting method (Sinkhorn algorithm (Sinkhorn, 1964; Peyré and Cuturi, 2019)). For the Gibbs measure $\mu_{\sigma}$ , Schrödinger bridge samplers iteratively modifies the transition kernels of the reference Markov chain to obtain a process whose marginal distribution at terminal time is approximate to $\mu_{\sigma}$ , via regression-based approximations of the corresponding iterative proportional fitting recursion.

4. Simulation studies

In this section, we conduct numerical simulations to evaluate the performance of the proposed method (Algorithm 1), and compare it with the Langevin method. We consider the following non-convex smooth function (Carrillo et al., 2021), which maps from $\mathbb{R}^{d}$ to $\mathbb{R}$ ,

V(x)=\frac{1}{d}\sum_{i=1}^{d}\left[\left(x_{i}-B\right)^{2}-10\cos\left(2\pi\left(x_{i}-B\right)\right)+10\right]+C,

with $B=\operatorname{argmin}_{x\in\mathbb{R}^{d}}V(x)$ and $C=\min_{x\in\mathbb{R}^{d}}V(x)$ . Figure 1 depicts this target function $V$ with setting $B=0,C=0,d=1$ , and we can see that the number of local minimizers in $\mathbb{R}$ is infinite and only one global minimizer exists, i.e, $x=0$ . In the numerical experiment, we consider the case $d=2,B=0,C=0$ in $V$ . We set $K=200,m=1000,\sigma=0.01$ in Algorithm 1, and Langevin method is implemented by R package yuima (Iacus and Yoshida, 2018). As shown in Proposition 2.3, the target distribution can be exactly obtained at time one. Thus, we only need to keep the last data in the Euler-Maruyama discretization of Schrödinger-Föllmer diffusion in each iteration and repeat this scheme such that the desired sample size is derived, i.e., we get one sample when running Algorithm 1 one time (it costs $200\times 1000$ Gaussian samples at each run). In comparison, the Langevin diffusion (3) tends to the target distribution as time $t$ goes to infinity, then it should burn out sufficient data points empirically in each Euler-Maruyama discretization. To make the comparison fair, we run the Langevin method 50 times. At each run we generate $200\times 1000$ samples and keep the one with the best function value to show. Figure 2 shows the simulation results of 50 independent runs, where the red line yields the global minimizer (GM) $(x_{1}=0,x_{2}=0)$ . From Figure 2, we can conclude that our proposed method obtains the global minimizer approximately and performs better than the Langevin method.

Refer to caption — Figure 1. $B=0$ , $C=0$ , $d=1$ .

5. Conclusion

We study the problem of finding global minimizers of $V(x):\mathbb{R}^{d}\rightarrow\mathbb{R}$ approximately via sampling from a probability distribution $\mu_{\sigma}$ with density $p_{\sigma}(x)=\dfrac{\exp(-V(x)/\sigma)}{\int_{\mathbb{R}^{d}}\exp(-V(y)/\sigma)dy}$ with respect to the Lebesgue measure for $\sigma\in(0,1]$ small enough. We analyze a sampler based on the Euler discretization scheme of the Schrödinger-Föllmer diffusion processes with stochastic approximation under appropriate assumptions on the step size $s$ and the potential $V$ . We prove that the output of the proposed sampler is an approximate global minimizer of $V(x)$ with high probability. Moreover, simulation studies verify the effectiveness of the proposed method on solving non-convex smooth optimization problems and it performs better than the Langevin method.

6. Acknowledgments

We would like the thank the anonymous referees for their useful comments and suggestions, which have led to considerable improvements in the paper. This work is supported by the National Key Research and Development Program of China (No. 2020YFA0714200), by the National Science Foundation of China (No. 12125103, No. 12071362, No. 11871474, No.11871385). The numerical calculations have been done at the Supercomputing Center of Wuhan University.

7. appendix

In the appendix, we will show (P1)-(P2) and prove the all Propositions and Theorems, i.e., Propositions 2.3, 3.3, Theorems 3.2, 3.5, 3.7.

7.1. Verify (P1)-(P2)

Proof.

Recall the definition of $f_{\sigma}$ in (6) and the assumption $V(x)=\|x\|_{2}^{2}/2$ when $\|x\|_{2}>R$ . Through some simple calculations, we have

	$\displaystyle\nabla\hat{f}_{\sigma}(x)$	$\displaystyle=\exp\left(\frac{\\|x\\|_{2}^{2}}{2\sigma}-\frac{V(x)}{\sigma}\right)\cdot\left(\frac{x}{\sigma}-\frac{\nabla V(x)}{\sigma}\right),$
	$\displaystyle\nabla^{2}\hat{f}_{\sigma}(x)$	$\displaystyle=\exp\left(\frac{\\|x\\|_{2}^{2}}{2\sigma}-\frac{V(x)}{\sigma}\right)\cdot\left(\frac{x}{\sigma}-\frac{\nabla V(x)}{\sigma}\right)\cdot\left(\frac{x}{\sigma}-\frac{\nabla V(x)}{\sigma}\right)^{\top}$
		$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ +\exp\left(\frac{\\|x\\|_{2}^{2}}{2\sigma}-\frac{V(x)}{\sigma}\right)\cdot\left(\frac{\mbox{\bf I}_{d}}{\sigma}-\frac{\nabla^{2}V(x)}{\sigma}\right).$

Then, the properties (P1)-(P2) hold if for each $\sigma\in(0,1]$ ,

		$\displaystyle\lim_{\widetilde{R}\rightarrow\infty}\sup_{\\|x\\|_{2}\geq\widetilde{R}}\exp\left(-\frac{V(x)}{\sigma}+\frac{\\|x\\|_{2}^{2}}{2\sigma}\right)\left\\|\frac{\mathbf{I}_{d}-\operatorname{\nabla^{2}}V(x)}{\sigma}\right\\|_{2}$
	$\displaystyle\leq$	$\displaystyle\lim_{\widetilde{R}\rightarrow\infty}\sup_{\\|x\\|_{2}\geq\widetilde{R}}\exp\left(-V(x)+\frac{\\|x\\|_{2}^{2}}{2}\right)\left\\|\mathbf{I}_{d}-\operatorname{\nabla^{2}}V(x)\right\\|_{2}<\infty,$

and

		$\displaystyle\lim_{\widetilde{R}\rightarrow\infty}\sup_{\\|x\\|_{2}\geq\widetilde{R}}\exp\left(-\frac{V(x)}{\sigma}+\frac{\\|x\\|_{2}^{2}}{2\sigma}\right)\cdot\left\\|\frac{x-\nabla V(x)}{\sigma}\right\\|_{2}$
	$\displaystyle\leq$	$\displaystyle\lim_{\widetilde{R}\rightarrow\infty}\sup_{\\|x\\|_{2}\geq\widetilde{R}}\exp\left(-V(x)+\frac{\\|x\\|_{2}^{2}}{2}\right)\cdot\left\\|x-\nabla V(x)\right\\|_{2}<\infty,$

where we use the fact that $a\mapsto a\exp(-ax)$ is non-increasing for $a\geq 1,x\in\mathbb{R}_{+}$ . ∎

7.2. Proof of Proposition 2.3

Proof.

By (P1) and (P2), it yields that for all $x\in\mathbb{R}^{d}$ and $t\in[0,1]$ ,

\displaystyle\|b(x,t)\|_{2}=\frac{\left\|\nabla Q_{1-t}\hat{f}_{\sigma}(x)\right\|_{2}}{Q_{1-t}\hat{f}_{\sigma}(x)}\leq\frac{\gamma_{\sigma}}{\xi_{\sigma}}.

(A.1)

Then, by (P1)-(P2) and (A.1), for all $x,y\in\mathbb{R}^{d}$ and $t\in[0,1]$ ,

	$\displaystyle\left\\|b(x,t)-b\left(y,t\right)\right\\|_{2}$	$\displaystyle=\left\\|\frac{\nabla Q_{1-t}\hat{f}_{\sigma}(x)}{Q_{1-t}\hat{f}_{\sigma}(x)}-\frac{\nabla Q_{1-t}\hat{f}_{\sigma}\left(y\right)}{Q_{1-t}\hat{f}_{\sigma}\left(y\right)}\right\\|_{2}$
		$\displaystyle=\left\\|\frac{\nabla Q_{1-t}\hat{f}_{\sigma}(x)-\nabla Q_{1-t}\hat{f}_{\sigma}(y)}{Q_{1-t}\hat{f}_{\sigma}(y)}+\frac{\nabla Q_{1-t}\hat{f}_{\sigma}(x)\left(Q_{1-t}\hat{f}_{\sigma}(y)-Q_{1-t}\hat{f}_{\sigma}(x)\right)}{Q_{1-t}\hat{f}_{\sigma}(x)Q_{1-t}\hat{f}_{\sigma}(y)}\right\\|_{2}$
		$\displaystyle\leq\frac{\left\\|\nabla Q_{1-t}\hat{f}_{\sigma}(x)-\nabla Q_{1-t}\hat{f}_{\sigma}\left(y\right)\right\\|_{2}}{Q_{1-t}\hat{f}_{\sigma}\left(y\right)}+\\|b(x,t)\\|_{2}\cdot\frac{\left\|Q_{1-t}\hat{f}_{\sigma}(x)-Q_{1-t}\hat{f}_{\sigma}\left(y\right)\right\|}{Q_{1-t}\hat{f}_{\sigma}\left(y\right)}$
		$\displaystyle\leq\left(\frac{\gamma_{\sigma}}{\xi_{\sigma}}+\frac{\gamma_{\sigma}^{2}}{\xi_{\sigma}^{2}}\right)\left\\|x-y\right\\|_{2}.$

Setting $C_{1,\sigma}:=\frac{\gamma_{\sigma}}{\xi_{\sigma}}+\frac{\gamma_{\sigma}^{2}}{\xi_{\sigma}^{2}}$ yields the Lipschitiz continuous condition (C2). Combining (A.1) and (C2) with the triangle inequality, we have

\|b(x,t)\|_{2}\leq\|b(0,t)\|_{2}+C_{1}\|x\|_{2}\leq\frac{\gamma_{\sigma}}{\xi_{\sigma}}+C_{1,\sigma}\|x\|_{2}.

Let $C_{0,\sigma}:=\max\left\{\frac{\gamma_{\sigma}}{\xi_{\sigma}},C_{1,\sigma}\right\}$ , then (C1) holds. Therefore, the drift term $b(x,t)$ satisfies the linear grow condition (C1) and Lipschitz condition (C2), then the Schrödinger-Föllmer diffusion SDE (7) has the unique strong solution (Revuz and Yor, 2013; Pavliotis, 2014).

Moreover, Schrödinger-Föllmer diffusion process $\{X_{t}\}_{t\in[0,1]}$ defined in (7) admits the transition probability density

\displaystyle p_{s,t}(x,y):=\widetilde{p}_{s,t}(x,y)\frac{Q_{1-t}f_{\sigma}(y)}{Q_{1-s}f_{\sigma}(x)},

where

\displaystyle\widetilde{p}_{s,t}(x,y)=\frac{1}{(2\pi\sigma(t-s))^{d/2}}\exp\left(-\frac{1}{2\sigma(t-s)}\|x-y\|^{2}_{2}\right)

is the transition probability density of a $d$ -dimensional Brownian motion $\sqrt{\sigma}B_{t}$ . See Dai Pra (1991); Lehec (2013) for details. It follows that for any Borel measurable set $A\in\mathcal{B}(\mathbb{R}^{d})$ ,

	$\displaystyle\mathbb{P}(X_{1}\in A)$	$\displaystyle=\int_{A}p_{0,1}(0,y)\mathrm{d}y$
		$\displaystyle=\int_{A}\widetilde{p}_{0,1}(0,y)\frac{Q_{0}f_{\sigma}(y)}{Q_{1}f_{\sigma}(0)}\mathrm{d}y$
		$\displaystyle=\mu_{\sigma}(A),$

where the last equality follows from $Q_{1}f_{\sigma}(0)=\mu_{\sigma}(\mathbb{R}^{d})=1$ and $Q_{0}f_{\sigma}(y)=f_{\sigma}(y)$ . Therefore, $X_{1}$ is distributed as the probability distribution $\mu_{\sigma}$ . This completes the proof. ∎

7.3. Proof of Proposition 3.3

Proof.

Under the Theorem 3.2, then $C_{\sigma}=\int_{\mathbb{R}^{d}}\exp\left(-V(x)/\sigma\right)dx\leq\int_{\mathbb{R}^{d}}\exp\left(-V(x)\right)dx<\infty$ for all $\sigma\in(0,1]$ . According to the Varadhan’s theorem 1.2.3 in Dupuis and Ellis (2011), it follows that the family $\{\mu_{\sigma}\}_{\sigma\in[0,1]}$ on $\mathcal{B}(\mathbb{R}^{d})$ satisfies large deviation principle with rate function $V(x)-\min_{x}V(x)$ , that is, for every function $F\in C_{b}(\mathbb{R}^{d})$ , the bounded continuous function space on $\mathbb{R}^{d}$ ,

\lim_{\sigma\to 0}\sigma\log\int_{\mathbb{R}^{d}}C_{\sigma}^{-1}\exp\left(\frac{F(x)-V(x)}{\sigma}\right)dx=\sup_{x}\{F(x)-I(x)\},

(A.2)

where the rate function $I(x)$ is defined by

I(x):=V(x)-\min_{x}V(x).

Next, we only need to prove (A.2). On the one hand,

\log\int_{\mathbb{R}^{d}}C_{\sigma}^{-1}\exp\left(\frac{F(x)-V(x)}{\sigma}\right)dx=-\log\int_{\mathbb{R}^{d}}\exp\left(-\frac{V(x)}{\sigma}\right)dx+\log\int_{\mathbb{R}^{d}}\exp\left(\frac{F(x)-V(x)}{\sigma}\right)dx.

By Lemma 7.1, we have

	$\displaystyle-\sigma\log\int_{\mathbb{R}^{d}}\exp\left(-\frac{V(x)}{\sigma}\right)dx$	$\displaystyle=-\sigma\log\int_{\mathbb{R}^{d}}\exp\left(-\frac{V(x)-\min_{x}V(x)}{\sigma}\right)dx+\min_{x}V(x)$
		$\displaystyle\rightarrow\min_{x}V(x)+\frac{d}{2}\lim_{\sigma\to 0}\sigma\log\sigma=\min_{x}V(x),\quad\text{as $\sigma\downarrow 0$.}$		(A.3)

On the other hand, for any positive $K>0$ , combining Lemma 2.2 in Varadhan (1966) and the dominated convergence theorem yields that

	$\displaystyle\lim_{\sigma\to 0}\sigma\log\int_{\mathbb{R}^{d}}\exp\left(\frac{F(x)-V(x)}{\sigma}\right)dx$	$\displaystyle=\lim_{K\to\infty}\lim_{\sigma\to 0}\sigma\log\int_{\mathbb{R}^{d}}\exp\left(\frac{F(x)-V(x)\land K}{\sigma}\right)dx$
		$\displaystyle=\lim_{K\to\infty}\sup_{x}\{F(x)-V(x)\land K\}=\sup_{x}\{F(x)-V(x)\}.$		(A.4)

Hence, by (7.3) and (A.4), we get

\lim_{\sigma\to 0}\sigma\log\int_{\mathbb{R}^{d}}C_{\sigma}^{-1}\exp\left(\frac{F(x)-V(x)}{\sigma}\right)dx=\sup_{x}\{F(x)-V(x)+\min_{x}V(x)\}.

Since the measure $\mu_{\sigma}$ satisfies the large deviation principle with rate function $I$ , if we take the closed set $F:=\{x\in\mathbb{R}^{d};V(x)-\min_{x}V(x)\geq\tau\}$ , then

\lim_{\sigma\to 0}\sigma\log\mu_{\sigma}(F)=\lim_{\sigma\to 0}\sigma\log\mu_{\sigma}\left(V(x)-\min_{x}V(x)\geq\tau\right)=-\inf_{x\in F}I(x)=-\tau.

∎

7.4. Preliminary lemmas for Theorem 3.2 and Theorem 3.5

In order to prove that the Gibbs measure $\mu_{\sigma}$ weakly converges to a multidimensional distribution and estimate the probabilities of $V(X_{1})>\tau$ and $V(\widetilde{Y}_{t_{k}})>\tau$ for any $\tau>0$ , we first need to prove the following Lemmas 7.1-7.2.

Lemma 7.1.

Assume $V$ is twice continuously differentiable and satisfies $V(x_{0})=0,\nabla V(x_{0})=0$ , and the Hessian matrix $\nabla^{2}V(x_{0})$ is positive definite. When $\delta>0$ is sufficiently small, then

\lim_{t\to+\infty}t^{\frac{d}{2}}\int_{U_{\delta}(x_{0})}e^{-tV(x_{1},\ldots,x_{d})}dx_{1}\cdots dx_{d}=\frac{(2\pi)^{\frac{d}{2}}}{\left(\det\nabla^{2}V(x_{0})\right)^{\frac{1}{2}}}

for any $x\in U_{\delta}(x_{0}):=\{x\in\mathbb{R}^{d};\|x-x_{0}\|_{2}<\delta\}$ .

Proof.

For each $\varepsilon\in(0,1)$ , there exists $\delta>0$ such that

\frac{(1-\varepsilon)(x-x_{0})^{\top}\nabla^{2}V(x_{0})(x-x_{0})}{2}\leq V(x)\leq\frac{(1+\varepsilon)(x-x_{0})^{\top}\nabla^{2}V(x_{0})(x-x_{0})}{2}

holds for any $x\in U_{\delta}$ . Moreover, we have

\int_{U_{\delta}(x_{0})}\exp\left(-tV(x)\right)dx\leq\int_{U_{\delta}(x_{0})}\exp\left(-\frac{t}{2}(1-\varepsilon)(x-x_{0})^{\top}\nabla^{2}V(x_{0})(x-x_{0})\right)dx.

(A.5)

There is an orthogonal matrix $P$ such taht $P^{\top}\nabla^{2}V(x_{0})P=\mathrm{diag}(\lambda_{1},\ldots,\lambda_{d})$ , where $\lambda_{1},\ldots,\lambda_{d}$ are the eigenvalues of Hessian matrix $\nabla^{2}V(x_{0})$ . And we denote $(y_{1},\ldots,y_{d}):=P(x-x_{0})$ , then

(x-x_{0})^{\top}\nabla^{2}V(x_{0})(x-x_{0})=\sum_{i=1}^{d}\lambda_{i}(y_{i})^{2}.

Hence

\displaystyle\int_{U_{\delta}(x_{0})}\exp\left(-\frac{t}{2}(1-\varepsilon)(x-x_{0})^{\top}\nabla^{2}V(x_{0})(x-x_{0})\right)dx=\int_{U_{\delta}(0)}\exp\left(-\frac{t(1-\varepsilon)}{2}\sum_{i=1}^{d}\lambda_{i}(y_{i})^{2}\right)dy.

(A.6)

Further, we can get

\int_{U_{\delta}(0)}\exp\left(-\frac{t(1-\varepsilon)}{2}\sum_{i=1}^{d}\lambda_{i}(y_{i})^{2}\right)dy=\left(\frac{2}{t(1-\varepsilon)}\right)^{\frac{d}{2}}\int_{\|z\|_{2}\leq\sqrt{\frac{t(1-\varepsilon)}{2}}\delta}\exp\left(-\sum_{i=1}^{d}\lambda_{i}(z_{i})^{2}\right)dz,

(A.7)

where the equality holds by setting $z=\sqrt{\dfrac{t(1-\varepsilon)}{2}}y$ . By (A.5), (A.6) and (A.7), it follows that

	$\displaystyle\limsup_{t\to+\infty}t^{\frac{d}{2}}\int_{U_{\delta}(x_{0})}e^{-tV(x)}dx$	$\displaystyle\leq\left(\frac{2}{1-\varepsilon}\right)^{\frac{d}{2}}\limsup_{t\to+\infty}\int_{\\|z\\|_{2}<\sqrt{\frac{t(1-\varepsilon)}{2}}\delta}\exp\left(-\sum_{i=1}^{d}\lambda_{i}(z_{i})^{2}\right)dz$
		$\displaystyle\leq\left(\frac{2}{1-\varepsilon}\right)^{\frac{d}{2}}\int_{\mathbb{R}^{d}}\exp\left(-\sum_{i=1}^{d}\lambda_{i}(z_{i})^{2}\right)dz_{1}\cdots dz_{d}$
		$\displaystyle=\left(\frac{2\pi}{1-\varepsilon}\right)^{\frac{d}{2}}\left(\prod_{i=1}^{d}\frac{1}{\sqrt{\lambda_{i}}}\right)=\left(\frac{2\pi}{1-\varepsilon}\right)^{\frac{d}{2}}\frac{1}{\left(\det\nabla^{2}V(x_{0})\right)^{\frac{1}{2}}}.$

Similarly, we have

	$\displaystyle\liminf_{t\to+\infty}t^{\frac{d}{2}}\int_{U_{\delta}(x_{0})}e^{-tV(x)}dx$	$\displaystyle\geq\left(\frac{2}{1+\varepsilon}\right)^{\frac{d}{2}}\liminf_{t\to+\infty}\int_{\\|z\\|_{2}<\sqrt{\frac{t(1+\varepsilon)}{2}}\delta}\exp\left(-\sum_{i=1}^{d}\lambda_{i}(z_{i})^{2}\right)dz$
		$\displaystyle\geq\left(\frac{2}{1+\varepsilon}\right)^{\frac{d}{2}}\int_{\mathbb{R}^{d}}\exp\left(-\sum_{i=1}^{d}\lambda_{i}(z_{i})^{2}\right)dz_{1}\cdots dz_{d}$
		$\displaystyle=\left(\frac{2\pi}{1+\varepsilon}\right)^{\frac{d}{2}}\left(\prod_{i=1}^{d}\frac{1}{\sqrt{\lambda_{i}}}\right)=\left(\frac{2\pi}{1+\varepsilon}\right)^{\frac{d}{2}}\frac{1}{\left(\det\nabla^{2}V(x_{0})\right)^{\frac{1}{2}}}.$

Therefore, letting $\varepsilon\rightarrow 0^{+}$ , we get

\lim_{t\to+\infty}t^{\frac{d}{2}}\int_{U_{\delta}(x_{0})}e^{-tV(x)}dx=(2\pi)^{\frac{d}{2}}\left(\prod_{i=1}^{d}\frac{1}{\sqrt{\lambda_{i}}}\right)=\frac{(2\pi)^{\frac{d}{2}}}{\left(\det\nabla^{2}V(x_{0})\right)^{\frac{1}{2}}}.

∎

Lemma 7.2.

Let $X=(X_{t},\mathcal{F}_{t}),Y=(Y_{t},\mathcal{F}_{t})$ be strong solutions of the following two stochastic differential equations

	$\displaystyle dX_{t}=a(X_{t},t)dt+\sigma dB_{t},\quad t\in[0,1]$
	$\displaystyle dY_{t}=b(Y_{t},t)dt+\sigma dB_{t},\quad Y_{0}=X_{0},\leavevmode\nobreak\ t\in[0,1],$

and $X_{0}$ is a $\mathcal{F}_{0}$ -measureable random variable. In addition, if drift terms $a(X_{t},t)$ and $b(X_{t},t)$ satisfy $\mathbb{E}\left[\exp\left(\int_{0}^{1}\|a(X_{t},t)\|_{2}^{2}+\|b(X_{t},t)\|_{2}^{2}dt\right)\right]<\infty$ , then we have

\frac{d\mathbb{P}_{Y}}{d\mathbb{P}_{X}}(X)=\exp\left(\sigma^{-1}\int_{0}^{1}\left\langle b(X_{t},t)-a(X_{t},t),dB_{t}\right\rangle-\frac{1}{2\sigma^{2}}\int_{0}^{1}\|b(X_{t},t)-a(X_{t},t)\|_{2}^{2}dt\right),

(A.8)

and the relative entropy of $\mathbb{P}_{X}$ with respect to $\mathbb{P}_{Y}$ satisfies

\mathbb{D}_{\mathrm{KL}}(\mathbb{P}_{X}||\mathbb{P}_{Y})=\frac{1}{2\sigma^{2}}\int_{0}^{1}\mathbb{E}\left[\|b(X_{t},t)-a(X_{t},t)\|_{2}^{2}\right]dt,

where probability distributions $\mathbb{P}_{X},\mathbb{P}_{Y}$ are induced by process $(X_{t},0\leq t\leq 1)$ and $(Y_{t},0\leq t\leq 1)$ , respectively.

Proof.

By the Novikov condition (Revuz and Yor, 2013), we know that

M_{t}:=\exp\left(\sigma^{-1}\int_{0}^{t}\left\langle b(X_{u},u)-a(X_{u},u),dB_{u}\right\rangle-\frac{1}{2\sigma^{2}}\int_{0}^{t}\|b(X_{u},u)-a(X_{u},u)\|_{2}^{2}du\right)

is exponential martingale and $\mathbb{E}M_{t}=1$ for all $t\in[0,1]$ . We can denote a new probability measure $\mathbb{Q}$ such that $d\mathbb{Q}=M_{1}d\mathbb{P}$ . By Girsanov’s theorem (Revuz and Yor, 2013), under the new probability measure $\mathbb{Q}$ , we can conclude that

\widetilde{B}_{t}:=B_{t}-\sigma^{-1}\int_{0}^{t}(b(X_{u},u)-a(X_{u},u))du

is a $\mathbb{Q}$ -Brownian motion. Hence, under the new probability measure $\mathbb{Q}$ ,

	$\displaystyle b(X_{t},t)dt+\sigma d\widetilde{B}_{t}$	$\displaystyle=b(X_{t},t)dt+\sigma dB_{t}-(b(X_{t},t)-a(X_{t},t))dt$
		$\displaystyle=a(X_{t},t)dt+\sigma dB_{t}=dX_{t}.$

Thus, we have $\mathbb{Q}_{X}=\mathbb{P}_{Y}$ , where $\mathbb{Q}_{X}$ is the distribution of $X$ under the measure $\mathbb{Q}$ . Furthermore, we can obtain (A.8). On the other hand, by the definition of realtive entropy of $\mathbb{P}_{X}$ with respect to $\mathbb{P}_{Y}$ , we have

\mathbb{D}_{\mathrm{KL}}(\mathbb{P}_{X}||\mathbb{P}_{Y})=\mathbb{E}\left[-\log\left(\frac{d\mathbb{P}_{Y}}{d\mathbb{P}_{X}}(X)\right)\right]=\frac{1}{2\sigma^{2}}\int_{0}^{1}\mathbb{E}\left[\|b(X_{t},t)-a(X_{t},t)\|_{2}^{2}\right]dt.

Therefore, the proof of Lemma 7.2 is completed. ∎

7.5. Proof of Theorem 3.2

Proof.

The result can be traced back to the 1980s. For the overall continuity of the article, we use the Laplace’s method in Hwang (1980, 1981) to give a detailed proof of the result. The key idea is to prove that for each $\delta^{\prime}>0$ small enough, $\mu_{\sigma}(\{x;\|x-x_{i}^{*}\|_{2}<\delta^{\prime}\})$ converges to $\dfrac{\left(\det\nabla^{2}V(x_{i}^{*})\right)^{-\frac{1}{2}}}{\sum_{j=1}^{\kappa}\left(\det\nabla^{2}V(x_{j}^{*})\right)^{-\frac{1}{2}}}$ as $\sigma\downarrow 0$ . We firstly introduce the following notations

	$\displaystyle a(\delta^{\prime}):=\inf\{V(x);\\|x-x_{i}^{*}\\|_{2}\geq\delta^{\prime},1\leq i\leq\kappa\},$
	$\displaystyle\widetilde{m}_{i}(\sigma,\delta^{\prime}):=\int_{\\|x-x_{i}^{*}\\|_{2}<\delta^{\prime}}\exp\left(-\frac{V(x)}{\sigma}\right)dx,\quad 1\leq i\leq\kappa,$
	$\displaystyle\widetilde{m}(\sigma,\delta^{\prime}):=\int_{\bigcup_{i=1}^{\kappa}\\|x-x_{i}^{*}\\|\geq\delta^{\prime}}\exp\left(-\frac{V(x)}{\sigma}\right)dx.$

Hence, we have

\mu_{\sigma}(\{x;\|x-x_{i}^{*}\|_{2}<\delta^{\prime}\})=\dfrac{\int_{\|x-x_{i}^{*}\|_{2}<\delta^{\prime}}\exp\left(-\frac{V(x)}{\sigma}\right)dx}{\int_{\mathbb{R}^{d}}\exp\left(-\frac{V(x)}{\sigma}\right)dx}=\frac{\widetilde{m}_{i}(\sigma,\delta^{\prime})}{\sum_{j=1}^{\kappa}\widetilde{m}_{j}(\sigma,\delta^{\prime})+\widetilde{m}(\sigma,\delta^{\prime})}.

(A.9)

On the one hand, $\nabla^{2}V(x_{i}^{*})$ is a positive definite symmetric matrix. For any $\varepsilon\in(0,1)$ , we can choose $0<\delta^{\prime}<\varepsilon$ such that

\frac{(x-x_{i}^{*})^{\top}(\nabla^{2}V(x_{i}^{*})-\varepsilon\mbox{\bf I}_{d})(x-x_{i}^{*})}{2}\leq V(x)-V(x_{i}^{*})\leq\frac{(x-x_{i}^{*})^{\top}(\nabla^{2}V(x_{i}^{*})+\varepsilon\mbox{\bf I}_{d})(x-x_{i}^{*})}{2}

holds for any $\|x-x_{i}^{*}\|_{2}<\delta^{\prime}$ . Thus, for any $i=1,\ldots,\kappa$ , we obtain

	$\displaystyle(2\pi\sigma)^{-\frac{d}{2}}e^{\frac{V(x_{i}^{})}{\sigma}}\widetilde{m}_{i}(\sigma,\delta^{\prime})\leq(2\pi\sigma)^{-\frac{d}{2}}\int_{\\|x-x_{i}^{}\\|_{2}<\delta^{\prime}}\exp\left(-\frac{(x-x_{i}^{})^{\top}(\nabla^{2}V(x_{i}^{})-\varepsilon\mbox{\bf I}_{d})(x-x_{i}^{*})}{2\sigma}\right)dx,$
	$\displaystyle(2\pi\sigma)^{-\frac{d}{2}}e^{\frac{V(x_{i}^{})}{\sigma}}\widetilde{m}_{i}(\sigma,\delta^{\prime})\geq(2\pi\sigma)^{-\frac{d}{2}}\int_{\\|x-x_{i}^{}\\|_{2}<\delta^{\prime}}\exp\left(-\frac{(x-x_{i}^{})^{\top}(\nabla^{2}V(x_{i})+\varepsilon\mbox{\bf I}_{d})(x-x_{i}^{})}{2\sigma}\right)dx.$

By Lemma 7.1 and letting $\sigma\rightarrow 0$ , we have

	$\displaystyle\left(\det(\nabla^{2}V(x_{i}^{*})+\varepsilon\mbox{\bf I}_{d})\right)^{-\frac{1}{2}}$	$\displaystyle\leq\liminf_{\sigma\to 0}(2\pi\sigma)^{-\frac{d}{2}}e^{\frac{V(x_{i}^{*})}{\sigma}}\widetilde{m}_{i}(\sigma,\delta^{\prime})$
		$\displaystyle\leq\limsup_{\sigma\to 0}(2\pi\sigma)^{-\frac{d}{2}}e^{\frac{V(x_{i}^{*})}{\sigma}}\widetilde{m}_{i}(\sigma,\delta^{\prime})$
		$\displaystyle\leq\left(\det(\nabla^{2}V(x_{i}^{*})-\varepsilon\mbox{\bf I}_{d})\right)^{-\frac{1}{2}}.$

As $\varepsilon\downarrow 0$ , we get

\lim_{\sigma\to 0}(2\pi\sigma)^{-\frac{d}{2}}\exp\left(\frac{V(x_{i}^{*})}{\sigma}\right)\widetilde{m}_{i}(\sigma,\delta^{\prime})=\left(\det\nabla^{2}V(x_{i}^{*})\right)^{-\frac{1}{2}}.

(A.10)

On the other hand, we have

	$\displaystyle(2\pi\sigma)^{-\frac{d}{2}}\exp\left(\frac{V(x_{i}^{*})}{\sigma}\right)\widetilde{m}(\sigma,\delta^{\prime})$	$\displaystyle=(2\pi\sigma)^{-\frac{d}{2}}\exp\left(-\frac{a(\delta^{\prime})-V(x_{i}^{*})}{\sigma}\right)$
		$\displaystyle\qquad\quad\times\int_{\bigcup_{i=1}^{\kappa}\\|x-x_{i}^{*}\\|\geq\delta^{\prime}}\exp\left(-\frac{V(x)-a(\delta^{\prime})}{\sigma}\right)dx.$

Since $a(\delta^{\prime}):=\inf\{V(x);\|x-x_{i}^{*}\|_{2}\geq\delta^{\prime},1\leq i\leq\kappa\}>V(x_{i}^{*})$ and $V(x)\geq a(\delta^{\prime})$ holds in $\{x\in\mathbb{R}^{d}:\|x-x_{i}^{*}\|_{2}\geq\delta^{\prime}\}$ , then for any $\sigma\in(0,1]$ , we have

	$\displaystyle\int_{\bigcup_{i=1}^{\kappa}\\|x-x_{i}^{*}\\|\geq\delta^{\prime}}\exp\left(-\frac{V(x)-a(\delta^{\prime})}{\sigma}\right)dx$	$\displaystyle\leq\int_{\bigcup_{i=1}^{\kappa}\\|x-x_{i}^{*}\\|\geq\delta^{\prime}}\exp\left(-(V(x)-a(\delta^{\prime}))\right)dx$
		$\displaystyle\leq\exp(a(\delta^{\prime}))\int_{\mathbb{R}^{d}}\exp(-V(x))dx<\infty.$

Also, it follows that

(2\pi\sigma)^{-\frac{d}{2}}\exp\left(-\frac{a(\delta^{\prime})-V(x_{i}^{*})}{\sigma}\right)\rightarrow 0\quad\text{as $\sigma\downarrow 0.$}

Thus we get

\lim_{\sigma\to 0}(2\pi\sigma)^{-\frac{d}{2}}\exp\left(\frac{V(x_{i}^{*})}{\sigma}\right)\widetilde{m}(\sigma,\delta^{\prime})=0.

(A.11)

Taking limit $\sigma\downarrow 0$ in (A.9) and applying (A.10), (A.11), we get

\mu_{\sigma}(\{x;\|x-x_{i}^{*}\|_{2}<\delta^{\prime}\})\rightarrow\frac{\left(\det\nabla^{2}V(x_{i}^{*})\right)^{-\frac{1}{2}}}{\sum_{j=1}^{\kappa}\left(\det\nabla^{2}V(x_{j}^{*})\right)^{-\frac{1}{2}}}\quad\text{as $\sigma\downarrow 0.$}

Therefore, the proof Theorem 3.2 is completed. ∎

7.6. Proof of Theorem 3.5

Proof.

Note that

\mathbb{P}(V(X_{1})>\tau)=\frac{\int_{V(x)>\tau}\exp(-V(x)/\sigma)dx}{\int_{\mathbb{R}^{d}}\exp(-V(x)/\sigma)dx}.

(A.12)

According to $V(x)=\|x\|^{2}_{2}/2$ for any $\|x\|_{2}\geq R$ , then $V$ has at least linear growth at infinity, that is, there exists a constant $C>0$ such that for $R^{\star}$ large enough

V(x)\geq\min_{\|y\|_{2}=R^{\star}}V(y)+C(\|x\|_{2}-R^{\star})\quad\text{ for $\|x\|_{2}>R^{\star}$.}

We can choose sufficiently large $R^{\star}$ such that $\min_{\|y\|_{2}=R^{\star}}V(y)>\tau$ . Hence,

$\displaystyle\int_{V(x)\geq\tau}\exp(-V(x)/\sigma)dx$	$\displaystyle=\int_{V(x)\geq\tau,\\|x\\|_{2}\leq R^{\star}}\exp(-V(x)/\sigma)dx+\int_{V(x)\geq\tau,\\|x\\|_{2}>R^{\star}}\exp(-V(x)/\sigma)dx$
	$\displaystyle\leq\exp\left(-\frac{\tau}{\sigma}\right)\text{Vol}(B_{R^{\star}})+\int_{V(x)\geq\tau,\\|x\\|_{2}>R^{\star}}\exp\left(-\frac{\tau+C(\\|x\\|_{2}-R^{\star})}{\sigma}\right)dx$
	$\displaystyle\leq\exp\left(-\frac{\tau}{\sigma}\right)\left(\text{Vol}(B_{R^{\star}})+d\text{Vol}(B_{1})\int_{R^{\star}}^{\infty}r^{d-1}\exp\left\{-C(r-R^{\star})\right\}dr\right)$
	$\displaystyle\leq 2\text{Vol}(B_{R^{\star}})\exp\left(-\frac{\tau}{\sigma}\right),$	(A.13)

where Vol $(B_{R^{\star}})$ is the volume of a ball with radius $R^{\star}$ . On the other hand, since $\min_{x}V(x)=0$ , then there exists $r>0$ such that $V(x)<\varepsilon$ when $\|x\|_{2}<r$ , we have

\int_{\mathbb{R}^{d}}\exp(-V(x)/\sigma)dx\geq\int_{\|x\|_{2}<r}\exp(-V(x)/\sigma)dx>\exp\left(-\frac{\varepsilon}{\sigma}\right)\text{Vol}(B_{r}).

(A.14)

By injection (7.6), (A.14) into (A.12), we get

\mathbb{P}(V(X_{1})>\tau)\leq C_{\tau,\varepsilon,d}\exp\left(-\frac{\tau-\varepsilon}{\sigma}\right),

where

C_{\tau,\varepsilon,d}:=\frac{2\text{Vol}(B_{R^{\star}})}{\text{Vol}(B_{r})}.

(A.15)

Next, we will prove that the second conclusion holds in the discrete case. Recall that $s=1/K$ is the step size, $t_{k}:=ks$ is the cumulative step size up to iteration $k$ , and $\{X_{t}\}_{t\in[0,1]}$ is the Schrödinger-Föllmer diffusion process (7). Let $\mu_{t_{k}}$ be the probability measure of $Y_{t_{k}}$ defined by (10). Then for fixed $\tau>0$ , we have

$\displaystyle\mathbb{P}(V(\widetilde{Y}_{t_{K}})>\tau)$	$\displaystyle\leq\mathbb{P}(V(\widetilde{Y}_{t_{K}})>\tau)+\|\mathbb{P}(V(\widetilde{Y}_{t_{K}})>\tau)-\mathbb{P}(V(Y_{t_{K}})>\tau)\|$
	$\displaystyle\leq\mathbb{P}(V(Y_{t_{K}})>\tau)+\\|\widetilde{Y}_{t_{K}}-Y_{t_{K}}\\|_{TV}$
	$\displaystyle\leq\mathbb{P}(V(X_{t_{K}})>\tau)+\\|X_{t_{K}}-Y_{t_{K}}\\|_{TV}+\\|\widetilde{Y}_{t_{K}}-Y_{t_{K}}\\|_{TV}$
	$\displaystyle\leq\mathbb{P}(V(X_{t_{K}})>\tau)+\sqrt{2\mathbb{D}_{\mathrm{KL}}(\mu_{t_{K}}\|\|\mu_{\sigma})}+\sqrt{2\mathbb{D}_{\mathrm{KL}}(\text{Law}(\widetilde{Y}_{t_{K}})\|\|\mu_{t_{K}})},$	(A.16)

where we use Pinsker’s inequality (Bakry et al., 2014) in the last inequality and the second inequality holds due to the fact that letting $g(x):=\mathbbm{1}_{V(x)>\tau}$ , then

	$\displaystyle\|\mathbb{P}(V(\widetilde{Y}_{t_{K}})>\tau)-\mathbb{P}(V(Y_{t_{K}})>\tau)\|$	$\displaystyle=\left\|\mathbb{E}\left(g(\widetilde{Y}_{t_{K}})-g(Y_{t_{K}})\right)\right\|$
		$\displaystyle=\left\|\int_{\mathbb{R}^{d}}g(x)d\left(\mathbb{P}_{\widetilde{Y}_{t_{k}}}(x)-\mathbb{P}_{Y_{t_{k}}}(x)\right)\right\|$
		$\displaystyle\leq\|\mathbb{P}_{\widetilde{Y}_{t_{k}}}-\mathbb{P}_{Y_{t_{k}}}\|(\mathbb{R}^{d})=\\|\widetilde{Y}_{t_{K}}-Y_{t_{K}}\\|_{TV},$

where the total variation metric between two probability measures $\mu,\nu$ on $\mathbb{R}^{d}$ is defined by $\|\mu-\nu\|_{TV}:=\left|\mu-\nu\right|(\mathbb{R}^{d})=2\sup_{A\subseteq\mathbb{R}^{d}}|\mu(A)-\nu(A)|$ .

Firstly, from the first part of proof, we can get a bound for the first term on the right hand side of (7.6). That is, for each $\varepsilon\in(0,\tau)$ , there exists a constant $C_{\tau,\varepsilon,d}$ defined in (A.15) such that

\mathbb{P}(V(X_{t_{K}})>\tau)\leq C_{\tau,\varepsilon,d}\exp\left(-\frac{\tau-\varepsilon}{\sigma}\right).

(A.17)

Secondly, we estimate the boundness of $\mathbb{D}_{\mathrm{KL}}(\mu_{t_{K}}||\mu_{\sigma})$ . To make use of continuous-time tools, we construct a continuous-time interpolation for the discrete-time algorithm (10). In particular, we define a stochastic process $\{Y_{t}\}_{t\in[0,1]}$ via SDE

dY_{t}=\sigma\hat{b}(Y_{t},t)dt+\sqrt{\sigma}dB_{t},\quad t\in[0,1],Y_{0}=0,

(A.18)

with the non-anticipative drift $\hat{b}(Y_{t},t):=\sum\limits_{k=0}^{K-1}b(Y_{t_{k}},t_{k})\mathbbm{1}_{[t_{k},t_{k+1})}(t)$ . Meanwhile, because of Proposition 2.3, the process $\{X_{t}\}_{t\in[0,1]}$ (7) satisfies $X_{1}\sim\mu_{\sigma}$ . We also denote by $\mu_{1}^{x}$ and $\nu_{1}^{y}$ the marginal distributions on $C([0,1],\mathbb{R}^{d})$ of $(X_{t},Y_{t})_{t\in[0,1]}$ . Thus, combining (10), (A.18) and Lemma 7.2, we obtain

$\displaystyle\mathbb{D}_{\mathrm{KL}}(\mu_{t_{K}}\|\|\mu_{\sigma})$	$\displaystyle\leq\mathbb{D}_{\mathrm{KL}}(\nu^{y}_{1}\|\|\mu^{x}_{1})=\frac{\sigma}{2}\int_{0}^{1}\mathbb{E}\left(\\|b(Y_{t},t)-\hat{b}(Y_{t},t)\\|_{2}^{2}\right)dt$
	$\displaystyle=\frac{\sigma}{2}\sum_{k=0}^{K-1}\int_{t_{k}}^{t_{k+1}}\mathbb{E}\left(\\|b(Y_{t},t)-b(Y_{t_{k}},t_{k})\\|_{2}^{2}\right)dt$
	$\displaystyle\leq\sigma dC_{1,\sigma}^{2}\sum_{k=0}^{K-1}\int_{t_{k}}^{t_{k+1}}\mathbb{E}\left(\\|Y_{t}-Y_{t_{k}}\\|_{2}^{2}+(t-t_{k})\right)dt$
	$\displaystyle=\sigma dC_{1,\sigma}^{2}\left[\sum_{k=0}^{K-1}\int_{t_{k}}^{t_{k+1}}\mathbb{E}\left(\\|b(Y_{t_{k}})(t-t_{k})\sigma+\sqrt{\sigma}(B_{t}-B_{t_{k}})\\|_{2}^{2}\right)dt+\frac{1}{2K}\right]$
	$\displaystyle\leq 2\sigma dC_{1,\sigma}^{2}\sum_{k=0}^{K-1}\int_{t_{k}}^{t_{k+1}}\mathbb{E}\left(\\|b(Y_{t_{k}},t_{k})\\|_{2}^{2}\sigma^{2}(t-t_{k})^{2}+\sigma d(t-t_{k})\right)dt+\frac{\sigma d}{2K}C_{1,\sigma}^{2}$
	$\displaystyle\leq 2dC_{1,\sigma}^{2}\sum_{k=0}^{K-1}\int_{t_{k}}^{t_{k+1}}\left[\left(\frac{\gamma_{\sigma}}{\xi_{\sigma}}\right)^{2}\sigma^{3}(t-t_{k})^{2}+d(t-t_{k})\sigma^{2}\right]dt+\frac{\sigma d}{2K}C_{1,\sigma}^{2}$
	$\displaystyle\leq\frac{4d\sigma^{3}}{3K^{2}}\left(\frac{\gamma^{4}_{\sigma}}{\xi^{4}_{\sigma}}+\frac{\gamma^{6}_{\sigma}}{\xi^{6}_{\sigma}}\right)+\frac{\sigma d(2d\sigma+1)}{K}\left(\frac{\gamma^{2}_{\sigma}}{\xi^{2}_{\sigma}}+\frac{\gamma^{4}_{\sigma}}{\xi^{4}_{\sigma}}\right)\leq\frac{d(2d+3)}{K}C^{\star}_{1,\sigma},$	(A.19)

where

C^{\star}_{1,\sigma}:=\frac{\gamma^{2}_{\sigma}}{\xi^{2}_{\sigma}}+\frac{\gamma^{4}_{\sigma}}{\xi^{4}_{\sigma}}+\frac{\gamma^{6}_{\sigma}}{\xi^{6}_{\sigma}},\quad\frac{\gamma_{\sigma}}{\xi_{\sigma}}=\left\{\left(\frac{M_{2,R}}{\sigma}\right)^{2}+\frac{M_{3,R}}{\sigma}\right\}\exp\left(\frac{M_{1,R}-m_{1,R}}{\sigma}\right),

the second inequality holds due to Remark 4.1 in Huang et al. (2021) and the fact that $(a+b)^{2}\leq 2(a^{2}+b^{2})$ , the second equality holds by the continuous-time interpolation (A.18), and the fourth inequality follows from $\|b(x,t)\|_{2}^{2}\leq\gamma_{\sigma}^{2}/\xi_{\sigma}^{2}$ .

So it remains to estimate the relative entropy $\mathbb{D}_{\mathrm{KL}}(\text{Law}(\widetilde{Y}_{t_{K}})||\mu_{t_{K}})$ . Similar to the proof of the relative entropy $\mathbb{D}_{\mathrm{KL}}(\mu_{t_{K}}||\mu_{\sigma})$ , we need to construct a continuous-time interpolation process $\{\widetilde{Y}_{t}\}_{t\in[0,1]}$ defined by

d\widetilde{Y}_{t}=\sigma\hat{b}_{m}(\widetilde{Y}_{t},t)dt+\sqrt{\sigma}dB_{t},\quad t\in[0,1],\quad\widetilde{Y}_{0}=0,

(A.20)

with the non-anticipative drift $\hat{b}_{m}(\widetilde{Y}_{t},t):=\sum\limits_{k=0}^{K-1}\widetilde{b}_{m}(\widetilde{Y}_{t_{k}},t_{k})\mathbbm{1}_{[t_{k},t_{k+1})}(t)$ , where $\widetilde{b}_{m}(\widetilde{Y}_{k},t_{k})$ is defined by (12) or (13). Denote by $\nu_{1}^{y}$ and $\widetilde{\nu}_{1}^{y}$ the marginal distributions on $C([0,1],\mathbb{R}^{d})$ of $(Y_{t},\widetilde{Y}_{t})_{t\in[0,1]}$ . Therefore, combining (A.18), (A.20), Lemma 7.2 and Lemma 7.5, we get

$\displaystyle\mathbb{D}_{\mathrm{KL}}(\text{Law}(\widetilde{Y}_{t_{K}})\|\|\mu_{t_{K}})\leq\mathbb{D}_{\mathrm{KL}}(\widetilde{\nu}_{1}^{y}\|\|\nu^{y}_{1})$	$\displaystyle=\frac{\sigma}{2}\int_{0}^{1}\mathbb{E}\left(\\|\hat{b}(\widetilde{Y}_{t},t)-\hat{b}_{m}(\widetilde{Y}_{t},t)\\|_{2}^{2}\right)dt$
	$\displaystyle=\frac{\sigma}{2}\sum_{k=0}^{K-1}\int_{t_{k}}^{t_{k+1}}\mathbb{E}\left(\\|b(\widetilde{Y}_{t_{k}},t_{k})-\widetilde{b}_{m}(\widetilde{Y}_{t_{k}},t_{k})\\|_{2}^{2}\right)dt$
	$\displaystyle\leq\frac{4d\sigma}{m}C^{\star}_{2,\sigma},$	(A.21)

where

C^{\star}_{2,\sigma}:=\frac{\gamma^{4}_{\sigma}}{\xi^{4}_{\sigma}}+\frac{\gamma_{\sigma}^{2}\zeta^{2}_{\sigma}}{\xi^{4}_{\sigma}},\quad\frac{\gamma_{\sigma}}{\xi_{\sigma}}=\left\{\left(\frac{M_{2,R}}{\sigma}\right)^{2}+\frac{M_{3,R}}{\sigma}\right\}\exp\left(\frac{M_{1,R}-m_{1,R}}{\sigma}\right),

with $\zeta_{\sigma}=\exp\left(M_{1,R}/\sigma\right)$ . By injecting (A.17), (7.6), and (7.6) into (7.6), we can get the desired results. ∎

7.7. Preliminary lemmas for Theorem 3.7

First, we introduce Lemmas 7.3-7.5 preparing for the proofs of Theorem 3.7.

Lemma 7.3.

Assume (P1) and (P2) hold, then

\displaystyle\mathbb{E}\left[\|X_{t}\|_{2}^{2}\right]\leq 2(C_{0,\sigma}+d)\exp(2C_{0,\sigma}t).

Proof.

From the definition of $X_{t}$ in (7), we have $\|X_{t}\|_{2}\leq\sigma\int_{0}^{t}\|b(X_{u},u)\|_{2}\mathrm{d}u+\sqrt{\sigma}\|B_{t}\|_{2}.$ Then, we can get

	$\displaystyle\\|X_{t}\\|_{2}^{2}$	$\displaystyle\leq 2\sigma^{2}\left(\int_{0}^{t}\\|b(X_{u},u)\\|_{2}\mathrm{d}u\right)^{2}+2\sigma\\|B_{t}\\|_{2}^{2}$
		$\displaystyle\leq 2t\int_{0}^{t}\\|b(X_{u},u)\\|_{2}^{2}\mathrm{d}u+2\\|B_{t}\\|_{2}^{2}$
		$\displaystyle\leq 2t\int_{0}^{t}C_{0,\sigma}\left(\\|X_{u}\\|_{2}^{2}+1\right)\mathrm{d}u+2\\|B_{t}\\|_{2}^{2},$

where the first inequality holds by the inequality $(a+c)^{2}\leq 2a^{2}+2c^{2}$ , the last inequality holds by (C1). Thus,

	$\displaystyle\mathbb{E}\left[\\|X_{t}\\|_{2}^{2}\right]$	$\displaystyle\leq 2t\int_{0}^{t}C_{0,\sigma}\left(\mathbb{E}\left[\\|X_{u}\\|_{2}^{2}\right]+1\right)\mathrm{d}u+2\mathbb{E}\left[\\|B_{t}\\|_{2}^{2}\right]$
		$\displaystyle\leq 2C_{0,\sigma}\int_{0}^{t}\mathbb{E}\left[\\|X_{u}\\|_{2}^{2}\right]\mathrm{d}u+2(C_{0,\sigma}+d).$

By the Grönwall inequality, we have

\displaystyle\mathbb{E}\left[\|X_{t}\|_{2}^{2}\right]\leq 2(C_{0,\sigma}+d)\exp(2C_{0,\sigma}t).

∎

Lemma 7.4.

Assume (P1) and (P2) hold, then for any $0\leq t_{1}\leq t_{2}\leq 1$ ,

\displaystyle\mathbb{E}\left[\|X_{t_{2}}-X_{t_{1}}\|_{2}^{2}\right]\leq 2(t_{2}-t_{1})\left\{1+C_{0,\sigma}\exp(2\sqrt{C_{0,\sigma}}+1)\right\}.

Proof.

Using (C1) and the elementary inequality $2ab\leq a^{2}+b^{2}$ , one can derive that for any $\varepsilon>0$ ,

2\left\langle x,b(x,t)\right\rangle\leq\varepsilon\|x\|^{2}_{2}+\frac{\|b(x,t)\|^{2}_{2}}{\varepsilon}\leq\varepsilon\|x\|^{2}_{2}+\frac{C_{0,\sigma}}{\varepsilon}(1+\|x\|^{2}_{2}).

Letting $\varepsilon=\sqrt{C_{0,\sigma}}$ yields

\left\langle x,b(x,t)\right\rangle\leq\sqrt{C_{0,\sigma}}(1+\|x\|^{2}_{2}).

(A.22)

From the definition of $X_{t}$ in (7), we have

X_{t}=\sigma\int_{0}^{t}b(X_{s},s)ds+\sqrt{\sigma}\int_{0}^{t}dB_{s},\quad\forall t\in[0,1].

On the one hand, by the Itô formula and (A.22), for any $t\in[0,1]$ , we have

	$\displaystyle 1+\\|X_{t}\\|^{2}_{2}$	$\displaystyle=1+2\sigma\int_{0}^{t}\left\langle X_{s},b(X_{s},s)\right\rangle ds+\int_{0}^{t}\sigma ds+2\int_{0}^{t}\left\langle X_{s},\sqrt{\sigma}dB_{s}\right\rangle$
		$\displaystyle\leq 1+2\int_{0}^{t}\left(X^{\top}_{s}b(X_{s},s)+\frac{1}{2}\right)ds+2\int_{0}^{t}\left\langle X_{s},dB_{s}\right\rangle$
		$\displaystyle\leq 1+2\alpha\int_{0}^{t}\left(1+\\|X_{s}\\|^{2}_{2}\right)ds+2\int_{0}^{t}\left\langle X_{s},dB_{s}\right\rangle,$

where $\alpha:=\sqrt{C_{0,\sigma}}+1/2$ . Furthermore, we have

\mathbb{E}\left[1+\|X_{t}\|^{2}_{2}\right]\leq 1+2\alpha\int_{0}^{t}\mathbb{E}\left[1+\|X_{s}\|^{2}_{2}\right]ds.

The Grönwall inequality yields

\mathbb{E}\left[\|1+X_{t}\|_{2}^{2}\right]\leq\exp\left(2\alpha t\right)=\exp\left(2\sqrt{C_{0,\sigma}}+1\right),\quad\forall t\in[0,1].

(A.23)

On the other hand, by the elementary inequality $(a+b)^{2}\leq 2(a^{2}+b^{2})$ then we get

\mathbb{E}\left[\|X_{t_{2}}-X_{t_{1}}\|^{2}_{2}\right]\leq 2\mathbb{E}\left[\left(\int_{t_{1}}^{t_{2}}\sigma b(X_{s},s)ds\right)^{2}\right]+2\mathbb{E}\left[\left(\int_{t_{1}}^{t_{2}}\sqrt{\sigma}dB_{s}\right)^{2}\right].

Thus, using the Cauchy-Schwarz inequality, Burkholder-Davis-Gundy inequality, (C1) and (A.23), we deduce that

	$\displaystyle\mathbb{E}\left[\\|X_{t_{2}}-X_{t_{1}}\\|^{2}_{2}\right]$	$\displaystyle\leq 2\sigma^{2}(t_{2}-t_{1})\int_{t_{1}}^{t_{2}}\mathbb{E}\left[\\|b(X_{s},s)\\|_{2}^{2}\right]ds+2\sigma(t_{2}-t_{1})$
		$\displaystyle\leq 2C_{0,\sigma}\int_{t_{1}}^{t_{2}}\mathbb{E}\left[1+\\|X_{s}\\|^{2}_{2}\right]ds+2(t_{2}-t_{1})$
		$\displaystyle\leq\left\{2+2C_{0,\sigma}\exp(2\sqrt{C_{0,\sigma}}+1)\right\}(t_{2}-t_{1}).$

∎

Lemma 7.5.

Assume (P1) and (P2) hold, then

\displaystyle\underset{x\in\mathbb{R}^{d},t\in[0,1]}{\sup}\mathbb{E}\left[\|b(x,t)-\tilde{b}_{m}(t,x)\|_{2}^{2}\right]\leq\frac{4d}{m}C^{\star}_{2,\sigma},

where

C^{\star}_{2,\sigma}:=\frac{\gamma^{4}_{\sigma}}{\xi^{4}_{\sigma}}+\frac{\gamma_{\sigma}^{2}\zeta^{2}_{\sigma}}{\xi^{4}_{\sigma}},\quad\frac{\gamma_{\sigma}}{\xi_{\sigma}}=\left\{\left(\frac{M_{2,R}}{\sigma}\right)^{2}+\frac{M_{3,R}}{\sigma}\right\}\exp\left(\frac{M_{1,R}-m_{1,R}}{\sigma}\right),

with $\zeta_{\sigma}=\exp\left(M_{1,R}/\sigma\right)$ .

Proof.

Denote two independent sequences of independent copies of $Z\sim N(0,\mbox{\bf I}_{d})$ , that is, ${\bf Z}=\{Z_{1},\ldots,Z_{m}\}$ and ${\bf Z}^{\prime}=\{Z_{1}^{\prime},\ldots,Z_{m}^{\prime}\}$ . For notation convenience, we also denote

	$\displaystyle h:=\mathbb{E}_{Z}\left[\nabla\hat{f}_{\sigma}(x+\sqrt{(1-t)\sigma}Z)\right],\leavevmode\nobreak\ h_{m}:=\frac{\sum_{i=1}^{m}\nabla\hat{f}_{\sigma}(x+\sqrt{(1-t)\sigma}Z_{i})}{m},$
	$\displaystyle e:=\mathbb{E}_{Z}\left[\hat{f}_{\sigma}(x+\sqrt{(1-t)\sigma}Z)\right],\leavevmode\nobreak\ e_{m}:=\frac{\sum_{i=1}^{m}\hat{f}_{\sigma}(x+\sqrt{(1-t)\sigma}Z_{i})}{m},$
	$\displaystyle h_{m}^{\prime}:=\frac{\sum_{i=1}^{m}\nabla\hat{f}_{\sigma}(x+\sqrt{(1-t)\sigma}Z_{i}^{\prime})}{m},\leavevmode\nobreak\ e_{m}^{\prime}:=\frac{\sum_{i=1}^{m}\hat{f}_{\sigma}(x+\sqrt{(1-t)\sigma}Z_{i}^{\prime})}{m}.$

Since $h-h_{m}=\mathbb{E}\left[h_{m}^{\prime}-h_{m}|{\bf Z}\right]$ , then $\|h-h_{m}\|^{2}_{2}\leq\mathbb{E}\left[\|h_{m}^{\prime}-h_{m}\|^{2}_{2}|{\bf Z}\right]$ . Moreover, we have

$\displaystyle\mathbb{E}\left[\\|h-h_{m}\\|^{2}\right]$	$\displaystyle\leq\mathbb{E}\left\{\mathbb{E}\left[\\|h_{m}^{\prime}-h_{m}\\|^{2}_{2}\|{\bf Z}\right]\right\}=\mathbb{E}\left[\\|h_{m}^{\prime}-h_{m}\\|^{2}_{2}\right]$
	$\displaystyle=\frac{\mathbb{E}_{Z_{1},Z_{1}^{\prime}}\left[\left\\|\nabla\hat{f}_{\sigma}(x+\sqrt{(1-t)\sigma}Z_{1})-\nabla\hat{f}_{\sigma}(x+\sqrt{(1-t)\sigma}Z_{1}^{\prime})\right\\|^{2}_{2}\right]}{m}$
	$\displaystyle\leq\frac{\sigma(1-t)\gamma_{\sigma}^{2}}{m}\mathbb{E}_{Z_{1},Z_{1}^{\prime}}\left[\left\\|Z_{1}-Z_{1}^{\prime}\right\\|^{2}_{2}\right]$
	$\displaystyle\leq\frac{2d\gamma_{\sigma}^{2}}{m},$	(A.24)

where the second inequality holds by (P1) and the last inequality follows from the fact that $Z_{1}$ and $Z^{\prime}_{1}$ are independent standard normal distribution. Similarly, we also have

$\displaystyle\mathbb{E}\left[\|e-e_{m}\|^{2}\right]$	$\displaystyle\leq\mathbb{E}\left[\|e_{m}^{\prime}-e_{m}\|^{2}\right]$
	$\displaystyle=\frac{\mathbb{E}_{Z_{1},Z_{1}^{\prime}}\left[\left\|\hat{f}_{\sigma}(x+\sqrt{(1-t)\sigma}Z_{1})-\hat{f}_{\sigma}(x+\sqrt{(1-t)\sigma}Z_{1}^{\prime})\right\|^{2}\right]}{m}\$
	$\displaystyle\leq\frac{\sigma(1-t)\gamma_{\sigma}^{2}}{m}\mathbb{E}_{Z_{1},Z_{1}^{\prime}}\left[\left\\|Z_{1}-Z_{1}^{\prime}\right\\|^{2}_{2}\right]$
	$\displaystyle\leq\frac{2d\gamma_{\sigma}^{2}}{m},$	(A.25)

where the second inequality holds due to (P1). Thus, by (7.7) and (7.7), it follows that

\displaystyle\underset{x\in\mathbb{R}^{d},t\in[0,1]}{\sup}\mathbb{E}\left[\left\|h-h_{m}\right\|_{2}^{2}\right]\leq\frac{2d\gamma_{\sigma}^{2}}{m},

(A.26)

\displaystyle\underset{x\in\mathbb{R}^{d},t\in[0,1]}{\sup}\mathbb{E}\left[|e-e_{m}|^{2}\right]\leq\frac{2d\gamma_{\sigma}^{2}}{m}.

(A.27)

Then, by (P1) and (P2), some simple calculations yield that

$\displaystyle\\|b(x,t)-\tilde{b}_{m}(x,t)\\|_{2}$	$\displaystyle=\left\\|\frac{h}{e}-\frac{h_{m}}{e_{m}}\right\\|_{2}$
	$\displaystyle\leq\frac{\\|h\\|_{2}\|e_{m}-e\|+\\|h-h_{m}\\|_{2}\|e\|}{\|ee_{m}\|}$
	$\displaystyle\leq\frac{\gamma_{\sigma}\|e_{m}-e\|+\\|h-h_{m}\\|_{2}\|e\|}{\xi_{\sigma}^{2}}.$	(A.28)

Recall that $\zeta_{\sigma}=\exp(M_{1,R}/\sigma)$ with

\quad M_{1,R}:=\max_{\|x\|_{2}\leq R}\left\{-V(x)+\frac{\|x\|^{2}_{2}}{2}\right\},

then $\hat{f}_{\sigma}(x)\leq\zeta_{\sigma}$ for any $x\in B_{R}$ . Further, by (7.7), it follows that for all $x\in\mathbb{R}^{d}$ and $t\in[0,1]$ ,

\displaystyle\|b(x,t)-\tilde{b}_{m}(x,t)\|_{2}^{2}

\displaystyle\leq 2\frac{\gamma_{\sigma}^{2}|e_{m}-e|^{2}+\zeta_{\sigma}^{2}\|h-h_{m}\|_{2}^{2}}{\xi_{\sigma}^{4}}.

(A.29)

Then, combining (A.26)-(A.27) and (A.29), it follows that

\displaystyle\underset{x\in\mathbb{R}^{d},t\in[0,1]}{\sup}\mathbb{E}\left[\|b(x,t)-\tilde{b}_{m}(t,x)\|_{2}^{2}\right]\leq\frac{4d}{m}C^{\star}_{2,\sigma},

where

C^{\star}_{2,\sigma}:=\frac{\gamma^{4}_{\sigma}}{\xi^{4}_{\sigma}}+\frac{\gamma_{\sigma}^{2}\zeta^{2}_{\sigma}}{\xi^{4}_{\sigma}},\quad\frac{\gamma_{\sigma}}{\xi_{\sigma}}=\left\{\left(\frac{M_{2,R}}{\sigma}\right)^{2}+\frac{M_{3,R}}{\sigma}\right\}\exp\left(\frac{M_{1,R}-m_{1,R}}{\sigma}\right),\quad\zeta_{\sigma}=\exp\left(\frac{M_{1,R}}{\sigma}\right).

∎

Lemma 7.6.

Assume (P1) and (P2) hold, then for any $k=0,1,\ldots,K$ ,

\displaystyle\mathbb{E}\left[\|\widetilde{Y}_{t_{k}}\|^{2}_{2}\right]\leq\frac{6\gamma_{\sigma}^{2}}{\xi_{\sigma}^{2}}+3d.

Proof.

Define $\Theta_{k,t}:=\widetilde{Y}_{t_{k}}+\sigma(t-t_{k})\tilde{b}_{m}(\widetilde{Y}_{t_{k}},t_{k})$ , hence, we get $\widetilde{Y}_{t}=\Theta_{k,t}+\sqrt{\sigma}(B_{t}-B_{t_{k}})$ , where $t_{k}\leq t\leq t_{k+1}$ with $k=0,1,\ldots,K-1$ . By (P1) and (P2), it follows that for all $x\in\mathbb{R}^{d}$ and $t\in[0,1]$ ,

\displaystyle\|b(x,t)\|_{2}^{2}\leq\frac{\gamma_{\sigma}^{2}}{\xi_{\sigma}^{2}},\leavevmode\nobreak\ \leavevmode\nobreak\ \|\tilde{b}_{m}(x,t)\|_{2}^{2}\leq\frac{\gamma_{\sigma}^{2}}{\xi^{2}_{\sigma}}.

(A.30)

Then, by (A.30) and $s=1/K$ , we have

	$\displaystyle\\|\Theta_{k,t}\\|^{2}_{2}$	$\displaystyle=\\|\widetilde{Y}_{t_{k}}\\|^{2}_{2}+\sigma^{2}(t-t_{k})^{2}\\|\tilde{b}_{m}(\widetilde{Y}_{t_{k}},t_{k})\\|^{2}_{2}+2\sigma(t-t_{k})\left<\widetilde{Y}_{t_{k}},\tilde{b}_{m}(\widetilde{Y}_{t_{k}},t_{k})\right>$
		$\displaystyle\leq(1+s\sigma)\\|\widetilde{Y}_{t_{k}}\\|^{2}_{2}+\frac{\gamma_{\sigma}^{2}(s\sigma+s^{2}\sigma^{2})}{\xi^{2}_{\sigma}}.$

Further, we can get

	$\displaystyle\mathbb{E}\left[\\|\left.\widetilde{Y}_{t}\\|^{2}_{2}\right\|\widetilde{Y}_{t_{k}}\right]$	$\displaystyle=\mathbb{E}\left[\\|\left.\Theta_{k,t}\\|^{2}_{2}\right\|\widetilde{Y}_{t_{k}}\right]+\sigma(t-t_{k})d$
		$\displaystyle\leq(1+\sigma s)\\|\widetilde{Y}_{t_{k}}\\|^{2}_{2}+\frac{(s\sigma+s^{2}\sigma^{2})\gamma_{\sigma}^{2}}{\xi^{2}_{\sigma}}+\sigma sd.$

Therefore,

\displaystyle\mathbb{E}\left[\|\widetilde{Y}_{t_{k+1}}\|^{2}_{2}\right]\leq(1+s\sigma)\mathbb{E}\left[\|\widetilde{Y}_{t_{k}}\|^{2}_{2}\right]+\frac{(s\sigma+s^{2}\sigma^{2})\gamma_{\sigma}^{2}}{\xi^{2}_{\sigma}}+\sigma sd.

Since $\widetilde{Y}_{t_{0}}=0$ , then by induction, we have

\displaystyle\mathbb{E}\left[\|\widetilde{Y}_{t_{k+1}}\|^{2}_{2}\right]\leq e^{(k+1)s\sigma}\left(d+\frac{(1+s\sigma)\gamma_{\sigma}^{2}}{\xi^{2}_{\sigma}}\right)\leq e\left(d+\frac{2\gamma_{\sigma}^{2}}{\xi^{2}_{\sigma}}\right)\leq\frac{6\gamma_{\sigma}^{2}}{\xi^{2}_{\sigma}}+3d.

∎

7.8. Proof of Theorem 3.7

Proof.

From the definitions of $\widetilde{Y}_{t_{k}}$ and $X_{t_{k}}$ , we have

	$\displaystyle\\|\widetilde{Y}_{t_{k}}-X_{t_{k}}\\|_{2}^{2}$	$\displaystyle\leq\\|\widetilde{Y}_{t_{k-1}}-X_{t_{k-1}}\\|_{2}^{2}+\left(\int_{t_{k-1}}^{t_{k}}\sigma\\|b(X_{u},u)-\tilde{b}_{m}(\widetilde{Y}_{t_{k-1}},t_{k-1})\\|_{2}\mathrm{d}u\right)^{2}$
		$\displaystyle\qquad+2\sigma\\|\widetilde{Y}_{t_{k-1}}-X_{t_{k-1}}\\|_{2}\left(\int_{t_{k-1}}^{t_{k}}\\|b(X_{u},u)-\tilde{b}_{m}(\widetilde{Y}_{t_{k-1}},t_{k-1})\\|_{2}\mathrm{d}u\right)$
		$\displaystyle\leq(1+s)\\|\widetilde{Y}_{t_{k-1}}-X_{t_{k-1}}\\|_{2}^{2}+(1+s)\int_{t_{k-1}}^{t_{k}}\\|b(X_{u},u)-\tilde{b}_{m}(\widetilde{Y}_{t_{k-1}},t_{k-1})\\|_{2}^{2}\mathrm{d}u$
		$\displaystyle\leq(1+s)\\|\widetilde{Y}_{t_{k-1}}-X_{t_{k-1}}\\|_{2}^{2}+2(1+s)\int_{t_{k-1}}^{t_{k}}\\|b(X_{u},u)-b(\widetilde{Y}_{t_{k-1}},t_{k-1})\\|_{2}^{2}\mathrm{d}u$
		$\displaystyle\qquad+2s(1+s)\\|b(\widetilde{Y}_{t_{k-1}},t_{k-1})-\tilde{b}_{m}(\widetilde{Y}_{t_{k-1}},t_{k-1})\\|_{2}^{2}$
		$\displaystyle\leq(1+s)\\|\widetilde{Y}_{t_{k-1}}-X_{t_{k-1}}\\|_{2}^{2}+4C_{2,\sigma}^{2}(1+s)\int_{t_{k-1}}^{t_{k}}\left[\\|X_{u}-\widetilde{Y}_{t_{k-1}}\\|_{2}^{2}+\|u-t_{k-1}\|\right]\mathrm{d}u$
		$\displaystyle\qquad+2s(1+s)\\|b(\widetilde{Y}_{t_{k-1}},t_{k-1})-\tilde{b}_{m}(\widetilde{Y}_{t_{k-1}},t_{k-1})\\|_{2}^{2}$
		$\displaystyle\leq(1+s)\\|\widetilde{Y}_{t_{k-1}}-X_{t_{k-1}}\\|_{2}^{2}+8C_{2,\sigma}^{2}(1+s)\int_{t_{k-1}}^{t_{k}}\\|X_{u}-X_{t_{k-1}}\\|_{2}^{2}\mathrm{d}u$
		$\displaystyle\qquad+8C_{2,\sigma}^{2}s(1+s)\\|X_{t_{k-1}}-\widetilde{Y}_{t_{k-1}}\\|_{2}^{2}+4C_{2,\sigma}^{2}(1+s)s^{2}$
		$\displaystyle\qquad+2s(1+s)\\|b(\widetilde{Y}_{t_{k-1}},t_{k-1})-\tilde{b}_{m}(\widetilde{Y}_{t_{k-1}},t_{k-1})\\|_{2}^{2}$
		$\displaystyle\leq(1+s+8C_{2,\sigma}^{2}(s+s^{2}))\\|\widetilde{Y}_{t_{k-1}}-X_{t_{k-1}}\\|_{2}^{2}+8C_{2,\sigma}^{2}(1+s)\int_{t_{k-1}}^{t_{k}}\\|X_{u}-X_{t_{k-1}}\\|_{2}^{2}\mathrm{d}u$
		$\displaystyle\qquad+4C_{2,\sigma}^{2}(1+s)s^{2}+2s(1+s)\\|b(\widetilde{Y}_{t_{k-1}},t_{k-1})-\tilde{b}_{m}(\widetilde{Y}_{t_{k-1}},t_{k-1})\\|_{2}^{2},$

where the second inequality holds due to $2ac\leq sa^{2}+c^{2}/s$ for $s=1/K$ and the fourth inequality holds by condition (C3). Then, we obtain

	$\displaystyle\mathbb{E}\left[\\|\widetilde{Y}_{t_{k}}-X_{t_{k}}\\|_{2}^{2}\right]$
	$\displaystyle\leq(1+s+8C_{2,\sigma}^{2}(s+s^{2}))\mathbb{E}\left[\\|\widetilde{Y}_{t_{k-1}}-X_{t_{k-1}}\\|_{2}^{2}\right]+8C_{2,\sigma}^{2}(1+s)\int_{t_{k-1}}^{t_{k}}\mathbb{E}\left[\\|X_{u}-X_{t_{k-1}}\\|_{2}^{2}\right]\mathrm{d}u$
	$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ +4C_{2,\sigma}^{2}(s^{2}+s^{3})+2s(1+s)\mathbb{E}\left[\\|b(\widetilde{Y}_{t_{k-1}},t_{k-1})-\tilde{b}_{m}(\widetilde{Y}_{t_{k-1}},t_{k-1})\\|_{2}^{2}\right]$
	$\displaystyle\leq(1+s+8C_{2,\sigma}^{2}(s+s^{2}))\mathbb{E}\left[\\|\widetilde{Y}_{t_{k-1}}-X_{t_{k-1}}\\|_{2}^{2}\right]+H(s,\sigma)+4C_{2,\sigma}^{2}(s^{2}+s^{3})$
	$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ +2s(1+s)\mathbb{E}\left[\\|b(\widetilde{Y}_{t_{k-1}},t_{k-1})-\tilde{b}_{m}(\widetilde{Y}_{t_{k-1}},t_{k-1})\\|_{2}^{2}\right]$
	$\displaystyle\leq(1+s+8C_{2,\sigma}^{2}(s+s^{2}))\mathbb{E}\left[\\|\widetilde{Y}_{t_{k-1}}-X_{t_{k-1}}\\|_{2}^{2}\right]+H(s,\sigma)+\frac{8s(1+s)d}{m}C^{\star}_{2,\sigma},$

where $H(s,\sigma):=16C_{2,\sigma}^{2}(s^{2}+s^{3})\left\{1+C_{0,\sigma}\exp(2\sqrt{C_{0,\sigma}}+1)\right\}+4C^{2}_{2,\sigma}(s^{2}+s^{3})$ follows from Lemma 7.4, and the last inequality holds by Lemma 7.5. Owing to $\widetilde{Y}_{t_{0}}=X_{t_{0}}=0$ , we can conclude that there exists a constant $C_{\sigma}>0$ such that

	$\displaystyle\mathbb{E}\left[\\|\widetilde{Y}_{t_{K}}-X_{t_{K}}\\|_{2}^{2}\right]$	$\displaystyle\leq\frac{(1+s+8C_{2,\sigma}^{2}(s+s^{2}))^{K}-1}{s+8C_{2,\sigma}^{2}(s+s^{2})}\left[H(s,\sigma)+\frac{8s(1+s)d}{m}C^{\star}_{2,\sigma}\right]$
		$\displaystyle\leq C_{\sigma}\left(sC^{\star}_{3,\sigma}+\frac{16d}{m}C^{\star}_{2,\sigma}\right),$

where

\displaystyle C_{\sigma}:=\exp(1+16C_{2,\sigma}^{2}),\quad C^{\star}_{3,\sigma}:=40C^{2}_{2,\sigma}+32C^{2}_{2,\sigma}C_{0,\sigma}\exp(2\sqrt{C_{0,\sigma}}+1).

Therefore,

\displaystyle W^{2}_{2}(\text{Law}(\widetilde{Y}_{t_{K}}),\mu_{\sigma})\leq C_{\sigma}\left(sC^{\star}_{3,\sigma}+\frac{16d}{m}C^{\star}_{2,\sigma}\right).

It completes the proof. ∎

References

Bakry et al. (2008) Bakry D, Cattiaux P, Guillin A (2008) Rate of convergence for ergodic continuous markov processes: Lyapunov versus poincaré. Journal of Functional Analysis 254(3):727–759
Bakry et al. (2014) Bakry D, Gentil I, Ledoux M (2014) Analysis and geometry of Markov diffusion operators, vol 103. Springer
Barkhagen et al. (2021) Barkhagen M, Chau NH, Moulines É, Rásonyi M, Sabanis S, Zhang Y (2021) On stochastic gradient langevin dynamics with dependent data streams in the logconcave case. Bernoulli 27(1):1–33
Bernton et al. (2019) Bernton E, Heng J, Doucet A, Jacob PE (2019) Schrödinger bridge samplers. arXiv preprint arXiv:191213170
Bou-Rabee et al. (2020) Bou-Rabee N, Eberle A, Zimmer R (2020) Coupling and convergence for hamiltonian monte carlo. The Annals of applied probability 30(3):1209–1250
Carrillo et al. (2021) Carrillo JA, Jin S, Li L, Zhu Y (2021) A consensus-based global optimization method for high dimensional machine learning problems. ESAIM: Control, Optimisation and Calculus of Variations 27:S5
Chau et al. (2021) Chau NH, Moulines É, Rásonyi M, Sabanis S, Zhang Y (2021) On stochastic gradient langevin dynamics with dependent data streams: The fully non-convex case. SIAM Journal on Mathematics of Data Science 3(3):959–986
Chen et al. (2021) Chen Y, Georgiou TT, Pavon M (2021) Stochastic control liaisons: Richard sinkhorn meets gaspard monge on a schrödinger bridge. SIAM Review 63(2):249–313
Cheng and Bartlett (2018) Cheng X, Bartlett P (2018) Convergence of langevin mcmc in kl-divergence. Proceedings of Machine Learning Research, Volume 83: Algorithmic Learning Theory pp 186–211
Cheng et al. (2018) Cheng X, Chatterji NS, Bartlett PL, Jordan MI (2018) Underdamped langevin mcmc: A non-asymptotic analysis. In: Conference on learning theory, PMLR, pp 300–323
Chiang et al. (1987) Chiang TS, Hwang CR, Sheu SJ (1987) Diffusion for global optimization in r^n. SIAM Journal on Control and Optimization 25(3):737–753
Dai Pra (1991) Dai Pra P (1991) A stochastic control approach to reciprocal diffusion processes. Applied mathematics and Optimization 23(1):313–329
Dalalyan (2017a) Dalalyan AS (2017a) Further and stronger analogy between sampling and optimization: Langevin monte carlo and gradient descent. pp 678–689
Dalalyan (2017b) Dalalyan AS (2017b) Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 79(3):651–676
Dalalyan and Karagulyan (2019) Dalalyan AS, Karagulyan AG (2019) User-friendly guarantees for the langevin monte carlo with inaccurate gradient. Stochastic Processes and their Applications 129(12):5278–5311
Dupuis and Ellis (2011) Dupuis P, Ellis RS (2011) A weak convergence approach to the theory of large deviations. John Wiley & Sons
Durmus and Moulines (2016) Durmus A, Moulines E (2016) Sampling from a strongly log-concave distribution with the unadjusted langevin algorithm. arXiv preprint arXiv:160501559
Durmus and Moulines (2017) Durmus A, Moulines E (2017) Nonasymptotic convergence analysis for the unadjusted langevin algorithm. The Annals of Applied Probability 27(3):1551–1587
Durmus and Moulines (2019) Durmus A, Moulines E (2019) High-dimensional bayesian inference via the unadjusted langevin algorithm. Bernoulli 25(4A):2854–2882
Eldan et al. (2020) Eldan R, Lehec J, Shenfeld Y (2020) Stability of the logarithmic sobolev inequality via the föllmer process. In: Annales de l’Institut Henri Poincaré, Probabilités et Statistiques, Institut Henri Poincaré, vol 56, pp 2253–2269
Föllmer (1985) Föllmer H (1985) An entropy approach to the time reversal of diffusion processes. In: Stochastic Differential Systems Filtering and Control, Springer, pp 156–163
Föllmer (1986) Föllmer H (1986) Time reversal on wiener space. In: Stochastic processes–mathematics and physics, Springer, pp 119–129
Föllmer (1988) Föllmer H (1988) Random fields and diffusion processes. In: École d’Été de Probabilités de Saint-Flour XV–XVII, 1985–87, Springer, pp 101–203
Garbuno-Inigo et al. (2020) Garbuno-Inigo A, Hoffmann F, Li W, Stuart AM (2020) Interacting langevin diffusions: Gradient structure and ensemble kalman sampler. SIAM Journal on Applied Dynamical Systems 19(1):412–441
Hale (2010) Hale JK (2010) Asymptotic behavior of dissipative systems. 25, American Mathematical Soc.
Holley et al. (1989) Holley RA, Kusuoka S, Stroock DW (1989) Asymptotics of the spectral gap with applications to the theory of simulated annealing. Journal of functional analysis 83(2):333–347
Huang et al. (2021) Huang J, Jiao Y, Kang L, Liao X, Liu J, Liu Y (2021) Schrödinger-föllmer sampler: Sampling without ergodicity. arXiv preprint arXiv: 210610880
Hwang (1980) Hwang CR (1980) Laplace’s method revisited: weak convergence of probability measures. The Annals of Probability pp 1177–1182
Hwang (1981) Hwang CR (1981) A generalization of laplace’s method. Proceedings of the American Mathematical Society 82(3):446–451
Iacus and Yoshida (2018) Iacus SM, Yoshida N (2018) Simulation and inference for stochastic processes with YUIMA. Springer
Jamison (1975) Jamison B (1975) The markov processes of schrödinger. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete 32(4):323–331
Jiao et al. (2021) Jiao Y, Kang L, Liu Y, Zhou Y (2021) Convergence analysis of schrödinger-föllmer sampler without convexity. arXiv preprint arXiv:210704766
Landsman and Nešlehová (2008) Landsman Z, Nešlehová J (2008) Stein’s lemma for elliptical random vectors. Journal of Multivariate Analysis 99(5):912–927
Lehec (2013) Lehec J (2013) Representation formula for the entropy and functional inequalities. In: Annales de l’IHP Probabilités et statistiques, vol 49, pp 885–899
Léonard (2014) Léonard C (2014) A survey of the schrödinger problem and some of its connections with optimal transport. Discrete & Continuous Dynamical Systems-A 34(4):1533
Ma et al. (2019) Ma YA, Chen Y, Jin C, Flammarion N, Jordan MI (2019) Sampling can be faster than optimization. Proceedings of the National Academy of Sciences 116(42):20881–20885
Márquez (1997) Márquez D (1997) Convergence rates for annealing diffusion processes. The Annals of Applied Probability pp 1118–1139
Menz and Schlichting (2014) Menz G, Schlichting A (2014) Poincaré and logarithmic sobolev inequalities by decomposition of the energy landscape. Annals of Probability 42(5):1809–1884
Mou et al. (2022) Mou W, Flammarion N, Wainwright MJ, Bartlett PL (2022) Improved bounds for discretization of langevin diffusions: Near-optimal rates without convexity. Bernoulli 28(3):1577–1601
Pavliotis (2014) Pavliotis GA (2014) Stochastic processes and applications: diffusion processes, the Fokker-Planck and Langevin equations, vol 60. Springer
Peyré and Cuturi (2019) Peyré G, Cuturi M (2019) Computational optimal transport: With applications to data science. Foundations and Trends® in Machine Learning 11(5-6):355–607
Raginsky et al. (2017) Raginsky M, Rakhlin A, Telgarsky M (2017) Non-convex learning via stochastic gradient langevin dynamics: a nonasymptotic analysis. In: Conference on Learning Theory, PMLR, pp 1674–1703
Revuz and Yor (2013) Revuz D, Yor M (2013) Continuous martingales and Brownian motion, vol 293. Springer Science & Business Media
Ruzayqat et al. (2022) Ruzayqat H, Beskos A, Crisan D, Jasra A, Kantas N (2022) Unbiased estimation using a class of diffusion processes. arXiv preprint arXiv:220303013
Schrödinger (1932) Schrödinger E (1932) Sur la théorie relativiste de l’électron et l’interprétation de la mécanique quantique. In: Annales de l’institut Henri Poincaré, vol 2, pp 269–310
Sinkhorn (1964) Sinkhorn R (1964) A relationship between arbitrary positive matrices and doubly stochastic matrices. The annals of mathematical statistics 35(2):876–879
Stein (1972) Stein C (1972) A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In: Proceedings of the sixth Berkeley symposium on mathematical statistics and probability, volume 2: Probability theory, University of California Press, pp 583–602
Stein (1986) Stein C (1986) Approximate computation of expectations. IMS
Tzen and Raginsky (2019) Tzen B, Raginsky M (2019) Theoretical guarantees for sampling and inference in generative models with latent diffusions. In: Conference on Learning Theory, PMLR, pp 3084–3114
Varadhan (1966) Varadhan SS (1966) Asymptotic probabilities and differential equations. Communications on Pure and Applied Mathematics 19(3):261–286
Vargas et al. (2021) Vargas F, Ovsianas A, Fernandes D, Girolami M, Lawrence N, Nüsken N (2021) Bayesian learning via neural schrödinger-föllmer flows. arXiv preprint arXiv:211110510
Wang (2009) Wang FY (2009) Log-sobolev inequalities: different roles of ric and hess. The Annals of Probability 37(4):1587–1604
Wang et al. (2021) Wang G, Jiao Y, Xu Q, Wang Y, Yang C (2021) Deep generative learning via schrödinger bridge. In: International Conference on Machine Learning, PMLR, pp 10794–10804
Wang and Li (2022) Wang Y, Li W (2022) Accelerated information gradient flow. Journal of Scientific Computing 90(1):1–47
Zhang and Chen (2021) Zhang Q, Chen Y (2021) Path integral sampler: a stochastic control approach for sampling. arXiv preprint arXiv:211115141
Zhang et al. (2019) Zhang Y, Akyildiz ÖD, Damoulas T, Sabanis S (2019) Nonasymptotic estimates for stochastic gradient langevin dynamics under local conditions in nonconvex optimization. arXiv preprint arXiv:191002008

	$\displaystyle\nabla\hat{f}_{\sigma}(x)$	$\displaystyle=\exp\left(\frac{\\|x\\|_{2}^{2}}{2\sigma}-\frac{V(x)}{\sigma}\right)\cdot\left(\frac{x}{\sigma}-\frac{\nabla V(x)}{\sigma}\right),$
	$\displaystyle\nabla^{2}\hat{f}_{\sigma}(x)$	$\displaystyle=\exp\left(\frac{\\|x\\|_{2}^{2}}{2\sigma}-\frac{V(x)}{\sigma}\right)\cdot\left(\frac{x}{\sigma}-\frac{\nabla V(x)}{\sigma}\right)\cdot\left(\frac{x}{\sigma}-\frac{\nabla V(x)}{\sigma}\right)^{\top}$
		$\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ +\exp\left(\frac{\\|x\\|_{2}^{2}}{2\sigma}-\frac{V(x)}{\sigma}\right)\cdot\left(\frac{\mbox{\bf I}_{d}}{\sigma}-\frac{\nabla^{2}V(x)}{\sigma}\right).$

	$\displaystyle\left\\|b(x,t)-b\left(y,t\right)\right\\|_{2}$	$\displaystyle=\left\\|\frac{\nabla Q_{1-t}\hat{f}_{\sigma}(x)}{Q_{1-t}\hat{f}_{\sigma}(x)}-\frac{\nabla Q_{1-t}\hat{f}_{\sigma}\left(y\right)}{Q_{1-t}\hat{f}_{\sigma}\left(y\right)}\right\\|_{2}$
		$\displaystyle=\left\\|\frac{\nabla Q_{1-t}\hat{f}_{\sigma}(x)-\nabla Q_{1-t}\hat{f}_{\sigma}(y)}{Q_{1-t}\hat{f}_{\sigma}(y)}+\frac{\nabla Q_{1-t}\hat{f}_{\sigma}(x)\left(Q_{1-t}\hat{f}_{\sigma}(y)-Q_{1-t}\hat{f}_{\sigma}(x)\right)}{Q_{1-t}\hat{f}_{\sigma}(x)Q_{1-t}\hat{f}_{\sigma}(y)}\right\\|_{2}$
		$\displaystyle\leq\frac{\left\\|\nabla Q_{1-t}\hat{f}_{\sigma}(x)-\nabla Q_{1-t}\hat{f}_{\sigma}\left(y\right)\right\\|_{2}}{Q_{1-t}\hat{f}_{\sigma}\left(y\right)}+\\|b(x,t)\\|_{2}\cdot\frac{\left\|Q_{1-t}\hat{f}_{\sigma}(x)-Q_{1-t}\hat{f}_{\sigma}\left(y\right)\right\|}{Q_{1-t}\hat{f}_{\sigma}\left(y\right)}$
		$\displaystyle\leq\left(\frac{\gamma_{\sigma}}{\xi_{\sigma}}+\frac{\gamma_{\sigma}^{2}}{\xi_{\sigma}^{2}}\right)\left\\|x-y\right\\|_{2}.$

$\displaystyle\mathbb{P}(V(\widetilde{Y}_{t_{K}})>\tau)$	$\displaystyle\leq\mathbb{P}(V(\widetilde{Y}_{t_{K}})>\tau)+\|\mathbb{P}(V(\widetilde{Y}_{t_{K}})>\tau)-\mathbb{P}(V(Y_{t_{K}})>\tau)\|$
	$\displaystyle\leq\mathbb{P}(V(Y_{t_{K}})>\tau)+\\|\widetilde{Y}_{t_{K}}-Y_{t_{K}}\\|_{TV}$
	$\displaystyle\leq\mathbb{P}(V(X_{t_{K}})>\tau)+\\|X_{t_{K}}-Y_{t_{K}}\\|_{TV}+\\|\widetilde{Y}_{t_{K}}-Y_{t_{K}}\\|_{TV}$
	$\displaystyle\leq\mathbb{P}(V(X_{t_{K}})>\tau)+\sqrt{2\mathbb{D}_{\mathrm{KL}}(\mu_{t_{K}}\|\|\mu_{\sigma})}+\sqrt{2\mathbb{D}_{\mathrm{KL}}(\text{Law}(\widetilde{Y}_{t_{K}})\|\|\mu_{t_{K}})},$	(A.16)

	$\displaystyle\|\mathbb{P}(V(\widetilde{Y}_{t_{K}})>\tau)-\mathbb{P}(V(Y_{t_{K}})>\tau)\|$	$\displaystyle=\left\|\mathbb{E}\left(g(\widetilde{Y}_{t_{K}})-g(Y_{t_{K}})\right)\right\|$
		$\displaystyle=\left\|\int_{\mathbb{R}^{d}}g(x)d\left(\mathbb{P}_{\widetilde{Y}_{t_{k}}}(x)-\mathbb{P}_{Y_{t_{k}}}(x)\right)\right\|$
		$\displaystyle\leq\|\mathbb{P}_{\widetilde{Y}_{t_{k}}}-\mathbb{P}_{Y_{t_{k}}}\|(\mathbb{R}^{d})=\\|\widetilde{Y}_{t_{K}}-Y_{t_{K}}\\|_{TV},$

$\displaystyle\mathbb{D}_{\mathrm{KL}}(\mu_{t_{K}}\|\|\mu_{\sigma})$	$\displaystyle\leq\mathbb{D}_{\mathrm{KL}}(\nu^{y}_{1}\|\|\mu^{x}_{1})=\frac{\sigma}{2}\int_{0}^{1}\mathbb{E}\left(\\|b(Y_{t},t)-\hat{b}(Y_{t},t)\\|_{2}^{2}\right)dt$
	$\displaystyle=\frac{\sigma}{2}\sum_{k=0}^{K-1}\int_{t_{k}}^{t_{k+1}}\mathbb{E}\left(\\|b(Y_{t},t)-b(Y_{t_{k}},t_{k})\\|_{2}^{2}\right)dt$
	$\displaystyle\leq\sigma dC_{1,\sigma}^{2}\sum_{k=0}^{K-1}\int_{t_{k}}^{t_{k+1}}\mathbb{E}\left(\\|Y_{t}-Y_{t_{k}}\\|_{2}^{2}+(t-t_{k})\right)dt$
	$\displaystyle=\sigma dC_{1,\sigma}^{2}\left[\sum_{k=0}^{K-1}\int_{t_{k}}^{t_{k+1}}\mathbb{E}\left(\\|b(Y_{t_{k}})(t-t_{k})\sigma+\sqrt{\sigma}(B_{t}-B_{t_{k}})\\|_{2}^{2}\right)dt+\frac{1}{2K}\right]$
	$\displaystyle\leq 2\sigma dC_{1,\sigma}^{2}\sum_{k=0}^{K-1}\int_{t_{k}}^{t_{k+1}}\mathbb{E}\left(\\|b(Y_{t_{k}},t_{k})\\|_{2}^{2}\sigma^{2}(t-t_{k})^{2}+\sigma d(t-t_{k})\right)dt+\frac{\sigma d}{2K}C_{1,\sigma}^{2}$
	$\displaystyle\leq 2dC_{1,\sigma}^{2}\sum_{k=0}^{K-1}\int_{t_{k}}^{t_{k+1}}\left[\left(\frac{\gamma_{\sigma}}{\xi_{\sigma}}\right)^{2}\sigma^{3}(t-t_{k})^{2}+d(t-t_{k})\sigma^{2}\right]dt+\frac{\sigma d}{2K}C_{1,\sigma}^{2}$
	$\displaystyle\leq\frac{4d\sigma^{3}}{3K^{2}}\left(\frac{\gamma^{4}_{\sigma}}{\xi^{4}_{\sigma}}+\frac{\gamma^{6}_{\sigma}}{\xi^{6}_{\sigma}}\right)+\frac{\sigma d(2d\sigma+1)}{K}\left(\frac{\gamma^{2}_{\sigma}}{\xi^{2}_{\sigma}}+\frac{\gamma^{4}_{\sigma}}{\xi^{4}_{\sigma}}\right)\leq\frac{d(2d+3)}{K}C^{\star}_{1,\sigma},$	(A.19)