Wasserstein distance estimates for the distributions of numerical approximations to ergodic stochastic differential equations.

J. M. Sanz-Serna ¹ Konstantinos C. Zygalakis²

Abstract

We present a framework that allows for the non-asymptotic study of the $2$ -Wasserstein distance between the invariant distribution of an ergodic stochastic differential equation and the distribution of its numerical approximation in the strongly log-concave case. This allows us to study in a unified way a number of different integrators proposed in the literature for the overdamped and underdamped Langevin dynamics. In addition, we analyse a novel splitting method for the underdamped Langevin dynamics which only requires one gradient evaluation per time step. Under an additional smoothness assumption on a $d$ –dimensional strongly log-concave distribution with condition number $\kappa$ , the algorithm is shown to produce with an $\mathcal{O}\big{(}\kappa^{5/4}d^{1/4}\epsilon^{-1/2}\big{)}$ complexity samples from a distribution that, in Wasserstein distance, is at most $\epsilon>0$ away from the target distribution.

¹¹footnotetext: Departamento de Matemáticas, Universidad Carlos III de Madrid, Leganés (Madrid), Spain²²footnotetext: School of Mathematics, University of Edinburgh, Edinburgh, Scotland

1 Introduction

The problem of sampling from a target probability distribution $\pi^{\star}(x)$ in $\mathbb{R}^{d}$ is ubiquitous throughout applied mathematics, statistics, molecular dynamics, statistical physics and other fields. A typical approach for solving such problems is to construct a Markov process on $\mathbb{R}^{m},\ m\geq d$ whose equilibrium distribution (or a suitable marginal of it) is designed to coincide with $\pi^{\star}$ [7]. Often such Markov processes are obtained by solving stochastic differential equations (SDEs). Two typical examples of such SDEs are the overdamped Langevin equation, $c>0$ ,

dx=-c\nabla f(x)\,dt+\sqrt{2c}\,dW(t),

(1.1)

and the underdamped Langevin equation, $c,\gamma>0$ ,


$\displaystyle dv$	$\displaystyle=$	$\displaystyle-\gamma v\,dt-c\nabla f(x)\,dt+\sqrt{2\gamma c}\,dW(t),$	(1.2a)
$\displaystyle dx$	$\displaystyle=$	$\displaystyle v\,dt.$	(1.2b)

Under mild assumptions on $f(x)$ one can show that the dynamics of (1.1) is ergodic with respect to the distribution $\pi^{\star}$ with density $\propto\exp(-f(x))$ , while the dynamics of (1.2) is ergodic with respect to $\pi^{\star}$ with density $\propto\exp(-f(x){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}-}\frac{1}{2c}\|v\|^{2})$ .

Equations (1.1), (1.2) [43] provide the basis for different computational devises for sampling from $\pi^{\star}$ . In particular, one can obtain samples from $\pi^{\star}$ by discretizing the SDEs and generating numerical solutions over a long time interval [37]. One needs to be careful with the integrator that is used, since it could be the case that the discrete Markov chain resulting from the numerical discretization might not be ergodic [42]. In addition, even if that chain is ergodic, it is normally the case that the stationary distribution $\widehat{\pi}^{\star}$ of the numerical solution is different from $\pi^{\star}$ . The study of the asymptotic error between $\widehat{\pi}^{\star}$ and $\pi^{\star}$ has received a lot of attention in the literature. The work in [34] investigates the effect of the numerical discretization on the convergence of the ergodic averages, while [1] present general order conditions for the numerical invariant measure $\widehat{\pi}^{\star}$ to approximate $\pi^{\star}$ with high order of accuracy, by exploiting the connections between partial differential equations and SDEs [31]. A number of recent papers have applied this framework to numerical integrators for the underdamped Langevin equation [2], as well as to the case of stochastic gradient Langevin dynamics [51, 30]. In addition, [50, 27] extended this framework to the case of post processed integrators and to SDEs on manifolds.

Another line of research that has received much attention in the last few years deals with the study of the non-asymptotic error between the numerical approximation and the invariant measure $\pi^{\star}$ . In particular, for the case of the overdamped Langevin equation (1.1) and log-concave and strongly log-concave distributions [12] established non-asymptotic bounds in total variation distance for the Euler-Maruyama method and an explicit extension of it based on further smoothness assumptions. These results have been extended to the Wasserstein distance $W_{2}$ in e.g. [17, 11, 18, 13, 16], while the paper [25] obtains similar bounds for implicit methods applied to (1.1). Similar non-asymptotic analyses for the case of the underdamped Langevin equation appear in [10, 15, 39, 48, 21]. One of the aims of all that literature is to study the number of steps $n$ that the integrators require to achieve a target accuracy $\epsilon$ when applied to $d$ -dimensional targets with condition number $\kappa$ . Underdamped discretizations may lead to a better dependence of $n$ on $\epsilon$ and $d$ than their overdamped counterparts. The case of non-strongly log-concave distributions and the non-asymptotic behaviour of numerical algorithms has also received attention recently [14, 33].

In this work, we present a unified framework that allows for the non-asymptotic study of numerical methods for ergodic stochastic differential equations (including equations (1.1) and (1.2)) in the case of strongly log-concave distributions. In particular, we obtain a general bound for the error in $W_{2}$ between $\pi^{\star}$ and the probability distribution of the numerical solution after $n$ -iterations. This bound depends on two factors, the first can be controlled by understanding the contractivity properties of the numerical method, while the second is directly related to the local strong error of the integrator. Moving to integrators with smaller strong local error results in a better performance when the dimensionality grows and the error level $\epsilon$ decreases. Also moving to integrators that are contractive for larger step sizes improves the performance for large condition numbers. This is consistent with what has been suggested in the literature [25, 41].

As an application of the suggested framework, we study two numerical methods for the underdamped Langevin dynamics. The first is the method, that we shall call EE, used in [10]; the second is a splitting method called UBU. Both require the same computational effort, namely one gradient evaluation per time-step, but UBU has better convergence properties [3, 4].

•

For the integrator EE, we prove that, in $2$ -Wasserstein distance and for a strongly log-concave $d$ -dimensional distribution with condition number $\kappa$ , the algorithm produces a distribution that is $\epsilon$ –away from the target in a number of steps that (up to logarithmic terms) behaves like $\mathcal{O}(\epsilon^{-1}\kappa^{3/2}d^{1/2})$ . This improves on the $\mathcal{O}(\epsilon^{-1}\kappa^{2}d^{1/2})$ estimate in [10]. EE has also been analysed in [15]; however, the analysis in that reference has severe limitations as discussed in Section 6.
•

UBU, under the same hypotheses as EE, shares the $\mathcal{O}(\epsilon^{-1}\kappa^{3/2}d^{1/2})$ estimate. However, unlike EE, UBU is capable of leveraging additional smoothness properties of the log-density of the target. With such an additional smoothness assumption, we prove an estimate that depends on $\epsilon$ , $\kappa$ and $d$ as $\mathcal{O}(\epsilon^{-1/2}\kappa^{5/4}d^{1/4})$ (there is also a dependence on a bound for the third derivatives of the target log-density).

Even though a detailed comparison between UBU and alternative algorithms is not within the scope of the present paper, the following comments are in order.

•

As we will discuss in Remark 6.3, for fixed $\kappa$ , the improvement from the $\epsilon^{-1}d^{1/2}$ EE estimate to the $\epsilon^{-1/2}d^{1/4}$ UBU estimate arises from EE having strong order one and UBU having strong order two. This shows the importance of strong second-order integrators. A strong second-order discretization of the underdamped Langevin dynamics that requires evaluation of the Hessian has been introduced in [15]. However the analysis in that reference only holds for unrealistic values of the stepsize, see Section 6.2.
•

The randomized midpoint method in [48] uses two gradient evaluations per time-step and may be regarded as optimal [9] among the integrators of the underdamped Langevin dynamics that use the driving Brownian motion, its weighted integration and target and an oracle of the $\nabla f$ . For Lipschitz gradients, the estimate of the mixing time is $\mathcal{O}(\epsilon^{-1/3}\kappa^{7/6}d^{1/6}+\epsilon^{-2/3}\kappa d^{1/3})$ , where we note the favourable dependence on $\kappa$ , which stems from the random nature of the algorithm (see [9] and its references). For fixed $\kappa$ we then find an $\epsilon^{-2/3}d^{1/3}$ behaviour, to be compared with the $\epsilon^{-1/2}d^{1/4}$ estimate of UBU when the extra smoothness assumption holds. See Remark 6.3.
•

The algorithm suggested in [40] is not based on integrating the underdamped Langevin equation but an alternative system of stochastic differential equations where $x$ has additional smoothness (see Remark 3.3). For fixed $\kappa$ , the authors prove a $\mathcal{O}(\epsilon^{-1/2}d^{1/4})$ estimate of the mixing time (i.e. the behaviour established here for UBU). That reference does not investigate the dependence of the mixing time on $\kappa$ ; numerical experiments suggest the algorithm does not operate satisfactorily when the condition number is large.

The main contributions of this work are:

1.

The use of an appropriate state-form representation of SDEs and their numerical integrators that allows to establish contractivity estimates both for the time-continuous process and its numerical solution.
2.

A study of the contractivity of integrators for the underdamped Langevin dynamics that takes into account the possible impact of increasing condition numbers.
3.

A general result that allows to obtain bounds for the $2$ -Wasserstein distance between the target distribution and its numerical approximations for general SDEs. In particular the result may be applied to discretizations of the overdamped and underdamped Langevin equations.
4.

We improve on the analysis in [10] and explain the reasons why similar improvements may be expected when analysing other integrators.
5.

We suggest the use in sampling of UBU, a splitting integrator for the underdamped Langevin equations that only requires one gradient evaluation per step and possesses second order weak and strong accuracy.
6.

We provide non-asymptotic estimates of the sampling accuracy of UBU.

The rest of the paper is organised as follows. In Section 2 we set up notation and discuss the different smoothness assumption on $f$ that we will employ through out the paper. In Section 3 we present the stochastic differential equations (SDEs) we are concerned with. These are written in a state-space form framework, similar to that used (for other purposes) in [32, 20, 45]. This framework is useful here because it makes it easy (see Propositions 3.8 and 3.10) to investigate the contractivity properties that underlie the SDE Wassertein distance estimates between the push-forward in time of two initial probability distribution (Proposition 3.6). Section 4, parallel to Section 3, is devoted to the integrators and their contractivity. Again a state-space framework is used that makes it possible to easily investigate the contractivity of the integrators. Section 5 contains one of the main contributions of this paper, Theorem 5.2, which provides a general result for getting bounds of the Wasserstein distance between the invariant distribution $\pi^{\star}$ of the SDE and the distribution of the numerical solution. To apply Theorem 5.2 one needs (1) to establish a contractivity estimate for the integrator and (2) to prove what we call a local error bound. The latter is essentially a mean square bound of the difference between a single step of the integrator and a corresponding step with the SDE, under the assumption that the initial data for the step follows the distribution $\pi^{\star}$ . Section 6 applies the general result to investigate two discretizations of the underdamped Langevin dynamics. The final Section 7 contains some additional results and also the more technical proofs of the results in the preceding sections.

The extension of the material in this paper to variable step sizes and to inaccurate gradients is certainly possible, but will not be considered.

2 Preliminaries

We will now discuss some assumptions on $f$ , as well as set up some notation that we will later use.

2.1 Smoothness properties of $f$

The symbol $\|\cdot\|$ always refers to the standard Euclidean norm. Throughout the paper we shall assume that the following two conditions hold:

Assumption 2.1.

$f:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ is twice differentiable and $L$ -smooth, i.e

\forall x,y\in\mathbb{R}^{d},\qquad\|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|.

(2.1)

Assumption 2.2.

$f:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d}$ is $m$ -strongly convex, i.e

\forall x,y\in\mathbb{R}^{d},\qquad f(y)\geq f(x)+\left\langle\nabla f(x),y-x\right\rangle+\frac{m}{2}\|x-y\|^{2}.

It is well known that these two assumptions are equivalent to the Hessian of $f$ , which we will denote by $\mathcal{H}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d\times d}$ , being positive definite and satisfying $mI_{d\times d}\preceq\mathcal{H}(x)\preceq LI_{d\times d}$ . In studies like the present one, Assumptions 2.1 and 2.2 are standard in the literature: see, among others, [11, 17, 18, 13] for the overdamped Langevin dynamics and [10, 15, 39, 21] for the underdamped case.

In addition to these two assumptions, the following further smoothness assumption on $f$ will be used when it comes to analysing higher strong-order discretizations for the underdamped Langevin equation. The symbol $\mathcal{H}^{\prime}$ denotes the tensor of third derivatives (derivative of the Hessian); at each $x\in\mathbb{R}^{d}$ , $\mathcal{H}^{\prime}(x)$ is a bilinear operator mapping pairs $(w_{1},w_{2})\in\mathbb{R}^{d}\times\mathbb{R}^{d}$ into vectors in $\mathbb{R}^{d}$ .

Assumption 2.3.

$f$ is three times differentiable and there is a constant $L_{1}\geq 0$ such that at each point $x\in\mathbb{R}^{d}$ , for arbitrary $w_{1}$ , $w_{2}$ :

\|\mathcal{H}^{\prime}(x)[w_{1},w_{2}]\|\leq L_{1}\|w_{1}\|\,\|w_{2}\|

2.2 Wasserstein distance

Let $\pi_{1}$ and $\pi_{2}$ be two probability measures on $\mathbb{R}^{N}$ . The $2$ -Wasserstein distance between $\pi_{1},\pi_{2}$ is given by

W_{2}(\pi_{1},\pi_{2})=\left(\inf_{\zeta\in Z}\int_{\mathbb{R}^{N}}\|x-y\|^{2}d\zeta(x,y)\right)^{1/2},

where $Z$ is the set of all couplings [19] between $\pi_{1}$ and $\pi_{2}$ , i.e. the set of all probability distributions in $\mathbb{R}^{N}\times\mathbb{R}^{N}$ whose marginals on the first and second factors are $\pi_{1}$ and $\pi_{2}$ respectively. More generally, if $P$ is an $N\times N$ positive definite symmetric matrix, we will use the distance

W_{P}(\pi_{1},\pi_{2})=\left(\inf_{\zeta\in Z}\int_{\mathbb{R}^{N}}\|x-y\|_{P}^{2}d\zeta(x,y)\right)^{1/2},

where in the $P$ -norm defined by $\|\xi\|_{P}=(\xi^{T}P\xi)^{1/2}$ . Since the $P$ -norm and the standard Euclidean norm are related by

p_{\min}\|\cdot\|^{2}\leq\|\cdot\|_{P}^{2}\leq p_{\max}\|\cdot\|^{2},

(2.2)

where $p_{\min}$ and $p_{\max}$ are the smallest and largest eigenvalues of $P$ , we also have

p_{\min}W_{2}^{2}(\pi_{1},\pi_{2})\leq W_{P}^{2}(\pi_{1},\pi_{2})\leq p_{\max}W_{2}^{2}(\pi_{1},\pi_{2}),

for arbitrary $\pi_{1}$ , $\pi_{2}$ . Therefore bounds for the metric $W_{P}$ may immediately be translated into bounds for $W_{2}$ and viceversa.

3 Stochastic differential equations

In this section we will study some properties of a class of ergodic stochastic differential equations that includes (1.1) and (1.2). In particular, we will extend to the stochastic case a control theoretical framework used in [20, 32] to analyse optimization algorithms, and study properties of such SDEs, including the existence of an invariant measure, and the speed of convergence to equilibrium in the Wasserstein distance.

3.1 State-space form

We are concerned with sampling algorithms obtained by discretizing SDEs with additive noise that may be written as linear systems in state-space form:¹¹1 Note that this excludes algorithms, like the Riemann manifold MALA in [22], that use multiplicative noise. Also Hamiltonian Montecarlo [6] and similar piecewise deterministic samplers that use jumps do not fit in the present study.


$\displaystyle d\xi(t)$	$\displaystyle=$	$\displaystyle A\xi(t)dt+Bu(t)dt+\sigma dW(t),$	(3.1a)
$\displaystyle x(t)$	$\displaystyle=$	$\displaystyle C\xi(t),$	(3.1b)
$\displaystyle u(t)$	$\displaystyle=$	$\displaystyle\nabla f(x(t)).$	(3.1c)

Here $\xi\in\mathbb{R}^{N}$ is the state, $u\in\mathbb{R}^{d}$ is the input, $x\in\mathbb{R}^{d}$ is the output that is mapped to $u$ by the nonlinear map $\nabla f$ and $W$ represents the standard $M$ -dimensional Brownian motion. The real matrices $A$ , $B$ , $C$ and $\sigma$ are constant, with sizes $N\times N$ , $N\times d$ , $d\times N$ and $N\times M$ respectively. We define

D=(1/2)\sigma\sigma^{T}.

and note that, since the right hand-side of (3.1a) is globally Lipschitz continuous, the solution exists and is unique.

Example 3.1.

The simplest case corresponds to the overdamped Langevin equation (1.1) (the positive constant $c$ may be set $=1$ by rescaling $t$ ) and $W$ $d$ -dimensional. Here, $N=d$ , $M=d$ , $\xi=x$ , $A=0_{d\times d}$ , $B=-cI_{d}$ , $C=I_{d}$ , $\sigma=\sqrt{2c}I_{d}$ , $D=cI_{d}$ .

Example 3.2.

The underdamped Langevin dynamics (1.2) ( $\gamma$ and $c$ are positive constants and $W$ is $d$ -dimensional) has $N=2d$ , $M=d$ , $\xi=[v^{T},x^{T}]^{T}$ , and ( $0$ stands for $0_{d\times d}$ )

A=\left[\begin{matrix}-\gamma I_{d}&0\\ I_{d}&0\end{matrix}\right],\quad B=\left[\begin{matrix}-cI_{d}\\ 0\end{matrix}\right],\quad C=\left[\begin{matrix}0&I_{d}\end{matrix}\right],\quad\sigma=\left[\begin{matrix}\sqrt{2\gamma c}I_{d}\\ 0\end{matrix}\right],\quad D=\left[\begin{matrix}\gamma cI_{d}&0\\ 0&0\end{matrix}\right].

Remark 3.3.

As distinct from the situation in (1.1), in (1.2) the noise $W(t)$ does not enter the $x$ equation directly; it does so only through the auxiliary variable $v$ . This results in $x(t)$ being smoother in the underdamped case than in the overdamped case. This idea may be taken further: additional auxiliary variables may be introduced so as to increase the smoothness of $x(t)$ , see e.g. [40].

The following proposition, whose proof is given in Section 7.1, relates (3.1) and the pdf $\propto\exp\big{(}-f(x)\big{)}$ . The proposition may be used to check that the target is in fact the invariant density for the overdamped Langevin dynamics (1.1) and that the underdamped Langevin system (1.2) has the invariant density $\propto\exp\big{(}-f(x)-\|v\|^{2}/(2c)\big{)}$ .

Proposition 3.4.

Assume that $S$ is an $N\times N$ positive semidefinite symmetric matrix.

•

The relations


$\displaystyle{\rm Tr}(A+DS)$	$\displaystyle=$	$\displaystyle 0,$	(3.2a)
$\displaystyle CB+CDC^{T}$	$\displaystyle=$	$\displaystyle 0,$	(3.2b)
$\displaystyle CA+B^{T}S+2CDS$	$\displaystyle=$	$\displaystyle 0,$	(3.2c)
$\displaystyle SA+A^{T}S+2SDS$	$\displaystyle=$	$\displaystyle 0,$	(3.2d)

imply that (3.1) has the invariant probability distribution $\pi^{\star}$ with density

\propto\exp\big{(}-f(C\xi){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}-}(1/2)\xi^{T}S\xi\big{)}.

•

If $SC^{T}=0$ , then the marginal of $\propto\exp\big{(}-f(C\xi)-(1/2)\xi^{T}S\xi\big{)}$ on $x=C\xi$ is the target $\propto\exp(-f(x))$ .

If $f$ is regarded as being arbitrary, then the relations (3.2) are also necessary for the probability distribution with density $\propto\exp\big{(}-f(C\xi){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}-}(1/2)\xi^{T}S\xi\big{)}$ to be invariant, see Section 7.1. The next result may be useful to check the hypotheses of Proposition 3.4. The proof is a simple exercise and will not be given.

Proposition 3.5.

The relations (3.2) hold if

A=-(D+R)S,\qquad B=-(D+R)C^{T},

where $R$ is an arbitrary $N\times N$ skew-symmetric matrix.

3.2 Convergence to the invariant distribution

We assume hereafter that (3.1) has the unique invariant distribution $\pi^{\star}$ . If $\pi$ denotes the probability distribution of the initial value $\xi(0)$ for (3.1) and $\Phi_{t}\pi$ , $t\geq 0$ represents the resulting probability distribution of $\xi(t)$ , we will investigate the convergence, in the Wasserstein distance, of $\Phi_{t}\pi$ towards $\pi^{\star}$ , as $t\rightarrow\infty$ .

In order to estimate $W_{P}(\Phi_{t}\pi_{1},\Phi_{t}\pi_{2})$ we use the following well-known approach. We introduce the auxiliary $2N$ -dimensional SDE:


$\displaystyle d\xi^{(1)}(t)$	$\displaystyle=$	$\displaystyle A\xi^{(1)}(t)dt+B\nabla f(C\xi^{(1)}(t))dt+\sigma dW(t),$	(3.3a)
$\displaystyle\qquad d\xi^{(2)}(t)$	$\displaystyle=$	$\displaystyle A\xi^{(2)}(t)dt+B\nabla f(C\xi^{(2)}(t))dt+\sigma dW(t),$	(3.3b)

where the same Brownian motion $W(t)$ drives $\xi^{(1)}(t)$ and $\xi^{(2)}(t)$ . If $\xi^{(1)}(0)\sim\pi_{1}$ and $\xi^{(2)}(0)\sim\pi_{2}$ , and $\zeta$ is a coupling between $\pi_{1}$ and $\pi_{2}$ then the pushforward of $\zeta$ by the solution of (3.3) provides a coupling for the distributions $\Phi_{t}\pi_{1}$ and $\Phi_{t}\pi_{2}$ of $\xi^{(1)}(t)$ and $\xi^{(2)}(t)$ . In this setting it is easy to prove the following result.

Proposition 3.6.

Assume that $P\succ 0$ and $\lambda>0$ exist such that for (3.3), almost surely,

\|\xi^{(2)}(t)-\xi^{(1)}(t)\|_{P}^{2}\leq e^{-\lambda t}\|\xi^{(2)}(0)-\xi^{(1)}(0)\|_{P}^{2},\qquad t>0.

(3.4)

Then, for arbitrary distributions, $\pi_{1}$ and $\pi_{2}$ ,

W_{P}(\Phi_{t}\pi_{1},\Phi_{t}\pi_{2})\leq e^{-\lambda t/2}W_{P}(\pi_{1},\pi_{2}),\qquad t>0,

and, in particular, for arbitrary $\pi$ ,

W_{P}(\Phi_{t}\pi,\pi^{\star})\leq e^{-\lambda t/2}W_{P}(\pi,\pi^{\star}),\qquad t>0.

(3.5)

3.3 Contractivity

We now identify sufficient conditions for (3.4) to hold.

Lemma 3.7.

Let $P\succ 0$ be an $N\times N$ symmetric matrix and $\lambda>0$ . For solutions of (3.3),

	$\displaystyle d\Big{(}e^{\lambda t}[\xi^{(2)}(t)-\xi^{(1)}(t)]^{T}P[\xi^{(2)}(t)-\xi^{(1)}(t)]\Big{)}=$
	$\displaystyle\qquad\qquad e^{\lambda t}\Big{(}[\xi^{(2)}(t)-\xi^{(1)}(t)]^{T}(\lambda P+A^{T}P+PA)[\xi^{(2)}(t)-\xi^{(1)}(t)]$
	$\displaystyle\qquad\qquad+[u^{(2)}(t)-u^{(1)}(t)]B^{T}P[\xi^{(2)}(t)-\xi^{(1)}(t)]$
	$\displaystyle\qquad\qquad+[\xi^{(2)}(t)-\xi^{(1)}(t)]^{T}PB[u^{(2)}(t)-u^{(1)}(t)]\Big{)}\>dt.$

Proof.

It is enough to apply Ito’s rule to

F(t,\xi^{(1)}(t),\xi^{(2)}(t))=e^{\lambda t}[\xi^{(2)}(t)-\xi^{(1)}(t)]^{T}P[\xi^{(2)}(t)-\xi^{(1)}(t)];

the Ito correction is

{\rm Tr}\left(\left[\begin{matrix}\sigma^{T}&\sigma^{T}\end{matrix}\right]\left[\begin{matrix}P&-P\\ -P&P\end{matrix}\right]\left[\begin{matrix}\sigma\\ \sigma\end{matrix}\right]\right)=0.

∎

The inputs $u^{(1)}(t)$ , $u^{(2)}(t)$ that appear in the lemma may be eliminated by using that $\nabla f(x)$ is continuously differentiable. In fact, by the mean value theorem,

	$\displaystyle u^{(2)}(t)-u^{(1)}(t)$	$\displaystyle=$	$\displaystyle\bar{\mathcal{H}}(x^{(2)}(t),x^{(1)}(t))\,[x^{(2)}(t)-x^{(1)}(t)]$
		$\displaystyle=$	$\displaystyle\bar{\mathcal{H}}(x^{(2)}(t),x^{(1)}(t))C\,[\xi^{(2)}(t)-\xi^{(1)}(t)],$

where, for each pair of vectors $y_{1}$ , $y_{2}$ in $\mathbb{R}^{d}$ , we have defined

\bar{\mathcal{H}}(y_{2},y_{1})=\int_{0}^{1}\mathcal{H}\big{(}y_{1}+z[y_{2}-y_{1}]\big{)}\,dz

( $\mathcal{H}$ is the Hessian of $f$ ). After elimination of the inputs, Lemma 3.7 yields

	$\displaystyle d\Big{(}e^{\lambda t}[\xi^{(2)}(t)-\xi^{(1)}(t)]^{T}P[\xi^{(2)}(t)-\xi^{(1)}(t)]\Big{)}=$
	$\displaystyle\qquad e^{\lambda t}[\xi^{(2)}(t)-\xi^{(1)}(t)]^{T}$
	$\displaystyle\qquad\Big{(}\lambda P+P\big{(}A+B\bar{\mathcal{H}}(x^{(2)}(t),x^{(1)}(t))C\big{)}+\big{(}A+B\bar{\mathcal{H}}(x^{(2)}(t),x^{(1)}(t))C\big{)}^{T}P\Big{)}$
	$\displaystyle\qquad[\xi^{(2)}(t)-\xi^{(1)}(t)]\>dt,$

an equality that implies our next result.

Proposition 3.8.

Let $P\succ 0$ be an $N\times N$ symmetric matrix and $\lambda>0$ . Assume that, for each $y_{1},y_{2}\in\mathbb{R}^{d}$ , the matrix

\mathcal{T}(\lambda,P,y_{1},y_{2})=\lambda P+P\big{(}A+B\bar{\mathcal{H}}(y_{1},y_{2})C\big{)}+\big{(}A+B\bar{\mathcal{H}}(y_{1},y_{2})C\big{)}^{T}P

is $\preceq 0$ . Then, for solutions of (3.3) the contractivity estimate (3.4) holds almost surely.

3.4 Checking contractivity

We next provide a result that is useful when checking the hypothesis $\mathcal{T}\preceq 0$ in the last proposition.

Typically, in (3.1)

A=\widehat{A}\otimes I_{d},\qquad B=\widehat{B}\otimes I_{d},\qquad C=\widehat{C}\otimes I_{d},

(3.6)

with $\widehat{A}$ , $\widehat{B}$ , and $\widehat{C}$ of sizes $\widehat{N}\times\widehat{N}$ , $\widehat{N}\times 1$ , and $1\times\widehat{N}$ respectively (which implies that $N=\widehat{N}d$ ). This is for instance the situation for the overdamped and underdamped Langevin equations presented above, where $\widehat{N}=1$ and $\widehat{N}=2$ respectively. In general $\widehat{N}$ will be a small integer and therefore the matrices $\widehat{A}$ , $\widehat{B}$ , and $\widehat{C}$ will also be small.

When (3.6) holds and also $\sigma=\widehat{\sigma}\otimes I_{d},$ (with $\widehat{\sigma}$ of size $\widehat{N}\times\widehat{M}$ ) and $S=\widehat{S}\otimes I_{d}$ , the hypotheses of Proposition 3.4 may be stated in terms of the matrices with a hat, i.e., in the second item, $\widehat{S}\widehat{C}^{T}=0$ and, in the first item, ${\rm Tr}(\widehat{A}+\widehat{D}\widehat{S})=0$ , etc. (here $\widehat{D}=(1/2)\widehat{\sigma}\widehat{\sigma}^{T}$ ). The same observation applies to Proposition 3.5. In addition, it makes sense to consider that the matrix $P\succ 0$ is of the form $\widehat{P}\otimes I_{d}$ with $\widehat{P}$ of size $\widehat{N}\times\widehat{N}$ . Note that the eigenvalues of $P$ are obtained by repeating $d$ times each eigenvalue of $\widehat{P}$ and in paticular $P\succ 0$ if and only if $\widehat{P}\succ 0$ . We then have:

Lemma 3.9.

Assume that (3.6) holds and $P=\widehat{P}\otimes I_{d}$ . The set of the $N=\widehat{N}d$ eigenvalues of $\mathcal{T}(\lambda,P,y_{1},y_{2})$ is the union of the sets of eigenvalues of the matrices (of size $\widehat{N}\times\widehat{N}$ )

\lambda\widehat{P}+\widehat{P}\big{(}\widehat{A}+H_{i}(y_{1},y_{2})\widehat{B}\widehat{C}\big{)}+\big{(}\widehat{A}+H_{i}(y_{1},y_{2})\widehat{B}\widehat{C}\big{)}^{T}\widehat{P}

(3.7)

where $H_{i}(y_{1},y_{2})$ , $i=1,\dots,d$ , are the eigenvalues of $\bar{\mathcal{H}}(y_{1},y_{2})$ .

Proof.

After using (3.6) and $\bar{\mathcal{H}}=1\otimes\bar{\mathcal{H}}$ , the mixed product property of $\otimes$ implies:

\mathcal{T}=(\lambda\widehat{P}+\widehat{P}\widehat{A}+\widehat{A}^{T}\widehat{P})\otimes I_{d}+(\widehat{P}\widehat{B}\widehat{C}+\widehat{C}^{T}\widehat{B}^{T}\widehat{P})\otimes\bar{\mathcal{H}}

Now factorize $\bar{\mathcal{H}}=Q\mathcal{D}Q^{T}$ with $\mathcal{D}$ diagonal and $Q$ orthogonal (both $d\times d$ ). It follows that

\mathcal{T}=(I_{\widehat{N}}\otimes Q)\Big{[}(\lambda\widehat{P}+\widehat{P}\widehat{A}+\widehat{A}^{T}\widehat{P})\otimes I_{d}+(\widehat{P}\widehat{B}\widehat{C}+\widehat{C}^{T}\widehat{B}^{T}\widehat{P})\otimes\mathcal{D}\Big{]}(I_{\widehat{N}}\otimes Q)^{T},

and, as a consequence, the eigenvalues of $\mathcal{T}$ are those of the matrix in square brackets in the display. This matrix consists of $\widehat{N}^{2}$ blocks, where each block is diagonal of size $d\times d$ . After reordering, the matrix in square brackets becomes a direct sum of the $d$ matrices in (3.7). ∎

We now describe how to find, for a given $\widehat{P}\succ 0$ , the decay rate $\lambda$ in (3.4). The hypotheses on $f$ guarantee that, in (3.7), $H_{i}(y_{1},y_{2})\in[m,L]$ . After defining the matrix-valued function of the real variable $H\in[m,L]$ given by

\widehat{Z}(H)=-\widehat{P}\big{(}\widehat{A}+H\widehat{B}\widehat{C}\big{)}-\big{(}\widehat{A}+H\widehat{B}\widehat{C}\big{)}^{T}\widehat{P},

(3.8)

we see from Lemma 3.9 that, if, for each $H\in[m,L]$ , $\lambda\widehat{P}-\widehat{Z}(H)\preceq 0$ , then $\mathcal{T}\preceq 0$ . We factorize $\widehat{P}=\widehat{L}\widehat{L}^{T}$ with $\widehat{L}$ invertible; for instance $\widehat{L}$ may be chosen to be lower triangular with positive diagonal entries —Choleski’s factorization—, but other possibilities of course exist. The condition $\lambda\widehat{P}-\widehat{Z}(H)\preceq 0$ is equivalent to the condition $\lambda I_{d}\preceq\widehat{L}^{-1}\widehat{Z}(H)\widehat{L}^{-T}$ . Therefore we will have $\mathcal{T}\preceq 0$ if, as $H$ varies in $[m,L]$ , the eigenvalues of $\widehat{L}^{-1}\widehat{Z}(H)\widehat{L}^{-T}$ are positive and bounded away from zero. When that is the case, $\lambda$ may be chosen to be the infimum of those eigenvalues. We also note that the eigenvalues of $\widehat{L}^{-1}\widehat{Z}(H)\widehat{L}^{-T}$ are the eigenvalues of the generalized eigenvalue problem $\widehat{Z}(H)x=\Lambda\widehat{P}x$ . To sum up:

Proposition 3.10.

Given the symmetric, positive definite $\widehat{P}$ , define $\widehat{Z}(H)$ by (3.8). Assume that, as $H$ varies in $[m,L]$ , the eigenvalues $\Lambda$ of the generalized eigenvalue problem $\widehat{Z}(H)x=\Lambda\widehat{P}x$ are positive and bounded away from zero and let $\lambda>0$ be the infimum of those eigenvalues. Then the contractivity bound (3.4) with $P=\widehat{P}\otimes I_{d}$ holds almost surely. Alternatively, $\lambda$ may be defined as the infimum of the eigenvalues of the matrices $\widehat{L}^{-1}\widehat{Z}(H)\widehat{L}^{-T}$ , where $\widehat{L}$ is any matrix with $\widehat{P}=\widehat{L}\widehat{L}^{T}$ .

The following two examples show this framework applied to the case of equations (1.1) and (1.2).

Example 3.11.

In the case of the overdamped Langevin equation (1.1) if we make the choice $\widehat{P}=1$ , a simple calculations gives $\widehat{Z}_{i}=2cH_{i}$ . We hence see that in this case $\lambda=2cm$ , a well-known result.

Example 3.12.

The paper [10] studies the underdamped Langevin equation (1.2) and fixes $\gamma=2$ . This does not entail any loss of generality as the value of $\gamma>0$ may be chosen arbitrarily by rescaling the variable $t$ .²²2Other authors, see e.g. [15], use different scalings. When we quote estimates from papers that use alternative scalings, we have translated them to the scale in [10] in order to have meaningful comparisons. Furthermore, [10] sets $c=1/L$ and

\widehat{P}=\left[\begin{matrix}1&1\\ 1&2\end{matrix}\right],\qquad\widehat{L}=\left[\begin{matrix}1&0\\ 1&1\end{matrix}\right].

(3.9)

For these choices, we find

\widehat{L}^{-1}\widehat{Z}(H)\widehat{L}^{-T}=\left[\begin{matrix}2&H/L-2\\ H/L-2&2\end{matrix}\right];

the eigenvalues of this matrix are $H/L$ and $4-H/L$ and, since $H\in[m,L]$ , they are $\geq m/L=1/\kappa$ ( $\kappa$ denotes the condition number). In this case $\lambda=1/\kappa$ and (3.4) becomes

	$\displaystyle\\|x_{2}(t)-x_{1}(t)\\|^{2}+\\|x_{2}(t)+v_{2}(t)-x_{1}(t)-v_{1}(t)\\|^{2}\leq$
	$\displaystyle\quad\quad\quad\quad\exp(-t/\kappa)\Big{(}\\|x_{2}(0)-x_{1}(0)\\|^{2}+\\|x_{2}(0)+v_{2}(0)-x_{1}(0)-v_{1}(0)\\|^{2}\Big{)};$

which is the contraction estimate used in [10].

We note that the use of the inner product associated with $P$ for $(v,x)$ is equivalent to working with the variables $(x+v,v)$ and the standard Euclidean inner product. This $P$ -inner product often appears in the construction of Lyapunov functions for damped oscillators [47, 5]

Example 3.13.

In the setting of the preceding example, we keep $\gamma=2$ and $\widehat{P}$ as in (3.9), but do not assume $c=1/L$ . The eigenvalues of the $2\times 2$ matrix $\widehat{L}^{-1}\widehat{Z}(H)\widehat{L}^{-T}$ are found to be $\Lambda^{+}(H)=cH$ and $\Lambda^{-}(H)=4-cH$ ; for future reference, we note that they depend on $H$ and $c$ through the combination $cH$ (as it was to be expected from (1.2), where $\nabla f(x)$ is multiplied by $c$ ). We distinguish four cases:

1.

$c<4/(L+m)$ . As $H$ varies in $[m,L]$ , we have $\min(\Lambda^{+}(H))=cm$ and $\min(\Lambda^{-}(H))=4-cL>cm$ . Therefore in this case the $\lambda=cm$ and an increase in $c$ results in an increase in $\lambda$ . In particular, for $1/L<c<4/(L+m)$ the contraction rate improves on the value $1/\kappa$ corresponding to the choice $c=1/L$ in [10] discussed in the preceding example.
2.

$c=4/(L+m)$ . In this case $\min(\Lambda^{+}(H))=cm$ and $\min(\Lambda^{-}(H))=4-cL$ have the common value $4/(\kappa+1)$ .
3.

$c\in[4/(L+m),4/L)$ . Now $\min(\Lambda^{+}(H))=cm$ is larger than $\min(\Lambda^{-}(H))=4-cL$ and therefore $\lambda=4-cL$ , which decreases as $c$ increases.
4.

$c\geq 4/L$ . In this case $\min(\Lambda^{-}(H))\leq 0$ and there is no contractivity.

Therefore, with $\gamma=2$ and $\widehat{P}$ in (3.9), the choice $c=4/(L+m)$ yields the best mixing: $\lambda=4/(\kappa+1)$ . We prove in Section 7.2 that the mixing cannot be improved by using alternative choices of $\widehat{P}$ .

More sophisticated choices of $\widehat{P}$ are considered in [15].³³3The matrix $\widehat{P}$ is not used in that reference, which only works with a non-triangular $\widehat{L}$ such that $\widehat{P}=\widehat{L}^{T}\widehat{L}$ . In turn, $\widehat{L}$ is defined indirectly by choosing the columns of $\widehat{L}^{-T}$ to be eigenvectors of a suitable known matrix that depends on a real parameter. The parameter is tuned to enhance the rate of contraction. While those choices allow, for some values of $c$ , a degree of improvement on the value of $\lambda$ we have obtained by using (3.9) in Proposition 3.10, they do not yield values of $\lambda$ above $4/(\kappa+1)$ (which is of course in agreement with the analysis in Section 7.2 below). In addition the study in [15] assumes that the variable $v$ is started at stationarity and only monitors the mixing in the variable $x$ .

A useful reference on contractivity is [38].

Remark 3.14.

In the examples above it was assumed that $\widehat{P}$ was known at the outset. Due to the small dimension of this matrix in applications, it is not difficult to find favourable choices of $\widehat{P}$ . This is illustrated in Section 7.2 (see also [15]).

4 Discretizations

Having established properties for solutions of SDEs of the type (3.1), we now turn our attention to the properties of their numerical discretizations. We derive a result analogous to Proposition 3.10 to establish the contractivity of the numerical solutions for integrators that use only one gradient evaluation per time step. Such integrators are particularly attractive in problems of high dimensionality.

4.1 Discrete state-space form

To discretize (3.1) on the grid points $t_{n}=nh$ , $h>0$ , $n=0,1,2,\dots$ , we use schemes of the form:


$\displaystyle\xi_{n+1}$	$\displaystyle=$	$\displaystyle A_{h}\xi_{n}+B_{h}u_{n}+\sigma^{\xi}_{h}\Omega_{n},$	(4.1a)
$\displaystyle y_{n}$	$\displaystyle=$	$\displaystyle C_{h}\xi_{n}+\sigma^{y}_{h}\Omega_{n},$	(4.1b)
$\displaystyle u_{n}$	$\displaystyle=$	$\displaystyle\nabla f(y_{n}),$	(4.1c)

Here, at each step, $y_{n}\in\mathbb{R}^{d}$ is the feedback output at which the gradient $\nabla f$ will be evaluated and $\Omega_{n}$ represents a random vector in $\mathbb{R}^{\bar{M}}$ suitably derived from the restriction to $[t_{n},t_{n+1}]$ of the Brownian motion $W$ in (3.1). The real matrices $A_{h}$ , $B_{h}$ , $C_{h}$ , $\sigma^{\xi}_{h}$ and $\sigma^{y}_{h}$ are constant, with sizes $N\times N$ , $N\times d$ , $d\times N$ , $N\times\bar{M}$ and $d\times\bar{M}$ respectively. As the examples that follow will illustrate, consistency requires that $h^{-1}(A_{h}-I)$ be an approximation to $A$ in (3.1), while $h^{-1}B_{h}$ and $C_{h}$ approximate $B$ and $C$ . Note also the noise in (4.1b), which has no countepart in (3.1b).

Example 4.1.

The Euler-Maruyama scheme for the SDE (1.1)

x_{n+1}=x_{n}-hc\nabla f(x_{n})+\sqrt{2c}\,(W(t_{n+1})-W(t_{n}))

is of the form (4.1) with $N=d$ , $\bar{M}=d$ , $\xi=x$ , $y=x$ , $\Omega_{n}=W(t_{n+1})-W(t_{n})$ , $A_{h}=I_{d}$ , $B_{h}=-hcI_{d}$ , $C_{h}=I_{d}$ , $\sigma^{\xi}_{h}=\sqrt{2c}I_{d}$ , $\sigma^{y}_{h}=0_{d\times d}$ .

Example 4.2.

To shorten the notation, we introduce the functions:

\mathcal{E}(t)=\exp(-\gamma t),\qquad\mathcal{F}(t)=\int_{0}^{t}\mathcal{E}(s)\,ds=\frac{1-\exp(-\gamma t)}{\gamma},

and

\mathcal{G}(t)=\int_{0}^{t}\mathcal{F}(s)\,ds=\frac{\gamma t+\exp(-\gamma t)-1}{\gamma^{2}}.

For the integration of (1.2) Cheng et al. [10] use the scheme:


$\displaystyle v_{n+1}$	$\displaystyle=$	$\displaystyle\mathcal{E}(h)v_{n}-\mathcal{F}(h)c\nabla f(x_{n})+\sqrt{2\gamma c}\int_{t_{n}}^{t_{n+1}}\mathcal{E}(t_{n+1}-s)dW(s),$	(4.2a)
$\displaystyle x_{n+1}$	$\displaystyle=$	$\displaystyle x_{n}+\mathcal{F}(h)v_{n}-\mathcal{G}(h)c\nabla f(x_{n})+\sqrt{2\gamma c}\int_{t_{n}}^{t_{n+1}}\mathcal{F}(t_{n+1}-s)dW(s).$	(4.2b)

In this example, $N=2d$ , $\bar{M}=2d$ , $\xi=[v^{T},x^{T}]^{T}$ , $y=x$ ,

\Omega_{n}=\left[\begin{matrix}\int_{t_{n}}^{t_{n+1}}\mathcal{E}(t_{n+1}-s)dW(s)\\ \int_{t_{n}}^{t_{n+1}}\mathcal{F}(t_{n+1}-s)dW(s)\end{matrix}\right],

and

A_{h}=\left[\begin{matrix}\mathcal{E}(h)I_{d}&0_{d\times d}\\ {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathcal{F}(h)I_{d}}&{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}I_{d}}\end{matrix}\right],\qquad B_{h}=\left[\begin{matrix}-\mathcal{F}(h)cI_{d}\\ -\mathcal{G}(h)cI_{d}\end{matrix}\right],

C_{h}=[0_{d\times d},I_{d}]\qquad\sigma_{h}^{\xi}=\sqrt{2c\gamma}I_{2d},\qquad\sigma_{h}^{y}=0_{d\times 2d}.

The recipe for simulating the Gaussian random variables $\Omega_{n}$ may be seen in [10].

In the absence of noise, this integrator is the well-known Euler exponential integrator [24], based, via the variation of constants formula/Duhamel’s principle, on the exact integration of the system $dv/dt=-\gamma v$ , $dx/dt=v$ . In the stochastic scenario the algorithm is first order in both the weak and strong senses. The paper [21] calls this scheme the left point method. In what follows we shall refer to it as the Euler exponential (EE) integrator.

Example 4.3.

Another instance of an underdamped Langevin integrator of the form (4.1) is the following UBU algorithm:


$\displaystyle v_{n+1}$	$\displaystyle=$	$\displaystyle\mathcal{E}(h)v_{n}-h\mathcal{E}(h/2)c\nabla f(y_{n})+\sqrt{2\gamma c}\int_{t_{n}}^{t_{n+1}}\mathcal{E}(t_{n+1}-s)dW(s),$	(4.3a)
$\displaystyle x_{n+1}$	$\displaystyle=$	$\displaystyle x_{n}+\mathcal{F}(h)v_{n}-h\mathcal{F}(h/2)c\nabla f(y_{n})+\sqrt{2\gamma c}\int_{t_{n}}^{t_{n+1}}\mathcal{F}(t_{n+1}-s)dW(s),$	(4.3b)
$\displaystyle y_{n}$	$\displaystyle=$	$\displaystyle x_{n}+\mathcal{F}(h/2)v_{n}+\sqrt{2\gamma c}\int_{t_{n}}^{t_{n+1/2}}\mathcal{F}(t_{n+1/2}-s)dW(s).$	(4.3c)

Here and later $t_{n+1/2}=t_{n}+h/2$ . UBU is a splitting integrator [35] that is second order in both the weak and strong senses. See Section 7.3 for details and note that both EE and UBU use stochastic integrals of the form $\int\mathcal{F}dW$ and therefore are not covered by the analysis in [9].

Remark 4.4.

The format (4.1) only caters for schemes that use a single evaluation of the gradient $\nabla f$ per step. By increasing the dimension of $u$ , the format may be easily adapted to integrators that use several gradient evaluations, cf. [32, 20, 45]. However, the technique used below to establish the contractivity of the integrators cannot be immediately extended to schemes with several gradient evaluations; several gradient evaluations would bring in Hessian matrices evaluated at different locations and it would not be possible to diagonalize those Hessians simultaneously, as we did when proving Lemma 3.9. For the contractivity of algorithms involving several gradient evaluations see e.g. [44] and its references.

4.2 The evolution of probability distributions in the discrete case

We will denote by $\Psi_{h,n}\pi$ the probability distribution for $\xi_{n}$ in (4.1) when $\pi$ is the distribution of $\xi_{0}$ (thus $\Psi_{h,n}\pi$ is an operator on measures). After introducing (cf. (3.3))


$\displaystyle\xi^{(1)}_{n+1}$	$\displaystyle=$	$\displaystyle A_{h}\xi^{(1)}_{n}+B_{h}\nabla f(C_{h}\xi^{(1)}_{n}+\sigma_{h}^{y}\Omega_{n})+\sigma_{h}^{\xi}\Omega_{n},$	(4.4a)
$\displaystyle\qquad\xi^{(2)}_{n+1}$	$\displaystyle=$	$\displaystyle A_{h}\xi^{(2)}_{n}+B_{h}\nabla f(C_{h}\xi^{(2)}_{n}+\sigma_{h}^{y}\Omega_{n})+\sigma_{h}^{\xi}\Omega_{n},$	(4.4b)

we have the following discrete counterpart of Proposition 3.6, whose proof will not be given:

Proposition 4.5.

Assume that $P_{h}\succ 0$ and $\rho_{h}\in(0,1)$ exist such that for (4.4), almost surely,

\|\xi^{(2)}_{n+1}-\xi^{(1)}_{n+1}\|_{P_{h}}^{2}\leq\rho_{h}\|\xi^{(2)}_{n}-\xi^{(1)}_{n}\|_{P_{h}}^{2},\qquad n=0,1,\dots

(4.5)

Then, for arbitrary distributions, $\pi_{1}$ and $\pi_{2}$ ,

W_{P}(\Psi_{h,n}\pi_{1},\Psi_{h,n}\pi_{2})\leq\rho_{h}^{n/2}W_{P}(\pi_{1},\pi_{2}),\qquad n=0,1,\dots

(4.6)

4.3 Checking discrete contractivity

The proof of the following result is similar to that of Proposition 3.8 and will be omitted:

Proposition 4.6.

Let $P_{h}\succ 0$ be an $N\times N$ symmetric matrix and $\rho_{h}\in(0,1)$ . Assume that, for each $y_{1},y_{2}\in\mathbb{R}^{d}$ the matrix

\mathcal{T}_{h}(\rho_{h},P_{h},y_{1},y_{2})=\rho_{h}P_{h}-\big{(}A_{h}+B_{h}\bar{\mathcal{H}}(y_{1},y_{2})C_{h}\big{)}^{T}P_{h}\big{(}A_{h}+B_{h}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\bar{\mathcal{H}}}(y_{1},y_{2})C_{h}\big{)}

is $\succeq 0$ . Then, for solutions of (4.4) the contractivity estimate (4.5) holds almost surely.

In a similar way as for the continuous time case we can prove a discrete time counterpart for Proposition 3.10.

Proposition 4.7.

Given the symmetric, positive definite $\widehat{P}_{h}$ , set

\widehat{Z}_{h}(H)=\big{(}\widehat{A}_{h}+H\widehat{B}_{h}\widehat{C}_{h}\big{)}^{T}\widehat{P}_{h}\big{(}\widehat{A}_{h}+H\widehat{B}_{h}\widehat{C}_{h}\big{)}.

Assume that, as $H$ varies in $[m,L]$ , the supremum $\rho_{h}$ of the eigenvalues $R$ of the generalized eigenvalue problems $\widehat{Z}_{h}(H)x=R\widehat{P}x$ is $<1$ . Then the contractivity bound (4.5) with $P_{h}=\widehat{P}_{h}\otimes I_{d}$ holds almost surely. Alternatively, $\rho_{h}$ may be defined as the suppremum of the eigenvalues of the matrices $\widehat{L}^{-1}_{h}\widehat{Z}_{h}(H)\widehat{L}^{-T}_{h}$ , where $\widehat{L}_{h}$ is any matrix with $\widehat{P}_{h}=\widehat{L}_{h}\widehat{L}^{T}_{h}$ .

The remainder of this section is devoted to the application of the last proposition to the investigation of the contractivity of the integrators (4.2) or (4.3) applied to the underdamped Langevin system (1.2) with $\gamma=2$ and $\widehat{P}_{h}$ chosen to coincide with $\widehat{P}$ in (3.9). We have computed symbolically the eigenvalues $R=R_{h}^{\pm}(H)$ in the proposition in closed form, but the resulting expressions are complicated and will not be reproduced here. (For each fixed $H$ we attach the $+$ superscript to the discrete eigenvalue $R(H)$ closest to $\Lambda^{+}(H)=cH$ and the $-$ superscript to the other.) Rather than analysing directly the discrete eigenvalues, we follow an alternative approach based on leveraging the contractivity of the SDE (studied in Example 3.13) and the consistency of the discretizations. The key observation is that, by definition of consistency, for fixed $H\in[m,L]$ and as $h\downarrow 0$ , the numerical propagator matrix $\widehat{A}_{h}+H\widehat{B}_{h}\widehat{C}_{h}$ in Proposition 4.7 differs from the differential equation propagator $\exp(-h(\widehat{A}+H\widehat{B}\widehat{C}))$ in Proposition 3.10 by an $\mathcal{O}(h^{p+1})$ amount, where $p=1$ for the first order EE integrator and $p=2$ for the second order UBU. As a consequence, $-h^{-1}\log(\widehat{A}_{h}+H\widehat{B}_{h}\widehat{C}_{h})$ is $\mathcal{O}(h^{p})$ away from $\widehat{A}+H\widehat{B}\widehat{C}$ (cf. [46, Example 10.1]), and, for the eigenvalues, we have $-h^{-1}\log(R_{h}^{\pm}(H))=\Lambda^{\pm}(H)+\mathcal{O}(h^{p})$ . It is convenient for our purposes to work, rather than with $-h^{-1}\log(R_{h}^{\pm}(H))$ , in terms of the quantities

\widetilde{\Lambda}^{\pm}_{h}(H)=2h^{-1}(1-R_{h}^{\pm}(H)^{1/2});

(4.7)

for these (since, as $\zeta\rightarrow 1$ , $-h\log\zeta\sim 2(1-\zeta^{1/2})$ ) we have $\widetilde{\Lambda}_{h}^{\pm}(H)=\Lambda^{\pm}(H)+\mathcal{O}(h)$ . An illustration of the convergence of $\widetilde{\Lambda}_{h}^{\pm}(H)$ to $\Lambda^{\pm}(H)$ may be seen in Figure 1.

Remark 4.8.

Note that $R_{h}^{\pm}(H)^{1/2}=1-\widetilde{\Lambda}_{h}^{\pm}h/2$ is an approximation to $\exp(-\Lambda^{\pm}(H)h/2)$ and compare with the relation between the discrete decay factor $\rho_{h}^{1/2}$ over one time step in (4.6) and the SDE decay factor $\exp(-\lambda h/2)$ in (3.5) over a time interval of length $h$ .

Because the discrete eigenvalues depend smoothly on $H$ and this variable ranges in the compact interval $[m,L]$ , the convergence $\widetilde{\Lambda}_{h}^{\pm}(H)\rightarrow\Lambda^{\pm}(H)$ is uniform in $H$ . Therefore $2(1-\rho_{h}^{1/2})/h$ , which is the minimum of $2(1-R_{h}^{\pm}(H)^{1/2})/h$ , converges to the minimum of $\Lambda^{\pm}(H)$ , in the limit where $h\downarrow 0$ with $c$ , $m$ and $L$ fixed. We conclude that, for $h$ small enough, the discretizations will behave contractively when the SDE does, i.e. whenever $c\leq 4/(L+m)$ . However, this conclusion is per se rather weak because, as we have seen in Example 3.13, as $L$ and $m$ vary with $\kappa\rightarrow\infty$ , the contraction rate $\lambda$ behaves like $\mathcal{O}(\kappa^{-1})$ and it may be feared that the discretizations be contractive only for values of $h$ that, as $\kappa$ increases unboundedly, tend to $0$ . If that were the case, the usefulness of the integrators (4.2) or (4.3) could be doubted. The next result proves that those fears are not warranted if $c$ is chosen appropiately: the discretizations are contractive with a rate that essentially coincides with the SDE rate, provided that $h$ is below a threshold independent of $L$ and $m$ .

Refer to caption — Figure 1: On the left, $\gamma=2$ , $L=10$ , $m=1$ , $c=3/(L+m)$ and $\widehat{P}$ is as in (3.9); the condition number is very low so as to enhance the clarity of the figure. The solid lines correspond to the SDE eigenvalues $\Lambda^{+}(H)=cH$ , $\Lambda^{-}(H)=4-cH$ . The value of $\lambda$ is the minimum eigenvalue and occurs for $\Lambda^{+}$ evaluated at $m$ , so that $\lambda=3m/(L+m)=3/11$ . The discontinuous lines represent the UBU discrete counterparts $\widetilde{\Lambda}_{h}^{+}(H)$ and $\widetilde{\Lambda}_{h}^{-}(H)$ for $h=2,1,1/2,1/4$ ; as $h$ decreases, $\widetilde{\Lambda}_{h}^{+}(H)$ and $\widetilde{\Lambda}_{h}^{-}(H)$ converge to the SDE eigenvalues. For $h=2$ and $H$ large, $\widetilde{\Lambda}_{h}^{-}(H)<0$ and the numerical scheme is not contractive. The parameters in the right panel are the same as those on the left, with the exception that $L=10^{9}$ leading to an extremely high condition number. Now the minimum of the SDE eigenvalues is $\lambda\approx 10^{-9}$ . As on the left, there is numerical contractivity for $h=1,1/2,1/4$ , but not for $h=2$ ; see the leftmost column in Table 1.

Theorem 4.9.

Consider the SDE (1.2) with $\gamma=2$ , $c=\bar{c}/(L+m)$ , where the constant $\bar{c}\in(0,4)$ is independent of $L$ and $m$ . For the discretization provided by the integrators (4.2) or (4.3), to any $\bar{r}<\bar{c}/2$ there corresponds a value $h_{0}=h_{0}(\bar{r})$ such that, for $h\leq h_{0}$ , the discrete contraction estimate (4.5) holds with $P_{h}=\widehat{P}\otimes I_{d}$ ( $\widehat{P}$ is the matrix in (3.9)) and $\rho_{h}=1-\bar{r}h/(\kappa+1)$ .

Proof.

We begin by recalling, from Example 3.13, that the SDE eigenvalues are $\Lambda^{+}=cH$ and $\Lambda^{-}=4-cH$ . In addition, and for the reasons we pointed out in the continuous case, the discrete eigenvalues $R_{h}^{\pm}(H)$ , for fixed $h$ , are functions of the combination $\widetilde{H}=cH$ . In other words, for fixed $\widetilde{H}$ , $R_{h}^{\pm}$ depend only on $h$ (i.e. they are independent of $L$ and $m$ ). In this way, when thinking in terms of the variable $\widetilde{H}$ , changing $L$ and $m$ only impacts the convergence $\widetilde{\Lambda}_{h}^{\pm}\rightarrow\Lambda^{\pm}$ by changing the interval ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}[cm,cL]}=[\bar{c}/(\kappa+1),\bar{c}\kappa/(\kappa+1)]\subset[0,\bar{c}]$ of values of $\widetilde{H}$ that have to be considered. Therefore, how much $h$ has to be reduced to get $|\widetilde{\Lambda}^{\pm}_{h}(H)-\Lambda^{\pm}(H)|$ below a target error tolerance is independent of $H\in[m,L]$ , $m>0$ and $L\geq m$ .

Now the theorem is clearly true when $\kappa$ ranges in a bounded interval and we may suppose in what follows that $\kappa$ is so large that $\bar{c}/(\kappa+1)\leq 2-\bar{c}/2$ . We consider the two eigenvalues $-$ and $+$ successively.

For $\widetilde{\Lambda}_{h}^{-}$ we note that, as $H$ varies in $[m,L]$ ,

\Lambda^{-}(H)\geq\Lambda^{-}(L)=4-cL{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}=}4-\frac{\bar{c}L}{L+m}=\frac{(4-\bar{c})L+4m}{L+m}=\frac{(4-\bar{c})\kappa+4}{\kappa+1}\geq 4-\bar{c}.

As a consequence, for $h$ small enough, $\widetilde{\Lambda}_{h}^{-}(H)>(1/2)(4-\bar{c})$ . In view of the restriction on $\kappa$ , $\widetilde{\Lambda}_{h}^{-}(H)>\bar{c}/(\kappa+1)$ .

The discussion of the behaviour of $\widetilde{\Lambda}_{h}^{+}$ is more delicate. Here we need to take into account that, if we set $c=0$ in (1.1) so as to switch off the force $\nabla f(x)$ and the noise, then both integrators under consideration are exact. This implies that, at $\widetilde{H}=0$ and for arbitrary $h$ , $\widetilde{\Lambda}_{h}^{+}$ coincides exactly with the continuous eigenvalue $\Lambda^{+}=0$ . This in turn entails that the error $\widetilde{\Lambda}_{h}^{+}(H)-\Lambda^{+}(H)$ vanishes at $\widetilde{H}=0$ for each $h$ and must then have an expression of the form $h\widetilde{H}G(h,\widetilde{H})$ , where $G$ is a smooth function (this is born out in Figure 1, where the difference ${\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\widetilde{\Lambda}_{h}^{+}(H)}-\Lambda^{+}(H)$ decreases as $\widetilde{H}$ decreases with fixed $h$ ). Since $\Lambda^{+}(H)=\widetilde{H}$ , for the relative error we may write $\big{(}\widetilde{\Lambda}_{h}^{+}(H)-\Lambda^{+}(H)\big{)}/\Lambda^{+}(H)=hG(h,\widetilde{H})$ . By taking $h$ sufficiently small we may guarantee that $\widetilde{\Lambda}_{h}^{+}(H)\geq(2\bar{r}/\bar{c})\Lambda^{+}(H)$ and, since $\Lambda^{+}(H)\geq\Lambda^{+}(m)=cm=\bar{c}/(\kappa+1)$ , we have $\widetilde{\Lambda}_{h}^{+}(H)>\bar{c}/(\kappa+1)$ and the proof is complete. ∎

Remark 4.10.

The same proof shows that if $c=1/L$ (as in [10]) a similar results holds with a rate $\rho_{h}=1-\bar{r}h/\kappa$ , where $\bar{r}<1/2$ may be chosen arbitrarily.

Remark 4.11.

The choice $c=4/(L+m)$ guarantees contractivity in the SDE, but has to be excluded from Theorem 4.9. For this value, the proof breaks down because, as the condition number increases, $\Lambda^{-}(L)=4m/(L+m)$ is not bounded away from zero. By using the expressions of the eigenvalues $R^{\pm}_{h}(H)$ at $H=L$ , it may be seen that contractivity requires $h=\mathcal{O}(\kappa^{-1})$ for the EE integrator and $h=\mathcal{O}(\kappa^{-1/2})$ for the second order UBU.

Remark 4.12.

Only two properties of the integrators EE and UBU have been used in the proof: (i) they are consistent, (ii) they are exact if the force and noise are switched off. The second of these was required to prove that, for each $h$ , $\widetilde{\Lambda}_{h}^{+}(0)=0$ , or equivalently, $R_{h}^{+}(0)=1$ (see (4.7)), which means that that $\widehat{A}_{h}$ has $1$ as an eigenvalue. In fact, for all reasonable discretizations, it is true that $\widehat{A}_{h}$ has the eigenvalue $1$ . This happens because with $v_{0}=0$ and $c=0$ (no velocity, no force) any reasonable discretization will yield $v_{1}=v_{0}$ , $x_{1}=x_{0}$ (the particle stands still). Therefore the theorem is true for all integrators of interest.

$h$	$c=1/L$		$c=2/(L+m)$		$c=3/(L+m)$
	EE	UBU	EE	UBU	EE	UBU
2	***	5.000(-10)	***	***	***	***
1	5.000(-10)	5.000(-10)	***	1.000(-9)	***	1.500(-9)
1/2	5.000(-10)	5.000(-10)	1.000(-9)	1.000(-9)	***	1.500(-9)
1/4	5.000(-10)	5.000(-10)	1.000(-9)	1.000(-9)	1.500(-9)	1.500(-9)

Table 1: Contractivity of the integrators EE and UBU for the underdamped Langevin equations with

\gamma=2

P_{h}=\widehat{P}\otimes I_{d}

(

\widehat{P}

as in (3.9)). The table gives the value of

(1-\rho_{h}^{1/2})/h

where

\rho_{h}

is as in (4.5). The symbol *** means that for that combination of

h

and

c

the integrator is not contractive. The table is for the large condition number

\kappa=10^{9}

. In the corresponding tables for

\kappa=10^{3}

\kappa=10^{6}

(not reproduced in this paper), the symbol *** appears at exactly the same locations as in the case

\kappa=10^{9}

reported above, but the values of

(1-\rho_{h}^{1/2})/h

are multiplied by

10^{6}

10^{3}

respectively, showing a

1/\kappa

behaviour.

Example 4.13.

The proof of the theorem sheds no light on the size of the threshold $h_{0}$ . With a view to ascertaining the range of values of $h$ for which the integrators (4.2) and (4.3) behave contractively, we have computed numerically the eigenvalues in Proposition 4.7 and taken the suppremum over $H$ with fixed $h$ . Table 1 provides information on the quotient $(1-\rho_{h}^{1/2})/h$ that approximates the decay constant $\lambda/2$ in the estimate (3.5). For the parameter choice $c=1/L$ used in [10], the table shows that, for $h$ sufficiently small (say $h\leq 1$ ), there is numerical contractivity and the quotient coincides (within the number of digits reported) with the SDE rate $\lambda/2=1/(2\kappa)$ found in Section 3.4. The table also gives results for the choices $c=2/(L+m)$ and $c=3/(L+m)$ , where again the values of $(1-\rho_{h}^{1/2})/h$ reported are in agreement with the SDE rates $\lambda/2=1/(\kappa+1)$ and $\lambda/2=(3/2)/(\kappa+1)$ found in Example 3.13.

For all choices of $c$ , the UBU integrator (4.3) operates contractively for values of $h$ that are larger than those required by the EE integrator (4.2). In addition, as $c$ increases with fixed $h$ , EE looses contractivity before UBU. This is consistent with Remark 4.11.

5 A general theorem

In this section we consider integrators for (3.1) of the form $\xi_{n+1}=\psi_{h}(\xi_{n},t_{n})$ , $t_{n}=nh$ , $n=0,1,\dots$ , $h>0$ , where, following the terminology in [36], $\psi_{h}(\xi,t)$ represents the one-step approximation; $\psi_{h}(\xi,t)$ uses the restriction of the Brownian motion $W$ in (3.1) to the interval $[t,t+h]$ , but this fact is not reflected in the notation. Integrators of the form (4.1) provide a particular class of integrators of this form (cf. Remark 4.4). If $P_{h}$ is an $N\times N$ matrix $\succ 0$ and $\pi$ is an arbitrary probability distribution for the initial condition $\xi_{0}$ , we wish to study the distance $W_{P_{h}}(\pi^{\star},\Psi_{h,n}\pi)$ between the invariant distribution $\pi^{\star}$ and the distribution $\Psi_{h,n}\pi$ of $\xi_{n}$ .

In the analysis, for random vectors $X\in\mathbb{R}^{N}$ , we use the Hilbert-space norm $\|X\|_{L^{2},P_{h}}=\mathbb{E}(\|X\|_{P_{h}}^{2})^{1/2}$ . The symbol $\langle\cdot,\cdot\rangle_{L^{2},P_{h}}$ will be used for the corresponding inner product. We denote by $\phi_{h}(\xi,t)$ the exact counterpart of $\psi_{h}(\xi,t)$ , so that if $\xi(t)$ is a solution of (3.1) then $\xi(t_{n+1})=\phi_{h}(\xi(t_{n}),t_{n})$ . At each time-level $n$ , $n=0,1,2,\dots$ , we introduce a random variable $\widehat{\xi}_{n}\sim\pi^{\star}$ such that $W_{P}(\pi^{\star},\Psi_{h,n}\pi)=\|\widehat{\xi}_{n}-\xi_{n}\|_{L^{2},P_{h}}$ . For the difference $\phi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\widehat{\xi}_{n},t_{n})$ (that may be seen as a local error), we consider the following assumption:

Assumption 5.1.

There is a decomposition

\phi_{h}(\widehat{\xi_{n}},t_{n})-\psi_{h}(\widehat{\xi}_{n},t_{n})=\alpha_{h}(\widehat{\xi}_{n},t_{n})+\beta_{h}(\widehat{\xi}_{n},t_{n}),

and positive constants $p$ , $h_{0}$ , $C_{0}$ , $C_{1}$ , $C_{2}$ such that for $n\geq 0$ and $h\leq h_{0}$ :

\left|\Big{\langle}\psi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\xi_{n},t_{n}),\alpha_{h}(\widehat{\xi}_{n},t_{n})\Big{\rangle}_{L^{2},P_{h}}\right|\leq C_{0}h\>\|\widehat{\xi}_{n}-\xi_{n}\|_{L^{2},P_{h}}\>\|\alpha_{h}(\widehat{\xi}_{n},t_{n})\|_{L^{2},P_{h}}

(5.1)

and

\|\alpha_{h}(\widehat{\xi}_{n},t_{n})\|_{L^{2},P_{h}}\leq C_{1}h^{p+1/2},\qquad\|\beta_{h}(\widehat{\xi}_{n},t_{n})\|_{L^{2},P_{h}}\leq C_{2}h^{p+1}.

(5.2)

We can now state our general theorem that gives a bound for the Wasserstein distance between the invariant measure $\pi^{\star}$ and the distribution of the $n+1$ -th iteration of the numerical scheme.

Theorem 5.2.

Assume that the integrator satisfies Assumption 5.2 and in addition, there are constants $h_{0}>0$ , $r>0$ such that for $h\leq h_{0}$ the contractivity estimate (4.5) holds with $\rho_{h}\leq(1-rh)^{2}$ . Then, for any initial distribution $\pi$ , stepsize $h\leq h_{0}$ , and $n=0,1,\dots$ ,

{W_{P_{h}}(\pi^{\star},\Psi_{h,n}\pi)\leq(1-hR_{h})^{n}W_{P_{h}}(\pi^{\star},\pi)+\left(\frac{\sqrt{2}C_{1}}{\sqrt{R_{h}}}+\frac{C_{2}}{R_{h}}\right)h^{p},}

(5.3)

with

R_{h}=\frac{1}{h}\Big{(}1-\sqrt{(1-rh)^{2}+C_{0}h^{2}}\Big{)}=r+o(1),\quad{\rm as}\quad h\downarrow 0.

Before going into the proof we make some observations:

•

The theorem is in the spirit of classic “consistency and stability imply convergence” numerical analysis results. The main idea of the proof is schematically illustrated in the left panel of Figure 2. The error of interest at time level $n+1$ can be decomposed into two terms. The first is the distance between $\Psi_{h}(\Psi_{h,n}\pi)$ and $\Psi_{h}\pi^{\star}$ and can be bounded in terms of the error at time level $n$ by using the contractivity of the numerical integrator. The second is the distance between $\Psi_{h}\pi^{\star}$ and $\Phi_{h}\pi^{\star}=\pi^{\star}$ and can be bounded by estimating the strong local error by means of Assumption 5.2.
•

The local error needs to be bounded in the strong sense. This should be compared with studies about weak convergence of the numerical distribution [1, 31] where the estimates depend on the weak order of the integrator.
•

The $\beta=\mathcal{O}(h^{p+1})$ part of the local error results in an $\mathcal{O}(h^{p})$ contribution to the bound on $W_{P_{h}}(\pi^{\star},\Psi_{h,n}\pi)$ . One power of $h$ is lost from going from local to global as in the classical analysis of (deterministic) numerical integrators.
•

The $\alpha=\mathcal{O}(h^{p+1/2})$ part of the local error is asked to satisfy the requirement (5.1) and only looses a factor $h^{1/2}$ when going from local to global. This is reminiscent of the situation for the strong convergence of numerical solutions of SDEs (see e.g. [36, Theorem 1.1]), where, for instance, the Euler-Maruyama scheme with $\mathcal{O}(h^{3/2})$ strong local errors yields $\mathcal{O}(h)$ strong global errors (assuming additivity of the noise). Typically, the $\alpha$ part of the local error will consist of Ito integrals that, conditional on events occurring up to the beginning of the time step, have zero expectation.

Proof.

We may write

	$\displaystyle\phi_{h}(\widehat{\xi}_{n},t_{n})-\xi_{n+1}$	$\displaystyle=$	$\displaystyle\big{(}\psi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\xi_{n},t_{n})\big{)}+\big{(}\phi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\widehat{\xi}_{n},t_{n})\big{)}$
		$\displaystyle=$	$\displaystyle\big{(}\psi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\xi_{n},t_{n})+\alpha_{h}(\widehat{\xi}_{n},t_{n})\big{)}+\beta_{h}(\widehat{\xi}_{n},t_{n}),$

and therefore, by the triangle inequality and (5.1), for $h\leq h_{0}$ ,

	$\displaystyle\\|\phi_{h}(\widehat{\xi}_{n},t_{n})-\xi_{n+1}\\|_{L^{2},P_{h}}$
	$\displaystyle\qquad\leq\\|\psi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\xi_{n},t_{n})+\alpha_{h}(\widehat{\xi}_{n},t_{n})\\|_{L^{2},P_{h}}+\\|\beta_{h}(\widehat{\xi}_{n},t_{n})\\|_{L^{2},P_{h}}$
	$\displaystyle\qquad\leq\big{(}\\|\psi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\xi_{n},t_{n})\\|_{L^{2},P_{h}}^{2}$
	$\displaystyle\qquad\qquad\qquad+2C_{0}h\\|\widehat{\xi}_{n}-\xi_{n}\\|_{L^{2},P_{h}}\\|\alpha_{h}(\widehat{\xi}_{n},t_{n})\\|_{L^{2},P_{h}}+\\|\alpha_{h}(\widehat{\xi}_{n},t_{n})\\|^{2}_{L^{2},P_{h}}\big{)}^{1/2}$
	$\displaystyle\qquad\quad+\\|\beta_{h}(\widehat{\xi}_{n},t_{n})\\|_{L^{2},P_{h}}.$

We next apply the contractivity hypothesis, the elementary bound $2ab\leq a^{2}+b^{2}$ , and (5.2):

	$\displaystyle\\|\phi_{h}(\widehat{\xi}_{n},t_{n})-\xi_{n+1}\\|_{L^{2},P_{h}}$
	$\displaystyle\quad\leq\Big{(}(1-rh)^{2}\\|\widehat{\xi}_{n}-\xi_{n}\\|_{L^{2},P_{h}}^{2}+C_{0}^{2}h^{2}\\|\widehat{\xi}_{n}-\xi_{n}\\|_{L^{2},P_{h}}^{2}+2\\|\alpha_{h}(\widehat{\xi}_{n},t_{n})\\|^{2}_{L^{2},P_{h}}\Big{)}^{1/2}$
	$\displaystyle\qquad\qquad+\\|\beta_{h}(\widehat{\xi}_{n},t_{n})\\|_{L^{2},P_{h}}$
	$\displaystyle\quad\leq\Big{(}\big{(}(1-rh)^{2}+C_{0}h^{2}\big{)}\\|\widehat{\xi}_{n}-\xi_{n}\\|_{L^{2},P_{h}}^{2}+2C_{1}^{2}h^{2p+1}\Big{)}^{1/2}+C_{2}h^{p+1}.$

Therefore, in view of the choice of $\widehat{\xi}_{n}$ , and taking into account that $\phi_{h}(\widehat{\xi}_{n},t_{n})\sim\pi^{*}$ ,

	$\displaystyle W_{P_{h}}(\pi^{\star},\Psi_{h,n+1}\pi)$
	$\displaystyle\qquad\qquad\leq\Big{(}\big{(}(1-rh)^{2}+C_{0}h^{2}\big{)}W_{P_{h}}(\pi^{\star},\Psi_{h,n}\pi)^{2}+2C_{1}^{2}h^{2p+1}\Big{)}^{1/2}+C_{2}h^{p+1}.$

The conclusion follows after applying the lemma in Section 7.4. ∎

The paper [13] analyzes the Euler-Maruyama discretization of (1.1). Under two different smoothness assumptions on $f$ , it derives two different estimates similar to (5.3), one with $p=1/2$ and the other with $p=1$ . The application of Theorem 5.2 to those two cases retrieves the estimates in [13]; in addition the proof of our theorem as applied to those two particular cases coincides with the proofs provided in that paper. (In fact we derived Theorem 5.2 as a generalization of the material in [13] to a more general scenario.)

By considering the case where the target distribution is a product of uncorrelated univariate Gaussians, one sees that the estimates of the mixing time in [13] are optimal in their dependence on $\epsilon$ and $d$ . [16] have shown that those estimates are not optimal in their dependence on $\kappa$ . This implies that the result in Theorem 5.2 is not necessarily the best that may be achieved in each application.

6 Application to underdamped Langevin dynamics

We now apply Theorem 5.2 to integrators for (1.2).

6.1 The EE integrator

We begin with the EE integrator (4.2). For its local error we have the following theorem, proved in Section 7.5. Recall that setting $\gamma=2$ does not restrict the generality, as such a choice is equivalent to choosing the units of time. Once this value of $\gamma$ has been fixed, the choice of $P_{h}$ in the theorem that follows is the one that allows the best contraction rate for the SDE (1.2).

Theorem 6.1.

Set $\gamma=2$ and $P_{h}=\widehat{P}\otimes I_{d}$ with $\widehat{P}$ as in (3.9). Then, for $h\leq 1$ , the discretization (4.2) satisfies Assumption 5.2 with $p=1$ , $C_{0}=C_{1}=0$ and $C_{2}=Kc^{3/2}Ld^{1/2}$ , where $K$ is an absolute constant ( $K=\sqrt{6+2\sqrt{5}}/3$ ).

We now may apply Theorem 5.2 to the situation at hand and to do so we need the contractivity of the scheme in the $P$ –norm studied in Theorem 4.9. The discussion that follows may immediately be extended to all choices of $c=c(L,m)$ that lead to contractivity; however, for the sake of clarity, we fix $c=1/L$ as in [10]. (But recall that $c=1/L$ is suboptimal in terms of the contraction rate). For this choice, we know from Remark 4.10 that, for $h\leq h_{0}$ , the numerical rate $(1-\rho_{h}^{1/2})/h$ will be of the form $\bar{r}/\kappa$ , where $\bar{r}<1/2$ may be chosen arbitrarily close to $1/2$ ( $h_{0}$ depends of course on $\bar{r}$ but it is independent of $d$ , $L$ and $m$ ). Theorem 5.2 yields, for $h\leq h_{0}$ ,

W_{P}(\pi^{\star},\Psi_{n,h}\pi)\leq\left(1-\frac{\bar{r}h}{\kappa}\right)^{n}W_{P}(\pi^{\star},\pi)+\frac{K\kappa d^{1/2}}{\bar{r}L^{1/2}}h.

(6.1)

Assume now that, given any initial distribution $\pi$ for the integrator and $\epsilon>0$ , we wish to find $h$ and $n$ to guarantee that $W_{P}(\pi^{\star},\Psi_{n,h}\pi)<\epsilon$ . We may achieve this goal by first choosing $h\leq h_{0}$ small enough to ensure that

\frac{K\kappa d^{1/2}}{\bar{r}L^{1/2}}h<\frac{\epsilon}{2}

and then choosing $n$ large enough to get

\left(1-\frac{\bar{r}h}{\kappa}\right)^{n}W_{P}(\pi^{\star},\pi)<\frac{\epsilon}{2}.

(Instead of splitting the target $\epsilon$ as $\epsilon/2+\epsilon/2$ , one may use $a\epsilon+(1-a)\epsilon$ , $a\in(0,1)$ , and tune $a$ to improve slightly some of the error constants below.) This leads to the conditions

h<\min\left(h_{0},\frac{\bar{r}}{2K}(m^{1/2}\epsilon)\kappa^{-1/2}d^{-1/2}\right),

(6.2)

( $m^{1/2}\epsilon$ is a nondimensional combination whose value does not change if the scale of $x$ is changed) and

n>\frac{-1}{\log(1-\bar{r}h/\kappa)}\log\frac{2W_{P}(\pi^{\star},\pi)}{\epsilon}.

(6.3)

According to the bound (6.2), $h$ has to be scaled like $(m^{1/2}\epsilon)\kappa^{-1/2}d^{-1/2}$ as $\epsilon\downarrow 0$ , $\kappa\uparrow\infty$ or $d\uparrow\infty$ ; then the number of steps to achieve a target value of the contraction factor $(1-\bar{r}h/\kappa)^{n}$ scales as

(m^{1/2}\epsilon)^{-1}\kappa^{3/2}d^{1/2}.

(6.4)

The analysis of the same integrator in [10, Theorem 1] yields scalings that are more pessimistic in their dependence on $\kappa$ : there, $h$ scales as $(m^{1/2}\epsilon)\kappa^{-1}d^{-1/2}$ and $n$ as $(m^{1/2}\epsilon)^{-1}\kappa^{2}d^{1/2}$ . In addition, the initial distribution $\pi$ , which is arbitrary in the present study, is assumed in [10] to be a Dirac delta located at $v=0$ and $x=x_{0}$ ; the estimates become worse as the distance between the initial position $x_{0}$ used in the integrator and the mode of $\exp(-f(x))$ increases.

It is perhaps useful to compare the technique of proof in [10] with our approach by means of Figure 2. While we employ the contractivity of the algorithm, [10] leverages the contractivity of the SDE itself. These two alternative approaches are well known in deteministic numerical differential equations (see e.g. the discussion in [23, Chapter 2] where a cartoon similar to Figure 2 is presented). On the other hand, while we investigate $\phi_{h}(\cdot,t_{n})-\psi_{h}(\cdot,t_{n})$ evaluated at a random variable $\widehat{\xi}_{n}$ whose marginal distribution is $\pi^{\star}$ , [10] has to evaluate that difference at the numerical $\xi_{n}$ . It is for this reason that in [10] one needs to have information on the distribution $\pi$ of $\xi_{0}$ and to establish a priori bounds on the distributions of $\xi_{n}$ as $n$ varies (this is done in Lemma 12 in [10]). Generally speaking, once contractivity estimates are available for the numerical solution, the approach on the left of Figure 2 is to be preferred.

The reference [15] also investigates Wasserstein error bounds for the integrator EE. The general approach is the same as that in [10], but the technical details differ. For $c\leq 4/(L+m)$ ,⁴⁴4Because [15] uses the SDE contractivity, it does not exclude the limit case $c=4/(L+m)$ as we have to do. See Remark 4.11 that implies that for this integrator bounds that hold for $c=4/(L+m)$ are only possible for $h=\mathcal{O}(\kappa^{-1})$ . an upper bound very similar to (6.1) is derived for the $2$ -Wasserstein distance between the $x$ -marginals of $\pi^{\star}$ and $\Psi_{n,h}\pi$ . That bound depends on $\kappa$ , $L$ , $d$ and $h$ in the same way as (6.1) does. The constants in the estimates are nevertheless different, as expected. For instance, for the choice $c=1/L$ we have been discussing, the factor $1-\bar{r}h/\kappa$ in (6.1) (where $\bar{r}$ is arbitrarily close to $1/2$ ) is replaced in [15] by the slightly worse factor $1-0.375h/\kappa$ . However the bound in [15] is only proved for very small values of $h$ ( $h\leq 1/(8\kappa)$ when $c=1/L$ ). This is an extremely severe limitation because we know from (6.2) that, as the condition number increases, EE may be operated with a value of $h$ of the order of $1/\sqrt{\kappa}$ rather than $1/\kappa$ . The unwelcome step size restriction originates from estimating $\phi_{h}(\cdot,t_{n})-\psi_{h}(\cdot,t_{n})$ evaluated at the numerical solution rather than at the SDE solution.

6.2 The UBU integrator

We now turn our attention to the UBU integrator. Under the standard smoothness Assumptions 2.1 and 2.2, the strong order of convergence of UBU is $p=1$ (see Section 7.7 for a detailed analysis of the UBU local error under those assumptions). The analysis for UBU is then very similar to the one presented above for EE, and leads to an $\mathcal{O}(\epsilon^{-1}\kappa^{3/2}d^{1/2})$ estimate for the mixing time. When $f$ satisfies the additional smoothness Assumption 2.3, UBU exhibits strong order $p=2$ and this may be used in our context to improve on the estimates (6.2)–(6.3).

The proof of the next result is given in Section 7.6.

Theorem 6.2.

Assume that $f$ satisfies Assumptions 2.1–2.3. Set $\gamma=2$ and $P_{h}=\widehat{P}\otimes I_{d}$ with $\widehat{P}$ as in (3.9). Then, for $h\leq 2$ , the discretization (4.2) satisfies Assumption 5.2 with $p=2$ ,

$\displaystyle C_{0}$	$\displaystyle=$	$\displaystyle K_{0}(2+cL),$
$\displaystyle C_{1}$	$\displaystyle=$	$\displaystyle K_{1}c^{3/2}Ld^{1/2},$
$\displaystyle C_{2}$	$\displaystyle=$	$\displaystyle K_{2}\Big{(}(1+4\sqrt{3})c^{2}L^{3/2}+(3+\frac{\sqrt{42}}{2})c^{3/2}L+6cL^{1/2}+\sqrt{3}c^{2}L_{1}\Big{)}d^{1/2},$

where $K_{j}$ , $j=0,1,2,$ are the following absolute constants

K_{0}=\sqrt{\frac{2\sqrt{2}}{3-\sqrt{5}}},\qquad K_{1}=\frac{\sqrt{3}}{12},\qquad K_{2}=\frac{1}{24}\sqrt{\frac{3+\sqrt{5}}{2}}.

The contractivity of the scheme in the $P$ -norm necessary to use Theorem 5.2 was established in Theorem 4.9. As we did for the first-order integrator, for the sake of clarity, we fix $c=1/L$ (but other values of $c$ may be discussed similarly, provided that they ensure the contractivity of the algorithm). Note that the constant $C_{0}=K_{0}(2+cL)$ is then $\leq 3K_{0}$ . After choosing $\bar{r}<1/2$ arbitrarily as in Remark 4.10, Theorem 5.2 yields, for $h\leq h_{0}$ :

W_{P}(\pi^{\star},\Psi_{n,h}\pi)\leq\left(1-\frac{\bar{r}h}{\kappa}\right)^{n}W_{P}(\pi^{\star},\pi)+\bar{K}\left(\frac{1}{\sqrt{L}}+\frac{L_{1}}{L^{2}}\right)\kappa d^{1/2}h^{2},

(6.5)

where $\bar{K}$ denotes an absolute constant. To ensure $W_{P}(\pi^{\star},\Psi_{n,h}\pi)<\epsilon$ , we take

\bar{K}\left(\frac{1}{\sqrt{L}}+\frac{L_{1}}{L^{2}}\right)\kappa d^{1/2}h^{2}<\frac{\epsilon}{2},

and then increase $n$ as in (6.3). Thus, for UBU, the scaling of $h$ is

(m^{1/2}\epsilon)^{1/2}\kappa^{-1/4}(1+L^{-3/2}L_{1})^{-1/2}d^{-1/4},

and, as a consequence, the number of steps $n$ to guarantee a target contraction factor $(1-\bar{r}h/\kappa)^{n}$ scales as

(m^{1/2}\epsilon)^{-1/2}\kappa^{5/4}(1+L^{-3/2}L_{1})^{1/2}d^{1/4}.

(6.6)

The dependence of $n$ on $m^{1/2}\epsilon$ , $\kappa$ and $d$ in this estimate is far more favourable than it was for EE (see (6.4)). However here we have the ( $L_{1}$ dependent) factor $(1+L^{-3/2}L_{1})^{1/2}$ and one could easily concoct examples of distributions where this factor is large even if the condition number is of moderate size. In those, arguably artificial, particular cases, it may be advantageous to see UBU as a first order method as discussed at the beginning of this section.

Remark 6.3.

A comparison between (6.1) and (6.5) makes it clear that (for fixed $m$ , $L$ , $L_{1}$ ) the $\epsilon^{-1/2}d^{1/4}$ dependence of the mixing time of UBU stems from having strong order two. In the second term of the right hand-side of the inequalities (6.1) and (6.5) (i.e. the bias), the exponent of $h$ coincides with the strong order of the integrator. In order to make those second terms of size $\epsilon$ one needs to scale $h$ as $\epsilon d^{-1/2}$ for EE and as $\epsilon^{1/2}d^{-1/4}$ for UBU. The first terms of the right hand-side of the inequalities (6.1) and (6.5), then show that $n$ has to be scaled as $h^{-1}$ , i.e. as $\epsilon^{-1}d^{1/2}$ for the first-order method and $\epsilon^{-1/2}d^{1/4}$ for the second-order method.

Integrators of strong order higher than two would have even more favourable dependence of the mixing time on $\epsilon$ and $d$ . Unfortunately such high-order integrators [37] are invariably too complicated to be of much practical significance. In particular there is no splitting algorithm that achieves strong order larger than two [4]. In addition, an increase of the order may be expected to require an increase of the required smoothness of $f$ .

The randomized algorithm in [48] has a bias that behaves as $d^{1/2}h^{3/2}$ leading to an $\epsilon^{-2/3}d^{1/3}$ estimate of the mixing time.

Remark 6.4.

The paper [13] considers a weaker form of the extra-smoothness assumption Assumption 2.3 where $\mathcal{H}(x)$ , rather than assumed to be differentiable with derivative upper-bounded by $L_{1}$ , is only assumed to be Lipschitz continuous with constant $L_{1}$ . It is likely that, by means of the technique in the proof of [13, Lemma 6], Theorem 6.2 may be proved under that alternative, weaker version of Assumption 2.3, but we have not yet studied that possibility.

A second order discretization of the underdamped Langevin equation, that unlike UBU, requires to evaluate the Hessian of $f$ once per step, has been suggested in [15]. A bound similar to (6.5) is derived which is valid only for small values $h\leq\mathcal{O}(\kappa^{-1})\wedge\mathcal{O}(L^{1/2}md^{-1/2}L_{1}^{-1})$ . This is very restritive because, as we have just found, for UBU $h$ scales with $\kappa$ as $\kappa^{-1/4}$ .

The reference [21] suggests a novel approach to obtaining high order discretizations of the underdamped Langevin dynamics (1.2). At each step, in addition to generating suitable random variables, one has to integrate a so-called “shifted ODE”, whose solutions are smoother than the solutions of (1.2). The analysis in that reference examines the case where the integration is exact; in practice, the shifted ODE has of course to be discretized by a suitable numerical method and [21] provides numerical examples based on two different choices of such a method.

7 Additional results and proofs

7.1 Proof of Proposition 3.4

The second item in the Proposition is proved as follows. By standard linear algebra results, $\xi$ may be uniquely decomposed as $\xi=n+m$ with $n$ in the kernel of $C$ and $m$ in the image of $C^{T}$ ; furthermore there is a bijection between values of $m$ and values of $x=C\xi$ . Under the assumption $SC^{T}=0$ , which implies $CS=0$ , $Sm$ and $m^{T}S$ vanish, and then $\xi^{T}S\xi=n^{T}Sn$ is independent of $m$ , i.e. of $x$ . Therefore the marginal of $\propto\exp(-f(C\xi)-(1/2)\xi S\xi)$ coincides with the marginal of $\propto\exp(-f(C\xi))$ .

For the first item, we have to show that the pdf $k\exp\big{(}-f(x)-(1/2)\xi^{T}S\xi\big{)}$ ( $k$ is the normalizing constant), that with some abuse of notation we denote by $\pi^{\star}$ satisfies the Fokker-Planck equation

-\nabla_{\xi}\cdot\big{(}\pi^{\star}(A\xi+B\nabla f(x))\big{)}+\nabla_{\xi}\cdot(D\nabla_{\xi}\pi^{\star})=0.

Here $\nabla_{\xi}\cdot$ and $\nabla_{\xi}$ respectively denote the standard divergence and gradient operators in the space $\mathbb{R}^{N}$ of the variable $\xi$ . The computations that follow use repeatedly the well-known identity $\nabla_{\xi}\cdot(cF)=c\,\nabla_{\xi}\cdot F+F^{T}\nabla_{\xi}c$ , where $c=c(\xi)$ is a scalar valued function and $F=F(\xi)$ is an $\mathbb{R}^{N}$ -valued function. We will also use that if $R$ is any $M\times d$ constant matrix, then $\nabla_{\xi}\cdot(R\nabla f(x))=CR:H(x)$ , where $H(x)$ denotes the Hessian of $f(x)$ and $:$ stands for the Frobenius product of matrices (equivalently $CR:H(x)={\rm Tr}((CR)^{T}H(x)$ ).

We observe that

\nabla_{\xi}\pi^{\star}=-\pi^{\star}\big{(}S\xi+C^{T}\nabla f(x)\big{)}

and therefore

\nabla_{\xi}\cdot(\pi^{\star}A\xi)=\pi^{\star}{\rm Tr}(A)-\pi^{\star}\xi^{T}A^{T}S\xi-\pi^{\star}\xi^{T}A^{T}C^{T}\nabla f(x)

and

	$\displaystyle\nabla_{\xi}\cdot\big{(}\pi^{\star}B\nabla f(x)\big{)}$	$\displaystyle=$	$\displaystyle\pi^{\star}(CB:H(x))$
			$\displaystyle-\pi^{\star}(\nabla f(x))^{T}B^{T}S\xi-\pi^{\star}(\nabla f(x))^{T}B^{T}C^{T}\nabla f(x).$

Furthermore

$\displaystyle\nabla_{\xi}\cdot(D\nabla_{\xi}\pi^{\star})$	$\displaystyle=$	$\displaystyle-\nabla_{\xi}\cdot(\pi^{\star}DS\xi)-\nabla_{\xi}\cdot(\pi^{\star}DC^{T}\nabla f(x))$
	$\displaystyle=$	$\displaystyle-\pi^{\star}{\rm Tr}(DS)+\pi^{\star}\xi^{T}SDS\xi+\pi^{\star}\xi^{T}SDC^{T}\nabla f(x)$
		$\displaystyle-\pi^{\star}(CDC^{T}:H(x))$
		$\displaystyle+\pi^{\star}(\nabla f(x))^{T}CDS\xi+\pi^{\star}(\nabla f(x))^{T}CDC^{T}\nabla f(x).$

From the last three displays we conclude that the left hand-side of the Fokker-Planck equations is the product of $\pi^{\star}$ and

	$\displaystyle-{\rm Tr}(A+DS)$
	$\displaystyle-(CB+CDC^{T}):H(x)+(\nabla f(x))^{T}(B^{T}C^{T}+CDC^{T})\nabla f(x)$
	$\displaystyle+\xi^{T}(A^{T}C^{T}+SB+2SDC^{T})\nabla f(x)$
	$\displaystyle+\xi^{T}(A^{T}S+SDS)\xi;$

each of the first three relations in (3.2) is sufficient for the corresponding line in this display to vanish. (In addition, if $f$ is regarded as arbitrary, then those three relations are also necessary.) The quadratic form in the fourth line in the display vanishes if and only if $A^{T}S+SDS$ is skew-symmetric as demanded by the fourth relation in (3.2). This completes the proof.

7.2 Contraction estimates for the underdamped Langevin equations

We consider the underdamped Langevin equations (1.2) where, after rescaling $t$ , we may assume that $\gamma=2$ . We apply Proposition 3.10 to determine $\widehat{P}$ and $c$ so as to maximize the decay rate $\lambda$ . We exclude the case $L=m$ , which has no practical relevance.

\widehat{L}=\left[\begin{matrix}\ell_{11}&0\\ \ell_{12}&\ell_{22}\end{matrix}\right],

$\ell_{11},\ell_{22}>0$ , denotes the unknown Choleski factor of $\widehat{P}$ , the eigenvalues of $\widehat{L}^{-1}\widehat{Z}_{i}\widehat{L}^{-T}$ are found to be

2\pm\frac{\sqrt{\left(\ell^{2}_{11}cH_{i}-2\ell_{11}\ell_{21}+\ell^{2}_{21}-\ell^{2}_{22}\right)^{2}+4\ell^{2}_{22}\left(\ell_{11}-\ell_{21}\right)^{2}}}{\ell_{11}\ell_{22}}.

(7.1)

Since our aim is to ensure that these have a lower bound as large as possible and we only have to consider the minus sign in (7.1).

Without loss of generality, we may set $\ell_{11}=1$ and then have to find $c>0$ , $\ell_{22}>0$ and $\ell_{21}$ to minimize

\sup_{m\leq H\leq L}\left\{\frac{1}{\ell_{22}^{2}}\Big{[}cH-(\ell_{22}^{2}+2\ell_{21}-\ell_{21}^{2})\Big{]}^{2}+4(1-\ell_{21})^{2}\right\}.

(7.2)

Consider a local minimum $c$ , $\ell_{21}$ , $\ell_{22}$ of the minimization problem. We claim that

c\frac{L+m}{2}=a

(7.3)

where

a=\ell_{22}^{2}+2\ell_{21}-\ell_{21}^{2}.

In other words $a$ has to coincide with the midpoint of the interval $[cm,cL]$ of possible values of $cH$ . In fact, assume that $c(L+m)/2>a$ , i.e. the point $a$ is to the left of the midpoint. Then the supremum in (7.2) is attained at $H=L$ , because $|cm-a|<|cL-a|$ ; we could lower the value of the supremum by decreasing slightly $c$ . If $u(L+m)/2<a$ the supremum decreases by increasing slightly $c$ .

When (7.3) holds the supremum in (7.2) is the common value that the expression in braces takes at $H=m$ and $H=L$ . This common value is:

\frac{1}{\ell_{22}^{2}}\left(\frac{L-m}{L+m}\right)^{2}(\ell_{22}^{2}+2\ell_{21}-\ell_{21}^{2})^{2}+4(1-\ell_{21})^{2},

that we rewrite as

\frac{1}{\ell_{22}^{2}}\left(\frac{L-m}{L+m}\right)^{2}\big{(}\ell_{22}^{2}+1-(1-\ell_{21})^{2}\big{)}^{2}+4(1-\ell_{21})^{2}.

With $b=(L-m)^{2}/(L+m)^{2}<1$ , our task is to find $X=\ell_{22}^{2}>0$ , $Y=(1-\ell_{21})^{2}\geq 0$ so as to minimize

F(X,Y)=\frac{b}{X}(X+1-Y)^{2}+4Y.

We first fix $Y\in[0,1)$ . Then $F\rightarrow\infty$ as $X\rightarrow 0$ or $X\rightarrow\infty$ . By setting $\partial F/\partial X=0$ , we easily see that $F$ has a unique minimum at $X=1-Y\in(0,1]$ . At that minimum

F=4(b-1)X+4,

which, in turn, is minimized by taking $X$ as large as possible, i.e. $X=1$ . Then $Y=0$ and $F=4b<4$ . We then fix $Y\geq 1$ . In this case, $F\geq 4Y\geq 4$ , which is worse than the best $F$ that can be achieved with $Y\in[0,1)$ . To sum up: the optimum value of $F$ is $4b$ and is achieved when $X=1$ , $Y=0$ , i.e. $\ell_{22}=1$ , $\ell_{21}=1$ ; then the matrix $\widehat{P}$ is the one in (3.9). Taking these values of $\ell_{ij}$ to (7.3), we find that $c=4/(L+m)$ provides the optimal parameter choice. Finally, since the best value of (7.2) is $4b$ , the expression (7.1) shows that the best decay rate is $\lambda=2-\sqrt{4b}=4m/(L+m)$ .

7.3 Integrators for the underdamped Langevin equations

Due to the importance of (1.2) in statistical physics and molecular dynamics, the literature on its numerical integration is by now enormous. It is completely out of our scope to summarize it and we limit ourselves to a few comments on splitting algorithms.

The different terms in the right hand-side of (1.2a) and (1.2b) correspond to different, separate physical effects, like inertia, noise, damping, etc. Therefore the system (1.2) it is ideally suited to splitting algorithms [35]. A possible way of carrying out the splitting is

$\displaystyle{\rm(A)}$	$\displaystyle\qquad(d/dt)v=0,\>$	$\displaystyle(d/dt)x=v,$
$\displaystyle{\rm(B)}$	$\displaystyle\qquad(d/dt)v=-c\nabla f(x),\>$	$\displaystyle(d/dt)x=0,$
$\displaystyle{\rm(O)}$	$\displaystyle\qquad dv=-\gamma vdt+\sqrt{2\gamma c}dW,\>$	$\displaystyle(d/dt)x=0.$

Each of these subsystems may be integrated in closed form. This partitioning gives rise to schemes like ABOBA, BAOAB and OBABO [28, 29, 39]. For instance, a step of ABOBA first advances the numerical solution over $[t_{n},t_{n+1/2}]$ using the exact solution of (A), then over $[t_{n},t_{n+1/2}]$ using the exact solution of (B), then over $[t_{n},t_{n+1}]$ with (O), then over $[t_{n+1/2},t_{n+1}]$ with (B) and finally closes symmetrically with (A) over $[t_{n+1/2},t_{n+1}]$ . ABOBA, BAOAB and OBABO have weak order two but only possess strong order one. In fact it is easy to check that with the A-B-O splitting it is impossible to generate the Ito integrals that are required to ensure strong order two.

Another subsystem that may be integrated exactly in closed form is

\displaystyle{\rm(U)}

\displaystyle\qquad dv=-\gamma vdt+\sqrt{2\gamma c}dW,\>

\displaystyle(d/dt)x=v.

The algorithm BUB, used e.g. in [8], advances with (B) over $[t_{n},t_{n+1/2}]$ , then with (U) over $[t_{n},t_{n+1}]$ and closes the step with (B) over $[t_{n+1/2},t_{n+1}]$ . To our best knowledge it was first suggested by Skeel [49]. In [21] is referred to as Strang splitting, but this terminology may be confusing because it is standard to use the expression Strang splitting to mean any splitting algorithms with a symmetric pattern XYX [46]. The authors of [21], a reference that compares different integrators, “believe that BUB offers an attractive compromise between accuracy and computational cost”.

Changing the roles of B and U we obtain the UBU scheme suggested in the thesis [52], where it is proved that both BUB and UBU have weak and strong order two. In fact BUB and UBU are closely related because, UBU is the algorithm that advances the BUB solution from the midpoint $t_{n+1/2}$ of one step to the midpoint $t_{n+3/2}$ of the next (and vice versa).

The thesis [52] also describes a method to boost to strong order two any method whose strong order is only one. The boosting is achieved by generating auxiliary Gaussian random variables and may be relevant in our context where we are interested in the strong order of the integrators. A general technique to investigate the weak and strong order of splitting algorithms for general SDEs may be seen in [3, 4]; those reference provide a detailed study of splitting Langevin integrators.

7.4 A lemma

The following result is a variant of [13, Lemma 7] and may be proved in a similar way.

Lemma 7.1.

Assume that the sequence of nonnegative numbers $(z_{n})$ is such that for some constants $A\in(0,1)$ , $B\geq 0$ , $C\geq 0$ and each $n=0,1,2,\dots$

z_{n+1}\leq\sqrt{(1-A)^{2}z_{n}^{2}+B}+C.

Then, for $n=0,1,2,\dots$

z_{n}\leq(1-A)^{n}z_{0}+\sqrt{\frac{B}{A}}+\frac{C}{A}.

7.5 The local error for EE integrator

To analyze $\phi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\widehat{\xi}_{n},t_{n})$ , we have to take the random variable $\widehat{\xi}_{n}=(\widehat{v}_{n},\widehat{x}_{n})\sim\pi^{*}$ (see Theorem 5.2) as initial data, first to move the solution of the SDE forward in the interval $[t_{n},t_{n}+h]$ and then to perform a step of the integrator over the same interval.

Solutions of (1.1) satisfy, for $t\geq t_{n}$ , $n=0,1,2,\dots$ ,

	$\displaystyle v(t)$	$\displaystyle=$	$\displaystyle\mathcal{E}(t-t_{n})v(t_{n})-\int_{t_{n}}^{t}\mathcal{E}(t-s)c\nabla f(x(s))\,ds+\sqrt{2\gamma c}\int_{t_{n}}^{t}\mathcal{E}(t-s)\,dW(s),$	(7.4a)
	$\displaystyle x(t)$	$\displaystyle=$	$\displaystyle x(t_{n})+\mathcal{F}(t-t_{n})v(t_{n})-\int_{t_{n}}^{t}\mathcal{F}(t-s)c\nabla f(x(s))\,ds+\sqrt{2\gamma c}\int_{t_{n}}^{t}\mathcal{F}(t-s)\,dW(s),$	(7.4b)

The proof is divided up in steps.

First step. From (7.4a) and (4.2a), we find that the $v$ -component of $\phi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\widehat{\xi}_{n},t_{n})$ , denoted as $\Delta_{v}$ , is

\Delta_{v}=-\int_{t_{n}}^{t_{n+1}}\mathcal{E}(t_{n+1}-s)c\big{(}\nabla f(x(s))-\nabla f(\widehat{x}_{n})\big{)}\,ds;

where $x(s)$ is the $x$ -component of the solution of (1.2) that at $t_{n}$ takes the value $\widehat{\xi}_{n}$ (we shall later need the $v$ component, to be denoted by $v(s)$ ). Using successively the Cauchy-Schwartz inequality, the bound $\mathcal{E}(t)\leq 1$ for $t\geq 0$ , the Lipschitz continuity of $\nabla f(x)$ , and (1.2b), we find:

$\displaystyle\mathbb{E}\big{(}\\|\Delta_{v}\\|^{2}\big{)}$	$\displaystyle\leq$	$\displaystyle c^{2}\mathbb{E}\left(\int_{t_{n}}^{t_{n+1}}\mathcal{E}(t_{n+1}-s)^{2}ds\>\int_{t_{n}}^{t_{n+1}}\\|\nabla f(x(s))-\nabla f(\widehat{x}_{n})\\|^{2}ds\right)$
	$\displaystyle\leq$	$\displaystyle hc^{2}\int_{t_{n}}^{t_{n+1}}\mathbb{E}\big{(}\\|\nabla f(x(s))-\nabla f(\widehat{x}_{n})\\|^{2}\big{)}ds$
	$\displaystyle\leq$	$\displaystyle hc^{2}L^{2}\int_{t_{n}}^{t_{n+1}}\mathbb{E}\big{(}\\|x(s)-\widehat{x}_{n}\\|^{2}\big{)}ds$
	$\displaystyle\leq$	$\displaystyle hc^{2}L^{2}\int_{t_{n}}^{t_{n+1}}\mathbb{E}\big{(}\\|\int_{t_{n}}^{s}v(s^{\prime})ds^{\prime}\\|^{2}\big{)}ds.$

An application of the Cauchy-Schwartz inequality to the inner integral yields

\mathbb{E}\big{(}\|\Delta_{v}\|^{2}\big{)}\leq hc^{2}L^{2}\int_{t_{n}}^{t_{n+1}}s\left(\int_{t_{n}}^{s}\mathbb{E}\big{(}\|v(s^{\prime})\|^{2}\big{)}ds^{\prime}\right)ds.

Now, using the fact that the initial data $\widehat{\xi}_{n}$ is distributed according to $\pi^{*}$ , this will be the distribution of $v(s^{\prime})$ for all $s^{\prime}\geq t_{n}$ . Hence, since the distribution of each of the $d$ scalar components of $v$ is centered Gaussian with second moment equal to $c$ , we obtain the final bound

\mathbb{E}\big{(}\|\Delta_{v}\|^{2}\big{)}\leq hc^{2}L^{2}d\int_{t_{n}}^{t_{n+1}}s\left(\int_{t_{n}}^{s}c\,ds^{\prime}\right)ds=\frac{h^{4}}{3}c^{3}L^{2}d.

Second step. Turning now to the $x$ component of $\phi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\widehat{\xi}_{n},t_{n})$ , we have

\Delta_{x}=-\int_{t_{n}}^{t_{n+1}}\mathcal{F}(t_{n+1}-s)c\big{(}\nabla f(x(s))-\nabla f(\widehat{x}_{n})\big{)}\,ds,

and, applying the Cauchy-Schwartz inequality and the bound $\mathcal{F}(t)\leq t$ ,

	$\displaystyle\mathbb{E}\big{(}\\|\Delta_{x}\\|^{2}\big{)}$	$\displaystyle\leq$	$\displaystyle c^{2}\mathbb{E}\left(\int_{t_{n}}^{t_{n+1}}\mathcal{F}(t_{n+1}-s)^{2}ds\int_{t_{n}}^{t_{n+1}}\\|\nabla f(x(s))-\nabla f(\widehat{x}_{n})\\|^{2}\right)ds$
		$\displaystyle\leq$	$\displaystyle\frac{h^{3}}{3}c^{2}\int_{t_{n}}^{t_{n+1}}\mathbb{E}\big{(}\\|\nabla f(x(s))-\nabla f(\widehat{x}_{n})\\|^{2}ds\big{)}.$

Therefore, by proceeding in the last integral as we when we found it above, we find

\mathbb{E}\big{(}\|\Delta_{x}\|^{2}\big{)}\leq\frac{h^{6}}{9}c^{3}L^{2}d.

Third step. The preceding analysis is valid for all values of the parameters. We now assume that $t$ is measured in units for which $\gamma=2$ and that $h$ is chosen $\leq 1$ . Then, combining the estimates for the $v$ and $x$ components:

\mathbb{E}\big{(}\|\phi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\widehat{\xi}_{n},t_{n})\|^{2}\big{)}\leq\frac{h^{4}}{3}c^{3}L^{2}d+\frac{h^{6}}{9}c^{3}L^{2}d\leq\frac{4}{9}h^{4}c^{3}L^{2}d.

(7.5)

For the matrix $\widehat{P}$ in (3.9) the constant $p_{\max}$ in (2.2) is easily computed as $(3+\sqrt{5})/2$ and we finally have:

\|\phi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\widehat{\xi}_{n},t_{n})\|^{2}_{L^{2},P}\leq\frac{6+2\sqrt{5}}{9}h^{4}c^{3}L^{2}d.

7.6 The local error for UBU

We first provide bounds for the local error for UBU under Assumptions 2.1–2.3 that ensure strong order two. As in the previous Subsection we have to take $(\widehat{v}_{n},\widehat{x}_{n})$ as a starting point for the SDE solution and the integrator. As with the EE integrator, $v(t)$ and $x(t)$ denote the solution of (1.2) that starts at $t_{n}$ from $(\widehat{v}_{n},\widehat{x}_{n})$ . The analysis is now substantially more involved as the Ito-Taylor expansions have to be taken to higher order.

First step. We begin by estimating the difference $\Delta_{y}$ between $x(t_{n}+h/2)$ and the point $y_{n}$ where the integrator evaluates the force $-\nabla f$ (see (4.3c)). By using (7.4b) and (4.3c), we find

\Delta_{y}=x(t_{n}+h/2)-y_{n}=-\int_{t_{n}}^{t_{n+1/2}}\mathcal{F}(t_{n+1/2}-s)c\nabla f(x(s))\,ds.

We apply the Cauchy-Schwartz inequality (in a similar way to what we did at Step 2 in the preceding subsection) to get

\mathbb{E}\big{(}\|\Delta_{y}\|^{2}\big{)}\leq\frac{h^{3}}{24}c^{2}\int_{t_{n}}^{t_{n+1/2}}\mathbb{E}\big{(}\|\nabla f(x(s)\|^{2}\big{)}\,ds.

As proved in [11, Lemma 2], when $\bar{x}$ has the distribution $\pi^{\star}$ ,

\mathbb{E}\big{(}\|\nabla f(\bar{x})\|^{2}\big{)}\leq Ld

(7.6)

and, accordingly,

\mathbb{E}\big{(}\|\Delta_{y}\|^{2}\big{)}\leq\frac{h^{4}}{48}c^{2}Ld.

(7.7)

Second step. From (7.4a) and (4.3a), the $v$ -component of $\phi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\widehat{\xi}_{n},t_{n})$ is found to be

\Delta_{v}=-\int_{t_{n}}^{t_{n+1}}\mathcal{E}(t_{n+1}-s)c\nabla f(x(s))\,ds+h\mathcal{E}(h/2)c\nabla f(y_{n});

(7.8)

thus UBU replaces the integral by a midpoint-rule approximation. We Ito-Taylor expand (see e.g. [26, 4]) the integral around $t_{n+1/2}$ as follows. Denote by $\chi(s)$ the (differentiable) integrand, i.e.

\chi(s)=\mathcal{E}(t_{n+1}-s)c\nabla f(x(s)).

Then (the dot indicates differentiation),

\int_{t_{n}}^{t_{n+1}}\chi(s)ds=\int_{t_{n}}^{t_{n+1}}\chi(t_{n+1/2})ds+\int_{t_{n}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}\dot{\chi}(s)ds^{\prime},

(7.9)

and taking the expansion one step further, we find

	$\displaystyle\int_{t_{n}}^{t_{n+1}}\chi(s)ds$	$\displaystyle=$	$\displaystyle h\chi(t_{n+1/2})+\int_{t_{n+1/2}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}\big{(}\dot{\chi}(s^{\prime})-\dot{\chi}(2t_{n+1/2}-s^{\prime})\big{)}ds^{\prime}$		(7.10)
		$\displaystyle=$	$\displaystyle h\chi(t_{n+1/2})+\int_{t_{n+1/2}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}ds^{\prime}\int_{2t_{n+1/2}-s^{\prime}}^{s^{\prime}}d\dot{\chi}(s^{\prime\prime}).$		(7.10)

We now replace $d\dot{\chi}(s^{\prime\prime})$ by its expression given by Ito’s lemma. (While $\chi$ is differentiable, $\dot{\chi}$ is a diffusion process.) There is no Ito correction because $\dot{\chi}$ is linear in $v$ and there is no forcing noise in the $x$ equation (1.2b). After computing $d\dot{\chi}(s^{\prime\prime})$ and substituting back in (7.8), we have

\Delta_{v}=-hc\mathcal{E}(h/2)\big{(}\nabla f(x(t_{n+1/2}))-\nabla f(y_{n})\big{)}+I_{1}+I_{2}+I_{3}+I_{4}+I_{5},

(7.11)

with

$\displaystyle I_{1}$	$\displaystyle=$	$\displaystyle-\int_{t_{n+1/2}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}ds^{\prime}\int_{2t_{n+1/2}-s^{\prime}}^{s^{\prime}}\mathcal{E}(t_{n+1}-s^{\prime\prime})\gamma^{2}c\nabla f(x(s^{\prime\prime}))ds^{\prime\prime},$
$\displaystyle I_{2}$	$\displaystyle=$	$\displaystyle-\int_{t_{n+1/2}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}ds^{\prime}\int_{2t_{n+1/2}-s^{\prime}}^{s^{\prime}}\mathcal{E}(t_{n+1}-s^{\prime\prime})\gamma c\mathcal{H}(x(s^{\prime\prime}))v(s^{\prime\prime})ds^{\prime\prime},$
$\displaystyle I_{3}$	$\displaystyle=$	$\displaystyle-\int_{t_{n+1/2}}^{t_{n+1}}\!\!\!\!ds\int_{t_{n+1/2}}^{s}\!\!\!\!\!ds^{\prime}\int_{2t_{n+1/2}-s^{\prime}}^{s^{\prime}}\!\!\!\!\mathcal{E}(t_{n+1}-s^{\prime\prime})c\mathcal{H}^{\prime}(x(s^{\prime\prime}))[v(s^{\prime\prime}),v(s^{\prime\prime})]ds^{\prime\prime},$
$\displaystyle I_{4}$	$\displaystyle=$	$\displaystyle\int_{t_{n+1/2}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}ds^{\prime}\int_{2t_{n+1/2}-s^{\prime}}^{s^{\prime}}\mathcal{E}(t_{n+1}-s^{\prime\prime})c^{2}\mathcal{H}(x(s^{\prime\prime}))\nabla f(x(s^{\prime\prime}))ds^{\prime\prime},$
$\displaystyle I_{5}$	$\displaystyle=$	$\displaystyle-\sqrt{2\gamma c}\int_{t_{n+1/2}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}ds^{\prime}\int_{2t_{n+1/2}-s^{\prime}}^{s^{\prime}}\mathcal{E}(t_{n+1}-s^{\prime\prime})c\mathcal{H}(x(s^{\prime\prime}))dW(s^{\prime\prime}).$

We now successively bound each term in the right hand-side of (7.11). From (7.7) and the Lipschitz continuity of $\nabla f(x)$

\mathbb{E}\Big{(}\|hc\mathcal{E}(h/2)\big{(}\nabla f(x(t_{n+1/2}))-\nabla f(y_{n})\big{)}\|^{2}\Big{)}\leq\frac{h^{6}}{48}c^{4}L^{3}d.

The integral $I_{1}$ may be bounded as follows:

	$\displaystyle\mathbb{E}\big{(}\\|I_{1}\\|^{2}\big{)}$	$\displaystyle\leq$	$\displaystyle\gamma^{4}c^{2}\mathbb{E}\left[\left(\int_{t_{n+1/2}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}ds^{\prime}\int_{2t_{n+1/2}-s^{\prime}}^{s^{\prime}}\mathcal{E}(t_{n+1}-s^{\prime\prime})^{2}ds^{\prime\prime}\right)\right.$
			$\displaystyle\left.\left(\int_{t_{n+1/2}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}ds^{\prime}\int_{2t_{n+1/2}-s^{\prime}}^{s^{\prime}}\\|\nabla f(x(s^{\prime\prime}))\\|^{2}ds^{\prime\prime}\right)\right]$

Now using the fact that $\mathcal{E}(t)\leq 1$ , for $t\geq 0$ , we can bound the first term in the equation above by observing that

\int_{t_{n+1/2}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}ds^{\prime}\int_{2t_{n+1/2}-s^{\prime}}^{s^{\prime}}ds^{\prime\prime}=\frac{h^{3}}{24}

and then take into account (7.6), to get

\mathbb{E}\big{(}\|I_{1}\|^{2}\big{)}\leq\frac{h^{6}}{576}\gamma^{4}c^{2}Ld.

A bound for $I_{2}$ may be derived similarly, by using, instead of (7.6),

\mathbb{E}\big{(}\|\mathcal{H}(\bar{x})\bar{v}\|^{2}\big{)}\leq L^{2}\mathbb{E}\big{(}\|\bar{v}\|^{2}\big{)}=L^{2}cd.

( $(\bar{v},\bar{x})\sim\pi^{\star}$ ). Then

\mathbb{E}\big{(}\|I_{2}\|^{2}\big{)}\leq\frac{h^{6}}{576}\gamma^{2}c^{3}L^{2}d.

For $I_{3}$ we use (each scalar component of $\bar{v}$ is a centered Gaussian with variance $c$ and fourth moment $3c^{2}$ )

\mathbb{E}\big{(}\|\mathcal{H}^{\prime}(\bar{x})[\bar{v},\bar{v}]\|^{2}\big{)}=L_{1}^{2}\mathbb{E}\big{(}\|\bar{v}\|^{4}\big{)}\leq 3L_{1}^{2}c^{2}d,

that leads to

\mathbb{E}\big{(}\|I_{3}\|^{2}\big{)}\leq\frac{3h^{6}}{576}c^{4}L_{1}^{2}d.

Turning now to $I_{4}$ , a new application of (7.6) gives

\mathbb{E}\big{(}\|\mathcal{H}(\bar{x})\nabla f(\bar{x})\|^{2}\big{)}\leq L^{2}\mathbb{E}\big{(}\|\nabla f(\bar{x})\|^{2}\big{)}\leq L^{3}d,

and then

\mathbb{E}\big{(}\|I_{4}\|^{2}\big{)}\leq\frac{h^{6}}{576}c^{4}L^{3}d.

Using the Cauchy-Schwartz inequality for the inner product associated with the integration on $s$ and $s^{\prime}$ , we have

	$\displaystyle\mathbb{E}\big{(}\\|I_{5}\\|^{2}\big{)}\leq 2\gamma c^{3}\mathbb{E}\left[\left(\int_{t_{n+1/2}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}ds^{\prime}\right)\right.\times$
	$\displaystyle\left.\left(\int_{t_{n+1/2}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}\left\\|\int_{2t_{n+1/2}-s^{\prime}}^{s^{\prime}}\mathcal{E}(t_{n+1}-s^{\prime\prime})\mathcal{H}(x(s^{\prime\prime}))dW(s^{\prime\prime})\right\\|^{2}ds^{\prime}\right)\right]$

and, with the help of the Ito isommetry, since the Frobenius norm of $\mathcal{H}$ is bounded by $L^{2}d$ ,

\mathbb{E}\big{(}\|I_{5}\|^{2}\big{)}\leq 2\gamma c^{3}\frac{h^{2}}{8}\int_{t_{n+1/2}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}ds^{\prime}\int_{2t_{n+1/2}-s^{\prime}}^{s^{\prime}}L^{2}d\,ds^{\prime\prime}=\frac{h^{5}}{96}\gamma c^{3}L^{2}d.

We have now bounded each term in the right-hand side of (7.11); the dominant term is $I_{5}$ , with $(\mathbb{E}(\|I_{5}\|^{2}))^{1/2}=\mathcal{O}(h^{5/2})$ . In Assumption 5.2 we have to split $\phi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\widehat{\xi}_{n},t_{n})$ into two parts, $\alpha$ and $\beta$ , with $\beta$ of higher order; for the $v$ component we then define $\alpha_{v}=I_{5}$ and $\beta_{v}=\Delta_{v}-\alpha_{v}$ . The bounds obtained above yield

\big{(}\mathbb{E}(\|\alpha_{v}\|^{2})\big{)}^{1/2}\leq\frac{\sqrt{6}}{24}h^{5/2}\gamma^{1/2}c^{3/2}Ld^{1/2}

and

\big{(}\mathbb{E}(\|\beta_{v}\|^{2})\big{)}^{1/2}\leq\frac{h^{3}}{24}\Big{(}(1+2\sqrt{3})c^{2}L^{3/2}+\gamma^{2}cL^{1/2}+\gamma c^{3/2}L+\sqrt{3}c^{2}L_{1}\Big{)}d^{1/2}.

Third step. From (7.4b) and (4.3b), the $x$ -component of $\phi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\widehat{\xi}_{n},t_{n})$ is given by

\Delta_{x}=-\int_{t_{n}}^{t_{n+1}}\mathcal{F}(t_{n+1}-s)c\nabla f(x(s))\,ds+h\mathcal{F}(h/2)c\nabla f(y_{n});

(7.12)

again UBU replaces the integral by a midpoint approximation. By expanding the integrand by means of the fundamental theorem of calculus as

	$\displaystyle\mathcal{F}(t_{n+1}-s)c\nabla f(x(s))=\mathcal{F}(h/2)c\nabla f(x(t_{n+1/2}))$
	$\displaystyle\qquad+\int_{t_{n+1/2}}^{s}\Big{(}\mathcal{F}(t_{n+1}-s^{\prime})c\mathcal{H}(x(s^{\prime}))v(s^{\prime})-\mathcal{E}(t_{n+1}-s^{\prime})c\nabla f(x(s^{\prime}))\Big{)}ds^{\prime},$

(7.12) becomes

\Delta_{x}=-hc\mathcal{F}(h/2)\big{(}\nabla f(x(t_{n+1/2}))-\nabla f(y_{n})\big{)}+I_{6}+I_{7}

(7.13)

with

	$\displaystyle I_{6}$	$\displaystyle=$	$\displaystyle-\int_{t_{n}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}\mathcal{F}(t_{n+1}-s^{\prime})c\mathcal{H}(x(s^{\prime}))v(s^{\prime})ds^{\prime},$
	$\displaystyle I_{7}$	$\displaystyle=$	$\displaystyle\int_{t_{n}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}\mathcal{E}(t_{n+1}-s^{\prime})c\nabla f(x(s^{\prime}))ds^{\prime}.$

Now, in (7.13), taking (7.7) into account and recalling that $\mathcal{F}(t)\leq t$ ,

\mathbb{E}\Big{(}\|hc\mathcal{F}(h/2)\big{(}\nabla f(x(t_{n+1/2}))-\nabla f(y_{n})\big{)}\|^{2}\Big{)}\leq\frac{h^{8}}{192}c^{4}L^{3}d.

Next, for $I_{6}$ (it is necessary to put absolute value bars around the inner integrals because $s$ could be $<t_{n+1/2}$ ),

$\displaystyle\mathbb{E}\big{(}\\|I_{6}\\|^{2}\big{)}$	$\displaystyle\leq$	$\displaystyle c^{2}\mathbb{E}\left[\left(\int_{t_{n}}^{t_{n+1}}ds\left\|\int_{t_{n+1/2}}^{s}\mathcal{F}(t_{n+1}-s^{\prime})^{2}ds^{\prime}\right\|\right)\right.\times$
		$\displaystyle\qquad\qquad\qquad\left.\left(\int_{t_{n}}^{t_{n+1}}ds\left\|\int_{t_{n+1/2}}^{s}\\|\mathcal{H}(x(s^{\prime}))v(s^{\prime})\\|^{2}ds^{\prime}\right\|\right)\right]$
	$\displaystyle\leq$	$\displaystyle c^{2}\frac{7h^{4}}{96}\times\frac{h^{2}}{4}L^{2}cd=\frac{7h^{6}}{384}c^{3}L^{2}d.$

The integral $I_{7}$ may be rewritten as

I_{7}=\int_{t_{n+1/2}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}ds^{\prime}\int_{2t_{n+1/2}-s^{\prime}}^{s^{\prime}}\frac{d}{ds^{\prime\prime}}\Big{(}\mathcal{E}(t_{n+1}-s^{\prime\prime})c\nabla f(x(s^{\prime\prime}))ds^{\prime\prime}\Big{)}ds^{\prime\prime},

and, after performing the differentiation in the integrand, $I_{7}=-\gamma^{-1}(I_{1}+I_{2})$ , so that we may use the available bounds for $I_{1}$ and $I_{2}$ .

Taking all the partial bounds to (7.13),

\big{(}\mathbb{E}(\|\Delta_{x}\|^{2})\big{)}^{1/2}\leq\frac{h^{3}}{24}\Big{(}\sqrt{3}hc^{2}L^{3/2}+(\frac{\sqrt{42}}{2}+1)c^{3/2}L+\gamma cL^{1/2}\Big{)}d^{1/2}.

As we see, for the $x$ component there is no $\mathcal{O}(h^{5/2})$ contribution and therefore we take $\alpha_{x}=0$ and $\beta_{x}=\Delta_{x}$ .

Fourth step. With a view to checking at Step 5 condition (5.2) in Assumption 5.2, we estimate $|\mathbb{E}\big{(}\langle\widetilde{v}_{n+1}-v_{n+1},\alpha_{v}\rangle\big{)}|$ ; here $\langle\cdot,\cdot\rangle$ is the standard inner product in $\mathbb{R}^{d}$ , $v_{n+1}$ is the velocity component of $\xi_{n+1}$ and $\widetilde{v}_{n+1}$ denotes the velocity component of a numerical step starting from $\widehat{\xi}_{n}=(\widehat{v}_{n},\widehat{x}_{n})$ . (This should not be confused with $\widehat{v}_{n+1}$ , the $v$ -component of the random variable $\widehat{\xi}_{n+1}$ to be introduced at the next time level in the construction leading to Theorem 5.2.)

Since, conditional on $\widehat{v}_{n}$ , $v_{n}$ , the expectation of the stochastic integral $\alpha_{v}=I_{5}$ is zero, we may write

	$\displaystyle\left\|\mathbb{E}\big{(}\langle\widetilde{v}_{n+1}-v_{n+1},\alpha_{v}\rangle\big{)}\right\|$	$\displaystyle=$	$\displaystyle\left\|\mathbb{E}\big{(}\langle\widetilde{v}_{n+1}-\widehat{v}_{n}-v_{n+1}+v_{n},\alpha_{v}\rangle\big{)}\right\|$
		$\displaystyle\leq$	$\displaystyle\Big{(}\mathbb{E}(\\|\widetilde{v}_{n+1}-\widehat{v}_{n}-v_{n+1}+v_{n}\\|^{2})\Big{)}^{1/2}\Big{(}\mathbb{E}(\\|\alpha_{v}\\|^{2})\Big{)}^{1/2}.$

Now, from (4.3a),

\widetilde{v}_{n+1}-\widehat{v}_{n}-v_{n+1}+v_{n}=(\mathcal{E}(h)-1)(\widehat{v}_{n}-v_{n})-h\mathcal{E}(h/2)c(\nabla f(\widetilde{y}_{n})-\nabla f(y_{n}))

with (see (4.3c))

\widetilde{y}_{n}=\widehat{x}_{n}+\mathcal{F}(h/2)\widehat{v}_{n}+\sqrt{2\gamma c}\int_{t_{n}}^{t_{n+1/2}}\mathcal{F}(t_{n+1/2}-s)dW(s),

and thus, since $1-\mathcal{E}(h)\leq\gamma h$ and $\mathcal{E}(h/2)\leq 1$ ,

	$\displaystyle\Big{(}\mathbb{E}(\\|\widetilde{v}_{n+1}-\widehat{v}_{n}-v_{n+1}+v_{n}\\|^{2})\Big{)}^{1/2}$
	$\displaystyle\qquad\qquad\qquad\leq h\gamma\Big{(}\mathbb{E}(\\|\widehat{v}_{n}-v_{n}\\|^{2})\Big{)}^{1/2}+hcL\Big{(}\mathbb{E}(\\|\widetilde{y}_{n}-y_{n}\\|^{2})\Big{)}^{1/2}.$

Taking into account (4.3c) and the definition of $\widetilde{y}_{n}$

\Big{(}\mathbb{E}\|\widetilde{y}_{n}-y_{n}\|^{2}\Big{)}^{1/2}\leq\Big{(}\mathbb{E}(\|\widehat{x}_{n}-x_{n}\|^{2})\Big{)}^{1/2}+\frac{h}{2}\Big{(}\mathbb{E}(\|\widehat{v}_{n}-v_{n}\|^{2})\Big{)}^{1/2},

and we conclude that $|\mathbb{E}\big{(}\langle\widetilde{v}_{n+1}-v_{n+1},\alpha_{v}\rangle\big{)}|$ is bounded above by

h\left(\Big{(}\gamma+\frac{h}{2}cL\Big{)}\Big{(}\mathbb{E}(\|\widehat{v}_{n}-v_{n}\|^{2})\Big{)}^{1/2}+cL\Big{(}\mathbb{E}(\|\widehat{x}_{n}-x_{n}\|^{2})\Big{)}^{1/2}\right)\Big{(}\mathbb{E}(\|\alpha_{v}\|^{2})\Big{)}^{1/2}.

(7.14)

Fifth step. The preceding analysis holds for all values of the parameters. We now focus in the case where $\gamma=2$ and $h\leq 2$ as in the statement of Theorem 6.2. To complete the proof it is enough to translate the Euclidean norm bounds in Steps 1–4 into $P$ -norm bounds.

To establish (5.1), we note that, because $\alpha_{x}=0$ ,

\left|\Big{\langle}\psi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\xi_{n},t_{n}),\alpha_{h}(\widehat{\xi}_{n},t_{n})\Big{\rangle}_{L^{2},P_{h}}\right|=|\mathbb{E}\big{(}\langle\widetilde{v}_{n+1}-v_{n+1},\alpha_{v}\rangle\big{)}|.

The right hand-side of this expression was bounded in (7.14) in terms of $\mathbb{E}(\|\widehat{v}_{n}-v_{n}\|^{2})$ , $\mathbb{E}(\|\widehat{x}_{n}-x_{n}\|^{2})$ and $\mathbb{E}(\|\alpha_{v}\|^{2})$ . We now take into account (2.2) to bound $\mathbb{E}(\|\widehat{v}_{n}-v_{n}\|^{2})$ and $\mathbb{E}(\|\widehat{x}_{n}-x_{n}\|^{2})$ by a multiple of $\|\widehat{\xi}_{n}-\xi_{n}\|_{L^{2},P_{h}}$ and to bound $\mathbb{E}(\|\alpha_{v}\|^{2})$ by a multiple of $\|\alpha_{h}(\widehat{\xi}_{n},t_{n})\|_{L^{2},P_{h}}$ .

The estimates in (5.2) are established in a similar way.

7.7 The local error for UBU without extrasmoothness

Let us now assume that Assumptions 2.1–2.2 hold but Assumption 2.3 does not. Then the strong order of UBU is one. The bound for $\mathbb{E}(\|\Delta_{x}\|^{2})$ derived in the third step of the preceding subsection remains valid (note that it does not involve the constant $L_{1}$ ). However for the component $\Delta_{y}$ of the local error, the Ito-Taylor expansion leading to (7.10) cannot be taken beyond (7.9) because now $d\dot{\chi}$ does not make sense. After replacing $\dot{\chi}(s)$ by its expression in terms of $f$ , one obtains the bound

\big{(}\mathbb{E}(\|\Delta_{y}\|^{2})\big{)}^{1/2}\leq\frac{h^{2}}{4}\gamma c^{3/2}Ld^{1/2}+\frac{h^{2}}{4}\gamma cL^{1/2}d^{1/2}+\frac{\sqrt{3}}{12}h^{3}c^{2}L^{3/2}d^{1/2}.

Then, after combining the $x$ and $v$ contributions and setting $\gamma=2$ , we have the bound

$\displaystyle\Big{(}\mathbb{E}\big{(}\\|\phi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\widehat{\xi}_{n},t_{n})\\|^{2}\big{)}\Big{)}^{1/2}$	$\displaystyle\leq$	$\displaystyle\frac{h^{2}}{4}\left[(1+\left(\frac{1}{6}+\frac{\sqrt{42}}{12}\right)h\right]c^{3/2}Ld^{1/2}$
		$\displaystyle\qquad+\frac{h^{2}}{2}\left(1+\frac{h}{6}\right)cL^{1/2}d^{1/2}$
		$\displaystyle\qquad+\frac{\sqrt{3}}{12}h^{3}\left(1+\frac{h}{2}\right)c^{2}L^{3/2}d^{1/2}.$

Recall that, for contractivity the integrator has to be operated with $cL$ bounded above, so that the combinations $c^{2}/L^{3/2}$ , $c^{3/2}L$ , and $cL^{1/2}$ are all $\mathcal{O}(L^{-1/2})$ as $L$ increases. The leading terms in the UBU bound in the display are $(h^{2}/4)(c^{3/2}Ld^{1/2})+(h^{2}/2)cL^{1/2}d^{1/2}$ . For comparison, we note from (7.5), that for EE the corresponding leading term in the bound is $(\sqrt{3}/3)h^{2}c^{3/2}Ld^{1/2}$ . The conclusion is that, under Assumptions 2.1–2.2, the properties of UBU are very close to those of EE.

Acknowledgement. J.M.S.-S. has been supported by project PID2019-104927GB-C21 (AEI/FEDER, UE). K.C. Z acknowledges support from a Leverhulme Research Fellowship (RF/ 2020-310), the EPSRC grant EP/V006177/1, and the Alan Turing Institute.

References

[1] A. Abdulle, G. Vilmart, and K. C. Zygalakis. High order numerical approximation of the invariant measure of ergodic sdes. SIAM Journal on Numerical Analysis, 52(4):1600–1622, 2014.
[2] A. Abdulle, G. Vilmart, and K. C. Zygalakis. Long time accuracy of Lie–Trotter splitting methods for Langevin dynamics. SIAM Journal on Numerical Analysis, 53(1):1–16, 2015.
[3] A. Alamo and J. M. Sanz-Serna. A technique for studying strong and weak local errors of splitting stochastic integrators. SIAM Journal on Numerical Analysis, 54(6):3239–3257, 2016.
[4] A. Alamo and J. M. Sanz-Serna. Word combinatorics for stochastic differential equations: Splitting integrator. Communications on Pure & Applied Analysis, 18:2163, 2019.
[5] N. Bou-Rabee and J. M. Sanz-Serna. Randomized Hamiltonian Monte Carlo. The Annals of Applied Probability, 27(4):2159 – 2194, 2017.
[6] N. Bou-Rabee and J. M. Sanz-Serna. Geometric integrators and the Hamiltonian Monte Carlo method. Acta Numerica, 27:113–206, 2018.
[7] S. Brooks, A. Gelman, G. Jones, and X.L. Meng. Handbook of Markov Chain Monte Carlo. Chapman & Hall/CRC Handbooks of Modern Statistical Methods. CRC Press, 2011.
[8] Evelyn Buckwar, Massimiliano Tamborrino, and Irene Tubikanec. Spectral density-based and measure-preserving ABC for partially observed diffusion processes. an illustration on Hamiltonian sdes. Statistics and Computing, 30(3):627–648, 2020.
[9] Y. Cao, J. Lu, and L. Wang. Complexity of randomized algorithms for underdamped Langevin dynamics. arXiv:2003.09906, 2020.
[10] X. Cheng, N. S. Chatterji, P. L. Bartlett, and M. I. Jordan. Underdamped Langevin MCMC: A non-asymptotic analysis. In Proceedings of the 2018 Conference on Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 300–323, 2018.
[11] A. S. Dalalyan. Further and stronger analogy between sampling and optimization: Langevin monte carlo and gradient descent. In Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 678–689, 2017.
[12] A. S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651–676, 2017.
[13] A. S. Dalalyan and A. Karagulyan. User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. Stochastic Processes and their Applications, 129(12):5278–5311, 2019.
[14] A. S. Dalalyan, A. Karagulyan, and L. Riou-Durand. Bounding the error of discretized langevin algorithms for non-strongly log-concave targets. arXiv:1906.08530, 2019.
[15] A. S. Dalalyan and L. Riou-Durand. On sampling from a log-concave density using kinetic Langevin diffusions. Bernoulli, 26(3):1956 – 1988, 2020.
[16] A. Durmus, S. Majewski, and B. Miasojedow. Analysis of Langevin Monte Carlo via convex optimization. Journal of Machine Learning Research, 20(73):1–46, 2019.
[17] A. Durmus and E. Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. The Annals of Applied Probability, 27(3):1551 – 1587, 2017.
[18] A. Durmus and E. Moulines. High-dimensional Bayesian inference via the unadjusted Langevin algorithm. Bernoulli, 25(4A):2854 – 2882, 2019.
[19] A. Eberle, A. Guillin, and R. Zimmer. Couplings and quantitative contraction rates for Langevin dynamics. The Annals of Probability, 47(4):1982 – 2010, 2019.
[20] M. Fazlyab, A. Ribeiro, M. Morari, and V. M. Preciado. Analysis of optimization algorithms via integral quadratic constraints: nonstrongly convex problems. SIAM Journal on Optimization, 28(3):2654–2689, 2018.
[21] J. Foster, T. Lyons, and H. Oberhauser. The shifted ODE method for underdamped Langevin MCMC. arXiv:2101.03446, 2021.
[22] M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, 2011.
[23] E. Hairer, S.P. Nørsett, and G. Wanner. Solving Ordinary Differential Equations I Nonstiff problems. Springer, Berlin, second edition, 2000.
[24] M. Hochbruck and A. Ostermann. Exponential integrators. Acta Numerica, 19:209–286, 2010.
[25] Liam Hodgkinson, Robert Salomone, and Fred Roosta. Implicit langevin algorithms for sampling from log-concave densities. Journal of Machine Learning Research, 22(136):1–30, 2021.
[26] Peter E. Kloeden and Eckhard. Platen. Numerical solution of stochastic differential equations. Springer-Verlag Berlin ; New York, 1992.
[27] A. Laurent and G. Vilmart. Order conditions for sampling the invariant measure of ergodic stochastic differential equations on manifolds. arXiv:2006.09743, 2020.
[28] B. Leimkuhler and C. Matthews. Rational Construction of Stochastic Numerical Methods for Molecular Sampling. Applied Mathematics Research eXpress, 2013(1):34–56, 06 2012.
[29] B. Leimkuhler and C. Matthews. Molecular Dynamics: With Deterministic and Stochastic Numerical Methods. Interdisciplinary Applied Mathematics. Springer, 2015.
[30] B. Leimkuhler and X. Shang. Adaptive thermostats for noisy gradient systems. SIAM Journal on Scientific Computing, 38(2):A712–A736, 2016.
[31] T. Lelièvre and G. Stoltz. Partial differential equations and stochastic methods in molecular dynamics. Acta Numerica, 25:681–880, 2016.
[32] L. Lessard, B. Recht, and A. Packard. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57–95, 2016.
[33] M. B. Majka, A. Mijatović, and L. Szpruch. Nonasymptotic bounds for sampling algorithms without log-concavity. The Annals of Applied Probability, 30(4):1534 – 1581, 2020.
[34] J. C. Mattingly, A. M. Stuart, and M. V. Tretyakov. Convergence of numerical time-averaging and stationary measures via Poisson equations. SIAM Journal on Numerical Analysis, 48(2):552–577, 2010.
[35] R. I. McLachlan and G. R. W. Quispel. Splitting methods. Acta Numerica, 11:341–434, 2002.
[36] G. N. Milstein and Michael V Tretyakov. Stochastic numerics for mathematical physics. Springer-Verlag, Berlin, 2004.
[37] G.N. Milstein and M.V. Tretyakov. Computing ergodic limits for Langevin equations. Physica D, 229(1):81 – 95, 2007.
[38] P. Monmarché. Almost sure contraction for diffusions on $\mathbb{R}^{d}$ . application to generalised Langevin diffusions. arXiv:2009.10828, 2020.
[39] Pierre Monmarché. High-dimensional MCMC with a standard splitting scheme for the underdamped Langevin diffusion. Electronic Journal of Statistics, 15(2):4117 – 4166, 2021.
[40] W. Mou, Y. A Ma, M. J. Wainwright, P. L. Bartlett, and M. I. Jordan. High-order Langevin diffusion yields an accelerated MCMC algorithm. Journal of Machine Learning Research, 22(42):1–41, 2021.
[41] M. Pereyra, L. V. Mieles, and K. C. Zygalakis. Accelerating proximal Markov chain Monte Carlo by using an explicit stabilized method. SIAM Journal on Imaging Sciences, 13(2):905–935, 2020.
[42] G. O. Roberts and R. L. Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363, 1996.
[43] J. M. Sanz-Serna. Markov chain monte carlo and numerical differential equations. In Current Challenges in Stability Issues for Numerical Differential Equations, volume 2082 of Lect. Notes in Math., pages 39–88. Springer, 2014.
[44] J. M. Sanz Serna and K. C. Zygalakis. Contractivity of runge–kutta methods for convex gradient systems. SIAM Journal on Numerical Analysis, 58(4):2079–2092, 2020.
[45] J. M. Sanz Serna and K. C. Zygalakis. The connections between Lyapunov functions for some optimization algorithms and differential equations. SIAM Journal on Numerical Analysis, 59(3):1542–1565, 2021.
[46] J.M. Sanz-Serna and M.P. Calvo. Numerical Hamiltonian Problems. Dover Books on Mathematics. Dover Publications, 2018.
[47] J.M. Sanz-Serna and A.M Stuart. Ergodicity of dissipative differential equations subject to random impulses. Journal of Differential Equations, 155(2):262–284, 1999.
[48] R. Shen and Y. T. Lee. The randomized midpoint method for log-concave sampling. In Advances in Neural Information Processing Systems, volume 32, pages 2098–2109, 2019.
[49] Robert D. Skeel. Integration schemes for molecular dynamics and related applications. In The Graduate Student’s Guide to Numerical Analysis ’98: Lecture Notes from the VIII EPSRC Summer School in Numerical Analysis, pages 119–176. Springer, 1999.
[50] G. Vilmart. Postprocessed integrators for the high order integration of ergodic sdes. SIAM Journal on Scientific Computing, 37(1):A201–A220, 2015.
[51] S. J. Vollmer, K. C. Zygalakis, and Y. W. Teh. Exploration of the (non-)asymptotic bias and variance of stochastic gradient Langevin dynamics. Journal of Machine Learning Research, 17(159):1–48, 2016.
[52] Alfonso Alamo Zapatero. Word series for the numerical integration of stochastic differential equations. PhD thesis, Universidad de Valladolid, 2019.

	$\displaystyle\\|\phi_{h}(\widehat{\xi}_{n},t_{n})-\xi_{n+1}\\|_{L^{2},P_{h}}$
	$\displaystyle\qquad\leq\\|\psi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\xi_{n},t_{n})+\alpha_{h}(\widehat{\xi}_{n},t_{n})\\|_{L^{2},P_{h}}+\\|\beta_{h}(\widehat{\xi}_{n},t_{n})\\|_{L^{2},P_{h}}$
	$\displaystyle\qquad\leq\big{(}\\|\psi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\xi_{n},t_{n})\\|_{L^{2},P_{h}}^{2}$
	$\displaystyle\qquad\qquad\qquad+2C_{0}h\\|\widehat{\xi}_{n}-\xi_{n}\\|_{L^{2},P_{h}}\\|\alpha_{h}(\widehat{\xi}_{n},t_{n})\\|_{L^{2},P_{h}}+\\|\alpha_{h}(\widehat{\xi}_{n},t_{n})\\|^{2}_{L^{2},P_{h}}\big{)}^{1/2}$
	$\displaystyle\qquad\quad+\\|\beta_{h}(\widehat{\xi}_{n},t_{n})\\|_{L^{2},P_{h}}.$

	$\displaystyle\\|\phi_{h}(\widehat{\xi}_{n},t_{n})-\xi_{n+1}\\|_{L^{2},P_{h}}$
	$\displaystyle\quad\leq\Big{(}(1-rh)^{2}\\|\widehat{\xi}_{n}-\xi_{n}\\|_{L^{2},P_{h}}^{2}+C_{0}^{2}h^{2}\\|\widehat{\xi}_{n}-\xi_{n}\\|_{L^{2},P_{h}}^{2}+2\\|\alpha_{h}(\widehat{\xi}_{n},t_{n})\\|^{2}_{L^{2},P_{h}}\Big{)}^{1/2}$
	$\displaystyle\qquad\qquad+\\|\beta_{h}(\widehat{\xi}_{n},t_{n})\\|_{L^{2},P_{h}}$
	$\displaystyle\quad\leq\Big{(}\big{(}(1-rh)^{2}+C_{0}h^{2}\big{)}\\|\widehat{\xi}_{n}-\xi_{n}\\|_{L^{2},P_{h}}^{2}+2C_{1}^{2}h^{2p+1}\Big{)}^{1/2}+C_{2}h^{p+1}.$

$\displaystyle\mathbb{E}\big{(}\\|\Delta_{v}\\|^{2}\big{)}$	$\displaystyle\leq$	$\displaystyle c^{2}\mathbb{E}\left(\int_{t_{n}}^{t_{n+1}}\mathcal{E}(t_{n+1}-s)^{2}ds\>\int_{t_{n}}^{t_{n+1}}\\|\nabla f(x(s))-\nabla f(\widehat{x}_{n})\\|^{2}ds\right)$
	$\displaystyle\leq$	$\displaystyle hc^{2}\int_{t_{n}}^{t_{n+1}}\mathbb{E}\big{(}\\|\nabla f(x(s))-\nabla f(\widehat{x}_{n})\\|^{2}\big{)}ds$
	$\displaystyle\leq$	$\displaystyle hc^{2}L^{2}\int_{t_{n}}^{t_{n+1}}\mathbb{E}\big{(}\\|x(s)-\widehat{x}_{n}\\|^{2}\big{)}ds$
	$\displaystyle\leq$	$\displaystyle hc^{2}L^{2}\int_{t_{n}}^{t_{n+1}}\mathbb{E}\big{(}\\|\int_{t_{n}}^{s}v(s^{\prime})ds^{\prime}\\|^{2}\big{)}ds.$

	$\displaystyle\mathbb{E}\big{(}\\|\Delta_{x}\\|^{2}\big{)}$	$\displaystyle\leq$	$\displaystyle c^{2}\mathbb{E}\left(\int_{t_{n}}^{t_{n+1}}\mathcal{F}(t_{n+1}-s)^{2}ds\int_{t_{n}}^{t_{n+1}}\\|\nabla f(x(s))-\nabla f(\widehat{x}_{n})\\|^{2}\right)ds$
		$\displaystyle\leq$	$\displaystyle\frac{h^{3}}{3}c^{2}\int_{t_{n}}^{t_{n+1}}\mathbb{E}\big{(}\\|\nabla f(x(s))-\nabla f(\widehat{x}_{n})\\|^{2}ds\big{)}.$

	$\displaystyle\left\|\mathbb{E}\big{(}\langle\widetilde{v}_{n+1}-v_{n+1},\alpha_{v}\rangle\big{)}\right\|$	$\displaystyle=$	$\displaystyle\left\|\mathbb{E}\big{(}\langle\widetilde{v}_{n+1}-\widehat{v}_{n}-v_{n+1}+v_{n},\alpha_{v}\rangle\big{)}\right\|$
		$\displaystyle\leq$	$\displaystyle\Big{(}\mathbb{E}(\\|\widetilde{v}_{n+1}-\widehat{v}_{n}-v_{n+1}+v_{n}\\|^{2})\Big{)}^{1/2}\Big{(}\mathbb{E}(\\|\alpha_{v}\\|^{2})\Big{)}^{1/2}.$

Wasserstein distance estimates for the distributions of numerical approximations to ergodic stochastic differential equations.

Abstract

1 Introduction

2 Preliminaries

2.1 Smoothness properties of ff

Assumption 2.1.

Assumption 2.2.

Assumption 2.3.

2.2 Wasserstein distance

3 Stochastic differential equations

3.1 State-space form

Example 3.1.

Example 3.2.

Remark 3.3.

Proposition 3.4.

Proposition 3.5.

3.2 Convergence to the invariant distribution

Proposition 3.6.

3.3 Contractivity

Lemma 3.7.

Proof.

Proposition 3.8.

3.4 Checking contractivity

Lemma 3.9.

Proof.

Proposition 3.10.

Example 3.11.

Example 3.12.

Example 3.13.

Remark 3.14.

4 Discretizations

4.1 Discrete state-space form

Example 4.1.

Example 4.2.

Example 4.3.

Remark 4.4.

4.2 The evolution of probability distributions in the discrete case

Proposition 4.5.

4.3 Checking discrete contractivity

Proposition 4.6.

Proposition 4.7.

Remark 4.8.

Theorem 4.9.

Proof.

Remark 4.10.

Remark 4.11.

Remark 4.12.

Example 4.13.

5 A general theorem

Assumption 5.1.

Theorem 5.2.

Proof.

6 Application to underdamped Langevin dynamics

6.1 The EE integrator

Theorem 6.1.

6.2 The UBU integrator

Theorem 6.2.

Remark 6.3.

Remark 6.4.

7 Additional results and proofs

7.1 Proof of Proposition 3.4

7.2 Contraction estimates for the underdamped Langevin equations

7.3 Integrators for the underdamped Langevin equations

7.4 A lemma

Lemma 7.1.

7.5 The local error for EE integrator

7.6 The local error for UBU

7.7 The local error for UBU without extrasmoothness

References

2.1 Smoothness properties of $f$