This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Wasserstein distance estimates for the distributions of numerical approximations to ergodic stochastic differential equations.

J. M. Sanz-Serna 1    Konstantinos C. Zygalakis2
Abstract

We present a framework that allows for the non-asymptotic study of the 22-Wasserstein distance between the invariant distribution of an ergodic stochastic differential equation and the distribution of its numerical approximation in the strongly log-concave case. This allows us to study in a unified way a number of different integrators proposed in the literature for the overdamped and underdamped Langevin dynamics. In addition, we analyse a novel splitting method for the underdamped Langevin dynamics which only requires one gradient evaluation per time step. Under an additional smoothness assumption on a dd–dimensional strongly log-concave distribution with condition number κ\kappa, the algorithm is shown to produce with an 𝒪(κ5/4d1/4ϵ1/2)\mathcal{O}\big{(}\kappa^{5/4}d^{1/4}\epsilon^{-1/2}\big{)} complexity samples from a distribution that, in Wasserstein distance, is at most ϵ>0\epsilon>0 away from the target distribution.

11footnotetext: Departamento de Matemáticas, Universidad Carlos III de Madrid, Leganés (Madrid), Spain22footnotetext: School of Mathematics, University of Edinburgh, Edinburgh, Scotland

1 Introduction

The problem of sampling from a target probability distribution π(x)\pi^{\star}(x) in d\mathbb{R}^{d} is ubiquitous throughout applied mathematics, statistics, molecular dynamics, statistical physics and other fields. A typical approach for solving such problems is to construct a Markov process on m,md\mathbb{R}^{m},\ m\geq d whose equilibrium distribution (or a suitable marginal of it) is designed to coincide with π\pi^{\star} [7]. Often such Markov processes are obtained by solving stochastic differential equations (SDEs). Two typical examples of such SDEs are the overdamped Langevin equation, c>0c>0,

dx=cf(x)dt+2cdW(t),dx=-c\nabla f(x)\,dt+\sqrt{2c}\,dW(t), (1.1)

and the underdamped Langevin equation, c,γ>0c,\gamma>0,

dv\displaystyle dv =\displaystyle= γvdtcf(x)dt+2γcdW(t),\displaystyle-\gamma v\,dt-c\nabla f(x)\,dt+\sqrt{2\gamma c}\,dW(t), (1.2a)
dx\displaystyle dx =\displaystyle= vdt.\displaystyle v\,dt. (1.2b)

Under mild assumptions on f(x)f(x) one can show that the dynamics of (1.1) is ergodic with respect to the distribution π\pi^{\star} with density exp(f(x))\propto\exp(-f(x)), while the dynamics of (1.2) is ergodic with respect to π\pi^{\star} with density exp(f(x)12cv2)\propto\exp(-f(x){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}-}\frac{1}{2c}\|v\|^{2}).

Equations (1.1), (1.2) [43] provide the basis for different computational devises for sampling from π\pi^{\star}. In particular, one can obtain samples from π\pi^{\star} by discretizing the SDEs and generating numerical solutions over a long time interval [37]. One needs to be careful with the integrator that is used, since it could be the case that the discrete Markov chain resulting from the numerical discretization might not be ergodic [42]. In addition, even if that chain is ergodic, it is normally the case that the stationary distribution π^\widehat{\pi}^{\star} of the numerical solution is different from π\pi^{\star}. The study of the asymptotic error between π^\widehat{\pi}^{\star} and π\pi^{\star} has received a lot of attention in the literature. The work in [34] investigates the effect of the numerical discretization on the convergence of the ergodic averages, while [1] present general order conditions for the numerical invariant measure π^\widehat{\pi}^{\star} to approximate π\pi^{\star} with high order of accuracy, by exploiting the connections between partial differential equations and SDEs [31]. A number of recent papers have applied this framework to numerical integrators for the underdamped Langevin equation [2], as well as to the case of stochastic gradient Langevin dynamics [51, 30]. In addition, [50, 27] extended this framework to the case of post processed integrators and to SDEs on manifolds.

Another line of research that has received much attention in the last few years deals with the study of the non-asymptotic error between the numerical approximation and the invariant measure π\pi^{\star}. In particular, for the case of the overdamped Langevin equation (1.1) and log-concave and strongly log-concave distributions [12] established non-asymptotic bounds in total variation distance for the Euler-Maruyama method and an explicit extension of it based on further smoothness assumptions. These results have been extended to the Wasserstein distance W2W_{2} in e.g. [17, 11, 18, 13, 16], while the paper [25] obtains similar bounds for implicit methods applied to (1.1). Similar non-asymptotic analyses for the case of the underdamped Langevin equation appear in [10, 15, 39, 48, 21]. One of the aims of all that literature is to study the number of steps nn that the integrators require to achieve a target accuracy ϵ\epsilon when applied to dd-dimensional targets with condition number κ\kappa. Underdamped discretizations may lead to a better dependence of nn on ϵ\epsilon and dd than their overdamped counterparts. The case of non-strongly log-concave distributions and the non-asymptotic behaviour of numerical algorithms has also received attention recently [14, 33].

In this work, we present a unified framework that allows for the non-asymptotic study of numerical methods for ergodic stochastic differential equations (including equations (1.1) and (1.2)) in the case of strongly log-concave distributions. In particular, we obtain a general bound for the error in W2W_{2} between π\pi^{\star} and the probability distribution of the numerical solution after nn-iterations. This bound depends on two factors, the first can be controlled by understanding the contractivity properties of the numerical method, while the second is directly related to the local strong error of the integrator. Moving to integrators with smaller strong local error results in a better performance when the dimensionality grows and the error level ϵ\epsilon decreases. Also moving to integrators that are contractive for larger step sizes improves the performance for large condition numbers. This is consistent with what has been suggested in the literature [25, 41].

As an application of the suggested framework, we study two numerical methods for the underdamped Langevin dynamics. The first is the method, that we shall call EE, used in [10]; the second is a splitting method called UBU. Both require the same computational effort, namely one gradient evaluation per time-step, but UBU has better convergence properties [3, 4].

  • For the integrator EE, we prove that, in 22-Wasserstein distance and for a strongly log-concave dd-dimensional distribution with condition number κ\kappa, the algorithm produces a distribution that is ϵ\epsilon–away from the target in a number of steps that (up to logarithmic terms) behaves like 𝒪(ϵ1κ3/2d1/2)\mathcal{O}(\epsilon^{-1}\kappa^{3/2}d^{1/2}). This improves on the 𝒪(ϵ1κ2d1/2)\mathcal{O}(\epsilon^{-1}\kappa^{2}d^{1/2}) estimate in [10]. EE has also been analysed in [15]; however, the analysis in that reference has severe limitations as discussed in Section 6.

  • UBU, under the same hypotheses as EE, shares the 𝒪(ϵ1κ3/2d1/2)\mathcal{O}(\epsilon^{-1}\kappa^{3/2}d^{1/2}) estimate. However, unlike EE, UBU is capable of leveraging additional smoothness properties of the log-density of the target. With such an additional smoothness assumption, we prove an estimate that depends on ϵ\epsilon, κ\kappa and dd as 𝒪(ϵ1/2κ5/4d1/4)\mathcal{O}(\epsilon^{-1/2}\kappa^{5/4}d^{1/4}) (there is also a dependence on a bound for the third derivatives of the target log-density).

Even though a detailed comparison between UBU and alternative algorithms is not within the scope of the present paper, the following comments are in order.

  • As we will discuss in Remark 6.3, for fixed κ\kappa, the improvement from the ϵ1d1/2\epsilon^{-1}d^{1/2} EE estimate to the ϵ1/2d1/4\epsilon^{-1/2}d^{1/4} UBU estimate arises from EE having strong order one and UBU having strong order two. This shows the importance of strong second-order integrators. A strong second-order discretization of the underdamped Langevin dynamics that requires evaluation of the Hessian has been introduced in [15]. However the analysis in that reference only holds for unrealistic values of the stepsize, see Section 6.2.

  • The randomized midpoint method in [48] uses two gradient evaluations per time-step and may be regarded as optimal [9] among the integrators of the underdamped Langevin dynamics that use the driving Brownian motion, its weighted integration and target and an oracle of the f\nabla f. For Lipschitz gradients, the estimate of the mixing time is 𝒪(ϵ1/3κ7/6d1/6+ϵ2/3κd1/3)\mathcal{O}(\epsilon^{-1/3}\kappa^{7/6}d^{1/6}+\epsilon^{-2/3}\kappa d^{1/3}), where we note the favourable dependence on κ\kappa, which stems from the random nature of the algorithm (see [9] and its references). For fixed κ\kappa we then find an ϵ2/3d1/3\epsilon^{-2/3}d^{1/3} behaviour, to be compared with the ϵ1/2d1/4\epsilon^{-1/2}d^{1/4} estimate of UBU when the extra smoothness assumption holds. See Remark 6.3.

  • The algorithm suggested in [40] is not based on integrating the underdamped Langevin equation but an alternative system of stochastic differential equations where xx has additional smoothness (see Remark 3.3). For fixed κ\kappa, the authors prove a 𝒪(ϵ1/2d1/4)\mathcal{O}(\epsilon^{-1/2}d^{1/4}) estimate of the mixing time (i.e. the behaviour established here for UBU). That reference does not investigate the dependence of the mixing time on κ\kappa; numerical experiments suggest the algorithm does not operate satisfactorily when the condition number is large.

The main contributions of this work are:

  1. 1.

    The use of an appropriate state-form representation of SDEs and their numerical integrators that allows to establish contractivity estimates both for the time-continuous process and its numerical solution.

  2. 2.

    A study of the contractivity of integrators for the underdamped Langevin dynamics that takes into account the possible impact of increasing condition numbers.

  3. 3.

    A general result that allows to obtain bounds for the 22-Wasserstein distance between the target distribution and its numerical approximations for general SDEs. In particular the result may be applied to discretizations of the overdamped and underdamped Langevin equations.

  4. 4.

    We improve on the analysis in [10] and explain the reasons why similar improvements may be expected when analysing other integrators.

  5. 5.

    We suggest the use in sampling of UBU, a splitting integrator for the underdamped Langevin equations that only requires one gradient evaluation per step and possesses second order weak and strong accuracy.

  6. 6.

    We provide non-asymptotic estimates of the sampling accuracy of UBU.

The rest of the paper is organised as follows. In Section 2 we set up notation and discuss the different smoothness assumption on ff that we will employ through out the paper. In Section 3 we present the stochastic differential equations (SDEs) we are concerned with. These are written in a state-space form framework, similar to that used (for other purposes) in [32, 20, 45]. This framework is useful here because it makes it easy (see Propositions 3.8 and 3.10) to investigate the contractivity properties that underlie the SDE Wassertein distance estimates between the push-forward in time of two initial probability distribution (Proposition 3.6). Section 4, parallel to Section 3, is devoted to the integrators and their contractivity. Again a state-space framework is used that makes it possible to easily investigate the contractivity of the integrators. Section 5 contains one of the main contributions of this paper, Theorem 5.2, which provides a general result for getting bounds of the Wasserstein distance between the invariant distribution π\pi^{\star} of the SDE and the distribution of the numerical solution. To apply Theorem 5.2 one needs (1) to establish a contractivity estimate for the integrator and (2) to prove what we call a local error bound. The latter is essentially a mean square bound of the difference between a single step of the integrator and a corresponding step with the SDE, under the assumption that the initial data for the step follows the distribution π\pi^{\star}. Section 6 applies the general result to investigate two discretizations of the underdamped Langevin dynamics. The final Section 7 contains some additional results and also the more technical proofs of the results in the preceding sections.

The extension of the material in this paper to variable step sizes and to inaccurate gradients is certainly possible, but will not be considered.

2 Preliminaries

We will now discuss some assumptions on ff, as well as set up some notation that we will later use.

2.1 Smoothness properties of ff

The symbol \|\cdot\| always refers to the standard Euclidean norm. Throughout the paper we shall assume that the following two conditions hold:

Assumption 2.1.

f:ddf:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} is twice differentiable and LL-smooth, i.e

x,yd,f(x)f(y)Lxy.\forall x,y\in\mathbb{R}^{d},\qquad\|\nabla f(x)-\nabla f(y)\|\leq L\|x-y\|. (2.1)
Assumption 2.2.

f:ddf:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} is mm-strongly convex, i.e

x,yd,f(y)f(x)+f(x),yx+m2xy2.\forall x,y\in\mathbb{R}^{d},\qquad f(y)\geq f(x)+\left\langle\nabla f(x),y-x\right\rangle+\frac{m}{2}\|x-y\|^{2}.

It is well known that these two assumptions are equivalent to the Hessian of ff, which we will denote by :dd×d\mathcal{H}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d\times d}, being positive definite and satisfying mId×d(x)LId×dmI_{d\times d}\preceq\mathcal{H}(x)\preceq LI_{d\times d}. In studies like the present one, Assumptions 2.1 and 2.2 are standard in the literature: see, among others, [11, 17, 18, 13] for the overdamped Langevin dynamics and [10, 15, 39, 21] for the underdamped case.

In addition to these two assumptions, the following further smoothness assumption on ff will be used when it comes to analysing higher strong-order discretizations for the underdamped Langevin equation. The symbol \mathcal{H}^{\prime} denotes the tensor of third derivatives (derivative of the Hessian); at each xdx\in\mathbb{R}^{d}, (x)\mathcal{H}^{\prime}(x) is a bilinear operator mapping pairs (w1,w2)d×d(w_{1},w_{2})\in\mathbb{R}^{d}\times\mathbb{R}^{d} into vectors in d\mathbb{R}^{d}.

Assumption 2.3.

ff is three times differentiable and there is a constant L10L_{1}\geq 0 such that at each point xdx\in\mathbb{R}^{d}, for arbitrary w1w_{1}, w2w_{2}:

(x)[w1,w2]L1w1w2\|\mathcal{H}^{\prime}(x)[w_{1},w_{2}]\|\leq L_{1}\|w_{1}\|\,\|w_{2}\|

2.2 Wasserstein distance

Let π1\pi_{1} and π2\pi_{2} be two probability measures on N\mathbb{R}^{N}. The 22-Wasserstein distance between π1,π2\pi_{1},\pi_{2} is given by

W2(π1,π2)=(infζZNxy2𝑑ζ(x,y))1/2,W_{2}(\pi_{1},\pi_{2})=\left(\inf_{\zeta\in Z}\int_{\mathbb{R}^{N}}\|x-y\|^{2}d\zeta(x,y)\right)^{1/2},

where ZZ is the set of all couplings [19] between π1\pi_{1} and π2\pi_{2}, i.e. the set of all probability distributions in N×N\mathbb{R}^{N}\times\mathbb{R}^{N} whose marginals on the first and second factors are π1\pi_{1} and π2\pi_{2} respectively. More generally, if PP is an N×NN\times N positive definite symmetric matrix, we will use the distance

WP(π1,π2)=(infζZNxyP2𝑑ζ(x,y))1/2,W_{P}(\pi_{1},\pi_{2})=\left(\inf_{\zeta\in Z}\int_{\mathbb{R}^{N}}\|x-y\|_{P}^{2}d\zeta(x,y)\right)^{1/2},

where in the PP-norm defined by ξP=(ξTPξ)1/2\|\xi\|_{P}=(\xi^{T}P\xi)^{1/2}. Since the PP-norm and the standard Euclidean norm are related by

pmin2P2pmax2,p_{\min}\|\cdot\|^{2}\leq\|\cdot\|_{P}^{2}\leq p_{\max}\|\cdot\|^{2}, (2.2)

where pminp_{\min} and pmaxp_{\max} are the smallest and largest eigenvalues of PP, we also have

pminW22(π1,π2)WP2(π1,π2)pmaxW22(π1,π2),p_{\min}W_{2}^{2}(\pi_{1},\pi_{2})\leq W_{P}^{2}(\pi_{1},\pi_{2})\leq p_{\max}W_{2}^{2}(\pi_{1},\pi_{2}),

for arbitrary π1\pi_{1}, π2\pi_{2}. Therefore bounds for the metric WPW_{P} may immediately be translated into bounds for W2W_{2} and viceversa.

3 Stochastic differential equations

In this section we will study some properties of a class of ergodic stochastic differential equations that includes (1.1) and (1.2). In particular, we will extend to the stochastic case a control theoretical framework used in [20, 32] to analyse optimization algorithms, and study properties of such SDEs, including the existence of an invariant measure, and the speed of convergence to equilibrium in the Wasserstein distance.

3.1 State-space form

We are concerned with sampling algorithms obtained by discretizing SDEs with additive noise that may be written as linear systems in state-space form:111 Note that this excludes algorithms, like the Riemann manifold MALA in [22], that use multiplicative noise. Also Hamiltonian Montecarlo [6] and similar piecewise deterministic samplers that use jumps do not fit in the present study.

dξ(t)\displaystyle d\xi(t) =\displaystyle= Aξ(t)dt+Bu(t)dt+σdW(t),\displaystyle A\xi(t)dt+Bu(t)dt+\sigma dW(t), (3.1a)
x(t)\displaystyle x(t) =\displaystyle= Cξ(t),\displaystyle C\xi(t), (3.1b)
u(t)\displaystyle u(t) =\displaystyle= f(x(t)).\displaystyle\nabla f(x(t)). (3.1c)

Here ξN\xi\in\mathbb{R}^{N} is the state, udu\in\mathbb{R}^{d} is the input, xdx\in\mathbb{R}^{d} is the output that is mapped to uu by the nonlinear map f\nabla f and WW represents the standard MM-dimensional Brownian motion. The real matrices AA, BB, CC and σ\sigma are constant, with sizes N×NN\times N, N×dN\times d, d×Nd\times N and N×MN\times M respectively. We define

D=(1/2)σσT.D=(1/2)\sigma\sigma^{T}.

and note that, since the right hand-side of (3.1a) is globally Lipschitz continuous, the solution exists and is unique.

Example 3.1.

The simplest case corresponds to the overdamped Langevin equation (1.1) (the positive constant cc may be set =1=1 by rescaling tt) and WW dd-dimensional. Here, N=dN=d, M=dM=d, ξ=x\xi=x, A=0d×dA=0_{d\times d}, B=cIdB=-cI_{d}, C=IdC=I_{d}, σ=2cId\sigma=\sqrt{2c}I_{d}, D=cIdD=cI_{d}.

Example 3.2.

The underdamped Langevin dynamics (1.2) (γ\gamma and cc are positive constants and WW is dd-dimensional) has N=2dN=2d, M=dM=d, ξ=[vT,xT]T\xi=[v^{T},x^{T}]^{T}, and (0 stands for 0d×d0_{d\times d})

A=[γId0Id0],B=[cId0],C=[0Id],σ=[2γcId0],D=[γcId000].A=\left[\begin{matrix}-\gamma I_{d}&0\\ I_{d}&0\end{matrix}\right],\quad B=\left[\begin{matrix}-cI_{d}\\ 0\end{matrix}\right],\quad C=\left[\begin{matrix}0&I_{d}\end{matrix}\right],\quad\sigma=\left[\begin{matrix}\sqrt{2\gamma c}I_{d}\\ 0\end{matrix}\right],\quad D=\left[\begin{matrix}\gamma cI_{d}&0\\ 0&0\end{matrix}\right].
Remark 3.3.

As distinct from the situation in (1.1), in (1.2) the noise W(t)W(t) does not enter the xx equation directly; it does so only through the auxiliary variable vv. This results in x(t)x(t) being smoother in the underdamped case than in the overdamped case. This idea may be taken further: additional auxiliary variables may be introduced so as to increase the smoothness of x(t)x(t), see e.g. [40].

The following proposition, whose proof is given in Section 7.1, relates (3.1) and the pdf exp(f(x))\propto\exp\big{(}-f(x)\big{)}. The proposition may be used to check that the target is in fact the invariant density for the overdamped Langevin dynamics (1.1) and that the underdamped Langevin system (1.2) has the invariant density exp(f(x)v2/(2c))\propto\exp\big{(}-f(x)-\|v\|^{2}/(2c)\big{)}.

Proposition 3.4.

Assume that SS is an N×NN\times N positive semidefinite symmetric matrix.

  • The relations

    Tr(A+DS)\displaystyle{\rm Tr}(A+DS) =\displaystyle= 0,\displaystyle 0, (3.2a)
    CB+CDCT\displaystyle CB+CDC^{T} =\displaystyle= 0,\displaystyle 0, (3.2b)
    CA+BTS+2CDS\displaystyle CA+B^{T}S+2CDS =\displaystyle= 0,\displaystyle 0, (3.2c)
    SA+ATS+2SDS\displaystyle SA+A^{T}S+2SDS =\displaystyle= 0,\displaystyle 0, (3.2d)

    imply that (3.1) has the invariant probability distribution π\pi^{\star} with density

    exp(f(Cξ)(1/2)ξTSξ).\propto\exp\big{(}-f(C\xi){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}-}(1/2)\xi^{T}S\xi\big{)}.
  • If SCT=0SC^{T}=0, then the marginal of exp(f(Cξ)(1/2)ξTSξ)\propto\exp\big{(}-f(C\xi)-(1/2)\xi^{T}S\xi\big{)} on x=Cξx=C\xi is the target exp(f(x))\propto\exp(-f(x)).

If ff is regarded as being arbitrary, then the relations (3.2) are also necessary for the probability distribution with density exp(f(Cξ)(1/2)ξTSξ)\propto\exp\big{(}-f(C\xi){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}-}(1/2)\xi^{T}S\xi\big{)} to be invariant, see Section 7.1. The next result may be useful to check the hypotheses of Proposition 3.4. The proof is a simple exercise and will not be given.

Proposition 3.5.

The relations (3.2) hold if

A=(D+R)S,B=(D+R)CT,A=-(D+R)S,\qquad B=-(D+R)C^{T},

where RR is an arbitrary N×NN\times N skew-symmetric matrix.

3.2 Convergence to the invariant distribution

We assume hereafter that (3.1) has the unique invariant distribution π\pi^{\star}. If π\pi denotes the probability distribution of the initial value ξ(0)\xi(0) for (3.1) and Φtπ\Phi_{t}\pi, t0t\geq 0 represents the resulting probability distribution of ξ(t)\xi(t), we will investigate the convergence, in the Wasserstein distance, of Φtπ\Phi_{t}\pi towards π\pi^{\star}, as tt\rightarrow\infty.

In order to estimate WP(Φtπ1,Φtπ2)W_{P}(\Phi_{t}\pi_{1},\Phi_{t}\pi_{2}) we use the following well-known approach. We introduce the auxiliary 2N2N-dimensional SDE:

dξ(1)(t)\displaystyle d\xi^{(1)}(t) =\displaystyle= Aξ(1)(t)dt+Bf(Cξ(1)(t))dt+σdW(t),\displaystyle A\xi^{(1)}(t)dt+B\nabla f(C\xi^{(1)}(t))dt+\sigma dW(t), (3.3a)
dξ(2)(t)\displaystyle\qquad d\xi^{(2)}(t) =\displaystyle= Aξ(2)(t)dt+Bf(Cξ(2)(t))dt+σdW(t),\displaystyle A\xi^{(2)}(t)dt+B\nabla f(C\xi^{(2)}(t))dt+\sigma dW(t), (3.3b)

where the same Brownian motion W(t)W(t) drives ξ(1)(t)\xi^{(1)}(t) and ξ(2)(t)\xi^{(2)}(t). If ξ(1)(0)π1\xi^{(1)}(0)\sim\pi_{1} and ξ(2)(0)π2\xi^{(2)}(0)\sim\pi_{2}, and ζ\zeta is a coupling between π1\pi_{1} and π2\pi_{2} then the pushforward of ζ\zeta by the solution of (3.3) provides a coupling for the distributions Φtπ1\Phi_{t}\pi_{1} and Φtπ2\Phi_{t}\pi_{2} of ξ(1)(t)\xi^{(1)}(t) and ξ(2)(t)\xi^{(2)}(t). In this setting it is easy to prove the following result.

Proposition 3.6.

Assume that P0P\succ 0 and λ>0\lambda>0 exist such that for (3.3), almost surely,

ξ(2)(t)ξ(1)(t)P2eλtξ(2)(0)ξ(1)(0)P2,t>0.\|\xi^{(2)}(t)-\xi^{(1)}(t)\|_{P}^{2}\leq e^{-\lambda t}\|\xi^{(2)}(0)-\xi^{(1)}(0)\|_{P}^{2},\qquad t>0. (3.4)

Then, for arbitrary distributions, π1\pi_{1} and π2\pi_{2},

WP(Φtπ1,Φtπ2)eλt/2WP(π1,π2),t>0,W_{P}(\Phi_{t}\pi_{1},\Phi_{t}\pi_{2})\leq e^{-\lambda t/2}W_{P}(\pi_{1},\pi_{2}),\qquad t>0,

and, in particular, for arbitrary π\pi,

WP(Φtπ,π)eλt/2WP(π,π),t>0.W_{P}(\Phi_{t}\pi,\pi^{\star})\leq e^{-\lambda t/2}W_{P}(\pi,\pi^{\star}),\qquad t>0. (3.5)

3.3 Contractivity

We now identify sufficient conditions for (3.4) to hold.

Lemma 3.7.

Let P0P\succ 0 be an N×NN\times N symmetric matrix and λ>0\lambda>0. For solutions of (3.3),

d(eλt[ξ(2)(t)ξ(1)(t)]TP[ξ(2)(t)ξ(1)(t)])=\displaystyle d\Big{(}e^{\lambda t}[\xi^{(2)}(t)-\xi^{(1)}(t)]^{T}P[\xi^{(2)}(t)-\xi^{(1)}(t)]\Big{)}=
eλt([ξ(2)(t)ξ(1)(t)]T(λP+ATP+PA)[ξ(2)(t)ξ(1)(t)]\displaystyle\qquad\qquad e^{\lambda t}\Big{(}[\xi^{(2)}(t)-\xi^{(1)}(t)]^{T}(\lambda P+A^{T}P+PA)[\xi^{(2)}(t)-\xi^{(1)}(t)]
+[u(2)(t)u(1)(t)]BTP[ξ(2)(t)ξ(1)(t)]\displaystyle\qquad\qquad+[u^{(2)}(t)-u^{(1)}(t)]B^{T}P[\xi^{(2)}(t)-\xi^{(1)}(t)]
+[ξ(2)(t)ξ(1)(t)]TPB[u(2)(t)u(1)(t)])dt.\displaystyle\qquad\qquad+[\xi^{(2)}(t)-\xi^{(1)}(t)]^{T}PB[u^{(2)}(t)-u^{(1)}(t)]\Big{)}\>dt.
Proof.

It is enough to apply Ito’s rule to

F(t,ξ(1)(t),ξ(2)(t))=eλt[ξ(2)(t)ξ(1)(t)]TP[ξ(2)(t)ξ(1)(t)];F(t,\xi^{(1)}(t),\xi^{(2)}(t))=e^{\lambda t}[\xi^{(2)}(t)-\xi^{(1)}(t)]^{T}P[\xi^{(2)}(t)-\xi^{(1)}(t)];

the Ito correction is

Tr([σTσT][PPPP][σσ])=0.{\rm Tr}\left(\left[\begin{matrix}\sigma^{T}&\sigma^{T}\end{matrix}\right]\left[\begin{matrix}P&-P\\ -P&P\end{matrix}\right]\left[\begin{matrix}\sigma\\ \sigma\end{matrix}\right]\right)=0.

The inputs u(1)(t)u^{(1)}(t), u(2)(t)u^{(2)}(t) that appear in the lemma may be eliminated by using that f(x)\nabla f(x) is continuously differentiable. In fact, by the mean value theorem,

u(2)(t)u(1)(t)\displaystyle u^{(2)}(t)-u^{(1)}(t) =\displaystyle= ¯(x(2)(t),x(1)(t))[x(2)(t)x(1)(t)]\displaystyle\bar{\mathcal{H}}(x^{(2)}(t),x^{(1)}(t))\,[x^{(2)}(t)-x^{(1)}(t)]
=\displaystyle= ¯(x(2)(t),x(1)(t))C[ξ(2)(t)ξ(1)(t)],\displaystyle\bar{\mathcal{H}}(x^{(2)}(t),x^{(1)}(t))C\,[\xi^{(2)}(t)-\xi^{(1)}(t)],

where, for each pair of vectors y1y_{1}, y2y_{2} in d\mathbb{R}^{d}, we have defined

¯(y2,y1)=01(y1+z[y2y1])𝑑z\bar{\mathcal{H}}(y_{2},y_{1})=\int_{0}^{1}\mathcal{H}\big{(}y_{1}+z[y_{2}-y_{1}]\big{)}\,dz

(\mathcal{H} is the Hessian of ff). After elimination of the inputs, Lemma 3.7 yields

d(eλt[ξ(2)(t)ξ(1)(t)]TP[ξ(2)(t)ξ(1)(t)])=\displaystyle d\Big{(}e^{\lambda t}[\xi^{(2)}(t)-\xi^{(1)}(t)]^{T}P[\xi^{(2)}(t)-\xi^{(1)}(t)]\Big{)}=
eλt[ξ(2)(t)ξ(1)(t)]T\displaystyle\qquad e^{\lambda t}[\xi^{(2)}(t)-\xi^{(1)}(t)]^{T}
(λP+P(A+B¯(x(2)(t),x(1)(t))C)+(A+B¯(x(2)(t),x(1)(t))C)TP)\displaystyle\qquad\Big{(}\lambda P+P\big{(}A+B\bar{\mathcal{H}}(x^{(2)}(t),x^{(1)}(t))C\big{)}+\big{(}A+B\bar{\mathcal{H}}(x^{(2)}(t),x^{(1)}(t))C\big{)}^{T}P\Big{)}
[ξ(2)(t)ξ(1)(t)]dt,\displaystyle\qquad[\xi^{(2)}(t)-\xi^{(1)}(t)]\>dt,

an equality that implies our next result.

Proposition 3.8.

Let P0P\succ 0 be an N×NN\times N symmetric matrix and λ>0\lambda>0. Assume that, for each y1,y2dy_{1},y_{2}\in\mathbb{R}^{d}, the matrix

𝒯(λ,P,y1,y2)=λP+P(A+B¯(y1,y2)C)+(A+B¯(y1,y2)C)TP\mathcal{T}(\lambda,P,y_{1},y_{2})=\lambda P+P\big{(}A+B\bar{\mathcal{H}}(y_{1},y_{2})C\big{)}+\big{(}A+B\bar{\mathcal{H}}(y_{1},y_{2})C\big{)}^{T}P

is 0\preceq 0. Then, for solutions of (3.3) the contractivity estimate (3.4) holds almost surely.

3.4 Checking contractivity

We next provide a result that is useful when checking the hypothesis 𝒯0\mathcal{T}\preceq 0 in the last proposition.

Typically, in (3.1)

A=A^Id,B=B^Id,C=C^Id,A=\widehat{A}\otimes I_{d},\qquad B=\widehat{B}\otimes I_{d},\qquad C=\widehat{C}\otimes I_{d}, (3.6)

with A^\widehat{A}, B^\widehat{B}, and C^\widehat{C} of sizes N^×N^\widehat{N}\times\widehat{N}, N^×1\widehat{N}\times 1, and 1×N^1\times\widehat{N} respectively (which implies that N=N^dN=\widehat{N}d). This is for instance the situation for the overdamped and underdamped Langevin equations presented above, where N^=1\widehat{N}=1 and N^=2\widehat{N}=2 respectively. In general N^\widehat{N} will be a small integer and therefore the matrices A^\widehat{A}, B^\widehat{B}, and C^\widehat{C} will also be small.

When (3.6) holds and also σ=σ^Id,\sigma=\widehat{\sigma}\otimes I_{d}, (with σ^\widehat{\sigma} of size N^×M^\widehat{N}\times\widehat{M}) and S=S^IdS=\widehat{S}\otimes I_{d}, the hypotheses of Proposition 3.4 may be stated in terms of the matrices with a hat, i.e., in the second item, S^C^T=0\widehat{S}\widehat{C}^{T}=0 and, in the first item, Tr(A^+D^S^)=0{\rm Tr}(\widehat{A}+\widehat{D}\widehat{S})=0, etc. (here D^=(1/2)σ^σ^T\widehat{D}=(1/2)\widehat{\sigma}\widehat{\sigma}^{T}). The same observation applies to Proposition 3.5. In addition, it makes sense to consider that the matrix P0P\succ 0 is of the form P^Id\widehat{P}\otimes I_{d} with P^\widehat{P} of size N^×N^\widehat{N}\times\widehat{N}. Note that the eigenvalues of PP are obtained by repeating dd times each eigenvalue of P^\widehat{P} and in paticular P0P\succ 0 if and only if P^0\widehat{P}\succ 0. We then have:

Lemma 3.9.

Assume that (3.6) holds and P=P^IdP=\widehat{P}\otimes I_{d}. The set of the N=N^dN=\widehat{N}d eigenvalues of 𝒯(λ,P,y1,y2)\mathcal{T}(\lambda,P,y_{1},y_{2}) is the union of the sets of eigenvalues of the matrices (of size N^×N^\widehat{N}\times\widehat{N})

λP^+P^(A^+Hi(y1,y2)B^C^)+(A^+Hi(y1,y2)B^C^)TP^\lambda\widehat{P}+\widehat{P}\big{(}\widehat{A}+H_{i}(y_{1},y_{2})\widehat{B}\widehat{C}\big{)}+\big{(}\widehat{A}+H_{i}(y_{1},y_{2})\widehat{B}\widehat{C}\big{)}^{T}\widehat{P} (3.7)

where Hi(y1,y2)H_{i}(y_{1},y_{2}), i=1,,di=1,\dots,d, are the eigenvalues of ¯(y1,y2)\bar{\mathcal{H}}(y_{1},y_{2}).

Proof.

After using (3.6) and ¯=1¯\bar{\mathcal{H}}=1\otimes\bar{\mathcal{H}}, the mixed product property of \otimes implies:

𝒯=(λP^+P^A^+A^TP^)Id+(P^B^C^+C^TB^TP^)¯\mathcal{T}=(\lambda\widehat{P}+\widehat{P}\widehat{A}+\widehat{A}^{T}\widehat{P})\otimes I_{d}+(\widehat{P}\widehat{B}\widehat{C}+\widehat{C}^{T}\widehat{B}^{T}\widehat{P})\otimes\bar{\mathcal{H}}

Now factorize ¯=Q𝒟QT\bar{\mathcal{H}}=Q\mathcal{D}Q^{T} with 𝒟\mathcal{D} diagonal and QQ orthogonal (both d×dd\times d). It follows that

𝒯=(IN^Q)[(λP^+P^A^+A^TP^)Id+(P^B^C^+C^TB^TP^)𝒟](IN^Q)T,\mathcal{T}=(I_{\widehat{N}}\otimes Q)\Big{[}(\lambda\widehat{P}+\widehat{P}\widehat{A}+\widehat{A}^{T}\widehat{P})\otimes I_{d}+(\widehat{P}\widehat{B}\widehat{C}+\widehat{C}^{T}\widehat{B}^{T}\widehat{P})\otimes\mathcal{D}\Big{]}(I_{\widehat{N}}\otimes Q)^{T},

and, as a consequence, the eigenvalues of 𝒯\mathcal{T} are those of the matrix in square brackets in the display. This matrix consists of N^2\widehat{N}^{2} blocks, where each block is diagonal of size d×dd\times d. After reordering, the matrix in square brackets becomes a direct sum of the dd matrices in (3.7). ∎

We now describe how to find, for a given P^0\widehat{P}\succ 0, the decay rate λ\lambda in (3.4). The hypotheses on ff guarantee that, in (3.7), Hi(y1,y2)[m,L]H_{i}(y_{1},y_{2})\in[m,L]. After defining the matrix-valued function of the real variable H[m,L]H\in[m,L] given by

Z^(H)=P^(A^+HB^C^)(A^+HB^C^)TP^,\widehat{Z}(H)=-\widehat{P}\big{(}\widehat{A}+H\widehat{B}\widehat{C}\big{)}-\big{(}\widehat{A}+H\widehat{B}\widehat{C}\big{)}^{T}\widehat{P}, (3.8)

we see from Lemma 3.9 that, if, for each H[m,L]H\in[m,L], λP^Z^(H)0\lambda\widehat{P}-\widehat{Z}(H)\preceq 0, then 𝒯0\mathcal{T}\preceq 0. We factorize P^=L^L^T\widehat{P}=\widehat{L}\widehat{L}^{T} with L^\widehat{L} invertible; for instance L^\widehat{L} may be chosen to be lower triangular with positive diagonal entries —Choleski’s factorization—, but other possibilities of course exist. The condition λP^Z^(H)0\lambda\widehat{P}-\widehat{Z}(H)\preceq 0 is equivalent to the condition λIdL^1Z^(H)L^T\lambda I_{d}\preceq\widehat{L}^{-1}\widehat{Z}(H)\widehat{L}^{-T}. Therefore we will have 𝒯0\mathcal{T}\preceq 0 if, as HH varies in [m,L][m,L], the eigenvalues of L^1Z^(H)L^T\widehat{L}^{-1}\widehat{Z}(H)\widehat{L}^{-T} are positive and bounded away from zero. When that is the case, λ\lambda may be chosen to be the infimum of those eigenvalues. We also note that the eigenvalues of L^1Z^(H)L^T\widehat{L}^{-1}\widehat{Z}(H)\widehat{L}^{-T} are the eigenvalues of the generalized eigenvalue problem Z^(H)x=ΛP^x\widehat{Z}(H)x=\Lambda\widehat{P}x. To sum up:

Proposition 3.10.

Given the symmetric, positive definite P^\widehat{P}, define Z^(H)\widehat{Z}(H) by (3.8). Assume that, as HH varies in [m,L][m,L], the eigenvalues Λ\Lambda of the generalized eigenvalue problem Z^(H)x=ΛP^x\widehat{Z}(H)x=\Lambda\widehat{P}x are positive and bounded away from zero and let λ>0\lambda>0 be the infimum of those eigenvalues. Then the contractivity bound (3.4) with P=P^IdP=\widehat{P}\otimes I_{d} holds almost surely. Alternatively, λ\lambda may be defined as the infimum of the eigenvalues of the matrices L^1Z^(H)L^T\widehat{L}^{-1}\widehat{Z}(H)\widehat{L}^{-T}, where L^\widehat{L} is any matrix with P^=L^L^T\widehat{P}=\widehat{L}\widehat{L}^{T}.

The following two examples show this framework applied to the case of equations (1.1) and (1.2).

Example 3.11.

In the case of the overdamped Langevin equation (1.1) if we make the choice P^=1\widehat{P}=1, a simple calculations gives Z^i=2cHi\widehat{Z}_{i}=2cH_{i}. We hence see that in this case λ=2cm\lambda=2cm, a well-known result.

Example 3.12.

The paper [10] studies the underdamped Langevin equation (1.2) and fixes γ=2\gamma=2. This does not entail any loss of generality as the value of γ>0\gamma>0 may be chosen arbitrarily by rescaling the variable tt.222Other authors, see e.g. [15], use different scalings. When we quote estimates from papers that use alternative scalings, we have translated them to the scale in [10] in order to have meaningful comparisons. Furthermore, [10] sets c=1/Lc=1/L and

P^=[1112],L^=[1011].\widehat{P}=\left[\begin{matrix}1&1\\ 1&2\end{matrix}\right],\qquad\widehat{L}=\left[\begin{matrix}1&0\\ 1&1\end{matrix}\right]. (3.9)

For these choices, we find

L^1Z^(H)L^T=[2H/L2H/L22];\widehat{L}^{-1}\widehat{Z}(H)\widehat{L}^{-T}=\left[\begin{matrix}2&H/L-2\\ H/L-2&2\end{matrix}\right];

the eigenvalues of this matrix are H/LH/L and 4H/L4-H/L and, since H[m,L]H\in[m,L], they are m/L=1/κ\geq m/L=1/\kappa (κ\kappa denotes the condition number). In this case λ=1/κ\lambda=1/\kappa and (3.4) becomes

x2(t)x1(t)2+x2(t)+v2(t)x1(t)v1(t)2\displaystyle\|x_{2}(t)-x_{1}(t)\|^{2}+\|x_{2}(t)+v_{2}(t)-x_{1}(t)-v_{1}(t)\|^{2}\leq
exp(t/κ)(x2(0)x1(0)2+x2(0)+v2(0)x1(0)v1(0)2);\displaystyle\quad\quad\quad\quad\exp(-t/\kappa)\Big{(}\|x_{2}(0)-x_{1}(0)\|^{2}+\|x_{2}(0)+v_{2}(0)-x_{1}(0)-v_{1}(0)\|^{2}\Big{)};

which is the contraction estimate used in [10].

We note that the use of the inner product associated with PP for (v,x)(v,x) is equivalent to working with the variables (x+v,v)(x+v,v) and the standard Euclidean inner product. This PP-inner product often appears in the construction of Lyapunov functions for damped oscillators [47, 5]

Example 3.13.

In the setting of the preceding example, we keep γ=2\gamma=2 and P^\widehat{P} as in (3.9), but do not assume c=1/Lc=1/L. The eigenvalues of the 2×22\times 2 matrix L^1Z^(H)L^T\widehat{L}^{-1}\widehat{Z}(H)\widehat{L}^{-T} are found to be Λ+(H)=cH\Lambda^{+}(H)=cH and Λ(H)=4cH\Lambda^{-}(H)=4-cH; for future reference, we note that they depend on HH and cc through the combination cHcH (as it was to be expected from (1.2), where f(x)\nabla f(x) is multiplied by cc). We distinguish four cases:

  1. 1.

    c<4/(L+m)c<4/(L+m). As HH varies in [m,L][m,L], we have min(Λ+(H))=cm\min(\Lambda^{+}(H))=cm and min(Λ(H))=4cL>cm\min(\Lambda^{-}(H))=4-cL>cm. Therefore in this case the λ=cm\lambda=cm and an increase in cc results in an increase in λ\lambda. In particular, for 1/L<c<4/(L+m)1/L<c<4/(L+m) the contraction rate improves on the value 1/κ1/\kappa corresponding to the choice c=1/Lc=1/L in [10] discussed in the preceding example.

  2. 2.

    c=4/(L+m)c=4/(L+m). In this case min(Λ+(H))=cm\min(\Lambda^{+}(H))=cm and min(Λ(H))=4cL\min(\Lambda^{-}(H))=4-cL have the common value 4/(κ+1)4/(\kappa+1).

  3. 3.

    c[4/(L+m),4/L)c\in[4/(L+m),4/L). Now min(Λ+(H))=cm\min(\Lambda^{+}(H))=cm is larger than min(Λ(H))=4cL\min(\Lambda^{-}(H))=4-cL and therefore λ=4cL\lambda=4-cL, which decreases as cc increases.

  4. 4.

    c4/Lc\geq 4/L. In this case min(Λ(H))0\min(\Lambda^{-}(H))\leq 0 and there is no contractivity.

Therefore, with γ=2\gamma=2 and P^\widehat{P} in (3.9), the choice c=4/(L+m)c=4/(L+m) yields the best mixing: λ=4/(κ+1)\lambda=4/(\kappa+1). We prove in Section 7.2 that the mixing cannot be improved by using alternative choices of P^\widehat{P}.

More sophisticated choices of P^\widehat{P} are considered in [15].333The matrix P^\widehat{P} is not used in that reference, which only works with a non-triangular L^\widehat{L} such that P^=L^TL^\widehat{P}=\widehat{L}^{T}\widehat{L}. In turn, L^\widehat{L} is defined indirectly by choosing the columns of L^T\widehat{L}^{-T} to be eigenvectors of a suitable known matrix that depends on a real parameter. The parameter is tuned to enhance the rate of contraction. While those choices allow, for some values of cc, a degree of improvement on the value of λ\lambda we have obtained by using (3.9) in Proposition 3.10, they do not yield values of λ\lambda above 4/(κ+1)4/(\kappa+1) (which is of course in agreement with the analysis in Section 7.2 below). In addition the study in [15] assumes that the variable vv is started at stationarity and only monitors the mixing in the variable xx.

A useful reference on contractivity is [38].

Remark 3.14.

In the examples above it was assumed that P^\widehat{P} was known at the outset. Due to the small dimension of this matrix in applications, it is not difficult to find favourable choices of P^\widehat{P}. This is illustrated in Section 7.2 (see also [15]).

4 Discretizations

Having established properties for solutions of SDEs of the type (3.1), we now turn our attention to the properties of their numerical discretizations. We derive a result analogous to Proposition 3.10 to establish the contractivity of the numerical solutions for integrators that use only one gradient evaluation per time step. Such integrators are particularly attractive in problems of high dimensionality.

4.1 Discrete state-space form

To discretize (3.1) on the grid points tn=nht_{n}=nh, h>0h>0, n=0,1,2,n=0,1,2,\dots, we use schemes of the form:

ξn+1\displaystyle\xi_{n+1} =\displaystyle= Ahξn+Bhun+σhξΩn,\displaystyle A_{h}\xi_{n}+B_{h}u_{n}+\sigma^{\xi}_{h}\Omega_{n}, (4.1a)
yn\displaystyle y_{n} =\displaystyle= Chξn+σhyΩn,\displaystyle C_{h}\xi_{n}+\sigma^{y}_{h}\Omega_{n}, (4.1b)
un\displaystyle u_{n} =\displaystyle= f(yn),\displaystyle\nabla f(y_{n}), (4.1c)

Here, at each step, yndy_{n}\in\mathbb{R}^{d} is the feedback output at which the gradient f\nabla f will be evaluated and Ωn\Omega_{n} represents a random vector in M¯\mathbb{R}^{\bar{M}} suitably derived from the restriction to [tn,tn+1][t_{n},t_{n+1}] of the Brownian motion WW in (3.1). The real matrices AhA_{h}, BhB_{h}, ChC_{h}, σhξ\sigma^{\xi}_{h} and σhy\sigma^{y}_{h} are constant, with sizes N×NN\times N, N×dN\times d, d×Nd\times N, N×M¯N\times\bar{M} and d×M¯d\times\bar{M} respectively. As the examples that follow will illustrate, consistency requires that h1(AhI)h^{-1}(A_{h}-I) be an approximation to AA in (3.1), while h1Bhh^{-1}B_{h} and ChC_{h} approximate BB and CC. Note also the noise in (4.1b), which has no countepart in (3.1b).

Example 4.1.

The Euler-Maruyama scheme for the SDE (1.1)

xn+1=xnhcf(xn)+2c(W(tn+1)W(tn))x_{n+1}=x_{n}-hc\nabla f(x_{n})+\sqrt{2c}\,(W(t_{n+1})-W(t_{n}))

is of the form (4.1) with N=dN=d, M¯=d\bar{M}=d, ξ=x\xi=x, y=xy=x, Ωn=W(tn+1)W(tn)\Omega_{n}=W(t_{n+1})-W(t_{n}), Ah=IdA_{h}=I_{d}, Bh=hcIdB_{h}=-hcI_{d}, Ch=IdC_{h}=I_{d}, σhξ=2cId\sigma^{\xi}_{h}=\sqrt{2c}I_{d}, σhy=0d×d\sigma^{y}_{h}=0_{d\times d}.

Example 4.2.

To shorten the notation, we introduce the functions:

(t)=exp(γt),(t)=0t(s)𝑑s=1exp(γt)γ,\mathcal{E}(t)=\exp(-\gamma t),\qquad\mathcal{F}(t)=\int_{0}^{t}\mathcal{E}(s)\,ds=\frac{1-\exp(-\gamma t)}{\gamma},

and

𝒢(t)=0t(s)𝑑s=γt+exp(γt)1γ2.\mathcal{G}(t)=\int_{0}^{t}\mathcal{F}(s)\,ds=\frac{\gamma t+\exp(-\gamma t)-1}{\gamma^{2}}.

For the integration of (1.2) Cheng et al. [10] use the scheme:

vn+1\displaystyle v_{n+1} =\displaystyle= (h)vn(h)cf(xn)+2γctntn+1(tn+1s)𝑑W(s),\displaystyle\mathcal{E}(h)v_{n}-\mathcal{F}(h)c\nabla f(x_{n})+\sqrt{2\gamma c}\int_{t_{n}}^{t_{n+1}}\mathcal{E}(t_{n+1}-s)dW(s), (4.2a)
xn+1\displaystyle x_{n+1} =\displaystyle= xn+(h)vn𝒢(h)cf(xn)+2γctntn+1(tn+1s)𝑑W(s).\displaystyle x_{n}+\mathcal{F}(h)v_{n}-\mathcal{G}(h)c\nabla f(x_{n})+\sqrt{2\gamma c}\int_{t_{n}}^{t_{n+1}}\mathcal{F}(t_{n+1}-s)dW(s). (4.2b)

In this example, N=2dN=2d, M¯=2d\bar{M}=2d, ξ=[vT,xT]T\xi=[v^{T},x^{T}]^{T}, y=xy=x,

Ωn=[tntn+1(tn+1s)𝑑W(s)tntn+1(tn+1s)𝑑W(s)],\Omega_{n}=\left[\begin{matrix}\int_{t_{n}}^{t_{n+1}}\mathcal{E}(t_{n+1}-s)dW(s)\\ \int_{t_{n}}^{t_{n+1}}\mathcal{F}(t_{n+1}-s)dW(s)\end{matrix}\right],

and

Ah=[(h)Id0d×d(h)IdId],Bh=[(h)cId𝒢(h)cId],A_{h}=\left[\begin{matrix}\mathcal{E}(h)I_{d}&0_{d\times d}\\ {\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\mathcal{F}(h)I_{d}}&{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}I_{d}}\end{matrix}\right],\qquad B_{h}=\left[\begin{matrix}-\mathcal{F}(h)cI_{d}\\ -\mathcal{G}(h)cI_{d}\end{matrix}\right],
Ch=[0d×d,Id]σhξ=2cγI2d,σhy=0d×2d.C_{h}=[0_{d\times d},I_{d}]\qquad\sigma_{h}^{\xi}=\sqrt{2c\gamma}I_{2d},\qquad\sigma_{h}^{y}=0_{d\times 2d}.

The recipe for simulating the Gaussian random variables Ωn\Omega_{n} may be seen in [10].

In the absence of noise, this integrator is the well-known Euler exponential integrator [24], based, via the variation of constants formula/Duhamel’s principle, on the exact integration of the system dv/dt=γvdv/dt=-\gamma v, dx/dt=vdx/dt=v. In the stochastic scenario the algorithm is first order in both the weak and strong senses. The paper [21] calls this scheme the left point method. In what follows we shall refer to it as the Euler exponential (EE) integrator.

Example 4.3.

Another instance of an underdamped Langevin integrator of the form (4.1) is the following UBU algorithm:

vn+1\displaystyle v_{n+1} =\displaystyle= (h)vnh(h/2)cf(yn)+2γctntn+1(tn+1s)𝑑W(s),\displaystyle\mathcal{E}(h)v_{n}-h\mathcal{E}(h/2)c\nabla f(y_{n})+\sqrt{2\gamma c}\int_{t_{n}}^{t_{n+1}}\mathcal{E}(t_{n+1}-s)dW(s), (4.3a)
xn+1\displaystyle x_{n+1} =\displaystyle= xn+(h)vnh(h/2)cf(yn)+2γctntn+1(tn+1s)𝑑W(s),\displaystyle x_{n}+\mathcal{F}(h)v_{n}-h\mathcal{F}(h/2)c\nabla f(y_{n})+\sqrt{2\gamma c}\int_{t_{n}}^{t_{n+1}}\mathcal{F}(t_{n+1}-s)dW(s), (4.3b)
yn\displaystyle y_{n} =\displaystyle= xn+(h/2)vn+2γctntn+1/2(tn+1/2s)𝑑W(s).\displaystyle x_{n}+\mathcal{F}(h/2)v_{n}+\sqrt{2\gamma c}\int_{t_{n}}^{t_{n+1/2}}\mathcal{F}(t_{n+1/2}-s)dW(s). (4.3c)

Here and later tn+1/2=tn+h/2t_{n+1/2}=t_{n}+h/2. UBU is a splitting integrator [35] that is second order in both the weak and strong senses. See Section 7.3 for details and note that both EE and UBU use stochastic integrals of the form 𝑑W\int\mathcal{F}dW and therefore are not covered by the analysis in [9].

Remark 4.4.

The format (4.1) only caters for schemes that use a single evaluation of the gradient f\nabla f per step. By increasing the dimension of uu, the format may be easily adapted to integrators that use several gradient evaluations, cf. [32, 20, 45]. However, the technique used below to establish the contractivity of the integrators cannot be immediately extended to schemes with several gradient evaluations; several gradient evaluations would bring in Hessian matrices evaluated at different locations and it would not be possible to diagonalize those Hessians simultaneously, as we did when proving Lemma 3.9. For the contractivity of algorithms involving several gradient evaluations see e.g. [44] and its references.

4.2 The evolution of probability distributions in the discrete case

We will denote by Ψh,nπ\Psi_{h,n}\pi the probability distribution for ξn\xi_{n} in (4.1) when π\pi is the distribution of ξ0\xi_{0} (thus Ψh,nπ\Psi_{h,n}\pi is an operator on measures). After introducing (cf. (3.3))

ξn+1(1)\displaystyle\xi^{(1)}_{n+1} =\displaystyle= Ahξn(1)+Bhf(Chξn(1)+σhyΩn)+σhξΩn,\displaystyle A_{h}\xi^{(1)}_{n}+B_{h}\nabla f(C_{h}\xi^{(1)}_{n}+\sigma_{h}^{y}\Omega_{n})+\sigma_{h}^{\xi}\Omega_{n}, (4.4a)
ξn+1(2)\displaystyle\qquad\xi^{(2)}_{n+1} =\displaystyle= Ahξn(2)+Bhf(Chξn(2)+σhyΩn)+σhξΩn,\displaystyle A_{h}\xi^{(2)}_{n}+B_{h}\nabla f(C_{h}\xi^{(2)}_{n}+\sigma_{h}^{y}\Omega_{n})+\sigma_{h}^{\xi}\Omega_{n}, (4.4b)

we have the following discrete counterpart of Proposition 3.6, whose proof will not be given:

Proposition 4.5.

Assume that Ph0P_{h}\succ 0 and ρh(0,1)\rho_{h}\in(0,1) exist such that for (4.4), almost surely,

ξn+1(2)ξn+1(1)Ph2ρhξn(2)ξn(1)Ph2,n=0,1,\|\xi^{(2)}_{n+1}-\xi^{(1)}_{n+1}\|_{P_{h}}^{2}\leq\rho_{h}\|\xi^{(2)}_{n}-\xi^{(1)}_{n}\|_{P_{h}}^{2},\qquad n=0,1,\dots (4.5)

Then, for arbitrary distributions, π1\pi_{1} and π2\pi_{2},

WP(Ψh,nπ1,Ψh,nπ2)ρhn/2WP(π1,π2),n=0,1,W_{P}(\Psi_{h,n}\pi_{1},\Psi_{h,n}\pi_{2})\leq\rho_{h}^{n/2}W_{P}(\pi_{1},\pi_{2}),\qquad n=0,1,\dots (4.6)

4.3 Checking discrete contractivity

The proof of the following result is similar to that of Proposition 3.8 and will be omitted:

Proposition 4.6.

Let Ph0P_{h}\succ 0 be an N×NN\times N symmetric matrix and ρh(0,1)\rho_{h}\in(0,1). Assume that, for each y1,y2dy_{1},y_{2}\in\mathbb{R}^{d} the matrix

𝒯h(ρh,Ph,y1,y2)=ρhPh(Ah+Bh¯(y1,y2)Ch)TPh(Ah+Bh¯(y1,y2)Ch)\mathcal{T}_{h}(\rho_{h},P_{h},y_{1},y_{2})=\rho_{h}P_{h}-\big{(}A_{h}+B_{h}\bar{\mathcal{H}}(y_{1},y_{2})C_{h}\big{)}^{T}P_{h}\big{(}A_{h}+B_{h}{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\bar{\mathcal{H}}}(y_{1},y_{2})C_{h}\big{)}

is 0\succeq 0. Then, for solutions of (4.4) the contractivity estimate (4.5) holds almost surely.

In a similar way as for the continuous time case we can prove a discrete time counterpart for Proposition 3.10.

Proposition 4.7.

Given the symmetric, positive definite P^h\widehat{P}_{h}, set

Z^h(H)=(A^h+HB^hC^h)TP^h(A^h+HB^hC^h).\widehat{Z}_{h}(H)=\big{(}\widehat{A}_{h}+H\widehat{B}_{h}\widehat{C}_{h}\big{)}^{T}\widehat{P}_{h}\big{(}\widehat{A}_{h}+H\widehat{B}_{h}\widehat{C}_{h}\big{)}.

Assume that, as HH varies in [m,L][m,L], the supremum ρh\rho_{h} of the eigenvalues RR of the generalized eigenvalue problems Z^h(H)x=RP^x\widehat{Z}_{h}(H)x=R\widehat{P}x is <1<1. Then the contractivity bound (4.5) with Ph=P^hIdP_{h}=\widehat{P}_{h}\otimes I_{d} holds almost surely. Alternatively, ρh\rho_{h} may be defined as the suppremum of the eigenvalues of the matrices L^h1Z^h(H)L^hT\widehat{L}^{-1}_{h}\widehat{Z}_{h}(H)\widehat{L}^{-T}_{h}, where L^h\widehat{L}_{h} is any matrix with P^h=L^hL^hT\widehat{P}_{h}=\widehat{L}_{h}\widehat{L}^{T}_{h}.

The remainder of this section is devoted to the application of the last proposition to the investigation of the contractivity of the integrators (4.2) or (4.3) applied to the underdamped Langevin system (1.2) with γ=2\gamma=2 and P^h\widehat{P}_{h} chosen to coincide with P^\widehat{P} in (3.9). We have computed symbolically the eigenvalues R=Rh±(H)R=R_{h}^{\pm}(H) in the proposition in closed form, but the resulting expressions are complicated and will not be reproduced here. (For each fixed HH we attach the ++ superscript to the discrete eigenvalue R(H)R(H) closest to Λ+(H)=cH\Lambda^{+}(H)=cH and the - superscript to the other.) Rather than analysing directly the discrete eigenvalues, we follow an alternative approach based on leveraging the contractivity of the SDE (studied in Example 3.13) and the consistency of the discretizations. The key observation is that, by definition of consistency, for fixed H[m,L]H\in[m,L] and as h0h\downarrow 0, the numerical propagator matrix A^h+HB^hC^h\widehat{A}_{h}+H\widehat{B}_{h}\widehat{C}_{h} in Proposition 4.7 differs from the differential equation propagator exp(h(A^+HB^C^))\exp(-h(\widehat{A}+H\widehat{B}\widehat{C})) in Proposition 3.10 by an 𝒪(hp+1)\mathcal{O}(h^{p+1}) amount, where p=1p=1 for the first order EE integrator and p=2p=2 for the second order UBU. As a consequence, h1log(A^h+HB^hC^h)-h^{-1}\log(\widehat{A}_{h}+H\widehat{B}_{h}\widehat{C}_{h}) is 𝒪(hp)\mathcal{O}(h^{p}) away from A^+HB^C^\widehat{A}+H\widehat{B}\widehat{C} (cf. [46, Example 10.1]), and, for the eigenvalues, we have h1log(Rh±(H))=Λ±(H)+𝒪(hp)-h^{-1}\log(R_{h}^{\pm}(H))=\Lambda^{\pm}(H)+\mathcal{O}(h^{p}). It is convenient for our purposes to work, rather than with h1log(Rh±(H))-h^{-1}\log(R_{h}^{\pm}(H)), in terms of the quantities

Λ~h±(H)=2h1(1Rh±(H)1/2);\widetilde{\Lambda}^{\pm}_{h}(H)=2h^{-1}(1-R_{h}^{\pm}(H)^{1/2}); (4.7)

for these (since, as ζ1\zeta\rightarrow 1, hlogζ2(1ζ1/2)-h\log\zeta\sim 2(1-\zeta^{1/2})) we have Λ~h±(H)=Λ±(H)+𝒪(h)\widetilde{\Lambda}_{h}^{\pm}(H)=\Lambda^{\pm}(H)+\mathcal{O}(h). An illustration of the convergence of Λ~h±(H)\widetilde{\Lambda}_{h}^{\pm}(H) to Λ±(H)\Lambda^{\pm}(H) may be seen in Figure 1.

Remark 4.8.

Note that Rh±(H)1/2=1Λ~h±h/2R_{h}^{\pm}(H)^{1/2}=1-\widetilde{\Lambda}_{h}^{\pm}h/2 is an approximation to exp(Λ±(H)h/2)\exp(-\Lambda^{\pm}(H)h/2) and compare with the relation between the discrete decay factor ρh1/2\rho_{h}^{1/2} over one time step in (4.6) and the SDE decay factor exp(λh/2)\exp(-\lambda h/2) in (3.5) over a time interval of length hh.

Because the discrete eigenvalues depend smoothly on HH and this variable ranges in the compact interval [m,L][m,L], the convergence Λ~h±(H)Λ±(H)\widetilde{\Lambda}_{h}^{\pm}(H)\rightarrow\Lambda^{\pm}(H) is uniform in HH. Therefore 2(1ρh1/2)/h2(1-\rho_{h}^{1/2})/h, which is the minimum of 2(1Rh±(H)1/2)/h2(1-R_{h}^{\pm}(H)^{1/2})/h, converges to the minimum of Λ±(H)\Lambda^{\pm}(H), in the limit where h0h\downarrow 0 with cc, mm and LL fixed. We conclude that, for hh small enough, the discretizations will behave contractively when the SDE does, i.e. whenever c4/(L+m)c\leq 4/(L+m). However, this conclusion is per se rather weak because, as we have seen in Example 3.13, as LL and mm vary with κ\kappa\rightarrow\infty, the contraction rate λ\lambda behaves like 𝒪(κ1)\mathcal{O}(\kappa^{-1}) and it may be feared that the discretizations be contractive only for values of hh that, as κ\kappa increases unboundedly, tend to 0 . If that were the case, the usefulness of the integrators (4.2) or (4.3) could be doubted. The next result proves that those fears are not warranted if cc is chosen appropiately: the discretizations are contractive with a rate that essentially coincides with the SDE rate, provided that hh is below a threshold independent of LL and mm.

Refer to caption
Refer to caption
Figure 1: On the left, γ=2\gamma=2, L=10L=10, m=1m=1, c=3/(L+m)c=3/(L+m) and P^\widehat{P} is as in (3.9); the condition number is very low so as to enhance the clarity of the figure. The solid lines correspond to the SDE eigenvalues Λ+(H)=cH\Lambda^{+}(H)=cH, Λ(H)=4cH\Lambda^{-}(H)=4-cH. The value of λ\lambda is the minimum eigenvalue and occurs for Λ+\Lambda^{+} evaluated at mm, so that λ=3m/(L+m)=3/11\lambda=3m/(L+m)=3/11. The discontinuous lines represent the UBU discrete counterparts Λ~h+(H)\widetilde{\Lambda}_{h}^{+}(H) and Λ~h(H)\widetilde{\Lambda}_{h}^{-}(H) for h=2,1,1/2,1/4h=2,1,1/2,1/4; as hh decreases, Λ~h+(H)\widetilde{\Lambda}_{h}^{+}(H) and Λ~h(H)\widetilde{\Lambda}_{h}^{-}(H) converge to the SDE eigenvalues. For h=2h=2 and HH large, Λ~h(H)<0\widetilde{\Lambda}_{h}^{-}(H)<0 and the numerical scheme is not contractive. The parameters in the right panel are the same as those on the left, with the exception that L=109L=10^{9} leading to an extremely high condition number. Now the minimum of the SDE eigenvalues is λ109\lambda\approx 10^{-9}. As on the left, there is numerical contractivity for h=1,1/2,1/4h=1,1/2,1/4, but not for h=2h=2; see the leftmost column in Table 1.
Theorem 4.9.

Consider the SDE (1.2) with γ=2\gamma=2, c=c¯/(L+m)c=\bar{c}/(L+m), where the constant c¯(0,4)\bar{c}\in(0,4) is independent of LL and mm. For the discretization provided by the integrators (4.2) or (4.3), to any r¯<c¯/2\bar{r}<\bar{c}/2 there corresponds a value h0=h0(r¯)h_{0}=h_{0}(\bar{r}) such that, for hh0h\leq h_{0}, the discrete contraction estimate (4.5) holds with Ph=P^IdP_{h}=\widehat{P}\otimes I_{d} (P^\widehat{P} is the matrix in (3.9)) and ρh=1r¯h/(κ+1)\rho_{h}=1-\bar{r}h/(\kappa+1).

Proof.

We begin by recalling, from Example 3.13, that the SDE eigenvalues are Λ+=cH\Lambda^{+}=cH and Λ=4cH\Lambda^{-}=4-cH. In addition, and for the reasons we pointed out in the continuous case, the discrete eigenvalues Rh±(H)R_{h}^{\pm}(H), for fixed hh, are functions of the combination H~=cH\widetilde{H}=cH. In other words, for fixed H~\widetilde{H}, Rh±R_{h}^{\pm} depend only on hh (i.e. they are independent of LL and mm). In this way, when thinking in terms of the variable H~\widetilde{H}, changing LL and mm only impacts the convergence Λ~h±Λ±\widetilde{\Lambda}_{h}^{\pm}\rightarrow\Lambda^{\pm} by changing the interval [cm,cL]=[c¯/(κ+1),c¯κ/(κ+1)][0,c¯]{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}[cm,cL]}=[\bar{c}/(\kappa+1),\bar{c}\kappa/(\kappa+1)]\subset[0,\bar{c}] of values of H~\widetilde{H} that have to be considered. Therefore, how much hh has to be reduced to get |Λ~h±(H)Λ±(H)||\widetilde{\Lambda}^{\pm}_{h}(H)-\Lambda^{\pm}(H)| below a target error tolerance is independent of H[m,L]H\in[m,L], m>0m>0 and LmL\geq m.

Now the theorem is clearly true when κ\kappa ranges in a bounded interval and we may suppose in what follows that κ\kappa is so large that c¯/(κ+1)2c¯/2\bar{c}/(\kappa+1)\leq 2-\bar{c}/2. We consider the two eigenvalues - and ++ successively.

For Λ~h\widetilde{\Lambda}_{h}^{-} we note that, as HH varies in [m,L][m,L],

Λ(H)Λ(L)=4cL=4c¯LL+m=(4c¯)L+4mL+m=(4c¯)κ+4κ+14c¯.\Lambda^{-}(H)\geq\Lambda^{-}(L)=4-cL{\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}=}4-\frac{\bar{c}L}{L+m}=\frac{(4-\bar{c})L+4m}{L+m}=\frac{(4-\bar{c})\kappa+4}{\kappa+1}\geq 4-\bar{c}.

As a consequence, for hh small enough, Λ~h(H)>(1/2)(4c¯)\widetilde{\Lambda}_{h}^{-}(H)>(1/2)(4-\bar{c}). In view of the restriction on κ\kappa, Λ~h(H)>c¯/(κ+1)\widetilde{\Lambda}_{h}^{-}(H)>\bar{c}/(\kappa+1).

The discussion of the behaviour of Λ~h+\widetilde{\Lambda}_{h}^{+} is more delicate. Here we need to take into account that, if we set c=0c=0 in (1.1) so as to switch off the force f(x)\nabla f(x) and the noise, then both integrators under consideration are exact. This implies that, at H~=0\widetilde{H}=0 and for arbitrary hh, Λ~h+\widetilde{\Lambda}_{h}^{+} coincides exactly with the continuous eigenvalue Λ+=0\Lambda^{+}=0. This in turn entails that the error Λ~h+(H)Λ+(H)\widetilde{\Lambda}_{h}^{+}(H)-\Lambda^{+}(H) vanishes at H~=0\widetilde{H}=0 for each hh and must then have an expression of the form hH~G(h,H~)h\widetilde{H}G(h,\widetilde{H}), where GG is a smooth function (this is born out in Figure 1, where the difference Λ~h+(H)Λ+(H){\color[rgb]{0,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,0}\pgfsys@color@gray@stroke{0}\pgfsys@color@gray@fill{0}\widetilde{\Lambda}_{h}^{+}(H)}-\Lambda^{+}(H) decreases as H~\widetilde{H} decreases with fixed hh). Since Λ+(H)=H~\Lambda^{+}(H)=\widetilde{H}, for the relative error we may write (Λ~h+(H)Λ+(H))/Λ+(H)=hG(h,H~)\big{(}\widetilde{\Lambda}_{h}^{+}(H)-\Lambda^{+}(H)\big{)}/\Lambda^{+}(H)=hG(h,\widetilde{H}). By taking hh sufficiently small we may guarantee that Λ~h+(H)(2r¯/c¯)Λ+(H)\widetilde{\Lambda}_{h}^{+}(H)\geq(2\bar{r}/\bar{c})\Lambda^{+}(H) and, since Λ+(H)Λ+(m)=cm=c¯/(κ+1)\Lambda^{+}(H)\geq\Lambda^{+}(m)=cm=\bar{c}/(\kappa+1), we have Λ~h+(H)>c¯/(κ+1)\widetilde{\Lambda}_{h}^{+}(H)>\bar{c}/(\kappa+1) and the proof is complete. ∎

Remark 4.10.

The same proof shows that if c=1/Lc=1/L (as in [10]) a similar results holds with a rate ρh=1r¯h/κ\rho_{h}=1-\bar{r}h/\kappa, where r¯<1/2\bar{r}<1/2 may be chosen arbitrarily.

Remark 4.11.

The choice c=4/(L+m)c=4/(L+m) guarantees contractivity in the SDE, but has to be excluded from Theorem 4.9. For this value, the proof breaks down because, as the condition number increases, Λ(L)=4m/(L+m)\Lambda^{-}(L)=4m/(L+m) is not bounded away from zero. By using the expressions of the eigenvalues Rh±(H)R^{\pm}_{h}(H) at H=LH=L, it may be seen that contractivity requires h=𝒪(κ1)h=\mathcal{O}(\kappa^{-1}) for the EE integrator and h=𝒪(κ1/2)h=\mathcal{O}(\kappa^{-1/2}) for the second order UBU.

Remark 4.12.

Only two properties of the integrators EE and UBU have been used in the proof: (i) they are consistent, (ii) they are exact if the force and noise are switched off. The second of these was required to prove that, for each hh, Λ~h+(0)=0\widetilde{\Lambda}_{h}^{+}(0)=0, or equivalently, Rh+(0)=1R_{h}^{+}(0)=1 (see (4.7)), which means that that A^h\widehat{A}_{h} has 11 as an eigenvalue. In fact, for all reasonable discretizations, it is true that A^h\widehat{A}_{h} has the eigenvalue 11. This happens because with v0=0v_{0}=0 and c=0c=0 (no velocity, no force) any reasonable discretization will yield v1=v0v_{1}=v_{0}, x1=x0x_{1}=x_{0} (the particle stands still). Therefore the theorem is true for all integrators of interest.

hh c=1/Lc=1/L c=2/(L+m)c=2/(L+m) c=3/(L+m)c=3/(L+m)
EE UBU EE UBU EE UBU
2 *** 5.000(-10) *** *** *** ***
1 5.000(-10) 5.000(-10) *** 1.000(-9) *** 1.500(-9)
1/2 5.000(-10) 5.000(-10) 1.000(-9) 1.000(-9) *** 1.500(-9)
1/4 5.000(-10) 5.000(-10) 1.000(-9) 1.000(-9) 1.500(-9) 1.500(-9)
Table 1: Contractivity of the integrators EE and UBU for the underdamped Langevin equations with γ=2\gamma=2, Ph=P^IdP_{h}=\widehat{P}\otimes I_{d} (P^\widehat{P} as in (3.9)). The table gives the value of (1ρh1/2)/h(1-\rho_{h}^{1/2})/h where ρh\rho_{h} is as in (4.5). The symbol *** means that for that combination of hh and cc the integrator is not contractive. The table is for the large condition number κ=109\kappa=10^{9}. In the corresponding tables for κ=103\kappa=10^{3} or κ=106\kappa=10^{6} (not reproduced in this paper), the symbol *** appears at exactly the same locations as in the case κ=109\kappa=10^{9} reported above, but the values of (1ρh1/2)/h(1-\rho_{h}^{1/2})/h are multiplied by 10610^{6} or 10310^{3} respectively, showing a 1/κ1/\kappa behaviour.
Example 4.13.

The proof of the theorem sheds no light on the size of the threshold h0h_{0}. With a view to ascertaining the range of values of hh for which the integrators (4.2) and (4.3) behave contractively, we have computed numerically the eigenvalues in Proposition 4.7 and taken the suppremum over HH with fixed hh. Table 1 provides information on the quotient (1ρh1/2)/h(1-\rho_{h}^{1/2})/h that approximates the decay constant λ/2\lambda/2 in the estimate (3.5). For the parameter choice c=1/Lc=1/L used in [10], the table shows that, for hh sufficiently small (say h1h\leq 1), there is numerical contractivity and the quotient coincides (within the number of digits reported) with the SDE rate λ/2=1/(2κ)\lambda/2=1/(2\kappa) found in Section 3.4. The table also gives results for the choices c=2/(L+m)c=2/(L+m) and c=3/(L+m)c=3/(L+m), where again the values of (1ρh1/2)/h(1-\rho_{h}^{1/2})/h reported are in agreement with the SDE rates λ/2=1/(κ+1)\lambda/2=1/(\kappa+1) and λ/2=(3/2)/(κ+1)\lambda/2=(3/2)/(\kappa+1) found in Example 3.13.

For all choices of cc, the UBU integrator (4.3) operates contractively for values of hh that are larger than those required by the EE integrator (4.2). In addition, as cc increases with fixed hh, EE looses contractivity before UBU. This is consistent with Remark 4.11.

5 A general theorem

In this section we consider integrators for (3.1) of the form ξn+1=ψh(ξn,tn)\xi_{n+1}=\psi_{h}(\xi_{n},t_{n}), tn=nht_{n}=nh, n=0,1,n=0,1,\dots, h>0h>0, where, following the terminology in [36], ψh(ξ,t)\psi_{h}(\xi,t) represents the one-step approximation; ψh(ξ,t)\psi_{h}(\xi,t) uses the restriction of the Brownian motion WW in (3.1) to the interval [t,t+h][t,t+h], but this fact is not reflected in the notation. Integrators of the form (4.1) provide a particular class of integrators of this form (cf. Remark 4.4). If PhP_{h} is an N×NN\times N matrix 0\succ 0 and π\pi is an arbitrary probability distribution for the initial condition ξ0\xi_{0}, we wish to study the distance WPh(π,Ψh,nπ)W_{P_{h}}(\pi^{\star},\Psi_{h,n}\pi) between the invariant distribution π\pi^{\star} and the distribution Ψh,nπ\Psi_{h,n}\pi of ξn\xi_{n}.

In the analysis, for random vectors XNX\in\mathbb{R}^{N}, we use the Hilbert-space norm XL2,Ph=𝔼(XPh2)1/2\|X\|_{L^{2},P_{h}}=\mathbb{E}(\|X\|_{P_{h}}^{2})^{1/2}. The symbol ,L2,Ph\langle\cdot,\cdot\rangle_{L^{2},P_{h}} will be used for the corresponding inner product. We denote by ϕh(ξ,t)\phi_{h}(\xi,t) the exact counterpart of ψh(ξ,t)\psi_{h}(\xi,t), so that if ξ(t)\xi(t) is a solution of (3.1) then ξ(tn+1)=ϕh(ξ(tn),tn)\xi(t_{n+1})=\phi_{h}(\xi(t_{n}),t_{n}). At each time-level nn, n=0,1,2,n=0,1,2,\dots, we introduce a random variable ξ^nπ\widehat{\xi}_{n}\sim\pi^{\star} such that WP(π,Ψh,nπ)=ξ^nξnL2,PhW_{P}(\pi^{\star},\Psi_{h,n}\pi)=\|\widehat{\xi}_{n}-\xi_{n}\|_{L^{2},P_{h}}. For the difference ϕh(ξ^n,tn)ψh(ξ^n,tn)\phi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\widehat{\xi}_{n},t_{n}) (that may be seen as a local error), we consider the following assumption:

Assumption 5.1.

There is a decomposition

ϕh(ξn^,tn)ψh(ξ^n,tn)=αh(ξ^n,tn)+βh(ξ^n,tn),\phi_{h}(\widehat{\xi_{n}},t_{n})-\psi_{h}(\widehat{\xi}_{n},t_{n})=\alpha_{h}(\widehat{\xi}_{n},t_{n})+\beta_{h}(\widehat{\xi}_{n},t_{n}),

and positive constants pp, h0h_{0}, C0C_{0}, C1C_{1}, C2C_{2} such that for n0n\geq 0 and hh0h\leq h_{0}:

|ψh(ξ^n,tn)ψh(ξn,tn),αh(ξ^n,tn)L2,Ph|C0hξ^nξnL2,Phαh(ξ^n,tn)L2,Ph\left|\Big{\langle}\psi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\xi_{n},t_{n}),\alpha_{h}(\widehat{\xi}_{n},t_{n})\Big{\rangle}_{L^{2},P_{h}}\right|\leq C_{0}h\>\|\widehat{\xi}_{n}-\xi_{n}\|_{L^{2},P_{h}}\>\|\alpha_{h}(\widehat{\xi}_{n},t_{n})\|_{L^{2},P_{h}} (5.1)

and

αh(ξ^n,tn)L2,PhC1hp+1/2,βh(ξ^n,tn)L2,PhC2hp+1.\|\alpha_{h}(\widehat{\xi}_{n},t_{n})\|_{L^{2},P_{h}}\leq C_{1}h^{p+1/2},\qquad\|\beta_{h}(\widehat{\xi}_{n},t_{n})\|_{L^{2},P_{h}}\leq C_{2}h^{p+1}. (5.2)

We can now state our general theorem that gives a bound for the Wasserstein distance between the invariant measure π\pi^{\star} and the distribution of the n+1n+1-th iteration of the numerical scheme.

Theorem 5.2.

Assume that the integrator satisfies Assumption 5.2 and in addition, there are constants h0>0h_{0}>0, r>0r>0 such that for hh0h\leq h_{0} the contractivity estimate (4.5) holds with ρh(1rh)2\rho_{h}\leq(1-rh)^{2}. Then, for any initial distribution π\pi, stepsize hh0h\leq h_{0}, and n=0,1,n=0,1,\dots,

WPh(π,Ψh,nπ)(1hRh)nWPh(π,π)+(2C1Rh+C2Rh)hp,{W_{P_{h}}(\pi^{\star},\Psi_{h,n}\pi)\leq(1-hR_{h})^{n}W_{P_{h}}(\pi^{\star},\pi)+\left(\frac{\sqrt{2}C_{1}}{\sqrt{R_{h}}}+\frac{C_{2}}{R_{h}}\right)h^{p},} (5.3)

with

Rh=1h(1(1rh)2+C0h2)=r+o(1),ash0.R_{h}=\frac{1}{h}\Big{(}1-\sqrt{(1-rh)^{2}+C_{0}h^{2}}\Big{)}=r+o(1),\quad{\rm as}\quad h\downarrow 0.
Refer to caption
Refer to caption
Figure 2: Different approaches for obtaining a bound for W2(Ψh,n+1π,π)W_{2}(\Psi_{h,n+1}\pi,\pi^{*}). The vertical axes represent probability distributions and the horizontal axes correspond to time. Solid lines indicate evolution with the SDE and broken lines evolution with the integrator. On the left, the technique in [11, 13] and this paper, where the contractivity of the integrator is used to propagate forward the distance between the target distribution π\pi^{\star} and distribution Ψh,nπ\Psi_{h,n}\pi of the numerical solution at time tnt_{n}. On the right, the technique in [10, 15] that leverages the contractivity of the SDE. On the left, the “local error” is based at the target; on the right is based at the numerical approximation.

Before going into the proof we make some observations:

  • The theorem is in the spirit of classic “consistency and stability imply convergence” numerical analysis results. The main idea of the proof is schematically illustrated in the left panel of Figure 2. The error of interest at time level n+1n+1 can be decomposed into two terms. The first is the distance between Ψh(Ψh,nπ)\Psi_{h}(\Psi_{h,n}\pi) and Ψhπ\Psi_{h}\pi^{\star} and can be bounded in terms of the error at time level nn by using the contractivity of the numerical integrator. The second is the distance between Ψhπ\Psi_{h}\pi^{\star} and Φhπ=π\Phi_{h}\pi^{\star}=\pi^{\star} and can be bounded by estimating the strong local error by means of Assumption 5.2.

  • The local error needs to be bounded in the strong sense. This should be compared with studies about weak convergence of the numerical distribution [1, 31] where the estimates depend on the weak order of the integrator.

  • The β=𝒪(hp+1)\beta=\mathcal{O}(h^{p+1}) part of the local error results in an 𝒪(hp)\mathcal{O}(h^{p}) contribution to the bound on WPh(π,Ψh,nπ)W_{P_{h}}(\pi^{\star},\Psi_{h,n}\pi). One power of hh is lost from going from local to global as in the classical analysis of (deterministic) numerical integrators.

  • The α=𝒪(hp+1/2)\alpha=\mathcal{O}(h^{p+1/2}) part of the local error is asked to satisfy the requirement (5.1) and only looses a factor h1/2h^{1/2} when going from local to global. This is reminiscent of the situation for the strong convergence of numerical solutions of SDEs (see e.g. [36, Theorem 1.1]), where, for instance, the Euler-Maruyama scheme with 𝒪(h3/2)\mathcal{O}(h^{3/2}) strong local errors yields 𝒪(h)\mathcal{O}(h) strong global errors (assuming additivity of the noise). Typically, the α\alpha part of the local error will consist of Ito integrals that, conditional on events occurring up to the beginning of the time step, have zero expectation.

Proof.

We may write

ϕh(ξ^n,tn)ξn+1\displaystyle\phi_{h}(\widehat{\xi}_{n},t_{n})-\xi_{n+1} =\displaystyle= (ψh(ξ^n,tn)ψh(ξn,tn))+(ϕh(ξ^n,tn)ψh(ξ^n,tn))\displaystyle\big{(}\psi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\xi_{n},t_{n})\big{)}+\big{(}\phi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\widehat{\xi}_{n},t_{n})\big{)}
=\displaystyle= (ψh(ξ^n,tn)ψh(ξn,tn)+αh(ξ^n,tn))+βh(ξ^n,tn),\displaystyle\big{(}\psi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\xi_{n},t_{n})+\alpha_{h}(\widehat{\xi}_{n},t_{n})\big{)}+\beta_{h}(\widehat{\xi}_{n},t_{n}),

and therefore, by the triangle inequality and (5.1), for hh0h\leq h_{0},

ϕh(ξ^n,tn)ξn+1L2,Ph\displaystyle\|\phi_{h}(\widehat{\xi}_{n},t_{n})-\xi_{n+1}\|_{L^{2},P_{h}}
ψh(ξ^n,tn)ψh(ξn,tn)+αh(ξ^n,tn)L2,Ph+βh(ξ^n,tn)L2,Ph\displaystyle\qquad\leq\|\psi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\xi_{n},t_{n})+\alpha_{h}(\widehat{\xi}_{n},t_{n})\|_{L^{2},P_{h}}+\|\beta_{h}(\widehat{\xi}_{n},t_{n})\|_{L^{2},P_{h}}
(ψh(ξ^n,tn)ψh(ξn,tn)L2,Ph2\displaystyle\qquad\leq\big{(}\|\psi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\xi_{n},t_{n})\|_{L^{2},P_{h}}^{2}
+2C0hξ^nξnL2,Phαh(ξ^n,tn)L2,Ph+αh(ξ^n,tn)L2,Ph2)1/2\displaystyle\qquad\qquad\qquad+2C_{0}h\|\widehat{\xi}_{n}-\xi_{n}\|_{L^{2},P_{h}}\|\alpha_{h}(\widehat{\xi}_{n},t_{n})\|_{L^{2},P_{h}}+\|\alpha_{h}(\widehat{\xi}_{n},t_{n})\|^{2}_{L^{2},P_{h}}\big{)}^{1/2}
+βh(ξ^n,tn)L2,Ph.\displaystyle\qquad\quad+\|\beta_{h}(\widehat{\xi}_{n},t_{n})\|_{L^{2},P_{h}}.

We next apply the contractivity hypothesis, the elementary bound 2aba2+b22ab\leq a^{2}+b^{2}, and (5.2):

ϕh(ξ^n,tn)ξn+1L2,Ph\displaystyle\|\phi_{h}(\widehat{\xi}_{n},t_{n})-\xi_{n+1}\|_{L^{2},P_{h}}
((1rh)2ξ^nξnL2,Ph2+C02h2ξ^nξnL2,Ph2+2αh(ξ^n,tn)L2,Ph2)1/2\displaystyle\quad\leq\Big{(}(1-rh)^{2}\|\widehat{\xi}_{n}-\xi_{n}\|_{L^{2},P_{h}}^{2}+C_{0}^{2}h^{2}\|\widehat{\xi}_{n}-\xi_{n}\|_{L^{2},P_{h}}^{2}+2\|\alpha_{h}(\widehat{\xi}_{n},t_{n})\|^{2}_{L^{2},P_{h}}\Big{)}^{1/2}
+βh(ξ^n,tn)L2,Ph\displaystyle\qquad\qquad+\|\beta_{h}(\widehat{\xi}_{n},t_{n})\|_{L^{2},P_{h}}
(((1rh)2+C0h2)ξ^nξnL2,Ph2+2C12h2p+1)1/2+C2hp+1.\displaystyle\quad\leq\Big{(}\big{(}(1-rh)^{2}+C_{0}h^{2}\big{)}\|\widehat{\xi}_{n}-\xi_{n}\|_{L^{2},P_{h}}^{2}+2C_{1}^{2}h^{2p+1}\Big{)}^{1/2}+C_{2}h^{p+1}.

Therefore, in view of the choice of ξ^n\widehat{\xi}_{n}, and taking into account that ϕh(ξ^n,tn)π\phi_{h}(\widehat{\xi}_{n},t_{n})\sim\pi^{*},

WPh(π,Ψh,n+1π)\displaystyle W_{P_{h}}(\pi^{\star},\Psi_{h,n+1}\pi)
(((1rh)2+C0h2)WPh(π,Ψh,nπ)2+2C12h2p+1)1/2+C2hp+1.\displaystyle\qquad\qquad\leq\Big{(}\big{(}(1-rh)^{2}+C_{0}h^{2}\big{)}W_{P_{h}}(\pi^{\star},\Psi_{h,n}\pi)^{2}+2C_{1}^{2}h^{2p+1}\Big{)}^{1/2}+C_{2}h^{p+1}.

The conclusion follows after applying the lemma in Section 7.4. ∎

The paper [13] analyzes the Euler-Maruyama discretization of (1.1). Under two different smoothness assumptions on ff, it derives two different estimates similar to (5.3), one with p=1/2p=1/2 and the other with p=1p=1. The application of Theorem 5.2 to those two cases retrieves the estimates in [13]; in addition the proof of our theorem as applied to those two particular cases coincides with the proofs provided in that paper. (In fact we derived Theorem 5.2 as a generalization of the material in [13] to a more general scenario.)

By considering the case where the target distribution is a product of uncorrelated univariate Gaussians, one sees that the estimates of the mixing time in [13] are optimal in their dependence on ϵ\epsilon and dd. [16] have shown that those estimates are not optimal in their dependence on κ\kappa. This implies that the result in Theorem 5.2 is not necessarily the best that may be achieved in each application.

6 Application to underdamped Langevin dynamics

We now apply Theorem 5.2 to integrators for (1.2).

6.1 The EE integrator

We begin with the EE integrator (4.2). For its local error we have the following theorem, proved in Section 7.5. Recall that setting γ=2\gamma=2 does not restrict the generality, as such a choice is equivalent to choosing the units of time. Once this value of γ\gamma has been fixed, the choice of PhP_{h} in the theorem that follows is the one that allows the best contraction rate for the SDE (1.2).

Theorem 6.1.

Set γ=2\gamma=2 and Ph=P^IdP_{h}=\widehat{P}\otimes I_{d} with P^\widehat{P} as in (3.9). Then, for h1h\leq 1, the discretization (4.2) satisfies Assumption 5.2 with p=1p=1, C0=C1=0C_{0}=C_{1}=0 and C2=Kc3/2Ld1/2C_{2}=Kc^{3/2}Ld^{1/2}, where KK is an absolute constant (K=6+25/3K=\sqrt{6+2\sqrt{5}}/3).

We now may apply Theorem 5.2 to the situation at hand and to do so we need the contractivity of the scheme in the PP–norm studied in Theorem 4.9. The discussion that follows may immediately be extended to all choices of c=c(L,m)c=c(L,m) that lead to contractivity; however, for the sake of clarity, we fix c=1/Lc=1/L as in [10]. (But recall that c=1/Lc=1/L is suboptimal in terms of the contraction rate). For this choice, we know from Remark 4.10 that, for hh0h\leq h_{0}, the numerical rate (1ρh1/2)/h(1-\rho_{h}^{1/2})/h will be of the form r¯/κ\bar{r}/\kappa, where r¯<1/2\bar{r}<1/2 may be chosen arbitrarily close to 1/21/2 (h0h_{0} depends of course on r¯\bar{r} but it is independent of dd, LL and mm). Theorem 5.2 yields, for hh0h\leq h_{0},

WP(π,Ψn,hπ)(1r¯hκ)nWP(π,π)+Kκd1/2r¯L1/2h.W_{P}(\pi^{\star},\Psi_{n,h}\pi)\leq\left(1-\frac{\bar{r}h}{\kappa}\right)^{n}W_{P}(\pi^{\star},\pi)+\frac{K\kappa d^{1/2}}{\bar{r}L^{1/2}}h. (6.1)

Assume now that, given any initial distribution π\pi for the integrator and ϵ>0\epsilon>0, we wish to find hh and nn to guarantee that WP(π,Ψn,hπ)<ϵW_{P}(\pi^{\star},\Psi_{n,h}\pi)<\epsilon. We may achieve this goal by first choosing hh0h\leq h_{0} small enough to ensure that

Kκd1/2r¯L1/2h<ϵ2\frac{K\kappa d^{1/2}}{\bar{r}L^{1/2}}h<\frac{\epsilon}{2}

and then choosing nn large enough to get

(1r¯hκ)nWP(π,π)<ϵ2.\left(1-\frac{\bar{r}h}{\kappa}\right)^{n}W_{P}(\pi^{\star},\pi)<\frac{\epsilon}{2}.

(Instead of splitting the target ϵ\epsilon as ϵ/2+ϵ/2\epsilon/2+\epsilon/2, one may use aϵ+(1a)ϵa\epsilon+(1-a)\epsilon, a(0,1)a\in(0,1), and tune aa to improve slightly some of the error constants below.) This leads to the conditions

h<min(h0,r¯2K(m1/2ϵ)κ1/2d1/2),h<\min\left(h_{0},\frac{\bar{r}}{2K}(m^{1/2}\epsilon)\kappa^{-1/2}d^{-1/2}\right), (6.2)

(m1/2ϵm^{1/2}\epsilon is a nondimensional combination whose value does not change if the scale of xx is changed) and

n>1log(1r¯h/κ)log2WP(π,π)ϵ.n>\frac{-1}{\log(1-\bar{r}h/\kappa)}\log\frac{2W_{P}(\pi^{\star},\pi)}{\epsilon}. (6.3)

According to the bound (6.2), hh has to be scaled like (m1/2ϵ)κ1/2d1/2(m^{1/2}\epsilon)\kappa^{-1/2}d^{-1/2} as ϵ0\epsilon\downarrow 0, κ\kappa\uparrow\infty or dd\uparrow\infty; then the number of steps to achieve a target value of the contraction factor (1r¯h/κ)n(1-\bar{r}h/\kappa)^{n} scales as

(m1/2ϵ)1κ3/2d1/2.(m^{1/2}\epsilon)^{-1}\kappa^{3/2}d^{1/2}. (6.4)

The analysis of the same integrator in [10, Theorem 1] yields scalings that are more pessimistic in their dependence on κ\kappa: there, hh scales as (m1/2ϵ)κ1d1/2(m^{1/2}\epsilon)\kappa^{-1}d^{-1/2} and nn as (m1/2ϵ)1κ2d1/2(m^{1/2}\epsilon)^{-1}\kappa^{2}d^{1/2}. In addition, the initial distribution π\pi, which is arbitrary in the present study, is assumed in [10] to be a Dirac delta located at v=0v=0 and x=x0x=x_{0}; the estimates become worse as the distance between the initial position x0x_{0} used in the integrator and the mode of exp(f(x))\exp(-f(x)) increases.

It is perhaps useful to compare the technique of proof in [10] with our approach by means of Figure 2. While we employ the contractivity of the algorithm, [10] leverages the contractivity of the SDE itself. These two alternative approaches are well known in deteministic numerical differential equations (see e.g. the discussion in [23, Chapter 2] where a cartoon similar to Figure 2 is presented). On the other hand, while we investigate ϕh(,tn)ψh(,tn)\phi_{h}(\cdot,t_{n})-\psi_{h}(\cdot,t_{n}) evaluated at a random variable ξ^n\widehat{\xi}_{n} whose marginal distribution is π\pi^{\star}, [10] has to evaluate that difference at the numerical ξn\xi_{n}. It is for this reason that in [10] one needs to have information on the distribution π\pi of ξ0\xi_{0} and to establish a priori bounds on the distributions of ξn\xi_{n} as nn varies (this is done in Lemma 12 in [10]). Generally speaking, once contractivity estimates are available for the numerical solution, the approach on the left of Figure 2 is to be preferred.

The reference [15] also investigates Wasserstein error bounds for the integrator EE. The general approach is the same as that in [10], but the technical details differ. For c4/(L+m)c\leq 4/(L+m),444Because [15] uses the SDE contractivity, it does not exclude the limit case c=4/(L+m)c=4/(L+m) as we have to do. See Remark 4.11 that implies that for this integrator bounds that hold for c=4/(L+m)c=4/(L+m) are only possible for h=𝒪(κ1)h=\mathcal{O}(\kappa^{-1}) . an upper bound very similar to (6.1) is derived for the 22-Wasserstein distance between the xx-marginals of π\pi^{\star} and Ψn,hπ\Psi_{n,h}\pi. That bound depends on κ\kappa, LL, dd and hh in the same way as (6.1) does. The constants in the estimates are nevertheless different, as expected. For instance, for the choice c=1/Lc=1/L we have been discussing, the factor 1r¯h/κ1-\bar{r}h/\kappa in (6.1) (where r¯\bar{r} is arbitrarily close to 1/21/2) is replaced in [15] by the slightly worse factor 10.375h/κ1-0.375h/\kappa. However the bound in [15] is only proved for very small values of hh (h1/(8κ)h\leq 1/(8\kappa) when c=1/Lc=1/L). This is an extremely severe limitation because we know from (6.2) that, as the condition number increases, EE may be operated with a value of hh of the order of 1/κ1/\sqrt{\kappa} rather than 1/κ1/\kappa. The unwelcome step size restriction originates from estimating ϕh(,tn)ψh(,tn)\phi_{h}(\cdot,t_{n})-\psi_{h}(\cdot,t_{n}) evaluated at the numerical solution rather than at the SDE solution.

6.2 The UBU integrator

We now turn our attention to the UBU integrator. Under the standard smoothness Assumptions 2.1 and 2.2, the strong order of convergence of UBU is p=1p=1 (see Section 7.7 for a detailed analysis of the UBU local error under those assumptions). The analysis for UBU is then very similar to the one presented above for EE, and leads to an 𝒪(ϵ1κ3/2d1/2)\mathcal{O}(\epsilon^{-1}\kappa^{3/2}d^{1/2}) estimate for the mixing time. When ff satisfies the additional smoothness Assumption 2.3, UBU exhibits strong order p=2p=2 and this may be used in our context to improve on the estimates (6.2)–(6.3).

The proof of the next result is given in Section 7.6.

Theorem 6.2.

Assume that ff satisfies Assumptions 2.12.3. Set γ=2\gamma=2 and Ph=P^IdP_{h}=\widehat{P}\otimes I_{d} with P^\widehat{P} as in (3.9). Then, for h2h\leq 2, the discretization (4.2) satisfies Assumption 5.2 with p=2p=2,

C0\displaystyle C_{0} =\displaystyle= K0(2+cL),\displaystyle K_{0}(2+cL),
C1\displaystyle C_{1} =\displaystyle= K1c3/2Ld1/2,\displaystyle K_{1}c^{3/2}Ld^{1/2},
C2\displaystyle C_{2} =\displaystyle= K2((1+43)c2L3/2+(3+422)c3/2L+6cL1/2+3c2L1)d1/2,\displaystyle K_{2}\Big{(}(1+4\sqrt{3})c^{2}L^{3/2}+(3+\frac{\sqrt{42}}{2})c^{3/2}L+6cL^{1/2}+\sqrt{3}c^{2}L_{1}\Big{)}d^{1/2},

where KjK_{j}, j=0,1,2,j=0,1,2, are the following absolute constants

K0=2235,K1=312,K2=1243+52.K_{0}=\sqrt{\frac{2\sqrt{2}}{3-\sqrt{5}}},\qquad K_{1}=\frac{\sqrt{3}}{12},\qquad K_{2}=\frac{1}{24}\sqrt{\frac{3+\sqrt{5}}{2}}.

The contractivity of the scheme in the PP-norm necessary to use Theorem 5.2 was established in Theorem 4.9. As we did for the first-order integrator, for the sake of clarity, we fix c=1/Lc=1/L (but other values of cc may be discussed similarly, provided that they ensure the contractivity of the algorithm). Note that the constant C0=K0(2+cL)C_{0}=K_{0}(2+cL) is then 3K0\leq 3K_{0}. After choosing r¯<1/2\bar{r}<1/2 arbitrarily as in Remark 4.10, Theorem 5.2 yields, for hh0h\leq h_{0}:

WP(π,Ψn,hπ)(1r¯hκ)nWP(π,π)+K¯(1L+L1L2)κd1/2h2,W_{P}(\pi^{\star},\Psi_{n,h}\pi)\leq\left(1-\frac{\bar{r}h}{\kappa}\right)^{n}W_{P}(\pi^{\star},\pi)+\bar{K}\left(\frac{1}{\sqrt{L}}+\frac{L_{1}}{L^{2}}\right)\kappa d^{1/2}h^{2}, (6.5)

where K¯\bar{K} denotes an absolute constant. To ensure WP(π,Ψn,hπ)<ϵW_{P}(\pi^{\star},\Psi_{n,h}\pi)<\epsilon, we take

K¯(1L+L1L2)κd1/2h2<ϵ2,\bar{K}\left(\frac{1}{\sqrt{L}}+\frac{L_{1}}{L^{2}}\right)\kappa d^{1/2}h^{2}<\frac{\epsilon}{2},

and then increase nn as in (6.3). Thus, for UBU, the scaling of hh is

(m1/2ϵ)1/2κ1/4(1+L3/2L1)1/2d1/4,(m^{1/2}\epsilon)^{1/2}\kappa^{-1/4}(1+L^{-3/2}L_{1})^{-1/2}d^{-1/4},

and, as a consequence, the number of steps nn to guarantee a target contraction factor (1r¯h/κ)n(1-\bar{r}h/\kappa)^{n} scales as

(m1/2ϵ)1/2κ5/4(1+L3/2L1)1/2d1/4.(m^{1/2}\epsilon)^{-1/2}\kappa^{5/4}(1+L^{-3/2}L_{1})^{1/2}d^{1/4}. (6.6)

The dependence of nn on m1/2ϵm^{1/2}\epsilon, κ\kappa and dd in this estimate is far more favourable than it was for EE (see (6.4)). However here we have the (L1L_{1} dependent) factor (1+L3/2L1)1/2(1+L^{-3/2}L_{1})^{1/2} and one could easily concoct examples of distributions where this factor is large even if the condition number is of moderate size. In those, arguably artificial, particular cases, it may be advantageous to see UBU as a first order method as discussed at the beginning of this section.

Remark 6.3.

A comparison between (6.1) and (6.5) makes it clear that (for fixed mm, LL, L1L_{1}) the ϵ1/2d1/4\epsilon^{-1/2}d^{1/4} dependence of the mixing time of UBU stems from having strong order two. In the second term of the right hand-side of the inequalities (6.1) and (6.5) (i.e. the bias), the exponent of hh coincides with the strong order of the integrator. In order to make those second terms of size ϵ\epsilon one needs to scale hh as ϵd1/2\epsilon d^{-1/2} for EE and as ϵ1/2d1/4\epsilon^{1/2}d^{-1/4} for UBU. The first terms of the right hand-side of the inequalities (6.1) and (6.5), then show that nn has to be scaled as h1h^{-1}, i.e. as ϵ1d1/2\epsilon^{-1}d^{1/2} for the first-order method and ϵ1/2d1/4\epsilon^{-1/2}d^{1/4} for the second-order method.

Integrators of strong order higher than two would have even more favourable dependence of the mixing time on ϵ\epsilon and dd. Unfortunately such high-order integrators [37] are invariably too complicated to be of much practical significance. In particular there is no splitting algorithm that achieves strong order larger than two [4]. In addition, an increase of the order may be expected to require an increase of the required smoothness of ff.

The randomized algorithm in [48] has a bias that behaves as d1/2h3/2d^{1/2}h^{3/2} leading to an ϵ2/3d1/3\epsilon^{-2/3}d^{1/3} estimate of the mixing time.

Remark 6.4.

The paper [13] considers a weaker form of the extra-smoothness assumption Assumption 2.3 where (x)\mathcal{H}(x), rather than assumed to be differentiable with derivative upper-bounded by L1L_{1}, is only assumed to be Lipschitz continuous with constant L1L_{1}. It is likely that, by means of the technique in the proof of [13, Lemma 6], Theorem 6.2 may be proved under that alternative, weaker version of Assumption 2.3, but we have not yet studied that possibility.

A second order discretization of the underdamped Langevin equation, that unlike UBU, requires to evaluate the Hessian of ff once per step, has been suggested in [15]. A bound similar to (6.5) is derived which is valid only for small values h𝒪(κ1)𝒪(L1/2md1/2L11)h\leq\mathcal{O}(\kappa^{-1})\wedge\mathcal{O}(L^{1/2}md^{-1/2}L_{1}^{-1}). This is very restritive because, as we have just found, for UBU hh scales with κ\kappa as κ1/4\kappa^{-1/4}.

The reference [21] suggests a novel approach to obtaining high order discretizations of the underdamped Langevin dynamics (1.2). At each step, in addition to generating suitable random variables, one has to integrate a so-called “shifted ODE”, whose solutions are smoother than the solutions of (1.2). The analysis in that reference examines the case where the integration is exact; in practice, the shifted ODE has of course to be discretized by a suitable numerical method and [21] provides numerical examples based on two different choices of such a method.

7 Additional results and proofs

7.1 Proof of Proposition 3.4

The second item in the Proposition is proved as follows. By standard linear algebra results, ξ\xi may be uniquely decomposed as ξ=n+m\xi=n+m with nn in the kernel of CC and mm in the image of CTC^{T}; furthermore there is a bijection between values of mm and values of x=Cξx=C\xi. Under the assumption SCT=0SC^{T}=0, which implies CS=0CS=0, SmSm and mTSm^{T}S vanish, and then ξTSξ=nTSn\xi^{T}S\xi=n^{T}Sn is independent of mm, i.e. of xx. Therefore the marginal of exp(f(Cξ)(1/2)ξSξ)\propto\exp(-f(C\xi)-(1/2)\xi S\xi) coincides with the marginal of exp(f(Cξ))\propto\exp(-f(C\xi)) .

For the first item, we have to show that the pdf kexp(f(x)(1/2)ξTSξ)k\exp\big{(}-f(x)-(1/2)\xi^{T}S\xi\big{)} (kk is the normalizing constant), that with some abuse of notation we denote by π\pi^{\star} satisfies the Fokker-Planck equation

ξ(π(Aξ+Bf(x)))+ξ(Dξπ)=0.-\nabla_{\xi}\cdot\big{(}\pi^{\star}(A\xi+B\nabla f(x))\big{)}+\nabla_{\xi}\cdot(D\nabla_{\xi}\pi^{\star})=0.

Here ξ\nabla_{\xi}\cdot and ξ\nabla_{\xi} respectively denote the standard divergence and gradient operators in the space N\mathbb{R}^{N} of the variable ξ\xi. The computations that follow use repeatedly the well-known identity ξ(cF)=cξF+FTξc\nabla_{\xi}\cdot(cF)=c\,\nabla_{\xi}\cdot F+F^{T}\nabla_{\xi}c, where c=c(ξ)c=c(\xi) is a scalar valued function and F=F(ξ)F=F(\xi) is an N\mathbb{R}^{N}-valued function. We will also use that if RR is any M×dM\times d constant matrix, then ξ(Rf(x))=CR:H(x)\nabla_{\xi}\cdot(R\nabla f(x))=CR:H(x), where H(x)H(x) denotes the Hessian of f(x)f(x) and :: stands for the Frobenius product of matrices (equivalently CR:H(x)=Tr((CR)TH(x)CR:H(x)={\rm Tr}((CR)^{T}H(x)).

We observe that

ξπ=π(Sξ+CTf(x))\nabla_{\xi}\pi^{\star}=-\pi^{\star}\big{(}S\xi+C^{T}\nabla f(x)\big{)}

and therefore

ξ(πAξ)=πTr(A)πξTATSξπξTATCTf(x)\nabla_{\xi}\cdot(\pi^{\star}A\xi)=\pi^{\star}{\rm Tr}(A)-\pi^{\star}\xi^{T}A^{T}S\xi-\pi^{\star}\xi^{T}A^{T}C^{T}\nabla f(x)

and

ξ(πBf(x))\displaystyle\nabla_{\xi}\cdot\big{(}\pi^{\star}B\nabla f(x)\big{)} =\displaystyle= π(CB:H(x))\displaystyle\pi^{\star}(CB:H(x))
π(f(x))TBTSξπ(f(x))TBTCTf(x).\displaystyle-\pi^{\star}(\nabla f(x))^{T}B^{T}S\xi-\pi^{\star}(\nabla f(x))^{T}B^{T}C^{T}\nabla f(x).

Furthermore

ξ(Dξπ)\displaystyle\nabla_{\xi}\cdot(D\nabla_{\xi}\pi^{\star}) =\displaystyle= ξ(πDSξ)ξ(πDCTf(x))\displaystyle-\nabla_{\xi}\cdot(\pi^{\star}DS\xi)-\nabla_{\xi}\cdot(\pi^{\star}DC^{T}\nabla f(x))
=\displaystyle= πTr(DS)+πξTSDSξ+πξTSDCTf(x)\displaystyle-\pi^{\star}{\rm Tr}(DS)+\pi^{\star}\xi^{T}SDS\xi+\pi^{\star}\xi^{T}SDC^{T}\nabla f(x)
π(CDCT:H(x))\displaystyle-\pi^{\star}(CDC^{T}:H(x))
+π(f(x))TCDSξ+π(f(x))TCDCTf(x).\displaystyle+\pi^{\star}(\nabla f(x))^{T}CDS\xi+\pi^{\star}(\nabla f(x))^{T}CDC^{T}\nabla f(x).

From the last three displays we conclude that the left hand-side of the Fokker-Planck equations is the product of π\pi^{\star} and

Tr(A+DS)\displaystyle-{\rm Tr}(A+DS)
(CB+CDCT):H(x)+(f(x))T(BTCT+CDCT)f(x)\displaystyle-(CB+CDC^{T}):H(x)+(\nabla f(x))^{T}(B^{T}C^{T}+CDC^{T})\nabla f(x)
+ξT(ATCT+SB+2SDCT)f(x)\displaystyle+\xi^{T}(A^{T}C^{T}+SB+2SDC^{T})\nabla f(x)
+ξT(ATS+SDS)ξ;\displaystyle+\xi^{T}(A^{T}S+SDS)\xi;

each of the first three relations in (3.2) is sufficient for the corresponding line in this display to vanish. (In addition, if ff is regarded as arbitrary, then those three relations are also necessary.) The quadratic form in the fourth line in the display vanishes if and only if ATS+SDSA^{T}S+SDS is skew-symmetric as demanded by the fourth relation in (3.2). This completes the proof.

7.2 Contraction estimates for the underdamped Langevin equations

We consider the underdamped Langevin equations (1.2) where, after rescaling tt, we may assume that γ=2\gamma=2. We apply Proposition 3.10 to determine P^\widehat{P} and cc so as to maximize the decay rate λ\lambda. We exclude the case L=mL=m, which has no practical relevance.

If

L^=[1101222],\widehat{L}=\left[\begin{matrix}\ell_{11}&0\\ \ell_{12}&\ell_{22}\end{matrix}\right],

11,22>0\ell_{11},\ell_{22}>0, denotes the unknown Choleski factor of P^\widehat{P}, the eigenvalues of L^1Z^iL^T\widehat{L}^{-1}\widehat{Z}_{i}\widehat{L}^{-T} are found to be

2±(112cHi21121+212222)2+4222(1121)21122.2\pm\frac{\sqrt{\left(\ell^{2}_{11}cH_{i}-2\ell_{11}\ell_{21}+\ell^{2}_{21}-\ell^{2}_{22}\right)^{2}+4\ell^{2}_{22}\left(\ell_{11}-\ell_{21}\right)^{2}}}{\ell_{11}\ell_{22}}. (7.1)

Since our aim is to ensure that these have a lower bound as large as possible and we only have to consider the minus sign in (7.1).

Without loss of generality, we may set 11=1\ell_{11}=1 and then have to find c>0c>0, 22>0\ell_{22}>0 and 21\ell_{21} to minimize

supmHL{1222[cH(222+221212)]2+4(121)2}.\sup_{m\leq H\leq L}\left\{\frac{1}{\ell_{22}^{2}}\Big{[}cH-(\ell_{22}^{2}+2\ell_{21}-\ell_{21}^{2})\Big{]}^{2}+4(1-\ell_{21})^{2}\right\}. (7.2)

Consider a local minimum cc, 21\ell_{21}, 22\ell_{22} of the minimization problem. We claim that

cL+m2=ac\frac{L+m}{2}=a (7.3)

where

a=222+221212.a=\ell_{22}^{2}+2\ell_{21}-\ell_{21}^{2}.

In other words aa has to coincide with the midpoint of the interval [cm,cL][cm,cL] of possible values of cHcH. In fact, assume that c(L+m)/2>ac(L+m)/2>a, i.e. the point aa is to the left of the midpoint. Then the supremum in (7.2) is attained at H=LH=L, because |cma|<|cLa||cm-a|<|cL-a|; we could lower the value of the supremum by decreasing slightly cc. If u(L+m)/2<au(L+m)/2<a the supremum decreases by increasing slightly cc.

When (7.3) holds the supremum in (7.2) is the common value that the expression in braces takes at H=mH=m and H=LH=L. This common value is:

1222(LmL+m)2(222+221212)2+4(121)2,\frac{1}{\ell_{22}^{2}}\left(\frac{L-m}{L+m}\right)^{2}(\ell_{22}^{2}+2\ell_{21}-\ell_{21}^{2})^{2}+4(1-\ell_{21})^{2},

that we rewrite as

1222(LmL+m)2(222+1(121)2)2+4(121)2.\frac{1}{\ell_{22}^{2}}\left(\frac{L-m}{L+m}\right)^{2}\big{(}\ell_{22}^{2}+1-(1-\ell_{21})^{2}\big{)}^{2}+4(1-\ell_{21})^{2}.

With b=(Lm)2/(L+m)2<1b=(L-m)^{2}/(L+m)^{2}<1, our task is to find X=222>0X=\ell_{22}^{2}>0, Y=(121)20Y=(1-\ell_{21})^{2}\geq 0 so as to minimize

F(X,Y)=bX(X+1Y)2+4Y.F(X,Y)=\frac{b}{X}(X+1-Y)^{2}+4Y.

We first fix Y[0,1)Y\in[0,1). Then FF\rightarrow\infty as X0X\rightarrow 0 or XX\rightarrow\infty. By setting F/X=0\partial F/\partial X=0, we easily see that FF has a unique minimum at X=1Y(0,1]X=1-Y\in(0,1]. At that minimum

F=4(b1)X+4,F=4(b-1)X+4,

which, in turn, is minimized by taking XX as large as possible, i.e. X=1X=1. Then Y=0Y=0 and F=4b<4F=4b<4. We then fix Y1Y\geq 1. In this case, F4Y4F\geq 4Y\geq 4, which is worse than the best FF that can be achieved with Y[0,1)Y\in[0,1). To sum up: the optimum value of FF is 4b4b and is achieved when X=1X=1, Y=0Y=0, i.e. 22=1\ell_{22}=1, 21=1\ell_{21}=1; then the matrix P^\widehat{P} is the one in (3.9). Taking these values of ij\ell_{ij} to (7.3), we find that c=4/(L+m)c=4/(L+m) provides the optimal parameter choice. Finally, since the best value of (7.2) is 4b4b, the expression (7.1) shows that the best decay rate is λ=24b=4m/(L+m)\lambda=2-\sqrt{4b}=4m/(L+m).

7.3 Integrators for the underdamped Langevin equations

Due to the importance of (1.2) in statistical physics and molecular dynamics, the literature on its numerical integration is by now enormous. It is completely out of our scope to summarize it and we limit ourselves to a few comments on splitting algorithms.

The different terms in the right hand-side of (1.2a) and (1.2b) correspond to different, separate physical effects, like inertia, noise, damping, etc. Therefore the system (1.2) it is ideally suited to splitting algorithms [35]. A possible way of carrying out the splitting is

(A)\displaystyle{\rm(A)} (d/dt)v=0,\displaystyle\qquad(d/dt)v=0,\> (d/dt)x=v,\displaystyle(d/dt)x=v,
(B)\displaystyle{\rm(B)} (d/dt)v=cf(x),\displaystyle\qquad(d/dt)v=-c\nabla f(x),\> (d/dt)x=0,\displaystyle(d/dt)x=0,
(O)\displaystyle{\rm(O)} dv=γvdt+2γcdW,\displaystyle\qquad dv=-\gamma vdt+\sqrt{2\gamma c}dW,\> (d/dt)x=0.\displaystyle(d/dt)x=0.

Each of these subsystems may be integrated in closed form. This partitioning gives rise to schemes like ABOBA, BAOAB and OBABO [28, 29, 39]. For instance, a step of ABOBA first advances the numerical solution over [tn,tn+1/2][t_{n},t_{n+1/2}] using the exact solution of (A), then over [tn,tn+1/2][t_{n},t_{n+1/2}] using the exact solution of (B), then over [tn,tn+1][t_{n},t_{n+1}] with (O), then over [tn+1/2,tn+1][t_{n+1/2},t_{n+1}] with (B) and finally closes symmetrically with (A) over [tn+1/2,tn+1][t_{n+1/2},t_{n+1}]. ABOBA, BAOAB and OBABO have weak order two but only possess strong order one. In fact it is easy to check that with the A-B-O splitting it is impossible to generate the Ito integrals that are required to ensure strong order two.

Another subsystem that may be integrated exactly in closed form is

(U)\displaystyle{\rm(U)} dv=γvdt+2γcdW,\displaystyle\qquad dv=-\gamma vdt+\sqrt{2\gamma c}dW,\> (d/dt)x=v.\displaystyle(d/dt)x=v.

The algorithm BUB, used e.g. in [8], advances with (B) over [tn,tn+1/2][t_{n},t_{n+1/2}], then with (U) over [tn,tn+1][t_{n},t_{n+1}] and closes the step with (B) over [tn+1/2,tn+1][t_{n+1/2},t_{n+1}]. To our best knowledge it was first suggested by Skeel [49]. In [21] is referred to as Strang splitting, but this terminology may be confusing because it is standard to use the expression Strang splitting to mean any splitting algorithms with a symmetric pattern XYX [46]. The authors of [21], a reference that compares different integrators, “believe that BUB offers an attractive compromise between accuracy and computational cost”.

Changing the roles of B and U we obtain the UBU scheme suggested in the thesis [52], where it is proved that both BUB and UBU have weak and strong order two. In fact BUB and UBU are closely related because, UBU is the algorithm that advances the BUB solution from the midpoint tn+1/2t_{n+1/2} of one step to the midpoint tn+3/2t_{n+3/2} of the next (and vice versa).

The thesis [52] also describes a method to boost to strong order two any method whose strong order is only one. The boosting is achieved by generating auxiliary Gaussian random variables and may be relevant in our context where we are interested in the strong order of the integrators. A general technique to investigate the weak and strong order of splitting algorithms for general SDEs may be seen in [3, 4]; those reference provide a detailed study of splitting Langevin integrators.

7.4 A lemma

The following result is a variant of [13, Lemma 7] and may be proved in a similar way.

Lemma 7.1.

Assume that the sequence of nonnegative numbers (zn)(z_{n}) is such that for some constants A(0,1)A\in(0,1), B0B\geq 0, C0C\geq 0 and each n=0,1,2,n=0,1,2,\dots

zn+1(1A)2zn2+B+C.z_{n+1}\leq\sqrt{(1-A)^{2}z_{n}^{2}+B}+C.

Then, for n=0,1,2,n=0,1,2,\dots

zn(1A)nz0+BA+CA.z_{n}\leq(1-A)^{n}z_{0}+\sqrt{\frac{B}{A}}+\frac{C}{A}.

7.5 The local error for EE integrator

To analyze ϕh(ξ^n,tn)ψh(ξ^n,tn)\phi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\widehat{\xi}_{n},t_{n}), we have to take the random variable ξ^n=(v^n,x^n)π\widehat{\xi}_{n}=(\widehat{v}_{n},\widehat{x}_{n})\sim\pi^{*} (see Theorem 5.2) as initial data, first to move the solution of the SDE forward in the interval [tn,tn+h][t_{n},t_{n}+h] and then to perform a step of the integrator over the same interval.

Solutions of (1.1) satisfy, for ttnt\geq t_{n}, n=0,1,2,n=0,1,2,\dots,
v(t)\displaystyle v(t) =\displaystyle= (ttn)v(tn)tnt(ts)cf(x(s))𝑑s+2γctnt(ts)𝑑W(s),\displaystyle\mathcal{E}(t-t_{n})v(t_{n})-\int_{t_{n}}^{t}\mathcal{E}(t-s)c\nabla f(x(s))\,ds+\sqrt{2\gamma c}\int_{t_{n}}^{t}\mathcal{E}(t-s)\,dW(s), (7.4a)
x(t)\displaystyle x(t) =\displaystyle= x(tn)+(ttn)v(tn)tnt(ts)cf(x(s))𝑑s+2γctnt(ts)𝑑W(s),\displaystyle x(t_{n})+\mathcal{F}(t-t_{n})v(t_{n})-\int_{t_{n}}^{t}\mathcal{F}(t-s)c\nabla f(x(s))\,ds+\sqrt{2\gamma c}\int_{t_{n}}^{t}\mathcal{F}(t-s)\,dW(s), (7.4b)

The proof is divided up in steps.

First step. From (7.4a) and (4.2a), we find that the vv-component of ϕh(ξ^n,tn)ψh(ξ^n,tn)\phi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\widehat{\xi}_{n},t_{n}), denoted as Δv\Delta_{v}, is

Δv=tntn+1(tn+1s)c(f(x(s))f(x^n))𝑑s;\Delta_{v}=-\int_{t_{n}}^{t_{n+1}}\mathcal{E}(t_{n+1}-s)c\big{(}\nabla f(x(s))-\nabla f(\widehat{x}_{n})\big{)}\,ds;

where x(s)x(s) is the xx-component of the solution of (1.2) that at tnt_{n} takes the value ξ^n\widehat{\xi}_{n} (we shall later need the vv component, to be denoted by v(s)v(s)). Using successively the Cauchy-Schwartz inequality, the bound (t)1\mathcal{E}(t)\leq 1 for t0t\geq 0, the Lipschitz continuity of f(x)\nabla f(x), and (1.2b), we find:

𝔼(Δv2)\displaystyle\mathbb{E}\big{(}\|\Delta_{v}\|^{2}\big{)} \displaystyle\leq c2𝔼(tntn+1(tn+1s)2𝑑stntn+1f(x(s))f(x^n)2𝑑s)\displaystyle c^{2}\mathbb{E}\left(\int_{t_{n}}^{t_{n+1}}\mathcal{E}(t_{n+1}-s)^{2}ds\>\int_{t_{n}}^{t_{n+1}}\|\nabla f(x(s))-\nabla f(\widehat{x}_{n})\|^{2}ds\right)
\displaystyle\leq hc2tntn+1𝔼(f(x(s))f(x^n)2)𝑑s\displaystyle hc^{2}\int_{t_{n}}^{t_{n+1}}\mathbb{E}\big{(}\|\nabla f(x(s))-\nabla f(\widehat{x}_{n})\|^{2}\big{)}ds
\displaystyle\leq hc2L2tntn+1𝔼(x(s)x^n2)𝑑s\displaystyle hc^{2}L^{2}\int_{t_{n}}^{t_{n+1}}\mathbb{E}\big{(}\|x(s)-\widehat{x}_{n}\|^{2}\big{)}ds
\displaystyle\leq hc2L2tntn+1𝔼(tnsv(s)𝑑s2)𝑑s.\displaystyle hc^{2}L^{2}\int_{t_{n}}^{t_{n+1}}\mathbb{E}\big{(}\|\int_{t_{n}}^{s}v(s^{\prime})ds^{\prime}\|^{2}\big{)}ds.

An application of the Cauchy-Schwartz inequality to the inner integral yields

𝔼(Δv2)hc2L2tntn+1s(tns𝔼(v(s)2)𝑑s)𝑑s.\mathbb{E}\big{(}\|\Delta_{v}\|^{2}\big{)}\leq hc^{2}L^{2}\int_{t_{n}}^{t_{n+1}}s\left(\int_{t_{n}}^{s}\mathbb{E}\big{(}\|v(s^{\prime})\|^{2}\big{)}ds^{\prime}\right)ds.

Now, using the fact that the initial data ξ^n\widehat{\xi}_{n} is distributed according to π\pi^{*}, this will be the distribution of v(s)v(s^{\prime}) for all stns^{\prime}\geq t_{n}. Hence, since the distribution of each of the dd scalar components of vv is centered Gaussian with second moment equal to cc, we obtain the final bound

𝔼(Δv2)hc2L2dtntn+1s(tnsc𝑑s)𝑑s=h43c3L2d.\mathbb{E}\big{(}\|\Delta_{v}\|^{2}\big{)}\leq hc^{2}L^{2}d\int_{t_{n}}^{t_{n+1}}s\left(\int_{t_{n}}^{s}c\,ds^{\prime}\right)ds=\frac{h^{4}}{3}c^{3}L^{2}d.

Second step. Turning now to the xx component of ϕh(ξ^n,tn)ψh(ξ^n,tn)\phi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\widehat{\xi}_{n},t_{n}), we have

Δx=tntn+1(tn+1s)c(f(x(s))f(x^n))𝑑s,\Delta_{x}=-\int_{t_{n}}^{t_{n+1}}\mathcal{F}(t_{n+1}-s)c\big{(}\nabla f(x(s))-\nabla f(\widehat{x}_{n})\big{)}\,ds,

and, applying the Cauchy-Schwartz inequality and the bound (t)t\mathcal{F}(t)\leq t,

𝔼(Δx2)\displaystyle\mathbb{E}\big{(}\|\Delta_{x}\|^{2}\big{)} \displaystyle\leq c2𝔼(tntn+1(tn+1s)2𝑑stntn+1f(x(s))f(x^n)2)ds\displaystyle c^{2}\mathbb{E}\left(\int_{t_{n}}^{t_{n+1}}\mathcal{F}(t_{n+1}-s)^{2}ds\int_{t_{n}}^{t_{n+1}}\|\nabla f(x(s))-\nabla f(\widehat{x}_{n})\|^{2}\right)ds
\displaystyle\leq h33c2tntn+1𝔼(f(x(s))f(x^n)2ds).\displaystyle\frac{h^{3}}{3}c^{2}\int_{t_{n}}^{t_{n+1}}\mathbb{E}\big{(}\|\nabla f(x(s))-\nabla f(\widehat{x}_{n})\|^{2}ds\big{)}.

Therefore, by proceeding in the last integral as we when we found it above, we find

𝔼(Δx2)h69c3L2d.\mathbb{E}\big{(}\|\Delta_{x}\|^{2}\big{)}\leq\frac{h^{6}}{9}c^{3}L^{2}d.

Third step. The preceding analysis is valid for all values of the parameters. We now assume that tt is measured in units for which γ=2\gamma=2 and that hh is chosen 1\leq 1. Then, combining the estimates for the vv and xx components:

𝔼(ϕh(ξ^n,tn)ψh(ξ^n,tn)2)h43c3L2d+h69c3L2d49h4c3L2d.\mathbb{E}\big{(}\|\phi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\widehat{\xi}_{n},t_{n})\|^{2}\big{)}\leq\frac{h^{4}}{3}c^{3}L^{2}d+\frac{h^{6}}{9}c^{3}L^{2}d\leq\frac{4}{9}h^{4}c^{3}L^{2}d. (7.5)

For the matrix P^\widehat{P} in (3.9) the constant pmaxp_{\max} in (2.2) is easily computed as (3+5)/2(3+\sqrt{5})/2 and we finally have:

ϕh(ξ^n,tn)ψh(ξ^n,tn)L2,P26+259h4c3L2d.\|\phi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\widehat{\xi}_{n},t_{n})\|^{2}_{L^{2},P}\leq\frac{6+2\sqrt{5}}{9}h^{4}c^{3}L^{2}d.

7.6 The local error for UBU

We first provide bounds for the local error for UBU under Assumptions 2.12.3 that ensure strong order two. As in the previous Subsection we have to take (v^n,x^n)(\widehat{v}_{n},\widehat{x}_{n}) as a starting point for the SDE solution and the integrator. As with the EE integrator, v(t)v(t) and x(t)x(t) denote the solution of (1.2) that starts at tnt_{n} from (v^n,x^n)(\widehat{v}_{n},\widehat{x}_{n}). The analysis is now substantially more involved as the Ito-Taylor expansions have to be taken to higher order.

First step. We begin by estimating the difference Δy\Delta_{y} between x(tn+h/2)x(t_{n}+h/2) and the point yny_{n} where the integrator evaluates the force f-\nabla f (see (4.3c)). By using (7.4b) and (4.3c), we find

Δy=x(tn+h/2)yn=tntn+1/2(tn+1/2s)cf(x(s))𝑑s.\Delta_{y}=x(t_{n}+h/2)-y_{n}=-\int_{t_{n}}^{t_{n+1/2}}\mathcal{F}(t_{n+1/2}-s)c\nabla f(x(s))\,ds.

We apply the Cauchy-Schwartz inequality (in a similar way to what we did at Step 2 in the preceding subsection) to get

𝔼(Δy2)h324c2tntn+1/2𝔼(f(x(s)2)ds.\mathbb{E}\big{(}\|\Delta_{y}\|^{2}\big{)}\leq\frac{h^{3}}{24}c^{2}\int_{t_{n}}^{t_{n+1/2}}\mathbb{E}\big{(}\|\nabla f(x(s)\|^{2}\big{)}\,ds.

As proved in [11, Lemma 2], when x¯\bar{x} has the distribution π\pi^{\star},

𝔼(f(x¯)2)Ld\mathbb{E}\big{(}\|\nabla f(\bar{x})\|^{2}\big{)}\leq Ld (7.6)

and, accordingly,

𝔼(Δy2)h448c2Ld.\mathbb{E}\big{(}\|\Delta_{y}\|^{2}\big{)}\leq\frac{h^{4}}{48}c^{2}Ld. (7.7)

Second step. From (7.4a) and (4.3a), the vv-component of ϕh(ξ^n,tn)ψh(ξ^n,tn)\phi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\widehat{\xi}_{n},t_{n}) is found to be

Δv=tntn+1(tn+1s)cf(x(s))𝑑s+h(h/2)cf(yn);\Delta_{v}=-\int_{t_{n}}^{t_{n+1}}\mathcal{E}(t_{n+1}-s)c\nabla f(x(s))\,ds+h\mathcal{E}(h/2)c\nabla f(y_{n}); (7.8)

thus UBU replaces the integral by a midpoint-rule approximation. We Ito-Taylor expand (see e.g. [26, 4]) the integral around tn+1/2t_{n+1/2} as follows. Denote by χ(s)\chi(s) the (differentiable) integrand, i.e.

χ(s)=(tn+1s)cf(x(s)).\chi(s)=\mathcal{E}(t_{n+1}-s)c\nabla f(x(s)).

Then (the dot indicates differentiation),

tntn+1χ(s)𝑑s=tntn+1χ(tn+1/2)𝑑s+tntn+1𝑑stn+1/2sχ˙(s)𝑑s,\int_{t_{n}}^{t_{n+1}}\chi(s)ds=\int_{t_{n}}^{t_{n+1}}\chi(t_{n+1/2})ds+\int_{t_{n}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}\dot{\chi}(s)ds^{\prime}, (7.9)

and taking the expansion one step further, we find

tntn+1χ(s)𝑑s\displaystyle\int_{t_{n}}^{t_{n+1}}\chi(s)ds =\displaystyle= hχ(tn+1/2)+tn+1/2tn+1𝑑stn+1/2s(χ˙(s)χ˙(2tn+1/2s))𝑑s\displaystyle h\chi(t_{n+1/2})+\int_{t_{n+1/2}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}\big{(}\dot{\chi}(s^{\prime})-\dot{\chi}(2t_{n+1/2}-s^{\prime})\big{)}ds^{\prime} (7.10)
=\displaystyle= hχ(tn+1/2)+tn+1/2tn+1𝑑stn+1/2s𝑑s2tn+1/2ss𝑑χ˙(s′′).\displaystyle h\chi(t_{n+1/2})+\int_{t_{n+1/2}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}ds^{\prime}\int_{2t_{n+1/2}-s^{\prime}}^{s^{\prime}}d\dot{\chi}(s^{\prime\prime}).

We now replace dχ˙(s′′)d\dot{\chi}(s^{\prime\prime}) by its expression given by Ito’s lemma. (While χ\chi is differentiable, χ˙\dot{\chi} is a diffusion process.) There is no Ito correction because χ˙\dot{\chi} is linear in vv and there is no forcing noise in the xx equation (1.2b). After computing dχ˙(s′′)d\dot{\chi}(s^{\prime\prime}) and substituting back in (7.8), we have

Δv=hc(h/2)(f(x(tn+1/2))f(yn))+I1+I2+I3+I4+I5,\Delta_{v}=-hc\mathcal{E}(h/2)\big{(}\nabla f(x(t_{n+1/2}))-\nabla f(y_{n})\big{)}+I_{1}+I_{2}+I_{3}+I_{4}+I_{5}, (7.11)

with

I1\displaystyle I_{1} =\displaystyle= tn+1/2tn+1𝑑stn+1/2s𝑑s2tn+1/2ss(tn+1s′′)γ2cf(x(s′′))𝑑s′′,\displaystyle-\int_{t_{n+1/2}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}ds^{\prime}\int_{2t_{n+1/2}-s^{\prime}}^{s^{\prime}}\mathcal{E}(t_{n+1}-s^{\prime\prime})\gamma^{2}c\nabla f(x(s^{\prime\prime}))ds^{\prime\prime},
I2\displaystyle I_{2} =\displaystyle= tn+1/2tn+1𝑑stn+1/2s𝑑s2tn+1/2ss(tn+1s′′)γc(x(s′′))v(s′′)𝑑s′′,\displaystyle-\int_{t_{n+1/2}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}ds^{\prime}\int_{2t_{n+1/2}-s^{\prime}}^{s^{\prime}}\mathcal{E}(t_{n+1}-s^{\prime\prime})\gamma c\mathcal{H}(x(s^{\prime\prime}))v(s^{\prime\prime})ds^{\prime\prime},
I3\displaystyle I_{3} =\displaystyle= tn+1/2tn+1𝑑stn+1/2s𝑑s2tn+1/2ss(tn+1s′′)c(x(s′′))[v(s′′),v(s′′)]𝑑s′′,\displaystyle-\int_{t_{n+1/2}}^{t_{n+1}}\!\!\!\!ds\int_{t_{n+1/2}}^{s}\!\!\!\!\!ds^{\prime}\int_{2t_{n+1/2}-s^{\prime}}^{s^{\prime}}\!\!\!\!\mathcal{E}(t_{n+1}-s^{\prime\prime})c\mathcal{H}^{\prime}(x(s^{\prime\prime}))[v(s^{\prime\prime}),v(s^{\prime\prime})]ds^{\prime\prime},
I4\displaystyle I_{4} =\displaystyle= tn+1/2tn+1𝑑stn+1/2s𝑑s2tn+1/2ss(tn+1s′′)c2(x(s′′))f(x(s′′))𝑑s′′,\displaystyle\int_{t_{n+1/2}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}ds^{\prime}\int_{2t_{n+1/2}-s^{\prime}}^{s^{\prime}}\mathcal{E}(t_{n+1}-s^{\prime\prime})c^{2}\mathcal{H}(x(s^{\prime\prime}))\nabla f(x(s^{\prime\prime}))ds^{\prime\prime},
I5\displaystyle I_{5} =\displaystyle= 2γctn+1/2tn+1𝑑stn+1/2s𝑑s2tn+1/2ss(tn+1s′′)c(x(s′′))𝑑W(s′′).\displaystyle-\sqrt{2\gamma c}\int_{t_{n+1/2}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}ds^{\prime}\int_{2t_{n+1/2}-s^{\prime}}^{s^{\prime}}\mathcal{E}(t_{n+1}-s^{\prime\prime})c\mathcal{H}(x(s^{\prime\prime}))dW(s^{\prime\prime}).

We now successively bound each term in the right hand-side of (7.11). From (7.7) and the Lipschitz continuity of f(x)\nabla f(x)

𝔼(hc(h/2)(f(x(tn+1/2))f(yn))2)h648c4L3d.\mathbb{E}\Big{(}\|hc\mathcal{E}(h/2)\big{(}\nabla f(x(t_{n+1/2}))-\nabla f(y_{n})\big{)}\|^{2}\Big{)}\leq\frac{h^{6}}{48}c^{4}L^{3}d.

The integral I1I_{1} may be bounded as follows:

𝔼(I12)\displaystyle\mathbb{E}\big{(}\|I_{1}\|^{2}\big{)} \displaystyle\leq γ4c2𝔼[(tn+1/2tn+1dstn+1/2sds2tn+1/2ss(tn+1s′′)2ds′′)\displaystyle\gamma^{4}c^{2}\mathbb{E}\left[\left(\int_{t_{n+1/2}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}ds^{\prime}\int_{2t_{n+1/2}-s^{\prime}}^{s^{\prime}}\mathcal{E}(t_{n+1}-s^{\prime\prime})^{2}ds^{\prime\prime}\right)\right.
(tn+1/2tn+1dstn+1/2sds2tn+1/2ssf(x(s′′))2ds′′)]\displaystyle\left.\left(\int_{t_{n+1/2}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}ds^{\prime}\int_{2t_{n+1/2}-s^{\prime}}^{s^{\prime}}\|\nabla f(x(s^{\prime\prime}))\|^{2}ds^{\prime\prime}\right)\right]

Now using the fact that (t)1\mathcal{E}(t)\leq 1, for t0t\geq 0, we can bound the first term in the equation above by observing that

tn+1/2tn+1𝑑stn+1/2s𝑑s2tn+1/2ss𝑑s′′=h324\int_{t_{n+1/2}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}ds^{\prime}\int_{2t_{n+1/2}-s^{\prime}}^{s^{\prime}}ds^{\prime\prime}=\frac{h^{3}}{24}

and then take into account (7.6), to get

𝔼(I12)h6576γ4c2Ld.\mathbb{E}\big{(}\|I_{1}\|^{2}\big{)}\leq\frac{h^{6}}{576}\gamma^{4}c^{2}Ld.

A bound for I2I_{2} may be derived similarly, by using, instead of (7.6),

𝔼((x¯)v¯2)L2𝔼(v¯2)=L2cd.\mathbb{E}\big{(}\|\mathcal{H}(\bar{x})\bar{v}\|^{2}\big{)}\leq L^{2}\mathbb{E}\big{(}\|\bar{v}\|^{2}\big{)}=L^{2}cd.

((v¯,x¯)π(\bar{v},\bar{x})\sim\pi^{\star}). Then

𝔼(I22)h6576γ2c3L2d.\mathbb{E}\big{(}\|I_{2}\|^{2}\big{)}\leq\frac{h^{6}}{576}\gamma^{2}c^{3}L^{2}d.

For I3I_{3} we use (each scalar component of v¯\bar{v} is a centered Gaussian with variance cc and fourth moment 3c23c^{2})

𝔼((x¯)[v¯,v¯]2)=L12𝔼(v¯4)3L12c2d,\mathbb{E}\big{(}\|\mathcal{H}^{\prime}(\bar{x})[\bar{v},\bar{v}]\|^{2}\big{)}=L_{1}^{2}\mathbb{E}\big{(}\|\bar{v}\|^{4}\big{)}\leq 3L_{1}^{2}c^{2}d,

that leads to

𝔼(I32)3h6576c4L12d.\mathbb{E}\big{(}\|I_{3}\|^{2}\big{)}\leq\frac{3h^{6}}{576}c^{4}L_{1}^{2}d.

Turning now to I4I_{4}, a new application of (7.6) gives

𝔼((x¯)f(x¯)2)L2𝔼(f(x¯)2)L3d,\mathbb{E}\big{(}\|\mathcal{H}(\bar{x})\nabla f(\bar{x})\|^{2}\big{)}\leq L^{2}\mathbb{E}\big{(}\|\nabla f(\bar{x})\|^{2}\big{)}\leq L^{3}d,

and then

𝔼(I42)h6576c4L3d.\mathbb{E}\big{(}\|I_{4}\|^{2}\big{)}\leq\frac{h^{6}}{576}c^{4}L^{3}d.

Using the Cauchy-Schwartz inequality for the inner product associated with the integration on ss and ss^{\prime}, we have

𝔼(I52)2γc3𝔼[(tn+1/2tn+1dstn+1/2sds)×\displaystyle\mathbb{E}\big{(}\|I_{5}\|^{2}\big{)}\leq 2\gamma c^{3}\mathbb{E}\left[\left(\int_{t_{n+1/2}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}ds^{\prime}\right)\right.\times
(tn+1/2tn+1dstn+1/2s2tn+1/2ss(tn+1s′′)(x(s′′))dW(s′′)2ds)]\displaystyle\left.\left(\int_{t_{n+1/2}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}\left\|\int_{2t_{n+1/2}-s^{\prime}}^{s^{\prime}}\mathcal{E}(t_{n+1}-s^{\prime\prime})\mathcal{H}(x(s^{\prime\prime}))dW(s^{\prime\prime})\right\|^{2}ds^{\prime}\right)\right]

and, with the help of the Ito isommetry, since the Frobenius norm of \mathcal{H} is bounded by L2dL^{2}d,

𝔼(I52)2γc3h28tn+1/2tn+1𝑑stn+1/2s𝑑s2tn+1/2ssL2𝑑ds′′=h596γc3L2d.\mathbb{E}\big{(}\|I_{5}\|^{2}\big{)}\leq 2\gamma c^{3}\frac{h^{2}}{8}\int_{t_{n+1/2}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}ds^{\prime}\int_{2t_{n+1/2}-s^{\prime}}^{s^{\prime}}L^{2}d\,ds^{\prime\prime}=\frac{h^{5}}{96}\gamma c^{3}L^{2}d.

We have now bounded each term in the right-hand side of (7.11); the dominant term is I5I_{5}, with (𝔼(I52))1/2=𝒪(h5/2)(\mathbb{E}(\|I_{5}\|^{2}))^{1/2}=\mathcal{O}(h^{5/2}). In Assumption 5.2 we have to split ϕh(ξ^n,tn)ψh(ξ^n,tn)\phi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\widehat{\xi}_{n},t_{n}) into two parts, α\alpha and β\beta, with β\beta of higher order; for the vv component we then define αv=I5\alpha_{v}=I_{5} and βv=Δvαv\beta_{v}=\Delta_{v}-\alpha_{v}. The bounds obtained above yield

(𝔼(αv2))1/2624h5/2γ1/2c3/2Ld1/2\big{(}\mathbb{E}(\|\alpha_{v}\|^{2})\big{)}^{1/2}\leq\frac{\sqrt{6}}{24}h^{5/2}\gamma^{1/2}c^{3/2}Ld^{1/2}

and

(𝔼(βv2))1/2h324((1+23)c2L3/2+γ2cL1/2+γc3/2L+3c2L1)d1/2.\big{(}\mathbb{E}(\|\beta_{v}\|^{2})\big{)}^{1/2}\leq\frac{h^{3}}{24}\Big{(}(1+2\sqrt{3})c^{2}L^{3/2}+\gamma^{2}cL^{1/2}+\gamma c^{3/2}L+\sqrt{3}c^{2}L_{1}\Big{)}d^{1/2}.

Third step. From (7.4b) and (4.3b), the xx-component of ϕh(ξ^n,tn)ψh(ξ^n,tn)\phi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\widehat{\xi}_{n},t_{n}) is given by

Δx=tntn+1(tn+1s)cf(x(s))𝑑s+h(h/2)cf(yn);\Delta_{x}=-\int_{t_{n}}^{t_{n+1}}\mathcal{F}(t_{n+1}-s)c\nabla f(x(s))\,ds+h\mathcal{F}(h/2)c\nabla f(y_{n}); (7.12)

again UBU replaces the integral by a midpoint approximation. By expanding the integrand by means of the fundamental theorem of calculus as

(tn+1s)cf(x(s))=(h/2)cf(x(tn+1/2))\displaystyle\mathcal{F}(t_{n+1}-s)c\nabla f(x(s))=\mathcal{F}(h/2)c\nabla f(x(t_{n+1/2}))
+tn+1/2s((tn+1s)c(x(s))v(s)(tn+1s)cf(x(s)))𝑑s,\displaystyle\qquad+\int_{t_{n+1/2}}^{s}\Big{(}\mathcal{F}(t_{n+1}-s^{\prime})c\mathcal{H}(x(s^{\prime}))v(s^{\prime})-\mathcal{E}(t_{n+1}-s^{\prime})c\nabla f(x(s^{\prime}))\Big{)}ds^{\prime},

(7.12) becomes

Δx=hc(h/2)(f(x(tn+1/2))f(yn))+I6+I7\Delta_{x}=-hc\mathcal{F}(h/2)\big{(}\nabla f(x(t_{n+1/2}))-\nabla f(y_{n})\big{)}+I_{6}+I_{7} (7.13)

with

I6\displaystyle I_{6} =\displaystyle= tntn+1𝑑stn+1/2s(tn+1s)c(x(s))v(s)𝑑s,\displaystyle-\int_{t_{n}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}\mathcal{F}(t_{n+1}-s^{\prime})c\mathcal{H}(x(s^{\prime}))v(s^{\prime})ds^{\prime},
I7\displaystyle I_{7} =\displaystyle= tntn+1𝑑stn+1/2s(tn+1s)cf(x(s))𝑑s.\displaystyle\int_{t_{n}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}\mathcal{E}(t_{n+1}-s^{\prime})c\nabla f(x(s^{\prime}))ds^{\prime}.

Now, in (7.13), taking (7.7) into account and recalling that (t)t\mathcal{F}(t)\leq t,

𝔼(hc(h/2)(f(x(tn+1/2))f(yn))2)h8192c4L3d.\mathbb{E}\Big{(}\|hc\mathcal{F}(h/2)\big{(}\nabla f(x(t_{n+1/2}))-\nabla f(y_{n})\big{)}\|^{2}\Big{)}\leq\frac{h^{8}}{192}c^{4}L^{3}d.

Next, for I6I_{6} (it is necessary to put absolute value bars around the inner integrals because ss could be <tn+1/2<t_{n+1/2}),

𝔼(I62)\displaystyle\mathbb{E}\big{(}\|I_{6}\|^{2}\big{)} \displaystyle\leq c2𝔼[(tntn+1ds|tn+1/2s(tn+1s)2ds|)×\displaystyle c^{2}\mathbb{E}\left[\left(\int_{t_{n}}^{t_{n+1}}ds\left|\int_{t_{n+1/2}}^{s}\mathcal{F}(t_{n+1}-s^{\prime})^{2}ds^{\prime}\right|\right)\right.\times
(tntn+1ds|tn+1/2s(x(s))v(s)2ds|)]\displaystyle\qquad\qquad\qquad\left.\left(\int_{t_{n}}^{t_{n+1}}ds\left|\int_{t_{n+1/2}}^{s}\|\mathcal{H}(x(s^{\prime}))v(s^{\prime})\|^{2}ds^{\prime}\right|\right)\right]
\displaystyle\leq c27h496×h24L2cd=7h6384c3L2d.\displaystyle c^{2}\frac{7h^{4}}{96}\times\frac{h^{2}}{4}L^{2}cd=\frac{7h^{6}}{384}c^{3}L^{2}d.

The integral I7I_{7} may be rewritten as

I7=tn+1/2tn+1𝑑stn+1/2s𝑑s2tn+1/2ssdds′′((tn+1s′′)cf(x(s′′))ds′′)𝑑s′′,I_{7}=\int_{t_{n+1/2}}^{t_{n+1}}ds\int_{t_{n+1/2}}^{s}ds^{\prime}\int_{2t_{n+1/2}-s^{\prime}}^{s^{\prime}}\frac{d}{ds^{\prime\prime}}\Big{(}\mathcal{E}(t_{n+1}-s^{\prime\prime})c\nabla f(x(s^{\prime\prime}))ds^{\prime\prime}\Big{)}ds^{\prime\prime},

and, after performing the differentiation in the integrand, I7=γ1(I1+I2)I_{7}=-\gamma^{-1}(I_{1}+I_{2}), so that we may use the available bounds for I1I_{1} and I2I_{2}.

Taking all the partial bounds to (7.13),

(𝔼(Δx2))1/2h324(3hc2L3/2+(422+1)c3/2L+γcL1/2)d1/2.\big{(}\mathbb{E}(\|\Delta_{x}\|^{2})\big{)}^{1/2}\leq\frac{h^{3}}{24}\Big{(}\sqrt{3}hc^{2}L^{3/2}+(\frac{\sqrt{42}}{2}+1)c^{3/2}L+\gamma cL^{1/2}\Big{)}d^{1/2}.

As we see, for the xx component there is no 𝒪(h5/2)\mathcal{O}(h^{5/2}) contribution and therefore we take αx=0\alpha_{x}=0 and βx=Δx\beta_{x}=\Delta_{x}.

Fourth step. With a view to checking at Step 5 condition (5.2) in Assumption 5.2, we estimate |𝔼(v~n+1vn+1,αv)||\mathbb{E}\big{(}\langle\widetilde{v}_{n+1}-v_{n+1},\alpha_{v}\rangle\big{)}|; here ,\langle\cdot,\cdot\rangle is the standard inner product in d\mathbb{R}^{d}, vn+1v_{n+1} is the velocity component of ξn+1\xi_{n+1} and v~n+1\widetilde{v}_{n+1} denotes the velocity component of a numerical step starting from ξ^n=(v^n,x^n)\widehat{\xi}_{n}=(\widehat{v}_{n},\widehat{x}_{n}). (This should not be confused with v^n+1\widehat{v}_{n+1}, the vv-component of the random variable ξ^n+1\widehat{\xi}_{n+1} to be introduced at the next time level in the construction leading to Theorem 5.2.)

Since, conditional on v^n\widehat{v}_{n}, vnv_{n}, the expectation of the stochastic integral αv=I5\alpha_{v}=I_{5} is zero, we may write

|𝔼(v~n+1vn+1,αv)|\displaystyle\left|\mathbb{E}\big{(}\langle\widetilde{v}_{n+1}-v_{n+1},\alpha_{v}\rangle\big{)}\right| =\displaystyle= |𝔼(v~n+1v^nvn+1+vn,αv)|\displaystyle\left|\mathbb{E}\big{(}\langle\widetilde{v}_{n+1}-\widehat{v}_{n}-v_{n+1}+v_{n},\alpha_{v}\rangle\big{)}\right|
\displaystyle\leq (𝔼(v~n+1v^nvn+1+vn2))1/2(𝔼(αv2))1/2.\displaystyle\Big{(}\mathbb{E}(\|\widetilde{v}_{n+1}-\widehat{v}_{n}-v_{n+1}+v_{n}\|^{2})\Big{)}^{1/2}\Big{(}\mathbb{E}(\|\alpha_{v}\|^{2})\Big{)}^{1/2}.

Now, from (4.3a),

v~n+1v^nvn+1+vn=((h)1)(v^nvn)h(h/2)c(f(y~n)f(yn))\widetilde{v}_{n+1}-\widehat{v}_{n}-v_{n+1}+v_{n}=(\mathcal{E}(h)-1)(\widehat{v}_{n}-v_{n})-h\mathcal{E}(h/2)c(\nabla f(\widetilde{y}_{n})-\nabla f(y_{n}))

with (see (4.3c))

y~n=x^n+(h/2)v^n+2γctntn+1/2(tn+1/2s)𝑑W(s),\widetilde{y}_{n}=\widehat{x}_{n}+\mathcal{F}(h/2)\widehat{v}_{n}+\sqrt{2\gamma c}\int_{t_{n}}^{t_{n+1/2}}\mathcal{F}(t_{n+1/2}-s)dW(s),

and thus, since 1(h)γh1-\mathcal{E}(h)\leq\gamma h and (h/2)1\mathcal{E}(h/2)\leq 1,

(𝔼(v~n+1v^nvn+1+vn2))1/2\displaystyle\Big{(}\mathbb{E}(\|\widetilde{v}_{n+1}-\widehat{v}_{n}-v_{n+1}+v_{n}\|^{2})\Big{)}^{1/2}
hγ(𝔼(v^nvn2))1/2+hcL(𝔼(y~nyn2))1/2.\displaystyle\qquad\qquad\qquad\leq h\gamma\Big{(}\mathbb{E}(\|\widehat{v}_{n}-v_{n}\|^{2})\Big{)}^{1/2}+hcL\Big{(}\mathbb{E}(\|\widetilde{y}_{n}-y_{n}\|^{2})\Big{)}^{1/2}.

Taking into account (4.3c) and the definition of y~n\widetilde{y}_{n}

(𝔼y~nyn2)1/2(𝔼(x^nxn2))1/2+h2(𝔼(v^nvn2))1/2,\Big{(}\mathbb{E}\|\widetilde{y}_{n}-y_{n}\|^{2}\Big{)}^{1/2}\leq\Big{(}\mathbb{E}(\|\widehat{x}_{n}-x_{n}\|^{2})\Big{)}^{1/2}+\frac{h}{2}\Big{(}\mathbb{E}(\|\widehat{v}_{n}-v_{n}\|^{2})\Big{)}^{1/2},

and we conclude that |𝔼(v~n+1vn+1,αv)||\mathbb{E}\big{(}\langle\widetilde{v}_{n+1}-v_{n+1},\alpha_{v}\rangle\big{)}| is bounded above by

h((γ+h2cL)(𝔼(v^nvn2))1/2+cL(𝔼(x^nxn2))1/2)(𝔼(αv2))1/2.h\left(\Big{(}\gamma+\frac{h}{2}cL\Big{)}\Big{(}\mathbb{E}(\|\widehat{v}_{n}-v_{n}\|^{2})\Big{)}^{1/2}+cL\Big{(}\mathbb{E}(\|\widehat{x}_{n}-x_{n}\|^{2})\Big{)}^{1/2}\right)\Big{(}\mathbb{E}(\|\alpha_{v}\|^{2})\Big{)}^{1/2}. (7.14)

Fifth step. The preceding analysis holds for all values of the parameters. We now focus in the case where γ=2\gamma=2 and h2h\leq 2 as in the statement of Theorem 6.2. To complete the proof it is enough to translate the Euclidean norm bounds in Steps 1–4 into PP-norm bounds.

To establish (5.1), we note that, because αx=0\alpha_{x}=0,

|ψh(ξ^n,tn)ψh(ξn,tn),αh(ξ^n,tn)L2,Ph|=|𝔼(v~n+1vn+1,αv)|.\left|\Big{\langle}\psi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\xi_{n},t_{n}),\alpha_{h}(\widehat{\xi}_{n},t_{n})\Big{\rangle}_{L^{2},P_{h}}\right|=|\mathbb{E}\big{(}\langle\widetilde{v}_{n+1}-v_{n+1},\alpha_{v}\rangle\big{)}|.

The right hand-side of this expression was bounded in (7.14) in terms of 𝔼(v^nvn2)\mathbb{E}(\|\widehat{v}_{n}-v_{n}\|^{2}), 𝔼(x^nxn2)\mathbb{E}(\|\widehat{x}_{n}-x_{n}\|^{2}) and 𝔼(αv2)\mathbb{E}(\|\alpha_{v}\|^{2}). We now take into account (2.2) to bound 𝔼(v^nvn2)\mathbb{E}(\|\widehat{v}_{n}-v_{n}\|^{2}) and 𝔼(x^nxn2)\mathbb{E}(\|\widehat{x}_{n}-x_{n}\|^{2}) by a multiple of ξ^nξnL2,Ph\|\widehat{\xi}_{n}-\xi_{n}\|_{L^{2},P_{h}} and to bound 𝔼(αv2)\mathbb{E}(\|\alpha_{v}\|^{2}) by a multiple of αh(ξ^n,tn)L2,Ph\|\alpha_{h}(\widehat{\xi}_{n},t_{n})\|_{L^{2},P_{h}}.

The estimates in (5.2) are established in a similar way.

7.7 The local error for UBU without extrasmoothness

Let us now assume that Assumptions 2.12.2 hold but Assumption 2.3 does not. Then the strong order of UBU is one. The bound for 𝔼(Δx2)\mathbb{E}(\|\Delta_{x}\|^{2}) derived in the third step of the preceding subsection remains valid (note that it does not involve the constant L1L_{1}). However for the component Δy\Delta_{y} of the local error, the Ito-Taylor expansion leading to (7.10) cannot be taken beyond (7.9) because now dχ˙d\dot{\chi} does not make sense. After replacing χ˙(s)\dot{\chi}(s) by its expression in terms of ff, one obtains the bound

(𝔼(Δy2))1/2h24γc3/2Ld1/2+h24γcL1/2d1/2+312h3c2L3/2d1/2.\big{(}\mathbb{E}(\|\Delta_{y}\|^{2})\big{)}^{1/2}\leq\frac{h^{2}}{4}\gamma c^{3/2}Ld^{1/2}+\frac{h^{2}}{4}\gamma cL^{1/2}d^{1/2}+\frac{\sqrt{3}}{12}h^{3}c^{2}L^{3/2}d^{1/2}.

Then, after combining the xx and vv contributions and setting γ=2\gamma=2, we have the bound

(𝔼(ϕh(ξ^n,tn)ψh(ξ^n,tn)2))1/2\displaystyle\Big{(}\mathbb{E}\big{(}\|\phi_{h}(\widehat{\xi}_{n},t_{n})-\psi_{h}(\widehat{\xi}_{n},t_{n})\|^{2}\big{)}\Big{)}^{1/2} \displaystyle\leq h24[(1+(16+4212)h]c3/2Ld1/2\displaystyle\frac{h^{2}}{4}\left[(1+\left(\frac{1}{6}+\frac{\sqrt{42}}{12}\right)h\right]c^{3/2}Ld^{1/2}
+h22(1+h6)cL1/2d1/2\displaystyle\qquad+\frac{h^{2}}{2}\left(1+\frac{h}{6}\right)cL^{1/2}d^{1/2}
+312h3(1+h2)c2L3/2d1/2.\displaystyle\qquad+\frac{\sqrt{3}}{12}h^{3}\left(1+\frac{h}{2}\right)c^{2}L^{3/2}d^{1/2}.

Recall that, for contractivity the integrator has to be operated with cLcL bounded above, so that the combinations c2/L3/2c^{2}/L^{3/2}, c3/2Lc^{3/2}L, and cL1/2cL^{1/2} are all 𝒪(L1/2)\mathcal{O}(L^{-1/2}) as LL increases. The leading terms in the UBU bound in the display are (h2/4)(c3/2Ld1/2)+(h2/2)cL1/2d1/2(h^{2}/4)(c^{3/2}Ld^{1/2})+(h^{2}/2)cL^{1/2}d^{1/2}. For comparison, we note from (7.5), that for EE the corresponding leading term in the bound is (3/3)h2c3/2Ld1/2(\sqrt{3}/3)h^{2}c^{3/2}Ld^{1/2}. The conclusion is that, under Assumptions 2.12.2, the properties of UBU are very close to those of EE.


Acknowledgement. J.M.S.-S. has been supported by project PID2019-104927GB-C21 (AEI/FEDER, UE). K.C. Z acknowledges support from a Leverhulme Research Fellowship (RF/ 2020-310), the EPSRC grant EP/V006177/1, and the Alan Turing Institute.

References

  • [1] A. Abdulle, G. Vilmart, and K. C. Zygalakis. High order numerical approximation of the invariant measure of ergodic sdes. SIAM Journal on Numerical Analysis, 52(4):1600–1622, 2014.
  • [2] A. Abdulle, G. Vilmart, and K. C. Zygalakis. Long time accuracy of Lie–Trotter splitting methods for Langevin dynamics. SIAM Journal on Numerical Analysis, 53(1):1–16, 2015.
  • [3] A. Alamo and J. M. Sanz-Serna. A technique for studying strong and weak local errors of splitting stochastic integrators. SIAM Journal on Numerical Analysis, 54(6):3239–3257, 2016.
  • [4] A. Alamo and J. M. Sanz-Serna. Word combinatorics for stochastic differential equations: Splitting integrator. Communications on Pure & Applied Analysis, 18:2163, 2019.
  • [5] N. Bou-Rabee and J. M. Sanz-Serna. Randomized Hamiltonian Monte Carlo. The Annals of Applied Probability, 27(4):2159 – 2194, 2017.
  • [6] N. Bou-Rabee and J. M. Sanz-Serna. Geometric integrators and the Hamiltonian Monte Carlo method. Acta Numerica, 27:113–206, 2018.
  • [7] S. Brooks, A. Gelman, G. Jones, and X.L. Meng. Handbook of Markov Chain Monte Carlo. Chapman & Hall/CRC Handbooks of Modern Statistical Methods. CRC Press, 2011.
  • [8] Evelyn Buckwar, Massimiliano Tamborrino, and Irene Tubikanec. Spectral density-based and measure-preserving ABC for partially observed diffusion processes. an illustration on Hamiltonian sdes. Statistics and Computing, 30(3):627–648, 2020.
  • [9] Y. Cao, J. Lu, and L. Wang. Complexity of randomized algorithms for underdamped Langevin dynamics. arXiv:2003.09906, 2020.
  • [10] X. Cheng, N. S. Chatterji, P. L. Bartlett, and M. I. Jordan. Underdamped Langevin MCMC: A non-asymptotic analysis. In Proceedings of the 2018 Conference on Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 300–323, 2018.
  • [11] A. S. Dalalyan. Further and stronger analogy between sampling and optimization: Langevin monte carlo and gradient descent. In Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 678–689, 2017.
  • [12] A. S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651–676, 2017.
  • [13] A. S. Dalalyan and A. Karagulyan. User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. Stochastic Processes and their Applications, 129(12):5278–5311, 2019.
  • [14] A. S. Dalalyan, A. Karagulyan, and L. Riou-Durand. Bounding the error of discretized langevin algorithms for non-strongly log-concave targets. arXiv:1906.08530, 2019.
  • [15] A. S. Dalalyan and L. Riou-Durand. On sampling from a log-concave density using kinetic Langevin diffusions. Bernoulli, 26(3):1956 – 1988, 2020.
  • [16] A. Durmus, S. Majewski, and B. Miasojedow. Analysis of Langevin Monte Carlo via convex optimization. Journal of Machine Learning Research, 20(73):1–46, 2019.
  • [17] A. Durmus and E. Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. The Annals of Applied Probability, 27(3):1551 – 1587, 2017.
  • [18] A. Durmus and E. Moulines. High-dimensional Bayesian inference via the unadjusted Langevin algorithm. Bernoulli, 25(4A):2854 – 2882, 2019.
  • [19] A. Eberle, A. Guillin, and R. Zimmer. Couplings and quantitative contraction rates for Langevin dynamics. The Annals of Probability, 47(4):1982 – 2010, 2019.
  • [20] M. Fazlyab, A. Ribeiro, M. Morari, and V. M. Preciado. Analysis of optimization algorithms via integral quadratic constraints: nonstrongly convex problems. SIAM Journal on Optimization, 28(3):2654–2689, 2018.
  • [21] J. Foster, T. Lyons, and H. Oberhauser. The shifted ODE method for underdamped Langevin MCMC. arXiv:2101.03446, 2021.
  • [22] M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, 2011.
  • [23] E. Hairer, S.P. Nørsett, and G. Wanner. Solving Ordinary Differential Equations I Nonstiff problems. Springer, Berlin, second edition, 2000.
  • [24] M. Hochbruck and A. Ostermann. Exponential integrators. Acta Numerica, 19:209–286, 2010.
  • [25] Liam Hodgkinson, Robert Salomone, and Fred Roosta. Implicit langevin algorithms for sampling from log-concave densities. Journal of Machine Learning Research, 22(136):1–30, 2021.
  • [26] Peter E. Kloeden and Eckhard. Platen. Numerical solution of stochastic differential equations. Springer-Verlag Berlin ; New York, 1992.
  • [27] A. Laurent and G. Vilmart. Order conditions for sampling the invariant measure of ergodic stochastic differential equations on manifolds. arXiv:2006.09743, 2020.
  • [28] B. Leimkuhler and C. Matthews. Rational Construction of Stochastic Numerical Methods for Molecular Sampling. Applied Mathematics Research eXpress, 2013(1):34–56, 06 2012.
  • [29] B. Leimkuhler and C. Matthews. Molecular Dynamics: With Deterministic and Stochastic Numerical Methods. Interdisciplinary Applied Mathematics. Springer, 2015.
  • [30] B. Leimkuhler and X. Shang. Adaptive thermostats for noisy gradient systems. SIAM Journal on Scientific Computing, 38(2):A712–A736, 2016.
  • [31] T. Lelièvre and G. Stoltz. Partial differential equations and stochastic methods in molecular dynamics. Acta Numerica, 25:681–880, 2016.
  • [32] L. Lessard, B. Recht, and A. Packard. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57–95, 2016.
  • [33] M. B. Majka, A. Mijatović, and L. Szpruch. Nonasymptotic bounds for sampling algorithms without log-concavity. The Annals of Applied Probability, 30(4):1534 – 1581, 2020.
  • [34] J. C. Mattingly, A. M. Stuart, and M. V. Tretyakov. Convergence of numerical time-averaging and stationary measures via Poisson equations. SIAM Journal on Numerical Analysis, 48(2):552–577, 2010.
  • [35] R. I. McLachlan and G. R. W. Quispel. Splitting methods. Acta Numerica, 11:341–434, 2002.
  • [36] G. N. Milstein and Michael V Tretyakov. Stochastic numerics for mathematical physics. Springer-Verlag, Berlin, 2004.
  • [37] G.N. Milstein and M.V. Tretyakov. Computing ergodic limits for Langevin equations. Physica D, 229(1):81 – 95, 2007.
  • [38] P. Monmarché. Almost sure contraction for diffusions on d\mathbb{R}^{d}. application to generalised Langevin diffusions. arXiv:2009.10828, 2020.
  • [39] Pierre Monmarché. High-dimensional MCMC with a standard splitting scheme for the underdamped Langevin diffusion. Electronic Journal of Statistics, 15(2):4117 – 4166, 2021.
  • [40] W. Mou, Y. A Ma, M. J. Wainwright, P. L. Bartlett, and M. I. Jordan. High-order Langevin diffusion yields an accelerated MCMC algorithm. Journal of Machine Learning Research, 22(42):1–41, 2021.
  • [41] M. Pereyra, L. V. Mieles, and K. C. Zygalakis. Accelerating proximal Markov chain Monte Carlo by using an explicit stabilized method. SIAM Journal on Imaging Sciences, 13(2):905–935, 2020.
  • [42] G. O. Roberts and R. L. Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363, 1996.
  • [43] J. M. Sanz-Serna. Markov chain monte carlo and numerical differential equations. In Current Challenges in Stability Issues for Numerical Differential Equations, volume 2082 of Lect. Notes in Math., pages 39–88. Springer, 2014.
  • [44] J. M. Sanz Serna and K. C. Zygalakis. Contractivity of runge–kutta methods for convex gradient systems. SIAM Journal on Numerical Analysis, 58(4):2079–2092, 2020.
  • [45] J. M. Sanz Serna and K. C. Zygalakis. The connections between Lyapunov functions for some optimization algorithms and differential equations. SIAM Journal on Numerical Analysis, 59(3):1542–1565, 2021.
  • [46] J.M. Sanz-Serna and M.P. Calvo. Numerical Hamiltonian Problems. Dover Books on Mathematics. Dover Publications, 2018.
  • [47] J.M. Sanz-Serna and A.M Stuart. Ergodicity of dissipative differential equations subject to random impulses. Journal of Differential Equations, 155(2):262–284, 1999.
  • [48] R. Shen and Y. T. Lee. The randomized midpoint method for log-concave sampling. In Advances in Neural Information Processing Systems, volume 32, pages 2098–2109, 2019.
  • [49] Robert D. Skeel. Integration schemes for molecular dynamics and related applications. In The Graduate Student’s Guide to Numerical Analysis ’98: Lecture Notes from the VIII EPSRC Summer School in Numerical Analysis, pages 119–176. Springer, 1999.
  • [50] G. Vilmart. Postprocessed integrators for the high order integration of ergodic sdes. SIAM Journal on Scientific Computing, 37(1):A201–A220, 2015.
  • [51] S. J. Vollmer, K. C. Zygalakis, and Y. W. Teh. Exploration of the (non-)asymptotic bias and variance of stochastic gradient Langevin dynamics. Journal of Machine Learning Research, 17(159):1–48, 2016.
  • [52] Alfonso Alamo Zapatero. Word series for the numerical integration of stochastic differential equations. PhD thesis, Universidad de Valladolid, 2019.