Wasserstein distance estimates for the distributions of numerical approximations to ergodic stochastic differential equations.
Abstract
We present a framework that allows for the non-asymptotic study of the -Wasserstein distance between the invariant distribution of an ergodic stochastic differential equation and the distribution of its numerical approximation in the strongly log-concave case. This allows us to study in a unified way a number of different integrators proposed in the literature for the overdamped and underdamped Langevin dynamics. In addition, we analyse a novel splitting method for the underdamped Langevin dynamics which only requires one gradient evaluation per time step. Under an additional smoothness assumption on a –dimensional strongly log-concave distribution with condition number , the algorithm is shown to produce with an complexity samples from a distribution that, in Wasserstein distance, is at most away from the target distribution.
1 Introduction
The problem of sampling from a target probability distribution in is ubiquitous throughout applied mathematics, statistics, molecular dynamics, statistical physics and other fields. A typical approach for solving such problems is to construct a Markov process on whose equilibrium distribution (or a suitable marginal of it) is designed to coincide with [7]. Often such Markov processes are obtained by solving stochastic differential equations (SDEs). Two typical examples of such SDEs are the overdamped Langevin equation, ,
(1.1) |
and the underdamped Langevin equation, ,
(1.2a) | |||||
(1.2b) |
Under mild assumptions on one can show that the dynamics of (1.1) is ergodic with respect to the distribution with density , while the dynamics of (1.2) is ergodic with respect to with density .
Equations (1.1), (1.2) [43] provide the basis for different computational devises for sampling from . In particular, one can obtain samples from by discretizing the SDEs and generating numerical solutions over a long time interval [37]. One needs to be careful with the integrator that is used, since it could be the case that the discrete Markov chain resulting from the numerical discretization might not be ergodic [42]. In addition, even if that chain is ergodic, it is normally the case that the stationary distribution of the numerical solution is different from . The study of the asymptotic error between and has received a lot of attention in the literature. The work in [34] investigates the effect of the numerical discretization on the convergence of the ergodic averages, while [1] present general order conditions for the numerical invariant measure to approximate with high order of accuracy, by exploiting the connections between partial differential equations and SDEs [31]. A number of recent papers have applied this framework to numerical integrators for the underdamped Langevin equation [2], as well as to the case of stochastic gradient Langevin dynamics [51, 30]. In addition, [50, 27] extended this framework to the case of post processed integrators and to SDEs on manifolds.
Another line of research that has received much attention in the last few years deals with the study of the non-asymptotic error between the numerical approximation and the invariant measure . In particular, for the case of the overdamped Langevin equation (1.1) and log-concave and strongly log-concave distributions [12] established non-asymptotic bounds in total variation distance for the Euler-Maruyama method and an explicit extension of it based on further smoothness assumptions. These results have been extended to the Wasserstein distance in e.g. [17, 11, 18, 13, 16], while the paper [25] obtains similar bounds for implicit methods applied to (1.1). Similar non-asymptotic analyses for the case of the underdamped Langevin equation appear in [10, 15, 39, 48, 21]. One of the aims of all that literature is to study the number of steps that the integrators require to achieve a target accuracy when applied to -dimensional targets with condition number . Underdamped discretizations may lead to a better dependence of on and than their overdamped counterparts. The case of non-strongly log-concave distributions and the non-asymptotic behaviour of numerical algorithms has also received attention recently [14, 33].
In this work, we present a unified framework that allows for the non-asymptotic study of numerical methods for ergodic stochastic differential equations (including equations (1.1) and (1.2)) in the case of strongly log-concave distributions. In particular, we obtain a general bound for the error in between and the probability distribution of the numerical solution after -iterations. This bound depends on two factors, the first can be controlled by understanding the contractivity properties of the numerical method, while the second is directly related to the local strong error of the integrator. Moving to integrators with smaller strong local error results in a better performance when the dimensionality grows and the error level decreases. Also moving to integrators that are contractive for larger step sizes improves the performance for large condition numbers. This is consistent with what has been suggested in the literature [25, 41].
As an application of the suggested framework, we study two numerical methods for the underdamped Langevin dynamics. The first is the method, that we shall call EE, used in [10]; the second is a splitting method called UBU. Both require the same computational effort, namely one gradient evaluation per time-step, but UBU has better convergence properties [3, 4].
-
•
For the integrator EE, we prove that, in -Wasserstein distance and for a strongly log-concave -dimensional distribution with condition number , the algorithm produces a distribution that is –away from the target in a number of steps that (up to logarithmic terms) behaves like . This improves on the estimate in [10]. EE has also been analysed in [15]; however, the analysis in that reference has severe limitations as discussed in Section 6.
-
•
UBU, under the same hypotheses as EE, shares the estimate. However, unlike EE, UBU is capable of leveraging additional smoothness properties of the log-density of the target. With such an additional smoothness assumption, we prove an estimate that depends on , and as (there is also a dependence on a bound for the third derivatives of the target log-density).
Even though a detailed comparison between UBU and alternative algorithms is not within the scope of the present paper, the following comments are in order.
-
•
As we will discuss in Remark 6.3, for fixed , the improvement from the EE estimate to the UBU estimate arises from EE having strong order one and UBU having strong order two. This shows the importance of strong second-order integrators. A strong second-order discretization of the underdamped Langevin dynamics that requires evaluation of the Hessian has been introduced in [15]. However the analysis in that reference only holds for unrealistic values of the stepsize, see Section 6.2.
-
•
The randomized midpoint method in [48] uses two gradient evaluations per time-step and may be regarded as optimal [9] among the integrators of the underdamped Langevin dynamics that use the driving Brownian motion, its weighted integration and target and an oracle of the . For Lipschitz gradients, the estimate of the mixing time is , where we note the favourable dependence on , which stems from the random nature of the algorithm (see [9] and its references). For fixed we then find an behaviour, to be compared with the estimate of UBU when the extra smoothness assumption holds. See Remark 6.3.
-
•
The algorithm suggested in [40] is not based on integrating the underdamped Langevin equation but an alternative system of stochastic differential equations where has additional smoothness (see Remark 3.3). For fixed , the authors prove a estimate of the mixing time (i.e. the behaviour established here for UBU). That reference does not investigate the dependence of the mixing time on ; numerical experiments suggest the algorithm does not operate satisfactorily when the condition number is large.
The main contributions of this work are:
-
1.
The use of an appropriate state-form representation of SDEs and their numerical integrators that allows to establish contractivity estimates both for the time-continuous process and its numerical solution.
-
2.
A study of the contractivity of integrators for the underdamped Langevin dynamics that takes into account the possible impact of increasing condition numbers.
-
3.
A general result that allows to obtain bounds for the -Wasserstein distance between the target distribution and its numerical approximations for general SDEs. In particular the result may be applied to discretizations of the overdamped and underdamped Langevin equations.
-
4.
We improve on the analysis in [10] and explain the reasons why similar improvements may be expected when analysing other integrators.
-
5.
We suggest the use in sampling of UBU, a splitting integrator for the underdamped Langevin equations that only requires one gradient evaluation per step and possesses second order weak and strong accuracy.
-
6.
We provide non-asymptotic estimates of the sampling accuracy of UBU.
The rest of the paper is organised as follows. In Section 2 we set up notation and discuss the different smoothness assumption on that we will employ through out the paper. In Section 3 we present the stochastic differential equations (SDEs) we are concerned with. These are written in a state-space form framework, similar to that used (for other purposes) in [32, 20, 45]. This framework is useful here because it makes it easy (see Propositions 3.8 and 3.10) to investigate the contractivity properties that underlie the SDE Wassertein distance estimates between the push-forward in time of two initial probability distribution (Proposition 3.6). Section 4, parallel to Section 3, is devoted to the integrators and their contractivity. Again a state-space framework is used that makes it possible to easily investigate the contractivity of the integrators. Section 5 contains one of the main contributions of this paper, Theorem 5.2, which provides a general result for getting bounds of the Wasserstein distance between the invariant distribution of the SDE and the distribution of the numerical solution. To apply Theorem 5.2 one needs (1) to establish a contractivity estimate for the integrator and (2) to prove what we call a local error bound. The latter is essentially a mean square bound of the difference between a single step of the integrator and a corresponding step with the SDE, under the assumption that the initial data for the step follows the distribution . Section 6 applies the general result to investigate two discretizations of the underdamped Langevin dynamics. The final Section 7 contains some additional results and also the more technical proofs of the results in the preceding sections.
The extension of the material in this paper to variable step sizes and to inaccurate gradients is certainly possible, but will not be considered.
2 Preliminaries
We will now discuss some assumptions on , as well as set up some notation that we will later use.
2.1 Smoothness properties of
The symbol always refers to the standard Euclidean norm. Throughout the paper we shall assume that the following two conditions hold:
Assumption 2.1.
is twice differentiable and -smooth, i.e
(2.1) |
Assumption 2.2.
is -strongly convex, i.e
It is well known that these two assumptions are equivalent to the Hessian of , which we will denote by , being positive definite and satisfying . In studies like the present one, Assumptions 2.1 and 2.2 are standard in the literature: see, among others, [11, 17, 18, 13] for the overdamped Langevin dynamics and [10, 15, 39, 21] for the underdamped case.
In addition to these two assumptions, the following further smoothness assumption on will be used when it comes to analysing higher strong-order discretizations for the underdamped Langevin equation. The symbol denotes the tensor of third derivatives (derivative of the Hessian); at each , is a bilinear operator mapping pairs into vectors in .
Assumption 2.3.
is three times differentiable and there is a constant such that at each point , for arbitrary , :
2.2 Wasserstein distance
Let and be two probability measures on . The -Wasserstein distance between is given by
where is the set of all couplings [19] between and , i.e. the set of all probability distributions in whose marginals on the first and second factors are and respectively. More generally, if is an positive definite symmetric matrix, we will use the distance
where in the -norm defined by . Since the -norm and the standard Euclidean norm are related by
(2.2) |
where and are the smallest and largest eigenvalues of , we also have
for arbitrary , . Therefore bounds for the metric may immediately be translated into bounds for and viceversa.
3 Stochastic differential equations
In this section we will study some properties of a class of ergodic stochastic differential equations that includes (1.1) and (1.2). In particular, we will extend to the stochastic case a control theoretical framework used in [20, 32] to analyse optimization algorithms, and study properties of such SDEs, including the existence of an invariant measure, and the speed of convergence to equilibrium in the Wasserstein distance.
3.1 State-space form
We are concerned with sampling algorithms obtained by discretizing SDEs with additive noise that may be written as linear systems in state-space form:111 Note that this excludes algorithms, like the Riemann manifold MALA in [22], that use multiplicative noise. Also Hamiltonian Montecarlo [6] and similar piecewise deterministic samplers that use jumps do not fit in the present study.
(3.1a) | |||||
(3.1b) | |||||
(3.1c) |
Here is the state, is the input, is the output that is mapped to by the nonlinear map and represents the standard -dimensional Brownian motion. The real matrices , , and are constant, with sizes , , and respectively. We define
and note that, since the right hand-side of (3.1a) is globally Lipschitz continuous, the solution exists and is unique.
Example 3.1.
The simplest case corresponds to the overdamped Langevin equation (1.1) (the positive constant may be set by rescaling ) and -dimensional. Here, , , , , , , , .
Example 3.2.
The underdamped Langevin dynamics (1.2) ( and are positive constants and is -dimensional) has , , , and ( stands for )
Remark 3.3.
As distinct from the situation in (1.1), in (1.2) the noise does not enter the equation directly; it does so only through the auxiliary variable . This results in being smoother in the underdamped case than in the overdamped case. This idea may be taken further: additional auxiliary variables may be introduced so as to increase the smoothness of , see e.g. [40].
The following proposition, whose proof is given in Section 7.1, relates (3.1) and the pdf . The proposition may be used to check that the target is in fact the invariant density for the overdamped Langevin dynamics (1.1) and that the underdamped Langevin system (1.2) has the invariant density .
Proposition 3.4.
Assume that is an positive semidefinite symmetric matrix.
-
•
The relations
(3.2a) (3.2b) (3.2c) (3.2d) imply that (3.1) has the invariant probability distribution with density
-
•
If , then the marginal of on is the target .
If is regarded as being arbitrary, then the relations (3.2) are also necessary for the probability distribution with density to be invariant, see Section 7.1. The next result may be useful to check the hypotheses of Proposition 3.4. The proof is a simple exercise and will not be given.
Proposition 3.5.
3.2 Convergence to the invariant distribution
We assume hereafter that (3.1) has the unique invariant distribution . If denotes the probability distribution of the initial value for (3.1) and , represents the resulting probability distribution of , we will investigate the convergence, in the Wasserstein distance, of towards , as .
In order to estimate we use the following well-known approach. We introduce the auxiliary -dimensional SDE:
(3.3a) | |||||
(3.3b) |
where the same Brownian motion drives and . If and , and is a coupling between and then the pushforward of by the solution of (3.3) provides a coupling for the distributions and of and . In this setting it is easy to prove the following result.
Proposition 3.6.
Assume that and exist such that for (3.3), almost surely,
(3.4) |
Then, for arbitrary distributions, and ,
and, in particular, for arbitrary ,
(3.5) |
3.3 Contractivity
We now identify sufficient conditions for (3.4) to hold.
Lemma 3.7.
Let be an symmetric matrix and . For solutions of (3.3),
Proof.
It is enough to apply Ito’s rule to
the Ito correction is
∎
The inputs , that appear in the lemma may be eliminated by using that is continuously differentiable. In fact, by the mean value theorem,
where, for each pair of vectors , in , we have defined
( is the Hessian of ). After elimination of the inputs, Lemma 3.7 yields
an equality that implies our next result.
3.4 Checking contractivity
We next provide a result that is useful when checking the hypothesis in the last proposition.
Typically, in (3.1)
(3.6) |
with , , and of sizes , , and respectively (which implies that ). This is for instance the situation for the overdamped and underdamped Langevin equations presented above, where and respectively. In general will be a small integer and therefore the matrices , , and will also be small.
When (3.6) holds and also (with of size ) and , the hypotheses of Proposition 3.4 may be stated in terms of the matrices with a hat, i.e., in the second item, and, in the first item, , etc. (here ). The same observation applies to Proposition 3.5. In addition, it makes sense to consider that the matrix is of the form with of size . Note that the eigenvalues of are obtained by repeating times each eigenvalue of and in paticular if and only if . We then have:
Lemma 3.9.
Assume that (3.6) holds and . The set of the eigenvalues of is the union of the sets of eigenvalues of the matrices (of size )
(3.7) |
where , , are the eigenvalues of .
Proof.
After using (3.6) and , the mixed product property of implies:
Now factorize with diagonal and orthogonal (both ). It follows that
and, as a consequence, the eigenvalues of are those of the matrix in square brackets in the display. This matrix consists of blocks, where each block is diagonal of size . After reordering, the matrix in square brackets becomes a direct sum of the matrices in (3.7). ∎
We now describe how to find, for a given , the decay rate in (3.4). The hypotheses on guarantee that, in (3.7), . After defining the matrix-valued function of the real variable given by
(3.8) |
we see from Lemma 3.9 that, if, for each , , then . We factorize with invertible; for instance may be chosen to be lower triangular with positive diagonal entries —Choleski’s factorization—, but other possibilities of course exist. The condition is equivalent to the condition . Therefore we will have if, as varies in , the eigenvalues of are positive and bounded away from zero. When that is the case, may be chosen to be the infimum of those eigenvalues. We also note that the eigenvalues of are the eigenvalues of the generalized eigenvalue problem . To sum up:
Proposition 3.10.
Given the symmetric, positive definite , define by (3.8). Assume that, as varies in , the eigenvalues of the generalized eigenvalue problem are positive and bounded away from zero and let be the infimum of those eigenvalues. Then the contractivity bound (3.4) with holds almost surely. Alternatively, may be defined as the infimum of the eigenvalues of the matrices , where is any matrix with .
Example 3.11.
In the case of the overdamped Langevin equation (1.1) if we make the choice , a simple calculations gives . We hence see that in this case , a well-known result.
Example 3.12.
The paper [10] studies the underdamped Langevin equation (1.2) and fixes . This does not entail any loss of generality as the value of may be chosen arbitrarily by rescaling the variable .222Other authors, see e.g. [15], use different scalings. When we quote estimates from papers that use alternative scalings, we have translated them to the scale in [10] in order to have meaningful comparisons. Furthermore, [10] sets and
(3.9) |
For these choices, we find
the eigenvalues of this matrix are and and, since , they are ( denotes the condition number). In this case and (3.4) becomes
which is the contraction estimate used in [10].
Example 3.13.
In the setting of the preceding example, we keep and as in (3.9), but do not assume . The eigenvalues of the matrix are found to be and ; for future reference, we note that they depend on and through the combination (as it was to be expected from (1.2), where is multiplied by ). We distinguish four cases:
-
1.
. As varies in , we have and . Therefore in this case the and an increase in results in an increase in . In particular, for the contraction rate improves on the value corresponding to the choice in [10] discussed in the preceding example.
-
2.
. In this case and have the common value .
-
3.
. Now is larger than and therefore , which decreases as increases.
-
4.
. In this case and there is no contractivity.
Therefore, with and in (3.9), the choice yields the best mixing: . We prove in Section 7.2 that the mixing cannot be improved by using alternative choices of .
More sophisticated choices of are considered in [15].333The matrix is not used in that reference, which only works with a non-triangular such that . In turn, is defined indirectly by choosing the columns of to be eigenvectors of a suitable known matrix that depends on a real parameter. The parameter is tuned to enhance the rate of contraction. While those choices allow, for some values of , a degree of improvement on the value of we have obtained by using (3.9) in Proposition 3.10, they do not yield values of above (which is of course in agreement with the analysis in Section 7.2 below). In addition the study in [15] assumes that the variable is started at stationarity and only monitors the mixing in the variable .
A useful reference on contractivity is [38].
4 Discretizations
Having established properties for solutions of SDEs of the type (3.1), we now turn our attention to the properties of their numerical discretizations. We derive a result analogous to Proposition 3.10 to establish the contractivity of the numerical solutions for integrators that use only one gradient evaluation per time step. Such integrators are particularly attractive in problems of high dimensionality.
4.1 Discrete state-space form
To discretize (3.1) on the grid points , , , we use schemes of the form:
(4.1a) | |||||
(4.1b) | |||||
(4.1c) |
Here, at each step, is the feedback output at which the gradient will be evaluated and represents a random vector in suitably derived from the restriction to of the Brownian motion in (3.1). The real matrices , , , and are constant, with sizes , , , and respectively. As the examples that follow will illustrate, consistency requires that be an approximation to in (3.1), while and approximate and . Note also the noise in (4.1b), which has no countepart in (3.1b).
Example 4.1.
Example 4.2.
To shorten the notation, we introduce the functions:
and
For the integration of (1.2) Cheng et al. [10] use the scheme:
(4.2a) | |||||
(4.2b) |
In this example, , , , ,
and
The recipe for simulating the Gaussian random variables may be seen in [10].
In the absence of noise, this integrator is the well-known Euler exponential integrator [24], based, via the variation of constants formula/Duhamel’s principle, on the exact integration of the system , . In the stochastic scenario the algorithm is first order in both the weak and strong senses. The paper [21] calls this scheme the left point method. In what follows we shall refer to it as the Euler exponential (EE) integrator.
Example 4.3.
Another instance of an underdamped Langevin integrator of the form (4.1) is the following UBU algorithm:
(4.3a) | |||||
(4.3b) | |||||
(4.3c) |
Here and later . UBU is a splitting integrator [35] that is second order in both the weak and strong senses. See Section 7.3 for details and note that both EE and UBU use stochastic integrals of the form and therefore are not covered by the analysis in [9].
Remark 4.4.
The format (4.1) only caters for schemes that use a single evaluation of the gradient per step. By increasing the dimension of , the format may be easily adapted to integrators that use several gradient evaluations, cf. [32, 20, 45]. However, the technique used below to establish the contractivity of the integrators cannot be immediately extended to schemes with several gradient evaluations; several gradient evaluations would bring in Hessian matrices evaluated at different locations and it would not be possible to diagonalize those Hessians simultaneously, as we did when proving Lemma 3.9. For the contractivity of algorithms involving several gradient evaluations see e.g. [44] and its references.
4.2 The evolution of probability distributions in the discrete case
We will denote by the probability distribution for in (4.1) when is the distribution of (thus is an operator on measures). After introducing (cf. (3.3))
(4.4a) | |||||
(4.4b) |
we have the following discrete counterpart of Proposition 3.6, whose proof will not be given:
Proposition 4.5.
Assume that and exist such that for (4.4), almost surely,
(4.5) |
Then, for arbitrary distributions, and ,
(4.6) |
4.3 Checking discrete contractivity
The proof of the following result is similar to that of Proposition 3.8 and will be omitted:
Proposition 4.6.
In a similar way as for the continuous time case we can prove a discrete time counterpart for Proposition 3.10.
Proposition 4.7.
Given the symmetric, positive definite , set
Assume that, as varies in , the supremum of the eigenvalues of the generalized eigenvalue problems is . Then the contractivity bound (4.5) with holds almost surely. Alternatively, may be defined as the suppremum of the eigenvalues of the matrices , where is any matrix with .
The remainder of this section is devoted to the application of the last proposition to the investigation of the contractivity of the integrators (4.2) or (4.3) applied to the underdamped Langevin system (1.2) with and chosen to coincide with in (3.9). We have computed symbolically the eigenvalues in the proposition in closed form, but the resulting expressions are complicated and will not be reproduced here. (For each fixed we attach the superscript to the discrete eigenvalue closest to and the superscript to the other.) Rather than analysing directly the discrete eigenvalues, we follow an alternative approach based on leveraging the contractivity of the SDE (studied in Example 3.13) and the consistency of the discretizations. The key observation is that, by definition of consistency, for fixed and as , the numerical propagator matrix in Proposition 4.7 differs from the differential equation propagator in Proposition 3.10 by an amount, where for the first order EE integrator and for the second order UBU. As a consequence, is away from (cf. [46, Example 10.1]), and, for the eigenvalues, we have . It is convenient for our purposes to work, rather than with , in terms of the quantities
(4.7) |
for these (since, as , ) we have . An illustration of the convergence of to may be seen in Figure 1.
Remark 4.8.
Because the discrete eigenvalues depend smoothly on and this variable ranges in the compact interval , the convergence is uniform in . Therefore , which is the minimum of , converges to the minimum of , in the limit where with , and fixed. We conclude that, for small enough, the discretizations will behave contractively when the SDE does, i.e. whenever . However, this conclusion is per se rather weak because, as we have seen in Example 3.13, as and vary with , the contraction rate behaves like and it may be feared that the discretizations be contractive only for values of that, as increases unboundedly, tend to . If that were the case, the usefulness of the integrators (4.2) or (4.3) could be doubted. The next result proves that those fears are not warranted if is chosen appropiately: the discretizations are contractive with a rate that essentially coincides with the SDE rate, provided that is below a threshold independent of and .


Theorem 4.9.
Proof.
We begin by recalling, from Example 3.13, that the SDE eigenvalues are and . In addition, and for the reasons we pointed out in the continuous case, the discrete eigenvalues , for fixed , are functions of the combination . In other words, for fixed , depend only on (i.e. they are independent of and ). In this way, when thinking in terms of the variable , changing and only impacts the convergence by changing the interval of values of that have to be considered. Therefore, how much has to be reduced to get below a target error tolerance is independent of , and .
Now the theorem is clearly true when ranges in a bounded interval and we may suppose in what follows that is so large that . We consider the two eigenvalues and successively.
For we note that, as varies in ,
As a consequence, for small enough, . In view of the restriction on , .
The discussion of the behaviour of is more delicate. Here we need to take into account that, if we set in (1.1) so as to switch off the force and the noise, then both integrators under consideration are exact. This implies that, at and for arbitrary , coincides exactly with the continuous eigenvalue . This in turn entails that the error vanishes at for each and must then have an expression of the form , where is a smooth function (this is born out in Figure 1, where the difference decreases as decreases with fixed ). Since , for the relative error we may write . By taking sufficiently small we may guarantee that and, since , we have and the proof is complete. ∎
Remark 4.10.
The same proof shows that if (as in [10]) a similar results holds with a rate , where may be chosen arbitrarily.
Remark 4.11.
The choice guarantees contractivity in the SDE, but has to be excluded from Theorem 4.9. For this value, the proof breaks down because, as the condition number increases, is not bounded away from zero. By using the expressions of the eigenvalues at , it may be seen that contractivity requires for the EE integrator and for the second order UBU.
Remark 4.12.
Only two properties of the integrators EE and UBU have been used in the proof: (i) they are consistent, (ii) they are exact if the force and noise are switched off. The second of these was required to prove that, for each , , or equivalently, (see (4.7)), which means that that has as an eigenvalue. In fact, for all reasonable discretizations, it is true that has the eigenvalue . This happens because with and (no velocity, no force) any reasonable discretization will yield , (the particle stands still). Therefore the theorem is true for all integrators of interest.
EE | UBU | EE | UBU | EE | UBU | |
---|---|---|---|---|---|---|
2 | *** | 5.000(-10) | *** | *** | *** | *** |
1 | 5.000(-10) | 5.000(-10) | *** | 1.000(-9) | *** | 1.500(-9) |
1/2 | 5.000(-10) | 5.000(-10) | 1.000(-9) | 1.000(-9) | *** | 1.500(-9) |
1/4 | 5.000(-10) | 5.000(-10) | 1.000(-9) | 1.000(-9) | 1.500(-9) | 1.500(-9) |
Example 4.13.
The proof of the theorem sheds no light on the size of the threshold . With a view to ascertaining the range of values of for which the integrators (4.2) and (4.3) behave contractively, we have computed numerically the eigenvalues in Proposition 4.7 and taken the suppremum over with fixed . Table 1 provides information on the quotient that approximates the decay constant in the estimate (3.5). For the parameter choice used in [10], the table shows that, for sufficiently small (say ), there is numerical contractivity and the quotient coincides (within the number of digits reported) with the SDE rate found in Section 3.4. The table also gives results for the choices and , where again the values of reported are in agreement with the SDE rates and found in Example 3.13.
5 A general theorem
In this section we consider integrators for (3.1) of the form , , , , where, following the terminology in [36], represents the one-step approximation; uses the restriction of the Brownian motion in (3.1) to the interval , but this fact is not reflected in the notation. Integrators of the form (4.1) provide a particular class of integrators of this form (cf. Remark 4.4). If is an matrix and is an arbitrary probability distribution for the initial condition , we wish to study the distance between the invariant distribution and the distribution of .
In the analysis, for random vectors , we use the Hilbert-space norm . The symbol will be used for the corresponding inner product. We denote by the exact counterpart of , so that if is a solution of (3.1) then . At each time-level , , we introduce a random variable such that . For the difference (that may be seen as a local error), we consider the following assumption:
Assumption 5.1.
There is a decomposition
and positive constants , , , , such that for and :
(5.1) |
and
(5.2) |
We can now state our general theorem that gives a bound for the Wasserstein distance between the invariant measure and the distribution of the -th iteration of the numerical scheme.
Theorem 5.2.


Before going into the proof we make some observations:
-
•
The theorem is in the spirit of classic “consistency and stability imply convergence” numerical analysis results. The main idea of the proof is schematically illustrated in the left panel of Figure 2. The error of interest at time level can be decomposed into two terms. The first is the distance between and and can be bounded in terms of the error at time level by using the contractivity of the numerical integrator. The second is the distance between and and can be bounded by estimating the strong local error by means of Assumption 5.2.
- •
-
•
The part of the local error results in an contribution to the bound on . One power of is lost from going from local to global as in the classical analysis of (deterministic) numerical integrators.
-
•
The part of the local error is asked to satisfy the requirement (5.1) and only looses a factor when going from local to global. This is reminiscent of the situation for the strong convergence of numerical solutions of SDEs (see e.g. [36, Theorem 1.1]), where, for instance, the Euler-Maruyama scheme with strong local errors yields strong global errors (assuming additivity of the noise). Typically, the part of the local error will consist of Ito integrals that, conditional on events occurring up to the beginning of the time step, have zero expectation.
Proof.
The paper [13] analyzes the Euler-Maruyama discretization of (1.1). Under two different smoothness assumptions on , it derives two different estimates similar to (5.3), one with and the other with . The application of Theorem 5.2 to those two cases retrieves the estimates in [13]; in addition the proof of our theorem as applied to those two particular cases coincides with the proofs provided in that paper. (In fact we derived Theorem 5.2 as a generalization of the material in [13] to a more general scenario.)
By considering the case where the target distribution is a product of uncorrelated univariate Gaussians, one sees that the estimates of the mixing time in [13] are optimal in their dependence on and . [16] have shown that those estimates are not optimal in their dependence on . This implies that the result in Theorem 5.2 is not necessarily the best that may be achieved in each application.
6 Application to underdamped Langevin dynamics
6.1 The EE integrator
We begin with the EE integrator (4.2). For its local error we have the following theorem, proved in Section 7.5. Recall that setting does not restrict the generality, as such a choice is equivalent to choosing the units of time. Once this value of has been fixed, the choice of in the theorem that follows is the one that allows the best contraction rate for the SDE (1.2).
Theorem 6.1.
We now may apply Theorem 5.2 to the situation at hand and to do so we need the contractivity of the scheme in the –norm studied in Theorem 4.9. The discussion that follows may immediately be extended to all choices of that lead to contractivity; however, for the sake of clarity, we fix as in [10]. (But recall that is suboptimal in terms of the contraction rate). For this choice, we know from Remark 4.10 that, for , the numerical rate will be of the form , where may be chosen arbitrarily close to ( depends of course on but it is independent of , and ). Theorem 5.2 yields, for ,
(6.1) |
Assume now that, given any initial distribution for the integrator and , we wish to find and to guarantee that . We may achieve this goal by first choosing small enough to ensure that
and then choosing large enough to get
(Instead of splitting the target as , one may use , , and tune to improve slightly some of the error constants below.) This leads to the conditions
(6.2) |
( is a nondimensional combination whose value does not change if the scale of is changed) and
(6.3) |
According to the bound (6.2), has to be scaled like as , or ; then the number of steps to achieve a target value of the contraction factor scales as
(6.4) |
The analysis of the same integrator in [10, Theorem 1] yields scalings that are more pessimistic in their dependence on : there, scales as and as . In addition, the initial distribution , which is arbitrary in the present study, is assumed in [10] to be a Dirac delta located at and ; the estimates become worse as the distance between the initial position used in the integrator and the mode of increases.
It is perhaps useful to compare the technique of proof in [10] with our approach by means of Figure 2. While we employ the contractivity of the algorithm, [10] leverages the contractivity of the SDE itself. These two alternative approaches are well known in deteministic numerical differential equations (see e.g. the discussion in [23, Chapter 2] where a cartoon similar to Figure 2 is presented). On the other hand, while we investigate evaluated at a random variable whose marginal distribution is , [10] has to evaluate that difference at the numerical . It is for this reason that in [10] one needs to have information on the distribution of and to establish a priori bounds on the distributions of as varies (this is done in Lemma 12 in [10]). Generally speaking, once contractivity estimates are available for the numerical solution, the approach on the left of Figure 2 is to be preferred.
The reference [15] also investigates Wasserstein error bounds for the integrator EE. The general approach is the same as that in [10], but the technical details differ. For ,444Because [15] uses the SDE contractivity, it does not exclude the limit case as we have to do. See Remark 4.11 that implies that for this integrator bounds that hold for are only possible for . an upper bound very similar to (6.1) is derived for the -Wasserstein distance between the -marginals of and . That bound depends on , , and in the same way as (6.1) does. The constants in the estimates are nevertheless different, as expected. For instance, for the choice we have been discussing, the factor in (6.1) (where is arbitrarily close to ) is replaced in [15] by the slightly worse factor . However the bound in [15] is only proved for very small values of ( when ). This is an extremely severe limitation because we know from (6.2) that, as the condition number increases, EE may be operated with a value of of the order of rather than . The unwelcome step size restriction originates from estimating evaluated at the numerical solution rather than at the SDE solution.
6.2 The UBU integrator
We now turn our attention to the UBU integrator. Under the standard smoothness Assumptions 2.1 and 2.2, the strong order of convergence of UBU is (see Section 7.7 for a detailed analysis of the UBU local error under those assumptions). The analysis for UBU is then very similar to the one presented above for EE, and leads to an estimate for the mixing time. When satisfies the additional smoothness Assumption 2.3, UBU exhibits strong order and this may be used in our context to improve on the estimates (6.2)–(6.3).
The proof of the next result is given in Section 7.6.
Theorem 6.2.
The contractivity of the scheme in the -norm necessary to use Theorem 5.2 was established in Theorem 4.9. As we did for the first-order integrator, for the sake of clarity, we fix (but other values of may be discussed similarly, provided that they ensure the contractivity of the algorithm). Note that the constant is then . After choosing arbitrarily as in Remark 4.10, Theorem 5.2 yields, for :
(6.5) |
where denotes an absolute constant. To ensure , we take
and then increase as in (6.3). Thus, for UBU, the scaling of is
and, as a consequence, the number of steps to guarantee a target contraction factor scales as
(6.6) |
The dependence of on , and in this estimate is far more favourable than it was for EE (see (6.4)). However here we have the ( dependent) factor and one could easily concoct examples of distributions where this factor is large even if the condition number is of moderate size. In those, arguably artificial, particular cases, it may be advantageous to see UBU as a first order method as discussed at the beginning of this section.
Remark 6.3.
A comparison between (6.1) and (6.5) makes it clear that (for fixed , , ) the dependence of the mixing time of UBU stems from having strong order two. In the second term of the right hand-side of the inequalities (6.1) and (6.5) (i.e. the bias), the exponent of coincides with the strong order of the integrator. In order to make those second terms of size one needs to scale as for EE and as for UBU. The first terms of the right hand-side of the inequalities (6.1) and (6.5), then show that has to be scaled as , i.e. as for the first-order method and for the second-order method.
Integrators of strong order higher than two would have even more favourable dependence of the mixing time on and . Unfortunately such high-order integrators [37] are invariably too complicated to be of much practical significance. In particular there is no splitting algorithm that achieves strong order larger than two [4]. In addition, an increase of the order may be expected to require an increase of the required smoothness of .
The randomized algorithm in [48] has a bias that behaves as leading to an estimate of the mixing time.
Remark 6.4.
The paper [13] considers a weaker form of the extra-smoothness assumption Assumption 2.3 where , rather than assumed to be differentiable with derivative upper-bounded by , is only assumed to be Lipschitz continuous with constant . It is likely that, by means of the technique in the proof of [13, Lemma 6], Theorem 6.2 may be proved under that alternative, weaker version of Assumption 2.3, but we have not yet studied that possibility.
A second order discretization of the underdamped Langevin equation, that unlike UBU, requires to evaluate the Hessian of once per step, has been suggested in [15]. A bound similar to (6.5) is derived which is valid only for small values . This is very restritive because, as we have just found, for UBU scales with as .
The reference [21] suggests a novel approach to obtaining high order discretizations of the underdamped Langevin dynamics (1.2). At each step, in addition to generating suitable random variables, one has to integrate a so-called “shifted ODE”, whose solutions are smoother than the solutions of (1.2). The analysis in that reference examines the case where the integration is exact; in practice, the shifted ODE has of course to be discretized by a suitable numerical method and [21] provides numerical examples based on two different choices of such a method.
7 Additional results and proofs
7.1 Proof of Proposition 3.4
The second item in the Proposition is proved as follows. By standard linear algebra results, may be uniquely decomposed as with in the kernel of and in the image of ; furthermore there is a bijection between values of and values of . Under the assumption , which implies , and vanish, and then is independent of , i.e. of . Therefore the marginal of coincides with the marginal of .
For the first item, we have to show that the pdf ( is the normalizing constant), that with some abuse of notation we denote by satisfies the Fokker-Planck equation
Here and respectively denote the standard divergence and gradient operators in the space of the variable . The computations that follow use repeatedly the well-known identity , where is a scalar valued function and is an -valued function. We will also use that if is any constant matrix, then , where denotes the Hessian of and stands for the Frobenius product of matrices (equivalently ).
We observe that
and therefore
and
Furthermore
From the last three displays we conclude that the left hand-side of the Fokker-Planck equations is the product of and
each of the first three relations in (3.2) is sufficient for the corresponding line in this display to vanish. (In addition, if is regarded as arbitrary, then those three relations are also necessary.) The quadratic form in the fourth line in the display vanishes if and only if is skew-symmetric as demanded by the fourth relation in (3.2). This completes the proof.
7.2 Contraction estimates for the underdamped Langevin equations
We consider the underdamped Langevin equations (1.2) where, after rescaling , we may assume that . We apply Proposition 3.10 to determine and so as to maximize the decay rate . We exclude the case , which has no practical relevance.
If
, denotes the unknown Choleski factor of , the eigenvalues of are found to be
(7.1) |
Since our aim is to ensure that these have a lower bound as large as possible and we only have to consider the minus sign in (7.1).
Without loss of generality, we may set and then have to find , and to minimize
(7.2) |
Consider a local minimum , , of the minimization problem. We claim that
(7.3) |
where
In other words has to coincide with the midpoint of the interval of possible values of . In fact, assume that , i.e. the point is to the left of the midpoint. Then the supremum in (7.2) is attained at , because ; we could lower the value of the supremum by decreasing slightly . If the supremum decreases by increasing slightly .
When (7.3) holds the supremum in (7.2) is the common value that the expression in braces takes at and . This common value is:
that we rewrite as
With , our task is to find , so as to minimize
We first fix . Then as or . By setting , we easily see that has a unique minimum at . At that minimum
which, in turn, is minimized by taking as large as possible, i.e. . Then and . We then fix . In this case, , which is worse than the best that can be achieved with . To sum up: the optimum value of is and is achieved when , , i.e. , ; then the matrix is the one in (3.9). Taking these values of to (7.3), we find that provides the optimal parameter choice. Finally, since the best value of (7.2) is , the expression (7.1) shows that the best decay rate is .
7.3 Integrators for the underdamped Langevin equations
Due to the importance of (1.2) in statistical physics and molecular dynamics, the literature on its numerical integration is by now enormous. It is completely out of our scope to summarize it and we limit ourselves to a few comments on splitting algorithms.
The different terms in the right hand-side of (1.2a) and (1.2b) correspond to different, separate physical effects, like inertia, noise, damping, etc. Therefore the system (1.2) it is ideally suited to splitting algorithms [35]. A possible way of carrying out the splitting is
Each of these subsystems may be integrated in closed form. This partitioning gives rise to schemes like ABOBA, BAOAB and OBABO [28, 29, 39]. For instance, a step of ABOBA first advances the numerical solution over using the exact solution of (A), then over using the exact solution of (B), then over with (O), then over with (B) and finally closes symmetrically with (A) over . ABOBA, BAOAB and OBABO have weak order two but only possess strong order one. In fact it is easy to check that with the A-B-O splitting it is impossible to generate the Ito integrals that are required to ensure strong order two.
Another subsystem that may be integrated exactly in closed form is
The algorithm BUB, used e.g. in [8], advances with (B) over , then with (U) over and closes the step with (B) over . To our best knowledge it was first suggested by Skeel [49]. In [21] is referred to as Strang splitting, but this terminology may be confusing because it is standard to use the expression Strang splitting to mean any splitting algorithms with a symmetric pattern XYX [46]. The authors of [21], a reference that compares different integrators, “believe that BUB offers an attractive compromise between accuracy and computational cost”.
Changing the roles of B and U we obtain the UBU scheme suggested in the thesis [52], where it is proved that both BUB and UBU have weak and strong order two. In fact BUB and UBU are closely related because, UBU is the algorithm that advances the BUB solution from the midpoint of one step to the midpoint of the next (and vice versa).
The thesis [52] also describes a method to boost to strong order two any method whose strong order is only one. The boosting is achieved by generating auxiliary Gaussian random variables and may be relevant in our context where we are interested in the strong order of the integrators. A general technique to investigate the weak and strong order of splitting algorithms for general SDEs may be seen in [3, 4]; those reference provide a detailed study of splitting Langevin integrators.
7.4 A lemma
The following result is a variant of [13, Lemma 7] and may be proved in a similar way.
Lemma 7.1.
Assume that the sequence of nonnegative numbers is such that for some constants , , and each
Then, for
7.5 The local error for EE integrator
To analyze , we have to take the random variable (see Theorem 5.2) as initial data, first to move the solution of the SDE forward in the interval and then to perform a step of the integrator over the same interval.
Solutions of (1.1) satisfy, for , , | |||||
(7.4a) | |||||
(7.4b) |
The proof is divided up in steps.
First step. From (7.4a) and (4.2a), we find that the -component of , denoted as , is
where is the -component of the solution of (1.2) that at takes the value (we shall later need the component, to be denoted by ). Using successively the Cauchy-Schwartz inequality, the bound for , the Lipschitz continuity of , and (1.2b), we find:
An application of the Cauchy-Schwartz inequality to the inner integral yields
Now, using the fact that the initial data is distributed according to , this will be the distribution of for all . Hence, since the distribution of each of the scalar components of is centered Gaussian with second moment equal to , we obtain the final bound
Second step. Turning now to the component of , we have
and, applying the Cauchy-Schwartz inequality and the bound ,
Therefore, by proceeding in the last integral as we when we found it above, we find
7.6 The local error for UBU
We first provide bounds for the local error for UBU under Assumptions 2.1–2.3 that ensure strong order two. As in the previous Subsection we have to take as a starting point for the SDE solution and the integrator. As with the EE integrator, and denote the solution of (1.2) that starts at from . The analysis is now substantially more involved as the Ito-Taylor expansions have to be taken to higher order.
First step. We begin by estimating the difference between and the point where the integrator evaluates the force (see (4.3c)). By using (7.4b) and (4.3c), we find
We apply the Cauchy-Schwartz inequality (in a similar way to what we did at Step 2 in the preceding subsection) to get
As proved in [11, Lemma 2], when has the distribution ,
(7.6) |
and, accordingly,
(7.7) |
Second step. From (7.4a) and (4.3a), the -component of is found to be
(7.8) |
thus UBU replaces the integral by a midpoint-rule approximation. We Ito-Taylor expand (see e.g. [26, 4]) the integral around as follows. Denote by the (differentiable) integrand, i.e.
Then (the dot indicates differentiation),
(7.9) |
and taking the expansion one step further, we find
(7.10) | |||||
We now replace by its expression given by Ito’s lemma. (While is differentiable, is a diffusion process.) There is no Ito correction because is linear in and there is no forcing noise in the equation (1.2b). After computing and substituting back in (7.8), we have
(7.11) |
with
We now successively bound each term in the right hand-side of (7.11). From (7.7) and the Lipschitz continuity of
The integral may be bounded as follows:
Now using the fact that , for , we can bound the first term in the equation above by observing that
and then take into account (7.6), to get
For we use (each scalar component of is a centered Gaussian with variance and fourth moment )
that leads to
Using the Cauchy-Schwartz inequality for the inner product associated with the integration on and , we have
and, with the help of the Ito isommetry, since the Frobenius norm of is bounded by ,
We have now bounded each term in the right-hand side of (7.11); the dominant term is , with . In Assumption 5.2 we have to split into two parts, and , with of higher order; for the component we then define and . The bounds obtained above yield
and
Third step. From (7.4b) and (4.3b), the -component of is given by
(7.12) |
again UBU replaces the integral by a midpoint approximation. By expanding the integrand by means of the fundamental theorem of calculus as
(7.12) becomes
(7.13) |
with
Next, for (it is necessary to put absolute value bars around the inner integrals because could be ),
The integral may be rewritten as
and, after performing the differentiation in the integrand, , so that we may use the available bounds for and .
Taking all the partial bounds to (7.13),
As we see, for the component there is no contribution and therefore we take and .
Fourth step. With a view to checking at Step 5 condition (5.2) in Assumption 5.2, we estimate ; here is the standard inner product in , is the velocity component of and denotes the velocity component of a numerical step starting from . (This should not be confused with , the -component of the random variable to be introduced at the next time level in the construction leading to Theorem 5.2.)
Since, conditional on , , the expectation of the stochastic integral is zero, we may write
Now, from (4.3a),
with (see (4.3c))
and thus, since and ,
Taking into account (4.3c) and the definition of
and we conclude that is bounded above by
(7.14) |
Fifth step. The preceding analysis holds for all values of the parameters. We now focus in the case where and as in the statement of Theorem 6.2. To complete the proof it is enough to translate the Euclidean norm bounds in Steps 1–4 into -norm bounds.
To establish (5.1), we note that, because ,
The right hand-side of this expression was bounded in (7.14) in terms of , and . We now take into account (2.2) to bound and by a multiple of and to bound by a multiple of .
The estimates in (5.2) are established in a similar way.
7.7 The local error for UBU without extrasmoothness
Let us now assume that Assumptions 2.1–2.2 hold but Assumption 2.3 does not. Then the strong order of UBU is one. The bound for derived in the third step of the preceding subsection remains valid (note that it does not involve the constant ). However for the component of the local error, the Ito-Taylor expansion leading to (7.10) cannot be taken beyond (7.9) because now does not make sense. After replacing by its expression in terms of , one obtains the bound
Then, after combining the and contributions and setting , we have the bound
Recall that, for contractivity the integrator has to be operated with bounded above, so that the combinations , , and are all as increases. The leading terms in the UBU bound in the display are . For comparison, we note from (7.5), that for EE the corresponding leading term in the bound is . The conclusion is that, under Assumptions 2.1–2.2, the properties of UBU are very close to those of EE.
Acknowledgement. J.M.S.-S. has been supported by project PID2019-104927GB-C21 (AEI/FEDER, UE). K.C. Z acknowledges support from a Leverhulme Research Fellowship (RF/ 2020-310), the EPSRC grant EP/V006177/1, and the Alan Turing Institute.
References
- [1] A. Abdulle, G. Vilmart, and K. C. Zygalakis. High order numerical approximation of the invariant measure of ergodic sdes. SIAM Journal on Numerical Analysis, 52(4):1600–1622, 2014.
- [2] A. Abdulle, G. Vilmart, and K. C. Zygalakis. Long time accuracy of Lie–Trotter splitting methods for Langevin dynamics. SIAM Journal on Numerical Analysis, 53(1):1–16, 2015.
- [3] A. Alamo and J. M. Sanz-Serna. A technique for studying strong and weak local errors of splitting stochastic integrators. SIAM Journal on Numerical Analysis, 54(6):3239–3257, 2016.
- [4] A. Alamo and J. M. Sanz-Serna. Word combinatorics for stochastic differential equations: Splitting integrator. Communications on Pure & Applied Analysis, 18:2163, 2019.
- [5] N. Bou-Rabee and J. M. Sanz-Serna. Randomized Hamiltonian Monte Carlo. The Annals of Applied Probability, 27(4):2159 – 2194, 2017.
- [6] N. Bou-Rabee and J. M. Sanz-Serna. Geometric integrators and the Hamiltonian Monte Carlo method. Acta Numerica, 27:113–206, 2018.
- [7] S. Brooks, A. Gelman, G. Jones, and X.L. Meng. Handbook of Markov Chain Monte Carlo. Chapman & Hall/CRC Handbooks of Modern Statistical Methods. CRC Press, 2011.
- [8] Evelyn Buckwar, Massimiliano Tamborrino, and Irene Tubikanec. Spectral density-based and measure-preserving ABC for partially observed diffusion processes. an illustration on Hamiltonian sdes. Statistics and Computing, 30(3):627–648, 2020.
- [9] Y. Cao, J. Lu, and L. Wang. Complexity of randomized algorithms for underdamped Langevin dynamics. arXiv:2003.09906, 2020.
- [10] X. Cheng, N. S. Chatterji, P. L. Bartlett, and M. I. Jordan. Underdamped Langevin MCMC: A non-asymptotic analysis. In Proceedings of the 2018 Conference on Learning Theory, volume 75 of Proceedings of Machine Learning Research, pages 300–323, 2018.
- [11] A. S. Dalalyan. Further and stronger analogy between sampling and optimization: Langevin monte carlo and gradient descent. In Proceedings of the 2017 Conference on Learning Theory, volume 65 of Proceedings of Machine Learning Research, pages 678–689, 2017.
- [12] A. S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651–676, 2017.
- [13] A. S. Dalalyan and A. Karagulyan. User-friendly guarantees for the Langevin Monte Carlo with inaccurate gradient. Stochastic Processes and their Applications, 129(12):5278–5311, 2019.
- [14] A. S. Dalalyan, A. Karagulyan, and L. Riou-Durand. Bounding the error of discretized langevin algorithms for non-strongly log-concave targets. arXiv:1906.08530, 2019.
- [15] A. S. Dalalyan and L. Riou-Durand. On sampling from a log-concave density using kinetic Langevin diffusions. Bernoulli, 26(3):1956 – 1988, 2020.
- [16] A. Durmus, S. Majewski, and B. Miasojedow. Analysis of Langevin Monte Carlo via convex optimization. Journal of Machine Learning Research, 20(73):1–46, 2019.
- [17] A. Durmus and E. Moulines. Nonasymptotic convergence analysis for the unadjusted Langevin algorithm. The Annals of Applied Probability, 27(3):1551 – 1587, 2017.
- [18] A. Durmus and E. Moulines. High-dimensional Bayesian inference via the unadjusted Langevin algorithm. Bernoulli, 25(4A):2854 – 2882, 2019.
- [19] A. Eberle, A. Guillin, and R. Zimmer. Couplings and quantitative contraction rates for Langevin dynamics. The Annals of Probability, 47(4):1982 – 2010, 2019.
- [20] M. Fazlyab, A. Ribeiro, M. Morari, and V. M. Preciado. Analysis of optimization algorithms via integral quadratic constraints: nonstrongly convex problems. SIAM Journal on Optimization, 28(3):2654–2689, 2018.
- [21] J. Foster, T. Lyons, and H. Oberhauser. The shifted ODE method for underdamped Langevin MCMC. arXiv:2101.03446, 2021.
- [22] M. Girolami and B. Calderhead. Riemann manifold Langevin and Hamiltonian Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(2):123–214, 2011.
- [23] E. Hairer, S.P. Nørsett, and G. Wanner. Solving Ordinary Differential Equations I Nonstiff problems. Springer, Berlin, second edition, 2000.
- [24] M. Hochbruck and A. Ostermann. Exponential integrators. Acta Numerica, 19:209–286, 2010.
- [25] Liam Hodgkinson, Robert Salomone, and Fred Roosta. Implicit langevin algorithms for sampling from log-concave densities. Journal of Machine Learning Research, 22(136):1–30, 2021.
- [26] Peter E. Kloeden and Eckhard. Platen. Numerical solution of stochastic differential equations. Springer-Verlag Berlin ; New York, 1992.
- [27] A. Laurent and G. Vilmart. Order conditions for sampling the invariant measure of ergodic stochastic differential equations on manifolds. arXiv:2006.09743, 2020.
- [28] B. Leimkuhler and C. Matthews. Rational Construction of Stochastic Numerical Methods for Molecular Sampling. Applied Mathematics Research eXpress, 2013(1):34–56, 06 2012.
- [29] B. Leimkuhler and C. Matthews. Molecular Dynamics: With Deterministic and Stochastic Numerical Methods. Interdisciplinary Applied Mathematics. Springer, 2015.
- [30] B. Leimkuhler and X. Shang. Adaptive thermostats for noisy gradient systems. SIAM Journal on Scientific Computing, 38(2):A712–A736, 2016.
- [31] T. Lelièvre and G. Stoltz. Partial differential equations and stochastic methods in molecular dynamics. Acta Numerica, 25:681–880, 2016.
- [32] L. Lessard, B. Recht, and A. Packard. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57–95, 2016.
- [33] M. B. Majka, A. Mijatović, and L. Szpruch. Nonasymptotic bounds for sampling algorithms without log-concavity. The Annals of Applied Probability, 30(4):1534 – 1581, 2020.
- [34] J. C. Mattingly, A. M. Stuart, and M. V. Tretyakov. Convergence of numerical time-averaging and stationary measures via Poisson equations. SIAM Journal on Numerical Analysis, 48(2):552–577, 2010.
- [35] R. I. McLachlan and G. R. W. Quispel. Splitting methods. Acta Numerica, 11:341–434, 2002.
- [36] G. N. Milstein and Michael V Tretyakov. Stochastic numerics for mathematical physics. Springer-Verlag, Berlin, 2004.
- [37] G.N. Milstein and M.V. Tretyakov. Computing ergodic limits for Langevin equations. Physica D, 229(1):81 – 95, 2007.
- [38] P. Monmarché. Almost sure contraction for diffusions on . application to generalised Langevin diffusions. arXiv:2009.10828, 2020.
- [39] Pierre Monmarché. High-dimensional MCMC with a standard splitting scheme for the underdamped Langevin diffusion. Electronic Journal of Statistics, 15(2):4117 – 4166, 2021.
- [40] W. Mou, Y. A Ma, M. J. Wainwright, P. L. Bartlett, and M. I. Jordan. High-order Langevin diffusion yields an accelerated MCMC algorithm. Journal of Machine Learning Research, 22(42):1–41, 2021.
- [41] M. Pereyra, L. V. Mieles, and K. C. Zygalakis. Accelerating proximal Markov chain Monte Carlo by using an explicit stabilized method. SIAM Journal on Imaging Sciences, 13(2):905–935, 2020.
- [42] G. O. Roberts and R. L. Tweedie. Exponential convergence of Langevin distributions and their discrete approximations. Bernoulli, 2(4):341–363, 1996.
- [43] J. M. Sanz-Serna. Markov chain monte carlo and numerical differential equations. In Current Challenges in Stability Issues for Numerical Differential Equations, volume 2082 of Lect. Notes in Math., pages 39–88. Springer, 2014.
- [44] J. M. Sanz Serna and K. C. Zygalakis. Contractivity of runge–kutta methods for convex gradient systems. SIAM Journal on Numerical Analysis, 58(4):2079–2092, 2020.
- [45] J. M. Sanz Serna and K. C. Zygalakis. The connections between Lyapunov functions for some optimization algorithms and differential equations. SIAM Journal on Numerical Analysis, 59(3):1542–1565, 2021.
- [46] J.M. Sanz-Serna and M.P. Calvo. Numerical Hamiltonian Problems. Dover Books on Mathematics. Dover Publications, 2018.
- [47] J.M. Sanz-Serna and A.M Stuart. Ergodicity of dissipative differential equations subject to random impulses. Journal of Differential Equations, 155(2):262–284, 1999.
- [48] R. Shen and Y. T. Lee. The randomized midpoint method for log-concave sampling. In Advances in Neural Information Processing Systems, volume 32, pages 2098–2109, 2019.
- [49] Robert D. Skeel. Integration schemes for molecular dynamics and related applications. In The Graduate Student’s Guide to Numerical Analysis ’98: Lecture Notes from the VIII EPSRC Summer School in Numerical Analysis, pages 119–176. Springer, 1999.
- [50] G. Vilmart. Postprocessed integrators for the high order integration of ergodic sdes. SIAM Journal on Scientific Computing, 37(1):A201–A220, 2015.
- [51] S. J. Vollmer, K. C. Zygalakis, and Y. W. Teh. Exploration of the (non-)asymptotic bias and variance of stochastic gradient Langevin dynamics. Journal of Machine Learning Research, 17(159):1–48, 2016.
- [52] Alfonso Alamo Zapatero. Word series for the numerical integration of stochastic differential equations. PhD thesis, Universidad de Valladolid, 2019.