Monte Carlo sampling with integrator snippets
Abstract
Assume interest is in sampling from a probability distribution defined on . We develop a framework to construct sampling algorithms taking full advantage of numerical integrators of ODEs, say for one integration step, to explore efficiently and robustly. The popular Hybrid/Hamiltonian Monte Carlo (HMC) algorithm [17, 29] and its derivatives are example of such a use of numerical integrators. However we show how the potential of integrators can be exploited beyond current ideas and HMC sampling in order to take into account aspects of the geometry of the target distribution. A key idea is the notion of integrator snippet, a fragment of the orbit of an ODE numerical integrator , and its associate probability distribution , which takes the form of a mixture of distributions derived from and . Exploiting properties of mixtures we show how samples from can be used to estimate expectations with respect to . We focus here primarily on Sequential Monte Carlo (SMC) algorithms, but the approach can be used in the context of Markov chain Monte Carlo algorithms as discussed at the end of the manuscript. We illustrate performance of these new algorithms through numerical experimentation and provide preliminary theoretical results supporting observed performance.
School of Mathematics, University of Bristol
1 Overview and motivation: SMC sampler with HMC
Assume interest is in sampling from a probability distribution on a probability space . The main ideas of sequential Monte Carlo (SMC) samplers to simulate from are (a) to define a sequence of probability distributions on where , is chosen by the user, simple to sample from and the sequence “interpolates” and , (b) and to propagate a cloud of samples for to represent using an importance sampling/resampling mechanism [16]. Notation and definitions used throughout this paper can be found in Appendix A.
After initialisation, for , samples , representing , are propagated thanks to a user-defined mutation Markov kernel , as follows. For sample and compute the importance weights, assumed to exist for the moment,
(1) |
where is a user-defined “backward” Markov kernel required to define importance sampling on and swap the rôles of and , in the sense that for -integrable,
We adopt this measure theoretic formulation of importance weights out of necessity in order to take into account situations such as when and both have densities with respect to the Lebesgue measure but is a Metropolis-Hastings (MH) update, which does not possess such a density – more details are provided in Appendices A-B but can be omitted to understand how the algorithms considered in the manuscript proceed.
The mutation step is followed by a selection step where for , for the random variable taking values in with . The procedure is summarized in Alg. 1.
Given , theoretically optimal choice of is well understood but tractability is typically obtained by assuming that is -invariant, or considering approximations of and that is -invariant. This makes Markov chain Monte Carlo (MCMC) kernels very attractive choices for , and the use of measure theoretic tools inevitable.
A possible choice of MCMC kernel is that of the hybrid Monte Carlo method, a MH update using a discretization of Hamilton’s equations [17, 29], which can be thought of as a particular instance of a more general strategy relying on numerical integrators of ODEs possessing properties of interest. More specifically, assume that interest is in sampling defined on . First the problem is embedded into that of sampling from the joint distribution defined on , where is an auxiliary variable facilitating sampling. Following the SMC framework we set for a sequence of distributions on with , probabilities on and on . With an integrator of an ODE of interest, one can use for some as a proposal in a MH update mechanism; is a source of randomness allowing “exploration”, resampled every now and then. Again, hereafter we let be the corresponding components of .
Example 1 (Leapfrog integrator of Hamilton’s equations).
Assume that , that have densities, denoted and , with respect to the Lebesgue measure and let . For and differentiable, Hamilton’s equations for the potential are
(2) |
and possess the important property that for . The corresponding leapfrog integrator is given, for some , by
(3) | |||
We point out that, with the exception of the first step, only one evaluation of is required per integration step since the rightmost in (3) recycles the last computation from the last iteration. Let such that for any , , it is standard to check that and note that in this particular case. In its most basic form the hybrid Monte Carlo MH update leaving invariant proceeds as follows, for
(4) |
with, for some user defined
(5) |
and .
Other ODEs, capturing other properties of the target density, or other types of integrators are possible. However a common feature of integrator based updates is the need to compute recursively an integrator snippet , for a given mapping , of which only the endpoint is used. This raises the question of recycling intermediate states, all the more so that computation of the snippet often involves quantities shared with the evaluation of . In Example 1, for instance, expressions for and often involve the same computationally costly quantities and evaluation of the density where has already been evaluated is therefore often virtually free; consider for example for a covariance matrix , then .
In turn these quantities offer the promise of being able to exploit points used to generate the snippet while preserving accuracy of the estimators of interest, through importance sampling or reweighting. For example one may consider virtual HMC updates i.e. given and define
(6) |
Noting that one deduces that for
is an unbiased estimator of . This post-processing procedure is akin to waste-recycling [14] and its generalizations but, crucially, does not affect the dynamic of the Markov chain generating the samples. An alternative algorithm exploiting the snippet fully, with an active effect on the dynamics, could use the following mixture of Markov chain transition kernels [23, 29, 22],
which is also related to the strategy adopted in [11] with the extra chance algorithm. As we shall see our work shares the same objective but we adopt a strategy more closely related to [28]; see Section 5 for a more detailed discussion.
The present paper is concerned with exploring such recycling procedures in the context of SMC samplers, although the ideas we develop can be straightforwardly applied to particle filters in the context of state-space models and to MCMC as discussed later. The manuscript is organised as follows.
In Section 2 we first provide a high level description of particular instances of the class of algorithms considered and provide a justification through reinterpretation as standard SMC algorithms in Subsection 2.2. In Subsection 2.3 we discuss direct extensions of our algorithms, some of which we explore in the present manuscript. This work has links with related recent attempts in the literature [33, 14, 38]; these links are discussed and contrasted with our work in Subsection 2.4 where some of the motivations behind these algorithms are also discussed. Initial exploratory simulations demonstrating the interest of our approach are provided in Subsections 2.5 and 2.6.
In Section 3 we introduce the more general framework of Markov snippets Monte Carlo and associated formal justifications. Subsection 3.3 details the link with the scenario considered in Section 2. In Subsection 3.4 we provide general results facilitating the practical calculation of some of the Radon-Nikodym involved, highlighting why some of the usual constraints on mutation and backward kernels in SMC can be lifted here.
In Section 4 we provide elements of a theoretical analysis explaining expected properties of the algorithms proposed in this manuscript, although a fully rigorous theoretical analysis is beyond the present methodological contribution.
In Section 5 we explore the use of some of the ideas developed here in the context of MCMC algorithms and establish links with earlier suggestions, such as “windows of states” techniques proposed in the context of HMC [28, 29]. Notation, definitions and basic mathematical background can be found in Appendices A-B.
A Python implementation of the algorithms developed in this paper is available at https://github.com/MauroCE/IntegratorSnippets.
2 An introductory example
Assume interest is in sampling from a probability distribution on as described above Example 1 using an SMC sampler relying on the leapfrog integrator of Hamilton’s equations. As in the previous section we introduce an interpolating sequence of distributions on and assume for now the existence of densities for with respect to a common measure , say the Lebesgue measure on , denoted for and and denote the corresponding integrator, which again is measure preserving in this setup.
2.1 An SMC-like algorithm
Primary interest in this paper is in algorithms of the type given in Alg. 2 and variations thereof; throughout .
The SMC sampler-like algorithm in Alg. 2 therefore involves propagating “seed” particles , with a mutation mechanism consisting of the generation of integrator snippets started at every seed particle , resulting in particles which are then whittled down to a set of seed particles using a standard resampling scheme; after rejuvenation of velocities this yields the next generation of seed particles–this is illustrated in Fig. 1. This algorithm should be contrasted with standard implementations of SMC samplers where, after resampling, a seed particle normally gives rise to a single particle in the mutation step, in Fig. 1 the last state on the snippet. Intuitively validity of the algorithm follows from the fact that if represent , then represents in the sense that for summable, one can use the approximation
(7) |
where the self-normalization of the weights is only required in situations where the ratio is only known up to a constant. We provide justification for the correctness of Alg. 2 and the estimator (7) by recasting the procedure as a standard SMC sampler targetting a particular sequence of distributions in Section 2.2 and using properties of mixtures. Direct generalizations are provided in Section 2.3.

2.2 Justification outline
We now outline the main ideas underpinning the theoretical justification of Alg. 2. Key to this is establishing that Alg. 2 is a standard SMC sampler targetting a particular sequence of probability distributions from which samples can be processed to approximate expectations with respect to . This has the advantage that no fundamentally new theory is required and that standard methodological ideas can be re-used in the present scenario, while the particular structure of can be exploited for new developments. This section focuses on identifying and establishing some of their important properties. Similar ideas are briefly touched upon in [14], but we will provide full details and show how these ideas can be pushed further, in interesting directions.
First for let be measurable and invertible mappings, define , i.e. the distribution of when . It is worth pointing out that invertibility of these mappings is not necessary, but facilitates interpretation throughout, as illustrated by the last statement about the distribution of . Earlier we have focused on the scenario where for an integrator , but this turns out not to be a requirement, although it is our main motivation. Useful applications of this general perspective can be found in Subsection 2.3. Introduce the probability distributions on
for . We will show that Alg. 2 can be interpreted as an SMC sampler targeting the sequence of marginal distributions on
(8) |
which we may refer to as a mixture, for . Note that the conditional distribution is
(9) |
which can be computed in most scenarios of interest.
Example 2.
For assume the existence of a -finite dominating measure , therefore implying the existence of a density ; could be the Lebesgue measure. Assuming that is -invariant, or “volume preserving”, then Lemma 32 implies, for ,
and therefore
When is the Lebesgue measure and are not volume preserving, additional multiplicative Jacobian determinant-like terms may be required (see Lemma 32 and the additional requirement that be differentiable). However this extra term may be more complex and require application specific treatment; adoption of the measure theoretic notation circumvents such peripheral considerations at this stage, which can be ignored until actual implementation of the algorithm.
A central point throughout this paper is how samples from can be used to obtain samples from the marginal and vice-versa, thanks to the mixture structure relating the two distributions. Naturally, given , sampling and returning yields a sample from the marginal . Now assuming and then sampling naturally yields and hence intuitively . We now formally establish a more general result concerned with the estimation of expectations with respect to from samples from . For -integrable and , a change of variable (see Theorem 31) yields
which formalizes the fact, at first sight of limited interest, that is the distribution of for and therefore . The relevance of this remark comes from the identity
(10) | ||||
(11) |
which implies that samples from the mixture can be used to unbiasedly estimate thanks to an appropriate weighted average along the snippet . Using for formally establishes the earlier claim that if then . Note that, as suggested by Example 2, construction of the estimator will only require evaluations of the density and function at .
We now turn to the description of an SMC algorithm targeting , Alg. 3, and then establish that it is probabilistically equivalent to Alg. 2 in a sense made precise below. With for we introduce the following mutation kernel
(12) |
where we note that the refreshment kernel has the property that . One can show that the near optimal kernel is given for any by
(13) |
and is well defined -a.s. (see Lemma 4, given at the end of this subsection) and satisfies the property that for any
that is is a conditional probability of given for the joint probability distribution . With the assumption , from Lemma 4, the SMC sampler importance weights at step are
(14) |
that is for such that ,
The corresponding standard SMC sampler is given in Alg. 3 where,
-
•
the weighted particles represent and represent from (11),
- •
-
•
represent ,
-
•
represent , and so do .
Notice that we assume here , and hence that , and that the weights appear as being computed twice in step 3 and step 14 when evaluating the resampling weights at the previous iteration, for the only reason that it facilitates exposition. The identities (11) and (9) suggest, for the estimator of , for -integrable,
(15) |
where the second line is correct under the assumptions of Example 2. Further, when for one can also write (11) for summable as
which can be estimated, using self-renormalization when required, with
(16) |
therefore justifying the estimator suggested in (7) in the particular case where the conditions of Example 2 are satisfied, once we establish the equivalence of Alg. 3 and Alg. 2 in Proposition 3. In fact it can be shown (Proposition 3) that, with referring to the expectation of the probability distribution underpinning Alg. ,
a form of Rao-Blackwellization implying lower variance for while the two estimators share the same bias. The result is in fact stronger since it states that the estimators are convex ordered, a property which we however do not exploit further here. Interestingly a result we establish later in the paper, Proposition 15, suggests that the variance is smaller than that of the standard Monte Carlo estimator assuming samples , due to the control variate nature of integrator snippets estimators.
An interesting point is that computation of the weight only requires knowledge of up to a normalizing constant, that is the estimator is unbiased if for even if is not completely known, while the estimator (7) will most often require self-normalisation, hence inducing a bias.
We now provide the probabilistic argument justifying Alg. 2 and the shared notation in Alg. 2 and Alg. 3.
Proposition 3.
Proof of Proposition 3.
In Alg. 3 for any we have with the random vector taking values in involved in the resampling step,
Letting for
from the tower property we have
Since
we deduce
Now notice that
and the first statement follows from conditional independence of and the fact that . Since by construction for any the second statement follows from a standard Markov chain argument. Now for and
Therefore for any ,
The third statement follows. ∎
Since the justification of the latter interpretation of the algorithm is straightforward, as a standard SMC sampler targeting instrumental distributions , and allows for further easy generalisations we will adopt this perspective in the remainder of the manuscript for simplicity.
This reinterpretation also allows the use of known facts about SMC sampler algorithms. For example it is well known that the output of SMC samplers can be used to estimate unbiasedly unknown normalising constants by virtue of the fact that, in the present scenario,
has expectation under . Now assume that the densities of are known up to a constant only, say and for , then
and the measure shares the same normalising constant as ,
Consequently
is an unbiased estimator of .
The results of the following lemma can be deduced from Lemma 8 but we provide more direct arguments for the present scenario.
Lemma 4.
Proof.
The first statement, follows from the definition in (12) of and the identity (11). The second statement is a consequence of the following classical arguments, justifying Bayes’ rule. For fixed, consider the measure
implying from which we deduce the existence of a Radon-Nikodym derivative such that
For , we let
and note that almost surely and . For the third statement note that from the second statement, for we have and from Fubini’s and the theorem [7, Theorems 3.1 and 3.2] and are probability distributions coinciding on . To conclude proof of the third statement, for measurable, we successively apply the definition of the Radon-Nikodym derivative, Fubini’s theorem, use that , the first statement of this lemma, Fubini again, the second statement
therefore establishing that -almost surely
∎
2.3 Direct extensions
It should be clear that the algorithm we have described lends itself to numerous generalizations, which we briefly discuss below.
There is no reason to limit the number of snippets arising from a seed particle to one, therefore offering the possibility to take advantage of parallel machines. For example the velocity of a given seed particle can be refreshed multiple times, resulting in partial copies of the seed particle from which integrator snippets can be grown.
The main scenario motivating this work, corresponds to the choice, for , of for a given . As should be apparent from the theoretical justification this can be replaced by a general family of invertible mappings where the ’s are now not required to be measure preserving in general, in which case the expression for may involve additional terms of the “Jacobian” type. These mappings may correspond to integrators other than those of Hamilton’s equations but may be more general deterministic mappings and may have no temporal meaning and only represent the number of deterministic transformations used in the algorithm. More specifically, asssuming that and for some -finite dominating measure and letting the required weights are now of the form (see Lemmas 32-34)
Again when is the Lebesgue measure and a diffeomorphism, then is the absolute value of the determinant of the Jacobian of . Non uniform weights may be ascribed to each of these transformations in the definition of (24). A more useful application of this generality is in the situation where it is believed that using multiple integrators, specialised in capturing different geometric features of the target density, could be beneficial. Hereafter we simplify notation by setting and . For the purpose of illustration consider two distinct integrators , (again disappears from the notation) we wish to use, each for time steps with proportions such that . Again with define the mixture
which still possesses the fundamental property that if (resp. ), then with (resp. ) we have . It is possible to aim to sample from , in which case the pair plays the rôle played by in the earlier simpler scenario. However, in order to introduce the sought persistency, that is use either or when constructing a snippet, we focus on the scenario where the pair plays the rôle played by earlier. In other words the target distribution is now , which is to what in (24) was to . The mutation kernel corresponding to (24) is given by
with now the requirement that . A sensible choice seems to be with . With these choices using the near-optimal backward kernel simple substitutions and yields
and justifies the earlier abstract presentation.
Another direct extension consists of generalizing the definition of by assigning non-uniform and possibly state-dependent weights to the integrator snippet particles as follows
with for and such that for any ; this should be contrasted with (8). Such choices ensure that the central identity (11) can be generalized as follows
Note the analogy with the identity behind umbrella sampling [37, 39]. In the light of the developments of Subsection 4.3.2, a potentially useful choice could be , which however requires additional computations as
which may require computation of additional states for .
We leave exploration of some of these generalisations for future work, although we think that the choice of the leapfrog integrator is particularly natural and attractive here.
2.4 Rational and computational considerations
Our initial motivation for this work was that computation of and most often involve common quantities, leading to negligible computational overhead, and reweighting offers the possibility to use all the states of a snippet in a simulation algorithm rather than the endpoint only. There are other reasons for which this approach may be beneficial.
The benefit of using all states along integrator snippets in sampling algorithms has been noted in the literature. For example, in the context of Hamiltonian integrators, with and , it is known that for the mapping is typically oscillatory, motivating for example the x-tra chance algorithm of [11]. The “windows of state” strategy of [28, 29] effectively makes use of the mixture as an instrumental distribution and improved performance is noted, with performance seemingly improving with dimension on particular problems; see Section 5 for a more detailed discussion. Further averaging is well known to address scenarios where components of evolve on different scales and no choice of a unique integration time can accommodate all scales [29, 22, section 3.2]; averaging addresses this issue effectively, see Example 27. We also note that, keeping constant, in the limit as the average in (14) corresponds to the numerical integration, Riemann like, along the contour of , effectively leading to some form of Rao-Blackwellization of this contour; this is discussed in detail in Section 4.
Another benefit we have observed with the SMC context is the following. Use of a WF-SMC strategy involves comparing particles within each Markov snippet arising from a single seed particle, while our strategy involves comparing all particles across snippets, which proves highly beneficial in practice. This seems to bring particular robustness to the choice of the integrator parameters and and can be combined with highly robust adaptation schemes taking advantage of the population of samples, in the spirit of [21, 22]; see Subsections 2.5 and 2.6.
At a computational level we note that, in contrast with standard SMC or WF-SMC implementations relying on an MH mechanism, the construction of integrator snippets does not involve an accept reject mechanism, therefore removing control flow operations and enabling lockstep computations on GPUs – actual implementation of our algorithms on such architectures is, however, left to future work.
Finally, when using integrators of Hamilton’s equations we naturally expect similar benefits to those enjoyed by HMC. Let and let . We know that in certain scenarios [4, 10], the distributions are such that for , , that is the weights do not degenerate to zero or one: in the context of SMC this means that the part of the importance weight (14) arising from the mutation mechanism does not degenerate. Further, with an appropriate choice of schedule, i.e. sequence for , ensures that the contribution to the importance weights (14) is also stable as . As shown in [4, 2, 10], while direct important sampling may require an exponential number of samples as grows, the use of such a schedule may reduce complexity to a polynomial order.
2.5 Numerical illustration: logistic regression
In this section, we consider sampling from the posterior distribution of a logistic regression model, focusing on the compution of the normalising constant. We follow [14] and consider the sonar dataset, previously used in [13]. With intercept terms, the dataset has responses and covariates , where . The likelihood of the parameters is then given by
(17) |
where . We ascribe a product of independent normal distributions of zero mean as a prior, with standard deviation equal to 20 for the intercept and 5 for the other parameters. Denote the prior distribution of , we define a sequence of tempered distributions of densities of the form for non-decreasing and such that and . We apply both Hamiltonian Snippet-SMC and the implementation of waste-free SMC of [14] and compare their performance.
For both algorithms, we set the total number of particles at each SMC step to be . For the waste-free SMC, the Markov kernel is chosen to be a random-walk Metropolis-Hastings kernel with covariances adaptively computed as , where is the empirical covariance matrix obtained from the particles in the previous SMC step. For the Hamiltonian Snippet-SMC algorithm, we set to be the one-step leap-frog integrator with stepsize , and a . To investigate the stability of our algorithm, we ran Hamiltonian Snippet SMC with and . For both algorithms, the temperatures are adaptively chosen so that the effective sample size (ESS) of the current SMC step will be , where is the maximum ESS achievable at the current step. In our experiments, we have chosen and for both algorithms.
Performance comparison
Figure 2 shows the boxplots of estimates of the logarithm of the normalising constant obtained from both algorithms, for different choices of and for the Hamiltonian Snippet SMC algorithm. The boxplots are obtained by running both algorithms 100 times for different of the algorithm parameters, with in all setups. Several points are worth observing. For a suitably choice of , the Hamiltonian Snippet SMC algorithm can produce stable and consistent estimates of the normalising constant with particles at each iteration. On the other hand, however, the waste-free SMC algorithm fails to produce accurate results for the same computational budget. It is also clear that with larger values of (meaning smaller value of and hence shorter snippets), the waste-free SMC algorithm produces results with larger biases and variability. For Hamiltonian Snippet SMC algorithm, the results are stable both for short and long snippets when is equal to or . Another point is that when or , the Hamiltonian Snippet SMC algorithm becomes unstable with short (i.e. ) and long (i.e. ). Possible reasons are for too small a stepsize the algorithm is not able to explore the target distribution efficiently, resulting in unstable performances. On the other hand, when the stepsize is too large, the leapfrog integrator becomes unstable, therefore affecting the variability of the weights; this is a common problem of HMC algorithms. Hence, a long trajectory will result in detoriate estimations. Hence, to obtain the best performance, one should find a way of tuning the stepsize and trajectory length along with the SMC steps.

In Figure 3 we display boxplots of the estimates of the posterior expectations of the mean of all coefficients, i.e. . This quantity is referred to as the mean of marginals in [14] and we use this terminology. One can see that the same properties can be seen from the estimations of the mean of marginals, with the unstability problems exacerbated with small and large stepsizes.

Computational Cost
In this section, we compare the running time of both algorithms. Since the calculations of the potential energy and its gradient often share common intermediate steps, we can recycle these to save computational cost. As the waste-free SMC also requires density evaluations, the Hamiltonian Snippet SMC algorithm will not require significant additional computations. Figure 4 shows boxplots of the simulation time of both algorithms from 100 runs. The simulations were run on an Apple M1-Pro CPU with 16G of memory. One can see that in comparison to the waste-free SMC the additional computational time is only marginal for the Hamiltonian Snippet SMC algorithm and mostly due to the larger memory needed to store the intermediate values.

2.6 Numerical illustration: simulating from filamentary distributions
We now illustrate the interest of integrator snippets in a scenario where the target distribution possesses specific geometric features. Specifically, we focus here on distributions concentrated around a manifold defined as the zero level set of a smooth function
sometimes referred to as filamentary distributions [12, 24, 25]. Such distributions arise naturally in various scenarios, including inverse problems or generalisations of the Gibbs sampler [12, 24, 25] or as a relaxation of problems where the support of the distribution of interest is included in or for example generalisation of the Gibbs sampler through disintegration [12]. Such is the case for ABC methods in Bayesian statistics [3]. Assume that is a probability density with respect to the Lebesgue measure defined on and for consider a “kernel” function where and define the probability density
The corresponding probability distribution can be thought of as an approximation of the probability distribution of density with respect to the Hausdorff measure on . Typical choices for the kernel are or . Strong anisotropy may result from such constraints and make exploration of the support of such distributions awkward for standard Monte Carlo algorithms. This is illustrated in Fig. 5 where a standard MH-HMC algorithms is used to sample from defined on , for three values , and performance is observed to deteriorate as decreases. The samples produced are displayed in blue: for HMC-MH mixes well and explores the support rapidly, but for the chain gets stuck in a narrow region of at the bottom, near initialisation, while for , no proposal is ever accepted in the experiment.

To illustrate the properties of integrator snippet techniques we consider the following toy example. For and let
for a symmetric positive matrix and , that is we consider the restriction of a standard normal distribution around an ellipsoid defined by .
In order to explore the support of the target distribution we use two mechanisms based on reflections either through tangent hyperplanes of equicontours of or through the corresponding orthogonal complements. More specifically for such that let and define the tangential HUG (THUG) and symmetrically-normal HUG (SNUG) as , , and respectively, with and as in Example 1 and (25). Both updates can be understood as discretisations of ODEs and are volume preserving. Intuitively for with for and we have and . This is illustrated in Fig. 6 for . Further, for an initial state , trajectories of the first component of remain close to while follows the gradient field and leads to hops across equicontours.

The sequence of target distributions on we consider is of the form
and the Integrator Snippet SMC is defined through the mixture, for ,
We compare performance of integrator snippet with an SMC Sampler relying on a mutation kernel consisting of a mixture of two updates targetting each iterating times a MH kernel applying integrator or once after refreshing the velocity; the backward kernels are chosen to correspond to the default choices discussed earlier. In the experiments below we set , , and is the diagonal matrix alternating ’s and ’s along the diagonal. We used particles, sampled from at time zero and is set to the maximum distance of these particles in the domain from and for both algorithms. We compared the two samplers across three metrics; all results are averaged over runs.
Robustness and Accuracy: we fix the step size for SNUG to and run both algorithms for a grid of values of and using a standard adaptive scheme based on ESS to determine until a criterion described below is satisfied and retain the final tolerance achieved, recorded in Fig. 7. As a result the terminal value and computational costs are different for both algorithms: the point of this experiment is only to demonstrate robustness and potential accuracy of Integrator Snippets. Both the SMC sampler and Integrator Snippet are stopped when the average probability of leaving the seed particle drops below .
Our proposed algorithm consistently achieves a two order magnitude smaller final value of and is more robust to the choice of stepsize.

Variance: for this experiment we set steps, and determine and compare the variances of the estimates of the mean of for the final SMC step, for which . To improve comparison, and in particular ensure comparable computational costs, both algorithms share the same schedule , determined adaptively by the SMC algorithm in a separate pre-run.
The results are reported as componentwise boxplots in Fig. 8 where we observe a significant variance reduction for comparable computational cost.

Efficiency: we report the Expected Squared Jump Distance (ESJD) as a proxy of distance travelled by the two algorithms. For Integrator Snippets, it is possible to estimate this quantity as follows for a function
We report the average of this metric in Table 1 for the functions , normalised by total runtime in seconds for all particles (first row), for the particles using the THUG update (second row) and those using the SNUG update (third row), with standard deviation shown in parenthesis. Our proposed algorithm is several orders of magnitude more efficient in the exploration of than its SMC counterpart which, thanks to its ability to take full advantage of all the intermediary states of the snippet. This is in contrast with the SMC sampler which creates trajectories of random length.
Integrator Snippet | SMC | |
---|---|---|
2.7 Links to the literature
Alg. 2, and its reinterpretation Alg. 3, are reminiscent of various earlier contributions and we discuss here parallels and differences. This work was initiated in [42] and pursued and extended in [18].
Readers familiar with the “waste-free SMC” algorithm (WF-SMC) of [14] where, similarly to Alg. 2, seed particles are extended with an MCMC kernel leaving invariant at iteration , yielding particles subsequently whittled down to new seed particles. A first difference is that while generation of an integrator snippet can be interpreted as applying a sequence of deterministic Markov kernels (see Subsection 3.3 for a detailed discussion) what we show is that the mutation kernels involved do not have to leave invariant; in fact we show in Section 3 that this indeed is not a requirement, therefore offering more freedom. Further, it is instructive to compare our procedure with the following two implementations of WF-SMC using an HMC kernel (4) for the mutation. A first possibility for the mutation stage is to run steps of an HMC update in sequence, where each update uses one integrator step. Assuming no velocity refreshment along the trajectory this would lead to the exploration of a random number of states of our integrator snippet due to the accept/reject mechanism involved; incorporating refreshment or partial refreshment would similarly lead to a random number of useful samples. Alternatively one could consider a mutation mechanism consisting of an HMC build around integration steps and where the endpoint of the trajectory would be the mutated particle; this would obviously mean discarding potentially useful candidates. To avoid possible reader confusion, we note apparent typos in [14, Proposition 1], provided without proof, where the intermediate target distribution of the SMC algorithms, in our notation (see (18)), seems improperly stated. The statement is however not used further in the paper.
Our work shares, at first sight, similarities with [33, 38], but differs in several respects. In the discrete time setup considered in [38] a mixture of distributions similar to ours is also introduced. Specifically for a sequence such that and , the following mixture is considered (in the notation of [38] their transformation is and we stick to our notation for the sake of comparison and avoid confusion with our ),
The intention is to use the distribution as an importance sampling proposal to estimate expectations with respect to ,
where the last line holds when , and . The rearrangement on the second and third line simply capture the fact noted earlier in the present paper that with and then . As such NEO relies on generating samples from first, typically exact samples, which then undergo a transformation and are then properly reweighted. In contrast we aim to sample from a mixture of the type directly, typically using an iterative method, and then exploit the mixture structure to estimate expectations with respect to . Note also that some of the terms involved in the renormalization may require additional computations beyond terms for which . Our approach relies on a different important sampling identity
We note however that the conformal Hamiltonian integrator used in [33, 38] could be used in our framework, which we leave for future investigations. Their NEO-MCMC targets a “posterior distribution” and inspired by the identity
which is an expection of with respect to
and they consider algorithms targetting which relying on exact samples from to construct proposals, resulting in a strategy which shares the weaknesses of an independent MH algorithm.
Much closer in spirit to our work is the “window of states” idea proposed in [28, 29] in the context of HMC algorithms. Although the MCMC algorithms of [28, 29] ultimately target and are not reversible, they involve a reversible MH update with respect to and additional transitions permitting transitions from to and to ; see Section 5 for full details.
A link we have not explored in the present manuscript is that to normalizing flows [36, 26] and related literature. In particular the ability to tune the parameter of the leapfrog integrator suggests that our methodology lends itself learning normalising flows. We finally note an analogy of such approaches with umbrella sampling [37, 39], although the targetted mixture is not of the same form, and the “Warp Bridge Sampling” idea of [27].
3 Sampling Markov snippets
In this section we develop the Markov snippet framework, largely inspired by the WF-SMC framework of [14] but provide here a detailed derivation following the standard SMC sampler framework [16] which allows us to consider much more general mutation kernels; integrator snippet SMC is recovered as a particular case. Importantly we provide recipes to compute some of the quantities involved using simple criteria (see Lemma 9 and Corollary 10) which allow us to consider unusual scenarios such as in Subsection 3.4.
3.1 Markov snippet SMC sampler or waste free SMC with a difference
Given a sequence of probability distributions defined on a measurable space introduce the sequence of distributions defined on such that for any
where for , and any ,
is assumed to exist for now and . This yields the marginals
(18) |
Further, consider such that and define
Note that [14] set , which we do not want in our later application and further assume that is invariant, which is not necessary and too constraining for our application in Section 3.3 (corresponding to the application in the introductory Subsection 2.2). We only require the condition is required. As in Subsection 2.2 we consider the optimal backward kernel , given for by
and as established in Lemma 6 and Lemma 8, one obtains
(19) |
The corresponding folded version of the algorithm is given in Alg. 4, with …:
-
•
represents ,
-
•
represent ,
-
•
represents and so do .
Remark 5.
Other choices are possible and we detail here an alternative, related to the Waste-free framework of [14]. With notation as above here we take
with the weights now defined as
With these choices we retain the fundamental property that for , with then . Now with
assuming , we have the property that for any , yielding
Finally choosing to be the optimized backward kernel, the importance weight of the algorithm is,
We note that continuous time Markov process snippets could also be used. For example piecewise deterministic Markov processes such as the Zig-Zag process [6] or the Bouncy Particle Sampler [8] could be used in practice since finite time horizon trajectories can be parametrized in terms of a finite number of parameters. We do not pursue this here.
3.2 Theoretical justification
In this section we provide the theoretical justification for the correctness of Alg. 4 and an alternative proof for Alg. 3, seen as a particular case of Alg. 4.
Throughout this section we use the following notation where (resp. ) plays the rôle of (resp. ), for notational simplicity. Let and be two Markov kernels such that the following condition holds. For let and for assume that , implying the existence of the Radon-Nikodym derivatives
(20) |
with the convention . We let . For , define
and introduce the mixture of distributions, defined on ,
(21) |
where for
so that one can write with .
We first establish properties of , showing how samples from can be used to estimate expectations with respect to .
Lemma 6.
For any such that
-
1.
we have for
-
2.
with ,
-
(a)
then
-
(b)
in particular,
(22)
-
(a)
Proof.
The first statement follows directly from the definition of the Radon-Nikodym derivative
The second statement follows from the definition of and
and the first statement. The last statement follows from the tower property for expectations. ∎
Corollary 7.
Assume that , then
is an unbiased estimator of since we notice that for ,
This justifies algorithms which sample from the mixture directly in order to estimate expectations with respect to .
Let and be derived from in the same way is from in (21), but for possibly different Markov kernels . Define now the kernel be the Markov kernel
(23) |
with . Remark that and can be made dependent on both and , provided they satisfy all the conditions stated above and below.
The following justifies the existence of in (19) for a particular choice of backward kernel and provides a simplified expression for a particular choices of .
Lemma 8.
Proof.
The first statement is a standard result and is a conditional expectation. We however provide the short argument taking advantage of the specific scenario. For a fixed consider the finite measure
such that , that is . Consequently we have the existence of a Radon-Nikodym derivative such that
which indeed has the sought property. This is a kernel since for any fixed , for any , and therefore -a.s., with equality for . For the second statement we have, for any ,
and we can apply (22) in Lemma 6 with the substitutions and to conclude. For the third statement, since then because for any ,
where is such that . In either case, since , this implies that
that is from the definition of . Consequently
and indeed from Fubini’s and Dynkin’s theorems and
Now for and using the second statement and ,
We therefore conclude that
where the last inequality holds only when . ∎
The following result is important in two respects. First it establishes that if satisfy a simple property then always have a simple expresssion in terms of certain densities of and – this implies in particular that in Subsection 3.1 the kernel is not required to leave invariant to make the method implementable [14]. Second it provides a direct justification of the validity of advanced schemes – see Example 14. This therefore establishes that generic and widely applicable sufficient conditions for to be tractable are and the notion of -reversibility.
Lemma 9.
Let , be a -finite measure on such that and assume that we have a pair of Markov kernels such that
(24) |
We call this property -reversibility. Then for such that we have
Proof.
For such that we have
∎
Corollary 10.
Let and be a -finite measure such that , let such that
then for any such that and
and provided we can deduce
where we have used Lemma 34.
We have shown earlier that standard integrator based mutation kernels used in the context of Monte Carlo method satisfy (24) with the Lebesgue measure but other scenarios involved that of the preconditioned Crank–Nicolson (pCN) algorithm where is the distribution of a Gaussian process.
3.3 Revisiting sampling with integrator snippets
In this scenario we have assumed to have a density w.r.t. a -finite measure , and is an invertible mapping such that and with such that . In his manuscript we focus primarily on the scenario where is a discretization of Hamilton’s equations for a potential e.g. a leapfrog integrator. We consider now the scenario where, in the framework developed in Section 3.1, we let be the deterministic kernel which maps the current state to . Define and ; we exploit the ideas of [1, Proposition 4] to establish that is the adjoint of if is invariant under .
Lemma 11.
Let be a probability measure and a -finite measure, on such that . Denote for any . Let be an invertible and volume preserving mapping, i.e. such that for all , then
-
1.
form a reversible triplet, that is for all ,
-
2.
for all such that
Proof.
For the first statement
We have
As a result
∎
Corollary 12.
With the assumptions of Lemma 11 above for the weight (20) for admits the expression
Further, for a -invariant Markov kernel the expression for the weight (19) becomes
hence recovering the expression used in Section 2. This together with the results of Section 3.1 provides an alternative justification of correctness of Alg. 3 and hence Alg. 2. Note that this choice for corresponds to the so-called optimal scenario; this can be deduced from Lemma 11 or by noting that
3.4 More complex integrator snippets
Here we provide examples of more complex integrator snippets to which earlier theory immediately applies thanks to the abstract point of view adopted throughout.
Example 13.
Let , so that is well defined. Let, for , be invertible and such that , for such that and . Then one can consider the delayed rejection MH transition probability
with and given below
A particular example is when is the Lebesgue measure on and is volume preserving, for instance the leapfrog integrator for Hamilton’s equations for some potential or the bounce update (25). Following [1] notice that has density with respect to the measure . Now define , we have
and with
For instance can be a HMC update, while
(25) |
for some unit length vector field . This can therefore be used as part of Alg. 4; care must be taken when compute the weights and , see Section 3-3.3 provide the tools to achieve this. For example, in the situation where is the Lebesgue measure on and then we recover Alg. 3 or equivalently Alg. 2.
Example 14.
An interesting instance of Example 13 is concerned with the scenario where interest is in sampling constrained to some set such that . Define constrainted to , and similarly . We let be defined as above but targeting . Naturally has a density w.r.t. , for and for we have , . Consequently and and as a result
The corresponding kernel is described algorithmically in Alg. 5. In the situation where for a continuously differentiable function , the bounces described in (25) can be defined in terms of the field such that
This justifies the ideas of [5], where a process of the type given in Alg. 5 is used as a proposal within a MH update, although the possibility of a rejection after the second stage seems to have been overlooked in that reference.
Naturally a rejection of both transformations and of the current state means that the algorithm gets stuck. We note that it is also possible to replace the third update case with a full refreshment of the velocity, which can be interpreted as a third delayed rejection update, of acceptance probability one.
3.5 Numerical illustration: orthant probabilities
In this section, we consider the problem of calculating the Gaussian orthant probabilities, which is given by
where are known vectors of dimension and with a covariance matrix of size . Consider the Cholesky decomposition of which is given by , where is a lower triangular matrix with positive diagonal entries. It is clear that can be viewed as , where . Consequently, one can instead rewrite as a product of probabilities given by
(26) |
and
(27) |
for . For notational simplicity, we let denote the interval with the convention . Then, can be written as the product of for . Moreover, one could see that is also the normalising constant of the conditional distribution of given . To calculate the orthant probability, [32] have proposed an SMC algorithm targetting the sequence of distributions for , given by
(28) | |||
(29) |
where denotes the probability density of a . One could also note that
and , where represents the probability of a standard Normal random variable being in the region . Therefore, the SMC algorithm proposed by [32] then proceeds as follows. (1) At time , particles are extended by sampling . (2) Particles are then reweighted by multiplying the incremental weights to . (3) If the ESS is below a certain threshold, resample the particles and move them through an MCMC kernel that leaves invariant for iterations. For the MCMC kernel, [32] recommended using Gibbs sampler that leaves invariant to move the particles at step (3). The orthant probability we are interested in can then be viewed as the normalising constant of and this can be estimated as a by-product of the SMC algorithm.
Since we are trying to sample from the constrained Gaussian distributions, the Hamiltonian equation can be solved exactly and is always . As a result, the incremental weights for the trajectories simplify to and each particle on the trajectory starting from will have an incremental weight proportional to . To obtain a trajectory, we follow [30] who perform HMC with reflections to sample. As the dimension increases, the number of reflections performed under will also increase given a fixed integration time. We adaptively tuned the integrating time to ensure that the everage number of reflections at each SMC step does not exceed a given threshold. In our experiment we set this threshold to be . To show that the waste-recycling RSMC algorithm scales well in high dimension, we set , and . Also, we use the same covariance matrix in [14] and perform variable re-ordering as suggested in [32] before the simulation.


Figures 9 and 10 show the results obtained with and various values for . With a quarter of the number of particles used in [14], the waste-recycling HSMC algorithm achieves comparable performance when estimating the normalising constant (i.e. the orthant probability). Moreover, the estimates are stable for different choices of values, although one observes that the algorithm achieves best performance when (i.e. each trajectory contains particles). This also suggests that the integrating time should be adaptively tuned in a different way to achieve the best performance given a fixed computational budget. Estimates of the function with respect to the Gaussian distribution truncated between and are also stable for different choices of , although they are more variable than those obtained in [14]. This indicates that the waste-recycling HSMC algorithm does scale well in high dimension. We note that this higher variance compared to the waste-free SMC of [14], is obtained in a scenario where they are able to exploit the particular structure of the problem and implement an exact Gibbs sampler to move the particles. The integrator snippet SMC we propose is however more general and applicable to scenarios where such a structure is not present.
4 Preliminary theoretical exploration
In this section we omit the index and provide precise elements to understand the properties of the algorithms and estimators considered in this manuscript.
4.1 Variance using folded mixture samples
This section establishes variance reduction for an estimator of of the type (15), but where it is assumed that are iid distributed according to . While this is not the exact distribution in the algorithm this provides a proxy representative of the quantities one needs to control when analysing SMC samplers [15, Chapter 9]: variance of estimators in an SMC can be decomposed as the sum of local variances at each iterations, which convergence to the variance terms considered hereafter as . The main message is that can be understood as playing the rôle of control variates, with the potential of reducing variance.
We use the following simplified notation throughout. Let
with . The fundamental property exploited throughout is (11), which can be rephrased as, for -integrable,
or written differently with
The estimator we use is therefore a so-called “Rao-Blackwellized estimator”, assuming ,
of variance . The following relates this variance to that of the standard estimator using iid samples from , which would likewise play a role in the asymptotic properties of SMC algorithms.
Proposition 15.
We have
Proof.
The variance decomposition identity yields
but from the fundamental property
∎
The following provides us with a notion of effective sample size for estimators of the form .
Theorem 16.
Assume that for , then with ,
-
1.
for
-
2.
further
Remark 17.
Multiplying by any constant does not alter the values of the upper bounds, making the results relevant to the scenario where is known up to a constant only. From the second statement we can define a notion of effective sample size (ESS)
which could be compared to or .
Proof.
4.2 Variance using unfolded mixture samples
For invertible, a probability distribution such that and such that exists we have the identity
As a result for , all invertible and such that for , we have
(30) |
which suggests the Pushforward Importance Sampling (PISA) estimator, for , and with
(31) |
This is the estimator in (7) when and .
4.2.1 Relative efficiency for unfolded estimators
In order to define the notion of relative efficiency for the estimator in (31) we first establish the following bounds.
Theorem 18.
With the notation above and throughout. For any
-
1.
-
2.
more simply
(32) -
3.
and
Remark 19.
The upper bound in the first statement confirms the control variate nature of integrator snippets, even when using the unfolded perspective, a property missed by the rougher bounds of the last two statements.
Remark 20 (ESS for PISA).
The notion of efficiency is usually defined relative to the “perfect” Monte Carlo scenario that is the standard estimator of relying on iid samples from for which we have
(33) |
The , is determined by the ratio of the upper bound in (32) by (33). Our point below is that the notion of efficiency can be defined relative to any competing algorithm, virtual or not, in order to characterize particular properties. For example we can compute the efficiency relative to that of the “ideal” PISA estimator i.e. for which is replaced with , and
(34) |
The corresponding captures the loss incurred because of dependence along a snippet. However, given our initial motivation of recycling the computation of a standard HMC based SMC algorithm we opt to define the relative to the estimator relying on both ends of the snippet only, i.e.
In the SMC scenario considered in this manuscript (see Section 1) the above can be thought of as a proxy for estimators obtained by a “Rao-Blackwellized” SMC algorithm using in (6), where particles in Alg. 1 give rise to weighted particles
with defined in (5). Resampling with these weights is then applied to obtain particles and followed by an update of velocities to yield . Now, we observe the similarity between
(35) |
and the corresponding weight in Alg. 2, in particular when and are similar, and hence and . This motivates our choice of reference to define which has a clear computational advantage since it involves ignoring terms only. In the present scenario, following Lemma 21, we have
which leads to the relative efficiency for PISA,
which can be estimated using empirical averages.
Proof of Theorem 18.
For clarity we reproduce the very useful lemma [38, Lemma 6], correcting a couple of minor typos along the way.
Lemma 21.
Let be two integrable random variables satisfying almost surely for some and let and . Then
4.2.2 Optimal weights
As mentioned in Subsection 2.3 it is possible to consider the more general scenario where unequal probability weights are ascribed to the elements of the snippets, yielding the estimator
and a natural question is that of the optimal choice of . Note that in the context of PISA the condition is not required, as suggested by the justification of the identity (30). However it should be clear that this condition should be enforced in the context of integrator snippet SMC, since the probabilistic interpretation is otherwise lost, or if the expectation is known to be non-negative. Here we discuss optimization of the variance upperbound provided by Lemma 21,
where for ,
It is a classical result that and that minimum is reached for the eigenvector(s) corresponding to the smallest eigenvalue of . If the constraint of non-negative entries is to be enforced then a standard quadratic programming procedure should be used. The same ideas can be applied to the function independent upperbounds used in our definitions of efficiency.
4.3 More on variance reduction and optimal flow
We now focus on some properties of the estimator of in (16). To facilitate the analysis and later developments we consider the scenario where, with the flow solution of an ODE, assume the dominating measure and is invariant by the flow. We consider
and notice that similarly to the integrator scenario, for any , we have . where for any , where
the following estimator of is considered, with
We now show that in the examples considered in this paper our approach can be understood as implementing unbiased Riemann sum approximations of line integrals along contours. Adopting a continuous time approach can be justified as follows. For numerous integrators, the conditions of the following proposition are satisfied; this is the case for the leapfrog integrator of Hamilton’s equation e.g. [20, 38, Theorem 3.4] and [38, Appendix 3.1, Theorem 9] for detailed results.
Proposition 22.
Let , for any let be a flow and for let be a discretization of such that that for any there exists such that for any
Then for any continuous such that the Riemann integral
exists we have
Proof.
We have
and we can conclude. ∎
Remark 23.
Naturally convergence is uniform in with additional assumptions and we note that in some scenarios dependence on can be characterized, allowing in principle to grow with .
4.3.1 Hamiltonian contour decomposition
Assume has a probability density with respect to the Lebesgue measure and let be Lipschitz continuous such that for all , then the co-area formula states that
where is the -dimensional Hausdorff measure, used here to measure length along the contours . For example in the HMC setup where and one may choose , leading to a decomposition of the expectation according to equi-energy contours of
We now show how the solution of Hamilton’s equation could be used as the basis for estimators of mixing Riemanian sum-like and Monte Carlo estimation techniques.
Favourable scenario where .
Let such that and assume that for some Hamilton’s equations and have solutions with for some , that is the contours can be parametrised with the solutions of Hamilton’s equation at corresponding level.
Proposition 24.
We have
implying that, assuming the integral along the path is tractable,
(36) |
is an unbiased estimator of .
Proof.
A remarkable point is that the strategy developed in this manuscript provides a general methodology to implement numerically the ideas underpinning the decomposition of Proposition 24 by using the estimator in (15) and invoking Proposition 22, assuming to be known. This point is valid outside of the SMC framework and it is worth pointing out that in (15) is unbiased if the samples used are marginally from .
Remark the promise of dimension free estimators if the one dimensional line integrals in (36) were easily computed and sampling from the one-dimensional energy distribution was routine – however the scenario is more subtle.
General scenario
In the scenario where the co-area decomposition still holds, but the solution to Hamilton’s equation can in general not be used to compute integrals over the hypersurface . This would require a form of ergodicity [35, 40] of the form, for ,
where the limit always exists in the sense and constitutes Von Neumann’s mean ergodic theorem [41, 35], and the rightmost equality forms Boltzman’s conjecture. An interesting property in the present context is that for and one could replicate the variance reduction properties developed earlier. Boltzman’s conjecture has long been disproved, as numerous Hamiltonians can be shown not to lead to ergodic systems, although some sub-classes do. However a weaker, or local, form or ergodicity can hold on sets of a partition of
Example 25 (Double well potential).
Consider the scenario where , and kinetic energy [34]. Elementary manipulations show that satisfying Hamilton’s equations (2) imposes and therefore requires – importantly the intervals are not connected for . Rearranging terms any solution of (2) must satisfy
that is the velocity is a function of the position in the double well, maximal for , vanishing as and a sign flip at the endpoints of the intervals. Therefore the system is not ergodic, but ergodicity trivially occurs in each well.
In general this partitioning of can be intricate but it should be clear that in principle this could reduce variability of an estimator. In the toy Example 25, a purely mathematical algorithm inspired by the discussions above would choose the right or left well with probability and then integrate deterministically, producing samples taking at most two values which could be averaged to compute . We further note that in our context a relevant result would be concerned with the limit, for any ,
where we have used a polar reparametrization , and whether a marginal version of Boltzman’s conjecture holds for this limit. Example 25 indicates that this is not true in general but the cardinality of the partition of may be reduced.
4.3.2 Advantage of averaging and control variate interpretation
Consider the scenario where , for some is the flow solution of an ODE of the form for some field . For we have
and we are interested in determining (or ) in order to minimize and maximize improvement over simple Monte Carlo. We recall that the Jabobian determinant of the flow is given by [19, p. 174108-5]
Lemma 26.
Let be a flow solution of and assume that with the Lebesgue measure, . Then
with
In particular for the flow solution of Hamilton’s equations associated with we have
Proof.
Using Fubini’s theorem we have
Further, we have
with . It is straightforward that
Consequently
For the second statement we have and
which is akin to the integration correlation time encountered in MCMC. ∎
This is somewhat reminiscent of what is advocated in the literature in the context of HMC or randomized HMC where the integration time (or when using an integrator) is randomized [29, 21].
Example 27.
Consider the Gaussian scenario where , with diagonal covariance matrix such that for , and . Using the reparametrization the solution of Hamilton’s equations is given for by
We have for each component , which clearly does not vanish as , or , increase: this is a particularly negative result for standard HMC where the final state of the computed integrator is used. Worse, as noted early on in [29] this implies that for heterogeneous values no integration time may be suitable simultaneously for all coordinates. This motivated the introduction of random integration times [29] which leads to the average correlation
where it is assumed that is independent of the initial state. This should be contrasted with the fixed integration time scenario since as increases this vanishes and is even negative for some values of (the minimum is here reached for about for a given component).
The example therefore illustrates that our approach implements this averaging feature, and therefore shares its benefits, within the context of an iterative algorithm. The example also highlights a control variate technique intepretation. More specifically in the discrete time scenarios can be interpreted as control variates, but can induce both positive and negative correlations.
4.3.3 Towards an optimal flow?
In this section we are looking to determine a flow for some of an ODE of the form for some field which defines as above the probability model
which has the property that for any , defining , then . This suggests the use of a Rao-Blackwellized estimator inspired by . Assuming and that has a density with respect to the Lebesgue measure, then for
(38) |
In the light of Lemma 26 we aim to find for any the flow solutions of ODEs such that the function decreases as fast as possible. This is motivated by the fact that the integral on of this mapping appears in the variance upper bound for (38) in Lemma 26, which we want to minimize. Note that we also expect this mapping to be smooth under general conditions not detailed in this preliminary work. For smooth enough flow and we have, with ,
Pointwise, the steepest descent direction is given by
for a positive function to be determined optimally. In this scenario we therefore have
and by Cauchy-Schwartz the (positive) expectation is maximized for
and the trajectories we are interested in must be such that, for some ,
Note that for any the term is only a change of speed and that the trajectory of is independent of this factor, despite the remarkable fact that depends on this flow. The result seems fairly natural.
5 MCMC with integrator snippets
We restrict this discussion to integrator snippet based algorithms, but more general Markov snippet algorithms could be considered.
Consider again the target distribution
with
where for . Assume that we are in the context of Example 1, dropping for simplicity, the HMC algorithm using the integrator is as follows
(39) |
with here
where the last equality holds when , and we let . In other works the snippet is shifted along the orbit by . Naturally this needs to be combined with updates of the velocity to lead to a viable ergodic MCMC algorithm. This can be achieved with the following kernel, with and ,
described algorithmically in Alg. 6. Indeed for any , using identity (11),
and hence
Again samples from this MCMC targetting can be used to estimate expectations with respect to using the identity (11). This is closely related to the “windows of states” approach of [28, 29, 31], where a window of states is what we call a Hamiltonian snippet in the present manuscript. Indeed the windows of states approach corresponds to the Markov update
(40) |
which, we show below, leaves invariant. Indeed, note that for and ,
and therefore
where we obtain the last line from (11). is not -reversible in general, making theoretical comparisons challenging.
6 Discussion
We have shown how mappings used in various Monte Carlo schemes relying on numerical integrators of ODE can be implemented to fully exploit all computations to design robust and efficient sampling algorithms. Numerous questions remain open, including the tradeoff between and . A precise analysis of this question is made particularly difficult by the fact that integration along snippets is straightforwardly parallelizable, while resampling does not lend itself to straightforward parallelisation.
Another point is concerned with the particular choice of mutation Markov kernel , or , in (12) or (23). Indeed such a kernel starts with a transition from samples approximating the snippet distribution to , which is then followed by a reweighting of samples leading to a representation of . Instead, for illustration, one could suggest using an SMC sampler with (39) as mutation kernel.
In relation to the discussion in Remark 20, a natural question is how our scheme would compare with a “Rao-Blackwellized” SMC where weights of the type (35), derived from (6) are used.
We leave all these questions for future investigations.
Acknowledgements
The authors would like to thank Carl Dettman for very useful discussions on Boltzman’s conjecture. Research of CA and MCE supported by EPSRC grant ‘CoSInES (COmputational Statistical INference for Engineering and Security)’ (EP/R034710/1), and EPSRC grant Bayes4Health, ‘New Approaches to Bayesian Data Science: Tackling Challenges from the Health Sciences’ (EP/R018561/1). Research of CZ was supported by a CSC Scholarship.
Appendix A Notation and definitions
We will write for the set of natural numbers and for positive real numbers. Throughout this section is a generic measurable space.
-
•
For we let be its complement.
-
•
(resp. is the set of measures (resp. probability distributions) on
-
•
For a set , its complement in is denoted by . We denote the corresponding indicator function by and may use the notation .
-
•
For a probability measure on and a measurable function and , we let .
-
•
For two probability measures and on we let be a measure on such that for .
-
•
For a Markov kernel on , we write
-
–
for the probability measure on such that for , the minimal product -algebra, .
-
–
for the probability measure on such that for .
-
–
-
•
For probability distributions on and kernels such that then we denote
the corresponding Radon-Nikodym derivative such that for ,
-
•
A point mass distribution at will be denoted by ; it is such that for
-
•
In order to alleviate notation, for , , , we refer to as weighted samples to mean where but .
-
•
We say that a set of weighted samples, or particles, for represents a distribution whenever for -integrable
in either in the sense for some .
-
•
For , , , we let mean that .
-
•
For , , , we let mean that .
-
•
for we let
-
–
be the transpose of the Jacobian
-
–
for we let be the gradient,
-
–
be the divergence.
-
–
Appendix B Radon-Nikodym derivative
The general formalism required and used throughout the paper relies on a unique measure theoretic tool, the Radon-Nikodym derivative. We gather here definitions and intermediate results used throughout, pointing out the simplicity of the tools involved and and the benefits they bring.
Definition 28 (Pushforward).
Let be a measure on and a measurable function. The pushforward of by is defined by
where is the preimage of under .
If is a probability distribution then is the probability measure associated with when .
Definition 29 (Dominating and equivalent measures).
For two measures and on the same measurable space ,
-
1.
is said to dominate if for all measurable , – this is denoted .
-
2.
and are equivalent, written , if and .
We will need the notion of Radon-Nikodym derivative [7, Theorems 32.2 & 16.11]:
Theorem 30 (Radon–Nikodym).
Let and be -finite measures on . Then if and only if there exists an essentially unique, measurable, non-negative function such that
Therefore we can view as the density of w.r.t and in particular if is integrable w.r.t. then
If is a measure and a non-negative, measurable function then is the measure , i.e. the measure such that is the Radon–Nikodym derivative .
The following establishes the expression of an expectation with respect to the pushforward in terms of expectations with respect to [7, Theorem 16.13].
Theorem 31 (Change of variables).
A function is integrable w.r.t. if and only if is integrable w.r.t. , in which case
(41) |
We now establish results useful throughout the manuscript. The central identity used throughout the manuscript is a direct application of Theorem 31 for invertible
which seems tautological since it can be summarized as follows: for , then and ! However the interest of the approach stems from the following properties.
Lemma 32.
Let be measurable and integrable, and be -finite measures on such that and . Then
-
1.
and therefore ,
-
2.
for -almost all ,
-
3.
we have
in which case for -almost all
and therefore
Proof.
Corollary 33.
In the scenario when and are such that then .
Lemma 34 (Billingsley, Problem 32.6.).
Assume and are - finite and that . Then if and only if , in which case
Proof.
Let . For integrable we always have
Assume then from above for any integrable
and therefore and we conclude from the first identity above. Now assume that , then
The equivalence is therefore established and when either conditions is satisfied we have
and we conclude. ∎
References
- [1] Christophe Andrieu, Anthony Lee, and Sam Livingstone. A general perspective on the Metropolis-Hastings kernel. arXiv preprint arXiv:2012.14881, 2020.
- [2] Christophe Andrieu, James Ridgway, and Nick Whiteley. Sampling normalizing constants in high dimensions using inhomogeneous diffusions, 2018.
- [3] Mark A Beaumont. Approximate bayesian computation in evolution and ecology. Annual review of ecology, evolution, and systematics, 41:379–406, 2010.
- [4] Alexandros Beskos, Natesh Pillai, Gareth Roberts, Jesus-Maria Sanz-Serna, and Andrew Stuart. Optimal tuning of the hybrid Monte Carlo algorithm. Bernoulli, 19(5A):1501–1534, 2013.
- [5] Michael Betancourt. Nested sampling with constrained Hamiltonian Monte Carlo. In AIP Conference Proceedings, volume 1305, pages 165–172. American Institute of Physics, 2011.
- [6] Joris Bierkens and Gareth Roberts. A piecewise deterministic scaling limit of lifted Metropolis–Hastings in the Curie–Weiss model. The Annals of Applied Probability, 27(2):846–882, 2017.
- [7] P Billingsley. Probability and measure. 3rd wiley. New York, 1995.
- [8] Alexandre Bouchard-Côté, Sebastian J Vollmer, and Arnaud Doucet. The bouncy particle sampler: A nonreversible rejection-free Markov chain monte carlo method. Journal of the American Statistical Association, 113(522):855–867, 2018.
- [9] Dmitri Burago, Yuri Burago, and Sergei Ivanov. A course in metric geometry, volume 33. American Mathematical Society, 2022.
- [10] Mari Paz Calvo, Daniel Sanz-Alonso, and Jesús María Sanz-Serna. HMC: reducing the number of rejections by not using leapfrog and some results on the acceptance rate. Journal of Computational Physics, 437:110333, 2021.
- [11] Cédric M Campos and Jesús María Sanz-Serna. Extra chance generalized hybrid Monte Carlo. Journal of Computational Physics, 281:365–374, 2015.
- [12] Joseph T Chang and David Pollard. Conditioning as disintegration. Statistica Neerlandica, 51(3):287–317, 1997.
- [13] Nicolas Chopin and James Ridgway. Leave pima indians alone: binary regression as a benchmark for Bayesian computation. Statistical Science, 32(1):64–87, 2017.
- [14] Hai-Dang Dau and Nicolas Chopin. Waste-free sequential Monte Carlo. arXiv preprint arXiv:2011.02328, 2020.
- [15] Pierre Del Moral and Pierre Del Moral. Feynman-kac formulae. Springer, 2004.
- [16] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411–436, 2006.
- [17] Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid Monte Carlo. Physics letters B, 195(2):216–222, 1987.
- [18] Mauro Camara Escudero. Approximate Manifold Sampling: Robust Bayesian Inference for Machine Learning. PhD thesis, School of Mathematics, January 2024.
- [19] Youhan Fang, J. M. Sanz-Serna, and Robert D. Skeel. Compressible generalized hybrid Monte Carlo. The Journal of Chemical Physics, 140(17):174108, 05 2014.
- [20] Ernst Hairer, SP Nörsett, and G. Wanner. Solving ordinary differential equations I, Nonstiff problems. Springer-Verlag, 1993.
- [21] Matthew Hoffman, Alexey Radul, and Pavel Sountsov. An adaptive-MCMC scheme for setting trajectory lengths in Hamiltonian Monte Carlo. In International Conference on Artificial Intelligence and Statistics, pages 3907–3915. PMLR, 2021.
- [22] Matthew D Hoffman and Pavel Sountsov. Tuning-Free Generalized Hamiltonian Monte Carlo. In International Conference on Artificial Intelligence and Statistics, pages 7799–7813. PMLR, 2022.
- [23] Paul B Mackenze. An improved hybrid Monte Carlo method. Physics Letters B, 226(3-4):369–371, 1989.
- [24] Florian Maire and Pierre Vandekerkhove. On markov chain monte carlo for sparse and filamentary distributions. ArXiv e-prints, 2018.
- [25] Florian Maire and Pierre Vandekerkhove. Markov kernels local aggregation for noise vanishing distribution sampling. SIAM Journal on Mathematics of Data Science, 4(4):1293–1319, 2022.
- [26] Youssef Marzouk, Tarek Moselhy, Matthew Parno, and Alessio Spantini. Sampling via Measure Transport: An Introduction, pages 1–41. Springer International Publishing, Cham, 2016.
- [27] Xiao-Li Meng and Stephen Schilling. Warp bridge sampling. Journal of Computational and Graphical Statistics, 11(3):552–586, 2002.
- [28] Radford M Neal. An improved acceptance procedure for the hybrid Monte Carlo algorithm. Journal of Computational Physics, 111(1):194–203, 1994.
- [29] Radford M. Neal. MCMC Using Hamiltonian Dynamics, chapter 5. CRC Press, 2011.
- [30] Ari Pakman and Liam Paninski. Exact Hamiltonian Monte Carlo for truncated multivariate Gaussians. Journal of Computational and Graphical Statistics, 23(2):518–542, 2014.
- [31] Zhaohui S. Qin and Jun S. Liu. Multipoint Metropolis method with application to Hybrid Monte Carlo. Journal of Computational Physics, 172(2):827–840, 2001.
- [32] James Ridgway. Computation of gaussian orthant probabilities in high dimension. Statistics and computing, 26(4):899–916, 2016.
- [33] Grant M Rotskoff and Eric Vanden-Eijnden. Dynamical computation of the density of states and Bayes factors using nonequilibrium importance sampling. Physical review letters, 122(15):150602, 2019.
- [34] Gabriel Stoltz, Mathias Rousset, et al. Free energy computations: A mathematical perspective. World Scientific, 2010.
- [35] D Szasz. Boltzmann’s ergodic hypothesis, a conjecture for centuries?, chapter Hard ball systems and the Lorentz gas, pages 421–446. Springer, 2000.
- [36] Esteban G Tabak and Eric Vanden-Eijnden. Density estimation by dual ascent of the log-likelihood. Communications in Mathematical Sciences, 8(1):217–233, 2010.
- [37] Erik H Thiede, Brian Van Koten, Jonathan Weare, and Aaron R Dinner. Eigenvector method for umbrella sampling enables error analysis. The Journal of chemical physics, 145(8), 2016.
- [38] Achille Thin, Yazid Janati El Idrissi, Sylvain Le Corff, Charles Ollion, Eric Moulines, Arnaud Doucet, Alain Durmus, and Christian X Robert. Neo: Non equilibrium sampling on the orbits of a deterministic transform. Advances in Neural Information Processing Systems, 34:17060–17071, 2021.
- [39] G.M. Torrie and J.P. Valleau. Nonphysical sampling distributions in Monte Carlo free-energy estimation: Umbrella sampling. Journal of Computational Physics, 23(2):187–199, 1977.
- [40] Paul F Tupper. Ergodicity and the numerical simulation of Hamiltonian systems. SIAM Journal on Applied Dynamical Systems, 4(3):563–587, 2005.
- [41] J. von Neumann. Proof of the quasi-ergodic hypothesis. Proc. Natl. Acad. Sci. USA, 18:70–82, 1932.
- [42] Chang Zhang. On the Improvements and Innovations of Monte Carlo Methods. PhD thesis, School of Mathematics, https://research-information.bris.ac.uk/en/studentTheses/on-the-improvements-and-innovations-of-monte-carlo-methods, June 2022.