Monte Carlo sampling with integrator snippets

Christophe Andrieu, Mauro Camara Escudero and Chang Zhang

Abstract

Assume interest is in sampling from a probability distribution $\mu$ defined on $(\mathsf{Z},\mathscr{Z})$ . We develop a framework to construct sampling algorithms taking full advantage of numerical integrators of ODEs, say $\psi\colon\mathsf{Z}\rightarrow\mathsf{Z}$ for one integration step, to explore $\mu$ efficiently and robustly. The popular Hybrid/Hamiltonian Monte Carlo (HMC) algorithm [17, 29] and its derivatives are example of such a use of numerical integrators. However we show how the potential of integrators can be exploited beyond current ideas and HMC sampling in order to take into account aspects of the geometry of the target distribution. A key idea is the notion of integrator snippet, a fragment of the orbit of an ODE numerical integrator $\psi$ , and its associate probability distribution $\bar{\mu}$ , which takes the form of a mixture of distributions derived from $\mu$ and $\psi$ . Exploiting properties of mixtures we show how samples from $\bar{\mu}$ can be used to estimate expectations with respect to $\mu$ . We focus here primarily on Sequential Monte Carlo (SMC) algorithms, but the approach can be used in the context of Markov chain Monte Carlo algorithms as discussed at the end of the manuscript. We illustrate performance of these new algorithms through numerical experimentation and provide preliminary theoretical results supporting observed performance.

School of Mathematics, University of Bristol

1 Overview and motivation: SMC sampler with HMC

Assume interest is in sampling from a probability distribution $\mu$ on a probability space $(\mathsf{Z},\mathscr{Z})$ . The main ideas of sequential Monte Carlo (SMC) samplers to simulate from $\mu$ are (a) to define a sequence of probability distributions $\{\mu_{n},n\in\llbracket 0,P\rrbracket\}$ on $(\mathsf{Z},\mathscr{Z})$ where $\mu_{P}=\mu$ , $\mu_{0}$ is chosen by the user, simple to sample from and the sequence $\{\mu_{n},n\in\llbracket P-1\rrbracket\}$ “interpolates” $\mu_{0}$ and $\mu_{P}$ , (b) and to propagate a cloud of samples $\{z_{n}^{(i)}\in\mathsf{Z},i\in\llbracket N\rrbracket\}$ for $n\in\llbracket 0,P\rrbracket$ to represent $\{\mu_{n},n\in\llbracket 0,P\rrbracket\}$ using an importance sampling/resampling mechanism [16]. Notation and definitions used throughout this paper can be found in Appendix A.

After initialisation, for $n\in\llbracket P\rrbracket$ , samples $\{z_{n-1}^{(i)},i\in\llbracket N\rrbracket\}$ , representing $\mu_{n-1}$ , are propagated thanks to a user-defined mutation Markov kernel $M_{n}\colon\mathsf{Z}\times\mathscr{Z}\rightarrow[0,1]$ , as follows. For $i\in\llbracket N\rrbracket$ sample $\tilde{z}_{n}^{(i)}\sim M_{n}(z_{n-1}^{(i)},\cdot)$ and compute the importance weights, assumed to exist for the moment,

\omega_{n}^{(i)}=\frac{{\rm d}\mu_{n}\accentset{\curvearrowleft}{\otimes}L_{n-1}}{{\rm d}\mu_{n-1}\otimes M_{n}}\big{(}z_{n-1}^{(i)},\tilde{z}_{n}^{(i)}\big{)}\,,

(1)

where $L_{n-1}\colon\mathsf{Z}\times\mathscr{Z}\rightarrow[0,1]$ is a user-defined “backward” Markov kernel required to define importance sampling on $\mathsf{Z}\times\mathsf{Z}$ and swap the rôles of $z_{n-1}^{(i)}$ and $\tilde{z}_{n}^{(i)}$ , in the sense that for $f\colon\mathsf{Z}\rightarrow\mathbb{R}$ $\mu_{n}$ -integrable,

\int f(z^{\prime})\frac{{\rm d}\mu_{n}\accentset{\curvearrowleft}{\otimes}L_{n-1}}{{\rm d}\mu_{n-1}\otimes M_{n}}\big{(}z,z^{\prime}\big{)}\mu_{n-1}({\rm d}z)M_{n}(z,{\rm d}z^{\prime})=\mu_{n}(f)\,.

We adopt this measure theoretic formulation of importance weights out of necessity in order to take into account situations such as when $\mu_{n-1}$ and $\mu_{n}$ both have densities with respect to the Lebesgue measure but $M_{n}$ is a Metropolis-Hastings (MH) update, which does not possess such a density – more details are provided in Appendices A-B but can be omitted to understand how the algorithms considered in the manuscript proceed.

The mutation step is followed by a selection step where for $i\in\llbracket N\rrbracket$ , $z_{n}^{(i)}=\tilde{z}_{n}^{(a_{i})}$ for $a_{i}$ the random variable taking values in $\llbracket N\rrbracket$ with $\mathbb{P}(a_{i}=k)\propto\omega_{n}^{(k)}$ . The procedure is summarized in Alg. 1.

Given $\{M_{n},n\in\llbracket P\rrbracket\}$ , theoretically optimal choice of $\{L_{n-1},n\in\llbracket P\rrbracket\}$ is well understood but tractability is typically obtained by assuming that $M_{n}$ is $\mu_{n-1}$ -invariant, or considering approximations of $\{L_{n-1},n\in\llbracket P\rrbracket\}$ and that $M_{n}$ is $\mu_{n}$ -invariant. This makes Markov chain Monte Carlo (MCMC) kernels very attractive choices for $M_{n}$ , and the use of measure theoretic tools inevitable.

1for $i\in\llbracket N\rrbracket$ do

2 Sample

z_{0}^{(i)}\sim\mu_{0}(\cdot)

;

3 Set

\omega_{0}^{(i)}=1

4 end for

5for $n\in\llbracket P\rrbracket$ do

6 for $i\in\llbracket N\rrbracket$ do

7 Sample

\tilde{z}_{n}^{(i)}\sim M_{n}\big{(}z_{n-1}^{(i)},\cdot\big{)}

;

8 Compute

w_{n}^{(i)}

as in (1).

9 end for

10 for $i\in\llbracket N\rrbracket$ do

11 Sample

a_{i}\sim{\rm Cat}\big{(}\omega_{n}^{(1)},\ldots,\omega_{n}^{(N)}\big{)}

12 Set

z_{n}^{(i)}=\tilde{z}_{n}^{(a_{i})}

13 end for

15 end for

Algorithm 1 Generic SMC sampler

A possible choice of MCMC kernel is that of the hybrid Monte Carlo method, a MH update using a discretization of Hamilton’s equations [17, 29], which can be thought of as a particular instance of a more general strategy relying on numerical integrators of ODEs possessing properties of interest. More specifically, assume that interest is in sampling $\pi$ defined on $(\mathsf{X},\mathscr{X})$ . First the problem is embedded into that of sampling from the joint distribution $\mu({\rm d}z):=\pi\otimes\varpi\big{(}{\rm d}z\big{)}=\pi({\rm d}x)\varpi({\rm d}v)$ defined on $(\mathsf{Z},\mathscr{Z})=(\mathsf{X}\times\mathsf{V},\mathscr{X}\otimes\mathscr{V})$ , where $v$ is an auxiliary variable facilitating sampling. Following the SMC framework we set $\mu_{n}({\rm d}z):=\pi_{n}\otimes\varpi_{n}\big{(}{\rm d}z\big{)}$ for $n\in\llbracket 0,P\rrbracket,$ a sequence of distributions on $(\mathsf{Z},\mathscr{Z})$ with $\pi_{P}=\pi$ , $\{\pi_{n},n\in\llbracket 0,P-1\rrbracket\}$ probabilities on $\big{(}\mathsf{X},\mathscr{X}\big{)}$ and $\{\varpi_{n},n\in\llbracket 0,P\rrbracket\}$ on $\big{(}\mathsf{V},\mathscr{V}\big{)}$ . With $\psi\colon\mathsf{Z}\rightarrow\mathsf{Z}$ an integrator of an ODE of interest, one can use $\psi^{k}(z)$ for some $k\in\mathbb{N}$ as a proposal in a MH update mechanism; $v$ is a source of randomness allowing “exploration”, resampled every now and then. Again, hereafter we let $z=:(x,v)\in\mathsf{X}\times\mathsf{V}$ be the corresponding components of $z$ .

Example 1 (Leapfrog integrator of Hamilton’s equations).

Assume that $\mathsf{X}=\mathsf{V}=\mathbb{R}^{d}$ , that $\{\pi_{n},\varpi_{n},n\in\llbracket 0,P\rrbracket\}$ have densities, denoted $\pi_{n}(x)$ and $\varpi_{n}(v)$ , with respect to the Lebesgue measure and let $x\mapsto U_{n}(x):=-\log\pi_{n}(x)$ . For $n\in\llbracket 0,P\rrbracket$ and $U_{n}$ differentiable, Hamilton’s equations for the potential $H_{n}(x,v):=U_{n}(x)+\frac{1}{2}|v|^{2}$ are

\dot{x}_{t}=v_{t},\dot{v}_{t}=-\nabla U_{n}(x_{t})\,,

(2)

and possess the important property that $H_{n}(x_{t},v_{t})=H_{n}(x_{0},v_{0})$ for $t\geq 0$ . The corresponding leapfrog integrator is given, for some $\varepsilon>0$ , by

	$\displaystyle\psi_{n}(x,v)={}_{\textsc{b}}\psi\circ{}_{\textsc{a}}\psi_{n}\circ{}_{\textsc{b}}\psi(x,v)$		(3)
	${}_{\textsc{b}}\psi(x,v):=\big{(}x,v-\tfrac{1}{2}\varepsilon\nabla U_{n}(x)\big{)},\quad{}_{\textsc{a}}\psi_{n}(x,v)=(x+\varepsilon\,v,v)\,.$

We point out that, with the exception of the first step, only one evaluation of $\nabla U(x)$ is required per integration step since the rightmost ${}_{\textsc{b}}\psi$ in (3) recycles the last computation from the last iteration. Let $\sigma\colon\mathsf{Z}\rightarrow\mathsf{Z}$ such that for any $f\colon\mathsf{Z}\rightarrow\mathsf{Z}$ , $f\circ\sigma(x,v)=f(x,-v)$ , it is standard to check that $\psi_{n}^{-1}=\sigma\circ\psi_{n}\circ\sigma$ and note that $\mu_{n}\circ\sigma=\mu_{n}$ in this particular case. In its most basic form the hybrid Monte Carlo MH update leaving $\mu_{n}$ invariant proceeds as follows, for $(z,A)\in\mathsf{Z}\times\mathscr{Z}$

M_{n+1}(z,A)=\int\varpi_{n}({\rm d}v^{\prime})\big{[}\alpha_{n}(x,v^{\prime};T)\mathbf{1}\{\psi_{n}^{T}(x,v^{\prime})\in A\}+\bar{\alpha}_{n}(x,v^{\prime};T)\mathbf{1}\{\sigma(x,v)\in A\}\big{]}\,,

(4)

with, for some user defined $T\in\mathbb{N},$

\alpha_{n}(z;T):=1\wedge\frac{\mu_{n}\circ\psi_{n}^{T}(z)}{\mu_{n}(z)}\,,

(5)

and $\bar{\alpha}_{n}(z;T)=1-\alpha_{n}(z;T)$ .

Other ODEs, capturing other properties of the target density, or other types of integrators are possible. However a common feature of integrator based updates is the need to compute recursively an integrator snippet $\mathsf{z}:=\big{(}z,\psi(z),\psi^{2}(z),\ldots,\psi^{T}(z)\big{)}$ , for a given mapping $\psi\colon\mathsf{Z}\rightarrow\mathsf{Z}$ , of which only the endpoint $\psi^{T}(z)$ is used. This raises the question of recycling intermediate states, all the more so that computation of the snippet often involves quantities shared with the evaluation of $U_{n}$ . In Example 1, for instance, expressions for $\nabla U_{n}(x)$ and $U_{n}(x)$ often involve the same computationally costly quantities and evaluation of the density $\mu_{n}(x)$ where $\nabla U_{n}(x)$ has already been evaluated is therefore often virtually free; consider for example $U(x)=x^{\top}\Sigma^{-1}x$ for a covariance matrix $\Sigma$ , then $\nabla U(x)=2\Sigma^{-1}x$ .

In turn these quantities offer the promise of being able to exploit points used to generate the snippet while preserving accuracy of the estimators of interest, through importance sampling or reweighting. For example one may consider virtual HMC updates i.e. given $(z,A)\in\mathsf{Z}\times\mathscr{Z}$ and $k\in\llbracket P-1\rrbracket$ define

P_{n,k}(z,A):=\alpha_{n}(z;k)\mathbf{1}\{\psi_{n}^{k}(z)\in A\}+\bar{\alpha}_{n}(z;k)\mathbf{1}\{\sigma(z)\in A\}\,.

(6)

Noting that $\mu_{n}P_{n,k}=\mu_{n}$ one deduces that for $z\sim\mu_{n}$

\frac{1}{T+1}\sum_{k=0}^{T}\alpha_{n}(z;k)f\circ\psi_{n}^{k}(z)+\bar{\alpha}_{n}(z;k)f(z)\,,

is an unbiased estimator of $\mu(f)$ . This post-processing procedure is akin to waste-recycling [14] and its generalizations but, crucially, does not affect the dynamic of the Markov chain generating the samples. An alternative algorithm exploiting the snippet fully, with an active effect on the dynamics, could use the following mixture of Markov chain transition kernels [23, 29, 22],

\mathfrak{M}_{n+1}(z,A)=\frac{1}{T+1}\sum_{k=0}^{T}\int\varpi_{n}({\rm d}v^{\prime})P_{n,k}(x,v^{\prime};A)\,,

which is also related to the strategy adopted in [11] with the extra chance algorithm. As we shall see our work shares the same objective but we adopt a strategy more closely related to [28]; see Section 5 for a more detailed discussion.

The present paper is concerned with exploring such recycling procedures in the context of SMC samplers, although the ideas we develop can be straightforwardly applied to particle filters in the context of state-space models and to MCMC as discussed later. The manuscript is organised as follows.

In Section 2 we first provide a high level description of particular instances of the class of algorithms considered and provide a justification through reinterpretation as standard SMC algorithms in Subsection 2.2. In Subsection 2.3 we discuss direct extensions of our algorithms, some of which we explore in the present manuscript. This work has links with related recent attempts in the literature [33, 14, 38]; these links are discussed and contrasted with our work in Subsection 2.4 where some of the motivations behind these algorithms are also discussed. Initial exploratory simulations demonstrating the interest of our approach are provided in Subsections 2.5 and 2.6.

In Section 3 we introduce the more general framework of Markov snippets Monte Carlo and associated formal justifications. Subsection 3.3 details the link with the scenario considered in Section 2. In Subsection 3.4 we provide general results facilitating the practical calculation of some of the Radon-Nikodym involved, highlighting why some of the usual constraints on mutation and backward kernels in SMC can be lifted here.

In Section 4 we provide elements of a theoretical analysis explaining expected properties of the algorithms proposed in this manuscript, although a fully rigorous theoretical analysis is beyond the present methodological contribution.

In Section 5 we explore the use of some of the ideas developed here in the context of MCMC algorithms and establish links with earlier suggestions, such as “windows of states” techniques proposed in the context of HMC [28, 29]. Notation, definitions and basic mathematical background can be found in Appendices A-B.

A Python implementation of the algorithms developed in this paper is available at https://github.com/MauroCE/IntegratorSnippets.

2 An introductory example

Assume interest is in sampling from a probability distribution $\mu$ on $(\mathsf{Z},\mathscr{Z})$ as described above Example 1 using an SMC sampler relying on the leapfrog integrator of Hamilton’s equations. As in the previous section we introduce an interpolating sequence of distributions $\{\mu_{n},n\in\llbracket 0,P\rrbracket\}$ on $(\mathsf{Z},\mathscr{Z})\}$ and assume for now the existence of densities for $\{\mu_{n},n\in\llbracket 0,P\rrbracket\}$ with respect to a common measure $\upsilon$ , say the Lebesgue measure on $\mathbb{R}^{2d}$ , denoted $\mu_{n}(z):={\rm d}\mu_{n}/{\rm d}\upsilon(z)$ for $z\in\mathsf{Z}$ and $n\in\llbracket 0,P\rrbracket$ and denote $\psi_{n}$ the corresponding integrator, which again is measure preserving in this setup.

2.1 An SMC-like algorithm

Primary interest in this paper is in algorithms of the type given in Alg. 2 and variations thereof; throughout $T\in\mathbb{N}\setminus\{0\}$ .

1Sample

z_{0}^{(i)}\overset{{\rm iid}}{\sim}\mu_{0}

for

i\in\llbracket N\rrbracket

2for $n\in\llbracket P\rrbracket$ do

4 for $i\in\llbracket N\rrbracket$ do

6 for $k\in\llbracket 0,T\rrbracket$ do

8 Compute

z_{n-1,k}^{(i)}:=\psi_{n}^{k}(z_{n-1}^{(i)})

and

\bar{w}_{n,k}\big{(}z_{n-1}^{(i)}\big{)}:=\frac{\mu_{n}\big{(}z_{n-1,k}^{(i)}\big{)}}{\mu_{n-1}\big{(}z_{n-1}^{(i)}\big{)}}=\frac{\mu_{n}\circ\psi_{n}^{k}\big{(}z_{n-1}^{(i)}\big{)}}{\mu_{n-1}\big{(}z_{n-1}^{(i)}\big{)}}\,,

10 end for

12 end for

14 for $j\in\llbracket N\rrbracket$ do

16 Sample

\llbracket N\rrbracket\times\llbracket 0,T\rrbracket\ni(b_{j},a_{j})\sim{\rm Cat}\big{(}\{\bar{w}_{n,k}(z_{n-1}^{(i)}),(i,k)\in\llbracket N\rrbracket\times\llbracket 0,T\rrbracket\}\big{)}

17 Set

\bar{z}_{n}^{(j)}:=(\bar{x}_{n-1}^{(j)},\bar{v}_{n-1}^{(j)})=z_{n-1,a_{j}}^{(b_{j})}

18 Rejuvenate the velocities

z_{n}^{(j)}=(\bar{x}_{n-1}^{(j)},v_{n}^{(j)})

with

v_{n}^{(j)}\sim\varpi_{n}

19 end for

21 end for

Algorithm 2 Unfolded Hamiltonian Snippet SMC algorithm

The SMC sampler-like algorithm in Alg. 2 therefore involves propagating $N$ “seed” particles $\{z_{n-1}^{(i)},i\in\llbracket N\rrbracket\}$ , with a mutation mechanism consisting of the generation of $N$ integrator snippets $\mathsf{z}:=\big{(}z,\psi_{n}(z),\psi_{n}^{2}(z),\ldots,\psi_{n}^{T}(z)\big{)}$ started at every seed particle $z\in\{z_{n-1}^{(i)},i\in\llbracket N\rrbracket\}$ , resulting in $N\times(T+1)$ particles which are then whittled down to a set of $N$ seed particles using a standard resampling scheme; after rejuvenation of velocities this yields the next generation of seed particles–this is illustrated in Fig. 1. This algorithm should be contrasted with standard implementations of SMC samplers where, after resampling, a seed particle normally gives rise to a single particle in the mutation step, in Fig. 1 the last state on the snippet. Intuitively validity of the algorithm follows from the fact that if $\big{\{}(z_{n-1}^{(i)},1),i\in\llbracket N\rrbracket\big{\}}$ represent $\mu_{n-1}$ , then $\big{\{}\big{(}z_{n-1,k}^{(i)},\bar{w}_{n,k}(z_{n-1}^{(i)})\big{)}),(i,k)\in\llbracket N\rrbracket\times\llbracket 0,T\rrbracket\big{\}}$ represents $\mu_{n}$ in the sense that for $f\colon\mathsf{Z}\rightarrow\mathbb{R}$ summable, one can use the approximation

\mu_{n}(f)\approx\sum_{i=1}^{N}\sum_{k=0}^{T}f\circ\psi_{n}^{k}\big{(}z_{n-1}^{(i)}\big{)}\frac{\mu_{n}\circ\psi_{n}^{k}\big{(}z_{n-1}^{(i)}\big{)}/\mu_{n-1}\big{(}z_{n-1}^{(i)}\big{)}}{\sum_{j=1}^{N}\sum_{l=0}^{T}\mu_{n}\circ\psi_{n}^{l}\big{(}z_{n-1}^{(j)}\big{)}/\mu_{n-1}\big{(}z_{n-1}^{(j)}\big{)}}\,,

(7)

where the self-normalization of the weights is only required in situations where the ratio $\mu_{n}(z)/\mu_{n-1}(z)$ is only known up to a constant. We provide justification for the correctness of Alg. 2 and the estimator (7) by recasting the procedure as a standard SMC sampler targetting a particular sequence of distributions in Section 2.2 and using properties of mixtures. Direct generalizations are provided in Section 2.3.

Refer to caption — Figure 1: Illustration of the transition from $\mu_{n-1}$ to $\mu_{n}$ with integrator snippet SMC. A snippet grows (black dots) out of each seed particle (blue). The middle snippet gives rises through selection (dashed red) to two seed particles while the bottom snippet does not produce any seed particle.

2.2 Justification outline

We now outline the main ideas underpinning the theoretical justification of Alg. 2. Key to this is establishing that Alg. 2 is a standard SMC sampler targetting a particular sequence of probability distributions $\big{\{}\bar{\mu}_{n},n\in\llbracket 0,P\rrbracket\big{\}}$ from which samples can be processed to approximate expectations with respect to $\big{\{}\mu_{n},n\in\llbracket 0,P\rrbracket\big{\}}$ . This has the advantage that no fundamentally new theory is required and that standard methodological ideas can be re-used in the present scenario, while the particular structure of $\big{\{}\bar{\mu}_{n},n\in\llbracket 0,P\rrbracket\big{\}}$ can be exploited for new developments. This section focuses on identifying $\big{\{}\bar{\mu}_{n},n\in\llbracket 0,P\rrbracket\big{\}}$ and establishing some of their important properties. Similar ideas are briefly touched upon in [14], but we will provide full details and show how these ideas can be pushed further, in interesting directions.

First for $(n,k)\in\llbracket 0,P\rrbracket\times\llbracket 0,T\rrbracket$ let $\psi_{n,k}\colon\mathsf{Z}\rightarrow\mathsf{Z}$ be measurable and invertible mappings, define $\mu_{n,k}({\rm d}z):=\mu_{n}^{\psi_{n,k}^{-1}}({\rm d}z)$ , i.e. the distribution of $\psi_{n,k}^{-1}(z)$ when $z\sim\mu_{n}$ . It is worth pointing out that invertibility of these mappings is not necessary, but facilitates interpretation throughout, as illustrated by the last statement about the distribution of $\psi_{n,k}^{-1}(z)$ . Earlier we have focused on the scenario where $\psi_{n,k}=\psi_{n}^{k}$ for an integrator $\psi_{n}\colon\mathsf{Z}\rightarrow\mathsf{Z}$ , but this turns out not to be a requirement, although it is our main motivation. Useful applications of this general perspective can be found in Subsection 2.3. Introduce the probability distributions on $\big{(}\mathbb{\llbracket}0,T\rrbracket\times\mathsf{Z},\mathscr{P}(\mathbb{\llbracket}0,T\rrbracket)\otimes\mathscr{Z}\big{)}$

\bar{\mu}_{n}(k,{\rm d}z)=\frac{1}{T+1}\mu_{n,k}({\rm d}z)\,,

for $n\in\llbracket 0,P\rrbracket$ . We will show that Alg. 2 can be interpreted as an SMC sampler targeting the sequence of marginal distributions on $\big{(}\mathsf{Z},\mathscr{Z}\big{)}$

\bar{\mu}_{n}({\rm d}z)=\frac{1}{T+1}\sum_{k=0}^{T}\mu_{n,k}({\rm d}z)\,,

(8)

which we may refer to as a mixture, for $n\in\llbracket 0,P\rrbracket$ . Note that the conditional distribution is

\displaystyle\bar{\mu}_{n}(k

\displaystyle\mid z)=\frac{1}{T+1}\frac{{\rm d}\mu_{n,k}}{{\rm d}\bar{\mu}_{n}}(z)\,,

(9)

which can be computed in most scenarios of interest.

Example 2.

For $n\in\llbracket 0,P\rrbracket$ assume the existence of a $\sigma$ -finite dominating measure $\upsilon\gg\mu_{n}$ , therefore implying the existence of a density $\mu_{n}(z):={\rm d}\mu_{n}/{\rm d}\upsilon(z)$ ; $\upsilon$ could be the Lebesgue measure. Assuming that $\upsilon$ is $\psi_{n,k}$ -invariant, or “volume preserving”, then Lemma 32 implies, for $k\in\llbracket 0,T\rrbracket$ ,

\mu_{n,k}(z):=\frac{{\rm d}\mu_{n,k}}{{\rm d}\upsilon}(z)=\mu_{n}\circ\psi_{n,k}(z)\text{ and }\bar{\mu}_{n}(z):=\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\upsilon}(z)=\frac{1}{T+1}\sum_{k=0}^{T}\mu_{n}\circ\psi_{n,k}(z)\,,

and therefore

w_{n,k}(z):=\frac{{\rm d}\mu_{n,k}}{{\rm d}\big{(}\sum_{l=0}^{T}\mu_{n,l}\big{)}}(z)=\frac{\mu_{n}\circ\psi_{n,k}(z)}{\sum_{l=0}^{T}\mu_{n}\circ\psi_{n,l}(z)}\,.

When $\upsilon$ is the Lebesgue measure and $\{\psi_{n,k},k\in\llbracket P\rrbracket\}$ are not volume preserving, additional multiplicative Jacobian determinant-like terms may be required (see Lemma 32 and the additional requirement that $\{\psi_{n,k},k\in\llbracket P\rrbracket\}$ be differentiable). However this extra term may be more complex and require application specific treatment; adoption of the measure theoretic notation circumvents such peripheral considerations at this stage, which can be ignored until actual implementation of the algorithm.

A central point throughout this paper is how samples from $\mu_{n}$ can be used to obtain samples from the marginal $\bar{\mu}_{n}$ and vice-versa, thanks to the mixture structure relating the two distributions. Naturally, given $z\sim\mu_{n}$ , sampling $k\sim\mathcal{U}\big{(}\llbracket 0,T\rrbracket\big{)}$ and returning $\psi_{n,k}^{-1}(z)$ yields a sample from the marginal $\bar{\mu}_{n}$ . Now assuming $z\sim\bar{\mu}_{n}$ and then sampling $k\sim\bar{\mu}_{n}(\cdot\mid z)$ naturally yields $(k,z)\sim\bar{\mu}_{n}$ and hence intuitively $\psi_{n,k}(z)\sim\mu_{n}$ . We now formally establish a more general result concerned with the estimation of expectations with respect to $\mu_{n}$ from samples from $\bar{\mu}_{n}$ . For $f\colon\mathsf{Z}\rightarrow\mathbb{R}$ $\mu_{n}$ -integrable and $k\in\llbracket 0,T\rrbracket$ , a change of variable (see Theorem 31) yields

\displaystyle\int f\circ\psi_{n,k}(z)\mu_{n,k}({\rm d}z)=\int f\circ\psi_{n,k}(z)\mu_{n}^{\psi_{n,k}^{-1}}({\rm d}z)

\displaystyle=\int f(z)\mu_{n}({\rm d}z)\,,

which formalizes the fact, at first sight of limited interest, that $\mu_{n}^{\psi_{n,k}^{-1}}$ is the distribution of $\psi_{n,k}^{-1}(z)$ for $z\sim\mu_{n}$ and therefore $\psi_{n,k}\circ\psi_{n,k}^{-1}(z)\sim\mu_{n}$ . The relevance of this remark comes from the identity

$\displaystyle\int f(z)\mu_{n}({\rm d}z)$	$\displaystyle=\frac{1}{T+1}\sum_{k=0}^{T}\int f\circ\psi_{n,k}(z)\mu_{n,k}({\rm d}z)$
	$\displaystyle=\sum_{k=0}^{T}\int f\circ\psi_{n,k}(z)\bar{\mu}_{n}(k,{\rm d}z)$	(10)
	$\displaystyle=\int\Big{\{}\sum_{k=0}^{T}f\circ\psi_{n,k}(z)\bar{\mu}_{n}(k\mid z)\Big{\}}\bar{\mu}_{n}({\rm d}z),$	(11)

which implies that samples from the mixture $\bar{\mu}_{n}$ can be used to unbiasedly estimate $\mu_{n}(f)$ thanks to an appropriate weighted average along the snippet $k\mapsto f\circ\psi_{n,k}(z)$ . Using $f=\mathbf{1}_{A}$ for $A\in\mathscr{Z}$ formally establishes the earlier claim that if $(k,z)\sim\bar{\mu}_{n}$ then $\psi_{n,k}(z)\sim\mu_{n}$ . Note that, as suggested by Example 2, construction of the estimator will only require evaluations of the density $\mu_{n}$ and function $f$ at $z,\psi_{n,1}(z),\psi_{n,2}(z),\ldots,\psi_{n,T}(z)$ .

We now turn to the description of an SMC algorithm targeting $\big{\{}\bar{\mu}_{n},n\in\llbracket 0,P\llbracket\big{\}}$ , Alg. 3, and then establish that it is probabilistically equivalent to Alg. 2 in a sense made precise below. With $z=(x,v)$ for $n\in\llbracket P\rrbracket$ we introduce the following mutation kernel

\bar{M}_{n}(z,{\rm d}z^{\prime}):=\sum_{k=0}^{T}\bar{\mu}_{n-1}(k\mid z)\,R_{n}(\psi_{n-1,k}(z),{\rm d}z^{\prime}),\quad R_{n}(z,{\rm d}z^{\prime}):=(\delta_{x}\otimes\varpi_{n-1})({\rm d}z^{\prime})\,,

(12)

where we note that the refreshment kernel has the property that $\mu_{n-1}R_{n}=\mu_{n-1}$ . One can show that the near optimal kernel $\bar{L}_{n-1}\colon\mathsf{Z}\times\mathscr{Z}\rightarrow[0,1]$ is given for any $(z,A)\in\mathsf{Z}\times\mathscr{Z}$ by

\bar{L}_{n-1}(z,A):=\frac{{\rm d}\bar{\mu}_{n-1}\otimes\bar{M}_{n}(A\times\cdot)}{{\rm d}\bar{\mu}_{n-1}\bar{M}_{n}}(z)\,,

(13)

and is well defined $\bar{\mu}_{n-1}\bar{M}_{n}$ -a.s. (see Lemma 4, given at the end of this subsection) and satisfies the property that for any $A,B\in\mathscr{Z}$

\bar{\mu}_{n-1}\otimes\bar{M}_{n}(A\times B)=\int_{B}\bar{\mu}_{n-1}\bar{M}_{n}({\rm d}z)\bar{L}_{n-1}(z,A)\,,

that is $\bar{L}_{n-1}(z,A)$ is a conditional probability of $A$ given $z$ for the joint probability distribution $\bar{\mu}_{n-1}\otimes\bar{M}_{n}$ . With the assumption $\mu_{n-1}\gg\bar{\mu}_{n}$ , from Lemma 4, the SMC sampler importance weights at step $n\in\llbracket P\rrbracket$ are

\bar{w}_{n}(z,z^{\prime}):=\frac{{\rm d}\bar{\mu}_{n}\accentset{\curvearrowleft}{\otimes}\bar{L}_{n-1}}{{\rm d}\bar{\mu}_{n-1}\otimes\bar{M}_{n}}(z,z^{\prime})=\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}(z^{\prime})\,.

(14)

that is for $f\colon\mathsf{Z}\rightarrow\mathbb{R}$ such that $\bar{\mu}(|f|)<\infty$ ,

\int f(z^{\prime})\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}(z^{\prime})\bar{\mu}_{n-1}\otimes\bar{M}_{n}\big{(}{\rm d}(z,z^{\prime})\big{)}=\bar{\mu}(f)\,.

The corresponding standard SMC sampler is given in Alg. 3 where,

•

the weighted particles $\{(\check{z}_{n}^{(i)},1),i\in\llbracket N\rrbracket\}$ represent $\bar{\mu}_{n}$ and $\{(\check{z}_{n,k}^{(i)},w_{n,k}(\check{z}_{n}^{(i)})),(i,k)\in\llbracket N\rrbracket\times\llbracket 0,T\rrbracket\}$ represent $\mu_{n}$ from (11),
•

steps 3-3 correspond to sampling from the mutation kernel $\bar{M}_{n+1}$ in (12),
•

$\{(z_{n}^{(i)},1),i\in\llbracket N\rrbracket\}$ represent $\mu_{n}$ ,
•

$\{\big{(}z_{n}^{(i)},\bar{w}_{n+1}(z_{n}^{(i)})),i\in\llbracket N\rrbracket\}$ represent $\bar{\mu}_{n+1}$ , and so do $\{(\check{z}_{n+1}^{(i)},1),i\in\llbracket N\rrbracket\}$ .

1 sample

\check{z}_{0}^{(i)}\overset{{\rm iid}}{\sim}\bar{\mu}_{0}=\mu_{0}

for

i\in\llbracket N\rrbracket

2for $n=0,\ldots,P-1$ do

4 for $i\in\llbracket N\rrbracket$ do

6 for $k\in\llbracket 0,T\rrbracket$ do

8 compute

\check{z}_{n,k}^{(i)}=(\check{x}_{n,k}^{(i)},\check{v}_{n,k}^{(i)}):=\psi_{n,k}(\check{z}_{n}^{(i)}),w_{n,k}\big{(}\check{z}_{n}^{(i)}\big{)}

9 end for

11 sample

a_{i}\sim{\rm Cat}\left(w_{n,0}\big{(}\check{z}_{n}^{(i)}\big{)},w_{n,1}\big{(}\check{z}_{n}^{(i)}\big{)},w_{n,2}\big{(}\check{z}_{n}^{(i)}\big{)},\ldots,w_{n,T}\big{(}\check{z}_{n}^{(i)}\big{)}\right)

12 set

z_{n}^{(i)}=(\check{x}_{n,a_{i}}^{(i)},v_{n}^{(i)})

with

v_{n}^{(i)}\sim\varpi_{n}

\bar{w}_{n+1}\big{(}z_{n}^{(i)}\big{)}=\frac{{\rm d}\bar{\mu}_{n+1}}{{\rm d}\mu_{n}}\big{(}z_{n}^{(i)}\big{)}\,,

15 end for

17 for $j\in\llbracket N\rrbracket$ do

19 sample

b_{j}\sim{\rm Cat}\left(\bar{w}_{n+1}\big{(}z_{n}^{(1)}\big{)},\ldots,\bar{w}_{n+1}\big{(}z_{n}^{(N)}\big{)}\right)

20 set

\check{z}_{n+1}^{(j)}=z_{n}^{(b_{j})}

21 end for

23 end for

Algorithm 3 Folded Hamiltonian Snippet SMC algorithm

Notice that we assume here $\psi_{0}={\rm Id}$ , and hence that $\bar{\mu}_{0}=\mu_{0}$ , and that the weights $w_{n,k}$ appear as being computed twice in step 3 and step 14 when evaluating the resampling weights at the previous iteration, for the only reason that it facilitates exposition. The identities (11) and (9) suggest, for $n\in\llbracket P\rrbracket$ the estimator of $\mu_{n}(f)$ , for $f\colon\mathsf{Z}\rightarrow\mathbb{R}$ $\mu_{n}$ -integrable,

	$\displaystyle\check{\mu}_{n}(f)$	$\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\sum_{k=0}^{T}\frac{1}{T+1}\frac{{\rm d}\mu_{n,k}}{{\rm d}\bar{\mu}_{n}}(\check{z}_{n}^{(i)})f\circ\psi_{n,k}\big{(}\check{z}_{n}^{(i)}\big{)}.$
		$\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\sum_{k=0}^{T}\frac{\mu_{n}\circ\psi_{n,k}}{\sum_{l=0}^{T}\mu_{n}\circ\psi_{n,l}}(\check{z}_{n}^{(i)})f\circ\psi_{n,k}\big{(}\check{z}_{n}^{(i)}\big{)}\,,$		(15)

where the second line is correct under the assumptions of Example 2. Further, when $\mu_{n-1}\gg\mu_{n,k}$ for $k\in\llbracket 0,T\rrbracket$ one can also write (11) for $f\colon\mathsf{Z}\rightarrow\mathbb{R}$ summable as

\displaystyle\mu_{n}(f)

\displaystyle=\int\Big{\{}\frac{1}{T+1}\sum_{k=0}^{T}f\circ\psi_{n,k}(z)\frac{{\rm d}\mu_{n,k}}{{\rm d}\mu_{n-1}}(z)\Big{\}}\mu_{n-1}({\rm d}z)\,,

which can be estimated, using self-renormalization when required, with

\hat{\mu}_{n}(f)=\sum_{i=1}^{N}\sum_{k=0}^{T}\frac{\frac{{\rm d}\mu_{n,k}}{{\rm d}\mu_{n-1}}(z_{n-1}^{(i)})}{\sum_{j=1}^{N}\sum_{l=0}^{T}\frac{{\rm d}\mu_{n,l}}{{\rm d}\mu_{n-1}}(z_{n-1}^{(j)})}f\circ\psi_{n,k}\big{(}z_{n-1}^{(i)}\big{)}\,,

(16)

therefore justifying the estimator suggested in (7) in the particular case where the conditions of Example 2 are satisfied, once we establish the equivalence of Alg. 3 and Alg. 2 in Proposition 3. In fact it can be shown (Proposition 3) that, with $\mathbb{E}_{i}$ referring to the expectation of the probability distribution underpinning Alg. $i$ ,

\mathbb{E}_{3}\bigl{(}\check{\mu}_{n}(f)\mid z_{n-1}^{(j)},j\in\llbracket N\rrbracket\bigr{)}=\hat{\mu}_{n}(f)\,,

a form of Rao-Blackwellization implying lower variance for $\hat{\mu}_{n}(f)$ while the two estimators share the same bias. The result is in fact stronger since it states that the estimators are convex ordered, a property which we however do not exploit further here. Interestingly a result we establish later in the paper, Proposition 15, suggests that the variance $\check{\mu}_{n}(f)$ is smaller than that of the standard Monte Carlo estimator assuming $N$ samples $z_{n}^{(i)}\overset{{\rm iid}}{\sim}\mu_{n}$ , due to the control variate nature of integrator snippets estimators.

An interesting point is that computation of the weight $\mu_{n}\circ\psi_{n,k}/\bar{\mu}_{n}$ only requires knowledge of $\mu_{n}$ up to a normalizing constant, that is the estimator is unbiased if $\check{z}_{n}^{(i)}\sim\bar{\mu}_{n}$ for $i\in\llbracket N\rrbracket$ even if $\mu_{n}$ is not completely known, while the estimator (7) will most often require self-normalisation, hence inducing a bias.

We now provide the probabilistic argument justifying Alg. 2 and the shared notation $\{z_{n}^{(i)},i\in\llbracket N\rrbracket\}$ in Alg. 2 and Alg. 3.

Proposition 3.

Alg. 2 and Alg. 3 are probabilistically equivalent. More precisely, letting $\mathbb{P}_{i}$ refer to the probability of Algorithm $i$ for $i\in\{2,3\}$ ,

1.

for $n\in\llbracket P-1\rrbracket$ the distributions of $\{z_{n}^{(i)},i\in\llbracket N\rrbracket\}$ conditional upon $\big{\{}z_{n-1}^{(i)},i\in\llbracket N\rrbracket\big{\}}$ in Alg. 3 and Alg. 2 are the same,
2.

the joint distributions of $\{z_{n}^{(i)},i\in\llbracket N\rrbracket,n\in\llbracket 0,P-1]\}$ are the same under $\mathbb{P}_{2}$ and $\mathbb{P}_{3}$ ,
3.

for $n\in\llbracket P\rrbracket$ , any $f\colon\mathsf{Z}\rightarrow\mathbb{R}$ $\mu_{n}$ -integrable with $\check{\mu}_{n}(f)$ and $\hat{\mu}_{n}(f)$ as in (15) and (16),

$\mathbb{E}_{3}\bigl{(}\check{\mu}_{n}(f)\mid z_{n-1}^{(j)},j\in\llbracket N\rrbracket\bigr{)}=\hat{\mu}_{n}(f)\,.$

Proof of Proposition 3.

In Alg. 3 for any $A\in\mathscr{Z}^{\otimes N}$ we have with $(b_{1},\ldots,b_{N})$ the random vector taking values in $\llbracket N\rrbracket^{N}$ involved in the resampling step,

	$\displaystyle\mathbb{P}_{3}\big{(}z_{n}^{(1:N)}\in A\mid z_{n-1}^{(1:N)}\big{)}=$	$\displaystyle\mathbb{E}_{3}\left\{\mathbf{1}_{A}\{(\check{x}_{n,a_{i}}^{(i)},v_{n}^{(i)}),i\in\llbracket N\rrbracket\}\mathbf{1}\{\check{z}_{n}^{(i)}=z_{n-1}^{(b_{i})},i\in\llbracket N\rrbracket\}\mid z_{n-1}^{(j)},j\in\llbracket N\rrbracket\right\}$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{3}\left\{\mathbf{1}_{A}\{(x_{n-1,a_{i}}^{(b_{i})},v_{n}^{(i)}),i\in\llbracket N\rrbracket\}\mid z_{n-1}^{(j)},j\in\llbracket N\rrbracket\right\}\,.$

Letting for $(\alpha,\beta)\in\llbracket 0,T\rrbracket^{N}\times\llbracket N\rrbracket^{N}$

E_{A}(\alpha,\beta)=E_{A}(\alpha,\beta,z_{n-1}^{(1:N)}):=\int\mathbf{1}_{A}\{(x_{n-1,\alpha_{i}}^{(\beta_{i})},v_{n}^{(i)}),i\in\llbracket N\rrbracket\}\varpi_{n}^{\otimes N}({\rm d}v_{n}^{(1:N)})

from the tower property we have

\mathbb{E}_{3}\left\{\mathbf{1}_{A}\{(x_{n-1,a_{i}}^{(b_{i})},v_{n}^{(i)}),i\in\llbracket N\rrbracket\}\mid z_{n-1}^{(j)},b_{j}=\beta_{j},j\in\llbracket N\rrbracket\right\}=\sum_{\alpha\in\llbracket 0,T\rrbracket^{N}}E_{A}(\alpha,\beta)\prod_{i=1}^{N}\frac{1}{T+1}\frac{{\rm d}\mu_{n,\alpha_{i}}}{{\rm d}\bar{\mu}_{n}}\big{(}z_{n-1}^{(\beta_{i})}\big{)}.

Since

\mathbb{P}_{3}\left(b_{1}=\beta_{1},\ldots,b_{N}=\beta_{N}\mid z_{n-1}^{(i)},i\in\llbracket N\rrbracket\right)=\biggl{(}\sum_{j=1}^{N}\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}\big{(}z_{n-1}^{(j)}\big{)}\biggr{)}^{-N}\prod_{i=1}^{N}\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}\big{(}z_{n-1}^{(\beta_{i})}\big{)}

we deduce

	$\displaystyle\mathbb{P}_{3}\big{(}z_{n}^{(1:N)}\in A\mid z_{n-1}^{(1:N)}\big{)}\propto$	$\displaystyle\sum_{\alpha,\beta}E_{A}(\alpha,\beta,z_{n-1}^{(1:N)})\prod_{i=1}^{N}\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}\big{(}z_{n-1}^{(\beta_{i})}\big{)}\prod_{i=1}^{N}\frac{{\rm d}\mu_{n,\alpha_{i}}}{{\rm d}\bar{\mu}_{n}}\big{(}z_{n-1}^{(\beta_{i})}\big{)}$
	$\displaystyle=$	$\displaystyle\sum_{\alpha,\beta}E_{A}(\alpha,\beta,z_{n-1}^{(1:N)})\prod_{i=1}^{N}\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}\big{(}z_{n-1}^{(\beta_{i})}\big{)}\frac{{\rm d}\mu_{n,\alpha_{i}}}{{\rm d}\bar{\mu}_{n}}\big{(}z_{n-1}^{(\beta_{i})}\big{)}$
	$\displaystyle=$	$\displaystyle\sum_{\alpha,\beta}E_{A}(\alpha,\beta,z_{n-1}^{(1:N)})\prod_{i=1}^{N}\frac{{\rm d}\mu_{n,\alpha_{i}}}{{\rm d}\mu_{n-1}}\big{(}z_{n-1}^{(\beta_{i})}\big{)}\,.$

Now notice that

	$\displaystyle\mathbb{P}_{2}\left(\bar{z}_{n}^{(j)}=z_{n-1,\alpha_{j}}^{(\beta_{j})}\mid z_{n-1}^{(1:N)}\right)$	$\displaystyle=\frac{\frac{{\rm d}\mu_{n,\alpha_{j}}}{{\rm d}\mu_{n-1}}\big{(}z_{n-1}^{(\beta_{j})}\big{)}}{\sum_{i,k}\frac{{\rm d}\mu_{n,k}}{{\rm d}\mu_{n-1}}\big{(}z_{n-1}^{(i)}\big{)}}$
		$\displaystyle\propto\frac{\frac{{\rm d}\mu_{n,\alpha_{j}}}{{\rm d}\mu_{n-1}}\big{(}z_{n-1}^{(\beta_{j})}\big{)}}{\sum_{i=1}^{N}\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}\big{(}z_{n-1}^{(i)}\big{)}}\,,$

and the first statement follows from conditional independence of $(b_{1},a_{1}),\ldots,(b_{N},a_{N})$ and the fact that $\mathbb{P}_{2}\big{(}z_{n}^{(1:N)}\in A\mid\bar{z}_{n-1}^{(1:N)}\big{)}=E_{A}(\alpha,\beta,\bar{z}_{n-1}^{(1:N)})$ . Since by construction $\mathbb{P}_{3}\big{(}(z_{0}^{(1)},\ldots,z_{0}^{(N)})\in A\big{)}=\mathbb{P}_{2}\big{(}(z_{0}^{(1)},\ldots,z_{0}^{(N)})\in A\big{)}$ for any $A\in\mathscr{Z}$ the second statement follows from a standard Markov chain argument. Now for $g\colon\mathsf{Z}\rightarrow\mathbb{R}$ and $\iota\in\llbracket N\rrbracket$

	$\displaystyle\mathbb{E}_{3}\big{(}g(\check{z}_{n}^{(\iota)})\mid z_{n-1}^{(j)},j\in\llbracket N\rrbracket\big{)}=$	$\displaystyle\mathbb{E}_{3}\big{(}g(z_{n-1}^{(b_{\iota})})\mid z_{n-1}^{(j)},j\in\llbracket N\rrbracket\big{)}$
	$\displaystyle=$	$\displaystyle\sum_{i=1}^{N}g(z_{n-1}^{(i)})\frac{\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}\big{(}z_{n-1}^{(i)}\big{)}}{\sum_{j=1}^{N}\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}\big{(}z_{n-1}^{(j)}\big{)}}\,.$

Therefore for any $k\in\llbracket 0,T\rrbracket$ ,

	$\displaystyle\mathbb{E}_{3}\left(\frac{{\rm d}\mu_{n,k}}{{\rm d}\bar{\mu}_{n}}(\check{z}_{n}^{(\iota)})f\circ\psi_{n,k}(\check{z}_{n}^{(\iota)})\mid z_{n-1}^{(j)},j\in\llbracket N\rrbracket\right)$	$\displaystyle=\sum_{i=1}^{N}\frac{{\rm d}\mu_{n,k}}{{\rm d}\bar{\mu}_{n}}(z_{n-1}^{(i)})f\circ\psi_{n,k}(z_{n-1}^{(i)})\frac{\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}\big{(}z_{n-1}^{(i)}\big{)}}{\sum_{j=1}^{N}\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}\big{(}z_{n-1}^{(j)}\big{)}}$
		$\displaystyle=\sum_{i=1}^{N}\frac{\frac{{\rm d}\mu_{n,k}}{{\rm d}\mu_{n-1}}(z_{n-1}^{(i)})}{\sum_{j=1}^{N}\sum_{l=0}^{T}\frac{{\rm d}\mu_{n,l}}{{\rm d}\mu_{n-1}}\big{(}z_{n-1}^{(j)}\big{)}}f\circ\psi_{n,k}(z_{n-1}^{(i)})\,.$

The third statement follows. ∎

Since the justification of the latter interpretation of the algorithm is straightforward, as a standard SMC sampler targeting instrumental distributions $\big{\{}\bar{\mu}_{n},n\in\llbracket 0,P\llbracket\big{\}}$ , and allows for further easy generalisations we will adopt this perspective in the remainder of the manuscript for simplicity.

This reinterpretation also allows the use of known facts about SMC sampler algorithms. For example it is well known that the output of SMC samplers can be used to estimate unbiasedly unknown normalising constants by virtue of the fact that, in the present scenario,

\prod_{n=0}^{P-1}\Big{[}\frac{1}{N}\sum_{i=1}^{N}\frac{{\rm d}\bar{\mu}_{n+1}}{{\rm d}\mu_{n}}\big{(}z_{n}^{(i)}\big{)}\Big{]}

has expectation $1$ under $\mathbb{P}_{3}$ . Now assume that the densities of $\{\mu_{n},n\in\llbracket 0,P\rrbracket\}$ are known up to a constant only, say ${\rm d}\mu_{n}/{\rm d}\upsilon(z)=\tilde{\mu}_{n}(z)/Z_{n}$ and $\upsilon\gg\upsilon^{\psi_{n}^{-k}}$ for $k\in\llbracket 0,T\rrbracket$ , then

Z_{n}=Z_{n}\int\mu_{n}^{\psi_{n,k}^{-1}}({\rm d}z)=\int\tilde{\mu}_{n}\circ\psi_{n}^{k}(z)\tfrac{{\rm d}\upsilon^{\psi_{n,k}^{-1}}}{{\rm d}\upsilon}(z)\,\upsilon({\rm d}z)\,,

and the measure $Z_{n}\bar{\mu}_{n}({\rm d}z)$ shares the same normalising constant as $\tilde{\mu}_{n}(z){\rm\upsilon}({\rm d}z)$ ,

Z_{n}=\int Z_{n}\bar{\mu}_{n}({\rm d}z)=\sum_{k=1}^{T}\frac{1}{T}\int\tilde{\mu}_{n}\circ\psi_{n}^{k}(z)\tfrac{{\rm d}\upsilon^{\psi_{n,k}^{-1}}}{{\rm d}\upsilon}(z)\,\upsilon({\rm d}z)\,.

Consequently

\prod_{n=0}^{P-1}\Big{[}\frac{1}{N(T+1)}\sum_{i=1}^{N}\sum_{k=0}^{N}\frac{\tilde{\mu}_{n+1}\circ\psi_{n+1,k}}{\tilde{\mu}_{n}}\big{(}z_{n}^{(i)}\big{)}\frac{{\rm d}\upsilon^{\psi_{n+1,k}^{-1}}}{{\rm d}\upsilon}\big{(}z_{n}^{(i)}\big{)}\Big{]}\,,

is an unbiased estimator of $Z_{P}/Z_{0}$ .

The results of the following lemma can be deduced from Lemma 8 but we provide more direct arguments for the present scenario.

Lemma 4.

Assume $\mu_{n-1}\gg\bar{\mu}_{n}$ , $\mu_{n-1}R_{n}=\mu_{n-1}$ and let $\bar{M}_{n}\colon\mathsf{Z}\times\mathscr{Z}\rightarrow[0,1]$ be as in (12). Then for $n\in\llbracket P\rrbracket$ ,

1.

$\bar{\mu}_{n-1}\bar{M}_{n}=\mu_{n-1}$ ,
2.

there exists $\bar{L}_{n-1}\colon\mathsf{Z}\times\mathscr{Z}\rightarrow[0,1]$ in (13) such that for any $A,B\in\mathscr{Z}$

$\bar{\mu}_{n-1}\otimes\bar{M}_{n}(A\times B)=\int_{B}\bar{\mu}_{n-1}\bar{M}_{n}({\rm d}z)\bar{L}_{n-1}(z,A)\,,$
3.

the importance weight in (14) is well defined and

$\bar{w}_{n}(z,z^{\prime}):=\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}(z^{\prime})\,.$

Proof.

The first statement, follows from the definition in (12) of $\bar{M}_{n}$ and the identity (11). The second statement is a consequence of the following classical arguments, justifying Bayes’ rule. For $A\in\mathscr{Z}$ fixed, consider the measure

\mathscr{Z}\ni B\mapsto\bar{\mu}_{n-1}\otimes\bar{M}_{n}\big{(}A\times B\big{)}\leq\bar{\mu}_{n-1}\otimes\bar{M}_{n}\big{(}\mathsf{Z}\times B\big{)}=\bar{\mu}_{n-1}\bar{M}_{n}(B)\,,

implying $\bar{\mu}_{n-1}\bar{M}_{n}\gg\bar{\mu}_{n-1}\otimes\bar{M}_{n}\big{(}A\times\cdot)$ from which we deduce the existence of a Radon-Nikodym derivative such that

\displaystyle\bar{\mu}_{n-1}\otimes\bar{M}_{n}\big{(}A\times B\big{)}

\displaystyle=\int_{B}\frac{{\rm d}\bar{\mu}_{n-1}\otimes\bar{M}_{n}\big{(}A\times\cdot\big{)}}{{\rm d}\bar{\mu}_{n-1}\bar{M}_{n}}(z^{\prime})\,\bar{\mu}_{n-1}\bar{M}_{n}({\rm d}z^{\prime})\,.

For $(z,A)\in\mathsf{Z}\times\mathscr{Z}$ , we let

\bar{L}_{n-1}(z,A):=\frac{{\rm d}\bar{\mu}_{n-1}\otimes\bar{M}_{n}\big{(}A\times\cdot\big{)}}{{\rm d}\bar{\mu}_{n-1}\bar{M}_{n}}(z)\,,

and note that almost surely $\bar{L}_{n-1}(z,A)\in[0,1]$ and $\bar{L}_{n-1}(z,\mathsf{Z})=1$ . For the third statement note that from the second statement, for $A,B\in\mathscr{Z}$ we have $\bar{\mu}_{n-1}\otimes\bar{M}_{n}\big{(}A\times B\big{)}=\bar{\mu}_{n-1}\bar{M}_{n}\accentset{\curvearrowleft}{\otimes}\bar{L}_{n-1}\big{(}A\times B\big{)}$ and from Fubini’s and the $\pi-\lambda$ theorem [7, Theorems 3.1 and 3.2] $\bar{\mu}_{n-1}\otimes\bar{M}_{n}$ and $\bar{\mu}_{n-1}\bar{M}_{n}\accentset{\curvearrowleft}{\otimes}\bar{L}_{n-1}$ are probability distributions coinciding on $\mathscr{Z}\otimes\mathscr{Z}$ . To conclude proof of the third statement, for $f\colon\mathsf{Z}^{2}\rightarrow[0,1]$ measurable, we successively apply the definition of the Radon-Nikodym derivative, Fubini’s theorem, use that $\mu_{n-1}\gg\bar{\mu}_{n}$ , the first statement of this lemma, Fubini again, the second statement

	$\displaystyle\int f(z,z^{\prime})\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}(z,z^{\prime})\bar{\mu}_{n-1}\otimes\bar{M}_{n}\big{(}{\rm d}(z,z^{\prime})\big{)}$	$\displaystyle=\int f(z,z^{\prime})\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}(z^{\prime})\,\bar{\mu}_{n-1}\bar{M}_{n}\accentset{\curvearrowleft}{\otimes}\bar{L}_{n}\big{(}{\rm d}(z,z^{\prime})\big{)}$
		$\displaystyle=\int f(z,z^{\prime})\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}(z^{\prime})\,\mu_{n-1}({\rm d}z^{\prime})\bar{L}_{n}\big{(}z^{\prime},{\rm d}z\big{)}$
		$\displaystyle=\int f(z,z^{\prime})\bar{\mu}_{n}({\rm d}z^{\prime})\bar{L}_{n}\big{(}z^{\prime},{\rm d}z\big{)}$

therefore establishing that $\bar{\mu}_{n-1}\otimes\bar{M}_{n}$ -almost surely

\frac{{\rm d}\bar{\mu}_{n-1}\bar{M}_{n}\accentset{\curvearrowleft}{\otimes}\bar{L}_{n-1}}{{\rm d}\bar{\mu}_{n-1}\otimes\bar{M}_{n}}(z,z^{\prime})=\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}(z,z^{\prime})\,.

∎

2.3 Direct extensions

It should be clear that the algorithm we have described lends itself to numerous generalizations, which we briefly discuss below.

There is no reason to limit the number of snippets arising from a seed particle to one, therefore offering the possibility to take advantage of parallel machines. For example the velocity of a given seed particle can be refreshed multiple times, resulting in partial copies of the seed particle from which integrator snippets can be grown.

The main scenario motivating this work, corresponds to the choice, for $n\in\llbracket 0,P\rrbracket$ , of $\{\psi_{n,k}=\psi_{n}^{k},k\in\llbracket 0,T\rrbracket\}$ for a given $\psi_{n}\colon\mathsf{Z}\rightarrow\mathsf{Z}$ . As should be apparent from the theoretical justification this can be replaced by a general family of invertible mappings $\{\psi_{n,k}\colon\mathsf{Z}\rightarrow\mathsf{Z},k\in\llbracket 0,T\rrbracket\}$ where the $\psi_{n,k}$ ’s are now not required to be measure preserving in general, in which case the expression for $w_{n,k}(z)$ may involve additional terms of the “Jacobian” type. These mappings may correspond to integrators other than those of Hamilton’s equations but may be more general deterministic mappings and $T$ may have no temporal meaning and only represent the number of deterministic transformations used in the algorithm. More specifically, asssuming that $\upsilon\gg\mu_{n}$ and $\upsilon\gg\upsilon^{\psi_{n,k}^{-1}}$ for some $\sigma$ -finite dominating measure $\upsilon$ and letting $\mu_{n}(z):={\rm d}\mu_{n}/{\rm d}\upsilon(z)$ the required weights are now of the form (see Lemmas 32-34)

\bar{w}_{n,k}(z):=\frac{1}{T+1}\frac{\mu_{n}\circ\psi_{n,k}(z)}{\mu_{n-1}(z)}\frac{{\rm d}\upsilon^{\psi_{n,k}^{-1}}}{{\rm d}\upsilon}(z)\,.

Again when $\upsilon$ is the Lebesgue measure and $\psi_{n,k}$ a diffeomorphism, then ${\rm d}\upsilon^{\psi_{n,k}^{-1}}/{\rm d}\upsilon$ is the absolute value of the determinant of the Jacobian of $\psi_{n,k}$ . Non uniform weights may be ascribed to each of these transformations in the definition of $\bar{\mu}$ (24). A more useful application of this generality is in the situation where it is believed that using multiple integrators, specialised in capturing different geometric features of the target density, could be beneficial. Hereafter we simplify notation by setting $\nu\leftarrow\mu_{n}$ and $\mu\leftarrow\mu_{n+1}$ . For the purpose of illustration consider two distinct integrators $\psi_{i}$ , $i\in\llbracket 2\rrbracket$ (again $n$ disappears from the notation) we wish to use, each for $T_{i}\in\mathbb{N}$ time steps with proportions $\gamma_{i}\geq 0$ such that $\gamma_{1}+\gamma_{2}=1$ . Again with $\mu=\pi\otimes\varpi$ define the mixture

\bar{\mu}(i,k,{\rm d}z):=\frac{\gamma_{i}}{T_{i}+1}\mu^{\psi_{i,k}^{-1}}({\rm d}z)\mathbf{1}\{k\in\llbracket 0,T_{i}\rrbracket\}\,,

which still possesses the fundamental property that if $z\sim\bar{\mu}$ (resp. $(i,z)\sim\bar{\mu}$ ), then with $(i,k)\sim\bar{\nu}(k,i\mid z)$ (resp. $k\sim\bar{\nu}(k\mid i,z)$ ) we have $\psi_{i,k}(z)\sim\mu$ . It is possible to aim to sample from $\bar{\mu}({\rm d}z)$ , in which case the pair $(i,k)$ plays the rôle played by $k$ in the earlier simpler scenario. However, in order to introduce the sought persistency, that is use either $\psi_{1}$ or $\psi_{2}$ when constructing a snippet, we focus on the scenario where the pair $(i,z)$ plays the rôle played by $z$ earlier. In other words the target distribution is now $\bar{\mu}(i,{\rm d}z)$ , which is to $\mu(i,{\rm d}z):=\gamma_{i}\cdot\mu({\rm d}z)$ what $\bar{\mu}_{n}({\rm d}z)$ in (24) was to $\mu_{n}({\rm d}z)$ . The mutation kernel corresponding to (24) is given by

\bar{M}_{\nu,\mu}(i,z;j,{\rm d}z^{\prime}):=\sum_{k=0}^{T_{i}}\bar{\nu}(k\mid i,z)R(i,\psi_{i,k}(z);j,{\rm d}z^{\prime})\,,

with now the requirement that $\bar{\nu}R(i,{\rm d}z)=\nu(i,{\rm d}z)=\gamma_{i}\cdot\nu({\rm d}z)$ . A sensible choice seems to be $R(i,z;j,{\rm d}z^{\prime})=\gamma_{j}\cdot R_{0}(z,{\rm d}z^{\prime})$ with $\nu R_{0}({\rm d}z^{\prime})=\nu({\rm d}z^{\prime})$ . With these choices using the near-optimal backward kernel simple substitutions $(i,z)\leftarrow z$ and $(j,z^{\prime})\leftarrow z^{\prime}$ yields

\frac{{\rm d}\bar{\mu}\accentset{\curvearrowleft}{\otimes}\bar{L}_{\nu,\mu}}{{\rm d}\bar{\nu}\otimes\bar{M}_{\nu,\mu}}(i,z;j,z^{\prime})=\frac{{\rm d}\bar{\mu}}{{\rm d}\nu}(j,z^{\prime})\,,

and justifies the earlier abstract presentation.

Another direct extension consists of generalizing the definition of $\bar{\mu}_{n}$ by assigning non-uniform and possibly state-dependent weights to the integrator snippet particles as follows

\bar{\mu}_{n}(k,{\rm d}z)=\omega_{n,k}\circ\psi_{n,k}(z)\,\mu_{n,k}({\rm d}z)\,,

with $\omega_{n,k}\colon\mathsf{Z}\rightarrow\mathbb{R}_{+}$ for $k\in\llbracket 0,T\rrbracket$ and such that $\sum_{k=0}^{T}\omega_{n,k}(z)=1$ for any $z\in\mathsf{Z}$ ; this should be contrasted with (8). Such choices ensure that the central identity (11) can be generalized as follows

	$\displaystyle\sum_{k=0}^{T}\int f\circ\psi_{n}^{k}(z)\frac{\omega_{n,k}\circ\psi_{n,k}(z)\cdot w_{n,k}(z)}{\sum_{l=0}^{T}\omega_{n,l}\circ\psi_{n,l}(z)w_{n,l}(z)}\bar{\mu}_{n}({\rm d}z)$	$\displaystyle=\sum_{k=0}^{T}\int f\circ\psi_{n,k}(z)\,\bar{\mu}_{n}(k\mid z)\bar{\mu}_{n}({\rm d}z)$
		$\displaystyle=\int f(z)\mu_{n}({\rm d}z)\,.$

Note the analogy with the identity behind umbrella sampling [37, 39]. In the light of the developments of Subsection 4.3.2, a potentially useful choice could be $\omega_{n,k}(z)\propto\|z-\psi_{n,k}^{-1}(z)\|$ , which however requires additional computations as

\omega_{n,k}\circ\psi_{n,k}(z)=\frac{\|\psi_{n,k}(z)-z\|}{\sum_{l=0}^{T}\|\psi_{n,k}(z)-\psi_{n,k}\circ\psi_{n,l}^{-1}(z)\|}

which may require computation of additional states for $l>k$ .

We leave exploration of some of these generalisations for future work, although we think that the choice of the leapfrog integrator is particularly natural and attractive here.

2.4 Rational and computational considerations

Our initial motivation for this work was that computation of $\{z_{n,k},k\in\llbracket T\rrbracket\}$ and $\{w_{n,k},k\in\llbracket T\rrbracket\}$ most often involve common quantities, leading to negligible computational overhead, and reweighting offers the possibility to use all the states of a snippet in a simulation algorithm rather than the endpoint only. There are other reasons for which this approach may be beneficial.

The benefit of using all states along integrator snippets in sampling algorithms has been noted in the literature. For example, in the context of Hamiltonian integrators, with $H_{n}(z):=-\log\mu_{n}(z)$ and $\psi_{n,k}=\psi_{n}^{k}$ , it is known that for $z\in\mathsf{Z}$ the mapping $k\mapsto H_{n}\circ\psi_{n}^{k}(z)$ is typically oscillatory, motivating for example the x-tra chance algorithm of [11]. The “windows of state” strategy of [28, 29] effectively makes use of the mixture $\bar{\mu}$ as an instrumental distribution and improved performance is noted, with performance seemingly improving with dimension on particular problems; see Section 5 for a more detailed discussion. Further averaging is well known to address scenarios where components of $x_{t}$ evolve on different scales and no choice of a unique integration time $\tau:=T\times\varepsilon$ can accommodate all scales [29, 22, section 3.2]; averaging addresses this issue effectively, see Example 27. We also note that, keeping $T\times\varepsilon$ constant, in the limit as $\varepsilon\rightarrow 0$ the average in (14) corresponds to the numerical integration, Riemann like, along the contour of $H_{n}(z)$ , effectively leading to some form of Rao-Blackwellization of this contour; this is discussed in detail in Section 4.

Another benefit we have observed with the SMC context is the following. Use of a WF-SMC strategy involves comparing particles within each Markov snippet arising from a single seed particle, while our strategy involves comparing all particles across snippets, which proves highly beneficial in practice. This seems to bring particular robustness to the choice of the integrator parameters $\varepsilon$ and $T$ and can be combined with highly robust adaptation schemes taking advantage of the population of samples, in the spirit of [21, 22]; see Subsections 2.5 and 2.6.

At a computational level we note that, in contrast with standard SMC or WF-SMC implementations relying on an MH mechanism, the construction of integrator snippets does not involve an accept reject mechanism, therefore removing control flow operations and enabling lockstep computations on GPUs – actual implementation of our algorithms on such architectures is, however, left to future work.

Finally, when using integrators of Hamilton’s equations we naturally expect similar benefits to those enjoyed by HMC. Let $d\in\mathbb{N}$ and let $\mathsf{X=\mathbb{R}}^{d}$ . We know that in certain scenarios [4, 10], the distributions $\{\mu_{n,d},d\in\mathbb{N},n\in\llbracket T(d)\rrbracket\}$ are such that for $n\in\mathbb{N}$ , $\log\big{(}\nicefrac{{\mu_{n,d}\circ\psi_{n,d}^{k}(z)}}{{\mu_{n,d}(z)}}\big{)}\rightarrow_{d\rightarrow\infty}\mathcal{N}(\mu_{n},\sigma_{n}^{2})$ , that is the weights do not degenerate to zero or one: in the context of SMC this means that the part of the importance weight (14) arising from the mutation mechanism does not degenerate. Further, with an appropriate choice of schedule, i.e. sequence $\{\mu_{n,d},n\in\llbracket T(d)\rrbracket\}$ for $d\in\mathbb{N}$ , ensures that the contribution $\nicefrac{{\mu_{n,d}(z)}}{{\mu_{n-1,d}(z)}}$ to the importance weights (14) is also stable as $d\rightarrow\infty$ . As shown in [4, 2, 10], while direct important sampling may require an exponential number of samples as $d$ grows, the use of such a schedule may reduce complexity to a polynomial order.

2.5 Numerical illustration: logistic regression

In this section, we consider sampling from the posterior distribution of a logistic regression model, focusing on the compution of the normalising constant. We follow [14] and consider the sonar dataset, previously used in [13]. With intercept terms, the dataset has responses $y_{i}\in\{-1,1\}$ and covariates $z_{i}\in\mathbb{R}^{p}$ , where $p=61$ . The likelihood of the parameters $x\in\mathsf{X}:=\mathbb{R}^{p}$ is then given by

L(x)=\prod_{i}^{n}F(z_{i}^{\top}x\cdot y_{i}),

(17)

where $F(x):=1/(1+\exp(-x))$ . We ascribe $x$ a product of independent normal distributions of zero mean as a prior, with standard deviation equal to 20 for the intercept and 5 for the other parameters. Denote $p(dx)$ the prior distribution of $x$ , we define a sequence of tempered distributions of densities of the form $\pi_{n}(x)\propto p(dx)L(x)^{\lambda_{n}}$ for $\lambda_{n}\colon\llbracket 0,P\rrbracket\rightarrow[0,1]$ non-decreasing and such that $\lambda_{0}=0$ and $\lambda_{P}=1$ . We apply both Hamiltonian Snippet-SMC and the implementation of waste-free SMC of [14] and compare their performance.

For both algorithms, we set the total number of particles at each SMC step to be $N(T+1)=10,000$ . For the waste-free SMC, the Markov kernel is chosen to be a random-walk Metropolis-Hastings kernel with covariances adaptively computed as $2.38^{2}/d\ \hat{\Sigma}$ , where $\hat{\Sigma}$ is the empirical covariance matrix obtained from the particles in the previous SMC step. For the Hamiltonian Snippet-SMC algorithm, we set $\psi_{n}$ to be the one-step leap-frog integrator with stepsize $\varepsilon$ , $U_{n}(x)=-\log(\pi_{n}(x))$ and $\varpi$ a $\mathcal{N}(0,\mathrm{Id})$ . To investigate the stability of our algorithm, we ran Hamiltonian Snippet SMC with $\varepsilon=0.05,0.1,0.2$ and $0.3$ . For both algorithms, the temperatures $\lambda_{n}$ are adaptively chosen so that the effective sample size (ESS) of the current SMC step will be $\alpha ESS_{max}$ , where $ESS_{max}$ is the maximum ESS achievable at the current step. In our experiments, we have chosen $\alpha=0.3,0.5$ and $0.7$ for both algorithms.

Performance comparison

Figure 2 shows the boxplots of estimates of the logarithm of the normalising constant obtained from both algorithms, for different choices of $N$ and $\varepsilon$ for the Hamiltonian Snippet SMC algorithm. The boxplots are obtained by running both algorithms 100 times for different of the algorithm parameters, with $\alpha=0.5$ in all setups. Several points are worth observing. For a suitably choice of $\varepsilon$ , the Hamiltonian Snippet SMC algorithm can produce stable and consistent estimates of the normalising constant with $10,000$ particles at each iteration. On the other hand, however, the waste-free SMC algorithm fails to produce accurate results for the same computational budget. It is also clear that with larger values of $N$ (meaning smaller value of $T$ and hence shorter snippets), the waste-free SMC algorithm produces results with larger biases and variability. For Hamiltonian Snippet SMC algorithm, the results are stable both for short and long snippets when $\varepsilon$ is equal to $0.1$ or $0.2$ . Another point is that when $\varepsilon=0.05$ or $0.3$ , the Hamiltonian Snippet SMC algorithm becomes unstable with short (i.e. $\varepsilon=0.05$ ) and long (i.e. $\varepsilon=0.3$ ). Possible reasons are for too small a stepsize the algorithm is not able to explore the target distribution efficiently, resulting in unstable performances. On the other hand, when the stepsize is too large, the leapfrog integrator becomes unstable, therefore affecting the variability of the weights; this is a common problem of HMC algorithms. Hence, a long trajectory will result in detoriate estimations. Hence, to obtain the best performance, one should find a way of tuning the stepsize and trajectory length along with the SMC steps.

In Figure 3 we display boxplots of the estimates of the posterior expectations of the mean of all coefficients, i.e. $\mathbb{E}{}_{\pi_{P}}(d^{-1}\sum_{i=1}^{d}x_{i})$ . This quantity is referred to as the mean of marginals in [14] and we use this terminology. One can see that the same properties can be seen from the estimations of the mean of marginals, with the unstability problems exacerbated with small and large stepsizes.

Computational Cost

In this section, we compare the running time of both algorithms. Since the calculations of the potential energy and its gradient often share common intermediate steps, we can recycle these to save computational cost. As the waste-free SMC also requires density evaluations, the Hamiltonian Snippet SMC algorithm will not require significant additional computations. Figure 4 shows boxplots of the simulation time of both algorithms from 100 runs. The simulations were run on an Apple M1-Pro CPU with 16G of memory. One can see that in comparison to the waste-free SMC the additional computational time is only marginal for the Hamiltonian Snippet SMC algorithm and mostly due to the larger memory needed to store the intermediate values.

2.6 Numerical illustration: simulating from filamentary distributions

We now illustrate the interest of integrator snippets in a scenario where the target distribution possesses specific geometric features. Specifically, we focus here on distributions concentrated around a manifold $\mathcal{M}\subset\mathsf{X}=\mathbb{R}^{d}$ defined as the zero level set of a smooth function $\ell:\mathbb{R}^{d}\to\mathbb{R}^{m}$

\mathcal{M}:=\left\{x\in\mathsf{X}:\ell(x)=0\right\},

sometimes referred to as filamentary distributions [12, 24, 25]. Such distributions arise naturally in various scenarios, including inverse problems or generalisations of the Gibbs sampler [12, 24, 25] or as a relaxation of problems where the support of the distribution of interest is included in $\mathcal{M}$ or for example generalisation of the Gibbs sampler through disintegration [12]. Such is the case for ABC methods in Bayesian statistics [3]. Assume that $\pi$ is a probability density with respect to the Lebesgue measure defined on $\mathsf{X}$ and for $\epsilon>0$ consider a “kernel” function $k_{\epsilon}(u):=\epsilon^{-m}k(u/\epsilon)$ where $k:\mathbb{R}^{m}\to\mathbb{R}_{+}$ and define the probability density

\pi_{\epsilon}(x)\propto k_{\epsilon}\circ\ell(x)\,\pi(x)\,.

The corresponding probability distribution can be thought of as an approximation of the probability distribution of density $\pi_{0}(x)\propto\pi(x)\mathbf{1}\{x\in\mathcal{M}\}$ with respect to the Hausdorff measure on $\mathcal{M}$ . Typical choices for the kernel are $k(u)=\mathbf{1}\{\|u\|\leq 1\}$ or $k(u)=\mathcal{N}(u;0,\mathbf{I}_{m})$ . Strong anisotropy may result from such constraints and make exploration of the support of such distributions awkward for standard Monte Carlo algorithms. This is illustrated in Fig. 5 where a standard MH-HMC algorithms is used to sample from $\pi_{\epsilon}$ defined on $\mathbb{R}^{2}$ , for three values $\epsilon=0.5,0.1,0.05$ , and performance is observed to deteriorate as $\epsilon$ decreases. The samples produced are displayed in blue: for $\epsilon=0.5$ HMC-MH mixes well and explores the support rapidly, but for $\epsilon=0.1$ the chain gets stuck in a narrow region of $\pi_{\epsilon}$ at the bottom, near initialisation, while for $\epsilon=0.05$ , no proposal is ever accepted in the experiment.

To illustrate the properties of integrator snippet techniques we consider the following toy example. For $\epsilon>0$ and $d\in\mathbb{N}_{*}$ let

\pi_{\epsilon}(x)\propto\frac{1}{\epsilon^{m}}\mathbf{1}\{\left\|\ell(x)\right\|\leq\epsilon\}\,\mathcal{N}(x;0,\mathrm{\mathbf{I}}_{d})\quad\text{with}\quad\ell(x)=x^{\top}\Sigma^{-1}x-c\,,

for $\Sigma$ a $d\times d$ symmetric positive matrix and $c\in\mathbb{R}$ , that is we consider the restriction of a standard normal distribution around an ellipsoid defined by $\ell(x)=0$ .

In order to explore the support of the target distribution we use two mechanisms based on reflections either through tangent hyperplanes of equicontours of $\ell(x)$ or through the corresponding orthogonal complements. More specifically for $x\in\mathsf{X}$ such that $\nabla\ell(x)\neq 0$ let $n(x):=\nabla\ell(x)/\|\nabla\ell(x)\|$ and define the tangential HUG (THUG) and symmetrically-normal HUG (SNUG) as $\psi_{\mathbin{\scalebox{0.4}{$\!/\mkern-5.0mu/\!$}}}:={}_{\textsc{a}}\psi\circ{\rm b}\circ{}_{\textsc{a}}\psi$ , , and $\psi_{\rotatebox[origin={c}]{180.0}{$\scalebox{0.45}{$\top$}$}}:={}_{\textsc{a}}\psi\circ(-{\rm b})\circ{}_{\textsc{a}}\psi$ respectively, with ${}_{\textsc{a}}\psi$ and $\mathrm{b}$ as in Example 1 and (25). Both updates can be understood as discretisations of ODEs and are volume preserving. Intuitively for $(x,v)\in\mathsf{Z}$ with $v_{\mathbin{\scalebox{0.4}{$\!/\mkern-5.0mu/\!$}}}=v_{\mathbin{\scalebox{0.4}{$\!/\mkern-5.0mu/\!$}}}(x+\varepsilon v):=n(x+\varepsilon v)n(x+\varepsilon v)^{\top}v$ for $\varepsilon>0$ and $v_{\rotatebox[origin={c}]{180.0}{$\scalebox{0.45}{$\top$}$}}=v-v_{\mathbin{\scalebox{0.4}{$\!/\mkern-5.0mu/\!$}}}$ we have $\psi_{\mathbin{\scalebox{0.4}{$\!/\mkern-5.0mu/\!$}}}(x,v)=(x+2\varepsilon v_{\mathbin{\scalebox{0.4}{$\!/\mkern-5.0mu/\!$}}},v_{\mathbin{\scalebox{0.4}{$\!/\mkern-5.0mu/\!$}}}-v_{\rotatebox[origin={c}]{180.0}{$\scalebox{0.45}{$\top$}$}})$ and $\psi_{\rotatebox[origin={c}]{180.0}{$\scalebox{0.45}{$\top$}$}}(x,v)=(x+2\varepsilon v_{\rotatebox[origin={c}]{180.0}{$\scalebox{0.45}{$\top$}$}},v_{\rotatebox[origin={c}]{180.0}{$\scalebox{0.45}{$\top$}$}}-v_{\mathbin{\scalebox{0.4}{$\!/\mkern-5.0mu/\!$}}})$ . This is illustrated in Fig. 6 for $d=2$ . Further, for an initial state $z_{0}=(x_{0},v_{0})\in\mathsf{Z}$ , trajectories of the first component of $k\mapsto\psi_{\mathbin{\scalebox{0.4}{$\!/\mkern-5.0mu/\!$}}}^{k}(z_{0})$ remain close to $\ell^{-1}(\{\ell(x_{0})\})$ while $k\mapsto\psi_{\rotatebox[origin={c}]{180.0}{$\scalebox{0.45}{$\top$}$}}^{k}(z_{0})$ follows the gradient field $x\mapsto\nabla\ell(x)$ and leads to hops across equicontours.

The sequence of target distributions on $(\mathsf{Z},\mathcal{Z})$ we consider is of the form

\mu_{n}({\rm d}z)\propto\pi_{\epsilon_{n}}({\rm d}x)\mathcal{N}({\rm d}v;0,\mathbf{I}_{d}),\qquad\qquad\epsilon_{n}>0,n\in\llbracket 0,P\rrbracket\,,

and the Integrator Snippet SMC is defined through the mixture, for $\alpha\in[0,1]$ ,

\bar{\mu}_{n}({\rm d}z)=\frac{\alpha}{T+1}\sum_{k=0}^{T}\mu_{n}^{\psi_{\mathbin{\scalebox{0.4}{$\!/\mkern-5.0mu/\!$}}}^{-k}}({\rm d}z)+\frac{1-\alpha}{T+1}\sum_{k=0}^{T}\mu_{n}^{\psi_{\rotatebox[origin={c}]{180.0}{$\scalebox{0.45}{$\top$}$}}^{-k}}({\rm d}z)\,.

We compare performance of integrator snippet with an SMC Sampler relying on a mutation kernel $M_{n}$ consisting of a mixture of two updates targetting $\mu_{n-1}$ each iterating $T$ times a MH kernel applying integrator $\psi_{\shortparallel}$ or $\psi_{\rotatebox[origin={c}]{180.0}{$\scalebox{0.45}{$\top$}$}}$ once after refreshing the velocity; the backward kernels are chosen to correspond to the default choices discussed earlier. In the experiments below we set $d=50$ , $c=12$ , and $\Sigma$ is the diagonal matrix alternating $1$ ’s and $0.1$ ’s along the diagonal. We used $N=5000$ particles, sampled from $\mathcal{N}(0,\mathrm{\mathbf{I}}_{d})$ at time zero and $\epsilon_{0}$ is set to the maximum distance of these particles in the $\ell$ domain from $0$ and $\alpha=0.8$ for both algorithms. We compared the two samplers across three metrics; all results are averaged over $20$ runs.

Robustness and Accuracy: we fix the step size for SNUG to $\varepsilon_{\rotatebox[origin={c}]{180.0}{$\scalebox{0.45}{$\top$}$}}=0.1$ and run both algorithms for a grid of values of $T$ and $\varepsilon_{\mathbin{\scalebox{0.4}{$\!/\mkern-5.0mu/\!$}}}$ using a standard adaptive scheme based on ESS to determine $\{\epsilon_{n},n\in\llbracket P\rrbracket\}$ until a criterion described below is satisfied and retain the final tolerance achieved, recorded in Fig. 7. As a result the terminal value $\epsilon_{P}$ and computational costs are different for both algorithms: the point of this experiment is only to demonstrate robustness and potential accuracy of Integrator Snippets. Both the SMC sampler and Integrator Snippet are stopped when the average probability of leaving the seed particle drops below $0.01$ .

Our proposed algorithm consistently achieves a two order magnitude smaller final value of $\epsilon$ and is more robust to the choice of stepsize.

Variance: for this experiment we set $T=50$ steps, $\varepsilon_{\mathbin{\scalebox{0.4}{$\!/\mkern-5.0mu/\!$}}}=0.01$ and determine and compare the variances of the estimates of the mean of $\pi_{\epsilon_{P}}$ for the final SMC step, for which $\epsilon_{P}=1\times 10^{-7}$ . To improve comparison, and in particular ensure comparable computational costs, both algorithms share the same schedule $\{\epsilon_{n},n\in\llbracket P\rrbracket\}$ , determined adaptively by the SMC algorithm in a separate pre-run.

The results are reported as componentwise boxplots in Fig. 8 where we observe a significant variance reduction for comparable computational cost.

Efficiency: we report the Expected Squared Jump Distance (ESJD) as a proxy of distance travelled by the two algorithms. For Integrator Snippets, it is possible to estimate this quantity as follows for a function $f:\mathsf{Z}\to\mathbb{R}$

\text{ESJD}_{n}(f)\approx\sum_{i=1}^{N}\sum_{k=0}^{T}\sum_{l=k+1}^{T}\left(f(Z_{n-1,l}^{(i)})\bar{W}_{n,l}^{(i)}-f(Z_{n-1,k}^{(i)})\bar{W}_{n,k}^{(i)}\right)^{2},\qquad\bar{W}_{n,k}^{(i)}=\frac{\bar{w}_{n,k}(Z_{n-1}^{(i)})}{{\displaystyle\sum_{l=0}^{T}\bar{w}_{n,l}(Z_{n-1,l}^{(i)})}}\,.

We report the average of this metric in Table 1 for the functions $f_{i}(x)=x_{i}$ $i\in\llbracket d\rrbracket$ , normalised by total runtime in seconds for all particles (first row), for the particles using the THUG update (second row) and those using the SNUG update (third row), with standard deviation shown in parenthesis. Our proposed algorithm is several orders of magnitude more efficient in the exploration of $\pi_{\epsilon}$ than its SMC counterpart which, thanks to its ability to take full advantage of all the intermediary states of the snippet. This is in contrast with the SMC sampler which creates trajectories of random length.

	Integrator Snippet	SMC
$\text{ESJD}/s$	$\mathbf{5.3\times 10^{-5}}\,\,(\pm 5.9\times 10^{-6})$	$1.2\times 10^{-7}\,\,(\pm 3.7\times 10^{-9})$
$\text{ESJD-THUG}/s$	$\mathbf{6.6\times 10^{-5}}\,\,(\pm 7.5\times 10^{-6})$	$1.7\times 10^{-7}\,\,(\pm 4.5\times 10^{-9})$
$\text{ESJD-SNUG}/s$	$\mathbf{2.7\times 10^{-4}}\,\,(\pm 3.2\times 10^{-5})$	$1.6\times 10^{-20}\,\,(\pm 6.0\times 10^{-21})$

Table 1:

d^{-1}\sum_{i=1}^{d}\text{ESJD}_{n}(f_{i})

normalised by time for Integrator Snippet and an SMC sampler.

2.7 Links to the literature

Alg. 2, and its reinterpretation Alg. 3, are reminiscent of various earlier contributions and we discuss here parallels and differences. This work was initiated in [42] and pursued and extended in [18].

Readers familiar with the “waste-free SMC” algorithm (WF-SMC) of [14] where, similarly to Alg. 2, seed particles are extended with an MCMC kernel leaving $\mu_{n-1}$ invariant at iteration $n$ , yielding $N\times(T+1)$ particles subsequently whittled down to $N$ new seed particles. A first difference is that while generation of an integrator snippet can be interpreted as applying a sequence of deterministic Markov kernels (see Subsection 3.3 for a detailed discussion) what we show is that the mutation kernels involved do not have to leave $\mu_{n-1}$ invariant; in fact we show in Section 3 that this indeed is not a requirement, therefore offering more freedom. Further, it is instructive to compare our procedure with the following two implementations of WF-SMC using an HMC kernel (4) for the mutation. A first possibility for the mutation stage is to run $T$ steps of an HMC update in sequence, where each update uses one integrator step. Assuming no velocity refreshment along the trajectory this would lead to the exploration of a random number of states of our integrator snippet due to the accept/reject mechanism involved; incorporating refreshment or partial refreshment would similarly lead to a random number of useful samples. Alternatively one could consider a mutation mechanism consisting of an HMC build around $T$ integration steps and where the endpoint of the trajectory would be the mutated particle; this would obviously mean discarding $T-1$ potentially useful candidates. To avoid possible reader confusion, we note apparent typos in [14, Proposition 1], provided without proof, where the intermediate target distribution of the SMC algorithms, $\bar{\mu}_{n}$ in our notation (see (18)), seems improperly stated. The statement is however not used further in the paper.

Our work shares, at first sight, similarities with [33, 38], but differs in several respects. In the discrete time setup considered in [38] a mixture of distributions similar to ours is also introduced. Specifically for a sequence $\{\omega_{k}\geq 0,k\in\mathbb{Z}\}$ such that $\#\{\omega_{k}\neq 0,k\in\mathbb{Z}\}<\infty$ and $\sum_{l\in\mathbb{Z}}\omega_{k}=1$ , the following mixture is considered (in the notation of [38] their transformation is $T=\psi^{-1}$ and we stick to our notation for the sake of comparison and avoid confusion with our $T$ ),

\bar{\mu}({\rm d}z)=\sum_{k\in\mathbb{Z}}\omega_{k}\mu_{k}({\rm d}z)\,.

The intention is to use the distribution $\bar{\mu}$ as an importance sampling proposal to estimate expectations with respect to $\mu$ ,

	$\displaystyle\int f(z)\frac{{\rm d}\mu}{{\rm d}\bar{\mu}}(z)\bar{\mu}({\rm d}z)$	$\displaystyle=\sum_{k\in\mathbb{Z}}\omega_{k}\int f(z)\frac{{\rm d}\mu}{{\rm d}\bar{\mu}}(z)\mu^{\psi^{-k}}({\rm d}z)$
		$\displaystyle=\sum_{k\in\mathbb{Z}}\omega_{k}\int f\circ\psi^{-k}(z)\frac{{\rm d}\mu}{{\rm d}\bar{\mu}}\circ\psi^{-k}(z)\mu({\rm d}z)$
		$\displaystyle=\sum_{k\in\mathbb{Z}}f\circ\psi^{-k}(z)\omega_{k}\frac{\mu\circ\psi^{-k}(z)}{\sum_{l\in\mathbb{Z}}\omega_{l}\,\mu\circ\psi^{l-k}(z)}\mu({\rm d}z)\,,$

where the last line holds when $\upsilon\gg\mu$ , $\upsilon{}^{\psi}=\upsilon$ and $\mu(z)={\rm d}\mu/{\rm d}\upsilon(z)$ . The rearrangement on the second and third line simply capture the fact noted earlier in the present paper that with $z\sim\mu$ and $k\sim{\rm Cat}\big{(}\{\omega_{l}:l\in\mathbb{Z}\}\big{)}$ then $\psi^{-k}(z)\sim\bar{\mu}$ . As such NEO relies on generating samples from $\mu$ first, typically exact samples, which then undergo a transformation and are then properly reweighted. In contrast we aim to sample from a mixture of the type $\bar{\mu}$ directly, typically using an iterative method, and then exploit the mixture structure to estimate expectations with respect to $\mu$ . Note also that some of the $\psi^{l-k}(z)$ terms involved in the renormalization may require additional computations beyond terms for which $\omega_{l-k}\neq 0$ . Our approach relies on a different important sampling identity

\int f\circ\psi^{k}(z)\frac{{\rm d}\mu^{\psi^{-k}}}{{\rm d}\bar{\mu}}(z)\bar{\mu}({\rm d}z)=\mu(f)\,.

We note however that the conformal Hamiltonian integrator used in [33, 38] could be used in our framework, which we leave for future investigations. Their NEO-MCMC targets a “posterior distribution” $\mu^{\prime}({\rm d}z)=Z^{-1}\mu({\rm d}z)L(z)$ and inspired by the identity

\displaystyle\int f(z)L(z)\frac{{\rm d}\mu}{{\rm d}\bar{\mu}}(z)\bar{\mu}({\rm d}z)

\displaystyle=\sum_{k\in\mathbb{Z}}\omega_{k}\int f\circ\psi^{-k}(z)L\circ\psi_{k}^{-1}(z)\frac{{\rm d}\mu}{{\rm d}\bar{\mu}}\circ\psi^{-k}(z)\mu({\rm d}z)\,,

which is an expection of $\bar{f}(k,z)=f\circ\psi^{-k}(z)$ with respect to

\bar{\mu}^{\prime}(k,{\rm d}z)\propto\mu({\rm d}z)\cdot\omega_{k}\frac{{\rm d}\mu}{{\rm d}\bar{\mu}}\circ\psi^{-k}(z)\cdot L\circ\psi^{-k}(z)

and they consider algorithms targetting $\bar{\mu}^{\prime}({\rm d}z)$ which relying on exact samples from $\mu$ to construct proposals, resulting in a strategy which shares the weaknesses of an independent MH algorithm.

Much closer in spirit to our work is the “window of states” idea proposed in [28, 29] in the context of HMC algorithms. Although the MCMC algorithms of [28, 29] ultimately target $\mu$ and are not reversible, they involve a reversible MH update with respect to $\bar{\mu}$ and additional transitions permitting transitions from $\mu$ to $\bar{\mu}$ and $\bar{\mu}$ to $\mu$ ; see Section 5 for full details.

A link we have not explored in the present manuscript is that to normalizing flows [36, 26] and related literature. In particular the ability to tune the parameter of the leapfrog integrator suggests that our methodology lends itself learning normalising flows. We finally note an analogy of such approaches with umbrella sampling [37, 39], although the targetted mixture is not of the same form, and the “Warp Bridge Sampling” idea of [27].

3 Sampling Markov snippets

In this section we develop the Markov snippet framework, largely inspired by the WF-SMC framework of [14] but provide here a detailed derivation following the standard SMC sampler framework [16] which allows us to consider much more general mutation kernels; integrator snippet SMC is recovered as a particular case. Importantly we provide recipes to compute some of the quantities involved using simple criteria (see Lemma 9 and Corollary 10) which allow us to consider unusual scenarios such as in Subsection 3.4.

3.1 Markov snippet SMC sampler or waste free SMC with a difference

Given a sequence $\big{\{}\mu_{n},n\in\llbracket 0,P\rrbracket\big{\}}$ of probability distributions defined on a measurable space $(\mathsf{Z},\mathscr{Z})$ introduce the sequence of distributions defined on $\big{(}\llbracket 0,T\rrbracket\times\mathsf{Z}^{T+1},\mathscr{P}(\llbracket 0,T\rrbracket)\otimes\mathscr{Z}^{\otimes(T+1)}\big{)}$ such that for any $(n,k,\mathsf{z})\in\llbracket 0,P\rrbracket\times\llbracket 0,T\rrbracket\times\mathsf{Z}$

\bar{\mu}_{n}(k,{\rm d}\mathsf{z}):=\frac{1}{T+1}w_{n,k}(\mathsf{z})\mu_{n}\otimes M_{n}^{\otimes T}({\rm d}\mathsf{z})

where for $M_{n},L_{n-1,k}\colon\mathsf{Z}\times\mathscr{Z}\rightarrow[0,1]$ , $k\in\llbracket 0,P\rrbracket$ and any $\mathsf{z}=(z_{0},z_{1},\ldots,z_{T})\in\mathsf{Z}^{T+1}$ ,

w_{n,k}(\mathsf{z}):=\frac{{\rm d}\mu_{n}\accentset{\curvearrowleft}{\otimes}L_{n-1,k}}{{\rm d}\mu_{n}\otimes M_{n}^{k}}(z_{0},z_{k})\,,

is assumed to exist for now and $w_{n,0}(\mathsf{z}):=1$ . This yields the marginals

\bar{\mu}_{n}({\rm d}\mathsf{z}):=\sum_{k=0}^{T}\bar{\mu}_{n}(k,{\rm d}\mathsf{z})=\frac{1}{T+1}\sum_{k=1}^{n}w_{n,k}(\mathsf{z})\,\mu_{n}\otimes M_{n}^{\otimes T}({\rm d}\mathsf{z})\,.

(18)

Further, consider $R_{n}\colon\mathsf{Z}\times\mathscr{Z}\rightarrow[0,1]$ such that $\mu_{n-1}R_{n}=\mu_{n-1}$ and define $\bar{M}_{n}\colon\mathsf{Z}^{T+1}\times\mathscr{Z}^{\otimes(T+1)}\rightarrow[0,1]$

\bar{M}_{n}(\mathsf{z},{\rm d}\mathsf{z}^{\prime}):=\sum_{k=0}^{T}\bar{\mu}_{n-1}(k\mid\mathsf{z})R_{n}(z_{k},{\rm d}z^{\prime}_{0})M_{n}^{\otimes T}\big{(}z^{\prime}_{0},{\rm d}\mathsf{z}^{\prime}_{-0}\big{)}\,.

Note that [14] set $R_{n}=M_{n}$ , which we do not want in our later application and further assume that $M_{n}$ is $\mu_{n-1}-$ invariant, which is not necessary and too constraining for our application in Section 3.3 (corresponding to the application in the introductory Subsection 2.2). We only require the condition $\mu_{n-1}R_{n}=\mu_{n-1}$ is required. As in Subsection 2.2 we consider the optimal backward kernel $\bar{L}_{n}\colon\mathsf{Z}^{T+1}\times\mathscr{Z}^{\otimes(T+1)}\rightarrow[0,1]$ , given for $(\mathsf{z},A)\in\mathsf{Z}^{T+1}\times\mathscr{Z}^{\otimes(T+1)}$ by

\bar{L}_{n-1}(\mathsf{z},A)=\frac{{\rm d}\bar{\mu}_{n-1}\otimes\bar{M}_{n}(A\times\cdot)}{{\rm d}\bar{\mu}_{n-1}\bar{M}_{n}}(\mathsf{z})\,,

and as established in Lemma 6 and Lemma 8, one obtains

\displaystyle\bar{w}_{n}(\mathsf{z}^{\prime})

\displaystyle=\frac{{\rm d}\bar{\mu}_{n}\accentset{\curvearrowleft}{\otimes}\bar{L}_{n-1}}{{\rm d}\bar{\mu}_{n-1}\otimes\bar{M}_{n}}(\mathsf{z},\mathsf{z}^{\prime})=\frac{{\rm d}\mu_{n}}{{\rm d}\mu_{n-1}}(z^{\prime}_{0})\frac{1}{T+1}\sum_{k=0}^{T}w_{n,k}(\mathsf{z}^{\prime})\,.

(19)

The corresponding folded version of the algorithm is given in Alg. 4, with $\check{\mathsf{z}}_{n}:=(\check{z}_{n,0},\check{z}_{n,1},\ldots,\check{z}_{n,T})$ …:

•

$\big{\{}(\check{\mathsf{z}}_{n}^{(i)},1),i\in\llbracket N\rrbracket\big{\}}$ represents $\bar{\mu}_{n}({\rm d}\mathsf{z})$ ,
•

$\big{\{}(z_{n,k}^{(i)},w_{n+1,k}),(i,k)\in\llbracket N\rrbracket\times\llbracket 0,T\rrbracket\big{\}}$ represent $\mu_{n}({\rm d}z)$ ,
•

$\big{\{}\big{(}\mathsf{z}_{n}^{(i)},\bar{w}_{n+1}\big{(}\mathsf{z}_{n}^{(i)}\big{)}\big{)},i\in\llbracket N\rrbracket\big{\}}$ represents $\bar{\mu}_{n+1}({\rm d}\mathsf{z})$ and so do $\big{\{}(\check{\mathsf{z}}_{n+1}^{(i)},1),i\in\llbracket N\rrbracket\big{\}}$ .

1sample

\check{\mathsf{z}}_{0}^{(i)}\overset{{\rm iid}}{\sim}\mu_{0}^{\otimes(T+1)}

, set

w_{0,k}(\check{\mathsf{z}}_{0}^{(i)})=1

for

(i,k)\in\llbracket N\rrbracket\times\llbracket T\rrbracket

2for $n=0,\ldots,P-1$ do

4 for $i\in\llbracket N\rrbracket$ do

6 sample

a_{i}\sim{\rm Cat}\left(1,w_{n,1}(\check{\mathsf{z}}_{n}^{(i)}),w_{n,2}(\check{\mathsf{z}}_{n}^{(i)}),\ldots,w_{n,T}(\check{\mathsf{z}}_{n}^{(i)})\right)

7 sample

z_{n,0}^{(i)}=z_{n}^{(i)}\sim R_{n+1}(\check{z}_{n,a_{i}},\cdot)

8 for $k\in\llbracket T\rrbracket$ do

10 sample

z_{n,k}^{(i)}\sim M_{n+1}(z_{n,k-1}^{(i)},\cdot)

11 compute

w_{n+1,k}\big{(}\mathsf{z}_{n}^{(i)}\big{)}=\frac{{\rm d}\mu_{n}\accentset{\curvearrowleft}{\otimes}L_{n}^{k}}{{\rm d}\mu_{n}\otimes M_{n+1}^{k}}(z_{n,0}^{(i)},z_{n,k}^{(i)})

12 end for

14 compute

\bar{w}_{n+1}\big{(}\mathsf{z}_{n}^{(i)}\big{)}=\frac{{\rm d}\mu_{n+1}}{{\rm d}\mu_{n}}\big{(}z_{n,0}^{(i)}\big{)}\frac{1}{T+1}\sum_{k\in\llbracket T\rrbracket}w_{n+1,k}\big{(}\mathsf{z}_{n}^{(i)}\big{)}\,,

16 end for

18 for $j\in\mathbb{N}$ do

20 sample

b_{j}\sim{\rm Cat}\left(\bar{w}_{n+1}\big{(}\mathsf{z}_{n}^{(1)}\big{)},\bar{w}_{n+1}\big{(}\mathsf{z}_{n}^{(2)}\big{)},\ldots,\bar{w}_{n+1}\big{(}\mathsf{z}_{n}^{(N)}\big{)}\right)

21 set

\check{\mathsf{z}}_{n+1}^{(j)}=\mathsf{z}_{n}^{(b_{j})}

22 end for

24 end for

Algorithm 4 (Folded) Markov Snippet SMC algorithm

Remark 5.

Other choices are possible and we detail here an alternative, related to the Waste-free framework of [14]. With notation as above here we take

\bar{\mu}_{n}(k,{\rm d}\mathsf{z}):=\frac{1}{T+1}w_{n,k}(\mathsf{z})\mu_{n-1}\otimes M_{n}^{\otimes T}({\rm d}\mathsf{z})

with the weights now defined as

w_{n,k}(\mathsf{z}):=\frac{{\rm d}\mu_{n}\accentset{\curvearrowleft}{\otimes}L_{n-1,k}}{{\rm d}\mu_{n-1}\otimes M_{n}^{k}}(z_{0},z_{k})\,.

With these choices we retain the fundamental property that for $f\colon\mathsf{Z}\rightarrow\mathbb{R}$ , with $\bar{f}(k,\mathsf{z}):=f(z_{k})$ then $\bar{\mu}_{n}\big{(}\bar{f}\big{)}=\mu_{n}(f)$ . Now with

\bar{M}_{n}(\mathsf{z},{\rm d}\mathsf{z}^{\prime}):=\sum_{k=0}^{T}\bar{\mu}_{n-1}(k\mid\mathsf{z})R_{n}(z_{k},{\rm d}z^{\prime}_{0})M_{n}^{\otimes T}\big{(}z^{\prime}_{0},{\rm d}\mathsf{z}^{\prime}_{-0}\big{)}\,,

assuming $\mu_{n-1}R_{n}=\mu_{n-1}$ , we have the property that for any $A\in\mathscr{Z}^{\otimes(T+1)}$ $\bar{\mu}_{n-1}\bar{M}_{n}(A)=\mu_{n-1}\otimes M_{n}^{\otimes T}\big{(}A\big{)}$ , yielding

\bar{w}_{n}(\mathsf{z}^{\prime})=\frac{1}{T+1}\sum_{k=0}^{T}w_{n,k}(\mathsf{z}^{\prime})\,.

Finally choosing $L_{n,k}$ to be the optimized backward kernel, the importance weight of the algorithm is,

w_{n,k}(\mathsf{z}^{\prime})=\frac{{\rm d}\mu_{n}\accentset{\curvearrowleft}{\otimes}L_{n-1,k}}{{\rm d}\mu_{n-1}\otimes M_{n}^{k}}(z_{0},z_{k})=\frac{{\rm d}\mu_{n}}{{\rm d}\mu_{n-1}}(z_{k})\,.

We note that continuous time Markov process snippets could also be used. For example piecewise deterministic Markov processes such as the Zig-Zag process [6] or the Bouncy Particle Sampler [8] could be used in practice since finite time horizon trajectories can be parametrized in terms of a finite number of parameters. We do not pursue this here.

3.2 Theoretical justification

In this section we provide the theoretical justification for the correctness of Alg. 4 and an alternative proof for Alg. 3, seen as a particular case of Alg. 4.

Throughout this section we use the following notation where $\mu$ (resp. $\nu$ ) plays the rôle of $\mu_{n+1}$ (resp. $\mu_{n}$ ), for notational simplicity. Let $\mu\in\mathcal{P}\big{(}\mathsf{Z},\mathscr{Z}\big{)}$ and $M,L\colon\mathsf{Z}\times\mathscr{Z}\rightarrow[0,1]$ be two Markov kernels such that the following condition holds. For $T\in\mathbb{N}\setminus\{0\}$ let $\mathsf{z}:=(z_{0},z_{1},\ldots,z_{T})\in\mathsf{Z}^{T+1}$ and for $k\in\llbracket 0,T\rrbracket$ assume that $\mu\otimes M^{k}\gg\mu\accentset{\curvearrowleft}{\otimes}(L^{k})$ , implying the existence of the Radon-Nikodym derivatives

w_{k}(\mathsf{z}):=\frac{{\rm d}\mu\accentset{\curvearrowleft}{\otimes}(L^{k})}{{\rm d}\mu\otimes M^{k}}(z_{0},z_{k})\,,

(20)

with the convention $w_{0}(\mathsf{z})=1$ . We let $w(\mathsf{z}):=(T+1)^{-1}\sum_{k=0}^{T}w_{k}(\mathsf{z})$ . For $\mathsf{z}\in\mathsf{Z}^{T+1}$ , define

\displaystyle M^{\otimes T}(z_{0},{\rm d}\mathsf{z}_{-0}):

\displaystyle=\prod_{i=1}^{T}M(z_{i-1},{\rm d}z_{i})\,,

and introduce the mixture of distributions, defined on $\big{(}\mathsf{Z}^{T+1},\mathscr{Z}^{\otimes(T+1)}\big{)}$ ,

\bar{\mu}({\rm d}\mathsf{z})=\sum_{k=0}^{T}\bar{\mu}(k,{\rm d}\mathsf{z})\,,

(21)

where for $(k,\mathsf{z})\in\llbracket 0,T\rrbracket\times\mathsf{Z}^{T+1}$

\bar{\mu}(k,{\rm d}\mathsf{z}):=\frac{1}{T+1}w_{k}(\mathsf{z})\mu\otimes M^{\otimes T}({\rm d}\mathsf{z})\,,

so that one can write $\bar{\mu}({\rm d}\mathsf{z})=w(\mathsf{z})\mu\otimes M^{\otimes T}({\rm d}\mathsf{z})$ with $w(\mathsf{z}):=(T+1)^{-1}\sum_{k=0}^{T}w_{k}(\mathsf{z})$ .

We first establish properties of $\bar{\mu}$ , showing how samples from $\bar{\mu}$ can be used to estimate expectations with respect to $\mu$ .

Lemma 6.

For any $f\colon\mathsf{Z}\rightarrow\mathbb{R}$ such that $\mu(|f|)<\infty$

we have for $k\in\llbracket 0,T\rrbracket$

\int f(z^{\prime})\frac{{\rm d}\mu\accentset{\curvearrowleft}{\otimes}(L^{k})}{{\rm d}\mu\otimes M^{k}}(z,z^{\prime})\mu({\rm d}z)M^{k}(z,{\rm d}z^{\prime})=\mu(f)\,,

2.
with $\bar{f}(k,\mathsf{z}):=f(z_{k})$ ,
1. (a)
  
  then
  
  $\displaystyle\bar{\mu}(\bar{f})$ $\displaystyle=\mu(f)\,,$
2. (b)
  
  in particular,
  
  $\int\sum_{k=0}^{T}\bar{f}(k,\mathtt{\mathsf{z}})\bar{\mu}(k\mid\mathsf{z})\bar{\mu}({\rm d}\mathsf{z})=\mu(f)\,.$ (22)

Proof.

The first statement follows directly from the definition of the Radon-Nikodym derivative

	$\displaystyle\int f(z^{\prime})\frac{{\rm d}\mu\accentset{\curvearrowleft}{\otimes}L^{k}}{{\rm d}\mu\otimes M^{k}}(z,z^{\prime})\mu\otimes M^{k}\big{(}{\rm d(}z,z^{\prime})\big{)}$	$\displaystyle=\int f(z^{\prime})\mu({\rm d}z^{\prime})L^{k}(z^{\prime},{\rm d}z)$
		$\displaystyle=\int f(z^{\prime})\mu({\rm d}z^{\prime})\,.$

The second statement follows from the definition of $\bar{\mu}$ and

\frac{1}{T+1}\sum_{k=0}^{T}\int\bar{f}(k,\mathsf{z})w_{k}(\mathsf{z})\mu\otimes M^{\otimes T}({\rm d}\mathsf{z})=\frac{1}{T+1}\sum_{k=0}^{T}\int f(z_{k})\frac{{\rm d}\mu\accentset{\curvearrowleft}{\otimes}(L^{k})}{{\rm d}\mu\otimes M^{k}}(z_{0},z_{k})\mu\otimes M^{k}\big{(}{\rm d}(z_{0},z_{k})\big{)}\,,

and the first statement. The last statement follows from the tower property for expectations. ∎

Corollary 7.

Assume that $\mathsf{z}\sim\bar{\mu}$ , then

\sum_{k=0}^{T}f(z_{k})\frac{w_{k}(\mathsf{z})}{\sum_{l=0}^{T}w_{l}(\mathsf{z})}

is an unbiased estimator of $\mu(f)$ since we notice that for $k\in\llbracket 0,T\rrbracket$ ,

\bar{\mu}(k\mid\mathsf{z})=\frac{w_{k}(\mathsf{z})}{\sum_{l=0}^{T}w_{l}(\mathsf{z})}\,.

This justifies algorithms which sample from the mixture $\bar{\mu}$ directly in order to estimate expectations with respect to $\mu$ .

Let $\nu\in\mathcal{P}\big{(}\mathsf{Z},\mathscr{Z}\big{)}$ and $\bar{\nu}\in\mathcal{P}\big{(}\mathsf{Z}^{T+1},\mathscr{Z}^{\otimes(T+1)}\big{)}$ be derived from $\nu$ in the same way $\bar{\mu}$ is from $\mu$ in (21), but for possibly different Markov kernels $M_{\nu},L_{\nu}\colon\mathsf{Z}\times\mathscr{Z}\rightarrow[0,1]$ . Define now the kernel $\bar{M}\colon\mathsf{Z}^{T+1}\times\mathscr{Z}^{\otimes(T+1)}\rightarrow[0,1]$ be the Markov kernel

\bar{M}(\mathsf{z},{\rm d}\mathsf{z}^{\prime}):=\sum_{k=0}^{T}\bar{\nu}(k\mid\mathsf{z})R(z_{k},{\rm d}z^{\prime}_{0})M^{\otimes T}\big{(}z^{\prime}_{0},{\rm d}\mathsf{z}^{\prime}_{-0}\big{)}\,,

(23)

with $R\colon\mathsf{Z}\times\mathscr{Z}\rightarrow[0,1]$ . Remark that $M,L,M_{\nu},L_{\nu}$ and $R$ can be made dependent on both $\mu$ and $\nu$ , provided they satisfy all the conditions stated above and below.

The following justifies the existence of $\bar{w}_{n+1}$ in (19) for a particular choice of backward kernel and provides a simplified expression for a particular choices of $R$ .

Lemma 8.

With the notation (20)-(21) and (23),

1.

there exists $\bar{M}^{*}\colon\mathsf{Z}^{T+1}\times\mathscr{Z}^{\otimes(T+1)}\rightarrow[0,1]$ such that for any $A,B\in\mathscr{Z}^{\otimes(T+1)}$

$\bar{\nu}\otimes\bar{M}(A\times B)=(\bar{\nu}\bar{M})\otimes\bar{M}^{*}(B\times A)$
2.

we have

$\bar{\nu}\bar{M}=(\nu R)\otimes M^{\otimes T}\,,$

assuming $\nu R\gg\mu$ then with the choice $\bar{L}=\bar{M}^{*}$ we have $\bar{\nu}\otimes\bar{M}\gg\bar{\mu}\accentset{\curvearrowleft}{\otimes}\bar{L}$ and for $\mathsf{z},\mathsf{z}^{\prime}\in\mathsf{\mathsf{Z}}^{T+1}$ almost surely

\bar{w}(\mathsf{z},\mathsf{z}^{\prime})=\frac{{\rm d}\bar{\mu}\accentset{\curvearrowleft}{\otimes}\bar{L}}{{\rm d}\bar{\nu}\otimes\bar{M}}(\mathsf{z},\mathsf{z}^{\prime})=\frac{{\rm d}\mu}{{\rm d}\nu R}(z^{\prime}_{0})\frac{1}{T+1}\sum_{k=0}^{T}w_{k}(\mathsf{z}^{\prime})=\frac{{\rm d}\mu}{{\rm d}\nu R}(z^{\prime}_{0})w(\mathsf{z}^{\prime})=:\bar{w}(\mathsf{z}^{\prime})\,,

that is we introduce notation reflecting this last simplication.

4.

Notice the additional simplification when $\nu R=\nu$ .

Proof.

The first statement is a standard result and $\bar{M}^{*}$ is a conditional expectation. We however provide the short argument taking advantage of the specific scenario. For a fixed $A\in\mathscr{Z}^{\otimes(T+1)}$ consider the finite measure

B\mapsto\bar{\nu}\otimes\bar{M}(A\times B)\leq\bar{\nu}\bar{M}(B)\,,

such that $\bar{\nu}\bar{M}(B)=0\implies\bar{\nu}\otimes\bar{M}(A\times B)=0$ , that is $\bar{\nu}\bar{M}\gg\bar{\nu}\otimes\bar{M}(A\times\cdot)$ . Consequently we have the existence of a Radon-Nikodym derivative such that

\bar{\nu}\otimes\bar{M}(A\times B)=\int_{B}\frac{{\rm d}\bar{\nu}\otimes\bar{M}(A\times\cdot)}{{\rm d}(\bar{\nu}\bar{M})}(\mathsf{z})\bar{\nu}\bar{M}({\rm d}\mathsf{z})

which indeed has the sought property. This is a kernel since for any fixed $A\in\mathscr{Z}$ , for any $B\in\mathscr{Z}$ , $\bar{\nu}\otimes\bar{M}(A\times B)\leq 1$ and therefore $\bar{M}^{*}(\mathsf{z},A)\leq 1$ $\bar{\nu}\bar{M}$ -a.s., with equality for $A=\mathsf{X}$ . For the second statement we have, for any $(\mathsf{z},A)\in\mathsf{\mathsf{Z}}^{T+1}\times\mathscr{Z}^{\otimes(T+1)}$ ,

\bar{M}(\mathsf{z},A)=\sum_{k=0}^{T}\bar{\nu}(k\mid\mathsf{z})R\otimes M^{\otimes T}\big{(}z{}_{k},A\big{)},

and we can apply (22) in Lemma 6 with the substitutions $\mu\leftarrow\nu$ and $M\leftarrow M_{\nu},L\leftarrow L_{\nu}$ to conclude. For the third statement, since $\nu R\gg\mu$ then $\bar{\nu}\bar{M}=(\nu R)\otimes M^{\otimes T}\gg\mu\otimes M^{\otimes T}$ because for any $A\in\mathscr{Z}^{\otimes(T+1)}$ ,

\int\mathbf{1}\{\mathsf{z}\in A\}\nu R({\rm d}z_{0})M^{\otimes T}(z_{0},{\rm d}\mathsf{z}_{-0})=0\implies\begin{cases}\int\mathbf{1}\{\mathsf{z}\in A\}M^{\otimes T}(z_{0},{\rm d}\mathsf{z}_{-0})=0&\nu R-a.s.\\ {\rm or}\\ \nu R\big{(}\mathfrak{P}(A)\big{)}=0\end{cases}

where $\mathfrak{P}\colon\mathsf{Z}^{T+1}\rightarrow\mathsf{Z}$ is such that $\mathfrak{P}(\mathsf{z})=z_{0}$ . In either case, since $\nu R\gg\mu$ , this implies that

\int\mathbf{1}\{\mathsf{z}\in A\}\mu({\rm d}z_{0})M^{\otimes T}(z_{0},{\rm d}\mathsf{z}_{-0})=0\,,

that is $\bar{\nu}\bar{M}\gg\bar{\mu}$ from the definition of $\bar{\mu}$ . Consequently

	$\displaystyle\bar{\mu}\otimes\bar{M}^{*}(A\times B)$	$\displaystyle=\int_{A}\bar{M}^{*}(\mathsf{z},B)\frac{{\rm d}\bar{\mu}}{{\rm d}(\bar{\nu}\bar{M})}(\mathsf{z})\bar{\nu}\bar{M}({\rm d}\mathsf{z})$
		$\displaystyle=\int_{A}\frac{{\rm d}\bar{\mu}}{{\rm d}(\bar{\nu}\bar{M})}(\mathsf{z})\bar{M}^{*}(\mathsf{z},B)\bar{\nu}\bar{M}({\rm d}\mathsf{z})$
		$\displaystyle=\int\mathbf{1}\{\mathsf{z}\in A,\mathsf{z}^{\prime}\in B\}\frac{{\rm d}\bar{\mu}}{{\rm d}(\bar{\nu}\bar{M})}(\mathsf{z})\bar{\nu}({\rm d}\mathsf{z}^{\prime})\bar{M}(\mathsf{z}^{\prime},{\rm d}\mathsf{z})$

and indeed from Fubini’s and Dynkin’s $\pi-\lambda$ theorems $\bar{\nu}\otimes\bar{M}\gg\bar{\mu}\accentset{\curvearrowleft}{\otimes}\bar{L}$ and

\frac{{\rm d}\bar{\mu}\accentset{\curvearrowleft}{\otimes}\bar{L}}{{\rm d}\bar{\nu}\otimes\bar{M}}(\mathsf{z},\mathsf{z}^{\prime})=\frac{{\rm d}\bar{\mu}}{{\rm d}(\bar{\nu}\bar{M})}(\mathsf{z}^{\prime})\,.

Now for $f\colon\mathsf{Z}^{T+1}\rightarrow[0,1]$ and using the second statement and $\nu R\gg\mu$ ,

	$\displaystyle\int f(\mathsf{z})\frac{{\rm d}\bar{\mu}}{{\rm d}(\bar{\nu}\bar{M})}(\mathsf{z})\,(\bar{\nu}\bar{M})({\rm d}\mathsf{z})$	$\displaystyle=\int f(\mathsf{z})w(\mathsf{z})\mu\otimes M^{\otimes T}({\rm d}\mathsf{z})$
		$\displaystyle=\int f(\mathsf{z})w(\mathsf{z})\frac{{\rm d}\mu}{{\rm d}(\nu R)}(z_{0})(\nu R)({\rm d}z_{0})M^{\otimes T}(z_{0},{\rm d}\mathsf{z}_{-0})$
		$\displaystyle=\int f(\mathsf{z})w(\mathsf{z})\frac{{\rm d}\mu}{{\rm d}(\nu R)}(z_{0})\bar{\nu}\bar{M}({\rm d}\mathsf{z})\,.$

We therefore conclude that

\frac{{\rm d}\bar{\mu}\accentset{\curvearrowleft}{\otimes}\bar{L}}{{\rm d}\bar{\nu}\otimes\bar{M}}(\mathsf{z},\mathsf{z}^{\prime})=w(\mathsf{z}^{\prime})\frac{{\rm d}\mu}{{\rm d}(\nu R)}(z^{\prime}_{0})=w(\mathsf{z}^{\prime})\frac{{\rm d}\mu}{{\rm d}\nu}(z^{\prime}_{0})

where the last inequality holds only when $\nu R=\nu$ . ∎

The following result is important in two respects. First it establishes that if $M,L$ satisfy a simple property then $w_{k}$ always have a simple expresssion in terms of certain densities of $\mu$ and $\nu$ – this implies in particular that in Subsection 3.1 the kernel $M_{n}$ is not required to leave $\mu_{n-1}$ invariant to make the method implementable [14]. Second it provides a direct justification of the validity of advanced schemes – see Example 14. This therefore establishes that generic and widely applicable sufficient conditions for $\bar{w}(\mathsf{z},\mathsf{z}^{\prime})$ to be tractable are $\nu R=\nu$ and the notion of $(\upsilon,M,M^{*})$ -reversibility.

Lemma 9.

Let $\mu,\nu\in\mathcal{P}\big{(}\mathsf{Z},\mathscr{Z}\big{)}$ , $\upsilon$ be a $\sigma$ -finite measure on $\big{(}\mathsf{Z},\mathscr{Z}\big{)}$ such that $\upsilon\gg\mu,\nu$ and assume that we have a pair of Markov kernels $M,M^{*}\colon\mathsf{Z}\times\mathscr{Z}\rightarrow[0,1]$ such that

\displaystyle\upsilon({\rm d}z)M(z,{\rm d}z^{\prime})

\displaystyle=\upsilon({\rm d}z^{\prime})M^{*}(z^{\prime},{\rm d}z)\,.

(24)

We call this property $(\upsilon,M,M^{*})$ -reversibility. Then for $z,z^{\prime}\in\mathsf{Z}$ such that $\nu(z):={\rm d}\nu/{\rm d}\upsilon(z)>0$ we have

\frac{{\rm d}\mu\accentset{\curvearrowleft}{\otimes}M^{*}}{{\rm d}\nu\otimes M}(z,z^{\prime})=\frac{{\rm d}\mu/{\rm d}\upsilon(z^{\prime})}{{\rm d}\nu/{\rm d}\upsilon(z)}\,.

Proof.

For $z,z^{\prime}\in\mathsf{Z}$ such that ${\rm d}\nu/{\rm d}\upsilon(z)>0$ we have

	$\displaystyle\mu({\rm d}z^{\prime})M^{*}(z^{\prime},{\rm d}z)$	$\displaystyle=\frac{{\rm d}\mu}{{\rm d}\upsilon}(z^{\prime})\upsilon({\rm d}z^{\prime})M^{*}(z^{\prime},{\rm d}z)$
		$\displaystyle=\frac{{\rm d}\mu}{{\rm d}\upsilon}(z^{\prime})\upsilon({\rm d}z)M(z,{\rm d}z^{\prime})$
		$\displaystyle=\frac{{\rm d}\mu/{\rm d}\upsilon(z^{\prime})}{{\rm d}\nu/{\rm d}\upsilon(z)}\frac{{\rm d}\nu}{{\rm d}\upsilon}(z)\upsilon({\rm d}z)M(z,{\rm d}z^{\prime})$
		$\displaystyle=\frac{{\rm d}\mu/{\rm d}\upsilon(z^{\prime})}{{\rm d}\nu/{\rm d}\upsilon(z)}\nu({\rm d}z)M(z,{\rm d}z^{\prime})\,.$

∎

Corollary 10.

Let $\mu_{n-1},\mu_{n}\in\mathcal{P}\big{(}\mathsf{Z},\mathscr{Z}\big{)}$ and $\upsilon$ be a $\sigma$ -finite measure such that $\upsilon\gg\mu_{n-1},\mu_{n}$ , let $M_{n},L_{n-1}\colon\big{(}\mathsf{Z},\mathscr{Z}\big{)}\rightarrow[0,1]$ such that

\upsilon({\rm d}z)M_{n}(z,{\rm d}z^{\prime})=\upsilon({\rm d}z^{\prime})L_{n-1}(z^{\prime},{\rm d}z)\,,

then for any $z,z^{\prime}\in\mathsf{Z}$ such that ${\rm d}\mu_{n}/{\rm d}\upsilon(z)>0$ and $k\in\mathbb{N}$

\displaystyle\frac{\mu_{n}\accentset{\curvearrowleft}{\otimes}L_{n-1}^{k}}{\mu_{n}\otimes M_{n}^{k}}(z,z^{\prime})=\frac{{\rm d}\mu_{n}/{\rm d}\upsilon(z^{\prime})}{{\rm d}\mu_{n}/{\rm d}\upsilon(z)}\,.

and provided $\mu_{n-1}R_{n}=\mu_{n-1}$ we can deduce

	$\displaystyle\bar{w}_{n}(\mathsf{z})$	$\displaystyle=\frac{{\rm d}\mu_{n}/{\rm d}\upsilon(z_{0})}{{\rm d}\mu_{n-1}/{\rm d}\upsilon(z_{0})}\frac{1}{T+1}\sum_{k=0}^{T}\frac{{\rm d}\mu_{n}/{\rm d}\upsilon(z_{k})}{{\rm d}\mu_{n}/{\rm d}\upsilon(z_{0})}$
		$\displaystyle=\frac{1}{T+1}\sum_{k=0}^{T}\frac{{\rm d}\mu_{n}/{\rm d}\upsilon(z_{k})}{{\rm d}\mu_{n-1}/{\rm d}\upsilon(z_{0})}$

where we have used Lemma 34.

We have shown earlier that standard integrator based mutation kernels used in the context of Monte Carlo method satisfy (24) with $\upsilon$ the Lebesgue measure but other scenarios involved that of the preconditioned Crank–Nicolson (pCN) algorithm where $\upsilon$ is the distribution of a Gaussian process.

3.3 Revisiting sampling with integrator snippets

In this scenario we have $\mu({\rm d}z)=\pi({\rm d}x)\varpi({\rm d}v)$ assumed to have a density w.r.t. a $\sigma$ -finite measure $\upsilon$ , and $\psi$ is an invertible mapping $\psi\colon\mathsf{Z}\rightarrow\mathsf{Z}$ such that $\upsilon^{\psi}=\upsilon$ and $\psi^{-1}=\sigma\circ\psi\circ\sigma$ with $\sigma\colon\mathsf{Z}\rightarrow\mathsf{Z}$ such that $\mu\circ\sigma(z)=\mu(z)$ . In his manuscript we focus primarily on the scenario where $\psi$ is a discretization of Hamilton’s equations for a potential $U\colon\mathsf{X}\rightarrow\mathbb{R}$ e.g. a leapfrog integrator. We consider now the scenario where, in the framework developed in Section 3.1, we let $M(z,{\rm d}z^{\prime}):=\delta_{\psi(z)}\big{(}{\rm d}z^{\prime}\big{)}$ be the deterministic kernel which maps the current state $z\in\mathsf{Z}$ to $\psi(z)$ . Define $\Psi(z,{\rm d}z^{\prime}):=\delta_{\psi(z)}({\rm d}z^{\prime})$ and $\Psi^{*}(z,{\rm d}z^{\prime}):=\delta_{\psi^{-1}(z)}({\rm d}z^{\prime})$ ; we exploit the ideas of [1, Proposition 4] to establish that $\Psi^{*}$ is the $\upsilon-$ adjoint of $\Psi$ if $\upsilon$ is invariant under $\psi$ .

Lemma 11.

Let $\mu$ be a probability measure and $\upsilon$ a $\sigma$ -finite measure, on $(\mathsf{Z},\mathscr{Z})$ such that $\upsilon\gg\mu$ . Denote $\mu(z):={\rm d}\mu/{\rm d}\upsilon(z)$ for any $z\in\mathsf{Z}$ . Let $\psi\colon\mathsf{Z}\rightarrow\mathsf{Z}$ be an invertible and volume preserving mapping, i.e. such that $\upsilon^{\psi}(A)=\upsilon\big{(}\psi^{-1}(A)\big{)}=\upsilon(A)$ for all $A\in\mathscr{Z}$ , then

$(\upsilon,\Psi,\Psi^{*})$ form a reversible triplet, that is for all $z,z^{\prime}\in\mathsf{Z}$ ,

\upsilon({\rm d}z)\delta_{\psi(z)}({\rm d}z^{\prime})=\upsilon({\rm d}z^{\prime})\delta_{\psi^{-1}(z^{\prime})}({\rm d}z),

for all $z,z^{\prime}\in\mathsf{Z}$ such that $\mu(z)>0$

\mu({\rm d}z^{\prime})\delta_{\psi^{-1}(z^{\prime})}({\rm d}z)=\frac{\mu\circ\psi(z)}{\mu(z)}\mu({\rm d}z)\delta_{\psi(z)}({\rm d}z^{\prime})\,.

Proof.

For the first statement

	$\displaystyle\int f(z)g\circ\psi(z)\upsilon({\rm d}z)$	$\displaystyle=\int f\circ\psi^{-1}\circ\psi(z)g\circ\psi(z)\upsilon({\rm d}z)$
		$\displaystyle=\int f\circ\psi^{-1}(z)g(z)\upsilon^{\psi}({\rm d}z)$
		$\displaystyle=\int f\circ\psi^{-1}(z)g(z)\upsilon({\rm d}z).$

We have

	$\displaystyle\mu({\rm d}z^{\prime})\delta_{\psi^{-1}(z^{\prime})}({\rm d}z)$	$\displaystyle=\mu(z^{\prime})\upsilon({\rm d}z^{\prime})\delta_{\psi^{-1}(z^{\prime})}({\rm d}z)$
		$\displaystyle=\mu(z^{\prime})\upsilon({\rm d}z)\delta_{\psi(z)}({\rm d}z^{\prime})$
		$\displaystyle=\mu\circ\psi(z)\upsilon({\rm d}z)\delta_{\psi(z)}({\rm d}z^{\prime})$
		$\displaystyle=\frac{\mu\circ\psi(z)}{\mu(z)}\mu(z)\upsilon({\rm d}z)\delta_{\psi(z)}({\rm d}z^{\prime})$
		$\displaystyle=\frac{\mu\circ\psi(z)}{\mu(z)}\mu({\rm d}z)\delta_{\psi(z)}({\rm d}z^{\prime})$

	$\displaystyle\int f(z^{\prime})g(z)\mu({\rm d}z^{\prime})\delta_{\psi^{-1}(z^{\prime})}({\rm d}z)$	$\displaystyle=\int f(z^{\prime})g(z)\mu(z^{\prime})\upsilon({\rm d}z^{\prime})\delta_{\psi^{-1}(z^{\prime})}({\rm d}z)$
		$\displaystyle=\int f(z^{\prime})g(z)\mu(z^{\prime})\upsilon({\rm d}z)\delta_{\psi(z)}({\rm d}z^{\prime})$
		$\displaystyle=\int f\circ\psi(z)g(z)\mu\circ\psi(z)\upsilon({\rm d}z)$
		$\displaystyle=\int f\circ\psi(z)g(z)\frac{\mu\circ\psi(z)}{\mu(z)}\mu(z)\upsilon({\rm d}z)$
		$\displaystyle=\int f(z^{\prime})g(z)\frac{\mu\circ\psi(z)}{\mu(z)}\mu({\rm d}z)\delta_{\psi(z)}({\rm d}z^{\prime})$

As a result

	$\displaystyle\int f(z^{\prime})\mu({\rm d}z^{\prime})$	$\displaystyle=\int f(z^{\prime})\mu({\rm d}z^{\prime})\delta_{\psi^{-1}(z^{\prime})}({\rm d}z)$
		$\displaystyle=\int f(z^{\prime})\frac{\mu\circ\psi(z)}{\mu(z)}\mu({\rm d}z)\delta_{\psi(z)}({\rm d}z^{\prime})$
		$\displaystyle=\int f\circ\psi(z)\frac{\mu\circ\psi(z)}{\mu(z)}\mu({\rm d}z)$

∎

Corollary 12.

With the assumptions of Lemma 11 above for $k\in\llbracket 0,T\rrbracket$ the weight (20) for $M_{\mu}=\Psi^{k},L_{\mu}=\Psi^{-k}$ admits the expression

w_{k}(\mathsf{z})=\frac{{\rm d}\mu\accentset{\curvearrowleft}{\otimes}\Psi{}^{-k}}{{\rm d}\mu\otimes\Psi^{k}}(z_{0},z_{k})=\frac{\mu\circ\psi^{k}(z_{0})}{\mu(z_{0})}\,,

Further, for $R$ a $\nu$ -invariant Markov kernel the expression for the weight (19) becomes

\displaystyle\bar{w}(\mathtt{\mathsf{z}},\mathtt{\mathsf{z}}^{\prime})

\displaystyle=\frac{\mu(z^{\prime}_{0})}{\nu(z^{\prime}_{0})}\frac{1}{T+1}\sum_{k=0}^{T}\frac{\mu\circ\psi^{k}(z^{\prime}_{0})}{\mu(z^{\prime}_{0})},

hence recovering the expression used in Section 2. This together with the results of Section 3.1 provides an alternative justification of correctness of Alg. 3 and hence Alg. 2. Note that this choice for $L_{\mu}$ corresponds to the so-called optimal scenario; this can be deduced from Lemma 11 or by noting that

	$\displaystyle\int f(z,z^{\prime})\mu\Psi({\rm d}z^{\prime})\Psi^{-1}(z^{\prime},{\rm d}z)$	$\displaystyle=\int f(\psi^{-1}(z^{\prime}),z^{\prime})\mu\Psi({\rm d}z^{\prime})$
		$\displaystyle=\int f(z,\psi(z))\mu({\rm d}z)$
		$\displaystyle\int f(z,z^{\prime})\mu({\rm d}z)\Psi(z,{\rm d}z^{\prime})\,.$

3.4 More complex integrator snippets

Here we provide examples of more complex integrator snippets to which earlier theory immediately applies thanks to the abstract point of view adopted throughout.

Example 13.

Let $\upsilon\gg\mu$ , so that $\mu(z):={\rm d}\mu/{\rm d}\upsilon(z)$ is well defined. Let, for $i\in\{1,2\}$ , $\psi_{i}\colon\mathsf{Z}\rightarrow\mathsf{Z}$ be invertible and such that $\upsilon^{\psi_{i}}=\upsilon$ , $\psi_{i}^{-1}=\sigma\circ\psi_{i}\circ\sigma$ for $\sigma\colon\mathsf{Z}\rightarrow\mathsf{Z}$ such that $\sigma^{2}={\rm Id}$ and $\upsilon^{\sigma}=\upsilon$ . Then one can consider the delayed rejection MH transition probability

M(z,{\rm d}z^{\prime})=\alpha_{1}(z)\delta_{\psi_{1}(z)}\big{(}{\rm d}z^{\prime}\big{)}+\bar{\alpha}_{1}(z)\big{[}\alpha_{2}(z)\delta_{\psi_{2}(z)}({\rm d}z^{\prime})+\bar{\alpha}_{2}(z)\delta_{z}({\rm d}z^{\prime})\big{]}\,,

with $\alpha_{i}(z)=1\wedge r_{i}(z)$ $\bar{\alpha}_{i}(z)=1-\alpha_{i}(z)$ and $z\in S_{1}\cap S_{2}$ given below

r_{1}(z)=\frac{{\rm d}\upsilon^{\psi_{1}^{-1}}}{{\rm d}\upsilon}(z),\quad r_{2}(z)=\frac{\bar{\alpha}_{1}\circ\sigma\circ\psi_{2}(z)}{\bar{\alpha}_{1}(z)}\frac{{\rm d}\upsilon^{\psi_{2}^{-1}}}{{\rm d}\upsilon}(z)\,.

A particular example is when $\upsilon$ is the Lebesgue measure on $\mathsf{X}\times\mathsf{V}$ and $\psi_{1}$ is volume preserving, for instance the leapfrog integrator for Hamilton’s equations for some potential $U$ or the bounce update (25). Following [1] notice that $\upsilon$ has density $\upsilon(z)=1/2$ with respect to the measure $\upsilon+\upsilon^{\psi_{i}}=2\upsilon$ . Now define $S_{1}:=\big{\{}z\in\mathsf{Z}\colon\upsilon(z)\wedge\upsilon\circ\psi_{1}(z)>0\big{\}}$ , we have

r_{1}(z):=\begin{cases}\frac{\upsilon\circ\psi_{1}(z)}{\upsilon(z)}=1&\text{for }z\in S_{1}\\ 0&\text{otherwise}\end{cases},

and with $S_{2}:=\big{\{}z\in\mathsf{Z}\colon\big{[}\bar{\alpha}_{1}(z)\,\upsilon(z)\big{]}\wedge\big{[}\bar{\alpha}_{1}\circ\sigma\circ\psi_{2}(z)\,\upsilon\circ\psi_{2}(z)\big{]}>0\big{\}}$

r_{2}(z):=\begin{cases}\frac{\bar{\alpha}_{1}\circ\sigma\circ\psi_{2}(z)\,\upsilon\circ\psi_{2}(z)}{\bar{\alpha}_{1}(z)\,\upsilon(z)}=1&\text{for }z\in S_{2}\\ 0&\text{otherwise}\end{cases}.

For instance $\psi_{1}$ can be a HMC update, while

\psi_{2}=\psi_{1}\circ{\rm b}\circ\psi_{1}\,\text{with}\,{\rm b}(x,v)=\big{(}x,v-2\bigl{\langle}v,n(x)\bigr{\rangle}n(x)\big{)}

(25)

for some unit length vector field $n\colon\mathsf{X}\rightarrow\mathsf{X}$ . This can therefore be used as part of Alg. 4; care must be taken when compute the weights $w_{n,k}$ and $\bar{w}_{n}$ , see Section 3-3.3 provide the tools to achieve this. For example, in the situation where $\upsilon$ is the Lebesgue measure on $\mathsf{Z}=\mathbb{R}^{d}$ and $\psi_{2}={\rm Id}$ then we recover Alg. 3 or equivalently Alg. 2.

Example 14.

An interesting instance of Example 13 is concerned with the scenario where interest is in sampling $\mu$ constrained to some set $C\subset\mathsf{Z}$ such that $\upsilon(C)<\infty$ . Define $\mu$ constrainted to $C$ , $\mu_{C}(\cdot):=\mu(C\cap\cdot)/\mu(C)$ and similarly $\upsilon_{C}(\cdot):=\upsilon(C\cap\cdot)/\upsilon(C)$ . We let $M$ be defined as above but targeting $\upsilon_{C}$ . Naturally $\upsilon_{C}$ has a density w.r.t. $\upsilon$ , $\upsilon_{C}(z)=\mathbf{1}_{C}(z)/\upsilon(C)$ for $z\in\mathsf{Z}$ and for $i\in\{1,2\}$ we have $\upsilon_{C}\circ\psi_{i}(z)=\mathbf{1}_{\psi_{i}^{-1}(C)}(z)$ , $\upsilon\circ\psi_{1}\circ\psi_{2}(z)=\mathbf{1}_{\psi_{2}^{-1}\circ\psi_{1}^{-1}(C)}(z)$ . Consequently $S_{1}:=\big{\{}z\in\mathsf{Z}\colon\mathbf{1}_{C\cap\psi_{1}^{-1}(C)}(z)>0\big{\}}$ and $S_{2}=\big{\{}z\in\mathsf{Z}\colon\mathbf{1}_{C\cap\psi_{1}^{-1}(C^{\complement})}(z)\mathbf{1}_{\psi_{2}^{-1}(C)\cap\psi_{2}^{-1}\circ\psi_{1}^{-1}(C^{\complement})}(z)>0\big{\}}$ and as a result

	$\displaystyle\alpha_{1}(z)$	$\displaystyle:=\mathbf{1}_{A\cap\psi_{1}^{-1}(C)}(z)$
	$\displaystyle\alpha_{2}(z)$	$\displaystyle:=\mathbf{1}_{A\cap\psi_{1}^{-1}(C^{\complement})\cap\psi_{2}^{-1}(C)\cap\psi_{2}^{-1}\circ\psi_{1}^{-1}(C^{\complement})}(z)\,.$

The corresponding kernel $M^{\otimes T}$ is described algorithmically in Alg. 5. In the situation where $C:=\{x\in\mathsf{X}\colon c(x)=0\}$ for a continuously differentiable function $c\colon\mathsf{X}\rightarrow\mathbb{R}$ , the bounces described in (25) can be defined in terms of the field $x\mapsto n(x)$ such that

\displaystyle n(x)

\displaystyle:=\begin{cases}\nabla c(x)/|\nabla c(x)|&\text{for}\,\nabla c(x)\neq 0\\ 0&\text{otherwise.}\end{cases}

This justifies the ideas of [5], where a process of the type given in Alg. 5 is used as a proposal within a MH update, although the possibility of a rejection after the second stage seems to have been overlooked in that reference.

1Given

z_{0}=z\in C\subset\mathsf{Z}

2for $k=1,\ldots,T$ do

4 if $\psi_{1}(z_{k-1})\in C$ then

z_{k}=\psi_{1}(z_{k-1})

7 else if $\psi_{2}(z_{k-1})\in C$ and $\psi_{1}\circ\psi_{2}(z_{k-1})\notin C$ then

z_{k}=\psi_{2}(z_{k-1})

10 else

z_{k}=z_{k-1}

13 end if

15 end for

Algorithm 5

M^{\otimes T}

for

M

the delayed rejection algorithm targetting the uniform distribution on

C

Naturally a rejection of both transformations $\psi_{1}$ and $\psi_{2}$ of the current state means that the algorithm gets stuck. We note that it is also possible to replace the third update case with a full refreshment of the velocity, which can be interpreted as a third delayed rejection update, of acceptance probability one.

3.5 Numerical illustration: orthant probabilities

In this section, we consider the problem of calculating the Gaussian orthant probabilities, which is given by

p(\mathbf{a},\mathbf{b},\Sigma):=\mathbb{P}(\mathbf{a}\leq\mathbf{X}\leq\mathbf{b}),

where $\mathbf{a},\mathbf{b}\in\mathbb{R}^{d}$ are known vectors of dimension $d$ and $\mathbf{X}\sim\mathcal{N}_{d}(\mathbf{0},\Sigma)$ with $\Sigma$ a covariance matrix of size $d\times d$ . Consider the Cholesky decomposition of $\Sigma$ which is given by $\Sigma:=LL^{\top}$ , where $L:=(l_{ij},i,j\in\llbracket d\rrbracket)$ is a lower triangular matrix with positive diagonal entries. It is clear that $\mathbf{X}$ can be viewed as $\mathbf{X}:=L\mathbf{\eta}$ , where $\mathbf{\eta}\sim\mathcal{N}_{d}(\mathbf{0},{\rm Id}_{d})$ . Consequently, one can instead rewrite $p(\mathbf{a},\mathbf{b},\Sigma)$ as a product of $d$ probabilities given by

\displaystyle p_{1}=\mathbb{P}(a_{1}\leq l_{11}\eta_{1}\leq b_{1})=\mathbb{P}\left(a_{1}/l_{11}\leq\eta_{1}\leq b_{1}/l_{11}\right)\,,

(26)

and

\displaystyle p_{n}=\mathbb{P}\left(a_{t}\leq{\textstyle\sum}_{j=1}^{n}l_{nj}\eta_{j}\leq b_{t}\right)=\mathbb{P}\left(\nicefrac{{(a_{t}-\sum_{j=1}^{n-1}l_{nj}\eta_{j})}}{{l_{nn}}}\leq\eta_{t}\leq\nicefrac{{(b_{t}-\sum_{j=1}^{n-1}l_{nj}\eta_{j})}}{{l_{nn}}}\right)\,,

(27)

for $n=2,\ldots,d$ . For notational simplicity, we let $\mathcal{B}_{n}(\eta_{1:n-1})$ denote the interval $\left[\nicefrac{{(a_{t}-\sum_{j=1}^{n-1}l_{nj}\eta_{j})}}{{l_{nn}}},\nicefrac{{(b_{t}-\sum_{j=1}^{n-1}l_{nj}\eta_{j})}}{{l_{nn}}}\right],$ with the convention $\mathcal{B}_{1}(\eta_{1:0}):=[a_{1}/l_{11},b_{1}/l_{11}]$ . Then, $p(\mathbf{a},\mathbf{b},\Sigma)$ can be written as the product of $p_{n}\,s$ for $n=1,2,\ldots,d$ . Moreover, one could see that $p_{n}$ is also the normalising constant of the conditional distribution of $\eta_{n}$ given $\eta_{1:n-1}$ . To calculate the orthant probability, [32] have proposed an SMC algorithm targetting the sequence of distributions $\pi_{n}(\eta_{1:n}):=\pi_{1}(\eta_{n})\prod_{k=2}^{n}\pi_{n}(\eta_{k}|\eta_{1:k-1})$ for $n\in\llbracket 1,d\rrbracket$ , given by

	$\displaystyle\pi_{1}(\eta_{1})\propto\phi(\eta_{1})\mathbf{1}\{\eta_{1}\in\mathcal{B}_{1}\}=\gamma_{1}(\eta_{1})$		(28)
	$\displaystyle\pi_{n}(\eta_{n}\|\eta_{1:n-1})\propto\phi(\eta_{n})\mathbf{1}\{\eta_{n}\in\mathcal{B}_{n}(\eta_{1:n-1})\}=\gamma_{n}(\eta_{n}\|\eta_{1:n-1}).$		(29)

where $\phi$ denotes the probability density of a $\mathcal{N}(0,1)$ . One could also note that

\pi_{n}(\eta_{n}|\eta_{1:n-1})=1/\Phi(\mathcal{B}_{n}(\eta_{1:n-1}))\phi(\eta_{n})\mathbf{1}\{\eta_{n}\in\mathcal{B}_{n}(\eta_{1:n-1})\}

and $\gamma_{n}(\eta_{1:n})=\phi(\eta_{1:n})\prod_{k=1}^{n}\mathbf{1}\{\eta_{k}\in\mathcal{B}_{k}(\eta_{1:k-1})\}$ , where $\Phi(\mathcal{B}_{n}(\eta_{1:n-1}))$ represents the probability of a standard Normal random variable being in the region $\mathcal{B}_{n}(\eta_{1:n-1})$ . Therefore, the SMC algorithm proposed by [32] then proceeds as follows. (1) At time $t$ , particles $\eta_{1:t-1}^{n}$ are extended by sampling $\eta_{n}^{n}\sim\pi_{n}({\rm d}\eta_{t}|\eta_{1:t-1}^{(i)})$ . (2) Particles $\eta_{1:t}^{(i)}$ are then reweighted by multiplying the incremental weights $\Phi(\mathcal{B}_{n}(\eta_{1:n-1}^{(i)}))$ to $w_{n-1}^{(i)}$ . (3) If the ESS is below a certain threshold, resample the particles and move them through an MCMC kernel that leaves $\pi_{t}$ invariant for $k$ iterations. For the MCMC kernel, [32] recommended using Gibbs sampler that leaves $\pi_{t}$ invariant to move the particles at step (3). The orthant probability we are interested in can then be viewed as the normalising constant of $\pi_{d}$ and this can be estimated as a by-product of the SMC algorithm.

Since we are trying to sample from the constrained Gaussian distributions, the Hamiltonian equation can be solved exactly and $w_{n,k}$ is always $1$ . As a result, the incremental weights for the trajectories simplify to $\Phi(\mathcal{B}_{n}(u_{n-1}))$ and each particle on the trajectory starting from $z_{t}$ will have an incremental weight proportional to $\Phi(\mathcal{B}_{n}(u_{n-1}))$ . To obtain a trajectory, we follow [30] who perform HMC with reflections to sample. As the dimension increases, the number of reflections performed under $\psi_{n}$ will also increase given a fixed integration time. We adaptively tuned the integrating time $\varepsilon$ to ensure that the everage number of reflections at each SMC step does not exceed a given threshold. In our experiment we set this threshold to be $5$ . To show that the waste-recycling RSMC algorithm scales well in high dimension, we set $d=150$ , $a=(1.5,1.5,...)$ and $b=(\infty,\infty,...)$ . Also, we use the same covariance matrix in [14] and perform variable re-ordering as suggested in [32] before the simulation.

Figures 9 and 10 show the results obtained with $N\times(T+1)=50,000$ and various values for $N$ . With a quarter of the number of particles used in [14], the waste-recycling HSMC algorithm achieves comparable performance when estimating the normalising constant (i.e. the orthant probability). Moreover, the estimates are stable for different choices of $N$ values, although one observes that the algorithm achieves best performance when $N=500$ (i.e. each trajectory contains $100$ particles). This also suggests that the integrating time should be adaptively tuned in a different way to achieve the best performance given a fixed computational budget. Estimates of the function $\varphi(x_{0:d})=\mathbb{E}(1/d\sum_{i=1}^{d}x_{i})$ with respect to the Gaussian distribution $\mathcal{N}_{d}(\mathbf{0},\Sigma)$ truncated between $\mathbf{a}$ and $\mathbf{b}$ are also stable for different choices of $N$ , although they are more variable than those obtained in [14]. This indicates that the waste-recycling HSMC algorithm does scale well in high dimension. We note that this higher variance compared to the waste-free SMC of [14], is obtained in a scenario where they are able to exploit the particular structure of the problem and implement an exact Gibbs sampler to move the particles. The integrator snippet SMC we propose is however more general and applicable to scenarios where such a structure is not present.

4 Preliminary theoretical exploration

In this section we omit the index $n\in\llbracket 0,P\rrbracket$ and provide precise elements to understand the properties of the algorithms and estimators considered in this manuscript.

4.1 Variance using folded mixture samples

This section establishes variance reduction for an estimator of $\mu(f)$ of the type (15), but where it is assumed that $\{\check{z}_{n}^{(i)},i\in\llbracket N\rrbracket\}$ are iid distributed according to $\bar{\mu}_{n}$ . While this is not the exact distribution in the algorithm this provides a proxy representative of the quantities one needs to control when analysing SMC samplers [15, Chapter 9]: variance of estimators in an SMC can be decomposed as the sum of local variances at each iterations, which convergence to the variance terms considered hereafter as $N\rightarrow\infty$ . The main message is that $\{f\circ\psi^{k}(z),k\in\llbracket P\rrbracket\}$ can be understood as playing the rôle of control variates, with the potential of reducing variance.

We use the following simplified notation throughout. Let

\bar{\mu}(k,{\rm d}z):=\frac{1}{T+1}\mu_{k}({\rm d}z)\,,

with $\mu_{k}:=\mu^{\psi^{-k}}$ . The fundamental property exploited throughout is (11), which can be rephrased as, for $f\colon\mathsf{Z}\rightarrow\mathbb{R}$ $\mu$ -integrable,

\int\Bigl{\{}\sum_{k=0}^{T}f\circ\psi^{k}(z)\bar{\mu}(k\mid z)\Bigr{\}}\bar{\mu}({\rm d}z)=\mu(f)\,,

or written differently with $\bar{f}(k,z):=f\circ\psi^{k}(z)$

\mathbb{E}_{\bar{\mu}}\left(\mathbb{E}_{\bar{\mu}}(\bar{f}(\mathrm{T},\check{Z})\mid\check{Z})\right)=\mathbb{E}_{\bar{\mu}}\left(\bar{f}(\mathrm{T},\check{Z})\right)=\mathbb{E}_{\mu}\left(f(Z)\right).

The estimator we use is therefore a so-called “Rao-Blackwellized estimator”, assuming $\check{Z}^{(1)},\ldots,\check{Z}^{(N)}\overset{{\rm iid}}{\sim}\bar{\mu}$ ,

\check{\mu}(f)=N^{-1}\sum_{i=1}^{N}\mathbb{E}_{\bar{\mu}}\big{(}\bar{f}(\mathrm{T},\check{Z}^{(i)})\mid\check{Z}^{(i)}\big{)}\,,

of variance ${\rm var}_{\bar{\mu}}\left(\mathbb{E}_{\bar{\mu}}\big{(}\bar{f}(\mathrm{T},\check{Z})\mid\check{Z}\big{)}\right)/N$ . The following relates this variance to that of the standard estimator using $N$ iid samples from $\mu$ , which would likewise play a role in the asymptotic properties of SMC algorithms.

Proposition 15.

We have

{\rm var}_{\bar{\mu}}\left(\mathbb{E}_{\bar{\mu}}\big{(}\bar{f}(\mathrm{T},\check{Z})\mid\check{Z}\big{)}\right)={\rm var}_{\mu}\left(f(\check{Z})\right)-\mathbb{E}_{\bar{\mu}}\big{(}{\rm var}_{\bar{\mu}}(\bar{f}(\mathrm{T},\check{Z})\mid\check{Z})\big{)}

Proof.

The variance decomposition identity yields

{\rm var}_{\bar{\mu}}\left(\bar{f}(\mathrm{T},\check{Z})\right)={\rm var}_{\bar{\mu}}\left(\mathbb{E}_{\bar{\mu}}\big{(}\bar{f}(\mathrm{T},\check{Z})\mid\check{Z}\big{)}\right)+\mathbb{E}_{\bar{\mu}}\big{(}{\rm var}_{\bar{\mu}}(\bar{f}(\mathrm{T},\check{Z})\mid\check{Z})\big{)}\,,

but from the fundamental property

	$\displaystyle{\rm var}_{\bar{\mu}}\left(\bar{f}(\mathrm{T},\check{Z})\right)$	$\displaystyle=\mathbb{E}_{\bar{\mu}}\left(\bar{f}(\mathrm{T},\check{Z})^{2}\right)-\mathbb{E}_{\bar{\mu}}\left(\bar{f}(\mathrm{T},\check{Z})\right)^{2}$
		$\displaystyle=\mathbb{E}_{\mu}\left(f(\check{Z})^{2}\right)-\mathbb{E}_{\mu}\left(f(\check{Z})\right)^{2}$
		$\displaystyle={\rm var}_{\mu}\left(f(\check{Z})\right).$

∎

The following provides us with a notion of effective sample size for estimators of the form $\check{\mu}(f)$ .

Theorem 16.

Assume that $\upsilon\gg\mu^{\psi^{-k}}$ for $k\in\llbracket T\rrbracket$ , then with $\check{Z}\sim\bar{\mu}$ ,

for $f\colon\mathsf{Z}\rightarrow\mathbb{R},$

{\rm var}\left(\check{\mu}(f)\right)\leq\frac{2}{N}\frac{{\rm var}\big{(}{\textstyle\sum}_{k=0}^{T}\mu\circ\psi^{k}(\check{Z})f\circ\psi^{k}(\check{Z})\big{)}+\|f\|_{\infty}^{2}{\rm var}\big{(}{\textstyle\sum}_{k=0}^{T}\mu\circ\psi^{k}(\check{Z})\big{)}}{\mathbb{E}_{\bar{\mu}}\big{(}{\textstyle\sum}_{k=0}^{T}\mu\circ\psi^{k}(\check{Z})\big{)}^{2}}\,,

further

{\rm var}\left(\check{\mu}(f)\right)\leq\frac{2\|f\|_{\infty}^{2}}{N}\left\{\frac{2\mathbb{E}_{\bar{\mu}}\big{(}({\textstyle\sum}_{k=0}^{T}\mu\circ\psi^{k}(\check{Z}))^{2}\big{)}}{\mathbb{E}_{\bar{\mu}}\big{(}{\textstyle\sum}_{k=0}^{T}\mu\circ\psi^{k}(\check{Z})\big{)}^{2}}-1\right\}\,.

Remark 17.

Multiplying $\mu$ by any constant $C_{\mu}>0$ does not alter the values of the upper bounds, making the results relevant to the scenario where $\mu$ is known up to a constant only. From the second statement we can define a notion of effective sample size (ESS)

\frac{N}{2}\frac{\mathbb{E}_{\bar{\mu}}\big{(}{\textstyle\sum}_{k=0}^{T}\mu\circ\psi^{k}(\check{Z})\big{)}^{2}}{2\mathbb{E}_{\bar{\mu}}\big{(}({\textstyle\sum}_{k=0}^{T}\mu\circ\psi^{k}(z))^{2}\big{)}-\mathbb{E}_{\bar{\mu}}\big{(}{\textstyle\sum}_{k=0}^{T}\mu\circ\psi^{k}(\check{Z})\big{)}^{2}}\,,

which could be compared to $N(T+1)$ or $N$ .

Proof.

We have

{\rm var}\left(\check{\mu}(f)\right)=\frac{1}{N}{\rm var}_{\bar{\mu}}\left(\frac{\sum_{k=0}^{T}\mu\circ\psi^{k}(\check{Z})f\circ\psi^{k}(\check{Z})}{\sum_{l=0}^{T}\mu\circ\psi^{l}(\check{Z})}\right)

and we apply Lemma 21

	$\displaystyle{\rm var}_{\bar{\mu}}\left(\frac{\sum_{k=0}^{T}\mu\circ\psi^{k}(\check{Z})f\circ\psi^{k}(\check{Z})}{\sum_{l=0}^{T}\mu\circ\psi^{l}(\check{Z})}\right)$	$\displaystyle\leq 2\frac{{\rm var}\big{(}{\textstyle\sum}_{k=0}^{T}\mu\circ\psi^{k}(\check{Z})f\circ\psi^{k}(\check{Z})\big{)}+\\|f\\|_{\infty}^{2}{\rm var}\big{(}{\textstyle\sum}_{k=0}^{T}\mu\circ\psi^{k}(\check{Z})\big{)}}{\mathbb{E}_{\bar{\mu}}\big{(}{\textstyle\sum}_{k=0}^{T}\mu\circ\psi^{k}(z)\big{)}^{2}}$
		$\displaystyle\leq 2\\|f\\|_{\infty}^{2}\left\{\frac{2\mathbb{E}_{\bar{\mu}}\big{(}({\textstyle\sum}_{k=0}^{T}\mu\circ\psi^{k}(\check{Z}))^{2}\big{)}}{\mathbb{E}_{\bar{\mu}}\big{(}{\textstyle\sum}_{k=0}^{T}\mu\circ\psi^{k}(\check{Z})\big{)}^{2}}-1\right\}\,.$

∎

4.2 Variance using unfolded mixture samples

For $\psi\colon\mathsf{Z}\rightarrow\mathbb{R}$ invertible, $\nu$ a probability distribution such that $\nu\gg\mu^{\psi^{-1}}$ and $f\colon\mathsf{Z}\rightarrow\mathbb{R}$ such that $\mu(f)$ exists we have the identity

\mu(f)=\int f\circ\psi(z)\mu^{\psi^{-1}}({\rm d}z)=\int f\circ\psi(z)\frac{{\rm d}\mu^{\psi^{-1}}}{{\rm d}\nu}(z)\nu({\rm d}z)\,.

As a result for $\{\psi_{k}\colon\mathsf{Z}\rightarrow\mathbb{R},k\in\llbracket K\rrbracket\}$ , all invertible and such that $\nu\gg\mu^{\psi_{k}^{-1}}$ for $k\in\llbracket K\rrbracket$ , we have

\mu(f)=\int\Bigl{\{}\frac{1}{K}\sum_{k\in\llbracket K\rrbracket}f\circ\psi_{k}(z)\frac{{\rm d}\mu^{\psi_{k}^{-1}}}{{\rm d}\nu}(z)\Bigr{\}}\nu({\rm d}z)\,,

(30)

which suggests the Pushforward Importance Sampling (PISA) estimator, for $Z^{i}\overset{{\rm iid}}{\sim}\nu$ , $i\in\llbracket N\rrbracket$ and with $\bar{\mu}_{/\nu}(f\mid z):=K^{-1}\sum_{k\in\llbracket K\rrbracket}f\circ\psi_{k}(z)\frac{{\rm d}\mu^{\psi_{k}^{-1}}}{{\rm d}\nu}(z)$

\displaystyle\hat{\mu}(f)

\displaystyle=\sum_{i=1}^{N}\frac{\bar{\mu}_{/\nu}(f\mid Z^{i})}{\sum_{j=1}^{N}\bar{\mu}_{/\nu}(\mathrm{1}\mid Z^{j})}=\frac{\frac{1}{N}\sum_{i=1}^{N}\bar{\mu}_{/\nu}(f\mid Z^{i})}{\frac{1}{N}\sum_{j=1}^{N}\bar{\mu}_{/\nu}(\mathrm{1}\mid Z^{j})}\,,

(31)

This is the estimator in (7) when $\nu=\mu_{n-1}$ and $\mu=\mu_{n}$ .

4.2.1 Relative efficiency for unfolded estimators

In order to define the notion of relative efficiency for the estimator $\hat{\mu}(f)$ in (31) we first establish the following bounds.

Theorem 18.

With the notation above and $Z\sim\nu$ throughout. For any $f\colon\mathsf{Z}\rightarrow\mathbb{R},$

\displaystyle{\rm var}\big{(}\hat{\mu}(f)\big{)}\leq\mathbb{E}\Bigl{(}|\hat{\mu}(f)-\mu(f)|^{2}\Bigr{)}

\displaystyle\leq\frac{2}{N}\big{\{}{\rm var}\big{(}\bar{\mu}_{/\nu}(f\mid Z)\big{)}+\|f\|_{\infty}^{2}{\rm var}\big{(}\bar{\mu}_{/\nu}(1\mid Z)\big{)}\big{\}}\,,

more simply

\mathbb{E}\Bigl{(}|\hat{\mu}(f)-\mu(f)|^{2}\Bigr{)}\leq\frac{2\|f\|_{\infty}^{2}}{KN}\Bigl{\{}\frac{2\mathbb{E}\Bigl{[}\bigl{(}\sum_{k\in\llbracket K\rrbracket}{\rm d}\mu^{\psi_{k}^{-1}}/{\rm d}\nu(Z)\bigr{)}^{2}\Bigr{]}}{K}-K\Bigr{\}}\,,

(32)

and

|\mathbb{E}[\hat{\mu}(f)]-\mu(f)|\leq\frac{2\|f\|_{\infty}^{2}}{KN}\frac{\mathbb{E}\Bigl{[}\bigl{(}\sum_{k\in\llbracket K\rrbracket}{\rm d}\mu^{\psi_{k}^{-1}}/{\rm d}\nu(Z)\bigr{)}^{2}\Bigr{]}}{K}\,.

Remark 19.

The upper bound in the first statement confirms the control variate nature of integrator snippets, even when using the unfolded perspective, a property missed by the rougher bounds of the last two statements.

Remark 20 (ESS for PISA).

The notion of efficiency is usually defined relative to the “perfect” Monte Carlo scenario that is the standard estimator $\hat{\mu}_{0}$ of $\mu(f)$ relying on $KN$ iid samples from $\mu$ for which we have

{\rm var}\Bigl{(}\hat{\mu}_{0}(f)\Bigr{)}=\frac{{\rm var}_{\mu}(f)}{KN}\leq\frac{\|f\|_{\infty}^{2}}{KN}\,.

(33)

The $\mathrm{RE}_{0}$ , is determined by the ratio of the upper bound in (32) by (33). Our point below is that the notion of efficiency can be defined relative to any competing algorithm, virtual or not, in order to characterize particular properties. For example we can compute the efficiency relative to that of the “ideal” PISA estimator i.e. for which $\bar{\mu}_{/\nu}(\bar{f}\mid Z^{i})$ is replaced with $K^{-1}\sum_{k\in\llbracket K\rrbracket}f\circ\psi_{k}(Z^{i,k})\frac{{\rm d}\mu^{\psi_{k}^{-1}}}{{\rm d}\nu}(Z^{i,k})$ , $Z^{i,k}\overset{{\rm iid}}{\sim}\nu$ and

{\rm var}\Bigl{(}\hat{\mu}_{1}(f)\Bigr{)}\leq\frac{2\|f\|_{\infty}^{2}}{KN}\left\{\frac{2\sum_{k\in\llbracket K\rrbracket}\mathbb{E}\Bigl{[}\bigl{(}{\rm d}\mu^{\psi_{k}^{-1}}/{\rm d}\nu(Z)\bigr{)}^{2}\Bigr{]}}{K}-K\right\}\,.

(34)

The corresponding $\mathrm{RE}_{1}$ captures the loss incurred because of dependence along a snippet. However, given our initial motivation of recycling the computation of a standard HMC based SMC algorithm we opt to define the $\mathrm{RE}_{2}$ relative to the estimator relying on both ends of the snippet only, i.e.

\hat{\mu}_{2}(f)=\frac{1}{N}\sum_{i=1}^{N}\frac{\frac{1}{2}\frac{{\rm d}\mu}{{\rm d}\nu}(z^{i})f(Z^{i})+\frac{1}{2}\frac{{\rm d}\mu^{\psi_{K}^{-1}}}{{\rm d}\nu}(Z^{i})f\circ\psi_{K}(Z^{i})}{\sum_{j=1}^{N}\frac{1}{2}\frac{{\rm d}\mu}{{\rm d}\nu}(Z^{j})+\frac{1}{2}\frac{{\rm d}\mu^{\psi_{K}^{-1}}}{{\rm d}\nu}(Z^{j})}\,.

In the SMC scenario considered in this manuscript (see Section 1) the above can be thought of as a proxy for estimators obtained by a “Rao-Blackwellized” SMC algorithm using $P_{n-1,T}$ in (6), where $N$ particles $\{z_{n-1}^{(i)},i\in\llbracket N\rrbracket\}$ in Alg. 1 give rise to $2N$ weighted particles

\{\big{(}z^{i},\bar{\alpha}_{n-1}(z_{i};T)\tfrac{\mu_{n}}{\mu_{n-1}}(z_{i})\big{)};\big{(}z^{i},\alpha_{n-1}(z_{i};T)\tfrac{\mu_{n}}{\mu_{n-1}}\circ\psi_{n}^{T}(z_{i})\big{)},i\in\llbracket N\rrbracket\}\>,

with $\alpha_{n-1}(\cdot;T)$ defined in (5). Resampling with these weights is then applied to obtain $N$ particles and followed by an update of velocities to yield $\{z_{n}^{(i)},i\in\llbracket N\rrbracket\}$ . Now, we observe the similarity between

\alpha_{n-1}(z;T)\tfrac{\mu_{n}}{\mu_{n-1}}\circ\psi_{n}^{T}(z)=\min\left\{1,\frac{\mu_{n-1}\circ\psi_{n-1}^{T}}{\mu_{n-1}}(z)\right\}\times\frac{\mu_{n}\circ\psi_{n}^{T}}{\mu_{n-1}\circ\psi_{n}^{T}}(z)\,,

(35)

and the corresponding weight in Alg. 2, in particular when $\mu_{n-1}$ and $\mu_{n}$ are similar, and hence $\psi_{n-1}$ and $\psi_{n}$ . This motivates our choice of reference to define $\mathrm{ESS}_{2}$ which has a clear computational advantage since it involves ignoring $T-1$ terms only. In the present scenario, following Lemma 21, we have

\displaystyle{\rm var}\big{(}\hat{\mu}_{2}(f)\big{)}

\displaystyle\leq\frac{\|f\|_{\infty}^{2}}{N}\left\{\mathbb{E}\left[\big{(}\frac{{\rm d}\mu}{{\rm d}\nu}(Z)+\frac{{\rm d}\mu^{\psi_{K}^{-1}}}{{\rm d}\nu}(Z)\big{)}^{2}\right]-\frac{1}{2}\right\}\,,

which leads to the relative efficiency for PISA,

{\rm RE}_{2}=\frac{\frac{4\mathbb{E}\Bigl{[}\bigl{(}\sum_{k\in\llbracket K\rrbracket}{\rm d}\mu^{\psi_{k}^{-1}}/{\rm d}\nu(Z)\bigr{)}^{2}\Bigr{]}}{K^{2}}-2}{\mathbb{E}\left[\big{(}\frac{{\rm d}\mu}{{\rm d}\nu}(Z)+\frac{{\rm d}\mu^{\psi_{K}^{-1}}}{{\rm d}\nu}(Z)\big{)}^{2}\right]-2}\,,

which can be estimated using empirical averages.

Proof of Theorem 18.

We apply Lemma 21 with

A(Z^{1},\ldots,Z^{N})=\frac{1}{N}\sum_{i=1}^{N}\bar{\mu}_{/\nu}(f\mid Z^{i})\quad B(Z^{1},\ldots,Z^{N})=\frac{1}{N}\sum_{j=1}^{N}\bar{\mu}_{/\nu}(\mathrm{1}\mid Z^{j})\,.

With $Z^{i}\overset{{\rm iid}}{\sim}\nu$ , we have directly

a=\mathbb{E}\big{(}A\big{)}=\mathbb{E}\big{(}\bar{\mu}_{/\nu}(f\mid Z)\big{)}=\mu(f)\quad b=\mathbb{E}\big{(}B\big{)}=1\,,

and

	$\displaystyle{\rm var}\big{(}B\big{)}$	$\displaystyle=\frac{1}{N}{\rm var}\big{(}\bar{\mu}_{/\nu}(1\mid Z^{j})\big{)}$
		$\displaystyle=\frac{1}{K^{2}N}\left\{\mathbb{E}\Bigl{[}\bigl{(}{\textstyle\sum}_{k\in\llbracket K\rrbracket}{\rm d}\mu^{\psi_{k}^{-1}}/{\rm d}\nu(Z)\bigr{)}^{2}\Bigr{]}-K^{2}\right\}\,.$

Now with $\|f\|_{\infty}<\infty$

	$\displaystyle{\rm var}\big{(}A\big{)}$	$\displaystyle=\frac{1}{N}{\rm var}\big{(}\bar{\mu}_{/\nu}(f\mid Z)\big{)}$
		$\displaystyle=\frac{1}{K^{2}N}\Bigl{\{}\\|f\\|_{\infty}^{2}\mathbb{E}\Bigl{[}\bigl{(}{\textstyle\sum}_{k\in\llbracket K\rrbracket}{\rm d}\mu^{\psi_{k}^{-1}}/{\rm d}\nu(Z)\bigr{)}^{2}\Bigr{]}-K^{2}\mu(f)^{2}\Bigr{\}}$
		$\displaystyle\leq\frac{\\|f\\|_{\infty}^{2}}{K^{2}N}\mathbb{E}\Bigl{[}\bigl{(}{\textstyle\sum}_{k\in\llbracket K\rrbracket}{\rm d}\mu^{\psi_{k}^{-1}}/{\rm d}\nu(Z)\bigr{)}^{2}\Bigr{]}\,.$

We conclude noting that we have $|A/B|\leq\|f\|_{\infty}$ . ∎

For clarity we reproduce the very useful lemma [38, Lemma 6], correcting a couple of minor typos along the way.

Lemma 21.

Let $A,B$ be two integrable random variables satisfying $|A/B|\leq M$ almost surely for some $M>0$ and let $a=\mathbb{E}(A)$ and $b=\mathbb{E}(B)\neq 0$ . Then

	$\displaystyle{\rm var}(A/B)\leq\mathbb{E}[\|A/B-a/b\|^{2}]\leq\frac{2}{b^{2}}\Bigl{\{}{\rm var}(A)+M^{2}{\rm var}(B)\Bigr{\}}\,,$
	$\displaystyle\|\mathbb{E}[A/B]-a/b\|\leq\frac{\sqrt{{\rm var}(A/B){\rm var}(B)}}{b}\,.$

4.2.2 Optimal weights

As mentioned in Subsection 2.3 it is possible to consider the more general scenario where unequal probability weights $\omega=\{\omega_{k}\in\mathbb{R},k\in\llbracket 0,T\rrbracket\colon\sum_{k=0}^{T}\omega_{k}=1\}$ are ascribed to the elements of the snippets, yielding the estimator

\displaystyle\hat{\mu}_{\omega}(f)

\displaystyle:=\frac{\sum_{i=1}^{N}\sum_{k=0}^{T}\omega_{k}\frac{{\rm d}\mu^{\psi_{k}^{-1}}}{{\rm d}\nu}(Z^{i})f\circ\psi_{k}(Z^{i})}{\sum_{j=1}^{N}\sum_{l=0}^{T}\omega_{l}\frac{{\rm d}\mu^{\psi_{l}^{-1}}}{{\rm d}\nu}(Z^{j})}\,,

and a natural question is that of the optimal choice of $\omega$ . Note that in the context of PISA the condition $\omega_{k}\geq 0$ is not required, as suggested by the justification of the identity (30). However it should be clear that this condition should be enforced in the context of integrator snippet SMC, since the probabilistic interpretation is otherwise lost, or if the expectation is known to be non-negative. Here we discuss optimization of the variance upperbound provided by Lemma 21,

	$\displaystyle\frac{2\\|f\\|_{\infty}^{2}}{N}\Bigl{\{}{\rm var}\biggl{(}{\textstyle\sum_{k\in\llbracket K\rrbracket}}\omega_{k}\frac{{\rm d}\mu^{\psi_{k}^{-1}}}{{\rm d}\nu}(Z)\frac{f\circ\psi_{k}}{\\|f\\|_{\infty}}(Z)\biggr{)}+$	$\displaystyle{\rm var}\biggl{(}{\textstyle\sum_{k\in\llbracket K\rrbracket}}\omega_{k}\frac{{\rm d}\mu^{\psi_{k}^{-1}}}{{\rm d}\nu}(Z)\biggr{)}\Bigr{\}}$
	$\displaystyle\leq\frac{2\\|f\\|_{\infty}^{2}}{N}$	$\displaystyle\omega^{\top}\big{(}\Sigma(f;\psi)+\Sigma_{\psi}(1;\psi)\big{)}\omega$

where for $k,l\in\llbracket K\rrbracket$ ,

\Sigma_{kl}(f;\psi)=\mathbb{E}\biggl{[}\frac{{\rm d}\mu^{\psi_{k}^{-1}}}{{\rm d}\nu}(Z)\frac{f\circ\psi_{k}}{\|f\|_{\infty}}(Z)\frac{{\rm d}\mu^{\psi_{l}^{-1}}}{{\rm d}\nu}(Z)\frac{f\circ\psi_{l}}{\|f\|_{\infty}}(Z)\biggr{]}-\mu(f/\|f\|_{\infty})^{2}\,.

It is a classical result that $\omega^{\top}\big{(}\Sigma(f;\psi)+\Sigma_{\psi}(1;\psi)\big{)}\omega\geq\lambda_{\min}\big{(}\Sigma(f;\psi)+\Sigma_{\psi}(1;\psi)\big{)}\omega^{\top}\omega$ and that minimum is reached for the eigenvector(s) $\omega_{{\rm min}}$ corresponding to the smallest eigenvalue $\lambda_{\min}\big{(}\Sigma(f;\psi)+\Sigma_{\psi}(1;\psi)\big{)}$ of $\Sigma(f;\psi)+\Sigma_{\psi}(1;\psi)$ . If the constraint of non-negative entries is to be enforced then a standard quadratic programming procedure should be used. The same ideas can be applied to the function independent upperbounds used in our definitions of efficiency.

4.3 More on variance reduction and optimal flow

We now focus on some properties of the estimator $\hat{\mu}(f)$ of $\mu(f)$ in (16). To facilitate the analysis and later developments we consider the scenario where, with $t\mapsto\psi_{t}(z)$ the flow solution of an ODE, assume the dominating measure $\upsilon\gg\mu$ and $\upsilon$ is invariant by the flow. We consider

\bar{\mu}({\rm d}t,{\rm d}z)=\frac{1}{T}\mu^{\psi_{-t}}({\rm d}z)\mathbf{1}\{0\leq t\leq T\}{\rm d}t\,,

and notice that similarly to the integrator scenario, for any $f\colon\mathsf{Z}\rightarrow\mathbb{R}$ , $\bar{f}(t,z):=f\circ\psi_{t}(z)$ we have $\mathbb{E}_{\bar{\mu}}\left(\bar{\mu}(\bar{f}\mid\check{Z})\right)=\bar{\mu}(\bar{f})=\mu(f)$ . where for any $z\in\mathsf{Z}$ , where

\bar{\mu}(\bar{f}\mid z)=\frac{1}{T}\int_{0}^{T}f\circ\psi_{t}(z)\frac{\mu\circ\psi_{t}(z)}{\int\mu\circ\psi_{u}(z){\rm d}u}{\rm d}t

the following estimator of $\mu(f)$ is considered, with $\check{Z}_{i}\overset{{\rm iid}}{\sim}\bar{\mu}$

\hat{\mu}(f)=\frac{1}{N}\sum_{i=1}^{N}\bar{\mu}\big{(}\bar{f}(\mathrm{T},\check{Z}_{i})\mid\check{Z}_{i}\big{)}\,.

We now show that in the examples considered in this paper our approach can be understood as implementing unbiased Riemann sum approximations of line integrals along contours. Adopting a continuous time approach can be justified as follows. For numerous integrators, the conditions of the following proposition are satisfied; this is the case for the leapfrog integrator of Hamilton’s equation e.g. [20, 38, Theorem 3.4] and [38, Appendix 3.1, Theorem 9] for detailed results.

Proposition 22.

Let $\tau>0$ , for any $z\in\mathsf{Z}$ let $[0,\tau]\ni t\mapsto\psi_{t}(z)$ be a flow and for $\epsilon>0$ let $\{\psi^{k}(z;\epsilon),0\leq k\epsilon\leq\tau\}$ be a discretization of $t\mapsto\psi_{t}$ such that that for any $z\in\mathsf{Z}$ there exists $C>0$ such that for any $(k,\epsilon)\in\mathbb{N}\times\mathbb{R}_{+}$

|\psi^{k}(z;\epsilon)-\psi_{k\epsilon}(z)|\leq C\epsilon^{2}\,.

Then for any continuous $g\colon\mathsf{Z}\rightarrow\mathbb{R}$ such that the Riemann integral

I(g):=\int_{0}^{\tau}g\circ\psi_{t}(z){\rm{\rm d}}t\,,

exists we have

\lim_{T\rightarrow\infty}\frac{1}{n}\sum_{k=0}^{n-1}g\circ\psi_{k\tau/n}(z;\tau/n)=I(g)\,.

Proof.

We have

\lim_{n\rightarrow\infty}\frac{1}{n}\sum_{k=0}^{n-1}g\circ\psi_{k\tau/n}(z)=\int_{0}^{\tau}g\circ\psi_{t}(z){\rm{\rm d}}t

\lim_{n\rightarrow\infty}\frac{1}{n}\biggl{|}\sum_{k=0}^{n-1}g\circ\psi_{k\tau/n}(z)-g\circ\psi^{k}(z;\tau/n)\biggr{|}=0\,,

and we can conclude. ∎

Remark 23.

Naturally convergence is uniform in $z\in\mathsf{Z}$ with additional assumptions and we note that in some scenarios dependence on $\tau$ can be characterized, allowing in principle $\tau$ to grow with $n$ .

4.3.1 Hamiltonian contour decomposition

Assume $\mu$ has a probability density with respect to the Lebesgue measure and let $\zeta\colon\mathsf{Z}\subset\mathbb{R}^{d}\rightarrow\mathbb{R}$ be Lipschitz continuous such that $\nabla\zeta(z)\neq 0$ for all $z\in\mathsf{Z}$ , then the co-area formula states that

\int_{\mathsf{Z}}f(z)\mu(z){\rm d}z=\int\big{[}\int_{\zeta^{-1}(s)}f(z)|\nabla\zeta(z)|^{-1}\mu(z)\mathcal{H}_{d-1}({\rm d}z)\big{]}{\rm d}s,

where $\mathcal{H}_{d-1}$ is the $(d-1)$ -dimensional Hausdorff measure, used here to measure length along the contours $\zeta^{-1}(s)$ . For example in the HMC setup where $\mu(z)\propto\exp\big{(}-H(z)\big{)}$ and $z=(x,v)$ one may choose $\zeta(z)=H(z)=-\log\big{(}\mu(z)\big{)}$ , leading to a decomposition of the expectation $\mu(f)$ according to equi-energy contours of $H(z)$

\mu(f)=\int\exp(-s)\big{[}\int_{\zeta^{-1}(s)}f(z)|\nabla H(z)|^{-1}\mathcal{H}_{d-1}({\rm d}z)\big{]}{\rm d}s.

We now show how the solution of Hamilton’s equation could be used as the basis for estimators of $\mu(f)$ mixing Riemanian sum-like and Monte Carlo estimation techniques.

Favourable scenario where $d=2$ .

Let $s\in\mathbb{R}$ such that $H^{-1}(s)\neq\emptyset$ and assume that for some $(x_{0},v_{0})\in H^{-1}(s)$ Hamilton’s equations $\dot{x}=-\nabla_{v}H(x,v)$ and $\dot{v}=\nabla_{x}H(x,v)$ have solutions $t\mapsto(x_{t},v_{t})=\psi_{t}(z_{0})\in H^{-1}(s)$ with $\big{(}x_{\tau(s)},v_{\tau(s)})=(x_{0},v_{0})=z_{0}$ for some $\tau(s)>0$ , that is the contours $H^{-1}(s)$ can be parametrised with the solutions of Hamilton’s equation at corresponding level.

Proposition 24.

We have

\int_{\mathsf{Z}}f(z)\mu({\rm d}z)=\int_{\mathsf{Z}}\left[[\tau\circ H(z_{0})]^{-1}\int_{0}^{\tau\circ H(z_{0})}f(z_{t}){\rm d}t\right]\mu({\rm d}z_{0})\,,

implying that, assuming the integral along the path $t\mapsto z_{t}$ is tractable,

[\tau\circ H(z_{0})]^{-1}\int_{0}^{\tau\circ H(z_{0})}f(z_{t}){\rm d}t\,\text{for}\,z_{0}\sim\mu\,,

(36)

is an unbiased estimator of $\mu(f)$ .

Proof.

Since $H^{-1}(s)$ is rectifiable we have that [9, Theorem 2.6.2] for $g\colon\mathsf{Z}\rightarrow\mathbb{R}$ ,

\int_{H^{-1}(s)}g(z)\mathcal{H}_{d-1}({\rm d}z)=\int_{0}^{\tau(s)}g(z_{t})|\nabla H(z_{t})|{\rm d}t\,.

Consequently

\displaystyle\int_{\mathsf{Z}}f(z)\mu({\rm d}z)

\displaystyle=\int\tau(s)\exp(-s)\left[\tau(s)^{-1}\int_{0}^{\tau(s)}f(z_{s,t}){\rm d}t\right]{\rm d}s\,,

(37)

where the notation now reflects dependence on $s$ of the flow and notice that since

H^{-1}(s)\ni z_{0}\mapsto[\tau\circ H(z_{0})]^{-1}\int_{0}^{\tau\circ H(z_{0})}f(z_{t})

is constant, $z_{0,s}\in H^{-1}(s)$ can be arbitrary. since indeed from (37)

	$\displaystyle\int_{\mathsf{Z}}\mathbf{1}\{H(z)\in A\}\mu({\rm d}z)=$	$\displaystyle\int\tau(s)\exp(-s)\left[\tau(s)^{-1}\int_{0}^{\tau(s)}\mathbf{1}\{s\in A\}{\rm d}t\right]{\rm d}s$
	$\displaystyle=$	$\displaystyle\int\tau(s)\exp(-s)\mathbf{1}\{s\in A\}{\rm d}s.$

Note that we can write, since for $s\in\mathbb{R}$ , $H^{-1}(s)\ni z_{0}\mapsto[\tau\circ H(z_{0})]^{-1}\int_{0}^{\tau\circ H(z_{0})}f(z_{t})$ is constant,

	$\displaystyle\int_{\mathsf{Z}}\left[[\tau\circ H(z_{0})]^{-1}\int_{0}^{\tau\circ H(z_{0})}f(z_{t}){\rm d}t\right]\mu({\rm d}z_{0})$
	$\displaystyle=\int\tau(s)\exp(-s)$	$\displaystyle\left\{\tau(s)^{-1}\int_{0}^{\tau(s)}\left[\big{(}\tau\circ H(z_{s,0})\big{)}^{-1}\int_{0}^{\tau\circ H(z_{0})}f(z_{s,t}){\rm d}t\right]\right\}{\rm d}t{\rm d}s$
	$\displaystyle=\int\tau(s)\exp(-s)$	$\displaystyle\left[\tau(s)^{-1}\int_{0}^{\tau(s)}f(z_{s,t}){\rm d}t\right]{\rm d}t{\rm d}s\,.$

∎

A remarkable point is that the strategy developed in this manuscript provides a general methodology to implement numerically the ideas underpinning the decomposition of Proposition 24 by using the estimator $\check{\mu}(f)$ in (15) and invoking Proposition 22, assuming $s\mapsto\tau(s)$ to be known. This point is valid outside of the SMC framework and it is worth pointing out that $\check{\mu}(f)$ in (15) is unbiased if the samples used are marginally from $\bar{\mu}$ .

Remark the promise of dimension free estimators if the one dimensional line integrals in (36) were easily computed and sampling from the one-dimensional energy distribution was routine – however the scenario $d\geq 3$ is more subtle.

General scenario

In the scenario where $d\geq 3$ the co-area decomposition still holds, but the solution to Hamilton’s equation can in general not be used to compute integrals over the hypersurface $H^{-1}(s)$ . This would require a form of ergodicity [35, 40] of the form, for $z_{0}\in H^{-1}(s)$ ,

\lim_{\tau\rightarrow\infty}\frac{1}{\tau}\int_{0}^{\tau}f\circ\psi_{t}(z_{0}){\rm d}t=\bar{f}(z_{0})=\frac{\int_{\zeta^{-1}(s)}f(z)|\nabla H(z)|^{-1}\mathcal{H}_{d-1}({\rm d}z)}{\int_{\zeta^{-1}(s)}|\nabla H(z)|^{-1}\mathcal{H}_{d-1}({\rm d}z)}\,,

where the limit always exists in the ${\rm L}^{2}(\mu)$ sense and constitutes Von Neumann’s mean ergodic theorem [41, 35], and the rightmost equality forms Boltzman’s conjecture. An interesting property in the present context is that $\mathbb{E}_{\bar{\mu}}\bigl{(}\bar{f}(Z)\bigr{)}=\mathbb{E}_{\mu}\bigl{(}f(Z)\bigr{)}$ for $f\in\mathrm{L}^{2}(\mu)$ and one could replicate the variance reduction properties developed earlier. Boltzman’s conjecture has long been disproved, as numerous Hamiltonians can be shown not to lead to ergodic systems, although some sub-classes do. However a weaker, or local, form or ergodicity can hold on sets of a partition of $\mathsf{Z}$

Example 25 (Double well potential).

Consider the scenario where $\mathsf{X}=\mathsf{V}=\mathbb{R}$ , $U(x)=(x^{2}-1)^{2}$ and kinetic energy $v^{2}$ [34]. Elementary manipulations show that satisfying Hamilton’s equations (2) imposes $t\mapsto H(x_{t},\dot{x}_{t})=(x_{t}^{2}-1)^{2}+\dot{x}_{t}^{2}=s>0$ and therefore requires $t\mapsto x_{t}\in[-\sqrt{1+\sqrt{s}},-\sqrt{1-\sqrt{s}}]\cup[\sqrt{1-\sqrt{s}},\sqrt{1+\sqrt{s}}]$ – importantly the intervals are not connected for $s<1$ . Rearranging terms any solution of (2) must satisfy

\displaystyle\dot{x}_{t}

\displaystyle=\pm\sqrt{s-(x_{t}^{2}-1)^{2}}\,,

that is the velocity is a function of the position in the double well, maximal for $x_{t}^{2}=1$ , vanishing as $x_{t}^{2}\rightarrow 1\pm\sqrt{s}$ and a sign flip at the endpoints of the intervals. Therefore the system is not ergodic, but ergodicity trivially occurs in each well.

In general this partitioning of $\mathsf{Z}$ can be intricate but it should be clear that in principle this could reduce variability of an estimator. In the toy Example 25, a purely mathematical algorithm inspired by the discussions above would choose the right or left well with probability $1/2$ and then integrate deterministically, producing samples taking at most two values which could be averaged to compute $\mu(f)$ . We further note that in our context a relevant result would be concerned with the limit, for any $x_{0}\in\mathsf{X}$ ,

\lim_{\tau\rightarrow\infty}\int\big{[}\frac{1}{\tau}\int_{0}^{\tau}f\circ\psi_{t}(x_{0},\rho\mathbf{e}){\rm d}t\big{]}\varpi_{\mathbf{e}}({\rm d}\mathbf{e})\,,

where we have used a polar reparametrization $v=\rho\mathbf{e}$ $(\rho,\mathbf{e})\in\mathbb{R}_{+}\times\mathcal{S}_{d-1}$ , and whether a marginal version of Boltzman’s conjecture holds for this limit. Example 25 indicates that this is not true in general but the cardinality of the partition of $\mathsf{X}$ may be reduced.

4.3.2 Advantage of averaging and control variate interpretation

Consider the scenario where $(\mathsf{T},\check{Z})\sim\bar{\mu}(t,{\rm d}z)=\frac{1}{\tau}\mu^{\psi_{-t}}({\rm d}z)\mathbf{1}\{0\leq t\leq\tau\}$ , $[0,\tau]\ni t\mapsto\psi_{t}$ for some $\tau>0$ is the flow solution of an ODE of the form $\dot{z}_{t}=F\circ z_{t}$ for some field $F\colon\mathsf{Z}\rightarrow\mathsf{Z}$ . For $f\colon\mathsf{Z}\rightarrow\mathbb{R}$ we have

{\rm var}_{\mu}\left(f(Z)\right)={\rm var}_{\bar{\mu}}\left(\mathbb{E}_{\bar{\mu}}\big{(}\bar{f}(\mathsf{T},\check{Z})\mid\check{Z}\big{)}\right)+\mathbb{E}_{\bar{\mu}}\big{(}{\rm var}_{\bar{\mu}}(\bar{f}(\mathsf{T},\check{Z})\mid\check{Z})\big{)}\,.

and we are interested in determining $t\mapsto\psi_{t}$ (or $F$ ) in order to minimize ${\rm var}_{\bar{\mu}}\left(\mathbb{E}_{\bar{\mu}}\big{(}\bar{f}(\mathsf{T},\check{Z})\mid\check{Z}\big{)}\right)$ and maximize improvement over simple Monte Carlo. We recall that the Jabobian determinant of the flow $t\mapsto\psi_{t}(z)$ is given by [19, p. 174108-5]

J_{t}(z)=|\det\big{(}\nabla\otimes\psi_{t}(z)\big{)}|=\mathcal{J}\circ\psi_{t}(z)\;\text{with}\;\mathcal{J}(z):=\exp\left(\int_{0}^{t}(\nabla\cdot F)(z)\right)\,.

Lemma 26.

Let $t\mapsto\psi_{t}$ be a flow solution of $\dot{z}_{t}=F\circ z_{t}$ and assume that with $\upsilon$ the Lebesgue measure, $\upsilon\gg\mu$ . Then

	$\displaystyle\mathbb{E}_{\bar{\mu}}\left(\mathbb{E}_{\bar{\mu}}\big{(}\bar{f}(\mathsf{T},\check{Z})\mid\check{Z}\big{)}^{2}\right)$	$\displaystyle=2\int_{0}^{\tau}\left\{\int_{0}^{\tau-u}[\bar{\mu}\circ\psi_{-t}(z)]^{-1}{\rm d}t\right\}\Bigl{\{}\int\mu({\rm d}z)f(z)f\circ\psi_{u}(z)\mu\circ\psi_{u}(z)\mathcal{J}\circ\psi_{u}(z)\Bigr{\}}{\rm d}u$
		$\displaystyle\leq 2\left\{\int_{0}^{\tau}[\bar{\mu}\circ\psi_{-t}(z)]^{-1}{\rm d}t\right\}\int_{0}^{\tau}\Bigl{\{}\int\mu({\rm d}z)f(z)f\circ\psi_{u}(z)\mu\circ\psi_{u}(z)\mathcal{J}\circ\psi_{u}(z)\Bigr{\}}{\rm d}u\,,$

with

\displaystyle\bar{\mu}\circ\psi_{-t}(z)

\displaystyle=\int_{-t}^{\tau-t}\mu\circ\psi_{u}(z)\,\mathcal{J}\circ\psi_{u}(z){\rm d}u\,.

In particular for $t\mapsto\psi_{t}$ the flow solution of Hamilton’s equations associated with $\mu$ we have

\displaystyle\mathbb{E}_{\bar{\mu}}\left(\mathbb{E}_{\bar{\mu}}\big{(}\bar{f}(\mathsf{T},\check{Z})\mid\check{Z}\big{)}^{2}\right)

\displaystyle=2\int\big{(}1-t/\tau\big{)}\int\mu({\rm d}z)f(z)f\circ\psi_{t}(z){\rm d}t\,\cdot

Proof.

Using Fubini’s theorem we have

	$\displaystyle\int\left\{\int_{0}^{\tau}f\circ\psi_{t}(z)\frac{{\rm d}\mu^{\psi_{-t}}}{{\rm d}\bar{\mu}}(z){\rm d}t\right\}^{2}\bar{\mu}({\rm d}z)$	$\displaystyle=$
	$\displaystyle=2\int_{0}^{\tau}\int_{0}^{\tau}\mathbf{1}\{t^{\prime}\geq t\}\int\bar{\mu}({\rm d}z)$	$\displaystyle f\circ\psi_{t}(z)\frac{{\rm d}\mu^{\psi_{-t}}}{{\rm d}\bar{\mu}}(z)f\circ\psi_{t^{\prime}-t+t}(z)\frac{{\rm d}\mu^{\psi_{-t^{\prime}}}}{{\rm d}\bar{\mu}}\circ\psi_{-t}\circ\psi_{t}(z){\rm d}t^{\prime}{\rm d}t$
	$\displaystyle=2\int_{0}^{\tau}\int_{0}^{\tau}\mathbf{1}\{t^{\prime}\geq t\}\int\mu({\rm d}z)$	$\displaystyle f(z)f\circ\psi_{t^{\prime}-t}(z)\frac{{\rm d}\mu^{\psi_{-t^{\prime}}}}{{\rm d}\bar{\mu}}\circ\psi_{-t}(z){\rm d}t^{\prime}{\rm d}t\,.$

Further, we have

\displaystyle\frac{{\rm d}\mu^{\psi_{-t^{\prime}}}}{{\rm d}\bar{\mu}}(z)

\displaystyle=\frac{\mu\circ\psi_{t^{\prime}}(z)\,\mathcal{J}\circ\psi_{t^{\prime}}(z)}{\bar{\mu}(z)}\,,

with $\bar{\mu}(z)=\int_{0}^{\tau}\mu\circ\psi_{u}(z)\,\mathcal{J}\circ\psi_{u}(z){\rm d}u$ . It is straightforward that

	$\displaystyle\bar{\mu}\circ\psi_{-t}(z)$	$\displaystyle=\int_{0}^{\tau}\mu\circ\psi_{u-t}(z)\,\mathcal{J}\circ\psi_{u-t}(z){\rm d}u$
		$\displaystyle=\int_{-t}^{\tau-t}\mu\circ\psi_{u^{\prime}}(z)\,\mathcal{J}\circ\psi_{u^{\prime}}(z){\rm d}u^{\prime}\,.$

Consequently

	$\displaystyle\int\left\{\int_{0}^{\tau}f\circ\psi_{t}(z)\frac{{\rm d}\mu^{\psi_{-t}}}{{\rm d}\bar{\mu}}(z){\rm d}t\right\}^{2}\bar{\mu}({\rm d}z)=$
	$\displaystyle=2\int_{0}^{\tau}\int_{0}^{\tau}$	$\displaystyle\frac{\mathbf{1}\{\tau-t\geq u\geq 0\}}{\bar{\mu}\circ\psi_{-t}(z)}\int\mu({\rm d}z)f(z)f\circ\psi_{u}(z)\mu\circ\psi_{u}(z)\mathcal{J}\circ\psi_{u}(z){\rm d}u{\rm d}t$
	$\displaystyle=2\int_{0}^{\tau}$	$\displaystyle\left\{\int_{0}^{\tau-u}[\bar{\mu}\circ\psi_{-t}(z)]^{-1}{\rm d}t\right\}\int\mu({\rm d}z)f(z)f\circ\psi_{u}(z)\mu\circ\psi_{u}(z)\mathcal{J}\circ\psi_{u}(z){\rm d}u$

For the second statement we have $\bar{\mu}\circ\psi_{-t}(z)=\tau\,\mu(z)$ and

	$\displaystyle\int\left\{\int_{0}^{\tau}f\circ\psi_{t}(z)\frac{{\rm d}\mu^{\psi_{-t}}}{{\rm d}\bar{\mu}}(z){\rm d}t\right\}^{2}\bar{\mu}({\rm d}z)=$
	$\displaystyle=2\int_{0}^{\tau}$	$\displaystyle\bigl{(}1-u/\tau\bigr{)}\int\mu({\rm d}z)f(z)f\circ\psi_{u}(z){\rm d}t^{\prime\prime}$

which is akin to the integration correlation time encountered in MCMC. ∎

This is somewhat reminiscent of what is advocated in the literature in the context of HMC or randomized HMC where the integration time $t$ (or $T$ when using an integrator) is randomized [29, 21].

Example 27.

Consider the Gaussian scenario where $\mathsf{X}=\mathsf{V}=\mathbb{R}^{d}$ , $\pi(x)=\mathcal{N}(x;0,\Sigma)$ with diagonal covariance matrix such that for $i\in\llbracket d\rrbracket$ , $\Sigma_{ii}=\sigma_{i}^{2}$ and $\varpi(v)=\mathcal{N}(v;0,{\rm Id})$ . Using the reparametrization $(x_{0}(i),v_{0}(i))=(a_{i}\sigma_{i}\sin(\phi_{i}),a_{i}\cos\phi_{i})$ the solution of Hamilton’s equations is given for $i\in\llbracket d\rrbracket$ by

	$\displaystyle x_{t}(i)$	$\displaystyle=a_{i}\sigma\sin(t/\sigma_{i}+\phi_{i})$
	$\displaystyle v_{t}(i)$	$\displaystyle=a_{i}\cos(t/\sigma_{i}+\phi_{i})\,.$

We have for each component $\mathbb{E}_{\mu}[X_{0}(i)X_{t}(i)]=\sigma_{i}^{2}\cos(t/\sigma_{i})$ , which clearly does not vanish as $t$ , or $\tau$ , increase: this is a particularly negative result for standard HMC where the final state of the computed integrator is used. Worse, as noted early on in [29] this implies that for heterogeneous values $\{\sigma_{i}^{2},i\in\llbracket d\rrbracket\}$ no integration time may be suitable simultaneously for all coordinates. This motivated the introduction of random integration times [29] which leads to the average correlation

\mathbb{E}_{\mu}\left\{\frac{1}{\tau}\int_{0}^{\tau}X_{0}(i)X_{t}(i){\rm d}t\right\}=\frac{\sin(\tau/\sigma_{i})}{\tau/\sigma_{i}}\,,

where it is assumed that $\tau$ is independent of the initial state. This should be contrasted with the fixed integration time scenario since as $\tau$ increases this vanishes and is even negative for some values of $\tau$ (the minimum is here reached for about $\tau_{i}=4.5\sigma_{i}$ for a given component).

The example therefore illustrates that our approach implements this averaging feature, and therefore shares its benefits, within the context of an iterative algorithm. The example also highlights a control variate technique intepretation. More specifically in the discrete time scenarios $\{f\circ\psi^{k}(z),k\in\llbracket P\rrbracket\}$ can be interpreted as control variates, but can induce both positive and negative correlations.

4.3.3 Towards an optimal flow?

In this section we are looking to determine a flow for some $\tau>0$ $[0,\tau]\ni t\mapsto\psi_{t}$ of an ODE of the form $\dot{z}_{t}=F\circ z_{t}$ for some field $F\colon\mathsf{Z}\rightarrow\mathsf{Z}$ which defines as above the probability model

\bar{\mu}({\rm d}t,{\rm d}z)=\frac{1}{\tau}\mu^{\psi_{-t}}({\rm d}z)\mathbf{1}\{0\leq t\leq\tau\}{\rm d}t\,,

which has the property that for any $f\colon\mathsf{Z}\rightarrow\mathbb{R}$ , defining $\bar{f}(t,z):=f\circ\psi_{t}(z)$ , then $\bar{\mu}(\bar{f})=\mu(f)$ . This suggests the use of a Rao-Blackwellized estimator inspired by $\mathbb{E}_{\bar{\mu}}\left(\bar{\mu}(\bar{f}\mid\check{Z})\right)$ . Assuming $\mathsf{Z}=\mathbb{R}^{d}\times\mathbb{R}^{d}$ and that $\mu$ has a density with respect to the Lebesgue measure, then for $z\in\mathsf{Z}$

\bar{\mu}(\bar{f}\mid z)=\frac{1}{\tau}\int_{0}^{\tau}f\circ\psi_{t}(z)\frac{\mu\circ\psi_{t}(z)\,\mathcal{J}\circ\psi_{t}(z)}{\int\mu\circ\psi_{u}(z)\,\mathcal{J}\circ\psi_{u}{\rm d}u}{\rm d}t

(38)

In the light of Lemma 26 we aim to find for any $z\in\mathsf{Z}$ the flow solutions $t\mapsto\psi_{t}(z)$ of ODEs $\dot{z}_{t}=F_{z}(z_{t})$ such that the function $t\mapsto\int\mu({\rm d}z)f(z)f\circ\psi_{t}(z)\mu\circ\psi_{t}(z)\,\mathcal{J}\circ\psi_{t}(z)$ decreases as fast as possible. This is motivated by the fact that the integral on $[0,\tau]$ of this mapping appears in the variance upper bound for (38) in Lemma 26, which we want to minimize. Note that we also expect this mapping to be smooth under general conditions not detailed in this preliminary work. For smooth enough flow and $f$ we have, with $g:=f\times\mu\times\mathcal{J}$ ,

\displaystyle\frac{{\rm d}}{{\rm d}t}\mathbb{E}_{\mu}\left[f(Z)\frac{g\circ\psi_{t}(Z)}{\bar{\mu}(Z)}\right]

\displaystyle=\mathbb{E}_{\mu}\left[\frac{f(Z)}{\bar{\mu}(Z)}\langle\nabla g\circ\psi_{t}(Z),\dot{\psi}_{t}(Z)\rangle\right]

Pointwise, the steepest descent direction is given by

\dot{\psi}_{t}(z)=-C_{t}(z)\frac{f(z)\,\nabla g\circ\psi_{t}(z)}{|f(z)\,\nabla g\circ\psi_{t}(z)|}

for a positive function $C_{t}(z)$ to be determined optimally. In this scenario we therefore have

\frac{{\rm d}}{{\rm d}t}\mathbb{E}_{\mu}\left[f(Z)\frac{g\circ\psi_{t}(Z)}{\bar{\mu}(Z)}\right]=-\mathbb{E}_{\mu}\left[\frac{C_{t}(z)}{\bar{\mu}(Z)}|f(Z)\,\nabla g\circ\psi_{t}(Z)|\right]

and by Cauchy-Schwartz the (positive) expectation is maximized for

C_{t}(z)=C_{t}\frac{|f(z)\,\nabla g\circ\psi_{t}(z)|}{\bar{\mu}(z)}\,.

and the trajectories we are interested in must be such that, for some $C>0$ ,

\dot{\psi}_{t}(z)=-\frac{C}{\bar{\mu}(z)}f(z)\,\nabla g\circ\psi_{t}(z)=-\frac{C}{\bar{\mu}(z)}f(z)F\circ\psi_{t}(z)\,.

Note that for any $z$ the term $|f(z)/\bar{\mu}(z)|$ is only a change of speed and that the trajectory of $\mathbb{R}_{+}\ni t\mapsto\psi_{t}(z)$ is independent of this factor, despite the remarkable fact that $\bar{\mu}(z)$ depends on this flow. The result seems fairly natural.

5 MCMC with integrator snippets

We restrict this discussion to integrator snippet based algorithms, but more general Markov snippet algorithms could be considered.

Consider again the target distribution

\bar{\mu}({\rm d}z)=\sum_{k=0}^{T}\bar{\mu}(k,{\rm d}z)\,,

with

\bar{\mu}(k,{\rm d}z)=\frac{1}{T+1}\mu_{k}({\rm d}z)\,,

where $\mu_{k}({\rm d}z):=\mu^{\psi^{-k}}({\rm d}z)$ for $k\in\llbracket 0,T\rrbracket$ . Assume that we are in the context of Example 1, dropping $n$ for simplicity, the HMC algorithm using the integrator $\psi^{s}$ is as follows

\bar{P}(z,{\rm d}z^{\prime})=\alpha(z)\delta_{\psi^{s}(z)}({\rm d}z)+\bar{\alpha}(z)\delta_{\sigma(z)}({\rm d}z^{\prime})

(39)

with here

\alpha(z)=\min\left\{1,\frac{{\rm d}\bar{\mu}^{\sigma\circ\psi^{s}}}{{\rm d}\bar{\mu}}(z)\right\}=\min\left\{1,\frac{\bar{\mu}\circ\psi^{w}(z)}{\bar{\mu}(z)}\right\}

where the last equality holds when $\upsilon\gg\bar{\mu}$ , $\upsilon^{\psi^{s}}=\upsilon$ and we let $\bar{\mu}(z)={\rm d}\bar{\mu}/{\rm d}\upsilon(z)$ . In other works the snippet $\mathsf{z}=\big{(}z,\psi(z),\psi^{2}(z),\ldots,\psi^{T}(z)\big{)}$ is shifted along the orbit $\{\psi^{k}(z),k\in\mathbb{Z}\}$ by $\psi^{s}$ . Naturally this needs to be combined with updates of the velocity to lead to a viable ergodic MCMC algorithm. This can be achieved with the following kernel, with $\psi^{k}(z)=\big{(}\psi_{x}^{k}(z),\psi_{v}^{k}(z)\big{)}$ and $A\in\mathscr{Z}$ ,

\bar{Q}(z,A):=\sum_{k,l=0}^{T}\bar{\mu}(k\mid z)\frac{1}{T+1}\int\mathbf{1}\{\psi^{-l}\big{(}\psi_{x}^{k}(z),v^{\prime}\big{)}\in A\}\varpi({\rm d}v^{\prime})\,,

described algorithmically in Alg. 6. Indeed for any $(l,v^{\prime})\in\llbracket 0,T\rrbracket\times\mathsf{V}$ , using identity (11),

\int\sum_{k=0}^{T}\bar{\mu}(k\mid z)\int\mathbf{1}\{\psi^{-l}\big{(}\psi_{x}^{k}(z),v^{\prime}\big{)}\in A\}\bar{\mu}({\rm d}z)=\int\sum_{k=0}^{T}\int\mathbf{1}\{\psi^{-l}\big{(}x,v^{\prime}\big{)}\in A\}\mu({\rm d}z)\,,

and hence

	$\displaystyle\bar{\mu}\bar{Q}(A)$	$\displaystyle=\frac{1}{T+1}\sum_{l=0}^{T}\int\mathbf{1}\{\psi^{-l}\big{(}x,v^{\prime}\big{)}\in A\}\varpi({\rm d}v^{\prime})\mu({\rm d}z)$
		$\displaystyle=\frac{1}{T+1}\sum_{l=0}^{T}\int\mathbf{1}\{\psi^{-l}\big{(}x,v\big{)}\in A\}\mu({\rm d}z)$
		$\displaystyle=\int\mathbf{1}\{z\in A\}\frac{1}{T+1}\sum_{l=0}^{T}\mu^{\psi^{-l}}({\rm d}z)$
		$\displaystyle=\bar{\mu}(A)\,.$

Again samples from this MCMC targetting $\bar{\mu}$ can be used to estimate expectations with respect to $\mu$ using the identity (11). This is closely related to the “windows of states” approach of [28, 29, 31], where a window of states is what we call a Hamiltonian snippet in the present manuscript. Indeed the windows of states approach corresponds to the Markov update

P(z,A)=\sum_{k,l=0}^{T}\frac{1}{T}\int\bar{P}\big{(}\psi^{-k}(z),{\rm d}z^{\prime}\big{)}\bar{\mu}(l\mid z^{\prime})\mathbf{1}\{\psi^{l}(z^{\prime})\in A\}\,,

(40)

which, we show below, leaves $\mu$ invariant. Indeed, note that for $k,l\in\llbracket 0,T\rrbracket$ and $A\in\mathscr{Z}$ ,

\displaystyle\int\mu({\rm d}z)\int\bar{P}\big{(}\psi^{-k}(z),{\rm d}z^{\prime}\big{)}\bar{\mu}(l\mid z^{\prime})\mathbf{1}\{\psi^{l}(z^{\prime})\in A\}

\displaystyle=\int_{A}\mu^{\psi^{-k}}({\rm d}z)\int\bar{P}\big{(}z,{\rm d}z^{\prime}\big{)}\bar{\mu}(l\mid z^{\prime})\mathbf{1}\{\psi^{l}(z^{\prime})\in A\}\,,

and therefore

	$\displaystyle\int\mu({\rm d}z)\int P(z,{\rm d}z^{\prime})\mathbf{1}\{z^{\prime}\in A\}=$	$\displaystyle\int\bar{\mu}({\rm d}z)\bar{P}(z,{\rm d}z^{\prime})\sum_{l=0}^{T}\bar{\mu}(l\mid z^{\prime})\mathbf{1}\{\psi^{l}(z^{\prime})\in A\}$
	$\displaystyle=$	$\displaystyle\int\bar{\mu}({\rm d}z^{\prime})\sum_{l=0}^{T}\bar{\mu}(l\mid z^{\prime})\mathbf{1}\{\psi^{l}(z^{\prime})\in A\}$
	$\displaystyle=$	$\displaystyle\int\mu({\rm d}z)\mathbf{1}\{z\in A\}\,,$

where we obtain the last line from (11). $P$ is not $\mu$ -reversible in general, making theoretical comparisons challenging.

1Given

z

2Sample

k\sim\bar{\mu}(\cdot\mid z)

and

l\sim\mathcal{U}(\llbracket 0,T\rrbracket)

3Compute

\psi^{k}(z)=\big{(}\psi_{x}^{k}(z),\psi_{v}^{k}(z)\big{)}

4Sample

v^{\prime}\sim\varpi(\cdot)

5Return

\psi^{-l}(\psi_{x}^{k}(z),v^{\prime})

Algorithm 6 Kernel

\bar{Q}

6 Discussion

We have shown how mappings used in various Monte Carlo schemes relying on numerical integrators of ODE can be implemented to fully exploit all computations to design robust and efficient sampling algorithms. Numerous questions remain open, including the tradeoff between $N$ and $T$ . A precise analysis of this question is made particularly difficult by the fact that integration along snippets is straightforwardly parallelizable, while resampling does not lend itself to straightforward parallelisation.

Another point is concerned with the particular choice of mutation Markov kernel $\bar{M}_{n}$ , or $\bar{M}$ , in (12) or (23). Indeed such a kernel starts with a transition from samples approximating the snippet distribution $\bar{\mu}_{n-1}$ to $\mu_{n-1}$ , which is then followed by a reweighting of samples leading to a representation of $\bar{\mu}_{n}$ . Instead, for illustration, one could suggest using an SMC sampler with (39) as mutation kernel.

In relation to the discussion in Remark 20, a natural question is how our scheme would compare with a “Rao-Blackwellized” SMC where weights of the type (35), derived from (6) are used.

We leave all these questions for future investigations.

Acknowledgements

The authors would like to thank Carl Dettman for very useful discussions on Boltzman’s conjecture. Research of CA and MCE supported by EPSRC grant ‘CoSInES (COmputational Statistical INference for Engineering and Security)’ (EP/R034710/1), and EPSRC grant Bayes4Health, ‘New Approaches to Bayesian Data Science: Tackling Challenges from the Health Sciences’ (EP/R018561/1). Research of CZ was supported by a CSC Scholarship.

Appendix A Notation and definitions

We will write $\mathbb{N}=\left\{0,1,2,\dots\right\}$ for the set of natural numbers and $\mathbb{R}_{+}=\left(0,\infty\right)$ for positive real numbers. Throughout this section $(\mathsf{E},\mathscr{E})$ is a generic measurable space.

•

For $A\subset\mathsf{E}$ we let $A^{\complement}$ be its complement.
•

$\mathcal{M}(\mathsf{E},\mathscr{E})$ (resp. $\mathcal{P}(\mathsf{E},\mathscr{E})$ is the set of measures (resp. probability distributions) on $(\mathsf{E},\mathscr{E})$
•

For a set $A\in\mathscr{E}$ , its complement in $\mathsf{E}$ is denoted by $A^{\complement}$ . We denote the corresponding indicator function by $\mathbf{1}_{A}:\mathsf{E}\to\left\{0,1\right\}$ and may use the notation $\mathbf{1}\{z\in A\}:=\mathbf{1}_{A}(z)$ .
•

For $\mu$ a probability measure on $(\mathsf{E},\mathscr{E})$ and a measurable function $f\colon\mathsf{E}\rightarrow\mathbb{R}$ and , we let $\mu(f):=\int f(x)\mu({\rm d}x)$ .
•

For two probability measures $\mu$ and $\nu$ on $(\mathsf{E},\mathscr{E})$ we let $\mu\otimes\nu$ be a measure on $(\mathsf{E}\times\mathsf{E},\mathscr{E}\otimes\mathscr{E})$ such that $\mu\otimes\nu(A\times B)=\mu(A)\nu(B)$ for $A,B\in\mathscr{E}$ .
•
For a Markov kernel $P(x,{\rm d}y)$ on $\mathsf{E}\times\mathscr{E}$ , we write
- –
  
  $\mu\otimes P$ for the probability measure on $(\mathsf{E}\times\mathsf{E},\mathscr{E}\otimes\mathscr{E})$ such that for $\bar{A}\in\mathscr{E}\otimes\mathscr{E}$ , the minimal product $\sigma$ -algebra, $\mu\otimes P(\bar{A})=\int_{\bar{A}}\mu({\rm d}x)P(x,{\rm d}y)$ .
- –
  
  $\mu\accentset{\curvearrowleft}{\otimes}P$ for the probability measure on $(\mathsf{E}\times\mathsf{E},\mathscr{E}\otimes\mathscr{E})$ such that for $A,B\in\mathscr{E}$ $\mu\accentset{\curvearrowleft}{\otimes}P(A\times B)=\mu\otimes P(B\times A)$ .

•

For $\mu,\nu$ probability distributions on $(\mathsf{E},\mathscr{E})$ and kernels $M,L\colon\mathsf{E}\times\mathscr{E\rightarrow}[0,1]$ such that $\mu\otimes M\gg\nu\accentset{\curvearrowleft}{\otimes}L$ then we denote

\frac{{\rm d}\nu\accentset{\curvearrowleft}{\otimes}L}{{\rm d}\mu\otimes M}(z,z^{\prime})

the corresponding Radon-Nikodym derivative such that for $f\colon\mathsf{E}\times\mathsf{E}\rightarrow\mathbb{R}$ ,

\int f(z,z^{\prime})\frac{{\rm d}\nu\accentset{\curvearrowleft}{\otimes}L}{{\rm d}\mu\otimes M}(z,z^{\prime})\mu\otimes M\big{(}{\rm d}(z,z^{\prime})\big{)}=\int f(z,z^{\prime})\nu\accentset{\curvearrowleft}{\otimes}L\big{(}{\rm d}(z^{\prime},z)\big{)}\,.

•

A point mass distribution at $x$ will be denoted by $\delta_{x}({\rm d}y)$ ; it is such that for $f\colon\mathsf{E}\rightarrow\mathbb{R}$

$\int f(x)\delta_{x}({\rm d}y)=f(x)$
•

In order to alleviate notation, for $M\in\mathbb{N}$ , $(z^{(i)},w_{i})\in\mathsf{E}\times[0,\infty)$ , $i\in\llbracket M\rrbracket$ , we refer to $\big{\{}(z^{(i)},w_{i}),i\in\llbracket M\rrbracket\big{\}}$ as weighted samples to mean $\big{\{}(z^{(i)},\tilde{w}_{i}),i\in\llbracket M\rrbracket\big{\}}$ where $\tilde{w}_{i}\propto w_{i}$ but $\sum_{i=1}^{M}\tilde{w}_{i}=1$ .
•

We say that a set of weighted samples, or particles, $\{(z_{i},w_{i})\in\mathsf{Z}\times\mathbb{R}_{+}\colon i\in\llbracket N\rrbracket\}$ for $N\geq 1$ represents a distribution $\mu$ whenever for $f$ $\mu$ -integrable

$\sum_{i=1}^{N}\frac{w_{i}}{\sum_{j=1}^{N}w_{j}}f(z_{i})\approx\mu(f)\,,$

in either in the $L^{p}$ sense for some $p\geq 1$ .
•

For $M\in\mathbb{N}$ , $w_{i}\in[0,\infty)$ , $i\in\llbracket M\rrbracket$ , we let $K\sim{\rm Cat}\left(w_{1},w_{2},\ldots,w_{M}\right)$ mean that $\mathbb{P}(K=k)\propto w_{k}$ .
•

For $M,N\in\mathbb{N}$ , $w_{ij}\in[0,\infty)$ , $i\in\llbracket M\rrbracket\times\llbracket N\rrbracket$ , we let $K\sim{\rm Cat}\left(w_{ij},i\in\llbracket M\rrbracket\times\llbracket N\rrbracket\right)$ mean that $\mathbb{P}(K=(k,l))\propto w_{kl}$ .
•
for $f\colon\mathbb{R}^{m}\rightarrow\mathbb{R}^{n}$ we let
- –
  
  $\nabla\otimes f$ be the transpose of the Jacobian
- –
  
  for $n=1$ we let $\nabla f=(\nabla\otimes f)^{\top}$ be the gradient,
- –
  
  $\nabla\cdot f$ be the divergence.

Appendix B Radon-Nikodym derivative

The general formalism required and used throughout the paper relies on a unique measure theoretic tool, the Radon-Nikodym derivative. We gather here definitions and intermediate results used throughout, pointing out the simplicity of the tools involved and and the benefits they bring.

Definition 28 (Pushforward).

Let $\mu$ be a measure on $(\mathsf{Z},\mathscr{Z})$ and $\psi:(\mathsf{Z},\mathscr{Z})\to(\mathsf{Z}^{\prime},\mathscr{F}^{\prime})$ a measurable function. The pushforward of $\mu$ by $\psi$ is defined by

\mu^{\psi}(A):=\mu(\psi^{-1}(A)),\qquad A\in\mathscr{Z}^{\prime},

where $\psi^{-1}(A)=\{z\in\mathsf{Z}:\psi(z)\in A\}$ is the preimage of $A$ under $\psi$ .

If $\mu$ is a probability distribution then $\mu^{\psi}$ is the probability measure associated with $\psi(Z)$ when $Z\sim\mu$ .

Definition 29 (Dominating and equivalent measures).

For two measures $\mu$ and $\nu$ on the same measurable space $(\mathsf{Z},\mathscr{Z})$ ,

1.

$\nu$ is said to dominate $\mu$ if for all measurable $A\in\mathscr{Z}$ , $\mu(A)>0\Rightarrow\nu(A)>0$ – this is denoted $\nu\gg\mu$ .
2.

$\mu$ and $\nu$ are equivalent, written $\mu\equiv\nu$ , if $\mu\gg\nu$ and $\nu\gg\mu$ .

We will need the notion of Radon-Nikodym derivative [7, Theorems 32.2 & 16.11]:

Theorem 30 (Radon–Nikodym).

Let $\mu$ and $\nu$ be $\sigma$ -finite measures on $(\mathsf{Z},\mathscr{Z})$ . Then $\nu\ll\mu$ if and only if there exists an essentially unique, measurable, non-negative function $f\colon\mathsf{Z}\rightarrow[0,\infty)$ such that

\int_{A}f(z)\mu({\rm d}z)=\nu(A),\qquad A\in\mathscr{E}.

Therefore we can view ${\rm d}\nu/{\rm d}\mu:=f$ as the density of $\nu$ w.r.t $\mu$ and in particular if $g$ is integrable w.r.t. $\nu$ then

\int g(z)\frac{{\rm d}\nu}{{\rm d}\mu}(z)\mu({\rm d}z)=\int g(z)\nu({\rm d}z)\,.

If $\mu$ is a measure and $f$ a non-negative, measurable function then $\mu\cdot f$ is the measure $(\mu\cdot f)(A)=\int{\bf 1}_{A}(z)f(z)\mu({\rm d}z)$ , i.e. the measure $\nu=\mu\cdot f$ such that $f$ is the Radon–Nikodym derivative ${\rm d}\nu/{\rm d}\mu=f$ .

The following establishes the expression of an expectation with respect to the pushforward $\mu^{\psi}$ in terms of expectations with respect to $\mu$ [7, Theorem 16.13].

Theorem 31 (Change of variables).

A function $f:\mathsf{Z}^{\prime}\to\mathbb{R}$ is integrable w.r.t. $\mu^{\psi}$ if and only if $f\circ\psi$ is integrable w.r.t. $\mu$ , in which case

\int_{\mathsf{Z}^{\prime}}f(z)\mu^{\psi}({\rm d}z)=\int_{\mathsf{Z}}f\circ\psi(z)\mu({\rm d}z)\,.

(41)

We now establish results useful throughout the manuscript. The central identity used throughout the manuscript is a direct application of Theorem 31 for $\psi\colon\mathsf{Z}\rightarrow\mathsf{Z}$ invertible

\int_{\mathsf{Z}^{\prime}}f\circ\psi(z)\mu^{\psi^{-1}}({\rm d}z)=\int_{\mathsf{Z}}f(z)\mu({\rm d}z)

which seems tautological since it can be summarized as follows: for $Z\sim\mu$ , then $\psi^{-1}(Z)\sim\mu^{\psi^{-1}}$ and $\psi\circ\psi^{-1}(Z)\sim\mu$ ! However the interest of the approach stems from the following properties.

Lemma 32.

Let $\psi\colon\mathsf{Z}\rightarrow\mathsf{Z}$ be measurable and integrable, $\mu$ and $\upsilon$ be $\sigma$ -finite measures on $(\mathsf{Z},\mathscr{Z})$ such that $\upsilon\gg\mu$ and $\upsilon\gg\upsilon^{\psi^{-1}}$ . Then

1.

$\upsilon^{\psi^{-1}}\gg\mu^{\psi^{-1}}$ and therefore $\upsilon\gg\mu^{\psi^{-1}}$ ,

for $\upsilon$ -almost all $z\in\mathsf{Z}$ ,

\frac{{\rm d}\mu^{\psi^{-1}}}{{\rm d}\upsilon}(z)=\frac{{\rm d}\mu}{{\rm d}\upsilon}\circ\psi(z)\frac{{\rm d}\upsilon^{\psi^{-1}}}{{\rm d}\upsilon}

we have

\mu\gg\mu^{\psi^{-1}}\iff\upsilon\big{(}\big{\{}z\in\mathsf{Z}\colon{\rm d}\mu^{\psi^{-1}}/{\rm d}\upsilon(z)>0,{\rm d}\mu/{\rm d}\upsilon(z)=0\big{\}}\big{)}=0

in which case for $\upsilon$ -almost all $z\in\mathsf{Z}$

\frac{{\rm d}\mu^{\psi^{-1}}}{{\rm d}\mu}(z)=\begin{cases}\frac{\mu\circ\psi}{\mu}(z)&\mu(z)>0\\ 0&\text{otherwise}\end{cases}

and therefore

\int_{\mathsf{Z}^{\prime}}f\circ\psi(z)\mu^{\psi^{-1}}({\rm d}z)=\int_{\mathsf{Z}^{\prime}}f\circ\psi(z)\frac{\mu\circ\psi(z)}{\mu(z)}\frac{{\rm d}\upsilon^{\psi}}{{\rm d}\upsilon}\mu({\rm d}z).

Proof.

For the first part of the first statement, let $A\in\mathscr{Z}$ such that $\upsilon^{\psi^{-1}}(A)=\upsilon\big{(}\psi(A)\big{)}>0$ , then since $\psi(A)\in\mathscr{Z}$ and $\upsilon\gg\mu$ we deduce $\mu\big{(}\psi(A)\big{)}=\mu^{\psi^{-1}}(A)>0$ and we conclude; the second parts follows from $\upsilon\gg\upsilon^{\psi^{-1}}$ . For the second statement for $f\colon\mathsf{Z}\rightarrow\mathbb{R}$ bounded and measurable,

	$\displaystyle\int f(z)\frac{{\rm d}\mu^{\psi^{-1}}}{{\rm d}\upsilon}(z)\upsilon({\rm d}z)$	$\displaystyle=\int f(z)\mu^{\psi^{-1}}({\rm d}z)$
		$\displaystyle=\int f\circ\psi^{-1}(z)\mu({\rm d}z)$
		$\displaystyle=\int f\circ\psi^{-1}(z)\frac{{\rm d}\mu}{{\rm d}\upsilon}(z)\upsilon({\rm d}z)$
		$\displaystyle=\int f(z)\frac{{\rm d}\mu}{{\rm d}\upsilon}\circ\psi(z)\upsilon^{\psi^{-1}}({\rm d}z)$
		$\displaystyle=\int f(z)\frac{{\rm d}\mu}{{\rm d}\upsilon}\circ\psi(z)\frac{{\rm d}\upsilon^{\psi^{-1}}}{{\rm d}\upsilon}(z)\upsilon({\rm d}z)$

The third statement is given as [7, Problem 32.6.], which we solve in Lemma 34. ∎

Corollary 33.

In the scenario when $\psi\colon\mathsf{Z}\rightarrow\mathsf{Z}$ and $\upsilon$ are such that $\upsilon^{\psi}=\upsilon$ then ${\rm d}\upsilon^{\psi}/{\rm d}\upsilon\equiv 1$ .

Lemma 34 (Billingsley, Problem 32.6.).

Assume $\mu,\nu$ and $\upsilon$ are $\sigma$ - finite and that $\upsilon\gg\nu,\mu$ . Then $\mu\gg\nu$ if and only if $\upsilon\big{(}\big{\{}z\in\mathsf{Z}\colon{\rm d}\nu/{\rm d}\upsilon(z)>0,{\rm d}\mu/{\rm d}\upsilon(z)=0\big{\}}\big{)}=0$ , in which case

\frac{{\rm d}\nu}{{\rm d}\mu}(z)=\mathbf{1}\{z\in\mathsf{Z}:{\rm d}\mu/{\rm d}\upsilon(z)>0\}\frac{{\rm d}\nu}{{\rm d}\upsilon}/\frac{{\rm d}\mu}{{\rm d}\upsilon}(z)\,.

Proof.

Let $S:=\big{\{}z\in\mathsf{Z}\colon{\rm d}\nu/{\rm d}\upsilon(z)>0,{\rm d}\mu/{\rm d}\upsilon(z)=0\big{\}}$ . For $f\colon\mathsf{Z}\rightarrow\mathbb{R}$ integrable we always have

\displaystyle\int f(z)\nu({\rm d}z)

\displaystyle=\int\mathbf{1}\{z\in S\}f(z)\frac{{\rm d}\nu}{{\rm d}\upsilon}(z)\upsilon({\rm d}z)+\int\mathbf{1}\big{\{}z\in S^{\complement}\big{\}}f(z)\frac{{\rm d}\nu}{{\rm d}\upsilon}/\frac{{\rm d}\mu}{{\rm d}\upsilon}(z)\mu({\rm d}z)\,.

Assume $\mu\gg\nu$ then from above for any $f\colon\mathsf{Z}\rightarrow\mathbb{R}$ integrable

	$\displaystyle\int\mathbf{1}\{z\in S\}f(z)\frac{{\rm d}\nu}{{\rm d}\mu}(z)\mu({\rm d}z)=$	$\displaystyle\int\mathbf{1}\{z\in S\}f(z)\frac{{\rm d}\nu}{{\rm d}\mu}(z)\frac{{\rm d}\mu}{{\rm d}\upsilon}(z)\upsilon({\rm d}z)$
	$\displaystyle=$	$\displaystyle 0\,,$

and therefore $\upsilon\big{(}S\big{)}=0$ and we conclude from the first identity above. Now assume that $\upsilon\big{(}S\big{)}=0$ , then

	$\displaystyle\int f(z)\nu({\rm d}z)$	$\displaystyle=\int\mathbf{1}\{z\in S\}f(z)\frac{{\rm d}\nu}{{\rm d}\upsilon}(z)\upsilon({\rm d}z)+\int\mathbf{1}\big{\{}z\in S^{\complement}\big{\}}f(z)\frac{{\rm d}\nu}{{\rm d}\upsilon}/\frac{{\rm d}\mu}{{\rm d}\upsilon}(z)\mu({\rm d}z)$
		$\displaystyle=\int\mathbf{1}\big{\{}z\in S^{\complement}\big{\}}f(z)\frac{{\rm d}\nu}{{\rm d}\upsilon}/\frac{{\rm d}\mu}{{\rm d}\upsilon}(z)\mu({\rm d}z)\,.$

The equivalence is therefore established and when either conditions is satisfied we have

\frac{{\rm d}\nu}{{\rm d}\mu}(z)=\mathbf{1}\big{\{}z\in S^{\complement}\big{\}}\frac{{\rm d}\nu}{{\rm d}\upsilon}/\frac{{\rm d}\mu}{{\rm d}\upsilon}(z)

and we conclude. ∎

References

[1] Christophe Andrieu, Anthony Lee, and Sam Livingstone. A general perspective on the Metropolis-Hastings kernel. arXiv preprint arXiv:2012.14881, 2020.
[2] Christophe Andrieu, James Ridgway, and Nick Whiteley. Sampling normalizing constants in high dimensions using inhomogeneous diffusions, 2018.
[3] Mark A Beaumont. Approximate bayesian computation in evolution and ecology. Annual review of ecology, evolution, and systematics, 41:379–406, 2010.
[4] Alexandros Beskos, Natesh Pillai, Gareth Roberts, Jesus-Maria Sanz-Serna, and Andrew Stuart. Optimal tuning of the hybrid Monte Carlo algorithm. Bernoulli, 19(5A):1501–1534, 2013.
[5] Michael Betancourt. Nested sampling with constrained Hamiltonian Monte Carlo. In AIP Conference Proceedings, volume 1305, pages 165–172. American Institute of Physics, 2011.
[6] Joris Bierkens and Gareth Roberts. A piecewise deterministic scaling limit of lifted Metropolis–Hastings in the Curie–Weiss model. The Annals of Applied Probability, 27(2):846–882, 2017.
[7] P Billingsley. Probability and measure. 3rd wiley. New York, 1995.
[8] Alexandre Bouchard-Côté, Sebastian J Vollmer, and Arnaud Doucet. The bouncy particle sampler: A nonreversible rejection-free Markov chain monte carlo method. Journal of the American Statistical Association, 113(522):855–867, 2018.
[9] Dmitri Burago, Yuri Burago, and Sergei Ivanov. A course in metric geometry, volume 33. American Mathematical Society, 2022.
[10] Mari Paz Calvo, Daniel Sanz-Alonso, and Jesús María Sanz-Serna. HMC: reducing the number of rejections by not using leapfrog and some results on the acceptance rate. Journal of Computational Physics, 437:110333, 2021.
[11] Cédric M Campos and Jesús María Sanz-Serna. Extra chance generalized hybrid Monte Carlo. Journal of Computational Physics, 281:365–374, 2015.
[12] Joseph T Chang and David Pollard. Conditioning as disintegration. Statistica Neerlandica, 51(3):287–317, 1997.
[13] Nicolas Chopin and James Ridgway. Leave pima indians alone: binary regression as a benchmark for Bayesian computation. Statistical Science, 32(1):64–87, 2017.
[14] Hai-Dang Dau and Nicolas Chopin. Waste-free sequential Monte Carlo. arXiv preprint arXiv:2011.02328, 2020.
[15] Pierre Del Moral and Pierre Del Moral. Feynman-kac formulae. Springer, 2004.
[16] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411–436, 2006.
[17] Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid Monte Carlo. Physics letters B, 195(2):216–222, 1987.
[18] Mauro Camara Escudero. Approximate Manifold Sampling: Robust Bayesian Inference for Machine Learning. PhD thesis, School of Mathematics, January 2024.
[19] Youhan Fang, J. M. Sanz-Serna, and Robert D. Skeel. Compressible generalized hybrid Monte Carlo. The Journal of Chemical Physics, 140(17):174108, 05 2014.
[20] Ernst Hairer, SP Nörsett, and G. Wanner. Solving ordinary differential equations I, Nonstiff problems. Springer-Verlag, 1993.
[21] Matthew Hoffman, Alexey Radul, and Pavel Sountsov. An adaptive-MCMC scheme for setting trajectory lengths in Hamiltonian Monte Carlo. In International Conference on Artificial Intelligence and Statistics, pages 3907–3915. PMLR, 2021.
[22] Matthew D Hoffman and Pavel Sountsov. Tuning-Free Generalized Hamiltonian Monte Carlo. In International Conference on Artificial Intelligence and Statistics, pages 7799–7813. PMLR, 2022.
[23] Paul B Mackenze. An improved hybrid Monte Carlo method. Physics Letters B, 226(3-4):369–371, 1989.
[24] Florian Maire and Pierre Vandekerkhove. On markov chain monte carlo for sparse and filamentary distributions. ArXiv e-prints, 2018.
[25] Florian Maire and Pierre Vandekerkhove. Markov kernels local aggregation for noise vanishing distribution sampling. SIAM Journal on Mathematics of Data Science, 4(4):1293–1319, 2022.
[26] Youssef Marzouk, Tarek Moselhy, Matthew Parno, and Alessio Spantini. Sampling via Measure Transport: An Introduction, pages 1–41. Springer International Publishing, Cham, 2016.
[27] Xiao-Li Meng and Stephen Schilling. Warp bridge sampling. Journal of Computational and Graphical Statistics, 11(3):552–586, 2002.
[28] Radford M Neal. An improved acceptance procedure for the hybrid Monte Carlo algorithm. Journal of Computational Physics, 111(1):194–203, 1994.
[29] Radford M. Neal. MCMC Using Hamiltonian Dynamics, chapter 5. CRC Press, 2011.
[30] Ari Pakman and Liam Paninski. Exact Hamiltonian Monte Carlo for truncated multivariate Gaussians. Journal of Computational and Graphical Statistics, 23(2):518–542, 2014.
[31] Zhaohui S. Qin and Jun S. Liu. Multipoint Metropolis method with application to Hybrid Monte Carlo. Journal of Computational Physics, 172(2):827–840, 2001.
[32] James Ridgway. Computation of gaussian orthant probabilities in high dimension. Statistics and computing, 26(4):899–916, 2016.
[33] Grant M Rotskoff and Eric Vanden-Eijnden. Dynamical computation of the density of states and Bayes factors using nonequilibrium importance sampling. Physical review letters, 122(15):150602, 2019.
[34] Gabriel Stoltz, Mathias Rousset, et al. Free energy computations: A mathematical perspective. World Scientific, 2010.
[35] D Szasz. Boltzmann’s ergodic hypothesis, a conjecture for centuries?, chapter Hard ball systems and the Lorentz gas, pages 421–446. Springer, 2000.
[36] Esteban G Tabak and Eric Vanden-Eijnden. Density estimation by dual ascent of the log-likelihood. Communications in Mathematical Sciences, 8(1):217–233, 2010.
[37] Erik H Thiede, Brian Van Koten, Jonathan Weare, and Aaron R Dinner. Eigenvector method for umbrella sampling enables error analysis. The Journal of chemical physics, 145(8), 2016.
[38] Achille Thin, Yazid Janati El Idrissi, Sylvain Le Corff, Charles Ollion, Eric Moulines, Arnaud Doucet, Alain Durmus, and Christian X Robert. Neo: Non equilibrium sampling on the orbits of a deterministic transform. Advances in Neural Information Processing Systems, 34:17060–17071, 2021.
[39] G.M. Torrie and J.P. Valleau. Nonphysical sampling distributions in Monte Carlo free-energy estimation: Umbrella sampling. Journal of Computational Physics, 23(2):187–199, 1977.
[40] Paul F Tupper. Ergodicity and the numerical simulation of Hamiltonian systems. SIAM Journal on Applied Dynamical Systems, 4(3):563–587, 2005.
[41] J. von Neumann. Proof of the quasi-ergodic hypothesis. Proc. Natl. Acad. Sci. USA, 18:70–82, 1932.
[42] Chang Zhang. On the Improvements and Innovations of Monte Carlo Methods. PhD thesis, School of Mathematics, https://research-information.bris.ac.uk/en/studentTheses/on-the-improvements-and-innovations-of-monte-carlo-methods, June 2022.

Monte Carlo sampling with integrator snippets

Abstract

1 Overview and motivation: SMC sampler with HMC

Example 1 (Leapfrog integrator of Hamilton’s equations).

2 An introductory example

2.1 An SMC-like algorithm

2.2 Justification outline

Example 2.

Proposition 3.

Proof of Proposition 3.

Lemma 4.

Proof.

2.3 Direct extensions

2.4 Rational and computational considerations

2.5 Numerical illustration: logistic regression

Performance comparison

Computational Cost

2.6 Numerical illustration: simulating from filamentary distributions

2.7 Links to the literature

3 Sampling Markov snippets

3.1 Markov snippet SMC sampler or waste free SMC with a difference

Remark 5.

3.2 Theoretical justification

Lemma 6.

Proof.

Corollary 7.

Lemma 8.

Proof.

Lemma 9.

Proof.

Corollary 10.

3.3 Revisiting sampling with integrator snippets

Lemma 11.

Proof.

Corollary 12.

3.4 More complex integrator snippets

Example 13.

Example 14.

3.5 Numerical illustration: orthant probabilities

4 Preliminary theoretical exploration

4.1 Variance using folded mixture samples

Proposition 15.

Proof.

Theorem 16.

Remark 17.

Proof.

4.2 Variance using unfolded mixture samples

4.2.1 Relative efficiency for unfolded estimators

Theorem 18.

Remark 19.

Remark 20 (ESS for PISA).

Proof of Theorem 18.

Lemma 21.

4.2.2 Optimal weights

4.3 More on variance reduction and optimal flow

Proposition 22.

Proof.

Remark 23.

4.3.1 Hamiltonian contour decomposition

Favourable scenario where d=2d=2.

Proposition 24.

Proof.

General scenario

Example 25 (Double well potential).

4.3.2 Advantage of averaging and control variate interpretation

Lemma 26.

Proof.

Example 27.

4.3.3 Towards an optimal flow?

5 MCMC with integrator snippets

6 Discussion

Acknowledgements

Appendix A Notation and definitions

Appendix B Radon-Nikodym derivative

Definition 28 (Pushforward).

Definition 29 (Dominating and equivalent measures).

Theorem 30 (Radon–Nikodym).

Theorem 31 (Change of variables).

Lemma 32.

Proof.

Favourable scenario where $d=2$ .