This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Monte Carlo sampling with integrator snippets

Christophe Andrieu, Mauro Camara Escudero and Chang Zhang
Abstract

Assume interest is in sampling from a probability distribution μ\mu defined on (𝖹,𝒵)(\mathsf{Z},\mathscr{Z}). We develop a framework to construct sampling algorithms taking full advantage of numerical integrators of ODEs, say ψ:𝖹𝖹\psi\colon\mathsf{Z}\rightarrow\mathsf{Z} for one integration step, to explore μ\mu efficiently and robustly. The popular Hybrid/Hamiltonian Monte Carlo (HMC) algorithm [17, 29] and its derivatives are example of such a use of numerical integrators. However we show how the potential of integrators can be exploited beyond current ideas and HMC sampling in order to take into account aspects of the geometry of the target distribution. A key idea is the notion of integrator snippet, a fragment of the orbit of an ODE numerical integrator ψ\psi, and its associate probability distribution μ¯\bar{\mu}, which takes the form of a mixture of distributions derived from μ\mu and ψ\psi. Exploiting properties of mixtures we show how samples from μ¯\bar{\mu} can be used to estimate expectations with respect to μ\mu. We focus here primarily on Sequential Monte Carlo (SMC) algorithms, but the approach can be used in the context of Markov chain Monte Carlo algorithms as discussed at the end of the manuscript. We illustrate performance of these new algorithms through numerical experimentation and provide preliminary theoretical results supporting observed performance.

School of Mathematics, University of Bristol

1 Overview and motivation: SMC sampler with HMC

Assume interest is in sampling from a probability distribution μ\mu on a probability space (𝖹,𝒵)(\mathsf{Z},\mathscr{Z}). The main ideas of sequential Monte Carlo (SMC) samplers to simulate from μ\mu are (a) to define a sequence of probability distributions {μn,n0,P}\{\mu_{n},n\in\llbracket 0,P\rrbracket\} on (𝖹,𝒵)(\mathsf{Z},\mathscr{Z}) where μP=μ\mu_{P}=\mu, μ0\mu_{0} is chosen by the user, simple to sample from and the sequence {μn,nP1}\{\mu_{n},n\in\llbracket P-1\rrbracket\} “interpolates” μ0\mu_{0} and μP\mu_{P}, (b) and to propagate a cloud of samples {zn(i)𝖹,iN}\{z_{n}^{(i)}\in\mathsf{Z},i\in\llbracket N\rrbracket\} for n0,Pn\in\llbracket 0,P\rrbracket to represent {μn,n0,P}\{\mu_{n},n\in\llbracket 0,P\rrbracket\} using an importance sampling/resampling mechanism [16]. Notation and definitions used throughout this paper can be found in Appendix A.

After initialisation, for nPn\in\llbracket P\rrbracket, samples {zn1(i),iN}\{z_{n-1}^{(i)},i\in\llbracket N\rrbracket\}, representing μn1\mu_{n-1}, are propagated thanks to a user-defined mutation Markov kernel Mn:𝖹×𝒵[0,1]M_{n}\colon\mathsf{Z}\times\mathscr{Z}\rightarrow[0,1], as follows. For iNi\in\llbracket N\rrbracket sample z~n(i)Mn(zn1(i),)\tilde{z}_{n}^{(i)}\sim M_{n}(z_{n-1}^{(i)},\cdot) and compute the importance weights, assumed to exist for the moment,

ωn(i)=dμnLn1dμn1Mn(zn1(i),z~n(i)),\omega_{n}^{(i)}=\frac{{\rm d}\mu_{n}\accentset{\curvearrowleft}{\otimes}L_{n-1}}{{\rm d}\mu_{n-1}\otimes M_{n}}\big{(}z_{n-1}^{(i)},\tilde{z}_{n}^{(i)}\big{)}\,, (1)

where Ln1:𝖹×𝒵[0,1]L_{n-1}\colon\mathsf{Z}\times\mathscr{Z}\rightarrow[0,1] is a user-defined “backward” Markov kernel required to define importance sampling on 𝖹×𝖹\mathsf{Z}\times\mathsf{Z} and swap the rôles of zn1(i)z_{n-1}^{(i)} and z~n(i)\tilde{z}_{n}^{(i)}, in the sense that for f:𝖹f\colon\mathsf{Z}\rightarrow\mathbb{R} μn\mu_{n}-integrable,

f(z)dμnLn1dμn1Mn(z,z)μn1(dz)Mn(z,dz)=μn(f).\int f(z^{\prime})\frac{{\rm d}\mu_{n}\accentset{\curvearrowleft}{\otimes}L_{n-1}}{{\rm d}\mu_{n-1}\otimes M_{n}}\big{(}z,z^{\prime}\big{)}\mu_{n-1}({\rm d}z)M_{n}(z,{\rm d}z^{\prime})=\mu_{n}(f)\,.

We adopt this measure theoretic formulation of importance weights out of necessity in order to take into account situations such as when μn1\mu_{n-1} and μn\mu_{n} both have densities with respect to the Lebesgue measure but MnM_{n} is a Metropolis-Hastings (MH) update, which does not possess such a density – more details are provided in Appendices A-B but can be omitted to understand how the algorithms considered in the manuscript proceed.

The mutation step is followed by a selection step where for iNi\in\llbracket N\rrbracket, zn(i)=z~n(ai)z_{n}^{(i)}=\tilde{z}_{n}^{(a_{i})} for aia_{i} the random variable taking values in N\llbracket N\rrbracket with (ai=k)ωn(k)\mathbb{P}(a_{i}=k)\propto\omega_{n}^{(k)}. The procedure is summarized in Alg. 1.

Given {Mn,nP}\{M_{n},n\in\llbracket P\rrbracket\}, theoretically optimal choice of {Ln1,nP}\{L_{n-1},n\in\llbracket P\rrbracket\} is well understood but tractability is typically obtained by assuming that MnM_{n} is μn1\mu_{n-1}-invariant, or considering approximations of {Ln1,nP}\{L_{n-1},n\in\llbracket P\rrbracket\} and that MnM_{n} is μn\mu_{n}-invariant. This makes Markov chain Monte Carlo (MCMC) kernels very attractive choices for MnM_{n}, and the use of measure theoretic tools inevitable.

1for  iNi\in\llbracket N\rrbracket do
2      Sample z0(i)μ0()z_{0}^{(i)}\sim\mu_{0}(\cdot);
3      Set ω0(i)=1\omega_{0}^{(i)}=1
4 end for
5for  nPn\in\llbracket P\rrbracket do
6      for  iNi\in\llbracket N\rrbracket do
7            Sample z~n(i)Mn(zn1(i),)\tilde{z}_{n}^{(i)}\sim M_{n}\big{(}z_{n-1}^{(i)},\cdot\big{)};
8            Compute wn(i)w_{n}^{(i)} as in (1).
9       end for
10      for iNi\in\llbracket N\rrbracket do
11            Sample aiCat(ωn(1),,ωn(N))a_{i}\sim{\rm Cat}\big{(}\omega_{n}^{(1)},\ldots,\omega_{n}^{(N)}\big{)}
12            Set zn(i)=z~n(ai)z_{n}^{(i)}=\tilde{z}_{n}^{(a_{i})}
13       end for
14      
15 end for
16
Algorithm 1 Generic SMC sampler

A possible choice of MCMC kernel is that of the hybrid Monte Carlo method, a MH update using a discretization of Hamilton’s equations [17, 29], which can be thought of as a particular instance of a more general strategy relying on numerical integrators of ODEs possessing properties of interest. More specifically, assume that interest is in sampling π\pi defined on (𝖷,𝒳)(\mathsf{X},\mathscr{X}). First the problem is embedded into that of sampling from the joint distribution μ(dz):=πϖ(dz)=π(dx)ϖ(dv)\mu({\rm d}z):=\pi\otimes\varpi\big{(}{\rm d}z\big{)}=\pi({\rm d}x)\varpi({\rm d}v) defined on (𝖹,𝒵)=(𝖷×𝖵,𝒳𝒱)(\mathsf{Z},\mathscr{Z})=(\mathsf{X}\times\mathsf{V},\mathscr{X}\otimes\mathscr{V}), where vv is an auxiliary variable facilitating sampling. Following the SMC framework we set μn(dz):=πnϖn(dz)\mu_{n}({\rm d}z):=\pi_{n}\otimes\varpi_{n}\big{(}{\rm d}z\big{)} for n0,P,n\in\llbracket 0,P\rrbracket, a sequence of distributions on (𝖹,𝒵)(\mathsf{Z},\mathscr{Z}) with πP=π\pi_{P}=\pi, {πn,n0,P1}\{\pi_{n},n\in\llbracket 0,P-1\rrbracket\} probabilities on (𝖷,𝒳)\big{(}\mathsf{X},\mathscr{X}\big{)} and {ϖn,n0,P}\{\varpi_{n},n\in\llbracket 0,P\rrbracket\} on (𝖵,𝒱)\big{(}\mathsf{V},\mathscr{V}\big{)}. With ψ:𝖹𝖹\psi\colon\mathsf{Z}\rightarrow\mathsf{Z} an integrator of an ODE of interest, one can use ψk(z)\psi^{k}(z) for some kk\in\mathbb{N} as a proposal in a MH update mechanism; vv is a source of randomness allowing “exploration”, resampled every now and then. Again, hereafter we let z=:(x,v)𝖷×𝖵z=:(x,v)\in\mathsf{X}\times\mathsf{V} be the corresponding components of zz.

Example 1 (Leapfrog integrator of Hamilton’s equations).

Assume that 𝖷=𝖵=d\mathsf{X}=\mathsf{V}=\mathbb{R}^{d}, that {πn,ϖn,n0,P}\{\pi_{n},\varpi_{n},n\in\llbracket 0,P\rrbracket\} have densities, denoted πn(x)\pi_{n}(x) and ϖn(v)\varpi_{n}(v), with respect to the Lebesgue measure and let xUn(x):=logπn(x)x\mapsto U_{n}(x):=-\log\pi_{n}(x). For n0,Pn\in\llbracket 0,P\rrbracket and UnU_{n} differentiable, Hamilton’s equations for the potential Hn(x,v):=Un(x)+12|v|2H_{n}(x,v):=U_{n}(x)+\frac{1}{2}|v|^{2} are

x˙t=vt,v˙t=Un(xt),\dot{x}_{t}=v_{t},\dot{v}_{t}=-\nabla U_{n}(x_{t})\,, (2)

and possess the important property that Hn(xt,vt)=Hn(x0,v0)H_{n}(x_{t},v_{t})=H_{n}(x_{0},v_{0}) for t0t\geq 0. The corresponding leapfrog integrator is given, for some ε>0\varepsilon>0, by

ψn(x,v)=ψbψnaψb(x,v)\displaystyle\psi_{n}(x,v)={}_{\textsc{b}}\psi\circ{}_{\textsc{a}}\psi_{n}\circ{}_{\textsc{b}}\psi(x,v) (3)
ψb(x,v):=(x,v12εUn(x)),ψna(x,v)=(x+εv,v).{}_{\textsc{b}}\psi(x,v):=\big{(}x,v-\tfrac{1}{2}\varepsilon\nabla U_{n}(x)\big{)},\quad{}_{\textsc{a}}\psi_{n}(x,v)=(x+\varepsilon\,v,v)\,.

We point out that, with the exception of the first step, only one evaluation of U(x)\nabla U(x) is required per integration step since the rightmost ψb{}_{\textsc{b}}\psi in (3) recycles the last computation from the last iteration. Let σ:𝖹𝖹\sigma\colon\mathsf{Z}\rightarrow\mathsf{Z} such that for any f:𝖹𝖹f\colon\mathsf{Z}\rightarrow\mathsf{Z}, fσ(x,v)=f(x,v)f\circ\sigma(x,v)=f(x,-v), it is standard to check that ψn1=σψnσ\psi_{n}^{-1}=\sigma\circ\psi_{n}\circ\sigma and note that μnσ=μn\mu_{n}\circ\sigma=\mu_{n} in this particular case. In its most basic form the hybrid Monte Carlo MH update leaving μn\mu_{n} invariant proceeds as follows, for (z,A)𝖹×𝒵(z,A)\in\mathsf{Z}\times\mathscr{Z}

Mn+1(z,A)=ϖn(dv)[αn(x,v;T)𝟏{ψnT(x,v)A}+α¯n(x,v;T)𝟏{σ(x,v)A}],M_{n+1}(z,A)=\int\varpi_{n}({\rm d}v^{\prime})\big{[}\alpha_{n}(x,v^{\prime};T)\mathbf{1}\{\psi_{n}^{T}(x,v^{\prime})\in A\}+\bar{\alpha}_{n}(x,v^{\prime};T)\mathbf{1}\{\sigma(x,v)\in A\}\big{]}\,, (4)

with, for some user defined T,T\in\mathbb{N},

αn(z;T):=1μnψnT(z)μn(z),\alpha_{n}(z;T):=1\wedge\frac{\mu_{n}\circ\psi_{n}^{T}(z)}{\mu_{n}(z)}\,, (5)

and α¯n(z;T)=1αn(z;T)\bar{\alpha}_{n}(z;T)=1-\alpha_{n}(z;T).

Other ODEs, capturing other properties of the target density, or other types of integrators are possible. However a common feature of integrator based updates is the need to compute recursively an integrator snippet 𝗓:=(z,ψ(z),ψ2(z),,ψT(z))\mathsf{z}:=\big{(}z,\psi(z),\psi^{2}(z),\ldots,\psi^{T}(z)\big{)}, for a given mapping ψ:𝖹𝖹\psi\colon\mathsf{Z}\rightarrow\mathsf{Z}, of which only the endpoint ψT(z)\psi^{T}(z) is used. This raises the question of recycling intermediate states, all the more so that computation of the snippet often involves quantities shared with the evaluation of UnU_{n}. In Example 1, for instance, expressions for Un(x)\nabla U_{n}(x) and Un(x)U_{n}(x) often involve the same computationally costly quantities and evaluation of the density μn(x)\mu_{n}(x) where Un(x)\nabla U_{n}(x) has already been evaluated is therefore often virtually free; consider for example U(x)=xΣ1xU(x)=x^{\top}\Sigma^{-1}x for a covariance matrix Σ\Sigma, then U(x)=2Σ1x\nabla U(x)=2\Sigma^{-1}x.

In turn these quantities offer the promise of being able to exploit points used to generate the snippet while preserving accuracy of the estimators of interest, through importance sampling or reweighting. For example one may consider virtual HMC updates i.e. given (z,A)𝖹×𝒵(z,A)\in\mathsf{Z}\times\mathscr{Z} and kP1k\in\llbracket P-1\rrbracket define

Pn,k(z,A):=αn(z;k)𝟏{ψnk(z)A}+α¯n(z;k)𝟏{σ(z)A}.P_{n,k}(z,A):=\alpha_{n}(z;k)\mathbf{1}\{\psi_{n}^{k}(z)\in A\}+\bar{\alpha}_{n}(z;k)\mathbf{1}\{\sigma(z)\in A\}\,. (6)

Noting that μnPn,k=μn\mu_{n}P_{n,k}=\mu_{n} one deduces that for zμnz\sim\mu_{n}

1T+1k=0Tαn(z;k)fψnk(z)+α¯n(z;k)f(z),\frac{1}{T+1}\sum_{k=0}^{T}\alpha_{n}(z;k)f\circ\psi_{n}^{k}(z)+\bar{\alpha}_{n}(z;k)f(z)\,,

is an unbiased estimator of μ(f)\mu(f). This post-processing procedure is akin to waste-recycling [14] and its generalizations but, crucially, does not affect the dynamic of the Markov chain generating the samples. An alternative algorithm exploiting the snippet fully, with an active effect on the dynamics, could use the following mixture of Markov chain transition kernels [23, 29, 22],

𝔐n+1(z,A)=1T+1k=0Tϖn(dv)Pn,k(x,v;A),\mathfrak{M}_{n+1}(z,A)=\frac{1}{T+1}\sum_{k=0}^{T}\int\varpi_{n}({\rm d}v^{\prime})P_{n,k}(x,v^{\prime};A)\,,

which is also related to the strategy adopted in [11] with the extra chance algorithm. As we shall see our work shares the same objective but we adopt a strategy more closely related to [28]; see Section 5 for a more detailed discussion.

The present paper is concerned with exploring such recycling procedures in the context of SMC samplers, although the ideas we develop can be straightforwardly applied to particle filters in the context of state-space models and to MCMC as discussed later. The manuscript is organised as follows.

In Section 2 we first provide a high level description of particular instances of the class of algorithms considered and provide a justification through reinterpretation as standard SMC algorithms in Subsection 2.2. In Subsection 2.3 we discuss direct extensions of our algorithms, some of which we explore in the present manuscript. This work has links with related recent attempts in the literature [33, 14, 38]; these links are discussed and contrasted with our work in Subsection 2.4 where some of the motivations behind these algorithms are also discussed. Initial exploratory simulations demonstrating the interest of our approach are provided in Subsections 2.5 and 2.6.

In Section 3 we introduce the more general framework of Markov snippets Monte Carlo and associated formal justifications. Subsection 3.3 details the link with the scenario considered in Section 2. In Subsection 3.4 we provide general results facilitating the practical calculation of some of the Radon-Nikodym involved, highlighting why some of the usual constraints on mutation and backward kernels in SMC can be lifted here.

In Section 4 we provide elements of a theoretical analysis explaining expected properties of the algorithms proposed in this manuscript, although a fully rigorous theoretical analysis is beyond the present methodological contribution.

In Section 5 we explore the use of some of the ideas developed here in the context of MCMC algorithms and establish links with earlier suggestions, such as “windows of states” techniques proposed in the context of HMC [28, 29]. Notation, definitions and basic mathematical background can be found in Appendices A-B.

A Python implementation of the algorithms developed in this paper is available at https://github.com/MauroCE/IntegratorSnippets.

2 An introductory example

Assume interest is in sampling from a probability distribution μ\mu on (𝖹,𝒵)(\mathsf{Z},\mathscr{Z}) as described above Example 1 using an SMC sampler relying on the leapfrog integrator of Hamilton’s equations. As in the previous section we introduce an interpolating sequence of distributions {μn,n0,P}\{\mu_{n},n\in\llbracket 0,P\rrbracket\} on (𝖹,𝒵)}(\mathsf{Z},\mathscr{Z})\} and assume for now the existence of densities for {μn,n0,P}\{\mu_{n},n\in\llbracket 0,P\rrbracket\} with respect to a common measure υ\upsilon, say the Lebesgue measure on 2d\mathbb{R}^{2d}, denoted μn(z):=dμn/dυ(z)\mu_{n}(z):={\rm d}\mu_{n}/{\rm d}\upsilon(z) for z𝖹z\in\mathsf{Z} and n0,Pn\in\llbracket 0,P\rrbracket and denote ψn\psi_{n} the corresponding integrator, which again is measure preserving in this setup.

2.1 An SMC-like algorithm

Primary interest in this paper is in algorithms of the type given in Alg. 2 and variations thereof; throughout T{0}T\in\mathbb{N}\setminus\{0\}.

1Sample z0(i)iidμ0z_{0}^{(i)}\overset{{\rm iid}}{\sim}\mu_{0} for iNi\in\llbracket N\rrbracket.
2for nPn\in\llbracket P\rrbracket do
3      
4      for iNi\in\llbracket N\rrbracket do
5            
6            for k0,Tk\in\llbracket 0,T\rrbracket do
7                  
8                  Compute zn1,k(i):=ψnk(zn1(i))z_{n-1,k}^{(i)}:=\psi_{n}^{k}(z_{n-1}^{(i)}) and
9                  
w¯n,k(zn1(i)):=μn(zn1,k(i))μn1(zn1(i))=μnψnk(zn1(i))μn1(zn1(i)),\bar{w}_{n,k}\big{(}z_{n-1}^{(i)}\big{)}:=\frac{\mu_{n}\big{(}z_{n-1,k}^{(i)}\big{)}}{\mu_{n-1}\big{(}z_{n-1}^{(i)}\big{)}}=\frac{\mu_{n}\circ\psi_{n}^{k}\big{(}z_{n-1}^{(i)}\big{)}}{\mu_{n-1}\big{(}z_{n-1}^{(i)}\big{)}}\,,
10             end for
11            
12       end for
13      
14      for jNj\in\llbracket N\rrbracket do
15            
16            Sample N×0,T(bj,aj)Cat({w¯n,k(zn1(i)),(i,k)N×0,T})\llbracket N\rrbracket\times\llbracket 0,T\rrbracket\ni(b_{j},a_{j})\sim{\rm Cat}\big{(}\{\bar{w}_{n,k}(z_{n-1}^{(i)}),(i,k)\in\llbracket N\rrbracket\times\llbracket 0,T\rrbracket\}\big{)}
17            Set z¯n(j):=(x¯n1(j),v¯n1(j))=zn1,aj(bj)\bar{z}_{n}^{(j)}:=(\bar{x}_{n-1}^{(j)},\bar{v}_{n-1}^{(j)})=z_{n-1,a_{j}}^{(b_{j})}
18            Rejuvenate the velocities zn(j)=(x¯n1(j),vn(j))z_{n}^{(j)}=(\bar{x}_{n-1}^{(j)},v_{n}^{(j)}) with vn(j)ϖnv_{n}^{(j)}\sim\varpi_{n}.
19       end for
20      
21 end for
22
Algorithm 2 Unfolded Hamiltonian Snippet SMC algorithm

The SMC sampler-like algorithm in Alg. 2 therefore involves propagating NN “seed” particles {zn1(i),iN}\{z_{n-1}^{(i)},i\in\llbracket N\rrbracket\}, with a mutation mechanism consisting of the generation of NN integrator snippets 𝗓:=(z,ψn(z),ψn2(z),,ψnT(z))\mathsf{z}:=\big{(}z,\psi_{n}(z),\psi_{n}^{2}(z),\ldots,\psi_{n}^{T}(z)\big{)} started at every seed particle z{zn1(i),iN}z\in\{z_{n-1}^{(i)},i\in\llbracket N\rrbracket\}, resulting in N×(T+1)N\times(T+1) particles which are then whittled down to a set of NN seed particles using a standard resampling scheme; after rejuvenation of velocities this yields the next generation of seed particles–this is illustrated in Fig. 1. This algorithm should be contrasted with standard implementations of SMC samplers where, after resampling, a seed particle normally gives rise to a single particle in the mutation step, in Fig.  1 the last state on the snippet. Intuitively validity of the algorithm follows from the fact that if {(zn1(i),1),iN}\big{\{}(z_{n-1}^{(i)},1),i\in\llbracket N\rrbracket\big{\}} represent μn1\mu_{n-1}, then {(zn1,k(i),w¯n,k(zn1(i)))),(i,k)N×0,T}\big{\{}\big{(}z_{n-1,k}^{(i)},\bar{w}_{n,k}(z_{n-1}^{(i)})\big{)}),(i,k)\in\llbracket N\rrbracket\times\llbracket 0,T\rrbracket\big{\}} represents μn\mu_{n} in the sense that for f:𝖹f\colon\mathsf{Z}\rightarrow\mathbb{R} summable, one can use the approximation

μn(f)i=1Nk=0Tfψnk(zn1(i))μnψnk(zn1(i))/μn1(zn1(i))j=1Nl=0Tμnψnl(zn1(j))/μn1(zn1(j)),\mu_{n}(f)\approx\sum_{i=1}^{N}\sum_{k=0}^{T}f\circ\psi_{n}^{k}\big{(}z_{n-1}^{(i)}\big{)}\frac{\mu_{n}\circ\psi_{n}^{k}\big{(}z_{n-1}^{(i)}\big{)}/\mu_{n-1}\big{(}z_{n-1}^{(i)}\big{)}}{\sum_{j=1}^{N}\sum_{l=0}^{T}\mu_{n}\circ\psi_{n}^{l}\big{(}z_{n-1}^{(j)}\big{)}/\mu_{n-1}\big{(}z_{n-1}^{(j)}\big{)}}\,, (7)

where the self-normalization of the weights is only required in situations where the ratio μn(z)/μn1(z)\mu_{n}(z)/\mu_{n-1}(z) is only known up to a constant. We provide justification for the correctness of Alg. 2 and the estimator (7) by recasting the procedure as a standard SMC sampler targetting a particular sequence of distributions in Section 2.2 and using properties of mixtures. Direct generalizations are provided in Section 2.3.

Refer to caption
Figure 1: Illustration of the transition from μn1\mu_{n-1} to μn\mu_{n} with integrator snippet SMC. A snippet grows (black dots) out of each seed particle (blue). The middle snippet gives rises through selection (dashed red) to two seed particles while the bottom snippet does not produce any seed particle.

2.2 Justification outline

We now outline the main ideas underpinning the theoretical justification of Alg. 2. Key to this is establishing that Alg. 2 is a standard SMC sampler targetting a particular sequence of probability distributions {μ¯n,n0,P}\big{\{}\bar{\mu}_{n},n\in\llbracket 0,P\rrbracket\big{\}} from which samples can be processed to approximate expectations with respect to {μn,n0,P}\big{\{}\mu_{n},n\in\llbracket 0,P\rrbracket\big{\}}. This has the advantage that no fundamentally new theory is required and that standard methodological ideas can be re-used in the present scenario, while the particular structure of {μ¯n,n0,P}\big{\{}\bar{\mu}_{n},n\in\llbracket 0,P\rrbracket\big{\}} can be exploited for new developments. This section focuses on identifying {μ¯n,n0,P}\big{\{}\bar{\mu}_{n},n\in\llbracket 0,P\rrbracket\big{\}} and establishing some of their important properties. Similar ideas are briefly touched upon in [14], but we will provide full details and show how these ideas can be pushed further, in interesting directions.

First for (n,k)0,P×0,T(n,k)\in\llbracket 0,P\rrbracket\times\llbracket 0,T\rrbracket let ψn,k:𝖹𝖹\psi_{n,k}\colon\mathsf{Z}\rightarrow\mathsf{Z} be measurable and invertible mappings, define μn,k(dz):=μnψn,k1(dz)\mu_{n,k}({\rm d}z):=\mu_{n}^{\psi_{n,k}^{-1}}({\rm d}z), i.e. the distribution of ψn,k1(z)\psi_{n,k}^{-1}(z) when zμnz\sim\mu_{n}. It is worth pointing out that invertibility of these mappings is not necessary, but facilitates interpretation throughout, as illustrated by the last statement about the distribution of ψn,k1(z)\psi_{n,k}^{-1}(z). Earlier we have focused on the scenario where ψn,k=ψnk\psi_{n,k}=\psi_{n}^{k} for an integrator ψn:𝖹𝖹\psi_{n}\colon\mathsf{Z}\rightarrow\mathsf{Z}, but this turns out not to be a requirement, although it is our main motivation. Useful applications of this general perspective can be found in Subsection 2.3. Introduce the probability distributions on (0,T×𝖹,𝒫(0,T)𝒵)\big{(}\mathbb{\llbracket}0,T\rrbracket\times\mathsf{Z},\mathscr{P}(\mathbb{\llbracket}0,T\rrbracket)\otimes\mathscr{Z}\big{)}

μ¯n(k,dz)=1T+1μn,k(dz),\bar{\mu}_{n}(k,{\rm d}z)=\frac{1}{T+1}\mu_{n,k}({\rm d}z)\,,

for n0,Pn\in\llbracket 0,P\rrbracket. We will show that Alg. 2 can be interpreted as an SMC sampler targeting the sequence of marginal distributions on (𝖹,𝒵)\big{(}\mathsf{Z},\mathscr{Z}\big{)}

μ¯n(dz)=1T+1k=0Tμn,k(dz),\bar{\mu}_{n}({\rm d}z)=\frac{1}{T+1}\sum_{k=0}^{T}\mu_{n,k}({\rm d}z)\,, (8)

which we may refer to as a mixture, for n0,Pn\in\llbracket 0,P\rrbracket. Note that the conditional distribution is

μ¯n(k\displaystyle\bar{\mu}_{n}(k z)=1T+1dμn,kdμ¯n(z),\displaystyle\mid z)=\frac{1}{T+1}\frac{{\rm d}\mu_{n,k}}{{\rm d}\bar{\mu}_{n}}(z)\,, (9)

which can be computed in most scenarios of interest.

Example 2.

For n0,Pn\in\llbracket 0,P\rrbracket assume the existence of a σ\sigma-finite dominating measure υμn\upsilon\gg\mu_{n}, therefore implying the existence of a density μn(z):=dμn/dυ(z)\mu_{n}(z):={\rm d}\mu_{n}/{\rm d}\upsilon(z); υ\upsilon could be the Lebesgue measure. Assuming that υ\upsilon is ψn,k\psi_{n,k}-invariant, or “volume preserving”, then Lemma 32 implies, for k0,Tk\in\llbracket 0,T\rrbracket,

μn,k(z):=dμn,kdυ(z)=μnψn,k(z) and μ¯n(z):=dμ¯ndυ(z)=1T+1k=0Tμnψn,k(z),\mu_{n,k}(z):=\frac{{\rm d}\mu_{n,k}}{{\rm d}\upsilon}(z)=\mu_{n}\circ\psi_{n,k}(z)\text{ and }\bar{\mu}_{n}(z):=\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\upsilon}(z)=\frac{1}{T+1}\sum_{k=0}^{T}\mu_{n}\circ\psi_{n,k}(z)\,,

and therefore

wn,k(z):=dμn,kd(l=0Tμn,l)(z)=μnψn,k(z)l=0Tμnψn,l(z).w_{n,k}(z):=\frac{{\rm d}\mu_{n,k}}{{\rm d}\big{(}\sum_{l=0}^{T}\mu_{n,l}\big{)}}(z)=\frac{\mu_{n}\circ\psi_{n,k}(z)}{\sum_{l=0}^{T}\mu_{n}\circ\psi_{n,l}(z)}\,.

When υ\upsilon is the Lebesgue measure and {ψn,k,kP}\{\psi_{n,k},k\in\llbracket P\rrbracket\} are not volume preserving, additional multiplicative Jacobian determinant-like terms may be required (see Lemma 32 and the additional requirement that {ψn,k,kP}\{\psi_{n,k},k\in\llbracket P\rrbracket\} be differentiable). However this extra term may be more complex and require application specific treatment; adoption of the measure theoretic notation circumvents such peripheral considerations at this stage, which can be ignored until actual implementation of the algorithm.

A central point throughout this paper is how samples from μn\mu_{n} can be used to obtain samples from the marginal μ¯n\bar{\mu}_{n} and vice-versa, thanks to the mixture structure relating the two distributions. Naturally, given zμnz\sim\mu_{n}, sampling k𝒰(0,T)k\sim\mathcal{U}\big{(}\llbracket 0,T\rrbracket\big{)} and returning ψn,k1(z)\psi_{n,k}^{-1}(z) yields a sample from the marginal μ¯n\bar{\mu}_{n}. Now assuming zμ¯nz\sim\bar{\mu}_{n} and then sampling kμ¯n(z)k\sim\bar{\mu}_{n}(\cdot\mid z) naturally yields (k,z)μ¯n(k,z)\sim\bar{\mu}_{n} and hence intuitively ψn,k(z)μn\psi_{n,k}(z)\sim\mu_{n}. We now formally establish a more general result concerned with the estimation of expectations with respect to μn\mu_{n} from samples from μ¯n\bar{\mu}_{n}. For f:𝖹f\colon\mathsf{Z}\rightarrow\mathbb{R} μn\mu_{n}-integrable and k0,Tk\in\llbracket 0,T\rrbracket, a change of variable (see Theorem 31) yields

fψn,k(z)μn,k(dz)=fψn,k(z)μnψn,k1(dz)\displaystyle\int f\circ\psi_{n,k}(z)\mu_{n,k}({\rm d}z)=\int f\circ\psi_{n,k}(z)\mu_{n}^{\psi_{n,k}^{-1}}({\rm d}z) =f(z)μn(dz),\displaystyle=\int f(z)\mu_{n}({\rm d}z)\,,

which formalizes the fact, at first sight of limited interest, that μnψn,k1\mu_{n}^{\psi_{n,k}^{-1}} is the distribution of ψn,k1(z)\psi_{n,k}^{-1}(z) for zμnz\sim\mu_{n} and therefore ψn,kψn,k1(z)μn\psi_{n,k}\circ\psi_{n,k}^{-1}(z)\sim\mu_{n}. The relevance of this remark comes from the identity

f(z)μn(dz)\displaystyle\int f(z)\mu_{n}({\rm d}z) =1T+1k=0Tfψn,k(z)μn,k(dz)\displaystyle=\frac{1}{T+1}\sum_{k=0}^{T}\int f\circ\psi_{n,k}(z)\mu_{n,k}({\rm d}z)
=k=0Tfψn,k(z)μ¯n(k,dz)\displaystyle=\sum_{k=0}^{T}\int f\circ\psi_{n,k}(z)\bar{\mu}_{n}(k,{\rm d}z) (10)
={k=0Tfψn,k(z)μ¯n(kz)}μ¯n(dz),\displaystyle=\int\Big{\{}\sum_{k=0}^{T}f\circ\psi_{n,k}(z)\bar{\mu}_{n}(k\mid z)\Big{\}}\bar{\mu}_{n}({\rm d}z), (11)

which implies that samples from the mixture μ¯n\bar{\mu}_{n} can be used to unbiasedly estimate μn(f)\mu_{n}(f) thanks to an appropriate weighted average along the snippet kfψn,k(z)k\mapsto f\circ\psi_{n,k}(z). Using f=𝟏Af=\mathbf{1}_{A} for A𝒵A\in\mathscr{Z} formally establishes the earlier claim that if (k,z)μ¯n(k,z)\sim\bar{\mu}_{n} then ψn,k(z)μn\psi_{n,k}(z)\sim\mu_{n}. Note that, as suggested by Example 2, construction of the estimator will only require evaluations of the density μn\mu_{n} and function ff at z,ψn,1(z),ψn,2(z),,ψn,T(z)z,\psi_{n,1}(z),\psi_{n,2}(z),\ldots,\psi_{n,T}(z).

We now turn to the description of an SMC algorithm targeting {μ¯n,n0,P}\big{\{}\bar{\mu}_{n},n\in\llbracket 0,P\llbracket\big{\}}, Alg. 3, and then establish that it is probabilistically equivalent to Alg. 2 in a sense made precise below. With z=(x,v)z=(x,v) for nPn\in\llbracket P\rrbracket we introduce the following mutation kernel

M¯n(z,dz):=k=0Tμ¯n1(kz)Rn(ψn1,k(z),dz),Rn(z,dz):=(δxϖn1)(dz),\bar{M}_{n}(z,{\rm d}z^{\prime}):=\sum_{k=0}^{T}\bar{\mu}_{n-1}(k\mid z)\,R_{n}(\psi_{n-1,k}(z),{\rm d}z^{\prime}),\quad R_{n}(z,{\rm d}z^{\prime}):=(\delta_{x}\otimes\varpi_{n-1})({\rm d}z^{\prime})\,, (12)

where we note that the refreshment kernel has the property that μn1Rn=μn1\mu_{n-1}R_{n}=\mu_{n-1}. One can show that the near optimal kernel L¯n1:𝖹×𝒵[0,1]\bar{L}_{n-1}\colon\mathsf{Z}\times\mathscr{Z}\rightarrow[0,1] is given for any (z,A)𝖹×𝒵(z,A)\in\mathsf{Z}\times\mathscr{Z} by

L¯n1(z,A):=dμ¯n1M¯n(A×)dμ¯n1M¯n(z),\bar{L}_{n-1}(z,A):=\frac{{\rm d}\bar{\mu}_{n-1}\otimes\bar{M}_{n}(A\times\cdot)}{{\rm d}\bar{\mu}_{n-1}\bar{M}_{n}}(z)\,, (13)

and is well defined μ¯n1M¯n\bar{\mu}_{n-1}\bar{M}_{n}-a.s. (see Lemma 4, given at the end of this subsection) and satisfies the property that for any A,B𝒵A,B\in\mathscr{Z}

μ¯n1M¯n(A×B)=Bμ¯n1M¯n(dz)L¯n1(z,A),\bar{\mu}_{n-1}\otimes\bar{M}_{n}(A\times B)=\int_{B}\bar{\mu}_{n-1}\bar{M}_{n}({\rm d}z)\bar{L}_{n-1}(z,A)\,,

that is L¯n1(z,A)\bar{L}_{n-1}(z,A) is a conditional probability of AA given zz for the joint probability distribution μ¯n1M¯n\bar{\mu}_{n-1}\otimes\bar{M}_{n}. With the assumption μn1μ¯n\mu_{n-1}\gg\bar{\mu}_{n}, from Lemma 4, the SMC sampler importance weights at step nPn\in\llbracket P\rrbracket are

w¯n(z,z):=dμ¯nL¯n1dμ¯n1M¯n(z,z)=dμ¯ndμn1(z).\bar{w}_{n}(z,z^{\prime}):=\frac{{\rm d}\bar{\mu}_{n}\accentset{\curvearrowleft}{\otimes}\bar{L}_{n-1}}{{\rm d}\bar{\mu}_{n-1}\otimes\bar{M}_{n}}(z,z^{\prime})=\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}(z^{\prime})\,. (14)

that is for f:𝖹f\colon\mathsf{Z}\rightarrow\mathbb{R} such that μ¯(|f|)<\bar{\mu}(|f|)<\infty,

f(z)dμ¯ndμn1(z)μ¯n1M¯n(d(z,z))=μ¯(f).\int f(z^{\prime})\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}(z^{\prime})\bar{\mu}_{n-1}\otimes\bar{M}_{n}\big{(}{\rm d}(z,z^{\prime})\big{)}=\bar{\mu}(f)\,.

The corresponding standard SMC sampler is given in Alg. 3 where,

  • the weighted particles {(zˇn(i),1),iN}\{(\check{z}_{n}^{(i)},1),i\in\llbracket N\rrbracket\} represent μ¯n\bar{\mu}_{n} and {(zˇn,k(i),wn,k(zˇn(i))),(i,k)N×0,T}\{(\check{z}_{n,k}^{(i)},w_{n,k}(\check{z}_{n}^{(i)})),(i,k)\in\llbracket N\rrbracket\times\llbracket 0,T\rrbracket\} represent μn\mu_{n} from (11),

  • steps 3-3 correspond to sampling from the mutation kernel M¯n+1\bar{M}_{n+1} in (12),

  • {(zn(i),1),iN}\{(z_{n}^{(i)},1),i\in\llbracket N\rrbracket\} represent μn\mu_{n},

  • {(zn(i),w¯n+1(zn(i))),iN}\{\big{(}z_{n}^{(i)},\bar{w}_{n+1}(z_{n}^{(i)})),i\in\llbracket N\rrbracket\} represent μ¯n+1\bar{\mu}_{n+1}, and so do {(zˇn+1(i),1),iN}\{(\check{z}_{n+1}^{(i)},1),i\in\llbracket N\rrbracket\}.

1 sample zˇ0(i)iidμ¯0=μ0\check{z}_{0}^{(i)}\overset{{\rm iid}}{\sim}\bar{\mu}_{0}=\mu_{0} for iNi\in\llbracket N\rrbracket.
2for n=0,,P1n=0,\ldots,P-1 do
3      
4      for iNi\in\llbracket N\rrbracket do
5            
6            for k0,Tk\in\llbracket 0,T\rrbracket do
7                  
8                  compute zˇn,k(i)=(xˇn,k(i),vˇn,k(i)):=ψn,k(zˇn(i)),wn,k(zˇn(i))\check{z}_{n,k}^{(i)}=(\check{x}_{n,k}^{(i)},\check{v}_{n,k}^{(i)}):=\psi_{n,k}(\check{z}_{n}^{(i)}),w_{n,k}\big{(}\check{z}_{n}^{(i)}\big{)}
9             end for
10            
11            sample aiCat(wn,0(zˇn(i)),wn,1(zˇn(i)),wn,2(zˇn(i)),,wn,T(zˇn(i)))a_{i}\sim{\rm Cat}\left(w_{n,0}\big{(}\check{z}_{n}^{(i)}\big{)},w_{n,1}\big{(}\check{z}_{n}^{(i)}\big{)},w_{n,2}\big{(}\check{z}_{n}^{(i)}\big{)},\ldots,w_{n,T}\big{(}\check{z}_{n}^{(i)}\big{)}\right)
12            set zn(i)=(xˇn,ai(i),vn(i))z_{n}^{(i)}=(\check{x}_{n,a_{i}}^{(i)},v_{n}^{(i)}) with vn(i)ϖnv_{n}^{(i)}\sim\varpi_{n}
13            
14            
w¯n+1(zn(i))=dμ¯n+1dμn(zn(i)),\bar{w}_{n+1}\big{(}z_{n}^{(i)}\big{)}=\frac{{\rm d}\bar{\mu}_{n+1}}{{\rm d}\mu_{n}}\big{(}z_{n}^{(i)}\big{)}\,,
15       end for
16      
17      for jNj\in\llbracket N\rrbracket do
18            
19            sample bjCat(w¯n+1(zn(1)),,w¯n+1(zn(N)))b_{j}\sim{\rm Cat}\left(\bar{w}_{n+1}\big{(}z_{n}^{(1)}\big{)},\ldots,\bar{w}_{n+1}\big{(}z_{n}^{(N)}\big{)}\right)
20            set zˇn+1(j)=zn(bj)\check{z}_{n+1}^{(j)}=z_{n}^{(b_{j})}
21       end for
22      
23 end for
24
Algorithm 3 Folded Hamiltonian Snippet SMC algorithm

Notice that we assume here ψ0=Id\psi_{0}={\rm Id}, and hence that μ¯0=μ0\bar{\mu}_{0}=\mu_{0}, and that the weights wn,kw_{n,k} appear as being computed twice in step 3 and step 14 when evaluating the resampling weights at the previous iteration, for the only reason that it facilitates exposition. The identities (11) and (9) suggest, for nPn\in\llbracket P\rrbracket the estimator of μn(f)\mu_{n}(f), for f:𝖹f\colon\mathsf{Z}\rightarrow\mathbb{R} μn\mu_{n}-integrable,

μˇn(f)\displaystyle\check{\mu}_{n}(f) =1Ni=1Nk=0T1T+1dμn,kdμ¯n(zˇn(i))fψn,k(zˇn(i)).\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\sum_{k=0}^{T}\frac{1}{T+1}\frac{{\rm d}\mu_{n,k}}{{\rm d}\bar{\mu}_{n}}(\check{z}_{n}^{(i)})f\circ\psi_{n,k}\big{(}\check{z}_{n}^{(i)}\big{)}.
=1Ni=1Nk=0Tμnψn,kl=0Tμnψn,l(zˇn(i))fψn,k(zˇn(i)),\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\sum_{k=0}^{T}\frac{\mu_{n}\circ\psi_{n,k}}{\sum_{l=0}^{T}\mu_{n}\circ\psi_{n,l}}(\check{z}_{n}^{(i)})f\circ\psi_{n,k}\big{(}\check{z}_{n}^{(i)}\big{)}\,, (15)

where the second line is correct under the assumptions of Example 2. Further, when μn1μn,k\mu_{n-1}\gg\mu_{n,k} for k0,Tk\in\llbracket 0,T\rrbracket one can also write (11) for f:𝖹f\colon\mathsf{Z}\rightarrow\mathbb{R} summable as

μn(f)\displaystyle\mu_{n}(f) ={1T+1k=0Tfψn,k(z)dμn,kdμn1(z)}μn1(dz),\displaystyle=\int\Big{\{}\frac{1}{T+1}\sum_{k=0}^{T}f\circ\psi_{n,k}(z)\frac{{\rm d}\mu_{n,k}}{{\rm d}\mu_{n-1}}(z)\Big{\}}\mu_{n-1}({\rm d}z)\,,

which can be estimated, using self-renormalization when required, with

μ^n(f)=i=1Nk=0Tdμn,kdμn1(zn1(i))j=1Nl=0Tdμn,ldμn1(zn1(j))fψn,k(zn1(i)),\hat{\mu}_{n}(f)=\sum_{i=1}^{N}\sum_{k=0}^{T}\frac{\frac{{\rm d}\mu_{n,k}}{{\rm d}\mu_{n-1}}(z_{n-1}^{(i)})}{\sum_{j=1}^{N}\sum_{l=0}^{T}\frac{{\rm d}\mu_{n,l}}{{\rm d}\mu_{n-1}}(z_{n-1}^{(j)})}f\circ\psi_{n,k}\big{(}z_{n-1}^{(i)}\big{)}\,, (16)

therefore justifying the estimator suggested in (7) in the particular case where the conditions of Example 2 are satisfied, once we establish the equivalence of Alg. 3 and Alg. 2 in Proposition 3. In fact it can be shown (Proposition 3) that, with 𝔼i\mathbb{E}_{i} referring to the expectation of the probability distribution underpinning Alg. ii,

𝔼3(μˇn(f)zn1(j),jN)=μ^n(f),\mathbb{E}_{3}\bigl{(}\check{\mu}_{n}(f)\mid z_{n-1}^{(j)},j\in\llbracket N\rrbracket\bigr{)}=\hat{\mu}_{n}(f)\,,

a form of Rao-Blackwellization implying lower variance for μ^n(f)\hat{\mu}_{n}(f) while the two estimators share the same bias. The result is in fact stronger since it states that the estimators are convex ordered, a property which we however do not exploit further here. Interestingly a result we establish later in the paper, Proposition 15, suggests that the variance μˇn(f)\check{\mu}_{n}(f) is smaller than that of the standard Monte Carlo estimator assuming NN samples zn(i)iidμnz_{n}^{(i)}\overset{{\rm iid}}{\sim}\mu_{n}, due to the control variate nature of integrator snippets estimators.

An interesting point is that computation of the weight μnψn,k/μ¯n\mu_{n}\circ\psi_{n,k}/\bar{\mu}_{n} only requires knowledge of μn\mu_{n} up to a normalizing constant, that is the estimator is unbiased if zˇn(i)μ¯n\check{z}_{n}^{(i)}\sim\bar{\mu}_{n} for iNi\in\llbracket N\rrbracket even if μn\mu_{n} is not completely known, while the estimator (7) will most often require self-normalisation, hence inducing a bias.

We now provide the probabilistic argument justifying Alg. 2 and the shared notation {zn(i),iN}\{z_{n}^{(i)},i\in\llbracket N\rrbracket\} in Alg. 2 and Alg. 3.

Proposition 3.

Alg. 2 and Alg. 3 are probabilistically equivalent. More precisely, letting i\mathbb{P}_{i} refer to the probability of Algorithm ii for i{2,3}i\in\{2,3\},

  1. 1.

    for nP1n\in\llbracket P-1\rrbracket the distributions of {zn(i),iN}\{z_{n}^{(i)},i\in\llbracket N\rrbracket\} conditional upon {zn1(i),iN}\big{\{}z_{n-1}^{(i)},i\in\llbracket N\rrbracket\big{\}} in Alg. 3 and Alg. 2 are the same,

  2. 2.

    the joint distributions of {zn(i),iN,n0,P1]}\{z_{n}^{(i)},i\in\llbracket N\rrbracket,n\in\llbracket 0,P-1]\} are the same under 2\mathbb{P}_{2} and 3\mathbb{P}_{3},

  3. 3.

    for nPn\in\llbracket P\rrbracket, any f:𝖹f\colon\mathsf{Z}\rightarrow\mathbb{R} μn\mu_{n}-integrable with μˇn(f)\check{\mu}_{n}(f) and μ^n(f)\hat{\mu}_{n}(f) as in (15) and (16),

    𝔼3(μˇn(f)zn1(j),jN)=μ^n(f).\mathbb{E}_{3}\bigl{(}\check{\mu}_{n}(f)\mid z_{n-1}^{(j)},j\in\llbracket N\rrbracket\bigr{)}=\hat{\mu}_{n}(f)\,.
Proof of Proposition 3.

In Alg. 3 for any A𝒵NA\in\mathscr{Z}^{\otimes N} we have with (b1,,bN)(b_{1},\ldots,b_{N}) the random vector taking values in NN\llbracket N\rrbracket^{N} involved in the resampling step,

3(zn(1:N)Azn1(1:N))=\displaystyle\mathbb{P}_{3}\big{(}z_{n}^{(1:N)}\in A\mid z_{n-1}^{(1:N)}\big{)}= 𝔼3{𝟏A{(xˇn,ai(i),vn(i)),iN}𝟏{zˇn(i)=zn1(bi),iN}zn1(j),jN}\displaystyle\mathbb{E}_{3}\left\{\mathbf{1}_{A}\{(\check{x}_{n,a_{i}}^{(i)},v_{n}^{(i)}),i\in\llbracket N\rrbracket\}\mathbf{1}\{\check{z}_{n}^{(i)}=z_{n-1}^{(b_{i})},i\in\llbracket N\rrbracket\}\mid z_{n-1}^{(j)},j\in\llbracket N\rrbracket\right\}
=\displaystyle= 𝔼3{𝟏A{(xn1,ai(bi),vn(i)),iN}zn1(j),jN}.\displaystyle\mathbb{E}_{3}\left\{\mathbf{1}_{A}\{(x_{n-1,a_{i}}^{(b_{i})},v_{n}^{(i)}),i\in\llbracket N\rrbracket\}\mid z_{n-1}^{(j)},j\in\llbracket N\rrbracket\right\}\,.

Letting for (α,β)0,TN×NN(\alpha,\beta)\in\llbracket 0,T\rrbracket^{N}\times\llbracket N\rrbracket^{N}

EA(α,β)=EA(α,β,zn1(1:N)):=𝟏A{(xn1,αi(βi),vn(i)),iN}ϖnN(dvn(1:N))E_{A}(\alpha,\beta)=E_{A}(\alpha,\beta,z_{n-1}^{(1:N)}):=\int\mathbf{1}_{A}\{(x_{n-1,\alpha_{i}}^{(\beta_{i})},v_{n}^{(i)}),i\in\llbracket N\rrbracket\}\varpi_{n}^{\otimes N}({\rm d}v_{n}^{(1:N)})

from the tower property we have

𝔼3{𝟏A{(xn1,ai(bi),vn(i)),iN}zn1(j),bj=βj,jN}=α0,TNEA(α,β)i=1N1T+1dμn,αidμ¯n(zn1(βi)).\mathbb{E}_{3}\left\{\mathbf{1}_{A}\{(x_{n-1,a_{i}}^{(b_{i})},v_{n}^{(i)}),i\in\llbracket N\rrbracket\}\mid z_{n-1}^{(j)},b_{j}=\beta_{j},j\in\llbracket N\rrbracket\right\}=\sum_{\alpha\in\llbracket 0,T\rrbracket^{N}}E_{A}(\alpha,\beta)\prod_{i=1}^{N}\frac{1}{T+1}\frac{{\rm d}\mu_{n,\alpha_{i}}}{{\rm d}\bar{\mu}_{n}}\big{(}z_{n-1}^{(\beta_{i})}\big{)}.

Since

3(b1=β1,,bN=βNzn1(i),iN)=(j=1Ndμ¯ndμn1(zn1(j)))Ni=1Ndμ¯ndμn1(zn1(βi))\mathbb{P}_{3}\left(b_{1}=\beta_{1},\ldots,b_{N}=\beta_{N}\mid z_{n-1}^{(i)},i\in\llbracket N\rrbracket\right)=\biggl{(}\sum_{j=1}^{N}\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}\big{(}z_{n-1}^{(j)}\big{)}\biggr{)}^{-N}\prod_{i=1}^{N}\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}\big{(}z_{n-1}^{(\beta_{i})}\big{)}

we deduce

3(zn(1:N)Azn1(1:N))\displaystyle\mathbb{P}_{3}\big{(}z_{n}^{(1:N)}\in A\mid z_{n-1}^{(1:N)}\big{)}\propto α,βEA(α,β,zn1(1:N))i=1Ndμ¯ndμn1(zn1(βi))i=1Ndμn,αidμ¯n(zn1(βi))\displaystyle\sum_{\alpha,\beta}E_{A}(\alpha,\beta,z_{n-1}^{(1:N)})\prod_{i=1}^{N}\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}\big{(}z_{n-1}^{(\beta_{i})}\big{)}\prod_{i=1}^{N}\frac{{\rm d}\mu_{n,\alpha_{i}}}{{\rm d}\bar{\mu}_{n}}\big{(}z_{n-1}^{(\beta_{i})}\big{)}
=\displaystyle= α,βEA(α,β,zn1(1:N))i=1Ndμ¯ndμn1(zn1(βi))dμn,αidμ¯n(zn1(βi))\displaystyle\sum_{\alpha,\beta}E_{A}(\alpha,\beta,z_{n-1}^{(1:N)})\prod_{i=1}^{N}\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}\big{(}z_{n-1}^{(\beta_{i})}\big{)}\frac{{\rm d}\mu_{n,\alpha_{i}}}{{\rm d}\bar{\mu}_{n}}\big{(}z_{n-1}^{(\beta_{i})}\big{)}
=\displaystyle= α,βEA(α,β,zn1(1:N))i=1Ndμn,αidμn1(zn1(βi)).\displaystyle\sum_{\alpha,\beta}E_{A}(\alpha,\beta,z_{n-1}^{(1:N)})\prod_{i=1}^{N}\frac{{\rm d}\mu_{n,\alpha_{i}}}{{\rm d}\mu_{n-1}}\big{(}z_{n-1}^{(\beta_{i})}\big{)}\,.

Now notice that

2(z¯n(j)=zn1,αj(βj)zn1(1:N))\displaystyle\mathbb{P}_{2}\left(\bar{z}_{n}^{(j)}=z_{n-1,\alpha_{j}}^{(\beta_{j})}\mid z_{n-1}^{(1:N)}\right) =dμn,αjdμn1(zn1(βj))i,kdμn,kdμn1(zn1(i))\displaystyle=\frac{\frac{{\rm d}\mu_{n,\alpha_{j}}}{{\rm d}\mu_{n-1}}\big{(}z_{n-1}^{(\beta_{j})}\big{)}}{\sum_{i,k}\frac{{\rm d}\mu_{n,k}}{{\rm d}\mu_{n-1}}\big{(}z_{n-1}^{(i)}\big{)}}
dμn,αjdμn1(zn1(βj))i=1Ndμ¯ndμn1(zn1(i)),\displaystyle\propto\frac{\frac{{\rm d}\mu_{n,\alpha_{j}}}{{\rm d}\mu_{n-1}}\big{(}z_{n-1}^{(\beta_{j})}\big{)}}{\sum_{i=1}^{N}\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}\big{(}z_{n-1}^{(i)}\big{)}}\,,

and the first statement follows from conditional independence of (b1,a1),,(bN,aN)(b_{1},a_{1}),\ldots,(b_{N},a_{N}) and the fact that 2(zn(1:N)Az¯n1(1:N))=EA(α,β,z¯n1(1:N))\mathbb{P}_{2}\big{(}z_{n}^{(1:N)}\in A\mid\bar{z}_{n-1}^{(1:N)}\big{)}=E_{A}(\alpha,\beta,\bar{z}_{n-1}^{(1:N)}). Since by construction 3((z0(1),,z0(N))A)=2((z0(1),,z0(N))A)\mathbb{P}_{3}\big{(}(z_{0}^{(1)},\ldots,z_{0}^{(N)})\in A\big{)}=\mathbb{P}_{2}\big{(}(z_{0}^{(1)},\ldots,z_{0}^{(N)})\in A\big{)} for any A𝒵A\in\mathscr{Z} the second statement follows from a standard Markov chain argument. Now for g:𝖹g\colon\mathsf{Z}\rightarrow\mathbb{R} and ιN\iota\in\llbracket N\rrbracket

𝔼3(g(zˇn(ι))zn1(j),jN)=\displaystyle\mathbb{E}_{3}\big{(}g(\check{z}_{n}^{(\iota)})\mid z_{n-1}^{(j)},j\in\llbracket N\rrbracket\big{)}= 𝔼3(g(zn1(bι))zn1(j),jN)\displaystyle\mathbb{E}_{3}\big{(}g(z_{n-1}^{(b_{\iota})})\mid z_{n-1}^{(j)},j\in\llbracket N\rrbracket\big{)}
=\displaystyle= i=1Ng(zn1(i))dμ¯ndμn1(zn1(i))j=1Ndμ¯ndμn1(zn1(j)).\displaystyle\sum_{i=1}^{N}g(z_{n-1}^{(i)})\frac{\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}\big{(}z_{n-1}^{(i)}\big{)}}{\sum_{j=1}^{N}\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}\big{(}z_{n-1}^{(j)}\big{)}}\,.

Therefore for any k0,Tk\in\llbracket 0,T\rrbracket,

𝔼3(dμn,kdμ¯n(zˇn(ι))fψn,k(zˇn(ι))zn1(j),jN)\displaystyle\mathbb{E}_{3}\left(\frac{{\rm d}\mu_{n,k}}{{\rm d}\bar{\mu}_{n}}(\check{z}_{n}^{(\iota)})f\circ\psi_{n,k}(\check{z}_{n}^{(\iota)})\mid z_{n-1}^{(j)},j\in\llbracket N\rrbracket\right) =i=1Ndμn,kdμ¯n(zn1(i))fψn,k(zn1(i))dμ¯ndμn1(zn1(i))j=1Ndμ¯ndμn1(zn1(j))\displaystyle=\sum_{i=1}^{N}\frac{{\rm d}\mu_{n,k}}{{\rm d}\bar{\mu}_{n}}(z_{n-1}^{(i)})f\circ\psi_{n,k}(z_{n-1}^{(i)})\frac{\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}\big{(}z_{n-1}^{(i)}\big{)}}{\sum_{j=1}^{N}\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}\big{(}z_{n-1}^{(j)}\big{)}}
=i=1Ndμn,kdμn1(zn1(i))j=1Nl=0Tdμn,ldμn1(zn1(j))fψn,k(zn1(i)).\displaystyle=\sum_{i=1}^{N}\frac{\frac{{\rm d}\mu_{n,k}}{{\rm d}\mu_{n-1}}(z_{n-1}^{(i)})}{\sum_{j=1}^{N}\sum_{l=0}^{T}\frac{{\rm d}\mu_{n,l}}{{\rm d}\mu_{n-1}}\big{(}z_{n-1}^{(j)}\big{)}}f\circ\psi_{n,k}(z_{n-1}^{(i)})\,.

The third statement follows. ∎

Since the justification of the latter interpretation of the algorithm is straightforward, as a standard SMC sampler targeting instrumental distributions {μ¯n,n0,P}\big{\{}\bar{\mu}_{n},n\in\llbracket 0,P\llbracket\big{\}}, and allows for further easy generalisations we will adopt this perspective in the remainder of the manuscript for simplicity.

This reinterpretation also allows the use of known facts about SMC sampler algorithms. For example it is well known that the output of SMC samplers can be used to estimate unbiasedly unknown normalising constants by virtue of the fact that, in the present scenario,

n=0P1[1Ni=1Ndμ¯n+1dμn(zn(i))]\prod_{n=0}^{P-1}\Big{[}\frac{1}{N}\sum_{i=1}^{N}\frac{{\rm d}\bar{\mu}_{n+1}}{{\rm d}\mu_{n}}\big{(}z_{n}^{(i)}\big{)}\Big{]}

has expectation 11 under 3\mathbb{P}_{3}. Now assume that the densities of {μn,n0,P}\{\mu_{n},n\in\llbracket 0,P\rrbracket\} are known up to a constant only, say dμn/dυ(z)=μ~n(z)/Zn{\rm d}\mu_{n}/{\rm d}\upsilon(z)=\tilde{\mu}_{n}(z)/Z_{n} and υυψnk\upsilon\gg\upsilon^{\psi_{n}^{-k}} for k0,Tk\in\llbracket 0,T\rrbracket, then

Zn=Znμnψn,k1(dz)=μ~nψnk(z)dυψn,k1dυ(z)υ(dz),Z_{n}=Z_{n}\int\mu_{n}^{\psi_{n,k}^{-1}}({\rm d}z)=\int\tilde{\mu}_{n}\circ\psi_{n}^{k}(z)\tfrac{{\rm d}\upsilon^{\psi_{n,k}^{-1}}}{{\rm d}\upsilon}(z)\,\upsilon({\rm d}z)\,,

and the measure Znμ¯n(dz)Z_{n}\bar{\mu}_{n}({\rm d}z) shares the same normalising constant as μ~n(z)υ(dz)\tilde{\mu}_{n}(z){\rm\upsilon}({\rm d}z),

Zn=Znμ¯n(dz)=k=1T1Tμ~nψnk(z)dυψn,k1dυ(z)υ(dz).Z_{n}=\int Z_{n}\bar{\mu}_{n}({\rm d}z)=\sum_{k=1}^{T}\frac{1}{T}\int\tilde{\mu}_{n}\circ\psi_{n}^{k}(z)\tfrac{{\rm d}\upsilon^{\psi_{n,k}^{-1}}}{{\rm d}\upsilon}(z)\,\upsilon({\rm d}z)\,.

Consequently

n=0P1[1N(T+1)i=1Nk=0Nμ~n+1ψn+1,kμ~n(zn(i))dυψn+1,k1dυ(zn(i))],\prod_{n=0}^{P-1}\Big{[}\frac{1}{N(T+1)}\sum_{i=1}^{N}\sum_{k=0}^{N}\frac{\tilde{\mu}_{n+1}\circ\psi_{n+1,k}}{\tilde{\mu}_{n}}\big{(}z_{n}^{(i)}\big{)}\frac{{\rm d}\upsilon^{\psi_{n+1,k}^{-1}}}{{\rm d}\upsilon}\big{(}z_{n}^{(i)}\big{)}\Big{]}\,,

is an unbiased estimator of ZP/Z0Z_{P}/Z_{0}.

The results of the following lemma can be deduced from Lemma 8 but we provide more direct arguments for the present scenario.

Lemma 4.

Assume μn1μ¯n\mu_{n-1}\gg\bar{\mu}_{n}, μn1Rn=μn1\mu_{n-1}R_{n}=\mu_{n-1} and let M¯n:𝖹×𝒵[0,1]\bar{M}_{n}\colon\mathsf{Z}\times\mathscr{Z}\rightarrow[0,1] be as in (12). Then for nPn\in\llbracket P\rrbracket,

  1. 1.

    μ¯n1M¯n=μn1\bar{\mu}_{n-1}\bar{M}_{n}=\mu_{n-1},

  2. 2.

    there exists L¯n1:𝖹×𝒵[0,1]\bar{L}_{n-1}\colon\mathsf{Z}\times\mathscr{Z}\rightarrow[0,1] in (13) such that for any A,B𝒵A,B\in\mathscr{Z}

    μ¯n1M¯n(A×B)=Bμ¯n1M¯n(dz)L¯n1(z,A),\bar{\mu}_{n-1}\otimes\bar{M}_{n}(A\times B)=\int_{B}\bar{\mu}_{n-1}\bar{M}_{n}({\rm d}z)\bar{L}_{n-1}(z,A)\,,
  3. 3.

    the importance weight in (14) is well defined and

    w¯n(z,z):=dμ¯ndμn1(z).\bar{w}_{n}(z,z^{\prime}):=\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}(z^{\prime})\,.
Proof.

The first statement, follows from the definition in (12) of M¯n\bar{M}_{n} and the identity (11). The second statement is a consequence of the following classical arguments, justifying Bayes’ rule. For A𝒵A\in\mathscr{Z} fixed, consider the measure

𝒵Bμ¯n1M¯n(A×B)μ¯n1M¯n(𝖹×B)=μ¯n1M¯n(B),\mathscr{Z}\ni B\mapsto\bar{\mu}_{n-1}\otimes\bar{M}_{n}\big{(}A\times B\big{)}\leq\bar{\mu}_{n-1}\otimes\bar{M}_{n}\big{(}\mathsf{Z}\times B\big{)}=\bar{\mu}_{n-1}\bar{M}_{n}(B)\,,

implying μ¯n1M¯nμ¯n1M¯n(A×)\bar{\mu}_{n-1}\bar{M}_{n}\gg\bar{\mu}_{n-1}\otimes\bar{M}_{n}\big{(}A\times\cdot) from which we deduce the existence of a Radon-Nikodym derivative such that

μ¯n1M¯n(A×B)\displaystyle\bar{\mu}_{n-1}\otimes\bar{M}_{n}\big{(}A\times B\big{)} =Bdμ¯n1M¯n(A×)dμ¯n1M¯n(z)μ¯n1M¯n(dz).\displaystyle=\int_{B}\frac{{\rm d}\bar{\mu}_{n-1}\otimes\bar{M}_{n}\big{(}A\times\cdot\big{)}}{{\rm d}\bar{\mu}_{n-1}\bar{M}_{n}}(z^{\prime})\,\bar{\mu}_{n-1}\bar{M}_{n}({\rm d}z^{\prime})\,.

For (z,A)𝖹×𝒵(z,A)\in\mathsf{Z}\times\mathscr{Z}, we let

L¯n1(z,A):=dμ¯n1M¯n(A×)dμ¯n1M¯n(z),\bar{L}_{n-1}(z,A):=\frac{{\rm d}\bar{\mu}_{n-1}\otimes\bar{M}_{n}\big{(}A\times\cdot\big{)}}{{\rm d}\bar{\mu}_{n-1}\bar{M}_{n}}(z)\,,

and note that almost surely L¯n1(z,A)[0,1]\bar{L}_{n-1}(z,A)\in[0,1] and L¯n1(z,𝖹)=1\bar{L}_{n-1}(z,\mathsf{Z})=1. For the third statement note that from the second statement, for A,B𝒵A,B\in\mathscr{Z} we have μ¯n1M¯n(A×B)=μ¯n1M¯nL¯n1(A×B)\bar{\mu}_{n-1}\otimes\bar{M}_{n}\big{(}A\times B\big{)}=\bar{\mu}_{n-1}\bar{M}_{n}\accentset{\curvearrowleft}{\otimes}\bar{L}_{n-1}\big{(}A\times B\big{)} and from Fubini’s and the πλ\pi-\lambda theorem [7, Theorems 3.1 and 3.2] μ¯n1M¯n\bar{\mu}_{n-1}\otimes\bar{M}_{n} and μ¯n1M¯nL¯n1\bar{\mu}_{n-1}\bar{M}_{n}\accentset{\curvearrowleft}{\otimes}\bar{L}_{n-1} are probability distributions coinciding on 𝒵𝒵\mathscr{Z}\otimes\mathscr{Z}. To conclude proof of the third statement, for f:𝖹2[0,1]f\colon\mathsf{Z}^{2}\rightarrow[0,1] measurable, we successively apply the definition of the Radon-Nikodym derivative, Fubini’s theorem, use that μn1μ¯n\mu_{n-1}\gg\bar{\mu}_{n}, the first statement of this lemma, Fubini again, the second statement

f(z,z)dμ¯ndμn1(z,z)μ¯n1M¯n(d(z,z))\displaystyle\int f(z,z^{\prime})\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}(z,z^{\prime})\bar{\mu}_{n-1}\otimes\bar{M}_{n}\big{(}{\rm d}(z,z^{\prime})\big{)} =f(z,z)dμ¯ndμn1(z)μ¯n1M¯nL¯n(d(z,z))\displaystyle=\int f(z,z^{\prime})\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}(z^{\prime})\,\bar{\mu}_{n-1}\bar{M}_{n}\accentset{\curvearrowleft}{\otimes}\bar{L}_{n}\big{(}{\rm d}(z,z^{\prime})\big{)}
=f(z,z)dμ¯ndμn1(z)μn1(dz)L¯n(z,dz)\displaystyle=\int f(z,z^{\prime})\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}(z^{\prime})\,\mu_{n-1}({\rm d}z^{\prime})\bar{L}_{n}\big{(}z^{\prime},{\rm d}z\big{)}
=f(z,z)μ¯n(dz)L¯n(z,dz)\displaystyle=\int f(z,z^{\prime})\bar{\mu}_{n}({\rm d}z^{\prime})\bar{L}_{n}\big{(}z^{\prime},{\rm d}z\big{)}

therefore establishing that μ¯n1M¯n\bar{\mu}_{n-1}\otimes\bar{M}_{n}-almost surely

dμ¯n1M¯nL¯n1dμ¯n1M¯n(z,z)=dμ¯ndμn1(z,z).\frac{{\rm d}\bar{\mu}_{n-1}\bar{M}_{n}\accentset{\curvearrowleft}{\otimes}\bar{L}_{n-1}}{{\rm d}\bar{\mu}_{n-1}\otimes\bar{M}_{n}}(z,z^{\prime})=\frac{{\rm d}\bar{\mu}_{n}}{{\rm d}\mu_{n-1}}(z,z^{\prime})\,.

2.3 Direct extensions

It should be clear that the algorithm we have described lends itself to numerous generalizations, which we briefly discuss below.

There is no reason to limit the number of snippets arising from a seed particle to one, therefore offering the possibility to take advantage of parallel machines. For example the velocity of a given seed particle can be refreshed multiple times, resulting in partial copies of the seed particle from which integrator snippets can be grown.

The main scenario motivating this work, corresponds to the choice, for n0,Pn\in\llbracket 0,P\rrbracket, of{ψn,k=ψnk,k0,T}\{\psi_{n,k}=\psi_{n}^{k},k\in\llbracket 0,T\rrbracket\} for a given ψn:𝖹𝖹\psi_{n}\colon\mathsf{Z}\rightarrow\mathsf{Z}. As should be apparent from the theoretical justification this can be replaced by a general family of invertible mappings {ψn,k:𝖹𝖹,k0,T}\{\psi_{n,k}\colon\mathsf{Z}\rightarrow\mathsf{Z},k\in\llbracket 0,T\rrbracket\} where the ψn,k\psi_{n,k}’s are now not required to be measure preserving in general, in which case the expression for wn,k(z)w_{n,k}(z) may involve additional terms of the “Jacobian” type. These mappings may correspond to integrators other than those of Hamilton’s equations but may be more general deterministic mappings and TT may have no temporal meaning and only represent the number of deterministic transformations used in the algorithm. More specifically, asssuming that υμn\upsilon\gg\mu_{n} and υυψn,k1\upsilon\gg\upsilon^{\psi_{n,k}^{-1}} for some σ\sigma-finite dominating measure υ\upsilon and letting μn(z):=dμn/dυ(z)\mu_{n}(z):={\rm d}\mu_{n}/{\rm d}\upsilon(z) the required weights are now of the form (see Lemmas 32-34)

w¯n,k(z):=1T+1μnψn,k(z)μn1(z)dυψn,k1dυ(z).\bar{w}_{n,k}(z):=\frac{1}{T+1}\frac{\mu_{n}\circ\psi_{n,k}(z)}{\mu_{n-1}(z)}\frac{{\rm d}\upsilon^{\psi_{n,k}^{-1}}}{{\rm d}\upsilon}(z)\,.

Again when υ\upsilon is the Lebesgue measure and ψn,k\psi_{n,k} a diffeomorphism, then dυψn,k1/dυ{\rm d}\upsilon^{\psi_{n,k}^{-1}}/{\rm d}\upsilon is the absolute value of the determinant of the Jacobian of ψn,k\psi_{n,k}. Non uniform weights may be ascribed to each of these transformations in the definition of μ¯\bar{\mu} (24). A more useful application of this generality is in the situation where it is believed that using multiple integrators, specialised in capturing different geometric features of the target density, could be beneficial. Hereafter we simplify notation by setting νμn\nu\leftarrow\mu_{n} and μμn+1\mu\leftarrow\mu_{n+1}. For the purpose of illustration consider two distinct integrators ψi\psi_{i}, i2i\in\llbracket 2\rrbracket (again nn disappears from the notation) we wish to use, each for TiT_{i}\in\mathbb{N} time steps with proportions γi0\gamma_{i}\geq 0 such that γ1+γ2=1\gamma_{1}+\gamma_{2}=1. Again with μ=πϖ\mu=\pi\otimes\varpi define the mixture

μ¯(i,k,dz):=γiTi+1μψi,k1(dz)𝟏{k0,Ti},\bar{\mu}(i,k,{\rm d}z):=\frac{\gamma_{i}}{T_{i}+1}\mu^{\psi_{i,k}^{-1}}({\rm d}z)\mathbf{1}\{k\in\llbracket 0,T_{i}\rrbracket\}\,,

which still possesses the fundamental property that if zμ¯z\sim\bar{\mu} (resp. (i,z)μ¯(i,z)\sim\bar{\mu}), then with (i,k)ν¯(k,iz)(i,k)\sim\bar{\nu}(k,i\mid z) (resp. kν¯(ki,z)k\sim\bar{\nu}(k\mid i,z)) we have ψi,k(z)μ\psi_{i,k}(z)\sim\mu. It is possible to aim to sample from μ¯(dz)\bar{\mu}({\rm d}z), in which case the pair (i,k)(i,k) plays the rôle played by kk in the earlier simpler scenario. However, in order to introduce the sought persistency, that is use either ψ1\psi_{1} or ψ2\psi_{2} when constructing a snippet, we focus on the scenario where the pair (i,z)(i,z) plays the rôle played by zz earlier. In other words the target distribution is now μ¯(i,dz)\bar{\mu}(i,{\rm d}z), which is to μ(i,dz):=γiμ(dz)\mu(i,{\rm d}z):=\gamma_{i}\cdot\mu({\rm d}z) what μ¯n(dz)\bar{\mu}_{n}({\rm d}z) in (24) was to μn(dz)\mu_{n}({\rm d}z). The mutation kernel corresponding to (24) is given by

M¯ν,μ(i,z;j,dz):=k=0Tiν¯(ki,z)R(i,ψi,k(z);j,dz),\bar{M}_{\nu,\mu}(i,z;j,{\rm d}z^{\prime}):=\sum_{k=0}^{T_{i}}\bar{\nu}(k\mid i,z)R(i,\psi_{i,k}(z);j,{\rm d}z^{\prime})\,,

with now the requirement that ν¯R(i,dz)=ν(i,dz)=γiν(dz)\bar{\nu}R(i,{\rm d}z)=\nu(i,{\rm d}z)=\gamma_{i}\cdot\nu({\rm d}z). A sensible choice seems to be R(i,z;j,dz)=γjR0(z,dz)R(i,z;j,{\rm d}z^{\prime})=\gamma_{j}\cdot R_{0}(z,{\rm d}z^{\prime}) with νR0(dz)=ν(dz)\nu R_{0}({\rm d}z^{\prime})=\nu({\rm d}z^{\prime}). With these choices using the near-optimal backward kernel simple substitutions (i,z)z(i,z)\leftarrow z and (j,z)z(j,z^{\prime})\leftarrow z^{\prime} yields

dμ¯L¯ν,μdν¯M¯ν,μ(i,z;j,z)=dμ¯dν(j,z),\frac{{\rm d}\bar{\mu}\accentset{\curvearrowleft}{\otimes}\bar{L}_{\nu,\mu}}{{\rm d}\bar{\nu}\otimes\bar{M}_{\nu,\mu}}(i,z;j,z^{\prime})=\frac{{\rm d}\bar{\mu}}{{\rm d}\nu}(j,z^{\prime})\,,

and justifies the earlier abstract presentation.

Another direct extension consists of generalizing the definition of μ¯n\bar{\mu}_{n} by assigning non-uniform and possibly state-dependent weights to the integrator snippet particles as follows

μ¯n(k,dz)=ωn,kψn,k(z)μn,k(dz),\bar{\mu}_{n}(k,{\rm d}z)=\omega_{n,k}\circ\psi_{n,k}(z)\,\mu_{n,k}({\rm d}z)\,,

with ωn,k:𝖹+\omega_{n,k}\colon\mathsf{Z}\rightarrow\mathbb{R}_{+} for k0,Tk\in\llbracket 0,T\rrbracket and such that k=0Tωn,k(z)=1\sum_{k=0}^{T}\omega_{n,k}(z)=1 for any z𝖹z\in\mathsf{Z}; this should be contrasted with (8). Such choices ensure that the central identity (11) can be generalized as follows

k=0Tfψnk(z)ωn,kψn,k(z)wn,k(z)l=0Tωn,lψn,l(z)wn,l(z)μ¯n(dz)\displaystyle\sum_{k=0}^{T}\int f\circ\psi_{n}^{k}(z)\frac{\omega_{n,k}\circ\psi_{n,k}(z)\cdot w_{n,k}(z)}{\sum_{l=0}^{T}\omega_{n,l}\circ\psi_{n,l}(z)w_{n,l}(z)}\bar{\mu}_{n}({\rm d}z) =k=0Tfψn,k(z)μ¯n(kz)μ¯n(dz)\displaystyle=\sum_{k=0}^{T}\int f\circ\psi_{n,k}(z)\,\bar{\mu}_{n}(k\mid z)\bar{\mu}_{n}({\rm d}z)
=f(z)μn(dz).\displaystyle=\int f(z)\mu_{n}({\rm d}z)\,.

Note the analogy with the identity behind umbrella sampling [37, 39]. In the light of the developments of Subsection 4.3.2, a potentially useful choice could be ωn,k(z)zψn,k1(z)\omega_{n,k}(z)\propto\|z-\psi_{n,k}^{-1}(z)\|, which however requires additional computations as

ωn,kψn,k(z)=ψn,k(z)zl=0Tψn,k(z)ψn,kψn,l1(z)\omega_{n,k}\circ\psi_{n,k}(z)=\frac{\|\psi_{n,k}(z)-z\|}{\sum_{l=0}^{T}\|\psi_{n,k}(z)-\psi_{n,k}\circ\psi_{n,l}^{-1}(z)\|}

which may require computation of additional states for l>kl>k.

We leave exploration of some of these generalisations for future work, although we think that the choice of the leapfrog integrator is particularly natural and attractive here.

2.4 Rational and computational considerations

Our initial motivation for this work was that computation of {zn,k,kT}\{z_{n,k},k\in\llbracket T\rrbracket\} and {wn,k,kT}\{w_{n,k},k\in\llbracket T\rrbracket\} most often involve common quantities, leading to negligible computational overhead, and reweighting offers the possibility to use all the states of a snippet in a simulation algorithm rather than the endpoint only. There are other reasons for which this approach may be beneficial.

The benefit of using all states along integrator snippets in sampling algorithms has been noted in the literature. For example, in the context of Hamiltonian integrators, with Hn(z):=logμn(z)H_{n}(z):=-\log\mu_{n}(z) and ψn,k=ψnk\psi_{n,k}=\psi_{n}^{k}, it is known that for z𝖹z\in\mathsf{Z} the mapping kHnψnk(z)k\mapsto H_{n}\circ\psi_{n}^{k}(z) is typically oscillatory, motivating for example the x-tra chance algorithm of [11]. The “windows of state” strategy of [28, 29] effectively makes use of the mixture μ¯\bar{\mu} as an instrumental distribution and improved performance is noted, with performance seemingly improving with dimension on particular problems; see Section 5 for a more detailed discussion. Further averaging is well known to address scenarios where components of xtx_{t} evolve on different scales and no choice of a unique integration time τ:=T×ε\tau:=T\times\varepsilon can accommodate all scales [29, 22, section 3.2]; averaging addresses this issue effectively, see Example 27. We also note that, keeping T×εT\times\varepsilon constant, in the limit as ε0\varepsilon\rightarrow 0 the average in (14) corresponds to the numerical integration, Riemann like, along the contour of Hn(z)H_{n}(z), effectively leading to some form of Rao-Blackwellization of this contour; this is discussed in detail in Section 4.

Another benefit we have observed with the SMC context is the following. Use of a WF-SMC strategy involves comparing particles within each Markov snippet arising from a single seed particle, while our strategy involves comparing all particles across snippets, which proves highly beneficial in practice. This seems to bring particular robustness to the choice of the integrator parameters ε\varepsilon and TT and can be combined with highly robust adaptation schemes taking advantage of the population of samples, in the spirit of [21, 22]; see Subsections 2.5 and 2.6.

At a computational level we note that, in contrast with standard SMC or WF-SMC implementations relying on an MH mechanism, the construction of integrator snippets does not involve an accept reject mechanism, therefore removing control flow operations and enabling lockstep computations on GPUs – actual implementation of our algorithms on such architectures is, however, left to future work.

Finally, when using integrators of Hamilton’s equations we naturally expect similar benefits to those enjoyed by HMC. Let dd\in\mathbb{N} and let 𝖷=d\mathsf{X=\mathbb{R}}^{d}. We know that in certain scenarios [4, 10], the distributions {μn,d,d,nT(d)}\{\mu_{n,d},d\in\mathbb{N},n\in\llbracket T(d)\rrbracket\} are such that for nn\in\mathbb{N}, log(μn,dψn,dk(z)/μn,d(z))d𝒩(μn,σn2)\log\big{(}\nicefrac{{\mu_{n,d}\circ\psi_{n,d}^{k}(z)}}{{\mu_{n,d}(z)}}\big{)}\rightarrow_{d\rightarrow\infty}\mathcal{N}(\mu_{n},\sigma_{n}^{2}), that is the weights do not degenerate to zero or one: in the context of SMC this means that the part of the importance weight (14) arising from the mutation mechanism does not degenerate. Further, with an appropriate choice of schedule, i.e. sequence {μn,d,nT(d)}\{\mu_{n,d},n\in\llbracket T(d)\rrbracket\} for dd\in\mathbb{N}, ensures that the contribution μn,d(z)/μn1,d(z)\nicefrac{{\mu_{n,d}(z)}}{{\mu_{n-1,d}(z)}} to the importance weights (14) is also stable as dd\rightarrow\infty. As shown in [4, 2, 10], while direct important sampling may require an exponential number of samples as dd grows, the use of such a schedule may reduce complexity to a polynomial order.

2.5 Numerical illustration: logistic regression

In this section, we consider sampling from the posterior distribution of a logistic regression model, focusing on the compution of the normalising constant. We follow [14] and consider the sonar dataset, previously used in [13]. With intercept terms, the dataset has responses yi{1,1}y_{i}\in\{-1,1\} and covariates zipz_{i}\in\mathbb{R}^{p}, where p=61p=61. The likelihood of the parameters x𝖷:=px\in\mathsf{X}:=\mathbb{R}^{p} is then given by

L(x)=inF(zixyi),L(x)=\prod_{i}^{n}F(z_{i}^{\top}x\cdot y_{i}), (17)

where F(x):=1/(1+exp(x))F(x):=1/(1+\exp(-x)). We ascribe xx a product of independent normal distributions of zero mean as a prior, with standard deviation equal to 20 for the intercept and 5 for the other parameters. Denote p(dx)p(dx) the prior distribution of xx, we define a sequence of tempered distributions of densities of the form πn(x)p(dx)L(x)λn\pi_{n}(x)\propto p(dx)L(x)^{\lambda_{n}} for λn:0,P[0,1]\lambda_{n}\colon\llbracket 0,P\rrbracket\rightarrow[0,1] non-decreasing and such that λ0=0\lambda_{0}=0 and λP=1\lambda_{P}=1. We apply both Hamiltonian Snippet-SMC and the implementation of waste-free SMC of [14] and compare their performance.

For both algorithms, we set the total number of particles at each SMC step to be N(T+1)=10,000N(T+1)=10,000. For the waste-free SMC, the Markov kernel is chosen to be a random-walk Metropolis-Hastings kernel with covariances adaptively computed as 2.382/dΣ^2.38^{2}/d\ \hat{\Sigma}, where Σ^\hat{\Sigma} is the empirical covariance matrix obtained from the particles in the previous SMC step. For the Hamiltonian Snippet-SMC algorithm, we set ψn\psi_{n} to be the one-step leap-frog integrator with stepsize ε\varepsilon, Un(x)=log(πn(x))U_{n}(x)=-\log(\pi_{n}(x)) and ϖ\varpi a 𝒩(0,Id)\mathcal{N}(0,\mathrm{Id}). To investigate the stability of our algorithm, we ran Hamiltonian Snippet SMC with ε=0.05,0.1,0.2\varepsilon=0.05,0.1,0.2 and 0.30.3. For both algorithms, the temperatures λn\lambda_{n} are adaptively chosen so that the effective sample size (ESS) of the current SMC step will be αESSmax\alpha ESS_{max}, where ESSmaxESS_{max} is the maximum ESS achievable at the current step. In our experiments, we have chosen α=0.3,0.5\alpha=0.3,0.5 and 0.70.7 for both algorithms.

Performance comparison

Figure 2 shows the boxplots of estimates of the logarithm of the normalising constant obtained from both algorithms, for different choices of NN and ε\varepsilon for the Hamiltonian Snippet SMC algorithm. The boxplots are obtained by running both algorithms 100 times for different of the algorithm parameters, with α=0.5\alpha=0.5 in all setups. Several points are worth observing. For a suitably choice of ε\varepsilon, the Hamiltonian Snippet SMC algorithm can produce stable and consistent estimates of the normalising constant with 10,00010,000 particles at each iteration. On the other hand, however, the waste-free SMC algorithm fails to produce accurate results for the same computational budget. It is also clear that with larger values of NN (meaning smaller value of TT and hence shorter snippets), the waste-free SMC algorithm produces results with larger biases and variability. For Hamiltonian Snippet SMC algorithm, the results are stable both for short and long snippets when ε\varepsilon is equal to 0.10.1 or 0.20.2. Another point is that when ε=0.05\varepsilon=0.05 or 0.30.3, the Hamiltonian Snippet SMC algorithm becomes unstable with short (i.e. ε=0.05\varepsilon=0.05) and long (i.e. ε=0.3\varepsilon=0.3). Possible reasons are for too small a stepsize the algorithm is not able to explore the target distribution efficiently, resulting in unstable performances. On the other hand, when the stepsize is too large, the leapfrog integrator becomes unstable, therefore affecting the variability of the weights; this is a common problem of HMC algorithms. Hence, a long trajectory will result in detoriate estimations. Hence, to obtain the best performance, one should find a way of tuning the stepsize and trajectory length along with the SMC steps.

Refer to caption
Figure 2: Estimates of the normalising constant (in log scale) obtained from both Hamiltonian Snippet SMC and waste-free SMC algorithm. Left: Estimate obtained from Hamiltonian Snippet SMC algorithm with different values of ε\varepsilon. Right: Estimates obtained from waste-free SMC algorithm.

In Figure 3 we display boxplots of the estimates of the posterior expectations of the mean of all coefficients, i.e. 𝔼(d1i=1dxi)πP\mathbb{E}{}_{\pi_{P}}(d^{-1}\sum_{i=1}^{d}x_{i}). This quantity is referred to as the mean of marginals in [14] and we use this terminology. One can see that the same properties can be seen from the estimations of the mean of marginals, with the unstability problems exacerbated with small and large stepsizes.

Refer to caption
Figure 3: Estimates of the mean of marginals obtained from both Hamiltonian Snippet SMC and the waste-free SMC algorithms. Left: Estimate obtained from the Hamiltonian Snippet SMC algorithm with difference values of ε\varepsilon. Right: Estimates obtained from the waste-free SMC algorithm.

Computational Cost

In this section, we compare the running time of both algorithms. Since the calculations of the potential energy and its gradient often share common intermediate steps, we can recycle these to save computational cost. As the waste-free SMC also requires density evaluations, the Hamiltonian Snippet SMC algorithm will not require significant additional computations. Figure 4 shows boxplots of the simulation time of both algorithms from 100 runs. The simulations were run on an Apple M1-Pro CPU with 16G of memory. One can see that in comparison to the waste-free SMC the additional computational time is only marginal for the Hamiltonian Snippet SMC algorithm and mostly due to the larger memory needed to store the intermediate values.

Refer to caption
Figure 4: Simulation times for both algorithms. Left: Simulation time for Hamiltonian Snippet SMC algorithm, with N=100,ε=0.2,α=0.5N=100,\varepsilon=0.2,\alpha=0.5. Right:Simulation times for the waste-free SMC with N=100N=100

2.6 Numerical illustration: simulating from filamentary distributions

We now illustrate the interest of integrator snippets in a scenario where the target distribution possesses specific geometric features. Specifically, we focus here on distributions concentrated around a manifold 𝖷=d\mathcal{M}\subset\mathsf{X}=\mathbb{R}^{d} defined as the zero level set of a smooth function :dm\ell:\mathbb{R}^{d}\to\mathbb{R}^{m}

:={x𝖷:(x)=0},\mathcal{M}:=\left\{x\in\mathsf{X}:\ell(x)=0\right\},

sometimes referred to as filamentary distributions [12, 24, 25]. Such distributions arise naturally in various scenarios, including inverse problems or generalisations of the Gibbs sampler [12, 24, 25] or as a relaxation of problems where the support of the distribution of interest is included in \mathcal{M} or for example generalisation of the Gibbs sampler through disintegration [12]. Such is the case for ABC methods in Bayesian statistics [3]. Assume that π\pi is a probability density with respect to the Lebesgue measure defined on 𝖷\mathsf{X} and for ϵ>0\epsilon>0 consider a “kernel” function kϵ(u):=ϵmk(u/ϵ)k_{\epsilon}(u):=\epsilon^{-m}k(u/\epsilon) where k:m+k:\mathbb{R}^{m}\to\mathbb{R}_{+} and define the probability density

πϵ(x)kϵ(x)π(x).\pi_{\epsilon}(x)\propto k_{\epsilon}\circ\ell(x)\,\pi(x)\,.

The corresponding probability distribution can be thought of as an approximation of the probability distribution of density π0(x)π(x)𝟏{x}\pi_{0}(x)\propto\pi(x)\mathbf{1}\{x\in\mathcal{M}\} with respect to the Hausdorff measure on \mathcal{M}. Typical choices for the kernel are k(u)=𝟏{u1}k(u)=\mathbf{1}\{\|u\|\leq 1\} or k(u)=𝒩(u;0,𝐈m)k(u)=\mathcal{N}(u;0,\mathbf{I}_{m}). Strong anisotropy may result from such constraints and make exploration of the support of such distributions awkward for standard Monte Carlo algorithms. This is illustrated in Fig. 5 where a standard MH-HMC algorithms is used to sample from πϵ\pi_{\epsilon} defined on 2\mathbb{R}^{2}, for three values ϵ=0.5,0.1,0.05\epsilon=0.5,0.1,0.05, and performance is observed to deteriorate as ϵ\epsilon decreases. The samples produced are displayed in blue: for ϵ=0.5\epsilon=0.5 HMC-MH mixes well and explores the support rapidly, but for ϵ=0.1\epsilon=0.1 the chain gets stuck in a narrow region of πϵ\pi_{\epsilon} at the bottom, near initialisation, while for ϵ=0.05\epsilon=0.05, no proposal is ever accepted in the experiment.

Refer to caption
Figure 5: Heat map of πϵ(x)𝒩(yF(x),ϵ2)𝒩(x0,𝐈2)\pi_{\epsilon}(x)\propto\mathcal{N}(y\mid F(x),\epsilon^{2})\mathcal{N}(x\mid 0,\mathbf{I}_{2}) with F(x1,x2):=x22+x12(x1212)F(x_{1},x_{2}):=x_{2}^{2}+x_{1}^{2}\left(x_{1}^{2}-\frac{1}{2}\right), defined on 2\mathbb{R}^{2}, for yy\in\mathbb{R} and ϵ=0.5,0.1,0.05\epsilon=0.5,0.1,0.05 (left to right) superimposed with 20002000 samples (in blue) obtained from an MH-HMC with step size 0.050.05 and involving T=20T=20 leapfrog steps.

To illustrate the properties of integrator snippet techniques we consider the following toy example. For ϵ>0\epsilon>0 and dd\in\mathbb{N}_{*} let

πϵ(x)1ϵm𝟏{(x)ϵ}𝒩(x;0,𝐈d)with(x)=xΣ1xc,\pi_{\epsilon}(x)\propto\frac{1}{\epsilon^{m}}\mathbf{1}\{\left\|\ell(x)\right\|\leq\epsilon\}\,\mathcal{N}(x;0,\mathrm{\mathbf{I}}_{d})\quad\text{with}\quad\ell(x)=x^{\top}\Sigma^{-1}x-c\,,

for Σ\Sigma a d×dd\times d symmetric positive matrix and cc\in\mathbb{R}, that is we consider the restriction of a standard normal distribution around an ellipsoid defined by (x)=0\ell(x)=0.

In order to explore the support of the target distribution we use two mechanisms based on reflections either through tangent hyperplanes of equicontours of (x)\ell(x) or through the corresponding orthogonal complements. More specifically for x𝖷x\in\mathsf{X} such that (x)0\nabla\ell(x)\neq 0 let n(x):=(x)/(x)n(x):=\nabla\ell(x)/\|\nabla\ell(x)\| and define the tangential HUG (THUG) and symmetrically-normal HUG (SNUG) as ψ//:=ψabψa\psi_{\mathbin{\scalebox{0.4}{$\!/\mkern-5.0mu/\!$}}}:={}_{\textsc{a}}\psi\circ{\rm b}\circ{}_{\textsc{a}}\psi, , and ψ:=ψa(b)ψa\psi_{\rotatebox[origin={c}]{180.0}{$\scalebox{0.45}{$\top$}$}}:={}_{\textsc{a}}\psi\circ(-{\rm b})\circ{}_{\textsc{a}}\psi respectively, with ψa{}_{\textsc{a}}\psi and b\mathrm{b} as in Example 1 and (25). Both updates can be understood as discretisations of ODEs and are volume preserving. Intuitively for (x,v)𝖹(x,v)\in\mathsf{Z} with v//=v//(x+εv):=n(x+εv)n(x+εv)vv_{\mathbin{\scalebox{0.4}{$\!/\mkern-5.0mu/\!$}}}=v_{\mathbin{\scalebox{0.4}{$\!/\mkern-5.0mu/\!$}}}(x+\varepsilon v):=n(x+\varepsilon v)n(x+\varepsilon v)^{\top}v for ε>0\varepsilon>0 and v=vv//v_{\rotatebox[origin={c}]{180.0}{$\scalebox{0.45}{$\top$}$}}=v-v_{\mathbin{\scalebox{0.4}{$\!/\mkern-5.0mu/\!$}}} we have ψ//(x,v)=(x+2εv//,v//v)\psi_{\mathbin{\scalebox{0.4}{$\!/\mkern-5.0mu/\!$}}}(x,v)=(x+2\varepsilon v_{\mathbin{\scalebox{0.4}{$\!/\mkern-5.0mu/\!$}}},v_{\mathbin{\scalebox{0.4}{$\!/\mkern-5.0mu/\!$}}}-v_{\rotatebox[origin={c}]{180.0}{$\scalebox{0.45}{$\top$}$}}) and ψ(x,v)=(x+2εv,vv//)\psi_{\rotatebox[origin={c}]{180.0}{$\scalebox{0.45}{$\top$}$}}(x,v)=(x+2\varepsilon v_{\rotatebox[origin={c}]{180.0}{$\scalebox{0.45}{$\top$}$}},v_{\rotatebox[origin={c}]{180.0}{$\scalebox{0.45}{$\top$}$}}-v_{\mathbin{\scalebox{0.4}{$\!/\mkern-5.0mu/\!$}}}). This is illustrated in Fig. 6 for d=2d=2. Further, for an initial state z0=(x0,v0)𝖹z_{0}=(x_{0},v_{0})\in\mathsf{Z}, trajectories of the first component of kψ//k(z0)k\mapsto\psi_{\mathbin{\scalebox{0.4}{$\!/\mkern-5.0mu/\!$}}}^{k}(z_{0}) remain close to 1({(x0)})\ell^{-1}(\{\ell(x_{0})\}) while kψk(z0)k\mapsto\psi_{\rotatebox[origin={c}]{180.0}{$\scalebox{0.45}{$\top$}$}}^{k}(z_{0}) follows the gradient field x(x)x\mapsto\nabla\ell(x) and leads to hops across equicontours.

Refer to caption
Figure 6: Black line: manifold of interest =1(0)\mathcal{M}=\ell^{-1}(0), grey line: level set of 1({(x0)})\ell^{-1}(\{\ell(x_{0})\}) the \ell-level set x0x_{0} belongs to. Left: THUG trajectory (blue) remains close to 1({(x0)})\ell^{-1}(\{\ell(x_{0})\}), SNUG trajectory (orange) explores different contours of (x)\ell(x). Right: vthug=v//vv_{{\rm thug}}=v_{\mathbin{\scalebox{0.4}{$\!/\mkern-5.0mu/\!$}}}-v_{\rotatebox[origin={c}]{180.0}{$\scalebox{0.45}{$\top$}$}} and vsnug=(v//v)v_{{\rm snug}}=-(v_{\mathbin{\scalebox{0.4}{$\!/\mkern-5.0mu/\!$}}}-v_{\rotatebox[origin={c}]{180.0}{$\scalebox{0.45}{$\top$}$}}).

The sequence of target distributions on (𝖹,𝒵)(\mathsf{Z},\mathcal{Z}) we consider is of the form

μn(dz)πϵn(dx)𝒩(dv;0,𝐈d),ϵn>0,n0,P,\mu_{n}({\rm d}z)\propto\pi_{\epsilon_{n}}({\rm d}x)\mathcal{N}({\rm d}v;0,\mathbf{I}_{d}),\qquad\qquad\epsilon_{n}>0,n\in\llbracket 0,P\rrbracket\,,

and the Integrator Snippet SMC is defined through the mixture, for α[0,1]\alpha\in[0,1],

μ¯n(dz)=αT+1k=0Tμnψ//k(dz)+1αT+1k=0Tμnψk(dz).\bar{\mu}_{n}({\rm d}z)=\frac{\alpha}{T+1}\sum_{k=0}^{T}\mu_{n}^{\psi_{\mathbin{\scalebox{0.4}{$\!/\mkern-5.0mu/\!$}}}^{-k}}({\rm d}z)+\frac{1-\alpha}{T+1}\sum_{k=0}^{T}\mu_{n}^{\psi_{\rotatebox[origin={c}]{180.0}{$\scalebox{0.45}{$\top$}$}}^{-k}}({\rm d}z)\,.

We compare performance of integrator snippet with an SMC Sampler relying on a mutation kernel MnM_{n} consisting of a mixture of two updates targetting μn1\mu_{n-1} each iterating TT times a MH kernel applying integrator ψ\psi_{\shortparallel} or ψ\psi_{\rotatebox[origin={c}]{180.0}{$\scalebox{0.45}{$\top$}$}} once after refreshing the velocity; the backward kernels are chosen to correspond to the default choices discussed earlier. In the experiments below we set d=50d=50, c=12c=12, and Σ\Sigma is the diagonal matrix alternating 11’s and 0.10.1’s along the diagonal. We used N=5000N=5000 particles, sampled from 𝒩(0,𝐈d)\mathcal{N}(0,\mathrm{\mathbf{I}}_{d}) at time zero and ϵ0\epsilon_{0} is set to the maximum distance of these particles in the \ell domain from 0 and α=0.8\alpha=0.8 for both algorithms. We compared the two samplers across three metrics; all results are averaged over 2020 runs.

Robustness and Accuracy: we fix the step size for SNUG to ε=0.1\varepsilon_{\rotatebox[origin={c}]{180.0}{$\scalebox{0.45}{$\top$}$}}=0.1 and run both algorithms for a grid of values of TT and ε//\varepsilon_{\mathbin{\scalebox{0.4}{$\!/\mkern-5.0mu/\!$}}} using a standard adaptive scheme based on ESS to determine {ϵn,nP}\{\epsilon_{n},n\in\llbracket P\rrbracket\} until a criterion described below is satisfied and retain the final tolerance achieved, recorded in Fig. 7. As a result the terminal value ϵP\epsilon_{P} and computational costs are different for both algorithms: the point of this experiment is only to demonstrate robustness and potential accuracy of Integrator Snippets. Both the SMC sampler and Integrator Snippet are stopped when the average probability of leaving the seed particle drops below 0.010.01.

Our proposed algorithm consistently achieves a two order magnitude smaller final value of ϵ\epsilon and is more robust to the choice of stepsize.

Refer to caption
Figure 7: Final tolerances achieved by the two algorithms, averaged over 2020 runs.

Variance: for this experiment we set T=50T=50 steps, ε//=0.01\varepsilon_{\mathbin{\scalebox{0.4}{$\!/\mkern-5.0mu/\!$}}}=0.01 and determine and compare the variances of the estimates of the mean of πϵP\pi_{\epsilon_{P}} for the final SMC step, for which ϵP=1×107\epsilon_{P}=1\times 10^{-7}. To improve comparison, and in particular ensure comparable computational costs, both algorithms share the same schedule {ϵn,nP}\{\epsilon_{n},n\in\llbracket P\rrbracket\}, determined adaptively by the SMC algorithm in a separate pre-run.

The results are reported as componentwise boxplots in Fig. 8 where we observe a significant variance reduction for comparable computational cost.

Refer to caption
Figure 8: Variability of the estimates of the mean of πϵ\pi_{\epsilon} for the SMC sampler and the Integrator Snippet, using estimator μ^P(f)\hat{\mu}_{P}(f), for the test function f(x)=xf(x)=x. Integrator snippet is able uniformly achieve smaller variance, in particular on the odd components corresponding to larger variances in Σ\Sigma.

Efficiency: we report the Expected Squared Jump Distance (ESJD) as a proxy of distance travelled by the two algorithms. For Integrator Snippets, it is possible to estimate this quantity as follows for a function f:𝖹f:\mathsf{Z}\to\mathbb{R}

ESJDn(f)i=1Nk=0Tl=k+1T(f(Zn1,l(i))W¯n,l(i)f(Zn1,k(i))W¯n,k(i))2,W¯n,k(i)=w¯n,k(Zn1(i))l=0Tw¯n,l(Zn1,l(i)).\text{ESJD}_{n}(f)\approx\sum_{i=1}^{N}\sum_{k=0}^{T}\sum_{l=k+1}^{T}\left(f(Z_{n-1,l}^{(i)})\bar{W}_{n,l}^{(i)}-f(Z_{n-1,k}^{(i)})\bar{W}_{n,k}^{(i)}\right)^{2},\qquad\bar{W}_{n,k}^{(i)}=\frac{\bar{w}_{n,k}(Z_{n-1}^{(i)})}{{\displaystyle\sum_{l=0}^{T}\bar{w}_{n,l}(Z_{n-1,l}^{(i)})}}\,.

We report the average of this metric in Table 1 for the functions fi(x)=xif_{i}(x)=x_{i} idi\in\llbracket d\rrbracket, normalised by total runtime in seconds for all particles (first row), for the particles using the THUG update (second row) and those using the SNUG update (third row), with standard deviation shown in parenthesis. Our proposed algorithm is several orders of magnitude more efficient in the exploration of πϵ\pi_{\epsilon} than its SMC counterpart which, thanks to its ability to take full advantage of all the intermediary states of the snippet. This is in contrast with the SMC sampler which creates trajectories of random length.

Integrator Snippet SMC
ESJD/s\text{ESJD}/s   5.3×𝟏𝟎𝟓(±5.9×106)\mathbf{5.3\times 10^{-5}}\,\,(\pm 5.9\times 10^{-6})   1.2×107(±3.7×109)1.2\times 10^{-7}\,\,(\pm 3.7\times 10^{-9})
ESJD-THUG/s\text{ESJD-THUG}/s   6.6×𝟏𝟎𝟓(±7.5×106)\mathbf{6.6\times 10^{-5}}\,\,(\pm 7.5\times 10^{-6})   1.7×107(±4.5×109)1.7\times 10^{-7}\,\,(\pm 4.5\times 10^{-9})
ESJD-SNUG/s\text{ESJD-SNUG}/s   2.7×𝟏𝟎𝟒(±3.2×105)\mathbf{2.7\times 10^{-4}}\,\,(\pm 3.2\times 10^{-5})   1.6×1020(±6.0×1021)1.6\times 10^{-20}\,\,(\pm 6.0\times 10^{-21})
Table 1: d1i=1dESJDn(fi)d^{-1}\sum_{i=1}^{d}\text{ESJD}_{n}(f_{i}) normalised by time for Integrator Snippet and an SMC sampler.

2.7 Links to the literature

Alg. 2, and its reinterpretation Alg. 3, are reminiscent of various earlier contributions and we discuss here parallels and differences. This work was initiated in [42] and pursued and extended in [18].

Readers familiar with the “waste-free SMC” algorithm (WF-SMC) of [14] where, similarly to Alg. 2, seed particles are extended with an MCMC kernel leaving μn1\mu_{n-1} invariant at iteration nn, yielding N×(T+1)N\times(T+1) particles subsequently whittled down to NN new seed particles. A first difference is that while generation of an integrator snippet can be interpreted as applying a sequence of deterministic Markov kernels (see Subsection 3.3 for a detailed discussion) what we show is that the mutation kernels involved do not have to leave μn1\mu_{n-1} invariant; in fact we show in Section 3 that this indeed is not a requirement, therefore offering more freedom. Further, it is instructive to compare our procedure with the following two implementations of WF-SMC using an HMC kernel (4) for the mutation. A first possibility for the mutation stage is to run TT steps of an HMC update in sequence, where each update uses one integrator step. Assuming no velocity refreshment along the trajectory this would lead to the exploration of a random number of states of our integrator snippet due to the accept/reject mechanism involved; incorporating refreshment or partial refreshment would similarly lead to a random number of useful samples. Alternatively one could consider a mutation mechanism consisting of an HMC build around TT integration steps and where the endpoint of the trajectory would be the mutated particle; this would obviously mean discarding T1T-1 potentially useful candidates. To avoid possible reader confusion, we note apparent typos in [14, Proposition 1], provided without proof, where the intermediate target distribution of the SMC algorithms, μ¯n\bar{\mu}_{n} in our notation (see (18)), seems improperly stated. The statement is however not used further in the paper.

Our work shares, at first sight, similarities with [33, 38], but differs in several respects. In the discrete time setup considered in [38] a mixture of distributions similar to ours is also introduced. Specifically for a sequence {ωk0,k}\{\omega_{k}\geq 0,k\in\mathbb{Z}\} such that #{ωk0,k}<\#\{\omega_{k}\neq 0,k\in\mathbb{Z}\}<\infty and lωk=1\sum_{l\in\mathbb{Z}}\omega_{k}=1, the following mixture is considered (in the notation of [38] their transformation is T=ψ1T=\psi^{-1} and we stick to our notation for the sake of comparison and avoid confusion with our TT),

μ¯(dz)=kωkμk(dz).\bar{\mu}({\rm d}z)=\sum_{k\in\mathbb{Z}}\omega_{k}\mu_{k}({\rm d}z)\,.

The intention is to use the distribution μ¯\bar{\mu} as an importance sampling proposal to estimate expectations with respect to μ\mu,

f(z)dμdμ¯(z)μ¯(dz)\displaystyle\int f(z)\frac{{\rm d}\mu}{{\rm d}\bar{\mu}}(z)\bar{\mu}({\rm d}z) =kωkf(z)dμdμ¯(z)μψk(dz)\displaystyle=\sum_{k\in\mathbb{Z}}\omega_{k}\int f(z)\frac{{\rm d}\mu}{{\rm d}\bar{\mu}}(z)\mu^{\psi^{-k}}({\rm d}z)
=kωkfψk(z)dμdμ¯ψk(z)μ(dz)\displaystyle=\sum_{k\in\mathbb{Z}}\omega_{k}\int f\circ\psi^{-k}(z)\frac{{\rm d}\mu}{{\rm d}\bar{\mu}}\circ\psi^{-k}(z)\mu({\rm d}z)
=kfψk(z)ωkμψk(z)lωlμψlk(z)μ(dz),\displaystyle=\sum_{k\in\mathbb{Z}}f\circ\psi^{-k}(z)\omega_{k}\frac{\mu\circ\psi^{-k}(z)}{\sum_{l\in\mathbb{Z}}\omega_{l}\,\mu\circ\psi^{l-k}(z)}\mu({\rm d}z)\,,

where the last line holds when υμ\upsilon\gg\mu, υ=ψυ\upsilon{}^{\psi}=\upsilon and μ(z)=dμ/dυ(z)\mu(z)={\rm d}\mu/{\rm d}\upsilon(z). The rearrangement on the second and third line simply capture the fact noted earlier in the present paper that with zμz\sim\mu and kCat({ωl:l})k\sim{\rm Cat}\big{(}\{\omega_{l}:l\in\mathbb{Z}\}\big{)} then ψk(z)μ¯\psi^{-k}(z)\sim\bar{\mu}. As such NEO relies on generating samples from μ\mu first, typically exact samples, which then undergo a transformation and are then properly reweighted. In contrast we aim to sample from a mixture of the type μ¯\bar{\mu} directly, typically using an iterative method, and then exploit the mixture structure to estimate expectations with respect to μ\mu. Note also that some of the ψlk(z)\psi^{l-k}(z) terms involved in the renormalization may require additional computations beyond terms for which ωlk0\omega_{l-k}\neq 0. Our approach relies on a different important sampling identity

fψk(z)dμψkdμ¯(z)μ¯(dz)=μ(f).\int f\circ\psi^{k}(z)\frac{{\rm d}\mu^{\psi^{-k}}}{{\rm d}\bar{\mu}}(z)\bar{\mu}({\rm d}z)=\mu(f)\,.

We note however that the conformal Hamiltonian integrator used in [33, 38] could be used in our framework, which we leave for future investigations. Their NEO-MCMC targets a “posterior distribution” μ(dz)=Z1μ(dz)L(z)\mu^{\prime}({\rm d}z)=Z^{-1}\mu({\rm d}z)L(z) and inspired by the identity

f(z)L(z)dμdμ¯(z)μ¯(dz)\displaystyle\int f(z)L(z)\frac{{\rm d}\mu}{{\rm d}\bar{\mu}}(z)\bar{\mu}({\rm d}z) =kωkfψk(z)Lψk1(z)dμdμ¯ψk(z)μ(dz),\displaystyle=\sum_{k\in\mathbb{Z}}\omega_{k}\int f\circ\psi^{-k}(z)L\circ\psi_{k}^{-1}(z)\frac{{\rm d}\mu}{{\rm d}\bar{\mu}}\circ\psi^{-k}(z)\mu({\rm d}z)\,,

which is an expection of f¯(k,z)=fψk(z)\bar{f}(k,z)=f\circ\psi^{-k}(z) with respect to

μ¯(k,dz)μ(dz)ωkdμdμ¯ψk(z)Lψk(z)\bar{\mu}^{\prime}(k,{\rm d}z)\propto\mu({\rm d}z)\cdot\omega_{k}\frac{{\rm d}\mu}{{\rm d}\bar{\mu}}\circ\psi^{-k}(z)\cdot L\circ\psi^{-k}(z)

and they consider algorithms targetting μ¯(dz)\bar{\mu}^{\prime}({\rm d}z) which relying on exact samples from μ\mu to construct proposals, resulting in a strategy which shares the weaknesses of an independent MH algorithm.

Much closer in spirit to our work is the “window of states” idea proposed in [28, 29] in the context of HMC algorithms. Although the MCMC algorithms of [28, 29] ultimately target μ\mu and are not reversible, they involve a reversible MH update with respect to μ¯\bar{\mu} and additional transitions permitting transitions from μ\mu to μ¯\bar{\mu} and μ¯\bar{\mu} to μ\mu; see Section 5 for full details.

A link we have not explored in the present manuscript is that to normalizing flows [36, 26] and related literature. In particular the ability to tune the parameter of the leapfrog integrator suggests that our methodology lends itself learning normalising flows. We finally note an analogy of such approaches with umbrella sampling [37, 39], although the targetted mixture is not of the same form, and the “Warp Bridge Sampling” idea of [27].

3 Sampling Markov snippets

In this section we develop the Markov snippet framework, largely inspired by the WF-SMC framework of [14] but provide here a detailed derivation following the standard SMC sampler framework [16] which allows us to consider much more general mutation kernels; integrator snippet SMC is recovered as a particular case. Importantly we provide recipes to compute some of the quantities involved using simple criteria (see Lemma 9 and Corollary 10) which allow us to consider unusual scenarios such as in Subsection 3.4.

3.1 Markov snippet SMC sampler or waste free SMC with a difference

Given a sequence {μn,n0,P}\big{\{}\mu_{n},n\in\llbracket 0,P\rrbracket\big{\}} of probability distributions defined on a measurable space (𝖹,𝒵)(\mathsf{Z},\mathscr{Z}) introduce the sequence of distributions defined on (0,T×𝖹T+1,𝒫(0,T)𝒵(T+1))\big{(}\llbracket 0,T\rrbracket\times\mathsf{Z}^{T+1},\mathscr{P}(\llbracket 0,T\rrbracket)\otimes\mathscr{Z}^{\otimes(T+1)}\big{)} such that for any (n,k,𝗓)0,P×0,T×𝖹(n,k,\mathsf{z})\in\llbracket 0,P\rrbracket\times\llbracket 0,T\rrbracket\times\mathsf{Z}

μ¯n(k,d𝗓):=1T+1wn,k(𝗓)μnMnT(d𝗓)\bar{\mu}_{n}(k,{\rm d}\mathsf{z}):=\frac{1}{T+1}w_{n,k}(\mathsf{z})\mu_{n}\otimes M_{n}^{\otimes T}({\rm d}\mathsf{z})

where for Mn,Ln1,k:𝖹×𝒵[0,1]M_{n},L_{n-1,k}\colon\mathsf{Z}\times\mathscr{Z}\rightarrow[0,1], k0,Pk\in\llbracket 0,P\rrbracket and any 𝗓=(z0,z1,,zT)𝖹T+1\mathsf{z}=(z_{0},z_{1},\ldots,z_{T})\in\mathsf{Z}^{T+1},

wn,k(𝗓):=dμnLn1,kdμnMnk(z0,zk),w_{n,k}(\mathsf{z}):=\frac{{\rm d}\mu_{n}\accentset{\curvearrowleft}{\otimes}L_{n-1,k}}{{\rm d}\mu_{n}\otimes M_{n}^{k}}(z_{0},z_{k})\,,

is assumed to exist for now and wn,0(𝗓):=1w_{n,0}(\mathsf{z}):=1. This yields the marginals

μ¯n(d𝗓):=k=0Tμ¯n(k,d𝗓)=1T+1k=1nwn,k(𝗓)μnMnT(d𝗓).\bar{\mu}_{n}({\rm d}\mathsf{z}):=\sum_{k=0}^{T}\bar{\mu}_{n}(k,{\rm d}\mathsf{z})=\frac{1}{T+1}\sum_{k=1}^{n}w_{n,k}(\mathsf{z})\,\mu_{n}\otimes M_{n}^{\otimes T}({\rm d}\mathsf{z})\,. (18)

Further, consider Rn:𝖹×𝒵[0,1]R_{n}\colon\mathsf{Z}\times\mathscr{Z}\rightarrow[0,1] such that μn1Rn=μn1\mu_{n-1}R_{n}=\mu_{n-1} and define M¯n:𝖹T+1×𝒵(T+1)[0,1]\bar{M}_{n}\colon\mathsf{Z}^{T+1}\times\mathscr{Z}^{\otimes(T+1)}\rightarrow[0,1]

M¯n(𝗓,d𝗓):=k=0Tμ¯n1(k𝗓)Rn(zk,dz0)MnT(z0,d𝗓0).\bar{M}_{n}(\mathsf{z},{\rm d}\mathsf{z}^{\prime}):=\sum_{k=0}^{T}\bar{\mu}_{n-1}(k\mid\mathsf{z})R_{n}(z_{k},{\rm d}z^{\prime}_{0})M_{n}^{\otimes T}\big{(}z^{\prime}_{0},{\rm d}\mathsf{z}^{\prime}_{-0}\big{)}\,.

Note that [14] set Rn=MnR_{n}=M_{n}, which we do not want in our later application and further assume that MnM_{n} is μn1\mu_{n-1}-invariant, which is not necessary and too constraining for our application in Section 3.3 (corresponding to the application in the introductory Subsection 2.2). We only require the condition μn1Rn=μn1\mu_{n-1}R_{n}=\mu_{n-1} is required. As in Subsection 2.2 we consider the optimal backward kernel L¯n:𝖹T+1×𝒵(T+1)[0,1]\bar{L}_{n}\colon\mathsf{Z}^{T+1}\times\mathscr{Z}^{\otimes(T+1)}\rightarrow[0,1], given for (𝗓,A)𝖹T+1×𝒵(T+1)(\mathsf{z},A)\in\mathsf{Z}^{T+1}\times\mathscr{Z}^{\otimes(T+1)} by

L¯n1(𝗓,A)=dμ¯n1M¯n(A×)dμ¯n1M¯n(𝗓),\bar{L}_{n-1}(\mathsf{z},A)=\frac{{\rm d}\bar{\mu}_{n-1}\otimes\bar{M}_{n}(A\times\cdot)}{{\rm d}\bar{\mu}_{n-1}\bar{M}_{n}}(\mathsf{z})\,,

and as established in Lemma 6 and Lemma 8, one obtains

w¯n(𝗓)\displaystyle\bar{w}_{n}(\mathsf{z}^{\prime}) =dμ¯nL¯n1dμ¯n1M¯n(𝗓,𝗓)=dμndμn1(z0)1T+1k=0Twn,k(𝗓).\displaystyle=\frac{{\rm d}\bar{\mu}_{n}\accentset{\curvearrowleft}{\otimes}\bar{L}_{n-1}}{{\rm d}\bar{\mu}_{n-1}\otimes\bar{M}_{n}}(\mathsf{z},\mathsf{z}^{\prime})=\frac{{\rm d}\mu_{n}}{{\rm d}\mu_{n-1}}(z^{\prime}_{0})\frac{1}{T+1}\sum_{k=0}^{T}w_{n,k}(\mathsf{z}^{\prime})\,. (19)

The corresponding folded version of the algorithm is given in Alg. 4, with 𝗓ˇn:=(zˇn,0,zˇn,1,,zˇn,T)\check{\mathsf{z}}_{n}:=(\check{z}_{n,0},\check{z}_{n,1},\ldots,\check{z}_{n,T})…:

  • {(𝗓ˇn(i),1),iN}\big{\{}(\check{\mathsf{z}}_{n}^{(i)},1),i\in\llbracket N\rrbracket\big{\}} represents μ¯n(d𝗓)\bar{\mu}_{n}({\rm d}\mathsf{z}),

  • {(zn,k(i),wn+1,k),(i,k)N×0,T}\big{\{}(z_{n,k}^{(i)},w_{n+1,k}),(i,k)\in\llbracket N\rrbracket\times\llbracket 0,T\rrbracket\big{\}} represent μn(dz)\mu_{n}({\rm d}z),

  • {(𝗓n(i),w¯n+1(𝗓n(i))),iN}\big{\{}\big{(}\mathsf{z}_{n}^{(i)},\bar{w}_{n+1}\big{(}\mathsf{z}_{n}^{(i)}\big{)}\big{)},i\in\llbracket N\rrbracket\big{\}} represents μ¯n+1(d𝗓)\bar{\mu}_{n+1}({\rm d}\mathsf{z}) and so do {(𝗓ˇn+1(i),1),iN}\big{\{}(\check{\mathsf{z}}_{n+1}^{(i)},1),i\in\llbracket N\rrbracket\big{\}}.

1sample 𝗓ˇ0(i)iidμ0(T+1)\check{\mathsf{z}}_{0}^{(i)}\overset{{\rm iid}}{\sim}\mu_{0}^{\otimes(T+1)}, set w0,k(𝗓ˇ0(i))=1w_{0,k}(\check{\mathsf{z}}_{0}^{(i)})=1 for (i,k)N×T(i,k)\in\llbracket N\rrbracket\times\llbracket T\rrbracket.
2for n=0,,P1n=0,\ldots,P-1 do
3      
4      for iNi\in\llbracket N\rrbracket do
5            
6            sample aiCat(1,wn,1(𝗓ˇn(i)),wn,2(𝗓ˇn(i)),,wn,T(𝗓ˇn(i)))a_{i}\sim{\rm Cat}\left(1,w_{n,1}(\check{\mathsf{z}}_{n}^{(i)}),w_{n,2}(\check{\mathsf{z}}_{n}^{(i)}),\ldots,w_{n,T}(\check{\mathsf{z}}_{n}^{(i)})\right)
7            sample zn,0(i)=zn(i)Rn+1(zˇn,ai,)z_{n,0}^{(i)}=z_{n}^{(i)}\sim R_{n+1}(\check{z}_{n,a_{i}},\cdot)
8            for kTk\in\llbracket T\rrbracket do
9                  
10                  sample zn,k(i)Mn+1(zn,k1(i),)z_{n,k}^{(i)}\sim M_{n+1}(z_{n,k-1}^{(i)},\cdot)
11                  compute
wn+1,k(𝗓n(i))=dμnLnkdμnMn+1k(zn,0(i),zn,k(i))w_{n+1,k}\big{(}\mathsf{z}_{n}^{(i)}\big{)}=\frac{{\rm d}\mu_{n}\accentset{\curvearrowleft}{\otimes}L_{n}^{k}}{{\rm d}\mu_{n}\otimes M_{n+1}^{k}}(z_{n,0}^{(i)},z_{n,k}^{(i)})
12             end for
13            
14            compute
15            
w¯n+1(𝗓n(i))=dμn+1dμn(zn,0(i))1T+1kTwn+1,k(𝗓n(i)),\bar{w}_{n+1}\big{(}\mathsf{z}_{n}^{(i)}\big{)}=\frac{{\rm d}\mu_{n+1}}{{\rm d}\mu_{n}}\big{(}z_{n,0}^{(i)}\big{)}\frac{1}{T+1}\sum_{k\in\llbracket T\rrbracket}w_{n+1,k}\big{(}\mathsf{z}_{n}^{(i)}\big{)}\,,
16       end for
17      
18      for jj\in\mathbb{N} do
19            
20            sample bjCat(w¯n+1(𝗓n(1)),w¯n+1(𝗓n(2)),,w¯n+1(𝗓n(N)))b_{j}\sim{\rm Cat}\left(\bar{w}_{n+1}\big{(}\mathsf{z}_{n}^{(1)}\big{)},\bar{w}_{n+1}\big{(}\mathsf{z}_{n}^{(2)}\big{)},\ldots,\bar{w}_{n+1}\big{(}\mathsf{z}_{n}^{(N)}\big{)}\right)
21            set 𝗓ˇn+1(j)=𝗓n(bj)\check{\mathsf{z}}_{n+1}^{(j)}=\mathsf{z}_{n}^{(b_{j})}
22       end for
23      
24 end for
25
Algorithm 4 (Folded) Markov Snippet SMC algorithm
Remark 5.

Other choices are possible and we detail here an alternative, related to the Waste-free framework of [14]. With notation as above here we take

μ¯n(k,d𝗓):=1T+1wn,k(𝗓)μn1MnT(d𝗓)\bar{\mu}_{n}(k,{\rm d}\mathsf{z}):=\frac{1}{T+1}w_{n,k}(\mathsf{z})\mu_{n-1}\otimes M_{n}^{\otimes T}({\rm d}\mathsf{z})

with the weights now defined as

wn,k(𝗓):=dμnLn1,kdμn1Mnk(z0,zk).w_{n,k}(\mathsf{z}):=\frac{{\rm d}\mu_{n}\accentset{\curvearrowleft}{\otimes}L_{n-1,k}}{{\rm d}\mu_{n-1}\otimes M_{n}^{k}}(z_{0},z_{k})\,.

With these choices we retain the fundamental property that for f:𝖹f\colon\mathsf{Z}\rightarrow\mathbb{R}, with f¯(k,𝗓):=f(zk)\bar{f}(k,\mathsf{z}):=f(z_{k}) then μ¯n(f¯)=μn(f)\bar{\mu}_{n}\big{(}\bar{f}\big{)}=\mu_{n}(f). Now with

M¯n(𝗓,d𝗓):=k=0Tμ¯n1(k𝗓)Rn(zk,dz0)MnT(z0,d𝗓0),\bar{M}_{n}(\mathsf{z},{\rm d}\mathsf{z}^{\prime}):=\sum_{k=0}^{T}\bar{\mu}_{n-1}(k\mid\mathsf{z})R_{n}(z_{k},{\rm d}z^{\prime}_{0})M_{n}^{\otimes T}\big{(}z^{\prime}_{0},{\rm d}\mathsf{z}^{\prime}_{-0}\big{)}\,,

assuming μn1Rn=μn1\mu_{n-1}R_{n}=\mu_{n-1}, we have the property that for any A𝒵(T+1)A\in\mathscr{Z}^{\otimes(T+1)} μ¯n1M¯n(A)=μn1MnT(A)\bar{\mu}_{n-1}\bar{M}_{n}(A)=\mu_{n-1}\otimes M_{n}^{\otimes T}\big{(}A\big{)}, yielding

w¯n(𝗓)=1T+1k=0Twn,k(𝗓).\bar{w}_{n}(\mathsf{z}^{\prime})=\frac{1}{T+1}\sum_{k=0}^{T}w_{n,k}(\mathsf{z}^{\prime})\,.

Finally choosing Ln,kL_{n,k} to be the optimized backward kernel, the importance weight of the algorithm is,

wn,k(𝗓)=dμnLn1,kdμn1Mnk(z0,zk)=dμndμn1(zk).w_{n,k}(\mathsf{z}^{\prime})=\frac{{\rm d}\mu_{n}\accentset{\curvearrowleft}{\otimes}L_{n-1,k}}{{\rm d}\mu_{n-1}\otimes M_{n}^{k}}(z_{0},z_{k})=\frac{{\rm d}\mu_{n}}{{\rm d}\mu_{n-1}}(z_{k})\,.

We note that continuous time Markov process snippets could also be used. For example piecewise deterministic Markov processes such as the Zig-Zag process [6] or the Bouncy Particle Sampler [8] could be used in practice since finite time horizon trajectories can be parametrized in terms of a finite number of parameters. We do not pursue this here.

3.2 Theoretical justification

In this section we provide the theoretical justification for the correctness of Alg. 4 and an alternative proof for Alg. 3, seen as a particular case of Alg. 4.

Throughout this section we use the following notation where μ\mu (resp. ν\nu) plays the rôle of μn+1\mu_{n+1} (resp. μn\mu_{n}), for notational simplicity. Let μ𝒫(𝖹,𝒵)\mu\in\mathcal{P}\big{(}\mathsf{Z},\mathscr{Z}\big{)} and M,L:𝖹×𝒵[0,1]M,L\colon\mathsf{Z}\times\mathscr{Z}\rightarrow[0,1] be two Markov kernels such that the following condition holds. For T{0}T\in\mathbb{N}\setminus\{0\} let 𝗓:=(z0,z1,,zT)𝖹T+1\mathsf{z}:=(z_{0},z_{1},\ldots,z_{T})\in\mathsf{Z}^{T+1} and for k0,Tk\in\llbracket 0,T\rrbracket assume that μMkμ(Lk)\mu\otimes M^{k}\gg\mu\accentset{\curvearrowleft}{\otimes}(L^{k}), implying the existence of the Radon-Nikodym derivatives

wk(𝗓):=dμ(Lk)dμMk(z0,zk),w_{k}(\mathsf{z}):=\frac{{\rm d}\mu\accentset{\curvearrowleft}{\otimes}(L^{k})}{{\rm d}\mu\otimes M^{k}}(z_{0},z_{k})\,, (20)

with the convention w0(𝗓)=1w_{0}(\mathsf{z})=1. We let w(𝗓):=(T+1)1k=0Twk(𝗓)w(\mathsf{z}):=(T+1)^{-1}\sum_{k=0}^{T}w_{k}(\mathsf{z}). For 𝗓𝖹T+1\mathsf{z}\in\mathsf{Z}^{T+1}, define

MT(z0,d𝗓0):\displaystyle M^{\otimes T}(z_{0},{\rm d}\mathsf{z}_{-0}): =i=1TM(zi1,dzi),\displaystyle=\prod_{i=1}^{T}M(z_{i-1},{\rm d}z_{i})\,,

and introduce the mixture of distributions, defined on (𝖹T+1,𝒵(T+1))\big{(}\mathsf{Z}^{T+1},\mathscr{Z}^{\otimes(T+1)}\big{)},

μ¯(d𝗓)=k=0Tμ¯(k,d𝗓),\bar{\mu}({\rm d}\mathsf{z})=\sum_{k=0}^{T}\bar{\mu}(k,{\rm d}\mathsf{z})\,, (21)

where for (k,𝗓)0,T×𝖹T+1(k,\mathsf{z})\in\llbracket 0,T\rrbracket\times\mathsf{Z}^{T+1}

μ¯(k,d𝗓):=1T+1wk(𝗓)μMT(d𝗓),\bar{\mu}(k,{\rm d}\mathsf{z}):=\frac{1}{T+1}w_{k}(\mathsf{z})\mu\otimes M^{\otimes T}({\rm d}\mathsf{z})\,,

so that one can write μ¯(d𝗓)=w(𝗓)μMT(d𝗓)\bar{\mu}({\rm d}\mathsf{z})=w(\mathsf{z})\mu\otimes M^{\otimes T}({\rm d}\mathsf{z}) with w(𝗓):=(T+1)1k=0Twk(𝗓)w(\mathsf{z}):=(T+1)^{-1}\sum_{k=0}^{T}w_{k}(\mathsf{z}).

We first establish properties of μ¯\bar{\mu}, showing how samples from μ¯\bar{\mu} can be used to estimate expectations with respect to μ\mu.

Lemma 6.

For any f:𝖹f\colon\mathsf{Z}\rightarrow\mathbb{R} such that μ(|f|)<\mu(|f|)<\infty

  1. 1.

    we have for k0,Tk\in\llbracket 0,T\rrbracket

    f(z)dμ(Lk)dμMk(z,z)μ(dz)Mk(z,dz)=μ(f),\int f(z^{\prime})\frac{{\rm d}\mu\accentset{\curvearrowleft}{\otimes}(L^{k})}{{\rm d}\mu\otimes M^{k}}(z,z^{\prime})\mu({\rm d}z)M^{k}(z,{\rm d}z^{\prime})=\mu(f)\,,
  2. 2.

    with f¯(k,𝗓):=f(zk)\bar{f}(k,\mathsf{z}):=f(z_{k}),

    1. (a)

      then

      μ¯(f¯)\displaystyle\bar{\mu}(\bar{f}) =μ(f),\displaystyle=\mu(f)\,,
    2. (b)

      in particular,

      k=0Tf¯(k,𝗓)μ¯(k𝗓)μ¯(d𝗓)=μ(f).\int\sum_{k=0}^{T}\bar{f}(k,\mathtt{\mathsf{z}})\bar{\mu}(k\mid\mathsf{z})\bar{\mu}({\rm d}\mathsf{z})=\mu(f)\,. (22)
Proof.

The first statement follows directly from the definition of the Radon-Nikodym derivative

f(z)dμLkdμMk(z,z)μMk(d(z,z))\displaystyle\int f(z^{\prime})\frac{{\rm d}\mu\accentset{\curvearrowleft}{\otimes}L^{k}}{{\rm d}\mu\otimes M^{k}}(z,z^{\prime})\mu\otimes M^{k}\big{(}{\rm d(}z,z^{\prime})\big{)} =f(z)μ(dz)Lk(z,dz)\displaystyle=\int f(z^{\prime})\mu({\rm d}z^{\prime})L^{k}(z^{\prime},{\rm d}z)
=f(z)μ(dz).\displaystyle=\int f(z^{\prime})\mu({\rm d}z^{\prime})\,.

The second statement follows from the definition of μ¯\bar{\mu} and

1T+1k=0Tf¯(k,𝗓)wk(𝗓)μMT(d𝗓)=1T+1k=0Tf(zk)dμ(Lk)dμMk(z0,zk)μMk(d(z0,zk)),\frac{1}{T+1}\sum_{k=0}^{T}\int\bar{f}(k,\mathsf{z})w_{k}(\mathsf{z})\mu\otimes M^{\otimes T}({\rm d}\mathsf{z})=\frac{1}{T+1}\sum_{k=0}^{T}\int f(z_{k})\frac{{\rm d}\mu\accentset{\curvearrowleft}{\otimes}(L^{k})}{{\rm d}\mu\otimes M^{k}}(z_{0},z_{k})\mu\otimes M^{k}\big{(}{\rm d}(z_{0},z_{k})\big{)}\,,

and the first statement. The last statement follows from the tower property for expectations. ∎

Corollary 7.

Assume that 𝗓μ¯\mathsf{z}\sim\bar{\mu}, then

k=0Tf(zk)wk(𝗓)l=0Twl(𝗓)\sum_{k=0}^{T}f(z_{k})\frac{w_{k}(\mathsf{z})}{\sum_{l=0}^{T}w_{l}(\mathsf{z})}

is an unbiased estimator of μ(f)\mu(f) since we notice that for k0,Tk\in\llbracket 0,T\rrbracket,

μ¯(k𝗓)=wk(𝗓)l=0Twl(𝗓).\bar{\mu}(k\mid\mathsf{z})=\frac{w_{k}(\mathsf{z})}{\sum_{l=0}^{T}w_{l}(\mathsf{z})}\,.

This justifies algorithms which sample from the mixture μ¯\bar{\mu} directly in order to estimate expectations with respect to μ\mu.

Let ν𝒫(𝖹,𝒵)\nu\in\mathcal{P}\big{(}\mathsf{Z},\mathscr{Z}\big{)} and ν¯𝒫(𝖹T+1,𝒵(T+1))\bar{\nu}\in\mathcal{P}\big{(}\mathsf{Z}^{T+1},\mathscr{Z}^{\otimes(T+1)}\big{)} be derived from ν\nu in the same way μ¯\bar{\mu} is from μ\mu in (21), but for possibly different Markov kernels Mν,Lν:𝖹×𝒵[0,1]M_{\nu},L_{\nu}\colon\mathsf{Z}\times\mathscr{Z}\rightarrow[0,1]. Define now the kernel M¯:𝖹T+1×𝒵(T+1)[0,1]\bar{M}\colon\mathsf{Z}^{T+1}\times\mathscr{Z}^{\otimes(T+1)}\rightarrow[0,1] be the Markov kernel

M¯(𝗓,d𝗓):=k=0Tν¯(k𝗓)R(zk,dz0)MT(z0,d𝗓0),\bar{M}(\mathsf{z},{\rm d}\mathsf{z}^{\prime}):=\sum_{k=0}^{T}\bar{\nu}(k\mid\mathsf{z})R(z_{k},{\rm d}z^{\prime}_{0})M^{\otimes T}\big{(}z^{\prime}_{0},{\rm d}\mathsf{z}^{\prime}_{-0}\big{)}\,, (23)

with R:𝖹×𝒵[0,1]R\colon\mathsf{Z}\times\mathscr{Z}\rightarrow[0,1]. Remark that M,L,Mν,LνM,L,M_{\nu},L_{\nu} and RR can be made dependent on both μ\mu and ν\nu, provided they satisfy all the conditions stated above and below.

The following justifies the existence of w¯n+1\bar{w}_{n+1} in (19) for a particular choice of backward kernel and provides a simplified expression for a particular choices of RR.

Lemma 8.

With the notation (20)-(21) and (23),

  1. 1.

    there exists M¯:𝖹T+1×𝒵(T+1)[0,1]\bar{M}^{*}\colon\mathsf{Z}^{T+1}\times\mathscr{Z}^{\otimes(T+1)}\rightarrow[0,1] such that for any A,B𝒵(T+1)A,B\in\mathscr{Z}^{\otimes(T+1)}

    ν¯M¯(A×B)=(ν¯M¯)M¯(B×A)\bar{\nu}\otimes\bar{M}(A\times B)=(\bar{\nu}\bar{M})\otimes\bar{M}^{*}(B\times A)
  2. 2.

    we have

    ν¯M¯=(νR)MT,\bar{\nu}\bar{M}=(\nu R)\otimes M^{\otimes T}\,,
  3. 3.

    assuming νRμ\nu R\gg\mu then with the choice L¯=M¯\bar{L}=\bar{M}^{*} we have ν¯M¯μ¯L¯\bar{\nu}\otimes\bar{M}\gg\bar{\mu}\accentset{\curvearrowleft}{\otimes}\bar{L} and for 𝗓,𝗓𝖹T+1\mathsf{z},\mathsf{z}^{\prime}\in\mathsf{\mathsf{Z}}^{T+1} almost surely

    w¯(𝗓,𝗓)=dμ¯L¯dν¯M¯(𝗓,𝗓)=dμdνR(z0)1T+1k=0Twk(𝗓)=dμdνR(z0)w(𝗓)=:w¯(𝗓),\bar{w}(\mathsf{z},\mathsf{z}^{\prime})=\frac{{\rm d}\bar{\mu}\accentset{\curvearrowleft}{\otimes}\bar{L}}{{\rm d}\bar{\nu}\otimes\bar{M}}(\mathsf{z},\mathsf{z}^{\prime})=\frac{{\rm d}\mu}{{\rm d}\nu R}(z^{\prime}_{0})\frac{1}{T+1}\sum_{k=0}^{T}w_{k}(\mathsf{z}^{\prime})=\frac{{\rm d}\mu}{{\rm d}\nu R}(z^{\prime}_{0})w(\mathsf{z}^{\prime})=:\bar{w}(\mathsf{z}^{\prime})\,,

    that is we introduce notation reflecting this last simplication.

  4. 4.

    Notice the additional simplification when νR=ν\nu R=\nu.

Proof.

The first statement is a standard result and M¯\bar{M}^{*} is a conditional expectation. We however provide the short argument taking advantage of the specific scenario. For a fixed A𝒵(T+1)A\in\mathscr{Z}^{\otimes(T+1)} consider the finite measure

Bν¯M¯(A×B)ν¯M¯(B),B\mapsto\bar{\nu}\otimes\bar{M}(A\times B)\leq\bar{\nu}\bar{M}(B)\,,

such that ν¯M¯(B)=0ν¯M¯(A×B)=0\bar{\nu}\bar{M}(B)=0\implies\bar{\nu}\otimes\bar{M}(A\times B)=0, that is ν¯M¯ν¯M¯(A×)\bar{\nu}\bar{M}\gg\bar{\nu}\otimes\bar{M}(A\times\cdot). Consequently we have the existence of a Radon-Nikodym derivative such that

ν¯M¯(A×B)=Bdν¯M¯(A×)d(ν¯M¯)(𝗓)ν¯M¯(d𝗓)\bar{\nu}\otimes\bar{M}(A\times B)=\int_{B}\frac{{\rm d}\bar{\nu}\otimes\bar{M}(A\times\cdot)}{{\rm d}(\bar{\nu}\bar{M})}(\mathsf{z})\bar{\nu}\bar{M}({\rm d}\mathsf{z})

which indeed has the sought property. This is a kernel since for any fixed A𝒵A\in\mathscr{Z}, for any B𝒵B\in\mathscr{Z}, ν¯M¯(A×B)1\bar{\nu}\otimes\bar{M}(A\times B)\leq 1 and therefore M¯(𝗓,A)1\bar{M}^{*}(\mathsf{z},A)\leq 1 ν¯M¯\bar{\nu}\bar{M}-a.s., with equality for A=𝖷A=\mathsf{X}. For the second statement we have, for any (𝗓,A)𝖹T+1×𝒵(T+1)(\mathsf{z},A)\in\mathsf{\mathsf{Z}}^{T+1}\times\mathscr{Z}^{\otimes(T+1)},

M¯(𝗓,A)=k=0Tν¯(k𝗓)RMT(z,kA),\bar{M}(\mathsf{z},A)=\sum_{k=0}^{T}\bar{\nu}(k\mid\mathsf{z})R\otimes M^{\otimes T}\big{(}z{}_{k},A\big{)},

and we can apply (22) in Lemma 6 with the substitutions μν\mu\leftarrow\nu and MMν,LLνM\leftarrow M_{\nu},L\leftarrow L_{\nu} to conclude. For the third statement, since νRμ\nu R\gg\mu then ν¯M¯=(νR)MTμMT\bar{\nu}\bar{M}=(\nu R)\otimes M^{\otimes T}\gg\mu\otimes M^{\otimes T} because for any A𝒵(T+1)A\in\mathscr{Z}^{\otimes(T+1)},

𝟏{𝗓A}νR(dz0)MT(z0,d𝗓0)=0{𝟏{𝗓A}MT(z0,d𝗓0)=0νRa.s.orνR(𝔓(A))=0\int\mathbf{1}\{\mathsf{z}\in A\}\nu R({\rm d}z_{0})M^{\otimes T}(z_{0},{\rm d}\mathsf{z}_{-0})=0\implies\begin{cases}\int\mathbf{1}\{\mathsf{z}\in A\}M^{\otimes T}(z_{0},{\rm d}\mathsf{z}_{-0})=0&\nu R-a.s.\\ {\rm or}\\ \nu R\big{(}\mathfrak{P}(A)\big{)}=0\end{cases}

where 𝔓:𝖹T+1𝖹\mathfrak{P}\colon\mathsf{Z}^{T+1}\rightarrow\mathsf{Z} is such that 𝔓(𝗓)=z0\mathfrak{P}(\mathsf{z})=z_{0}. In either case, since νRμ\nu R\gg\mu, this implies that

𝟏{𝗓A}μ(dz0)MT(z0,d𝗓0)=0,\int\mathbf{1}\{\mathsf{z}\in A\}\mu({\rm d}z_{0})M^{\otimes T}(z_{0},{\rm d}\mathsf{z}_{-0})=0\,,

that is ν¯M¯μ¯\bar{\nu}\bar{M}\gg\bar{\mu} from the definition of μ¯\bar{\mu}. Consequently

μ¯M¯(A×B)\displaystyle\bar{\mu}\otimes\bar{M}^{*}(A\times B) =AM¯(𝗓,B)dμ¯d(ν¯M¯)(𝗓)ν¯M¯(d𝗓)\displaystyle=\int_{A}\bar{M}^{*}(\mathsf{z},B)\frac{{\rm d}\bar{\mu}}{{\rm d}(\bar{\nu}\bar{M})}(\mathsf{z})\bar{\nu}\bar{M}({\rm d}\mathsf{z})
=Adμ¯d(ν¯M¯)(𝗓)M¯(𝗓,B)ν¯M¯(d𝗓)\displaystyle=\int_{A}\frac{{\rm d}\bar{\mu}}{{\rm d}(\bar{\nu}\bar{M})}(\mathsf{z})\bar{M}^{*}(\mathsf{z},B)\bar{\nu}\bar{M}({\rm d}\mathsf{z})
=𝟏{𝗓A,𝗓B}dμ¯d(ν¯M¯)(𝗓)ν¯(d𝗓)M¯(𝗓,d𝗓)\displaystyle=\int\mathbf{1}\{\mathsf{z}\in A,\mathsf{z}^{\prime}\in B\}\frac{{\rm d}\bar{\mu}}{{\rm d}(\bar{\nu}\bar{M})}(\mathsf{z})\bar{\nu}({\rm d}\mathsf{z}^{\prime})\bar{M}(\mathsf{z}^{\prime},{\rm d}\mathsf{z})

and indeed from Fubini’s and Dynkin’s πλ\pi-\lambda theorems ν¯M¯μ¯L¯\bar{\nu}\otimes\bar{M}\gg\bar{\mu}\accentset{\curvearrowleft}{\otimes}\bar{L} and

dμ¯L¯dν¯M¯(𝗓,𝗓)=dμ¯d(ν¯M¯)(𝗓).\frac{{\rm d}\bar{\mu}\accentset{\curvearrowleft}{\otimes}\bar{L}}{{\rm d}\bar{\nu}\otimes\bar{M}}(\mathsf{z},\mathsf{z}^{\prime})=\frac{{\rm d}\bar{\mu}}{{\rm d}(\bar{\nu}\bar{M})}(\mathsf{z}^{\prime})\,.

Now for f:𝖹T+1[0,1]f\colon\mathsf{Z}^{T+1}\rightarrow[0,1] and using the second statement and νRμ\nu R\gg\mu,

f(𝗓)dμ¯d(ν¯M¯)(𝗓)(ν¯M¯)(d𝗓)\displaystyle\int f(\mathsf{z})\frac{{\rm d}\bar{\mu}}{{\rm d}(\bar{\nu}\bar{M})}(\mathsf{z})\,(\bar{\nu}\bar{M})({\rm d}\mathsf{z}) =f(𝗓)w(𝗓)μMT(d𝗓)\displaystyle=\int f(\mathsf{z})w(\mathsf{z})\mu\otimes M^{\otimes T}({\rm d}\mathsf{z})
=f(𝗓)w(𝗓)dμd(νR)(z0)(νR)(dz0)MT(z0,d𝗓0)\displaystyle=\int f(\mathsf{z})w(\mathsf{z})\frac{{\rm d}\mu}{{\rm d}(\nu R)}(z_{0})(\nu R)({\rm d}z_{0})M^{\otimes T}(z_{0},{\rm d}\mathsf{z}_{-0})
=f(𝗓)w(𝗓)dμd(νR)(z0)ν¯M¯(d𝗓).\displaystyle=\int f(\mathsf{z})w(\mathsf{z})\frac{{\rm d}\mu}{{\rm d}(\nu R)}(z_{0})\bar{\nu}\bar{M}({\rm d}\mathsf{z})\,.

We therefore conclude that

dμ¯L¯dν¯M¯(𝗓,𝗓)=w(𝗓)dμd(νR)(z0)=w(𝗓)dμdν(z0)\frac{{\rm d}\bar{\mu}\accentset{\curvearrowleft}{\otimes}\bar{L}}{{\rm d}\bar{\nu}\otimes\bar{M}}(\mathsf{z},\mathsf{z}^{\prime})=w(\mathsf{z}^{\prime})\frac{{\rm d}\mu}{{\rm d}(\nu R)}(z^{\prime}_{0})=w(\mathsf{z}^{\prime})\frac{{\rm d}\mu}{{\rm d}\nu}(z^{\prime}_{0})

where the last inequality holds only when νR=ν\nu R=\nu. ∎

The following result is important in two respects. First it establishes that if M,LM,L satisfy a simple property then wkw_{k} always have a simple expresssion in terms of certain densities of μ\mu and ν\nu – this implies in particular that in Subsection 3.1 the kernel MnM_{n} is not required to leave μn1\mu_{n-1} invariant to make the method implementable [14]. Second it provides a direct justification of the validity of advanced schemes – see Example 14. This therefore establishes that generic and widely applicable sufficient conditions for w¯(𝗓,𝗓)\bar{w}(\mathsf{z},\mathsf{z}^{\prime}) to be tractable are νR=ν\nu R=\nu and the notion of (υ,M,M)(\upsilon,M,M^{*})-reversibility.

Lemma 9.

Let μ,ν𝒫(𝖹,𝒵)\mu,\nu\in\mathcal{P}\big{(}\mathsf{Z},\mathscr{Z}\big{)}, υ\upsilon be a σ\sigma-finite measure on (𝖹,𝒵)\big{(}\mathsf{Z},\mathscr{Z}\big{)} such that υμ,ν\upsilon\gg\mu,\nu and assume that we have a pair of Markov kernels M,M:𝖹×𝒵[0,1]M,M^{*}\colon\mathsf{Z}\times\mathscr{Z}\rightarrow[0,1] such that

υ(dz)M(z,dz)\displaystyle\upsilon({\rm d}z)M(z,{\rm d}z^{\prime}) =υ(dz)M(z,dz).\displaystyle=\upsilon({\rm d}z^{\prime})M^{*}(z^{\prime},{\rm d}z)\,. (24)

We call this property (υ,M,M)(\upsilon,M,M^{*})-reversibility. Then for z,z𝖹z,z^{\prime}\in\mathsf{Z} such that ν(z):=dν/dυ(z)>0\nu(z):={\rm d}\nu/{\rm d}\upsilon(z)>0 we have

dμMdνM(z,z)=dμ/dυ(z)dν/dυ(z).\frac{{\rm d}\mu\accentset{\curvearrowleft}{\otimes}M^{*}}{{\rm d}\nu\otimes M}(z,z^{\prime})=\frac{{\rm d}\mu/{\rm d}\upsilon(z^{\prime})}{{\rm d}\nu/{\rm d}\upsilon(z)}\,.
Proof.

For z,z𝖹z,z^{\prime}\in\mathsf{Z} such that dν/dυ(z)>0{\rm d}\nu/{\rm d}\upsilon(z)>0 we have

μ(dz)M(z,dz)\displaystyle\mu({\rm d}z^{\prime})M^{*}(z^{\prime},{\rm d}z) =dμdυ(z)υ(dz)M(z,dz)\displaystyle=\frac{{\rm d}\mu}{{\rm d}\upsilon}(z^{\prime})\upsilon({\rm d}z^{\prime})M^{*}(z^{\prime},{\rm d}z)
=dμdυ(z)υ(dz)M(z,dz)\displaystyle=\frac{{\rm d}\mu}{{\rm d}\upsilon}(z^{\prime})\upsilon({\rm d}z)M(z,{\rm d}z^{\prime})
=dμ/dυ(z)dν/dυ(z)dνdυ(z)υ(dz)M(z,dz)\displaystyle=\frac{{\rm d}\mu/{\rm d}\upsilon(z^{\prime})}{{\rm d}\nu/{\rm d}\upsilon(z)}\frac{{\rm d}\nu}{{\rm d}\upsilon}(z)\upsilon({\rm d}z)M(z,{\rm d}z^{\prime})
=dμ/dυ(z)dν/dυ(z)ν(dz)M(z,dz).\displaystyle=\frac{{\rm d}\mu/{\rm d}\upsilon(z^{\prime})}{{\rm d}\nu/{\rm d}\upsilon(z)}\nu({\rm d}z)M(z,{\rm d}z^{\prime})\,.

Corollary 10.

Let μn1,μn𝒫(𝖹,𝒵)\mu_{n-1},\mu_{n}\in\mathcal{P}\big{(}\mathsf{Z},\mathscr{Z}\big{)} and υ\upsilon be a σ\sigma-finite measure such that υμn1,μn\upsilon\gg\mu_{n-1},\mu_{n}, let Mn,Ln1:(𝖹,𝒵)[0,1]M_{n},L_{n-1}\colon\big{(}\mathsf{Z},\mathscr{Z}\big{)}\rightarrow[0,1] such that

υ(dz)Mn(z,dz)=υ(dz)Ln1(z,dz),\upsilon({\rm d}z)M_{n}(z,{\rm d}z^{\prime})=\upsilon({\rm d}z^{\prime})L_{n-1}(z^{\prime},{\rm d}z)\,,

then for any z,z𝖹z,z^{\prime}\in\mathsf{Z} such that dμn/dυ(z)>0{\rm d}\mu_{n}/{\rm d}\upsilon(z)>0 and kk\in\mathbb{N}

μnLn1kμnMnk(z,z)=dμn/dυ(z)dμn/dυ(z).\displaystyle\frac{\mu_{n}\accentset{\curvearrowleft}{\otimes}L_{n-1}^{k}}{\mu_{n}\otimes M_{n}^{k}}(z,z^{\prime})=\frac{{\rm d}\mu_{n}/{\rm d}\upsilon(z^{\prime})}{{\rm d}\mu_{n}/{\rm d}\upsilon(z)}\,.

and provided μn1Rn=μn1\mu_{n-1}R_{n}=\mu_{n-1} we can deduce

w¯n(𝗓)\displaystyle\bar{w}_{n}(\mathsf{z}) =dμn/dυ(z0)dμn1/dυ(z0)1T+1k=0Tdμn/dυ(zk)dμn/dυ(z0)\displaystyle=\frac{{\rm d}\mu_{n}/{\rm d}\upsilon(z_{0})}{{\rm d}\mu_{n-1}/{\rm d}\upsilon(z_{0})}\frac{1}{T+1}\sum_{k=0}^{T}\frac{{\rm d}\mu_{n}/{\rm d}\upsilon(z_{k})}{{\rm d}\mu_{n}/{\rm d}\upsilon(z_{0})}
=1T+1k=0Tdμn/dυ(zk)dμn1/dυ(z0)\displaystyle=\frac{1}{T+1}\sum_{k=0}^{T}\frac{{\rm d}\mu_{n}/{\rm d}\upsilon(z_{k})}{{\rm d}\mu_{n-1}/{\rm d}\upsilon(z_{0})}

where we have used Lemma  34.

We have shown earlier that standard integrator based mutation kernels used in the context of Monte Carlo method satisfy (24) with υ\upsilon the Lebesgue measure but other scenarios involved that of the preconditioned Crank–Nicolson (pCN) algorithm where υ\upsilon is the distribution of a Gaussian process.

3.3 Revisiting sampling with integrator snippets

In this scenario we have μ(dz)=π(dx)ϖ(dv)\mu({\rm d}z)=\pi({\rm d}x)\varpi({\rm d}v) assumed to have a density w.r.t. a σ\sigma-finite measure υ\upsilon, and ψ\psi is an invertible mapping ψ:𝖹𝖹\psi\colon\mathsf{Z}\rightarrow\mathsf{Z} such that υψ=υ\upsilon^{\psi}=\upsilon and ψ1=σψσ\psi^{-1}=\sigma\circ\psi\circ\sigma with σ:𝖹𝖹\sigma\colon\mathsf{Z}\rightarrow\mathsf{Z} such that μσ(z)=μ(z)\mu\circ\sigma(z)=\mu(z). In his manuscript we focus primarily on the scenario where ψ\psi is a discretization of Hamilton’s equations for a potential U:𝖷U\colon\mathsf{X}\rightarrow\mathbb{R} e.g. a leapfrog integrator. We consider now the scenario where, in the framework developed in Section 3.1, we let M(z,dz):=δψ(z)(dz)M(z,{\rm d}z^{\prime}):=\delta_{\psi(z)}\big{(}{\rm d}z^{\prime}\big{)} be the deterministic kernel which maps the current state z𝖹z\in\mathsf{Z} to ψ(z)\psi(z). Define Ψ(z,dz):=δψ(z)(dz)\Psi(z,{\rm d}z^{\prime}):=\delta_{\psi(z)}({\rm d}z^{\prime}) and Ψ(z,dz):=δψ1(z)(dz)\Psi^{*}(z,{\rm d}z^{\prime}):=\delta_{\psi^{-1}(z)}({\rm d}z^{\prime}); we exploit the ideas of [1, Proposition 4] to establish that Ψ\Psi^{*} is the υ\upsilon-adjoint of Ψ\Psi if υ\upsilon is invariant under ψ\psi.

Lemma 11.

Let μ\mu be a probability measure and υ\upsilon a σ\sigma-finite measure, on (𝖹,𝒵)(\mathsf{Z},\mathscr{Z}) such that υμ\upsilon\gg\mu. Denote μ(z):=dμ/dυ(z)\mu(z):={\rm d}\mu/{\rm d}\upsilon(z) for any z𝖹z\in\mathsf{Z}. Let ψ:𝖹𝖹\psi\colon\mathsf{Z}\rightarrow\mathsf{Z} be an invertible and volume preserving mapping, i.e. such that υψ(A)=υ(ψ1(A))=υ(A)\upsilon^{\psi}(A)=\upsilon\big{(}\psi^{-1}(A)\big{)}=\upsilon(A) for all A𝒵A\in\mathscr{Z}, then

  1. 1.

    (υ,Ψ,Ψ)(\upsilon,\Psi,\Psi^{*}) form a reversible triplet, that is for all z,z𝖹z,z^{\prime}\in\mathsf{Z},

    υ(dz)δψ(z)(dz)=υ(dz)δψ1(z)(dz),\upsilon({\rm d}z)\delta_{\psi(z)}({\rm d}z^{\prime})=\upsilon({\rm d}z^{\prime})\delta_{\psi^{-1}(z^{\prime})}({\rm d}z),
  2. 2.

    for all z,z𝖹z,z^{\prime}\in\mathsf{Z} such that μ(z)>0\mu(z)>0

    μ(dz)δψ1(z)(dz)=μψ(z)μ(z)μ(dz)δψ(z)(dz).\mu({\rm d}z^{\prime})\delta_{\psi^{-1}(z^{\prime})}({\rm d}z)=\frac{\mu\circ\psi(z)}{\mu(z)}\mu({\rm d}z)\delta_{\psi(z)}({\rm d}z^{\prime})\,.
Proof.

For the first statement

f(z)gψ(z)υ(dz)\displaystyle\int f(z)g\circ\psi(z)\upsilon({\rm d}z) =fψ1ψ(z)gψ(z)υ(dz)\displaystyle=\int f\circ\psi^{-1}\circ\psi(z)g\circ\psi(z)\upsilon({\rm d}z)
=fψ1(z)g(z)υψ(dz)\displaystyle=\int f\circ\psi^{-1}(z)g(z)\upsilon^{\psi}({\rm d}z)
=fψ1(z)g(z)υ(dz).\displaystyle=\int f\circ\psi^{-1}(z)g(z)\upsilon({\rm d}z).

We have

μ(dz)δψ1(z)(dz)\displaystyle\mu({\rm d}z^{\prime})\delta_{\psi^{-1}(z^{\prime})}({\rm d}z) =μ(z)υ(dz)δψ1(z)(dz)\displaystyle=\mu(z^{\prime})\upsilon({\rm d}z^{\prime})\delta_{\psi^{-1}(z^{\prime})}({\rm d}z)
=μ(z)υ(dz)δψ(z)(dz)\displaystyle=\mu(z^{\prime})\upsilon({\rm d}z)\delta_{\psi(z)}({\rm d}z^{\prime})
=μψ(z)υ(dz)δψ(z)(dz)\displaystyle=\mu\circ\psi(z)\upsilon({\rm d}z)\delta_{\psi(z)}({\rm d}z^{\prime})
=μψ(z)μ(z)μ(z)υ(dz)δψ(z)(dz)\displaystyle=\frac{\mu\circ\psi(z)}{\mu(z)}\mu(z)\upsilon({\rm d}z)\delta_{\psi(z)}({\rm d}z^{\prime})
=μψ(z)μ(z)μ(dz)δψ(z)(dz)\displaystyle=\frac{\mu\circ\psi(z)}{\mu(z)}\mu({\rm d}z)\delta_{\psi(z)}({\rm d}z^{\prime})
f(z)g(z)μ(dz)δψ1(z)(dz)\displaystyle\int f(z^{\prime})g(z)\mu({\rm d}z^{\prime})\delta_{\psi^{-1}(z^{\prime})}({\rm d}z) =f(z)g(z)μ(z)υ(dz)δψ1(z)(dz)\displaystyle=\int f(z^{\prime})g(z)\mu(z^{\prime})\upsilon({\rm d}z^{\prime})\delta_{\psi^{-1}(z^{\prime})}({\rm d}z)
=f(z)g(z)μ(z)υ(dz)δψ(z)(dz)\displaystyle=\int f(z^{\prime})g(z)\mu(z^{\prime})\upsilon({\rm d}z)\delta_{\psi(z)}({\rm d}z^{\prime})
=fψ(z)g(z)μψ(z)υ(dz)\displaystyle=\int f\circ\psi(z)g(z)\mu\circ\psi(z)\upsilon({\rm d}z)
=fψ(z)g(z)μψ(z)μ(z)μ(z)υ(dz)\displaystyle=\int f\circ\psi(z)g(z)\frac{\mu\circ\psi(z)}{\mu(z)}\mu(z)\upsilon({\rm d}z)
=f(z)g(z)μψ(z)μ(z)μ(dz)δψ(z)(dz)\displaystyle=\int f(z^{\prime})g(z)\frac{\mu\circ\psi(z)}{\mu(z)}\mu({\rm d}z)\delta_{\psi(z)}({\rm d}z^{\prime})

As a result

f(z)μ(dz)\displaystyle\int f(z^{\prime})\mu({\rm d}z^{\prime}) =f(z)μ(dz)δψ1(z)(dz)\displaystyle=\int f(z^{\prime})\mu({\rm d}z^{\prime})\delta_{\psi^{-1}(z^{\prime})}({\rm d}z)
=f(z)μψ(z)μ(z)μ(dz)δψ(z)(dz)\displaystyle=\int f(z^{\prime})\frac{\mu\circ\psi(z)}{\mu(z)}\mu({\rm d}z)\delta_{\psi(z)}({\rm d}z^{\prime})
=fψ(z)μψ(z)μ(z)μ(dz)\displaystyle=\int f\circ\psi(z)\frac{\mu\circ\psi(z)}{\mu(z)}\mu({\rm d}z)

Corollary 12.

With the assumptions of Lemma 11 above for k0,Tk\in\llbracket 0,T\rrbracket the weight (20) for Mμ=Ψk,Lμ=ΨkM_{\mu}=\Psi^{k},L_{\mu}=\Psi^{-k} admits the expression

wk(𝗓)=dμΨkdμΨk(z0,zk)=μψk(z0)μ(z0),w_{k}(\mathsf{z})=\frac{{\rm d}\mu\accentset{\curvearrowleft}{\otimes}\Psi{}^{-k}}{{\rm d}\mu\otimes\Psi^{k}}(z_{0},z_{k})=\frac{\mu\circ\psi^{k}(z_{0})}{\mu(z_{0})}\,,

Further, for RR a ν\nu-invariant Markov kernel the expression for the weight (19) becomes

w¯(𝗓,𝗓)\displaystyle\bar{w}(\mathtt{\mathsf{z}},\mathtt{\mathsf{z}}^{\prime}) =μ(z0)ν(z0)1T+1k=0Tμψk(z0)μ(z0),\displaystyle=\frac{\mu(z^{\prime}_{0})}{\nu(z^{\prime}_{0})}\frac{1}{T+1}\sum_{k=0}^{T}\frac{\mu\circ\psi^{k}(z^{\prime}_{0})}{\mu(z^{\prime}_{0})},

hence recovering the expression used in Section 2. This together with the results of Section 3.1 provides an alternative justification of correctness of Alg. 3 and hence Alg. 2. Note that this choice for LμL_{\mu} corresponds to the so-called optimal scenario; this can be deduced from Lemma 11 or by noting that

f(z,z)μΨ(dz)Ψ1(z,dz)\displaystyle\int f(z,z^{\prime})\mu\Psi({\rm d}z^{\prime})\Psi^{-1}(z^{\prime},{\rm d}z) =f(ψ1(z),z)μΨ(dz)\displaystyle=\int f(\psi^{-1}(z^{\prime}),z^{\prime})\mu\Psi({\rm d}z^{\prime})
=f(z,ψ(z))μ(dz)\displaystyle=\int f(z,\psi(z))\mu({\rm d}z)
f(z,z)μ(dz)Ψ(z,dz).\displaystyle\int f(z,z^{\prime})\mu({\rm d}z)\Psi(z,{\rm d}z^{\prime})\,.

3.4 More complex integrator snippets

Here we provide examples of more complex integrator snippets to which earlier theory immediately applies thanks to the abstract point of view adopted throughout.

Example 13.

Let υμ\upsilon\gg\mu, so that μ(z):=dμ/dυ(z)\mu(z):={\rm d}\mu/{\rm d}\upsilon(z) is well defined. Let, for i{1,2}i\in\{1,2\}, ψi:𝖹𝖹\psi_{i}\colon\mathsf{Z}\rightarrow\mathsf{Z} be invertible and such that υψi=υ\upsilon^{\psi_{i}}=\upsilon, ψi1=σψiσ\psi_{i}^{-1}=\sigma\circ\psi_{i}\circ\sigma for σ:𝖹𝖹\sigma\colon\mathsf{Z}\rightarrow\mathsf{Z} such that σ2=Id\sigma^{2}={\rm Id} and υσ=υ\upsilon^{\sigma}=\upsilon. Then one can consider the delayed rejection MH transition probability

M(z,dz)=α1(z)δψ1(z)(dz)+α¯1(z)[α2(z)δψ2(z)(dz)+α¯2(z)δz(dz)],M(z,{\rm d}z^{\prime})=\alpha_{1}(z)\delta_{\psi_{1}(z)}\big{(}{\rm d}z^{\prime}\big{)}+\bar{\alpha}_{1}(z)\big{[}\alpha_{2}(z)\delta_{\psi_{2}(z)}({\rm d}z^{\prime})+\bar{\alpha}_{2}(z)\delta_{z}({\rm d}z^{\prime})\big{]}\,,

with αi(z)=1ri(z)\alpha_{i}(z)=1\wedge r_{i}(z) α¯i(z)=1αi(z)\bar{\alpha}_{i}(z)=1-\alpha_{i}(z) and zS1S2z\in S_{1}\cap S_{2} given below

r1(z)=dυψ11dυ(z),r2(z)=α¯1σψ2(z)α¯1(z)dυψ21dυ(z).r_{1}(z)=\frac{{\rm d}\upsilon^{\psi_{1}^{-1}}}{{\rm d}\upsilon}(z),\quad r_{2}(z)=\frac{\bar{\alpha}_{1}\circ\sigma\circ\psi_{2}(z)}{\bar{\alpha}_{1}(z)}\frac{{\rm d}\upsilon^{\psi_{2}^{-1}}}{{\rm d}\upsilon}(z)\,.

A particular example is when υ\upsilon is the Lebesgue measure on 𝖷×𝖵\mathsf{X}\times\mathsf{V} and ψ1\psi_{1} is volume preserving, for instance the leapfrog integrator for Hamilton’s equations for some potential UU or the bounce update (25). Following [1] notice that υ\upsilon has density υ(z)=1/2\upsilon(z)=1/2 with respect to the measure υ+υψi=2υ\upsilon+\upsilon^{\psi_{i}}=2\upsilon. Now define S1:={z𝖹:υ(z)υψ1(z)>0}S_{1}:=\big{\{}z\in\mathsf{Z}\colon\upsilon(z)\wedge\upsilon\circ\psi_{1}(z)>0\big{\}}, we have

r1(z):={υψ1(z)υ(z)=1for zS10otherwise,r_{1}(z):=\begin{cases}\frac{\upsilon\circ\psi_{1}(z)}{\upsilon(z)}=1&\text{for }z\in S_{1}\\ 0&\text{otherwise}\end{cases},

and with S2:={z𝖹:[α¯1(z)υ(z)][α¯1σψ2(z)υψ2(z)]>0}S_{2}:=\big{\{}z\in\mathsf{Z}\colon\big{[}\bar{\alpha}_{1}(z)\,\upsilon(z)\big{]}\wedge\big{[}\bar{\alpha}_{1}\circ\sigma\circ\psi_{2}(z)\,\upsilon\circ\psi_{2}(z)\big{]}>0\big{\}}

r2(z):={α¯1σψ2(z)υψ2(z)α¯1(z)υ(z)=1for zS20otherwise.r_{2}(z):=\begin{cases}\frac{\bar{\alpha}_{1}\circ\sigma\circ\psi_{2}(z)\,\upsilon\circ\psi_{2}(z)}{\bar{\alpha}_{1}(z)\,\upsilon(z)}=1&\text{for }z\in S_{2}\\ 0&\text{otherwise}\end{cases}.

For instance ψ1\psi_{1} can be a HMC update, while

ψ2=ψ1bψ1withb(x,v)=(x,v2v,n(x)n(x))\psi_{2}=\psi_{1}\circ{\rm b}\circ\psi_{1}\,\text{with}\,{\rm b}(x,v)=\big{(}x,v-2\bigl{\langle}v,n(x)\bigr{\rangle}n(x)\big{)} (25)

for some unit length vector field n:𝖷𝖷n\colon\mathsf{X}\rightarrow\mathsf{X}. This can therefore be used as part of Alg. 4; care must be taken when compute the weights wn,kw_{n,k} and w¯n\bar{w}_{n}, see Section 3-3.3 provide the tools to achieve this. For example, in the situation where υ\upsilon is the Lebesgue measure on 𝖹=d\mathsf{Z}=\mathbb{R}^{d} and ψ2=Id\psi_{2}={\rm Id} then we recover Alg. 3 or equivalently Alg. 2.

Example 14.

An interesting instance of Example 13 is concerned with the scenario where interest is in sampling μ\mu constrained to some set C𝖹C\subset\mathsf{Z} such that υ(C)<\upsilon(C)<\infty. Define μ\mu constrainted to CC, μC():=μ(C)/μ(C)\mu_{C}(\cdot):=\mu(C\cap\cdot)/\mu(C) and similarly υC():=υ(C)/υ(C)\upsilon_{C}(\cdot):=\upsilon(C\cap\cdot)/\upsilon(C). We let MM be defined as above but targeting υC\upsilon_{C}. Naturally υC\upsilon_{C} has a density w.r.t. υ\upsilon, υC(z)=𝟏C(z)/υ(C)\upsilon_{C}(z)=\mathbf{1}_{C}(z)/\upsilon(C) for z𝖹z\in\mathsf{Z} and for i{1,2}i\in\{1,2\} we have υCψi(z)=𝟏ψi1(C)(z)\upsilon_{C}\circ\psi_{i}(z)=\mathbf{1}_{\psi_{i}^{-1}(C)}(z) , υψ1ψ2(z)=𝟏ψ21ψ11(C)(z)\upsilon\circ\psi_{1}\circ\psi_{2}(z)=\mathbf{1}_{\psi_{2}^{-1}\circ\psi_{1}^{-1}(C)}(z). Consequently S1:={z𝖹:𝟏Cψ11(C)(z)>0}S_{1}:=\big{\{}z\in\mathsf{Z}\colon\mathbf{1}_{C\cap\psi_{1}^{-1}(C)}(z)>0\big{\}} and S2={z𝖹:𝟏Cψ11(C)(z)𝟏ψ21(C)ψ21ψ11(C)(z)>0}S_{2}=\big{\{}z\in\mathsf{Z}\colon\mathbf{1}_{C\cap\psi_{1}^{-1}(C^{\complement})}(z)\mathbf{1}_{\psi_{2}^{-1}(C)\cap\psi_{2}^{-1}\circ\psi_{1}^{-1}(C^{\complement})}(z)>0\big{\}} and as a result

α1(z)\displaystyle\alpha_{1}(z) :=𝟏Aψ11(C)(z)\displaystyle:=\mathbf{1}_{A\cap\psi_{1}^{-1}(C)}(z)
α2(z)\displaystyle\alpha_{2}(z) :=𝟏Aψ11(C)ψ21(C)ψ21ψ11(C)(z).\displaystyle:=\mathbf{1}_{A\cap\psi_{1}^{-1}(C^{\complement})\cap\psi_{2}^{-1}(C)\cap\psi_{2}^{-1}\circ\psi_{1}^{-1}(C^{\complement})}(z)\,.

The corresponding kernel MTM^{\otimes T} is described algorithmically in Alg. 5. In the situation where C:={x𝖷:c(x)=0}C:=\{x\in\mathsf{X}\colon c(x)=0\} for a continuously differentiable function c:𝖷c\colon\mathsf{X}\rightarrow\mathbb{R}, the bounces described in (25) can be defined in terms of the field xn(x)x\mapsto n(x)such that

n(x)\displaystyle n(x) :={c(x)/|c(x)|forc(x)00otherwise.\displaystyle:=\begin{cases}\nabla c(x)/|\nabla c(x)|&\text{for}\,\nabla c(x)\neq 0\\ 0&\text{otherwise.}\end{cases}

This justifies the ideas of [5], where a process of the type given in Alg. 5 is used as a proposal within a MH update, although the possibility of a rejection after the second stage seems to have been overlooked in that reference.

1Given z0=zC𝖹z_{0}=z\in C\subset\mathsf{Z}
2for k=1,,Tk=1,\ldots,T do
3      
4      if ψ1(zk1)C\psi_{1}(z_{k-1})\in C then
5            
6            zk=ψ1(zk1)z_{k}=\psi_{1}(z_{k-1})
7      else if ψ2(zk1)C\psi_{2}(z_{k-1})\in C and ψ1ψ2(zk1)C\psi_{1}\circ\psi_{2}(z_{k-1})\notin C then
8            
9            zk=ψ2(zk1)z_{k}=\psi_{2}(z_{k-1})
10      else
11            
12            zk=zk1z_{k}=z_{k-1}.
13       end if
14      
15 end for
16
Algorithm 5 MTM^{\otimes T} for MM the delayed rejection algorithm targetting the uniform distribution on CC

Naturally a rejection of both transformations ψ1\psi_{1} and ψ2\psi_{2} of the current state means that the algorithm gets stuck. We note that it is also possible to replace the third update case with a full refreshment of the velocity, which can be interpreted as a third delayed rejection update, of acceptance probability one.

3.5 Numerical illustration: orthant probabilities

In this section, we consider the problem of calculating the Gaussian orthant probabilities, which is given by

p(𝐚,𝐛,Σ):=(𝐚𝐗𝐛),p(\mathbf{a},\mathbf{b},\Sigma):=\mathbb{P}(\mathbf{a}\leq\mathbf{X}\leq\mathbf{b}),

where 𝐚,𝐛d\mathbf{a},\mathbf{b}\in\mathbb{R}^{d} are known vectors of dimension dd and 𝐗𝒩d(𝟎,Σ)\mathbf{X}\sim\mathcal{N}_{d}(\mathbf{0},\Sigma) with Σ\Sigma a covariance matrix of size d×dd\times d. Consider the Cholesky decomposition of Σ\Sigma which is given by Σ:=LL\Sigma:=LL^{\top}, where L:=(lij,i,jd)L:=(l_{ij},i,j\in\llbracket d\rrbracket) is a lower triangular matrix with positive diagonal entries. It is clear that 𝐗\mathbf{X} can be viewed as 𝐗:=Lη\mathbf{X}:=L\mathbf{\eta}, where η𝒩d(𝟎,Idd)\mathbf{\eta}\sim\mathcal{N}_{d}(\mathbf{0},{\rm Id}_{d}). Consequently, one can instead rewrite p(𝐚,𝐛,Σ)p(\mathbf{a},\mathbf{b},\Sigma) as a product of dd probabilities given by

p1=(a1l11η1b1)=(a1/l11η1b1/l11),\displaystyle p_{1}=\mathbb{P}(a_{1}\leq l_{11}\eta_{1}\leq b_{1})=\mathbb{P}\left(a_{1}/l_{11}\leq\eta_{1}\leq b_{1}/l_{11}\right)\,, (26)

and

pn=(atj=1nlnjηjbt)=((atj=1n1lnjηj)/lnnηt(btj=1n1lnjηj)/lnn),\displaystyle p_{n}=\mathbb{P}\left(a_{t}\leq{\textstyle\sum}_{j=1}^{n}l_{nj}\eta_{j}\leq b_{t}\right)=\mathbb{P}\left(\nicefrac{{(a_{t}-\sum_{j=1}^{n-1}l_{nj}\eta_{j})}}{{l_{nn}}}\leq\eta_{t}\leq\nicefrac{{(b_{t}-\sum_{j=1}^{n-1}l_{nj}\eta_{j})}}{{l_{nn}}}\right)\,, (27)

for n=2,,dn=2,\ldots,d. For notational simplicity, we let n(η1:n1)\mathcal{B}_{n}(\eta_{1:n-1}) denote the interval [(atj=1n1lnjηj)/lnn,(btj=1n1lnjηj)/lnn],\left[\nicefrac{{(a_{t}-\sum_{j=1}^{n-1}l_{nj}\eta_{j})}}{{l_{nn}}},\nicefrac{{(b_{t}-\sum_{j=1}^{n-1}l_{nj}\eta_{j})}}{{l_{nn}}}\right], with the convention 1(η1:0):=[a1/l11,b1/l11]\mathcal{B}_{1}(\eta_{1:0}):=[a_{1}/l_{11},b_{1}/l_{11}]. Then, p(𝐚,𝐛,Σ)p(\mathbf{a},\mathbf{b},\Sigma) can be written as the product of pnsp_{n}\,s for n=1,2,,dn=1,2,\ldots,d. Moreover, one could see that pnp_{n} is also the normalising constant of the conditional distribution of ηn\eta_{n} given η1:n1\eta_{1:n-1}. To calculate the orthant probability, [32] have proposed an SMC algorithm targetting the sequence of distributions πn(η1:n):=π1(ηn)k=2nπn(ηk|η1:k1)\pi_{n}(\eta_{1:n}):=\pi_{1}(\eta_{n})\prod_{k=2}^{n}\pi_{n}(\eta_{k}|\eta_{1:k-1}) for n1,dn\in\llbracket 1,d\rrbracket, given by

π1(η1)ϕ(η1)𝟏{η11}=γ1(η1)\displaystyle\pi_{1}(\eta_{1})\propto\phi(\eta_{1})\mathbf{1}\{\eta_{1}\in\mathcal{B}_{1}\}=\gamma_{1}(\eta_{1}) (28)
πn(ηn|η1:n1)ϕ(ηn)𝟏{ηnn(η1:n1)}=γn(ηn|η1:n1).\displaystyle\pi_{n}(\eta_{n}|\eta_{1:n-1})\propto\phi(\eta_{n})\mathbf{1}\{\eta_{n}\in\mathcal{B}_{n}(\eta_{1:n-1})\}=\gamma_{n}(\eta_{n}|\eta_{1:n-1}). (29)

where ϕ\phi denotes the probability density of a 𝒩(0,1)\mathcal{N}(0,1). One could also note that

πn(ηn|η1:n1)=1/Φ(n(η1:n1))ϕ(ηn)𝟏{ηnn(η1:n1)}\pi_{n}(\eta_{n}|\eta_{1:n-1})=1/\Phi(\mathcal{B}_{n}(\eta_{1:n-1}))\phi(\eta_{n})\mathbf{1}\{\eta_{n}\in\mathcal{B}_{n}(\eta_{1:n-1})\}

and γn(η1:n)=ϕ(η1:n)k=1n𝟏{ηkk(η1:k1)}\gamma_{n}(\eta_{1:n})=\phi(\eta_{1:n})\prod_{k=1}^{n}\mathbf{1}\{\eta_{k}\in\mathcal{B}_{k}(\eta_{1:k-1})\}, where Φ(n(η1:n1))\Phi(\mathcal{B}_{n}(\eta_{1:n-1})) represents the probability of a standard Normal random variable being in the region n(η1:n1)\mathcal{B}_{n}(\eta_{1:n-1}). Therefore, the SMC algorithm proposed by [32] then proceeds as follows. (1) At time tt, particles η1:t1n\eta_{1:t-1}^{n} are extended by sampling ηnnπn(dηt|η1:t1(i))\eta_{n}^{n}\sim\pi_{n}({\rm d}\eta_{t}|\eta_{1:t-1}^{(i)}). (2) Particles η1:t(i)\eta_{1:t}^{(i)} are then reweighted by multiplying the incremental weights Φ(n(η1:n1(i)))\Phi(\mathcal{B}_{n}(\eta_{1:n-1}^{(i)})) to wn1(i)w_{n-1}^{(i)}. (3) If the ESS is below a certain threshold, resample the particles and move them through an MCMC kernel that leaves πt\pi_{t} invariant for kk iterations. For the MCMC kernel, [32] recommended using Gibbs sampler that leaves πt\pi_{t} invariant to move the particles at step (3). The orthant probability we are interested in can then be viewed as the normalising constant of πd\pi_{d} and this can be estimated as a by-product of the SMC algorithm.

Since we are trying to sample from the constrained Gaussian distributions, the Hamiltonian equation can be solved exactly and wn,kw_{n,k} is always 11. As a result, the incremental weights for the trajectories simplify to Φ(n(un1))\Phi(\mathcal{B}_{n}(u_{n-1})) and each particle on the trajectory starting from ztz_{t} will have an incremental weight proportional to Φ(n(un1))\Phi(\mathcal{B}_{n}(u_{n-1})). To obtain a trajectory, we follow [30] who perform HMC with reflections to sample. As the dimension increases, the number of reflections performed under ψn\psi_{n} will also increase given a fixed integration time. We adaptively tuned the integrating time ε\varepsilon to ensure that the everage number of reflections at each SMC step does not exceed a given threshold. In our experiment we set this threshold to be 55. To show that the waste-recycling RSMC algorithm scales well in high dimension, we set d=150d=150, a=(1.5,1.5,)a=(1.5,1.5,...) and b=(,,)b=(\infty,\infty,...). Also, we use the same covariance matrix in [14] and perform variable re-ordering as suggested in [32] before the simulation.

Refer to caption
Figure 9: Orthant probability example: estimates of the normalising constant (i.e. the orthant probability) obtained from the waste-recycling HSMC algorithm with N=50,000N=50,000.
Refer to caption
Figure 10: Orthant probability example: estimates of the mean of marginals obtained from the integrator snippet SMC algorithm with N=50,000N=50,000.

Figures 9 and 10 show the results obtained with N×(T+1)=50,000N\times(T+1)=50,000 and various values for NN. With a quarter of the number of particles used in [14], the waste-recycling HSMC algorithm achieves comparable performance when estimating the normalising constant (i.e. the orthant probability). Moreover, the estimates are stable for different choices of NN values, although one observes that the algorithm achieves best performance when N=500N=500 (i.e. each trajectory contains 100100 particles). This also suggests that the integrating time should be adaptively tuned in a different way to achieve the best performance given a fixed computational budget. Estimates of the function φ(x0:d)=𝔼(1/di=1dxi)\varphi(x_{0:d})=\mathbb{E}(1/d\sum_{i=1}^{d}x_{i}) with respect to the Gaussian distribution 𝒩d(𝟎,Σ)\mathcal{N}_{d}(\mathbf{0},\Sigma) truncated between 𝐚\mathbf{a} and 𝐛\mathbf{b} are also stable for different choices of NN, although they are more variable than those obtained in [14]. This indicates that the waste-recycling HSMC algorithm does scale well in high dimension. We note that this higher variance compared to the waste-free SMC of [14], is obtained in a scenario where they are able to exploit the particular structure of the problem and implement an exact Gibbs sampler to move the particles. The integrator snippet SMC we propose is however more general and applicable to scenarios where such a structure is not present.

4 Preliminary theoretical exploration

In this section we omit the index n0,Pn\in\llbracket 0,P\rrbracket and provide precise elements to understand the properties of the algorithms and estimators considered in this manuscript.

4.1 Variance using folded mixture samples

This section establishes variance reduction for an estimator of μ(f)\mu(f) of the type (15), but where it is assumed that {zˇn(i),iN}\{\check{z}_{n}^{(i)},i\in\llbracket N\rrbracket\} are iid distributed according to μ¯n\bar{\mu}_{n}. While this is not the exact distribution in the algorithm this provides a proxy representative of the quantities one needs to control when analysing SMC samplers [15, Chapter 9]: variance of estimators in an SMC can be decomposed as the sum of local variances at each iterations, which convergence to the variance terms considered hereafter as NN\rightarrow\infty. The main message is that {fψk(z),kP}\{f\circ\psi^{k}(z),k\in\llbracket P\rrbracket\} can be understood as playing the rôle of control variates, with the potential of reducing variance.

We use the following simplified notation throughout. Let

μ¯(k,dz):=1T+1μk(dz),\bar{\mu}(k,{\rm d}z):=\frac{1}{T+1}\mu_{k}({\rm d}z)\,,

with μk:=μψk\mu_{k}:=\mu^{\psi^{-k}}. The fundamental property exploited throughout is (11), which can be rephrased as, for f:𝖹f\colon\mathsf{Z}\rightarrow\mathbb{R} μ\mu-integrable,

{k=0Tfψk(z)μ¯(kz)}μ¯(dz)=μ(f),\int\Bigl{\{}\sum_{k=0}^{T}f\circ\psi^{k}(z)\bar{\mu}(k\mid z)\Bigr{\}}\bar{\mu}({\rm d}z)=\mu(f)\,,

or written differently with f¯(k,z):=fψk(z)\bar{f}(k,z):=f\circ\psi^{k}(z)

𝔼μ¯(𝔼μ¯(f¯(T,Zˇ)Zˇ))=𝔼μ¯(f¯(T,Zˇ))=𝔼μ(f(Z)).\mathbb{E}_{\bar{\mu}}\left(\mathbb{E}_{\bar{\mu}}(\bar{f}(\mathrm{T},\check{Z})\mid\check{Z})\right)=\mathbb{E}_{\bar{\mu}}\left(\bar{f}(\mathrm{T},\check{Z})\right)=\mathbb{E}_{\mu}\left(f(Z)\right).

The estimator we use is therefore a so-called “Rao-Blackwellized estimator”, assuming Zˇ(1),,Zˇ(N)iidμ¯\check{Z}^{(1)},\ldots,\check{Z}^{(N)}\overset{{\rm iid}}{\sim}\bar{\mu},

μˇ(f)=N1i=1N𝔼μ¯(f¯(T,Zˇ(i))Zˇ(i)),\check{\mu}(f)=N^{-1}\sum_{i=1}^{N}\mathbb{E}_{\bar{\mu}}\big{(}\bar{f}(\mathrm{T},\check{Z}^{(i)})\mid\check{Z}^{(i)}\big{)}\,,

of variance varμ¯(𝔼μ¯(f¯(T,Zˇ)Zˇ))/N{\rm var}_{\bar{\mu}}\left(\mathbb{E}_{\bar{\mu}}\big{(}\bar{f}(\mathrm{T},\check{Z})\mid\check{Z}\big{)}\right)/N. The following relates this variance to that of the standard estimator using NN iid samples from μ\mu, which would likewise play a role in the asymptotic properties of SMC algorithms.

Proposition 15.

We have

varμ¯(𝔼μ¯(f¯(T,Zˇ)Zˇ))=varμ(f(Zˇ))𝔼μ¯(varμ¯(f¯(T,Zˇ)Zˇ)){\rm var}_{\bar{\mu}}\left(\mathbb{E}_{\bar{\mu}}\big{(}\bar{f}(\mathrm{T},\check{Z})\mid\check{Z}\big{)}\right)={\rm var}_{\mu}\left(f(\check{Z})\right)-\mathbb{E}_{\bar{\mu}}\big{(}{\rm var}_{\bar{\mu}}(\bar{f}(\mathrm{T},\check{Z})\mid\check{Z})\big{)}
Proof.

The variance decomposition identity yields

varμ¯(f¯(T,Zˇ))=varμ¯(𝔼μ¯(f¯(T,Zˇ)Zˇ))+𝔼μ¯(varμ¯(f¯(T,Zˇ)Zˇ)),{\rm var}_{\bar{\mu}}\left(\bar{f}(\mathrm{T},\check{Z})\right)={\rm var}_{\bar{\mu}}\left(\mathbb{E}_{\bar{\mu}}\big{(}\bar{f}(\mathrm{T},\check{Z})\mid\check{Z}\big{)}\right)+\mathbb{E}_{\bar{\mu}}\big{(}{\rm var}_{\bar{\mu}}(\bar{f}(\mathrm{T},\check{Z})\mid\check{Z})\big{)}\,,

but from the fundamental property

varμ¯(f¯(T,Zˇ))\displaystyle{\rm var}_{\bar{\mu}}\left(\bar{f}(\mathrm{T},\check{Z})\right) =𝔼μ¯(f¯(T,Zˇ)2)𝔼μ¯(f¯(T,Zˇ))2\displaystyle=\mathbb{E}_{\bar{\mu}}\left(\bar{f}(\mathrm{T},\check{Z})^{2}\right)-\mathbb{E}_{\bar{\mu}}\left(\bar{f}(\mathrm{T},\check{Z})\right)^{2}
=𝔼μ(f(Zˇ)2)𝔼μ(f(Zˇ))2\displaystyle=\mathbb{E}_{\mu}\left(f(\check{Z})^{2}\right)-\mathbb{E}_{\mu}\left(f(\check{Z})\right)^{2}
=varμ(f(Zˇ)).\displaystyle={\rm var}_{\mu}\left(f(\check{Z})\right).

The following provides us with a notion of effective sample size for estimators of the form μˇ(f)\check{\mu}(f).

Theorem 16.

Assume that υμψk\upsilon\gg\mu^{\psi^{-k}} for kTk\in\llbracket T\rrbracket, then with Zˇμ¯\check{Z}\sim\bar{\mu},

  1. 1.

    for f:𝖹,f\colon\mathsf{Z}\rightarrow\mathbb{R},

    var(μˇ(f))2Nvar(k=0Tμψk(Zˇ)fψk(Zˇ))+f2var(k=0Tμψk(Zˇ))𝔼μ¯(k=0Tμψk(Zˇ))2,{\rm var}\left(\check{\mu}(f)\right)\leq\frac{2}{N}\frac{{\rm var}\big{(}{\textstyle\sum}_{k=0}^{T}\mu\circ\psi^{k}(\check{Z})f\circ\psi^{k}(\check{Z})\big{)}+\|f\|_{\infty}^{2}{\rm var}\big{(}{\textstyle\sum}_{k=0}^{T}\mu\circ\psi^{k}(\check{Z})\big{)}}{\mathbb{E}_{\bar{\mu}}\big{(}{\textstyle\sum}_{k=0}^{T}\mu\circ\psi^{k}(\check{Z})\big{)}^{2}}\,,
  2. 2.

    further

    var(μˇ(f))2f2N{2𝔼μ¯((k=0Tμψk(Zˇ))2)𝔼μ¯(k=0Tμψk(Zˇ))21}.{\rm var}\left(\check{\mu}(f)\right)\leq\frac{2\|f\|_{\infty}^{2}}{N}\left\{\frac{2\mathbb{E}_{\bar{\mu}}\big{(}({\textstyle\sum}_{k=0}^{T}\mu\circ\psi^{k}(\check{Z}))^{2}\big{)}}{\mathbb{E}_{\bar{\mu}}\big{(}{\textstyle\sum}_{k=0}^{T}\mu\circ\psi^{k}(\check{Z})\big{)}^{2}}-1\right\}\,.
Remark 17.

Multiplying μ\mu by any constant Cμ>0C_{\mu}>0 does not alter the values of the upper bounds, making the results relevant to the scenario where μ\mu is known up to a constant only. From the second statement we can define a notion of effective sample size (ESS)

N2𝔼μ¯(k=0Tμψk(Zˇ))22𝔼μ¯((k=0Tμψk(z))2)𝔼μ¯(k=0Tμψk(Zˇ))2,\frac{N}{2}\frac{\mathbb{E}_{\bar{\mu}}\big{(}{\textstyle\sum}_{k=0}^{T}\mu\circ\psi^{k}(\check{Z})\big{)}^{2}}{2\mathbb{E}_{\bar{\mu}}\big{(}({\textstyle\sum}_{k=0}^{T}\mu\circ\psi^{k}(z))^{2}\big{)}-\mathbb{E}_{\bar{\mu}}\big{(}{\textstyle\sum}_{k=0}^{T}\mu\circ\psi^{k}(\check{Z})\big{)}^{2}}\,,

which could be compared to N(T+1)N(T+1) or NN.

Proof.

We have

var(μˇ(f))=1Nvarμ¯(k=0Tμψk(Zˇ)fψk(Zˇ)l=0Tμψl(Zˇ)){\rm var}\left(\check{\mu}(f)\right)=\frac{1}{N}{\rm var}_{\bar{\mu}}\left(\frac{\sum_{k=0}^{T}\mu\circ\psi^{k}(\check{Z})f\circ\psi^{k}(\check{Z})}{\sum_{l=0}^{T}\mu\circ\psi^{l}(\check{Z})}\right)

and we apply Lemma 21

varμ¯(k=0Tμψk(Zˇ)fψk(Zˇ)l=0Tμψl(Zˇ))\displaystyle{\rm var}_{\bar{\mu}}\left(\frac{\sum_{k=0}^{T}\mu\circ\psi^{k}(\check{Z})f\circ\psi^{k}(\check{Z})}{\sum_{l=0}^{T}\mu\circ\psi^{l}(\check{Z})}\right) 2var(k=0Tμψk(Zˇ)fψk(Zˇ))+f2var(k=0Tμψk(Zˇ))𝔼μ¯(k=0Tμψk(z))2\displaystyle\leq 2\frac{{\rm var}\big{(}{\textstyle\sum}_{k=0}^{T}\mu\circ\psi^{k}(\check{Z})f\circ\psi^{k}(\check{Z})\big{)}+\|f\|_{\infty}^{2}{\rm var}\big{(}{\textstyle\sum}_{k=0}^{T}\mu\circ\psi^{k}(\check{Z})\big{)}}{\mathbb{E}_{\bar{\mu}}\big{(}{\textstyle\sum}_{k=0}^{T}\mu\circ\psi^{k}(z)\big{)}^{2}}
2f2{2𝔼μ¯((k=0Tμψk(Zˇ))2)𝔼μ¯(k=0Tμψk(Zˇ))21}.\displaystyle\leq 2\|f\|_{\infty}^{2}\left\{\frac{2\mathbb{E}_{\bar{\mu}}\big{(}({\textstyle\sum}_{k=0}^{T}\mu\circ\psi^{k}(\check{Z}))^{2}\big{)}}{\mathbb{E}_{\bar{\mu}}\big{(}{\textstyle\sum}_{k=0}^{T}\mu\circ\psi^{k}(\check{Z})\big{)}^{2}}-1\right\}\,.

4.2 Variance using unfolded mixture samples

For ψ:𝖹\psi\colon\mathsf{Z}\rightarrow\mathbb{R} invertible, ν\nu a probability distribution such that νμψ1\nu\gg\mu^{\psi^{-1}} and f:𝖹f\colon\mathsf{Z}\rightarrow\mathbb{R} such that μ(f)\mu(f) exists we have the identity

μ(f)=fψ(z)μψ1(dz)=fψ(z)dμψ1dν(z)ν(dz).\mu(f)=\int f\circ\psi(z)\mu^{\psi^{-1}}({\rm d}z)=\int f\circ\psi(z)\frac{{\rm d}\mu^{\psi^{-1}}}{{\rm d}\nu}(z)\nu({\rm d}z)\,.

As a result for {ψk:𝖹,kK}\{\psi_{k}\colon\mathsf{Z}\rightarrow\mathbb{R},k\in\llbracket K\rrbracket\}, all invertible and such that νμψk1\nu\gg\mu^{\psi_{k}^{-1}} for kKk\in\llbracket K\rrbracket, we have

μ(f)={1KkKfψk(z)dμψk1dν(z)}ν(dz),\mu(f)=\int\Bigl{\{}\frac{1}{K}\sum_{k\in\llbracket K\rrbracket}f\circ\psi_{k}(z)\frac{{\rm d}\mu^{\psi_{k}^{-1}}}{{\rm d}\nu}(z)\Bigr{\}}\nu({\rm d}z)\,, (30)

which suggests the Pushforward Importance Sampling (PISA) estimator, for ZiiidνZ^{i}\overset{{\rm iid}}{\sim}\nu, iNi\in\llbracket N\rrbracket and with μ¯/ν(fz):=K1kKfψk(z)dμψk1dν(z)\bar{\mu}_{/\nu}(f\mid z):=K^{-1}\sum_{k\in\llbracket K\rrbracket}f\circ\psi_{k}(z)\frac{{\rm d}\mu^{\psi_{k}^{-1}}}{{\rm d}\nu}(z)

μ^(f)\displaystyle\hat{\mu}(f) =i=1Nμ¯/ν(fZi)j=1Nμ¯/ν(1Zj)=1Ni=1Nμ¯/ν(fZi)1Nj=1Nμ¯/ν(1Zj),\displaystyle=\sum_{i=1}^{N}\frac{\bar{\mu}_{/\nu}(f\mid Z^{i})}{\sum_{j=1}^{N}\bar{\mu}_{/\nu}(\mathrm{1}\mid Z^{j})}=\frac{\frac{1}{N}\sum_{i=1}^{N}\bar{\mu}_{/\nu}(f\mid Z^{i})}{\frac{1}{N}\sum_{j=1}^{N}\bar{\mu}_{/\nu}(\mathrm{1}\mid Z^{j})}\,, (31)

This is the estimator in (7) when ν=μn1\nu=\mu_{n-1} and μ=μn\mu=\mu_{n}.

4.2.1 Relative efficiency for unfolded estimators

In order to define the notion of relative efficiency for the estimator μ^(f)\hat{\mu}(f) in (31) we first establish the following bounds.

Theorem 18.

With the notation above and ZνZ\sim\nu throughout. For any f:𝖹,f\colon\mathsf{Z}\rightarrow\mathbb{R},

  1. 1.
    var(μ^(f))𝔼(|μ^(f)μ(f)|2)\displaystyle{\rm var}\big{(}\hat{\mu}(f)\big{)}\leq\mathbb{E}\Bigl{(}|\hat{\mu}(f)-\mu(f)|^{2}\Bigr{)} 2N{var(μ¯/ν(fZ))+f2var(μ¯/ν(1Z))},\displaystyle\leq\frac{2}{N}\big{\{}{\rm var}\big{(}\bar{\mu}_{/\nu}(f\mid Z)\big{)}+\|f\|_{\infty}^{2}{\rm var}\big{(}\bar{\mu}_{/\nu}(1\mid Z)\big{)}\big{\}}\,,
  2. 2.

    more simply

    𝔼(|μ^(f)μ(f)|2)2f2KN{2𝔼[(kKdμψk1/dν(Z))2]KK},\mathbb{E}\Bigl{(}|\hat{\mu}(f)-\mu(f)|^{2}\Bigr{)}\leq\frac{2\|f\|_{\infty}^{2}}{KN}\Bigl{\{}\frac{2\mathbb{E}\Bigl{[}\bigl{(}\sum_{k\in\llbracket K\rrbracket}{\rm d}\mu^{\psi_{k}^{-1}}/{\rm d}\nu(Z)\bigr{)}^{2}\Bigr{]}}{K}-K\Bigr{\}}\,, (32)
  3. 3.

    and

    |𝔼[μ^(f)]μ(f)|2f2KN𝔼[(kKdμψk1/dν(Z))2]K.|\mathbb{E}[\hat{\mu}(f)]-\mu(f)|\leq\frac{2\|f\|_{\infty}^{2}}{KN}\frac{\mathbb{E}\Bigl{[}\bigl{(}\sum_{k\in\llbracket K\rrbracket}{\rm d}\mu^{\psi_{k}^{-1}}/{\rm d}\nu(Z)\bigr{)}^{2}\Bigr{]}}{K}\,.
Remark 19.

The upper bound in the first statement confirms the control variate nature of integrator snippets, even when using the unfolded perspective, a property missed by the rougher bounds of the last two statements.

Remark 20 (ESS for PISA).

The notion of efficiency is usually defined relative to the “perfect” Monte Carlo scenario that is the standard estimator μ^0\hat{\mu}_{0} of μ(f)\mu(f) relying on KNKN iid samples from μ\mu for which we have

var(μ^0(f))=varμ(f)KNf2KN.{\rm var}\Bigl{(}\hat{\mu}_{0}(f)\Bigr{)}=\frac{{\rm var}_{\mu}(f)}{KN}\leq\frac{\|f\|_{\infty}^{2}}{KN}\,. (33)

The RE0\mathrm{RE}_{0}, is determined by the ratio of the upper bound in (32) by (33). Our point below is that the notion of efficiency can be defined relative to any competing algorithm, virtual or not, in order to characterize particular properties. For example we can compute the efficiency relative to that of the “ideal” PISA estimator i.e. for which μ¯/ν(f¯Zi)\bar{\mu}_{/\nu}(\bar{f}\mid Z^{i}) is replaced with K1kKfψk(Zi,k)dμψk1dν(Zi,k)K^{-1}\sum_{k\in\llbracket K\rrbracket}f\circ\psi_{k}(Z^{i,k})\frac{{\rm d}\mu^{\psi_{k}^{-1}}}{{\rm d}\nu}(Z^{i,k}), Zi,kiidνZ^{i,k}\overset{{\rm iid}}{\sim}\nu and

var(μ^1(f))2f2KN{2kK𝔼[(dμψk1/dν(Z))2]KK}.{\rm var}\Bigl{(}\hat{\mu}_{1}(f)\Bigr{)}\leq\frac{2\|f\|_{\infty}^{2}}{KN}\left\{\frac{2\sum_{k\in\llbracket K\rrbracket}\mathbb{E}\Bigl{[}\bigl{(}{\rm d}\mu^{\psi_{k}^{-1}}/{\rm d}\nu(Z)\bigr{)}^{2}\Bigr{]}}{K}-K\right\}\,. (34)

The corresponding RE1\mathrm{RE}_{1} captures the loss incurred because of dependence along a snippet. However, given our initial motivation of recycling the computation of a standard HMC based SMC algorithm we opt to define the RE2\mathrm{RE}_{2} relative to the estimator relying on both ends of the snippet only, i.e.

μ^2(f)=1Ni=1N12dμdν(zi)f(Zi)+12dμψK1dν(Zi)fψK(Zi)j=1N12dμdν(Zj)+12dμψK1dν(Zj).\hat{\mu}_{2}(f)=\frac{1}{N}\sum_{i=1}^{N}\frac{\frac{1}{2}\frac{{\rm d}\mu}{{\rm d}\nu}(z^{i})f(Z^{i})+\frac{1}{2}\frac{{\rm d}\mu^{\psi_{K}^{-1}}}{{\rm d}\nu}(Z^{i})f\circ\psi_{K}(Z^{i})}{\sum_{j=1}^{N}\frac{1}{2}\frac{{\rm d}\mu}{{\rm d}\nu}(Z^{j})+\frac{1}{2}\frac{{\rm d}\mu^{\psi_{K}^{-1}}}{{\rm d}\nu}(Z^{j})}\,.

In the SMC scenario considered in this manuscript (see Section 1) the above can be thought of as a proxy for estimators obtained by a “Rao-Blackwellized” SMC algorithm using Pn1,TP_{n-1,T} in (6), where NN particles {zn1(i),iN}\{z_{n-1}^{(i)},i\in\llbracket N\rrbracket\} in Alg. 1 give rise to 2N2N weighted particles

{(zi,α¯n1(zi;T)μnμn1(zi));(zi,αn1(zi;T)μnμn1ψnT(zi)),iN},\{\big{(}z^{i},\bar{\alpha}_{n-1}(z_{i};T)\tfrac{\mu_{n}}{\mu_{n-1}}(z_{i})\big{)};\big{(}z^{i},\alpha_{n-1}(z_{i};T)\tfrac{\mu_{n}}{\mu_{n-1}}\circ\psi_{n}^{T}(z_{i})\big{)},i\in\llbracket N\rrbracket\}\>,

with αn1(;T)\alpha_{n-1}(\cdot;T) defined in (5). Resampling with these weights is then applied to obtain NN particles and followed by an update of velocities to yield {zn(i),iN}\{z_{n}^{(i)},i\in\llbracket N\rrbracket\}. Now, we observe the similarity between

αn1(z;T)μnμn1ψnT(z)=min{1,μn1ψn1Tμn1(z)}×μnψnTμn1ψnT(z),\alpha_{n-1}(z;T)\tfrac{\mu_{n}}{\mu_{n-1}}\circ\psi_{n}^{T}(z)=\min\left\{1,\frac{\mu_{n-1}\circ\psi_{n-1}^{T}}{\mu_{n-1}}(z)\right\}\times\frac{\mu_{n}\circ\psi_{n}^{T}}{\mu_{n-1}\circ\psi_{n}^{T}}(z)\,, (35)

and the corresponding weight in Alg. 2, in particular when μn1\mu_{n-1} and μn\mu_{n} are similar, and hence ψn1\psi_{n-1} and ψn\psi_{n}. This motivates our choice of reference to define ESS2\mathrm{ESS}_{2} which has a clear computational advantage since it involves ignoring T1T-1 terms only. In the present scenario, following Lemma 21, we have

var(μ^2(f))\displaystyle{\rm var}\big{(}\hat{\mu}_{2}(f)\big{)} f2N{𝔼[(dμdν(Z)+dμψK1dν(Z))2]12},\displaystyle\leq\frac{\|f\|_{\infty}^{2}}{N}\left\{\mathbb{E}\left[\big{(}\frac{{\rm d}\mu}{{\rm d}\nu}(Z)+\frac{{\rm d}\mu^{\psi_{K}^{-1}}}{{\rm d}\nu}(Z)\big{)}^{2}\right]-\frac{1}{2}\right\}\,,

which leads to the relative efficiency for PISA,

RE2=4𝔼[(kKdμψk1/dν(Z))2]K22𝔼[(dμdν(Z)+dμψK1dν(Z))2]2,{\rm RE}_{2}=\frac{\frac{4\mathbb{E}\Bigl{[}\bigl{(}\sum_{k\in\llbracket K\rrbracket}{\rm d}\mu^{\psi_{k}^{-1}}/{\rm d}\nu(Z)\bigr{)}^{2}\Bigr{]}}{K^{2}}-2}{\mathbb{E}\left[\big{(}\frac{{\rm d}\mu}{{\rm d}\nu}(Z)+\frac{{\rm d}\mu^{\psi_{K}^{-1}}}{{\rm d}\nu}(Z)\big{)}^{2}\right]-2}\,,

which can be estimated using empirical averages.

Proof of Theorem 18.

We apply Lemma 21 with

A(Z1,,ZN)=1Ni=1Nμ¯/ν(fZi)B(Z1,,ZN)=1Nj=1Nμ¯/ν(1Zj).A(Z^{1},\ldots,Z^{N})=\frac{1}{N}\sum_{i=1}^{N}\bar{\mu}_{/\nu}(f\mid Z^{i})\quad B(Z^{1},\ldots,Z^{N})=\frac{1}{N}\sum_{j=1}^{N}\bar{\mu}_{/\nu}(\mathrm{1}\mid Z^{j})\,.

With ZiiidνZ^{i}\overset{{\rm iid}}{\sim}\nu, we have directly

a=𝔼(A)=𝔼(μ¯/ν(fZ))=μ(f)b=𝔼(B)=1,a=\mathbb{E}\big{(}A\big{)}=\mathbb{E}\big{(}\bar{\mu}_{/\nu}(f\mid Z)\big{)}=\mu(f)\quad b=\mathbb{E}\big{(}B\big{)}=1\,,

and

var(B)\displaystyle{\rm var}\big{(}B\big{)} =1Nvar(μ¯/ν(1Zj))\displaystyle=\frac{1}{N}{\rm var}\big{(}\bar{\mu}_{/\nu}(1\mid Z^{j})\big{)}
=1K2N{𝔼[(kKdμψk1/dν(Z))2]K2}.\displaystyle=\frac{1}{K^{2}N}\left\{\mathbb{E}\Bigl{[}\bigl{(}{\textstyle\sum}_{k\in\llbracket K\rrbracket}{\rm d}\mu^{\psi_{k}^{-1}}/{\rm d}\nu(Z)\bigr{)}^{2}\Bigr{]}-K^{2}\right\}\,.

Now with f<\|f\|_{\infty}<\infty

var(A)\displaystyle{\rm var}\big{(}A\big{)} =1Nvar(μ¯/ν(fZ))\displaystyle=\frac{1}{N}{\rm var}\big{(}\bar{\mu}_{/\nu}(f\mid Z)\big{)}
=1K2N{f2𝔼[(kKdμψk1/dν(Z))2]K2μ(f)2}\displaystyle=\frac{1}{K^{2}N}\Bigl{\{}\|f\|_{\infty}^{2}\mathbb{E}\Bigl{[}\bigl{(}{\textstyle\sum}_{k\in\llbracket K\rrbracket}{\rm d}\mu^{\psi_{k}^{-1}}/{\rm d}\nu(Z)\bigr{)}^{2}\Bigr{]}-K^{2}\mu(f)^{2}\Bigr{\}}
f2K2N𝔼[(kKdμψk1/dν(Z))2].\displaystyle\leq\frac{\|f\|_{\infty}^{2}}{K^{2}N}\mathbb{E}\Bigl{[}\bigl{(}{\textstyle\sum}_{k\in\llbracket K\rrbracket}{\rm d}\mu^{\psi_{k}^{-1}}/{\rm d}\nu(Z)\bigr{)}^{2}\Bigr{]}\,.

We conclude noting that we have |A/B|f|A/B|\leq\|f\|_{\infty}. ∎

For clarity we reproduce the very useful lemma [38, Lemma 6], correcting a couple of minor typos along the way.

Lemma 21.

Let A,BA,B be two integrable random variables satisfying |A/B|M|A/B|\leq M almost surely for some M>0M>0 and let a=𝔼(A)a=\mathbb{E}(A) and b=𝔼(B)0b=\mathbb{E}(B)\neq 0. Then

var(A/B)𝔼[|A/Ba/b|2]2b2{var(A)+M2var(B)},\displaystyle{\rm var}(A/B)\leq\mathbb{E}[|A/B-a/b|^{2}]\leq\frac{2}{b^{2}}\Bigl{\{}{\rm var}(A)+M^{2}{\rm var}(B)\Bigr{\}}\,,
|𝔼[A/B]a/b|var(A/B)var(B)b.\displaystyle|\mathbb{E}[A/B]-a/b|\leq\frac{\sqrt{{\rm var}(A/B){\rm var}(B)}}{b}\,.

4.2.2 Optimal weights

As mentioned in Subsection 2.3 it is possible to consider the more general scenario where unequal probability weights ω={ωk,k0,T:k=0Tωk=1}\omega=\{\omega_{k}\in\mathbb{R},k\in\llbracket 0,T\rrbracket\colon\sum_{k=0}^{T}\omega_{k}=1\} are ascribed to the elements of the snippets, yielding the estimator

μ^ω(f)\displaystyle\hat{\mu}_{\omega}(f) :=i=1Nk=0Tωkdμψk1dν(Zi)fψk(Zi)j=1Nl=0Tωldμψl1dν(Zj),\displaystyle:=\frac{\sum_{i=1}^{N}\sum_{k=0}^{T}\omega_{k}\frac{{\rm d}\mu^{\psi_{k}^{-1}}}{{\rm d}\nu}(Z^{i})f\circ\psi_{k}(Z^{i})}{\sum_{j=1}^{N}\sum_{l=0}^{T}\omega_{l}\frac{{\rm d}\mu^{\psi_{l}^{-1}}}{{\rm d}\nu}(Z^{j})}\,,

and a natural question is that of the optimal choice of ω\omega. Note that in the context of PISA the condition ωk0\omega_{k}\geq 0 is not required, as suggested by the justification of the identity (30). However it should be clear that this condition should be enforced in the context of integrator snippet SMC, since the probabilistic interpretation is otherwise lost, or if the expectation is known to be non-negative. Here we discuss optimization of the variance upperbound provided by Lemma 21,

2f2N{var(kKωkdμψk1dν(Z)fψkf(Z))+\displaystyle\frac{2\|f\|_{\infty}^{2}}{N}\Bigl{\{}{\rm var}\biggl{(}{\textstyle\sum_{k\in\llbracket K\rrbracket}}\omega_{k}\frac{{\rm d}\mu^{\psi_{k}^{-1}}}{{\rm d}\nu}(Z)\frac{f\circ\psi_{k}}{\|f\|_{\infty}}(Z)\biggr{)}+ var(kKωkdμψk1dν(Z))}\displaystyle{\rm var}\biggl{(}{\textstyle\sum_{k\in\llbracket K\rrbracket}}\omega_{k}\frac{{\rm d}\mu^{\psi_{k}^{-1}}}{{\rm d}\nu}(Z)\biggr{)}\Bigr{\}}
2f2N\displaystyle\leq\frac{2\|f\|_{\infty}^{2}}{N} ω(Σ(f;ψ)+Σψ(1;ψ))ω\displaystyle\omega^{\top}\big{(}\Sigma(f;\psi)+\Sigma_{\psi}(1;\psi)\big{)}\omega

where for k,lKk,l\in\llbracket K\rrbracket,

Σkl(f;ψ)=𝔼[dμψk1dν(Z)fψkf(Z)dμψl1dν(Z)fψlf(Z)]μ(f/f)2.\Sigma_{kl}(f;\psi)=\mathbb{E}\biggl{[}\frac{{\rm d}\mu^{\psi_{k}^{-1}}}{{\rm d}\nu}(Z)\frac{f\circ\psi_{k}}{\|f\|_{\infty}}(Z)\frac{{\rm d}\mu^{\psi_{l}^{-1}}}{{\rm d}\nu}(Z)\frac{f\circ\psi_{l}}{\|f\|_{\infty}}(Z)\biggr{]}-\mu(f/\|f\|_{\infty})^{2}\,.

It is a classical result that ω(Σ(f;ψ)+Σψ(1;ψ))ωλmin(Σ(f;ψ)+Σψ(1;ψ))ωω\omega^{\top}\big{(}\Sigma(f;\psi)+\Sigma_{\psi}(1;\psi)\big{)}\omega\geq\lambda_{\min}\big{(}\Sigma(f;\psi)+\Sigma_{\psi}(1;\psi)\big{)}\omega^{\top}\omega and that minimum is reached for the eigenvector(s) ωmin\omega_{{\rm min}} corresponding to the smallest eigenvalue λmin(Σ(f;ψ)+Σψ(1;ψ))\lambda_{\min}\big{(}\Sigma(f;\psi)+\Sigma_{\psi}(1;\psi)\big{)} of Σ(f;ψ)+Σψ(1;ψ)\Sigma(f;\psi)+\Sigma_{\psi}(1;\psi). If the constraint of non-negative entries is to be enforced then a standard quadratic programming procedure should be used. The same ideas can be applied to the function independent upperbounds used in our definitions of efficiency.

4.3 More on variance reduction and optimal flow

We now focus on some properties of the estimator μ^(f)\hat{\mu}(f) of μ(f)\mu(f) in (16). To facilitate the analysis and later developments we consider the scenario where, with tψt(z)t\mapsto\psi_{t}(z) the flow solution of an ODE, assume the dominating measure υμ\upsilon\gg\mu and υ\upsilon is invariant by the flow. We consider

μ¯(dt,dz)=1Tμψt(dz)𝟏{0tT}dt,\bar{\mu}({\rm d}t,{\rm d}z)=\frac{1}{T}\mu^{\psi_{-t}}({\rm d}z)\mathbf{1}\{0\leq t\leq T\}{\rm d}t\,,

and notice that similarly to the integrator scenario, for any f:𝖹f\colon\mathsf{Z}\rightarrow\mathbb{R}, f¯(t,z):=fψt(z)\bar{f}(t,z):=f\circ\psi_{t}(z) we have 𝔼μ¯(μ¯(f¯Zˇ))=μ¯(f¯)=μ(f)\mathbb{E}_{\bar{\mu}}\left(\bar{\mu}(\bar{f}\mid\check{Z})\right)=\bar{\mu}(\bar{f})=\mu(f). where for any z𝖹z\in\mathsf{Z}, where

μ¯(f¯z)=1T0Tfψt(z)μψt(z)μψu(z)dudt\bar{\mu}(\bar{f}\mid z)=\frac{1}{T}\int_{0}^{T}f\circ\psi_{t}(z)\frac{\mu\circ\psi_{t}(z)}{\int\mu\circ\psi_{u}(z){\rm d}u}{\rm d}t

the following estimator of μ(f)\mu(f) is considered, with Zˇiiidμ¯\check{Z}_{i}\overset{{\rm iid}}{\sim}\bar{\mu}

μ^(f)=1Ni=1Nμ¯(f¯(T,Zˇi)Zˇi).\hat{\mu}(f)=\frac{1}{N}\sum_{i=1}^{N}\bar{\mu}\big{(}\bar{f}(\mathrm{T},\check{Z}_{i})\mid\check{Z}_{i}\big{)}\,.

We now show that in the examples considered in this paper our approach can be understood as implementing unbiased Riemann sum approximations of line integrals along contours. Adopting a continuous time approach can be justified as follows. For numerous integrators, the conditions of the following proposition are satisfied; this is the case for the leapfrog integrator of Hamilton’s equation e.g. [20, 38, Theorem 3.4] and [38, Appendix 3.1, Theorem 9] for detailed results.

Proposition 22.

Let τ>0\tau>0, for any z𝖹z\in\mathsf{Z} let [0,τ]tψt(z)[0,\tau]\ni t\mapsto\psi_{t}(z) be a flow and for ϵ>0\epsilon>0 let {ψk(z;ϵ),0kϵτ}\{\psi^{k}(z;\epsilon),0\leq k\epsilon\leq\tau\} be a discretization of tψtt\mapsto\psi_{t} such that that for any z𝖹z\in\mathsf{Z} there exists C>0C>0 such that for any (k,ϵ)×+(k,\epsilon)\in\mathbb{N}\times\mathbb{R}_{+}

|ψk(z;ϵ)ψkϵ(z)|Cϵ2.|\psi^{k}(z;\epsilon)-\psi_{k\epsilon}(z)|\leq C\epsilon^{2}\,.

Then for any continuous g:𝖹g\colon\mathsf{Z}\rightarrow\mathbb{R} such that the Riemann integral

I(g):=0τgψt(z)dt,I(g):=\int_{0}^{\tau}g\circ\psi_{t}(z){\rm{\rm d}}t\,,

exists we have

limT1nk=0n1gψkτ/n(z;τ/n)=I(g).\lim_{T\rightarrow\infty}\frac{1}{n}\sum_{k=0}^{n-1}g\circ\psi_{k\tau/n}(z;\tau/n)=I(g)\,.
Proof.

We have

limn1nk=0n1gψkτ/n(z)=0τgψt(z)dt\lim_{n\rightarrow\infty}\frac{1}{n}\sum_{k=0}^{n-1}g\circ\psi_{k\tau/n}(z)=\int_{0}^{\tau}g\circ\psi_{t}(z){\rm{\rm d}}t
limn1n|k=0n1gψkτ/n(z)gψk(z;τ/n)|=0,\lim_{n\rightarrow\infty}\frac{1}{n}\biggl{|}\sum_{k=0}^{n-1}g\circ\psi_{k\tau/n}(z)-g\circ\psi^{k}(z;\tau/n)\biggr{|}=0\,,

and we can conclude. ∎

Remark 23.

Naturally convergence is uniform in z𝖹z\in\mathsf{Z} with additional assumptions and we note that in some scenarios dependence on τ\tau can be characterized, allowing in principle τ\tau to grow with nn .

4.3.1 Hamiltonian contour decomposition

Assume μ\mu has a probability density with respect to the Lebesgue measure and let ζ:𝖹d\zeta\colon\mathsf{Z}\subset\mathbb{R}^{d}\rightarrow\mathbb{R} be Lipschitz continuous such that ζ(z)0\nabla\zeta(z)\neq 0 for all z𝖹z\in\mathsf{Z}, then the co-area formula states that

𝖹f(z)μ(z)dz=[ζ1(s)f(z)|ζ(z)|1μ(z)d1(dz)]ds,\int_{\mathsf{Z}}f(z)\mu(z){\rm d}z=\int\big{[}\int_{\zeta^{-1}(s)}f(z)|\nabla\zeta(z)|^{-1}\mu(z)\mathcal{H}_{d-1}({\rm d}z)\big{]}{\rm d}s,

where d1\mathcal{H}_{d-1} is the (d1)(d-1)-dimensional Hausdorff measure, used here to measure length along the contours ζ1(s)\zeta^{-1}(s). For example in the HMC setup where μ(z)exp(H(z))\mu(z)\propto\exp\big{(}-H(z)\big{)} and z=(x,v)z=(x,v) one may choose ζ(z)=H(z)=log(μ(z))\zeta(z)=H(z)=-\log\big{(}\mu(z)\big{)}, leading to a decomposition of the expectation μ(f)\mu(f) according to equi-energy contours of H(z)H(z)

μ(f)=exp(s)[ζ1(s)f(z)|H(z)|1d1(dz)]ds.\mu(f)=\int\exp(-s)\big{[}\int_{\zeta^{-1}(s)}f(z)|\nabla H(z)|^{-1}\mathcal{H}_{d-1}({\rm d}z)\big{]}{\rm d}s.

We now show how the solution of Hamilton’s equation could be used as the basis for estimators of μ(f)\mu(f) mixing Riemanian sum-like and Monte Carlo estimation techniques.

Favourable scenario where d=2d=2.

Let ss\in\mathbb{R} such that H1(s)H^{-1}(s)\neq\emptyset and assume that for some (x0,v0)H1(s)(x_{0},v_{0})\in H^{-1}(s) Hamilton’s equations x˙=vH(x,v)\dot{x}=-\nabla_{v}H(x,v) and v˙=xH(x,v)\dot{v}=\nabla_{x}H(x,v) have solutions t(xt,vt)=ψt(z0)H1(s)t\mapsto(x_{t},v_{t})=\psi_{t}(z_{0})\in H^{-1}(s) with (xτ(s),vτ(s))=(x0,v0)=z0\big{(}x_{\tau(s)},v_{\tau(s)})=(x_{0},v_{0})=z_{0} for some τ(s)>0\tau(s)>0, that is the contours H1(s)H^{-1}(s) can be parametrised with the solutions of Hamilton’s equation at corresponding level.

Proposition 24.

We have

𝖹f(z)μ(dz)=𝖹[[τH(z0)]10τH(z0)f(zt)dt]μ(dz0),\int_{\mathsf{Z}}f(z)\mu({\rm d}z)=\int_{\mathsf{Z}}\left[[\tau\circ H(z_{0})]^{-1}\int_{0}^{\tau\circ H(z_{0})}f(z_{t}){\rm d}t\right]\mu({\rm d}z_{0})\,,

implying that, assuming the integral along the path tztt\mapsto z_{t} is tractable,

[τH(z0)]10τH(z0)f(zt)dtforz0μ,[\tau\circ H(z_{0})]^{-1}\int_{0}^{\tau\circ H(z_{0})}f(z_{t}){\rm d}t\,\text{for}\,z_{0}\sim\mu\,, (36)

is an unbiased estimator of μ(f)\mu(f).

Proof.

Since H1(s)H^{-1}(s) is rectifiable we have that [9, Theorem 2.6.2] for g:𝖹g\colon\mathsf{Z}\rightarrow\mathbb{R},

H1(s)g(z)d1(dz)=0τ(s)g(zt)|H(zt)|dt.\int_{H^{-1}(s)}g(z)\mathcal{H}_{d-1}({\rm d}z)=\int_{0}^{\tau(s)}g(z_{t})|\nabla H(z_{t})|{\rm d}t\,.

Consequently

𝖹f(z)μ(dz)\displaystyle\int_{\mathsf{Z}}f(z)\mu({\rm d}z) =τ(s)exp(s)[τ(s)10τ(s)f(zs,t)dt]ds,\displaystyle=\int\tau(s)\exp(-s)\left[\tau(s)^{-1}\int_{0}^{\tau(s)}f(z_{s,t}){\rm d}t\right]{\rm d}s\,, (37)

where the notation now reflects dependence on ss of the flow and notice that since

H1(s)z0[τH(z0)]10τH(z0)f(zt)H^{-1}(s)\ni z_{0}\mapsto[\tau\circ H(z_{0})]^{-1}\int_{0}^{\tau\circ H(z_{0})}f(z_{t})

is constant, z0,sH1(s)z_{0,s}\in H^{-1}(s) can be arbitrary. since indeed from (37)

𝖹𝟏{H(z)A}μ(dz)=\displaystyle\int_{\mathsf{Z}}\mathbf{1}\{H(z)\in A\}\mu({\rm d}z)= τ(s)exp(s)[τ(s)10τ(s)𝟏{sA}dt]ds\displaystyle\int\tau(s)\exp(-s)\left[\tau(s)^{-1}\int_{0}^{\tau(s)}\mathbf{1}\{s\in A\}{\rm d}t\right]{\rm d}s
=\displaystyle= τ(s)exp(s)𝟏{sA}ds.\displaystyle\int\tau(s)\exp(-s)\mathbf{1}\{s\in A\}{\rm d}s.

Note that we can write, since for ss\in\mathbb{R}, H1(s)z0[τH(z0)]10τH(z0)f(zt)H^{-1}(s)\ni z_{0}\mapsto[\tau\circ H(z_{0})]^{-1}\int_{0}^{\tau\circ H(z_{0})}f(z_{t}) is constant,

𝖹[[τH(z0)]10τH(z0)f(zt)dt]μ(dz0)\displaystyle\int_{\mathsf{Z}}\left[[\tau\circ H(z_{0})]^{-1}\int_{0}^{\tau\circ H(z_{0})}f(z_{t}){\rm d}t\right]\mu({\rm d}z_{0})
=τ(s)exp(s)\displaystyle=\int\tau(s)\exp(-s) {τ(s)10τ(s)[(τH(zs,0))10τH(z0)f(zs,t)dt]}dtds\displaystyle\left\{\tau(s)^{-1}\int_{0}^{\tau(s)}\left[\big{(}\tau\circ H(z_{s,0})\big{)}^{-1}\int_{0}^{\tau\circ H(z_{0})}f(z_{s,t}){\rm d}t\right]\right\}{\rm d}t{\rm d}s
=τ(s)exp(s)\displaystyle=\int\tau(s)\exp(-s) [τ(s)10τ(s)f(zs,t)dt]dtds.\displaystyle\left[\tau(s)^{-1}\int_{0}^{\tau(s)}f(z_{s,t}){\rm d}t\right]{\rm d}t{\rm d}s\,.

A remarkable point is that the strategy developed in this manuscript provides a general methodology to implement numerically the ideas underpinning the decomposition of Proposition 24 by using the estimator μˇ(f)\check{\mu}(f) in (15) and invoking Proposition 22, assuming sτ(s)s\mapsto\tau(s) to be known. This point is valid outside of the SMC framework and it is worth pointing out that μˇ(f)\check{\mu}(f) in (15) is unbiased if the samples used are marginally from μ¯\bar{\mu}.

Remark the promise of dimension free estimators if the one dimensional line integrals in (36) were easily computed and sampling from the one-dimensional energy distribution was routine – however the scenario d3d\geq 3 is more subtle.

General scenario

In the scenario where d3d\geq 3 the co-area decomposition still holds, but the solution to Hamilton’s equation can in general not be used to compute integrals over the hypersurface H1(s)H^{-1}(s). This would require a form of ergodicity [35, 40] of the form, for z0H1(s)z_{0}\in H^{-1}(s),

limτ1τ0τfψt(z0)dt=f¯(z0)=ζ1(s)f(z)|H(z)|1d1(dz)ζ1(s)|H(z)|1d1(dz),\lim_{\tau\rightarrow\infty}\frac{1}{\tau}\int_{0}^{\tau}f\circ\psi_{t}(z_{0}){\rm d}t=\bar{f}(z_{0})=\frac{\int_{\zeta^{-1}(s)}f(z)|\nabla H(z)|^{-1}\mathcal{H}_{d-1}({\rm d}z)}{\int_{\zeta^{-1}(s)}|\nabla H(z)|^{-1}\mathcal{H}_{d-1}({\rm d}z)}\,,

where the limit always exists in the L2(μ){\rm L}^{2}(\mu) sense and constitutes Von Neumann’s mean ergodic theorem [41, 35], and the rightmost equality forms Boltzman’s conjecture. An interesting property in the present context is that 𝔼μ¯(f¯(Z))=𝔼μ(f(Z))\mathbb{E}_{\bar{\mu}}\bigl{(}\bar{f}(Z)\bigr{)}=\mathbb{E}_{\mu}\bigl{(}f(Z)\bigr{)} for fL2(μ)f\in\mathrm{L}^{2}(\mu) and one could replicate the variance reduction properties developed earlier. Boltzman’s conjecture has long been disproved, as numerous Hamiltonians can be shown not to lead to ergodic systems, although some sub-classes do. However a weaker, or local, form or ergodicity can hold on sets of a partition of 𝖹\mathsf{Z}

Example 25 (Double well potential).

Consider the scenario where 𝖷=𝖵=\mathsf{X}=\mathsf{V}=\mathbb{R}, U(x)=(x21)2U(x)=(x^{2}-1)^{2} and kinetic energy v2v^{2} [34]. Elementary manipulations show that satisfying Hamilton’s equations (2) imposes tH(xt,x˙t)=(xt21)2+x˙t2=s>0t\mapsto H(x_{t},\dot{x}_{t})=(x_{t}^{2}-1)^{2}+\dot{x}_{t}^{2}=s>0 and therefore requires txt[1+s,1s][1s,1+s]t\mapsto x_{t}\in[-\sqrt{1+\sqrt{s}},-\sqrt{1-\sqrt{s}}]\cup[\sqrt{1-\sqrt{s}},\sqrt{1+\sqrt{s}}] – importantly the intervals are not connected for s<1s<1. Rearranging terms any solution of (2) must satisfy

x˙t\displaystyle\dot{x}_{t} =±s(xt21)2,\displaystyle=\pm\sqrt{s-(x_{t}^{2}-1)^{2}}\,,

that is the velocity is a function of the position in the double well, maximal for xt2=1x_{t}^{2}=1, vanishing as xt21±sx_{t}^{2}\rightarrow 1\pm\sqrt{s} and a sign flip at the endpoints of the intervals. Therefore the system is not ergodic, but ergodicity trivially occurs in each well.

In general this partitioning of 𝖹\mathsf{Z} can be intricate but it should be clear that in principle this could reduce variability of an estimator. In the toy Example 25, a purely mathematical algorithm inspired by the discussions above would choose the right or left well with probability 1/21/2 and then integrate deterministically, producing samples taking at most two values which could be averaged to compute μ(f)\mu(f). We further note that in our context a relevant result would be concerned with the limit, for any x0𝖷x_{0}\in\mathsf{X},

limτ[1τ0τfψt(x0,ρ𝐞)dt]ϖ𝐞(d𝐞),\lim_{\tau\rightarrow\infty}\int\big{[}\frac{1}{\tau}\int_{0}^{\tau}f\circ\psi_{t}(x_{0},\rho\mathbf{e}){\rm d}t\big{]}\varpi_{\mathbf{e}}({\rm d}\mathbf{e})\,,

where we have used a polar reparametrization v=ρ𝐞v=\rho\mathbf{e} (ρ,𝐞)+×𝒮d1(\rho,\mathbf{e})\in\mathbb{R}_{+}\times\mathcal{S}_{d-1}, and whether a marginal version of Boltzman’s conjecture holds for this limit. Example 25 indicates that this is not true in general but the cardinality of the partition of 𝖷\mathsf{X} may be reduced.

4.3.2 Advantage of averaging and control variate interpretation

Consider the scenario where (𝖳,Zˇ)μ¯(t,dz)=1τμψt(dz)𝟏{0tτ}(\mathsf{T},\check{Z})\sim\bar{\mu}(t,{\rm d}z)=\frac{1}{\tau}\mu^{\psi_{-t}}({\rm d}z)\mathbf{1}\{0\leq t\leq\tau\}, [0,τ]tψt[0,\tau]\ni t\mapsto\psi_{t} for some τ>0\tau>0 is the flow solution of an ODE of the form z˙t=Fzt\dot{z}_{t}=F\circ z_{t} for some field F:𝖹𝖹F\colon\mathsf{Z}\rightarrow\mathsf{Z}. For f:𝖹f\colon\mathsf{Z}\rightarrow\mathbb{R} we have

varμ(f(Z))=varμ¯(𝔼μ¯(f¯(𝖳,Zˇ)Zˇ))+𝔼μ¯(varμ¯(f¯(𝖳,Zˇ)Zˇ)).{\rm var}_{\mu}\left(f(Z)\right)={\rm var}_{\bar{\mu}}\left(\mathbb{E}_{\bar{\mu}}\big{(}\bar{f}(\mathsf{T},\check{Z})\mid\check{Z}\big{)}\right)+\mathbb{E}_{\bar{\mu}}\big{(}{\rm var}_{\bar{\mu}}(\bar{f}(\mathsf{T},\check{Z})\mid\check{Z})\big{)}\,.

and we are interested in determining tψtt\mapsto\psi_{t} (or FF) in order to minimize varμ¯(𝔼μ¯(f¯(𝖳,Zˇ)Zˇ)){\rm var}_{\bar{\mu}}\left(\mathbb{E}_{\bar{\mu}}\big{(}\bar{f}(\mathsf{T},\check{Z})\mid\check{Z}\big{)}\right) and maximize improvement over simple Monte Carlo. We recall that the Jabobian determinant of the flow tψt(z)t\mapsto\psi_{t}(z) is given by [19, p. 174108-5]

Jt(z)=|det(ψt(z))|=𝒥ψt(z)with𝒥(z):=exp(0t(F)(z)).J_{t}(z)=|\det\big{(}\nabla\otimes\psi_{t}(z)\big{)}|=\mathcal{J}\circ\psi_{t}(z)\;\text{with}\;\mathcal{J}(z):=\exp\left(\int_{0}^{t}(\nabla\cdot F)(z)\right)\,.
Lemma 26.

Let tψtt\mapsto\psi_{t} be a flow solution of z˙t=Fzt\dot{z}_{t}=F\circ z_{t} and assume that with υ\upsilon the Lebesgue measure, υμ\upsilon\gg\mu. Then

𝔼μ¯(𝔼μ¯(f¯(𝖳,Zˇ)Zˇ)2)\displaystyle\mathbb{E}_{\bar{\mu}}\left(\mathbb{E}_{\bar{\mu}}\big{(}\bar{f}(\mathsf{T},\check{Z})\mid\check{Z}\big{)}^{2}\right) =20τ{0τu[μ¯ψt(z)]1dt}{μ(dz)f(z)fψu(z)μψu(z)𝒥ψu(z)}du\displaystyle=2\int_{0}^{\tau}\left\{\int_{0}^{\tau-u}[\bar{\mu}\circ\psi_{-t}(z)]^{-1}{\rm d}t\right\}\Bigl{\{}\int\mu({\rm d}z)f(z)f\circ\psi_{u}(z)\mu\circ\psi_{u}(z)\mathcal{J}\circ\psi_{u}(z)\Bigr{\}}{\rm d}u
2{0τ[μ¯ψt(z)]1dt}0τ{μ(dz)f(z)fψu(z)μψu(z)𝒥ψu(z)}du,\displaystyle\leq 2\left\{\int_{0}^{\tau}[\bar{\mu}\circ\psi_{-t}(z)]^{-1}{\rm d}t\right\}\int_{0}^{\tau}\Bigl{\{}\int\mu({\rm d}z)f(z)f\circ\psi_{u}(z)\mu\circ\psi_{u}(z)\mathcal{J}\circ\psi_{u}(z)\Bigr{\}}{\rm d}u\,,

with

μ¯ψt(z)\displaystyle\bar{\mu}\circ\psi_{-t}(z) =tτtμψu(z)𝒥ψu(z)du.\displaystyle=\int_{-t}^{\tau-t}\mu\circ\psi_{u}(z)\,\mathcal{J}\circ\psi_{u}(z){\rm d}u\,.

In particular for tψtt\mapsto\psi_{t} the flow solution of Hamilton’s equations associated with μ\mu we have

𝔼μ¯(𝔼μ¯(f¯(𝖳,Zˇ)Zˇ)2)\displaystyle\mathbb{E}_{\bar{\mu}}\left(\mathbb{E}_{\bar{\mu}}\big{(}\bar{f}(\mathsf{T},\check{Z})\mid\check{Z}\big{)}^{2}\right) =2(1t/τ)μ(dz)f(z)fψt(z)dt\displaystyle=2\int\big{(}1-t/\tau\big{)}\int\mu({\rm d}z)f(z)f\circ\psi_{t}(z){\rm d}t\,\cdot
Proof.

Using Fubini’s theorem we have

{0τfψt(z)dμψtdμ¯(z)dt}2μ¯(dz)\displaystyle\int\left\{\int_{0}^{\tau}f\circ\psi_{t}(z)\frac{{\rm d}\mu^{\psi_{-t}}}{{\rm d}\bar{\mu}}(z){\rm d}t\right\}^{2}\bar{\mu}({\rm d}z) =\displaystyle=
=20τ0τ𝟏{tt}μ¯(dz)\displaystyle=2\int_{0}^{\tau}\int_{0}^{\tau}\mathbf{1}\{t^{\prime}\geq t\}\int\bar{\mu}({\rm d}z) fψt(z)dμψtdμ¯(z)fψtt+t(z)dμψtdμ¯ψtψt(z)dtdt\displaystyle f\circ\psi_{t}(z)\frac{{\rm d}\mu^{\psi_{-t}}}{{\rm d}\bar{\mu}}(z)f\circ\psi_{t^{\prime}-t+t}(z)\frac{{\rm d}\mu^{\psi_{-t^{\prime}}}}{{\rm d}\bar{\mu}}\circ\psi_{-t}\circ\psi_{t}(z){\rm d}t^{\prime}{\rm d}t
=20τ0τ𝟏{tt}μ(dz)\displaystyle=2\int_{0}^{\tau}\int_{0}^{\tau}\mathbf{1}\{t^{\prime}\geq t\}\int\mu({\rm d}z) f(z)fψtt(z)dμψtdμ¯ψt(z)dtdt.\displaystyle f(z)f\circ\psi_{t^{\prime}-t}(z)\frac{{\rm d}\mu^{\psi_{-t^{\prime}}}}{{\rm d}\bar{\mu}}\circ\psi_{-t}(z){\rm d}t^{\prime}{\rm d}t\,.

Further, we have

dμψtdμ¯(z)\displaystyle\frac{{\rm d}\mu^{\psi_{-t^{\prime}}}}{{\rm d}\bar{\mu}}(z) =μψt(z)𝒥ψt(z)μ¯(z),\displaystyle=\frac{\mu\circ\psi_{t^{\prime}}(z)\,\mathcal{J}\circ\psi_{t^{\prime}}(z)}{\bar{\mu}(z)}\,,

with μ¯(z)=0τμψu(z)𝒥ψu(z)du\bar{\mu}(z)=\int_{0}^{\tau}\mu\circ\psi_{u}(z)\,\mathcal{J}\circ\psi_{u}(z){\rm d}u. It is straightforward that

μ¯ψt(z)\displaystyle\bar{\mu}\circ\psi_{-t}(z) =0τμψut(z)𝒥ψut(z)du\displaystyle=\int_{0}^{\tau}\mu\circ\psi_{u-t}(z)\,\mathcal{J}\circ\psi_{u-t}(z){\rm d}u
=tτtμψu(z)𝒥ψu(z)du.\displaystyle=\int_{-t}^{\tau-t}\mu\circ\psi_{u^{\prime}}(z)\,\mathcal{J}\circ\psi_{u^{\prime}}(z){\rm d}u^{\prime}\,.

Consequently

{0τfψt(z)dμψtdμ¯(z)dt}2μ¯(dz)=\displaystyle\int\left\{\int_{0}^{\tau}f\circ\psi_{t}(z)\frac{{\rm d}\mu^{\psi_{-t}}}{{\rm d}\bar{\mu}}(z){\rm d}t\right\}^{2}\bar{\mu}({\rm d}z)=
=20τ0τ\displaystyle=2\int_{0}^{\tau}\int_{0}^{\tau} 𝟏{τtu0}μ¯ψt(z)μ(dz)f(z)fψu(z)μψu(z)𝒥ψu(z)dudt\displaystyle\frac{\mathbf{1}\{\tau-t\geq u\geq 0\}}{\bar{\mu}\circ\psi_{-t}(z)}\int\mu({\rm d}z)f(z)f\circ\psi_{u}(z)\mu\circ\psi_{u}(z)\mathcal{J}\circ\psi_{u}(z){\rm d}u{\rm d}t
=20τ\displaystyle=2\int_{0}^{\tau} {0τu[μ¯ψt(z)]1dt}μ(dz)f(z)fψu(z)μψu(z)𝒥ψu(z)du\displaystyle\left\{\int_{0}^{\tau-u}[\bar{\mu}\circ\psi_{-t}(z)]^{-1}{\rm d}t\right\}\int\mu({\rm d}z)f(z)f\circ\psi_{u}(z)\mu\circ\psi_{u}(z)\mathcal{J}\circ\psi_{u}(z){\rm d}u

For the second statement we have μ¯ψt(z)=τμ(z)\bar{\mu}\circ\psi_{-t}(z)=\tau\,\mu(z) and

{0τfψt(z)dμψtdμ¯(z)dt}2μ¯(dz)=\displaystyle\int\left\{\int_{0}^{\tau}f\circ\psi_{t}(z)\frac{{\rm d}\mu^{\psi_{-t}}}{{\rm d}\bar{\mu}}(z){\rm d}t\right\}^{2}\bar{\mu}({\rm d}z)=
=20τ\displaystyle=2\int_{0}^{\tau} (1u/τ)μ(dz)f(z)fψu(z)dt′′\displaystyle\bigl{(}1-u/\tau\bigr{)}\int\mu({\rm d}z)f(z)f\circ\psi_{u}(z){\rm d}t^{\prime\prime}

which is akin to the integration correlation time encountered in MCMC. ∎

This is somewhat reminiscent of what is advocated in the literature in the context of HMC or randomized HMC where the integration time tt (or TT when using an integrator) is randomized [29, 21].

Example 27.

Consider the Gaussian scenario where 𝖷=𝖵=d\mathsf{X}=\mathsf{V}=\mathbb{R}^{d}, π(x)=𝒩(x;0,Σ)\pi(x)=\mathcal{N}(x;0,\Sigma) with diagonal covariance matrix such that for idi\in\llbracket d\rrbracket, Σii=σi2\Sigma_{ii}=\sigma_{i}^{2} and ϖ(v)=𝒩(v;0,Id)\varpi(v)=\mathcal{N}(v;0,{\rm Id}). Using the reparametrization (x0(i),v0(i))=(aiσisin(ϕi),aicosϕi)(x_{0}(i),v_{0}(i))=(a_{i}\sigma_{i}\sin(\phi_{i}),a_{i}\cos\phi_{i}) the solution of Hamilton’s equations is given for idi\in\llbracket d\rrbracket by

xt(i)\displaystyle x_{t}(i) =aiσsin(t/σi+ϕi)\displaystyle=a_{i}\sigma\sin(t/\sigma_{i}+\phi_{i})
vt(i)\displaystyle v_{t}(i) =aicos(t/σi+ϕi).\displaystyle=a_{i}\cos(t/\sigma_{i}+\phi_{i})\,.

We have for each component 𝔼μ[X0(i)Xt(i)]=σi2cos(t/σi)\mathbb{E}_{\mu}[X_{0}(i)X_{t}(i)]=\sigma_{i}^{2}\cos(t/\sigma_{i}), which clearly does not vanish as tt, or τ\tau, increase: this is a particularly negative result for standard HMC where the final state of the computed integrator is used. Worse, as noted early on in [29] this implies that for heterogeneous values {σi2,id}\{\sigma_{i}^{2},i\in\llbracket d\rrbracket\} no integration time may be suitable simultaneously for all coordinates. This motivated the introduction of random integration times [29] which leads to the average correlation

𝔼μ{1τ0τX0(i)Xt(i)dt}=sin(τ/σi)τ/σi,\mathbb{E}_{\mu}\left\{\frac{1}{\tau}\int_{0}^{\tau}X_{0}(i)X_{t}(i){\rm d}t\right\}=\frac{\sin(\tau/\sigma_{i})}{\tau/\sigma_{i}}\,,

where it is assumed that τ\tau is independent of the initial state. This should be contrasted with the fixed integration time scenario since as τ\tau increases this vanishes and is even negative for some values of τ\tau (the minimum is here reached for about τi=4.5σi\tau_{i}=4.5\sigma_{i} for a given component).

The example therefore illustrates that our approach implements this averaging feature, and therefore shares its benefits, within the context of an iterative algorithm. The example also highlights a control variate technique intepretation. More specifically in the discrete time scenarios {fψk(z),kP}\{f\circ\psi^{k}(z),k\in\llbracket P\rrbracket\} can be interpreted as control variates, but can induce both positive and negative correlations.

4.3.3 Towards an optimal flow?

In this section we are looking to determine a flow for some τ>0\tau>0 [0,τ]tψt[0,\tau]\ni t\mapsto\psi_{t} of an ODE of the form z˙t=Fzt\dot{z}_{t}=F\circ z_{t} for some field F:𝖹𝖹F\colon\mathsf{Z}\rightarrow\mathsf{Z} which defines as above the probability model

μ¯(dt,dz)=1τμψt(dz)𝟏{0tτ}dt,\bar{\mu}({\rm d}t,{\rm d}z)=\frac{1}{\tau}\mu^{\psi_{-t}}({\rm d}z)\mathbf{1}\{0\leq t\leq\tau\}{\rm d}t\,,

which has the property that for any f:𝖹f\colon\mathsf{Z}\rightarrow\mathbb{R}, defining f¯(t,z):=fψt(z)\bar{f}(t,z):=f\circ\psi_{t}(z), then μ¯(f¯)=μ(f)\bar{\mu}(\bar{f})=\mu(f). This suggests the use of a Rao-Blackwellized estimator inspired by 𝔼μ¯(μ¯(f¯Zˇ))\mathbb{E}_{\bar{\mu}}\left(\bar{\mu}(\bar{f}\mid\check{Z})\right). Assuming 𝖹=d×d\mathsf{Z}=\mathbb{R}^{d}\times\mathbb{R}^{d} and that μ\mu has a density with respect to the Lebesgue measure, then for z𝖹z\in\mathsf{Z}

μ¯(f¯z)=1τ0τfψt(z)μψt(z)𝒥ψt(z)μψu(z)𝒥ψududt\bar{\mu}(\bar{f}\mid z)=\frac{1}{\tau}\int_{0}^{\tau}f\circ\psi_{t}(z)\frac{\mu\circ\psi_{t}(z)\,\mathcal{J}\circ\psi_{t}(z)}{\int\mu\circ\psi_{u}(z)\,\mathcal{J}\circ\psi_{u}{\rm d}u}{\rm d}t (38)

In the light of Lemma 26 we aim to find for any z𝖹z\in\mathsf{Z} the flow solutions tψt(z)t\mapsto\psi_{t}(z) of ODEs z˙t=Fz(zt)\dot{z}_{t}=F_{z}(z_{t}) such that the function tμ(dz)f(z)fψt(z)μψt(z)𝒥ψt(z)t\mapsto\int\mu({\rm d}z)f(z)f\circ\psi_{t}(z)\mu\circ\psi_{t}(z)\,\mathcal{J}\circ\psi_{t}(z) decreases as fast as possible. This is motivated by the fact that the integral on [0,τ][0,\tau] of this mapping appears in the variance upper bound for (38) in Lemma 26, which we want to minimize. Note that we also expect this mapping to be smooth under general conditions not detailed in this preliminary work. For smooth enough flow and ff we have, with g:=f×μ×𝒥g:=f\times\mu\times\mathcal{J},

ddt𝔼μ[f(Z)gψt(Z)μ¯(Z)]\displaystyle\frac{{\rm d}}{{\rm d}t}\mathbb{E}_{\mu}\left[f(Z)\frac{g\circ\psi_{t}(Z)}{\bar{\mu}(Z)}\right] =𝔼μ[f(Z)μ¯(Z)gψt(Z),ψ˙t(Z)]\displaystyle=\mathbb{E}_{\mu}\left[\frac{f(Z)}{\bar{\mu}(Z)}\langle\nabla g\circ\psi_{t}(Z),\dot{\psi}_{t}(Z)\rangle\right]

Pointwise, the steepest descent direction is given by

ψ˙t(z)=Ct(z)f(z)gψt(z)|f(z)gψt(z)|\dot{\psi}_{t}(z)=-C_{t}(z)\frac{f(z)\,\nabla g\circ\psi_{t}(z)}{|f(z)\,\nabla g\circ\psi_{t}(z)|}

for a positive function Ct(z)C_{t}(z) to be determined optimally. In this scenario we therefore have

ddt𝔼μ[f(Z)gψt(Z)μ¯(Z)]=𝔼μ[Ct(z)μ¯(Z)|f(Z)gψt(Z)|]\frac{{\rm d}}{{\rm d}t}\mathbb{E}_{\mu}\left[f(Z)\frac{g\circ\psi_{t}(Z)}{\bar{\mu}(Z)}\right]=-\mathbb{E}_{\mu}\left[\frac{C_{t}(z)}{\bar{\mu}(Z)}|f(Z)\,\nabla g\circ\psi_{t}(Z)|\right]

and by Cauchy-Schwartz the (positive) expectation is maximized for

Ct(z)=Ct|f(z)gψt(z)|μ¯(z).C_{t}(z)=C_{t}\frac{|f(z)\,\nabla g\circ\psi_{t}(z)|}{\bar{\mu}(z)}\,.

and the trajectories we are interested in must be such that, for some C>0C>0,

ψ˙t(z)=Cμ¯(z)f(z)gψt(z)=Cμ¯(z)f(z)Fψt(z).\dot{\psi}_{t}(z)=-\frac{C}{\bar{\mu}(z)}f(z)\,\nabla g\circ\psi_{t}(z)=-\frac{C}{\bar{\mu}(z)}f(z)F\circ\psi_{t}(z)\,.

Note that for any zz the term |f(z)/μ¯(z)||f(z)/\bar{\mu}(z)| is only a change of speed and that the trajectory of +tψt(z)\mathbb{R}_{+}\ni t\mapsto\psi_{t}(z) is independent of this factor, despite the remarkable fact that μ¯(z)\bar{\mu}(z) depends on this flow. The result seems fairly natural.

5 MCMC with integrator snippets

We restrict this discussion to integrator snippet based algorithms, but more general Markov snippet algorithms could be considered.

Consider again the target distribution

μ¯(dz)=k=0Tμ¯(k,dz),\bar{\mu}({\rm d}z)=\sum_{k=0}^{T}\bar{\mu}(k,{\rm d}z)\,,

with

μ¯(k,dz)=1T+1μk(dz),\bar{\mu}(k,{\rm d}z)=\frac{1}{T+1}\mu_{k}({\rm d}z)\,,

where μk(dz):=μψk(dz)\mu_{k}({\rm d}z):=\mu^{\psi^{-k}}({\rm d}z) for k0,Tk\in\llbracket 0,T\rrbracket. Assume that we are in the context of Example 1, dropping nn for simplicity, the HMC algorithm using the integrator ψs\psi^{s} is as follows

P¯(z,dz)=α(z)δψs(z)(dz)+α¯(z)δσ(z)(dz)\bar{P}(z,{\rm d}z^{\prime})=\alpha(z)\delta_{\psi^{s}(z)}({\rm d}z)+\bar{\alpha}(z)\delta_{\sigma(z)}({\rm d}z^{\prime}) (39)

with here

α(z)=min{1,dμ¯σψsdμ¯(z)}=min{1,μ¯ψw(z)μ¯(z)}\alpha(z)=\min\left\{1,\frac{{\rm d}\bar{\mu}^{\sigma\circ\psi^{s}}}{{\rm d}\bar{\mu}}(z)\right\}=\min\left\{1,\frac{\bar{\mu}\circ\psi^{w}(z)}{\bar{\mu}(z)}\right\}

where the last equality holds when υμ¯\upsilon\gg\bar{\mu}, υψs=υ\upsilon^{\psi^{s}}=\upsilon and we let μ¯(z)=dμ¯/dυ(z)\bar{\mu}(z)={\rm d}\bar{\mu}/{\rm d}\upsilon(z). In other works the snippet 𝗓=(z,ψ(z),ψ2(z),,ψT(z))\mathsf{z}=\big{(}z,\psi(z),\psi^{2}(z),\ldots,\psi^{T}(z)\big{)} is shifted along the orbit {ψk(z),k}\{\psi^{k}(z),k\in\mathbb{Z}\} by ψs\psi^{s}. Naturally this needs to be combined with updates of the velocity to lead to a viable ergodic MCMC algorithm. This can be achieved with the following kernel, with ψk(z)=(ψxk(z),ψvk(z))\psi^{k}(z)=\big{(}\psi_{x}^{k}(z),\psi_{v}^{k}(z)\big{)} and A𝒵A\in\mathscr{Z},

Q¯(z,A):=k,l=0Tμ¯(kz)1T+1𝟏{ψl(ψxk(z),v)A}ϖ(dv),\bar{Q}(z,A):=\sum_{k,l=0}^{T}\bar{\mu}(k\mid z)\frac{1}{T+1}\int\mathbf{1}\{\psi^{-l}\big{(}\psi_{x}^{k}(z),v^{\prime}\big{)}\in A\}\varpi({\rm d}v^{\prime})\,,

described algorithmically in Alg. 6. Indeed for any (l,v)0,T×𝖵(l,v^{\prime})\in\llbracket 0,T\rrbracket\times\mathsf{V}, using identity (11),

k=0Tμ¯(kz)𝟏{ψl(ψxk(z),v)A}μ¯(dz)=k=0T𝟏{ψl(x,v)A}μ(dz),\int\sum_{k=0}^{T}\bar{\mu}(k\mid z)\int\mathbf{1}\{\psi^{-l}\big{(}\psi_{x}^{k}(z),v^{\prime}\big{)}\in A\}\bar{\mu}({\rm d}z)=\int\sum_{k=0}^{T}\int\mathbf{1}\{\psi^{-l}\big{(}x,v^{\prime}\big{)}\in A\}\mu({\rm d}z)\,,

and hence

μ¯Q¯(A)\displaystyle\bar{\mu}\bar{Q}(A) =1T+1l=0T𝟏{ψl(x,v)A}ϖ(dv)μ(dz)\displaystyle=\frac{1}{T+1}\sum_{l=0}^{T}\int\mathbf{1}\{\psi^{-l}\big{(}x,v^{\prime}\big{)}\in A\}\varpi({\rm d}v^{\prime})\mu({\rm d}z)
=1T+1l=0T𝟏{ψl(x,v)A}μ(dz)\displaystyle=\frac{1}{T+1}\sum_{l=0}^{T}\int\mathbf{1}\{\psi^{-l}\big{(}x,v\big{)}\in A\}\mu({\rm d}z)
=𝟏{zA}1T+1l=0Tμψl(dz)\displaystyle=\int\mathbf{1}\{z\in A\}\frac{1}{T+1}\sum_{l=0}^{T}\mu^{\psi^{-l}}({\rm d}z)
=μ¯(A).\displaystyle=\bar{\mu}(A)\,.

Again samples from this MCMC targetting μ¯\bar{\mu} can be used to estimate expectations with respect to μ\mu using the identity (11). This is closely related to the “windows of states” approach of [28, 29, 31], where a window of states is what we call a Hamiltonian snippet in the present manuscript. Indeed the windows of states approach corresponds to the Markov update

P(z,A)=k,l=0T1TP¯(ψk(z),dz)μ¯(lz)𝟏{ψl(z)A},P(z,A)=\sum_{k,l=0}^{T}\frac{1}{T}\int\bar{P}\big{(}\psi^{-k}(z),{\rm d}z^{\prime}\big{)}\bar{\mu}(l\mid z^{\prime})\mathbf{1}\{\psi^{l}(z^{\prime})\in A\}\,, (40)

which, we show below, leaves μ\mu invariant. Indeed, note that for k,l0,Tk,l\in\llbracket 0,T\rrbracket and A𝒵A\in\mathscr{Z},

μ(dz)P¯(ψk(z),dz)μ¯(lz)𝟏{ψl(z)A}\displaystyle\int\mu({\rm d}z)\int\bar{P}\big{(}\psi^{-k}(z),{\rm d}z^{\prime}\big{)}\bar{\mu}(l\mid z^{\prime})\mathbf{1}\{\psi^{l}(z^{\prime})\in A\} =Aμψk(dz)P¯(z,dz)μ¯(lz)𝟏{ψl(z)A},\displaystyle=\int_{A}\mu^{\psi^{-k}}({\rm d}z)\int\bar{P}\big{(}z,{\rm d}z^{\prime}\big{)}\bar{\mu}(l\mid z^{\prime})\mathbf{1}\{\psi^{l}(z^{\prime})\in A\}\,,

and therefore

μ(dz)P(z,dz)𝟏{zA}=\displaystyle\int\mu({\rm d}z)\int P(z,{\rm d}z^{\prime})\mathbf{1}\{z^{\prime}\in A\}= μ¯(dz)P¯(z,dz)l=0Tμ¯(lz)𝟏{ψl(z)A}\displaystyle\int\bar{\mu}({\rm d}z)\bar{P}(z,{\rm d}z^{\prime})\sum_{l=0}^{T}\bar{\mu}(l\mid z^{\prime})\mathbf{1}\{\psi^{l}(z^{\prime})\in A\}
=\displaystyle= μ¯(dz)l=0Tμ¯(lz)𝟏{ψl(z)A}\displaystyle\int\bar{\mu}({\rm d}z^{\prime})\sum_{l=0}^{T}\bar{\mu}(l\mid z^{\prime})\mathbf{1}\{\psi^{l}(z^{\prime})\in A\}
=\displaystyle= μ(dz)𝟏{zA},\displaystyle\int\mu({\rm d}z)\mathbf{1}\{z\in A\}\,,

where we obtain the last line from (11). PP is not μ\mu-reversible in general, making theoretical comparisons challenging.

1Given zz
2Sample kμ¯(z)k\sim\bar{\mu}(\cdot\mid z) and l𝒰(0,T)l\sim\mathcal{U}(\llbracket 0,T\rrbracket).
3Compute ψk(z)=(ψxk(z),ψvk(z))\psi^{k}(z)=\big{(}\psi_{x}^{k}(z),\psi_{v}^{k}(z)\big{)}.
4Sample vϖ()v^{\prime}\sim\varpi(\cdot)
5Return ψl(ψxk(z),v)\psi^{-l}(\psi_{x}^{k}(z),v^{\prime})
Algorithm 6 Kernel Q¯\bar{Q}

6 Discussion

We have shown how mappings used in various Monte Carlo schemes relying on numerical integrators of ODE can be implemented to fully exploit all computations to design robust and efficient sampling algorithms. Numerous questions remain open, including the tradeoff between NN and TT. A precise analysis of this question is made particularly difficult by the fact that integration along snippets is straightforwardly parallelizable, while resampling does not lend itself to straightforward parallelisation.

Another point is concerned with the particular choice of mutation Markov kernel M¯n\bar{M}_{n}, or M¯\bar{M}, in (12) or (23). Indeed such a kernel starts with a transition from samples approximating the snippet distribution μ¯n1\bar{\mu}_{n-1} to μn1\mu_{n-1}, which is then followed by a reweighting of samples leading to a representation of μ¯n\bar{\mu}_{n}. Instead, for illustration, one could suggest using an SMC sampler with (39) as mutation kernel.

In relation to the discussion in Remark 20, a natural question is how our scheme would compare with a “Rao-Blackwellized” SMC where weights of the type (35), derived from (6) are used.

We leave all these questions for future investigations.

Acknowledgements

The authors would like to thank Carl Dettman for very useful discussions on Boltzman’s conjecture. Research of CA and MCE supported by EPSRC grant ‘CoSInES (COmputational Statistical INference for Engineering and Security)’ (EP/R034710/1), and EPSRC grant Bayes4Health, ‘New Approaches to Bayesian Data Science: Tackling Challenges from the Health Sciences’ (EP/R018561/1). Research of CZ was supported by a CSC Scholarship.

Appendix A Notation and definitions

We will write ={0,1,2,}\mathbb{N}=\left\{0,1,2,\dots\right\} for the set of natural numbers and +=(0,)\mathbb{R}_{+}=\left(0,\infty\right) for positive real numbers. Throughout this section (𝖤,)(\mathsf{E},\mathscr{E}) is a generic measurable space.

  • For A𝖤A\subset\mathsf{E} we let AA^{\complement} be its complement.

  • (𝖤,)\mathcal{M}(\mathsf{E},\mathscr{E}) (resp. 𝒫(𝖤,)\mathcal{P}(\mathsf{E},\mathscr{E}) is the set of measures (resp. probability distributions) on (𝖤,)(\mathsf{E},\mathscr{E})

  • For a set AA\in\mathscr{E}, its complement in 𝖤\mathsf{E} is denoted by AA^{\complement}. We denote the corresponding indicator function by 𝟏A:𝖤{0,1}\mathbf{1}_{A}:\mathsf{E}\to\left\{0,1\right\} and may use the notation 𝟏{zA}:=𝟏A(z)\mathbf{1}\{z\in A\}:=\mathbf{1}_{A}(z).

  • For μ\mu a probability measure on (𝖤,)(\mathsf{E},\mathscr{E}) and a measurable function f:𝖤f\colon\mathsf{E}\rightarrow\mathbb{R} and , we let μ(f):=f(x)μ(dx)\mu(f):=\int f(x)\mu({\rm d}x).

  • For two probability measures μ\mu and ν\nu on (𝖤,)(\mathsf{E},\mathscr{E}) we let μν\mu\otimes\nu be a measure on (𝖤×𝖤,)(\mathsf{E}\times\mathsf{E},\mathscr{E}\otimes\mathscr{E}) such that μν(A×B)=μ(A)ν(B)\mu\otimes\nu(A\times B)=\mu(A)\nu(B) for A,BA,B\in\mathscr{E}.

  • For a Markov kernel P(x,dy)P(x,{\rm d}y) on 𝖤×\mathsf{E}\times\mathscr{E}, we write

    • μP\mu\otimes P for the probability measure on (𝖤×𝖤,)(\mathsf{E}\times\mathsf{E},\mathscr{E}\otimes\mathscr{E}) such that for A¯\bar{A}\in\mathscr{E}\otimes\mathscr{E}, the minimal product σ\sigma-algebra, μP(A¯)=A¯μ(dx)P(x,dy)\mu\otimes P(\bar{A})=\int_{\bar{A}}\mu({\rm d}x)P(x,{\rm d}y).

    • μP\mu\accentset{\curvearrowleft}{\otimes}P for the probability measure on (𝖤×𝖤,)(\mathsf{E}\times\mathsf{E},\mathscr{E}\otimes\mathscr{E}) such that for A,BA,B\in\mathscr{E} μP(A×B)=μP(B×A)\mu\accentset{\curvearrowleft}{\otimes}P(A\times B)=\mu\otimes P(B\times A).

  • For μ,ν\mu,\nu probability distributions on (𝖤,)(\mathsf{E},\mathscr{E}) and kernels M,L:𝖤×[0,1]M,L\colon\mathsf{E}\times\mathscr{E\rightarrow}[0,1] such that μMνL\mu\otimes M\gg\nu\accentset{\curvearrowleft}{\otimes}L then we denote

    dνLdμM(z,z)\frac{{\rm d}\nu\accentset{\curvearrowleft}{\otimes}L}{{\rm d}\mu\otimes M}(z,z^{\prime})

    the corresponding Radon-Nikodym derivative such that for f:𝖤×𝖤f\colon\mathsf{E}\times\mathsf{E}\rightarrow\mathbb{R},

    f(z,z)dνLdμM(z,z)μM(d(z,z))=f(z,z)νL(d(z,z)).\int f(z,z^{\prime})\frac{{\rm d}\nu\accentset{\curvearrowleft}{\otimes}L}{{\rm d}\mu\otimes M}(z,z^{\prime})\mu\otimes M\big{(}{\rm d}(z,z^{\prime})\big{)}=\int f(z,z^{\prime})\nu\accentset{\curvearrowleft}{\otimes}L\big{(}{\rm d}(z^{\prime},z)\big{)}\,.
  • A point mass distribution at xx will be denoted by δx(dy)\delta_{x}({\rm d}y); it is such that for f:𝖤f\colon\mathsf{E}\rightarrow\mathbb{R}

    f(x)δx(dy)=f(x)\int f(x)\delta_{x}({\rm d}y)=f(x)
  • In order to alleviate notation, for MM\in\mathbb{N}, (z(i),wi)𝖤×[0,)(z^{(i)},w_{i})\in\mathsf{E}\times[0,\infty), iMi\in\llbracket M\rrbracket, we refer to {(z(i),wi),iM}\big{\{}(z^{(i)},w_{i}),i\in\llbracket M\rrbracket\big{\}} as weighted samples to mean {(z(i),w~i),iM}\big{\{}(z^{(i)},\tilde{w}_{i}),i\in\llbracket M\rrbracket\big{\}} where w~iwi\tilde{w}_{i}\propto w_{i} but i=1Mw~i=1\sum_{i=1}^{M}\tilde{w}_{i}=1.

  • We say that a set of weighted samples, or particles, {(zi,wi)𝖹×+:iN}\{(z_{i},w_{i})\in\mathsf{Z}\times\mathbb{R}_{+}\colon i\in\llbracket N\rrbracket\} for N1N\geq 1 represents a distribution μ\mu whenever for ff μ\mu-integrable

    i=1Nwij=1Nwjf(zi)μ(f),\sum_{i=1}^{N}\frac{w_{i}}{\sum_{j=1}^{N}w_{j}}f(z_{i})\approx\mu(f)\,,

    in either in the LpL^{p} sense for some p1p\geq 1.

  • For MM\in\mathbb{N}, wi[0,)w_{i}\in[0,\infty), iMi\in\llbracket M\rrbracket, we let KCat(w1,w2,,wM)K\sim{\rm Cat}\left(w_{1},w_{2},\ldots,w_{M}\right) mean that (K=k)wk\mathbb{P}(K=k)\propto w_{k}.

  • For M,NM,N\in\mathbb{N}, wij[0,)w_{ij}\in[0,\infty), iM×Ni\in\llbracket M\rrbracket\times\llbracket N\rrbracket, we let KCat(wij,iM×N)K\sim{\rm Cat}\left(w_{ij},i\in\llbracket M\rrbracket\times\llbracket N\rrbracket\right) mean that (K=(k,l))wkl\mathbb{P}(K=(k,l))\propto w_{kl}.

  • for f:mnf\colon\mathbb{R}^{m}\rightarrow\mathbb{R}^{n} we let

    • f\nabla\otimes f be the transpose of the Jacobian

    • for n=1n=1 we let f=(f)\nabla f=(\nabla\otimes f)^{\top} be the gradient,

    • f\nabla\cdot f be the divergence.

Appendix B Radon-Nikodym derivative

The general formalism required and used throughout the paper relies on a unique measure theoretic tool, the Radon-Nikodym derivative. We gather here definitions and intermediate results used throughout, pointing out the simplicity of the tools involved and and the benefits they bring.

Definition 28 (Pushforward).

Let μ\mu be a measure on (𝖹,𝒵)(\mathsf{Z},\mathscr{Z}) and ψ:(𝖹,𝒵)(𝖹,)\psi:(\mathsf{Z},\mathscr{Z})\to(\mathsf{Z}^{\prime},\mathscr{F}^{\prime}) a measurable function. The pushforward of μ\mu by ψ\psi is defined by

μψ(A):=μ(ψ1(A)),A𝒵,\mu^{\psi}(A):=\mu(\psi^{-1}(A)),\qquad A\in\mathscr{Z}^{\prime},

where ψ1(A)={z𝖹:ψ(z)A}\psi^{-1}(A)=\{z\in\mathsf{Z}:\psi(z)\in A\} is the preimage of AA under ψ\psi.

If μ\mu is a probability distribution then μψ\mu^{\psi} is the probability measure associated with ψ(Z)\psi(Z) when ZμZ\sim\mu.

Definition 29 (Dominating and equivalent measures).

For two measures μ\mu and ν\nu on the same measurable space (𝖹,𝒵)(\mathsf{Z},\mathscr{Z}),

  1. 1.

    ν\nu is said to dominate μ\mu if for all measurable A𝒵A\in\mathscr{Z}, μ(A)>0ν(A)>0\mu(A)>0\Rightarrow\nu(A)>0 – this is denoted νμ\nu\gg\mu.

  2. 2.

    μ\mu and ν\nu are equivalent, written μν\mu\equiv\nu, if μν\mu\gg\nu and νμ\nu\gg\mu.

We will need the notion of Radon-Nikodym derivative [7, Theorems 32.2 & 16.11]:

Theorem 30 (Radon–Nikodym).

Let μ\mu and ν\nu be σ\sigma-finite measures on (𝖹,𝒵)(\mathsf{Z},\mathscr{Z}). Then νμ\nu\ll\mu if and only if there exists an essentially unique, measurable, non-negative function f:𝖹[0,)f\colon\mathsf{Z}\rightarrow[0,\infty) such that

Af(z)μ(dz)=ν(A),A.\int_{A}f(z)\mu({\rm d}z)=\nu(A),\qquad A\in\mathscr{E}.

Therefore we can view dν/dμ:=f{\rm d}\nu/{\rm d}\mu:=f as the density of ν\nu w.r.t μ\mu and in particular if gg is integrable w.r.t. ν\nu then

g(z)dνdμ(z)μ(dz)=g(z)ν(dz).\int g(z)\frac{{\rm d}\nu}{{\rm d}\mu}(z)\mu({\rm d}z)=\int g(z)\nu({\rm d}z)\,.

If μ\mu is a measure and ff a non-negative, measurable function then μf\mu\cdot f is the measure (μf)(A)=𝟏A(z)f(z)μ(dz)(\mu\cdot f)(A)=\int{\bf 1}_{A}(z)f(z)\mu({\rm d}z), i.e. the measure ν=μf\nu=\mu\cdot f such that ff is the Radon–Nikodym derivative dν/dμ=f{\rm d}\nu/{\rm d}\mu=f.

The following establishes the expression of an expectation with respect to the pushforward μψ\mu^{\psi} in terms of expectations with respect to μ\mu [7, Theorem 16.13].

Theorem 31 (Change of variables).

A function f:𝖹f:\mathsf{Z}^{\prime}\to\mathbb{R} is integrable w.r.t. μψ\mu^{\psi} if and only if fψf\circ\psi is integrable w.r.t. μ\mu, in which case

𝖹f(z)μψ(dz)=𝖹fψ(z)μ(dz).\int_{\mathsf{Z}^{\prime}}f(z)\mu^{\psi}({\rm d}z)=\int_{\mathsf{Z}}f\circ\psi(z)\mu({\rm d}z)\,. (41)

We now establish results useful throughout the manuscript. The central identity used throughout the manuscript is a direct application of Theorem 31 for ψ:𝖹𝖹\psi\colon\mathsf{Z}\rightarrow\mathsf{Z} invertible

𝖹fψ(z)μψ1(dz)=𝖹f(z)μ(dz)\int_{\mathsf{Z}^{\prime}}f\circ\psi(z)\mu^{\psi^{-1}}({\rm d}z)=\int_{\mathsf{Z}}f(z)\mu({\rm d}z)

which seems tautological since it can be summarized as follows: for ZμZ\sim\mu, then ψ1(Z)μψ1\psi^{-1}(Z)\sim\mu^{\psi^{-1}} and ψψ1(Z)μ\psi\circ\psi^{-1}(Z)\sim\mu! However the interest of the approach stems from the following properties.

Lemma 32.

Let ψ:𝖹𝖹\psi\colon\mathsf{Z}\rightarrow\mathsf{Z} be measurable and integrable, μ\mu and υ\upsilon be σ\sigma-finite measures on (𝖹,𝒵)(\mathsf{Z},\mathscr{Z}) such that υμ\upsilon\gg\mu and υυψ1\upsilon\gg\upsilon^{\psi^{-1}}. Then

  1. 1.

    υψ1μψ1\upsilon^{\psi^{-1}}\gg\mu^{\psi^{-1}} and therefore υμψ1\upsilon\gg\mu^{\psi^{-1}},

  2. 2.

    for υ\upsilon-almost all z𝖹z\in\mathsf{Z},

    dμψ1dυ(z)=dμdυψ(z)dυψ1dυ\frac{{\rm d}\mu^{\psi^{-1}}}{{\rm d}\upsilon}(z)=\frac{{\rm d}\mu}{{\rm d}\upsilon}\circ\psi(z)\frac{{\rm d}\upsilon^{\psi^{-1}}}{{\rm d}\upsilon}
  3. 3.

    we have

    μμψ1υ({z𝖹:dμψ1/dυ(z)>0,dμ/dυ(z)=0})=0\mu\gg\mu^{\psi^{-1}}\iff\upsilon\big{(}\big{\{}z\in\mathsf{Z}\colon{\rm d}\mu^{\psi^{-1}}/{\rm d}\upsilon(z)>0,{\rm d}\mu/{\rm d}\upsilon(z)=0\big{\}}\big{)}=0

    in which case for υ\upsilon-almost all z𝖹z\in\mathsf{Z}

    dμψ1dμ(z)={μψμ(z)μ(z)>00otherwise\frac{{\rm d}\mu^{\psi^{-1}}}{{\rm d}\mu}(z)=\begin{cases}\frac{\mu\circ\psi}{\mu}(z)&\mu(z)>0\\ 0&\text{otherwise}\end{cases}

    and therefore

    𝖹fψ(z)μψ1(dz)=𝖹fψ(z)μψ(z)μ(z)dυψdυμ(dz).\int_{\mathsf{Z}^{\prime}}f\circ\psi(z)\mu^{\psi^{-1}}({\rm d}z)=\int_{\mathsf{Z}^{\prime}}f\circ\psi(z)\frac{\mu\circ\psi(z)}{\mu(z)}\frac{{\rm d}\upsilon^{\psi}}{{\rm d}\upsilon}\mu({\rm d}z).
Proof.

For the first part of the first statement, let A𝒵A\in\mathscr{Z} such that υψ1(A)=υ(ψ(A))>0\upsilon^{\psi^{-1}}(A)=\upsilon\big{(}\psi(A)\big{)}>0, then since ψ(A)𝒵\psi(A)\in\mathscr{Z} and υμ\upsilon\gg\mu we deduce μ(ψ(A))=μψ1(A)>0\mu\big{(}\psi(A)\big{)}=\mu^{\psi^{-1}}(A)>0 and we conclude; the second parts follows from υυψ1\upsilon\gg\upsilon^{\psi^{-1}}. For the second statement for f:𝖹f\colon\mathsf{Z}\rightarrow\mathbb{R} bounded and measurable,

f(z)dμψ1dυ(z)υ(dz)\displaystyle\int f(z)\frac{{\rm d}\mu^{\psi^{-1}}}{{\rm d}\upsilon}(z)\upsilon({\rm d}z) =f(z)μψ1(dz)\displaystyle=\int f(z)\mu^{\psi^{-1}}({\rm d}z)
=fψ1(z)μ(dz)\displaystyle=\int f\circ\psi^{-1}(z)\mu({\rm d}z)
=fψ1(z)dμdυ(z)υ(dz)\displaystyle=\int f\circ\psi^{-1}(z)\frac{{\rm d}\mu}{{\rm d}\upsilon}(z)\upsilon({\rm d}z)
=f(z)dμdυψ(z)υψ1(dz)\displaystyle=\int f(z)\frac{{\rm d}\mu}{{\rm d}\upsilon}\circ\psi(z)\upsilon^{\psi^{-1}}({\rm d}z)
=f(z)dμdυψ(z)dυψ1dυ(z)υ(dz)\displaystyle=\int f(z)\frac{{\rm d}\mu}{{\rm d}\upsilon}\circ\psi(z)\frac{{\rm d}\upsilon^{\psi^{-1}}}{{\rm d}\upsilon}(z)\upsilon({\rm d}z)

The third statement is given as [7, Problem 32.6.], which we solve in Lemma 34. ∎

Corollary 33.

In the scenario when ψ:𝖹𝖹\psi\colon\mathsf{Z}\rightarrow\mathsf{Z} and υ\upsilon are such that υψ=υ\upsilon^{\psi}=\upsilon then dυψ/dυ1{\rm d}\upsilon^{\psi}/{\rm d}\upsilon\equiv 1.

Lemma 34 (Billingsley, Problem 32.6.).

Assume μ,ν\mu,\nu and υ\upsilon are σ\sigma- finite and that υν,μ\upsilon\gg\nu,\mu. Then μν\mu\gg\nu if and only if υ({z𝖹:dν/dυ(z)>0,dμ/dυ(z)=0})=0\upsilon\big{(}\big{\{}z\in\mathsf{Z}\colon{\rm d}\nu/{\rm d}\upsilon(z)>0,{\rm d}\mu/{\rm d}\upsilon(z)=0\big{\}}\big{)}=0, in which case

dνdμ(z)=𝟏{z𝖹:dμ/dυ(z)>0}dνdυ/dμdυ(z).\frac{{\rm d}\nu}{{\rm d}\mu}(z)=\mathbf{1}\{z\in\mathsf{Z}:{\rm d}\mu/{\rm d}\upsilon(z)>0\}\frac{{\rm d}\nu}{{\rm d}\upsilon}/\frac{{\rm d}\mu}{{\rm d}\upsilon}(z)\,.
Proof.

Let S:={z𝖹:dν/dυ(z)>0,dμ/dυ(z)=0}S:=\big{\{}z\in\mathsf{Z}\colon{\rm d}\nu/{\rm d}\upsilon(z)>0,{\rm d}\mu/{\rm d}\upsilon(z)=0\big{\}}. For f:𝖹f\colon\mathsf{Z}\rightarrow\mathbb{R} integrable we always have

f(z)ν(dz)\displaystyle\int f(z)\nu({\rm d}z) =𝟏{zS}f(z)dνdυ(z)υ(dz)+𝟏{zS}f(z)dνdυ/dμdυ(z)μ(dz).\displaystyle=\int\mathbf{1}\{z\in S\}f(z)\frac{{\rm d}\nu}{{\rm d}\upsilon}(z)\upsilon({\rm d}z)+\int\mathbf{1}\big{\{}z\in S^{\complement}\big{\}}f(z)\frac{{\rm d}\nu}{{\rm d}\upsilon}/\frac{{\rm d}\mu}{{\rm d}\upsilon}(z)\mu({\rm d}z)\,.

Assume μν\mu\gg\nu then from above for any f:𝖹f\colon\mathsf{Z}\rightarrow\mathbb{R} integrable

𝟏{zS}f(z)dνdμ(z)μ(dz)=\displaystyle\int\mathbf{1}\{z\in S\}f(z)\frac{{\rm d}\nu}{{\rm d}\mu}(z)\mu({\rm d}z)= 𝟏{zS}f(z)dνdμ(z)dμdυ(z)υ(dz)\displaystyle\int\mathbf{1}\{z\in S\}f(z)\frac{{\rm d}\nu}{{\rm d}\mu}(z)\frac{{\rm d}\mu}{{\rm d}\upsilon}(z)\upsilon({\rm d}z)
=\displaystyle= 0,\displaystyle 0\,,

and therefore υ(S)=0\upsilon\big{(}S\big{)}=0 and we conclude from the first identity above. Now assume that υ(S)=0\upsilon\big{(}S\big{)}=0, then

f(z)ν(dz)\displaystyle\int f(z)\nu({\rm d}z) =𝟏{zS}f(z)dνdυ(z)υ(dz)+𝟏{zS}f(z)dνdυ/dμdυ(z)μ(dz)\displaystyle=\int\mathbf{1}\{z\in S\}f(z)\frac{{\rm d}\nu}{{\rm d}\upsilon}(z)\upsilon({\rm d}z)+\int\mathbf{1}\big{\{}z\in S^{\complement}\big{\}}f(z)\frac{{\rm d}\nu}{{\rm d}\upsilon}/\frac{{\rm d}\mu}{{\rm d}\upsilon}(z)\mu({\rm d}z)
=𝟏{zS}f(z)dνdυ/dμdυ(z)μ(dz).\displaystyle=\int\mathbf{1}\big{\{}z\in S^{\complement}\big{\}}f(z)\frac{{\rm d}\nu}{{\rm d}\upsilon}/\frac{{\rm d}\mu}{{\rm d}\upsilon}(z)\mu({\rm d}z)\,.

The equivalence is therefore established and when either conditions is satisfied we have

dνdμ(z)=𝟏{zS}dνdυ/dμdυ(z)\frac{{\rm d}\nu}{{\rm d}\mu}(z)=\mathbf{1}\big{\{}z\in S^{\complement}\big{\}}\frac{{\rm d}\nu}{{\rm d}\upsilon}/\frac{{\rm d}\mu}{{\rm d}\upsilon}(z)

and we conclude. ∎

References

  • [1] Christophe Andrieu, Anthony Lee, and Sam Livingstone. A general perspective on the Metropolis-Hastings kernel. arXiv preprint arXiv:2012.14881, 2020.
  • [2] Christophe Andrieu, James Ridgway, and Nick Whiteley. Sampling normalizing constants in high dimensions using inhomogeneous diffusions, 2018.
  • [3] Mark A Beaumont. Approximate bayesian computation in evolution and ecology. Annual review of ecology, evolution, and systematics, 41:379–406, 2010.
  • [4] Alexandros Beskos, Natesh Pillai, Gareth Roberts, Jesus-Maria Sanz-Serna, and Andrew Stuart. Optimal tuning of the hybrid Monte Carlo algorithm. Bernoulli, 19(5A):1501–1534, 2013.
  • [5] Michael Betancourt. Nested sampling with constrained Hamiltonian Monte Carlo. In AIP Conference Proceedings, volume 1305, pages 165–172. American Institute of Physics, 2011.
  • [6] Joris Bierkens and Gareth Roberts. A piecewise deterministic scaling limit of lifted Metropolis–Hastings in the Curie–Weiss model. The Annals of Applied Probability, 27(2):846–882, 2017.
  • [7] P Billingsley. Probability and measure. 3rd wiley. New York, 1995.
  • [8] Alexandre Bouchard-Côté, Sebastian J Vollmer, and Arnaud Doucet. The bouncy particle sampler: A nonreversible rejection-free Markov chain monte carlo method. Journal of the American Statistical Association, 113(522):855–867, 2018.
  • [9] Dmitri Burago, Yuri Burago, and Sergei Ivanov. A course in metric geometry, volume 33. American Mathematical Society, 2022.
  • [10] Mari Paz Calvo, Daniel Sanz-Alonso, and Jesús María Sanz-Serna. HMC: reducing the number of rejections by not using leapfrog and some results on the acceptance rate. Journal of Computational Physics, 437:110333, 2021.
  • [11] Cédric M Campos and Jesús María Sanz-Serna. Extra chance generalized hybrid Monte Carlo. Journal of Computational Physics, 281:365–374, 2015.
  • [12] Joseph T Chang and David Pollard. Conditioning as disintegration. Statistica Neerlandica, 51(3):287–317, 1997.
  • [13] Nicolas Chopin and James Ridgway. Leave pima indians alone: binary regression as a benchmark for Bayesian computation. Statistical Science, 32(1):64–87, 2017.
  • [14] Hai-Dang Dau and Nicolas Chopin. Waste-free sequential Monte Carlo. arXiv preprint arXiv:2011.02328, 2020.
  • [15] Pierre Del Moral and Pierre Del Moral. Feynman-kac formulae. Springer, 2004.
  • [16] Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential Monte Carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):411–436, 2006.
  • [17] Simon Duane, Anthony D Kennedy, Brian J Pendleton, and Duncan Roweth. Hybrid Monte Carlo. Physics letters B, 195(2):216–222, 1987.
  • [18] Mauro Camara Escudero. Approximate Manifold Sampling: Robust Bayesian Inference for Machine Learning. PhD thesis, School of Mathematics, January 2024.
  • [19] Youhan Fang, J. M. Sanz-Serna, and Robert D. Skeel. Compressible generalized hybrid Monte Carlo. The Journal of Chemical Physics, 140(17):174108, 05 2014.
  • [20] Ernst Hairer, SP Nörsett, and G. Wanner. Solving ordinary differential equations I, Nonstiff problems. Springer-Verlag, 1993.
  • [21] Matthew Hoffman, Alexey Radul, and Pavel Sountsov. An adaptive-MCMC scheme for setting trajectory lengths in Hamiltonian Monte Carlo. In International Conference on Artificial Intelligence and Statistics, pages 3907–3915. PMLR, 2021.
  • [22] Matthew D Hoffman and Pavel Sountsov. Tuning-Free Generalized Hamiltonian Monte Carlo. In International Conference on Artificial Intelligence and Statistics, pages 7799–7813. PMLR, 2022.
  • [23] Paul B Mackenze. An improved hybrid Monte Carlo method. Physics Letters B, 226(3-4):369–371, 1989.
  • [24] Florian Maire and Pierre Vandekerkhove. On markov chain monte carlo for sparse and filamentary distributions. ArXiv e-prints, 2018.
  • [25] Florian Maire and Pierre Vandekerkhove. Markov kernels local aggregation for noise vanishing distribution sampling. SIAM Journal on Mathematics of Data Science, 4(4):1293–1319, 2022.
  • [26] Youssef Marzouk, Tarek Moselhy, Matthew Parno, and Alessio Spantini. Sampling via Measure Transport: An Introduction, pages 1–41. Springer International Publishing, Cham, 2016.
  • [27] Xiao-Li Meng and Stephen Schilling. Warp bridge sampling. Journal of Computational and Graphical Statistics, 11(3):552–586, 2002.
  • [28] Radford M Neal. An improved acceptance procedure for the hybrid Monte Carlo algorithm. Journal of Computational Physics, 111(1):194–203, 1994.
  • [29] Radford M. Neal. MCMC Using Hamiltonian Dynamics, chapter 5. CRC Press, 2011.
  • [30] Ari Pakman and Liam Paninski. Exact Hamiltonian Monte Carlo for truncated multivariate Gaussians. Journal of Computational and Graphical Statistics, 23(2):518–542, 2014.
  • [31] Zhaohui S. Qin and Jun S. Liu. Multipoint Metropolis method with application to Hybrid Monte Carlo. Journal of Computational Physics, 172(2):827–840, 2001.
  • [32] James Ridgway. Computation of gaussian orthant probabilities in high dimension. Statistics and computing, 26(4):899–916, 2016.
  • [33] Grant M Rotskoff and Eric Vanden-Eijnden. Dynamical computation of the density of states and Bayes factors using nonequilibrium importance sampling. Physical review letters, 122(15):150602, 2019.
  • [34] Gabriel Stoltz, Mathias Rousset, et al. Free energy computations: A mathematical perspective. World Scientific, 2010.
  • [35] D Szasz. Boltzmann’s ergodic hypothesis, a conjecture for centuries?, chapter Hard ball systems and the Lorentz gas, pages 421–446. Springer, 2000.
  • [36] Esteban G Tabak and Eric Vanden-Eijnden. Density estimation by dual ascent of the log-likelihood. Communications in Mathematical Sciences, 8(1):217–233, 2010.
  • [37] Erik H Thiede, Brian Van Koten, Jonathan Weare, and Aaron R Dinner. Eigenvector method for umbrella sampling enables error analysis. The Journal of chemical physics, 145(8), 2016.
  • [38] Achille Thin, Yazid Janati El Idrissi, Sylvain Le Corff, Charles Ollion, Eric Moulines, Arnaud Doucet, Alain Durmus, and Christian X Robert. Neo: Non equilibrium sampling on the orbits of a deterministic transform. Advances in Neural Information Processing Systems, 34:17060–17071, 2021.
  • [39] G.M. Torrie and J.P. Valleau. Nonphysical sampling distributions in Monte Carlo free-energy estimation: Umbrella sampling. Journal of Computational Physics, 23(2):187–199, 1977.
  • [40] Paul F Tupper. Ergodicity and the numerical simulation of Hamiltonian systems. SIAM Journal on Applied Dynamical Systems, 4(3):563–587, 2005.
  • [41] J. von Neumann. Proof of the quasi-ergodic hypothesis. Proc. Natl. Acad. Sci. USA, 18:70–82, 1932.
  • [42] Chang Zhang. On the Improvements and Innovations of Monte Carlo Methods. PhD thesis, School of Mathematics, https://research-information.bris.ac.uk/en/studentTheses/on-the-improvements-and-innovations-of-monte-carlo-methods, June 2022.