Scaling of piecewise deterministic Monte Carlo for anisotropic targets

Joris Bierkenslabel=e1][email protected]\orcid0000-0003-0185-5804 [ Kengo Kamatanilabel=e2][email protected]\orcid0000-0002-1537-3446 [ Gareth O. Robertslabel=e3][email protected] [ Delft Institute of Applied Mathematicspresep=, ]e1 Institute of Statistical Mathematicspresep=, ]e2 Department of Statistics, University of Warwickpresep=, ]e3

Abstract

Piecewise deterministic Markov processes (PDMPs) are a type of continuous-time Markov process that combine deterministic flows with jumps. Recently, PDMPs have garnered attention within the Monte Carlo community as a potential alternative to traditional Markov chain Monte Carlo (MCMC) methods. The Zig-Zag sampler and the Bouncy Particle Sampler are commonly used examples of the PDMP methodology which have also yielded impressive theoretical properties, but little is known about their robustness to extreme dependence or anisotropy of the target density. It turns out that PDMPs may suffer from poor mixing due to anisotropy and this paper investigates this effect in detail in the stylised but important Gaussian case. To this end, we employ a multi-scale analysis framework in this paper. Our results show that when the Gaussian target distribution has two scales, of order $1$ and $\epsilon$ , the computational cost of the Bouncy Particle Sampler is of order $\epsilon^{-1}$ , and the computational cost of the Zig-Zag sampler is $\epsilon^{-2}$ . In comparison, the cost of the traditional MCMC methods such as RWM is of order $\epsilon^{-2}$ , at least when the dimensionality of the small component is more than $1$ . Therefore, there is a robustness advantage to using PDMPs in this context.

65C05,

60J25,

62F15,

Markov process,

Monte Carlo methods,

multi-scale analysis,

keywords:

[class=MSC]

keywords:

\startlocaldefs\endlocaldefs

1 Introduction

Piecewise deterministic Markov processes (PDMPs) are continuous-time non-reversible Markov processes driven by deterministic flows and jumps. Recently, PDMPs have attracted the attention of the Monte Carlo community as a possible alternative to traditional Markov chain Monte Carlo methods. The Zig-Zag sampler BierkensFearnheadRoberts2016 and the Bouncy Particle Sampler PhysRevE.85.026703; BouchardCote2017 are commonly used examples of the PDMP methodology to approximate target distributions defined in $\mathbb{R}^{d}$ . Both Markov samplers operate in subsets of $\mathbb{R}^{d}\times\mathbb{R}^{d}$ , with states represented as $(x,v)$ . When these processes achieve ergodicity, the empirical averages of the Markov processes in the $x$ -space converge to the corresponding expectations under the original target distribution, by the law of large numbers.

These samplers have yielded impressive theoretical properties, for example in high-dimensional contexts deligiannidisrandomized; andrieu2021hypocoercivity; bierkens2018high, in terms of asymptotic variance reduction BierkensDuncan2016, and in the context of MCMC with large data sets BierkensFearnheadRoberts2016.

It is well-known that all MCMCs suffer from deterioration of mixing properties in the context of anisotropy of the target distribution. For example the mixing of random walk Metropolis, Metropolis-Adjusted Langevin and Hamiltonian schemes can all become arbitrarily bad in any given problem for increasingly inappropriately scaled proposal distributions (see for example the studies in RGG; roberts1998optimal; roberts2001optimal; beskos2013optimal; beskos2018; Graham2022). Meanwhile the Gibbs sampler is known to perform poorly in the context of highly correlated target distributions, see for example robsah97; zanella2021multilevel. In principle these problems can be overcome using preconditioning (a priori reparameterisations of the state space), often carried out using adaptive MCMC (andrieu2008tutorial; roberts2009examples). However, effective preconditioning can be very difficult in higher-dimensional problems. Therefore, it is important to understand the extent of this deterioration of mixing in the face of anisotropy or dependence.

PDMPs can also suffer from poor mixing due to anisotropy and this paper will study this effect in detail in the stylised but important Gaussian case. In principle, continuous time processes are well-suited to this problem because they do not require a rejection scheme, unlike traditional MCMC methods which suffer from a scaling issue. However, this does not guarantee that the processes will be effective for this problem. Even if the process mixes well, it may require a large number of jumps, resulting in high computational cost. To understand this quantitatively, it is important to determine the influence that anisotropies in the target distribution have on the speed convergence to equilibrium.

However, for continuous-time processes a further consideration relevant to the computational efficiency of an algorithm involves some measure of the computational effort expended per unit stochastic process time. It is difficult to be precise about how to measure this in general. Here we will use a commonly adopted surrogate: the number of jumps per unit time. Effectively this assumes that we have access to an efficient implementation scheme for our PDMP via an appropriate Poisson thinning scheme, which is not always available in practice (see Section LABEL:sec:2dim-discussion for the detail).

This quantification of the computational cost can be achieved by the scaling analysis which first appeared in RGG in the literature under mathematically formal argument. In this paper, we follow beskos2018 that studied the random walk Metropolis algorithm for anisotropic target distributions. To avoid technical difficulties, they considered a simple target distribution with both small and large-scale components, where these two components are conditionally independent. The scale of the noise of the algorithm must be appropriately chosen in advance, otherwise, the algorithm will be frozen. On the other hand, the piecewise deterministic Markov processes naturally fit the scale without artificial parameter tuning. However, from a theoretical point of view, this makes the analysis complicated since the process changes its dynamics depending on the target distribution. Therefore, we study a simple Gaussian example and identify the factors that change the dynamics for a better understanding of the processes for anisotropic target distributions.

The main contributions of this paper are thus the following:

1.

We give a theoretical and rigorous study of the robustness of Zig-Zag and BPS algorithms in the case of increasingly extreme anisotropy (where one component is scaled by a factor $\epsilon$ which converges to $0$ in the limit). Specifically, we derive weak convergence results that characterise the limiting dynamics of the slow component in the extreme anisotropy limit.
2.

For the Zig-Zag, we see that, unless the slow component is either (a) perfectly aligned with one of the possible velocity directions, or (b) independent of the fast component, the limiting behaviour is a reversible Ornstein–Uhlenbeck process that mixes in time $O(\epsilon^{-1})$ . Considering that the number of switches per unit time scales as $O(\epsilon^{-1})$ (by Theorem LABEL:theo:zz-costs), the total complexity grows as $O(\epsilon^{-2})$ .
3.

For the BPS in Theorem LABEL:theo:bps-main we show that its limiting behaviour is a deterministic dynamical system for which we can explicitly write down its generating ODE. Mixing occurs in $O(1)$ time, which when combined with a cost of $O(\epsilon^{-1})$ switches per unit time, yields an overall complexity of $O(\epsilon^{-1})$ .

The paper is organised as follows. In Section 2 we explore the problem in a 2-dimensional setting where we can easily illustrate and motivate our results. Here we describe the limiting processes using the homogenisation and averaging techniques of Pavliotis2008. Section LABEL:sect:main then describes our main results in detail and signposts the proofs. Section LABEL:sec:leading then gives the detailed proofs for the Zig-Zag and BPS respectively, supported by technical material from the supplementary material (BierkensKamatanRoberts2024). More specifically, in Section LABEL:sec:leading, we illustrate the properties of the leading-order generator. Additionally, in the supplementary material, we demonstrate weak convergence for the Zig-Zag sampler through homogenisation, and for the Bouncy Particle Sampler through averaging methods. A final discussion is given in Section LABEL:sect:disc.

2 Exhibit of main results in the two-dimensional case

To better understand our result, we first consider the two-dimensional case, which is easier to interpret than the general result. Consider the situation of a centered two-dimensional Gaussian target distribution with one large and one small component. Indeed, let $\Sigma^{\epsilon}=U^{\top}(\Lambda^{\epsilon})^{2}U$ , where

U=\pmatrix{\cos}\theta&-\sin\theta\\ \sin\theta\cos\theta\quad\text{and}\quad\Lambda^{\epsilon}=\pmatrix{1}&0\\ 0\epsilon.

(1)

We will show that the Markov processes of the Zig-Zag sampler and the Bouncy Particle Sampler converges to limit processes as $\epsilon\rightarrow 0$ if we choose a correct scaling in space and time.

To illustrate the impact of $\epsilon$ , we will use a reparametrised variable for our scaling analysis:

y=(\Lambda^{\epsilon})^{-1}Ux.

(2)

By this variable scaling, the invariant distribution of the Markov semigroup corresponding to $\xi=(y,v)$ is $\mu=\mathcal{N}_{d}(0,I_{d})\otimes\mathcal{U}(\{-1,+1\}^{d})$ for the Zig-Zag sampler, and $\mu=\mathcal{N}_{d}(0,I_{d})\otimes\mathcal{N}_{d}(0,I_{d})$ for the Bouncy Particle Sampler.

2.1 The Zig-Zag sampler in the two dimensional case

The Markov generator of the two-dimensional Zig-Zag sampler with reparametrisation is

\begin{split}\mathcal{L}^{\epsilon}f(y,v)&=(v_{1}\cos\theta-v_{2}\sin\theta)\partial_{y_{1}}f(y,v)+\epsilon^{-1}(v_{1}\sin\theta+v_{2}\cos\theta)\partial_{y_{2}}f(y,v)\\ &\quad+\left(v_{1}(y_{1}\cos\theta+\epsilon^{-1}y_{2}\sin\theta)\right)_{+}~{}(\mathcal{F}_{1}-\operatorname{id})f(y,v)\\ &\quad+\left(v_{2}(-y_{1}\sin\theta+\epsilon^{-1}y_{2}\cos\theta)\right)_{+}~{}(\mathcal{F}_{2}-\operatorname{id})f(y,v),\end{split}

(3)

where $\mathcal{F}_{i}f(y,v)=f(y,F_{i}(v))$ and $F_{i}$ represents the operation that flips the sign of the $i$ -th component. Figure 1 displays the trajectories of the Zig-Zag sampler in a two-dimensional scenario, with $\theta=0,\pi/6$ and $\pi/4$ and $\epsilon=0.01$ .

Refer to caption — Figure 1: Paths of $y_{1}$ for $\theta=0$ (left), $\theta=\pi/6$ (centre), and $\theta=\pi/4$ (right), and $y_{1}$ ( $1$ st row), $y_{2}$ ( $2$ nd row) and $v_{1},v_{2}$ ( $3$ rd and $4$ th rows) of the Zig-Zag sampler where $\epsilon=0.01$ .

Our first observation from Figure 1 is that the process of $(y_{2},v_{1},v_{2})$ exhibits good mixing, whereas the process of $y_{1}$ does not. This result is expected, as the terms in the expression for $\mathcal{L}^{\epsilon}$ that correspond to the dynamics of $(y_{2},v_{1},v_{2})$ are proportional to $\epsilon^{-1}$ , whereas the dynamics of $y_{1}$ are of order $O(1)$ . When $\epsilon$ is sufficiently small, the dynamics of $(y_{2},v_{1},v_{2})$ are described by the operator $\mathcal{L}_{0}$ , which is derived from $\mathcal{L}^{\epsilon}$ by retaining only terms proportional to $\epsilon^{-1}$ :