This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Zig-zag sampling for discrete structures and non-reversible phylogenetic MCMC

Jere Koskela
[email protected]
Department of Statistics
University of Warwick
Coventry CV4 7AL, UK
Abstract

We construct a zig-zag process targeting a posterior distribution defined on a hybrid state space consisting of both discrete and continuous variables. The construction does not require any assumptions on the structure among discrete variables. We demonstrate our method on two examples in genetics based on the Kingman coalescent, showing that the zig-zag process can lead to efficiency gains of up to several orders of magnitude over classical Metropolis–Hastings algorithms, and that it is well suited to parallel computation. Our construction resembles existing techniques for Hamiltonian Monte Carlo on a hybrid state space, which suffers from implementationally and analytically complex boundary crossings when applied to the coalescent. We demonstrate that the continuous-time zig-zag process avoids these complications.

Keywords: Bayesian inference; coalescent; hybrid state space; piecewise deterministic Markov process; zig-zag process

1 Introduction

The zig-zag process is a non-reversible, piecewise deterministic Markov process introduced by [BR17, BFR19b] for continuous-time MCMC. It has several advantages over reversible methods such as Metropolis–Hastings [Has70] and Gibbs sampling [GS90]: it avoids diffusive backtracking which slows their mixing, and is rejection-free so that no computation is wasted on rejected moves.

In brief, the generator of the zig-zag process (𝐱t,𝐯t)t0(\mathbf{x}_{t},\mathbf{v}_{t})_{t\geq 0} targeting the probability density (with respect to the product of the Lebesgue and counting measures) π~(𝐱,𝐯):=π(𝐱)/2d\tilde{\pi}(\mathbf{x},\mathbf{v}):=\pi(\mathbf{x})/2^{d} on a state space X×{1,1}dd×{1,1}dX\times\{-1,1\}^{d}\subseteq\mathbb{R}^{d}\times\{-1,1\}^{d} is

Lf(𝐱,𝐯)=i=1dviif(𝐱,𝐯)+i=1dλi(𝐱,𝐯)(f(𝐱,Fi𝐯)f(𝐱,𝐯)),Lf(\mathbf{x},\mathbf{v})=\sum_{i=1}^{d}v_{i}\partial_{i}f(\mathbf{x},\mathbf{v})+\sum_{i=1}^{d}\lambda_{i}(\mathbf{x},\mathbf{v})(f(\mathbf{x},F_{i}\mathbf{v})-f(\mathbf{x},\mathbf{v})),

where i\partial_{i} is the derivative with respect to xix_{i}, and FiF_{i} flips the sign of viv_{i}. The flip rates

λi(𝐱,𝐯):=(viilog(π~(𝐱,𝐯)))+\lambda_{i}(\mathbf{x},\mathbf{v}):=(-v_{i}\partial_{i}\log(\tilde{\pi}(\mathbf{x},\mathbf{v})))^{+} (1)

with (x)+:=max{x,0}(x)^{+}:=\max\{x,0\}, ensure that (𝐱t,𝐯t)t0(\mathbf{x}_{t},\mathbf{v}_{t})_{t\geq 0} leaves π~(𝐱,𝐯)\tilde{\pi}(\mathbf{x},\mathbf{v}) invariant [BFR19b, Theorem 2.2]. In words, the coordinates of 𝐱t\mathbf{x}_{t} move at constant velocities 𝐯t\mathbf{v}_{t} until a flip at coordinate ii, when the corresponding velocity changes sign.

To date, the zig-zag processes have been applied to targets such as the Curie–Weiss model [BR17] and logistic regression [BFR19b], whose state spaces have simple geometric structures with natural notions of direction and velocity. Discrete variables (other than the velocities) have been restricted to cases with simple partial orders, such as model selection [CFS20, GD21]. We construct a zig-zag process on a general hybrid state space with both continuous and discrete coordinates, without imposing any structure on discrete coordinates. This is done by introducing a separate space of continuous variables for each value of the discrete variable, introducing boundaries into the continuous spaces, and updating the discrete variable when boundaries are hit. The strategy takes advantage of the full generality of piecewise-deterministic Markov processes [Dav93, Section 24], and is resembles similar work for Hamiltonian Monte Carlo (HMC) [DBZM17, NDL20]. Our method can also been seen of as a generalization of the zig-zag process on a restricted domain [BBCD+18] to a union of many restricted sub-domains, with jumps between sub-domains at boundary hitting events. We illustrate our method with an application to the coalescent [Kin82]: a tree-valued target with continuous branch lengths, discrete tree topologies with no natural partial order, and no canonical geometric structure.

The coalescent examples illustrate the need for methods which are implementable on complex state spaces. They are also of interest because existing MCMC algorithms for coalescents tend to mix slowly. The key difficulty lies in designing Metropolis–Hastings proposal distributions which combine efficient exploration with a high acceptance rate [MV05, HDPD08, LVH+08]. Workarounds consist of empirical searches for efficient proposals [HD12, ASR16] or Metropolis-coupled MCMC [Gey92]. The former does not scale to problems for which empirical optimization is infeasible. The latter helps mixing between modes, but does not overcome low acceptance rates or the backtracking behavior of reversible MCMC.

The zig-zag process has some similarities with HMC [Nea10], which augments the state space with momentum and uses Hamiltonian dynamics to propose large steps which are accepted with high probability, though they are not rejection-free. Like the zig-zag process, HMC requires gradient information and a suitable geometric embedding of the target. [DBZM17] provided those for the coalescent using an orthant complex construction of phylogenetic tree space [BHV01]. Our examples differ from the method of [DBZM17] in three ways. Firstly, we replace the embedding of [BHV01] with τ\tau-space [GD16], which is better suited to ultrametric trees. Secondly, the zig-zag process is readily implementable on τ\tau-space via Poisson thinning without a numerical integrator such as leap-prog [DBZM17, Algorithm 1]. Finally, the zig-zag process has simple boundary behavior between orthants and does not require boundary smoothing [DBZM17, Section 3.3], chiefly because discontinuous gradients are easier to handle on continuous rather than discretized paths.

The rest of the paper is structured as follows. Section 2 defines the zig-zag algorithm with discrete and continuous variables, and proves that it has the desired invariant distribution. Section 3 recalls the coalescent and the τ\tau-space embedding. In Sections 4 and 5 we recall the popular infinite and finite sites models of mutation, derive zig-zag processes for each, and demonstrate their performance via simulation studies. Section 6 concludes with a discussion. The algorithms and data sets used in the simulation studies are available at https://github.com/JereKoskela/tree-zig-zag.

2 Zig-zag on hybrid spaces

The definition of our zig-zag process follows [Dav93, Section 24]. Let 𝔽\mathbb{F} be a countable set. For each m𝔽m\in\mathbb{F}, let Ωmo\Omega_{m}^{o} be an open subset of d\mathbb{R}^{d}, Ωmo¯\overline{\Omega_{m}^{o}} be its closure, and Ωm:=Ωmo¯Ωmo\partial\Omega_{m}^{*}:=\overline{\Omega_{m}^{o}}\setminus\Omega_{m}^{o} be its boundary. We assume that {Ωm}m𝔽\{\partial\Omega_{m}^{*}\}_{m\in\mathbb{F}} are piecewise Lipschitz and denote by Ωm\partial\Omega_{m} the restriction of Ωm\partial\Omega_{m}^{*} to non-corner points. Let Ωo:=m𝔽Ωmo={(m,𝐱):m𝔽,𝐱Ωmo}\Omega^{o}:=\cup_{m\in\mathbb{F}}\Omega_{m}^{o}=\{(m,\mathbf{x}):m\in\mathbb{F},\mathbf{x}\in\Omega_{m}^{o}\} and Ω:=m𝔽Ωm\partial\Omega:=\cup_{m\in\mathbb{F}}\partial\Omega_{m}. For a point (m,𝐱)Ωm(m,\mathbf{x})\in\partial\Omega_{m}, let n(m,𝐱)n(m,\mathbf{x}) be the outward unit normal, and let Γ±(m,𝐱):={𝐯{1,1}d:±(𝐯n(m,𝐱))>0}\Gamma^{\pm}(m,\mathbf{x}):=\{\mathbf{v}\in\{-1,1\}^{d}:\pm(\mathbf{v}\cdot n(m,\mathbf{x}))>0\} be the sets of velocities with which a zig-zag process can exit (+) or enter (-) Ωmo\Omega_{m}^{o} at 𝐱\mathbf{x}. Zig-zag dynamics imply 𝐯Γ+(m,𝐱)𝐯Γ(m,𝐱)\mathbf{v}\in\Gamma^{+}(m,\mathbf{x})\Leftrightarrow-\mathbf{v}\in\Gamma^{-}(m,\mathbf{x}). We also define Γ±(Ω):=(m,𝐱)Ω({(m,𝐱)}×Γ±(m,𝐱))\Gamma^{\pm}(\partial\Omega):=\cup_{(m,\mathbf{x})\in\partial\Omega}(\{(m,\mathbf{x})\}\times\Gamma^{\pm}(m,\mathbf{x})) and Ω:=(Ωo×{1,1}d)Γ(Ω)\Omega^{*}:=(\Omega^{o}\times\{-1,1\}^{d})\cup\Gamma^{-}(\partial\Omega). Integrals over Ωo\Omega^{o} and Ω\partial\Omega, or subsets thereof, are taken to incorporate discrete sums in the m𝔽m\in\mathbb{F} coordinate.

The zig-zag process (mt,𝐱t,𝐯t)t0(m_{t},\mathbf{x}_{t},\mathbf{v}_{t})_{t\geq 0} is defined on Ω\Omega^{*}, with target π~(m,𝐱,𝐯):=π(m,𝐱)/2d\tilde{\pi}(m,\mathbf{x},\mathbf{v}):=\pi(m,\mathbf{x})/2^{d} on ΩΓ+(Ω)\Omega^{*}\cup\Gamma^{+}(\partial\Omega) for a given density π(m,𝐱)\pi(m,\mathbf{x}). At (m,𝐱,𝐯)Ω(m,\mathbf{x},\mathbf{v})\in\Omega^{*}, the process moves with velocity 𝐯\mathbf{v} and each coordinate viv_{i} flips at rate λ(m,𝐱,𝐯)\lambda(m,\mathbf{x},\mathbf{v}), defined as in (1) since mm is fixed between boundary hitting events. When (m,𝐱,𝐯)Γ+(Ω)(m,\mathbf{x},\mathbf{v})\in\Gamma^{+}(\partial\Omega), the process jumps according to a Markov kernel Q:Γ+(Ω)𝒫(Γ(Ω))Q:\Gamma^{+}(\partial\Omega)\mapsto\mathcal{P}(\Gamma^{-}(\partial\Omega)), where 𝒫(A)\mathcal{P}(A) denotes the set of probability measures on (A,(A))(A,\mathcal{B}(A)). We assume that QQ and π~\tilde{\pi} satisfy the skew-detailed balance condition

π~(m,𝐱,𝐯)Q(m,𝐱,𝐯;j,d𝐲,𝐰)d𝐱=π~(j,𝐲,𝐰)Q(j,𝐲,𝐰;m,d𝐱,𝐯)d𝐲\tilde{\pi}(m,\mathbf{x},\mathbf{v})Q(m,\mathbf{x},\mathbf{v};j,\mathrm{d}\mathbf{y},\mathbf{w})\mathrm{d}\mathbf{x}=\tilde{\pi}(j,\mathbf{y},-\mathbf{w})Q(j,\mathbf{y},-\mathbf{w};m,\mathrm{d}\mathbf{x},-\mathbf{v})\mathrm{d}\mathbf{y} (2)

for any (m,𝐱,𝐯)Γ+(Ω)(m,\mathbf{x},\mathbf{v})\in\Gamma^{+}(\partial\Omega) and (j,𝐲,𝐰)Γ(Ω)(j,\mathbf{y},\mathbf{w})\in\Gamma^{-}(\partial\Omega), as well as

(j,𝐲)Ω𝐰Γ(j,𝐲)(𝐰n(j,𝐲))Q(m,𝐱,𝐯;j,d𝐲,𝐰)=𝐯n(m,𝐱),\int_{(j,\mathbf{y})\in\partial\Omega}\sum_{\mathbf{w}\in\Gamma^{-}(j,\mathbf{y})}(\mathbf{w}\cdot n(j,\mathbf{y}))Q(m,\mathbf{x},\mathbf{v};j,\mathrm{d}\mathbf{y},\mathbf{w})=-\mathbf{v}\cdot n(m,\mathbf{x}), (3)

for any (𝐱,𝐯)Γ+(Ω)(\mathbf{x},\mathbf{v})\in\Gamma^{+}(\partial\Omega), and exclude jumps to paths pointing into corners by assuming

(m,𝐱)Ω𝐯Γ+(m,𝐱)(j,𝐲)Ω𝐰Γ(j,𝐲)𝟙(m𝔽Ω)Ω(j,𝐲+TΓ+(Ω)(j,𝐲,𝐰)𝐰)\displaystyle\int_{(m,\mathbf{x})\in\partial\Omega}\sum_{\mathbf{v}\in\Gamma^{+}(m,\mathbf{x})}\int_{(j,\mathbf{y})\in\partial\Omega}\sum_{\mathbf{w}\in\Gamma^{-}(j,\mathbf{y})}\mathds{1}_{(\cup_{m\in\mathbb{F}}\partial\Omega^{*})\setminus\partial\Omega}(j,\mathbf{y}+T_{\Gamma^{+}(\partial\Omega)}(j,\mathbf{y},\mathbf{w})\mathbf{w})
×Q(m,𝐱,𝐯;j,d𝐲,𝐰)d𝐱\displaystyle\times Q(m,\mathbf{x},\mathbf{v};j,\mathrm{d}\mathbf{y},\mathbf{w})\mathrm{d}\mathbf{x} =0,\displaystyle=0, (4)

where TΓ+(Ω)(j,𝐲,𝐰)T_{\Gamma^{+}(\partial\Omega)}(j,\mathbf{y},\mathbf{w}) is the time the line (j,𝐲+t𝐰)t0(j,\mathbf{y}+t\mathbf{w})_{t\geq 0} hits Γ+(Ω)\Gamma^{+}(\partial\Omega). We will abuse notation and use π~(m,𝐱,𝐯)d𝐱\tilde{\pi}(m,\mathbf{x},\mathbf{v})\mathrm{d}\mathbf{x} as the target density and Lebesgue measure on Ωo\Omega^{o}, and as their restrictions to the surface Ω\partial\Omega, on which π~\tilde{\pi} is not a probability density.

By [Dav93, Theorem 26.14], the zig-zag process with dynamics defined above is a piecewise-deterministic Markov process with extended generator

Lf(m,𝐱,𝐯)=i=1dviif(m,𝐱,𝐯)+i=1dλi(m,𝐱,𝐯)(f(m,𝐱,Fi𝐯)f(m,𝐱,𝐯)),Lf(m,\mathbf{x},\mathbf{v})=\sum_{i=1}^{d}v_{i}\partial_{i}f(m,\mathbf{x},\mathbf{v})+\sum_{i=1}^{d}\lambda_{i}(m,\mathbf{x},\mathbf{v})(f(m,\mathbf{x},F_{i}\mathbf{v})-f(m,\mathbf{x},\mathbf{v})), (5)

whose domain D(L)D(L) consists of measurable functions f(m,𝐱,𝐯)f(m,\mathbf{x},\mathbf{v}) on Ω\Omega^{*} satisfying:

  1. 1.

    For each (m,𝐱,𝐯)Γ+(Ω)(m,\mathbf{x},\mathbf{v})\in\Gamma^{+}(\partial\Omega), the limit limt0f(m,𝐱t𝐯,𝐯)=:f(m,𝐱,𝐯)\lim_{t\to 0}f(m,\mathbf{x}-t\mathbf{v},\mathbf{v})=:f(m,\mathbf{x},\mathbf{v}) exists.

  2. 2.

    For each (m,𝐱,𝐯)Ω(m,\mathbf{x},\mathbf{v})\in\Omega^{*}, the function tf(m,𝐱+t𝐯,𝐯)t\mapsto f(m,\mathbf{x}+t\mathbf{v},\mathbf{v}) is absolutely continuous on t[0,TΓ+(Ω)(m,𝐱,𝐯))t\in[0,T_{\Gamma^{+}(\partial\Omega)}(m,\mathbf{x},\mathbf{v})).

  3. 3.

    For (m,𝐱,𝐯)Γ+(Ω)(m,\mathbf{x},\mathbf{v})\in\Gamma^{+}(\partial\Omega),

    f(m,𝐱,𝐯)=(j,𝐲)Ω𝐰Γ(𝐲)f(j,𝐲,𝐰)Q(m,𝐱,𝐯;j,d𝐲,𝐰).f(m,\mathbf{x},\mathbf{v})=\int_{(j,\mathbf{y})\in\partial\Omega}\sum_{\mathbf{w}\in\Gamma^{-}(\mathbf{y})}f(j,\mathbf{y},\mathbf{w})Q(m,\mathbf{x},\mathbf{v};j,\mathrm{d}\mathbf{y},\mathbf{w}). (6)
  4. 4.

    The random variable k=1nf(mTk,𝐱Tk,𝐯Tk)f(mTk,𝐱Tk,𝐯Tk)\sum_{k=1}^{n}f(m_{T_{k}},\mathbf{x}_{T_{k}},\mathbf{v}_{T_{k}})-f(m_{T_{k}},\mathbf{x}_{T_{k}-},\mathbf{v}_{T_{k}-}) is integrable for each nn\in\mathbb{N}, where {Tk}k1\{T_{k}\}_{k\geq 1} are the successive jump times (both velocity flips and jumps due to hitting a boundary) of (mt,𝐱t,𝐯t)t0(m_{t},\mathbf{x}_{t},\mathbf{v}_{t})_{t\geq 0}.

For a set AA, let B(A)B(A) and C1(A)C^{1}(A) be the respective spaces of bounded and continuously differentiable functions on AA. Let Cb1(A):=B(A)C1(A)C_{b}^{1}(A):=B(A)\cap C^{1}(A). For t>0t>0, we define

Nt:=#{velocity flips and boundary jumps in (ms,𝐱s,𝐯s)s=0t}.N_{t}:=\#\{\text{velocity flips and boundary jumps in }(m_{s},\mathbf{x}_{s},\mathbf{v}_{s})_{s=0}^{t}\}.
Theorem 1.

Suppose 𝔽\mathbb{F} is finite, π~(m,,𝐯)C1(Ωmo)\tilde{\pi}(m,\cdot,\mathbf{v})\in C^{1}(\Omega_{m}^{o}) for each 𝐯{1,1}d\mathbf{v}\in\{-1,1\}^{d} and m𝔽m\in\mathbb{F}, π~>0\tilde{\pi}>0 on Ωo×{1,1}d\Omega^{o}\times\{-1,1\}^{d}, that Q(m,𝐱,𝐯;)Q(m,\mathbf{x},\mathbf{v};\cdot) has compact support for each (m,𝐱,𝐯)Γ+(Ω)(m,\mathbf{x},\mathbf{v})\in\Gamma^{+}(\partial\Omega), and for each t>0t>0 and (m,𝐱,𝐯)Ω(m,\mathbf{x},\mathbf{v})\in\Omega^{*},

𝔼[Nt|(m0,𝐱0,𝐯0)=(m,𝐱,𝐯)]<.\mathbb{E}[N_{t}|(m_{0},\mathbf{x}_{0},\mathbf{v}_{0})=(m,\mathbf{x},\mathbf{v})]<\infty. (7)

Suppose the initial distribution of (m,𝐱)(m,\mathbf{x}) has a density on Ω\Omega^{*} and that (2), (3), and (4) hold. Then the zig-zag process with generator (5) and domain D(L)D(L) as described above has stationary distribution π~\tilde{\pi}.

Proof.

Provided in the appendix. ∎

Remark 1.

In addition to having the right invariant distribution, a practical algorithm needs to be ergodic. To discuss ergodicity of the zig-zag process (mt,𝐱t,𝐯t)t0(m_{t},\mathbf{x}_{t},\mathbf{v}_{t})_{t\geq 0} from Theorem 1, let {(mtj,𝐱tj,𝐯tj)t0}j𝔽\{(m_{t}^{j},\mathbf{x}_{t}^{j},\mathbf{v}_{t}^{j})_{t\geq 0}\}_{j\in\mathbb{F}} be zig-zag processes restricted to respective spaces Ωjo\Omega_{j}^{o} by boundary jump kernels Qj:𝐱Ωj[(j,𝐱)×Γ+(j,𝐱)]𝒫(𝐱Ωj[(j,𝐱)×Γ(j,𝐱)])Q^{j}:\cup_{\mathbf{x}\in\partial\Omega_{j}}[(j,\mathbf{x})\times\Gamma^{+}(j,\mathbf{x})]\mapsto\mathcal{P}(\cup_{\mathbf{x}\in\partial\Omega_{j}}[(j,\mathbf{x})\times\Gamma^{-}(j,\mathbf{x})]), each with target proportional to π~(j,,)\tilde{\pi}(j,\cdot,\cdot). When 𝔽\mathbb{F} is finite, a sufficient condition for ergodicity of the global process (mt,𝐱t,𝐯t)t0(m_{t},\mathbf{x}_{t},\mathbf{v}_{t})_{t\geq 0} is that {(mtj,𝐱tj,𝐯tj)t0}j𝔽\{(m_{t}^{j},\mathbf{x}_{t}^{j},\mathbf{v}_{t}^{j})_{t\geq 0}\}_{j\in\mathbb{F}} are all ergodic, and that

𝐱Ωm𝐯Γ+(m,𝐱)𝐲Ωj𝐰Γ(j,𝐲)Q(m,𝐱,𝐯;j,d𝐲,𝐰)π~(m,𝐱,𝐯)d𝐱>0,\int_{\mathbf{x}\in\partial\Omega_{m}}\sum_{\mathbf{v}\in\Gamma^{+}(m,\mathbf{x})}\int_{\mathbf{y}\in\partial\Omega_{j}}\sum_{\mathbf{w}\in\Gamma^{-}(j,\mathbf{y})}Q(m,\mathbf{x},\mathbf{v};j,\mathrm{d}\mathbf{y},\mathbf{w})\tilde{\pi}(m,\mathbf{x},\mathbf{v})\mathrm{d}\mathbf{x}>0, (8)

for ordered pairs (m,j)𝔽2(m,j)\in\mathbb{F}^{2} which form a cycle spanning the support of π~\tilde{\pi}. [BRZ19] provide conditions for ergodicity of single-domain zig-zag processes.

We conclude this section with a pseudocode specification of our zig-zag algorithm.

Algorithm 1 Simulation the zig-zag process targeting π~\tilde{\pi}
1:Initial condition (m,𝐱0,𝐯0)(m,\mathbf{x}_{0},\mathbf{v}_{0}), target π~\tilde{\pi}, jump kernel QQ, terminal time tendt_{\text{end}}
2:Set t0,m0m,𝐱𝐱0,𝐯𝐯0t\leftarrow 0,m_{0}\leftarrow m,\mathbf{x}\leftarrow\mathbf{x}_{0},\mathbf{v}\leftarrow\mathbf{v}_{0}
3:while t<tendt<t_{\text{end}} do
4:     Set τTΓ+(Ω)(m,𝐱,𝐯)\tau\leftarrow T_{\Gamma^{+}(\partial\Omega)}(m,\mathbf{x},\mathbf{v}) and I0I\leftarrow 0. \triangleright I=0I=0\Rightarrow boundary hit.
5:     for i{1,2,,d1,d}i\in\{1,2,\ldots,d-1,d\} do
6:         Sample UExp(1)U\sim\text{Exp}(1).
7:         Set ρ\rho to be the solution of 0ρλi(m,𝐱+s𝐯,𝐯)ds=U\int_{0}^{\rho}\lambda_{i}(m,\mathbf{x}+s\mathbf{v},\mathbf{v})\mathrm{d}s=U.
8:         if ρ<τ\rho<\tau then
9:              Set τρ\tau\leftarrow\rho and IiI\leftarrow i               
10:     Set tt+τ,𝐱𝐱+τ𝐯t\leftarrow t+\tau,\mathbf{x}\leftarrow\mathbf{x}+\tau\mathbf{v}
11:     if I>0I>0 then
12:         Set vIvIv_{I}\leftarrow-v_{I}
13:     else
14:         Set (m,𝐱,𝐯)(j,𝐲,𝐰)Q(m,𝐱,𝐯;,,)(m,\mathbf{x},\mathbf{v})\leftarrow(j,\mathbf{y},\mathbf{w})\sim Q(m,\mathbf{x},\mathbf{v};\cdot,\cdot,\cdot)      
15:end while

3 The coalescent and a geometric embedding

An ultrametric binary tree with nn labeled leaves is a rooted binary tree in which each leaf is an equal graph distance away from the root. We follow [GD16] and encode such a tree with leaf labels {1,,n}\{1,\ldots,n\} as the pair (En,𝐭n)(E_{n},\mathbf{t}_{n}), where EnE_{n} is a ranked topology and 𝐭n(0,)n1\mathbf{t}_{n}\in(0,\infty)^{n-1}. The continuous variables 𝐭n\mathbf{t}_{n} encode times between mergers. The time from the leaves to the first merger is t1t_{1}, and subsequent tit_{i} variables are times between successive mergers. The ranked topology EnE_{n} is an (n1)(n-1)-tuple of pairs of labels, where the iith pair specifies the two child nodes of the iith merger. Non-leaf nodes are labeled by the leaves they subtend. For example, the ranked topology

E4=(E4,1,E4,2,E4,3):=({1,2},{{1,2},3},{{1,2,3},4})E_{4}=(E_{4,1},E_{4,2},E_{4,3}):=(\{1,2\},\{\{1,2\},3\},\{\{1,2,3\},4\})

encodes the four leaf caterpillar tree depicted in Figures 3 and 7 with nodes labeled left to right. We order the two entries of En,iE_{n,i} by their least element for definiteness.

The coalescent [Kin82] is a seminal model for the genetic ancestry of samples from large populations. Under the coalescent, a tree (En,𝐭n)(E_{n},\mathbf{t}_{n}) has probability density

π(En,𝐭n)d𝐭n:=exp(i=1n1(n+1i2)ti)d𝐭n,\pi(E_{n},\mathbf{t}_{n}){\rm d}\mathbf{t}_{n}:=\exp\Bigg{(}-\sum_{i=1}^{n-1}\binom{n+1-i}{2}t_{i}\Bigg{)}{\rm d}\mathbf{t}_{n}, (9)

which arises as the law of a tree constructed by starting a lineage from each leaf, and merging each pair of lineages at rate 1 until the most recent common ancestor (MRCA) is reached. The success of the coalescent is due to robustness: distributions of ancestries of a large class of individual-based models converge to the coalescent in the infinite population limit under suitable rescalings of time. For details, see e.g. [Wak09].

We define swap operators sis_{i} for i{2,,n2}:En,i1En,ii\in\{2,\ldots,n-2\}:E_{n,i-1}\notin E_{n,i} via

si(En):=(En,1,,En,n1) where En,k:={En,k for k<i1 or k>i,En,i for k=i1,En,i1 for k=i.s_{i}(E_{n}):=(E_{n,1}^{\prime},\ldots,E_{n,n-1}^{\prime})\text{ where }E_{n,k}^{\prime}:=\begin{cases}E_{n,k}&\text{ for }k<i-1\text{ or }k>i,\\ E_{n,i}&\text{ for }k=i-1,\\ E_{n,i-1}&\text{ for }k=i.\\ \end{cases}

In words, sis_{i} swaps the order of the (i1)(i-1)th and iith mergers. We also define pivot operators pip_{i}^{\downarrow} and pip_{i}^{\uparrow} for i{2,,n1}:En,i1En,ii\in\{2,\ldots,n-1\}:E_{n,i-1}\in E_{n,i} as

pi(En):=(En,1,,En,n1) with En,k:={En,k for k{i1,i},{En,i1,En,is} for k=i1,{En,i1,En,i1} for k=i,p_{i}^{\downarrow}(E_{n}):=(E_{n,1}^{\prime},\ldots,E_{n,n-1}^{\prime})\text{ with }E_{n,k}^{\prime}:=\begin{cases}E_{n,k}&\text{ for }k\notin\{i-1,i\},\\ \{E_{n,i-1}^{\downarrow},E_{n,i}^{s}\}&\text{ for }k=i-1,\\ \{E_{n,i-1}^{\prime},E_{n,i-1}^{\uparrow}\}&\text{ for }k=i,\end{cases} (10)

where En,iE_{n,i}^{\uparrow} (resp. En,iE_{n,i}^{\downarrow}) is the entry of En,iE_{n,i} with the higher (resp. lower) least element, and En,isE_{n,i}^{s} is the sibling: the entry of En,iE_{n,i} that is not En,i1E_{n,i-1}. Pivot pip_{i}^{\uparrow} is defined by interchanging \downarrow and \uparrow in (10). The pivots are the two nearest neighbor interchanges between the iith merger and the merger involving its nearest child. Figure 1 illustrates all three operators.

123E3=({1,2},{{1,2},3})E_{3}=(\{1,2\},\{\{1,2\},3\})p2p_{2}^{\uparrow}321p2(E3)=({2,3},{1,{2,3}})p_{2}^{\uparrow}(E_{3})=(\{2,3\},\{1,\{2,3\}\})123E3=({1,2},{{1,2},3})E_{3}=(\{1,2\},\{\{1,2\},3\})p2p_{2}^{\downarrow}312p2(E3)=({1,3},{{1,3},2})p_{2}^{\downarrow}(E_{3})=(\{1,3\},\{\{1,3\},2\})1234E4=({3,4},{1,2},{{1,2},{3,4}})E_{4}=(\{3,4\},\{1,2\},\{\{1,2\},\{3,4\}\})s2s_{2}1234s1(E4)=({1,2},{3,4},{{1,2},{3,4}})s_{1}(E_{4})=(\{1,2\},\{3,4\},\{\{1,2\},\{3,4\}\})
Figure 1: Operators p2p_{2}^{\uparrow}, p2p_{2}^{\downarrow}, and s2s_{2}. The horizontal arrangement of leaves is arbitrary throughout this paper; only vertical distance is meaningful.

Next we describe τ\tau-space, which gives a geometric structure to the set of pairs (En,𝐭n)(E_{n},\mathbf{t}_{n}). For fixed EnE_{n}, the space of 𝐭n\mathbf{t}_{n}-vectors is the orthant [0,)n1[0,\infty)^{n-1}. Each boundary point with ti=0t_{i}=0 corresponds to a degenerate tree in which one of three things happens:

  1. 1.

    The two leaves of En,1E_{n,1} merge at time 0 if i=1i=1.

  2. 2.

    There are two simultaneous mergers if En,i1En,iE_{n,i-1}\notin E_{n,i}.

  3. 3.

    There is a simultaneous merger of three lineages if En,i1En,iE_{n,i-1}\in E_{n,i}.

Type 1 boundaries are boundaries of the whole τ\tau-space. Type 2 boundaries separate orthants corresponding to two ranked topologies separated by an sis_{i}-step. Trajectories crossing the boundary move from one ranked topology to the other. Type 3 boundaries separate the three orthants which resolve the triple merger into two binary mergers, which differ by a pip_{i}^{\downarrow} or pip_{i}^{\uparrow}-step. A trajectory that crosses the boundary visits a tree with the triple merger. Figure 2 depicts the τ\tau-space with three leaves, and a type 2 boundary in a general τ\tau-space. An example with five leaves is depicted in [GD16, Figure 2], and the 𝐭n\mathbf{t}_{n} variables are illustrated in Figures 3 and 7 in Sections 4 and 5.

123132231\cdots\vdots\cdots\cdots\vdots\cdots\vdots\vdots\vdots\vdots\vdots\vdots\vdots\vdots\vdots\vdots\vdots\vdots
Figure 2: (Left) τ\tau-space with n=3n=3 embedded into 3\mathbb{R}^{3}. Each square is a copy of [0,)2[0,\infty)^{2} associated with the given topology. The coordinates (t1,t2)(t_{1},t_{2}) are the respective time of the first merger, and the time between the first and second merger. The dot is the origin, and the line on which all three orthants intersect is a type 3 boundary consisting of trees in which all three leaves merge simultaneously at time t1t_{1}. The dashed lines are boundaries at \infty. (Right) A segment of τ\tau-space depicting a type 2 boundary, in which each square represents [0,)n1[0,\infty)^{n-1}. Only the two orthants adjacent to the boundary are shown.

We will use τ\tau-space to construct zig-zag processes whose state spaces consist of tree topologies, branch lengths, and a scalar parameter introduced in the next section. The discrete variables 𝔽\mathbb{F} will be the ranked topologies, with the boundary crossings described above defining the boundary jump kernel QQ. For more on τ\tau-space, e.g. existence and uniqueness of geodesics and Fréchet means, we refer to [GD16].

4 The infinite sites model

The infinite sites model [Wat75] connects the coalescent tree to DNA sequence data by associating the MRCA with the unit interval (0,1)(0,1). Mutations with independent, U(0,1)U(0,1)-distributed locations accrue along branches of the tree (with branch lengths as specified by 𝐭n\mathbf{t}_{n}) at rate θ/2\theta/2. The type of a leaf consist of the mutations along branches separating it from the MRCA. We denote the resulting list of types of the leaves by DnD_{n}. A realization of a coalescent tree and associated DnD_{n} is shown in Figure 3. The typical task is to sample from the conditional law (En,𝐭n,θ)|Dn(E_{n},\mathbf{t}_{n},\theta)|D_{n} corresponding to observing DnD_{n}, but not the tree which gave rise to it.

t1t_{1}t2t_{2}t3t_{3}0.70.70.20.20.7
Figure 3: A realization of the infinite sites model with n=4n=4, two mutations, three types, and Dn=({0.7},{0.7},{},{0.2})D_{n}=(\{0.7\},\{0.7\},\{\},\{0.2\}). The holding times 𝐭3\mathbf{t}_{3} are shown on the left.

To handle mutations, we define FnF_{n} as the rooted graphical tree with 2n12n-1 nodes, the first nn of which are leaves labeled 1,,n1,\ldots,n, while the remaining n1n-1 are labeled as in EnE_{n}. Edges connect children to their parents, and edge lengths are determined by 𝐭n\mathbf{t}_{n}. For an edge γFn\gamma\in F_{n}, we denote by cγc_{\gamma} and pγp_{\gamma} the respective labels of the child and parent nodes of γ\gamma, by mγm_{\gamma} the number of mutations on γ\gamma, and by lγ:=tjγtjl_{\gamma}:=\sum_{t_{j}\in\gamma}t_{j} the edge length, where we write tiγt_{i}\in\gamma if tit_{i} contributes to the length of γ\gamma in which case we say γ\gamma spans tit_{i}.

Given a prior density π0(θ)\pi_{0}(\theta) for θ\theta, the posterior distribution of (En,𝐭n,θ)|Dn(E_{n},\mathbf{t}_{n},\theta)|D_{n} follows from (9), the fact that the number of mutations on branch γFn\gamma\in F_{n} is Poisson(θlγ/2\theta l_{\gamma}/2)-distributed given lγl_{\gamma}, and that mutations on distinct branches are independent. In particular,

π(En,𝐭n,θ|Dn){γFn(θlγ2)mγ}exp(i=1n1(n+1i)(n+θi)2ti)π0(θ)\pi(E_{n},\mathbf{t}_{n},\theta|D_{n})\propto\Bigg{\{}\prod_{\gamma\in F_{n}}\Big{(}\frac{\theta l_{\gamma}}{2}\Big{)}^{m_{\gamma}}\Bigg{\}}\exp\Bigg{(}-\sum_{i=1}^{n-1}\frac{(n+1-i)(n+\theta-i)}{2}t_{i}\Bigg{)}\pi_{0}(\theta) (11)

provided EnE_{n} is consistent DnD_{n}, and π=0\pi=0 otherwise. This distribution can be sampled using a zig-zag algorithm by taking 𝔽\mathbb{F} to be the set of ranked topologies on nn leaves which are consistent with DnD_{n}, as well as ΩEno:={(𝐭n,θ)(0,)n}\Omega_{E_{n}}^{o}:=\{(\mathbf{t}_{n},\theta)\in(0,\infty)^{n}\} and ΩEn:={(𝐭n,θ)[0,)n:ti=0 for one i{1,,n1} or θ=0}\partial\Omega_{E_{n}}:=\{(\mathbf{t}_{n},\theta)\in[0,\infty)^{n}:t_{i}=0\text{ for one }i\in\{1,\ldots,n-1\}\text{ or }\theta=0\} for each En𝔽E_{n}\in\mathbb{F}. In the boundary classification of Section 3, θ=0\theta=0 is another type 1 boundary. For (𝐭n,θ)Ω(\mathbf{t}_{n},\theta)\in\partial\Omega, we define k(𝐭n,θ)k(\mathbf{t}_{n},\theta) as the index of the zero variable, taken to be nn in the case of θ\theta.

We augment the state space with nn zig-zag velocities 𝐯n\mathbf{v}_{n}, of which (v1,,vn1)(v_{1},\ldots,v_{n-1}) drive 𝐭n\mathbf{t}_{n} and vnv_{n} drives θ\theta. For γFn\gamma\in F_{n}, we also define vγ:=j:tjγvjv_{\gamma}:=\sum_{j:t_{j}\in\gamma}v_{j} as the rate of change of lγl_{\gamma}. The boundary kernel QQ is defined separately on each boundary type:

Q(En,(𝐭n,θ),𝐯n;,,)\displaystyle Q(E_{n},(\mathbf{t}_{n},\theta),\mathbf{v}_{n};\cdot,\cdot,\cdot)
:={δ{En}()δ{(𝐭n,θ)}()δ{Fk(𝐭n,θ)(𝐯n)}()for type 1,δ{sk(𝐭n,θ)(En)}()δ{(𝐭n,θ)}()δ{Fk(𝐭n,θ)(𝐯n)}()for type 2,(δ{pk(𝐭n,θ)(En)}()2+δ{pk(𝐭n,θ)(En)}()2)δ{(𝐭n,θ)}()δ{Fk(𝐭n,θ)(𝐯n)}()for type 3.\displaystyle:=\begin{cases}\delta_{\{E_{n}\}}(\cdot)\otimes\delta_{\{(\mathbf{t}_{n},\theta)\}}(\cdot)\otimes\delta_{\{F_{k(\mathbf{t}_{n},\theta)}(\mathbf{v}_{n})\}}(\cdot)&\text{for type }1,\\ \delta_{\{s_{k(\mathbf{t}_{n},\theta)}(E_{n})\}}(\cdot)\otimes\delta_{\{(\mathbf{t}_{n},\theta)\}}(\cdot)\otimes\delta_{\{F_{k(\mathbf{t}_{n},\theta)}(\mathbf{v}_{n})\}}(\cdot)&\text{for type }2,\\ \Big{(}\frac{\delta_{\{p_{k(\mathbf{t}_{n},\theta)}^{\uparrow}(E_{n})\}}(\cdot)}{2}+\frac{\delta_{\{p_{k(\mathbf{t}_{n},\theta)}^{\downarrow}(E_{n})\}}(\cdot)}{2}\Big{)}\otimes\delta_{\{(\mathbf{t}_{n},\theta)\}}(\cdot)\otimes\delta_{\{F_{k(\mathbf{t}_{n},\theta)}(\mathbf{v}_{n})\}}(\cdot)&\text{for type }3.\end{cases} (12)

At a type 1 boundary the process reflects back into ΩEno\Omega_{E_{n}}^{o} via a velocity flip. At a type 2 boundary it undergoes an sis_{i}-step and a velocity flip to pass through the boundary. For type 3 boundaries it chooses an adjacent orthant uniformly at random.

In the interiors of orthants, velocity flip rates are

λi(En,𝐭n,θ;𝐯n)\displaystyle\lambda_{i}(E_{n},\mathbf{t}_{n},\theta;\mathbf{v}_{n}) :=[vi((n+1i)(n+θi)2γFn:tiγmγlγ)]+,\displaystyle:=\Bigg{[}v_{i}\Bigg{(}\frac{(n+1-i)(n+\theta-i)}{2}-\sum_{\gamma\in F_{n}:t_{i}\in\gamma}\frac{m_{\gamma}}{l_{\gamma}}\Bigg{)}\Bigg{]}^{+}, (13)
λθ(En,𝐭n,θ;𝐯n)\displaystyle\lambda_{\theta}(E_{n},\mathbf{t}_{n},\theta;\mathbf{v}_{n}) :=[vn(i=1n1n+1i2ti1θγFnmγθlog(π0(θ)))]+.\displaystyle:=\Bigg{[}v_{n}\Bigg{(}\sum_{i=1}^{n-1}\frac{n+1-i}{2}t_{i}-\frac{1}{\theta}\sum_{\gamma\in F_{n}}m_{\gamma}-\partial_{\theta}\log(\pi_{0}(\theta))\Bigg{)}\Bigg{]}^{+}. (14)

Simulating holding times with these rates is difficult due to the time intervals during which they vanish. One strategy is Poisson thinning via dominating rates consisting of only those terms in the round brackets in (13) and (14) whose sign matches that of the corresponding velocity viv_{i}, but these can result in loose bounds and inefficient algorithms. Instead, we define tn:=θt_{n}:=\theta for brevity, and for i{1,,n1}i\in\{1,\ldots,n-1\} define γ(En,i):=argmin{lγ:pγ=En,i}\gamma(E_{n},i):=\operatorname{argmin}\{l_{\gamma}:p_{\gamma}=E_{n,i}\} as the shorter child branch from parent node En,iE_{n,i}. For i{1,n}i\in\{1,n\} we adopt the short-hands

mγ(En,n):=γFnmγandmγ(En,1):=γFn:pγ=En,1mγ,m_{\gamma(E_{n},n)}:=\sum_{\gamma\in F_{n}}m_{\gamma}\qquad\text{and}\qquad m_{\gamma(E_{n},1)}:=\sum_{\gamma\in F_{n}:p_{\gamma}=E_{n,1}}m_{\gamma},

and finally define the time localization TT(En,𝐭n,θ;𝐯n)T\equiv T(E_{n},\mathbf{t}_{n},\theta;\mathbf{v}_{n}) as

T:=min{mini{1,,n}:vi<0{ti{1+c[𝟙En,i(En,i1)+𝟙{1,n}(i)]𝟙+(mγ(En,i))}vi},K},T:=\min\Bigg{\{}\min_{i\in\{1,\ldots,n\}:v_{i}<0}\Big{\{}\frac{-t_{i}}{\{1+c[\mathds{1}_{E_{n,i}}(E_{n,i-1})+\mathds{1}_{\{1,n\}}(i)]\mathds{1}_{\mathbb{Z}_{+}}(m_{\gamma(E_{n},i)})\}v_{i}}\Big{\}},K\Bigg{\}}, (15)

where min=\min_{\emptyset}=\infty, and K0K\gg 0 is a maximum increment for the case when all velocities are positive. The indicator functions in the denominator pick out boundaries where (11) vanishes: type 1 or 3 boundaries corresponding to length 0 branches which carry at least one mutation, and the θ=0\theta=0 boundary if there is at least one mutation in total. The parameter c>0c>0 ensures that, when the current process time is tt, such boundaries cannot be hit on [t,t+T][t,t+T], and at most one other boundary can be reached. We found c=4c=4 gave good performance across our tests in Sections 4 and 5. A larger value results in tighter bounds on (13) and (14), but wastes more computation as t+Tt+T is hit more often before an accepted velocity flip.

On time interval [t,t+T][t,t+T], flip rates (13) and (14) are bounded above by constant rates

λi\displaystyle\lambda_{i}^{*} :=[vi((n+1i)(n+θ+(vnT)±i)2γFn:tiγmγlγ+(vγT)±})]+,\displaystyle:=\Bigg{[}v_{i}\Bigg{(}\frac{(n+1-i)(n+\theta+(v_{n}T)^{\pm}-i)}{2}-\sum_{\gamma\in F_{n}:t_{i}\in\gamma}\frac{m_{\gamma}}{l_{\gamma}+(v_{\gamma}T)^{\pm}\}}\Bigg{)}\Bigg{]}^{+},
λθ\displaystyle\lambda_{\theta}^{*} :=[vθ(i=1n1n+1i2[ti+(viT)±]1θ+vnTγFnmγinfs[0,T]{θlog(π0(θ+vθs))})]+,\displaystyle:=\Bigg{[}v_{\theta}\Bigg{(}\sum_{i=1}^{n-1}\frac{n+1-i}{2}[t_{i}+(v_{i}T)^{\pm}]-\frac{1}{\theta+v_{n}T}\sum_{\gamma\in F_{n}}m_{\gamma}-\inf_{s\in[0,T]}\{\partial_{\theta}\log(\pi_{0}(\theta+v_{\theta}s))\}\Bigg{)}\Bigg{]}^{+},

where, for each λi\lambda_{i}^{*}, i{1,,n1,θ}i\in\{1,\ldots,n-1,\theta\}, (x)±:=(x)+(x)^{\pm}:=(x)^{+} if vi>0v_{i}>0 and (x)±:=(x):=min{x,0}(x)^{\pm}:=(x)^{-}:=\min\{x,0\} if vi<0v_{i}<0. Algorithms 2 and 3 in the appendix give pseudocode for simulating holding times, velocity flips, and boundary crossings as outlined above.

Proposition 1.

Suppose that the initial condition (En,𝐭n,θ)(E_{n},\mathbf{t}_{n},\theta) has a positive density on 𝔽×(0,)n\mathbb{F}\times(0,\infty)^{n}, that π0>0\pi_{0}>0, and that θlog(π0(θ))\partial_{\theta}\log(\pi_{0}(\theta)) is bounded on compact subsets of (0,)(0,\infty). Then (11) is stationary under the dynamics simulated by Algorithm 2 and 3.

Proof.

Provided in the appendix. ∎

We compared the zig-zag process to a Metropolis–Hastings algorithm by reanalyzing the data of [WFDP91] with n=55n=55, 14 distinct types, and 18 mutations. We used the improper prior π0(θ)𝟙(0,)(θ)\pi_{0}(\theta)\propto\mathds{1}_{(0,\infty)}(\theta), set vi=±2/[(n+1i)(ni)]v_{i}=\pm 2/[(n+1-i)(n-i)], and set vnv_{n} from trial runs to cross the θ\theta-mode in unit time. The appendix details the Metropolis–Hastings algorithm, and other tuning parameters in Table 3. We compared both methods to a hybrid combining zig-zag dynamics with continuous time Metropolis–Hastings moves at rate κ=10\kappa=10. Performance was insensitive to κ\kappa provided it was not extreme: small values resemble a zig-zag process, while large values resemble Metropolis–Hastings.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Trace plots under the infinite sites model and the data set of [WFDP91].

Figure 4 shows that the zig-zag and hybrid methods mix visibly better than Metropolis–Hastings over the latent tree, as measured by the tree height Hn:=t1++tn1H_{n}:=t_{1}+\ldots+t_{n-1}. However, they are not as effective at exploring the upper tail of the θ\theta-marginal, likely because they do not stay in regions of short trees for long enough for θ\theta to increase into the tail.

To assess scaling, we simulated two data sets: one of size n=550n=550 with mutation rate θ=5.5\theta=5.5 (the approximate posterior mean in Figure 4) and one with n=55n=55 and θ=55\theta=55, which models a segment of DNA 10 times longer. Figures 5 and 6 demonstrate that the zig-zag and hybrid processes scale far better than Metropolis–Hastings, particularly when θ=55\theta=55. Estimates in Table 1 quantify the improvement to 1–3 orders of magnitude.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: Trace plots for the infinite sites model and the data set with n=550n=550, θ=5.5\theta=5.5, 30 distinct types, and 38 mutations.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 6: Trace plots for the infinite sites model and the data set with n=55n=55, θ=55\theta=55, 40 distinct types, and 252 mutations.
Data set Method ESS(θ\theta)/sec ESS(HnH_{n})/sec Run time (min)
[WFDP91] Metropolis 5 3 3.5
Zig-zag 160 179 0.5
Hybrid 128 67 0.5
n=550n=550, θ=5.5\theta=5.5 Metropolis 0.10.1^{*} 0.0020.002^{*} 70
Zig-zag 1.7 1.2 45
Hybrid 0.5 0.3 115
n=55n=55, θ=55\theta=55 Metropolis 0.10.1^{*} 0.10.1^{*} 50
Zig-zag 36 37 1
Hybrid 22 22 1.5
Table 1: Effective sample sizes and run times for all three methods and data sets for the infinite sites model. Estimates were computed with the ess method [FHVD17] under default settings for Metropolis–Hastings, and as in [BFR19a, Section 2] for the other two. Stars highlight poorly mixing chains with unreliable ESS estimates.

5 The finite sites model

The finite sites model [JC69] is more detailed than the infinite sites model, but has greater computational cost. Consider a finite set of sites SS with a finite number of possible types HH per site; for example H={0,1}H=\{0,1\} or H={A,T,C,G}H=\{A,T,C,G\}. Mutations occur along branches of the coalescent tree at each site with rate θ/(2|S|)\theta/(2|S|), and the type of a mutant child is drawn from stochastic matrix PP with unique stationary distribution ν\nu. We denote the transition matrix of the HH-valued compound Poisson process with jump rate θ/(2|S|)\theta/(2|S|) and jump transition matrix PP by (Qhgθ(t))h,gH;t0(Q_{hg}^{\theta}(t))_{h,g\in H;t\geq 0}. Figure 7 depicts a realization of the finite sites coalescent.

As in Section 4, we denote the configuration of types at the leaves by DnD_{n}, and seek to sample from the posterior π(En,𝐭n,θ|Dn)\pi(E_{n},\mathbf{t}_{n},\theta|D_{n}), which can be written as a sum over the types of internal nodes:

π(En,𝐭n,θ|Dn)\displaystyle\pi(E_{n},\mathbf{t}_{n},\theta|D_{n})\propto{} {sSh(s,cγ)H for γFn:|cγ|>1γFnQh(s;pγ)h(s;cγ)θ(lγ)}\displaystyle\Bigg{\{}\prod_{s\in S}\sum_{\begin{subarray}{c}h(s,c_{\gamma})\in H\\ \text{ for }\gamma\in F_{n}:|c_{\gamma}|>1\end{subarray}}\prod_{\gamma\in F_{n}}Q_{h(s;p_{\gamma})h(s;c_{\gamma})}^{\theta}(l_{\gamma})\Bigg{\}}
×exp(i=1n1(n+1i2)ti)π0(θ),\displaystyle\times\exp\Bigg{(}-\sum_{i=1}^{n-1}\binom{n+1-i}{2}t_{i}\Bigg{)}\pi_{0}(\theta), (16)

where h(s;η)Hh(s;\eta)\in H is the type at site sSs\in S on the node with label ηEn\eta\in E_{n}, and γFn:|cγ|>1\gamma\in F_{n}:|c_{\gamma}|>1 denotes edges which do not end in a leaf. The target (16) can be evaluated efficiently using the pruning algorithm of [Fel81].

t1t_{1}t2t_{2}t3t_{3}100000010001011000
Figure 7: A realization of the finite sites model with n=4n=4, S=H={0,1}S=H=\{0,1\}, three mutations, three types, and Dn=(#00,#10,#01,#11)=(2,1,1,0)D_{n}=(\#00,\#10,\#01,\#11)=(2,1,1,0).

The posterior (16) can be sampled using zig-zag dynamics with the same construction as in Section 4. Velocity flip rates can be written in terms of branch-specific gradients as

λi(En,\displaystyle\lambda_{i}(E_{n},{} 𝐭n,θ;𝐯n)=[vi((n+1i2)\displaystyle\mathbf{t}_{n},\theta;\mathbf{v}_{n})=\Bigg{[}v_{i}\Bigg{(}\binom{n+1-i}{2}
sSh(s,cγ)H for γFn:|cγ|>1δFn:tiδ[iQh(s;pδ)h(s;cδ)θ(lδ)]γFn:γδQh(s;pγ)h(s;cγ)θ(lγ)h(s,cγ)H for γFn:|cγ|>1γFnQh(s;pγ)h(s;cγ)θ(lγ))]+,\displaystyle-\sum_{s\in S}\sum_{\begin{subarray}{c}h(s,c_{\gamma})\in H\\ \text{ for }\gamma\in F_{n}:|c_{\gamma}|>1\end{subarray}}\sum_{\delta\in F_{n}:t_{i}\in\delta}\frac{\displaystyle[\partial_{i}Q_{h(s;p_{\delta})h(s;c_{\delta})}^{\theta}(l_{\delta})]\prod_{\gamma\in F_{n}:\gamma\neq\delta}Q_{h(s;p_{\gamma})h(s;c_{\gamma})}^{\theta}(l_{\gamma})}{\displaystyle\sum_{\begin{subarray}{c}h(s,c_{\gamma})\in H\\ \text{ for }\gamma\in F_{n}:|c_{\gamma}|>1\end{subarray}}\prod_{\gamma\in F_{n}}Q_{h(s;p_{\gamma})h(s;c_{\gamma})}^{\theta}(l_{\gamma})}\Bigg{)}\Bigg{]}^{+}, (17)
λθ(En,\displaystyle\lambda_{\theta}(E_{n},{} 𝐭n,θ;𝐯n)=[vn(θlog(π0(θ))\displaystyle\mathbf{t}_{n},\theta;\mathbf{v}_{n})=\Bigg{[}-v_{n}\Bigg{(}\partial_{\theta}\log(\pi_{0}(\theta))
+sSh(s,cγ)H for γFn:|cγ|>1δFn[θQh(s;pδ)h(s;cδ)θ(lδ)]γFn:γδQh(s;pγ)h(s;cγ)θ(lγ)h(s,cγ)H for γFn:|cγ|>1γFnQh(s;pγ)h(s;cγ)θ(lγ))]+,\displaystyle+\sum_{s\in S}\sum_{\begin{subarray}{c}h(s,c_{\gamma})\in H\\ \text{ for }\gamma\in F_{n}:|c_{\gamma}|>1\end{subarray}}\sum_{\delta\in F_{n}}\frac{\displaystyle[\partial_{\theta}Q_{h(s;p_{\delta})h(s;c_{\delta})}^{\theta}(l_{\delta})]\prod_{\gamma\in F_{n}:\gamma\neq\delta}Q_{h(s;p_{\gamma})h(s;c_{\gamma})}^{\theta}(l_{\gamma})}{\displaystyle\sum_{\begin{subarray}{c}h(s,c_{\gamma})\in H\\ \text{ for }\gamma\in F_{n}:|c_{\gamma}|>1\end{subarray}}\prod_{\gamma\in F_{n}}Q_{h(s;p_{\gamma})h(s;c_{\gamma})}^{\theta}(l_{\gamma})}\Bigg{)}\Bigg{]}^{+}, (18)

which can be evaluated using the linear-cost method of [JZH+20].

We show that events with rates (17) and (18) can be simulated using the example of [GT94, Section 7.4], in which S={1,,20}S=\{1,\ldots,20\}, H={0,1}H=\{0,1\} and PP is the 2×22\times 2 matrix which always changes state, corresponding to ν=(1/2,1/2)\nu=(1/2,1/2) and

Qhgθ(t):=12+(𝟙{h}(g)12)eθt.Q_{hg}^{\theta}(t):=\frac{1}{2}+\Big{(}\mathds{1}_{\{h\}}(g)-\frac{1}{2}\Big{)}e^{-\theta t}. (19)

As (19) is not bounded away from 0 when hgh\neq g, (17) and (18) lack simple bounds for Poisson thinning. As in (15), bounds can be obtained by time localization using

TT(En,𝐭n,θ;𝐯n):=min{mini{1,,n}:vi<0{ti[1+c𝟙{1,θ}(i)𝟙+(mγ(En,i))]vi},K},T\equiv T(E_{n},\mathbf{t}_{n},\theta;\mathbf{v}_{n}):=\displaystyle\min\Bigg{\{}\min_{i\in\{1,\ldots,n\}:v_{i}<0}\Big{\{}\frac{-t_{i}}{[1+c\mathds{1}_{\{1,\theta\}}(i)\mathds{1}_{\mathbb{Z}_{+}}(m_{\gamma(E_{n},i)})]v_{i}}\Big{\}},K\Bigg{\}},

where K0K\gg 0 is a default increment in case all velocities are positive. The variable TT localizes the next zig-zag time step beginning at time tt so that at most one branch can shrink to length zero on [t,t+T][t,t+T], θ\theta can fall by at most 1/(1+c)1/(1+c) of its present value, and t1t_{1} can fall by at most 1/(1+c)1/(1+c) of its present value if the first two leaves to merge are of distinct types. This treatment of θ\theta and t1t_{1} is needed as (17) and (18) diverge in these cases, rendering the θ=0\theta=0 and t1=0t_{1}=0 boundaries inaccessible.

Given T(0,)T\in(0,\infty), we have the following bounds on (19) on the time interval [t,t+T][t,t+T]:

Qhhθ(lγ)\displaystyle Q_{hh}^{\theta}(l_{\gamma}) 12{1+exp([θ+(vnT)][lγ+(vγT)])},\displaystyle\leq\frac{1}{2}\{1+\exp(-[\theta+(v_{n}T)^{-}][l_{\gamma}+(v_{\gamma}T)^{-}])\},
Qhgθ(lγ)\displaystyle Q_{hg}^{\theta}(l_{\gamma}) 12{1exp([θ+(vnT)+][lγ+(vγT)+])},\displaystyle\leq\frac{1}{2}\{1-\exp(-[\theta+(v_{n}T)^{+}][l_{\gamma}+(v_{\gamma}T)^{+}])\},
Qhhθ(lγ)\displaystyle Q_{hh}^{\theta}(l_{\gamma}) 12{1+exp([θ+(vnT)+][lγ+(vγT)+])},\displaystyle\geq\frac{1}{2}\{1+\exp(-[\theta+(v_{n}T)^{+}][l_{\gamma}+(v_{\gamma}T)^{+}])\},
Qhgθ(lγ)\displaystyle Q_{hg}^{\theta}(l_{\gamma}) 12{1exp([θ+(vnT)][lγ+(vγT)])},\displaystyle\geq\frac{1}{2}\{1-\exp(-[\theta+(v_{n}T)^{-}][l_{\gamma}+(v_{\gamma}T)^{-}])\},

where hgh\neq g. Substituting these bounds into [JZH+20, Equation (9)] provides bounds on flip rates that can be evaluated with O(|S|n)O(|S|n) cost.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: Trace plots for the finite sites model and data from [GT94].

Figure 8 demonstrates that the zig-zag process mixes over latent trees faster than Metropolis–Hastings again, but struggles to explore the upper tail of the θ\theta-marginal. The hybrid method was run with κ=100\kappa=100 to compensate for shorter run lengths than in Section 4, and thus resembles Metropolis–Hastings rather than the zig-zag process.

Figures 9 and 10 show results for two further simulated data sets: one with n=500n=500 and S=20S=20, and one with n=50n=50 and S=200S=200. The superior mixing of the zig-zag process over the latent tree is clear. The lack of mixing in the upper tail of the θ\theta-marginal is also stark, particularly in Figure 9 where zig-zag significantly underestimates posterior variance. The estimated posterior means of all three methods coincide in all cases (results not shown).

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 9: Trace plots for the finite sites model and data set with n=500n=500 and S=20S=20 consisting of five distinct sequences.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 10: Trace plots for the finite sites model and data set with n=50n=50 and S=200S=200 consisting of 18 distinct sequences.
Data set Method ESS(θ\theta)/sec ESS(HnH_{n})/sec Run time (min)
[GT94] Metropolis 1.8 0.2 29
Zig-zag 2.3 1.0 18
Hybrid 2.1 0.6 16
n=500n=500, S=20S=20 Metropolis 0.09 0.00060.0006^{*} 1567
Zig-zag 0.010.01^{*} 0.004 1749
Hybrid 0.01 0.003 1800
n=50n=50, S=200S=200 Metropolis 0.04 0.03 257
Zig-zag 0.05 0.13 175
Hybrid 0.08 0.07 183
Table 2: Effective sample sizes and run times for all three methods and data sets. Estimates were computed as in Table 1. Stars highlight unmixed chains with unreliable ESS estimates.

6 Discussion

We have presented a general method for using zig-zag processes to sample targets defined on hybrid spaces consisting of discrete and continuous variables. This was done by introducing boundaries into the state space of continuous variables and updating discrete components via a Markov jump kernel QQ whenever a boundary was hit. The resulting algorithm remains a piecewise-deterministic Markov processes in the sense of [Dav93, Section 24], and generalizes existing zig-zag processes for restricted domains [BBCD+18]. Crucially, no assumptions of structure among the discrete variables are required. The key conditions on QQ are the skew-detailed balance (2), which is local, and (3), which involves an integral with respect to QQ but not the target π~\tilde{\pi}. Both are verifiable in applications, and do not require π~\tilde{\pi} to be normalized. Our method is reminiscent of discrete Hamiltonian Monte Carlo [DBZM17, NDL20], but the lack of time discretization simplifies boundary crossings (though see [NDL20, Section S6.4]).

We have demonstrated the method on two examples involving the coalescent, which is a gold-standard model in phylogenetics. It is defined on the space of binary trees consisting of discrete tree topologies and continuous branch lengths, which lacks a simple geometric structure, e.g. a partial order or a tractable norm. We have also shown that the zig-zag process can improve mixing over trees relative to Metropolis-Hastings, particularly under the infinite sites model. This model is widely used to analyze ever larger data sets, and the zig-zag process shows promise for expanding the scope of feasible MCMC computations.

The zig-zag process was more efficient than Metropolis–Hastings under the infinite sites model in terms of effective sample size, but struggled to explore the tails of the θ\theta-marginal. A likely reason is correlation in the target: high mutation rates are only be attainable when branch lengths are short. A Metropolis–Hastings algorithm can jump to a high mutation rate as soon as the latent tree has short branches, while the zig-zag process must traverse all intervening mutation rates before branch lengths grow. The speed up zig-zag method of [VR21] has state-dependent velocities, and could provide further improvement. The hybrid method with both zig-zag motion and Metropolis-Hastings updates interpolated between the two algorithms.

All three algorithms exhibited much longer run times under the finite sites model than under infinite sites. For the zig-zag and hybrid methods, that is due to the O(|S|n)O(|S|n) cost per evaluation of (17) and (18), of which there are O(n)O(n). However, flip times for different velocities are conditionally independent given the current state and can be generated in parallel, unlike steps of the Metropolis–Hastings algorithm. Hence the zig-zag process is well suited to parallel architectures.

Acknowledgements

The author was supported by ESPRC grant EP/R044732/1, and is grateful to Jure Vogrinc and Andi Wang for productive conversations on MCMC for coalescent processes, and non-reversible MCMC in general.

Appendix

Proof of Theorem 1.

Since the initial point (m,𝐱)(m,\mathbf{x}) has a density, the Poisson construction of velocity flips and (4) ensure that corners (where the the process is undefined) are hit with probability 0.

Next, we will verify the four conditions of [Dav93, Section (24.8)]. Firstly, the deterministic zig-zag motion is at a constant, bounded velocity, and hence forms a non-explosive, Lipschitz vector field. The flip rates λi(m,𝐱,𝐯)\lambda_{i}(m,\mathbf{x},\mathbf{v}) are bounded, and hence integrable, on compact subsets of [0,TΓ+(Ω)(m,𝐱,𝐯))[0,T_{\Gamma^{+}(\partial\Omega)}(m,\mathbf{x},\mathbf{v})) because π~\tilde{\pi} is continuously differentiable in 𝐱\mathbf{x} away from boundaries. Velocity flips and boundary transitions both change the state (m,𝐱,𝐯)(m,\mathbf{x},\mathbf{v}) with probability 1, the former by explicit construction and the latter because Γ+(Ω)Γ(Ω)=\Gamma^{+}(\partial\Omega)\cap\Gamma^{-}(\partial\Omega)=\emptyset. The final standard condition is (7). In addition, the first three conditions of [CPWF21, Theorem 1] hold by construction, and hence it suffices for stationarity of π~\tilde{\pi} to verify their fourth condition by checking that for all fD(L)f\in D(L),

𝔼π~[Lf(m,𝐱,𝐯)]=i=1d\displaystyle\mathbb{E}_{\tilde{\pi}}[Lf(m,\mathbf{x},\mathbf{v})]=\sum_{i=1}^{d} (m,𝐱)Ωo𝐯{1,1}d{viif(m,𝐱,𝐯)\displaystyle\int_{(m,\mathbf{x})\in\Omega^{o}}\sum_{\mathbf{v}\in\{-1,1\}^{d}}\{v_{i}\partial_{i}f(m,\mathbf{x},\mathbf{v})
+λi(m,𝐱,𝐯)[f(m,𝐱,Fi𝐯)f(m,𝐱,𝐯)]}π~(m,𝐱,𝐯)d𝐱=0.\displaystyle+\lambda_{i}(m,\mathbf{x},\mathbf{v})[f(m,\mathbf{x},F_{i}\mathbf{v})-f(m,\mathbf{x},\mathbf{v})]\}\tilde{\pi}(m,\mathbf{x},\mathbf{v})\mathrm{d}\mathbf{x}=0.

The change of variable 𝐯Fi𝐯\mathbf{v}\mapsto F_{i}\mathbf{v} combined with the fact that π~(m,𝐱,)\tilde{\pi}(m,\mathbf{x},\cdot) is uniform, then (1), and then an integration by parts yield

i=1d(m,𝐱)Ωo𝐯{1,1}d{viif(m,𝐱,𝐯)+λi(m,𝐱,𝐯)[f(m,𝐱,Fi𝐯)f(m,𝐱,𝐯)]}π~(m,𝐱,𝐯)d𝐱\displaystyle\sum_{i=1}^{d}\int_{(m,\mathbf{x})\in\Omega^{o}}\sum_{\mathbf{v}\in\{-1,1\}^{d}}\{v_{i}\partial_{i}f(m,\mathbf{x},\mathbf{v})+\lambda_{i}(m,\mathbf{x},\mathbf{v})[f(m,\mathbf{x},F_{i}\mathbf{v})-f(m,\mathbf{x},\mathbf{v})]\}\tilde{\pi}(m,\mathbf{x},\mathbf{v})\mathrm{d}\mathbf{x}
=i=1d(m,𝐱)Ωo𝐯{1,1}d{viif(m,𝐱,𝐯)+f(m,𝐱,𝐯)[λi(m,𝐱,Fi𝐯)λi(m,𝐱,𝐯)]}π~(m,𝐱,𝐯)d𝐱\displaystyle=\sum_{i=1}^{d}\int_{(m,\mathbf{x})\in\Omega^{o}}\sum_{\mathbf{v}\in\{-1,1\}^{d}}\{v_{i}\partial_{i}f(m,\mathbf{x},\mathbf{v})+f(m,\mathbf{x},\mathbf{v})[\lambda_{i}(m,\mathbf{x},F_{i}\mathbf{v})-\lambda_{i}(m,\mathbf{x},\mathbf{v})]\}\tilde{\pi}(m,\mathbf{x},\mathbf{v})\mathrm{d}\mathbf{x}
=(m,𝐱)Ωo𝐯{1,1}d𝐯{𝐱f(m,𝐱,𝐯)f(m,𝐱,𝐯)𝐱log(π~(m,𝐱,𝐯))}π~(m,𝐱,𝐯)d𝐱\displaystyle=\int_{(m,\mathbf{x})\in\Omega^{o}}\sum_{\mathbf{v}\in\{-1,1\}^{d}}\mathbf{v}\cdot\{\nabla_{\mathbf{x}}f(m,\mathbf{x},\mathbf{v})-f(m,\mathbf{x},\mathbf{v})\nabla_{\mathbf{x}}\log(\tilde{\pi}(m,\mathbf{x},\mathbf{v}))\}\tilde{\pi}(m,\mathbf{x},\mathbf{v})\mathrm{d}\mathbf{x}
=(m,𝐱)Ω𝐯Γ+(m,𝐱)f(m,𝐱,𝐯)(𝐯n(m,𝐱))π~(m,𝐱,𝐯)d𝐱\displaystyle=-\int_{(m,\mathbf{x})\in\partial\Omega}\sum_{\mathbf{v}\in\Gamma^{+}(m,\mathbf{x})}f(m,\mathbf{x},\mathbf{v})(\mathbf{v}\cdot n(m,\mathbf{x}))\tilde{\pi}(m,\mathbf{x},\mathbf{v})\mathrm{d}\mathbf{x}
(m,𝐱)Ω𝐯Γ(m,𝐱)f(m,𝐱,𝐯)(𝐯n(m,𝐱))π~(m,𝐱,𝐯)d𝐱,\displaystyle\phantom{=-}-\int_{(m,\mathbf{x})\in\partial\Omega}\sum_{\mathbf{v}\in\Gamma^{-}(m,\mathbf{x})}f(m,\mathbf{x},\mathbf{v})(\mathbf{v}\cdot n(m,\mathbf{x}))\tilde{\pi}(m,\mathbf{x},\mathbf{v})\mathrm{d}\mathbf{x}, (20)

where 𝐱\nabla_{\mathbf{x}} is the gradient operator with respect to 𝐱\mathbf{x}. To conclude, we will show that the first term on the right hand side of (20) cancels with the second. Using (6), then (2), and then {𝐯Γ+(m,𝐱)}={𝐯:𝐯Γ(m,𝐱)}\{\mathbf{v}\in\Gamma^{+}(m,\mathbf{x})\}=\{-\mathbf{v}:\mathbf{v}\in\Gamma^{-}(m,\mathbf{x})\},

(m,𝐱)Ω𝐯Γ+(m,𝐱)f(m,𝐱,𝐯)(𝐯n(m,𝐱))π~(m,𝐱,𝐯)d𝐱\displaystyle-\int_{(m,\mathbf{x})\in\partial\Omega}\sum_{\mathbf{v}\in\Gamma^{+}(m,\mathbf{x})}f(m,\mathbf{x},\mathbf{v})(\mathbf{v}\cdot n(m,\mathbf{x}))\tilde{\pi}(m,\mathbf{x},\mathbf{v})\mathrm{d}\mathbf{x}
=(m,𝐱)Ω𝐯Γ+(m,𝐱)[(j,𝐲)Ω𝐰Γ(j,𝐲)f(j,𝐲,𝐰)Q(m,𝐱,𝐯;j,d𝐲,𝐰)]\displaystyle=-\int_{(m,\mathbf{x})\in\partial\Omega}\sum_{\mathbf{v}\in\Gamma^{+}(m,\mathbf{x})}\Bigg{[}\int_{(j,\mathbf{y})\in\partial\Omega}\sum_{\mathbf{w}\in\Gamma^{-}(j,\mathbf{y})}f(j,\mathbf{y},\mathbf{w})Q(m,\mathbf{x},\mathbf{v};j,\mathrm{d}\mathbf{y},\mathbf{w})\Bigg{]}
×(𝐯n(m,𝐱))π~(m,𝐱,𝐯)d𝐱\displaystyle\phantom{=-\int_{(m,\mathbf{x})\in\partial\Omega}\sum_{\mathbf{v}\in\Gamma^{+}(m,\mathbf{x})}\Bigg{[}\int_{(j,\mathbf{y})\in\partial\Omega}\sum_{\mathbf{w}\in\Gamma^{-}(j,\mathbf{y})}f(j,\mathbf{y},\mathbf{w})}\times(\mathbf{v}\cdot n(m,\mathbf{x}))\tilde{\pi}(m,\mathbf{x},\mathbf{v})\mathrm{d}\mathbf{x}
=(m,𝐱)Ω𝐯Γ+(m,𝐱)(j,𝐲)Ω𝐰Γ(j,𝐲)f(m,𝐲,𝐰)(𝐯n(m,𝐱))Q(j,𝐲,𝐰;m,d𝐱,𝐯)\displaystyle=-\int_{(m,\mathbf{x})\in\partial\Omega}\sum_{\mathbf{v}\in\Gamma^{+}(m,\mathbf{x})}\int_{(j,\mathbf{y})\in\partial\Omega}\sum_{\mathbf{w}\in\Gamma^{-}(j,\mathbf{y})}f(m,\mathbf{y},\mathbf{w})(\mathbf{v}\cdot n(m,\mathbf{x}))Q(j,\mathbf{y},-\mathbf{w};m,\mathrm{d}\mathbf{x},-\mathbf{v})
×π~(j,𝐲,𝐰)d𝐲\displaystyle\phantom{=-\int_{(m,\mathbf{x})\in\partial\Omega}\sum_{\mathbf{v}\in\Gamma^{+}(m,\mathbf{x})}\int_{(j,\mathbf{y})\in\partial\Omega}\sum_{\mathbf{w}\in\Gamma^{-}(j,\mathbf{y})}f(m,\mathbf{y},\mathbf{w})(\mathbf{v}\cdot n(m,\mathbf{x}))}\times\tilde{\pi}(j,\mathbf{y},-\mathbf{w})\mathrm{d}\mathbf{y}
=(m,𝐱)Ω𝐯Γ(m,𝐱)(j,𝐲)Ω𝐰Γ(j,𝐲)f(j,𝐲,𝐰)(𝐯n(m,𝐱))Q(j,𝐲,𝐰;m,d𝐱,𝐯)\displaystyle=-\int_{(m,\mathbf{x})\in\partial\Omega}\sum_{\mathbf{v}\in\Gamma^{-}(m,\mathbf{x})}\int_{(j,\mathbf{y})\in\partial\Omega}\sum_{\mathbf{w}\in\Gamma^{-}(j,\mathbf{y})}f(j,\mathbf{y},\mathbf{w})(\mathbf{v}\cdot n(m,\mathbf{x}))Q(j,\mathbf{y},-\mathbf{w};m,\mathrm{d}\mathbf{x},\mathbf{v})
×π~(j,𝐲,𝐰)d𝐲.\displaystyle\phantom{=-\int_{(m,\mathbf{x})\in\partial\Omega}\sum_{\mathbf{v}\in\Gamma^{-}(m,\mathbf{x})}\int_{(j,\mathbf{y})\in\partial\Omega}\sum_{\mathbf{w}\in\Gamma^{-}(j,\mathbf{y})}f(j,\mathbf{y},\mathbf{w})(\mathbf{v}\cdot n(m,\mathbf{x}))}\times\tilde{\pi}(j,\mathbf{y},-\mathbf{w})\mathrm{d}\mathbf{y}.

We now exchange order of integration (justified by the fact that ff is bounded), then apply (3) and the fact that π~(𝐱,)\tilde{\pi}(\mathbf{x},\cdot) is uniform to obtain

Ω𝐯Γ+(𝐱)f(𝐱,𝐯)(𝐯n(𝐱))π~(𝐱,𝐯)d𝐱\displaystyle-\int_{\partial\Omega}\sum_{\mathbf{v}\in\Gamma^{+}(\mathbf{x})}f(\mathbf{x},\mathbf{v})(\mathbf{v}\cdot n(\mathbf{x}))\tilde{\pi}(\mathbf{x},\mathbf{v})\mathrm{d}\mathbf{x}
=𝐲Ω𝐰Γ(𝐲)f(𝐲,𝐰)[𝐱Ω𝐯Γ(𝐱)(𝐯n(𝐱))Q(𝐲,𝐰;d𝐱,𝐯)]π~(𝐲,𝐰)d𝐲\displaystyle=-\int_{\mathbf{y}\in\partial\Omega}\sum_{\mathbf{w}\in\Gamma^{-}(\mathbf{y})}f(\mathbf{y},\mathbf{w})\Bigg{[}\int_{\mathbf{x}\in\partial\Omega}\sum_{\mathbf{v}\in\Gamma^{-}(\mathbf{x})}(\mathbf{v}\cdot n(\mathbf{x}))Q(\mathbf{y},-\mathbf{w};\mathrm{d}\mathbf{x},\mathbf{v})\Bigg{]}\tilde{\pi}(\mathbf{y},-\mathbf{w})\mathrm{d}\mathbf{y}
=Ω𝐰Γ(𝐲)f(𝐲,𝐰)(𝐰n(𝐲))π~(𝐲,𝐰)d𝐲.\displaystyle=\int_{\partial\Omega}\sum_{\mathbf{w}\in\Gamma^{-}(\mathbf{y})}f(\mathbf{y},\mathbf{w})(\mathbf{w}\cdot n(\mathbf{y}))\tilde{\pi}(\mathbf{y},\mathbf{w})\mathrm{d}\mathbf{y}. (21)

Substituting (21) into (20) concludes the proof. ∎

Proof of Proposition 1.

The proof consists of checking the hypotheses of Theorem 1. The set of ranked topologies of binary trees with nn leaves is finite, and hence so is its restriction to topologies consistent with DnD_{n}. The initial condition has a density by assumption. The set of points from which deterministic motion at a constant velocity first hits a boundary at a corner consists of finitely many Lebesgue-null lines. Hence, since the initial condition has a density and the boundary kernel QQ does not change (𝐭n,θ)(\mathbf{t}_{n},\theta), corners are hit with probability 0.

The augmented posterior π~\tilde{\pi} corresponding to (11) is C1(Ω)C^{1}(\Omega^{*}) by inspection, and only vanishes at a boundary when θ=0\theta=0 or a branch with mutations has length zero.

The unit outward normal is n(𝐭n,θ)=(0,,0,1,0,,0)n(\mathbf{t}_{n},\theta)=(0,\ldots,0,-1,0,\ldots,0), where the 1-1 is in the k(𝐭n,θ)k(\mathbf{t}_{n},\theta)th place. We also have 𝐯n=Fi(Fi(𝐯n))\mathbf{v}_{n}=F_{i}(F_{i}(\mathbf{v}_{n})), En=si(si(En))E_{n}=s_{i}(s_{i}(E_{n})), the third case of (12) is invariant under permutation of {En,pi(En),pi(En)}\{E_{n},p_{i}^{\uparrow}(E_{n}),p_{i}^{\downarrow}(E_{n})\}, and (11) is continuous even at the boundaries (though its gradient is not), so verifying (2) and (3) is a routine calculation.

It remains to verify (7), which we do by stochastic domination. By construction, the zig-zag process crosses boundaries only when a corresponding velocity is negative. The boundary crossing flips the velocity, and a boundary corresponding to the same coordinate cannot be crossed again until a further flip. Hence, the number of jumps in the zig-zag process started at (En,𝐭n,θ)(E_{n},\mathbf{t}_{n},\theta) by time t(0,)t\in(0,\infty) is dominated by n+2Y(𝐭n,θ,t)n+2Y(\mathbf{t}_{n},\theta,t), where

Y(𝐭n,θ,t)Pois(|𝔽|(𝐯n1t)ni=1n1{(n+1i)2[|vi|(n+θ+|vn|ti)+|vn|(ti+|vi|t)]}),Y(\mathbf{t}_{n},\theta,t)\sim\operatorname{Pois}\Bigg{(}|\mathbb{F}|(\|\mathbf{v}_{n}\|_{1}t)^{n}\sum_{i=1}^{n-1}\Big{\{}\frac{(n+1-i)}{2}[|v_{i}|(n+\theta+|v_{n}|t-i)+|v_{n}|(t_{i}+|v_{i}|t)]\Big{\}}\Bigg{)},

which arises as the volume of the set which is reachable from (En,𝐭n,θ)(E_{n},\mathbf{t}_{n},\theta) by time tt multiplied by upper bounds for (13) and (14) for positive velocities on the reachable set. Here 1\|\cdot\|_{1} is the L1L^{1}-norm, which is invariant for all valid zig-zag velocities 𝐯n\mathbf{v}_{n}. In the construction of the dominating random variable, negative velocities are assumed to flip to positive instantaneously which accounts for the initial summand of nn (the maximum number of initial negative velocities) and the factor of 2. ∎

The remainder of the appendix details the zig-zag and Metropolis–Hastings algorithms used in Sections 4 and 5. Algorithms 2 and 3 provide pseudocode implementations of the zig-zag methods. Line 7 of Algorithm 2 is given twice, with the first applicable to Section 4 and the second to Section 5. Otherwise, both algorithms apply to both sections.

Algorithm 2 Simulation of zig-zag process targeting π\pi
1:EnE_{n}, 𝐭n\mathbf{t}_{n}, θ\theta, 𝐯n\mathbf{v}_{n}, tendt_{\text{end}}, cc, KK
2:Set t0t\leftarrow 0
3:while t<tendt<t_{\text{end}} do
4:     Set τK\tau\leftarrow K and I0I\leftarrow 0 \triangleright K(0,)K\in(0,\infty) is a default increment if all vi>0v_{i}>0.
5:     for i{1,,n}i\in\{1,\ldots,n\} do \triangleright tn:=θt_{n}:=\theta for brevity.
6:         if vi<0v_{i}<0 then
7:              Set d1d\leftarrow 1 and JiJ\leftarrow i
8:              if (i{1,n}i\in\{1,n\} or En,i1En,iE_{n,i-1}\in E_{n,i}) and mγ(En,i)>0m_{\gamma(E_{n},i)}>0 then \triangleright For Section 4
7:              if i{1,n}i\in\{1,n\} and mγ(En,i)>0m_{\gamma(E_{n},i)}>0 then \triangleright For Section 5
8:                  Set d1+cd\leftarrow 1+c and J0J\leftarrow 0 \triangleright I=0I=0\Rightarrow localization refresh with no flip.               
9:              if ti/(d×vi)<τ-t_{i}/(d\times v_{i})<\tau then
10:                  Set τti/(d×vi)\tau\leftarrow-t_{i}/(d\times v_{i}) and IJI\leftarrow J                             
11:     for i{1,,n}i\in\{1,\ldots,n\} do \triangleright tn:=θt_{n}:=\theta as above.
12:         Sample ρ\rho\leftarrow NextFlip(En,𝐭n,θ,𝐯n;i,τ)(E_{n},\mathbf{t}_{n},\theta,\mathbf{v}_{n};i,\tau)
13:         if ρ<τ\rho<\tau then
14:              Set τρ\tau\leftarrow\rho and IiI\leftarrow i               
15:     Set tt+τt\leftarrow t+\tau
16:     for i{1,,n}i\in\{1,\ldots,n\} do \triangleright tn:=θt_{n}:=\theta as above.
17:         Set titi+viτt_{i}\leftarrow t_{i}+v_{i}\tau      
18:     if I0I\neq 0 then
19:         Set vIvIv_{I}\leftarrow-v_{I}      
20:     if I{0,1,n}I\notin\{0,1,n\} and tI=0t_{I}=0 then
21:         if En,I1En,IE_{n,I-1}\in E_{n,I} then
22:              Sample YU({,})Y\sim U(\{\uparrow,\downarrow\})
23:              Set EnpIY(En)E_{n}\leftarrow p_{I}^{Y}(E_{n})
24:         else
25:              Set EnsI(En)E_{n}\leftarrow s_{I}(E_{n})               
26:end while

The for-loop on lines 4–10 of Algorithm 1 implements the time localization and ensures that time steps cannot cause negative entries in 𝐭n\mathbf{t}_{n}. The distribution of the increment τ\tau thus has an atom at the value determined by lines 4–10, signifying either a boundary crossing or a need to recompute the time localization and rates {λi:i{1,,n1,θ}}\{\lambda_{i}^{*}:i\in\{1,\ldots,n-1,\theta\}\}. However, the continuous-time motion of the trajectory is not interrupted in either case; see also [BBCD+18, Section 2.2].

Iterations of the outer while-loop of Algorithm 1 flip a velocity for one of two reasons: a flip event occurs or a boundary is hit. In the former case the reason for the flip is clear by construction. In the latter, a trajectory arriving at a boundary must have vi<0v_{i}<0 for the corresponding velocity. After crossing, the trajectory moves into the interior of an adjacent orthant, which corresponds to vi>0v_{i}>0 in its local coordinates.

Algorithm 3 NextFlip
1:EnE_{n}, 𝐭n\mathbf{t}_{n}, θ\theta, 𝐯n\mathbf{v}_{n}, ii, τ\tau
2:Set ρ0\rho\leftarrow 0
3:repeat
4:     Sample sExp(λi)s\sim\operatorname{Exp}(\lambda_{i}^{*})
5:     Set ρρ+s\rho\leftarrow\rho+s
6:     if ρ<τ\rho<\tau then
7:         Set αλi(En,𝐭n+𝐯nρ,θ+vnρ;𝐯n)/λi\alpha\leftarrow\lambda_{i}(E_{n},\mathbf{t}_{n}+\mathbf{v}_{n}\rho,\theta+v_{n}\rho;\mathbf{v}_{n})/\lambda_{i}^{*}
8:     else\triangleright Guaranteed to not be the shortest waiting time.
9:         Set α1\alpha\leftarrow 1      
10:     Sample UU(0,1)U\sim U(0,1)
11:until U<αU<\alpha
12:return ρ\rho

The Metropolis–Hastings algorithms in Sections 4 and 5 both used the same proposal mechanisms consisting of three steps. An iteration of the algorithm is one scan through all three steps, with an accept/reject correction after each step. The hybrid method only used steps 1 and 3.

  1. 1.

    A Gaussian random walk update of θ\theta reflected at 0 with proposal variance σθ2\sigma_{\theta}^{2} tuned to obtain an average acceptance probability αθ1/4\alpha_{\theta}\approx 1/4.

  2. 2.

    An update of holding times 𝐭n\mathbf{t}_{n} under a fixed topology.

  3. 3.

    A subtree-prune-regraft step.

Type 2 updates first additively perturb the initial holding time t1t_{1} as

ξ1𝒩(t1,σ𝐭n2/[n(n1)2])|(ξ1>0).\xi_{1}\sim\mathcal{N}(t_{1},\sigma_{\mathbf{t}_{n}}^{2}/[n(n-1)^{2}])|(\xi_{1}>0).

Further holding times i{2,,n1}i\in\{2,\ldots,n-1\} are conditionally independently perturbed as

ξi|(ξ1,,ξi1)𝒩(ti,σ𝐭n2/[(n1)(ni+1)(ni)])|(ξi>ξci,1ξci,2),\xi_{i}|(\xi_{1},\ldots,\xi_{i-1})\sim\mathcal{N}(t_{i},\sigma_{\mathbf{t}_{n}}^{2}/[(n-1)(n-i+1)(n-i)])|(\xi_{i}>\xi_{c_{i,1}}\vee\xi_{c_{i,2}}),

where ξci,1\xi_{c_{i,1}} and ξci,2\xi_{c_{i,2}} are the perturbed times of the child nodes of the node at time tit_{i}. Again, σ𝐭n\sigma_{\mathbf{t}_{n}} was tuned so that the average acceptance probability was α𝐭n1/4\alpha_{\mathbf{t}_{n}}\approx 1/4.

Type 3 updates sample an ordered pair of edges (γ,γ)U(Fn×[FnγMRCA])(\gamma,\gamma^{\prime})\sim U(F_{n}\times[F_{n}\cup\gamma_{MRCA}]), where γMRCA\gamma_{MRCA} is an edge connecting the MRCA to \infty. Edge γ\gamma is deleted, and its child cγc_{\gamma} is reattached into γ\gamma^{\prime}. Letting tηt_{\eta} be the time of the node with label ηEn\eta\in E_{n} in an abuse of notation, the reattachment time tt^{\prime} has distribution

t\displaystyle t^{\prime} U(tcγtcγ,tpγ) if γγMRCA,\displaystyle\sim U(t_{c_{\gamma}}\wedge t_{c_{\gamma^{\prime}}},t_{p_{\gamma^{\prime}}})\text{ if }\gamma^{\prime}\neq\gamma_{MRCA},
ttcγMRCA\displaystyle t^{\prime}-t_{c_{\gamma_{MRCA}}} Exp(1) otherwise.\displaystyle\sim\operatorname{Exp}(1)\text{ otherwise}.

Under the infinite sites model, moves into topologies incompatible with observed mutations were rejected essentially instantaneously, before costly likelihood evaluations.

Table 3 summarizes hyperparameters and quantities of interest used in the simulation.

Data set Method vθv_{\theta} σθ\sigma_{\theta} σ𝐭n\sigma_{\mathbf{t}_{n}} αθ\alpha_{\theta} α𝐭n\alpha_{\mathbf{t}_{n}} αSPR\alpha_{\text{SPR}} κ\kappa
[WFDP91] Metropolis - 8 0.6 0.27 0.25 0.06 -
Zig-zag 8 - - - - - -
Hybrid 8 10 - 0.24 - 0.06 10
n=550n=550, θ=5.5\theta=5.5 Metropolis - 6 0.25 0.23 0.24 0.03 -
Zig-zag 6 - - - - - -
Hybrid 6 6 - 0.23 - 0.03 10
n=55n=55, θ=55\theta=55 Metropolis - 18 0.4 0.26 0.23 0.02 -
Zig-zag 40 - - - - - -
Hybrid 40 18 - 0.24 - 0.01 10
[GT94] Metropolis - 4 0.7 0.23 0.25 0.12 -
Zig-zag 4 - - - - - -
Hybrid 4 4 - 0.24 - 0.12 100
n=500n=500, S=20S=20 Metropolis - 4 0.3 0.21 0.21 0.31 -
Zig-zag 4 - - - - - -
Hybrid 4 3 - 0.29 - 0.32 100
n=50n=50, S=200S=200 Metropolis - 14 0.6 0.20 0.27 0.04 -
Zig-zag 20 - - - - - -
Hybrid 20 14 - 0.20 - 0.04 100
Table 3: Hyperparameters and acceptance probabilities for simulations in sections 4 and 5.

References

  • [ASR16] A. J. Aberer, A. Stamatakis, and F. Ronquist. An efficient independence sampler for updating branches in Bayesian Markov chain Monte Carlo sampling of phylogenetic trees. Syst. Biol., 65(1):161–176, 2016.
  • [BBCD+18] J. Bierkens, A. Bouchard-Côte, A. Doucet, A. B. Duncan, P. Fearnhead, T. Lienart, G. Roberts, and S. J. Vollmer. Piecewise deterministic Markov processes for scalable Monte Carlo on restricted domains. Stat. Probab. Lett., 136:148–154, 2018.
  • [BFR19a] J. Bierkens, P. Fearnhead, and G. Roberts. Supplement to the zig-zag process and super-efficient sampling for Bayesian analysis of big data. Ann. Stat., 47, 2019.
  • [BFR19b] J. Bierkens, P. Fearnhead, and G. Roberts. The zig-zag process and super-efficient sampling for Bayesian analysis of big data. Ann. Stat., 47(3):1288–1320, 2019.
  • [BHV01] L. J. Billera, S. P. Holmes, and K. Vogtmann. Geometry of the space of phylogenetic trees. Adv. Appl. Math., 27:733–767, 2001.
  • [BR17] J. Bierkens and G. Roberts. A piecewise deterministic scaling limit of lifted Metropolis-Hastings for the Curie-Weiss model. Ann. Appl. Probab., 27(2):846–882, 2017.
  • [BRZ19] J. Bierkens, G. Roberts, and P.-A. Zitt. Ergodicity of the zigzag process. Ann. Appl. Probab., 29(4):2266–2301, 2019.
  • [CFS20] A. Chevalier, P. Fearnhead, and M. Sutton. Reversible jump PDMP samplers for variable selection. Preprint, arXiv:2010.11771, 2020+.
  • [CPWF21] A. Chevalier, S. Power, A. Wang, and P. Fearnhead. PDMP Monte Carlo methods for piecewise-smooth densities. Preprint, arXiv:2111.05859, 2021+.
  • [Dav93] M. Davis. Markov models and Optimization. Chapman Hall, 1993.
  • [DBZM17] V. Dinh, A. Bilge, C. Zhang, and F. A. Matsen IV. Probabilistic path Hamiltonian Monte Carlo. In Proceedings of the 34th International Conference on Machine Learning, Sydney, Australia, 2017.
  • [Fel81] J. Felsenstein. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol., 17:368–376, 1981.
  • [FHVD17] J. M. Flegal, J. Hughes, D. Vats, and N. Dai. mcmcse: Monte Carlo Standard Errors for MCMC, 2017.
  • [GD16] A. Gavryushkin and A. J. Drummond. The space of ultrametric phylogenetic trees. J. Theor. Biol., 403:197–208, 2016.
  • [GD21] P. Gagnon and A. Doucet. Nonreversible jump algorithms for Bayesian nested model selection. J. Comput. Graph. Stat., 30(2):312–323, 2021.
  • [Gey92] C. J. Geyer. Practical Markov chain Monte Carlo. Statist. Sci., 7:473–483, 1992.
  • [GS90] A. E. Gelfand and A. F. M. Smith. Sampling-based approaches to calculating marginal densities. J. Am. Stat. Assoc., 85(410):398–409, 1990.
  • [GT94] R. C. Griffiths and S. Tavaré. Simulating probability distributions in the coalescent. Theor. Popln Biol., 46:131–159, 1994.
  • [Has70] W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57:97–109, 1970.
  • [HD12] S. Höhna and A. J. Drummond. Guided tree topology proposals for Bayesian phylogenetic inference. Syst. Biol., 61(1):1–11, 2012.
  • [HDPD08] S. Höhna, M. Defoin-Platel, and A. J. Drummond. Clock-constrained tree proposal operators in Bayesian phylogenetic inference. In BioInformatics and BioEngineering, pages 1–7, 8th IEEE International Conference on IEEE, 2008.
  • [JC69] T. H. Jukes and C. R. Cantor. Evolution of protein molecules. In H. N. Munro, editor, Mammalian protein metabolism, pages 21–132. Academic Press, New York, 1969.
  • [JZH+20] X. Ji, Z. Zhang, A. Holbrook, A. Nishimura, G. Baele, A. Rambaut, P. Lemey, and M. A. Suchard. Gradients do grow on trees: a linear-time O(N)O(N)-dimensional gradient for statistical phylogenetics. Mol. Biol. Evol., 37(10):3047–3060, 2020.
  • [Kin82] J. F. C. Kingman. The coalescent. Stochast. Process. Applic., 13(3):235–248, 1982.
  • [LVH+08] C. Lakner, P. Van Der Mark, J. P. Huelsenbeck, B. Larget, and F. Ronquist. Efficiency of Markov chain Monte Carlo tree proposals in Bayesian phylogenetics. Syst. Biol., 57:86–103, 2008.
  • [MV05] E. Mossel and E. Vigoda. Phylogenetic MCMC algorithms are misleading on mixtures of trees. Science, 309:2207–2209, 2005.
  • [NDL20] A. Nishimura, D. B. Dunson, and J. Lu. Discontinuous Hamiltonian Monte Carlo for discrete parameters and discontinuous likelihoods. Biometrika, 107(2):365–380, 2020.
  • [Nea10] R. M. Neal. MCMC using Hamiltonian dynamics. In Handbook of Markov chain Monte Carlo. CRC Press, 2010.
  • [VR21] G. Vasdekis and G. O. Roberts. Speed up zig-zag. Preprint, arXiv:2103.16620, 2021+.
  • [Wak09] J. Wakeley. Coalescent theory: an introduction. Roberts & Co, 2009.
  • [Wat75] G. A. Watterson. On the number of segregating sites in genetical models without recombination. Theor. Popln Biol., 7:256–276, 1975.
  • [WFDP91] R. H. Ward, B. L. Frazier, K. Dew, and S. Pääbo. Extensive mitochondrial diversity within a single Amerindian tribe. Proc. Natl. Acad. Sci. USA, 88:8720–8724, 1991.