This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Approximate filtering via discrete dual processes

Guillaume Kon Kam King, Université Paris-Saclay - INRAE
Andrea Pandolfi, Bocconi University
Marco Piretto, BrandDelta
Matteo Ruggiero111Corresponding author: ESOMAS Dept., Corso Unione Sovietica 218/bis, 10134, Torino, Italy; [email protected], University of Torino and Collegio Carlo Alberto

We consider the task of filtering a dynamic parameter evolving as a diffusion process, given data collected at discrete times from a likelihood which is conjugate to the reversible law of the diffusion, when a generic dual process on a discrete state space is available. Recently, it was shown that duality with respect to a death-like process implies that the filtering distributions are finite mixtures, making exact filtering and smoothing feasible through recursive algorithms with polynomial complexity in the number of observations. Here we provide general results for the case where the dual is a regular jump continuous-time Markov chain on a discrete state space, which typically leads to filtering distribution given by countable mixtures indexed by the dual process state space. We investigate the performance of several approximation strategies on two hidden Markov models driven by Cox–Ingersoll–Ross and Wright–Fisher diffusions, which admit duals of birth-and-death type, and compare them with the available exact strategies based on death-type duals and with bootstrap particle filtering on the diffusion state space as a general benchmark.

Keywords: Bayesian inference; Diffusion; Duality; Hidden Markov models; Particle filtering; Smoothing.

1 Introduction

Hidden Markov models are widely used statistical models for time series that assume an unobserved Markov process (Xt)t0(X_{t})_{t\geq 0}, or hidden signal, driving the process that generates the observations (Yti)i=0,,n(Y_{t_{i}})_{i=0,\dots,n}, e.g., by specifying the dynamics of one or more parameters of the observation density fXt(y)f_{X_{t}}(y), called emission distribution. See [11] for a general treatment of hidden Markov models. In this framework, the first task is to estimate the trajectory of the signal given observations collected at discrete times 0=t0<t1<<tn=T0=t_{0}<t_{1}<\cdots<t_{n}=T, which amounts to performing sequential Bayesian inference by computing the so-called filtering distributions p(xti|yt0,,yti1)p(x_{t_{i}}|y_{t_{0}},\ldots,y_{t_{i-1}}), i.e., the marginal distributions of the signal at time tit_{i} conditional on observations collected up to time ti1t_{i-1}. Originally motivated by real-time tracking and navigation systems and pioneered by [46, 47], classical and widely known explicit results for this problem include: the Kalman–Bucy filter, when both the signal and the observation process are formulated in a gaussian linear system; the Baum–Welch filter, when the signal has a finite state-space as the observations are categorical; the Wonham filter, when the signal has a finite state-space and the observations are Gaussian. These scenarios allow the derivation of so-called finite-dimensional filters, i.e., a sequence of filtering distributions whose explicit identification is obtained through a parameter update based on the collected observations and on the time intervals between the collection times. In such case, the resulting computational cost of the algorithm increases linearly with the number of observations. Other explicit results include [56, 26, 27, 55, 30, 31, 16]. Outside these classes, explicit solutions are difficult to obtain, and their derivation typically relies on ad hoc computations. This is especially true when the map xfxx\mapsto f_{x} is non-linear and when the signal transition kernel is known up to a series expansion, often intractable, as is the case for many widely used stochastic models. When exact solutions are not available, one must typically make use of approximate strategies, whose state of the art is most prominently based on extensions of the Kalman and particle filters. See, for example, [6, 15].

A somewhat weaker but useful notion with respect to that of a finite-dimensional filter was formulated in [13], who introduced the concept of computable filter. This extends the former class to a larger class of filters whose marginal distributions are finite mixtures of elementary kernels, rather than single kernels. Unlike the former case, such a scenario entails a higher computational cost, usually polynomial in the number of observations, but avoids the infinite-dimensionality typically implied by series expansion of the signal transition kernel. See [13, 14] for two examples.

Recently, [53] derived sufficient conditions for computable filtering based on duality. A dual process is a process DtD_{t} which enjoys the identity

𝔼[h(Xt,d)|X0=x]=𝔼[h(x,Dt)|D0=d].\mathbb{E}[h(X_{t},d)|X_{0}=x]=\mathbb{E}[h(x,D_{t})|D_{0}=d]. (1)

Here the expectation on the left-hand side is taken with respect to the transition law of the signal XtX_{t}, and that on the right hand side with respect to that of DtD_{t}, while the class of functions h(x,d)h(x,d) which satisfy the above identity are called duality functions. See [44] for a review and for the technical details we have overlooked here. The use of duality is largely established in probability and statistical physics, see for example [21, 51, 7, 33, 23, 34, 35, 24, 52, 12, 25, 38, 1, 8, 18, 42, 28, 43]. The use of duality for inference was initiated by [53], who showed that for a reversible signal whose marginal distributions are conjugate to the emission distribution (i.e., the Bayesian update at a fixed tt can be computed in closed-form), computable filtering is guaranteed if the stochastic part of the dual process evolves on a finite state space. Examples of such scenario were given for non-linear hidden Markov models involving signals driven by Cox–Ingersoll–Ross (CIR) and KK-dimensional Wright–Fisher (WF) diffusions, for which recursive formulae for the filtering distributions were derived. Along similar lines, duality was exploited for computable smoothing in [50], whereby the signal is also conditioned on data collected at later times, and for nonparametric hidden Markov models driven by Fleming–Viot and Dawson–Watanabe diffusions in [54, 2, 3].

In this paper, we investigate filtering problems for hidden Markov models when the dual process takes the more general form of a continuous-time Markov chain on a discrete state space. This is of interest for example in some population genetic models with selection [7] or interaction [4, 25] whose known dual processes are of birth-and-death (B&D) type, whose specific filtering problems are currently under investigation by some of the authors. When the dual process evolves in a countable state space, the filtering distributions can in general be expected to be countably infinite mixtures. This leads to inferential procedures which are not computable in the sense specified above, since the computation of the filtering distribution can no longer be exact. It is thus natural to wonder how the inferential procedures obtained in such duality-based scenario, possibly aided by some suitable approximation strategies, would perform.

The paper is organized as follows. In Section 2 we identify sufficient conditions for filtering based on discrete duals and provide a general description of the filtering operations in this setting. In Section 3, we apply these results to devise practical algorithms which allow to evaluate in recursive form filtering and smoothing distributions under this formulation. Section 4 and 5 investigate hidden Markov models driven by a Cox–Ingersoll–Ross diffusion, which admits a dual given by a one-dimensional B&D process, and by a KK-dimensional Wright–Fisher diffusions, which is shown to admit a dual given by a KK-dimensional Moran model. The latter can be seen as a multidimensional B&D process with constant total population size. Section 6 discusses several approximation strategies used to implement the above algorithms with these dual processes, and compares their performance with exact filtering based on the results in [53] and with a bootstrap particle filter as a general benchmark. Finally, we conclude with some brief remarks.

2 Filtering via discrete dual processes

Consider a hidden signal (Xt)t0(X_{t})_{t\geq 0} given by a diffusion process on 𝒳K\mathcal{X}\subset\mathbb{R}^{K}, for K1K\geq 1. This takes here the role of a temporally evolving parameter which is the main target of estimation. Observations Yti𝒴DY_{t_{i}}\in\mathcal{Y}\subset\mathbb{R}^{D}, D1D\geq 1, are collected at discrete times 0=t0<t1<<tn=T0=t_{0}<t_{1}<\cdots<t_{n}=T with distribution Ytiindfx()Y_{t_{i}}\overset{\text{ind}}{\sim}f_{x}(\cdot), given Xti=xX_{t_{i}}=x. Knowledge of the signal state xx thus makes the observations conditionally independent. Given an observation Y=yY=y, define the update operator ϕy\phi_{y} acting on densities ξ\xi on 𝒳\mathcal{X} by

ϕy(ξ)(x):=fx(y)ξ(x)μξ(y),μξ(y):=𝒳fx(y)ξ(x).\phi_{y}(\xi)(x):=\frac{f_{x}(y)\xi(x)}{\mu_{\xi}(y)},\qquad\mu_{\xi}(y):=\int_{\mathcal{X}}f_{x}(y)\xi(x). (2)

Here and later we assume all densities of interest exist with respect to an appropriate dominating measure. In (2), ξ\xi acts as a prior distribution on the signal state, which encodes the current knowledge (or lack thereof) on XtX_{t}, whereas μξ(y)\mu_{\xi}(y) is interpreted as the marginal likelihood of a data point yy when XtX_{t} has distribution ξ\xi. The update operator thus amounts to an application of Bayes’ theorem for conditioning ξ\xi on a new observation yy, leading to the updated density ϕy(ξ)\phi_{y}(\xi). For example, if ξ\xi is a Beta(a,b)\text{Beta}(a,b) density on [0,1][0,1], and fxf_{x} is Bern(x)\text{Bern}(x), then ϕy(ξ)\phi_{y}(\xi) is Beta(a+y,b+1y)\text{Beta}(a+y,b+1-y) as in a classical Beta-Bernoulli update.

Define also the propagation operator ψt\psi_{t} by

ψt(ξ)(x):=𝒳ξ(x)Pt(x|x)dx.\psi_{t}(\xi)(x^{\prime}):=\int_{\mathcal{X}}\xi(x)P_{t}(x^{\prime}|x)\mathrm{d}x^{\prime}. (3)

where PtP_{t} is the transition density of the signal. Here ψt(ξ)\psi_{t}(\xi) is the probability density at time tt obtained by propagating forward the density ξ\xi of the signal at time 0 by means of the signal semigroup.

We will make three assumptions, the first two of which are the same as in [53].

Assumption 1 (Reversibility). The signal XtX_{t} is reversible with respect to the density π\pi, i.e., π(x)Pt(x|x)=π(x)Pt(x|x)\pi(x)P_{t}(x^{\prime}|x)=\pi(x^{\prime})P_{t}(x|x^{\prime}).

See the discussion in [53] on the possibility of relaxing the above assumption. For K+K\in\mathbb{Z}_{+}, define now the space of multi-indices =+K\mathcal{M}=\mathbb{Z}_{+}^{K} to be

={𝐦=(m1,,mK):mj+, for j=1,,K},\mathcal{M}=\{\mathbf{m}=(m_{1},\ldots,m_{K}):\ m_{j}\in\mathbb{Z}_{+},\text{ for }j=1,\ldots,K\}, (4)

whose origin is denoted 𝟎=(0,,0)\mathbf{0}=(0,\ldots,0).

Assumption 2 (Conjugacy). For Θl\Theta\subset\mathbb{R}^{l}, l+l\in\mathbb{Z}_{+}, let h:𝒳××Θ+h:\mathcal{X}\times\mathcal{M}\times\Theta\rightarrow\mathbb{R}_{+} be such that supx𝒳h(x,𝐦,θ)<\sup_{x\in\mathcal{X}}h(x,\mathbf{m},\theta)<\infty for all 𝐦,θΘ\mathbf{m}\in\mathcal{M},\theta\in\Theta and h(x,𝟎,θ)=1h(x,\mathbf{0},\theta^{\prime})=1 for some θΘ\theta^{\prime}\in\Theta. Then fx()f_{x}(\cdot) is conjugate to distributions in the family

={g(x,𝐦,θ)=h(x,𝐦,θ)π(x),𝐦,θΘ},\mathcal{F}=\{g(x,\mathbf{m},\theta)=h(x,\mathbf{m},\theta)\pi(x),\ \mathbf{m}\in\mathcal{M},\theta\in\Theta\},

i.e., there exist functions t:𝒴×t:\mathcal{Y}\times\mathcal{M}\rightarrow\mathcal{M} and T:𝒴×ΘΘT:\mathcal{Y}\times\Theta\rightarrow\Theta such that if Xtg(x,𝐦,θ)X_{t}\sim g(x,\mathbf{m},\theta) and Yt|Xt=xfxY_{t}|X_{t}=x\sim f_{x}, we have Xt|Yy=yg(x,t(y,𝐦),T(y,θ))X_{t}|Y_{y}=y\sim g(x,t(y,\mathbf{m}),T(y,\theta)).

Here g(x,𝐦,θ)g(x,\mathbf{m},\theta) takes the role of “current” prior distribution, i.e., the prior information on the signal state, possibly based on past observations through previous conditioning and propagations, and g(x,t(y,𝐦),T(y,θ))g(x,t(y,\mathbf{m}),T(y,\theta)) takes the role of the posterior, i.e., g(x,𝐦,θ)g(x,\mathbf{m},\theta) conditional on a data point yy observed from fxf_{x} for Xt=xX_{t}=x. The functions t(y,𝐦),T(y,θ)t(y,\mathbf{m}),T(y,\theta) provide the transformations that update the parameters based on yy. In absence of data, the condition h(x,𝟎,θ)=1h(x,\mathbf{0},\theta^{\prime})=1 reduces g(x,𝐦,θ)g(x,\mathbf{m},\theta) to π(x)\pi(x).

The third assumption weakens Assumption 3 in [53] by assuming the dual process has finite activity on a discrete state space, and possibly has a deterministic companion.

Assumption 3 (Duality). Given a deterministic process ΘtΘ\Theta_{t}\in\Theta and a regular jump continuous-time Markov chain MtM_{t} on +K\mathbb{Z}_{+}^{K} with transition probabilities

p𝐦,𝐧(t;θ):=(Mt=𝐧|M0=𝐦,Θ0=θ),p_{\mathbf{m},\mathbf{n}}(t;\theta):=\mathbb{P}(M_{t}=\mathbf{n}|M_{0}=\mathbf{m},\Theta_{0}=\theta), (5)

equation (1) holds with Dt=(Mt,Θt)D_{t}=(M_{t},\Theta_{t}) and hh as in Assumption 2.

The following result provides a full description of the propagation and update steps which allow to compute the filtering distribution.

Proposition 2.1.

Let Assumptions 1-3 hold, and let 𝐦w𝐦g(x,𝐦,θ)\sum_{\mathbf{m}\in\mathcal{M}}w_{\mathbf{m}}g(x,\mathbf{m},\theta) be a countable mixture with 𝐦w𝐦=1\sum_{\mathbf{m}\in\mathcal{M}}w_{\mathbf{m}}=1. Then, for ψt\psi_{t} as in (3) we have

ψt(𝐦w𝐦g(x,𝐦,θ))=𝐧w𝐧(t)g(x,𝐧,Θt),\psi_{t}\bigg{(}\sum_{\mathbf{m}\in\mathcal{M}}w_{\mathbf{m}}g(x,\mathbf{m},\theta)\bigg{)}=\sum_{\mathbf{n}\in\mathcal{M}}w^{\prime}_{\mathbf{n}}(t)g(x,\mathbf{n},\Theta_{t}), (6)

where

w𝐧(t)=𝐦w𝐧p𝐦,𝐧(t;θ),\displaystyle w^{\prime}_{\mathbf{n}}(t)=\sum_{\mathbf{m}\in\mathcal{M}}w_{\mathbf{n}}p_{\mathbf{m},\mathbf{n}}(t;\theta), (7)

and p𝐦,𝐧(t;θ)p_{\mathbf{m},\mathbf{n}}(t;\theta) are as in (5). Furthermore, for ϕy\phi_{y} as in (2), we have

ϕy(𝐦w𝐦g(x,𝐦,θ))=𝐦w^𝐧,θ(y)g(x,t(y,𝐦),T(y,θ))\phi_{y}\bigg{(}\sum_{\mathbf{m}\in\mathcal{M}}w_{\mathbf{m}}g(x,\mathbf{m},\theta)\bigg{)}=\sum_{\mathbf{m}\in\mathcal{M}}\hat{w}_{\mathbf{n},\theta}(y)g(x,t(y,\mathbf{m}),T(y,\theta)) (8)

where w^𝐦,θ(y)w𝐦μ𝐦,θ(y)\hat{w}_{\mathbf{m},\theta}(y)\propto w_{\mathbf{m}}\mu_{\mathbf{m},\theta}(y) and

μ𝐦,θ(y):=𝒳fx(y)g(x,𝐦,θ)dx.\displaystyle\mu_{\mathbf{m},\theta}(y):=\int_{\mathcal{X}}f_{x}(y)g(x,\mathbf{m},\theta)\mathrm{d}x. (9)
Proof.

First observe that ψt(g(x,𝐦,θ))=𝐧p𝐦,𝐧(t;θ)g(x,𝐧,Θt)\psi_{t}(g(x,\mathbf{m},\theta))=\sum_{\mathbf{n}\in\mathcal{M}}p_{\mathbf{m},\mathbf{n}}(t;\theta)g(x,\mathbf{n},\Theta_{t}), which follows similarly to Proposition 2.2 in [53]. Then the claim follows by linearity using the fact that

ψt(i1wiξi)=i1wiψt(ξi)\psi_{t}\bigg{(}\sum_{i\geq 1}w_{i}\xi_{i}\bigg{)}=\sum_{i\geq 1}w_{i}\psi_{t}(\xi_{i}) (10)

so that

ψt\displaystyle\psi_{t} (𝐦w𝐦g(x,𝐦,θ))=𝐦w𝐦ψt(g(x,𝐦,θ))\displaystyle\,\bigg{(}\sum_{\mathbf{m}\in\mathcal{M}}w_{\mathbf{m}}g(x,\mathbf{m},\theta)\bigg{)}=\sum_{\mathbf{m}\in\mathcal{M}}w_{\mathbf{m}}\psi_{t}(g(x,\mathbf{m},\theta))
=\displaystyle= 𝐦w𝐦𝐧p𝐦,𝐧(t;θ)g(x,𝐧,Θt)=𝐧𝐦w𝐦p𝐦,𝐧(t;θ)g(x,𝐧,Θt)\displaystyle\,\sum_{\mathbf{m}\in\mathcal{M}}w_{\mathbf{m}}\sum_{\mathbf{n}\in\mathcal{M}}p_{\mathbf{m},\mathbf{n}}(t;\theta)g(x,\mathbf{n},\Theta_{t})=\sum_{\mathbf{n}\in\mathcal{M}}\sum_{\mathbf{m}\in\mathcal{M}}w_{\mathbf{m}}p_{\mathbf{m},\mathbf{n}}(t;\theta)g(x,\mathbf{n},\Theta_{t})

Using now the fact that

ϕy(i1wiξi)=i1wiμξi(y)jwjμξj(y)ϕy(ξi),\phi_{y}\bigg{(}\sum_{i\geq 1}w_{i}\xi_{i}\bigg{)}=\sum_{i\geq 1}\frac{w_{i}\mu_{\xi_{i}}(y)}{\sum_{j}w_{j}\mu_{\xi_{j}}(y)}\phi_{y}(\xi_{i}),\qquad (11)

we also find that

ϕy(𝐦w𝐦g(x,𝐦,θ))=𝐦w𝐦μ𝐦,θ(y)𝐧w𝐧μ𝐧,θ(y)g(x,t(y,𝐦),T(y,θ))\phi_{y}\bigg{(}\sum_{\mathbf{m}\in\mathcal{M}}w_{\mathbf{m}}g(x,\mathbf{m},\theta)\bigg{)}=\sum_{\mathbf{m}\in\mathcal{M}}\frac{w_{\mathbf{m}}\mu_{\mathbf{m},\theta}(y)}{\sum_{\mathbf{n}\in\mathcal{M}}w_{\mathbf{n}}\mu_{\mathbf{n},\theta}(y)}g(x,t(y,\mathbf{m}),T(y,\theta)) (12)

where

μ𝐦,θ(y)=𝒳fx(y)g(x,𝐦,θ)dx=𝒳fx(y)h(x,𝐦,θ)π(x)dx.\mu_{\mathbf{m},\theta}(y)=\int_{\mathcal{X}}f_{x}(y)g(x,\mathbf{m},\theta)\mathrm{d}x=\int_{\mathcal{X}}f_{x}(y)h(x,\mathbf{m},\theta)\pi(x)\mathrm{d}x. (13)

The expression (6), together with (7), provides a general recipe on how to compute the forward propagation of the current marginal distribution of the signal g(x,𝐦,θ)g(x,\mathbf{m},\theta), based on the transition probabilities of the dual continuous-time Markov chain. Since the update operator (2) can be easily applied to the resulting distribution, Proposition 2.1 then shows that under these assumptions all filtering distributions are countable mixtures of elementary kernels indexed by the state space of the dual process, with mixture weights proportional to the dual process transition probabilities p𝐦,𝐧(t;θ)p_{\mathbf{m},\mathbf{n}}(t;\theta). When these transition probabilities happen to give positive mass only to points {𝐧:𝐧𝐦}\{\mathbf{n}\in\mathcal{M}:\ \mathbf{n}\leq\mathbf{m}\}, as is the case for a pure-death process, then the right-hand side of (6) reduces to a finite sum, and one can construct an exact filter with a computational cost that is polynomial in the number of observations, as shown in [53].

The above approach can be seen as an alternative to deriving the filtering distribution of the signal by leveraging on a spectral expansion of the transition function PtP_{t} in (3), which typically requires ad hoc computations and does not lend itself easily to explicit update operations through (2). Note also that expressions like (6) can be used, by taking appropriate limits of p𝐦,𝐧(t;θ)p_{\mathbf{m},\mathbf{n}}(t;\theta) as t0t\rightarrow 0, to identify the transition kernel of the signal PtP_{t} itself, see, e.g., [7, 53, 38].

3 Recursive formulae for filtering and smoothing

In order to translate Proposition 2.1 into practical recursive formulae for filtering and smoothing, we are going to assume for simplicity of exposition that the time intervals between successive data collections titi1t_{i}-t_{i-1} equal Δ\Delta for all ii. For ease of reading, we will therefore use the symbol PΔP_{\Delta} instead of Ptiti1P_{t_{i}-t_{i-1}} for the signal transition function over the interval Δ=titi1\Delta=t_{i}-t_{i-1}. We will also use the established notation whereby i|0:i1i|0:i-1 indicates that the argument refers to time ti=iΔt_{i}=i\Delta, and we are conditioning on the data collected at times from 0 to ti1=(i1)Δt_{i-1}=(i-1)\Delta.

Define the filtering density

νi|0:i(xi):=p(xi|y0:i)𝒳ip(x0:i,y0:i)dx0:i1,\nu_{i|0:i}(x_{i}):=p(x_{i}|y_{0:i})\propto\int_{\mathcal{X}^{i}}p(x_{0:i},y_{0:i})\mathrm{d}x_{0:i-1}, (14)

i.e., the law of the signal at time tit_{i} given data up to time tit_{i}, obtained by integrating out the past trajectory. Define also the predictive density

νi+1|0:i(xi):=p(xi+1|y0:i)=𝒳p(xi|y0:i)PΔ(xi+1|xi)dxi,\nu_{i+1|0:i}(x_{i}):=p(x_{i+1}|y_{0:i})=\int_{\mathcal{X}}p(x_{i}|y_{0:i})P_{\Delta}(x_{i+1}|x_{i})\mathrm{d}x_{i}, (15)

i.e, the marginal density of the signal at time ti+1t_{i+1}, given data up to time tit_{i}. This can be expressed recursively as a function of the previous filtering density p(xi|y0:i)p(x_{i}|y_{0:i}), as displayed. Finally, define the marginal smoothing density

νi|0:n(xi):=p(xi|y0:n)𝒳np(x0:n,y0:n)dx0:i1dxi+1:n,\nu_{i|0:n}(x_{i}):=p(x_{i}|y_{0:n})\propto\int_{\mathcal{X}^{n}}p(x_{0:n},y_{0:n})\mathrm{d}x_{0:i-1}\mathrm{d}x_{i+1:n}, (16)

where the signal is evaluated at time tit_{i} conditional on all available data. The first two distributions above are natural objects of inferential interest, whereas the latter is typically used to improve previous estimates once additional data become available. Finally, for ΘΔ\Theta_{\Delta} as in Assumption 3 and t(,),T(,)t(\cdot,\cdot),T(\cdot,\cdot) as in Assumption 2, define for i=0,,ni=0,\dots,n the quantities

ϑi|0:i:=\displaystyle\vartheta_{i|0:i}:= T(yi,ϑi|0:i1),ϑi|0:i1:=ΘΔ(ϑi1|0:i1),ϑ0|0:1:=θ0.\displaystyle\,T(y_{i},\vartheta_{i|0:i-1}),\quad\quad\ \ \vartheta_{i|0:i-1}:=\Theta_{\Delta}(\vartheta_{i-1|0:i-1}),\quad\quad\vartheta_{0|0:-1}:=\theta_{0}. (17)

Here, ϑi|0:i1\vartheta_{i|0:i-1} denotes the state of the deterministic component of the dual process at time ii, after the propagation from time i1i-1 and before updating with the datum collected at time ii, and ϑi|0:i\vartheta_{i|0:i} the state after such update.

The following Corollary of Proposition 2.1 extends a result of [53] (see also Theorem 1 in [50] for an easier comparison in a similar notation).

Corollary 3.1.

Let Assumptions 1-3 hold, and assume that

νi1|0:i1(x)=𝐦w𝐦(i1)g(x,𝐦,ϑi1|0:i1).\nu_{i-1|0:i-1}(x)=\sum_{\mathbf{m}\in\mathcal{M}}w_{\mathbf{m}}^{(i-1)}g(x,\mathbf{m},\vartheta_{i-1|0:i-1}).

Then (15) can be written, through (3), as

νi|0:i1(x)=\displaystyle\nu_{i|0:i-1}(x)= ψΔ(νi1|0:i1(x))=𝐦w𝐦(i1)g(x,𝐦,ϑi|0:i1),\displaystyle\,\psi_{\Delta}(\nu_{i-1|0:i-1}(x))=\sum_{\mathbf{m}\in\mathcal{M}}w_{\mathbf{m}}^{(i-1)^{\prime}}g(x,\mathbf{m},\vartheta_{i|0:i-1}), (18)
w𝐦(i1)=\displaystyle w_{\mathbf{m}}^{(i-1)^{\prime}}= 𝐧w𝐧(i1)p𝐧,𝐦(Δ;ϑi1|0:i1),𝐦,\displaystyle\,\sum_{\mathbf{n}\in\mathcal{M}}w_{\mathbf{n}}^{(i-1)}p_{\mathbf{n},\mathbf{m}}(\Delta;\vartheta_{i-1|0:i-1}),\quad\mathbf{m}\in\mathcal{M},

with p𝐧,𝐦(Δ;ϑi|0:i)p_{\mathbf{n},\mathbf{m}}(\Delta;\vartheta_{i|0:i}) as in (5). Furthermore, given the observation yiy_{i}, (14) can be written, through (2), as

νi|0:i(x)=\displaystyle\nu_{i|0:i}(x)= ϕyi(νi|0:i1(x))=𝐦w𝐦(i)g(x,𝐦,ϑi|0:i),\displaystyle\,\phi_{y_{i}}(\nu_{i|0:i-1}(x))=\sum_{\mathbf{m}\in\mathcal{M}}w_{\mathbf{m}}^{(i)}g(x,\mathbf{m},\vartheta_{i|0:i}), (19)
w𝐦(i)\displaystyle w_{\mathbf{m}}^{(i)}\propto μ𝐧,ϑi|0:i1(yi)w𝐧(i1),𝐦=t(yi,𝐧),𝐧,\displaystyle\,\mu_{\mathbf{n},\vartheta_{i|0:i-1}}(y_{i})w_{\mathbf{n}}^{(i-1)^{\prime}},\quad\mathbf{m}=t(y_{i},\mathbf{n}),\mathbf{n}\in\mathcal{M},

with μ𝐦,θ\mu_{\mathbf{m},\theta} as in (9).

Input: Y0:nY_{0:n}, t0:nt_{0:n}
Result: ϑi|0:i\vartheta_{i|0:i}, 𝐌i|0:i\mathbf{M}_{i|0:i} and Wi={w𝐦(i),𝐦𝐌i|0:i}W_{i}=\{w_{\mathbf{m}}^{(i)},\mathbf{m}\in\mathbf{M}_{i|0:i}\}
Initialise
      
      Set ϑ0|0=T(Y0,θ0)\vartheta_{0|0}=T(Y_{0},\theta_{0}) with TT as in Assumption 2
      
      Set 𝐌0|0={t(Y0,𝟎)}={𝐦}\mathbf{M}_{0|0}=\{t(Y_{0},\mathbf{0})\}=\{\mathbf{m}^{*}\} and W0={1}W_{0}=\{1\} with tt as in Assumption 2
      
      Compute ϑ1|0\vartheta_{1|0} from ϑ0|0\vartheta_{0|0} as in (17)
      
      Set 𝐌=(𝐌0|0)\mathbf{M}^{*}=\mathcal{B}(\mathbf{M}_{0|0}) and W={p𝐦,𝐧(Δ,ϑ0|0),𝐧𝐌}W^{*}=\{p_{\mathbf{m}^{*},\mathbf{n}}(\Delta,\vartheta_{0|0}),\mathbf{n}\in\mathbf{M}^{*}\} with p𝐦,𝐧p_{\mathbf{m},\mathbf{n}} as in (5)
for ii from 11 to nn do
      
      
      Update
            
            Set ϑi|0:i=T(Yi,ϑi|0:i1)\vartheta_{i|0:i}=T(Y_{i},\vartheta_{i|0:i-1})
            
            Set Wi={w𝐦μ𝐦,ϑi|0:i1(Yi)𝐧𝐌w𝐧μ𝐧,ϑi(Yi),𝐦𝐌}W_{i}=\left\{\frac{w_{\mathbf{m}}^{*}\mu_{\mathbf{m},\vartheta_{i|0:i-1}}(Y_{i})}{\sum_{\mathbf{n}\in\mathbf{M}^{*}}w_{\mathbf{n}}^{*}\mu_{\mathbf{n},\vartheta_{i}}(Y_{i})},\mathbf{m}\in\mathbf{M}^{*}\right\} with μ𝐦,θ\mu_{\mathbf{m},\theta} defined as in (9)
            Set 𝐌i|0:i={t(Yi,𝐦),𝐦𝐌}\mathbf{M}_{i|0:i}=\{t(Y_{i},\mathbf{m}),\mathbf{m}\in\mathbf{M}^{*}\} and update the labels in WiW_{i}
             Copy ϑi|0:i\vartheta_{i|0:i}, 𝐌i|0:i\mathbf{M}_{i|0:i} and WiW_{i} to be reported as the output
      
      
      
      Propagation
            
            Compute ϑi+1|0:i\vartheta_{i+1|0:i} from ϑi|0:i\vartheta_{i|0:i}
            
            Set 𝐌=(𝐌i|0:i)\mathbf{M}^{*}=\mathcal{B}(\mathbf{M}_{i|0:i}) and W={𝐦𝐌i|0:iw𝐦(i)p𝐦,𝐧(Δ,ϑi|0:i),𝐧𝐌}W^{*}=\bigg{\{}\displaystyle{\sum_{\mathbf{m}\in\mathbf{M}_{i|0:i}}}w_{\mathbf{m}}^{(i)}p_{\mathbf{m},\mathbf{n}}(\Delta,\vartheta_{i|0:i}),\mathbf{n}\in\mathbf{M}^{*}\bigg{\}}
      
      
end for

Note: 𝐌i|0:i={𝐦:w𝐦(i)>0}\mathbf{M}_{i|0:i}=\{\mathbf{m}\in\mathcal{M}:\ w_{\mathbf{m}}^{(i)}>0\}\subset\mathcal{M} is the support of the weights of νi|0:i\nu_{i|0:i}; (𝐦)\mathcal{B}(\mathbf{m}) denotes the states reached by the dual process from 𝐦\mathbf{m}, and (𝐌)\mathcal{B}(\mathbf{M}) those reached from all 𝐦𝐌\mathbf{m}\in\mathbf{M}.

Algorithm 1 Filtering

Algorithm 1 outlines the pseudo-code for implementing the update and propagation steps of Corollary 3.1. How to use these results efficiently can depend on the model at hand. When the transition probabilities p𝐦,𝐧(t;θ)p_{\mathbf{m},\mathbf{n}}(t;\theta) are available in closed form, their use could lead to the best performance, but can also at times face numerical instability issues (as is the case pointed out in Section 4 below). When the transition probabilities p𝐦,𝐧(t;θ)p_{\mathbf{m},\mathbf{n}}(t;\theta) are not available in closed form, one can approximate them by simulating NN replicates of the dual component MtM_{t}, and then regroup probability masses according to the arrival states as done in (18). In our framework, the dual process is typically easier to simulate than the original process, given its discrete state space. For instance, pure-death or B&D processes are easily simulated using a Gillespie algorithm [36], whereby one alternates sampling waiting times and jump transitions for the embedded chain. Depending on the dual process, there might also be more efficient simulation strategies.

A different type of approximation of the propagation step (18) in Corollary 3.1 can be based on pruning the transition probabilities or the arrival weights under a given threshold, followed by a renormalisation of the weights. Both this approximation strategy and that outlined above assign positive weights only to a finite subset of \mathcal{M}, hence they overcome the infinite dimensionality of the dual process state space. [50] showed that the latter strategy allows to control the approximation error while retaining almost entirely the distributional information, thus affecting the inference negligibly. In the next sections we will investigate such strategies for two hidden Markov models driven by Cox–Ingersoll–Ross and KK-dimensional Wright–Fisher diffusions.

Now, in order to describe the marginal smoothing densities (16), we need an additional assumption and some further notation.

Assumption 4 For hh as in Assumption 3, there exist functions d:2d:\mathcal{M}^{2}\to\mathcal{M} and e:Θ2Θe:\Theta^{2}\to\Theta such that for all x𝒳,𝐦,𝐦,θ,θΘx\in\mathcal{X},\ \mathbf{m},\mathbf{m}^{\prime}\in\mathcal{M},\ \theta,\theta^{\prime}\in\Theta

h(x,𝐦,θ)h(x,𝐦,θ)=C𝐦,𝐦,θ,θh(x,d(𝐦,𝐦),e(θ,θ)),h(x,\mathbf{m},\theta)h(x,\mathbf{m}^{\prime},\theta^{\prime})=C_{\mathbf{m},\mathbf{m}^{\prime},\theta,\theta^{\prime}}h(x,d(\mathbf{m},\mathbf{m}^{\prime}),e(\theta,\theta^{\prime})), (20)

where C𝐦,𝐦,θ,θC_{\mathbf{m},\mathbf{m}^{\prime},\theta,\theta^{\prime}} is constant in xx.

Denote by ϑi,ϑi\overset{\leftarrow}{\vartheta}_{i},\overset{\leftarrow}{\vartheta}_{i}^{\prime} the quantities defined in (17) computed backwards. Equivalently, these are computed as in (17) with data in reverse order, i.e. using yn:0y_{n:0} in place of y0:ny_{0:n}, namely

ϑi|i+1:T=\displaystyle\overset{\leftarrow}{\vartheta}_{i|i+1:T}= ΘΔ(ϑi+1|i+1:T),ϑi|i:T=T(yi,ϑi|i+1:T),ϑT|T=T(yT,θ0)\displaystyle\,\Theta_{\Delta}(\overset{\leftarrow}{\vartheta}_{i+1|i+1:T}),\quad\quad\overset{\leftarrow}{\vartheta}_{i|i:T}=T(y_{i},\overset{\leftarrow}{\vartheta}_{i|i+1:T}),\quad\quad\overset{\leftarrow}{\vartheta}_{T|T}=T(y_{T},\theta_{0}) (21)

The following result extends Proposition 3 and Theorem 4 of [50]:

Proposition 3.2.

Let Assumptions 1-4 hold, and let ν0=π\nu_{0}=\pi. Then, for 0in10\leq i\leq n-1, we have

p(xi|y0:n)=𝐦,𝐧w𝐦,𝐧(i)g(xi,d(𝐦,𝐧),e(ϑi|i+1:n,ϑi|0:i)),p(x_{i}|y_{0:n})=\sum_{\mathbf{m}\in\mathcal{M},\ \mathbf{n}\in\mathcal{M}}w_{\mathbf{m},\mathbf{n}}^{(i)}g(x_{i},d(\mathbf{m},\mathbf{n}),e(\overset{\leftarrow}{\vartheta}_{i|i+1:n},\vartheta_{i|0:i})), (22)

with

w𝐦,𝐧(i)\displaystyle w_{\mathbf{m},\mathbf{n}}^{(i)}\propto w𝐦(i+1)w𝐧(i)C𝐦,𝐧,ϑi|i+1:n,ϑi|0:i,\displaystyle\,\overleftarrow{w}^{(i+1)}_{\mathbf{m}}w_{\mathbf{n}}^{(i)}C_{\mathbf{m},\mathbf{n},\overset{\leftarrow}{\vartheta}_{i|i+1:n},\vartheta_{i|0:i}}, (23)
ω𝐦(i+1)=\displaystyle\overset{\leftarrow}{\omega}_{\mathbf{m}}^{(i+1)}= 𝐧ω𝐧(i+2)μ𝐧,ϑi+1|i+2:n(yi+1)pt(yi+1,𝐧),𝐦(Δ;ϑi+1|i+1:n)\displaystyle\,\sum_{\mathbf{n}\in\mathcal{M}}\overset{\leftarrow}{\omega}_{\mathbf{n}}^{(i+2)}\mu_{\mathbf{n},\overset{\leftarrow}{\vartheta}_{i+1|i+2:n}}(y_{i+1})p_{t(y_{i+1},\mathbf{n}),\mathbf{m}}(\Delta;\overset{\leftarrow}{\vartheta}_{i+1|i+1:n})

w𝐧(i)w_{\mathbf{n}}^{(i)} as in (19) and C𝐦,𝐧,ϑi|i+1:n,ϑi|0:iC_{\mathbf{m},\mathbf{n},\overset{\leftarrow}{\vartheta}_{i|i+1:n},\vartheta_{i|0:i}} as in (20).

Proof.

Note that Bayes’ Theorem and conditional independence allow to write (16) as

νi|0:n(xi)=p(xi|y0:n)p(yi+1:n|xi)νi|0:i(xi)\nu_{i|0:n}(x_{i})=p(x_{i}|y_{0:n})\propto p(y_{i+1:n}|x_{i})\nu_{i|0:i}(x_{i}) (24)

where the right-hand side involves the filtering distribution, available from Corollary 3.1, and the so called cost-to-go function p(yi+1:n|xi)p(y_{i+1:n}|x_{i}) (sometimes called information filter), which is the likelihood of future observations given the current signal state. Along the same lines as Proposition 3 in [50] we find that

p(yi+1:n|xi)=𝐦ω𝐦(i+1)h(xi,𝐦,ϑi|i+1:n)p(y_{i+1:n}|x_{i})=\sum_{\mathbf{m}\in\mathcal{M}}\overset{\leftarrow}{\omega}_{\mathbf{m}}^{(i+1)}h(x_{i},\mathbf{m},\overset{\leftarrow}{\vartheta}_{i|i+1:n}) (25)

with ω𝐦(i+1)\overset{\leftarrow}{\omega}_{\mathbf{m}}^{(i+1)} as in the statement. The main claim can now be proved along the same lines as Theorem 4 in [50]. ∎

The main difference between the above result and Theorem 4 in [50] lies in the fact that the support of the weights {ω𝐦(i+1),𝐦}\{\overset{\leftarrow}{\omega}_{\mathbf{m}}^{(i+1)},\mathbf{m}\in\mathcal{M}\} (which in [50] is denoted by Mi|i+1:n\overset{\leftarrow}{M}_{i|i+1:n}) can be countably infinite and coincide with the whole of \mathcal{M}. Indeed, which points of \mathcal{M} have positive weight are determined by the transition probabilities of the dual process, which in the present framework is no longer assumed to make only downward moves in \mathcal{M}. Section 6 will deal with this possibly infinite support for a concrete implementation of the inferential strategy.

4 Cox–Ingersoll–Ross hidden Markov models

The Cox–Ingersoll–Ross diffusion, also known as the square-root process, is widely used in financial mathematics for modelling short-term interest rates and stochastic volatility, see [10, 9, 29, 40, 37]. It also belongs to the class of continuous-state branching processes with immigration, arising as the large-population scaling limit of certain branching Markov chains [49] and as the time-evolving total mass of a Dawson–Watanabe branching measure-valued diffusion [20].

Let XtX_{t} be a CIR diffusion on +\mathbb{R}_{+} that solves the one-dimensional SDE

dXt=(δσ22γXt)dt+2σXtdBt,X00,\mathrm{d}X_{t}=\left(\delta\sigma^{2}-2\gamma X_{t}\right)dt+2\sigma\sqrt{X_{t}}dB_{t},\quad\quad X_{0}\geq 0, (26)

where δ,γ,σ>0\delta,\gamma,\sigma>0, which is reversible with respect to the Gamma density π=Ga(δ/2,γ/σ2)\pi=\text{Ga}(\delta/2,\gamma/\sigma^{2}). The following proposition identifies a B&D process as dual to the CIR diffusion.

Proposition 4.1.

Let XtX_{t} be as in (26), let MtM_{t} be a B&D process on +\mathbb{Z}_{+} which jumps from mm to m+1m+1 at rate λm=2σ2(δ/2+m)(θγ/σ2)\lambda_{m}=2\sigma^{2}(\delta/2+m)(\theta-\gamma/\sigma^{2}) and to m1m-1 at rate μm=2σ2θm\mu_{m}=2\sigma^{2}\theta m, and let

h(x,m,θ)=Γ(δ/2)Γ(δ/2+m)(γσ2)δ/2θδ/2+mxme(θγ/σ2)x.h(x,m,\theta)=\frac{\Gamma(\delta/2)}{\Gamma(\delta/2+m)}\Big{(}\frac{\gamma}{\sigma^{2}}\Big{)}^{-\delta/2}\theta^{\delta/2+m}x^{m}e^{-(\theta-\gamma/\sigma^{2})x}. (27)

Then (1) holds with Dt=MtD_{t}=M_{t}.

Proof.

The infinitesimal generator associated to (26) is

𝒜f(x)=(δσ22γx)f(x)+2σ2xf′′(x),\mathcal{A}f(x)=(\delta\sigma^{2}-2\gamma x)f^{\prime}(x)+2\sigma^{2}xf^{\prime\prime}(x),

for f:+f:\mathbb{R}_{+}\rightarrow\mathbb{R} vanishing at infinity. Letting h(x,m)=h(x,m,θ)h(x,m)=h(x,m,\theta) denote (27) omitting the dependence on θ\theta to make notations lighter, a direct computation yields

𝒜h(,m)(x)=\displaystyle\mathcal{A}h(\cdot,m)(x)= (δσ22γx)(mxm1xm(θγ/σ2))Γ(δ/2)Γ(δ/2+m)(γσ2)δ/2θδ/2+me(θγ/σ2)x\displaystyle\,(\delta\sigma^{2}-2\gamma x)\Big{(}mx^{m-1}-x^{m}(\theta-\gamma/\sigma^{2})\Big{)}\frac{\Gamma(\delta/2)}{\Gamma(\delta/2+m)}\Big{(}\frac{\gamma}{\sigma^{2}}\Big{)}^{-\delta/2}\theta^{\delta/2+m}e^{-(\theta-\gamma/\sigma^{2})x} (28)
+2σ2x(m(m1)xm2+xm(θγ/σ2)22mxm1(θγ/σ2))\displaystyle\,+2\sigma^{2}x\Big{(}m(m-1)x^{m-2}+x^{m}(\theta-\gamma/\sigma^{2})^{2}-2mx^{m-1}(\theta-\gamma/\sigma^{2})\Big{)} (29)
×Γ(δ/2)Γ(δ/2+m)(γσ2)δ/2θδ/2+me(θγ/σ2)x\displaystyle\,\times\frac{\Gamma(\delta/2)}{\Gamma(\delta/2+m)}\Big{(}\frac{\gamma}{\sigma^{2}}\Big{)}^{-\delta/2}\theta^{\delta/2+m}e^{-(\theta-\gamma/\sigma^{2})x} (30)
=\displaystyle= δσ2mθδ/2+m1h(x,m1)+2γ(θγ/σ2)δ/2+mθh(x,m+1)\displaystyle\,\frac{\delta\sigma^{2}m\theta}{\delta/2+m-1}h(x,m-1)+2\gamma(\theta-\gamma/\sigma^{2})\frac{\delta/2+m}{\theta}h(x,m+1) (31)
[2γm+δσ2(θγ/σ2)]h(x,m)+2σ2m(m1)θδ/2+m1h(x,m1)\displaystyle\,-[2\gamma m+\delta\sigma^{2}(\theta-\gamma/\sigma^{2})]h(x,m)+2\sigma^{2}m(m-1)\frac{\theta}{\delta/2+m-1}h(x,m-1) (32)
+2σ2(θγ/σ2)2δ/2+mθh(x,m+1)4σ2m(θγ/σ2)h(x,m)\displaystyle\,+2\sigma^{2}(\theta-\gamma/\sigma^{2})^{2}\frac{\delta/2+m}{\theta}h(x,m+1)-4\sigma^{2}m(\theta-\gamma/\sigma^{2})h(x,m) (33)
=\displaystyle=  2σ2θmh(x,m1)+2σ2(δ/2+m)(θγ/σ2)h(x,m+1)\displaystyle\,2\sigma^{2}\theta mh(x,m-1)+2\sigma^{2}(\delta/2+m)(\theta-\gamma/\sigma^{2})h(x,m+1)
[2γm+σ2(δ+4m)(θγ/σ2)]h(x,m).\displaystyle\,-[2\gamma m+\sigma^{2}(\delta+4m)(\theta-\gamma/\sigma^{2})]h(x,m). (34)

where it can be checked that

2σ2θm+2σ2(δ/2+m)(θγ/σ2)=2γm+σ2(δ+4m)(θγ/σ2).\displaystyle 2\sigma^{2}\theta m+2\sigma^{2}(\delta/2+m)(\theta-\gamma/\sigma^{2})=2\gamma m+\sigma^{2}(\delta+4m)(\theta-\gamma/\sigma^{2}). (35)

Hence the r.h.s. equals

g(m)=λm[g(m+1)g(m)]+μm[g(m1)g(m)]\mathcal{B}g(m)=\lambda_{m}[g(m+1)-g(m)]+\mu_{m}[g(m-1)-g(m)]

with g():=h(x,)g(\cdot):=h(x,\cdot), λm=2σ2(δ/2+m)(θγ/σ2)\lambda_{m}=2\sigma^{2}(\delta/2+m)(\theta-\gamma/\sigma^{2}), and μm=2σ2θm\mu_{m}=2\sigma^{2}\theta m, which is the infinitesimal generator of a B&D process with rates λm,μm\lambda_{m},\mu_{m}. The claim now follows from Proposition 1.2 in [44]. ∎

Assign now prior ν0=Ga(δ/2,γ/σ2)\nu_{0}=\text{Ga}(\delta/2,\gamma/\sigma^{2}) to X0X_{0}, and assume Poisson observations are collected at equally spaced intervals of length Δ\Delta. Specifically, Y|Xt=xiidPo(τx)Y|X_{t}=x\overset{\text{iid}}{\sim}\text{Po}(\tau x), for some τ>0\tau>0. By the well-known conjugacy to Gamma priors, we have that Xt|Y=yGa(δ/2+y,γ/σ2+τ)X_{t}|Y=y\sim\text{Ga}(\delta/2+y,\gamma/\sigma^{2}+\tau). Without loss of generality, we can set τ=1\tau=1, which allows to interpret the update of the gamma rate parameter as the size of the conditioning data set. The filtering algorithm starts by first updating the prior ν0\nu_{0} to ν0|0:=ϕY0(ν0)\nu_{0|0}:=\phi_{Y_{0}}(\nu_{0}). If we observe Y0=(Y0,1,,Y0,k)Y_{0}=(Y_{0,1},\ldots,Y_{0,k}) at time 0, then ν0|0\nu_{0|0} is the law of X0|j=1ky0,j=mGa(δ/2+m,γ/σ2+k)X_{0}|\sum_{j=1}^{k}y_{0,j}=m\sim\text{Ga}(\delta/2+m,\gamma/\sigma^{2}+k). Then ν0|0\nu_{0|0} is propagated forward for a Δ\Delta time interval, yielding ν1|0:=ψΔ(ν0|0)\nu_{1|0}:=\psi_{\Delta}(\nu_{0|0}). In light of Proposition 4.1, an application of (6) to ν0|0\nu_{0|0} yields the infinite mixture

ψΔ(Ga(δ/2+m,γ/σ2+k))=n0pm,n(Δ)Ga(δ/2+n,γ/σ2+k),\psi_{\Delta}\left(\text{Ga}(\delta/2+m,\gamma/\sigma^{2}+k)\right)=\sum_{n\geq 0}p_{m,n}(\Delta)\text{Ga}(\delta/2+n,\gamma/\sigma^{2}+k), (36)

where pm,n(t)p_{m,n}(t) are the transition probabilities of MtM_{t} in Proposition 4.1. Hence, the law of the signal is indexed by +\mathbb{Z}_{+}, the state space of the dual process. While after the update at time 0 mass one is assigned to the sum of the observations j=1ky0,j=m\sum_{j=1}^{k}y_{0,j}=m, after the propagation the mass is spread over the whole +\mathbb{Z}_{+} by the effect of the dual process. We then observe Y1fX1Y_{1}\sim f_{X_{1}} at time 1, which is used to update ν1|0\nu_{1|0} to ν1|1\nu_{1|1} and has the effect of shifting the probability masses of the mixture weights. For example, the weight pm,n(Δ)p_{m,n}(\Delta) in (36) is assigned to n+n\in\mathbb{Z}_{+}, but after the update based on Y1=(Y1,1,,Y1,k)Y_{1}=(Y_{1,1},\ldots,Y_{1,k^{\prime}}) it will be assigned to n+mn+m^{\prime} if j=1ky1,j=m\sum_{j=1}^{k^{\prime}}y_{1,j}=m^{\prime}, on top of being transformed according to (11). We then propagate forward again and proceed analogously.

When the current distribution of the signal, after the update, is given by a mixture of type m+wmGa(δ/2+m,γ/σ2+k)\sum_{m\in\mathbb{Z}_{+}}w_{m}\text{Ga}(\delta/2+m,\gamma/\sigma^{2}+k), it is enough to rearrange the mixture weights after the propagation step as done in (7).

The main difference with qualitatively similar equations found in [50] is now given by the transition probabilities pm,n(t)p_{m,n}(t) in (36), which are those of the B&D process in Proposition 4.1. Before tackling the problem of how to use the above expressions for inference, we try to provide further intuition of the extent and implications of such differences. To this end, consider the simplified parameterization α=δ/2,β=γ/σ2,σ2=1/2,τ=1\alpha=\delta/2,\beta=\gamma/\sigma^{2},\sigma^{2}=1/2,\tau=1, whereby one can check that the embedded chain of the B&D process of Proposition 4.1 has jump probabilities

pm,m+1=k(α+m)k(α+m)+m(β+k),pm,m1=1pm,m+1.\displaystyle p_{m,m+1}=\frac{k(\alpha+m)}{k(\alpha+m)+m(\beta+k)},\quad\quad p_{m,m-1}=1-p_{m,m+1}.

Here m,km,k are the same as in the left-hand side of (36), so m/km/k is the sample mean. It is easily verified that pm,m+1<pm,m1p_{m,m+1}<p_{m,m-1} if m/k>α/βm/k>\alpha/\beta and viceversa. Therefore, the dual evolves on +\mathbb{Z}_{+} so that it reverts m/km/k to the prior mean α/β\alpha/\beta. Indeed, the dual has Negative Binomial ergodic distribution NBin(α,β/(β+k))\text{NBin}\left(\alpha,\beta/(\beta+k)\right), whose mean is kα/βk\alpha/\beta, i.e., such that m/km/k on average coincides with α/β\alpha/\beta.

Recall now that the dual process elicited in [53] for the CIR model is Dt=(Mt,Θt)D_{t}=(M_{t},\Theta_{t}), with MtM_{t} a pure-death process with rates from mm to m1m-1 equal to 2σ2θ2\sigma^{2}\theta and Θt\Theta_{t} a deterministic process that solves dΘt/dt=2σ2Θt(Θtγ/σ2)\mathrm{d}\Theta_{t}/\mathrm{d}t=-2\sigma^{2}\Theta_{t}(\Theta_{t}-\gamma/\sigma^{2}), Θ0=θ\Theta_{0}=\theta. This dual has a single ergodic state given by (0,β)(0,\beta) (note that [53] uses a slightly different parameterization, where the ergodic state (0,β)(0,\beta) means that, in the limit for tt\rightarrow\infty, the gamma parameters are the prior parameters). In particular, as tt\rightarrow\infty, this entails the convergence of pm,n(t)p_{m,n}(t) in (36) to 1 if n=0n=0 and 0 elsewhere. Whence the strong ergodic convergence ψt(g(x,m,θ))π\psi_{t}(g(x,m,\theta))\rightarrow\pi as tt\rightarrow\infty, whereby the effect of the observed data become progressively negligible as tt increases. One could then argue that in the long run, the filtering strategy based on the pure-death dual process in [53] completely forgets the collected data. As a consequence, one could expect filtering with long-spaced observations (relative to the forward process autocorrelation) to be similar to using independent priors at each data collection point. On the other hand, the B&D dual can be thought as not forgetting but rather spreading around the probability mass in such a way as to preserve convergence of the empirical mean to the prior mean. It is not obvious a priori which of these two scenarios could be more beneficial in terms of filtering, hence in Section 6.1 we provide numerical experiments for comparing the performance of strategies based on these different duals.

In view of such experiments, note that the transition probabilities of the above B&D dual are in principle available in closed form (cf. [5, 17]), but their computation is prone to numerical instability. Alternatively, we can approximate the transition probabilities pm,n(t)p_{m,n}(t) in (36) by drawing NN sample paths of the dual started in mm and use the empirical distribution of the arrival points. This can in principle be done through the Gillespie algorithm [36], which alternates sampling waiting times and jumps of the embedded chain. A faster strategy can be achieved by writing the B&D rates in Proposition 4.1 as λm=λm+β\lambda_{m}=\lambda m+\beta and μm=μm\mu_{m}=\mu m with

λ=2σ2(θγ/σ2),β=σ2δ(θγ/σ2),μ=2σ2θ,\displaystyle\lambda=2\sigma^{2}(\theta-\gamma/\sigma^{2}),\qquad\beta=\sigma^{2}\delta(\theta-\gamma/\sigma^{2}),\qquad\mu=2\sigma^{2}\theta, (37)

where λ,μ\lambda,\mu represent the per capita birth and death rate and β\beta is the immigration rate. Then write Mt=At+BtM_{t}=A_{t}+B_{t} where AtA_{t} is the population size of the descendant of autochthonous individuals (already in the population at t=0t=0), and BtB_{t} the descendants of the immigrants. These rates define a linear B&D process, whereby [57] suggests simulating AtA_{t} by drawing, given A0=iA_{0}=i,

FBin(i,g(t)),AtNBin(F,h(t))+F,F\sim\text{Bin}(i,g(t)),\qquad A_{t}\sim\text{NBin}(F,h(t))+F, (38)

with h(t)=(λμ)/(λexp{(λμ)t}μ)h(t)=(\lambda-\mu)/(\lambda\exp\{(\lambda-\mu)t\}-\mu) and g(t)=h(t)exp{(λμ)t}g(t)=h(t)\exp\{(\lambda-\mu)t\}, with the convention NBin(0,p)=δ0(0,p)=\delta_{0}. Let now NsN_{s} be the number of immigrants up to time ss, which follows a simple Poisson process with rate β\beta, so given NtN_{t} the arrival times are uniformly distributed on [0,t][0,t]. Once in the population, the lineage of each immigrating individual follows again a B&D process and can be simulated using (38) starting at i=1i=1. Summing the numerosity of each immigrant family at time tt yields BtB_{t}.

5 Wright–Fisher hidden Markov models

The KK-dimensional WF diffusion is a widely studied classical model in population genetics (see [21, 19, 22, 41] and references therein), recently used also in a statistical framework [53, 50]. See also [12] for connections with statistical physics. It takes values in the simplex

ΔK={𝐱[0,1]K:1iKxi=1}\Delta_{K}=\bigg{\{}\mathbf{x}\in[0,1]^{K}:\sum_{1\leq i\leq K}x_{i}=1\bigg{\}}

and, in the population genetics interpretation, it models the temporal evolution of KK proportions of types in an underlying large population. Its infinitesimal generator on C2(ΔK)C^{2}(\Delta_{K}) is

𝒜=12i,j=1Kxi(δijxj)2xixj+12i=1K(αiθxi)xi\mathcal{A}=\frac{1}{2}\sum_{i,j=1}^{K}x_{i}(\delta_{ij}-x_{j})\frac{\partial^{2}}{\partial x_{i}\partial x_{j}}+\frac{1}{2}\sum_{i=1}^{K}(\alpha_{i}-\theta x_{i})\frac{\partial}{\partial x_{i}} (39)

for 𝜶=(α1,,αK)+K\bm{\alpha}=(\alpha_{1},\ldots,\alpha_{K})\in\mathbb{R}_{+}^{K}, θ=i=1Kαi\theta=\sum_{i=1}^{K}\alpha_{i}, and its reversible measure is the Dirichlet distribution whose density with respect to Lebesgue measure is

π𝜶(𝐱)=Γ(θ)i=1KΓ(αi)x1α11xKαK1,xK=1i=1K1xi.\pi_{\bm{\alpha}}(\mathbf{x})=\frac{\Gamma(\theta)}{\prod_{i=1}^{K}\Gamma(\alpha_{i})}x_{1}^{\alpha_{1}-1}\cdots x_{K}^{\alpha_{K}-1},\quad x_{K}=1-\sum_{i=1}^{K-1}x_{i}. (40)

See for example [21], Chapter 10. The transition density of this model is (cf., e.g., [19], eqn. (1.27))

pt(𝐱,𝐱)=m=0dm(t)𝐦+K:|𝐦|=mMN(𝐦;m,𝐱)π𝜶+𝐦(𝐱),p_{t}(\mathbf{x},\mathbf{x}^{\prime})=\sum_{m=0}^{\infty}d_{m}(t)\sum_{\mathbf{m}\in\mathbb{Z}_{+}^{K}:|\mathbf{m}|=m}\text{MN}(\mathbf{m};m,\mathbf{x})\pi_{\bm{\alpha}+\mathbf{m}}(\mathbf{x}^{\prime}), (41)

where MN(𝐦;m,𝐱)=(mm1,,mK)i=1Kximi\text{MN}(\mathbf{m};m,\mathbf{x})=\binom{m}{m_{1},\ldots,m_{K}}\prod_{i=1}^{K}x_{i}^{m_{i}}, and where dm(t)d_{m}(t) are the transition probabilities of the block counting process of Kingman’s coalescent on +\mathbb{Z}_{+}, which has an entrance boundary at \infty. Cf., e.g., [19], eqn. (1.12).

It is well known that a version of Kingman’s typed coalescent with mutation is dual to the WF diffusion. This can be seen as a death process on Z+KZ_{+}^{K} which jumps from 𝐦\mathbf{m} to 𝐦𝐞i\mathbf{m}-\mathbf{e}_{i} at rate

q𝐦,𝐦𝐞i=mi(θ+|𝐦|1)/2.q_{\mathbf{m},\mathbf{m}-\mathbf{e}_{i}}=m_{i}(\theta+|\mathbf{m}|-1)/2. (42)

Here 𝐞i\mathbf{e}_{i} is the canonical vector in the ii-th direction. See, for example, [23, 39, 24]; see also [53], Section 3.3. The above death process with transitions dm(t)d_{m}(t) is indeed the process that counts the surviving blocks of the typed version without keeping track of which types have been removed.

Recall now that a Moran model with NN individuals of KK types is a particle process with overlapping generations whereby at discrete times a uniformly chosen individual is removed and another, uniformly chosen from the remaining individuals, produces one offspring of its own type, leaving the total population size constant. See, e.g., [22]. In the presence of mutation, upon reproduction, the offspring can mutate to type jj at parent-independent rate αj\alpha_{j}. The generator of such process on the set B(+K)B(\mathbb{Z}_{+}^{K}) of bounded functions on +K\mathbb{Z}_{+}^{K} can be written in terms of the multiplicities of types 𝐧+K\mathbf{n}\in\mathbb{Z}_{+}^{K} as

f(𝐧)=121ijKni(αj+nj)f(𝐧𝐞i+𝐞j)121ijKni(αj+nj)f(𝐧),\mathcal{B}f(\mathbf{n})=\frac{1}{2}\sum_{1\leq i\neq j\leq K}n_{i}(\alpha_{j}+n_{j})f(\mathbf{n}-\mathbf{e}_{i}+\mathbf{e}_{j})-\frac{1}{2}\sum_{1\leq i\neq j\leq K}n_{i}(\alpha_{j}+n_{j})f(\mathbf{n}), (43)

where an individual of type ii is removed at rate nin_{i}, the number of individuals of type ii, and is replaced by an individual of type jj at rate αj+nj\alpha_{j}+n_{j}.

The following proposition extends a result in [12] (cf. Section 5) and shows that the above Moran model is dual to the WF diffusion with generator (39).

Proposition 5.1.

Let XtX_{t} have generator (39), let Nt+KN_{t}\in\mathbb{Z}_{+}^{K} be a Moran model which from 𝐧\mathbf{n} jumps to 𝐧𝐞i+𝐞j\mathbf{n}-\mathbf{e}_{i}+\mathbf{e}_{j} at rate ni(αj+nj)/2n_{i}(\alpha_{j}+n_{j})/2, and let

h(𝐱,𝐧)=Γ(θ+|𝐧|)Γ(θ)i=1KΓ(αi)Γ(αi+ni)xini,θ=i=1Kαi.h(\mathbf{x},\mathbf{n})=\frac{\Gamma(\theta+|\mathbf{n}|)}{\Gamma(\theta)}\prod_{i=1}^{K}\frac{\Gamma\left(\alpha_{i}\right)}{\Gamma\left(\alpha_{i}+n_{i}\right)}x_{i}^{n_{i}},\quad\quad\theta=\sum_{i=1}^{K}\alpha_{i}. (44)

Then (1) holds with Dt=NtD_{t}=N_{t} and hh as above.

Proof.

From (39), since θ=i=1Kαi\theta=\sum_{i=1}^{K}\alpha_{i}, we can write

2𝒜=\displaystyle 2\mathcal{A}= 1iKxi(1xi)2xi21ijKxixj2xixj+1iK(αi(1xi)xi1jK,jiαj)xi\displaystyle\,\sum_{1\leq i\leq K}x_{i}(1-x_{i})\frac{\partial^{2}}{\partial x_{i}^{2}}-\sum_{1\leq i\neq j\leq K}x_{i}x_{j}\frac{\partial^{2}}{\partial x_{i}\partial x_{j}}+\sum_{1\leq i\leq K}(\alpha_{i}(1-x_{i})-x_{i}\sum_{1\leq j\leq K,j\neq i}\alpha_{j})\frac{\partial}{\partial x_{i}} (45)
=\displaystyle= 1ijKxixj2xi21ijKxixj2xixj+1iKαi1jK,jixjxi1ijKαjxi.\displaystyle\,\sum_{1\leq i\neq j\leq K}x_{i}x_{j}\frac{\partial^{2}}{\partial x_{i}^{2}}-\sum_{1\leq i\neq j\leq K}x_{i}x_{j}\frac{\partial^{2}}{\partial x_{i}\partial x_{j}}+\sum_{1\leq i\leq K}\alpha_{i}\sum_{1\leq j\leq K,j\neq i}x_{j}\frac{\partial}{\partial x_{i}}-\sum_{1\leq i\neq j\leq K}\alpha_{j}\frac{\partial}{\partial x_{i}}. (46)

Then one can check that

2𝒜h(𝐱,𝐧)=\displaystyle 2\mathcal{A}h(\mathbf{x},\mathbf{n})= 1ijKni(αi+ni1)Γ(θ+|𝐧|)Γ(θ)𝐱𝐧𝐞i+𝐞jh=1KΓ(αh)Γ(αh+nh)\displaystyle\,\sum_{1\leq i\neq j\leq K}n_{i}(\alpha_{i}+n_{i}-1)\frac{\Gamma(\theta+|\mathbf{n}|)}{\Gamma(\theta)}\mathbf{x}^{\mathbf{n}-\mathbf{e}_{i}+\mathbf{e}_{j}}\prod_{h=1}^{K}\frac{\Gamma\left(\alpha_{h}\right)}{\Gamma\left(\alpha_{h}+n_{h}\right)} (47)
1ijKni(αj+nj)𝐱𝐧Γ(θ+|𝐧|)Γ(θ)h=1KΓ(αh)Γ(αh+nh)\displaystyle\,-\sum_{1\leq i\neq j\leq K}n_{i}(\alpha_{j}+n_{j})\mathbf{x}^{\mathbf{n}}\frac{\Gamma(\theta+|\mathbf{n}|)}{\Gamma(\theta)}\prod_{h=1}^{K}\frac{\Gamma\left(\alpha_{h}\right)}{\Gamma\left(\alpha_{h}+n_{h}\right)} (48)
=\displaystyle= 1ijKni(αj+nj)h(𝐱,𝐧𝐞i+𝐞j)1ijKni(αj+nj)h(𝐱,𝐧).\displaystyle\,\sum_{1\leq i\neq j\leq K}n_{i}(\alpha_{j}+n_{j})h(\mathbf{x},\mathbf{n}-\mathbf{e}_{i}+\mathbf{e}_{j})-\sum_{1\leq i\neq j\leq K}n_{i}(\alpha_{j}+n_{j})h(\mathbf{x},\mathbf{n}). (49)

Hence we have

(𝒜h(,𝐧))(𝐱)=(h(𝐱,))(𝐧)(\mathcal{A}h(\cdot,\mathbf{n}))(\mathbf{x})=(\mathcal{B}h(\mathbf{x},\cdot))(\mathbf{n})

where the right hand side is (43) applied to h(𝐱,𝐧)h(\mathbf{x},\mathbf{n}) as a function of 𝐧\mathbf{n}. The claim now follows from Proposition 1.2 in [44]. ∎

Assign now prior ν0=π𝜶\nu_{0}=\pi_{\bm{\alpha}} to X0X_{0}, and assume categorical observations so that (Y=j|Xt=x)=xj\mathbb{P}(Y=j|X_{t}=x)=x_{j}. By the well-known conjugacy to Dirichlet priors, we have Xt|Y=yπ𝜶+δyX_{t}|Y=y\sim\pi_{\bm{\alpha}+\delta_{y}}, where 𝜶+δy=(α1,,αj+1,,αK)\bm{\alpha}+\delta_{y}=(\alpha_{1},\ldots,\alpha_{j}+1,\ldots,\alpha_{K}) if y=jy=j. When multiple categorical observations with vector of multiplicities 𝐦+K\mathbf{m}\in\mathbb{Z}_{+}^{K} are collected, we write π𝜶+𝐦\pi_{\bm{\alpha}+\mathbf{m}}. The filtering algorithm then proceeds by first updating ν0\nu_{0} to ν0|0:=ϕY0(ν0)=π𝜶+𝐦\nu_{0|0}:=\phi_{Y_{0}}(\nu_{0})=\pi_{\bm{\alpha}+\mathbf{m}}, if Y0=(Y0,1,,Y0,k)Y_{0}=(Y_{0,1},\ldots,Y_{0,k}) yields multiplicities 𝐦\mathbf{m}, then propagating ν0|0\nu_{0|0} to ν1|0:=ψΔ(ν0|0)\nu_{1|0}:=\psi_{\Delta}(\nu_{0|0}). In light of the previous result, an application of (6) to ν0|0=π𝜶+𝐦\nu_{0|0}=\pi_{\bm{\alpha}+\mathbf{m}} yields the mixture

ψΔ(π𝜶+𝐦)=𝐧:|𝐧|=|𝐦|p𝐦,𝐧(Δ)π𝜶+𝐧,\psi_{\Delta}\left(\pi_{\bm{\alpha}+\mathbf{m}}\right)=\sum_{\mathbf{n}:|\mathbf{n}|=|\mathbf{m}|}p_{\mathbf{m},\mathbf{n}}(\Delta)\pi_{\bm{\alpha}+\mathbf{n}}, (50)

where p𝐦,𝐧(Δ)p_{\mathbf{m},\mathbf{n}}(\Delta) are the transition probabilities of NtN_{t} in Proposition 5.1 over the interval Δ\Delta. We then observe Y1|X1Y_{1}|X_{1}, which is in turn used to update ν1|0\nu_{1|0} to ν1|1\nu_{1|1}, as so forth. We refer again the reader to [50], Section 2.4.2, for details on qualitatively similar recursive formulae.

In (50), the overall multiplicity |𝐧||\mathbf{n}| equals the original |𝐦||\mathbf{m}|, as an effect of the population size preservation provided by the Moran model. The space {𝐧:|𝐧|=|𝐦|}\{\mathbf{n}:|\mathbf{n}|=|\mathbf{m}|\} is finite, which shows that Assumption 3 need not require the presence of a death-like process to have filtering distributions being finite mixtures. However, it is not obvious a priori how (50) compares in terms of practical implementation with the different representation obtained in [53], namely

ψΔ(π𝜶+𝐦)=𝐧:|𝐧||𝐦|p^𝐦,𝐧(Δ)π𝜶+𝐧\psi_{\Delta}\left(\pi_{\bm{\alpha}+\mathbf{m}}\right)=\sum_{\mathbf{n}:|\mathbf{n}|\leq|\mathbf{m}|}\hat{p}_{\mathbf{m},\mathbf{n}}(\Delta)\pi_{\bm{\alpha}+\mathbf{n}} (51)

where p^𝐦,𝐧(Δ)\hat{p}_{\mathbf{m},\mathbf{n}}(\Delta) are the transition probabilities of the death process on +K\mathbb{Z}_{+}^{K} with rates (42). Similarly to what has already been discussed for the CIR case, the death process dual has a single ergodic state given by the origin (0,,0)(0,\ldots,0), which entails the convergence of p^𝐦,𝐧(t)\hat{p}_{\mathbf{m},\mathbf{n}}(t) to 11 if 𝐧=(0,,0)\mathbf{n}=(0,\ldots,0) and 0 elsewhere, implying the strong convergence ψt(π𝜶+𝐦)π𝜶\psi_{t}\left(\pi_{\bm{\alpha}+\mathbf{m}}\right)\rightarrow\pi_{\bm{\alpha}} in (51). This is ultimately determined by the fact that Kingman’s coalescent removes lineages by coalescence and mutation until absorption to the empty set.

At first glance, a similar convergence is seemingly precluded to (50). However, we note in the first sum of (43) that the new particle’s type is either resampled from the survived particles or drawn from the baseline distribution, in which case the new particle is of type jj with (parent-independent) probability αj/i=1Kαi\alpha_{j}/\sum_{i=1}^{K}\alpha_{i}. Hence each particle will be resampled from the baseline distribution in finite time. Together with the fact that j=1Kπ𝜶+δjαj/i=1Kαi=π𝜶\sum_{j=1}^{K}\pi_{\bm{\alpha}+\delta_{j}}\alpha_{j}/\sum_{i=1}^{K}\alpha_{i}=\pi_{\bm{\alpha}}, which follows from Proposition G.9 in [32], and considering that the number of particles is finite, we can therefore expect that, as tt\rightarrow\infty, we have the convergence ψt(π𝜶+𝐦)π𝜶\psi_{t}\left(\pi_{\bm{\alpha}+\mathbf{m}}\right)\rightarrow\pi_{\bm{\alpha}} in (50) also for this case.

The transition probabilities p𝐦,𝐧(t)p_{\mathbf{m},\mathbf{n}}(t) in (50), induced by the Moran model, are not available in closed form. This poses a limit on the direct applicability of the presented algorithms for numerical experiments. The first alternative is then to approximate them by drawing NN points from the discrete distribution on the dual space before the propagation, making use of the Gillespie algorithm to draw as many paths, and evaluating the empirical distribution of the arrival points. Alternative approximations are suggested by the fact that an appropriately rescaled version of the Moran model converges in distribution to a WF diffusion (see, e.g., [22], Lemma 2.39). Indeed, a spatial rescaling of the Moran model in (43) to get proportions in place of multiplicities of types results in the generator

𝒞|𝐧|f(𝐱)=1ijKni|𝐧|αj+nj|𝐧|[f(𝐱𝐞i|𝐧|+𝐞j|𝐧|)f(𝐱)],\mathcal{C}_{|\mathbf{n}|}f(\mathbf{x})=\sum_{1\leq i\neq j\leq K}\frac{n_{i}}{|\mathbf{n}|}\frac{\alpha_{j}+n_{j}}{|\mathbf{n}|}\bigg{[}f\bigg{(}\mathbf{x}-\frac{\mathbf{e}_{i}}{|\mathbf{n}|}+\frac{\mathbf{e}_{j}}{|\mathbf{n}|}\bigg{)}-f(\mathbf{x})\bigg{]}, (52)

where xi:=ni/|𝐧|x_{i}:=n_{i}/|\mathbf{n}|. A classical argument based on a Taylor expansion for ff now leads to write |𝐧|2𝒞|𝐧|f=𝒜f+O(|𝐧|1)|\mathbf{n}|^{2}\mathcal{C}_{|\mathbf{n}|}f=\mathcal{A}f+O(|\mathbf{n}|^{-1}), with 𝒜\mathcal{A} as in (39) and where O(|𝐧|1)O(|\mathbf{n}|^{-1}) represent a remainder term which goes to zero with |𝐧|1|\mathbf{n}|^{-1}. The claim then be based on classical arguments following, e.g., [21], Theorem 4.8.7. We could therefore use a WF diffusion to approximate the Moran dual transitions in (50). Since the spatially rescaled Moran model takes values 𝐦/|𝐧|\mathbf{m}/|\mathbf{n}| with 𝐦\mathbf{m}\in\mathcal{M} such that 0|𝐦||𝐧|0\leq|\mathbf{m}|\leq|\mathbf{n}|, to the above end it suffices to discretize the states of the WF diffusion through binning, e.g., given a state 𝐱\mathbf{x} of the approximating WF diffusion, we take as state of the Moran model the point [|𝐧|𝐱]:=([|𝐧|x1],,[|𝐧|xK])[|\mathbf{n}|\mathbf{x}]:=([|\mathbf{n}|x_{1}],\ldots,[|\mathbf{n}|x_{K}]), where [|𝐧|xi][|\mathbf{n}|x_{i}] is the approximation of |𝐧|xi|\mathbf{n}|x_{i} to the closest integer in {0,,|𝐧|}\{0,\ldots,|\mathbf{n}|\}. The functionals of interest can thus be evaluated through the same procedure which uses the original Moran model, i.e., through (50), based on the WF diffusion transition probabilities. This strategy in principle has the drawback of having to deal with the intractable terms dm(t)d_{m}(t) in the transition function expansion (41) of the diffusion, hurdle overcome by adopting the solution proposed by [45].

It is also known that one could also construct a sequence of WF discrete Markov chains with non-overlapping generations indexed by the population size which, upon appropriate rescaling, converge weakly to the desired WF diffusion (see, e.g., [48], Sec. 15.2.F or [22], Sec 4.1). Since two sequences that converge to the same limit can to some extent be considered close to each other, one could then consider a WF discrete chain indexed by |𝐧||\mathbf{n}| with a parameterization that would make it converge to (39), and use it to approximate the Moran transition probabilities. This would permit a straightforward implementation, given WF discrete chains have multinomial transitions. In Section 6.2 we compare the performance of all the above mentioned strategies.

6 Numerical experiments

To illustrate how the above results can be used in practice and how they perform in comparison with other methods, we are going to consider particle approximations of the dual processes for evaluating their transition probabilities, which in turn are used in (7) in place of the true transition probabilities to evaluate the predictive distributions for the signal, denoted here p^(xk+1|y1:k)\hat{p}(x_{k+1}|y_{1:k}). We compare these distributions with the exact predictive distribution obtained through the results in [53] and those obtained through bootstrap particle filtering which make use of the signal transition function. Particle filtering can be considered the state of the art for this type of inferential problems, a general reference being [15]. The notable difference between these two approaches is that bootstrap particle filtering operates on the original state space of the signal, whereas filtering based on dual processes index the filtering mixtures using the dual state space, which in the present framework is discrete.

We first briefly describe the specific particle approximation on the dual space we are going to use. To approximate a predictive distribution νi|0:i1(xi)\nu_{i|0:i-1}(x_{i}), the classical particle approximation used in bootstrap particle filtering can be described as follows:

  • \bullet

    sample Xi1(m)iidνi1|0:i1X_{i-1}^{(m)}\overset{\text{iid}}{\sim}\nu_{i-1|0:i-1}, m=1,,Nm=1,\ldots,N;

  • \bullet

    propagate the particles by sampling Xi(m)pt(Xi1(m),)X_{i}^{(m)}\sim p_{t}(X_{i-1}^{(m)},\cdot), with ptp_{t} the signal transition density;

  • \bullet

    estimate νi|0:i1\nu_{i|0:i-1} with ν^i|0:i1:=N1m=1NδXi(m)\hat{\nu}_{i|0:i-1}:=N^{-1}\sum_{m=1}^{N}\delta_{X_{i}^{(m)}}.

For what concerns the use of dual processes, we are going to operate similarly to bootstrap particle filtering but on the dual space. The filtering distributions considered in this work are mixtures of the form νi1|0:i1(xi1)=𝐦w𝐦h(xi1,𝐦)π(xi1)\nu_{i-1|0:i-1}(x_{i-1})=\sum\nolimits_{\mathbf{m}}w_{\mathbf{m}}h(x_{i-1},\mathbf{m})\pi(x_{i-1}). An estimate of these can be obtained through a particle approximation of the discrete mixing measure, that is we draw 𝐦(n)iid𝐦w𝐦δ𝐦\mathbf{m}^{(n)}\overset{\text{iid}}{\sim}\sum_{\mathbf{m}}w_{\mathbf{m}}\delta_{\mathbf{m}}, n=1,,Nn=1,\ldots,N, to obtain ν^i1|0:i1(xi1):=N1n=1Nh(xi1,𝐦(n))π(xi1)\hat{\nu}_{i-1|0:i-1}(x_{i-1}):=N^{-1}\sum\nolimits_{n=1}^{N}h(x_{i-1},\mathbf{m}^{(n)})\pi(x_{i-1}). The natural approximation of νi|0:i1(xi)\nu_{i|0:i-1}(x_{i}) is therefore as follows:

  • \bullet

    sample 𝐦(n)iid𝐦w𝐦δ𝐦\mathbf{m}^{(n)}\overset{\text{iid}}{\sim}\sum_{\mathbf{m}}w_{\mathbf{m}}\delta_{\mathbf{m}};

  • \bullet

    propagate the particles by sampling 𝐧(n)p𝐦(n),(t)\mathbf{n}^{(n)}\sim p_{\mathbf{m}^{(n)},\cdot}(t), with p𝐦(n),(t)p_{\mathbf{m}^{(n)},\cdot}(t) the transition probabilities of the dual process;

  • \bullet

    estimate νi|0:i1(xi)\nu_{i|0:i-1}(x_{i}) with ν^i|0:i1(xi):=N1n=1Nh(xi,𝐧(n))π(xi)\hat{\nu}_{i|0:i-1}(x_{i}):=N^{-1}\sum\nolimits_{n=1}^{N}h(x_{i},\mathbf{n}^{(n)})\pi(x_{i}).

Here some important remarks are in order. The above dual particle approximation is a finite mixture approximation of a mixture which can be either finite or infinite. Hence the above strategy can be applied both to filtering given death-like duals but also given general duals on discrete state spaces. The quality of the dual particle approximation, in general, may differ from that obtained through the particle filtering approximation since the particles live on a discrete space in the first case and on a continuous space in the second. This is the object of the following sections, at least for two specific examples. Finally, the ease of implementation of the two approximations may be very different because simulating from the original Markov process may be much harder than simulating from the dual process. An example is the simulation of Kingman’s typed coalescent, immediate as compared to the simulation from (41), which would be unfeasible without [45].

6.1 Cox–Ingersoll–Ross numerical experiments

The CIR diffusion admits two different duals:

  • \bullet

    the death-like dual given by Dt=(Mt,Θt)D_{t}=(M_{t},\Theta_{t}), with MtM_{t} a pure-death process on +\mathbb{Z}_{+} with rates 2σ2θm2\sigma^{2}\theta m from mm to m1m-1 and Θt\Theta_{t} a deterministic process that solves the ODE dΘt/dt=2σ2Θt(Θtγ/σ2)\mathrm{d}\Theta_{t}/\mathrm{d}t=-2\sigma^{2}\Theta_{t}(\Theta_{t}-\gamma/\sigma^{2}), Θ0=θ\Theta_{0}=\theta. Cf. [53], Section 3.1.

  • \bullet

    the B&D dual MtM_{t} on +\mathbb{Z}_{+} with birth rates from mm to m+1m+1 given by λm=2σ2(δ/2+m)(θγ/σ2)\lambda_{m}=2\sigma^{2}(\delta/2+m)(\theta-\gamma/\sigma^{2}) and death rates from mm to m1m-1 given by μm=2σ2θm\mu_{m}=2\sigma^{2}\theta m respectively. Cf. Proposition 4.1.

Note that the latter is time-homogeneous, the former is not. In general, temporal homogeneity is to be preferred since a direct simulation with a Gillespie algorithm in the inhomogeneous case would require a time-rescaling. However, for this specific case, there is a convenient closed-form expression for the transition density of the first dual, which can be used to simulate for arbitrary time transitions (see the third displayed equation at page 2011 in [53]). The second dual, by virtue of the temporal homogeneity, can be simulated directly using a Gillespie algorithm. This may be slow if the event rate becomes large, but as suggested in Section 4 we can see it as a linear B&D process, and a convenient closed-form expression can be used to simulate arbitrary time transitions.

We compare these two particle approximations with an exact computation of the predictive distribution following [53] and to a bootstrap particle filtering approach on the original state space of the signal, which is easy to implement for arbitrary time transitions thanks to the Gamma-Poisson expansion of the CIR transition density (see details in [50], Section 5).

Refer to caption
Figure 1: Comparison of the signal predictive distribution p^(xk+1|y1:k)\hat{p}(x_{k+1}|y_{1:k}) obtained through the approximation approach to the death-process dual and the B&D dual, and through the bootstrap particle filter, with the exact predictive. The number of particles used for the approximations are 50, 100, 500, 1000, 1500 and are indicated in the panel labels. The acronyms are BD: Birth-and-Death, PD: Pure-Death, BPF: Bootstrap Particle Filter.

Figure 1 shows the comparison of the above-illustrated strategies, with prediction performed for a forecast time horizon of 0.05. The CIR parameters were specified to δ=11,σ=1,γ=1.1\delta=11,\sigma=1,\gamma=1.1. The starting distribution for the prediction is a filtering distribution for a dataset whose last Poisson observation equals 4, so the starting distribution is a mixture of Gamma densities roughly centred around this point. The density estimates for the bootstrap particle filter were obtained from a Gamma kernel density estimator with bandwidth estimated by cross-validation. This is expected to induce a negligible error because the target distribution is a finite mixture of Gamma distributions.

The figure suggests that the bootstrap particle filter is slower to converge to the exact predictive distribution. Instead, with only 50 particles, both dual approximations that use a pure-and a B&D dual (respectively PD and BD in the Figure legend) are already almost indistinguishable from the exact predictive distribution. This shows that accurately approximating the mixing measure on the discrete dual space seems to require fewer particles than approximating the continuous distribution on the original continuous state space. Shorter and longer time horizons than that used in Figure 1 were also tested and provided qualitatively similar results.

Next, we turn to investigating the error on the filtering distributions, which combines successive particle approximations. Since the update operation can be performed exactly through (8), particle filtering using the dual process is conveniently implemented like a bootstrap particle approximation to a Baum-Welch filter with systematic resampling. We quantify the error on the filtering distributions by measuring the absolute error on the first moment and the standard deviation of the filtering distributions (with respect to the exact computation). We also include the error on the signal retrieval, measured as the absolute difference between the first moment of the filtering distributions and the value of the hidden signal to be retrieved. The mean filtering error is averaged over the second half of the sequence of observations to avoid possible transient effects at the beginning of the observation sequence and estimated over 50 different simulated datasets. The parameter specification is again δ=11,σ=1,γ=1.1\delta=11,\sigma=1,\gamma=1.1, with a single Poisson observation at each of 200 observation times, and intervals between consecutive observations equal to 0.1. Figure 2 shows that the pure-death particle approximation performs better than the B&D particle approximation, but the latter performs comparably to the bootstrap particle filter, possibly with a modest advantage.

[Uncaptioned image]
Figure 2: Mean filtering error as a function of the number of particles for the various particle approximation methods. The error bars represent the confidence interval on the error estimate from the 50 repetitions. The acronyms are BD: Birth-and-Death, PD: Pure-Death, BPF: Bootstrap Particle Filter.

6.2 Wright–Fisher numerical experiments

The WF diffusion admits two different duals:

  • \bullet

    Kingman’s typed coalescent with mutation dual, given by a pure-death process on +K\mathbb{Z}_{+}^{K} with rates λ𝐦,𝐦𝐞i=mi(|𝜶|+|𝒎|1)/2\lambda_{\mathbf{m},\mathbf{m}-\mathbf{e}_{i}}=m_{i}(|\bm{\alpha}|+|\bm{m}|-1)/2 from 𝐦\mathbf{m} to 𝐦𝐞i\mathbf{m}-\mathbf{e}_{i}. Cf. [53], Section 3.3.

  • \bullet

    a Moran dual process, given a homogeneous B&D process on +K\mathbb{Z}_{+}^{K} with rates λ𝐦,𝐦𝐞i+𝐞j=mi(αj+mj)/2\lambda_{\mathbf{m},\mathbf{m}-\mathbf{e}_{i}+\mathbf{e}_{j}}=m_{i}(\alpha_{j}+m_{j})/2 from 𝐦\mathbf{m} to 𝐦𝐞i+𝐞j\mathbf{m}-\mathbf{e}_{i}+\mathbf{e}_{j}. Cf. Proposition 5.1.

Here both processes are temporally homogeneous and can thus be easily simulated using a Gillespie algorithm, with the only caveat that the simulation can be inefficient when the infinitesimal rates are large. Similar to the CIR case, there is a closed-form expression for the transition probabilities in the first case, which can be used for simulation purposes for arbitrary time transitions (see Theorem 3.1 in [54]). Unlike the one-dimensional CIR case, handling this expression is challenging in the multi-dimensional WF case, with significant numerical stability issues raised by the need to compute the sum of alternated series with terms that can both overflow and underflow. In [50], these hurdles were addressed using arbitrary precision computation libraries and careful re-use of previous computations applicable when data is equally spaced. The Gillespie simulation strategy presents no such restriction and may be significantly faster when the event rates remain low.

As mentioned in Section 5, no closed-form expression is available for the Moran dual and the Gillespie algorithm approach is the main option, likely resulting in a slow algorithm. Alternatively, as argued in Section 5, we can approximate the Moran dual process by a finite population Wright–Fisher chain, with the quality of approximation increasing with the population size. The interest in this approximation is that the event rate for the latter is lower than for the Moran process. This is related to the fact that weak convergence of a sequence of WF chains to a WF diffusion occurs when time is rescaled by a factor of NN (cf. [48], Sec. 15.2.F), whereas a Moran model whose individual updates occur at the times of a Poisson process with rate 1, needs a rescaling by a factor N2N^{2} to obtain a similar convergence. In other words, in order to establish weak convergence to the diffusion, time must be measured in units of NN generations in the WF chain and in units of N2N^{2} generations in the Moran model. See discussion in Section 5. For this reason, the resulting Gillespie simulation is expected to be faster using a WF chain approximation to the Moran model.

The above considerations also suggest another possibility. Since the Moran process converges weakly to a Wright–Fisher diffusion, the latter could also be used as a possible approximation instead of a WF chain. In this case, it is possible to sample directly from (41) for arbitrary time transitions using the algorithm in [45]. Hence we would be using a WF diffusion to approximate the dual Moran transitions in (50).

A standard bootstrap particle filter performed directly on the Wright–Fisher diffusion state space also crucially relies on the algorithm of [45] for the prediction step, without which approximate sampling recipes from the transition density would be needed.

In Figure 3, we compare prediction strategies for a WF diffusion with K=4K=4 types using:

  • \bullet

    the closed-form transition of the pure death dual (“Exact” in Fig. 3 legend);

  • \bullet

    an approximation of the pure death dual using a Gillespie algorithm (“PD”);

  • \bullet

    an approximation of the Moran dual using a Gillespie algorithm (“BD Gillespie Moran”);

  • \bullet

    a WF chain approximation of the Moran dual using a Gillespie algorithm (“BD Gillespie WF”);

  • \bullet

    a WF diffusion approximation of the Moran dual using [45] (“BD diffusion WF”);

  • \bullet

    a bootstrap particle filtering approximation using [45] (“Bootstrap PF”).

In Figure 3, prediction was performed for a forecast time horizon equal to 0.1, with WF parameters 𝜶=(3,3,3,3)\bm{\alpha}=(3,3,3,3). The starting distribution for the prediction is a filtering distribution for a dataset whose last multinomial observation is equal to (4,0,9,2)(4,0,9,2) (so the starting distribution is a mixture of Dirichlet distributions roughly centred around (4/15,0,9/15,2/15)(4/15,0,9/15,2/15)). Various values for parameter 𝜶\bm{\alpha} were also tested and provided results qualitatively similar to Figure 1. The density estimates for the bootstrap particle filter are obtained from a Dirichlet kernel density estimator with bandwidth estimated by cross-validation (using Julia package KernelEstimators https://github.com/panlanfeng/KernelEstimator.jl). This is expected to induce a negligible error because the target distribution is a finite mixture of Dirichlet distributions. Figure 3 shows that among these particle approximations of p(xk+1|y1:k)p(x_{k+1}|y_{1:k}), the Wright–Fisher diffusion approximation of the Moran dual seems to converge slowest, followed by the bootstrap particle filter, whereas the other strategies based on the dual process converge quickly to the exact distribution.

[Uncaptioned image]
Figure 3: Convergence of the WF predictive distribution (only the first dimension) with the number of particles for the various particle approximations. The acronyms are PD: Pure-Death, BD: Birth-and-Death, WF: Wright-Fisher, PF: Particle Filter.

Figure 4 evaluates the filtering error for a WF process with K=3K=3 and parameters 𝜶=(1.1,1.1,1.1)\bm{\alpha}=(1.1,1.1,1.1), given 20 categorical observations collected at each time, over 10 collection times spaced by intervals equal to 1. We consider increasing numbers of particles and use 100 replications to estimate the error. The figure shows that the particle approximation of the pure death dual process using the closed-form transition exhibits better performance. The bootstrap particle approximation has the fastest improvement relative to increasing the number of particles. Overall, the Moran dual performs better or comparably to bootstrap particle filtering.

[Uncaptioned image]
Figure 4: Mean filtering error as a function of the number of particles for the various particle approximation methods. The error bars represent the confidence interval on the error estimate from the 100 repetitions. The acronyms are PD: Pure-Death, BD: Birth-and-Death, WF: Wright-Fisher, PF: Particle Filter.

7 Concluding remarks

We have provided conditions for filtering diffusion processes on multidimensional continuous spaces which avoid computations on the state space of the forward process when a dual process given by a discrete Markov chain is available. Motivated by certain diffusion models for which only duals with a countable state space are known (e.g., B&D-like duals for WF diffusions with selection), we have investigated the performance of filtering based on a B&D dual for the CIR diffusion and based on a Moran process dual for the WF diffusion. All approximation methods proposed appear to be valuable strategies, despite resting on different simulation schemes. The optimal strategy is bound to depend on the application at hand, together with several other details like the interval lengths between data collection times, and possibly be constrained by which of these tools are available. For example, the transition function of coupled WF diffusions [4] is not available, whereas a discrete dual was found in [25]. Overall, approximate filtering using B&D-like duals may perform better or comparably to bootstrap particle filtering, with the advantage of operating on a discrete state space. The computational effort for each of these strategies is also bound to depend on a series of factors the identification of which is beyond the scope of this contribution.

The code to reproduce the analyses illustrated above will be made available in the Supporting Material and is based on the package freely available at https://github.com/konkam/DualOptimalFiltering.jl.

8 Acknowledgements

The authors are grateful to two anonymous referees for carefully reading the manuscript and for providing constructive suggestions that led to improving the paper. They also gratefully acknowledge insightful conversations with Omiros Papaspiliopoulos on performing particle filtering on the dual space.
The last author is partially supported by MUR, PRIN project 2022CLTYP4.

References

  • [1] Arthreya, S. and Swart, J. Branching-coalescing particle systems. Probab. Theory Relat. Fields 131, 376–414.
  • [2] Ascolani, F., Lijoi, A. and Ruggiero, M. (2021). Predictive inference with Fleming–Viot-driven dependent Dirichlet processes. Bayesian Anal. 16, 371–395.
  • [3] Ascolani, F., Lijoi, A. and Ruggiero, M. (2023). Smoothing distributions for conditional Fleming–Viot and Dawson–Watanabe diffusions. Bernoulli 29, 1410-1434.
  • [4] Aurell, E., Ekeberg, M. and Koski, T. (2019). On a multilocus Wright–Fisher model with mutation and a Svirezhev-Shahshahani gradient-like selection dynamics. arXiv:1906.00716.
  • [5] Bailey, N.T.J. (1964). The elements of stochastic processes with applications to the natural sciences. Wiley, New York
  • [6] Bain, A. and Crisan, D. (2009). Fundamentals of stochastic filtering. Springer.
  • [7] Barbour, A.D., Ethier, S.N. and Griffiths, R.C. (2000). A transition function expansion for a diffusion model with selection. Ann. Appl. Probab. 10, 123–162.
  • [8] Birkner, M.C., Blath, J., Moehle, M., Steinruecken, M. and Tams, J. (2008). A modified lookdown construction for the Xi-Fleming–Viot process with mutation and populations with recurrent bottlenecks. Alea 6, 25–61.
  • [9] Chen, R. and Scott, L. (1992). Pricing interest rate options in a two-factor Cox–Ingersoll–Ross model of the term structure. Rev. Financial Stud. 5, 613–636.
  • [10] Cox, J.C., Ingersoll, J.E. and Ross, S.A. (1985). A theory of the term structure of interest rates. Econometrica 53, 385–407.
  • [11] Cappé, O., Moulines, E. and Rydén, T. (2005). Inference in hidden Markov models. Springer.
  • [12] Carinci, C., Giardinà, C., Giberti, C. and Redig, F. (2015). Dualities in population genetics: A fresh look with new dualities. Stoch. Proc. Appl. 125, 941–969.
  • [13] Chaleyat-Maurel, M. and Genon-Catalot, V. (2006). Computable infinite-dimensional filters with applications to discretized diffusion processes. Stoch. Proc. Appl. 116, 1447–1467.
  • [14] Chaleyat-Maurel, M. and Genon-Catalot, V. (2009). Filtering the Wright–Fisher diffusion. ESAIM Probab. Stat. 13, 197–217.
  • [15] Chopin, N. and Papaspiliopoulos, O. (2020). An introduction to sequential Monte Carlo. Springer.
  • [16] Comte, F., Genon-Catalot, V. and Kessler, M. (2011). Multiplicative Kalman filtering. Test 20, 389–411.
  • [17] Crawford, F.W. and Suchard, M.A. (2012). Transition probabilities for general birth-death processes with applications in ecology, genetics, and evolution. J. Math. Biol. 65, 553–580.
  • [18] Depperschmidt, A., Greven, A. and Pfaffelhuber, P. (2019). Duality and the well-posedness of a martingale problem. arXiv:1904.01564.
  • Ethier and Griffiths [1993a] Ethier, S.N. and Griffiths, R.C. (1993a). The transition function of a Fleming–Viot process. Ann. Probab. 21, 1571–1590.
  • [20] Ethier, S.N. and Griffiths, R.C. (1993b). The transition function of a measure-valued branching diffusion with immigration. In Stochastic Processes. A Festschrift in Honour of Gopinath Kallianpur (S. Cambanis, J. Ghosh, R. L. Karandikar and P. K. Sen, eds.) 71–79. Springer, New York.
  • [21] Ethier, S.N. and Kurtz, T.G. (1986). Markov processes: characterization and convergence. Wiley, New York.
  • Etheridge [2009] Etheridge, A.M. (2009). Some mathematical models from population genetics. École d’été de Probabilités de Saint-Flour XXXIX. Lecture Notes in Math. 2012. Springer.
  • [23] Etheridge, A.M. and Griffiths, R.C. (2009). A coalescent dual process in a Moran model with genic selection. Theor. Pop. Biol. 75, 320–330.
  • [24] Etheridge, A.M., Griffiths, R.C. and Taylor, J.E. (2010). A coalescent dual process in a Moran model with genic selection and the lambda coalescent limit. Theor. Popn. Biol. 78, 77–92.
  • [25] Favero, M., Hult, H. and Koski, T. (2021). A dual process for the coupled Wright–Fisher diffusion. J. Math. Biol. 82:6.
  • [26] Ferrante, M. and Runggaldier, W.J. (1990). On necessary conditions for the existence of finite-dimensional filters in discrete time. Systems Control Lett. 14, 63–69.
  • [27] Ferrante, M. and Vidoni, P. (1998). Finite dimensional filters for nonlinear stochastic difference equations with multiplicative noises. Stoch. Proc. Appl. 77, 69–81.
  • [28] Franceschini, C., Giardinà, C. and Groenevelt, W. (2018). Self-duality of Markov processes and intertwining functions. Math. Phys. Anal. Geom. 21, 29.
  • [29] Frydman, H. (1994). Asymptotic inference for the parameters of a discrete-time square-root process. Math. Finance 4, 169–181.
  • [30] Genon-Catalot, V. (2003). A non-linear explicit filter. Statist. Probab. Lett. 61, 145–154.
  • [31] Genon-Catalot, V. and Kessler, M. (2004). Random scale perturbation of an AR(1) process and its properties as a nonlinear explicit filter. Bernoulli, 10, 701–720.
  • [32] Ghosal, S. and van der Vaart, A. (2017). Fundamentals of nonparametric Bayesian inference. Cambridge University Press.
  • [33] Giardinà, C., Kurchan, J. Redig, F. (2007). Duality and exact correlations for a model of heat conduction. J. Math. Phys. 48, 033301.
  • [34] Giardinà, C., Kurchan, J., Redig, F. and Vafayi, K. (2009a). Duality and hidden symmetries in interacting particle systems. J. Stat. Phys. 135, 25–55.
  • [35] Giardinà, C., Redig, F. and Vafayi, K. (2009b). Correlation inequalities for interacting particle systems with duality, J. Stat. Phys. 141, 242–263.
  • [36] Gillespie, D.T. (2007). Stochastic simulation of chemical kinetics. Annu. Rev. Phys. Chem. 58, 35–55.
  • [37] Göing-Jaeschke, A. and Yor, M. (2003). A survey and some generalizations of Bessel processes. Bernoulli, 9, 313–349.
  • [38] Griffiths, R.C., Ruggiero, M, Spanó, D. and Zhou, Y. (2022). Dual process in the two-parameter Poisson-Dirichlet Petrov diffusion. arXiv:2102.08520.
  • Griffiths and Spanó [2009] Griffiths, R.C. and Spanó, D. (2009). Diffusion processes and coalescent trees. In Probability and Mathematical Genetics: Papers in Honour of Sir John Kingman, ed. N. H. Bingham and C. M. Goldie. London Mathematical Society Lecture Notes Series, Cambridge University Press 2010.
  • [40] Heston, S.L. (1993). A closed-form solution for options with stochastic volatility, with applications to bond and currency options. Rev. Financial Stud. 6, 327–343.
  • [41] Huillet, T. (2007). On Wright?Fisher diffusion and its relatives. J. Stat. Mech., P11006.
  • [42] Huillet, T. and Martinez, S. (2011). Duality and intertwining for discrete Markov kernels: relations and examples. Adv. Appl. Probab. 43, 437–460.
  • [43] Hutzenthaler, M. and Wakolbinger, A. (2007). Ergodic behavior of locally regulated branching populations. Ann. Appl. Probab. 17, 474–501.
  • [44] Jansen, S. and Kurt, N. (2014). On the notion(s) of duality for Markov processes. Probab. Surv. 11, 59–120.
  • [45] Jenkins, P. A. and Spanò, D. (2017). Exact simulation of the Wright–Fisher diffusion. Ann. Appl. Probab. 27(3), 1478–1509.
  • [46] Kalman, R.E. (1960). A new approach to linear filtering and prediction problems. J. Basic Eng. 82, 35–45.
  • [47] Kalman, R.E. and Bucy, R.S. (1961). New results in linear filtering and prediction theory. J. Basic Eng. 83 95–108.
  • Karlin and Taylor [1981] Karlin, S. and Taylor, H.M. (1981). A second course in stochastic processes. Academic Press, New York.
  • [49] Kawazu, K. and Watanabe, S. (1971). Branching processes with immigration and related limit theorems. Theory Prob. Appl. 16, 36–54.
  • [50] Kon Kam King, G., Papaspiliopoulos, O. and Ruggiero, M. (2021). Exact inference for a class of hidden Markov models on general state spaces. Electron. J. Stat. 15, 2832–2875.
  • [51] Möhle, M. (1999). The concept of duality and applications to Markov processes arising in neutral population genetics models. Bernoulli 5, 761–777.
  • [52] Ohkubo, J. (2010). Duality in interacting particle systems and boson representation. J. Stat. Phys. 139, 454–465.
  • [53] Papaspiliopoulos, O. and Ruggiero, M. (2014). Optimal filtering and the dual process. Bernoulli 20, 1999–2019.
  • [54] Papaspiliopoulos, O., Ruggiero, M. and Spanò, D. (2016). Conjugacy properties of time-evolving Dirichlet and gamma random measures. Electron. J. Stat. 10, 3452–3489.
  • [55] Runggaldier, W.J. and Spizzichino, F. (2001). Sufficient conditions for finite dimensionality of filters in discrete time: a Laplace transform-based approach. Bernoulli 7, 211-221.
  • [56] Sawitzki, G. (1981). Finite-dimensional filter systems in discrete time. Stochastics 5, 107–114.
  • [57] Tavaré, S. (2018). The linear birth-and-death process: an inferential retrospective. Adv. Appl. Probab. 50, 253–269.