Online Smoothing for Diffusion Processes Observed with Noise

Shouto Yonekura
Department of Statistical Science, University College London,
Alan Turing Institute,
and
Alexandros Beskos
Department of Statistical Science, University College London,
Alan Turing Institute SY acknowledges funding from the Alan Turing Institute under grant number TU/C/000013. AB acknowledges funding via a Leverhulme Trust Prize.

Abstract

We introduce a methodology for online estimation of smoothing expectations for a class of additive functionals, in the context of a rich family of diffusion processes (that may include jumps) – observed at discrete-time instances. We overcome the unavailability of the transition density of the underlying SDE by working on the augmented pathspace. The new method can be applied, for instance, to carry out online parameter inference for the designated class of models. Algorithms defined on the infinite-dimensional pathspace have been developed the last years mainly in the context of MCMC techniques. There, the main benefit is the achievement of mesh-free mixing times for the practical time-discretised algorithm used on a PC. Our own methodology sets up the framework for infinite-dimensional online filtering – an important positive practical consequence is the construct of estimates with variance that does not increase with decreasing mesh-size. Besides regularity conditions, our method is, in principle, applicable under the weak assumption – relatively to restrictive conditions often required in the MCMC or filtering literature of methods defined on pathspace – that the SDE covariance matrix is invertible.

Keywords: Jump Diffusion; Data Augmentation; Forward-Only Smoothing; Sequential Monte Carlo; Online Parameter Estimation.

1 Introduction

Research in Hidden Markov Models (HMMs) has – thus far – provided effective online algorithms for the estimation of expectations of the smoothing distribution for the case of a class of additive functionals of the underlying signal. Such methods necessitate knowledge of the transition density of the Markovian part of the model between observation times. We carry out a related exploration for the (common in applications) case when the signal corresponds to a diffusion process, thus we are faced with the challenge that such transition densities are typically unavailable. Standard data augmentation schemes that work with the multivariate density of a large enough number of imputed points of the continuous-time signal will lead to ineffective algorithms. The latter will have the abnormal characteristic that – for given Monte-Carlo iterates – the variability of the produced estimates will increase rapidly as the resolution of the imputation becomes finer. One of the ideas underpinning the work in this paper is that development of effective algorithms instead requires respecting the structural properties of the diffusion process, thus we build up imputation schemes on the infinite-dimensional diffusion pathspace itself. As a consequence, the time-discretised algorithm used in practice on a PC will be stable under mesh-refinement.

We consider continuous-time jump-diffusion models observed at discrete-time instances. The $d_{x}$ -dimensional process, $X=\{X_{t};t\geq 0\}$ , $d_{x}\geq 1$ , is defined via the following time-homogeneous stochastic differential equation (SDE), with $X_{t-}:=\lim_{s\uparrow t}X_{t}$ ,

\displaystyle dX_{t}=b(X_{t-})dt+\sigma(X_{t-})dW_{t}+dJ_{t},\quad X_{0}=x_{0}\in\mathbb{R}^{d_{x}},\,\,\,t\geq 0.

(1.1)

Solution $X$ is driven by the $d_{w}$ -dimensional Brownian motion, $W=\{W_{t};t\geq 0\}$ , $d_{w}\geq 1$ , and the compound Poisson process, $J=\{J_{t};t\geq 0\}$ . The SDE involves a drift function $b=b_{\theta}:\mathbb{R}^{d_{x}}\mapsto\mathbb{R}^{d_{x}}$ and coefficient matrix $\sigma=\sigma_{\theta}:\mathbb{R}^{d_{x}}\mapsto\mathbb{R}^{d_{x}\times d_{w}}$ , for parameter $\theta\in\mathbb{R}^{p}$ , $p\geq 1$ . Let $\{N_{t};t\geq 0\}$ be a Poisson process with intensity function $\lambda_{\theta}(\cdot)$ , and $\{\xi_{k}\}_{k\geq 1}$ i.i.d. sequence of random variables with Lebesgue density $h_{\theta}(\cdot)$ ; the càdlàg process $J$ is determined as $J_{t}=J_{t,\theta}=\sum_{i=1}^{N_{t}}\xi_{i}$ . We work under standard assumptions (e.g. linear growth, Lipschitz continuity for $b$ , $\sigma$ ) that guarantee a unique global solution of (1.1), in a weak or strong sense, see e.g. Øksendal and Sulem, (2007).

SDE (1.1) is observed with noise at discrete-time instances $0=t_{0}<t_{1}<t_{2}<\cdots<t_{n}$ , $n\geq 1$ . Without loss of generality, we assume equidistant observation times, with $\Delta:=t_{1}-t_{0}$ . We consider data $Y_{t_{0}},\ldots,Y_{t_{n}}$ , and for simplicity we set,

\displaystyle x_{i}:=X_{t_{i}},\quad y_{i}=Y_{t_{i}},\quad 0\leq i\leq n.

Let $\mathcal{F}_{i}$ be a filtration generated by $X_{s}$ for $s\in[t_{i-1},t_{i}],0<i\leq n$ . We assume,

\displaystyle\big{[}\,Y_{t_{i}}\,\big{|}\,\{Y_{t_{j}};j<i\},\,\{X_{s};s\in[0,t_{i}]\}\,\big{]}\sim g_{\theta}\big{(}dY_{t_{i}}\,\big{|}\,Y_{t_{i-1}},\mathcal{F}_{i}\big{)},\quad 0\leq i\leq n,

(1.2)

for conditional distribution $g_{\theta}(\cdot|Y_{t_{i-1}},\mathcal{F}_{i})$ on $\mathbb{R}^{d_{y}}$ , $d_{y}\geq 1$ , under convention $Y_{t_{-1}}=y_{-1}=\emptyset$ . We write,

\displaystyle[\,x_{i}\,|\,x_{i-1}\,]\sim f_{\theta}(dx_{i}|x_{i-1}),

(1.3)

where $f_{\theta}(dx_{i}|x_{i-1})$ is the transition distribution of the driving SDE process (1.1). We consider the density functions of $g_{\theta}(dy_{i}|y_{i-1},\mathcal{F}_{i})$ and $f_{\theta}(dx_{i}|x_{i-1})$ , and – with some abuse of notation – we write $g_{\theta}(dy_{i}|y_{i-1},\mathcal{F}_{i})=g_{\theta}(y_{i}|y_{i-1},\mathcal{F}_{i})dy_{i}$ , $f_{\theta}(dx_{i}|x_{i-1})=f_{\theta}(x_{i}|x_{i-1})dx_{i}$ , where $dy_{i}$ , $dx_{i}$ denote Lebesgue measures. Our work develops under the following regime.

Assumption 1.

The transition density $f_{\theta}(x^{\prime}|x)$ is intractable; the density $g_{\theta}(y^{\prime}|y,\mathcal{F})$ is analytically available – for appropriate $x^{\prime}$ , $x$ , $y^{\prime}$ , $y$ , $\mathcal{F}$ consistent with preceding definitions.

The intractability of the transition density $f_{\theta}(\cdot|\cdot)$ will pose challenges for the main problems this paper aims to address.

Models defined via (1.1)-(1.3) are extensively used, e.g., in finance and econometrics, for instance for capturing the market microstructure noise, see Aït-Sahalia et al., (2005); Aït-Sahalia and Yu, (2008); Hansen and Lunde, (2006). The above setting belongs to the general class of HMMs, with a signal defined in continuous-time. See Cappé et al., (2005); Douc et al., (2014) for a general treatment of HMMs fully specified in discrete-time. A number of methods have been suggested in the literature for approximating the unavailable transition density – mainly in the case of processes without jumps – including: asymptotic expansion techniques (Aït-Sahalia et al.,, 2005; Aït-Sahalia,, 2002, 2008; Kessler,, 1997); martingale estimating functions (Kessler and Sørensen,, 1999); generalized method of moments (Hansen and Scheinkman,, 1993); Monte-Carlo approaches (Wagner,, 1989; Durham and Gallant,, 2002; Beskos et al.,, 2006). See, e.g., Kessler et al., (2012) for a detailed review.

For a given sequence $\{a_{m}\}_{m}$ , we use the notation $a_{i:j}:=(a_{i},\ldots,a_{j})$ , for integers $i\leq j$ . Let $p_{\theta}(y_{0:n})$ denote the joint density of $y_{0:n}$ . Throughout the paper, $p_{\theta}(\cdot)$ is used generically to represent probability distributions or densities of random variables appearing as its arguments.

Consider the maximum likelihood estimator (MLE),

\displaystyle\hat{\theta}_{n}

\displaystyle:=\arg\max_{\theta\in\Theta}\log p_{\theta}(y_{0:n}).

Except for limited cases, one cannot obtain the MLE analytically for HMMs (even for discrete-time signal) due to the intractability of $p_{\theta}(y_{0:n})$ .

We have set up the modelling context for this work. The main contributions of the paper in this setting – several of which relate with overcoming the intractability of the transition density of the SDE, and developing a methodology that is well-posed on the infinite-dimensional pathspace – will be as follows:

(i)

We present an online algorithm that delivers Monte-Carlo estimators of smoothing expectations,

$\mathcal{S}_{\theta,n}=\int S_{\theta}(\mathbf{x}_{0:n})p_{\theta}(d\mathbf{x}_{0:n}|y_{0:n}),\quad n\geq 1,$ (1.4)

for the class of additive functionals $S_{\theta}(\cdot)$ of the structure,

$S_{\theta}(\mathbf{x}_{0:n})=\sum_{k=0}^{n}s_{\theta,k}(x_{k-1},\mathbf{x}_{k}),$ (1.5)

under the conventions $x_{-1}=\emptyset$ , $\mathbf{x}_{0}=x_{0}$ . The bold type notation $\mathbf{x}_{k}$ , $k\geq 0$ , is reserved for carefully defined variables involving elements of the infinite-dimensional pathspace. The specific construction of $\mathbf{x}_{k}$ will depend on the model at hand, and will be explained in the main part of the paper. We note that the class of additive functionals considered supposes additivity over time, and therefore includes the general case of evaluating the score function. The online solution of the smoothing problem is often used as the means to solve some concrete inferential problems - see, e.g., (ii) below.
(ii)

We take advantage of the new approach to show numerical applications, with emphasis on carrying out online parameter inference for the designated class of models via a gradient-ascent approach (in a Robbins-Monro stochastic gradient framework). A critical aspect of this particular online algorithm (partly likelihood based, when concerned with parameter estimation; partly Bayesian, with regards to identification of filtering/smoothing expectations) is that it delivers estimates of the evolving score function, of the model parameters, together with particle representations of the filtering distributions, through a single passage of the data. This is a unique favourable algorithmic characteristic, when constrasted with alternative algorithms with similar objectives, such as, e.g., Particle MCMC (Andrieu et al.,, 2010), or SMC² (Chopin et al.,, 2013).
(iii)

In this work, we will not characterise analytically the size of the time-discretisation bias relevant to the SDE models at hand, and are content that: (I) the bias can be decreased by increasing the resolution of the numerical scheme (typically an Eyler-Maruyama one, or some other Taylor scheme, see e.g. Kloeden and Platen, (2013)); (II) critically, the Monte-Carlo algorithms are developed in a manner that the variance does not increase (in the limit, up to infinity) when increasing the resolution of the time-discretisation method; to achieve such an effect, the algorithms are (purposely) defined on the infinite-dimensional pathspace, and SDE paths are only discretised when implementing the algorithm on a PC (to allow, necessarily, for finite computations).
(iv)

Our method draws inspiration from earlier works, in the context of online filtering for discrete-time HMMs and infinite-dimensional pathspace MCMC methods. The complete construct is novel; one consequence of this is that it is applicable, in principle, for a wide class of SDEs, under the following weak assumption (relatively to restrictive conditions often imposed in the literature of infinite-dimensional MCMC methods).

Assumption 2.

The diffusion covariance matrix function,

$\Sigma_{\theta}(v):=\sigma_{\theta}(v)\sigma_{\theta}(v)^{\top}\in\mathbb{R}^{d_{x}\times d_{x}},$

is invertible, for all relevant $v$ , $\theta$ .

Thus, the methodology does not apply as defined here only for the class of hypoelliptic SDEs.

An elegant solution to the online smoothing problem posed above in (i), for the case of a standard HMM with discrete-time signal of known transition density $f_{\theta}(x^{\prime}|x)$ , is given in Del Moral et al., (2010); Poyiadjis et al., (2011). Our own work overcomes the unavailability of the transition density in the continuous-time scenario by following the above literature but augmenting the hidden state with the complete continuous-time SDE path. Related augmentation approaches in this setting – though for different inferential objectives – have appeared in Fearnhead et al., (2008); Ströjby and Olsson, (2009); Gloaguen et al., (2018), where the auxiliary variables are derived via the Poisson estimator of transition densities for SDEs (under strict conditions on the class of SDEs; no jumps), introduced in Beskos et al., (2006), and in Särkkä and Sottinen, (2008) where the augmentation involves indeed the continuous-time path (the objective therein is to solve the filtering problem and the method is applicable for SDEs with additive Wiener noise; no jump processes are considered).

A Motivating Example: Fig. 1.1 (Top-Panel) shows estimates of the score function, evaluated at the true parameter value $\theta=\theta^{\dagger}$ , for parameter $\theta_{3}$ of the Ornstein–Uhlenbeck (O–U) process, $dX_{t}=\theta_{1}(\theta_{2}-X_{t})dt+\theta_{3}dW_{t}$ , $X_{0}=0.0$ , for $n=10$ observations $y_{i}=x_{i}+\epsilon_{i}$ , $\epsilon_{i}\stackrel{{\scriptstyle i.i.d}}{{\sim}}\mathcal{N}(0,0.1^{2})$ , $1\leq i\leq n$ . Data were simulated from $\theta^{\dagger}=(0.5,0.0,0.4)$ with an Euler-Maruyama scheme of $M^{\dagger}=1,000$ grid points per unit of time. Fig. 1.1 (Top-Panel) illustrates the ‘abnormal’ effect of a standard data-augmentation scheme, where for $N=100$ particles, the Monte-Carlo method (see later sections for details) produces estimates of increasing variability as algorithmic resolution increases with $M=10,50,100,200$ – i.e., as it approaches the ‘true’ resolution used for the data generation. In contrast, Fig. 1.1 (Bottom Panel) shows results for the same score function as estimated by our proposed method in this paper. Our method is well-defined on the infinite-dimensional pathspace, and, consequently, it manages to robustly estimate the score function under mesh-refinement.

Refer to caption — Figure 1.1: Boxplots of estimated score functions of $\theta_{3}$ for the O–U process, for different resolution sizes $M$ , over $R=50$ experiment replications. $N=100$ particles were used in all cases, for the same $n=10$ data-points. *Top-Panel*: Results are produced via a Monte-Carlo algorithm that uses a naive ‘finite-dimensional’ data augmentation approach. *Bottom-Panel*: Results are produced via a Monte-Carlo algorithm that uses the ‘infinite-dimensional’ data augmentation approach introduced in this paper.

The rest of the paper is organised as follows. Section 2 reviews the Forward-Only algorithm for the online approximation of expectations of a class of additive functionals described in Del Moral et al., (2010). Section 3 sets up the framework for the treatment of pathspace-valued SDEs, first for the conceptually simpler case of SDEs without jumps, and then proceeding with incorporating jumps. Section 4 provides the complete online approximation algorithm, constructed on the infinite-dimensional pathspace. Section 5 discusses the adaptation of the developed methodology for the purposes of online inference for unknown parameters of the given SDE model. Section 6 shows numerical applications of the developed methodology. Section 7 contains conclusions and directions for future research.

2 Forward-Only Smoothing

A bootstrap filter (Gordon et al.,, 1993) is applicable in the continuous-time setting, as it only requires forward sampling of the underlying signal $\{X_{t}\}_{t\geq 0}$ ; this is trivially possible – under numerous approaches – and is typically associated with the introduction of some time-discretisation bias (Kloeden and Platen,, 2013). However, the transition density is still required for the smoothing problem we have posed in the Introduction. In this section, we assume a standard discrete-time HMM, with initial distribution $p_{\theta}(dx_{0})$ , transition density $f_{\theta}(x^{\prime}|x)$ , and likelihood $g_{\theta}(y|x)$ , for appropriate $x,x^{\prime}\in\mathbb{R}^{d_{x}}$ , $y\in\mathbb{R}^{d_{y}}$ , and review the online algorithm developed in Del Moral et al., (2010) for this setting. Implementation of the bootstrap filter provides an approximation of the smoothing distribution $p_{\theta}(dx_{0:n}|y_{0:n})$ by following the geneology of the particles. This method is studied, e.g., in Cappe, (2009); Dahlhaus and Neddermeyer, (2010). Let $\{x_{0:n}^{(i)},W_{n}^{(i)}\}_{i=1}^{N}$ , $N\geq 1$ , be a particle approximation of the smoothing distribution $p_{\theta}(dx_{0:n}|y_{0:n})$ , in the sense that we have the estimate,

\displaystyle\widehat{p}_{\theta}(dx_{0:n}|y_{0:n})=\sum_{i=1}^{N}W_{n}^{(i)}\delta_{x_{0:n}^{(i)}}(dx_{0:n}),\quad\sum_{i=1}^{N}W_{n}^{(i)}=1,

(2.1)

with $\delta_{x_{0:n}^{(i)}}(dx_{0:n})$ the Dirac measure with an atom at $x_{0:n}^{(i)}$ . Then, replacing $p_{\theta}(x_{0:n}|y_{0:n})$ with its estimate in (2.1) provides consistent estimators of expectations of the HMM smoothing distributions. Though the method is online and the computational cost per time step is $\mathcal{O}(N)$ , it typically suffers from the well-documented path-degeneracy problem – as illustrated via theoretical results or numerically (Del Moral et al.,, 2010; Kantas et al.,, 2015). That is, as $n$ increases, the particles representing $p_{\theta}(dx_{0:n}|y_{0:n})$ obtained by the above method will eventually all share the same ancestral particle due to the resampling steps, and the approximation collapses for big enough $n$ . This is well-understood not to be a solution to the approximation of the smoothing distribution for practical applications.

An approach which overcomes path-degeneracy is the Forward Filtering Backward Smoothing (FFBS) algorithm of Doucet et al., (2000). We briefly review the method here, following closely the notation and development in Del Moral et al., (2010). In the forward direction, assume that a filtering algorithm (e.g. bootstrap) has provided a particle approximation of the filtering distribution $p_{\theta}(dx_{k-1}|y_{0:k-1})$ – assuming a relevant $k$ –,

\displaystyle\widehat{p}_{\theta}(dx_{k-1}|y_{0:k-1})=\sum_{i=1}^{N}W_{k-1}^{(i)}\delta_{x_{k-1}^{(i)}}(dx_{k-1}),

(2.2)

for weighted particles $\{x_{k-1}^{(i)},W_{k-1}^{(i)}\}_{i=1}^{N}$ . In the backward direction, assume that one is given the particle approximation of the marginal smoothing distribution $p_{\theta}(dx_{k}|y_{0:n})$ ,

\displaystyle\widehat{p}_{\theta}(dx_{k}|y_{0:n})=\sum_{i=1}^{N}W_{k|n}^{(i)}\delta_{x_{k}^{(i)}}(dx_{k}).

(2.3)

One has that (Kitagawa,, 1987),

	$\displaystyle p_{\theta}(dx_{k-1:k}\|y_{0:n})$	$\displaystyle=p_{\theta}(dx_{k}\|y_{0:n})\otimes p_{\theta}(dx_{k-1}\|x_{k},y_{0:k-1})$
		$\displaystyle=p_{\theta}(dx_{k}\|y_{0:n})\otimes p_{\theta}(dx_{k-1}\|y_{0:k-1})\frac{f_{\theta}(x_{k}\|x_{k-1})}{\int f_{\theta}(x_{k}\|x_{k-1})p_{\theta}(x_{k-1}\|y_{0:k-1})dx_{k-1}}.$		(2.4)

Using (2.2)-(2.3), and based on equation (2.4), we obtain the approximation,

\displaystyle\widehat{p}_{\theta}(dx_{k-1:k}|y_{0:n})=\sum_{j=1}^{N}W_{k|n}^{(j)}\sum_{i=1}^{N}\frac{f_{\theta}(x_{k}^{(j)}|x_{k-1}^{(i)})W_{k-1}^{(i)}}{\sum_{l=1}^{N}f_{\theta}(x_{k}^{(j)}|x_{k-1}^{(l)})W_{k-1}^{(l)}}\delta_{(x_{k-1}^{(i)},x_{k}^{(j)})}(dx_{k-1:k}).

(2.5)

Recalling the expectation of additive functionals in (1.4)-(1.5) – where now, in the discrete-time setting, we can ignore the bold $\mathbf{x}_{k}$ elements, and simply use $x_{k}$ instead – the above calculations give rise to the following estimator of the target quantity $\mathcal{S}_{\theta,n}$ in (1.5),

\widehat{\mathcal{S}}_{\theta,n}=\sum_{k=0}^{n}\int s_{\theta,k}(x_{k-1},x_{k})\,\,\widehat{p}_{\theta}(dx_{k-1:k}|y_{0:n}).

To be able to apply the above method, the marginal smoothing approximation in (2.3) is obtained via a backward recursive approach. In particular, starting from $k=n$ (where the approximation is provided by the standard forward particle filter), one proceeds as follows. Given $k$ , the quantity for $k-1$ is directly obtained by integrating out $x_{k}$ in (2.5), thus we have,

\displaystyle\widehat{p}_{\theta}(dx_{k-1}|y_{0:n})=\sum_{i=1}^{N}W_{k-1|n}^{(i)}\delta_{x_{k-1}^{(i)}}(dx_{k-1}),

for the normalised weights,

\displaystyle W_{k-1|n}^{(i)}\propto\sum_{j=1}^{N}W_{k|n}^{(j)}\,\frac{f_{\theta}(x_{k}^{(j)}|x_{k-1}^{(i)})W_{k-1}^{(i)}}{\sum_{l=1}^{N}f_{\theta}(x_{k}^{(j)}|x_{k-1}^{(l)})W_{k-1}^{(l)}}.

Notice that – in this version of FFBS – the same particles $\{x_{k}^{(i)}\}_{i=1}^{N}$ are used in both directions (the ones before resampling at the forward filter), but with different weights.

An important development made in Del Moral et al., (2010) is transforming the above offline algorithm into an online one. This is achieved by consideration of the sequence of instrumental functionals,

\displaystyle T_{\theta,0}(x_{0})=s_{\theta,0}(x_{0});\qquad T_{\theta,n}(x_{n}):=\int S_{\theta,n}(x_{0:n})p_{\theta}(dx_{0:n-1}|y_{0:n-1},x_{n}),\quad n\geq 1.

Notice that, first,

\displaystyle\mathcal{S}_{\theta,n}=\int T_{\theta,n}(x_{n})p_{\theta}(dx_{n}|y_{0:n}).

We also have that,

	$\displaystyle T_{\theta,n}(x_{n})$	$\displaystyle=\int\big{[}T_{\theta,n-1}(x_{n-1})+s_{\theta,n}(x_{n-1},x_{n})\big{]}p_{\theta}(dx_{n-1}\|y_{0:n-1},x_{n})$
		$\displaystyle\equiv\frac{\int\big{[}T_{\theta,n-1}(x_{n-1})+s_{\theta,n}(x_{n-1},x_{n})\big{]}f_{\theta}(x_{n}\|x_{n-1})p_{\theta}(dx_{n-1}\|y_{0:n-1})}{\int f_{\theta}(x_{n}\|x_{n-1})p_{\theta}(dx_{n-1}\|y_{0:n-1})},$		(2.6)

– see Proposition 2.1 of Del Moral et al., (2010) for a (simple) proof. This recursion provides an online – forward-only – advancement of FFBS for estimating the smoothing expectation of additive functionals. The complete method is summarised in Algorithm 1: one key ingredient is that, during the recursion, values of the functional $T_{\theta,n}(x_{n})$ are only required at the discrete positions $x_{n}^{(i)}$ determined by the forward particle filter.

In the SDE context, under Assumption 1, the transition density $f_{\theta}(\cdot|\cdot)$ is considered intractable, thus Algorithm 1 – apart from serving as a review of the method in Del Moral et al., (2010) – does not appear to be practical in the continuous-time case.

Algorithm 1 Online Forward-Only Smoothing (Del Moral et al.,, 2010)

(i)

Initialise particles $\{x_{0}^{(i)},W_{0}^{(i)}\}_{i=1}^{N}$ , with $x_{0}^{(i)}\stackrel{{\scriptstyle iid}}{{\sim}}p_{\theta}(dx_{0})$ , $W_{0}^{(i)}\propto g_{\theta}(y_{0}|x_{0}^{(i)})$ , and functionals $\widehat{T}_{\theta,0}(x_{0}^{(i)})=s_{\theta,0}(x_{0}^{(i)})$ , for $1\leq i\leq N$ .
(ii)

Assume that at time $n-1$ , one has a particle approximation $\{x_{n-1}^{(i)},W_{n-1}^{(i)}\}_{i=1}^{N}$ of the filtering law $p_{\theta}(dx_{n-1}|y_{0:n-1})$ and estimators $\widehat{T}_{\theta,n-1}(x_{n-1}^{(i)})$ of $T_{\theta,n-1}(x_{n-1}^{(i)})$ , for $1\leq i\leq N$ .
(iii)

At time $n$ , sample $x_{n}^{(i)}$ , for $1\leq i\leq N$ , from the mixture (Gordon et al.,, 1993),

$\displaystyle x_{n}^{(i)}\sim\widehat{p}_{\theta}(x_{n}|y_{0:n})=\sum_{j=1}^{N}W_{n-1}^{(j)}f_{\theta}(x_{n}|x_{n-1}^{(j)}),$

and assign particle weights $W_{n}^{(i)}\propto g_{\theta}(y_{n}|x_{n}^{(i)})$ , $1\leq i\leq N$ .

(iv)

Then set, for $1\leq i\leq N$ ,

\displaystyle\widehat{T}_{\theta,n}(x_{n}^{(i)})

\displaystyle=\frac{\sum_{j=1}^{N}W_{n-1}^{(j)}f_{\theta}(x_{n}^{(i)}|x_{n-1}^{(j)})}{\sum_{l=1}^{N}W_{n-1}^{(l)}f_{\theta}(x_{n}^{(i)}|x_{n-1}^{(l)})}\big{[}\widehat{T}_{\theta,n-1}(x_{n-1}^{(j)})+s_{\theta,n}(x_{n-1}^{(j)},x_{n}^{(i)})\big{]}.

(v)

Obtain an estimate of $\mathcal{S}_{\theta,n}$ as,

$\displaystyle\widehat{\mathcal{S}}_{\theta,n}$ $\displaystyle=\sum_{i=1}^{N}W_{n}^{(i)}\widehat{T}_{\theta,n}(x_{n}^{(i)}).$

3 Data Augmentation on Diffusion Pathspace

To overcome the intractability of the transition density $f_{\theta}(\cdot|\cdot)$ of the SDE, we will work with an algorithm that is defined in continuous-time and makes use of the complete SDE path-particles in its development. The new method has connections with earlier works in the literature. Särkkä and Sottinen, (2008) focus on the filtering problem for a class of models related to (1.1)-(1.3), and come up with an approach that requires the complete SDE path, for a limited class of diffusions with additive noise and no jumps. Fearnhead et al., (2008) also deal with the filtering problem, and – equipped with an unbiased estimator of the unknown transition density – recast the problem as one of filtering over an augmented space that incorporates the randomness for the unbiased estimate. The method is accompanied by strict conditions on the drift and diffusion coefficient (one should be able to transform the SDE – no jumps – into one of unit diffusion coefficient; the drift of the SDE must have a gradient form). Our contribution requires, in principle, solely the diffusion coefficient invertibility Assumption 2; arguably, the weakened condition we require stems from the fact that our approach appears as the relatively most ‘natural’ extension (compared to alternatives) of the standard discrete-time algorithm of Del Moral et al., (2010).

The latter discrete-time method requires the density $f_{\theta}(x^{\prime}|x)=f_{\theta}(dx^{\prime}|x)/dx^{\prime}$ . In continuous-time, we obtain an analytically available Radon-Nikodym derivative of $p_{\theta}(d\mathbf{x}^{\prime}|x)$ , for a properly defined variate $\mathbf{x}^{\prime}$ that involves information about the continuous-time path for moving from $x$ to $x^{\prime}$ within time $\Delta$ . We will give the complete algorithm in Section 4. In this section, we prepare the ground via carefully determining $\mathbf{x}^{\prime}$ given $x$ , and calculating the relevant densities to be later plugged in into our method.

3.1 SDEs with Continuous Paths

We work first with the process with continuous sample-paths, i.e. of dynamics,

\displaystyle dX_{t}=b_{\theta}(X_{t})dt+\sigma_{\theta}(X_{t})dW_{t}.

(3.1)

We adopt an approach motivated by techniques used for MCMC algorithms (Chib et al.,, 2004; Golightly and Wilkinson,, 2008; Roberts and Stramer,, 2001). Assume we are given starting point $x\in\mathbb{R}^{d_{x}}$ , ending point $x^{\prime}\in\mathbb{R}^{d_{x}}$ , and the complete continuous-time path for the signal process in (3.1) on $[0,T]$ , for some $T\geq 0$ . That is, we now work with the path process,

\displaystyle\big{[}\,\{X_{t};t\in[0,T]\}\,\big{|}\,X_{0}=x,X_{T}=x^{\prime}\,\big{]}.

(3.2)

Let $\mathbb{P}_{\theta,x,x^{\prime}}$ denote the probability distribution of the pathspace-valued variable in (3.2). We consider the auxiliary bridge process $\tilde{X}=\{\tilde{X}_{t};t\in[0,T]\}$ defined as,

\displaystyle d\tilde{X}_{t}

\displaystyle=\Big{\{}b_{\theta}(\tilde{X}_{t})+\frac{x^{\prime}-\tilde{X}_{t}}{T-t}\Big{\}}dt+\sigma_{\theta}(\tilde{X}_{t})dW_{t},\quad\tilde{X}_{0}=x,\quad t\in[0,T],

(3.3)

with corresponding probability distribution $\mathbb{Q}_{\theta,x,x^{\prime}}$ . Critically, a path of $\tilde{X}$ starts at point $x$ and finishes at $x^{\prime}$ , w.p. 1. Under regularity conditions, Delyon and Hu, (2006) prove that probability measures $\mathbb{P}_{\theta,x,x^{\prime}}$ , $\mathbb{Q}_{\theta,x,y}$ are absolutely continuous with respect to each other. We treat the auxiliary SDE (3.3) as a transform from the driving noise to the solution, whence a sample path, $X$ , of the process $\tilde{X}=\{\tilde{X_{t}};t\in[0,T]\}$ , is produced by a mapping – determined by (3.3) – of a corresponding sample path, say $Z$ , of the Wiener process. That is, we have set up a map, and – under Assumption 2 – its inverse,

Z\mapsto X=:F_{\theta}(Z;x,x^{\prime}),\quad Z=F^{-1}_{\theta}(X;x,x^{\prime}).

(3.4)

More analytically, $F_{\theta}^{-1}$ is given via the transform,

\displaystyle dZ_{t}

\displaystyle=\sigma_{\theta}(X_{t})^{-1}\Big{\{}dX_{t}-b_{\theta}(X_{t})dt-\frac{x^{\prime}-X_{t}}{T-t}dt\Big{\}}.

In this case we define,

\mathbf{x}^{\prime}:=(x^{\prime},Z),

and the probability measure of interest is,

p_{\theta}(d\mathbf{x}^{\prime}|x):=f_{\theta}(dx^{\prime}|x)\otimes p_{\theta}(dZ|x^{\prime},x).

(3.5)

Let $\mathbb{W}$ be the standard Wiener probability measure on $[0,T]$ . Due to the 1–1 transform, we have that,

\frac{p_{\theta}(dZ|x^{\prime},x)}{\mathbb{W}(dZ)}\equiv\frac{d\mathbb{P}_{\theta,x,x^{\prime}}}{d\mathbb{Q}_{\theta,x,x^{\prime}}}(F_{\theta}(Z;x,x^{\prime})),

so it remains to obtain the density $d\mathbb{P}_{\theta,x,x^{\prime}}/d\mathbb{Q}_{\theta,x,x^{\prime}}$ .

Such a Radom-Nikodym derivative has been object of interest in many works. Delyon and Hu, (2006) provided detailed conditions and a proof, but (seemingly) omit an expression for the normalising constant which is important in our case, as it involves the parameter $\theta$ – in our applications later in the paper, we aim to infer about $\theta$ . Papaspiliopoulos et al., (2013); Papaspiliopoulos and Roberts, (2012) provide an expression based on a conditioning argument for the projection of the probability measures on $[0,t)$ , $t<T$ , and passage to the limit $t\uparrow T$ . The derivations in Delyon and Hu, (2006) are extremely rigorous, so we will make use the expressions in that paper. Following carefully the proofs of some of their main results (Theorem 5, together with Lemmas 7, 8) one can indeed retrieve the constant in the deduced density. In particular, Delyon and Hu, (2006) impose the following conditions.

Assumption 3.

(i)

SDE (3.1) admits a strong solution;
(ii)

$v\mapsto\sigma_{\theta}(v)$ is in $\mathcal{C}^{2}_{b}$ (i.e., twice continuously differentiable and bounded, with bounded first and second derivatives), and it is invertible, with bounded inverse;
(iii)

$v\mapsto b_{\theta}(v)$ is locally Lipschitz, locally bounded.

Under Assumption 3, Delyon and Hu, (2006) prove that,

\displaystyle\frac{d\mathbb{P}_{\theta,x,x^{\prime}}}{d\mathbb{Q}_{\theta,x,x^{\prime}}}(X)=\frac{|\Sigma_{\theta}(x^{\prime})|^{1/2}}{|\Sigma_{\theta}(x)|^{1/2}}\times\frac{\mathcal{N}(x^{\prime};x,T\Sigma_{\theta}(x))}{f_{\theta}(x^{\prime}|x)}\times\varphi_{\theta}(X;x,x^{\prime}),

(3.6)

where $\mathcal{N}(v;\mu,V)$ is the density function of the Gaussian law on $\mathbb{R}^{d_{x}}$ with mean $\mu$ , variance $V$ , evaluated at $v$ , $|\cdot|$ is matrix determinant, and $\varphi_{\theta}(X;x,x^{\prime})$ is such that,

	$\displaystyle\log\varphi_{\theta}(X;x,x^{\prime})=\int_{0}^{T}$	$\displaystyle\big{\langle}\,b_{\theta}(X_{t}),\Sigma^{-1}_{\theta}(X_{t})dX_{t}\,\big{\rangle}-\tfrac{1}{2}\int_{0}^{T}\big{\langle}\,b_{\theta}(X_{t}),\Sigma_{\theta}^{-1}(X_{t})b_{\theta}(X_{t})dt\,\big{\rangle}$
		$\displaystyle\!\!\!\!\!\!\!-\tfrac{1}{2}\int_{0}^{T}\tfrac{\left\langle\,(x^{\prime}-X_{t}),\,d\Sigma^{-1}_{\theta}(X_{t})(x^{\prime}-X_{t})\,\right\rangle}{T-t}-\tfrac{1}{2}\int_{0}^{T}\tfrac{\sum_{i,j=1}^{d_{x}}d\,[\,\Sigma_{\theta,ij}^{-1},(x^{\prime}_{i}-X_{t,i})(x^{\prime}_{j}-X_{t,j})\,]}{T-t}.$

Here, $[\cdot,\cdot]$ denotes the quadratic variation process for semi-martingales; also, $\langle\cdot,\cdot\rangle$ is the standard inner-product on $\mathbb{R}^{d_{x}}$ . We note that transforms different from (3.4) have been proposed in the literature (Dellaportas et al.,, 2006; Kalogeropoulos et al.,, 2010) to achieve the same effect of obtaining an 1–1 mapping of the latent path that has a density with respect to a measure that does not depend on $x$ , $x^{\prime}$ or $\theta$ . However, such methods are mostly applicable for scalar diffusions (Aït-Sahalia,, 2008). Auxiliary variables involving a random, finite selection of points of the latent path, based on the (generalised) Poisson estimator of Fearnhead et al., (2008) are similarly restrictive. In contrast to other attempts, our methodology may be applied for a much more general class of SDEs, as determined by Assumption 2 – and further regularity conditions, e.g. as in Assumption 3. Thus, continuing from (3.5) we have obtained that,

	$\displaystyle\frac{p_{\theta}(d\mathbf{x}^{\prime}\|x)}{(\mathrm{Leb}^{\otimes d_{x}}\otimes\mathbb{W})(d\mathbf{x}^{\prime})}$	$\displaystyle=\varphi_{\theta}\big{(}F_{\theta}(Z;x,x^{\prime});x,x^{\prime}\big{)}\times\mathcal{N}(x^{\prime};x,T\Sigma_{\theta}(x))\times\frac{\|\Sigma_{\theta}(x^{\prime})\|^{1/2}}{\|\Sigma_{\theta}(x)\|^{1/2}}$
		$\displaystyle=:p_{\theta}(\mathbf{x}^{\prime}\|x)\equiv p_{\theta}(\mathbf{x}^{\prime}\|x;T).$		(3.7)

We have added the extra argument involving the length of path, $T$ , in the last expression, as it will be of use in the next section.

Remark 1.

A critical point here is that the above density is analytically tractable, thus by working on pathspace we have overcome the unavailability of the transition density $f_{\theta}(x^{\prime}|x)$ .

3.2 SDEs with Jumps

We extend the above developments to the more general case of the $d_{x}$ -dimensional jump diffusion model given in (1.1), which we re-write here for convenience,

\displaystyle dX_{t}

\displaystyle=b_{\theta}(X_{t{-}})dt+\sigma_{\theta}(X_{t{-}})dW_{t}+dJ_{t},\quad X_{0}=x\in\mathbb{R}^{d_{x}},\,\,\,t\in[0,T].

(3.8)

Recall that $J_{\theta}=J=\{J_{t}\}$ denotes a compound Poisson process with jump intensity $\lambda_{\theta}(\cdot)$ and jump-size density $h_{\theta}(\cdot)$ . Let $\mathbb{F}_{\theta,x}(\cdot)$ denote the law of the unconditional process (3.8) and $\mathbb{L}_{\theta}$ the law of the involved compound Poisson process. We write $J=((\tau_{1},b_{1}),\ldots,(\tau_{\kappa},b_{\kappa}))$ to denote the jump process, where $\{\tau_{i}\}$ are the times of events, $\{b_{i}\}$ the jump sizes and $\kappa\geq 0$ the total number of events. In addition, we consider the reference measure $\mathbb{L}$ , corresponding to unit rate Poisson process measure on $[0,T]$ multiplied with $\otimes_{i=1}^{\kappa+1}\mathrm{Leb}^{\otimes d_{x}}$ .

Construct One

We consider the random variate,

\displaystyle\mathbf{x}^{\prime}=\big{(}J,\{x_{\tau_{i}-}\}_{i=1}^{\kappa+1},\{Z(i)\}_{i=1}^{\kappa+1}\big{)},

under the conventions $x_{\tau_{0}-}\equiv x$ , $x_{\tau_{\kappa+1}-}\equiv x^{\prime}$ , where we have defined,

\displaystyle Z(i)=F^{-1}_{\theta}(X(i)\,;\,x_{\tau_{i-1}},x_{\tau_{i}-}),\quad X(i):=\{X_{t};t\in[x_{\tau_{i-1}},x_{\tau_{i}})\},\quad 1\leq i\leq\kappa+1.

We have that,

\displaystyle p_{\theta}(d\mathbf{x}^{\prime}|x):=\mathbb{L}_{\theta}(dJ)\otimes\Big{[}\otimes_{i=1}^{\kappa+1}\big{\{}\,f_{\theta}(dx_{\tau_{i}-}|x_{\tau_{i-1}})\otimes p_{\theta}(dZ(i)|x_{\tau_{i-1}},x_{\tau_{i}-})\,\big{\}}\Big{]}.

Using the results about SDEs without jumps in Section 3.1, and in particular expression (3.7), upon defining,

\mathbf{x}^{\prime}(i):=(x_{\tau_{i}-},Z(i)),

we have that – under an apparent adaptation of Assumption 3,

\displaystyle\frac{f_{\theta}(dx_{\tau_{i}-}|x_{\tau_{i-1}})\otimes p_{\theta}(dZ(i)|x_{\tau_{i-1}},x_{\tau_{i}-})}{\mathrm{Leb}^{\otimes d_{x}}(dx_{\tau_{i}-})\otimes\mathbb{W}(dZ(i)))}=p_{\theta}(\mathbf{x}^{\prime}(i)|x_{\tau_{i}-};\tau_{i}-\tau_{i-1}),

with the latter density $p_{\theta}(\mathbf{x}^{\prime}(i)|x_{\tau_{i}-})$ determined as in (3.7), given clear adjustments. Thus, the density of $p_{\theta}(d\mathbf{x}^{\prime}|x)$ with respect to the reference measure,

\displaystyle\mu(d\mathbf{x}^{\prime}):=\mathbb{L}(dJ)\otimes\Big{[}\otimes_{i=1}^{\kappa+1}\big{\{}\mathrm{Leb}^{\otimes d_{x}}(dx_{\tau_{i}-})\otimes\mathbb{W}(dZ(i))\big{\}}\Big{]},

is equal to,

\displaystyle\frac{p_{\theta}(d\mathbf{x}^{\prime}|x)}{\mu(d\mathbf{x}^{\prime})}=\frac{e^{-\int_{0}^{T}\lambda_{\theta}(t)dt}}{e^{-T}}\cdot\prod_{i=1}^{\kappa}\big{\{}\lambda_{\theta}(\tau_{i}))h_{\theta}(b_{i})\big{\}}\times\prod_{i=1}^{\kappa+1}p_{\theta}(\mathbf{x}^{\prime}(i)|x_{\tau_{i}-};\tau_{i}-\tau_{i-1}).

Construct Two

We adopt an idea used – for a different problem – in Gonçalves and Roberts, (2014). Given $x,x^{\prime}\in\mathbb{R}^{d_{x}}$ , we define an auxiliary process $\tilde{X}_{t}$ as follows,

\displaystyle d\tilde{X}_{t}

\displaystyle=\Big{\{}b_{\theta}(\tilde{X}_{t})+\frac{x^{\prime}-J_{T}-\tilde{X}_{t}+J_{t}}{T-t}\Big{\}}dt+\sigma_{\theta}(\tilde{X}_{t})dW_{t}+dJ_{t},\quad X_{0}=x,

(3.9)

so that $\tilde{X}_{T}=x^{\prime}$ , w.p. 1. As with (3.4), we view (3.9) as a transform, projecting a path, $Z$ of the Wiener process and the compound process, $J$ , onto a path, $X$ , of the jump process. That is, we consider the 1–1 maps,

(J,Z)\mapsto X=:G_{\theta}(J,Z;x,x^{\prime}),\quad(J,Z)=G^{-1}_{\theta}(X;x,x^{\prime}).

Notice that for the inverse transform, the $J$ -part is obtained immediately, whereas for the $Z$ -part one uses the expression – well-defined due to Assumption 2 –,

\displaystyle dZ_{t}

\displaystyle=\sigma_{\theta}(X_{t})^{-1}\Big{\{}dX_{t}-dJ_{t}-b_{\theta}(X_{t})dt-\frac{x^{\prime}-J_{T}-X_{t}+J_{t}}{T-t}dt\Big{\}}.

We denote by $\overline{\mathbb{P}}_{\theta,x,x^{\prime}}$ the law of the original process in (3.8) conditionally on hitting $x^{\prime}$ at time $T$ . Also, we denote the distribution on pathspace induced by (3.9) as $\overline{\mathbb{Q}}_{\theta,x,x^{\prime}}$ . Consider the variate,

\displaystyle\mathbf{x}^{\prime}=(x^{\prime},J,Z),

so that,

p_{\theta}(d\mathbf{x}^{\prime}|x):=f_{\theta}(dx^{\prime}|x)\otimes p_{\theta}\big{(}d(J,Z)|x^{\prime},x\big{)}.

Due to the employed 1–1 transforms, we have that,

\displaystyle\frac{p_{\theta}(d\mathbf{x}^{\prime}|x)}{\big{(}\mathrm{Leb}^{\otimes d_{x}}\otimes\mathbb{L}_{\theta}\otimes\mathbb{W}\big{)}(d\mathbf{x}^{\prime})}=f_{\theta}(x^{\prime}|x)\times\frac{d\overline{\mathbb{P}}_{\theta,x,x^{\prime}}}{d\overline{\mathbb{Q}}_{\theta,x,x^{\prime}}}\big{(}G_{\theta}(J,Z;x,x^{\prime})\big{)}.

Thus, using the parameter-free reference measure,

\displaystyle\mu(d\mathbf{x}^{\prime}):=\mathrm{Leb}^{\otimes d_{x}}\otimes\mathbb{L}\otimes\mathbb{W},

one obtains that,

	$\displaystyle\frac{p_{\theta}(d\mathbf{x}^{\prime}\|x)}{\mu(d\mathbf{x}^{\prime})}$	$\displaystyle=f_{\theta}(x^{\prime}\|x)\times\frac{e^{-\int_{0}^{T}\lambda_{\theta}(t)dt}}{e^{-T}}\cdot\prod_{i=1}^{\kappa}\big{\{}\lambda_{\theta}(\tau_{i}))h_{\theta}(b_{i})\big{\}}$
		$\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad\times\frac{d\overline{\mathbb{P}}_{\theta,x,x^{\prime}}}{d\overline{\mathbb{Q}}_{\theta,x,x^{\prime}}}\big{(}F_{\theta}(J,Z;x,x^{\prime})\big{)}.$		(3.10)

Remark 2.

Delyon and Hu, (2006) obtained the Radon-Nikodym derivative expression in (3.6) after a great amount of rigorous analysis. A similar development for the case of conditioned jump diffusions does not follow from their work, and can only be subject of dedicated research at the scale of a separate paper. This is beyond the scope of our work. In practice, one can proceed as follows. For grid size $M\geq 1$ , and $\delta=T/M$ , let $\overline{\mathbb{P}}^{M}_{\theta,x,x^{\prime}}(X_{\delta},\ldots,X_{(M-1)\delta}\,|\,X_{0}=x,X_{M\delta}=x^{\prime})$ denote the time-discretised Lebesgue density of the $(M-1)$ -positions of the conditioned diffusion with law $\overline{\mathbb{P}}_{\theta,x,x^{\prime}}$ . Once (3.10) is obtained, a time-discretisation approach will give,

	$\displaystyle\overline{\mathbb{P}}^{M}_{\theta,x,x^{\prime}}(X_{\delta},\ldots,$	$\displaystyle X_{(M-1)\delta}\,\|\,X_{0}=x,X_{M\delta}=x^{\prime})$
		$\displaystyle=\frac{\overline{\mathbb{P}}_{\theta,x,x^{\prime}}^{M}(X_{\delta},\ldots,X_{(M-1)\delta},X_{M\delta}=x^{\prime}\,\|\,X_{0}=x)}{\overline{\mathbb{P}}^{M}_{\theta,x,x^{\prime}}(X_{M\delta}=x^{\prime}\,\|X_{0}=x)}.$

In this time-discretised setting, $f_{\theta}(x^{\prime}|x)$ in (3.6) will be replaced by $\overline{\mathbb{P}}^{M}_{\theta,x,x^{\prime}}(X_{M\delta}=x^{\prime}\,|X_{0}=x)$ . Thus, the intractable transition density over the complete time period will cancel out, and one is left with an explicit expression to use on a PC. Compared to the method in Section 3.1, and the Construct One in the current section, we do not have explicit theoretical evidence of a density on the pathspace. Yet, all numerical experiments we tried showed that the deduced algorithm was stable under mesh-refinement. We thus adopt the approach (or, conjecture) that the density in (3.10) exists – under assumptions –, and can be obtained pending future research.

4 Forward-Only Smoothing for SDEs

4.1 Pathspace Algorithm

We are ready to develop a forward-only particle smoothing method, under the scenario in (1.4)-(1.5) on the pathspace setting. We will work with the pairs of random elements

\displaystyle(x_{k-1},\mathbf{x}_{k}),\quad 1\leq k\leq n,

with $\mathbf{x}_{k}$ as defined in Section 3, i.e. containing pathspace elements, and given by an 1–1 transform of $\{X_{s};s\in[t_{k-1},t_{k}]\}$ , such that we can obtain a density for $p_{\theta}(\mathbf{x}_{k}|x_{k-1})$ with respect to a reference measure that does not involve $\theta$ . Recall that $p_{\theta}(d\mathbf{x}_{k}|x_{k-1})$ denotes the probability law for the augmented variable $\mathbf{x}_{k}$ given $x_{k-1}$ . We also write the corresponding density as,

\displaystyle p_{\theta}(\mathbf{x}_{k}|x_{k-1}):=\frac{p_{\theta}(d\mathbf{x}_{k}|x_{k-1})}{\mu(d\mathbf{x}_{k})}.

The quantity of interest is now,

\mathcal{S}_{\theta,n}=\int S_{\theta}(\mathbf{x}_{0:n})\,p_{\theta}(d\mathbf{x}_{0:n}|y_{0:n}),\quad n\geq 1,

for the class of additive functionals $S(\cdot)$ of the structure,

S_{\theta}(\mathbf{x}_{0:n})=\sum_{k=0}^{n}s_{\theta,k}(x_{k-1},\mathbf{x}_{k}),

under the convention that $x_{-1}=\emptyset$ . Notice that we now allow $s_{k}(\cdot,\cdot)$ to be a function of $x_{k-1}$ and $\mathbf{x_{k}}$ ; thus, $s_{k}(\cdot,\cdot)$ can potentially correspond to integrals, or other pathspace functionals. We will work with a transition density on the enlarged space of $\mathbf{x}_{k}$ .

Similarly to the discrete-time case in Section 2, we define the functional,

\displaystyle T_{\theta,n}(\mathbf{x}_{n})

\displaystyle:=\int S_{\theta}(\mathbf{x}_{0:n})\,p_{\theta}(d\mathbf{x}_{0:n-1}|y_{0:n-1},\mathbf{x}_{n}).

Proposition 1.

We have that,

\displaystyle\mathcal{S}_{\theta,n}

\displaystyle=\int T_{\theta,n}(\mathbf{x}_{n})\,p_{\theta}(d\mathbf{x}_{n}|y_{0:n}).

Proof.

See Appendix A ∎

Critically, as in (2.6), we obtain the following recursion.

Proposition 2.

For any $n\geq 1$ , we have that,

	$\displaystyle T_{\theta,n}(\mathbf{x}_{n})$	$\displaystyle=\int\big{[}T_{\theta,n-1}(\mathbf{x}_{n-1})+s_{\theta,n}(x_{n-1},\mathbf{x}_{n})\big{]}\,p_{\theta}(d\mathbf{x}_{n-1}\|y_{0:n-1},\mathbf{x}_{n})$
		$\displaystyle\equiv\frac{\int\big{[}T_{\theta,n-1}(\mathbf{x}_{n-1})+s_{\theta,n}(x_{n-1},\mathbf{x}_{n})\big{]}\,p_{\theta}(\mathbf{x}_{n}\|x_{n-1})\,p_{\theta}(d\mathbf{x}_{n-1}\|y_{0:n-1})}{\int p_{\theta}(\mathbf{x}_{n}\|x_{n-1})\,p_{\theta}(dx_{n-1}\|y_{0:n-1})}.$

Proof.

See Appendix B. ∎

Proposition 2 gives rise to a Monte-Carlo methodology for a forward-only, online approximation of the smoothing expectation of interest. This is given in Algorithm 2.

Algorithm 2 Online Forward-Only Particle Smoothing on Pathspace

(i)

Initialise the particles $\{x_{0}^{(i)},W_{0}^{(i)}\}_{i=1}^{N}$ , with $x_{0}^{(i)}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}p_{\theta}(dx_{0})$ , $W_{0}^{(i)}\propto g_{\theta}(y_{0}|x_{0}^{(i)})$ , and functionals $\widehat{T}_{\theta,0}(x_{0}^{(i)})=s_{\theta,0}(\mathbf{x}_{0}^{(i)})$ , where $\mathbf{x}_{0}^{(i)}\equiv x_{0}^{(i)}$ , $1\leq i\leq N$ .
(ii)

Assume that at time $n-1$ , one has a particle approximation $\{\mathbf{x}_{n-1}^{(i)},W_{n-1}^{(i)}\}_{i=1}^{N}$ of the filtering law $p_{\theta}(d\mathbf{x}_{n-1}|y_{0:n-1})$ and estimators $\widehat{T}_{\theta,n-1}(\mathbf{x}_{n-1}^{(i)})$ of $T_{\theta,n-1}(\mathbf{x}_{n-1}^{(i)})$ , $1\leq i\leq N$ .

(iii)

At time $n$ , sample $\mathbf{x}_{n}^{(i)}$ , for $1\leq i\leq N$ , from

\displaystyle\mathbf{x}_{n}^{(i)}\sim\widehat{p}_{\theta}(d\mathbf{x}_{n}|y_{0:n-1})=\sum_{j=1}^{N}W_{n-1}^{(j)}p_{\theta}(d\mathbf{x}_{n}|x_{n-1}^{(j)}),

and assign particle weights $W_{n}^{(i)}\propto g_{\theta}(y_{n}|y_{n-1},\mathcal{F}_{n}^{(i)})$ , $1\leq i\leq N$ .

(iv)

Then set, for $1\leq i\leq N$ ,

\displaystyle\widehat{T}_{\theta,n}(\mathbf{x}_{n}^{(i)})

\displaystyle=\frac{\sum_{j=1}^{N}W_{n-1}^{(j)}\,p_{\theta}(\mathbf{x}_{n}^{(i)}|x_{n-1}^{(j)})}{\sum_{l=1}^{N}W_{n-1}^{(l)}\,p_{\theta}(\mathbf{x}_{n}^{(i)}|x_{n-1}^{(l)})}\,\big{[}\widehat{T}_{\theta,n-1}(\mathbf{x}_{n-1}^{(j)})+s_{\theta,n}(x_{n-1}^{(j)},\mathbf{x}_{n}^{(i)})\big{]}.

(v)

Obtain an estimate of $\mathcal{S}_{\theta,n}$ as,

$\displaystyle\widehat{\mathcal{S}}_{\theta,n}$ $\displaystyle=\sum_{i=1}^{N}W_{n}^{(i)}\,\widehat{T}_{\theta,n}(\mathbf{x}_{n}^{(i)}).$

Remark 3.

Algorithm 2 uses a simple bootstrap filter with multinomial resampling applied at each step. The variability of the Monte-Carlo estimates can be further reduced by incorporating: more effective resampling (e.g., systematic resampling (Carpenter et al.,, 1999), stratified resampling (Kitagawa,, 1996)); dynamic resampling via use of Effective Sample Size; non-blind proposals in the propagation of the particles.

4.2 Pathspace versus Finite-Dimensional Construct

One can attempt to define an algorithm without reference to the underlying pathspace. That is, in the case of no jumps (for simplicity) an alternative approach can involve working with a regular grid on the period $[0,T]$ , say $\{s_{j}=j\delta\}_{j=0}^{M}$ , with $\delta=T/M$ for chosen size $M\geq 1$ . Then, defining $\mathbf{x}^{\prime}=(x_{\delta},x_{2\delta},\ldots,x_{M\delta})$ , and using, e.g., an Euler-Maruyama time-discretisation scheme to obtain the joint density of such an $\mathbf{x}^{\prime}$ given $x=x_{0}$ , a Radon-Nikodym derivative, $p_{\theta}^{M}(\mathbf{x}^{\prime}|x)$ on $\mathbb{R}^{M\times d_{x}}$ , can be obtained with respect to the Lebesgue reference measure $\mathrm{Leb}^{\otimes(d_{x}\times M)}$ , as a product of $M$ conditionally Gaussian densities. As shown e.g. in the motivating example in the Introduction, such an approach would lead to estimates with variability that increases rapidly with $M$ , for fixed Monte-Carlo particles $N$ . A central argument in this work is that one should develop the algorithm in a manner that respect the probabilistic properties of the SDE pathspace, before applying (necessarily) a time-discretisation for implementation on a PC. This procedure is not followed for purposes of mathematical rigour, but it has practical effects on algorithmic performance.

4.3 Consistency

For completeness, we provide an asymptotic result for Algorithm 2 following standard results from the literature. We consider the following assumptions.

Assumption 4.

Let $\mathbf{X}$ and $\mathsf{X}$ denote the state spaces of $\mathbf{x}$ and $x$ respectively.

(i)

For any relevant $y^{\prime}$ , $y$ , $\mathcal{F}$ , $x^{\prime}$ , we have that $g_{\theta}(y^{\prime}|y,\mathcal{F})\equiv g_{\theta}(y^{\prime}|x^{\prime})$ , where the latter is a positive function such that, for any $y$ , $\sup_{x\in\mathsf{X}}g_{\theta}(y|x)<\infty$ .
(ii)

$\sup_{(x,\mathbf{x}^{\prime})\in X\times\mathsf{\mathbf{X}}}p_{\theta}(\mathbf{x}^{\prime}|x)<\infty$ .

Proposition 3.

(i)

Under Assumption 4, for any $n\geq 0$ , there exist constants $b_{n},c_{n}>0$ , such that for any $\epsilon>0$ ,

\displaystyle\mathrm{Prob}\big{[}\,|\mathcal{S}_{\theta,n}-\hat{\mathcal{S}}_{\theta,n}|>\epsilon\,\big{]}\leq b_{n}e^{-c_{n}N\epsilon^{2}}.

(ii)

For any $n\geq 0$ , $\hat{\mathcal{S}}_{\theta,n}\rightarrow\mathcal{S}_{\theta,n}$ w.p.1, as $N\rightarrow\infty$ .

Proof.

See Appendix C. ∎

5 Online Parameter/State Estimation for SDEs

In this section, we derive an online gradient-ascent for partially observed SDEs. Poyiadjis et al., (2011) use the score function estimation methodology to propose an online gradient-ascent algorithm for obtaining an MLE-type parameter estimate, following ideas in LeGland and Mevel, (1997). In more detail, the method is based on the Robbins-Monro-type of recursion,

	$\displaystyle\theta_{n+1}$	$\displaystyle=\theta_{n}+\gamma_{n+1}\nabla\log p_{\theta_{0:n}}(y_{n}\|y_{0:n-1})$
		$\displaystyle=\theta_{n}+\gamma_{n+1}\big{\{}\nabla\log p_{\theta_{0:n}}(y_{0:n})-\nabla\log p_{\theta_{0:n-1}}(y_{0:n-1})\big{\}}$		(5.1)

where $\{\gamma_{k}\}_{k}$ is a positive decreasing sequence with,

\displaystyle\sum_{k=1}^{\infty}\gamma_{k}=\infty,\quad\sum_{k=1}^{\infty}\gamma_{k}^{2}<\infty.

The meaning of quantity $\nabla\log p_{\theta_{0:n}}(y_{0:n})$ is that – given a recursive method (in $n$ ) for the estimation of $\theta\mapsto\nabla\log p_{\theta}(y_{0:n})$ as we describe below and based on the methodology of Algorithm 2 – one uses $\theta_{n-1}$ when incorporating $y_{n-1}$ , then $\theta_{n}$ for $y_{n}$ , and similarly for $k>n$ . See LeGland and Mevel, (1997); Tadic and Doucet, (2018) for analytical studies of the convergence properties of the deduced algorithm, where under strong conditions the recursion is shown to converge to the ‘true’ parameter value, say $\theta^{\dagger}$ , as $n\rightarrow\infty$ .

Observe that, from Fisher’s identity (see e.g. Poyiadjis et al., (2011)) we have that,

\displaystyle\nabla\log p_{\theta}(y_{0:n})

\displaystyle=\int\nabla\log p_{\theta}(\mathbf{x}_{0:n},y_{0:n})p_{\theta}(d\mathbf{x}_{0:n}|y_{0:n}).

Thus, in the context of Algorithm 2, estimation of the score function corresponds to the choice,

\displaystyle S_{\theta}(\mathbf{x}_{0:n})=\nabla\log p_{\theta}(\mathbf{x}_{0:n},y_{0:n})\equiv\sum_{k=0}^{n}\nabla\log p_{\theta}(\mathbf{x}_{k},y_{k}|x_{k-1}).

and,

\displaystyle s_{\theta,k}(x_{k-1},\mathbf{x}_{k})\equiv\nabla\log p_{\theta}(\mathbf{x}_{k},y_{k}|x_{k-1}).

(5.2)

Combination of the Robins-Morno recursion (5.1) with the one in Algorithm 2, delivers Algorithm 3, which we have presented here in some detail for clarity.

Algorithm 3 Online Gradient-Ascent for SDEs via Forward-Only Smoothing

(i)

Assume that at time $n\geq 0$ , one has current parameter estimate $\hat{\theta}_{n}$ , particle approximation $\{\mathbf{x}_{n}^{(i)},W_{n}^{(i)}\}_{i=1}^{N}$ of the ‘filter’ $p_{\hat{\theta}_{0:n}}(d\mathbf{x}_{n}|y_{0:n})$ and estimators $\widehat{T}_{\hat{\theta}_{0:n},n}(\mathbf{x}_{n}^{(i)})$ of $T_{\hat{\theta}_{0:n},n}(\mathbf{x}_{n}^{(i)})$ , for $1\leq i\leq N$ .

(ii)

Apply the iteration,

\displaystyle\hat{\theta}_{n+1}=\hat{\theta}_{n}+\gamma_{n+1}\big{\{}\nabla\log p_{\hat{\theta}_{0:n}}(y_{0:n})-\nabla\log p_{\hat{\theta}_{0:n-1}}(y_{0:n-1})\big{\}}

(iii)

At time $n+1$ , sample $\mathbf{x}_{n+1}^{(i)}$ , for $1\leq i\leq N$ , from the mixture,

\displaystyle\mathbf{x}_{n+1}^{(i)}\sim\widehat{p}_{\hat{\theta}_{n+1}}(d\mathbf{x}_{n+1}|y_{0:n})=\sum_{j=1}^{N}W_{n}^{(j)}\,p_{\hat{\theta}_{n+1}}(\mathbf{x}_{n+1}|x_{n}^{(j)}),

and assign particle weights $W_{n+1}^{(i)}\propto g_{\hat{\theta}_{n+1}}(y_{n+1}|y_{n},\mathcal{F}_{n+1}^{(i)})$ , $1\leq i\leq N$ .

(iii)

Then set, for $1\leq i\leq N$ ,

\displaystyle\widehat{T}_{\hat{\theta}_{0:n+1},n+1}(\mathbf{x}_{n+1}^{(i)})

\displaystyle=\frac{\sum_{j=1}^{N}W_{n}^{(j)}\,p_{\theta}(\mathbf{x}_{n+1}^{(i)}|x_{n}^{(j)})}{\sum_{l=1}^{N}W_{n}^{(l)}\,p_{\theta}(\mathbf{x}_{n+1}^{(i)}|x_{n}^{(l)})}\,\big{[}\widehat{T}_{\hat{\theta}_{0:n},n}(\mathbf{x}_{n}^{(j)})+s_{\theta,n}(x_{n}^{(j)},\mathbf{x}_{n+1}^{(i)})\big{]},

where, on the right-hand-side we use the parameter,

\displaystyle\theta=\hat{\theta}_{n+1}.

(iv)

Obtain the estimate,

\displaystyle\widehat{\mathcal{S}}_{\hat{\theta}_{n+1},n+1}

\displaystyle=\sum_{i=1}^{N}W_{n+1}^{(i)}\,\widehat{T}_{\hat{\theta}_{0:n+1},n+1}(\mathbf{x}_{n+1}^{(i)}).

Remark 4.

When the joint density of $(\mathbf{x}_{0:n},y_{0:n})$ is in the exponential family, an online EM algorithm can also be developed; see Del Moral et al., (2010) for the discrete-time case.

6 Numerical Applications

All numerical experiments shown in this section were obtained on a standard PC, with code written in Julia. Unless otherwise stated, executed algorithms discretised the pathspace using the Euler–Maruyama scheme with $M=10$ imputed points between consecutive observations, the latter assumed to be separated by 1 time unit. Algorithmic costs scale linearly with $M$ .

To specify the schedule for the scaling parameters $\{\gamma_{k}\}$ in (5.1), we use the standard adaptive method termed ADAM, by Kingma and Ba, (2014). Assume that after $n$ steps, we have $c_{n}:=-(\nabla\log p_{\hat{\theta}_{0:n}}(y_{0:n})-\nabla\log p_{\hat{\theta}_{0:n-1}}(y_{0:n-1}))$ . ADAM applies the following iterative procedure,

	$\displaystyle m_{n}=m_{n-1}\beta_{1}+(1-\beta_{1})c_{n},\quad v_{n}=v_{n-1}\beta_{2}+(1-\beta_{2})c_{n}^{2},$
	$\displaystyle\hat{m}_{n}=m_{n}/(1-\beta_{1}^{n}),\quad\hat{v}_{n}=v_{n}/(1-\beta_{2}^{n}),$
	$\displaystyle\hat{\theta}_{n+1}=\hat{\theta}_{n}-\alpha\hat{m}_{n}/(\sqrt{\hat{v}_{n}}+\epsilon),$

where $(\beta_{1},\beta_{2},\alpha,\epsilon)$ are tuning parameters. Convergence properties of ADAM have been widely studied (Goodfellow et al.,, 2016; Kingma and Ba,, 2014; Reddi et al.,, 2019). Following Kingma and Ba, (2014), in all cases below we set $(\beta_{1},\beta_{2},\alpha,\epsilon)=(0.9,0.999,0.001,10^{-8})$ . ADAM is nowadays a standard and very effective addition to the type of recursive inference algorithms we are considering here, even more so as for increasing dimension of unknown parameters. See the above references for more motivation and details.

6.1 Ornstein-Uhlenbeck SDE

We consider the model,

	$\displaystyle dX_{t}=\theta_{1}(\theta_{2}-X_{t})+\theta_{3}dW_{t}+dJ_{t},\quad X_{0}=0,$
	$\displaystyle y_{i}=x_{i}+\epsilon_{i},\quad i\geq 1,\quad\epsilon_{i}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{N}(0,0.1^{2}),\quad\Delta=1,$		(6.1)

where $J_{t}=\sum_{i=1}^{N_{t}}\xi_{i}$ , $N_{t}$ is a Poisson process of intensity $\lambda\geq 0$ , and $\xi_{i}\overset{i.i.d.}{\sim}\mathcal{U}\left(-\zeta,\zeta\right)$ , $\zeta>0$ .

O–U: Experiment 1. Here, we choose $\lambda=0$ , so the score function is analytically available; we also fix $\theta_{2}=0$ . We simulated datasets of size $n=2,500,5,000,7,500,10,000$ under true remaining parameter values $(\theta_{1}^{\dagger},\theta_{3}^{\dagger})=(0.4,0.5)$ . We executed Algorithm 2 to approximate the score function with $N=50,100,150$ particles, for all above datasets. Fig. 6.1 summarises results of $R=50$ replicates of estimates of the bivariate score function, with the dashed lines showing the true values.

O–U: Experiment 2. We now present two scenarios: one still without jumps (i.e., $\lambda=0$ ); and one with jumps, such that we fix $\lambda=0.5$ , $\zeta=0.5$ . In both scenarios, the parameters to be estimated are $\theta=(\theta_{1},\theta_{2},\theta_{3})$ . Beyond the mentioned fixed values for $\lambda$ , $\zeta$ , in both scenarios we generated $n=20,000$ data points using the true parameters values $\theta^{\dagger}=(\theta_{1}^{\dagger},\theta_{2}^{\dagger},\theta_{3}^{\dagger})=(0.2,0.0,0.2)$ . We applied Algorithm 3 with $N=100$ particles, and initial parameter values $\hat{\theta}_{0}=(1.0,1.0,1.0)$ for both scenarios. We used Construct 1 for the scenario with jumps. Fig. 6.2 summarises the results.

O–U: Experiment 3. To compare the efficiency of Constructs 1 and 2 for the case of jump processes, as developed in Section 3.2, we applied Algorithm 2 – for each case – to approximate the score function (at the true values for $(\theta_{1},\theta_{2},\theta_{3})$ ), with $N=50,100,150,200$ particles, with $n=10$ data generated from O–U process (6.1) with fixed $\lambda=0.5$ , $\zeta=0.5$ , and true remaining parameters $(\theta_{1}^{\dagger},\theta_{2}^{\dagger},\theta_{3}^{\dagger})=(0.3,0.0,0.2)$ . Fig. 6.3 shows boxplots summarising the results from $R=50$ replications of the obtained estimates. We note that performance of Construct 2 is, in this case, much better than that of Construct 1.

6.2 Periodic Drift SDE

We consider the (highly) nonlinear model,

	$\displaystyle dX_{t}=\sin\left(X_{t}-\theta_{1}\right)dt+\theta_{2}dW_{t},\quad X_{0}=0,$
	$\displaystyle y_{i}=x_{i}+\epsilon_{i},\quad i\geq 1,\quad\epsilon_{i}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{N}(0,0.1^{2}),\quad\Delta=1,$		(6.2)

for $\theta_{1}\in[0,2\pi)$ , $\theta_{2}>0$ . We used true values $(\theta_{1}^{\dagger},\theta_{2}^{\dagger})=(\pi/4,0.9)$ , generated $n=10^{4}$ data, and applied Algorithm 3 with $N=100$ particles and initial values $(\hat{\theta}_{1,0},\hat{\theta}_{2,0})=(0.1,2)$ . To illustrate the mesh-free performance of Algorithm 3, we choose two values, $M=10$ and $M=100$ and compare the obtained numerical results. Figure 6.4 shows the data (Left Panel) and the trajectory of the estimated parameter values (Right Panel). Note that the SDE in (6.2) is making infrequent jumps between a number of modes. Indeed, in this case, the results are stable when increasing $M$ from $10$ to $100$ .

6.3 Heston Model

Consider the Heston model (Heston,, 1993), for price $S_{t}$ and volatility $X_{t}$ , modelled as,

	$\displaystyle dS_{t}$	$\displaystyle=\theta_{4}S_{t}dt+S_{t}X_{t}^{1/2}dB_{t},$
	$\displaystyle dX_{t}$	$\displaystyle=\theta_{1}(\theta_{2}-X_{t})dt+\theta_{3}X_{t}^{1/2}dW_{t},\quad X_{0}=\theta_{2},$

where $B_{t}$ , $W_{t}$ are independent standard Brownian motions. Define the log-price $U_{t}=\log(S_{t})$ , so that application of Itô’s lemma gives,

	$\displaystyle dU_{t}$	$\displaystyle=(\theta_{4}-\tfrac{X_{t}}{2})dt+X_{t}^{1/2}dB_{t},$
	$\displaystyle dX_{t}$	$\displaystyle=\theta_{1}(\theta_{2}-X_{t})dt+\theta_{3}X_{t}^{1/2}dW_{t},\quad X_{0}=\theta_{2}.$		(6.3)

We make the standard assumption $2\theta_{1}\theta_{2}>\theta_{3}^{2}$ , so that the CIR process $X=\{X_{t}\}$ will not hit $0$ ; also, we have $\theta_{1},\theta_{2},\theta_{3},\theta_{4}>0$ . Process $U_{t}$ is observed at discrete times, so that $y_{i}=U_{t_{i}}$ , $i\geq 1$ . Thus, we have,

\displaystyle\big{[}\,y_{i}\,\big{|}\,y_{i-1},\,\{X_{s};s\in[t_{i-1},t_{i}]\}\,\big{]}

\displaystyle\sim\mathcal{N}(y_{i};\mu_{i},\Sigma_{i}),

where we have set,

\displaystyle\mu_{i}=y_{i-1}+\int_{t_{i-1}}^{t_{i}}(\theta_{4}-\tfrac{X_{s}}{2})ds,\quad\Sigma_{i}

\displaystyle=\int_{t_{i-1}}^{t_{i}}X_{s}ds.

We chose true parameter value $\theta^{\dagger}=(0.1,1.0,0.2,0.45)$ , and generated $n=10^{4}$ observations, with $\Delta=1$ . We applied Algorithm 3 with $\hat{\theta}_{0}=(0.005,0.1,0.4,0.3)$ and $N=100$ particles. The trajectories of the estimated parameters are given in Figure 6.5.

6.4 Sequential Model Selection – Real Data

We use our methodology to carry out sequential model selection, motivated by applications in Eraker et al., (2003); Johannes et al., (2009). Recall that BIC (Schwarz,, 1978) for model $\mathcal{M}$ , with parameter vector $\theta$ , and data $y_{1:n}$ is given by,

\displaystyle BIC(\mathcal{M})

\displaystyle:=-2\ell_{\hat{\theta}_{MLE}}(y_{1:n})+\mathrm{dim}(\theta)\log n,

where $\hat{\theta}_{MLE}$ denotes the MLE, and $\ell(\cdot)$ the log-likelihood. BIC and the (closely related) Bayes Factor are known to have good asymptotic properties, e.g. they are consistent Model Selection Criteria, under the assumption of model-correctness, in the context of nested models, for specific classes of models (see e.g. Chib and Kuffner, (2016); Nishii, (1988); Sin and White, (1996), and Yonekura et al., (2018) for the case of discrete-time nested HMMs). One can plausibly conjecture such criteria will also perform well for the type of continuous-time HMMs we consider here. See also Eguchi and Masuda, (2018) for a rigorous analysis of BIC for diffusion-type models. Given models $\mathcal{M}_{a},\mathcal{M}_{b}$ , with $\mathcal{M}_{a}\subset\mathcal{M}_{b}$ , we use the difference,

\displaystyle BIC(\mathcal{M}_{ab}):=-2(\ell_{\hat{\theta}_{MLE,a}}(y_{1:n})+\ell_{\hat{\theta}_{MLE,b}}(y_{1:n}))+(\mathrm{dim}(\theta_{a})-\mathrm{dim}(\theta_{b}))\log n,

with involved terms defined as obviously, to choose between $\mathcal{M}_{a}$ and $\mathcal{M}_{b}$ .

We note that Eraker et al., (2003) use Bayes Factor for model selection in the context of SDE processes; that work uses MCMC to estimate parameters, and is not sequential (or online). We stress that our methodology allows for carrying out inference for parameters and the signal part of the model, and simultaneously allowing for model selection, all in an online fashion. Also, it is worth noting that Johannes et al., (2009) use the sequential likelihood ratio for model comparison, but such quantity can overshoot, i.e. the likelihood ratio will tend to choose a large model. Also, that work uses fixed calibrated parameters, so it does not relate to online parameter inference.

Remark 5.

We will use the running estimate of the parameter vector as a proxy for the MLE given the data already taken under consideration. Similarly, we use the weights of the particle filter to obtain a running proxy of the log-likelihood evaluated at the MLE, thus overall an online approximation of BIC. Such an approach, even not fully justified theoretically, can provide reasonable practical insights when performing model comparison in an online manner – particularly so, given when there is no alternative, to the best of our knowledge, for such an objective.

We consider the following family of nested SDE models,

\displaystyle dX_{t}

\displaystyle=b_{\theta}^{(i)}(X_{t})+\theta_{4}\sqrt{X_{t}}dW_{t},

where,

	$\displaystyle\mathcal{M}_{1}:$	$\displaystyle\quad b_{\theta}^{(1)}(X_{t})=\theta_{0}+\theta_{1}X_{t},$
	$\displaystyle\mathcal{M}_{2}:$	$\displaystyle\quad b_{\theta}^{(2)}(X_{t})=\theta_{0}+\theta_{1}X_{t}+\theta_{2}^{2}X_{t}^{2},$
	$\displaystyle\mathcal{M}_{3}:$	$\displaystyle\quad b_{\theta}^{(3)}(X_{t})=\theta_{0}+\theta_{1}X_{t}+\theta_{2}^{2}X_{t}^{2}+\frac{\theta_{3}}{X_{t}}.$

Such models have been used for short-term interest rates; see Jones, (2003); Durham, (2003); Aït-Sahalia, (1996); Bali and Wu, (2006) and references therein for more details. Motivated by Dellaportas et al., (2006); Stanton, (1997), we applied our methodology to daily 3-month Treasury Bill rates, from 2nd of January 1970, to 29th of December 2000, providing $n=7,739$ observations. The data can be obtained from the Federal Reserve Bank of St. Louis, at webpage https://fred.stlouisfed.org/series/TB3MS. The dataset is shown at Fig. 6.6.

Fig. 6.7 shows the BIC differences, obtained online, for each of the three pairs of models. The results suggest that $\mathcal{M}_{2}$ does not provide a good fit to the data – relatively to $\mathcal{M}_{1}$ , $\mathcal{M}_{3}$ – for almost all period under consideration. In terms of $\mathcal{M}_{1}$ against $\mathcal{M}_{3}$ , there is evidence that once the data from around 1993 onwards are taken under consideration, $\mathcal{M}_{3}$ is preferred to $\mathcal{M}_{1}$ . In general, one can claim that models with non-linear drift should be taken under consideration for fitting daily 3-month Treasury Bill rates, without strong evidence in favour or against linearity. This non-definite conclusion is in some agreement with the empirical studies in Chapman and Pearson, (2000); Durham, (2003); Dellaportas et al., (2006).

7 Conclusions and Future Directions

We have introduced an online particle smoothing methodology for discretely observed (jump) diffusions with intractable transition densities. Our approach overcomes such intractability by formulating the problem on pathspace, thus delivering an algorithm that – besides regulatory conditions – requires only the weak invertibility Assumption 2. Thus, we have covered a rich family of SDE models, when related literature imposes quite strong restrictions. Combining our online smoothing algorithm with a Robbins-Monro-type approach of Recursive Maximum-Likelihood, we set up an online stochastic gradient-ascent for the likelihood function of the SDEs under consideration. The algorithm provides a wealth of interesting output, that can provide a lot of useful insights in statistical applications. The numerical examples show a lot of promise for the performance of the methodology. Our framework opens up a number of routes for insights and future research, including the ones described below.

(i)

In the case of SDEs of jumps, we have focused on jump dynamics driven by compound Poisson processes. There is great scope for generalisation here, and one can extend the algorithm to different cases of jump processes, also characterised by more complex dependencies between the jumps and the paths of the solution of the SDE, $X=\{X_{t}\}$ . Extensions to time-inhomogeneous cases with $b(v)=b(t,v)$ , $\sigma(v)=\sigma(t,v)$ , are immediate; we have chosen the time-homogeneous models only for purposes of presentation simplicity.

(ii)

Since the seminal work of Delyon and Hu, (2006), more ‘tuned’ auxiliary bridge processes have appeared in the literature, see e.g. the works of Schauer et al., (2017); van der Meulen and Schauer, (2017). Indicatively, the work in Schauer et al., (2017) considers bridges of the form (in one of the many options they consider),

d\tilde{X}_{t}=\Big{\{}b(\tilde{X}_{t})+\Sigma(\tilde{X}_{t})\Sigma^{-1}(x^{\prime})\frac{x^{\prime}-\tilde{X}_{t}}{T-t}\Big{\}}dt+\sigma(\tilde{X}_{t})dW_{t}.

(7.1)

Auxiliary bridge processes that are closer in dynamics to the diffusion bridges of the given signal are expected to reduce the variability of Monte-Carlo algorithm, thus progress along the above direction can be immediately incorporated in our methodology and improve its performance. For instance, as noted in Schauer et al., (2017), use of (7.1) will give a Radon-Nikodym derivative where stochastic integrals cancel out. Such a setting is known to considerably reduce the variability of Monte-Carlo methods, see e.g. the numerical examples in Durham and Gallant, (2002) and the discussion in (Papaspiliopoulos and Roberts,, 2012, Section 4).

(iii)

The exact specification of the recursion used for the online estimation of unknown parameters is in itself an problem of intensive research in the field of stochastic optimisation. One would ideally aim for the recursion procedure to provide parameter estimates which are as close to the unknown parameter as the data (considered thus far) permit. In our case, we have used a fairly ‘vanilla’ recursion, maybe with the exception of the Adam variation. E.g., recent works in the Machine Learning community have pointed at the use of ‘velocity’ components in the recursion to speed up convergence, see, e.g., Sutskever et al., (2013); Yuan et al., (2016).
(iv)

We mentioned in the main text several modifications that can improve algorithmic performance: dynamic resampling, stratified resampling, non-blind proposals in the filtering steps, choice of auxiliary processes. Parallelisation and use of HPC are obvious additions in this list.
(v)

We stress that the algorithm involves a filtering step, and a step that approximates the values of the instrumental function. These two procedures should be thought of separately. One reason for including two approaches in the case of jump diffusions (Constructs 1 and 2) is indeed to highlight this point. The two Constructs are identical in terms of the filtering part. Construct 1 incorporates in $\mathbf{x}^{\prime}$ the location of the path at all times of jumps; thus, when the algorithm ‘mixes’ all pairs of $\{x_{k-1}^{(j)}\}$ , $\{\mathbf{x}_{k}^{(i)}\}$ , at the update of the instrumental function (see Step (iv) of Algorithm 2), many of such pairs can be incompatible (simply consider any such pair with $i\neq j$ , when $\{\mathbf{x}_{k}^{(i)}\}$ has in fact been generated conditionally on $\{x_{k-1}^{(i)}\}$ ). Such an effect is even stronger in the case of the standard algorithm applied in the motivating example in the Introduction, and partially explains the inefficiency of that algorithm. In contrast, in Construct 2, $\mathbf{x}^{\prime}$ contains less information about the underlying paths, thereby improving the compatibility of pairs selected from particle populations $\{x_{k-1}^{(j)}\}$ , $\{\mathbf{x}_{k}^{(i)}\}$ , thus – not surprisingly – Construct 2 seems more effective than Construct 1.
(vi)

The experiments included in Section 6 of the paper aim to highlight the qualitative aspects of the introduced approach. The key property of robustness to mesh-refinement underlies all numerical applications, and is explicitly illustrated in the experiment in Section 6.2, and within Figure 6.4. In all cases, the code was written in Julia, was not parallelised or optimised and was run on a PC, with executions times that were quite small due to the online nature of the algorithm (the most expensive experiment took about 30 minutes). We have avoided a comparison with potential alternative approaches as we believe this would obscure the main messages we have tried to convey within the numerical results section. Such a detailed numerical study can be the subject of future work.

Acknowledgments

We are grateful to the AE and the referees for several useful comments that helped us to improve the content of the paper.

References

Aït-Sahalia, (1996) Aït-Sahalia, Y. (1996). Testing continuous-time models of the spot interest rate. The Review of Financial Studies, 9(2):385–426.
Aït-Sahalia, (2002) Aït-Sahalia, Y. (2002). Maximum likelihood estimation of discretely sampled diffusions: a closed-form approximation approach. Econometrica, 70(1):223–262.
Aït-Sahalia, (2008) Aït-Sahalia, Y. (2008). Closed-form likelihood expansions for multivariate diffusions. The Annals of Statistics, 36(2):906–937.
Aït-Sahalia et al., (2005) Aït-Sahalia, Y., Mykland, P. A., and Zhang, L. (2005). How often to sample a continuous-time process in the presence of market microstructure noise. The Review of Financial Studies, 18(2):351–416.
Aït-Sahalia and Yu, (2008) Aït-Sahalia, Y. and Yu, J. (2008). High frequency market microstructure noise estimates and liquidity measures. Technical report, National Bureau of Economic Research.
Andrieu et al., (2010) Andrieu, C., Doucet, A., and Holenstein, R. (2010). Particle Markov Chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342.
Azuma, (1967) Azuma, K. (1967). Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, Second Series, 19(3):357–367.
Bali and Wu, (2006) Bali, T. G. and Wu, L. (2006). A comprehensive analysis of the short-term interest-rate dynamics. Journal of Banking & Finance, 30(4):1269–1290.
Beskos et al., (2006) Beskos, A., Papaspiliopoulos, O., Roberts, G. O., and Fearnhead, P. (2006). Exact and computationally efficient likelihood-based estimation for discretely observed diffusion processes (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):333–382.
Cappe, (2009) Cappe, O. (2009). Online sequential Monte Carlo EM algorithm. In 2009 IEEE/SP 15th Workshop on Statistical Signal Processing.
Cappé et al., (2005) Cappé, O., Moulines, E., and Ryden, T. (2005). Inference in Hidden Markov Models. Springer Science & Business Media.
Carpenter et al., (1999) Carpenter, J., Clifford, P., and Fearnhead, P. (1999). Improved particle filter for nonlinear problems. IEE Proceedings-Radar, Sonar and Navigation, 146(1):2–7.
Chapman and Pearson, (2000) Chapman, D. A. and Pearson, N. D. (2000). Is the short rate drift actually nonlinear? The Journal of Finance, 55(1):355–388.
Chib and Kuffner, (2016) Chib, S. and Kuffner, T. A. (2016). Bayes factor consistency. arXiv preprint, arXiv:1607.00292.
Chib et al., (2004) Chib, S., Pitt, M. K., and Shephard, N. (2004). Likelihood based inference for diffusion driven models.
Chopin et al., (2013) Chopin, N., Jacob, P. E., and Papaspiliopoulos, O. (2013). SMC2: an efficient algorithm for sequential analysis of state space models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3):397–426.
Dahlhaus and Neddermeyer, (2010) Dahlhaus, R. and Neddermeyer, J. C. (2010). Particle filter-based on-line estimation of spot volatility with nonlinear market microstructure noise models. arXiv preprint arXiv:1006.1860.
Del Moral et al., (2010) Del Moral, P., Doucet, A., and Singh, S. (2010). Forward smoothing using sequential Monte Carlo. arXiv preprint arXiv:1012.5390.
Dellaportas et al., (2006) Dellaportas, P., Friel, N., and Roberts, G. O. (2006). Bayesian model selection for partially observed diffusion models. Biometrika, 93(4):809–825.
Delyon and Hu, (2006) Delyon, B. and Hu, Y. (2006). Simulation of conditioned diffusion and application to parameter estimation. Stochastic Processes and their Applications, 116(11):1660–1675.
Douc et al., (2011) Douc, R., Garivier, A., Moulines, E., and Olsson, J. (2011). Sequential Monte Carlo smoothing for general state space hidden Markov models. The Annals of Applied Probability, 21(6):2109–2145.
Douc et al., (2014) Douc, R., Moulines, E., and Stoffer, D. (2014). Nonlinear time series: Theory, methods and applications with R examples. Chapman and Hall/CRC.
Doucet et al., (2000) Doucet, A., Godsill, S., and Andrieu, C. (2000). On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and Computing, 10(3):197–208.
Durham, (2003) Durham, G. B. (2003). Likelihood-based specification analysis of continuous-time models of the short-term interest rate. Journal of Financial Economics, 70(3):463–487.
Durham and Gallant, (2002) Durham, G. B. and Gallant, A. R. (2002). Numerical techniques for maximum likelihood estimation of continuous-time diffusion processes. Journal of Business & Economic Statistics, 20(3):297–338.
Eguchi and Masuda, (2018) Eguchi, S. and Masuda, H. (2018). Schwarz type model comparison for LAQ models. Bernoulli, 24(3):2278–2327.
Eraker et al., (2003) Eraker, B., Johannes, M., and Polson, N. (2003). The impact of jumps in volatility and returns. The Journal of Finance, 58(3):1269–1300.
Fearnhead et al., (2008) Fearnhead, P., Papaspiliopoulos, O., and Roberts, G. O. (2008). Particle filters for partially observed diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(4):755–777.
Gloaguen et al., (2018) Gloaguen, P., Étienne, M.-P., and Le Corff, S. (2018). Online sequential Monte Carlo smoother for partially observed diffusion processes. EURASIP Journal on Advances in Signal Processing, 2018(1):9.
Golightly and Wilkinson, (2008) Golightly, A. and Wilkinson, D. J. (2008). Bayesian inference for nonlinear multivariate diffusion models observed with error. Computational Statistics & Data Analysis, 52(3):1674–1693.
Gonçalves and Roberts, (2014) Gonçalves, F. B. and Roberts, G. O. (2014). Exact simulation problems for jump-diffusions. Methodology and Computing in Applied Probability, 16(4):907–930.
Goodfellow et al., (2016) Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT press.
Gordon et al., (1993) Gordon, N. J., Salmond, D. J., and Smith, A. F. (1993). Novel approach to nonlinear/non-gaussian bayesian state estimation. In IEE Proceedings F (Radar and Signal Processing), volume 140, pages 107–113. IET.
Hansen and Scheinkman, (1993) Hansen, L. P. and Scheinkman, J. A. (1993). Back to the future: Generating moment implications for continuous-time Markov processes. Technical report, National Bureau of Economic Research.
Hansen and Lunde, (2006) Hansen, P. R. and Lunde, A. (2006). Realized variance and market microstructure noise. Journal of Business & Economic Statistics, 24(2):127–161.
Heston, (1993) Heston, S. L. (1993). A closed-form solution for options with stochastic volatility with applications to bond and currency options. The Review of Financial Studies, 6(2):327–343.
Johannes et al., (2009) Johannes, M. S., Polson, N. G., and Stroud, J. R. (2009). Optimal filtering of jump diffusions: Extracting latent states from asset prices. The Review of Financial Studies, 22(7):2759–2799.
Jones, (2003) Jones, C. S. (2003). Nonlinear mean reversion in the short-term interest rate. The Review of Financial Studies, 16(3):793–843.
Kalogeropoulos et al., (2010) Kalogeropoulos, K., Roberts, G. O., and Dellaportas, P. (2010). Inference for stochastic volatility models using time change transformations. The Annals of Statistics, 38(2):784–807.
Kantas et al., (2015) Kantas, N., Doucet, A., Singh, S. S., Maciejowski, J., and Chopin, N. (2015). On particle methods for parameter estimation in state-space models. Statistical Science, 30(3):328–351.
Kessler, (1997) Kessler, M. (1997). Estimation of an ergodic diffusion from discrete observations. Scandinavian Journal of Statistics, 24(2):211–229.
Kessler et al., (2012) Kessler, M., Lindner, A., and Sorensen, M. (2012). Statistical methods for stochastic differential equations. Chapman and Hall/CRC.
Kessler and Sørensen, (1999) Kessler, M. and Sørensen, M. (1999). Estimating equations based on eigenfunctions for a discretely observed diffusion process. Bernoulli, 5(2):299–314.
Kingma and Ba, (2014) Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Kitagawa, (1987) Kitagawa, G. (1987). Non-gaussian state space modeling of nonstationary time series. Journal of the American Statistical Association, 82(400):1032–1041.
Kitagawa, (1996) Kitagawa, G. (1996). Monte Carlo filter and smoother for non-Gaussian nonlinear state space models. Journal of Computational and Graphical Statistics, 5(1):1–25.
Kloeden and Platen, (2013) Kloeden, P. E. and Platen, E. (2013). Numerical solution of stochastic differential equations, volume 23. Springer Science & Business Media.
LeGland and Mevel, (1997) LeGland, F. and Mevel, L. (1997). Recursive estimation in hidden Markov models. In Decision and Control, 1997, Proceedings of the 36th IEEE Conference, volume 4, pages 3468–3473. IEEE.
Nishii, (1988) Nishii, R. (1988). Maximum likelihood principle and model selection when the true model is unspecified. Journal of Multivariate Analysis, 27(2):392–403.
Øksendal and Sulem, (2007) Øksendal, B. and Sulem, A. (2007). Applied Stochastic Control of Jump Diffusions. Springer Science & Business Media.
Olsson and Westerborn, (2017) Olsson, J. and Westerborn, J. (2017). Efficient particle-based online smoothing in general hidden Markov models: the PaRIS algorithm. Bernoulli, 23(3):1951–1996.
Papaspiliopoulos and Roberts, (2012) Papaspiliopoulos, O. and Roberts, G. (2012). Importance sampling techniques for estimation of diffusion models. Statistical methods for stochastic differential equations, 124:311–340.
Papaspiliopoulos et al., (2013) Papaspiliopoulos, O., Roberts, G. O., and Stramer, O. (2013). Data augmentation for diffusions. Journal of Computational and Graphical Statistics, 22(3):665–688.
Poyiadjis et al., (2011) Poyiadjis, G., Doucet, A., and Singh, S. S. (2011). Particle approximations of the score and observed information matrix in state space models with application to parameter estimation. Biometrika, 98(1):65–80.
Reddi et al., (2019) Reddi, S. J., Kale, S., and Kumar, S. (2019). On the convergence of Adam and beyond. arXiv preprint arXiv:1904.09237.
Roberts and Stramer, (2001) Roberts, G. O. and Stramer, O. (2001). On inference for partially observed nonlinear diffusion models using the Metropolis–Hastings algorithm. Biometrika, 88(3):603–621.
Särkkä and Sottinen, (2008) Särkkä, S. and Sottinen, T. (2008). Application of Girsanov theorem to particle filtering of discretely observed continuous-time non-linear systems. Bayesian Analysis, 3(3):555–584.
Schauer et al., (2017) Schauer, M., Van Der Meulen, F., and Van Zanten, H. (2017). Guided proposals for simulating multi-dimensional diffusion bridges. Bernoulli, 23(4A):2917–2950.
Schwarz, (1978) Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6(2):461–464.
Sin and White, (1996) Sin, C.-Y. and White, H. (1996). Information criteria for selecting possibly misspecified parametric models. Journal of Econometrics, 71(1-2):207–225.
Stanton, (1997) Stanton, R. (1997). A nonparametric model of term structure dynamics and the market price of interest rate risk. The Journal of Finance, 52(5):1973–2002.
Ströjby and Olsson, (2009) Ströjby, J. and Olsson, J. (2009). Efficient particle-based likelihood estimation in partially observed diffusion processes. IFAC Proceedings Volumes, 42(10):1364–1369.
Sutskever et al., (2013) Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of initialization and momentum in Deep Learning. In International Conference on Machine Learning, pages 1139–1147.
Tadic and Doucet, (2018) Tadic, V. Z. and Doucet, A. (2018). Asymptotic properties of recursive maximum likelihood estimation in non-linear state-space models. arXiv preprint arXiv:1806.09571.
van der Meulen and Schauer, (2017) van der Meulen, F. and Schauer, M. (2017). Bayesian estimation of discretely observed multi-dimensional diffusion processes using guided proposals. Electronic Journal of Statistics, 11(1):2358–2396.
Wagner, (1989) Wagner, W. (1989). Undiased Monte Carlo estimators for functionals of weak solutions of stochastic differential equations. Stochastics: An International Journal of Probability and Stochastic Processes, 28(1):1–20.
Yonekura et al., (2018) Yonekura, S., Beskos, A., and Singh, S. (2018). Asymptotic analysis of model selection criteria for general hidden Markov models. arXiv preprint arXiv:1811.11834.
Yuan et al., (2016) Yuan, K., Ying, B., and Sayed, A. H. (2016). On the influence of momentum acceleration on online learning. The Journal of Machine Learning Research, 17(1):6602–6667.

Appendix A Proof of Proposition 1

Proof.

We have the integral,

\displaystyle\int T_{\theta,n}(\mathbf{x}_{n})\,p_{\theta}(d\mathbf{x}_{n}|y_{0:n})=\int S_{\theta}(\mathbf{x}_{0:n})\,p_{\theta}(d\mathbf{x}_{0:n-1}|y_{0:n-1},\mathbf{x}_{n})\otimes p_{\theta}(d\mathbf{x}_{n}|y_{0:n}).

(A.1)

Also, simple calculations give,

	$\displaystyle p_{\theta}(d\mathbf{x}_{0:n-1}\|y_{0:n-1},$	$\displaystyle\mathbf{x}_{n})\otimes p_{\theta}(\mathbf{x}_{n}\|y_{0:n})$
		$\displaystyle\equiv p_{\theta}(d\mathbf{x}_{0:n-1}\|y_{0:n},\mathbf{x}_{n})\otimes p_{\theta}(\mathbf{x}_{n}\|y_{0:n})=p_{\theta}(d\mathbf{x}_{0:n}\|y_{0:n}).$

Using this expression in the integral on the right side of (A.1) completes the proof. ∎

Appendix B Proof of Proposition 2

Proof.

Simply note that,

	$\displaystyle\int T_{\theta,n-1}(\mathbf{x}_{n-1})\,$	$\displaystyle p_{\theta}(d\mathbf{x}_{n-1}\|y_{0:n-1},\mathbf{x}_{n})=$
		$\displaystyle=\int S_{\theta,n-1}(\mathbf{x}_{0:n-1})\,p_{\theta}(d\mathbf{x}_{0:n-2}\|y_{0:n-2},\mathbf{x}_{n-1})\,p_{\theta}(d\mathbf{x}_{n-1}\|y_{0:n-1},\mathbf{x}_{n}).$

Then, we have that,

	$\displaystyle p_{\theta}$	$\displaystyle(d\mathbf{x}_{0:n-2}\|y_{0:n-2},\mathbf{x}_{n-1})\,p_{\theta}(d\mathbf{x}_{n-1}\|y_{0:n-1},\mathbf{x}_{n})$
		$\displaystyle\qquad\equiv p_{\theta}(d\mathbf{x}_{0:n-2}\|\mathbf{x}_{n-1},y_{0:n-1},\mathbf{x}_{n})\,p_{\theta}(d\mathbf{x}_{n-1}\|y_{0:n-1},\mathbf{x}_{n})$
		$\displaystyle\qquad\equiv p_{\theta}(d\mathbf{x}_{0:n-1}\|y_{0:n-1},\mathbf{x}_{n}).$

Replacing the probability measure on the left side of the above equality with its equal on the right side, and using the latter in the integral above completes the proof for the first equation in the statement of the proposition. The second equation follows from trivial use of Bayes rule. ∎

Appendix C Proof of Proposition 3

Proof.

Part (i) follows via the same arguments as in Olsson and Westerborn, (2017, Corollary 2), based on Azuma, (1967) and Douc et al., (2011, Lemma 4). For part (ii), one can proceed as follows. Given $n\geq 0$ , we define the event $A_{N}(1/j):=\{|\mathcal{S}_{\theta,n}-\hat{\mathcal{S}}_{\theta,n}^{N}|>\frac{1}{j}\}$ , where we added superscript $N$ to stress the dependency of the estimate on the number of particles. Then,

	$\displaystyle\mathrm{Prob}\,\big{[}\,\lim_{N\rightarrow\infty}\hat{\mathcal{S}}_{\theta,n}^{N}=\mathcal{S}_{\theta,n}\,\big{]}=1-\mathrm{Prob}$	$\displaystyle\,\big{[}\,\cup_{j=1}^{\infty}\limsup_{N\rightarrow\infty}A_{N}(1/j)\,\big{]}$
		$\displaystyle\geq 1-\sum_{j=1}^{\infty}\mathrm{Prob}\,\big{[}\,\limsup_{N\rightarrow\infty}A_{N}(1/j)\,\big{]}.$		(C.1)

From (i), we get $\mathrm{Prob}\,[\,A_{N}(1/j)\,]\leq b_{n}e^{-c_{n}N\left(\frac{1}{j}\right)^{2}}$ , so $\sum_{N=1}^{\infty}\mathrm{Prob}\,[\,A_{N}(1/j)\,]<\infty$ . The Borel–Cantelli lemma gives $\mathrm{Prob}\,[\,\limsup_{N\rightarrow\infty}A_{N}(1/j)\,]=0$ , therefore the result follows from (C.1). ∎

	$\displaystyle p_{\theta}(dx_{k-1:k}\|y_{0:n})$	$\displaystyle=p_{\theta}(dx_{k}\|y_{0:n})\otimes p_{\theta}(dx_{k-1}\|x_{k},y_{0:k-1})$
		$\displaystyle=p_{\theta}(dx_{k}\|y_{0:n})\otimes p_{\theta}(dx_{k-1}\|y_{0:k-1})\frac{f_{\theta}(x_{k}\|x_{k-1})}{\int f_{\theta}(x_{k}\|x_{k-1})p_{\theta}(x_{k-1}\|y_{0:k-1})dx_{k-1}}.$		(2.4)

	$\displaystyle\frac{p_{\theta}(d\mathbf{x}^{\prime}\|x)}{(\mathrm{Leb}^{\otimes d_{x}}\otimes\mathbb{W})(d\mathbf{x}^{\prime})}$	$\displaystyle=\varphi_{\theta}\big{(}F_{\theta}(Z;x,x^{\prime});x,x^{\prime}\big{)}\times\mathcal{N}(x^{\prime};x,T\Sigma_{\theta}(x))\times\frac{\|\Sigma_{\theta}(x^{\prime})\|^{1/2}}{\|\Sigma_{\theta}(x)\|^{1/2}}$
		$\displaystyle=:p_{\theta}(\mathbf{x}^{\prime}\|x)\equiv p_{\theta}(\mathbf{x}^{\prime}\|x;T).$		(3.7)

	$\displaystyle p_{\theta}$	$\displaystyle(d\mathbf{x}_{0:n-2}\|y_{0:n-2},\mathbf{x}_{n-1})\,p_{\theta}(d\mathbf{x}_{n-1}\|y_{0:n-1},\mathbf{x}_{n})$
		$\displaystyle\qquad\equiv p_{\theta}(d\mathbf{x}_{0:n-2}\|\mathbf{x}_{n-1},y_{0:n-1},\mathbf{x}_{n})\,p_{\theta}(d\mathbf{x}_{n-1}\|y_{0:n-1},\mathbf{x}_{n})$
		$\displaystyle\qquad\equiv p_{\theta}(d\mathbf{x}_{0:n-1}\|y_{0:n-1},\mathbf{x}_{n}).$