This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Online Smoothing for Diffusion Processes Observed with Noise

Shouto Yonekura  
Department of Statistical Science, University College London,
Alan Turing Institute,
and
Alexandros Beskos
Department of Statistical Science, University College London,
Alan Turing Institute
SY acknowledges funding from the Alan Turing Institute under grant number TU/C/000013. AB acknowledges funding via a Leverhulme Trust Prize.
Abstract

We introduce a methodology for online estimation of smoothing expectations for a class of additive functionals, in the context of a rich family of diffusion processes (that may include jumps) – observed at discrete-time instances. We overcome the unavailability of the transition density of the underlying SDE by working on the augmented pathspace. The new method can be applied, for instance, to carry out online parameter inference for the designated class of models. Algorithms defined on the infinite-dimensional pathspace have been developed the last years mainly in the context of MCMC techniques. There, the main benefit is the achievement of mesh-free mixing times for the practical time-discretised algorithm used on a PC. Our own methodology sets up the framework for infinite-dimensional online filtering – an important positive practical consequence is the construct of estimates with variance that does not increase with decreasing mesh-size. Besides regularity conditions, our method is, in principle, applicable under the weak assumption – relatively to restrictive conditions often required in the MCMC or filtering literature of methods defined on pathspace – that the SDE covariance matrix is invertible.


Keywords: Jump Diffusion; Data Augmentation; Forward-Only Smoothing; Sequential Monte Carlo; Online Parameter Estimation.

1 Introduction

Research in Hidden Markov Models (HMMs) has – thus far – provided effective online algorithms for the estimation of expectations of the smoothing distribution for the case of a class of additive functionals of the underlying signal. Such methods necessitate knowledge of the transition density of the Markovian part of the model between observation times. We carry out a related exploration for the (common in applications) case when the signal corresponds to a diffusion process, thus we are faced with the challenge that such transition densities are typically unavailable. Standard data augmentation schemes that work with the multivariate density of a large enough number of imputed points of the continuous-time signal will lead to ineffective algorithms. The latter will have the abnormal characteristic that – for given Monte-Carlo iterates – the variability of the produced estimates will increase rapidly as the resolution of the imputation becomes finer. One of the ideas underpinning the work in this paper is that development of effective algorithms instead requires respecting the structural properties of the diffusion process, thus we build up imputation schemes on the infinite-dimensional diffusion pathspace itself. As a consequence, the time-discretised algorithm used in practice on a PC will be stable under mesh-refinement.

We consider continuous-time jump-diffusion models observed at discrete-time instances. The dxd_{x}-dimensional process, X={Xt;t0}X=\{X_{t};t\geq 0\}, dx1d_{x}\geq 1, is defined via the following time-homogeneous stochastic differential equation (SDE), with Xt:=limstXtX_{t-}:=\lim_{s\uparrow t}X_{t},

dXt=b(Xt)dt+σ(Xt)dWt+dJt,X0=x0dx,t0.\displaystyle dX_{t}=b(X_{t-})dt+\sigma(X_{t-})dW_{t}+dJ_{t},\quad X_{0}=x_{0}\in\mathbb{R}^{d_{x}},\,\,\,t\geq 0. (1.1)

Solution XX is driven by the dwd_{w}-dimensional Brownian motion, W={Wt;t0}W=\{W_{t};t\geq 0\}, dw1d_{w}\geq 1, and the compound Poisson process, J={Jt;t0}J=\{J_{t};t\geq 0\}. The SDE involves a drift function b=bθ:dxdxb=b_{\theta}:\mathbb{R}^{d_{x}}\mapsto\mathbb{R}^{d_{x}} and coefficient matrix σ=σθ:dxdx×dw\sigma=\sigma_{\theta}:\mathbb{R}^{d_{x}}\mapsto\mathbb{R}^{d_{x}\times d_{w}}, for parameter θp\theta\in\mathbb{R}^{p}, p1p\geq 1. Let {Nt;t0}\{N_{t};t\geq 0\} be a Poisson process with intensity function λθ()\lambda_{\theta}(\cdot), and {ξk}k1\{\xi_{k}\}_{k\geq 1} i.i.d. sequence of random variables with Lebesgue density hθ()h_{\theta}(\cdot); the càdlàg process JJ is determined as Jt=Jt,θ=i=1NtξiJ_{t}=J_{t,\theta}=\sum_{i=1}^{N_{t}}\xi_{i}. We work under standard assumptions (e.g. linear growth, Lipschitz continuity for bb, σ\sigma) that guarantee a unique global solution of (1.1), in a weak or strong sense, see e.g. Øksendal and Sulem, (2007).

SDE (1.1) is observed with noise at discrete-time instances 0=t0<t1<t2<<tn0=t_{0}<t_{1}<t_{2}<\cdots<t_{n}, n1n\geq 1. Without loss of generality, we assume equidistant observation times, with Δ:=t1t0\Delta:=t_{1}-t_{0}. We consider data Yt0,,YtnY_{t_{0}},\ldots,Y_{t_{n}}, and for simplicity we set,

xi:=Xti,yi=Yti,0in.\displaystyle x_{i}:=X_{t_{i}},\quad y_{i}=Y_{t_{i}},\quad 0\leq i\leq n.

Let i\mathcal{F}_{i} be a filtration generated by XsX_{s} for s[ti1,ti],0<ins\in[t_{i-1},t_{i}],0<i\leq n. We assume,

[Yti|{Ytj;j<i},{Xs;s[0,ti]}]gθ(dYti|Yti1,i),0in,\displaystyle\big{[}\,Y_{t_{i}}\,\big{|}\,\{Y_{t_{j}};j<i\},\,\{X_{s};s\in[0,t_{i}]\}\,\big{]}\sim g_{\theta}\big{(}dY_{t_{i}}\,\big{|}\,Y_{t_{i-1}},\mathcal{F}_{i}\big{)},\quad 0\leq i\leq n, (1.2)

for conditional distribution gθ(|Yti1,i)g_{\theta}(\cdot|Y_{t_{i-1}},\mathcal{F}_{i}) on dy\mathbb{R}^{d_{y}}, dy1d_{y}\geq 1, under convention Yt1=y1=Y_{t_{-1}}=y_{-1}=\emptyset. We write,

[xi|xi1]fθ(dxi|xi1),\displaystyle[\,x_{i}\,|\,x_{i-1}\,]\sim f_{\theta}(dx_{i}|x_{i-1}), (1.3)

where fθ(dxi|xi1)f_{\theta}(dx_{i}|x_{i-1}) is the transition distribution of the driving SDE process (1.1). We consider the density functions of gθ(dyi|yi1,i)g_{\theta}(dy_{i}|y_{i-1},\mathcal{F}_{i}) and fθ(dxi|xi1)f_{\theta}(dx_{i}|x_{i-1}), and – with some abuse of notation – we write gθ(dyi|yi1,i)=gθ(yi|yi1,i)dyig_{\theta}(dy_{i}|y_{i-1},\mathcal{F}_{i})=g_{\theta}(y_{i}|y_{i-1},\mathcal{F}_{i})dy_{i}, fθ(dxi|xi1)=fθ(xi|xi1)dxif_{\theta}(dx_{i}|x_{i-1})=f_{\theta}(x_{i}|x_{i-1})dx_{i}, where dyidy_{i}, dxidx_{i} denote Lebesgue measures. Our work develops under the following regime.

Assumption 1.

The transition density fθ(x|x)f_{\theta}(x^{\prime}|x) is intractable; the density gθ(y|y,)g_{\theta}(y^{\prime}|y,\mathcal{F}) is analytically available – for appropriate xx^{\prime}, xx, yy^{\prime}, yy, \mathcal{F} consistent with preceding definitions.

The intractability of the transition density fθ(|)f_{\theta}(\cdot|\cdot) will pose challenges for the main problems this paper aims to address.

Models defined via (1.1)-(1.3) are extensively used, e.g., in finance and econometrics, for instance for capturing the market microstructure noise, see Aït-Sahalia et al., (2005); Aït-Sahalia and Yu, (2008); Hansen and Lunde, (2006). The above setting belongs to the general class of HMMs, with a signal defined in continuous-time. See Cappé et al., (2005); Douc et al., (2014) for a general treatment of HMMs fully specified in discrete-time. A number of methods have been suggested in the literature for approximating the unavailable transition density – mainly in the case of processes without jumps – including: asymptotic expansion techniques (Aït-Sahalia et al.,, 2005; Aït-Sahalia,, 2002, 2008; Kessler,, 1997); martingale estimating functions (Kessler and Sørensen,, 1999); generalized method of moments (Hansen and Scheinkman,, 1993); Monte-Carlo approaches (Wagner,, 1989; Durham and Gallant,, 2002; Beskos et al.,, 2006). See, e.g., Kessler et al., (2012) for a detailed review.

For a given sequence {am}m\{a_{m}\}_{m}, we use the notation ai:j:=(ai,,aj)a_{i:j}:=(a_{i},\ldots,a_{j}), for integers iji\leq j. Let pθ(y0:n)p_{\theta}(y_{0:n}) denote the joint density of y0:ny_{0:n}. Throughout the paper, pθ()p_{\theta}(\cdot) is used generically to represent probability distributions or densities of random variables appearing as its arguments.

Consider the maximum likelihood estimator (MLE),

θ^n\displaystyle\hat{\theta}_{n} :=argmaxθΘlogpθ(y0:n).\displaystyle:=\arg\max_{\theta\in\Theta}\log p_{\theta}(y_{0:n}).

Except for limited cases, one cannot obtain the MLE analytically for HMMs (even for discrete-time signal) due to the intractability of pθ(y0:n)p_{\theta}(y_{0:n}).

We have set up the modelling context for this work. The main contributions of the paper in this setting – several of which relate with overcoming the intractability of the transition density of the SDE, and developing a methodology that is well-posed on the infinite-dimensional pathspace – will be as follows:

  • (i)

    We present an online algorithm that delivers Monte-Carlo estimators of smoothing expectations,

    𝒮θ,n=Sθ(𝐱0:n)pθ(d𝐱0:n|y0:n),n1,\mathcal{S}_{\theta,n}=\int S_{\theta}(\mathbf{x}_{0:n})p_{\theta}(d\mathbf{x}_{0:n}|y_{0:n}),\quad n\geq 1, (1.4)

    for the class of additive functionals Sθ()S_{\theta}(\cdot) of the structure,

    Sθ(𝐱0:n)=k=0nsθ,k(xk1,𝐱k),S_{\theta}(\mathbf{x}_{0:n})=\sum_{k=0}^{n}s_{\theta,k}(x_{k-1},\mathbf{x}_{k}), (1.5)

    under the conventions x1=x_{-1}=\emptyset, 𝐱0=x0\mathbf{x}_{0}=x_{0}. The bold type notation 𝐱k\mathbf{x}_{k}, k0k\geq 0, is reserved for carefully defined variables involving elements of the infinite-dimensional pathspace. The specific construction of 𝐱k\mathbf{x}_{k} will depend on the model at hand, and will be explained in the main part of the paper. We note that the class of additive functionals considered supposes additivity over time, and therefore includes the general case of evaluating the score function. The online solution of the smoothing problem is often used as the means to solve some concrete inferential problems - see, e.g., (ii) below.

  • (ii)

    We take advantage of the new approach to show numerical applications, with emphasis on carrying out online parameter inference for the designated class of models via a gradient-ascent approach (in a Robbins-Monro stochastic gradient framework). A critical aspect of this particular online algorithm (partly likelihood based, when concerned with parameter estimation; partly Bayesian, with regards to identification of filtering/smoothing expectations) is that it delivers estimates of the evolving score function, of the model parameters, together with particle representations of the filtering distributions, through a single passage of the data. This is a unique favourable algorithmic characteristic, when constrasted with alternative algorithms with similar objectives, such as, e.g., Particle MCMC (Andrieu et al.,, 2010), or SMC2 (Chopin et al.,, 2013).

  • (iii)

    In this work, we will not characterise analytically the size of the time-discretisation bias relevant to the SDE models at hand, and are content that: (I) the bias can be decreased by increasing the resolution of the numerical scheme (typically an Eyler-Maruyama one, or some other Taylor scheme, see e.g. Kloeden and Platen, (2013)); (II) critically, the Monte-Carlo algorithms are developed in a manner that the variance does not increase (in the limit, up to infinity) when increasing the resolution of the time-discretisation method; to achieve such an effect, the algorithms are (purposely) defined on the infinite-dimensional pathspace, and SDE paths are only discretised when implementing the algorithm on a PC (to allow, necessarily, for finite computations).

  • (iv)

    Our method draws inspiration from earlier works, in the context of online filtering for discrete-time HMMs and infinite-dimensional pathspace MCMC methods. The complete construct is novel; one consequence of this is that it is applicable, in principle, for a wide class of SDEs, under the following weak assumption (relatively to restrictive conditions often imposed in the literature of infinite-dimensional MCMC methods).

    Assumption 2.

    The diffusion covariance matrix function,

    Σθ(v):=σθ(v)σθ(v)dx×dx,\Sigma_{\theta}(v):=\sigma_{\theta}(v)\sigma_{\theta}(v)^{\top}\in\mathbb{R}^{d_{x}\times d_{x}},

    is invertible, for all relevant vv, θ\theta.

Thus, the methodology does not apply as defined here only for the class of hypoelliptic SDEs.

An elegant solution to the online smoothing problem posed above in (i), for the case of a standard HMM with discrete-time signal of known transition density fθ(x|x)f_{\theta}(x^{\prime}|x), is given in Del Moral et al., (2010); Poyiadjis et al., (2011). Our own work overcomes the unavailability of the transition density in the continuous-time scenario by following the above literature but augmenting the hidden state with the complete continuous-time SDE path. Related augmentation approaches in this setting – though for different inferential objectives – have appeared in Fearnhead et al., (2008); Ströjby and Olsson, (2009); Gloaguen et al., (2018), where the auxiliary variables are derived via the Poisson estimator of transition densities for SDEs (under strict conditions on the class of SDEs; no jumps), introduced in Beskos et al., (2006), and in Särkkä and Sottinen, (2008) where the augmentation involves indeed the continuous-time path (the objective therein is to solve the filtering problem and the method is applicable for SDEs with additive Wiener noise; no jump processes are considered).

A Motivating Example: Fig. 1.1 (Top-Panel) shows estimates of the score function, evaluated at the true parameter value θ=θ\theta=\theta^{\dagger}, for parameter θ3\theta_{3} of the Ornstein–Uhlenbeck (O–U) process, dXt=θ1(θ2Xt)dt+θ3dWtdX_{t}=\theta_{1}(\theta_{2}-X_{t})dt+\theta_{3}dW_{t}, X0=0.0X_{0}=0.0, for n=10n=10 observations yi=xi+ϵiy_{i}=x_{i}+\epsilon_{i}, ϵii.i.d𝒩(0,0.12)\epsilon_{i}\stackrel{{\scriptstyle i.i.d}}{{\sim}}\mathcal{N}(0,0.1^{2}), 1in1\leq i\leq n. Data were simulated from θ=(0.5,0.0,0.4)\theta^{\dagger}=(0.5,0.0,0.4) with an Euler-Maruyama scheme of M=1,000M^{\dagger}=1,000 grid points per unit of time. Fig. 1.1 (Top-Panel) illustrates the ‘abnormal’ effect of a standard data-augmentation scheme, where for N=100N=100 particles, the Monte-Carlo method (see later sections for details) produces estimates of increasing variability as algorithmic resolution increases with M=10,50,100,200M=10,50,100,200 – i.e., as it approaches the ‘true’ resolution used for the data generation. In contrast, Fig. 1.1 (Bottom Panel) shows results for the same score function as estimated by our proposed method in this paper. Our method is well-defined on the infinite-dimensional pathspace, and, consequently, it manages to robustly estimate the score function under mesh-refinement.

Refer to caption
Refer to caption
Figure 1.1: Boxplots of estimated score functions of θ3\theta_{3} for the O–U process, for different resolution sizes MM, over R=50R=50 experiment replications. N=100N=100 particles were used in all cases, for the same n=10n=10 data-points. Top-Panel: Results are produced via a Monte-Carlo algorithm that uses a naive ‘finite-dimensional’ data augmentation approach. Bottom-Panel: Results are produced via a Monte-Carlo algorithm that uses the ‘infinite-dimensional’ data augmentation approach introduced in this paper.

The rest of the paper is organised as follows. Section 2 reviews the Forward-Only algorithm for the online approximation of expectations of a class of additive functionals described in Del Moral et al., (2010). Section 3 sets up the framework for the treatment of pathspace-valued SDEs, first for the conceptually simpler case of SDEs without jumps, and then proceeding with incorporating jumps. Section 4 provides the complete online approximation algorithm, constructed on the infinite-dimensional pathspace. Section 5 discusses the adaptation of the developed methodology for the purposes of online inference for unknown parameters of the given SDE model. Section 6 shows numerical applications of the developed methodology. Section 7 contains conclusions and directions for future research.

2 Forward-Only Smoothing

A bootstrap filter (Gordon et al.,, 1993) is applicable in the continuous-time setting, as it only requires forward sampling of the underlying signal {Xt}t0\{X_{t}\}_{t\geq 0}; this is trivially possible – under numerous approaches – and is typically associated with the introduction of some time-discretisation bias (Kloeden and Platen,, 2013). However, the transition density is still required for the smoothing problem we have posed in the Introduction. In this section, we assume a standard discrete-time HMM, with initial distribution pθ(dx0)p_{\theta}(dx_{0}), transition density fθ(x|x)f_{\theta}(x^{\prime}|x), and likelihood gθ(y|x)g_{\theta}(y|x), for appropriate x,xdxx,x^{\prime}\in\mathbb{R}^{d_{x}}, ydyy\in\mathbb{R}^{d_{y}}, and review the online algorithm developed in Del Moral et al., (2010) for this setting. Implementation of the bootstrap filter provides an approximation of the smoothing distribution pθ(dx0:n|y0:n)p_{\theta}(dx_{0:n}|y_{0:n}) by following the geneology of the particles. This method is studied, e.g., in Cappe, (2009); Dahlhaus and Neddermeyer, (2010). Let {x0:n(i),Wn(i)}i=1N\{x_{0:n}^{(i)},W_{n}^{(i)}\}_{i=1}^{N}, N1N\geq 1, be a particle approximation of the smoothing distribution pθ(dx0:n|y0:n)p_{\theta}(dx_{0:n}|y_{0:n}), in the sense that we have the estimate,

p^θ(dx0:n|y0:n)=i=1NWn(i)δx0:n(i)(dx0:n),i=1NWn(i)=1,\displaystyle\widehat{p}_{\theta}(dx_{0:n}|y_{0:n})=\sum_{i=1}^{N}W_{n}^{(i)}\delta_{x_{0:n}^{(i)}}(dx_{0:n}),\quad\sum_{i=1}^{N}W_{n}^{(i)}=1, (2.1)

with δx0:n(i)(dx0:n)\delta_{x_{0:n}^{(i)}}(dx_{0:n}) the Dirac measure with an atom at x0:n(i)x_{0:n}^{(i)}. Then, replacing pθ(x0:n|y0:n)p_{\theta}(x_{0:n}|y_{0:n}) with its estimate in (2.1) provides consistent estimators of expectations of the HMM smoothing distributions. Though the method is online and the computational cost per time step is 𝒪(N)\mathcal{O}(N), it typically suffers from the well-documented path-degeneracy problem – as illustrated via theoretical results or numerically (Del Moral et al.,, 2010; Kantas et al.,, 2015). That is, as nn increases, the particles representing pθ(dx0:n|y0:n)p_{\theta}(dx_{0:n}|y_{0:n}) obtained by the above method will eventually all share the same ancestral particle due to the resampling steps, and the approximation collapses for big enough nn. This is well-understood not to be a solution to the approximation of the smoothing distribution for practical applications.

An approach which overcomes path-degeneracy is the Forward Filtering Backward Smoothing (FFBS) algorithm of Doucet et al., (2000). We briefly review the method here, following closely the notation and development in Del Moral et al., (2010). In the forward direction, assume that a filtering algorithm (e.g. bootstrap) has provided a particle approximation of the filtering distribution pθ(dxk1|y0:k1)p_{\theta}(dx_{k-1}|y_{0:k-1}) – assuming a relevant kk –,

p^θ(dxk1|y0:k1)=i=1NWk1(i)δxk1(i)(dxk1),\displaystyle\widehat{p}_{\theta}(dx_{k-1}|y_{0:k-1})=\sum_{i=1}^{N}W_{k-1}^{(i)}\delta_{x_{k-1}^{(i)}}(dx_{k-1}), (2.2)

for weighted particles {xk1(i),Wk1(i)}i=1N\{x_{k-1}^{(i)},W_{k-1}^{(i)}\}_{i=1}^{N}. In the backward direction, assume that one is given the particle approximation of the marginal smoothing distribution pθ(dxk|y0:n)p_{\theta}(dx_{k}|y_{0:n}),

p^θ(dxk|y0:n)=i=1NWk|n(i)δxk(i)(dxk).\displaystyle\widehat{p}_{\theta}(dx_{k}|y_{0:n})=\sum_{i=1}^{N}W_{k|n}^{(i)}\delta_{x_{k}^{(i)}}(dx_{k}). (2.3)

One has that (Kitagawa,, 1987),

pθ(dxk1:k|y0:n)\displaystyle p_{\theta}(dx_{k-1:k}|y_{0:n}) =pθ(dxk|y0:n)pθ(dxk1|xk,y0:k1)\displaystyle=p_{\theta}(dx_{k}|y_{0:n})\otimes p_{\theta}(dx_{k-1}|x_{k},y_{0:k-1})
=pθ(dxk|y0:n)pθ(dxk1|y0:k1)fθ(xk|xk1)fθ(xk|xk1)pθ(xk1|y0:k1)𝑑xk1.\displaystyle=p_{\theta}(dx_{k}|y_{0:n})\otimes p_{\theta}(dx_{k-1}|y_{0:k-1})\frac{f_{\theta}(x_{k}|x_{k-1})}{\int f_{\theta}(x_{k}|x_{k-1})p_{\theta}(x_{k-1}|y_{0:k-1})dx_{k-1}}. (2.4)

Using (2.2)-(2.3), and based on equation (2.4), we obtain the approximation,

p^θ(dxk1:k|y0:n)=j=1NWk|n(j)i=1Nfθ(xk(j)|xk1(i))Wk1(i)l=1Nfθ(xk(j)|xk1(l))Wk1(l)δ(xk1(i),xk(j))(dxk1:k).\displaystyle\widehat{p}_{\theta}(dx_{k-1:k}|y_{0:n})=\sum_{j=1}^{N}W_{k|n}^{(j)}\sum_{i=1}^{N}\frac{f_{\theta}(x_{k}^{(j)}|x_{k-1}^{(i)})W_{k-1}^{(i)}}{\sum_{l=1}^{N}f_{\theta}(x_{k}^{(j)}|x_{k-1}^{(l)})W_{k-1}^{(l)}}\delta_{(x_{k-1}^{(i)},x_{k}^{(j)})}(dx_{k-1:k}). (2.5)

Recalling the expectation of additive functionals in (1.4)-(1.5) – where now, in the discrete-time setting, we can ignore the bold 𝐱k\mathbf{x}_{k} elements, and simply use xkx_{k} instead – the above calculations give rise to the following estimator of the target quantity 𝒮θ,n\mathcal{S}_{\theta,n} in (1.5),

𝒮^θ,n=k=0nsθ,k(xk1,xk)p^θ(dxk1:k|y0:n).\widehat{\mathcal{S}}_{\theta,n}=\sum_{k=0}^{n}\int s_{\theta,k}(x_{k-1},x_{k})\,\,\widehat{p}_{\theta}(dx_{k-1:k}|y_{0:n}).

To be able to apply the above method, the marginal smoothing approximation in (2.3) is obtained via a backward recursive approach. In particular, starting from k=nk=n (where the approximation is provided by the standard forward particle filter), one proceeds as follows. Given kk, the quantity for k1k-1 is directly obtained by integrating out xkx_{k} in (2.5), thus we have,

p^θ(dxk1|y0:n)=i=1NWk1|n(i)δxk1(i)(dxk1),\displaystyle\widehat{p}_{\theta}(dx_{k-1}|y_{0:n})=\sum_{i=1}^{N}W_{k-1|n}^{(i)}\delta_{x_{k-1}^{(i)}}(dx_{k-1}),

for the normalised weights,

Wk1|n(i)j=1NWk|n(j)fθ(xk(j)|xk1(i))Wk1(i)l=1Nfθ(xk(j)|xk1(l))Wk1(l).\displaystyle W_{k-1|n}^{(i)}\propto\sum_{j=1}^{N}W_{k|n}^{(j)}\,\frac{f_{\theta}(x_{k}^{(j)}|x_{k-1}^{(i)})W_{k-1}^{(i)}}{\sum_{l=1}^{N}f_{\theta}(x_{k}^{(j)}|x_{k-1}^{(l)})W_{k-1}^{(l)}}.

Notice that – in this version of FFBS – the same particles {xk(i)}i=1N\{x_{k}^{(i)}\}_{i=1}^{N} are used in both directions (the ones before resampling at the forward filter), but with different weights.

An important development made in Del Moral et al., (2010) is transforming the above offline algorithm into an online one. This is achieved by consideration of the sequence of instrumental functionals,

Tθ,0(x0)=sθ,0(x0);Tθ,n(xn):=Sθ,n(x0:n)pθ(dx0:n1|y0:n1,xn),n1.\displaystyle T_{\theta,0}(x_{0})=s_{\theta,0}(x_{0});\qquad T_{\theta,n}(x_{n}):=\int S_{\theta,n}(x_{0:n})p_{\theta}(dx_{0:n-1}|y_{0:n-1},x_{n}),\quad n\geq 1.

Notice that, first,

𝒮θ,n=Tθ,n(xn)pθ(dxn|y0:n).\displaystyle\mathcal{S}_{\theta,n}=\int T_{\theta,n}(x_{n})p_{\theta}(dx_{n}|y_{0:n}).

We also have that,

Tθ,n(xn)\displaystyle T_{\theta,n}(x_{n}) =[Tθ,n1(xn1)+sθ,n(xn1,xn)]pθ(dxn1|y0:n1,xn)\displaystyle=\int\big{[}T_{\theta,n-1}(x_{n-1})+s_{\theta,n}(x_{n-1},x_{n})\big{]}p_{\theta}(dx_{n-1}|y_{0:n-1},x_{n})
[Tθ,n1(xn1)+sθ,n(xn1,xn)]fθ(xn|xn1)pθ(dxn1|y0:n1)fθ(xn|xn1)pθ(dxn1|y0:n1),\displaystyle\equiv\frac{\int\big{[}T_{\theta,n-1}(x_{n-1})+s_{\theta,n}(x_{n-1},x_{n})\big{]}f_{\theta}(x_{n}|x_{n-1})p_{\theta}(dx_{n-1}|y_{0:n-1})}{\int f_{\theta}(x_{n}|x_{n-1})p_{\theta}(dx_{n-1}|y_{0:n-1})}, (2.6)

– see Proposition 2.1 of Del Moral et al., (2010) for a (simple) proof. This recursion provides an online – forward-only – advancement of FFBS for estimating the smoothing expectation of additive functionals. The complete method is summarised in Algorithm 1: one key ingredient is that, during the recursion, values of the functional Tθ,n(xn)T_{\theta,n}(x_{n}) are only required at the discrete positions xn(i)x_{n}^{(i)} determined by the forward particle filter.

In the SDE context, under Assumption 1, the transition density fθ(|)f_{\theta}(\cdot|\cdot) is considered intractable, thus Algorithm 1 – apart from serving as a review of the method in Del Moral et al., (2010) – does not appear to be practical in the continuous-time case.

Algorithm 1 Online Forward-Only Smoothing (Del Moral et al.,, 2010)
  1. (i)

    Initialise particles {x0(i),W0(i)}i=1N\{x_{0}^{(i)},W_{0}^{(i)}\}_{i=1}^{N}, with x0(i)iidpθ(dx0)x_{0}^{(i)}\stackrel{{\scriptstyle iid}}{{\sim}}p_{\theta}(dx_{0}), W0(i)gθ(y0|x0(i))W_{0}^{(i)}\propto g_{\theta}(y_{0}|x_{0}^{(i)}), and functionals T^θ,0(x0(i))=sθ,0(x0(i))\widehat{T}_{\theta,0}(x_{0}^{(i)})=s_{\theta,0}(x_{0}^{(i)}), for 1iN1\leq i\leq N.

  2. (ii)

    Assume that at time n1n-1, one has a particle approximation {xn1(i),Wn1(i)}i=1N\{x_{n-1}^{(i)},W_{n-1}^{(i)}\}_{i=1}^{N} of the filtering law pθ(dxn1|y0:n1)p_{\theta}(dx_{n-1}|y_{0:n-1}) and estimators T^θ,n1(xn1(i))\widehat{T}_{\theta,n-1}(x_{n-1}^{(i)}) of Tθ,n1(xn1(i))T_{\theta,n-1}(x_{n-1}^{(i)}), for 1iN1\leq i\leq N.

  3. (iii)

    At time nn, sample xn(i)x_{n}^{(i)}, for 1iN1\leq i\leq N, from the mixture (Gordon et al.,, 1993),

    xn(i)p^θ(xn|y0:n)=j=1NWn1(j)fθ(xn|xn1(j)),\displaystyle x_{n}^{(i)}\sim\widehat{p}_{\theta}(x_{n}|y_{0:n})=\sum_{j=1}^{N}W_{n-1}^{(j)}f_{\theta}(x_{n}|x_{n-1}^{(j)}),

    and assign particle weights Wn(i)gθ(yn|xn(i))W_{n}^{(i)}\propto g_{\theta}(y_{n}|x_{n}^{(i)}), 1iN1\leq i\leq N.

  4. (iv)

    Then set, for 1iN1\leq i\leq N,

    T^θ,n(xn(i))\displaystyle\widehat{T}_{\theta,n}(x_{n}^{(i)}) =j=1NWn1(j)fθ(xn(i)|xn1(j))l=1NWn1(l)fθ(xn(i)|xn1(l))[T^θ,n1(xn1(j))+sθ,n(xn1(j),xn(i))].\displaystyle=\frac{\sum_{j=1}^{N}W_{n-1}^{(j)}f_{\theta}(x_{n}^{(i)}|x_{n-1}^{(j)})}{\sum_{l=1}^{N}W_{n-1}^{(l)}f_{\theta}(x_{n}^{(i)}|x_{n-1}^{(l)})}\big{[}\widehat{T}_{\theta,n-1}(x_{n-1}^{(j)})+s_{\theta,n}(x_{n-1}^{(j)},x_{n}^{(i)})\big{]}.
  5. (v)

    Obtain an estimate of 𝒮θ,n\mathcal{S}_{\theta,n} as,

    𝒮^θ,n\displaystyle\widehat{\mathcal{S}}_{\theta,n} =i=1NWn(i)T^θ,n(xn(i)).\displaystyle=\sum_{i=1}^{N}W_{n}^{(i)}\widehat{T}_{\theta,n}(x_{n}^{(i)}).

3 Data Augmentation on Diffusion Pathspace

To overcome the intractability of the transition density fθ(|)f_{\theta}(\cdot|\cdot) of the SDE, we will work with an algorithm that is defined in continuous-time and makes use of the complete SDE path-particles in its development. The new method has connections with earlier works in the literature. Särkkä and Sottinen, (2008) focus on the filtering problem for a class of models related to (1.1)-(1.3), and come up with an approach that requires the complete SDE path, for a limited class of diffusions with additive noise and no jumps. Fearnhead et al., (2008) also deal with the filtering problem, and – equipped with an unbiased estimator of the unknown transition density – recast the problem as one of filtering over an augmented space that incorporates the randomness for the unbiased estimate. The method is accompanied by strict conditions on the drift and diffusion coefficient (one should be able to transform the SDE – no jumps – into one of unit diffusion coefficient; the drift of the SDE must have a gradient form). Our contribution requires, in principle, solely the diffusion coefficient invertibility Assumption 2; arguably, the weakened condition we require stems from the fact that our approach appears as the relatively most ‘natural’ extension (compared to alternatives) of the standard discrete-time algorithm of Del Moral et al., (2010).

The latter discrete-time method requires the density fθ(x|x)=fθ(dx|x)/dxf_{\theta}(x^{\prime}|x)=f_{\theta}(dx^{\prime}|x)/dx^{\prime}. In continuous-time, we obtain an analytically available Radon-Nikodym derivative of pθ(d𝐱|x)p_{\theta}(d\mathbf{x}^{\prime}|x), for a properly defined variate 𝐱\mathbf{x}^{\prime} that involves information about the continuous-time path for moving from xx to xx^{\prime} within time Δ\Delta. We will give the complete algorithm in Section 4. In this section, we prepare the ground via carefully determining 𝐱\mathbf{x}^{\prime} given xx, and calculating the relevant densities to be later plugged in into our method.

3.1 SDEs with Continuous Paths

We work first with the process with continuous sample-paths, i.e. of dynamics,

dXt=bθ(Xt)dt+σθ(Xt)dWt.\displaystyle dX_{t}=b_{\theta}(X_{t})dt+\sigma_{\theta}(X_{t})dW_{t}. (3.1)

We adopt an approach motivated by techniques used for MCMC algorithms (Chib et al.,, 2004; Golightly and Wilkinson,, 2008; Roberts and Stramer,, 2001). Assume we are given starting point xdxx\in\mathbb{R}^{d_{x}}, ending point xdxx^{\prime}\in\mathbb{R}^{d_{x}}, and the complete continuous-time path for the signal process in (3.1) on [0,T][0,T], for some T0T\geq 0. That is, we now work with the path process,

[{Xt;t[0,T]}|X0=x,XT=x].\displaystyle\big{[}\,\{X_{t};t\in[0,T]\}\,\big{|}\,X_{0}=x,X_{T}=x^{\prime}\,\big{]}. (3.2)

Let θ,x,x\mathbb{P}_{\theta,x,x^{\prime}} denote the probability distribution of the pathspace-valued variable in (3.2). We consider the auxiliary bridge process X~={X~t;t[0,T]}\tilde{X}=\{\tilde{X}_{t};t\in[0,T]\} defined as,

dX~t\displaystyle d\tilde{X}_{t} ={bθ(X~t)+xX~tTt}dt+σθ(X~t)dWt,X~0=x,t[0,T],\displaystyle=\Big{\{}b_{\theta}(\tilde{X}_{t})+\frac{x^{\prime}-\tilde{X}_{t}}{T-t}\Big{\}}dt+\sigma_{\theta}(\tilde{X}_{t})dW_{t},\quad\tilde{X}_{0}=x,\quad t\in[0,T], (3.3)

with corresponding probability distribution θ,x,x\mathbb{Q}_{\theta,x,x^{\prime}}. Critically, a path of X~\tilde{X} starts at point xx and finishes at xx^{\prime}, w.p. 1. Under regularity conditions, Delyon and Hu, (2006) prove that probability measures θ,x,x\mathbb{P}_{\theta,x,x^{\prime}}, θ,x,y\mathbb{Q}_{\theta,x,y} are absolutely continuous with respect to each other. We treat the auxiliary SDE (3.3) as a transform from the driving noise to the solution, whence a sample path, XX, of the process X~={Xt~;t[0,T]}\tilde{X}=\{\tilde{X_{t}};t\in[0,T]\}, is produced by a mapping – determined by (3.3) – of a corresponding sample path, say ZZ, of the Wiener process. That is, we have set up a map, and – under Assumption 2 – its inverse,

ZX=:Fθ(Z;x,x),Z=Fθ1(X;x,x).Z\mapsto X=:F_{\theta}(Z;x,x^{\prime}),\quad Z=F^{-1}_{\theta}(X;x,x^{\prime}). (3.4)

More analytically, Fθ1F_{\theta}^{-1} is given via the transform,

dZt\displaystyle dZ_{t} =σθ(Xt)1{dXtbθ(Xt)dtxXtTtdt}.\displaystyle=\sigma_{\theta}(X_{t})^{-1}\Big{\{}dX_{t}-b_{\theta}(X_{t})dt-\frac{x^{\prime}-X_{t}}{T-t}dt\Big{\}}.

In this case we define,

𝐱:=(x,Z),\mathbf{x}^{\prime}:=(x^{\prime},Z),

and the probability measure of interest is,

pθ(d𝐱|x):=fθ(dx|x)pθ(dZ|x,x).p_{\theta}(d\mathbf{x}^{\prime}|x):=f_{\theta}(dx^{\prime}|x)\otimes p_{\theta}(dZ|x^{\prime},x). (3.5)

Let 𝕎\mathbb{W} be the standard Wiener probability measure on [0,T][0,T]. Due to the 1–1 transform, we have that,

pθ(dZ|x,x)𝕎(dZ)dθ,x,xdθ,x,x(Fθ(Z;x,x)),\frac{p_{\theta}(dZ|x^{\prime},x)}{\mathbb{W}(dZ)}\equiv\frac{d\mathbb{P}_{\theta,x,x^{\prime}}}{d\mathbb{Q}_{\theta,x,x^{\prime}}}(F_{\theta}(Z;x,x^{\prime})),

so it remains to obtain the density dθ,x,x/dθ,x,xd\mathbb{P}_{\theta,x,x^{\prime}}/d\mathbb{Q}_{\theta,x,x^{\prime}}.

Such a Radom-Nikodym derivative has been object of interest in many works. Delyon and Hu, (2006) provided detailed conditions and a proof, but (seemingly) omit an expression for the normalising constant which is important in our case, as it involves the parameter θ\theta – in our applications later in the paper, we aim to infer about θ\theta. Papaspiliopoulos et al., (2013); Papaspiliopoulos and Roberts, (2012) provide an expression based on a conditioning argument for the projection of the probability measures on [0,t)[0,t), t<Tt<T, and passage to the limit tTt\uparrow T. The derivations in Delyon and Hu, (2006) are extremely rigorous, so we will make use the expressions in that paper. Following carefully the proofs of some of their main results (Theorem 5, together with Lemmas 7, 8) one can indeed retrieve the constant in the deduced density. In particular, Delyon and Hu, (2006) impose the following conditions.

Assumption 3.
  • (i)

    SDE (3.1) admits a strong solution;

  • (ii)

    vσθ(v)v\mapsto\sigma_{\theta}(v) is in 𝒞b2\mathcal{C}^{2}_{b} (i.e., twice continuously differentiable and bounded, with bounded first and second derivatives), and it is invertible, with bounded inverse;

  • (iii)

    vbθ(v)v\mapsto b_{\theta}(v) is locally Lipschitz, locally bounded.

Under Assumption 3, Delyon and Hu, (2006) prove that,

dθ,x,xdθ,x,x(X)=|Σθ(x)|1/2|Σθ(x)|1/2×𝒩(x;x,TΣθ(x))fθ(x|x)×φθ(X;x,x),\displaystyle\frac{d\mathbb{P}_{\theta,x,x^{\prime}}}{d\mathbb{Q}_{\theta,x,x^{\prime}}}(X)=\frac{|\Sigma_{\theta}(x^{\prime})|^{1/2}}{|\Sigma_{\theta}(x)|^{1/2}}\times\frac{\mathcal{N}(x^{\prime};x,T\Sigma_{\theta}(x))}{f_{\theta}(x^{\prime}|x)}\times\varphi_{\theta}(X;x,x^{\prime}), (3.6)

where 𝒩(v;μ,V)\mathcal{N}(v;\mu,V) is the density function of the Gaussian law on dx\mathbb{R}^{d_{x}} with mean μ\mu, variance VV, evaluated at vv, |||\cdot| is matrix determinant, and φθ(X;x,x)\varphi_{\theta}(X;x,x^{\prime}) is such that,

logφθ(X;x,x)=0T\displaystyle\log\varphi_{\theta}(X;x,x^{\prime})=\int_{0}^{T} bθ(Xt),Σθ1(Xt)dXt120Tbθ(Xt),Σθ1(Xt)bθ(Xt)dt\displaystyle\big{\langle}\,b_{\theta}(X_{t}),\Sigma^{-1}_{\theta}(X_{t})dX_{t}\,\big{\rangle}-\tfrac{1}{2}\int_{0}^{T}\big{\langle}\,b_{\theta}(X_{t}),\Sigma_{\theta}^{-1}(X_{t})b_{\theta}(X_{t})dt\,\big{\rangle}
120T(xXt),dΣθ1(Xt)(xXt)Tt120Ti,j=1dxd[Σθ,ij1,(xiXt,i)(xjXt,j)]Tt.\displaystyle\!\!\!\!\!\!\!-\tfrac{1}{2}\int_{0}^{T}\tfrac{\left\langle\,(x^{\prime}-X_{t}),\,d\Sigma^{-1}_{\theta}(X_{t})(x^{\prime}-X_{t})\,\right\rangle}{T-t}-\tfrac{1}{2}\int_{0}^{T}\tfrac{\sum_{i,j=1}^{d_{x}}d\,[\,\Sigma_{\theta,ij}^{-1},(x^{\prime}_{i}-X_{t,i})(x^{\prime}_{j}-X_{t,j})\,]}{T-t}.

Here, [,][\cdot,\cdot] denotes the quadratic variation process for semi-martingales; also, ,\langle\cdot,\cdot\rangle is the standard inner-product on dx\mathbb{R}^{d_{x}}. We note that transforms different from (3.4) have been proposed in the literature (Dellaportas et al.,, 2006; Kalogeropoulos et al.,, 2010) to achieve the same effect of obtaining an 1–1 mapping of the latent path that has a density with respect to a measure that does not depend on xx, xx^{\prime} or θ\theta. However, such methods are mostly applicable for scalar diffusions (Aït-Sahalia,, 2008). Auxiliary variables involving a random, finite selection of points of the latent path, based on the (generalised) Poisson estimator of Fearnhead et al., (2008) are similarly restrictive. In contrast to other attempts, our methodology may be applied for a much more general class of SDEs, as determined by Assumption 2 – and further regularity conditions, e.g. as in Assumption 3. Thus, continuing from (3.5) we have obtained that,

pθ(d𝐱|x)(Lebdx𝕎)(d𝐱)\displaystyle\frac{p_{\theta}(d\mathbf{x}^{\prime}|x)}{(\mathrm{Leb}^{\otimes d_{x}}\otimes\mathbb{W})(d\mathbf{x}^{\prime})} =φθ(Fθ(Z;x,x);x,x)×𝒩(x;x,TΣθ(x))×|Σθ(x)|1/2|Σθ(x)|1/2\displaystyle=\varphi_{\theta}\big{(}F_{\theta}(Z;x,x^{\prime});x,x^{\prime}\big{)}\times\mathcal{N}(x^{\prime};x,T\Sigma_{\theta}(x))\times\frac{|\Sigma_{\theta}(x^{\prime})|^{1/2}}{|\Sigma_{\theta}(x)|^{1/2}}
=:pθ(𝐱|x)pθ(𝐱|x;T).\displaystyle=:p_{\theta}(\mathbf{x}^{\prime}|x)\equiv p_{\theta}(\mathbf{x}^{\prime}|x;T). (3.7)

We have added the extra argument involving the length of path, TT, in the last expression, as it will be of use in the next section.

Remark 1.

A critical point here is that the above density is analytically tractable, thus by working on pathspace we have overcome the unavailability of the transition density fθ(x|x)f_{\theta}(x^{\prime}|x).

3.2 SDEs with Jumps

We extend the above developments to the more general case of the dxd_{x}-dimensional jump diffusion model given in (1.1), which we re-write here for convenience,

dXt\displaystyle dX_{t} =bθ(Xt)dt+σθ(Xt)dWt+dJt,X0=xdx,t[0,T].\displaystyle=b_{\theta}(X_{t{-}})dt+\sigma_{\theta}(X_{t{-}})dW_{t}+dJ_{t},\quad X_{0}=x\in\mathbb{R}^{d_{x}},\,\,\,t\in[0,T]. (3.8)

Recall that Jθ=J={Jt}J_{\theta}=J=\{J_{t}\} denotes a compound Poisson process with jump intensity λθ()\lambda_{\theta}(\cdot) and jump-size density hθ()h_{\theta}(\cdot). Let 𝔽θ,x()\mathbb{F}_{\theta,x}(\cdot) denote the law of the unconditional process (3.8) and 𝕃θ\mathbb{L}_{\theta} the law of the involved compound Poisson process. We write J=((τ1,b1),,(τκ,bκ))J=((\tau_{1},b_{1}),\ldots,(\tau_{\kappa},b_{\kappa})) to denote the jump process, where {τi}\{\tau_{i}\} are the times of events, {bi}\{b_{i}\} the jump sizes and κ0\kappa\geq 0 the total number of events. In addition, we consider the reference measure 𝕃\mathbb{L}, corresponding to unit rate Poisson process measure on [0,T][0,T] multiplied with i=1κ+1Lebdx\otimes_{i=1}^{\kappa+1}\mathrm{Leb}^{\otimes d_{x}}.

Construct One

We consider the random variate,

𝐱=(J,{xτi}i=1κ+1,{Z(i)}i=1κ+1),\displaystyle\mathbf{x}^{\prime}=\big{(}J,\{x_{\tau_{i}-}\}_{i=1}^{\kappa+1},\{Z(i)\}_{i=1}^{\kappa+1}\big{)},

under the conventions xτ0xx_{\tau_{0}-}\equiv x, xτκ+1xx_{\tau_{\kappa+1}-}\equiv x^{\prime}, where we have defined,

Z(i)=Fθ1(X(i);xτi1,xτi),X(i):={Xt;t[xτi1,xτi)},1iκ+1.\displaystyle Z(i)=F^{-1}_{\theta}(X(i)\,;\,x_{\tau_{i-1}},x_{\tau_{i}-}),\quad X(i):=\{X_{t};t\in[x_{\tau_{i-1}},x_{\tau_{i}})\},\quad 1\leq i\leq\kappa+1.

We have that,

pθ(d𝐱|x):=𝕃θ(dJ)[i=1κ+1{fθ(dxτi|xτi1)pθ(dZ(i)|xτi1,xτi)}].\displaystyle p_{\theta}(d\mathbf{x}^{\prime}|x):=\mathbb{L}_{\theta}(dJ)\otimes\Big{[}\otimes_{i=1}^{\kappa+1}\big{\{}\,f_{\theta}(dx_{\tau_{i}-}|x_{\tau_{i-1}})\otimes p_{\theta}(dZ(i)|x_{\tau_{i-1}},x_{\tau_{i}-})\,\big{\}}\Big{]}.

Using the results about SDEs without jumps in Section 3.1, and in particular expression (3.7), upon defining,

𝐱(i):=(xτi,Z(i)),\mathbf{x}^{\prime}(i):=(x_{\tau_{i}-},Z(i)),

we have that – under an apparent adaptation of Assumption 3,

fθ(dxτi|xτi1)pθ(dZ(i)|xτi1,xτi)Lebdx(dxτi)𝕎(dZ(i)))=pθ(𝐱(i)|xτi;τiτi1),\displaystyle\frac{f_{\theta}(dx_{\tau_{i}-}|x_{\tau_{i-1}})\otimes p_{\theta}(dZ(i)|x_{\tau_{i-1}},x_{\tau_{i}-})}{\mathrm{Leb}^{\otimes d_{x}}(dx_{\tau_{i}-})\otimes\mathbb{W}(dZ(i)))}=p_{\theta}(\mathbf{x}^{\prime}(i)|x_{\tau_{i}-};\tau_{i}-\tau_{i-1}),

with the latter density pθ(𝐱(i)|xτi)p_{\theta}(\mathbf{x}^{\prime}(i)|x_{\tau_{i}-}) determined as in (3.7), given clear adjustments. Thus, the density of pθ(d𝐱|x)p_{\theta}(d\mathbf{x}^{\prime}|x) with respect to the reference measure,

μ(d𝐱):=𝕃(dJ)[i=1κ+1{Lebdx(dxτi)𝕎(dZ(i))}],\displaystyle\mu(d\mathbf{x}^{\prime}):=\mathbb{L}(dJ)\otimes\Big{[}\otimes_{i=1}^{\kappa+1}\big{\{}\mathrm{Leb}^{\otimes d_{x}}(dx_{\tau_{i}-})\otimes\mathbb{W}(dZ(i))\big{\}}\Big{]},

is equal to,

pθ(d𝐱|x)μ(d𝐱)=e0Tλθ(t)𝑑teTi=1κ{λθ(τi))hθ(bi)}×i=1κ+1pθ(𝐱(i)|xτi;τiτi1).\displaystyle\frac{p_{\theta}(d\mathbf{x}^{\prime}|x)}{\mu(d\mathbf{x}^{\prime})}=\frac{e^{-\int_{0}^{T}\lambda_{\theta}(t)dt}}{e^{-T}}\cdot\prod_{i=1}^{\kappa}\big{\{}\lambda_{\theta}(\tau_{i}))h_{\theta}(b_{i})\big{\}}\times\prod_{i=1}^{\kappa+1}p_{\theta}(\mathbf{x}^{\prime}(i)|x_{\tau_{i}-};\tau_{i}-\tau_{i-1}).

Construct Two

We adopt an idea used – for a different problem – in Gonçalves and Roberts, (2014). Given x,xdxx,x^{\prime}\in\mathbb{R}^{d_{x}}, we define an auxiliary process X~t\tilde{X}_{t} as follows,

dX~t\displaystyle d\tilde{X}_{t} ={bθ(X~t)+xJTX~t+JtTt}dt+σθ(X~t)dWt+dJt,X0=x,\displaystyle=\Big{\{}b_{\theta}(\tilde{X}_{t})+\frac{x^{\prime}-J_{T}-\tilde{X}_{t}+J_{t}}{T-t}\Big{\}}dt+\sigma_{\theta}(\tilde{X}_{t})dW_{t}+dJ_{t},\quad X_{0}=x, (3.9)

so that X~T=x\tilde{X}_{T}=x^{\prime}, w.p. 1. As with (3.4), we view (3.9) as a transform, projecting a path, ZZ of the Wiener process and the compound process, JJ, onto a path, XX, of the jump process. That is, we consider the 1–1 maps,

(J,Z)X=:Gθ(J,Z;x,x),(J,Z)=Gθ1(X;x,x).(J,Z)\mapsto X=:G_{\theta}(J,Z;x,x^{\prime}),\quad(J,Z)=G^{-1}_{\theta}(X;x,x^{\prime}).

Notice that for the inverse transform, the JJ-part is obtained immediately, whereas for the ZZ-part one uses the expression – well-defined due to Assumption 2 –,

dZt\displaystyle dZ_{t} =σθ(Xt)1{dXtdJtbθ(Xt)dtxJTXt+JtTtdt}.\displaystyle=\sigma_{\theta}(X_{t})^{-1}\Big{\{}dX_{t}-dJ_{t}-b_{\theta}(X_{t})dt-\frac{x^{\prime}-J_{T}-X_{t}+J_{t}}{T-t}dt\Big{\}}.

We denote by ¯θ,x,x\overline{\mathbb{P}}_{\theta,x,x^{\prime}} the law of the original process in (3.8) conditionally on hitting xx^{\prime} at time TT. Also, we denote the distribution on pathspace induced by (3.9) as ¯θ,x,x\overline{\mathbb{Q}}_{\theta,x,x^{\prime}}. Consider the variate,

𝐱=(x,J,Z),\displaystyle\mathbf{x}^{\prime}=(x^{\prime},J,Z),

so that,

pθ(d𝐱|x):=fθ(dx|x)pθ(d(J,Z)|x,x).p_{\theta}(d\mathbf{x}^{\prime}|x):=f_{\theta}(dx^{\prime}|x)\otimes p_{\theta}\big{(}d(J,Z)|x^{\prime},x\big{)}.

Due to the employed 1–1 transforms, we have that,

pθ(d𝐱|x)(Lebdx𝕃θ𝕎)(d𝐱)=fθ(x|x)×d¯θ,x,xd¯θ,x,x(Gθ(J,Z;x,x)).\displaystyle\frac{p_{\theta}(d\mathbf{x}^{\prime}|x)}{\big{(}\mathrm{Leb}^{\otimes d_{x}}\otimes\mathbb{L}_{\theta}\otimes\mathbb{W}\big{)}(d\mathbf{x}^{\prime})}=f_{\theta}(x^{\prime}|x)\times\frac{d\overline{\mathbb{P}}_{\theta,x,x^{\prime}}}{d\overline{\mathbb{Q}}_{\theta,x,x^{\prime}}}\big{(}G_{\theta}(J,Z;x,x^{\prime})\big{)}.

Thus, using the parameter-free reference measure,

μ(d𝐱):=Lebdx𝕃𝕎,\displaystyle\mu(d\mathbf{x}^{\prime}):=\mathrm{Leb}^{\otimes d_{x}}\otimes\mathbb{L}\otimes\mathbb{W},

one obtains that,

pθ(d𝐱|x)μ(d𝐱)\displaystyle\frac{p_{\theta}(d\mathbf{x}^{\prime}|x)}{\mu(d\mathbf{x}^{\prime})} =fθ(x|x)×e0Tλθ(t)𝑑teTi=1κ{λθ(τi))hθ(bi)}\displaystyle=f_{\theta}(x^{\prime}|x)\times\frac{e^{-\int_{0}^{T}\lambda_{\theta}(t)dt}}{e^{-T}}\cdot\prod_{i=1}^{\kappa}\big{\{}\lambda_{\theta}(\tau_{i}))h_{\theta}(b_{i})\big{\}}
×d¯θ,x,xd¯θ,x,x(Fθ(J,Z;x,x)).\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad\qquad\times\frac{d\overline{\mathbb{P}}_{\theta,x,x^{\prime}}}{d\overline{\mathbb{Q}}_{\theta,x,x^{\prime}}}\big{(}F_{\theta}(J,Z;x,x^{\prime})\big{)}. (3.10)
Remark 2.

Delyon and Hu, (2006) obtained the Radon-Nikodym derivative expression in (3.6) after a great amount of rigorous analysis. A similar development for the case of conditioned jump diffusions does not follow from their work, and can only be subject of dedicated research at the scale of a separate paper. This is beyond the scope of our work. In practice, one can proceed as follows. For grid size M1M\geq 1, and δ=T/M\delta=T/M, let ¯θ,x,xM(Xδ,,X(M1)δ|X0=x,XMδ=x)\overline{\mathbb{P}}^{M}_{\theta,x,x^{\prime}}(X_{\delta},\ldots,X_{(M-1)\delta}\,|\,X_{0}=x,X_{M\delta}=x^{\prime}) denote the time-discretised Lebesgue density of the (M1)(M-1)-positions of the conditioned diffusion with law ¯θ,x,x\overline{\mathbb{P}}_{\theta,x,x^{\prime}}. Once (3.10) is obtained, a time-discretisation approach will give,

¯θ,x,xM(Xδ,,\displaystyle\overline{\mathbb{P}}^{M}_{\theta,x,x^{\prime}}(X_{\delta},\ldots, X(M1)δ|X0=x,XMδ=x)\displaystyle X_{(M-1)\delta}\,|\,X_{0}=x,X_{M\delta}=x^{\prime})
=¯θ,x,xM(Xδ,,X(M1)δ,XMδ=x|X0=x)¯θ,x,xM(XMδ=x|X0=x).\displaystyle=\frac{\overline{\mathbb{P}}_{\theta,x,x^{\prime}}^{M}(X_{\delta},\ldots,X_{(M-1)\delta},X_{M\delta}=x^{\prime}\,|\,X_{0}=x)}{\overline{\mathbb{P}}^{M}_{\theta,x,x^{\prime}}(X_{M\delta}=x^{\prime}\,|X_{0}=x)}.

In this time-discretised setting, fθ(x|x)f_{\theta}(x^{\prime}|x) in (3.6) will be replaced by ¯θ,x,xM(XMδ=x|X0=x)\overline{\mathbb{P}}^{M}_{\theta,x,x^{\prime}}(X_{M\delta}=x^{\prime}\,|X_{0}=x). Thus, the intractable transition density over the complete time period will cancel out, and one is left with an explicit expression to use on a PC. Compared to the method in Section 3.1, and the Construct One in the current section, we do not have explicit theoretical evidence of a density on the pathspace. Yet, all numerical experiments we tried showed that the deduced algorithm was stable under mesh-refinement. We thus adopt the approach (or, conjecture) that the density in (3.10) exists – under assumptions –, and can be obtained pending future research.

4 Forward-Only Smoothing for SDEs

4.1 Pathspace Algorithm

We are ready to develop a forward-only particle smoothing method, under the scenario in (1.4)-(1.5) on the pathspace setting. We will work with the pairs of random elements

(xk1,𝐱k),1kn,\displaystyle(x_{k-1},\mathbf{x}_{k}),\quad 1\leq k\leq n,

with 𝐱k\mathbf{x}_{k} as defined in Section 3, i.e. containing pathspace elements, and given by an 1–1 transform of {Xs;s[tk1,tk]}\{X_{s};s\in[t_{k-1},t_{k}]\}, such that we can obtain a density for pθ(𝐱k|xk1)p_{\theta}(\mathbf{x}_{k}|x_{k-1}) with respect to a reference measure that does not involve θ\theta. Recall that pθ(d𝐱k|xk1)p_{\theta}(d\mathbf{x}_{k}|x_{k-1}) denotes the probability law for the augmented variable 𝐱k\mathbf{x}_{k} given xk1x_{k-1}. We also write the corresponding density as,

pθ(𝐱k|xk1):=pθ(d𝐱k|xk1)μ(d𝐱k).\displaystyle p_{\theta}(\mathbf{x}_{k}|x_{k-1}):=\frac{p_{\theta}(d\mathbf{x}_{k}|x_{k-1})}{\mu(d\mathbf{x}_{k})}.

The quantity of interest is now,

𝒮θ,n=Sθ(𝐱0:n)pθ(d𝐱0:n|y0:n),n1,\mathcal{S}_{\theta,n}=\int S_{\theta}(\mathbf{x}_{0:n})\,p_{\theta}(d\mathbf{x}_{0:n}|y_{0:n}),\quad n\geq 1,

for the class of additive functionals S()S(\cdot) of the structure,

Sθ(𝐱0:n)=k=0nsθ,k(xk1,𝐱k),S_{\theta}(\mathbf{x}_{0:n})=\sum_{k=0}^{n}s_{\theta,k}(x_{k-1},\mathbf{x}_{k}),

under the convention that x1=x_{-1}=\emptyset. Notice that we now allow sk(,)s_{k}(\cdot,\cdot) to be a function of xk1x_{k-1} and 𝐱𝐤\mathbf{x_{k}}; thus, sk(,)s_{k}(\cdot,\cdot) can potentially correspond to integrals, or other pathspace functionals. We will work with a transition density on the enlarged space of 𝐱k\mathbf{x}_{k}.

Similarly to the discrete-time case in Section 2, we define the functional,

Tθ,n(𝐱n)\displaystyle T_{\theta,n}(\mathbf{x}_{n}) :=Sθ(𝐱0:n)pθ(d𝐱0:n1|y0:n1,𝐱n).\displaystyle:=\int S_{\theta}(\mathbf{x}_{0:n})\,p_{\theta}(d\mathbf{x}_{0:n-1}|y_{0:n-1},\mathbf{x}_{n}).
Proposition 1.

We have that,

𝒮θ,n\displaystyle\mathcal{S}_{\theta,n} =Tθ,n(𝐱n)pθ(d𝐱n|y0:n).\displaystyle=\int T_{\theta,n}(\mathbf{x}_{n})\,p_{\theta}(d\mathbf{x}_{n}|y_{0:n}).
Proof.

See Appendix A

Critically, as in (2.6), we obtain the following recursion.

Proposition 2.

For any n1n\geq 1, we have that,

Tθ,n(𝐱n)\displaystyle T_{\theta,n}(\mathbf{x}_{n}) =[Tθ,n1(𝐱n1)+sθ,n(xn1,𝐱n)]pθ(d𝐱n1|y0:n1,𝐱n)\displaystyle=\int\big{[}T_{\theta,n-1}(\mathbf{x}_{n-1})+s_{\theta,n}(x_{n-1},\mathbf{x}_{n})\big{]}\,p_{\theta}(d\mathbf{x}_{n-1}|y_{0:n-1},\mathbf{x}_{n})
[Tθ,n1(𝐱n1)+sθ,n(xn1,𝐱n)]pθ(𝐱n|xn1)pθ(d𝐱n1|y0:n1)pθ(𝐱n|xn1)pθ(dxn1|y0:n1).\displaystyle\equiv\frac{\int\big{[}T_{\theta,n-1}(\mathbf{x}_{n-1})+s_{\theta,n}(x_{n-1},\mathbf{x}_{n})\big{]}\,p_{\theta}(\mathbf{x}_{n}|x_{n-1})\,p_{\theta}(d\mathbf{x}_{n-1}|y_{0:n-1})}{\int p_{\theta}(\mathbf{x}_{n}|x_{n-1})\,p_{\theta}(dx_{n-1}|y_{0:n-1})}.
Proof.

See Appendix B. ∎

Proposition 2 gives rise to a Monte-Carlo methodology for a forward-only, online approximation of the smoothing expectation of interest. This is given in Algorithm 2.

Algorithm 2 Online Forward-Only Particle Smoothing on Pathspace
  1. (i)

    Initialise the particles {x0(i),W0(i)}i=1N\{x_{0}^{(i)},W_{0}^{(i)}\}_{i=1}^{N}, with x0(i)i.i.d.pθ(dx0)x_{0}^{(i)}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}p_{\theta}(dx_{0}), W0(i)gθ(y0|x0(i))W_{0}^{(i)}\propto g_{\theta}(y_{0}|x_{0}^{(i)}), and functionals T^θ,0(x0(i))=sθ,0(𝐱0(i))\widehat{T}_{\theta,0}(x_{0}^{(i)})=s_{\theta,0}(\mathbf{x}_{0}^{(i)}), where 𝐱0(i)x0(i)\mathbf{x}_{0}^{(i)}\equiv x_{0}^{(i)}, 1iN1\leq i\leq N.

  2. (ii)

    Assume that at time n1n-1, one has a particle approximation {𝐱n1(i),Wn1(i)}i=1N\{\mathbf{x}_{n-1}^{(i)},W_{n-1}^{(i)}\}_{i=1}^{N} of the filtering law pθ(d𝐱n1|y0:n1)p_{\theta}(d\mathbf{x}_{n-1}|y_{0:n-1}) and estimators T^θ,n1(𝐱n1(i))\widehat{T}_{\theta,n-1}(\mathbf{x}_{n-1}^{(i)}) of Tθ,n1(𝐱n1(i))T_{\theta,n-1}(\mathbf{x}_{n-1}^{(i)}), 1iN1\leq i\leq N.

  3. (iii)

    At time nn, sample 𝐱n(i)\mathbf{x}_{n}^{(i)}, for 1iN1\leq i\leq N, from

    𝐱n(i)p^θ(d𝐱n|y0:n1)=j=1NWn1(j)pθ(d𝐱n|xn1(j)),\displaystyle\mathbf{x}_{n}^{(i)}\sim\widehat{p}_{\theta}(d\mathbf{x}_{n}|y_{0:n-1})=\sum_{j=1}^{N}W_{n-1}^{(j)}p_{\theta}(d\mathbf{x}_{n}|x_{n-1}^{(j)}),

    and assign particle weights Wn(i)gθ(yn|yn1,n(i))W_{n}^{(i)}\propto g_{\theta}(y_{n}|y_{n-1},\mathcal{F}_{n}^{(i)}), 1iN1\leq i\leq N.

  4. (iv)

    Then set, for 1iN1\leq i\leq N,

    T^θ,n(𝐱n(i))\displaystyle\widehat{T}_{\theta,n}(\mathbf{x}_{n}^{(i)}) =j=1NWn1(j)pθ(𝐱n(i)|xn1(j))l=1NWn1(l)pθ(𝐱n(i)|xn1(l))[T^θ,n1(𝐱n1(j))+sθ,n(xn1(j),𝐱n(i))].\displaystyle=\frac{\sum_{j=1}^{N}W_{n-1}^{(j)}\,p_{\theta}(\mathbf{x}_{n}^{(i)}|x_{n-1}^{(j)})}{\sum_{l=1}^{N}W_{n-1}^{(l)}\,p_{\theta}(\mathbf{x}_{n}^{(i)}|x_{n-1}^{(l)})}\,\big{[}\widehat{T}_{\theta,n-1}(\mathbf{x}_{n-1}^{(j)})+s_{\theta,n}(x_{n-1}^{(j)},\mathbf{x}_{n}^{(i)})\big{]}.
  5. (v)

    Obtain an estimate of 𝒮θ,n\mathcal{S}_{\theta,n} as,

    𝒮^θ,n\displaystyle\widehat{\mathcal{S}}_{\theta,n} =i=1NWn(i)T^θ,n(𝐱n(i)).\displaystyle=\sum_{i=1}^{N}W_{n}^{(i)}\,\widehat{T}_{\theta,n}(\mathbf{x}_{n}^{(i)}).
Remark 3.

Algorithm 2 uses a simple bootstrap filter with multinomial resampling applied at each step. The variability of the Monte-Carlo estimates can be further reduced by incorporating: more effective resampling (e.g., systematic resampling (Carpenter et al.,, 1999), stratified resampling (Kitagawa,, 1996)); dynamic resampling via use of Effective Sample Size; non-blind proposals in the propagation of the particles.

4.2 Pathspace versus Finite-Dimensional Construct

One can attempt to define an algorithm without reference to the underlying pathspace. That is, in the case of no jumps (for simplicity) an alternative approach can involve working with a regular grid on the period [0,T][0,T], say {sj=jδ}j=0M\{s_{j}=j\delta\}_{j=0}^{M}, with δ=T/M\delta=T/M for chosen size M1M\geq 1. Then, defining 𝐱=(xδ,x2δ,,xMδ)\mathbf{x}^{\prime}=(x_{\delta},x_{2\delta},\ldots,x_{M\delta}), and using, e.g., an Euler-Maruyama time-discretisation scheme to obtain the joint density of such an 𝐱\mathbf{x}^{\prime} given x=x0x=x_{0}, a Radon-Nikodym derivative, pθM(𝐱|x)p_{\theta}^{M}(\mathbf{x}^{\prime}|x) on M×dx\mathbb{R}^{M\times d_{x}}, can be obtained with respect to the Lebesgue reference measure Leb(dx×M)\mathrm{Leb}^{\otimes(d_{x}\times M)}, as a product of MM conditionally Gaussian densities. As shown e.g. in the motivating example in the Introduction, such an approach would lead to estimates with variability that increases rapidly with MM, for fixed Monte-Carlo particles NN. A central argument in this work is that one should develop the algorithm in a manner that respect the probabilistic properties of the SDE pathspace, before applying (necessarily) a time-discretisation for implementation on a PC. This procedure is not followed for purposes of mathematical rigour, but it has practical effects on algorithmic performance.

4.3 Consistency

For completeness, we provide an asymptotic result for Algorithm 2 following standard results from the literature. We consider the following assumptions.

Assumption 4.

Let 𝐗\mathbf{X} and 𝖷\mathsf{X} denote the state spaces of 𝐱\mathbf{x} and xx respectively.

  1. (i)

    For any relevant yy^{\prime}, yy, \mathcal{F}, xx^{\prime}, we have that gθ(y|y,)gθ(y|x)g_{\theta}(y^{\prime}|y,\mathcal{F})\equiv g_{\theta}(y^{\prime}|x^{\prime}), where the latter is a positive function such that, for any yy, supx𝖷gθ(y|x)<\sup_{x\in\mathsf{X}}g_{\theta}(y|x)<\infty.

  2. (ii)

    sup(x,𝐱)X×𝐗pθ(𝐱|x)<\sup_{(x,\mathbf{x}^{\prime})\in X\times\mathsf{\mathbf{X}}}p_{\theta}(\mathbf{x}^{\prime}|x)<\infty.

Proposition 3.
  1. (i)

    Under Assumption 4, for any n0n\geq 0, there exist constants bn,cn>0b_{n},c_{n}>0, such that for any ϵ>0\epsilon>0,

    Prob[|𝒮θ,n𝒮^θ,n|>ϵ]bnecnNϵ2.\displaystyle\mathrm{Prob}\big{[}\,|\mathcal{S}_{\theta,n}-\hat{\mathcal{S}}_{\theta,n}|>\epsilon\,\big{]}\leq b_{n}e^{-c_{n}N\epsilon^{2}}.
  2. (ii)

    For any n0n\geq 0, 𝒮^θ,n𝒮θ,n\hat{\mathcal{S}}_{\theta,n}\rightarrow\mathcal{S}_{\theta,n} w.p.1, as NN\rightarrow\infty.

Proof.

See Appendix C. ∎

5 Online Parameter/State Estimation for SDEs

In this section, we derive an online gradient-ascent for partially observed SDEs. Poyiadjis et al., (2011) use the score function estimation methodology to propose an online gradient-ascent algorithm for obtaining an MLE-type parameter estimate, following ideas in LeGland and Mevel, (1997). In more detail, the method is based on the Robbins-Monro-type of recursion,

θn+1\displaystyle\theta_{n+1} =θn+γn+1logpθ0:n(yn|y0:n1)\displaystyle=\theta_{n}+\gamma_{n+1}\nabla\log p_{\theta_{0:n}}(y_{n}|y_{0:n-1})
=θn+γn+1{logpθ0:n(y0:n)logpθ0:n1(y0:n1)}\displaystyle=\theta_{n}+\gamma_{n+1}\big{\{}\nabla\log p_{\theta_{0:n}}(y_{0:n})-\nabla\log p_{\theta_{0:n-1}}(y_{0:n-1})\big{\}} (5.1)

where {γk}k\{\gamma_{k}\}_{k} is a positive decreasing sequence with,

k=1γk=,k=1γk2<.\displaystyle\sum_{k=1}^{\infty}\gamma_{k}=\infty,\quad\sum_{k=1}^{\infty}\gamma_{k}^{2}<\infty.

The meaning of quantity logpθ0:n(y0:n)\nabla\log p_{\theta_{0:n}}(y_{0:n}) is that – given a recursive method (in nn) for the estimation of θlogpθ(y0:n)\theta\mapsto\nabla\log p_{\theta}(y_{0:n}) as we describe below and based on the methodology of Algorithm 2 – one uses θn1\theta_{n-1} when incorporating yn1y_{n-1}, then θn\theta_{n} for yny_{n}, and similarly for k>nk>n. See LeGland and Mevel, (1997); Tadic and Doucet, (2018) for analytical studies of the convergence properties of the deduced algorithm, where under strong conditions the recursion is shown to converge to the ‘true’ parameter value, say θ\theta^{\dagger}, as nn\rightarrow\infty.

Observe that, from Fisher’s identity (see e.g. Poyiadjis et al., (2011)) we have that,

logpθ(y0:n)\displaystyle\nabla\log p_{\theta}(y_{0:n}) =logpθ(𝐱0:n,y0:n)pθ(d𝐱0:n|y0:n).\displaystyle=\int\nabla\log p_{\theta}(\mathbf{x}_{0:n},y_{0:n})p_{\theta}(d\mathbf{x}_{0:n}|y_{0:n}).

Thus, in the context of Algorithm 2, estimation of the score function corresponds to the choice,

Sθ(𝐱0:n)=logpθ(𝐱0:n,y0:n)k=0nlogpθ(𝐱k,yk|xk1).\displaystyle S_{\theta}(\mathbf{x}_{0:n})=\nabla\log p_{\theta}(\mathbf{x}_{0:n},y_{0:n})\equiv\sum_{k=0}^{n}\nabla\log p_{\theta}(\mathbf{x}_{k},y_{k}|x_{k-1}).

and,

sθ,k(xk1,𝐱k)logpθ(𝐱k,yk|xk1).\displaystyle s_{\theta,k}(x_{k-1},\mathbf{x}_{k})\equiv\nabla\log p_{\theta}(\mathbf{x}_{k},y_{k}|x_{k-1}). (5.2)

Combination of the Robins-Morno recursion (5.1) with the one in Algorithm 2, delivers Algorithm 3, which we have presented here in some detail for clarity.

Algorithm 3 Online Gradient-Ascent for SDEs via Forward-Only Smoothing
  1. (i)

    Assume that at time n0n\geq 0, one has current parameter estimate θ^n\hat{\theta}_{n}, particle approximation {𝐱n(i),Wn(i)}i=1N\{\mathbf{x}_{n}^{(i)},W_{n}^{(i)}\}_{i=1}^{N} of the ‘filter’ pθ^0:n(d𝐱n|y0:n)p_{\hat{\theta}_{0:n}}(d\mathbf{x}_{n}|y_{0:n}) and estimators T^θ^0:n,n(𝐱n(i))\widehat{T}_{\hat{\theta}_{0:n},n}(\mathbf{x}_{n}^{(i)}) of Tθ^0:n,n(𝐱n(i))T_{\hat{\theta}_{0:n},n}(\mathbf{x}_{n}^{(i)}), for 1iN1\leq i\leq N.

  2. (ii)

    Apply the iteration,

    θ^n+1=θ^n+γn+1{logpθ^0:n(y0:n)logpθ^0:n1(y0:n1)}\displaystyle\hat{\theta}_{n+1}=\hat{\theta}_{n}+\gamma_{n+1}\big{\{}\nabla\log p_{\hat{\theta}_{0:n}}(y_{0:n})-\nabla\log p_{\hat{\theta}_{0:n-1}}(y_{0:n-1})\big{\}}
  3. (iii)

    At time n+1n+1, sample 𝐱n+1(i)\mathbf{x}_{n+1}^{(i)}, for 1iN1\leq i\leq N, from the mixture,

    𝐱n+1(i)p^θ^n+1(d𝐱n+1|y0:n)=j=1NWn(j)pθ^n+1(𝐱n+1|xn(j)),\displaystyle\mathbf{x}_{n+1}^{(i)}\sim\widehat{p}_{\hat{\theta}_{n+1}}(d\mathbf{x}_{n+1}|y_{0:n})=\sum_{j=1}^{N}W_{n}^{(j)}\,p_{\hat{\theta}_{n+1}}(\mathbf{x}_{n+1}|x_{n}^{(j)}),

    and assign particle weights Wn+1(i)gθ^n+1(yn+1|yn,n+1(i))W_{n+1}^{(i)}\propto g_{\hat{\theta}_{n+1}}(y_{n+1}|y_{n},\mathcal{F}_{n+1}^{(i)}), 1iN1\leq i\leq N.

  4. (iii)

    Then set, for 1iN1\leq i\leq N,

    T^θ^0:n+1,n+1(𝐱n+1(i))\displaystyle\widehat{T}_{\hat{\theta}_{0:n+1},n+1}(\mathbf{x}_{n+1}^{(i)}) =j=1NWn(j)pθ(𝐱n+1(i)|xn(j))l=1NWn(l)pθ(𝐱n+1(i)|xn(l))[T^θ^0:n,n(𝐱n(j))+sθ,n(xn(j),𝐱n+1(i))],\displaystyle=\frac{\sum_{j=1}^{N}W_{n}^{(j)}\,p_{\theta}(\mathbf{x}_{n+1}^{(i)}|x_{n}^{(j)})}{\sum_{l=1}^{N}W_{n}^{(l)}\,p_{\theta}(\mathbf{x}_{n+1}^{(i)}|x_{n}^{(l)})}\,\big{[}\widehat{T}_{\hat{\theta}_{0:n},n}(\mathbf{x}_{n}^{(j)})+s_{\theta,n}(x_{n}^{(j)},\mathbf{x}_{n+1}^{(i)})\big{]},

    where, on the right-hand-side we use the parameter,

    θ=θ^n+1.\displaystyle\theta=\hat{\theta}_{n+1}.
  5. (iv)

    Obtain the estimate,

    𝒮^θ^n+1,n+1\displaystyle\widehat{\mathcal{S}}_{\hat{\theta}_{n+1},n+1} =i=1NWn+1(i)T^θ^0:n+1,n+1(𝐱n+1(i)).\displaystyle=\sum_{i=1}^{N}W_{n+1}^{(i)}\,\widehat{T}_{\hat{\theta}_{0:n+1},n+1}(\mathbf{x}_{n+1}^{(i)}).
Remark 4.

When the joint density of (𝐱0:n,y0:n)(\mathbf{x}_{0:n},y_{0:n}) is in the exponential family, an online EM algorithm can also be developed; see Del Moral et al., (2010) for the discrete-time case.

6 Numerical Applications

All numerical experiments shown in this section were obtained on a standard PC, with code written in Julia. Unless otherwise stated, executed algorithms discretised the pathspace using the Euler–Maruyama scheme with M=10M=10 imputed points between consecutive observations, the latter assumed to be separated by 1 time unit. Algorithmic costs scale linearly with MM.

To specify the schedule for the scaling parameters {γk}\{\gamma_{k}\} in (5.1), we use the standard adaptive method termed ADAM, by Kingma and Ba, (2014). Assume that after nn steps, we have cn:=(logpθ^0:n(y0:n)logpθ^0:n1(y0:n1))c_{n}:=-(\nabla\log p_{\hat{\theta}_{0:n}}(y_{0:n})-\nabla\log p_{\hat{\theta}_{0:n-1}}(y_{0:n-1})). ADAM applies the following iterative procedure,

mn=mn1β1+(1β1)cn,vn=vn1β2+(1β2)cn2,\displaystyle m_{n}=m_{n-1}\beta_{1}+(1-\beta_{1})c_{n},\quad v_{n}=v_{n-1}\beta_{2}+(1-\beta_{2})c_{n}^{2},
m^n=mn/(1β1n),v^n=vn/(1β2n),\displaystyle\hat{m}_{n}=m_{n}/(1-\beta_{1}^{n}),\quad\hat{v}_{n}=v_{n}/(1-\beta_{2}^{n}),
θ^n+1=θ^nαm^n/(v^n+ϵ),\displaystyle\hat{\theta}_{n+1}=\hat{\theta}_{n}-\alpha\hat{m}_{n}/(\sqrt{\hat{v}_{n}}+\epsilon),

where (β1,β2,α,ϵ)(\beta_{1},\beta_{2},\alpha,\epsilon) are tuning parameters. Convergence properties of ADAM have been widely studied (Goodfellow et al.,, 2016; Kingma and Ba,, 2014; Reddi et al.,, 2019). Following Kingma and Ba, (2014), in all cases below we set (β1,β2,α,ϵ)=(0.9,0.999,0.001,108)(\beta_{1},\beta_{2},\alpha,\epsilon)=(0.9,0.999,0.001,10^{-8}). ADAM is nowadays a standard and very effective addition to the type of recursive inference algorithms we are considering here, even more so as for increasing dimension of unknown parameters. See the above references for more motivation and details.

6.1 Ornstein-Uhlenbeck SDE

We consider the model,

dXt=θ1(θ2Xt)+θ3dWt+dJt,X0=0,\displaystyle dX_{t}=\theta_{1}(\theta_{2}-X_{t})+\theta_{3}dW_{t}+dJ_{t},\quad X_{0}=0,
yi=xi+ϵi,i1,ϵii.i.d.𝒩(0,0.12),Δ=1,\displaystyle y_{i}=x_{i}+\epsilon_{i},\quad i\geq 1,\quad\epsilon_{i}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{N}(0,0.1^{2}),\quad\Delta=1, (6.1)

where Jt=i=1NtξiJ_{t}=\sum_{i=1}^{N_{t}}\xi_{i}, NtN_{t} is a Poisson process of intensity λ0\lambda\geq 0, and ξii.i.d.𝒰(ζ,ζ)\xi_{i}\overset{i.i.d.}{\sim}\mathcal{U}\left(-\zeta,\zeta\right), ζ>0\zeta>0.

O–U: Experiment 1. Here, we choose λ=0\lambda=0, so the score function is analytically available; we also fix θ2=0\theta_{2}=0. We simulated datasets of size n=2,500,5,000,7,500,10,000n=2,500,5,000,7,500,10,000 under true remaining parameter values (θ1,θ3)=(0.4,0.5)(\theta_{1}^{\dagger},\theta_{3}^{\dagger})=(0.4,0.5). We executed Algorithm 2 to approximate the score function with N=50,100,150N=50,100,150 particles, for all above datasets. Fig. 6.1 summarises results of R=50R=50 replicates of estimates of the bivariate score function, with the dashed lines showing the true values.

O–U: Experiment 2. We now present two scenarios: one still without jumps (i.e., λ=0\lambda=0); and one with jumps, such that we fix λ=0.5\lambda=0.5, ζ=0.5\zeta=0.5. In both scenarios, the parameters to be estimated are θ=(θ1,θ2,θ3)\theta=(\theta_{1},\theta_{2},\theta_{3}). Beyond the mentioned fixed values for λ\lambda, ζ\zeta, in both scenarios we generated n=20,000n=20,000 data points using the true parameters values θ=(θ1,θ2,θ3)=(0.2,0.0,0.2)\theta^{\dagger}=(\theta_{1}^{\dagger},\theta_{2}^{\dagger},\theta_{3}^{\dagger})=(0.2,0.0,0.2). We applied Algorithm 3 with N=100N=100 particles, and initial parameter values θ^0=(1.0,1.0,1.0)\hat{\theta}_{0}=(1.0,1.0,1.0) for both scenarios. We used Construct 1 for the scenario with jumps. Fig. 6.2 summarises the results.

O–U: Experiment 3. To compare the efficiency of Constructs 1 and 2 for the case of jump processes, as developed in Section 3.2, we applied Algorithm 2 – for each case – to approximate the score function (at the true values for (θ1,θ2,θ3)(\theta_{1},\theta_{2},\theta_{3})), with N=50,100,150,200N=50,100,150,200 particles, with n=10n=10 data generated from O–U process (6.1) with fixed λ=0.5\lambda=0.5, ζ=0.5\zeta=0.5, and true remaining parameters (θ1,θ2,θ3)=(0.3,0.0,0.2)(\theta_{1}^{\dagger},\theta_{2}^{\dagger},\theta_{3}^{\dagger})=(0.3,0.0,0.2). Fig. 6.3 shows boxplots summarising the results from R=50R=50 replications of the obtained estimates. We note that performance of Construct 2 is, in this case, much better than that of Construct 1.

Refer to caption
Figure 6.1: O–U: Experiment 1: Boxplots of estimated score functions of θ1\theta_{1} at θ1=0.4\theta_{1}=0.4, over R=50R=50 experiment replications, for the O–U model (6.1), without jumps (λ=0\lambda=0), fixed θ2=0\theta_{2}=0, and free parameters (θ1,θ3)(\theta_{1},\theta_{3}), with true values (θ1,θ3)=(0.4,0.5)(\theta_{1}^{\dagger},\theta_{3}^{\dagger})=(0.4,0.5) used for data generation. The left, middle and right boxplots in each panel correspond to values N=50,100,150N=50,100,150, respectively. The horizontal dashed lines show the correct values (33.1,24.7,154.2,48.7)(-33.1,24.7,-154.2,-48.7) for datasizes n=2,500,5,000,7,500,10,000n=2,500,5,000,7,500,10,000, respectively.
Refer to caption
Figure 6.2: O–U: Experiment 2: Trajectories from execution of Algorithm 3 for the estimation of parameters (θ1,θ2,θ3)(\theta_{1},\theta_{2},\theta_{3}) of the O–U model (6.1). We used N=100N=100 particles and initial value θ^0=(1.0,1.0,1.0)\hat{\theta}_{0}=(1.0,1.0,1.0). Left Panel: results for O–U model with jumps (λ=0.5\lambda=0.5, ζ=0.5\zeta=0.5). Right Panel: Results for O–U model without jumps (λ=0\lambda=0). The horizontal dashed lines in the plots show the true parameter values (θ1,θ2,θ3)=(0.2,0.0,0.2)(\theta_{1}^{\dagger},\theta_{2}^{\dagger},\theta_{3}^{\dagger})=(0.2,0.0,0.2) used in both scenarios.
Refer to caption
Figure 6.3: O–U: Experiment 3: Boxplots of estimated score function of θ1\theta_{1}, at θ1=0.3\theta_{1}=0.3, over R=50R=50 experiment replications, for O–U model (6.1), with fixed λ=0.5\lambda=0.5, ζ=0.5\zeta=0.5, and true parameters (θ1,θ2,θ3)=(0.3,0.0,0.2)(\theta_{1}^{\dagger},\theta_{2}^{\dagger},\theta_{3}^{\dagger})=(0.3,0.0,0.2), used to generate n=10n=10 observations. Boxplots on the right (left) of each panel correspond to Construct 1 (Construct 2).

6.2 Periodic Drift SDE

We consider the (highly) nonlinear model,

dXt=sin(Xtθ1)dt+θ2dWt,X0=0,\displaystyle dX_{t}=\sin\left(X_{t}-\theta_{1}\right)dt+\theta_{2}dW_{t},\quad X_{0}=0,
yi=xi+ϵi,i1,ϵii.i.d.𝒩(0,0.12),Δ=1,\displaystyle y_{i}=x_{i}+\epsilon_{i},\quad i\geq 1,\quad\epsilon_{i}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mathcal{N}(0,0.1^{2}),\quad\Delta=1, (6.2)

for θ1[0,2π)\theta_{1}\in[0,2\pi), θ2>0\theta_{2}>0. We used true values (θ1,θ2)=(π/4,0.9)(\theta_{1}^{\dagger},\theta_{2}^{\dagger})=(\pi/4,0.9), generated n=104n=10^{4} data, and applied Algorithm 3 with N=100N=100 particles and initial values (θ^1,0,θ^2,0)=(0.1,2)(\hat{\theta}_{1,0},\hat{\theta}_{2,0})=(0.1,2). To illustrate the mesh-free performance of Algorithm 3, we choose two values, M=10M=10 and M=100M=100 and compare the obtained numerical results. Figure 6.4 shows the data (Left Panel) and the trajectory of the estimated parameter values (Right Panel). Note that the SDE in (6.2) is making infrequent jumps between a number of modes. Indeed, in this case, the results are stable when increasing MM from 1010 to 100100.

Refer to caption
Figure 6.4: Left Panel: Data generated from model (6.2) for true values θ=(π/4,0.9)\theta^{\dagger}=(\pi/4,0.9). Right Panel: Trajectories obtained from Algorithm 3, with N=100N=100 and initial values (θ^1,0,θ^2,0)=(0.1,2)(\hat{\theta}_{1,0},\hat{\theta}_{2,0})=(0.1,2). Solid lines show the results for M=10M=10 and dotted lines for M=100M=100.

6.3 Heston Model

Consider the Heston model (Heston,, 1993), for price StS_{t} and volatility XtX_{t}, modelled as,

dSt\displaystyle dS_{t} =θ4Stdt+StXt1/2dBt,\displaystyle=\theta_{4}S_{t}dt+S_{t}X_{t}^{1/2}dB_{t},
dXt\displaystyle dX_{t} =θ1(θ2Xt)dt+θ3Xt1/2dWt,X0=θ2,\displaystyle=\theta_{1}(\theta_{2}-X_{t})dt+\theta_{3}X_{t}^{1/2}dW_{t},\quad X_{0}=\theta_{2},

where BtB_{t}, WtW_{t} are independent standard Brownian motions. Define the log-price Ut=log(St)U_{t}=\log(S_{t}), so that application of Itô’s lemma gives,

dUt\displaystyle dU_{t} =(θ4Xt2)dt+Xt1/2dBt,\displaystyle=(\theta_{4}-\tfrac{X_{t}}{2})dt+X_{t}^{1/2}dB_{t},
dXt\displaystyle dX_{t} =θ1(θ2Xt)dt+θ3Xt1/2dWt,X0=θ2.\displaystyle=\theta_{1}(\theta_{2}-X_{t})dt+\theta_{3}X_{t}^{1/2}dW_{t},\quad X_{0}=\theta_{2}. (6.3)

We make the standard assumption 2θ1θ2>θ322\theta_{1}\theta_{2}>\theta_{3}^{2}, so that the CIR process X={Xt}X=\{X_{t}\} will not hit 0; also, we have θ1,θ2,θ3,θ4>0\theta_{1},\theta_{2},\theta_{3},\theta_{4}>0. Process UtU_{t} is observed at discrete times, so that yi=Utiy_{i}=U_{t_{i}}, i1i\geq 1. Thus, we have,

[yi|yi1,{Xs;s[ti1,ti]}]\displaystyle\big{[}\,y_{i}\,\big{|}\,y_{i-1},\,\{X_{s};s\in[t_{i-1},t_{i}]\}\,\big{]} 𝒩(yi;μi,Σi),\displaystyle\sim\mathcal{N}(y_{i};\mu_{i},\Sigma_{i}),

where we have set,

μi=yi1+ti1ti(θ4Xs2)𝑑s,Σi\displaystyle\mu_{i}=y_{i-1}+\int_{t_{i-1}}^{t_{i}}(\theta_{4}-\tfrac{X_{s}}{2})ds,\quad\Sigma_{i} =ti1tiXs𝑑s.\displaystyle=\int_{t_{i-1}}^{t_{i}}X_{s}ds.

We chose true parameter value θ=(0.1,1.0,0.2,0.45)\theta^{\dagger}=(0.1,1.0,0.2,0.45), and generated n=104n=10^{4} observations, with Δ=1\Delta=1. We applied Algorithm 3 with θ^0=(0.005,0.1,0.4,0.3)\hat{\theta}_{0}=(0.005,0.1,0.4,0.3) and N=100N=100 particles. The trajectories of the estimated parameters are given in Figure 6.5.

Refer to caption
Figure 6.5: Parameter estimate trajectories produced by Algorithm 3, for data generated from Heston model (6.3). The algorithm used initial values θ^0=(0.005,0.1,0.4,0.3\hat{\theta}_{0}=(0.005,0.1,0.4,0.3), and N=100N=100 particles. The horizontal dashed lines indicate the true parameter values θ=(0.1,1.0,0.2,0.45)\theta^{\dagger}=(0.1,1.0,0.2,0.45).

6.4 Sequential Model Selection – Real Data

We use our methodology to carry out sequential model selection, motivated by applications in Eraker et al., (2003); Johannes et al., (2009). Recall that BIC (Schwarz,, 1978) for model \mathcal{M}, with parameter vector θ\theta, and data y1:ny_{1:n} is given by,

BIC()\displaystyle BIC(\mathcal{M}) :=2θ^MLE(y1:n)+dim(θ)logn,\displaystyle:=-2\ell_{\hat{\theta}_{MLE}}(y_{1:n})+\mathrm{dim}(\theta)\log n,

where θ^MLE\hat{\theta}_{MLE} denotes the MLE, and ()\ell(\cdot) the log-likelihood. BIC and the (closely related) Bayes Factor are known to have good asymptotic properties, e.g. they are consistent Model Selection Criteria, under the assumption of model-correctness, in the context of nested models, for specific classes of models (see e.g. Chib and Kuffner, (2016); Nishii, (1988); Sin and White, (1996), and Yonekura et al., (2018) for the case of discrete-time nested HMMs). One can plausibly conjecture such criteria will also perform well for the type of continuous-time HMMs we consider here. See also Eguchi and Masuda, (2018) for a rigorous analysis of BIC for diffusion-type models. Given models a,b\mathcal{M}_{a},\mathcal{M}_{b}, with ab\mathcal{M}_{a}\subset\mathcal{M}_{b}, we use the difference,

BIC(ab):=2(θ^MLE,a(y1:n)+θ^MLE,b(y1:n))+(dim(θa)dim(θb))logn,\displaystyle BIC(\mathcal{M}_{ab}):=-2(\ell_{\hat{\theta}_{MLE,a}}(y_{1:n})+\ell_{\hat{\theta}_{MLE,b}}(y_{1:n}))+(\mathrm{dim}(\theta_{a})-\mathrm{dim}(\theta_{b}))\log n,

with involved terms defined as obviously, to choose between a\mathcal{M}_{a} and b\mathcal{M}_{b}.

We note that Eraker et al., (2003) use Bayes Factor for model selection in the context of SDE processes; that work uses MCMC to estimate parameters, and is not sequential (or online). We stress that our methodology allows for carrying out inference for parameters and the signal part of the model, and simultaneously allowing for model selection, all in an online fashion. Also, it is worth noting that Johannes et al., (2009) use the sequential likelihood ratio for model comparison, but such quantity can overshoot, i.e. the likelihood ratio will tend to choose a large model. Also, that work uses fixed calibrated parameters, so it does not relate to online parameter inference.

Remark 5.

We will use the running estimate of the parameter vector as a proxy for the MLE given the data already taken under consideration. Similarly, we use the weights of the particle filter to obtain a running proxy of the log-likelihood evaluated at the MLE, thus overall an online approximation of BIC. Such an approach, even not fully justified theoretically, can provide reasonable practical insights when performing model comparison in an online manner – particularly so, given when there is no alternative, to the best of our knowledge, for such an objective.

We consider the following family of nested SDE models,

dXt\displaystyle dX_{t} =bθ(i)(Xt)+θ4XtdWt,\displaystyle=b_{\theta}^{(i)}(X_{t})+\theta_{4}\sqrt{X_{t}}dW_{t},

where,

1:\displaystyle\mathcal{M}_{1}: bθ(1)(Xt)=θ0+θ1Xt,\displaystyle\quad b_{\theta}^{(1)}(X_{t})=\theta_{0}+\theta_{1}X_{t},
2:\displaystyle\mathcal{M}_{2}: bθ(2)(Xt)=θ0+θ1Xt+θ22Xt2,\displaystyle\quad b_{\theta}^{(2)}(X_{t})=\theta_{0}+\theta_{1}X_{t}+\theta_{2}^{2}X_{t}^{2},
3:\displaystyle\mathcal{M}_{3}: bθ(3)(Xt)=θ0+θ1Xt+θ22Xt2+θ3Xt.\displaystyle\quad b_{\theta}^{(3)}(X_{t})=\theta_{0}+\theta_{1}X_{t}+\theta_{2}^{2}X_{t}^{2}+\frac{\theta_{3}}{X_{t}}.

Such models have been used for short-term interest rates; see Jones, (2003); Durham, (2003); Aït-Sahalia, (1996); Bali and Wu, (2006) and references therein for more details. Motivated by Dellaportas et al., (2006); Stanton, (1997), we applied our methodology to daily 3-month Treasury Bill rates, from 2nd of January 1970, to 29th of December 2000, providing n=7,739n=7,739 observations. The data can be obtained from the Federal Reserve Bank of St. Louis, at webpage https://fred.stlouisfed.org/series/TB3MS. The dataset is shown at Fig. 6.6.

Refer to caption
Figure 6.6: Daily values of 3-Month Treasury Bill rates from 2 Jan 1970 to 29 Dec 2000.

Fig. 6.7 shows the BIC differences, obtained online, for each of the three pairs of models. The results suggest that 2\mathcal{M}_{2} does not provide a good fit to the data – relatively to 1\mathcal{M}_{1}, 3\mathcal{M}_{3} – for almost all period under consideration. In terms of 1\mathcal{M}_{1} against 3\mathcal{M}_{3}, there is evidence that once the data from around 1993 onwards are taken under consideration, 3\mathcal{M}_{3} is preferred to 1\mathcal{M}_{1}. In general, one can claim that models with non-linear drift should be taken under consideration for fitting daily 3-month Treasury Bill rates, without strong evidence in favour or against linearity. This non-definite conclusion is in some agreement with the empirical studies in Chapman and Pearson, (2000); Durham, (2003); Dellaportas et al., (2006).

Refer to caption
Figure 6.7: Online estimation of BIC differences. The solid line stands for BIC(32)BIC(\mathcal{M}_{32}), the dashed line for BIC(31)BIC(\mathcal{M}_{31}), and the dotted line for BIC(21)BIC(\mathcal{M}_{21}).

7 Conclusions and Future Directions

We have introduced an online particle smoothing methodology for discretely observed (jump) diffusions with intractable transition densities. Our approach overcomes such intractability by formulating the problem on pathspace, thus delivering an algorithm that – besides regulatory conditions – requires only the weak invertibility Assumption 2. Thus, we have covered a rich family of SDE models, when related literature imposes quite strong restrictions. Combining our online smoothing algorithm with a Robbins-Monro-type approach of Recursive Maximum-Likelihood, we set up an online stochastic gradient-ascent for the likelihood function of the SDEs under consideration. The algorithm provides a wealth of interesting output, that can provide a lot of useful insights in statistical applications. The numerical examples show a lot of promise for the performance of the methodology. Our framework opens up a number of routes for insights and future research, including the ones described below.

  • (i)

    In the case of SDEs of jumps, we have focused on jump dynamics driven by compound Poisson processes. There is great scope for generalisation here, and one can extend the algorithm to different cases of jump processes, also characterised by more complex dependencies between the jumps and the paths of the solution of the SDE, X={Xt}X=\{X_{t}\}. Extensions to time-inhomogeneous cases with b(v)=b(t,v)b(v)=b(t,v), σ(v)=σ(t,v)\sigma(v)=\sigma(t,v), are immediate; we have chosen the time-homogeneous models only for purposes of presentation simplicity.

  • (ii)

    Since the seminal work of Delyon and Hu, (2006), more ‘tuned’ auxiliary bridge processes have appeared in the literature, see e.g. the works of Schauer et al., (2017); van der Meulen and Schauer, (2017). Indicatively, the work in Schauer et al., (2017) considers bridges of the form (in one of the many options they consider),

    dX~t={b(X~t)+Σ(X~t)Σ1(x)xX~tTt}dt+σ(X~t)dWt.d\tilde{X}_{t}=\Big{\{}b(\tilde{X}_{t})+\Sigma(\tilde{X}_{t})\Sigma^{-1}(x^{\prime})\frac{x^{\prime}-\tilde{X}_{t}}{T-t}\Big{\}}dt+\sigma(\tilde{X}_{t})dW_{t}. (7.1)

    Auxiliary bridge processes that are closer in dynamics to the diffusion bridges of the given signal are expected to reduce the variability of Monte-Carlo algorithm, thus progress along the above direction can be immediately incorporated in our methodology and improve its performance. For instance, as noted in Schauer et al., (2017), use of (7.1) will give a Radon-Nikodym derivative where stochastic integrals cancel out. Such a setting is known to considerably reduce the variability of Monte-Carlo methods, see e.g. the numerical examples in Durham and Gallant, (2002) and the discussion in (Papaspiliopoulos and Roberts,, 2012, Section 4).

  • (iii)

    The exact specification of the recursion used for the online estimation of unknown parameters is in itself an problem of intensive research in the field of stochastic optimisation. One would ideally aim for the recursion procedure to provide parameter estimates which are as close to the unknown parameter as the data (considered thus far) permit. In our case, we have used a fairly ‘vanilla’ recursion, maybe with the exception of the Adam variation. E.g., recent works in the Machine Learning community have pointed at the use of ‘velocity’ components in the recursion to speed up convergence, see, e.g., Sutskever et al., (2013); Yuan et al., (2016).

  • (iv)

    We mentioned in the main text several modifications that can improve algorithmic performance: dynamic resampling, stratified resampling, non-blind proposals in the filtering steps, choice of auxiliary processes. Parallelisation and use of HPC are obvious additions in this list.

  • (v)

    We stress that the algorithm involves a filtering step, and a step that approximates the values of the instrumental function. These two procedures should be thought of separately. One reason for including two approaches in the case of jump diffusions (Constructs 1 and 2) is indeed to highlight this point. The two Constructs are identical in terms of the filtering part. Construct 1 incorporates in 𝐱\mathbf{x}^{\prime} the location of the path at all times of jumps; thus, when the algorithm ‘mixes’ all pairs of {xk1(j)}\{x_{k-1}^{(j)}\}, {𝐱k(i)}\{\mathbf{x}_{k}^{(i)}\}, at the update of the instrumental function (see Step (iv) of Algorithm 2), many of such pairs can be incompatible (simply consider any such pair with iji\neq j, when {𝐱k(i)}\{\mathbf{x}_{k}^{(i)}\} has in fact been generated conditionally on {xk1(i)}\{x_{k-1}^{(i)}\}). Such an effect is even stronger in the case of the standard algorithm applied in the motivating example in the Introduction, and partially explains the inefficiency of that algorithm. In contrast, in Construct 2, 𝐱\mathbf{x}^{\prime} contains less information about the underlying paths, thereby improving the compatibility of pairs selected from particle populations {xk1(j)}\{x_{k-1}^{(j)}\}, {𝐱k(i)}\{\mathbf{x}_{k}^{(i)}\}, thus – not surprisingly – Construct 2 seems more effective than Construct 1.

  • (vi)

    The experiments included in Section 6 of the paper aim to highlight the qualitative aspects of the introduced approach. The key property of robustness to mesh-refinement underlies all numerical applications, and is explicitly illustrated in the experiment in Section 6.2, and within Figure 6.4. In all cases, the code was written in Julia, was not parallelised or optimised and was run on a PC, with executions times that were quite small due to the online nature of the algorithm (the most expensive experiment took about 30 minutes). We have avoided a comparison with potential alternative approaches as we believe this would obscure the main messages we have tried to convey within the numerical results section. Such a detailed numerical study can be the subject of future work.

Acknowledgments

We are grateful to the AE and the referees for several useful comments that helped us to improve the content of the paper.

References

  • Aït-Sahalia, (1996) Aït-Sahalia, Y. (1996). Testing continuous-time models of the spot interest rate. The Review of Financial Studies, 9(2):385–426.
  • Aït-Sahalia, (2002) Aït-Sahalia, Y. (2002). Maximum likelihood estimation of discretely sampled diffusions: a closed-form approximation approach. Econometrica, 70(1):223–262.
  • Aït-Sahalia, (2008) Aït-Sahalia, Y. (2008). Closed-form likelihood expansions for multivariate diffusions. The Annals of Statistics, 36(2):906–937.
  • Aït-Sahalia et al., (2005) Aït-Sahalia, Y., Mykland, P. A., and Zhang, L. (2005). How often to sample a continuous-time process in the presence of market microstructure noise. The Review of Financial Studies, 18(2):351–416.
  • Aït-Sahalia and Yu, (2008) Aït-Sahalia, Y. and Yu, J. (2008). High frequency market microstructure noise estimates and liquidity measures. Technical report, National Bureau of Economic Research.
  • Andrieu et al., (2010) Andrieu, C., Doucet, A., and Holenstein, R. (2010). Particle Markov Chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342.
  • Azuma, (1967) Azuma, K. (1967). Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, Second Series, 19(3):357–367.
  • Bali and Wu, (2006) Bali, T. G. and Wu, L. (2006). A comprehensive analysis of the short-term interest-rate dynamics. Journal of Banking & Finance, 30(4):1269–1290.
  • Beskos et al., (2006) Beskos, A., Papaspiliopoulos, O., Roberts, G. O., and Fearnhead, P. (2006). Exact and computationally efficient likelihood-based estimation for discretely observed diffusion processes (with discussion). Journal of the Royal Statistical Society: Series B (Statistical Methodology), 68(3):333–382.
  • Cappe, (2009) Cappe, O. (2009). Online sequential Monte Carlo EM algorithm. In 2009 IEEE/SP 15th Workshop on Statistical Signal Processing.
  • Cappé et al., (2005) Cappé, O., Moulines, E., and Ryden, T. (2005). Inference in Hidden Markov Models. Springer Science & Business Media.
  • Carpenter et al., (1999) Carpenter, J., Clifford, P., and Fearnhead, P. (1999). Improved particle filter for nonlinear problems. IEE Proceedings-Radar, Sonar and Navigation, 146(1):2–7.
  • Chapman and Pearson, (2000) Chapman, D. A. and Pearson, N. D. (2000). Is the short rate drift actually nonlinear? The Journal of Finance, 55(1):355–388.
  • Chib and Kuffner, (2016) Chib, S. and Kuffner, T. A. (2016). Bayes factor consistency. arXiv preprint, arXiv:1607.00292.
  • Chib et al., (2004) Chib, S., Pitt, M. K., and Shephard, N. (2004). Likelihood based inference for diffusion driven models.
  • Chopin et al., (2013) Chopin, N., Jacob, P. E., and Papaspiliopoulos, O. (2013). SMC2: an efficient algorithm for sequential analysis of state space models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3):397–426.
  • Dahlhaus and Neddermeyer, (2010) Dahlhaus, R. and Neddermeyer, J. C. (2010). Particle filter-based on-line estimation of spot volatility with nonlinear market microstructure noise models. arXiv preprint arXiv:1006.1860.
  • Del Moral et al., (2010) Del Moral, P., Doucet, A., and Singh, S. (2010). Forward smoothing using sequential Monte Carlo. arXiv preprint arXiv:1012.5390.
  • Dellaportas et al., (2006) Dellaportas, P., Friel, N., and Roberts, G. O. (2006). Bayesian model selection for partially observed diffusion models. Biometrika, 93(4):809–825.
  • Delyon and Hu, (2006) Delyon, B. and Hu, Y. (2006). Simulation of conditioned diffusion and application to parameter estimation. Stochastic Processes and their Applications, 116(11):1660–1675.
  • Douc et al., (2011) Douc, R., Garivier, A., Moulines, E., and Olsson, J. (2011). Sequential Monte Carlo smoothing for general state space hidden Markov models. The Annals of Applied Probability, 21(6):2109–2145.
  • Douc et al., (2014) Douc, R., Moulines, E., and Stoffer, D. (2014). Nonlinear time series: Theory, methods and applications with R examples. Chapman and Hall/CRC.
  • Doucet et al., (2000) Doucet, A., Godsill, S., and Andrieu, C. (2000). On sequential Monte Carlo sampling methods for Bayesian filtering. Statistics and Computing, 10(3):197–208.
  • Durham, (2003) Durham, G. B. (2003). Likelihood-based specification analysis of continuous-time models of the short-term interest rate. Journal of Financial Economics, 70(3):463–487.
  • Durham and Gallant, (2002) Durham, G. B. and Gallant, A. R. (2002). Numerical techniques for maximum likelihood estimation of continuous-time diffusion processes. Journal of Business & Economic Statistics, 20(3):297–338.
  • Eguchi and Masuda, (2018) Eguchi, S. and Masuda, H. (2018). Schwarz type model comparison for LAQ models. Bernoulli, 24(3):2278–2327.
  • Eraker et al., (2003) Eraker, B., Johannes, M., and Polson, N. (2003). The impact of jumps in volatility and returns. The Journal of Finance, 58(3):1269–1300.
  • Fearnhead et al., (2008) Fearnhead, P., Papaspiliopoulos, O., and Roberts, G. O. (2008). Particle filters for partially observed diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(4):755–777.
  • Gloaguen et al., (2018) Gloaguen, P., Étienne, M.-P., and Le Corff, S. (2018). Online sequential Monte Carlo smoother for partially observed diffusion processes. EURASIP Journal on Advances in Signal Processing, 2018(1):9.
  • Golightly and Wilkinson, (2008) Golightly, A. and Wilkinson, D. J. (2008). Bayesian inference for nonlinear multivariate diffusion models observed with error. Computational Statistics & Data Analysis, 52(3):1674–1693.
  • Gonçalves and Roberts, (2014) Gonçalves, F. B. and Roberts, G. O. (2014). Exact simulation problems for jump-diffusions. Methodology and Computing in Applied Probability, 16(4):907–930.
  • Goodfellow et al., (2016) Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT press.
  • Gordon et al., (1993) Gordon, N. J., Salmond, D. J., and Smith, A. F. (1993). Novel approach to nonlinear/non-gaussian bayesian state estimation. In IEE Proceedings F (Radar and Signal Processing), volume 140, pages 107–113. IET.
  • Hansen and Scheinkman, (1993) Hansen, L. P. and Scheinkman, J. A. (1993). Back to the future: Generating moment implications for continuous-time Markov processes. Technical report, National Bureau of Economic Research.
  • Hansen and Lunde, (2006) Hansen, P. R. and Lunde, A. (2006). Realized variance and market microstructure noise. Journal of Business & Economic Statistics, 24(2):127–161.
  • Heston, (1993) Heston, S. L. (1993). A closed-form solution for options with stochastic volatility with applications to bond and currency options. The Review of Financial Studies, 6(2):327–343.
  • Johannes et al., (2009) Johannes, M. S., Polson, N. G., and Stroud, J. R. (2009). Optimal filtering of jump diffusions: Extracting latent states from asset prices. The Review of Financial Studies, 22(7):2759–2799.
  • Jones, (2003) Jones, C. S. (2003). Nonlinear mean reversion in the short-term interest rate. The Review of Financial Studies, 16(3):793–843.
  • Kalogeropoulos et al., (2010) Kalogeropoulos, K., Roberts, G. O., and Dellaportas, P. (2010). Inference for stochastic volatility models using time change transformations. The Annals of Statistics, 38(2):784–807.
  • Kantas et al., (2015) Kantas, N., Doucet, A., Singh, S. S., Maciejowski, J., and Chopin, N. (2015). On particle methods for parameter estimation in state-space models. Statistical Science, 30(3):328–351.
  • Kessler, (1997) Kessler, M. (1997). Estimation of an ergodic diffusion from discrete observations. Scandinavian Journal of Statistics, 24(2):211–229.
  • Kessler et al., (2012) Kessler, M., Lindner, A., and Sorensen, M. (2012). Statistical methods for stochastic differential equations. Chapman and Hall/CRC.
  • Kessler and Sørensen, (1999) Kessler, M. and Sørensen, M. (1999). Estimating equations based on eigenfunctions for a discretely observed diffusion process. Bernoulli, 5(2):299–314.
  • Kingma and Ba, (2014) Kingma, D. P. and Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Kitagawa, (1987) Kitagawa, G. (1987). Non-gaussian state space modeling of nonstationary time series. Journal of the American Statistical Association, 82(400):1032–1041.
  • Kitagawa, (1996) Kitagawa, G. (1996). Monte Carlo filter and smoother for non-Gaussian nonlinear state space models. Journal of Computational and Graphical Statistics, 5(1):1–25.
  • Kloeden and Platen, (2013) Kloeden, P. E. and Platen, E. (2013). Numerical solution of stochastic differential equations, volume 23. Springer Science & Business Media.
  • LeGland and Mevel, (1997) LeGland, F. and Mevel, L. (1997). Recursive estimation in hidden Markov models. In Decision and Control, 1997, Proceedings of the 36th IEEE Conference, volume 4, pages 3468–3473. IEEE.
  • Nishii, (1988) Nishii, R. (1988). Maximum likelihood principle and model selection when the true model is unspecified. Journal of Multivariate Analysis, 27(2):392–403.
  • Øksendal and Sulem, (2007) Øksendal, B. and Sulem, A. (2007). Applied Stochastic Control of Jump Diffusions. Springer Science & Business Media.
  • Olsson and Westerborn, (2017) Olsson, J. and Westerborn, J. (2017). Efficient particle-based online smoothing in general hidden Markov models: the PaRIS algorithm. Bernoulli, 23(3):1951–1996.
  • Papaspiliopoulos and Roberts, (2012) Papaspiliopoulos, O. and Roberts, G. (2012). Importance sampling techniques for estimation of diffusion models. Statistical methods for stochastic differential equations, 124:311–340.
  • Papaspiliopoulos et al., (2013) Papaspiliopoulos, O., Roberts, G. O., and Stramer, O. (2013). Data augmentation for diffusions. Journal of Computational and Graphical Statistics, 22(3):665–688.
  • Poyiadjis et al., (2011) Poyiadjis, G., Doucet, A., and Singh, S. S. (2011). Particle approximations of the score and observed information matrix in state space models with application to parameter estimation. Biometrika, 98(1):65–80.
  • Reddi et al., (2019) Reddi, S. J., Kale, S., and Kumar, S. (2019). On the convergence of Adam and beyond. arXiv preprint arXiv:1904.09237.
  • Roberts and Stramer, (2001) Roberts, G. O. and Stramer, O. (2001). On inference for partially observed nonlinear diffusion models using the Metropolis–Hastings algorithm. Biometrika, 88(3):603–621.
  • Särkkä and Sottinen, (2008) Särkkä, S. and Sottinen, T. (2008). Application of Girsanov theorem to particle filtering of discretely observed continuous-time non-linear systems. Bayesian Analysis, 3(3):555–584.
  • Schauer et al., (2017) Schauer, M., Van Der Meulen, F., and Van Zanten, H. (2017). Guided proposals for simulating multi-dimensional diffusion bridges. Bernoulli, 23(4A):2917–2950.
  • Schwarz, (1978) Schwarz, G. (1978). Estimating the dimension of a model. Annals of Statistics, 6(2):461–464.
  • Sin and White, (1996) Sin, C.-Y. and White, H. (1996). Information criteria for selecting possibly misspecified parametric models. Journal of Econometrics, 71(1-2):207–225.
  • Stanton, (1997) Stanton, R. (1997). A nonparametric model of term structure dynamics and the market price of interest rate risk. The Journal of Finance, 52(5):1973–2002.
  • Ströjby and Olsson, (2009) Ströjby, J. and Olsson, J. (2009). Efficient particle-based likelihood estimation in partially observed diffusion processes. IFAC Proceedings Volumes, 42(10):1364–1369.
  • Sutskever et al., (2013) Sutskever, I., Martens, J., Dahl, G., and Hinton, G. (2013). On the importance of initialization and momentum in Deep Learning. In International Conference on Machine Learning, pages 1139–1147.
  • Tadic and Doucet, (2018) Tadic, V. Z. and Doucet, A. (2018). Asymptotic properties of recursive maximum likelihood estimation in non-linear state-space models. arXiv preprint arXiv:1806.09571.
  • van der Meulen and Schauer, (2017) van der Meulen, F. and Schauer, M. (2017). Bayesian estimation of discretely observed multi-dimensional diffusion processes using guided proposals. Electronic Journal of Statistics, 11(1):2358–2396.
  • Wagner, (1989) Wagner, W. (1989). Undiased Monte Carlo estimators for functionals of weak solutions of stochastic differential equations. Stochastics: An International Journal of Probability and Stochastic Processes, 28(1):1–20.
  • Yonekura et al., (2018) Yonekura, S., Beskos, A., and Singh, S. (2018). Asymptotic analysis of model selection criteria for general hidden Markov models. arXiv preprint arXiv:1811.11834.
  • Yuan et al., (2016) Yuan, K., Ying, B., and Sayed, A. H. (2016). On the influence of momentum acceleration on online learning. The Journal of Machine Learning Research, 17(1):6602–6667.

Appendix A Proof of Proposition 1

Proof.

We have the integral,

Tθ,n(𝐱n)pθ(d𝐱n|y0:n)=Sθ(𝐱0:n)pθ(d𝐱0:n1|y0:n1,𝐱n)pθ(d𝐱n|y0:n).\displaystyle\int T_{\theta,n}(\mathbf{x}_{n})\,p_{\theta}(d\mathbf{x}_{n}|y_{0:n})=\int S_{\theta}(\mathbf{x}_{0:n})\,p_{\theta}(d\mathbf{x}_{0:n-1}|y_{0:n-1},\mathbf{x}_{n})\otimes p_{\theta}(d\mathbf{x}_{n}|y_{0:n}). (A.1)

Also, simple calculations give,

pθ(d𝐱0:n1|y0:n1,\displaystyle p_{\theta}(d\mathbf{x}_{0:n-1}|y_{0:n-1}, 𝐱n)pθ(𝐱n|y0:n)\displaystyle\mathbf{x}_{n})\otimes p_{\theta}(\mathbf{x}_{n}|y_{0:n})
pθ(d𝐱0:n1|y0:n,𝐱n)pθ(𝐱n|y0:n)=pθ(d𝐱0:n|y0:n).\displaystyle\equiv p_{\theta}(d\mathbf{x}_{0:n-1}|y_{0:n},\mathbf{x}_{n})\otimes p_{\theta}(\mathbf{x}_{n}|y_{0:n})=p_{\theta}(d\mathbf{x}_{0:n}|y_{0:n}).

Using this expression in the integral on the right side of (A.1) completes the proof. ∎

Appendix B Proof of Proposition 2

Proof.

Simply note that,

Tθ,n1(𝐱n1)\displaystyle\int T_{\theta,n-1}(\mathbf{x}_{n-1})\, pθ(d𝐱n1|y0:n1,𝐱n)=\displaystyle p_{\theta}(d\mathbf{x}_{n-1}|y_{0:n-1},\mathbf{x}_{n})=
=Sθ,n1(𝐱0:n1)pθ(d𝐱0:n2|y0:n2,𝐱n1)pθ(d𝐱n1|y0:n1,𝐱n).\displaystyle=\int S_{\theta,n-1}(\mathbf{x}_{0:n-1})\,p_{\theta}(d\mathbf{x}_{0:n-2}|y_{0:n-2},\mathbf{x}_{n-1})\,p_{\theta}(d\mathbf{x}_{n-1}|y_{0:n-1},\mathbf{x}_{n}).

Then, we have that,

pθ\displaystyle p_{\theta} (d𝐱0:n2|y0:n2,𝐱n1)pθ(d𝐱n1|y0:n1,𝐱n)\displaystyle(d\mathbf{x}_{0:n-2}|y_{0:n-2},\mathbf{x}_{n-1})\,p_{\theta}(d\mathbf{x}_{n-1}|y_{0:n-1},\mathbf{x}_{n})
pθ(d𝐱0:n2|𝐱n1,y0:n1,𝐱n)pθ(d𝐱n1|y0:n1,𝐱n)\displaystyle\qquad\equiv p_{\theta}(d\mathbf{x}_{0:n-2}|\mathbf{x}_{n-1},y_{0:n-1},\mathbf{x}_{n})\,p_{\theta}(d\mathbf{x}_{n-1}|y_{0:n-1},\mathbf{x}_{n})
pθ(d𝐱0:n1|y0:n1,𝐱n).\displaystyle\qquad\equiv p_{\theta}(d\mathbf{x}_{0:n-1}|y_{0:n-1},\mathbf{x}_{n}).

Replacing the probability measure on the left side of the above equality with its equal on the right side, and using the latter in the integral above completes the proof for the first equation in the statement of the proposition. The second equation follows from trivial use of Bayes rule. ∎

Appendix C Proof of Proposition 3

Proof.

Part (i) follows via the same arguments as in Olsson and Westerborn, (2017, Corollary 2), based on Azuma, (1967) and Douc et al., (2011, Lemma 4). For part (ii), one can proceed as follows. Given n0n\geq 0, we define the event AN(1/j):={|𝒮θ,n𝒮^θ,nN|>1j}A_{N}(1/j):=\{|\mathcal{S}_{\theta,n}-\hat{\mathcal{S}}_{\theta,n}^{N}|>\frac{1}{j}\}, where we added superscript NN to stress the dependency of the estimate on the number of particles. Then,

Prob[limN𝒮^θ,nN=𝒮θ,n]=1Prob\displaystyle\mathrm{Prob}\,\big{[}\,\lim_{N\rightarrow\infty}\hat{\mathcal{S}}_{\theta,n}^{N}=\mathcal{S}_{\theta,n}\,\big{]}=1-\mathrm{Prob} [j=1lim supNAN(1/j)]\displaystyle\,\big{[}\,\cup_{j=1}^{\infty}\limsup_{N\rightarrow\infty}A_{N}(1/j)\,\big{]}
1j=1Prob[lim supNAN(1/j)].\displaystyle\geq 1-\sum_{j=1}^{\infty}\mathrm{Prob}\,\big{[}\,\limsup_{N\rightarrow\infty}A_{N}(1/j)\,\big{]}. (C.1)

From (i), we get Prob[AN(1/j)]bnecnN(1j)2\mathrm{Prob}\,[\,A_{N}(1/j)\,]\leq b_{n}e^{-c_{n}N\left(\frac{1}{j}\right)^{2}}, so N=1Prob[AN(1/j)]<\sum_{N=1}^{\infty}\mathrm{Prob}\,[\,A_{N}(1/j)\,]<\infty. The Borel–Cantelli lemma gives Prob[lim supNAN(1/j)]=0\mathrm{Prob}\,[\,\limsup_{N\rightarrow\infty}A_{N}(1/j)\,]=0, therefore the result follows from (C.1). ∎