This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Bayesian Parameter Identification for
Jump Markov Linear Systems

Mark P. Balenzuela111Corresponding author: [email protected] Adrian G. Wills Christopher Renton and Brett Ninness
Abstract

This paper presents a Bayesian method for identification of jump Markov linear system parameters. A primary motivation is to provide accurate quantification of parameter uncertainty without relying on asymptotic in data-length arguments. To achieve this, the paper details a particle-Gibbs sampling approach that provides samples from the desired posterior distribution. These samples are produced by utilising a modified discrete particle filter and carefully chosen conjugate priors.

Faculty of Engineering and Built Environment, The University of Newcastle, Callaghan, NSW 2308 Australia

1 Introduction

In many important applications, the underlying system is known to abruptly switch behaviour in a manner that is temporally random. This type of system behaviour has been observed across a broad range of applications including econometrics [39, 33, 50, 12, 26, 34, 40, 20], telecommunications [43], fault detection and isolation [35], and mobile robotics [44], to name just a few.

For the purposes of decision making and control, it is vital to obtain a model of this switched system behaviour that accurately captures both the dynamics and switching events. At the same time, it is well recognised that obtaining such a model based on first principles considerations is often prohibitive.

This latter difficulty, together with the importance of the problem of switched system modelling has been recognised by many other authors who have taken an approach of deriving algorithms to estimate both the dynamics and the switching times on the basis of observed system behaviour. The vast majority of methods are directed towards providing maximum-likelihood or related estimates of the system [13, 30, 3, 55, 51, 29], including our own work [9]. While maximum-likelihood (and related) estimates have proven to be very useful, their employment within decision making and control should be accompanied by a quantification of uncertainty. Otherwise trust or belief may be apportioned in error, which can lead to devastating outcomes.

Classical results for providing such error quantifications have relied on central limit theorems, that are asymptotic in the limit as the available data length [41, 42, 5, 4]. These results are then used in situations where only finite data lengths are available, which can provide inaccurate error bounds. This raises a question around providing error quantification that is accurate for finite length data sets. A further question concerns the quantification of performance metrics that rely on these system models, such as the performance of a model-based control strategy.

Fortunately, issues relating to model uncertainty can be addressed using the Bayesian framework [48]. The rationale of the Bayesian approach is that is captures the full probabilistic information of the system conditioned on the available data, even for short data lengths. However, the Bayesian approach is not without its challenges. Except for some very special cases, the Bayesian posterior distribution is not straightforward to compute due to the typically large multi-dimensional integrals that are required in forming the solution.

A remarkable alternative approach aims instead to construct a random number generator whose samples are distributed according to the desired Bayesian posterior distribution. The utility of this random number generator is that its samples can be used to approximate, with arbitrary accuracy, user defined expectation integrals via Monte-Carlo integration (see e.g. [47]). For example, this can provide estimates of the conditional mean, which is known to be the minimum mean-squared estimate.

Constructing such a random number generator has received significant research attention and among the possible solutions is the so-called Markov chain approach. As the name suggests, this approach constructs a Markov chain whose stationary distribution coincides with desired target distribution, and therefore iterates of the Markov chain are in fact samples from the desired target.

Ensuring that this Markov chain does indeed converge to the desired stationary distribution can be achieved using the Metropolis–Hastings (MH) algorithm (see e.g. [47]). Despite achieving this remarkable outcome, the MH algorithm approach has a significant drawback in that it requires a certain user supplied “proposal density”. This proposal must be carefully selected so as to deliver realisations that have a reasonable probability of originating from the target distribution.

Fortunately, this latter problem has received significant research attention and there are by now many variants of the MH method. One alternative, the Gibbs sampler, proceeds by sampling from conditional distributions in a given sequence with the primary benefit that each sample is guaranteed to be accepted. Alternatively, as we will consider in this paper, particle filters may be used to construct samples from the required conditional distribution, this resulting algorithm is aptly referred to as the particle-Gibbs method.

This particle-Gibbs approach has been previously employed for identifying linear state-space models [54, 15], autoregressive (AR) systems [21, 26, 28, 1], change-point models [19, 16, 31], stochastic volatility models [50], and stochastically switched systems operating according to drift processes [33]. Employing the particle-Gibbs sampler for a switched system encounters additional challenges. In essence, these stem from the exponential (in data-length) growth in complexity caused by possibility that the system switches among a number of possible dynamic models at each time-step [8, 3, 13, 30, 7, 29, 14, 11, 10, 37, 39].

In this work, we focus on applying a particle-Gibbs sampler to an important subclass of switched systems known as jump Markov linear systems (JMLS). This class is attractive due to its relative simplicity and versatility in modelling switched systems. Previous research effort has considered particle-Gibbs sampling for the JMLS class, each with various limitations, including univariate models in [40, 25, 26], limited dynamics modes in [40, 17, 1], and constrained system or noise structure in [16, 33, 31, 28, 23]. We focus on removing these limitations.

To perform particle-Gibbs sampling on this system class two methods need to be provided:

  1. 1.

    A method for generating smoothed hybrid state trajectories conditioned on the available data.

  2. 2.

    A method of updating the prior chosen for the system parameters using the sampled hybrid state trajectory.

The latter can employ the MH algorithm, but is attended to by use of conjugate prior solutions in this paper as this approach is computationally advantageous. The relevant details for this approach are discussed later.

Previous solutions to the former problem have involved sampling the continuous and discrete latent variables in separate stages of the particle-Gibbs algorithm [49, 40, 25, 26, 16, 28, 50, 23]. These approaches rely on [24, 27] for sampling the continuous state trajectory, and [21, 34, 18, 27] for the sampling the discrete state sequence. Methods used to sample the discrete sequence include ‘single move’ sampling, where the model used at each time step is sampled separately by increasing the number of steps within the particle-Gibbs iteration, and ‘multi-move sampling’ where the entire discrete trajectory is sampled as a block [27, 40]. However, these approaches have been criticised for poor mixing [53], slowing down the convergence rate of the particle-Gibbs algorithm.

Rao-Blackwellization may be used to improve the efficiency of particle filters. Additional improvements often used with Rao-Blackwellization include using a discrete particle filter (DPF) [22], which offers more efficient sampling from the discrete state space. The algorithm does this by using a metric to determine which components/particles to be kept deterministically, and which particles should undergo a resampling procedure. As the particles which are kept deterministically are those with the greatest probability mass, and therefore are not duplicated during resampling, the algorithm presents with fewer duplicated particles, and is therefore more efficient. The DPF algorithm also promises to be exact as the number of particles reaches the impracticable limit of the total number of components required to represent the probability distribution exactly, as all components will be kept deterministically in this case.

In this paper, we apply the particle-Gibbs sampler to the general JMLS class using a modified version of the DPF. As detailed later, this modified version is specifically tailored to the JMLS class and attends to some of the shortcomings in the papers which conceived the idea [22, 53]. Unlike previous particle-Gibbs approaches to JMLS identification, we do not assume the system to be univariate [40, 25, 26], support only a small number of models [40, 17, 1], or constrain the system or noise structure [23, 16, 33, 31, 28]. This in part, is made possible by use of the inverse-Wishart conjugate prior such as used within [54, 23], opposed to commonly used inverse-Gamma distribution (e.g. see [1, 18, 40, 26, 21]) that only describes scalar probability distributions.

The contributions of this paper are therefore:

  1. 1.

    A self contained particle-Gibbs sampler which targets the parameter distribution for JMLS systems, without restriction on state dimension, model or noise structure, or the number of models.

  2. 2.

    The provided solution uses a modified discrete particle filter for application specifically toward JMLS, which is detailed thoroughly. A discussion is also provided, clarifying a point of confusion in the original papers [53, 22]. Additionally, an alternative sampling approach is used when compared to [53], which is simpler and unbiased.

2 Problem Formulation

In this paper, we address the modelling of systems which can jump between a finite number of linear dynamic system modes using a discrete-time jump Markov linear system description, which can be expressed as

[ykxk+1]\displaystyle{\begin{bmatrix}y_{k}\\ x_{k+1}\end{bmatrix}} =[𝐂zk𝐃zk𝐀zk𝐁zk]𝚪zk[xkuk]+[ekvk]wk,\displaystyle=\underbrace{\begin{bmatrix}\mathbf{C}_{z_{k}}&\mathbf{D}_{z_{k}}\\ \mathbf{A}_{z_{k}}&\mathbf{B}_{z_{k}}\end{bmatrix}}_{\triangleq\boldsymbol{\Gamma}_{z_{k}}}{\begin{bmatrix}x_{k}\\ u_{k}\end{bmatrix}}+\underbrace{\begin{bmatrix}e_{k}\\ v_{k}\end{bmatrix}}_{w_{k}}, (1a)
where xknxx_{k}\in\mathbb{R}^{n_{x}} is the system state, yknyy_{k}\in\mathbb{R}^{n_{y}} is the system output, uknuu_{k}\in\mathbb{R}^{n_{u}} is the system input, zk{1,,m}z_{k}\in\{1,\dots,m\} is a discrete random variable called the model index that indicates the active model, and the noise terms vkv_{k} and eke_{k} originate from the Gaussian white noise process
wk\displaystyle w_{k} 𝒩(0,𝚷zk),𝚷zk=[𝐑zk𝐒zkT𝐒zk𝐐zk],\displaystyle\sim\mathcal{N}\left(0\ ,\ \boldsymbol{\Pi}_{z_{k}}\right),\quad\boldsymbol{\Pi}_{z_{k}}=\begin{bmatrix}\mathbf{R}_{z_{k}}&\mathbf{S}^{T}_{z_{k}}\\ \mathbf{S}_{z_{k}}&\mathbf{Q}_{z_{k}}\end{bmatrix}, (1b)

where

𝒩(x|μ,𝐏)=det(2π𝐏)12e12(xμ)T𝐏1(xμ)\displaystyle\mathcal{N}(x|\mu,\mathbf{P})=\det(2\pi\mathbf{P})^{-\frac{1}{2}}e^{-\frac{1}{2}(x-\mu)^{T}\mathbf{P}^{-1}(x-\mu)} (2)

denotes a multivariate Normal distribution.

The system matrices and noise covariances are allowed to randomly jump or switch values as a function of the model index zkz_{k}. A switch event is captured by allowing zkz_{k} to transition to zk+1z_{k+1} stochastically with the probability of transitioning from the jthj_{\text{th}} model at time-index kk to the ithi_{\text{th}} model at time-index k+1k+1 given by

(zk+1=i|zk=j)\displaystyle\mathbb{P}(z_{k+1}=i|z_{k}=j) =𝐓i,j,i=1m𝐓i,j\displaystyle=\mathbf{T}_{i,j},\quad\sum_{i=1}^{m}\mathbf{T}_{i,j} =1j.\displaystyle=1\quad\forall j. (3)

The set of model parameters that fully describe the above JMLS class can be conveniently collected into a parameter object θ\theta, defined as

θ{𝐓,{𝚪i,𝚷i}i=1m}.\displaystyle\theta\triangleq\bigl{\{}\mathbf{T},\{\boldsymbol{\Gamma}_{i},\boldsymbol{\Pi}_{i}\}_{i=1}^{m}\bigr{\}}. (4)

Problem: Presented with known state dimension nxn_{x} and known number of system modes m>0m>0, a data record of NN outputs and inputs

𝐲y1:N\displaystyle\mathbf{y}\triangleq y_{1:N} ={y1,,yN},𝐮u1:N={u1,,uN},\displaystyle=\{y_{1},\ldots,y_{N}\},\quad\mathbf{u}\triangleq u_{1:N}=\{u_{1},\ldots,u_{N}\},

respectively, the primary focus of this paper is provide a random number generator that produces samples θ\theta^{\ell} from the target distribution

θp(θy1:N).\displaystyle\theta^{\ell}\sim p(\theta\mid y_{1:N}). (5)

The following section details how this distribution can be targeted using a Markov chain Monte Carlo technique called particle-Gibbs sampling.

3 Markov Chain Monte-Carlo Approach

The essential idea underpinning so-called Markov chain Monte-Carlo (MCMC) methods is that they are focussed on computing expectation integrals of the form

I=f(θ)p(θ𝐲)dθ,\displaystyle I=\int f(\theta)\,p(\theta\mid\mathbf{y})\text{d}\theta, (6)

where the function f()f(\cdot) may be quite general. For example, ff could be as simple as an indicator function for θ\theta belonging to a given set, or more complex functions such as the gain or phase margins of a θ\theta dependent control design.

Solving (6) is generally intractable in closed-form, and classical quadrature methods are limited to small dimensions only. An alternative computational approach relies on a law of large numbers (LLN) to provide estimates of (6) via so-called Monte-Carlo integration where samples from p(θ𝐲)p(\theta\mid\mathbf{y}) are used to compute a sample average of f()f(\cdot) according to

I^M=1M=1Mf(θ),θp(θ𝐲).\displaystyle\widehat{I}_{M}=\frac{1}{M}\sum_{\ell=1}^{M}f(\theta^{\ell}),\qquad\theta^{\ell}\sim p(\theta\mid\mathbf{y}). (7)

Under rather general assumptions on p(θ𝐲)p(\theta\mid\mathbf{y}) and f()f(\cdot), it can be shown that

I^Ma.s.IasM,\displaystyle\widehat{I}_{M}\overset{\text{a.s.}}{\to}I\quad\text{as}\quad M\to\infty, (8)

where a.s. denotes almost sure convergence. Importantly, the rate of convergence of I^M\widehat{I}_{M} to II is maximised when the samples θ\theta^{\ell} are uncorrelated [46]. This raises the question of how to generate samples from p(θ𝐲)p(\theta\mid\mathbf{y}) that have minimal correlation.

A remarkably effective approach aimed at solving this problem is to construct a Markov chain whose stationary distribution coincides with the target p(θ𝐲)p(\theta\mid\mathbf{y}). Equally remarkable is that such a Markov chain can be constructed in a straightforward manner using the Metropolis–Hastings (MH) algorithm [45, 36].

In this work, we will use a variant of the MH algorithm known as the particle-Gibbs sampler. In essence, the main idea of the particle-Gibbs sampler is to reduce a high-dimensional sampling problem into a sequence of low-dimensional sub-problems. The primary aim is to design these sub-problems so that sampling is relatively straightforward.

In detailing this approach for JMLS identification, it is important to extend our target distribution to the joint parameter and state posterior

p(θ,ξ1:N+1𝐲)\displaystyle p(\theta,{\xi}_{1:N+1}\mid\mathbf{y}) (9)

where ξ1:N+1{\xi}_{1:N+1} is a hybrid continuous-discrete trajectory, i.e.,

ξ1:N+1\displaystyle\xi_{1:N+1} ={ξ1,,ξN+1},\displaystyle=\{\xi_{1},\dots,\xi_{N+1}\}, (10a)
ξk\displaystyle\xi_{k} ={xk,zk}.\displaystyle=\{x_{k},z_{k}\}. (10b)

The particle-Gibbs algorithm will produce a Markov chain, whose iterates

{θ1,ξ1:N+11},{θ2,ξ1:N+12},,{θ,ξ1:N+1},\displaystyle\{\theta^{1},{\xi}^{1}_{1:N+1}\},\{\theta^{2},{\xi}^{2}_{1:N+1}\},\ldots,\{\theta^{\ell},{\xi}^{\ell}_{1:N+1}\}, (11)

are samples from the target p(θ,ξ1:N+1|𝐲)p(\theta,{\xi}_{1:N+1}|\mathbf{y}). Furthermore, samples from the desired marginal posterior p(θ|𝐲)p(\theta|\mathbf{y}) are obtained by simply ignoring the ξ1:N+1{\xi}_{1:N+1} component of the joint samples.

There are many different ways to design a particle-Gibbs sampler to provide the desired Markov chain in (11), but possibly the most direct choice is to firstly sample the state trajectory ξ1:N+1\xi_{1:N+1} based on the assumption of an available θ\theta value. Then in a second step, this is reversed and a new sample of θ\theta is generated based on the sampled state trajectory ξ1:N+1\xi_{1:N+1}. This process is then repeated in order to provide the desired Markov chain. This procedure can be summarised by the following steps, which are indexed by the integer \ell to clarify the iterative nature of this approach:

  1. 1.

    Given θ\theta^{\ell}, sample a hybrid trajectory according to

    ξ1:N+1p(ξ1:N+1|θ,y1:N).\displaystyle\xi_{1:N+1}^{\ell}\sim p(\xi_{1:N+1}|\theta^{\ell},y_{1:N}). (12)
  2. 2.

    Then, given ξ1:N+1\xi_{1:N+1}^{\ell} sample a new θ+1\theta^{\ell+1}

    θ+1p(θ|ξ1:N+1,y1:N).\displaystyle\theta^{\ell+1}\sim p(\theta|\xi_{1:N+1}^{\ell},y_{1:N}). (13)

The key to success for the above procedure is tied to the relative ease of generating samples in each step. Unfortunately, for the JMLS class it is not tractable to construct p(ξ1:N+1|θ,y1:N)p(\xi_{1:N+1}|\theta^{\ell},y_{1:N}) exactly, as this distribution requires an exponentially increasing number of terms as the data length NN increases.

It is tempting to consider the impact of simply replacing the exact distribution p(ξ1:N+1|θ,y1:N)p(\xi_{1:N+1}|\theta^{\ell},y_{1:N}) with a tractable approximation instead. While this may seem appealing from a practical perspective, it in general forfeits guarantees that samples are from the distribution p(ξ1:N+1|θ,y1:N)p(\xi_{1:N+1}|\theta^{\ell},y_{1:N}), and therefore this implies that the resulting particle-Gibbs sampler does not produce samples from the desired target.

A remarkable result from [2] shows that such a particle-Gibbs sampler can be constructed by using sequential Monte Carlo (SMC) approximations of p(ξ1:N+1|θ,y1:N)p(\xi_{1:N+1}|\theta^{\ell},y_{1:N}) with guaranteed converge properties if the approximated distribution includes the previously sampled discrete model sequence z1:N+11z^{\ell-1}_{1:N+1}. Using this particle approach, a particle filter can sample possible model sequence hypothesis (z1:kz_{1:k}), indexed by ii and zkz_{k} to calculate the approximate distribution

p\displaystyle p (xk,zk|y1:k)i=1Mkf(zk)wk|ki(zk)𝒩(xk|μk|ki(zk),𝐏k|ki(zk)).\displaystyle(x_{k},z_{k}|y_{1:k})\approx\sum_{i=1}^{M^{f}_{k}(z_{k})}w^{i}_{k|k}(z_{k})\,\mathcal{N}\left(x_{k}\big{|}\mu^{i}_{k|k}(z_{k}),\mathbf{P}^{i}_{k|k}(z_{k})\right). (14)

As the continuous space has a closed-form solution when conditioned on a model sequence, the particle filter employed for this problem is used to explore the discrete space only. As such, it is immensely computationally wasteful to directly employ traditional resampling schemes based on the component weights wk|ki(zk)w^{i}_{k|k}(z_{k}) with the indices ii and zkz_{k} collapsed. Doing so will likely result in duplicate components and repeated calculations, leaving other model hypothesis that could have been considered to remain unexplored.

This motivates the use of the discrete particle filter, which is employed in the proposed scheme to prevent this phenomenon. As specific details are required to implement a working DPF, it is introduced thoroughly in the following section. This section begins by relating the approximate distribution (14) produced by the DPF, to the required samples ξ1:N+1\xi_{1:N+1}^{\ell}.

3.1 Sampling the latent variables

In the proposed solution, the latent variables {x1:N+1,z1:N+1}\{x^{\ell}_{1:N+1},z^{\ell}_{1:N+1}\} are sampled from an approximated distribution of p(ξ1:N+1|θ,y1:N)p(\xi_{1:N+1}|\theta^{\ell},y_{1:N}), with the requirement that this distribution considers the model sequence z1:N+11z^{\ell-1}_{1:N+1} [2]. Complying with this requirement is very straight-forward, as any JMLS forward filter with a pruning/resampling scheme can be modified to always produce hybrid mixtures with the component related to z1:N+11z^{\ell-1}_{1:N+1}. Using the produced approximated forward distributions (14), the samples {x1:N+1,z1:N+1}\{x^{\ell}_{1:N+1},z^{\ell}_{1:N+1}\} can be generated by first sampling from the prediction distribution ξN+1p(ξN+1|θ,y1:N)\xi^{\ell}_{N+1}\sim p(\xi_{N+1}|\theta^{\ell},y_{1:N}), followed by conditioning the filtered distribution p(ξk|θ,y1:k)p(\xi_{k}|\theta^{\ell},y_{1:k}) on the latest sample, and sampling from the resulting distribution. These latter steps can be completed for k=N,,1k=N,\dots,1, and can be shown to generate samples from the correct target distribution using the equations

p\displaystyle p (ξ1:N+1|θ,y1:N)=p(ξN+1|θ,y1:N)k=1Np(ξk|ξk+1:N+1,θ,y1:N),\displaystyle(\xi_{1:N+1}|\theta^{\ell},y_{1:N})=p(\xi_{N+1}|\theta^{\ell},y_{1:N})\prod_{k=1}^{N}p(\xi_{k}|\xi_{k+1:N+1},\theta^{\ell},y_{1:N}),
p\displaystyle p (ξk|ξk+1:N+1,θ,y1:N)=p(ξk+1|ξk,θ,y1:k)p(ξk|θ,y1:k)p(ξk+1|θ,y1:k)p(ξk+1|ξk,θ,y1:k)p(ξk|θ,y1:k).\displaystyle(\xi_{k}|\xi_{k+1:N+1}^{\ell},\theta^{\ell},y_{1:N})=\frac{p(\xi_{k+1}^{\ell}|\xi_{k},\theta^{\ell},y_{1:k})p(\xi_{k}|\theta^{\ell},y_{1:k})}{p(\xi^{\ell}_{k+1}|\theta^{\ell},y_{1:k})}\propto p(\xi_{k+1}^{\ell}|\xi_{k},\theta^{\ell},y_{1:k})p(\xi_{k}|\theta^{\ell},y_{1:k}). (15a)

Care must be taken when choosing the component reduction scheme required to compute p(ξk|θ,y1:k)p(\xi_{k}|\theta^{\ell},y_{1:k}) with a practical computational cost. A merging-based scheme is strictly prohibited as this voids the convergence guarantees of the particle-Gibbs algorithm. Additionally, traditional resampling schemes for particles exploring discrete spaces are highly inefficient, as often multiple particles remaining after resampling are clones of each other. This not only is computationally wasteful, as the same calculations are performed multiple times, but also doesn’t explore the discrete space effectively.

This motivates the use of the DPF, a particle filter designed to explore a discrete space. In describing the main ideas behind the DPF it is convenient to define a new set of weights for each time instant, denoted as WkW_{k}, that are simply the collection of all the weights wk|ki(zk)w_{k|k}^{i}(z_{k}) from (14) for all values of ii and zkz_{k}. That is

Wk={wk|k1(1),\displaystyle W_{k}=\{w_{k|k}^{1}(1), ,wk|kMf(1)(1),wk|k1(2),,wk|kMf(2)(2),,wk|k1(mk),,wk|kMf(mk)(mk)}.\displaystyle\dots,w_{k|k}^{M_{f}(1)}(1),w_{k|k}^{1}(2),\dots,w_{k|k}^{M_{f}(2)}(2),\dots,w_{k|k}^{1}(m_{k}),\dots,w_{k|k}^{M_{f}(m_{k})}(m_{k})\}. (16)

We can then index this array of weights using the notation WkjW_{k}^{j} to indicate the jthj^{\text{th}} entry.

To reduce the growing number of hypothesises encountered, the DPF uses a sampling scheme which is partially deterministic, where the number of components are reduced but few (if any) clones will exist in the reduced mixture. Essentially, when tasked to reduce a mixture of nn components with respective weights WkjW_{k}^{j}, which satisfies j=1nWkj=1\sum_{j=1}^{n}W_{k}^{j}=1, to a mixture with MM components, this scheme operates by considering the cc corresponding to the unique solution of

M=j=0nmin(ckWkj,1).\displaystyle M=\sum_{j=0}^{n}\min(c_{k}W_{k}^{j},1). (17)

The component reduction algorithm then dictates that components with a weight satisfying

Wkj1ck\displaystyle W_{k}^{j}\geq\frac{1}{c_{k}} (18)

will be kept deterministically, leaving the remainder of the mixture to be undergo resampling.

Interestingly enough, the exact value of ckc_{k} doesn’t need to be computed, and instead, the inclusion of each of the components can be tested, as detailed in Appendix C of [22]. In providing this solution, a second optimisation problem was conceived which only generates values of ckc_{k} at the boundaries of the inclusion of each component. Unfortunately these values of cc are in general not common, and confusion mounts as resampled components are allocated a weight of 1/c1/c, which provably should refer to the value generated from the second optimisation problem. The cc’s have been used interchangeably in the paper and has led to dependant literature (e.g. [53]) skipping over the second optimisation problem entirely and mistakenly using the ckc_{k} from the first optimisation problem.

To make matters worse, the following issues exist in the provided solution [22] to the second optimisation problem:

  1. 1.

    The solution proposed does not handle the case when all components should be resampled, e.g., if reduction needs to be completed on components with equal weight.

  2. 2.

    A strict inequality has been used mistakenly to form the sets which are used to compute the terms AκA_{\kappa}, BκB_{\kappa} and cc in the paper, misclassifying an equality that will be encountered at every boundary test.

Furthermore, the literature of [22, 53] mistakes systematic sampling for stratified sampling, and the sampling algorithm presented in [22] also contains an error, preventing duplicated components. These components statistically should either be allowed or instead, the component should be issued a modified weight when sampled multiple times.

The proposed Algorithm builds on the work of [22, 53] to provide a simpler DPF algorithm that is free of confusion about a cc term, which is capable of handling the case when all components should undergo resampling. Like [53], these sequences are produced using a particle-Gibbs algorithm, and the previous discrete state sequence is evaluated in the filter to ensure a guarantee of convergence. In the proposed scheme, this component is kept deterministically, which avoids the overly complex conditional sampling scheme from [53] that provably places biases upon component number and results in some components never being chosen if the algorithm was rerun an infinite number of times. To add further distinction to [53], a two-filter smoothing approach is not used in the proposed solution, and instead, the direct conditioning on the partially sampled hybrid trajectory described by (15) is used for sampling the remainder of the trajectory.

As the produced algorithms for sampling a hybrid trajectory are quite long and detailed, they have been placed in Appendix A along with practical notes for implementation.

With the approach taken for sampling the latent hybrid state trajectory covered, the choice of conjugate priors that govern the parameter distributions are now discussed. These priors are then updated with the freshly sampled hybrid trajectory ξ1:N+1\xi_{1:N+1}^{\ell} to form new parameter distributions p(θ|ξ1:N+1,y1:N)p(\theta|\xi^{\ell}_{1:N+1},y_{1:N}), which are subsequently sampled from to produce θ+1\theta^{\ell+1}.

3.2 Conditioned parameter distributions

To enable efficient sampling of the JMLS parameters θ\theta, the parameters were assumed to be distributed according to conjugate priors which allow for p(θ|ξ1:N+1,y1:N)p(\theta|\xi_{1:N+1},y_{1:N}) to be calculated using closed-form solutions. These conjugate priors were assumed to be as follows.

Like [21, 27, 18, 26, 33, 31], the columns of the model transition matrix 𝐓\mathbf{T} were assumed to be distributed according to independent Dirichlet distributions,

𝒟(𝐓|𝜶)=Γ(i=1m𝜶i,j)i=1KΓ(𝜶i,j)i=1m(𝐓i,j)𝜶i,j1,\displaystyle\mathcal{D}(\mathbf{T}|\boldsymbol{\alpha})=\frac{\Gamma(\sum_{i=1}^{m}\boldsymbol{\alpha}_{i,j})}{\prod_{i=1}^{K}\Gamma(\boldsymbol{\alpha}_{i,j})}\prod_{i=1}^{m}(\mathbf{T}_{i,j})^{\boldsymbol{\alpha}_{i,j}-1}, (19)

where 𝜶\boldsymbol{\alpha} is a matrix of concentration parameters with elements 𝜶i,j>0\boldsymbol{\alpha}_{i,j}>0, and Γ()\Gamma(\cdot) is the gamma function.

The covariance matrices were assumed to be distributed according to an Inverse-Wishart distribution

𝒲1(𝚷z|𝚲z,νz)\displaystyle\mathcal{W}^{-1}(\boldsymbol{\Pi}_{z}|\boldsymbol{\Lambda}_{z},\nu_{z}) =|𝚲z|νz/22(nνz/2)Γn(νz2)|𝚷z|(νz+n+1)/2exp{12tr[𝚲z𝚷1(z)]},\displaystyle=\frac{|\boldsymbol{\Lambda}_{z}|^{\nu_{z}/2}}{2^{(n\nu_{z}/2)}\Gamma_{n}(\frac{\nu_{z}}{2})}|\boldsymbol{\Pi}_{z}|^{-(\nu_{z}+n+1)/2}\exp\left\{-\frac{1}{2}\text{tr}\left[\boldsymbol{\Lambda}_{z}\boldsymbol{\Pi}^{-1}(z)\right]\right\}, (20)

where 𝚷zn×n\boldsymbol{\Pi}_{z}\in{}^{n\times n} is the positive definite symmetric matrix argument, 𝚲zn×n\boldsymbol{\Lambda}_{z}\in{}^{n\times n} is the positive definite symmetric scale matrix, νz\nu_{z}\in\real is the degrees of freedom, which must satisfy νz>n1\nu_{z}>n-1, and Γn\Gamma_{n} is the multivariate gamma function with dimension nn. A higher degree of freedom indicates greater confidence about the mean value of 𝚲zνzn1\frac{\boldsymbol{\Lambda}_{z}}{\nu_{z}-n-1} if νz>n+1\nu_{z}>n+1.

Finally, the deterministic parameters for a model are assumed to be distributed according to a Matrix-Normal distribution with the form

𝒩\displaystyle\mathcal{M}\mathcal{N} (𝚪(z)|𝐌z,𝚷z,𝐕z)=exp{12tr[𝐕z1(𝚪(z)𝐌z)T𝚷1(z)(𝚪(z)𝐌z)]}(2π)np/2|𝐕z|n/2|𝚷z|p/2,\displaystyle(\boldsymbol{\Gamma}(z)|{\mathbf{M}_{z}},\boldsymbol{\Pi}_{z},{\mathbf{V}}_{z})=\frac{\exp\left\{-\frac{1}{2}\text{tr}\left[\mathbf{V}_{z}^{-1}(\boldsymbol{\Gamma}(z)-\mathbf{M}_{z})^{T}\boldsymbol{\Pi}^{-1}(z)(\boldsymbol{\Gamma}(z)-\mathbf{M}_{z})\right]\right\}}{(2\pi)^{np/2}|\mathbf{V}_{z}|^{n/2}|\boldsymbol{\Pi}_{z}|^{p/2}}, (21)

with mean 𝐌n×p\mathbf{M}\in{}^{n\times p}, positive definite column covariance matrix 𝐕p×p\mathbf{V}\in{}^{p\times p}, and tr(A)\text{tr}(A) denoting the trace of matrix AA. The Matrix Normal is related to the Normal distribution by

𝒩(𝚪|𝐌,𝚷,𝐕)\displaystyle\mathcal{M}\mathcal{N}(\boldsymbol{\Gamma}|\mathbf{M},\boldsymbol{\Pi},\mathbf{V}) =𝒩(Vec(𝚪)|Vec(𝐌),𝐕𝚷).\displaystyle=\mathcal{N}(\text{Vec}(\boldsymbol{\Gamma})|\text{Vec}(\mathbf{M}),\mathbf{V}\otimes\boldsymbol{\Pi}). (22)

Lemma 3.1 provides instructions on how these conjugate prior distributions can be conditioned on the sample ξ1:N+1\xi^{\ell}_{1:N+1}, thus calculating the parameters for the distribution p(θ|ξ1:N+1,y1:N)p(\theta|\xi_{1:N+1},y_{1:N}).

Lemma 3.1.

The distribution which parameters are sampled from can be expressed as

p({𝚪i,𝚷i}i=1m,𝐓|x1:N+1,z1:N+1,y1:N)𝒟(𝐓|𝜶¯)(i=1m𝒩(𝚪i|𝐌¯i,𝚷i,𝐕¯i)𝒲1(𝚷i|𝚲¯i,ν¯i)),\displaystyle p(\{\boldsymbol{\Gamma}_{i},\boldsymbol{\Pi}_{i}\}_{i=1}^{m},\mathbf{T}|x^{\ell}_{1:N+1},z^{\ell}_{1:N+1},y_{1:N})\propto\mathcal{D}(\mathbf{T}|\bar{\boldsymbol{\alpha}})\left(\prod^{m}_{i=1}\mathcal{M}\mathcal{N}(\boldsymbol{\Gamma}_{i}|\bar{\mathbf{M}}_{i},\boldsymbol{\Pi}_{i},\bar{\mathbf{V}}_{i})\mathcal{W}^{-1}(\boldsymbol{\Pi}_{i}|\bar{\boldsymbol{\Lambda}}_{i},\bar{\nu}_{i})\right), (23)

where the parameters {νi,𝐌i,𝐕i,𝚲i,𝛂}\{\nu_{i},\mathbf{M}_{i},\mathbf{V}_{i},\boldsymbol{\Lambda}_{i},\boldsymbol{\alpha}\} define the prior for the ii-th model, and the corrected parameters {ν¯i,𝐌¯i,𝐕¯i,𝚲¯i,𝛂¯}\{\bar{\nu}_{i},\bar{\mathbf{M}}_{i},\bar{\mathbf{V}}_{i},\bar{\boldsymbol{\Lambda}}_{i},\bar{\boldsymbol{\alpha}}\} for each ii can be calculated using

𝚲¯i\displaystyle\bar{\boldsymbol{\Lambda}}_{i} \ensurestackMath\stackon[1pt]=Δ𝚲i+𝚽¯i𝚿¯i𝚺¯i1𝚿¯iT\displaystyle\mathrel{\ensurestackMath{\stackon[1pt]{=}{\scriptstyle\Delta}}}\boldsymbol{\Lambda}_{i}+\bar{\boldsymbol{\Phi}}_{i}-\bar{\boldsymbol{\Psi}}_{i}{\bar{\boldsymbol{\Sigma}}_{i}}^{-1}\bar{\boldsymbol{\Psi}}_{i}^{T}
ν¯i\displaystyle\bar{\nu}_{i} \ensurestackMath\stackon[1pt]=Δνi+Ni,\displaystyle\mathrel{\ensurestackMath{\stackon[1pt]{=}{\scriptstyle\Delta}}}\nu_{i}+N_{i}, (24a)
𝐌¯i\displaystyle\bar{\mathbf{M}}_{i} \ensurestackMath\stackon[1pt]=Δ𝚿¯i𝚺¯i1,\displaystyle\mathrel{\ensurestackMath{\stackon[1pt]{=}{\scriptstyle\Delta}}}\bar{\boldsymbol{\Psi}}_{i}{\bar{\boldsymbol{\Sigma}}_{i}}^{-1}, (24b)
𝐕¯i\displaystyle\bar{\mathbf{V}}_{i} \ensurestackMath\stackon[1pt]=Δ𝚺¯i1,\displaystyle\mathrel{\ensurestackMath{\stackon[1pt]{=}{\scriptstyle\Delta}}}\bar{\boldsymbol{\Sigma}}_{i}^{-1}, (24c)
𝜶¯\displaystyle\bar{\boldsymbol{\alpha}} =𝜶+𝐮,\displaystyle=\boldsymbol{\alpha}+\mathbf{u}, (24d)

which use the quantities,

𝚺¯i\displaystyle\bar{\boldsymbol{\Sigma}}_{i} \ensurestackMath\stackon[1pt]=Δ𝚺i+𝐕i1,\displaystyle\mathrel{\ensurestackMath{\stackon[1pt]{=}{\scriptstyle\Delta}}}\boldsymbol{\Sigma}_{i}+\mathbf{V}_{i}^{-1}, (24ya)
𝚿¯i\displaystyle\bar{\boldsymbol{\Psi}}_{i} \ensurestackMath\stackon[1pt]=Δ𝚿i+𝐌i𝐕i1,\displaystyle\mathrel{\ensurestackMath{\stackon[1pt]{=}{\scriptstyle\Delta}}}\boldsymbol{\Psi}_{i}+\mathbf{M}_{i}\mathbf{V}_{i}^{-1}, (24yb)
𝚽¯i\displaystyle\bar{\boldsymbol{\Phi}}_{i} \ensurestackMath\stackon[1pt]=Δ𝚽i+𝐌i𝐕i1𝐌iT,\displaystyle\mathrel{\ensurestackMath{\stackon[1pt]{=}{\scriptstyle\Delta}}}\boldsymbol{\Phi}_{i}+\mathbf{M}_{i}\mathbf{V}_{i}^{-1}\mathbf{M}_{i}^{T}, (24yc)
𝚽i\displaystyle\boldsymbol{\Phi}_{i} \ensurestackMath\stackon[1pt]=ΔkGi[ykxk+1][ykxk+1]T,\displaystyle\mathrel{\ensurestackMath{\stackon[1pt]{=}{\scriptstyle\Delta}}}\sum_{k\in G_{i}}\begin{bmatrix}y_{k}\\ x_{k+1}^{\ell}\end{bmatrix}\begin{bmatrix}y_{k}\\ x_{k+1}^{\ell}\end{bmatrix}^{T}, (24yd)
𝚿i\displaystyle\boldsymbol{\Psi}_{i} \ensurestackMath\stackon[1pt]=ΔkGi[ykxk+1][xkuk]T,\displaystyle\mathrel{\ensurestackMath{\stackon[1pt]{=}{\scriptstyle\Delta}}}\sum_{k\in G_{i}}\begin{bmatrix}y_{k}\\ x_{k+1}^{\ell}\end{bmatrix}\begin{bmatrix}x_{k}^{\ell}\\ u_{k}\end{bmatrix}^{T}, (24ye)
𝚺i\displaystyle\boldsymbol{\Sigma}_{i} \ensurestackMath\stackon[1pt]=ΔkGi[xkuk][xkuk]T,\displaystyle\mathrel{\ensurestackMath{\stackon[1pt]{=}{\scriptstyle\Delta}}}\sum_{k\in G_{i}}\begin{bmatrix}x_{k}^{\ell}\\ u_{k}\end{bmatrix}\begin{bmatrix}x_{k}^{\ell}\\ u_{k}\end{bmatrix}^{T}, (24yf)
Ni\displaystyle N_{i} =kGi1,\displaystyle=\sum_{k\in G_{i}}1, (24yg)
𝐮j,i\displaystyle\mathbf{u}_{j,i} =kGj,i1,\displaystyle=\sum_{k\in G_{j,i}}1, (24yh)

where GiG_{i} is the set of time steps kk for which zk=iz^{\ell}_{k}=i, and Gj,iG_{j,i} is the set of time steps kk for which zk+1=jz^{\ell}_{k+1}=j and zk=iz^{\ell}_{k}=i. Note that 𝐮j,i\mathbf{u}_{j,i} denotes the element in the jj-th row and ii-th column of matrix 𝐮\mathbf{u}. Additionally, ATA^{T} and A1A^{-1} denote the transpose and inverse of matrix AA respectively.

3.3 Sampling new parameters

As the distribution the distribution p(θ|ξ1:N+1,y1:N)p(\theta|\xi^{\ell}_{1:N+1},y_{1:N}) can now be calculated from Lemma 3.1, we now provide instruction on how θ+1\theta^{\ell+1} may be sampled from such distribution.

We begin by sampling 𝚷i\boldsymbol{\Pi}_{i} from a Inverse-Wishart distribution for each model i=1,,mi=1,\dots,m, i.e.

𝚷i+1𝒲1(𝚲¯i,ν¯i)i=1,,m.\displaystyle\boldsymbol{\Pi}^{\ell+1}_{i}\sim\mathcal{W}^{-1}(\bar{\boldsymbol{\Lambda}}_{i},\bar{\nu}_{i})\quad\forall i=1,\dots,m. (24z)

With 𝚷i+1\boldsymbol{\Pi}^{\ell+1}_{i} sampled for i=1,,mi=1,\dots,m, it is then possible to sample 𝚪i+1\boldsymbol{\Gamma}^{\ell+1}_{i} for each model i=1,,mi=1,\dots,m using Lemma 3.2.

Lemma 3.2.

If we let 𝚪i+1\boldsymbol{\Gamma}^{\ell+1}_{i} be determined by

𝚪i+1=𝐌¯i+((𝚷i+1)1/2)T𝐇i𝐕¯i1/2\displaystyle\boldsymbol{\Gamma}^{\ell+1}_{i}=\bar{\mathbf{M}}_{i}+(({\boldsymbol{\Pi}}^{\ell+1}_{i})^{1/2})^{T}\mathbf{H}_{i}\bar{\mathbf{V}}_{i}^{1/2} (24aa)

where A1/2A^{1/2} denotes an upper Cholesky factor of AA, i.e. (A1/2)TA1/2=A(A^{1/2})^{T}A^{1/2}=A, and each element of 𝐇i\mathbf{H}_{i} is distributed i.i.d. according to

h𝒩(0,1),\displaystyle h\sim\mathcal{N}(0,1), (24ab)

then

𝚪i+1𝒩(𝐌¯i,𝚷i+1,𝐕¯i).\displaystyle{\boldsymbol{\Gamma}}^{\ell+1}_{i}\sim\mathcal{M}\mathcal{N}(\bar{\mathbf{M}}_{i},{\boldsymbol{\Pi}}^{\ell+1}_{i},\bar{\mathbf{V}}_{i}). (24ac)
Proof.

See [32]. Finally sampling the transition matrix 𝐓\mathbf{T} can be completed by sampling from mm Dirichlet distributions, all parametrised by 𝜶¯\bar{\boldsymbol{\alpha}}. This is completed by sampling each element of 𝐓\mathbf{T} from a Gamma distribution with shape parameter 𝜶¯i,j\bar{\boldsymbol{\alpha}}_{i,j}, and scale parameter of 11, i.e.,

𝐓~i,j𝒢(𝜶¯i,j,1),\displaystyle\tilde{\mathbf{T}}_{i,j}\sim\mathcal{G}(\bar{\boldsymbol{\alpha}}_{i,j},1), (24ad)

before normalising over each column,

𝐓i,j+1=𝐓~i,ji𝐓~i,j.\displaystyle\mathbf{T}_{i,j}^{\ell+1}=\frac{\tilde{\mathbf{T}}_{i,j}}{\sum_{i}\tilde{\mathbf{T}}_{i,j}}. (24ae)

The complete procedure for sampling the parameter set is further outlined in Algorithm 1.

Algorithm 1 Sampling the Parameters
1:Sample from the Inverse-Wishart distribution 𝚷i+1𝒲1(𝚷i|𝚲¯i,ν¯i)i=1,m\boldsymbol{\Pi}^{\ell+1}_{i}\sim\mathcal{W}^{-1}(\boldsymbol{\Pi}_{i}|\bar{\boldsymbol{\Lambda}}_{i},\bar{\nu}_{i})\quad\forall i=1,\dots m.
2:Using Lemma 3.2, sample from the Matrix-Normal distribution 𝚪i+1𝒩(𝚪i|𝐌¯i,𝚷i+1,𝐕¯i)i=1,m\boldsymbol{\Gamma}^{\ell+1}_{i}\sim\mathcal{M}\mathcal{N}(\boldsymbol{\Gamma}_{i}|\bar{\mathbf{M}}_{i},\boldsymbol{\Pi}^{\ell+1}_{i},\bar{\mathbf{V}}_{i})\quad\forall i=1,\dots m.
3:Sample from the Gamma distribution 𝐓~i,j𝒢(𝛂¯i,j,1)\tilde{\mathbf{T}}_{i,j}\sim\mathcal{G}(\bar{\boldsymbol{\alpha}}_{i,j},1), with shape parameter 𝜶¯i,j\bar{\boldsymbol{\alpha}}_{i,j}, and scale parameter of 1 i=1,m,j=1,m\quad\forall i=1,\dots m,\forall j=1,\dots m.
4:Set 𝐓i,j+1=𝐓~i,ji𝐓~i,ji=1,m,j=1,m\mathbf{T}_{i,j}^{\ell+1}=\frac{\tilde{\mathbf{T}}_{i,j}}{\sum_{i}\tilde{\mathbf{T}}_{i,j}}\quad\forall i=1,\dots m,\forall j=1,\dots m.

3.4 Algorithm Overview

For clarity, the proposed method is summarised in full by Algorithm 2.

Algorithm 2 Algorithm overview
1:State prior p(x0,z0)p(x_{0},z_{0}), prior on model parameters defined by {νi,𝐌i,𝐕i,𝚲i}i=1m\{\nu_{i},\mathbf{M}_{i},\mathbf{V}_{i},\boldsymbol{\Lambda}_{i}\}_{i=1}^{m}, prior on model transitions defined by 𝜶\boldsymbol{\alpha}, initial guess of θ1\theta^{1}, which may be provided using the JMLS EM algorithm [9].
2:for =1\ell=1 to Max iterations do
3:    Calculate the Forward-filter distribution, as per Algorithm 3.
4:    Sample a latent trajectory ξ1:N+1\xi^{\ell}_{1:N+1} using the instruction provided in Algorithm 7.
5:    Calculate the conditional parameter distribution p(θ|ξ1:N+1,y1:N)p(\theta|\xi_{1:N+1}^{\ell},y_{1:N}) using Lemma 3.1.
6:    Sample and save new parameter estimate θ+1p(θ|ξ1:N+1,y1:N)\theta^{\ell+1}\sim p(\theta|\xi^{\ell}_{1:N+1},y_{1:N}) using Algorithm 1.
7:end for
8:By the convergence properties of the particle-Gibbs sampler, the samples θ\theta^{\ell} are now distributed according to p(θ|y1:N)p(\theta|y_{1:N}).

4 Simulations

In this section we provide two simulations demonstrating the effectiveness of the proposed method.

4.1 Univariate system

In this example we use the proposed method on a univariate JMLS single-input single-output (SISO) system, comprising of two models m=2m=2. This ensures that parameters are scalar and distributions can appear on 2D plots. The choice to estimate a univariate system also ensures that certain system matrices (𝐀,𝐃,𝐑\mathbf{A},\mathbf{D},\mathbf{R}) are free from a similarity transformation [6, 52, 51, 9], as in general, there is not a unique solution for these parameters.

The true system used in this example was parameterised by

𝐀1=0.4766,𝐁1=1.207,𝐂1=0.233,𝐃1=0.8935,𝐐1=103,𝐑1=0.0202,𝐒1=0,\displaystyle\mathbf{A}_{1}=0.4766,\mathbf{B}_{1}=-1.207,\mathbf{C}_{1}=0.233,\mathbf{D}_{1}=-0.8935,\mathbf{Q}_{1}=10^{-3},\mathbf{R}_{1}=0.0202,\mathbf{S}_{1}=0,
𝐀2=0.1721,𝐁2=1.5330,𝐂2=0.1922,𝐃2=1.7449,𝐐2=0.0340,𝐑2=0.0439,𝐒2=0,\displaystyle\mathbf{A}_{2}=-0.1721,\mathbf{B}_{2}=1.5330,\mathbf{C}_{2}=-0.1922,\mathbf{D}_{2}=1.7449,\mathbf{Q}_{2}=0.0340,\mathbf{R}_{2}=0.0439,\mathbf{S}_{2}=0,
𝐓=[0.70.50.30.5].\displaystyle\mathbf{T}=\begin{bmatrix}0.7&0.5\\ 0.3&0.5\end{bmatrix}. (24af)

Input-output data was generated using these parameters and the input uk𝒩(0,1)u_{k}\sim\mathcal{N}(0,1) for N=2000N=2000 time steps before the proposed method was applied to the dataset. The proposed method used a resampling step allowing R=5R=5 hybrid components per time step, and used uninformative priors on the parameters to ensure the PDFs were highly data driven. The priors chosen were parameterised by

𝐌1=𝐌2=𝟎2×2,\displaystyle\mathbf{M}_{1}=\mathbf{M}_{2}=\mathbf{0}_{2\times 2}, (24aga)
𝐕1=𝐕2=13𝐈2×2,\displaystyle\mathbf{V}_{1}=\mathbf{V}_{2}=13\cdot\mathbf{I}_{2\times 2}, (24agb)
𝚲1=𝚲2=1010𝐈2×2,\displaystyle\boldsymbol{\Lambda}_{1}=\boldsymbol{\Lambda}_{2}=10^{-10}\cdot\mathbf{I}_{2\times 2}, (24agc)
ν1=ν2=2,\displaystyle\nu_{1}=\nu_{2}=2, (24agd)
𝜶=𝟏2×2.\displaystyle\boldsymbol{\alpha}=\mathbf{1}_{2\times 2}. (24age)

Using these priors, the PDFs produced by the proposed method should have good support of the true parameters, and grow certainty about them with increasing size of dataset. Other alternative algorithms [40] can potentially operate on this system, but due to their use of inverse-Gamma distributions, cannot operate on the example within subsection 4.2. Additionally, as a univariate inverse-Wishart distribution is an inverse-Gamma distribution, these algorithms are equivalent for the univariate case.

The initial parameter set used in the particle-Gibbs sampler is allowed to be chosen arbitrarily. To avoid a lengthy burn-in procedure, the particle-Gibbs algorithm was initialised with values close to those which correspond the maximum likelihood solution. For a real-world problem this could be provided using the EM algorithm [9].

After 10510^{5} iterations of the particle-Gibbs algorithm, the parameter samples θ\theta^{\ell} were used to construct Figure 1 and Figure 2. Figure 1 shows the distribution of diagonal elements of the transition matrix, and represents the probability of models being used for consecutive time steps. As the off-diagonals are constrained by the total law of probability, there is no need to plot them. Whereas Figure 2 shows the distribution of components within {𝚪i,𝚷i}i=1m\{\boldsymbol{\Gamma}_{i},\boldsymbol{\Pi}_{i}\}_{i=1}^{m} which are free from a similarity transformation.

Refer to caption
(a) Probability distribution of consecutive use of the first model.
Refer to caption
(b) Probability distribution of consecutive use of the second model.
Figure 1: Estimated probability distributions of the transition matrix 𝐓\mathbf{T} from Example 1. The distribution from the proposed method is shown in solid blue, whereas the true values are indicated by a red dashed vertical line.
Refer to caption
(a) Probability distribution of 𝐀1\mathbf{A}_{1}.
Refer to caption
(b) Probability distribution of 𝐀2\mathbf{A}_{2}.
Refer to caption
(c) Probability distribution of 𝐃1\mathbf{D}_{1}.
Refer to caption
(d) Probability distribution of 𝐃2\mathbf{D}_{2}.
Refer to caption
(e) Probability distribution of 𝐑1\mathbf{R}_{1}.
Refer to caption
(f) Probability distribution of 𝐑2\mathbf{R}_{2}.
Figure 2: Estimated model parameter distributions for Example 1. The distribution of parameters free from a similarity transformation are shown in solid blue, whereas the true values are indicated by a dashed red vertical line.

The proposed method has produced distributions with a large amount of support for the true parameters, which suggests that the algorithm is correctly performing Bayesian identification on the system. As expected, subsequent testing has shown that an increase in data length will tighten the confidence intervals of the parameters.

Generation of the both figures for this example, required the models to be sorted. Models were sorted or ‘relabelled’ by comparison of the magnitude of the models Bode response. This is not a core part of the proposed algorithm, as reordering is only required for plotting purposes.

4.2 Multivariate system

In this example, we consider identification of a multivariate three-state SISO JMLS system comprising of three models m=3m=3. To the best of the authors knowledge there are no alternative algorithms suitable for this problem, or for generating a ground truth.

As this example considers a multivariate system, there are an infinite number of state space modes which has equivalent system response. Because of this, plotting the distribution of the model parameters themselves would be somewhat arbitrary. Instead, for the analysis, we provide a variation likely frequency responses of the models. The distribution of the transition matrix however, is free from similarity transformations, but due to the increase in available models can no longer appear on a 2D plot.

The system analysed in this example, described by the discrete transfer functions and transition matrix

H1(z)\displaystyle H_{1}(\text{z}) =217.4z3+212.9z20.003827z+4.603×1020z31.712z2+0.9512z1.481×106,\displaystyle=\frac{217.4\text{z}^{3}+212.9\text{z}^{2}-0.003827\text{z}+4.603\times 10^{-20}}{\text{z}^{3}-1.712\text{z}^{2}+0.9512\text{z}-1.481\times 10^{-6}}, (24aha)
H2(z)\displaystyle H_{2}(\text{z}) =0.4184z3+0.008764z2+0.1669z0.01542z32.374z2+1.929z0.5321,\displaystyle=\frac{0.4184\text{z}^{3}+0.008764\text{z}^{2}+0.1669\text{z}-0.01542}{\text{z}^{3}-2.374\text{z}^{2}+1.929\text{z}-0.5321}, (24ahb)
H3(z)\displaystyle H_{3}(\text{z}) =0.2728z30.9506z2+1.066z0.3881z32.374z2+1.929z0.5321,\displaystyle=\frac{0.2728\text{z}^{3}-0.9506\text{z}^{2}+1.066\text{z}-0.3881}{\text{z}^{3}-2.374\text{z}^{2}+1.929\text{z}-0.5321}, (24ahc)
and
𝐓\displaystyle\mathbf{T} =[0.50.250.250.250.50.250.250.250.5],\displaystyle=\begin{bmatrix}0.5&0.25&0.25\\ 0.25&0.5&0.25\\ 0.25&0.25&0.5\end{bmatrix}, (24ahd)

respectively was then simulated for N=5000N=5000 time steps using an input uk𝒩(0,1)u_{k}\sim\mathcal{N}(0,1). This system was then used to initialise the proposed procedure to avoid a lengthy burn-in time. For a real-world example, this initial estimate could be obtained using the EM algorithm [9].

The proposed method was then used to identify the system based on the generated input-output data with the filter being allowed to store R=5R=5 hybrid Gaussian mixture components. The uninformative priors chosen for identification were parameterised by

𝐌1=𝐌2=𝐌3=𝟎4×4,\displaystyle\mathbf{M}_{1}=\mathbf{M}_{2}=\mathbf{M}_{3}=\mathbf{0}_{4\times 4}, (24aia)
𝐕1=𝐕2=𝐕3=13𝐈4×4,\displaystyle\mathbf{V}_{1}=\mathbf{V}_{2}=\mathbf{V}_{3}=13\cdot\mathbf{I}_{4\times 4}, (24aib)
𝚲1=𝚲2=𝚲3=1010𝐈4×4,\displaystyle\boldsymbol{\Lambda}_{1}=\boldsymbol{\Lambda}_{2}=\boldsymbol{\Lambda}_{3}=10^{-10}\cdot\mathbf{I}_{4\times 4}, (24aic)
ν1=ν2=ν3=4,\displaystyle\nu_{1}=\nu_{2}=\nu_{3}=4, (24aid)
𝜶=𝟏3×3.\displaystyle\boldsymbol{\alpha}=\mathbf{1}_{3\times 3}. (24aie)

After 10510^{5} particle-Gibbs iterations, the samples θ\theta^{\ell} were used to construct Figure 3 and Figure 4. Figure 3 shows the estimated distribution of model transition probabilities, and Figure 4 shows the variation of expected model responses. As with Example 1, models were sorted using their frequency response before producing these figures. Both of these figures show good support for the model used to generate the data, demonstrating validity of the proposed algorithm.

Refer to caption
(a) Distribution of the first column of the transition matrix.
Refer to caption
(b) Distribution of the second column of the transition matrix.
Refer to caption
(c) Distribution of the third column of the transition matrix.
Figure 3: Contour plots showing the estimated transition matrix distributions from Example 2, along with the true values shown with a black +.
Refer to caption
(a) Model 1 Bode plot.
Refer to caption
(b) Model 2 Bode plot.
Refer to caption
(c) Model 3 Bode plot.
Figure 4: Frequency response from the three models for Example 2. The blue line is the true system response, where the red line and shaded red region represents the estimated mean response and 3 standard deviation confidence region respectively.

5 Conclusion

We have developed and demonstrated an effective algorithm for Bayesian parameter identification of JMLS systems. Unlike alternative methods, we have not forced assumptions such as a univariate state or operation according to drift models. The proposed method scales easily to an increase in models and state dimension. This solution was based on a modified version of the discrete particle filter. In developing this method, points of confusion surrounding the documentation for the DPF were discovered, and have been addressed within this paper.

The proposed method was deployed for Bayesian estimation of a multivariate JMLS system in subsection 4.2, yielding distributions with good support of the transition matrix and models used to generate the data. It should be noted that due to a finite data length the true values used are unlikely to align with the maximum a posteriori (MAP) estimate.

It is unfortunate that competing approaches for either of the examples don’t exist. Regarding the first example, competing approaches use an inverse-Gamma distribution, which is identical to the inverse-Wishart distribution for the univariate case, and yield an identical algorithm. Regarding the second example, to the best of our knowledge, no alternative algorithms are available for the unconstrained multivariate JMLS identification problem.

References

  • [1] James H Albert and Siddhartha Chib. Bayes inference via gibbs sampling of autoregressive time series subject to Markov mean and variance shifts. Journal of Business & Economic Statistics, 11(1):1–15, 1993.
  • [2] Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle markov chain monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342, 2010.
  • [3] Trevor T Ashley and Sean B Andersson. A sequential monte carlo framework for the system identification of jump Markov state space models. In 2014 American Control Conference, pages 1144–1149. IEEE, 2014.
  • [4] Karl Johan Åström and Peter Eykhoff. System identification—a survey. Automatica, 7(2):123–162, 1971.
  • [5] KJ Astrom. Maximum likelihood and prediction error methods. IFAC Proceedings Volumes, 12(8):551–574, 1979.
  • [6] Laurent Bako, Guillaume Mercère, René Vidal, and Stéphane Lecoeuche. Identification of switched linear state space models without minimum dwell time. IFAC Proceedings Volumes, 42(10):569–574, 2009.
  • [7] Hamsa Balakrishnan, Inseok Hwang, Jung Soon Jang, and Claire J Tomlin. Inference methods for autonomous stochastic linear hybrid systems. In International Workshop on Hybrid Systems: Computation and Control, pages 64–79. Springer, 2004.
  • [8] Mark P. Balenzuela, Adrian G. Wills, Christopher Renton, and Brett Ninness. A new smoothing algorithm for jump Markov linear systems. Submitted to Automatica.
  • [9] Mark P. Balenzuela, Adrian G. Wills, Christopher Renton, and Brett Ninness. A variational expectation-maximisation algorithm for learning jump Markov linear systems. Submitted to Automatica.
  • [10] David Barber. Expectation correction for smoothed inference in switching linear dynamical systems. Journal of Machine Learning Research, 7(Nov):2515–2540, 2006.
  • [11] Niclas Bergman and Arnaud Doucet. Markov chain monte carlo data association for target tracking. In Acoustics, Speech, and Signal Processing, 2000. ICASSP’00. Proceedings. 2000 IEEE International Conference on, volume 2, pages II705–II708. IEEE, 2000.
  • [12] Francesco Bianchi. Regime switches, agents’ beliefs, and post-world war ii us macroeconomic dynamics. Review of Economic studies, 80(2):463–490, 2012.
  • [13] Lars Blackmore, Stephanie Gil, Seung Chung, and Brian Williams. Model learning for switching linear systems with autonomous mode transitions. In 2007 46th IEEE Conference on Decision and Control, pages 4648–4655. IEEE, 2007.
  • [14] Henk AP Blom and Yaakov Bar-Shalom. The interacting multiple model algorithm for systems with Markovian switching coefficients. IEEE transactions on Automatic Control, 33(8):780–783, 1988.
  • [15] Chris K Carter and Robert Kohn. On gibbs sampling for state space models. Biometrika, 81(3):541–553, 1994.
  • [16] Christopher K Carter and Robert Kohn. Markov chain monte carlo in conditionally gaussian state space models. Biometrika, 83(3):589–601, 1996.
  • [17] Yoosoon Chang, Junior Maih, and Fei Tan. State space models with endogenous regime switching. 2018.
  • [18] Siddhartha Chib. Calculating posterior distributions and modal estimates in Markov mixture models. Journal of Econometrics, 75(1):79–97, 1996.
  • [19] Siddhartha Chib. Estimation and comparison of multiple change-point models. Journal of econometrics, 86(2):221–241, 1998.
  • [20] Luca Di Persio and Vukasin Jovic. Gibbs sampling approach to Markov switching models in finance. International Journal of Mathematical and Computational Methods, 1, 2016.
  • [21] Jean Diebolt and Christian P Robert. Estimation of finite mixture distributions through bayesian sampling. Journal of the Royal Statistical Society: Series B (Methodological), 56(2):363–375, 1994.
  • [22] Paul Fearnhead and Peter Clifford. On-line inference for hidden Markov models via particle filters. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(4):887–899, 2003.
  • [23] Emily Fox, Erik B Sudderth, Michael I Jordan, and Alan S Willsky. Bayesian nonparametric inference of switching dynamic linear models. IEEE Transactions on Signal Processing, 59(4):1569–1585, 2011.
  • [24] Sylvia Frühwirth-Schnatter. Data augmentation and dynamic linear models. Journal of time series analysis, 15(2):183–202, 1994.
  • [25] Sylvia Frühwirth-Schnatter. Fully bayesian analysis of switching gaussian state space models. Annals of the Institute of Statistical Mathematics, 53(1):31–49, 2001.
  • [26] Sylvia Frühwirth-Schnatter. Markov chain monte carlo estimation of classical and dynamic switching and mixture models. Journal of the American Statistical Association, 96(453):194–209, 2001.
  • [27] Sylvia Frühwirth-Schnatter. Finite mixture and Markov switching models. Springer Science & Business Media, 2006.
  • [28] Richard Gerlach, Chris Carter, and Robert Kohn. Efficient bayesian inference for dynamic mixture models. Journal of the American Statistical Association, 95(451):819–828, 2000.
  • [29] Zoubin Ghahramani and Geoffrey E Hinton. Variational learning for switching state-space models. Neural computation, 12(4):831–864, 2000.
  • [30] Stephanie Gil and Brian Williams. Beyond local optimality: An improved approach to hybrid model learning. In Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference, pages 3938–3945. IEEE, 2009.
  • [31] Paolo Giordani and Robert Kohn. Efficient bayesian inference for multiple change-point and mixture innovation models. Journal of Business & Economic Statistics, 26(1):66–77, 2008.
  • [32] Arjun K Gupta and Daya K Nagar. Matrix variate distributions. Chapman and Hall/CRC, 2018.
  • [33] Markus Hahn, Sylvia Frühwirth-Schnatter, and Jörn Sass. Markov chain monte carlo methods for parameter estimation in multidimensional continuous time Markov switching models. Journal of Financial Econometrics, 8(1):88–121, 2010.
  • [34] James D Hamilton. A new approach to the economic analysis of nonstationary time series and the business cycle. Econometrica: Journal of the Econometric Society, pages 357–384, 1989.
  • [35] Masafumi Hashimoto, Hiroyuki Kawashima, Takashi Nakagami, and Fuminori Oba. Sensor fault detection and identification in dead-reckoning system of mobile robot: interacting multiple model approach. In Intelligent Robots and Systems, 2001. Proceedings. 2001 IEEE/RSJ International Conference on, volume 3, pages 1321–1326. IEEE, 2001.
  • [36] W Keith Hastings. Monte carlo sampling methods using markov chains and their applications. 1970.
  • [37] Ronald E Helmick, W Dale Blair, and Scott A Hoffman. Fixed-interval smoothing for Markovian switching systems. IEEE Transactions on Information Theory, 41(6):1845–1855, 1995.
  • [38] Jeroen D Hol. Resampling in particle filters, 2004.
  • [39] Chang-Jin Kim. Dynamic linear models with Markov-switching. Journal of Econometrics, 60(1-2):1–22, 1994.
  • [40] Chang-Jin Kim, Charles R Nelson, et al. State-space models with regime switching: classical and gibbs-sampling approaches with applications. MIT Press Books, 1, 1999.
  • [41] EL Lehmann. Theory of point estimation, john willey & sons. Inc., New York, 1983.
  • [42] Ljung Lennart. System identification: theory for the user. PTR Prentice Hall, Upper Saddle River, NJ, pages 1–14, 1999.
  • [43] Andrew Logothetis and Vikram Krishnamurthy. Expectation maximization algorithms for MAP estimation of jump Markov linear systems. IEEE Transactions on Signal Processing, 47(8):2139–2156, 1999.
  • [44] Efim Mazor, Amir Averbuch, Yakov Bar-Shalom, and Joshua Dayan. Interacting multiple model methods in target tracking: a survey. IEEE Transactions on aerospace and electronic systems, 34(1):103–123, 1998.
  • [45] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087–1092, 1953.
  • [46] Brett Ninness. Strong laws of large numbers under weak assumptions with application. IEEE Transactions on Automatic Control, 45(11):2117–2122, 2000.
  • [47] C. P. Robert and G. Casella. Monte Carlo Statistical Methods. Springer texts in statistics. Springer, New York, USA, second edition, 2004.
  • [48] Christian Robert. The Bayesian choice: from decision-theoretic foundations to computational implementation. Springer Science & Business Media, 2007.
  • [49] Neil Shephard. Partial non-gaussian state space. Biometrika, 81(1):115–131, 1994.
  • [50] Mike EC P So, Kin Lam, and Wai Keung Li. A stochastic volatility model with Markov switching. Journal of Business & Economic Statistics, 16(2):244–253, 1998.
  • [51] Andreas Svensson, Thomas B Schön, and Fredrik Lindsten. Identification of jump Markov linear models using particle filters. In 53rd IEEE Conference on Decision and Control, pages 6504–6509. IEEE, 2014.
  • [52] René Vidal, Alessandro Chiuso, and Stefano Soatto. Observability and identifiability of jump linear systems. In Proceedings of the 41st IEEE Conference on Decision and Control, 2002., volume 4, pages 3614–3619. IEEE, 2002.
  • [53] Nick Whiteley, Christophe Andrieu, and Arnaud Doucet. Efficient Bayesian inference for switching state-space models using discrete particle Markov chain Monte Carlo methods. arXiv preprint arXiv:1011.2437, 2010.
  • [54] Adrian Wills, Thomas B Schön, Fredrik Lindsten, and Brett Ninness. Estimation of linear systems using a gibbs sampler. IFAC Proceedings Volumes, 45(16):203–208, 2012.
  • [55] Sinan Yildirim, Sumeetpal S Singh, and Arnaud Doucet. An online expectation–maximization algorithm for changepoint models. Journal of Computational and Graphical Statistics, 22(4):906–926, 2013.

Appendix A Hybrid state sampling algorithms

This appendix details a self-contained complete set of algorithms required for sampling hybrid state trajectories.

Here, the forward filter is described by Algorithm 3, that makes use of Algorithm 4 to control the computational complexity. Algorithm 4 then makes subsequent use of the DPF threshold calculated by Algorithm 5 and systematic resampling scheme detailed in Algorithm 6. Following the forward-filtering pass, Algorithm 7 may be used to sample the required trajectories.

The following notes are intended to assist the practitioner:

  • The forwards filter and sampling of a latent trajectory is intended to be completed completed using the θ\theta^{\ell} parameter set, this is not explicitly written for readability purposes.

  • If the ancestor component has a weight of numerically zero in Algorithm 3, it no longer needs to be tracked, as it cannot be selected by the backwards simulator, improving the produced distribution.

  • Algorithm 5 should be implemented using the LSE trick.

  • Duplicated components are allowed to be returned from Algorithm 6.

  • When implementing Algorithm 7, it is beneficial (but not required) to use the non-reduced forwards filtered distribution from Algorithm 3.

Before filtering and backwards simulating each time step using Algorithms 3 and 7, the following transformation made to efficiently handle the cross-covariance term 𝐒zk\mathbf{S}_{z_{k}},

𝐀¯zk=𝐀zk𝐒zk𝐑zk1𝐂zk,\displaystyle\bar{\mathbf{A}}_{z_{k}}=\mathbf{A}_{z_{k}}-\mathbf{S}_{z_{k}}\mathbf{R}_{z_{k}}^{-1}\mathbf{C}_{z_{k}},
𝐁¯zk=[𝐁zk𝐒zk𝐑zk1𝐃zk𝐒zk𝐑zk1],\displaystyle\bar{\mathbf{B}}_{z_{k}}=\begin{bmatrix}\mathbf{B}_{z_{k}}-\mathbf{S}_{z_{k}}\mathbf{R}_{z_{k}}^{-1}\mathbf{D}_{z_{k}}&\ \mathbf{S}_{z_{k}}\mathbf{R}_{z_{k}}^{-1}\end{bmatrix},
𝐂¯zk=𝐂zk,𝐃¯zk=[𝐃zk 0ny],\displaystyle\bar{\mathbf{C}}_{z_{k}}=\mathbf{C}_{z_{k}},\quad\bar{\mathbf{D}}_{z_{k}}=\begin{bmatrix}\mathbf{D}_{z_{k}}&\ \mathbf{0}_{n_{y}}\end{bmatrix},
𝐐¯zk=𝐐zk𝐒zk𝐑zk1𝐒zkT,𝐑¯zk=𝐑zk,\displaystyle\bar{\mathbf{Q}}_{z_{k}}=\mathbf{Q}_{z_{k}}-\mathbf{S}_{z_{k}}\mathbf{R}_{z_{k}}^{-1}\mathbf{S}^{T}_{z_{k}},\quad\bar{\mathbf{R}}_{z_{k}}=\mathbf{R}_{z_{k}},
u¯k=[ukyk].\displaystyle\bar{u}_{k}=\begin{bmatrix}u_{k}\\ y_{k}\end{bmatrix}. (24aj)

Notice that the new system uses a different input u¯k\bar{u}_{k}.

Algorithm 3 Forward Filter
1:Prior joint distribution p(x1,z1)p(x_{1},z_{1}) given by
p(x1,z1)=i=1M1p(z1)w1|0i(z1)𝒩(x1|μ1|0i(z1),𝐏1|0i(z1)),\displaystyle p(x_{1},z_{1})=\sum_{i=1}^{M^{p}_{1}(z_{1})}w^{i}_{1|0}(z_{1})\mathcal{N}\left(x_{1}|\mu^{i}_{1|0}(z_{1}),\mathbf{P}^{i}_{1|0}(z_{1})\right), (24ak)
where M1p(z1)=1.M^{p}_{1}(z_{1})=1. maximum number of hybrid components MM and model parameters θ\theta.
2:Set the conditioned ancestor particle index number a1=1a_{1}=1. Note that if this is the first particle-Gibbs iteration i.e. =1\ell=1, filtering does not need to be conditioned on a model sequence, alternatively it can be set with any valid random sequence satisfying zk1{1,,m}k=1,,N+1z_{k}^{\ell-1}\in\{1,\dots,m\}\,\forall k=1,\dots,N+1 .
3:for k=1k=1 to NN do
4:    for zk=1z_{k}=1 to mm do
5:         for i=1i=1 to Mkp(zk)M_{k}^{p}(z_{k}) do
w~k|ki(zk)\displaystyle\tilde{w}^{i}_{k|k}(z_{k}) =wk|k1i(zk)𝒩(yk|ηk|ki(zk),𝚵k|ki(zk)),\displaystyle=w^{i}_{k|k-1}(z_{k})\mathcal{N}\left(y_{k}|\eta_{k|k}^{i}(z_{k}),\boldsymbol{\Xi}_{k|k}^{i}(z_{k})\right), (24ala)
μk|ki(zk)\displaystyle\mu^{i}_{k|k}(z_{k}) =μk|k1i(zk)+𝐊k|ki(zk)[ykηk|ki(zk)],\displaystyle=\mu^{i}_{k|k-1}(z_{k})+\mathbf{K}_{k|k}^{i}(z_{k})[y_{k}-\eta_{k|k}^{i}(z_{k})], (24alb)
ηk|ki(zk)\displaystyle\eta_{k|k}^{i}(z_{k}) =𝐂¯zkμk|k1i(zk)+𝐃¯zku¯k,\displaystyle=\bar{\mathbf{C}}_{z_{k}}\mu^{i}_{k|k-1}(z_{k})+\bar{\mathbf{D}}_{z_{k}}\bar{u}_{k}, (24alc)
𝐏k|ki(zk)\displaystyle\mathbf{P}^{i}_{k|k}(z_{k}) =𝐏k|k1i(zk)𝐊k|ki(zk)𝐂¯zk𝐏k|k1i(zk),\displaystyle=\mathbf{P}^{i}_{k|k-1}(z_{k})-\mathbf{K}_{k|k}^{i}(z_{k})\bar{\mathbf{C}}_{z_{k}}\mathbf{P}^{i}_{k|k-1}(z_{k}), (24ald)
𝐊k|ki(zk)\displaystyle\mathbf{K}_{k|k}^{i}(z_{k}) =𝐏k|k1i(zk)𝐂¯zkT(𝚵k|ki(zk))1,\displaystyle=\mathbf{P}^{i}_{k|k-1}(z_{k})\bar{\mathbf{C}}_{z_{k}}^{T}(\boldsymbol{\Xi}_{k|k}^{i}(z_{k}))^{-1}, (24ale)
𝚵k|ki(zk)\displaystyle\boldsymbol{\Xi}_{k|k}^{i}(z_{k}) =𝐂¯zk𝐏k|k1i(zk)𝐂¯zkT+𝐑¯zk.\displaystyle=\ \bar{\mathbf{C}}_{z_{k}}\mathbf{P}^{i}_{k|k-1}(z_{k})\bar{\mathbf{C}}_{z_{k}}^{T}+\bar{\mathbf{R}}_{z_{k}}. (24alf)
6:         end for(over ii)
7:    end for(over zkz_{k})
wk|ki(zk)=w~k|ki(zk)zk=1mi=1Mkp(zk)w~k|ki(zk)\displaystyle w^{i}_{k|k}(z_{k})=\frac{\tilde{w}^{i}_{k|k}(z_{k})}{\sum_{z_{k}=1}^{m}\sum_{i=1}^{M_{k}^{p}(z_{k})}\tilde{w}^{i}_{k|k}(z_{k})}
zk=1,,m and i=1,,Mkp(zk)\displaystyle\forall z_{k}=1,\dots,m\text{ and }\forall i=1,\dots,M^{p}_{k}(z_{k}) (24am)
8:    if z=1mMkp(z)>M\sum_{z=1}^{m}M_{k}^{p}(z)>M then
9:         Using Algorithm 4 resample the hybrid mixture defined by {wk|k(zk),μk|k(zk),𝐏k|k(zk)}i=1Mkp(zk)zk\{w_{k|k}(z_{k}),\mu_{k|k}(z_{k}),\mathbf{P}_{k|k}(z_{k})\}_{i=1}^{M^{p}_{k}(z_{k})}\forall z_{k} to yield a replacement mixture, preserving the conditioned component {wk|kak(zk1),μk|kak(zk1),𝐏k|kak(zk1)}\{w_{k|k}^{a_{k}}(z_{k}^{\ell-1}),\mu_{k|k}^{a_{k}}(z_{k}^{\ell-1}),\mathbf{P}_{k|k}^{a_{k}}(z_{k}^{\ell-1})\}.
10:    end if
11:    for zk+1=1z_{k+1}=1 to mm do
12:         Set j=0j=0.
13:         for zk=1z_{k}=1 to mm do
14:             for i=1i=1 to Mkp(zk)M^{p}_{k}(z_{k}) do
15:                 Increment jj.
16:                 if i=aki=a_{k} and zk=zk1z_{k}=z_{k}^{\ell-1} then
17:                     Set ak+1=ja_{k+1}=j.
18:                 end if
wk+1|kj(zk+1)\displaystyle{w}^{j}_{k+1|k}(z_{k+1}) =𝐓zk+1,zkwk|ki(zk),\displaystyle=\mathbf{T}_{z_{k+1},z_{k}}w_{k|k}^{i}(z_{k}), (24ana)
μk+1|kj(zk+1)\displaystyle\mu^{j}_{k+1|k}(z_{k+1}) =𝐀¯zkμk|ki(zk)+𝐁¯zku¯k,\displaystyle=\bar{\mathbf{A}}_{z_{k}}\mu_{k|k}^{i}(z_{k})+\bar{\mathbf{B}}_{z_{k}}\bar{u}_{k}, (24anb)
𝐏k+1|kj(zk+1)\displaystyle\mathbf{P}^{j}_{k+1|k}(z_{k+1}) =𝐀¯zk𝐏k|ki(zk)𝐀¯zkT+𝐐¯zk.\displaystyle=\bar{\mathbf{A}}_{z_{k}}\mathbf{P}^{i}_{k|k}(z_{k})\bar{\mathbf{A}}_{z_{k}}^{T}+\bar{\mathbf{Q}}_{z_{k}}. (24anc)
19:             end for(over ii)
20:         end for(over zkz_{k})
21:    end for(over zk+1z_{k+1})
22:end for(over kk)
Algorithm 4 DPF Resampling
1:The maximum number of components which may be kept MM, and a hybrid mixture which needs to be reduced that contains an ancestor component.
2:Place the ancestor component into set SS and all other components into set S¯\bar{S}.
3:Form normalised weights for the S¯\bar{S} set
Wi=wiiS¯wiiS¯\displaystyle W^{i}=\frac{w^{i}}{\sum_{i\in\bar{S}}w^{i}}\quad\forall i\in\bar{S} (24ao)
4:Call Algorithm 5 with {Wi}iS¯\{W^{i}\}_{i\in\bar{S}} to discern the number of components to keep deterministically LL, with a maximum number of stored components being K=M1K=M-1.
5:Place LL components with the highest WiW^{i} values into set SS with their original weights wiw^{i}, removing them from set S¯\bar{S}.
6:Renormalise the new set S¯\bar{S},
Wi=wiviS¯, where v=iS¯wi.\displaystyle W^{i}=\frac{w^{i}}{v}\quad\forall i\in\bar{S},\text{ where }v=\sum_{i\in\bar{S}}w^{i}. (24ap)
7:Using Algorithm 6 perform systematic sampling [38] to sample R=ML1R=M-L-1 components from set S¯\bar{S} based upon weights {Wi}iS¯\{W^{i}\}_{i\in\bar{S}}, placing them in set SS with a new weight of
wi=vR.\displaystyle w^{i}=\frac{v}{R}. (24aq)
8:Return the reduced mixture governed by the set SS.
Algorithm 5 Compute DPF Threshold
1:Maximum number of components which may be kept KK, and mixture weights Wii=1,,nW^{i}\,\forall i=1,\dots,n.
2:Sort weights in descending order {Wi}i=1n\{W^{i}\}_{i=1}^{n}, e.g. W1W2W_{1}\geq W_{2}.
3:Set L=0L=0.
4:for j=1j=1 to KK do
5:    if Wj(Kj)i=j+1nWiW^{j}(K-j)\geq\sum_{i=j+1}^{n}W^{i} then
6:         Set L=jL=j.
7:    else
8:         Return number of components to keep LL.
9:    end if
10:end for
11:Return number of components to keep LL.
Algorithm 6 Systematic Sampling
1:Number of components to be sampled RR, and mixture weights Wii=1,,nW^{i}\,\forall i=1,\dots,n.
2:Sample from a uniform distribution u𝒰u\sim\mathcal{U}(0,1).
3:Compute cumulative probability mass function, such that Q(i)=j=1iWji=1,,nQ(i)=\sum_{j=1}^{i}W^{j}\,\forall i=1,\dots,n.
4:for j=1j=1 to RR do
5:    for i=1i=1 to nn do
6:         if Q(i)(j1+u)/RQ(i)\geq(j-1+u)/R then
7:             Keep component ii.
8:             Break.
9:         end if
10:    end for
11:end for
Algorithm 7 Backwards Sampler
1:A Forwards filtered distribution from Algorithm 3.
2:Sample zN+1z_{N+1}^{\ell}, bN+1b_{N+1} and xN+1x_{N+1}^{\ell}
zN+1\displaystyle z_{N+1}^{\ell} 𝒞({i=1MN+1p(j)wN+1|Ni(j)}j=1m),\displaystyle\sim\mathcal{C}\left(\left\{\sum_{i=1}^{M^{p}_{N+1}(j)}w^{i}_{N+1|N}(j)\right\}_{j=1}^{m}\right), (24ar)
bN+1\displaystyle b_{N+1} 𝒞({wN+1|Ni(zN+1)j=1MN+1p(zN+1)wN+1|Nj(zN+1)}i=1MN+1p(zN+1)),\displaystyle\sim\mathcal{C}\left(\left\{\frac{w^{i}_{N+1|N}(z_{N+1}^{\ell})}{\sum_{j=1}^{M_{N+1}^{p}(z_{N+1}^{\ell})}w^{j}_{N+1|N}(z_{N+1}^{\ell})}\right\}_{i=1}^{M_{N+1}^{p}(z_{N+1}^{\ell})}\right), (24as)
xN+1\displaystyle x_{N+1}^{\ell} 𝒩(xN+1|μN+1|NbN+1(zN+1),𝐏N+1|NbN+1(zN+1))\displaystyle\sim\mathcal{N}\left({x_{N+1}}|\mu^{b_{N+1}}_{N+1|N}(z_{N+1}^{\ell}),\,\mathbf{P}^{b_{N+1}}_{N+1|N}(z_{N+1}^{\ell})\right) (24at)
3:for k=Nk=N to 11 do
4:    for zk=1z_{k}=1 to mm do
5:         for i=1i=1 to Mkp(zk)M^{p}_{k}(z_{k}) do
w~k|Ni(zk)\displaystyle\tilde{w}^{i}_{k|N}(z_{k}) =𝐓zk+1,zkwk|ki(zk)𝒩(xk+1|ηk|Ni(zk),𝚵k|Ni(zk)),\displaystyle=\mathbf{T}_{z_{k+1}^{\ell},z_{k}}w_{k|k}^{i}(z_{k})\mathcal{N}\left(x_{k+1}^{\ell}|\eta^{i}_{k|N}(z_{k}),\boldsymbol{\Xi}_{k|N}^{i}(z_{k})\right), (24aua)
μk|Ni(zk)\displaystyle\mu^{i}_{k|N}(z_{k}) =μk|ki(zk)+𝐊k|Ni(zk)[xk+1ηk|Ni(zk)],\displaystyle=\mu^{i}_{k|k}(z_{k})+\mathbf{K}_{k|N}^{i}(z_{k})[x_{k+1}^{\ell}-\eta^{i}_{k|N}(z_{k})], (24aub)
ηk|Ni(zk)\displaystyle\eta_{k|N}^{i}(z_{k}) =𝐀¯zkμk|ki(zk)+𝐁¯zku¯k,\displaystyle=\bar{\mathbf{A}}_{z_{k}}\mu^{i}_{k|k}(z_{k})+\bar{\mathbf{B}}_{z_{k}}\bar{u}_{k}, (24auc)
𝐏k|Ni(zk)\displaystyle\mathbf{P}^{i}_{k|N}(z_{k}) =(𝐈𝐊k|Ni(zk)𝐀¯zk)𝐏k|ki(zk),\displaystyle=\left(\mathbf{I}-\mathbf{K}_{k|N}^{i}(z_{k})\bar{\mathbf{A}}_{z_{k}}\right)\mathbf{P}^{i}_{k|k}(z_{k}), (24aud)
𝐊k|Ni(zk)\displaystyle\mathbf{K}_{k|N}^{i}(z_{k}) =𝐏k|ki(zk)𝐀¯zkT(𝚵k|Ni(zk))1,\displaystyle=\mathbf{P}^{i}_{k|k}(z_{k})\bar{\mathbf{A}}^{T}_{z_{k}}(\boldsymbol{\Xi}_{k|N}^{i}(z_{k}))^{-1}, (24aue)
𝚵k|Ni(zk)\displaystyle\boldsymbol{\Xi}_{k|N}^{i}(z_{k}) =𝐀¯zk𝐏k|ki(zk)𝐀¯zkT+𝐐¯zk,\displaystyle=\bar{\mathbf{A}}_{z_{k}}\mathbf{P}^{i}_{k|k}(z_{k})\bar{\mathbf{A}}^{T}_{z_{k}}+\bar{\mathbf{Q}}_{z_{k}}, (24auf)
6:         end for(over ii)
7:    end for(over zkz_{k})
wk|Ni(zk)=w~k|Ni(zk)zk=1mi=1Mkp(zk)w~k|Ni(zk)\displaystyle{w}_{k|N}^{i}(z_{k})=\frac{\tilde{w}_{k|N}^{i}(z_{k})}{\sum_{z_{k}=1}^{m}\sum_{i=1}^{M^{p}_{k}(z_{k})}\tilde{w}_{k|N}^{i}(z_{k})}
zk=1,,m, and i=1,,Mkp(zk).\displaystyle\forall z_{k}=1,\dots,m,\text{ and }\forall i=1,\dots,M^{p}_{k}(z_{k}). (24av)
8:    Sample zkz_{k}^{\ell}, bkb_{k} and xkx_{k}^{\ell}
zk\displaystyle z_{k}^{\ell} 𝒞({i=1Mkp(j)wk|Ni(j)}j=1m),\displaystyle\sim\mathcal{C}\left(\left\{\sum_{i=1}^{M^{p}_{k}(j)}w^{i}_{k|N}(j)\right\}_{j=1}^{m}\right), (24aw)
bk\displaystyle b_{k} 𝒞({wk|Ni(zk)j=1Mkp(zk)wk|Nj(zk)}i=1Mkp(zk)),\displaystyle\sim\mathcal{C}\left(\left\{\frac{w^{i}_{k|N}(z_{k}^{\ell})}{\sum_{j=1}^{M^{p}_{k}(z_{k}^{\ell})}w^{j}_{k|N}(z_{k}^{\ell})}\right\}_{i=1}^{M^{p}_{k}(z_{k}^{\ell})}\right), (24ax)
xk\displaystyle x_{k}^{\ell} 𝒩(xk|μk|Nbk(zk),𝐏k|Nbk(zk))\displaystyle\sim\mathcal{N}\left(x_{k}|\mu^{b_{k}}_{k|N}(z_{k}^{\ell}),\,\mathbf{P}^{b_{k}}_{k|N}(z_{k}^{\ell})\right) (24ay)
9:end for(over kk)

Appendix B Proofs

In this appendix we provide proofs for the Algorithms and Lemmata within the paper.

B.1 Proof of Algorithm 7

For readability, in this proof we utilise the shorthand θ={{𝚪i,𝚷i}i=1m,𝐓}\theta^{\ell}=\left\{\{\boldsymbol{\Gamma}^{\ell}_{i},\boldsymbol{\Pi}^{\ell}_{i}\}_{i=1}^{m},\mathbf{T}^{\ell}\right\}. We begin the derivation with

p(x1:N+1,z1:N+1|{𝚪i,𝚷i}i=1m,𝐓,y1:N)=p(xN+1,zN+1|θ,y1:N)k=1Np(xk+1,zk+1|xk,zk,θ,y1:k)p(xk,zk|θ,y1:k)p(xk+1,zk+1|θ,y1:k).\displaystyle p(x_{1:N+1},z_{1:N+1}|\{\boldsymbol{\Gamma}^{\ell}_{i},\boldsymbol{\Pi}^{\ell}_{i}\}_{i=1}^{m},\mathbf{T}^{\ell},y_{1:N})=p(x_{N+1},z_{N+1}|\theta^{\ell},y_{1:N})\prod_{k=1}^{N}\frac{p(x_{k+1},z_{k+1}|x_{k},z_{k},\theta^{\ell},y_{1:k})p(x_{k},z_{k}|\theta^{\ell},y_{1:k})}{p(x_{k+1},z_{k+1}|\theta^{\ell},y_{1:k})}. (24az)

We will now outline computations the distribution p(xk+1,zk+1|xk,zk,θ,y1:k)p(xk,zk|θ,y1:k)p(xk+1,zk+1|θ,y1:k)\frac{p(x_{k+1}^{\ell},z_{k+1}^{\ell}|x_{k},z_{k},\theta^{\ell},y_{1:k})p(x_{k},z_{k}|\theta^{\ell},y_{1:k})}{p(x_{k+1}^{\ell},z_{k+1}^{\ell}|\theta^{\ell},y_{1:k})}. Since the terms in the numerator have a known form,

p(xk,zk|θ,y1:k)=i=1Mkf(zk)wk|ki(zk)𝒩(xk|μk|ki(zk),𝐏k|ki(zk)),\displaystyle p(x_{k},z_{k}|\theta^{\ell},y_{1:k})=\sum_{i=1}^{M_{k}^{f}(z_{k})}w_{k|k}^{i}(z_{k})\mathcal{N}(x_{k}|\mu^{i}_{k|k}(z_{k}),\mathbf{P}^{i}_{k|k}(z_{k})), (24ba)

and by removing conditionally independent terms we yield

p(xk+1,zk+1|xk,zk,θ,y1:k)=(zk+1|zk,θ)p(xk+1|xk,zk,θ,y1:k)=𝐓zk+1,zk𝒩(xk+1|𝐀¯zkxk+𝐁¯zku¯k,𝐐¯zk),\displaystyle p(x_{k+1}^{\ell},z_{k+1}^{\ell}|x_{k},z_{k},\theta^{\ell},y_{1:k})=\mathbb{P}(z_{k+1}^{\ell}|z_{k},\theta^{\ell})p(x_{k+1}^{\ell}|x_{k},z_{k},\theta^{\ell},y_{1:k})=\mathbf{T}^{\ell}_{z_{k+1}^{\ell},z_{k}}\mathcal{N}(x_{k+1}^{\ell}|\bar{\mathbf{A}}^{\ell}_{z_{k}}x_{k}+\bar{\mathbf{B}}^{\ell}_{z_{k}}\bar{u}_{k},\bar{\mathbf{Q}}^{\ell}_{z_{k}}), (24bb)

we can rewrite and manipulate the numerator as follows,

p(xk,zk|θ,y1:k)p(xk+1,zk+1|xk,zk,θ,y1:k)\displaystyle p(x_{k},z_{k}|\theta^{\ell},y_{1:k})p(x_{k+1}^{\ell},z_{k+1}^{\ell}|x_{k},z_{k},\theta^{\ell},y_{1:k})
=i=1Mkf(zk)wk|ki(zk)𝒩(xk|μk|ki(zk),𝐏k|ki(zk))𝐓zk+1,zk𝒩(xk+1|𝐀¯zkxk+𝐁¯zku¯k,𝐐¯zk)\displaystyle=\sum_{i=1}^{M_{k}^{f}(z_{k})}w_{k|k}^{i}(z_{k})\mathcal{N}(x_{k}|\mu^{i}_{k|k}(z_{k}),\mathbf{P}^{i}_{k|k}(z_{k}))\mathbf{T}^{\ell}_{z_{k+1}^{\ell},z_{k}}\mathcal{N}(x_{k+1}^{\ell}|\bar{\mathbf{A}}^{\ell}_{z_{k}}x_{k}+\bar{\mathbf{B}}^{\ell}_{z_{k}}\bar{u}_{k},\bar{\mathbf{Q}}^{\ell}_{z_{k}})
=i=1Mkf(zk)w~k|Ni(zk)𝒩(xk|μk|Ni(zk),𝐏k|Ni(zk)),\displaystyle=\sum_{i=1}^{M_{k}^{f}(z_{k})}\tilde{w}_{k|N}^{i}(z_{k})\mathcal{N}(x_{k}|\mu^{i}_{k|N}(z_{k}),{\mathbf{P}}^{i}_{k|N}(z_{k})), (24bc)

where w~k|Ni(zk)\tilde{w}_{k|N}^{i}(z_{k}), and μk|Ni\mu^{i}_{k|N}, and 𝐏k|Ni{\mathbf{P}}^{i}_{k|N} can be computed straight-forwardly as this pattern of Normal distribution terms has a well known solution which is identical the correction step of the weighted Kalman filter used for forward-filtering. With the numerator of p(xk+1,zk+1|xk,zk,θ,y1:k)p(xk,zk|θ,y1:k)p(xk+1,zk+1|θ,y1:k)\frac{p(x_{k+1}^{\ell},z_{k+1}^{\ell}|x_{k},z_{k},\theta^{\ell},y_{1:k})p(x_{k},z_{k}|\theta^{\ell},y_{1:k})}{p(x_{k+1}^{\ell},z_{k+1}^{\ell}|\theta^{\ell},y_{1:k})} now having a closed form, it can be marginalised to yield the denominator

p(xk+1,zk+1|θ,y1:k)\displaystyle p(x_{k+1}^{\ell},z_{k+1}^{\ell}|\theta^{\ell},y_{1:k}) =zk=1mi=1Mkf(zk)w~k|Ni(zk).\displaystyle=\sum_{z_{k}=1}^{m}\sum_{i=1}^{M_{k}^{f}(z_{k})}\tilde{w}_{k|N}^{i}(z_{k}). (24bd)

Therefore

p(xk,zk|xk+1:N+1,zk+1:N+1,θ,y1:N)=i=1Mkf(zk)wk|Ni(zk)𝒩(xk|μk|Ni(zk),𝐏k|Ni(zk)),\displaystyle p(x_{k},z_{k}|x_{k+1:N+1},z_{k+1:N+1},\theta^{\ell},y_{1:N})=\sum_{i=1}^{M_{k}^{f}(z_{k})}{w}_{k|N}^{i}(z_{k})\mathcal{N}(x_{k}|\mu^{i}_{k|N}(z_{k}),{\mathbf{P}}^{i}_{k|N}(z_{k})), (24be)

where

wk|Ni(zk)=w~k|Ni(zk)zk=1mi=1Mkf(zk)w~k|Ni(zk).\displaystyle{w}_{k|N}^{i}(z_{k})=\frac{\tilde{w}_{k|N}^{i}(z_{k})}{\sum_{z_{k}=1}^{m}\sum_{i=1}^{M_{k}^{f}(z_{k})}\tilde{w}_{k|N}^{i}(z_{k})}. (24bf)

Sampling may be computed by introducing a auxiliary variable bb of some hybrid Gaussian mixture, which represents a possible model sequence for this application,

p(x,z,b|)\displaystyle p(x,z,b|\cdot) =wb(z)𝒩(x|μb(z),𝐏b(z)).\displaystyle=w^{b}(z)\mathcal{N}(x|\mu^{b}(z),\mathbf{P}^{b}(z)). (24bg)

Normally this variable is not of interest and is marginalised out. Using conditional probability, we outline sampling from x,zx,z, and bb as sampling from the distributions

p(x,z,b|)=p(x|z,b,)(b|z,)(z|),\displaystyle p(x,z,b|\cdot)=p(x|z,b,\cdot)\mathbb{P}(b|z,\cdot)\mathbb{P}(z|\cdot), (24bh)

where the bb component of the sample may be discarded to obtain a sample from p(x,z|)p(x,z|\cdot). Next we consider marginalisation of xx to yield

(z,b|)\displaystyle\mathbb{P}(z,b|\cdot) =p(x,z,b|)𝑑x=wb(z)𝒩(x|μb(z),𝐏b(z))𝑑x=wb(z).\displaystyle=\int p(x,z,b|\cdot)\,dx=\int w^{b}(z)\mathcal{N}(x|\mu^{b}(z),\mathbf{P}^{b}(z))\,dx=w^{b}(z). (24bi)

Considering the (b|z,)\mathbb{P}(b|z,\cdot) and (z|)\mathbb{P}(z|\cdot) distributions,

p(z|)\displaystyle p(z|\cdot) =b=1M(z)(z,b|)=b=1M(z)wb(z),\displaystyle=\sum_{b=1}^{M(z)}\mathbb{P}(z,b|\cdot)=\sum_{b=1}^{M(z)}w^{b}(z), (24bj)
p(b|z,)\displaystyle p(b|z,\cdot) =(z,b|)(z|)=wb(z)b=1M(z)wb(z),\displaystyle=\frac{\mathbb{P}(z,b|\cdot)}{\mathbb{P}(z|\cdot)}=\frac{w^{b}(z)}{\sum_{b=1}^{M(z)}w^{b}(z)}, (24bk)

where these are Categorical distributions, which zz^{\ell} and bb can easily be sampled from. Next it follows from (24bg) and (24bi) that

p(x|z,b,)=p(x,z,b|)(z,b|)=wb(z)𝒩(x|μb(z),𝐏b(z))wb(z)=𝒩(x|μb(z),𝐏b(z)),\displaystyle p(x|z,b,\cdot)=\frac{p(x,z,b|\cdot)}{\mathbb{P}(z,b|\cdot)}=\frac{w^{b}(z)\mathcal{N}(x|\mu^{b}(z),\mathbf{P}^{b}(z))}{w^{b}(z)}=\mathcal{N}(x|\mu^{b}(z),\mathbf{P}^{b}(z)), (24bl)

and therefore sampling xx^{\ell} can be completed straight-forwardly using

x𝒩(x|μb(z),𝐏b(z)).\displaystyle x^{\ell}\sim\mathcal{N}(x|\mu^{b}(z^{\ell}),\mathbf{P}^{b}(z^{\ell})). (24bm)

Proof of Lemma 3.1

For readability, in this proof we utilise the shorthand θ={{𝚪i,𝚷i}i=1m,𝐓}\theta=\left\{\{\boldsymbol{\Gamma}_{i},\boldsymbol{\Pi}_{i}\}_{i=1}^{m},\mathbf{T}\right\}. We begin with

p(θ|x1:N+1,z1:N+1,y1:N)p(θ)k=1N(zk+1|xk+1,zk,xk,y1:k,θ)p(xk+1,yk|xk,zk,y1:k1,θ)p(xk+1,yk|xk,zk,y1:k1,θ),\displaystyle p(\theta|x^{\ell}_{1:N+1},z^{\ell}_{1:N+1},y_{1:N})\propto p(\theta)\prod_{k=1}^{N}\mathbb{P}(z_{k+1}^{\ell}|x^{\ell}_{k+1},z^{\ell}_{k},x^{\ell}_{k},y_{1:k},\theta)p(x^{\ell}_{k+1},y_{k}|x^{\ell}_{k},z^{\ell}_{k},y_{1:k-1},\theta)p(x^{\ell}_{k+1},y_{k}|x^{\ell}_{k},z^{\ell}_{k},y_{1:k-1},\theta), (24bn)

as p(x1,z1)p(x^{\ell}_{1},z^{\ell}_{1}) is a constant for each iteration. By expanding θ\theta and exercising conditional independence, we yield

p(θ|x1:N+1,z1:N+1,y1:N)(p(𝐓)k=1N(zk+1|zk,𝐓))(i=1mp(𝚪i|𝚷i)p(𝚷i))k=1Np(xk+1,yk|xk,𝚪zk,𝚷zk).\displaystyle p(\theta|x^{\ell}_{1:N+1},z^{\ell}_{1:N+1},y_{1:N})\propto(p(\mathbf{T})\prod_{k=1}^{N}\mathbb{P}(z_{k+1}^{\ell}|z^{\ell}_{k},\mathbf{T}))(\prod^{m}_{i=1}p(\boldsymbol{\Gamma}_{i}|\boldsymbol{\Pi}_{i})p(\boldsymbol{\Pi}_{i}))\prod_{k=1}^{N}p(x^{\ell}_{k+1},y_{k}|x^{\ell}_{k},\boldsymbol{\Gamma}_{z^{\ell}_{k}},\boldsymbol{\Pi}_{z^{\ell}_{k}}). (24bo)

Next we consider the each of the columns in the 𝐓\mathbf{T} matrix, written as 𝐓i\mathbf{T}_{i}, to be conditionally independent, and therefore p(𝐓)=i=1mp(𝐓i)p(\mathbf{T})=\prod_{i=1}^{m}p(\mathbf{T}_{i}). Therefore

p(θ|x1:N+1,z1:N+1,y1:N)(i=1mp(𝐓i)kGi(zk+1|𝐓i))(i=1mp(𝚪i|𝚷i)p(𝚷i)kGip(xk+1,yk|xk,𝚪i,𝚷i)),\displaystyle p(\theta|x^{\ell}_{1:N+1},z^{\ell}_{1:N+1},y_{1:N})\propto(\prod_{i=1}^{m}p(\mathbf{T}_{i})\prod_{k\in G_{i}}\mathbb{P}(z_{k+1}^{\ell}|\mathbf{T}_{i}))(\prod^{m}_{i=1}p(\boldsymbol{\Gamma}_{i}|\boldsymbol{\Pi}_{i})p(\boldsymbol{\Pi}_{i})\prod_{k\in G_{i}}p(x^{\ell}_{k+1},y_{k}|x^{\ell}_{k},\boldsymbol{\Gamma}_{i},\boldsymbol{\Pi}_{i})),

where GiG_{i} is the set of time steps in sample \ell which has model ii being active, i.e. i=zk,kGii=z^{\ell}_{k},k\in G_{i}. Substituting the assumed distributions for these terms yields

p(θ|x1:N+1,z1:N+1,y1:N)\displaystyle p(\theta|x^{\ell}_{1:N+1},z^{\ell}_{1:N+1},y_{1:N})\propto (i=1m𝒟(𝐓i|𝜶i)kGi𝒞(zk+1|𝐓i))(i=1m𝒩(𝚪i|𝐌i,𝚷i,𝐕i)𝒲1(𝚷i|𝚲i,νi)\displaystyle(\prod_{i=1}^{m}\mathcal{D}(\mathbf{T}_{i}|\boldsymbol{\alpha}_{i})\prod_{k\in G_{i}}\mathcal{C}(z^{\ell}_{k+1}|\mathbf{T}_{i}))(\prod^{m}_{i=1}\mathcal{M}\mathcal{N}(\boldsymbol{\Gamma}_{i}|\mathbf{M}_{i},\boldsymbol{\Pi}_{i},\mathbf{V}_{i})\mathcal{W}^{-1}(\boldsymbol{\Pi}_{i}|\boldsymbol{\Lambda}_{i},\nu_{i})
kGi𝒩([ykxk+1]|𝚪i[xkuk],𝚷i)),\displaystyle\quad\cdot\prod_{k\in G_{i}}\mathcal{N}(\begin{bmatrix}y_{k}\\ x^{\ell}_{k+1}\end{bmatrix}\bigg{|}\boldsymbol{\Gamma}_{i}\begin{bmatrix}x_{k}^{\ell}\\ u_{k}\end{bmatrix},\boldsymbol{\Pi}_{i})), (24bq)

where 𝛂i\boldsymbol{\alpha}_{i} denotes the ii-th column of 𝛂\boldsymbol{\alpha}. We can now use well known results for updating the conjugate prior, see e.g. [54]. This yields a solution of the form

p\displaystyle p ({𝚪i,𝚷i}i=1m,𝐓|x1:N+1,z1:N+1,y1:N)(i=1m𝒟(𝐓i|𝜶¯i))(i=1m𝒩(𝚪i|𝐌¯i,𝚷i,𝐕¯i)𝒲1(𝚷i|𝚲¯i,ν¯i)),\displaystyle(\{\boldsymbol{\Gamma}_{i},\boldsymbol{\Pi}_{i}\}_{i=1}^{m},\mathbf{T}|x^{\ell}_{1:N+1},z^{\ell}_{1:N+1},y_{1:N})\propto(\prod_{i=1}^{m}\mathcal{D}(\mathbf{T}_{i}|\bar{\boldsymbol{\alpha}}_{i}))(\prod^{m}_{i=1}\mathcal{M}\mathcal{N}(\boldsymbol{\Gamma}_{i}|\bar{\mathbf{M}}_{i},\boldsymbol{\Pi}_{i},\bar{\mathbf{V}}_{i})\mathcal{W}^{-1}(\boldsymbol{\Pi}_{i}|\bar{\boldsymbol{\Lambda}}_{i},\bar{\nu}_{i})), (24br)

where the parameters can be calculated using the instructions provided in Lemma 3.1. A sample from this distribution can be taken by sampling from the Dirichlet, Matrix-Normal and Inverse-Wishart distribution for each model.