This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Context-tree weighting for real-valued time series:
Bayesian inference with hierarchical mixture models

Ioannis Papageorgiou University of Cambridge
[email protected]
   Ioannis Kontoyiannis University of Cambridge
[email protected]
Abstract

Real-valued time series are ubiquitous in the sciences and engineering. In this work, a general, hierarchical Bayesian modelling framework is developed for building mixture models for times series. This development is based, in part, on the use of context trees, and it includes a collection of effective algorithmic tools for learning and inference. A discrete context (or ‘state’) is extracted for each sample, consisting of a discretised version of some of the most recent observations preceding it. The set of all relevant contexts are represented as a discrete context tree. At the bottom level, a different real-valued time series model is associated with each context-state, i.e., with each leaf of the tree. This defines a very general framework that can be used in conjunction with any existing model class to build flexible and interpretable mixture models. Extending the idea of context-tree weighting leads to algorithms that allow for efficient, exact Bayesian inference in this setting. The utility of the general framework is illustrated in detail when autoregressive (AR) models are used at the bottom level, resulting in a nonlinear AR mixture model. The associated methods are found to outperform several state-of-the-art techniques on simulated and real-world experiments.

I Introduction

Modelling and inference of real-valued time series are critical tasks with important applications throughout the sciences and engineering. A wide range of approaches exist, including classical statistical methods [7, 15] as well as modern machine learning (ML) techniques, notably matrix factorisations [47, 10], Gaussian processes [34, 36, 11], and neural networks [5, 2, 48]. Despite their popularity, there has not been conclusive evidence that in general the latter outperform the former in the time series setting [24, 23, 1]. Motivated in part by the two well-known limitations of neural network models (see, e.g., [5]), namely, their lack of interpretability and their typically very large training data requirements, in this work we propose a general class of flexible hierarchical Bayesian models, which are both naturally interpretable and suitable for applications with limited training data. Also, we provide computationally efficient (linear complexity) algorithms for inference and prediction, offering another important practical advantage over ML methods [24].

The first step in the modelling design is the identification of meaningful discrete states. Importantly, these are observable rather than hidden, and given by the discretised values of some of the most recent samples. The second step is the assignment of a different time series model to each of these discrete context-based states. In technical terms, we define a hierarchical Bayesian model, which at the top level selects the set of relevant states (that can be viewed as providing an adaptive partition of the state space), and at the bottom level associates an arbitrary time series model to each state. These collections of states (equivalently, the corresponding state space partitions) are naturally represented as discrete context-trees [35, 17], which are shown to admit a natural interpretation and to enable capturing important aspects of the structure present in the data. We call the resulting model class, the Bayesian Context Trees State Space Model (BCT-SSM).

Although BCT-SSM is referred to as a ‘model’, it is in fact a general framework for building Bayesian mixture models for time series, that can be used in conjunction with any existing model class. The resulting model family is rich, and much more general than the class one starts with. For example, using any of the standard linear families (like the classical AR or ARIMA) leads to much more general models that can capture highly nonlinear trends in the data, and are easily interpretable.

It is demonstrated that employing this particular type of observable state process (as opposed to a conventional hidden state process), also facilitates very effective Bayesian inference. This is achieved by exploiting the structure of context-tree models and extending the ideas of context-tree weighting (CTW) [44] and the Bayesian Context Trees (BCT) framework of [17], which were previously used only in the restricted setting of discrete-valued time series. In this discrete setting, CTW has been used very widely for data compression [44, 43], and BCTs have been used for a range of statistical tasks, including model selection, prediction, entropy estimation, and change-point detection [32, 20, 30, 21, 29, 31]. Context-trees in data compression were also recently studied in [25, 27] and were employed in an optimisation setting in [37].

The resulting tools developed for real-valued data make the BCT-SSM a powerful Bayesian framework, which in fact allows for exact and computationally very efficient Bayesian inference. In particular, the evidence [22] can be computed exactly, with all models and parameters integrated out. Furthermore, the a posteriori most likely (MAP) partition (i.e., the MAP set of discrete states) can be identified, along with its exact posterior probability. It is also shown that these algorithms allow for efficient sequential updates, which are ideally suited for online forecasting, offering an important practical advantage over standard ML approaches.

To illustrate the application of the general framework, the case where AR models are used as a building block for the BCT-SSM is examined in detail. We refer to the resulting model class as the Bayesian context tree autoregressive (BCT-AR) model. The BCT-AR model is shown to be a flexible, nonlinear mixture of AR models which is found to outperform several state-of-the-art methods in experiments with both simulated and real-world data from standard applications of nonlinear time series from economics and finance, both in terms of forecasting accuracy and computational requirements.

Finally, we note that a number of earlier approaches employ discrete patterns in the analysis of real-valued time series [3, 4, 18, 6, 12, 14, 19, 28, 38]. These works illustrate the fact that useful and meaningful information can indeed be extracted from discrete contexts. However, in most cases the methods are either application- or task-specific, and typically resort to ad hoc considerations for performing inference. In contrast, in this work, discrete contexts are used in a natural manner by defining a hierarchical Bayesian modelling structure upon which principled Bayesian inference is performed.

II The Bayesian Context Trees State Space Model

II-A Discrete contexts

Consider a real-valued time series (x1,x2,)(x_{1},x_{2},\ldots). The first key element of the present development is the use of an observable state for each xnx_{n}, based on discretised versions of some of the samples (,xn2,xn1)(\ldots,x_{n-2},x_{n-1}) preceding it. We refer to the string consisting of these discretised previous samples as the discrete context; it plays the role of a discrete-valued feature vector that can be used to identify useful nonlinear structure in the data. These contexts are extracted via simple quantisers Q:A:={0,1,,m1}Q:\mathbb{R}\to A:=\{0,1,\dots,m-1\}, of the form,

Q(x)={0,x<c1,i,cixci+1, 1im2,m1,x>cm1,Q(x)\!=\!\left\{\begin{array}[]{ll}0,\ \;\;\;\;\;\;\;x<c_{1},\\ i,\ \;\;\;\;\;\;\;\;c_{i}\leq x\leq c_{i+1},\ 1\leq i\leq m-2,\\ m-1,\ x>c_{m-1},\end{array}\right. (1)

where in this section the thresholds {c1,,cm1}\{c_{1},\ldots,c_{m-1}\} and the resulting quantiser QQ are considered fixed. A systematic way to infer the thresholds from data is described in Section III-B.

This general framework can be used in conjunction with an arbitrary way of extracting discrete features, based on an arbitrary mapping to a discrete alphabet, not necessarily of the form in (1). However, the quantisation needs to be meaningful in order to lead to useful results. Quantisers as in (1) offer a generally reasonable choice although, depending on the application at hand, there are other useful approaches, e.g.,  quantising the percentage differences between samples.

II-B Context trees

Refer to caption
Figure 1: Example of a context tree TT.

Given a quantiser QQ as in (1), a maximum context length D0D\geq 0, and a proper mm-ary context tree TT, the context (or ‘state’) of each sample xnx_{n} is obtained as follows. Let t=(Q(xn1),,Q(xnD))t=(Q(x_{n-1}),\ldots,Q(x_{n-D})) be the discretised string of length DD preceding xnx_{n}; the context ss of xnx_{n} is the unique leaf of TT that is a suffix of tt. For example, for the context tree of Figure 1, if Q(xn1)=0Q(x_{n-1})=0 and Q(xn2)=1Q(x_{n-2})=1 then s=01s=01, whereas if Q(xn1)=Q(xn2)=1Q(x_{n-1})=Q(x_{n-2})=1 then s=1s=1.

The leaves of the tree define the set of discrete states in the hierarchical model. So, for the example BCT-SSM of Figure 1, the set of states is 𝒮={1,01,00}\mathcal{S}=\{1,01,00\}. Equivalently, this process can be viewed as defining a partition of 2{\mathbb{R}}^{2} into three regions indexed by the contexts 𝒮{\cal S} in T.T.

To complete the specification of the BCT-SSM, a different time series model s\mathcal{M}_{s} is associated with each leaf ss of the context tree TT, giving a different conditional density for xnx_{n}. At time nn, given the context ss determined by the past DD samples (xn1,,xnD)(x_{n-1},\ldots,x_{n-D}), the distribution of xnx_{n} is determined by the model s\mathcal{M}_{s} assigned to ss. Parametric models with parameters θs\theta_{s} at each leaf ss are considered. Altogether, the BCT-SSM consists of an mm-ary quantiser QQ, an mm-ary tree TT that defines the set of discrete states, and a collection of parameter vectors θs\theta_{s} for the parametric models at its leaves.

Identifying TT with the collection of its leaves 𝒮{\cal S}, and writing xijx_{i}^{j} for the segment (xi,xi+i,,xj)(x_{i},x_{i+i},\ldots,x_{j}), the likelihood is,

p(x|θ,T)=p(x1n|T,θ,xD+10)=sTiBsp(xi|T,θs,xD+1i1),\displaystyle\begin{split}p(x|\theta,T)=p(x_{1}^{n}|T,\theta,x_{-D+1}^{0})=\prod_{s\in T}\ \prod_{\hskip 5.69046pti\in B_{s}}p(x_{i}|T,\theta_{s},x_{-D+1}^{i-1}),\end{split}\vspace*{-0.15 cm}

where BsB_{s} is the set of indices i{1,2,,n}i\in\{1,2,\ldots,n\} such that the context of xix_{i} is ss, and θ={θs;sT}\theta=\{\theta_{s}\;;\;s\in T\}.

II-C Bayesian modelling and inference

For the top level we consider collections of states represented by context trees TT in the class 𝒯(D)\mathcal{T}(D) of all proper mm-ary trees with depth no greater than DD, where TT is proper if any node in TT that is not a leaf has exactly mm children.

Prior structure. For the trees T𝒯(D)T\in{\mathcal{T}}(D) with maximum depth D0D\geq 0 at the top level of the hierarchical model, we use the Bayesian Context Trees (BCT) prior of [17],

π(T)=πD(T;β)=α|T|1β|T|LD(T),\pi(T)=\pi_{D}(T;\beta)=\alpha^{|T|-1}\beta^{|T|-L_{D}(T)}\ , (2)

where β(0,1)\beta\in(0,1) is a hyperparameter, α=(1β)1/(m1)\alpha=(1-\beta)^{1/(m-1)}, |T||T| is the number of leaves of TT, and LD(T)L_{D}(T) is the number of leaves of TT at depth DD. This prior penalises larger trees by an exponential amount, a desirable property that discourages overfitting. The default value β=12m+1\beta=1-2^{-m+1} [17] is used. Given a tree model T𝒯(D)T\in\mathcal{T}(D), an independent prior is placed on each θs\theta_{s}, so that π(θ|T)=sTπ(θs)\pi(\theta|T)=\prod_{s\in T}\pi(\theta_{s}).

Typically, the main obstacle in performing Bayesian inference with a time series xx is the computation of the normalising constant (or evidence) p(x)p(x) of the posterior distriburion,

p(x)=T𝒯(D)π(T)θp(x|T,θ)π(θ|T)𝑑θ.p(x)=\sum_{T\in\mathcal{T}(D)}\pi(T)\int_{\theta}p(x|T,\theta)\pi(\theta|T)\ d\theta. (3)

The power of the proposed Bayesian structure is that, although 𝒯(D)\mathcal{T}(D) is enormously rich, consisting of doubly-exponentially many models in DD, it is actually possible to perform exact Bayesian inference efficiently. To that end, we introduce the Continuous Context Tree Weighting (CCTW) algorithm, and the Continuous Bayesian Context Tree (CBCT) algorithm, generalising the corresponding algorithms for discrete time series in [17]. It is shown that CCTW computes the normalising constant p(x)p(x) exactly (Theorem 1), and CBCT identifies the MAP tree model (Theorem 2). The main difference from the discrete case in [17], both in the algorithmic descriptions and in the proofs of the theorems (given in Appendix A), is that a new generalised form of estimated probabilities is introduced and used in place of their discrete versions. For a time series x=xD+1nx=x_{-D+1}^{n}, these are,

Pe(s,x)=iBsp(xi|T,θs,xD+1i1)π(θs)dθs.P_{e}(s,x)=\int\prod_{i\in B_{s}}p(x_{i}|T,\theta_{s},x_{-D+1}^{i-1})\ \pi(\theta_{s})\ d\theta_{s}. (4)

Let x=xD+1nx=x_{-D+1}^{n} be a time series, and let yi=Q(xi)y_{i}=Q(x_{i}) denote the corresponding quantised samples.

CCTW: Continuous context-tree weighting algorithm

  1. 1.

    Build the tree TMAXT_{\text{MAX}}, whose leaves are all the discrete contexts yiDi1,i=1,2,,ny_{i-D}^{i-1},\ i=1,2,\ldots,n. Compute Pe(s,x)P_{e}(s,x) as given in (4) for each node ss of TMAXT_{\text{MAX}}.

  2. 2.

    Starting at the leaves and proceeding recursively towards the root compute:

    Pw,s={Pe(s,x),if s is a leaf,βPe(s,x)+(1β)j=0m1Pw,sj,o/w,P_{w,s}\!=\!\left\{\begin{array}[]{ll}P_{e}(s,x),\;\;\;\mbox{if $s$ is a leaf,}\\ \beta P_{e}(s,x)+(1-\beta)\prod_{j=0}^{m-1}P_{w,sj},\;\;\;\mbox{o/w,}\end{array}\!\!\right.\!\!\vspace*{-0.05 cm}

    where sjsj is the concatenation of context ss and symbol jj.

CBCT: Continuous Bayesian context tree algorithm

  1. 1.

    Build the tree TMAXT_{\text{MAX}} and compute Pe(s,x)P_{e}(s,x) for each node ss of TMAXT_{\text{MAX}}, as in CCTW.

  2. 2.

    Starting at the leaves and proceeding recursively towards the root compute:

    Pm,s={Pe(s,x),if s is a leaf at depth D,β,if s is a leaf at depth <D,max{βPe(s,x),(1β)j=0m1Pm,sj},o/w.P_{m,s}\!=\!\left\{\begin{array}[]{ll}P_{e}(s,x),\;\;\;\mbox{if $s$ is a leaf at depth $D$,}\\ \beta,\;\;\;\mbox{if $s$ is a leaf at depth $<D$,}\\ \max\big{\{}\beta P_{e}(s,x),(1-\beta)\prod_{j=0}^{m-1}P_{m,sj}\big{\}},\ \mbox{o/w.}\end{array}\right.
  3. 3.

    Starting at the root and proceeding recursively with its descendants, for each node: If the maximum is achieved by the first term, prune all its descendants from TMAXT_{\text{MAX}}.

Theorem 1.

The weighted probability Pw,sP_{w,s} at the root is exactly the normalising constant p(x)p(x) of (3).

Theorem 2.

The tree T1T^{*}_{1} produced by the CBCT algorithm is the MAP tree model, as long as β1/2\beta\geq 1/2.

Even in cases where the integrals in (4) are not tractable, the fact that they are in the form of standard marginal likelihoods makes it possible to compute them approximately using standard methods, e.g., [8, 9, 46]. The above algorithms can then be used with these approximations as a way of performing approximate inference for the BCT-SSM. However, this is not investigated further in this work. Instead, the general principle is illustrated via an interesting example where the estimated probabilities can be computed explicitly and the resulting mixture model is a flexible nonlinear model of practical interest. This is described in the next section, where AR models s{\cal M}_{s} are associated to each leaf ss. We refer to the resulting model as the Bayesian context tree autoregressive (BCT-AR) model, which is just a particular instance of the general BCT-SSM.

III The BCT Autoregressive Model

Here we consider the BCT-SSM model class where an AR model of order pp is associated to each leaf ss of the tree TT,

xn=ϕs,1xn1++ϕs,pxnp+en=ϕsT𝐱~n1+en,x_{n}=\phi_{s,1}x_{n-1}+\dots+\phi_{s,p}x_{n-p}+e_{n}={\boldsymbol{\phi}_{s}}^{\text{T}}\ \mathbf{\widetilde{x}}_{n-1}+e_{n}, (5)

where en𝒩(0,σs2),𝐱~n1=(xn1,,xnp)Te_{n}\sim\mathcal{N}(0,\sigma_{s}^{2}),\ \mathbf{\widetilde{x}}_{n-1}=(x_{n-1},\dots,x_{n-p})^{\text{T}}, and ϕs=(ϕs,1,,ϕs,p)T\boldsymbol{\phi}_{s}=(\phi_{s,1},\dots,\phi_{s,p})^{\text{T}}.

The parameters of the model are the AR coefficients and the noise variance, so that θs=(ϕs,σs2)\theta_{s}=(\boldsymbol{\phi}_{s},\sigma_{s}^{2}). An inverse-gamma prior is used for the noise variance, and a Gaussian prior is placed on the AR coefficients, so that the joint prior on the parameters is π(θs)=π(ϕs|σs2)π(σs2)\pi(\theta_{s})=\pi(\boldsymbol{\phi}_{s}|\sigma_{s}^{2})\pi(\sigma_{s}^{2}), with,

π(σs2)=Inv-Gamma(τ,λ),π(ϕs|σs2)=𝒩(μo,σs2Σo),\displaystyle\pi(\sigma_{s}^{2})=\text{Inv-Gamma}(\tau,\lambda),\quad\quad\pi(\boldsymbol{\phi}_{s}|\sigma_{s}^{2})=\mathcal{N}(\mu_{o},\sigma_{s}^{2}\Sigma_{o}),

where (τ,λ,μo,Σo)(\tau,\lambda,\mu_{o},\Sigma_{o}) are the prior hyperparameters. This prior specification allows the exact computation of the estimated probabilities of (4), and gives closed-form posterior distributions for the AR coefficients and the noise variance. These are given in Lemmas 1 and 2, which are proven in Appendix B.

Lemma 1.For the AR model, the estimated probabilities Pe(s,x)P_{e}(s,x) as in (4) are given by,

Pe(s,x)=Cs1Γ(τ+|Bs|/2)λτΓ(τ)(λ+Ds/2)τ+|Bs|/2,P_{e}(s,x)=C_{s}^{-1}\ \frac{\Gamma\left(\tau+|B_{s}|/2\right)\ \lambda^{\tau}}{\Gamma(\tau)\ \left(\lambda+D_{s}/2\right)^{\tau+|B_{s}|/2}}, (6)
where Cs=(2π)|Bs|det(I+ΣoS3),\displaystyle C_{s}=\sqrt{{(2\pi)^{|B_{s}|}}\text{det}(I+\Sigma_{o}S_{3}}),
and Ds=s1+μoTΣo1μo\displaystyle D_{s}=s_{1}+\mu_{o}^{\text{T}}\Sigma_{o}^{-1}\mu_{o}
(𝐬2+Σo1μo)T(S3+Σo1)1(𝐬2+Σo1μo),\displaystyle\hskip 27.46295pt-(\mathbf{s}_{2}+\Sigma_{o}^{-1}\mu_{o})^{\text{T}}(S_{3}+\Sigma_{o}^{-1})^{-1}(\mathbf{s}_{2}+\Sigma_{o}^{-1}\mu_{o}),

with the sums s1,𝐬2,S3s_{1},\mathbf{s}_{2},S_{3} defined as:

s1=iBsxi2,𝐬2=iBsxi𝐱~i1,S3=iBs𝐱~i1𝐱~i1T.s_{1}=\sum_{i\in B_{s}}x_{i}^{2},\quad\mathbf{s}_{2}=\sum_{i\in B_{s}}x_{i}\ \mathbf{\widetilde{x}}_{i-1},\quad S_{3}=\sum_{i\in B_{s}}\mathbf{\widetilde{x}}_{i-1}\mathbf{\widetilde{x}}_{i-1}^{\text{T}}.

Lemma 2.Given a tree model TT, at each leaf ss, the posterior distributions of the AR coefficients and the noise variance are,

π(σs2|T,x)=Inv-Gamma (τ+|Bs|/2,λ+Ds/2),\displaystyle\pi(\sigma_{s}^{2}|T,x)=\text{Inv-Gamma }\left(\tau+|B_{s}|/{2},\lambda+{D_{s}}/{2}\right), (7)
π(ϕs|T,x)=tν(𝐦s,Ps),\displaystyle\pi(\boldsymbol{\phi}_{s}|T,x)=t_{\nu}(\mathbf{m}_{s},P_{s}), (8)

where tνt_{\nu} denotes a multivariate tt-distribution with ν\nu degrees of freedom. Here, ν=2τ+|Bs|\nu=2\tau+|B_{s}|, and,

𝐦s\displaystyle\mathbf{m}_{s} =(S3+Σo1)1(𝐬2+Σo1μo),\displaystyle=(S_{3}+\Sigma_{o}^{-1})^{-1}(\mathbf{s}_{2}+\Sigma_{o}^{-1}\mu_{o}),
Ps1\displaystyle P_{s}^{-1} =2τ+|Bs|2λ+Ds(S3+Σo1).\displaystyle=\frac{2\tau+|B_{s}|}{2\lambda+D_{s}}(S_{3}+\Sigma_{o}^{-1}).

Corollary. The MAP estimators of ϕs\boldsymbol{\phi}_{s} and σs2\sigma_{s}^{2} are given by,

ϕs^MAP=𝐦s,σs2^MAP=(2λ+Ds)/(2τ+|Bs|+2).\widehat{\boldsymbol{\phi}_{s}}^{\text{MAP}}=\mathbf{m}_{s}\ ,\quad\widehat{{\sigma_{s}^{2}}}^{\text{MAP}}=({2\lambda+D_{s}})/({2\tau+|B_{s}|+2}).

III-A Computational complexity and sequential updates

For each symbol xix_{i} in a time series x1nx_{1}^{n}, exactly D+1D+1 nodes of TMAXT_{\text{MAX}} need to be updated, corresponding to its contexts of length 0,1,,D0,1,\ldots,D. For each one of these nodes, only the quantities {|Bs|,s1,𝐬2,S3}\{|B_{s}|,s_{1},\mathbf{s}_{2},S_{3}\} need to be updated, which can be done efficiently by just adding an extra term to each sum. Using these and Lemma 1, the estimated probabilities Pe(s,x)P_{e}(s,x) can be computed for all nodes of TMAXT_{\text{MAX}}.

Hence, the complexity of both algorithms as a function of nn and DD is only 𝒪(nD)\mathcal{O}(nD): linear in the length of the time series and the maximum depth. Therefore, the present methods are computationally very efficient and scale well with large numbers of observations. [Taking into account mm and pp as well, it is easy to see that the complexity is 𝒪(nD(p3+m))\mathcal{O}\left(nD(p^{3}+m)\right).]

The above discussion also shows that, importantly, all algorithms can be updated sequentially. When observing a new sample xn+1x_{n+1}, only D+1D+1 nodes need to be updated, which requires 𝒪(D)\mathcal{O}(D) operations, i.e., 𝒪(1)\mathcal{O}(1) as a function of nn. In particular, this implies that sequential prediction can be performed very efficiently. Empirical running times in the forecasting experiments show that the present methods are much more efficient than essentially all the alternatives examined. In fact, the difference is quite large, especially when comparing with state-of-the-art ML models that require heavy training and do not allow for efficient sequential updates, usually giving empirical running times that are larger by several orders of magnitude; see also [24] for a relevant review comparing the computational requirements of ML versus statistical techniques.

III-B Choosing the quantiser and AR order

Finally, a principled Bayesian approach is introduced for the selection of the quantiser thresholds {ci}\{c_{i}\} of (1) and the AR order pp. Viewed as extra parameters on an additional layer above everything else, uniform priors are placed on {ci}\{c_{i}\} and pp, and Bayesian model selection [22] is performed to obtain their MAP values: The resulting posterior distribution p({ci},p|x)p(\{c_{i}\},p|x) is proportional to the evidence p(x|{ci},p)p(x|\{c_{i}\},p), which can be computed exactly using the CCTW algorithm (Theorem 1). So, in order to select appropriate values, suitable ranges of possible {ci}\{c_{i}\} and pp are specified, and the values with the higher evidence are selected. For the AR order we take 1ppmax1\leq p\leq p_{\text{max}} for an appropriate pmaxp_{\text{max}} (pmax=5p_{\text{max}}=5 in our experiments), and for the {ci}\{c_{i}\} we perform a grid search in a reasonable range (e.g., between the 10th and 90th percentiles of the data).

IV Experimental results

The performance of the BCT-AR model is evaluated on simulated and real-world data from standard applications of nonlinear time series in economics and finance. It is compared with the most successful previous approaches for these types of applications, considering both classical and modern ML methods. Useful resources include the R package forecast [16] and the Python library ‘GluonTS’ [2], containing implementations of state-of-the-art classical and ML methods, respectively. We briefly discuss the methods used, and refer to the packages’ documentation and for more details on the methods themselves and the training procedures carried out. Among classical statistical approaches, we compare with ARIMA and Exponential smoothing state space (ETS) models [15], with self-excited threshold autoregressive (SETAR) models [41], and with mixture autoregressive (MAR) models [45]. Among ML techniques, we compare with the Neural Network AR (NNAR) model [48], and with the most successful RNN-based approach, ‘deepAR’ [39].

IV-A Simulated data

First an experiment on simulated data is performed, illustrating that the present methods are consistent and effective with data generated by BCT-SSM models. The context tree used is the tree of Figure 1, the quantiser threshold is c=0c=0, and the AR order p=2p=2. The posterior distribution over trees, π(T|x)\pi(T|x), is examined. On a time series consisting of only n=100n=100 observations, the MAP tree identified by the CBCT algorithm is the empty tree corresponding to a single AR model, with posterior probability 99.9%. This means that the data do not provide sufficient evidence to support a more complex state space partition. With n=300n=300 observations, the MAP tree is now the true underlying model, with posterior probability 57%. With n=500n=500 observations, the posterior of the true model is 99.9%. So, the posterior indeed concentrates on the true model, indicating that the BCT-AR inferential framework can be very effective even with limited training data.

Forecasting. The performance of the BCT-AR model is evaluated on out-of-sample 1-step ahead forecasts, and compared with state-of-the-art approaches in four simulated and three real datasets. The first simulated dataset (sim_1) is generated by the BCT-AR model used above, and the second (sim_2) by a BCT-AR model with a ternary tree of depth 2. The third and fourth ones (sim_3 and sim_4) are generated from SETAR models of orders p=1p=1 and p=5p=5, respectively. In each case, the training set consists of the first 50% of the observations; also, all models are updated at every time-step in the test set. For BCT-AR, the MAP tree with its MAP parameters is used at every time-step, which can be updated efficiently (Section III-A). From Table I, it is observed that the BCT-AR model outperforms the alternatives, and achieves the lowest mean-squared error (MSE) even on the two datasets generated from SETAR models. As discussed in Section III-A, the BCT-AR model also outperforms the alternatives in terms of empirical running times.

TABLE I: Mean squared error (MSE) of forecasts
BCT-AR ARIMA ETS NNAR deepAR SETAR MAR
sim_1 0.131 0.150 0.178 0.143 0.148 0.141 0.151
sim_2 0.035 0.050 0.054 0.048 0.061 0.050 0.064
sim_3 0.216 0.267 0.293 0.252 0.273 0.243 0.283
sim_4 0.891 1.556 1.614 1.287 1.573 0.951 1.543
unemp 0.034 0.040 0.042 0.036 0.036 0.038 0.037
gnp 0.324 0.364 0.378 0.393 0.473 0.394 0.384
ibm 78.02 82.90 77.52 78.90 75.71 81.07 77.02

IV-B US unemployment rate

An important application of SETAR models is in modelling the US unemployment rate [13, 26, 42]. As described in [42], the unemployment rate moves countercyclically with business cycles, and rises quickly but decays slowly, indicating nonlinear behaviour. Here, the quarterly US unemployment rate in the time period from 1948 to 2019 is examined (dataset unemp, 288 observations). Following [26], the difference series Δxn=xnxn1{\Delta x}_{n}=x_{n}-x_{n-1} is considered, and a constant term is included in the AR model. For the quantiser alphabet size, m=2m=2 is a natural choice, as will become apparent below. The threshold selected using the procedure of Section III-B is c=0.15c=0.15, and the MAP tree is the tree of Figure 1, with depth d=2d=2, leaves {1,01,00}\{1,01,00\}, and posterior probability 91.5%. The fitted BCT-AR model with its MAP parameters is,

Δxn={0.09+0.72Δxn10.30Δxn2+0.42en,0.04+0.29Δxn10.32Δxn2+0.32en,0.02+0.34Δxn1+0.19Δxn2+0.20en,{\Delta x}_{n}\!=\!\left\{\begin{array}[]{ll}0.09+0.72{\Delta x}_{n-1}-0.30{\Delta x}_{n-2}+0.42\ e_{n},\\ 0.04+0.29\ {\Delta x}_{n-1}-0.32\ {\Delta x}_{n-2}+0.32\ e_{n},\\ -0.02+0.34\ {\Delta x}_{n-1}+0.19\ {\Delta x}_{n-2}+0.20\ e_{n},\end{array}\right.\vspace*{-0.15 cm}

with en𝒩(0,1)e_{n}\sim\mathcal{N}(0,1), and corresponding regions s=1s=1 if Δxn1>0.15{\Delta x}_{n-1}>0.15, s=01s=01 if Δxn10.15,Δxn2>0.15{\Delta x}_{n-1}\leq 0.15,{\Delta x}_{n-2}>0.15, and s=00s=00 if Δxn10.15,Δxn20.15{\Delta x}_{n-1}\leq 0.15,\ {\Delta x}_{n-2}\leq 0.15.

Interpretation. The MAP BCT-AR model finds significant structure in the data, providing a very natural interpretation. It identifies 3 meaningful states: First, jumps in the unemployment rate higher than 0.15 signify economic contractions (context 1). If there is no jump at the most recent time-point, the model looks further back to determine the state. Context 00 signifies a stable economy, as there are no jumps in the unemployment rate for two consecutive quarters. Finally, context 01 identifies an intermediate state: “stabilising just after a contraction”. An important feature identified by the BCT-AR model is that the volatility is different in each case: Higher in contractions (σ=0.42\sigma=0.42), smaller in stable economy regions (σ=0.20\sigma=0.20), and in-between for context 01 (σ=0.32\sigma=0.32).

Forecasting. In addition to its appealing interpretation, the BCT-AR model outperforms all benchmarks in forecasting, giving a 6% lower MSE than the second-best method (Table I).

IV-C US Gross National Product

Refer to caption
Figure 2: MAP tree

Another important example of a nonlinear time series is the US Gross National Product (GNP) [33, 13]. The quarterly US GNP in the time period from 1947 to 2019 is examined (291 observations, dataset gnp). Following [33], the difference in the logarithm of the series, yn=logxnlogxn1y_{n}=\log x_{n}-\log x_{n-1}, is considered. As above, m=2m=2 is a natural choice for the quantiser size, helping to differentiate economic expansions from contractions – which govern the underlying dynamics. The MAP BCT-AR tree is shown in Figure 2: It has depth d=3d=3, its leaves are {0,10,110,111}\{0,10,110,111\} and its posterior probability is 42.6%.

Interpretation. Compared with the previous example, here the MAP BCT-AR model finds even richer structure in the data and identifies four meaningful states. First, as before, there is a single state corresponding to an economic contraction (which now corresponds to s=0s=0 instead of s=1s=1, as the GNP obviously increases in expansions and decreases in contractions). And again, the model does not look further back whenever a contraction is detected. Here, the model also shows that the effect of a contraction is still present even after three quarters (s=110s=110), and that the exact ‘distance’ from a contraction is also important, with the dynamics changing depending on how much time has elapsed. Finally, the state s=111s=111 corresponds to a flourishing, expanding economy, without a contraction in the recent past. An important feature captured by the BCT-AR model is again that the volatility is different in each case. More specifically, it is found that the volatility strictly decreases with the distance from the last contraction, starting with the maximum σ=1.23\sigma=1.23 for s=0s=0 and decreasing to σ=0.75\sigma=0.75 for s=111s=111.

Forecasting. The BCT-AR model outperforms all benchmarks in forecasting, giving a 12% lower MSE than the second-best method, as presented in Table I.

IV-D The stock price of IBM

Finally, the daily IBM common stock closing price from May 17, 1961 to November 2, 1962 is examined (dataset ibm, 369 observations), taken from [7]. This is a well-studied dataset, with [7] fitting an ARIMA model, [40] fitting a SETAR model, and [45] fitting a MAR model to the data. Following previous approaches, the first-difference series, Δxn=xnxn1{\Delta x}_{n}=x_{n}-x_{n-1}, is considered. The value m=3m=3 is chosen for the alphabet size of the quantiser, with contexts {0,1,2}\{0,1,2\} naturally corresponding to the states {down, steady, up} for the stock price. Using the procedure of Section III-B to select the thresholds, the resulting quantiser regions are: s=0s=0 if Δxn1<7{\Delta x}_{n-1}<-7, s=2s=2 if Δxn1>7{\Delta x}_{n-1}>7, and s=1s=1 otherwise. The MAP tree is shown in Figure 3: It has depth d=2d=2, and its leaves are {0,2,10,11,12}\{0,2,10,11,12\}, hence identifying five states. Its posterior probability is 99.3%, suggesting that there is very strong evidence in the data supporting this exact structure, even with only 369 observations.

Refer to caption
Figure 3: MAP tree

Interpretation. The BCT-AR model reveals important information about apparent structure in the data. Firstly, it admits a very simple and natural interpretation: In order to determine the AR model generating the next value, one needs to look back until there is a significant enough price change (corresponding to contexts 0, 2, 10, 12), or until the maximum depth of 2 (context 11) is reached. Another important feature captured by this model is the commonly observed asymmetric response in volatility due to positive and negative shocks, sometimes called the leverage effect [42, 7]. Even though there is no suggestion of that in the prior, the MAP parameter estimates show that negative shocks increase the volatility much more: Context 0 has the highest volatility (σ=12.3\sigma=12.3), with 10 being a close second (σ=10.8\sigma=10.8), showing that the effect of a past shock is still present. In all other cases the volatility is much smaller (5.17σ6.865.17\leq\sigma\leq 6.86).

Forecasting. As shown Table I, in this example deepAR performs marginally better in 1-step ahead forecasts, with BCT-AR, ETS and MAR also having comparable performance.

References

  • [1] N.K. Ahmed, A.F. Atiya, N.E. Gayar, and H. El-Shishiny. An empirical comparison of machine learning models for time series forecasting. Econometric reviews, 29(5-6):594–621, 2010.
  • [2] A. Alexandrov, K. Benidis, M. Bohlke-Schneider, V. Flunkert, J. Gasthaus, T. Januschowski, D.C. Maddix, S. Rangapuram, D. Salinas, J. Schulz, et al. GluonTS: Probabilistic time series models in python. arXiv preprint arXiv:1906.05264, 2019.
  • [3] F.M. Alvarez, A. Troncoso, J.C. Riquelme, and J.S.A. Ruiz. Energy time series forecasting based on pattern sequence similarity. IEEE Transactions on Knowledge and Data Engineering, 23(8):1230–1243, 2010.
  • [4] S. Alvisi, M. Franchini, and A. Marinelli. A short-term, pattern-based model for water-demand forecasting. Journal of Hydroinformatics, 9(1):39–50, 2007.
  • [5] K. Benidis, S.S. Rangapuram, V. Flunkert, B. Wang, D. Maddix, C. Turkmen, J. Gasthaus, M. Bohlke-Schneider, D. Salinas, L. Stella, et al. Neural forecasting: Introduction and literature overview. arXiv preprint arXiv:2004.10240, 2020.
  • [6] D.J. Berndt and J. Clifford. Using dynamic time warping to find patterns in time series. In KDD Workshop, volume 10, pages 359–370. Seattle, WA, USA, 1994.
  • [7] G.E.P. Box, G.M. Jenkins, G.C. Reinsel, and G.M. Ljung. Time series analysis: forecasting and control. John Wiley & Sons, 2015.
  • [8] S. Chib. Marginal likelihood from the Gibbs output. Journal of the american statistical association, 90(432):1313–1321, 1995.
  • [9] S. Chib and I. Jeliazkov. Marginal likelihood from the Metropolis–Hastings output. Journal of the American statistical association, 96(453):270–281, 2001.
  • [10] C. Faloutsos, J. Gasthaus, T. Januschowski, and Y. Wang. Forecasting big time series: old and new. Proceedings of the VLDB Endowment, 11(12):2102–2105, 2018.
  • [11] R. Frigola. Bayesian time series learning with Gaussian processes. PhD thesis, University of Cambridge, 2015.
  • [12] T.C. Fu, F.L. Chung, R. Luk, and C.M. Ng. Stock time series pattern matching: Template-based vs. rule-based approaches. Engineering Applications of Artificial Intelligence, 20(3):347–364, 2007.
  • [13] B.E. Hansen. Threshold autoregression in economics. Statistics and its Interface, 4(2):123–127, 2011.
  • [14] Q. Hu, P. Su, D. Yu, and J. Liu. Pattern-based wind speed prediction based on generalized principal component analysis. IEEE Transactions on Sustainable Energy, 5(3):866–874, 2014.
  • [15] R. Hyndman, A.B. Koehler, J.K. Ord, and R.D. Snyder. Forecasting with exponential smoothing: the state space approach. Springer Science & Business Media, 2008.
  • [16] R.J. Hyndman and Y. Khandakar. Automatic time series forecasting: the forecast package for R. Journal of Statistical Software, 26(3):1–22, 2008. https://CRAN.R-project.org/package=forecast.
  • [17] I. Kontoyiannis, L. Mertzanis, A. Panotopoulou, I. Papageorgiou, and M. Skoularidou. Bayesian Context Trees: Modelling and exact inference for discrete time series. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84(4):1287–1323, 2022.
  • [18] S.S. Kozat, A.C. Singer, and G.C. Zeitler. Universal piecewise linear prediction via context trees. IEEE Transactions on Signal Processing, 55(7):3730–3745, 2007.
  • [19] X. Liu, Z. Ni, D. Yuan, Y. Jiang, Z. Wu, J. Chen, and Y. Yang. A novel statistical time-series pattern based interval forecasting strategy for activity durations in workflow systems. Journal of Systems and Software, 84(3):354–376, 2011.
  • [20] V. Lungu, I. Papageorgiou, and I. Kontoyiannis. Change-point detection and segmentation of discrete data using Bayesian Context Trees. arXiv preprint arXiv:2203.04341, 2021.
  • [21] V. Lungu, I. Papageorgiou, and I. Kontoyiannis. Bayesian change-point detection via context-tree weighting. In 2022 IEEE Information Theory Workshop (ITW), pages 125–130. IEEE, 2022.
  • [22] D.J.C. MacKay. Bayesian interpolation. Neural Computation, 4(3):415–447, 1992.
  • [23] S. Makridakis, E. Spiliotis, and V. Assimakopoulos. The M4 Competition: Results, findings, conclusion and way forward. International Journal of Forecasting, 34(4):802–808, 2018.
  • [24] S. Makridakis, E. Spiliotis, and V. Assimakopoulos. Statistical and machine learning forecasting methods: Concerns and ways forward. PloS one, 13(3):e0194889, 2018.
  • [25] T. Matsushima and S. Hirasawa. Reducing the space complexity of a Bayes coding algorithm using an expanded context tree. In 2009 IEEE International Symposium on Information Theory, pages 719–723. IEEE, 2009.
  • [26] A.L. Montgomery, V. Zarnowitz, R.S. Tsay, and G.C. Tiao. Forecasting the US unemployment rate. Journal of the American Statistical Association, 93(442):478–493, 1998.
  • [27] Y. Nakahara, S. Saito, A. Kamatsuka, and T. Matsushima. Probability distribution on full rooted trees. Entropy, 24(3):328, 2022.
  • [28] G. Ouyang, C. Dang, D.A. Richards, and X. Li. Ordinal pattern based similarity analysis for eeg recordings. Clinical Neurophysiology, 121(5):694–703, 2010.
  • [29] I. Papageorgiou and I. Kontoyiannis. The posterior distribution of Bayesian Context-Tree models: Theory and applications. In 2022 IEEE International Symposium on Information Theory (ISIT), pages 702–707. IEEE, 2022.
  • [30] I. Papageorgiou and I. Kontoyiannis. Posterior representations for Bayesian Context Trees: Sampling, estimation and convergence. Bayesian Analysis, to appear, arXiv preprint arXiv:2202.02239, 2022.
  • [31] I. Papageorgiou and I. Kontoyiannis. Truly Bayesian entropy estimation. arXiv preprint arXiv:2212.06705, 2022.
  • [32] I. Papageorgiou, I. Kontoyiannis, L. Mertzanis, A. Panotopoulou, and M. Skoularidou. Revisiting context-tree weighting for Bayesian inference. In 2021 IEEE International Symposium on Information Theory (ISIT), pages 2906–2911, 2021.
  • [33] S.M. Potter. A nonlinear approach to US GNP. Journal of applied econometrics, 10(2):109–125, 1995.
  • [34] C.E. Rasmussen and C.K.I. Williams. Gaussian processes for machine learning. MIT Press, 2006.
  • [35] J. Rissanen. A universal data compression system. IEEE Transactions on Information Theory, 29(5):656–664, 1983.
  • [36] S. Roberts, M. Osborne, M. Ebden, S. Reece, N. Gibson, and S. Aigrain. Gaussian processes for time-series modelling. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 371(1984):20110550, 2013.
  • [37] J.J. Ryu, A. Bhatt, and Y.H. Kim. Parameter-free online linear optimization with side information via universal coin betting. In International Conference on Artificial Intelligence and Statistics, pages 6022–6044. PMLR, 2022.
  • [38] E. Sabeti, P.X.K. Song, and A.O. Hero. Pattern-based analysis of time series: Estimation. In 2020 IEEE International Symposium on Information Theory (ISIT), pages 1236–1241, 2020.
  • [39] D. Salinas, V. Flunkert, J. Gasthaus, and T. Januschowski. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 36(3):1181–1191, 2020.
  • [40] H. Tong. Non-linear time series: a dynamical system approach. Oxford University Press, 1990.
  • [41] H. Tong and K.S. Lim. Threshold autoregression, limit cycles and cyclical data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 42(3):245–268, 1980.
  • [42] R.S. Tsay. Analysis of financial time series, volume 543. John Wiley & Sons, 2005.
  • [43] F.M.J. Willems. The context-tree weighting method: extensions. IEEE Transactions on Information Theory, 44(2):792–798, 1998.
  • [44] F.M.J. Willems, Y.M. Shtarkov, and T.J. Tjalkens. The context-tree weighting method: basic properties. IEEE Transactions on Information Theory, 41(3):653–664, 1995.
  • [45] C. S. Wong and W. K. Li. On a mixture autoregressive model. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(1):95–115, 2000.
  • [46] S.N. Wood. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(1):3–36, 2011.
  • [47] H.F. Yu, N. Rao, and I.S. Dhillon. Temporal regularized matrix factorization for high-dimensional time series prediction. Advances in neural information processing systems, 29, 2016.
  • [48] G. Zhang, B.E. Patuwo, and M.Y. Hu. Forecasting with artificial neural networks: The state of the art. International journal of forecasting, 14(1):35–62, 1998.

A. Proofs of Theorems 1 and 2

The important observation here is that using the new, different form of the estimated probabilities Pe(s,x)P_{e}(s,x) in (4), it is still possible to factorise the marginal likelihoods p(x|T)p(x|T) for the general BCT-SSM as,

p(x|T)=p(x|θ,T)π(θ|T)𝑑θ=sT(iBsp(xi|T,θs,xD+1i1)π(θs)dθs)=sTPe(s,x),p(x|T)=\int p(x|\theta,T)\pi(\theta|T)d\theta=\int\prod_{s\in T}\bigg{(}\prod_{i\in B_{s}}p(x_{i}|T,\theta_{s},x_{-D+1}^{i-1})\ \pi(\theta_{s})\ d\theta_{s}\bigg{)}=\prod_{s\in T}P_{e}(s,x),

where the second equality follows from the likelihood of the general BCT-SSM, and the fact that we use independent priors on the parameters at the leaves, so that π(θ|T)=sTπ(θs)\pi(\theta|T)=\prod_{s\in T}\pi(\theta_{s}). Then, the proofs of Theorems 1 and 2 follow along the same lines as the proofs of the corresponding results for discrete time series in [17], with the main difference being that the new general version of the estimated probabilities Pe(s,x)P_{e}(s,x) of (4) need to be used in place of their discrete versions.

The kk-BCT algorithm of [17] can be generalised in a similar manner to the way the CTW and BCT algorithms were generalised. The resulting algorithm identifies the top-kk a posteriori most likely context trees. The proof of the theorem claiming this is similar to the proof of Theorem 3.3 of [17]. Again, the important difference, both in the algorithm description and in the proof, is that the estimated probabilities Pe(s,x)P_{e}(s,x) are used in place of their discrete counterparts Pe(as)P_{e}(a_{s}).

B. Proofs of Lemmas 1 and 2

The proofs of these lemmas are mostly based on explicit computations. Recall that, for each context ss, the set BsB_{s} consists of those indices i{1,2,,n}i\in\{1,2,\ldots,n\} such that the context of xix_{i} is ss. The important step in the following two proofs is the factorisation of the likelihood using the sets BsB_{s}. In order to prove the lemmas for the AR model with parameters θs=(ϕs,σs2)\theta_{s}=(\boldsymbol{\phi}_{s},\sigma_{s}^{2}), we first consider an intermediate step in which the noise variance is assumed to be known and equal to σ2\sigma^{2}.

B.1. Known noise variance

Here, an AR model with known variance σ2\sigma^{2} is associated with every leaf ss of the context tree TT, so that,

xn=ϕs,1xn1++ϕs,pxnp+en=ϕsT𝐱~n1+en,en𝒩(0,σ2).x_{n}=\phi_{s,1}x_{n-1}+\dots+\phi_{s,p}x_{n-p}+e_{n}={\boldsymbol{\phi}_{s}}^{\text{T}}\ \mathbf{\widetilde{x}}_{n-1}+e_{n},\quad e_{n}\sim\mathcal{N}(0,\sigma^{2}). (9)

In this setting, the parameters of the model are only the AR coefficients: θs=ϕs\theta_{s}=\boldsymbol{\phi}_{s}. For these, a Gaussian prior is used,

θs𝒩(μo,Σo),\theta_{s}\sim\mathcal{N}(\mu_{o},\Sigma_{o})\ , (10)

where μo,Σo\mu_{o},\Sigma_{o} are hyperparameters.

Lemma B. The estimated probabilities Pe(s,x)P_{e}(s,x) for the known-variance case are given by,

Pe(s,x)=1(2πσ2)|Bs|/21det(I+ΣoS3/σ2)exp{Es2σ2},P_{e}(s,x)=\frac{1}{(2\pi\sigma^{2})^{|B_{s}|/2}}\ \frac{1}{\sqrt{\text{det}(I+\Sigma_{o}S_{3}/\sigma^{2}})}\ \exp{\bigg{\{}-\frac{E_{s}}{2\sigma^{2}}\bigg{\}}}, (11)

where II is the identity matrix and EsE_{s} is given by:

Es=s1+σ2μoTΣo1μo(𝐬2+σ2Σo1μo)T(S3+σ2Σo1)1(𝐬2+σ2Σo1μo).E_{s}=s_{1}+\sigma^{2}\mu_{o}^{\text{T}}\Sigma_{o}^{-1}\mu_{o}-(\mathbf{s}_{2}+\sigma^{2}\Sigma_{o}^{-1}\mu_{o})^{\text{T}}(S_{3}+\sigma^{2}\Sigma_{o}^{-1})^{-1}(\mathbf{s}_{2}+\sigma^{2}\Sigma_{o}^{-1}\mu_{o})\ . (12)
Proof.

For the AR model of (9),

p(xi|T,θs,xD+1i1)=12πσ2exp{12σ2(xiθsT𝐱~i1)2},p(x_{i}|T,\theta_{s},x_{-D+1}^{i-1})=\frac{1}{\sqrt{2\pi\sigma^{2}}}\ \exp\bigg{\{}-\frac{1}{2\sigma^{2}}(x_{i}-{\theta_{s}}^{\text{T}}\mathbf{\widetilde{x}}_{i-1})^{2}\bigg{\}},

so that,

iBsp(xi|T,θs,xD+1i1)=1(2πσ2)|Bs|exp{12σ2iBs(xiθsT𝐱~i1)2}.\prod_{i\in B_{s}}p(x_{i}|T,\theta_{s},x_{-D+1}^{i-1})=\frac{1}{(\sqrt{2\pi\sigma^{2}})^{|B_{s}|}}\ \exp\bigg{\{}-\frac{1}{2\sigma^{2}}\sum_{i\in B_{s}}(x_{i}-{\theta_{s}}^{\text{T}}\mathbf{\widetilde{x}}_{i-1})^{2}\bigg{\}}.

Expanding the sum in the exponent gives,

iBs(xiθsT𝐱~i1)2=iBsxi22θsTiBsxi𝐱~i1+θsTiBs𝐱~i1𝐱~i1Tθs=s12θsT𝐬2+θsTS3θs,\displaystyle\sum_{i\in B_{s}}(x_{i}-{\theta_{s}}^{\text{T}}\mathbf{\widetilde{x}}_{i-1})^{2}=\sum_{i\in B_{s}}x_{i}^{2}-2\theta_{s}^{\text{T}}\sum_{i\in B_{s}}x_{i}\mathbf{\widetilde{x}}_{i-1}+\theta_{s}^{\text{T}}\sum_{i\in B_{s}}\mathbf{\widetilde{x}}_{i-1}\mathbf{\widetilde{x}}_{i-1}^{\text{T}}\theta_{s}=s_{1}-2\theta_{s}^{\text{T}}\mathbf{s}_{2}+\theta_{s}^{\text{T}}S_{3}\theta_{s},

from which we obtain that,

iBsp(xi|T,θs,xD+1i1)=1(2πσ2)|Bs|exp{12σ2(s12θsT𝐬2+θsTS3θs)}=(2π)pρs𝒩(θs;𝝁,S),\displaystyle\prod_{i\in B_{s}}p(x_{i}|T,\theta_{s},x_{-D+1}^{i-1})=\frac{1}{(\sqrt{2\pi\sigma^{2}})^{|B_{s}|}}\ \exp\bigg{\{}-\frac{1}{2\sigma^{2}}(s_{1}-2\theta_{s}^{\text{T}}\mathbf{s}_{2}+\theta_{s}^{\text{T}}S_{3}\theta_{s})\bigg{\}}=(\sqrt{2\pi})^{p}\rho_{s}\ \mathcal{N}(\theta_{s};\boldsymbol{\mu},S),

by completing the square, where 𝝁=S31𝐬2\boldsymbol{\mu}=S_{3}^{-1}\mathbf{s}_{2}, S=σ2S31S=\sigma^{2}S_{3}^{-1}, and,

ρs=det(σ2S31)(2πσ2)|Bs|exp{12σ2(s1𝐬2TS31𝐬2)}.\rho_{s}=\sqrt{\frac{\text{det}(\sigma^{2}S_{3}^{-1})}{(2\pi\sigma^{2})^{|B_{s}|}}}\ \exp\bigg{\{}-\frac{1}{2\sigma^{2}}(s_{1}-\mathbf{s}_{2}^{\text{T}}S_{3}^{-1}\mathbf{s}_{2})\bigg{\}}. (13)

So, multiplying with the prior,

iBsp(xi|T,θs,xD+1i1)π(θs)=(2π)pρs𝒩(θs;𝝁,S)𝒩(θs;μo,Σo)=ρsZs𝒩(θs;𝐦,Σ),\prod_{i\in B_{s}}p(x_{i}|T,\theta_{s},x_{-D+1}^{i-1})\pi(\theta_{s})=(\sqrt{2\pi})^{p}\rho_{s}\ \mathcal{N}(\theta_{s};\boldsymbol{\mu},S)\ \mathcal{N}(\theta_{s};\mu_{o},\Sigma_{o})=\rho_{s}Z_{s}\ \mathcal{N}(\theta_{s};\mathbf{m},\Sigma),

where Σ1=Σo1+S1,m=Σ(Σo1μo+S1𝝁)\Sigma^{-1}=\Sigma_{o}^{-1}+S^{-1},\ m=\Sigma\ (\Sigma_{o}^{-1}\mu_{o}+S^{-1}\boldsymbol{\mu}), and,

Zs=1det(Σo+σ2S31)exp{12(μoS31𝐬2)T(Σo+σ2S31)1(μoS31𝐬2)}.Z_{s}=\frac{1}{\sqrt{{\text{det}(\Sigma_{o}}+\sigma^{2}S_{3}^{-1})}}\ \exp{\bigg{\{}-\frac{1}{2}(\mu_{o}-S_{3}^{-1}\mathbf{s}_{2})^{\text{T}}(\Sigma_{o}+\sigma^{2}S_{3}^{-1})^{-1}(\mu_{o}-S_{3}^{-1}\mathbf{s}_{2})\bigg{\}}}\ . (14)

Therefore,

iBsp(xi|T,θs,xD+1i1)π(θs)=ρsZs𝒩(θs;𝐦,Σ),\prod_{i\in B_{s}}p(x_{i}|T,\theta_{s},x_{-D+1}^{i-1})\pi(\theta_{s})=\rho_{s}Z_{s}\ \mathcal{N}(\theta_{s};\mathbf{m},\Sigma), (15)

and hence,

Pe(s,x)=iBsp(xi|T,θs,xD+1i1)π(θs)dθs=ρsZs.P_{e}(s,x)=\int\prod_{i\in B_{s}}p(x_{i}|T,\theta_{s},x_{-D+1}^{i-1})\ \pi(\theta_{s})\ d\theta_{s}\ =\rho_{s}Z_{s}.

Using standard matrix inversion properties, after some algebra the product ρsZs\rho_{s}Z_{s} can be rearranged to give exactly the required expression in (11). ∎

B.2. Proof of Lemma 1

Getting back to the original case as described in the main text, the noise variance is considered to be a parameter of the AR model, so that, θs=(ϕs,σs2)\theta_{s}=(\boldsymbol{\phi}_{s},\sigma_{s}^{2}). Here, the joint prior on the parameters is π(θs)=π(ϕs|σs2)π(σs2)\pi(\theta_{s})=\pi(\boldsymbol{\phi}_{s}|\sigma_{s}^{2})\pi(\sigma_{s}^{2}), where,

σs2Inv-Gamma(τ,λ),ϕs|σs2𝒩(μo,σs2Σo),\displaystyle\sigma_{s}^{2}\sim\mbox{Inv-Gamma}(\tau,\lambda),\;\;\boldsymbol{\phi}_{s}|\sigma_{s}^{2}\sim\mathcal{N}(\mu_{o},\sigma_{s}^{2}\Sigma_{o}), (16)

and where (τ,λ,μo,Σo)(\tau,\lambda,\mu_{o},\Sigma_{o}) are hyperparameters. For Pe(s,x)P_{e}(s,x), we just need to compute the integral:

Pe(s,x)=iBsp(xi|T,θs,xD+1i1)π(θs)dθs=π(σs2)(iBsp(xi|T,ϕs,σs2,xD+1i1)π(ϕs|σs2)dϕs)𝑑σs2.\displaystyle P_{e}(s,x)=\int\prod_{i\in B_{s}}p(x_{i}|T,\theta_{s},x_{-D+1}^{i-1})\ \pi(\theta_{s})\ d\theta_{s}=\int\pi(\sigma_{s}^{2})\left(\int\prod_{i\in B_{s}}p(x_{i}|T,\boldsymbol{\phi}_{s},\sigma_{s}^{2},x_{-D+1}^{i-1})\ \pi(\boldsymbol{\phi}_{s}|\sigma_{s}^{2})\ d\boldsymbol{\phi}_{s}\right)d\sigma_{s}^{2}. (17)

The inner integral has exactly the form of the estimated probabilities Pe(s,x)P_{e}(s,x) from the previous section, where the noise variance was fixed. The only difference is that the prior π(ϕs|σs2)\pi(\boldsymbol{\phi}_{s}|\sigma_{s}^{2}) of (16) now has covariance matrix σs2Σo\sigma_{s}^{2}\Sigma_{o} instead of Σo\Sigma_{o}. So, using (11)-(12), with Σo\Sigma_{o} replaced by σs2Σo\sigma_{s}^{2}\Sigma_{o}, we get,

Pe(s,x)=π(σs2){Cs1(1σs2)|Bs|/2exp(Ds2σs2)}𝑑σs2,P_{e}(s,x)=\int\pi(\sigma_{s}^{2})\bigg{\{}C_{s}^{-1}\bigg{(}\frac{1}{\sigma_{s}^{2}}\bigg{)}^{{|B_{s}|}/{2}}\exp\bigg{(}-\frac{D_{s}}{2\sigma_{s}^{2}}\bigg{)}\bigg{\}}d\sigma_{s}^{2},

with CsC_{s} and DsD_{s} as in Lemma 1. And using the inverse-gamma prior π(σs2)\pi(\sigma_{s}^{2}) of (16),

Pe(s,x)=Cs1λτΓ(τ)(1σs2)τ+1exp(λσs2)𝑑σs2,P_{e}(s,x)=\ C_{s}^{-1}\ \frac{\lambda^{\tau}}{\Gamma(\tau)}\ \int\bigg{(}\frac{1}{\sigma_{s}^{2}}\bigg{)}^{\tau^{\prime}+1}\exp\bigg{(}-\frac{\lambda^{\prime}}{\sigma_{s}^{2}}\bigg{)}d\sigma_{s}^{2}, (18)

with τ=τ+|Bs|2\tau^{\prime}=\tau+\frac{|B_{s}|}{2} and λ=λ+Ds2\lambda^{\prime}=\lambda+\frac{D_{s}}{2}. The integral in (18) has the form of an inverse-gamma density with parameters τ\tau^{\prime} and λ\lambda^{\prime}. Its closed-form solution, as required, completes the proof of the lemma:

Pe(s,x)=Cs1λτΓ(τ)Γ(τ)(λ)τ.P_{e}(s,x)=C_{s}^{-1}\ \frac{\lambda^{\tau}}{\Gamma(\tau)}\ \frac{\Gamma(\tau^{\prime})}{\left(\lambda^{\prime}\right)^{\tau^{\prime}}}.\vspace*{-0.4 cm}

B.3. Proof of Lemma 2

In order to derive the required expressions for the posterior distributions of ϕs\boldsymbol{\phi}_{s} and σs2\sigma_{s}^{2} , for a leaf ss of model TT, first consider the joint posterior π(θs|T,x)=π(ϕs,σs2|T,x)\pi(\theta_{s}|T,x)=\pi(\boldsymbol{\phi}_{s},\sigma_{s}^{2}|T,x), given by,

π(θs|T,x)p(x|T,θs)π(θs)=i=1np(xi|T,θs,xD+1i1)π(θs)iBsp(xi|T,θs,xD+1i1)π(θs),\displaystyle\pi(\theta_{s}|T,x)\propto p(x|T,\theta_{s})\pi(\theta_{s})=\prod_{i=1}^{n}p(x_{i}|T,\theta_{s},x_{-D+1}^{i-1})\pi(\theta_{s})\propto\prod_{i\in B_{s}}p(x_{i}|T,\theta_{s},x_{-D+1}^{i-1})\pi(\theta_{s}),

where we used the fact that, in the product, only the terms involving indices iBsi\in B_{s} are functions of θs\theta_{s}. So,

π(ϕs,σs2|T,x)\displaystyle\pi(\boldsymbol{\phi}_{s},\sigma_{s}^{2}|T,x) (iBsp(xi|T,ϕs,σs2,xD+1i1)π(ϕs|σs2))π(σs2).\displaystyle\propto\left(\ \prod_{i\in B_{s}}p(x_{i}|T,\boldsymbol{\phi}_{s},\sigma_{s}^{2},x_{-D+1}^{i-1})\ \pi(\boldsymbol{\phi}_{s}|\sigma_{s}^{2})\right)\pi(\sigma_{s}^{2}).

Here, the first two terms can be computed from (15) of the previous section, where the noise variance was known. Again, the only difference is that we have to replace Σo\Sigma_{o} with σs2Σo\sigma_{s}^{2}\Sigma_{o} because of the prior π(ϕs|σs2)\pi(\boldsymbol{\phi}_{s}|\sigma_{s}^{2}) defined in (16). After some algebra,

π(ϕs,σs2|T,x)\displaystyle\pi(\boldsymbol{\phi}_{s},\sigma_{s}^{2}|T,x) (1σs2)|Bs|/2exp(Ds2σs2)𝒩(ϕs;𝐦s,Σs)π(σs2),\displaystyle\propto\bigg{(}\frac{1}{\sigma_{s}^{2}}\bigg{)}^{{|B_{s}|}/{2}}\exp\bigg{(}-\frac{D_{s}}{2\sigma_{s}^{2}}\bigg{)}\ \mathcal{N}(\boldsymbol{\phi}_{s};\mathbf{m}_{s},\Sigma_{s})\ \pi(\sigma_{s}^{2})\ ,

with 𝐦s\mathbf{m}_{s} defined as in Lemma 2, and Σs=σs2(S3+Σo1)1\Sigma_{s}=\sigma_{s}^{2}(S_{3}+\Sigma_{o}^{-1})^{-1}. Substituting the prior π(σs2)\pi(\sigma_{s}^{2}) in the last expression,

π(ϕs,σs2|T,x)(1σs2)τ+1+|Bs|/2exp(λ+Ds/2σs2)𝒩(ϕs;𝐦s,Σs).\pi(\boldsymbol{\phi}_{s},\sigma_{s}^{2}|T,x)\propto\bigg{(}\frac{1}{\sigma_{s}^{2}}\bigg{)}^{\tau+1+{|B_{s}|}/{2}}\exp\bigg{(}-\frac{\lambda+D_{s}/2}{\sigma_{s}^{2}}\bigg{)}\ \mathcal{N}(\boldsymbol{\phi}_{s};\mathbf{m}_{s},\Sigma_{s}). (19)

From (19), it is easy to integrate out ϕs\boldsymbol{\phi}_{s} and get the posterior of σs2\sigma_{s}^{2},

π(σs2|T,x)=π(ϕs,σs2|T,x)𝑑ϕs(1σs2)τ+1+|Bs|/2exp(λ+Ds/2σs2),\pi(\sigma_{s}^{2}|T,x)=\int\pi(\boldsymbol{\phi}_{s},\sigma_{s}^{2}|T,x)\ d\boldsymbol{\phi}_{s}\propto\bigg{(}\frac{1}{\sigma_{s}^{2}}\bigg{)}^{\tau+1+{|B_{s}|}/{2}}\exp\bigg{(}-\frac{\lambda+D_{s}/2}{\sigma_{s}^{2}}\bigg{)},

which is of the form of an inverse-gamma distribution with parameters τ=τ+|Bs|2\tau^{\prime}=\tau+\frac{|B_{s}|}{2} and λ=λ+Ds2\lambda^{\prime}=\lambda+\frac{D_{s}}{2}, proving the first part of the lemma.

However, as Σs\Sigma_{s} is a function of σs2\sigma_{s}^{2}, integrating out σs2\sigma_{s}^{2} requires more algebra. We have,

𝒩(ϕs;𝐦s,Σs)\displaystyle\mathcal{N}(\boldsymbol{\phi}_{s};\mathbf{m}_{s},\Sigma_{s}) 1det(Σs)exp{12(ϕs𝐦s)TΣs1(ϕs𝐦s)}\displaystyle\propto\frac{1}{\sqrt{\text{det}(\Sigma_{s})}}\ \exp\bigg{\{}-\frac{1}{2}(\boldsymbol{\phi}_{s}-\mathbf{m}_{s})^{\text{T}}\Sigma_{s}^{-1}(\boldsymbol{\phi}_{s}-\mathbf{m}_{s})\bigg{\}}
(1σs2)p/2exp{12σs2(ϕs𝐦s)T(S3+Σo1)(ϕs𝐦s)},\displaystyle\propto\bigg{(}\frac{1}{\sigma_{s}^{2}}\bigg{)}^{p/2}\exp\bigg{\{}-\frac{1}{2\sigma_{s}^{2}}(\boldsymbol{\phi}_{s}-\mathbf{m}_{s})^{\text{T}}(S_{3}+\Sigma_{o}^{-1})(\boldsymbol{\phi}_{s}-\mathbf{m}_{s})\bigg{\}}\ ,

and substituting this in (19) gives that π(ϕs,σs2|T,x)\pi(\boldsymbol{\phi}_{s},\sigma_{s}^{2}|T,x) is proportional to,

(1σs2)τ+1+|Bs|+p2exp{12σs2(2λ+Ds+(ϕs𝐦s)T(S3+Σo1)(ϕs𝐦s))},\displaystyle\bigg{(}\frac{1}{\sigma_{s}^{2}}\bigg{)}^{\tau+1+\frac{|B_{s}|+p}{2}}\exp\bigg{\{}-\frac{1}{2\sigma_{s}^{2}}\bigg{(}2\lambda+D_{s}+(\boldsymbol{\phi}_{s}-\mathbf{m}_{s})^{\text{T}}(S_{3}+\Sigma_{o}^{-1})(\boldsymbol{\phi}_{s}-\mathbf{m}_{s})\bigg{)}\bigg{\}},

which, as a function of σs2\sigma_{s}^{2}, has the form of an inverse-gamma density, allowing σ2\sigma^{2} to be integrated out. Denoting L=2λ+Ds+(ϕs𝐦s)T(S3+Σo1)(ϕs𝐦s)L=2\lambda+D_{s}+(\boldsymbol{\phi}_{s}-\mathbf{m}_{s})^{\text{T}}(S_{3}+\Sigma_{o}^{-1})(\boldsymbol{\phi}_{s}-\mathbf{m}_{s}), and τ~=τ+|Bs|+p2\widetilde{\tau}=\tau+\frac{|B_{s}|+p}{2},

π(ϕs|T,x)=π(ϕs,σs2|T,x)𝑑σs2(1σs2)τ~+1exp(L2σs2)𝑑σs2=Γ(τ~)(L/2)τ~.\displaystyle\pi(\boldsymbol{\phi}_{s}|T,x)=\int\pi(\boldsymbol{\phi}_{s},\sigma_{s}^{2}|T,x)\ d\sigma_{s}^{2}\propto\int\bigg{(}\frac{1}{\sigma_{s}^{2}}\bigg{)}^{\widetilde{\tau}+1}\exp\bigg{(}-\frac{L}{2\sigma_{s}^{2}}\bigg{)}\ d\sigma_{s}^{2}=\frac{\Gamma(\widetilde{\tau})}{(L/2)^{\widetilde{\tau}}}\ .

So, as a function of ϕs\boldsymbol{\phi}_{s}, the posterior π(ϕs|T,x)\pi(\boldsymbol{\phi}_{s}|T,x) is,

π(ϕs|T,x)Lτ~\displaystyle\pi(\boldsymbol{\phi}_{s}|T,x)\propto L^{-\widetilde{\tau}} =(2λ+Ds+(ϕs𝐦s)T(S3+Σo1)(ϕs𝐦s))2τ+|Bs|+p2\displaystyle=\bigg{(}2\lambda+D_{s}+(\boldsymbol{\phi}_{s}-\mathbf{m}_{s})^{\text{T}}(S_{3}+\Sigma_{o}^{-1})(\boldsymbol{\phi}_{s}-\mathbf{m}_{s})\bigg{)}^{-\frac{2\tau+|B_{s}|+p}{2}}
(1+12τ+|Bs|(ϕs𝐦s)T(S3+Σo1)(2τ+|Bs|)(2λ+Ds)(ϕs𝐦s))2τ+|Bs|+p2\displaystyle\propto\bigg{(}1+\frac{1}{2\tau+|B_{s}|}\ (\boldsymbol{\phi}_{s}-\mathbf{m}_{s})^{\text{T}}\frac{(S_{3}+\Sigma_{o}^{-1})(2\tau+|B_{s}|)}{(2\lambda+D_{s})}(\boldsymbol{\phi}_{s}-\mathbf{m}_{s})\bigg{)}^{-\frac{2\tau+|B_{s}|+p}{2}}
(1+1ν(ϕs𝐦s)TPs1(ϕs𝐦s))ν+p2,\displaystyle\propto\bigg{(}1+\frac{1}{\nu}\ (\boldsymbol{\phi}_{s}-\mathbf{m}_{s})^{\text{T}}P_{s}^{-1}(\boldsymbol{\phi}_{s}-\mathbf{m}_{s})\bigg{)}^{-\frac{\nu+p}{2}},

which is exactly in the form of a multivariate tt-distribution, with pp being the dimension of ϕs\boldsymbol{\phi}_{s}, and with ν,𝐦s\nu,\mathbf{m}_{s} and PsP_{s} exactly as given in Lemma 2, completing the proof. ∎