Context-tree weighting for real-valued time series:
Bayesian inference with hierarchical mixture models

Ioannis Papageorgiou University of Cambridge
[email protected] Ioannis Kontoyiannis University of Cambridge
[email protected]

Abstract

Real-valued time series are ubiquitous in the sciences and engineering. In this work, a general, hierarchical Bayesian modelling framework is developed for building mixture models for times series. This development is based, in part, on the use of context trees, and it includes a collection of effective algorithmic tools for learning and inference. A discrete context (or ‘state’) is extracted for each sample, consisting of a discretised version of some of the most recent observations preceding it. The set of all relevant contexts are represented as a discrete context tree. At the bottom level, a different real-valued time series model is associated with each context-state, i.e., with each leaf of the tree. This defines a very general framework that can be used in conjunction with any existing model class to build flexible and interpretable mixture models. Extending the idea of context-tree weighting leads to algorithms that allow for efficient, exact Bayesian inference in this setting. The utility of the general framework is illustrated in detail when autoregressive (AR) models are used at the bottom level, resulting in a nonlinear AR mixture model. The associated methods are found to outperform several state-of-the-art techniques on simulated and real-world experiments.

I Introduction

Modelling and inference of real-valued time series are critical tasks with important applications throughout the sciences and engineering. A wide range of approaches exist, including classical statistical methods [7, 15] as well as modern machine learning (ML) techniques, notably matrix factorisations [47, 10], Gaussian processes [34, 36, 11], and neural networks [5, 2, 48]. Despite their popularity, there has not been conclusive evidence that in general the latter outperform the former in the time series setting [24, 23, 1]. Motivated in part by the two well-known limitations of neural network models (see, e.g., [5]), namely, their lack of interpretability and their typically very large training data requirements, in this work we propose a general class of flexible hierarchical Bayesian models, which are both naturally interpretable and suitable for applications with limited training data. Also, we provide computationally efficient (linear complexity) algorithms for inference and prediction, offering another important practical advantage over ML methods [24].

The first step in the modelling design is the identification of meaningful discrete states. Importantly, these are observable rather than hidden, and given by the discretised values of some of the most recent samples. The second step is the assignment of a different time series model to each of these discrete context-based states. In technical terms, we define a hierarchical Bayesian model, which at the top level selects the set of relevant states (that can be viewed as providing an adaptive partition of the state space), and at the bottom level associates an arbitrary time series model to each state. These collections of states (equivalently, the corresponding state space partitions) are naturally represented as discrete context-trees [35, 17], which are shown to admit a natural interpretation and to enable capturing important aspects of the structure present in the data. We call the resulting model class, the Bayesian Context Trees State Space Model (BCT-SSM).

Although BCT-SSM is referred to as a ‘model’, it is in fact a general framework for building Bayesian mixture models for time series, that can be used in conjunction with any existing model class. The resulting model family is rich, and much more general than the class one starts with. For example, using any of the standard linear families (like the classical AR or ARIMA) leads to much more general models that can capture highly nonlinear trends in the data, and are easily interpretable.

It is demonstrated that employing this particular type of observable state process (as opposed to a conventional hidden state process), also facilitates very effective Bayesian inference. This is achieved by exploiting the structure of context-tree models and extending the ideas of context-tree weighting (CTW) [44] and the Bayesian Context Trees (BCT) framework of [17], which were previously used only in the restricted setting of discrete-valued time series. In this discrete setting, CTW has been used very widely for data compression [44, 43], and BCTs have been used for a range of statistical tasks, including model selection, prediction, entropy estimation, and change-point detection [32, 20, 30, 21, 29, 31]. Context-trees in data compression were also recently studied in [25, 27] and were employed in an optimisation setting in [37].

The resulting tools developed for real-valued data make the BCT-SSM a powerful Bayesian framework, which in fact allows for exact and computationally very efficient Bayesian inference. In particular, the evidence [22] can be computed exactly, with all models and parameters integrated out. Furthermore, the a posteriori most likely (MAP) partition (i.e., the MAP set of discrete states) can be identified, along with its exact posterior probability. It is also shown that these algorithms allow for efficient sequential updates, which are ideally suited for online forecasting, offering an important practical advantage over standard ML approaches.

To illustrate the application of the general framework, the case where AR models are used as a building block for the BCT-SSM is examined in detail. We refer to the resulting model class as the Bayesian context tree autoregressive (BCT-AR) model. The BCT-AR model is shown to be a flexible, nonlinear mixture of AR models which is found to outperform several state-of-the-art methods in experiments with both simulated and real-world data from standard applications of nonlinear time series from economics and finance, both in terms of forecasting accuracy and computational requirements.

Finally, we note that a number of earlier approaches employ discrete patterns in the analysis of real-valued time series [3, 4, 18, 6, 12, 14, 19, 28, 38]. These works illustrate the fact that useful and meaningful information can indeed be extracted from discrete contexts. However, in most cases the methods are either application- or task-specific, and typically resort to ad hoc considerations for performing inference. In contrast, in this work, discrete contexts are used in a natural manner by defining a hierarchical Bayesian modelling structure upon which principled Bayesian inference is performed.

II The Bayesian Context Trees State Space Model

II-A Discrete contexts

Consider a real-valued time series $(x_{1},x_{2},\ldots)$ . The first key element of the present development is the use of an observable state for each $x_{n}$ , based on discretised versions of some of the samples $(\ldots,x_{n-2},x_{n-1})$ preceding it. We refer to the string consisting of these discretised previous samples as the discrete context; it plays the role of a discrete-valued feature vector that can be used to identify useful nonlinear structure in the data. These contexts are extracted via simple quantisers $Q:\mathbb{R}\to A:=\{0,1,\dots,m-1\}$ , of the form,

Q(x)\!=\!\left\{\begin{array}[]{ll}0,\ \;\;\;\;\;\;\;x<c_{1},\\ i,\ \;\;\;\;\;\;\;\;c_{i}\leq x\leq c_{i+1},\ 1\leq i\leq m-2,\\ m-1,\ x>c_{m-1},\end{array}\right.

(1)

where in this section the thresholds $\{c_{1},\ldots,c_{m-1}\}$ and the resulting quantiser $Q$ are considered fixed. A systematic way to infer the thresholds from data is described in Section III-B.

This general framework can be used in conjunction with an arbitrary way of extracting discrete features, based on an arbitrary mapping to a discrete alphabet, not necessarily of the form in (1). However, the quantisation needs to be meaningful in order to lead to useful results. Quantisers as in (1) offer a generally reasonable choice although, depending on the application at hand, there are other useful approaches, e.g., quantising the percentage differences between samples.

II-B Context trees

Refer to caption — Figure 1: Example of a context tree $T$ .

Given a quantiser $Q$ as in (1), a maximum context length $D\geq 0$ , and a proper $m$ -ary context tree $T$ , the context (or ‘state’) of each sample $x_{n}$ is obtained as follows. Let $t=(Q(x_{n-1}),\ldots,Q(x_{n-D}))$ be the discretised string of length $D$ preceding $x_{n}$ ; the context $s$ of $x_{n}$ is the unique leaf of $T$ that is a suffix of $t$ . For example, for the context tree of Figure 1, if $Q(x_{n-1})=0$ and $Q(x_{n-2})=1$ then $s=01$ , whereas if $Q(x_{n-1})=Q(x_{n-2})=1$ then $s=1$ .

The leaves of the tree define the set of discrete states in the hierarchical model. So, for the example BCT-SSM of Figure 1, the set of states is $\mathcal{S}=\{1,01,00\}$ . Equivalently, this process can be viewed as defining a partition of ${\mathbb{R}}^{2}$ into three regions indexed by the contexts ${\cal S}$ in $T.$

To complete the specification of the BCT-SSM, a different time series model $\mathcal{M}_{s}$ is associated with each leaf $s$ of the context tree $T$ , giving a different conditional density for $x_{n}$ . At time $n$ , given the context $s$ determined by the past $D$ samples $(x_{n-1},\ldots,x_{n-D})$ , the distribution of $x_{n}$ is determined by the model $\mathcal{M}_{s}$ assigned to $s$ . Parametric models with parameters $\theta_{s}$ at each leaf $s$ are considered. Altogether, the BCT-SSM consists of an $m$ -ary quantiser $Q$ , an $m$ -ary tree $T$ that defines the set of discrete states, and a collection of parameter vectors $\theta_{s}$ for the parametric models at its leaves.

Identifying $T$ with the collection of its leaves ${\cal S}$ , and writing $x_{i}^{j}$ for the segment $(x_{i},x_{i+i},\ldots,x_{j})$ , the likelihood is,

\displaystyle\begin{split}p(x|\theta,T)=p(x_{1}^{n}|T,\theta,x_{-D+1}^{0})=\prod_{s\in T}\ \prod_{\hskip 5.69046pti\in B_{s}}p(x_{i}|T,\theta_{s},x_{-D+1}^{i-1}),\end{split}\vspace*{-0.15 cm}

where $B_{s}$ is the set of indices $i\in\{1,2,\ldots,n\}$ such that the context of $x_{i}$ is $s$ , and $\theta=\{\theta_{s}\;;\;s\in T\}$ .

II-C Bayesian modelling and inference

For the top level we consider collections of states represented by context trees $T$ in the class $\mathcal{T}(D)$ of all proper $m$ -ary trees with depth no greater than $D$ , where $T$ is proper if any node in $T$ that is not a leaf has exactly $m$ children.

Prior structure. For the trees $T\in{\mathcal{T}}(D)$ with maximum depth $D\geq 0$ at the top level of the hierarchical model, we use the Bayesian Context Trees (BCT) prior of [17],

\pi(T)=\pi_{D}(T;\beta)=\alpha^{|T|-1}\beta^{|T|-L_{D}(T)}\ ,

(2)

where $\beta\in(0,1)$ is a hyperparameter, $\alpha=(1-\beta)^{1/(m-1)}$ , $|T|$ is the number of leaves of $T$ , and $L_{D}(T)$ is the number of leaves of $T$ at depth $D$ . This prior penalises larger trees by an exponential amount, a desirable property that discourages overfitting. The default value $\beta=1-2^{-m+1}$ [17] is used. Given a tree model $T\in\mathcal{T}(D)$ , an independent prior is placed on each $\theta_{s}$ , so that $\pi(\theta|T)=\prod_{s\in T}\pi(\theta_{s})$ .

Typically, the main obstacle in performing Bayesian inference with a time series $x$ is the computation of the normalising constant (or evidence) $p(x)$ of the posterior distriburion,

p(x)=\sum_{T\in\mathcal{T}(D)}\pi(T)\int_{\theta}p(x|T,\theta)\pi(\theta|T)\ d\theta.

(3)

The power of the proposed Bayesian structure is that, although $\mathcal{T}(D)$ is enormously rich, consisting of doubly-exponentially many models in $D$ , it is actually possible to perform exact Bayesian inference efficiently. To that end, we introduce the Continuous Context Tree Weighting (CCTW) algorithm, and the Continuous Bayesian Context Tree (CBCT) algorithm, generalising the corresponding algorithms for discrete time series in [17]. It is shown that CCTW computes the normalising constant $p(x)$ exactly (Theorem 1), and CBCT identifies the MAP tree model (Theorem 2). The main difference from the discrete case in [17], both in the algorithmic descriptions and in the proofs of the theorems (given in Appendix A), is that a new generalised form of estimated probabilities is introduced and used in place of their discrete versions. For a time series $x=x_{-D+1}^{n}$ , these are,

P_{e}(s,x)=\int\prod_{i\in B_{s}}p(x_{i}|T,\theta_{s},x_{-D+1}^{i-1})\ \pi(\theta_{s})\ d\theta_{s}.

(4)

Let $x=x_{-D+1}^{n}$ be a time series, and let $y_{i}=Q(x_{i})$ denote the corresponding quantised samples.

CCTW: Continuous context-tree weighting algorithm

1.

Build the tree $T_{\text{MAX}}$ , whose leaves are all the discrete contexts $y_{i-D}^{i-1},\ i=1,2,\ldots,n$ . Compute $P_{e}(s,x)$ as given in (4) for each node $s$ of $T_{\text{MAX}}$ .

Starting at the leaves and proceeding recursively towards the root compute:

P_{w,s}\!=\!\left\{\begin{array}[]{ll}P_{e}(s,x),\;\;\;\mbox{if $s$ is a leaf,}\\ \beta P_{e}(s,x)+(1-\beta)\prod_{j=0}^{m-1}P_{w,sj},\;\;\;\mbox{o/w,}\end{array}\!\!\right.\!\!\vspace*{-0.05 cm}

where $sj$ is the concatenation of context $s$ and symbol $j$ .

CBCT: Continuous Bayesian context tree algorithm

1.

Build the tree $T_{\text{MAX}}$ and compute $P_{e}(s,x)$ for each node $s$ of $T_{\text{MAX}}$ , as in CCTW.

Starting at the leaves and proceeding recursively towards the root compute:

P_{m,s}\!=\!\left\{\begin{array}[]{ll}P_{e}(s,x),\;\;\;\mbox{if $s$ is a leaf at depth $D$,}\\ \beta,\;\;\;\mbox{if $s$ is a leaf at depth $<D$,}\\ \max\big{\{}\beta P_{e}(s,x),(1-\beta)\prod_{j=0}^{m-1}P_{m,sj}\big{\}},\ \mbox{o/w.}\end{array}\right.

3.

Starting at the root and proceeding recursively with its descendants, for each node: If the maximum is achieved by the first term, prune all its descendants from $T_{\text{MAX}}$ .

Theorem 1.

The weighted probability $P_{w,s}$ at the root is exactly the normalising constant $p(x)$ of (3).

Theorem 2.

The tree $T^{*}_{1}$ produced by the CBCT algorithm is the MAP tree model, as long as $\beta\geq 1/2$ .

Even in cases where the integrals in (4) are not tractable, the fact that they are in the form of standard marginal likelihoods makes it possible to compute them approximately using standard methods, e.g., [8, 9, 46]. The above algorithms can then be used with these approximations as a way of performing approximate inference for the BCT-SSM. However, this is not investigated further in this work. Instead, the general principle is illustrated via an interesting example where the estimated probabilities can be computed explicitly and the resulting mixture model is a flexible nonlinear model of practical interest. This is described in the next section, where AR models ${\cal M}_{s}$ are associated to each leaf $s$ . We refer to the resulting model as the Bayesian context tree autoregressive (BCT-AR) model, which is just a particular instance of the general BCT-SSM.

III The BCT Autoregressive Model

Here we consider the BCT-SSM model class where an AR model of order $p$ is associated to each leaf $s$ of the tree $T$ ,

x_{n}=\phi_{s,1}x_{n-1}+\dots+\phi_{s,p}x_{n-p}+e_{n}={\boldsymbol{\phi}_{s}}^{\text{T}}\ \mathbf{\widetilde{x}}_{n-1}+e_{n},

(5)

where $e_{n}\sim\mathcal{N}(0,\sigma_{s}^{2}),\ \mathbf{\widetilde{x}}_{n-1}=(x_{n-1},\dots,x_{n-p})^{\text{T}}$ , and $\boldsymbol{\phi}_{s}=(\phi_{s,1},\dots,\phi_{s,p})^{\text{T}}$ .

The parameters of the model are the AR coefficients and the noise variance, so that $\theta_{s}=(\boldsymbol{\phi}_{s},\sigma_{s}^{2})$ . An inverse-gamma prior is used for the noise variance, and a Gaussian prior is placed on the AR coefficients, so that the joint prior on the parameters is $\pi(\theta_{s})=\pi(\boldsymbol{\phi}_{s}|\sigma_{s}^{2})\pi(\sigma_{s}^{2})$ , with,

\displaystyle\pi(\sigma_{s}^{2})=\text{Inv-Gamma}(\tau,\lambda),\quad\quad\pi(\boldsymbol{\phi}_{s}|\sigma_{s}^{2})=\mathcal{N}(\mu_{o},\sigma_{s}^{2}\Sigma_{o}),

where $(\tau,\lambda,\mu_{o},\Sigma_{o})$ are the prior hyperparameters. This prior specification allows the exact computation of the estimated probabilities of (4), and gives closed-form posterior distributions for the AR coefficients and the noise variance. These are given in Lemmas 1 and 2, which are proven in Appendix B.

Lemma 1. For the AR model, the estimated probabilities $P_{e}(s,x)$ as in (4) are given by,

P_{e}(s,x)=C_{s}^{-1}\ \frac{\Gamma\left(\tau+|B_{s}|/2\right)\ \lambda^{\tau}}{\Gamma(\tau)\ \left(\lambda+D_{s}/2\right)^{\tau+|B_{s}|/2}},

(6)

	where	$\displaystyle C_{s}=\sqrt{{(2\pi)^{\|B_{s}\|}}\text{det}(I+\Sigma_{o}S_{3}}),$
	and	$\displaystyle D_{s}=s_{1}+\mu_{o}^{\text{T}}\Sigma_{o}^{-1}\mu_{o}$
		$\displaystyle\hskip 27.46295pt-(\mathbf{s}_{2}+\Sigma_{o}^{-1}\mu_{o})^{\text{T}}(S_{3}+\Sigma_{o}^{-1})^{-1}(\mathbf{s}_{2}+\Sigma_{o}^{-1}\mu_{o}),$

with the sums $s_{1},\mathbf{s}_{2},S_{3}$ defined as:

s_{1}=\sum_{i\in B_{s}}x_{i}^{2},\quad\mathbf{s}_{2}=\sum_{i\in B_{s}}x_{i}\ \mathbf{\widetilde{x}}_{i-1},\quad S_{3}=\sum_{i\in B_{s}}\mathbf{\widetilde{x}}_{i-1}\mathbf{\widetilde{x}}_{i-1}^{\text{T}}.

Lemma 2. Given a tree model $T$ , at each leaf $s$ , the posterior distributions of the AR coefficients and the noise variance are,

		$\displaystyle\pi(\sigma_{s}^{2}\|T,x)=\text{Inv-Gamma }\left(\tau+\|B_{s}\|/{2},\lambda+{D_{s}}/{2}\right),$		(7)
		$\displaystyle\pi(\boldsymbol{\phi}_{s}\|T,x)=t_{\nu}(\mathbf{m}_{s},P_{s}),$		(8)

where $t_{\nu}$ denotes a multivariate $t$ -distribution with $\nu$ degrees of freedom. Here, $\nu=2\tau+|B_{s}|$ , and,

	$\displaystyle\mathbf{m}_{s}$	$\displaystyle=(S_{3}+\Sigma_{o}^{-1})^{-1}(\mathbf{s}_{2}+\Sigma_{o}^{-1}\mu_{o}),$
	$\displaystyle P_{s}^{-1}$	$\displaystyle=\frac{2\tau+\|B_{s}\|}{2\lambda+D_{s}}(S_{3}+\Sigma_{o}^{-1}).$

Corollary. The MAP estimators of $\boldsymbol{\phi}_{s}$ and $\sigma_{s}^{2}$ are given by,

\widehat{\boldsymbol{\phi}_{s}}^{\text{MAP}}=\mathbf{m}_{s}\ ,\quad\widehat{{\sigma_{s}^{2}}}^{\text{MAP}}=({2\lambda+D_{s}})/({2\tau+|B_{s}|+2}).

III-A Computational complexity and sequential updates

For each symbol $x_{i}$ in a time series $x_{1}^{n}$ , exactly $D+1$ nodes of $T_{\text{MAX}}$ need to be updated, corresponding to its contexts of length $0,1,\ldots,D$ . For each one of these nodes, only the quantities $\{|B_{s}|,s_{1},\mathbf{s}_{2},S_{3}\}$ need to be updated, which can be done efficiently by just adding an extra term to each sum. Using these and Lemma 1, the estimated probabilities $P_{e}(s,x)$ can be computed for all nodes of $T_{\text{MAX}}$ .

Hence, the complexity of both algorithms as a function of $n$ and $D$ is only $\mathcal{O}(nD)$ : linear in the length of the time series and the maximum depth. Therefore, the present methods are computationally very efficient and scale well with large numbers of observations. [Taking into account $m$ and $p$ as well, it is easy to see that the complexity is $\mathcal{O}\left(nD(p^{3}+m)\right)$ .]

The above discussion also shows that, importantly, all algorithms can be updated sequentially. When observing a new sample $x_{n+1}$ , only $D+1$ nodes need to be updated, which requires $\mathcal{O}(D)$ operations, i.e., $\mathcal{O}(1)$ as a function of $n$ . In particular, this implies that sequential prediction can be performed very efficiently. Empirical running times in the forecasting experiments show that the present methods are much more efficient than essentially all the alternatives examined. In fact, the difference is quite large, especially when comparing with state-of-the-art ML models that require heavy training and do not allow for efficient sequential updates, usually giving empirical running times that are larger by several orders of magnitude; see also [24] for a relevant review comparing the computational requirements of ML versus statistical techniques.

III-B Choosing the quantiser and AR order

Finally, a principled Bayesian approach is introduced for the selection of the quantiser thresholds $\{c_{i}\}$ of (1) and the AR order $p$ . Viewed as extra parameters on an additional layer above everything else, uniform priors are placed on $\{c_{i}\}$ and $p$ , and Bayesian model selection [22] is performed to obtain their MAP values: The resulting posterior distribution $p(\{c_{i}\},p|x)$ is proportional to the evidence $p(x|\{c_{i}\},p)$ , which can be computed exactly using the CCTW algorithm (Theorem 1). So, in order to select appropriate values, suitable ranges of possible $\{c_{i}\}$ and $p$ are specified, and the values with the higher evidence are selected. For the AR order we take $1\leq p\leq p_{\text{max}}$ for an appropriate $p_{\text{max}}$ ( $p_{\text{max}}=5$ in our experiments), and for the $\{c_{i}\}$ we perform a grid search in a reasonable range (e.g., between the 10th and 90th percentiles of the data).

IV Experimental results

The performance of the BCT-AR model is evaluated on simulated and real-world data from standard applications of nonlinear time series in economics and finance. It is compared with the most successful previous approaches for these types of applications, considering both classical and modern ML methods. Useful resources include the R package forecast [16] and the Python library ‘GluonTS’ [2], containing implementations of state-of-the-art classical and ML methods, respectively. We briefly discuss the methods used, and refer to the packages’ documentation and for more details on the methods themselves and the training procedures carried out. Among classical statistical approaches, we compare with ARIMA and Exponential smoothing state space (ETS) models [15], with self-excited threshold autoregressive (SETAR) models [41], and with mixture autoregressive (MAR) models [45]. Among ML techniques, we compare with the Neural Network AR (NNAR) model [48], and with the most successful RNN-based approach, ‘deepAR’ [39].

IV-A Simulated data

First an experiment on simulated data is performed, illustrating that the present methods are consistent and effective with data generated by BCT-SSM models. The context tree used is the tree of Figure 1, the quantiser threshold is $c=0$ , and the AR order $p=2$ . The posterior distribution over trees, $\pi(T|x)$ , is examined. On a time series consisting of only $n=100$ observations, the MAP tree identified by the CBCT algorithm is the empty tree corresponding to a single AR model, with posterior probability 99.9%. This means that the data do not provide sufficient evidence to support a more complex state space partition. With $n=300$ observations, the MAP tree is now the true underlying model, with posterior probability 57%. With $n=500$ observations, the posterior of the true model is 99.9%. So, the posterior indeed concentrates on the true model, indicating that the BCT-AR inferential framework can be very effective even with limited training data.

Forecasting. The performance of the BCT-AR model is evaluated on out-of-sample 1-step ahead forecasts, and compared with state-of-the-art approaches in four simulated and three real datasets. The first simulated dataset (sim_1) is generated by the BCT-AR model used above, and the second (sim_2) by a BCT-AR model with a ternary tree of depth 2. The third and fourth ones (sim_3 and sim_4) are generated from SETAR models of orders $p=1$ and $p=5$ , respectively. In each case, the training set consists of the first 50% of the observations; also, all models are updated at every time-step in the test set. For BCT-AR, the MAP tree with its MAP parameters is used at every time-step, which can be updated efficiently (Section III-A). From Table I, it is observed that the BCT-AR model outperforms the alternatives, and achieves the lowest mean-squared error (MSE) even on the two datasets generated from SETAR models. As discussed in Section III-A, the BCT-AR model also outperforms the alternatives in terms of empirical running times.

TABLE I: Mean squared error (MSE) of forecasts

	BCT-AR	ARIMA	ETS	NNAR	deepAR	SETAR	MAR
sim_1	0.131	0.150	0.178	0.143	0.148	0.141	0.151
sim_2	0.035	0.050	0.054	0.048	0.061	0.050	0.064
sim_3	0.216	0.267	0.293	0.252	0.273	0.243	0.283
sim_4	0.891	1.556	1.614	1.287	1.573	0.951	1.543
unemp	0.034	0.040	0.042	0.036	0.036	0.038	0.037
gnp	0.324	0.364	0.378	0.393	0.473	0.394	0.384
ibm	78.02	82.90	77.52	78.90	75.71	81.07	77.02

IV-B US unemployment rate

An important application of SETAR models is in modelling the US unemployment rate [13, 26, 42]. As described in [42], the unemployment rate moves countercyclically with business cycles, and rises quickly but decays slowly, indicating nonlinear behaviour. Here, the quarterly US unemployment rate in the time period from 1948 to 2019 is examined (dataset unemp, 288 observations). Following [26], the difference series ${\Delta x}_{n}=x_{n}-x_{n-1}$ is considered, and a constant term is included in the AR model. For the quantiser alphabet size, $m=2$ is a natural choice, as will become apparent below. The threshold selected using the procedure of Section III-B is $c=0.15$ , and the MAP tree is the tree of Figure 1, with depth $d=2$ , leaves $\{1,01,00\}$ , and posterior probability 91.5%. The fitted BCT-AR model with its MAP parameters is,

{\Delta x}_{n}\!=\!\left\{\begin{array}[]{ll}0.09+0.72{\Delta x}_{n-1}-0.30{\Delta x}_{n-2}+0.42\ e_{n},\\ 0.04+0.29\ {\Delta x}_{n-1}-0.32\ {\Delta x}_{n-2}+0.32\ e_{n},\\ -0.02+0.34\ {\Delta x}_{n-1}+0.19\ {\Delta x}_{n-2}+0.20\ e_{n},\end{array}\right.\vspace*{-0.15 cm}

with $e_{n}\sim\mathcal{N}(0,1)$ , and corresponding regions $s=1$ if ${\Delta x}_{n-1}>0.15$ , $s=01$ if ${\Delta x}_{n-1}\leq 0.15,{\Delta x}_{n-2}>0.15$ , and $s=00$ if ${\Delta x}_{n-1}\leq 0.15,\ {\Delta x}_{n-2}\leq 0.15$ .

Interpretation. The MAP BCT-AR model finds significant structure in the data, providing a very natural interpretation. It identifies 3 meaningful states: First, jumps in the unemployment rate higher than 0.15 signify economic contractions (context 1). If there is no jump at the most recent time-point, the model looks further back to determine the state. Context 00 signifies a stable economy, as there are no jumps in the unemployment rate for two consecutive quarters. Finally, context 01 identifies an intermediate state: “stabilising just after a contraction”. An important feature identified by the BCT-AR model is that the volatility is different in each case: Higher in contractions ( $\sigma=0.42$ ), smaller in stable economy regions ( $\sigma=0.20$ ), and in-between for context 01 ( $\sigma=0.32$ ).

Forecasting. In addition to its appealing interpretation, the BCT-AR model outperforms all benchmarks in forecasting, giving a 6% lower MSE than the second-best method (Table I).

IV-C US Gross National Product

Another important example of a nonlinear time series is the US Gross National Product (GNP) [33, 13]. The quarterly US GNP in the time period from 1947 to 2019 is examined (291 observations, dataset gnp). Following [33], the difference in the logarithm of the series, $y_{n}=\log x_{n}-\log x_{n-1}$ , is considered. As above, $m=2$ is a natural choice for the quantiser size, helping to differentiate economic expansions from contractions – which govern the underlying dynamics. The MAP BCT-AR tree is shown in Figure 2: It has depth $d=3$ , its leaves are $\{0,10,110,111\}$ and its posterior probability is 42.6%.

Interpretation. Compared with the previous example, here the MAP BCT-AR model finds even richer structure in the data and identifies four meaningful states. First, as before, there is a single state corresponding to an economic contraction (which now corresponds to $s=0$ instead of $s=1$ , as the GNP obviously increases in expansions and decreases in contractions). And again, the model does not look further back whenever a contraction is detected. Here, the model also shows that the effect of a contraction is still present even after three quarters ( $s=110$ ), and that the exact ‘distance’ from a contraction is also important, with the dynamics changing depending on how much time has elapsed. Finally, the state $s=111$ corresponds to a flourishing, expanding economy, without a contraction in the recent past. An important feature captured by the BCT-AR model is again that the volatility is different in each case. More specifically, it is found that the volatility strictly decreases with the distance from the last contraction, starting with the maximum $\sigma=1.23$ for $s=0$ and decreasing to $\sigma=0.75$ for $s=111$ .

Forecasting. The BCT-AR model outperforms all benchmarks in forecasting, giving a 12% lower MSE than the second-best method, as presented in Table I.

IV-D The stock price of IBM

Finally, the daily IBM common stock closing price from May 17, 1961 to November 2, 1962 is examined (dataset ibm, 369 observations), taken from [7]. This is a well-studied dataset, with [7] fitting an ARIMA model, [40] fitting a SETAR model, and [45] fitting a MAR model to the data. Following previous approaches, the first-difference series, ${\Delta x}_{n}=x_{n}-x_{n-1}$ , is considered. The value $m=3$ is chosen for the alphabet size of the quantiser, with contexts $\{0,1,2\}$ naturally corresponding to the states {down, steady, up} for the stock price. Using the procedure of Section III-B to select the thresholds, the resulting quantiser regions are: $s=0$ if ${\Delta x}_{n-1}<-7$ , $s=2$ if ${\Delta x}_{n-1}>7$ , and $s=1$ otherwise. The MAP tree is shown in Figure 3: It has depth $d=2$ , and its leaves are $\{0,2,10,11,12\}$ , hence identifying five states. Its posterior probability is 99.3%, suggesting that there is very strong evidence in the data supporting this exact structure, even with only 369 observations.

Interpretation. The BCT-AR model reveals important information about apparent structure in the data. Firstly, it admits a very simple and natural interpretation: In order to determine the AR model generating the next value, one needs to look back until there is a significant enough price change (corresponding to contexts 0, 2, 10, 12), or until the maximum depth of 2 (context 11) is reached. Another important feature captured by this model is the commonly observed asymmetric response in volatility due to positive and negative shocks, sometimes called the leverage effect [42, 7]. Even though there is no suggestion of that in the prior, the MAP parameter estimates show that negative shocks increase the volatility much more: Context 0 has the highest volatility ( $\sigma=12.3$ ), with 10 being a close second ( $\sigma=10.8$ ), showing that the effect of a past shock is still present. In all other cases the volatility is much smaller ( $5.17\leq\sigma\leq 6.86$ ).

Forecasting. As shown Table I, in this example deepAR performs marginally better in 1-step ahead forecasts, with BCT-AR, ETS and MAR also having comparable performance.

References

[1] N.K. Ahmed, A.F. Atiya, N.E. Gayar, and H. El-Shishiny. An empirical comparison of machine learning models for time series forecasting. Econometric reviews, 29(5-6):594–621, 2010.
[2] A. Alexandrov, K. Benidis, M. Bohlke-Schneider, V. Flunkert, J. Gasthaus, T. Januschowski, D.C. Maddix, S. Rangapuram, D. Salinas, J. Schulz, et al. GluonTS: Probabilistic time series models in python. arXiv preprint arXiv:1906.05264, 2019.
[3] F.M. Alvarez, A. Troncoso, J.C. Riquelme, and J.S.A. Ruiz. Energy time series forecasting based on pattern sequence similarity. IEEE Transactions on Knowledge and Data Engineering, 23(8):1230–1243, 2010.
[4] S. Alvisi, M. Franchini, and A. Marinelli. A short-term, pattern-based model for water-demand forecasting. Journal of Hydroinformatics, 9(1):39–50, 2007.
[5] K. Benidis, S.S. Rangapuram, V. Flunkert, B. Wang, D. Maddix, C. Turkmen, J. Gasthaus, M. Bohlke-Schneider, D. Salinas, L. Stella, et al. Neural forecasting: Introduction and literature overview. arXiv preprint arXiv:2004.10240, 2020.
[6] D.J. Berndt and J. Clifford. Using dynamic time warping to find patterns in time series. In KDD Workshop, volume 10, pages 359–370. Seattle, WA, USA, 1994.
[7] G.E.P. Box, G.M. Jenkins, G.C. Reinsel, and G.M. Ljung. Time series analysis: forecasting and control. John Wiley & Sons, 2015.
[8] S. Chib. Marginal likelihood from the Gibbs output. Journal of the american statistical association, 90(432):1313–1321, 1995.
[9] S. Chib and I. Jeliazkov. Marginal likelihood from the Metropolis–Hastings output. Journal of the American statistical association, 96(453):270–281, 2001.
[10] C. Faloutsos, J. Gasthaus, T. Januschowski, and Y. Wang. Forecasting big time series: old and new. Proceedings of the VLDB Endowment, 11(12):2102–2105, 2018.
[11] R. Frigola. Bayesian time series learning with Gaussian processes. PhD thesis, University of Cambridge, 2015.
[12] T.C. Fu, F.L. Chung, R. Luk, and C.M. Ng. Stock time series pattern matching: Template-based vs. rule-based approaches. Engineering Applications of Artificial Intelligence, 20(3):347–364, 2007.
[13] B.E. Hansen. Threshold autoregression in economics. Statistics and its Interface, 4(2):123–127, 2011.
[14] Q. Hu, P. Su, D. Yu, and J. Liu. Pattern-based wind speed prediction based on generalized principal component analysis. IEEE Transactions on Sustainable Energy, 5(3):866–874, 2014.
[15] R. Hyndman, A.B. Koehler, J.K. Ord, and R.D. Snyder. Forecasting with exponential smoothing: the state space approach. Springer Science & Business Media, 2008.
[16] R.J. Hyndman and Y. Khandakar. Automatic time series forecasting: the forecast package for R. Journal of Statistical Software, 26(3):1–22, 2008. https://CRAN.R-project.org/package=forecast.
[17] I. Kontoyiannis, L. Mertzanis, A. Panotopoulou, I. Papageorgiou, and M. Skoularidou. Bayesian Context Trees: Modelling and exact inference for discrete time series. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84(4):1287–1323, 2022.
[18] S.S. Kozat, A.C. Singer, and G.C. Zeitler. Universal piecewise linear prediction via context trees. IEEE Transactions on Signal Processing, 55(7):3730–3745, 2007.
[19] X. Liu, Z. Ni, D. Yuan, Y. Jiang, Z. Wu, J. Chen, and Y. Yang. A novel statistical time-series pattern based interval forecasting strategy for activity durations in workflow systems. Journal of Systems and Software, 84(3):354–376, 2011.
[20] V. Lungu, I. Papageorgiou, and I. Kontoyiannis. Change-point detection and segmentation of discrete data using Bayesian Context Trees. arXiv preprint arXiv:2203.04341, 2021.
[21] V. Lungu, I. Papageorgiou, and I. Kontoyiannis. Bayesian change-point detection via context-tree weighting. In 2022 IEEE Information Theory Workshop (ITW), pages 125–130. IEEE, 2022.
[22] D.J.C. MacKay. Bayesian interpolation. Neural Computation, 4(3):415–447, 1992.
[23] S. Makridakis, E. Spiliotis, and V. Assimakopoulos. The M4 Competition: Results, findings, conclusion and way forward. International Journal of Forecasting, 34(4):802–808, 2018.
[24] S. Makridakis, E. Spiliotis, and V. Assimakopoulos. Statistical and machine learning forecasting methods: Concerns and ways forward. PloS one, 13(3):e0194889, 2018.
[25] T. Matsushima and S. Hirasawa. Reducing the space complexity of a Bayes coding algorithm using an expanded context tree. In 2009 IEEE International Symposium on Information Theory, pages 719–723. IEEE, 2009.
[26] A.L. Montgomery, V. Zarnowitz, R.S. Tsay, and G.C. Tiao. Forecasting the US unemployment rate. Journal of the American Statistical Association, 93(442):478–493, 1998.
[27] Y. Nakahara, S. Saito, A. Kamatsuka, and T. Matsushima. Probability distribution on full rooted trees. Entropy, 24(3):328, 2022.
[28] G. Ouyang, C. Dang, D.A. Richards, and X. Li. Ordinal pattern based similarity analysis for eeg recordings. Clinical Neurophysiology, 121(5):694–703, 2010.
[29] I. Papageorgiou and I. Kontoyiannis. The posterior distribution of Bayesian Context-Tree models: Theory and applications. In 2022 IEEE International Symposium on Information Theory (ISIT), pages 702–707. IEEE, 2022.
[30] I. Papageorgiou and I. Kontoyiannis. Posterior representations for Bayesian Context Trees: Sampling, estimation and convergence. Bayesian Analysis, to appear, arXiv preprint arXiv:2202.02239, 2022.
[31] I. Papageorgiou and I. Kontoyiannis. Truly Bayesian entropy estimation. arXiv preprint arXiv:2212.06705, 2022.
[32] I. Papageorgiou, I. Kontoyiannis, L. Mertzanis, A. Panotopoulou, and M. Skoularidou. Revisiting context-tree weighting for Bayesian inference. In 2021 IEEE International Symposium on Information Theory (ISIT), pages 2906–2911, 2021.
[33] S.M. Potter. A nonlinear approach to US GNP. Journal of applied econometrics, 10(2):109–125, 1995.
[34] C.E. Rasmussen and C.K.I. Williams. Gaussian processes for machine learning. MIT Press, 2006.
[35] J. Rissanen. A universal data compression system. IEEE Transactions on Information Theory, 29(5):656–664, 1983.
[36] S. Roberts, M. Osborne, M. Ebden, S. Reece, N. Gibson, and S. Aigrain. Gaussian processes for time-series modelling. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 371(1984):20110550, 2013.
[37] J.J. Ryu, A. Bhatt, and Y.H. Kim. Parameter-free online linear optimization with side information via universal coin betting. In International Conference on Artificial Intelligence and Statistics, pages 6022–6044. PMLR, 2022.
[38] E. Sabeti, P.X.K. Song, and A.O. Hero. Pattern-based analysis of time series: Estimation. In 2020 IEEE International Symposium on Information Theory (ISIT), pages 1236–1241, 2020.
[39] D. Salinas, V. Flunkert, J. Gasthaus, and T. Januschowski. DeepAR: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 36(3):1181–1191, 2020.
[40] H. Tong. Non-linear time series: a dynamical system approach. Oxford University Press, 1990.
[41] H. Tong and K.S. Lim. Threshold autoregression, limit cycles and cyclical data. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 42(3):245–268, 1980.
[42] R.S. Tsay. Analysis of financial time series, volume 543. John Wiley & Sons, 2005.
[43] F.M.J. Willems. The context-tree weighting method: extensions. IEEE Transactions on Information Theory, 44(2):792–798, 1998.
[44] F.M.J. Willems, Y.M. Shtarkov, and T.J. Tjalkens. The context-tree weighting method: basic properties. IEEE Transactions on Information Theory, 41(3):653–664, 1995.
[45] C. S. Wong and W. K. Li. On a mixture autoregressive model. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 62(1):95–115, 2000.
[46] S.N. Wood. Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 73(1):3–36, 2011.
[47] H.F. Yu, N. Rao, and I.S. Dhillon. Temporal regularized matrix factorization for high-dimensional time series prediction. Advances in neural information processing systems, 29, 2016.
[48] G. Zhang, B.E. Patuwo, and M.Y. Hu. Forecasting with artificial neural networks: The state of the art. International journal of forecasting, 14(1):35–62, 1998.

A. Proofs of Theorems 1 and 2

The important observation here is that using the new, different form of the estimated probabilities $P_{e}(s,x)$ in (4), it is still possible to factorise the marginal likelihoods $p(x|T)$ for the general BCT-SSM as,

p(x|T)=\int p(x|\theta,T)\pi(\theta|T)d\theta=\int\prod_{s\in T}\bigg{(}\prod_{i\in B_{s}}p(x_{i}|T,\theta_{s},x_{-D+1}^{i-1})\ \pi(\theta_{s})\ d\theta_{s}\bigg{)}=\prod_{s\in T}P_{e}(s,x),

where the second equality follows from the likelihood of the general BCT-SSM, and the fact that we use independent priors on the parameters at the leaves, so that $\pi(\theta|T)=\prod_{s\in T}\pi(\theta_{s})$ . Then, the proofs of Theorems 1 and 2 follow along the same lines as the proofs of the corresponding results for discrete time series in [17], with the main difference being that the new general version of the estimated probabilities $P_{e}(s,x)$ of (4) need to be used in place of their discrete versions.

The $k$ -BCT algorithm of [17] can be generalised in a similar manner to the way the CTW and BCT algorithms were generalised. The resulting algorithm identifies the top- $k$ a posteriori most likely context trees. The proof of the theorem claiming this is similar to the proof of Theorem 3.3 of [17]. Again, the important difference, both in the algorithm description and in the proof, is that the estimated probabilities $P_{e}(s,x)$ are used in place of their discrete counterparts $P_{e}(a_{s})$ .

B. Proofs of Lemmas 1 and 2

The proofs of these lemmas are mostly based on explicit computations. Recall that, for each context $s$ , the set $B_{s}$ consists of those indices $i\in\{1,2,\ldots,n\}$ such that the context of $x_{i}$ is $s$ . The important step in the following two proofs is the factorisation of the likelihood using the sets $B_{s}$ . In order to prove the lemmas for the AR model with parameters $\theta_{s}=(\boldsymbol{\phi}_{s},\sigma_{s}^{2})$ , we first consider an intermediate step in which the noise variance is assumed to be known and equal to $\sigma^{2}$ .

B.1. Known noise variance

Here, an AR model with known variance $\sigma^{2}$ is associated with every leaf $s$ of the context tree $T$ , so that,

x_{n}=\phi_{s,1}x_{n-1}+\dots+\phi_{s,p}x_{n-p}+e_{n}={\boldsymbol{\phi}_{s}}^{\text{T}}\ \mathbf{\widetilde{x}}_{n-1}+e_{n},\quad e_{n}\sim\mathcal{N}(0,\sigma^{2}).

(9)

In this setting, the parameters of the model are only the AR coefficients: $\theta_{s}=\boldsymbol{\phi}_{s}$ . For these, a Gaussian prior is used,

\theta_{s}\sim\mathcal{N}(\mu_{o},\Sigma_{o})\ ,

(10)

where $\mu_{o},\Sigma_{o}$ are hyperparameters.

Lemma B. The estimated probabilities $P_{e}(s,x)$ for the known-variance case are given by,

P_{e}(s,x)=\frac{1}{(2\pi\sigma^{2})^{|B_{s}|/2}}\ \frac{1}{\sqrt{\text{det}(I+\Sigma_{o}S_{3}/\sigma^{2}})}\ \exp{\bigg{\{}-\frac{E_{s}}{2\sigma^{2}}\bigg{\}}},

(11)

where $I$ is the identity matrix and $E_{s}$ is given by:

E_{s}=s_{1}+\sigma^{2}\mu_{o}^{\text{T}}\Sigma_{o}^{-1}\mu_{o}-(\mathbf{s}_{2}+\sigma^{2}\Sigma_{o}^{-1}\mu_{o})^{\text{T}}(S_{3}+\sigma^{2}\Sigma_{o}^{-1})^{-1}(\mathbf{s}_{2}+\sigma^{2}\Sigma_{o}^{-1}\mu_{o})\ .

(12)

Proof.

For the AR model of (9),

p(x_{i}|T,\theta_{s},x_{-D+1}^{i-1})=\frac{1}{\sqrt{2\pi\sigma^{2}}}\ \exp\bigg{\{}-\frac{1}{2\sigma^{2}}(x_{i}-{\theta_{s}}^{\text{T}}\mathbf{\widetilde{x}}_{i-1})^{2}\bigg{\}},

so that,

\prod_{i\in B_{s}}p(x_{i}|T,\theta_{s},x_{-D+1}^{i-1})=\frac{1}{(\sqrt{2\pi\sigma^{2}})^{|B_{s}|}}\ \exp\bigg{\{}-\frac{1}{2\sigma^{2}}\sum_{i\in B_{s}}(x_{i}-{\theta_{s}}^{\text{T}}\mathbf{\widetilde{x}}_{i-1})^{2}\bigg{\}}.

Expanding the sum in the exponent gives,

\displaystyle\sum_{i\in B_{s}}(x_{i}-{\theta_{s}}^{\text{T}}\mathbf{\widetilde{x}}_{i-1})^{2}=\sum_{i\in B_{s}}x_{i}^{2}-2\theta_{s}^{\text{T}}\sum_{i\in B_{s}}x_{i}\mathbf{\widetilde{x}}_{i-1}+\theta_{s}^{\text{T}}\sum_{i\in B_{s}}\mathbf{\widetilde{x}}_{i-1}\mathbf{\widetilde{x}}_{i-1}^{\text{T}}\theta_{s}=s_{1}-2\theta_{s}^{\text{T}}\mathbf{s}_{2}+\theta_{s}^{\text{T}}S_{3}\theta_{s},

from which we obtain that,

\displaystyle\prod_{i\in B_{s}}p(x_{i}|T,\theta_{s},x_{-D+1}^{i-1})=\frac{1}{(\sqrt{2\pi\sigma^{2}})^{|B_{s}|}}\ \exp\bigg{\{}-\frac{1}{2\sigma^{2}}(s_{1}-2\theta_{s}^{\text{T}}\mathbf{s}_{2}+\theta_{s}^{\text{T}}S_{3}\theta_{s})\bigg{\}}=(\sqrt{2\pi})^{p}\rho_{s}\ \mathcal{N}(\theta_{s};\boldsymbol{\mu},S),

by completing the square, where $\boldsymbol{\mu}=S_{3}^{-1}\mathbf{s}_{2}$ , $S=\sigma^{2}S_{3}^{-1}$ , and,

\rho_{s}=\sqrt{\frac{\text{det}(\sigma^{2}S_{3}^{-1})}{(2\pi\sigma^{2})^{|B_{s}|}}}\ \exp\bigg{\{}-\frac{1}{2\sigma^{2}}(s_{1}-\mathbf{s}_{2}^{\text{T}}S_{3}^{-1}\mathbf{s}_{2})\bigg{\}}.

(13)

So, multiplying with the prior,

\prod_{i\in B_{s}}p(x_{i}|T,\theta_{s},x_{-D+1}^{i-1})\pi(\theta_{s})=(\sqrt{2\pi})^{p}\rho_{s}\ \mathcal{N}(\theta_{s};\boldsymbol{\mu},S)\ \mathcal{N}(\theta_{s};\mu_{o},\Sigma_{o})=\rho_{s}Z_{s}\ \mathcal{N}(\theta_{s};\mathbf{m},\Sigma),

where $\Sigma^{-1}=\Sigma_{o}^{-1}+S^{-1},\ m=\Sigma\ (\Sigma_{o}^{-1}\mu_{o}+S^{-1}\boldsymbol{\mu})$ , and,

Z_{s}=\frac{1}{\sqrt{{\text{det}(\Sigma_{o}}+\sigma^{2}S_{3}^{-1})}}\ \exp{\bigg{\{}-\frac{1}{2}(\mu_{o}-S_{3}^{-1}\mathbf{s}_{2})^{\text{T}}(\Sigma_{o}+\sigma^{2}S_{3}^{-1})^{-1}(\mu_{o}-S_{3}^{-1}\mathbf{s}_{2})\bigg{\}}}\ .

(14)

Therefore,

\prod_{i\in B_{s}}p(x_{i}|T,\theta_{s},x_{-D+1}^{i-1})\pi(\theta_{s})=\rho_{s}Z_{s}\ \mathcal{N}(\theta_{s};\mathbf{m},\Sigma),

(15)

and hence,

P_{e}(s,x)=\int\prod_{i\in B_{s}}p(x_{i}|T,\theta_{s},x_{-D+1}^{i-1})\ \pi(\theta_{s})\ d\theta_{s}\ =\rho_{s}Z_{s}.

Using standard matrix inversion properties, after some algebra the product $\rho_{s}Z_{s}$ can be rearranged to give exactly the required expression in (11). ∎

B.2. Proof of Lemma 1

Getting back to the original case as described in the main text, the noise variance is considered to be a parameter of the AR model, so that, $\theta_{s}=(\boldsymbol{\phi}_{s},\sigma_{s}^{2})$ . Here, the joint prior on the parameters is $\pi(\theta_{s})=\pi(\boldsymbol{\phi}_{s}|\sigma_{s}^{2})\pi(\sigma_{s}^{2})$ , where,

\displaystyle\sigma_{s}^{2}\sim\mbox{Inv-Gamma}(\tau,\lambda),\;\;\boldsymbol{\phi}_{s}|\sigma_{s}^{2}\sim\mathcal{N}(\mu_{o},\sigma_{s}^{2}\Sigma_{o}),

(16)

and where $(\tau,\lambda,\mu_{o},\Sigma_{o})$ are hyperparameters. For $P_{e}(s,x)$ , we just need to compute the integral:

\displaystyle P_{e}(s,x)=\int\prod_{i\in B_{s}}p(x_{i}|T,\theta_{s},x_{-D+1}^{i-1})\ \pi(\theta_{s})\ d\theta_{s}=\int\pi(\sigma_{s}^{2})\left(\int\prod_{i\in B_{s}}p(x_{i}|T,\boldsymbol{\phi}_{s},\sigma_{s}^{2},x_{-D+1}^{i-1})\ \pi(\boldsymbol{\phi}_{s}|\sigma_{s}^{2})\ d\boldsymbol{\phi}_{s}\right)d\sigma_{s}^{2}.

(17)

The inner integral has exactly the form of the estimated probabilities $P_{e}(s,x)$ from the previous section, where the noise variance was fixed. The only difference is that the prior $\pi(\boldsymbol{\phi}_{s}|\sigma_{s}^{2})$ of (16) now has covariance matrix $\sigma_{s}^{2}\Sigma_{o}$ instead of $\Sigma_{o}$ . So, using (11)-(12), with $\Sigma_{o}$ replaced by $\sigma_{s}^{2}\Sigma_{o}$ , we get,

P_{e}(s,x)=\int\pi(\sigma_{s}^{2})\bigg{\{}C_{s}^{-1}\bigg{(}\frac{1}{\sigma_{s}^{2}}\bigg{)}^{{|B_{s}|}/{2}}\exp\bigg{(}-\frac{D_{s}}{2\sigma_{s}^{2}}\bigg{)}\bigg{\}}d\sigma_{s}^{2},

with $C_{s}$ and $D_{s}$ as in Lemma 1. And using the inverse-gamma prior $\pi(\sigma_{s}^{2})$ of (16),

P_{e}(s,x)=\ C_{s}^{-1}\ \frac{\lambda^{\tau}}{\Gamma(\tau)}\ \int\bigg{(}\frac{1}{\sigma_{s}^{2}}\bigg{)}^{\tau^{\prime}+1}\exp\bigg{(}-\frac{\lambda^{\prime}}{\sigma_{s}^{2}}\bigg{)}d\sigma_{s}^{2},

(18)

with $\tau^{\prime}=\tau+\frac{|B_{s}|}{2}$ and $\lambda^{\prime}=\lambda+\frac{D_{s}}{2}$ . The integral in (18) has the form of an inverse-gamma density with parameters $\tau^{\prime}$ and $\lambda^{\prime}$ . Its closed-form solution, as required, completes the proof of the lemma:

P_{e}(s,x)=C_{s}^{-1}\ \frac{\lambda^{\tau}}{\Gamma(\tau)}\ \frac{\Gamma(\tau^{\prime})}{\left(\lambda^{\prime}\right)^{\tau^{\prime}}}.\vspace*{-0.4 cm}

∎

B.3. Proof of Lemma 2

In order to derive the required expressions for the posterior distributions of $\boldsymbol{\phi}_{s}$ and $\sigma_{s}^{2}$ , for a leaf $s$ of model $T$ , first consider the joint posterior $\pi(\theta_{s}|T,x)=\pi(\boldsymbol{\phi}_{s},\sigma_{s}^{2}|T,x)$ , given by,

\displaystyle\pi(\theta_{s}|T,x)\propto p(x|T,\theta_{s})\pi(\theta_{s})=\prod_{i=1}^{n}p(x_{i}|T,\theta_{s},x_{-D+1}^{i-1})\pi(\theta_{s})\propto\prod_{i\in B_{s}}p(x_{i}|T,\theta_{s},x_{-D+1}^{i-1})\pi(\theta_{s}),

where we used the fact that, in the product, only the terms involving indices $i\in B_{s}$ are functions of $\theta_{s}$ . So,

\displaystyle\pi(\boldsymbol{\phi}_{s},\sigma_{s}^{2}|T,x)

\displaystyle\propto\left(\ \prod_{i\in B_{s}}p(x_{i}|T,\boldsymbol{\phi}_{s},\sigma_{s}^{2},x_{-D+1}^{i-1})\ \pi(\boldsymbol{\phi}_{s}|\sigma_{s}^{2})\right)\pi(\sigma_{s}^{2}).

Here, the first two terms can be computed from (15) of the previous section, where the noise variance was known. Again, the only difference is that we have to replace $\Sigma_{o}$ with $\sigma_{s}^{2}\Sigma_{o}$ because of the prior $\pi(\boldsymbol{\phi}_{s}|\sigma_{s}^{2})$ defined in (16). After some algebra,

\displaystyle\pi(\boldsymbol{\phi}_{s},\sigma_{s}^{2}|T,x)

\displaystyle\propto\bigg{(}\frac{1}{\sigma_{s}^{2}}\bigg{)}^{{|B_{s}|}/{2}}\exp\bigg{(}-\frac{D_{s}}{2\sigma_{s}^{2}}\bigg{)}\ \mathcal{N}(\boldsymbol{\phi}_{s};\mathbf{m}_{s},\Sigma_{s})\ \pi(\sigma_{s}^{2})\ ,

with $\mathbf{m}_{s}$ defined as in Lemma 2, and $\Sigma_{s}=\sigma_{s}^{2}(S_{3}+\Sigma_{o}^{-1})^{-1}$ . Substituting the prior $\pi(\sigma_{s}^{2})$ in the last expression,

\pi(\boldsymbol{\phi}_{s},\sigma_{s}^{2}|T,x)\propto\bigg{(}\frac{1}{\sigma_{s}^{2}}\bigg{)}^{\tau+1+{|B_{s}|}/{2}}\exp\bigg{(}-\frac{\lambda+D_{s}/2}{\sigma_{s}^{2}}\bigg{)}\ \mathcal{N}(\boldsymbol{\phi}_{s};\mathbf{m}_{s},\Sigma_{s}).

(19)

From (19), it is easy to integrate out $\boldsymbol{\phi}_{s}$ and get the posterior of $\sigma_{s}^{2}$ ,

\pi(\sigma_{s}^{2}|T,x)=\int\pi(\boldsymbol{\phi}_{s},\sigma_{s}^{2}|T,x)\ d\boldsymbol{\phi}_{s}\propto\bigg{(}\frac{1}{\sigma_{s}^{2}}\bigg{)}^{\tau+1+{|B_{s}|}/{2}}\exp\bigg{(}-\frac{\lambda+D_{s}/2}{\sigma_{s}^{2}}\bigg{)},

which is of the form of an inverse-gamma distribution with parameters $\tau^{\prime}=\tau+\frac{|B_{s}|}{2}$ and $\lambda^{\prime}=\lambda+\frac{D_{s}}{2}$ , proving the first part of the lemma.

However, as $\Sigma_{s}$ is a function of $\sigma_{s}^{2}$ , integrating out $\sigma_{s}^{2}$ requires more algebra. We have,

	$\displaystyle\mathcal{N}(\boldsymbol{\phi}_{s};\mathbf{m}_{s},\Sigma_{s})$	$\displaystyle\propto\frac{1}{\sqrt{\text{det}(\Sigma_{s})}}\ \exp\bigg{\{}-\frac{1}{2}(\boldsymbol{\phi}_{s}-\mathbf{m}_{s})^{\text{T}}\Sigma_{s}^{-1}(\boldsymbol{\phi}_{s}-\mathbf{m}_{s})\bigg{\}}$
		$\displaystyle\propto\bigg{(}\frac{1}{\sigma_{s}^{2}}\bigg{)}^{p/2}\exp\bigg{\{}-\frac{1}{2\sigma_{s}^{2}}(\boldsymbol{\phi}_{s}-\mathbf{m}_{s})^{\text{T}}(S_{3}+\Sigma_{o}^{-1})(\boldsymbol{\phi}_{s}-\mathbf{m}_{s})\bigg{\}}\ ,$

and substituting this in (19) gives that $\pi(\boldsymbol{\phi}_{s},\sigma_{s}^{2}|T,x)$ is proportional to,

\displaystyle\bigg{(}\frac{1}{\sigma_{s}^{2}}\bigg{)}^{\tau+1+\frac{|B_{s}|+p}{2}}\exp\bigg{\{}-\frac{1}{2\sigma_{s}^{2}}\bigg{(}2\lambda+D_{s}+(\boldsymbol{\phi}_{s}-\mathbf{m}_{s})^{\text{T}}(S_{3}+\Sigma_{o}^{-1})(\boldsymbol{\phi}_{s}-\mathbf{m}_{s})\bigg{)}\bigg{\}},

which, as a function of $\sigma_{s}^{2}$ , has the form of an inverse-gamma density, allowing $\sigma^{2}$ to be integrated out. Denoting $L=2\lambda+D_{s}+(\boldsymbol{\phi}_{s}-\mathbf{m}_{s})^{\text{T}}(S_{3}+\Sigma_{o}^{-1})(\boldsymbol{\phi}_{s}-\mathbf{m}_{s})$ , and $\widetilde{\tau}=\tau+\frac{|B_{s}|+p}{2}$ ,

\displaystyle\pi(\boldsymbol{\phi}_{s}|T,x)=\int\pi(\boldsymbol{\phi}_{s},\sigma_{s}^{2}|T,x)\ d\sigma_{s}^{2}\propto\int\bigg{(}\frac{1}{\sigma_{s}^{2}}\bigg{)}^{\widetilde{\tau}+1}\exp\bigg{(}-\frac{L}{2\sigma_{s}^{2}}\bigg{)}\ d\sigma_{s}^{2}=\frac{\Gamma(\widetilde{\tau})}{(L/2)^{\widetilde{\tau}}}\ .

So, as a function of $\boldsymbol{\phi}_{s}$ , the posterior $\pi(\boldsymbol{\phi}_{s}|T,x)$ is,

	$\displaystyle\pi(\boldsymbol{\phi}_{s}\|T,x)\propto L^{-\widetilde{\tau}}$	$\displaystyle=\bigg{(}2\lambda+D_{s}+(\boldsymbol{\phi}_{s}-\mathbf{m}_{s})^{\text{T}}(S_{3}+\Sigma_{o}^{-1})(\boldsymbol{\phi}_{s}-\mathbf{m}_{s})\bigg{)}^{-\frac{2\tau+\|B_{s}\|+p}{2}}$
		$\displaystyle\propto\bigg{(}1+\frac{1}{2\tau+\|B_{s}\|}\ (\boldsymbol{\phi}_{s}-\mathbf{m}_{s})^{\text{T}}\frac{(S_{3}+\Sigma_{o}^{-1})(2\tau+\|B_{s}\|)}{(2\lambda+D_{s})}(\boldsymbol{\phi}_{s}-\mathbf{m}_{s})\bigg{)}^{-\frac{2\tau+\|B_{s}\|+p}{2}}$
		$\displaystyle\propto\bigg{(}1+\frac{1}{\nu}\ (\boldsymbol{\phi}_{s}-\mathbf{m}_{s})^{\text{T}}P_{s}^{-1}(\boldsymbol{\phi}_{s}-\mathbf{m}_{s})\bigg{)}^{-\frac{\nu+p}{2}},$

which is exactly in the form of a multivariate $t$ -distribution, with $p$ being the dimension of $\boldsymbol{\phi}_{s}$ , and with $\nu,\mathbf{m}_{s}$ and $P_{s}$ exactly as given in Lemma 2, completing the proof. ∎

	$\displaystyle\pi(\boldsymbol{\phi}_{s}\|T,x)\propto L^{-\widetilde{\tau}}$	$\displaystyle=\bigg{(}2\lambda+D_{s}+(\boldsymbol{\phi}_{s}-\mathbf{m}_{s})^{\text{T}}(S_{3}+\Sigma_{o}^{-1})(\boldsymbol{\phi}_{s}-\mathbf{m}_{s})\bigg{)}^{-\frac{2\tau+\|B_{s}\|+p}{2}}$
		$\displaystyle\propto\bigg{(}1+\frac{1}{2\tau+\|B_{s}\|}\ (\boldsymbol{\phi}_{s}-\mathbf{m}_{s})^{\text{T}}\frac{(S_{3}+\Sigma_{o}^{-1})(2\tau+\|B_{s}\|)}{(2\lambda+D_{s})}(\boldsymbol{\phi}_{s}-\mathbf{m}_{s})\bigg{)}^{-\frac{2\tau+\|B_{s}\|+p}{2}}$
		$\displaystyle\propto\bigg{(}1+\frac{1}{\nu}\ (\boldsymbol{\phi}_{s}-\mathbf{m}_{s})^{\text{T}}P_{s}^{-1}(\boldsymbol{\phi}_{s}-\mathbf{m}_{s})\bigg{)}^{-\frac{\nu+p}{2}},$

Context-tree weighting for real-valued time series: Bayesian inference with hierarchical mixture models