Loss-Based Variational Bayes Prediction

\nameDavid T. Frazier \email[email protected]
\addrDepartment of Econometrics and Business Statistics
Monash University
Melbourne, Australia, 3800 \AND\nameRubén Loaiza-Maya \email[email protected]
\addrDepartment of Econometrics and Business Statistics
Monash University
Melbourne, Australia, 3800 \AND\nameGael M. Martin \email[email protected]
\addrDepartment of Econometrics and Business Statistics
Monash University
Melbourne, Australia, 3800 \AND\nameBonsoo Koo \email[email protected]
\addrDepartment of Econometrics and Business Statistics
Monash University
Melbourne, Australia, 3800

Abstract

We propose a new approach to Bayesian prediction that caters for models with a large number of parameters and is robust to model misspecification. Given a class of high-dimensional (but parametric) predictive models, this new approach constructs a posterior predictive using a variational approximation to a generalized posterior that is directly focused on predictive accuracy. The theoretical behavior of the new prediction approach is analyzed and a form of optimality demonstrated. Applications to both simulated and empirical data using high-dimensional Bayesian neural network and autoregressive mixture models demonstrate that the approach provides more accurate results than various alternatives, including misspecified likelihood-based predictions.

Keywords: Loss-based Bayesian forecasting; variational inference; generalized (Gibbs) posteriors; proper scoring rules; Bayesian neural networks; M4 forecasting competition

1 Introduction

The conventional paradigm for Bayesian prediction is underpinned by the assumption that the true data generating process is either equivalent to the predictive model adopted, or spanned by a finite set of models over which we average. Of late however, recognition of the unrealistic nature of such an assumption, allied with an increased interest in driving prediction by problem-specific measures of accuracy, or loss, have led to alternative approaches. Whilst antecedents of these new principles are found in the ‘probably approximately correct’ (PAC)-Bayes approach to prediction in the machine learning literature (see Alquier, 2021, for a review), it is in the statistics and econometrics literature that this ‘loss-based prediction’ has come to more formal maturity, including in terms of its theoretical validation. This includes Bayesian work on weighted combinations of predictions, such as, for example, Billio et al. (2013), Casarin et al. (2015), Pettenuzzo and Ravazzolo (2016), Bassetti et al. (2018), Baştürk et al. (2019), McAlinn and West (2019) and McAlinn et al. (2020), where weights are updated via various predictive criteria, and the true model is not assumed to be one of the constituent models - i.e. an $\mathcal{M}$ -open state of the world (Bernardo and Smith, 1994) is implicitly adopted. It also includes a contribution by Loaiza-Maya et al. (2020), in which both single models and predictive mixtures are used to generate accurate Bayesian predictions in the presence of model misspecification; with both theoretical and numerical results highlighting the ability of the approach to out-perform conventional likelihood-based Bayesian predictive methods.

However, as a general rule, the existing approaches discussed above do not scale well to complex models with high-dimensional parameter spaces. In contrast, the current paper contributes to the above literature by providing a new approach for producing accurate loss-based predictions in high-dimensional problems. We begin by defining a class of flexible predictive models, conditional on a set of unknown parameters, that are a plausible mechanism for generating probabilistic predictions. A prior distribution is placed over the parameters of this predictive class, and the prior then updated to a posterior via a criterion function that captures a user-specified measure of predictive accuracy. That is, the conventional, and potentially misspecified likelihood-based update is eschewed, in favour of a function that is tailored to the predictive problem at hand; the ultimate goal being to produce accurate predictions according to the measure that matters, without requiring knowledge of the true data generating mechanism.

In the spirit of the various generalized Bayesian inferential methods, in which likelihood functions are also replaced by alternative updating mechanisms (inter alia, Chernozhukov and Hong, 2003, Zhang, 2006a, Zhang, 2006b, Jiang and Tanner, 2008, Bissiri et al., 2016, Giummolè et al., 2017, Knoblauch et al., 2019, Miller and Dunson, 2019, Syring and Martin, 2019, and Pacchiardi and Dutta, 2021), we adopt a coherent update based on the exponential of a scaled sample loss; we refer to Miller (2021) for a thorough discussion of the large sample behavior of generalized Bayesian posteriors. As in Loaiza-Maya et al. (2020) the loss is, in turn, defined by a proper scoring rule (Gneiting et al., 2007; Gneiting and Raftery, 2007) that rewards a given form of predictive accuracy; for example, accurate prediction of extreme values. Given the high-dimensional nature of the resultant posterior, numerical treatment via ‘exact’ Markov Chain Monte Carlo (MCMC) is computationally challenging; hence we adopt an ‘approximate’ approach using variational principles. Since the posterior that results from an exponentiated loss was first denoted as a ‘Gibbs posterior’ by Zhang (2006a), we refer to the variational approximation of this posterior as the Gibbs variational posterior, and the predictive distribution that results from this posterior, via the standard Bayesian calculus, as the Gibbs variational predictive (hereafter, GVP). With a slight abuse of terminology, and when it is clear from the context, we also use the abbreviation GVP to reference the method of Gibbs variational prediction, or loss-based variational prediction per se.

In an artificially simple ‘toy’ example in which MCMC sampling of the exact Gibbs posterior is feasible, we illustrate that there are negligible differences between the out-of-sample results yielded by the GVP and those produced by the predictive based on MCMC sampling from the Gibbs posterior. We establish this result under both correct specification, in which the true data generating process matches the adopted predictive model, and under misspecification of the predictive model. The correspondence between the ‘approximate’ and ‘exact’ predictions in the correct specification case mimics that documented in Frazier et al. (2019), in which posterior approximations are produced by approximate Bayesian computation (ABC) and for the log score update (only). In the misspecified case, ‘strictly coherent’ predictions are produced (Martin et al., 2022), whereby a given GVP, constructed via the use of a particular scoring rule, is shown to perform best out-of-sample according to that same score when compared with a GVP constructed via some alternative scoring rule; with the numerical values of the average out-of-sample scores closely matching those produced by the exact Gibbs predictive. That is, building a generalized posterior via a given scoring rule yields superior predictive accuracy in that rule despite any inaccuracy in the measurement of posterior uncertainty that is induced by the variational approximation.

We then undertake more extensive Monte Carlo experiments to highlight the power of the approach in genuinely high-dimensional problems, with predictives based on: an autoregressive mixture model with 20 mixture components, and a neural network model used for illustration. An empirical analysis, in which GVP is used to produce accurate prediction intervals for the 4227 daily time series used in the M4 forecasting competition, illustrates the applicability of the method to reasonably large, and realistic data sets.

While the numerical toy example demonstrates that little predictive accuracy is lost when using the GVP, relative to the ‘exact’ predictive, we also rigorously compare the theoretical behavior of the GVP and the potentially infeasible exact predictive. Specifically, we demonstrate that in large samples the GVP delivers predictions that are just as accurate as those obtained from the exact Gibbs predictive when measured according to the score used to construct the Gibbs posterior. We do this by proving that the GVP ‘merges’ (in the sense of Blackwell and Dubins, 1962) with the exact Gibbs predictive. This merging result relies on novel posterior concentration results that extend existing concentration results for generalized posteriors (Alquier et al., 2016, Alquier and Ridgway, 2020, and Yang et al., 2020) to temporally dependent, potentially heavy-tailed data.

The remainder of the paper is structured as follows. In Section 2 the loss-based paradigm for Bayesian prediction is defined, with its links to related segments of the literature (as flagged briefly above) detailed. In Section 3 we detail how to construct the GVP, and provide theoretical verification of its accuracy. A numerical illustration of the ability of the variational method to yield essentially equivalent predictive accuracy to that produced via MCMC sampling is given in Section 4, using a low-dimensional example. The approach that we use in the implementation of GVP, including the choice of variational family, is briefly described, with further computational details of the stochastic gradient ascent (SGA) method used to perform the optimization are included in a supplementary appendix to the paper. We then proceed with the numerical illustration of the method in high-dimensional settings - using both artificially simulated and empirical data - in Sections 5 and 6 respectively; with the illustrations highlighting that, overall, the method ‘works’ and ‘works well’. We conclude in Section 7 with discussion of the implications of our findings, including for future research directions. Proof of all theoretical results are given in an appendix, and all computational details, and prior specifications are included in supplementary appendix to the paper.

2 Setup and Loss-Based Bayesian Prediction

2.1 Preliminaries and Notation

Consider a stochastic process $\{Y_{t}:\Omega\rightarrow\mathcal{Y},t\in\mathbb{N}\}$ defined on the complete probability space $(\Omega,\mathcal{F},P_{0})$ . Let $\mathcal{F}_{t}:=\sigma(Y_{1},\dots,Y_{t})$ denote the natural sigma-field, and let $P_{0}$ denote the infinite-dimensional distribution of the sequence $Y_{1},Y_{2},\dots$ . Let $y_{1:n}=(y_{1},\dots,y_{n})^{\prime}$ denote a vector of realizations from the stochastic process.

Our goal is to use a particular collection of statistical models, adapted to $\mathcal{F}_{n}$ , that describe the behavior of the observed data, to construct accurate predictions for the random variable $Y_{n+1}$ . The parameters of the model are denoted by $\theta_{n}$ , the parameter space by $\Theta_{n}\subseteq\mathbb{R}^{d_{n}}$ , where the dimension $d_{n}$ could grow as $n\rightarrow\infty$ and $\Theta_{1}\subseteq\Theta_{2}\subseteq...\subseteq\Theta_{n}$ . For the notational simplicity, we drop the dependence of $\theta_{n}$ and $\Theta_{n}$ on $n$ in what follows. Let $\mathcal{P}^{(n)}$ be a generic class of one-step-ahead predictive models for $Y_{n+1}$ , conditioned on the information $\mathcal{F}_{n}$ available at time $n$ , such that $\mathcal{P}^{(n)}:=\{P_{\theta}^{(n)},\theta\in\Theta\}$ .¹¹1The treatment of scalar $Y_{t}$ and one-step-ahead prediction is for the purpose of illustration only, and all the methodology that follows can easily be extended to multivariate $Y_{t}$ and multi-step-ahead prediction in the usual manner. When $P_{\theta}^{(n)}(\cdot)$ admits a density with respect to the Lebesgue measure, we denote it by $p_{\theta}^{(n)}(\cdot)\equiv p_{\theta}(\cdot|\mathcal{F}_{n})$ . The parameter ${\theta}$ thus indexes values in the predictive class, with ${\theta}$ taking values in the complete probability space $(\Theta,\mathcal{T},\Pi)$ , and where $\Pi$ measures our beliefs - either prior or posterior - about the unknown parameter ${\theta}$ , and when they exist we denote the respective densities by $\pi(\theta)$ and $\pi(\theta|y_{1:n})$ .

Denoting the likelihood function by $p_{\theta}(y_{1:n})$ , the conventional approach to Bayesian prediction updates prior beliefs about ${\theta}$ via Bayes rule, to form the Bayesian posterior density,

\pi(\theta|y_{1:n})=\frac{p_{\theta}\left(y_{1:n}\right)\pi(\theta)}{\int_{\Theta}p_{\theta}\left(y_{1:n}\right)\pi(\theta)d\theta},

(1)

in which we follow convention and abuse notation by writing this quantity as a density even though, strictly speaking, the density may not exist. The one-step-ahead predictive distribution is then constructed as

P_{\Pi}^{(n)}:=\int_{\Theta}P_{\theta}^{(n)}\pi(\theta|y_{1:n})\mathrm{d}\theta.

(2)

However, when the class of predictive models indexed by $\Theta$ does not contain the true predictive distribution there is no sense in which this conventional approach remains the ‘gold standard’. In such cases, the loss that underpins (2) should be replaced by the particular predictive loss that matters for the problem at hand. That is, our prior beliefs about ${\theta}$ and, hence, about the elements $P_{\theta}^{(n)}$ in $\mathcal{P}^{(n)}$ , need to be updated via a criterion function defined by a user-specified measure of predictive loss.

2.2 Bayesian Updating Based on Scoring Rules

Loaiza-Maya et al. (2020) propose a method for producing Bayesian predictions using loss functions that specifically capture the accuracy of density forecasts. For $\mathcal{P}^{(n)}$ a convex class of predictive distributions on $(\Omega,\mathcal{F})$ , density prediction accuracy can be measured using the positively-oriented (i.e. higher is better) scoring rule $s:\mathcal{P}^{(n)}\times\mathcal{Y}\mapsto\mathbb{R}$ , where the expected scoring rule under the true distribution $P_{0}$ is defined as

\mathbb{S}(\cdot,P_{0}):=\int_{y\in\Omega}s(\cdot,y)\mathrm{d}P_{0}(y).

(3)

Since $\mathbb{S}(\cdot,P_{0})$ is unattainable in practice, a sample estimate based on $y_{1:n}$ is used to define a sample criterion: for a given $\theta\in\Theta$ , define sample average score as

S_{n}(\theta):=\sum_{t=0}^{n-1}s(P_{\theta}^{(t)},y_{t+1}).

(4)

Adopting the generalized updating rule proposed by Bissiri et al. (2016) (see also Giummolè et al., 2017, Holmes and Walker, 2017, Lyddon et al., 2019, and Syring and Martin, 2019), Loaiza-Maya et al. (2020) distinguish elements in $\Theta$ using

\pi_{w}(\theta|y_{1:n})=\frac{\exp\left[wS_{n}\left(\theta\right)\right]\pi(\theta)}{\int_{\Theta}\exp\left[wS_{n}\left(\theta\right)\right]\pi(\theta)\mathrm{d}\theta},

(5)

where the scale factor $w$ is obtained in a preliminary step using measures of predictive accuracy. This posterior explicitly weights elements of $\Theta$ according to their predictive accuracy in the scoring rule $s(\cdot,\cdot)$ . As such, the one-step-ahead generalized predictive,

P_{\Pi_{w}}^{(n)}:=\int_{\Theta}P_{\theta}^{(n)}\pi_{w}(\theta|y_{1:n})\mathrm{d}\theta,

(6)

will often outperform, in the chosen rule $s(\cdot,\cdot)$ , the likelihood (or log-score)-based predictive $P_{\Pi}^{(n)}$ in cases where the model is misspecified. Given its explicit dependence on the Gibbs posterior in (5), and as noted earlier, we refer to the predictive in (6) as the (exact) Gibbs predictive.²²2We note that, in general the choice of $w$ can be $n$ -dependent. However, we eschew this dependence to maintain notational brevity.

3 Gibbs Variational Prediction

3.1 Overview

While $\Pi_{w}(\cdot|y_{1:n})$ is our ideal posterior, it can be difficult to sample from if the dimension of $\theta$ is even moderately large, which occurs in situations with a large number of predictors, or in flexible models. Therefore, the exact predictive itself is not readily available in such cases. However, this does not invalidate the tenant on which $P_{\Pi_{w}}^{(n)}$ is constructed. Viewed in this way, we see that the problem of predictive inference via $P_{\Pi_{w}}^{(n)}$ could be solved if one were able to construct an accurate enough approximation to $\Pi_{w}(\cdot|y_{1:n})$ . Herein, we propose to approximate $P_{\Pi_{w}}^{(n)}$ using variational Bayes (VB).

VB seeks to approximate $\Pi_{w}(\cdot|y_{1:n})$ by finding the closest member in a class of probability measures, denoted as $\mathcal{Q}$ , to $\Pi_{w}(\cdot|y_{1:n})$ in a chosen divergence measure. The most common choice of divergence is the Kullback-Leibler (KL) divergence: for probability measures $P,Q$ , and $P$ absolutely continuous with respect to $Q$ , the KL divergence is given by $\text{D}(P||Q)=\int\log(\mathrm{d}P/\mathrm{d}Q)\mathrm{d}P$ . VB then attempts to produce a posterior approximation to $\Pi_{w}$ by choosing $Q$ to minimize:

\text{D}\left(Q||\Pi_{w}\right)=\int\log\left[{\mathrm{d}Q}/{\mathrm{d}\Pi_{w}(\cdot|y_{1:n})}\right]\mathrm{d}Q.

In practice, VB is often conducted, in an equivalent manner, by maximizing the so-called evidence lower bound (ELBO). In the case where $\Pi_{w}$ and $Q$ admit densities $\pi_{w}$ and $q$ , respectively, the ELBO is given by:

\text{ELBO}[Q||\Pi_{w}]:=\mathbb{E}_{q}[\log\left\{\exp\left[wS_{n}(\theta)\right]\pi(\theta)\right\}]-\mathbb{E}_{q}[\log\{q(\theta)\}].

(7)

While other classes of divergences are used in variational inference, such as the Rényi-divergence, the KL divergence is the most commonly encountered in the literature. Hence, in what follows we focus on this choice, but note here that other divergences can also be used.

The variational approximation to the Gibbs posterior, $P_{\Pi_{w}}^{(n)}$ , can be defined as follows.

Definition 1

For $\mathcal{Q}$ a class of probability distributions, the variational posterior $\widehat{Q}$ satisfies

\widehat{Q}:=\operatornamewithlimits{argmin\,}_{Q\in\mathcal{Q}}\text{D}\left[Q||\Pi_{w}(\cdot|y_{1:n})\right].

and the Gibbs variational predictive (GVP) is defined as

P_{Q}^{(n)}:=\int_{\Theta}P_{\theta}^{(n)}\mathrm{d}\widehat{Q}(\theta).

Remark 1

In general, we only require that $\widehat{Q}$ is an approximate minimizer, i.e., $\text{D}[\widehat{Q}||\Pi_{w}(\cdot|y_{1:n})]\geq\sup_{Q\in\mathcal{Q}}\text{D}[\widehat{Q}||\Pi_{w}(\cdot|y_{1:n})]+o_{p}(1)$ . The later may be useful in cases where $\widehat{Q}$ may not be unique, or in cases where it does not exist for a given $n$ but is well-behaved for all $n$ large enough.

The GVP, $P_{Q}^{(n)}$ , circumvents the need to construct $P_{\Pi_{w}}^{(n)}$ via sampling the Gibbs posterior $\Pi_{w}(\cdot|y_{1:n})$ . In essence, we replace the sampling problem with an optimization problem for which reliable methods exist even if $\Theta$ is high-dimensional, and which in turn yields an approximation to $\Pi_{w}(\cdot|y_{1:n})$ . Consequently, even though, as discussed earlier, it is not always feasible to access $P_{\Pi_{w}}^{(n)}$ , via simulation from, $\Pi_{w}(\cdot|y_{1:n})$ in situations where $\Theta$ is high-dimensional, access to the variational predictive $P_{Q}^{(n)}$ remains feasible.

This variational approach to prediction is related to the ‘generalized variational inference’ approach of Knoblauch et al. (2019), but with two main differences. Firstly, our approach is focused on predictive accuracy, not on parameter inference. Our only goal is to produce density forecasts that are as accurate as possible in the chosen scoring rule $s(\cdot,y)$ .³³3Critically, since predictive accuracy is our goal, we are not concerned with the potential over/under-coverage of credible sets for the model unknowns built from generalized posteriors, or VB posteriors; see Miller (2021) for a theoretical discussion of posterior coverage in generalized Bayesian inference. Secondly, our approach follows the idea of Bissiri et al. (2016) and Loaiza-Maya et al. (2020) and targets the predictions built from the Gibbs posterior in (5), rather than the exponentiated form of some general loss as in Knoblauch et al. (2019). This latter point is certainly critical if inferential accuracy were still deemed to be important, since without the tempering constant $w$ that defines the posterior in (5), the exponentiated loss function can be very flat and the posterior $\Pi_{w}(\cdot|y_{1:n})$ uninformative about $\theta$ .⁴⁴4See Bissiri et al. (2016), Giummolè et al. (2017), Holmes and Walker (2017), Lyddon et al. (2019), Syring and Martin (2019) and Pacchiardi and Dutta (2021) for various approaches to the setting of $w$ in inferential settings. It is our experience however (Loaiza-Maya et al., 2020), that in settings where predictive accuracy is the only goal, the choice of $w$ actually has little impact on the generation of accurate predictions via $P_{\Pi_{w}}^{(n)}$ , as long as the sample size is reasonably large. Preliminary experimentation in the current setting, in which the variational predictive $P_{Q}^{(n)}$ is the target, suggests that this finding remains relevant, and as a consequence we have adopted a default value of $w=1$ in all numerical work. Further research is needed to deduce the precise impact of $w$ on $P_{Q}^{(n)}$ , in particular for smaller sample sizes, and we leave this important topic for future work.⁵⁵5We note here that the work of Wu and Martin (2021) attempts to calibrate the coverage of posterior predictives constructed from power posteriors. While the approach of Wu and Martin (2021) is not directly applicable in our context, we conjecture that an appropriately modified version of their approach could be applied in our setting to derive posterior predictives with reasonable predictive coverage.

3.2 Accuracy of the GVP

By driving the updating mechanism by the measure of predictive accuracy that matters, it is hoped that GVP produces accurate predictions without requiring correct model specification. The following result demonstrates that, in large samples, the GVP, $P^{(n)}_{Q}$ , produces predictions that are indistinguishable from the exact predictive, $P^{(n)}_{\Pi_{w}}$ ; thus, in terms of predictive accuracy, the use of a variational approximation to the Gibbs posterior has little impact on the accuracy of the posterior predictives. For probability measures $P,Q$ , let $d_{\text{TV}}\{P,Q\}$ denote the total variation distance.

Theorem 1

Under Assumptions 1-3 in Appendix A, $d_{\text{TV}}\{P_{Q}^{(n)},P_{\Pi_{w}}^{(n)}\}\rightarrow 0$ in probability.

Theorem 1 demonstrates that the difference between distributional predictions made using the GVP and the exact Gibbs predictive agree in the rule $s(\cdot,\cdot)$ in large samples. This type of result is colloquially known as a ‘merging’ result (Blackwell and Dubins, 1962), and means that if we take the exact Gibbs predictive as our benchmark for (Bayesian) prediction, then the GVP is as accurate as this possibly infeasible benchmark (at least in large samples).

Remark 2

As an intermediate step in the proof of Theorem 1, we must demonstrate posterior concentration of $\widehat{Q}$ , and $\Pi_{w}$ , in cases where the data is temporally dependent, possibly heavy-tailed, and where $S_{n}(\theta)$ is an arbitrary scoring rule. For the sake of brevity, and to keep the focus on the practical usefulness of this approach, we present these details in Appendix A. However, we note that our focus on the types of data often encountered in forecasting, for example, financial, economic, or atmospheric processes, does not allow us to use existing approaches to guarantee posterior concentration of $\widehat{Q}$ and $\Pi_{w}$ . Instead, we must rely on the smoothness of $S_{n}(\theta)$ and the control of certain remainder terms to demonstrate this concentration. We refer the interested reader to Appendix A for full details and discussion.

As a further result, we can demonstrate that GVP is as accurate as the infeasible frequentist ‘optimal [distributional] predictive’ approach suggested in Gneiting and Raftery (2007). Following Gneiting and Raftery (2007), define the ‘optimal’ (frequentist) predictive within the class $\mathcal{P}^{(n)}$ , and based on scoring rule $s(\cdot,y)$ , as $P_{\star}^{(n)}:=P(\cdot|\mathcal{F}_{n},\theta_{\star})$ , where

\theta_{\star}:=\arg\max_{\theta\in\Theta}\mathcal{S}(\theta),\text{ {with} }\mathcal{S}(\theta)=\operatornamewithlimits{lim\,}_{n\rightarrow\infty}\mathbb{E}\left[S_{n}(\theta)/n\right].

(8)

The following result demonstrates that, in large samples, predictions produced using the GVP are equivalent to those made using the optimal frequentist predictive $P_{\star}^{(n)}$ .

Theorem 2

Under Assumptions 1-3 in Appendix A, $d_{\text{TV}}\{P_{Q}^{(n)},P_{\star}^{(n)}\}\rightarrow 0$ in probability.

We remind the reader that, for the sake of brevity, statement of the assumptions, and proofs of all stated results, are given in Appendix A.

4 A Toy Model of Financial Returns

We now illustrate the behavior of GVP in a simple toy example for a financial return, $Y_{t}$ , in which the exact Gibbs predictive is also accessible.⁶⁶6We reiterate here, and without subsequent repetition, that all details of the prior specifications and the numerical steps required to produce the variational approximations, for this and the following numerical examples, are provided in the supplementary appendices. This example serves two purposes. First, it illustrates Theorem 2, and, in so doing, highlights the benefits of GVP relative to prediction based on a misspecified likelihood function. Secondly, the GVP is shown to yield almost equal (average) out-of-sample predictions to those produced by the exact Gibbs posterior accessed via MCMC; i.e. numerical support for Theorem 1 is provided.

The predictive class, $\mathcal{P}^{(n)}$ , is defined by a generalized autoregressive conditional heteroscedastic GARCH(1,1) model with Gaussian errors, $Y_{t}=\theta_{1}+\sigma_{t}\varepsilon_{t},$ $\varepsilon_{t}\overset{i.i.d.}{\sim}N\left(0,1\right),$ $\sigma_{t}^{2}=\theta_{2}+\theta_{3}\left(Y_{t-1}-\theta_{1}\right)^{2}+\theta_{4}\sigma_{t-1}^{2},$ with $\theta=\left(\theta_{1},\theta_{2},\theta_{3},{\theta}_{4}\right)^{\prime}$ . We adopt three alternative specifications for the true data generating process (DGP) : 1) a model that matches the Gaussian GARCH(1,1) predictive class; 2) a stochastic volatility model with leverage:

\displaystyle Y_{t}

\displaystyle=\exp\left(\frac{h_{t}}{2}\right)\varepsilon_{t},\quad h_{t}=-2+0.7\left(h_{t-1}-(-2)\right)+\eta_{t},\quad\left(\varepsilon_{t},\eta_{t}\right)^{\prime}\overset{i.i.d.}{\sim}N\left(0,\begin{bmatrix}1&-0.35\\ -0.35&0.25\end{bmatrix}\right);

and 3) a stochastic volatility model with a smooth transition in the volatility autoregression:

\displaystyle Y_{t}

\displaystyle=\exp\left(\frac{h_{t}}{2}\right)\varepsilon_{t},\quad h_{t}=0.9g\left(h_{t-1}\right)h_{t-1}+\eta_{t},\quad\eta_{t}\overset{i.i.d.}{\sim}N\left(0,0.25\right),

where $g(x)=\left(1+\exp\left(-2x\right)\right)^{-1}$ . DGP 1) defines a correct specification setting, whilst DGPs 2) and 3) characterize different forms of misspecification.

Denoting the predictive density function associated with the Gaussian GARCH (1,1) model, evaluated at the observed $y_{t+1}$ , as $p_{\theta}(y_{t+1}|\mathcal{F}_{t})$ , we implement GVP using four alternative forms of (postively-oriented) scoring rules:

	$\displaystyle s^{\text{LS}}\left(P_{\theta}^{(t)},y_{t+1}\right)=\log p_{\theta}(y_{t+1}\|\mathcal{F}_{t}),$		(9)
	$\displaystyle s^{\text{CRPS}}\left(P_{\theta}^{(t)},y_{t+1}\right)=-\int_{-\infty}^{\infty}\left[P_{\theta}^{(t)}-I(y\geq y_{t+1})\right]^{2}dy,$		(10)
	$\displaystyle s^{\text{CLS-A}}\left(P_{\theta}^{(t)},y_{t+1}\right)=\log p_{\theta}(y_{t+1}\|\mathcal{F}_{t})I\left(y_{t+1}\in A\right)+\left[\ln\int_{A^{c}}p_{\theta}(y\|\mathcal{F}_{t})dy\right]I\left(y_{t+1}\in A^{c}\right),$		(11)
	$\displaystyle s^{\text{IS}}\left(P_{\theta}^{(t)},y_{t+1}\right)=\left(u_{t+1}-l_{t+1}\right)+\frac{2}{\alpha}\left(l_{t+1}-y_{t+1}\right)I\left(y_{t+1}<l_{t+1}\right)+\frac{2}{\alpha}\left(y_{t+1}-u_{t+1}\right)I\left(y_{t+1}>u_{t+1}\right),$		(12)

where $l_{t+1}\$ and $u_{t+1}$ denote the $100(\frac{\alpha}{2})\%$ and $100(1-\frac{\alpha}{2})\%\$ predictive quantiles. The log-score (LS) in (9) is a ‘local’ scoring rule, attaining a high value if the observed value, $y_{t+1}$ , is in the high density region of $p_{\theta}(y_{t+1}|F_{t})$ . The continuously ranked probability score (CRPS) in (10) (Gneiting and Raftery, 2007) is, in contrast, sensitive to distance, and rewards the assignment of high predictive mass near to the realized $y_{t+1}$ , rather than just at that value. The score in (11) is the censored likelihood score (CLS) of Diks et al. (2011), which rewards predictive accuracy over any pre-specified region of interest $A$ ( $A^{c}$ denoting the complement). We use the score for $A$ defining the lower and upper tails of the predictive distribution, as determined in turn by the 10%, 20%, 80% and 90% percentiles of the empirical distribution of $Y_{t}$ , labelling these cases hereafter as CLS $10$ , CLS $20,$ CLS $80$ and CLS $90.$ A high value of this score in any particular instance thus reflects a predictive distribution that accurately predicts extreme values of the financial return. The last score considered is the interval score (IS) in (12), which is designed to measure the accuracy of the $100(1-\alpha)\%$ predictive interval where $\alpha=0.05$ . This score rewards narrow intervals with accurate coverage. All components of (11) have closed-form solutions for the (conditionally) Gaussian predictive model, as does the integral in (10) and the bounds in (12).

In total then, seven distinct scoring rules are used to define the sample criterion function in (4). In what follows we reference these seven criterion functions, and the associated Gibbs posteriors using the notation $S_{n}^{j}\left(\theta\right)=\sum_{t=0}^{n-1}s^{j}\left(P_{\theta}^{(t)},y_{t+1}\right)$ , for $j=\{\text{LS, CRPS, CLS10, CLS20, CLS80, }$ CLS90, IS $\}$ .

4.1 Estimation of the Gibbs Predictives

Given the low dimension of the predictive model it is straightforward to use an MCMC scheme to sample from the exact Gibbs posterior $\pi_{w}^{j}({\theta}|y_{1:n})\propto\exp\{wS_{n}^{j}(\theta)\}\pi\left(\theta\right),$ where $\pi_{w}^{j}(\theta|y_{1:n})$ is the exact Gibbs posterior density in (5) computed under scoring rule $j.$ As noted earlier, in this and all following numerical work we set $w=1$ . For each of the $j$ posteriors, we initialize the chains using a burn-in period of $20000$ periods, and retain the next $M=20000$ draws $\theta^{(m,j)}\sim\pi_{w}^{j}(\theta|y_{1:n}),$ $m=1,\dots,M$ . The posterior draws are then used to estimate the exact Gibbs predictive in (6) via $\hat{P}_{\Pi_{w}}^{(n,j)}=\frac{1}{M}\sum_{m=1}^{M}P_{\theta^{(m,j)}}^{(n)}.$

To perform GVP, we first need to produce the variational approximation of $\pi_{w}^{j}(\theta|y_{1:n}).$ This is achieved in several steps. First, the parameters of the GARCH(1,1) model are transformed to the real line. With some abuse of notation, we re-define here the GARCH(1,1) parameters introduced in the previous section with the superscript $r$ to signal ‘raw’. The parameter vector $\theta$ is then re-defined as $\theta=\left(\theta_{1},\theta_{2},\theta_{3},\theta_{4}\right)^{\prime}=\left(\theta_{1}^{r},\log(\theta_{2}^{r}),\Phi_{1}^{-1}\left(\theta_{3}^{r}\right),\Phi_{1}^{-1}\left(\theta_{4}^{r}\right)\right)^{\prime}$ , where $\Phi_{1}$ denotes the (univariate) normal cumulative distribution function (cdf). The next step involves approximating $\pi_{w}^{j}(\theta|y_{1:n})$ for the re-defined $\theta$ . We adopt the mean-field variational family (see for example Blei et al., 2017), with a product-form Gaussian density $q_{\lambda}\left(\theta\right)=\prod_{i=1}^{4}\phi_{1}\left({\theta}_{i};\mu_{i},d_{i}^{2}\right)$ , where $\lambda=\left(\mu^{\prime},d^{\prime}\right)^{\prime}$ is the vector of variational parameters, comprised of mean and variance vectors $\mu=\left(\mu_{1},\mu_{2},\mu_{3},\mu_{4}\right)^{\prime}$ and $d=\left(d_{1},\dots,d_{4}\right)^{\prime}$ respectively, and $\phi_{1}$ denotes the (univariate) normal probability density function (pdf). Denote as $Q_{\lambda}$ and $\Pi_{w}^{j}$ the distribution functions associated with $q_{\lambda}\left(\theta\right)$ and $\pi_{w}^{j}(\theta|y_{1:n})$ , respectively. The approximation is then calibrated by solving the maximization problem

\tilde{{\lambda}}=\text{ }\underset{\lambda\in\Lambda}{\arg\max}\ \mathcal{L}(\lambda),

(13)

where, $\mathcal{L}(\lambda):=\text{ELBO}\left[Q_{\lambda}||\Pi_{w}^{j}\right]$ . Remembering the use of the notation $S_{n}^{j}\left(\theta\right)$ to denote (4) for scoring rule $j$ , (7) (for case $j$ ) becomes $\mathcal{L}(\lambda)=\mathbb{E}_{q_{\lambda}}\left[wS_{n}^{j}\left({\theta}\right)+\log\pi\left({\theta}\right)-\log q_{\lambda}\left({\theta}\right)\right].$ Optimization is performed via SGA, as described in Section S2 of the supplementary appendix. Once calibration of $\tilde{\lambda}$ is completed, the GVP is estimated as $\hat{P}_{Q}^{(n,j)}=\frac{1}{M}\sum_{m=1}^{M}P_{\theta^{(m,j)}}^{(n)}$ with $\theta^{(m,j)}\overset{i.i.d.}{\sim}q_{\tilde{\lambda}}\left({\theta}\right)$ . To calibrate $\tilde{\lambda}$ $10000$ VB iterations are used, and to estimate the variational predictive we set $M=1000$ .

To produce the numerical results we generate a times series of length $T=6000$ from the true DGP. Then, we perform an expanding window exercise from $n=1000$ to $n=5999$ . For $j\in\{\text{LS, CRPS, CLS10, CLS20, CLS80, CLS90, IS\}}$ we construct the predictive densities $\hat{P}_{\Pi_{w}}^{(n,j)}$ and $\hat{P}_{Q}^{(n,j)}$ as outlined above. Then, for $i\in\{\text{LS, CRPS, CLS10, CLS20, CLS80, CLS90, IS\}}$ we compute the measures of out-of-sample predictive accuracy $S_{\Pi_{w}}^{i,j,n}=s^{i}\left(\hat{P}_{\Pi_{w}}^{(n,j)},y_{n+1}\right)$ and $S_{Q}^{i,j,n}=s^{i}\left(\hat{P}_{Q}^{(n,j)},y_{n+1}\right)$ . Finally, we compute the average out-of-sample scores $S_{\Pi_{w}}^{i,j}=\frac{1}{5000}\sum_{n=1000}^{5999}S_{\Pi_{w}}^{i,j,n}$ and $S_{Q}^{i,j}=\frac{1}{5000}\sum_{n=1000}^{5999}S_{Q}^{i,j,n}$ . The results are tabulated and discussed in the following section.

4.2 Results

The results of the simulation exercise are recorded in Table 1. Panels A to C record the results for Scenarios 1) to 3), with the average out-of-sample scores associated with the exact Gibbs predictive (estimated via MCMC) appearing in the left-hand-side panel and the average scores for GVP appearing in the right-hand-side panel. All values on the diagonal of each sub-panel correspond to the case where $i=j$ . In the misspecified case, numerical validation of the asymptotic result that the GVP concentrates onto the optimal predictive (Theorem 2) occurs if the largest values (bolded) in a column appear on the diagonal (‘strict coherence’ in the language of Martin et al., 2022). In the correctly specified case, in which all proper scoring rules will, for a large enough sample, pick up the one true model (Gneiting and Raftery, 2007), we would expect all values in given column to be very similar to one another, differences reflecting sampling variation only. Finally, validation of the theoretical property of merging between the exact and variational Gibbs predictive (Theorem 1) occurs if the corresponding results in all left- and right-hand-side panels are equivalent.

As is clear, there is almost uniform numerical validation of both theoretical results. Under misspecification Scenario 2), (Panel B) the GVP results are strictly coherent (i.e. all bold values lie on the diagonal), with the exact Gibbs predictive results equivalent to the corresponding GVP results to three or four decimal places. The same broad findings obtain under misspecification Scenario 3), apart from the fact that the IS updates are second best (to the log-score; and then only just) in terms of the out-of-sample IS measure. In Panel A on the other hand, we see the expected (virtual) equivalence of all results in a given column, reflecting the fact that all Gibbs predictives (however estimated) are concentrating onto the true predictive model and, hence, have identical out-of-sample performance. Of course, for a large but still finite number of observations, we would expect the log-score to perform best, due to the efficiency of the implicit maximum likelihood estimator underpinning the results and, to all intents and purposes this is exactly what we observe in Panels A.1 and A.2.

In summary, GVP performs as anticipated, and reaps distinct benefits in terms of predictive accuracy. Any inaccuracy in the measurement of parameter uncertainty also has negligible impact on the finite sample performance of GVP relative to an exact comparator. In Section 5, we extend the investigation into design settings that mimic the high-dimensional problems to which we would apply the variational approach in practice, followed by an empirical application - again using a high-dimensional predictive model - in Section 6.

	Panel A. True DGP: GARCH(1,1)
	A.1: Exact Gibbs predictive							A.2: GVP
	Average out-of-sample score							Average out-of-sample score
	LS	CLS10	CLS20	CLS80	CLS90	CRPS	IS	LS	CLS10	CLS20	CLS80	CLS90	CRPS	IS
U.method
LS	-0.0433	-0.2214	-0.3087	-0.3029	-0.2150	-0.1438	-1.1853	-0.0434	-0.2216	-0.3088	-0.3028	-0.2150	-0.1438	-1.1857
CLS10	-0.0485	-0.2222	-0.3094	-0.3069	-0.2190	-0.1441	-1.2009	-0.0496	-0.2227	-0.3100	-0.3075	-0.2199	-0.1441	-1.2053
CLS20	-0.0482	-0.2221	-0.3094	-0.3066	-0.2185	-0.1440	-1.2014	-0.0493	-0.2225	-0.3097	-0.3072	-0.2193	-0.1440	-1.2052
CLS80	-0.0558	-0.2305	-0.3198	-0.3032	-0.2150	-0.1446	-1.2087	-0.0567	-0.2311	-0.3204	-0.3033	-0.2152	-0.1446	-1.2107
CLS90	-0.0495	-0.2258	-0.3143	-0.3029	-0.2150	-0.1441	-1.1973	-0.0502	-0.2265	-0.3149	-0.3031	-0.2152	-0.1441	-1.1993
CRPS	-0.0462	-0.2239	-0.3116	-0.3027	-0.2147	-0.1438	-1.1874	-0.0474	-0.2239	-0.3112	-0.3041	-0.2162	-0.1439	-1.1918
IS	-0.0438	-0.2216	-0.3089	-0.3031	-0.2153	-0.1438	-1.1877	-0.0440	-0.2218	-0.3091	-0.3031	-0.2153	-0.1438	-1.1882
	Panel B. True DGP: Stochastic volatility with leverage
	B.1: Exact Gibbs predictive							B.2: GVP
	Average out-of-sample score							Average out-of-sample score
	LS	CLS10	CLS20	CLS80	CLS90	CRPS	IS	LS	CLS10	CLS20	CLS80	CLS90	CRPS	IS
U.method
LS	-0.5636	-0.3753	-0.5452	-0.3536	-0.2512	-0.2313	-2.3468	-0.5633	-0.3752	-0.545	-0.3535	-0.2511	-0.2313	-2.3467
CLS10	-1.0193	-0.3336	-0.5021	-0.8379	-0.7339	-0.3679	-3.447	-1.0156	-0.3336	-0.502	-0.834	-0.7302	-0.3659	-3.4207
CLS20	-0.806	-0.3354	-0.4968	-0.6291	-0.5267	-0.2863	-2.9923	-0.8055	-0.3355	-0.4969	-0.6282	-0.5259	-0.2861	-2.9853
CLS80	-0.9203	-0.7372	-0.9311	-0.329	-0.2292	-0.2402	-3.3135	-0.9357	-0.7514	-0.9463	-0.329	-0.2292	-0.2402	-3.3248
CLS90	-0.9575	-0.7615	-0.9649	-0.3294	-0.2292	-0.2425	-3.4213	-0.9959	-0.7969	-1.0033	-0.3293	-0.2291	-0.2426	-3.4476
CRPS	-0.5692	-0.4029	-0.5671	-0.3431	-0.2419	-0.23	-2.4312	-0.5649	-0.3985	-0.5626	-0.3432	-0.2419	-0.2301	-2.4338
IS	-0.6552	-0.3986	-0.6111	-0.3713	-0.248	-0.2604	-2.203	-0.655	-0.3985	-0.6109	-0.3712	-0.2479	-0.2603	-2.2033
	Panel C. True DGP: Stochastic volatility with smooth transition
	C.1: Exact Gibbs predictive							C.2: GVP
	Average out-of-sample score							Average out-of-sample score
	LS	CLS10	CLS20	CLS80	CLS90	CRPS	IS	LS	CLS10	CLS20	CLS80	CLS90	CRPS	IS
U.method
LS	-1.6858	-0.4239	-0.672	-0.6751	-0.4369	-0.7196	-7.0726	-1.686	-0.424	-0.6721	-0.6752	-0.4371	-0.7196	-7.0742
CLS10	-1.8173	-0.4172	-0.6707	-0.8016	-0.5468	-0.821	-7.8217	-1.8145	-0.4172	-0.6706	-0.7987	-0.5437	-0.8183	-7.7717
CLS20	-1.7173	-0.419	-0.6674	-0.7067	-0.4616	-0.7425	-7.2516	-1.7171	-0.4192	-0.6676	-0.7063	-0.4611	-0.742	-7.2481
CLS80	-1.733	-0.465	-0.7199	-0.6681	-0.4304	-0.7511	-7.3087	-1.7339	-0.4654	-0.7204	-0.6684	-0.4307	-0.7513	-7.3159
CLS90	-1.8777	-0.5927	-0.8579	-0.6731	-0.4288	-0.8707	-8.1299	-1.8819	-0.5952	-0.8608	-0.674	-0.4295	-0.8734	-8.1211
CRPS	-1.6938	-0.4283	-0.6762	-0.6812	-0.4435	-0.7184	-7.2059	-1.6938	-0.4286	-0.6765	-0.6809	-0.4432	-0.7185	-7.2121
IS	-1.6879	-0.4244	-0.6726	-0.6754	-0.4374	-0.7208	-7.0915	-1.6881	-0.4245	-0.6728	-0.6754	-0.4374	-0.7208	-7.093

Table 1: Predictive accuracy of GVP using a Gaussian GARCH(1,1) predictive model for a financial return. Panel A corresponds to the correctly specified case, and Panels B and C to the two different misspecified settings as described in the text. The rows in each panel refer to the update method (U.method) used. The columns refer to the out-of-sample measure used to compute the average scores. The figures in bold are the largest average scores according to a given out-of-sample measure.

5 Complex Time Series Examples

In this section we demonstrate the application of GVP in two realistic simulated examples. In both cases the assumed predictive model is high-dimensional and the exact Gibbs posterior, even if accessible in principle via MCMC, is challenging from a computational point of view. The mean-field class is adopted in both cases to produce variational approximations to the Gibbs posterior. The simulation design for each example (including the choice of $w=1$ ) mimics that for the toy example, apart from the obvious changes made to the true DGP and the assumed predictive model, some changes in the size of the estimation and evaluation periods, plus the absence of comparative exact results. For reasons of computational burden we remove the CRPS update from consideration in the first example.

5.1 Example 1: Autoregressive Mixture Predictive Model

In this example we adopt a true DGP in which $Y_{t}$ evolves according to the logistic smooth transition autoregressive (LSTAR) process proposed in Teräsvirta (1994):

Y_{t}=\rho_{1}Y_{t-1}+\rho_{2}\left\{\frac{1}{1+\exp\left[-\gamma\left(Y_{t-1}-c\right)\right]}\right\}y_{t-1}+\sigma_{\varepsilon}\varepsilon_{t},

(14)

where $\varepsilon_{t}\overset{i.i.d.}{\sim}t_{\nu}$ , and $t_{\nu}$ denotes the standardised Student-t distribution with $\nu$ degrees of freedom. This model has the ability to produce data that exhibits a range of complex features. For example, it not only allows for skewness in the marginal density of $Y_{t}$ , but can also produce temporal dependence structures that are asymmetric. We thus use this model as an illustration of a complex DGP whose characteristics are hard to replicate with simple parsimoneous models. Hence the need to adopt a highly parameterized predictive model; plus the need to acknowledge that even that representation will be misspecified.

The assumed predictive model is based on the flexible Bayesian non-parametric structure proposed in Antoniano-Villalobos and Walker (2016). The predictive distribution for $Y_{t}$ , conditional on the observed $y_{t-1}$ , is constructed from a mixture of $K=20$ Gaussian autoregressive (AR) models of order one as follows:

P_{\theta}^{(t-1)}=\sum_{k=1}^{K}\tau_{k,t}\Phi_{1}\left[Y_{t}-\mu;\beta_{k,0}+\beta_{k,1}(y_{t-1}-\mu),\sigma_{k}^{2}\right],

(15)

with time-varying mixture weights

\tau_{k,t}=\frac{\tau_{k}\phi_{1}\left(y_{t-1}-\mu;\mu_{k},s_{k}^{2}\right)}{\sum_{j=1}^{K}\tau_{j}\phi_{1}\left(y_{t-1}-\mu;\mu_{j},s_{j}^{2}\right)},

where $\mu_{k}=\frac{\beta_{k,0}}{1-\beta_{k,1}}$ and $s_{k}^{2}=\frac{\sigma_{k}^{2}}{1-\beta_{k,1}^{2}}$ . Denoting $\tau=\left(\tau_{1},\dots,\tau_{K}\right)^{\prime}$ , $\beta_{0}=\left(\beta_{1,0},\dots,\beta_{K,0}\right)^{\prime}$ , $\beta_{1}=\left(\beta_{1,1},\dots,\beta_{K,1}\right)^{\prime}$ and $\sigma=\left(\sigma_{1},\dots,\sigma_{K}\right)^{\prime}$ , then the full vector of unknown parameters is $\theta=\left(\mu,\tau^{\prime},\beta_{0}^{{}^{\prime}},\beta_{1}^{\prime},\right.$ $\left.\sigma^{\prime}\right)^{\prime}$ , which comprises $1+\left(4\times 20\right)=81$ elements. GVP is a natural and convenient alternative to exact Gibbs prediction in this case.

5.2 Example 2: Bayesian Neural Network Predictive Model

In the second example we consider a true DGP in which the dependent variable $Y_{t}$ has a complex non-linear relationship with a set of covariates. Specifically, the time series process $\{Y_{t}\}_{t=1}^{T}$ , is determined by a three-dimensional stochastic process $\{X_{t}\}_{t=1}^{T}$ , with $X_{t}=\left(X_{1,t},X_{2,t},X_{3,t}\right)^{\prime}$ . The first two covariates are jointly distributed as $(X_{1,t},X_{2,t})^{\prime}\overset{i.i.d.}{\sim}N\left(0,\Sigma\right)$ . The third covariate $X_{3,t}$ , independent of the former two, is distributed according to an AR(4) process so that $X_{3,t}=\sum_{i=1}^{4}\alpha_{i}X_{3,t-i}+\sigma\varepsilon_{t}$ , with $\varepsilon_{t}\overset{i.i.d.}{\sim}N\left(0,1\right)$ . The variable $Y_{t}$ is then given by $Y_{t}=X_{t}^{\prime}\beta_{t}$ , where $\beta_{t}=\left(\beta_{1,t},\beta_{2,t},\beta_{3,t}\right)^{\prime}$ is a three dimensional vector of time-varying coefficients, with $\beta_{i,t}=b_{i}+a_{i}F\left(X_{3,t}\right)$ , and $F$ denotes the marginal distribution function of $X_{3,t}$ induced by the AR(4) model.

This particular choice of DGP has two advantages. First, given the complex dependence structure of the DGP (i.e. non-linear cross-sectional dependence as well as temporal dependence), it would be difficult, once again, to find a simple parsimoneous predictive model that would adequately capture this structure; hence motivating the need to adopt a high-dimensional model and to resort to GVP. Second, because it includes several covariates, we can assess how GVP performs for varying informations sets. For example, we can establish if the overall performance of GVP improves with an expansion of the conditioning information set, as we would anticipate; and if the inclusion of a more complete conditioning set affects the occurrence of strict coherence, or not.

A flexible model that has the potential to at least partially capture several features of the true DGP is a Bayesian feed-forward neural network that takes $Y_{t}$ as the dependent variable and some of its lags, along with observed values of $X_{t}$ , $x_{t}=\left(x_{1,t},x_{2,t},x_{3,t}\right)^{\prime},$ as the vector of independent inputs, $z_{t}$ (see for instance Hernández-Lobato and Adams, 2015). The predictive distribution is defined as $P_{\theta}^{(t-1)}=\Phi_{1}\left(Y_{t};g\left(z_{t};\omega\right),\sigma_{y}^{2}\right).$ The mean function $g\left(z_{t};\omega\right)$ denotes a feed-forward neural network with $q=2$ layers, $r=3$ nodes in each layer, a $p$ -dimensional input vector $z_{t}$ , and a $d$ -dimensional parameter vector $\omega$ with $d=r^{2}(q-1)+r(p+q+1)+1$ . It allows for a flexible non-linear relationship between $Y_{t}$ and the vector of observed covariates $z_{t}$ . Defining $c=\log(\sigma_{y})$ , the full parameter vector of this predictive class, $\theta=\left(\omega^{\prime},c\right)^{\prime}$ , is of dimension $d+1$ .

5.3 Simulation Results

5.3.1 Example 1

A time series of length $T=2500$ for $Y_{t}$ is generated from the LSTAR model in (14), with the true parameters set as: $\rho_{1}=0$ , $\rho_{2}=0.9$ , $\gamma=5$ , $c=0$ , $\sigma_{\varepsilon}=1$ and $\nu=3$ . With exception of the CRPS, which cannot be evaluated in closed-form for the mixture predictive class and is not included in the exercise as a consequence, the same scores in the toy example are considered. With reference to the simulation steps given earlier, the initial estimation window is $500$ observations; hence the predictive accuracy results are based on an evaluation sample of size $2000.$

	Average out-of-sample score
	LS	CLS10	CLS20	CLS80	CLS90	IS
U.method
LS	-1.253	-0.345	-0.497	-0.452	-0.263	-5.589
CLS10	-1.445	-0.346	-0.512	-0.555	-0.321	-7.279
CLS20	-1.445	-0.344	-0.496	-0.589	-0.349	-6.674
CLS80	-1.333	-0.414	-0.571	-0.450	-0.260	-5.831
CLS90	-1.330	-0.407	-0.564	-0.451	-0.259	-5.730
IS	-1.410	-0.401	-0.558	-0.474	-0.282	-5.550

Table 2: Predictive accuracy of GVP using a autoregressive mixture model for data generated from an LSTAR model. The rows in each panel refer to the update method (U.method) used. The columns refer to the out-of-sample measure used to compute the average scores. The figures in bold are the largest average scores according to a given out-of-sample measure.

The results in Table 2 are clear-cut. With one exception (that of CLS10), the best out-of-sample performance, according to a given measure, is produced by the version of GVP based on that same scoring rule. That is, the GVP results are almost uniformly strictly coherent: a matching of the update rule with the out-of-sample measure produces the best predictive accuracy in that measure, almost always.

5.3.2 Example 2

In this case, we generate a time series of length $T=4000$ from the model discussed in Section 5.2, with specifications: $\Sigma_{11}=1$ , $\Sigma_{22}=1.25$ , $\Sigma_{12}=\Sigma_{21}=0.5$ , $\sigma^{2}=0.2$ , $\alpha_{1}=0.5$ , $\alpha_{2}=0.2$ , $\alpha_{3}=0.15$ , $\alpha_{4}=0.1$ , $a_{1}=1.3$ , $b_{1}=0$ , $a_{2}=-2.6$ , $b_{2}=1.3$ , $a_{3}=-1.5$ and $b_{3}=1.5$ . These settings generate data with a negatively skewed empirical distribution, a non-linear relationship between the observations on $Y_{t}$ and $X_{3,t}$ , and autoregressive behavior in $Y_{t}$ .

To assess if the performance of GVP is affected by varying information sets, we consider four alternative specifications for the input vector $z_{t}$ in the assumed predictive model. These four specifications (labelled as Model 1, Model 2, Model 3 and Model 4, respectively) are: $z_{t}=y_{t-1}$ , $z_{t}=\left(y_{t-1},x_{1,t}\right)^{\prime}$ , $z_{t}=\left(y_{t-1},x_{2,t}\right)^{\prime}$ and $z_{t}=\left(y_{t-1},x_{1,t},x_{2,t}\right)^{\prime}$ . The dimension of the parameter vector for each of these model specifications is $d+1=23$ , $d+1=26$ , $d+1=26$ and $d+1=29$ , respectively. Given that the assumed predictive class is Gaussian, all scoring rules used in the seven updates, and for all four model specifications, can be evaluated in closed form. Referencing the simulation steps given earlier, the initial estimation window is $2000$ observations; hence the predictive accuracy results are based on an evaluation sample of size $2000$ as in the previous example.

	Panel A: Model 1 ( ${z}_{t}=y_{t-1}$ )								Panel B: Model 2 ( ${z}_{t}=\left(y_{t-1},x_{1,t}\right)^{\prime}$ )
	Average out-of-sample score								Average out-of-sample score
	LS	CLS10	CLS20	CLS80	CLS90	CRPS	IS		LS	CLS10	CLS20	CLS80	CLS90	CRPS	IS
U.method								U.method
LS	-1.607	-0.489	-0.761	-0.520	-0.310	-0.659	-6.579	LS	-1.451	-0.494	-0.733	-0.424	-0.254	-0.550	-6.059
CLS10	-1.914	-0.460	-0.738	-0.852	-0.617	-0.914	-8.174	CLS10	-1.858	-0.435	-0.686	-0.849	-0.626	-0.866	-7.707
CLS20	-1.766	-0.460	-0.732	-0.704	-0.478	-0.772	-7.448	CLS20	-1.683	-0.438	-0.679	-0.686	-0.476	-0.708	-6.963
CLS80	-1.637	-0.514	-0.795	-0.514	-0.303	-0.678	-6.822	CLS80	-1.551	-0.595	-0.850	-0.403	-0.241	-0.587	-6.878
CLS90	-1.686	-0.534	-0.831	-0.521	-0.307	-0.718	-6.912	CLS90	-1.539	-0.540	-0.811	-0.406	-0.241	-0.605	-6.369
CRPS	-1.620	-0.513	-0.783	-0.514	-0.304	-0.660	-6.752	CRPS	-1.510	-0.585	-0.822	-0.411	-0.244	-0.553	-6.649
IS	-1.601	-0.476	-0.753	-0.516	-0.307	-0.660	-6.338	IS	-1.452	-0.473	-0.728	-0.414	-0.248	0.562	-5.671
	Panel C: Model 3 ( ${z}_{t}=\left(y_{t-1},x_{2,t}\right)^{\prime}$ )								Panel D: Model 4 ( ${z}_{t}=\left(y_{t-1},x_{1,t},x_{2,t}\right)^{\prime}$ )
	Average out-of-sample score								Average out-of-sample score
	LS	CLS10	CLS20	CLS80	CLS90	CRPS	IS		LS	CLS10	CLS20	CLS80	CLS90	CRPS	IS
U.method								U.method
LS	-1.536	-0.428	-0.688	-0.521	-0.314	-0.617	-5.975	LS	-1.390	-0.483	-0.716	-0.389	-0.229	-0.511	-5.890
CLS10	-1.775	-0.396	-0.658	-0.793	-0.555	-0.805	-6.704	CLS10	-1.758	-0.394	-0.650	-0.784	-0.548	-0.790	-6.659
CLS20	-1.646	-0.397	-0.654	-0.664	-0.439	-0.694	-6.277	CLS20	-1.656	-0.394	-0.621	-0.732	-0.515	-0.712	-6.236
CLS80	-1.698	-0.566	-0.858	-0.520	-0.301	-0.714	-7.282	CLS80	-2.127	-1.049	-1.438	-0.356	-0.196	-0.736	-9.798
CLS90	-1.744	-0.581	-0.889	-0.530	-0.304	-0.758	-7.361	CLS90	-2.157	-0.989	-1.401	-0.379	-0.192	-0.782	-9.545
CRPS	-1.535	-0.438	-0.695	-0.517	-0.310	-0.615	-6.044	CRPS	-1.519	-0.626	-0.872	-0.380	-0.218	-0.537	-6.922
IS	-1.549	-0.438	-0.7030	-0.517	-0.310	-0.624	-6.054	IS	-1.673	-0.592	-0.934	-0.381	-0.203	-0.710	-5.493

Table 3: Predictive accuracy of GVP using a Bayesian neural network for data generated from the dynamic non-linear regression model discussed in Section 5.2. Panels A to D document results for the four different versions of

z_{t}

, in which varying numbers of input variables are included in the predictive model. The rows in each panel refer to the update method (U.method) used. The columns refer to the out-of-sample measure used to compute the average scores. The figures in bold are the largest average scores according to a given out-of-sample measure.

With reference to the results in Table 3 we can make two observations. First, as anticipated, an increase in the information set produces higher average scores out-of-sample, for all updates; with the corresponding values increasing steadily and uniformly as one moves from the results in Panel A (based on $z_{t}=y_{t-1}$ ) to those in Panel D (based on the largest information, $z_{t}=\left(y_{t-1},x_{1,t},x_{2,t}\right)^{\prime}$ ) set. Secondly however, despite the improved performance of all versions of GVP as the information set is increased to better match that used in the true DGP, strict coherence still prevails. That is, ‘focusing’ on the measure that matters in the construction of the GVP update, still reaps benefits, despite the reduction in misspecification.

6 Empirical Application: M4 Competition

The Makridakis 4 (M4) forecasting competition was a forecasting event first organised by the University of Nicosia and the New York University Tandon School of Engineering in 2018. The competition sought submissions of point and interval predictions at different time horizons, for a total of 100,000 times series of varying frequencies. The winner of the competition in a particular category (i.e. point prediction or interval prediction) was the submission that achieved the best average out-of-sample predictive accuracy according to the measure of accuracy that defined that category, over all horizons and all series.⁷⁷7Details of all aspects of the competition can be found via the following link:
https://www.m4.unic.ac.cy/wp-content/uploads/2018/03/M4-Competitors-Guide.pdf

We gauge the success of our method of distributional prediction in terms of the measure used to rank the interval forecasts in the competition, namely the IS. We focus on one-step-ahead prediction of each of the 4227 time series of daily frequency. Each of these series is denoted by $\left\{Y_{i,t}\right\}$ where $i=1,\dots,4227$ and $t=1,\dots,n_{i}$ , and the task is to construct a predictive interval for $Y_{i,n_{i}+1}$ based on GVP. We adopt the mixture distribution in (15) as the predictive model, being suitable as it is to capture the stylized features of high frequency data. However, the model is only appropriate for stationary time series and most of the daily time series exhibit non-stationary patterns. To account for this, we model the differenced series $\left\{Z_{i,t}=\Delta^{d_{i}}Y_{i,t}\right\}$ , where $d_{i}$ indicates the differencing order. The integer $d_{i}$ is selected by sequentially applying the KPSS test (Kwiatkowski et al., 1992) to $\Delta^{j}y_{i,t}$ for $j=0,\dots,d_{i}$ , with $d_{i}$ being the first difference at which the null hypothesis of no unit root is not rejected.⁸⁸8To do this we employ the autoarima function in the forecast R package. To construct the predictive distribution of $Z_{i,n_{i}+1}$ we first obtain $M=5000$ draws $\{\theta_{i}^{(m)}\}_{m=1}^{M}$ from the Gibbs variational posterior $q_{\tilde{\lambda}}\left(\theta_{i}\right)$ that is based on the IS.⁹⁹9Once again, a value of $w=1$ is used in defining the Gibbs posterior and, hence, in producing the posterior approximation. Draws $\{z_{i,n_{i}+1}^{(m)}\}_{m=1}^{M}$ are then obtained from the corresponding predictive distributions $\{P_{\theta_{i}^{(m)}}^{(n_{i}+1)}\}_{m=1}^{M}$ . The draws of $Z_{i,n_{i}+1}$ are then transformed into draws of $Y_{i,n_{i}+1}$ and a predictive distribution for $Y_{i,n_{i}+1}$ produced using kernel density estimation; assessment is performed in terms of the accuracy of the prediction interval for $Y_{i,n_{i}+1}.$ To account for different units of measurement, the IS of each series is then divided by the constant $\frac{1}{(n_{i}-1)}\sum_{t=2}^{n_{i}}|Y_{i,t}-Y_{i,t-1}|$ .

Table 4 documents the predictive performance of our approach relative to the competing methods (Makridakis et al., 2020, see) for details on these methods). The first column corresponds to the average IS. In terms of this measure the winner is the method proposed by Smyl (2020), while our method is ranked 11 out of 12. The ranking is six out of 12 in terms of the median IS, recorded in column two. However, it is worth remembering that our method aims to achieve high individual, and not aggregate (or average) predictive interval accuracy across series. A more appropriate way of judging the effectiveness of our approach is thus to count the number of series for which each method performs best. This number is reported in column three of the table. In terms of this measure, with a total number of 858 series, our approach is only outperformed by the method proposed in Doornik et al. (2020). In contrast, although Smyl is the best method in terms of mean IS, it is only best at predicting 130 of the series in the dataset. It is also important to mention that most of the competing approaches have more complex assumed predictive classes than our mixture model. Even the simpler predictive classes like the ARIMA and ETS models have model selection steps that allow them to capture richer Markovian processes than the mixture model which, by construction, is based on an autoregressive process of order one.

In summary, despite a single specification being defined for the predictive class for all 4227 daily series, the process of driving the Bayesian update via the IS rule has still yielded the best predictive results in a very high number of cases, with GVP beaten in this sense by only one other competitor. Whilst the predictive model clearly matters, designing a bespoke updating mechanism to produce the predictive distribution is shown to still reap substantial benefits.

	Out-of-sample results
	Mean IS	Median IS	Series
Method
Doornik et al.	-9.36	-4.58	2060
Mixture	-16.66	-5.85	858
Trotta	-11.41	-7.19	376
ARIMA	-10.07	-5.67	278
Fiorucci and Louzada	-9.20	-5.73	163
Smyl	-8.73	-6.65	139
Roubinchtein	-13.08	-7.35	97
Ibrahim	-38.81	-6.33	88
ETS	-9.13	-5.71	67
Petropoulos and Svetunkov	-9.77	-5.68	59
Montero-Manso, et al.	-10.24	-6.90	23
Segura-Heras, et al.	-15.29	-8.48	19

Table 4: One-step-ahead predictive performance for the 4,227 daily series from the M4 competition. The columns ‘Mean IS’ and Median IS’ respectively record the mean and median values of average out-of-sample MSMS, across all 4,227 series, for each competiting method/team. The column ‘Series’ reports the number of series for which the method/team produces the largest IS value. Some of the competing methods are individually described in Doornik et al. (2020), Fiorucci and Louzada (2020), Smyl (2020), Petropoulos and Svetunkov (2020), and Montero-Manso et al. (2020). For the remaining methods see Makridakis et al. (2020). Our GVP method based on the mixture model is referred to as ‘Mixture’.

7 Discussion

We have developed a new approach for conducting loss-based Bayesian prediction in high-dimensional models. Based on a variational approximation to the Gibbs posterior defined, in turn, by the predictive loss that is germane to the problem at hand, the method is shown to produce predictions that minimize that loss out-of-sample. ‘Loss’ is characterized in this paper by positively-oriented proper scoring rules designed to reward the accuracy of predictive probability density functions for a continuous random variable. Hence, loss minimization translates to maximization of an expected score. However, in principle, any loss function in which predictive accuracy plays a role could be used to define the Gibbs posterior. Critically, we have proven theoretically, and illustrated numerically, that for a large enough sample there is no loss incurred in predictive accuracy as a result of approximating the Gibbs posterior.

In comparison with the standard approach based on a likelihood-based Bayesian posterior, our Gibbs variational predictive approach is ultimately aimed at generating accurate predictions in the realistic empirical setting where the predictive model and, hence, the likelihood function is misspecified. Gibbs variational prediction enables the investigator to break free from the shackles of likelihood-based prediction, and to drive predictive outcomes according to the form of predictive accuracy that matters for the problem at hand; and all with theoretical validity guaranteed.

We have focused in the paper on particular examples where the model used to construct the Gibbs variational predictive is observation-driven. Extensions to parameter-driven models (i.e., hidden Markov models, or state space models) require different approaches with regard to both the implementation of variational Bayes, and in establishing the asymptotic properties of the resulting posteriors and predictives. This is currently the subject of on-going work by the authors, Frazier et al. (2021), and we reference Tran et al. (2017) and Quiroz et al. (2018) for additional discussion of this problem.

This paper develops the theory of Gibbs variational prediction for classes of models that are general and flexible enough to accommodate a wide range of data generating processes, and which, under the chosen loss function, are smooth enough to permit a quadratic expansion. This latter condition restricts the classes of models, and loss functions, under which our results are applicable. For example, the discrete class of models studied in Douc et al. (2013) may not be smooth enough to deduce the validity of such an expansion.

While our approach to prediction focuses on accuracy in a given forecasting rule, which is related to the notion of sharpness in the probabilistic forecasting literature (see, Gneiting et al., 2007), it does not pay attention to the calibration of such predictive densities. That is, there is no guarantee that the resulting predictive densities have valid frequentist coverage properties. If calibration is a desired property, it is possible to augment our approach to prediction with the recently proposed approach by Wu and Martin (2021), which calibrates generalized predictives. The amalgamation of these two procedures should produce accurate predictions that are well-calibrated in large samples. We leave an exploration of this possibility to future research.

Acknowledgments

We would like to thank various participants at: the ‘ABC in Svalbard’ Workshop, April 2021, the International Symposium of Forecasting, the Alan Turing Institute, and RIKEN reading group, June 2021, the European Symposium of Bayesian Econometrics, September 2021, and the Vienna University of Business, December 2021, for helpful comments on earlier drafts of the paper. This research has been supported by Australian Research Council (ARC) Discovery Grants DP170100729 and DP200101414. Frazier was also supported by ARC Early Career Researcher Award DE200101070.

A Theoretical Results

In this section, we state several intermediate results and prove Theorems 1 and 2 stated in Section 3.2 of the main paper.

We maintain the following notation throughout the remainder of the paper. We let $d:\Theta\times\Theta\rightarrow\mathbb{R}_{x\geq 0}$ denote a generic metric on $\Theta\subseteq\mathbb{R}^{d_{n}}$ , and we take $\|\cdot\|$ and $\langle\cdot,\cdot\rangle$ to be the Euclidean norm and inner product for vectors. In the following section, we let $C$ denote a generic constant independent of $n$ that can vary from line to line. For sequences $x_{n},y_{n}$ and some $C$ we write $x_{n}\lesssim y_{n}$ when $x_{n}\leq Cy_{n}$ . If $x_{n}\lesssim y_{n}$ and $y_{n}\lesssim x_{n}$ we write $x_{n}\asymp y_{n}$ . For some positive sequence $\delta_{n}$ , the notation $o_{p}(\delta_{n})$ and $O_{p}(\delta_{n})$ have their usual connotation. Unless otherwise stated, all limits are taken as $n\rightarrow\infty$ , so that when no confusion will result we use $\lim_{n}$ to denote $\lim_{n\rightarrow\infty}$ . The random variable $Y_{n}$ takes values in $\mathcal{Y}$ for all $n\geq 1$ , and for a given $n$ , the vector of observations $y_{1:n}$ are generated from the true probability measure $P_{0}$ , not necessarily in $\mathcal{P}^{(n)}$ , and whose dependence on $n$ we suppress to avoid confusion with the predictive model $P_{\theta}^{(n)}$ .

A.1 Assumptions and Discussion

For some positive sequence $\epsilon_{n}\rightarrow 0$ , define $\Theta(\epsilon_{n}):=\{\theta\in\Theta:d(\theta,\theta_{\star})\leq\epsilon_{n}\}$ as an $n$ -dependent neighbourhood of $\theta_{\star}$ and let $\Theta^{c}(\epsilon_{n})$ denote its complement.

Assumption 1

(i) There exists a non-random function $\mathcal{S}(\theta)$ such that $\sup_{\theta\in\Theta}|S_{n}(\theta)/n-\mathcal{S}(\theta)|\rightarrow 0$ in probability. (ii) For any $\epsilon\geq\epsilon_{n}$ , there exists some $C>0$ such that $\sup_{\Theta^{c}(\epsilon)}\left\{\mathcal{S}(\theta)-\mathcal{S}(\theta_{\star})\right\}\leq-C\epsilon^{2}$ .

Remark 3

Assumption 1 places regularity conditions on the sample and limit counterpart of the expected scoring rules, $\mathcal{S}(\theta)$ , and is used to deduce the asymptotic behavior of the Gibbs posterior. The first part of the assumption is a standard uniform law of large numbers, and the second part amounts to an identification condition for $\theta_{\star}$ .

We note here that the temporal dependence of the observations, and the use of potentially heavy-tailed data means that existing results on the concentration of Gibbs posteriors, and/or their variational approximation, of which we are aware may not be applicable in our setting; see e.g., Zhang and Gao (2020), Alquier and Ridgway (2020) and Yang et al. (2020).

Instead, we use smoothness of $S_{n}(\theta)$ (and the underlying model $P_{\theta}^{(n)}$ ) to deduce a posterior concentration result.

Assumption 2

There exists a sequence of $d_{n}\times d_{n}$ -dimensional matrices $\Sigma_{n}$ , and a $d_{n}\times 1$ -random vector $\Delta_{n}$ such:

(i)

$S_{n}({\theta})-S_{n}({\theta}_{\star})=\langle\sqrt{n}\left({\theta}-{\ \theta_{\star}}\right),{\Delta}_{n}/\sqrt{n}\rangle-\frac{1}{2}\|\Sigma_{n}^{1/2}\sqrt{n}\left({\theta}-{\theta}_{\star}\right)\|^{2}+R_{n}({\theta}).$
(ii)

For any $\epsilon>\epsilon_{n}$ , any $M>0$ , with $M/\sqrt{n}\rightarrow 0$ , $\lim\sup P_{0}\left[\sup_{d(\theta,\theta_{\star})\leq M/\sqrt{n}}\frac{\left|R_{n}(\theta)\right|}{1+n\|\theta-\theta_{\star}\|^{2}}>\epsilon\right]=0.$
(iii)

$\left\|\Sigma_{n}^{-1/2}\Delta_{n}\right\|=O_{p}(1)$ .

We require the following a tail control condition on the prior.

Assumption 3

For any $\epsilon>\epsilon_{n}$ , $\log\left\{\Pi\left[\Theta(\epsilon)\right]\right\}\gtrsim-n\epsilon^{2}$ .

Remark 4

The above assumptions ultimately preclude cases where $\Theta$ can be partitioned into global parameters that drive the model, which are fixed for all $n$ , and local parameters that represent time-dependent latent variables in the model and grow in a one-for-one manner with the sample size. As such, these assumptions are not designed to be applicable to variational inference in classes such as hidden Markov models.

To validate the usefulness and broad applicability of Assumptions 1-2, we prove that these assumptions are satisfied for the GARCH(1,1) model used for the illustrative exercise in Section 4 in the case of the logarithmic scoring rule, $S(y,f)=\log f(y)$ , under weak moment assumptions on the observed data.

Lemma 1

Recall the GARCH(1,1) model presented in Section 4, and assume, WLOG, that $\mathbb{E}[Y_{t}]=0$ . Define the rescaled variable $Z_{t}=Y_{t}/\sigma_{t\star}$ , where $\sigma_{t\star}$ denotes the GARCH process evaluated at $\theta=\theta_{\star}$ . Assume that $Z_{t}$ satisfies the following conditions:

1.

$Z_{t}$ is strictly stationary and ergodic.
2.

$Z_{t}^{2}$ is a random variable.
3.

For some $\delta>0$ , there exists an $C<\infty$ such that $\mathbb{E}[Z_{t}^{4}|\mathcal{F}_{t-1}]\leq C<\infty$ .
4.

The parameter space $\Theta$ is compact, with the elements $\theta_{2},\theta_{3},\theta_{4}$ elements bounded away from zero and the elements $\theta_{3},\theta_{4}$ bounded above by one, and where $\theta_{3}+\theta_{4}<1$ . The pseudo-true value $\theta^{\star}$ is in the interior of $\Theta$ .

If the above are satisfied, then Assumptions (1) and (2) are satisfied for $S_{n}(\theta)=\sum_{t=2}^{n}\log p_{\theta}(y_{t}|\mathcal{F}_{t-1})$ .

A.2 Preliminary Result: Posterior Concentration

Before proving the results stated in Section 3.2, we first give two results regarding the posterior concentration of the exact Gibbs posterior and its variational approximation.

Lemma 2

Under Assumptions 1-3, for $\epsilon_{n}\rightarrow 0$ , $n\epsilon_{n}^{2}\rightarrow\infty$ , and any $M_{n}\rightarrow\infty$ , for $w_{n}\in(\underline{w},\overline{w})$ with probability one, where $0<\underline{w}<\overline{w}<\infty$ ,

\Pi_{w_{n}}\left(d(\theta,\theta_{\star})>M_{n}\epsilon_{n}|y_{1:n}\right)\lesssim\exp\left(-CM_{n}^{2}\epsilon_{n}^{2}n\right),

with probability converging to one.

Remark 5

The proof of Lemma 2, requires controlling the behavior of $\sup_{d(\theta,\theta_{\star})\geq\epsilon}n^{-1}\left\{S_{n}(\theta)-S_{n}(\theta_{\star})\right\}$ for any $\epsilon>0$ . When the parameter space is compact, this control can be guaranteed by controlling the bracketing entropy (see, e.g., van der Vaart and Wellner, 1996) of $\Theta$ . If the parameter space is not compact, this control can be achieved by considering more stringent conditions on the prior than Assumption 3. In particular, results where the parameter space is not compact can be obtained by modifying Theorem 4 in Shen and Wasserman (2001) for our setting. That being said, we note that in the case of the toy GARCH(1,1) model in Section 4, the parameter space is compact and this condition is satisfied under the conditions given in Lemma 1.

To transfer the posterior concentration of the Gibbs posterior to its variational approximation, we restrict the class $\mathcal{Q}$ used to produce the variational approximation. Following Zhang and Gao (2020), we define $\kappa_{n}$ to be the approximation error rate for the variational family:

\kappa_{n}^{2}=\frac{1}{n}\mathbb{E}_{P_{0}}\inf_{Q\in\mathcal{Q}}\text{D}\left\{Q\|\Pi_{w}(\cdot|y_{1:n})\right\}.

Lemma 3

Under the Assumptions of Lemma 2, for any sequence $M_{n}\rightarrow\infty$ ,

\widehat{Q}\left\{d(\theta,\theta_{\star})\geq M_{n}n\left(\epsilon^{2}_{n}+\kappa^{2}_{n}\right)|y_{1:n}\right\}\rightarrow 0

in probability.

Remark 6

While several authors, such as Alquier et al. (2016), Alquier and Ridgway (2020), and Yang et al. (2020), have demonstrated a result similar to Lemma 3, existing results generally hinge on the satisfaction of a sub-Gaussian concentration inequality for the resulting risk function, i.e., $S_{n}(\theta)$ , used to produce the Gibbs posterior, such as a Hoeffding or Bernstein-type inequality. Given that our goal is prediction in possibly non-Markovian, non-stationary, and/or complex models, and with a loss function based specifically on a general scoring rule, it is not clear that such results remain applicable. Therefore, we deviate from the earlier referenced approaches and instead draw inspiration from Miller (2021), and use an approach based on a quadratic expansion of the loss function. The latter conditions can be shown to be valid in a wide variety of models used to produce predictions in commonly encountered time series settings, including models with non-Markovian features. Indeed, as Lemma 1 demonstrates, Assumptions 1 and 2 are satisfied in cases where the scoring rule only exhibits a polynomial tail, e.g., in the case of the GARCH(1,1) model, and with data that is non-markovian.

Remark 7

Lemma 3 gives a similar result to Theorem 2.1 in Zhang and Gao (2020) but for the Gibbs variational posterior and demonstrates that the latter concentrates onto $\theta_{\star}$ , with the rate of concentration given by the slower of $\epsilon_{n}$ and $\kappa_{n}$ . We note that, in cases where the dimension of $\Theta$ grows with the sample size, the rate $\epsilon_{n}$ is ultimately affected by the rate of growth of $\Theta$ and care must be taken to account for this dependence.

Remark 8

Verification of Lemma 3 in any practical example requires choosing the variational family $\mathcal{Q}$ . In the case of observation-driven time series models, where the predictive model $P_{\theta}^{(n)}$ can be constructed analytically, and no explicit treatment of ‘local’ parameters is needed, the variational family can be taken to be a sufficiently rich parametric class. For example, consider the case where the dimension of $\Theta$ is fixed and compact, and assume that $P_{\theta}^{(n)}$ and $S_{n}(\theta)$ satisfy the Assumptions of Lemma 2. Then, we can consider as our variational family the mean-field class:

\mathcal{Q}_{MF}:=\left\{Q:q(\theta)=\prod_{i=1}^{d_{\theta}}q_{i}(\theta_{i})\right\},

or the Gaussian family

\mathcal{Q}_{G}:=\{\lambda=(\mu^{\prime},\text{vech}({\Sigma})^{\prime})^{\prime}:q(\theta)\propto\mathcal{N}(\theta;\mu,\Sigma),\},

where $\mu\in\mathcal{R}^{d}$ , and $\Sigma$ is a $d\times d$ positive-definite matrix. In either case, the approximation results of Zhang and Gao (2020) and Yang et al. (2020) for these variational families can be used to obtain the rate $\kappa_{n}$ . In particular, when $d$ is fixed these results demonstrate that $\kappa_{n}\lesssim\epsilon_{n}$ , and the Gibbs variational posterior converges at the same rate as the Gibbs posterior.¹⁰¹⁰10The rate of concentration obtained by applying Lemma 2 in such an example will be $\epsilon_{n}=\log(n)/\sqrt{n}$ , which is slightly slower than the optimal parametric rate $\epsilon_{n}=1/\sqrt{n}$ . The parametric rate can be achieved by sufficiently modifying the prior condition in Assumption 3 to cater for the parametric nature of the model. However, given that this is not germane to the main point off the paper, we do not analyze such situations here.

As seen from Lemma 3, the variational posterior places increasing mass on the value $\theta_{\star}$ that maximizes the limit of the expected scoring rule, $\mathcal{S}(\theta)$ . Lemma 3 does not rely on the existence of exponential testing sequences as in much of the Bayesian nonparametrics literature (see, e.g., the treatment in Ghosal and Van der Vaart, 2017). The need to consider an alternative approach is due to the loss-based, i.e., non-likelihood-based, nature of the Gibbs posterior. Instead, the arguments used to demonstrate posterior concentration rely on controlling the behavior of a suitable quadratic approximation to the criterion function in a neighborhood of $\theta_{\star}$ .

The result of Lemma 3 allows us to demonstrate that the predictions made using the GVP, $P_{Q}^{(n)}$ , are just as accurate as those obtained by an agent that uses the ‘optimal [frequentist] predictive’ $P_{\star}^{(n)}$ defined in equation (8); this is the key message of 2, which demonstrates that the GVP yields predictions that are just as accurate as the exact Gibbs predictive (in large samples).

A.3 Proofs of Lemmas

Proof of Lemma 1. We break the proof up by showing the results for Assumption 1 and 2 separately.

Assumption 1. We remark that Assumptions 1-4 in Lemma 1 are sufficient for conditions (A.1) and (A.2) of Lee and Hansen (1994). To establish Assumption 1(i) we note that Lemma 7 in Lee and Hansen (1994)(2) implies that $S_{n}(\theta)\rightarrow_{p}\mathcal{S}(\theta)$ for each $\theta\in\Theta$ . Furthermore, since the expected value of $\partial\log p_{\theta}(y_{t}|\mathcal{F}_{t-1})/\partial\theta$ is bounded for each $t$ , for each $\theta\in\Theta$ , by Lemma 8(2) in Lee and Hansen (1994), it follows that $S_{n}(\theta)\rightarrow_{p}\mathcal{S}(\theta)$ uniformly over $\Theta$ . Assumption 1(ii) can then be established by using the fact that $\mathcal{S}(\theta)$ is twice-continuously differentiable for $\theta$ near $\theta_{\star}$ with negative-definite second derivative matrix $\partial^{2}\mathcal{S}(\theta_{\star})/\partial\theta\partial\theta^{\prime}$ , by Lemma 11(3) of Lee and Hansen (1994), which yields

\mathcal{S}(\theta)-\mathcal{S}(\theta^{\star})\leq-C\|(\theta-\theta_{\star})\|^{2},

for some $C>0$ . Hence, Assumption 1(ii) follows.

Assumption 2. The result again follows by using the results of Lee and Hansen (1994). In particular, in this case $S_{n}(\theta)=\sum_{t=1}^{n}\log p_{\theta}(y_{t+1}|\mathcal{F}_{t})$ , is twice continuously differentiable so that a direct quadratic approximation is valid at $\theta=\theta^{\star}$ . Then, taking $\Delta_{n}=\partial S_{n}(\theta_{\star})/\partial\theta$ and $\Sigma_{n}=-\partial^{2}\mathcal{S}(\theta_{\star})/\partial\theta\partial\theta^{\prime}$ yields the expansion, where the remainder term is given by

	$\displaystyle R_{n}(\theta)$	$\displaystyle=n(\theta-\theta_{\star})^{\prime}[\partial^{2}\mathcal{S}(\theta_{\star})/\partial\theta\partial\theta^{\prime}-\partial^{2}\mathcal{S}(\bar{\theta})/\partial\theta\partial\theta^{\prime}](\theta-\theta_{\star})$
		$\displaystyle-n(\theta-\theta_{\star})^{\prime}[n^{-1}\partial^{2}S_{n}(\bar{\theta})/\partial\theta\partial\theta^{\prime}-\partial^{2}\mathcal{S}(\bar{\theta})/\partial\theta\partial\theta^{\prime}](\theta-\theta_{\star}),$

and $\bar{\theta}$ an intermediate value such that $\|\theta-\bar{\theta}\|\leq\|\theta-\theta_{\star}\|$ .

To verify (ii), we note that $\partial^{2}\mathcal{S}(\bar{\theta})/\partial\theta\partial\theta^{\prime}$ exists and is continuous by Lemma 11(3) of Lee and Hansen (1994), so that the first term is $o(1)$ on $d(\theta,\theta_{\star})\leq M/\sqrt{n}$ . Further, Lemma 11(3) of Lee and Hansen (1994) demonstrates that

\sup_{\theta\in\Theta}\|[n^{-1}\partial^{2}S_{n}(\bar{\theta})/\partial\theta\partial\theta^{\prime}-\partial^{2}\mathcal{S}(\bar{\theta})/\partial\theta\partial\theta^{\prime}]\|=o_{p}(1),

so that the second term is also $o(1)$ on $d(\theta,\theta_{\star})\leq M/\sqrt{n}$ .

Lastly, (iii), is verified since $\Sigma_{n}=-\partial^{2}\mathcal{S}(\theta_{\star})/\partial\theta\partial\theta^{\prime}$ , and since, by Lemma 9(2) of Lee and Hansen (1994), $\Delta_{n}/\sqrt{n}\Rightarrow N(0,\mathbb{E}\left[\partial\log p_{\theta}(y_{t}|\mathcal{F}_{t-1})/\partial\theta\partial\log p_{\theta}(y_{t}|\mathcal{F}_{t-1})/\partial\theta^{\prime}\right])$ .

Proof of Lemma 2. The posterior density is $\Pi_{w_{n}}\left(\theta|y_{1:n}\right)\propto\mathrm{d}\Pi(\theta)\exp\left[w_{n}\{{S}_{n}(\theta)\}\right],$ where we recall that, by hypothesis, $w_{n}\in(\underline{w},\overline{w})\subset(0,\infty)$ with probability one. Define the quasi-likelihood ratio

Z_{n}(\theta):=\exp\left[w_{n}\left\{{S}_{n}\left(\theta\right)-{S}_{n}(\theta_{\star})-\frac{1}{2}\Delta_{n}^{\prime}\Sigma^{-1}_{n}\Delta_{n}\right\}\right],

where, by Assumption 2 (iii), $\Sigma_{n}^{-1/2}\Delta_{n}=O_{p}(1)$ . For $M>0$ , the posterior probability over $A_{n}:=\{\theta:d(\theta,\theta_{\star})>M\epsilon_{n}\}$ is

\displaystyle\Pi_{w_{n}}(A_{n}|y_{1:n})=\frac{\int_{A_{n}}\mathrm{d}\Pi\left(\theta\right)Z_{n}(\theta)}{\int_{\Theta}\mathrm{d}\Pi\left(\theta\right)Z_{n}(\theta)}=\frac{N_{n}(A_{n})}{D_{n}},

where

{N_{n}(A_{n})}=\int_{A_{n}}\mathrm{d}\Pi\left(\theta\right)Z_{n}(\theta),\quad{D_{n}}={\int_{\Theta}\mathrm{d}\Pi\left(\theta\right)Z_{n}(\theta)}.

The result now follows by lower bounding $D_{n}$ and Upper bounding $N_{n}(A_{n})$ .

Part 1: $D_{n}$ Term. Define $T_{n}:=\theta_{\star}+\Sigma_{n}^{-1}\Delta_{n}$ , and $G_{n}:=\{\theta\in\Theta:\frac{1}{2}(\theta-T_{n})^{\prime}\left[\Sigma_{n}/n\right](\theta-T_{n})\leq t_{n}\}$ . We show that, for $t_{n}\rightarrow 0$ , and $t_{n}\leq\epsilon_{n}^{2}$ , wpc1

D_{n}=\int_{\Theta}\mathrm{d}\Pi(\theta)Z_{n}(\theta)>\frac{1}{2}\Pi(G_{n})e^{-2\overline{w}nt_{n}}.

Proof of Part 1: To lower bound $D_{n}$ we follow an approach similar approach to Lemma 1 in Shen and Wasserman (2001) (see, also, Syring and Martin, 2020). Over $G_{n}$ , use Assumption 2(i) to rewrite $\log\{Z_{n}(\theta)\}$ as

	$\displaystyle\log\{Z_{n}(\theta)\}$	$\displaystyle=\log\left[\exp\left\{w_{n}\left[S_{n}(\theta)-S_{n}(\theta_{\star})-\frac{1}{2}\Delta_{n}^{\prime}\Sigma^{-1}_{n}\Delta_{n}\right]\right\}\right]$
		$\displaystyle=-\frac{w_{n}n}{2}(\theta-T_{n})^{\prime}\left[\Sigma_{n}/n\right](\theta-T_{n})-R_{n}(\theta).$		(16)

Define the following sets: $\mathcal{C}_{n}:=\{(\theta,y_{1:n}):|R_{n}(\theta)|\geq nt_{n}\}$ , $\mathcal{C}_{n}(\theta):=\{y_{1:n}:(\theta,y_{1:n})\in\mathcal{C}_{n}\}$ , and $\mathcal{C}_{n}(y_{1:n}):=\{\theta:(\theta,y_{1:n})\in\mathcal{C}_{n}\}$ . On the set $G_{n}\cap\mathcal{C}^{c}_{n}(y_{1:n})$ , $Z_{n}(\theta)$ is bounded in probability, and we can bound $D_{n}$ as follows:

$\displaystyle D_{n}$	$\displaystyle\geq\int_{G_{n}\cap\mathcal{C}^{c}_{n}(y_{1:n})}\exp\left\{-\frac{w_{n}n}{2}(\theta-T_{n})^{\prime}[\Sigma_{n}/n](\theta-T_{n})-R_{n}(\theta)\right\}\mathrm{d}\Pi(\theta)$
	$\displaystyle\geq\exp\left(-2\overline{\omega}nt_{n}\right)\Pi\left\{G_{n}\cap\mathcal{C}^{c}_{n}(y_{1:n})\right\}$
	$\displaystyle\geq\left[\Pi(G_{n})-\Pi\left\{G_{n}\cap\mathcal{C}_{n}(y_{1:n})\right\}\right]\exp\left(-2\overline{\omega}nt_{n}\right).$	(17)

From the bound in (17), the remainder follows almost identically to Lemma 1 in Shen and Wasserman (2001). In particular,

	$\displaystyle\mathbb{E}_{P_{0}}\left\{\Pi\left[G_{n}\cap\mathcal{C}_{n}(y_{1:n})\right]\right\}$	$\displaystyle=\int_{\mathcal{Y}}\int_{\Theta}\mathbb{I}\{G_{n}\cap\mathcal{C}_{n}(y_{1:n})\}\mathrm{d}\Pi(\theta)\mathrm{d}P_{0}(y_{1:n})$
		$\displaystyle=\int_{\mathcal{Y}}\int_{\Theta}\mathbb{I}\{G_{n}\}\mathbb{I}\{\mathcal{C}_{n}(y_{1:n})\}\mathrm{d}\Pi(\theta)\mathrm{d}P_{0}(y_{1:n})$
		$\displaystyle\leq\frac{1}{nt_{n}}\Pi(G_{n}),$

where the last line follows from Markov’s inequality and the definition of $\mathcal{C}_{n}(y_{1:n})$ . Lastly, consider the probability of the set

	$\displaystyle P_{0}\left\{D_{n}\leq\frac{1}{2}\Pi(G_{n})e^{-2\overline{w}nt_{n}}\right\}$	$\displaystyle\leq P_{0}\left(e^{-2\overline{w}nt_{n}}\left[\Pi(G_{n})-\Pi\left\{G_{n}\cap\mathcal{C}_{n}(y_{1:n})\right\}\right]\leq\frac{1}{2}\Pi(G_{n})e^{-2\overline{w}nt_{n}}\right)$
		$\displaystyle=P_{0}\left[\Pi\{G_{n}\cap\mathcal{C}_{n}(y_{1:n})\}\geq\frac{1}{2}\Pi(G_{n})\right]$
		$\displaystyle\leq 2P_{0}\left\{G_{n}\cap\mathcal{C}_{n}(y_{1:n})\right\}/\Pi(G_{n})$
		$\displaystyle\leq\frac{2}{nt_{n}}.$

Hence, $P_{0}\left\{D_{n}\geq\frac{1}{2}\Pi(G_{n})e^{-2\overline{w}nt_{n}}\right\}\geq 1-2(nt_{n})^{-1}$ and for $nt_{n}\rightarrow\infty$ , we have that

\displaystyle D_{n}\geq\frac{1}{2}\Pi(G_{n})e^{-2\overline{w}nt_{n}}

(18)

except on sets of $P_{0}$ -probability converging to zero.

Then, for $G_{n}$ as in that result, for a sequence $t_{n}\rightarrow 0$ with $t_{n}\leq\epsilon_{n}^{2}$ , by Assumption 3 (wpc1)

\displaystyle D_{n}\geq\frac{1}{2}\Pi(G_{n})e^{-2nt_{n}w_{n}}\gtrsim e^{-C_{1}nt_{n}}e^{-2nt_{n}w_{n}}\gtrsim e^{-(C_{1}+2\overline{w})nt_{n}}

(19)

Part 2: $N_{n}(A_{n})$ Term. We show that $P_{0}^{(n)}[N_{n}(A_{n})>e^{-M\overline{w}n\epsilon^{2}_{n}}]\leq e^{-C_{2}M\overline{w}nt^{2}_{n}}$ for $n$ large enough.

Proof of Part 2: To bound $N_{n}(A_{n})=\int_{A_{n}}\mathrm{d}\Pi(\theta)Z_{n}(\theta)$ from above, use Assumption 1 to deduce that, for any positive $\epsilon_{n}$ ,

	$\displaystyle\sup_{d(\theta,\theta_{\star})\geq\epsilon_{n}}n^{-1}\left\{S_{n}(\theta)-S_{n}(\theta_{\star})\right\}\leq$	$\displaystyle 2\sup_{\theta\in\Theta}\{S_{n}(\theta)/n-\mathcal{S}(\theta)\}+\sup_{d(\theta,\theta_{\star})\geq\epsilon}\{\mathcal{S}(\theta)-\mathcal{S}(\theta_{\star})\}$
		$\displaystyle\leq o_{p}(1)-C_{2}\epsilon^{2}_{n}.$		(20)

For some $M>0$ , we decompose $N_{n}(A_{n})$ :

N_{n}(A_{n})=N_{n}(A_{n}\cap\Theta_{n}(M))+N_{n}(A_{n}\cap\Theta_{n}^{c}(M))=N_{n}^{(1)}+N_{n}^{(2)},

where $\Theta_{n}(M)$ is a compact set of $\theta$ by construction and $\Theta_{n}^{c}(M)$ is its complement.

Then, $N_{n}^{(1)}\leq e^{-C_{2}M\overline{w}n\epsilon^{2}_{n}}$ from (20) due to the compactness of $\Theta_{n}(M)$ . For $N_{n}^{(2)}$ , for a sequence $t_{n}\rightarrow 0$ with $t_{n}\leq\epsilon_{n}^{2}$ ,

	$\displaystyle P_{0}(N_{n}^{(2)}>e^{-M\overline{w}nt_{n}/2})$
$\displaystyle\leq$	$\displaystyle e^{M\overline{w}nt_{n}/2}\int_{\mathcal{Y}}\int_{A_{n}\cap\Theta_{n}^{c}(M)}\exp\left\{-\frac{w_{n}n}{2}(\theta-T_{n})^{\prime}[\Sigma_{n}/n](\theta-T_{n})-R_{n}(\theta)\right\}\mathrm{d}\Pi(\theta)\mathrm{d}P_{0}(y_{1:n})$
$\displaystyle\leq$	$\displaystyle e^{M\overline{\omega}nt_{n}/2}\Pi\left\{A_{n}\cap\Theta_{n}^{c}(M)\right\}\leq e^{-MC_{2}\overline{w}nt_{n}}$	(21)

The first inequality comes from Markov inequality and the definition of $Z_{n}(\theta)$ and the second and the third inequalities are due to Assumptions 2 and 3 respectively.

Using (19), we have the posterior bound

\displaystyle\Pi_{w_{n}}(A_{n}|y_{1:n})\lesssim N_{n}(A_{n})e^{(C_{1}+2\overline{w})nt_{n}}

Combining $N_{n}^{(1)}\leq e^{-C_{2}M\overline{w}n\epsilon^{2}_{n}}$ , from (20), and equation (21), we can deduce that (wpc1)

\displaystyle\Pi_{w_{n}}(A_{n}|y_{1:n})\lesssim N_{n}(A_{n})e^{(C_{1}+2\overline{w})nt_{n}}\lesssim{e^{-\left\{MC_{2}\overline{w}-C_{1}-2\overline{w}\right\}n\epsilon^{2}_{n}}},

since $t_{n}\leq\epsilon_{n}^{2}$ . The right-hand-side vanishes for $M$ large enough, and the result follows.

Proof of of Lemma 3. The result begins with a similar approach to Corollary 2.1 of Zhang and Gao (2020), but requires particular deviations given the non-likelihood-based version of our problem. We also deviate from Zhang and Gao (2020) and use $\mathbb{E}_{P_{0}}[f]$ to denote expectations of $f(\cdot)$ taken under $P_{0}$ , rather than the notation $P_{0}[f]$ .

First, we note that Lemma B.2 in the supplement to Zhang and Gao (2020) (reproduced for convenience as Lemma 4 in Appendix A.5) can be directly applied in this setting: for any $a>0$ and $n\geq 1$ , given observations $y_{1:n}$ ,

a\widehat{Q}\left\{d(\theta,\theta_{\star})\right\}\leq\text{D}\left[\widehat{Q}\|\Pi_{w_{n}}\left(\cdot|y_{1:n}\right)\right]+\log\Pi_{w_{n}}\left(\exp\left\{ad(\theta,\theta_{\star})\right\}|y_{1:n}\right).

Similarly, for any $Q\in\mathcal{Q}$ , $\text{D}\left[{Q}\|\Pi_{w_{n}}(\cdot|y_{1:n})\right]+o_{p}(1)\geq\text{D}\left[\widehat{Q}\|\Pi_{w_{n}}(\cdot|y_{1:n})\right]$ by construction, so that

a\widehat{Q}\left\{d(\theta,\theta_{\star})\right\}\leq\inf_{Q\in\mathcal{Q}}\text{D}\left[{Q}\|\Pi_{w_{n}}\left(\cdot|y_{1:n}\right)\right]+\log\Pi_{w_{n}}\left(\exp\left\{ad(\theta,\theta_{\star})\right\}|y_{1:n}\right).

(22)

Taking expectations on both sides of (22), and re-arranging terms, yields

	$\displaystyle\mathbb{E}_{P_{0}}\widehat{Q}\left\{d(\theta,\theta_{\star})\right\}$	$\displaystyle\leq\frac{1}{a}\mathbb{E}_{P_{0}}\left\{\inf_{Q\in\mathcal{Q}}\text{D}\left[{Q}\\|\Pi_{w_{n}}\left(\cdot\|y_{1:n}\right)\right]+\log\Pi_{w_{n}}\left(\exp\left\{ad(\theta,\theta_{\star})\right\}\|y_{1:n}\right)\right\}$
		$\displaystyle\leq\frac{1}{a}n\kappa^{2}_{n}+\frac{1}{a}\log\left[\mathbb{E}_{P_{0}}\Pi_{w_{n}}\left(\exp\left\{ad(\theta,\theta_{\star})\right\}\|y_{1:n}\right)\right],$		(23)

where the second inequality follows from the definition of $\kappa_{n}^{2}$ , and Jensen’s inequality.

The second term in equation (23) is bounded by applying Lemma 2. In particular, for all $\alpha\geq\alpha_{0}>0$ and any $0<a\leq\frac{1}{2}c_{1}$ , $c_{1}>0$ , by Lemma 2 (wpc1),

\Pi_{w_{n}}\left(\exp\left\{ad(\theta,\theta_{\star})\right\}>\alpha_{0}|y_{1:n}\right)\lesssim\exp\left(-ac_{1}\alpha\right).

Then, appealing to Lemma B.4 in the supplement to Zhang and Gao (2020) (reproduced for convenience as Lemma 5 in Appendix A.5) we obtain

\mathbb{E}_{P_{0}}\Pi_{w_{n}}\left(\exp\left\{ad(\theta,\theta_{\star})\right\}|y_{1:n}\right)\lesssim\exp\left(ac_{1}\alpha_{0}\right),

for all $a\leq\min\{c_{1},1\}$ . Taking $a=\min\{c_{1},1\}$ and $\alpha_{0}=n\epsilon_{n}^{2}$ , and applying the above in equation (23), yields, for some $M>0$ ,

\displaystyle\mathbb{E}_{P_{0}}\widehat{Q}\left\{d(\theta,\theta_{\star})\right\}

\displaystyle\leq\frac{n\kappa_{n}^{2}+\log\left(c_{1}+e^{ac_{1}n\epsilon_{n}^{2}}\right)}{a}\lesssim{n\kappa_{n}^{2}}+n\epsilon_{n}^{2}+o(1)\lesssim n\left(\kappa_{n}^{2}+\epsilon_{n}^{2}\right)+o(1).

The stated result then follows from Markov’s inequality,

\mathbb{E}_{P_{0}}\widehat{Q}\left(d(\theta,\theta_{\star})>M_{n}n\left(\epsilon_{n}^{2}+\kappa_{n}^{2}\right)\right)\leq\frac{\mathbb{E}_{P_{0}}\widehat{Q}\left\{d(\theta,\theta_{\star})\right\}}{M_{n}n\left(\epsilon_{n}^{2}+\kappa_{n}^{2}\right)}\lesssim M_{n}^{-1}\rightarrow 0.

A.4 Proofs of Theorems in Section 3.2

Using Lemmas 2 and 3, we can now prove Theorems 1 and 2 stated in the main text. A simple proof of Theorem 1 can be given by first proving Theorem 2. Therefore, we first prove Theorem 2.

Proof of Theorem 2. For probability measures $P$ and $Q$ , let $d_{H}(P,Q)$ denote the Hellinger distance between $P$ and $Q$ . Consider a positive sequence $\epsilon_{n}>0$ , and define the set $A(\epsilon_{n}):=\{P^{(n)}_{\theta}\in\mathcal{P},\;\theta\in\Theta:d_{H}^{2}(P^{(n)}_{\theta},P_{\star}^{(n)})\geq\epsilon_{n}\}$ . By convexity of $d_{H}^{2}(\cdot,\cdot)$ in the first argument and Jensen’s inequality,

	$\displaystyle d^{2}_{H}(P^{(n)}_{Q},P^{(n)}_{\star})$	$\displaystyle\leq\int_{\Theta}d_{H}^{2}(P^{(n)}_{\theta},P^{(n)}_{\star})\mathrm{d}\widehat{Q}(\theta)$
		$\displaystyle=\int_{A^{c}(\epsilon^{2}_{n})}d_{H}^{2}(P^{(n)}_{\theta},P^{(n)}_{\star})\mathrm{d}\widehat{Q}(\theta)+\int_{A(\epsilon^{2}_{n})}d_{H}^{2}(P^{(n)}_{\theta},P^{(n)}_{\star})\mathrm{d}\widehat{Q}(\theta)$
		$\displaystyle\leq\epsilon^{2}_{n}+\sqrt{2}\widehat{Q}\{A(\epsilon^{2}_{n})\}.$

By Lemma 3, for any $\epsilon_{n}$ such that $n\epsilon_{n}^{2}\rightarrow\infty$ , $\widehat{Q}\{A(\epsilon_{n})\}=o_{p}(1)$ , so that $d^{2}_{H}(P^{(n)}_{Q},P^{(n)}_{\star})\leq\epsilon^{2}_{n}$ (wpc1). Using the above and the definition $d_{H}(P^{(n)}_{Q},P^{(n)}_{\star})=\sqrt{d^{2}_{H}(P^{(n)}_{Q},P^{(n)}_{\star})},$ we can conclude that, with probability converging to one,

d_{H}(P^{(n)}_{Q},P^{(n)}_{\star})\leq\epsilon_{n}.

Which yields the first stated result. Applying the relationship between total variation and Hellinger distance yields the result: $d_{TV}(P^{(n)}_{Q},P^{(n)}_{\Pi_{w_{n}}})\leq\sqrt{2}d_{H}(P^{(n)}_{Q},P^{(n)}_{\Pi_{w_{n}}}).$

Proof of Theorem 1. A similar argument to the proof of Theorem 1 yields

\displaystyle d^{2}_{H}(P^{(n)}_{\Pi_{w_{n}}},P^{(n)}_{\star})

\displaystyle\leq\int_{\Theta}d_{H}^{2}(P^{(n)}_{\theta},P^{(n)}_{\star})\mathrm{d}\Pi_{w_{n}}(\theta|\mathbf{y})\leq\epsilon^{2}_{n}+\sqrt{2}\Pi_{w_{n}}\{A(\epsilon^{2}_{n})|\mathbf{y}\}\leq\epsilon^{2}_{n},

where the last line follows from the convergence in Lemma 2 and holds wpc1. Apply the triangle inequality, and the relationship between the Hellinger and it’s square, to see that (wpc1)

	$\displaystyle d_{H}(P^{(n)}_{Q},P^{(n)}_{\Pi_{w_{n}}})$	$\displaystyle\leq d_{H}(P^{(n)}_{Q},P^{(n)}_{\star})+d_{H}(P^{(n)}_{\theta},P^{(n)}_{\star})$
		$\displaystyle=\sqrt{d^{2}_{H}(P^{(n)}_{\theta},P^{(n)}_{\star})}+\sqrt{d^{2}_{H}(P^{(n)}_{\theta},P^{(n)}_{\star})}$
		$\displaystyle\leq 2\epsilon_{n}.$

The above holds for any $\epsilon_{n}\rightarrow 0$ with $n\epsilon_{n}^{2}\rightarrow 0$ , and we can conclude that $d_{TV}(P^{(n)}_{Q},P^{(n)}_{\Pi_{w_{n}}})\leq\sqrt{2}d_{H}(P^{(n)}_{Q},P^{(n)}_{\Pi_{w_{n}}})=o_{p}(1)$ .

A.5 Additional Lemmas

The following Lemmas from Zhang and Gao (2020) are used in the proof of Lemma 3, and are reproduce here, without proof, to aid the reader. We again use $\mathbb{E}_{P_{0}}$ to denote expectations taken under $P_{0}$ .

Lemma 4

For $\widehat{Q}$ in Definition 1, we have

			$\displaystyle\mathbb{E}_{P_{0}}\widehat{Q}L(P_{\theta}^{(n)},P_{0}^{(n)})$
		$\displaystyle\leq$	$\displaystyle\inf_{a>0}\frac{1}{a}\left(\inf_{Q\in\mathcal{S}}\mathbb{E}_{P_{0}}D(Q\\|\Pi(\cdot\|X^{(n)}))+\log\mathbb{E}_{P_{0}}\Pi(\exp(aL(P_{\theta}^{(n)},P_{0}^{(n)}))\|X^{(n)})\right).$

Lemma 5

Suppose the random variable $X$ satisfies

\mathbb{P}(X\geq t)\leq c_{1}\exp(-c_{2}t),

for all $t\geq t_{0}>0$ . Then, for any $0<a\leq\frac{1}{2}c_{2}$ ,

\mathbb{E}\exp(aX)\leq\exp(at_{0})+c_{1}.

References

Alquier (2021) Pierre Alquier. User-friendly introduction to pac-bayes bounds. arXiv preprint arXiv:2110.11216, 2021.
Alquier and Ridgway (2020) Pierre Alquier and James Ridgway. Concentration of tempered posteriors and of their variational approximations. Annals of Statistics, 48(3):1475–1497, 2020.
Alquier et al. (2016) Pierre Alquier, James Ridgway, and Nicolas Chopin. On the properties of variational approximations of Gibbs posteriors. The Journal of Machine Learning Research, 17(1):8374–8414, 2016.
Antoniano-Villalobos and Walker (2016) Isadora Antoniano-Villalobos and Stephen G Walker. A nonparametric model for stationary time series. Journal of Time Series Analysis, 37(1):126–142, 2016.
Bassetti et al. (2018) Federico Bassetti, Roberto Casarin, and Francesco Ravazzolo. Bayesian nonparametric calibration and combination of predictive distributions. Journal of the American Statistical Association, 113(522):675–685, 2018.
Baştürk et al. (2019) N Baştürk, A. Borowska, S. Grassi, L. Hoogerheide, and H.K. van Dijk. Forecast density combinations of dynamic models and data driven portfolio strategies. Journal of Econometrics, 210(1):170–186, 2019.
Billio et al. (2013) M Billio, R Casarin, F Ravazzolo, and H.K. van Dijk. Time-varying combinations of predictive densities using nonlinear filtering. Journal of Econometrics, 177(2):213–232, 2013.
Bissiri et al. (2016) Pier Giovanni Bissiri, Chris C Holmes, and Stephen G Walker. A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):1103–1130, 2016.
Blackwell and Dubins (1962) David Blackwell and Lester Dubins. Merging of opinions with increasing information. The Annals of Mathematical Statistics, 33(3):882–886, 1962.
Blei et al. (2017) David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
Casarin et al. (2015) Roberto Casarin, Fabrizio Leisen, German Molina, and Enrique ter Horst. A Bayesian beta Markov random field calibration of the term structure of implied risk neutral densities. Bayesian Analysis, 10(4):791–819, 2015.
Chernozhukov and Hong (2003) Victor Chernozhukov and Han Hong. An MCMC approach to classical estimation. Journal of Econometrics, 115(2):293–346, 2003.
Diks et al. (2011) Cees Diks, Valentyn Panchenko, and Dick Van Dijk. Likelihood-based scoring rules for comparing density forecasts in tails. Journal of Econometrics, 163(2):215–230, 2011.
Doornik et al. (2020) Jurgen A Doornik, Jennifer L Castle, and David F Hendry. Card forecasts for M4. International Journal of Forecasting, 36(1):129–134, 2020.
Douc et al. (2013) Randal Douc, Paul Doukhan, and Eric Moulines. Ergodicity of observation-driven time series models and consistency of the maximum likelihood estimator. Stochastic Processes and their Applications, 123(7):2620–2647, 2013.
Fiorucci and Louzada (2020) Jose Augusto Fiorucci and Francisco Louzada. Groec: Combination method via generalized rolling origin evaluation. International Journal of Forecasting, 36(1):105–109, 2020.
Frazier et al. (2019) David T Frazier, Worapree Maneesoonthorn, Gael M Martin, and Brendan PM McCabe. Approximate Bayesian forecasting. International Journal of Forecasting, 35(2):521–539, 2019.
Frazier et al. (2021) David T Frazier, Ruben Loaiza-Maya, and Gael M Martin. A note on the accuracy of variational Bayes in state space models: Inference and prediction. arXiv preprint arXiv:2106.12262, 2021.
Ghosal and Van der Vaart (2017) Subhashis Ghosal and Aad Van der Vaart. Fundamentals of Nonparametric Bayesian Inference, volume 44. Cambridge University Press, 2017.
Giummolè et al. (2017) Federica Giummolè, Valentina Mameli, Erlis Ruli, and Laura Ventura. Objective Bayesian inference with proper scoring rules. TEST, pages 1–28, 2017.
Gneiting and Raftery (2007) Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, 2007.
Gneiting et al. (2007) Tilmann Gneiting, Fadoua Balabdaoui, and Adrian E Raftery. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(2):243–268, 2007.
Hernández-Lobato and Adams (2015) José Miguel Hernández-Lobato and Ryan Adams. Probabilistic backpropagation for scalable learning of Bayesian neural networks. In International Conference on Machine Learning, pages 1861–1869. PMLR, 2015.
Holmes and Walker (2017) CC Holmes and SG Walker. Assigning a value to a power likelihood in a general Bayesian model. Biometrika, 104(2):497–503, 2017.
Jiang and Tanner (2008) Wenxin Jiang and Martin A. Tanner. Gibbs posterior for variable selection in high-dimensional classification and data-mining. Annals of Statistics, 36(5):2207–2231, 2008.
Kingma and Welling (2019) Diederik P Kingma and Max Welling. An introduction to variational autoencoders. arXiv preprint arXiv:1906.02691, 2019.
Knoblauch et al. (2019) Jeremias Knoblauch, Jack Jewson, and Theodoros Damoulas. Generalized variational inference: Three arguments for deriving new posteriors. arXiv preprint arXiv:1904.02063, 2019.
Kwiatkowski et al. (1992) Denis Kwiatkowski, Peter C.B. Phillips, Peter Schmidt, and Yongcheol Shin. Testing the null hypothesis of stationarity against the alternative of a unit root: How sure are we that economic time series have a unit root? Journal of Econometrics, 54(1):159–178, 1992.
Lee and Hansen (1994) Sang-Won Lee and Bruce E Hansen. Asymptotic theory for the GARCH(1, 1) quasi-maximum likelihood estimator. Econometric theory, 10(1):29–52, 1994.
Loaiza-Maya et al. (2020) Ruben Loaiza-Maya, Gael M Martin, and David T Frazier. Focused Bayesian prediction. Forthcoming. Journal of Applied Econometrics, 2020.
Lyddon et al. (2019) SP Lyddon, CC Holmes, and SG Walker. General Bayesian updating and the loss-likelihood bootstrap. Biometrika, 106(2):465–478, 2019.
Makridakis et al. (2020) Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. The M4 competition: 100,000 time series and 61 forecasting methods. International Journal of Forecasting, 36(1):54–74, 2020.
Martin et al. (2022) Gael M Martin, Rubén Loaiza-Maya, Worapree Maneesoonthorn, David T Frazier, and Andrés Ramírez-Hassan. Optimal probabilistic forecasts: When do they work? International Journal of Forecasting, 38(1):384–406, 2022.
McAlinn and West (2019) Kenichiro McAlinn and Mike West. Dynamic Bayesian predictive synthesis in time series forecasting. Journal of Econometrics, 210(1):155–169, 2019.
McAlinn et al. (2020) Kenichiro McAlinn, Knut Are Aastveit, Jouchi Nakajima, and Mike West. Multivariate Bayesian predictive synthesis in macroeconomic forecasting. Journal of the American Statistical Association, 115(531):1092–1110, 2020.
Miller (2021) Jeffrey W Miller. Asymptotic normality, concentration, and coverage of generalized posteriors. Journal of Machine Learning Research, 22(168):1–53, 2021.
Miller and Dunson (2019) Jeffrey W Miller and David B Dunson. Robust Bayesian inference via coarsening. Journal of the American Statistical Association, 114(527):1113–1125, 2019.
Montero-Manso et al. (2020) Pablo Montero-Manso, George Athanasopoulos, Rob J Hyndman, and Thiyanga S Talagala. FFORMA: Feature-based forecast model averaging. International Journal of Forecasting, 36(1):86–92, 2020.
Pacchiardi and Dutta (2021) Lorenzo Pacchiardi and Ritabrata Dutta. Generalized Bayesian likelihood-free inference using scoring rules estimators. arXiv preprint arXiv:2104.03889, 2021.
Petropoulos and Svetunkov (2020) Fotios Petropoulos and Ivan Svetunkov. A simple combination of univariate models. International journal of forecasting, 36(1):110–115, 2020.
Pettenuzzo and Ravazzolo (2016) Davide Pettenuzzo and Francesco Ravazzolo. Optimal portfolio choice under decision-based model combinations. Journal of Applied Econometrics, 31(7):1312–1332, 2016.
Quiroz et al. (2018) Matias Quiroz, David J Nott, and Robert Kohn. Gaussian variational approximation for high-dimensional state space models. arXiv preprint arXiv:1801.07873, 2018.
Shen and Wasserman (2001) Xiaotong Shen and Larry Wasserman. Rates of convergence of posterior distributions. The Annals of Statistics, 29(3):687–714, 2001.
Smyl (2020) Slawek Smyl. A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. International Journal of Forecasting, 36(1):75–85, 2020.
Syring and Martin (2019) Nicholas Syring and Ryan Martin. Calibrating general posterior credible regions. Biometrika, 106(2):479–486, 2019.
Syring and Martin (2020) Nicholas Syring and Ryan Martin. Gibbs posterior concentration rates under sub-exponential type losses. arXiv preprint arXiv:2012.04505, 2020.
Teräsvirta (1994) Timo Teräsvirta. Specification, estimation, and evaluation of smooth transition autoregressive models. Journal of the American Statistical Association, 89(425):208–218, 1994.
Tran et al. (2017) Minh-Ngoc Tran, David J Nott, and Robert Kohn. Variational Bayes with intractable likelihood. Journal of Computational and Graphical Statistics, 26(4):873–882, 2017.
van der Vaart and Wellner (1996) AW van der Vaart and Jon Wellner. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Series in Statistics, 1996.
Wu and Martin (2021) Pei-Shien Wu and Ryan Martin. Calibrating generalized predictive distributions. arXiv preprint arXiv:2107.01688, 2021.
Yang et al. (2020) Yun Yang, Debdeep Pati, Anirban Bhattacharya, et al. Alpha-variational inference with statistical guarantees. Annals of Statistics, 48(2):886–905, 2020.
Zhang and Gao (2020) Fengshuo Zhang and Chao Gao. Convergence rates of variational posterior distributions. Annals of Statistics, 48(4):2180–2207, 2020.
Zhang (2006a) Tong Zhang. From eps-entropy to kl entropy: analysis of minimum information complexity density estimation. Annals of Statistics, 34:2180–2210, 2006a.
Zhang (2006b) Tong Zhang. Information-theoretic upper and lower bounds for statistical estimation. IEEE Transactions on Information Theory, 52(4):1307–1321, 2006b.

Supplementary material for “Loss-Based Variational Bayes Prediction”

S1 Details for the Implementation of all Variational Approximations

S1.1 General Notational Matters

Throughout this appendix we will employ the following notation. All gradients are column vectors and their notation starts with the symbol $\nabla$ . For two generic matrices $A_{d_{1}\times d_{2}}$ and $B_{d_{3}\times d_{4}}$ we have that

\frac{\partial A}{\partial B}=\frac{\partial\text{vec}(A)}{\partial\text{vec}(B)},

where vec is the vectorization operation and $\frac{\partial A}{\partial B}$ is a matrix of dimension $(d_{1}d_{2})\times(d_{3}d_{4})$ . Throughout this appendix scalars are treated as matrices of dimension $1\times 1$ .

The gradient of the log density of the approximation is computed as

\nabla_{\theta}\log q_{\lambda}\left({\theta}\right)=-\left(D^{2}\right)^{-1}\left({\theta}-{\mu}\right).

To compute $\frac{\partial{\theta}}{\partial{\lambda}}$ note that

\frac{\partial{\theta}}{\partial{\lambda}}=\left[\frac{\partial{\theta}}{\partial{\mu}},\frac{\partial{\theta}}{\partial{d}}\right],

where

\frac{\partial{\theta}}{\partial{\mu}}=I_{r},\ \ \ \ \ \ \ \frac{\partial{\theta}}{\partial{d}}=\text{diag}({e}).

S2 Stochastic Gradient Ascent

The optimization problem in (13) is performed via stochastic gradient ascent methods (SGA). SGA maximizes $\mathcal{L}(\lambda)$ by first initializing the variational parameters at some vector of values ${\lambda}^{(0)}$ , and then sequentially iterating over

{\lambda}^{i+1}={\lambda}^{i}+\Delta{\lambda}^{i+1}\left[\widehat{\nabla_{\lambda}\mathcal{L}\left({\lambda}^{(i)}\right)},{\rho}\right].

The step size $\Delta{\lambda}^{i+1}$ is a function of an unbiased estimate of the ELBO gradient, $\widehat{\nabla_{\lambda}\mathcal{L}\left({\lambda}^{(i)}\right)}$ , and the set of tuning parameters that control the learning rates in the optimization problem, denoted by ${\rho}$ . Throughout this paper we employ the ADADELTA method of Zieler (2012) to update $\Delta{\lambda}^{i+1}$ .

Key to achieving fast maximization of $\mathcal{L}(\lambda)$ via SGA is the use of a variance-reduction technique in producing the unbiased gradient estimate $\widehat{\nabla_{\lambda}\mathcal{L}\left({\lambda}^{(i)}\right)}$ . Here, we follow Kingma and Welling (2014) and Rezende et al. (2014), and make use of the ‘reparameterization trick’. In this approach, a draw ${\theta}$ from $q_{\lambda}$ is written as a direct function of ${\lambda}$ , and a set of random variables ${\varepsilon}$ that are invariant with respect to ${\lambda}$ . For the mean-field approximation used for this example we can write ${\theta}\left({\lambda},{\varepsilon}\right)=\mu+{d}\circ{\varepsilon}$ , with ${\varepsilon}\sim N\left({0}_{4},I_{4}\right)$ . This reparametrization allows us to re-write $\mathcal{L}(\lambda)$ as

\mathcal{L}(\lambda)=\mathbb{E}_{{\varepsilon}}\left[wS_{n}^{j}\left({\ \theta}\left({\lambda},{\varepsilon}\right)\right)+\log\pi\left({\ \theta}\left({\lambda},{\varepsilon}\right)\right)-\log q_{\lambda}\left({\theta}\left({\lambda},{\varepsilon}\right)\right)\right]

(S2.1)

and its gradient as

\nabla_{\lambda}\mathcal{L}(\lambda)=\mathbb{E}_{{\varepsilon}}\left(\frac{\partial{\theta}\left({\lambda},{\varepsilon}\right)}{\partial{\ \lambda}}^{\prime}\left[w\nabla_{\theta}S_{n}^{j}\left({\theta}\left({\lambda},{\varepsilon}\right)\right)+\nabla_{\theta}\log\pi\left({\ \theta}\left({\lambda},{\varepsilon}\right)\right)-\nabla_{\theta}\log q_{\lambda}\left({\theta}\left({\lambda},{\varepsilon}\right)\right)\right]\right).

(S2.2)

A low-variance unbiased estimate of $\nabla_{\lambda}\mathcal{L}(\lambda)$ can be constructed by numerically computing the expectation in (S2.2) using the (available) closed-form expressions for the derivatives $\frac{\partial{\theta}\left({\lambda},{\varepsilon}\right)}{\partial{\lambda}}$ , $\nabla_{\theta}\log q_{\lambda}\left({\theta}\left({\lambda},{\varepsilon}\right)\right)$ , $\nabla_{\theta}\log\pi\left({\theta}\left({\lambda},{\varepsilon}\right)\right)$ , and $\nabla_{\theta}S_{n}^{j}\left({\theta}\left({\lambda},{\varepsilon}\right)\right)$ for any $j$ . We follow Kingma and Welling (2019) and use a single draw of $\varepsilon$ for the construction of an estimate of (S2.2). The required expressions for the GARCH model are provided in Appendix S3.

S3 The GARCH Predictive Class (Section 4)

The predictive class, $\mathcal{P}^{(t)}$ , is defined by a generalized autoregressive conditional heteroscedastic GARCH(1,1) model with Gaussian errors, $Y_{t}=\theta_{1}^{r}+\sigma_{t}\varepsilon_{t},$ $\varepsilon_{t}\overset{i.i.d.}{\sim}N\left(0,1\right),$ $\sigma_{t}^{2}=\theta_{2}^{r}+\theta_{3}^{r}\left(Y_{t-1}-\theta_{1}^{r}\right)^{2}+\theta_{4}^{r}\sigma_{t-1}^{2},$ with $\theta=\left(\theta_{1},\theta_{2},\theta_{3},\theta_{4}\right)^{\prime}=\left(\theta_{1}^{r},\log(\theta_{2}^{r}),\Phi_{1}^{-1}\left(\theta_{3}^{r}\right),\Phi_{1}^{-1}\left(\theta_{4}^{r}\right)\right)^{\prime}$ . Note that throughout this section $\theta_{1}$ and $\theta_{1}^{r}$ can be used interchangeably.

S3.1 Priors

We employ the following priors for each of the parameters of the model:

p(\theta_{1})\propto 1,\hskip 28.45274ptp({\theta_{2}^{r}})\propto\frac{1}{{\theta_{2}^{r}}}I({\theta_{2}^{r}}>0),\hskip 28.45274pt{\theta_{3}^{r}}\sim U(0,1)\text{,\ \ \ and }\hskip 14.22636pt{\theta_{4}^{r}}\sim U(0,1).

For the implementation of variational inference, all the parameters are transformed into the real line as follows:

	$\displaystyle\text{(i) }{\theta_{2}^{r}}\text{ is transformed to }{\theta_{2}}=\log({\theta_{2}^{r}});\hskip 28.45274pt\text{(ii) }{\theta_{3}^{r}}\text{ is transformed to }{\theta_{3}}=\Phi_{1}^{-1}\left({\theta_{3}^{r}}\right);$
	$\displaystyle\text{(iii) }{\theta_{4}^{r}}\text{ is transformed to }{\theta_{4}}=\Phi_{1}^{-1}\left({\theta_{4}^{r}}\right).$

After applying these transformations, we have that the prior densities are:

\text{(i) }p({\theta_{1}})\propto 1;\hskip 28.45274pt\text{(ii) }p({\theta_{2}})\propto 1;\hskip 28.45274pt\text{(iii) }p({\theta_{3}})=\phi_{1}\left({\theta_{3}}\right);\hskip 18.49411pt\text{(iv) }p({\theta_{4}})=\phi_{1}\left({\theta_{4}}\right).

The gradient of the logarithm of the prior is $\nabla_{\theta}\log p\left({\theta}\right)=\left(0,0,-{\theta_{3}},-\theta_{4}\right)^{{}^{\prime}}$ .

S3.2 Derivation of $\nabla_{\theta}S_{n}\left({\theta}\right)$ for all scoring rules

We can show that $\nabla_{\theta}S_{n}\left({\theta}\right)=\sum_{t=1}^{n}\nabla_{\theta}s\left(P_{{\theta}}^{(t-1)},y_{t}\right).$ Thus, we must find an expression for $\nabla_{\theta}s\left(P_{{\theta}}^{(t-1)},y_{t}\right)$ for each of the scores. The gradients from all the scores can be written as a function of the recursive derivatives:

	$\displaystyle\text{(i)}\ \ \frac{\partial\sigma_{t}^{2}}{\partial{\theta_{1}}}=$	$\displaystyle-2\theta_{3}^{r}(y_{t-1}-{\theta_{1}})+\theta_{4}^{r}\frac{\partial\sigma_{t-1}^{2}}{\partial{\theta_{1}}};\hskip 28.45274pt\text{(ii)}\ \ \frac{\partial\sigma_{t}^{2}}{\partial\theta_{2}^{r}}=1+\theta_{4}^{r}\frac{\partial\sigma_{t-1}^{2}}{\partial\theta_{2}^{r}};$
	$\displaystyle\text{(iii)}\ \ \frac{\partial\sigma_{t}^{2}}{\partial\theta_{3}^{r}}=$	$\displaystyle(y_{t-1}-{\theta_{1}})^{2}+\theta_{4}^{r}\frac{\partial\sigma_{t-1}^{2}}{\partial\theta_{3}^{r}};\hskip 54.06006pt\text{(iv)}\ \ \frac{\partial\sigma_{t}^{2}}{\partial\theta_{4}^{r}}=\theta_{4}^{r}\frac{\partial\sigma_{t-1}^{2}}{\partial\theta_{4}^{r}}+\sigma_{t-1}^{2}.$

with $\frac{\partial\sigma_{0}^{2}}{\partial{\theta_{1}}}=0$ , $\frac{\partial\sigma_{0}^{2}}{\partial\theta_{2}^{r}}=0$ , $\frac{\partial\sigma_{0}^{2}}{\partial\theta_{3}^{r}}=0$ and $\frac{\partial\sigma_{0}^{2}}{\partial\theta_{4}^{r}}=0$ .

S3.2.1 Logarithmic score (LS)

The gradient for the LS can be written as

\nabla_{\theta}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\left(\nabla_{{\theta_{1}}}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)^{\prime},\nabla_{\theta_{2}}s^{\text{LS}}\left(P_{{\theta}}^{t-1},y_{t}\right),\nabla_{\theta_{3}}s^{\text{LS}}\left(P_{{\theta}}^{t-1},y_{t}\right),\nabla_{\theta_{4}}s^{\text{LS}}\left(P_{{\theta}}^{t-1},y_{t}\right)\right)^{\prime},

with

	$\displaystyle\nabla_{{\theta_{1}}}S^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=$	$\displaystyle-\frac{1}{2\sigma_{t}^{2}}\frac{\partial\sigma_{t}^{2}}{\partial{\theta_{1}}}+\frac{1}{2}\frac{(y_{t}-{\theta_{1}})^{2}}{\sigma_{t}^{4}}\frac{\partial\sigma_{t}^{2}}{\partial{\theta_{1}}}+\frac{(y_{t}-{\theta_{1}})}{\sigma_{t}^{2}},\hskip 8.5359pt\nabla_{\theta_{2}}S^{\text{LS}}\left(P_{{\theta}}^{t-1},y_{t}\right)=\theta_{2}^{r}\left[-\frac{1}{2\sigma_{t}^{2}}\frac{\partial\sigma_{t}^{2}}{\partial\theta_{2}^{r}}+\frac{1}{2}\frac{(y_{t}-{\theta_{1}})^{2}}{\sigma_{t}^{4}}\frac{\partial\sigma_{t}^{2}}{\partial\theta_{2}^{r}}\right],$
	$\displaystyle\nabla_{\theta_{3}}S^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=$	$\displaystyle\phi_{1}(\theta_{3})\left[-\frac{1}{2\sigma_{t}^{2}}\frac{\partial\sigma_{t}^{2}}{\partial\theta_{3}^{r}}+\frac{1}{2}\frac{(y_{t}-{\theta_{1}})^{2}}{\sigma_{t}^{4}}\frac{\partial\sigma_{t}^{2}}{\partial\theta_{3}^{r}}\right]\hskip 8.5359pt\text{and}\hskip 8.5359pt\nabla_{\theta_{4}}S^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\phi_{1}(\theta_{4})\left[-\frac{1}{2\sigma_{t}^{2}}\frac{\partial\sigma_{t}^{2}}{\partial\theta_{4}^{r}}+\frac{1}{2}\frac{(y_{t}-{\theta_{1}})^{2}}{\sigma_{t}^{4}}\frac{\partial\sigma_{t}^{2}}{\partial\theta_{4}^{r}}\right].$

S3.2.2 Continuously ranked probability score (CRPS)

For the Gaussian GARCH(1,1) predictive class the CRPS can be expressed as

s^{\text{CRPS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=-\sigma_{t}B_{t},

with $B_{t}=z_{t}\left(2\Phi_{1}(z_{t})-1\right)+2\phi_{1}(z_{t})-\frac{1}{\sqrt{\pi}}$ and $z_{t}=\frac{y_{t}-{\theta_{1}}}{\sigma_{t}}$ . The gradient of the CRPS can be written as

\nabla_{\theta}s^{\text{CRPS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\left(\nabla_{{\theta_{1}}}s^{\text{CRPS}}\left(P_{{\theta}}^{t-1},y_{t}\right),\nabla_{{\theta_{2}}}s^{\text{CRPS}}\left(P_{{\theta}}^{t-1},y_{t}\right),\nabla_{{\theta_{3}}}s^{\text{CRPS}}\left(P_{{\theta}}^{t-1},y_{t}\right),\nabla_{{\theta_{4}}}s^{\text{CRPS}}\left(P_{{\theta}}^{t-1},y_{t}\right)\right)^{\prime}.

The elements of this gradient are given by

	$\displaystyle\nabla_{{\theta_{1}}}s^{\text{CRPS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)$	$\displaystyle=-\sigma_{t}\frac{\partial B_{t}}{\partial z_{t}}\left[\frac{\partial z_{t}}{\partial\sigma_{t}^{2}}\frac{\partial\sigma_{t}^{2}}{\partial{\theta_{1}}}-\frac{1}{\sigma_{t}}\right]-\frac{B_{t}}{2\sigma_{t}}\frac{\partial\sigma_{t}^{2}}{\partial{\theta_{1}}}$
	$\displaystyle\nabla_{{\theta_{2}}}s^{\text{CRPS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)$	$\displaystyle=-\sigma_{t}\frac{\partial B_{t}}{\partial z_{t}}\frac{\partial z_{t}}{\partial\sigma_{t}^{2}}\frac{\partial\sigma_{t}^{2}}{\partial\theta_{2}^{r}}-\frac{B_{t}}{2\sigma_{t}}\frac{\partial\sigma_{t}^{2}}{{\theta_{2}^{r}}}{\theta_{2}^{r}}$
	$\displaystyle\nabla_{{\theta_{3}}}s^{\text{CRPS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)$	$\displaystyle=-\sigma_{t}\frac{\partial B_{t}}{\partial z_{t}}\frac{\partial z_{t}}{\partial\sigma_{t}^{2}}\frac{\partial\sigma_{t}^{2}}{\partial{\theta_{3}^{r}}}-\frac{B_{t}}{2\sigma_{t}}\frac{\partial\sigma_{t}^{2}}{{\theta_{3}^{r}}}\phi_{1}({\theta_{3}})$
	$\displaystyle\nabla_{{\theta_{4}}}s^{\text{CRPS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)$	$\displaystyle=-\sigma_{t}\frac{\partial B_{t}}{\partial z_{t}}\frac{\partial z_{t}}{\partial\sigma_{t}^{2}}\frac{\partial\sigma_{t}^{2}}{\partial\theta_{4}^{r}}-\frac{B_{t}}{2\sigma_{t}}\frac{\partial\sigma_{t}^{2}}{{\theta_{4}^{r}}}\phi_{1}({\theta_{4}})$

with $\frac{\partial B_{t}}{\partial z_{t}}=2\Phi_{1}(z_{t})-1$ and $\frac{\partial z_{t}}{\partial\sigma_{t}^{2}}=-\frac{z_{t}}{2\sigma_{t}^{2}}$ .

S3.2.3 Censored logarithmic score (CLS)

For some threshold value $y_{q}$ , denote the upper tail support of predictive distribution as $A=\{y_{t}:y_{t}>y_{q}\}$ . The gradient for the upper tail CLS can be written as

			$\displaystyle\nabla_{\theta}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)$
		$\displaystyle=$	$\displaystyle\left(\nabla_{{\theta_{1}}}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right),\nabla_{{\theta_{2}}}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right),\nabla_{{\theta_{3}}}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right),\nabla_{\theta_{4}}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)\right)^{\prime},$

with

	$\displaystyle\nabla_{{\theta_{1}}}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)$	$\displaystyle=\nabla_{{\theta_{1}}}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)I(y_{t}\in A)+\frac{\phi_{1}(\frac{y_{q}-{\theta_{1}}}{\sigma_{t}})}{P_{{\theta}}^{(t-1)}(y_{q})}\left(-\frac{y_{q}-{\theta_{1}}}{2\sigma_{t}^{3}}\frac{\partial\sigma_{t}^{2}}{\partial{\theta_{1}}}-\frac{1}{\sigma_{t}}\right)I(y_{t}\in A^{c})$
	$\displaystyle\nabla_{{\theta_{2}}}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)$	$\displaystyle=\nabla_{{\theta_{2}}}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)I(y_{t}\in A)+\frac{\phi_{1}(\frac{y_{q}-\theta_{1}}{\sigma_{t}})}{P_{{\theta}}^{(t-1)}(y_{q})}\left(-\frac{y_{q}-{\theta_{1}}}{2\sigma_{t}^{3}}\frac{\partial\sigma_{t}^{2}}{\partial\theta_{2}^{r}}\right){\theta_{2}^{r}}I(y_{t}\in A^{c})$
	$\displaystyle\nabla_{{\theta_{3}}}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)$	$\displaystyle=\nabla_{{\theta_{3}}}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)I(y_{t}\in A)+\frac{\phi_{1}(\frac{y_{q}-\theta_{1}}{\sigma_{t}})}{P_{{\theta}}^{(t-1)}(y_{q})}\left(-\frac{y_{q}-{\theta_{1}}}{2\sigma_{t}^{3}}\frac{\partial\sigma_{t}^{2}}{\partial\theta_{3}^{r}}\right)\phi_{1}({\theta_{3}})I(y_{t}\in A^{c})$
	$\displaystyle\nabla_{{\theta_{4}}}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)$	$\displaystyle=\nabla_{{\theta_{4}}}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)I(y_{t}\in A)+\frac{\phi_{1}(\frac{y_{q}-\theta_{1}}{\sigma_{t}})}{P_{{\theta}}^{(t-1)}(y_{q})}\left(-\frac{y_{q}-{\theta_{1}}}{2\sigma_{t}^{3}}\frac{\partial\sigma_{t}^{2}}{\partial\theta_{4}^{r}}\right)\phi_{1}({\theta_{4}})I(y_{t}\in A^{c})$

To obtain the gradient expressions for the lower tail CLS simply replace the term $\frac{\phi_{1}(\frac{y_{q}-{\theta_{1}}}{\sigma_{t}})}{P_{{\theta}}^{(t-1)}(y_{q})}$ by the term $-\frac{\phi_{1}(\frac{y_{q}-{\theta_{1}}}{\sigma_{t}})}{1-P_{{\theta}}^{(t-1)}(y_{q})}$ , and redefine the set $A$ .

S3.2.4 Interval score (IS)

The gradient of the IS can be written as

\nabla_{\theta}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\left(\nabla_{{\theta_{1}}}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right),\nabla_{{\theta_{2}}}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right),\nabla_{{\theta_{3}}}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right),\nabla_{{\theta_{4}}}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)\right).

For $\theta_{i}\in\{{\theta_{1}},{\theta_{2}},{\theta_{3}},\theta_{4}\}$ , the elements of the gradient can be written as

\nabla_{\theta_{i}}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=-\left[1-\frac{2}{1-q}I(y_{t}>u_{t})\right]\frac{\partial u_{t}}{\partial\theta_{i}}-\left[\frac{2}{1-q}I(y_{t}<l_{t})-1\right]\frac{\partial l_{t}}{\partial\theta_{i}},

where the derivative $\frac{\partial u_{t}}{\partial{\theta_{i}}}$ can be evaluated as $\frac{\partial u_{t}}{\partial\theta_{i}}=-\frac{\partial u_{t}}{\partial\alpha_{u,t}}\frac{\partial\alpha_{u,t}}{\partial\theta_{i}}$ with $\alpha_{u,t}=P_{{\theta}}^{(t-1)}(u_{t})$ . The first term can be computed as $\frac{\partial u_{t}}{\partial\alpha_{u,t}}=\frac{1}{p\left(u_{t}|y_{1:t-1},{\theta}\right)}$ . The second term is parameter specific, and can be computed as

	$\displaystyle\text{(i) }\ \ \frac{\partial\alpha_{u,t}}{\partial{\theta_{1}}}$	$\displaystyle=\phi_{1}\left(\frac{u_{t}-{\theta_{1}}}{\sigma_{t}}\right)\left[\left(-\frac{u_{t}-\theta_{1}}{2\sigma_{t}^{3}}\right)\frac{\partial\sigma_{t}^{2}}{\partial{\theta_{1}}}-\frac{1}{\sigma_{t}}\right];\hskip 14.22636pt\text{(ii) }\ \ \frac{\partial\alpha_{u,t}}{\partial{\theta_{2}}}=\phi_{1}\left(\frac{u_{t}-{\theta_{1}}}{\sigma_{t}}\right)\left(-\frac{u_{t}-{\theta_{1}}}{2\sigma_{t}^{3}}\right)\frac{\partial\sigma_{t}^{2}}{\partial{\theta_{2}^{r}}}{\theta_{2}^{r}};$
	$\displaystyle\text{(iii) }\ \ \frac{\partial\alpha_{u,t}}{\partial{\theta_{3}}}$	$\displaystyle=\phi_{1}\left(\frac{u_{t}-{\theta_{1}}}{\sigma_{t}}\right)\left(-\frac{u_{t}-{\theta_{1}}}{2\sigma_{t}^{3}}\right)\frac{\partial\sigma_{t}^{2}}{\partial\theta_{3}^{r}}\phi_{1}({\theta_{3}});\hskip 17.07182pt\text{(iv) }\ \ \frac{\partial\alpha_{u,t}}{\partial{\theta_{4}}}=\phi_{1}\left(\frac{u_{t}-{\theta_{1}}}{\sigma_{t}}\right)\left(-\frac{u_{t}-{\theta_{1}}}{2\sigma_{t}^{3}}\right)\frac{\partial\sigma_{t}^{2}}{\partial{\theta_{4}^{r}}}\phi_{1}({\theta_{4}}).$

The derivative $\frac{\partial l_{t}}{\partial{\theta_{i}}}$ is evaluated in the same fashion as described for $\frac{\partial u_{t}}{\partial{\theta_{i}}}$ .

S4 Autoregressive Mixture Predictive Class (Section 5.1)

S4.1 Priors

We employ the following priors for each of the parameters of the model:

\beta_{k,0}\sim N(0,10000^{2}),\hskip 14.22636pt\beta_{k,1}\sim U(-1,1),\hskip 14.22636ptv_{k}\sim\text{Beta}(1,2),\hskip 14.22636pt\left(\sigma_{k}^{2}\right)^{-1}\sim\mathcal{G}(1,1)\text{,\ \ \ and }\hskip 14.22636pt\mu\sim N(0,100^{2}),

where the parameter $v_{k}$ is given by the stick breaking decomposition of the mixture weights, which sets $\tau_{k}=v_{k}\prod_{j=1}^{k-1}(1-v_{j})$ , for $k=1,\dots,K$ and where $v_{K}=1$ . For the implementation of variational inference, all the parameters are transformed into the real line as follows:

	$\displaystyle\text{(i) }\beta_{k,1}\text{ is transformed to }\eta_{k}=\Phi_{1}^{-1}\left(\frac{1}{2}\left(\beta_{k,1}+1\right)\right);\hskip 28.45274pt\text{(ii) }v_{k}\text{ is transformed to }\psi_{k}=\Phi_{1}^{-1}\left(v_{k}\right);$
	$\displaystyle\text{(iii) }\sigma_{k}\text{ is transformed to }\kappa_{k}=\log\sigma_{k}.$

After applying these transformations, we have that the prior densities are:

	$\displaystyle\text{(i) }p(\beta_{k,0})=\phi_{1}\left(\beta_{k,1};0,10000^{2}\right);\hskip 28.45274pt\text{(ii) }p(\eta_{k})=\phi_{1}\left(\eta_{k};0,1\right);$
	$\displaystyle\text{(iii) }p(\psi_{k})\propto\left[1-\Phi_{1}\left(\psi_{k}\right)\right]\phi_{1}\left(\psi_{k}\right);\hskip 18.49411pt\text{(iv) }p(\kappa_{k})\propto 2\exp\left(-2\kappa_{k}\right)\exp\left[-\exp\left(-2\kappa_{k}\right)\right];$
	$\displaystyle\text{(v) }p(\mu)=\phi_{1}\left(\mu;0,100^{2}\right).$

Computation of $\nabla_{\theta}\log p\left({\theta}\right)$
Denoting as ${\beta}_{0}=\left(\beta_{1,0},\dots,\beta_{K,0}\right)^{\prime}$ , ${\eta}=\left(\eta_{1},\dots,\eta_{K}\right)^{\prime}$ , ${\psi}=\left(\psi_{1},\dots,\psi_{K}\right)^{\prime}$ and ${\kappa}=\left(\kappa_{1},\dots,\kappa_{K}\right)^{\prime}$ , the gradient of the prior density is

\nabla_{\theta}\log p\left({\theta}\right)=\left[\frac{\partial\log p\left({\theta}\right)}{\partial{\beta}_{0}},\frac{\partial\log p\left({\theta}\right)}{\partial{\eta}},\frac{\partial\log p\left({\theta}\right)}{\partial{\psi}},\frac{\partial\log p\left({\theta}\right)}{\partial{\kappa}},\frac{\partial\log p\left({\theta}\right)}{\mu}\right]^{\prime},

with

\text{(i) }\frac{\partial\log p\left({\theta}\right)}{\partial{\beta}_{0}}=-(10000)^{-2}{\beta}_{0}^{{}^{\prime}};\ \ \ \ \ \text{(ii) }\frac{\partial\log p\left({\theta}\right)}{\partial{\eta}}=-{\eta}^{{}^{\prime}};

\text{(iii) }\frac{\partial\log p\left({\theta}\right)}{\partial{\psi}}=-(1-{v}^{\prime})^{-1}\frac{\partial{v}}{\partial\psi}-{\psi}^{{}^{\prime}};\ \ \ \ \ \text{(iv) }\frac{\partial\log p\left({\theta}\right)}{\partial{\kappa}}=2\tilde{{\kappa}}^{{}^{\prime}}-2;

\text{(v) }\frac{\partial\log p\left({\theta}\right)}{\partial{\mu}}=-(100)^{-2}{\mu}.

and where $(1-{v}^{\prime})^{-1}=\left([1-v_{1}]^{-1},\dots,[1-v_{K-1}]^{-1}\right)$ and $\tilde{{\kappa}}=(\exp[-2\kappa_{1}],\dots,\exp[-2\kappa_{K}])^{\prime}$ .

S4.2 Derivation of $\nabla_{\theta}S_{n}\left({\theta}\right)$ for all scoring rules

The focused Bayesian update uses the term $S_{n}\left({\theta}\right)=\sum_{t=1}^{n}s\left(P_{{\theta}}^{(t-1)},y_{t}\right)$ , thus variational inference requires evaluation of

\nabla_{\theta}S_{n}\left({\theta}\right)=\sum_{t=1}^{n}\nabla_{\theta}s\left(P_{{\theta}}^{(t-1)},y_{t}\right).

For the mixture example we consider three alternative choices for $s\left(P_{{\theta}}^{(t-1)},y_{t}\right)$ , namely, the IS, the CLS and LS. Here, we derive an expression for $\nabla_{\theta}s\left(P_{{\theta}}^{(t-1)},y_{t}\right)$ for each of these scores. As will be shown later, the gradient $\nabla_{\theta}S\left(P_{{\theta}}^{(t-1)},y_{t}\right)$ for all the scores can be expressed solely in terms of $\nabla_{\theta}p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right)$ and $\nabla_{\theta}P_{{\theta}}^{(t-1)}$ , thus we focus on the derivation of these two expressions. Below, we denote $\epsilon_{t}=y_{t}-\mu$ and use the scalar $r$ to denote the number of elements in $\theta$ . For ease of notation we rewrite

p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right)=\frac{c_{2,t}}{c_{1,t}}\text{\ \ \ \ \ and \ \ \ \ \ }P_{{\theta}}^{(t-1)}=\frac{c_{3,t}}{c_{1,t}},

where $c_{1,t,k}=\frac{\tau_{k}}{s_{k}}\phi_{1}\left(\frac{\epsilon_{t-1}-\mu_{k}}{s_{k}}\right)$ , $c_{2,t,k}=\frac{1}{\sigma_{k}}\phi_{1}\left(\frac{\epsilon_{t}-\beta_{k,0}-\beta_{k,1}\epsilon_{t-1}}{\sigma_{k}}\right)$ , $c_{3,t,k}=\Phi_{1}\left(\frac{\epsilon_{t}-\beta_{k,0}-\beta_{k,1}\epsilon_{t-1}}{\sigma_{k}}\right)$ , $c_{1,t}=\sum_{k=1}^{K}c_{1,t,k}$ , $c_{2,t}=\sum_{k=1}^{K}c_{1,t,k}c_{2,t,k}$ and $c_{3,t}=\sum_{k=1}^{K}c_{1,t,k}c_{3,t,k}$ . The elements of the gradients $\nabla_{\theta}p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right)=\left[\nabla_{\theta_{1}}p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right),\dots,\nabla_{\theta_{r}}p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right)\right]^{{}^{\prime}}$ and $\nabla_{\theta}P_{{\theta}}^{(t-1)}=\left[\nabla_{\theta_{1}}P_{{\theta}}^{(t-1)},\dots,\nabla_{\theta_{r}}P_{{\theta}}^{(t-1)}\right]^{{}^{\prime}}$ , can then be computed as

\nabla_{\theta_{i}}p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right)=\sum_{k=1}^{K}\left[\tau_{k,t}\frac{\partial c_{2,t,k}}{\partial\theta_{i}}+\frac{c_{2,t,k}-p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right)}{c_{1,t}}\frac{\partial c_{t,k,1}}{\partial\theta_{i}}\right]

(S4.3)

and

\nabla_{\theta_{i}}P_{{\theta}}^{(t-1)}=\sum_{k=1}^{K}\left[\tau_{k,t}\frac{\partial c_{3,t,k}}{\partial\theta_{i}}+\frac{c_{3,t,k}-P_{{\theta}}^{(t-1)}}{c_{1,t}}\frac{\partial c_{t,k,1}}{\partial\theta_{i}}\right].

(S4.4)

Table S1 provides the expressions $\frac{\partial c_{1,t,k}}{\partial\theta_{i}}$ , $\frac{\partial c_{2,t,k}}{\partial\theta_{i}}$ and $\frac{\partial c_{3,t,k}}{\partial\theta_{i}}$ for $\theta_{i}\in\{{\beta}_{0},{\eta},{\kappa},\mu\}$ . For $\theta_{i}\in{\psi}$ , the gradients can be evaluated as

\nabla_{\psi}p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right)=\left[\nabla_{\tau}p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right)^{\prime}\frac{\partial{\tau}}{\partial v}\frac{\partial{v}}{\partial\psi}\right]^{\prime},

and

\nabla_{\psi}P_{{\theta}}^{(t-1)}=\left[{\nabla_{\tau}P_{{\theta}}^{(t-1)}}^{\prime}\frac{\partial{\tau}}{\partial v}\frac{\partial{v}}{\partial\psi}\right]^{\prime},

where $\frac{\partial{v}}{\partial\psi}$ is a diagonal matrix with entries $\frac{\partial v_{k}}{\partial\psi_{k}}=\phi_{1}\left(\psi_{k}\right)$ , and $\frac{\partial{\tau}}{\partial v}$ is a lower triangular matrix with diagonal elements $\frac{\partial\tau_{k}}{\partial v_{k}}=\prod_{j=1}^{k-1}(1-v_{j})$ and off-diagonal elements $\frac{\partial\tau_{k}}{\partial v_{s}}=-\frac{\tau_{k}}{(1-v_{s})}$ , for $s<k$ . The expressions needed to evaluate $\nabla_{\tau}p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right)=\left[\nabla_{\tau_{1}}p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right),\dots,\nabla_{\tau_{K}}p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right)\right]^{\prime}$ and $\nabla_{\tau}P_{{\theta}}^{(t-1)}=\left[\nabla_{\tau_{1}}P_{{\theta}}^{(t-1)},\dots,\nabla_{\tau_{K}}P_{{\theta}}^{(t-1)}\right]^{\prime}$ are also provided in Table S1.

With expressions (S4.3) and (S4.4), we can now derive expression for $\nabla_{\theta}s\left(P_{{\theta}}^{(t-1)},y_{t}\right)$ for the three scores considered.

S4.2.1 Logarithmic score (LS)

Denote the gradient of the LS in (9) as

\nabla_{\theta}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\left(\nabla_{\theta_{1}}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right),\dots\right.\left.,\nabla_{\theta_{r}}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)\right)^{\prime}.

The element $\nabla_{\theta_{i}}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)$ of this gradient can be evaluated as

\nabla_{\theta_{i}}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\frac{1}{p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right)}\nabla_{\theta_{i}}p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right)

where the derivative $\nabla_{\theta_{i}}p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right)$ can be computed using (S4.3).

S4.2.2 Censored logarithmic score (CLS)

For some threshold value $y_{q}$ , denote the upper tail support of predictive distribution as $A=\{y_{t}:y_{t}>y_{q}\}$ . The gradient for the upper tail CLS in (11) can be written as

\nabla_{\theta}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\left(\nabla_{\theta_{1}}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right),\dots,\nabla_{\theta_{r}}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)\right)^{\prime}.

We can compute the element $\nabla_{\theta_{i}}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)$ of the upper CLS as:

\nabla_{\theta_{i}}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\nabla_{\theta_{i}}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)I(y_{t}\in A)+\frac{1}{P_{{\theta}}^{(t-1)}(y_{q})}\nabla_{\theta_{i}}P_{{\theta}}^{(t-1)}(y_{q})I(y_{t}\in A^{c}).

The derivatives $\nabla_{\theta_{i}}P_{{\theta}}^{(t-1)}(y_{q})$ can be computed using (S4.4). For the lower tail CLS, where $A=\{y_{t}:y_{t}<y_{q}\}$ we have that

\nabla_{\theta_{i}}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\nabla_{\theta_{i}}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)I(y_{t}\in A)-\frac{1}{1-P_{{\theta}}^{(t-1)}(y_{q})}\nabla_{\theta_{i}}P_{{\theta}}^{(t-1)}(y_{q})I(y_{t}\in A^{c}).

S4.2.3 Interval score (IS)

The gradient of the IS in (12) can be written as

\nabla_{\theta}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\left(\nabla_{\theta_{1}}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)\right.\left.,\dots,\nabla_{\theta_{r}}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)\right)^{\prime},

with elements $\nabla_{\theta_{i}}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)$ defined as:

\nabla_{\theta_{i}}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=-\left[1-\frac{2}{1-q}I(y_{t}>u_{t})\right]\frac{\partial u_{t}}{\partial\theta_{i}}-\left[\frac{2}{1-q}I(y_{t}<l_{t})-1\right]\frac{\partial l_{t}}{\partial\theta_{i}}

As such, computation of $\nabla_{\theta_{i}}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)$ entails evaluation of $\frac{\partial u_{t}}{\partial{\theta_{i}}}$ and $\frac{\partial l_{t}}{\partial{\theta_{i}}}$ . The derivative $\frac{\partial u_{t}}{\partial{\theta_{i}}}$ can be evaluated by first noting that $\alpha_{u,t}=P_{{\theta}}^{(t-1)}(u_{t})$ . Then, using the triple product rule we know that

\frac{\partial u_{t}}{\partial\theta_{i}}=-\frac{\partial u_{t}}{\partial\alpha_{u,t}}\frac{\partial\alpha_{u,t}}{\partial\theta_{i}},

where the first term can be computed as $\frac{\partial u_{t}}{\partial\alpha_{u,t}}=\frac{1}{p\left(u_{t}|\mathcal{F}_{t-1},{\theta}\right)}$ . The second term $\frac{\partial\alpha_{u,t}}{\partial\theta_{i}}=\nabla_{\theta_{i}}P_{{\theta}}^{(t-1)}(u_{t})$ is evaluated using (S4.4). The derivative $\frac{\partial l_{t}}{\partial{\theta_{i}}}$ is evaluated in the same fashion as described for $\frac{\partial u_{t}}{\partial{\theta_{i}}}$ .

Expressions for $\frac{\partial}{\partial\theta_{i}}c_{1,t,k}$

$\frac{\partial c_{1,t,k}}{\partial\tau_{k}}=\frac{1}{s_{k}}\phi_{1}\left(\frac{\epsilon_{t-1}-\mu_{k}}{s_{k}}\right)$
$\frac{\partial c_{1,t,k}}{\partial\beta_{0,k}}=-\frac{\tau_{k}}{s_{k}}\phi_{1}^{\prime}\left(\frac{\epsilon_{t-1}-\mu_{k}}{s_{k}}\right)\frac{1}{(1-\beta_{k,1})s_{k}}$
$\frac{\partial c_{1,t,k}}{\partial\eta_{k}}=\left[-\frac{\tau_{k}}{s_{k}}\phi_{1}^{\prime}\left(\frac{\epsilon_{t-1}-\mu_{k}}{s_{k}}\right)\left(\frac{1}{s_{k}}\frac{\partial\mu_{k}}{\partial\beta_{1,k}}+\frac{1}{s_{k}^{2}}(\epsilon_{t-1}-\mu_{k})\frac{\partial s_{k}}{\partial\beta_{1,k}}\right)-\phi_{1}\left(\frac{\epsilon_{t-1}-\mu_{k}}{s_{k}}\right)\frac{\tau_{k}}{s_{k}^{2}}\frac{\partial s_{k}}{\partial\beta_{1,k}}\right]\frac{\partial\beta_{1,k}}{\partial\eta_{k}}$
$\frac{\partial c_{1,t,k}}{\partial\kappa_{k}}=\left[\frac{\tau_{k}}{s_{k}}\phi_{1}^{\prime}\left(\frac{\epsilon_{t-1}-\mu_{k}}{s_{k}}\right)\left(\frac{1}{s_{k}}\frac{\partial\mu_{k}}{\partial\sigma_{k}}+\frac{1}{s_{k}^{2}}(\epsilon_{t-1}-\mu_{k})\frac{\partial s_{k}}{\partial\sigma_{k}}\right)-\phi_{1}\left(\frac{\epsilon_{t-1}-\mu_{k}}{s_{k}}\right)\frac{\tau_{k}}{s_{k}^{2}}\frac{\partial s_{k}}{\partial\sigma_{k}}\right]\frac{\partial\sigma_{k}}{\partial\kappa_{k}}$
$\frac{\partial c_{1,t,k}}{\partial\mu}=-\frac{\tau_{k}}{s_{k}}\phi_{1}^{\prime}\left(\frac{\epsilon_{t-1}-\mu_{k}}{s_{k}}\right)\frac{1}{s_{k}}$
Expressions for $\frac{\partial}{\partial\theta_{i}}c_{2,t,k}$		Expressions for $\frac{\partial}{\partial\theta_{i}}c_{3,t,k}$

$\frac{\partial c_{2,t,k}}{\partial\tau_{k}}=0$		$\frac{\partial c_{3,t,k}}{\partial\tau_{k}}=0$
$\frac{\partial c_{2,t,k}}{\partial\beta_{0,k}}=-\frac{1}{\sigma_{k}^{2}}\phi_{1}^{\prime}\left(\frac{\epsilon_{t}-\beta_{0,k}-\beta_{1,k}\epsilon_{t-1}}{\sigma_{k}}\right)$		$\frac{\partial c_{3,t,k}}{\partial\beta_{0,k}}=-\frac{1}{\sigma_{k}}\phi_{1}\left(\frac{\epsilon_{t}-\beta_{0,k}-\beta_{1,k}\epsilon_{t-1}}{\sigma_{k}}\right)$
$\frac{\partial c_{2,t,k}}{\partial\eta_{k}}=-\frac{\epsilon_{t-1}}{\sigma_{k}^{2}}\phi_{1}^{\prime}\left(\frac{\epsilon_{t}-\beta_{0,k}-\beta_{1,k}\epsilon_{t-1}}{\sigma_{k}}\right)\frac{\partial\beta_{1,k}}{\partial\eta_{k}}$		$\frac{\partial c_{3,t,k}}{\partial\eta_{k}}=-\frac{\epsilon_{t-1}}{\sigma_{k}}\phi_{1}\left(\frac{\epsilon_{t}-\beta_{0,k}-\beta_{1,k}\epsilon_{t-1}}{\sigma_{k}}\right)\frac{\partial\beta_{1,k}}{\partial\eta_{k}}$
$\frac{\partial c_{2,t,k}}{\partial\kappa_{k}}=-\frac{\epsilon_{t}-\beta_{0,k}-\beta_{1,k}\epsilon_{t-1}}{\sigma_{k}^{3}}\phi_{1}^{\prime}\left(\frac{\epsilon_{t}-\beta_{0,k}-\beta_{1,k}\epsilon_{t-1}}{\sigma_{k}}\right)\frac{\partial\sigma_{k}}{\partial\kappa_{k}}-$		$\frac{\partial c_{3,t,k}}{\partial\kappa_{k}}=-\frac{\epsilon_{t}-\beta_{0,k}-\beta_{1,k}\epsilon_{t-1}}{\sigma_{k}^{2}}\phi_{1}\left(\frac{\epsilon_{t}-\beta_{0,k}-\beta_{1,k}\epsilon_{t-1}}{\sigma_{k}}\right)\frac{\partial\sigma_{k}}{\partial\kappa_{k}}$
$\frac{1}{\sigma_{k}^{2}}\phi_{1}\left(\frac{\epsilon_{t}-\beta_{0,k}-\beta_{1,k}\epsilon_{t-1}}{\sigma_{k}}\right)\frac{\partial\sigma_{k}}{\partial\kappa_{k}}$
$\frac{\partial c_{2,t,k}}{\partial\mu}=\frac{-1+\beta_{1,k}}{\sigma_{k}^{2}}\phi_{1}^{\prime}\left(\frac{\epsilon_{t}-\beta_{0,k}-\beta_{1,k}\epsilon_{t-1}}{\sigma_{k}}\right)$		$\frac{\partial c_{3,t,k}}{\partial\mu}=\frac{-1+\beta_{1,k}}{\sigma_{k}}\phi_{1}\left(\frac{\epsilon_{t}-\beta_{0,k}-\beta_{1,k}\epsilon_{t-1}}{\sigma_{k}}\right)$

Table S1: Derivatives required to implement reparameterization trick for the mixture model. Other required expressions also include

\frac{\partial\mu_{k}}{\partial\beta_{1,k}}=\frac{\beta_{0,k}}{(1-\beta_{1,k})^{2}}

\frac{\partial s_{k}}{\partial\beta_{1,k}}=\frac{\beta_{1,k}\sigma_{k}}{(1-\beta_{1,k}^{2})^{3/2}}

\frac{\partial s_{k}}{\partial\sigma_{k}}=\frac{1}{(1-\beta_{1,k}^{2})^{1/2}}

\frac{\partial\sigma_{k}}{\partial\kappa_{k}}=\exp\left(\kappa_{k}\right)

\frac{\partial\beta_{1,k}}{\partial\eta_{k}}=2\phi_{1}(\eta_{k})

. The cross-component derivatives

\frac{\partial c_{i,t,k}}{\partial\tau_{j}}

\frac{\partial c_{i,t,k}}{\partial\beta_{0,j}}

\frac{\partial c_{i,t,k}}{\partial\eta_{j}}

\frac{\partial c_{i,t,k}}{\partial\kappa_{j}}

are all zero for

j\neq k

. Finally, in this table we denote

\phi_{1}^{\prime}\left(x\right)=\frac{\partial\phi_{1}\left(x\right)}{\partial x}

S5 Bayesian Neural Network Predictive Class (Section 5.2)

S5.1 Priors

The priors of the model parameters are set as $\omega_{k}\sim N(0,\text{Inf})$ and $p(\sigma_{y}^{2})\propto\frac{1}{\sigma_{y}^{2}}$ . The standard deviation parameter $\sigma_{y}$ is transformed to the real line as $c=\log\left(\sigma_{y}\right)$ , so that the parameter vector is ${\theta}=\left({\omega}^{\prime},c\right)^{{}^{\prime}}$ . Then, the prior density can be written as $p({\theta})=p(c)\prod_{k=1}^{d}p(\omega_{k})$ , where $p(\omega_{k})\propto 1$ and $p(c)\propto 1$ . From this prior density we have that $\nabla_{\theta}\log p\left({\theta}\right)={0}$ .

S5.2 Derivation of $\nabla_{\theta}S_{n}\left({\theta}\right)$ for all scoring rules

As in the previous appendix we can show that $\nabla_{\theta}S_{n}\left({\theta}\right)=\sum_{t=1}^{n}\nabla_{\theta}s\left(P_{{\theta}}^{(t-1)},y_{t}\right).$ Thus, we must find an expression for $\nabla_{\theta}S\left(P_{{\theta}}^{(t-1)},y_{t}\right)$ for each of the scores.

S5.2.1 Logarithmic score (LS)

The gradient for the LS can be written as

\nabla_{\theta}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\left(\nabla_{\omega}^{\text{LS}}s\left(P_{{\theta}}^{(t-1)},y_{t}\right)^{\prime},\nabla_{c}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)\right)^{\prime},

with

\nabla_{\omega}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\frac{1}{\sigma_{y}^{2}}\left[y_{t}-g\left({z}_{t};{\omega}\right)\right]\nabla_{\omega}g\left({z}_{t};{\omega}\right)\text{ and }\nabla_{c}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=-1+\frac{1}{\sigma_{y}^{2}}\left[y_{t}-g\left({z}_{t};{\omega}\right)\right]^{2}.

The term $\nabla_{\omega}g\left({z}_{t};{\omega}\right)$ can be evaluated analytically through back propagation.

S5.2.2 Censored logarithmic score (CLS)

For the upper tail CLS the gradient can be written as

\nabla_{\theta}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\nabla_{\theta}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)I(y_{t}\in A)+\frac{1}{P_{{\theta}}^{(t-1)}(y_{q})}\nabla_{\theta}P_{{\theta}}^{(t-1)}(y_{q})I(y_{t}\in A^{c}),

where $\nabla_{\theta}P_{{\theta}}^{(t-1)}(y_{q})=\left(\nabla_{\omega}P_{{\theta}}^{(t-1)}(y_{q})^{{}^{\prime}},\nabla_{c}P_{{\theta}}^{(t-1)}(y_{q})\right)^{{}^{\prime}}$ , $\nabla_{\omega}P_{{\theta}}^{(t-1)}(y_{q})=-\phi_{1}\left(y_{q};g\left({z}_{t};{\omega}\right),\sigma_{y}^{2}\right)\nabla_{\omega}g\left({z}_{t};{\omega}\right)$ and $\nabla_{c}P_{{\theta}}^{(t-1)}(y_{q})=-\phi_{1}\left(y_{q};g\left({z}_{t};{\omega}\right),\sigma_{y}^{2}\right)(y_{q}-g\left({z}_{t};{\omega}\right))$ . The gradient for the lower tail CLS is

\nabla_{\theta}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\nabla_{\theta}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)I(y_{t}\in A)-\frac{1}{1-P_{{\theta}}^{(t-1)}(y_{q})}\nabla_{\theta}P_{{\theta}}^{(t-1)}(y_{q})I(y_{t}\in A^{c}).

S5.2.3 Interval score (IS)

The gradient of the IS can be written as

\nabla_{\theta}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\left(\nabla_{\omega}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)\right.\left.,\nabla_{c}s_{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)\right)^{\prime},

with elements $\nabla_{\omega}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)$ and $\nabla_{c}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)$ defined as:

	$\displaystyle\nabla_{\omega}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)$	$\displaystyle=-\left[1-\frac{2}{1-q}I(y_{t}>u_{t})\right]\frac{\partial u_{t}}{\partial\omega}^{\prime}-\left[\frac{2}{1-q}I(y_{t}<l_{t})-1\right]\frac{\partial l_{t}}{\partial\omega}^{\prime}$
	$\displaystyle\nabla_{c}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)$	$\displaystyle=-\left[1-\frac{2}{1-q}I(y_{t}>u_{t})\right]\frac{\partial u_{t}}{\partial c}-\left[\frac{2}{1-q}I(y_{t}<l_{t})-1\right]\frac{\partial l_{t}}{\partial c}$

The derivative $\frac{\partial u_{t}}{\partial{\omega}}$ can be evaluated by first noting that $\alpha_{u,t}=P_{{\theta}}^{(t-1)}(u_{t})$ . Then, using the triple product rule we know that

\frac{\partial u_{t}}{\partial\omega}=-\frac{\partial u_{t}}{\partial\alpha_{u,t}}\frac{\partial\alpha_{u,t}}{\partial\omega},

where the first term can be computed as $\frac{\partial u_{t}}{\partial\alpha_{u,t}}=\frac{1}{p\left(u_{t}|\mathcal{F}_{t-1},{\theta}\right)}$ . The second term $\frac{\partial\alpha_{u,t}}{\partial\omega}=\nabla_{\omega}P_{{\theta}}^{(t-1)}(u_{t})$ is evaluated as in the subsection above. The derivative $\frac{\partial l_{t}}{\partial{\omega}}$ is evaluated in the same fashion as described for $\frac{\partial u_{t}}{\partial{\omega}}$ . The corresponding derivatives for parameter $c$ can also be computed using similar steps.

Loss-Based Variational Bayes Prediction

Abstract

1 Introduction

2 Setup and Loss-Based Bayesian Prediction

2.1 Preliminaries and Notation

2.2 Bayesian Updating Based on Scoring Rules

3 Gibbs Variational Prediction

3.1 Overview

Definition 1

Remark 1

3.2 Accuracy of the GVP

Theorem 1

Remark 2

Theorem 2

4 A Toy Model of Financial Returns

4.1 Estimation of the Gibbs Predictives

4.2 Results

5 Complex Time Series Examples

5.1 Example 1: Autoregressive Mixture Predictive Model

5.2 Example 2: Bayesian Neural Network Predictive Model

5.3 Simulation Results

5.3.1 Example 1

5.3.2 Example 2

6 Empirical Application: M4 Competition

7 Discussion

A Theoretical Results

A.1 Assumptions and Discussion

Assumption 1

Remark 3

Assumption 2

Assumption 3

Remark 4

Lemma 1

A.2 Preliminary Result: Posterior Concentration

Lemma 2

Remark 5

Lemma 3

Remark 6

Remark 7

Remark 8

A.3 Proofs of Lemmas

A.4 Proofs of Theorems in Section 3.2

A.5 Additional Lemmas

Lemma 4

Lemma 5

References

S1 Details for the Implementation of all Variational Approximations

S1.1 General Notational Matters

S2 Stochastic Gradient Ascent

S3 The GARCH Predictive Class (Section 4)

S3.1 Priors

S3.2 Derivation of ∇θSn​(θ)\nabla_{\theta}S_{n}\left({\theta}\right) for all scoring rules

S3.2.1 Logarithmic score (LS)

S3.2.2 Continuously ranked probability score (CRPS)

S3.2.3 Censored logarithmic score (CLS)

S3.2.4 Interval score (IS)

S4 Autoregressive Mixture Predictive Class (Section 5.1)

S4.1 Priors

S4.2 Derivation of ∇θSn​(θ)\nabla_{\theta}S_{n}\left({\theta}\right) for all scoring rules

S4.2.1 Logarithmic score (LS)

S4.2.2 Censored logarithmic score (CLS)

S4.2.3 Interval score (IS)

S5 Bayesian Neural Network Predictive Class (Section 5.2)

S5.1 Priors

S5.2 Derivation of ∇θSn​(θ)\nabla_{\theta}S_{n}\left({\theta}\right) for all scoring rules

S5.2.1 Logarithmic score (LS)

S5.2.2 Censored logarithmic score (CLS)

S5.2.3 Interval score (IS)

S3.2 Derivation of $\nabla_{\theta}S_{n}\left({\theta}\right)$ for all scoring rules

S4.2 Derivation of $\nabla_{\theta}S_{n}\left({\theta}\right)$ for all scoring rules

S5.2 Derivation of $\nabla_{\theta}S_{n}\left({\theta}\right)$ for all scoring rules