This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Loss-Based Variational Bayes Prediction

\nameDavid T. Frazier \email[email protected]
\addrDepartment of Econometrics and Business Statistics
Monash University
Melbourne, Australia, 3800 \AND\nameRubén Loaiza-Maya \email[email protected]
\addrDepartment of Econometrics and Business Statistics
Monash University
Melbourne, Australia, 3800 \AND\nameGael M. Martin \email[email protected]
\addrDepartment of Econometrics and Business Statistics
Monash University
Melbourne, Australia, 3800 \AND\nameBonsoo Koo \email[email protected]
\addrDepartment of Econometrics and Business Statistics
Monash University
Melbourne, Australia, 3800
Abstract

We propose a new approach to Bayesian prediction that caters for models with a large number of parameters and is robust to model misspecification. Given a class of high-dimensional (but parametric) predictive models, this new approach constructs a posterior predictive using a variational approximation to a generalized posterior that is directly focused on predictive accuracy. The theoretical behavior of the new prediction approach is analyzed and a form of optimality demonstrated. Applications to both simulated and empirical data using high-dimensional Bayesian neural network and autoregressive mixture models demonstrate that the approach provides more accurate results than various alternatives, including misspecified likelihood-based predictions.

Keywords: Loss-based Bayesian forecasting; variational inference; generalized (Gibbs) posteriors; proper scoring rules; Bayesian neural networks; M4 forecasting competition

1 Introduction

The conventional paradigm for Bayesian prediction is underpinned by the assumption that the true data generating process is either equivalent to the predictive model adopted, or spanned by a finite set of models over which we average. Of late however, recognition of the unrealistic nature of such an assumption, allied with an increased interest in driving prediction by problem-specific measures of accuracy, or loss, have led to alternative approaches. Whilst antecedents of these new principles are found in the ‘probably approximately correct’ (PAC)-Bayes approach to prediction in the machine learning literature (see Alquier, 2021, for a review), it is in the statistics and econometrics literature that this ‘loss-based prediction’ has come to more formal maturity, including in terms of its theoretical validation. This includes Bayesian work on weighted combinations of predictions, such as, for example, Billio et al. (2013), Casarin et al. (2015), Pettenuzzo and Ravazzolo (2016), Bassetti et al. (2018), Baştürk et al. (2019), McAlinn and West (2019) and McAlinn et al. (2020), where weights are updated via various predictive criteria, and the true model is not assumed to be one of the constituent models - i.e. an \mathcal{M}-open state of the world (Bernardo and Smith, 1994) is implicitly adopted. It also includes a contribution by Loaiza-Maya et al. (2020), in which both single models and predictive mixtures are used to generate accurate Bayesian predictions in the presence of model misspecification; with both theoretical and numerical results highlighting the ability of the approach to out-perform conventional likelihood-based Bayesian predictive methods.

However, as a general rule, the existing approaches discussed above do not scale well to complex models with high-dimensional parameter spaces. In contrast, the current paper contributes to the above literature by providing a new approach for producing accurate loss-based predictions in high-dimensional problems. We begin by defining a class of flexible predictive models, conditional on a set of unknown parameters, that are a plausible mechanism for generating probabilistic predictions. A prior distribution is placed over the parameters of this predictive class, and the prior then updated to a posterior via a criterion function that captures a user-specified measure of predictive accuracy. That is, the conventional, and potentially misspecified likelihood-based update is eschewed, in favour of a function that is tailored to the predictive problem at hand; the ultimate goal being to produce accurate predictions according to the measure that matters, without requiring knowledge of the true data generating mechanism.

In the spirit of the various generalized Bayesian inferential methods, in which likelihood functions are also replaced by alternative updating mechanisms (inter alia, Chernozhukov and Hong, 2003, Zhang, 2006a, Zhang, 2006b, Jiang and Tanner, 2008, Bissiri et al., 2016, Giummolè et al., 2017, Knoblauch et al., 2019, Miller and Dunson, 2019, Syring and Martin, 2019, and Pacchiardi and Dutta, 2021), we adopt a coherent update based on the exponential of a scaled sample loss; we refer to Miller (2021) for a thorough discussion of the large sample behavior of generalized Bayesian posteriors. As in Loaiza-Maya et al. (2020) the loss is, in turn, defined by a proper scoring rule (Gneiting et al., 2007; Gneiting and Raftery, 2007) that rewards a given form of predictive accuracy; for example, accurate prediction of extreme values. Given the high-dimensional nature of the resultant posterior, numerical treatment via ‘exact’ Markov Chain Monte Carlo (MCMC) is computationally challenging; hence we adopt an ‘approximate’ approach using variational principles. Since the posterior that results from an exponentiated loss was first denoted as a ‘Gibbs posterior’ by Zhang (2006a), we refer to the variational approximation of this posterior as the Gibbs variational posterior, and the predictive distribution that results from this posterior, via the standard Bayesian calculus, as the Gibbs variational predictive (hereafter, GVP). With a slight abuse of terminology, and when it is clear from the context, we also use the abbreviation GVP to reference the method of Gibbs variational prediction, or loss-based variational prediction per se.

In an artificially simple ‘toy’ example in which MCMC sampling of the exact Gibbs posterior is feasible, we illustrate that there are negligible differences between the out-of-sample results yielded by the GVP and those produced by the predictive based on MCMC sampling from the Gibbs posterior. We establish this result under both correct specification, in which the true data generating process matches the adopted predictive model, and under misspecification of the predictive model. The correspondence between the ‘approximate’ and ‘exact’ predictions in the correct specification case mimics that documented in Frazier et al. (2019), in which posterior approximations are produced by approximate Bayesian computation (ABC) and for the log score update (only). In the misspecified case, ‘strictly coherent’ predictions are produced (Martin et al., 2022), whereby a given GVP, constructed via the use of a particular scoring rule, is shown to perform best out-of-sample according to that same score when compared with a GVP constructed via some alternative scoring rule; with the numerical values of the average out-of-sample scores closely matching those produced by the exact Gibbs predictive. That is, building a generalized posterior via a given scoring rule yields superior predictive accuracy in that rule despite any inaccuracy in the measurement of posterior uncertainty that is induced by the variational approximation.

We then undertake more extensive Monte Carlo experiments to highlight the power of the approach in genuinely high-dimensional problems, with predictives based on: an autoregressive mixture model with 20 mixture components, and a neural network model used for illustration. An empirical analysis, in which GVP is used to produce accurate prediction intervals for the 4227 daily time series used in the M4 forecasting competition, illustrates the applicability of the method to reasonably large, and realistic data sets.

While the numerical toy example demonstrates that little predictive accuracy is lost when using the GVP, relative to the ‘exact’ predictive, we also rigorously compare the theoretical behavior of the GVP and the potentially infeasible exact predictive. Specifically, we demonstrate that in large samples the GVP delivers predictions that are just as accurate as those obtained from the exact Gibbs predictive when measured according to the score used to construct the Gibbs posterior. We do this by proving that the GVP ‘merges’ (in the sense of Blackwell and Dubins, 1962) with the exact Gibbs predictive. This merging result relies on novel posterior concentration results that extend existing concentration results for generalized posteriors (Alquier et al., 2016, Alquier and Ridgway, 2020, and Yang et al., 2020) to temporally dependent, potentially heavy-tailed data.

The remainder of the paper is structured as follows. In Section 2 the loss-based paradigm for Bayesian prediction is defined, with its links to related segments of the literature (as flagged briefly above) detailed. In Section 3 we detail how to construct the GVP, and provide theoretical verification of its accuracy. A numerical illustration of the ability of the variational method to yield essentially equivalent predictive accuracy to that produced via MCMC sampling is given in Section 4, using a low-dimensional example. The approach that we use in the implementation of GVP, including the choice of variational family, is briefly described, with further computational details of the stochastic gradient ascent (SGA) method used to perform the optimization are included in a supplementary appendix to the paper. We then proceed with the numerical illustration of the method in high-dimensional settings - using both artificially simulated and empirical data - in Sections 5 and 6 respectively; with the illustrations highlighting that, overall, the method ‘works’ and ‘works well’. We conclude in Section 7 with discussion of the implications of our findings, including for future research directions. Proof of all theoretical results are given in an appendix, and all computational details, and prior specifications are included in supplementary appendix to the paper.

2 Setup and Loss-Based Bayesian Prediction

2.1 Preliminaries and Notation

Consider a stochastic process {Yt:Ω𝒴,t}\{Y_{t}:\Omega\rightarrow\mathcal{Y},t\in\mathbb{N}\} defined on the complete probability space (Ω,,P0)(\Omega,\mathcal{F},P_{0}). Let t:=σ(Y1,,Yt)\mathcal{F}_{t}:=\sigma(Y_{1},\dots,Y_{t}) denote the natural sigma-field, and let P0P_{0} denote the infinite-dimensional distribution of the sequence Y1,Y2,Y_{1},Y_{2},\dots. Let y1:n=(y1,,yn)y_{1:n}=(y_{1},\dots,y_{n})^{\prime} denote a vector of realizations from the stochastic process.

Our goal is to use a particular collection of statistical models, adapted to n\mathcal{F}_{n}, that describe the behavior of the observed data, to construct accurate predictions for the random variable Yn+1Y_{n+1}. The parameters of the model are denoted by θn\theta_{n}, the parameter space by Θndn\Theta_{n}\subseteq\mathbb{R}^{d_{n}}, where the dimension dnd_{n} could grow as nn\rightarrow\infty and Θ1Θ2Θn\Theta_{1}\subseteq\Theta_{2}\subseteq...\subseteq\Theta_{n}. For the notational simplicity, we drop the dependence of θn\theta_{n} and Θn\Theta_{n} on nn in what follows. Let 𝒫(n)\mathcal{P}^{(n)} be a generic class of one-step-ahead predictive models for Yn+1Y_{n+1}, conditioned on the information n\mathcal{F}_{n} available at time nn, such that 𝒫(n):={Pθ(n),θΘ}\mathcal{P}^{(n)}:=\{P_{\theta}^{(n)},\theta\in\Theta\}.111The treatment of scalar YtY_{t} and one-step-ahead prediction is for the purpose of illustration only, and all the methodology that follows can easily be extended to multivariate YtY_{t} and multi-step-ahead prediction in the usual manner. When Pθ(n)()P_{\theta}^{(n)}(\cdot) admits a density with respect to the Lebesgue measure, we denote it by pθ(n)()pθ(|n)p_{\theta}^{(n)}(\cdot)\equiv p_{\theta}(\cdot|\mathcal{F}_{n}). The parameter θ{\theta} thus indexes values in the predictive class, with θ{\theta} taking values in the complete probability space (Θ,𝒯,Π)(\Theta,\mathcal{T},\Pi), and where Π\Pi measures our beliefs - either prior or posterior - about the unknown parameter θ{\theta}, and when they exist we denote the respective densities by π(θ)\pi(\theta) and π(θ|y1:n)\pi(\theta|y_{1:n}).

Denoting the likelihood function by pθ(y1:n)p_{\theta}(y_{1:n}), the conventional approach to Bayesian prediction updates prior beliefs about θ{\theta} via Bayes rule, to form the Bayesian posterior density,

π(θ|y1:n)=pθ(y1:n)π(θ)Θpθ(y1:n)π(θ)𝑑θ,\pi(\theta|y_{1:n})=\frac{p_{\theta}\left(y_{1:n}\right)\pi(\theta)}{\int_{\Theta}p_{\theta}\left(y_{1:n}\right)\pi(\theta)d\theta}, (1)

in which we follow convention and abuse notation by writing this quantity as a density even though, strictly speaking, the density may not exist. The one-step-ahead predictive distribution is then constructed as

PΠ(n):=ΘPθ(n)π(θ|y1:n)dθ.P_{\Pi}^{(n)}:=\int_{\Theta}P_{\theta}^{(n)}\pi(\theta|y_{1:n})\mathrm{d}\theta. (2)

However, when the class of predictive models indexed by Θ\Theta does not contain the true predictive distribution there is no sense in which this conventional approach remains the ‘gold standard’. In such cases, the loss that underpins (2) should be replaced by the particular predictive loss that matters for the problem at hand. That is, our prior beliefs about θ{\theta} and, hence, about the elements Pθ(n)P_{\theta}^{(n)} in 𝒫(n)\mathcal{P}^{(n)}, need to be updated via a criterion function defined by a user-specified measure of predictive loss.

2.2 Bayesian Updating Based on Scoring Rules

Loaiza-Maya et al. (2020) propose a method for producing Bayesian predictions using loss functions that specifically capture the accuracy of density forecasts. For 𝒫(n)\mathcal{P}^{(n)} a convex class of predictive distributions on (Ω,)(\Omega,\mathcal{F}), density prediction accuracy can be measured using the positively-oriented (i.e. higher is better) scoring rule s:𝒫(n)×𝒴s:\mathcal{P}^{(n)}\times\mathcal{Y}\mapsto\mathbb{R}, where the expected scoring rule under the true distribution P0P_{0} is defined as

𝕊(,P0):=yΩs(,y)dP0(y).\mathbb{S}(\cdot,P_{0}):=\int_{y\in\Omega}s(\cdot,y)\mathrm{d}P_{0}(y). (3)

Since 𝕊(,P0)\mathbb{S}(\cdot,P_{0}) is unattainable in practice, a sample estimate based on y1:ny_{1:n} is used to define a sample criterion: for a given θΘ\theta\in\Theta, define sample average score as

Sn(θ):=t=0n1s(Pθ(t),yt+1).S_{n}(\theta):=\sum_{t=0}^{n-1}s(P_{\theta}^{(t)},y_{t+1}). (4)

Adopting the generalized updating rule proposed by Bissiri et al. (2016) (see also Giummolè et al., 2017, Holmes and Walker, 2017, Lyddon et al., 2019, and Syring and Martin, 2019), Loaiza-Maya et al. (2020) distinguish elements in Θ\Theta using

πw(θ|y1:n)=exp[wSn(θ)]π(θ)Θexp[wSn(θ)]π(θ)dθ,\pi_{w}(\theta|y_{1:n})=\frac{\exp\left[wS_{n}\left(\theta\right)\right]\pi(\theta)}{\int_{\Theta}\exp\left[wS_{n}\left(\theta\right)\right]\pi(\theta)\mathrm{d}\theta}, (5)

where the scale factor ww is obtained in a preliminary step using measures of predictive accuracy. This posterior explicitly weights elements of Θ\Theta according to their predictive accuracy in the scoring rule s(,)s(\cdot,\cdot). As such, the one-step-ahead generalized predictive,

PΠw(n):=ΘPθ(n)πw(θ|y1:n)dθ,P_{\Pi_{w}}^{(n)}:=\int_{\Theta}P_{\theta}^{(n)}\pi_{w}(\theta|y_{1:n})\mathrm{d}\theta, (6)

will often outperform, in the chosen rule s(,)s(\cdot,\cdot), the likelihood (or log-score)-based predictive PΠ(n)P_{\Pi}^{(n)} in cases where the model is misspecified. Given its explicit dependence on the Gibbs posterior in (5), and as noted earlier, we refer to the predictive in (6) as the (exact) Gibbs predictive.222We note that, in general the choice of ww can be nn-dependent. However, we eschew this dependence to maintain notational brevity.

3 Gibbs Variational Prediction

3.1 Overview

While Πw(|y1:n)\Pi_{w}(\cdot|y_{1:n}) is our ideal posterior, it can be difficult to sample from if the dimension of θ\theta is even moderately large, which occurs in situations with a large number of predictors, or in flexible models. Therefore, the exact predictive itself is not readily available in such cases. However, this does not invalidate the tenant on which PΠw(n)P_{\Pi_{w}}^{(n)} is constructed. Viewed in this way, we see that the problem of predictive inference via PΠw(n)P_{\Pi_{w}}^{(n)} could be solved if one were able to construct an accurate enough approximation to Πw(|y1:n)\Pi_{w}(\cdot|y_{1:n}). Herein, we propose to approximate PΠw(n)P_{\Pi_{w}}^{(n)} using variational Bayes (VB).

VB seeks to approximate Πw(|y1:n)\Pi_{w}(\cdot|y_{1:n}) by finding the closest member in a class of probability measures, denoted as 𝒬\mathcal{Q}, to Πw(|y1:n)\Pi_{w}(\cdot|y_{1:n}) in a chosen divergence measure. The most common choice of divergence is the Kullback-Leibler (KL) divergence: for probability measures P,QP,Q, and PP absolutely continuous with respect to QQ, the KL divergence is given by D(P||Q)=log(dP/dQ)dP\text{D}(P||Q)=\int\log(\mathrm{d}P/\mathrm{d}Q)\mathrm{d}P. VB then attempts to produce a posterior approximation to Πw\Pi_{w} by choosing QQ to minimize:

D(Q||Πw)=log[dQ/dΠw(|y1:n)]dQ.\text{D}\left(Q||\Pi_{w}\right)=\int\log\left[{\mathrm{d}Q}/{\mathrm{d}\Pi_{w}(\cdot|y_{1:n})}\right]\mathrm{d}Q.

In practice, VB is often conducted, in an equivalent manner, by maximizing the so-called evidence lower bound (ELBO). In the case where Πw\Pi_{w} and QQ admit densities πw\pi_{w} and qq, respectively, the ELBO is given by:

ELBO[Q||Πw]:=𝔼q[log{exp[wSn(θ)]π(θ)}]𝔼q[log{q(θ)}].\text{ELBO}[Q||\Pi_{w}]:=\mathbb{E}_{q}[\log\left\{\exp\left[wS_{n}(\theta)\right]\pi(\theta)\right\}]-\mathbb{E}_{q}[\log\{q(\theta)\}]. (7)

While other classes of divergences are used in variational inference, such as the Rényi-divergence, the KL divergence is the most commonly encountered in the literature. Hence, in what follows we focus on this choice, but note here that other divergences can also be used.

The variational approximation to the Gibbs posterior, PΠw(n)P_{\Pi_{w}}^{(n)}, can be defined as follows.

Definition 1

For 𝒬\mathcal{Q} a class of probability distributions, the variational posterior Q^\widehat{Q} satisfies

Q^:=argminQ𝒬D[Q||Πw(|y1:n)].\widehat{Q}:=\operatornamewithlimits{argmin\,}_{Q\in\mathcal{Q}}\text{D}\left[Q||\Pi_{w}(\cdot|y_{1:n})\right].

and the Gibbs variational predictive (GVP) is defined as

PQ(n):=ΘPθ(n)dQ^(θ).P_{Q}^{(n)}:=\int_{\Theta}P_{\theta}^{(n)}\mathrm{d}\widehat{Q}(\theta).
Remark 1

In general, we only require that Q^\widehat{Q} is an approximate minimizer, i.e., D[Q^||Πw(|y1:n)]supQ𝒬D[Q^||Πw(|y1:n)]+op(1)\text{D}[\widehat{Q}||\Pi_{w}(\cdot|y_{1:n})]\geq\sup_{Q\in\mathcal{Q}}\text{D}[\widehat{Q}||\Pi_{w}(\cdot|y_{1:n})]+o_{p}(1). The later may be useful in cases where Q^\widehat{Q} may not be unique, or in cases where it does not exist for a given nn but is well-behaved for all nn large enough.

The GVP, PQ(n)P_{Q}^{(n)}, circumvents the need to construct PΠw(n)P_{\Pi_{w}}^{(n)} via sampling the Gibbs posterior Πw(|y1:n)\Pi_{w}(\cdot|y_{1:n}). In essence, we replace the sampling problem with an optimization problem for which reliable methods exist even if Θ\Theta is high-dimensional, and which in turn yields an approximation to Πw(|y1:n)\Pi_{w}(\cdot|y_{1:n}). Consequently, even though, as discussed earlier, it is not always feasible to access PΠw(n)P_{\Pi_{w}}^{(n)}, via simulation from, Πw(|y1:n)\Pi_{w}(\cdot|y_{1:n}) in situations where Θ\Theta is high-dimensional, access to the variational predictive PQ(n)P_{Q}^{(n)} remains feasible.

This variational approach to prediction is related to the ‘generalized variational inference’ approach of Knoblauch et al. (2019), but with two main differences. Firstly, our approach is focused on predictive accuracy, not on parameter inference. Our only goal is to produce density forecasts that are as accurate as possible in the chosen scoring rule s(,y)s(\cdot,y).333Critically, since predictive accuracy is our goal, we are not concerned with the potential over/under-coverage of credible sets for the model unknowns built from generalized posteriors, or VB posteriors; see Miller (2021) for a theoretical discussion of posterior coverage in generalized Bayesian inference. Secondly, our approach follows the idea of Bissiri et al. (2016) and Loaiza-Maya et al. (2020) and targets the predictions built from the Gibbs posterior in (5), rather than the exponentiated form of some general loss as in Knoblauch et al. (2019). This latter point is certainly critical if inferential accuracy were still deemed to be important, since without the tempering constant ww that defines the posterior in (5), the exponentiated loss function can be very flat and the posterior Πw(|y1:n)\Pi_{w}(\cdot|y_{1:n}) uninformative about θ\theta.444See Bissiri et al. (2016), Giummolè et al. (2017), Holmes and Walker (2017), Lyddon et al. (2019), Syring and Martin (2019) and Pacchiardi and Dutta (2021) for various approaches to the setting of ww  in inferential settings. It is our experience however (Loaiza-Maya et al., 2020), that in settings where predictive accuracy is the only goal, the choice of ww actually has little impact on the generation of accurate predictions via PΠw(n)P_{\Pi_{w}}^{(n)}, as long as the sample size is reasonably large. Preliminary experimentation in the current setting, in which the variational predictive PQ(n)P_{Q}^{(n)} is the target, suggests that this finding remains relevant, and as a consequence we have adopted a default value of w=1w=1 in all numerical work. Further research is needed to deduce the precise impact of ww on PQ(n)P_{Q}^{(n)}, in particular for smaller sample sizes, and we leave this important topic for future work.555We note here that the work of Wu and Martin (2021) attempts to calibrate the coverage of posterior predictives constructed from power posteriors. While the approach of Wu and Martin (2021) is not directly applicable in our context, we conjecture that an appropriately modified version of their approach could be applied in our setting to derive posterior predictives with reasonable predictive coverage.

3.2 Accuracy of the GVP

By driving the updating mechanism by the measure of predictive accuracy that matters, it is hoped that GVP produces accurate predictions without requiring correct model specification. The following result demonstrates that, in large samples, the GVP, PQ(n)P^{(n)}_{Q}, produces predictions that are indistinguishable from the exact predictive, PΠw(n)P^{(n)}_{\Pi_{w}}; thus, in terms of predictive accuracy, the use of a variational approximation to the Gibbs posterior has little impact on the accuracy of the posterior predictives. For probability measures P,QP,Q, let dTV{P,Q}d_{\text{TV}}\{P,Q\} denote the total variation distance.

Theorem 1

Under Assumptions 1-3 in Appendix A, dTV{PQ(n),PΠw(n)}0d_{\text{TV}}\{P_{Q}^{(n)},P_{\Pi_{w}}^{(n)}\}\rightarrow 0 in probability.

Theorem 1 demonstrates that the difference between distributional predictions made using the GVP and the exact Gibbs predictive agree in the rule s(,)s(\cdot,\cdot) in large samples. This type of result is colloquially known as a ‘merging’ result (Blackwell and Dubins, 1962), and means that if we take the exact Gibbs predictive as our benchmark for (Bayesian) prediction, then the GVP is as accurate as this possibly infeasible benchmark (at least in large samples).

Remark 2

As an intermediate step in the proof of Theorem 1, we must demonstrate posterior concentration of Q^\widehat{Q}, and Πw\Pi_{w}, in cases where the data is temporally dependent, possibly heavy-tailed, and where Sn(θ)S_{n}(\theta) is an arbitrary scoring rule. For the sake of brevity, and to keep the focus on the practical usefulness of this approach, we present these details in Appendix A. However, we note that our focus on the types of data often encountered in forecasting, for example, financial, economic, or atmospheric processes, does not allow us to use existing approaches to guarantee posterior concentration of Q^\widehat{Q} and Πw\Pi_{w}. Instead, we must rely on the smoothness of Sn(θ)S_{n}(\theta) and the control of certain remainder terms to demonstrate this concentration. We refer the interested reader to Appendix A for full details and discussion.

As a further result, we can demonstrate that GVP is as accurate as the infeasible frequentist ‘optimal [distributional] predictive’ approach suggested in Gneiting and Raftery (2007). Following Gneiting and Raftery (2007), define the ‘optimal’ (frequentist) predictive within the class 𝒫(n)\mathcal{P}^{(n)}, and based on scoring rule s(,y)s(\cdot,y), as P(n):=P(|n,θ)P_{\star}^{(n)}:=P(\cdot|\mathcal{F}_{n},\theta_{\star}), where

θ:=argmaxθΘ𝒮(θ), with 𝒮(θ)=limn𝔼[Sn(θ)/n].\theta_{\star}:=\arg\max_{\theta\in\Theta}\mathcal{S}(\theta),\text{ {with} }\mathcal{S}(\theta)=\operatornamewithlimits{lim\,}_{n\rightarrow\infty}\mathbb{E}\left[S_{n}(\theta)/n\right]. (8)

The following result demonstrates that, in large samples, predictions produced using the GVP are equivalent to those made using the optimal frequentist predictive P(n)P_{\star}^{(n)}.

Theorem 2

Under Assumptions 1-3 in Appendix A, dTV{PQ(n),P(n)}0d_{\text{TV}}\{P_{Q}^{(n)},P_{\star}^{(n)}\}\rightarrow 0 in probability.

We remind the reader that, for the sake of brevity, statement of the assumptions, and proofs of all stated results, are given in Appendix A.

4 A Toy Model of Financial Returns

We now illustrate the behavior of GVP in a simple toy example for a financial return, YtY_{t}, in which the exact Gibbs predictive is also accessible.666We reiterate here, and without subsequent repetition, that all details of the prior specifications and the numerical steps required to produce the variational approximations, for this and the following numerical examples, are provided in the supplementary appendices. This example serves two purposes. First, it illustrates Theorem 2, and, in so doing, highlights the benefits of GVP relative to prediction based on a misspecified likelihood function. Secondly, the GVP is shown to yield almost equal (average) out-of-sample predictions to those produced by the exact Gibbs posterior accessed via MCMC; i.e. numerical support for Theorem 1 is provided.

The predictive class, 𝒫(n)\mathcal{P}^{(n)}, is defined by a generalized autoregressive conditional heteroscedastic GARCH(1,1) model with Gaussian errors, Yt=θ1+σtεt,Y_{t}=\theta_{1}+\sigma_{t}\varepsilon_{t}, εti.i.d.N(0,1),\varepsilon_{t}\overset{i.i.d.}{\sim}N\left(0,1\right), σt2=θ2+θ3(Yt1θ1)2+θ4σt12,\sigma_{t}^{2}=\theta_{2}+\theta_{3}\left(Y_{t-1}-\theta_{1}\right)^{2}+\theta_{4}\sigma_{t-1}^{2}, with θ=(θ1,θ2,θ3,θ4)\theta=\left(\theta_{1},\theta_{2},\theta_{3},{\theta}_{4}\right)^{\prime}. We adopt three alternative specifications for the true data generating process (DGP) : 1) a model that matches the Gaussian GARCH(1,1) predictive class; 2) a stochastic volatility model with leverage:

Yt\displaystyle Y_{t} =exp(ht2)εt,ht=2+0.7(ht1(2))+ηt,(εt,ηt)i.i.d.N(0,[10.350.350.25]);\displaystyle=\exp\left(\frac{h_{t}}{2}\right)\varepsilon_{t},\quad h_{t}=-2+0.7\left(h_{t-1}-(-2)\right)+\eta_{t},\quad\left(\varepsilon_{t},\eta_{t}\right)^{\prime}\overset{i.i.d.}{\sim}N\left(0,\begin{bmatrix}1&-0.35\\ -0.35&0.25\end{bmatrix}\right);

and 3) a stochastic volatility model with a smooth transition in the volatility autoregression:

Yt\displaystyle Y_{t} =exp(ht2)εt,ht=0.9g(ht1)ht1+ηt,ηti.i.d.N(0,0.25),\displaystyle=\exp\left(\frac{h_{t}}{2}\right)\varepsilon_{t},\quad h_{t}=0.9g\left(h_{t-1}\right)h_{t-1}+\eta_{t},\quad\eta_{t}\overset{i.i.d.}{\sim}N\left(0,0.25\right),

where g(x)=(1+exp(2x))1g(x)=\left(1+\exp\left(-2x\right)\right)^{-1}. DGP 1) defines a correct specification setting, whilst DGPs 2) and 3) characterize different forms of misspecification.

Denoting the predictive density function associated with the Gaussian GARCH (1,1) model, evaluated at the observed yt+1y_{t+1}, as pθ(yt+1|t)p_{\theta}(y_{t+1}|\mathcal{F}_{t}), we implement GVP using four alternative forms of (postively-oriented) scoring rules:

sLS(Pθ(t),yt+1)=logpθ(yt+1|t),\displaystyle s^{\text{LS}}\left(P_{\theta}^{(t)},y_{t+1}\right)=\log p_{\theta}(y_{t+1}|\mathcal{F}_{t}), (9)
sCRPS(Pθ(t),yt+1)=[Pθ(t)I(yyt+1)]2𝑑y,\displaystyle s^{\text{CRPS}}\left(P_{\theta}^{(t)},y_{t+1}\right)=-\int_{-\infty}^{\infty}\left[P_{\theta}^{(t)}-I(y\geq y_{t+1})\right]^{2}dy, (10)
sCLS-A(Pθ(t),yt+1)=logpθ(yt+1|t)I(yt+1A)+[lnAcpθ(y|t)𝑑y]I(yt+1Ac),\displaystyle s^{\text{CLS-A}}\left(P_{\theta}^{(t)},y_{t+1}\right)=\log p_{\theta}(y_{t+1}|\mathcal{F}_{t})I\left(y_{t+1}\in A\right)+\left[\ln\int_{A^{c}}p_{\theta}(y|\mathcal{F}_{t})dy\right]I\left(y_{t+1}\in A^{c}\right), (11)
sIS(Pθ(t),yt+1)=(ut+1lt+1)+2α(lt+1yt+1)I(yt+1<lt+1)+2α(yt+1ut+1)I(yt+1>ut+1),\displaystyle s^{\text{IS}}\left(P_{\theta}^{(t)},y_{t+1}\right)=\left(u_{t+1}-l_{t+1}\right)+\frac{2}{\alpha}\left(l_{t+1}-y_{t+1}\right)I\left(y_{t+1}<l_{t+1}\right)+\frac{2}{\alpha}\left(y_{t+1}-u_{t+1}\right)I\left(y_{t+1}>u_{t+1}\right), (12)

where lt+1l_{t+1}\ and ut+1u_{t+1} denote the 100(α2)%100(\frac{\alpha}{2})\% and 100(1α2)%100(1-\frac{\alpha}{2})\%\ predictive quantiles. The log-score (LS) in (9) is a ‘local’ scoring rule, attaining a high value if the observed value, yt+1y_{t+1}, is in the high density region of pθ(yt+1|Ft)p_{\theta}(y_{t+1}|F_{t}). The continuously ranked probability score (CRPS) in (10) (Gneiting and Raftery, 2007) is, in contrast, sensitive to distance, and rewards the assignment of high predictive mass near to the realized yt+1y_{t+1}, rather than just at that value. The score in (11) is the censored likelihood score (CLS) of Diks et al. (2011), which rewards predictive accuracy over any pre-specified region of interest AA (AcA^{c} denoting the complement). We use the score for AA defining the lower and upper tails of the predictive distribution, as determined in turn by the 10%, 20%, 80% and 90% percentiles of the empirical distribution of YtY_{t}, labelling these cases hereafter as CLS1010, CLS20,20, CLS8080 and CLS90.90. A high value of this score in any particular instance thus reflects a predictive distribution that accurately predicts extreme values of the financial return. The last score considered is the interval score (IS) in (12), which is designed to measure the accuracy of the 100(1α)%100(1-\alpha)\% predictive interval where α=0.05\alpha=0.05. This score rewards narrow intervals with accurate coverage. All components of (11) have closed-form solutions for the (conditionally) Gaussian predictive model, as does the integral in (10) and the bounds in (12).

In total then, seven distinct scoring rules are used to define the sample criterion function in (4). In what follows we reference these seven criterion functions, and the associated Gibbs posteriors using the notation Snj(θ)=t=0n1sj(Pθ(t),yt+1)S_{n}^{j}\left(\theta\right)=\sum_{t=0}^{n-1}s^{j}\left(P_{\theta}^{(t)},y_{t+1}\right), for j={LS, CRPS, CLS10, CLS20, CLS80, j=\{\text{LS, CRPS, CLS10, CLS20, CLS80, }CLS90, IS}\}.

4.1 Estimation of the Gibbs Predictives

Given the low dimension of the predictive model it is straightforward to use an MCMC scheme to sample from the exact Gibbs posterior πwj(θ|y1:n)exp{wSnj(θ)}π(θ),\pi_{w}^{j}({\theta}|y_{1:n})\propto\exp\{wS_{n}^{j}(\theta)\}\pi\left(\theta\right), where πwj(θ|y1:n)\pi_{w}^{j}(\theta|y_{1:n}) is the exact Gibbs posterior density in (5) computed under scoring rule j.j. As noted earlier, in this and all following numerical work we set w=1w=1. For each of the jj posteriors, we initialize the chains using a burn-in period of 2000020000 periods, and retain the next M=20000M=20000 draws θ(m,j)πwj(θ|y1:n),\theta^{(m,j)}\sim\pi_{w}^{j}(\theta|y_{1:n}), m=1,,Mm=1,\dots,M. The posterior draws are then used to estimate the exact Gibbs predictive in (6) via P^Πw(n,j)=1Mm=1MPθ(m,j)(n).\hat{P}_{\Pi_{w}}^{(n,j)}=\frac{1}{M}\sum_{m=1}^{M}P_{\theta^{(m,j)}}^{(n)}.

To perform GVP, we first need to produce the variational approximation of πwj(θ|y1:n).\pi_{w}^{j}(\theta|y_{1:n}). This is achieved in several steps. First, the parameters of the GARCH(1,1) model are transformed to the real line. With some abuse of notation, we re-define here the GARCH(1,1) parameters introduced in the previous section with the superscript rr to signal ‘raw’. The parameter vector θ\theta is then re-defined as θ=(θ1,θ2,θ3,θ4)=(θ1r,log(θ2r),Φ11(θ3r),Φ11(θ4r))\theta=\left(\theta_{1},\theta_{2},\theta_{3},\theta_{4}\right)^{\prime}=\left(\theta_{1}^{r},\log(\theta_{2}^{r}),\Phi_{1}^{-1}\left(\theta_{3}^{r}\right),\Phi_{1}^{-1}\left(\theta_{4}^{r}\right)\right)^{\prime}, where Φ1\Phi_{1} denotes the (univariate) normal cumulative distribution function (cdf). The next step involves approximating πwj(θ|y1:n)\pi_{w}^{j}(\theta|y_{1:n}) for the re-defined θ\theta. We adopt the mean-field variational family (see for example Blei et al., 2017), with a product-form Gaussian density qλ(θ)=i=14ϕ1(θi;μi,di2)q_{\lambda}\left(\theta\right)=\prod_{i=1}^{4}\phi_{1}\left({\theta}_{i};\mu_{i},d_{i}^{2}\right), where λ=(μ,d)\lambda=\left(\mu^{\prime},d^{\prime}\right)^{\prime} is the vector of variational parameters, comprised of mean and variance vectors μ=(μ1,μ2,μ3,μ4)\mu=\left(\mu_{1},\mu_{2},\mu_{3},\mu_{4}\right)^{\prime} and d=(d1,,d4)d=\left(d_{1},\dots,d_{4}\right)^{\prime} respectively, and ϕ1\phi_{1} denotes the (univariate) normal probability density function (pdf). Denote as QλQ_{\lambda} and Πwj\Pi_{w}^{j} the distribution functions associated with qλ(θ)q_{\lambda}\left(\theta\right) and πwj(θ|y1:n)\pi_{w}^{j}(\theta|y_{1:n}), respectively. The approximation is then calibrated by solving the maximization problem

λ~= argmaxλΛ(λ),\tilde{{\lambda}}=\text{ }\underset{\lambda\in\Lambda}{\arg\max}\ \mathcal{L}(\lambda), (13)

where, (λ):=ELBO[Qλ||Πwj]\mathcal{L}(\lambda):=\text{ELBO}\left[Q_{\lambda}||\Pi_{w}^{j}\right]. Remembering the use of the notation Snj(θ)S_{n}^{j}\left(\theta\right) to denote (4) for scoring rule jj, (7) (for case jj) becomes (λ)=𝔼qλ[wSnj(θ)+logπ(θ)logqλ(θ)].\mathcal{L}(\lambda)=\mathbb{E}_{q_{\lambda}}\left[wS_{n}^{j}\left({\theta}\right)+\log\pi\left({\theta}\right)-\log q_{\lambda}\left({\theta}\right)\right]. Optimization is performed via SGA, as described in Section S2 of the supplementary appendix. Once calibration of λ~\tilde{\lambda} is completed, the GVP is estimated as P^Q(n,j)=1Mm=1MPθ(m,j)(n)\hat{P}_{Q}^{(n,j)}=\frac{1}{M}\sum_{m=1}^{M}P_{\theta^{(m,j)}}^{(n)} with θ(m,j)i.i.d.qλ~(θ)\theta^{(m,j)}\overset{i.i.d.}{\sim}q_{\tilde{\lambda}}\left({\theta}\right). To calibrate λ~\tilde{\lambda} 1000010000 VB iterations are used, and to estimate the variational predictive we set M=1000M=1000.

To produce the numerical results we generate a times series of length T=6000T=6000 from the true DGP. Then, we perform an expanding window exercise from n=1000n=1000 to n=5999n=5999. For j{LS, CRPS, CLS10, CLS20, CLS80, CLS90, IS}j\in\{\text{LS, CRPS, CLS10, CLS20, CLS80, CLS90, IS\}} we construct the predictive densities P^Πw(n,j)\hat{P}_{\Pi_{w}}^{(n,j)} and P^Q(n,j)\hat{P}_{Q}^{(n,j)} as outlined above. Then, for i{LS, CRPS, CLS10, CLS20, CLS80, CLS90, IS}i\in\{\text{LS, CRPS, CLS10, CLS20, CLS80, CLS90, IS\}} we compute the measures of out-of-sample predictive accuracy SΠwi,j,n=si(P^Πw(n,j),yn+1)S_{\Pi_{w}}^{i,j,n}=s^{i}\left(\hat{P}_{\Pi_{w}}^{(n,j)},y_{n+1}\right) and SQi,j,n=si(P^Q(n,j),yn+1)S_{Q}^{i,j,n}=s^{i}\left(\hat{P}_{Q}^{(n,j)},y_{n+1}\right). Finally, we compute the average out-of-sample scores SΠwi,j=15000n=10005999SΠwi,j,nS_{\Pi_{w}}^{i,j}=\frac{1}{5000}\sum_{n=1000}^{5999}S_{\Pi_{w}}^{i,j,n} and SQi,j=15000n=10005999SQi,j,nS_{Q}^{i,j}=\frac{1}{5000}\sum_{n=1000}^{5999}S_{Q}^{i,j,n}. The results are tabulated and discussed in the following section.

4.2 Results

The results of the simulation exercise are recorded in Table 1. Panels A to C record the results for Scenarios 1) to 3), with the average out-of-sample scores associated with the exact Gibbs predictive (estimated via MCMC) appearing in the left-hand-side panel and the average scores for GVP appearing in the right-hand-side panel. All values on the diagonal of each sub-panel correspond to the case where i=ji=j. In the misspecified case, numerical validation of the asymptotic result that the GVP concentrates onto the optimal predictive (Theorem 2) occurs if the largest values (bolded) in a column appear on the diagonal (‘strict coherence’ in the language of Martin et al., 2022). In the correctly specified case, in which all proper scoring rules will, for a large enough sample, pick up  the one true model (Gneiting and Raftery, 2007), we would expect all values in given column to be very similar to one another, differences reflecting sampling variation only. Finally, validation of the theoretical property of merging between the exact and variational Gibbs predictive (Theorem 1) occurs if the corresponding results in all left- and right-hand-side panels are equivalent.

As is clear, there is almost uniform numerical validation of both theoretical results. Under misspecification Scenario 2), (Panel B) the GVP results are strictly coherent (i.e. all bold values lie on the diagonal), with the exact Gibbs predictive results equivalent to the corresponding GVP results to three or four decimal places. The same broad findings obtain under misspecification Scenario 3), apart from the fact that the IS updates are second best (to the log-score; and then only just) in terms of the out-of-sample IS measure. In Panel A on the other hand, we see the expected (virtual) equivalence of all results in a given column, reflecting the fact that all Gibbs predictives (however estimated) are concentrating onto the true predictive model and, hence, have identical out-of-sample performance. Of course, for a large but still finite number of observations, we would expect the log-score to perform best, due to the efficiency of the implicit maximum likelihood estimator underpinning the results and, to all intents and purposes this is exactly what we observe in Panels A.1 and A.2.

In summary, GVP performs as anticipated, and reaps distinct benefits in terms of predictive accuracy. Any inaccuracy in the measurement of parameter uncertainty also has negligible impact on the finite sample performance of GVP relative to an exact comparator. In Section 5, we extend the investigation into design settings that mimic the high-dimensional problems to which we would apply the variational approach in practice, followed by an empirical application - again using a high-dimensional predictive model - in Section 6.

Panel A. True DGP: GARCH(1,1)
A.1: Exact Gibbs predictive A.2: GVP
Average out-of-sample score Average out-of-sample score
LS CLS10 CLS20 CLS80 CLS90 CRPS IS LS CLS10 CLS20 CLS80 CLS90 CRPS IS
U.method
LS -0.0433 -0.2214 -0.3087 -0.3029 -0.2150 -0.1438 -1.1853 -0.0434 -0.2216 -0.3088 -0.3028 -0.2150 -0.1438 -1.1857
CLS10 -0.0485 -0.2222 -0.3094 -0.3069 -0.2190 -0.1441 -1.2009 -0.0496 -0.2227 -0.3100 -0.3075 -0.2199 -0.1441 -1.2053
CLS20 -0.0482 -0.2221 -0.3094 -0.3066 -0.2185 -0.1440 -1.2014 -0.0493 -0.2225 -0.3097 -0.3072 -0.2193 -0.1440 -1.2052
CLS80 -0.0558 -0.2305 -0.3198 -0.3032 -0.2150 -0.1446 -1.2087 -0.0567 -0.2311 -0.3204 -0.3033 -0.2152 -0.1446 -1.2107
CLS90 -0.0495 -0.2258 -0.3143 -0.3029 -0.2150 -0.1441 -1.1973 -0.0502 -0.2265 -0.3149 -0.3031 -0.2152 -0.1441 -1.1993
CRPS -0.0462 -0.2239 -0.3116 -0.3027 -0.2147 -0.1438 -1.1874 -0.0474 -0.2239 -0.3112 -0.3041 -0.2162 -0.1439 -1.1918
IS -0.0438 -0.2216 -0.3089 -0.3031 -0.2153 -0.1438 -1.1877 -0.0440 -0.2218 -0.3091 -0.3031 -0.2153 -0.1438 -1.1882
Panel B. True DGP: Stochastic volatility with leverage
B.1: Exact Gibbs predictive B.2: GVP
Average out-of-sample score Average out-of-sample score
LS CLS10 CLS20 CLS80 CLS90 CRPS IS LS CLS10 CLS20 CLS80 CLS90 CRPS IS
U.method
LS -0.5636 -0.3753 -0.5452 -0.3536 -0.2512 -0.2313 -2.3468 -0.5633 -0.3752 -0.545 -0.3535 -0.2511 -0.2313 -2.3467
CLS10 -1.0193 -0.3336 -0.5021 -0.8379 -0.7339 -0.3679 -3.447 -1.0156 -0.3336 -0.502 -0.834 -0.7302 -0.3659 -3.4207
CLS20 -0.806 -0.3354 -0.4968 -0.6291 -0.5267 -0.2863 -2.9923 -0.8055 -0.3355 -0.4969 -0.6282 -0.5259 -0.2861 -2.9853
CLS80 -0.9203 -0.7372 -0.9311 -0.329 -0.2292 -0.2402 -3.3135 -0.9357 -0.7514 -0.9463 -0.329 -0.2292 -0.2402 -3.3248
CLS90 -0.9575 -0.7615 -0.9649 -0.3294 -0.2292 -0.2425 -3.4213 -0.9959 -0.7969 -1.0033 -0.3293 -0.2291 -0.2426 -3.4476
CRPS -0.5692 -0.4029 -0.5671 -0.3431 -0.2419 -0.23 -2.4312 -0.5649 -0.3985 -0.5626 -0.3432 -0.2419 -0.2301 -2.4338
IS -0.6552 -0.3986 -0.6111 -0.3713 -0.248 -0.2604 -2.203 -0.655 -0.3985 -0.6109 -0.3712 -0.2479 -0.2603 -2.2033
Panel C. True DGP: Stochastic volatility with smooth transition
C.1: Exact Gibbs predictive C.2: GVP
Average out-of-sample score Average out-of-sample score
LS CLS10 CLS20 CLS80 CLS90 CRPS IS LS CLS10 CLS20 CLS80 CLS90 CRPS IS
U.method
LS -1.6858 -0.4239 -0.672 -0.6751 -0.4369 -0.7196 -7.0726 -1.686 -0.424 -0.6721 -0.6752 -0.4371 -0.7196 -7.0742
CLS10 -1.8173 -0.4172 -0.6707 -0.8016 -0.5468 -0.821 -7.8217 -1.8145 -0.4172 -0.6706 -0.7987 -0.5437 -0.8183 -7.7717
CLS20 -1.7173 -0.419 -0.6674 -0.7067 -0.4616 -0.7425 -7.2516 -1.7171 -0.4192 -0.6676 -0.7063 -0.4611 -0.742 -7.2481
CLS80 -1.733 -0.465 -0.7199 -0.6681 -0.4304 -0.7511 -7.3087 -1.7339 -0.4654 -0.7204 -0.6684 -0.4307 -0.7513 -7.3159
CLS90 -1.8777 -0.5927 -0.8579 -0.6731 -0.4288 -0.8707 -8.1299 -1.8819 -0.5952 -0.8608 -0.674 -0.4295 -0.8734 -8.1211
CRPS -1.6938 -0.4283 -0.6762 -0.6812 -0.4435 -0.7184 -7.2059 -1.6938 -0.4286 -0.6765 -0.6809 -0.4432 -0.7185 -7.2121
IS -1.6879 -0.4244 -0.6726 -0.6754 -0.4374 -0.7208 -7.0915 -1.6881 -0.4245 -0.6728 -0.6754 -0.4374 -0.7208 -7.093
Table 1: Predictive accuracy of GVP using a Gaussian GARCH(1,1) predictive model for a financial return. Panel A corresponds to the correctly specified case, and Panels B and C to the two different misspecified settings as described in the text.  The rows in each panel refer to the update method (U.method) used. The columns refer to the out-of-sample measure used to compute the average scores. The figures in bold are the largest average scores according to a given out-of-sample measure.

5 Complex Time Series Examples

In this section we demonstrate the application of GVP in two realistic simulated examples. In both cases the assumed predictive model is high-dimensional and the exact Gibbs posterior, even if accessible in principle via MCMC, is challenging from a computational point of view. The mean-field class is adopted in both cases to produce variational approximations to the Gibbs posterior. The simulation design for each example (including the choice of w=1w=1) mimics that for the toy example, apart from the obvious changes made to the true DGP and the assumed predictive model, some changes in the size of the estimation and evaluation periods, plus the absence of comparative exact results. For reasons of computational burden we remove the CRPS update from consideration in the first example.

5.1 Example 1: Autoregressive Mixture Predictive Model

In this example we adopt a true DGP in which YtY_{t} evolves according to the logistic smooth transition autoregressive (LSTAR) process proposed in Teräsvirta (1994):

Yt=ρ1Yt1+ρ2{11+exp[γ(Yt1c)]}yt1+σεεt,Y_{t}=\rho_{1}Y_{t-1}+\rho_{2}\left\{\frac{1}{1+\exp\left[-\gamma\left(Y_{t-1}-c\right)\right]}\right\}y_{t-1}+\sigma_{\varepsilon}\varepsilon_{t}, (14)

where εti.i.d.tν\varepsilon_{t}\overset{i.i.d.}{\sim}t_{\nu}, and tνt_{\nu} denotes the standardised Student-t distribution with ν\nu degrees of freedom. This model has the ability to produce data that exhibits a range of complex features. For example, it not only allows for skewness in the marginal density of YtY_{t}, but can also produce temporal dependence structures that are asymmetric. We thus use this model as an illustration of a complex DGP whose characteristics are hard to replicate with simple parsimoneous models. Hence the need to adopt a highly parameterized predictive model; plus the need to acknowledge that even that representation will be misspecified.

The assumed predictive model is based on the flexible Bayesian non-parametric structure proposed in Antoniano-Villalobos and Walker (2016). The predictive distribution for YtY_{t}, conditional on the observed yt1y_{t-1}, is constructed from a mixture of K=20K=20 Gaussian autoregressive (AR) models of order one as follows:

Pθ(t1)=k=1Kτk,tΦ1[Ytμ;βk,0+βk,1(yt1μ),σk2],P_{\theta}^{(t-1)}=\sum_{k=1}^{K}\tau_{k,t}\Phi_{1}\left[Y_{t}-\mu;\beta_{k,0}+\beta_{k,1}(y_{t-1}-\mu),\sigma_{k}^{2}\right], (15)

with time-varying mixture weights

τk,t=τkϕ1(yt1μ;μk,sk2)j=1Kτjϕ1(yt1μ;μj,sj2),\tau_{k,t}=\frac{\tau_{k}\phi_{1}\left(y_{t-1}-\mu;\mu_{k},s_{k}^{2}\right)}{\sum_{j=1}^{K}\tau_{j}\phi_{1}\left(y_{t-1}-\mu;\mu_{j},s_{j}^{2}\right)},

where μk=βk,01βk,1\mu_{k}=\frac{\beta_{k,0}}{1-\beta_{k,1}} and sk2=σk21βk,12s_{k}^{2}=\frac{\sigma_{k}^{2}}{1-\beta_{k,1}^{2}}. Denoting τ=(τ1,,τK)\tau=\left(\tau_{1},\dots,\tau_{K}\right)^{\prime}, β0=(β1,0,,βK,0)\beta_{0}=\left(\beta_{1,0},\dots,\beta_{K,0}\right)^{\prime}, β1=(β1,1,,βK,1)\beta_{1}=\left(\beta_{1,1},\dots,\beta_{K,1}\right)^{\prime} and σ=(σ1,,σK)\sigma=\left(\sigma_{1},\dots,\sigma_{K}\right)^{\prime}, then the full vector of unknown parameters is θ=(μ,τ,β0,β1,\theta=\left(\mu,\tau^{\prime},\beta_{0}^{{}^{\prime}},\beta_{1}^{\prime},\right. σ)\left.\sigma^{\prime}\right)^{\prime}, which comprises 1+(4×20)=811+\left(4\times 20\right)=81 elements. GVP is a natural and convenient alternative to exact Gibbs prediction in this case.

5.2 Example 2: Bayesian Neural Network Predictive Model

In the second example we consider a true DGP in which the dependent variable YtY_{t} has a complex non-linear relationship with a set of covariates. Specifically, the time series process {Yt}t=1T\{Y_{t}\}_{t=1}^{T}, is determined by a three-dimensional stochastic process {Xt}t=1T\{X_{t}\}_{t=1}^{T}, with Xt=(X1,t,X2,t,X3,t)X_{t}=\left(X_{1,t},X_{2,t},X_{3,t}\right)^{\prime}. The first two covariates are jointly distributed as (X1,t,X2,t)i.i.d.N(0,Σ)(X_{1,t},X_{2,t})^{\prime}\overset{i.i.d.}{\sim}N\left(0,\Sigma\right). The third covariate X3,tX_{3,t}, independent of the former two, is distributed according to an AR(4) process so that X3,t=i=14αiX3,ti+σεtX_{3,t}=\sum_{i=1}^{4}\alpha_{i}X_{3,t-i}+\sigma\varepsilon_{t}, with εti.i.d.N(0,1)\varepsilon_{t}\overset{i.i.d.}{\sim}N\left(0,1\right). The variable YtY_{t} is then given by Yt=XtβtY_{t}=X_{t}^{\prime}\beta_{t}, where βt=(β1,t,β2,t,β3,t)\beta_{t}=\left(\beta_{1,t},\beta_{2,t},\beta_{3,t}\right)^{\prime} is a three dimensional vector of time-varying coefficients, with βi,t=bi+aiF(X3,t)\beta_{i,t}=b_{i}+a_{i}F\left(X_{3,t}\right), and FF denotes the marginal distribution function of X3,tX_{3,t} induced by the AR(4) model.

This particular choice of DGP has two advantages. First, given the complex dependence structure of the DGP (i.e. non-linear cross-sectional dependence as well as temporal dependence), it would be difficult, once again, to find a simple parsimoneous predictive model that would adequately capture this structure; hence motivating the need to adopt a high-dimensional model and to resort to GVP. Second, because it includes several covariates, we can assess how GVP performs for varying informations sets. For example, we can establish if the overall performance of GVP improves with an expansion of the conditioning information set, as we would anticipate; and if the inclusion of a more complete conditioning set affects the occurrence of strict coherence, or not.

A flexible model that has the potential to at least partially capture several features of the true DGP is a Bayesian feed-forward neural network that takes YtY_{t} as the dependent variable and some of its lags, along with observed values of XtX_{t}, xt=(x1,t,x2,t,x3,t),x_{t}=\left(x_{1,t},x_{2,t},x_{3,t}\right)^{\prime}, as the vector of independent inputs, ztz_{t} (see for instance Hernández-Lobato and Adams, 2015). The predictive distribution is defined as Pθ(t1)=Φ1(Yt;g(zt;ω),σy2).P_{\theta}^{(t-1)}=\Phi_{1}\left(Y_{t};g\left(z_{t};\omega\right),\sigma_{y}^{2}\right). The mean function g(zt;ω)g\left(z_{t};\omega\right) denotes a feed-forward neural network with q=2q=2 layers, r=3r=3 nodes in each layer, a pp-dimensional input vector ztz_{t}, and a dd-dimensional parameter vector ω\omega with d=r2(q1)+r(p+q+1)+1d=r^{2}(q-1)+r(p+q+1)+1. It allows for a flexible non-linear relationship between YtY_{t} and the vector of observed covariates ztz_{t}. Defining c=log(σy)c=\log(\sigma_{y}), the full parameter vector of this predictive class, θ=(ω,c)\theta=\left(\omega^{\prime},c\right)^{\prime}, is of dimension d+1d+1.

5.3 Simulation Results

5.3.1 Example 1

A time series of length T=2500T=2500 for YtY_{t} is generated from the LSTAR model in (14), with the true parameters set as: ρ1=0\rho_{1}=0, ρ2=0.9\rho_{2}=0.9, γ=5\gamma=5, c=0c=0, σε=1\sigma_{\varepsilon}=1 and ν=3\nu=3. With exception of the CRPS, which cannot be evaluated in closed-form for the mixture predictive class and is not included in the exercise as a consequence, the same scores in the toy example are considered. With reference to the simulation steps given earlier, the initial estimation window is 500500 observations; hence the predictive accuracy results are based on an evaluation sample of size 2000.2000.

Average out-of-sample score
LS CLS10 CLS20 CLS80 CLS90 IS
U.method
LS -1.253 -0.345 -0.497 -0.452 -0.263 -5.589
CLS10 -1.445 -0.346 -0.512 -0.555 -0.321 -7.279
CLS20 -1.445 -0.344 -0.496 -0.589 -0.349 -6.674
CLS80 -1.333 -0.414 -0.571 -0.450 -0.260 -5.831
CLS90 -1.330 -0.407 -0.564 -0.451 -0.259 -5.730
IS -1.410 -0.401 -0.558 -0.474 -0.282 -5.550
Table 2: Predictive accuracy of GVP using a autoregressive mixture model for data generated from an LSTAR model. The rows in each panel refer to the update method (U.method) used. The columns refer to the out-of-sample measure used to compute the average scores. The figures in bold are the largest average scores according to a given out-of-sample measure.

The results in Table 2 are clear-cut. With one exception (that of CLS10), the best out-of-sample performance, according to a given measure, is produced by the version of GVP based on that same scoring rule. That is, the GVP results are almost uniformly strictly coherent: a matching of the update rule with the out-of-sample measure produces the best predictive accuracy in that measure, almost always.

5.3.2 Example 2

In this case, we generate a time series of length T=4000T=4000 from the model discussed in Section 5.2, with specifications: Σ11=1\Sigma_{11}=1, Σ22=1.25\Sigma_{22}=1.25, Σ12=Σ21=0.5\Sigma_{12}=\Sigma_{21}=0.5, σ2=0.2\sigma^{2}=0.2, α1=0.5\alpha_{1}=0.5, α2=0.2\alpha_{2}=0.2, α3=0.15\alpha_{3}=0.15, α4=0.1\alpha_{4}=0.1, a1=1.3a_{1}=1.3, b1=0b_{1}=0, a2=2.6a_{2}=-2.6, b2=1.3b_{2}=1.3, a3=1.5a_{3}=-1.5 and b3=1.5b_{3}=1.5. These settings generate data with a negatively skewed empirical distribution, a non-linear relationship between the observations on YtY_{t} and X3,tX_{3,t}, and autoregressive behavior in YtY_{t}.

To assess if the performance of GVP is affected by varying information sets, we consider four alternative specifications for the input vector ztz_{t} in the assumed predictive model. These four specifications (labelled as Model 1, Model 2, Model 3 and Model 4, respectively) are: zt=yt1z_{t}=y_{t-1}, zt=(yt1,x1,t)z_{t}=\left(y_{t-1},x_{1,t}\right)^{\prime}, zt=(yt1,x2,t)z_{t}=\left(y_{t-1},x_{2,t}\right)^{\prime} and zt=(yt1,x1,t,x2,t)z_{t}=\left(y_{t-1},x_{1,t},x_{2,t}\right)^{\prime}. The dimension of the parameter vector for each of these model specifications is d+1=23d+1=23, d+1=26d+1=26, d+1=26d+1=26 and d+1=29d+1=29, respectively. Given that the assumed predictive class is Gaussian, all scoring rules used in the seven updates, and for all four model specifications, can be evaluated in closed form. Referencing the simulation steps given earlier, the initial estimation window is 20002000 observations; hence the predictive accuracy results are based on an evaluation sample of size 20002000 as in the previous example.

Panel A: Model 1 (zt=yt1{z}_{t}=y_{t-1}) Panel B: Model 2 (zt=(yt1,x1,t){z}_{t}=\left(y_{t-1},x_{1,t}\right)^{\prime})
Average out-of-sample score Average out-of-sample score
LS CLS10 CLS20 CLS80 CLS90 CRPS IS LS CLS10 CLS20 CLS80 CLS90 CRPS IS
U.method U.method
LS -1.607 -0.489 -0.761 -0.520 -0.310 -0.659 -6.579 LS -1.451 -0.494 -0.733 -0.424 -0.254 -0.550 -6.059
CLS10 -1.914 -0.460 -0.738 -0.852 -0.617 -0.914 -8.174 CLS10 -1.858 -0.435 -0.686 -0.849 -0.626 -0.866 -7.707
CLS20 -1.766 -0.460 -0.732 -0.704 -0.478 -0.772 -7.448 CLS20 -1.683 -0.438 -0.679 -0.686 -0.476 -0.708 -6.963
CLS80 -1.637 -0.514 -0.795 -0.514 -0.303 -0.678 -6.822 CLS80 -1.551 -0.595 -0.850 -0.403 -0.241 -0.587 -6.878
CLS90 -1.686 -0.534 -0.831 -0.521 -0.307 -0.718 -6.912 CLS90 -1.539 -0.540 -0.811 -0.406 -0.241 -0.605 -6.369
CRPS -1.620 -0.513 -0.783 -0.514 -0.304 -0.660 -6.752 CRPS -1.510 -0.585 -0.822 -0.411 -0.244 -0.553 -6.649
IS -1.601 -0.476 -0.753 -0.516 -0.307 -0.660 -6.338 IS -1.452 -0.473 -0.728 -0.414 -0.248 0.562 -5.671
Panel C: Model 3 (zt=(yt1,x2,t){z}_{t}=\left(y_{t-1},x_{2,t}\right)^{\prime}) Panel D: Model 4 (zt=(yt1,x1,t,x2,t){z}_{t}=\left(y_{t-1},x_{1,t},x_{2,t}\right)^{\prime})
Average out-of-sample score Average out-of-sample score
LS CLS10 CLS20 CLS80 CLS90 CRPS IS LS CLS10 CLS20 CLS80 CLS90 CRPS IS
U.method U.method
LS -1.536 -0.428 -0.688 -0.521 -0.314 -0.617 -5.975 LS -1.390 -0.483 -0.716 -0.389 -0.229 -0.511 -5.890
CLS10 -1.775 -0.396 -0.658 -0.793 -0.555 -0.805 -6.704 CLS10 -1.758 -0.394 -0.650 -0.784 -0.548 -0.790 -6.659
CLS20 -1.646 -0.397 -0.654 -0.664 -0.439 -0.694 -6.277 CLS20 -1.656 -0.394 -0.621 -0.732 -0.515 -0.712 -6.236
CLS80 -1.698 -0.566 -0.858 -0.520 -0.301 -0.714 -7.282 CLS80 -2.127 -1.049 -1.438 -0.356 -0.196 -0.736 -9.798
CLS90 -1.744 -0.581 -0.889 -0.530 -0.304 -0.758 -7.361 CLS90 -2.157 -0.989 -1.401 -0.379 -0.192 -0.782 -9.545
CRPS -1.535 -0.438 -0.695 -0.517 -0.310 -0.615 -6.044 CRPS -1.519 -0.626 -0.872 -0.380 -0.218 -0.537 -6.922
IS -1.549 -0.438 -0.7030 -0.517 -0.310 -0.624 -6.054 IS -1.673 -0.592 -0.934 -0.381 -0.203 -0.710 -5.493
Table 3: Predictive accuracy of GVP using a Bayesian neural network for data generated from the dynamic non-linear regression model discussed in Section 5.2. Panels A to D document results for the four different versions of ztz_{t}, in which varying numbers of input variables are included in the predictive model. The rows in each panel refer to the update method (U.method) used. The columns refer to the out-of-sample measure used to compute the average scores. The figures in bold are the largest average scores according to a given out-of-sample measure.

With reference to the results in Table 3 we can make two observations. First, as anticipated, an increase in the information set produces higher average scores out-of-sample, for all updates; with the corresponding values increasing steadily and uniformly as one moves from the results in Panel A (based on zt=yt1z_{t}=y_{t-1}) to those in Panel D (based on the largest information, zt=(yt1,x1,t,x2,t)z_{t}=\left(y_{t-1},x_{1,t},x_{2,t}\right)^{\prime}) set. Secondly however, despite the improved performance of all versions of GVP as the information set is increased to better match that used in the true DGP, strict coherence still prevails. That is, ‘focusing’ on the measure that matters in the construction of the GVP update, still reaps benefits, despite the reduction in misspecification.

6 Empirical Application: M4 Competition

The Makridakis 4 (M4) forecasting competition was a forecasting event first organised by the University of Nicosia and the New York University Tandon School of Engineering in 2018. The competition sought submissions of point and interval predictions at different time horizons, for a total of 100,000 times series of varying frequencies. The winner of the competition in a particular category (i.e. point prediction or interval prediction) was the submission that achieved the best average out-of-sample predictive accuracy according to the measure of accuracy that defined that category, over all horizons and all series.777Details of all aspects of the competition can be found via the following link:
https://www.m4.unic.ac.cy/wp-content/uploads/2018/03/M4-Competitors-Guide.pdf

We gauge the success of our method of distributional prediction in terms of the measure used to rank the interval forecasts in the competition, namely the IS. We focus on one-step-ahead prediction of each of the 4227 time series of daily frequency. Each of these series is denoted by {Yi,t}\left\{Y_{i,t}\right\} where i=1,,4227i=1,\dots,4227 and t=1,,nit=1,\dots,n_{i}, and the task is to construct a predictive interval for Yi,ni+1Y_{i,n_{i}+1} based on GVP. We adopt the mixture distribution in (15) as the predictive model, being suitable as it is to capture the stylized features of high frequency data. However, the model is only appropriate for stationary time series and most of the daily time series exhibit non-stationary patterns. To account for this, we model the differenced series {Zi,t=ΔdiYi,t}\left\{Z_{i,t}=\Delta^{d_{i}}Y_{i,t}\right\}, where did_{i} indicates the differencing order. The integer did_{i} is selected by sequentially applying the KPSS test (Kwiatkowski et al., 1992) to Δjyi,t\Delta^{j}y_{i,t} for j=0,,dij=0,\dots,d_{i}, with did_{i} being the first difference at which the null hypothesis of no unit root is not rejected.888To do this we employ the autoarima function in the forecast R package. To construct the predictive distribution of Zi,ni+1Z_{i,n_{i}+1} we first obtain M=5000M=5000 draws {θi(m)}m=1M\{\theta_{i}^{(m)}\}_{m=1}^{M} from the Gibbs variational posterior qλ~(θi)q_{\tilde{\lambda}}\left(\theta_{i}\right) that is based on the IS.999Once again, a value of w=1w=1 is used in defining the Gibbs posterior and, hence, in producing the posterior approximation. Draws {zi,ni+1(m)}m=1M\{z_{i,n_{i}+1}^{(m)}\}_{m=1}^{M} are then obtained from the corresponding predictive distributions {Pθi(m)(ni+1)}m=1M\{P_{\theta_{i}^{(m)}}^{(n_{i}+1)}\}_{m=1}^{M}. The draws of Zi,ni+1Z_{i,n_{i}+1} are then transformed into draws of Yi,ni+1Y_{i,n_{i}+1} and a predictive distribution for Yi,ni+1Y_{i,n_{i}+1} produced using kernel density estimation; assessment is performed in terms of the accuracy of the prediction interval for Yi,ni+1.Y_{i,n_{i}+1}. To account for different units of measurement, the IS of each series is then divided by the constant 1(ni1)t=2ni|Yi,tYi,t1|\frac{1}{(n_{i}-1)}\sum_{t=2}^{n_{i}}|Y_{i,t}-Y_{i,t-1}|.

Table 4 documents the predictive performance of our approach relative to the competing methods (Makridakis et al., 2020, see) for details on these methods). The first column corresponds to the average IS. In terms of this measure the winner is the method proposed by Smyl (2020), while our method is ranked 11 out of 12. The ranking is six out of 12 in terms of the median IS, recorded in column two. However, it is worth remembering that our method aims to achieve high individual, and not aggregate (or average) predictive interval accuracy across series. A more appropriate way of judging the effectiveness of our approach is thus to count the number of series for which each method performs best. This number is reported in column three of the table. In terms of this measure, with a total number of 858 series, our approach is only outperformed by the method proposed in Doornik et al. (2020). In contrast, although Smyl is the best method in terms of mean IS, it is only best at predicting 130 of the series in the dataset. It is also important to mention that most of the competing approaches have more complex assumed predictive classes than our mixture model. Even the simpler predictive classes like the ARIMA and ETS models have model selection steps that allow them to capture richer Markovian processes than the mixture model which, by construction, is based on an autoregressive process of order one.

In summary, despite a single specification being defined for the predictive class for all 4227 daily series, the process of driving the Bayesian update via the IS rule has still yielded the best predictive results in a very high number of cases, with GVP beaten in this sense by only one other competitor. Whilst the predictive model clearly matters, designing a bespoke updating mechanism to produce the predictive distribution is shown to still reap substantial benefits.

Out-of-sample results
Mean IS Median IS Series
Method
Doornik et al. -9.36 -4.58 2060
Mixture -16.66 -5.85 858
Trotta -11.41 -7.19 376
ARIMA -10.07 -5.67 278
Fiorucci and Louzada -9.20 -5.73 163
Smyl -8.73 -6.65 139
Roubinchtein -13.08 -7.35 97
Ibrahim -38.81 -6.33 88
ETS -9.13 -5.71 67
Petropoulos and Svetunkov -9.77 -5.68 59
Montero-Manso, et al. -10.24 -6.90 23
Segura-Heras, et al. -15.29 -8.48 19
Table 4: One-step-ahead predictive performance for the 4,227 daily series from the M4 competition. The columns ‘Mean IS’ and Median IS’ respectively record the mean and median values of average out-of-sample MSMS, across all 4,227 series, for each competiting method/team. The column ‘Series’ reports the number of series for which the method/team produces the largest IS value. Some of the competing methods are individually described in Doornik et al. (2020), Fiorucci and Louzada (2020), Smyl (2020), Petropoulos and Svetunkov (2020), and Montero-Manso et al. (2020). For the remaining methods see Makridakis et al. (2020). Our GVP method based on the mixture model is referred to as ‘Mixture’.

7 Discussion

We have developed a new approach for conducting loss-based Bayesian prediction in high-dimensional models. Based on a variational approximation to the Gibbs posterior defined, in turn, by the predictive loss that is germane to the problem at hand, the method is shown to produce predictions that minimize that loss out-of-sample. ‘Loss’ is characterized in this paper by positively-oriented proper scoring rules designed to reward the accuracy of predictive probability density functions for a continuous random variable. Hence, loss minimization translates to maximization of an expected score. However, in principle, any loss function in which predictive accuracy plays a role could be used to define the Gibbs posterior. Critically, we have proven theoretically, and illustrated numerically, that for a large enough sample there is no loss incurred in predictive accuracy as a result of approximating the Gibbs posterior.

In comparison with the standard approach based on a likelihood-based Bayesian posterior, our Gibbs variational predictive approach is ultimately aimed at generating accurate predictions in the realistic empirical setting where the predictive model and, hence, the likelihood function is misspecified. Gibbs variational prediction enables the investigator to break free from the shackles of likelihood-based prediction, and to drive predictive outcomes according to the form of predictive accuracy that matters for the problem at hand; and all with theoretical validity guaranteed.

We have focused in the paper on particular examples where the model used to construct the Gibbs variational predictive is observation-driven. Extensions to parameter-driven models (i.e., hidden Markov models, or state space models) require different approaches with regard to both the implementation of variational Bayes, and in establishing the asymptotic properties of the resulting posteriors and predictives. This is currently the subject of on-going work by the authors, Frazier et al. (2021), and we reference Tran et al. (2017) and Quiroz et al. (2018) for additional discussion of this problem.

This paper develops the theory of Gibbs variational prediction for classes of models that are general and flexible enough to accommodate a wide range of data generating processes, and which, under the chosen loss function, are smooth enough to permit a quadratic expansion. This latter condition restricts the classes of models, and loss functions, under which our results are applicable. For example, the discrete class of models studied in Douc et al. (2013) may not be smooth enough to deduce the validity of such an expansion.

While our approach to prediction focuses on accuracy in a given forecasting rule, which is related to the notion of sharpness in the probabilistic forecasting literature (see, Gneiting et al., 2007), it does not pay attention to the calibration of such predictive densities. That is, there is no guarantee that the resulting predictive densities have valid frequentist coverage properties. If calibration is a desired property, it is possible to augment our approach to prediction with the recently proposed approach by Wu and Martin (2021), which calibrates generalized predictives. The amalgamation of these two procedures should produce accurate predictions that are well-calibrated in large samples. We leave an exploration of this possibility to future research.


Acknowledgments

We would like to thank various participants at: the ‘ABC in Svalbard’ Workshop, April 2021, the International Symposium of Forecasting, the Alan Turing Institute, and RIKEN reading group, June 2021, the European Symposium of Bayesian Econometrics, September 2021, and the Vienna University of Business, December 2021, for helpful comments on earlier drafts of the paper. This research has been supported by Australian Research Council (ARC) Discovery Grants DP170100729 and DP200101414. Frazier was also supported by ARC Early Career Researcher Award DE200101070.

A Theoretical Results

In this section, we state several intermediate results and prove Theorems 1 and 2 stated in Section 3.2 of the main paper.

We maintain the following notation throughout the remainder of the paper. We let d:Θ×Θx0d:\Theta\times\Theta\rightarrow\mathbb{R}_{x\geq 0} denote a generic metric on Θdn\Theta\subseteq\mathbb{R}^{d_{n}}, and we take \|\cdot\| and ,\langle\cdot,\cdot\rangle to be the Euclidean norm and inner product for vectors. In the following section, we let CC denote a generic constant independent of nn that can vary from line to line. For sequences xn,ynx_{n},y_{n} and some CC we write xnynx_{n}\lesssim y_{n} when xnCynx_{n}\leq Cy_{n}. If xnynx_{n}\lesssim y_{n} and ynxny_{n}\lesssim x_{n} we write xnynx_{n}\asymp y_{n}. For some positive sequence δn\delta_{n}, the notation op(δn)o_{p}(\delta_{n}) and Op(δn)O_{p}(\delta_{n}) have their usual connotation. Unless otherwise stated, all limits are taken as nn\rightarrow\infty, so that when no confusion will result we use limn\lim_{n} to denote limn\lim_{n\rightarrow\infty}. The random variable YnY_{n} takes values in 𝒴\mathcal{Y} for all n1n\geq 1, and for a given nn, the vector of observations y1:ny_{1:n} are generated from the true probability measure P0P_{0}, not necessarily in 𝒫(n)\mathcal{P}^{(n)}, and whose dependence on nn we suppress to avoid confusion with the predictive model Pθ(n)P_{\theta}^{(n)}.

A.1 Assumptions and Discussion

For some positive sequence ϵn0\epsilon_{n}\rightarrow 0, define Θ(ϵn):={θΘ:d(θ,θ)ϵn}\Theta(\epsilon_{n}):=\{\theta\in\Theta:d(\theta,\theta_{\star})\leq\epsilon_{n}\} as an nn-dependent neighbourhood of θ\theta_{\star} and let Θc(ϵn)\Theta^{c}(\epsilon_{n}) denote its complement.

Assumption 1

(i) There exists a non-random function 𝒮(θ)\mathcal{S}(\theta) such that supθΘ|Sn(θ)/n𝒮(θ)|0\sup_{\theta\in\Theta}|S_{n}(\theta)/n-\mathcal{S}(\theta)|\rightarrow 0 in probability. (ii) For any ϵϵn\epsilon\geq\epsilon_{n}, there exists some C>0C>0 such that supΘc(ϵ){𝒮(θ)𝒮(θ)}Cϵ2\sup_{\Theta^{c}(\epsilon)}\left\{\mathcal{S}(\theta)-\mathcal{S}(\theta_{\star})\right\}\leq-C\epsilon^{2}.

Remark 3

Assumption 1 places regularity conditions on the sample and limit counterpart of the expected scoring rules, 𝒮(θ)\mathcal{S}(\theta), and is used to deduce the asymptotic behavior of the Gibbs posterior. The first part of the assumption is a standard uniform law of large numbers, and the second part amounts to an identification condition for θ\theta_{\star}.

We note here that the temporal dependence of the observations, and the use of potentially heavy-tailed data means that existing results on the concentration of Gibbs posteriors, and/or their variational approximation, of which we are aware may not be applicable in our setting; see e.g., Zhang and Gao (2020), Alquier and Ridgway (2020) and Yang et al. (2020).

Instead, we use smoothness of Sn(θ)S_{n}(\theta) (and the underlying model Pθ(n)P_{\theta}^{(n)}) to deduce a posterior concentration result.

Assumption 2

There exists a sequence of dn×dnd_{n}\times d_{n}-dimensional matrices Σn\Sigma_{n}, and a dn×1d_{n}\times 1-random vector Δn\Delta_{n} such:

  1. (i)

    Sn(θ)Sn(θ)=n(θθ),Δn/n12Σn1/2n(θθ)2+Rn(θ).S_{n}({\theta})-S_{n}({\theta}_{\star})=\langle\sqrt{n}\left({\theta}-{\ \theta_{\star}}\right),{\Delta}_{n}/\sqrt{n}\rangle-\frac{1}{2}\|\Sigma_{n}^{1/2}\sqrt{n}\left({\theta}-{\theta}_{\star}\right)\|^{2}+R_{n}({\theta}).

  2. (ii)

    For any ϵ>ϵn\epsilon>\epsilon_{n}, any M>0M>0, with M/n0M/\sqrt{n}\rightarrow 0, limsupP0[supd(θ,θ)M/n|Rn(θ)|1+nθθ2>ϵ]=0.\lim\sup P_{0}\left[\sup_{d(\theta,\theta_{\star})\leq M/\sqrt{n}}\frac{\left|R_{n}(\theta)\right|}{1+n\|\theta-\theta_{\star}\|^{2}}>\epsilon\right]=0.

  3. (iii)

    Σn1/2Δn=Op(1)\left\|\Sigma_{n}^{-1/2}\Delta_{n}\right\|=O_{p}(1).

We require the following a tail control condition on the prior.

Assumption 3

For any ϵ>ϵn\epsilon>\epsilon_{n}, log{Π[Θ(ϵ)]}nϵ2\log\left\{\Pi\left[\Theta(\epsilon)\right]\right\}\gtrsim-n\epsilon^{2}.

Remark 4

The above assumptions ultimately preclude cases where Θ\Theta can be partitioned into global parameters that drive the model, which are fixed for all nn, and local parameters that represent time-dependent latent variables in the model and grow in a one-for-one manner with the sample size. As such, these assumptions are not designed to be applicable to variational inference in classes such as hidden Markov models.

To validate the usefulness and broad applicability of Assumptions 1-2, we prove that these assumptions are satisfied for the GARCH(1,1) model used for the illustrative exercise in Section 4 in the case of the logarithmic scoring rule, S(y,f)=logf(y)S(y,f)=\log f(y), under weak moment assumptions on the observed data.

Lemma 1

Recall the GARCH(1,1) model presented in Section 4, and assume, WLOG, that 𝔼[Yt]=0\mathbb{E}[Y_{t}]=0. Define the rescaled variable Zt=Yt/σtZ_{t}=Y_{t}/\sigma_{t\star}, where σt\sigma_{t\star} denotes the GARCH process evaluated at θ=θ\theta=\theta_{\star}. Assume that ZtZ_{t} satisfies the following conditions:

  1. 1.

    ZtZ_{t} is strictly stationary and ergodic.

  2. 2.

    Zt2Z_{t}^{2} is a random variable.

  3. 3.

    For some δ>0\delta>0, there exists an C<C<\infty such that 𝔼[Zt4|t1]C<\mathbb{E}[Z_{t}^{4}|\mathcal{F}_{t-1}]\leq C<\infty.

  4. 4.

    The parameter space Θ\Theta is compact, with the elements θ2,θ3,θ4\theta_{2},\theta_{3},\theta_{4} elements bounded away from zero and the elements θ3,θ4\theta_{3},\theta_{4} bounded above by one, and where θ3+θ4<1\theta_{3}+\theta_{4}<1. The pseudo-true value θ\theta^{\star} is in the interior of Θ\Theta.

If the above are satisfied, then Assumptions (1) and (2) are satisfied for Sn(θ)=t=2nlogpθ(yt|t1)S_{n}(\theta)=\sum_{t=2}^{n}\log p_{\theta}(y_{t}|\mathcal{F}_{t-1}).

A.2 Preliminary Result: Posterior Concentration

Before proving the results stated in Section 3.2, we first give two results regarding the posterior concentration of the exact Gibbs posterior and its variational approximation.

Lemma 2

Under Assumptions 1-3, for ϵn0\epsilon_{n}\rightarrow 0, nϵn2n\epsilon_{n}^{2}\rightarrow\infty, and any MnM_{n}\rightarrow\infty, for wn(w¯,w¯)w_{n}\in(\underline{w},\overline{w}) with probability one, where 0<w¯<w¯<0<\underline{w}<\overline{w}<\infty,

Πwn(d(θ,θ)>Mnϵn|y1:n)exp(CMn2ϵn2n),\Pi_{w_{n}}\left(d(\theta,\theta_{\star})>M_{n}\epsilon_{n}|y_{1:n}\right)\lesssim\exp\left(-CM_{n}^{2}\epsilon_{n}^{2}n\right),

with probability converging to one.

Remark 5

The proof of Lemma 2, requires controlling the behavior of supd(θ,θ)ϵn1{Sn(θ)Sn(θ)}\sup_{d(\theta,\theta_{\star})\geq\epsilon}n^{-1}\left\{S_{n}(\theta)-S_{n}(\theta_{\star})\right\} for any ϵ>0\epsilon>0. When the parameter space is compact, this control can be guaranteed by controlling the bracketing entropy (see, e.g., van der Vaart and Wellner, 1996) of Θ\Theta. If the parameter space is not compact, this control can be achieved by considering more stringent conditions on the prior than Assumption 3. In particular, results where the parameter space is not compact can be obtained by modifying Theorem 4 in Shen and Wasserman (2001) for our setting. That being said, we note that in the case of the toy GARCH(1,1) model in Section 4, the parameter space is compact and this condition is satisfied under the conditions given in Lemma 1.

To transfer the posterior concentration of the Gibbs posterior to its variational approximation, we restrict the class 𝒬\mathcal{Q} used to produce the variational approximation. Following Zhang and Gao (2020), we define κn\kappa_{n} to be the approximation error rate for the variational family:

κn2=1n𝔼P0infQ𝒬D{QΠw(|y1:n)}.\kappa_{n}^{2}=\frac{1}{n}\mathbb{E}_{P_{0}}\inf_{Q\in\mathcal{Q}}\text{D}\left\{Q\|\Pi_{w}(\cdot|y_{1:n})\right\}.
Lemma 3

Under the Assumptions of Lemma 2, for any sequence MnM_{n}\rightarrow\infty,

Q^{d(θ,θ)Mnn(ϵn2+κn2)|y1:n}0\widehat{Q}\left\{d(\theta,\theta_{\star})\geq M_{n}n\left(\epsilon^{2}_{n}+\kappa^{2}_{n}\right)|y_{1:n}\right\}\rightarrow 0

in probability.

Remark 6

While several authors, such as Alquier et al. (2016), Alquier and Ridgway (2020), and Yang et al. (2020), have demonstrated a result similar to Lemma 3, existing results generally hinge on the satisfaction of a sub-Gaussian concentration inequality for the resulting risk function, i.e., Sn(θ)S_{n}(\theta), used to produce the Gibbs posterior, such as a Hoeffding or Bernstein-type inequality. Given that our goal is prediction in possibly non-Markovian, non-stationary, and/or complex models, and with a loss function based specifically on a general scoring rule, it is not clear that such results remain applicable. Therefore, we deviate from the earlier referenced approaches and instead draw inspiration from Miller (2021), and use an approach based on a quadratic expansion of the loss function. The latter conditions can be shown to be valid in a wide variety of models used to produce predictions in commonly encountered time series settings, including models with non-Markovian features. Indeed, as Lemma 1 demonstrates, Assumptions 1 and 2 are satisfied in cases where the scoring rule only exhibits a polynomial tail, e.g., in the case of the GARCH(1,1) model, and with data that is non-markovian.

Remark 7

Lemma 3 gives a similar result to Theorem 2.1 in Zhang and Gao (2020) but for the Gibbs variational posterior and demonstrates that the latter concentrates onto θ\theta_{\star}, with the rate of concentration given by the slower of ϵn\epsilon_{n} and κn\kappa_{n}. We note that, in cases where the dimension of Θ\Theta grows with the sample size, the rate ϵn\epsilon_{n} is ultimately affected by the rate of growth of Θ\Theta and care must be taken to account for this dependence.

Remark 8

Verification of Lemma 3 in any practical example requires choosing the variational family 𝒬\mathcal{Q}. In the case of observation-driven time series models, where the predictive model Pθ(n)P_{\theta}^{(n)} can be constructed analytically, and no explicit treatment of ‘local’ parameters is needed, the variational family can be taken to be a sufficiently rich parametric class. For example, consider the case where the dimension of Θ\Theta is fixed and compact, and assume that Pθ(n)P_{\theta}^{(n)} and Sn(θ)S_{n}(\theta) satisfy the Assumptions of Lemma 2. Then, we can consider as our variational family the mean-field class:

𝒬MF:={Q:q(θ)=i=1dθqi(θi)},\mathcal{Q}_{MF}:=\left\{Q:q(\theta)=\prod_{i=1}^{d_{\theta}}q_{i}(\theta_{i})\right\},

or the Gaussian family

𝒬G:={λ=(μ,vech(Σ)):q(θ)𝒩(θ;μ,Σ),},\mathcal{Q}_{G}:=\{\lambda=(\mu^{\prime},\text{vech}({\Sigma})^{\prime})^{\prime}:q(\theta)\propto\mathcal{N}(\theta;\mu,\Sigma),\},

where μd\mu\in\mathcal{R}^{d}, and Σ\Sigma is a d×dd\times d positive-definite matrix. In either case, the approximation results of Zhang and Gao (2020) and Yang et al. (2020) for these variational families can be used to obtain the rate κn\kappa_{n}. In particular, when dd is fixed these results demonstrate that κnϵn\kappa_{n}\lesssim\epsilon_{n}, and the Gibbs variational posterior converges at the same rate as the Gibbs posterior.101010The rate of concentration obtained by applying Lemma 2 in such an example will be ϵn=log(n)/n\epsilon_{n}=\log(n)/\sqrt{n}, which is slightly slower than the optimal parametric rate ϵn=1/n\epsilon_{n}=1/\sqrt{n}. The parametric rate can be achieved by sufficiently modifying the prior condition in Assumption 3 to cater for the parametric nature of the model. However, given that this is not germane to the main point off the paper, we do not analyze such situations here.

As seen from Lemma 3, the variational posterior places increasing mass on the value θ\theta_{\star} that maximizes the limit of the expected scoring rule, 𝒮(θ)\mathcal{S}(\theta). Lemma 3 does not rely on the existence of exponential testing sequences as in much of the Bayesian nonparametrics literature (see, e.g., the treatment in Ghosal and Van der Vaart, 2017). The need to consider an alternative approach is due to the loss-based, i.e., non-likelihood-based, nature of the Gibbs posterior. Instead, the arguments used to demonstrate posterior concentration rely on controlling the behavior of a suitable quadratic approximation to the criterion function in a neighborhood of θ\theta_{\star}.

The result of Lemma 3 allows us to demonstrate that the predictions made using the GVP, PQ(n)P_{Q}^{(n)}, are just as accurate as those obtained by an agent that uses the ‘optimal [frequentist] predictive’ P(n)P_{\star}^{(n)} defined in equation (8); this is the key message of 2, which demonstrates that the GVP yields predictions that are just as accurate as the exact Gibbs predictive (in large samples).

A.3 Proofs of Lemmas

Proof of Lemma 1. We break the proof up by showing the results for Assumption 1 and 2 separately.


Assumption 1. We remark that Assumptions 1-4 in Lemma 1 are sufficient for conditions (A.1) and (A.2) of Lee and Hansen (1994). To establish Assumption 1(i) we note that Lemma 7 in Lee and Hansen (1994)(2) implies that Sn(θ)p𝒮(θ)S_{n}(\theta)\rightarrow_{p}\mathcal{S}(\theta) for each θΘ\theta\in\Theta. Furthermore, since the expected value of logpθ(yt|t1)/θ\partial\log p_{\theta}(y_{t}|\mathcal{F}_{t-1})/\partial\theta is bounded for each tt, for each θΘ\theta\in\Theta, by Lemma 8(2) in Lee and Hansen (1994), it follows that Sn(θ)p𝒮(θ)S_{n}(\theta)\rightarrow_{p}\mathcal{S}(\theta) uniformly over Θ\Theta. Assumption 1(ii) can then be established by using the fact that 𝒮(θ)\mathcal{S}(\theta) is twice-continuously differentiable for θ\theta near θ\theta_{\star} with negative-definite second derivative matrix 2𝒮(θ)/θθ\partial^{2}\mathcal{S}(\theta_{\star})/\partial\theta\partial\theta^{\prime}, by Lemma 11(3) of Lee and Hansen (1994), which yields

𝒮(θ)𝒮(θ)C(θθ)2,\mathcal{S}(\theta)-\mathcal{S}(\theta^{\star})\leq-C\|(\theta-\theta_{\star})\|^{2},

for some C>0C>0. Hence, Assumption 1(ii) follows.


Assumption 2. The result again follows by using the results of Lee and Hansen (1994). In particular, in this case Sn(θ)=t=1nlogpθ(yt+1|t)S_{n}(\theta)=\sum_{t=1}^{n}\log p_{\theta}(y_{t+1}|\mathcal{F}_{t}), is twice continuously differentiable so that a direct quadratic approximation is valid at θ=θ\theta=\theta^{\star}. Then, taking Δn=Sn(θ)/θ\Delta_{n}=\partial S_{n}(\theta_{\star})/\partial\theta and Σn=2𝒮(θ)/θθ\Sigma_{n}=-\partial^{2}\mathcal{S}(\theta_{\star})/\partial\theta\partial\theta^{\prime} yields the expansion, where the remainder term is given by

Rn(θ)\displaystyle R_{n}(\theta) =n(θθ)[2𝒮(θ)/θθ2𝒮(θ¯)/θθ](θθ)\displaystyle=n(\theta-\theta_{\star})^{\prime}[\partial^{2}\mathcal{S}(\theta_{\star})/\partial\theta\partial\theta^{\prime}-\partial^{2}\mathcal{S}(\bar{\theta})/\partial\theta\partial\theta^{\prime}](\theta-\theta_{\star})
n(θθ)[n12Sn(θ¯)/θθ2𝒮(θ¯)/θθ](θθ),\displaystyle-n(\theta-\theta_{\star})^{\prime}[n^{-1}\partial^{2}S_{n}(\bar{\theta})/\partial\theta\partial\theta^{\prime}-\partial^{2}\mathcal{S}(\bar{\theta})/\partial\theta\partial\theta^{\prime}](\theta-\theta_{\star}),

and θ¯\bar{\theta} an intermediate value such that θθ¯θθ\|\theta-\bar{\theta}\|\leq\|\theta-\theta_{\star}\|.

To verify (ii), we note that 2𝒮(θ¯)/θθ\partial^{2}\mathcal{S}(\bar{\theta})/\partial\theta\partial\theta^{\prime} exists and is continuous by Lemma 11(3) of Lee and Hansen (1994), so that the first term is o(1)o(1) on d(θ,θ)M/nd(\theta,\theta_{\star})\leq M/\sqrt{n}. Further, Lemma 11(3) of Lee and Hansen (1994) demonstrates that

supθΘ[n12Sn(θ¯)/θθ2𝒮(θ¯)/θθ]=op(1),\sup_{\theta\in\Theta}\|[n^{-1}\partial^{2}S_{n}(\bar{\theta})/\partial\theta\partial\theta^{\prime}-\partial^{2}\mathcal{S}(\bar{\theta})/\partial\theta\partial\theta^{\prime}]\|=o_{p}(1),

so that the second term is also o(1)o(1) on d(θ,θ)M/nd(\theta,\theta_{\star})\leq M/\sqrt{n}.

Lastly, (iii), is verified since Σn=2𝒮(θ)/θθ\Sigma_{n}=-\partial^{2}\mathcal{S}(\theta_{\star})/\partial\theta\partial\theta^{\prime}, and since, by Lemma 9(2) of Lee and Hansen (1994), Δn/nN(0,𝔼[logpθ(yt|t1)/θlogpθ(yt|t1)/θ])\Delta_{n}/\sqrt{n}\Rightarrow N(0,\mathbb{E}\left[\partial\log p_{\theta}(y_{t}|\mathcal{F}_{t-1})/\partial\theta\partial\log p_{\theta}(y_{t}|\mathcal{F}_{t-1})/\partial\theta^{\prime}\right]).

 

Proof of Lemma 2. The posterior density is Πwn(θ|y1:n)dΠ(θ)exp[wn{Sn(θ)}],\Pi_{w_{n}}\left(\theta|y_{1:n}\right)\propto\mathrm{d}\Pi(\theta)\exp\left[w_{n}\{{S}_{n}(\theta)\}\right],where we recall that, by hypothesis, wn(w¯,w¯)(0,)w_{n}\in(\underline{w},\overline{w})\subset(0,\infty) with probability one. Define the quasi-likelihood ratio

Zn(θ):=exp[wn{Sn(θ)Sn(θ)12ΔnΣn1Δn}],Z_{n}(\theta):=\exp\left[w_{n}\left\{{S}_{n}\left(\theta\right)-{S}_{n}(\theta_{\star})-\frac{1}{2}\Delta_{n}^{\prime}\Sigma^{-1}_{n}\Delta_{n}\right\}\right],

where, by Assumption 2 (iii), Σn1/2Δn=Op(1)\Sigma_{n}^{-1/2}\Delta_{n}=O_{p}(1). For M>0M>0, the posterior probability over An:={θ:d(θ,θ)>Mϵn}A_{n}:=\{\theta:d(\theta,\theta_{\star})>M\epsilon_{n}\} is

Πwn(An|y1:n)=AndΠ(θ)Zn(θ)ΘdΠ(θ)Zn(θ)=Nn(An)Dn,\displaystyle\Pi_{w_{n}}(A_{n}|y_{1:n})=\frac{\int_{A_{n}}\mathrm{d}\Pi\left(\theta\right)Z_{n}(\theta)}{\int_{\Theta}\mathrm{d}\Pi\left(\theta\right)Z_{n}(\theta)}=\frac{N_{n}(A_{n})}{D_{n}},

where

Nn(An)=AndΠ(θ)Zn(θ),Dn=ΘdΠ(θ)Zn(θ).{N_{n}(A_{n})}=\int_{A_{n}}\mathrm{d}\Pi\left(\theta\right)Z_{n}(\theta),\quad{D_{n}}={\int_{\Theta}\mathrm{d}\Pi\left(\theta\right)Z_{n}(\theta)}.

The result now follows by lower bounding DnD_{n} and Upper bounding Nn(An)N_{n}(A_{n}).

Part 1: DnD_{n} Term. Define Tn:=θ+Σn1ΔnT_{n}:=\theta_{\star}+\Sigma_{n}^{-1}\Delta_{n}, and Gn:={θΘ:12(θTn)[Σn/n](θTn)tn}G_{n}:=\{\theta\in\Theta:\frac{1}{2}(\theta-T_{n})^{\prime}\left[\Sigma_{n}/n\right](\theta-T_{n})\leq t_{n}\}. We show that, for tn0t_{n}\rightarrow 0 , and tnϵn2t_{n}\leq\epsilon_{n}^{2}, wpc1

Dn=ΘdΠ(θ)Zn(θ)>12Π(Gn)e2w¯ntn.D_{n}=\int_{\Theta}\mathrm{d}\Pi(\theta)Z_{n}(\theta)>\frac{1}{2}\Pi(G_{n})e^{-2\overline{w}nt_{n}}.

Proof of Part 1: To lower bound DnD_{n} we follow an approach similar approach to Lemma 1 in Shen and Wasserman (2001) (see, also, Syring and Martin, 2020). Over GnG_{n}, use Assumption 2(i) to rewrite log{Zn(θ)}\log\{Z_{n}(\theta)\} as

log{Zn(θ)}\displaystyle\log\{Z_{n}(\theta)\} =log[exp{wn[Sn(θ)Sn(θ)12ΔnΣn1Δn]}]\displaystyle=\log\left[\exp\left\{w_{n}\left[S_{n}(\theta)-S_{n}(\theta_{\star})-\frac{1}{2}\Delta_{n}^{\prime}\Sigma^{-1}_{n}\Delta_{n}\right]\right\}\right]
=wnn2(θTn)[Σn/n](θTn)Rn(θ).\displaystyle=-\frac{w_{n}n}{2}(\theta-T_{n})^{\prime}\left[\Sigma_{n}/n\right](\theta-T_{n})-R_{n}(\theta). (16)

Define the following sets: 𝒞n:={(θ,y1:n):|Rn(θ)|ntn}\mathcal{C}_{n}:=\{(\theta,y_{1:n}):|R_{n}(\theta)|\geq nt_{n}\}, 𝒞n(θ):={y1:n:(θ,y1:n)𝒞n}\mathcal{C}_{n}(\theta):=\{y_{1:n}:(\theta,y_{1:n})\in\mathcal{C}_{n}\}, and 𝒞n(y1:n):={θ:(θ,y1:n)𝒞n}\mathcal{C}_{n}(y_{1:n}):=\{\theta:(\theta,y_{1:n})\in\mathcal{C}_{n}\}. On the set Gn𝒞nc(y1:n)G_{n}\cap\mathcal{C}^{c}_{n}(y_{1:n}), Zn(θ)Z_{n}(\theta) is bounded in probability, and we can bound DnD_{n} as follows:

Dn\displaystyle D_{n} Gn𝒞nc(y1:n)exp{wnn2(θTn)[Σn/n](θTn)Rn(θ)}dΠ(θ)\displaystyle\geq\int_{G_{n}\cap\mathcal{C}^{c}_{n}(y_{1:n})}\exp\left\{-\frac{w_{n}n}{2}(\theta-T_{n})^{\prime}[\Sigma_{n}/n](\theta-T_{n})-R_{n}(\theta)\right\}\mathrm{d}\Pi(\theta)
exp(2ω¯ntn)Π{Gn𝒞nc(y1:n)}\displaystyle\geq\exp\left(-2\overline{\omega}nt_{n}\right)\Pi\left\{G_{n}\cap\mathcal{C}^{c}_{n}(y_{1:n})\right\}
[Π(Gn)Π{Gn𝒞n(y1:n)}]exp(2ω¯ntn).\displaystyle\geq\left[\Pi(G_{n})-\Pi\left\{G_{n}\cap\mathcal{C}_{n}(y_{1:n})\right\}\right]\exp\left(-2\overline{\omega}nt_{n}\right). (17)

From the bound in (17), the remainder follows almost identically to Lemma 1 in Shen and Wasserman (2001). In particular,

𝔼P0{Π[Gn𝒞n(y1:n)]}\displaystyle\mathbb{E}_{P_{0}}\left\{\Pi\left[G_{n}\cap\mathcal{C}_{n}(y_{1:n})\right]\right\} =𝒴Θ𝕀{Gn𝒞n(y1:n)}dΠ(θ)dP0(y1:n)\displaystyle=\int_{\mathcal{Y}}\int_{\Theta}\mathbb{I}\{G_{n}\cap\mathcal{C}_{n}(y_{1:n})\}\mathrm{d}\Pi(\theta)\mathrm{d}P_{0}(y_{1:n})
=𝒴Θ𝕀{Gn}𝕀{𝒞n(y1:n)}dΠ(θ)dP0(y1:n)\displaystyle=\int_{\mathcal{Y}}\int_{\Theta}\mathbb{I}\{G_{n}\}\mathbb{I}\{\mathcal{C}_{n}(y_{1:n})\}\mathrm{d}\Pi(\theta)\mathrm{d}P_{0}(y_{1:n})
1ntnΠ(Gn),\displaystyle\leq\frac{1}{nt_{n}}\Pi(G_{n}),

where the last line follows from Markov’s inequality and the definition of 𝒞n(y1:n)\mathcal{C}_{n}(y_{1:n}). Lastly, consider the probability of the set

P0{Dn12Π(Gn)e2w¯ntn}\displaystyle P_{0}\left\{D_{n}\leq\frac{1}{2}\Pi(G_{n})e^{-2\overline{w}nt_{n}}\right\} P0(e2w¯ntn[Π(Gn)Π{Gn𝒞n(y1:n)}]12Π(Gn)e2w¯ntn)\displaystyle\leq P_{0}\left(e^{-2\overline{w}nt_{n}}\left[\Pi(G_{n})-\Pi\left\{G_{n}\cap\mathcal{C}_{n}(y_{1:n})\right\}\right]\leq\frac{1}{2}\Pi(G_{n})e^{-2\overline{w}nt_{n}}\right)
=P0[Π{Gn𝒞n(y1:n)}12Π(Gn)]\displaystyle=P_{0}\left[\Pi\{G_{n}\cap\mathcal{C}_{n}(y_{1:n})\}\geq\frac{1}{2}\Pi(G_{n})\right]
2P0{Gn𝒞n(y1:n)}/Π(Gn)\displaystyle\leq 2P_{0}\left\{G_{n}\cap\mathcal{C}_{n}(y_{1:n})\right\}/\Pi(G_{n})
2ntn.\displaystyle\leq\frac{2}{nt_{n}}.

Hence, P0{Dn12Π(Gn)e2w¯ntn}12(ntn)1P_{0}\left\{D_{n}\geq\frac{1}{2}\Pi(G_{n})e^{-2\overline{w}nt_{n}}\right\}\geq 1-2(nt_{n})^{-1} and for ntnnt_{n}\rightarrow\infty, we have that

Dn12Π(Gn)e2w¯ntn\displaystyle D_{n}\geq\frac{1}{2}\Pi(G_{n})e^{-2\overline{w}nt_{n}} (18)

except on sets of P0P_{0}-probability converging to zero.  

Then, for GnG_{n} as in that result, for a sequence tn0t_{n}\rightarrow 0 with tnϵn2t_{n}\leq\epsilon_{n}^{2}, by Assumption 3 (wpc1)

Dn12Π(Gn)e2ntnwneC1ntne2ntnwne(C1+2w¯)ntn\displaystyle D_{n}\geq\frac{1}{2}\Pi(G_{n})e^{-2nt_{n}w_{n}}\gtrsim e^{-C_{1}nt_{n}}e^{-2nt_{n}w_{n}}\gtrsim e^{-(C_{1}+2\overline{w})nt_{n}} (19)

Part 2: Nn(An)N_{n}(A_{n}) Term. We show that P0(n)[Nn(An)>eMw¯nϵn2]eC2Mw¯ntn2P_{0}^{(n)}[N_{n}(A_{n})>e^{-M\overline{w}n\epsilon^{2}_{n}}]\leq e^{-C_{2}M\overline{w}nt^{2}_{n}} for nn large enough.


Proof of Part 2: To bound Nn(An)=AndΠ(θ)Zn(θ)N_{n}(A_{n})=\int_{A_{n}}\mathrm{d}\Pi(\theta)Z_{n}(\theta) from above, use Assumption 1 to deduce that, for any positive ϵn\epsilon_{n},

supd(θ,θ)ϵnn1{Sn(θ)Sn(θ)}\displaystyle\sup_{d(\theta,\theta_{\star})\geq\epsilon_{n}}n^{-1}\left\{S_{n}(\theta)-S_{n}(\theta_{\star})\right\}\leq 2supθΘ{Sn(θ)/n𝒮(θ)}+supd(θ,θ)ϵ{𝒮(θ)𝒮(θ)}\displaystyle 2\sup_{\theta\in\Theta}\{S_{n}(\theta)/n-\mathcal{S}(\theta)\}+\sup_{d(\theta,\theta_{\star})\geq\epsilon}\{\mathcal{S}(\theta)-\mathcal{S}(\theta_{\star})\}
op(1)C2ϵn2.\displaystyle\leq o_{p}(1)-C_{2}\epsilon^{2}_{n}. (20)

For some M>0M>0, we decompose Nn(An)N_{n}(A_{n}):

Nn(An)=Nn(AnΘn(M))+Nn(AnΘnc(M))=Nn(1)+Nn(2),N_{n}(A_{n})=N_{n}(A_{n}\cap\Theta_{n}(M))+N_{n}(A_{n}\cap\Theta_{n}^{c}(M))=N_{n}^{(1)}+N_{n}^{(2)},

where Θn(M)\Theta_{n}(M) is a compact set of θ\theta by construction and Θnc(M)\Theta_{n}^{c}(M) is its complement.

Then, Nn(1)eC2Mw¯nϵn2N_{n}^{(1)}\leq e^{-C_{2}M\overline{w}n\epsilon^{2}_{n}} from (20) due to the compactness of Θn(M)\Theta_{n}(M). For Nn(2)N_{n}^{(2)}, for a sequence tn0t_{n}\rightarrow 0 with tnϵn2t_{n}\leq\epsilon_{n}^{2},

P0(Nn(2)>eMw¯ntn/2)\displaystyle P_{0}(N_{n}^{(2)}>e^{-M\overline{w}nt_{n}/2})
\displaystyle\leq eMw¯ntn/2𝒴AnΘnc(M)exp{wnn2(θTn)[Σn/n](θTn)Rn(θ)}dΠ(θ)dP0(y1:n)\displaystyle e^{M\overline{w}nt_{n}/2}\int_{\mathcal{Y}}\int_{A_{n}\cap\Theta_{n}^{c}(M)}\exp\left\{-\frac{w_{n}n}{2}(\theta-T_{n})^{\prime}[\Sigma_{n}/n](\theta-T_{n})-R_{n}(\theta)\right\}\mathrm{d}\Pi(\theta)\mathrm{d}P_{0}(y_{1:n})
\displaystyle\leq eMω¯ntn/2Π{AnΘnc(M)}eMC2w¯ntn\displaystyle e^{M\overline{\omega}nt_{n}/2}\Pi\left\{A_{n}\cap\Theta_{n}^{c}(M)\right\}\leq e^{-MC_{2}\overline{w}nt_{n}} (21)

The first inequality comes from Markov inequality and the definition of Zn(θ)Z_{n}(\theta) and the second and the third inequalities are due to Assumptions 2 and 3 respectively.  

Using (19), we have the posterior bound

Πwn(An|y1:n)Nn(An)e(C1+2w¯)ntn\displaystyle\Pi_{w_{n}}(A_{n}|y_{1:n})\lesssim N_{n}(A_{n})e^{(C_{1}+2\overline{w})nt_{n}}

Combining Nn(1)eC2Mw¯nϵn2N_{n}^{(1)}\leq e^{-C_{2}M\overline{w}n\epsilon^{2}_{n}}, from (20), and equation (21), we can deduce that (wpc1)

Πwn(An|y1:n)Nn(An)e(C1+2w¯)ntne{MC2w¯C12w¯}nϵn2,\displaystyle\Pi_{w_{n}}(A_{n}|y_{1:n})\lesssim N_{n}(A_{n})e^{(C_{1}+2\overline{w})nt_{n}}\lesssim{e^{-\left\{MC_{2}\overline{w}-C_{1}-2\overline{w}\right\}n\epsilon^{2}_{n}}},

since tnϵn2t_{n}\leq\epsilon_{n}^{2}. The right-hand-side vanishes for MM large enough, and the result follows.

 

Proof of of Lemma 3. The result begins with a similar approach to Corollary 2.1 of Zhang and Gao (2020), but requires particular deviations given the non-likelihood-based version of our problem. We also deviate from Zhang and Gao (2020) and use 𝔼P0[f]\mathbb{E}_{P_{0}}[f] to denote expectations of f()f(\cdot) taken under P0P_{0}, rather than the notation P0[f]P_{0}[f].

First, we note that Lemma B.2 in the supplement to Zhang and Gao (2020) (reproduced for convenience as Lemma 4 in Appendix A.5) can be directly applied in this setting: for any a>0a>0 and n1n\geq 1, given observations y1:ny_{1:n},

aQ^{d(θ,θ)}D[Q^Πwn(|y1:n)]+logΠwn(exp{ad(θ,θ)}|y1:n).a\widehat{Q}\left\{d(\theta,\theta_{\star})\right\}\leq\text{D}\left[\widehat{Q}\|\Pi_{w_{n}}\left(\cdot|y_{1:n}\right)\right]+\log\Pi_{w_{n}}\left(\exp\left\{ad(\theta,\theta_{\star})\right\}|y_{1:n}\right).

Similarly, for any Q𝒬Q\in\mathcal{Q}, D[QΠwn(|y1:n)]+op(1)D[Q^Πwn(|y1:n)]\text{D}\left[{Q}\|\Pi_{w_{n}}(\cdot|y_{1:n})\right]+o_{p}(1)\geq\text{D}\left[\widehat{Q}\|\Pi_{w_{n}}(\cdot|y_{1:n})\right] by construction, so that

aQ^{d(θ,θ)}infQ𝒬D[QΠwn(|y1:n)]+logΠwn(exp{ad(θ,θ)}|y1:n).a\widehat{Q}\left\{d(\theta,\theta_{\star})\right\}\leq\inf_{Q\in\mathcal{Q}}\text{D}\left[{Q}\|\Pi_{w_{n}}\left(\cdot|y_{1:n}\right)\right]+\log\Pi_{w_{n}}\left(\exp\left\{ad(\theta,\theta_{\star})\right\}|y_{1:n}\right). (22)

Taking expectations on both sides of (22), and re-arranging terms, yields

𝔼P0Q^{d(θ,θ)}\displaystyle\mathbb{E}_{P_{0}}\widehat{Q}\left\{d(\theta,\theta_{\star})\right\} 1a𝔼P0{infQ𝒬D[QΠwn(|y1:n)]+logΠwn(exp{ad(θ,θ)}|y1:n)}\displaystyle\leq\frac{1}{a}\mathbb{E}_{P_{0}}\left\{\inf_{Q\in\mathcal{Q}}\text{D}\left[{Q}\|\Pi_{w_{n}}\left(\cdot|y_{1:n}\right)\right]+\log\Pi_{w_{n}}\left(\exp\left\{ad(\theta,\theta_{\star})\right\}|y_{1:n}\right)\right\}
1anκn2+1alog[𝔼P0Πwn(exp{ad(θ,θ)}|y1:n)],\displaystyle\leq\frac{1}{a}n\kappa^{2}_{n}+\frac{1}{a}\log\left[\mathbb{E}_{P_{0}}\Pi_{w_{n}}\left(\exp\left\{ad(\theta,\theta_{\star})\right\}|y_{1:n}\right)\right], (23)

where the second inequality follows from the definition of κn2\kappa_{n}^{2}, and Jensen’s inequality.

The second term in equation (23) is bounded by applying Lemma 2. In particular, for all αα0>0\alpha\geq\alpha_{0}>0 and any 0<a12c10<a\leq\frac{1}{2}c_{1}, c1>0c_{1}>0, by Lemma 2 (wpc1),

Πwn(exp{ad(θ,θ)}>α0|y1:n)exp(ac1α).\Pi_{w_{n}}\left(\exp\left\{ad(\theta,\theta_{\star})\right\}>\alpha_{0}|y_{1:n}\right)\lesssim\exp\left(-ac_{1}\alpha\right).

Then, appealing to Lemma B.4 in the supplement to Zhang and Gao (2020) (reproduced for convenience as Lemma 5 in Appendix A.5) we obtain

𝔼P0Πwn(exp{ad(θ,θ)}|y1:n)exp(ac1α0),\mathbb{E}_{P_{0}}\Pi_{w_{n}}\left(\exp\left\{ad(\theta,\theta_{\star})\right\}|y_{1:n}\right)\lesssim\exp\left(ac_{1}\alpha_{0}\right),

for all amin{c1,1}a\leq\min\{c_{1},1\}. Taking a=min{c1,1}a=\min\{c_{1},1\} and α0=nϵn2\alpha_{0}=n\epsilon_{n}^{2}, and applying the above in equation (23), yields, for some M>0M>0,

𝔼P0Q^{d(θ,θ)}\displaystyle\mathbb{E}_{P_{0}}\widehat{Q}\left\{d(\theta,\theta_{\star})\right\} nκn2+log(c1+eac1nϵn2)anκn2+nϵn2+o(1)n(κn2+ϵn2)+o(1).\displaystyle\leq\frac{n\kappa_{n}^{2}+\log\left(c_{1}+e^{ac_{1}n\epsilon_{n}^{2}}\right)}{a}\lesssim{n\kappa_{n}^{2}}+n\epsilon_{n}^{2}+o(1)\lesssim n\left(\kappa_{n}^{2}+\epsilon_{n}^{2}\right)+o(1).

The stated result then follows from Markov’s inequality,

𝔼P0Q^(d(θ,θ)>Mnn(ϵn2+κn2))𝔼P0Q^{d(θ,θ)}Mnn(ϵn2+κn2)Mn10.\mathbb{E}_{P_{0}}\widehat{Q}\left(d(\theta,\theta_{\star})>M_{n}n\left(\epsilon_{n}^{2}+\kappa_{n}^{2}\right)\right)\leq\frac{\mathbb{E}_{P_{0}}\widehat{Q}\left\{d(\theta,\theta_{\star})\right\}}{M_{n}n\left(\epsilon_{n}^{2}+\kappa_{n}^{2}\right)}\lesssim M_{n}^{-1}\rightarrow 0.

 

A.4 Proofs of Theorems in Section 3.2

Using Lemmas 2 and 3, we can now prove Theorems 1 and 2 stated in the main text. A simple proof of Theorem 1 can be given by first proving Theorem 2. Therefore, we first prove Theorem 2.


Proof of Theorem 2. For probability measures PP and QQ, let dH(P,Q)d_{H}(P,Q) denote the Hellinger distance between PP and QQ. Consider a positive sequence ϵn>0\epsilon_{n}>0, and define the set A(ϵn):={Pθ(n)𝒫,θΘ:dH2(Pθ(n),P(n))ϵn}A(\epsilon_{n}):=\{P^{(n)}_{\theta}\in\mathcal{P},\;\theta\in\Theta:d_{H}^{2}(P^{(n)}_{\theta},P_{\star}^{(n)})\geq\epsilon_{n}\}. By convexity of dH2(,)d_{H}^{2}(\cdot,\cdot) in the first argument and Jensen’s inequality,

dH2(PQ(n),P(n))\displaystyle d^{2}_{H}(P^{(n)}_{Q},P^{(n)}_{\star}) ΘdH2(Pθ(n),P(n))dQ^(θ)\displaystyle\leq\int_{\Theta}d_{H}^{2}(P^{(n)}_{\theta},P^{(n)}_{\star})\mathrm{d}\widehat{Q}(\theta)
=Ac(ϵn2)dH2(Pθ(n),P(n))dQ^(θ)+A(ϵn2)dH2(Pθ(n),P(n))dQ^(θ)\displaystyle=\int_{A^{c}(\epsilon^{2}_{n})}d_{H}^{2}(P^{(n)}_{\theta},P^{(n)}_{\star})\mathrm{d}\widehat{Q}(\theta)+\int_{A(\epsilon^{2}_{n})}d_{H}^{2}(P^{(n)}_{\theta},P^{(n)}_{\star})\mathrm{d}\widehat{Q}(\theta)
ϵn2+2Q^{A(ϵn2)}.\displaystyle\leq\epsilon^{2}_{n}+\sqrt{2}\widehat{Q}\{A(\epsilon^{2}_{n})\}.

By Lemma 3, for any ϵn\epsilon_{n} such that nϵn2n\epsilon_{n}^{2}\rightarrow\infty, Q^{A(ϵn)}=op(1)\widehat{Q}\{A(\epsilon_{n})\}=o_{p}(1), so that dH2(PQ(n),P(n))ϵn2d^{2}_{H}(P^{(n)}_{Q},P^{(n)}_{\star})\leq\epsilon^{2}_{n} (wpc1). Using the above and the definition dH(PQ(n),P(n))=dH2(PQ(n),P(n)),d_{H}(P^{(n)}_{Q},P^{(n)}_{\star})=\sqrt{d^{2}_{H}(P^{(n)}_{Q},P^{(n)}_{\star})}, we can conclude that, with probability converging to one,

dH(PQ(n),P(n))ϵn.d_{H}(P^{(n)}_{Q},P^{(n)}_{\star})\leq\epsilon_{n}.

Which yields the first stated result. Applying the relationship between total variation and Hellinger distance yields the result: dTV(PQ(n),PΠwn(n))2dH(PQ(n),PΠwn(n)).d_{TV}(P^{(n)}_{Q},P^{(n)}_{\Pi_{w_{n}}})\leq\sqrt{2}d_{H}(P^{(n)}_{Q},P^{(n)}_{\Pi_{w_{n}}}).  

Proof of Theorem 1. A similar argument to the proof of Theorem 1 yields

dH2(PΠwn(n),P(n))\displaystyle d^{2}_{H}(P^{(n)}_{\Pi_{w_{n}}},P^{(n)}_{\star}) ΘdH2(Pθ(n),P(n))dΠwn(θ|𝐲)ϵn2+2Πwn{A(ϵn2)|𝐲}ϵn2,\displaystyle\leq\int_{\Theta}d_{H}^{2}(P^{(n)}_{\theta},P^{(n)}_{\star})\mathrm{d}\Pi_{w_{n}}(\theta|\mathbf{y})\leq\epsilon^{2}_{n}+\sqrt{2}\Pi_{w_{n}}\{A(\epsilon^{2}_{n})|\mathbf{y}\}\leq\epsilon^{2}_{n},

where the last line follows from the convergence in Lemma 2 and holds wpc1. Apply the triangle inequality, and the relationship between the Hellinger and it’s square, to see that (wpc1)

dH(PQ(n),PΠwn(n))\displaystyle d_{H}(P^{(n)}_{Q},P^{(n)}_{\Pi_{w_{n}}}) dH(PQ(n),P(n))+dH(Pθ(n),P(n))\displaystyle\leq d_{H}(P^{(n)}_{Q},P^{(n)}_{\star})+d_{H}(P^{(n)}_{\theta},P^{(n)}_{\star})
=dH2(Pθ(n),P(n))+dH2(Pθ(n),P(n))\displaystyle=\sqrt{d^{2}_{H}(P^{(n)}_{\theta},P^{(n)}_{\star})}+\sqrt{d^{2}_{H}(P^{(n)}_{\theta},P^{(n)}_{\star})}
2ϵn.\displaystyle\leq 2\epsilon_{n}.

The above holds for any ϵn0\epsilon_{n}\rightarrow 0 with nϵn20n\epsilon_{n}^{2}\rightarrow 0, and we can conclude that dTV(PQ(n),PΠwn(n))2dH(PQ(n),PΠwn(n))=op(1)d_{TV}(P^{(n)}_{Q},P^{(n)}_{\Pi_{w_{n}}})\leq\sqrt{2}d_{H}(P^{(n)}_{Q},P^{(n)}_{\Pi_{w_{n}}})=o_{p}(1).

 

A.5 Additional Lemmas

The following Lemmas from Zhang and Gao (2020) are used in the proof of Lemma 3, and are reproduce here, without proof, to aid the reader. We again use 𝔼P0\mathbb{E}_{P_{0}} to denote expectations taken under P0P_{0}.

Lemma 4

For Q^\widehat{Q} in Definition 1, we have

𝔼P0Q^L(Pθ(n),P0(n))\displaystyle\mathbb{E}_{P_{0}}\widehat{Q}L(P_{\theta}^{(n)},P_{0}^{(n)})
\displaystyle\leq infa>01a(infQ𝒮𝔼P0D(QΠ(|X(n)))+log𝔼P0Π(exp(aL(Pθ(n),P0(n)))|X(n))).\displaystyle\inf_{a>0}\frac{1}{a}\left(\inf_{Q\in\mathcal{S}}\mathbb{E}_{P_{0}}D(Q\|\Pi(\cdot|X^{(n)}))+\log\mathbb{E}_{P_{0}}\Pi(\exp(aL(P_{\theta}^{(n)},P_{0}^{(n)}))|X^{(n)})\right).
Lemma 5

Suppose the random variable XX satisfies

(Xt)c1exp(c2t),\mathbb{P}(X\geq t)\leq c_{1}\exp(-c_{2}t),

for all tt0>0t\geq t_{0}>0. Then, for any 0<a12c20<a\leq\frac{1}{2}c_{2},

𝔼exp(aX)exp(at0)+c1.\mathbb{E}\exp(aX)\leq\exp(at_{0})+c_{1}.

References

  • Alquier (2021) Pierre Alquier. User-friendly introduction to pac-bayes bounds. arXiv preprint arXiv:2110.11216, 2021.
  • Alquier and Ridgway (2020) Pierre Alquier and James Ridgway. Concentration of tempered posteriors and of their variational approximations. Annals of Statistics, 48(3):1475–1497, 2020.
  • Alquier et al. (2016) Pierre Alquier, James Ridgway, and Nicolas Chopin. On the properties of variational approximations of Gibbs posteriors. The Journal of Machine Learning Research, 17(1):8374–8414, 2016.
  • Antoniano-Villalobos and Walker (2016) Isadora Antoniano-Villalobos and Stephen G Walker. A nonparametric model for stationary time series. Journal of Time Series Analysis, 37(1):126–142, 2016.
  • Bassetti et al. (2018) Federico Bassetti, Roberto Casarin, and Francesco Ravazzolo. Bayesian nonparametric calibration and combination of predictive distributions. Journal of the American Statistical Association, 113(522):675–685, 2018.
  • Baştürk et al. (2019) N Baştürk, A. Borowska, S. Grassi, L. Hoogerheide, and H.K. van Dijk. Forecast density combinations of dynamic models and data driven portfolio strategies. Journal of Econometrics, 210(1):170–186, 2019.
  • Billio et al. (2013) M Billio, R Casarin, F Ravazzolo, and H.K. van Dijk. Time-varying combinations of predictive densities using nonlinear filtering. Journal of Econometrics, 177(2):213–232, 2013.
  • Bissiri et al. (2016) Pier Giovanni Bissiri, Chris C Holmes, and Stephen G Walker. A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):1103–1130, 2016.
  • Blackwell and Dubins (1962) David Blackwell and Lester Dubins. Merging of opinions with increasing information. The Annals of Mathematical Statistics, 33(3):882–886, 1962.
  • Blei et al. (2017) David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
  • Casarin et al. (2015) Roberto Casarin, Fabrizio Leisen, German Molina, and Enrique ter Horst. A Bayesian beta Markov random field calibration of the term structure of implied risk neutral densities. Bayesian Analysis, 10(4):791–819, 2015.
  • Chernozhukov and Hong (2003) Victor Chernozhukov and Han Hong. An MCMC approach to classical estimation. Journal of Econometrics, 115(2):293–346, 2003.
  • Diks et al. (2011) Cees Diks, Valentyn Panchenko, and Dick Van Dijk. Likelihood-based scoring rules for comparing density forecasts in tails. Journal of Econometrics, 163(2):215–230, 2011.
  • Doornik et al. (2020) Jurgen A Doornik, Jennifer L Castle, and David F Hendry. Card forecasts for M4. International Journal of Forecasting, 36(1):129–134, 2020.
  • Douc et al. (2013) Randal Douc, Paul Doukhan, and Eric Moulines. Ergodicity of observation-driven time series models and consistency of the maximum likelihood estimator. Stochastic Processes and their Applications, 123(7):2620–2647, 2013.
  • Fiorucci and Louzada (2020) Jose Augusto Fiorucci and Francisco Louzada. Groec: Combination method via generalized rolling origin evaluation. International Journal of Forecasting, 36(1):105–109, 2020.
  • Frazier et al. (2019) David T Frazier, Worapree Maneesoonthorn, Gael M Martin, and Brendan PM McCabe. Approximate Bayesian forecasting. International Journal of Forecasting, 35(2):521–539, 2019.
  • Frazier et al. (2021) David T Frazier, Ruben Loaiza-Maya, and Gael M Martin. A note on the accuracy of variational Bayes in state space models: Inference and prediction. arXiv preprint arXiv:2106.12262, 2021.
  • Ghosal and Van der Vaart (2017) Subhashis Ghosal and Aad Van der Vaart. Fundamentals of Nonparametric Bayesian Inference, volume 44. Cambridge University Press, 2017.
  • Giummolè et al. (2017) Federica Giummolè, Valentina Mameli, Erlis Ruli, and Laura Ventura. Objective Bayesian inference with proper scoring rules. TEST, pages 1–28, 2017.
  • Gneiting and Raftery (2007) Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, 2007.
  • Gneiting et al. (2007) Tilmann Gneiting, Fadoua Balabdaoui, and Adrian E Raftery. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(2):243–268, 2007.
  • Hernández-Lobato and Adams (2015) José Miguel Hernández-Lobato and Ryan Adams. Probabilistic backpropagation for scalable learning of Bayesian neural networks. In International Conference on Machine Learning, pages 1861–1869. PMLR, 2015.
  • Holmes and Walker (2017) CC Holmes and SG Walker. Assigning a value to a power likelihood in a general Bayesian model. Biometrika, 104(2):497–503, 2017.
  • Jiang and Tanner (2008) Wenxin Jiang and Martin A. Tanner. Gibbs posterior for variable selection in high-dimensional classification and data-mining. Annals of Statistics, 36(5):2207–2231, 2008.
  • Kingma and Welling (2019) Diederik P Kingma and Max Welling. An introduction to variational autoencoders. arXiv preprint arXiv:1906.02691, 2019.
  • Knoblauch et al. (2019) Jeremias Knoblauch, Jack Jewson, and Theodoros Damoulas. Generalized variational inference: Three arguments for deriving new posteriors. arXiv preprint arXiv:1904.02063, 2019.
  • Kwiatkowski et al. (1992) Denis Kwiatkowski, Peter C.B. Phillips, Peter Schmidt, and Yongcheol Shin. Testing the null hypothesis of stationarity against the alternative of a unit root: How sure are we that economic time series have a unit root? Journal of Econometrics, 54(1):159–178, 1992.
  • Lee and Hansen (1994) Sang-Won Lee and Bruce E Hansen. Asymptotic theory for the GARCH(1, 1) quasi-maximum likelihood estimator. Econometric theory, 10(1):29–52, 1994.
  • Loaiza-Maya et al. (2020) Ruben Loaiza-Maya, Gael M Martin, and David T Frazier. Focused Bayesian prediction. Forthcoming. Journal of Applied Econometrics, 2020.
  • Lyddon et al. (2019) SP Lyddon, CC Holmes, and SG Walker. General Bayesian updating and the loss-likelihood bootstrap. Biometrika, 106(2):465–478, 2019.
  • Makridakis et al. (2020) Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. The M4 competition: 100,000 time series and 61 forecasting methods. International Journal of Forecasting, 36(1):54–74, 2020.
  • Martin et al. (2022) Gael M Martin, Rubén Loaiza-Maya, Worapree Maneesoonthorn, David T Frazier, and Andrés Ramírez-Hassan. Optimal probabilistic forecasts: When do they work? International Journal of Forecasting, 38(1):384–406, 2022.
  • McAlinn and West (2019) Kenichiro McAlinn and Mike West. Dynamic Bayesian predictive synthesis in time series forecasting. Journal of Econometrics, 210(1):155–169, 2019.
  • McAlinn et al. (2020) Kenichiro McAlinn, Knut Are Aastveit, Jouchi Nakajima, and Mike West. Multivariate Bayesian predictive synthesis in macroeconomic forecasting. Journal of the American Statistical Association, 115(531):1092–1110, 2020.
  • Miller (2021) Jeffrey W Miller. Asymptotic normality, concentration, and coverage of generalized posteriors. Journal of Machine Learning Research, 22(168):1–53, 2021.
  • Miller and Dunson (2019) Jeffrey W Miller and David B Dunson. Robust Bayesian inference via coarsening. Journal of the American Statistical Association, 114(527):1113–1125, 2019.
  • Montero-Manso et al. (2020) Pablo Montero-Manso, George Athanasopoulos, Rob J Hyndman, and Thiyanga S Talagala. FFORMA: Feature-based forecast model averaging. International Journal of Forecasting, 36(1):86–92, 2020.
  • Pacchiardi and Dutta (2021) Lorenzo Pacchiardi and Ritabrata Dutta. Generalized Bayesian likelihood-free inference using scoring rules estimators. arXiv preprint arXiv:2104.03889, 2021.
  • Petropoulos and Svetunkov (2020) Fotios Petropoulos and Ivan Svetunkov. A simple combination of univariate models. International journal of forecasting, 36(1):110–115, 2020.
  • Pettenuzzo and Ravazzolo (2016) Davide Pettenuzzo and Francesco Ravazzolo. Optimal portfolio choice under decision-based model combinations. Journal of Applied Econometrics, 31(7):1312–1332, 2016.
  • Quiroz et al. (2018) Matias Quiroz, David J Nott, and Robert Kohn. Gaussian variational approximation for high-dimensional state space models. arXiv preprint arXiv:1801.07873, 2018.
  • Shen and Wasserman (2001) Xiaotong Shen and Larry Wasserman. Rates of convergence of posterior distributions. The Annals of Statistics, 29(3):687–714, 2001.
  • Smyl (2020) Slawek Smyl. A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. International Journal of Forecasting, 36(1):75–85, 2020.
  • Syring and Martin (2019) Nicholas Syring and Ryan Martin. Calibrating general posterior credible regions. Biometrika, 106(2):479–486, 2019.
  • Syring and Martin (2020) Nicholas Syring and Ryan Martin. Gibbs posterior concentration rates under sub-exponential type losses. arXiv preprint arXiv:2012.04505, 2020.
  • Teräsvirta (1994) Timo Teräsvirta. Specification, estimation, and evaluation of smooth transition autoregressive models. Journal of the American Statistical Association, 89(425):208–218, 1994.
  • Tran et al. (2017) Minh-Ngoc Tran, David J Nott, and Robert Kohn. Variational Bayes with intractable likelihood. Journal of Computational and Graphical Statistics, 26(4):873–882, 2017.
  • van der Vaart and Wellner (1996) AW van der Vaart and Jon Wellner. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Series in Statistics, 1996.
  • Wu and Martin (2021) Pei-Shien Wu and Ryan Martin. Calibrating generalized predictive distributions. arXiv preprint arXiv:2107.01688, 2021.
  • Yang et al. (2020) Yun Yang, Debdeep Pati, Anirban Bhattacharya, et al. Alpha-variational inference with statistical guarantees. Annals of Statistics, 48(2):886–905, 2020.
  • Zhang and Gao (2020) Fengshuo Zhang and Chao Gao. Convergence rates of variational posterior distributions. Annals of Statistics, 48(4):2180–2207, 2020.
  • Zhang (2006a) Tong Zhang. From eps-entropy to kl entropy: analysis of minimum information complexity density estimation. Annals of Statistics, 34:2180–2210, 2006a.
  • Zhang (2006b) Tong Zhang. Information-theoretic upper and lower bounds for statistical estimation. IEEE Transactions on Information Theory, 52(4):1307–1321, 2006b.

Supplementary material for “Loss-Based Variational Bayes Prediction”

S1 Details for the Implementation of all Variational Approximations

S1.1 General Notational Matters

Throughout this appendix we will employ the following notation. All gradients are column vectors and their notation starts with the symbol \nabla. For two generic matrices Ad1×d2A_{d_{1}\times d_{2}} and Bd3×d4B_{d_{3}\times d_{4}} we have that

AB=vec(A)vec(B),\frac{\partial A}{\partial B}=\frac{\partial\text{vec}(A)}{\partial\text{vec}(B)},

where vec is the vectorization operation and AB\frac{\partial A}{\partial B} is a matrix of dimension (d1d2)×(d3d4)(d_{1}d_{2})\times(d_{3}d_{4}). Throughout this appendix scalars are treated as matrices of dimension 1×11\times 1.

The gradient of the log density of the approximation is computed as

θlogqλ(θ)=(D2)1(θμ).\nabla_{\theta}\log q_{\lambda}\left({\theta}\right)=-\left(D^{2}\right)^{-1}\left({\theta}-{\mu}\right).

To compute θλ\frac{\partial{\theta}}{\partial{\lambda}} note that

θλ=[θμ,θd],\frac{\partial{\theta}}{\partial{\lambda}}=\left[\frac{\partial{\theta}}{\partial{\mu}},\frac{\partial{\theta}}{\partial{d}}\right],

where

θμ=Ir,θd=diag(e).\frac{\partial{\theta}}{\partial{\mu}}=I_{r},\ \ \ \ \ \ \ \frac{\partial{\theta}}{\partial{d}}=\text{diag}({e}).

S2 Stochastic Gradient Ascent

The optimization problem in (13) is performed via stochastic gradient ascent methods (SGA). SGA maximizes (λ)\mathcal{L}(\lambda) by first initializing the variational parameters at some vector of values λ(0){\lambda}^{(0)}, and then sequentially iterating over

λi+1=λi+Δλi+1[λ(λ(i))^,ρ].{\lambda}^{i+1}={\lambda}^{i}+\Delta{\lambda}^{i+1}\left[\widehat{\nabla_{\lambda}\mathcal{L}\left({\lambda}^{(i)}\right)},{\rho}\right].

The step size Δλi+1\Delta{\lambda}^{i+1} is a function of an unbiased estimate of the ELBO gradient, λ(λ(i))^\widehat{\nabla_{\lambda}\mathcal{L}\left({\lambda}^{(i)}\right)}, and the set of tuning parameters that control the learning rates in the optimization problem, denoted by ρ{\rho}. Throughout this paper we employ the ADADELTA method of Zieler (2012) to update Δλi+1\Delta{\lambda}^{i+1}.

Key to achieving fast maximization of (λ)\mathcal{L}(\lambda) via SGA is the use of a variance-reduction technique in producing the unbiased gradient estimate λ(λ(i))^\widehat{\nabla_{\lambda}\mathcal{L}\left({\lambda}^{(i)}\right)}. Here, we follow Kingma and Welling (2014) and Rezende et al. (2014), and make use of the ‘reparameterization trick’. In this approach, a draw θ{\theta} from qλq_{\lambda} is written as a direct function of λ{\lambda}, and a set of random variables ε{\varepsilon} that are invariant with respect to λ{\lambda}. For the mean-field approximation used for this example we can write θ(λ,ε)=μ+dε{\theta}\left({\lambda},{\varepsilon}\right)=\mu+{d}\circ{\varepsilon}, with εN(04,I4){\varepsilon}\sim N\left({0}_{4},I_{4}\right). This reparametrization allows us to re-write (λ)\mathcal{L}(\lambda) as

(λ)=𝔼ε[wSnj(θ(λ,ε))+logπ(θ(λ,ε))logqλ(θ(λ,ε))]\mathcal{L}(\lambda)=\mathbb{E}_{{\varepsilon}}\left[wS_{n}^{j}\left({\ \theta}\left({\lambda},{\varepsilon}\right)\right)+\log\pi\left({\ \theta}\left({\lambda},{\varepsilon}\right)\right)-\log q_{\lambda}\left({\theta}\left({\lambda},{\varepsilon}\right)\right)\right] (S2.1)

and its gradient as

λ(λ)=𝔼ε(θ(λ,ε)λ[wθSnj(θ(λ,ε))+θlogπ(θ(λ,ε))θlogqλ(θ(λ,ε))]).\nabla_{\lambda}\mathcal{L}(\lambda)=\mathbb{E}_{{\varepsilon}}\left(\frac{\partial{\theta}\left({\lambda},{\varepsilon}\right)}{\partial{\ \lambda}}^{\prime}\left[w\nabla_{\theta}S_{n}^{j}\left({\theta}\left({\lambda},{\varepsilon}\right)\right)+\nabla_{\theta}\log\pi\left({\ \theta}\left({\lambda},{\varepsilon}\right)\right)-\nabla_{\theta}\log q_{\lambda}\left({\theta}\left({\lambda},{\varepsilon}\right)\right)\right]\right). (S2.2)

A low-variance unbiased estimate of λ(λ)\nabla_{\lambda}\mathcal{L}(\lambda) can be constructed by numerically computing the expectation in (S2.2) using the (available) closed-form expressions for the derivatives θ(λ,ε)λ\frac{\partial{\theta}\left({\lambda},{\varepsilon}\right)}{\partial{\lambda}}, θlogqλ(θ(λ,ε))\nabla_{\theta}\log q_{\lambda}\left({\theta}\left({\lambda},{\varepsilon}\right)\right), θlogπ(θ(λ,ε))\nabla_{\theta}\log\pi\left({\theta}\left({\lambda},{\varepsilon}\right)\right), and θSnj(θ(λ,ε))\nabla_{\theta}S_{n}^{j}\left({\theta}\left({\lambda},{\varepsilon}\right)\right) for any jj. We follow Kingma and Welling (2019) and use a single draw of ε\varepsilon for the construction of an estimate of (S2.2). The required expressions for the GARCH model are provided in Appendix S3.

S3 The GARCH Predictive Class (Section 4)

The predictive class, 𝒫(t)\mathcal{P}^{(t)}, is defined by a generalized autoregressive conditional heteroscedastic GARCH(1,1) model with Gaussian errors, Yt=θ1r+σtεt,Y_{t}=\theta_{1}^{r}+\sigma_{t}\varepsilon_{t}, εti.i.d.N(0,1),\varepsilon_{t}\overset{i.i.d.}{\sim}N\left(0,1\right), σt2=θ2r+θ3r(Yt1θ1r)2+θ4rσt12,\sigma_{t}^{2}=\theta_{2}^{r}+\theta_{3}^{r}\left(Y_{t-1}-\theta_{1}^{r}\right)^{2}+\theta_{4}^{r}\sigma_{t-1}^{2}, with θ=(θ1,θ2,θ3,θ4)=(θ1r,log(θ2r),Φ11(θ3r),Φ11(θ4r))\theta=\left(\theta_{1},\theta_{2},\theta_{3},\theta_{4}\right)^{\prime}=\left(\theta_{1}^{r},\log(\theta_{2}^{r}),\Phi_{1}^{-1}\left(\theta_{3}^{r}\right),\Phi_{1}^{-1}\left(\theta_{4}^{r}\right)\right)^{\prime}. Note that throughout this section θ1\theta_{1} and θ1r\theta_{1}^{r} can be used interchangeably.

S3.1 Priors

We employ the following priors for each of the parameters of the model:

p(θ1)1,p(θ2r)1θ2rI(θ2r>0),θ3rU(0,1),   and θ4rU(0,1).p(\theta_{1})\propto 1,\hskip 28.45274ptp({\theta_{2}^{r}})\propto\frac{1}{{\theta_{2}^{r}}}I({\theta_{2}^{r}}>0),\hskip 28.45274pt{\theta_{3}^{r}}\sim U(0,1)\text{,\ \ \ and }\hskip 14.22636pt{\theta_{4}^{r}}\sim U(0,1).

For the implementation of variational inference, all the parameters are transformed into the real line as follows:

(i) θ2r is transformed to θ2=log(θ2r);(ii) θ3r is transformed to θ3=Φ11(θ3r);\displaystyle\text{(i) }{\theta_{2}^{r}}\text{ is transformed to }{\theta_{2}}=\log({\theta_{2}^{r}});\hskip 28.45274pt\text{(ii) }{\theta_{3}^{r}}\text{ is transformed to }{\theta_{3}}=\Phi_{1}^{-1}\left({\theta_{3}^{r}}\right);
(iii) θ4r is transformed to θ4=Φ11(θ4r).\displaystyle\text{(iii) }{\theta_{4}^{r}}\text{ is transformed to }{\theta_{4}}=\Phi_{1}^{-1}\left({\theta_{4}^{r}}\right).

After applying these transformations, we have that the prior densities are:

(i) p(θ1)1;(ii) p(θ2)1;(iii) p(θ3)=ϕ1(θ3);(iv) p(θ4)=ϕ1(θ4).\text{(i) }p({\theta_{1}})\propto 1;\hskip 28.45274pt\text{(ii) }p({\theta_{2}})\propto 1;\hskip 28.45274pt\text{(iii) }p({\theta_{3}})=\phi_{1}\left({\theta_{3}}\right);\hskip 18.49411pt\text{(iv) }p({\theta_{4}})=\phi_{1}\left({\theta_{4}}\right).

The gradient of the logarithm of the prior is θlogp(θ)=(0,0,θ3,θ4)\nabla_{\theta}\log p\left({\theta}\right)=\left(0,0,-{\theta_{3}},-\theta_{4}\right)^{{}^{\prime}}.

S3.2 Derivation of θSn(θ)\nabla_{\theta}S_{n}\left({\theta}\right) for all scoring rules

We can show that θSn(θ)=t=1nθs(Pθ(t1),yt).\nabla_{\theta}S_{n}\left({\theta}\right)=\sum_{t=1}^{n}\nabla_{\theta}s\left(P_{{\theta}}^{(t-1)},y_{t}\right). Thus, we must find an expression for θs(Pθ(t1),yt)\nabla_{\theta}s\left(P_{{\theta}}^{(t-1)},y_{t}\right) for each of the scores. The gradients from all the scores can be written as a function of the recursive derivatives:

(i)σt2θ1=\displaystyle\text{(i)}\ \ \frac{\partial\sigma_{t}^{2}}{\partial{\theta_{1}}}= 2θ3r(yt1θ1)+θ4rσt12θ1;(ii)σt2θ2r=1+θ4rσt12θ2r;\displaystyle-2\theta_{3}^{r}(y_{t-1}-{\theta_{1}})+\theta_{4}^{r}\frac{\partial\sigma_{t-1}^{2}}{\partial{\theta_{1}}};\hskip 28.45274pt\text{(ii)}\ \ \frac{\partial\sigma_{t}^{2}}{\partial\theta_{2}^{r}}=1+\theta_{4}^{r}\frac{\partial\sigma_{t-1}^{2}}{\partial\theta_{2}^{r}};
(iii)σt2θ3r=\displaystyle\text{(iii)}\ \ \frac{\partial\sigma_{t}^{2}}{\partial\theta_{3}^{r}}= (yt1θ1)2+θ4rσt12θ3r;(iv)σt2θ4r=θ4rσt12θ4r+σt12.\displaystyle(y_{t-1}-{\theta_{1}})^{2}+\theta_{4}^{r}\frac{\partial\sigma_{t-1}^{2}}{\partial\theta_{3}^{r}};\hskip 54.06006pt\text{(iv)}\ \ \frac{\partial\sigma_{t}^{2}}{\partial\theta_{4}^{r}}=\theta_{4}^{r}\frac{\partial\sigma_{t-1}^{2}}{\partial\theta_{4}^{r}}+\sigma_{t-1}^{2}.

with σ02θ1=0\frac{\partial\sigma_{0}^{2}}{\partial{\theta_{1}}}=0, σ02θ2r=0\frac{\partial\sigma_{0}^{2}}{\partial\theta_{2}^{r}}=0, σ02θ3r=0\frac{\partial\sigma_{0}^{2}}{\partial\theta_{3}^{r}}=0 and σ02θ4r=0\frac{\partial\sigma_{0}^{2}}{\partial\theta_{4}^{r}}=0.

S3.2.1 Logarithmic score (LS)

The gradient for the LS can be written as

θsLS(Pθ(t1),yt)=(θ1sLS(Pθ(t1),yt),θ2sLS(Pθt1,yt),θ3sLS(Pθt1,yt),θ4sLS(Pθt1,yt)),\nabla_{\theta}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\left(\nabla_{{\theta_{1}}}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)^{\prime},\nabla_{\theta_{2}}s^{\text{LS}}\left(P_{{\theta}}^{t-1},y_{t}\right),\nabla_{\theta_{3}}s^{\text{LS}}\left(P_{{\theta}}^{t-1},y_{t}\right),\nabla_{\theta_{4}}s^{\text{LS}}\left(P_{{\theta}}^{t-1},y_{t}\right)\right)^{\prime},

with

θ1SLS(Pθ(t1),yt)=\displaystyle\nabla_{{\theta_{1}}}S^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)= 12σt2σt2θ1+12(ytθ1)2σt4σt2θ1+(ytθ1)σt2,θ2SLS(Pθt1,yt)=θ2r[12σt2σt2θ2r+12(ytθ1)2σt4σt2θ2r],\displaystyle-\frac{1}{2\sigma_{t}^{2}}\frac{\partial\sigma_{t}^{2}}{\partial{\theta_{1}}}+\frac{1}{2}\frac{(y_{t}-{\theta_{1}})^{2}}{\sigma_{t}^{4}}\frac{\partial\sigma_{t}^{2}}{\partial{\theta_{1}}}+\frac{(y_{t}-{\theta_{1}})}{\sigma_{t}^{2}},\hskip 8.5359pt\nabla_{\theta_{2}}S^{\text{LS}}\left(P_{{\theta}}^{t-1},y_{t}\right)=\theta_{2}^{r}\left[-\frac{1}{2\sigma_{t}^{2}}\frac{\partial\sigma_{t}^{2}}{\partial\theta_{2}^{r}}+\frac{1}{2}\frac{(y_{t}-{\theta_{1}})^{2}}{\sigma_{t}^{4}}\frac{\partial\sigma_{t}^{2}}{\partial\theta_{2}^{r}}\right],
θ3SLS(Pθ(t1),yt)=\displaystyle\nabla_{\theta_{3}}S^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)= ϕ1(θ3)[12σt2σt2θ3r+12(ytθ1)2σt4σt2θ3r]andθ4SLS(Pθ(t1),yt)=ϕ1(θ4)[12σt2σt2θ4r+12(ytθ1)2σt4σt2θ4r].\displaystyle\phi_{1}(\theta_{3})\left[-\frac{1}{2\sigma_{t}^{2}}\frac{\partial\sigma_{t}^{2}}{\partial\theta_{3}^{r}}+\frac{1}{2}\frac{(y_{t}-{\theta_{1}})^{2}}{\sigma_{t}^{4}}\frac{\partial\sigma_{t}^{2}}{\partial\theta_{3}^{r}}\right]\hskip 8.5359pt\text{and}\hskip 8.5359pt\nabla_{\theta_{4}}S^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\phi_{1}(\theta_{4})\left[-\frac{1}{2\sigma_{t}^{2}}\frac{\partial\sigma_{t}^{2}}{\partial\theta_{4}^{r}}+\frac{1}{2}\frac{(y_{t}-{\theta_{1}})^{2}}{\sigma_{t}^{4}}\frac{\partial\sigma_{t}^{2}}{\partial\theta_{4}^{r}}\right].

S3.2.2 Continuously ranked probability score (CRPS)

For the Gaussian GARCH(1,1) predictive class the CRPS can be expressed as

sCRPS(Pθ(t1),yt)=σtBt,s^{\text{CRPS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=-\sigma_{t}B_{t},

with Bt=zt(2Φ1(zt)1)+2ϕ1(zt)1πB_{t}=z_{t}\left(2\Phi_{1}(z_{t})-1\right)+2\phi_{1}(z_{t})-\frac{1}{\sqrt{\pi}} and zt=ytθ1σtz_{t}=\frac{y_{t}-{\theta_{1}}}{\sigma_{t}}. The gradient of the CRPS can be written as

θsCRPS(Pθ(t1),yt)=(θ1sCRPS(Pθt1,yt),θ2sCRPS(Pθt1,yt),θ3sCRPS(Pθt1,yt),θ4sCRPS(Pθt1,yt)).\nabla_{\theta}s^{\text{CRPS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\left(\nabla_{{\theta_{1}}}s^{\text{CRPS}}\left(P_{{\theta}}^{t-1},y_{t}\right),\nabla_{{\theta_{2}}}s^{\text{CRPS}}\left(P_{{\theta}}^{t-1},y_{t}\right),\nabla_{{\theta_{3}}}s^{\text{CRPS}}\left(P_{{\theta}}^{t-1},y_{t}\right),\nabla_{{\theta_{4}}}s^{\text{CRPS}}\left(P_{{\theta}}^{t-1},y_{t}\right)\right)^{\prime}.

The elements of this gradient are given by

θ1sCRPS(Pθ(t1),yt)\displaystyle\nabla_{{\theta_{1}}}s^{\text{CRPS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right) =σtBtzt[ztσt2σt2θ11σt]Bt2σtσt2θ1\displaystyle=-\sigma_{t}\frac{\partial B_{t}}{\partial z_{t}}\left[\frac{\partial z_{t}}{\partial\sigma_{t}^{2}}\frac{\partial\sigma_{t}^{2}}{\partial{\theta_{1}}}-\frac{1}{\sigma_{t}}\right]-\frac{B_{t}}{2\sigma_{t}}\frac{\partial\sigma_{t}^{2}}{\partial{\theta_{1}}}
θ2sCRPS(Pθ(t1),yt)\displaystyle\nabla_{{\theta_{2}}}s^{\text{CRPS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right) =σtBtztztσt2σt2θ2rBt2σtσt2θ2rθ2r\displaystyle=-\sigma_{t}\frac{\partial B_{t}}{\partial z_{t}}\frac{\partial z_{t}}{\partial\sigma_{t}^{2}}\frac{\partial\sigma_{t}^{2}}{\partial\theta_{2}^{r}}-\frac{B_{t}}{2\sigma_{t}}\frac{\partial\sigma_{t}^{2}}{{\theta_{2}^{r}}}{\theta_{2}^{r}}
θ3sCRPS(Pθ(t1),yt)\displaystyle\nabla_{{\theta_{3}}}s^{\text{CRPS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right) =σtBtztztσt2σt2θ3rBt2σtσt2θ3rϕ1(θ3)\displaystyle=-\sigma_{t}\frac{\partial B_{t}}{\partial z_{t}}\frac{\partial z_{t}}{\partial\sigma_{t}^{2}}\frac{\partial\sigma_{t}^{2}}{\partial{\theta_{3}^{r}}}-\frac{B_{t}}{2\sigma_{t}}\frac{\partial\sigma_{t}^{2}}{{\theta_{3}^{r}}}\phi_{1}({\theta_{3}})
θ4sCRPS(Pθ(t1),yt)\displaystyle\nabla_{{\theta_{4}}}s^{\text{CRPS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right) =σtBtztztσt2σt2θ4rBt2σtσt2θ4rϕ1(θ4)\displaystyle=-\sigma_{t}\frac{\partial B_{t}}{\partial z_{t}}\frac{\partial z_{t}}{\partial\sigma_{t}^{2}}\frac{\partial\sigma_{t}^{2}}{\partial\theta_{4}^{r}}-\frac{B_{t}}{2\sigma_{t}}\frac{\partial\sigma_{t}^{2}}{{\theta_{4}^{r}}}\phi_{1}({\theta_{4}})

with Btzt=2Φ1(zt)1\frac{\partial B_{t}}{\partial z_{t}}=2\Phi_{1}(z_{t})-1 and ztσt2=zt2σt2\frac{\partial z_{t}}{\partial\sigma_{t}^{2}}=-\frac{z_{t}}{2\sigma_{t}^{2}}.

S3.2.3 Censored logarithmic score (CLS)

For some threshold value yqy_{q}, denote the upper tail support of predictive distribution as A={yt:yt>yq}A=\{y_{t}:y_{t}>y_{q}\}. The gradient for the upper tail CLS can be written as

θsCLS-A(Pθ(t1),yt)\displaystyle\nabla_{\theta}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)
=\displaystyle= (θ1sCLS-A(Pθ(t1),yt),θ2sCLS-A(Pθ(t1),yt),θ3sCLS-A(Pθ(t1),yt),θ4sCLS-A(Pθ(t1),yt)),\displaystyle\left(\nabla_{{\theta_{1}}}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right),\nabla_{{\theta_{2}}}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right),\nabla_{{\theta_{3}}}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right),\nabla_{\theta_{4}}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)\right)^{\prime},

with

θ1sCLS-A(Pθ(t1),yt)\displaystyle\nabla_{{\theta_{1}}}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right) =θ1sLS(Pθ(t1),yt)I(ytA)+ϕ1(yqθ1σt)Pθ(t1)(yq)(yqθ12σt3σt2θ11σt)I(ytAc)\displaystyle=\nabla_{{\theta_{1}}}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)I(y_{t}\in A)+\frac{\phi_{1}(\frac{y_{q}-{\theta_{1}}}{\sigma_{t}})}{P_{{\theta}}^{(t-1)}(y_{q})}\left(-\frac{y_{q}-{\theta_{1}}}{2\sigma_{t}^{3}}\frac{\partial\sigma_{t}^{2}}{\partial{\theta_{1}}}-\frac{1}{\sigma_{t}}\right)I(y_{t}\in A^{c})
θ2sCLS-A(Pθ(t1),yt)\displaystyle\nabla_{{\theta_{2}}}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right) =θ2sLS(Pθ(t1),yt)I(ytA)+ϕ1(yqθ1σt)Pθ(t1)(yq)(yqθ12σt3σt2θ2r)θ2rI(ytAc)\displaystyle=\nabla_{{\theta_{2}}}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)I(y_{t}\in A)+\frac{\phi_{1}(\frac{y_{q}-\theta_{1}}{\sigma_{t}})}{P_{{\theta}}^{(t-1)}(y_{q})}\left(-\frac{y_{q}-{\theta_{1}}}{2\sigma_{t}^{3}}\frac{\partial\sigma_{t}^{2}}{\partial\theta_{2}^{r}}\right){\theta_{2}^{r}}I(y_{t}\in A^{c})
θ3sCLS-A(Pθ(t1),yt)\displaystyle\nabla_{{\theta_{3}}}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right) =θ3sLS(Pθ(t1),yt)I(ytA)+ϕ1(yqθ1σt)Pθ(t1)(yq)(yqθ12σt3σt2θ3r)ϕ1(θ3)I(ytAc)\displaystyle=\nabla_{{\theta_{3}}}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)I(y_{t}\in A)+\frac{\phi_{1}(\frac{y_{q}-\theta_{1}}{\sigma_{t}})}{P_{{\theta}}^{(t-1)}(y_{q})}\left(-\frac{y_{q}-{\theta_{1}}}{2\sigma_{t}^{3}}\frac{\partial\sigma_{t}^{2}}{\partial\theta_{3}^{r}}\right)\phi_{1}({\theta_{3}})I(y_{t}\in A^{c})
θ4sCLS-A(Pθ(t1),yt)\displaystyle\nabla_{{\theta_{4}}}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right) =θ4sLS(Pθ(t1),yt)I(ytA)+ϕ1(yqθ1σt)Pθ(t1)(yq)(yqθ12σt3σt2θ4r)ϕ1(θ4)I(ytAc)\displaystyle=\nabla_{{\theta_{4}}}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)I(y_{t}\in A)+\frac{\phi_{1}(\frac{y_{q}-\theta_{1}}{\sigma_{t}})}{P_{{\theta}}^{(t-1)}(y_{q})}\left(-\frac{y_{q}-{\theta_{1}}}{2\sigma_{t}^{3}}\frac{\partial\sigma_{t}^{2}}{\partial\theta_{4}^{r}}\right)\phi_{1}({\theta_{4}})I(y_{t}\in A^{c})

To obtain the gradient expressions for the lower tail CLS simply replace the term ϕ1(yqθ1σt)Pθ(t1)(yq)\frac{\phi_{1}(\frac{y_{q}-{\theta_{1}}}{\sigma_{t}})}{P_{{\theta}}^{(t-1)}(y_{q})} by the term ϕ1(yqθ1σt)1Pθ(t1)(yq)-\frac{\phi_{1}(\frac{y_{q}-{\theta_{1}}}{\sigma_{t}})}{1-P_{{\theta}}^{(t-1)}(y_{q})}, and redefine the set AA.

S3.2.4 Interval score (IS)

The gradient of the IS can be written as

θsIS(Pθ(t1),yt)=(θ1sIS(Pθ(t1),yt),θ2sIS(Pθ(t1),yt),θ3sIS(Pθ(t1),yt),θ4sIS(Pθ(t1),yt)).\nabla_{\theta}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\left(\nabla_{{\theta_{1}}}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right),\nabla_{{\theta_{2}}}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right),\nabla_{{\theta_{3}}}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right),\nabla_{{\theta_{4}}}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)\right).

For θi{θ1,θ2,θ3,θ4}\theta_{i}\in\{{\theta_{1}},{\theta_{2}},{\theta_{3}},\theta_{4}\}, the elements of the gradient can be written as

θisIS(Pθ(t1),yt)=[121qI(yt>ut)]utθi[21qI(yt<lt)1]ltθi,\nabla_{\theta_{i}}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=-\left[1-\frac{2}{1-q}I(y_{t}>u_{t})\right]\frac{\partial u_{t}}{\partial\theta_{i}}-\left[\frac{2}{1-q}I(y_{t}<l_{t})-1\right]\frac{\partial l_{t}}{\partial\theta_{i}},

where the derivative utθi\frac{\partial u_{t}}{\partial{\theta_{i}}} can be evaluated as utθi=utαu,tαu,tθi\frac{\partial u_{t}}{\partial\theta_{i}}=-\frac{\partial u_{t}}{\partial\alpha_{u,t}}\frac{\partial\alpha_{u,t}}{\partial\theta_{i}} with αu,t=Pθ(t1)(ut)\alpha_{u,t}=P_{{\theta}}^{(t-1)}(u_{t}). The first term can be computed as utαu,t=1p(ut|y1:t1,θ)\frac{\partial u_{t}}{\partial\alpha_{u,t}}=\frac{1}{p\left(u_{t}|y_{1:t-1},{\theta}\right)}. The second term is parameter specific, and can be computed as

(i) αu,tθ1\displaystyle\text{(i) }\ \ \frac{\partial\alpha_{u,t}}{\partial{\theta_{1}}} =ϕ1(utθ1σt)[(utθ12σt3)σt2θ11σt];(ii) αu,tθ2=ϕ1(utθ1σt)(utθ12σt3)σt2θ2rθ2r;\displaystyle=\phi_{1}\left(\frac{u_{t}-{\theta_{1}}}{\sigma_{t}}\right)\left[\left(-\frac{u_{t}-\theta_{1}}{2\sigma_{t}^{3}}\right)\frac{\partial\sigma_{t}^{2}}{\partial{\theta_{1}}}-\frac{1}{\sigma_{t}}\right];\hskip 14.22636pt\text{(ii) }\ \ \frac{\partial\alpha_{u,t}}{\partial{\theta_{2}}}=\phi_{1}\left(\frac{u_{t}-{\theta_{1}}}{\sigma_{t}}\right)\left(-\frac{u_{t}-{\theta_{1}}}{2\sigma_{t}^{3}}\right)\frac{\partial\sigma_{t}^{2}}{\partial{\theta_{2}^{r}}}{\theta_{2}^{r}};
(iii) αu,tθ3\displaystyle\text{(iii) }\ \ \frac{\partial\alpha_{u,t}}{\partial{\theta_{3}}} =ϕ1(utθ1σt)(utθ12σt3)σt2θ3rϕ1(θ3);(iv) αu,tθ4=ϕ1(utθ1σt)(utθ12σt3)σt2θ4rϕ1(θ4).\displaystyle=\phi_{1}\left(\frac{u_{t}-{\theta_{1}}}{\sigma_{t}}\right)\left(-\frac{u_{t}-{\theta_{1}}}{2\sigma_{t}^{3}}\right)\frac{\partial\sigma_{t}^{2}}{\partial\theta_{3}^{r}}\phi_{1}({\theta_{3}});\hskip 17.07182pt\text{(iv) }\ \ \frac{\partial\alpha_{u,t}}{\partial{\theta_{4}}}=\phi_{1}\left(\frac{u_{t}-{\theta_{1}}}{\sigma_{t}}\right)\left(-\frac{u_{t}-{\theta_{1}}}{2\sigma_{t}^{3}}\right)\frac{\partial\sigma_{t}^{2}}{\partial{\theta_{4}^{r}}}\phi_{1}({\theta_{4}}).

The derivative ltθi\frac{\partial l_{t}}{\partial{\theta_{i}}} is evaluated in the same fashion as described for utθi\frac{\partial u_{t}}{\partial{\theta_{i}}}.

S4 Autoregressive Mixture Predictive Class (Section 5.1)

S4.1 Priors

We employ the following priors for each of the parameters of the model:

βk,0N(0,100002),βk,1U(1,1),vkBeta(1,2),(σk2)1𝒢(1,1),   and μN(0,1002),\beta_{k,0}\sim N(0,10000^{2}),\hskip 14.22636pt\beta_{k,1}\sim U(-1,1),\hskip 14.22636ptv_{k}\sim\text{Beta}(1,2),\hskip 14.22636pt\left(\sigma_{k}^{2}\right)^{-1}\sim\mathcal{G}(1,1)\text{,\ \ \ and }\hskip 14.22636pt\mu\sim N(0,100^{2}),

where the parameter vkv_{k} is given by the stick breaking decomposition of the mixture weights, which sets τk=vkj=1k1(1vj)\tau_{k}=v_{k}\prod_{j=1}^{k-1}(1-v_{j}), for k=1,,Kk=1,\dots,K and where vK=1v_{K}=1. For the implementation of variational inference, all the parameters are transformed into the real line as follows:

(i) βk,1 is transformed to ηk=Φ11(12(βk,1+1));(ii) vk is transformed to ψk=Φ11(vk);\displaystyle\text{(i) }\beta_{k,1}\text{ is transformed to }\eta_{k}=\Phi_{1}^{-1}\left(\frac{1}{2}\left(\beta_{k,1}+1\right)\right);\hskip 28.45274pt\text{(ii) }v_{k}\text{ is transformed to }\psi_{k}=\Phi_{1}^{-1}\left(v_{k}\right);
(iii) σk is transformed to κk=logσk.\displaystyle\text{(iii) }\sigma_{k}\text{ is transformed to }\kappa_{k}=\log\sigma_{k}.

After applying these transformations, we have that the prior densities are:

(i) p(βk,0)=ϕ1(βk,1;0,100002);(ii) p(ηk)=ϕ1(ηk;0,1);\displaystyle\text{(i) }p(\beta_{k,0})=\phi_{1}\left(\beta_{k,1};0,10000^{2}\right);\hskip 28.45274pt\text{(ii) }p(\eta_{k})=\phi_{1}\left(\eta_{k};0,1\right);
(iii) p(ψk)[1Φ1(ψk)]ϕ1(ψk);(iv) p(κk)2exp(2κk)exp[exp(2κk)];\displaystyle\text{(iii) }p(\psi_{k})\propto\left[1-\Phi_{1}\left(\psi_{k}\right)\right]\phi_{1}\left(\psi_{k}\right);\hskip 18.49411pt\text{(iv) }p(\kappa_{k})\propto 2\exp\left(-2\kappa_{k}\right)\exp\left[-\exp\left(-2\kappa_{k}\right)\right];
(v) p(μ)=ϕ1(μ;0,1002).\displaystyle\text{(v) }p(\mu)=\phi_{1}\left(\mu;0,100^{2}\right).

Computation of θlogp(θ)\nabla_{\theta}\log p\left({\theta}\right)
Denoting as β0=(β1,0,,βK,0){\beta}_{0}=\left(\beta_{1,0},\dots,\beta_{K,0}\right)^{\prime}, η=(η1,,ηK){\eta}=\left(\eta_{1},\dots,\eta_{K}\right)^{\prime}, ψ=(ψ1,,ψK){\psi}=\left(\psi_{1},\dots,\psi_{K}\right)^{\prime} and κ=(κ1,,κK){\kappa}=\left(\kappa_{1},\dots,\kappa_{K}\right)^{\prime}, the gradient of the prior density is

θlogp(θ)=[logp(θ)β0,logp(θ)η,logp(θ)ψ,logp(θ)κ,logp(θ)μ],\nabla_{\theta}\log p\left({\theta}\right)=\left[\frac{\partial\log p\left({\theta}\right)}{\partial{\beta}_{0}},\frac{\partial\log p\left({\theta}\right)}{\partial{\eta}},\frac{\partial\log p\left({\theta}\right)}{\partial{\psi}},\frac{\partial\log p\left({\theta}\right)}{\partial{\kappa}},\frac{\partial\log p\left({\theta}\right)}{\mu}\right]^{\prime},

with

(i) logp(θ)β0=(10000)2β0;(ii) logp(θ)η=η;\text{(i) }\frac{\partial\log p\left({\theta}\right)}{\partial{\beta}_{0}}=-(10000)^{-2}{\beta}_{0}^{{}^{\prime}};\ \ \ \ \ \text{(ii) }\frac{\partial\log p\left({\theta}\right)}{\partial{\eta}}=-{\eta}^{{}^{\prime}};
(iii) logp(θ)ψ=(1v)1vψψ;(iv) logp(θ)κ=2κ~2;\text{(iii) }\frac{\partial\log p\left({\theta}\right)}{\partial{\psi}}=-(1-{v}^{\prime})^{-1}\frac{\partial{v}}{\partial\psi}-{\psi}^{{}^{\prime}};\ \ \ \ \ \text{(iv) }\frac{\partial\log p\left({\theta}\right)}{\partial{\kappa}}=2\tilde{{\kappa}}^{{}^{\prime}}-2;
(v) logp(θ)μ=(100)2μ.\text{(v) }\frac{\partial\log p\left({\theta}\right)}{\partial{\mu}}=-(100)^{-2}{\mu}.

and where (1v)1=([1v1]1,,[1vK1]1)(1-{v}^{\prime})^{-1}=\left([1-v_{1}]^{-1},\dots,[1-v_{K-1}]^{-1}\right) and κ~=(exp[2κ1],,exp[2κK])\tilde{{\kappa}}=(\exp[-2\kappa_{1}],\dots,\exp[-2\kappa_{K}])^{\prime}.

S4.2 Derivation of θSn(θ)\nabla_{\theta}S_{n}\left({\theta}\right) for all scoring rules

The focused Bayesian update uses the term Sn(θ)=t=1ns(Pθ(t1),yt)S_{n}\left({\theta}\right)=\sum_{t=1}^{n}s\left(P_{{\theta}}^{(t-1)},y_{t}\right), thus variational inference requires evaluation of

θSn(θ)=t=1nθs(Pθ(t1),yt).\nabla_{\theta}S_{n}\left({\theta}\right)=\sum_{t=1}^{n}\nabla_{\theta}s\left(P_{{\theta}}^{(t-1)},y_{t}\right).

For the mixture example we consider three alternative choices for s(Pθ(t1),yt)s\left(P_{{\theta}}^{(t-1)},y_{t}\right), namely, the IS, the CLS and LS. Here, we derive an expression for θs(Pθ(t1),yt)\nabla_{\theta}s\left(P_{{\theta}}^{(t-1)},y_{t}\right) for each of these scores. As will be shown later, the gradient θS(Pθ(t1),yt)\nabla_{\theta}S\left(P_{{\theta}}^{(t-1)},y_{t}\right) for all the scores can be expressed solely in terms of θp(yt|t1,θ)\nabla_{\theta}p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right) and θPθ(t1)\nabla_{\theta}P_{{\theta}}^{(t-1)}, thus we focus on the derivation of these two expressions. Below, we denote ϵt=ytμ\epsilon_{t}=y_{t}-\mu and use the scalar rr to denote the number of elements in θ\theta. For ease of notation we rewrite

p(yt|t1,θ)=c2,tc1,t and Pθ(t1)=c3,tc1,t,p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right)=\frac{c_{2,t}}{c_{1,t}}\text{\ \ \ \ \ and \ \ \ \ \ }P_{{\theta}}^{(t-1)}=\frac{c_{3,t}}{c_{1,t}},

where c1,t,k=τkskϕ1(ϵt1μksk)c_{1,t,k}=\frac{\tau_{k}}{s_{k}}\phi_{1}\left(\frac{\epsilon_{t-1}-\mu_{k}}{s_{k}}\right), c2,t,k=1σkϕ1(ϵtβk,0βk,1ϵt1σk)c_{2,t,k}=\frac{1}{\sigma_{k}}\phi_{1}\left(\frac{\epsilon_{t}-\beta_{k,0}-\beta_{k,1}\epsilon_{t-1}}{\sigma_{k}}\right), c3,t,k=Φ1(ϵtβk,0βk,1ϵt1σk)c_{3,t,k}=\Phi_{1}\left(\frac{\epsilon_{t}-\beta_{k,0}-\beta_{k,1}\epsilon_{t-1}}{\sigma_{k}}\right), c1,t=k=1Kc1,t,kc_{1,t}=\sum_{k=1}^{K}c_{1,t,k}, c2,t=k=1Kc1,t,kc2,t,kc_{2,t}=\sum_{k=1}^{K}c_{1,t,k}c_{2,t,k} and c3,t=k=1Kc1,t,kc3,t,kc_{3,t}=\sum_{k=1}^{K}c_{1,t,k}c_{3,t,k}. The elements of the gradients θp(yt|t1,θ)=[θ1p(yt|t1,θ),,θrp(yt|t1,θ)]\nabla_{\theta}p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right)=\left[\nabla_{\theta_{1}}p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right),\dots,\nabla_{\theta_{r}}p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right)\right]^{{}^{\prime}} and θPθ(t1)=[θ1Pθ(t1),,θrPθ(t1)]\nabla_{\theta}P_{{\theta}}^{(t-1)}=\left[\nabla_{\theta_{1}}P_{{\theta}}^{(t-1)},\dots,\nabla_{\theta_{r}}P_{{\theta}}^{(t-1)}\right]^{{}^{\prime}}, can then be computed as

θip(yt|t1,θ)=k=1K[τk,tc2,t,kθi+c2,t,kp(yt|t1,θ)c1,tct,k,1θi]\nabla_{\theta_{i}}p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right)=\sum_{k=1}^{K}\left[\tau_{k,t}\frac{\partial c_{2,t,k}}{\partial\theta_{i}}+\frac{c_{2,t,k}-p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right)}{c_{1,t}}\frac{\partial c_{t,k,1}}{\partial\theta_{i}}\right] (S4.3)

and

θiPθ(t1)=k=1K[τk,tc3,t,kθi+c3,t,kPθ(t1)c1,tct,k,1θi].\nabla_{\theta_{i}}P_{{\theta}}^{(t-1)}=\sum_{k=1}^{K}\left[\tau_{k,t}\frac{\partial c_{3,t,k}}{\partial\theta_{i}}+\frac{c_{3,t,k}-P_{{\theta}}^{(t-1)}}{c_{1,t}}\frac{\partial c_{t,k,1}}{\partial\theta_{i}}\right]. (S4.4)

Table S1 provides the expressions c1,t,kθi\frac{\partial c_{1,t,k}}{\partial\theta_{i}}, c2,t,kθi\frac{\partial c_{2,t,k}}{\partial\theta_{i}} and c3,t,kθi\frac{\partial c_{3,t,k}}{\partial\theta_{i}} for θi{β0,η,κ,μ}\theta_{i}\in\{{\beta}_{0},{\eta},{\kappa},\mu\}. For θiψ\theta_{i}\in{\psi}, the gradients can be evaluated as

ψp(yt|t1,θ)=[τp(yt|t1,θ)τvvψ],\nabla_{\psi}p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right)=\left[\nabla_{\tau}p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right)^{\prime}\frac{\partial{\tau}}{\partial v}\frac{\partial{v}}{\partial\psi}\right]^{\prime},

and

ψPθ(t1)=[τPθ(t1)τvvψ],\nabla_{\psi}P_{{\theta}}^{(t-1)}=\left[{\nabla_{\tau}P_{{\theta}}^{(t-1)}}^{\prime}\frac{\partial{\tau}}{\partial v}\frac{\partial{v}}{\partial\psi}\right]^{\prime},

where vψ\frac{\partial{v}}{\partial\psi} is a diagonal matrix with entries vkψk=ϕ1(ψk)\frac{\partial v_{k}}{\partial\psi_{k}}=\phi_{1}\left(\psi_{k}\right), and τv\frac{\partial{\tau}}{\partial v} is a lower triangular matrix with diagonal elements τkvk=j=1k1(1vj)\frac{\partial\tau_{k}}{\partial v_{k}}=\prod_{j=1}^{k-1}(1-v_{j}) and off-diagonal elements τkvs=τk(1vs)\frac{\partial\tau_{k}}{\partial v_{s}}=-\frac{\tau_{k}}{(1-v_{s})}, for s<ks<k. The expressions needed to evaluate τp(yt|t1,θ)=[τ1p(yt|t1,θ),,τKp(yt|t1,θ)]\nabla_{\tau}p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right)=\left[\nabla_{\tau_{1}}p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right),\dots,\nabla_{\tau_{K}}p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right)\right]^{\prime} and τPθ(t1)=[τ1Pθ(t1),,τKPθ(t1)]\nabla_{\tau}P_{{\theta}}^{(t-1)}=\left[\nabla_{\tau_{1}}P_{{\theta}}^{(t-1)},\dots,\nabla_{\tau_{K}}P_{{\theta}}^{(t-1)}\right]^{\prime} are also provided in Table S1.

With expressions (S4.3) and (S4.4), we can now derive expression for θs(Pθ(t1),yt)\nabla_{\theta}s\left(P_{{\theta}}^{(t-1)},y_{t}\right) for the three scores considered.

S4.2.1 Logarithmic score (LS)

Denote the gradient of the LS in (9) as

θsLS(Pθ(t1),yt)=(θ1sLS(Pθ(t1),yt),,θrsLS(Pθ(t1),yt)).\nabla_{\theta}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\left(\nabla_{\theta_{1}}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right),\dots\right.\left.,\nabla_{\theta_{r}}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)\right)^{\prime}.

The element θisLS(Pθ(t1),yt)\nabla_{\theta_{i}}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right) of this gradient can be evaluated as

θisLS(Pθ(t1),yt)=1p(yt|t1,θ)θip(yt|t1,θ)\nabla_{\theta_{i}}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\frac{1}{p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right)}\nabla_{\theta_{i}}p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right)

where the derivative θip(yt|t1,θ)\nabla_{\theta_{i}}p\left(y_{t}|\mathcal{F}_{t-1},{\theta}\right) can be computed using (S4.3).

S4.2.2 Censored logarithmic score (CLS)

For some threshold value yqy_{q}, denote the upper tail support of predictive distribution as A={yt:yt>yq}A=\{y_{t}:y_{t}>y_{q}\}. The gradient for the upper tail CLS in (11) can be written as

θsCLS-A(Pθ(t1),yt)=(θ1sCLS-A(Pθ(t1),yt),,θrsCLS-A(Pθ(t1),yt)).\nabla_{\theta}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\left(\nabla_{\theta_{1}}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right),\dots,\nabla_{\theta_{r}}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)\right)^{\prime}.

We can compute the element θisCLS-A(Pθ(t1),yt)\nabla_{\theta_{i}}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right) of the upper CLS as:

θisCLS-A(Pθ(t1),yt)=θisLS(Pθ(t1),yt)I(ytA)+1Pθ(t1)(yq)θiPθ(t1)(yq)I(ytAc).\nabla_{\theta_{i}}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\nabla_{\theta_{i}}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)I(y_{t}\in A)+\frac{1}{P_{{\theta}}^{(t-1)}(y_{q})}\nabla_{\theta_{i}}P_{{\theta}}^{(t-1)}(y_{q})I(y_{t}\in A^{c}).

The derivatives θiPθ(t1)(yq)\nabla_{\theta_{i}}P_{{\theta}}^{(t-1)}(y_{q}) can be computed using (S4.4). For the lower tail CLS, where A={yt:yt<yq}A=\{y_{t}:y_{t}<y_{q}\} we have that

θisCLS-A(Pθ(t1),yt)=θisLS(Pθ(t1),yt)I(ytA)11Pθ(t1)(yq)θiPθ(t1)(yq)I(ytAc).\nabla_{\theta_{i}}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\nabla_{\theta_{i}}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)I(y_{t}\in A)-\frac{1}{1-P_{{\theta}}^{(t-1)}(y_{q})}\nabla_{\theta_{i}}P_{{\theta}}^{(t-1)}(y_{q})I(y_{t}\in A^{c}).

S4.2.3 Interval score (IS)

The gradient of the IS in (12) can be written as

θsIS(Pθ(t1),yt)=(θ1sIS(Pθ(t1),yt),,θrsIS(Pθ(t1),yt)),\nabla_{\theta}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\left(\nabla_{\theta_{1}}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)\right.\left.,\dots,\nabla_{\theta_{r}}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)\right)^{\prime},

with elements θisIS(Pθ(t1),yt)\nabla_{\theta_{i}}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right) defined as:

θisIS(Pθ(t1),yt)=[121qI(yt>ut)]utθi[21qI(yt<lt)1]ltθi\nabla_{\theta_{i}}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=-\left[1-\frac{2}{1-q}I(y_{t}>u_{t})\right]\frac{\partial u_{t}}{\partial\theta_{i}}-\left[\frac{2}{1-q}I(y_{t}<l_{t})-1\right]\frac{\partial l_{t}}{\partial\theta_{i}}

As such, computation of θisIS(Pθ(t1),yt)\nabla_{\theta_{i}}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right) entails evaluation of utθi\frac{\partial u_{t}}{\partial{\theta_{i}}} and ltθi\frac{\partial l_{t}}{\partial{\theta_{i}}}. The derivative utθi\frac{\partial u_{t}}{\partial{\theta_{i}}} can be evaluated by first noting that αu,t=Pθ(t1)(ut)\alpha_{u,t}=P_{{\theta}}^{(t-1)}(u_{t}). Then, using the triple product rule we know that

utθi=utαu,tαu,tθi,\frac{\partial u_{t}}{\partial\theta_{i}}=-\frac{\partial u_{t}}{\partial\alpha_{u,t}}\frac{\partial\alpha_{u,t}}{\partial\theta_{i}},

where the first term can be computed as utαu,t=1p(ut|t1,θ)\frac{\partial u_{t}}{\partial\alpha_{u,t}}=\frac{1}{p\left(u_{t}|\mathcal{F}_{t-1},{\theta}\right)}. The second term αu,tθi=θiPθ(t1)(ut)\frac{\partial\alpha_{u,t}}{\partial\theta_{i}}=\nabla_{\theta_{i}}P_{{\theta}}^{(t-1)}(u_{t}) is evaluated using (S4.4). The derivative ltθi\frac{\partial l_{t}}{\partial{\theta_{i}}} is evaluated in the same fashion as described for utθi\frac{\partial u_{t}}{\partial{\theta_{i}}}.

Expressions for θic1,t,k\frac{\partial}{\partial\theta_{i}}c_{1,t,k}
c1,t,kτk=1skϕ1(ϵt1μksk)\frac{\partial c_{1,t,k}}{\partial\tau_{k}}=\frac{1}{s_{k}}\phi_{1}\left(\frac{\epsilon_{t-1}-\mu_{k}}{s_{k}}\right)
c1,t,kβ0,k=τkskϕ1(ϵt1μksk)1(1βk,1)sk\frac{\partial c_{1,t,k}}{\partial\beta_{0,k}}=-\frac{\tau_{k}}{s_{k}}\phi_{1}^{\prime}\left(\frac{\epsilon_{t-1}-\mu_{k}}{s_{k}}\right)\frac{1}{(1-\beta_{k,1})s_{k}}
c1,t,kηk=[τkskϕ1(ϵt1μksk)(1skμkβ1,k+1sk2(ϵt1μk)skβ1,k)ϕ1(ϵt1μksk)τksk2skβ1,k]β1,kηk\frac{\partial c_{1,t,k}}{\partial\eta_{k}}=\left[-\frac{\tau_{k}}{s_{k}}\phi_{1}^{\prime}\left(\frac{\epsilon_{t-1}-\mu_{k}}{s_{k}}\right)\left(\frac{1}{s_{k}}\frac{\partial\mu_{k}}{\partial\beta_{1,k}}+\frac{1}{s_{k}^{2}}(\epsilon_{t-1}-\mu_{k})\frac{\partial s_{k}}{\partial\beta_{1,k}}\right)-\phi_{1}\left(\frac{\epsilon_{t-1}-\mu_{k}}{s_{k}}\right)\frac{\tau_{k}}{s_{k}^{2}}\frac{\partial s_{k}}{\partial\beta_{1,k}}\right]\frac{\partial\beta_{1,k}}{\partial\eta_{k}}
c1,t,kκk=[τkskϕ1(ϵt1μksk)(1skμkσk+1sk2(ϵt1μk)skσk)ϕ1(ϵt1μksk)τksk2skσk]σkκk\frac{\partial c_{1,t,k}}{\partial\kappa_{k}}=\left[\frac{\tau_{k}}{s_{k}}\phi_{1}^{\prime}\left(\frac{\epsilon_{t-1}-\mu_{k}}{s_{k}}\right)\left(\frac{1}{s_{k}}\frac{\partial\mu_{k}}{\partial\sigma_{k}}+\frac{1}{s_{k}^{2}}(\epsilon_{t-1}-\mu_{k})\frac{\partial s_{k}}{\partial\sigma_{k}}\right)-\phi_{1}\left(\frac{\epsilon_{t-1}-\mu_{k}}{s_{k}}\right)\frac{\tau_{k}}{s_{k}^{2}}\frac{\partial s_{k}}{\partial\sigma_{k}}\right]\frac{\partial\sigma_{k}}{\partial\kappa_{k}}
c1,t,kμ=τkskϕ1(ϵt1μksk)1sk\frac{\partial c_{1,t,k}}{\partial\mu}=-\frac{\tau_{k}}{s_{k}}\phi_{1}^{\prime}\left(\frac{\epsilon_{t-1}-\mu_{k}}{s_{k}}\right)\frac{1}{s_{k}}
Expressions for θic2,t,k\frac{\partial}{\partial\theta_{i}}c_{2,t,k} Expressions for θic3,t,k\frac{\partial}{\partial\theta_{i}}c_{3,t,k}
c2,t,kτk=0\frac{\partial c_{2,t,k}}{\partial\tau_{k}}=0 c3,t,kτk=0\frac{\partial c_{3,t,k}}{\partial\tau_{k}}=0
c2,t,kβ0,k=1σk2ϕ1(ϵtβ0,kβ1,kϵt1σk)\frac{\partial c_{2,t,k}}{\partial\beta_{0,k}}=-\frac{1}{\sigma_{k}^{2}}\phi_{1}^{\prime}\left(\frac{\epsilon_{t}-\beta_{0,k}-\beta_{1,k}\epsilon_{t-1}}{\sigma_{k}}\right) c3,t,kβ0,k=1σkϕ1(ϵtβ0,kβ1,kϵt1σk)\frac{\partial c_{3,t,k}}{\partial\beta_{0,k}}=-\frac{1}{\sigma_{k}}\phi_{1}\left(\frac{\epsilon_{t}-\beta_{0,k}-\beta_{1,k}\epsilon_{t-1}}{\sigma_{k}}\right)
c2,t,kηk=ϵt1σk2ϕ1(ϵtβ0,kβ1,kϵt1σk)β1,kηk\frac{\partial c_{2,t,k}}{\partial\eta_{k}}=-\frac{\epsilon_{t-1}}{\sigma_{k}^{2}}\phi_{1}^{\prime}\left(\frac{\epsilon_{t}-\beta_{0,k}-\beta_{1,k}\epsilon_{t-1}}{\sigma_{k}}\right)\frac{\partial\beta_{1,k}}{\partial\eta_{k}} c3,t,kηk=ϵt1σkϕ1(ϵtβ0,kβ1,kϵt1σk)β1,kηk\frac{\partial c_{3,t,k}}{\partial\eta_{k}}=-\frac{\epsilon_{t-1}}{\sigma_{k}}\phi_{1}\left(\frac{\epsilon_{t}-\beta_{0,k}-\beta_{1,k}\epsilon_{t-1}}{\sigma_{k}}\right)\frac{\partial\beta_{1,k}}{\partial\eta_{k}}
c2,t,kκk=ϵtβ0,kβ1,kϵt1σk3ϕ1(ϵtβ0,kβ1,kϵt1σk)σkκk\frac{\partial c_{2,t,k}}{\partial\kappa_{k}}=-\frac{\epsilon_{t}-\beta_{0,k}-\beta_{1,k}\epsilon_{t-1}}{\sigma_{k}^{3}}\phi_{1}^{\prime}\left(\frac{\epsilon_{t}-\beta_{0,k}-\beta_{1,k}\epsilon_{t-1}}{\sigma_{k}}\right)\frac{\partial\sigma_{k}}{\partial\kappa_{k}}- c3,t,kκk=ϵtβ0,kβ1,kϵt1σk2ϕ1(ϵtβ0,kβ1,kϵt1σk)σkκk\frac{\partial c_{3,t,k}}{\partial\kappa_{k}}=-\frac{\epsilon_{t}-\beta_{0,k}-\beta_{1,k}\epsilon_{t-1}}{\sigma_{k}^{2}}\phi_{1}\left(\frac{\epsilon_{t}-\beta_{0,k}-\beta_{1,k}\epsilon_{t-1}}{\sigma_{k}}\right)\frac{\partial\sigma_{k}}{\partial\kappa_{k}}
               1σk2ϕ1(ϵtβ0,kβ1,kϵt1σk)σkκk\frac{1}{\sigma_{k}^{2}}\phi_{1}\left(\frac{\epsilon_{t}-\beta_{0,k}-\beta_{1,k}\epsilon_{t-1}}{\sigma_{k}}\right)\frac{\partial\sigma_{k}}{\partial\kappa_{k}}
c2,t,kμ=1+β1,kσk2ϕ1(ϵtβ0,kβ1,kϵt1σk)\frac{\partial c_{2,t,k}}{\partial\mu}=\frac{-1+\beta_{1,k}}{\sigma_{k}^{2}}\phi_{1}^{\prime}\left(\frac{\epsilon_{t}-\beta_{0,k}-\beta_{1,k}\epsilon_{t-1}}{\sigma_{k}}\right) c3,t,kμ=1+β1,kσkϕ1(ϵtβ0,kβ1,kϵt1σk)\frac{\partial c_{3,t,k}}{\partial\mu}=\frac{-1+\beta_{1,k}}{\sigma_{k}}\phi_{1}\left(\frac{\epsilon_{t}-\beta_{0,k}-\beta_{1,k}\epsilon_{t-1}}{\sigma_{k}}\right)
Table S1: Derivatives required to implement reparameterization trick for the mixture model. Other required expressions also include μkβ1,k=β0,k(1β1,k)2\frac{\partial\mu_{k}}{\partial\beta_{1,k}}=\frac{\beta_{0,k}}{(1-\beta_{1,k})^{2}}, skβ1,k=β1,kσk(1β1,k2)3/2\frac{\partial s_{k}}{\partial\beta_{1,k}}=\frac{\beta_{1,k}\sigma_{k}}{(1-\beta_{1,k}^{2})^{3/2}}, skσk=1(1β1,k2)1/2\frac{\partial s_{k}}{\partial\sigma_{k}}=\frac{1}{(1-\beta_{1,k}^{2})^{1/2}}, σkκk=exp(κk)\frac{\partial\sigma_{k}}{\partial\kappa_{k}}=\exp\left(\kappa_{k}\right), β1,kηk=2ϕ1(ηk)\frac{\partial\beta_{1,k}}{\partial\eta_{k}}=2\phi_{1}(\eta_{k}). The cross-component derivatives ci,t,kτj\frac{\partial c_{i,t,k}}{\partial\tau_{j}}, ci,t,kβ0,j\frac{\partial c_{i,t,k}}{\partial\beta_{0,j}}, ci,t,kηj\frac{\partial c_{i,t,k}}{\partial\eta_{j}}, ci,t,kκj\frac{\partial c_{i,t,k}}{\partial\kappa_{j}} are all zero for jkj\neq k. Finally, in this table we denote ϕ1(x)=ϕ1(x)x\phi_{1}^{\prime}\left(x\right)=\frac{\partial\phi_{1}\left(x\right)}{\partial x}.

S5 Bayesian Neural Network Predictive Class (Section 5.2)

S5.1 Priors

The priors of the model parameters are set as ωkN(0,Inf)\omega_{k}\sim N(0,\text{Inf}) and p(σy2)1σy2p(\sigma_{y}^{2})\propto\frac{1}{\sigma_{y}^{2}}. The standard deviation parameter σy\sigma_{y} is transformed to the real line as c=log(σy)c=\log\left(\sigma_{y}\right), so that the parameter vector is θ=(ω,c){\theta}=\left({\omega}^{\prime},c\right)^{{}^{\prime}}. Then, the prior density can be written as p(θ)=p(c)k=1dp(ωk)p({\theta})=p(c)\prod_{k=1}^{d}p(\omega_{k}), where p(ωk)1p(\omega_{k})\propto 1 and p(c)1p(c)\propto 1. From this prior density we have that θlogp(θ)=0\nabla_{\theta}\log p\left({\theta}\right)={0}.

S5.2 Derivation of θSn(θ)\nabla_{\theta}S_{n}\left({\theta}\right) for all scoring rules

As in the previous appendix we can show that θSn(θ)=t=1nθs(Pθ(t1),yt).\nabla_{\theta}S_{n}\left({\theta}\right)=\sum_{t=1}^{n}\nabla_{\theta}s\left(P_{{\theta}}^{(t-1)},y_{t}\right). Thus, we must find an expression for θS(Pθ(t1),yt)\nabla_{\theta}S\left(P_{{\theta}}^{(t-1)},y_{t}\right) for each of the scores.

S5.2.1 Logarithmic score (LS)

The gradient for the LS can be written as

θsLS(Pθ(t1),yt)=(ωLSs(Pθ(t1),yt),csLS(Pθ(t1),yt)),\nabla_{\theta}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\left(\nabla_{\omega}^{\text{LS}}s\left(P_{{\theta}}^{(t-1)},y_{t}\right)^{\prime},\nabla_{c}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)\right)^{\prime},

with

ωsLS(Pθ(t1),yt)=1σy2[ytg(zt;ω)]ωg(zt;ω) and csLS(Pθ(t1),yt)=1+1σy2[ytg(zt;ω)]2.\nabla_{\omega}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\frac{1}{\sigma_{y}^{2}}\left[y_{t}-g\left({z}_{t};{\omega}\right)\right]\nabla_{\omega}g\left({z}_{t};{\omega}\right)\text{ and }\nabla_{c}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=-1+\frac{1}{\sigma_{y}^{2}}\left[y_{t}-g\left({z}_{t};{\omega}\right)\right]^{2}.

The term ωg(zt;ω)\nabla_{\omega}g\left({z}_{t};{\omega}\right) can be evaluated analytically through back propagation.

S5.2.2 Censored logarithmic score (CLS)

For the upper tail CLS the gradient can be written as

θsCLS-A(Pθ(t1),yt)=θsLS(Pθ(t1),yt)I(ytA)+1Pθ(t1)(yq)θPθ(t1)(yq)I(ytAc),\nabla_{\theta}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\nabla_{\theta}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)I(y_{t}\in A)+\frac{1}{P_{{\theta}}^{(t-1)}(y_{q})}\nabla_{\theta}P_{{\theta}}^{(t-1)}(y_{q})I(y_{t}\in A^{c}),

where θPθ(t1)(yq)=(ωPθ(t1)(yq),cPθ(t1)(yq))\nabla_{\theta}P_{{\theta}}^{(t-1)}(y_{q})=\left(\nabla_{\omega}P_{{\theta}}^{(t-1)}(y_{q})^{{}^{\prime}},\nabla_{c}P_{{\theta}}^{(t-1)}(y_{q})\right)^{{}^{\prime}}, ωPθ(t1)(yq)=ϕ1(yq;g(zt;ω),σy2)ωg(zt;ω)\nabla_{\omega}P_{{\theta}}^{(t-1)}(y_{q})=-\phi_{1}\left(y_{q};g\left({z}_{t};{\omega}\right),\sigma_{y}^{2}\right)\nabla_{\omega}g\left({z}_{t};{\omega}\right) and cPθ(t1)(yq)=ϕ1(yq;g(zt;ω),σy2)(yqg(zt;ω))\nabla_{c}P_{{\theta}}^{(t-1)}(y_{q})=-\phi_{1}\left(y_{q};g\left({z}_{t};{\omega}\right),\sigma_{y}^{2}\right)(y_{q}-g\left({z}_{t};{\omega}\right)). The gradient for the lower tail CLS is

θsCLS-A(Pθ(t1),yt)=θsLS(Pθ(t1),yt)I(ytA)11Pθ(t1)(yq)θPθ(t1)(yq)I(ytAc).\nabla_{\theta}s^{\text{CLS-A}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\nabla_{\theta}s^{\text{LS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)I(y_{t}\in A)-\frac{1}{1-P_{{\theta}}^{(t-1)}(y_{q})}\nabla_{\theta}P_{{\theta}}^{(t-1)}(y_{q})I(y_{t}\in A^{c}).

S5.2.3 Interval score (IS)

The gradient of the IS can be written as

θsIS(Pθ(t1),yt)=(ωsIS(Pθ(t1),yt),csIS(Pθ(t1),yt)),\nabla_{\theta}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)=\left(\nabla_{\omega}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)\right.\left.,\nabla_{c}s_{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right)\right)^{\prime},

with elements ωsIS(Pθ(t1),yt)\nabla_{\omega}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right) and csIS(Pθ(t1),yt)\nabla_{c}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right) defined as:

ωsIS(Pθ(t1),yt)\displaystyle\nabla_{\omega}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right) =[121qI(yt>ut)]utω[21qI(yt<lt)1]ltω\displaystyle=-\left[1-\frac{2}{1-q}I(y_{t}>u_{t})\right]\frac{\partial u_{t}}{\partial\omega}^{\prime}-\left[\frac{2}{1-q}I(y_{t}<l_{t})-1\right]\frac{\partial l_{t}}{\partial\omega}^{\prime}
csIS(Pθ(t1),yt)\displaystyle\nabla_{c}s^{\text{IS}}\left(P_{{\theta}}^{(t-1)},y_{t}\right) =[121qI(yt>ut)]utc[21qI(yt<lt)1]ltc\displaystyle=-\left[1-\frac{2}{1-q}I(y_{t}>u_{t})\right]\frac{\partial u_{t}}{\partial c}-\left[\frac{2}{1-q}I(y_{t}<l_{t})-1\right]\frac{\partial l_{t}}{\partial c}

The derivative utω\frac{\partial u_{t}}{\partial{\omega}} can be evaluated by first noting that αu,t=Pθ(t1)(ut)\alpha_{u,t}=P_{{\theta}}^{(t-1)}(u_{t}). Then, using the triple product rule we know that

utω=utαu,tαu,tω,\frac{\partial u_{t}}{\partial\omega}=-\frac{\partial u_{t}}{\partial\alpha_{u,t}}\frac{\partial\alpha_{u,t}}{\partial\omega},

where the first term can be computed as utαu,t=1p(ut|t1,θ)\frac{\partial u_{t}}{\partial\alpha_{u,t}}=\frac{1}{p\left(u_{t}|\mathcal{F}_{t-1},{\theta}\right)}. The second term αu,tω=ωPθ(t1)(ut)\frac{\partial\alpha_{u,t}}{\partial\omega}=\nabla_{\omega}P_{{\theta}}^{(t-1)}(u_{t}) is evaluated as in the subsection above. The derivative ltω\frac{\partial l_{t}}{\partial{\omega}} is evaluated in the same fashion as described for utω\frac{\partial u_{t}}{\partial{\omega}}. The corresponding derivatives for parameter cc can also be computed using similar steps.