Loss-Based Variational Bayes Prediction
Abstract
We propose a new approach to Bayesian prediction that caters for models with a large number of parameters and is robust to model misspecification. Given a class of high-dimensional (but parametric) predictive models, this new approach constructs a posterior predictive using a variational approximation to a generalized posterior that is directly focused on predictive accuracy. The theoretical behavior of the new prediction approach is analyzed and a form of optimality demonstrated. Applications to both simulated and empirical data using high-dimensional Bayesian neural network and autoregressive mixture models demonstrate that the approach provides more accurate results than various alternatives, including misspecified likelihood-based predictions.
Keywords: Loss-based Bayesian forecasting; variational inference; generalized (Gibbs) posteriors; proper scoring rules; Bayesian neural networks; M4 forecasting competition
1 Introduction
The conventional paradigm for Bayesian prediction is underpinned by the assumption that the true data generating process is either equivalent to the predictive model adopted, or spanned by a finite set of models over which we average. Of late however, recognition of the unrealistic nature of such an assumption, allied with an increased interest in driving prediction by problem-specific measures of accuracy, or loss, have led to alternative approaches. Whilst antecedents of these new principles are found in the ‘probably approximately correct’ (PAC)-Bayes approach to prediction in the machine learning literature (see Alquier, 2021, for a review), it is in the statistics and econometrics literature that this ‘loss-based prediction’ has come to more formal maturity, including in terms of its theoretical validation. This includes Bayesian work on weighted combinations of predictions, such as, for example, Billio et al. (2013), Casarin et al. (2015), Pettenuzzo and Ravazzolo (2016), Bassetti et al. (2018), Baştürk et al. (2019), McAlinn and West (2019) and McAlinn et al. (2020), where weights are updated via various predictive criteria, and the true model is not assumed to be one of the constituent models - i.e. an -open state of the world (Bernardo and Smith, 1994) is implicitly adopted. It also includes a contribution by Loaiza-Maya et al. (2020), in which both single models and predictive mixtures are used to generate accurate Bayesian predictions in the presence of model misspecification; with both theoretical and numerical results highlighting the ability of the approach to out-perform conventional likelihood-based Bayesian predictive methods.
However, as a general rule, the existing approaches discussed above do not scale well to complex models with high-dimensional parameter spaces. In contrast, the current paper contributes to the above literature by providing a new approach for producing accurate loss-based predictions in high-dimensional problems. We begin by defining a class of flexible predictive models, conditional on a set of unknown parameters, that are a plausible mechanism for generating probabilistic predictions. A prior distribution is placed over the parameters of this predictive class, and the prior then updated to a posterior via a criterion function that captures a user-specified measure of predictive accuracy. That is, the conventional, and potentially misspecified likelihood-based update is eschewed, in favour of a function that is tailored to the predictive problem at hand; the ultimate goal being to produce accurate predictions according to the measure that matters, without requiring knowledge of the true data generating mechanism.
In the spirit of the various generalized Bayesian inferential methods, in which likelihood functions are also replaced by alternative updating mechanisms (inter alia, Chernozhukov and Hong, 2003, Zhang, 2006a, Zhang, 2006b, Jiang and Tanner, 2008, Bissiri et al., 2016, Giummolè et al., 2017, Knoblauch et al., 2019, Miller and Dunson, 2019, Syring and Martin, 2019, and Pacchiardi and Dutta, 2021), we adopt a coherent update based on the exponential of a scaled sample loss; we refer to Miller (2021) for a thorough discussion of the large sample behavior of generalized Bayesian posteriors. As in Loaiza-Maya et al. (2020) the loss is, in turn, defined by a proper scoring rule (Gneiting et al., 2007; Gneiting and Raftery, 2007) that rewards a given form of predictive accuracy; for example, accurate prediction of extreme values. Given the high-dimensional nature of the resultant posterior, numerical treatment via ‘exact’ Markov Chain Monte Carlo (MCMC) is computationally challenging; hence we adopt an ‘approximate’ approach using variational principles. Since the posterior that results from an exponentiated loss was first denoted as a ‘Gibbs posterior’ by Zhang (2006a), we refer to the variational approximation of this posterior as the Gibbs variational posterior, and the predictive distribution that results from this posterior, via the standard Bayesian calculus, as the Gibbs variational predictive (hereafter, GVP). With a slight abuse of terminology, and when it is clear from the context, we also use the abbreviation GVP to reference the method of Gibbs variational prediction, or loss-based variational prediction per se.
In an artificially simple ‘toy’ example in which MCMC sampling of the exact Gibbs posterior is feasible, we illustrate that there are negligible differences between the out-of-sample results yielded by the GVP and those produced by the predictive based on MCMC sampling from the Gibbs posterior. We establish this result under both correct specification, in which the true data generating process matches the adopted predictive model, and under misspecification of the predictive model. The correspondence between the ‘approximate’ and ‘exact’ predictions in the correct specification case mimics that documented in Frazier et al. (2019), in which posterior approximations are produced by approximate Bayesian computation (ABC) and for the log score update (only). In the misspecified case, ‘strictly coherent’ predictions are produced (Martin et al., 2022), whereby a given GVP, constructed via the use of a particular scoring rule, is shown to perform best out-of-sample according to that same score when compared with a GVP constructed via some alternative scoring rule; with the numerical values of the average out-of-sample scores closely matching those produced by the exact Gibbs predictive. That is, building a generalized posterior via a given scoring rule yields superior predictive accuracy in that rule despite any inaccuracy in the measurement of posterior uncertainty that is induced by the variational approximation.
We then undertake more extensive Monte Carlo experiments to highlight the power of the approach in genuinely high-dimensional problems, with predictives based on: an autoregressive mixture model with 20 mixture components, and a neural network model used for illustration. An empirical analysis, in which GVP is used to produce accurate prediction intervals for the 4227 daily time series used in the M4 forecasting competition, illustrates the applicability of the method to reasonably large, and realistic data sets.
While the numerical toy example demonstrates that little predictive accuracy is lost when using the GVP, relative to the ‘exact’ predictive, we also rigorously compare the theoretical behavior of the GVP and the potentially infeasible exact predictive. Specifically, we demonstrate that in large samples the GVP delivers predictions that are just as accurate as those obtained from the exact Gibbs predictive when measured according to the score used to construct the Gibbs posterior. We do this by proving that the GVP ‘merges’ (in the sense of Blackwell and Dubins, 1962) with the exact Gibbs predictive. This merging result relies on novel posterior concentration results that extend existing concentration results for generalized posteriors (Alquier et al., 2016, Alquier and Ridgway, 2020, and Yang et al., 2020) to temporally dependent, potentially heavy-tailed data.
The remainder of the paper is structured as follows. In Section 2 the loss-based paradigm for Bayesian prediction is defined, with its links to related segments of the literature (as flagged briefly above) detailed. In Section 3 we detail how to construct the GVP, and provide theoretical verification of its accuracy. A numerical illustration of the ability of the variational method to yield essentially equivalent predictive accuracy to that produced via MCMC sampling is given in Section 4, using a low-dimensional example. The approach that we use in the implementation of GVP, including the choice of variational family, is briefly described, with further computational details of the stochastic gradient ascent (SGA) method used to perform the optimization are included in a supplementary appendix to the paper. We then proceed with the numerical illustration of the method in high-dimensional settings - using both artificially simulated and empirical data - in Sections 5 and 6 respectively; with the illustrations highlighting that, overall, the method ‘works’ and ‘works well’. We conclude in Section 7 with discussion of the implications of our findings, including for future research directions. Proof of all theoretical results are given in an appendix, and all computational details, and prior specifications are included in supplementary appendix to the paper.
2 Setup and Loss-Based Bayesian Prediction
2.1 Preliminaries and Notation
Consider a stochastic process defined on the complete probability space . Let denote the natural sigma-field, and let denote the infinite-dimensional distribution of the sequence . Let denote a vector of realizations from the stochastic process.
Our goal is to use a particular collection of statistical models, adapted to , that describe the behavior of the observed data, to construct accurate predictions for the random variable . The parameters of the model are denoted by , the parameter space by , where the dimension could grow as and . For the notational simplicity, we drop the dependence of and on in what follows. Let be a generic class of one-step-ahead predictive models for , conditioned on the information available at time , such that .111The treatment of scalar and one-step-ahead prediction is for the purpose of illustration only, and all the methodology that follows can easily be extended to multivariate and multi-step-ahead prediction in the usual manner. When admits a density with respect to the Lebesgue measure, we denote it by . The parameter thus indexes values in the predictive class, with taking values in the complete probability space , and where measures our beliefs - either prior or posterior - about the unknown parameter , and when they exist we denote the respective densities by and .
Denoting the likelihood function by , the conventional approach to Bayesian prediction updates prior beliefs about via Bayes rule, to form the Bayesian posterior density,
(1) |
in which we follow convention and abuse notation by writing this quantity as a density even though, strictly speaking, the density may not exist. The one-step-ahead predictive distribution is then constructed as
(2) |
However, when the class of predictive models indexed by does not contain the true predictive distribution there is no sense in which this conventional approach remains the ‘gold standard’. In such cases, the loss that underpins (2) should be replaced by the particular predictive loss that matters for the problem at hand. That is, our prior beliefs about and, hence, about the elements in , need to be updated via a criterion function defined by a user-specified measure of predictive loss.
2.2 Bayesian Updating Based on Scoring Rules
Loaiza-Maya et al. (2020) propose a method for producing Bayesian predictions using loss functions that specifically capture the accuracy of density forecasts. For a convex class of predictive distributions on , density prediction accuracy can be measured using the positively-oriented (i.e. higher is better) scoring rule , where the expected scoring rule under the true distribution is defined as
(3) |
Since is unattainable in practice, a sample estimate based on is used to define a sample criterion: for a given , define sample average score as
(4) |
Adopting the generalized updating rule proposed by Bissiri et al. (2016) (see also Giummolè et al., 2017, Holmes and Walker, 2017, Lyddon et al., 2019, and Syring and Martin, 2019), Loaiza-Maya et al. (2020) distinguish elements in using
(5) |
where the scale factor is obtained in a preliminary step using measures of predictive accuracy. This posterior explicitly weights elements of according to their predictive accuracy in the scoring rule . As such, the one-step-ahead generalized predictive,
(6) |
will often outperform, in the chosen rule , the likelihood (or log-score)-based predictive in cases where the model is misspecified. Given its explicit dependence on the Gibbs posterior in (5), and as noted earlier, we refer to the predictive in (6) as the (exact) Gibbs predictive.222We note that, in general the choice of can be -dependent. However, we eschew this dependence to maintain notational brevity.
3 Gibbs Variational Prediction
3.1 Overview
While is our ideal posterior, it can be difficult to sample from if the dimension of is even moderately large, which occurs in situations with a large number of predictors, or in flexible models. Therefore, the exact predictive itself is not readily available in such cases. However, this does not invalidate the tenant on which is constructed. Viewed in this way, we see that the problem of predictive inference via could be solved if one were able to construct an accurate enough approximation to . Herein, we propose to approximate using variational Bayes (VB).
VB seeks to approximate by finding the closest member in a class of probability measures, denoted as , to in a chosen divergence measure. The most common choice of divergence is the Kullback-Leibler (KL) divergence: for probability measures , and absolutely continuous with respect to , the KL divergence is given by . VB then attempts to produce a posterior approximation to by choosing to minimize:
In practice, VB is often conducted, in an equivalent manner, by maximizing the so-called evidence lower bound (ELBO). In the case where and admit densities and , respectively, the ELBO is given by:
(7) |
While other classes of divergences are used in variational inference, such as the Rényi-divergence, the KL divergence is the most commonly encountered in the literature. Hence, in what follows we focus on this choice, but note here that other divergences can also be used.
The variational approximation to the Gibbs posterior, , can be defined as follows.
Definition 1
For a class of probability distributions, the variational posterior satisfies
and the Gibbs variational predictive (GVP) is defined as
Remark 1
In general, we only require that is an approximate minimizer, i.e., . The later may be useful in cases where may not be unique, or in cases where it does not exist for a given but is well-behaved for all large enough.
The GVP, , circumvents the need to construct via sampling the Gibbs posterior . In essence, we replace the sampling problem with an optimization problem for which reliable methods exist even if is high-dimensional, and which in turn yields an approximation to . Consequently, even though, as discussed earlier, it is not always feasible to access , via simulation from, in situations where is high-dimensional, access to the variational predictive remains feasible.
This variational approach to prediction is related to the ‘generalized variational inference’ approach of Knoblauch et al. (2019), but with two main differences. Firstly, our approach is focused on predictive accuracy, not on parameter inference. Our only goal is to produce density forecasts that are as accurate as possible in the chosen scoring rule .333Critically, since predictive accuracy is our goal, we are not concerned with the potential over/under-coverage of credible sets for the model unknowns built from generalized posteriors, or VB posteriors; see Miller (2021) for a theoretical discussion of posterior coverage in generalized Bayesian inference. Secondly, our approach follows the idea of Bissiri et al. (2016) and Loaiza-Maya et al. (2020) and targets the predictions built from the Gibbs posterior in (5), rather than the exponentiated form of some general loss as in Knoblauch et al. (2019). This latter point is certainly critical if inferential accuracy were still deemed to be important, since without the tempering constant that defines the posterior in (5), the exponentiated loss function can be very flat and the posterior uninformative about .444See Bissiri et al. (2016), Giummolè et al. (2017), Holmes and Walker (2017), Lyddon et al. (2019), Syring and Martin (2019) and Pacchiardi and Dutta (2021) for various approaches to the setting of in inferential settings. It is our experience however (Loaiza-Maya et al., 2020), that in settings where predictive accuracy is the only goal, the choice of actually has little impact on the generation of accurate predictions via , as long as the sample size is reasonably large. Preliminary experimentation in the current setting, in which the variational predictive is the target, suggests that this finding remains relevant, and as a consequence we have adopted a default value of in all numerical work. Further research is needed to deduce the precise impact of on , in particular for smaller sample sizes, and we leave this important topic for future work.555We note here that the work of Wu and Martin (2021) attempts to calibrate the coverage of posterior predictives constructed from power posteriors. While the approach of Wu and Martin (2021) is not directly applicable in our context, we conjecture that an appropriately modified version of their approach could be applied in our setting to derive posterior predictives with reasonable predictive coverage.
3.2 Accuracy of the GVP
By driving the updating mechanism by the measure of predictive accuracy that matters, it is hoped that GVP produces accurate predictions without requiring correct model specification. The following result demonstrates that, in large samples, the GVP, , produces predictions that are indistinguishable from the exact predictive, ; thus, in terms of predictive accuracy, the use of a variational approximation to the Gibbs posterior has little impact on the accuracy of the posterior predictives. For probability measures , let denote the total variation distance.
Theorem 1 demonstrates that the difference between distributional predictions made using the GVP and the exact Gibbs predictive agree in the rule in large samples. This type of result is colloquially known as a ‘merging’ result (Blackwell and Dubins, 1962), and means that if we take the exact Gibbs predictive as our benchmark for (Bayesian) prediction, then the GVP is as accurate as this possibly infeasible benchmark (at least in large samples).
Remark 2
As an intermediate step in the proof of Theorem 1, we must demonstrate posterior concentration of , and , in cases where the data is temporally dependent, possibly heavy-tailed, and where is an arbitrary scoring rule. For the sake of brevity, and to keep the focus on the practical usefulness of this approach, we present these details in Appendix A. However, we note that our focus on the types of data often encountered in forecasting, for example, financial, economic, or atmospheric processes, does not allow us to use existing approaches to guarantee posterior concentration of and . Instead, we must rely on the smoothness of and the control of certain remainder terms to demonstrate this concentration. We refer the interested reader to Appendix A for full details and discussion.
As a further result, we can demonstrate that GVP is as accurate as the infeasible frequentist ‘optimal [distributional] predictive’ approach suggested in Gneiting and Raftery (2007). Following Gneiting and Raftery (2007), define the ‘optimal’ (frequentist) predictive within the class , and based on scoring rule , as , where
(8) |
The following result demonstrates that, in large samples, predictions produced using the GVP are equivalent to those made using the optimal frequentist predictive .
We remind the reader that, for the sake of brevity, statement of the assumptions, and proofs of all stated results, are given in Appendix A.
4 A Toy Model of Financial Returns
We now illustrate the behavior of GVP in a simple toy example for a financial return, , in which the exact Gibbs predictive is also accessible.666We reiterate here, and without subsequent repetition, that all details of the prior specifications and the numerical steps required to produce the variational approximations, for this and the following numerical examples, are provided in the supplementary appendices. This example serves two purposes. First, it illustrates Theorem 2, and, in so doing, highlights the benefits of GVP relative to prediction based on a misspecified likelihood function. Secondly, the GVP is shown to yield almost equal (average) out-of-sample predictions to those produced by the exact Gibbs posterior accessed via MCMC; i.e. numerical support for Theorem 1 is provided.
The predictive class, , is defined by a generalized autoregressive conditional heteroscedastic GARCH(1,1) model with Gaussian errors, with . We adopt three alternative specifications for the true data generating process (DGP) : 1) a model that matches the Gaussian GARCH(1,1) predictive class; 2) a stochastic volatility model with leverage:
and 3) a stochastic volatility model with a smooth transition in the volatility autoregression:
where . DGP 1) defines a correct specification setting, whilst DGPs 2) and 3) characterize different forms of misspecification.
Denoting the predictive density function associated with the Gaussian GARCH (1,1) model, evaluated at the observed , as , we implement GVP using four alternative forms of (postively-oriented) scoring rules:
(9) | |||
(10) | |||
(11) | |||
(12) |
where and denote the and predictive quantiles. The log-score (LS) in (9) is a ‘local’ scoring rule, attaining a high value if the observed value, , is in the high density region of . The continuously ranked probability score (CRPS) in (10) (Gneiting and Raftery, 2007) is, in contrast, sensitive to distance, and rewards the assignment of high predictive mass near to the realized , rather than just at that value. The score in (11) is the censored likelihood score (CLS) of Diks et al. (2011), which rewards predictive accuracy over any pre-specified region of interest ( denoting the complement). We use the score for defining the lower and upper tails of the predictive distribution, as determined in turn by the 10%, 20%, 80% and 90% percentiles of the empirical distribution of , labelling these cases hereafter as CLS, CLS CLS and CLS A high value of this score in any particular instance thus reflects a predictive distribution that accurately predicts extreme values of the financial return. The last score considered is the interval score (IS) in (12), which is designed to measure the accuracy of the predictive interval where . This score rewards narrow intervals with accurate coverage. All components of (11) have closed-form solutions for the (conditionally) Gaussian predictive model, as does the integral in (10) and the bounds in (12).
In total then, seven distinct scoring rules are used to define the sample criterion function in (4). In what follows we reference these seven criterion functions, and the associated Gibbs posteriors using the notation , for CLS90, IS.
4.1 Estimation of the Gibbs Predictives
Given the low dimension of the predictive model it is straightforward to use an MCMC scheme to sample from the exact Gibbs posterior where is the exact Gibbs posterior density in (5) computed under scoring rule As noted earlier, in this and all following numerical work we set . For each of the posteriors, we initialize the chains using a burn-in period of periods, and retain the next draws . The posterior draws are then used to estimate the exact Gibbs predictive in (6) via
To perform GVP, we first need to produce the variational approximation of This is achieved in several steps. First, the parameters of the GARCH(1,1) model are transformed to the real line. With some abuse of notation, we re-define here the GARCH(1,1) parameters introduced in the previous section with the superscript to signal ‘raw’. The parameter vector is then re-defined as , where denotes the (univariate) normal cumulative distribution function (cdf). The next step involves approximating for the re-defined . We adopt the mean-field variational family (see for example Blei et al., 2017), with a product-form Gaussian density , where is the vector of variational parameters, comprised of mean and variance vectors and respectively, and denotes the (univariate) normal probability density function (pdf). Denote as and the distribution functions associated with and , respectively. The approximation is then calibrated by solving the maximization problem
(13) |
where, . Remembering the use of the notation to denote (4) for scoring rule , (7) (for case ) becomes Optimization is performed via SGA, as described in Section S2 of the supplementary appendix. Once calibration of is completed, the GVP is estimated as with . To calibrate VB iterations are used, and to estimate the variational predictive we set .
To produce the numerical results we generate a times series of length from the true DGP. Then, we perform an expanding window exercise from to . For we construct the predictive densities and as outlined above. Then, for we compute the measures of out-of-sample predictive accuracy and . Finally, we compute the average out-of-sample scores and . The results are tabulated and discussed in the following section.
4.2 Results
The results of the simulation exercise are recorded in Table 1. Panels A to C record the results for Scenarios 1) to 3), with the average out-of-sample scores associated with the exact Gibbs predictive (estimated via MCMC) appearing in the left-hand-side panel and the average scores for GVP appearing in the right-hand-side panel. All values on the diagonal of each sub-panel correspond to the case where . In the misspecified case, numerical validation of the asymptotic result that the GVP concentrates onto the optimal predictive (Theorem 2) occurs if the largest values (bolded) in a column appear on the diagonal (‘strict coherence’ in the language of Martin et al., 2022). In the correctly specified case, in which all proper scoring rules will, for a large enough sample, pick up the one true model (Gneiting and Raftery, 2007), we would expect all values in given column to be very similar to one another, differences reflecting sampling variation only. Finally, validation of the theoretical property of merging between the exact and variational Gibbs predictive (Theorem 1) occurs if the corresponding results in all left- and right-hand-side panels are equivalent.
As is clear, there is almost uniform numerical validation of both theoretical results. Under misspecification Scenario 2), (Panel B) the GVP results are strictly coherent (i.e. all bold values lie on the diagonal), with the exact Gibbs predictive results equivalent to the corresponding GVP results to three or four decimal places. The same broad findings obtain under misspecification Scenario 3), apart from the fact that the IS updates are second best (to the log-score; and then only just) in terms of the out-of-sample IS measure. In Panel A on the other hand, we see the expected (virtual) equivalence of all results in a given column, reflecting the fact that all Gibbs predictives (however estimated) are concentrating onto the true predictive model and, hence, have identical out-of-sample performance. Of course, for a large but still finite number of observations, we would expect the log-score to perform best, due to the efficiency of the implicit maximum likelihood estimator underpinning the results and, to all intents and purposes this is exactly what we observe in Panels A.1 and A.2.
In summary, GVP performs as anticipated, and reaps distinct benefits in terms of predictive accuracy. Any inaccuracy in the measurement of parameter uncertainty also has negligible impact on the finite sample performance of GVP relative to an exact comparator. In Section 5, we extend the investigation into design settings that mimic the high-dimensional problems to which we would apply the variational approach in practice, followed by an empirical application - again using a high-dimensional predictive model - in Section 6.
Panel A. True DGP: GARCH(1,1) | |||||||||||||||
A.1: Exact Gibbs predictive | A.2: GVP | ||||||||||||||
Average out-of-sample score | Average out-of-sample score | ||||||||||||||
LS | CLS10 | CLS20 | CLS80 | CLS90 | CRPS | IS | LS | CLS10 | CLS20 | CLS80 | CLS90 | CRPS | IS | ||
U.method | |||||||||||||||
LS | -0.0433 | -0.2214 | -0.3087 | -0.3029 | -0.2150 | -0.1438 | -1.1853 | -0.0434 | -0.2216 | -0.3088 | -0.3028 | -0.2150 | -0.1438 | -1.1857 | |
CLS10 | -0.0485 | -0.2222 | -0.3094 | -0.3069 | -0.2190 | -0.1441 | -1.2009 | -0.0496 | -0.2227 | -0.3100 | -0.3075 | -0.2199 | -0.1441 | -1.2053 | |
CLS20 | -0.0482 | -0.2221 | -0.3094 | -0.3066 | -0.2185 | -0.1440 | -1.2014 | -0.0493 | -0.2225 | -0.3097 | -0.3072 | -0.2193 | -0.1440 | -1.2052 | |
CLS80 | -0.0558 | -0.2305 | -0.3198 | -0.3032 | -0.2150 | -0.1446 | -1.2087 | -0.0567 | -0.2311 | -0.3204 | -0.3033 | -0.2152 | -0.1446 | -1.2107 | |
CLS90 | -0.0495 | -0.2258 | -0.3143 | -0.3029 | -0.2150 | -0.1441 | -1.1973 | -0.0502 | -0.2265 | -0.3149 | -0.3031 | -0.2152 | -0.1441 | -1.1993 | |
CRPS | -0.0462 | -0.2239 | -0.3116 | -0.3027 | -0.2147 | -0.1438 | -1.1874 | -0.0474 | -0.2239 | -0.3112 | -0.3041 | -0.2162 | -0.1439 | -1.1918 | |
IS | -0.0438 | -0.2216 | -0.3089 | -0.3031 | -0.2153 | -0.1438 | -1.1877 | -0.0440 | -0.2218 | -0.3091 | -0.3031 | -0.2153 | -0.1438 | -1.1882 | |
Panel B. True DGP: Stochastic volatility with leverage | |||||||||||||||
B.1: Exact Gibbs predictive | B.2: GVP | ||||||||||||||
Average out-of-sample score | Average out-of-sample score | ||||||||||||||
LS | CLS10 | CLS20 | CLS80 | CLS90 | CRPS | IS | LS | CLS10 | CLS20 | CLS80 | CLS90 | CRPS | IS | ||
U.method | |||||||||||||||
LS | -0.5636 | -0.3753 | -0.5452 | -0.3536 | -0.2512 | -0.2313 | -2.3468 | -0.5633 | -0.3752 | -0.545 | -0.3535 | -0.2511 | -0.2313 | -2.3467 | |
CLS10 | -1.0193 | -0.3336 | -0.5021 | -0.8379 | -0.7339 | -0.3679 | -3.447 | -1.0156 | -0.3336 | -0.502 | -0.834 | -0.7302 | -0.3659 | -3.4207 | |
CLS20 | -0.806 | -0.3354 | -0.4968 | -0.6291 | -0.5267 | -0.2863 | -2.9923 | -0.8055 | -0.3355 | -0.4969 | -0.6282 | -0.5259 | -0.2861 | -2.9853 | |
CLS80 | -0.9203 | -0.7372 | -0.9311 | -0.329 | -0.2292 | -0.2402 | -3.3135 | -0.9357 | -0.7514 | -0.9463 | -0.329 | -0.2292 | -0.2402 | -3.3248 | |
CLS90 | -0.9575 | -0.7615 | -0.9649 | -0.3294 | -0.2292 | -0.2425 | -3.4213 | -0.9959 | -0.7969 | -1.0033 | -0.3293 | -0.2291 | -0.2426 | -3.4476 | |
CRPS | -0.5692 | -0.4029 | -0.5671 | -0.3431 | -0.2419 | -0.23 | -2.4312 | -0.5649 | -0.3985 | -0.5626 | -0.3432 | -0.2419 | -0.2301 | -2.4338 | |
IS | -0.6552 | -0.3986 | -0.6111 | -0.3713 | -0.248 | -0.2604 | -2.203 | -0.655 | -0.3985 | -0.6109 | -0.3712 | -0.2479 | -0.2603 | -2.2033 | |
Panel C. True DGP: Stochastic volatility with smooth transition | |||||||||||||||
C.1: Exact Gibbs predictive | C.2: GVP | ||||||||||||||
Average out-of-sample score | Average out-of-sample score | ||||||||||||||
LS | CLS10 | CLS20 | CLS80 | CLS90 | CRPS | IS | LS | CLS10 | CLS20 | CLS80 | CLS90 | CRPS | IS | ||
U.method | |||||||||||||||
LS | -1.6858 | -0.4239 | -0.672 | -0.6751 | -0.4369 | -0.7196 | -7.0726 | -1.686 | -0.424 | -0.6721 | -0.6752 | -0.4371 | -0.7196 | -7.0742 | |
CLS10 | -1.8173 | -0.4172 | -0.6707 | -0.8016 | -0.5468 | -0.821 | -7.8217 | -1.8145 | -0.4172 | -0.6706 | -0.7987 | -0.5437 | -0.8183 | -7.7717 | |
CLS20 | -1.7173 | -0.419 | -0.6674 | -0.7067 | -0.4616 | -0.7425 | -7.2516 | -1.7171 | -0.4192 | -0.6676 | -0.7063 | -0.4611 | -0.742 | -7.2481 | |
CLS80 | -1.733 | -0.465 | -0.7199 | -0.6681 | -0.4304 | -0.7511 | -7.3087 | -1.7339 | -0.4654 | -0.7204 | -0.6684 | -0.4307 | -0.7513 | -7.3159 | |
CLS90 | -1.8777 | -0.5927 | -0.8579 | -0.6731 | -0.4288 | -0.8707 | -8.1299 | -1.8819 | -0.5952 | -0.8608 | -0.674 | -0.4295 | -0.8734 | -8.1211 | |
CRPS | -1.6938 | -0.4283 | -0.6762 | -0.6812 | -0.4435 | -0.7184 | -7.2059 | -1.6938 | -0.4286 | -0.6765 | -0.6809 | -0.4432 | -0.7185 | -7.2121 | |
IS | -1.6879 | -0.4244 | -0.6726 | -0.6754 | -0.4374 | -0.7208 | -7.0915 | -1.6881 | -0.4245 | -0.6728 | -0.6754 | -0.4374 | -0.7208 | -7.093 |
5 Complex Time Series Examples
In this section we demonstrate the application of GVP in two realistic simulated examples. In both cases the assumed predictive model is high-dimensional and the exact Gibbs posterior, even if accessible in principle via MCMC, is challenging from a computational point of view. The mean-field class is adopted in both cases to produce variational approximations to the Gibbs posterior. The simulation design for each example (including the choice of ) mimics that for the toy example, apart from the obvious changes made to the true DGP and the assumed predictive model, some changes in the size of the estimation and evaluation periods, plus the absence of comparative exact results. For reasons of computational burden we remove the CRPS update from consideration in the first example.
5.1 Example 1: Autoregressive Mixture Predictive Model
In this example we adopt a true DGP in which evolves according to the logistic smooth transition autoregressive (LSTAR) process proposed in Teräsvirta (1994):
(14) |
where , and denotes the standardised Student-t distribution with degrees of freedom. This model has the ability to produce data that exhibits a range of complex features. For example, it not only allows for skewness in the marginal density of , but can also produce temporal dependence structures that are asymmetric. We thus use this model as an illustration of a complex DGP whose characteristics are hard to replicate with simple parsimoneous models. Hence the need to adopt a highly parameterized predictive model; plus the need to acknowledge that even that representation will be misspecified.
The assumed predictive model is based on the flexible Bayesian non-parametric structure proposed in Antoniano-Villalobos and Walker (2016). The predictive distribution for , conditional on the observed , is constructed from a mixture of Gaussian autoregressive (AR) models of order one as follows:
(15) |
with time-varying mixture weights
where and . Denoting , , and , then the full vector of unknown parameters is , which comprises elements. GVP is a natural and convenient alternative to exact Gibbs prediction in this case.
5.2 Example 2: Bayesian Neural Network Predictive Model
In the second example we consider a true DGP in which the dependent variable has a complex non-linear relationship with a set of covariates. Specifically, the time series process , is determined by a three-dimensional stochastic process , with . The first two covariates are jointly distributed as . The third covariate , independent of the former two, is distributed according to an AR(4) process so that , with . The variable is then given by , where is a three dimensional vector of time-varying coefficients, with , and denotes the marginal distribution function of induced by the AR(4) model.
This particular choice of DGP has two advantages. First, given the complex dependence structure of the DGP (i.e. non-linear cross-sectional dependence as well as temporal dependence), it would be difficult, once again, to find a simple parsimoneous predictive model that would adequately capture this structure; hence motivating the need to adopt a high-dimensional model and to resort to GVP. Second, because it includes several covariates, we can assess how GVP performs for varying informations sets. For example, we can establish if the overall performance of GVP improves with an expansion of the conditioning information set, as we would anticipate; and if the inclusion of a more complete conditioning set affects the occurrence of strict coherence, or not.
A flexible model that has the potential to at least partially capture several features of the true DGP is a Bayesian feed-forward neural network that takes as the dependent variable and some of its lags, along with observed values of , as the vector of independent inputs, (see for instance Hernández-Lobato and Adams, 2015). The predictive distribution is defined as The mean function denotes a feed-forward neural network with layers, nodes in each layer, a -dimensional input vector , and a -dimensional parameter vector with . It allows for a flexible non-linear relationship between and the vector of observed covariates . Defining , the full parameter vector of this predictive class, , is of dimension .
5.3 Simulation Results
5.3.1 Example 1
A time series of length for is generated from the LSTAR model in (14), with the true parameters set as: , , , , and . With exception of the CRPS, which cannot be evaluated in closed-form for the mixture predictive class and is not included in the exercise as a consequence, the same scores in the toy example are considered. With reference to the simulation steps given earlier, the initial estimation window is observations; hence the predictive accuracy results are based on an evaluation sample of size
Average out-of-sample score | ||||||
---|---|---|---|---|---|---|
LS | CLS10 | CLS20 | CLS80 | CLS90 | IS | |
U.method | ||||||
LS | -1.253 | -0.345 | -0.497 | -0.452 | -0.263 | -5.589 |
CLS10 | -1.445 | -0.346 | -0.512 | -0.555 | -0.321 | -7.279 |
CLS20 | -1.445 | -0.344 | -0.496 | -0.589 | -0.349 | -6.674 |
CLS80 | -1.333 | -0.414 | -0.571 | -0.450 | -0.260 | -5.831 |
CLS90 | -1.330 | -0.407 | -0.564 | -0.451 | -0.259 | -5.730 |
IS | -1.410 | -0.401 | -0.558 | -0.474 | -0.282 | -5.550 |
The results in Table 2 are clear-cut. With one exception (that of CLS10), the best out-of-sample performance, according to a given measure, is produced by the version of GVP based on that same scoring rule. That is, the GVP results are almost uniformly strictly coherent: a matching of the update rule with the out-of-sample measure produces the best predictive accuracy in that measure, almost always.
5.3.2 Example 2
In this case, we generate a time series of length from the model discussed in Section 5.2, with specifications: , , , , , , , , , , , , and . These settings generate data with a negatively skewed empirical distribution, a non-linear relationship between the observations on and , and autoregressive behavior in .
To assess if the performance of GVP is affected by varying information sets, we consider four alternative specifications for the input vector in the assumed predictive model. These four specifications (labelled as Model 1, Model 2, Model 3 and Model 4, respectively) are: , , and . The dimension of the parameter vector for each of these model specifications is , , and , respectively. Given that the assumed predictive class is Gaussian, all scoring rules used in the seven updates, and for all four model specifications, can be evaluated in closed form. Referencing the simulation steps given earlier, the initial estimation window is observations; hence the predictive accuracy results are based on an evaluation sample of size as in the previous example.
Panel A: Model 1 () | Panel B: Model 2 () | |||||||||||||||
Average out-of-sample score | Average out-of-sample score | |||||||||||||||
LS | CLS10 | CLS20 | CLS80 | CLS90 | CRPS | IS | LS | CLS10 | CLS20 | CLS80 | CLS90 | CRPS | IS | |||
U.method | U.method | |||||||||||||||
LS | -1.607 | -0.489 | -0.761 | -0.520 | -0.310 | -0.659 | -6.579 | LS | -1.451 | -0.494 | -0.733 | -0.424 | -0.254 | -0.550 | -6.059 | |
CLS10 | -1.914 | -0.460 | -0.738 | -0.852 | -0.617 | -0.914 | -8.174 | CLS10 | -1.858 | -0.435 | -0.686 | -0.849 | -0.626 | -0.866 | -7.707 | |
CLS20 | -1.766 | -0.460 | -0.732 | -0.704 | -0.478 | -0.772 | -7.448 | CLS20 | -1.683 | -0.438 | -0.679 | -0.686 | -0.476 | -0.708 | -6.963 | |
CLS80 | -1.637 | -0.514 | -0.795 | -0.514 | -0.303 | -0.678 | -6.822 | CLS80 | -1.551 | -0.595 | -0.850 | -0.403 | -0.241 | -0.587 | -6.878 | |
CLS90 | -1.686 | -0.534 | -0.831 | -0.521 | -0.307 | -0.718 | -6.912 | CLS90 | -1.539 | -0.540 | -0.811 | -0.406 | -0.241 | -0.605 | -6.369 | |
CRPS | -1.620 | -0.513 | -0.783 | -0.514 | -0.304 | -0.660 | -6.752 | CRPS | -1.510 | -0.585 | -0.822 | -0.411 | -0.244 | -0.553 | -6.649 | |
IS | -1.601 | -0.476 | -0.753 | -0.516 | -0.307 | -0.660 | -6.338 | IS | -1.452 | -0.473 | -0.728 | -0.414 | -0.248 | 0.562 | -5.671 | |
Panel C: Model 3 () | Panel D: Model 4 () | |||||||||||||||
Average out-of-sample score | Average out-of-sample score | |||||||||||||||
LS | CLS10 | CLS20 | CLS80 | CLS90 | CRPS | IS | LS | CLS10 | CLS20 | CLS80 | CLS90 | CRPS | IS | |||
U.method | U.method | |||||||||||||||
LS | -1.536 | -0.428 | -0.688 | -0.521 | -0.314 | -0.617 | -5.975 | LS | -1.390 | -0.483 | -0.716 | -0.389 | -0.229 | -0.511 | -5.890 | |
CLS10 | -1.775 | -0.396 | -0.658 | -0.793 | -0.555 | -0.805 | -6.704 | CLS10 | -1.758 | -0.394 | -0.650 | -0.784 | -0.548 | -0.790 | -6.659 | |
CLS20 | -1.646 | -0.397 | -0.654 | -0.664 | -0.439 | -0.694 | -6.277 | CLS20 | -1.656 | -0.394 | -0.621 | -0.732 | -0.515 | -0.712 | -6.236 | |
CLS80 | -1.698 | -0.566 | -0.858 | -0.520 | -0.301 | -0.714 | -7.282 | CLS80 | -2.127 | -1.049 | -1.438 | -0.356 | -0.196 | -0.736 | -9.798 | |
CLS90 | -1.744 | -0.581 | -0.889 | -0.530 | -0.304 | -0.758 | -7.361 | CLS90 | -2.157 | -0.989 | -1.401 | -0.379 | -0.192 | -0.782 | -9.545 | |
CRPS | -1.535 | -0.438 | -0.695 | -0.517 | -0.310 | -0.615 | -6.044 | CRPS | -1.519 | -0.626 | -0.872 | -0.380 | -0.218 | -0.537 | -6.922 | |
IS | -1.549 | -0.438 | -0.7030 | -0.517 | -0.310 | -0.624 | -6.054 | IS | -1.673 | -0.592 | -0.934 | -0.381 | -0.203 | -0.710 | -5.493 |
With reference to the results in Table 3 we can make two observations. First, as anticipated, an increase in the information set produces higher average scores out-of-sample, for all updates; with the corresponding values increasing steadily and uniformly as one moves from the results in Panel A (based on ) to those in Panel D (based on the largest information, ) set. Secondly however, despite the improved performance of all versions of GVP as the information set is increased to better match that used in the true DGP, strict coherence still prevails. That is, ‘focusing’ on the measure that matters in the construction of the GVP update, still reaps benefits, despite the reduction in misspecification.
6 Empirical Application: M4 Competition
The Makridakis 4 (M4) forecasting competition was a forecasting event first
organised by the University of Nicosia and the New York University Tandon
School of Engineering in 2018. The competition sought submissions of point
and interval predictions at different time horizons, for a total of 100,000
times series of varying frequencies. The winner of the competition in a
particular category (i.e. point prediction or interval prediction) was the
submission that achieved the best average out-of-sample predictive
accuracy according to the measure of accuracy that defined that category,
over all horizons and all series.777Details of all aspects of the competition can be found via the following
link:
https://www.m4.unic.ac.cy/wp-content/uploads/2018/03/M4-Competitors-Guide.pdf
We gauge the success of our method of distributional prediction in terms of the measure used to rank the interval forecasts in the competition, namely the IS. We focus on one-step-ahead prediction of each of the 4227 time series of daily frequency. Each of these series is denoted by where and , and the task is to construct a predictive interval for based on GVP. We adopt the mixture distribution in (15) as the predictive model, being suitable as it is to capture the stylized features of high frequency data. However, the model is only appropriate for stationary time series and most of the daily time series exhibit non-stationary patterns. To account for this, we model the differenced series , where indicates the differencing order. The integer is selected by sequentially applying the KPSS test (Kwiatkowski et al., 1992) to for , with being the first difference at which the null hypothesis of no unit root is not rejected.888To do this we employ the autoarima function in the forecast R package. To construct the predictive distribution of we first obtain draws from the Gibbs variational posterior that is based on the IS.999Once again, a value of is used in defining the Gibbs posterior and, hence, in producing the posterior approximation. Draws are then obtained from the corresponding predictive distributions . The draws of are then transformed into draws of and a predictive distribution for produced using kernel density estimation; assessment is performed in terms of the accuracy of the prediction interval for To account for different units of measurement, the IS of each series is then divided by the constant .
Table 4 documents the predictive performance of our approach relative to the competing methods (Makridakis et al., 2020, see) for details on these methods). The first column corresponds to the average IS. In terms of this measure the winner is the method proposed by Smyl (2020), while our method is ranked 11 out of 12. The ranking is six out of 12 in terms of the median IS, recorded in column two. However, it is worth remembering that our method aims to achieve high individual, and not aggregate (or average) predictive interval accuracy across series. A more appropriate way of judging the effectiveness of our approach is thus to count the number of series for which each method performs best. This number is reported in column three of the table. In terms of this measure, with a total number of 858 series, our approach is only outperformed by the method proposed in Doornik et al. (2020). In contrast, although Smyl is the best method in terms of mean IS, it is only best at predicting 130 of the series in the dataset. It is also important to mention that most of the competing approaches have more complex assumed predictive classes than our mixture model. Even the simpler predictive classes like the ARIMA and ETS models have model selection steps that allow them to capture richer Markovian processes than the mixture model which, by construction, is based on an autoregressive process of order one.
In summary, despite a single specification being defined for the predictive class for all 4227 daily series, the process of driving the Bayesian update via the IS rule has still yielded the best predictive results in a very high number of cases, with GVP beaten in this sense by only one other competitor. Whilst the predictive model clearly matters, designing a bespoke updating mechanism to produce the predictive distribution is shown to still reap substantial benefits.
Out-of-sample results | |||
Mean IS | Median IS | Series | |
Method | |||
Doornik et al. | -9.36 | -4.58 | 2060 |
Mixture | -16.66 | -5.85 | 858 |
Trotta | -11.41 | -7.19 | 376 |
ARIMA | -10.07 | -5.67 | 278 |
Fiorucci and Louzada | -9.20 | -5.73 | 163 |
Smyl | -8.73 | -6.65 | 139 |
Roubinchtein | -13.08 | -7.35 | 97 |
Ibrahim | -38.81 | -6.33 | 88 |
ETS | -9.13 | -5.71 | 67 |
Petropoulos and Svetunkov | -9.77 | -5.68 | 59 |
Montero-Manso, et al. | -10.24 | -6.90 | 23 |
Segura-Heras, et al. | -15.29 | -8.48 | 19 |
7 Discussion
We have developed a new approach for conducting loss-based Bayesian prediction in high-dimensional models. Based on a variational approximation to the Gibbs posterior defined, in turn, by the predictive loss that is germane to the problem at hand, the method is shown to produce predictions that minimize that loss out-of-sample. ‘Loss’ is characterized in this paper by positively-oriented proper scoring rules designed to reward the accuracy of predictive probability density functions for a continuous random variable. Hence, loss minimization translates to maximization of an expected score. However, in principle, any loss function in which predictive accuracy plays a role could be used to define the Gibbs posterior. Critically, we have proven theoretically, and illustrated numerically, that for a large enough sample there is no loss incurred in predictive accuracy as a result of approximating the Gibbs posterior.
In comparison with the standard approach based on a likelihood-based Bayesian posterior, our Gibbs variational predictive approach is ultimately aimed at generating accurate predictions in the realistic empirical setting where the predictive model and, hence, the likelihood function is misspecified. Gibbs variational prediction enables the investigator to break free from the shackles of likelihood-based prediction, and to drive predictive outcomes according to the form of predictive accuracy that matters for the problem at hand; and all with theoretical validity guaranteed.
We have focused in the paper on particular examples where the model used to construct the Gibbs variational predictive is observation-driven. Extensions to parameter-driven models (i.e., hidden Markov models, or state space models) require different approaches with regard to both the implementation of variational Bayes, and in establishing the asymptotic properties of the resulting posteriors and predictives. This is currently the subject of on-going work by the authors, Frazier et al. (2021), and we reference Tran et al. (2017) and Quiroz et al. (2018) for additional discussion of this problem.
This paper develops the theory of Gibbs variational prediction for classes of models that are general and flexible enough to accommodate a wide range of data generating processes, and which, under the chosen loss function, are smooth enough to permit a quadratic expansion. This latter condition restricts the classes of models, and loss functions, under which our results are applicable. For example, the discrete class of models studied in Douc et al. (2013) may not be smooth enough to deduce the validity of such an expansion.
While our approach to prediction focuses on accuracy in a given forecasting rule, which is related to the notion of sharpness in the probabilistic forecasting literature (see, Gneiting et al., 2007), it does not pay attention to the calibration of such predictive densities. That is, there is no guarantee that the resulting predictive densities have valid frequentist coverage properties. If calibration is a desired property, it is possible to augment our approach to prediction with the recently proposed approach by Wu and Martin (2021), which calibrates generalized predictives. The amalgamation of these two procedures should produce accurate predictions that are well-calibrated in large samples. We leave an exploration of this possibility to future research.
Acknowledgments
We would like to thank various participants at: the ‘ABC in Svalbard’ Workshop, April 2021, the International Symposium of Forecasting, the Alan Turing Institute, and RIKEN reading group, June 2021, the European Symposium of Bayesian Econometrics, September 2021, and the Vienna University of Business, December 2021, for helpful comments on earlier drafts of the paper. This research has been supported by Australian Research Council (ARC) Discovery Grants DP170100729 and DP200101414. Frazier was also supported by ARC Early Career Researcher Award DE200101070.
A Theoretical Results
In this section, we state several intermediate results and prove Theorems 1 and 2 stated in Section 3.2 of the main paper.
We maintain the following notation throughout the remainder of the paper. We let denote a generic metric on , and we take and to be the Euclidean norm and inner product for vectors. In the following section, we let denote a generic constant independent of that can vary from line to line. For sequences and some we write when . If and we write . For some positive sequence , the notation and have their usual connotation. Unless otherwise stated, all limits are taken as , so that when no confusion will result we use to denote . The random variable takes values in for all , and for a given , the vector of observations are generated from the true probability measure , not necessarily in , and whose dependence on we suppress to avoid confusion with the predictive model .
A.1 Assumptions and Discussion
For some positive sequence , define as an -dependent neighbourhood of and let denote its complement.
Assumption 1
(i) There exists a non-random function such that in probability. (ii) For any , there exists some such that .
Remark 3
Assumption 1 places regularity conditions on the sample and limit counterpart of the expected scoring rules, , and is used to deduce the asymptotic behavior of the Gibbs posterior. The first part of the assumption is a standard uniform law of large numbers, and the second part amounts to an identification condition for .
We note here that the temporal dependence of the observations, and the use of potentially heavy-tailed data means that existing results on the concentration of Gibbs posteriors, and/or their variational approximation, of which we are aware may not be applicable in our setting; see e.g., Zhang and Gao (2020), Alquier and Ridgway (2020) and Yang et al. (2020).
Instead, we use smoothness of (and the underlying model ) to deduce a posterior concentration result.
Assumption 2
There exists a sequence of -dimensional matrices , and a -random vector such:
-
(i)
-
(ii)
For any , any , with ,
-
(iii)
.
We require the following a tail control condition on the prior.
Assumption 3
For any , .
Remark 4
The above assumptions ultimately preclude cases where can be partitioned into global parameters that drive the model, which are fixed for all , and local parameters that represent time-dependent latent variables in the model and grow in a one-for-one manner with the sample size. As such, these assumptions are not designed to be applicable to variational inference in classes such as hidden Markov models.
To validate the usefulness and broad applicability of Assumptions 1-2, we prove that these assumptions are satisfied for the GARCH(1,1) model used for the illustrative exercise in Section 4 in the case of the logarithmic scoring rule, , under weak moment assumptions on the observed data.
Lemma 1
Recall the GARCH(1,1) model presented in Section 4, and assume, WLOG, that . Define the rescaled variable , where denotes the GARCH process evaluated at . Assume that satisfies the following conditions:
-
1.
is strictly stationary and ergodic.
-
2.
is a random variable.
-
3.
For some , there exists an such that .
-
4.
The parameter space is compact, with the elements elements bounded away from zero and the elements bounded above by one, and where . The pseudo-true value is in the interior of .
A.2 Preliminary Result: Posterior Concentration
Before proving the results stated in Section 3.2, we first give two results regarding the posterior concentration of the exact Gibbs posterior and its variational approximation.
Lemma 2
Remark 5
The proof of Lemma 2, requires controlling the behavior of for any . When the parameter space is compact, this control can be guaranteed by controlling the bracketing entropy (see, e.g., van der Vaart and Wellner, 1996) of . If the parameter space is not compact, this control can be achieved by considering more stringent conditions on the prior than Assumption 3. In particular, results where the parameter space is not compact can be obtained by modifying Theorem 4 in Shen and Wasserman (2001) for our setting. That being said, we note that in the case of the toy GARCH(1,1) model in Section 4, the parameter space is compact and this condition is satisfied under the conditions given in Lemma 1.
To transfer the posterior concentration of the Gibbs posterior to its variational approximation, we restrict the class used to produce the variational approximation. Following Zhang and Gao (2020), we define to be the approximation error rate for the variational family:
Lemma 3
Remark 6
While several authors, such as Alquier et al. (2016), Alquier and Ridgway (2020), and Yang et al. (2020), have demonstrated a result similar to Lemma 3, existing results generally hinge on the satisfaction of a sub-Gaussian concentration inequality for the resulting risk function, i.e., , used to produce the Gibbs posterior, such as a Hoeffding or Bernstein-type inequality. Given that our goal is prediction in possibly non-Markovian, non-stationary, and/or complex models, and with a loss function based specifically on a general scoring rule, it is not clear that such results remain applicable. Therefore, we deviate from the earlier referenced approaches and instead draw inspiration from Miller (2021), and use an approach based on a quadratic expansion of the loss function. The latter conditions can be shown to be valid in a wide variety of models used to produce predictions in commonly encountered time series settings, including models with non-Markovian features. Indeed, as Lemma 1 demonstrates, Assumptions 1 and 2 are satisfied in cases where the scoring rule only exhibits a polynomial tail, e.g., in the case of the GARCH(1,1) model, and with data that is non-markovian.
Remark 7
Lemma 3 gives a similar result to Theorem 2.1 in Zhang and Gao (2020) but for the Gibbs variational posterior and demonstrates that the latter concentrates onto , with the rate of concentration given by the slower of and . We note that, in cases where the dimension of grows with the sample size, the rate is ultimately affected by the rate of growth of and care must be taken to account for this dependence.
Remark 8
Verification of Lemma 3 in any practical example requires choosing the variational family . In the case of observation-driven time series models, where the predictive model can be constructed analytically, and no explicit treatment of ‘local’ parameters is needed, the variational family can be taken to be a sufficiently rich parametric class. For example, consider the case where the dimension of is fixed and compact, and assume that and satisfy the Assumptions of Lemma 2. Then, we can consider as our variational family the mean-field class:
or the Gaussian family
where , and is a positive-definite matrix. In either case, the approximation results of Zhang and Gao (2020) and Yang et al. (2020) for these variational families can be used to obtain the rate . In particular, when is fixed these results demonstrate that , and the Gibbs variational posterior converges at the same rate as the Gibbs posterior.101010The rate of concentration obtained by applying Lemma 2 in such an example will be , which is slightly slower than the optimal parametric rate . The parametric rate can be achieved by sufficiently modifying the prior condition in Assumption 3 to cater for the parametric nature of the model. However, given that this is not germane to the main point off the paper, we do not analyze such situations here.
As seen from Lemma 3, the variational posterior places increasing mass on the value that maximizes the limit of the expected scoring rule, . Lemma 3 does not rely on the existence of exponential testing sequences as in much of the Bayesian nonparametrics literature (see, e.g., the treatment in Ghosal and Van der Vaart, 2017). The need to consider an alternative approach is due to the loss-based, i.e., non-likelihood-based, nature of the Gibbs posterior. Instead, the arguments used to demonstrate posterior concentration rely on controlling the behavior of a suitable quadratic approximation to the criterion function in a neighborhood of .
The result of Lemma 3 allows us to demonstrate that the predictions made using the GVP, , are just as accurate as those obtained by an agent that uses the ‘optimal [frequentist] predictive’ defined in equation (8); this is the key message of 2, which demonstrates that the GVP yields predictions that are just as accurate as the exact Gibbs predictive (in large samples).
A.3 Proofs of Lemmas
Assumption 1. We remark that Assumptions 1-4 in Lemma 1 are sufficient for conditions (A.1) and (A.2) of Lee and Hansen (1994). To establish Assumption 1(i) we note that Lemma 7 in Lee and Hansen (1994)(2) implies that for each . Furthermore, since the expected value of is bounded for each , for each , by Lemma 8(2) in Lee and Hansen (1994), it follows that uniformly over . Assumption 1(ii) can then be established by using the fact that is twice-continuously differentiable for near with negative-definite second derivative matrix , by Lemma 11(3) of Lee and Hansen (1994), which yields
for some . Hence, Assumption 1(ii) follows.
Assumption 2. The result again follows by using the results of Lee and Hansen (1994). In particular, in this case , is twice continuously differentiable so that a direct quadratic approximation is valid at . Then, taking and yields the expansion, where the remainder term is given by
and an intermediate value such that .
To verify (ii), we note that exists and is continuous by Lemma 11(3) of Lee and Hansen (1994), so that the first term is on . Further, Lemma 11(3) of Lee and Hansen (1994) demonstrates that
so that the second term is also on .
Lastly, (iii), is verified since , and since, by Lemma 9(2) of Lee and Hansen (1994), .
Proof of Lemma 2. The posterior density is where we recall that, by hypothesis, with probability one. Define the quasi-likelihood ratio
where, by Assumption 2 (iii), . For , the posterior probability over is
where
The result now follows by lower bounding and Upper bounding .
Part 1: Term. Define , and . We show that, for , and , wpc1
Proof of Part 1: To lower bound we follow an approach similar approach to Lemma 1 in Shen and Wasserman (2001) (see, also, Syring and Martin, 2020). Over , use Assumption 2(i) to rewrite as
(16) |
Define the following sets: , , and . On the set , is bounded in probability, and we can bound as follows:
(17) |
From the bound in (17), the remainder follows almost identically to Lemma 1 in Shen and Wasserman (2001). In particular,
where the last line follows from Markov’s inequality and the definition of . Lastly, consider the probability of the set
Hence, and for , we have that
(18) |
except on sets of -probability converging to zero.
Then, for as in that result, for a sequence with , by Assumption 3 (wpc1)
(19) |
Part 2: Term. We show that for large enough.
Proof of Part 2: To bound from above, use Assumption 1 to deduce that, for any positive ,
(20) |
For some , we decompose :
where is a compact set of by construction and is its complement.
Then, from (20) due to the compactness of . For , for a sequence with ,
(21) |
The first inequality comes from Markov inequality and the definition of and the second and the third inequalities are due to Assumptions 2 and 3 respectively.
Using (19), we have the posterior bound
Combining , from (20), and equation (21), we can deduce that (wpc1)
since . The right-hand-side vanishes for large enough, and the result follows.
Proof of of Lemma 3. The result begins with a similar approach to Corollary 2.1 of Zhang and Gao (2020), but requires particular deviations given the non-likelihood-based version of our problem. We also deviate from Zhang and Gao (2020) and use to denote expectations of taken under , rather than the notation .
First, we note that Lemma B.2 in the supplement to Zhang and Gao (2020) (reproduced for convenience as Lemma 4 in Appendix A.5) can be directly applied in this setting: for any and , given observations ,
Similarly, for any , by construction, so that
(22) |
Taking expectations on both sides of (22), and re-arranging terms, yields
(23) |
where the second inequality follows from the definition of , and Jensen’s inequality.
The second term in equation (23) is bounded by applying Lemma 2. In particular, for all and any , , by Lemma 2 (wpc1),
Then, appealing to Lemma B.4 in the supplement to Zhang and Gao (2020) (reproduced for convenience as Lemma 5 in Appendix A.5) we obtain
for all . Taking and , and applying the above in equation (23), yields, for some ,
The stated result then follows from Markov’s inequality,
A.4 Proofs of Theorems in Section 3.2
Using Lemmas 2 and 3, we can now prove Theorems 1 and 2 stated in the main text. A simple proof of Theorem 1 can be given by first proving Theorem 2. Therefore, we first prove Theorem 2.
Proof of Theorem 2. For probability measures and , let denote the Hellinger distance between and . Consider a positive sequence , and define the set . By convexity of in the first argument and Jensen’s inequality,
By Lemma 3, for any such that , , so that (wpc1). Using the above and the definition we can conclude that, with probability converging to one,
Which yields the first stated result. Applying the relationship between total variation and Hellinger distance yields the result:
Proof of Theorem 1. A similar argument to the proof of Theorem 1 yields
where the last line follows from the convergence in Lemma 2 and holds wpc1. Apply the triangle inequality, and the relationship between the Hellinger and it’s square, to see that (wpc1)
The above holds for any with , and we can conclude that .
A.5 Additional Lemmas
The following Lemmas from Zhang and Gao (2020) are used in the proof of Lemma 3, and are reproduce here, without proof, to aid the reader. We again use to denote expectations taken under .
Lemma 4
For in Definition 1, we have
Lemma 5
Suppose the random variable satisfies
for all . Then, for any ,
References
- Alquier (2021) Pierre Alquier. User-friendly introduction to pac-bayes bounds. arXiv preprint arXiv:2110.11216, 2021.
- Alquier and Ridgway (2020) Pierre Alquier and James Ridgway. Concentration of tempered posteriors and of their variational approximations. Annals of Statistics, 48(3):1475–1497, 2020.
- Alquier et al. (2016) Pierre Alquier, James Ridgway, and Nicolas Chopin. On the properties of variational approximations of Gibbs posteriors. The Journal of Machine Learning Research, 17(1):8374–8414, 2016.
- Antoniano-Villalobos and Walker (2016) Isadora Antoniano-Villalobos and Stephen G Walker. A nonparametric model for stationary time series. Journal of Time Series Analysis, 37(1):126–142, 2016.
- Bassetti et al. (2018) Federico Bassetti, Roberto Casarin, and Francesco Ravazzolo. Bayesian nonparametric calibration and combination of predictive distributions. Journal of the American Statistical Association, 113(522):675–685, 2018.
- Baştürk et al. (2019) N Baştürk, A. Borowska, S. Grassi, L. Hoogerheide, and H.K. van Dijk. Forecast density combinations of dynamic models and data driven portfolio strategies. Journal of Econometrics, 210(1):170–186, 2019.
- Billio et al. (2013) M Billio, R Casarin, F Ravazzolo, and H.K. van Dijk. Time-varying combinations of predictive densities using nonlinear filtering. Journal of Econometrics, 177(2):213–232, 2013.
- Bissiri et al. (2016) Pier Giovanni Bissiri, Chris C Holmes, and Stephen G Walker. A general framework for updating belief distributions. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 78(5):1103–1130, 2016.
- Blackwell and Dubins (1962) David Blackwell and Lester Dubins. Merging of opinions with increasing information. The Annals of Mathematical Statistics, 33(3):882–886, 1962.
- Blei et al. (2017) David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
- Casarin et al. (2015) Roberto Casarin, Fabrizio Leisen, German Molina, and Enrique ter Horst. A Bayesian beta Markov random field calibration of the term structure of implied risk neutral densities. Bayesian Analysis, 10(4):791–819, 2015.
- Chernozhukov and Hong (2003) Victor Chernozhukov and Han Hong. An MCMC approach to classical estimation. Journal of Econometrics, 115(2):293–346, 2003.
- Diks et al. (2011) Cees Diks, Valentyn Panchenko, and Dick Van Dijk. Likelihood-based scoring rules for comparing density forecasts in tails. Journal of Econometrics, 163(2):215–230, 2011.
- Doornik et al. (2020) Jurgen A Doornik, Jennifer L Castle, and David F Hendry. Card forecasts for M4. International Journal of Forecasting, 36(1):129–134, 2020.
- Douc et al. (2013) Randal Douc, Paul Doukhan, and Eric Moulines. Ergodicity of observation-driven time series models and consistency of the maximum likelihood estimator. Stochastic Processes and their Applications, 123(7):2620–2647, 2013.
- Fiorucci and Louzada (2020) Jose Augusto Fiorucci and Francisco Louzada. Groec: Combination method via generalized rolling origin evaluation. International Journal of Forecasting, 36(1):105–109, 2020.
- Frazier et al. (2019) David T Frazier, Worapree Maneesoonthorn, Gael M Martin, and Brendan PM McCabe. Approximate Bayesian forecasting. International Journal of Forecasting, 35(2):521–539, 2019.
- Frazier et al. (2021) David T Frazier, Ruben Loaiza-Maya, and Gael M Martin. A note on the accuracy of variational Bayes in state space models: Inference and prediction. arXiv preprint arXiv:2106.12262, 2021.
- Ghosal and Van der Vaart (2017) Subhashis Ghosal and Aad Van der Vaart. Fundamentals of Nonparametric Bayesian Inference, volume 44. Cambridge University Press, 2017.
- Giummolè et al. (2017) Federica Giummolè, Valentina Mameli, Erlis Ruli, and Laura Ventura. Objective Bayesian inference with proper scoring rules. TEST, pages 1–28, 2017.
- Gneiting and Raftery (2007) Tilmann Gneiting and Adrian E Raftery. Strictly proper scoring rules, prediction, and estimation. Journal of the American Statistical Association, 102(477):359–378, 2007.
- Gneiting et al. (2007) Tilmann Gneiting, Fadoua Balabdaoui, and Adrian E Raftery. Probabilistic forecasts, calibration and sharpness. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(2):243–268, 2007.
- Hernández-Lobato and Adams (2015) José Miguel Hernández-Lobato and Ryan Adams. Probabilistic backpropagation for scalable learning of Bayesian neural networks. In International Conference on Machine Learning, pages 1861–1869. PMLR, 2015.
- Holmes and Walker (2017) CC Holmes and SG Walker. Assigning a value to a power likelihood in a general Bayesian model. Biometrika, 104(2):497–503, 2017.
- Jiang and Tanner (2008) Wenxin Jiang and Martin A. Tanner. Gibbs posterior for variable selection in high-dimensional classification and data-mining. Annals of Statistics, 36(5):2207–2231, 2008.
- Kingma and Welling (2019) Diederik P Kingma and Max Welling. An introduction to variational autoencoders. arXiv preprint arXiv:1906.02691, 2019.
- Knoblauch et al. (2019) Jeremias Knoblauch, Jack Jewson, and Theodoros Damoulas. Generalized variational inference: Three arguments for deriving new posteriors. arXiv preprint arXiv:1904.02063, 2019.
- Kwiatkowski et al. (1992) Denis Kwiatkowski, Peter C.B. Phillips, Peter Schmidt, and Yongcheol Shin. Testing the null hypothesis of stationarity against the alternative of a unit root: How sure are we that economic time series have a unit root? Journal of Econometrics, 54(1):159–178, 1992.
- Lee and Hansen (1994) Sang-Won Lee and Bruce E Hansen. Asymptotic theory for the GARCH(1, 1) quasi-maximum likelihood estimator. Econometric theory, 10(1):29–52, 1994.
- Loaiza-Maya et al. (2020) Ruben Loaiza-Maya, Gael M Martin, and David T Frazier. Focused Bayesian prediction. Forthcoming. Journal of Applied Econometrics, 2020.
- Lyddon et al. (2019) SP Lyddon, CC Holmes, and SG Walker. General Bayesian updating and the loss-likelihood bootstrap. Biometrika, 106(2):465–478, 2019.
- Makridakis et al. (2020) Spyros Makridakis, Evangelos Spiliotis, and Vassilios Assimakopoulos. The M4 competition: 100,000 time series and 61 forecasting methods. International Journal of Forecasting, 36(1):54–74, 2020.
- Martin et al. (2022) Gael M Martin, Rubén Loaiza-Maya, Worapree Maneesoonthorn, David T Frazier, and Andrés Ramírez-Hassan. Optimal probabilistic forecasts: When do they work? International Journal of Forecasting, 38(1):384–406, 2022.
- McAlinn and West (2019) Kenichiro McAlinn and Mike West. Dynamic Bayesian predictive synthesis in time series forecasting. Journal of Econometrics, 210(1):155–169, 2019.
- McAlinn et al. (2020) Kenichiro McAlinn, Knut Are Aastveit, Jouchi Nakajima, and Mike West. Multivariate Bayesian predictive synthesis in macroeconomic forecasting. Journal of the American Statistical Association, 115(531):1092–1110, 2020.
- Miller (2021) Jeffrey W Miller. Asymptotic normality, concentration, and coverage of generalized posteriors. Journal of Machine Learning Research, 22(168):1–53, 2021.
- Miller and Dunson (2019) Jeffrey W Miller and David B Dunson. Robust Bayesian inference via coarsening. Journal of the American Statistical Association, 114(527):1113–1125, 2019.
- Montero-Manso et al. (2020) Pablo Montero-Manso, George Athanasopoulos, Rob J Hyndman, and Thiyanga S Talagala. FFORMA: Feature-based forecast model averaging. International Journal of Forecasting, 36(1):86–92, 2020.
- Pacchiardi and Dutta (2021) Lorenzo Pacchiardi and Ritabrata Dutta. Generalized Bayesian likelihood-free inference using scoring rules estimators. arXiv preprint arXiv:2104.03889, 2021.
- Petropoulos and Svetunkov (2020) Fotios Petropoulos and Ivan Svetunkov. A simple combination of univariate models. International journal of forecasting, 36(1):110–115, 2020.
- Pettenuzzo and Ravazzolo (2016) Davide Pettenuzzo and Francesco Ravazzolo. Optimal portfolio choice under decision-based model combinations. Journal of Applied Econometrics, 31(7):1312–1332, 2016.
- Quiroz et al. (2018) Matias Quiroz, David J Nott, and Robert Kohn. Gaussian variational approximation for high-dimensional state space models. arXiv preprint arXiv:1801.07873, 2018.
- Shen and Wasserman (2001) Xiaotong Shen and Larry Wasserman. Rates of convergence of posterior distributions. The Annals of Statistics, 29(3):687–714, 2001.
- Smyl (2020) Slawek Smyl. A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. International Journal of Forecasting, 36(1):75–85, 2020.
- Syring and Martin (2019) Nicholas Syring and Ryan Martin. Calibrating general posterior credible regions. Biometrika, 106(2):479–486, 2019.
- Syring and Martin (2020) Nicholas Syring and Ryan Martin. Gibbs posterior concentration rates under sub-exponential type losses. arXiv preprint arXiv:2012.04505, 2020.
- Teräsvirta (1994) Timo Teräsvirta. Specification, estimation, and evaluation of smooth transition autoregressive models. Journal of the American Statistical Association, 89(425):208–218, 1994.
- Tran et al. (2017) Minh-Ngoc Tran, David J Nott, and Robert Kohn. Variational Bayes with intractable likelihood. Journal of Computational and Graphical Statistics, 26(4):873–882, 2017.
- van der Vaart and Wellner (1996) AW van der Vaart and Jon Wellner. Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Series in Statistics, 1996.
- Wu and Martin (2021) Pei-Shien Wu and Ryan Martin. Calibrating generalized predictive distributions. arXiv preprint arXiv:2107.01688, 2021.
- Yang et al. (2020) Yun Yang, Debdeep Pati, Anirban Bhattacharya, et al. Alpha-variational inference with statistical guarantees. Annals of Statistics, 48(2):886–905, 2020.
- Zhang and Gao (2020) Fengshuo Zhang and Chao Gao. Convergence rates of variational posterior distributions. Annals of Statistics, 48(4):2180–2207, 2020.
- Zhang (2006a) Tong Zhang. From eps-entropy to kl entropy: analysis of minimum information complexity density estimation. Annals of Statistics, 34:2180–2210, 2006a.
- Zhang (2006b) Tong Zhang. Information-theoretic upper and lower bounds for statistical estimation. IEEE Transactions on Information Theory, 52(4):1307–1321, 2006b.
Supplementary material for “Loss-Based Variational Bayes Prediction”
S1 Details for the Implementation of all Variational Approximations
S1.1 General Notational Matters
Throughout this appendix we will employ the following notation. All gradients are column vectors and their notation starts with the symbol . For two generic matrices and we have that
where vec is the vectorization operation and is a matrix of dimension . Throughout this appendix scalars are treated as matrices of dimension .
The gradient of the log density of the approximation is computed as
To compute note that
where
S2 Stochastic Gradient Ascent
The optimization problem in (13) is performed via stochastic gradient ascent methods (SGA). SGA maximizes by first initializing the variational parameters at some vector of values , and then sequentially iterating over
The step size is a function of an unbiased estimate of the ELBO gradient, , and the set of tuning parameters that control the learning rates in the optimization problem, denoted by . Throughout this paper we employ the ADADELTA method of Zieler (2012) to update .
Key to achieving fast maximization of via SGA is the use of a variance-reduction technique in producing the unbiased gradient estimate . Here, we follow Kingma and Welling (2014) and Rezende et al. (2014), and make use of the ‘reparameterization trick’. In this approach, a draw from is written as a direct function of , and a set of random variables that are invariant with respect to . For the mean-field approximation used for this example we can write , with . This reparametrization allows us to re-write as
(S2.1) |
and its gradient as
(S2.2) |
A low-variance unbiased estimate of can be constructed by numerically computing the expectation in (S2.2) using the (available) closed-form expressions for the derivatives , , , and for any . We follow Kingma and Welling (2019) and use a single draw of for the construction of an estimate of (S2.2). The required expressions for the GARCH model are provided in Appendix S3.
S3 The GARCH Predictive Class (Section 4)
The predictive class, , is defined by a generalized autoregressive conditional heteroscedastic GARCH(1,1) model with Gaussian errors, with . Note that throughout this section and can be used interchangeably.
S3.1 Priors
We employ the following priors for each of the parameters of the model:
For the implementation of variational inference, all the parameters are transformed into the real line as follows:
After applying these transformations, we have that the prior densities are:
The gradient of the logarithm of the prior is .
S3.2 Derivation of for all scoring rules
We can show that Thus, we must find an expression for for each of the scores. The gradients from all the scores can be written as a function of the recursive derivatives:
with , , and .
S3.2.1 Logarithmic score (LS)
The gradient for the LS can be written as
with
S3.2.2 Continuously ranked probability score (CRPS)
For the Gaussian GARCH(1,1) predictive class the CRPS can be expressed as
with and . The gradient of the CRPS can be written as
The elements of this gradient are given by
with and .
S3.2.3 Censored logarithmic score (CLS)
For some threshold value , denote the upper tail support of predictive distribution as . The gradient for the upper tail CLS can be written as
with
To obtain the gradient expressions for the lower tail CLS simply replace the term by the term , and redefine the set .
S3.2.4 Interval score (IS)
The gradient of the IS can be written as
For , the elements of the gradient can be written as
where the derivative can be evaluated as with . The first term can be computed as . The second term is parameter specific, and can be computed as
The derivative is evaluated in the same fashion as described for .
S4 Autoregressive Mixture Predictive Class (Section 5.1)
S4.1 Priors
We employ the following priors for each of the parameters of the model:
where the parameter is given by the stick breaking decomposition of the mixture weights, which sets , for and where . For the implementation of variational inference, all the parameters are transformed into the real line as follows:
After applying these transformations, we have that the prior densities are:
Computation of
Denoting as , , and , the gradient of
the prior density is
with
and where and .
S4.2 Derivation of for all scoring rules
The focused Bayesian update uses the term , thus variational inference requires evaluation of
For the mixture example we consider three alternative choices for , namely, the IS, the CLS and LS. Here, we derive an expression for for each of these scores. As will be shown later, the gradient for all the scores can be expressed solely in terms of and , thus we focus on the derivation of these two expressions. Below, we denote and use the scalar to denote the number of elements in . For ease of notation we rewrite
where , , , , and . The elements of the gradients and , can then be computed as
(S4.3) |
and
(S4.4) |
Table S1 provides the expressions , and for . For , the gradients can be evaluated as
and
where is a diagonal matrix with entries , and is a lower triangular matrix with diagonal elements and off-diagonal elements , for . The expressions needed to evaluate and are also provided in Table S1.
With expressions (S4.3) and (S4.4), we can now derive expression for for the three scores considered.
S4.2.1 Logarithmic score (LS)
S4.2.2 Censored logarithmic score (CLS)
S4.2.3 Interval score (IS)
The gradient of the IS in (12) can be written as
with elements defined as:
As such, computation of entails evaluation of and . The derivative can be evaluated by first noting that . Then, using the triple product rule we know that
where the first term can be computed as . The second term is evaluated using (S4.4). The derivative is evaluated in the same fashion as described for .
Expressions for | ||
Expressions for | Expressions for | |
S5 Bayesian Neural Network Predictive Class (Section 5.2)
S5.1 Priors
The priors of the model parameters are set as and . The standard deviation parameter is transformed to the real line as , so that the parameter vector is . Then, the prior density can be written as , where and . From this prior density we have that .
S5.2 Derivation of for all scoring rules
As in the previous appendix we can show that Thus, we must find an expression for for each of the scores.
S5.2.1 Logarithmic score (LS)
The gradient for the LS can be written as
with
The term can be evaluated analytically through back propagation.
S5.2.2 Censored logarithmic score (CLS)
For the upper tail CLS the gradient can be written as
where , and . The gradient for the lower tail CLS is
S5.2.3 Interval score (IS)
The gradient of the IS can be written as
with elements and defined as:
The derivative can be evaluated by first noting that . Then, using the triple product rule we know that
where the first term can be computed as . The second term is evaluated as in the subsection above. The derivative is evaluated in the same fashion as described for . The corresponding derivatives for parameter can also be computed using similar steps.