This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Calibrated Multivariate Regression with Localized PIT Mappings

Lucas Kock1\mbox{}^{1}, G. S. Rodrigues2\mbox{}^{2*}, Scott A. Sisson3\mbox{}^{3},
Nadja Klein4\mbox{}^{4} and David J. Nott1\mbox{}^{1}

Abstract

Calibration ensures that predicted uncertainties align with observed uncertainties. While there is an extensive literature on recalibration methods for univariate probabilistic forecasts, work on calibration for multivariate forecasts is much more limited. This paper introduces a novel post-hoc recalibration approach that addresses multivariate calibration for potentially misspecified models. Our method involves constructing local mappings between vectors of marginal probability integral transform values and the space of observations, providing a flexible and model free solution applicable to continuous, discrete, and mixed responses. We present two versions of our approach: one uses K-nearest neighbors, and the other uses normalizing flows. Each method has its own strengths in different situations. We demonstrate the effectiveness of our approach on two real data applications: recalibrating a deep neural network’s currency exchange rate forecast and improving a regression model for childhood malnutrition in India for which the multivariate response has both discrete and continuous components.

Keywords: nearest neighbours, normalizing flows, probabilistic forecasting, regression calibration, uncertainty quantification

Acknowledgments: David Nott’s research was supported by the Ministry of Education, Singapore, under the Academic Research Fund Tier 2 (MOE-T2EP20123-0009), and he is affiliated with the Institute of Operations Research and Analytics at the National University of Singapore. Nadja Klein was supported by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) through the Emmy Noether grant KL 3037/1-1.

1 Department of Statistics and Data Science, National University of Singapore, Singapore
2 Department of Statistics, University of Brasília, Brazil
3 School of Mathematics and Statistics, University of New South Wales, Australia
4 Scientific Computing Center, Karlsruhe Institute of Technology, Germany
Correspondence should be directed to [email protected]

1 Introduction

Historically notions of calibration have their roots in probabilistic forecasting with applications in many fields. There are different types of calibration (e.g., Gneiting et al.; 2007), but heuristically a model is considered to be calibrated if its predicted uncertainty matches the observed uncertainty in the data in some sense. For example, a forecast might be considered useful for decision making if an event with a certain forecast probability occurs with the corresponding relative frequency. Calibration of probabilistic models is considered desirable in many scenarios, and has been studied for many types of models and applications, such as neural networks (Dheur and Ben Taieb; 2023; Lakshminarayanan et al.; 2017), regression (Klein et al.; 2021), simulator based inference (Rodrigues et al.; 2018), and clustering (Guo et al.; 2017). In many situations probabilistic forecasts can be multivariate, involving a vector of random variables 𝒀=(Y1,,Yd)\bm{Y}=(Y_{1},\dots,Y_{d}). However, ensuring multivariate calibration is a challenging task. In this paper, we introduce a novel approach to recalibrating uncertainties obtained from multivariate probabilistic forecasts, based on models which are possibly misspecified. Our approach can be applied post hoc to arbitrary and already fully fitted models ensuring approximate multivariate calibration while simultaneously keeping other properties such as the interpretability of the base model intact. While we focus on recalibrating an existing base model, it is possible to use a very flexible model at the outset, and there is an extensive literature on flexible regression beyond the mean (Kneib; 2013; Henzi et al.; 2021).

To explain our contribution, it is necessary to discuss different types of calibration. For univariate responses, a common choice is probability calibration (Gneiting et al.; 2007), which can be assessed by checking uniformity of the probability integral transform (PIT) values (Dawid; 1984). A PIT value is the evaluation of the cumulative distribution function (CDF) of the model at an observed data point. Uniformity of the PIT values implies that prediction intervals derived from the probabilistic model have the correct coverage in a frequentist sense. When extending probability calibration from the univariate to the multivariate setting, it is not enough to check uniformity of PIT values for each marginal separately.

Smith (1985) considers joint uniformity of the univariate PIT values under a Rosenblatt transformation (Rosenblatt; 1952) summarizing the joint distribution. Diebold et al. (1998) suggest to check this graphically using histograms and correlograms, and formal tests have been developed in the context of economic forecasting (Corradi and Swanson; 2006; Ko and Park; 2013; Dovern and Manner; 2020). However, this approach is limited, as a Rosenblatt transformation is not readily available for many complex models.

In the context of ensemble forecasts, Gneiting et al. (2008) introduce multivariate rank histograms as a simple graphical check for multivariate calibration. Multivariate rank histograms are extended to copula PIT (CopPIT) values by Ziegel and Gneiting (2014). If all univariate marginals of the prediction model are probability calibrated, the CopPIT values depend only on the copula of the forecast. A formal description of this is given in Section 2. An alternative to PIT and CopPIT values are proper scoring rules (Gneiting and Raftery; 2007), which jointly quantify calibration and sharpness of a probabilistic prediction model. Formal tests for multivariate calibration based on scoring rules complementing PIT-based calibration can be derived (Knüppel et al.; 2023).

Even when useful in practice, many models suffer from miscalibration induced by model misspecification. For example, computer based simulators idealise and simplify complex real world phenomena and thus suffer from model misspecification (Ward et al.; 2022); in regression, highly structured, but therefore misspecified models can be preferable when the focus is on interpretation and not on prediction; modern deep neural networks suffer from several sources of uncertainty that can be challenging to track through their complex structures (Gawlikowski et al.; 2023). These observations motivate recalibration techniques, which allow to adjust a fitted model post hoc.

When it comes to recalibration of already estimated models, a rich literature exists in the univariate context. The approaches are often tailored to a specific subclass of statistical models. In the context of parameter estimation, Menéndez et al. (2014) consider recalibration of confidence intervals using bootstrap style samples generated from a predictive distribution under the estimated model. In classification, Platt-scaling (Platt; 1999), which extends a trained classifier with a logistic regression model to return class probabilities, is a popular approach. Platt-scaling has been extended in various directions within the machine learning literature (e.g., Guo et al.; 2017; Kull et al.; 2017). For univariate regression, Kuleshov et al. (2018) suggest learning a transformation of the PIT values to achieve probability calibration. They use isotonic regression and their approach is extended by Dheur and Ben Taieb (2023), who use a kernel density estimator (KDE) of the PIT distribution instead. Our approach can be considered as a multivariate extension to this idea and we review it in more detail in Section 2.1. Our approach is also closely linked to the local recalibration technique for artificial neural networks proposed by Torres et al. (2024). They use non-parametric PIT transformations on a local neighbourhood learned with K-nearest neighbours (KNN), which can be applied to any layer of a deep neural architecture.

Despite the clear need, there is still a lack of general multivariate recalibration techniques that consider a vector of quantities of interest jointly in the literature. Heinrich et al. (2021) discuss post-processing methods for multivariate spatio-temporal forecasting models. However, calibration is only one of their many objectives and the approach does not easily generalize to other model classes. Recently, Wehenkel et al. (2024) considered recalibration for simulation-based inference under model misspecification. Their approach involves learning an optimal transport map between real world observations and the output of the misspecified simulator.

The main contributions of this paper are as follows. (i) We introduce a novel method to achieve multivariate calibration post hoc. The main idea is to construct local mappings between vectors of marginal PIT values and the observation space. Our method thus complements established methods for univariate calibration. (ii) Our approach is general. We are not restricted to continuous data, but can consider discrete and even mixed responses. Therefore the approach can be applied beyond regression to tasks such as clustering, classification, and generalized parameter inference. (iii) Our approach is model-free as we do not assume a particular structure of the underlying base model. Even though it is helpful if the CDFs of the univariate marginals are available in closed form, our method can be applied as long as samples from the base model can be readily generated. (iv) Our method is simple to use. We introduce two versions of our approach. First, a KNN-based approach similar to Torres et al. (2024), which is then extended to a normalizing flow based approach, where the PIT maps are explicitly learned. Both versions of our approach come with different advantages and we discuss which method is best suited to which scenario.

We apply our method to two real data examples. First, we recalibrate a one-day ahead forecast for currency exchange rates based on a deep neural network. Multivariate calibration, where all currencies are considered jointly, is desirable due to the complex dependence structure across currencies. Secondly, we consider a regression task concerning childhood malnutrition in India. The bivariate response vector is mixed, containing a continuous and a discrete response. Multivariate recalibration can be used to combine univariate regression models into one joint predictor. Specifying separate regression models for the predictors can be easier then constructing a joint model, especially when working with mixed data.

The rest of this paper is organized as follows. First, we give some background on different definitions of calibration and existing recalibration techniques in Section 2. Then, we present our novel recalibration method in Section 3. Section 4 illustrates the good performance of our approach for simulated data in a number of scenarios and Section 5 considers the aforementioned real data examples. Section 6 gives a concluding discussion.

2 Background on Calibration

Let F(𝒀,𝑿)F(\bm{Y},\bm{X}) denote the joint distribution of a response 𝒀𝒴\bm{Y}\in\cal Y and feature vector 𝑿𝒳\bm{X}\in\cal X. In practice, F(𝒀,𝑿)F(\bm{Y},\bm{X}) is unknown and the conditional distribution F(𝒀𝑿)F(\bm{Y}\mid\bm{X}) is estimated by some probabilistic model F^(𝒀𝑿)\widehat{F}(\bm{Y}\mid\bm{X}) from a training set 𝒟train={(𝒚train(i),𝒙train(i)),i=1,,ntrain}\mathcal{D}_{\text{train}}=\{(\bm{y}_{\text{train}}^{(i)},\bm{x}_{\text{train}}^{(i)}),i=1,\dots,n_{\text{train}}\}. Heuristically, the model F^\widehat{F} is said to be calibrated if it correctly specifies the uncertainty in it’s own predictions. Since FF is not available in practice, calibration can only be assessed based on a validation set 𝒟val={(𝒚val(i),𝒙val(i)),i=1,,nval}\mathcal{D}_{\text{val}}=\{(\bm{y}_{\text{val}}^{(i)},\bm{x}_{\text{val}}^{(i)}),i=1,\dots,n_{\text{val}}\} of observations from FF, which is potentially disjoint from 𝒟train\mathcal{D}_{\text{train}}. In this section, we will give some background on the notion of calibration by first reviewing univariate calibration in Section 2.1, which will be then extended to the multivariate setting in Section 2.2.

2.1 Univariate Calibration

In the univariate case, where 𝒴\cal{Y}\subseteq\mathbb{R}, several notions of calibration exist within the literature (e.g., Gneiting and Resin; 2023). Here, we focus on marginal calibration and probability calibration, which are two choices commonly considered in practice.

F^\widehat{F} is said to be marginally calibrated (Gneiting et al.; 2007) if

𝔼𝒙F[F^(y𝑿)]=F(Yy)for all y𝒴.\mathbb{E}_{\bm{x}\sim F}[\widehat{F}(y\mid\bm{X})]=\mathbb{P}_{F}(Y\leq y)\qquad\text{for all }y\in\cal Y.

That is, the average predictive CDF 1nvali=1nvalF^(y𝒙val(i))\frac{1}{n_{\text{val}}}\sum_{i=1}^{n_{\text{val}}}\widehat{F}(y\mid\bm{x}_{\text{val}}^{(i)}) matches with the empirical CDF of the observations 1nvali=1nval𝟙{yval(i)y}\frac{1}{n_{\text{val}}}\sum_{i=1}^{n_{\text{val}}}\mathds{1}_{\left\{y_{\text{val}}^{(i)}\leq y\right\}} asymptotically for all y𝒴y\in\cal Y. Hence, Gneiting et al. (2007) suggest plotting the average predictive CDF versus the empirical CDF to graphically assess marginal calibration.

The random variable

P=F^(Y𝑿)+𝒱[F^(Y𝑿)F^(Y𝑿)]for (Y,𝑿)F(Y,𝑿),P=\widehat{F}(Y^{-}\mid\bm{X})+\operatorname{\mathcal{V}}\left[\widehat{F}(Y\mid\bm{X})-\widehat{F}(Y^{-}\mid\bm{X})\right]\qquad\text{for }(Y,\bm{X})\sim F(Y,\bm{X}), (1)

where 𝒱U(0,1)\operatorname{\mathcal{V}}\sim\operatorname{U}(0,1) and F^(Y𝑿)\widehat{F}(Y^{-}\mid\bm{X}) is the left-handed limit of F^(y𝑿)\widehat{F}(y\mid\bm{X}) as yy approaches YY from below, is the randomized PIT value (Czado et al.; 2009). Note that PP depends both on F(Y,𝑿)F(Y,\bm{X}) and the model F^(Y𝑿)\widehat{F}(Y\mid\bm{X}). If F^\widehat{F} is continuous, PP is not randomized

P=F^(Y𝑿).P=\widehat{F}(Y\mid\bm{X}).

F^\widehat{F} is said to be probability calibrated if PU(0,1)P\sim\operatorname{U}(0,1). For i=1,,nvali=1,\dots,n_{\text{val}} let ν(i)\nu^{(i)} be independent uniform random variables on [0,1][0,1] and write

p(i)=F^(yval(i)𝒙val(i))+ν(i)[F^(yval(i)𝒙val(i))F^(yval(i)𝒙val(i))].p^{(i)}=\widehat{F}(y_{\text{val}}^{(i)-}\mid\bm{x}_{\text{val}}^{(i)})+\nu^{(i)}\left[\widehat{F}(y_{\text{val}}^{(i)}\mid\bm{x}_{\text{val}}^{(i)})-\widehat{F}(y_{\text{val}}^{(i)-}\mid\bm{x}_{\text{val}}^{(i)})\right]. (2)

p(i)p^{(i)} is an empirical evaluation of (1) across the validation set. Thus, probability calibration can be graphically checked by plotting a histogram of {p(i),i=1,,nval}\{p^{(i)},i=1,\dots,n_{\text{val}}\}. Formal tests for uniformity of the PIT values based on the Wasserstein distance (Zhou et al.; 2021; Zhao et al.; 2020) and the Cramér-von Mises distance (Kuleshov et al.; 2018) are popular alternatives to graphical checks. Gneiting et al. (2007) show that probability calibration is under mild conditions equivalent to quantile calibration (Kuleshov et al.; 2018), which requires

1nvali=1nval𝟙{yval(i)F^1(p𝒙val(i))}palmost surely for all p[0,1],\frac{1}{n_{\text{val}}}\sum_{i=1}^{n_{\text{val}}}\mathds{1}_{\left\{y_{\text{val}}^{(i)}\leq\widehat{F}^{-1}(p\mid\bm{x}_{\text{val}}^{(i)})\right\}}\to p\qquad\text{almost surely for all }p\in[0,1],

where F^1(𝒙val(i))\widehat{F}^{-1}(\cdot\mid\bm{x}_{\text{val}}^{(i)}) denotes the generalized inverse of F^(𝒙val(i))\widehat{F}(\cdot\mid\bm{x}_{\text{val}}^{(i)}). This perspective has the nice interpretation that prediction intervals derived from F^\widehat{F} have the correct coverage.

Histograms of the PIT values can also be used for model criticism as they indicate the type of miscalibration at hand. For example, U-shaped histograms indicate overconfidence, while triangular shapes indicate a biased model (Gneiting et al.; 2007).

Several techniques to recalibrate a potentially miscalibrated model F^\widehat{F} in a post-hoc step exist in the literature. Here, we describe a simple method for doing this due to Kuleshov et al. (2018). Write G(p)G(p) for the distribution function of the PIT values (1), where dependence on FF and F^\widehat{F} is left implicit in the notation. It is easy to check that G(F^(y𝒙))G(\widehat{F}(y\mid\bm{x})) is a distribution function for every 𝒙𝒳\bm{x}\in\cal X and probability calibrated with respect to F(Y,𝑿)F(Y,\bm{X}). In practice, the distribution function G(p)G(p) is not known, and it must be estimated from 𝒟val\mathcal{D}_{\text{val}}. Kuleshov et al. (2018) suggest using a method based on isotonic regression. Dheur and Ben Taieb (2023) extend this idea and consider KDEs. Among other choices, they propose to use

G^(p)=1nvali=1nval𝟙{pip},\displaystyle\widehat{G}(p)=\frac{1}{n_{\text{val}}}\sum_{i=1}^{n_{\text{val}}}\mathds{1}_{\left\{p_{i}\leq p\right\}},

which is the empirical CDF from the PIT values over the validation set 𝒟val\cal{D}_{\text{val}}. An alternative approach to recalibration for regression models is given by Song et al. (2019). Recently, Torres et al. (2024) proposed nonparametric local recalibration for neural networks. Their approach uses a fast KNN algorithm to localize the recalibration and can be used in any layer of the neural network scaling to potentially high-dimensional feature spaces 𝒳\cal X.

2.2 Multivariate Calibration

Extending the different notions of calibration from the univariate to the multivariate case, 𝒴d\mathcal{Y}\subseteq\mathbb{R}^{d}, is not straightforward. One reason for this is that the multivariate integral transformation

F(𝒚𝒙)for(𝒚,𝒙)F(𝒀,𝑿)F(\bm{y}\mid\bm{x})\qquad\text{for}\qquad(\bm{y},\bm{x})\sim F(\bm{Y},\bm{X})

is, in contrast to the univariate case, generally not uniformly distributed (e.g., Genest and Rivest; 2001), but follows the so-called Kendall distribution of FF. The Kendall distribution depends only on the copula of the multivariate probability measure, and thus summarizes the dependence structure of FF. Based on this observation, Ziegel and Gneiting (2014) introduce copula probability integral transform (CopPIT) values as analogous to the univariate PIT values described in (1). The CopPIT values are given as

U=𝒦𝑿(F^(𝒀𝑿))+Υ[𝒦𝑿(F^(𝒀𝑿))𝒦𝑿(F^(𝒀𝑿))],U=\mathcal{K}_{\bm{X}}\left(\widehat{F}(\bm{Y}^{-}\mid\bm{X})\right)+\Upsilon\left[\mathcal{K}_{\bm{X}}\left(\widehat{F}(\bm{Y}\mid\bm{X})\right)-\mathcal{K}_{\bm{X}}\left(\widehat{F}(\bm{Y}^{-}\mid\bm{X})\right)\right], (3)

where ΥU(0,1)\Upsilon\sim\operatorname{U}(0,1), (𝒀,𝑿)F(𝒀,𝑿)(\bm{Y},\bm{X})\sim F(\bm{Y},\bm{X}), and 𝒦𝑿\mathcal{K}_{\bm{X}} denotes the Kendall distribution of F^(𝒀𝑿)\widehat{F}(\bm{Y}\mid\bm{X}). F^\widehat{F} is said to be copula calibrated if the CopPIT values are uniformly distributed on the unit interval (Ziegel and Gneiting; 2014). In this way, copula calibration can be seen as a multivariate extension to probability calibration. In particular for d=1d=1, 𝒦𝑿\mathcal{K}_{\bm{X}} is the uniform distribution on [0,1][0,1], so that (3) is equal to (1)\eqref{eq:pit}. As in the univariate case, let

u(i)=𝒦𝒙val(i)(F^(𝒚val(i)𝒙val(i)))+υ(i)[𝒦𝒙val(i)(F^(𝒚val(i)𝒙val(i)))𝒦𝒙val(i)(F^(𝒚val(i)𝒙val(i)))],u^{(i)}=\mathcal{K}_{\bm{x}_{\text{val}}^{(i)}}\left(\widehat{F}(\bm{y}_{\text{val}}^{(i)-}\mid\bm{x}_{\text{val}}^{(i)})\right)+\upsilon^{(i)}\left[\mathcal{K}_{\bm{x}_{\text{val}}^{(i)}}\left(\widehat{F}(\bm{y}_{\text{val}}^{(i)}\mid\bm{x}_{\text{val}}^{(i)})\right)-\mathcal{K}_{\bm{x}_{\text{val}}^{(i)}}\left(\widehat{F}(\bm{y}_{\text{val}}^{(i)-}\mid\bm{x}_{\text{val}}^{(i)})\right)\right],

denote the empirical CopPIT values from the validation set, i=1,,nvali=1,\dots,n_{\text{val}}, where υ(i)\upsilon^{(i)} are independent uniform variates on [0,1][0,1]. Again, copula calibration can be assessed by checking uniformity of {u(i),i=1,,nval}\{u^{(i)},i=1,\dots,n_{\text{val}}\}. However, interpretation of the CopPIT histograms is more challenging than in the univariate case, as they not only summarize potential miscalibration of the dependence structure, but also of the marginal distributions. However, in the special case that all margins of F^\widehat{F} are uniformly probability calibrated, the CopPIT values summarize miscalibration of the copula of F^\widehat{F} only (Ziegel and Gneiting; 2014). Thus, in practice it is sensible to assess multivariate calibration by checking for univariate calibration of each marginal in terms of the marginal PIT values (1) and copula calibration in terms of the CopPIT values (3).

Ziegel and Gneiting (2014) also introduce Kendall calibration

limnval1nvali=1nval𝟙{F^(𝒚val(i)𝒙val(i))ω}=limnval1nvali=1nval𝒦𝒙val(i)(ω)for all ω[0,1]\lim_{n_{\text{val}}\to\infty}\frac{1}{n_{\text{val}}}\sum_{i=1}^{n_{\text{val}}}\mathds{1}_{\left\{\widehat{F}(\bm{y}^{(i)}_{\text{val}}\mid\bm{x}^{(i)}_{\text{val}})\leq\omega\right\}}=\lim_{n_{\text{val}}\to\infty}\frac{1}{n_{\text{val}}}\sum_{i=1}^{n_{\text{val}}}\mathcal{K}_{\bm{x}_{\text{val}}^{(i)}}(\omega)\qquad\text{for all }\omega\in[0,1] (4)

as the multivariate analogue to marginal calibration. Kendall calibration can be assessed by a so called Kendall diagram, which is a scatter plot of the empirical left hand side versus the empirical right hand side of (4) for different values of ω\omega.

Both copula calibration and Kendall calibration necessitate the derivation of the Kendall distributions 𝒦𝒙\mathcal{K}_{\bm{x}}. The Kendall distribution can be calculated in closed form only for a few special cases (e.g., Genest and Rivest; 2001). So, in practice, 𝒦𝒙\mathcal{K}_{\bm{x}} in (3) and (4) is replaced by an approximation given as the empirical CDF of the pseudo observations (Barbe et al.; 1996)

wk=1mj=1m𝟙{𝒚j𝒚k}for k=1,,n,w_{k}=\frac{1}{m}\sum_{j=1}^{m}\mathds{1}_{\left\{\bm{y}_{j}\preceq\bm{y}_{k}\right\}}\qquad\text{for }k=1,\dots,n,

where 𝒚j=(yj1,,yjd)𝒚k=(yk1,,ykd)\bm{y}_{j}=(y_{j1},\dots,y_{jd})\preceq\bm{y}_{k}=(y_{k1},\dots,y_{kd}) if yjlykly_{jl}\leq y_{kl} for all l=1,,dl=1,\dots,d and 𝒚1,,𝒚m\bm{y}_{1},\dots,\bm{y}_{m} is a large sample from F^(𝒚𝒙)\widehat{F}(\bm{y}\mid\bm{x}).

3 Multivariate Calibration via PIT mapping

This section describes our approach to recalibrate arbitrary probabilistic prediction models F^\widehat{F}. We consider a simple KNN approach, which can be thought of as a multivariate extension to the recalibration methods by Torres et al. (2024) and Rodrigues et al. (2018) first in Section 3.1, and then the novel normalizing flow based method in Section 3.2.

3.1 Nearest neighbour recalibration

Suppose that we have a mapping on the feature space, h:𝒳dh:\mathcal{X}\to\mathbb{R}^{d}. The purpose of the function hh is to reduce the dimension of 𝒙\bm{x} and we use h(𝒙)h(𝒙)\lVert h(\bm{x})-h(\bm{x}^{\prime})\rVert to measure the similarity of the feature vectors 𝒙\bm{x} and 𝒙\bm{x}^{\prime}. Let

Nk(𝒙)={i:h(𝒙val(i)) is one of the k nearest neighbours of h(𝒙)}.N_{k}(\bm{x})=\{i:h(\bm{x}^{(i)}_{\text{val}})\text{ is one of the $k$ nearest neighbours of }h(\bm{x})\}.

If Nk(𝒙)N_{k}(\bm{x}) is a sufficiently small neighbourhood around 𝒙\bm{x}, {𝒑(i),iNk(𝒙)}\{\bm{p}^{(i)},i\in N_{k}(\bm{x})\} approximates a sample from 𝑷=(P1,,Pd)\bm{P}=(P_{1},\dots,P_{d}), where PlP_{l} is the PIT value for the ll-th response of F^(𝒀𝒙)\widehat{F}(\bm{Y}\mid\bm{x}) as given in (1). Let G𝒙G_{\bm{x}} denote the joint distribution of 𝑷\bm{P} given 𝒙\bm{x} with marginal distributions G𝒙,lG_{\bm{x},l}, l=1,,dl=1,\dots,d. Theoretical properties of G𝒙G_{\bm{x}} were studied in Rodrigues et al. (2018). In particular, G𝒙(F^(𝒚𝒙))=(G𝒙,1(F^1(y1𝒙)),,G𝒙,d(F^d(yd𝒙)))G_{\bm{x}}(\widehat{F}(\bm{y}\mid\bm{x}))=\left(G_{\bm{x},1}(\widehat{F}_{1}(y_{1}\mid\bm{x})),\dots,G_{\bm{x},d}(\widehat{F}_{d}(y_{d}\mid\bm{x}))\right) has probability calibrated marginals following the same arguments as for the univariate recalibration techniques described in Section 2.1. Note that the use of G𝒙,lG_{\bm{x},l} instead of the global unconditional distribution GlG_{l} as considered in Kuleshov et al. (2018) and Dheur and Ben Taieb (2023) gives a stronger form of calibration, as the resulting model is locally, that is conditional on 𝒙\bm{x}, probability calibrated. In addition to the marginal information, G𝒙G_{\bm{x}} also matches the dependence structure of F^(𝒀𝒙)\widehat{F}(\bm{Y}\mid\bm{x}) under F(𝒀𝒙)F(\bm{Y}\mid\bm{x}). For a given 𝒙𝒳\bm{x}\in\cal X and continuous marginals F^l(y𝒙)\widehat{F}_{l}(y\mid\bm{x}), plp_{l} is a non-random transformation of yly_{l} and, in particular, 𝒑\bm{p} is an invertible transformation of 𝒚\bm{y}. The Kendall distribution is invariant under such transformations and thus the CopPIT value u𝒙u\mid\bm{x} could be calculated purely on 𝒑𝒙\bm{p}\mid\bm{x}, without access to 𝒚𝒙\bm{y}\mid\bm{x}. Also, 𝑷𝒙\bm{P}\mid\bm{x} has copula C𝒙C_{\bm{x}}, which is the copula of F(𝒀𝒙)F(\bm{Y}\mid\bm{x}) and, from the arguments above, G𝒙(F^(𝒀𝒙))G_{\bm{x}}(\widehat{F}(\bm{Y}\mid\bm{x})) has probability calibrated marginals. Following Ziegel and Gneiting (2014) this implies copula calibration.

Thus, for a given 𝒙𝒳\bm{x}\in\cal X, a sample of size kk from an approximately calibrated predictive distribution F~(𝒀𝒙)\widetilde{F}(\bm{Y}\mid\bm{x}) can be generated as

𝒚~(i)=(y~1(i),,y~d(i))=(F^11(p1(i)𝒙),,F^d1(pd(i)𝒙))=F^1(𝒑(i)𝒙),iNk(𝒙).\tilde{\bm{y}}^{(i)}=\left(\tilde{y}^{(i)}_{1},\dots,\tilde{y}^{(i)}_{d}\right)=\left(\widehat{F}^{-1}_{1}(p^{(i)}_{1}\mid\bm{x}),\dots,\widehat{F}^{-1}_{d}(p^{(i)}_{d}\mid\bm{x})\right)=\widehat{F}^{-1}(\bm{p}^{(i)}\mid\bm{x}),\quad i\in N_{k}(\bm{x}). (5)

Here, 𝒑(i)=(p1(i),,pd(i))\bm{p}^{(i)}=(p_{1}^{(i)},\dots,p_{d}^{(i)}) with pl(i)p^{(i)}_{l} the empirical PIT value for the ll-th marginal distribution F^l(Yl𝑿)\widehat{F}_{l}(Y_{l}\mid\bm{X}) evaluated on the ii-th entry of the validation set. If F^l1(𝒙)\widehat{F}^{-1}_{l}(\cdot\mid\bm{x}) is not available in closed form it can be easily approximated using a sample from F^l(𝒀𝒙)\widehat{F}_{l}(\bm{Y}\mid\bm{x}) making our approach model free.

However, (5) is only an approximation to G𝒙(F^(𝒚𝒙))G_{\bm{x}}(\widehat{F}(\bm{y}\mid\bm{x})) and we will illustrate how well this works in practice on a number of simulated and real data examples in Sections 4 and 5.

3.2 Recalibration with normalizing flows

We can think of the nearest neighbour approach introduced in Section 3.1 as obtaining an approximate sample from 𝑷𝑿=𝒙\bm{P}\mid\bm{X}=\bm{x} for a target feature vector 𝒙\bm{x}, and then transforming back to the original space of the responses to obtain approximate samples of 𝒀𝑿=𝒙\bm{Y}\mid\bm{X}=\bm{x}. In the nearest neighbour approach, no explicit expression for F~(𝒀𝒙)\widetilde{F}(\bm{Y}\mid\bm{x}) is constructed, and the number of potential draws from F~(𝒀𝒙)\widetilde{F}(\bm{Y}\mid\bm{x}) is restricted by kk heavily depending on nvaln_{\text{val}}. However, some applications require calculating complex summary statistics from the potentially intricate distribution F~\widetilde{F}, which necessitates the ability to draw arbitrary large samples from the recalibrated model. Thus, we propose a similar method to the KNN approach using normalizing flows to draw approximate samples from 𝑷𝒙\bm{P}\mid\bm{x}.

The basic idea is as follows. Let ρ(𝒛)\rho(\bm{z}) be a reference density with respect to the Lebesgue measure on d\mathbb{R}^{d}, which we take to be the standard normal density. We consider a bijective transformation T𝜻(𝒛𝒙)T_{\bm{\zeta}}(\bm{z}\mid\bm{x}), and transform 𝒁ρ(𝒛)\bm{Z}\sim\rho(\bm{z}) to a random vector 𝑷𝒙\bm{P}\mid\bm{x}, where 𝜻\bm{\zeta} is a set of learnable parameters. The density of 𝑷𝒙\bm{P}\mid\bm{x} is thus approximated as

ρ(T𝜻1(𝒑𝒙))|detJT𝜻1(𝒑𝒙)|,\rho\left(T^{-1}_{\bm{\zeta}}(\bm{p}\mid\bm{x})\right)|\det J_{T^{-1}_{\bm{\zeta}}}(\bm{p}\mid\bm{x})|,

where T𝜻1(𝒑𝒙)T^{-1}_{\bm{\zeta}}(\bm{p}\mid\bm{x}) is the inverse of T𝜻(𝒛𝒙)T_{\bm{\zeta}}(\bm{z}\mid\bm{x}), and JT𝜻1(𝒑𝒙)J_{T^{-1}_{\bm{\zeta}}}(\bm{p}\mid\bm{x}) is its Jacobian matrix. Based on this, the parameter 𝜻{\bm{\zeta}} is learned using observations (𝒑(i),𝒙val(i))(\bm{p}^{(i)},\bm{x}^{(i)}_{\text{val}}), where 𝒑(i)\bm{p}^{(i)} denotes the vector of PIT values for the marginal distributions evaluated on the validation set. To avoid boundary effects, we consider normalized PIT values 𝒑N=Φ1(𝒑)=(Φ1(p1),,Φ1(pd))d\bm{p}_{N}=\Phi^{-1}(\bm{p})=\left(\Phi^{-1}(p_{1}),\dots,\Phi^{-1}(p_{d})\right)\in\mathbb{R}^{d}, where Φ()\Phi(\cdot) is the CDF of the standard Gaussian distribution, instead of the usual PIT values on [0,1][0,1]. As for the KNN approach, 𝒙\bm{x} can be replaced with a lower dimensional representation h(𝒙)h(\bm{x}) in the construction of T𝜻(𝒛𝒙)T_{\bm{\zeta}}(\bm{z}\mid\bm{x}). Having learned 𝜻{\bm{\zeta}} as 𝜻^\hat{{\bm{\zeta}}}, samples from 𝑷𝒙\bm{P}\mid\bm{x} can be generated by sampling 𝒛ρ(𝒛)\bm{z}\sim\rho(\bm{z}) and setting 𝒑=T𝜻^(𝒛𝒙)\bm{p}=T_{\hat{{\bm{\zeta}}}}(\bm{z}\mid\bm{x}). These samples can be then used similar to (5) to generate samples from the approximately recalibrated predictive distribution F~(𝒀𝒙)\widetilde{F}(\bm{Y}\mid\bm{x}). Pseudo code for the full approach is given in Appendix A.

There are many ways to construct suitable transformations T𝜻(𝒛𝒙)T_{{\bm{\zeta}}}(\bm{z}\mid\bm{x}) in the literature on transport maps (e.g., Marzouk et al.; 2016) and normalizing flows (e.g., Rippel and Adams; 2013; Yao et al.; 2023) as well as standard software for conditional density estimation. Here, we consider the real-valued non-volume preserving (real-NVP) approach by Dinh et al. (2017), where T𝜻(𝒛𝒙)T_{{\bm{\zeta}}}(\bm{z}\mid\bm{x}) is given as a stack of affine coupling layers. We found this to be a satisfactory choice in all examples considered, but alternative approaches might be better suited depending on the structure of the recalibration task at hand.

The normalizing flow can be interpreted as a conditional density estimate for 𝑷𝒙\bm{P}\mid\bm{x}. If d=1d=1, our approach can therefore be considered a localized version of the KDE based method by Dheur and Ben Taieb (2023).

4 Simulations

We illustrate the performance of both the KNN based approach labeled KNN, and the normalizing flow approach labeled NF on a number of simulated examples. First, we reanalyze the illustrative example from Ziegel and Gneiting (2014) to consider forecasts suffering different kinds of miscalibration. Both KNN and NF achieve probability calibration of the marginals, copula calibration and Kendall calibration across all scenarios. Secondly, we consider a regression task, where 𝒀𝑿\bm{Y}\mid\bm{X} is degenerate and we investigate the local calibration properties of our approaches. For a given 𝒙\bm{x}, both NF and KNN allow to generate samples from the recalibrated predictive distribution 𝒀𝒙\bm{Y}\mid\bm{x}. KNN can generate only a small sample of size much smaller then nvaln_{\text{val}}, while NF is computational more complex, but allows to draw an arbitrarily large sample from the recalibrated model. In our simulations, samples from both methods are close to the true distribution even for a grossly misspecified base model. More details on the simulation studies can be found in Appendix B.

5 Applications

5.1 Currency exchange rates

Foreign currencies constitute a popular class of assets among investors. In so-called Forex trading, traders exchange currencies with the goal of making a profit. The ability to make reliable predictions for currency exchange rates is crucial for a successful trading strategy. To this end, we analyze five time series of daily exchange rates for five currencies relative to the US dollar: the Australian Dollar (AUD), the Chinese Yuan (CNY), the Euro (EUR), the Pound Sterling (GBP), and the Singapore Dollar (SGD). These data span five years from August 01, 2019, to August 01, 2024, and were sourced from Yahoo Finance. Given the high correlation among currency exchange rates, multivariate calibration is a desirable feature for any currency exchange forecasting model.

Baseline model

We consider a one-day ahead forecast based on a Long Short-Term Memory (LSTM) Neural Network with a distributional layer, so that the resulting forecast distribution is multivariate Gaussian with diagonal covariance structure. Even though the resulting probabilistic forecast cannot express correlation, dependencies between the currencies are exploited through the deep LSTM network modelling the 55-dimensional time-series jointly. LSTMs have been successfully implemented for time-series prediction (e.g., Hua et al.; 2019) and the resulting model recovers the general structure of the data well. Note however that our main focus is to illustrate the merits of multivariate calibration and not on the construction of the forecasting model.

Recalibration in online learning

Every day, as a new data point becomes available, the baseline LSTM model is updated accordingly. The recalibration model follows a similar iterative process. After an initial period (here 100 days), the process is as follows. The LSTM model generates a predictive distribution for the one-day-ahead forecast. This forecast is then recalibrated using the recalibration model. Once the actual data for the next day is obtained, the LSTM model is updated on the now extended dataset. Simultaneously, the new data point yields an updated vector of marginal PIT values, prompting an update to the recalibration model. In each step only one additional data point becomes available, and both the base and the recalibration model can be updated using a warm-start avoiding the need to retrain the models from scratch. This drastically reduces the computational resources needed for training. We do not use a separate validation set, but reuse the training data for recalibration of the forecast.

In time series forecasting, a natural assumption is that the more recent an observation was made, the more information it contains on future values. Hence, here we consider the KNN approach, where we use the most recent 100 PIT values for recalibration at each time step. This way, the recalibration is carried out on a rolling window of PIT values. The KNN approach is preferable to NF here as it allows us to gradually control the information available to the recalibration method.

Results

Histograms of the univariate PIT values (Appendix C) indicate that none of the margins of the base model are probability calibrated, and the kind of miscalibration differs drastically between currencies. For example, the marginal PIT values for CNY and SGD are skewed indicating a biased forecast, while the histograms for AUD and EUR are U shaped indicating underdispersion (Gneiting et al.; 2007). The recalibration through KNN drastically improves the overall calibration of the base model. Figure 1A shows the base, and the recalibrated forecast together with the realized values for CNY. The base model underestimates the exchange rate for CNY in 2020 and 2021. This bias is corrected by the recalibration. In the third and fourth quarter of 2023, the base model underestimates the uncertainty of the forecast, resulting in an accumulation of realised values outside of the 95%95\% credible band during this time period. Our KNN approach detects this local miscalibration and widens the credible band in this time period. Multivariate calibration is especially helpful when estimating functions over multiple margins, as done for example when assessing the risk of a portfolio. To illustrate this, we consider the time-series EUR/GBP, which gives the direct exchange rate for EUR relative to GBP. Both the base and the recalibrated forecast model describe an implicit forecast for EUR/GBP. EUR and GBP are strongly correlated with an estimated Kendall’s τ\tau of 0.680.68. Since the base model does not account for this dependence structure, the estimated credible intervals are wider than necessary. Even though KNN does not recalibrate EUR/GBP directly, this is corrected by the multivariate recalibration as shown in Figure 1B.

Refer to caption
Figure 1: Currency exchange. One day ahead forecasts for CNY (A) and EUR/GBP (B) under the base model (green) and the recalibrated model (red). The bold lines correspond to the estimated mean values while 95%95\% credible bands are given by the shaded area. The true realized time series are given in black.

5.2 Childhood malnutrition

Ending all forms of malnutrition is one of the sustainable development goals of the United Nations (United Nations; 2015). Here, we consider a sample from the Demographic and Health surveys (www.measuredhs.com) containing n=24,286n=24,286 observations on several factors of undernutrition in India. Following previous analyses of the data (Klein et al.; 2020; Briseño Sanchez et al.; 2024) we consider two responses. The continuous indicator wasting reports weight for height as a z-score and the binary response fever indicates fever within the two weeks prior to the interview. Following Klein et al. (2020) we consider the covariables csex indicating the sex of the child, cage the age of the child in months, breastfeeding the duration of breastfeeding in months, mbmi the body mass index of the mother, and dist the district in India the child lives in.

Baseline model

We fit separate regression models for the two responses. wasting is modelled through a heteroscedastic Gaussian distribution, where both the mean and the variance parameter are linked to an additive predictor, and we use logistic regression for fever. Non-linear effects for the continuous covariates cage, breastfeeding, and mbmi are modelled with Bayesian P-splines (Eilers and Marx; 1996). We use a linear effect for csex, and a spatial effect with a Gaussian Markov random field prior for dist (Rue and Held; 2005). This results in two highly interpretable distributional regression models.

Multivariate calibration

Figure 2A shows a scatter plot of the normalized PIT values 𝒑N(i)\bm{p}^{(i)}_{N} for the independent baseline regression models. While the PIT values for fever under the logistic regression model are close to the uniform distribution, the PIT values for wasting show clear deviations from uniformity especially in the upper tail. Since both fever and wasting are indicators of the child’s health, a complex dependence structure between the two response variables that is not sufficiently accounted for by the baseline model is expected. Multivariate calibration is used to combine the two univariate distributional regression models into a single multivariate regression model. Since we want to investigate how the recalibration affects the interpretable effects of the univariate regression models, we need to be able to generate large samples from p(𝒚𝒙)p(\bm{y}\mid\bm{x}), which is not possible with the basic KNN approach. We will thus consider the NF approach as described in Section 3. We condition the NF on the continuous covariables cage, and mbmi as they are predominant factors in the marginal regression models.

Results

The NF improves both the probabilistic calibration of the marginals and the copula calibration of the bivariate regression model as described by the PIT and CopPIT values respectively (Appendix C). The World Health Organization defines a child suffering from wasting if wasting2\texttt{wasting}\leq-2. Figure 2B shows the left tail for the predictive density of wasting conditional on the median values for all covariables for both the baseline and the recalibrated model. The baseline model overestimates the risk of the median child suffering wasting Pr(wasting2𝒙)\text{Pr}(\texttt{wasting}\leq-2\mid\bm{x}) compared to the recalibrated model.

The joint regression model allows us to study the risk of a child having fever and simultaneously suffering from wasting Pr(wasting2,fever=1)\text{Pr}(\texttt{wasting}\leq-2,\texttt{fever}=1). Figures 2C and D show the main effects of cage and dist on this risk respectively. The main effects are calculated by varying the covariable of interest, while keeping the other covariables fixed to their median values. The risk increases for children younger than a year and decreases for older children. The likelihood for fever is increasing for 0cage120\leq\texttt{cage}\leq 12 according to the baseline logistic regression model. Both the baseline and the recalibrated model find similar shapes for the main effect for cage, but in terms of magnitude the recalibrated model estimates a lower risk. Similarly, the estimated main effect for dist is lower for the recalibrated model than for the baseline model in all 438 districts (Appendix C). According to this analysis the risk of a child suffering simultaneously from wasting and fever is higher in the mid-eastern districts of India. This is consistent with the findings in Briseño Sanchez et al. (2024). Even though the recalibrated model is nonparametric, the interpretability of the baseline regression models can be maintained, making the multivariate recalibration approach a valuable tool for the development of complex multivariate distributional regression models.

Refer to caption
Figure 2: Malnutrition. A) Scatter plot of the normalized PIT values with respect to the base model. A contour plot of the bivariate standard Gaussian distribution, which is the reference distribution for NF, is given in grey. B) Log-density for wasting for the base model (dashed) and the recalibrated model (bold). The cut-off value for wasting according to the WHO definition is indicated by the vertical line. C) Main effect for cage for the base model (dashed) and the recalibrated model (bold). The y-axis denotes the risk of a child suffering from wasting and fever in %. D) Main effect for dist under the recalibrated model. The risk of a child suffering simultaneously from wasting and fever in % is indicated by the shade of the region. A darker shade corresponds to a higher risk.

6 Conclusion and Discussion

In this paper, we have introduced a novel approach for recalibrating multivariate models, addressing a critical gap in the calibration literature. Our method involves local mappings between marginal PIT values and the space of the observations and extends established univariate recalibration techniques to the multivariate case. We discuss two different versions of our approach. The KNN-based method provides simplicity and ease of implementation, but is limited as it only allows the generation of a small sample from the recalibrated model. While being computationally more challenging, the NF-based method overcomes this limitation. The merits of our approach are illustrated on a number of simulated and real data examples. We consider forecasting of a multivariate time series and regression for mixed data, further illustrating the versatility of our approach. However, theoretical properties of the PIT-based mappings are not well investigated. A better theoretical understanding could potentially lead to improved recalibration techniques. We use transformations of the PIT values from the marginal distributions. However, depending on the structure of the underlying predictor model, other univariate distributions that summarize the joint distribution could be considered. Future research could investigate the application of our recalibration technique to additional model types, including more intricate dependence structures and larger datasets. Additionally, our method focuses purely on calibration, and integrating it with other aspects of model evaluation and improvement, such as sharpness and robustness checks, could enhance its utility. Calibration is also an important tool for model criticism. The local nature of our approach could potentially allow to detect areas of model misspecification and thus our approach could be developed further into model specification and model selection pipelines.

References

  • (1)
  • Barbe et al. (1996) Barbe, P., Genest, C., Ghoudi, K. and Rémillard, B. (1996). On Kendall’s process, Journal of multivariate analysis 58(2): 197–229.
  • Briseño Sanchez et al. (2024) Briseño Sanchez, G., Klein, N., Klinkhammer, H. and Mayr, A. (2024). Boosting distributional copula regression for bivariate binary, discrete and mixed responses, arXiv preprint arXiv:2403.02194 .
  • Corradi and Swanson (2006) Corradi, V. and Swanson, N. R. (2006). Predictive density evaluation, Handbook of Economic Forecasting 1: 197–284.
  • Czado et al. (2009) Czado, C., Gneiting, T. and Held, L. (2009). Predictive model assessment for count data, Biometrics 65(4): 1254–1261.
  • Dawid (1984) Dawid, A. P. (1984). Statistical theory: The prequential approach (with discussion and rejoinder), Journal of the Royal Statistical Society, Series A 147: 278–292.
  • Dheur and Ben Taieb (2023) Dheur, V. and Ben Taieb, S. (2023). A large-scale study of probabilistic calibration in neural network regression, in A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato and J. Scarlett (eds), Proceedings of the 40th International Conference on Machine Learning, Vol. 202 of Proceedings of Machine Learning Research, PMLR, pp. 7813–7836.
  • Diebold et al. (1998) Diebold, F., Gunther, T. A. and Tay, A. S. (1998). Evaluating density forecasts with applications to financial risk management, International Economic Review 39(4): 863–83.
  • Dinh et al. (2017) Dinh, L., Sohl-Dickstein, J. and Bengio, S. (2017). Density estimation using real NVP, 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, OpenReview.net.
    https://openreview.net/forum?id=HkpbnH9lx
  • Dovern and Manner (2020) Dovern, J. and Manner, H. (2020). Order-invariant tests for proper calibration of multivariate density forecasts, Journal of Applied Econometrics 35(4): 440–456.
  • Eilers and Marx (1996) Eilers, P. H. and Marx, B. D. (1996). Flexible smoothing with B-splines and penalties, Statistical Science 11(2): 89–121.
  • Gawlikowski et al. (2023) Gawlikowski, J., Tassi, C. R. N., Ali, M., Lee, J., Humt, M., Feng, J., Kruspe, A., Triebel, R., Jung, P., Roscher, R. et al. (2023). A survey of uncertainty in deep neural networks, Artificial Intelligence Review 56(Suppl 1): 1513–1589.
  • Genest and Rivest (2001) Genest, C. and Rivest, L.-P. (2001). On the multivariate probability integral transformation, Statistics & probability letters 53(4): 391–399.
  • Gneiting et al. (2007) Gneiting, T., Balabdaoui, F. and Raftery, A. E. (2007). Probabilistic forecasts, calibration and sharpness, Journal of the Royal Statistical Society Series B: Statistical Methodology 69(2): 243–268.
  • Gneiting and Raftery (2007) Gneiting, T. and Raftery, A. E. (2007). Strictly proper scoring rules, prediction, and estimation, Journal of the American statistical Association 102(477): 359–378.
  • Gneiting and Resin (2023) Gneiting, T. and Resin, J. (2023). Regression diagnostics meets forecast evaluation: Conditional calibration, reliability diagrams, and coefficient of determination, Electronic Journal of Statistics 17(2): 3226–3286.
  • Gneiting et al. (2008) Gneiting, T., Stanberry, L. I., Grimit, E. P., Held, L. and Johnson, N. A. (2008). Assessing probabilistic forecasts of multivariate quantities, with an application to ensemble predictions of surface winds, Test 17: 211–235.
  • Guo et al. (2017) Guo, C., Pleiss, G., Sun, Y. and Weinberger, K. Q. (2017). On calibration of modern neural networks, in D. Precup and Y. W. Teh (eds), Proceedings of the 34th International Conference on Machine Learning, Vol. 70 of Proceedings of Machine Learning Research, PMLR, pp. 1321–1330.
  • Heinrich et al. (2021) Heinrich, C., Hellton, K. H., Lenkoski, A. and Thorarinsdottir, T. L. (2021). Multivariate postprocessing methods for high-dimensional seasonal weather forecasts, Journal of the American Statistical Association 116(535): 1048–1059.
  • Henzi et al. (2021) Henzi, A., Ziegel, J. F. and Gneiting, T. (2021). Isotonic Distributional Regression, Journal of the Royal Statistical Society Series B: Statistical Methodology 83(5): 963–993.
  • Hua et al. (2019) Hua, Y., Zhao, Z., Li, R., Chen, X., Liu, Z. and Zhang, H. (2019). Deep learning with long short-term memory for time series prediction, IEEE Communications Magazine 57(6): 114–119.
  • Klein et al. (2020) Klein, N., Kneib, T., Marra, G. and Radice, R. (2020). Bayesian mixed binary-continuous copula regression with an application to childhood undernutrition, Flexible Bayesian Regression Modelling, Elsevier, pp. 121–152.
  • Klein et al. (2021) Klein, N., Nott, D. J. and Smith, M. S. (2021). Marginally calibrated deep distributional regression, Journal of Computational and Graphical Statistics 30(2): 467–483.
  • Kneib (2013) Kneib, T. (2013). Beyond mean regression, Statistical Modelling 13(4): 275–303.
  • Knüppel et al. (2023) Knüppel, M., Krüger, F. and Pohle, M.-O. (2023). Score-based calibration testing for multivariate forecast distributions, arXiv preprint arXiv:2211.16362 .
  • Ko and Park (2013) Ko, S. I. and Park, S. Y. (2013). Multivariate density forecast evaluation: a modified approach, International Journal of Forecasting 29(3): 431–441.
  • Kuleshov et al. (2018) Kuleshov, V., Fenner, N. and Ermon, S. (2018). Accurate uncertainties for deep learning using calibrated regression, in J. Dy and A. Krause (eds), Proceedings of the 35th International Conference on Machine Learning, Vol. 80 of Proceedings of Machine Learning Research, PMLR, pp. 2796–2804.
  • Kull et al. (2017) Kull, M., Filho, T. S. and Flach, P. (2017). Beta calibration: a well-founded and easily implemented improvement on logistic calibration for binary classifiers, in A. Singh and J. Zhu (eds), Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Vol. 54 of Proceedings of Machine Learning Research, PMLR, pp. 623–631.
  • Lakshminarayanan et al. (2017) Lakshminarayanan, B., Pritzel, A. and Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles, in I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan and R. Garnett (eds), Advances in Neural Information Processing Systems, Vol. 30, Curran Associates, Inc.
  • Marzouk et al. (2016) Marzouk, Y., Moselhy, T., Parno, M. and Spantini, A. (2016). Sampling via measure transport: An introduction, in R. Ghanem, D. Higdon and H. Owhadi (eds), Handbook of Uncertainty Quantification, Springer International Publishing, pp. 1–41.
  • Menéndez et al. (2014) Menéndez, P., Fan, Y., Garthwaite, P. H. and Sisson, S. A. (2014). Simultaneous adjustment of bias and coverage probabilities for confidence intervals, Computational Statistics & Data Analysis 70: 35–44.
  • Platt (1999) Platt, J. C. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods, Advances in large margin classifiers 10(3): 61–74.
  • Rippel and Adams (2013) Rippel, O. and Adams, R. P. (2013). High-dimensional probability estimation with deep density models, arXiv preprint arXiv:1302.5125 .
  • Rodrigues et al. (2018) Rodrigues, G., Prangle, D. and Sisson, S. A. (2018). Recalibration: A post-processing method for approximate Bayesian computation, Computational Statistics & Data Analysis 126: 53–66.
  • Rosenblatt (1952) Rosenblatt, M. (1952). Remarks on a multivariate transformation, The Annals of Mathematical Statistics 23(3): 470–472.
  • Rue and Held (2005) Rue, H. and Held, L. (2005). Gaussian Markov Random Fields: Theory and Applications, Chapman and Hall/CRC.
  • Smith (1985) Smith, J. (1985). Diagnostic checks of non-standard time series models, Journal of Forecasting 4(3): 283–291.
  • Song et al. (2019) Song, H., Diethe, T., Kull, M. and Flach, P. (2019). Distribution calibration for regression, in K. Chaudhuri and R. Salakhutdinov (eds), Proceedings of the 36th International Conference on Machine Learning, Vol. 97 of Proceedings of Machine Learning Research, PMLR, pp. 5897–5906.
  • Torres et al. (2024) Torres, R., Nott, D. J., Sisson, S. A., Rodrigues, T., Reis, J. and Rodrigues, G. (2024). Model-free local recalibration of neural networks, arXiv preprint arXiv:2403.05756 .
  • United Nations (2015) United Nations (2015). Transforming our world: the 2030 agenda for sustainable development.
    https://sdgs.un.org/2030agenda
  • Ward et al. (2022) Ward, D., Cannon, P., Beaumont, M., Fasiolo, M. and Schmon, S. (2022). Robust neural posterior estimation and statistical model criticism, in S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho and A. Oh (eds), Advances in Neural Information Processing Systems, Vol. 35, Curran Associates, Inc., pp. 33845–33859.
  • Wehenkel et al. (2024) Wehenkel, A., Gamella, J. L., Sener, O., Behrmann, J., Sapiro, G., Cuturi, M. and Jacobsen, J.-H. (2024). Addressing misspecification in simulation-based inference through data-driven calibration, arXiv preprint arXiv:2405.08719 .
  • Yao et al. (2023) Yao, J.-E., Tsao, L.-Y., Lo, Y.-C., Tseng, R., Chang, C.-C. and Lee, C.-Y. (2023). Local implicit normalizing flow for arbitrary-scale image super-resolution, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1776–1785.
  • Zhao et al. (2020) Zhao, S., Ma, T. and Ermon, S. (2020). Individual calibration with randomized forecasting, in H. D. III and A. Singh (eds), Proceedings of the 37th International Conference on Machine Learning, Vol. 119 of Proceedings of Machine Learning Research, PMLR, pp. 11387–11397.
  • Zhou et al. (2021) Zhou, T., Li, Y., Wu, Y. and Carlson, D. (2021). Estimating uncertainty intervals from collaborating networks, Journal of Machine Learning Research 22(257): 1–47.
  • Ziegel and Gneiting (2014) Ziegel, J. F. and Gneiting, T. (2014). Copula calibration, Electronic Journal of Statistics 8: 2619–2638.

Appendix

Appendix A Pseudo Code

Pseudo code for the NF approach is given in Algorithm 1.

A: Calculate normalized PIT values
Input: Validation set 𝒟val={(𝒚val(i),𝒙val(i)),i=1,,nval}\mathcal{D}_{\text{val}}=\{(\bm{y}_{\text{val}}^{(i)},\bm{x}_{\text{val}}^{(i)}),i=1,\dots,n_{\text{val}}\}; CDF-valued predictive distribution F^(𝒀𝑿)\widehat{F}(\bm{Y}\mid\bm{X}) with marginals F^1(Y1𝑿),,F^D(Y1𝑿)\widehat{F}_{1}(Y_{1}\mid\bm{X}),\cdots,\widehat{F}_{D}(Y_{1}\mid\bm{X});
for i1i\leftarrow 1 to nvaln_{\text{val}} do
       for l1l\leftarrow 1 to dd do
             Sample νl(i)U(0,1)\nu^{(i)}_{l}\sim\operatorname{U}(0,1);
             Set pl(i)=F^l(yval,l(i)𝒙val(i))+νl(i)[F^l(yval,l(i)𝒙val(i))F^(yval,l(i)𝒙val(i))]p^{(i)}_{l}=\widehat{F}_{l}\left(y_{\text{val},l}^{(i)-}\mid\bm{x}_{\text{val}}^{(i)}\right)+\nu_{l}^{(i)}\left[\widehat{F}_{l}\left(y_{\text{val},l}^{(i)}\mid\bm{x}_{\text{val}}^{(i)}\right)-\widehat{F}\left(y_{\text{val},l}^{(i)-}\mid\bm{x}_{\text{val}}^{(i)}\right)\right]
       end for
      Let 𝒑N(i)=(Φ1(p1(i)),,Φ1(pd(i)))\bm{p}_{N}^{(i)}=\left(\Phi^{-1}\left(p^{(i)}_{1}\right),\dots,\Phi^{-1}\left(p^{(i)}_{d}\right)\right) be the vector of normalized PIT values
end for
B: Train normalizing flow
Input: Data {(𝒑N(i),𝒙val(i)),i=1,,nval}\left\{(\bm{p}_{N}^{(i)},\bm{x}^{(i)}_{\text{val}}),i=1,\dots,n_{\text{val}}\right\}; an invertible map T𝜻(𝒛𝒙)T_{\bm{\zeta}}(\bm{z}\mid\bm{x});
Set 𝜻^=argmax𝜻i=1nvalρ(T𝜻1(𝒑(i)𝒙val(i)))|detJT𝜻1(𝒑(i)𝒙val(i))|\hat{{\bm{\zeta}}}=\arg\max_{\bm{\zeta}}\prod_{i=1}^{n_{\text{val}}}\rho\left(T^{-1}_{\bm{\zeta}}(\bm{p}^{(i)}\mid\bm{x}_{\text{val}}^{(i)})\right)|\det J_{T^{-1}_{\bm{\zeta}}}(\bm{p}^{(i)}\mid\bm{x}_{\text{val}}^{(i)})|;
C: Sample from the recalibrated model
Input: An invertible map T𝜻(𝒛𝒙)T_{\bm{\zeta}}(\bm{z}\mid\bm{x}) with trained parameter 𝜻^\hat{{\bm{\zeta}}}; an observation 𝒙obs\bm{x}_{\text{obs}}; number of samples to be drawn nn;
for j1j\leftarrow 1 to nn do
       Sample 𝒛(j)=(z1(j),,zd(j))𝒩d(0,Id)\bm{z}^{(j)}=\left(z^{(j)}_{1},\dots,z^{(j)}_{d}\right)\sim\operatorname{\mathcal{N}}_{d}(0,I_{d});
       Set 𝒑~N(j)=(p~N,1(j),,p~N,d(j))=T𝜻^(𝒛(j)𝒙obs)\tilde{\bm{p}}_{N}^{(j)}=\left(\tilde{p}^{(j)}_{N,1},\dots,\tilde{p}^{(j)}_{N,d}\right)=T_{\hat{{\bm{\zeta}}}}(\bm{z}^{(j)}\mid\bm{x}_{\text{obs}});
       for l1l\leftarrow 1 to dd do
             Set y~l(j)=F^l1(Φ(p~N,l(j))𝒙obs)\tilde{y}^{(j)}_{l}=\widehat{F}^{-1}_{l}\left(\Phi\left(\tilde{p}^{(j)}_{N,l}\right)\mid\bm{x}_{\text{obs}}\right);
            
       end for
      Set 𝒚~(j)=(y~1(j),,y~d(j))\tilde{\bm{y}}^{(j)}=\left(\tilde{y}^{(j)}_{1},\dots,\tilde{y}^{(j)}_{d}\right);
      
end for
Return {𝒚~(j),j=1,,n}\left\{\tilde{\bm{y}}^{(j)},j=1,\dots,n\right\};
Algorithm 1 Recalibration with normalizing flows

Appendix B Simulations

This section contains detailed results for the simulation studies.

B.1 Bivariate copula model

To illustrate how our proposed approach handles different miscalibrated forecasts, we reanalyze the illustrative example from Ziegel and Gneiting (2014). The true data generating process (DGP) is a bivariate distribution with normal margins and a Gumbel copula with parameters 𝜽=(μ1,σ12,μ2,σ22,τ)\bm{\theta}=\left(\mu_{1},\sigma_{1}^{2},\mu_{2},\sigma_{2}^{2},\tau\right), where μj\mu_{j} is the mean and σj2\sigma_{j}^{2} the variance for the jj-th marginal, j=1,2j=1,2, and τ\tau is Kendall’s τ\tau parameterizing the Gumbel copula. σ12=1\sigma_{1}^{2}=1, μ2=0\mu_{2}=0 are fixed and the remaining parameters depend on a bivariate vector of covariates 𝒙=(x1,x2)\bm{x}=(x_{1},x_{2}) following independent beta distributions x1Beta(2,5)x_{1}\sim\operatorname{Beta}(2,5), x2Beta(5,2)x_{2}\sim\operatorname{Beta}(5,2). Under the true DGP, μ1=2x1\mu_{1}=2-x_{1}, σ22=x21\sigma_{2}^{2}=x_{2}^{-1} and τ=x1+x22\tau=\frac{x_{1}+x_{2}}{2}. All forecasts considered specify a Gumbel copula with Gaussian marginals, but potentially misspecify the parameter vector 𝜽𝜽(𝒙)\bm{\theta}\equiv\bm{\theta}(\bm{x}). We consider all 88 possible combinations of the following three fallacies.

  • The forecast distribution for the first marginal is either correctly specified (T) or biased μ=0.8(2x1)\mu=0.8(2-x_{1}) (F).

  • The forecast distribution for the second marginal is either correctly specified (T) or underdispersed σ22=0.8x21\sigma_{2}^{2}=0.8x_{2}^{-1} (F).

  • Kendall’s τ\tau is either correctly specified (T) or underestimated τ=0.6x1+x22\tau=0.6\frac{x_{1}+x_{2}}{2} (F).

As in Ziegel and Gneiting (2014), we denote each of the forecasts by a combination of three letters, where the first letter denotes if the first margin is misspecified, the second letter denotes if the second margin is misspecified and the last letter denotes if the copula is misspecified. For example, FFT denotes the forecast with misspecified margins, but correctly specified dependence structure.

We consider a validation set with n=4,000n=4,000 samples and evaluate the performance on a hold-out test set of 4,0004,000 samples. Due to the fixed structure of the forecasting models there is no training set used here. We compare both the KNN approach and the NF approach. For KNN we use the 5%5\% nearest samples according to the Euclidean norm in covariate space.

Figure B.3 summarises the results. KNN and NF perform very similar. Both approaches achieve probability calibration of the marginals as summarized through histograms of the univariate PIT values (Columns 1+2 of Figure B.3), copula calibration (Column 3 of Figure B.3) and Kendall calibration as indicated by the Kendall plot (Column 4 of Figure B.3). The forecast TTT is optimal in the sense that the forecast matches the true DGP exactly and neither NF nor KNN seem to deteriorate the forecast.

Refer to caption
Figure B.3: Simulations. Performance across the 88 different forecast specifications (rows) for the uncalibrated base model (green), NF recalibration (blue), and KNN recalibration (red). The first two columns show histograms for the marginal pit values, the middle column shows histograms for the CopPit values, and the fourth column is the Kendall plot.

B.2 Twisted Gaussians

We consider the following DGP inspired by a related example in Rodrigues et al. (2018):

X\displaystyle X 𝒩(0,1)\displaystyle\sim\operatorname{\mathcal{N}}(0,1)
Y1X\displaystyle Y_{1}\mid X 𝒩(0,1)\displaystyle\sim\operatorname{\mathcal{N}}(0,1)
Y2Y1,X\displaystyle Y_{2}\mid Y_{1},X Y1+XY12,\displaystyle\sim Y_{1}+XY_{1}^{2},

with 𝒀=(Y1,Y2)\bm{Y}=(Y_{1},Y_{2}), from which we draw ntrain=5,000n_{\text{train}}=5,000 samples to train the base model, and nval=5,000n_{\text{val}}=5,000 samples as the validation set. The base model consists of two univariate, Gaussian, linear, homoscedastic models yjx𝒩(β0j+xβ1j,σj2)y_{j}\mid x\sim\operatorname{\mathcal{N}}(\beta_{0j}+x\beta_{1j},\sigma_{j}^{2}), j=1,2j=1,2. Hence, the marginal Y1XY_{1}\mid X of the base model is approximately probability calibrated, while the second marginal and the dependence structure are miscalibrated.

Refer to caption
Figure B.4: Simulation. Twisted Gaussians. A), B) PIT values for Y2XY_{2}\mid X and CopPIT values respectively for the base model (green), NF (blue) and KNN (red), calculated on a hold-out test set. C) Estimated densities for p(y2x=1)p(y_{2}\mid x=1) under the three models (again indicated by color). The dashed black line gives the density under the true data generating process. D) PIT values for Y2x=1Y_{2}\mid x=1. E) Samples from the predictive distribution p(y1,y2x=1)p(y_{1},y_{2}\mid x=1) under the NF model (blue diamonds) and the KNN model (red dots). For reference the contour plot from the base model is given in green. Under the true model all samples should lie on the dashed black line. F) Scatter plot of the PIT values from the validation set respective to the base model (grey). The PIT values corresponding to the samples shown in panel E are given in colour.

Figure B.4A and Figure B.4B show marginal PIT values for Y2XY_{2}\mid X and CopPIT values for (Y1,Y2)X(Y_{1},Y_{2})\mid X respectively indicating that both the KNN and the NF approach result in multivariate calibrated models calculated on ntest=1,000n_{\text{test}}=1,000 hold-out samples from the true DGP. However, our approach results not only in global calibration, but in local calibration in the following sense. Conditional on x=1x=1 the base model for Y2XY_{2}\mid X is grossly misspecified as it assumes a Gaussian distribution while the true predictive distribution is heavily skewed and bounded by 0.25-0.25 from below (dashed black line in Figure B.4C) and the recalibrated models match the shape of the true distribution. Under the NF approach an arbitrary large sample from the calibrated model can be generated, while the KNN approach is restricted to a fixed sample size depending on nvaln_{\text{val}}. Figure B.4E shows PIT values for Y2x=1Y_{2}\mid x=1 calculated from n~test=1,000\tilde{n}_{\text{test}}=1,000 samples which are close to uniformity for both NF and KNN. The bivariate distribution (Y1,Y2)x=1(Y_{1},Y_{2})\mid x=1 is degenerate as all samples from the true model fulfill Y12+Y1Y2=0Y_{1}^{2}+Y_{1}-Y_{2}=0. Figure B.4E shows samples from the calibrated models for (Y1,Y2)x=1(Y_{1},Y_{2})\mid x=1, which are virtually indistinguishable for NF and KNN and very close to the true distribution drastically improving the base model with independent marginals. Finally, Figure B.4F shows how the joint marginal PIT values 𝒑(i)\bm{p}^{(i)} from the base model under the validation set (shown in grey) encapsulate the complex dependence structure of the true model. The PIT values used by KNN and NF to generate the samples shown in panel E are marked by color. Note again that KNN is restricted by selecting PIT values from the validation set, which are sufficiently close to x1x\approx 1, while NF generates samples from an approximation to (p1,p2)x=1(p_{1},p_{2})\mid x=1, meaning that the approach could potentially hallucinate information not supported by the data, but allowing to draw an arbitrarily large sample from the recalibrated model.

Appendix C Additional Results for the Applications

C.1 Currency exchange rates

Figure C.5 shows histograms of the univariate PIT values for the five currencies. Under the base model none of the margins is probability calibrated and the kind of miscalibration differs across currencies. Under the recalibrated model all margins are probability calibrated.

Refer to caption
Figure C.5: Currency exchange. Histograms of the univariate PIT values for the one-day ahead forecast for the five currencies (A)(E) under the base model (green) and the recalibrated model (red).

C.2 Childhood malnutrition

Figure C.6 shows histograms for the PIT and CopPIT values for the base and the recalibrated model indicating that NF improves the calibration of the regression model. Figure C.7 shows how the main effect for dist differs between the base and the recalibrated model.

Refer to caption
Figure C.6: Malnutrition. PIT values for wasting (A) and CopPIT values (B) respectively for the base model (green) and the recalibrated model (blue).
Refer to caption
Figure C.7: Malnutrition. Effect of the recalibration for the different districts. Shown is the fraction between the main effect for dist under the base model and the recalibrated model. Positive values indicate that the estimated risk is lower for the recalibrated model than in the base model and the relative magnitude of change is indicated by the shade of the region.