This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Conformal Methods for Quantifying Uncertainty
in Spatiotemporal Data: A Survey

Sophia Sun
University of California, San Diego
[email protected]
Abstract

Machine learning methods are increasingly widely used in high-risk settings such as healthcare, transportation, and finance. In these settings, it is important that a model produces calibrated uncertainty to reflect its own confidence and avoid failures. In this paper we survey recent works on uncertainty quantification (UQ) for deep learning, in particular distribution-free Conformal Prediction method for its mathematical properties and wide applicability. We will cover the theoretical guarantees of conformal methods, introduce techniques that improve calibration and efficiency for UQ in the context of spatiotemporal data, and discuss the role of UQ in the context of safe decision making.

1 Introduction

Let 𝒟=(z1,,zn)\mathcal{D}=(z_{1},\ldots,z_{n}) be a dataset of size nn. We denote zi=(xi,yi)z_{i}=(x_{i},y_{i}) as a sample of an input and output pair that follow the distribution 𝒫\mathcal{P}, where ii is the data index. Let the input space 𝐗\mathbf{X} and target space 𝐘\mathbf{Y} be two measurable spaces, their Cartesian product 𝐙=𝐗×𝐘\mathbf{Z}=\mathbf{X}\times\mathbf{Y} is the sample space. Consider the following problem: given a desired coverage rate 1α(0,1)1-\alpha\in(0,1), we want to construct a prediction region Γ1α:𝐗{subsets of 𝐘}\Gamma^{1-\alpha}:\mathbf{X}\rightarrow\{\text{subsets of }\mathbf{Y}\} such that for a new data pair (X,Y)𝒫(X,Y)\sim\mathcal{P}, we have

(X,Y)𝒫(YΓ1α(X))1α{\mathbb{P}}_{(X,Y)\sim{\mathcal{P}}}(Y\in\Gamma^{1-\alpha}(X))\geq 1-\alpha (1)

In the case where YΓ1α(X)Y\in\Gamma^{1-\alpha}(X), the prediction region covers YY. We say the prediction region is valid if it satisfies Equation 1. Hence, Equation 1 is also known as the validity condition or coverage guarantee. An example of a prediction region that we are familiar with in real life is hurricane forecasts (figure 1). In the spatiotemporal setting, uncertainty quantification can often be visualized as a "cone of uncertainty".

Refer to caption
Figure 1: The "cone of uncertainty" for hurricane forecasts. molina2021striving

Quantifying the uncertainty of a model’s predicted outcome is critical for risk assessment and decision making. Prediction regions allow us to bound random variables of interest and know that, with a specified probability, where the observation could fall. Consider the case of pandemic forecasts, for example: compared to a predicted number of future cases without a confidence metric, a prediction interval with 99%99\% confidence is much more useful for local health facilities, as they can safely plan for the worst-case scenario.

The problem of building a prediction region can be approached in many ways. One may model the distribution of data via maximum likelihood or Bayesian methods, or use a distribution-free method to capture the uncertainty - a method is distribution-free when it creates valid prediction sets without making distributional assumptions on the data. Among them, Conformal Prediction (CP) is a straightforward algorithm to generate prediction sets for any underlying prediction model.

This survey will be organized as follows. We will provide a brief survey for uncertainty quantification methods in Section 2. Conformal prediction vovk2005algorithmic allows us to satisfy the validity condition (Equation 1) with no assumptions on the common distribution 𝒫\mathcal{P}, hence the name distribution-free uncertainty quantification. Section 3 will introduce the algorithm and theory of conformal prediction, recent advancements for improved calibration and efficiency, and methods for applying conformal prediction to spatiotemporal data. In 4 we present methods that utilize conformal prediction for safe decision making. We will also discuss the limits of distribution-free UQ methods in section 4.2. We conclude in 5, and discuss several future directions for research on conformal algorithms and safe decision-making.

2 Brief Overview of UQ methods

Bayesian Uncertainty Quantification

Bayesian approaches represent uncertainty by estimating a distribution over the model parameters given data, and then marginalizing these parameters to form a predictive distribution. Bayesian neural networks (BNN) mirikitani2009recursive ; mackay1992bayesian were proposed to learn through Bayesian inference with neural networks. As modern neural networks often contain millions of parameters, the posterior over these parameters becomes highly non-convex, rendering inference intractable. Modern approaches performs approximate Bayesian inference by Markov chain Monte Carlo (MCMC) sampling welling2011bayesian ; neal2012bayesian ; chen2014stochastic or variational inference (VI) graves2011practical ; kingma2013auto ; kingma2015variational ; blundell2015weight ; louizos2017multiplicative . In practice, stochastic gradient MCMC methods are prone to approximation errors, and can be to difficult to tune mandt2017stochastic . Variational inference methods have seen strong performance on moderately sized networks, but are empirically found to be difficult to train on larger architectures he2016deep ; blier2018description . MC-dropout methods gal2016theoretically ; gal2017concrete view dropout at test time to be approximate variational Bayesian inference. MC dropout methods have found success in efficiently approximate the posterior kendall2017uncertainties , but calibration of the resulted distribution has been found to be difficult alaa2020frequentist . We refer the readers to recent surveys like abdar2021review for the rich body of work in Bayesian uncertainty quantification.

Mixed / Re-calibration Methods

In practice, Bayesian uncertainty estimates often fail to capture the true data distribution due to intractability of Bayesian optimization lakshminarayanan2017simple ; zadrozny2001obtaining . There exists a line of research exploring re-calibration of neural network models - lakshminarayanan2017simple proposed using ensembles of several networks for enhanced calibration, and incorporated an adversarial loss function to be used when possible as well. guo2017calibration proposed temperature scaling, a procedure which uses a validation set to rescale the logits of deep neural network outputs for enhanced calibration. kuleshov2018accurate and kull2019beyond propose calibrated regression using similar rescaling techniques, the latter through stochastic weight averaging. minderer2021revisiting

Frequentist and Distribution-free Uncertainty Quantification

Frequentist UQ methods emphasize the robustness against variations in the data. These approach either rely on resampling the data or by learning an interval bound to encompass the dataset. Among them are ensemble methods such as bootstrap efron2016 ; alaa2020frequentist , model ensembles lakshminarayanan2017simple ; pearce2018high ; jackknife+after ; interval prediction methods include quantile regression tagasovska2019single ; gasthaus2019probabilistic ; takeuchi2006nonparametric , interval regression through proper scoring rules kivaranovic2020adaptive ; wu2021quantifying ; rank tests rankedtest1 ; rankedtest2 and permutation tests permtest . Many of these frequentist methods satisfy the validity condition asymptotically and can be categorized as distribution-free UQ techniques as they are (1) agnostic to the model and (2) agnostic to the data distribution.

Conformal Prediction

Conformal prediction is an important example of distribution-free UQ method - a key difference between conformal methods from the ones introduced in the previous section is that it achieves the validity condition in finite samples. Originally working on studies of finite random sequences, Vladimir Vovk and Glenn Shafer coined the name and developed the core theory for online and split conformal prediction in their 2005 book Algorithmic Learning in a Random World vovk2005algorithmic . Conformal prediction has since become popular because of its simplicity and low computational cost properties shafer2008tutorial ; angelopoulos2021gentle and has shown promise in many realistic applications eklund2015application ; angelopoulos2022image ; lei2021conformal . Current research on conformal prediction can be roughly split into three branches, shown in figure 2. The first is to develop conformal prediction algorithms for different machine learning algorithms, such as quantile regression romano2019conformalized ; sesia2020comparison , k-Nearest Neighbors papadopoulos2011regression , density estimators izbicki2019flexible , survival analysis teng2021t ; candes2021conformalized , risk control bates2021distribution , et cetera. The second is works on relaxing the exchangability assumption about data distribution so we can make guarantees in the context of, for example, distribution shifts tibshirani2019conformal ; podkopaev2021distribution ; barber2022conformal ; hu2020distribution ; gibbs2021adaptive . The third line is improving efficiency (i.e. how sharp the prediction regions are while validity holds) of conformal prediction messoudi2020multitarget ; messoudi2021copula ; romano2020classification ; yang2021finite ; sesia2020comparison . The algorithms for time series that we will focus on in this survey can generally be categorized into the latter two branches.

Refer to caption
Figure 2: Venn Diagram of recent research on Conformal prediction.
Algorithm Data Distribution Assumption Coverage Guarantee
(split) Conformal vovk2005algorithmic Regression Exchangeable 1α1-\alpha
EnbPI xu2021conformal single time series strongly mixing errors 1α\approx 1-\alpha
ACI gibbs2021adaptive single time series None (online) Asymptotically 1α\approx 1-\alpha
CF-RNN Schaar2021conformaltime independent time series exchangeable time series 1α1-\alpha
Copula-RNN (ours) independent time series exchangeable time series 1α1-\alpha
Table 1: A summary of Algorithms in Sections 3.2 - 3.4.

3 Conformal Prediction Algorithms

3.1 Conformal Prediction

This section introduces the necessary notation and definitions, and formally presents the idea of conformal prediction.

Recall that the goal of conformal prediction is to constructing a prediction region Γ1α:𝐗{subsets of 𝐘}\Gamma^{1-\alpha}:\mathbf{X}\longrightarrow\{\text{subsets of }\mathbf{Y}\} given a new data pair (X,Y)𝒫(X,Y)\sim\mathcal{P} and a desired coverage rate 1α(0,1)1-\alpha\in(0,1) such that Γ1α(X)\Gamma^{1-\alpha}(X) is valid, i.e.

(YΓ1α(X))1α{\mathbb{P}}(Y\in\Gamma^{1-\alpha}(X))\geq 1-\alpha (2)
Validity and Efficiency

There is a trade-off between validity and efficiency, as one can always set Γ1α\Gamma^{1-\alpha} to be infinitely large to satisfy Equation 2. In practice, we want to achieve that the measure of the confidence region (e.g. its area or length) to be as small as possible, given that the validity condition holds. This is known as efficiency or sharpness of the confidence region.

Training and calibration datasets

Let 𝒟=(z1,,zn)\mathcal{D}=(z_{1},\ldots,z_{n}) be a dataset of size nn. The inductive conformal prediction method begins by splitting 𝒟\mathcal{D} into two disjoint subsets, the proper training set 𝒟train=(z1,,zm)\mathcal{D}_{train}=(z_{1},\ldots,z_{m}) of size m<nm<n and the calibration set 𝒟cal\mathcal{D}_{cal} of size nmn-m.

Nonconformity score

An important component of conformal prediction is the nonconformity score 𝒮:𝐙m×𝐙{\mathcal{S}}:\mathbf{Z}^{m}\times\mathbf{Z}\rightarrow\mathbb{R}, a function to capture how well a sample zz conforms to the proper training set. Informally, a high value of 𝒮((x,y),𝒟cal){\mathcal{S}}((x,y),\mathcal{D}_{cal}) indicates that the point (x,y)(x,y) is atypical relative to the points in 𝒟cal\mathcal{D}_{cal}. A common choice of the non-conformity score 𝒮{\mathcal{S}} in a regression settings is the L2-loss on the test sample (x,y)(x,y):

𝒮((x,y),𝒟cal):=|yf^(x)|{\mathcal{S}}((x,y),\mathcal{D}_{cal}):=|y-\hat{f}(x)| (3)

where f^:𝐗𝐘\hat{f}:\mathbf{X}\longrightarrow\mathbf{Y} is a prediction model trained with the proper training set 𝒟train\mathcal{D}_{train}. In the following sections, we refer to the non-conformity score of a data point zz given a prediction model f^\hat{f} as 𝒮(z,f^){\mathcal{S}}(z,\hat{f}) for simplicity.

Algorithm

The conformal prediction algorithm consists of two parts, the training and the calibration process. During the training process, we fit a machine learning model indicated by f^\hat{f} to the training data set 𝒟train\mathcal{D}_{train}. The validity guarantee of the conformal algorithm holds regardless of the underlying model f^\hat{f}. In the calibration process, we calculate the nonconformity score si=𝒮(zi,f^)s_{i}={\mathcal{S}}(z_{i},\hat{f}) for each zi𝒟calz_{i}\in\mathcal{D}_{cal}. We can then create the confidence region based on the nonconformity scores. We outline the full algorithm in Algorithm 1.

Input: Dataset 𝒟=(xi,yi)i=1,,n\mathcal{D}={(x_{i},y_{i})}_{i=1,\ldots,n}, regression algorithm 𝒜{\mathcal{A}}, test point xn+1x_{n+1}, target confidence level 1α1-\alpha
Output: Prediction region Γ1α(xn+1)\Gamma^{1-\alpha}(x_{n+1}).
1
2 Randomly split dataset 𝒟\mathcal{D} into training and calibration datasets 𝒟=𝒟train𝒟cal\mathcal{D}=\mathcal{D}_{train}\cup\mathcal{D}_{cal} where |𝒟train|=m|\mathcal{D}_{train}|=m and |𝒟cal|=nm|\mathcal{D}_{cal}|=n-m;
3 Train machine learning model f^=𝒜(𝒟train)\hat{f}={\mathcal{A}}(\mathcal{D}_{train});
4 Initialize nonconformity scores 𝐬cal={}\mathbf{s}_{cal}=\{\};
5 for (xi,yi)𝒟cal(x_{i},y_{i})\in\mathcal{D}_{cal} do
6       𝐬cal𝐬{𝒮((xi,yi),f^)}\mathbf{s}_{cal}\leftarrow\mathbf{s}\cup\{{\mathcal{S}}((x_{i},y_{i}),\hat{f})\};
7 end for
8Calculate the (1α)(1-\alpha)-th quantile q1αq_{1-\alpha} of 𝐬cal{}\mathbf{s}_{cal}\cup\{\infty\};
return Γ1α(xn+1)={y:𝒮((xn+1,y),f^)q1α}\Gamma^{1-\alpha}(x_{n+1})=\{y:{\mathcal{S}}((x_{n+1},y),\hat{f})\leq q_{1-\alpha}\}
Algorithm 1 Conformal Prediction
Theoretical Guarantees

We will prove that the prediction regions generated by Algorithm 1 is valid (Equation 1 )if the dataset 𝒟𝒫\mathcal{D}\sim{\mathcal{P}} is exchangeable (assumption 1).

Assumption 1 (exchangeability).

Assume that the data points zi𝒟{z_{i}}\in\mathcal{D} and the test point zn+1z_{n+1} are exchangeable. Formally, ziz_{i} are exchangeable if arbitrary permutation leads to the same distribution, that is,

(z1,,zn+1)=d(zπ(1),,zπ(n+1))(z_{1},\ldots,z_{n+1})\stackrel{{\scriptstyle d}}{{=}}(z_{\pi(1)},\ldots,z_{\pi(n+1)})

with arbitrary permutation π\pi over {1,,n+1}\{1,\ldots,n+1\}

Define the empirical pp-quantile of a set of nonconformity scores s1:n={s1,,sn}s_{1:n}=\{s_{1},\ldots,s_{n}\} as:

Q(p,s1:n):=inf{s:(1ni=1n𝟙sis)p}Q(p,s_{1:n}):=\inf\{s^{\prime}:(\frac{1}{n}\sum_{i=1}^{n}\mathbbm{1}_{s_{i}\leq s^{\prime}})\geq p\} (4)
Lemma 1 (Lemma 1 in tibshirani2019conformal ).

If s1sn+1s_{1}\ldots s_{n+1} are exchangeable random variables, then for any p(0,1)p\in(0,1), we have

(sn+1Q(p,s1:n{}))p{\mathbb{P}}(s_{n+1}\leq Q(p,s_{1:n}\cup\{\infty\}))\geq p

Furthermore, if ties within s1:ns_{1:n} occur with probability zero, then the above probability is upper bounded by p+1/(n+1)p+1/(n+1).

Proof.

Let q=Q(p,s1:n{})q=Q(p,s_{1:n}\cup\{\infty\}). Note that if sn+1>qs_{n+1}>q, then Q(p,s1:n{})=Q(p,s1:n{sn+1})Q(p,s_{1:n}\cup\{\infty\})=Q(p,s_{1:n}\cup\{s_{n+1}\}), the quantile remains unchanged. This fact can be written as:

sn+1>Q(p,s1:n{})sn+1>Q(p,s1:n+1)s_{n+1}>Q(p,s_{1:n}\cup\{\infty\})\Longleftrightarrow s_{n+1}>Q(p,s_{1:n+1})

and equivalently

sn+1Q(p,s1:n{})sn+1Q(p,s1:n+1)s_{n+1}\leq Q(p,s_{1:n}\cup\{\infty\})\Longleftrightarrow s_{n+1}\leq Q(p,s_{1:n+1})

The above describes the condition where sn+1s_{n+1} is among the p(n+1)\lceil p(n+1)\rceil smallest of s1:ns_{1:n}. By exchangability, the probability of sn+1s_{n+1}’s rank among s1:ns_{1:n} is uniform. Therefore,

(sn+1Q(p,s1:n+1))=(sn+1Q(p,s1:n{})=p(n+1)(n+1)p{\mathbb{P}}(s_{n+1}\leq Q(p,s_{1:n+1}))={\mathbb{P}}(s_{n+1}\leq Q(p,s_{1:n}\cup\{\infty\})=\frac{\lceil p(n+1)\rceil}{(n+1)}\geq p

Which proves the lower bound. When there are no ties, we have p(n+1)(n+1)p+1(n+1)\frac{\lceil p(n+1)\rceil}{(n+1)}\leq p+\frac{1}{(n+1)} which proves the upper bound. ∎

Theorem 1 (Validity of the conformal prediction algorithmvovk2005algorithmic ; lei2018distribution ).

Let 𝒟𝒫\mathcal{D}\sim{\mathcal{P}} be an exchangeable dataset. Then for any score function 𝒮{\mathcal{S}} and any significance level α(0,1)\alpha\in(0,1), the prediction set given by algorithm 1

Γ1α(xn+1)={y:𝒮((xn+1,y),f^)Q(1α,𝐬cal)}\Gamma^{1-\alpha}(x_{n+1})=\{y:{\mathcal{S}}((x_{n+1},y),\hat{f})\leq Q(1-\alpha,\mathbf{s}_{cal})\}

satisfies the validity condition, i.e.

(yx+1Γ1α(xn+1))1α{\mathbb{P}}(y_{x+1}\in\Gamma^{1-\alpha}(x_{n+1}))\geq 1-\alpha
Proof.

Note that if 𝒟\mathcal{D} is exchangeable, so is 𝒟cal\mathcal{D}_{cal}, and hence so is the set of nonconformity scores 𝐬cal\mathbf{s}_{cal}. Therefore, by lemma 1 we have:

(yx+1Γ1α(xn+1))=(𝒮((xn+1,y),f^)Q(1α,𝐬cal{}))1α{\mathbb{P}}(y_{x+1}\in\Gamma^{1-\alpha}(x_{n+1}))={\mathbb{P}}\left({\mathcal{S}}((x_{n+1},y),\hat{f})\leq Q(1-\alpha,\mathbf{s}_{cal}\cup\{\infty\})\right)\geq 1-\alpha

It is easy to prove that the upper bound in lemma 1 also applies for the prediction set in theorem 1. More precisely, if ties between the nonconformity scores 𝐬cal\mathbf{s}_{cal} occur with probability zero, then

1α(yx+1Γ1α(xn+1))1α+1/(n+1)1-\alpha\leq{\mathbb{P}}(y_{x+1}\in\Gamma^{1-\alpha}(x_{n+1}))\leq 1-\alpha+1/(n+1)
Remark on exchangability

Exchangeability is a weaker assumption on data than the i.i.d. assumption, hence it is reasonable for many application scenarios. In later sections of this survey we will explore conformal algorithms that relaxes the exchangability assumption and adaptive to data distribution shifts in an online fashion.

Split vs. Full conformal regression

The procedures described above and in Algorithm 1 is generally known as inductive conformal prediction or split conformal prediction. It is developed on the basis of the full conformal prediction, where we don’t split the training set into proper training and calibration sets, but performs nn folds of hold-one-out training, and use the held-out sample for calibration. The full conformal prediction method requires training the model nn times, and hence is unsuitable for settings where model training is computationally expensive, such as deep learning. The scope of this survey is limited to the regression setting, but conformal prediction can be used for many machine learning tasks, including and not limited to classification, image segmentation and outlier detection. We refer readers to vovk2005algorithmic ; angelopoulos2021gentle for a more thorough introduction.

Refer to caption
Figure 3: A visualization of the conformal prediction interval in a 1-D regression setting from angelopoulos2021gentle
Adaptive prediction sets

One may notice that if we choose nonconformity score 𝒮((x,y),f^)=|yf^(y)|{\mathcal{S}}((x,y),\hat{f})=|y-\hat{f}(y)| then we will end up with the same size of prediction set for any input xx following the conformal prediction algorithm 1. From an UQ perspective, this is undesirable - we want the procedure to return larger sets for harder inputs and smaller sets for easier inputs. Adaptability is an active line of research within the conformal prediction community, with approaches including, but not limited to:

  • Normalized nonconformity scores lei2018distribution . The most common method for adaptation is to scale an non-adaptive score function with a heuristic function u(x)u(x) that capture the uncertainty of the model. That is, u(x)u(x) is large when model is uncertain and small otherwise. Define the score function as

    𝒮((x,y),f^)=|yf^(x)|u(x){\mathcal{S}}((x,y),\hat{f})=\frac{|y-\hat{f}(x)|}{u(x)}

    some common choices of u(x)u(x) angelopoulos2021gentle ; svensson2018conformal are

    • Standard deviation to the center of training data mean. u(x):=σ^(x)u(x):=\hat{\sigma}(x)

    • Distance to training center (in Euclidean space or learned latent space such as PCA)

    • Nearest neighbour. Uncertainty was derived based on the average distance to the kk-nearest neighbors learned latent space such as PCA or t-SNE.

    • Error prediction model model norinder2014introducing ; cortes2016improved where a machine learning model was trained to predict the expected error given inputs.

    However, the estimated heuristic u(x)u(x) can be (1) unstable and (2) lack direct relation to the quantiles α\alpha of the target space distribution.

  • Feature-space conformal regression bates2021distribution ; angelopoulos2021gentle , where the conformal prediction algorithm is performed on the feature space and then projected to the output space. In the deep learning setting, the feature space can be latent spaces before the final layer(s) for the neural network. featurecp proved that validity in the feature space implies validity in the output space, given some smoothness assumption on the prediction layer.

  • Conformalized quantile regression (CQR) steinwart2011estimating ; romano2019conformalized . CRQ allows for varying lengths over due to the learned quantile functions, and allows for asymmetric prediction intervals and a because of the separately fitted upper and lower bounds. A quantile regression algorithm takeuchi2006nonparametric attempts to learn the γ\gamma quantile of Ytest|Xtest=xY_{test}|X_{test}=x for each possible value of xx. We denote the fitted model as t^γ(x)\hat{t}_{\gamma}(x) - by definition, the probability of Ytest|Xtest=xY_{test}|X_{test}=x lies below t^0.05(x)\hat{t}_{0.05}(x) is 5%5\% and below t^0.95(x)\hat{t}_{0.95}(x) is 95%95\%, hence the interval [t^0.05(x),t^0.95(x)][\hat{t}_{0.05}(x),\hat{t}_{0.95}(x)] to have 90%90\% coverage. With this intuition, we define the nonconformity score to be the distance from yy to the predicted quantile interval, given a desired coverage rate of 1α1-\alpha:

    𝒮(x,y)=max{t^α/2(x)y,yt^1α/2(x)}{\mathcal{S}}(x,y)=\max\{\hat{t}_{\alpha/2}(x)-y,y-\hat{t}_{1-\alpha/2}(x)\}

    Figure 4 illustrates the result of different adaptive techniques with a synthetic dataset.

Refer to caption
Figure 4: Prediction interval on simulated data, where noise is heteroscedastic and with outliers from romano2019conformalized . (a) standard split conformal prediction (algorithm 1), (b) normalized conformity score with u(x)=σ^(x)u(x)=\hat{\sigma}(x) is the mean absolute deviation (MAD) of |Yf^(x)||Y-\hat{f}(x)| given X=xX=x, and (c) Conformalized Quantile Regression (CQR). Observe that while all three methods achieve the 90%\geq 90\% coverage guarantee, the intervals generated by CQR are much more efficient.

The adaptability techniques introduced above are developed for the exchangeable setting. While some methods are modular to the specific conformal algorithm (such as choosing a scaled nonconformity score), further research is needed for adaptability in the time series setting. Formalization of the adaptive property and its implications are further discussed in section 4.2.

3.2 Conformal prediction for time series under distribution shift

In this section, we will look at recent developments of extending conformal prediction’s validity guarantees to the time series setting. This is a nontrivial task, as real-world time series data often are non-exchangeable. Distribution shifts can happen, for example, in financial markets where market behavior can dramatically change because of a new legislation or policy, and in epidemiological data when there are seasonal and infrastructural shifts.

3.2.1 Sequential Distribution-free Ensemble Batch Prediction Intervals (EnbPI xu2021conformal )

The problem setting is we have a unknown model f:df:\mathbb{R}^{d}\rightarrow\mathbb{R} where dd is the dimension of the feature vector, the data pairs (x,y)(x,y) are generated in a time-series fashion

Yt=f(Xt)+ϵt,t=1,2,Y_{t}=f(X_{t})+\epsilon_{t},\quad t=1,2,\ldots

where the noise term ϵt\epsilon_{t} are identically distributed to a common cumulative distribution function (CDF) FF, but doesn’t have to be independent.. Xu et al. tackles the nonexchangeable problem by (1) training an ensemble of leave-one-out predictors, and (2) deploying a sliding window for calibration and re-calibrates as the time series continues. The algorithm is presented in Algorithm 2. The ensemble training (line 3-13 in Algorithm 2) is similar to a bootstrap method jackknife+after . Line 19-24 in Algorithm 2 shows the procedure of leveraging recent data for calibration. It allows the prediction intervals to adapt to distributional shifts without having to retrain f^\hat{f}. The authors show that, under the assumption of the error terms {ϵt}t1)\{\epsilon_{t}\}_{t\geq 1}) are strongly mixing and stationary, the proposed algorithm achieves a coverage approximate validity guarantee. (see Theorem 1 of xu2021conformal ).

Input: Dataset 𝒟=(xi,yi)i=1T\mathcal{D}={(x_{i},y_{i})}_{i=1}^{T}, regression algorithm 𝒜{\mathcal{A}}, target confidence level 1α1-\alpha, aggregation function ϕ\phi, number of bootstrap models BB, prediction horizon hh, test data 𝒟=(xt,yt)t=T+1T+T1\mathcal{D}={(x_{t},y_{t})}_{t=T+1}^{T+T_{1}}
Output: Ensemble prediction intervals {ΓT,tϕ,1α(xt)}t=T+1T+T1\{\Gamma^{\phi,1-\alpha}_{T,t}(x_{t})\}_{t=T+1}^{T+T_{1}}.
1
2 // Train bootstrap predictors
3 for b=1,,Bb=1,\ldots,B do
4       Sample with replacement an index set Sb=(i1,,iT)S_{b}=(i_{1},\ldots,i_{T}) from indices (1,,T)(1,\ldots,T);
5       Compute f^b=𝒜({(xi,yi)|iSb})\hat{f}^{b}={\mathcal{A}}(\{(x_{i},y_{i})|i\in S_{b}\})
6 end for
7
8// Calculate nonconformity scores
9 Initialize 𝐬={}\mathbf{s}=\{\};
10 for i=1,,Ti=1,\ldots,T do
11       f^iϕ(xi)=ϕ({f^ϕ(xi)|xSb})\hat{f}^{\phi}_{-i}(x_{i})=\phi(\{\hat{f}^{\phi}(x_{i})|x\notin S_{b}\});
12       Compute s^iϕ=|yif^iϕ(xi)|\hat{s}_{i}^{\phi}=|y_{i}-\hat{f}^{\phi}_{-i}(x_{i})|;
13       𝐬=𝐬{s^iϕ}\mathbf{s}=\mathbf{s}\cup\{\hat{s}_{i}^{\phi}\};
14      
15 end for
16
17// Prediction
18 for t=T+1,,T+T1t=T+1,\ldots,T+T_{1} do
19       Let f^iϕ(xt)=(1α) quantile of {f^iϕ(xi)}i=1T\hat{f}^{\phi}_{-i}(x_{t})=(1-\alpha)\text{ quantile of }\{\hat{f}^{\phi}_{-i}(x_{i})\}_{i=1}^{T};
20       Let wtϕ=(1α) quantile of 𝐬w_{t}^{\phi}=(1-\alpha)\text{ quantile of }\mathbf{s};
21       return ΓT,tϕ,1α(xt)=[f^iϕ±wtϕ]\Gamma^{\phi,1-\alpha}_{T,t}(x_{t})=[\hat{f}^{\phi}_{-i}\pm w_{t}^{\phi}];
22      
23      // Re-calibration using recent nonconformity scores
24      
25      if tT=0modht-T=0\mod h then
26             for j=th,,t1j=t-h,\ldots,t-1 do
27                   Compute s^jϕ=|yjf^jϕ(xt)|\hat{s}_{j}^{\phi}=|y_{j}-\hat{f}^{\phi}_{-j}(x_{t})|;
28                   𝐬=(𝐬{s^1ϕ}){s^jϕ}\mathbf{s}=(\mathbf{s}-\{\hat{s}_{1}^{\phi}\})\cup\{\hat{s}_{j}^{\phi}\} and reset index of 𝐬\mathbf{s}
29             end for
30            
31       end if
32      
33 end for
Algorithm 2 Sequential Distribution-free Ensemble Batch Prediction Intervals (EnbPI)

3.2.2 Adaptive Conformal Inference Under Distribution Shift (ACI gibbs2021adaptive )

The EnbPI algorithm adapts to the inherent distribution change in time series through a fixed-window re-calibration process. One naturally wonders if there are more principled ways to perform online re-calibration. Gibbs et al.’s approach to a shifting underlying data distribution, Adaptive conformal inference, is to re-estimate the nonconformity score function S()S(\cdot) quantile function Q^()\hat{Q}(\cdot) to align with the most recent observations: one can assume at each time tt we are given a fitted score function St()S_{t}(\cdot) and corresponding quantile function Q^t()\hat{Q}_{t}(\cdot). We define the miscoverage rate of the prediction set Γt1α(Xt):={y:St(Xt,y)Q^T(1α)}\Gamma_{t}^{1-\alpha}(X_{t}):=\{y:S_{t}(X_{t},y)\leq\hat{Q}_{T}(1-\alpha)\} to be

Mt(α):=(St(Xt,Yt)>Q^t(1α))M_{t}(\alpha):={\mathbb{P}}(S_{t}(X_{t},Y_{t})>\hat{Q}_{t}(1-\alpha))

Note that Mt(α)<αM_{t}(\alpha)<\alpha is equivalent to the validity condition in Equation 1. In other words, we want to keep the miscoverage rate of a α\alpha prediction region less than α\alpha. We assume the quantile function Q^t(1α)\hat{Q}_{t}(1-\alpha) is continuous, non-decreasing such that Q^t(0)=\hat{Q}_{t}(0)=-\infty and Q^t(1)=\hat{Q}_{t}(1)=\infty. As a result, Mt()M_{t}(\cdot) will be non-decreasing on [0,1][0,1] with Mt(0)=0M_{t}(0)=0 and Mt(1)=1M_{t}(1)=1. Then there exists a αt\alpha^{*}_{t} defined as

αt:=sup{β[0,1]:Mt(β)α}\alpha^{*}_{t}:=\sup\{\beta\in[0,1]:M_{t}(\beta)\leq\alpha\}

This means if we correctly calibrate the argument to Q^t()\hat{Q}_{t}(\cdot) we can achieve the coverage guarantee. gibbs2021adaptive performs the calibration with an online update - we look at the miscoverage rate on our calibration set and increase αt\alpha^{*}_{t} if the algorithm currently over-covers, and decrease it if YtY_{t} is under-covered.

αt+1:=αt+γ(αerrt)\alpha_{t+1}:=\alpha_{t}+\gamma(\alpha-\text{err}_{t}) (5)

where

errt:={1,if YtΓ^t(αt),Γ^t(αt)={y:𝐬t(X,y)Q^t(1α)}0,otherwise\text{err}_{t}:=\begin{cases}1,&\text{if }Y_{t}\notin\hat{\Gamma}_{t}(\alpha_{t}),\;\hat{\Gamma}_{t}(\alpha_{t})=\{y:\mathbf{s}_{t}(X,y)\leq\hat{Q}_{t}(1-\alpha)\}\\ 0,&\text{otherwise}\end{cases}

and the step size γ\gamma is a hyper-parameter that is a trade-off between adaptability and stability. We essentially re-calibrate our conformal predictor by learning the αt\alpha^{*}_{t} online. Because Equation 5 is a contraction, we have the following coverage result on ACI.

Lemma 2 (Validity of ACI (Proposition 4.1 in gibbs2021adaptive )).

With probability one we have that for all TT\in{\mathbb{N}}

|1Tt=1Terrtα|max{α1,1α1}+γγT\left|\frac{1}{T}\sum^{T}_{t=1}\text{err}_{t}-\alpha\right|\leq\frac{\max\{\alpha_{1},1-\alpha_{1}\}+\gamma}{\gamma T}

In particular, asymptotically we have

limT1Tt=1Terrt=α\lim_{T\rightarrow\infty}\frac{1}{T}\sum_{t=1}^{T}\text{err}_{t}=\alpha
Refer to caption
Figure 5: An illustration of the ACI algorithm in action on county level election data, figure from gibbs2021adaptive . Local coverage frequencies of adaptive conformal (blue), vanilla conformal (red), and an i.i.d. Bernoulli(0.1) sequence (grey) for reference of stochasticity. Colored dotted lines show the average coverage across all time points, while the black line indicates the target coverage level of 1α=0.91-\alpha=0.9.

3.3 Conformal algorithms for multiple exchangeable time series

The algorithms covered above are designed to learn from a single time series where distribution shifts are present. In this section, we study the setting where datasets contain many time series whose shared patterns can be potentially exploited for improved prediction. This setting is commonly seen in medical data, traffic, and recommender system settings where research in probabilistic forecasting has attracted much interest salinas2020deepar . Figure 6 illustrates how the two data paradigms differ.

Refer to caption
Figure 6: Figure from Schaar2021conformaltime explaining two different paradigms of time series data. The left is when data is comprised of a single time series, whose observations are individual time-steps within the time series. The right side is an example of a data set of multiple time series where each time series are treated as an observation. Independence of time series means they are exchangeable.

The formal setup for multi-horizon time series is as follows. We denote the time series dataset as 𝒟={(𝐱1:t(i),𝐲t+1:t+k(i)}i=1n\mathcal{D}=\{(\mathbf{x}_{1:t}^{(i)},\mathbf{y}_{t+1:t+k}^{(i)}\}_{i=1}^{n}, where 𝐱1:t𝐗t\mathbf{x}_{1:t}\in\mathbf{X}^{t} is tt timesteps of input, 𝐲t+1:t+k𝐘k\mathbf{y}_{t+1:t+k}\in\mathbf{Y}^{k} is kk timesteps of output. In the traditional time series forecasting setting we have 𝐗=𝐘\mathbf{X}=\mathbf{Y}, but they do not have to be equal. Given dataset 𝒟\mathcal{D}, a test sample 𝐱1:t(n+1)\mathbf{x}_{1:t}^{(n+1)}, and a confidence level 1α1-\alpha, we want a set of kk prediction intervals, [Γ11α,,Γk1α][\Gamma^{1-\alpha}_{1},\ldots,\Gamma^{1-\alpha}_{k}], one for each timestep, such that:

[h{1,,k},𝐲t+hΓh1α]1α{\mathbb{P}}[\>\forall h\in\{1,\ldots,k\},\>\mathbf{y}_{t+h}\in\Gamma^{1-\alpha}_{h}\>]\geq 1-\alpha (6)

3.3.1 Conformal Time-series forecasting

(CF-RNN Schaar2021conformaltime ) The straightforward way to apply conformal prediction to this form of time series is to treat each 𝐲h{\mathbf{y}}_{h} time step as a separate prediction task. Schaar2021conformaltime did exactly so. To ensure that the joint probability over kk prediction time steps satisfies the coverage guarantee in equation 6, they apply Bonferroni correction bonferroni1936teoria when calibrating for each time step. That is, given the confidence level 1α1-\alpha, instead of setting the critical nonconformity score as the (1α)(1-\alpha)-th quantile of the calibration nonconformity scores as in algorithm 1, we chose the (1α/k)(1-\alpha/k)-th quantile.

Formally, let subscript h[1,,k]h\in[1,\ldots,k] be the timestep index of variables - 𝐬h{\mathbf{s}}_{h} denote the set of calibration-set nonconformity score at prediction horizon hh, and f^(𝐱)h\hat{f}({\mathbf{x}})_{h} denote the prediction by learned model f^\hat{f} given input 𝐱{\mathbf{x}} at prediction horizon hh. Schaar2021conformaltime constructs the prediction sets as:

s^h\displaystyle\hat{s}_{h} :=Q(1α/k,𝐬h),h[1,,k]\displaystyle:=Q(1-\alpha/k,\mathbf{s}_{h}),\quad h\in[1,\ldots,k] (7)
Γh1α(𝐱)\displaystyle\Gamma^{1-\alpha}_{h}({\mathbf{x}}) :=[f^(𝐱)hs^h,f^(𝐱)h+s^h],h[1,,k]\displaystyle:=\left[\hat{f}({\mathbf{x}})_{h}-\hat{s}_{h},\hat{f}({\mathbf{x}})_{h}+\hat{s}_{h}\right],\quad h\in[1,\ldots,k] (8)
Lemma 3 (Validity of CF-RNN).

The prediction intervals [Γ11α,,Γk1α][\Gamma^{1-\alpha}_{1},\ldots,\Gamma^{1-\alpha}_{k}] constructed with Equation 8 satisfy the coverage guarantee of Equation 6.

Proof.

The proof follows directly from Boole’s inequality. Let pi=[𝐲t+iΓi1α]p_{i}={\mathbb{P}}[\>\mathbf{y}_{t+i}\notin\Gamma^{1-\alpha}_{i}\>] denote the probability of

[h{1,,k},𝐲t+hΓh1α]=[i=1k(piαk)]i=1k([piαk])=kαk=α\displaystyle{\mathbb{P}}[\>\exists h\in\{1,\ldots,k\},\>\mathbf{y}_{t+h}\notin\Gamma^{1-\alpha}_{h}\>]={\mathbb{P}}[\>\bigcup_{i=1}^{k}(p_{i}\leq\frac{\alpha}{k})\>]\leq\sum_{i=1}^{k}(\>{\mathbb{P}}[p_{i}\leq\frac{\alpha}{k}]\>)=k\cdot\frac{\alpha}{k}=\alpha

Take the contrapositive and we have the coverage result in Equation 6. ∎

Bonferroni correction does not require any assumptions about dependence among the probabilities of individual timesteps. As a result, the predictive intervals are very conservative. If we require an interval of 95%95\% confidence over 5 timesteps, we would end up using the 99%99\% quantile of nonconformity scores for each timesteps; if it is over 20 timesteps, then the quantile for each timestep goes up to 99.75%99.75\%, resulting in large and often uninformative predictive intervals.

Input: Trained Prediction model f^\hat{f} producing kk-step forecasts, calibration set 𝒟cal={(𝐱1:t(i),𝐲t+1:t+k(i)}i=m+1n}\mathcal{D}_{cal}=\{(\mathbf{x}_{1:t}^{(i)},\mathbf{y}_{t+1:t+k}^{(i)}\}_{i=m+1}^{n}\}, new input 𝐱1:tn+1\mathbf{x}_{1:t}^{n+1}, target miscoverage level α\alpha
Output: Prediction regions Γ1ϵ,,Γkϵ\Gamma^{\epsilon}_{1},\ldots,\Gamma^{\epsilon}_{k}.
1
2 // Calibration Step
3 Initialize conformity scores s1={},,sk={}s_{1}=\{\},\ldots,s_{k}=\{\}
4 for (𝐱1:ti,𝐲t+1:t+ki)𝒟cal(\mathbf{x}_{1:t}^{i},\mathbf{y}_{t+1:t+k}^{i})\in\mathcal{D}_{cal} do
5       𝐲^t+1:t+kif^(𝐱1:ti)\hat{\mathbf{y}}_{t+1:t+k}^{i}\leftarrow\hat{f}(\mathbf{x}_{1:t}^{i})
6       for  h=1h=1 to kk do
7             shsh|𝐲^t+hi𝐲t+hi|s_{h}\leftarrow s_{h}\bigcup|\hat{\mathbf{y}}_{t+h}^{i}-\mathbf{y}_{t+h}^{i}|
8       end for
9      
10 end for
11for  h=1h=1 to kk do
12       s^hQ(1α/k,𝐬h)\hat{s}_{h}\leftarrow Q(1-\alpha/k,\mathbf{s}_{h})
13      
14 end for
15// Prediction Step
16 𝐲^t+1:t+kn+1f^(𝐱1:tn+1)\hat{\mathbf{y}}^{n+1}_{t+1:t+k}\leftarrow\hat{f}(\mathbf{x}_{1:t}^{n+1})
17 for  h=1h=1 to kk do
18       Γh{y:|yy^hn+1|<s^h}\Gamma_{h}\leftarrow\{y:\;|y-\hat{y}^{n+1}_{h}|<\hat{s}_{h}\}
19      
20 end for
return Γ1ϵ,,Γkϵ\Gamma^{\epsilon}_{1},\ldots,\Gamma^{\epsilon}_{k}
Algorithm 3 Conformal Forecasting RNN (CF-RNN) Schaar2021conformaltime

3.3.2 Copula conformal prediction for multivariate time series (our work)

We note that there are often dependencies between timesteps in a time series, and by modeling these dependencies, we can improve efficiency of the prediction interval.

Introduction to Copulas

We introduce the concept of copulas to leverage such dependencies for sharper prediction regions. Copula is a concept from statistics that describes the dependency structure in a multivariate distribution. We can use copulas to capture the joint distribution for multiple future timesteps. It is widely used for economic and financial forecasting nelsen2007introduction ; patton2012review ; remillard2012copula and is explored in the ML multivariate time series literature salinas2019high ; drouin2022tactis .

Definition 1 (Copula).

Given a random vector (X1,Xk)(X_{1},\cdots X_{k}), define the marginal cumulative density function (CDF) as Fi(x)=P(Xix)F_{i}(x)=P(X_{i}\leq x), the copula of (X1,Xk)(X_{1},\cdots X_{k}) is the joined CDF of (U1,,Uk)=(F(X1),,F(Xk))(U_{1},\cdots,U_{k})=(F(X_{1}),\cdots,F(X_{k})), meaning

C(u1,,uk)=P(U1u1,,Ukuk)C(u_{1},\cdots,u_{k})=P(U_{1}\leq u_{1},\cdots,U_{k}\leq u_{k})

Alternatively

C(u1,,uk)=P(X1F11(u1),,XkFt1(uk))C(u_{1},\cdots,u_{k})=P(X_{1}\leq F_{1}^{-1}(u_{1}),\cdots,X_{k}\leq F_{t}^{-1}(u_{k}))

In other words, the copula function captures the dependency structure between the variable XXs; we can view a kk dimensional copula C:[0,1]k[0,1]C:[0,1]^{k}\rightarrow[0,1] is a CDF with uniform marginals. A fundamental result in the theory of copula is Sklar’s theorem.

Theorem 2 (Sklar’s theorem).

Given a joined CDF as F(X1,,Xk)F(X_{1},\cdots,X_{k}) and the marginals Fi(x)F_{i}(x), there exists a copula such that

F(x1,,xk)=C(F1(x1),,Fk(xk))F(x_{1},\cdots,x_{k})=C(F_{1}(x_{1}),\cdots,F_{k}(x_{k}))

for all xi[,]x_{i}\in[-\infty,\infty] and i=1,,ki=1,\cdots,k.

Sklar’s theorem states that for all multivariate distribution functions, we can find a copula function such that the distribution can be expressed using the copula and a mixture of univariate marginal distributions. Copulas can be parametric or non-parametric; the Gaussian copula, for example, is used for pricing in financial markets Haugh2016copula . Figure 7 illustrates an example of a bivariate copula. To give an example, when all the XkX_{k}s are independent, the copula function is know as the product copula: C(u1,,uk)=Πi=1kuiC(u_{1},\cdots,u_{k})=\Pi_{i=1}^{k}u_{i}. We can bound copula functions as follows.

Theorem 3 (The Frechet-Hoeffding Bounds).

Consider a copula C(u1,,uk)C(u_{1},\ldots,u_{k}). Then,

max{1k+i=1kui,0}C(u1,,uk)min{u1,,uk}\max{\{1-k+\sum_{i=1}^{k}u_{i},0\}}\leq C(u_{1},\ldots,u_{k})\leq\min{\{u_{1},\ldots,u_{k}\}}

The Frechet-Hoeffding upper- and lower- bounds are both copulas. In fact, it is notable that Bonferroni correction used in CF-RNN 3 is precisely the lower bound in theorem 3.

Refer to caption
Figure 7: An example copula, where we express a multivariate Gaussian with correlation ρ=0.8\rho=0.8 with two univariate distributions and a Copula function C(u1,u2)C(u_{1},u_{2}).
Algorithm of Copula Conformal Prediction for Time Series

We now introduce how we can leverage copulas in time series prediction. Firstly, recall that from results in section 3 we know given a confidence level α\alpha and a prediction model f^\hat{f}, the conformal procedure returns a nonconformity score s1αs_{1-\alpha} such that for a z=(x,y)𝒫z=(x,y)\sim{\mathcal{P}}

P(𝒮(z,f^)s1α)1αP({\mathcal{S}}(z,\hat{f})\leq s_{1-\alpha})\geq 1-\alpha

We can view the procedure as an empirical cumulative distribution function (CDF) for the random variable 𝒮(z,f^){\mathcal{S}}(z,\hat{f})

F^(s):=P(𝒮(z,f^)s)=1|Dcal|ziDcal𝟙𝒮(zi,f^)s\hat{F}(s):=P({\mathcal{S}}(z,\hat{f})\leq s)=\frac{1}{|D_{cal}|}\sum_{z^{i}\in D_{cal}}\mathbbm{1}_{{\mathcal{S}}(z^{i},\hat{f})\leq s} (9)

For the multi-step conformal prediction algorithm, we will estimate this empirical CDF for each step of the time series, denoted as

F^t(st):=P(𝒮(zt,f^)st)for t1,,k\hat{F}_{t}(s_{t}):=P({\mathcal{S}}(z_{t},\hat{f})\leq s_{t})\;\text{for }t\in 1,\ldots,k (10)

For each prediction timestep 1,,k1,\ldots,k, we want to find a significance level ϵt\epsilon_{t} such that the entire predicted time series trajectory is covered by the intervals with confidence 1ϵ1-\epsilon. Following Sklar’s Theorem (Theorem 2):

F(s1,,sk)\displaystyle F(s_{1},\ldots,s_{k}) =C(F1(s1),,Fk(sk))\displaystyle=C(F_{1}(s_{1}),\ldots,F_{k}(s_{k}))
=C(1ϵ1,,1ϵk)\displaystyle=C(1-\epsilon_{1},\ldots,1-\epsilon_{k})
=1ϵ\displaystyle=1-\epsilon

In this work, we adopt the empirical copula empiricalcop1 as our default copula. The empirical copula is a nonparametric method of estimating marginals directly from observation, and hence does not introduce any bias. For the joint distribution of a time series with kk timesteps, the copula of a vector of probabilities 𝐮[0,1]k\mathbf{u}\in[0,1]^{k} is defined as

Cempirical(𝐮)=1|𝒟cal|i𝒟cal𝟙𝐮i<𝐮=1|𝒟cal|i𝒟calnt=1k𝟙𝐮ti<𝐮tC_{\text{empirical}}(\mathbf{u})=\frac{1}{|\mathcal{D}_{cal}|}\sum_{i\in\mathcal{D}_{cal}}\mathbbm{1}_{\mathbf{u}^{i}<\mathbf{u}}=\frac{1}{|\mathcal{D}_{cal}|}\sum_{i\in\mathcal{D}_{cal}}^{n}\prod_{t=1}^{k}\mathbbm{1}_{\mathbf{u}^{i}_{t}<\mathbf{u}_{t}} (11)

Here, the 𝐮i\mathbf{u}_{i}s are the cumulative probabilities for each data in the calibration set 𝒟cal\mathcal{D}_{cal} of size nmn-m.

𝐮i=(u1i,,uti)=(F^1(s1i),,F^1(sti)),i{m+1,,n}\mathbf{u}^{i}=(u^{i}_{1},\ldots,u^{i}_{t})=(\hat{F}_{1}(s^{i}_{1}),\ldots,\hat{F}_{1}(s^{i}_{t})),\;i\in\{m+1,\ldots,n\}

messoudi2021copula demonstrated that the empirical copula performs well for multi-target regression, flexible to various distributions. One may pick other parametric copula functions, such as the Gaussian copula, to introduce inductive bias and improve sample efficiency when calibration data is scarce.

Now, to fulfill the validity condition of equation 1, we only need to find

𝐮=(F^1(s1),,F^t(sk))such that Cempirical(𝐮)1ϵ\mathbf{u^{*}}=(\hat{F}_{1}(s^{*}_{1}),\ldots,\hat{F}_{t}(s^{*}_{k}))\;\text{such that }\;C_{\text{empirical}}(\mathbf{u^{*}})\geq 1-\epsilon

We can find the ss^{*}s through any search algorithm. The prediction region for each timestep is constructed as the set of all ytyty_{t}\in\textbf{y}_{t} such that the nonconformity score is less than sts_{t}^{*}. Algorithm 4 summarizes the CoupulaCPTS algorithm. The difference between CF-RNN (algorithm 3) and CopulaCPTS (algorithm 4) is line 10-12 of the latter - instead of calculating the quantile of nonconformity scores for each timestep individually, CopulaCPTS searches for the nonconformity scores that jointly ensures validity of the copula.

Input: Trained Prediction model f^\hat{f} producing t-step forecasts, calibration set 𝒟cal={(𝐱1:t(i),𝐲t+1:t+k(i)}i=m+1n}\mathcal{D}_{cal}=\{(\mathbf{x}_{1:t}^{(i)},\mathbf{y}_{t+1:t+k}^{(i)}\}_{i=m+1}^{n}\}, new input 𝐱1:tn+1\mathbf{x}_{1:t}^{n+1}, target significant level ϵ\epsilon
Output: Prediction regions Γ1ϵ,,Γkϵ\Gamma^{\epsilon}_{1},\ldots,\Gamma^{\epsilon}_{k}.
1
2 // Calibration Step
3 Initialize conformity scores s1={},,sk={}s_{1}=\{\},\ldots,s_{k}=\{\}
4 for (𝐱1:ti,𝐲t+1:t+ki)𝒟cal(\mathbf{x}_{1:t}^{i},\mathbf{y}_{t+1:t+k}^{i})\in\mathcal{D}_{cal} do
5       𝐲^t+1:t+kif^(𝐱1:ti)\hat{\mathbf{y}}_{t+1:t+k}^{i}\leftarrow\hat{f}(\mathbf{x}_{1:t}^{i})
6       for  h=1h=1 to kk do
7             shsh|𝐲^t+hi𝐲t+hi|s_{h}\leftarrow s_{h}\bigcup|\hat{\mathbf{y}}_{t+h}^{i}-\mathbf{y}_{t+h}^{i}|
8       end for
9      
10 end for
11Construct F^1F^k\hat{F}_{1}\ldots\hat{F}_{k} as equation 10.
12 Construct copula C(𝐮)C(\mathbf{u}) as equation 11.
13 Search for s1,,sks^{*}_{1},\ldots,s^{*}_{k} such that C(F^1(s1),,F^k(sk))1ϵC(\hat{F}_{1}(s^{*}_{1}),\ldots,\hat{F}_{k}(s^{*}_{k}))\geq 1-\epsilon
14 // Prediction Step
15 𝐲^t+1:t+kn+1f^(𝐱1:tn+1)\hat{\mathbf{y}}^{n+1}_{t+1:t+k}\leftarrow\hat{f}(\mathbf{x}_{1:t}^{n+1})
16 for  h=1h=1 to kk do
17       Γh{y:|yy^hn+1|<sh}\Gamma_{h}\leftarrow\{y:\;|y-\hat{y}^{n+1}_{h}|<s^{*}_{h}\}
18      
19 end for
return Γ1ϵ,,Γkϵ\Gamma^{\epsilon}_{1},\ldots,\Gamma^{\epsilon}_{k}
Algorithm 4 Copula Conformal Prediction for Time Series (Copula CPTS)
Validity of Copula CPTS
Lemma 4 (Validity of Copula CPTS).

The prediction regions provided by algorithm 4 [Γ11α,,Γk1α][\Gamma^{1-\alpha}_{1},\ldots,\Gamma^{1-\alpha}_{k}] satisfies the coverage guarantee [h{1,,k},𝐲t+hΓhϵ]1ϵ{\mathbb{P}}[\>\forall h\in\{1,\ldots,k\},\>\mathbf{y}_{t+h}\in\Gamma^{\epsilon}_{h}\>]\geq 1-\epsilon.

Proof.

We estimated 𝐮=(F^1(s1),,F^t(sk))\mathbf{u^{*}}=(\hat{F}_{1}(s^{*}_{1}),\ldots,\hat{F}_{t}(s^{*}_{k})) such that

Cempirical(𝐮)=𝔼ut=1k𝟙𝐮t<𝐮t1ϵ\displaystyle C_{\text{empirical}}(\mathbf{u^{*}})=\displaystyle\mathop{\mathbb{E}}_{u}\prod_{t=1}^{k}\mathbbm{1}_{\mathbf{u}^{*}_{t}<\mathbf{u}_{t}}\geq 1-\epsilon (12)

Let (𝐱1:tn+1,𝐲t+1:t+hn+1)𝐗×𝐘(\mathbf{x}_{1:t}^{n+1},\mathbf{y}_{t+1:t+h}^{n+1})\sim\mathbf{X}\times\mathbf{Y} be a new data point. Denote 𝐲^=f^(𝐱1:tn+1)\hat{\mathbf{y}}=\hat{f}(\mathbf{x}_{1:t}^{n+1}), the prediction given by the trained model. Because the F^h\hat{F}_{h} functions and CempiricalC_{\text{empirical}} are monotonously increasing, we have:

P(h{1,,k},𝐲t+hΓhϵ)\displaystyle P(\>\forall h\in\{1,\ldots,k\},\>\mathbf{y}_{t+h}\in\Gamma^{\epsilon}_{h}\>) =C(F^1(s1n+1),,F^k(skn+1))\displaystyle=C(\hat{F}_{1}(s^{n+1}_{1}),\ldots,\hat{F}_{k}(s^{n+1}_{k}))
C(F^1(s1),,F^k(sk))\displaystyle\geq C(\hat{F}_{1}(s^{*}_{1}),\ldots,\hat{F}_{k}(s^{*}_{k}))
1ϵ\displaystyle\geq 1-\epsilon

Refer to caption
(a) Copula-EncDec
Refer to caption
(b) MC Dropout
Refer to caption
(c) CF-RNN
Figure 8: Visualization of the predictions of the Copula method. The underlying models are trained on a drone simulation dataset with dynamics noise; the prediction regions have confidence level 1α=99%1-\alpha=99\%. With copula our method (a) produces a more consistent, expanding cone of uncertainty compared to MC-dropout (b) sharper one compared to CF-RNN (c).

4 Decision making under (non-parametric) uncertainty

4.1 Sample-efficient Safety Assurances with CP

Many important problems involve decision making under uncertainty, including vehicle collision avoidance, wildfire management, and disaster response. When designing automated systems for making or recommending decisions, it is important to account for uncertainty inherent in the system or learned machine learning predictions. There is rich literature on incorporating UQ in a decision making framework, exemplary works include utility theory fishburn1968utility , risk-based decision making lounis2016risk , robust optimization ben2009robust , and reinforcement learning chua2018deep . We will explore methods that leverage conformal prediction to achieve provably safe decision making.

The properties of conformal prediction, namely (1) agnostic to prediction model (2) agnostic to underlying data distribution and (3) finite sample guarantees, makes it a particularly useful technique for safety assurance. Existing work on probabilistic safety guarantees most commonly adopt a Bayesian framework dhiman2021control ; fisac2018probabilistically , but many recent work has explored conformal methods for better calibrated UQ of learned dynamics. sun2020learning , for example, uses conformal prediction to give probabilistic guarantees on the convergence of the tracking error. farid2022failure showed that conformal methods achieve much better sample efficiency than PAC-Bayes guarantees in the case of predicting failures in vision-based robot control. johnstone2021conformal explored use of conformal prediction sets for robust optimization and found that they are both more calibrated and more efficient than uncertainty sets based on normality. eliades2019applying leverage the confidence provided by CP to make controls to an exoskeleton more robust.

Definition 2 (ϵ\epsilon-safety).

Let ZZ denote the state an agent and ϕ\phi a safety score with threshold ϕ0\phi_{0}. For some 0<ϵ<10<\epsilon<1, we say the agent is ϵ\epsilon-safe with respect to Z,ϕZ,\phi and ϕ0\phi_{0} if

[ϕ(Z)ϕ0]1ϵ{\mathbb{P}}[\phi(Z)\leq\phi_{0}]\geq 1-\epsilon

One can make guarantee about the false positive rate for a warning system: luo2021sample introduced the paradigm of achieving ϵ\epsilon-safety of warning system with regard to a surrogate safety score ϕ\phi (3). The authors deployed conformal prediction in experiments of a driver alert safety system and a robotic grasping system, showing that the conformal guarantees hold in practice, issuing very little (<1%<1\%) false positive alerts

Definition 3 (ϵ\epsilon-safety of warning systems).

Let ZZ denote the state an agent and ϕ\phi a safety score with threshold f0f_{0}. For some 0<ϵ<10<\epsilon<1, we say a warning system ww is ϵ\epsilon-safe with respect to Z,ϕZ,\phi and ϕ0\phi_{0} if

[w(Z)=1|ϕ(Z)ϕ0]1ϵ{\mathbb{P}}[w(Z)=1\;|\;\phi(Z)\leq\phi_{0}]\geq 1-\epsilon

We briefly note on the difference between safety guarantees provided in the framework of PAC learning and by conformal prediction (see luo2021sample for algorithmic and mathematical). PAC learning assumes i.i.d. distribution of data, and requires Θ(1/ϵ2)\Theta(1/\epsilon^{2}) samples to achieve an i.i.d. ϵ\epsilon-safety guarantee with probability 1δ1-\delta; conformal prediction, on the other hand, only assumes that data are exchangeable, and requires Θ(1/ϵ)\Theta(1/\epsilon) samples to achieve a marginal ϵ\epsilon-safety guarantee that always holds. We will elaborated on these difference between the guarantees in Section 4.2. To summarize, their use cases differ in that: conformal learning requires much weaker assumptions and fewer samples, whereas PAC learning offers stronger guarantees when its assumptions and sample complexity requirements are met.

4.2 The Limits of Distribution-free UQ: Marginal vs Conditional Coverage

An important distinction to be made between marginal coverage guarantees and conditional coverage vovk2012conditional guarantees. So far we have been focused on the marginal coverage guarantee formalized in Equation 1:

(X,Y)𝒫(YΓ1α(X))1α{\mathbb{P}}_{(X,Y)\sim{\mathcal{P}}}(Y\in\Gamma^{1-\alpha}(X))\geq 1-\alpha (13)

This means that the probability that Γ1α(X)\Gamma^{1-\alpha}(X) covers the true test value YY is at least 1α1-\alpha, on average, over a random draw of the training and test data from the distribution (X,Y)𝒫(X,Y)\sim{\mathcal{P}}. In other words, the marginal coverage guarantee is a guarantee on the average case. This is to be distinguished with conditional coverage, where

(X,Y)𝒫(YΓ1α(X)|X=x)1α{\mathbb{P}}_{(X,Y)\sim{\mathcal{P}}}(Y\in\Gamma^{1-\alpha}(X)|X=x)\geq 1-\alpha (14)

This on the other hand means that the probability of Γ1α\Gamma^{1-\alpha} covering at a fixed test point X=xX=x is at least 1α1-\alpha. An illustration of the difference can be found in figure 9. In practical settings, marginal coverage is often not sufficient, as it leaves open the possibility that entire regions of test points are receiving inaccurate predictions - a robot would reasonably hope that the predictions they receive is accurate for its specific circumstances, and would not be comforted by knowing that the inaccurate information they might be receiving will be balanced out by some other robots’ highly precise prediction romano2019malice . Conditionally valid predictions are also tied to some definitions of algorithmic fairness bastani2022practical .

Refer to caption
Figure 9: An illustration of the difference between marginal and conditional coverage from angelopoulos2021gentle .

Conditional coverage is a stronger property than the marginal coverage that conformal prediction is guaranteed to achieve. It is proven in vovk2012conditional ; lei2014distribution that, if we do not place any assumptions on 𝒫{\mathcal{P}}, exact conditional coverage guarantee is impossible to achieve. We use 𝒫X{\mathcal{P}}_{X} to denote the marginal distribution of 𝒫{\mathcal{P}} on XX, and the function Leb()Leb(\cdot) to denote the Lebesgue measure in d\mathbb{R}^{d} where d1d\geq 1. Theorem 4 presents that for almost all nonatomic points x𝒫Xx\sim{\mathcal{P}}_{X}, a conditionally valid prediction interval has infinite expected length. Intuitively, this means the coverage guarantees of conformal prediction are only applicable for the average case and not for the worst case, and that there may exist subspaces of the sample space 𝐙\mathbf{Z} where the prediction regions have poor coverage.

Theorem 4 (Impossibility of finite sample conditional validity lei2014distribution ).

Let Γ1α\Gamma^{1-\alpha} be a function such that

(X,Y)𝒫(YΓ1α(X)|X=x)1α{\mathbb{P}}_{(X,Y)\sim{\mathcal{P}}}(Y\in\Gamma^{1-\alpha}(X)|X=x)\geq 1-\alpha

Then for all distributions 𝒫{\mathcal{P}} it holds that

𝔼[Leb(Γ1α(x))]=\mathbb{E}\left[Leb(\Gamma^{1-\alpha}(x))\right]=\infty

at almost all points x asside from the atoms of 𝒫X{\mathcal{P}}_{X}.

This is known as the "limits of distribution-free conditional predictive inference" foygel2021limits . The negative result has motivated the community to consider approximate versions of the conditional coverage property. romano2019malice provides group conditional guarantees for disjoint groups by calibrating separately on each group. foygel2021limits provides guarantees that are valid conditional on membership in intersecting subgroups 𝒢{\mathcal{G}}. bastani2022practical proposes an algorithm that provides a convergence rate lower bound on conditional coverage on for arbitrary subsets of the input feature space — possibly intersecting, conditional on membership in each of these subsets. They relax the exchangability assumption by considering the adversarial setting, where the order of 𝒟\mathcal{D} is chosen by an adversary, and there by proving guarantees for the worst case. We refer readers to these works for more detailed exposition of their algorithm and theory.

5 Conclusion and Future Work

This survey introduces conformal algorithms for uncertainty quantification in the spatiotemporal setting. Section 2 provides an overview of existing uncertainty quantification methods; section 3 presents the theory of conformal prediction, and studied in detail a few conformal prediction algorithms developed for the time series setting, looking respectively in the case of (1) data generated from one time series with distribution shift and (2) multiple exchangeable time series. In section 4, we explored how conformal methods are used in decision making and where the limits may lie.

We highlight the merit of conformal prediction algorithms - they construct predictions sets that have finite-sample validity guarantees that hold for any prediction algorithm and underlying data distribution. Moreover, as we have introduced in this survey, the prediction sets are able to be re-calibrated in an online fashion in the presence of distribution shifts. Many opportunities of future work are present on both the algorithm and application front, as we will elaborate below.

In section 3.3.2 we present our work on improving efficiency of the prediction sets in the independent time-series setting, motivated by works in probabilistic modeling of vehicle trajectories. Future work includes developing an online re-calibration procedure for the copula and proving its effectiveness. This setting can be further extended to producing joint prediction sets for multiple targets or agents where their relationship evolves over time, for example in multi-agent trajectory prediction.

Developing methods to leverage distribution-free UQ technique for automated decision making is also an important future direction. Conformal prediction has found extensive use in medical research and personal medicine izbicki2019flexible ; Schaar2021conformaltime ; teng2021t ; lei2012distribution ; eklund2015application , where the scientists and physicians can use the calibrated intervals to understand model outputs and make better decisions. On the other hand, using CP for automated decision making remains a large space for exploration. For example, in a multi-agent planning setting, one can use conformal prediction sets to produce trajectories with similar safety guarantees as the ϵ\epsilon-safety introduced in section 4. The online adaptive time series methods introduced in this survey will also be meaningful in the safety setting.

References

  • [1] M. Abdar, F. Pourpanah, S. Hussain, D. Rezazadegan, L. Liu, M. Ghavamzadeh, P. Fieguth, X. Cao, A. Khosravi, U. R. Acharya, et al. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Information Fusion, 76:243–297, 2021.
  • [2] A. Alaa and M. van der Schaar. Frequentist uncertainty in recurrent neural networks via blockwise influence functions. In ICML, 2020.
  • [3] A. N. Angelopoulos and S. Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511, 2021.
  • [4] A. N. Angelopoulos, A. P. Kohli, S. Bates, M. Jordan, J. Malik, T. Alshaabi, S. Upadhyayula, and Y. Romano. Image-to-image regression with distribution-free uncertainty quantification and applications in imaging. In International Conference on Machine Learning, pages 717–730. PMLR, 2022.
  • [5] A. Anonymous. Adaptive predictive inference with feature conformal prediction. Under Review at Advances in neural information processing systems, 2022.
  • [6] R. F. Barber, E. J. Candes, A. Ramdas, and R. J. Tibshirani. Conformal prediction beyond exchangeability. arXiv preprint arXiv:2202.13415, 2022.
  • [7] O. Bastani, V. Gupta, C. Jung, G. Noarov, R. Ramalingam, and A. Roth. Practical adversarial multivalid conformal prediction. arXiv preprint arXiv:2206.01067, 2022.
  • [8] S. Bates, A. Angelopoulos, L. Lei, J. Malik, and M. Jordan. Distribution-free, risk-controlling prediction sets. Journal of the ACM (JACM), 68(6):1–34, 2021.
  • [9] A. Ben-Tal, L. El Ghaoui, and A. Nemirovski. Robust optimization, volume 28. Princeton university press, 2009.
  • [10] L. Blier and Y. Ollivier. The description length of deep learning models. Advances in Neural Information Processing Systems, 31, 2018.
  • [11] C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight uncertainty in neural network. In International conference on machine learning, pages 1613–1622. PMLR, 2015.
  • [12] C. Bonferroni. Teoria statistica delle classi e calcolo delle probabilita. Pubblicazioni del R Istituto Superiore di Scienze Economiche e Commericiali di Firenze, 8:3–62, 1936.
  • [13] E. J. Candès, L. Lei, and Z. Ren. Conformalized survival analysis. arXiv preprint arXiv:2103.09763, 2021.
  • [14] T. Chen, E. Fox, and C. Guestrin. Stochastic gradient hamiltonian monte carlo. In International conference on machine learning, pages 1683–1691. PMLR, 2014.
  • [15] K. Chua, R. Calandra, R. McAllister, and S. Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. Advances in neural information processing systems, 31, 2018.
  • [16] E. Chung and J. P. Romano. Exact and asymptotically robust permutation tests. The Annals of Statistics, 41(2):484–507, 2013.
  • [17] I. Cortés-Ciriano, G. J. Van Westen, G. Bouvier, M. Nilges, J. P. Overington, A. Bender, and T. E. Malliavin. Improved large-scale prediction of growth inhibition patterns using the nci60 cancer cell line panel. Bioinformatics, 32(1):85–95, 2016.
  • [18] V. Dhiman, M. J. Khojasteh, M. Franceschetti, and N. Atanasov. Control barriers in bayesian learning of system dynamics. IEEE Transactions on Automatic Control, 2021.
  • [19] A. Drouin, É. Marcotte, and N. Chapados. Tactis: Transformer-attentional copulas for time series. arXiv preprint arXiv:2202.03528, 2022.
  • [20] B. Efron and T. Hastie. Computer Age Statistical Inference, volume 5. Cambridge University Press, 2016.
  • [21] M. Eklund, U. Norinder, S. Boyer, and L. Carlsson. The application of conformal prediction to the drug discovery process. Annals of Mathematics and Artificial Intelligence, 74(1):117–132, 2015.
  • [22] C. Eliades and H. Papadopoulos. Applying conformal prediction to control an exoskeleton. In Conformal and Probabilistic Prediction and Applications, pages 214–227. PMLR, 2019.
  • [23] A. Farid, D. Snyder, A. Z. Ren, and A. Majumdar. Failure prediction with statistical guarantees for vision-based robot control. arXiv preprint arXiv:2202.05894, 2022.
  • [24] J. F. Fisac, A. Bajcsy, S. L. Herbert, D. Fridovich-Keil, S. Wang, C. J. Tomlin, and A. D. Dragan. Probabilistically safe robot planning with confidence-based human predictions. arXiv preprint arXiv:1806.00109, 2018.
  • [25] P. C. Fishburn. Utility theory. Management science, 14(5):335–378, 1968.
  • [26] R. Foygel Barber, E. J. Candes, A. Ramdas, and R. J. Tibshirani. The limits of distribution-free conditional predictive inference. Information and Inference: A Journal of the IMA, 10(2):455–482, 2021.
  • [27] Y. Gal and Z. Ghahramani. A theoretically grounded application of dropout in recurrent neural networks. Advances in neural information processing systems, 29, 2016.
  • [28] Y. Gal, J. Hron, and A. Kendall. Concrete dropout. In NIPS, pages 3581–3590, 2017.
  • [29] J. Gasthaus, K. Benidis, Y. Wang, S. S. Rangapuram, D. Salinas, V. Flunkert, and T. Januschowski. Probabilistic forecasting with spline quantile function RNNs. In AISTATS 22, pages 1901–1910, 2019.
  • [30] I. Gibbs and E. Candes. Adaptive conformal inference under distribution shift. Advances in Neural Information Processing Systems, 34, 2021.
  • [31] A. Graves. Practical variational inference for neural networks. Advances in neural information processing systems, 24, 2011.
  • [32] C. Guo, G. Pleiss, Y. Sun, and K. Q. Weinberger. On calibration of modern neural networks. In International conference on machine learning, pages 1321–1330. PMLR, 2017.
  • [33] M. B. Haugh. An introduction to copulas. Lecture notes for IEOR E4602: Quantitative Risk Management at Columbia University, 2016.
  • [34] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [35] X. Hu and J. Lei. A distribution-free test of covariate shift using conformal prediction. arXiv preprint arXiv:2010.07147, 2020.
  • [36] R. Izbicki, G. T. Shimizu, and R. B. Stern. Flexible distribution-free conditional predictive bands using density estimators. arXiv preprint arXiv:1910.05575, 2019.
  • [37] C. Johnstone and B. Cox. Conformal uncertainty sets for robust optimization. In Conformal and Probabilistic Prediction and Applications, pages 72–90. PMLR, 2021.
  • [38] A. Kendall and Y. Gal. What uncertainties do we need in bayesian deep learning for computer vision? Advances in neural information processing systems, 30, 2017.
  • [39] B. Kim, C. Xu, and R. Barber. Predictive inference is free with the jackknife+-after-bootstrap. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 4138–4149. Curran Associates, Inc., 2020.
  • [40] D. P. Kingma, T. Salimans, and M. Welling. Variational dropout and the local reparameterization trick. Advances in neural information processing systems, 28, 2015.
  • [41] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [42] D. Kivaranovic, K. D. Johnson, and H. Leeb. Adaptive, distribution-free prediction intervals for deep networks. In AISTATS, pages 4346–4356, 2020.
  • [43] V. Kuleshov, N. Fenner, and S. Ermon. Accurate uncertainties for deep learning using calibrated regression. In International conference on machine learning, pages 2796–2804. PMLR, 2018.
  • [44] M. Kull, M. Perello Nieto, M. Kängsepp, T. Silva Filho, H. Song, and P. Flach. Beyond temperature scaling: Obtaining well-calibrated multi-class probabilities with dirichlet calibration. Advances in neural information processing systems, 32, 2019.
  • [45] B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30, 2017.
  • [46] J. Lei, M. G’Sell, A. Rinaldo, R. J. Tibshirani, and L. Wasserman. Distribution-free predictive inference for regression. Journal of the American Statistical Association, 113(523):1094–1111, 2018.
  • [47] J. Lei and L. Wasserman. Distribution free prediction bands. arXiv preprint arXiv:1203.5422, 2012.
  • [48] J. Lei and L. Wasserman. Distribution-free prediction bands for non-parametric regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):71–96, 2014.
  • [49] L. Lei and E. J. Candès. Conformal inference of counterfactuals and individual treatment effects. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 2021.
  • [50] C. Louizos and M. Welling. Multiplicative normalizing flows for variational bayesian neural networks. In International Conference on Machine Learning, pages 2218–2227. PMLR, 2017.
  • [51] Z. Lounis and T. P. McAllister. Risk-based decision making for sustainable and resilient infrastructure systems. Journal of Structural Engineering, 142(9):F4016005, 2016.
  • [52] R. Luo, S. Zhao, J. Kuck, B. Ivanovic, S. Savarese, E. Schmerling, and M. Pavone. Sample-efficient safety assurances using conformal prediction. arXiv preprint arXiv:2109.14082, 2021.
  • [53] D. J. MacKay. Bayesian interpolation. Neural computation, 4(3):415–447, 1992.
  • [54] S. Mandt, M. D. Hoffman, and D. M. Blei. Stochastic gradient descent as approximate bayesian inference. arXiv preprint arXiv:1704.04289, 2017.
  • [55] H. B. Mann and D. R. Whitney. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, pages 50–60, 1947.
  • [56] S. Messoudi, S. Destercke, and S. Rousseau. Conformal multi-target regression using neural networks. In Conformal and Probabilistic Prediction and Applications, pages 65–83. PMLR, 2020.
  • [57] S. Messoudi, S. Destercke, and S. Rousseau. Copula-based conformal prediction for multi-target regression. Pattern Recognition, 120:108101, 2021.
  • [58] M. Minderer, J. Djolonga, R. Romijnders, F. Hubis, X. Zhai, N. Houlsby, D. Tran, and M. Lucic. Revisiting the calibration of modern neural networks. Advances in Neural Information Processing Systems, 34:15682–15694, 2021.
  • [59] D. T. Mirikitani and N. Nikolaev. Recursive bayesian recurrent neural networks for time-series modeling. IEEE Transactions on Neural Networks, 21(2):262–274, 2009.
  • [60] R. Molina, D. Letson, B. McNoldy, P. Mozumder, and M. Varkony. Striving for improvement: The perceived value of improving hurricane forecast accuracy. Bulletin of the American Meteorological Society, 102(7):E1408–E1423, 2021.
  • [61] R. M. Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
  • [62] R. B. Nelsen. An introduction to copulas. Springer Science & Business Media, 2007.
  • [63] U. Norinder, L. Carlsson, S. Boyer, and M. Eklund. Introducing conformal prediction in predictive modeling. a transparent and flexible alternative to applicability domain determination. Journal of chemical information and modeling, 54(6):1596–1603, 2014.
  • [64] H. Papadopoulos, V. Vovk, and A. Gammerman. Regression conformal prediction with nearest neighbours. Journal of Artificial Intelligence Research, 40:815–840, 2011.
  • [65] A. J. Patton. A review of copula models for economic time series. Journal of Multivariate Analysis, 110:4–18, 2012.
  • [66] T. Pearce, A. Brintrup, M. Zaki, and A. Neely. High-quality prediction intervals for deep learning: A distribution-free, ensembled approach. In ICML, pages 4075–4084. PMLR, 2018.
  • [67] A. Podkopaev and A. Ramdas. Distribution-free uncertainty quantification for classification under label shift. In Uncertainty in Artificial Intelligence, pages 844–853. PMLR, 2021.
  • [68] B. Rémillard, N. Papageorgiou, and F. Soustra. Copula-based semiparametric models for multivariate time series. Journal of Multivariate Analysis, 110:30–42, 2012.
  • [69] Y. Romano, R. F. Barber, C. Sabatti, and E. J. Candès. With malice towards none: Assessing uncertainty via equalized coverage. arXiv preprint arXiv:1908.05428, 2019.
  • [70] Y. Romano, E. Patterson, and E. Candes. Conformalized quantile regression. Advances in neural information processing systems, 32, 2019.
  • [71] Y. Romano, M. Sesia, and E. Candes. Classification with valid and adaptive coverage. Advances in Neural Information Processing Systems, 33:3581–3591, 2020.
  • [72] L. Ruschendorf. Asymptotic distributions of multivariate rank order statistics. The Annals of Statistics, pages 912–923, 1976.
  • [73] D. Salinas, M. Bohlke-Schneider, L. Callot, R. Medico, and J. Gasthaus. High-dimensional multivariate forecasting with low-rank gaussian copula processes. Advances in neural information processing systems, 32, 2019.
  • [74] D. Salinas, V. Flunkert, J. Gasthaus, and T. Januschowski. Deepar: Probabilistic forecasting with autoregressive recurrent networks. International Journal of Forecasting, 36(3):1181–1191, 2020.
  • [75] M. Sesia and E. J. Candès. A comparison of some conformal quantile regression methods. Stat, 9(1):e261, 2020.
  • [76] G. Shafer and V. Vovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(3), 2008.
  • [77] Z. Sidak, P. K. Sen, and J. Hajek. Theory of rank tests. Elsevier, 1999.
  • [78] K. Stankevičiūtė, A. Alaa, and M. van der Schaar. Conformal time-series forecasting. In Advances in Neural Information Processing Systems, 2021.
  • [79] I. Steinwart and A. Christmann. Estimating conditional quantiles with the help of the pinball loss. Bernoulli, 17(1):211–225, 2011.
  • [80] D. Sun, S. Jha, and C. Fan. Learning certified control using contraction metric. arXiv preprint arXiv:2011.12569, 2020.
  • [81] F. Svensson, N. Aniceto, U. Norinder, I. Cortes-Ciriano, O. Spjuth, L. Carlsson, and A. Bender. Conformal regression for quantitative structure–activity relationship modeling—quantifying prediction uncertainty. Journal of Chemical Information and Modeling, 58(5):1132–1140, 2018.
  • [82] N. Tagasovska and D. Lopez-Paz. Single-model uncertainties for deep learning. In NeurIPS, pages 6417–6428, 2019.
  • [83] I. Takeuchi, Q. Le, T. Sears, A. Smola, et al. Nonparametric quantile estimation. MIT Press, 2006.
  • [84] J. Teng, Z. Tan, and Y. Yuan. T-sci: A two-stage conformal inference algorithm with guaranteed coverage for cox-mlp. In International Conference on Machine Learning, pages 10203–10213. PMLR, 2021.
  • [85] R. J. Tibshirani, R. Foygel Barber, E. Candes, and A. Ramdas. Conformal prediction under covariate shift. Advances in neural information processing systems, 32, 2019.
  • [86] V. Vovk. Conditional validity of inductive conformal predictors. In Asian conference on machine learning, pages 475–490. PMLR, 2012.
  • [87] V. Vovk, A. Gammerman, and G. Shafer. Algorithmic learning in a random world. Springer Science & Business Media, 2005.
  • [88] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient langevin dynamics. In ICML, pages 681–688, 2011.
  • [89] D. Wu, L. Gao, M. Chinazzi, X. Xiong, A. Vespignani, Y.-A. Ma, and R. Yu. Quantifying uncertainty in deep spatiotemporal forecasting. In Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery & Data Mining, pages 1841–1851, 2021.
  • [90] C. Xu and Y. Xie. Conformal prediction interval for dynamic time-series. In International Conference on Machine Learning. PMLR, 2021.
  • [91] Y. Yang and A. K. Kuchibhotla. Finite-sample efficient conformal prediction. arXiv preprint arXiv:2104.13871, 2021.
  • [92] B. Zadrozny and C. Elkan. Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers. In Icml, volume 1, pages 609–616. Citeseer, 2001.