Transformer Conformal Prediction for Time Series
Abstract
We present a conformal prediction method for time series using the Transformer architecture to capture long-memory and long-range dependencies. Specifically, we use the Transformer decoder as a conditional quantile estimator to predict the quantiles of prediction residuals, which are used to estimate the prediction interval. We hypothesize that the Transformer decoder benefits the estimation of the prediction interval by learning temporal dependencies across past prediction residuals. Our comprehensive experiments using simulated and real data empirically demonstrate the superiority of the proposed method compared to the existing state-of-the-art conformal prediction methods.
1 Introduction
Uncertainty quantification has become crucial in many scientific domains where black-box machine learning models are often used (angelopoulos2021gentle). Conformal prediction has emerged as a popular and modern technique for uncertainty quantification by providing valid predictive inference for those black-box models (shafer2008tutorial; barber2023conformal).
Time series prediction aims to forecast future values based on a sequence of observations that are sequentially ordered in time (box2015time). With recent advances in machine learning, numerous models have been proposed and adopted for various time series prediction tasks. The increased use of black-box machine learning models necessitates uncertainty quantification, particularly in high-stakes time series prediction tasks such as medical event prediction, stock prediction, and weather forecasting.
While conformal prediction can provide valid predictive inference for uncertainty quantification, applying conformal prediction to time series is challenging since time series data often violate the exchangeability assumption. Additionally, real-world time series data typically exhibit significant stochastic variations and strong temporal correlations. Many efforts have been made to develop valid and effective conformal prediction methods for time series (xu2023conformal). Sequential Predictive Conformal Inference (SPCI), a recently proposed conformal prediction framework for time series, has shown state-of-the-art performance by using Quantile Random Forest (meinshausen2006quantile) as a conditional quantile estimator to predict the quantiles of future prediction residuals, which are used to estimate prediction interval (xu2023sequential).
In this study, we employed Transformer decoder (vaswani2017attention; radford2018improving) as a conditional quantile estimator in the SPCI framework. Specifically, the Transformer decoder takes a sequence of past residuals and features as input to predict the quantiles of future residuals. Given that Transformer decoder-only architecture has already shown impressive performance in many sequential modeling tasks, we hypothesize that utilizing it in the SPCI framework benefits the estimation of prediction interval by learning temporal dependencies between the residuals. We empirically demonstrate the superiority of the proposed method through experiments with simulated and real data, comparing it to state-of-the-art conformal prediction methods.
2 Problem setup
Consider a sequence of observations , where denotes -dimensional feature and denotes continuous scalar outcome. Assume that the first samples be the data for training and validation. We also assume that we have a point prediction method that provides a point prediction for as .
Our goal is to sequentially construct a prediction interval , starting from time , that desirably contains the true outcome with the probability at least . is constructed using the past observations and predictions up to time. The significance level is pre-defined and is a set includes with probability at least .
We use prediction residual (i.e., prediction error) as a non-conformity score, which is defined as:
(1) |
There are two types of coverage guarantees that should be satisfied with prediction intervals: marginal coverage and conditional coverage. Marginal coverage is defined as follows:
(2) |
Conditional coverage is a stronger guarantee that given each , the true observation in included in with at least probability, which can be defined as follows:
(3) |
If satisfies eq (2) or eq (3), it is called marginally valid or conditionally valid, respectively. The desired aim is to construct that satisfies both marginally and conditionally valid.
While infinite-width prediction intervals always satisfy both coverage guarantees, such intervals are pointless since they do not carry any information to quantify uncertainty. Therefore, we aim to minimize the width of prediction intervals while satisfying coverage.
3 Method
3.1 Sequential Predictive Conformal Inference
Sequential Predictive Conformal Inference (SPCI) is a conformal prediction framework for time series proposed by (xu2023sequential). SPCI adopted Quantile Random Forest (meinshausen2006quantile) as a conditional quantile estimator to estimate sequentially, leveraging the dependencies of the past residuals. Specifically, SPCI estimates prediction intervals as follows:
(4) |
where denotes a conditional quantile estimator, is the estimation for the true -th quantile of . denotes the value that minimizes the prediction interval as:
Non-stationary | Heteroskedasticity | |||
---|---|---|---|---|
Coverage | Width | Coverage | Width | |
SPCI-T | ||||
SPCI | ||||
EnbPI | ||||
NexCP |
Non-stationary | Heteroskedasticity | |||
---|---|---|---|---|
Coverage | Width | Coverage | Width | |
Wind | Electricity | Solar | ||||
---|---|---|---|---|---|---|
Coverage | Width | Coverage | Width | Coverage | Width | |
SPCI-T | ||||||
SPCI | ||||||
EnbPI | ||||||
NexCP |
Wind | Electricity | Solar | ||||
---|---|---|---|---|---|---|
Coverage | Width | Coverage | Width | Coverage | Width | |
SPCI-T | ||||||
SPCI | ||||||
EnbPI | ||||||
NexCP |
3.2 Conformal Prediction for Time Series with Transformer
In this study, we employ Transformer (vaswani2017attention) as a conditional quantile estimator to estimate the true quantiles of within the SPCI framework. Specifically, we use decoder-only architecture (radford2018improving) since it can generalize to variable lengths of sequences without strictly partitioning the sequences for encoding and decoding. Throughout the paper, we refer to the proposed method as Sequential Predictive Conformal Inference with Transformer (SPCI-T).
The past residuals and features are used as input for the model to predict the quantiles of the future unobserved residuals. Specifically, is used to predict , where . A fully connected layer without activation converts to the input representation of the model dimension. Note that can include other features that are useful for prediction besides . Figure 1 visually describes the model architecture.
A fully connected layer without activation transforms output representation into the prediction of the quantile of . Training is done by sequentially minimizing the quantile loss as follows:
(5) |
where is the target quantile and is the predicted value of corresponding to the target quantile.
We hypothesize that using the Transformer decoder as a conditional quantile estimator can offer the following advantages:
-
•
Transformer decoder can effectively learn temporal dependencies, including long-term dependencies, across residuals.
-
•
We can incorporate additional features, such as , for conditional quantile estimation, allowing the Transformer to learn potential dependencies between these additional features and the residuals.
-
•
Transformer decoder can perform multi-step predictions using known features through generative inference without needing explicit residuals as input for prediction.
4 Experiments
We evaluate SPCI-T and baselines on simulated and real data. The code for all experiments is available at anonymousurl. Hyperparameters and implementation details are available in Appendix A.
4.1 Setup
We first obtain point predictions for all and corresponding residuals by using leave-one-out (LOO) point predictors in all experiments. We use the ensemble of 25 random forests as the LOO point predictor.
We use state-of-the-art conformal prediction methods as baselines, which include SPCI (xu2023sequential), EnbPI (xu2021conformal), and NexCP (barber2023conformal). We evaluate SPCI-T and the baselines regarding interval coverage and width. We also evaluate the multi-step prediction of SPCI-T. In a multi-step prediction setup, we aim to estimate the prediction intervals at -step ahead, assuming that we only have known features () for multi-step prediction.
Electricity | Solar | |||
---|---|---|---|---|
Coverage | Width | Coverage | Width | |
4.2 Simulation
Dataset
We generate two simulated time series datasets. The data generating process follows . The first dataset is a non-stationary time series, which contains periodicity and autoregressive . The second dataset is time series with heteroscedastic errors where the variance of depends on . Details on simulated data are provided in Appendix A.
Results
Table 1 shows empirical coverage and interval width of SPCI-T and baselines on the two simulated datasets. We observe that SPCI-T achieves the narrowest interval width without losing coverage. Figure 2 displays the prediction interval estimated by SPCI-T and baselines on the non-stationary dataset, confirming that SPCI-T obtains significantly narrower interval width compared to the baselines. Table 2 presents multi-step prediction results of SPCI-T. While SPCI-T maintains its prediction interval width, it loses coverage for multi-step prediction with larger .
4.3 Real data Experiments
Dataset
We use three time-series datasets from the real world: solar, wind, and electricity. The solar dataset (zhang2021solar) contains solar radiation information in Atalanta downtown measured in terms of diffuse horizontal irradiance, provided by the United States National Solar Radiation Database. The wind dataset consists of wind speed records measured by the Midcontinent Independent System Operator every 15 minutes over a week period in September 2020 (zhu2021multi). The electricity dataset contains electricity usage and pricing in the states of New South Wales and Victoria in Australia, observed between 1996 and 1999 (harries1999splice). All three datasets were widely adopted benchmark datasets in conformal prediction literature. Details of the three datasets are provided in Appendix A.
Results
Table 3 and Table 4 shows empirical coverage and interval width of SPCI-T and baselines on three real datasets with and , respectively. We observe that SPCI-T consistently outperforms all other baselines. We also observe that the interval width of SPCI-T was narrow with a longer window, which shows that SPCI-T utilizes the long history of residuals by learning dependencies. Table 5 shows multi-step prediction results of SPCI-T on electricity and solar datasets. We exclude the results on the wind dataset since the coverage deteriorated significantly with . Similarly, in simulation, we observed that SPCI-T lost the coverage with the increasing .
We add additional time features to each in the solar dataset to see how these additional features can influence the performance of SPCI-T. Table 6 presents empirical coverage and interval width of SPCI-T and baselines on a solar dataset with additional features. SPCI-T again consistently outperforms baselines and shows significantly improved performance compared to the experiment on the solar dataset without the additional features. These results empirically demonstrate the advantage of SPCI-T in utilizing features with the residuals for conditional quantile estimation.
Solar with additional features | ||||
---|---|---|---|---|
Coverage | Width | Coverage | Width | |
SPCI-T | ||||
SPCI | ||||
EnbPI | ||||
NexCP |
5 Conclusion
In this study, we propose SPCI-T, which incorporates a Transformer decoder into the recently developed SPCI framework. SPCI-T uses a Transformer decoder to learn temporal dependencies between the residuals and features to predict the conditional quantile of future residuals, which are then used to estimate the prediction interval. Our simulated and real data experiments empirically demonstrate the superiority of SPCI-T. Future directions include tailoring the model architecture for the time series conformal prediction task and conducting more comprehensive evaluations with state-of-the-art methods.
Accessibility
Software and Data
Datasets used in simulation and real data experiments are available at anonymousurl.
Acknowledgements
Anonymous acknowledgments.
Appendix A Additional Details of Experiments
A.1 Details of Simulated Data
Non-stationary time series
We set , which means we have . We first generate each dimension of from for all . Then, is obtained as follows:
where
and
We sample from AR(1) as , where and is i.i.d. generated from normal distribution with zero mean and unit variance. is a sparse vector having only 20% non-zero elements where non-zero elements are generated from .
Time series with heteroskedastic error
We set , then generate similarly to non-stationary time series for all . Then, is obtained as follows:
where is generated in a similar way as in the non-stationary time series. We also sample from AR(1) similar to non-stationary time series, but with the variance of dependent on as follows:
A.2 Details of Real Data
Solar data
The solar dataset contains solar radiation information recorded in every 30 minutes in 2018 in Atalanta downtown (zhang2021solar) provided by the United States National Solar Radiation Database (sengupta2018national). We used seven covariates: Direct Normal Irradiance (DNI), dew point, surface albedo, wind speed, relative humidity, temperature, and pressure. The outcome variable is Diffuse Horizontal Irradiance (DHI), which reflects radiation levels. For additional time features, we converted the 24-hour period into 24 hourly one-hot encoded features, adding 24 additional features.
Wind data
The wind dataset contains wind speed data (measured in m/s) from wind farms operated by the Midcontinent Independent System Operator (MISO) in the United States (zhu2021multi). This 10-dimensional dataset recorded wind speed every 15 minutes over a one-week period in September 2020.
Electricity data
The electricity dataset recorded electricity usage and pricing in the states of New South Wales and Victoria in Australia, every 30 minutes over a 2.5 year period between 1996 and 1999 (harries1999splice). We used four covariates: nswprice and vicprice, the price of electricity in each of the two states; and nswdemand and vicdemand, the usage demand in each of the two states. The outcome variable is transfer, which is the quantity of electricity transferred between the two states.
A.3 Hyperparameters and Implementation Details
Table 7 shows hyperparameters chosen for SPCI-T. We conducted a grid search using training and validation set to find the optimal hyperparameters. The model dimension, number of heads, and number of layers were chosen when performance plateaued. Since SPCI-T requires validation set to select the best model during training, we split the datasets into training, validation, and test set with 8:1:1 ratio for SPCI-T. For all other baselines, the datasets were split into training and test set with a 9:1 ratio. For a fair comparison in terms of data usage, we additionally trained SPCI-T on the validation set after it was initially trained on the training set, which was conducted for 10% of the number of epochs used for the initial training.
Simulation | Real data | |||||
Non-stationary | Heteroskedastic | Wind | Electricity | Solar | Solar w/ add. features | |
batch size | 4 | 4 | 4 | 4 | 4 | 4 |
learning rate | 0.0001 | 0.0001 | 0.0001 | 0.0001 | 0.0005 | 0.0005 |
model dimension | 16 | 16 | 32 | 16 | 16 | 32 |
number of heads | 4 | 4 | 4 | 4 | 4 | 4 |
number of layers | 4 | 4 | 4 | 4 | 4 | 4 |
dropout | 0.2 | 0.2 | 0.1 | 0.2 | 0.2 | 0.2 |
additional training | N | N | Y | Y | Y | Y |