This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Uncertainty Intervals for Prediction Errors in Time Series Forecasting

Hui Xu Department of Statistics, Stanford University. E-mail: [email protected]    Song Mei Department of Statistics and EECS, University of California, Berkeley. E-mail: [email protected]    Stephen Bates Department of Statistics and EECS, University of California, Berkeley. E-mail: [email protected]    Jonathan Taylor Department of Statistics, Stanford University. E-mail: [email protected]    Robert Tibshirani Departments of Biomedical Data Science and Statistics, Stanford University. E-mail: [email protected]
Abstract

Inference for prediction errors is critical in time series forecasting pipelines. However, providing statistically meaningful uncertainty intervals for prediction errors remains relatively under-explored. Practitioners often resort to forward cross-validation (FCV) for obtaining point estimators and constructing confidence intervals based on the Central Limit Theorem (CLT). The naive version assumes independence, a condition that is usually invalid due to time correlation. These approaches lack statistical interpretations and theoretical justifications even under stationarity.

This paper systematically investigates uncertainty intervals for prediction errors in time series forecasting. We first distinguish two key inferential targets: the stochastic test error over near future data points, and the expected test error as the expectation of the former. The stochastic test error is often more relevant in applications needing to quantify uncertainty over individual time series instances. To construct prediction intervals for the stochastic test error, we propose the quantile-based forward cross-validation (QFCV) method. Under an ergodicity assumption, QFCV intervals have asymptotically valid coverage and are shorter than marginal empirical quantiles. In addition, we also illustrate why naive CLT-based FCV intervals fail to provide valid uncertainty intervals, even with certain corrections. For non-stationary time series, we further provide rolling intervals by combining QFCV with adaptive conformal prediction to give time-average coverage guarantees. Overall, we advocate the use of QFCV procedures and demonstrate their coverage and efficiency through simulations and real data examples.

Code for our experiments is available at https://github.com/huixu18/QFCV.

1 Introduction

Inference for prediction errors is crucial in time series forecasting [Ham20], with applications in a variety of domains including economics and inance [FVD00, Coc97], healthcare [LLC+15], supply chain [Avi03, Gil05], and energy forecasting [DZY+17]. While many point estimators for prediction error have been proposed [Rac00, BB11, CTM20, PAA+22, HAB23], uncertainty quantification for prediction error in time series remains relatively under-explored. Indeed, existing naive methods for uncertainty quantifications of prediction errors tend to lack statistically meaningful interpretations.

In particular, a widely used approach for inference for prediction error is cross-validation (CV) [All74, Gei75, Sto74]. Cross-validation is a resampling method that uses the prediction error of sub-sampled folds to infer the true prediction error. Although widely studied for independent, identically distributed (IID) data, cross-validation is often adapted to time series settings. To account for the temporal correlation in time series, each fold usually comprises continuous time windows instead of IID-resampled time indices. After dividing into multiple folds, practitioners usually use the mean and standard deviation of the validation errors to give an uncertainty interval for the prediction error, similar to the IID case. We call these CLT-based uncertainty intervals the “naive forward cross-validation” (FCV) intervals. However, it is not clear whether naive FCV intervals provide statistically meaningful coverage for prediction errors.

To provide statistically meaningful uncertainty quantification, we should first clarify our inferential target and define prediction error precisely in the time series setting. Suppose we are given a forecaster trained on historical data. A natural notion of the inferential target is the forecaster’s average loss over near future test points, a random variable called the stochastic test error (Errsto\textup{Err}_{\rm sto}). Another natural target is Err=𝔼[Errsto]\textup{Err}={\mathbb{E}}[\textup{Err}_{\rm sto}], the expectation of Errsto\textup{Err}_{\rm sto} over all randomness including training and test data, called the expected test error. The mathematical formulations of these targets are in Section 2. Note that these two prediction errors are often very close in the IID setting per the law of large numbers (LLN), since we are often interested in predicting a large number of IID test points. However, in times series settings, we are not interested in the long-term performance of forecasters, and LLN often does not apply. To distinguish, we call uncertainty intervals for Errsto\textup{Err}_{\rm sto} prediction intervals, and for Err confidence intervals.

In time series applications, providing coverage of the stochastic test error Errsto\textup{Err}_{\rm sto} is often more intuitive and useful than coverage of the expected test error Err. For example, energy companies are typically interested in the fluctuation of the actual prediction error of power demand for the next few days, rather than the range of the hypothetical expected error. So this paper will focus more on providing prediction intervals to Errsto\textup{Err}_{\rm sto}, but will also discuss confidence intervals for Err.

The second important ingredient in providing statistically meaningful uncertainty quantification is to make appropriate statistical assumptions on the time series. While strong parametric assumptions like ARMA models are undesirable, certain distributional assumptions are needed for statistical inference. In this paper, we start with the assumption that the time series is stationary. This assumption seems very strong, but to our knowledge, there are no existing statistically meaningful uncertainty intervals for prediction errors in stationary time series. Starting with stationarity allows an initial meaningful step towards uncertainty quantification for prediction errors in time series.

After clarifying the inferential target and statistical assumptions, we are ready to discuss inferential procedures for prediction errors. In this paper, we introduce and advocate quantile-based forward cross-validation (QFCV) prediction intervals, which provide statistically meaningful intervals with asymptotically valid coverage for prediction errors (Section 3). We will also illustrate why naive FCV intervals fail to provide valid uncertainty intervals, even with certain corrections (Section 4). Finally, we will consider non-stationary time series, and provide rolling intervals by combining QFCV with adaptive conformal prediction methods to give certain coverage guarantees under non-stationarity (Section 5).

1.1 An illustrating example

Refer to caption
Figure 1: Stochastic and expected test errors, along with naive FCV and QFCV intervals for 50 simulation instances. Black points show the stochastic test error per instance. The black dashed line is the expected test error. Red and blue bars represent naive FCV and QFCV intervals, respectively, for each run. Orange dashed lines indicate the 5%5\% and 95%95\% empirical quantiles of stochastic test errors over 500500 instances.

We provide a simulation example to illustrate the difference between stochastic and expected test errors in time series forecasting, and compare naive FCV and QFCV methods, where the former is a naive CLT-based uncertainty interval (Eq (12) in Section 4.1) as a baseline and the latter is our proposed uncertainty interval that will be elaborated in Algorithm 1 in Section 3. At each time step t{1,,1000}t\in\{1,\ldots,1000\}, we observe feature vector xtpx_{t}\in{\mathbb{R}}^{p} where p=20p=20, with IID standard normal entries, and outcomes yt=xt𝖳β+εty_{t}=x_{t}^{\sf{T}}\beta+\varepsilon_{t} following a linear model. Here β=(1,1,1,1,0,)𝖳p\beta=(1,1,1,1,0,\ldots)^{\sf{T}}\in\mathbb{R}^{p}, and noise {εt}t=1\{\varepsilon_{t}\}_{t=1} is an AR(1) process with parameter ϕ=0.5\phi=0.5. The forecaster is a Lasso regression trained on the last 4040 time indices (961961 to 10001000). Our target is the prediction error over the unseen future 2020 data points (10011001 to 10201020).

We calculate the coverage of naive FCV and QFCV intervals over 500500 simulation instances. Figure 1 shows these intervals for the first 50 instances. Numerical results demonstrate 90%90\% nominal coverage by QFCV for the stochastic test error, which we theoretically justify in Section 3. In contrast, FCV severely undercovers the stochastic test error at only 12.8%12.8\% coverage, and slightly undercovers the expected test error at 78.0%coverage78.0\%coverage. In Section 4, we explain the issues with naive FCV and investigate certain correction approaches. Overall, this example illustrates the difference between stochastic and expected test errors in time series, and shows that QFCV provides valid coverage while naive FCV does not.

1.2 Our contribution

This paper provides a systematic investigation of inference for prediction errors in time series forecasting. We identify two inferential targets, the stochastic test error and the expected test error, and discuss different methods for providing uncertainty intervals. Our contribution is three-fold.

  • We propose quantile-based forward cross-validation (QFCV) methods to provide prediction intervals for stochastic test error (Section 3). We prove that QFCV methods have asymptotically valid coverage under the assumption of ergodicity. Through numerical simulations, we show that QFCV methods give well-calibrated coverages. In particular, when choosing an appropriate auxiliary function, QFCV provides shorter intervals than the naive empirical quantile method, especially for smooth time series.

  • We examine forward cross-validation (FCV) methods, which are CLT-based procedures adapted from cross-validation in the IID setting (Section 4). These methods seem like a natural choice for practitioners. We identify issues with the naive FCV method that underlie its failure to cover the stochastic and expected test errors. We also investigate two modifications to naive FCV that partially address these problems. However, these modifications do not perfectly restore valid coverage without strong assumptions.

  • For non-stationary time series, we provide rolling intervals by combining the QFCV method with a variant of the adaptive conformal inference, which we name the AQFCV method (Section 5). AQFCV method produces rolling intervals with asymptotically valid time-average coverage under arbitrary non-stationarity. We articulate the difference between time-average coverage and the frequentist’s notion of instance-average coverage. Finally, we demonstrate AQFCV’s performance on both simulated and real-world time series data.

1.3 Related work

Cross validation with independent data

Cross-validation (CV) [All74, Gei75, Sto74] is used ubiquitously to estimate the prediction error of a model. A comprehensive review of CV methods is presented in [AC10]. Despite the simplicity of CV, the seemingly basic question “what is the inferential target of cross-validation?” has engendered considerable debate [Zha95, HTFF09, You20, RT19, Wag20, BHT21].

Turning to the question of inference for prediction error, extensive studies provided various intervals of prediction error based on CV [Die98, NB99, MTBH05, Efr83, ET97, You21, LPvdL15, BPvdL20, AZ20, BBJM20, BHT21]. In particular, [Die98, NB99, MTBH05] provides estimators of the CV standard error based on either the sample splitting method or the moment method. Works such as [Efr83, ET97, You21] provide estimators of the standard error of bootstrap estimate of the prediction error, based on the influence function method. Asymptotically consistent estimators of CV standard error were provided in [LPvdL15, BPvdL20, AZ20, BBJM20]. Furthermore, [BHT21] introduced the nested CV method, which produces an unbiased estimator of the mean squared error for the CV point estimator. However, all these estimators are designed for datasets with independent data points and do not extend to time series datasets, the setting considered in this work.

Finally, we remark that CV is often used for comparing predictive models, such as in model selection or determining optimal values for hyperparameters of learning algorithms [SEJS86, Sha93, Zha93, Yan07, FH20, VS06, TT09, Yan07, Wag20, Lei20].

Inference for test error in a time series dataset

In the time series setting, many variants of cross-validation methods have been proposed to provide point estimators for prediction errors [BCN94, Rac00, BB11, BB12, BCB14, BHK18, CTM20, PAA+22, HAB23]. The basic idea is to only use past data as training data and future data as test data. Time series prediction error estimation has several different setups (see, for example, [HAB23]). The setup we adopt is the “rolling window” approach, wherein a fixed size of the training dataset is used to fit the forecaster. This setup is quite common in scenarios where the available dataset size is large [MWMC18]. Furthermore, in this setup, inference methods could have meaningful statistical interpretation under only the stationarity assumption. Another setup, the “expanding window” setting, uses an incrementally expanding training dataset to fit the forecaster over time. In this context, it is challenging to confer meaningful statistical interpretation to any prediction error estimator, even under stationary conditions.

In this work, instead of providing point estimators, our primary focus is to provide prediction intervals and confidence intervals for prediction errors within the context of time series analysis. To the best of our knowledge, there seems to be no existing literature addressing this particular topic.

Conformal inference in time series

A recent line of work focuses on predictive inference (i.e., providing intervals for prediction) in the time series setting [GC21, GC22, FBR22, BWXB23, ZFG+22, XX22, SY22, LTS22]. These works are mostly based on adaptive conformal inference (ACI) [GC21, GC22], a term derived from “conformal prediction” methods [VGS99, SV08, AB21]. Utilizing online learning techniques, the ACI method produces a sequence of prediction intervals with an asymptotic guarantee of valid time-average coverage.

Our work differs in its focus, with the inferential target being the prediction error, as opposed to the prediction value, the target of the ACI method [GC21]. However, our AQFCV method borrows the techniques of ACI to provide prediction intervals for prediction error, also providing an asymptotic guarantee of valid time-average coverage.

2 Settings and notations

Refer to caption
Figure 2: Diagram illustration of prediction error evaluation in time series forecasting.

In the context of time series forecasting, we are provided with a stream of data points {zt}t1𝒵𝒳×𝒴\{z_{t}\}_{t\geq 1}\subseteq{\mathcal{Z}}\equiv{\mathcal{X}}\times{\mathcal{Y}}, with zt=(xt,yt)z_{t}=(x_{t},y_{t}). In time series without additional context xtx_{t}, we can consider the context xtx_{t} as a constant value of 11. By the time index n{n}, we have observed the first n{n} pairs z1:n𝒵nz_{1:{n}}\in{\mathcal{Z}}^{n}, and we aim to sequentially predict the forthcoming n𝗍𝖾{n_{\sf te}} outcomes yn+1:n+n𝗍𝖾y_{{n}+1:{n}+{n_{\sf te}}} using the contexts xn+1:n+n𝗍𝖾x_{{n}+1:{n}+{n_{\sf te}}}.

Stochastic prediction error.

We begin with a function f^:𝒳×𝒵n𝗍𝗋\hat{f}:{\mathcal{X}}\times{\mathcal{Z}}^{{n_{\sf tr}}} that forecasts future responses, along with a loss function :𝒴×𝒴\ell:{\mathcal{Y}}\times{\mathcal{Y}}\to{\mathbb{R}}. An interesting inferential target is the stochastic prediction error Errsto\textup{Err}_{\rm sto}, representing the average losses associated with the next n𝗍𝖾{n_{\sf te}} data points, where the forecaster is trained on the most recent n𝗍𝗋{n_{\sf tr}} data points:

Errsto=1n𝗍𝖾t=n+1n+n𝗍𝖾(f^(xt;znn𝗍𝗋+1:n),yt).\textup{Err}_{\rm sto}=\frac{1}{{n_{\sf te}}}\sum_{t={n}+1}^{{n}+{n_{\sf te}}}\ell(\hat{f}(x_{t};z_{{n}-{n_{\sf tr}}+1:{n}}),y_{t}). (1)

See Figure 2 for a diagram illustration of prediction error evaluation in time series forecasting.

In a more general scenario, we permit the forecaster to provide varying forecasts for different future time steps. In other words, we have a vector function f^:𝒳n𝗍𝖾×𝒵n𝗍𝗋𝒴n𝗍𝖾\hat{f}:{\mathcal{X}}^{{n_{\sf te}}}\times{\mathcal{Z}}^{n_{\sf tr}}\to{\mathcal{Y}}^{{n_{\sf te}}} that is trained on the last n𝗍𝗋{n_{\sf tr}} data points and is designed to predict the next n𝗍𝖾{n_{\sf te}} data points. The stochastic prediction error, in this case, can be similarly expressed as:

Errsto=\displaystyle\textup{Err}_{\rm sto}= 1n𝗍𝖾t=n+1n+n𝗍𝖾(f^tn(xn+1:n+n𝗍𝖾;znn𝗍𝗋+1:n),yn+1:n+n𝗍𝖾).\displaystyle\leavevmode\nobreak\ \frac{1}{{n_{\sf te}}}\sum_{t={n}+1}^{{n}+{n_{\sf te}}}\ell(\hat{f}_{t-{n}}(x_{{n}+1:{n}+{n_{\sf te}}};z_{{n}-{n_{\sf tr}}+1:{n}}),y_{n+1:n+{n_{\sf te}}}). (2)

Note that (2) is slightly more general than (1), albeit with a more complicated notation. Throughout the paper, we will assume that our forecasters are identical at each future prediction step and primarily focus on the (1) formulation of the stochastic prediction error. However, it is important to keep in mind that our methods can accommodate the general case of a non-identical prediction function.

Expected prediction error.

Another potential inferential target is the expected prediction error, denoted as Err. Assuming a distribution over {zt}t1\{z_{t}\}_{t\geq 1}, the expected prediction error is defined as the expectation of the stochastic prediction error, Errsto\textup{Err}_{\rm sto},

Err=𝔼[Errsto].\textup{Err}={\mathbb{E}}[\textup{Err}_{\rm sto}]. (3)

The expectation is taken with respect to all the randomness variables, namely, z1:n+n𝗍𝖾z_{1:{n}+{n_{\sf te}}} and possible randomness present in the forecaster f^\hat{f}. Notably, due to time correlation, it is probable that

Err𝔼[(f^(xn+1;znn𝗍𝗋+1:n),yn+1)],ErrlimT1Tt+1t=n+1T𝔼[(f^(xt;znn𝗍𝗋+1:n),yt)].\displaystyle\textup{Err}\neq{\mathbb{E}}\Big{[}\ell(\hat{f}(x_{n+1};z_{{n}-{n_{\sf tr}}+1:{n}}),y_{n+1})\Big{]},\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \textup{Err}\neq\lim_{T\to\infty}\frac{1}{T-t+1}\sum_{t=n+1}^{T}{\mathbb{E}}\Big{[}\ell(\hat{f}(x_{t};z_{{n}-{n_{\sf tr}}+1:{n}}),y_{t})\Big{]}.

Indeed, the expected prediction error Err also depends on the number of test points n𝗍𝖾{n_{\sf te}}, different from the IID setting. Additionally, we should not expect n𝗍𝖾{n_{\sf te}} to be a significantly large number, since we are typically concerned with the forecaster’s performance in the near future. In such a scenario, we should not expect Errsto\textup{Err}_{\rm sto} to concentrate around Err: as is illustrated in Figure 1, the black dashed line represents the expected prediction error Err, while the black dots represent the stochastic prediction error Errsto\textup{Err}_{\rm sto}.

Prediction intervals and confidence intervals.

In this paper, our primary focus is quantifying the uncertainty for both stochastic and expected prediction errors. When the stochastic prediction error Errsto\textup{Err}_{\rm sto} is our inferential target, we aim to offer a prediction interval (PI), denoted as PIα{\rm PI}^{\alpha}\subseteq{\mathbb{R}}, based on the observed data z1:n𝗍𝗋z_{1:{n_{\sf tr}}}. Conversely, when the expected prediction error Err is our target, our goal is to provide a confidence interval (CI), denoted as CIα{\rm CI}^{\alpha}\subseteq{\mathbb{R}}, also based on the observed data z1:n𝗍𝗋z_{1:{n_{\sf tr}}}:

(ErrstoPIα)=1α,(ErrCIα)=1α.{\mathbb{P}}\Big{(}\textup{Err}_{\rm sto}\in{\rm PI}^{\alpha}\Big{)}\stackrel{{\scriptstyle\cdot}}{{=}}1-\alpha,\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ {\mathbb{P}}\Big{(}\textup{Err}\in{\rm CI}^{\alpha}\Big{)}\stackrel{{\scriptstyle\cdot}}{{=}}1-\alpha. (4)

In many practical scenarios, our interest leans more towards providing a prediction interval for the stochastic prediction error Errsto\textup{Err}_{\rm sto}. This preference stems from the fact that the prediction interval of Errsto\textup{Err}_{\rm sto} incorporates the fluctuation of the realized loss that we will incur on future unseen data. Capturing this fluctuation is crucial for risk control and uncertainty quantification. In contrast, a confidence interval for the expected prediction error Err marginalizes over the randomness in Errsto\textup{Err}_{\rm sto}, providing less information about the variability of the test error in a single time series instance.

3 Quantile-based procedures for prediction intervals

In this section, we first present the Quantile-based Forward Cross-Validation (QFCV) method used for generating prediction intervals for the stochastic prediction error Errsto\textup{Err}_{\rm sto} (Section 3.1). We then demonstrate in Section 3.2, under the stationary and ergodicity assumptions, that the QFCV prediction interval guarantees an asymptotically valid coverage of Errsto\textup{Err}_{\rm sto}. Following this, Section 3.3 provides a series of numerical simulations, while Section 3.4 offers a detailed discussion.

3.1 The QFCV prediction interval

Algorithm 1 Quantile-based forward cross-validation (QFCV)
0:  Dataset {zt}t[n]𝒵\{z_{t}\}_{t\in[n]}\subseteq{\mathcal{Z}}. Auxilliary function 𝒜:𝒵n𝗍𝗋×𝒵n𝗏𝖺𝗅m{\mathcal{A}}:{\mathcal{Z}}^{{n_{\sf tr}}}\times{\mathcal{Z}}^{{n_{\sf val}}}\to{\mathbb{R}}^{m}. Coverage tolerance α\alpha. Function class {}{\mathcal{F}}\subseteq\{{\mathbb{R}}\to{\mathbb{R}}\}. Let K=(nn𝗍𝗋n𝗏𝖺𝗅n𝗍𝖾)/Δ+1K=\lfloor({n}-{n_{\sf tr}}-{n_{\sf val}}-{n_{\sf te}})/\Delta\rfloor+1.
1:  Compute {(Errival,Erritest)}i[K]\{(\textup{Err}_{i}^{{\rm val}},\textup{Err}_{i}^{{\rm test}})\}_{i\in[K]} and Errval\textup{Err}_{\star}^{{\rm val}} using Eq. (6).
2:  For β=α/2\beta=\alpha/2 and β=1α/2\beta=1-\alpha/2, compute
f^β=argminf1Ki=1K𝗉𝗂𝗇𝖻𝖺𝗅𝗅β(Erritestf(Errival)),\hat{f}^{\beta}=\arg\min_{f\in{\mathcal{F}}}\frac{1}{K}\sum_{i=1}^{K}{\sf pinball}_{\beta}\Big{(}\textup{Err}_{i}^{{\rm test}}-f(\textup{Err}_{i}^{{\rm val}})\Big{)},
where 𝗉𝗂𝗇𝖻𝖺𝗅𝗅β(t)=t(1{t0}β){\sf pinball}_{\beta}(t)=-t\cdot(1\{t\leq 0\}-\beta) is the pinball loss with quantile parameter β\beta.
3:  Output PIQFCVα=[f^α/2(Errval),f^1α/2(Errval)]{\rm PI}^{\alpha}_{\rm QFCV}=[\hat{f}^{\alpha/2}(\textup{Err}^{{\rm val}}_{\star}),\hat{f}^{1-\alpha/2}(\textup{Err}^{{\rm val}}_{\star})].

In stationary time series, a straightforward approach for providing a prediction interval involves estimating the quantile of the stochastic prediction error distribution using the observed dataset. We introduce a more refined method named Quantile-based Forward Cross-Validation (QFCV). In a nutshell, QFCV estimates the quantile function of the stochastic prediction error, Errsto\textup{Err}_{\rm sto}, based on certain user-specified features. These features are expected to have high predictive power to Errsto\textup{Err}_{\rm sto}, an example being the validation error.

To introduce QFCV, we organize the time indices into multiple possibly overlapping time windows, which we denote as {Di,Vi,Di,Ti}i[K]\{D_{i},V_{i},D_{i\star},T_{i}\}_{i\in[K]} and {D,V,D,T}\{D,V,D_{\star},T\}, as depicted in Figure 3. The time windows {Di,Di,D,D}\{D_{i},D_{i\star},D,D_{\star}\} share the same size, denoted as n𝗍𝗋{n_{\sf tr}}; {Vi,V}\{V_{i},V\} share the same size, denoted as n𝗏𝖺𝗅{n_{\sf val}}; {Ti,T}\{T_{i},T\} share the same size, denoted as n𝗍𝖾{n_{\sf te}}. In addition, for each i[K]i\in[K], each time window in {Di+1,Vi+1,D(i+1),Ti+1}\{D_{i+1},V_{i+1},D_{(i+1)\star},T_{i+1}\} shifts from {Di,Vi,Di,Ti}\{D_{i},V_{i},D_{i\star},T_{i}\} by Δ\Delta indices, where Δ(K1)+n𝗍𝗋+n𝗏𝖺𝗅+n𝗍𝖾n\Delta(K-1)+{n_{\sf tr}}+{n_{\sf val}}+{n_{\sf te}}\leq{n}. Spelling out the elements of each time window, we have

Di=\displaystyle D_{i}= {(i1)Δ+1,,(i1)Δ+n𝗍𝗋},D={nn𝗍𝗋n𝗏𝖺𝗅+1,,nn𝗏𝖺𝗅},\displaystyle\leavevmode\nobreak\ \{(i-1)\Delta+1,\ldots,{(i-1)\Delta+{n_{\sf tr}}}\},\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ D=\{{n}-{n_{\sf tr}}-{n_{\sf val}}+1,\ldots,{n}-{n_{\sf val}}\}, (5)
Vi=\displaystyle V_{i}= {(i1)Δ+n𝗍𝗋+1,,(i1)Δ+n𝗍𝗋+n𝗏𝖺𝗅},V={nn𝗏𝖺𝗅+1,,n},\displaystyle\leavevmode\nobreak\ \{(i-1)\Delta+{n_{\sf tr}}+1,\ldots,(i-1)\Delta+{n_{\sf tr}}+{n_{\sf val}}\},\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ V=\{{n}-{n_{\sf val}}+1,\ldots,{n}\},
Di=\displaystyle D_{i\star}= {(i1)Δ+n𝗏𝖺𝗅+1,,(i1)Δ+n𝗍𝗋+n𝗏𝖺𝗅},D={nn𝗍𝗋+1,,n},\displaystyle\leavevmode\nobreak\ \{(i-1)\Delta+{n_{\sf val}}+1,\ldots,{(i-1)\Delta+{n_{\sf tr}}+{n_{\sf val}}}\},\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ D_{\star}=\{{n}-{n_{\sf tr}}+1,\ldots,{n}\},
Ti=\displaystyle T_{i}= {(i1)Δ+n𝗍𝗋+n𝗏𝖺𝗅,,(i1)Δ+n𝗍𝗋+n𝗏𝖺𝗅+n𝗍𝖾1},T={n+1,,n+n𝗍𝖾}.\displaystyle\leavevmode\nobreak\ \{(i-1)\Delta+{n_{\sf tr}}+{n_{\sf val}},\ldots,(i-1)\Delta+{n_{\sf tr}}+{n_{\sf val}}+{n_{\sf te}}-1\},\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ T=\{{n}+1,\ldots,{n}+{n_{\sf te}}\}.

Furthermore, given a user-specified auxiliary function that generates a set of covariates for predicting the stochastic prediction error, denoted as 𝒜:𝒵n𝗍𝗋×𝒵n𝗏𝖺𝗅m{\mathcal{A}}:{\mathcal{Z}}^{{n_{\sf tr}}}\times{\mathcal{Z}}^{{n_{\sf val}}}\to{\mathbb{R}}^{m}, we define tuples {(Errival,Erritest)}i[K]\{(\textup{Err}_{i}^{{\rm val}},\textup{Err}_{i}^{{\rm test}})\}_{i\in[K]} and (Errval,Errtest)(\textup{Err}_{\star}^{\rm val},\textup{Err}_{\star}^{{\rm test}}) by the equations below:

Errival=\displaystyle\textup{Err}_{i}^{{\rm val}}= 𝒜(zDi,zVi)m,\displaystyle\leavevmode\nobreak\ {\mathcal{A}}(z_{D_{i}},z_{V_{i}})\in{\mathbb{R}}^{m},\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ Erritest=\displaystyle\textup{Err}_{i}^{{\rm test}}= 1|Ti|tTi(f^(xt;zDi),yt),\displaystyle\leavevmode\nobreak\ \textstyle\frac{1}{|T_{i}|}\sum_{t\in T_{i}}\ell(\hat{f}(x_{t};z_{D_{i\star}}),y_{t})\in{\mathbb{R}}, (6)
Errval=\displaystyle\textup{Err}_{\star}^{{\rm val}}= 𝒜(zD,zV)m,\displaystyle\leavevmode\nobreak\ {\mathcal{A}}(z_{D},z_{V})\in{\mathbb{R}}^{m},\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ Errtest=\displaystyle\textup{Err}_{\star}^{{\rm test}}= 1|T|tT(f^(xt;zD),yt).\displaystyle\leavevmode\nobreak\ \textstyle\frac{1}{|T|}\sum_{t\in T}\ell(\hat{f}(x_{t};z_{D_{\star}}),y_{t})\in{\mathbb{R}}.

Here, zDz_{D} is a condensed notation for {zt}tD\{z_{t}\}_{t\in D}. It is important to note that {(Errival,Erritest)}i[K]\{(\textup{Err}_{i}^{{\rm val}},\textup{Err}_{i}^{{\rm test}})\}_{i\in[K]} and Errval\textup{Err}_{\star}^{\rm val} can be computed from the observed dataset {zt}t[n𝗍𝗋]\{z_{t}\}_{t\in[{n_{\sf tr}}]}, but Errtest\textup{Err}_{\star}^{{\rm test}} depends on future unobserved data points. Indeed, Errtest\textup{Err}_{\star}^{{\rm test}} is the same as the stochastic prediction error Errsto\textup{Err}_{\rm sto}, as defined in Eq. (1). Assuming the time series is stationary, it is evident that {(Errival,Erritest)}i[K]\{(\textup{Err}_{i}^{{\rm val}},\textup{Err}_{i}^{{\rm test}})\}_{i\in[K]} and (Errval,Errtest)(\textup{Err}_{\star}^{\rm val},\textup{Err}_{\star}^{{\rm test}}) are identically distributed, though they are not independent due to time correlation.

The QFCV method, given these definitions, can be summarized in the three steps below. First, calculate {(Errival,Erritest)}\{(\textup{Err}_{i}^{{\rm val}},\textup{Err}_{i}^{{\rm test}})\} and Errval\textup{Err}_{\star}^{\rm val} given by Eq. (6). Then, predict quantiles of {Erritest}\{\textup{Err}_{i}^{{\rm test}}\} based on feature vectors {Errival}\{\textup{Err}_{i}^{{\rm val}}\} by minimizing the empirical quantile loss to obtain quantile functions f^α/2\hat{f}^{\alpha/2} and f^1α/2\hat{f}^{1-\alpha/2}. Finally, the QFCV prediction interval is given by PIQFCVα=[f^α/2(Errval),f^1α/2(Errval)]{\rm PI}^{\alpha}_{\rm QFCV}=[\hat{f}^{\alpha/2}(\textup{Err}^{{\rm val}}_{\star}),\hat{f}^{1-\alpha/2}(\textup{Err}^{{\rm val}}_{\star})]. Detailed steps are provided in Algorithm 1.

Refer to caption
Figure 3: Diagram illustration for an example of QFCV method.
Choice of auxiliary function.

There is a certain amount of flexibility in choosing the auxiliary function 𝒜{\mathcal{A}}. As we will show in the upcoming section, regardless of the specific choice of the auxiliary function, the QFCV algorithm will yield an asymptotically valid prediction interval, under the assumption of ergodicity. In practice, we aim to choose 𝒜(zDi,zVi){\mathcal{A}}(z_{D_{i}},z_{V_{i}}) such that it is as predictive of Erritest\textup{Err}_{i}^{{\rm test}} as possible. Our preferred choice for 𝒜{\mathcal{A}} is given by the validation error, a sample splitting estimator of the test error:

𝒜(zDi,zVi)=1|Vi|tVi(f^(xt;zDi),yt).\textstyle{\mathcal{A}}(z_{D_{i}},z_{V_{i}})=\frac{1}{|V_{i}|}\sum_{t\in V_{i}}\ell(\hat{f}(x_{t};z_{D_{i}}),y_{t}). (7)

We illustrate the QFCV method in Figure 4, where the QFCV prediction interval is illustrated with red dashed lines.

Refer to caption
Figure 4: Scatter plot of test and validation errors of the true population and QFCV. We set n𝗍𝗋=40{n_{\sf tr}}=40 and n𝗏𝖺𝗅=n𝗍𝖾=20{n_{\sf val}}={n_{\sf te}}=20. True population points represent validation and test error pairs (Errval,Errtest)(\textup{Err}_{\star}^{{\rm val}},\textup{Err}_{\star}^{{\rm test}}), derived from 500500 simulation runs. QFCV points are from the {(Errival,Erritest)}i[K]\{(\textup{Err}_{i}^{{\rm val}},\textup{Err}_{i}^{{\rm test}})\}_{i\in[K]} set, computed from a single time series via (6). The colored dashed lines are the fitted lines for the 5%5\% and 95%95\% quantile regressions, with the red dashed lines representing the QFCV prediction interval.

3.2 Asymptotic validity of QFCV prediction intervals

We provide theoretical guarantees on the coverage of QFCV prediction intervals, under the assumption of a stationary time series. We present methods for relaxing the stationarity assumption in Section 5.

Assumption 1.

The time series {zt}t1\{z_{t}\}_{t\geq 1} is a stationary ergodic stochastic process, such that for any function g:𝒵kg:{\mathcal{Z}}^{k}\to{\mathbb{R}} with 𝔼[|g(z1:k)|]<{\mathbb{E}}[|g(z_{1:k})|]<\infty and ε>0\varepsilon>0, we have

limn(|1nt=1ng(zt:t+k1)𝔼[g(z1:k)]|ε)=0.\lim_{n\to\infty}{\mathbb{P}}\Big{(}\Big{|}\frac{1}{n}\sum_{t=1}^{n}g(z_{t:t+k-1})-{\mathbb{E}}[g(z_{1:k})]\Big{|}\geq\varepsilon\Big{)}=0.

Given this assumption, (Errtval,Errttest)t1{(\textup{Err}_{t}^{{\rm val}},\textup{Err}_{t}^{{\rm test}})}_{t\geq 1} is likewise stationary ergodic and shares the same distribution as (Errval,Errtest)(\textup{Err}_{\star}^{{\rm val}},\textup{Err}_{\star}^{{\rm test}}). Let \mathcal{L} denote this common distribution. For simplicity, we will assume \mathcal{L} has a density with respect to the Lebesgue measure. While this is a strong assumption, it could be relaxed by more sophisticated analysis.

Assumption 2.

The distribution {\mathcal{L}} of (Errval,Errtest)(\textup{Err}_{\star}^{{\rm val}},\textup{Err}_{\star}^{{\rm test}}) has a density with respect to the Lebesgue measure.

Moreover, we denote β{\mathcal{F}}_{\beta} as the class of β\beta-quantiles of the conditional distribution of Errtest\textup{Err}_{\star}^{{\rm test}} given Errval\textup{Err}_{\star}^{{\rm val}}. This class can be expressed as minimizers of the pinball loss,

β={gargming:m𝔼[𝗉𝗂𝗇𝖻𝖺𝗅𝗅β(Errtestg(Errval))]}.{\mathcal{F}}_{\beta}=\Big{\{}g^{*}\in\arg\min_{g:{\mathbb{R}}^{m}\to{\mathbb{R}}}{\mathbb{E}}_{{\mathcal{L}}}\big{[}{\sf pinball}_{\beta}\big{(}\textup{Err}^{{\rm test}}-g(\textup{Err}^{{\rm val}})\big{)}\big{]}\Big{\}}.

We assume the function class {\mathcal{F}} used in the QFCV algorithm (Algorithm 1) can realize the α/2\alpha/2 and 1α/21-\alpha/2 quantile classes, α/2{\mathcal{F}}_{\alpha/2} and 1α/2{\mathcal{F}}_{1-\alpha/2}. Furthermore, for simplicity, we assume that {\mathcal{F}} is finite, a strong assumption that could also be relaxed.

Assumption 3.

Assume that {\mathcal{F}} is a finite function class, and α/2{\mathcal{F}}_{\alpha/2}\cap{\mathcal{F}}\neq\emptyset, 1α/2{\mathcal{F}}_{1-\alpha/2}\cap{\mathcal{F}}\neq\emptyset.

We are ready to state the asymptotically valid coverage for the QFCV method, as detailed in Theorem 3.1. The proof of this theorem can be found in Appendix A. A more general statement going beyond the finite function class assumption (Assumption 3) can be found in Appendix A.1.

Theorem 3.1.

Let Assumption 1, 2 and 3 hold. Let PIQFCVα{\rm PI}^{\alpha}_{\rm QFCV} be the output of Algorithm 1. Then we have

limn(ErrstoPIQFCVα)=1α.\lim_{n\to\infty}{\mathbb{P}}\Big{(}\textup{Err}_{\rm sto}\in{\rm PI}^{\alpha}_{\rm QFCV}\Big{)}=1-\alpha. (8)

We remark that the choice of the auxiliary function 𝒜:𝒵n𝗍𝗋×n𝗏𝖺𝗅m{\mathcal{A}}:{\mathcal{Z}}^{{n_{\sf tr}}\times{n_{\sf val}}}\to{\mathbb{R}}^{m} does not affect the asymptotic validity of QFCV prediction intervals. However, when the dimension mm is large, more samples will often be required for the asymptotic regime to take effect. In practice, we would like to choose mm to be moderately small and to be some function such that Errval\textup{Err}_{\star}^{{\rm val}} and Errtest\textup{Err}_{\star}^{{\rm test}} have a large mutual information.

3.3 Comparing QFCV methods via numerical simulation

We evaluate QFCV methods with different choices of auxiliary functions 𝒜{\mathcal{A}} through a simulation study. Recall the setting as in Section 2, we generate a time series with n=2000{n}=2000 historical data, train a model on n𝗍𝗋=40{n_{\sf tr}}=40 training set, and aim to construct prediction intervals for the test error over the next n𝗍𝖾{5,20}{n_{\sf te}}\in\{5,20\} unseen points (see Figure 2 for a diagram of this dataset division). The simulated time series {zt}t1\{z_{t}\}_{t\geq 1} is generated as follows: for each time step tt, zt=(xt,yt)p×z_{t}=(x_{t},y_{t})\in{\mathbb{R}}^{p}\times{\mathbb{R}} with p=20p=20, where

yt=xt𝖳β+εt,xtiid𝒩(0,Ip),{εt}t1ARMA(a,b).\displaystyle y_{t}=x_{t}^{\sf{T}}\beta+\varepsilon_{t},\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ x_{t}\sim_{iid}{\mathcal{N}}(0,I_{p}),\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \{\varepsilon_{t}\}_{t\geq 1}\sim{\rm ARMA}(a,b). (9)

We fix the true coefficient vector as β=(1,1,1,1,0,,0)𝖳p\beta=(1,1,1,1,0,\ldots,0)^{\sf{T}}\in\mathbb{R}^{p}, where the first four coordinates are 11, and the rest of the coordinates are 0. The noise process {εt}t1\{\varepsilon_{t}\}_{t\geq 1} follows a standard autoregressive-moving-average model ARMA(a,b)\text{ARMA}(a,b) with AR coefficients ϕ=(ϕ1,,ϕa)𝖳{\boldsymbol{\phi}}=(\phi_{1},\ldots,\phi_{a})^{\sf{T}} and MA coefficients 𝜽=(θ1,,θb)𝖳{\boldsymbol{\theta}}=(\theta_{1},\ldots,\theta_{b})^{\sf{T}}. That is

εt=k=1aϕkεtk+ηt+i=1bθiηti,{ηt}t1iid𝒩(0,1).\textstyle\varepsilon_{t}=\sum_{k=1}^{a}\phi_{k}\cdot\varepsilon_{t-k}+\eta_{t}+\sum_{i=1}^{b}\theta_{i}\cdot\eta_{t-i},\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \{\eta_{t}\}_{t\geq 1}\sim_{iid}\sim{\mathcal{N}}(0,1). (10)

We consider two ARMA parameter settings:

  • (1)

    ARMA(1,0)\text{ARMA}(1,0) with ϕ=ϕ[0,1]{\boldsymbol{\phi}}=\phi\in[0,1].

  • (2)

    ARMA(1,20)\text{ARMA}(1,20) with ϕ=ϕ[0,1]{\boldsymbol{\phi}}=\phi\in[0,1] and 𝜽=(0.1,0.2,0.9,1,1,0.9,,0.1)𝖳20{\boldsymbol{\theta}}=(0.1,0.2,\ldots 0.9,1,1,0.9,\ldots,0.1)^{\sf{T}}\in\mathbb{R}^{20}.

Both processes are stationary, but setting (2) exhibits smoother sample paths than setting (1), as visualized in Figure 5 which plots realizations of {εt}t1\{\varepsilon_{t}\}_{t\geq 1} for the first 200 time steps. We posit that setting (2), as a smoother noise process, will yield a higher correlation between validation and test errors (Errval,Errtest)(\textup{Err}_{\star}^{{\rm val}},\textup{Err}_{\star}^{{\rm test}}).

Refer to caption
Figure 5: Instances of {εt}t1\{\varepsilon_{t}\}_{t\geq 1} following two different ARMA(a,b). The left hand side plots ARMA(1,0) and the right hand side plots 0.2×0.2\,\times\,ARMA(1,20). In both plots, we choose ϕ=0.5\phi=0.5. In the right plot, 𝜽=(0.1,0.2,,1,1,,0.1){\boldsymbol{\theta}}=(0.1,0.2,\ldots,1,1,\ldots,0.1) is a wedge-shaped list of weights, resulting in a more smooth stationary process.

We consider QFCV methods utilizing the following auxiliary functions. For m{1,,n𝗏𝖺𝗅}m\in\{1,\ldots,{n_{\sf val}}\}, we split the validation set ViV_{i} into mm continuous and non-overlapping subsets Vi1,,VimViV_{i1},\ldots,V_{im}\subseteq V_{i}. We define the auxiliary function 𝒜m{\mathcal{A}}_{m} of dimension mm as the vector of validation errors on each subset:

𝒜m(zDi,zVi)=(1|Vi1|tVi1(f^(xt;zDi),yt),1|Vim|tVim(f^(xt;zDi),yt))𝖳m.{\mathcal{A}}_{m}(z_{D_{i}},z_{V_{i}})=\left(\frac{1}{|V_{i1}|}\sum_{t\in V_{i1}}\ell(\hat{f}(x_{t};z_{D_{i}}),y_{t}),\ldots\frac{1}{|V_{im}|}\sum_{t\in V_{im}}\ell(\hat{f}(x_{t};z_{D_{i}}),y_{t})\right)^{\sf{T}}\in\mathbb{R}^{m}.

For m=0m=0, we use a trivial auxiliary function 𝒜0(zDi,zVi)1{\mathcal{A}}_{0}(z_{D_{i}},z_{V_{i}})\equiv 1. We denote by QFCV(m)\text{QFCV}(m) the QFCV method utilizing auxiliary function 𝒜m{\mathcal{A}}_{m}. We provide a few special examples as follows.

  • QFCV(0)\text{QFCV}(0): 𝒜0(zDi,zVi)=1{\mathcal{A}}_{0}(z_{D_{i}},z_{V_{i}})=1. Here, QFCV essentially calculates the empirical quantile of {Erritest}\{\textup{Err}_{i}^{{\rm test}}\}, equivalent to the intercept-only quantile regression.

  • QFCV(1)\text{QFCV}(1): 𝒜1(zDi,zVi)=1|Vi|tVi(f^(xt;zDi),yt){\mathcal{A}}_{1}(z_{D_{i}},z_{V_{i}})=\frac{1}{|V_{i}|}\sum_{t\in V_{i}}\ell(\hat{f}(x_{t};z_{D_{i}}),y_{t})\in\mathbb{R}. Here, QFCV uses quantile regression with the single feature given by the average validation error.

  • QFCV(n𝗏𝖺𝗅)\text{QFCV}({n_{\sf val}}): 𝒜n𝗏𝖺𝗅(zDi,zVi)=((f^(xt;zDi),yt))tVin𝗏𝖺𝗅{\mathcal{A}}_{{n_{\sf val}}}(z_{D_{i}},z_{V_{i}})=(\ell(\hat{f}(x_{t};z_{D_{i}}),y_{t}))_{t\in V_{i}}\in\mathbb{R}^{n_{\sf val}}. Here, QFCV uses quantile regression with n𝗏𝖺𝗅{n_{\sf val}} features given by the individual validation errors.

In all cases, we use linear quantile regression of {Erritest}\{\textup{Err}_{i}^{{\rm test}}\} against {Errival}\{\textup{Err}_{i}^{{\rm val}}\}, choosing {\mathcal{F}} as the class of linear functions.

Refer to caption
(a) ARMA(1,0), n𝗏𝖺𝗅=5{n_{\sf val}}=5.
Refer to caption
(b) ARMA(1,20), n𝗏𝖺𝗅=5{n_{\sf val}}=5.
Refer to caption
(c) ARMA(1,0), n𝗏𝖺𝗅=20{n_{\sf val}}=20.
Refer to caption
(d) ARMA(1,20), n𝗏𝖺𝗅=20{n_{\sf val}}=20.
Figure 6: Performance comparison of QFCV prediction intervals on simulated time series (9). The time series length is fixed at n=2000{n}=2000, with n𝗍𝗋=40{n_{\sf tr}}=40, n𝗏𝖺𝗅=n𝗍𝖾{5,20}{n_{\sf val}}={n_{\sf te}}\in\{5,20\}, and p=20p=20. The x-axis represents the autocorrelation parameter ϕ[0,1]\phi\in[0,1]. Each panel displays two plots: (i) actual coverage of the 90%90\% nominal QFCV intervals, and (ii) ratio of the QFCV interval length to the true marginal quantiles interval length. The true marginal quantiles interval is defined by the 5%5\% and 95%95\% quantiles of the true test error distribution. Each curve displays the mean and standard error over 500500 simulation instances.

Figure 6 reports the coverage and length of QFCV intervals under different simulation settings. The time series is generated according to model (9), where the noise {εt}\{\varepsilon_{t}\} follows either an ARMA(1,0)\text{ARMA}(1,0) process (panels (a) and (c)) or ARMA(1,20)\text{ARMA}(1,20) process (panels (b) and (d)). For both noise processes, we fix the MA coefficients 𝜽{\boldsymbol{\theta}} and vary the AR coefficient ϕ\phi along the x-axis over [0,1][0,1] to modulate autocorrelation. The nominal coverage level is 90%90\%. To normalize the interval lengths, we divide the QFCV lengths by the true marginal quantiles interval length, defined as the difference between the 95%95\% and 5%5\% quantiles of the true test error distribution. Each curve displays the mean and standard error over 500500 simulation instances.

Refer to caption
Figure 7: Scatter plots comparing interval length ratios versus true test error over 500500 simulation instances. The noise process is ARMA(1,20)\text{ARMA}(1,20) with ϕ=0.9\phi=0.9 and n𝗏𝖺𝗅=5{n_{\sf val}}=5. The left panel compares QFCV(0) and QFCV(1)\text{QFCV}(1). The right panel illustrates coverage patterns for each QFCV method, with points colored by whether the true test error falls inside or outside of the constructed interval.

Figure 6 shows that QFCV(m)\text{QFCV}(m) achieves close to 90%90\% coverage at the nominal level when mm is less than or equal to 22. However, for m=20m=20 in panels (c) and (d), the coverage deviates further from the nominal level, likely because quantile regression requires more samples to be consistent with a 2020-dimensional feature vector. In terms of interval lengths, all QFCV methods produce similar lengths to the true marginal quantiles on data with non-smooth ARMA(1,0)\text{ARMA}(1,0) noise (panels (a) and (c)). However, for smooth ARMA(1,20)\text{ARMA}(1,20) noise (panels (b) and (d)), QFCV(m>0)\text{QFCV}(m>0) can substantially outperform the true marginal quantiles length, especially when n𝗏𝖺𝗅=5{n_{\sf val}}=5 (panel (b)). This occurs because with smooth noise, the validation error Errval\textup{Err}_{\star}^{{\rm val}} becomes more predictive of the actual test error Errtest\textup{Err}_{\star}^{{\rm test}}.

To explain why QFCV(m>0)\text{QFCV}(m>0) produces smaller average interval lengths than QFCV(0)\text{QFCV}(0), Figure 7 presents scatter plots of the length ratios versus true test errors over 500500 simulation instances. The noise process is taken to be ARMA(1,20)\text{ARMA}(1,20) with ϕ=0.9\phi=0.9 and n𝗏𝖺𝗅=5{n_{\sf val}}=5. The left panel shows that QFCV(0)\text{QFCV}(0) (the empirical quantile method) generates interval lengths close to the true marginal quantiles for all ranges of the true test error. In contrast, QFCV(1)\text{QFCV}(1) assigns smaller intervals for small true test errors. The right panel colors points by whether they are mis-covered. We see that while QFCV(0)\text{QFCV}(0) tends to mis-cover points with large test errors, the mis-covered points for QFCV(>0)\text{QFCV}(>0) are more uniformly distributed across the test error range. This indicates that by adapting to the validation error, QFCV(>0)\text{QFCV}(>0) produces prediction intervals that are overall shorter while maintaining coverage.

Refer to caption
Figure 8: Actual coverage of QFCV methods as a function of the time series length n{n}. The noise process is ARMA(1,0)\text{ARMA}(1,0) with autocorrelation parameter ϕ=0.5\phi=0.5 and n𝗏𝖺𝗅=5{n_{\sf val}}=5.

Figure 8 shows how the actual coverage of QFCV methods varies with the time series length n{n}, for an ARMA(1,0)\text{ARMA}(1,0) noise process with ϕ=0.5\phi=0.5 and n𝗏𝖺𝗅=5{n_{\sf val}}=5. We find that all QFCV variants achieve close to the nominal 90%90\% coverage when n>1000{n}>1000. However, using a larger number mm of features for quantile regression leads to a larger estimation error, requiring more samples to attain the nominal coverage level. This highlights the need to avoid excessively large mm values in practice.

3.4 Discussion

Let us take a moment to reflect on and discuss our main proposal of QFCV prediction intervals for time series forecasting.

Quantile-based methods versus CLT-based methods.

QFCV produces intervals using quantiles, which provides asymptotic validity relying on the law of large numbers for the ergodic process of test errors. While quantile-based methods are a natural choice in hindsight, CLT-based methods like forward cross-validation (FCV; see Section 4) may seem like a more instinctive first approach for practitioners. However, as we will see in Section 4, FCV prediction intervals have certain limitations that quantile-based QFCV avoids. By modeling the conditional quantiles of the test error distribution, QFCV does not make Gaussian assumptions and can provide valid coverage where naive CLT-based methods fail.

QFCV(1)\text{QFCV}(1) as the advocated approach.

Our simulations illustrate the tradeoff in selecting the number of features mm for the auxiliary function 𝒜{\mathcal{A}}. Adding more features can yield shorter intervals but requires more samples to attain valid coverage. We advocate using QFCV(1)\text{QFCV}(1), which takes the average validation error (7) as the single feature. As shown in our experiments, QFCV(1)\text{QFCV}(1) improves on the interval length of QFCV(0)\text{QFCV}(0) while matching the length of QFCV(2)\text{QFCV}(2). Critically, QFCV(1)\text{QFCV}(1) maintains valid coverage across all settings. In later sections, we will refer to the QFCV method as QFCV(1)\text{QFCV}(1) by default.

QFCV point estimators.

While we have focused on QFCV for constructing prediction intervals, point estimators can also be naturally derived from the QFCV framework. We include more details of QFCV point estimators, as well as other algorithm design choices of QFCV intervals, in Appendix D.

4 CLT-based forward cross-validation

Many inference methods for prediction error under the IID assumption produce confidence intervals aiming to cover the expected test error Err based on the central limit theorem (CLT). Such methods typically have the form:

CI=(Err^z1α/2SE^,Err^+z1α/2SE^),{\rm CI}=\big{(}\widehat{\textup{Err}}-z_{1-\alpha/2}{\widehat{\rm SE}},\widehat{\textup{Err}}+z_{1-\alpha/2}{\widehat{\rm SE}}\big{)},

where Err^\widehat{\textup{Err}} is a point estimate of Err, SE^{\widehat{\rm SE}} is the estimated standard error of Err^\widehat{\textup{Err}}, and zαz_{\alpha} is the α\alpha-quantile of a standard normal distribution. In time series forecasting, practitioners usually adopt a naive adaptation of the cross-validation called forward cross-validation (FCV) as natural candidates for constructing confidence intervals targeting expected error Err.

In this section, we examine FCV for constructing confidence and prediction intervals. We illustrate when it succeeds or fails, comparing it to QFCV. We show that as confidence intervals covering expected error Err, naive FCV lacks valid coverage unless adjusted by an autocovariance correction. Furthermore, as prediction intervals aim to cover stochastic error Errsto\textup{Err}_{\rm sto}, scaling-corrected FCV only provides valid coverage under strong Gaussianity assumptions, unlike QFCV.

4.1 Forward cross-validation procedures

In FCV, we split the historical data sequence z1:nz_{1:{n}} into KK possibly overlapping time windows. Each window is shifted Δ\Delta time steps from the previous one. For the ii-th window, we further divide it into a training set Di={(i1)Δ+1,,(i1)Δ+n𝗍𝗋}D_{i}=\{(i-1)\Delta+1,\ldots,(i-1)\Delta+{n_{\sf tr}}\} of size n𝗍𝗋{n_{\sf tr}} and a validation set Vi={(i1)Δ+n𝗍𝗋+1,,(i1)Δ+n𝗍𝗋+n𝗏𝖺𝗅}V_{i}=\{(i-1)\Delta+{n_{\sf tr}}+1,\ldots,(i-1)\Delta+{n_{\sf tr}}+{n_{\sf val}}\} of size n𝗏𝖺𝗅{n_{\sf val}}. See Figure 9 for a diagram illustration. The FCV point estimate of the prediction error is:

Err^KFCV=1Ki=1K1n𝗏𝖺𝗅tVi(f^(xt;zDi),yt)=1Ki=1KEi, where Ei=1n𝗏𝖺𝗅tVi(f^(xt;zDi),yt).{\widehat{\rm Err}}_{K}^{\rm FCV}=\frac{1}{K}\sum_{i=1}^{K}\frac{1}{{n_{\sf val}}}\sum_{t\in V_{i}}\ell(\hat{f}(x_{t};z_{D_{i}}),y_{t})=\frac{1}{K}\sum_{i=1}^{K}E_{i},\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \text{ where }E_{i}=\frac{1}{{n_{\sf val}}}\sum_{t\in V_{i}}\ell(\hat{f}(x_{t};z_{D_{i}}),y_{t}). (11)

Under the stationary assumption, Err^KFCV{\widehat{\rm Err}}_{K}^{\rm FCV} is an unbiased estimator for the expected target error Err, as stated in Lemma 4.1 below.

Lemma 4.1 (Unbiasedness of FCV estimator).

Assume {zt}t1\{z_{t}\}_{t\geq 1} is stationary, and let n𝗍𝖾=n𝗏𝖺𝗅{n_{\sf te}}={n_{\sf val}}. Then the FCV estimator is unbiased for the expected test error (3), that is,

𝔼[Err^KFCV]=Err.{\mathbb{E}}[{\widehat{\rm Err}}_{K}^{\rm FCV}]=\textup{Err}.
Proof.

By direct calculation,

𝔼[Err^KFCV]=𝔼[K1i=1KEi]=(1)𝔼[Errsto]=(2)Err,\textstyle{\mathbb{E}}[{\widehat{\rm Err}}_{K}^{\rm FCV}]={\mathbb{E}}[K^{-1}\sum_{i=1}^{K}E_{i}]\overset{(1)}{=}{\mathbb{E}}[\textup{Err}_{\rm sto}]\overset{(2)}{=}\textup{Err},

where (1) uses stationarity and (2) is by definition of Err. ∎

Refer to caption
Figure 9: Diagram illustration for forward cross-validation (FCV)
Naive FCV interval.

Following the strategy of constructing confidence intervals in the IID setting, the naive FCV interval based on CLT is:

CIKFCV=(Err^KFCVz1α/2SE^KFCV,Err^KFCV+z1α/2SE^KFCV),{\rm CI}_{K}^{{\rm FCV}}=\Big{(}{\widehat{\rm Err}}_{K}^{\rm FCV}-z_{1-\alpha/2}\cdot{\widehat{\rm SE}}_{K}^{\rm FCV},{\widehat{\rm Err}}_{K}^{\rm FCV}+z_{1-\alpha/2}\cdot{\widehat{\rm SE}}_{K}^{\rm FCV}\Big{)}, (12)

where

SE^KFCV=1K1Ki=1K(EiE¯)2.\textstyle{\widehat{\rm SE}}_{K}^{\rm FCV}=\frac{1}{\sqrt{K}}\sqrt{\frac{1}{K}\sum_{i=1}^{K}(E_{i}-\bar{E})^{2}}. (13)

We present the algorithm for computing the naive FCV interval in Algorithm 2.

The naive FCV interval has some limitations in serving as both a confidence interval for the expected test error Err and a prediction interval for the stochastic test error Errsto\textup{Err}_{\rm sto}. When treated as a confidence interval for expected test error Err, naive FCV often underestimates the true variance of the FCV point estimator since it doesn’t account for the time correlation. When treated as a prediction interval for stochastic test error Errsto\textup{Err}_{\rm sto}, naive FCV does not capture the variance of Errsto\textup{Err}_{\rm sto}, but instead the variance of the FCV point estimator. Therefore, the naive FCV interval will undercover both Err as a confidence interval and Errsto\textup{Err}_{\rm sto} as a prediction interval.

To address these issues, we describe two modifications to the naive FCV intervals. In the first modification, an autocovariance correction is adopted to derive a consistent variance estimate for the FCV point estimator. This results in an autocovariance-corrected FCV that becomes a valid confidence interval for expected test error Err. In the second modification, the standard error estimator is rescaled, making it an estimate of the standard error of the stochastic test error. This results in a scaling-corrected FCV interval whose length is on the scale of the standard deviation of the stochastic prediction error.

Autocovariance-corrected FCV confidence interval.

Assuming that the time series is stationary, the variance of the FCV point estimator gives

Var(Err^KFCV)\displaystyle{\rm Var}({\widehat{\rm Err}}_{K}^{\rm FCV}) =1K2Var[i=1KEi]=1K2i=1KVar[Ei]+1K2ijCov[Ei,Ej]=1K[γ(0)+2s=1K1(1sK)γ(s)].\displaystyle=\frac{1}{K^{2}}{\rm Var}\Big{[}\sum_{i=1}^{K}E_{i}\Big{]}=\frac{1}{K^{2}}\sum_{i=1}^{K}{\rm Var}[E_{i}]+\frac{1}{K^{2}}\sum_{i\neq j}{\rm Cov}[E_{i},E_{j}]=\frac{1}{K}\Big{[}\gamma(0)+2\sum_{s=1}^{K-1}\Big{(}1-\frac{s}{K}\Big{)}\gamma(s)\Big{]}.

Here, γ(s)=Cov(E1,E1+s)\gamma(s)={\rm Cov}(E_{1},E_{1+s}) is the autocovariance function of the stationary process {Ei}\{E_{i}\}, which can be estimated from the data using the sample autocovariance γ^(s)\hat{\gamma}(s),

γ^(s)=1Ksi=1Ks(EiE¯)(Ei+sE¯),E¯=1Ki=1KEi=Err^KFCV.\hat{\gamma}(s)=\frac{1}{K-s}\sum_{i=1}^{K-s}(E_{i}-\bar{E})(E_{i+s}-\bar{E}),\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \bar{E}=\frac{1}{K}\sum_{i=1}^{K}E_{i}={\widehat{\rm Err}}_{K}^{\rm FCV}.

The adjusted standard error estimator is thus a truncated sum of the sample autocovariance function, denoted as SE^KFCV(c){\widehat{\rm SE}}^{{\rm FCV}(c)}_{K}, where the cc indicates covariance adjustment.

SE^KFCV(c)=1Kγ^(0)+2s=1K𝗍𝗋𝗎𝗇(1sK)γ^(s).\displaystyle\textstyle{\widehat{\rm SE}}^{{\rm FCV}(c)}_{K}=\frac{1}{\sqrt{K}}\sqrt{\hat{\gamma}(0)+2\sum_{s=1}^{{K_{\sf trun}}}\Big{(}1-\frac{s}{K}\Big{)}\hat{\gamma}(s)}. (14)

Here, K𝗍𝗋𝗎𝗇<K{K_{\sf trun}}<K truncates the summation to the correlation length of the stationary process {Ei}i[K]\{E_{i}\}_{i\in[K]}. This is needed since γ^(s)\hat{\gamma}(s) converges to γ(s)\gamma(s) very slowly for large ss. We expect that when ss exceeds some threshold K𝗍𝗋𝗎𝗇{K_{\sf trun}}, the autocovariance becomes very small and can be neglected.

Given the adjusted standard error estimator, the autocovariance-adjusted FCV confidence interval is

CIKFCV(c)=(Err^KFCVz1α/2SE^KFCV(c),Err^KFCV+z1α/2SE^KFCV(c)).{\rm CI}_{K}^{{\rm FCV}(c)}=\Big{(}{\widehat{\rm Err}}_{K}^{\rm FCV}-z_{1-\alpha/2}\cdot{\widehat{\rm SE}}_{K}^{{\rm FCV}(c)},{\widehat{\rm Err}}_{K}^{\rm FCV}+z_{1-\alpha/2}\cdot{\widehat{\rm SE}}_{K}^{{\rm FCV}(c)}\Big{)}.

Under certain assumptions on the stationary process {Ei}i[K]\{E_{i}\}_{i\in[K]}, it can be shown via the central limit theorem that Err^KFCV(c){\widehat{\rm Err}}_{K}^{{\rm FCV}(c)} is asymptotically normal with asymptotic standard deviation consistently estimated by SE^KFCV(c){\widehat{\rm SE}}^{{\rm FCV}(c)}_{K}. Therefore, CIKFCV(c){\rm CI}_{K}^{{\rm FCV}(c)} is an asymptotically valid confidence interval for Err. This leads to the following informal proposition, with a rigorous statement presented in Appendix B.1.

Proposition 4.2 (Asymptotic normality of the autocovariance-corrected FCV estimator; Informal).

Under certain assumptions of the stationary process {Ei}i[K]\{E_{i}\}_{i\in[K]}, as KK\to\infty, we have convergence in distribution

(Err^KFCVErr)/SE^KFCV(c)d𝒩(0,1).\big{(}{\widehat{\rm Err}}_{K}^{\rm FCV}-\textup{Err}\big{)}/{\widehat{\rm SE}}^{{\rm FCV}(c)}_{K}\stackrel{{\scriptstyle d}}{{\longrightarrow}}{\mathcal{N}}(0,1).

Consequently, CIKFCV(c){\rm CI}_{K}^{{\rm FCV}(c)} has asymptotically 1α1-\alpha coverage over Err.

Algorithm 2 Forward cross-validation: naive, autocovariance-corrected, and scaling-corrected
0:  Dataset {zt}t[n]𝒵\{z_{t}\}_{t\in[{n}]}\subseteq{\mathcal{Z}}, loss \ell, size of training set n𝗍𝗋{n_{\sf tr}}, size of validation set n𝗏𝖺𝗅{n_{\sf val}}, number of fold KK, prediction algorithm f^:𝒳×𝒵n𝗍𝗋𝒴\hat{f}:{\mathcal{X}}\times{\mathcal{Z}}^{{n_{\sf tr}}}\to{\mathcal{Y}}. Parameter K𝗍𝗋𝗎𝗇{K_{\sf trun}} for autocovariance-corrected FCV.
1:  Compute Ei=(1/n𝗏𝖺𝗅)tVi(f^(xt;zDi),yt)E_{i}=(1/{n_{\sf val}})\sum_{t\in V_{i}}\ell(\hat{f}(x_{t};z_{D_{i}}),y_{t}) for i[K]i\in[K].
2:  Compute Err^K=K1i=1KEi{\widehat{\rm Err}}_{K}=K^{-1}\sum_{i=1}^{K}E_{i} as in Eq. (11).
3:  Compute SE^K{\widehat{\rm SE}}_{K} using Eq. (13) for naive FCV, using Eq. (14) for autocovariance-corrected FCV, or using Eq. (15) for scaling-corrected FCV.
4:  Return Err^K±z1α/2SE^K{\widehat{\rm Err}}_{K}\pm z_{1-\alpha/2}{\widehat{\rm SE}}_{K}.
Scaling-corrected FCV prediction interval.

Both SE^KFCV{\widehat{\rm SE}}_{K}^{\rm FCV} in (13) and SE^KFCV(c){\widehat{\rm SE}}^{{\rm FCV}(c)}_{K} in (14) estimate the variance of Err^KFCV{\widehat{\rm Err}}_{K}^{\rm FCV} instead of the variance of Errsto\textup{Err}_{\rm sto}, and thus do not provide prediction intervals for Errsto\textup{Err}_{\rm sto}. To estimate the variance of Errsto\textup{Err}_{\rm sto}, a commonly used approach is a scaling correction. In particular, we define the scaling-corrected standard error estimator as SE^KFCV(p){\widehat{\rm SE}}_{K}^{{\rm FCV}(p)}, where pp indicates this is for a prediction interval:

SE^KFCV(p)=KSE^KFCV=1Ki=1K(EiE¯)2.\textstyle{\widehat{\rm SE}}_{K}^{{\rm FCV}(p)}=\sqrt{K}\cdot{\widehat{\rm SE}}_{K}^{{\rm FCV}}=\sqrt{\frac{1}{K}\sum_{i=1}^{K}(E_{i}-\bar{E})^{2}}. (15)

The scaling-corrected FCV prediction interval is then:

PIKFCV(p)=(Err^KFCVz1α/2SE^KFCV(p),Err^KFCV+z1α/2SE^KFCV(p)).{\rm PI}_{K}^{{\rm FCV}(p)}=\Big{(}{\widehat{\rm Err}}_{K}^{\rm FCV}-z_{1-\alpha/2}\cdot{\widehat{\rm SE}}_{K}^{{\rm FCV}(p)},{\widehat{\rm Err}}_{K}^{\rm FCV}+z_{1-\alpha/2}\cdot{\widehat{\rm SE}}_{K}^{{\rm FCV}(p)}\Big{)}. (16)

Under the strong assumption that {Ei}i[K]\{E_{i}\}_{i\in[K]} follows a Gaussian process, we can show PIKFCV(p){\rm PI}_{K}^{{\rm FCV}(p)} gives approximately the α/2\alpha/2 and 1α/21-\alpha/2 quantiles of the stochastic test error distribution. This gives the following proposition, with the proof in Appendix B.2.

Proposition 4.3.

Assume {Ei}i[K]\{E_{i}\}_{i\in[K]} is a stationary ergodic Gaussian process. Let V^K=K1i=1K(EiE¯)2\hat{V}_{K}=K^{-1}\sum_{i=1}^{K}(E_{i}-\bar{E})^{2}. Then as KK\to\infty, we have convergence in probability

V^KpVar(E1).\hat{V}_{K}\stackrel{{\scriptstyle p}}{{\longrightarrow}}{\rm Var}(E_{1}). (17)

In this case, we have convergence in probability

Err^KFCVz1α/2SE^KFCV(p)pqα/2(Errsto),Err^KFCV+z1α/2SE^KFCV(p)pq1α/2(Errsto).{\widehat{\rm Err}}_{K}^{\rm FCV}-z_{1-\alpha/2}\cdot{\widehat{\rm SE}}_{K}^{{\rm FCV}(p)}\stackrel{{\scriptstyle p}}{{\longrightarrow}}q_{\alpha/2}(\textup{Err}_{\rm sto}),\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ {\widehat{\rm Err}}_{K}^{\rm FCV}+z_{1-\alpha/2}\cdot{\widehat{\rm SE}}_{K}^{{\rm FCV}(p)}\stackrel{{\scriptstyle p}}{{\longrightarrow}}q_{1-\alpha/2}(\textup{Err}_{\rm sto}). (18)

That is, PIKFCV(p){\rm PI}_{K}^{{\rm FCV}(p)} is an asymptotically valid 1α1-\alpha prediction interval for Errsto\textup{Err}_{\rm sto},

limK(ErrstoPIKFCV(p))=1α.\lim_{K\to\infty}{\mathbb{P}}\Big{(}\textup{Err}_{\rm sto}\in{\rm PI}_{K}^{{\rm FCV}(p)}\Big{)}=1-\alpha. (19)

It is worth noting that QFCV(0)\text{QFCV}(0), the empirical quantile method, will also give the α/2\alpha/2 and 1α/21-\alpha/2 quantiles of the stochastic test error distribution. Consequently, under the Gaussian process assumption, the scaling-corrected FCV prediction interval is asymptotically equivalent to QFCV(0)\text{QFCV}(0). As shown in Section 3, QFCV(0)\text{QFCV}(0) has a larger interval length than our advocated QFCV(1)\text{QFCV}(1) method. Thus, we would also expect scaling-corrected FCV to have a larger interval length than QFCV(1)\text{QFCV}(1).

We note that the Gaussian process assumption for {Ei}i[K]\{E_{i}\}_{i\in[K]} is a strong assumption that may not hold in practice. In Figure 10, we provide an example where the Gaussianity assumption is violated. The bottom panel shows the population-level stochastic test error distribution obtained with 1000 samples, where the setting matches Figure 1 except with a smaller test set size of 55. With small test sets, errors are no longer Gaussian distributed - the distribution is truncated at zero and skewed. Therefore, the scaling-corrected FCV interval may not have valid coverage due to violating the Gaussian assumption.

In contrast, QFCV methods do not require the Gaussian assumption. As seen in the top and middle panels of Figure 10, the empirical validation error distribution from a single FCV run converges to the population test error distribution as the number of folds KK increases. The QFCV(0)\text{QFCV}(0) method bypasses the Gaussianity assumption by approximating the oracle stochastic test error distribution with the empirical distribution of validation errors and reporting empirical quantiles for prediction intervals. Other QFCV variants, such as QFCV(1)\text{QFCV}(1), exploit the time correlation between past validation error and future test error to yield shorter intervals than QFCV(0)\text{QFCV}(0).

Given that Gaussianity often does not hold in practice, scaling-corrected FCV has no guarantees of asymptotically valid coverage because it is using Gaussian quantiles for uncertainty interval construction. Therefore, we do not advocate it due to the strong assumptions required. We recommend using QFCV, which does not rely on this assumption.

Refer to caption
Figure 10: Empirical error distributions of EiE_{i} for K=30K=30 number of validation errors, K=300K=300 validation errors, and true population level stochastic test error distribution. Solid lines are density estimations of the histograms.

4.2 Comparing FCV and QFCV via numerical simulations

We perform numerical simulations to compare CLT-based FCV intervals (Algorithm 2) with QFCV methods (Algorithm 1). Specifically, we examine naive FCV intervals (FCV), autocorrelation-corrected FCV intervals (FCV(c)), and scaling-corrected FCV intervals (FCV(p)) for the CLT-based methods. For the QFCV method, we use QFCV(1)\text{QFCV}(1), where the auxiliary function is the validation error (7). The simulations follow the same setup as in Section 3.3, with time series generated from a linear model (9) with ARMA(a,b)\text{ARMA}(a,b) noise. We investigate the coverage and length ratio of these intervals. The results, presented in Figure 11, illustrate the performance of the different methods.

The results in Figure 11 demonstrate that the naive FCV and FCV(c) methods undercover the stochastic test error Errsto\textup{Err}_{\rm sto} across the simulation settings. In contrast, the FCV(p) and QFCV methods achieve valid coverage. However, the QFCV intervals are shorter than those from FCV(p), particularly for time series with smoother noise processes. This aligns with our expectations, as we anticipated FCV(p) would have similar performance to QFCV(0). Since QFCV(1) utilizes features informative to the stochastic test error during quantile regression, it outperforms QFCV(0), and hence should outperform FCV(p).

Table 1 provides further comparison of the four methods across the simulation settings. The first row for each setting shows the mis-coverage percentages of the stochastic test error Errsto\textup{Err}_{\rm sto} for both the upper and lower ends. This illustrates that QFCV has more balanced mis-coverage than FCV(p), which is expected since QFCV relies on quantiles, without the need of Gaussian assumptions like FCV(p). The second row for each setting gives the mis-coverage percentages of the expected test error Err. It demonstrates that FCV(c) provides better coverage for Err and serves as a better confidence interval than naive FCV through its autocovariance correction. The fourth row for each setting illustrates the mean squared error of the FCV and QFCV point estimators for estimating the stochastic test error Errsto\textup{Err}_{\rm sto}. It shows that QFCV point estimator has a smaller MSE than the FCV point estimator.

Refer to caption
(a) ARMA(1,0), n𝗏𝖺𝗅=5{n_{\sf val}}=5.
Refer to caption
(b) ARMA(1,20), n𝗏𝖺𝗅=5{n_{\sf val}}=5.
Refer to caption
(c) ARMA(1,0), n𝗏𝖺𝗅=20{n_{\sf val}}=20.
Refer to caption
(d) ARMA(1,20), n𝗏𝖺𝗅=20{n_{\sf val}}=20.
Figure 11: Performance comparison of QFCV and CLT-based FCV prediction intervals on simulated time series (9). We examine naive FCV (FCV), autocovariance-corrected FCV (FCV(c)\text{FCV}(c)), scaling-corrected FCV (FCV(p)\text{FCV}(p)), and QFCV(1)\text{QFCV}(1). The time series generative process follows the same setup as in Figure 6. The x-axis represents the autocorrelation parameter ϕ[0,1]\phi\in[0,1]. We plot the coverage and length ratio averaged over 500500 simulation runs. The nominal coverage level is 90%90\%, and the length ratio is the prediction interval length divided by the true marginal quantiles interval length based on the true test error distribution quantiles.
Setting Miscoverage (%) & Performance
FCV FCV(c) FCV(p) QFCV
Data Target Hi Lo Hi Lo Hi Lo Hi Lo
(a) Miscoverage of Errsto\text{Miscoverage of }\textup{Err}_{\rm sto} 37.2 56.2 36.8 55.0 7.8 0.0 5.8 4.0
Miscoverage of Err 18.8 4.8 13.4 2.4 0.0 0.0 0.0 0.0
Interval Length 0.211 0.273 5.13 4.69
MSE of Errsto\text{MSE of }\textup{Err}_{\rm sto} 2.39 - - 2.40
(b) Miscoverage of Errsto\text{Miscoverage of }\textup{Err}_{\rm sto} 30.4 64.0 27.8 61.2 7.2 0.0 5.0 4.8
Miscoverage of Err 34.6 10.4 15.6 0.6 0.0 0.0 0.4 0.0
Interval Length 0.287 0.572 6.99 4.59
MSE of Errsto\text{MSE of }\textup{Err}_{\rm sto} 5.24 - - 3.30
(c) Miscoverage of Errsto\text{Miscoverage of }\textup{Err}_{\rm sto} 39.2 49.8 38.4 47.8 7.8 1.0 5.6 6.8
Miscoverage of Err 18.2 4.4 11.8 1.8 0.0 0.0 0.0 0.0
Interval Length 0.239 0.293 2.91 2.75
MSE of Errsto\text{MSE of }\textup{Err}_{\rm sto} 0.881 - - 0.841
(d) Miscoverage of Errsto\text{Miscoverage of }\textup{Err}_{\rm sto} 30.2 57.8 28.4 55.6 5.6 0.0 4.8 4.4
Miscoverage of Err 13.8 13.8 7.8 6.6 0.0 0.0 0.0 0.2
Interval Length 0.492 0.660 5.99 5.12
MSE of Errsto\text{MSE of }\textup{Err}_{\rm sto} 3.22 - - 2.99
Table 1: Comparison of point estimates and prediction intervals for QFCV and CLT-based procedures. The simulations follow the same settings as Figure 11, with autocorrelation ϕ=0.5\phi=0.5. The four cases (a)-(d) correspond to the same settings as in Figure 11. The values are averaged over 500500 simulations. The nominal coverage level is 90%90\%, with 5%5\% nominal mis-coverage on each end. The point estimator for QFCV is computed by linear regression of {(Errival,Erritest)}\{(\textup{Err}_{i}^{{\rm val}},\textup{Err}_{i}^{{\rm test}})\} as detailed in Appendix D.

5 Rolling intervals under nonstationarity

In previous sections, we provided inference methods for test errors in stationary time series. However, constructing valid prediction intervals becomes more difficult when the time series is non-stationary. With arbitrary non-stationarity, the observed data may contain little information about the distribution of future data points, and obtaining prediction intervals is therefore challenging. A more feasible goal is to provide rolling intervals for the test error that have asymptotically valid time-average coverage.

5.1 ACI with delayed feedback for rolling intervals

Rolling intervals.

Suppose that we observe the time series {zt=(xt,yt)}t1𝒵=𝒳×𝒴\{z_{t}=(x_{t},y_{t})\}_{t\geq 1}\subseteq{\mathcal{Z}}={\mathcal{X}}\times{\mathcal{Y}} arriving sequentially, where the {zt}t1\{z_{t}\}_{t\geq 1} are treated as deterministic, unlike the stochastic settings in previous sections. We are provided with a forecasting function f^:𝒳×𝒵n𝗍𝗋𝒴\hat{f}:{\mathcal{X}}\times{\mathcal{Z}}^{{n_{\sf tr}}}\to{\mathcal{Y}} and a loss function :𝒴×𝒴\ell:{\mathcal{Y}}\times{\mathcal{Y}}\to{\mathbb{R}}. Define Errstot+1\textup{Err}_{\rm sto}^{t+1} as the (t+1)(t+1)-th test error,

Errstot+1=1n𝗍𝖾s=t+1t+n𝗍𝖾(f^(xs;ztn𝗍𝗋+1:t),yt).\textup{Err}_{\rm sto}^{t+1}=\frac{1}{{n_{\sf te}}}\sum_{s=t+1}^{t+{n_{\sf te}}}\ell(\hat{f}(x_{s};z_{t-{n_{\sf tr}}+1:t}),y_{t}). (20)

This is the rolling version of Errsto\textup{Err}_{\rm sto} in Eq. (1), where we add the superscript t+1t+1 to emphasize its dependence on the time index. Our goal is to construct rolling intervals {RIα,t}t1\{{\rm RI}^{\alpha,t}\}_{t\geq 1}\subseteq{\mathbb{R}} where each RIα,t{\rm RI}^{\alpha,t} depends on z1:t1z_{1:t-1}, the time series up to time t1t-1. We hope these intervals will have time-average coverage 1α1-\alpha over the set of times indices ST,Δ={Δ,2Δ,3Δ,,T/ΔΔ}S_{T,\Delta}=\{\Delta,2\Delta,3\Delta,\ldots,\lfloor T/\Delta\rfloor\cdot\Delta\} spaced by Δ1\Delta\geq 1,

limT1|ST,Δ|tST,Δ1{ErrstotRIα,t}=1α.\lim_{T\to\infty}\frac{1}{|S_{T,\Delta}|}\sum_{t\in S_{T,\Delta}}1\{\textup{Err}_{\rm sto}^{t}\in{\rm RI}^{\alpha,t}\}=1-\alpha. (21)

We emphasize that time-average coverage differs from statistical coverage in (4). We will revisit this distinction later.

Adaptive conformal inference with delayed feedback (ACI-DF).

We introduce the ACI-DF algorithm for constructing rolling prediction intervals, described in Algorithm 3. The ACI-DF algorithm adapts the ACI algorithm [GC21] to the delayed feedback setting. The algorithm takes as input a sequence of prediction interval constructing functions PIt:𝒵t1×{\rm PI}^{t}:{\mathcal{Z}}^{t-1}\times{\mathbb{R}}\subseteq{\mathbb{R}}, and sets the rolling interval at step tt to be PIt(z1:t1;θt){\rm PI}^{t}(z_{1:t-1};\theta_{t}). The parameter θt\theta_{t} is updated every Δ\Delta steps using online gradient descent on the pinball loss (line 7), but with delayed feedback since Errstot\textup{Err}_{\rm sto}^{t} depends on zt:n𝗍𝖾+t1z_{t:{n_{\sf te}}+t-1}. So we update θt\theta_{t} based on the most recent computable 0-1 loss of coverage.

Algorithm 3 ACI-DF for rolling intervals
0:  Dataset {zt=(xt,yt)}t1\{z_{t}=(x_{t},y_{t})\}_{t\geq 1}. Coverage level α\alpha. prediction interval constructing functions PIt:(𝒵t1,)2𝒴{\rm PI}^{t}:({\mathcal{Z}}^{t-1},{\mathbb{R}})\to 2^{\mathcal{Y}}. Number of test samples n𝗍𝖾{n_{\sf te}}. Lag parameter Δ\Delta. Stepsize γ>0\gamma>0.
1:  Initialize θ0=0\theta_{0}=0. Take k=argmins+{sΔn𝗍𝖾}k=\arg\min_{s\in{\mathbb{N}}_{+}}\{s\Delta\geq{n_{\sf te}}\}.
2:  for all t=1,2,,Tt=1,2,\ldots,T do
3:     Construct prediction set: RIα,t=PIt(z1:t1;θt){\rm RI}^{\alpha,t}={\rm PI}^{t}(z_{1:t-1};\theta_{t}).
4:     Obtain zt=(xt,yt)z_{t}=(x_{t},y_{t}).
5:     Compute ctn𝗍𝖾+1=1{Errstotn𝗍𝖾+1RItn𝗍𝖾+1}c_{t-{n_{\sf te}}+1}=1\{\textup{Err}_{\rm sto}^{t-{n_{\sf te}}+1}\in{\rm RI}^{t-{n_{\sf te}}+1}\}. // Errstotn𝗍𝖾+1\textup{Err}_{\rm sto}^{t-{n_{\sf te}}+1} depends on ztn𝗍𝖾+1:tz_{t-{n_{\sf te}}+1:t} and is computable.
6:     if mod(t,Δ)=0(t,\Delta)=0 and t>kΔt>k\Delta then
7:        θt+1=θtΔ+1+γ(1αctkΔ+1)\theta_{t+1}=\theta_{t-\Delta+1}+\gamma(1-\alpha-c_{t-k\Delta+1}).
8:     else
9:        θt+1=θt\theta_{t+1}=\theta_{t}.
10:     end if
11:  end for

We show that the ACI-DF algorithm has the following time-average coverage guarantee, whose proof follows similarly to [GC21] and is given in Appendix C.

Theorem 5.1 (ACI with delayed feedback).

Suppose PIt:𝒵t1×{\rm PI}^{t}:{\mathcal{Z}}^{t-1}\times{\mathbb{R}}\to{\mathbb{R}} is a sequence of set constructing functions. Suppose there exists mm and MM such that for all z1:t1𝒵t1z_{1:t-1}\in{\mathcal{Z}}^{t-1}, we have PIt(z1:t1,θ)={\rm PI}^{t}(z_{1:t-1},\theta)={\mathbb{R}} for all θ>M\theta>M, and PIt(z1:t1,θ)={\rm PI}^{t}(z_{1:t-1},\theta)=\emptyset for all θ<m\theta<m. Then the rolling intervals {RIα,t}t1\{{\rm RI}^{\alpha,t}\}_{t\geq 1} from Algorithm 3 satisfy

limT1|ST,Δ|tST,Δ1{ErrstotRIα,t}=1α.\lim_{T\to\infty}\frac{1}{|S_{T,\Delta}|}\sum_{t\in S_{T,\Delta}}1\{\textup{Err}_{\rm sto}^{t}\in{\rm RI}^{\alpha,t}\}=1-\alpha. (22)

We note that, similar to adaptive conformal inference, we have flexibility in choosing the prediction interval constructing functions. In particular, to achieve the time-average coverage guarantee, we are not required to use a statistically valid prediction interval construction function. Indeed, any prediction interval PIt(z1:t1)=[𝗅𝗈(z1:t1),𝗁𝗂(z1:t1)]{\rm PI}^{t}(z_{1:t-1})=[{\sf lo}(z_{1:t-1}),{\sf hi}(z_{1:t-1})] can serve as the base procedure, by simply taking PIt(z1:t1;θ)=[𝗅𝗈(z1:t1)θ,𝗁𝗂(z1:t1)+θ]{\rm PI}^{t}(z_{1:t-1};\theta)=[{\sf lo}(z_{1:t-1})-\theta,{\sf hi}(z_{1:t-1})+\theta] to be the constructing function. However, if we hope for the average interval length to be small, we need a good base procedure. One suitable choice is the QFCV intervals from Section 3.1. That is, at iteration tt, we take PIt(z1:t1;θt)=PIQFCVα+θt{\rm PI}^{t}(z_{1:t-1};\theta_{t})={\rm PI}^{\alpha+\theta_{t}}_{{\rm QFCV}}, where PIQFCVα+θt{\rm PI}^{\alpha+\theta_{t}}_{{\rm QFCV}} is the QFCV interval (Algorithm 1) with n=t{n}=t and nominal mis-coverage α+θt\alpha+\theta_{t}. We call the resulting algorithm Adaptive QFCV, and the rolling intervals are denoted as {RIQFCVα,t}t1\{{\rm RI}_{{\rm QFCV}}^{\alpha,t}\}_{t\geq 1}.

Time-average coverage versus instance-average coverage.

We note that the time-average coverage differs from the frequentist notion of instance-average coverage. To understand the difference, we define ciT=1{Errstot,iRIα,t,i}c_{iT}=1\{\textup{Err}_{\rm sto}^{t,i}\in{\rm RI}^{\alpha,t,i}\}, where i[n𝗌𝗂𝗆]i\in[{n_{\sf sim}}] is the index of the simulation instance, and t[T]t\in[T] is the time index. Different simulation instances are assumed to be IID. Then the coverage indicators form the following matrix:

[c11c1Tcn𝗌𝗂𝗆1cn𝗌𝗂𝗆T],\begin{bmatrix}c_{11}&\ldots&c_{1T}\\ &\ldots&\\ c_{{n_{\sf sim}}1}&\ldots&c_{{n_{\sf sim}}T}\end{bmatrix},

where each row is the coverage of a fixed simulation instance across different time indices, and each column is the coverage of a fixed time index across different simulation instances. The two different coverage notions are then:

  • Time-average coverage of instance ii: the row average limT1Tt=1Tcit\lim_{T\to\infty}\frac{1}{T}\sum_{t=1}^{T}c_{it}.

  • Instance-average coverage of time tt: the column average limn𝗌𝗂𝗆1n𝗌𝗂𝗆i=1n𝗌𝗂𝗆cit=(ErrstotRIα,t)\lim_{{n_{\sf sim}}\to\infty}\frac{1}{{n_{\sf sim}}}\sum_{i=1}^{{n_{\sf sim}}}c_{it}={\mathbb{P}}(\textup{Err}_{\rm sto}^{t}\in{\rm RI}^{\alpha,t}).

In practice, both notions of coverage are useful depending on the application scenario.

We further remark that valid time-average coverage does not imply valid instance-average coverage, and vice versa. Note that ACI-DF method yields valid time-average coverage as shown in Theorem 5.1, while QFCV prediction intervals have asymptotically valid instance-average coverage under ergodicity as shown in Theorem 3.1. So we expect AQFCV to have valid coverage of both notions, though there is no guarantee of valid instance-average coverage.

5.2 Numerical simulations of AQFCV

Comparing QFCV and AQFCV

We perform simulations to illustrate the coverage properties of QFCV and AQFCV (Adaptive conformal inference wrapped with QFCV) in both stationary and nonstationary time series. We generate n𝗌𝗂𝗆=500{n_{\sf sim}}=500 IID instances of time series with T=3000T=3000 steps each. The training set size for the (fixed window size) forecasting function is n𝗍𝗋=40{n_{\sf tr}}=40, and we evaluate the test error for the next n𝗍𝖾=5{n_{\sf te}}=5 steps. Furthermore, we generate the rolling interval every Δ=5\Delta=5 steps, starting from step 500500. So in total, there are (3000500)/5=500(3000-500)/5=500 rolling intervals. The time series are generated from the linear model (9) with p=20p=20 and β=(1,1,1,1,0,,0)𝖳p\beta=(1,1,1,1,0,\ldots,0)^{\sf{T}}\in\mathbb{R}^{p}. In the stationary setting, the noise process εt\varepsilon_{t} is an AR(1)\text{AR}(1) process with parameter ϕ=0.5\phi=0.5. In the non-stationary setting, εt\varepsilon_{t} is the sum of an ARIMA(1,1,0)\text{ARIMA}(1,1,0) process with AR parameter ϕ=0.99\phi=0.99 and an independent Gaussian sequence ηtiid𝒩(0,t4)\eta_{t}\sim_{iid}\mathcal{N}(0,t^{4}). This choice results in a noise process far from stationarity. In Figure 12(a) and 12(c), we report the instance-average coverage 1n𝗌𝗂𝗆i=1n𝗌𝗂𝗆cit\frac{1}{{n_{\sf sim}}}\sum_{i=1}^{n_{\sf sim}}c_{it} for tSΔ,Tt\in S_{\Delta,T}, under the stationary and nonstationary settings. In Figure 12(b) and 12(d), we calculate the time-average coverage 1|SΔ,t|tSΔ,tcit\frac{1}{|S_{\Delta,t}|}\sum_{t\in S_{\Delta,t}}c_{it} for each instance i[n𝗌𝗂𝗆]i\in[{n_{\sf sim}}] and report the 5%5\% and 95%95\% quantiles over the 500500 instances.

Figure 12 shows that AQFCV results in asymptotically valid time-average coverage for both stationary and non-stationary settings. In stationary settings where QFCV is already instance-average valid, ACI-DF does not affect the instance-average validity. In nonstationary settings where QFCV slightly undercovers, ACI-DF can correct the interval to give correct time-average coverage. We note that AQFCV can also obtain valid instance-average coverage in nonstationary settings, an interesting phenomenon beyond our theories that deserves further investigation.

Comparing AFCV and AQFCV

We compare the performance of ACI-DF wrapped with two base methods: CLT-based FCV intervals (AFCV) and QFCV prediction intervals (AQFCV). Theorem 5.1 provides asymptotic time-average coverage guarantee when ACI-DF is combined with any base method satisfying mild assumptions. However, AQFCV may have advantages over AFCV despite both having valid asymptotic coverage.

Through simulation, we aimed to illustrate why practitioners may prefer AQFCV over wrapping ACI-DF with a less carefully constructed prediction interval such as FCV based interval. In Figure 13, we compared the rolling intervals from AFCV and AQFCV on a generated time series with T=3000T=3000 steps. The forecasting function used a fixed training window of n𝗍𝗋=40{n_{\sf tr}}=40 steps to generate forecasts for the next n𝗍𝖾=5{n_{\sf te}}=5 steps, with rolling intervals for prediction error produced every Δ=5\Delta=5 steps starting at step 500500, for 500500 intervals total. The time series comes from a 2020-dimensional linear model (9) with an ARMA(1,20)\text{ARMA}(1,20) noise process at ϕ=0.5\phi=0.5. This is the same generative process used in prior simulation as in case (b) of Figure 6, 11, and Table 1.

Figure 13 shows that AQFCV provides more adaptive intervals than AFCV, with wider intervals in volatile regions and narrower intervals when volatility is low. This adaptiveness of AQFCV stems from the base QFCV method exploiting time correlation between validation and test errors. We note this phenomenon may not always occur, especially if error correlations are weak. Still, AQFCV is recommended since a good base method at low computational cost brings benefits such as potential instance-average coverage. Quantile-based methods also naturally handle arbitrary nonstationarity through flexible quantile-based prediction interval constructing function.

Refer to caption
(a) Instance-average coverage under stationarity
Refer to caption
(b) Time-average coverage under stationarity
Refer to caption
(c) Instance-average coverage under nonstationarity
Refer to caption
(d) Time-average coverage under nonstationarity
Figure 12: Instance-average and time-average coverages for a fixed time and rolling window under stationary and nonstationary time series.
Refer to caption
Figure 13: Comparison of adaptive QFCV and naive FCV intervals by wrapping around adaptive conformal inference over one time series instance.

6 Real data examples

6.1 French electricity dataset

We first evaluate QFCV and AQFCV using the French electricity price forecast dataset from [ZFG+22]. The dataset contains French electricity spot prices from 2016 to 2019, with (3×365+366)×24=35064(3\times 365+366)\times 24=35064 hourly observations of forecast consumptions and prices. The goal of the forecaster is to predict at day DD the prices of the next 24 hours of the day D+1D+1. The forecaster uses the following features: day-ahead forecast consumption, day-of-the-week, 2424 prices of the day D1D-1, and 2424 prices of the day D7D-7.

Refer to caption
(a) QFCV
Refer to caption
(b) AQFCV
Figure 14: French electricity prediction error time-average coverage with QFCV and AQFCV methods. Linear ridge regression is trained on data in the last 7 months to predict the next 24 hours of electricity prices so that n𝗍𝗋=24×7×30=5040{n_{\sf tr}}=24\times 7\times 30=5040. We choose n𝗏𝖺𝗅=Δ=n𝗍𝖾=24{n_{\sf val}}=\Delta={n_{\sf te}}=24, γ=0.01\gamma=0.01. Top row plots true prediction errors and calculated prediction intervals across time. The bottom row plots the time-average coverage of uncertainty intervals of prediction error.

We use linear ridge regression with appropriate fixed ridge parameters as the prediction algorithm f^\hat{f}, trained on the data of the last 77 months, so that n𝗍𝗋=24×7×30=5040{n_{\sf tr}}=24\times 7\times 30=5040. The test error is evaluated on the next 2424 hours of electricity price, so n𝗍𝖾=24{n_{\sf te}}=24. We use QFCV and AQFCV to provide the prediction interval for the test error. In methods, we set n𝗏𝖺𝗅=Δ=n𝗍𝖾=24{n_{\sf val}}=\Delta={n_{\sf te}}=24, γ=0.01\gamma=0.01. The result is provided in Figure 14.

By visual inspection, the error process of French electricity dataset is not stationary due to large spikes of varying strengths. The AQFCV method, which wraps ACI-DF with QFCV, yields the asymptotically nominal 90%90\% coverage rate as promised in Theorem 5.1. Although QFCV does not have a time-average coverage guarantee under nonstationarity, it has enough coverage in this particular example, despite some over-coverage. This contrasts the simulation example in Section 5.2, where QFCV undercovers in nonstationary time series forecasting (see Figure 12). The simulation example is a special case where forecasting difficulty increases over time, so intervals using historical data underestimate future errors, causing under-coverage. In the French electricity dataset, nonstationarity arises from error spikes available to the QFCV quantile regression step, so QFCV assumes spikes occur regularly under stationarity. Thus, QFCV outputs wide intervals, resulting in over-coverage.

6.2 Stock Market Volatility dataset

Refer to caption
(a) Nvidia
Refer to caption
(b) AMD
Refer to caption
(c) BlackBerry
Refer to caption
(d) Fannie Mae
Figure 15: Stock market volatility prediction error time-average coverage with QFCV and AQFCV methods. Garch(1,1)\text{Garch}(1,1) model is trained on data in the last 1000 business days to predict the next 77 days of realized volatility, such that n𝗍𝗋=1000{n_{\sf tr}}=1000. We choose n𝗏𝖺𝗅=n𝗍𝖾=7{n_{\sf val}}={n_{\sf te}}=7, Δ=1\Delta=1, and γ=0.01\gamma=0.01. Top row plots true prediction errors across time. Bottom row plots the time-average coverage of uncertainty intervals.

Next, we evaluate QFCV and AQFCV using the publicly available stock market volatility dataset from The Wall Street Journal, the same dataset adopted in [GC21]. As in [GC21], we select the same four stocks and use their daily minimum prices {Pt}1tT\{P_{t}\}_{1\leq t\leq T} from January 2005 to July 2023. For each stock we calculate daily returns Rt=(PtPt1)/Pt1R_{t}=(P_{t}-P_{t-1})/P_{t-1} for t2t\geq 2 and thus the realized volatility Vt=Rt2V_{t}=R_{t}^{2}. We consider the task of multiperiod forecasting of market volatility yt+1:t+n𝗍𝖾:=(Vt+1,,Vt+n𝗍𝖾)y_{t+1:t+{n_{\sf te}}}:=(V_{t+1},\ldots,V_{t+{n_{\sf te}}}) using previously observed returns xt:={Rs}1stx_{t}:=\{R_{s}\}_{1\leq s\leq t}.

Similar to [GC21], we adopt the Garch(1,1)\text{Garch}(1,1) model [Bol86] as the prediction algorithm f^\hat{f}, which assumes Rt=σtεtR_{t}=\sigma_{t}\varepsilon_{t} where εt\varepsilon_{t} are IID sampled from 𝒩(0,1)\mathcal{N}(0,1) and σt2=ω+τVt1+βσt12\sigma_{t}^{2}=\omega+\tau V_{t-1}+\beta\sigma_{t-1}^{2} with (ω,τ,β)(\omega,\tau,\beta) as trainable parameters. For a given time stage tt, we train on the data from the last n𝗍𝗋{n_{\sf tr}} business days {Rs}tn𝗍𝗋<st\{R_{s}\}_{t-{n_{\sf tr}}<s\leq t} to obtain coefficient estimates (ω^t,τ^t,β^t)(\hat{\omega}_{t},\hat{\tau}_{t},\hat{\beta}_{t}) and past standard deviation estimates {σ^st}tn𝗍𝗋<st\{\hat{\sigma}^{t}_{s}\}_{t-{n_{\sf tr}}<s\leq t}. Then we give point estimates of multiperiod volatility in the next n𝗍𝖾{n_{\sf te}} days as follows,

f^(r;{Rs}tn𝗍𝗋<st)=(σ^t+rt)2:={ω^t+τ^tVt1+β^t(σ^t1t)2,r=1,ω^t+(τ^t+β^t)(σ^t+r1t)2,1<rn𝗍𝖾.\hat{f}(r;\{R_{s}\}_{t-{n_{\sf tr}}<s\leq t})=(\hat{\sigma}_{t+r}^{t})^{2}:=\begin{cases}\hat{\omega}_{t}+\hat{\tau}_{t}V_{t-1}+\hat{\beta}_{t}(\hat{\sigma}_{t-1}^{t})^{2},&r=1,\\ \hat{\omega}_{t}+(\hat{\tau}_{t}+\hat{\beta}_{t})(\hat{\sigma}_{t+r-1}^{t})^{2},&1<r\leq{n_{\sf te}}.\end{cases}

We then want to give a prediction interval for the test error incurred in predicting the volatility over the next n𝗍𝖾{n_{\sf te}} days. That is, we want to find intervals {RItα}t[T]\{\text{RI}_{t}^{\alpha}\}_{t\in[T]} such that

1Tt=1T1{1n𝗍𝖾r=1n𝗍𝖾((σ^t+rt)2Vt+r)2RItα}1α.\frac{1}{T}\sum_{t=1}^{T}1\Big{\{}\frac{1}{{n_{\sf te}}}\sum_{r=1}^{n_{\sf te}}\big{(}(\hat{\sigma}_{t+r}^{t})^{2}-V_{t+r}\big{)}^{2}\in\text{RI}_{t}^{\alpha}\Big{\}}\approx 1-\alpha.

For both QFCV and AQFCV, we set n𝗍𝗋=1000{n_{\sf tr}}=1000, n𝗏𝖺𝗅=n𝗍𝖾=7{n_{\sf val}}={n_{\sf te}}=7, Δ=1\Delta=1, and γ=0.01\gamma=0.01. Time-average coverages for all four stocks are presented in Figure 15. For all four stock prices, the AQFCV method provides the asymptotically nominal 90%90\% coverage rate as established in Theorem 5.1. On the other hand, we do not expect QFCV to have a time-average coverage guarantee under nonstationarity. In settings (a), (c), and (d) of Figure 15 (corresponding to Nvidia, BlackBerry, and FannieMae), QFCV undercovers with an approximately 85%85\% coverage rate. This aligns with simulation results in Section 5. In setting (c), we can see that QFCV restores the nominal coverage rate in years with fewer spikes in true error evolution. And in setting (b) (corresponding to AMD), QFCV achieves a nominal coverage rate in more recent years due to fewer spikes in frequency and magnitude of error evolution trends compared to historical trends.

7 Conclusion and discussion

This paper provides a systematic investigation of inference for prediction errors in time series forecasting. We propose the quantile-based forward cross-validation (QFCV) method to construct prediction intervals for the stochastic test error Errsto\textup{Err}_{\rm sto}, which have asymptotic valid coverage under stationarity assumptions. Through simulations, we find that QFCV provides well-calibrated prediction intervals that can be substantially shorter than naive empirical quantiles when the validation error is predictive of the test error. For non-stationary settings, we propose an adaptive QFCV procedure (AQFCV) that combines QFCV intervals with adaptive conformal prediction using delayed feedback. We prove AQFCV provides prediction intervals with asymptotic time-average coverage guarantees. Experiments on simulated and real-world time series demonstrate that AQFCV adapts efficiently during periods of low volatility. Overall, we advocate the use of QFCV and AQFCV procedures due to their statistical validity, computational simplicity, and flexibility to different time series characteristics.

Our work opens up several promising research avenues. One direction is proving that AQFCV maintains its time-average coverage guarantee when the underlying data-generating process is stationary. Another direction is developing statistically principled inference methods for prediction errors that are compatible with the expanding window forecasting setup. More broadly, this work underscores the value of rigorous uncertainty quantification when evaluating predictive models on temporal data. We hope that our proposed methodology helps catalyze further research into validated approaches for quantifying uncertainty in time series forecasting tasks.

Acknowledgement

The authors would like to thank Ying Jin, Trevor Hastie, Art Owen, Anthony Norcia, and Shuangping Li for their helpful discussions. Song Mei was supported in part by NSF DMS-2210827 and CCF-2315725. Robert Tibshirani was supported by the National Institutes of Health (5R01 EB001988-16) and the National Science Foundation (19 DMS1208164).

References

  • [AB21] Anastasios N Angelopoulos and Stephen Bates. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv preprint arXiv:2107.07511, 2021.
  • [AC10] Sylvain Arlot and Alain Celisse. A survey of cross-validation procedures for model selection. 2010.
  • [All74] David M Allen. The relationship between variable selection and data agumentation and a method for prediction. technometrics, 16(1):125–127, 1974.
  • [Avi03] Yossi Aviv. A time-series framework for supply-chain inventory management. Operations Research, 51(2):210–227, 2003.
  • [AZ20] Morgane Austern and Wenda Zhou. Asymptotics of cross-validation. arXiv preprint arXiv:2001.11111, 2020.
  • [BB11] Christoph Bergmeir and Jose M Benitez. Forecaster performance evaluation with cross-validation and variants. In 2011 11th International Conference on Intelligent Systems Design and Applications, pages 849–854. IEEE, 2011.
  • [BB12] Christoph Bergmeir and José M Benítez. On the use of cross-validation for time series predictor evaluation. Information Sciences, 191:192–213, 2012.
  • [BBJM20] Pierre Bayle, Alexandre Bayle, Lucas Janson, and Lester Mackey. Cross-validation confidence intervals for test error. Advances in Neural Information Processing Systems, 33:16339–16350, 2020.
  • [BCB14] Christoph Bergmeir, Mauro Costantini, and José M Benítez. On the usefulness of cross-validation for directional forecast evaluation. Computational Statistics & Data Analysis, 76:132–143, 2014.
  • [BCN94] Prabir Burman, Edmond Chow, and Deborah Nolan. A cross-validatory method for dependent data. Biometrika, 81(2):351–358, 1994.
  • [BHK18] Christoph Bergmeir, Rob J Hyndman, and Bonsoo Koo. A note on the validity of cross-validation for evaluating autoregressive time series prediction. Computational Statistics & Data Analysis, 120:70–83, 2018.
  • [BHT21] Stephen Bates, Trevor Hastie, and Robert Tibshirani. Cross-validation: what does it estimate and how well does it do it? arXiv preprint arXiv:2104.00673, 2021.
  • [Bol86] Tim Bollerslev. Generalized autoregressive conditional heteroskedasticity. Journal of econometrics, 31(3):307–327, 1986.
  • [BPvdL20] David Benkeser, Maya Petersen, and Mark J van der Laan. Improved small-sample estimation of nonlinear cross-validated prediction metrics. Journal of the American Statistical Association, 115(532):1917–1932, 2020.
  • [BWXB23] Aadyot Bhatnagar, Huan Wang, Caiming Xiong, and Yu Bai. Improved online conformal prediction via strongly adaptive online learning. arXiv preprint arXiv:2302.07869, 2023.
  • [Coc97] John H Cochrane. Time series for macroeconomics and finance, 1997.
  • [CTM20] Vitor Cerqueira, Luis Torgo, and Igor Mozeti. Evaluating time series forecasting models: An empirical study on performance estimation methods. Machine Learning, 109:1997–2028, 2020.
  • [Die98] Thomas G Dietterich. Approximate statistical tests for comparing supervised classification learning algorithms. Neural computation, 10(7):1895–1923, 1998.
  • [DZY+17] Chirag Deb, Fan Zhang, Junjing Yang, Siew Eang Lee, and Kwok Wei Shah. A review on time series forecasting techniques for building energy consumption. Renewable and Sustainable Energy Reviews, 74:902–924, 2017.
  • [Efr83] Bradley Efron. Estimating the error rate of a prediction rule: improvement on cross-validation. Journal of the American statistical association, 78(382):316–331, 1983.
  • [ET97] Bradley Efron and Robert Tibshirani. Improvements on cross-validation: the 632+ bootstrap method. Journal of the American Statistical Association, 92(438):548–560, 1997.
  • [FBR22] Shai Feldman, Stephen Bates, and Yaniv Romano. Conformalized online learning: Online calibration without a holdout set. arXiv preprint arXiv:2205.09095, 2022.
  • [FH20] Edwin Fong and Chris C Holmes. On the marginal likelihood and cross-validation. Biometrika, 107(2):489–496, 2020.
  • [FVD00] Philip Hans Franses and Dick Van Dijk. Non-linear time series models in empirical finance. Cambridge university press, 2000.
  • [GC21] Isaac Gibbs and Emmanuel Candès. Adaptive conformal inference under distribution shift. Advances in Neural Information Processing Systems, 34:1660–1672, 2021.
  • [GC22] Isaac Gibbs and Emmanuel Candès. Conformal inference for online prediction with arbitrary distribution shifts. arXiv preprint arXiv:2208.08401, 2022.
  • [Gei75] Seymour Geisser. The predictive sample reuse method with applications. Journal of the American statistical Association, 70(350):320–328, 1975.
  • [Gil05] Kenneth Gilbert. An arima supply chain model. Management Science, 51(2):305–310, 2005.
  • [HAB23] Hansika Hewamalage, Klaus Ackermann, and Christoph Bergmeir. Forecast evaluation for data scientists: common pitfalls and best practices. Data Mining and Knowledge Discovery, 37(2):788–832, 2023.
  • [Ham20] James Douglas Hamilton. Time series analysis. Princeton university press, 2020.
  • [Hay11] Fumio Hayashi. Econometrics. Princeton University Press, 2011.
  • [HTFF09] Trevor Hastie, Robert Tibshirani, Jerome H Friedman, and Jerome H Friedman. The elements of statistical learning: data mining, inference, and prediction, volume 2. Springer, 2009.
  • [Lei20] Jing Lei. Cross-validation with confidence. Journal of the American Statistical Association, 115(532):1978–1997, 2020.
  • [LLC+15] Bo Liu, Jianqiang Li, Cheng Chen, Wei Tan, Qiang Chen, and MengChu Zhou. Efficient motif discovery for large-scale time series in healthcare. IEEE Transactions on Industrial Informatics, 11(3):583–590, 2015.
  • [LPvdL15] Erin LeDell, Maya Petersen, and Mark van der Laan. Computationally efficient confidence intervals for cross-validated area under the roc curve estimates. Electronic journal of statistics, 9(1):1583, 2015.
  • [LTS22] Zhen Lin, Shubhendu Trivedi, and Jimeng Sun. Conformal prediction intervals with temporal dependence. arXiv preprint arXiv:2205.12940, 2022.
  • [MTBH05] Marianthi Markatou, Hong Tian, Shameek Biswas, and George M Hripcsak. Analysis of variance of cross-validation estimators of the generalization error. 2005.
  • [MWMC18] Pavol Mulinka, Sarah Wassermann, Gonzalo Marin, and Pedro Casas. Remember the good, forget the bad, do it fast-continuous learning over streaming data. In Continual Learning Workshop at NeurIPS 2018, 2018.
  • [NB99] Claude Nadeau and Yoshua Bengio. Inference for the generalization error. Advances in neural information processing systems, 12, 1999.
  • [PAA+22] Fotios Petropoulos, Daniele Apiletti, Vassilios Assimakopoulos, Mohamed Zied Babai, Devon K Barrow, Souhaib Ben Taieb, Christoph Bergmeir, Ricardo J Bessa, Jakub Bijak, John E Boylan, et al. Forecasting: theory and practice. International Journal of Forecasting, 2022.
  • [Rac00] Jeff Racine. Consistent cross-validatory model-selection for dependent data: hv-block cross-validation. Journal of econometrics, 99(1):39–61, 2000.
  • [RT19] Saharon Rosset and Ryan J Tibshirani. From fixed-x to random-x regression: Bias-variance decompositions, covariance penalties, and prediction error estimation. Journal of the American Statistical Association, 2019.
  • [SC11] Ingo Steinwart and Andreas Christmann. Estimating conditional quantiles with the help of the pinball loss. 2011.
  • [SEJS86] Petre Stoica, Pieter Eykhoff, Peter Janssen, and Torsten Söderström. Model-structure selection by cross-validation. International Journal of Control, 43(6):1841–1878, 1986.
  • [Sha93] Jun Shao. Linear model selection by cross-validation. Journal of the American statistical Association, 88(422):486–494, 1993.
  • [Sto74] Mervyn Stone. Cross-validatory choice and assessment of statistical predictions. Journal of the royal statistical society: Series B (Methodological), 36(2):111–133, 1974.
  • [SV08] Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction. Journal of Machine Learning Research, 9(3), 2008.
  • [SY22] Sophia Sun and Rose Yu. Copula conformal prediction for multi-step time series forecasting. arXiv preprint arXiv:2212.03281, 2022.
  • [TT09] Ryan J Tibshirani and Robert Tibshirani. A bias correction for the minimum error rate in cross-validation. 2009.
  • [VGS99] Volodya Vovk, Alexander Gammerman, and Craig Saunders. Machine-learning applications of algorithmic randomness. 1999.
  • [VS06] Sudhir Varma and Richard Simon. Bias in error estimation when using cross-validation for model selection. BMC bioinformatics, 7(1):1–8, 2006.
  • [Wag20] Stefan Wager. Cross-validation, risk estimation, and model selection: Comment on a paper by rosset and tibshirani. Journal of the American Statistical Association, 115(529):157–160, 2020.
  • [XX22] Chen Xu and Yao Xie. Sequential predictive conformal inference for time series. arXiv preprint arXiv:2212.03463, 2022.
  • [Yan07] Yuhong Yang. Consistency of cross validation for comparing regression procedures. 2007.
  • [You20] Waleed A Yousefa. A leisurely look at versions and variants of the cross validation estimator. stat, 1050:9, 2020.
  • [You21] Waleed A Yousef. Estimating the standard error of cross-validation-based estimators of classifier performance. Pattern Recognition Letters, 146:115–125, 2021.
  • [ZFG+22] Margaux Zaffran, Olivier Féron, Yannig Goude, Julie Josse, and Aymeric Dieuleveut. Adaptive conformal predictions for time series. In International Conference on Machine Learning, pages 25834–25866. PMLR, 2022.
  • [Zha93] Ping Zhang. Model selection via multifold cross validation. The annals of statistics, pages 299–313, 1993.
  • [Zha95] Ping Zhang. Assessing prediction error in non-parametric regression. Scandinavian journal of statistics, pages 83–94, 1995.

Appendix A Proof of Theorem 3.1

Proof of Theorem 3.1.

Throughout the proof, we take Δ=1\Delta=1. For general Δ>1\Delta>1, the proof follows similarly. The proof uses the uniform convergence argument and relies on the following lemma.

Lemma A.1.

Let (E1,E2)(E_{1},E_{2}) be random variables that have a joint density with respect to the Lebesgue measure. Define the risk function

Rβ(f)=𝔼(E1,E2)[𝗉𝗂𝗇𝖻𝖺𝗅𝗅β(E2f(E1))].R_{\beta}(f)={\mathbb{E}}_{(E_{1},E_{2})}\Big{[}{\sf pinball}_{\beta}\Big{(}E_{2}-f(E_{1})\Big{)}\Big{]}.

Then for any function fargming:mRβ(g)f_{\star}\in\arg\min_{g:{\mathbb{R}}^{m}\to{\mathbb{R}}}R_{\beta}(g), we have,

(E1,E2)(E2f(E1))=1β.{\mathbb{P}}_{(E_{1},E_{2})}(E_{2}\leq f_{\star}(E_{1}))=1-\beta.

We take K=nn𝗍𝗋n𝗍𝖾n𝗏𝖺𝗅+1K={n}-{n_{\sf tr}}-{n_{\sf te}}-{n_{\sf val}}+1 and define the empirical risk associated to the β\beta-pinball loss as

R^β(f)=1Ki=1K𝗉𝗂𝗇𝖻𝖺𝗅𝗅β(Erritestf(Errival)).\widehat{R}_{\beta}(f)=\frac{1}{K}\sum_{i=1}^{K}{\sf pinball}_{\beta}\Big{(}\textup{Err}_{i}^{{\rm test}}-f(\textup{Err}_{i}^{{\rm val}})\Big{)}.

For β=α/2\beta=\alpha/2 and β=1α/2\beta=1-\alpha/2, and for any quantile function qββq_{\beta}\in{\mathcal{F}}\cap{\mathcal{F}}_{\beta}, we define

εβ=13[inffβRβ(f)Rβ(qβ)].\varepsilon_{\beta}=\frac{1}{3}\Big{[}\inf_{f\in{\mathcal{F}}\setminus{\mathcal{F}}_{\beta}}R_{\beta}(f)-R_{\beta}(q_{\beta})\Big{]}. (23)

By the definition of β{\mathcal{F}}_{\beta}, we have εβ\varepsilon_{\beta} is well-defined. By Assumption 3 that {\mathcal{F}} is a finite set, we have εβ>0\varepsilon_{\beta}>0. By Assumption 1, for any fixed function ff\in{\mathcal{F}}, we have point convergence of R^β(f)\widehat{R}_{\beta}(f) to Rβ(f)R_{\beta}(f),

limn(|R^β(f)Rβ(f)|εβ)=1.\lim_{n\to\infty}{\mathbb{P}}\Big{(}\Big{|}\widehat{R}_{\beta}(f)-R_{\beta}(f)\Big{|}\leq\varepsilon_{\beta}\Big{)}=1. (24)

By Assumption 3 that {\mathcal{F}} is a finite set, applying union bound gives uniform convergence

limn(supf|R^β(f)Rβ(f)|εβ)=1.\lim_{n\to\infty}{\mathbb{P}}\Big{(}\sup_{f\in{\mathcal{F}}}\Big{|}\widehat{R}_{\beta}(f)-R_{\beta}(f)\Big{|}\leq\varepsilon_{\beta}\Big{)}=1. (25)

As a consequence, by Assumption 3 that qα/2,q1α/2\exists q_{\alpha/2},q_{1-\alpha/2}\in{\mathcal{F}} and by the definition of εβ\varepsilon_{\beta} as in Eq. (23), when the good event in Eq. (25) happens, we have

inffβR^β(f)R^β(qβ)\displaystyle\inf_{f\in{\mathcal{F}}\setminus{\mathcal{F}}_{\beta}}\widehat{R}_{\beta}(f)-\widehat{R}_{\beta}(q_{\beta})\geq [inffβRβ(f)Rβ(qβ)]2supf|R^β(f)Rβ(f)|3εβ2εβ=εβ.\displaystyle\leavevmode\nobreak\ \Big{[}\inf_{f\in{\mathcal{F}}\setminus{\mathcal{F}}_{\beta}}R_{\beta}(f)-R_{\beta}(q_{\beta})\Big{]}-2\sup_{f\in{\mathcal{F}}}\Big{|}\widehat{R}_{\beta}(f)-R_{\beta}(f)\Big{|}\geq 3\varepsilon_{\beta}-2\varepsilon_{\beta}=\varepsilon_{\beta}.

In words, any minimizer of the empirical risk should be contained in the set β{\mathcal{F}}_{\beta}. By the fact that f^nβ\hat{f}^{\beta}_{n} is a minimizer of the empirical risk R^β(f)\widehat{R}_{\beta}(f), we get that f^nβ\hat{f}^{\beta}_{n} is contained in the set β{\mathcal{F}}_{\beta} with high probability. That is, for β=α/2\beta=\alpha/2 or β=1α/2\beta=1-\alpha/2, we have

limn(f^nββ)=1.\lim_{n\to\infty}{\mathbb{P}}\Big{(}\hat{f}_{n}^{\beta}\in{\mathcal{F}}_{\beta}\Big{)}=1.

Then by Lemma A.1, we have

limn(Errsto[f^nα/2(Errval),f^n1α/2(Errval)])=1α.\lim_{n\to\infty}{\mathbb{P}}\Big{(}\textup{Err}_{\rm sto}\in[\hat{f}_{n}^{\alpha/2}(\textup{Err}_{\star}^{{\rm val}}),\hat{f}_{n}^{1-\alpha/2}(\textup{Err}_{\star}^{{\rm val}})]\Big{)}=1-\alpha.

This finishes the proof of the theorem. ∎

We next give a proof of Lemma A.1.

Proof of Lemma A.1.

Let p(E1,E2)(e1,e2)p_{(E_{1},E_{2})}(e_{1},e_{2}) denote the density of (E1,E2)(E_{1},E_{2}), pE1(e1)p_{E_{1}}(e_{1}) denote the marginal density of E1E_{1}, and pE2|E1(e2|e1)p_{E_{2}|E_{1}}(e_{2}|e_{1}) denote the conditional density of E2E_{2} given E1=e1E_{1}=e_{1}.

Step 1. Show that (E2f(e1)|E1=e1)=1β{\mathbb{P}}(E_{2}\leq f_{\star}(e_{1})|E_{1}=e_{1})=1-\beta for almost every e1supp{pE1}e_{1}\in{\rm supp}\{p_{E_{1}}\}. For any fargming:mRβ(g)f_{\star}\in\arg\min_{g:{\mathbb{R}}^{m}\to{\mathbb{R}}}R_{\beta}(g), we claim that (E2f(e1)|E1=e1)=1β{\mathbb{P}}(E_{2}\leq f_{\star}(e_{1})|E_{1}=e_{1})=1-\beta for almost every e1supp{pE1}e_{1}\in{\rm supp}\{p_{E_{1}}\}. Otherwise, suppose that for certain DD\subseteq{\mathbb{R}} with (E1D)>0{\mathbb{P}}(E_{1}\in D)>0, we have (E2f(e1)|E1=e1)1β{\mathbb{P}}(E_{2}\leq f_{\star}(e_{1})|E_{1}=e_{1})\neq 1-\beta, by the fact that the minimizer of the expected β\beta-pinball loss gives the β\beta-quantile, we have

𝔼E2|E1[𝗉𝗂𝗇𝖻𝖺𝗅𝗅β(E2f(e1))|E1=e1]>infh𝔼E2|E1[𝗉𝗂𝗇𝖻𝖺𝗅𝗅β(E2h)|E1=e1].{\mathbb{E}}_{E_{2}|E_{1}}\Big{[}{\sf pinball}_{\beta}\Big{(}E_{2}-f_{\star}(e_{1})\Big{)}\Big{|}E_{1}=e_{1}\Big{]}>\inf_{h}{\mathbb{E}}_{E_{2}|E_{1}}\Big{[}{\sf pinball}_{\beta}\Big{(}E_{2}-h\Big{)}\Big{|}E_{1}=e_{1}\Big{]}.

Then we can take g:Dg:D\to{\mathbb{R}} with g(e1)argminh𝔼E2|E1[𝗉𝗂𝗇𝖻𝖺𝗅𝗅β(E2h)|E1=e1]g(e_{1})\in\arg\min_{h\in{\mathbb{R}}}{\mathbb{E}}_{E_{2}|E_{1}}[{\sf pinball}_{\beta}(E_{2}-h)|E_{1}=e_{1}] for e1De_{1}\in D, and define f~(e1)=f(e1)1{e1D}+g(e1)1{e1D}\tilde{f}(e_{1})=f_{\star}(e_{1})1\{e_{1}\not\in D\}+g(e_{1})1\{e_{1}\in D\}. Then we have

𝔼(E1,E2)[𝗉𝗂𝗇𝖻𝖺𝗅𝗅β(E2f~(E1))]𝔼(E1,E2)[𝗉𝗂𝗇𝖻𝖺𝗅𝗅β(E2f(E1))]\displaystyle\leavevmode\nobreak\ {\mathbb{E}}_{(E_{1},E_{2})}\Big{[}{\sf pinball}_{\beta}\Big{(}E_{2}-\tilde{f}(E_{1})\Big{)}\Big{]}-{\mathbb{E}}_{(E_{1},E_{2})}\Big{[}{\sf pinball}_{\beta}\Big{(}E_{2}-f_{\star}(E_{1})\Big{)}\Big{]}
=\displaystyle= D{𝔼E2|E1[𝗉𝗂𝗇𝖻𝖺𝗅𝗅β(E2f~(e1))|E1=e1]𝔼E2|E1[𝗉𝗂𝗇𝖻𝖺𝗅𝗅β(E2f(e1))|E1=e1]}pE1(e1)𝑑e1<0,\displaystyle\leavevmode\nobreak\ \int_{D}\Big{\{}{\mathbb{E}}_{E_{2}|E_{1}}\Big{[}{\sf pinball}_{\beta}\Big{(}E_{2}-\tilde{f}(e_{1})\Big{)}\Big{|}E_{1}=e_{1}\Big{]}-{\mathbb{E}}_{E_{2}|E_{1}}\Big{[}{\sf pinball}_{\beta}\Big{(}E_{2}-f_{\star}(e_{1})\Big{)}\Big{|}E_{1}=e_{1}\Big{]}\Big{\}}p_{E_{1}}(e_{1})de_{1}<0,

which contradicts the fact that fargming:mRβ(g)f_{\star}\in\arg\min_{g:{\mathbb{R}}^{m}\to{\mathbb{R}}}R_{\beta}(g).

Step 2. Concludes the proof. We have

(E2f(E1))=(E2f(e1)|E1=e1)pE1(e1)𝑑e1=1β,{\mathbb{P}}(E_{2}\leq f_{\star}(E_{1}))=\int{\mathbb{P}}(E_{2}\leq f_{\star}(e_{1})|E_{1}=e_{1})p_{E_{1}}(e_{1})de_{1}=1-\beta,

where the last equality is by Step 1. This proves the lemma. ∎

A.1 General result going beyond finite function class

In this section, we state and prove a more general version of Theorem 3.1, which goes beyond the assumption that {\mathcal{F}} is a finite function class. We start by stating the additional required assumptions.

Assumption 4.

(Errval,Errtest)(\textup{Err}_{\star}^{{\rm val}},\textup{Err}_{\star}^{{\rm test}}) is independent of {(Errtval,Errttest)}t1\{(\textup{Err}_{t}^{{\rm val}},\textup{Err}_{t}^{{\rm test}})\}_{t\geq 1}.

For a distribution QQ and for α(0,1)\alpha\in(0,1), the α\alpha-quantile of QQ is the set

qα(Q)={t:Q((,t])α,Q([t,+))1α}.q_{\alpha}(Q)=\{t\in{\mathbb{R}}:Q((-\infty,t])\geq\alpha,Q([t,+\infty))\geq 1-\alpha\}.

We write qα,min(Q)minqα(Q)q_{\alpha,\min}(Q)\equiv\min q_{\alpha}(Q) and qα,max(Q)maxqα(Q)q_{\alpha,\max}(Q)\equiv\max q_{\alpha}(Q). Consider the stationary distribution 𝒫(×){\mathcal{L}}\in{\mathcal{P}}({\mathbb{R}}\times{\mathbb{R}}) of {(Errtval,Errttest)}t1\{(\textup{Err}_{t}^{{\rm val}},\textup{Err}_{t}^{{\rm test}})\}_{t\geq 1} and let (X,Y)(X,Y)\sim{\mathcal{L}}. We denote X{\mathcal{L}}_{X} to be the marginal distribution of XX, and Y|x{\mathcal{L}}_{Y|x} to be the conditional distribution of YY given X=xX=x.

Assumption 5.

For X{\mathcal{L}}_{X}-almost-every xx\in{\mathbb{R}}, Y|x{\mathcal{L}}_{Y|x} has a density that is uniformly upper-bounded by CC.

Assumption 6 (Definition 2.6 of [SC11]).

We have suppY|x[0,B]{\rm supp}{\mathcal{L}}_{Y|x}\in[0,B] for X{\mathcal{L}}_{X}-almost-every xx\in{\mathbb{R}}. Furthermore, for X{\mathcal{L}}_{X}-almost-every xx\in{\mathbb{R}}, there exists c(x)(0,2]c(x)\in(0,2] and b(x)>0b(x)>0 such that for all s[0,c(x)]s\in[0,c(x)], we have

Y|x((qβ,min(Y|x)s,qβ,min(Y|x)))b(x)s,\displaystyle\leavevmode\nobreak\ {\mathcal{L}}_{Y|x}((q_{\beta,\min}({\mathcal{L}}_{Y|x})-s,q_{\beta,\min}({\mathcal{L}}_{Y|x})))\geq b(x)s,
Y|x((qβ,max(Y|x),qβ,max(Y|x)+s))b(x)s,β{α/2,1α/2}.\displaystyle\leavevmode\nobreak\ {\mathcal{L}}_{Y|x}((q_{\beta,\max}({\mathcal{L}}_{Y|x}),q_{\beta,\max}({\mathcal{L}}_{Y|x})+s))\geq b(x)s,\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall\beta\in\{\alpha/2,1-\alpha/2\}.

Moreover, denoting γ(x)=1/[b(x)c(x)]\gamma(x)=1/[b(x)\cdot c(x)], we have γL1(X)\gamma\in L^{1}({\mathcal{L}}_{X}).

Recall that we have

β={gargming:m𝔼[𝗉𝗂𝗇𝖻𝖺𝗅𝗅β(Errtestg(Errval))]}.{\mathcal{F}}_{\beta}=\Big{\{}g^{*}\in\arg\min_{g:{\mathbb{R}}^{m}\to{\mathbb{R}}}{\mathbb{E}}_{{\mathcal{L}}}\big{[}{\sf pinball}_{\beta}\big{(}\textup{Err}^{{\rm test}}-g(\textup{Err}^{{\rm val}})\big{)}\big{]}\Big{\}}.
Assumption 7.

Assume that (,)({\mathcal{F}},\|\cdot\|_{\infty}) is a compact space, and α/2{\mathcal{F}}_{\alpha/2}\cap{\mathcal{F}}\neq\emptyset, 1α/2{\mathcal{F}}_{1-\alpha/2}\cap{\mathcal{F}}\neq\emptyset.

Given these assumptions, we are ready to state the general asymptotic validity result.

Theorem A.2 (General version of Theorem 3.1).

Let Assumption 1, 4, 5, 6 and 7 hold. Let PIQFCVα{\rm PI}^{\alpha}_{\rm QFCV} be the output of Algorithm 1. Then we have

limn(Errsto[f^α/2(Errval),f^1α/2(Errval)])=1α.\lim_{n\to\infty}{\mathbb{P}}\Big{(}\textup{Err}_{\rm sto}\in[\hat{f}^{\alpha/2}(\textup{Err}^{{\rm val}}_{\star}),\hat{f}^{1-\alpha/2}(\textup{Err}^{{\rm val}}_{\star})]\Big{)}=1-\alpha. (26)
Proof of Theorem A.2.

For K=nn𝗍𝗋n𝗍𝖾n𝗏𝖺𝗅+1K={n}-{n_{\sf tr}}-{n_{\sf te}}-{n_{\sf val}}+1, define

R^β(f)=1Ki=1K𝗉𝗂𝗇𝖻𝖺𝗅𝗅β(Erritestf(Errival)).\widehat{R}_{\beta}(f)=\frac{1}{K}\sum_{i=1}^{K}{\sf pinball}_{\beta}\Big{(}\textup{Err}_{i}^{{\rm test}}-f(\textup{Err}_{i}^{{\rm val}})\Big{)}.

For any ε>0\varepsilon>0, by the fact that (,)({\mathcal{F}},\|\cdot\|_{\infty}) is compact, there exists an ε\varepsilon-cover N(ε;,)N(\varepsilon;{\mathcal{F}},\|\cdot\|_{\infty}) which is a finite set that satisfies the following: for any ff\in{\mathcal{F}}, there exists fN(ε;,)f^{\prime}\in N(\varepsilon;{\mathcal{F}},\|\cdot\|_{\infty}), such that ff<ε\|f-f^{\prime}\|_{\infty}<\varepsilon. By Assumption 1, for any fixed function fN(ε;,)f\in N(\varepsilon;{\mathcal{F}},\|\cdot\|_{\infty}), we have

limn(|R^β(f)Rβ(f)|ε)=1.\lim_{n\to\infty}{\mathbb{P}}\Big{(}\Big{|}\widehat{R}_{\beta}(f)-R_{\beta}(f)\Big{|}\leq\varepsilon\Big{)}=1. (27)

Applying union bound, we get uniform convergence over the ε\varepsilon-cover

limn(supfN(ε;,)|R^β(f)Rβ(f)|ε)=1.\lim_{n\to\infty}{\mathbb{P}}\Big{(}\sup_{f\in N(\varepsilon;{\mathcal{F}},\|\cdot\|_{\infty})}\Big{|}\widehat{R}_{\beta}(f)-R_{\beta}(f)\Big{|}\leq\varepsilon\Big{)}=1. (28)

Notice that we have |R^β(f)R^β(f)|ff|\widehat{R}_{\beta}(f)-\widehat{R}_{\beta}(f^{\prime})|\leq\|f-f^{\prime}\|_{\infty} and |Rβ(f)Rβ(f)|ff|R_{\beta}(f)-R_{\beta}(f^{\prime})|\leq\|f-f^{\prime}\|_{\infty} for any f,ff,f^{\prime}. Then we have uniform convergence over the whole space {\mathcal{F}}

limn(supf|R^β(f)Rβ(f)|3ε)=1.\lim_{n\to\infty}{\mathbb{P}}\Big{(}\sup_{f\in{\mathcal{F}}}\Big{|}\widehat{R}_{\beta}(f)-R_{\beta}(f)\Big{|}\leq 3\varepsilon\Big{)}=1. (29)

Denote f^βargminfR^β(f)\hat{f}_{\beta}\in\arg\min_{f\in{\mathcal{F}}}\widehat{R}_{\beta}(f) and fβargminfRβ(f)f^{\star}_{\beta}\in\arg\min_{f\in{\mathcal{F}}}R_{\beta}(f) (choose one if there are multiple of them). Then by standard decomposition, we have

Rβ(f^β)=\displaystyle R_{\beta}(\hat{f}_{\beta})= Rβ(f^β)R^β(f^β)+R^β(f^β)R^β(fβ)+R^β(fβ)Rβ(fβ)+Rβ(fβ)\displaystyle\leavevmode\nobreak\ R_{\beta}(\hat{f}_{\beta})-\widehat{R}_{\beta}(\hat{f}_{\beta})+\widehat{R}_{\beta}(\hat{f}_{\beta})-\widehat{R}_{\beta}(f_{\beta})+\widehat{R}_{\beta}(f_{\beta})-R_{\beta}(f_{\beta})+R_{\beta}(f_{\beta})
\displaystyle\leq 2supf|Rβ(f)R^β(f)|+Rβ(fβ).\displaystyle\leavevmode\nobreak\ 2\sup_{f\in{\mathcal{F}}}|R_{\beta}(f)-\widehat{R}_{\beta}(f)|+R_{\beta}(f_{\beta}).

Combining with Eq. (29), we have

limn(Rβ(f^β)minfRβ(f)+6ε)=1.\lim_{n\to\infty}{\mathbb{P}}\Big{(}R_{\beta}(\hat{f}_{\beta})\leq\min_{f\in{\mathcal{F}}}R_{\beta}(f)+6\varepsilon\Big{)}=1.

Furthermore, by Theorem 2.7 of [SC11] and by Assumption 7, we have

minqβf^βqL1(X)2γL1(X)Rβ(f^β)inffRβ(f).\min_{q\in{\mathcal{F}}_{\beta}}\|\hat{f}_{\beta}-q\|_{L^{1}({\mathcal{L}}_{X})}\leq 2\sqrt{\|\gamma\|_{L^{1}({\mathcal{L}}_{X})}}\sqrt{R_{\beta}(\hat{f}_{\beta})-\inf_{f\in{\mathcal{F}}}R_{\beta}(f)}.

This implies that for any ε>0\varepsilon>0 (may be different from the ε\varepsilon above), we have

limn(minqβf^βqL1(X)ε)=1.\lim_{n\to\infty}{\mathbb{P}}\Big{(}\min_{q\in{\mathcal{F}}_{\beta}}\|\hat{f}_{\beta}-q\|_{L^{1}({\mathcal{L}}_{X})}\leq\varepsilon\Big{)}=1.

Finally, by Assumption 5 that the density of |x{\mathcal{L}}_{|x} is uniformly bounded by CC, we have the inequality that |X,Y(Yf(X))X,Y(Yg(X))|CfgL1(X)|{\mathbb{P}}_{X,Y}(Y\leq f(X))-{\mathbb{P}}_{X,Y}(Y\leq g(X))|\leq C\|f-g\|_{L^{1}({\mathcal{L}}_{X})} for any fixed ff and gg. By Assumption 4 that (Errval,Errtest=Errsto)(\textup{Err}_{\star}^{{\rm val}},\textup{Err}_{\star}^{{\rm test}}=\textup{Err}_{\rm sto}) is independent of {(Errtval,Errttest)}t1\{(\textup{Err}_{t}^{{\rm val}},\textup{Err}_{t}^{{\rm test}})\}_{t\geq 1} but has the same distribution {\mathcal{L}}, taking X=ErrvalX=\textup{Err}_{\star}^{{\rm val}} and Y=Errtest=ErrstoY=\textup{Err}_{\star}^{{\rm test}}=\textup{Err}_{\rm sto}, we have

|X,Y(Y[f^α/2(X),f^1α/2(X)])X,Y(Y[fα/2(X),f1α/2(X)])|\displaystyle\leavevmode\nobreak\ \Big{|}{\mathbb{P}}_{X,Y}(Y\in[\hat{f}_{\alpha/2}(X),\hat{f}_{1-\alpha/2}(X)])-{\mathbb{P}}_{X,Y}(Y\in[f_{\alpha/2}(X),f_{1-\alpha/2}(X)])\Big{|}
\displaystyle\leq C[f^α/2fα/2L1(X)+f^1α/2f1α/2L1(X)].\displaystyle\leavevmode\nobreak\ C\Big{[}\|\hat{f}_{\alpha/2}-f_{\alpha/2}\|_{L^{1}({\mathcal{L}}_{X})}+\|\hat{f}_{1-\alpha/2}-f_{1-\alpha/2}\|_{L^{1}({\mathcal{L}}_{X})}\Big{]}.

Notice that X,Y(Y[fα/2(X),f1α/2(X)])=1α{\mathbb{P}}_{X,Y}(Y\in[f_{\alpha/2}(X),f_{1-\alpha/2}(X)])=1-\alpha. This gives that for any ε>0\varepsilon>0, we have

(|X,Y(Y[f^α/2(X),f^1α/2(X)])(1α)|ε)=1.{\mathbb{P}}\Big{(}\Big{|}{\mathbb{P}}_{X,Y}(Y\in[\hat{f}_{\alpha/2}(X),\hat{f}_{1-\alpha/2}(X)])-(1-\alpha)\Big{|}\leq\varepsilon\Big{)}=1.

This concludes the proof of Theorem 3.1. ∎

Appendix B Proofs for Section 4

B.1 Formal statement and proof of Proposition 4.2

Proposition B.1 (Asymptotic normality of FCV(c) (restatement)).

Assume that {Ei}i0\{E_{i}\}_{i\geq 0} as defined in (11) is stationary ergodic, and satisfies

  • (1)

    K𝗍𝗋𝗎𝗇{K_{\sf trun}}-independence: {Ei}i[k]\{E_{i}\}_{i\in[k]} and {Ei}ik+K𝗍𝗋𝗎𝗇+1\{E_{i}\}_{i\geq k+{K_{\sf trun}}+1} are independent for any kk\in{\mathbb{N}}.

  • (2)

    The variance is finite: Var(E1)<{\rm Var}(E_{1})<\infty.

  • (3)

    Let t=σ(Es,st){\mathcal{F}}_{t}=\sigma(E_{s},s\leq t) be a filtration. We have limsVar[𝔼[Et+s|t]]=0\lim_{s\to\infty}{\rm Var}[{\mathbb{E}}[E_{t+s}|{\mathcal{F}}_{t}]]=0 for any tt\in{\mathbb{N}}.

  • (4)

    We have s=0Var(𝔼[Et+s|t]𝔼[Et+s|t1])1/2<\sum_{s=0}^{\infty}{\rm Var}({\mathbb{E}}[E_{t+s}|{\mathcal{F}}_{t}]-{\mathbb{E}}[E_{t+s}|{\mathcal{F}}_{t-1}])^{1/2}<\infty for any tt\in{\mathbb{N}}.

Then we have convergence in distribution as KK\to\infty

(Err^KFCVErr)/SE^KFCV(c)d𝒩(0,1),\big{(}{\widehat{\rm Err}}_{K}^{\rm FCV}-\textup{Err}\big{)}/{\widehat{\rm SE}}^{{\rm FCV}(c)}_{K}\stackrel{{\scriptstyle d}}{{\longrightarrow}}{\mathcal{N}}(0,1), (30)

where Err^KFCV{\widehat{\rm Err}}_{K}^{\rm FCV} is as defined in (11), and SE^KFCV(c){\widehat{\rm SE}}^{{\rm FCV}(c)}_{K} is as defined in (14).

Proof of Proposition B.1.

The proof of Proposition B.1 relies on the following two lemmas.

Lemma B.2 (Convergence of sample autocovariance).

Suppose that {Et}t[n]\{E_{t}\}_{t\in[n]} is stationary ergodic with average E¯=n1t=1nEt\bar{E}=n^{-1}\sum_{t=1}^{n}E_{t}. Denote its autocovariance function and sample autocovariance function as

γ(k)=𝔼[(E1𝔼[E1])(Ek+1𝔼[Ek+1])],γ^n(k)=1nkt=1nk(EtE¯)(Et+kE¯).\textstyle\gamma(k)={\mathbb{E}}[(E_{1}-{\mathbb{E}}[E_{1}])(E_{k+1}-{\mathbb{E}}[E_{k+1}])],\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \hat{\gamma}_{n}(k)=\frac{1}{n-k}\sum_{t=1}^{n-k}(E_{t}-\bar{E})(E_{t+k}-\bar{E}).

Then for any fixed kk\in{\mathbb{N}}, as nn\to\infty, we have convergence in probability

γ^n(k)pγ(k).\hat{\gamma}_{n}(k)\stackrel{{\scriptstyle p}}{{\rightarrow}}\gamma(k).
Lemma B.3 (Gordin’s central limit theorem ([Hay11] Page 402-405) ).

Suppose that {Mt}t\{M_{t}\}_{t\in{\mathbb{Z}}} is mean-zero stationary ergodic, with autocovariance function γ(k)=𝔼[M1Mk+1]\gamma(k)={\mathbb{E}}[M_{1}M_{k+1}]. Assume that

  • (1)

    The variance is finite: Var[Mt]<{\rm Var}[M_{t}]<\infty.

  • (2)

    Let t=σ(Ms,st){\mathcal{F}}_{t}=\sigma(M_{s},s\leq t) be a filtration. We have limsVar[𝔼[Mt+s|t]]=0\lim_{s\to\infty}{\rm Var}[{\mathbb{E}}[M_{t+s}|{\mathcal{F}}_{t}]]=0 for any tt\in{\mathbb{N}}.

  • (3)

    We have s=0Var(𝔼[Mt+s|t]𝔼[Mt+s|t1])1/2<\sum_{s=0}^{\infty}{\rm Var}({\mathbb{E}}[M_{t+s}|{\mathcal{F}}_{t}]-{\mathbb{E}}[M_{t+s}|{\mathcal{F}}_{t-1}])^{1/2}<\infty for any tt\in{\mathbb{N}}.

Then we have k=0|γ(k)|<\sum_{k=0}^{\infty}|\gamma(k)|<\infty, and as nn\to\infty, we have convergence in distribution

1nt=1nMtd𝒩(0,γ(0)+2k=1γ(k)).\frac{1}{\sqrt{n}}\sum_{t=1}^{n}M_{t}\stackrel{{\scriptstyle d}}{{\rightarrow}}{\mathcal{N}}\Big{(}0,\gamma(0)+2\sum_{k=1}^{\infty}\gamma(k)\Big{)}.

By the assumption of Proposition B.1 and by Gordin’s central limit theorem (Lemma B.3), as KK\to\infty, we have convergence in probability

K(Err^KFCVErr)=K1/2(t=1KEt𝔼[Et])d𝒩(0,SE).\sqrt{K}\Big{(}{\widehat{\rm Err}}_{K}^{\rm FCV}-\textup{Err}\Big{)}=K^{-1/2}\Big{(}\sum_{t=1}^{K}E_{t}-{\mathbb{E}}[E_{t}]\Big{)}\stackrel{{\scriptstyle d}}{{\longrightarrow}}{\mathcal{N}}(0,{\rm SE}). (31)

Here, taking γ(k)\gamma(k) to be the autocovariance function of {Ei}i[K]\{E_{i}\}_{i\in[K]}, we have

SE=γ(0)+2k=1γ(k)=γ(0)+2k=1K𝗍𝗋𝗎𝗇γ(k),{\rm SE}=\gamma(0)+2\sum_{k=1}^{\infty}\gamma(k)=\gamma(0)+2\sum_{k=1}^{{K_{\sf trun}}}\gamma(k),

where the last equality is by the assumption that {Ei}i[K]\{E_{i}\}_{i\in[K]} is K𝗍𝗋𝗎𝗇{K_{\sf trun}}-independent.

Furthermore, by the definition of SE^KFCV(c){\widehat{\rm SE}}^{{\rm FCV}(c)}_{K} as in (14) and by Lemma B.2, as KK\to\infty, we have

K(SE^KFCV(c))2=γ^(0)+2s=1K𝗍𝗋𝗎𝗇(1sK)γ^(s)pγ(0)+2s=1K𝗍𝗋𝗎𝗇γ(s)=SE2.K\cdot\big{(}{\widehat{\rm SE}}^{{\rm FCV}(c)}_{K}\big{)}^{2}=\hat{\gamma}(0)+2\sum_{s=1}^{{K_{\sf trun}}}\Big{(}1-\frac{s}{K}\Big{)}\hat{\gamma}(s)\stackrel{{\scriptstyle p}}{{\rightarrow}}\gamma(0)+2\sum_{s=1}^{{K_{\sf trun}}}\gamma(s)={\rm SE}^{2}. (32)

As a consequence, by Slutsky’s theorem, (31) and (32) imply that (30) holds. ∎

B.2 Proof of Proposition 4.3

Since {Ei}i[K]\{E_{i}\}_{i\in[K]} is a stationary ergodic process as in Assumption 1, we have that for any ε>0\varepsilon>0,

limK(|1Ki=1KEi𝔼[E1]|ε)=0,limK(|1Ki=1K(Ei𝔼[E1])2Var(E1)|ε)=0.\lim_{K\to\infty}{\mathbb{P}}\Big{(}\Big{|}\frac{1}{K}\sum_{i=1}^{K}E_{i}-{\mathbb{E}}[E_{1}]\Big{|}\geq\varepsilon\Big{)}=0,\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \lim_{K\to\infty}{\mathbb{P}}\Big{(}\Big{|}\frac{1}{K}\sum_{i=1}^{K}\big{(}E_{i}-{\mathbb{E}}[E_{1}]\big{)}^{2}-{\rm Var}(E_{1})\Big{|}\geq\varepsilon\Big{)}=0.

This proves Eq. (17). By the fact that EiE_{i}’s follow Gaussian distributions, we immediately have Eq. (18) and (19).

Appendix C Proof of Theorem 5.1

Proof of Theorem 5.1.

Throughout the proof, we assume that Δ=1\Delta=1. For general Δ>1\Delta>1, the proof follows similarly. Our proof is similar to Proposition 4.1 in [GC21] and Theorem 1 in [FBR22].

Lemma C.1.

Under the condition of Theorem 5.1. Then for all tt\in{\mathbb{N}}, we have θt[mn𝗍𝖾γ,M+n𝗍𝖾γ]\theta_{t}\in[m-{n_{\sf te}}\gamma,M+{n_{\sf te}}\gamma].

Proof of Lemma C.1.

Assume for the sake of contradiction that there exists tt\in{\mathbb{N}} such that θt>M+n𝗍𝖾γ\theta_{t}>M+{n_{\sf te}}\gamma (the complementary case is similar). Further assume that for all t<tt^{\prime}<t, we have θtM+n𝗍𝖾γ\theta_{t^{\prime}}\leq M+{n_{\sf te}}\gamma. Since ct,α[0,1]c_{t},\alpha\in[0,1], we get that:

θtn𝗍𝖾=θtγs=t2n𝗍𝖾+1tn𝗍𝖾(1αcs)θtn𝗍𝖾γ>M+n𝗍𝖾γn𝗍𝖾γ=M.\textstyle\theta_{t-{n_{\sf te}}}=\theta_{t}-\gamma\sum_{s=t-2{n_{\sf te}}+1}^{t-{n_{\sf te}}}(1-\alpha-c_{s})\geq\theta_{t}-{n_{\sf te}}\gamma>M+{n_{\sf te}}\gamma-{n_{\sf te}}\gamma=M.

Therefore, θtn𝗍𝖾>M\theta_{t-{n_{\sf te}}}>M. Since PIt(;θ)={\rm PI}^{t}(*;\theta)={\mathbb{R}} for θ>M\theta>M, we get that ctn𝗍𝖾=1{Errstotn𝗍𝖾PItn𝗍𝖾(;θtn𝗍𝖾)}=1{Errstotn𝗍𝖾}=1>1αc_{t-{n_{\sf te}}}=1\{\textup{Err}_{\rm sto}^{t-{n_{\sf te}}}\in{\rm PI}^{t-{n_{\sf te}}}(*;\theta_{t-{n_{\sf te}}})\}=1\{\textup{Err}_{\rm sto}^{t-{n_{\sf te}}}\in{\mathbb{R}}\}=1>1-\alpha. As a result: θt=θt1+γ(1αctn𝗍𝖾)<θt1M+n𝗍𝖾γ\theta_{t}=\theta_{t-1}+\gamma(1-\alpha-c_{t-{n_{\sf te}}})<\theta_{t-1}\leq M+{n_{\sf te}}\gamma. Which is a contradiction to our assumption. ∎

We next prove Theorem 5.1. By applying Lemma C.1 we get that θt[mn𝗍𝖾γ,M+n𝗍𝖾γ]\theta_{t}\in[m-{n_{\sf te}}\gamma,M+{n_{\sf te}}\gamma] for all tt\in{\mathbb{N}}. Denote m=mn𝗍𝖾γm^{\prime}=m-{n_{\sf te}}\gamma, and M=M+n𝗍𝖾γM^{\prime}=M+{n_{\sf te}}\gamma. We follow the proof of [GC21] and expand the recursion:

[m,M]θT+1=θn𝗍𝖾+s=n𝗍𝖾Tγ(1αcs).[m^{\prime},M^{\prime}]\ni\theta_{T+1}=\theta_{n_{\sf te}}+\sum_{s={n_{\sf te}}}^{T}\gamma(1-\alpha-c_{s}).

By rearranging this we get that:

mθn𝗍𝖾Tγ1Ts=n𝗍𝖾T(1αcs)=θT+1θn𝗍𝖾TγMθn𝗍𝖾Tγ.\frac{m^{\prime}-\theta_{n_{\sf te}}}{T\gamma}\leq\frac{1}{T}\sum_{s={n_{\sf te}}}^{T}(1-\alpha-c_{s})=\frac{\theta_{T+1}-\theta_{n_{\sf te}}}{T\gamma}\leq\frac{M-\theta_{n_{\sf te}}}{T\gamma}.

Therefore we get

|1Ts=n𝗍𝖾T(1αcs)|(Mm)+2n𝗍𝖾γTγ,\Big{|}\frac{1}{T}\sum_{s={n_{\sf te}}}^{T}(1-\alpha-c_{s})\Big{|}\leq\frac{(M-m)+2{n_{\sf te}}\gamma}{T\gamma},

and

|1Ts=1T(1αcs)|(Mm)+3n𝗍𝖾γTγ.\Big{|}\frac{1}{T}\sum_{s=1}^{T}(1-\alpha-c_{s})\Big{|}\leq\frac{(M-m)+3{n_{\sf te}}\gamma}{T\gamma}.

Lastly, the definition of the loss, ct=1{ErrstotRIα,t}c_{t}=1\{\textup{Err}_{\rm sto}^{t}\in{\rm RI}^{\alpha,t}\}, gives us the risk statement. ∎

Appendix D The QFCV point estimator and variations of QFCV intervals

In this section, we first provide QFCV point estimators for the stochastic test error. We then discuss QFCV methods in the “expanding window” setting, which utilizes an incrementally expanding training dataset to fit the forecaster over time.

D.1 QFCV point estimators

We adopt the notations as in Section 3.1. Recall that the tuples {(Errival,Erritest)}i[K]\{(\textup{Err}_{i}^{{\rm val}},\textup{Err}_{i}^{{\rm test}})\}_{i\in[K]} and (Errval,Errtest)(\textup{Err}_{\star}^{{\rm val}},\textup{Err}_{\star}^{{\rm test}}) are defined in Eq. (6). Importantly, {(Errival,Erritest)}i[K]\{(\textup{Err}_{i}^{{\rm val}},\textup{Err}_{i}^{{\rm test}})\}_{i\in[K]} and Errval\textup{Err}_{\star}^{{\rm val}} are computable from the observed dataset, and our goal is to estimate Errtest\textup{Err}_{\star}^{{\rm test}}. To provide a point estimator, we compute

f^=argminf1Ki=1K(Erritest,f(Errival)),\textstyle\hat{f}=\arg\min_{f\in{\mathcal{F}}}\frac{1}{K}\sum_{i=1}^{K}\ell\big{(}\textup{Err}_{i}^{{\rm test}},f(\textup{Err}_{i}^{{\rm val}})\big{)},

where (e,e^)\ell(e,\hat{e}) is a loss function, such as the absolute or square loss. The QFCV point estimator is then given by

Err^KQFCV=f^(Errval).{\widehat{\rm Err}}_{K}^{\rm QFCV}=\hat{f}(\textup{Err}_{\star}^{{\rm val}}).

Note that Errtest\textup{Err}_{\star}^{{\rm test}} is a random variable, so this estimator is not a point estimator in the statistical decision-theoretic sense.

In simulation examples of Section 4.2, we adopt the square loss to produce point estimators. Table 1 shows that QFCV point estimators provide lower mean squared error (MSE) compared to the classical forward cross-validation estimator Err^KFCV{\widehat{\rm Err}}_{K}^{\rm FCV}, especially when the time series is smoother so that past validation errors give stronger signals about future test error.

D.2 Variations of QFCV method

D.2.1 Longer memory span

The construction of validation and test sets is flexible and can be tailored to practitioners’ needs (recall the diagram for QFCV in Figure 3). Our theoretical guarantee for QFCV will hold as long as the covariates and targets are the same across folds. For instance, we can modify the number of covariates in quantile regression by including more past time windows. That is, we can replace step 2 in Algorithm 1 with:

f^β=argminf1Ki=mK𝗉𝗂𝗇𝖻𝖺𝗅𝗅β(Erritestf(Errival,,Errim+1val)),\hat{f}^{\beta}=\arg\min_{f\in{\mathcal{F}}}\frac{1}{K}\sum_{i=m}^{K}{\sf pinball}_{\beta}\Big{(}\textup{Err}_{i}^{{\rm test}}-f(\textup{Err}_{i}^{{\rm val}},\ldots,\textup{Err}_{i-m+1}^{{\rm val}})\Big{)},

and replace the algorithm output with:

PIQFCVα=[f^α/2(Errval,,ErrKm+2val),f^1α/2(Errval,,ErrKm+2val)].{\rm PI}^{\alpha}_{\rm QFCV}=[\hat{f}^{\alpha/2}(\textup{Err}^{{\rm val}}_{\star},\ldots,\textup{Err}_{K-m+2}^{{\rm val}}),\hat{f}^{1-\alpha/2}(\textup{Err}^{{\rm val}}_{\star},\ldots,\textup{Err}_{K-m+2}^{{\rm val}})].

The original QFCV method is the special case with m=1m=1. Increasing the number of past windows mm may improve efficiency when there is a long-range correlation between past validation errors and future test errors. However, the tradeoff is slower mixing and convergence rates for quantile regression.

Refer to caption
Figure 16: Diagram illustration QFCV with expanding time windows.

D.2.2 Expanding window setup

Refer to caption
(a) ARMA(1,0), n𝗏𝖺𝗅=5{n_{\sf val}}=5.
Refer to caption
(b) ARMA(1,20), n𝗏𝖺𝗅=5{n_{\sf val}}=5.
Refer to caption
(c) ARMA(1,0), n𝗏𝖺𝗅=20{n_{\sf val}}=20.
Refer to caption
(d) ARMA(1,20), n𝗏𝖺𝗅=20{n_{\sf val}}=20.
Figure 17: Comparison of different variants of QFCV methods for expanding time windows. The simulation setting is the same as Figure 6, where the time series is generated from (9). Specifically, data is generated by yt=xt𝖳β+εty_{t}=x_{t}^{\sf{T}}\beta+\varepsilon_{t}, where {εt}t=1n\{\varepsilon_{t}\}_{t=1}^{{n}} follows an ARMA(a,b) process. The time series length is fixed at n=2000{n}=2000, with n𝗍𝗋=40{n_{\sf tr}}=40, n𝗏𝖺𝗅=n𝗍𝖾{5,20}{n_{\sf val}}={n_{\sf te}}\in\{5,20\}, and p=20p=20. The x-axis represents the autocorrelation parameter ϕ[0,1]\phi\in[0,1]. Each panel displays two plots: (i) actual coverage of the 90%90\% nominal QFCV intervals, and (ii) ratio of the QFCV interval length to the true marginal quantiles interval length. The true marginal quantiles interval is defined by the 5%5\% and 95%95\% quantiles of the true test error distribution. Each curve displays the mean and standard error over 500500 simulation instances.

The QFCV method in Figure 3 uses a “rolling window” approach where a fixed-size training dataset is used to fit the forecaster. Besides this “rolling window” approach, another common setup is the “expanding window” setting, which uses an incrementally expanding training dataset to fit the forecaster over time. The QFCV method can be adapted to the expanding window setting, where an illustrative diagram is shown in Figure 16. Specifically, we consider the following sets:

Di=\displaystyle D_{i}= {1,,(i1)Δ+n𝗍𝗋},D={1,,nn𝗏𝖺𝗅},\displaystyle\leavevmode\nobreak\ \{1,\ldots,{(i-1)\Delta+{n_{\sf tr}}}\},\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ D=\{1,\ldots,{n}-{n_{\sf val}}\}, (33)
Vi=\displaystyle V_{i}= {(i1)Δ+n𝗍𝗋+1,,(i1)Δ+n𝗍𝗋+n𝗏𝖺𝗅},V={nn𝗏𝖺𝗅+1,,n},\displaystyle\leavevmode\nobreak\ \{(i-1)\Delta+{n_{\sf tr}}+1,\ldots,(i-1)\Delta+{n_{\sf tr}}+{n_{\sf val}}\},\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ V=\{{n}-{n_{\sf val}}+1,\ldots,{n}\},
Di=\displaystyle D_{i\star}= {1,,(i1)Δ+n𝗍𝗋+n𝗏𝖺𝗅},D={1,,n},\displaystyle\leavevmode\nobreak\ \{1,\ldots,{(i-1)\Delta+{n_{\sf tr}}+{n_{\sf val}}}\},\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ D_{\star}=\{1,\ldots,{n}\},
Ti=\displaystyle T_{i}= {(i1)Δ+n𝗍𝗋+n𝗏𝖺𝗅,,(i1)Δ+n𝗍𝗋+n𝗏𝖺𝗅+n𝗍𝖾1},T={n+1,,n+n𝗍𝖾}.\displaystyle\leavevmode\nobreak\ \{(i-1)\Delta+{n_{\sf tr}}+{n_{\sf val}},\ldots,(i-1)\Delta+{n_{\sf tr}}+{n_{\sf val}}+{n_{\sf te}}-1\},\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ T=\{{n}+1,\ldots,{n}+{n_{\sf te}}\}.

With these sets {Di,Vi,Di,Ti}i[K]\{D_{i},V_{i},D_{i\star},T_{i}\}_{i\in[K]} and {D,V,D,T}\{D,V,D_{\star},T\}, we can compute tuples {(Errival,Erritest)}i[K]\{(\textup{Err}_{i}^{{\rm val}},\textup{Err}_{i}^{{\rm test}})\}_{i\in[K]} and Errval\textup{Err}_{\star}^{\rm val} as defined in Equation (6), then perform quantile regression to construct a prediction interval for Errtest\textup{Err}_{\star}^{\rm test} as detailed in Algorithm 1. Due to varying training size, we do not have stationarity for {(Errtval,Errttest)}t1\{(\textup{Err}_{t}^{{\rm val}},\textup{Err}_{t}^{{\rm test}})\}_{t\geq 1}, even if the original data sequence {zt}t1\{z_{t}\}_{t\geq 1} is stationary and ergodic. However, we may have approximate stationarity if the parameter estimates converge over time, providing some justification for using the QFCV method in the expanding window setting.

In Figure 17, we repeat the same simulation experiments as in Figure 6 but using the expanding window setting. Figure 17 shows that QFCV with different choices of mm generally achieves about 90%90\% nominal coverage when mm is not too large, although we have no theoretical guarantee. Notably, in panel (b) with ARMA(1,20)\text{ARMA}(1,20) model and n𝗏𝖺𝗅=5{n_{\sf val}}=5, the QFCV method can produce prediction intervals at the nominal coverage level with only half the interval length compared to the true marginal quantiles interval. Besides, note that AQFCV method still has time-average coverage guarantee in the expanding window setup.