¹¹affiliationtext: Department of Mathematics and Halicioğlu Data Science Institute, University of California, San Diego²²affiliationtext: Department of Mathematics, University of California, San Diego

Multi-step ahead prediction intervals for non-parametric autoregressions via bootstrap: consistency, debiasing and pertinence

Dimitris N. Politis Kejin Wu

Abstract

To address the difficult problem of multi-step ahead prediction of non-parametric autoregressions, we consider a forward bootstrap approach. Employing a local constant estimator, we can analyze a general type of non-parametric time series model, and show that the proposed point predictions are consistent with the true optimal predictor. We construct a quantile prediction interval that is asymptotically valid. Moreover, using a debiasing technique, we can asymptotically approximate the distribution of multi-step ahead non-parametric estimation by bootstrap. As a result, we can build bootstrap prediction intervals that are pertinent, i.e., can capture the model estimation variability, thus improving upon the standard quantile prediction intervals. Simulation studies are given to illustrate the performance of our point predictions and pertinent prediction intervals for finite samples.

1 Introduction

To model the asymmetry in financial returns, volatility of stock markets, switching regimes, etc., non-linear time series models have attracted attention since the 1980s. Compared to linear time series models, non-linear models possess more capabilities to depict the underlying data-generating mechanism; see the review of politis2009financial for example. However, unlike linear models where the one-step ahead predictor can be iterated, multi-step ahead prediction of non-linear models is cumbersome, since the innovation influences the forecasting value severely.

In this paper, by combining the forward bootstrap of politis2015model with non-parametric estimation, we develop multi-step ahead (conditional) predictive inference for the general model:

X_{t}=m(X_{t-1},\ldots,X_{t-p})+\sigma(X_{t-1},\ldots,X_{t-q})\epsilon_{t};

(1)

here, the $\epsilon_{t}$ are assumed to be independent, identically distributed (i.i.d.) with mean 0 and variance 1, and the $m(\cdot)$ and $\sigma(\cdot)$ are some functions that satisfy some smoothness conditions. We will also assume that the time series satisfying Eq. 1 is geometrically ergodic and causal, i.e., that for any $t$ , $\epsilon_{t}$ is independent of $\{X_{s},s<t\}$ .

In Eq. 1, we have the trend/regression function $m(\cdot)$ depending on the last $p$ data points, while the standard deviation/volatility function $\sigma(\cdot)$ depends on the last $q$ data points; in many situations, $p$ and $q$ are taken to be equal for simplicity. Some special cases deserve mention, e.g., if $\sigma(X_{t-1},\ldots,X_{t-q})$ $\equiv\sigma$ (constant), Eq. 1 yields a non-linear/non-parametric autoregressive model with homoscedastic innovations. The well-known ARCH/GARCH models are a special case of Eq. 1 with $m(X_{t-1},\ldots,X_{t-p})\equiv 0$ .

Although the $L_{2}$ optimal one-step ahead prediction of Eq. 1 is trivial when we know the regression function $m(\cdot)$ or have a consistent estimator of it, the multi-step ahead prediction is not easy to obtain. In addition, it is non-trivial to find the $L_{1}$ optimal prediction even for the one-step ahead forecasting. In several applied areas, e.g. econometrics, climate modeling, water resources management, etc., data might not possess a finite 2nd moment in which case optimizing $L_{2}$ loss is vacuous; For all such cases—but also of independent interest—prediction that is optimal with respect to $L_{1}$ loss should receive more attention in practice; see detailed discussions from Ch. 10 of politis2015model. Later, we will show our method is compatible with both $L_{2}$ and $L_{1}$ optimal multi-step ahead predictions.

The efforts to overcome the difficulty of forecasting non-linear time series could be traced back to the work of pemberton1987exact, where a numerical approach was proposed to explore the exact conditional $k$ -step ahead $L_{2}$ optimal prediction of $X_{T+k}$ for a homoscedastic Eq. 1. However, this method is intractable computationally with long-horizon prediction, and requires knowledge of the distribution of innovations and the regression function which is not realistic in practice.

Consequently, practitioners started to investigate some suboptimal methods to perform multi-step ahead prediction. Generally speaking, these methods take one of two avenues: (1) Direct prediction or (2) Iterative prediction. The first idea involved working with a different (‘direct’) model, specific to $k$ -step ahead prediction, namely:

X_{t}=m_{k}(X_{t-k},\ldots,X_{t-k-p+1})+\sigma_{k}(X_{t-k},\ldots,X_{t-k-q+1})\xi_{t}.

(2)

Even though $m_{k}(\cdot)$ and $\sigma_{k}(\cdot)$ are unknown to us, we can construct non-parametric estimators $\widehat{m}_{k}$ and $\widehat{\sigma}_{k}$ , and plug them in Eq. 2 to perform $k$ -step ahead prediction. lee2003new give a review of this approach. However, as pointed out by chen2004nonparametric, a drawback of this approach is that the information of intermediate observations $\{X_{t},\ldots,X_{t-k+1}\}$ is disregarded. Furthermore, if the $\epsilon_{t}$ of Eq. 1 are i.i.d., then the $\xi_{t}$ of Eq. 2 can not be i.i.d. In other words, a practitioner must employ the (estimated) dependence structure of the $\xi_{t}$ of Eq. 2 in order to perform the prediction in an optimal fashion

The second idea is “iterative prediction” which employs one-step ahead predictors in a sequential way, to perform a multi-step ahead forecast. For example, consider a 2-step ahead prediction under model Eq. 1; first note that the $L_{2}$ optimal predictor of $X_{T+1}$ is $\hat{X}_{T+1}=m(X_{T},\ldots,X_{T+1-p})$ . The $L_{2}$ optimal predictor of $X_{T+2}=m(X_{T+1},X_{T},\ldots,X_{T+2-p})$ but since $X_{T+1}$ is unknown, it is tempting to plug-in $\hat{X}_{T+1}$ in its place. This plug-in idea can be extended to multi-step ahead forecasts but it does not lead to the $L_{2}$ optimal predictor except in the special case where the function $m(\cdot)$ is linear, e.g., in the case of a linear auto-regressive (LAR) model.

Remark.

Since neither of the above two approaches is satisfactory, we propose to approximate the distribution of the future value via a particular type of simulation when the model is known or, more generally, by bootstrap. To describe this approach, we rewrite Eq. 1 as

X_{t}=G(\bm{X}_{t-1},\epsilon_{t})

where $\bm{X}_{t-1}$ is a vector which represents $\{X_{t-1},\ldots,X_{t-\text{max}(p,q)}\}$ and $G(\cdot,\cdot)$ is some appropriate function. Then, when model and innovation information are known to us, we can create a pseudo value $X^{*}_{T+k}$ . Take a three-step ahead prediction as an example, the pseudo value $X^{*}_{T+3}$ can be defined as below:

X^{*}_{T+3}=G(G(G(\bm{X_{T}},\epsilon^{*}_{T+1}),\epsilon^{*}_{T+2}),\epsilon^{*}_{T+3});

(3)

here $\{\epsilon^{*}_{i}\}_{i=T+1}^{T+3}$ are simulated as i.i.d. from $F_{\epsilon}$ . Repeating this process to $M$ pseudo $X^{*}_{T+3}$ , the $L_{2}$ optimal prediction of $X_{T+3}$ can be estimated by the mean of $\{X^{*(m)}_{T+3}\}_{m=1}^{M}$ . As already discussed, constructing the $L_{1}$ optimal predictor may also be required since sometimes $L_{2}$ loss is not well-defined; in our simulation framework, we can construct the optimal $L_{1}$ prediction by taking the median of $\{X^{*(m)}_{T+k}\}_{m=1}^{M}$ . Moreover, we can even build a prediction interval (PI) to measure the forecasting accuracy based on quantile values of simulated pseudo values. The extension of this algorithm to longer step ahead prediction is illustrated in Section 2.

Realistically, practitioners would not know $F_{\epsilon}$ , $m(\cdot)$ and $\sigma(\cdot)$ . In this situation, the first step is to estimate these quantities and plug them into the above simulation which then turns into a bootstrap method. In the spirit of this idea, some studies were done by adopting different bootstrap techniques. thombs1990bootstrap proposed a backward bootstrap trick to predict $AR(p)$ model. The advantage of the backward method is that each bootstrap prediction is naturally conditional on the latest $p$ observations which coincide with the conditional prediction in the real world. However, this method can not handle non-linear time series whose backward representation may not exist. Later, pascual2004bootstrap proposed a strategy to generate bootstrap $AR(p)$ series forward. For resolving the conditional prediction issues, they fixed the last $p$ bootstrap values to be the true observations and compute predictions iteratively in the bootstrap world starting from there. They then extended this procedure to forecast the GARCH model in pascual2006bootstrap.

Sharing a similar idea, pan2016bootstrap defined the forward bootstrap to do prediction, but they proposed a different PI format which empirically has better performance according to the coverage rate (CVR) and the length (LEN), compared to the PI of pascual2004bootstrap. Although pan2016bootstrap covered the forecasting of a non-linear and/or non-parametric time series model, only one-step ahead prediction was considered. The case of multi-step ahead prediction of non-linear (but parametric) time series models was recently addressed in wu2023bootstrap In the paper at hand, we address the case of multi-step ahead prediction of non-parametric time series models as in Eq. 1. Beyond discussing optimal $L_{1}$ and $L_{2}$ point predictions, we consider two types of PI—Quantile PI (QPI) and Pertinent PI (PPI). As already mentioned, the former can be approximated by taking quantile values of the future value’s distribution in the bootstrap world. The PPI requires a more complicated and computationally-heavy procedure to be built as it attempts to capture the variability of parameter estimation. This additional effort results in improved finite-sample coverage as compared to the QPI.

As in most non-parametric estimation problems, the issue of bias becomes important. We will show that debiasing on the inherent bias-type terms of non-parametric estimation is necessary to guarantee the pertinence of a PI when multi-step ahead predictions are required. Although QPI and PPI are asymptotically equivalent, the PPI renders better CVR in finite sample cases; see the formal definition of PPI in the work of politis2015model and pan2016bootstrap. Analogously to the successful construction of PIs in the work of politis2013model, we may employ predictive—as opposed to fitted—residuals in the bootstrap process to further alleviate the finite-sample undercoverage of bootstrap PIs in practice.

The paper is organized as follows. In Section 2, forward bootstrap prediction algorithms with local constant estimators will be given. The asymptotic properties of point predictions and PIs will be discussed in Section 3. Simulations are given in LABEL:Sec:simulation to substantiate the finite-sample performance of our methods. Conclusions are given in LABEL:Sec:conclusion. All proofs can be found in Appendix A. Discussions on the debiasing and pertinence related to building PIs are presented in Appendix B to Appendix D.

2 Non-parametric forward bootstrap prediction

As discussed in Remark Remark, we can apply the simulation or bootstrap technique to approximate the distribution of a future value. In general, this idea works for any geometrically ergodic auto-regressive model no matter in a linear or non-linear format. For example, if we have a known general model $X_{t}=G(\bm{X}_{t-1},\epsilon_{t})$ at hand, we can do $k$ -step ahead predictions according to the same logic of the three-step ahead prediction example of Remark Remark.

To elaborate, we need to simulate $\{\epsilon^{*}_{i}\}_{i=T+1}^{T+k}$ as i.i.d. from $F_{\epsilon}$ and then compute pseudo value $X_{T+k}^{*}$ iteratively with simulated innovations as below:

X^{*}_{T+k}=G(\cdots G(G(G(\bm{X_{T}},\epsilon^{*}_{T+1}),\epsilon^{*}_{T+2}),\epsilon^{*}_{T+3}),\ldots,\epsilon^{*}_{T+k})

(4)

Repeating this procedure $M$ times, we can make prediction inference with the empirical distribution of $\{X^{*(m)}_{T+k}\}_{m=1}^{M}$ . Similarly, if the model and innovation distribution are unknown to us, we can do the estimation first to get $\widehat{G}(\cdot,\cdot)$ and $\widehat{F}_{\epsilon}$ . Then, the above simulation-based algorithm turns out to be a bootstrap-based algorithm. More specifically, we bootstrap $\{\hat{\epsilon}^{*}_{i}\}_{i=T+1}^{T+k}$ from $\widehat{F}_{\epsilon}$ and calculate pseudo value $\widehat{X}_{T+k}^{*}$ iteratively with $\widehat{G}(\cdot,\cdot)$ . The prediction inference can also be conducted with the empirical distribution of $\{\widehat{X}^{*(m)}_{T+k}\}_{m=1}^{M}$ .

The simulation/bootstrap idea of Remark Remark was recently implemented by wu2023bootstrap in the case where the model $G$ is either known or parametrically specified. In what follows, we will focus on the case of a non-parametric model Eq. 1 and will analyze the asymptotic properties of the point predictor and prediction interval. For the sake of simplicity, we consider only the case $p=q=1$ ; the general case can be handled similarly but the notation is quite more cumbersome. Assume we observe $T+1$ datapoints and we denote them as $\{X_{0},\ldots,X_{T}\}$ ; our goal is prediction inference of $X_{T+k}$ for some $k\geq 1$ . If we know $m(\cdot)$ , $\sigma(\cdot)$ and $F_{\epsilon}$ , we can take a simulation approach to develop prediction inference as we explained in Section 1. When $m(\cdot)$ , $\sigma(\cdot)$ and $F_{\epsilon}$ are unknown, we start by estimating $m(\cdot)$ and $\sigma(\cdot)$ ; we then estimate $F_{\epsilon}$ based on the empirical distribution of residuals. Subsequently, we can deploy a bootstrap-based method to approximate the distribution of future values. Several algorithms are given for this purpose in the later context.

2.1 Bootstrap algorithm for point prediction and QPI

For concreteness, we focus on local constant estimators, i.e., kernel-smoothed estimators of Nadaraya-Watson type; other estimators can be applied similarly. The local constant estimators of $m(\cdot)$ and $\sigma(\cdot)$ are respectively defined as:

\widetilde{m}_{h}(x)=\frac{\sum_{t=1}^{T}K(\frac{x-X_{t-1}}{h})X_{t}}{\sum_{t=1}^{T}K(\frac{x-X_{t-1}}{h})}~{}~{}\text{and}~{}~{}\widetilde{\sigma}_{h}(x)=\frac{\sum_{t=1}^{T}K(\frac{x-X_{t-1}}{h})(X_{t}-\widetilde{m}_{h}(X_{t-1}))^{2}}{\sum_{t=1}^{T}K(\frac{x-X_{t-1}}{h})};

(5)

here, $K$ is a non-negative kernel function that satisfies some regularity assumptions; see Section 3 for details. We use $h$ to represent the bandwidth of kernel functions but $h$ may take a different value for mean and variance estimators. Due to the theoretical and practical issues, we need to truncate the above local constant estimators as follows:

\widehat{m}_{h}(x)=\begin{cases}-C_{m}&\text{if}~{}\widetilde{m}_{h}(x)<-C_{m}\\ \widetilde{m}_{h}(x)&\text{if}~{}|\widetilde{m}_{h}(x)|\leq C_{m}\\ C_{m}&\text{if}~{}\widetilde{m}_{h}(x)>C_{m}\\ \end{cases}~{};~{}\widehat{\sigma}(x)=\begin{cases}c_{\sigma}&\text{if}~{}\widetilde{\sigma}_{h}(x)<c_{\sigma}\\ \widetilde{\sigma}_{h}(x)&\text{if}~{}c_{\sigma}\leq\widetilde{\sigma}_{h}(x)\leq C_{\sigma}\\ C_{\sigma}&\text{if}~{}\widetilde{\sigma}_{h}(x)>C_{\sigma}\\ \end{cases};

(6)

here, $C_{m}$ and $C_{\sigma}$ are large enough and $c_{\sigma}$ is small enough.

Using $\widehat{m}_{h}(\cdot)$ and $\widehat{\sigma}_{h}(\cdot)$ on Eq. 1, we can obtain the fitted residuals $\{\hat{\epsilon}_{t}\}_{t=1}^{T}$ which is defined as:

\hat{\epsilon}_{t}=\frac{X_{t}-\widehat{m}_{h}(X_{t-1})}{\widehat{\sigma}_{h}(X_{t-1})},~{}\text{for}~{}t=1,\ldots,T.

(7)

Later in Section 3, we will show that the innovation distribution $F_{\epsilon}$ can be consistently estimated by the centered empirical distribution of $\{\hat{\epsilon}_{t}\}_{t=1}^{T}$ , i.e., $\widehat{F}_{\epsilon}$ , under some standard assumptions. We now have all the ingredients to perform the bootstrap-based Algorithm 1 to yield the point prediction and QPI of $X_{T+k}$ .

Algorithm 1 Bootstrap prediction of

X_{T+k}

with fitted residuals

Step 1	With data $\{X_{0},\ldots,X_{T}\}$ , construct the estimators $\widehat{m}_{h}(x)$ and $\widehat{\sigma}_{h}(x)$ by formula Eq. 6.
Step 2	Compute fitted residuals based on Eq. 7, and let $\bar{\epsilon}=\frac{1}{T}\sum_{i=1}^{T}\hat{\epsilon}_{i}$ . Denote $\widehat{F}_{\epsilon}$ the empirical distribution of the centered residuals $\hat{\epsilon}_{t}-\bar{\epsilon}$ for $t=1,\ldots,T$ .
Step 3	Generate $\{\hat{\epsilon}^{}_{i}\}_{i=T+1}^{T+k}$ i.i.d. from $\widehat{F}_{\epsilon}$ . Then, construct bootstrap pseudo-values $X^{}_{T+1}$ , $\cdots,X^{}_{T+k}$ iteratively, i.e., $X^{}_{T+i}=\widehat{m}_{h}(X^{}_{T+i-1})+\widehat{\sigma}_{h}(X^{}_{T+i-1})\hat{\epsilon}^{}_{T+i},~{}\text{for}~{}i=1,\ldots,k.$ (8) For example, $X^{}_{T+1}=\widehat{m}_{h}(X^{}_{T})+\widehat{\sigma}_{h}(X^{}_{T})\hat{\epsilon}^{}_{T+1},$ and $X^{}_{T+2}=\widehat{m}_{h}(\widehat{m}_{h}(X_{T})+\sigma_{h}(X_{T})\hat{\epsilon}^{}_{T+1})+\widehat{\sigma}_{h}(\widehat{m}(X_{T})+\widehat{\sigma}_{h}(X_{T})\hat{\epsilon}^{}_{T+1})\hat{\epsilon}^{*}_{T+2}$ .
Step 4	Repeating Step 3 $M$ times, we obtain pseudo-value replicates of $X^{*}_{T+k}$ that we denote $\{X_{T+k}^{(1)},\ldots,X_{T+k}^{(M)}\}$ . Then, $L_{2}$ and $L_{1}$ optimal predictors can be approximated by $\frac{1}{M}\sum_{i=1}^{M}X_{T+k}^{(i)}$ and $\text{Median of }\{X_{T+k}^{(1)},\ldots,X_{T+k}^{(M)}\}$ , respectively. Furthermore, a $(1-\alpha)100$ % QPI can be built as $(L,U)$ where $L$ and $U$ denote the $\alpha/2$ and $1-\alpha/2$ sample quantiles of $M$ values $\{X_{T+k}^{(1)},\ldots,X_{T+k}^{(M)}\}$ .

Remark.

To construct the QPI of Algorithm 1, we may employ the optimal bandwidth rate, i.e., $h=O(T^{-1/5})$ . However, in practice with small sample size, the QPI has a better empirical CVR for multi-step ahead predictions by adopting an under-smoothing bandwidth; see Appendix B for related discussions and see LABEL:Sec:simulation for simulation comparisons between applying optimal and under-smoothing bandwidths on QPI.

In the next section, we will show the conditional asymptotic consistency of our optimal point predictions and QPI. In particular, we verify that our point predictions converge to oracle optimal point predictors in probability –conditional on $X_{T}$ . In addition, we look for an asymptotically valid PI with $(1-\alpha)100\%$ CVR to measure the prediction accuracy conditional on the latest observed data, which is defined as:

\mathbb{P}(L\leq X_{T+k}\leq U)\to 1-\alpha,~{}\text{as}~{}T\overset{}{\to}\infty,

(9)

where $L$ and $U$ are lower and higher PI bounds, respectively. Although not explicitly denoted, the probability $\mathbb{P}$ should be understood as conditional probability given $X_{T}$ . Later, based on a sequence of sets that contains the observed sample with a probability tending to 1, we will show how to build a prediction interval that is asymptotically valid by the bootstrap technique even for model information being unknown.

Although asymptotically correct, in finite samples the QPI typically suffers from undercoverage; see the discussion in politis2015model and pan2016bootstrap. To improve the CVR in practice, we consider taking the predictive residuals to boost the bootstrap process. To derive such predictive residuals, we need to estimate the model based on the delete- $X_{t}$ dataset, i.e., the available data for the scatter plot of $X_{i}$ vs. $\{X_{i-1}\}$ for $i=1,\ldots,t-1,t+1,\ldots,T$ , i.e., excludes the single point at $i=t$ . More specifically, we define the delete- $X_{t}$ local constant estimator as:

\widetilde{m}^{t}_{h}(x)=\frac{\sum_{i=1,i\neq t}^{T}K(\frac{|x-X_{i-1}|}{h})X_{i}}{\sum_{i=1,i\neq t}^{T}K(\frac{|x-X_{i-1}|}{h})}~{}~{}\text{and}~{}~{}\widetilde{\sigma}^{t}_{h}(x)=\frac{\sum_{i=1,i\neq t}^{T}K(\frac{|x-X_{i-1}|}{h})(X_{i}-\widetilde{m}^{t}_{h}(X_{i-1}))^{2}}{\sum_{i=1,i\neq t}^{T}K(\frac{|x-X_{i-1}|}{h})}.

(10)

Similarly, the truncated delete- $X_{t}$ local estimator $\widehat{m}^{t}(x)$ and $\widehat{\sigma}^{t}_{h}(x)$ can be defined according to Eq. 6. We now construct the so-called predictive residuals as:

\hat{\epsilon}^{p}_{t}=\frac{X_{t}-\widehat{m}^{t}_{h}(X_{t-1})}{\widehat{\sigma}^{t}_{h}(X_{t-1})},~{}\text{for}~{}t=1,\ldots,T.

(11)

The $k$ -step ahead prediction of $X_{T+k}$ with predictive residuals is depicted in Algorithm 2. Although Algorithms 1 and 2 are asymptotically equivalent, Algorithm 2 gives a QPI with better CVR for finite samples; see the simulation comparisons of these two approaches in LABEL:Sec:simulation.

Algorithm 2 Bootstrap prediction of

X_{T+k}

with predictive residuals

Step 1	Same with Step 1 of Algorithm 1.
Step 2	Compute predictive residuals based on Eq. 11. denote $\widehat{F}^{p}_{\epsilon}$ the empirical distribution of the centered predictive residuals $\hat{\epsilon}^{p}_{t}-\frac{1}{T}\sum_{i=1}^{T}\hat{\epsilon}^{p}_{i},t=1,\ldots,T$ .
Steps 3-4	Replace $\widehat{F}_{\epsilon}$ by $\widehat{F}^{p}_{\epsilon}$ in Algorithm 1. All the rest are the same.

2.2 Bootstrap algorithm for PPI

To improve the CVR of a PI, we can try to take the variability of the model estimation into account when we build the PI, i.e., we need to mimic the estimation process in the bootstrap world. Employing this idea results in a Pertinent PI (PPI) as discussed in Section 1; see also wang2021model.

Algorithm 3 outlines the procedure to build a PPI. Although this algorithm is more computationally heavy, the advantage is that PPI gives better CVR compared to QPI in practice, i.e., with finite samples; see the examples in LABEL:Sec:simulation.

Algorithm 3 Bootstrap PPI of

X_{T+k}

with fitted residuals

Step 1	With data $\{X_{0},\ldots,X_{T}\}$ , construct the estimators $\widehat{m}_{h}(x)$ and $\widehat{\sigma}_{h}(x)$ by formula Eq. 6. Furthermore, compute fitted residuals based on Eq. 7. Denote the empirical distribution of centered residuals $\hat{\epsilon}_{t}-\frac{1}{T}\sum_{i=1}^{T}\hat{\epsilon}_{i},t=1,\ldots,T$ by $\widehat{F}_{\epsilon}$ .
Step 2	Construct the $L_{1}$ or $L_{2}$ prediction $\widehat{X}_{T+k}$ using Algorithm 1.
Step 3	(a) Resample (with replacement) the residuals from $\widehat{F}_{\epsilon}$ to create pseudo-errors $\{\hat{\epsilon}^{}_{i}\}_{i=1}^{T}$ and $\{\hat{\epsilon}^{}_{i}\}_{i=T+1}^{T+k}$ .
	(b) Let $X_{0}^{}={X}_{I}$ where $I$ is generated as a discrete random variable uniformly on the values $0,\ldots,T$ . Then, create bootstrap pseudo-data $\{X_{t}^{}\}_{t=1}^{T}$ in a recursive manner from the formula $X^{}_{i}=\widehat{m}_{g}(X^{}_{i-1})+\widehat{\sigma}_{g}(X^{}_{i-1})\hat{\epsilon}^{}_{i},~{}\text{for}~{}i=1,\ldots,T.$ (12)
	(c) Based on the bootstrap data $\{X^{}_{t}\}_{t=0}^{T}$ , re-estimate the regression and variance functions according to Eq. 6 and get $\widehat{m}_{h}^{}(x)$ and $\widehat{\sigma}_{h}^{*}(x)$ ; we use the same bandwidth $h$ as the original estimator $\widehat{m}_{h}(x)$ .
	(d) Guided by the idea of forward bootstrap, re-define the latest value $X_{T}^{}$ to match the original, i.e., re-define $X_{T}^{}=X_{T}$ .
	(e) With estimators $\widehat{m}_{g}(x)$ and $\widehat{\sigma}_{g}(x)$ , the bootstrap data $\{X_{t}^{}\}_{t=0}^{T}$ , and the pseudo-errors $\{\hat{\epsilon}_{t}^{}\}_{t=T+1}^{T+k}$ , use Eq. 12 to generate recursively the future bootstrap data $X_{T+1}^{},\ldots,X_{T+k}^{}$ .
	(f) With bootstrap data $\{X^{}_{t}\}_{t=0}^{T}$ and estimators $\widehat{m}_{h}^{}(x)$ and $\widehat{\sigma}_{h}^{}(x)$ , utilize Algorithm 1 to compute the optimal bootstrap prediction which is denoted by $\widehat{X}^{}_{T+h}$ ; to generate bootstrap innovations, we still use $\widehat{F}_{\epsilon}$ .
	(g) Determine the bootstrap predictive root: $X_{T+k}^{}-\widehat{X}^{}_{T+k}$ .
Step 4	Repeat Step 3 $B$ times; the $B$ bootstrap root replicates are collected in the form of an empirical distribution whose $\beta$ -quantile is denoted $q(\beta)$ . The $(1-\alpha)100\%$ equal-tailed prediction interval for $X_{T+k}$ centered at $\widehat{X}_{T+k}$ is then estimated by $[\widehat{X}_{T+k}+q(\alpha/2),\widehat{X}_{T+k}+q(1-\alpha/2)].$

Remark 2.1 (Bandwidth choices).

In Step 3 (b) of Algorithm 3, we may use an optimal bandwidth $h$ and an over-smoothing bandwidth $g$ to generate bootstrap time series so that we can capture the asymptotically non-random bias-type term of non-parametric estimation by the forward bootstrap; see the application in franke2002bootstrap. We can also apply an under-smoothing bandwidth $h$ (and then use $g=h$ ) to render the bias term negligible. It turns out that both approaches work well for one-step ahead prediction, although applying the over-smoothing bandwidth may be slightly better. However, taking under-smoothing bandwidth(s) is notably better for multi-step ahead prediction. The reason for this is that the bias term can not be captured appropriately for multi-step ahead estimation with over-smoothing bandwidth. On the other hand, with under-smoothing bandwidth the bias term is negligible; see LABEL:Subsec:PPI for more discussion—also see politisstudentization for related discussion. The simulation studies in Appendix C explore the differences between these two bandwidth strategies.

As Algorithm 2 was a version of Algorithm 1 using predictive (as opposed to fitted) residuals, we now propose Algorithm 4 that constructs a PPI with predictive residuals.

Algorithm 4 Bootstrap PPI of

X_{T+k}

with predictive residuals

Step 1	With data $\{X_{0},\ldots,X_{T}\}$ , construct the estimators $\widehat{m}_{h}(x)$ and $\widehat{\sigma}_{h}(x)$ by formula Eq. 6. Furthermore, compute predictive residuals based on Eq. 11. Denote the empirical distribution of centered residuals $\hat{\epsilon}^{p}_{t}-\frac{1}{T}\sum_{i=1}^{T}\hat{\epsilon}^{p}_{i},t=1,\ldots,T$ by $\widehat{F}^{p}_{\epsilon}$ .
Steps 3-4	Same as in Algorithm 3 but change the residual distribution from $\widehat{F}_{\epsilon}$ to $\widehat{F}^{p}_{\epsilon}$ , and change the application of Algorithm 1 to Algorithm 2.

3 Asymptotic properties

In this section, we provide the theoretical substantiation for our non-parametric bootstrap prediction methods—Algorithms 1, 2, 3 and 4. We start by analyzing optimal point predictions and QPI based on Algorithms 1 and 2.

Remark.

Since the effect of leaving out one data pair $X_{t}\text{vs}~{}\{X_{t-1}\}$ is asymptotically negligible for large $T$ , the delete- $X_{t}$ estimator $\widehat{m}^{t}(x)$ and $\widehat{\sigma}^{t}(x)$ are asymptotically equal to $\widehat{m}(x)$ and $\widehat{\sigma}(x)$ , respectively. Then, the predictive residual $\hat{\epsilon}_{t}^{p}$ is asymptotically the same as the fitted residual $\hat{\epsilon}_{t}$ ; see Lemma 5.5 of pan2016bootstrap for a formal comparison of these two types of estimators and residuals. Thus, we just give theorems to guarantee the asymptotic properties of point predictions and PIs with fitted residuals. The asymptotic properties for variants with predictive residuals also stand true.