Volatility prediction comparison via robust volatility proxies: An empirical deviation perspective

Weichen Wang , Ran An^†, Ziwei Zhu
^∗Innovation and Information Management, HKU Business School
^†Department of Statistics, University of Michigan Address: Innovation and Information Management, HKU Business School, The Univsersity of Hong Kong, Hong Kong. E-mail: [email protected].Address: Department of Statistics, University of Michigan, Ann Arbor, MI 48109, USA. E-mail: [email protected], [email protected]. The research was partially supported by NSF grant DMS-2015366.

Abstract

Volatility forecasting is crucial to risk management and portfolio construction. One particular challenge of assessing volatility forecasts is how to construct a robust proxy for the unknown true volatility. In this work, we show that the empirical loss comparison between two volatility predictors hinges on the deviation of the volatility proxy from the true volatility. We then establish non-asymptotic deviation bounds for three robust volatility proxies, two of which are based on clipped data, and the third of which is based on exponentially weighted Huber loss minimization. In particular, in order for the Huber approach to adapt to non-stationary financial returns, we propose to solve a tuning-free weighted Huber loss minimization problem to jointly estimate the volatility and the optimal robustification parameter at each time point. We then inflate this robustification parameter and use it to update the volatility proxy to achieve optimal balance between the bias and variance of the global empirical loss. We also extend this Huber method to construct volatility predictors. Finally, we exploit the proposed robust volatility proxy to compare different volatility predictors on the Bitcoin market data. It turns out that when the sample size is limited, applying the robust volatility proxy gives more consistent and stable evaluation of volatility forecasts.

Keywords: Volatility forecasting, Robust loss function, Huber minimization, Risk management, Crypto market.

1 Introduction

Volatility forecasting is a central task for financial practitioners, who care to understand the risk levels of their financial instruments or portfolios. There have been countless researches on improving the volatility modeling for financial time series, including the famous ARCH/GARCH model for better modeling volatility clustering, its many variants and more general stochastic volatility models (Engle, 1982; Bollerslev, 1986; Baillie et al., 1996; Taylor, 1994), and on proposing better volatility predictors under different model settings and objectives (Poon and Granger, 2003; Brailsford and Faff, 1996; Andersen et al., 2005; Brooks and Persand, 2003; Christoffersen and Diebold, 2000). This list of volatility forecasting literature is only illustrative and far from complete for the large body of researches on this topic.

The prediction ideas range from the simplest Exponentially Weighted Moving Average (EWMA) (Taylor, 2004), which is adopted by J. P. Morgan’s RiskMetrics, to more complicated time series models and volatility models including GARCH (Brandt and Jones, 2006; Park, 2002), to option-based or macro-based volatility forecasting (Lamoureux and Lastrapes, 1993; Vasilellis and Meade, 1996; Christiansen et al., 2012), and to the more advanced machine learning techniques such as the nearest neighbor truncation (Andersen et al., 2012) and Recurrent Neural Netowrks (RNN) (Guo et al., 2016). Correspondingly, the underlying model assumption ranges from only smoothness of nearby volatilities, to different versions of GARCH, to Black-Scholes model (Black and Scholes, 2019) and its complicated extensions. The data distribution assumption can also vary in whether data are normally distributed, or heavy-tailed distributed from a known distribution e.g. t-distribution, or generally non-normal. When data are generally non-normal, researchers have proposed to use the quasi maximum likelihood estimation (QMLE) (Bollerslev and Wooldridge, 1992; Charles and Darné, 2019; Carnero et al., 2012) and its robust standard error for inference, but the theoretical results are typically asymptotic. Albeit good theoretical guarantee, industry practitioners seldom apply QMLE and tend to employ the naive approach of truncating the returns by an ad-hoc level and then applying EWMA.

In this work, we consider a model assumption requiring only smoothness of volatilities. For simplicity, we also assume the volatility time series are given a-priori, and after conditioning on the volatilities, return innovations are independent. We choose this simple setting for the following reasons. Firstly, our main focus of study is on building effective robust proxies rather than testing volatility models and constructing fancy volatility predictors. Secondly, although we ignore the weak dependency between return innovations (think of ARMA models (Brockwell and Davis, 2009) for weakly dependency), the EWMA predictors and proxies can still have strong temporal dependency, due to data overlapping of a rolling window, so our analysis is still nontrivial. Also note that we allow the return time series to be non-stationary. Thirdly, the motivating example for us is volatility forecasting for the crypto market. Charles and Darné (2019) applied several versions of GARCH models characterized by short memory, asymmetric effect, or long-run and short-run movements and concluded that they all seem not appropriate to model Bitcoin returns. Therefore, starting from conditionally independent data without imposing a too detailed model e.g. GARCH may be a good general starting point for the study of robust proxies.

Besides the native EWMA predictor as our comparison benchmark, we consider a type of robust volatility predictor when the instrument returns present heavy tails in their distributions. Specifically, we only require the returns to bear finite fourth moment. We consider the weighted Huber loss minimization, which turns out to be a nontrivial extension from the equal-weighted Huber loss minimization. To achieve the desired rate of convergence, the optimal Huber truncation level for each sample should also depend on sample weight. In addition, we apply a tuning-free approach following Wang et al. (2020) to tune the Huber truncation level adaptively and automatically. Unlike QMLE, our results focusing on a non-asymptotic empirical deviation bound. Therefore, although the main contribution of the paper is on robust proxy construction, we also claim a separate contribution on applying Huber minimization in the EWMA fashion.

Now, given two volatility predictors, the evaluation of their performance is often quite challenging due to two things: (1) selection of loss functions, and (2) selection of proxies, since obviously we cannot observe the truth volatilities. The selection of loss functions have been studied by Patton (2011). In Patton’s insightful paper, he defined a class of robust losses with the ideal property that for any unbiased proxy, the ranking of two predictors using one of the robust losses will be always consistent in terms of the long-run expectation. The property is desired for it tells risk managers to select a robust loss, then not to worry much on designing proxies. As long as the proxy is unbiased, everything should just work out. Commonly used robust losses include the mean-squared error (MSE) and the quasi-likelihood loss (QL). However, there is one weakness of Patton’s approach which has not been emphasized much in previous literature: the evaluation has to be in long-run expectation. The deviation of the empirical loss, which is what people actually use in practice, from the expected loss may still cause a large variance due to a bad choice of volatility proxy. Put it in the other way, his theory did not tell the risk managers how much an empirical loss can differ from its expected counterpart.

In this work, we hope to bring the main message that besides the selection of a robust loss, the selection of a good proxy also matters for effective comparison of predictors, especially when the sample size $T$ is not large enough. For a single time point, we show that the probability of making false comparison could be very high. So the natural question is that by averaging the performance comparison over $T$ time points, are we able to get a faithful comparison of two predictors with high probability, so that the empirical loss ranking does reflect the population loss ranking? The answer is that we need robust proxies in order to have this kind of guarantee.

We propose three robust proxies and compare them. The first choice uses the clipped squared return at the single current time $t$ as the proxy. This may be the simplest practical choice of a robust proxy. However, it cannot achieve the desired convergence in terms of empirical risk comparison due to large variance of only using a single time point. The second option mimics the EWMA proxy, so now we clip and average at multiple time points close to $t$ . To find out the proper clipping, we first run EWMA tuning-free Huber loss minimization on local data for each time $t$ . This will give a truncation level adaptive to the unknown volatility. Then the clipping bound will be rescaled to reflect the total sample size. According to literature on Huber minimization Catoni (2012); Fan et al. (2016); Sun et al. (2020), the truncation level needs to scale with the square root of the sample size to balance the bias and variance optimally. Therefore, it is natural to rescale the clipping bound by square root of the ratio of the total sample size $T$ over the local effective sample size. The third proxy exactly solves the EWMA Huber minimization, again with the rescaled truncation. Compared to the first and second proxies, this gives further improvement on the deviation bound of the proxy, depending on the central kurtosis rather than the absolute kurtosis. We will illustrate the above claims in more detail in later sections.

The Huber loss minimization has been proposed by Huber (1964) under Huber’s $\epsilon$ -contamination model and its asymptotic properties have been studied in Huber (1973). At that time, the truncation level was set as fixed according to the $95\%$ asymptotic efficiency rule and “robustness” means achieving minimax optimality under the $\epsilon$ -contamination model (Chen et al., 2018). But recently, Huber’s M-estimator has been revisited in the regression setting under the assumption of general heavy-tailed distributions (Catoni, 2012; Fan et al., 2016). Here “robustness” slightly changes its meaning to achieving sub-Gaussian non-asymptotic deviation bound under the heavy-tailed data assumption. In this setting, the truncation level grows with sample size and the resultant M-estimator is still asymptotically unbiased even when data distribution is asymmetric. Huber’s estimator fits the goal of robust volatility prediction and robust proxy construction very well, as squared returns indeed have asymmetric distributions. Since Catoni (2012), new literature to reveal deeper understanding on Huber’s M-estimator sprung up. For example, Sun et al. (2020) proved the necessity of finite fourth moment for volatility estimation if we hope to achieve a sub-Gaussian type of deviation bound; Wang et al. (2020) proposed the tuning-free Huber procedure; Chen et al. (2018); Minsker (2018) extended the Huber methodology to robust covariance matrix estimation.

Robustness issue is indeed an important concern for real data volatility forecasting. It has been widely observed that financial returns have fat tails. When it comes to the crypto markets e.g. Bitcoin product (BTC), the issue gets more serious, as crypto traders frequently experience huge jumps in the BTC price. For example, BTC plummeted more than 20% in a single day in March 2020. The lack of government regulation probably leaves the market far from efficient. This posts a stronger need for robust methodology to estimate and forecast volatility for crypto markets. Some recent works include Catania et al. (2018); Trucíos (2019); Charles and Darné (2019).

With the BTC returns, we will compare the non-robust EWMA predictor with the robust Huber predictor, with different decays, and evaluate their performance using the non-robust forward EWMA proxy and the robust forward Huber proxy. Both the predictors and proxies will be rolled forward and compared at the end of each day. We apply two robust losses, MSE and QL, to evaluate their performance. Interestingly, we will see that when sample size $T$ is large, our proposed robust proxy will be very close to forward EWMA proxy, and both will lead to sensible and similar comparison. However, when $T$ is small, non-robust proxy could lead to higher probability of making wrong conclusions, whereas the robust proxy, which automatically adapts to the total sample size and the time-varying volatilities, can still work as expected. This matches with our theoretical findings and provides new insights about applying robust proxies for practical risk evaluation.

The rest of the paper is organized as follows. In Section 2, we first review the definition of robust loss by Patton (2011) and explain our analyzing strategy for high probability bound of the empirical loss. We bridge the empirical loss and the unconditional expected loss, by the conditional expected loss conditioning on proxies. In Section 3, we propose three robust proxies and prove that they can all achieve the correct ranking with high probability, if measured by the conditional expected loss. However, the proxy based on Huber loss minimization will have the smallest probability of making false comparison, if measured by the empirical loss. In Section 4, we will discuss robust predictors and see why the above claim is true and why comparing robust predictors with non-robust predictors can be a valid thing to do. Simulation studies as well as an interesting case study on BTC volatility forecasting are presented in Section 5. We finally conclude the paper with some discussions in Section 6. All the proofs are relegated to the appendix.

2 Evaluation of volatility forecast

In this section, we first review the key conclusions of Patton (2011) on robust loss functions for volatility forecast comparison. We then use examples to see why we also care about the randomness from proxy deviation beyond picking a robust loss.

2.1 Robust loss functions

Suppose we have a time series of returns $(X_{i})_{-\infty<i<\infty}$ of a financial instrument. Let $\mathcal{F}_{t-1}$ denote the $\sigma$ -algebra generated from $(X_{i})_{i\leq t-1}$ . Consider a volatility predictor $h_{t}$ , computed at time $t$ based on $\mathcal{F}_{t-1}$ , that targets $\sigma^{2}_{t}:=\mbox{var}(X_{t})$ . We use a loss function $L(\sigma^{2}_{t},h_{t})$ to gauge the prediction error of $h_{t}$ . In practice, we never observe $\sigma^{2}_{t}$ ; therefore, in order to evaluate the loss function $L(\sigma^{2}_{t},h_{t})$ , we have to substitute $\sigma_{t}^{2}$ therein with a proxy $\widehat{\sigma}_{t}^{2}$ , which is computed based on $\mathcal{G}_{t}$ , the $\sigma$ -algebra generated from the future returns $(X_{t},\dots,X_{\infty})$ .

Following Patton (2011), to achieve reliable evaluation of volatility forecasts, we wish to have the loss function $L$ satisfy the following three desirable properties:

(a)

Mean-pursuit: $h_{t}^{*}:=\operatorname*{argmin}_{h\in\mathcal{H}}\mathbb{E}[L(\widehat{\sigma}_{t}^{2},h)|\mathcal{F}_{t-1}]=\mathbb{E}[\widehat{\sigma}_{t}^{2}|\mathcal{F}_{t-1}]$ . This says that the optimal predictor is exactly the conditional expectation of the proxy.
(b)

Proxy-robust: Given any two predictors $h_{1t}$ and $h_{2t}$ and any unbiased proxy $\widehat{\sigma}_{t}^{2}$ , i.e., $E[\widehat{\sigma}_{t}^{2}|\mathcal{F}_{t-1}]=\sigma_{t}^{2}$ , $\mathbb{E}\{L(\sigma_{t}^{2},h_{1t})\}\leq\mathbb{E}\{L(\sigma_{t}^{2},h_{2t})\}\iff\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{1t})\}\allowbreak\leq\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{2t})\}$ . This means that the forecast ranking is robust to the choice of the proxy.
(c)

Homogeneous: $L$ is a homogeneous loss function of order $k$ , i.e., $L(a\sigma^{2},ah)=a^{k}L(\sigma^{2},h)$ for any $a>0$ . This ensures that the ranking of two predictors is invariant to the re-scaling of data.

Define the mean squared error (MSE) loss and quasi-likelihood (QL) loss as

\mathrm{MSE}(\sigma^{2},h)=(\sigma^{2}-h)^{2}\text{~{}and~{}}\mathrm{QL}(\sigma^{2},h)=\frac{\sigma^{2}}{h}-\log\bigg{(}\frac{\sigma^{2}}{h}\bigg{)}-1

(2.1)

respectively. Here the QL loss can be viewed, up to an affine transformation, as the negative log-likelihood function of $X$ that follows ${\cal N}(0,h)$ when we observe that $X^{2}=\sigma^{2}$ . Besides, QL is always positive and the Taylor expansion gives that $\mathrm{QL}(\sigma^{2},h)\approx(\sigma^{2}/h-1)^{2}/2$ when $\sigma^{2}/h$ is around $1$ . Patton (2011) shows that among many commonly used loss functions, MSE and QL are the only two that satisfy all the three properties above. Specifically, Proposition 1 of Patton (2011) says that given that $L$ satisfies property (a) and some regularity conditions, $L$ further satisfies property (b) if and only if $L$ takes the form:

L(\sigma^{2},h)=\tilde{C}(h)+B(\sigma^{2})+C(h)(\sigma^{2}-h),

(2.2)

where $C(h)$ is the derivative function of $\tilde{C}(h)$ and is monotonically decreasing. Proposition 2 in Patton (2011) establishes that MSE is the only proxy-robust loss that depends on $\sigma^{2}-h$ and that QL is the only proxy-robust loss that depends on $\sigma^{2}/h$ . Finally, Proposition 4 in Patton (2011) gives the entire family of proxy-robust and homogeneous loss functions, which include QL and MSE (MSE and QL are homogeneous of order 2 and 0 respectively). Given such nice properties of MSE and QL, we mainly use MSE and QL to evaluate and compare volatility forecasts throughout this work.

2.2 The empirical deviation perspective

Besides selecting a robust loss as Patton (2011) suggested, one has to also nail down the proxy selection for prediction loss computation. Patton (2011)’s framework did not separate the randomness from the predictors and the proxies, and the proxy-robust property (b) compares two predictors in long-term unconditional expectation, which averages both randomnesses. However, it is not clear from Patton (2011) that for a given selected proxy, what is the probability that we end up a wrong comparison of two predictors. How does the random deviation of a proxy affect the comparison? Can some proxies outperform others in terms of less probability to make mistakes in finite sample?

In practice, one has to use empirical risk to approximate the expected risk to evaluate volatility forecasts. This implies one important issue that property (b) neglects: Property (b) concerns only the expected risk and ignores the deviation of the empirical risk from its expectation. Such empirical deviation is further exacerbated by replacing the true volatility with its proxies, jeopardizing accurate evaluation of volatility forecasts. Our strategy of analysis is as follows: we first link the empirical risk to the conditional risk (conditioning on the selected proxy), claiming that they are close with high probability (see formal arguments in Section 4), and then study the relationship of comparing the unconditional risk and conditional risk.

Specifically, we are interested in comparing the accuracy of two series of volatility forecasts $\{h_{1t}\}_{t\in[T]}$ and $\{h_{2t}\}_{t\in[T]}$ . For notational convenience, we drop the subscript “ $t\in[T]$ ” when we refer to a time series unless specified otherwise. Define $\eta_{t}=\mathbb{E}\{L(\sigma_{t}^{2},h_{2t})-L(\sigma_{t}^{2},h_{1t})\}>0$ . Without loss of generality, suppose that $\{h_{1t}\}$ outperforms $\{h_{2t}\}$ in terms of expected loss, i.e.,

\frac{1}{T}\sum_{t=1}^{T}\bigg{[}\mathbb{E}L(\sigma_{t}^{2},h_{2t})-\mathbb{E}L(\sigma_{t}^{2},h_{1t})\bigg{]}=\frac{1}{T}\sum_{t=1}^{T}\eta_{t}>0.

(2.3)

The empirical loss comparison can be decomposed into the conditional loss comparison and the difference between empirical loss and conditional loss.

	$\displaystyle\frac{1}{T}\sum_{t=1}^{T}\bigg{(}L(\widehat{\sigma}_{t}^{2},h_{2t})$	$\displaystyle-L(\widehat{\sigma}_{t}^{2},h_{1t})\bigg{)}=\frac{1}{T}\sum_{t=1}^{T}\bigg{[}\bigg{(}\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{2t})\|\mathcal{G}_{t}\}-\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{1t})\|\mathcal{G}_{t}\}\bigg{)}$
		$\displaystyle+\bigg{(}L(\widehat{\sigma}_{t}^{2},h_{2t})-\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{2t})\|\mathcal{G}_{t}\}\bigg{)}-\bigg{(}L(\widehat{\sigma}_{t}^{2},h_{1t})-\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{1t})\|\mathcal{G}_{t}\}\bigg{)}\bigg{]}\,.$

Therefore, we study the following two probabilities for any $\varepsilon>0$ :

\text{I}:=\mathbb{P}\bigg{[}\frac{1}{T}\sum_{t=1}^{T}\bigg{(}\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{2t})|\mathcal{G}_{t}\}-\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{1t})|\mathcal{G}_{t}\}\bigg{)}<\frac{1}{T}\sum_{t=1}^{T}\eta_{t}-\varepsilon\bigg{]}\,,

(2.4)

\text{II}:=\mathbb{P}\bigg{[}\bigg{|}\frac{1}{T}\sum_{t=1}^{T}\bigg{(}L(\widehat{\sigma}_{t}^{2},h_{t})-\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{t})|\mathcal{G}_{t}\}\bigg{)}\bigg{|}>\varepsilon\bigg{]}\,.

(2.5)

We aim to select stable proxies to make I small, so that the probability of obtaining false rank of the empirical risk is small. Meanwhile, with a selected proxy, we hope II can be well controlled for the predictors we care to compare. Note that only randomness from proxy matters in I. So we can focus on proxy design by studying this quantity. Then we would like to make sure the difference between the empirical risk and conditional risk are indeed small via studying II. The probability in II is with respect to both the proxy and the predictor. By following this analyzing strategy, we separate the randomness from predictor and proxy and eventually give results on empirical deviation rather than in expectation.

2.3 False comparison due to proxy randomness

To illustrate this issue, we first focus on a single time point $t$ . We compare two volatility forecasts $h_{1t}$ and $h_{2t}$ satisfying that $\mathbb{E}\{L(\sigma_{t}^{2},h_{1t})\}<\mathbb{E}\{L(\sigma_{t}^{2},h_{2t})\}$ . We are interested in the probability of having a reverse rank of forecast precision between $h_{1t}$ and $h_{2t}$ , conditioning on the selected proxy, i.e, $\mathbb{P}_{\mathcal{G}_{t}}\big{[}\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{1t})|\mathcal{G}_{t}\}>\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{2t})|\mathcal{G}_{t}\}\big{]}$ . Note that this probability is with respect to the randomness of the proxy $\widehat{\sigma}_{t}^{2}$ ; in the sequel, we show that it may not be small for a general proxy. But if we can select a good proxy to control this probability well, we can ensure a correct comparison with high probability.

Now consider MSE and QL as the loss functions, so that we can derive explicitly the condition for the empirical risk comparison to be consistent with the expected risk comparison. For simplicity, assume that $\mathcal{F}_{t-1}$ and $\mathcal{G}_{t}$ are independent. Recall that $\eta_{t}=\mathbb{E}\{L(\sigma_{t}^{2},h_{2t})-L(\sigma_{t}^{2},h_{1t})\}>0$ . We wish to calculate $\mathbb{P}_{\mathcal{G}_{t}}[\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{2t})|\mathcal{G}_{t}\}-\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{1t})|\mathcal{G}_{t}\}<\eta_{t}-\varepsilon_{t}]$ for some $\varepsilon_{t}>|\eta_{t}|$ , i.e., the probability of having the forecast rank in conditional expectation be opposite of the rank in unconditional expectation. When $L$ is chosen to be MSE, we have

\eta_{t}=\mathbb{E}L(\sigma_{t}^{2},h_{2t})-\mathbb{E}L(\sigma_{t}^{2},h_{1t})=\mathbb{E}(h_{2t}^{2}-h_{1t}^{2})+2\sigma_{t}^{2}\mathbb{E}(h_{1t}-h_{2t})

(2.6)

and

\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{2t})|\mathcal{G}_{t}\}-\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{1t})|\mathcal{G}_{t}\}=\mathbb{E}(h_{2t}^{2}-h_{1t}^{2})+2\widehat{\sigma}_{t}^{2}\mathbb{E}(h_{1t}-h_{2t}).

Therefore,

	$\displaystyle\mathbb{P}_{\mathcal{G}_{t}}$	$\displaystyle[\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{2t})\|\mathcal{G}_{t}\}-\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{1t})\|\mathcal{G}_{t}\}<\eta_{t}-\varepsilon_{t}]$
		$\displaystyle=\mathbb{P}_{\mathcal{G}_{t}}\{2(\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2})\mathbb{E}(h_{1t}-h_{2t})<-\varepsilon_{t}\}$
		$\displaystyle=\mathbb{P}_{\mathcal{G}_{t}}[(\widehat{\sigma}_{t}^{2}/\sigma_{t}^{2}-1)\{\eta_{t}-\mathbb{E}(h_{2t}^{2}-h_{1t}^{2})\}<-\varepsilon_{t}].$

For illustration purposes, consider a deterministic scenario where $h_{1t}=\sigma_{t}^{2}$ is the oracle predictor, and where $h_{2t}=\sigma_{t}^{2}+\sqrt{\eta}_{t}$ (so that (2.6) holds). Then

\displaystyle\mathbb{P}_{\mathcal{G}_{t}}

\displaystyle\bigl{[}\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{2t})|\mathcal{G}_{t}\}-\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{1t})|\mathcal{G}_{t}\}<\eta_{t}-\varepsilon_{t}\bigr{]}=\mathbb{P}_{\mathcal{G}_{t}}\bigg{(}\frac{\widehat{\sigma}_{t}^{2}}{\sigma_{t}^{2}}-1>\frac{\varepsilon_{t}}{2\sigma_{t}^{2}\sqrt{\eta_{t}}}\bigg{)}.

Similarly, if $h_{2t}=\sigma_{t}^{2}-\sqrt{\eta}_{t}$ , we have

\displaystyle\mathbb{P}_{\mathcal{G}_{t}}

\displaystyle\bigl{[}\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{2t})|\mathcal{G}_{t}\}-\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{1t})|\mathcal{G}_{t}\}<\eta_{t}-\varepsilon_{t}\bigr{]}=\mathbb{P}_{\mathcal{G}_{t}}\bigg{(}\frac{\widehat{\sigma}_{t}^{2}}{\sigma_{t}^{2}}-1<-\frac{\varepsilon_{t}}{2\sigma_{t}^{2}\sqrt{\eta_{t}}}\bigg{)}.

We can see from the two equations above that a large deviation of $\widehat{\sigma}_{t}^{2}$ from $\sigma_{t}^{2}$ gives rise to inconsistency between forecast comparisons based on empirical risk and expected risk. When we choose $L$ to be QL, we have that

\eta_{t}=\mathbb{E}L(\sigma_{t}^{2},h_{2t})-\mathbb{E}L(\sigma_{t}^{2},h_{1t})=\mathbb{E}(\log h_{2t}-\log h_{1t})+\sigma_{t}^{2}\mathbb{E}\bigg{(}\frac{1}{h_{2t}}-\frac{1}{h_{1t}}\bigg{)}

and that

	$\displaystyle\mathbb{P}_{\mathcal{G}_{t}}$	$\displaystyle\bigl{[}\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{2t})\|\mathcal{G}_{t}\}-\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{1t})\|\mathcal{G}_{t}\}<\eta_{t}-\varepsilon_{t}\bigr{]}$
		$\displaystyle=\mathbb{P}_{\mathcal{G}_{t}}\bigg{\{}(\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2})\mathbb{E}\bigg{(}\frac{1}{h_{2t}}-\frac{1}{h_{1t}}\bigg{)}<-\varepsilon_{t}\bigg{\}}$
		$\displaystyle=\mathbb{P}_{\mathcal{G}_{t}}\big{[}(\widehat{\sigma}_{t}^{2}/\sigma_{t}^{2}-1)\{\eta_{t}-\mathbb{E}(\log h_{2t}-\log h_{1t})\}<-\varepsilon_{t}\big{]}.$

Similarly, we consider a deterministic setup where $h_{1t}=\sigma_{t}^{2}$ , and where $h_{2t}=m\sigma_{t}^{2}$ with a misspecified scale. To ensure that $\mathbb{E}L(\sigma_{t}^{2},h_{2t})-\mathbb{E}L(\sigma_{t}^{2},h_{1t})=\eta_{t}$ , we have $\eta_{t}=1/m-\log(1/m)-1\approx(1/m-1)^{2}/2$ when $1/m\approx 1$ . In this case, we deduce that

	$\displaystyle\mathbb{P}_{\mathcal{G}_{t}}$	$\displaystyle\{\mathbb{E}(L(\widehat{\sigma}_{t}^{2},h_{2t})\|\mathcal{G}_{t})-\mathbb{E}(L(\widehat{\sigma}_{t}^{2},h_{1t})\|\mathcal{G}_{t})<\eta_{t}-\varepsilon_{t}\}=\mathbb{P}_{\mathcal{G}_{t}}\{(\widehat{\sigma}_{t}^{2}/\sigma_{t}^{2}-1)(\eta_{t}-\log m)<-\varepsilon_{t}\}$
		$\displaystyle=\mathbb{P}_{\mathcal{G}_{t}}\{(\widehat{\sigma}_{t}^{2}/\sigma_{t}^{2}-1)(1/m-1)<-\varepsilon_{t}\}\approx\begin{cases}\mathbb{P}_{\mathcal{G}_{t}}\bigl{\{}\widehat{\sigma}_{t}^{2}/\sigma_{t}^{2}-1>\frac{\varepsilon_{t}}{\sqrt{2\eta_{t}}}\bigr{\}}&\text{if $m>1$,}\\ \mathbb{P}_{\mathcal{G}_{t}}\bigl{\{}\widehat{\sigma}_{t}^{2}/\sigma_{t}^{2}-1<-\frac{\varepsilon_{t}}{\sqrt{2\eta_{t}}}\bigr{\}}&\text{if $m\leq 1$.}\end{cases}$

Similarly, we can see that the volatility forecast rank will be flipped once the deviation of $\widehat{\sigma}_{t}^{2}$ from $\sigma_{t}^{2}$ is large.

Note again that in the derivation above, the probability of reversing the expected forecast rank is evaluated at a single time point $t$ , which is far from enough to yield reliable comparison between volatility predictors. The common practice is to compute the empirical average loss of the predictors over time for their performance evaluation. Two natural questions arise: Does the empirical average completely resolve the instability of forecast evaluation due to the deviation of volatility proxies? If not, how should we robustify our volatility proxies to mitigate their empirical deviation?

3 Robust volatility proxies

3.1 Problem setup

Our goal in this section is to construct robust volatility proxies $\{\widehat{\sigma}^{2}_{t}\}$ to ensure that $\{h_{1t}\}$ maintains empirical superiority with high probability, or more precisely, that $\mathbb{P}_{\mathcal{G}_{t}}\big{[}\frac{1}{T}\sum_{t=1}^{T}\allowbreak\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{2t})|\mathcal{G}_{t}\}-\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{1t})|\mathcal{G}_{t}\}<0\big{]}$ is small, given $\frac{1}{T}\sum_{t=1}^{T}\eta_{t}>0$ . We first present our assumption on the data generation process.

Assumption 1.

Given the true volatility series $\{\sigma_{t}\}$ , instrument returns $\{X_{t}\}$ are independent with $\mathbb{E}X_{t}=0$ and $\mathrm{Var}(X_{t})=\sigma_{t}^{2}$ . The central and absolute fourth moments of $X_{t}$ , denoted by $\kappa_{t}=\mathbb{E}\{(X_{t}^{2}-\sigma_{t}^{2})^{2}\}$ and $\tilde{\kappa}_{t}=\mathbb{E}X_{t}^{4}$ , are both finite.

Now we introduce some quantities that frequently appear in the sequel. At time $t$ , define the smoothness parameters

		$\displaystyle\delta_{0,t}:=\sum_{s=t}^{t+m}w_{s,t}\sigma_{s}^{2}-\sigma_{t}^{2},~{}~{}~{}\delta_{1,t}:=\sum_{s=t}^{t+m}w_{s,t}^{2}\|\sigma_{s}^{2}-\sigma_{t}^{2}\|^{2},$		(3.1)
		$\displaystyle\Delta_{0,t}:=\sum_{s=t-m}^{t-1}\nu_{s,t}\sigma_{s}^{2}-\sigma_{t}^{2}~{}\text{ and }~{}\Delta_{1,t}=\sum_{s=t-m}^{t-1}\nu_{s,t}^{2}\|\sigma_{s}^{2}-\sigma_{t}^{2}\|^{2},$		(3.1)

where $w_{s,t}=\lambda^{s-t}/\sum_{j=t}^{t+m}\lambda^{j-t}$ is the forward exponential-decay weight at time $s$ from time $t$ with rate $\lambda$ , and where $\nu_{s,t}=\lambda^{t-1-s}/\sum_{s=t-m}^{t-1}\lambda^{t-1-s}$ is the backward exponential-decay weight with rate $\lambda$ . These smoothness parameters characterize how fast the distribution of volatility varies as time evolves, and our theory explicitly derives their impact. As we shall see, our robust volatility proxies yield desirable statistical performance as long as these smoothness parameters are small, meaning that the variation of the volatility distribution is slow. Besides, define the forward and backward effective sample sizes as

n^{\dagger}_{{\rm{eff}}}:=1/\sum_{s=t}^{t+m}w_{s,t}^{2}~{}~{}\text{and}~{}~{}n^{\ddagger}_{{\rm{eff}}}:=1/\sum_{s=t-m}^{t-1}\nu_{s,t}^{2}

(3.2)

respectively, and define the forward and backward exponential-weighted moving average (EWMA) of the central fourth moment as

\kappa^{\dagger}_{t}=\sum_{s=t}^{t+m}w_{s,t}^{2}\kappa_{s}/\allowbreak\sum_{s=t}^{t+m}w_{s,t}^{2}~{}~{}\text{and}~{}~{}\kappa^{\ddagger}_{t}=\sum_{s=t}^{t+m}\nu_{s,t}^{2}\kappa_{s}/\allowbreak\sum_{s=t}^{t+m}\nu_{s,t}^{2}

(3.3)

respectively. Similarly, we have $\tilde{\kappa}^{\dagger}_{t}$ and $\tilde{\kappa}^{\ddagger}_{t}$ as the forward and backward EWMA of the absolute fourth moment.

Consider a mean-pursuit and proxy-robust loss function that takes the form (2.2):

L(\sigma^{2},h)=\tilde{C}(h)+B(\sigma^{2})+C(h)(\sigma^{2}-h)=:f(h)+B(\sigma^{2})+C(h)\sigma^{2},

where we write $f(h)=\int_{a}^{h}C(x)dx-C(h)h$ for any constant $a$ . When $C(h)=-2h$ and $f(h)=h^{2}$ ( $a=0$ ), $L$ is MSE. When $C(h)=1/h$ and $f(h)=\log h$ ( $a=e^{-1}$ ), $L$ becomes QL. Under Assumption 1, $h_{it}$ and $\widehat{\sigma}_{t}^{2}$ are independent for $i=1,2$ . Therefore, $\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{it})|\mathcal{G}_{t}\}=\mathbb{E}[f(h_{it})]+B(\widehat{\sigma}_{t}^{2})+\mathbb{E}[C(h_{it})]\widehat{\sigma}_{t}^{2}$ . Given (2.3), we wish to show that $\{h_{2t}\}$ outperforms $\{h_{1t}\}$ in conditional risk with high probability, i.e. I is small. Recall that

	$\displaystyle\text{I}=$	$\displaystyle\mathbb{P}\bigg{[}\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{2t})\|\mathcal{G}_{t}\}-\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{1t})\|\mathcal{G}_{t}\}<\frac{1}{T}\sum_{t=1}^{T}\eta_{t}-\varepsilon\bigg{]}$
	$\displaystyle=$	$\displaystyle\mathbb{P}\bigg{[}\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\{f(h_{2t})-f(h_{1t})\}+\mathbb{E}\{C(h_{2t})-C(h_{1t})\}\widehat{\sigma}_{t}^{2}<\frac{1}{T}\sum_{t=1}^{T}\eta_{t}-\varepsilon\bigg{]}$
	$\displaystyle=$	$\displaystyle\mathbb{P}\bigg{[}\frac{1}{T}\sum_{t=1}^{T}(\widehat{\sigma}_{t}^{2}/\sigma_{t}^{2}-1)(\eta_{t}-A_{t})<-\varepsilon\bigg{]},$

where $\varepsilon>0$ is a deviation parameter that may exceed $T^{-1}\sum_{t\in[T]}\eta_{t}$ , and the last equation is due to the fact that $\eta_{t}=\mathbb{E}[f(h_{2t})-f(h_{1t})]+\mathbb{E}[C(h_{2t})-C(h_{1t})]\sigma_{t}^{2}$ .

3.2 Exponentially weighted Huber estimator

We first review the tuning-free adaptive Huber estimator proposed in Wang et al. (2020). Define the Huber loss function $\ell_{\tau}(x)$ with robustification parameter $\tau$ as

\ell_{\tau}(x):=\begin{cases}\tau x-\frac{\tau^{2}}{2},\quad\text{if $x>\tau$;}\\ \frac{x^{2}}{2},\quad\quad\quad\text{ if $|x|\leq\tau$;}\\ -\tau x-\frac{\tau^{2}}{2},\text{ if $x<-\tau$.}\end{cases}

Suppose we have $n$ independent observations $\{Y_{i}\}_{i\in[n]}$ of $Y$ satisfying that $\mathbb{E}Y=\mu$ and that $\mbox{var}(Y)=\sigma^{2}$ . The Huber mean estimator $\widehat{\mu}$ is obtained by solving the following optimization problem:

\widehat{\mu}:=\operatorname*{argmin}_{\theta\in\mathbb{R}}\sum_{i=1}^{n}\ell_{\tau}(Y_{i}-\theta).

Fan et al. (2017) show that when $\tau$ is of order $\sigma\sqrt{n}$ , $\widehat{\mu}$ achieves the optimal statistical rate with a sub-Gaussian deviation bound:

\mathbb{P}(|\widehat{\mu}-\mu|\lesssim\sigma\sqrt{z/n})\geq 1-2e^{-z},\forall z>0.

(3.4)

In practice, $\sigma$ is unknown, and one therefore has to rely on cross validation (CV) to tune $\tau$ , which incurs loss of sample efficiency. Wang et al. (2020) propose a data-driven principle to estimate $\mu$ and the optimal $\tau$ jointly by iteratively solving the following two equations:

\left\{\begin{aligned} &\frac{1}{n}\sum_{i=1}^{n}\min(|Y_{i}-\theta|,\tau)\mbox{sgn}(Y_{i}-\theta)=0;\\ &\frac{1}{n}\sum_{i=1}^{n}\frac{\min(|Y_{i}-\theta|^{2},\tau^{2})}{\tau^{2}}-\frac{z}{n}=0,\end{aligned}\right.

(3.5)

where $z$ is the same deviation parameter as in (3.4). Specifically, we start with $\widehat{\theta}^{(0)}=\bar{Y}$ and solve the second equation for $\tau^{(1)}$ . We then plug $\tau=\tau^{(1)}$ into the first equation to get $\widehat{\theta}^{(1)}$ . We repeat these two steps until the algorithm converges and use the final value of $\widehat{\theta}$ as the estimator for $\mu$ . Wang et al. (2020) proved that (i) if $\theta=\mu$ in the second equation above, then its solution gives $\tau=\sigma\sqrt{n/z}$ with probability approaching $1$ ; (ii) if we choose $\tau=\sigma\sqrt{n/z}$ in the first equation above, its solution satisfies (3.4), even when $Y$ is asymmetrically distributed with heavy tails. Note that Wang et al. (2020) call the above procedure tuning-free, in the sense that the knowledge of $\sigma$ is not needed, but we still have the deviation parameter $z$ used to control the exception probability. The paper suggested to use $z=\log n$ in practice.

In the context of volatility forecast, $\sigma_{t}$ always varies across time. The well known phenomenon of volatility clustering in the financial market implies that $\sigma_{t}$ typically changes slowly, so that we can borrow data around time $t$ to help with estimating $\sigma_{t}^{2}$ with little bias. A common practice in quantitative finance is to exploit an exponential-weighted average of $\{X^{2}_{s}\}_{s=t}^{t+m}$ to estimate $\sigma_{t}^{2}$ , thereby discounting the importance of data that are distant from time $t$ . To accommodate such exponential-decaying weights, we now propose a sample-weighted variant of the Huber estimator for volatility estimation as follows:

\widehat{\mu}:=\operatorname*{argmin}_{\theta\in\mathbb{R}}\sum_{s=t}^{t+m}w_{s,t}\ell_{{\tau_{t}/w_{s,t}}}(X_{s}^{2}-\theta)=:\mathcal{L}_{\tau_{t}}(\theta;\{w_{s,t}\}_{s=t}^{t+m}),

(3.6)

where $\{w_{s,t}\}_{s\in\{t,t+1,\ldots,t+m\}}$ are the sample weights. Note that the robustification parameters for the observations can be different: intuitively, the higher the sample weight is, the lower $\tau$ should be, so that we can better guard against heavy-tailed deviation of important data points. More technical justification on such choice of robustification parameters is given after Theorem 1. Correspondingly, to adaptively tune $\tau$ , we iteratively solve the following two equations for $\tau_{t}$ and $\theta_{t}$ until convergence:

\left\{\begin{aligned} &\sum_{s=t}^{t+m}w_{s,t}\min\bigg{(}|X_{s}^{2}-\theta_{t}|,\frac{\tau_{t}}{w_{s,t}}\bigg{)}\mbox{sgn}(X_{s}^{2}-\sigma_{t}^{2})=0;\\ &\sum_{s=t}^{t+m}w_{s,t}^{2}\min\bigg{(}|X_{s}^{2}-\theta_{t}|^{2},\frac{\tau_{t}^{2}}{w_{s,t}^{2}}\bigg{)}/\tau_{t}^{2}-z=0.\end{aligned}\right.

(3.7)

Our first theorem shows that the solution to the first equation of (3.7) yields a sub-Gaussian estimator of $\sigma_{t}^{2}$ , provided that $\tau_{t}$ is well tuned and that the distribution evolution of the volatility series is sufficiently slow.

Theorem 1.

Under Assumption 1, if $\tau_{t}=\sqrt{\frac{\kappa^{\dagger}_{t}}{n^{\dagger}_{{\rm{eff}}}z}}$ , we have for $n^{\dagger}_{{\rm{eff}}}\geq 16z$ that

\mathbb{P}\bigg{(}|\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2}|\leq 4\sqrt{\kappa^{\dagger}_{t}z/n^{\dagger}_{{\rm{eff}}}}\bigg{)}\geq 1-2e^{-z+|\delta_{0,t}|\sqrt{{n^{\dagger}_{{\rm{eff}}}z}/{\kappa^{\dagger}_{t}}}+2\delta_{1,t}{n^{\dagger}_{{\rm{eff}}}z}/{\kappa^{\dagger}_{t}}},

(3.8)

where $\widehat{\sigma}_{t}^{2}$ equals $\theta_{t}$ that solves the first equation of (3.7).

Remark 1.

Given that $\{w_{s,t}\}_{s=t}^{t+m}$ are forward exponential-decay weights, we have that

n^{\dagger}_{{\rm{eff}}}=\frac{(1+\lambda)(1-\lambda^{m+1})}{(1-\lambda)(1+\lambda^{m+1})}=\frac{1+\lambda}{1+\lambda^{m+1}}\sum_{i=0}^{m}\lambda^{i},

which converges to $(1+\lambda)/(1-\lambda)$ as $m\to\infty$ , and which converges to $m+1$ as $\lambda\to 1$ . Therefore, $n^{\dagger}_{{\rm{eff}}}\to\infty$ requires both that $m\to\infty$ and that $\lambda\to 1$ . As $n^{\dagger}_{{\rm{eff}}}\to\infty$ , when $\delta_{0,t}=o(\sqrt{1/n^{\dagger}_{{\rm{eff}}}})$ and $\delta_{1,t}=o(1/n^{\dagger}_{{\rm{eff}}})$ , the exception probability is of order $e^{-z+o(z)}$ , which converges to $0$ as $z\to\infty$ . Therefore, if we choose $z=\log{n^{\dagger}_{{\rm{eff}}}}$ , then $|\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2}|=O_{\mathbb{P}}\{(\log n^{\dagger}_{{\rm{eff}}}/n^{\dagger}_{{\rm{eff}}})^{1/2}\}$ as $n^{\dagger}_{{\rm{eff}}}\to\infty$ .

Remark 2.

One crucial step of our proof is using Bernstein’s inequality to control the derivative of the weighted Huber loss with respect to $\theta$ , i.e.,

{\cal L}_{\tau}^{\prime}(\theta;\{w_{s,t}\})=\sum_{s=t}^{t+m}w_{s,t}\min(|X_{s}^{2}-\theta|,\tau_{t}/w_{s,t})\mbox{sgn}(X_{s}^{2}-\theta).

Through setting $\tau_{t}/w_{s,t}$ as the robustification parameter for the data at time $s$ , we can ensure that the corresponding summand in the derivative is bounded by $\tau_{t}$ in absolute value, which allows us to apply Bernstein’s inequality. This justifies from the technical perspective our choices of the robustification parameters for different sample weights.

Remark 3.

Theorem 1 is in fact not limited to exponential-decay weights $\{w_{s,t}\}_{s=t}^{t+m}$ . Bound (3.8) applies to any sample weight series.

Our next theorem provides theoretical justification of the second equation of (3.7). It basically says that the solution to that equation gives an appropriate value of $\tau_{t}$ .

Theorem 2.

On top of Assumption 1, we further assume that $\delta_{1,t}n^{\dagger}_{{\rm{eff}}}\leq c_{1}\kappa^{\dagger}_{t}$ and $w_{s,t}\asymp 1/m$ for all $t\leq s\leq t+m$ . If $\theta_{t}=\sigma_{t}^{2}$ in the second equation of (3.7), we have that as $n^{\dagger}_{{\rm{eff}}},z\to\infty$ with $z=o(n^{\dagger}_{{\rm{eff}}})$ , its solution $\tau_{t}\asymp\sqrt{\frac{\kappa^{\dagger}_{t}}{n^{\dagger}_{{\rm{eff}}}z}}$ with probability approaching $1$ .

Remark 4.

Define the half-life parameter $l:=\log(1/2)/\log(\lambda)$ . If we fix $m/l=C$ for a universal constant $C$ , which is common practice in volatility forecast, then we can ensure that $\{w_{s,t}\}_{t\leq s\leq t+m}$ are all of order $1/m$ .

3.3 Average deviation of volatility proxies

We are now in position to evaluate I, which concerns average deviation of the volatility proxies over all the $T$ time points. To illustrate the advantage of the sample-weighted Huber volatility proxy as proposed in (3.6), we first introduce and investigate two benchmark volatility proxies that are widely used in practice. Then we present our average deviation analysis of the sample-weighted Huber proxy.

The first benchmark volatility proxy, which we denote by $(\widehat{\sigma}_{c})_{t}^{2}$ , is simply a clipped squared return:

\left\{\begin{aligned} &{(\widehat{\sigma}_{c})^{2}_{t}}=\min(X_{t}^{2},c_{t})=\min(X_{t}^{2},\tau_{t}\sqrt{n^{\dagger}_{{\rm{eff}}}T});\\ &\sum_{s=t}^{t+m}w_{s,t}^{2}\min\bigg{(}X_{s}^{4},\frac{\tau_{t}^{2}}{w_{s,t}^{2}}\bigg{)}/\tau_{t}^{2}-z=0,\end{aligned}\right.

(3.9)

where $c_{t}$ is the clipping threshold, and where $z$ is a similar deviation parameter as in (3.7). Here $\tau_{t}$ is tuned similarly as in (3.7), except that now the second equation of (3.9) does not depend on $\widehat{\sigma}^{2}_{c}$ and thus can be solved independently. Following Theorem 2, we can deduce that $\tau_{t}\asymp\sqrt{\frac{{{\tilde{\kappa}_{t}}^{\dagger}}}{n^{\dagger}_{{\rm{eff}}}z}}$ and thus that $c_{t}\asymp\sqrt{\frac{{{\tilde{\kappa}^{\dagger}_{t}}}T}{z}}$ . The main purpose of choosing such a rate of $c_{t}$ is to balance the bias and variance of the average of $(\widehat{\sigma}_{c})_{t}^{2}$ over $T$ time points. The following theorem develops a non-asymptotic bound for the average relative deviation of $\widehat{\sigma}_{c}^{2}$ .

Theorem 3.

Under Assumption 1, if $c_{t}=\sqrt{{{\tilde{\kappa}^{\dagger}_{t}}T}/{z}}$ in (3.9), for any bounded series $\{q_{t}\}_{t\in[T]}$ such that $\max_{t\in[T]}|q_{t}|\leq Q$ and any $z>0$ , we have

\mathbb{P}\bigg{(}\frac{1}{T}\sum_{t=1}^{T}\{(\widehat{\sigma}_{c})_{t}^{2}/\sigma_{t}^{2}-1\}q_{t}\leq CQ\sqrt{z/T}\bigg{)}\geq 1-e^{-z}\,,

where $C$ depends on $\max_{t\in[T]}\{(\tilde{\kappa}_{t}/\sigma_{t}^{4})\vee({\tilde{\kappa}^{\dagger}_{t}}/\sigma_{t}^{4})\vee(\tilde{\kappa}_{t}/{{\tilde{\kappa}^{\dagger}_{t}}})\}$ .

The second benchmark volatility proxy, which we denote by $(\widehat{\sigma}_{e})^{2}_{t}$ , is defined as

\left\{\begin{aligned} &(\widehat{\sigma}_{e})_{t}^{2}=\sum_{s=t}^{t+m}w_{s,t}\min\bigg{(}X_{s}^{2},\frac{c_{t}}{w_{s,t}}\bigg{)}=\sum_{s=t}^{t+m}\min(w_{s,t}X_{s}^{2},\tau_{t}\sqrt{T/n^{\dagger}_{{\rm{eff}}}}),\\ &\sum_{s=t}^{t+m}w_{s,t}^{2}\min\bigg{(}X_{s}^{4},\frac{\tau_{t}^{2}}{w_{s,t}^{2}}\bigg{)}/\tau_{t}^{2}-z=0.\end{aligned}\right.

(3.10)

The second equation of (3.10) is the same as that of (3.9). The only difference between $\widehat{\sigma}^{2}_{e}$ and $\widehat{\sigma}^{2}_{c}$ is that $\widehat{\sigma}_{e}^{2}$ exploits not only a single time point, but multiple data points in the near future to construct the volatility proxy. Accordingly, the clipping threshold is updated as $\tau_{t}\sqrt{T/n_{\rm{eff}}}$ . The following theorem characterizes the average relative deviation of $\widehat{\sigma}^{2}_{e}$ .

Theorem 4.

Under Assumption 1, for any $(z,c)>0$ satisfying that

1.

$\max_{t\in[T]}\bigg{\{}2|\delta_{0,t}|/\sqrt{{\tilde{\kappa}^{\dagger}_{t}}}+4\delta_{1,t}n^{\dagger}_{{\rm{eff}}}/({\tilde{\kappa}^{\dagger}_{t}}\sqrt{T})\bigg{\}}\leq c$ ;
2.

$n^{\dagger}_{{\rm{eff}}}\geq 2c^{-1}\{1+(\log 2T)/z\}\sqrt{T}$ ;
3.

$\sqrt{T}\geq 16cz$ , $T\geq 16z$ ;

and any bounded series $\{q_{t}\}_{t\in[T]}$ such that $\max_{t\in[T]}|q_{t}|\leq Q$ , let $c_{t}=\sqrt{{{\tilde{\kappa}^{\dagger}_{t}}T}/{(n^{\dagger 2}_{{\rm{eff}}}z)}}$ , we have

\mathbb{P}\bigg{[}\frac{1}{T}\sum_{t=1}^{T}\{(\widehat{\sigma}_{e})_{t}^{2}/\sigma_{t}^{2}-1\}q_{t}\leq CQ\sqrt{\frac{z}{T}}+\frac{1}{T}\sum_{t=1}^{T}\frac{\delta_{0,t}q_{t}}{\sigma_{t}^{2}}\bigg{]}\geq 1-2e^{-z}\,,

where $C$ depends on $\max_{t\in[T],u\in[m]}\{\tilde{\kappa}_{t+u}/\sigma_{t}^{4}\}$ .

Remark 5.

Let $z=\log T$ . To achieve the optimal rate of $\sqrt{\log T/T}$ , Theorem 4 requires that $\frac{1}{T}\sum_{t=1}^{T}\frac{\delta_{0,t}q_{t}}{\sigma_{t}^{2}}$ is of order $\sqrt{\log T/T}$ . We also require $n^{\dagger}_{{\rm{eff}}}$ to be at least the order of $\sqrt{T}$ .

Remark 6.

One technical challenge of proving Theorem 4 lies in the overlap of squared returns that are used to construct neighboring $\widehat{\sigma}_{t}^{2}$ , which leads to temporal dependence across $\{(\widehat{\sigma}_{e})^{2}_{t}\}_{t\in[T]}$ . To resolve this issue, we apply a more sophisticated variant of Bernstein’s inequality for time series data (Zhang, 2021). See Lemma 1 in the appendix.

We now move on to the Huber volatility proxy. At time $t$ , denote the solution to (3.7) by $(\widehat{\theta}_{t},\widehat{\tau}_{t})$ . Note that $\widehat{\tau}_{t}$ is tuned based on just $m$ data points. Given that now our focus is on the average deviation of volatility proxies over $T$ data points, we need to raise our robustification parameters to reduce the bias of our Huber proxies and rebalance the bias and variance of the average deviation. After all, averaging over a large $T$ mitigates the impact of possible tail events, so that we can relax the thresholding effect of the Huber loss. Specifically, let $c_{t}=\widehat{\tau}_{t}\sqrt{T/n^{\dagger}_{{\rm{eff}}}}$ , which is of order $\sqrt{{{\kappa^{\dagger}_{t}}T}/{(n^{\dagger 2}_{{\rm{eff}}}z)}}$ according to Theorem 2. Then we substitute $\tau_{t}=c_{t}$ into the first equation of (3.7) and solve for $\sigma_{t}^{2}$ therein to obtain the adjusted proxy; that is to say, the final $(\widehat{\sigma}_{H})_{t}^{2}$ satisfies the following:

\sum_{s=t}^{t+m}w_{s,t}\min\bigg{(}|X_{s,t}^{2}-(\widehat{\sigma}_{H})_{t}^{2}|,\frac{c_{t}}{w_{s,t}}\bigg{)}\mbox{sgn}(X_{s,t}^{2}-\sigma_{t}^{2})=0.

(3.11)

The inflation factor $\sqrt{T/n^{\dagger}_{{\rm{eff}}}}$ in $c_{t}$ implies that the larger sample size we have, the closer the corresponding Huber loss is to the square loss. This justifies the usage of vanilla EWMA proxy as the most common practice in financial industry when the total evaluation period is long. However, when $T$ is not sufficiently large, the Huber proxy yields more robust estimation of the true volatility. The following theorem characterizes the average relative deviation of the Huber proxies.

Theorem 5.

Under Assumption 1, for any $(c,z)>0$ satisfying that

1.

$\max_{t\in[T]}\bigg{\{}2|\delta_{0,t}|/\sqrt{{\kappa^{\dagger}_{t}}}+4\delta_{1,t}n^{\dagger}_{{\rm{eff}}}/({\kappa^{\dagger}_{t}}\sqrt{T})\bigg{\}}\leq c$ ;
2.

$n^{\dagger}_{{\rm{eff}}}\geq 2c^{-1}\{1+(\log 2T)/z\}\sqrt{T}$ ;
3.

$\sqrt{T}\geq 16cz$ , $T\geq 16z$ ;
4.

For any $t\in[T]$ , ${\cal L}_{c_{t}}(\theta;\{w_{s,t}\}_{s=t}^{t+m})$ is $\alpha$ -strongly convex for $\theta\in\bigg{(}\sigma_{t}^{2}-(2c+2)\sqrt{\kappa_{t}^{\dagger}z},\sigma_{t}^{2}+(2c+2)\sqrt{\kappa_{t}^{\dagger}z}\bigg{)}$ ;

and any bounded series $\{q_{t}\}_{t\in[T]}$ such that $\max_{t\in[T]}|q_{t}|\leq Q$ , let $c_{t}=\sqrt{{{\kappa^{\dagger}_{t}}T}/(n^{\dagger 2}_{{\rm{eff}}}z)}$ , we have

\mathbb{P}\bigg{[}\frac{1}{T}\sum_{t=1}^{T}\{(\widehat{\sigma}_{H})_{t}^{2}/\sigma_{t}^{2}-1\}q_{t}\leq CQ\sqrt{\frac{z}{T}}+\frac{1}{T}\sum_{t=1}^{T}\frac{(\delta_{0,t}+4\delta_{1,t})q_{t}}{\sigma_{t}^{2}}\bigg{]}\geq 1-2e^{-z}\,,

where $C$ depends on $\max_{t\in[T],u\in[m]}\{\kappa_{t+u}/\sigma_{t}^{4}\}$ and $\alpha$ .

Remark 7.

Compared with the previous two benchmark proxies, the main advantage of the Huber volatility proxy is that its average deviation error depends on the central fourth moment of the returns instead of the absolute one.

Remark 8.

$\alpha$ -strong convexity is a standard assumption that can be verified for Huber loss. For example, Proposition B.1 of Chen and Zhou (2020) shows that the equally weighted Huber loss ${\cal L}_{\tau_{t}}(\theta;\{1/(m+1)\}_{s=t}^{t+m})$ enjoys strong convexity for $\alpha=1/4$ in the region $(\sigma_{t}^{2}-r,\sigma_{t}^{2}+r)$ where $r\asymp\tau$ with probability at least $1-e^{-t}$ when $m\geq C(1+t)$ . Such strong convexity paves the way to apply Lemma 1 to the Huber proxies, so that we can establish their Bernstein-type concentration. Please refer to Lemma 2 in the appendix for details.

Remark 9.

To achieve the optimal $\sqrt{\log T/T}$ rate of convergence if we choose $z\asymp\log T$ , we need to additionally assume $\frac{1}{T}\sum_{t=1}^{T}\frac{(\delta_{0,t}+4\delta_{1,t})q_{t}}{\sigma_{t}^{2}}$ is of a smaller order, which in practice requires certain volatility smoothness.

4 Robust predictors

In this section, we further take into account the randomness from predictors and study bounding II, the difference between empirical risk and conditional risk.

4.1 Robust volatility predictor

We essentially follow (3.6), the sample-weighted Huber mean estimator, to construct robust volatility predictors. The only difference is that now we cannot touch any data beyond time $t$ ; we can only look backward at time $t$ . Consider the following volatility predictor based on the past $m$ data points with backward exponential-decay weights:

h_{t}:=\operatorname*{argmin}_{\theta\in\mathbb{R}}\sum_{s=t-m}^{t-1}\nu_{s,t}\ell_{{\tau_{t}/\nu_{s,t}}}(X_{s}^{2}-\theta)=:\mathcal{L}_{\tau_{t}}(\theta;\{\nu_{s,t}\}_{s=t-m}^{t-1}).

(4.1)

Similarly to (3.7), we iteratively solve the following two equations for $h_{t}$ and $\tau_{t}$ simultaneously:

\left\{\begin{aligned} &\sum_{s=t-m}^{t-1}\nu_{s,t}\min\bigg{(}|X_{s}^{2}-h_{t}^{2}|,\frac{\tau_{t}}{\nu_{s,t}}\bigg{)}\mbox{sgn}(X_{s}^{2}-\sigma_{t}^{2})=0;\\ &\sum_{s=t-m}^{t-1}\nu_{s,t}^{2}\min\bigg{(}|X_{s}^{2}-h_{t}^{2}|^{2},\frac{\tau_{t}^{2}}{\nu_{s,t}^{2}}\bigg{)}/\tau_{t}^{2}-z=0,\end{aligned}\right.

(4.2)

where we recall that $\nu_{s,t}=\lambda^{t-1-s}/\sum_{s=t-m}^{t-1}\lambda^{t-1-s}$ . Theorem 1 showed concentration of $\widehat{\sigma}_{t}^{2}$ around $\sigma_{t}^{2}$ , i.e. the loss is MSE. More generally, we hope to give results on $\mathbb{P}(L(\sigma_{t}^{2},h_{t})-\min_{h}L(\sigma_{t}^{2},h_{t})>\varepsilon)$ . According to (a) in Section 2.1, $\min_{h}L(\sigma_{t}^{2},h_{t})=f(\sigma_{t}^{2})+B(\sigma_{t}^{2})+C(\sigma_{t}^{2})\sigma_{t}^{2}$ . Therefore, we hope to bound

\mathbb{P}(|L(\sigma_{t}^{2},h_{t})-L(\sigma_{t}^{2},\sigma_{t}^{2})|>\varepsilon)

This should be easy to control if we assume smoothness of the loss fucntion. We give the following theorem.

Theorem 6.

Assume there exist $b,B$ such that $sup_{|h-\sigma_{t}^{2}|\leq b}|\partial L(\sigma_{t}^{2},h)/\partial h|\leq B$ . If $\tau_{t}=\sqrt{\frac{\kappa^{\ddagger}_{t}}{n^{\ddagger}_{{\rm{eff}}}z}}$ , under Assumption 1, for $n^{\ddagger}_{{\rm{eff}}}\geq 16(1\vee\kappa^{\ddagger}_{t}/b^{2})z$ , we have

\mathbb{P}\bigg{(}L(\sigma_{t}^{2},h_{t})-\min_{h}L(\sigma_{t}^{2},h)\leq 4B\sqrt{\kappa^{\ddagger}_{t}z/n^{\ddagger}_{{\rm{eff}}}}\bigg{)}\geq 1-2e^{-z+|\Delta_{0,t}|\sqrt{{n^{\ddagger}_{{\rm{eff}}}z}/{\kappa^{\ddagger}_{t}}}+2\Delta_{1,t}{n^{\ddagger}_{{\rm{eff}}}z}/{\kappa^{\ddagger}_{t}}}

where $\widehat{\sigma}_{t}^{2}$ is the solution of the first equation.

Remark 10.

Here for notational simplicity, we used the same estimation horizon $m$ , same exponential-decay $\lambda$ for constructing both predictors and proxies. But in practice, of course they do not necessarily need to be the same. In our real data example, we will use a slower decay for constructing predictors while using a faster decay for proxies, which seems to be the common practice for real financial data, where we typically use more data for constructing predictors and less data for constructing proxies. We will stick to $m$ equals twice the half-life so that equivalently, we use a longer window for predictors and shorter window for proxies.

Remark 11.

In addition, here we also simplify the theoretical results by assuming the same $z$ for constructing proxies and predictors. We do not need to use the same $z$ controlling the tail probability for predictors and proxies practically. For predictors, as we focus on local performance, it is more natural to use $z=\log n^{\ddagger}_{{\rm{eff}}}$ following Wang et al. (2020). For proxies, as we focus on overall evaluation, for a given $T$ we can take $z=\log T$ . Sometimes, we want to monitor the risk evaluation as $T$ grows, then a changing $z$ may not be a good choice; we do not want to re-solve the local weighted tuning-free Huber problem every time $T$ changes. Therefore, we recommend to use $z=C\log n^{\dagger}_{{\rm{eff}}}$ for a slightly larger $C$ , e.g. $z=2\log n^{\dagger}_{{\rm{eff}}}$ as in our real data analysis.

4.2 Concentration of robust and non-robust predictors

Recall a robust loss satisfies $\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{it})|\mathcal{G}_{t}\}=\mathbb{E}[f(h_{it})]+B(\widehat{\sigma}_{t}^{2})+\mathbb{E}[C(h_{it})]\widehat{\sigma}_{t}^{2}$ . So we further bound II as follows:

	II	$\displaystyle=\mathbb{P}\bigg{[}\bigg{\|}\frac{1}{T}\sum_{t=1}^{T}\{f(h_{t})-\mathbb{E}f(h_{t})\}+\frac{1}{T}\sum_{t=1}^{T}\{C(h_{t})-\mathbb{E}C(h_{t})\}\widehat{\sigma}_{t}^{2}\bigg{\|}>\varepsilon\bigg{]}$
		$\displaystyle\leq\mathbb{P}\bigg{[}\bigg{\|}\frac{1}{T}\sum_{t=1}^{T}f(h_{t})-\mathbb{E}f(h_{t})+\{C(h_{t})-\mathbb{E}C(h_{t})\}\sigma_{t}^{2}\bigg{\|}$
		$\displaystyle\qquad\qquad\qquad\qquad\qquad+\bigg{\|}\frac{1}{T}\sum_{t=1}^{T}(C(h_{t})-\mathbb{E}C(h_{t}))(\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2})\bigg{\|}>\varepsilon\bigg{]}$
		$\displaystyle\leq\underbrace{\mathbb{P}\bigg{[}\bigg{\|}\frac{1}{T}\sum_{t=1}^{T}(L(\sigma_{t}^{2},h_{t})-\mathbb{E}[L(\sigma_{t}^{2},h_{t})])\bigg{\|}>\frac{\varepsilon}{2}\bigg{]}}_{\Delta_{A}}+\underbrace{\mathbb{P}\bigg{[}\bigg{\|}\frac{1}{T}\sum_{t=1}^{T}(C(h_{t})-\mathbb{E}[C(h_{t})])(\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2})\bigg{\|}>\frac{\varepsilon}{2}\bigg{]}}_{\Delta_{B}}\,.$

We wish to show that both $\Delta_{A}$ and $\Delta_{B}$ can achieve the desired rate of $\sqrt{\log T/T}$ for a broad class of predictors and loss functions. When $h_{t}$ is the proposed robust predictor in Section 4.1, we can obtain sharp rates of $\Delta_{A}$ and $\Delta_{B}$ as expected. Moreover, for the vanilla (non-robust) EWMA predictor $h_{t}=\sum_{s=t-m}^{t-1}\nu_{s,t}X_{s}^{2}$ , we are also able to obtain the same sharp rates for $\Delta_{A}$ and $\Delta_{B}$ for Lipschitz robust losses. The third option is to truncate the predictor, i.e. $h_{t}=(\sum_{s=t-m}^{t-1}\nu_{s,t}X_{s}^{2})\wedge M$ with some large constant $M$ and we can control the two terms for general robust losses. The bottom line is that for non-robust predictors, we need to make sure the loss does not go too crazy either via shrinking predictor’s effect on the loss (bounded Lipschitz) or clipping the predictor directly.

The interesting observation is that bounding II, or the difference between empirical risk and conditional risk, requires minimal assumption such that most of the reasonable predictors, say in the M-estimator form, have no problem satisfying the concentration bound with proper choice of a robust proxy, although we do require the loss not to become too wild (see Theorem 7 for details). Technically, the concentration here only cares about controlling variance and does not care about the bias between $\mathbb{E}[L(\sigma_{t}^{2},h_{t})]$ and $L(\sigma_{t}^{2},\sigma_{t}^{2})$ and between $\mathbb{E}[C(h_{t})]$ and $C(h_{t})$ . There is no need to carefully choose the truncation threshold to balance variance and bias optimally.

Theorem 7.

For any $t\in[T]$ , ${\cal L}_{\tau_{t}}(\theta;\{\nu_{s,t}\}_{s=t-m}^{t-1})$ is $\alpha$ -strongly convex for $\theta\in(\sigma_{t}^{2}/2,3\sigma_{t}^{2}/2)$ . Choose $\widehat{\sigma}_{t}=(\widehat{\sigma}_{e})_{t}$ (or $(\widehat{\sigma}_{H})_{t}$ ) with robustification parameter $c_{t}$ specified in Theorem 4 (or Theorem 5). Under the assumptions of Theorem 4 (or Theorem 5), the bound

\mathbb{P}\bigg{(}\bigg{|}\frac{1}{T}\sum_{t=1}^{T}L(\widehat{\sigma}_{t}^{2},h_{t})-\frac{1}{T}\sum_{t=1}^{T}\bigg{(}\mathbb{E}f(h_{t})+B(\widehat{\sigma}_{t}^{2})+\mathbb{E}C(h_{t})\widehat{\sigma}_{t}^{2}\bigg{)}\bigg{|}<C^{\prime}\sqrt{z/T}\bigg{)}\geq 1-(2T+5)e^{-z/2}

holds for all the following three cases:

1.

For the robust predictors proposed in Section 4.1, assume $sup_{|h-\sigma_{t}^{2}|\leq b}|\partial L(\sigma_{t}^{2},h)/\partial h|\leq B_{t}(b)$ , if $|\Delta_{0,t}|\sqrt{{n^{\ddagger}_{{\rm{eff}}}}/{\kappa^{\ddagger}_{t}}}+2\Delta_{1,t}{n^{\ddagger}_{{\rm{eff}}}}/{\kappa^{\ddagger}_{t}}<1/2$ and $n^{\ddagger}_{{\rm{eff}}}\geq 64z\max_{t}\kappa^{\ddagger}_{t}/\sigma_{t}^{4}$ , the above bound holds with $C^{\prime}$ depending on $\max_{t}B_{t}(\sigma_{t}^{2}/2)$ and C in Theorem 4 (or Theorem 5).
2.

For the clipped vanilla non-robust exponentially weighted moving average predictors $h_{t}=(\sum_{s=t-m}^{t-1}\nu_{s,t}X_{s}^{2})\wedge M$ , assume $sup_{\sigma_{t}^{2}/2\leq h\leq b}|\partial L(\sigma_{t}^{2},h)/\partial h|\leq B_{t}(b)$ , if $|\Delta_{0,t}|\sqrt{{n^{\ddagger}_{{\rm{eff}}}}/{{\tilde{\kappa}^{\ddagger}_{t}}}}+2\Delta_{1,t}{n^{\ddagger}_{{\rm{eff}}}}/{{\tilde{\kappa}^{\ddagger}_{t}}}<1/2$ and $n^{\ddagger}_{{\rm{eff}}}\geq 64z\max_{t}{\tilde{\kappa}^{\ddagger}_{t}}/\sigma_{t}^{4}$ , the above bound holds with $C^{\prime}$ depending on $\max_{t}B_{t}(M)$ and $C$ in Theorem 4 (or Theorem 5).
3.

For the vanilla non-robust exponentially weighted moving average predictors $h_{t}=\sum_{s=t-m}^{t-1}\nu_{s,t}X_{s}^{2}$ , assume $|\partial L(\sigma_{t}^{2},h)/\partial h|\leq B_{0}$ and $|L(\sigma_{t}^{2},h)|\leq M_{0}$ for $\forall\sigma_{t}^{2},h$ , if $|\Delta_{0,t}|\allowbreak\sqrt{{n^{\ddagger}_{{\rm{eff}}}}/{{\tilde{\kappa}^{\ddagger}_{t}}}}+2\Delta_{1,t}{n^{\ddagger}_{{\rm{eff}}}}/{{\tilde{\kappa}^{\ddagger}_{t}}}<1/2$ and $n^{\ddagger}_{{\rm{eff}}}\geq 64z\max_{t}{\tilde{\kappa}^{\ddagger}_{t}}/\sigma_{t}^{4}$ , the above bound holds with $C^{\prime}$ depending on $B_{0},M_{0}$ and $C$ in Theorem 4 (or Theorem 5).

Remark 12.

The proof can be easily extended to more general robust or non-robust predictors of the M-estimator form. Theorem 7 tells us that comparing robust predictors with optimal truncation for a single time point (rather than adjusting the truncation as in constructing proxies) and non-robust predictors (either with rough overall truncation when using a general loss or without any truncation when using a truncated loss) is indeed a valid thing to do, when we employ proper robust proxies.

Remark 13.

Although the first proxy achieves the optimal rate of convergence for comparing average conditional loss in Theorem 3, we did not manage to show it is valid for comparing the average empirical loss in Theorem 7. The reason is that single time clipping has no concentration guarantee like Theorem 1 for a single time point, therefore cannot ensure $|(\widehat{\sigma}_{c})_{t}^{2}-\sigma_{t}^{2}|$ to be bounded with high probability for all $t$ , which is important to make sure the sub-exponential tail in Bernstein inequality does not dominate the sub-Gaussian tail. Taking into consideration of central fourth moment versus absolute fourth moment (see Remark 7), we would recommend the third proxy as the best practical choice among our three proposals.

5 Numerical study

In this section, we first verify the advantage of the Huber mean estimator over the truncated mean in terms of estimating the variance through simulations. As illustrated by Theorem 5, the statistical error of the Huber mean estimator depends on the central moment, while that of the truncated mean depends on the absolute moment. Then we apply the proposed robust proxies for volatility forecasting comparison, using data from crypto currency market. Specifically, we focus on the returns of Bitcoin (BTC) quoted by Tether (USDT), a stable coin pegged to the US Dollar, in the years of 2019 and 2020, which witness dramatic volatility of Bitcoin.

5.1 Simulations

We first examine numerically the finite sample performance of the adaptive Huber estimator (Wang et al., 2020) for variance estimation, i.e., we solve (3.5) iteratively for $\theta$ given the data and $z$ until convergence. We first draw an independent sample $Y_{1},\dots,Y_{n}$ of $Y$ that follows a heavy-tailed distribution. We investigate the following two distributions:

1.

Log-normal distribution $\mathrm{LN}(0,1)$ , that is, $\log Y\sim{\cal N}(0,1)$ .
2.

Student’s t distribution with degree of freedom $\mathrm{df}=3$ .

Given that $\sigma^{2}:=\operatorname{Var}(Y)=\mathbb{E}(Y^{2})-(\mathbb{E}Y)^{2}$ , we estimate $\mathbb{E}(Y^{2})$ and $\mathbb{E}Y$ separately and plug these mean estimators into the variance formula to estimate $\sigma^{2}$ . Besides the Huber mean estimator, we investigate two alternative mean estimators as benchmarks: (a) the naïve sample mean; (b) the sample mean of data that are truncated at their upper and lower $\alpha$ -percentile. We use MSE and QL (see (2.1) for the definitions) to assess the accuracy of variance estimation. In our simulation, we set $n=100$ and we repeat evaluating these three methods in 2000 independent Monte Carlo experiments.

Refer to caption — Figure 1: MSE comparisons between the truncated mean and the Huber mean.

Figure 1 compares the MSE of the truncated and Huber variance estimators under log-normal distribution (left) and t-distribution (right). The red curve represents the MSE of the truncated method with different $\alpha$ values on the top x-axis, and the blue curve represents the MSE of the tuning-free Huber method with different $z$ values on the bottom x-axis. The error bars in both panels represent the standard errors of the MSE. For convenience of comparison, we focus on the ranges of $\alpha$ and $z$ that exhibit the smile shapes of MSE of the two methods. Note that the MSE of the sample variance, which is $40.14$ under the log-normal distribution and $10.17$ under the t-distribution, is too large to be presented in the plot. Figure 1 shows that the Huber variance estimator outperforms the optimally tuned truncated method with $z$ roughly between $(1.5,3.5)$ under the log-normal distribution and between $(1.5,4)$ under the Student’s t distribution. The performance gap is particularly large under the t distribution, where the optimal Huber method achieves around $20\%$ less MSE than the optimal truncated method.

Figure 2 assesses the QL loss of the truncated and Huber estimators and displays a similar pattern as Figure 1. Again, the naïve sample variance is still much worse off with QL loss $0.1765$ under the log-normal distribution and $0.1201$ under the t-distribution; we therefore do not present it in the plots. The Huber approach continues to defeat the optimally tuned truncation method with any $z\in(1,2)$ under both distributions of our focus. Together with the $z$ ranges where the Huber approach is superior in terms of MSE, our results suggest that $z=1.5$ can be a good practical choice, at least to start with. Such a universal practical choice of $z$ demonstrates the adaptivity of the tuning-free Huber method.

5.2 BTC/USDT volatility forecasting

We use the BTC/USDT daily returns to demonstrate the benefit of using robust proxies in volatility forecasting comparison.

Fig 3 presents the time series, histogram and normal QQ-plot of the daily BTC returns from 2019-01-01 to 2021-01-01. It is clear that the distribution of the returns is heavy-tailed, that the volatility is clustered and that there are extreme daily returns beyond 10% or even 20%. The empirical mean of the returns over this 2-year period is 38 basis points, which is quite close to zero compared with its volatility. We thus assume the population mean of the return is zero, so that the variance of the return boils down to the mean of the squared return. In the sequel, we focus on robust estimation of the mean of the squared returns.

5.2.1 Construction of volatility predictors and proxies

Let $r_{t}$ denote the daily return of BTC from the end of day $t-1$ to the end of day $t$ . We emphasize that a volatility predictor $h_{t}$ must be ex-ante. Here we construct $h_{t}$ based on $r_{t-m_{b}},\dots,r_{t-1}$ in the backward window of size $m_{b}$ and evaluate it at the end of day $t-1$ . Our proxy $\widehat{\sigma}_{t}^{2}$ for the unobserved variance $\sigma_{t}^{2}$ of $r_{t}$ is instead based on $r_{t},\dots,r_{t+m_{f}}$ in the forward window of size $m_{f}$ .

We consider two volatility prediction approaches: (i) the vanilla EWA of the backward squared returns, i.e., $\sum_{s=t-m_{b}}^{t-1}\nu_{s,t}X_{s}^{2}$ ; (ii) the exponentially weighted Huber predictor proposed in Section 4.1. Each approach is evaluated with half lives equal to $7$ days (1 weeks) and $14$ days (2 weeks), giving rise to four predictors, which are referred to as EWMA_HL7, EWMA_HL14, Huber_HL7, Huber_HL14. We always choose $m_{b}$ to be twice the corresponding half life and set $z=n^{\ddagger}_{{\rm{eff}}}$ for the two ’Huber predictors. As for volatility proxies, we similarly consider two methods: (i) the vanilla forward EWA proxy, i.e., $\sum_{s=t}^{t+m_{f}}w_{s,t}X_{s}^{2}$ ; (ii) the robust Huber proxy proposed in Section 3.3. We set the half life of the exponential decay weights to be always $7$ days, $m_{f}=14$ and $z=2\log n^{\dagger}_{{\rm{eff}}}$ . We evaluate the Huber approach on two time series of different lengths: $T=720$ or $180$ , which imply two different $c_{t}$ values that are used in (3.11). We refer to the two corresponding Huber proxies as Huber_ $720$ and Huber_ $180$ . Given the theoretical advantages of the Huber proxy as demonstrated in Remarks 7 and 13, we do not investigate the first two proxies proposed in Section 3.3.

Crypto currency market is traded 24 hours every day nonstop, which gives us $732$ daily returns from 2019-01-01 to 2021-01-01. After removing the first $27$ days used for predictor priming and the last $13$ days used for proxy priming, we then have $691$ data points left. For each day, we compute the four predictors and three proxies that are previously described. We plot the series of squared volatility (variance) proxies in Fig 4. As we can see, the vanilla EWMA proxy (blue line) is obviously the most volatile one, reaching the peak variance of $0.016$ , or equivalently, volatility of $12.6\%$ in March, 2020, when the outbreak of COVID-19 in the US sparked a flash crash of the crypto market. In contrast, the Huber proxies react in a much milder manner, and the smaller $T$ we consider, the more truncation effect on the Huber proxies.

5.2.2 Volatility forecasting comparison with large $T$

With the predictors and proxies computed, we are ready to conduct volatility prediction evaluation and comparison. Now we would like to emphasize one issue that is crucial to the evaluation procedure: the global scale of the predictors. Different loss functions may prefer different global scales of volatility forecasts. For example, QL penalizes underestimation much more than overestimation, as the predictor is in the denominator in the formula of QL. In other words, QL typically favors relatively high forecast values. To remove the impact of the scales and focus more on the capability of capturing relative variation of volatility, we also compute optimally scaled versions of our predictors and evaluate their empirical loss. Specifically, we first seek the optimal scale by solving for

\widehat{\beta}:=\operatorname*{argmin}_{\beta\in\mathbb{R}}T^{-1}\sum_{t=1}^{T}L(\widehat{\sigma}_{t}^{2},\beta h_{t})

and then use $\{\widehat{\beta}h_{t}\}_{t\in[T]}$ for prediction. By comparing the empirical risk of the optimally scaled predictors, we can completely eliminate the discrimination of the loss against different global scales. Some algebra yields that for MSE, the optimal $\widehat{\beta}_{\mathrm{MSE}}=\sum_{t}h_{t}\widehat{\sigma}_{t}^{2}/\sum_{t}h_{t}^{2}$ , and for QL, the optimal $\widehat{\beta}_{\mathrm{QL}}=T^{-1}\sum_{t}(\widehat{\sigma}_{t}^{2}/h_{t})$ . Table 1 reports the loss of the four predictors and their optimal scaled versions based on all the 691 time points with the non-robust EWMA proxy and the robust proxy Huber_ $720$ . Several interesting observations are in order.

Table 1: Losses of (original or optimally scaled) four predictors with robust and non-robust proxies, evaluated with all

T=691

time points.

	MSE
	EWMA Proxy			Huber_720 Proxy
	Orig (1e-6)	Scaled (1e-6)	$\widehat{\beta}_{\mathrm{MSE}}$	Orig (1e-6)	Scaled (1e-6)	$\widehat{\beta}_{\mathrm{MSE}}$
EWMA_HL14	4.115	3.365	0.55	3.285	2.386	0.50
Huber_HL14	3.162	3.161	1.03	2.233	2.228	0.94
EWMA_HL7	4.824	3.395	0.46	3.930	2.364	0.44
Huber_HL7	3.112	3.110	1.05	2.134	2.133	0.98
	QL
	EWMA Proxy			Huber_720 Proxy
	Orig	Scaled	$\widehat{\beta}_{\mathrm{QL}}$	Orig	Scaled	$\widehat{\beta}_{\mathrm{QL}}$
EWMA_HL14	0.804	0.647	1.67	0.584	0.548	1.29
Huber_HL14	1.352	0.567	2.82	0.831	0.450	2.14
EWMA_HL7	1.239	0.792	2.26	0.720	0.595	1.59
Huber_HL7	2.382	0.702	4.09	1.396	0.532	2.94

•

Using the longer half life of $14$ days gives a smaller QL loss, regardless of whether the predictor is robust or non-robust, original or optimally scaled, and regardless of whether the proxy is robust or non-robust. In terms of MSE, the half-life comparison is mixed: Huber_HL7 is slightly better than Huber_HL14, but EWMA_HL14 is better than EWMA_HL7. We only focus on the longer half-life from now on.
•

If we look at the original predictors without optimal scaling, it is clear that MSE favors the robust predictor and QL favors the non-robust predictor, regardless of using robust or non-robust proxies. This confirms that different loss functions can lead to very different comparison results.
•

However, the above inconsistency between MSE and QL is mostly due to scaling, which is clearly demonstrated by the column of the optimal scaling $\widehat{\beta}$ . For MSE, the optimal scaling of the EWMA predictor is around $0.5$ , while that of the Huber predictor is around $1$ . In contrast, for QL, the optimal scaling needs to be much larger than $1.0$ and Huber needs a even larger scaling. If we look at the loss function values with optimally scaled predictors, it is interesting to see that the Huber predictor outperforms the EWMA predictor in terms of both MSE (slightly) and QL (substantially). This means that the Huber predictor is more capable of capturing the relative change of time-varying volatility than the non-robust predictor.
•

Last but not least, when the sample size $T$ is large compared with $n^{\dagger}_{{\rm{eff}}}$ (here $T/n^{\dagger}_{{\rm{eff}}}=691/24.87=27.78$ ), the difference between the EWMA and Huber proxies is small, which explains the reason they give consistent comparison results. When $T$ is not large enough in the next subsection, we can see that the robust proxies gives more sensible conclusions.

5.2.3 Volatility forecasting comparison with small $T$

Now suppose we only have $T=180$ data points to evaluate and compare volatility forecasts. In Fig 5, we present the curve of 180-day rolling loss difference, i.e., $T^{-1}\sum_{s=t-179}^{t}\allowbreak\{L(\widehat{\sigma}_{s}^{2},h_{\text{Huber\_HL14},s})-L(\widehat{\sigma}_{s}^{2},h_{\text{EWMA\_HL14},s})\}$ with $t$ ranging from $180$ to $691$ , where $\widehat{\sigma}_{s}^{2}$ can either be the EWMA proxy or Huber_ $180$ . Positive loss difference at $t$ indicates that the EWMA predictor outperforms the Huber predictor in the past 180 days. We see that most of the time, Huber_HL14 defeats EWMA_HL14 (negative loss difference) in terms of MSE, while EWMA_HL14 defeats Huber_HL14 (positive loss difference) in terms of QL. In terms of MSE, robust proxies tend to yield more consistent comparison between the two predictors throughout the entire period of time. We can see from the upper panel of Figure 5 that the time period for EWMA_HL14 to outperform Huber_HL14 is much shorter with the robust proxies (orange curve) than that with the EWMA proxies (blue curve). In terms of QL, if we use the EWMA proxy, we can see from the lower panel of Figure 5 that the robust predictor is much worse than the non-robust predictor, especially towards the end of 2020. However, the small MSE difference at the end of 2020 suggests that the EWMA proxy should overestimate the true volatility and exaggerate the performance gap in terms of QL. With the Huber proxy, however, the loss gap between the two predictors is much narrower, suggesting that the Huber proxy is more robust against huge volatility.

Fig 6 presents the curve of 180-day rolling loss difference between optimally scaled Huber_HL14 and EWMA_HL14, based on robust and EWMA proxies respectively. For MSE, from the previous subsection, we know that when the optimal scaling is applied, two predictors do not differ much in terms of the overall loss, and the Huber predictor only slightly outperforms the EWMA predictor. In the upper panel of Fig 6, we see that the robust-proxy-based curve is closer to zero than the EWMA-proxy-based curve, displaying more consistency with our result based on large $T$ . For QL, the loss differences using the robust or the non-robust proxy look quite similar. We also plot $\widehat{\beta}_{\text{Huber},t}$ and $\widehat{\beta}_{\text{EWMA},t}$ versus time $t$ based on robust and non-robust proxies in Fig 7. For both MSE and QL losses, using the robust proxy leads to more stable optimal scaling values, which are always preferred by practitioners.

In a nutshell, we have seen that how our proposed robust Huber proxy can lead to better interpretability and sensible comparison of volatility predictors. When total sample size is small compared to the local effective sample size, using a robust proxy is necessary and leads to a smaller probability of misleading forecast evaluation and comparison. When the total sample size is large enough, the proposed robust proxy automatically truncates less and resembles the EWMA proxy. This also provides justification for using a non-robust EWMA proxy when the sample size is large. But we still recommend the proposed robust proxy that can adapt to the sample size and the time-varying volatilities. Sometimes, even if the robust proxy only truncates data for a very small volatile period, in terms of risk evaluation, it could still cause significant difference.

6 Discussions

Compared with the literature of modeling volatility and predicting volatility, evaluating volatility prediction has not been given enough attention. Part of the reason is due to lacking a good framework for its study, which makes practical volatility forecast comparison quite subjective and less systematic in terms of loss selection and proxy selection. Patton (2011) is a pioneering work to provide one framework based on long term expectation and provides guidance for loss selection, while our work gives a new framework based on an empirical deviation perspective and further provides guidance on proxy selection. In our framework, we focus on predictors that can achieve desired probability bound for II so that the empirical loss is close to the conditional expected loss. Then the correct comparison of the conditional expected loss of two predictors with large probability rely on a good control for I, which imposes requirements for proxies.

With this framework, we proposed three robust proxies, each of which guarantees a good bound for I when the data only bears finite fourth moment. Among the three proxies, although they can all obtain the optimal rate of convergence for bounding I, we recommend the exponentially-weighted tuning-free Huber proxy. It is better than the clipped squared returns in that it leverages neighborhood volatility smoothness and it is better than the proxy based on direct truncation in that its improved constant in the deviation bound, depending only on the central moment. To construct this proxy, we need to solve the exponentially-weighted Huber loss, whose truncation level for each sample also needs to change with the weights surprisingly.

We then applied this proxy to the real BTC volatility forecasting comparison and reached some interesting observations. Firstly, robust predictors with better control in variance may use a faster decay to reduce the approximation bias. Secondly, different losses can lead to drastically different comparison, so even restricting to robust losses, loss selection is still a meaningful topic in practice. Thirdly, predictor rescaling according to the loss function is necessary and could further extract the value of robust predictors. Finally, the proposed robust Huber proxy adapts to both the time-varying volatility and the total sample size. When the overall sample size is much larger than the local effective sample size, the robust Huber proxy barely truncate, which provides justification for even using the EWMA proxy for prediction evaluation. However, the robust Huber proxy in theory still gives high probability of concluding the correct comparison.

There are still limitations of the current work and open questions to be addressed. Assumption 1 excludes the situation when $\sigma_{t}^{2}$ depends on previous returns such as in GARCH models. We require $n^{\dagger}_{{\rm{eff}}}\to\infty$ for local performance guarantee, leading to a potentially slow decay. However, in practice, it is hard to know how fast the volatility changes. Also, we ignored the auto-correlation of returns and assumed temporal independence of the innovations for simplicity. Extensions of the current framework to time series models and more relaxed assumptions are of practical value to investment managers and financial analysts. Our framework may have nontrivial implications on how to conduct cross-validation with heavy-tailed data, where we use validation data to construct proxies for the unknown variable to be estimated robustly. Obviously, subjectively choosing a truncation level for proxy construction could favor certain truncation level used by a robust predictor. Motivated by our study, rescaling the optimal truncation level for one data splitting according to the total (effective) number of sample splitting sounds an interesting idea worth further investigation in the future.

References

Andersen et al. (2005) Andersen, T. G., Bollerslev, T., Christoffersen, P. and Diebold, F. X. (2005). Volatility forecasting.
Andersen et al. (2012) Andersen, T. G., Dobrev, D. and Schaumburg, E. (2012). Jump-robust volatility estimation using nearest neighbor truncation. Journal of Econometrics 169 75–93.
Baillie et al. (1996) Baillie, R. T., Bollerslev, T. and Mikkelsen, H. O. (1996). Fractionally integrated generalized autoregressive conditional heteroskedasticity. Journal of econometrics 74 3–30.
Black and Scholes (2019) Black, F. and Scholes, M. (2019). The pricing of options and corporate liabilities. In World Scientific Reference on Contingent Claims Analysis in Corporate Finance: Volume 1: Foundations of CCA and Equity Valuation. World Scientific, 3–21.
Bollerslev (1986) Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Journal of econometrics 31 307–327.
Bollerslev and Wooldridge (1992) Bollerslev, T. and Wooldridge, J. M. (1992). Quasi-maximum likelihood estimation and inference in dynamic models with time-varying covariances. Econometric reviews 11 143–172.
Brailsford and Faff (1996) Brailsford, T. J. and Faff, R. W. (1996). An evaluation of volatility forecasting techniques. Journal of Banking & Finance 20 419–438.
Brandt and Jones (2006) Brandt, M. W. and Jones, C. S. (2006). Volatility forecasting with range-based egarch models. Journal of Business & Economic Statistics 24 470–486.
Brockwell and Davis (2009) Brockwell, P. J. and Davis, R. A. (2009). Time series: theory and methods. Springer Science & Business Media.
Brooks and Persand (2003) Brooks, C. and Persand, G. (2003). Volatility forecasting for risk management. Journal of forecasting 22 1–22.
Bubeck (2014) Bubeck, S. (2014). Convex optimization: Algorithms and complexity. arXiv preprint arXiv:1405.4980 .
Carnero et al. (2012) Carnero, M. A., Peña, D. and Ruiz, E. (2012). Estimating garch volatility in the presence of outliers. Economics Letters 114 86–90.
Catania et al. (2018) Catania, L., Grassi, S. and Ravazzolo, F. (2018). Predicting the volatility of cryptocurrency time-series. In Mathematical and statistical methods for actuarial sciences and finance. Springer, 203–207.
Catoni (2012) Catoni, O. (2012). Challenging the empirical mean and empirical variance: a deviation study. Annales de l’Institut Henri Poincaré 48 1148–1185.
Charles and Darné (2019) Charles, A. and Darné, O. (2019). Volatility estimation for bitcoin: Replication and robustness. International Economics 157 23–32.
Chen et al. (2018) Chen, M., Gao, C. and Ren, Z. (2018). Robust covariance and scatter matrix estimation under huber’s contamination model. The Annals of Statistics 46 1932–1960.
Chen and Zhou (2020) Chen, X. and Zhou, W.-X. (2020). Robust inference via multiplier bootstrap. The Annals of Statistics 48 1665–1691.
Christiansen et al. (2012) Christiansen, C., Schmeling, M. and Schrimpf, A. (2012). A comprehensive look at financial volatility prediction by economic variables. Journal of Applied Econometrics 27 956–977.
Christoffersen and Diebold (2000) Christoffersen, P. F. and Diebold, F. X. (2000). How relevant is volatility forecasting for financial risk management? Review of Economics and Statistics 82 12–22.
Engle (1982) Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of the variance of united kingdom inflation. Econometrica: Journal of the econometric society 987–1007.
Fan et al. (2016) Fan, J., Li, Q. and Wang, Y. (2016). Robust estimation of high-dimensional mean regression. Journal of Royal Statistical Society, Series B .
Fan et al. (2017) Fan, J., Li, Q. and Wang, Y. (2017). Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. Journal of the Royal Statistical Society. Series B, Statistical methodology 79 247.
Guo et al. (2016) Guo, T., Xu, Z., Yao, X., Chen, H., Aberer, K. and Funaya, K. (2016). Robust online time series prediction with recurrent neural networks. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). Ieee.
Huber (1964) Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics 35 73–101.
Huber (1973) Huber, P. J. (1973). Robust regression: asymptotics, conjectures and monte carlo. The annals of statistics 799–821.
Lamoureux and Lastrapes (1993) Lamoureux, C. G. and Lastrapes, W. D. (1993). Forecasting stock-return variance: Toward an understanding of stochastic implied volatilities. The Review of Financial Studies 6 293–326.
Minsker (2018) Minsker, S. (2018). Sub-gaussian estimators of the mean of a random matrix with heavy-tailed entries. The Annals of Statistics 46 2871–2903.
Park (2002) Park, B.-J. (2002). An outlier robust garch model and forecasting volatility of exchange rate returns. Journal of Forecasting 21 381–393.
Patton (2011) Patton, A. J. (2011). Volatility forecast comparison using imperfect volatility proxies. Journal of Econometrics 160 246–256.
Poon and Granger (2003) Poon, S.-H. and Granger, C. W. (2003). Forecasting volatility in financial markets: A review. Journal of economic literature 41 478–539.
Sun et al. (2020) Sun, Q., Zhou, W.-X. and Fan, J. (2020). Adaptive huber regression. Journal of the American Statistical Association 115 254–265.
Taylor (2004) Taylor, J. W. (2004). Volatility forecasting with smooth transition exponential smoothing. International Journal of Forecasting 20 273–286.
Taylor (1994) Taylor, S. J. (1994). Modeling stochastic volatility: A review and comparative study. Mathematical finance 4 183–204.
Trucíos (2019) Trucíos, C. (2019). Forecasting bitcoin risk measures: A robust approach. International Journal of Forecasting 35 836–847.
Vasilellis and Meade (1996) Vasilellis, G. A. and Meade, N. (1996). Forecasting volatility for portfolio selection. Journal of Business Finance & Accounting 23 125–143.
Wang et al. (2020) Wang, L., Zheng, C., Zhou, W. and Zhou, W.-X. (2020). A new principle for tuning-free huber regression. Statistica Sinica .
Zhang (2021) Zhang, D. (2021). Robust estimation of the mean and covariance matrix for high dimensional time series. Statistica Sinica 31 797–820.

Appendix A Proofs

This section provides proof details for all the theorems in the main text.

Proof of Theorem 1.

The proof follows Theorem 5 of Fan et al. (2016). Denote $\phi(x)=\min(|x|,1)\mbox{sgn}(x)$ , we have

-\log(1-x+x^{2})\leq\phi(x)\leq\log(1+x+x^{2})\,.

Define $r(\theta)=\sum_{s=t}^{t+m}\min(w_{s,t}|X_{s}^{2}-\theta|,\tau_{t})\mbox{sgn}(X_{s}^{2}-\theta)$ , so $\widehat{\sigma}_{t}^{2}$ is the solution to $r(\theta)=0$ .

	$\displaystyle\mathbb{E}\{\exp[r(\theta)/\tau_{t}]\}$	$\displaystyle\leq\prod_{s=t}^{t+m}\mathbb{E}\{\exp[\phi(w_{s,t}(X_{s}^{2}-\theta)/\tau_{t})]\}$
		$\displaystyle\leq\prod_{s=t}^{t+m}\mathbb{E}\{1+w_{s,t}(X_{s}^{2}-\theta)/\tau_{t}+w_{s,t}^{2}(X_{s}^{2}-\theta)^{2}/\tau_{t}^{2}\}$
		$\displaystyle\leq\prod_{s=t}^{t+m}\{1+w_{s,t}(\sigma_{s}^{2}-\theta)/\tau_{t}+w_{s,t}^{2}(\kappa_{s}+(\sigma_{s}^{2}-\theta)^{2})/\tau_{t}^{2}\}$
		$\displaystyle\leq\exp\bigg{\{}\sum_{s=t}^{t+m}\{w_{s,t}(\sigma_{s}^{2}-\theta)/\tau_{t}+w_{s,t}^{2}(\kappa_{s}+(\sigma_{s}^{2}-\theta)^{2})/\tau_{t}^{2}\}\bigg{\}}$

Define $\bar{\sigma_{t}^{2}}=\sum_{s=t}^{t+m}w_{s,t}\sigma_{s}^{2}=\sigma_{t}^{2}+\delta_{0,t}$ . The RHS can be further bounded by

	$\displaystyle\mathbb{E}\{\exp[r(\theta)/\tau_{t}]\}$	$\displaystyle\leq\exp\bigg{\{}(\bar{\sigma_{t}^{2}}-\theta)/\tau_{t}+\sum_{s=t}^{t+m}w_{s,t}^{2}(\kappa^{\dagger}_{t}+2(\sigma_{s}^{2}-{\sigma_{t}^{2}})^{2}+2({\sigma_{t}^{2}}-\theta)^{2})/\tau_{t}^{2}\bigg{\}}$
		$\displaystyle=\exp\bigg{\{}(\sigma_{t}^{2}-\theta)/\tau_{t}+\sum_{s=t}^{t+m}w_{s,t}^{2}(\kappa^{\dagger}_{t}+2(\sigma_{t}^{2}-\theta)^{2})/\tau_{t}^{2}+\delta_{0,t}/\tau_{t}+2\delta_{1,t}/\tau_{t}^{2}\bigg{\}}$
		$\displaystyle=\exp\bigg{\{}(\sigma_{t}^{2}-\theta)/\tau_{t}+(n^{\dagger}_{{\rm{eff}}})^{-1}(\kappa^{\dagger}_{t}+2(\sigma_{t}^{2}-\theta)^{2})/\tau_{t}^{2}+\delta_{0,t}/\tau_{t}+2\delta_{1,t}/\tau_{t}^{2}\bigg{\}}$

Similarly, we can prove that $\mathbb{E}\{\exp[-r(\theta)/\tau_{t}]\}\leq\exp\{-(\sigma_{t}^{2}-\theta)/\tau_{t}+(n^{\dagger}_{{\rm{eff}}})^{-1}(\kappa^{\dagger}_{t}+2(\sigma_{t}^{2}-\theta)^{2})/\tau_{t}^{2}-\delta_{0,t}/\tau_{t}+2\delta_{1,t}/\tau_{t}^{2}\}$ . Define

B_{+}(\theta)=(\sigma_{t}^{2}-\theta)+(\kappa^{\dagger}_{t}+2(\sigma_{t}^{2}-\theta)^{2})/(n^{\dagger}_{{\rm{eff}}}\tau_{t})+\tau_{t}z

B_{-}(\theta)=(\sigma_{t}^{2}-\theta)-(\kappa^{\dagger}_{t}+2(\sigma_{t}^{2}-\theta)^{2})/(n^{\dagger}_{{\rm{eff}}}\tau_{t})-\tau_{t}z

By Chebyshev inequality,

\mathbb{P}(r(\theta)>B_{+}(\theta))\leq\exp\{-B_{+}(\theta)/\tau_{t}\}\mathbb{E}\exp\{r(\theta)/\tau_{t}\}=\exp\{-z+|\delta_{0,t}|/\tau_{t}+2\delta_{1,t}/\tau_{t}^{2}\}

Similarly, $\mathbb{P}(r(\theta)<B_{-}(\theta))\leq\exp\{-z+|\delta_{0,t}|/\tau_{t}+2\delta_{1,t}/\tau_{t}^{2}\}$ . Following the same argument with Fan et al. (2016), we can show that for large enough $n^{\dagger}_{{\rm{eff}}}$ such that $8/(n^{\dagger}_{{\rm{eff}}}\tau_{t})(\kappa^{\dagger}_{t}/(n^{\dagger}_{{\rm{eff}}}\tau_{t})+\tau_{t}z)\leq 1$ , the root $\theta_{+}$ of $B_{+}(\theta)$ satisfies that

\theta_{+}\leq\sigma_{t}^{2}+2(\kappa^{\dagger}_{t}/(n^{\dagger}_{{\rm{eff}}}\tau_{t})+\tau_{t}z)\,,

and the root $\theta_{-}$ of $B_{-}(\theta)$ satisfies that

\theta_{-}\geq\sigma_{t}^{2}-2(\kappa^{\dagger}_{t}/(n^{\dagger}_{{\rm{eff}}}\tau_{t})+\tau_{t}z)

With the choice of $\tau_{t}$ given in Theorem 1, we have $\mathbb{P}(|\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2}|\leq 4\sqrt{\kappa^{\dagger}_{t}z/n^{\dagger}_{{\rm{eff}}}})\geq 1-2e^{-z+|\delta_{0,t}|/\tau_{t}+2\delta_{1,t}/\tau_{t}^{2}}$ . And the requirement for the effective sample size is that $n^{\dagger}_{{\rm{eff}}}\geq 16z$ .

∎

Proof of Theorem 2.

We extend Theorem 2.1 of Wang et al. (2020) to the weighted case. Note again we are solving the following equation for $\widehat{\tau}_{t}$ :

\sum_{s=t}^{t+m}\frac{\min\bigg{(}w_{s,t}^{2}|X_{s}^{2}-\sigma_{t}^{2}|^{2},\tau_{t}^{2}\bigg{)}}{\tau_{t}^{2}}-z=0\,.

Also defined $\tau_{t}$ as the solution of the corresponding population equation:

\sum_{s=t}^{t+m}\frac{\mathbb{E}\bigg{[}\min\bigg{(}w_{s,t}^{2}|X_{s}^{2}-\sigma_{t}^{2}|^{2},\tau_{t}^{2}\bigg{)}\bigg{]}}{\tau_{t}^{2}}-z=0\,.

We will first show that (a) $\tau_{t}\asymp\sqrt{\frac{\kappa^{\dagger}_{t}}{n^{\dagger}_{{\rm{eff}}}z}}$ and then (b) with probability approaching $1$ , $|\widehat{\tau}_{t}/\tau_{t}-1|\leq c_{0}$ for a small fixed $c_{0}$ . To prove (a), it is straightforward to see that

\tau_{t}^{2}z\leq\sum_{s=t}^{t+m}w_{s,t}^{2}\mathbb{E}|X_{s}^{2}-\sigma_{t}^{2}|^{2}=\sum_{s=t}^{t+m}w_{s,t}^{2}(\kappa_{s}+(\sigma_{s}^{2}-\sigma_{t}^{2})^{2})=\frac{\kappa^{\dagger}_{t}+\delta_{1,t}n^{\dagger}_{{\rm{eff}}}}{n^{\dagger}_{{\rm{eff}}}}\leq\frac{(1+c_{1})\kappa^{\dagger}_{t}}{n^{\dagger}_{{\rm{eff}}}}

Furthermore,

\displaystyle\tau_{t}^{2}z

\displaystyle=\sum_{s=t}^{t+m}\mathbb{E}\bigg{[}(w_{s,t}^{2}|X_{s}^{2}-\sigma_{t}^{2}|^{2})\wedge\tau_{t}^{2}\bigg{]}\geq\sum_{s=t}^{t+m}\tau_{t}^{2}\mathbb{P}\bigg{(}w_{s,t}^{2}|X_{s}^{2}-\sigma_{t}^{2}|^{2}>\tau_{t}^{2}\bigg{)}\,.

Therefore, $z\geq\sum_{s=t}^{t+m}\mathbb{P}(w_{s,t}^{2}|X_{s}^{2}-\sigma_{t}^{2}|^{2}>\tau_{t}^{2})$ . Consider the solution $q_{s}(a)$ to the equation $\mathbb{P}(|X_{s}^{2}-\sigma_{t}^{2}|^{2}>qa)=a^{-1}$ of the variable $q$ . Note that the solution is unique. Since all $w_{s,t}$ are on the same order, we know that $w_{s,t}\asymp 1/m$ , $n^{\dagger}_{{\rm{eff}}}\asymp m$ . Let $a=a_{s}=(\sum_{s=t}^{t+m}w_{s,t}^{2})/(zw_{s,t}^{2})\asymp m/z$ , the corresponding solution is $q_{s}(a_{s})$ . Define $q_{\min}=\min_{s}q_{s}(a_{s})$ . So we have

\mathbb{P}\bigg{[}w_{s,t}^{2}|X_{s}^{2}-\sigma_{t}^{2}|^{2}>\frac{q_{\min}}{z}\bigg{(}\sum_{s=t}^{t+m}w_{s,t}^{2}\bigg{)}\bigg{]}\geq\mathbb{P}\bigg{[}|X_{s}^{2}-\sigma_{t}^{2}|^{2}>q_{s}(a_{s})\bigg{(}\sum_{s=t}^{t+m}w_{s,t}^{2}\bigg{)}/(zw_{s,t}^{2})\bigg{]}=\frac{zw_{s,t}^{2}}{\sum_{s=t}^{t+m}w_{s,t}^{2}}\,.

Let $\tau_{0}^{2}=\frac{q_{\min}}{z}(\sum_{s=t}^{t+m}w_{s,t}^{2})=\frac{q_{\min}}{n^{\dagger}_{{\rm{eff}}}z}$ , so we have showed that $\sum_{s=t}^{t+m}\mathbb{P}(w_{s,t}^{2}|X_{s}^{2}-\sigma_{t}^{2}|^{2}>\tau_{0}^{2})\geq z$ . Therefore, $\tau_{t}\geq\tau_{0}$ . From $\mathbb{P}(|X_{s}^{2}-\sigma_{t}^{2}|^{2}>qa)=a^{-1}$ , we know that for any $s$ , $q_{s}(a_{s})\asymp a^{-1}\asymp z/m$ , so is $q_{\min}$ . Therefore, we have $\tau_{t}/w_{s,t}\geq\tau_{0}/w_{s,t}\asymp m\tau_{0}\asymp\sqrt{q_{\min}m/z}\asymp 1$ . Write $\tau_{t}/w_{s,t}\geq c_{2}$ for some $c_{2}\geq 0$ .

\tau_{t}^{2}z=\sum_{s=t}^{t+m}\mathbb{E}\bigg{[}(w_{s,t}^{2}|X_{s}^{2}-\sigma_{t}^{2}|^{2})\wedge\tau_{t}^{2}\bigg{]}\geq\sum_{s=t}^{t+m}w_{s,t}^{2}\mathbb{E}\bigg{[}(|X_{s}^{2}-\sigma_{t}^{2}|^{2})\wedge c_{2}^{2}\bigg{]}\asymp\frac{\kappa^{\dagger}_{t}}{n^{\dagger}_{{\rm{eff}}}}\,.

So we have shown (a) holds, that is, $\tau_{t}\asymp\sqrt{\frac{\kappa^{\dagger}_{t}}{n^{\dagger}_{{\rm{eff}}}z}}$ .

Next, we need to show (b), so that the solution $\widehat{\tau}_{t}$ from the second equation gives us the desired optimal truncation rate. To this end, we still follow the proof of Theorem 1 of Wang et al. (2020) closely. Specifically, define $Y_{s}=w_{s,t}|X_{s}^{2}-\sigma_{t}^{2}|$ , using their notations, we define

p_{n}(t)=\sum_{s}\frac{Y_{s}^{2}I(Y_{s}\leq t)}{t^{2}}\,,\quad q_{n}(t)=\sum_{s}\frac{Y_{s}^{2}\wedge t^{2}}{t^{2}}\,,

and their population versions

p(t)=\sum_{s}\frac{E[Y_{s}^{2}I(Y_{s}\leq t)]}{t^{2}}\,,\quad q(t)=\sum_{s}\frac{E[Y_{s}^{2}\wedge t^{2}]}{t^{2}}\,.

One important fact here is that $q_{n}^{\prime}(t)=-2t^{-1}p_{n}(t)$ and $q^{\prime}(t)=-2t^{-1}p(t)$ , which is key to prove Theorem 1 of Wang et al. (2020). The only difference of our setting here is that we do not assume $Y_{s}$ ’s are identically distributed. So when applying Bernstein’s inequality as in (S1.8) of Wang et al. (2020), we need to use the version for non-identically distributed variables, and also bound the sum of individual variances. Specifically, define $\zeta_{s}=\frac{Y_{s}^{2}\wedge\tau_{t}^{2}}{\tau_{t}^{2}}$ , we have $0\leq\zeta_{s}\leq\min\{1,(Y_{s}\wedge\tau_{t})/\tau_{t}\}$ , and hence $\sum_{s}\mathbb{E}[\zeta_{s}^{2}]\leq\sum_{s}\mathbb{E}\Big{[}\frac{Y_{s}^{2}\wedge\tau_{t}^{2}}{\tau_{t}^{2}}\Big{]}=q(\tau_{t})=z$ . So we can indeed apply Bernstein’s inequality on $\sum_{s}\zeta_{s}$ . For more details, we refer the interested readers to Wang et al. (2020). ∎

Proof of Theorem 3.

Let $\widehat{\sigma}_{t}=(\widehat{\sigma}_{c})_{t}$ .

	$\displaystyle\mathbb{P}\bigg{[}$	$\displaystyle\sum_{t=1}^{T}(\widehat{\sigma}_{t}^{2}/\sigma_{t}^{2}-1)q_{t}\geq y\bigg{]}=\mathbb{P}\bigg{[}\sum_{t=1}^{T}\frac{q_{t}}{\sigma_{t}^{2}}(X_{t}^{2}\wedge c_{t}-\mathbb{E}[X_{t}^{2}\wedge c_{t}])+\sum_{t=1}^{T}\bigg{(}\frac{\mathbb{E}[X_{t}^{2}\wedge c_{t}]}{\sigma_{t}^{2}}-1\bigg{)}q_{t}\geq y\bigg{]}$
		$\displaystyle\leq\mathbb{P}\bigg{[}\sum_{t=1}^{T}\frac{q_{t}}{\sigma_{t}^{2}}(X_{t}^{2}\wedge c_{t}-E[X_{t}^{2}\wedge c_{t}])\geq y/2\bigg{]}+\mathbb{P}\bigg{[}\sum_{t=1}^{T}\bigg{(}\frac{E[X_{t}^{2}\wedge c_{t}]}{\sigma_{t}^{2}}-1\bigg{)}q_{t}\geq y/2\bigg{]}$

To bound the first term, we can apply Bernstein inequality for $\sum_{t}Y_{t}$ where $Y_{t}=\frac{q_{t}}{\sigma_{t}^{2}}(X_{t}^{2}\wedge c_{t}-E[X_{t}^{2}\wedge c_{t}])$ . Note that $E[Y_{t}^{2}]\leq\tilde{\kappa}_{t}q_{t}^{2}/\sigma_{t}^{4}\leq\tilde{\kappa}_{t}Q^{2}/\sigma_{t}^{4}$ and $|Y_{t}|\leq 2c_{t}Q/\sigma_{t}^{2}$ , so we can choose $y=CQ\sqrt{zT}$ to make the first term bounded by $e^{-z}$ .

To bound the second term, note that

\mathbb{E}[X_{t}^{2}\wedge c_{t}]-\sigma_{t}^{2}=\mathbb{E}[X_{t}^{2}I(X_{t}^{2}>c_{t})]+\mathbb{E}[c_{t}I(X_{t}^{2}>c_{t})]\leq\mathbb{E}\bigg{[}X_{t}^{2}\cdot\frac{X_{t}^{2}}{c_{t}}\bigg{]}+c_{t}\frac{\mathbb{E}[X_{t}^{4}]}{c_{t}^{2}}=\tilde{\kappa}_{t}\sqrt{\frac{z}{{\tilde{\kappa}^{\dagger}_{t}}T}}.

Here we can also choose $y=CQ\sqrt{zT}$ for a large enough $C$ to make the second probability equal to $0$ . ∎

Lemma 1.

Let $\{Y_{t}\}$ be a process such that $Y_{t}=h_{t}(X_{t},X_{t-1},\dots)$ . Define

\gamma_{j}:=\max_{t}\|h_{t}(X_{t},X_{t-1},\dots)-h_{t}(X_{t},\dots,X_{t-j+1},{X_{t-j}}^{\prime},X_{t-j-1},\dots)\|_{2}

for any $j\geq 0$ where ${X_{t-j}}^{\prime}$ is an iid copy of $X_{t-j}$ and $\{X_{t}\}$ are independent random innovations satisfying Assumption 1. Assume $\mathbb{E}[Y_{t}]=0,|Y_{t}|\leq M$ for all $t$ and there exists constant $\rho\in(0,1)$ such that

\|Y_{\cdot}\|_{2}:=\sup_{k\geq 0}\rho^{-k}\sum_{j=k}^{\infty}\gamma_{j}<\infty\,.

Also assume $T\geq 4\vee(\log(\rho^{-1})/2)$ . We have for $y>0$ ,

\mathbb{P}\bigg{(}\sum_{t=1}^{T}Y_{t}\geq y\bigg{)}\leq\exp\bigg{\{}-\frac{y^{2}}{4C_{1}(T\|Y_{\cdot}\|_{2}^{2}+M^{2})+2C_{2}M(\log T)^{2}y}\bigg{\}}\,,

where $C_{1}=2\max\{(e^{4}-5)/4,[\rho(1-\rho)\log(\rho^{-1})]^{-1}\}\cdot(8\vee\log(\rho^{-1}))^{2}$ , $C_{2}=\max\{(c\log 2)^{-1},[1\vee(\log(\rho^{-1})/8)]\}$ with $c=[\log(\rho^{-1})/8]\wedge\sqrt{(\log 2)\log(\rho^{-1})/4}$ .

The proof of Lemma 1 follows closely with Theorem 2.1 of Zhang (2021). The only extension here is that we do not require $X_{t}$ or even $X_{t}/\sigma_{t}$ to be identically distributed, so there is no assumption on the stationarity of the process $Y_{t}$ . However, we requires a stronger assumption on the maximal perturbation of each $Y_{t}=h_{t}(X_{t},X_{t-1},\dots)$ . The entire proof of Zhang (2021) will go through with this new definition of $\|Y_{\cdot}\|_{2}$ and $\gamma_{j}$ . We omit the details of the proof.

Proof of Theorem 4.

Let $\widehat{\sigma}_{t}=(\widehat{\sigma}_{e})_{t}$ and define $Y_{t}=\frac{q_{t}}{\sigma_{t}^{2}}(\widehat{\sigma}_{t}^{2}-\mathbb{E}[\widehat{\sigma}_{t}^{2}])$ . It is not hard to see that for $j\leq m$ ,

	$\displaystyle\gamma_{j}^{2}$	$\displaystyle=\max_{t}\mathbb{E}\bigg{\{}[(w_{j,0}X_{t+j}^{2})\wedge c_{t}-(w_{j,0}{{X_{t+j}}^{\prime}}^{2})\wedge c_{t}]^{2}\bigg{\}}\frac{q_{t}^{2}}{\sigma_{t}^{4}}$
		$\displaystyle\leq\max_{t}\mathbb{E}\bigg{\{}(w_{j,0}X_{t+j}^{2}-w_{j,0}{{X_{t+j}}^{\prime}}^{2})^{2}+(w_{j,0}X_{t+j}^{2}-c_{t})^{2}I(w_{j,0}X_{t+j}^{2}<c_{t},w_{j,0}{{X_{t+j}}^{\prime}}^{2}>c_{t})$
		$\displaystyle\quad\quad+(w_{j,0}{{X_{t+j}}^{\prime}}^{2}-c_{t})^{2}I(w_{j,0}X_{t+j}^{2}>c_{t},w_{j,0}{{X_{t+j}}^{\prime}}^{2}<c_{t})\bigg{\}}\frac{Q^{2}}{\sigma_{t}^{4}}$
		$\displaystyle\leq 4w_{j,0}^{2}Q^{2}\max_{t}\tilde{\kappa}_{t+j}/\sigma_{t}^{4}\leq 4w_{j,0}^{2}Q^{2}\max_{t,u\leq m}\tilde{\kappa}_{t+u}/\sigma_{t}^{4}$

And for $j>m$ , $\gamma_{j}=0$ . Therefore,

\|Y_{\cdot}\|_{2}=\sup_{k}\rho^{-k}\sum_{j=k}^{\infty}\gamma_{j}\leq 2Q\sqrt{\max_{t,u\leq m}\tilde{\kappa}_{t+u}/\sigma_{t}^{4}}\sup_{k}\rho^{-k}\sum_{j=k}^{m}w_{j,0}\leq 2Q\sqrt{\max_{t,u\leq m}\tilde{\kappa}_{t+u}/\sigma_{t}^{4}}<\infty\,,

for any fixed $\rho\in(0,1)$ .

In addition, we claim $|Y_{t}|\leq CQ\sqrt{z}$ with high probability. To prove this, we need the following result: when $n^{\dagger}_{{\rm{eff}}}\geq 16\tilde{z}$ and $(n^{\dagger}_{{\rm{eff}}}c_{t})^{2}\geq 16{\tilde{\kappa}^{\dagger}_{t}}$ ,

\mathbb{P}(|\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2}|\leq 2({\tilde{\kappa}^{\dagger}_{t}}/(n^{\dagger}_{{\rm{eff}}}c_{t})+c_{t}\tilde{z})\geq 1-2e^{-\tilde{z}+|\delta_{0,t}|/c_{t}+2\delta_{1,t}/c_{t}^{2}}\,.

This can be shown following a similar proof as Theorem 1, thus we omit the details. Note that here $c_{t}$ is not chosen to optimize the error bound, as we have another average over $T$ to take care of the extra variance in $\widehat{\sigma}_{t}^{2}$ . So here we only need to choose $c_{t}$ to make sure the error bound to be in the order of $\sqrt{z}$ . $c_{t}=\sqrt{\frac{{\tilde{\kappa}^{\dagger}_{t}}T}{n^{\dagger 2}_{{\rm{eff}}}z}}$ and pick $\tilde{z}=czn^{\dagger}_{{\rm{eff}}}/\sqrt{T}$ can indeed do the job, since the exception probability is $2e^{-\frac{c}{2}n^{\dagger}_{{\rm{eff}}}z/\sqrt{T}}$ under the assumption that $\max_{t}|\delta_{0,t}|/\sqrt{{\tilde{\kappa}^{\dagger}_{t}}}+2\delta_{1,t}n^{\dagger}_{{\rm{eff}}}/({\tilde{\kappa}^{\dagger}_{t}}\sqrt{T})\leq c/2$ . We require $|Y_{t}|\leq CQ\sqrt{z}$ to hold for all time points, so the exception probability for all events is bounded by $2Te^{-\frac{c}{2}n^{\dagger}_{{\rm{eff}}}z/\sqrt{T}}$ . When $n^{\dagger}_{{\rm{eff}}}>2c^{-1}(1+\log 2T/z)\sqrt{T}$ , this is further bounded by $e^{-z}$ . Finally $n^{\dagger}_{{\rm{eff}}}\geq 16\tilde{z}$ and $(n^{\dagger}_{{\rm{eff}}}c_{t})^{2}\geq 16{\tilde{\kappa}^{\dagger}_{t}}$ lead to the requirements of $\sqrt{T}\geq 16cz$ and $T\geq 16z$ . Now conditioning on that $|Y_{t}|\leq CQ\sqrt{z}$ , we are ready to call Lemma 1 for $Y_{t}$ . We can choose $y=CQ\sqrt{zT}$ in Lemma 1 to make the exception probability smaller than $e^{-z}$ . So in total the exception probability is $2e^{-z}$ .

Next, in terms of the bias term $\mathbb{E}[\widehat{\sigma}_{t}^{2}]-\sigma_{t}^{2}=\sum_{s=t}^{t+m}w_{s,t}\mathbb{E}[\min(X_{s}^{2},c_{t}/w_{s,t})-\sigma_{s}^{2}]+(\sum_{s=t}^{t+m}w_{s,t}\sigma_{s}^{2}-\sigma_{t}^{2})$ . From the proof of Theorem 3, we know that $\mathbb{E}[\min(X_{s}^{2},c_{t}/w_{s,t})-\sigma_{s}^{2}]\leq 2w_{s,t}\tilde{\kappa}_{s}/c_{t}$ . Therefore

\displaystyle\frac{\mathbb{E}[\widehat{\sigma}_{t}^{2}]}{\sigma_{t}^{2}}-1\leq\frac{2}{\sigma_{t}^{2}}\sum_{s=t}^{t+m}w_{s,t}^{2}\tilde{\kappa}_{s}/c_{t}+\frac{\delta_{0,t}}{\sigma_{t}^{2}}=2\sqrt{\frac{{\tilde{\kappa}^{\dagger}_{t}}z}{\sigma_{t}^{4}T}}+\frac{\delta_{0,t}}{\sigma_{t}^{2}}.

Thus, the bias term will not affect the total error bound as shown in the theorem. ∎

Lemma 2.

Assume the weighted Huber loss $F(\theta)={\cal L}_{c_{t}}(\theta;\{w_{s,t}\}_{s=t}^{t+m})$ is $\alpha$ -strongly convex for some $\alpha\in(0,1/2)$ in some local neighborhood $\Theta$ around $\theta^{*}\in\Theta$ . $\tilde{F}(\theta)$ is the perturbed version of $F(\theta)$ . If $\theta^{*}=\operatorname*{argmin}_{\theta}F(\theta)$ and $\tilde{\theta}^{*}=\operatorname*{argmin}_{\theta}\tilde{F}(\theta)$ and $\tilde{\theta}^{*}\in\Theta$ , we have

|\tilde{\theta}-\theta|\leq\frac{\alpha+1}{\alpha}\sup_{\theta}|\tilde{F}^{\prime}(\theta)-F^{\prime}(\theta)|\,.

Proof of Lemma 2.

Besides strong convexity, we know that $F(\theta)$ is $\beta$ -smooth with $\beta=1$ . That is

F(\theta_{1})-F(\theta_{2})-F^{\prime}(\theta_{2})(\theta_{1}-\theta_{2})\leq\frac{\beta}{2}(\theta_{1}-\theta_{2})^{2}.

$\beta=1$ is obvious given the second derivative of $F(\theta)$ is bounded by 1. From Lemma 3.11 of Bubeck (2014), for $\theta_{1},\theta_{2}\in\Theta$ ,

\frac{\alpha}{\alpha+1}(\theta_{1}-\theta_{2})^{2}\leq(F^{\prime}(\theta_{1})-F^{\prime}(\theta_{2}))(\theta_{1}-\theta_{2})\leq\frac{\alpha}{2(\alpha+1)}(\theta_{1}-\theta_{2})^{2}+\frac{\alpha+1}{2\alpha}(F^{\prime}(\theta_{1})-F^{\prime}(\theta_{2}))^{2}.

Choose $\theta_{1}=\tilde{\theta}^{*},\theta_{2}=\theta^{*}$ , so $F^{\prime}(\theta_{2})=0=\tilde{F}^{\prime}(\theta_{1})$ . Then

\frac{\alpha}{\alpha+1}(\tilde{\theta}^{*}-\theta^{*})^{2}\leq\frac{\alpha+1}{\alpha}(F^{\prime}(\tilde{\theta}^{*})-\tilde{F}^{\prime}(\tilde{\theta}^{*}))^{2},

which concludes the proof. ∎

Proof of Theorem 5.

Let $\widehat{\sigma}_{t}=(\widehat{\sigma}_{H})_{t}$ and define $Y_{t}=\frac{q_{t}}{\sigma_{t}^{2}}(\widehat{\sigma}_{t}^{2}-\mathbb{E}[\widehat{\sigma}_{t}^{2}])$ . In order to apply Lemma 1, we need to employ Lemma 2 to bound the perturbation of Huber loss minimizer via bounding the perturbation of Huber loss derivative. Similar to the proof of Theorem 4, we can show that $|\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2}|\leq C\sqrt{z}$ for all $t$ with probability larger than $1-e^{-z}$ . We can actually explicitly write out the bound for $|\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2}|$ following the proof of Theorem 4: $|\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2}|\leq 2(\sqrt{\kappa_{t}^{\dagger}z/T}+c\sqrt{\kappa_{t}^{\dagger}z})<(2c+2)\sqrt{\kappa_{t}^{\dagger}z}$ , which means the Huber loss minimizer does fall into the region we have strong convexity by our assumptions in Theorem 5. The bound on $|\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2}|$ also implies that $|Y_{t}|\leq CQ\sqrt{z}$ with exception probability of $e^{-z}$ .

In addition, we would like to check $\|Y_{\cdot}\|_{2}<\infty$ . For $j>m$ , $\gamma_{j}=0$ and for $j\leq m$ ,

	$\displaystyle\gamma_{j}^{2}$	$\displaystyle\leq\frac{(\alpha+1)^{2}}{\alpha^{2}}\max_{t}\max_{\sigma_{t}^{2}}\mathbb{E}\bigg{\{}\bigg{[}(w_{j,0}\|X_{t+j}^{2}-\sigma_{t}^{2}\|)\wedge c_{t}-(w_{j,0}\|{{X_{t+j}}^{\prime}}^{2}-\sigma_{t}^{2}\|)\wedge c_{t}\bigg{]}^{2}\bigg{\}}\frac{q_{t}^{2}}{\sigma_{t}^{4}}$
		$\displaystyle\leq\frac{(\alpha+1)^{2}}{\alpha^{2}}\max_{t}\max_{\sigma_{t}^{2}}\mathbb{E}\bigg{\{}(w_{j,0}\|X_{t+j}^{2}-\sigma_{t}^{2}\|-w_{j,0}\|{{X_{t+j}}^{\prime}}^{2}-\sigma_{t}^{2}\|)^{2}$
		$\displaystyle\quad\quad+(w_{j,0}\|X_{t+j}^{2}-\sigma_{t}^{2}\|-c_{t})^{2}I(w_{j,0}\|X_{t+j}^{2}-\sigma_{t}^{2}\|<c_{t},w_{j,0}\|{{X_{t+j}}^{\prime}}^{2}-\sigma_{t}^{2}\|>c_{t})$
		$\displaystyle\quad\quad+(w_{j,0}\|{{X_{t+j}}^{\prime}}^{2}-\sigma_{t}^{2}\|-c_{t})^{2}I(w_{j,0}\|X_{t+j}^{2}-\sigma_{t}^{2}\|>c_{t},w_{j,0}\|{{X_{t+j}}^{\prime}}^{2}-\sigma_{t}^{2}\|<c_{t})\bigg{\}}\frac{Q^{2}}{\sigma_{t}^{4}}$
		$\displaystyle\leq 4w_{j,0}^{2}Q^{2}\max_{t}\kappa_{t+j}/\sigma_{t}^{4}\leq 4w_{j,0}^{2}Q^{2}\max_{t,u\leq m}\kappa_{t+u}/\sigma_{t}^{4}\,.$

Therefore $\|Y_{\cdot}\|_{2}<\infty$ for any fixed $\rho\in(0,1)$ . We can indeed apply Lemma 1 to bound the sum of $Y_{t}$ , which is of order $CQ\sqrt{zT}$ , with exceptional probability of another $e^{-z}$ .

Finally, we bound the bias term $\mathbb{E}[\widehat{\sigma}_{t}^{2}]/\sigma_{t}^{2}-1$ . Note that

	$\displaystyle 0$	$\displaystyle=\sum_{s=t}^{t+m}w_{s,t}\mathbb{E}\bigg{[}\min\bigg{(}\|X_{s}^{2}-\widehat{\sigma}_{t}^{2}\|,\frac{c_{t}}{w_{s,t}}\bigg{)}\mbox{sgn}(X_{s}^{2}-\widehat{\sigma}_{t}^{2})\bigg{]}$
		$\displaystyle=\sum_{s=t}^{t+m}w_{s,t}\mathbb{E}\bigg{[}(X_{s}^{2}-\widehat{\sigma}_{t}^{2})I\bigg{(}\|X_{s}^{2}-\widehat{\sigma}_{t}^{2}\|\leq\frac{c_{t}}{w_{s,t}}\bigg{)}+\frac{c_{t}}{w_{s,t}}I\bigg{(}\|X_{s}^{2}-\widehat{\sigma}_{t}^{2}\|>\frac{c_{t}}{w_{s,t}}\bigg{)}\bigg{]}$
		$\displaystyle=\delta_{0,t}+\mathbb{E}[(X_{t}^{2}-\widehat{\sigma}_{t}^{2})]+\sum_{s=t}^{t+m}w_{s,t}\mathbb{E}\bigg{[}(X_{s}^{2}-\widehat{\sigma}_{t}^{2})I\bigg{(}\|X_{s}^{2}-\widehat{\sigma}_{t}^{2}\|>\frac{c_{t}}{w_{s,t}}\bigg{)}+\frac{c_{t}}{w_{s,t}}I\bigg{(}\|X_{s}^{2}-\widehat{\sigma}_{t}^{2}\|>\frac{c_{t}}{w_{s,t}}\bigg{)}\bigg{]}$

Similar to the proof of Theorem 3, we know that the $s$ -th component of the third term can be bounded by $2w_{s,t}\mathbb{E}[(X_{s}^{2}-\widehat{\sigma}_{t}^{2})^{2}]/c_{t}\leq 2w_{s,t}(2\kappa_{s}+2(\sigma_{s}^{2}-\sigma_{t}^{2})^{2}+2\mathbb{E}[(\sigma_{t}^{2}-\widehat{\sigma}_{t}^{2})^{2}])/c_{t}$ . Therefore

	$\displaystyle\frac{\mathbb{E}[\widehat{\sigma}_{t}^{2}]}{\sigma_{t}^{2}}-1$	$\displaystyle\leq\frac{4}{\sigma_{t}^{2}}\sum_{s=t}^{t+m}w_{s,t}^{2}(\kappa_{s}+\mathbb{E}[(\sigma_{t}^{2}-\widehat{\sigma}_{t}^{2})^{2}])/c_{t}+\frac{\delta_{0,t}+4\delta_{1,t}}{\sigma_{t}^{2}}$
		$\displaystyle=4\sqrt{\frac{{\kappa^{\dagger}_{t}}z}{\sigma_{t}^{4}T}}+\frac{\delta_{0,t}+4\delta_{1,t}}{\sigma_{t}^{2}}+\frac{4E[(\sigma_{t}^{2}-\widehat{\sigma}_{t}^{2})^{2}]}{\sigma_{t}^{4}}\sqrt{\frac{\sigma_{t}^{4}z}{{\kappa^{\dagger}_{t}}T}}.$

Furthermore, it is not hard to show that $E[(\sigma_{t}^{2}-\widehat{\sigma}_{t}^{2})^{2}]$ is bounded using Lemma 2. Therefore, the bias is indeed of the order $\sqrt{z/T}$ with the additional approximation error rate $\frac{\delta_{0,t}+4\delta_{1,t}}{\sigma_{t}^{2}}$ . The proof is now complete. ∎

Proof of Theorem 6.

Following the same proof as Theorem 1, we have that

\mathbb{P}(|h_{t}-\sigma_{t}^{2}|\leq 4\sqrt{\kappa^{\ddagger}_{t}z/n^{\ddagger}_{{\rm{eff}}}})\geq 1-2e^{-z+|\Delta_{0,t}|\sqrt{{n^{\ddagger}_{{\rm{eff}}}z}/{\kappa^{\ddagger}_{t}}}+2\Delta_{1,t}{n^{\ddagger}_{{\rm{eff}}}z}/{\kappa^{\ddagger}_{t}}}

Conditioning on this event, $L$ is Lipchitz in the second argument. Then the error bound is enlarged by $B$ times. ∎

Proof of Theorem 7.

Recall that

		$\displaystyle\text{II}\leq\mathbb{P}\bigg{(}\bigg{\|}\frac{1}{T}\sum_{t=1}^{T}(L(\sigma_{t}^{2},h_{t})-\mathbb{E}[L(\sigma_{t}^{2},h_{t})])\bigg{\|}>\frac{\varepsilon}{2}\bigg{)}+\mathbb{P}\bigg{(}\bigg{\|}\frac{1}{T}\sum_{t=1}^{T}(C(h_{t})-\mathbb{E}[C(h_{t})])(\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2})\bigg{\|}>\frac{\varepsilon}{2}\bigg{)}$
		$\displaystyle=:\Delta_{A}+\Delta_{B}\,.$

Now let us prove (i) first. We first bound $\Delta_{A}$ . We still apply Lemma 1 to bound the concentration. Let $Y_{t}=L(\sigma_{t}^{2},h_{t})-\mathbb{E}[L(\sigma_{t}^{2},h_{t})]$ . From the proof of Theorem 6, we know that

\mathbb{P}(|h_{t}-\sigma_{t}^{2}|\leq 4\sqrt{\kappa^{\ddagger}_{t}z/n^{\ddagger}_{{\rm{eff}}}})\geq 1-2e^{-z/2}\,.

So $|h_{t}-\sigma_{t}^{2}|\leq 4\sqrt{\kappa^{\ddagger}_{t}z/n^{\ddagger}_{{\rm{eff}}}}\leq\sigma_{t}^{2}/2$ with exception probability of $2e^{-z/2}$ . Applying the union bound, we get $|h_{t}-\sigma_{t}^{2}|\leq\sigma_{t}^{2}/2$ for all $t$ with probability $\geq 1-2Te^{-z/2}$ . On this event, we have $|Y_{t}|\leq C$ because $|L(\sigma_{t}^{2},h_{t})-L(\sigma_{t}^{2},\sigma_{t}^{2})|\leq B(\sigma_{t}^{2}/2)|h_{t}-\sigma_{t}^{2}|\leq B(\sigma_{t}^{2}/2)\sigma_{t}^{2}/2<\infty$ . Similar to previous proofs, except that now we look data backward, for $j\geq 0$ ,

	$\displaystyle\gamma_{j}^{2}$	$\displaystyle=\max_{t}\mathbb{E}\bigg{\{}[L(\sigma_{t}^{2},h_{t}(X_{t-1},\dots,X_{t-m}))-L(\sigma_{t}^{2},h_{t}(X_{t-1},\dots,{X_{t-j-1}}^{\prime},\dots,X_{t-m}))]^{2}\bigg{\}}$
		$\displaystyle\leq\max_{t}B^{2}(\sigma_{t}^{2}/2)\max_{t}\mathbb{E}\bigg{\{}[h_{t}(X_{t-1},\dots,X_{t-m})-h_{t}(X_{t-1},\dots,{X_{t-j-1}}^{\prime},\dots,X_{t-m})]^{2}\bigg{\}}$
		$\displaystyle\leq 4\nu_{-j,1}^{2}\max_{t}B^{2}(\sigma_{t}^{2}/2)\frac{(\alpha+1)^{2}}{\alpha^{2}}\max_{t}\kappa_{t}$

The last inequality can be shown similar to the proof of Theorem 5 with the assumption that the Huber loss is $\alpha$ -strongly convex locally. Therefore, $\|Y_{\cdot}\|_{2}<\infty$ for any fixed $\rho\in(0,1)$ . We apply Lemma 1 on $Y_{t}$ and again pick $y=C\sqrt{zT}$ to make the exceptional probability $2e^{-z}$ . So we get $\mathbb{P}(|\Delta_{A}|\leq C\sqrt{z/T})\geq 1-2(T+1)e^{-z}$ .

Now let us try to apply Lemma 1 on $\Delta_{B}$ . Let $Y_{t}=(C(h_{t})-\mathbb{E}[C(h_{t})])(\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2})$ . Note that since $C(\cdot)$ is non-increasing, so $C(h_{t})\leq C(\sigma_{t}^{2}/2)$ is bounded. If we use the second and third proxies, in the proofs of Theorem 4 and 5, we have showed that $|\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2}|\leq C\sqrt{z}$ for all $t$ with exception probability at most $e^{-z}$ . Therefore, we conclude $|Y_{t}|\leq C\sqrt{z}$ for all $t$ with exception probability at most $e^{-z}$ . Now for bounding $\gamma_{j}$ , note that $Y_{t}$ actually is a function of $X_{t-m},\dots,X_{t-1},X_{t},\dots,X_{t+m}$ with the first $m$ data constructing predictors and the remaining $m+1$ data constructing proxies. Hence, for $j<m$ , it is not hard to show $\gamma_{j}^{2}\leq C\nu_{-(m-j),1}^{2}$ for some $C>0$ and for $m\leq j\leq 2m$ , $\gamma_{j}^{2}\leq Cw_{j-m,0}^{2}$ . So we have $\|Y_{\cdot}\|_{2}\leq 2\sqrt{C}<\infty$ . Applying Lemma 1 will again give us $\mathbb{P}(|\Delta_{B}|\leq C\sqrt{z/T})\geq 1-3e^{-z}$ .

Combining results for $\Delta_{A}$ and $\Delta_{B}$ and choose $\varepsilon=C\sqrt{z/T}$ for large enough $C$ , we conclude (i) for bounding II.

Next we prove (ii) and (iii). The proof follows the exact same arguments as (i) except for a few bounding details.

Firstly, in bounding $\Delta_{A}$ , we need $L(\sigma_{t}^{2},h_{t})$ to be bounded. In (iii), $h_{t}$ is not necessarily bounded, but we directly work with a bounded loss $L(\sigma_{t}^{2},h_{t})\leq M_{0}$ . In (ii), $h_{t}\leq M$ and we claim $h_{t}\geq\sigma_{t}^{2}/2$ with probability $\geq 1-2e^{-z}$ , thus $L(\sigma_{t}^{2},h_{t})\leq B_{t}(M)M$ . To see why the claim holds, define $\check{h}_{t}=\sum_{s=t-m}^{t-1}\nu_{s,t}\min(X_{s}^{2},\tau_{t}/\nu_{s,t})$ . The robust predictor proposed in Section 4.1 achieves central fourth moment, while $\check{h}_{t}$ can achieve the same rate of convergence with absolute fourth moment. This is similar to the difference between the second and third proxy option. Similar to the proof of Theorem 6, we can show

\mathbb{P}\bigg{(}|\check{h}_{t}-\sigma_{t}^{2}|\leq 4\sqrt{{\tilde{\kappa}^{\ddagger}_{t}}z/n^{\ddagger}_{{\rm{eff}}}}\leq\sigma_{t}^{2}/2\bigg{)}\geq 1-2e^{-z/2}\,.

So we know that $\check{h}_{t}\geq\sigma_{t}^{2}/2$ with probability $\geq 1-2e^{-z}$ . Interestingly, $h_{t}\geq\check{h}_{t}$ . So we always have good control of the left tail due to the positiveness of squared data.

Secondly, in bounding $\Delta_{A}$ , we need $\gamma_{j}^{2}\leq C\nu_{-j,1}^{2}$ . In (iii), since loss is Lipchitz in the whole region, we can easily see that $\gamma_{j}^{2}\leq 4\nu_{-j,1}^{2}B_{0}^{2}\max_{t}\tilde{\kappa}_{t}$ . In (ii), loss is Lipchitz in the local region of $\sigma_{t}^{2}/2\leq h_{t}\leq M$ , but we know with high probability the clipped predictor indeed falls into the region, so we have $\gamma_{j}^{2}\leq 4\nu_{-j,1}^{2}\max_{t}B_{t}^{2}(M)\max_{t}\tilde{\kappa}_{t}$ . Therefore, we have no problem bounding $\Delta_{A}$ .

Thirdly, in bounding $\Delta_{B}$ , we require $C(h_{t})$ to be bounded. Note that since $h_{t}\geq\sigma_{t}^{2}$ with high probability, we have $C(h_{t})\leq C(\sigma_{t}^{2}/2)$ even when we use non-robust predictors in (ii) and (iii). So we can bound $\Delta_{B}$ as desired too.

Finally putting everything together and choose $\varepsilon=C\sqrt{z/T}$ for large enough $C$ , we conclude (ii) and (iii) for bounding II.

∎

	II	$\displaystyle=\mathbb{P}\bigg{[}\bigg{\|}\frac{1}{T}\sum_{t=1}^{T}\{f(h_{t})-\mathbb{E}f(h_{t})\}+\frac{1}{T}\sum_{t=1}^{T}\{C(h_{t})-\mathbb{E}C(h_{t})\}\widehat{\sigma}_{t}^{2}\bigg{\|}>\varepsilon\bigg{]}$
		$\displaystyle\leq\mathbb{P}\bigg{[}\bigg{\|}\frac{1}{T}\sum_{t=1}^{T}f(h_{t})-\mathbb{E}f(h_{t})+\{C(h_{t})-\mathbb{E}C(h_{t})\}\sigma_{t}^{2}\bigg{\|}$
		$\displaystyle\qquad\qquad\qquad\qquad\qquad+\bigg{\|}\frac{1}{T}\sum_{t=1}^{T}(C(h_{t})-\mathbb{E}C(h_{t}))(\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2})\bigg{\|}>\varepsilon\bigg{]}$
		$\displaystyle\leq\underbrace{\mathbb{P}\bigg{[}\bigg{\|}\frac{1}{T}\sum_{t=1}^{T}(L(\sigma_{t}^{2},h_{t})-\mathbb{E}[L(\sigma_{t}^{2},h_{t})])\bigg{\|}>\frac{\varepsilon}{2}\bigg{]}}_{\Delta_{A}}+\underbrace{\mathbb{P}\bigg{[}\bigg{\|}\frac{1}{T}\sum_{t=1}^{T}(C(h_{t})-\mathbb{E}[C(h_{t})])(\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2})\bigg{\|}>\frac{\varepsilon}{2}\bigg{]}}_{\Delta_{B}}\,.$

	$\displaystyle\gamma_{j}^{2}$	$\displaystyle\leq\frac{(\alpha+1)^{2}}{\alpha^{2}}\max_{t}\max_{\sigma_{t}^{2}}\mathbb{E}\bigg{\{}\bigg{[}(w_{j,0}\|X_{t+j}^{2}-\sigma_{t}^{2}\|)\wedge c_{t}-(w_{j,0}\|{{X_{t+j}}^{\prime}}^{2}-\sigma_{t}^{2}\|)\wedge c_{t}\bigg{]}^{2}\bigg{\}}\frac{q_{t}^{2}}{\sigma_{t}^{4}}$
		$\displaystyle\leq\frac{(\alpha+1)^{2}}{\alpha^{2}}\max_{t}\max_{\sigma_{t}^{2}}\mathbb{E}\bigg{\{}(w_{j,0}\|X_{t+j}^{2}-\sigma_{t}^{2}\|-w_{j,0}\|{{X_{t+j}}^{\prime}}^{2}-\sigma_{t}^{2}\|)^{2}$
		$\displaystyle\quad\quad+(w_{j,0}\|X_{t+j}^{2}-\sigma_{t}^{2}\|-c_{t})^{2}I(w_{j,0}\|X_{t+j}^{2}-\sigma_{t}^{2}\|<c_{t},w_{j,0}\|{{X_{t+j}}^{\prime}}^{2}-\sigma_{t}^{2}\|>c_{t})$
		$\displaystyle\quad\quad+(w_{j,0}\|{{X_{t+j}}^{\prime}}^{2}-\sigma_{t}^{2}\|-c_{t})^{2}I(w_{j,0}\|X_{t+j}^{2}-\sigma_{t}^{2}\|>c_{t},w_{j,0}\|{{X_{t+j}}^{\prime}}^{2}-\sigma_{t}^{2}\|<c_{t})\bigg{\}}\frac{Q^{2}}{\sigma_{t}^{4}}$
		$\displaystyle\leq 4w_{j,0}^{2}Q^{2}\max_{t}\kappa_{t+j}/\sigma_{t}^{4}\leq 4w_{j,0}^{2}Q^{2}\max_{t,u\leq m}\kappa_{t+u}/\sigma_{t}^{4}\,.$

	$\displaystyle 0$	$\displaystyle=\sum_{s=t}^{t+m}w_{s,t}\mathbb{E}\bigg{[}\min\bigg{(}\|X_{s}^{2}-\widehat{\sigma}_{t}^{2}\|,\frac{c_{t}}{w_{s,t}}\bigg{)}\mbox{sgn}(X_{s}^{2}-\widehat{\sigma}_{t}^{2})\bigg{]}$
		$\displaystyle=\sum_{s=t}^{t+m}w_{s,t}\mathbb{E}\bigg{[}(X_{s}^{2}-\widehat{\sigma}_{t}^{2})I\bigg{(}\|X_{s}^{2}-\widehat{\sigma}_{t}^{2}\|\leq\frac{c_{t}}{w_{s,t}}\bigg{)}+\frac{c_{t}}{w_{s,t}}I\bigg{(}\|X_{s}^{2}-\widehat{\sigma}_{t}^{2}\|>\frac{c_{t}}{w_{s,t}}\bigg{)}\bigg{]}$
		$\displaystyle=\delta_{0,t}+\mathbb{E}[(X_{t}^{2}-\widehat{\sigma}_{t}^{2})]+\sum_{s=t}^{t+m}w_{s,t}\mathbb{E}\bigg{[}(X_{s}^{2}-\widehat{\sigma}_{t}^{2})I\bigg{(}\|X_{s}^{2}-\widehat{\sigma}_{t}^{2}\|>\frac{c_{t}}{w_{s,t}}\bigg{)}+\frac{c_{t}}{w_{s,t}}I\bigg{(}\|X_{s}^{2}-\widehat{\sigma}_{t}^{2}\|>\frac{c_{t}}{w_{s,t}}\bigg{)}\bigg{]}$