This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Volatility prediction comparison via robust volatility proxies: An empirical deviation perspective

Weichen Wang , Ran An, Ziwei Zhu
Innovation and Information Management, HKU Business School
Department of Statistics, University of Michigan
Address: Innovation and Information Management, HKU Business School, The Univsersity of Hong Kong, Hong Kong. E-mail: [email protected].Address: Department of Statistics, University of Michigan, Ann Arbor, MI 48109, USA. E-mail: [email protected], [email protected]. The research was partially supported by NSF grant DMS-2015366.
Abstract

Volatility forecasting is crucial to risk management and portfolio construction. One particular challenge of assessing volatility forecasts is how to construct a robust proxy for the unknown true volatility. In this work, we show that the empirical loss comparison between two volatility predictors hinges on the deviation of the volatility proxy from the true volatility. We then establish non-asymptotic deviation bounds for three robust volatility proxies, two of which are based on clipped data, and the third of which is based on exponentially weighted Huber loss minimization. In particular, in order for the Huber approach to adapt to non-stationary financial returns, we propose to solve a tuning-free weighted Huber loss minimization problem to jointly estimate the volatility and the optimal robustification parameter at each time point. We then inflate this robustification parameter and use it to update the volatility proxy to achieve optimal balance between the bias and variance of the global empirical loss. We also extend this Huber method to construct volatility predictors. Finally, we exploit the proposed robust volatility proxy to compare different volatility predictors on the Bitcoin market data. It turns out that when the sample size is limited, applying the robust volatility proxy gives more consistent and stable evaluation of volatility forecasts.

Keywords: Volatility forecasting, Robust loss function, Huber minimization, Risk management, Crypto market.

1 Introduction

Volatility forecasting is a central task for financial practitioners, who care to understand the risk levels of their financial instruments or portfolios. There have been countless researches on improving the volatility modeling for financial time series, including the famous ARCH/GARCH model for better modeling volatility clustering, its many variants and more general stochastic volatility models (Engle, 1982; Bollerslev, 1986; Baillie et al., 1996; Taylor, 1994), and on proposing better volatility predictors under different model settings and objectives (Poon and Granger, 2003; Brailsford and Faff, 1996; Andersen et al., 2005; Brooks and Persand, 2003; Christoffersen and Diebold, 2000). This list of volatility forecasting literature is only illustrative and far from complete for the large body of researches on this topic.

The prediction ideas range from the simplest Exponentially Weighted Moving Average (EWMA) (Taylor, 2004), which is adopted by J. P. Morgan’s RiskMetrics, to more complicated time series models and volatility models including GARCH (Brandt and Jones, 2006; Park, 2002), to option-based or macro-based volatility forecasting (Lamoureux and Lastrapes, 1993; Vasilellis and Meade, 1996; Christiansen et al., 2012), and to the more advanced machine learning techniques such as the nearest neighbor truncation (Andersen et al., 2012) and Recurrent Neural Netowrks (RNN) (Guo et al., 2016). Correspondingly, the underlying model assumption ranges from only smoothness of nearby volatilities, to different versions of GARCH, to Black-Scholes model (Black and Scholes, 2019) and its complicated extensions. The data distribution assumption can also vary in whether data are normally distributed, or heavy-tailed distributed from a known distribution e.g. t-distribution, or generally non-normal. When data are generally non-normal, researchers have proposed to use the quasi maximum likelihood estimation (QMLE) (Bollerslev and Wooldridge, 1992; Charles and Darné, 2019; Carnero et al., 2012) and its robust standard error for inference, but the theoretical results are typically asymptotic. Albeit good theoretical guarantee, industry practitioners seldom apply QMLE and tend to employ the naive approach of truncating the returns by an ad-hoc level and then applying EWMA.

In this work, we consider a model assumption requiring only smoothness of volatilities. For simplicity, we also assume the volatility time series are given a-priori, and after conditioning on the volatilities, return innovations are independent. We choose this simple setting for the following reasons. Firstly, our main focus of study is on building effective robust proxies rather than testing volatility models and constructing fancy volatility predictors. Secondly, although we ignore the weak dependency between return innovations (think of ARMA models (Brockwell and Davis, 2009) for weakly dependency), the EWMA predictors and proxies can still have strong temporal dependency, due to data overlapping of a rolling window, so our analysis is still nontrivial. Also note that we allow the return time series to be non-stationary. Thirdly, the motivating example for us is volatility forecasting for the crypto market. Charles and Darné (2019) applied several versions of GARCH models characterized by short memory, asymmetric effect, or long-run and short-run movements and concluded that they all seem not appropriate to model Bitcoin returns. Therefore, starting from conditionally independent data without imposing a too detailed model e.g. GARCH may be a good general starting point for the study of robust proxies.

Besides the native EWMA predictor as our comparison benchmark, we consider a type of robust volatility predictor when the instrument returns present heavy tails in their distributions. Specifically, we only require the returns to bear finite fourth moment. We consider the weighted Huber loss minimization, which turns out to be a nontrivial extension from the equal-weighted Huber loss minimization. To achieve the desired rate of convergence, the optimal Huber truncation level for each sample should also depend on sample weight. In addition, we apply a tuning-free approach following Wang et al. (2020) to tune the Huber truncation level adaptively and automatically. Unlike QMLE, our results focusing on a non-asymptotic empirical deviation bound. Therefore, although the main contribution of the paper is on robust proxy construction, we also claim a separate contribution on applying Huber minimization in the EWMA fashion.

Now, given two volatility predictors, the evaluation of their performance is often quite challenging due to two things: (1) selection of loss functions, and (2) selection of proxies, since obviously we cannot observe the truth volatilities. The selection of loss functions have been studied by Patton (2011). In Patton’s insightful paper, he defined a class of robust losses with the ideal property that for any unbiased proxy, the ranking of two predictors using one of the robust losses will be always consistent in terms of the long-run expectation. The property is desired for it tells risk managers to select a robust loss, then not to worry much on designing proxies. As long as the proxy is unbiased, everything should just work out. Commonly used robust losses include the mean-squared error (MSE) and the quasi-likelihood loss (QL). However, there is one weakness of Patton’s approach which has not been emphasized much in previous literature: the evaluation has to be in long-run expectation. The deviation of the empirical loss, which is what people actually use in practice, from the expected loss may still cause a large variance due to a bad choice of volatility proxy. Put it in the other way, his theory did not tell the risk managers how much an empirical loss can differ from its expected counterpart.

In this work, we hope to bring the main message that besides the selection of a robust loss, the selection of a good proxy also matters for effective comparison of predictors, especially when the sample size TT is not large enough. For a single time point, we show that the probability of making false comparison could be very high. So the natural question is that by averaging the performance comparison over TT time points, are we able to get a faithful comparison of two predictors with high probability, so that the empirical loss ranking does reflect the population loss ranking? The answer is that we need robust proxies in order to have this kind of guarantee.

We propose three robust proxies and compare them. The first choice uses the clipped squared return at the single current time tt as the proxy. This may be the simplest practical choice of a robust proxy. However, it cannot achieve the desired convergence in terms of empirical risk comparison due to large variance of only using a single time point. The second option mimics the EWMA proxy, so now we clip and average at multiple time points close to tt. To find out the proper clipping, we first run EWMA tuning-free Huber loss minimization on local data for each time tt. This will give a truncation level adaptive to the unknown volatility. Then the clipping bound will be rescaled to reflect the total sample size. According to literature on Huber minimization Catoni (2012); Fan et al. (2016); Sun et al. (2020), the truncation level needs to scale with the square root of the sample size to balance the bias and variance optimally. Therefore, it is natural to rescale the clipping bound by square root of the ratio of the total sample size TT over the local effective sample size. The third proxy exactly solves the EWMA Huber minimization, again with the rescaled truncation. Compared to the first and second proxies, this gives further improvement on the deviation bound of the proxy, depending on the central kurtosis rather than the absolute kurtosis. We will illustrate the above claims in more detail in later sections.

The Huber loss minimization has been proposed by Huber (1964) under Huber’s ϵ\epsilon-contamination model and its asymptotic properties have been studied in Huber (1973). At that time, the truncation level was set as fixed according to the 95%95\% asymptotic efficiency rule and “robustness” means achieving minimax optimality under the ϵ\epsilon-contamination model (Chen et al., 2018). But recently, Huber’s M-estimator has been revisited in the regression setting under the assumption of general heavy-tailed distributions (Catoni, 2012; Fan et al., 2016). Here “robustness” slightly changes its meaning to achieving sub-Gaussian non-asymptotic deviation bound under the heavy-tailed data assumption. In this setting, the truncation level grows with sample size and the resultant M-estimator is still asymptotically unbiased even when data distribution is asymmetric. Huber’s estimator fits the goal of robust volatility prediction and robust proxy construction very well, as squared returns indeed have asymmetric distributions. Since Catoni (2012), new literature to reveal deeper understanding on Huber’s M-estimator sprung up. For example, Sun et al. (2020) proved the necessity of finite fourth moment for volatility estimation if we hope to achieve a sub-Gaussian type of deviation bound; Wang et al. (2020) proposed the tuning-free Huber procedure; Chen et al. (2018); Minsker (2018) extended the Huber methodology to robust covariance matrix estimation.

Robustness issue is indeed an important concern for real data volatility forecasting. It has been widely observed that financial returns have fat tails. When it comes to the crypto markets e.g. Bitcoin product (BTC), the issue gets more serious, as crypto traders frequently experience huge jumps in the BTC price. For example, BTC plummeted more than 20% in a single day in March 2020. The lack of government regulation probably leaves the market far from efficient. This posts a stronger need for robust methodology to estimate and forecast volatility for crypto markets. Some recent works include Catania et al. (2018); Trucíos (2019); Charles and Darné (2019).

With the BTC returns, we will compare the non-robust EWMA predictor with the robust Huber predictor, with different decays, and evaluate their performance using the non-robust forward EWMA proxy and the robust forward Huber proxy. Both the predictors and proxies will be rolled forward and compared at the end of each day. We apply two robust losses, MSE and QL, to evaluate their performance. Interestingly, we will see that when sample size TT is large, our proposed robust proxy will be very close to forward EWMA proxy, and both will lead to sensible and similar comparison. However, when TT is small, non-robust proxy could lead to higher probability of making wrong conclusions, whereas the robust proxy, which automatically adapts to the total sample size and the time-varying volatilities, can still work as expected. This matches with our theoretical findings and provides new insights about applying robust proxies for practical risk evaluation.

The rest of the paper is organized as follows. In Section 2, we first review the definition of robust loss by Patton (2011) and explain our analyzing strategy for high probability bound of the empirical loss. We bridge the empirical loss and the unconditional expected loss, by the conditional expected loss conditioning on proxies. In Section 3, we propose three robust proxies and prove that they can all achieve the correct ranking with high probability, if measured by the conditional expected loss. However, the proxy based on Huber loss minimization will have the smallest probability of making false comparison, if measured by the empirical loss. In Section 4, we will discuss robust predictors and see why the above claim is true and why comparing robust predictors with non-robust predictors can be a valid thing to do. Simulation studies as well as an interesting case study on BTC volatility forecasting are presented in Section 5. We finally conclude the paper with some discussions in Section 6. All the proofs are relegated to the appendix.

2 Evaluation of volatility forecast

In this section, we first review the key conclusions of Patton (2011) on robust loss functions for volatility forecast comparison. We then use examples to see why we also care about the randomness from proxy deviation beyond picking a robust loss.

2.1 Robust loss functions

Suppose we have a time series of returns (Xi)<i<(X_{i})_{-\infty<i<\infty} of a financial instrument. Let t1\mathcal{F}_{t-1} denote the σ\sigma-algebra generated from (Xi)it1(X_{i})_{i\leq t-1}. Consider a volatility predictor hth_{t}, computed at time tt based on t1\mathcal{F}_{t-1}, that targets σt2:=var(Xt)\sigma^{2}_{t}:=\mbox{var}(X_{t}). We use a loss function L(σt2,ht)L(\sigma^{2}_{t},h_{t}) to gauge the prediction error of hth_{t}. In practice, we never observe σt2\sigma^{2}_{t}; therefore, in order to evaluate the loss function L(σt2,ht)L(\sigma^{2}_{t},h_{t}), we have to substitute σt2\sigma_{t}^{2} therein with a proxy σ^t2\widehat{\sigma}_{t}^{2}, which is computed based on 𝒢t\mathcal{G}_{t}, the σ\sigma-algebra generated from the future returns (Xt,,X)(X_{t},\dots,X_{\infty}).

Following Patton (2011), to achieve reliable evaluation of volatility forecasts, we wish to have the loss function LL satisfy the following three desirable properties:

  • (a)

    Mean-pursuit: ht:=argminh𝔼[L(σ^t2,h)|t1]=𝔼[σ^t2|t1]h_{t}^{*}:=\operatorname*{argmin}_{h\in\mathcal{H}}\mathbb{E}[L(\widehat{\sigma}_{t}^{2},h)|\mathcal{F}_{t-1}]=\mathbb{E}[\widehat{\sigma}_{t}^{2}|\mathcal{F}_{t-1}]. This says that the optimal predictor is exactly the conditional expectation of the proxy.

  • (b)

    Proxy-robust: Given any two predictors h1th_{1t} and h2th_{2t} and any unbiased proxy σ^t2\widehat{\sigma}_{t}^{2}, i.e., E[σ^t2|t1]=σt2E[\widehat{\sigma}_{t}^{2}|\mathcal{F}_{t-1}]=\sigma_{t}^{2}, 𝔼{L(σt2,h1t)}𝔼{L(σt2,h2t)}𝔼{L(σ^t2,h1t)}𝔼{L(σ^t2,h2t)}\mathbb{E}\{L(\sigma_{t}^{2},h_{1t})\}\leq\mathbb{E}\{L(\sigma_{t}^{2},h_{2t})\}\iff\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{1t})\}\allowbreak\leq\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{2t})\}. This means that the forecast ranking is robust to the choice of the proxy.

  • (c)

    Homogeneous: LL is a homogeneous loss function of order kk, i.e., L(aσ2,ah)=akL(σ2,h)L(a\sigma^{2},ah)=a^{k}L(\sigma^{2},h) for any a>0a>0. This ensures that the ranking of two predictors is invariant to the re-scaling of data.

Define the mean squared error (MSE) loss and quasi-likelihood (QL) loss as

MSE(σ2,h)=(σ2h)2 and QL(σ2,h)=σ2hlog(σ2h)1\mathrm{MSE}(\sigma^{2},h)=(\sigma^{2}-h)^{2}\text{~{}and~{}}\mathrm{QL}(\sigma^{2},h)=\frac{\sigma^{2}}{h}-\log\bigg{(}\frac{\sigma^{2}}{h}\bigg{)}-1 (2.1)

respectively. Here the QL loss can be viewed, up to an affine transformation, as the negative log-likelihood function of XX that follows 𝒩(0,h){\cal N}(0,h) when we observe that X2=σ2X^{2}=\sigma^{2}. Besides, QL is always positive and the Taylor expansion gives that QL(σ2,h)(σ2/h1)2/2\mathrm{QL}(\sigma^{2},h)\approx(\sigma^{2}/h-1)^{2}/2 when σ2/h\sigma^{2}/h is around 11. Patton (2011) shows that among many commonly used loss functions, MSE and QL are the only two that satisfy all the three properties above. Specifically, Proposition 1 of Patton (2011) says that given that LL satisfies property (a) and some regularity conditions, LL further satisfies property (b) if and only if LL takes the form:

L(σ2,h)=C~(h)+B(σ2)+C(h)(σ2h),L(\sigma^{2},h)=\tilde{C}(h)+B(\sigma^{2})+C(h)(\sigma^{2}-h), (2.2)

where C(h)C(h) is the derivative function of C~(h)\tilde{C}(h) and is monotonically decreasing. Proposition 2 in Patton (2011) establishes that MSE is the only proxy-robust loss that depends on σ2h\sigma^{2}-h and that QL is the only proxy-robust loss that depends on σ2/h\sigma^{2}/h. Finally, Proposition 4 in Patton (2011) gives the entire family of proxy-robust and homogeneous loss functions, which include QL and MSE (MSE and QL are homogeneous of order 2 and 0 respectively). Given such nice properties of MSE and QL, we mainly use MSE and QL to evaluate and compare volatility forecasts throughout this work.

2.2 The empirical deviation perspective

Besides selecting a robust loss as Patton (2011) suggested, one has to also nail down the proxy selection for prediction loss computation. Patton (2011)’s framework did not separate the randomness from the predictors and the proxies, and the proxy-robust property (b) compares two predictors in long-term unconditional expectation, which averages both randomnesses. However, it is not clear from Patton (2011) that for a given selected proxy, what is the probability that we end up a wrong comparison of two predictors. How does the random deviation of a proxy affect the comparison? Can some proxies outperform others in terms of less probability to make mistakes in finite sample?

In practice, one has to use empirical risk to approximate the expected risk to evaluate volatility forecasts. This implies one important issue that property (b) neglects: Property (b) concerns only the expected risk and ignores the deviation of the empirical risk from its expectation. Such empirical deviation is further exacerbated by replacing the true volatility with its proxies, jeopardizing accurate evaluation of volatility forecasts. Our strategy of analysis is as follows: we first link the empirical risk to the conditional risk (conditioning on the selected proxy), claiming that they are close with high probability (see formal arguments in Section 4), and then study the relationship of comparing the unconditional risk and conditional risk.

Specifically, we are interested in comparing the accuracy of two series of volatility forecasts {h1t}t[T]\{h_{1t}\}_{t\in[T]} and {h2t}t[T]\{h_{2t}\}_{t\in[T]}. For notational convenience, we drop the subscript “t[T]t\in[T]” when we refer to a time series unless specified otherwise. Define ηt=𝔼{L(σt2,h2t)L(σt2,h1t)}>0\eta_{t}=\mathbb{E}\{L(\sigma_{t}^{2},h_{2t})-L(\sigma_{t}^{2},h_{1t})\}>0. Without loss of generality, suppose that {h1t}\{h_{1t}\} outperforms {h2t}\{h_{2t}\} in terms of expected loss, i.e.,

1Tt=1T[𝔼L(σt2,h2t)𝔼L(σt2,h1t)]=1Tt=1Tηt>0.\frac{1}{T}\sum_{t=1}^{T}\bigg{[}\mathbb{E}L(\sigma_{t}^{2},h_{2t})-\mathbb{E}L(\sigma_{t}^{2},h_{1t})\bigg{]}=\frac{1}{T}\sum_{t=1}^{T}\eta_{t}>0. (2.3)

The empirical loss comparison can be decomposed into the conditional loss comparison and the difference between empirical loss and conditional loss.

1Tt=1T(L(σ^t2,h2t)\displaystyle\frac{1}{T}\sum_{t=1}^{T}\bigg{(}L(\widehat{\sigma}_{t}^{2},h_{2t}) L(σ^t2,h1t))=1Tt=1T[(𝔼{L(σ^t2,h2t)|𝒢t}𝔼{L(σ^t2,h1t)|𝒢t})\displaystyle-L(\widehat{\sigma}_{t}^{2},h_{1t})\bigg{)}=\frac{1}{T}\sum_{t=1}^{T}\bigg{[}\bigg{(}\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{2t})|\mathcal{G}_{t}\}-\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{1t})|\mathcal{G}_{t}\}\bigg{)}
+(L(σ^t2,h2t)𝔼{L(σ^t2,h2t)|𝒢t})(L(σ^t2,h1t)𝔼{L(σ^t2,h1t)|𝒢t})].\displaystyle+\bigg{(}L(\widehat{\sigma}_{t}^{2},h_{2t})-\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{2t})|\mathcal{G}_{t}\}\bigg{)}-\bigg{(}L(\widehat{\sigma}_{t}^{2},h_{1t})-\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{1t})|\mathcal{G}_{t}\}\bigg{)}\bigg{]}\,.

Therefore, we study the following two probabilities for any ε>0\varepsilon>0:

I:=[1Tt=1T(𝔼{L(σ^t2,h2t)|𝒢t}𝔼{L(σ^t2,h1t)|𝒢t})<1Tt=1Tηtε],\text{I}:=\mathbb{P}\bigg{[}\frac{1}{T}\sum_{t=1}^{T}\bigg{(}\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{2t})|\mathcal{G}_{t}\}-\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{1t})|\mathcal{G}_{t}\}\bigg{)}<\frac{1}{T}\sum_{t=1}^{T}\eta_{t}-\varepsilon\bigg{]}\,, (2.4)
II:=[|1Tt=1T(L(σ^t2,ht)𝔼{L(σ^t2,ht)|𝒢t})|>ε].\text{II}:=\mathbb{P}\bigg{[}\bigg{|}\frac{1}{T}\sum_{t=1}^{T}\bigg{(}L(\widehat{\sigma}_{t}^{2},h_{t})-\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{t})|\mathcal{G}_{t}\}\bigg{)}\bigg{|}>\varepsilon\bigg{]}\,. (2.5)

We aim to select stable proxies to make I small, so that the probability of obtaining false rank of the empirical risk is small. Meanwhile, with a selected proxy, we hope II can be well controlled for the predictors we care to compare. Note that only randomness from proxy matters in I. So we can focus on proxy design by studying this quantity. Then we would like to make sure the difference between the empirical risk and conditional risk are indeed small via studying II. The probability in II is with respect to both the proxy and the predictor. By following this analyzing strategy, we separate the randomness from predictor and proxy and eventually give results on empirical deviation rather than in expectation.

2.3 False comparison due to proxy randomness

To illustrate this issue, we first focus on a single time point tt. We compare two volatility forecasts h1th_{1t} and h2th_{2t} satisfying that 𝔼{L(σt2,h1t)}<𝔼{L(σt2,h2t)}\mathbb{E}\{L(\sigma_{t}^{2},h_{1t})\}<\mathbb{E}\{L(\sigma_{t}^{2},h_{2t})\}. We are interested in the probability of having a reverse rank of forecast precision between h1th_{1t} and h2th_{2t}, conditioning on the selected proxy, i.e, 𝒢t[𝔼{L(σ^t2,h1t)|𝒢t}>𝔼{L(σ^t2,h2t)|𝒢t}]\mathbb{P}_{\mathcal{G}_{t}}\big{[}\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{1t})|\mathcal{G}_{t}\}>\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{2t})|\mathcal{G}_{t}\}\big{]}. Note that this probability is with respect to the randomness of the proxy σ^t2\widehat{\sigma}_{t}^{2}; in the sequel, we show that it may not be small for a general proxy. But if we can select a good proxy to control this probability well, we can ensure a correct comparison with high probability.

Now consider MSE and QL as the loss functions, so that we can derive explicitly the condition for the empirical risk comparison to be consistent with the expected risk comparison. For simplicity, assume that t1\mathcal{F}_{t-1} and 𝒢t\mathcal{G}_{t} are independent. Recall that ηt=𝔼{L(σt2,h2t)L(σt2,h1t)}>0\eta_{t}=\mathbb{E}\{L(\sigma_{t}^{2},h_{2t})-L(\sigma_{t}^{2},h_{1t})\}>0. We wish to calculate 𝒢t[𝔼{L(σ^t2,h2t)|𝒢t}𝔼{L(σ^t2,h1t)|𝒢t}<ηtεt]\mathbb{P}_{\mathcal{G}_{t}}[\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{2t})|\mathcal{G}_{t}\}-\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{1t})|\mathcal{G}_{t}\}<\eta_{t}-\varepsilon_{t}] for some εt>|ηt|\varepsilon_{t}>|\eta_{t}|, i.e., the probability of having the forecast rank in conditional expectation be opposite of the rank in unconditional expectation. When LL is chosen to be MSE, we have

ηt=𝔼L(σt2,h2t)𝔼L(σt2,h1t)=𝔼(h2t2h1t2)+2σt2𝔼(h1th2t)\eta_{t}=\mathbb{E}L(\sigma_{t}^{2},h_{2t})-\mathbb{E}L(\sigma_{t}^{2},h_{1t})=\mathbb{E}(h_{2t}^{2}-h_{1t}^{2})+2\sigma_{t}^{2}\mathbb{E}(h_{1t}-h_{2t}) (2.6)

and

𝔼{L(σ^t2,h2t)|𝒢t}𝔼{L(σ^t2,h1t)|𝒢t}=𝔼(h2t2h1t2)+2σ^t2𝔼(h1th2t).\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{2t})|\mathcal{G}_{t}\}-\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{1t})|\mathcal{G}_{t}\}=\mathbb{E}(h_{2t}^{2}-h_{1t}^{2})+2\widehat{\sigma}_{t}^{2}\mathbb{E}(h_{1t}-h_{2t}).

Therefore,

𝒢t\displaystyle\mathbb{P}_{\mathcal{G}_{t}} [𝔼{L(σ^t2,h2t)|𝒢t}𝔼{L(σ^t2,h1t)|𝒢t}<ηtεt]\displaystyle[\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{2t})|\mathcal{G}_{t}\}-\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{1t})|\mathcal{G}_{t}\}<\eta_{t}-\varepsilon_{t}]
=𝒢t{2(σ^t2σt2)𝔼(h1th2t)<εt}\displaystyle=\mathbb{P}_{\mathcal{G}_{t}}\{2(\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2})\mathbb{E}(h_{1t}-h_{2t})<-\varepsilon_{t}\}
=𝒢t[(σ^t2/σt21){ηt𝔼(h2t2h1t2)}<εt].\displaystyle=\mathbb{P}_{\mathcal{G}_{t}}[(\widehat{\sigma}_{t}^{2}/\sigma_{t}^{2}-1)\{\eta_{t}-\mathbb{E}(h_{2t}^{2}-h_{1t}^{2})\}<-\varepsilon_{t}].

For illustration purposes, consider a deterministic scenario where h1t=σt2h_{1t}=\sigma_{t}^{2} is the oracle predictor, and where h2t=σt2+ηth_{2t}=\sigma_{t}^{2}+\sqrt{\eta}_{t} (so that (2.6) holds). Then

𝒢t\displaystyle\mathbb{P}_{\mathcal{G}_{t}} [𝔼{L(σ^t2,h2t)|𝒢t}𝔼{L(σ^t2,h1t)|𝒢t}<ηtεt]=𝒢t(σ^t2σt21>εt2σt2ηt).\displaystyle\bigl{[}\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{2t})|\mathcal{G}_{t}\}-\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{1t})|\mathcal{G}_{t}\}<\eta_{t}-\varepsilon_{t}\bigr{]}=\mathbb{P}_{\mathcal{G}_{t}}\bigg{(}\frac{\widehat{\sigma}_{t}^{2}}{\sigma_{t}^{2}}-1>\frac{\varepsilon_{t}}{2\sigma_{t}^{2}\sqrt{\eta_{t}}}\bigg{)}.

Similarly, if h2t=σt2ηth_{2t}=\sigma_{t}^{2}-\sqrt{\eta}_{t}, we have

𝒢t\displaystyle\mathbb{P}_{\mathcal{G}_{t}} [𝔼{L(σ^t2,h2t)|𝒢t}𝔼{L(σ^t2,h1t)|𝒢t}<ηtεt]=𝒢t(σ^t2σt21<εt2σt2ηt).\displaystyle\bigl{[}\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{2t})|\mathcal{G}_{t}\}-\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{1t})|\mathcal{G}_{t}\}<\eta_{t}-\varepsilon_{t}\bigr{]}=\mathbb{P}_{\mathcal{G}_{t}}\bigg{(}\frac{\widehat{\sigma}_{t}^{2}}{\sigma_{t}^{2}}-1<-\frac{\varepsilon_{t}}{2\sigma_{t}^{2}\sqrt{\eta_{t}}}\bigg{)}.

We can see from the two equations above that a large deviation of σ^t2\widehat{\sigma}_{t}^{2} from σt2\sigma_{t}^{2} gives rise to inconsistency between forecast comparisons based on empirical risk and expected risk. When we choose LL to be QL, we have that

ηt=𝔼L(σt2,h2t)𝔼L(σt2,h1t)=𝔼(logh2tlogh1t)+σt2𝔼(1h2t1h1t)\eta_{t}=\mathbb{E}L(\sigma_{t}^{2},h_{2t})-\mathbb{E}L(\sigma_{t}^{2},h_{1t})=\mathbb{E}(\log h_{2t}-\log h_{1t})+\sigma_{t}^{2}\mathbb{E}\bigg{(}\frac{1}{h_{2t}}-\frac{1}{h_{1t}}\bigg{)}

and that

𝒢t\displaystyle\mathbb{P}_{\mathcal{G}_{t}} [𝔼{L(σ^t2,h2t)|𝒢t}𝔼{L(σ^t2,h1t)|𝒢t}<ηtεt]\displaystyle\bigl{[}\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{2t})|\mathcal{G}_{t}\}-\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{1t})|\mathcal{G}_{t}\}<\eta_{t}-\varepsilon_{t}\bigr{]}
=𝒢t{(σ^t2σt2)𝔼(1h2t1h1t)<εt}\displaystyle=\mathbb{P}_{\mathcal{G}_{t}}\bigg{\{}(\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2})\mathbb{E}\bigg{(}\frac{1}{h_{2t}}-\frac{1}{h_{1t}}\bigg{)}<-\varepsilon_{t}\bigg{\}}
=𝒢t[(σ^t2/σt21){ηt𝔼(logh2tlogh1t)}<εt].\displaystyle=\mathbb{P}_{\mathcal{G}_{t}}\big{[}(\widehat{\sigma}_{t}^{2}/\sigma_{t}^{2}-1)\{\eta_{t}-\mathbb{E}(\log h_{2t}-\log h_{1t})\}<-\varepsilon_{t}\big{]}.

Similarly, we consider a deterministic setup where h1t=σt2h_{1t}=\sigma_{t}^{2}, and where h2t=mσt2h_{2t}=m\sigma_{t}^{2} with a misspecified scale. To ensure that 𝔼L(σt2,h2t)𝔼L(σt2,h1t)=ηt\mathbb{E}L(\sigma_{t}^{2},h_{2t})-\mathbb{E}L(\sigma_{t}^{2},h_{1t})=\eta_{t}, we have ηt=1/mlog(1/m)1(1/m1)2/2\eta_{t}=1/m-\log(1/m)-1\approx(1/m-1)^{2}/2 when 1/m11/m\approx 1. In this case, we deduce that

𝒢t\displaystyle\mathbb{P}_{\mathcal{G}_{t}} {𝔼(L(σ^t2,h2t)|𝒢t)𝔼(L(σ^t2,h1t)|𝒢t)<ηtεt}=𝒢t{(σ^t2/σt21)(ηtlogm)<εt}\displaystyle\{\mathbb{E}(L(\widehat{\sigma}_{t}^{2},h_{2t})|\mathcal{G}_{t})-\mathbb{E}(L(\widehat{\sigma}_{t}^{2},h_{1t})|\mathcal{G}_{t})<\eta_{t}-\varepsilon_{t}\}=\mathbb{P}_{\mathcal{G}_{t}}\{(\widehat{\sigma}_{t}^{2}/\sigma_{t}^{2}-1)(\eta_{t}-\log m)<-\varepsilon_{t}\}
=𝒢t{(σ^t2/σt21)(1/m1)<εt}{𝒢t{σ^t2/σt21>εt2ηt}if m>1,𝒢t{σ^t2/σt21<εt2ηt}if m1.\displaystyle=\mathbb{P}_{\mathcal{G}_{t}}\{(\widehat{\sigma}_{t}^{2}/\sigma_{t}^{2}-1)(1/m-1)<-\varepsilon_{t}\}\approx\begin{cases}\mathbb{P}_{\mathcal{G}_{t}}\bigl{\{}\widehat{\sigma}_{t}^{2}/\sigma_{t}^{2}-1>\frac{\varepsilon_{t}}{\sqrt{2\eta_{t}}}\bigr{\}}&\text{if $m>1$,}\\ \mathbb{P}_{\mathcal{G}_{t}}\bigl{\{}\widehat{\sigma}_{t}^{2}/\sigma_{t}^{2}-1<-\frac{\varepsilon_{t}}{\sqrt{2\eta_{t}}}\bigr{\}}&\text{if $m\leq 1$.}\end{cases}

Similarly, we can see that the volatility forecast rank will be flipped once the deviation of σ^t2\widehat{\sigma}_{t}^{2} from σt2\sigma_{t}^{2} is large.

Note again that in the derivation above, the probability of reversing the expected forecast rank is evaluated at a single time point tt, which is far from enough to yield reliable comparison between volatility predictors. The common practice is to compute the empirical average loss of the predictors over time for their performance evaluation. Two natural questions arise: Does the empirical average completely resolve the instability of forecast evaluation due to the deviation of volatility proxies? If not, how should we robustify our volatility proxies to mitigate their empirical deviation?

3 Robust volatility proxies

3.1 Problem setup

Our goal in this section is to construct robust volatility proxies {σ^t2}\{\widehat{\sigma}^{2}_{t}\} to ensure that {h1t}\{h_{1t}\} maintains empirical superiority with high probability, or more precisely, that 𝒢t[1Tt=1T𝔼{L(σ^t2,h2t)|𝒢t}𝔼{L(σ^t2,h1t)|𝒢t}<0]\mathbb{P}_{\mathcal{G}_{t}}\big{[}\frac{1}{T}\sum_{t=1}^{T}\allowbreak\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{2t})|\mathcal{G}_{t}\}-\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{1t})|\mathcal{G}_{t}\}<0\big{]} is small, given 1Tt=1Tηt>0\frac{1}{T}\sum_{t=1}^{T}\eta_{t}>0. We first present our assumption on the data generation process.

Assumption 1.

Given the true volatility series {σt}\{\sigma_{t}\}, instrument returns {Xt}\{X_{t}\} are independent with 𝔼Xt=0\mathbb{E}X_{t}=0 and Var(Xt)=σt2\mathrm{Var}(X_{t})=\sigma_{t}^{2}. The central and absolute fourth moments of XtX_{t}, denoted by κt=𝔼{(Xt2σt2)2}\kappa_{t}=\mathbb{E}\{(X_{t}^{2}-\sigma_{t}^{2})^{2}\} and κ~t=𝔼Xt4\tilde{\kappa}_{t}=\mathbb{E}X_{t}^{4}, are both finite.

Now we introduce some quantities that frequently appear in the sequel. At time tt, define the smoothness parameters

δ0,t:=s=tt+mws,tσs2σt2,δ1,t:=s=tt+mws,t2|σs2σt2|2,\displaystyle\delta_{0,t}:=\sum_{s=t}^{t+m}w_{s,t}\sigma_{s}^{2}-\sigma_{t}^{2},~{}~{}~{}\delta_{1,t}:=\sum_{s=t}^{t+m}w_{s,t}^{2}|\sigma_{s}^{2}-\sigma_{t}^{2}|^{2}, (3.1)
Δ0,t:=s=tmt1νs,tσs2σt2 and Δ1,t=s=tmt1νs,t2|σs2σt2|2,\displaystyle\Delta_{0,t}:=\sum_{s=t-m}^{t-1}\nu_{s,t}\sigma_{s}^{2}-\sigma_{t}^{2}~{}\text{ and }~{}\Delta_{1,t}=\sum_{s=t-m}^{t-1}\nu_{s,t}^{2}|\sigma_{s}^{2}-\sigma_{t}^{2}|^{2},

where ws,t=λst/j=tt+mλjtw_{s,t}=\lambda^{s-t}/\sum_{j=t}^{t+m}\lambda^{j-t} is the forward exponential-decay weight at time ss from time tt with rate λ\lambda, and where νs,t=λt1s/s=tmt1λt1s\nu_{s,t}=\lambda^{t-1-s}/\sum_{s=t-m}^{t-1}\lambda^{t-1-s} is the backward exponential-decay weight with rate λ\lambda. These smoothness parameters characterize how fast the distribution of volatility varies as time evolves, and our theory explicitly derives their impact. As we shall see, our robust volatility proxies yield desirable statistical performance as long as these smoothness parameters are small, meaning that the variation of the volatility distribution is slow. Besides, define the forward and backward effective sample sizes as

neff:=1/s=tt+mws,t2andneff:=1/s=tmt1νs,t2n^{\dagger}_{{\rm{eff}}}:=1/\sum_{s=t}^{t+m}w_{s,t}^{2}~{}~{}\text{and}~{}~{}n^{\ddagger}_{{\rm{eff}}}:=1/\sum_{s=t-m}^{t-1}\nu_{s,t}^{2} (3.2)

respectively, and define the forward and backward exponential-weighted moving average (EWMA) of the central fourth moment as

κt=s=tt+mws,t2κs/s=tt+mws,t2andκt=s=tt+mνs,t2κs/s=tt+mνs,t2\kappa^{\dagger}_{t}=\sum_{s=t}^{t+m}w_{s,t}^{2}\kappa_{s}/\allowbreak\sum_{s=t}^{t+m}w_{s,t}^{2}~{}~{}\text{and}~{}~{}\kappa^{\ddagger}_{t}=\sum_{s=t}^{t+m}\nu_{s,t}^{2}\kappa_{s}/\allowbreak\sum_{s=t}^{t+m}\nu_{s,t}^{2} (3.3)

respectively. Similarly, we have κ~t\tilde{\kappa}^{\dagger}_{t} and κ~t\tilde{\kappa}^{\ddagger}_{t} as the forward and backward EWMA of the absolute fourth moment.

Consider a mean-pursuit and proxy-robust loss function that takes the form (2.2):

L(σ2,h)=C~(h)+B(σ2)+C(h)(σ2h)=:f(h)+B(σ2)+C(h)σ2,L(\sigma^{2},h)=\tilde{C}(h)+B(\sigma^{2})+C(h)(\sigma^{2}-h)=:f(h)+B(\sigma^{2})+C(h)\sigma^{2},

where we write f(h)=ahC(x)𝑑xC(h)hf(h)=\int_{a}^{h}C(x)dx-C(h)h for any constant aa. When C(h)=2hC(h)=-2h and f(h)=h2f(h)=h^{2} (a=0a=0), LL is MSE. When C(h)=1/hC(h)=1/h and f(h)=loghf(h)=\log h (a=e1a=e^{-1}), LL becomes QL. Under Assumption 1, hith_{it} and σ^t2\widehat{\sigma}_{t}^{2} are independent for i=1,2i=1,2. Therefore, 𝔼{L(σ^t2,hit)|𝒢t}=𝔼[f(hit)]+B(σ^t2)+𝔼[C(hit)]σ^t2\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{it})|\mathcal{G}_{t}\}=\mathbb{E}[f(h_{it})]+B(\widehat{\sigma}_{t}^{2})+\mathbb{E}[C(h_{it})]\widehat{\sigma}_{t}^{2}. Given (2.3), we wish to show that {h2t}\{h_{2t}\} outperforms {h1t}\{h_{1t}\} in conditional risk with high probability, i.e. I is small. Recall that

I=\displaystyle\text{I}= [1Tt=1T𝔼{L(σ^t2,h2t)|𝒢t}𝔼{L(σ^t2,h1t)|𝒢t}<1Tt=1Tηtε]\displaystyle\mathbb{P}\bigg{[}\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{2t})|\mathcal{G}_{t}\}-\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{1t})|\mathcal{G}_{t}\}<\frac{1}{T}\sum_{t=1}^{T}\eta_{t}-\varepsilon\bigg{]}
=\displaystyle= [1Tt=1T𝔼{f(h2t)f(h1t)}+𝔼{C(h2t)C(h1t)}σ^t2<1Tt=1Tηtε]\displaystyle\mathbb{P}\bigg{[}\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\{f(h_{2t})-f(h_{1t})\}+\mathbb{E}\{C(h_{2t})-C(h_{1t})\}\widehat{\sigma}_{t}^{2}<\frac{1}{T}\sum_{t=1}^{T}\eta_{t}-\varepsilon\bigg{]}
=\displaystyle= [1Tt=1T(σ^t2/σt21)(ηtAt)<ε],\displaystyle\mathbb{P}\bigg{[}\frac{1}{T}\sum_{t=1}^{T}(\widehat{\sigma}_{t}^{2}/\sigma_{t}^{2}-1)(\eta_{t}-A_{t})<-\varepsilon\bigg{]},

where ε>0\varepsilon>0 is a deviation parameter that may exceed T1t[T]ηtT^{-1}\sum_{t\in[T]}\eta_{t}, and the last equation is due to the fact that ηt=𝔼[f(h2t)f(h1t)]+𝔼[C(h2t)C(h1t)]σt2\eta_{t}=\mathbb{E}[f(h_{2t})-f(h_{1t})]+\mathbb{E}[C(h_{2t})-C(h_{1t})]\sigma_{t}^{2}.

3.2 Exponentially weighted Huber estimator

We first review the tuning-free adaptive Huber estimator proposed in Wang et al. (2020). Define the Huber loss function τ(x)\ell_{\tau}(x) with robustification parameter τ\tau as

τ(x):={τxτ22,if x>τ;x22, if |x|τ;τxτ22, if x<τ.\ell_{\tau}(x):=\begin{cases}\tau x-\frac{\tau^{2}}{2},\quad\text{if $x>\tau$;}\\ \frac{x^{2}}{2},\quad\quad\quad\text{ if $|x|\leq\tau$;}\\ -\tau x-\frac{\tau^{2}}{2},\text{ if $x<-\tau$.}\end{cases}

Suppose we have nn independent observations {Yi}i[n]\{Y_{i}\}_{i\in[n]} of YY satisfying that 𝔼Y=μ\mathbb{E}Y=\mu and that var(Y)=σ2\mbox{var}(Y)=\sigma^{2}. The Huber mean estimator μ^\widehat{\mu} is obtained by solving the following optimization problem:

μ^:=argminθi=1nτ(Yiθ).\widehat{\mu}:=\operatorname*{argmin}_{\theta\in\mathbb{R}}\sum_{i=1}^{n}\ell_{\tau}(Y_{i}-\theta).

Fan et al. (2017) show that when τ\tau is of order σn\sigma\sqrt{n}, μ^\widehat{\mu} achieves the optimal statistical rate with a sub-Gaussian deviation bound:

(|μ^μ|σz/n)12ez,z>0.\mathbb{P}(|\widehat{\mu}-\mu|\lesssim\sigma\sqrt{z/n})\geq 1-2e^{-z},\forall z>0. (3.4)

In practice, σ\sigma is unknown, and one therefore has to rely on cross validation (CV) to tune τ\tau, which incurs loss of sample efficiency. Wang et al. (2020) propose a data-driven principle to estimate μ\mu and the optimal τ\tau jointly by iteratively solving the following two equations:

{1ni=1nmin(|Yiθ|,τ)sgn(Yiθ)=0;1ni=1nmin(|Yiθ|2,τ2)τ2zn=0,\left\{\begin{aligned} &\frac{1}{n}\sum_{i=1}^{n}\min(|Y_{i}-\theta|,\tau)\mbox{sgn}(Y_{i}-\theta)=0;\\ &\frac{1}{n}\sum_{i=1}^{n}\frac{\min(|Y_{i}-\theta|^{2},\tau^{2})}{\tau^{2}}-\frac{z}{n}=0,\end{aligned}\right. (3.5)

where zz is the same deviation parameter as in (3.4). Specifically, we start with θ^(0)=Y¯\widehat{\theta}^{(0)}=\bar{Y} and solve the second equation for τ(1)\tau^{(1)}. We then plug τ=τ(1)\tau=\tau^{(1)} into the first equation to get θ^(1)\widehat{\theta}^{(1)}. We repeat these two steps until the algorithm converges and use the final value of θ^\widehat{\theta} as the estimator for μ\mu. Wang et al. (2020) proved that (i) if θ=μ\theta=\mu in the second equation above, then its solution gives τ=σn/z\tau=\sigma\sqrt{n/z} with probability approaching 11; (ii) if we choose τ=σn/z\tau=\sigma\sqrt{n/z} in the first equation above, its solution satisfies (3.4), even when YY is asymmetrically distributed with heavy tails. Note that Wang et al. (2020) call the above procedure tuning-free, in the sense that the knowledge of σ\sigma is not needed, but we still have the deviation parameter zz used to control the exception probability. The paper suggested to use z=lognz=\log n in practice.

In the context of volatility forecast, σt\sigma_{t} always varies across time. The well known phenomenon of volatility clustering in the financial market implies that σt\sigma_{t} typically changes slowly, so that we can borrow data around time tt to help with estimating σt2\sigma_{t}^{2} with little bias. A common practice in quantitative finance is to exploit an exponential-weighted average of {Xs2}s=tt+m\{X^{2}_{s}\}_{s=t}^{t+m} to estimate σt2\sigma_{t}^{2}, thereby discounting the importance of data that are distant from time tt. To accommodate such exponential-decaying weights, we now propose a sample-weighted variant of the Huber estimator for volatility estimation as follows:

μ^:=argminθs=tt+mws,tτt/ws,t(Xs2θ)=:τt(θ;{ws,t}s=tt+m),\widehat{\mu}:=\operatorname*{argmin}_{\theta\in\mathbb{R}}\sum_{s=t}^{t+m}w_{s,t}\ell_{{\tau_{t}/w_{s,t}}}(X_{s}^{2}-\theta)=:\mathcal{L}_{\tau_{t}}(\theta;\{w_{s,t}\}_{s=t}^{t+m}), (3.6)

where {ws,t}s{t,t+1,,t+m}\{w_{s,t}\}_{s\in\{t,t+1,\ldots,t+m\}} are the sample weights. Note that the robustification parameters for the observations can be different: intuitively, the higher the sample weight is, the lower τ\tau should be, so that we can better guard against heavy-tailed deviation of important data points. More technical justification on such choice of robustification parameters is given after Theorem 1. Correspondingly, to adaptively tune τ\tau, we iteratively solve the following two equations for τt\tau_{t} and θt\theta_{t} until convergence:

{s=tt+mws,tmin(|Xs2θt|,τtws,t)sgn(Xs2σt2)=0;s=tt+mws,t2min(|Xs2θt|2,τt2ws,t2)/τt2z=0.\left\{\begin{aligned} &\sum_{s=t}^{t+m}w_{s,t}\min\bigg{(}|X_{s}^{2}-\theta_{t}|,\frac{\tau_{t}}{w_{s,t}}\bigg{)}\mbox{sgn}(X_{s}^{2}-\sigma_{t}^{2})=0;\\ &\sum_{s=t}^{t+m}w_{s,t}^{2}\min\bigg{(}|X_{s}^{2}-\theta_{t}|^{2},\frac{\tau_{t}^{2}}{w_{s,t}^{2}}\bigg{)}/\tau_{t}^{2}-z=0.\end{aligned}\right. (3.7)

Our first theorem shows that the solution to the first equation of (3.7) yields a sub-Gaussian estimator of σt2\sigma_{t}^{2}, provided that τt\tau_{t} is well tuned and that the distribution evolution of the volatility series is sufficiently slow.

Theorem 1.

Under Assumption 1, if τt=κtneffz\tau_{t}=\sqrt{\frac{\kappa^{\dagger}_{t}}{n^{\dagger}_{{\rm{eff}}}z}}, we have for neff16zn^{\dagger}_{{\rm{eff}}}\geq 16z that

(|σ^t2σt2|4κtz/neff)12ez+|δ0,t|neffz/κt+2δ1,tneffz/κt,\mathbb{P}\bigg{(}|\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2}|\leq 4\sqrt{\kappa^{\dagger}_{t}z/n^{\dagger}_{{\rm{eff}}}}\bigg{)}\geq 1-2e^{-z+|\delta_{0,t}|\sqrt{{n^{\dagger}_{{\rm{eff}}}z}/{\kappa^{\dagger}_{t}}}+2\delta_{1,t}{n^{\dagger}_{{\rm{eff}}}z}/{\kappa^{\dagger}_{t}}}, (3.8)

where σ^t2\widehat{\sigma}_{t}^{2} equals θt\theta_{t} that solves the first equation of (3.7).

Remark 1.

Given that {ws,t}s=tt+m\{w_{s,t}\}_{s=t}^{t+m} are forward exponential-decay weights, we have that

neff=(1+λ)(1λm+1)(1λ)(1+λm+1)=1+λ1+λm+1i=0mλi,n^{\dagger}_{{\rm{eff}}}=\frac{(1+\lambda)(1-\lambda^{m+1})}{(1-\lambda)(1+\lambda^{m+1})}=\frac{1+\lambda}{1+\lambda^{m+1}}\sum_{i=0}^{m}\lambda^{i},

which converges to (1+λ)/(1λ)(1+\lambda)/(1-\lambda) as mm\to\infty, and which converges to m+1m+1 as λ1\lambda\to 1. Therefore, neffn^{\dagger}_{{\rm{eff}}}\to\infty requires both that mm\to\infty and that λ1\lambda\to 1. As neffn^{\dagger}_{{\rm{eff}}}\to\infty, when δ0,t=o(1/neff)\delta_{0,t}=o(\sqrt{1/n^{\dagger}_{{\rm{eff}}}}) and δ1,t=o(1/neff)\delta_{1,t}=o(1/n^{\dagger}_{{\rm{eff}}}), the exception probability is of order ez+o(z)e^{-z+o(z)}, which converges to 0 as zz\to\infty. Therefore, if we choose z=logneffz=\log{n^{\dagger}_{{\rm{eff}}}}, then |σ^t2σt2|=O{(logneff/neff)1/2}|\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2}|=O_{\mathbb{P}}\{(\log n^{\dagger}_{{\rm{eff}}}/n^{\dagger}_{{\rm{eff}}})^{1/2}\} as neffn^{\dagger}_{{\rm{eff}}}\to\infty.

Remark 2.

One crucial step of our proof is using Bernstein’s inequality to control the derivative of the weighted Huber loss with respect to θ\theta, i.e.,

τ(θ;{ws,t})=s=tt+mws,tmin(|Xs2θ|,τt/ws,t)sgn(Xs2θ).{\cal L}_{\tau}^{\prime}(\theta;\{w_{s,t}\})=\sum_{s=t}^{t+m}w_{s,t}\min(|X_{s}^{2}-\theta|,\tau_{t}/w_{s,t})\mbox{sgn}(X_{s}^{2}-\theta).

Through setting τt/ws,t\tau_{t}/w_{s,t} as the robustification parameter for the data at time ss, we can ensure that the corresponding summand in the derivative is bounded by τt\tau_{t} in absolute value, which allows us to apply Bernstein’s inequality. This justifies from the technical perspective our choices of the robustification parameters for different sample weights.

Remark 3.

Theorem 1 is in fact not limited to exponential-decay weights {ws,t}s=tt+m\{w_{s,t}\}_{s=t}^{t+m}. Bound (3.8) applies to any sample weight series.

Our next theorem provides theoretical justification of the second equation of (3.7). It basically says that the solution to that equation gives an appropriate value of τt\tau_{t}.

Theorem 2.

On top of Assumption 1, we further assume that δ1,tneffc1κt\delta_{1,t}n^{\dagger}_{{\rm{eff}}}\leq c_{1}\kappa^{\dagger}_{t} and ws,t1/mw_{s,t}\asymp 1/m for all tst+mt\leq s\leq t+m. If θt=σt2\theta_{t}=\sigma_{t}^{2} in the second equation of (3.7), we have that as neff,zn^{\dagger}_{{\rm{eff}}},z\to\infty with z=o(neff)z=o(n^{\dagger}_{{\rm{eff}}}), its solution τtκtneffz\tau_{t}\asymp\sqrt{\frac{\kappa^{\dagger}_{t}}{n^{\dagger}_{{\rm{eff}}}z}} with probability approaching 11.

Remark 4.

Define the half-life parameter l:=log(1/2)/log(λ)l:=\log(1/2)/\log(\lambda). If we fix m/l=Cm/l=C for a universal constant CC, which is common practice in volatility forecast, then we can ensure that {ws,t}tst+m\{w_{s,t}\}_{t\leq s\leq t+m} are all of order 1/m1/m.

3.3 Average deviation of volatility proxies

We are now in position to evaluate I, which concerns average deviation of the volatility proxies over all the TT time points. To illustrate the advantage of the sample-weighted Huber volatility proxy as proposed in (3.6), we first introduce and investigate two benchmark volatility proxies that are widely used in practice. Then we present our average deviation analysis of the sample-weighted Huber proxy.

The first benchmark volatility proxy, which we denote by (σ^c)t2(\widehat{\sigma}_{c})_{t}^{2}, is simply a clipped squared return:

{(σ^c)t2=min(Xt2,ct)=min(Xt2,τtneffT);s=tt+mws,t2min(Xs4,τt2ws,t2)/τt2z=0,\left\{\begin{aligned} &{(\widehat{\sigma}_{c})^{2}_{t}}=\min(X_{t}^{2},c_{t})=\min(X_{t}^{2},\tau_{t}\sqrt{n^{\dagger}_{{\rm{eff}}}T});\\ &\sum_{s=t}^{t+m}w_{s,t}^{2}\min\bigg{(}X_{s}^{4},\frac{\tau_{t}^{2}}{w_{s,t}^{2}}\bigg{)}/\tau_{t}^{2}-z=0,\end{aligned}\right. (3.9)

where ctc_{t} is the clipping threshold, and where zz is a similar deviation parameter as in (3.7). Here τt\tau_{t} is tuned similarly as in (3.7), except that now the second equation of (3.9) does not depend on σ^c2\widehat{\sigma}^{2}_{c} and thus can be solved independently. Following Theorem 2, we can deduce that τtκ~tneffz\tau_{t}\asymp\sqrt{\frac{{{\tilde{\kappa}_{t}}^{\dagger}}}{n^{\dagger}_{{\rm{eff}}}z}} and thus that ctκ~tTzc_{t}\asymp\sqrt{\frac{{{\tilde{\kappa}^{\dagger}_{t}}}T}{z}}. The main purpose of choosing such a rate of ctc_{t} is to balance the bias and variance of the average of (σ^c)t2(\widehat{\sigma}_{c})_{t}^{2} over TT time points. The following theorem develops a non-asymptotic bound for the average relative deviation of σ^c2\widehat{\sigma}_{c}^{2}.

Theorem 3.

Under Assumption 1, if ct=κ~tT/zc_{t}=\sqrt{{{\tilde{\kappa}^{\dagger}_{t}}T}/{z}} in (3.9), for any bounded series {qt}t[T]\{q_{t}\}_{t\in[T]} such that maxt[T]|qt|Q\max_{t\in[T]}|q_{t}|\leq Q and any z>0z>0, we have

(1Tt=1T{(σ^c)t2/σt21}qtCQz/T)1ez,\mathbb{P}\bigg{(}\frac{1}{T}\sum_{t=1}^{T}\{(\widehat{\sigma}_{c})_{t}^{2}/\sigma_{t}^{2}-1\}q_{t}\leq CQ\sqrt{z/T}\bigg{)}\geq 1-e^{-z}\,,

where CC depends on maxt[T]{(κ~t/σt4)(κ~t/σt4)(κ~t/κ~t)}\max_{t\in[T]}\{(\tilde{\kappa}_{t}/\sigma_{t}^{4})\vee({\tilde{\kappa}^{\dagger}_{t}}/\sigma_{t}^{4})\vee(\tilde{\kappa}_{t}/{{\tilde{\kappa}^{\dagger}_{t}}})\}.

The second benchmark volatility proxy, which we denote by (σ^e)t2(\widehat{\sigma}_{e})^{2}_{t}, is defined as

{(σ^e)t2=s=tt+mws,tmin(Xs2,ctws,t)=s=tt+mmin(ws,tXs2,τtT/neff),s=tt+mws,t2min(Xs4,τt2ws,t2)/τt2z=0.\left\{\begin{aligned} &(\widehat{\sigma}_{e})_{t}^{2}=\sum_{s=t}^{t+m}w_{s,t}\min\bigg{(}X_{s}^{2},\frac{c_{t}}{w_{s,t}}\bigg{)}=\sum_{s=t}^{t+m}\min(w_{s,t}X_{s}^{2},\tau_{t}\sqrt{T/n^{\dagger}_{{\rm{eff}}}}),\\ &\sum_{s=t}^{t+m}w_{s,t}^{2}\min\bigg{(}X_{s}^{4},\frac{\tau_{t}^{2}}{w_{s,t}^{2}}\bigg{)}/\tau_{t}^{2}-z=0.\end{aligned}\right. (3.10)

The second equation of (3.10) is the same as that of (3.9). The only difference between σ^e2\widehat{\sigma}^{2}_{e} and σ^c2\widehat{\sigma}^{2}_{c} is that σ^e2\widehat{\sigma}_{e}^{2} exploits not only a single time point, but multiple data points in the near future to construct the volatility proxy. Accordingly, the clipping threshold is updated as τtT/neff\tau_{t}\sqrt{T/n_{\rm{eff}}}. The following theorem characterizes the average relative deviation of σ^e2\widehat{\sigma}^{2}_{e}.

Theorem 4.

Under Assumption 1, for any (z,c)>0(z,c)>0 satisfying that

  1. 1.

    maxt[T]{2|δ0,t|/κ~t+4δ1,tneff/(κ~tT)}c\max_{t\in[T]}\bigg{\{}2|\delta_{0,t}|/\sqrt{{\tilde{\kappa}^{\dagger}_{t}}}+4\delta_{1,t}n^{\dagger}_{{\rm{eff}}}/({\tilde{\kappa}^{\dagger}_{t}}\sqrt{T})\bigg{\}}\leq c;

  2. 2.

    neff2c1{1+(log2T)/z}Tn^{\dagger}_{{\rm{eff}}}\geq 2c^{-1}\{1+(\log 2T)/z\}\sqrt{T};

  3. 3.

    T16cz\sqrt{T}\geq 16cz, T16zT\geq 16z;

and any bounded series {qt}t[T]\{q_{t}\}_{t\in[T]} such that maxt[T]|qt|Q\max_{t\in[T]}|q_{t}|\leq Q, let ct=κ~tT/(neff2z)c_{t}=\sqrt{{{\tilde{\kappa}^{\dagger}_{t}}T}/{(n^{\dagger 2}_{{\rm{eff}}}z)}}, we have

[1Tt=1T{(σ^e)t2/σt21}qtCQzT+1Tt=1Tδ0,tqtσt2]12ez,\mathbb{P}\bigg{[}\frac{1}{T}\sum_{t=1}^{T}\{(\widehat{\sigma}_{e})_{t}^{2}/\sigma_{t}^{2}-1\}q_{t}\leq CQ\sqrt{\frac{z}{T}}+\frac{1}{T}\sum_{t=1}^{T}\frac{\delta_{0,t}q_{t}}{\sigma_{t}^{2}}\bigg{]}\geq 1-2e^{-z}\,,

where CC depends on maxt[T],u[m]{κ~t+u/σt4}\max_{t\in[T],u\in[m]}\{\tilde{\kappa}_{t+u}/\sigma_{t}^{4}\}.

Remark 5.

Let z=logTz=\log T. To achieve the optimal rate of logT/T\sqrt{\log T/T}, Theorem 4 requires that 1Tt=1Tδ0,tqtσt2\frac{1}{T}\sum_{t=1}^{T}\frac{\delta_{0,t}q_{t}}{\sigma_{t}^{2}} is of order logT/T\sqrt{\log T/T}. We also require neffn^{\dagger}_{{\rm{eff}}} to be at least the order of T\sqrt{T}.

Remark 6.

One technical challenge of proving Theorem 4 lies in the overlap of squared returns that are used to construct neighboring σ^t2\widehat{\sigma}_{t}^{2}, which leads to temporal dependence across {(σ^e)t2}t[T]\{(\widehat{\sigma}_{e})^{2}_{t}\}_{t\in[T]}. To resolve this issue, we apply a more sophisticated variant of Bernstein’s inequality for time series data (Zhang, 2021). See Lemma 1 in the appendix.

We now move on to the Huber volatility proxy. At time tt, denote the solution to (3.7) by (θ^t,τ^t)(\widehat{\theta}_{t},\widehat{\tau}_{t}). Note that τ^t\widehat{\tau}_{t} is tuned based on just mm data points. Given that now our focus is on the average deviation of volatility proxies over TT data points, we need to raise our robustification parameters to reduce the bias of our Huber proxies and rebalance the bias and variance of the average deviation. After all, averaging over a large TT mitigates the impact of possible tail events, so that we can relax the thresholding effect of the Huber loss. Specifically, let ct=τ^tT/neffc_{t}=\widehat{\tau}_{t}\sqrt{T/n^{\dagger}_{{\rm{eff}}}}, which is of order κtT/(neff2z)\sqrt{{{\kappa^{\dagger}_{t}}T}/{(n^{\dagger 2}_{{\rm{eff}}}z)}} according to Theorem 2. Then we substitute τt=ct\tau_{t}=c_{t} into the first equation of (3.7) and solve for σt2\sigma_{t}^{2} therein to obtain the adjusted proxy; that is to say, the final (σ^H)t2(\widehat{\sigma}_{H})_{t}^{2} satisfies the following:

s=tt+mws,tmin(|Xs,t2(σ^H)t2|,ctws,t)sgn(Xs,t2σt2)=0.\sum_{s=t}^{t+m}w_{s,t}\min\bigg{(}|X_{s,t}^{2}-(\widehat{\sigma}_{H})_{t}^{2}|,\frac{c_{t}}{w_{s,t}}\bigg{)}\mbox{sgn}(X_{s,t}^{2}-\sigma_{t}^{2})=0. (3.11)

The inflation factor T/neff\sqrt{T/n^{\dagger}_{{\rm{eff}}}} in ctc_{t} implies that the larger sample size we have, the closer the corresponding Huber loss is to the square loss. This justifies the usage of vanilla EWMA proxy as the most common practice in financial industry when the total evaluation period is long. However, when TT is not sufficiently large, the Huber proxy yields more robust estimation of the true volatility. The following theorem characterizes the average relative deviation of the Huber proxies.

Theorem 5.

Under Assumption 1, for any (c,z)>0(c,z)>0 satisfying that

  1. 1.

    maxt[T]{2|δ0,t|/κt+4δ1,tneff/(κtT)}c\max_{t\in[T]}\bigg{\{}2|\delta_{0,t}|/\sqrt{{\kappa^{\dagger}_{t}}}+4\delta_{1,t}n^{\dagger}_{{\rm{eff}}}/({\kappa^{\dagger}_{t}}\sqrt{T})\bigg{\}}\leq c;

  2. 2.

    neff2c1{1+(log2T)/z}Tn^{\dagger}_{{\rm{eff}}}\geq 2c^{-1}\{1+(\log 2T)/z\}\sqrt{T};

  3. 3.

    T16cz\sqrt{T}\geq 16cz, T16zT\geq 16z;

  4. 4.

    For any t[T]t\in[T], ct(θ;{ws,t}s=tt+m){\cal L}_{c_{t}}(\theta;\{w_{s,t}\}_{s=t}^{t+m}) is α\alpha-strongly convex for θ(σt2(2c+2)κtz,σt2+(2c+2)κtz)\theta\in\bigg{(}\sigma_{t}^{2}-(2c+2)\sqrt{\kappa_{t}^{\dagger}z},\sigma_{t}^{2}+(2c+2)\sqrt{\kappa_{t}^{\dagger}z}\bigg{)};

and any bounded series {qt}t[T]\{q_{t}\}_{t\in[T]} such that maxt[T]|qt|Q\max_{t\in[T]}|q_{t}|\leq Q, let ct=κtT/(neff2z)c_{t}=\sqrt{{{\kappa^{\dagger}_{t}}T}/(n^{\dagger 2}_{{\rm{eff}}}z)}, we have

[1Tt=1T{(σ^H)t2/σt21}qtCQzT+1Tt=1T(δ0,t+4δ1,t)qtσt2]12ez,\mathbb{P}\bigg{[}\frac{1}{T}\sum_{t=1}^{T}\{(\widehat{\sigma}_{H})_{t}^{2}/\sigma_{t}^{2}-1\}q_{t}\leq CQ\sqrt{\frac{z}{T}}+\frac{1}{T}\sum_{t=1}^{T}\frac{(\delta_{0,t}+4\delta_{1,t})q_{t}}{\sigma_{t}^{2}}\bigg{]}\geq 1-2e^{-z}\,,

where CC depends on maxt[T],u[m]{κt+u/σt4}\max_{t\in[T],u\in[m]}\{\kappa_{t+u}/\sigma_{t}^{4}\} and α\alpha.

Remark 7.

Compared with the previous two benchmark proxies, the main advantage of the Huber volatility proxy is that its average deviation error depends on the central fourth moment of the returns instead of the absolute one.

Remark 8.

α\alpha-strong convexity is a standard assumption that can be verified for Huber loss. For example, Proposition B.1 of Chen and Zhou (2020) shows that the equally weighted Huber loss τt(θ;{1/(m+1)}s=tt+m){\cal L}_{\tau_{t}}(\theta;\{1/(m+1)\}_{s=t}^{t+m}) enjoys strong convexity for α=1/4\alpha=1/4 in the region (σt2r,σt2+r)(\sigma_{t}^{2}-r,\sigma_{t}^{2}+r) where rτr\asymp\tau with probability at least 1et1-e^{-t} when mC(1+t)m\geq C(1+t). Such strong convexity paves the way to apply Lemma 1 to the Huber proxies, so that we can establish their Bernstein-type concentration. Please refer to Lemma 2 in the appendix for details.

Remark 9.

To achieve the optimal logT/T\sqrt{\log T/T} rate of convergence if we choose zlogTz\asymp\log T, we need to additionally assume 1Tt=1T(δ0,t+4δ1,t)qtσt2\frac{1}{T}\sum_{t=1}^{T}\frac{(\delta_{0,t}+4\delta_{1,t})q_{t}}{\sigma_{t}^{2}} is of a smaller order, which in practice requires certain volatility smoothness.

4 Robust predictors

In this section, we further take into account the randomness from predictors and study bounding II, the difference between empirical risk and conditional risk.

4.1 Robust volatility predictor

We essentially follow (3.6), the sample-weighted Huber mean estimator, to construct robust volatility predictors. The only difference is that now we cannot touch any data beyond time tt; we can only look backward at time tt. Consider the following volatility predictor based on the past mm data points with backward exponential-decay weights:

ht:=argminθs=tmt1νs,tτt/νs,t(Xs2θ)=:τt(θ;{νs,t}s=tmt1).h_{t}:=\operatorname*{argmin}_{\theta\in\mathbb{R}}\sum_{s=t-m}^{t-1}\nu_{s,t}\ell_{{\tau_{t}/\nu_{s,t}}}(X_{s}^{2}-\theta)=:\mathcal{L}_{\tau_{t}}(\theta;\{\nu_{s,t}\}_{s=t-m}^{t-1}). (4.1)

Similarly to (3.7), we iteratively solve the following two equations for hth_{t} and τt\tau_{t} simultaneously:

{s=tmt1νs,tmin(|Xs2ht2|,τtνs,t)sgn(Xs2σt2)=0;s=tmt1νs,t2min(|Xs2ht2|2,τt2νs,t2)/τt2z=0,\left\{\begin{aligned} &\sum_{s=t-m}^{t-1}\nu_{s,t}\min\bigg{(}|X_{s}^{2}-h_{t}^{2}|,\frac{\tau_{t}}{\nu_{s,t}}\bigg{)}\mbox{sgn}(X_{s}^{2}-\sigma_{t}^{2})=0;\\ &\sum_{s=t-m}^{t-1}\nu_{s,t}^{2}\min\bigg{(}|X_{s}^{2}-h_{t}^{2}|^{2},\frac{\tau_{t}^{2}}{\nu_{s,t}^{2}}\bigg{)}/\tau_{t}^{2}-z=0,\end{aligned}\right. (4.2)

where we recall that νs,t=λt1s/s=tmt1λt1s\nu_{s,t}=\lambda^{t-1-s}/\sum_{s=t-m}^{t-1}\lambda^{t-1-s}. Theorem 1 showed concentration of σ^t2\widehat{\sigma}_{t}^{2} around σt2\sigma_{t}^{2}, i.e. the loss is MSE. More generally, we hope to give results on (L(σt2,ht)minhL(σt2,ht)>ε)\mathbb{P}(L(\sigma_{t}^{2},h_{t})-\min_{h}L(\sigma_{t}^{2},h_{t})>\varepsilon). According to (a) in Section 2.1, minhL(σt2,ht)=f(σt2)+B(σt2)+C(σt2)σt2\min_{h}L(\sigma_{t}^{2},h_{t})=f(\sigma_{t}^{2})+B(\sigma_{t}^{2})+C(\sigma_{t}^{2})\sigma_{t}^{2}. Therefore, we hope to bound

(|L(σt2,ht)L(σt2,σt2)|>ε)\mathbb{P}(|L(\sigma_{t}^{2},h_{t})-L(\sigma_{t}^{2},\sigma_{t}^{2})|>\varepsilon)

This should be easy to control if we assume smoothness of the loss fucntion. We give the following theorem.

Theorem 6.

Assume there exist b,Bb,B such that sup|hσt2|b|L(σt2,h)/h|Bsup_{|h-\sigma_{t}^{2}|\leq b}|\partial L(\sigma_{t}^{2},h)/\partial h|\leq B. If τt=κtneffz\tau_{t}=\sqrt{\frac{\kappa^{\ddagger}_{t}}{n^{\ddagger}_{{\rm{eff}}}z}}, under Assumption 1, for neff16(1κt/b2)zn^{\ddagger}_{{\rm{eff}}}\geq 16(1\vee\kappa^{\ddagger}_{t}/b^{2})z, we have

(L(σt2,ht)minhL(σt2,h)4Bκtz/neff)12ez+|Δ0,t|neffz/κt+2Δ1,tneffz/κt\mathbb{P}\bigg{(}L(\sigma_{t}^{2},h_{t})-\min_{h}L(\sigma_{t}^{2},h)\leq 4B\sqrt{\kappa^{\ddagger}_{t}z/n^{\ddagger}_{{\rm{eff}}}}\bigg{)}\geq 1-2e^{-z+|\Delta_{0,t}|\sqrt{{n^{\ddagger}_{{\rm{eff}}}z}/{\kappa^{\ddagger}_{t}}}+2\Delta_{1,t}{n^{\ddagger}_{{\rm{eff}}}z}/{\kappa^{\ddagger}_{t}}}

where σ^t2\widehat{\sigma}_{t}^{2} is the solution of the first equation.

Remark 10.

Here for notational simplicity, we used the same estimation horizon mm, same exponential-decay λ\lambda for constructing both predictors and proxies. But in practice, of course they do not necessarily need to be the same. In our real data example, we will use a slower decay for constructing predictors while using a faster decay for proxies, which seems to be the common practice for real financial data, where we typically use more data for constructing predictors and less data for constructing proxies. We will stick to mm equals twice the half-life so that equivalently, we use a longer window for predictors and shorter window for proxies.

Remark 11.

In addition, here we also simplify the theoretical results by assuming the same zz for constructing proxies and predictors. We do not need to use the same zz controlling the tail probability for predictors and proxies practically. For predictors, as we focus on local performance, it is more natural to use z=logneffz=\log n^{\ddagger}_{{\rm{eff}}} following Wang et al. (2020). For proxies, as we focus on overall evaluation, for a given TT we can take z=logTz=\log T. Sometimes, we want to monitor the risk evaluation as TT grows, then a changing zz may not be a good choice; we do not want to re-solve the local weighted tuning-free Huber problem every time TT changes. Therefore, we recommend to use z=Clogneffz=C\log n^{\dagger}_{{\rm{eff}}} for a slightly larger CC, e.g. z=2logneffz=2\log n^{\dagger}_{{\rm{eff}}} as in our real data analysis.

4.2 Concentration of robust and non-robust predictors

Recall a robust loss satisfies 𝔼{L(σ^t2,hit)|𝒢t}=𝔼[f(hit)]+B(σ^t2)+𝔼[C(hit)]σ^t2\mathbb{E}\{L(\widehat{\sigma}_{t}^{2},h_{it})|\mathcal{G}_{t}\}=\mathbb{E}[f(h_{it})]+B(\widehat{\sigma}_{t}^{2})+\mathbb{E}[C(h_{it})]\widehat{\sigma}_{t}^{2}. So we further bound II as follows:

II =[|1Tt=1T{f(ht)𝔼f(ht)}+1Tt=1T{C(ht)𝔼C(ht)}σ^t2|>ε]\displaystyle=\mathbb{P}\bigg{[}\bigg{|}\frac{1}{T}\sum_{t=1}^{T}\{f(h_{t})-\mathbb{E}f(h_{t})\}+\frac{1}{T}\sum_{t=1}^{T}\{C(h_{t})-\mathbb{E}C(h_{t})\}\widehat{\sigma}_{t}^{2}\bigg{|}>\varepsilon\bigg{]}
[|1Tt=1Tf(ht)𝔼f(ht)+{C(ht)𝔼C(ht)}σt2|\displaystyle\leq\mathbb{P}\bigg{[}\bigg{|}\frac{1}{T}\sum_{t=1}^{T}f(h_{t})-\mathbb{E}f(h_{t})+\{C(h_{t})-\mathbb{E}C(h_{t})\}\sigma_{t}^{2}\bigg{|}
+|1Tt=1T(C(ht)𝔼C(ht))(σ^t2σt2)|>ε]\displaystyle\qquad\qquad\qquad\qquad\qquad+\bigg{|}\frac{1}{T}\sum_{t=1}^{T}(C(h_{t})-\mathbb{E}C(h_{t}))(\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2})\bigg{|}>\varepsilon\bigg{]}
[|1Tt=1T(L(σt2,ht)𝔼[L(σt2,ht)])|>ε2]ΔA+[|1Tt=1T(C(ht)𝔼[C(ht)])(σ^t2σt2)|>ε2]ΔB.\displaystyle\leq\underbrace{\mathbb{P}\bigg{[}\bigg{|}\frac{1}{T}\sum_{t=1}^{T}(L(\sigma_{t}^{2},h_{t})-\mathbb{E}[L(\sigma_{t}^{2},h_{t})])\bigg{|}>\frac{\varepsilon}{2}\bigg{]}}_{\Delta_{A}}+\underbrace{\mathbb{P}\bigg{[}\bigg{|}\frac{1}{T}\sum_{t=1}^{T}(C(h_{t})-\mathbb{E}[C(h_{t})])(\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2})\bigg{|}>\frac{\varepsilon}{2}\bigg{]}}_{\Delta_{B}}\,.

We wish to show that both ΔA\Delta_{A} and ΔB\Delta_{B} can achieve the desired rate of logT/T\sqrt{\log T/T} for a broad class of predictors and loss functions. When hth_{t} is the proposed robust predictor in Section 4.1, we can obtain sharp rates of ΔA\Delta_{A} and ΔB\Delta_{B} as expected. Moreover, for the vanilla (non-robust) EWMA predictor ht=s=tmt1νs,tXs2h_{t}=\sum_{s=t-m}^{t-1}\nu_{s,t}X_{s}^{2}, we are also able to obtain the same sharp rates for ΔA\Delta_{A} and ΔB\Delta_{B} for Lipschitz robust losses. The third option is to truncate the predictor, i.e. ht=(s=tmt1νs,tXs2)Mh_{t}=(\sum_{s=t-m}^{t-1}\nu_{s,t}X_{s}^{2})\wedge M with some large constant MM and we can control the two terms for general robust losses. The bottom line is that for non-robust predictors, we need to make sure the loss does not go too crazy either via shrinking predictor’s effect on the loss (bounded Lipschitz) or clipping the predictor directly.

The interesting observation is that bounding II, or the difference between empirical risk and conditional risk, requires minimal assumption such that most of the reasonable predictors, say in the M-estimator form, have no problem satisfying the concentration bound with proper choice of a robust proxy, although we do require the loss not to become too wild (see Theorem 7 for details). Technically, the concentration here only cares about controlling variance and does not care about the bias between 𝔼[L(σt2,ht)]\mathbb{E}[L(\sigma_{t}^{2},h_{t})] and L(σt2,σt2)L(\sigma_{t}^{2},\sigma_{t}^{2}) and between 𝔼[C(ht)]\mathbb{E}[C(h_{t})] and C(ht)C(h_{t}). There is no need to carefully choose the truncation threshold to balance variance and bias optimally.

Theorem 7.

For any t[T]t\in[T], τt(θ;{νs,t}s=tmt1){\cal L}_{\tau_{t}}(\theta;\{\nu_{s,t}\}_{s=t-m}^{t-1}) is α\alpha-strongly convex for θ(σt2/2,3σt2/2)\theta\in(\sigma_{t}^{2}/2,3\sigma_{t}^{2}/2). Choose σ^t=(σ^e)t\widehat{\sigma}_{t}=(\widehat{\sigma}_{e})_{t} (or (σ^H)t(\widehat{\sigma}_{H})_{t}) with robustification parameter ctc_{t} specified in Theorem 4 (or Theorem 5). Under the assumptions of Theorem 4 (or Theorem 5), the bound

(|1Tt=1TL(σ^t2,ht)1Tt=1T(𝔼f(ht)+B(σ^t2)+𝔼C(ht)σ^t2)|<Cz/T)1(2T+5)ez/2\mathbb{P}\bigg{(}\bigg{|}\frac{1}{T}\sum_{t=1}^{T}L(\widehat{\sigma}_{t}^{2},h_{t})-\frac{1}{T}\sum_{t=1}^{T}\bigg{(}\mathbb{E}f(h_{t})+B(\widehat{\sigma}_{t}^{2})+\mathbb{E}C(h_{t})\widehat{\sigma}_{t}^{2}\bigg{)}\bigg{|}<C^{\prime}\sqrt{z/T}\bigg{)}\geq 1-(2T+5)e^{-z/2}

holds for all the following three cases:

  1. 1.

    For the robust predictors proposed in Section 4.1, assume sup|hσt2|b|L(σt2,h)/h|Bt(b)sup_{|h-\sigma_{t}^{2}|\leq b}|\partial L(\sigma_{t}^{2},h)/\partial h|\leq B_{t}(b), if |Δ0,t|neff/κt+2Δ1,tneff/κt<1/2|\Delta_{0,t}|\sqrt{{n^{\ddagger}_{{\rm{eff}}}}/{\kappa^{\ddagger}_{t}}}+2\Delta_{1,t}{n^{\ddagger}_{{\rm{eff}}}}/{\kappa^{\ddagger}_{t}}<1/2 and neff64zmaxtκt/σt4n^{\ddagger}_{{\rm{eff}}}\geq 64z\max_{t}\kappa^{\ddagger}_{t}/\sigma_{t}^{4}, the above bound holds with CC^{\prime} depending on maxtBt(σt2/2)\max_{t}B_{t}(\sigma_{t}^{2}/2) and C in Theorem 4 (or Theorem 5).

  2. 2.

    For the clipped vanilla non-robust exponentially weighted moving average predictors ht=(s=tmt1νs,tXs2)Mh_{t}=(\sum_{s=t-m}^{t-1}\nu_{s,t}X_{s}^{2})\wedge M, assume supσt2/2hb|L(σt2,h)/h|Bt(b)sup_{\sigma_{t}^{2}/2\leq h\leq b}|\partial L(\sigma_{t}^{2},h)/\partial h|\leq B_{t}(b), if |Δ0,t|neff/κ~t+2Δ1,tneff/κ~t<1/2|\Delta_{0,t}|\sqrt{{n^{\ddagger}_{{\rm{eff}}}}/{{\tilde{\kappa}^{\ddagger}_{t}}}}+2\Delta_{1,t}{n^{\ddagger}_{{\rm{eff}}}}/{{\tilde{\kappa}^{\ddagger}_{t}}}<1/2 and neff64zmaxtκ~t/σt4n^{\ddagger}_{{\rm{eff}}}\geq 64z\max_{t}{\tilde{\kappa}^{\ddagger}_{t}}/\sigma_{t}^{4}, the above bound holds with CC^{\prime} depending on maxtBt(M)\max_{t}B_{t}(M) and CC in Theorem 4 (or Theorem 5).

  3. 3.

    For the vanilla non-robust exponentially weighted moving average predictors ht=s=tmt1νs,tXs2h_{t}=\sum_{s=t-m}^{t-1}\nu_{s,t}X_{s}^{2}, assume |L(σt2,h)/h|B0|\partial L(\sigma_{t}^{2},h)/\partial h|\leq B_{0} and |L(σt2,h)|M0|L(\sigma_{t}^{2},h)|\leq M_{0} for σt2,h\forall\sigma_{t}^{2},h, if |Δ0,t|neff/κ~t+2Δ1,tneff/κ~t<1/2|\Delta_{0,t}|\allowbreak\sqrt{{n^{\ddagger}_{{\rm{eff}}}}/{{\tilde{\kappa}^{\ddagger}_{t}}}}+2\Delta_{1,t}{n^{\ddagger}_{{\rm{eff}}}}/{{\tilde{\kappa}^{\ddagger}_{t}}}<1/2 and neff64zmaxtκ~t/σt4n^{\ddagger}_{{\rm{eff}}}\geq 64z\max_{t}{\tilde{\kappa}^{\ddagger}_{t}}/\sigma_{t}^{4}, the above bound holds with CC^{\prime} depending on B0,M0B_{0},M_{0} and CC in Theorem 4 (or Theorem 5).

Remark 12.

The proof can be easily extended to more general robust or non-robust predictors of the M-estimator form. Theorem 7 tells us that comparing robust predictors with optimal truncation for a single time point (rather than adjusting the truncation as in constructing proxies) and non-robust predictors (either with rough overall truncation when using a general loss or without any truncation when using a truncated loss) is indeed a valid thing to do, when we employ proper robust proxies.

Remark 13.

Although the first proxy achieves the optimal rate of convergence for comparing average conditional loss in Theorem 3, we did not manage to show it is valid for comparing the average empirical loss in Theorem 7. The reason is that single time clipping has no concentration guarantee like Theorem 1 for a single time point, therefore cannot ensure |(σ^c)t2σt2||(\widehat{\sigma}_{c})_{t}^{2}-\sigma_{t}^{2}| to be bounded with high probability for all tt, which is important to make sure the sub-exponential tail in Bernstein inequality does not dominate the sub-Gaussian tail. Taking into consideration of central fourth moment versus absolute fourth moment (see Remark 7), we would recommend the third proxy as the best practical choice among our three proposals.

5 Numerical study

In this section, we first verify the advantage of the Huber mean estimator over the truncated mean in terms of estimating the variance through simulations. As illustrated by Theorem 5, the statistical error of the Huber mean estimator depends on the central moment, while that of the truncated mean depends on the absolute moment. Then we apply the proposed robust proxies for volatility forecasting comparison, using data from crypto currency market. Specifically, we focus on the returns of Bitcoin (BTC) quoted by Tether (USDT), a stable coin pegged to the US Dollar, in the years of 2019 and 2020, which witness dramatic volatility of Bitcoin.

5.1 Simulations

We first examine numerically the finite sample performance of the adaptive Huber estimator (Wang et al., 2020) for variance estimation, i.e., we solve (3.5) iteratively for θ\theta given the data and zz until convergence. We first draw an independent sample Y1,,YnY_{1},\dots,Y_{n} of YY that follows a heavy-tailed distribution. We investigate the following two distributions:

  1. 1.

    Log-normal distribution LN(0,1)\mathrm{LN}(0,1), that is, logY𝒩(0,1)\log Y\sim{\cal N}(0,1).

  2. 2.

    Student’s t distribution with degree of freedom df=3\mathrm{df}=3.

Given that σ2:=Var(Y)=𝔼(Y2)(𝔼Y)2\sigma^{2}:=\operatorname{Var}(Y)=\mathbb{E}(Y^{2})-(\mathbb{E}Y)^{2}, we estimate 𝔼(Y2)\mathbb{E}(Y^{2}) and 𝔼Y\mathbb{E}Y separately and plug these mean estimators into the variance formula to estimate σ2\sigma^{2}. Besides the Huber mean estimator, we investigate two alternative mean estimators as benchmarks: (a) the naïve sample mean; (b) the sample mean of data that are truncated at their upper and lower α\alpha-percentile. We use MSE and QL (see (2.1) for the definitions) to assess the accuracy of variance estimation. In our simulation, we set n=100n=100 and we repeat evaluating these three methods in 2000 independent Monte Carlo experiments.

Refer to caption
Refer to caption
Figure 1: MSE comparisons between the truncated mean and the Huber mean.

Figure 1 compares the MSE of the truncated and Huber variance estimators under log-normal distribution (left) and t-distribution (right). The red curve represents the MSE of the truncated method with different α\alpha values on the top x-axis, and the blue curve represents the MSE of the tuning-free Huber method with different zz values on the bottom x-axis. The error bars in both panels represent the standard errors of the MSE. For convenience of comparison, we focus on the ranges of α\alpha and zz that exhibit the smile shapes of MSE of the two methods. Note that the MSE of the sample variance, which is 40.1440.14 under the log-normal distribution and 10.1710.17 under the t-distribution, is too large to be presented in the plot. Figure 1 shows that the Huber variance estimator outperforms the optimally tuned truncated method with zz roughly between (1.5,3.5)(1.5,3.5) under the log-normal distribution and between (1.5,4)(1.5,4) under the Student’s t distribution. The performance gap is particularly large under the t distribution, where the optimal Huber method achieves around 20%20\% less MSE than the optimal truncated method.

Refer to caption
Refer to caption
Figure 2: QL comparisons between truncation and Huber minimization.

Figure 2 assesses the QL loss of the truncated and Huber estimators and displays a similar pattern as Figure 1. Again, the naïve sample variance is still much worse off with QL loss 0.17650.1765 under the log-normal distribution and 0.12010.1201 under the t-distribution; we therefore do not present it in the plots. The Huber approach continues to defeat the optimally tuned truncation method with any z(1,2)z\in(1,2) under both distributions of our focus. Together with the zz ranges where the Huber approach is superior in terms of MSE, our results suggest that z=1.5z=1.5 can be a good practical choice, at least to start with. Such a universal practical choice of zz demonstrates the adaptivity of the tuning-free Huber method.

5.2 BTC/USDT volatility forecasting

We use the BTC/USDT daily returns to demonstrate the benefit of using robust proxies in volatility forecasting comparison.

Refer to caption
Figure 3: Time series, histogram and Q-Q plot of the BTC/USDT returns.

Fig 3 presents the time series, histogram and normal QQ-plot of the daily BTC returns from 2019-01-01 to 2021-01-01. It is clear that the distribution of the returns is heavy-tailed, that the volatility is clustered and that there are extreme daily returns beyond 10% or even 20%. The empirical mean of the returns over this 2-year period is 38 basis points, which is quite close to zero compared with its volatility. We thus assume the population mean of the return is zero, so that the variance of the return boils down to the mean of the squared return. In the sequel, we focus on robust estimation of the mean of the squared returns.

5.2.1 Construction of volatility predictors and proxies

Let rtr_{t} denote the daily return of BTC from the end of day t1t-1 to the end of day tt. We emphasize that a volatility predictor hth_{t} must be ex-ante. Here we construct hth_{t} based on rtmb,,rt1r_{t-m_{b}},\dots,r_{t-1} in the backward window of size mbm_{b} and evaluate it at the end of day t1t-1. Our proxy σ^t2\widehat{\sigma}_{t}^{2} for the unobserved variance σt2\sigma_{t}^{2} of rtr_{t} is instead based on rt,,rt+mfr_{t},\dots,r_{t+m_{f}} in the forward window of size mfm_{f}.

We consider two volatility prediction approaches: (i) the vanilla EWA of the backward squared returns, i.e., s=tmbt1νs,tXs2\sum_{s=t-m_{b}}^{t-1}\nu_{s,t}X_{s}^{2}; (ii) the exponentially weighted Huber predictor proposed in Section 4.1. Each approach is evaluated with half lives equal to 77 days (1 weeks) and 1414 days (2 weeks), giving rise to four predictors, which are referred to as EWMA_HL7, EWMA_HL14, Huber_HL7, Huber_HL14. We always choose mbm_{b} to be twice the corresponding half life and set z=neffz=n^{\ddagger}_{{\rm{eff}}} for the two ’Huber predictors. As for volatility proxies, we similarly consider two methods: (i) the vanilla forward EWA proxy, i.e., s=tt+mfws,tXs2\sum_{s=t}^{t+m_{f}}w_{s,t}X_{s}^{2}; (ii) the robust Huber proxy proposed in Section 3.3. We set the half life of the exponential decay weights to be always 77 days, mf=14m_{f}=14 and z=2logneffz=2\log n^{\dagger}_{{\rm{eff}}}. We evaluate the Huber approach on two time series of different lengths: T=720T=720 or 180180, which imply two different ctc_{t} values that are used in (3.11). We refer to the two corresponding Huber proxies as Huber_720720 and Huber_180180. Given the theoretical advantages of the Huber proxy as demonstrated in Remarks 7 and 13, we do not investigate the first two proxies proposed in Section 3.3.

Crypto currency market is traded 24 hours every day nonstop, which gives us 732732 daily returns from 2019-01-01 to 2021-01-01. After removing the first 2727 days used for predictor priming and the last 1313 days used for proxy priming, we then have 691691 data points left. For each day, we compute the four predictors and three proxies that are previously described. We plot the series of squared volatility (variance) proxies in Fig 4. As we can see, the vanilla EWMA proxy (blue line) is obviously the most volatile one, reaching the peak variance of 0.0160.016, or equivalently, volatility of 12.6%12.6\% in March, 2020, when the outbreak of COVID-19 in the US sparked a flash crash of the crypto market. In contrast, the Huber proxies react in a much milder manner, and the smaller TT we consider, the more truncation effect on the Huber proxies.

Refer to caption
Figure 4: Three volatility proxies: non-robust EWMA proxy (blue) and the robust Huber proxies Huber_720720 (orange) and Huber_180180 (green). The plotted values are variances.

5.2.2 Volatility forecasting comparison with large TT

With the predictors and proxies computed, we are ready to conduct volatility prediction evaluation and comparison. Now we would like to emphasize one issue that is crucial to the evaluation procedure: the global scale of the predictors. Different loss functions may prefer different global scales of volatility forecasts. For example, QL penalizes underestimation much more than overestimation, as the predictor is in the denominator in the formula of QL. In other words, QL typically favors relatively high forecast values. To remove the impact of the scales and focus more on the capability of capturing relative variation of volatility, we also compute optimally scaled versions of our predictors and evaluate their empirical loss. Specifically, we first seek the optimal scale by solving for

β^:=argminβT1t=1TL(σ^t2,βht)\widehat{\beta}:=\operatorname*{argmin}_{\beta\in\mathbb{R}}T^{-1}\sum_{t=1}^{T}L(\widehat{\sigma}_{t}^{2},\beta h_{t})

and then use {β^ht}t[T]\{\widehat{\beta}h_{t}\}_{t\in[T]} for prediction. By comparing the empirical risk of the optimally scaled predictors, we can completely eliminate the discrimination of the loss against different global scales. Some algebra yields that for MSE, the optimal β^MSE=thtσ^t2/tht2\widehat{\beta}_{\mathrm{MSE}}=\sum_{t}h_{t}\widehat{\sigma}_{t}^{2}/\sum_{t}h_{t}^{2}, and for QL, the optimal β^QL=T1t(σ^t2/ht)\widehat{\beta}_{\mathrm{QL}}=T^{-1}\sum_{t}(\widehat{\sigma}_{t}^{2}/h_{t}). Table 1 reports the loss of the four predictors and their optimal scaled versions based on all the 691 time points with the non-robust EWMA proxy and the robust proxy Huber_720720. Several interesting observations are in order.

Table 1: Losses of (original or optimally scaled) four predictors with robust and non-robust proxies, evaluated with all T=691T=691 time points.
MSE
EWMA Proxy Huber_720 Proxy
Orig (1e-6) Scaled (1e-6) β^MSE\widehat{\beta}_{\mathrm{MSE}} Orig (1e-6) Scaled (1e-6) β^MSE\widehat{\beta}_{\mathrm{MSE}}
EWMA_HL14 4.115 3.365 0.55 3.285 2.386 0.50
Huber_HL14 3.162 3.161 1.03 2.233 2.228 0.94
EWMA_HL7 4.824 3.395 0.46 3.930 2.364 0.44
Huber_HL7 3.112 3.110 1.05 2.134 2.133 0.98
QL
EWMA Proxy Huber_720 Proxy
Orig Scaled β^QL\widehat{\beta}_{\mathrm{QL}} Orig Scaled β^QL\widehat{\beta}_{\mathrm{QL}}
EWMA_HL14 0.804 0.647 1.67 0.584 0.548 1.29
Huber_HL14 1.352 0.567 2.82 0.831 0.450 2.14
EWMA_HL7 1.239 0.792 2.26 0.720 0.595 1.59
Huber_HL7 2.382 0.702 4.09 1.396 0.532 2.94
  • Using the longer half life of 1414 days gives a smaller QL loss, regardless of whether the predictor is robust or non-robust, original or optimally scaled, and regardless of whether the proxy is robust or non-robust. In terms of MSE, the half-life comparison is mixed: Huber_HL7 is slightly better than Huber_HL14, but EWMA_HL14 is better than EWMA_HL7. We only focus on the longer half-life from now on.

  • If we look at the original predictors without optimal scaling, it is clear that MSE favors the robust predictor and QL favors the non-robust predictor, regardless of using robust or non-robust proxies. This confirms that different loss functions can lead to very different comparison results.

  • However, the above inconsistency between MSE and QL is mostly due to scaling, which is clearly demonstrated by the column of the optimal scaling β^\widehat{\beta}. For MSE, the optimal scaling of the EWMA predictor is around 0.50.5, while that of the Huber predictor is around 11. In contrast, for QL, the optimal scaling needs to be much larger than 1.01.0 and Huber needs a even larger scaling. If we look at the loss function values with optimally scaled predictors, it is interesting to see that the Huber predictor outperforms the EWMA predictor in terms of both MSE (slightly) and QL (substantially). This means that the Huber predictor is more capable of capturing the relative change of time-varying volatility than the non-robust predictor.

  • Last but not least, when the sample size TT is large compared with neffn^{\dagger}_{{\rm{eff}}} (here T/neff=691/24.87=27.78T/n^{\dagger}_{{\rm{eff}}}=691/24.87=27.78), the difference between the EWMA and Huber proxies is small, which explains the reason they give consistent comparison results. When TT is not large enough in the next subsection, we can see that the robust proxies gives more sensible conclusions.

5.2.3 Volatility forecasting comparison with small TT

Refer to caption
Refer to caption
Figure 5: 180-day rolling loss difference between EWMA_HL14 and Huber_HL14 with robust or non-robust proxies. The upper panel corresponds to MSE and the lower one corresponds to QL.
Refer to caption
Refer to caption
Figure 6: 180-day rolling loss difference between optimally scaleda EWMA_HL14 and Huber_HL14 with robust or non-robust proxies. The upper panel corresponds to MSE and the lower one corresponds to QL.
Refer to caption
Refer to caption
Figure 7: Optimal scaling of robust and non-robust predictors with robust and non-robust proxies. The upper panel is MSE and the lower one is QL.

Now suppose we only have T=180T=180 data points to evaluate and compare volatility forecasts. In Fig 5, we present the curve of 180-day rolling loss difference, i.e., T1s=t179t{L(σ^s2,hHuber_HL14,s)L(σ^s2,hEWMA_HL14,s)}T^{-1}\sum_{s=t-179}^{t}\allowbreak\{L(\widehat{\sigma}_{s}^{2},h_{\text{Huber\_HL14},s})-L(\widehat{\sigma}_{s}^{2},h_{\text{EWMA\_HL14},s})\} with tt ranging from 180180 to 691691, where σ^s2\widehat{\sigma}_{s}^{2} can either be the EWMA proxy or Huber_180180. Positive loss difference at tt indicates that the EWMA predictor outperforms the Huber predictor in the past 180 days. We see that most of the time, Huber_HL14 defeats EWMA_HL14 (negative loss difference) in terms of MSE, while EWMA_HL14 defeats Huber_HL14 (positive loss difference) in terms of QL. In terms of MSE, robust proxies tend to yield more consistent comparison between the two predictors throughout the entire period of time. We can see from the upper panel of Figure 5 that the time period for EWMA_HL14 to outperform Huber_HL14 is much shorter with the robust proxies (orange curve) than that with the EWMA proxies (blue curve). In terms of QL, if we use the EWMA proxy, we can see from the lower panel of Figure 5 that the robust predictor is much worse than the non-robust predictor, especially towards the end of 2020. However, the small MSE difference at the end of 2020 suggests that the EWMA proxy should overestimate the true volatility and exaggerate the performance gap in terms of QL. With the Huber proxy, however, the loss gap between the two predictors is much narrower, suggesting that the Huber proxy is more robust against huge volatility.

Fig 6 presents the curve of 180-day rolling loss difference between optimally scaled Huber_HL14 and EWMA_HL14, based on robust and EWMA proxies respectively. For MSE, from the previous subsection, we know that when the optimal scaling is applied, two predictors do not differ much in terms of the overall loss, and the Huber predictor only slightly outperforms the EWMA predictor. In the upper panel of Fig 6, we see that the robust-proxy-based curve is closer to zero than the EWMA-proxy-based curve, displaying more consistency with our result based on large TT. For QL, the loss differences using the robust or the non-robust proxy look quite similar. We also plot β^Huber,t\widehat{\beta}_{\text{Huber},t} and β^EWMA,t\widehat{\beta}_{\text{EWMA},t} versus time tt based on robust and non-robust proxies in Fig 7. For both MSE and QL losses, using the robust proxy leads to more stable optimal scaling values, which are always preferred by practitioners.

In a nutshell, we have seen that how our proposed robust Huber proxy can lead to better interpretability and sensible comparison of volatility predictors. When total sample size is small compared to the local effective sample size, using a robust proxy is necessary and leads to a smaller probability of misleading forecast evaluation and comparison. When the total sample size is large enough, the proposed robust proxy automatically truncates less and resembles the EWMA proxy. This also provides justification for using a non-robust EWMA proxy when the sample size is large. But we still recommend the proposed robust proxy that can adapt to the sample size and the time-varying volatilities. Sometimes, even if the robust proxy only truncates data for a very small volatile period, in terms of risk evaluation, it could still cause significant difference.

6 Discussions

Compared with the literature of modeling volatility and predicting volatility, evaluating volatility prediction has not been given enough attention. Part of the reason is due to lacking a good framework for its study, which makes practical volatility forecast comparison quite subjective and less systematic in terms of loss selection and proxy selection. Patton (2011) is a pioneering work to provide one framework based on long term expectation and provides guidance for loss selection, while our work gives a new framework based on an empirical deviation perspective and further provides guidance on proxy selection. In our framework, we focus on predictors that can achieve desired probability bound for II so that the empirical loss is close to the conditional expected loss. Then the correct comparison of the conditional expected loss of two predictors with large probability rely on a good control for I, which imposes requirements for proxies.

With this framework, we proposed three robust proxies, each of which guarantees a good bound for I when the data only bears finite fourth moment. Among the three proxies, although they can all obtain the optimal rate of convergence for bounding I, we recommend the exponentially-weighted tuning-free Huber proxy. It is better than the clipped squared returns in that it leverages neighborhood volatility smoothness and it is better than the proxy based on direct truncation in that its improved constant in the deviation bound, depending only on the central moment. To construct this proxy, we need to solve the exponentially-weighted Huber loss, whose truncation level for each sample also needs to change with the weights surprisingly.

We then applied this proxy to the real BTC volatility forecasting comparison and reached some interesting observations. Firstly, robust predictors with better control in variance may use a faster decay to reduce the approximation bias. Secondly, different losses can lead to drastically different comparison, so even restricting to robust losses, loss selection is still a meaningful topic in practice. Thirdly, predictor rescaling according to the loss function is necessary and could further extract the value of robust predictors. Finally, the proposed robust Huber proxy adapts to both the time-varying volatility and the total sample size. When the overall sample size is much larger than the local effective sample size, the robust Huber proxy barely truncate, which provides justification for even using the EWMA proxy for prediction evaluation. However, the robust Huber proxy in theory still gives high probability of concluding the correct comparison.

There are still limitations of the current work and open questions to be addressed. Assumption 1 excludes the situation when σt2\sigma_{t}^{2} depends on previous returns such as in GARCH models. We require neffn^{\dagger}_{{\rm{eff}}}\to\infty for local performance guarantee, leading to a potentially slow decay. However, in practice, it is hard to know how fast the volatility changes. Also, we ignored the auto-correlation of returns and assumed temporal independence of the innovations for simplicity. Extensions of the current framework to time series models and more relaxed assumptions are of practical value to investment managers and financial analysts. Our framework may have nontrivial implications on how to conduct cross-validation with heavy-tailed data, where we use validation data to construct proxies for the unknown variable to be estimated robustly. Obviously, subjectively choosing a truncation level for proxy construction could favor certain truncation level used by a robust predictor. Motivated by our study, rescaling the optimal truncation level for one data splitting according to the total (effective) number of sample splitting sounds an interesting idea worth further investigation in the future.

References

  • Andersen et al. (2005) Andersen, T. G., Bollerslev, T., Christoffersen, P. and Diebold, F. X. (2005). Volatility forecasting.
  • Andersen et al. (2012) Andersen, T. G., Dobrev, D. and Schaumburg, E. (2012). Jump-robust volatility estimation using nearest neighbor truncation. Journal of Econometrics 169 75–93.
  • Baillie et al. (1996) Baillie, R. T., Bollerslev, T. and Mikkelsen, H. O. (1996). Fractionally integrated generalized autoregressive conditional heteroskedasticity. Journal of econometrics 74 3–30.
  • Black and Scholes (2019) Black, F. and Scholes, M. (2019). The pricing of options and corporate liabilities. In World Scientific Reference on Contingent Claims Analysis in Corporate Finance: Volume 1: Foundations of CCA and Equity Valuation. World Scientific, 3–21.
  • Bollerslev (1986) Bollerslev, T. (1986). Generalized autoregressive conditional heteroskedasticity. Journal of econometrics 31 307–327.
  • Bollerslev and Wooldridge (1992) Bollerslev, T. and Wooldridge, J. M. (1992). Quasi-maximum likelihood estimation and inference in dynamic models with time-varying covariances. Econometric reviews 11 143–172.
  • Brailsford and Faff (1996) Brailsford, T. J. and Faff, R. W. (1996). An evaluation of volatility forecasting techniques. Journal of Banking & Finance 20 419–438.
  • Brandt and Jones (2006) Brandt, M. W. and Jones, C. S. (2006). Volatility forecasting with range-based egarch models. Journal of Business & Economic Statistics 24 470–486.
  • Brockwell and Davis (2009) Brockwell, P. J. and Davis, R. A. (2009). Time series: theory and methods. Springer Science & Business Media.
  • Brooks and Persand (2003) Brooks, C. and Persand, G. (2003). Volatility forecasting for risk management. Journal of forecasting 22 1–22.
  • Bubeck (2014) Bubeck, S. (2014). Convex optimization: Algorithms and complexity. arXiv preprint arXiv:1405.4980 .
  • Carnero et al. (2012) Carnero, M. A., Peña, D. and Ruiz, E. (2012). Estimating garch volatility in the presence of outliers. Economics Letters 114 86–90.
  • Catania et al. (2018) Catania, L., Grassi, S. and Ravazzolo, F. (2018). Predicting the volatility of cryptocurrency time-series. In Mathematical and statistical methods for actuarial sciences and finance. Springer, 203–207.
  • Catoni (2012) Catoni, O. (2012). Challenging the empirical mean and empirical variance: a deviation study. Annales de l’Institut Henri Poincaré 48 1148–1185.
  • Charles and Darné (2019) Charles, A. and Darné, O. (2019). Volatility estimation for bitcoin: Replication and robustness. International Economics 157 23–32.
  • Chen et al. (2018) Chen, M., Gao, C. and Ren, Z. (2018). Robust covariance and scatter matrix estimation under huber’s contamination model. The Annals of Statistics 46 1932–1960.
  • Chen and Zhou (2020) Chen, X. and Zhou, W.-X. (2020). Robust inference via multiplier bootstrap. The Annals of Statistics 48 1665–1691.
  • Christiansen et al. (2012) Christiansen, C., Schmeling, M. and Schrimpf, A. (2012). A comprehensive look at financial volatility prediction by economic variables. Journal of Applied Econometrics 27 956–977.
  • Christoffersen and Diebold (2000) Christoffersen, P. F. and Diebold, F. X. (2000). How relevant is volatility forecasting for financial risk management? Review of Economics and Statistics 82 12–22.
  • Engle (1982) Engle, R. F. (1982). Autoregressive conditional heteroscedasticity with estimates of the variance of united kingdom inflation. Econometrica: Journal of the econometric society 987–1007.
  • Fan et al. (2016) Fan, J., Li, Q. and Wang, Y. (2016). Robust estimation of high-dimensional mean regression. Journal of Royal Statistical Society, Series B .
  • Fan et al. (2017) Fan, J., Li, Q. and Wang, Y. (2017). Estimation of high dimensional mean regression in the absence of symmetry and light tail assumptions. Journal of the Royal Statistical Society. Series B, Statistical methodology 79 247.
  • Guo et al. (2016) Guo, T., Xu, Z., Yao, X., Chen, H., Aberer, K. and Funaya, K. (2016). Robust online time series prediction with recurrent neural networks. In 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA). Ieee.
  • Huber (1964) Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics 35 73–101.
  • Huber (1973) Huber, P. J. (1973). Robust regression: asymptotics, conjectures and monte carlo. The annals of statistics 799–821.
  • Lamoureux and Lastrapes (1993) Lamoureux, C. G. and Lastrapes, W. D. (1993). Forecasting stock-return variance: Toward an understanding of stochastic implied volatilities. The Review of Financial Studies 6 293–326.
  • Minsker (2018) Minsker, S. (2018). Sub-gaussian estimators of the mean of a random matrix with heavy-tailed entries. The Annals of Statistics 46 2871–2903.
  • Park (2002) Park, B.-J. (2002). An outlier robust garch model and forecasting volatility of exchange rate returns. Journal of Forecasting 21 381–393.
  • Patton (2011) Patton, A. J. (2011). Volatility forecast comparison using imperfect volatility proxies. Journal of Econometrics 160 246–256.
  • Poon and Granger (2003) Poon, S.-H. and Granger, C. W. (2003). Forecasting volatility in financial markets: A review. Journal of economic literature 41 478–539.
  • Sun et al. (2020) Sun, Q., Zhou, W.-X. and Fan, J. (2020). Adaptive huber regression. Journal of the American Statistical Association 115 254–265.
  • Taylor (2004) Taylor, J. W. (2004). Volatility forecasting with smooth transition exponential smoothing. International Journal of Forecasting 20 273–286.
  • Taylor (1994) Taylor, S. J. (1994). Modeling stochastic volatility: A review and comparative study. Mathematical finance 4 183–204.
  • Trucíos (2019) Trucíos, C. (2019). Forecasting bitcoin risk measures: A robust approach. International Journal of Forecasting 35 836–847.
  • Vasilellis and Meade (1996) Vasilellis, G. A. and Meade, N. (1996). Forecasting volatility for portfolio selection. Journal of Business Finance & Accounting 23 125–143.
  • Wang et al. (2020) Wang, L., Zheng, C., Zhou, W. and Zhou, W.-X. (2020). A new principle for tuning-free huber regression. Statistica Sinica .
  • Zhang (2021) Zhang, D. (2021). Robust estimation of the mean and covariance matrix for high dimensional time series. Statistica Sinica 31 797–820.

Appendix A Proofs

This section provides proof details for all the theorems in the main text.

Proof of Theorem 1.

The proof follows Theorem 5 of Fan et al. (2016). Denote ϕ(x)=min(|x|,1)sgn(x)\phi(x)=\min(|x|,1)\mbox{sgn}(x), we have

log(1x+x2)ϕ(x)log(1+x+x2).-\log(1-x+x^{2})\leq\phi(x)\leq\log(1+x+x^{2})\,.

Define r(θ)=s=tt+mmin(ws,t|Xs2θ|,τt)sgn(Xs2θ)r(\theta)=\sum_{s=t}^{t+m}\min(w_{s,t}|X_{s}^{2}-\theta|,\tau_{t})\mbox{sgn}(X_{s}^{2}-\theta), so σ^t2\widehat{\sigma}_{t}^{2} is the solution to r(θ)=0r(\theta)=0.

𝔼{exp[r(θ)/τt]}\displaystyle\mathbb{E}\{\exp[r(\theta)/\tau_{t}]\} s=tt+m𝔼{exp[ϕ(ws,t(Xs2θ)/τt)]}\displaystyle\leq\prod_{s=t}^{t+m}\mathbb{E}\{\exp[\phi(w_{s,t}(X_{s}^{2}-\theta)/\tau_{t})]\}
s=tt+m𝔼{1+ws,t(Xs2θ)/τt+ws,t2(Xs2θ)2/τt2}\displaystyle\leq\prod_{s=t}^{t+m}\mathbb{E}\{1+w_{s,t}(X_{s}^{2}-\theta)/\tau_{t}+w_{s,t}^{2}(X_{s}^{2}-\theta)^{2}/\tau_{t}^{2}\}
s=tt+m{1+ws,t(σs2θ)/τt+ws,t2(κs+(σs2θ)2)/τt2}\displaystyle\leq\prod_{s=t}^{t+m}\{1+w_{s,t}(\sigma_{s}^{2}-\theta)/\tau_{t}+w_{s,t}^{2}(\kappa_{s}+(\sigma_{s}^{2}-\theta)^{2})/\tau_{t}^{2}\}
exp{s=tt+m{ws,t(σs2θ)/τt+ws,t2(κs+(σs2θ)2)/τt2}}\displaystyle\leq\exp\bigg{\{}\sum_{s=t}^{t+m}\{w_{s,t}(\sigma_{s}^{2}-\theta)/\tau_{t}+w_{s,t}^{2}(\kappa_{s}+(\sigma_{s}^{2}-\theta)^{2})/\tau_{t}^{2}\}\bigg{\}}

Define σt2¯=s=tt+mws,tσs2=σt2+δ0,t\bar{\sigma_{t}^{2}}=\sum_{s=t}^{t+m}w_{s,t}\sigma_{s}^{2}=\sigma_{t}^{2}+\delta_{0,t}. The RHS can be further bounded by

𝔼{exp[r(θ)/τt]}\displaystyle\mathbb{E}\{\exp[r(\theta)/\tau_{t}]\} exp{(σt2¯θ)/τt+s=tt+mws,t2(κt+2(σs2σt2)2+2(σt2θ)2)/τt2}\displaystyle\leq\exp\bigg{\{}(\bar{\sigma_{t}^{2}}-\theta)/\tau_{t}+\sum_{s=t}^{t+m}w_{s,t}^{2}(\kappa^{\dagger}_{t}+2(\sigma_{s}^{2}-{\sigma_{t}^{2}})^{2}+2({\sigma_{t}^{2}}-\theta)^{2})/\tau_{t}^{2}\bigg{\}}
=exp{(σt2θ)/τt+s=tt+mws,t2(κt+2(σt2θ)2)/τt2+δ0,t/τt+2δ1,t/τt2}\displaystyle=\exp\bigg{\{}(\sigma_{t}^{2}-\theta)/\tau_{t}+\sum_{s=t}^{t+m}w_{s,t}^{2}(\kappa^{\dagger}_{t}+2(\sigma_{t}^{2}-\theta)^{2})/\tau_{t}^{2}+\delta_{0,t}/\tau_{t}+2\delta_{1,t}/\tau_{t}^{2}\bigg{\}}
=exp{(σt2θ)/τt+(neff)1(κt+2(σt2θ)2)/τt2+δ0,t/τt+2δ1,t/τt2}\displaystyle=\exp\bigg{\{}(\sigma_{t}^{2}-\theta)/\tau_{t}+(n^{\dagger}_{{\rm{eff}}})^{-1}(\kappa^{\dagger}_{t}+2(\sigma_{t}^{2}-\theta)^{2})/\tau_{t}^{2}+\delta_{0,t}/\tau_{t}+2\delta_{1,t}/\tau_{t}^{2}\bigg{\}}

Similarly, we can prove that 𝔼{exp[r(θ)/τt]}exp{(σt2θ)/τt+(neff)1(κt+2(σt2θ)2)/τt2δ0,t/τt+2δ1,t/τt2}\mathbb{E}\{\exp[-r(\theta)/\tau_{t}]\}\leq\exp\{-(\sigma_{t}^{2}-\theta)/\tau_{t}+(n^{\dagger}_{{\rm{eff}}})^{-1}(\kappa^{\dagger}_{t}+2(\sigma_{t}^{2}-\theta)^{2})/\tau_{t}^{2}-\delta_{0,t}/\tau_{t}+2\delta_{1,t}/\tau_{t}^{2}\}. Define

B+(θ)=(σt2θ)+(κt+2(σt2θ)2)/(neffτt)+τtzB_{+}(\theta)=(\sigma_{t}^{2}-\theta)+(\kappa^{\dagger}_{t}+2(\sigma_{t}^{2}-\theta)^{2})/(n^{\dagger}_{{\rm{eff}}}\tau_{t})+\tau_{t}z
B(θ)=(σt2θ)(κt+2(σt2θ)2)/(neffτt)τtzB_{-}(\theta)=(\sigma_{t}^{2}-\theta)-(\kappa^{\dagger}_{t}+2(\sigma_{t}^{2}-\theta)^{2})/(n^{\dagger}_{{\rm{eff}}}\tau_{t})-\tau_{t}z

By Chebyshev inequality,

(r(θ)>B+(θ))exp{B+(θ)/τt}𝔼exp{r(θ)/τt}=exp{z+|δ0,t|/τt+2δ1,t/τt2}\mathbb{P}(r(\theta)>B_{+}(\theta))\leq\exp\{-B_{+}(\theta)/\tau_{t}\}\mathbb{E}\exp\{r(\theta)/\tau_{t}\}=\exp\{-z+|\delta_{0,t}|/\tau_{t}+2\delta_{1,t}/\tau_{t}^{2}\}

Similarly, (r(θ)<B(θ))exp{z+|δ0,t|/τt+2δ1,t/τt2}\mathbb{P}(r(\theta)<B_{-}(\theta))\leq\exp\{-z+|\delta_{0,t}|/\tau_{t}+2\delta_{1,t}/\tau_{t}^{2}\}. Following the same argument with Fan et al. (2016), we can show that for large enough neffn^{\dagger}_{{\rm{eff}}} such that 8/(neffτt)(κt/(neffτt)+τtz)18/(n^{\dagger}_{{\rm{eff}}}\tau_{t})(\kappa^{\dagger}_{t}/(n^{\dagger}_{{\rm{eff}}}\tau_{t})+\tau_{t}z)\leq 1, the root θ+\theta_{+} of B+(θ)B_{+}(\theta) satisfies that

θ+σt2+2(κt/(neffτt)+τtz),\theta_{+}\leq\sigma_{t}^{2}+2(\kappa^{\dagger}_{t}/(n^{\dagger}_{{\rm{eff}}}\tau_{t})+\tau_{t}z)\,,

and the root θ\theta_{-} of B(θ)B_{-}(\theta) satisfies that

θσt22(κt/(neffτt)+τtz)\theta_{-}\geq\sigma_{t}^{2}-2(\kappa^{\dagger}_{t}/(n^{\dagger}_{{\rm{eff}}}\tau_{t})+\tau_{t}z)

With the choice of τt\tau_{t} given in Theorem 1, we have (|σ^t2σt2|4κtz/neff)12ez+|δ0,t|/τt+2δ1,t/τt2\mathbb{P}(|\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2}|\leq 4\sqrt{\kappa^{\dagger}_{t}z/n^{\dagger}_{{\rm{eff}}}})\geq 1-2e^{-z+|\delta_{0,t}|/\tau_{t}+2\delta_{1,t}/\tau_{t}^{2}}. And the requirement for the effective sample size is that neff16zn^{\dagger}_{{\rm{eff}}}\geq 16z.

Proof of Theorem 2.

We extend Theorem 2.1 of Wang et al. (2020) to the weighted case. Note again we are solving the following equation for τ^t\widehat{\tau}_{t}:

s=tt+mmin(ws,t2|Xs2σt2|2,τt2)τt2z=0.\sum_{s=t}^{t+m}\frac{\min\bigg{(}w_{s,t}^{2}|X_{s}^{2}-\sigma_{t}^{2}|^{2},\tau_{t}^{2}\bigg{)}}{\tau_{t}^{2}}-z=0\,.

Also defined τt\tau_{t} as the solution of the corresponding population equation:

s=tt+m𝔼[min(ws,t2|Xs2σt2|2,τt2)]τt2z=0.\sum_{s=t}^{t+m}\frac{\mathbb{E}\bigg{[}\min\bigg{(}w_{s,t}^{2}|X_{s}^{2}-\sigma_{t}^{2}|^{2},\tau_{t}^{2}\bigg{)}\bigg{]}}{\tau_{t}^{2}}-z=0\,.

We will first show that (a) τtκtneffz\tau_{t}\asymp\sqrt{\frac{\kappa^{\dagger}_{t}}{n^{\dagger}_{{\rm{eff}}}z}} and then (b) with probability approaching 11, |τ^t/τt1|c0|\widehat{\tau}_{t}/\tau_{t}-1|\leq c_{0} for a small fixed c0c_{0}. To prove (a), it is straightforward to see that

τt2zs=tt+mws,t2𝔼|Xs2σt2|2=s=tt+mws,t2(κs+(σs2σt2)2)=κt+δ1,tneffneff(1+c1)κtneff\tau_{t}^{2}z\leq\sum_{s=t}^{t+m}w_{s,t}^{2}\mathbb{E}|X_{s}^{2}-\sigma_{t}^{2}|^{2}=\sum_{s=t}^{t+m}w_{s,t}^{2}(\kappa_{s}+(\sigma_{s}^{2}-\sigma_{t}^{2})^{2})=\frac{\kappa^{\dagger}_{t}+\delta_{1,t}n^{\dagger}_{{\rm{eff}}}}{n^{\dagger}_{{\rm{eff}}}}\leq\frac{(1+c_{1})\kappa^{\dagger}_{t}}{n^{\dagger}_{{\rm{eff}}}}

Furthermore,

τt2z\displaystyle\tau_{t}^{2}z =s=tt+m𝔼[(ws,t2|Xs2σt2|2)τt2]s=tt+mτt2(ws,t2|Xs2σt2|2>τt2).\displaystyle=\sum_{s=t}^{t+m}\mathbb{E}\bigg{[}(w_{s,t}^{2}|X_{s}^{2}-\sigma_{t}^{2}|^{2})\wedge\tau_{t}^{2}\bigg{]}\geq\sum_{s=t}^{t+m}\tau_{t}^{2}\mathbb{P}\bigg{(}w_{s,t}^{2}|X_{s}^{2}-\sigma_{t}^{2}|^{2}>\tau_{t}^{2}\bigg{)}\,.

Therefore, zs=tt+m(ws,t2|Xs2σt2|2>τt2)z\geq\sum_{s=t}^{t+m}\mathbb{P}(w_{s,t}^{2}|X_{s}^{2}-\sigma_{t}^{2}|^{2}>\tau_{t}^{2}). Consider the solution qs(a)q_{s}(a) to the equation (|Xs2σt2|2>qa)=a1\mathbb{P}(|X_{s}^{2}-\sigma_{t}^{2}|^{2}>qa)=a^{-1} of the variable qq. Note that the solution is unique. Since all ws,tw_{s,t} are on the same order, we know that ws,t1/mw_{s,t}\asymp 1/m, neffmn^{\dagger}_{{\rm{eff}}}\asymp m. Let a=as=(s=tt+mws,t2)/(zws,t2)m/za=a_{s}=(\sum_{s=t}^{t+m}w_{s,t}^{2})/(zw_{s,t}^{2})\asymp m/z, the corresponding solution is qs(as)q_{s}(a_{s}). Define qmin=minsqs(as)q_{\min}=\min_{s}q_{s}(a_{s}). So we have

[ws,t2|Xs2σt2|2>qminz(s=tt+mws,t2)][|Xs2σt2|2>qs(as)(s=tt+mws,t2)/(zws,t2)]=zws,t2s=tt+mws,t2.\mathbb{P}\bigg{[}w_{s,t}^{2}|X_{s}^{2}-\sigma_{t}^{2}|^{2}>\frac{q_{\min}}{z}\bigg{(}\sum_{s=t}^{t+m}w_{s,t}^{2}\bigg{)}\bigg{]}\geq\mathbb{P}\bigg{[}|X_{s}^{2}-\sigma_{t}^{2}|^{2}>q_{s}(a_{s})\bigg{(}\sum_{s=t}^{t+m}w_{s,t}^{2}\bigg{)}/(zw_{s,t}^{2})\bigg{]}=\frac{zw_{s,t}^{2}}{\sum_{s=t}^{t+m}w_{s,t}^{2}}\,.

Let τ02=qminz(s=tt+mws,t2)=qminneffz\tau_{0}^{2}=\frac{q_{\min}}{z}(\sum_{s=t}^{t+m}w_{s,t}^{2})=\frac{q_{\min}}{n^{\dagger}_{{\rm{eff}}}z}, so we have showed that s=tt+m(ws,t2|Xs2σt2|2>τ02)z\sum_{s=t}^{t+m}\mathbb{P}(w_{s,t}^{2}|X_{s}^{2}-\sigma_{t}^{2}|^{2}>\tau_{0}^{2})\geq z. Therefore, τtτ0\tau_{t}\geq\tau_{0}. From (|Xs2σt2|2>qa)=a1\mathbb{P}(|X_{s}^{2}-\sigma_{t}^{2}|^{2}>qa)=a^{-1}, we know that for any ss, qs(as)a1z/mq_{s}(a_{s})\asymp a^{-1}\asymp z/m, so is qminq_{\min}. Therefore, we have τt/ws,tτ0/ws,tmτ0qminm/z1\tau_{t}/w_{s,t}\geq\tau_{0}/w_{s,t}\asymp m\tau_{0}\asymp\sqrt{q_{\min}m/z}\asymp 1. Write τt/ws,tc2\tau_{t}/w_{s,t}\geq c_{2} for some c20c_{2}\geq 0.

τt2z=s=tt+m𝔼[(ws,t2|Xs2σt2|2)τt2]s=tt+mws,t2𝔼[(|Xs2σt2|2)c22]κtneff.\tau_{t}^{2}z=\sum_{s=t}^{t+m}\mathbb{E}\bigg{[}(w_{s,t}^{2}|X_{s}^{2}-\sigma_{t}^{2}|^{2})\wedge\tau_{t}^{2}\bigg{]}\geq\sum_{s=t}^{t+m}w_{s,t}^{2}\mathbb{E}\bigg{[}(|X_{s}^{2}-\sigma_{t}^{2}|^{2})\wedge c_{2}^{2}\bigg{]}\asymp\frac{\kappa^{\dagger}_{t}}{n^{\dagger}_{{\rm{eff}}}}\,.

So we have shown (a) holds, that is, τtκtneffz\tau_{t}\asymp\sqrt{\frac{\kappa^{\dagger}_{t}}{n^{\dagger}_{{\rm{eff}}}z}}.

Next, we need to show (b), so that the solution τ^t\widehat{\tau}_{t} from the second equation gives us the desired optimal truncation rate. To this end, we still follow the proof of Theorem 1 of Wang et al. (2020) closely. Specifically, define Ys=ws,t|Xs2σt2|Y_{s}=w_{s,t}|X_{s}^{2}-\sigma_{t}^{2}|, using their notations, we define

pn(t)=sYs2I(Yst)t2,qn(t)=sYs2t2t2,p_{n}(t)=\sum_{s}\frac{Y_{s}^{2}I(Y_{s}\leq t)}{t^{2}}\,,\quad q_{n}(t)=\sum_{s}\frac{Y_{s}^{2}\wedge t^{2}}{t^{2}}\,,

and their population versions

p(t)=sE[Ys2I(Yst)]t2,q(t)=sE[Ys2t2]t2.p(t)=\sum_{s}\frac{E[Y_{s}^{2}I(Y_{s}\leq t)]}{t^{2}}\,,\quad q(t)=\sum_{s}\frac{E[Y_{s}^{2}\wedge t^{2}]}{t^{2}}\,.

One important fact here is that qn(t)=2t1pn(t)q_{n}^{\prime}(t)=-2t^{-1}p_{n}(t) and q(t)=2t1p(t)q^{\prime}(t)=-2t^{-1}p(t), which is key to prove Theorem 1 of Wang et al. (2020). The only difference of our setting here is that we do not assume YsY_{s}’s are identically distributed. So when applying Bernstein’s inequality as in (S1.8) of Wang et al. (2020), we need to use the version for non-identically distributed variables, and also bound the sum of individual variances. Specifically, define ζs=Ys2τt2τt2\zeta_{s}=\frac{Y_{s}^{2}\wedge\tau_{t}^{2}}{\tau_{t}^{2}}, we have 0ζsmin{1,(Ysτt)/τt}0\leq\zeta_{s}\leq\min\{1,(Y_{s}\wedge\tau_{t})/\tau_{t}\}, and hence s𝔼[ζs2]s𝔼[Ys2τt2τt2]=q(τt)=z\sum_{s}\mathbb{E}[\zeta_{s}^{2}]\leq\sum_{s}\mathbb{E}\Big{[}\frac{Y_{s}^{2}\wedge\tau_{t}^{2}}{\tau_{t}^{2}}\Big{]}=q(\tau_{t})=z. So we can indeed apply Bernstein’s inequality on sζs\sum_{s}\zeta_{s}. For more details, we refer the interested readers to Wang et al. (2020). ∎

Proof of Theorem 3.

Let σ^t=(σ^c)t\widehat{\sigma}_{t}=(\widehat{\sigma}_{c})_{t}.

[\displaystyle\mathbb{P}\bigg{[} t=1T(σ^t2/σt21)qty]=[t=1Tqtσt2(Xt2ct𝔼[Xt2ct])+t=1T(𝔼[Xt2ct]σt21)qty]\displaystyle\sum_{t=1}^{T}(\widehat{\sigma}_{t}^{2}/\sigma_{t}^{2}-1)q_{t}\geq y\bigg{]}=\mathbb{P}\bigg{[}\sum_{t=1}^{T}\frac{q_{t}}{\sigma_{t}^{2}}(X_{t}^{2}\wedge c_{t}-\mathbb{E}[X_{t}^{2}\wedge c_{t}])+\sum_{t=1}^{T}\bigg{(}\frac{\mathbb{E}[X_{t}^{2}\wedge c_{t}]}{\sigma_{t}^{2}}-1\bigg{)}q_{t}\geq y\bigg{]}
[t=1Tqtσt2(Xt2ctE[Xt2ct])y/2]+[t=1T(E[Xt2ct]σt21)qty/2]\displaystyle\leq\mathbb{P}\bigg{[}\sum_{t=1}^{T}\frac{q_{t}}{\sigma_{t}^{2}}(X_{t}^{2}\wedge c_{t}-E[X_{t}^{2}\wedge c_{t}])\geq y/2\bigg{]}+\mathbb{P}\bigg{[}\sum_{t=1}^{T}\bigg{(}\frac{E[X_{t}^{2}\wedge c_{t}]}{\sigma_{t}^{2}}-1\bigg{)}q_{t}\geq y/2\bigg{]}

To bound the first term, we can apply Bernstein inequality for tYt\sum_{t}Y_{t} where Yt=qtσt2(Xt2ctE[Xt2ct])Y_{t}=\frac{q_{t}}{\sigma_{t}^{2}}(X_{t}^{2}\wedge c_{t}-E[X_{t}^{2}\wedge c_{t}]). Note that E[Yt2]κ~tqt2/σt4κ~tQ2/σt4E[Y_{t}^{2}]\leq\tilde{\kappa}_{t}q_{t}^{2}/\sigma_{t}^{4}\leq\tilde{\kappa}_{t}Q^{2}/\sigma_{t}^{4} and |Yt|2ctQ/σt2|Y_{t}|\leq 2c_{t}Q/\sigma_{t}^{2}, so we can choose y=CQzTy=CQ\sqrt{zT} to make the first term bounded by eze^{-z}.

To bound the second term, note that

𝔼[Xt2ct]σt2=𝔼[Xt2I(Xt2>ct)]+𝔼[ctI(Xt2>ct)]𝔼[Xt2Xt2ct]+ct𝔼[Xt4]ct2=κ~tzκ~tT.\mathbb{E}[X_{t}^{2}\wedge c_{t}]-\sigma_{t}^{2}=\mathbb{E}[X_{t}^{2}I(X_{t}^{2}>c_{t})]+\mathbb{E}[c_{t}I(X_{t}^{2}>c_{t})]\leq\mathbb{E}\bigg{[}X_{t}^{2}\cdot\frac{X_{t}^{2}}{c_{t}}\bigg{]}+c_{t}\frac{\mathbb{E}[X_{t}^{4}]}{c_{t}^{2}}=\tilde{\kappa}_{t}\sqrt{\frac{z}{{\tilde{\kappa}^{\dagger}_{t}}T}}.

Here we can also choose y=CQzTy=CQ\sqrt{zT} for a large enough CC to make the second probability equal to 0. ∎

Lemma 1.

Let {Yt}\{Y_{t}\} be a process such that Yt=ht(Xt,Xt1,)Y_{t}=h_{t}(X_{t},X_{t-1},\dots). Define

γj:=maxtht(Xt,Xt1,)ht(Xt,,Xtj+1,Xtj,Xtj1,)2\gamma_{j}:=\max_{t}\|h_{t}(X_{t},X_{t-1},\dots)-h_{t}(X_{t},\dots,X_{t-j+1},{X_{t-j}}^{\prime},X_{t-j-1},\dots)\|_{2}

for any j0j\geq 0 where Xtj{X_{t-j}}^{\prime} is an iid copy of XtjX_{t-j} and {Xt}\{X_{t}\} are independent random innovations satisfying Assumption 1. Assume 𝔼[Yt]=0,|Yt|M\mathbb{E}[Y_{t}]=0,|Y_{t}|\leq M for all tt and there exists constant ρ(0,1)\rho\in(0,1) such that

Y2:=supk0ρkj=kγj<.\|Y_{\cdot}\|_{2}:=\sup_{k\geq 0}\rho^{-k}\sum_{j=k}^{\infty}\gamma_{j}<\infty\,.

Also assume T4(log(ρ1)/2)T\geq 4\vee(\log(\rho^{-1})/2). We have for y>0y>0,

(t=1TYty)exp{y24C1(TY22+M2)+2C2M(logT)2y},\mathbb{P}\bigg{(}\sum_{t=1}^{T}Y_{t}\geq y\bigg{)}\leq\exp\bigg{\{}-\frac{y^{2}}{4C_{1}(T\|Y_{\cdot}\|_{2}^{2}+M^{2})+2C_{2}M(\log T)^{2}y}\bigg{\}}\,,

where C1=2max{(e45)/4,[ρ(1ρ)log(ρ1)]1}(8log(ρ1))2C_{1}=2\max\{(e^{4}-5)/4,[\rho(1-\rho)\log(\rho^{-1})]^{-1}\}\cdot(8\vee\log(\rho^{-1}))^{2}, C2=max{(clog2)1,[1(log(ρ1)/8)]}C_{2}=\max\{(c\log 2)^{-1},[1\vee(\log(\rho^{-1})/8)]\} with c=[log(ρ1)/8](log2)log(ρ1)/4c=[\log(\rho^{-1})/8]\wedge\sqrt{(\log 2)\log(\rho^{-1})/4}.

The proof of Lemma 1 follows closely with Theorem 2.1 of Zhang (2021). The only extension here is that we do not require XtX_{t} or even Xt/σtX_{t}/\sigma_{t} to be identically distributed, so there is no assumption on the stationarity of the process YtY_{t}. However, we requires a stronger assumption on the maximal perturbation of each Yt=ht(Xt,Xt1,)Y_{t}=h_{t}(X_{t},X_{t-1},\dots). The entire proof of Zhang (2021) will go through with this new definition of Y2\|Y_{\cdot}\|_{2} and γj\gamma_{j}. We omit the details of the proof.

Proof of Theorem 4.

Let σ^t=(σ^e)t\widehat{\sigma}_{t}=(\widehat{\sigma}_{e})_{t} and define Yt=qtσt2(σ^t2𝔼[σ^t2])Y_{t}=\frac{q_{t}}{\sigma_{t}^{2}}(\widehat{\sigma}_{t}^{2}-\mathbb{E}[\widehat{\sigma}_{t}^{2}]). It is not hard to see that for jmj\leq m,

γj2\displaystyle\gamma_{j}^{2} =maxt𝔼{[(wj,0Xt+j2)ct(wj,0Xt+j2)ct]2}qt2σt4\displaystyle=\max_{t}\mathbb{E}\bigg{\{}[(w_{j,0}X_{t+j}^{2})\wedge c_{t}-(w_{j,0}{{X_{t+j}}^{\prime}}^{2})\wedge c_{t}]^{2}\bigg{\}}\frac{q_{t}^{2}}{\sigma_{t}^{4}}
maxt𝔼{(wj,0Xt+j2wj,0Xt+j2)2+(wj,0Xt+j2ct)2I(wj,0Xt+j2<ct,wj,0Xt+j2>ct)\displaystyle\leq\max_{t}\mathbb{E}\bigg{\{}(w_{j,0}X_{t+j}^{2}-w_{j,0}{{X_{t+j}}^{\prime}}^{2})^{2}+(w_{j,0}X_{t+j}^{2}-c_{t})^{2}I(w_{j,0}X_{t+j}^{2}<c_{t},w_{j,0}{{X_{t+j}}^{\prime}}^{2}>c_{t})
+(wj,0Xt+j2ct)2I(wj,0Xt+j2>ct,wj,0Xt+j2<ct)}Q2σt4\displaystyle\quad\quad+(w_{j,0}{{X_{t+j}}^{\prime}}^{2}-c_{t})^{2}I(w_{j,0}X_{t+j}^{2}>c_{t},w_{j,0}{{X_{t+j}}^{\prime}}^{2}<c_{t})\bigg{\}}\frac{Q^{2}}{\sigma_{t}^{4}}
4wj,02Q2maxtκ~t+j/σt44wj,02Q2maxt,umκ~t+u/σt4\displaystyle\leq 4w_{j,0}^{2}Q^{2}\max_{t}\tilde{\kappa}_{t+j}/\sigma_{t}^{4}\leq 4w_{j,0}^{2}Q^{2}\max_{t,u\leq m}\tilde{\kappa}_{t+u}/\sigma_{t}^{4}

And for j>mj>m, γj=0\gamma_{j}=0. Therefore,

Y2=supkρkj=kγj2Qmaxt,umκ~t+u/σt4supkρkj=kmwj,02Qmaxt,umκ~t+u/σt4<,\|Y_{\cdot}\|_{2}=\sup_{k}\rho^{-k}\sum_{j=k}^{\infty}\gamma_{j}\leq 2Q\sqrt{\max_{t,u\leq m}\tilde{\kappa}_{t+u}/\sigma_{t}^{4}}\sup_{k}\rho^{-k}\sum_{j=k}^{m}w_{j,0}\leq 2Q\sqrt{\max_{t,u\leq m}\tilde{\kappa}_{t+u}/\sigma_{t}^{4}}<\infty\,,

for any fixed ρ(0,1)\rho\in(0,1).

In addition, we claim |Yt|CQz|Y_{t}|\leq CQ\sqrt{z} with high probability. To prove this, we need the following result: when neff16z~n^{\dagger}_{{\rm{eff}}}\geq 16\tilde{z} and (neffct)216κ~t(n^{\dagger}_{{\rm{eff}}}c_{t})^{2}\geq 16{\tilde{\kappa}^{\dagger}_{t}},

(|σ^t2σt2|2(κ~t/(neffct)+ctz~)12ez~+|δ0,t|/ct+2δ1,t/ct2.\mathbb{P}(|\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2}|\leq 2({\tilde{\kappa}^{\dagger}_{t}}/(n^{\dagger}_{{\rm{eff}}}c_{t})+c_{t}\tilde{z})\geq 1-2e^{-\tilde{z}+|\delta_{0,t}|/c_{t}+2\delta_{1,t}/c_{t}^{2}}\,.

This can be shown following a similar proof as Theorem 1, thus we omit the details. Note that here ctc_{t} is not chosen to optimize the error bound, as we have another average over TT to take care of the extra variance in σ^t2\widehat{\sigma}_{t}^{2}. So here we only need to choose ctc_{t} to make sure the error bound to be in the order of z\sqrt{z}. ct=κ~tTneff2zc_{t}=\sqrt{\frac{{\tilde{\kappa}^{\dagger}_{t}}T}{n^{\dagger 2}_{{\rm{eff}}}z}} and pick z~=czneff/T\tilde{z}=czn^{\dagger}_{{\rm{eff}}}/\sqrt{T} can indeed do the job, since the exception probability is 2ec2neffz/T2e^{-\frac{c}{2}n^{\dagger}_{{\rm{eff}}}z/\sqrt{T}} under the assumption that maxt|δ0,t|/κ~t+2δ1,tneff/(κ~tT)c/2\max_{t}|\delta_{0,t}|/\sqrt{{\tilde{\kappa}^{\dagger}_{t}}}+2\delta_{1,t}n^{\dagger}_{{\rm{eff}}}/({\tilde{\kappa}^{\dagger}_{t}}\sqrt{T})\leq c/2. We require |Yt|CQz|Y_{t}|\leq CQ\sqrt{z} to hold for all time points, so the exception probability for all events is bounded by 2Tec2neffz/T2Te^{-\frac{c}{2}n^{\dagger}_{{\rm{eff}}}z/\sqrt{T}}. When neff>2c1(1+log2T/z)Tn^{\dagger}_{{\rm{eff}}}>2c^{-1}(1+\log 2T/z)\sqrt{T}, this is further bounded by eze^{-z}. Finally neff16z~n^{\dagger}_{{\rm{eff}}}\geq 16\tilde{z} and (neffct)216κ~t(n^{\dagger}_{{\rm{eff}}}c_{t})^{2}\geq 16{\tilde{\kappa}^{\dagger}_{t}} lead to the requirements of T16cz\sqrt{T}\geq 16cz and T16zT\geq 16z. Now conditioning on that |Yt|CQz|Y_{t}|\leq CQ\sqrt{z}, we are ready to call Lemma 1 for YtY_{t}. We can choose y=CQzTy=CQ\sqrt{zT} in Lemma 1 to make the exception probability smaller than eze^{-z}. So in total the exception probability is 2ez2e^{-z}.

Next, in terms of the bias term 𝔼[σ^t2]σt2=s=tt+mws,t𝔼[min(Xs2,ct/ws,t)σs2]+(s=tt+mws,tσs2σt2)\mathbb{E}[\widehat{\sigma}_{t}^{2}]-\sigma_{t}^{2}=\sum_{s=t}^{t+m}w_{s,t}\mathbb{E}[\min(X_{s}^{2},c_{t}/w_{s,t})-\sigma_{s}^{2}]+(\sum_{s=t}^{t+m}w_{s,t}\sigma_{s}^{2}-\sigma_{t}^{2}). From the proof of Theorem 3, we know that 𝔼[min(Xs2,ct/ws,t)σs2]2ws,tκ~s/ct\mathbb{E}[\min(X_{s}^{2},c_{t}/w_{s,t})-\sigma_{s}^{2}]\leq 2w_{s,t}\tilde{\kappa}_{s}/c_{t}. Therefore

𝔼[σ^t2]σt212σt2s=tt+mws,t2κ~s/ct+δ0,tσt2=2κ~tzσt4T+δ0,tσt2.\displaystyle\frac{\mathbb{E}[\widehat{\sigma}_{t}^{2}]}{\sigma_{t}^{2}}-1\leq\frac{2}{\sigma_{t}^{2}}\sum_{s=t}^{t+m}w_{s,t}^{2}\tilde{\kappa}_{s}/c_{t}+\frac{\delta_{0,t}}{\sigma_{t}^{2}}=2\sqrt{\frac{{\tilde{\kappa}^{\dagger}_{t}}z}{\sigma_{t}^{4}T}}+\frac{\delta_{0,t}}{\sigma_{t}^{2}}.

Thus, the bias term will not affect the total error bound as shown in the theorem. ∎

Lemma 2.

Assume the weighted Huber loss F(θ)=ct(θ;{ws,t}s=tt+m)F(\theta)={\cal L}_{c_{t}}(\theta;\{w_{s,t}\}_{s=t}^{t+m}) is α\alpha-strongly convex for some α(0,1/2)\alpha\in(0,1/2) in some local neighborhood Θ\Theta around θΘ\theta^{*}\in\Theta. F~(θ)\tilde{F}(\theta) is the perturbed version of F(θ)F(\theta). If θ=argminθF(θ)\theta^{*}=\operatorname*{argmin}_{\theta}F(\theta) and θ~=argminθF~(θ)\tilde{\theta}^{*}=\operatorname*{argmin}_{\theta}\tilde{F}(\theta) and θ~Θ\tilde{\theta}^{*}\in\Theta, we have

|θ~θ|α+1αsupθ|F~(θ)F(θ)|.|\tilde{\theta}-\theta|\leq\frac{\alpha+1}{\alpha}\sup_{\theta}|\tilde{F}^{\prime}(\theta)-F^{\prime}(\theta)|\,.
Proof of Lemma 2.

Besides strong convexity, we know that F(θ)F(\theta) is β\beta-smooth with β=1\beta=1. That is

F(θ1)F(θ2)F(θ2)(θ1θ2)β2(θ1θ2)2.F(\theta_{1})-F(\theta_{2})-F^{\prime}(\theta_{2})(\theta_{1}-\theta_{2})\leq\frac{\beta}{2}(\theta_{1}-\theta_{2})^{2}.

β=1\beta=1 is obvious given the second derivative of F(θ)F(\theta) is bounded by 1. From Lemma 3.11 of Bubeck (2014), for θ1,θ2Θ\theta_{1},\theta_{2}\in\Theta,

αα+1(θ1θ2)2(F(θ1)F(θ2))(θ1θ2)α2(α+1)(θ1θ2)2+α+12α(F(θ1)F(θ2))2.\frac{\alpha}{\alpha+1}(\theta_{1}-\theta_{2})^{2}\leq(F^{\prime}(\theta_{1})-F^{\prime}(\theta_{2}))(\theta_{1}-\theta_{2})\leq\frac{\alpha}{2(\alpha+1)}(\theta_{1}-\theta_{2})^{2}+\frac{\alpha+1}{2\alpha}(F^{\prime}(\theta_{1})-F^{\prime}(\theta_{2}))^{2}.

Choose θ1=θ~,θ2=θ\theta_{1}=\tilde{\theta}^{*},\theta_{2}=\theta^{*}, so F(θ2)=0=F~(θ1)F^{\prime}(\theta_{2})=0=\tilde{F}^{\prime}(\theta_{1}). Then

αα+1(θ~θ)2α+1α(F(θ~)F~(θ~))2,\frac{\alpha}{\alpha+1}(\tilde{\theta}^{*}-\theta^{*})^{2}\leq\frac{\alpha+1}{\alpha}(F^{\prime}(\tilde{\theta}^{*})-\tilde{F}^{\prime}(\tilde{\theta}^{*}))^{2},

which concludes the proof. ∎

Proof of Theorem 5.

Let σ^t=(σ^H)t\widehat{\sigma}_{t}=(\widehat{\sigma}_{H})_{t} and define Yt=qtσt2(σ^t2𝔼[σ^t2])Y_{t}=\frac{q_{t}}{\sigma_{t}^{2}}(\widehat{\sigma}_{t}^{2}-\mathbb{E}[\widehat{\sigma}_{t}^{2}]). In order to apply Lemma 1, we need to employ Lemma 2 to bound the perturbation of Huber loss minimizer via bounding the perturbation of Huber loss derivative. Similar to the proof of Theorem 4, we can show that |σ^t2σt2|Cz|\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2}|\leq C\sqrt{z} for all tt with probability larger than 1ez1-e^{-z}. We can actually explicitly write out the bound for |σ^t2σt2||\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2}| following the proof of Theorem 4: |σ^t2σt2|2(κtz/T+cκtz)<(2c+2)κtz|\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2}|\leq 2(\sqrt{\kappa_{t}^{\dagger}z/T}+c\sqrt{\kappa_{t}^{\dagger}z})<(2c+2)\sqrt{\kappa_{t}^{\dagger}z}, which means the Huber loss minimizer does fall into the region we have strong convexity by our assumptions in Theorem 5. The bound on |σ^t2σt2||\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2}| also implies that |Yt|CQz|Y_{t}|\leq CQ\sqrt{z} with exception probability of eze^{-z}.

In addition, we would like to check Y2<\|Y_{\cdot}\|_{2}<\infty. For j>mj>m, γj=0\gamma_{j}=0 and for jmj\leq m,

γj2\displaystyle\gamma_{j}^{2} (α+1)2α2maxtmaxσt2𝔼{[(wj,0|Xt+j2σt2|)ct(wj,0|Xt+j2σt2|)ct]2}qt2σt4\displaystyle\leq\frac{(\alpha+1)^{2}}{\alpha^{2}}\max_{t}\max_{\sigma_{t}^{2}}\mathbb{E}\bigg{\{}\bigg{[}(w_{j,0}|X_{t+j}^{2}-\sigma_{t}^{2}|)\wedge c_{t}-(w_{j,0}|{{X_{t+j}}^{\prime}}^{2}-\sigma_{t}^{2}|)\wedge c_{t}\bigg{]}^{2}\bigg{\}}\frac{q_{t}^{2}}{\sigma_{t}^{4}}
(α+1)2α2maxtmaxσt2𝔼{(wj,0|Xt+j2σt2|wj,0|Xt+j2σt2|)2\displaystyle\leq\frac{(\alpha+1)^{2}}{\alpha^{2}}\max_{t}\max_{\sigma_{t}^{2}}\mathbb{E}\bigg{\{}(w_{j,0}|X_{t+j}^{2}-\sigma_{t}^{2}|-w_{j,0}|{{X_{t+j}}^{\prime}}^{2}-\sigma_{t}^{2}|)^{2}
+(wj,0|Xt+j2σt2|ct)2I(wj,0|Xt+j2σt2|<ct,wj,0|Xt+j2σt2|>ct)\displaystyle\quad\quad+(w_{j,0}|X_{t+j}^{2}-\sigma_{t}^{2}|-c_{t})^{2}I(w_{j,0}|X_{t+j}^{2}-\sigma_{t}^{2}|<c_{t},w_{j,0}|{{X_{t+j}}^{\prime}}^{2}-\sigma_{t}^{2}|>c_{t})
+(wj,0|Xt+j2σt2|ct)2I(wj,0|Xt+j2σt2|>ct,wj,0|Xt+j2σt2|<ct)}Q2σt4\displaystyle\quad\quad+(w_{j,0}|{{X_{t+j}}^{\prime}}^{2}-\sigma_{t}^{2}|-c_{t})^{2}I(w_{j,0}|X_{t+j}^{2}-\sigma_{t}^{2}|>c_{t},w_{j,0}|{{X_{t+j}}^{\prime}}^{2}-\sigma_{t}^{2}|<c_{t})\bigg{\}}\frac{Q^{2}}{\sigma_{t}^{4}}
4wj,02Q2maxtκt+j/σt44wj,02Q2maxt,umκt+u/σt4.\displaystyle\leq 4w_{j,0}^{2}Q^{2}\max_{t}\kappa_{t+j}/\sigma_{t}^{4}\leq 4w_{j,0}^{2}Q^{2}\max_{t,u\leq m}\kappa_{t+u}/\sigma_{t}^{4}\,.

Therefore Y2<\|Y_{\cdot}\|_{2}<\infty for any fixed ρ(0,1)\rho\in(0,1). We can indeed apply Lemma 1 to bound the sum of YtY_{t}, which is of order CQzTCQ\sqrt{zT}, with exceptional probability of another eze^{-z}.

Finally, we bound the bias term 𝔼[σ^t2]/σt21\mathbb{E}[\widehat{\sigma}_{t}^{2}]/\sigma_{t}^{2}-1. Note that

0\displaystyle 0 =s=tt+mws,t𝔼[min(|Xs2σ^t2|,ctws,t)sgn(Xs2σ^t2)]\displaystyle=\sum_{s=t}^{t+m}w_{s,t}\mathbb{E}\bigg{[}\min\bigg{(}|X_{s}^{2}-\widehat{\sigma}_{t}^{2}|,\frac{c_{t}}{w_{s,t}}\bigg{)}\mbox{sgn}(X_{s}^{2}-\widehat{\sigma}_{t}^{2})\bigg{]}
=s=tt+mws,t𝔼[(Xs2σ^t2)I(|Xs2σ^t2|ctws,t)+ctws,tI(|Xs2σ^t2|>ctws,t)]\displaystyle=\sum_{s=t}^{t+m}w_{s,t}\mathbb{E}\bigg{[}(X_{s}^{2}-\widehat{\sigma}_{t}^{2})I\bigg{(}|X_{s}^{2}-\widehat{\sigma}_{t}^{2}|\leq\frac{c_{t}}{w_{s,t}}\bigg{)}+\frac{c_{t}}{w_{s,t}}I\bigg{(}|X_{s}^{2}-\widehat{\sigma}_{t}^{2}|>\frac{c_{t}}{w_{s,t}}\bigg{)}\bigg{]}
=δ0,t+𝔼[(Xt2σ^t2)]+s=tt+mws,t𝔼[(Xs2σ^t2)I(|Xs2σ^t2|>ctws,t)+ctws,tI(|Xs2σ^t2|>ctws,t)]\displaystyle=\delta_{0,t}+\mathbb{E}[(X_{t}^{2}-\widehat{\sigma}_{t}^{2})]+\sum_{s=t}^{t+m}w_{s,t}\mathbb{E}\bigg{[}(X_{s}^{2}-\widehat{\sigma}_{t}^{2})I\bigg{(}|X_{s}^{2}-\widehat{\sigma}_{t}^{2}|>\frac{c_{t}}{w_{s,t}}\bigg{)}+\frac{c_{t}}{w_{s,t}}I\bigg{(}|X_{s}^{2}-\widehat{\sigma}_{t}^{2}|>\frac{c_{t}}{w_{s,t}}\bigg{)}\bigg{]}

Similar to the proof of Theorem 3, we know that the ss-th component of the third term can be bounded by 2ws,t𝔼[(Xs2σ^t2)2]/ct2ws,t(2κs+2(σs2σt2)2+2𝔼[(σt2σ^t2)2])/ct2w_{s,t}\mathbb{E}[(X_{s}^{2}-\widehat{\sigma}_{t}^{2})^{2}]/c_{t}\leq 2w_{s,t}(2\kappa_{s}+2(\sigma_{s}^{2}-\sigma_{t}^{2})^{2}+2\mathbb{E}[(\sigma_{t}^{2}-\widehat{\sigma}_{t}^{2})^{2}])/c_{t}. Therefore

𝔼[σ^t2]σt21\displaystyle\frac{\mathbb{E}[\widehat{\sigma}_{t}^{2}]}{\sigma_{t}^{2}}-1 4σt2s=tt+mws,t2(κs+𝔼[(σt2σ^t2)2])/ct+δ0,t+4δ1,tσt2\displaystyle\leq\frac{4}{\sigma_{t}^{2}}\sum_{s=t}^{t+m}w_{s,t}^{2}(\kappa_{s}+\mathbb{E}[(\sigma_{t}^{2}-\widehat{\sigma}_{t}^{2})^{2}])/c_{t}+\frac{\delta_{0,t}+4\delta_{1,t}}{\sigma_{t}^{2}}
=4κtzσt4T+δ0,t+4δ1,tσt2+4E[(σt2σ^t2)2]σt4σt4zκtT.\displaystyle=4\sqrt{\frac{{\kappa^{\dagger}_{t}}z}{\sigma_{t}^{4}T}}+\frac{\delta_{0,t}+4\delta_{1,t}}{\sigma_{t}^{2}}+\frac{4E[(\sigma_{t}^{2}-\widehat{\sigma}_{t}^{2})^{2}]}{\sigma_{t}^{4}}\sqrt{\frac{\sigma_{t}^{4}z}{{\kappa^{\dagger}_{t}}T}}.

Furthermore, it is not hard to show that E[(σt2σ^t2)2]E[(\sigma_{t}^{2}-\widehat{\sigma}_{t}^{2})^{2}] is bounded using Lemma 2. Therefore, the bias is indeed of the order z/T\sqrt{z/T} with the additional approximation error rate δ0,t+4δ1,tσt2\frac{\delta_{0,t}+4\delta_{1,t}}{\sigma_{t}^{2}}. The proof is now complete. ∎

Proof of Theorem 6.

Following the same proof as Theorem 1, we have that

(|htσt2|4κtz/neff)12ez+|Δ0,t|neffz/κt+2Δ1,tneffz/κt\mathbb{P}(|h_{t}-\sigma_{t}^{2}|\leq 4\sqrt{\kappa^{\ddagger}_{t}z/n^{\ddagger}_{{\rm{eff}}}})\geq 1-2e^{-z+|\Delta_{0,t}|\sqrt{{n^{\ddagger}_{{\rm{eff}}}z}/{\kappa^{\ddagger}_{t}}}+2\Delta_{1,t}{n^{\ddagger}_{{\rm{eff}}}z}/{\kappa^{\ddagger}_{t}}}

Conditioning on this event, LL is Lipchitz in the second argument. Then the error bound is enlarged by BB times. ∎

Proof of Theorem 7.

Recall that

II(|1Tt=1T(L(σt2,ht)𝔼[L(σt2,ht)])|>ε2)+(|1Tt=1T(C(ht)𝔼[C(ht)])(σ^t2σt2)|>ε2)\displaystyle\text{II}\leq\mathbb{P}\bigg{(}\bigg{|}\frac{1}{T}\sum_{t=1}^{T}(L(\sigma_{t}^{2},h_{t})-\mathbb{E}[L(\sigma_{t}^{2},h_{t})])\bigg{|}>\frac{\varepsilon}{2}\bigg{)}+\mathbb{P}\bigg{(}\bigg{|}\frac{1}{T}\sum_{t=1}^{T}(C(h_{t})-\mathbb{E}[C(h_{t})])(\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2})\bigg{|}>\frac{\varepsilon}{2}\bigg{)}
=:ΔA+ΔB.\displaystyle=:\Delta_{A}+\Delta_{B}\,.

Now let us prove (i) first. We first bound ΔA\Delta_{A}. We still apply Lemma 1 to bound the concentration. Let Yt=L(σt2,ht)𝔼[L(σt2,ht)]Y_{t}=L(\sigma_{t}^{2},h_{t})-\mathbb{E}[L(\sigma_{t}^{2},h_{t})]. From the proof of Theorem 6, we know that

(|htσt2|4κtz/neff)12ez/2.\mathbb{P}(|h_{t}-\sigma_{t}^{2}|\leq 4\sqrt{\kappa^{\ddagger}_{t}z/n^{\ddagger}_{{\rm{eff}}}})\geq 1-2e^{-z/2}\,.

So |htσt2|4κtz/neffσt2/2|h_{t}-\sigma_{t}^{2}|\leq 4\sqrt{\kappa^{\ddagger}_{t}z/n^{\ddagger}_{{\rm{eff}}}}\leq\sigma_{t}^{2}/2 with exception probability of 2ez/22e^{-z/2}. Applying the union bound, we get |htσt2|σt2/2|h_{t}-\sigma_{t}^{2}|\leq\sigma_{t}^{2}/2 for all tt with probability 12Tez/2\geq 1-2Te^{-z/2}. On this event, we have |Yt|C|Y_{t}|\leq C because |L(σt2,ht)L(σt2,σt2)|B(σt2/2)|htσt2|B(σt2/2)σt2/2<|L(\sigma_{t}^{2},h_{t})-L(\sigma_{t}^{2},\sigma_{t}^{2})|\leq B(\sigma_{t}^{2}/2)|h_{t}-\sigma_{t}^{2}|\leq B(\sigma_{t}^{2}/2)\sigma_{t}^{2}/2<\infty. Similar to previous proofs, except that now we look data backward, for j0j\geq 0,

γj2\displaystyle\gamma_{j}^{2} =maxt𝔼{[L(σt2,ht(Xt1,,Xtm))L(σt2,ht(Xt1,,Xtj1,,Xtm))]2}\displaystyle=\max_{t}\mathbb{E}\bigg{\{}[L(\sigma_{t}^{2},h_{t}(X_{t-1},\dots,X_{t-m}))-L(\sigma_{t}^{2},h_{t}(X_{t-1},\dots,{X_{t-j-1}}^{\prime},\dots,X_{t-m}))]^{2}\bigg{\}}
maxtB2(σt2/2)maxt𝔼{[ht(Xt1,,Xtm)ht(Xt1,,Xtj1,,Xtm)]2}\displaystyle\leq\max_{t}B^{2}(\sigma_{t}^{2}/2)\max_{t}\mathbb{E}\bigg{\{}[h_{t}(X_{t-1},\dots,X_{t-m})-h_{t}(X_{t-1},\dots,{X_{t-j-1}}^{\prime},\dots,X_{t-m})]^{2}\bigg{\}}
4νj,12maxtB2(σt2/2)(α+1)2α2maxtκt\displaystyle\leq 4\nu_{-j,1}^{2}\max_{t}B^{2}(\sigma_{t}^{2}/2)\frac{(\alpha+1)^{2}}{\alpha^{2}}\max_{t}\kappa_{t}

The last inequality can be shown similar to the proof of Theorem 5 with the assumption that the Huber loss is α\alpha-strongly convex locally. Therefore, Y2<\|Y_{\cdot}\|_{2}<\infty for any fixed ρ(0,1)\rho\in(0,1). We apply Lemma 1 on YtY_{t} and again pick y=CzTy=C\sqrt{zT} to make the exceptional probability 2ez2e^{-z}. So we get (|ΔA|Cz/T)12(T+1)ez\mathbb{P}(|\Delta_{A}|\leq C\sqrt{z/T})\geq 1-2(T+1)e^{-z}.

Now let us try to apply Lemma 1 on ΔB\Delta_{B}. Let Yt=(C(ht)𝔼[C(ht)])(σ^t2σt2)Y_{t}=(C(h_{t})-\mathbb{E}[C(h_{t})])(\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2}). Note that since C()C(\cdot) is non-increasing, so C(ht)C(σt2/2)C(h_{t})\leq C(\sigma_{t}^{2}/2) is bounded. If we use the second and third proxies, in the proofs of Theorem 4 and 5, we have showed that |σ^t2σt2|Cz|\widehat{\sigma}_{t}^{2}-\sigma_{t}^{2}|\leq C\sqrt{z} for all tt with exception probability at most eze^{-z}. Therefore, we conclude |Yt|Cz|Y_{t}|\leq C\sqrt{z} for all tt with exception probability at most eze^{-z}. Now for bounding γj\gamma_{j}, note that YtY_{t} actually is a function of Xtm,,Xt1,Xt,,Xt+mX_{t-m},\dots,X_{t-1},X_{t},\dots,X_{t+m} with the first mm data constructing predictors and the remaining m+1m+1 data constructing proxies. Hence, for j<mj<m, it is not hard to show γj2Cν(mj),12\gamma_{j}^{2}\leq C\nu_{-(m-j),1}^{2} for some C>0C>0 and for mj2mm\leq j\leq 2m, γj2Cwjm,02\gamma_{j}^{2}\leq Cw_{j-m,0}^{2}. So we have Y22C<\|Y_{\cdot}\|_{2}\leq 2\sqrt{C}<\infty. Applying Lemma 1 will again give us (|ΔB|Cz/T)13ez\mathbb{P}(|\Delta_{B}|\leq C\sqrt{z/T})\geq 1-3e^{-z}.

Combining results for ΔA\Delta_{A} and ΔB\Delta_{B} and choose ε=Cz/T\varepsilon=C\sqrt{z/T} for large enough CC, we conclude (i) for bounding II.

Next we prove (ii) and (iii). The proof follows the exact same arguments as (i) except for a few bounding details.

Firstly, in bounding ΔA\Delta_{A}, we need L(σt2,ht)L(\sigma_{t}^{2},h_{t}) to be bounded. In (iii), hth_{t} is not necessarily bounded, but we directly work with a bounded loss L(σt2,ht)M0L(\sigma_{t}^{2},h_{t})\leq M_{0}. In (ii), htMh_{t}\leq M and we claim htσt2/2h_{t}\geq\sigma_{t}^{2}/2 with probability 12ez\geq 1-2e^{-z}, thus L(σt2,ht)Bt(M)ML(\sigma_{t}^{2},h_{t})\leq B_{t}(M)M. To see why the claim holds, define hˇt=s=tmt1νs,tmin(Xs2,τt/νs,t)\check{h}_{t}=\sum_{s=t-m}^{t-1}\nu_{s,t}\min(X_{s}^{2},\tau_{t}/\nu_{s,t}). The robust predictor proposed in Section 4.1 achieves central fourth moment, while hˇt\check{h}_{t} can achieve the same rate of convergence with absolute fourth moment. This is similar to the difference between the second and third proxy option. Similar to the proof of Theorem 6, we can show

(|hˇtσt2|4κ~tz/neffσt2/2)12ez/2.\mathbb{P}\bigg{(}|\check{h}_{t}-\sigma_{t}^{2}|\leq 4\sqrt{{\tilde{\kappa}^{\ddagger}_{t}}z/n^{\ddagger}_{{\rm{eff}}}}\leq\sigma_{t}^{2}/2\bigg{)}\geq 1-2e^{-z/2}\,.

So we know that hˇtσt2/2\check{h}_{t}\geq\sigma_{t}^{2}/2 with probability 12ez\geq 1-2e^{-z}. Interestingly, hthˇth_{t}\geq\check{h}_{t}. So we always have good control of the left tail due to the positiveness of squared data.

Secondly, in bounding ΔA\Delta_{A}, we need γj2Cνj,12\gamma_{j}^{2}\leq C\nu_{-j,1}^{2}. In (iii), since loss is Lipchitz in the whole region, we can easily see that γj24νj,12B02maxtκ~t\gamma_{j}^{2}\leq 4\nu_{-j,1}^{2}B_{0}^{2}\max_{t}\tilde{\kappa}_{t}. In (ii), loss is Lipchitz in the local region of σt2/2htM\sigma_{t}^{2}/2\leq h_{t}\leq M, but we know with high probability the clipped predictor indeed falls into the region, so we have γj24νj,12maxtBt2(M)maxtκ~t\gamma_{j}^{2}\leq 4\nu_{-j,1}^{2}\max_{t}B_{t}^{2}(M)\max_{t}\tilde{\kappa}_{t}. Therefore, we have no problem bounding ΔA\Delta_{A}.

Thirdly, in bounding ΔB\Delta_{B}, we require C(ht)C(h_{t}) to be bounded. Note that since htσt2h_{t}\geq\sigma_{t}^{2} with high probability, we have C(ht)C(σt2/2)C(h_{t})\leq C(\sigma_{t}^{2}/2) even when we use non-robust predictors in (ii) and (iii). So we can bound ΔB\Delta_{B} as desired too.

Finally putting everything together and choose ε=Cz/T\varepsilon=C\sqrt{z/T} for large enough CC, we conclude (ii) and (iii) for bounding II.