This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

[1]\fnmYanxi \surHou

1]\orgdivSchool of Data Science, \orgnameFudan University, \orgaddress\streetYangpu District, \cityShanghai, \countryChina

Online Prediction of Extreme Conditional Quantiles via B-Spline Interpolation

\fnmZhengpin \surLi [email protected]    \fnmJian \surWang [email protected]    [email protected] [
Abstract

Extreme quantiles are critical for understanding the behavior of data in the tail region of a distribution. It is challenging to estimate extreme quantiles, particularly when dealing with limited data in the tail. In such cases, extreme value theory offers a solution by approximating the tail distribution using the Generalized Pareto Distribution (GPD). This allows for the extrapolation beyond the range of observed data, making it a valuable tool for various applications. However, when it comes to conditional cases, where estimation relies on covariates, existing methods may require computationally expensive GPD fitting for different observations. This computational burden becomes even more problematic as the volume of observations increases, sometimes approaching infinity. To address this issue, we propose an interpolation-based algorithm named EMI. EMI facilitates the online prediction of extreme conditional quantiles with finite offline observations. Combining quantile regression and GPD-based extrapolation, EMI formulates as a bilevel programming problem, efficiently solvable using classic optimization methods. Once estimates for offline observations are obtained, EMI employs B-spline interpolation for covariate-dependent variables, enabling estimation for online observations with finite GPD fitting. Simulations and real data analysis demonstrate the effectiveness of EMI across various scenarios.

keywords:
Extreme conditional quantile, generalized Pareto distribution, quantile regression, B-splines.

1 Introduction

In fields of statistical analysis such as risk management, finance, and environmental analysis, quantiles play a fundamental role in understanding data distributions and their various characteristics (Allen et al, 2012). Quantiles act as dividing points, segmenting the range of a probability distribution into continuous intervals with equal probabilities. Among the commonly used quantiles, the extreme quantile assumes a critical role, especially when rare events carry significant implications. This particular quantile sheds light on the tail behavior of a distribution and finds wide applications in decision-making, policy formulation, and strategic planning (Resnick, 2008). Estimating the extreme quantile of a distribution, given a dataset comprising NN identically distributed observations, naturally involves using the sample quantile as a non-parametric estimation. This non-parametric approach performs admirably when the extreme quantile τN\tau_{N} remains moderately high, i.e., as τN1\tau_{N}\rightarrow 1 and N(1τN)N(1-\tau_{N})\rightarrow\infty. In such cases, a sufficient number of observations lie above the quantile level τN\tau_{N}. However, when the observation size is not large enough and the observations above the more extreme quantile τN\tau_{N} become scarce as N(1τN)c,c[0,)N(1-\tau_{N})\rightarrow c,c\in[0,\infty), the non-parametric estimators are no longer inefficient and may cause serious underestimation (He et al, 2022). This presents a concern for accurately assessing risk, making informed financial decisions, and devising effective environmental strategies.

An alternative way to estimate extreme quantiles involves extrapolation beyond the range of observed data. Extreme Value Theory (EVT) offers a systematic framework for precisely this purpose, allowing us to analyze the tail of a distribution and model the excess distribution above a prefixed threshold. The core principle of EVT is embodied in the famous Fisher-Tippett-Gnedenko theorem (Balkema and De Haan, 1974), which states that under certain conditions, the limiting distribution of extreme values can fall into the Generalized Pareto Distribution (GPD). The excess distribution, denoted as Fu(x)F_{u}(x), is defined as follows:

Fu(x)=(XuxX>u),0x<xFu,F_{u}(x)=\mathbb{P}(X-u\leq x\mid X>u),\quad 0\leq x<x_{F}-u,

where xFx_{F} is the right endpoint of the distribution function F(x)F(x), and uu serves as a large threshold. According to the Fisher-Tippett-Gnedenko theorem, as uu approaches the right endpoint, i.e., uxFu\rightarrow x_{F}, the excess distribution converges to the GPD as:

limuxFFu(x)=Gγ,σ(x),0x<xFu,\lim_{u\rightarrow x_{F}}F_{u}(x)=G_{\gamma,\sigma}(x),\quad 0\leq x<x_{F}-u,

where Gγ,σ(x)=1(1+γx/σ)+1/γG_{\gamma,\sigma}(x)=1-(1+\gamma x/\sigma)_{+}^{-1/\gamma} is the cumulative distribution function of the GPD. Here, γ\gamma\in\mathbb{R} and σ>0\sigma>0 are called the shape and scale parameters, respectively. Various methods, such as the Pickands estimator (Pickands III, 1975,) probability-weighted moments method (Hosking et al, 1985), and maximum likelihood method (Smith, 1987), are available for estimating these parameters.

In light of the limiting distribution of extreme values, extensive research efforts have been dedicated to the estimation of extreme quantiles through extrapolation techniques. Among these methods, the Block Maxima Method (BMM) stands out as a widely used approach (Naveau et al, 2009). BMM involves partitioning the dataset into non-overlapping blocks and selecting the maximum observation within each block. The distribution of these block maxima is then fitted to the GPD to estimate extreme quantiles. A more general and flexible variant of BMM, known as the Peak Over Threshold (POT) method, has gained more attention (Ferreira and De Haan, 2015). Unlike BMM, which only considers the maximum observation in each block, POT broadens its scope to encompass several large values. To facilitate this, POT employs a threshold that isolates the exceedances, ensuring that they fall within the tail of the distribution. Subsequently, POT models these exceedances using the GPD, allowing for the estimation of extreme quantiles with increased flexibility and accuracy.

While the BMM and POT have proven effective in estimating extreme quantiles, they exhibit a limitation. In essence, they do not account for dependencies between variables. Consequently, these methods may not be suitable for scenarios where extreme events are influenced by complex interplays of covariates. To address this limitation, conditional quantile regression emerges as a powerful extension of classical regression, offering a comprehensive approach to modeling the conditional quantiles of covariates (Koenker and Hallock, 2001). Consider a two-variable case where the distribution of a response variable YY depends on a set of covariates 𝐗p\mathbf{X}\in\mathbb{R}^{p}. Here, pp denotes the feature dimension. In this context, the conditional quantile is defined as QY(τ|𝐱)=FY1(τ|𝐱)Q_{Y}(\tau|\mathbf{x})=F_{Y}^{-1}(\tau|\mathbf{x}), where FY1(|𝐱)F_{Y}^{-1}(\cdot|\mathbf{x}) is the generalized inverse of the conditional distribution function of YY. When the extreme quantile τ\tau is moderately high, QY(τ|𝐱)Q_{Y}(\tau|\mathbf{x}) can be efficiently estimated via simple models like linear conditional quantile regression. However, when dealing with scenarios where the quantile τ=τN\tau=\tau_{N} satisfies

N(1τN)c,c[0,),N(1-\tau_{N})\rightarrow c,c\in[0,\infty),

these estimation methods may introduce substantial biases.

In such cases, the combination of EVT and conditional quantile regression becomes essential. One common approach relies on the fact that, under mild conditions,

limtU(tz|𝐱)U(t|𝐱)a(t)=zγ1γ,z>0,\lim_{t\rightarrow\infty}\frac{U(tz|\mathbf{x})-U(t|\mathbf{x})}{a(t)}=\frac{z^{\gamma}-1}{\gamma},\quad z>0,

where γ\gamma is a real parameter, U(|𝐱)U(\cdot|\mathbf{x}) is the inverse function of 1/1F(|𝐱)1/1-F(\cdot|\mathbf{x}) and a()a(\cdot) is a suitable positive function. Please see (Li and Wang, 2019), (Wang and Li, 2013), (Wang et al, 2012), (Xu et al, 2022) for more details. In this paper, we focus on an alternative way that employs the GPD to approximate the tail distribution. The resulting methods for extreme conditional quantile regression typically involve two key components: i) first estimating the intermediate conditional quantile QY(τ0|𝐱)Q_{Y}(\tau_{0}|\mathbf{x}) at an intermediate level τ=τ0\tau=\tau_{0} through classic conditional quantile regression, and ii) then modeling the exceeding data above the threshold u=Q^Y(τ0|𝐱)u=\widehat{Q}_{Y}(\tau_{0}|\mathbf{x}) using the GPD. Notable examples of such methods can be found in the work of Hou et al (2022).

While existing extreme conditional quantile regression methods have achieved significant success, they exhibit certain limitations. These limitations become apparent when the response variable YY is influenced by covariates 𝐗\mathbf{X}, as the conditional quantile QY(τ0|𝐱)Q_{Y}(\tau_{0}|\mathbf{x}) becomes dependent on 𝐗\mathbf{X}. In other words, the threshold u=QY(τ0|𝐱)u={Q}_{Y}(\tau_{0}|\mathbf{x}) varies with different observations 𝐱\mathbf{x}, resulting in different sets of exceeding data used to fit the GPD. Consequently, both the scale and shape parameters of the GPD also become dependent on 𝐗\mathbf{X}. To estimate these covariate-dependent parameters, one must conduct conditional quantile regression and extreme distribution fitting once for each observation. This is computationally complex since most existing methods are time-consuming in the above two steps (Velthoen et al, 2019). More critically, practical scenarios often involve online streaming data, where the observations continually emerge over time. In such cases, the dataset’s size can become substantial or even infinite (Granello and Wheaton, 2004). Under this circumstance, applying existing methods becomes impractical, as performing GPD fitting for each observation becomes computationally burdensome and unacceptable.

To solve this issue, in this paper, we propose an offline algorithm denoted as EMI (stands for Extrapolation via Mathematical programs with equilibrium constraints and B-spline Interpolation). Given finite offline historical observations, our algorithm contains two main stages. First, EMI employs linear conditional quantile regression to estimate an intermediate quantile, QY(τ0|𝐱){Q}_{Y}(\tau_{0}|\mathbf{x}), which serves as the threshold. The exceedances beyond this threshold are modeled using a GPD with scale and shape parameters that depend on the covariates. Notably, EMI innovatively formulates these regression and fitting tasks as a bilevel programming problem, where the lower- and upper-level functions represent the check function and maximum likelihood function, respectively. For each offline observation, we obtain the corresponding estimated parameters by solving the bilevel programming with classic optimization methods, such as interior point algorithms and active set algorithms. Second, EMI employs B-spline interpolation for the scale and shape parameters. The interpolation is conducted on each feature dimension of the covariates. For online observations with features falling outside the range of offline data, EMI leverages the nearest available data points for estimation. This innovative approach eliminates the need for infinite repeated GPD fitting when encountering new online data. Numerical simulations and real-world analysis further illustrate the effectiveness of EMI in extreme conditional quantile estimation.

2 Methodology

2.1 Conditional Quantile Estimation via GPD Approximation

Conditional quantile estimation is a statistical technique that focuses on modeling and analyzing conditional quantiles of a response variable concerning one or more predictor variables. The techniques involved in conditional quantile estimation include quantile regression and various non-parametric approaches. The motivation behind conditional quantile estimation is typically different from that of traditional mean-based modeling. While traditional regression techniques, like linear regression, focus on estimating the conditional mean of the response variable, conditional quantile methods seek to estimate quantiles, which offer a more comprehensive view of the conditional distribution. This approach is particularly useful when dealing with non-normally distributed data, outliers, or when the interest is in understanding the variability at different points in the distribution.

We consider here the setting where the response YY\in\mathbb{R} depends on covariates 𝐗p\mathbf{X}\in\mathbb{R}^{p}. Denote the conditional distribution of the variable YY as FY(|𝐗)F_{Y}(\cdot|\mathbf{X}) and marginal distribution of YY as FYF_{Y}. Suppose observations (yi,𝐱i),i=1,2,,N(y_{i},\mathbf{x}_{i}),i=1,2,\ldots,N, where yiy_{i} is generated from FY(|𝐱i)F_{Y}(\cdot|\mathbf{x}_{i}) given 𝐗=𝐱i\mathbf{X}=\mathbf{x}_{i}. The conditional quantile of YY given 𝐗=𝐱\mathbf{X}=\mathbf{x} at the quantile level τ(0,1)\tau\in(0,1) is defined as QY(τ|𝐱)=FY1(τ|𝐱)Q_{Y}(\tau|\mathbf{x})=F_{Y}^{-1}(\tau|\mathbf{x}), where FY1(|𝐱)F_{Y}^{-1}(\cdot|\mathbf{x}) is the generalized inverse of the conditional distribution function of YY. The conditional quantile QY(τ|𝐱)Q_{Y}(\tau|\mathbf{x}) can be calculated as the solution to the following optimization problem

QY(τ|𝐱)=argminq𝔼[ρτ(Yq)|𝐗=𝐱],Q_{Y}(\tau|\mathbf{x})=\arg\min_{q\in\mathbb{R}}\mathbb{E}\left[\rho_{\tau}(Y-q)|\mathbf{X}=\mathbf{x}\right], (2.1)

where ρτ(u)=(τI(u0))u\rho_{\tau}(u)=(\tau-I(u\leq 0))u is the so-called check function (Koenker and Bassett Jr, 1978). Quantile regression is one of the most widely used techniques for conditional quantile estimation. It extends traditional regression by estimating the conditional quantiles directly. We follow the standard parametric assumption that the true conditional quantile has a linear form as

QY(τ|𝐱)=α(τ)+𝜷T(τ)𝐱,Q_{Y}(\tau|\mathbf{x})=\alpha(\tau)+\boldsymbol{\beta}^{T}(\tau)\mathbf{x}, (2.2)

where (α(τ),𝜷T(τ))(\alpha(\tau),\boldsymbol{\beta}^{T}(\tau)) is a parameter vector associated with the level τ\tau and (α0(τ),𝜷0T(τ))(\alpha_{0}(\tau),\boldsymbol{\beta}_{0}^{T}(\tau)) denotes the true parameter. With observations (yi,𝐱i),i=1,2,,N(y_{i},\mathbf{x}_{i}),\,i=1,2,\ldots,N, the parameter can be estimated by solving an optimization problem corresponding to the empirical counterpart of (2.1):

(α^(τ),𝜷^T(τ))=argminα,𝜷LNτ(QR)(α,𝜷)=argminα,𝜷i=1Nρτ(yiα𝜷T𝐱i).\begin{split}(\widehat{\alpha}(\tau),\widehat{\boldsymbol{\beta}}^{T}(\tau))&=\arg\min_{\alpha,\boldsymbol{\beta}}L^{(\text{QR})}_{N\tau}(\alpha,\boldsymbol{\beta})\\ &=\arg\min_{\alpha,\boldsymbol{\beta}}\sum_{i=1}^{N}\rho_{\tau}(y_{i}-\alpha-\boldsymbol{\beta}^{T}\mathbf{x}_{i}).\end{split} (2.3)

The conditional quantile estimation of YY given 𝐗=𝐱\mathbf{X}=\mathbf{x} is thus calculated as

Q^Y(τ|𝐱)=α^(τ)+𝜷^T(τ)𝐱.\widehat{Q}_{Y}(\tau|\mathbf{x})=\widehat{\alpha}(\tau)+\widehat{\boldsymbol{\beta}}^{T}(\tau)\mathbf{x}. (2.4)

The empirical linear estimation of the conditional quantile works effectively when the quantile level τ=τ0\tau=\tau_{0} is moderately high. However, as the quantile level τ=τN\tau=\tau_{N} approaches 11 and N(1τN)c,c[0,)N(1-\tau_{N})\rightarrow c,c\in[0,\infty), the expected number of observations exceeding QY(τN|𝐱)Q_{Y}(\tau_{N}|\mathbf{x}) becomes insufficient. In such cases, empirical linear estimation of the conditional quantile is no longer feasible, necessitating extrapolation beyond observed data. Extrapolation is the process of estimating or predicting values beyond the range of the available data. It is a common practice in various fields, including finance, environmental science, insurance, and reliability analysis, where the primary interest lies in understanding and managing extreme events, such as financial market crashes, natural disasters, rare disease occurrences, or equipment failures. To address this need for extrapolation, we turn to the GPD, which offers an excellent approximation of the tail distribution of the conditional distribution of YY given 𝐗=𝐱\mathbf{X}=\mathbf{x}. More specifically, given a large threshold uu, 𝐗=𝐱\mathbf{X}=\mathbf{x}, and z>0z>0, the exceedance YuY>uY-u\mid Y>u given 𝐗=𝐱\mathbf{X}=\mathbf{x} satisfies

P(Y>u+z|Y>u,𝐗=𝐱)={(1+γ0(𝐱)zσ0(𝐱;τu(𝐱)))1/γ0(𝐱),γ0(𝐱)0,exp(zσ0(𝐱;τu(𝐱))),γ0(𝐱)=0,\begin{split}&P\left(Y>u+z\,|\,Y>u,\mathbf{X}=\mathbf{x}\right)\\ =&\left\{\begin{matrix}\left(1+\frac{\gamma_{0}(\mathbf{x})z}{\sigma_{0}(\mathbf{x};\tau_{u}(\mathbf{x}))}\right)^{-1/\gamma_{0}(\mathbf{x})},&\gamma_{0}(\mathbf{x})\neq 0,\\ \exp\left(-\frac{z}{\sigma_{0}(\mathbf{x};\tau_{u}(\mathbf{x}))}\right),&\gamma_{0}(\mathbf{x})=0,\end{matrix}\right.\end{split} (2.5)

where τu(𝐱)=FY(u|𝐱)(0,1)\tau_{u}(\mathbf{x})=F_{Y}(u|\mathbf{x})\in(0,1) and we require 1+γ0(𝐱)z/σ0(𝐱;τu(𝐱))>01+\gamma_{0}(\mathbf{x})z/\sigma_{0}(\mathbf{x};\tau_{u}(\mathbf{x}))>0 for γ0(𝐱)0\gamma_{0}(\mathbf{x})\neq 0. Here γ0(𝐱)\gamma_{0}(\mathbf{x}) is the extreme value index, and σ0(𝐱;τu(𝐱))\sigma_{0}(\mathbf{x};\tau_{u}(\mathbf{x})) is a scale related to the quantile level of the threshold uu. When γ0(𝐱)<0\gamma_{0}(\mathbf{x})<0, there is a finite right endpoint u=uσ0(𝐱;τu(𝐱))γ0(𝐱)u^{*}=u-\frac{\sigma_{0}(\mathbf{x};\tau_{u}(\mathbf{x}))}{\gamma_{0}(\mathbf{x})} in the support of the distribution of YY, that is, FY(y|𝐗)=1F_{Y}(y|\mathbf{X})=1 for all yuy\geq u^{*}. When γ0(𝐱)=0\gamma_{0}(\mathbf{x})=0, the exceedance YuY>uY-u\mid Y>u given 𝐗=𝐱\mathbf{X}=\mathbf{x} has an exponential distribution with mean σ0(𝐱;τu(𝐱))\sigma_{0}(\mathbf{x};\tau_{u}(\mathbf{x})). When γ0(𝐱)>0\gamma_{0}(\mathbf{x})>0, the exceedance YuY>uY-u\mid Y>u given 𝐗=𝐱\mathbf{X}=\mathbf{x} has a heavy tail with up to 1γ0(𝐱)\frac{1}{\gamma_{0}(\mathbf{x})}-th finite moments. Note that for any higher threshold u>uu^{\prime}>u, the exceedance YuY>uY-u\mid Y>u given 𝐗=𝐱\mathbf{X}=\mathbf{x} again follows the generalized Pareto distribution with the same shape parameter γ0(𝐱)\gamma_{0}(\mathbf{x}) but a different scale parameter.

Once the threshold uu is set, the next step involves estimating the parameters of the GPD. In the absence of covariates, a standard way of estimating the GPD parameters is the maximum likelihood method, which provides asymptotically normal estimators in the unconditional case with γ>1/2\gamma>-1/2 (Drees et al, 2004). To estimate the covariate-dependent variables γ0(𝐱)\gamma_{0}(\mathbf{x}) and σ0(𝐱;τu(𝐱))\sigma_{0}(\mathbf{x};\tau_{u}(\mathbf{x})), we also use the maximum likelihood method. Specifically, given observations (yi,𝐱i),i=1,2,,N(y_{i},\mathbf{x}_{i}),\,i=1,2,\ldots,N and the threshold uu, we define the exceedances of the observations above the threshold as

zi=(yiu)+,i=1,,N,z_{i}=\left(y_{i}-u\right)_{+},\quad i=1,\ldots,N, (2.6)

where (a)+=max{0,a}(a)_{+}=\max\{0,a\}. In other words, zi=0z_{i}=0 whenever the value yiy_{i} is below the threshold. Denoe the log-likelihood function as

l(γ,σ|z)={1+γγlog(1+γzσ)+logσ}I(γ0){zσ+logσ}I(γ=0),\small\begin{split}l(\gamma,\sigma|z)=&-\left\{\frac{1+\gamma}{\gamma}\log\left(1+\frac{\gamma z}{\sigma}\right)+\log\sigma\right\}I(\gamma\neq 0)\\ &-\left\{\frac{z}{\sigma}+\log\sigma\right\}I(\gamma=0),\end{split} (2.7)

and the loss function is calculated only on positive ziz_{i} as

LN(MLE)(γ,σ|u)=i=1Nl(γ,σ|zi)I(zi>0).L^{(\text{MLE})}_{N}(\gamma,\sigma|u)=-\sum_{i=1}^{N}l(\gamma,\sigma|z_{i})I(z_{i}>0).

Then, to estimate γ0(𝐱)\gamma_{0}(\mathbf{x}) and σ0(𝐱;τu(𝐱))\sigma_{0}(\mathbf{x};\tau_{u}(\mathbf{x})), we can apply the censored maximum likelihood method via solving the following optimization problem

maxγ,σLN(MLE)(γ,σ|u).\max_{\gamma,\sigma}L^{(\text{MLE})}_{N}(\gamma,\sigma|u). (2.8)

Inverting the distribution function in (2.5), we have an approximation of the quantile for probability level τN>τ0\tau_{N}>\tau_{0} by taking u=QY(τ0|𝐱)u=Q_{Y}(\tau_{0}|\mathbf{x}) and z=QY(τN|𝐱)QY(τ0|𝐱)z=Q_{Y}(\tau_{N}|\mathbf{x})-Q_{Y}(\tau_{0}|\mathbf{x}). As a result, we have that

QY(τN|𝐱)={QY(τ0|𝐱)+σ0(𝐱;τ0)γ0(𝐱)(τ0,Nγ0(𝐱)1),γ0(𝐱)0,QY(τ0|𝐱)+σ0(𝐱;τ0)logτ0,N,γ0(𝐱)=0,\small\begin{split}&Q_{Y}(\tau_{N}|\mathbf{x})\\ =&\left\{\begin{matrix}Q_{Y}(\tau_{0}|\mathbf{x})+\frac{\sigma_{0}(\mathbf{x};\tau_{0})}{\gamma_{0}(\mathbf{x})}\left(\tau_{0,N}^{\gamma_{0}(\mathbf{x})}-1\right),&\gamma_{0}(\mathbf{x})\neq 0,\\ Q_{Y}(\tau_{0}|\mathbf{x})+\sigma_{0}(\mathbf{x};\tau_{0})\log\tau_{0,N},&\gamma_{0}(\mathbf{x})=0,\end{matrix}\right.\end{split} (2.9)

where for simplicity, we define

τ0,N=1τ01τN.\tau_{0,N}=\frac{1-\tau_{0}}{1-\tau_{N}}. (2.10)

We denote the solution of the problem (2.8) as γ^(𝐱)\widehat{\gamma}(\mathbf{x}) and σ^(𝐱;τ0)\widehat{\sigma}(\mathbf{x};\tau_{0}). Therefore, using plug-in estimators, the prediction of the extreme conditional quantile QY(τN|𝐱)Q_{Y}(\tau_{N}|\mathbf{x}) is

Q^Y(τ0|𝐱)=α^(τ0)+𝜷^T(τ0)𝐱,Q^Y(τN|𝐱)={Q^Y(τ0|𝐱)+σ^(𝐱;τ0)γ^(𝐱)(τ0,Nγ^(𝐱)1),γ^(𝐱)0,Q^Y(τ0|𝐱)+σ^(𝐱;τ0)logτ0,N,γ^(𝐱)=0.\small\begin{split}&\widehat{Q}_{Y}(\tau_{0}|\mathbf{x})=\widehat{\alpha}(\tau_{0})+\widehat{\boldsymbol{\beta}}^{T}(\tau_{0})\mathbf{x},\\ &\widehat{Q}_{Y}(\tau_{N}|\mathbf{x})\\ =&\left\{\begin{matrix}\widehat{Q}_{Y}(\tau_{0}|\mathbf{x})+\frac{\widehat{\sigma}(\mathbf{x};\tau_{0})}{\widehat{\gamma}(\mathbf{x})}\left(\tau_{0,N}^{\widehat{\gamma}(\mathbf{x})}-1\right),&\widehat{\gamma}(\mathbf{x})\neq 0,\\ \widehat{Q}_{Y}(\tau_{0}|\mathbf{x})+\widehat{\sigma}(\mathbf{x};\tau_{0})\log\tau_{0,N},&\widehat{\gamma}(\mathbf{x})=0.\end{matrix}\right.\end{split} (2.11)

It is important to note that the estimations in (2.3) and (2.8) are usually implemented individually, such as the stages two and three of the algorithm in [12]. However, the combination of (2.3) and (2.8) is essentially a bilevel programming problem when the threshold uu in (2.8) equals the conditional quantile Q(τ|𝐱)Q(\tau|\mathbf{x}) determined by (2.2) and (2.3). In the following, we apply algorithms of bilevel programming to obtain the joint estimation in (2.3) and (2.8).

2.2 Stage 1: Offline Quantile Estimation via Bilevel Programming

In this section, we formulate the extreme quantile estimation via GPD approximation as a bilevel programming problem. Bilevel programming, a class of optimization problems characterized by their hierarchical structure, has found wide-ranging applications across diverse domains including economics, engineering, transportation, and environmental management (Sinha et al, 2017). These problems feature a unique interplay between two optimization tasks: an upper-level decision-maker seeks to optimize a given objective while factoring in the response of a lower-level decision-maker, who, in turn, optimizes their objective in response to the upper-level decisions. To illustrate this concept in our context, we combine (2.3) and (2.8) and obtain

maxγ,σLN(MLE)(γ,σ|QY(τ0|𝐱))s.t.minα,𝜷LNτ0(QR)(α,𝜷),\begin{split}\max_{\gamma,\sigma}\quad&L^{(\text{MLE})}_{N}(\gamma,\sigma|Q_{Y}(\tau_{0}|\mathbf{x}))\\ s.t.\quad&\min_{\alpha,\boldsymbol{\beta}}L^{(\text{QR})}_{N\tau_{0}}(\alpha,\boldsymbol{\beta}),\end{split} (2.12)

where 𝐱\mathbf{x} and the prefixed level τ0\tau_{0} is given. The conditional quantile regression and the maximum likelihood function are denoted as the lower- and upper-level functions of (2.12), respectively.

The hierarchical structure of bilevel programming introduces unique challenges that pose difficulties for existing solution methods. For instance, penalty function methods address bilevel optimization by solving a sequence of unconstrained optimization problems. These unconstrained problems are derived by introducing a penalty term that quantifies the extent of constraint violation. The penalty term typically relies on a parameter, taking a zero value for feasible solutions and a positive value (indicating minimization) for infeasible ones. However, many penalty function methods require the upper-level function to be convex, which conflicts with the non-convex nature of the log-likelihood function (2.7). This discrepancy arises from the fact that the Hessian matrix l(γ,σ|z)γσ\frac{\partial l(\gamma,\sigma|z)}{\partial\gamma\partial\sigma} is not semi-positive definite. Consequently, various penalty methods, such as those proposed by Aiyoshi and Shimizu (1981) and Lv et al (2007), tend to be ineffective in addressing the problem. Additionally, nested evolutionary algorithms have gained popularity for dealing with bilevel problems (Sinha et al, 2014). These algorithms involve solving a lower-level optimization problem for each upper-level member, which are typically employed in two main ways. The first approach combines an evolutionary algorithm at the upper level with classical algorithms at the lower level (Mathieu et al, 1994). In contrast, the second approach leverages evolutionary algorithms at both the upper and lower levels (Angelo et al, 2013). While these nested strategies can be effective, they face a significant computational burden, rendering them impractical for large-scale bilevel problems.

To handle problem (2.12) effectively, we first reformulate the lower-level function as

minα,𝜷yi>𝜶+𝜷T𝐱iτ0(yiα𝜷T𝐱i)+yi<α+𝜷T𝐱i(τ01)(yiα𝜷T𝐱i).\begin{split}\min_{\alpha,\boldsymbol{\beta}}&\sum_{y_{i}>\boldsymbol{\alpha}+\boldsymbol{\beta}^{T}\mathbf{x}_{i}}\tau_{0}\left(y_{i}-\alpha-\boldsymbol{\beta}^{T}\mathbf{x}_{i}\right)\\ &+\sum_{y_{i}<\alpha+\boldsymbol{\beta}^{T}\mathbf{x}_{i}}(\tau_{0}-1)\left(y_{i}-\alpha-\boldsymbol{\beta}^{T}\mathbf{x}_{i}\right).\end{split}

The above problem can be transformed into a linear program as

minα,𝜷,λi+,λi\displaystyle\min_{\alpha,\boldsymbol{\beta},\lambda_{i}^{+},\lambda_{i}^{-}} i=1Nτ0λi++(1τ0)λi,\displaystyle\sum_{i=1}^{N}\tau_{0}\lambda_{i}^{+}+(1-\tau_{0})\lambda_{i}^{-}, (2.13)
s.t.\displaystyle{s.t.} yi=α+𝜷T𝐱i+λi+λi,i=1,,N,\displaystyle y_{i}=\alpha+\boldsymbol{\beta}^{T}\mathbf{x}_{i}+\lambda_{i}^{+}-\lambda_{i}^{-},i=1,\ldots,N,
λi+,λi0,i=1,,N,\displaystyle\lambda_{i}^{+},\lambda_{i}^{-}\geq 0,i=1,\ldots,N,

where λi+,λi,i=1,2,,N\lambda_{i}^{+},\lambda_{i}^{-},i=1,2,\ldots,N are Lagrange multiplicators. With (2.13), the bilevel programming (2.12) can be rewritten as

maxα,𝜷,γ,σi=1Nl(γ,σ|yiα𝜷T𝐱)I(yiα𝜷T𝐱)s.t.{α,𝜷,λi+,λi}𝒜,i=1,2,,N,\small\begin{split}\max_{\alpha,\boldsymbol{\beta},\gamma,\sigma}&-\sum_{i=1}^{N}l\left(\gamma,\sigma|y_{i}-\alpha-\boldsymbol{\beta}^{T}\mathbf{x}\right)I\left(y_{i}-\alpha-\boldsymbol{\beta}^{T}\mathbf{x}\right)\\ {s.t.}\quad&\{\alpha,\boldsymbol{\beta},\lambda^{+}_{i},\lambda^{-}_{i}\}\in\mathcal{A},\quad i=1,2,\ldots,N,\end{split} (2.14)

where we let zi=(yiu)+z_{i}=(y_{i}-u)_{+} and u=α+𝜷T𝐱u=\alpha+\boldsymbol{\beta}^{T}\mathbf{x}. Then, we replace the constraints in problem (2.14) with its Karush-Kuhn-Tucker (KKT) conditions. These conditions manifest as Lagrangian and complementarity constraints, simplifying the bilevel optimization problem into a single-level constrained optimization problem. Define the augmented Lagrangian (AL) function as

LN(AL)=i=1Nτ0λi++(1τ0)λi+i=1Nui(yiα𝜷T𝐱iλi++λi)+i=1Nti+λi++tiλi,\begin{split}L_{N}^{(\text{AL})}&=\sum_{i=1}^{N}~{}\tau_{0}\lambda_{i}^{+}+(1-\tau_{0})\lambda_{i}^{-}\\ &+\sum_{i=1}^{N}u_{i}(y_{i}-\alpha-\boldsymbol{\beta}^{T}\mathbf{x}_{i}-\lambda_{i}^{+}+\lambda_{i}^{-})\\ &+\sum_{i=1}^{N}t_{i}^{+}\lambda_{i}^{+}+t_{i}^{-}\lambda_{i}^{-},\end{split}

where ti+,ti,i=1,2,,N,t_{i}^{+},t_{i}^{-},i=1,2,\ldots,N, are the Lagrange multiplicators. The derivatives of LN(AL)L_{N}^{(\text{AL})} with respect to the parameters α,𝜷,λi+,λi\alpha,\boldsymbol{\beta},\lambda^{+}_{i},\lambda^{-}_{i} are given as follows:

αLN(AL)=0i=1Nui=0,𝜷LN(AL)=0i=1Nui𝐱i=0,λi+LN(AL)=0τ0uiti+=0,λiLN(AL)=01τ0+uiti=0.\small\begin{split}\nabla_{\alpha}L_{N}^{(\text{AL})}=0~{}&~{}\Rightarrow\sum_{i=1}^{N}u_{i}=0,\\ \nabla_{\boldsymbol{\beta}}L_{N}^{(\text{AL})}=0~{}&~{}\Rightarrow\sum_{i=1}^{N}u_{i}\mathbf{x}_{i}=0,\\ \nabla_{\lambda_{i}^{+}}L_{N}^{(\text{AL})}=0~{}&~{}\Rightarrow\tau_{0}-u_{i}-t^{+}_{i}=0,\\ \nabla_{\lambda_{i}^{-}}L_{N}^{(\text{AL})}=0~{}&~{}\Rightarrow 1-\tau_{0}+u_{i}-t^{-}_{i}=0.\\ \end{split} (2.15)

Therefore, the problem (2.14) is reformulated as

maxΘ𝐱,τ0i=1N(γ,σ|yiα𝜷T𝐱)I(yiα𝜷T𝐱)s.t.yi=α+𝜷T𝐱+λi+λi,i=1,,N,ti+,ti,λi+,λi0,i=1,,N,ti+λi+=tiλi=0,i=1,,N,i=1Nui=0,i=1Nui𝐱(i)=0,τuiti+=i=1,,N,1τ+uiti=0,i=1,,N,\small\begin{split}\max_{\Theta_{\mathbf{x},\tau_{0}}}~{}&-\sum_{i=1}^{N}\ell\left(\gamma,\sigma|y_{i}-\alpha-\boldsymbol{\beta}^{T}\mathbf{x}\right)I\left(y_{i}-\alpha-\boldsymbol{\beta}^{T}\mathbf{x}\right)\\ {s.t.}\quad&~{}y_{i}=\alpha+\boldsymbol{\beta}^{T}\mathbf{x}+\lambda^{+}_{i}-\lambda^{-}_{i},\quad i=1,\ldots,N,\\ &~{}t^{+}_{i},t^{-}_{i},\lambda^{+}_{i},\lambda^{-}_{i}\geq 0,\quad i=1,\ldots,N,\\ &~{}t^{+}_{i}\lambda^{+}_{i}=t^{-}_{i}\lambda^{-}_{i}=0,\quad i=1,\ldots,N,\\ &~{}\sum_{i=1}^{N}u_{i}=0,\quad\sum_{i=1}^{N}u_{i}\mathbf{x}(i)=0,\\ &~{}\tau-u_{i}-t^{+}_{i}=\quad i=1,\ldots,N,\\ &~{}1-\tau+u_{i}-t^{-}_{i}=0,\quad i=1,\ldots,N,\end{split} (2.16)

where Θ𝐱,τ0=(γ,σ,α,𝜷,λi+,λi,ti+,ti,ui)\Theta_{\mathbf{x},\tau_{0}}=(\gamma,\sigma,\alpha,\boldsymbol{\beta},\lambda^{+}_{i},\lambda^{-}_{i},t^{+}_{i},t^{-}_{i},u_{i}) denotes the variable set for a prefixed 𝐱\mathbf{x} and level τ0\tau_{0}. Problem (2.16) is a special case of mathematical programs with equilibrium constraints (MPEC), in which the constraints include the complementarity conditions

ti+λi+=tiλi=0,i=1,2,,N.t^{+}_{i}\lambda^{+}_{i}=t^{-}_{i}\lambda^{-}_{i}=0,\quad i=1,2,\ldots,N.

MPEC plays an important role in various fields, such as the Stackelberg game in economic sciences, and problem (2.16) could be efficiently solved via interior point algorithm or active set algorithm. For more details on these algorithms, we refer readers to the work of [16]. In the following, we leave out the solutions of the Lagrange multiplicators. and denote the solutions of problem (2.16) as Θ𝐱,τ0=(α~(τ0),𝜷~(τ0),γ~(𝐱),σ~(𝐱;τ0))\Theta_{\mathbf{x},\tau_{0}}=(\widetilde{\alpha}(\tau_{0}),\widetilde{\boldsymbol{\beta}}(\tau_{0}),\widetilde{\gamma}(\mathbf{x}),\widetilde{\sigma}(\mathbf{x};\tau_{0})) for simplicity.

2.3 Stage 2: Online Prediction via B-Spline Interpolation

Note that the solution to (2.16) is derived given a specific covariate 𝐱\mathbf{x}. This leads to estimations γ~(𝐱)\widetilde{\gamma}(\mathbf{x}) and σ~(𝐱;τ0)\widetilde{\sigma}(\mathbf{x};\tau_{0}) that are inherently covariate-dependent. To illustrate this, recall that γ~(𝐱)\widetilde{\gamma}(\mathbf{x}) and σ~(𝐱;τ0)\widetilde{\sigma}(\mathbf{x};\tau_{0}) are utilized in modeling the exceedances over a threshold. We estimate this threshold through (2.4), based on a prefixed τ0\tau_{0}. Consequently, the estimated threshold becomes intricately linked to the covariates, 𝐱\mathbf{x}. This, in turn, results in varying exceedances for different values of 𝐱\mathbf{x}, ultimately leading to distinct GPD models and covariate-dependent estimations, namely γ~(𝐱)\widetilde{\gamma}(\mathbf{x}) and σ~(𝐱;τ0)\widetilde{\sigma}(\mathbf{x};\tau_{0}).

In practice, we often encounter situations where we need to deal with an infinite number of future observations, in addition to the finite historical observations at hand. A typical example is the realm of online streaming data, continually generated from various sources (Gaber et al, 2005). Online streaming data finds applications across diverse industries, including finance, transportation, and e-commerce. Notably, streaming data volumes can be vast, occasionally approaching infinity. When it comes to the task of conditional quantile estimation with these infinite observations from online streaming data, applying (2.16) entails solving the problem an infinite number of times. This introduces a significant computational complexity that poses a substantial challenge.

To alleviate the computational burden associated with solving (2.16) for possible infinite observations, we employ B-spline interpolation. To clarify, let us denote the finite offline observations as (yi(off),𝐱i(off))(y_{i}^{(\text{off})},\mathbf{x}_{i}^{(\text{off})}) for i=1,,N(off)i=1,\ldots,N^{(\text{off})} and the future online observations as 𝐱i(on)\mathbf{x}_{i}^{(\text{on})} for i=1,,N(on)i=1,\ldots,N^{(\text{on})} where we allow N(on)N^{(\text{on})} to tend to infinite. It is important to note that we assume the covariates for the two observation sets are distinct. B-spline interpolation empowers us to solve (2.16) only N(off)N^{(\text{off})} times, obtaining estimations for any number of data points, even as N(on)N^{(\text{on})} approaches infinity. In more specific terms, assume that we have obtained Θ𝐱i(off),τ0=(α~(τ0),𝜷~(τ0),γ~(𝐱i(off)),σ~(𝐱i(off);τ0))\Theta_{\mathbf{x}_{i}^{(\text{off})},\tau_{0}}=(\widetilde{\alpha}(\tau_{0}),\widetilde{\boldsymbol{\beta}}(\tau_{0}),\widetilde{\gamma}(\mathbf{x}_{i}^{(\text{off})}),\widetilde{\sigma}(\mathbf{x}_{i}^{(\text{off})};\tau_{0})) for i=1,,N(off)i=1,\ldots,N^{(\text{off})}, for given offline observations by solving (2.16). Notably, the estimations α~(τ0)\widetilde{\alpha}(\tau_{0}) and 𝜷~(τ0)\widetilde{\boldsymbol{\beta}}(\tau_{0}) are solely linked to the choice of τ0\tau_{0}, which is assumed to be the same for both offline and online observations in our paper. As a result, for 𝐱i(on)\mathbf{x}_{i}^{(\text{on})}, the estimations of α\alpha and 𝜷\boldsymbol{\beta} remain consistent with those for 𝐱i(off)\mathbf{x}_{i}^{(\text{off})}. Therefore, we only need to apply B-spline interpolation to estimate the extreme value index γ(𝐱)\gamma(\mathbf{x}) and the scale σ(𝐱,τ0)\sigma(\mathbf{x},\tau_{0}) for 𝐱=𝐱i(on)\mathbf{x}=\mathbf{x}_{i}^{(\text{on})}.

B-splines, which stand for “Basis splines”, hold a pivotal role in the realm of numerical analysis and computer-aided design. They provide a powerful method for representing and approximating complex curves and surfaces, as well as for solving various mathematical and engineering problems. B-splines are piecewise polynomials, and the positions where the pieces meet are known as knots, which are defined as a non-decreasing sequence of real numbers

𝝃={ξi}i=1M={ξ1ξ2ξM},\boldsymbol{\xi}=\left\{\xi_{i}\right\}_{i=1}^{M}=\left\{\xi_{1}\leq\xi_{2}\leq\cdots\leq\xi_{M}\right\},

where MM is the knot number. Provided that Md+2M\geq d+2, we can define the B-splines of degree d0d\geq 0 over the knots ξ\xi using a recursive formulation. Specifically, the jj-th B-spline Bj,d,𝝃(x):B_{j,d,\boldsymbol{\xi}}(x):\mathbb{R}\rightarrow\mathbb{R} can be calculated by

Bj,d,𝝃(x):=xξjξj+dξjBj,d1,𝝃(x)+ξj+d+1xξj+d+1ξj+1Bj+1,d1,𝝃(x),\begin{split}B_{j,d,\boldsymbol{\xi}}(x):=&\frac{x-\xi_{j}}{\xi_{j+d}-\xi_{j}}B_{j,d-1,\boldsymbol{\xi}}(x)\\ &+\frac{\xi_{j+d+1}-x}{\xi_{j+d+1}-\xi_{j+1}}B_{j+1,d-1,\boldsymbol{\xi}}(x),\end{split}

where ξjξj+1ξj+d+1\xi_{j}\leq\xi_{j+1}\leq\cdots\leq\xi_{j+d+1} are d+2d+2 real numbers taken from the knot sequence ξ\xi. When ξj=ξj+d+1\xi_{j}=\xi_{j+d+1}, Bj,d,𝝃(x)B_{j,d,\boldsymbol{\xi}}(x) is identically zero. The foundation of B-splines begins with the initial B-spline of order zero, represented as:

Bi,0,𝝃(x):={1, if x[ξi,ξi+1),0, otherwise .B_{i,0,\boldsymbol{\xi}}(x):=\left\{\begin{array}[]{ll}1,&\text{ if }x\in\left[\xi_{i},\xi_{i+1}\right),\\ 0,&\text{ otherwise }.\end{array}\right.

In simpler terms, a B-spline operates over MM knots, strictly adhering to their non-decreasing order. B-splines contribute meaningfully only within the range defined by the first and last knots, remaining zero elsewhere. It is worth noting that while two B-splines can share some of their knots, two B-splines defined over exactly the same set of knots are identical. In essence, a B-spline is uniquely characterized by its knot sequence.

Algorithm 1 The proposed EMI algorithm
1:Observations (yi(off),𝐱i(off)),i=1,2,,N(off)(y_{i}^{(\text{off})},\mathbf{x}_{i}^{(\text{off})}),i=1,2,\ldots,N^{(\text{off})}, spline degree dd, knots number MM, moderate quantile level τ0\tau_{0}, extreme quantile level τN\tau_{N}.
2:Conditional quantile estimation Q~Y(τN|𝐱(on))\widetilde{Q}_{Y}(\tau_{N}|\mathbf{x}^{(\text{on})}).
3:Extrapolation.
4:while i=1,2,,N(off)i=1,2,\cdots,N^{(\text{off})} do
5:     Solve MPEC problem (2.16) with 𝐱i(off)\mathbf{x}_{i}^{(\text{off})} and obtain (α~(τ0),𝜷~(τ0),γ~(𝐱i(off)),σ~(𝐱i(off);τ0)))(\widetilde{\alpha}(\tau_{0}),\widetilde{\boldsymbol{\beta}}(\tau_{0}),\widetilde{\gamma}(\mathbf{x}_{i}^{(\text{off})}),\widetilde{\sigma}(\mathbf{x}_{i}^{(\text{off})};\tau_{0}))).
6:end while
7:Interpolation.
8:Calculate the knots {ξ~i,γ}i=1M\left\{\widetilde{\xi}_{i,\gamma}\right\}_{i=1}^{M} for sigma via Eq (2.17).
9:Calculate the knots {ξ~i,σ}i=1M\left\{\widetilde{\xi}_{i,\sigma}\right\}_{i=1}^{M} for sigma via Eq (2.20).
10:Prediction.
11:Calculate γ~(𝐱(on))\widetilde{\gamma}(\mathbf{x}^{(\text{on})}) by B-spline interpolation via Eq (2.18).
12:Calculate σ~(𝐱(on);τ0)\widetilde{\sigma}(\mathbf{x}^{(\text{on})};\tau_{0}) by B-spline interpolation via Eq (2.19).
13:Calculate Q~Y(τN|𝐱(on))\widetilde{Q}_{Y}(\tau_{N}|\mathbf{x}^{(\text{on})}) via Eq. (2.21).

B-splines provide a powerful method for approximating complex functions. In scenarios where the covariants are multi-dimensional, we estimate the covariate-dependent variables γ(𝐱)\gamma(\mathbf{x}) and the scale σ(𝐱,τ0)\sigma(\mathbf{x},\tau_{0}) as

γ(𝐱)i=1pj=1JBj,D,𝝃(i)(𝐱(i))tγ,j,σ(𝐱,τ0)i=1pj=1JBj,D,𝝃(i)(𝐱(i))tσ,j.\begin{split}\gamma(\mathbf{x})&\approx\sum_{i=1}^{p}\sum_{j=1}^{J}B_{j,D,\boldsymbol{\xi}^{(i)}}\left(\mathbf{x}(i)\right)t_{\gamma,j},\\ \sigma(\mathbf{x},\tau_{0})&\approx\sum_{i=1}^{p}\sum_{j=1}^{J}B_{j,D,\boldsymbol{\xi}^{(i)}}\left(\mathbf{x}(i)\right)t_{\sigma,j}.\end{split}

Here, 𝝃(1),𝝃(2),,𝝃(p)\boldsymbol{\xi}^{(1)},\boldsymbol{\xi}^{(2)},\ldots,\boldsymbol{\xi}^{(p)} represent the pp sets of knots, with each set containing MM knots. It is assumed that 𝐱(i)\mathbf{x}(i) falls within the range defined by the boundary knots ξ1(i)\xi^{(i)}_{1} and ξM(i)\xi^{(i)}_{M}. The choice of DD, representing the maximum degree of approximation, is crucial. While higher degrees tend to yield more accurate estimates, they also come with increased computational complexity. Here, we opt for D=3D=3, a well-known choice commonly referred to as cubic spline interpolation. To determine the coefficients tγ,j,j=1,,Jt_{\gamma,j},j=1,\ldots,J, for 𝐬=[s1,s2,,sN]N\mathbf{s}=[s_{1},s_{2},\ldots,s_{N}]^{\top}\in\mathbb{R}^{N}, we introduce the B-spline matrix 𝐁d,𝝃𝐬N×J\mathbf{B}_{d,\boldsymbol{\xi}}^{\mathbf{s}}\in\mathbb{R}^{N\times J} as

𝐁d,𝝃𝐬=[B1,d,𝝃(s1)B2,d,𝝃(s1)BJ,d,𝝃(s1)B1,d,𝝃(s2)B2,d,𝝃(s2)BJ,d,𝝃(s2)B1,d,𝝃(sN)B2,d,𝝃(sN)BJ,d,𝝃(sN)].\mathbf{B}_{d,\boldsymbol{\xi}}^{\mathbf{s}}=\left[\begin{array}[]{cccc}B_{1,d,\boldsymbol{\xi}}(s_{1})&B_{2,d,\boldsymbol{\xi}}(s_{1})&\cdots&B_{J,d,\boldsymbol{\xi}}(s_{1})\\ B_{1,d,\boldsymbol{\xi}}(s_{2})&B_{2,d,\boldsymbol{\xi}}(s_{2})&\cdots&B_{J,d,\boldsymbol{\xi}}(s_{2})\\ \ldots&\ldots&\ldots&\ldots\\ B_{1,d,\boldsymbol{\xi}}(s_{N})&B_{2,d,\boldsymbol{\xi}}(s_{N})&\cdots&B_{J,d,\boldsymbol{\xi}}(s_{N})\end{array}\right].

We further define the feature vector 𝐯j\mathbf{v}_{j} as follows

𝐯j=[𝐱1(off)(j),𝐱2(off)(j),,𝐱N(off)(off)(j)]N(off),\mathbf{v}_{j}=[\mathbf{x}_{1}^{(\text{off})}(j),\mathbf{x}_{2}^{(\text{off})}(j),\ldots,\mathbf{x}_{N^{(\text{off})}}^{(\text{off})}(j)]^{\top}\in\mathbb{R}^{N^{(\text{off})}},

where j=1,2,,pj=1,2,\ldots,p and 𝐱i(off)(j)\mathbf{x}_{i}^{{}^{(\text{off})}}(j) represents the jj-th entry of observation 𝐱i(off)\mathbf{x}_{i}^{{}^{(\text{off})}}. For pairs (γ~(𝐱i(off)),𝐱i(off)),i=1,2,,N(off)(\widetilde{\gamma}(\mathbf{x}_{i}^{(\text{off})}),\mathbf{x}_{i}^{{}^{(\text{off})}}),i=1,2,\ldots,N^{(\text{off})}, we we can determine the coefficients tγ,j,j=1,,Jt_{\gamma,j},j=1,\ldots,J by solving the following matrix equation

(i=1p𝐁D,𝝃(i)𝐯i)[tγ,1tγ,2tγ,J]=[γ~(𝐱1(off))γ~(𝐱2(off))γ~(𝐱N(off)(off))].\left(\sum_{i=1}^{p}\mathbf{B}_{D,\boldsymbol{\xi}^{(i)}}^{\mathbf{v}_{i}}\right)\left[\begin{array}[]{l}t_{\gamma,1}\\ t_{\gamma,2}\\ \ldots\\ t_{\gamma,J}\end{array}\right]=\left[\begin{array}[]{c}\widetilde{\gamma}(\mathbf{x}_{1}^{(\text{off})})\\ \widetilde{\gamma}(\mathbf{x}_{2}^{(\text{off})})\\ \ldots\\ \widetilde{\gamma}(\mathbf{x}_{N^{(\text{off})}}^{(\text{off})})\end{array}\right]. (2.17)

with the boundary knots satisfy

ξ1(i)=min𝑠𝐱s(off)(i),ξM(i)=max𝑠𝐱s(off)(i),{\xi}^{(i)}_{1}=\underset{s}{\min}~{}\mathbf{x}_{s}^{(\text{off})}(i),\quad{\xi}^{(i)}_{M}=\underset{s}{\max}~{}\mathbf{x}_{s}^{(\text{off})}(i),

for i=1,,pi=1,\ldots,p. The solution to this equation is denoted as t~γ,j,j=1,,J\widetilde{t}_{\gamma,j},j=1,\ldots,J. Then for new observation 𝐱(on)\mathbf{x}^{{(\text{on})}}, we can estimate the corresponding extreme value index by

γ~(𝐱(on))=i=1pj=1JBj,D,𝝃(i)(𝐱(on)(i))t~γ,j,\widetilde{\gamma}(\mathbf{x}^{(\text{on})})=\sum_{i=1}^{p}\sum_{j=1}^{J}B_{j,D,\boldsymbol{\xi}^{(i)}}\left(\mathbf{x}^{(\text{on})}(i)\right)\widetilde{t}_{\gamma,j},

when 𝝃1(i)𝐱(on)(i)𝝃M(i)\boldsymbol{\xi}^{(i)}_{1}\leq\mathbf{x}^{(\text{on})}(i)\leq\boldsymbol{\xi}^{(i)}_{M}. For cases where 𝐱(i)\mathbf{x}(i) falls beyond the range of boundary knots ξ1(i)\xi^{(i)}_{1} and ξM(i)\xi^{(i)}_{M}, the estimation of the extreme value index can be expressed as follows

γ~(𝐱(on))=i=1pj=1JBj,D,𝝃(i)(𝐱~γ(i))t~γ,j,\widetilde{\gamma}(\mathbf{x}^{(\text{on})})=\sum_{i=1}^{p}\sum_{j=1}^{J}B_{j,D,\boldsymbol{\xi}^{(i)}}\left(\widetilde{\mathbf{x}}_{\gamma}(i)\right)\widetilde{t}_{\gamma,j},

where 𝐱~γ(i)\widetilde{\mathbf{x}}_{\gamma}(i) is determined by minimizing the absolute difference as

𝐱~γ(i)=argmins|𝐱(on)(i)s|,s{𝐱n(off)(i)}n=1N(off).\widetilde{\mathbf{x}}_{\gamma}(i)=\arg\min_{s}\left|\mathbf{x}^{(\text{on})}(i)-s\right|,s\in\{\mathbf{x}_{n}^{(\text{off})}(i)\}_{n=1}^{N^{(\text{off})}}.

In summary, we estimate γ(𝐱)\gamma(\mathbf{x}) as

γ~(𝐱(on))=i=1pj=1J{Bj,D,𝝃(i)(𝐱(on)(i))t~γ,j, if 𝝃1(i)𝐱(on)(i)𝝃M(i),Bj,D,𝝃(i)(𝐱~γ(i))t~γ,j, otherwise .\footnotesize\begin{split}&\widetilde{\gamma}(\mathbf{x}^{(\text{on})})\\ =&\sum_{i=1}^{p}\sum_{j=1}^{J}\left\{\begin{array}[]{ll}B_{j,D,\boldsymbol{\xi}^{(i)}}\left(\mathbf{x}^{(\text{on})}(i)\right)\widetilde{t}_{\gamma,j},&\text{ if }\boldsymbol{\xi}^{(i)}_{1}\leq\mathbf{x}^{(\text{on})}(i)\leq\boldsymbol{\xi}^{(i)}_{M},\\ B_{j,D,\boldsymbol{\xi}^{(i)}}\left(\widetilde{\mathbf{x}}_{\gamma}(i)\right)\widetilde{t}_{\gamma,j},&\text{ otherwise }.\end{array}\right.\end{split} (2.18)

Similarly, the scale parameter σ(𝐱,τ0)\sigma(\mathbf{x},\tau_{0}) can be estimated using B-spline interpolation as:

σ~(𝐱(on),τ0)=i=1pj=1J{Bj,D,𝝃(i)(𝐱(on)(i))t~σ,j, if 𝝃1(i)𝐱(on)(i)𝝃M(i),Bj,D,𝝃(i)(𝐱~σ(i))t~σ,j, otherwise .\footnotesize\begin{split}&\widetilde{\sigma}(\mathbf{x}^{(\text{on})},\tau_{0})\\ =&\sum_{i=1}^{p}\sum_{j=1}^{J}\left\{\begin{array}[]{ll}B_{j,D,\boldsymbol{\xi}^{(i)}}\left(\mathbf{x}^{(\text{on})}(i)\right)\widetilde{t}_{\sigma,j},&\text{ if }\boldsymbol{\xi}^{(i)}_{1}\leq\mathbf{x}^{(\text{on})}(i)\leq\boldsymbol{\xi}^{(i)}_{M},\\ B_{j,D,\boldsymbol{\xi}^{(i)}}\left(\widetilde{\mathbf{x}}_{\sigma}(i)\right)\widetilde{t}_{\sigma,j},&\text{ otherwise }.\end{array}\right.\end{split} (2.19)

In this expression, the coefficients t~σ,j\widetilde{t}_{\sigma,j} for j=1,2,,Jj=1,2,\ldots,J are determined by solving the matrix equation given by:

(i=1p𝐁D,𝝃(i)𝐯i)[tσ,1tσ,2tσ,J]=[σ~(𝐱1(off),τ0)σ~(𝐱2(off),τ0)σ~(𝐱N(off)(off),τ0)],\left(\sum_{i=1}^{p}\mathbf{B}_{D,\boldsymbol{\xi}^{(i)}}^{\mathbf{v}_{i}}\right)\left[\begin{array}[]{l}t_{\sigma,1}\\ t_{\sigma,2}\\ \ldots\\ t_{\sigma,J}\end{array}\right]=\left[\begin{array}[]{c}\widetilde{\sigma}(\mathbf{x}_{1}^{(\text{off})},\tau_{0})\\ \widetilde{\sigma}(\mathbf{x}_{2}^{(\text{off})},\tau_{0})\\ \ldots\\ \widetilde{\sigma}(\mathbf{x}_{N^{(\text{off})}}^{(\text{off})},\tau_{0})\end{array}\right], (2.20)

and 𝐱^σ(i)\widehat{\mathbf{x}}_{\sigma}(i) satisfying

𝐱~σ(i)=argmins|𝐱(on)(i)s|,s{𝐱n(off)(i)}n=1N(off).\widetilde{\mathbf{x}}_{\sigma}(i)=\arg\min_{s}\left|\mathbf{x}^{(\text{on})}(i)-s\right|,s\in\{\mathbf{x}_{n}^{(\text{off})}(i)\}_{n=1}^{N^{(\text{off})}}.

In this way, the estimation of extreme conditional quantile for observation 𝐱(on)\mathbf{x}^{(\text{on})} is calculated as

Q~Y(τ0|𝐱(on))=α~(τ0)+𝜷~T(τ0)𝐱(on),Q~Y(τN|𝐱(on))={Q~Y(τN|𝐱(on))+σ~(𝐱(on),τ0)γ~(𝐱(on))(τ0,Nγ~(𝐱(on))1),γ~(𝐱(on))0,Q~Y(τ0|𝐱(on))+σ~(𝐱(on),τ0)logτ0,N,γ~(𝐱(on))=0.\footnotesize\begin{split}&\widetilde{Q}_{Y}(\tau_{0}|\mathbf{x}^{(\text{on})})=\widetilde{\alpha}(\tau_{0})+\widetilde{\boldsymbol{\beta}}^{T}(\tau_{0})\mathbf{x}^{(\text{on})},\\ &\widetilde{Q}_{Y}(\tau_{N}|\mathbf{x}^{(\text{on})})\\ =&\left\{\begin{matrix}\widetilde{Q}_{Y}(\tau_{N}|\mathbf{x}^{(\text{on})})+\frac{\widetilde{\sigma}(\mathbf{x}^{(\text{on})},\tau_{0})}{\widetilde{\gamma}(\mathbf{x}^{(\text{on})})}\left(\tau_{0,N}^{\widetilde{\gamma}(\mathbf{x}^{(\text{on})})}-1\right),&\widetilde{\gamma}(\mathbf{x}^{(\text{on})})\neq 0,\\ \widetilde{Q}_{Y}(\tau_{0}|\mathbf{x}^{(\text{on})})+\widetilde{\sigma}(\mathbf{x}^{(\text{on})},\tau_{0})\log\tau_{0,N},&\widetilde{\gamma}(\mathbf{x}^{(\text{on})})=0.\end{matrix}\right.\end{split} (2.21)

We summarize our method in Algorithm 1.

It is important to emphasize that B-spline interpolation, while highly effective for approximating scale and shape parameters, does introduce a degree of bias into the estimation process. This trade-off is essential as it significantly reduces computational time, making it a valuable tool for online observations. For this specific context, there is no need to engage in solving the censored maximum likelihood problem required for GPD distribution fitting when dealing with online data. Instead, we capitalize on historical offline data to perform the approximation using B-spline interpolation. It is noteworthy that the solutions derived from equations like (2.17) and (2.20) exclusively rely on the historical dataset. Consequently, we can obtain these solutions well in advance of the arrival of new online data. This proactive approach implies that once the online data is observed, we can directly substitute the results into (2.21) to efficiently obtain the parameter estimations, thus drastically reducing the computational burden associated with online data processing.

3 Experiments

Refer to caption
(a) N(on)=1000,p=10N^{(\text{on})}=1000,p=10
Refer to caption
(b) N(on)=2000,p=10N^{(\text{on})}=2000,p=10
Refer to caption
(c) N(on)=2000,p=20N^{(\text{on})}=2000,p=20
Refer to caption
(d) N(on)=2000,p=30N^{(\text{on})}=2000,p=30
Refer to caption
(e) N(on)=1000,p=10N^{(\text{on})}=1000,p=10
Refer to caption
(f) N(on)=2000,p=10N^{(\text{on})}=2000,p=10
Refer to caption
(g) N(on)=2000,p=20N^{(\text{on})}=2000,p=20
Refer to caption
(h) N(on)=2000,p=30N^{(\text{on})}=2000,p=30
Refer to caption
(i) N(on)=1000,p=10N^{(\text{on})}=1000,p=10
Refer to caption
(j) N(on)=2000,p=10N^{(\text{on})}=2000,p=10
Refer to caption
(k) N(on)=2000,p=20N^{(\text{on})}=2000,p=20
Refer to caption
(l) N(on)=2000,p=30N^{(\text{on})}=2000,p=30
Figure 1: The boxplot of ARSE for different methods with different extreme quantile levels. The top, middle, and below rows are the results of τN=0.99\tau_{N}=0.99, τN=0.995\tau_{N}=0.995, and τN=0.999\tau_{N}=0.999 respectively. The noise variable ϵi\epsilon_{i} is drawn from the t-distribution.
Refer to caption
(a) N(on)=1000,p=10N^{(\text{on})}=1000,p=10
Refer to caption
(b) N(on)=2000,p=10N^{(\text{on})}=2000,p=10
Refer to caption
(c) N(on)=2000,p=20N^{(\text{on})}=2000,p=20
Refer to caption
(d) N(on)=2000,p=30N^{(\text{on})}=2000,p=30
Refer to caption
(e) N(on)=1000,p=10N^{(\text{on})}=1000,p=10
Refer to caption
(f) N(on)=2000,p=10N^{(\text{on})}=2000,p=10
Refer to caption
(g) N(on)=2000,p=20N^{(\text{on})}=2000,p=20
Refer to caption
(h) N(on)=2000,p=30N^{(\text{on})}=2000,p=30
Refer to caption
(i) N(on)=1000,p=10N^{(\text{on})}=1000,p=10
Refer to caption
(j) N(on)=2000,p=10N^{(\text{on})}=2000,p=10
Refer to caption
(k) N(on)=2000,p=20N^{(\text{on})}=2000,p=20
Refer to caption
(l) N(on)=2000,p=30N^{(\text{on})}=2000,p=30
Figure 2: The boxplot of the ARSE for different methods with different extreme quantile levels. The top, middle, and below rows are the results of τN=0.99\tau_{N}=0.99, τN=0.995\tau_{N}=0.995, and τN=0.999\tau_{N}=0.999 respectively. The noise variable ϵi\epsilon_{i} is drawn from the GPD.

3.1 Setup

In this section, we conduct some simulations to investigate the performance of our proposed EMI. We generate offline data (yi(off),𝐱i(off)),i=1,2,,N(off)(y_{i}^{(\text{off})},\mathbf{x}_{i}^{(\text{off})}),i=1,2,\ldots,N^{(\text{off})} and online data (yi(on),𝐱i(on)),i=1,2,,N(on)(y_{i}^{(\text{on})},\mathbf{x}_{i}^{(\text{on})}),i=1,2,\ldots,N^{(\text{on})} in the same way. Specifically, we generate the covariates 𝐱ip\mathbf{x}_{i}\in\mathbb{R}^{p} independently from the standard uniform distribution. The response yiy_{i} is generated by

yi=α+𝜷𝐱i+ϵi,y_{i}=\alpha+\boldsymbol{\beta}^{\top}\mathbf{x}_{i}+\epsilon_{i},

where we consider two models of the univariate variable ϵi\epsilon_{i}:

  • Model 1: ϵi\epsilon_{i} is the absolute value of a random variable drawn from the t-distribution with the zero location and degree ψi\psi_{i} satisfying

    ϵiT(ψ),ψi=exp(𝜼1𝐱i).\epsilon_{i}\sim\operatorname{T}(\psi),\quad\psi_{i}=\operatorname{exp}\left(\boldsymbol{\eta}_{1}^{\top}\mathbf{x}_{i}\right).
  • Model 2: ϵi\epsilon_{i} is the absolute value of a random variable drawn from the GPD with the location equal to 11, the scale parameter equal to 0.250.25, and the shape parameter ϕi\phi_{i} satisfying

    ϵiGPD(1,0.25,ϕi),ϕi=1/(𝜼2𝐱i).\epsilon_{i}\sim\operatorname{GPD}(1,0.25,\phi_{i}),\quad\phi_{i}=1/(\boldsymbol{\eta}_{2}^{\top}\mathbf{x}_{i}).

Note that both models are heavy-tailed, where the degrees of t-distribution and the shape parameters of GPD all depend on the covariates. We set 𝜼1=[0.2,,0.2]p\boldsymbol{\eta}_{1}=[0.2,\ldots,0.2]^{\top}\in\mathbb{R}^{p} and 𝜼2=[0.6,,0.6]p\boldsymbol{\eta}_{2}=[0.6,\ldots,0.6]^{\top}\in\mathbb{R}^{p}. Without loss of generality, we set α=0.5\alpha=0.5 and 𝜷=[0.5,,0.5]p\boldsymbol{\beta}=[0.5,\ldots,0.5]^{\top}\in\mathbb{R}^{p}. We aim to estimate the conditional quantile function QY(τN|𝐱)Q_{Y}(\tau_{N}|\mathbf{x}) corresponding to extreme probability levels τN{0.99,0.995,0.999}\tau_{N}\in\{0.99,0.995,0.999\} and N(off)=1000N^{(\text{off})}=1000. The intermediate quantile level τ0\tau_{0} is set to 0.80.8.

To assess the performance, we conduct 100100 replications and calculate the average relative prediction square error (ARSE) on the online data (yi(on),𝐱i(on)),i=1,2,,N(on)(y_{i}^{(\text{on})},\mathbf{x}_{i}^{(\text{on})}),i=1,2,\ldots,N^{(\text{on})}. For the jj-th replication, the ARSE(τN,j)\left(\tau_{N},j\right) is defined as

ARSE(τN,j)=1N(on)i=1N(on)(Q~Y(τN|𝐱i(on))QY(τN|𝐱i(on))1)2.\text{ARSE}\left(\tau_{N},j\right)=\frac{1}{N^{(\text{on})}}\sum_{i=1}^{N^{(\text{on})}}\left(\frac{\widetilde{Q}_{Y}(\tau_{N}|\mathbf{x}_{i}^{(\text{on})})}{Q_{Y}(\tau_{N}|\mathbf{x}_{i}^{(\text{on})})}-1\right)^{2}.

Here Q~Y(τN|𝐱i(on))\widetilde{Q}_{Y}(\tau_{N}|\mathbf{x}_{i}^{(\text{on})}) is the estimator calculated according to Eq. (2.21) and QY(τN|𝐱i(on))Q_{Y}(\tau_{N}|\mathbf{x}_{i}^{(\text{on})}) is the true conditional quantile. For comparison, we apply the traditional method (denoted as Linear) which solves the following problem

minα,𝜷i=1N(on)ρτ(yi(on)α𝜷T𝐱i(on)),\min_{\alpha,\boldsymbol{\beta}}\sum_{i=1}^{N^{(\text{on})}}\rho_{\tau}(y_{i}^{(\text{on})}-\alpha-\boldsymbol{\beta}^{T}\mathbf{x}_{i}^{(\text{on})}), (3.1)

with τ=τN\tau=\tau_{N}. Linear estimates the conditional quantile as the linear function of 𝐱i(on)\mathbf{x}_{i}^{(\text{on})}:

Q~Y(τN|𝐱i(on))=α~(τ0)+𝜷~(τ0)𝐱i(on),i=1,,N(on),\small\widetilde{Q}_{Y}^{\prime}(\tau_{N}|\mathbf{x}_{i}^{(\text{on})})=\widetilde{\alpha}^{\prime}(\tau_{0})+\widetilde{\boldsymbol{\beta}}^{\prime\top}(\tau_{0})\mathbf{x}_{i}^{(\text{on})},~{}i=1,\ldots,N^{(\text{on})},

where α~(τN)\widetilde{\alpha}^{\prime}(\tau_{N}) and 𝜷~(τN)\widetilde{\boldsymbol{\beta}}^{\prime}(\tau_{N}) denote the solutions to (3.1).

Refer to caption
(a) Offline data size.
Refer to caption
(b) Choice of threshold τ0\tau_{0}.
Figure 3: The boxplot of the ARSE with N(on)=1000N^{(\text{on})}=1000, p=30p=30. The noise variable ϵi\epsilon_{i} is drawn from the t-distribution. (a) The ARSE under different offline data sizes with τN=0.999\tau_{N}=0.999. (b) The ARSE under different choices of threshold τ0\tau_{0} with τN=0.99\tau_{N}=0.99.

3.2 Estimation Results

The comparison of our proposed EMI to the Linear approach is presented in Fig. 1 and Fig. 2. For each chosen extreme quantile τN\tau_{N}, we generated data with various combinations of online data size N(on)N^{(\text{on})} and feature dimension pp. We observe that when the quantile level τN\tau_{N} is close to 11, the ARSE of both methods grows. This behavior is expected since, in such cases, observations above the τN\tau_{N} quantile become scarce. However, our EMI clearly outperforms the Linear approach with a significantly lower ARSE. This superior performance can be attributed to two key factors. Firstly, EMI leverages the GPD to approximate the tail distribution of the data, enabling extrapolation beyond the observation range. Secondly, when sufficient offline historical data is available, B-spline interpolation can effectively capture the relationship between 𝐱\mathbf{x} and covariate-dependent parameters, resulting in better interpolation outcomes.

To further analyze the impact of the offline data size on estimation results, we conduct experiments with different offline data sizes ranging from 500500 to 20002000, as shown in Fig. 3(a). The results reveal that, as the offline data size increases, the ARSE of our EMI gradually decreases while exhibiting reduced variance. This trend is due to the fact that larger data size can improve the performance of B-spline interpolation, particularly in cases where 𝐱i(on)\mathbf{x}^{(\text{on})}_{i} falls beyond the observation range. Additionally, we examine the influence of different thresholds τ0\tau_{0} on the estimation. Fig. 3(b) shows that across various threshold choices, our EMI consistently exhibits superior estimation results compared to the Linear approach.

Refer to caption
(a) τN=0.99\tau_{N}=0.99
Refer to caption
(b) τN=0.995\tau_{N}=0.995
Refer to caption
(c) τN=0.999\tau_{N}=0.999
Figure 4: The boxplot of the ARSE for different methods with different extreme quantile levels. The noise variable ϵi\epsilon_{i} is drawn from the t-distribution. We set N(on)=2000N^{(\text{on})}=2000 and p=10p=10.
Refer to caption
(a) τ=0.99\tau^{\prime}=0.99
Refer to caption
(b) τ=0.995\tau^{\prime}=0.995
Refer to caption
(c) τ=0.999\tau^{\prime}=0.999
Figure 5: The boxplot of the ARSE for different methods with different extreme quantile levels. The noise variable ϵi\epsilon_{i} is drawn from the t-distribution. We set N(on)=2000N^{(\text{on})}=2000 and p=20p=20.
Refer to caption
(a) τ=0.99\tau^{\prime}=0.99
Refer to caption
(b) τ=0.995\tau^{\prime}=0.995
Refer to caption
(c) τ=0.999\tau^{\prime}=0.999
Figure 6: The boxplot of the ARSE for different methods with different extreme quantile levels. The noise variable ϵi\epsilon_{i} is drawn from the t-distribution. We set N(on)=2000N^{(\text{on})}=2000 and p=30p=30.
Refer to caption
(a) N(on)=2000,p=10N^{(\text{on})}=2000,p=10
Refer to caption
(b) N(on)=2000,p=20N^{(\text{on})}=2000,p=20
Refer to caption
(c) N(on)=2000,p=30N^{(\text{on})}=2000,p=30
Refer to caption
(d) N(on)=2000,p=10N^{(\text{on})}=2000,p=10
Figure 7: The boxplot of the ARSE for different methods with different extreme quantile levels. We set N(off)=1000N^{(\text{off})}=1000. (a)-(c) The ARSE under different covariate dimensions with τN=0.99\tau_{N}=0.99. (d) The ARSE under different covariate dimensions with τN=0.995\tau_{N}=0.995.
Refer to caption
(a) N(on)=2000,p=20N^{(\text{on})}=2000,p=20
Refer to caption
(b) N(on)=2000,p=30N^{(\text{on})}=2000,p=30
Figure 8: The boxplot of the ARSE for different methods with τN=0.995\tau_{N}=0.995. We set N(off)=1000N^{(\text{off})}=1000.
Refer to caption
(a) N(on)=2000,p=10N^{(\text{on})}=2000,p=10
Refer to caption
(b) N(on)=2000,p=20N^{(\text{on})}=2000,p=20
Refer to caption
(c) N(on)=2000,p=30N^{(\text{on})}=2000,p=30
Figure 9: The boxplot of the ARSE for different methods with different covariate dimensions. We set N(off)=1000N^{(\text{off})}=1000 and τN=0.995\tau_{N}=0.995.

3.3 Bias Analysis

We then study the bias introduced by the B-spline interpolation. For comparison, given each online observation 𝐱i(on)\mathbf{x}^{(\text{on})}_{i}, we do not apply B-spline interpolation but fit the tail distribution into a GPD and estimate the shape and scale parameters, respectively. To be specific, we directly solve the following bilevel programming

maxγ,σLN(MLE)(γ,σ|QY(τ0|𝐱i(on)))s.t.minα,𝜷LNτ0(QR)(α,𝜷).\begin{split}\max_{\gamma,\sigma}\quad&L^{(\text{MLE})}_{N}(\gamma,\sigma|Q_{Y}(\tau_{0}|\mathbf{x}^{(\text{on})}_{i}))\\ s.t.\quad&\min_{\alpha,\boldsymbol{\beta}}L^{(\text{QR})}_{N\tau_{0}}(\alpha,\boldsymbol{\beta}).\end{split} (3.2)

Denote the solutions to problem (3.2) as (α¯(τ0),𝜷¯(τ0),γ¯(𝐱i(on)),σ¯(𝐱i(on);τ0))(\bar{\alpha}(\tau_{0}),\bar{\boldsymbol{\beta}}(\tau_{0}),\bar{\gamma}(\mathbf{x}_{i}^{(\text{on})}),\bar{\sigma}(\mathbf{x}_{i}^{(\text{on})};\tau_{0})). Then the conditional quantile is estimated as

Q¯Y(τ0|𝐱(on))=α¯(τ0)+𝜷¯T(τ0)𝐱(on),Q¯Y(τN|𝐱(on))={Q¯Y(τN|𝐱(on))+σ¯(𝐱(on),τ0)γ¯(𝐱(on))(τ0,Nγ¯(𝐱(on))1),γ¯(𝐱(on))0,Q¯Y(τ0|𝐱(on))+σ¯(𝐱(on),τ0)logτ0,N,γ¯(𝐱(on))=0.\footnotesize\begin{split}&\bar{Q}_{Y}(\tau_{0}|\mathbf{x}^{(\text{on})})=\bar{\alpha}(\tau_{0})+\bar{\boldsymbol{\beta}}^{T}(\tau_{0})\mathbf{x}^{(\text{on})},\\ &\bar{Q}_{Y}(\tau_{N}|\mathbf{x}^{(\text{on})})\\ =&\left\{\begin{matrix}\bar{Q}_{Y}(\tau_{N}|\mathbf{x}^{(\text{on})})+\frac{\bar{\sigma}(\mathbf{x}^{(\text{on})},\tau_{0})}{\bar{\gamma}(\mathbf{x}^{(\text{on})})}\left(\tau_{0,N}^{\bar{\gamma}(\mathbf{x}^{(\text{on})})}-1\right),&\bar{\gamma}(\mathbf{x}^{(\text{on})})\neq 0,\\ \bar{Q}_{Y}(\tau_{0}|\mathbf{x}^{(\text{on})})+\bar{\sigma}(\mathbf{x}^{(\text{on})},\tau_{0})\log\tau_{0,N},&\bar{\gamma}(\mathbf{x}^{(\text{on})})=0.\end{matrix}\right.\end{split} (3.3)

We denote the above method as EMI w/o. Interp., denoting EMI without B-spline interpolation. We conducte performance tests under various extreme quantile levels τN\tau_{N}, and the results are summarized in Fig.4, Fig.5, and Fig. 6. Notably, we observe that EMI performs slightly less effectively when employing B-spline interpolation, which coincides with our expectations. However, it is important to emphasize that B-spline interpolation allows us to bypass the complex computations involved in fitting the GPD, which leads to more practical applications.

Refer to caption
(a) τN=0.99\tau_{N}=0.99
Refer to caption
(b) τN=0.995\tau_{N}=0.995
Refer to caption
(c) τN=0.999\tau_{N}=0.999
Figure 10: Illustrations of online scenarios. We set N(on)=1000N^{(\text{on})}=1000 and p=10p=10.
Refer to caption
(a) τN=0.99\tau_{N}=0.99
Refer to caption
(b) τN=0.995\tau_{N}=0.995
Refer to caption
(c) τN=0.999\tau_{N}=0.999
Figure 11: Illustrations of online scenarios. We set N(on)=1000N^{(\text{on})}=1000 and p=20p=20.

3.4 Results of Online Data

In our final scenario, we consider a more realistic situation where online data is observed incrementally, one data point at a time. At each time step tt, we observe a new pair (yt(on),𝐱t(on))(y_{t}^{(\text{on})},\mathbf{x}_{t}^{(\text{on})}). The current observation set at time tt, denoted as (𝒴(t),𝒳(t))(\mathcal{Y}^{(t)},\mathcal{X}^{(t)}), are the union of the new observation (yt(on),𝐱t(on))(y_{t}^{(\text{on})},\mathbf{x}_{t}^{(\text{on})}) and the observation set from the previous time step t1t-1. This is represented as:

𝒴(t)=𝒴(t1){yt(on)}𝒳(t)=𝒳(t1){𝐱t(on)},\begin{split}\mathcal{Y}^{(t)}&=\mathcal{Y}^{(t-1)}\cup\{y_{t}^{(\text{on})}\}\\ \mathcal{X}^{(t)}&=\mathcal{X}^{(t-1)}\cup\{\mathbf{x}_{t}^{(\text{on})}\},\end{split}

for t=1,,Tt=1,\ldots,T. The initial observation set at time t=0t=0 consists of only offline data as 𝒴(0)={yi(off)}\mathcal{Y}^{(0)}=\left\{y_{i}^{(\text{off})}\right\} and 𝒳(0)={xi(off)}\mathcal{X}^{(0)}=\left\{x_{i}^{(\text{off})}\right\}. For comparison, the Linear method solves the following linear quantile regression problem at each time step tt:

minα,𝜷(y,𝐱)(𝒴(t),𝒳(t))ρτN(yα𝜷T𝐱)\min_{\alpha,\boldsymbol{\beta}}\sum_{(y,\mathbf{x})\in(\mathcal{Y}^{(t)},\mathcal{X}^{(t)})}\rho_{\tau_{N}}(y-\alpha-\boldsymbol{\beta}^{T}\mathbf{x})

On the other hand, the EMI w/o. Interp. method applies extrapolation based on the GPD and solves the following optimization problem at each time step tt:

minγ,σzi𝒵(t)l(γ,σ|zi)s.t.minα,𝜷(y,𝐱)(𝒴(t),𝒳(t))ρτ0(yα𝜷T𝐱),\begin{split}\min_{\gamma,\sigma}\quad&-\sum_{z_{i}\in\mathcal{Z}^{(t)}}l(\gamma,\sigma|z_{i})\\ s.t.\quad&\min_{\alpha,\boldsymbol{\beta}}\sum_{(y,\mathbf{x})\in(\mathcal{Y}^{(t)},\mathcal{X}^{(t)})}\rho_{\tau_{0}}(y-\alpha-\boldsymbol{\beta}^{T}\mathbf{x}),\end{split} (3.4)

Here, 𝒵(t)\mathcal{Z}^{(t)} is a set of transformed observations defined as zi=(yα~′′(τ0)𝜷~′′(τ0)𝐱t(on))+z_{i}=\left(y-\widetilde{\alpha}^{\prime\prime}(\tau_{0})-\widetilde{\boldsymbol{\beta}}^{\prime\prime\top}(\tau_{0})\mathbf{x}_{t}^{(\text{on})}\right)_{+} for each y𝒴(t)y\in\mathcal{Y}^{(t)}, and α~′′(τ0)\widetilde{\alpha}^{\prime\prime}(\tau_{0}) and 𝜷~′′(τ0)\widetilde{\boldsymbol{\beta}}^{\prime\prime}(\tau_{0}) are solutions of (3.4). We put the results in Fig 7, Fig 8 and Fig 9. Comparing the EMI method with the Linear method, we observe that the performance of EMI is consistently better. Additionally, when compared to EMI w/o. Interp., the performance of EMI is slightly worse, which is consistent with previous experimental results. In Fig 10 and Fig 11, we present visualizations of extreme conditional quantile estimates for online data streams. To keep the presentation concise, we select only a portion of the data. These figures demonstrate EMI’s effectiveness in capturing changes in conditional quantiles.

4 Application

Table 1: Summary statistics of covariates 𝐗\mathbf{X}.
𝐗\mathbf{X} Name Mean Standard deviation Skewness Min. Max.
x1x_{1} Market return 0.16 2.23 -0.29 -15.35 12.00
x2x_{2} Three-month yield change -0.22 24.43 -0.57 -182.00 192.00
x3x_{3} Equity volatility 0.83 0.43 3.06 0.28 4.70
x4x_{4} Credit spread change -0.02 8.86 0.74 -48.00 60.00
x5x_{5} Term spread change 0.15 20.66 0.09 -168.00 146.00
x6x_{6} TED spread 121.52 94.42 1.74 6.34 591.00
x7x_{7} Real estate excess return -0.08 2.38 0.04 -14.49 16.58
Table 2: Summary statistics of YY.
YY Company Start time End time
S&P S&P 500 January 1st, 1971 December 28th, 2002
AAPL Apple Inc. January 1st, 1981 December 28th, 2002
BA The Boeing Company January 1st, 1971 December 28th, 2002
JPM JPMorgan Chase & Co. January 1st, 1981 December 28th, 2002

In this section, we apply our algorithm EMI to estimate the extreme conditional quantile of a real dataset, consisting of the daily stock price of the S&P 500 and three companies (see Table 2). The weekly return is derived from daily index observations. The covariates employed in our analysis include ret (weekly return), marketret (weekly market return), yield3m (three-month yield change), mktsd (equity volatility), credit (credit spread change), term (term spread change), ted (short-term TED spread), and housing (real estate excess return), as documented in ( Adrian and Brunnermeier, 2016). More specificity, the variables are defined as:

  • YY: The weekly loss, namely the weekly return;

  • x1x_{1}: The weekly market return of S&P 500;

  • x2x_{2}: The change in the three-month yield from the Federal Reserve Board’s H.15H.15 release. We use the change in the three-month treasury bill rate because we find that the change, not the level, is most significant in explaining the tails of financial sector market-valued asset returns;

  • x3x_{3}: Equity volatility, which is computed as the 2222-day rolling standard deviation of the daily CRSP equity market return;

  • x4x_{4}: The change in the credit spread between Moody’s Baarated bonds and the ten-year Treasury rate from the Federal Reserve Board’s H.15H.15 release;

  • x5x_{5}: The change in the slope of the yield curve, measured by the spread between the composite long-term bond yield and the three-month bill rate obtained from the Federal Reserve Board’s H.15H.15 release;

  • x6x_{6}: A short-term TED spread, which is defined as the difference between the three-month LIBOR rate and the three-month secondary market treasury bill rate. This spread measures short-term funding liquidity risk. We use the three-month LIBOR rate that is available from the British Bankers’ Association, and obtain the three-month Treasury rate from the Federal Reserve Bank of New York;

  • x7x_{7}: The weekly real estate sector returns over the market financial sector return (from the real estate companies with SIC code 6565-6666).

Refer to caption
(a) S&P.
Refer to caption
(b) AAPL.
Refer to caption
(c) BA.
Refer to caption
(d) JPM.
Figure 12: Summary of the weekly return.
Refer to caption
Figure 13: Results for S&P. The green curve is the estimated conditional quantile, and the blue curve is the truth response.

We present a summary of the variables in Table 1. At time tt, we assume that the response is related to covariates at time t1t-1. As a result, the quantile of response YtY_{t} depends on the covariate 𝐗t1\mathbf{X}_{t-1} in a linear form:

QYt(τ|𝐱)=α(τ)+𝜷T(τ)𝐗t1.Q_{Y_{t}}(\tau|\mathbf{x})=\alpha(\tau)+\boldsymbol{\beta}^{T}(\tau)\mathbf{X}_{t-1}.

In our data analysis, our primary focus is on estimating the extreme conditional quantile at the level τ=0.95\tau=0.95 for S&P 500 and τ=0.99\tau=0.99 for three companies using EMI. Initially, we conduct a brief visual assessment to ascertain the suitability of the heavy-tail assumption for our data. As depicted in Fig. 12, the returns exhibit heavy tails on both ends. However, our emphasis lies in scrutinizing significant losses, directing our analysis towards the upper tail losses of the weekly returns, specifically, the extremely large losses. For S&P 500 and each company, we follow the setting in Section 3.4 to estimate the extreme conditional quantile of future weekly returns employing historical data. To be specific, let 𝒴={yi}i=1T\mathcal{Y}=\{y_{i}\}_{i=1}^{T} and 𝒳={𝐱i}i=1T\mathcal{X}=\{\mathbf{x}_{i}\}_{i=1}^{T}, where TT represents the total time span and (yi,𝐱i)(y_{i},\mathbf{x}_{i}) denotes the ii-th weekly data. At each time point t=521,525,,Tt=521,525,\ldots,T, the observation set is defined as:

𝒴(t)={yi}i=t520t,𝒳(t)={𝐱i}i=t520t.\mathcal{Y}^{(t)}=\{y_{i}\}_{i=t-520}^{t},\quad\mathcal{X}^{(t)}=\{\mathbf{x}_{i}\}_{i=t-520}^{t}.

In essence, we leverage the data from the past ten years to estimate the extreme conditional quantile of the subsequent four weekly returns.

Refer to caption
Figure 14: Results for AAPL. The green curve is the estimated conditional quantile, and the blue curve is the truth response.
Refer to caption
Figure 15: Results for BA. The green curve is the estimated conditional quantile, and the blue curve is the truth response.
Refer to caption
Figure 16: Results for JPM. The green curve is the estimated conditional quantile, and the blue curve is the truth response.

We summarize our results in Fig. 13 (S&P), Fig. 14 (AAPL), Fig. 15 (BA) and Fig. 16 (JPM). 2008). We observe that the index of S&P 500 exhibits the lowest conditional quantiles, indicating that it has the most robust capability to resist risk. Among the three companies studied, the conditional quantile of losses of AAPL is much larger compared to BA and JPM. It probably reflects the fact that technical companies may suffer more risk during the crisis period. In conclusion, our empirical analysis demonstrates that the proposed method for extreme conditional quantile serves as a good risk measure, particularly when considering loss values in tail regions.

5 Conclusion

In this paper, we introduce an algorithm called EMI designed for estimating extreme conditional quantiles. Operating with finite offline observations, our approach approximates the exceedance above a predetermined threshold using the GPD. We formulate the extrapolation process by rewriting it as bilevel programming, making it amenable to efficient solutions through conventional optimization techniques. To enhance flexibility and computational efficiency, we incorporate B-spline interpolation for covariate-dependent parameters. This interpolation capability empowers EMI to estimate conditional quantiles for potentially infinite online data, avoiding the computational complexity associated with GPD fitting. Empirical experiments convincingly demonstrate that our proposed EMI consistently outperforms the linear conditional regression model.

6 Acknowledgment

Yanxi Hou’s work was supported by the MOE Laboratory for National Development and Intelligent Governance, Fudan University, the National Natural Science Foundation of China Grant 72171055, and the Natural Science Foundation of Shanghai Grant 20ZR1403900.

References

  • \bibcommenthead
  • Adrian and Brunnermeier [2016] Adrian T, Brunnermeier MK (2016) Covar. American Economic Review 106(7):1705–1741
  • Aiyoshi and Shimizu [1981] Aiyoshi E, Shimizu K (1981) Hierarchical decentralized systems and its new solution by a barrier method. IEEE Transactions on Systems, Man and Cybernetics 11(6):444–449
  • Allen et al [2012] Allen L, Bali TG, Tang Y (2012) Does systemic risk in the financial sector predict future economic downturns? The Review of Financial Studies 25(10):3000–3036
  • Angelo et al [2013] Angelo JS, Krempser E, Barbosa HJ (2013) Differential evolution for bilevel programming. In: 2013 IEEE Congress on Evolutionary Computation, pp 470–477
  • Balkema and De Haan [1974] Balkema AA, De Haan L (1974) Residual life time at great age. The Annals of Probability 2(5):792–804
  • Drees et al [2004] Drees H, Ferreira A, De Haan L (2004) On maximum likelihood estimation of the extreme value index. Annals of Applied Probability 14(3):1179–1201
  • Ferreira and De Haan [2015] Ferreira A, De Haan L (2015) On the block maxima method in extreme value theory: PWM estimators. The Annals of Statistics 3(1):276–298
  • Gaber et al [2005] Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: A review. ACM Sigmod Record 34(2):18–26
  • Granello and Wheaton [2004] Granello DH, Wheaton JE (2004) Online data collection: Strategies for research. Journal of Counseling & Development 82(4):387–393
  • He et al [2022] He Y, Peng L, Zhang D, et al (2022) Risk analysis via generalized pareto distributions. Journal of Business & Economic Statistics 40(2):852–867
  • Hosking et al [1985] Hosking JRM, Wallis JR, Wood EF (1985) Estimation of the generalized extreme-value distribution by the method of probability-weighted moments. Technometrics 27(3):251–261
  • Hou et al [2022] Hou Y, Kang SK, Lo CC, et al (2022) Three-step risk inference in insurance ratemaking. Insurance: Mathematics and Economics 105:1–13
  • Koenker and Bassett Jr [1978] Koenker R, Bassett Jr G (1978) Regression quantiles. Econometrica 46(1):33–50
  • Koenker and Hallock [2001] Koenker R, Hallock KF (2001) Quantile regression. Journal of Economic Perspectives 15(4):143–156
  • Li and Wang [2019] Li D, Wang HJ (2019) Extreme quantile estimation for autoregressive models. Journal of Business & Economic Statistics 37(4):661–670
  • Luo et al [1996] Luo ZQ, Pang JS, Ralph D (1996) Mathematical Programs with Equilibrium Constraints. Cambridge University Press, Cambridge, England
  • Lv et al [2007] Lv Y, Hu T, Wang G, et al (2007) A penalty function method based on Kuhn–Tucker condition for solving linear bilevel programming. Applied Mathematics and Computation 188(1):808–813
  • Mathieu et al [1994] Mathieu R, Pittard L, Anandalingam G (1994) Genetic algorithm based approach to bi-level linear programming. Operations Research 28(1):1–21
  • Naveau et al [2009] Naveau P, Guillou A, Cooley D, et al (2009) Modelling pairwise dependence of maxima in space. Biometrika 96(1):1–17
  • Pickands III [1975] Pickands III J (1975) Statistical inference using extreme order statistics. The Annals of Statistics 3(1):119–131
  • Resnick [2008] Resnick SI (2008) Extreme Values, Regular Variation, and Point Processes. Springer Science & Business Media, Berlin, Germany
  • Sinha et al [2014] Sinha A, Malo P, Frantsev A, et al (2014) Finding optimal strategies in a multi-period multi-leader–follower Stackelberg game using an evolutionary algorithm. Computers & Operations Research 41:374–385
  • Sinha et al [2017] Sinha A, Malo P, Deb K (2017) A review on bilevel optimization: From classical to evolutionary approaches and applications. IEEE Transactions on Evolutionary Computation 22(2):276–295
  • Smith [1987] Smith RL (1987) Estimating tails of probability distributions. The Annals of Statistics 15(3):1174–1207
  • Velthoen et al [2019] Velthoen J, Cai JJ, Jongbloed G, et al (2019) Improving precipitation forecasts using extreme quantile regression. Extremes 22:599–622
  • Wang and Li [2013] Wang HJ, Li D (2013) Estimation of extreme conditional quantiles through power transformation. Journal of the American Statistical Association 108(503):1062–1074
  • Wang et al [2012] Wang HJ, Li D, He X (2012) Estimation of high conditional quantiles for heavy-tailed distributions. Journal of the American Statistical Association 107(500):1453–1464
  • Xu et al [2022] Xu W, Hou Y, Li D (2022) Prediction of extremal expectile based on regression models with heteroscedastic extremes. Journal of Business & Economic Statistics 40(2):522–536