[1]\fnmYanxi \surHou

1]\orgdivSchool of Data Science, \orgnameFudan University, \orgaddress\streetYangpu District, \cityShanghai, \countryChina

Online Prediction of Extreme Conditional Quantiles via B-Spline Interpolation

\fnmZhengpin \surLi [email protected] \fnmJian \surWang [email protected] [email protected] [

Abstract

Extreme quantiles are critical for understanding the behavior of data in the tail region of a distribution. It is challenging to estimate extreme quantiles, particularly when dealing with limited data in the tail. In such cases, extreme value theory offers a solution by approximating the tail distribution using the Generalized Pareto Distribution (GPD). This allows for the extrapolation beyond the range of observed data, making it a valuable tool for various applications. However, when it comes to conditional cases, where estimation relies on covariates, existing methods may require computationally expensive GPD fitting for different observations. This computational burden becomes even more problematic as the volume of observations increases, sometimes approaching infinity. To address this issue, we propose an interpolation-based algorithm named EMI. EMI facilitates the online prediction of extreme conditional quantiles with finite offline observations. Combining quantile regression and GPD-based extrapolation, EMI formulates as a bilevel programming problem, efficiently solvable using classic optimization methods. Once estimates for offline observations are obtained, EMI employs B-spline interpolation for covariate-dependent variables, enabling estimation for online observations with finite GPD fitting. Simulations and real data analysis demonstrate the effectiveness of EMI across various scenarios.

keywords:

Extreme conditional quantile, generalized Pareto distribution, quantile regression, B-splines.

1 Introduction

In fields of statistical analysis such as risk management, finance, and environmental analysis, quantiles play a fundamental role in understanding data distributions and their various characteristics (Allen et al, 2012). Quantiles act as dividing points, segmenting the range of a probability distribution into continuous intervals with equal probabilities. Among the commonly used quantiles, the extreme quantile assumes a critical role, especially when rare events carry significant implications. This particular quantile sheds light on the tail behavior of a distribution and finds wide applications in decision-making, policy formulation, and strategic planning (Resnick, 2008). Estimating the extreme quantile of a distribution, given a dataset comprising $N$ identically distributed observations, naturally involves using the sample quantile as a non-parametric estimation. This non-parametric approach performs admirably when the extreme quantile $\tau_{N}$ remains moderately high, i.e., as $\tau_{N}\rightarrow 1$ and $N(1-\tau_{N})\rightarrow\infty$ . In such cases, a sufficient number of observations lie above the quantile level $\tau_{N}$ . However, when the observation size is not large enough and the observations above the more extreme quantile $\tau_{N}$ become scarce as $N(1-\tau_{N})\rightarrow c,c\in[0,\infty)$ , the non-parametric estimators are no longer inefficient and may cause serious underestimation (He et al, 2022). This presents a concern for accurately assessing risk, making informed financial decisions, and devising effective environmental strategies.

An alternative way to estimate extreme quantiles involves extrapolation beyond the range of observed data. Extreme Value Theory (EVT) offers a systematic framework for precisely this purpose, allowing us to analyze the tail of a distribution and model the excess distribution above a prefixed threshold. The core principle of EVT is embodied in the famous Fisher-Tippett-Gnedenko theorem (Balkema and De Haan, 1974), which states that under certain conditions, the limiting distribution of extreme values can fall into the Generalized Pareto Distribution (GPD). The excess distribution, denoted as $F_{u}(x)$ , is defined as follows:

F_{u}(x)=\mathbb{P}(X-u\leq x\mid X>u),\quad 0\leq x<x_{F}-u,

where $x_{F}$ is the right endpoint of the distribution function $F(x)$ , and $u$ serves as a large threshold. According to the Fisher-Tippett-Gnedenko theorem, as $u$ approaches the right endpoint, i.e., $u\rightarrow x_{F}$ , the excess distribution converges to the GPD as:

\lim_{u\rightarrow x_{F}}F_{u}(x)=G_{\gamma,\sigma}(x),\quad 0\leq x<x_{F}-u,

where $G_{\gamma,\sigma}(x)=1-(1+\gamma x/\sigma)_{+}^{-1/\gamma}$ is the cumulative distribution function of the GPD. Here, $\gamma\in\mathbb{R}$ and $\sigma>0$ are called the shape and scale parameters, respectively. Various methods, such as the Pickands estimator (Pickands III, 1975,) probability-weighted moments method (Hosking et al, 1985), and maximum likelihood method (Smith, 1987), are available for estimating these parameters.

In light of the limiting distribution of extreme values, extensive research efforts have been dedicated to the estimation of extreme quantiles through extrapolation techniques. Among these methods, the Block Maxima Method (BMM) stands out as a widely used approach (Naveau et al, 2009). BMM involves partitioning the dataset into non-overlapping blocks and selecting the maximum observation within each block. The distribution of these block maxima is then fitted to the GPD to estimate extreme quantiles. A more general and flexible variant of BMM, known as the Peak Over Threshold (POT) method, has gained more attention (Ferreira and De Haan, 2015). Unlike BMM, which only considers the maximum observation in each block, POT broadens its scope to encompass several large values. To facilitate this, POT employs a threshold that isolates the exceedances, ensuring that they fall within the tail of the distribution. Subsequently, POT models these exceedances using the GPD, allowing for the estimation of extreme quantiles with increased flexibility and accuracy.

While the BMM and POT have proven effective in estimating extreme quantiles, they exhibit a limitation. In essence, they do not account for dependencies between variables. Consequently, these methods may not be suitable for scenarios where extreme events are influenced by complex interplays of covariates. To address this limitation, conditional quantile regression emerges as a powerful extension of classical regression, offering a comprehensive approach to modeling the conditional quantiles of covariates (Koenker and Hallock, 2001). Consider a two-variable case where the distribution of a response variable $Y$ depends on a set of covariates $\mathbf{X}\in\mathbb{R}^{p}$ . Here, $p$ denotes the feature dimension. In this context, the conditional quantile is defined as $Q_{Y}(\tau|\mathbf{x})=F_{Y}^{-1}(\tau|\mathbf{x})$ , where $F_{Y}^{-1}(\cdot|\mathbf{x})$ is the generalized inverse of the conditional distribution function of $Y$ . When the extreme quantile $\tau$ is moderately high, $Q_{Y}(\tau|\mathbf{x})$ can be efficiently estimated via simple models like linear conditional quantile regression. However, when dealing with scenarios where the quantile $\tau=\tau_{N}$ satisfies

N(1-\tau_{N})\rightarrow c,c\in[0,\infty),

these estimation methods may introduce substantial biases.

In such cases, the combination of EVT and conditional quantile regression becomes essential. One common approach relies on the fact that, under mild conditions,

\lim_{t\rightarrow\infty}\frac{U(tz|\mathbf{x})-U(t|\mathbf{x})}{a(t)}=\frac{z^{\gamma}-1}{\gamma},\quad z>0,

where $\gamma$ is a real parameter, $U(\cdot|\mathbf{x})$ is the inverse function of $1/1-F(\cdot|\mathbf{x})$ and $a(\cdot)$ is a suitable positive function. Please see (Li and Wang, 2019), (Wang and Li, 2013), (Wang et al, 2012), (Xu et al, 2022) for more details. In this paper, we focus on an alternative way that employs the GPD to approximate the tail distribution. The resulting methods for extreme conditional quantile regression typically involve two key components: i) first estimating the intermediate conditional quantile $Q_{Y}(\tau_{0}|\mathbf{x})$ at an intermediate level $\tau=\tau_{0}$ through classic conditional quantile regression, and ii) then modeling the exceeding data above the threshold $u=\widehat{Q}_{Y}(\tau_{0}|\mathbf{x})$ using the GPD. Notable examples of such methods can be found in the work of Hou et al (2022).

While existing extreme conditional quantile regression methods have achieved significant success, they exhibit certain limitations. These limitations become apparent when the response variable $Y$ is influenced by covariates $\mathbf{X}$ , as the conditional quantile $Q_{Y}(\tau_{0}|\mathbf{x})$ becomes dependent on $\mathbf{X}$ . In other words, the threshold $u={Q}_{Y}(\tau_{0}|\mathbf{x})$ varies with different observations $\mathbf{x}$ , resulting in different sets of exceeding data used to fit the GPD. Consequently, both the scale and shape parameters of the GPD also become dependent on $\mathbf{X}$ . To estimate these covariate-dependent parameters, one must conduct conditional quantile regression and extreme distribution fitting once for each observation. This is computationally complex since most existing methods are time-consuming in the above two steps (Velthoen et al, 2019). More critically, practical scenarios often involve online streaming data, where the observations continually emerge over time. In such cases, the dataset’s size can become substantial or even infinite (Granello and Wheaton, 2004). Under this circumstance, applying existing methods becomes impractical, as performing GPD fitting for each observation becomes computationally burdensome and unacceptable.

To solve this issue, in this paper, we propose an offline algorithm denoted as EMI (stands for Extrapolation via Mathematical programs with equilibrium constraints and B-spline Interpolation). Given finite offline historical observations, our algorithm contains two main stages. First, EMI employs linear conditional quantile regression to estimate an intermediate quantile, ${Q}_{Y}(\tau_{0}|\mathbf{x})$ , which serves as the threshold. The exceedances beyond this threshold are modeled using a GPD with scale and shape parameters that depend on the covariates. Notably, EMI innovatively formulates these regression and fitting tasks as a bilevel programming problem, where the lower- and upper-level functions represent the check function and maximum likelihood function, respectively. For each offline observation, we obtain the corresponding estimated parameters by solving the bilevel programming with classic optimization methods, such as interior point algorithms and active set algorithms. Second, EMI employs B-spline interpolation for the scale and shape parameters. The interpolation is conducted on each feature dimension of the covariates. For online observations with features falling outside the range of offline data, EMI leverages the nearest available data points for estimation. This innovative approach eliminates the need for infinite repeated GPD fitting when encountering new online data. Numerical simulations and real-world analysis further illustrate the effectiveness of EMI in extreme conditional quantile estimation.

2 Methodology

2.1 Conditional Quantile Estimation via GPD Approximation

Conditional quantile estimation is a statistical technique that focuses on modeling and analyzing conditional quantiles of a response variable concerning one or more predictor variables. The techniques involved in conditional quantile estimation include quantile regression and various non-parametric approaches. The motivation behind conditional quantile estimation is typically different from that of traditional mean-based modeling. While traditional regression techniques, like linear regression, focus on estimating the conditional mean of the response variable, conditional quantile methods seek to estimate quantiles, which offer a more comprehensive view of the conditional distribution. This approach is particularly useful when dealing with non-normally distributed data, outliers, or when the interest is in understanding the variability at different points in the distribution.

We consider here the setting where the response $Y\in\mathbb{R}$ depends on covariates $\mathbf{X}\in\mathbb{R}^{p}$ . Denote the conditional distribution of the variable $Y$ as $F_{Y}(\cdot|\mathbf{X})$ and marginal distribution of $Y$ as $F_{Y}$ . Suppose observations $(y_{i},\mathbf{x}_{i}),i=1,2,\ldots,N$ , where $y_{i}$ is generated from $F_{Y}(\cdot|\mathbf{x}_{i})$ given $\mathbf{X}=\mathbf{x}_{i}$ . The conditional quantile of $Y$ given $\mathbf{X}=\mathbf{x}$ at the quantile level $\tau\in(0,1)$ is defined as $Q_{Y}(\tau|\mathbf{x})=F_{Y}^{-1}(\tau|\mathbf{x})$ , where $F_{Y}^{-1}(\cdot|\mathbf{x})$ is the generalized inverse of the conditional distribution function of $Y$ . The conditional quantile $Q_{Y}(\tau|\mathbf{x})$ can be calculated as the solution to the following optimization problem

Q_{Y}(\tau|\mathbf{x})=\arg\min_{q\in\mathbb{R}}\mathbb{E}\left[\rho_{\tau}(Y-q)|\mathbf{X}=\mathbf{x}\right],

(2.1)

where $\rho_{\tau}(u)=(\tau-I(u\leq 0))u$ is the so-called check function (Koenker and Bassett Jr, 1978). Quantile regression is one of the most widely used techniques for conditional quantile estimation. It extends traditional regression by estimating the conditional quantiles directly. We follow the standard parametric assumption that the true conditional quantile has a linear form as

Q_{Y}(\tau|\mathbf{x})=\alpha(\tau)+\boldsymbol{\beta}^{T}(\tau)\mathbf{x},

(2.2)

where $(\alpha(\tau),\boldsymbol{\beta}^{T}(\tau))$ is a parameter vector associated with the level $\tau$ and $(\alpha_{0}(\tau),\boldsymbol{\beta}_{0}^{T}(\tau))$ denotes the true parameter. With observations $(y_{i},\mathbf{x}_{i}),\,i=1,2,\ldots,N$ , the parameter can be estimated by solving an optimization problem corresponding to the empirical counterpart of (2.1):

\begin{split}(\widehat{\alpha}(\tau),\widehat{\boldsymbol{\beta}}^{T}(\tau))&=\arg\min_{\alpha,\boldsymbol{\beta}}L^{(\text{QR})}_{N\tau}(\alpha,\boldsymbol{\beta})\\ &=\arg\min_{\alpha,\boldsymbol{\beta}}\sum_{i=1}^{N}\rho_{\tau}(y_{i}-\alpha-\boldsymbol{\beta}^{T}\mathbf{x}_{i}).\end{split}

(2.3)

The conditional quantile estimation of $Y$ given $\mathbf{X}=\mathbf{x}$ is thus calculated as

\widehat{Q}_{Y}(\tau|\mathbf{x})=\widehat{\alpha}(\tau)+\widehat{\boldsymbol{\beta}}^{T}(\tau)\mathbf{x}.

(2.4)

The empirical linear estimation of the conditional quantile works effectively when the quantile level $\tau=\tau_{0}$ is moderately high. However, as the quantile level $\tau=\tau_{N}$ approaches $1$ and $N(1-\tau_{N})\rightarrow c,c\in[0,\infty)$ , the expected number of observations exceeding $Q_{Y}(\tau_{N}|\mathbf{x})$ becomes insufficient. In such cases, empirical linear estimation of the conditional quantile is no longer feasible, necessitating extrapolation beyond observed data. Extrapolation is the process of estimating or predicting values beyond the range of the available data. It is a common practice in various fields, including finance, environmental science, insurance, and reliability analysis, where the primary interest lies in understanding and managing extreme events, such as financial market crashes, natural disasters, rare disease occurrences, or equipment failures. To address this need for extrapolation, we turn to the GPD, which offers an excellent approximation of the tail distribution of the conditional distribution of $Y$ given $\mathbf{X}=\mathbf{x}$ . More specifically, given a large threshold $u$ , $\mathbf{X}=\mathbf{x}$ , and $z>0$ , the exceedance $Y-u\mid Y>u$ given $\mathbf{X}=\mathbf{x}$ satisfies

\begin{split}&P\left(Y>u+z\,|\,Y>u,\mathbf{X}=\mathbf{x}\right)\\ =&\left\{\begin{matrix}\left(1+\frac{\gamma_{0}(\mathbf{x})z}{\sigma_{0}(\mathbf{x};\tau_{u}(\mathbf{x}))}\right)^{-1/\gamma_{0}(\mathbf{x})},&\gamma_{0}(\mathbf{x})\neq 0,\\ \exp\left(-\frac{z}{\sigma_{0}(\mathbf{x};\tau_{u}(\mathbf{x}))}\right),&\gamma_{0}(\mathbf{x})=0,\end{matrix}\right.\end{split}

(2.5)

where $\tau_{u}(\mathbf{x})=F_{Y}(u|\mathbf{x})\in(0,1)$ and we require $1+\gamma_{0}(\mathbf{x})z/\sigma_{0}(\mathbf{x};\tau_{u}(\mathbf{x}))>0$ for $\gamma_{0}(\mathbf{x})\neq 0$ . Here $\gamma_{0}(\mathbf{x})$ is the extreme value index, and $\sigma_{0}(\mathbf{x};\tau_{u}(\mathbf{x}))$ is a scale related to the quantile level of the threshold $u$ . When $\gamma_{0}(\mathbf{x})<0$ , there is a finite right endpoint $u^{*}=u-\frac{\sigma_{0}(\mathbf{x};\tau_{u}(\mathbf{x}))}{\gamma_{0}(\mathbf{x})}$ in the support of the distribution of $Y$ , that is, $F_{Y}(y|\mathbf{X})=1$ for all $y\geq u^{*}$ . When $\gamma_{0}(\mathbf{x})=0$ , the exceedance $Y-u\mid Y>u$ given $\mathbf{X}=\mathbf{x}$ has an exponential distribution with mean $\sigma_{0}(\mathbf{x};\tau_{u}(\mathbf{x}))$ . When $\gamma_{0}(\mathbf{x})>0$ , the exceedance $Y-u\mid Y>u$ given $\mathbf{X}=\mathbf{x}$ has a heavy tail with up to $\frac{1}{\gamma_{0}(\mathbf{x})}$ -th finite moments. Note that for any higher threshold $u^{\prime}>u$ , the exceedance $Y-u\mid Y>u$ given $\mathbf{X}=\mathbf{x}$ again follows the generalized Pareto distribution with the same shape parameter $\gamma_{0}(\mathbf{x})$ but a different scale parameter.

Once the threshold $u$ is set, the next step involves estimating the parameters of the GPD. In the absence of covariates, a standard way of estimating the GPD parameters is the maximum likelihood method, which provides asymptotically normal estimators in the unconditional case with $\gamma>-1/2$ (Drees et al, 2004). To estimate the covariate-dependent variables $\gamma_{0}(\mathbf{x})$ and $\sigma_{0}(\mathbf{x};\tau_{u}(\mathbf{x}))$ , we also use the maximum likelihood method. Specifically, given observations $(y_{i},\mathbf{x}_{i}),\,i=1,2,\ldots,N$ and the threshold $u$ , we define the exceedances of the observations above the threshold as

z_{i}=\left(y_{i}-u\right)_{+},\quad i=1,\ldots,N,

(2.6)

where $(a)_{+}=\max\{0,a\}$ . In other words, $z_{i}=0$ whenever the value $y_{i}$ is below the threshold. Denoe the log-likelihood function as

\small\begin{split}l(\gamma,\sigma|z)=&-\left\{\frac{1+\gamma}{\gamma}\log\left(1+\frac{\gamma z}{\sigma}\right)+\log\sigma\right\}I(\gamma\neq 0)\\ &-\left\{\frac{z}{\sigma}+\log\sigma\right\}I(\gamma=0),\end{split}

(2.7)

and the loss function is calculated only on positive $z_{i}$ as

L^{(\text{MLE})}_{N}(\gamma,\sigma|u)=-\sum_{i=1}^{N}l(\gamma,\sigma|z_{i})I(z_{i}>0).

Then, to estimate $\gamma_{0}(\mathbf{x})$ and $\sigma_{0}(\mathbf{x};\tau_{u}(\mathbf{x}))$ , we can apply the censored maximum likelihood method via solving the following optimization problem

\max_{\gamma,\sigma}L^{(\text{MLE})}_{N}(\gamma,\sigma|u).

(2.8)

Inverting the distribution function in (2.5), we have an approximation of the quantile for probability level $\tau_{N}>\tau_{0}$ by taking $u=Q_{Y}(\tau_{0}|\mathbf{x})$ and $z=Q_{Y}(\tau_{N}|\mathbf{x})-Q_{Y}(\tau_{0}|\mathbf{x})$ . As a result, we have that

\small\begin{split}&Q_{Y}(\tau_{N}|\mathbf{x})\\ =&\left\{\begin{matrix}Q_{Y}(\tau_{0}|\mathbf{x})+\frac{\sigma_{0}(\mathbf{x};\tau_{0})}{\gamma_{0}(\mathbf{x})}\left(\tau_{0,N}^{\gamma_{0}(\mathbf{x})}-1\right),&\gamma_{0}(\mathbf{x})\neq 0,\\ Q_{Y}(\tau_{0}|\mathbf{x})+\sigma_{0}(\mathbf{x};\tau_{0})\log\tau_{0,N},&\gamma_{0}(\mathbf{x})=0,\end{matrix}\right.\end{split}

(2.9)

where for simplicity, we define

\tau_{0,N}=\frac{1-\tau_{0}}{1-\tau_{N}}.

(2.10)

We denote the solution of the problem (2.8) as $\widehat{\gamma}(\mathbf{x})$ and $\widehat{\sigma}(\mathbf{x};\tau_{0})$ . Therefore, using plug-in estimators, the prediction of the extreme conditional quantile $Q_{Y}(\tau_{N}|\mathbf{x})$ is

\small\begin{split}&\widehat{Q}_{Y}(\tau_{0}|\mathbf{x})=\widehat{\alpha}(\tau_{0})+\widehat{\boldsymbol{\beta}}^{T}(\tau_{0})\mathbf{x},\\ &\widehat{Q}_{Y}(\tau_{N}|\mathbf{x})\\ =&\left\{\begin{matrix}\widehat{Q}_{Y}(\tau_{0}|\mathbf{x})+\frac{\widehat{\sigma}(\mathbf{x};\tau_{0})}{\widehat{\gamma}(\mathbf{x})}\left(\tau_{0,N}^{\widehat{\gamma}(\mathbf{x})}-1\right),&\widehat{\gamma}(\mathbf{x})\neq 0,\\ \widehat{Q}_{Y}(\tau_{0}|\mathbf{x})+\widehat{\sigma}(\mathbf{x};\tau_{0})\log\tau_{0,N},&\widehat{\gamma}(\mathbf{x})=0.\end{matrix}\right.\end{split}

(2.11)

It is important to note that the estimations in (2.3) and (2.8) are usually implemented individually, such as the stages two and three of the algorithm in [12]. However, the combination of (2.3) and (2.8) is essentially a bilevel programming problem when the threshold $u$ in (2.8) equals the conditional quantile $Q(\tau|\mathbf{x})$ determined by (2.2) and (2.3). In the following, we apply algorithms of bilevel programming to obtain the joint estimation in (2.3) and (2.8).

2.2 Stage 1: Offline Quantile Estimation via Bilevel Programming

In this section, we formulate the extreme quantile estimation via GPD approximation as a bilevel programming problem. Bilevel programming, a class of optimization problems characterized by their hierarchical structure, has found wide-ranging applications across diverse domains including economics, engineering, transportation, and environmental management (Sinha et al, 2017). These problems feature a unique interplay between two optimization tasks: an upper-level decision-maker seeks to optimize a given objective while factoring in the response of a lower-level decision-maker, who, in turn, optimizes their objective in response to the upper-level decisions. To illustrate this concept in our context, we combine (2.3) and (2.8) and obtain

\begin{split}\max_{\gamma,\sigma}\quad&L^{(\text{MLE})}_{N}(\gamma,\sigma|Q_{Y}(\tau_{0}|\mathbf{x}))\\ s.t.\quad&\min_{\alpha,\boldsymbol{\beta}}L^{(\text{QR})}_{N\tau_{0}}(\alpha,\boldsymbol{\beta}),\end{split}

(2.12)

where $\mathbf{x}$ and the prefixed level $\tau_{0}$ is given. The conditional quantile regression and the maximum likelihood function are denoted as the lower- and upper-level functions of (2.12), respectively.

The hierarchical structure of bilevel programming introduces unique challenges that pose difficulties for existing solution methods. For instance, penalty function methods address bilevel optimization by solving a sequence of unconstrained optimization problems. These unconstrained problems are derived by introducing a penalty term that quantifies the extent of constraint violation. The penalty term typically relies on a parameter, taking a zero value for feasible solutions and a positive value (indicating minimization) for infeasible ones. However, many penalty function methods require the upper-level function to be convex, which conflicts with the non-convex nature of the log-likelihood function (2.7). This discrepancy arises from the fact that the Hessian matrix $\frac{\partial l(\gamma,\sigma|z)}{\partial\gamma\partial\sigma}$ is not semi-positive definite. Consequently, various penalty methods, such as those proposed by Aiyoshi and Shimizu (1981) and Lv et al (2007), tend to be ineffective in addressing the problem. Additionally, nested evolutionary algorithms have gained popularity for dealing with bilevel problems (Sinha et al, 2014). These algorithms involve solving a lower-level optimization problem for each upper-level member, which are typically employed in two main ways. The first approach combines an evolutionary algorithm at the upper level with classical algorithms at the lower level (Mathieu et al, 1994). In contrast, the second approach leverages evolutionary algorithms at both the upper and lower levels (Angelo et al, 2013). While these nested strategies can be effective, they face a significant computational burden, rendering them impractical for large-scale bilevel problems.

To handle problem (2.12) effectively, we first reformulate the lower-level function as

\begin{split}\min_{\alpha,\boldsymbol{\beta}}&\sum_{y_{i}>\boldsymbol{\alpha}+\boldsymbol{\beta}^{T}\mathbf{x}_{i}}\tau_{0}\left(y_{i}-\alpha-\boldsymbol{\beta}^{T}\mathbf{x}_{i}\right)\\ &+\sum_{y_{i}<\alpha+\boldsymbol{\beta}^{T}\mathbf{x}_{i}}(\tau_{0}-1)\left(y_{i}-\alpha-\boldsymbol{\beta}^{T}\mathbf{x}_{i}\right).\end{split}

The above problem can be transformed into a linear program as

$\displaystyle\min_{\alpha,\boldsymbol{\beta},\lambda_{i}^{+},\lambda_{i}^{-}}$	$\displaystyle\sum_{i=1}^{N}\tau_{0}\lambda_{i}^{+}+(1-\tau_{0})\lambda_{i}^{-},$	(2.13)
$\displaystyle{s.t.}$	$\displaystyle y_{i}=\alpha+\boldsymbol{\beta}^{T}\mathbf{x}_{i}+\lambda_{i}^{+}-\lambda_{i}^{-},i=1,\ldots,N,$
	$\displaystyle\lambda_{i}^{+},\lambda_{i}^{-}\geq 0,i=1,\ldots,N,$

where $\lambda_{i}^{+},\lambda_{i}^{-},i=1,2,\ldots,N$ are Lagrange multiplicators. With (2.13), the bilevel programming (2.12) can be rewritten as

\small\begin{split}\max_{\alpha,\boldsymbol{\beta},\gamma,\sigma}&-\sum_{i=1}^{N}l\left(\gamma,\sigma|y_{i}-\alpha-\boldsymbol{\beta}^{T}\mathbf{x}\right)I\left(y_{i}-\alpha-\boldsymbol{\beta}^{T}\mathbf{x}\right)\\ {s.t.}\quad&\{\alpha,\boldsymbol{\beta},\lambda^{+}_{i},\lambda^{-}_{i}\}\in\mathcal{A},\quad i=1,2,\ldots,N,\end{split}

(2.14)

where we let $z_{i}=(y_{i}-u)_{+}$ and $u=\alpha+\boldsymbol{\beta}^{T}\mathbf{x}$ . Then, we replace the constraints in problem (2.14) with its Karush-Kuhn-Tucker (KKT) conditions. These conditions manifest as Lagrangian and complementarity constraints, simplifying the bilevel optimization problem into a single-level constrained optimization problem. Define the augmented Lagrangian (AL) function as

\begin{split}L_{N}^{(\text{AL})}&=\sum_{i=1}^{N}~{}\tau_{0}\lambda_{i}^{+}+(1-\tau_{0})\lambda_{i}^{-}\\ &+\sum_{i=1}^{N}u_{i}(y_{i}-\alpha-\boldsymbol{\beta}^{T}\mathbf{x}_{i}-\lambda_{i}^{+}+\lambda_{i}^{-})\\ &+\sum_{i=1}^{N}t_{i}^{+}\lambda_{i}^{+}+t_{i}^{-}\lambda_{i}^{-},\end{split}

where $t_{i}^{+},t_{i}^{-},i=1,2,\ldots,N,$ are the Lagrange multiplicators. The derivatives of $L_{N}^{(\text{AL})}$ with respect to the parameters $\alpha,\boldsymbol{\beta},\lambda^{+}_{i},\lambda^{-}_{i}$ are given as follows:

\small\begin{split}\nabla_{\alpha}L_{N}^{(\text{AL})}=0~{}&~{}\Rightarrow\sum_{i=1}^{N}u_{i}=0,\\ \nabla_{\boldsymbol{\beta}}L_{N}^{(\text{AL})}=0~{}&~{}\Rightarrow\sum_{i=1}^{N}u_{i}\mathbf{x}_{i}=0,\\ \nabla_{\lambda_{i}^{+}}L_{N}^{(\text{AL})}=0~{}&~{}\Rightarrow\tau_{0}-u_{i}-t^{+}_{i}=0,\\ \nabla_{\lambda_{i}^{-}}L_{N}^{(\text{AL})}=0~{}&~{}\Rightarrow 1-\tau_{0}+u_{i}-t^{-}_{i}=0.\\ \end{split}

(2.15)

Therefore, the problem (2.14) is reformulated as

\small\begin{split}\max_{\Theta_{\mathbf{x},\tau_{0}}}~{}&-\sum_{i=1}^{N}\ell\left(\gamma,\sigma|y_{i}-\alpha-\boldsymbol{\beta}^{T}\mathbf{x}\right)I\left(y_{i}-\alpha-\boldsymbol{\beta}^{T}\mathbf{x}\right)\\ {s.t.}\quad&~{}y_{i}=\alpha+\boldsymbol{\beta}^{T}\mathbf{x}+\lambda^{+}_{i}-\lambda^{-}_{i},\quad i=1,\ldots,N,\\ &~{}t^{+}_{i},t^{-}_{i},\lambda^{+}_{i},\lambda^{-}_{i}\geq 0,\quad i=1,\ldots,N,\\ &~{}t^{+}_{i}\lambda^{+}_{i}=t^{-}_{i}\lambda^{-}_{i}=0,\quad i=1,\ldots,N,\\ &~{}\sum_{i=1}^{N}u_{i}=0,\quad\sum_{i=1}^{N}u_{i}\mathbf{x}(i)=0,\\ &~{}\tau-u_{i}-t^{+}_{i}=\quad i=1,\ldots,N,\\ &~{}1-\tau+u_{i}-t^{-}_{i}=0,\quad i=1,\ldots,N,\end{split}

(2.16)

where $\Theta_{\mathbf{x},\tau_{0}}=(\gamma,\sigma,\alpha,\boldsymbol{\beta},\lambda^{+}_{i},\lambda^{-}_{i},t^{+}_{i},t^{-}_{i},u_{i})$ denotes the variable set for a prefixed $\mathbf{x}$ and level $\tau_{0}$ . Problem (2.16) is a special case of mathematical programs with equilibrium constraints (MPEC), in which the constraints include the complementarity conditions

t^{+}_{i}\lambda^{+}_{i}=t^{-}_{i}\lambda^{-}_{i}=0,\quad i=1,2,\ldots,N.

MPEC plays an important role in various fields, such as the Stackelberg game in economic sciences, and problem (2.16) could be efficiently solved via interior point algorithm or active set algorithm. For more details on these algorithms, we refer readers to the work of [16]. In the following, we leave out the solutions of the Lagrange multiplicators. and denote the solutions of problem (2.16) as $\Theta_{\mathbf{x},\tau_{0}}=(\widetilde{\alpha}(\tau_{0}),\widetilde{\boldsymbol{\beta}}(\tau_{0}),\widetilde{\gamma}(\mathbf{x}),\widetilde{\sigma}(\mathbf{x};\tau_{0}))$ for simplicity.

2.3 Stage 2: Online Prediction via B-Spline Interpolation

Note that the solution to (2.16) is derived given a specific covariate $\mathbf{x}$ . This leads to estimations $\widetilde{\gamma}(\mathbf{x})$ and $\widetilde{\sigma}(\mathbf{x};\tau_{0})$ that are inherently covariate-dependent. To illustrate this, recall that $\widetilde{\gamma}(\mathbf{x})$ and $\widetilde{\sigma}(\mathbf{x};\tau_{0})$ are utilized in modeling the exceedances over a threshold. We estimate this threshold through (2.4), based on a prefixed $\tau_{0}$ . Consequently, the estimated threshold becomes intricately linked to the covariates, $\mathbf{x}$ . This, in turn, results in varying exceedances for different values of $\mathbf{x}$ , ultimately leading to distinct GPD models and covariate-dependent estimations, namely $\widetilde{\gamma}(\mathbf{x})$ and $\widetilde{\sigma}(\mathbf{x};\tau_{0})$ .

In practice, we often encounter situations where we need to deal with an infinite number of future observations, in addition to the finite historical observations at hand. A typical example is the realm of online streaming data, continually generated from various sources (Gaber et al, 2005). Online streaming data finds applications across diverse industries, including finance, transportation, and e-commerce. Notably, streaming data volumes can be vast, occasionally approaching infinity. When it comes to the task of conditional quantile estimation with these infinite observations from online streaming data, applying (2.16) entails solving the problem an infinite number of times. This introduces a significant computational complexity that poses a substantial challenge.

To alleviate the computational burden associated with solving (2.16) for possible infinite observations, we employ B-spline interpolation. To clarify, let us denote the finite offline observations as $(y_{i}^{(\text{off})},\mathbf{x}_{i}^{(\text{off})})$ for $i=1,\ldots,N^{(\text{off})}$ and the future online observations as $\mathbf{x}_{i}^{(\text{on})}$ for $i=1,\ldots,N^{(\text{on})}$ where we allow $N^{(\text{on})}$ to tend to infinite. It is important to note that we assume the covariates for the two observation sets are distinct. B-spline interpolation empowers us to solve (2.16) only $N^{(\text{off})}$ times, obtaining estimations for any number of data points, even as $N^{(\text{on})}$ approaches infinity. In more specific terms, assume that we have obtained $\Theta_{\mathbf{x}_{i}^{(\text{off})},\tau_{0}}=(\widetilde{\alpha}(\tau_{0}),\widetilde{\boldsymbol{\beta}}(\tau_{0}),\widetilde{\gamma}(\mathbf{x}_{i}^{(\text{off})}),\widetilde{\sigma}(\mathbf{x}_{i}^{(\text{off})};\tau_{0}))$ for $i=1,\ldots,N^{(\text{off})}$ , for given offline observations by solving (2.16). Notably, the estimations $\widetilde{\alpha}(\tau_{0})$ and $\widetilde{\boldsymbol{\beta}}(\tau_{0})$ are solely linked to the choice of $\tau_{0}$ , which is assumed to be the same for both offline and online observations in our paper. As a result, for $\mathbf{x}_{i}^{(\text{on})}$ , the estimations of $\alpha$ and $\boldsymbol{\beta}$ remain consistent with those for $\mathbf{x}_{i}^{(\text{off})}$ . Therefore, we only need to apply B-spline interpolation to estimate the extreme value index $\gamma(\mathbf{x})$ and the scale $\sigma(\mathbf{x},\tau_{0})$ for $\mathbf{x}=\mathbf{x}_{i}^{(\text{on})}$ .

B-splines, which stand for “Basis splines”, hold a pivotal role in the realm of numerical analysis and computer-aided design. They provide a powerful method for representing and approximating complex curves and surfaces, as well as for solving various mathematical and engineering problems. B-splines are piecewise polynomials, and the positions where the pieces meet are known as knots, which are defined as a non-decreasing sequence of real numbers

\boldsymbol{\xi}=\left\{\xi_{i}\right\}_{i=1}^{M}=\left\{\xi_{1}\leq\xi_{2}\leq\cdots\leq\xi_{M}\right\},

where $M$ is the knot number. Provided that $M\geq d+2$ , we can define the B-splines of degree $d\geq 0$ over the knots $\xi$ using a recursive formulation. Specifically, the $j$ -th B-spline $B_{j,d,\boldsymbol{\xi}}(x):\mathbb{R}\rightarrow\mathbb{R}$ can be calculated by

\begin{split}B_{j,d,\boldsymbol{\xi}}(x):=&\frac{x-\xi_{j}}{\xi_{j+d}-\xi_{j}}B_{j,d-1,\boldsymbol{\xi}}(x)\\ &+\frac{\xi_{j+d+1}-x}{\xi_{j+d+1}-\xi_{j+1}}B_{j+1,d-1,\boldsymbol{\xi}}(x),\end{split}

where $\xi_{j}\leq\xi_{j+1}\leq\cdots\leq\xi_{j+d+1}$ are $d+2$ real numbers taken from the knot sequence $\xi$ . When $\xi_{j}=\xi_{j+d+1}$ , $B_{j,d,\boldsymbol{\xi}}(x)$ is identically zero. The foundation of B-splines begins with the initial B-spline of order zero, represented as:

B_{i,0,\boldsymbol{\xi}}(x):=\left\{\begin{array}[]{ll}1,&\text{ if }x\in\left[\xi_{i},\xi_{i+1}\right),\\ 0,&\text{ otherwise }.\end{array}\right.

In simpler terms, a B-spline operates over $M$ knots, strictly adhering to their non-decreasing order. B-splines contribute meaningfully only within the range defined by the first and last knots, remaining zero elsewhere. It is worth noting that while two B-splines can share some of their knots, two B-splines defined over exactly the same set of knots are identical. In essence, a B-spline is uniquely characterized by its knot sequence.

Algorithm 1 The proposed EMI algorithm

1:Observations

(y_{i}^{(\text{off})},\mathbf{x}_{i}^{(\text{off})}),i=1,2,\ldots,N^{(\text{off})}

, spline degree

d

, knots number

M

, moderate quantile level

\tau_{0}

, extreme quantile level

\tau_{N}

2:Conditional quantile estimation

\widetilde{Q}_{Y}(\tau_{N}|\mathbf{x}^{(\text{on})})

3:Extrapolation.

4:while

i=1,2,\cdots,N^{(\text{off})}

5: Solve MPEC problem (2.16) with

\mathbf{x}_{i}^{(\text{off})}

and obtain

(\widetilde{\alpha}(\tau_{0}),\widetilde{\boldsymbol{\beta}}(\tau_{0}),\widetilde{\gamma}(\mathbf{x}_{i}^{(\text{off})}),\widetilde{\sigma}(\mathbf{x}_{i}^{(\text{off})};\tau_{0})))

6:end while

7:Interpolation.

8:Calculate the knots

\left\{\widetilde{\xi}_{i,\gamma}\right\}_{i=1}^{M}

for sigma via Eq (2.17).

9:Calculate the knots

\left\{\widetilde{\xi}_{i,\sigma}\right\}_{i=1}^{M}

for sigma via Eq (2.20).

10:Prediction.

11:Calculate

\widetilde{\gamma}(\mathbf{x}^{(\text{on})})

by B-spline interpolation via Eq (2.18).

12:Calculate

\widetilde{\sigma}(\mathbf{x}^{(\text{on})};\tau_{0})

by B-spline interpolation via Eq (2.19).

13:Calculate

\widetilde{Q}_{Y}(\tau_{N}|\mathbf{x}^{(\text{on})})

via Eq. (2.21).

B-splines provide a powerful method for approximating complex functions. In scenarios where the covariants are multi-dimensional, we estimate the covariate-dependent variables $\gamma(\mathbf{x})$ and the scale $\sigma(\mathbf{x},\tau_{0})$ as

\begin{split}\gamma(\mathbf{x})&\approx\sum_{i=1}^{p}\sum_{j=1}^{J}B_{j,D,\boldsymbol{\xi}^{(i)}}\left(\mathbf{x}(i)\right)t_{\gamma,j},\\ \sigma(\mathbf{x},\tau_{0})&\approx\sum_{i=1}^{p}\sum_{j=1}^{J}B_{j,D,\boldsymbol{\xi}^{(i)}}\left(\mathbf{x}(i)\right)t_{\sigma,j}.\end{split}

Here, $\boldsymbol{\xi}^{(1)},\boldsymbol{\xi}^{(2)},\ldots,\boldsymbol{\xi}^{(p)}$ represent the $p$ sets of knots, with each set containing $M$ knots. It is assumed that $\mathbf{x}(i)$ falls within the range defined by the boundary knots $\xi^{(i)}_{1}$ and $\xi^{(i)}_{M}$ . The choice of $D$ , representing the maximum degree of approximation, is crucial. While higher degrees tend to yield more accurate estimates, they also come with increased computational complexity. Here, we opt for $D=3$ , a well-known choice commonly referred to as cubic spline interpolation. To determine the coefficients $t_{\gamma,j},j=1,\ldots,J$ , for $\mathbf{s}=[s_{1},s_{2},\ldots,s_{N}]^{\top}\in\mathbb{R}^{N}$ , we introduce the B-spline matrix $\mathbf{B}_{d,\boldsymbol{\xi}}^{\mathbf{s}}\in\mathbb{R}^{N\times J}$ as

\mathbf{B}_{d,\boldsymbol{\xi}}^{\mathbf{s}}=\left[\begin{array}[]{cccc}B_{1,d,\boldsymbol{\xi}}(s_{1})&B_{2,d,\boldsymbol{\xi}}(s_{1})&\cdots&B_{J,d,\boldsymbol{\xi}}(s_{1})\\ B_{1,d,\boldsymbol{\xi}}(s_{2})&B_{2,d,\boldsymbol{\xi}}(s_{2})&\cdots&B_{J,d,\boldsymbol{\xi}}(s_{2})\\ \ldots&\ldots&\ldots&\ldots\\ B_{1,d,\boldsymbol{\xi}}(s_{N})&B_{2,d,\boldsymbol{\xi}}(s_{N})&\cdots&B_{J,d,\boldsymbol{\xi}}(s_{N})\end{array}\right].

We further define the feature vector $\mathbf{v}_{j}$ as follows

\mathbf{v}_{j}=[\mathbf{x}_{1}^{(\text{off})}(j),\mathbf{x}_{2}^{(\text{off})}(j),\ldots,\mathbf{x}_{N^{(\text{off})}}^{(\text{off})}(j)]^{\top}\in\mathbb{R}^{N^{(\text{off})}},

where $j=1,2,\ldots,p$ and $\mathbf{x}_{i}^{{}^{(\text{off})}}(j)$ represents the $j$ -th entry of observation $\mathbf{x}_{i}^{{}^{(\text{off})}}$ . For pairs $(\widetilde{\gamma}(\mathbf{x}_{i}^{(\text{off})}),\mathbf{x}_{i}^{{}^{(\text{off})}}),i=1,2,\ldots,N^{(\text{off})}$ , we we can determine the coefficients $t_{\gamma,j},j=1,\ldots,J$ by solving the following matrix equation

\left(\sum_{i=1}^{p}\mathbf{B}_{D,\boldsymbol{\xi}^{(i)}}^{\mathbf{v}_{i}}\right)\left[\begin{array}[]{l}t_{\gamma,1}\\ t_{\gamma,2}\\ \ldots\\ t_{\gamma,J}\end{array}\right]=\left[\begin{array}[]{c}\widetilde{\gamma}(\mathbf{x}_{1}^{(\text{off})})\\ \widetilde{\gamma}(\mathbf{x}_{2}^{(\text{off})})\\ \ldots\\ \widetilde{\gamma}(\mathbf{x}_{N^{(\text{off})}}^{(\text{off})})\end{array}\right].

(2.17)

with the boundary knots satisfy

{\xi}^{(i)}_{1}=\underset{s}{\min}~{}\mathbf{x}_{s}^{(\text{off})}(i),\quad{\xi}^{(i)}_{M}=\underset{s}{\max}~{}\mathbf{x}_{s}^{(\text{off})}(i),

for $i=1,\ldots,p$ . The solution to this equation is denoted as $\widetilde{t}_{\gamma,j},j=1,\ldots,J$ . Then for new observation $\mathbf{x}^{{(\text{on})}}$ , we can estimate the corresponding extreme value index by

\widetilde{\gamma}(\mathbf{x}^{(\text{on})})=\sum_{i=1}^{p}\sum_{j=1}^{J}B_{j,D,\boldsymbol{\xi}^{(i)}}\left(\mathbf{x}^{(\text{on})}(i)\right)\widetilde{t}_{\gamma,j},

when $\boldsymbol{\xi}^{(i)}_{1}\leq\mathbf{x}^{(\text{on})}(i)\leq\boldsymbol{\xi}^{(i)}_{M}$ . For cases where $\mathbf{x}(i)$ falls beyond the range of boundary knots $\xi^{(i)}_{1}$ and $\xi^{(i)}_{M}$ , the estimation of the extreme value index can be expressed as follows

\widetilde{\gamma}(\mathbf{x}^{(\text{on})})=\sum_{i=1}^{p}\sum_{j=1}^{J}B_{j,D,\boldsymbol{\xi}^{(i)}}\left(\widetilde{\mathbf{x}}_{\gamma}(i)\right)\widetilde{t}_{\gamma,j},

where $\widetilde{\mathbf{x}}_{\gamma}(i)$ is determined by minimizing the absolute difference as

\widetilde{\mathbf{x}}_{\gamma}(i)=\arg\min_{s}\left|\mathbf{x}^{(\text{on})}(i)-s\right|,s\in\{\mathbf{x}_{n}^{(\text{off})}(i)\}_{n=1}^{N^{(\text{off})}}.

In summary, we estimate $\gamma(\mathbf{x})$ as

\footnotesize\begin{split}&\widetilde{\gamma}(\mathbf{x}^{(\text{on})})\\ =&\sum_{i=1}^{p}\sum_{j=1}^{J}\left\{\begin{array}[]{ll}B_{j,D,\boldsymbol{\xi}^{(i)}}\left(\mathbf{x}^{(\text{on})}(i)\right)\widetilde{t}_{\gamma,j},&\text{ if }\boldsymbol{\xi}^{(i)}_{1}\leq\mathbf{x}^{(\text{on})}(i)\leq\boldsymbol{\xi}^{(i)}_{M},\\ B_{j,D,\boldsymbol{\xi}^{(i)}}\left(\widetilde{\mathbf{x}}_{\gamma}(i)\right)\widetilde{t}_{\gamma,j},&\text{ otherwise }.\end{array}\right.\end{split}

(2.18)

Similarly, the scale parameter $\sigma(\mathbf{x},\tau_{0})$ can be estimated using B-spline interpolation as:

\footnotesize\begin{split}&\widetilde{\sigma}(\mathbf{x}^{(\text{on})},\tau_{0})\\ =&\sum_{i=1}^{p}\sum_{j=1}^{J}\left\{\begin{array}[]{ll}B_{j,D,\boldsymbol{\xi}^{(i)}}\left(\mathbf{x}^{(\text{on})}(i)\right)\widetilde{t}_{\sigma,j},&\text{ if }\boldsymbol{\xi}^{(i)}_{1}\leq\mathbf{x}^{(\text{on})}(i)\leq\boldsymbol{\xi}^{(i)}_{M},\\ B_{j,D,\boldsymbol{\xi}^{(i)}}\left(\widetilde{\mathbf{x}}_{\sigma}(i)\right)\widetilde{t}_{\sigma,j},&\text{ otherwise }.\end{array}\right.\end{split}

(2.19)

In this expression, the coefficients $\widetilde{t}_{\sigma,j}$ for $j=1,2,\ldots,J$ are determined by solving the matrix equation given by:

\left(\sum_{i=1}^{p}\mathbf{B}_{D,\boldsymbol{\xi}^{(i)}}^{\mathbf{v}_{i}}\right)\left[\begin{array}[]{l}t_{\sigma,1}\\ t_{\sigma,2}\\ \ldots\\ t_{\sigma,J}\end{array}\right]=\left[\begin{array}[]{c}\widetilde{\sigma}(\mathbf{x}_{1}^{(\text{off})},\tau_{0})\\ \widetilde{\sigma}(\mathbf{x}_{2}^{(\text{off})},\tau_{0})\\ \ldots\\ \widetilde{\sigma}(\mathbf{x}_{N^{(\text{off})}}^{(\text{off})},\tau_{0})\end{array}\right],

(2.20)

and $\widehat{\mathbf{x}}_{\sigma}(i)$ satisfying

\widetilde{\mathbf{x}}_{\sigma}(i)=\arg\min_{s}\left|\mathbf{x}^{(\text{on})}(i)-s\right|,s\in\{\mathbf{x}_{n}^{(\text{off})}(i)\}_{n=1}^{N^{(\text{off})}}.

In this way, the estimation of extreme conditional quantile for observation $\mathbf{x}^{(\text{on})}$ is calculated as

\footnotesize\begin{split}&\widetilde{Q}_{Y}(\tau_{0}|\mathbf{x}^{(\text{on})})=\widetilde{\alpha}(\tau_{0})+\widetilde{\boldsymbol{\beta}}^{T}(\tau_{0})\mathbf{x}^{(\text{on})},\\ &\widetilde{Q}_{Y}(\tau_{N}|\mathbf{x}^{(\text{on})})\\ =&\left\{\begin{matrix}\widetilde{Q}_{Y}(\tau_{N}|\mathbf{x}^{(\text{on})})+\frac{\widetilde{\sigma}(\mathbf{x}^{(\text{on})},\tau_{0})}{\widetilde{\gamma}(\mathbf{x}^{(\text{on})})}\left(\tau_{0,N}^{\widetilde{\gamma}(\mathbf{x}^{(\text{on})})}-1\right),&\widetilde{\gamma}(\mathbf{x}^{(\text{on})})\neq 0,\\ \widetilde{Q}_{Y}(\tau_{0}|\mathbf{x}^{(\text{on})})+\widetilde{\sigma}(\mathbf{x}^{(\text{on})},\tau_{0})\log\tau_{0,N},&\widetilde{\gamma}(\mathbf{x}^{(\text{on})})=0.\end{matrix}\right.\end{split}

(2.21)

We summarize our method in Algorithm 1.

It is important to emphasize that B-spline interpolation, while highly effective for approximating scale and shape parameters, does introduce a degree of bias into the estimation process. This trade-off is essential as it significantly reduces computational time, making it a valuable tool for online observations. For this specific context, there is no need to engage in solving the censored maximum likelihood problem required for GPD distribution fitting when dealing with online data. Instead, we capitalize on historical offline data to perform the approximation using B-spline interpolation. It is noteworthy that the solutions derived from equations like (2.17) and (2.20) exclusively rely on the historical dataset. Consequently, we can obtain these solutions well in advance of the arrival of new online data. This proactive approach implies that once the online data is observed, we can directly substitute the results into (2.21) to efficiently obtain the parameter estimations, thus drastically reducing the computational burden associated with online data processing.

3 Experiments

Refer to caption — (a) $N^{(\text{on})}=1000,p=10$

3.1 Setup

In this section, we conduct some simulations to investigate the performance of our proposed EMI. We generate offline data $(y_{i}^{(\text{off})},\mathbf{x}_{i}^{(\text{off})}),i=1,2,\ldots,N^{(\text{off})}$ and online data $(y_{i}^{(\text{on})},\mathbf{x}_{i}^{(\text{on})}),i=1,2,\ldots,N^{(\text{on})}$ in the same way. Specifically, we generate the covariates $\mathbf{x}_{i}\in\mathbb{R}^{p}$ independently from the standard uniform distribution. The response $y_{i}$ is generated by

y_{i}=\alpha+\boldsymbol{\beta}^{\top}\mathbf{x}_{i}+\epsilon_{i},

where we consider two models of the univariate variable $\epsilon_{i}$ :

•

Model 1: $\epsilon_{i}$ is the absolute value of a random variable drawn from the t-distribution with the zero location and degree $\psi_{i}$ satisfying

$\epsilon_{i}\sim\operatorname{T}(\psi),\quad\psi_{i}=\operatorname{exp}\left(\boldsymbol{\eta}_{1}^{\top}\mathbf{x}_{i}\right).$
•

Model 2: $\epsilon_{i}$ is the absolute value of a random variable drawn from the GPD with the location equal to $1$ , the scale parameter equal to $0.25$ , and the shape parameter $\phi_{i}$ satisfying

$\epsilon_{i}\sim\operatorname{GPD}(1,0.25,\phi_{i}),\quad\phi_{i}=1/(\boldsymbol{\eta}_{2}^{\top}\mathbf{x}_{i}).$

Note that both models are heavy-tailed, where the degrees of t-distribution and the shape parameters of GPD all depend on the covariates. We set $\boldsymbol{\eta}_{1}=[0.2,\ldots,0.2]^{\top}\in\mathbb{R}^{p}$ and $\boldsymbol{\eta}_{2}=[0.6,\ldots,0.6]^{\top}\in\mathbb{R}^{p}$ . Without loss of generality, we set $\alpha=0.5$ and $\boldsymbol{\beta}=[0.5,\ldots,0.5]^{\top}\in\mathbb{R}^{p}$ . We aim to estimate the conditional quantile function $Q_{Y}(\tau_{N}|\mathbf{x})$ corresponding to extreme probability levels $\tau_{N}\in\{0.99,0.995,0.999\}$ and $N^{(\text{off})}=1000$ . The intermediate quantile level $\tau_{0}$ is set to $0.8$ .

To assess the performance, we conduct $100$ replications and calculate the average relative prediction square error (ARSE) on the online data $(y_{i}^{(\text{on})},\mathbf{x}_{i}^{(\text{on})}),i=1,2,\ldots,N^{(\text{on})}$ . For the $j$ -th replication, the ARSE $\left(\tau_{N},j\right)$ is defined as

\text{ARSE}\left(\tau_{N},j\right)=\frac{1}{N^{(\text{on})}}\sum_{i=1}^{N^{(\text{on})}}\left(\frac{\widetilde{Q}_{Y}(\tau_{N}|\mathbf{x}_{i}^{(\text{on})})}{Q_{Y}(\tau_{N}|\mathbf{x}_{i}^{(\text{on})})}-1\right)^{2}.

Here $\widetilde{Q}_{Y}(\tau_{N}|\mathbf{x}_{i}^{(\text{on})})$ is the estimator calculated according to Eq. (2.21) and $Q_{Y}(\tau_{N}|\mathbf{x}_{i}^{(\text{on})})$ is the true conditional quantile. For comparison, we apply the traditional method (denoted as Linear) which solves the following problem

\min_{\alpha,\boldsymbol{\beta}}\sum_{i=1}^{N^{(\text{on})}}\rho_{\tau}(y_{i}^{(\text{on})}-\alpha-\boldsymbol{\beta}^{T}\mathbf{x}_{i}^{(\text{on})}),

(3.1)

with $\tau=\tau_{N}$ . Linear estimates the conditional quantile as the linear function of $\mathbf{x}_{i}^{(\text{on})}$ :

\small\widetilde{Q}_{Y}^{\prime}(\tau_{N}|\mathbf{x}_{i}^{(\text{on})})=\widetilde{\alpha}^{\prime}(\tau_{0})+\widetilde{\boldsymbol{\beta}}^{\prime\top}(\tau_{0})\mathbf{x}_{i}^{(\text{on})},~{}i=1,\ldots,N^{(\text{on})},

where $\widetilde{\alpha}^{\prime}(\tau_{N})$ and $\widetilde{\boldsymbol{\beta}}^{\prime}(\tau_{N})$ denote the solutions to (3.1).

3.2 Estimation Results

The comparison of our proposed EMI to the Linear approach is presented in Fig. 1 and Fig. 2. For each chosen extreme quantile $\tau_{N}$ , we generated data with various combinations of online data size $N^{(\text{on})}$ and feature dimension $p$ . We observe that when the quantile level $\tau_{N}$ is close to $1$ , the ARSE of both methods grows. This behavior is expected since, in such cases, observations above the $\tau_{N}$ quantile become scarce. However, our EMI clearly outperforms the Linear approach with a significantly lower ARSE. This superior performance can be attributed to two key factors. Firstly, EMI leverages the GPD to approximate the tail distribution of the data, enabling extrapolation beyond the observation range. Secondly, when sufficient offline historical data is available, B-spline interpolation can effectively capture the relationship between $\mathbf{x}$ and covariate-dependent parameters, resulting in better interpolation outcomes.

To further analyze the impact of the offline data size on estimation results, we conduct experiments with different offline data sizes ranging from $500$ to $2000$ , as shown in Fig. 3(a). The results reveal that, as the offline data size increases, the ARSE of our EMI gradually decreases while exhibiting reduced variance. This trend is due to the fact that larger data size can improve the performance of B-spline interpolation, particularly in cases where $\mathbf{x}^{(\text{on})}_{i}$ falls beyond the observation range. Additionally, we examine the influence of different thresholds $\tau_{0}$ on the estimation. Fig. 3(b) shows that across various threshold choices, our EMI consistently exhibits superior estimation results compared to the Linear approach.

3.3 Bias Analysis

We then study the bias introduced by the B-spline interpolation. For comparison, given each online observation $\mathbf{x}^{(\text{on})}_{i}$ , we do not apply B-spline interpolation but fit the tail distribution into a GPD and estimate the shape and scale parameters, respectively. To be specific, we directly solve the following bilevel programming

\begin{split}\max_{\gamma,\sigma}\quad&L^{(\text{MLE})}_{N}(\gamma,\sigma|Q_{Y}(\tau_{0}|\mathbf{x}^{(\text{on})}_{i}))\\ s.t.\quad&\min_{\alpha,\boldsymbol{\beta}}L^{(\text{QR})}_{N\tau_{0}}(\alpha,\boldsymbol{\beta}).\end{split}

(3.2)

Denote the solutions to problem (3.2) as $(\bar{\alpha}(\tau_{0}),\bar{\boldsymbol{\beta}}(\tau_{0}),\bar{\gamma}(\mathbf{x}_{i}^{(\text{on})}),\bar{\sigma}(\mathbf{x}_{i}^{(\text{on})};\tau_{0}))$ . Then the conditional quantile is estimated as

\footnotesize\begin{split}&\bar{Q}_{Y}(\tau_{0}|\mathbf{x}^{(\text{on})})=\bar{\alpha}(\tau_{0})+\bar{\boldsymbol{\beta}}^{T}(\tau_{0})\mathbf{x}^{(\text{on})},\\ &\bar{Q}_{Y}(\tau_{N}|\mathbf{x}^{(\text{on})})\\ =&\left\{\begin{matrix}\bar{Q}_{Y}(\tau_{N}|\mathbf{x}^{(\text{on})})+\frac{\bar{\sigma}(\mathbf{x}^{(\text{on})},\tau_{0})}{\bar{\gamma}(\mathbf{x}^{(\text{on})})}\left(\tau_{0,N}^{\bar{\gamma}(\mathbf{x}^{(\text{on})})}-1\right),&\bar{\gamma}(\mathbf{x}^{(\text{on})})\neq 0,\\ \bar{Q}_{Y}(\tau_{0}|\mathbf{x}^{(\text{on})})+\bar{\sigma}(\mathbf{x}^{(\text{on})},\tau_{0})\log\tau_{0,N},&\bar{\gamma}(\mathbf{x}^{(\text{on})})=0.\end{matrix}\right.\end{split}

(3.3)

We denote the above method as EMI w/o. Interp., denoting EMI without B-spline interpolation. We conducte performance tests under various extreme quantile levels $\tau_{N}$ , and the results are summarized in Fig.4, Fig.5, and Fig. 6. Notably, we observe that EMI performs slightly less effectively when employing B-spline interpolation, which coincides with our expectations. However, it is important to emphasize that B-spline interpolation allows us to bypass the complex computations involved in fitting the GPD, which leads to more practical applications.

3.4 Results of Online Data

In our final scenario, we consider a more realistic situation where online data is observed incrementally, one data point at a time. At each time step $t$ , we observe a new pair $(y_{t}^{(\text{on})},\mathbf{x}_{t}^{(\text{on})})$ . The current observation set at time $t$ , denoted as $(\mathcal{Y}^{(t)},\mathcal{X}^{(t)})$ , are the union of the new observation $(y_{t}^{(\text{on})},\mathbf{x}_{t}^{(\text{on})})$ and the observation set from the previous time step $t-1$ . This is represented as:

\begin{split}\mathcal{Y}^{(t)}&=\mathcal{Y}^{(t-1)}\cup\{y_{t}^{(\text{on})}\}\\ \mathcal{X}^{(t)}&=\mathcal{X}^{(t-1)}\cup\{\mathbf{x}_{t}^{(\text{on})}\},\end{split}

for $t=1,\ldots,T$ . The initial observation set at time $t=0$ consists of only offline data as $\mathcal{Y}^{(0)}=\left\{y_{i}^{(\text{off})}\right\}$ and $\mathcal{X}^{(0)}=\left\{x_{i}^{(\text{off})}\right\}$ . For comparison, the Linear method solves the following linear quantile regression problem at each time step $t$ :

\min_{\alpha,\boldsymbol{\beta}}\sum_{(y,\mathbf{x})\in(\mathcal{Y}^{(t)},\mathcal{X}^{(t)})}\rho_{\tau_{N}}(y-\alpha-\boldsymbol{\beta}^{T}\mathbf{x})

On the other hand, the EMI w/o. Interp. method applies extrapolation based on the GPD and solves the following optimization problem at each time step $t$ :

\begin{split}\min_{\gamma,\sigma}\quad&-\sum_{z_{i}\in\mathcal{Z}^{(t)}}l(\gamma,\sigma|z_{i})\\ s.t.\quad&\min_{\alpha,\boldsymbol{\beta}}\sum_{(y,\mathbf{x})\in(\mathcal{Y}^{(t)},\mathcal{X}^{(t)})}\rho_{\tau_{0}}(y-\alpha-\boldsymbol{\beta}^{T}\mathbf{x}),\end{split}

(3.4)

Here, $\mathcal{Z}^{(t)}$ is a set of transformed observations defined as $z_{i}=\left(y-\widetilde{\alpha}^{\prime\prime}(\tau_{0})-\widetilde{\boldsymbol{\beta}}^{\prime\prime\top}(\tau_{0})\mathbf{x}_{t}^{(\text{on})}\right)_{+}$ for each $y\in\mathcal{Y}^{(t)}$ , and $\widetilde{\alpha}^{\prime\prime}(\tau_{0})$ and $\widetilde{\boldsymbol{\beta}}^{\prime\prime}(\tau_{0})$ are solutions of (3.4). We put the results in Fig 7, Fig 8 and Fig 9. Comparing the EMI method with the Linear method, we observe that the performance of EMI is consistently better. Additionally, when compared to EMI w/o. Interp., the performance of EMI is slightly worse, which is consistent with previous experimental results. In Fig 10 and Fig 11, we present visualizations of extreme conditional quantile estimates for online data streams. To keep the presentation concise, we select only a portion of the data. These figures demonstrate EMI’s effectiveness in capturing changes in conditional quantiles.

4 Application

Table 1: Summary statistics of covariates

\mathbf{X}

$\mathbf{X}$	Name	Mean	Standard deviation	Skewness	Min.	Max.
$x_{1}$	Market return	0.16	2.23	-0.29	-15.35	12.00
$x_{2}$	Three-month yield change	-0.22	24.43	-0.57	-182.00	192.00
$x_{3}$	Equity volatility	0.83	0.43	3.06	0.28	4.70
$x_{4}$	Credit spread change	-0.02	8.86	0.74	-48.00	60.00
$x_{5}$	Term spread change	0.15	20.66	0.09	-168.00	146.00
$x_{6}$	TED spread	121.52	94.42	1.74	6.34	591.00
$x_{7}$	Real estate excess return	-0.08	2.38	0.04	-14.49	16.58

Table 2: Summary statistics of

Y

$Y$	Company	Start time	End time
S&P	S&P 500	January 1st, 1971	December 28th, 2002
AAPL	Apple Inc.	January 1st, 1981	December 28th, 2002
BA	The Boeing Company	January 1st, 1971	December 28th, 2002
JPM	JPMorgan Chase & Co.	January 1st, 1981	December 28th, 2002

In this section, we apply our algorithm EMI to estimate the extreme conditional quantile of a real dataset, consisting of the daily stock price of the S&P 500 and three companies (see Table 2). The weekly return is derived from daily index observations. The covariates employed in our analysis include ret (weekly return), marketret (weekly market return), yield3m (three-month yield change), mktsd (equity volatility), credit (credit spread change), term (term spread change), ted (short-term TED spread), and housing (real estate excess return), as documented in ( Adrian and Brunnermeier, 2016). More specificity, the variables are defined as:

•

$Y$ : The weekly loss, namely the weekly return;
•

$x_{1}$ : The weekly market return of S&P 500;
•

$x_{2}$ : The change in the three-month yield from the Federal Reserve Board’s $H.15$ release. We use the change in the three-month treasury bill rate because we find that the change, not the level, is most significant in explaining the tails of financial sector market-valued asset returns;
•

$x_{3}$ : Equity volatility, which is computed as the $22$ -day rolling standard deviation of the daily CRSP equity market return;
•

$x_{4}$ : The change in the credit spread between Moody’s Baarated bonds and the ten-year Treasury rate from the Federal Reserve Board’s $H.15$ release;
•

$x_{5}$ : The change in the slope of the yield curve, measured by the spread between the composite long-term bond yield and the three-month bill rate obtained from the Federal Reserve Board’s $H.15$ release;
•

$x_{6}$ : A short-term TED spread, which is defined as the difference between the three-month LIBOR rate and the three-month secondary market treasury bill rate. This spread measures short-term funding liquidity risk. We use the three-month LIBOR rate that is available from the British Bankers’ Association, and obtain the three-month Treasury rate from the Federal Reserve Bank of New York;
•

$x_{7}$ : The weekly real estate sector returns over the market financial sector return (from the real estate companies with SIC code $65$ - $66$ ).

We present a summary of the variables in Table 1. At time $t$ , we assume that the response is related to covariates at time $t-1$ . As a result, the quantile of response $Y_{t}$ depends on the covariate $\mathbf{X}_{t-1}$ in a linear form:

Q_{Y_{t}}(\tau|\mathbf{x})=\alpha(\tau)+\boldsymbol{\beta}^{T}(\tau)\mathbf{X}_{t-1}.

In our data analysis, our primary focus is on estimating the extreme conditional quantile at the level $\tau=0.95$ for S&P 500 and $\tau=0.99$ for three companies using EMI. Initially, we conduct a brief visual assessment to ascertain the suitability of the heavy-tail assumption for our data. As depicted in Fig. 12, the returns exhibit heavy tails on both ends. However, our emphasis lies in scrutinizing significant losses, directing our analysis towards the upper tail losses of the weekly returns, specifically, the extremely large losses. For S&P 500 and each company, we follow the setting in Section 3.4 to estimate the extreme conditional quantile of future weekly returns employing historical data. To be specific, let $\mathcal{Y}=\{y_{i}\}_{i=1}^{T}$ and $\mathcal{X}=\{\mathbf{x}_{i}\}_{i=1}^{T}$ , where $T$ represents the total time span and $(y_{i},\mathbf{x}_{i})$ denotes the $i$ -th weekly data. At each time point $t=521,525,\ldots,T$ , the observation set is defined as:

\mathcal{Y}^{(t)}=\{y_{i}\}_{i=t-520}^{t},\quad\mathcal{X}^{(t)}=\{\mathbf{x}_{i}\}_{i=t-520}^{t}.

In essence, we leverage the data from the past ten years to estimate the extreme conditional quantile of the subsequent four weekly returns.

We summarize our results in Fig. 13 (S&P), Fig. 14 (AAPL), Fig. 15 (BA) and Fig. 16 (JPM). 2008). We observe that the index of S&P 500 exhibits the lowest conditional quantiles, indicating that it has the most robust capability to resist risk. Among the three companies studied, the conditional quantile of losses of AAPL is much larger compared to BA and JPM. It probably reflects the fact that technical companies may suffer more risk during the crisis period. In conclusion, our empirical analysis demonstrates that the proposed method for extreme conditional quantile serves as a good risk measure, particularly when considering loss values in tail regions.

5 Conclusion

In this paper, we introduce an algorithm called EMI designed for estimating extreme conditional quantiles. Operating with finite offline observations, our approach approximates the exceedance above a predetermined threshold using the GPD. We formulate the extrapolation process by rewriting it as bilevel programming, making it amenable to efficient solutions through conventional optimization techniques. To enhance flexibility and computational efficiency, we incorporate B-spline interpolation for covariate-dependent parameters. This interpolation capability empowers EMI to estimate conditional quantiles for potentially infinite online data, avoiding the computational complexity associated with GPD fitting. Empirical experiments convincingly demonstrate that our proposed EMI consistently outperforms the linear conditional regression model.

6 Acknowledgment

Yanxi Hou’s work was supported by the MOE Laboratory for National Development and Intelligent Governance, Fudan University, the National Natural Science Foundation of China Grant 72171055, and the Natural Science Foundation of Shanghai Grant 20ZR1403900.

References

\bibcommenthead
Adrian and Brunnermeier [2016] Adrian T, Brunnermeier MK (2016) Covar. American Economic Review 106(7):1705–1741
Aiyoshi and Shimizu [1981] Aiyoshi E, Shimizu K (1981) Hierarchical decentralized systems and its new solution by a barrier method. IEEE Transactions on Systems, Man and Cybernetics 11(6):444–449
Allen et al [2012] Allen L, Bali TG, Tang Y (2012) Does systemic risk in the financial sector predict future economic downturns? The Review of Financial Studies 25(10):3000–3036
Angelo et al [2013] Angelo JS, Krempser E, Barbosa HJ (2013) Differential evolution for bilevel programming. In: 2013 IEEE Congress on Evolutionary Computation, pp 470–477
Balkema and De Haan [1974] Balkema AA, De Haan L (1974) Residual life time at great age. The Annals of Probability 2(5):792–804
Drees et al [2004] Drees H, Ferreira A, De Haan L (2004) On maximum likelihood estimation of the extreme value index. Annals of Applied Probability 14(3):1179–1201
Ferreira and De Haan [2015] Ferreira A, De Haan L (2015) On the block maxima method in extreme value theory: PWM estimators. The Annals of Statistics 3(1):276–298
Gaber et al [2005] Gaber MM, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: A review. ACM Sigmod Record 34(2):18–26
Granello and Wheaton [2004] Granello DH, Wheaton JE (2004) Online data collection: Strategies for research. Journal of Counseling & Development 82(4):387–393
He et al [2022] He Y, Peng L, Zhang D, et al (2022) Risk analysis via generalized pareto distributions. Journal of Business & Economic Statistics 40(2):852–867
Hosking et al [1985] Hosking JRM, Wallis JR, Wood EF (1985) Estimation of the generalized extreme-value distribution by the method of probability-weighted moments. Technometrics 27(3):251–261
Hou et al [2022] Hou Y, Kang SK, Lo CC, et al (2022) Three-step risk inference in insurance ratemaking. Insurance: Mathematics and Economics 105:1–13
Koenker and Bassett Jr [1978] Koenker R, Bassett Jr G (1978) Regression quantiles. Econometrica 46(1):33–50
Koenker and Hallock [2001] Koenker R, Hallock KF (2001) Quantile regression. Journal of Economic Perspectives 15(4):143–156
Li and Wang [2019] Li D, Wang HJ (2019) Extreme quantile estimation for autoregressive models. Journal of Business & Economic Statistics 37(4):661–670
Luo et al [1996] Luo ZQ, Pang JS, Ralph D (1996) Mathematical Programs with Equilibrium Constraints. Cambridge University Press, Cambridge, England
Lv et al [2007] Lv Y, Hu T, Wang G, et al (2007) A penalty function method based on Kuhn–Tucker condition for solving linear bilevel programming. Applied Mathematics and Computation 188(1):808–813
Mathieu et al [1994] Mathieu R, Pittard L, Anandalingam G (1994) Genetic algorithm based approach to bi-level linear programming. Operations Research 28(1):1–21
Naveau et al [2009] Naveau P, Guillou A, Cooley D, et al (2009) Modelling pairwise dependence of maxima in space. Biometrika 96(1):1–17
Pickands III [1975] Pickands III J (1975) Statistical inference using extreme order statistics. The Annals of Statistics 3(1):119–131
Resnick [2008] Resnick SI (2008) Extreme Values, Regular Variation, and Point Processes. Springer Science & Business Media, Berlin, Germany
Sinha et al [2014] Sinha A, Malo P, Frantsev A, et al (2014) Finding optimal strategies in a multi-period multi-leader–follower Stackelberg game using an evolutionary algorithm. Computers & Operations Research 41:374–385
Sinha et al [2017] Sinha A, Malo P, Deb K (2017) A review on bilevel optimization: From classical to evolutionary approaches and applications. IEEE Transactions on Evolutionary Computation 22(2):276–295
Smith [1987] Smith RL (1987) Estimating tails of probability distributions. The Annals of Statistics 15(3):1174–1207
Velthoen et al [2019] Velthoen J, Cai JJ, Jongbloed G, et al (2019) Improving precipitation forecasts using extreme quantile regression. Extremes 22:599–622
Wang and Li [2013] Wang HJ, Li D (2013) Estimation of extreme conditional quantiles through power transformation. Journal of the American Statistical Association 108(503):1062–1074
Wang et al [2012] Wang HJ, Li D, He X (2012) Estimation of high conditional quantiles for heavy-tailed distributions. Journal of the American Statistical Association 107(500):1453–1464
Xu et al [2022] Xu W, Hou Y, Li D (2022) Prediction of extremal expectile based on regression models with heteroscedastic extremes. Journal of Business & Economic Statistics 40(2):522–536