This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On Uniform Confidence Intervals for the Tail Index and the Extreme Quantile

Yuya Sasaki111Y. Sasaki: VU Station B #351819, 2301 Vanderbilt Place, Nashville, TN 37235-1819, USA.
Vanderbilt University
   Yulong Wang222Y. Wang: 110 Eggers Hall, Syracuse, NY 13244-1020, USA.
Syracuse University
(October 10, 2022)
Abstract

This paper presents two results concerning uniform confidence intervals for the tail index and the extreme quantile. First, we show that it is impossible to construct a length-optimal confidence interval satisfying the correct uniform coverage over a local non-parametric family of tail distributions. Second, in light of the impossibility result, we construct honest confidence intervals that are uniformly valid by incorporating the worst-case bias in the local non-parametric family. The proposed method is applied to simulated data and a real data set of National Vital Statistics from National Center for Health Statistics.

Keywords: honest confidence interval, extreme quantile, impossibility, tail index, uniform inference

1 Introduction

Suppose that one is interested in constructing a confidence interval (CI) for the true tail index ξ0+\xi_{0}\in\mathbb{R}^{+} of a distribution. To define this parameter, assume that the distribution function (d.f.), denoted by FF, has its well-defined density function ff and satisfies the well-known von Mises condition:

g(x)1F(x)xf(x)ξg\left(x\right)\equiv\frac{1-F\left(x\right)}{xf\left(x\right)}\rightarrow\xi (1)

as xx\rightarrow\infty, where ξ\xi is called the tail index. Point-wise CIs for ξ0\xi_{0} have been proposed by a number of papers, including Cheng and Peng (2001), Lu and Peng (2002), Peng and Qi (2006), and Haeusler and Segers (2007). More recently, Carpentier and Kim (2014b) develop an adaptive CI for ξ0\xi_{0} that is uniformly valid over a parametric family of tail distributions indexed by the second-order parameter.

Since the seminal paper by Hill (1975), the literature has investigated numerous estimators for ξ0\xi_{0} as well as their asymptotic properties. See the recent reviews by Gomes and Guillou (2015) and Fedotenkov (2020). Drees (2001) obtains the worst-case optimal convergence rate, i.e., the min-max bound, in estimation over a local non-parametric family of tail distributions. Carpentier and Kim (2014a) construct adaptive and minimax optimal estimator over the parametric family of second-order Pareto-type distributions. Motivated by these results, a natural question is whether a length-optimal CI for ξ0\xi_{0} can achieve uniformly correct coverage probabilities for ξ0\xi_{0} over the non-parametric family of tail distributions. To our best knowledge, the existing literature has not answered this important question, while such a problem has been investigated in other important contexts in statistics, e.g., for CIs of non-parametric densities (e.g., Low, 1997; Hoffmann and Nickl, 2011; Bull and Nickl, 2013; Carpentier, 2013), non-parametric regressions (e.g., Li, 1989; Genovese and Wasserman, 2008), and high-dimensional regression models (e.g., van de Geer et al., 2014; Wu et al., 2021) to list but a very few.

In this paper, we first show that it is in fact impossible to construct a length-optimal CI for the true tail index ξ0\xi_{0} satisfying the uniformly correct coverage over the local non-parametric family considered by Drees (2001). Specifically, any CI enjoying the uniform coverage property can be no shorter than the worst-case bias over the non-parametric family up to a constant. This negative result is analogous to those of Low (1997) and Genovese and Wasserman (2008) presented in the contexts of non-parametric densities/regressions, but is novel in the context of the tail index.

Second, we derive the asymptotic distribution of Hill’s estimator uniformly over the local non-parametric family of tail distributions. Given the above impossibility result, it is imperative to account for a possible range of bias over the non-parametric family. Hence, we construct an honest CI for the tail index that is locally uniformly valid by incorporating the worse-case bias over the local non-parametric family, as well as influences from a sample on asymptotic randomness. We also demonstrate that this honest CI for the tail index extends to that for extreme quantiles.

Simulation studies support our theoretical predictions. While the naïve length-optimal CI not accounting for a possible range of bias over the local non-parametric family suffers from severe under-coverage overall, our proposed CIs on the other hand achieve correct coverage. We apply the proposed method to National Vital Statistics from National Center for Health Statistics, and construct CIs for quantiles of extremely low infant birth weights across a variety of mothers’ demographic characteristics and maternal behaviors. We find that, even after accounting for a possible range of bias over the local non-parametric family, having no prenatal visit during pregnancy remains a strong risk factor for low infant birth weight.

Organization: Section 2 introduces the setup, notations, and definitions. Section 3 establishes the impossibility result. Section 4 proposes a uniformly valid CI. Section 5 presents an application to extreme quantiles. Sections 6 and 7 illustrate simulation studies and real data analyses, respectively. Section 8 summarizes the paper. Mathematical proofs are collected in the Appendix.

2 Setup, Notations, and Definitions

2.1 Non-parametric Families in the Tail

Any distribution function FF satisfying the von Mises condition (1) can be equivalently characterized in the right tail in terms of its inverse F1F^{-1} by

F1(1t)=ctξexp(t1η(s)s𝑑s)F^{-1}\left(1-t\right)=ct^{-\xi}\exp\left(\int_{t}^{1}\frac{\eta\left(s\right)}{s}ds\right)

for some c>0c>0 and η(s)=g(F1(1s))ξ\eta\left(s\right)=g\left(F^{-1}\left(1-s\right)\right)-\xi, which satisfies that η(s)\eta(s) tends to zero as s0s\rightarrow 0. The standard Pareto distribution falls in this family as a trivial special case in which η\eta is the zero function and g(x)=ξg\left(x\right)=\xi for all xx.

To maintain a non-parametric setup in statistical inference about the true tail idex ξ0\xi_{0}, we follow Drees (2001) and consider the following family of d.f.’s:

(ξ0,c,ε,u){F is a d.f. |F1(1t)=ctξexp(t1η(s)s𝑑s),|ξξ0|ε,|η(t)|u(t),t(0,1]}\mathcal{F}\left(\xi_{0},c,\varepsilon,u\right)\equiv\left\{F\text{ is a d.f. }\left|\begin{array}[]{c}\left.F^{-1}\left(1-t\right)=ct^{-\xi}\exp\left(\int_{t}^{1}\frac{\eta\left(s\right)}{s}ds\right),\right.\\ \left.\left|\xi-\xi_{0}\right|\leq\varepsilon,\left|\eta\left(t\right)\right|\leq u\left(t\right),t\in(0,1]\right.\end{array}\right.\right\}

where ξ0>ε>0,c>0\xi_{0}>\varepsilon>0,c>0 and u(t)=Atρu\left(t\right)=At^{\rho} for some constants A,ρ>0A,\rho>0.

Two remarks are in order about this family (ξ0,c,ε,u)\mathcal{F}\left(\xi_{0},c,\varepsilon,u\right). First, let F0F_{0} denote the Pareto distribution function with true tail index ξ0>0\xi_{0}>0, i.e., F01(1t)=tξ0F_{0}^{-1}\left(1-t\right)=t^{-\xi_{0}}. This F0F_{0} is the center of localization by the family (ξ0,c,ε,u)\mathcal{F}\left(\xi_{0},c,\varepsilon,u\right). The factor exp(t1η(s)s𝑑s)\exp\left(\int_{t}^{1}\frac{\eta\left(s\right)}{s}ds\right) represents a deviation from this center. If we set η=u\eta=u, then the model essentially becomes parametric, since the deviation is fully determined by u(t)=Atρu\left(t\right)=At^{\rho} and hence by the constants, AA and ρ\rho. In this parametric setup, Hall and Welsh (1985) establish the optimal uniform rates of convergence over a family of d.f.’s with densities of the form f(x)=cx(1/ξ+1)(1+r(x))f\left(x\right)=cx^{-\left(1/\xi+1\right)}\left(1+r\left(x\right)\right), where |r(x)|Axρ/ξ\left|r\left(x\right)\right|\leq Ax^{-\rho/\xi}. In contrast, we consider a non-parametric family like (ξ0,c,ε,u)\mathcal{F}\left(\xi_{0},c,\varepsilon,u\right), in which the function u()u\left(\cdot\right) serves merely as an upper bound for deviations from the center. We will focus on the classic estimator by Hill (1975). Since Hill’s estimator is scale invariant, we hereafter assume c=1c=1 without loss of generality and for succinct writing.

Second, by the fact that t1η(0)s𝑑s=η(0)logt\int_{t}^{1}\frac{\eta\left(0\right)}{s}ds=-\eta\left(0\right)\log t, we can rewrite the quantile characterization F1(1t)=tξ0exp(t1η(s)s𝑑s)F^{-1}\left(1-t\right)=t^{-\xi_{0}}\exp\left(\int_{t}^{1}\frac{\eta\left(s\right)}{s}ds\right) of an element F(ξ0,c,ε,u)F\in\mathcal{F}\left(\xi_{0},c,\varepsilon,u\right) as

tξ0exp(t1η(s)s𝑑s)=tξ0η(0)exp(t1η(s)η(0)s𝑑s).t^{-\xi_{0}}\exp\left(\int_{t}^{1}\frac{\eta\left(s\right)}{s}ds\right)=t^{-\xi_{0}-\eta\left(0\right)}\exp\left(\int_{t}^{1}\frac{\eta\left(s\right)-\eta\left(0\right)}{s}ds\right). (2)

To construct a uniformly valid inference, we now again follow Drees (2001) and consider a sequence of families of data-generating processes with tail index converging to ξ0\xi_{0} at a rate dn:=u(tn)0d_{n}:=u\left(t_{n}\right)\rightarrow 0 for some tn0t_{n}\rightarrow 0 as nn\rightarrow\infty. Specifically, we consider a sequence tnt_{n} satisfying

limnu(tn)(ntn)1/2=1.\lim_{n\rightarrow\infty}u(t_{n})(nt_{n})^{1/2}=1. (3)

This sequence essentially entails the optimal choice of the tuning parameter (de Haan and Ferreira, 2006, p.77), which we introduce later. Now, we obtain a drifting sequence of local families consisting of (n,h)(n,h)-indexed elements Fn,hF_{n,h} whose quantile functions satisfy

Fn,h1(1t)\displaystyle F_{n,h}^{-1}\left(1-t\right)\equiv tξ0exp(t1dnh(tn1v)v𝑑v)\displaystyle t^{-\xi_{0}}\exp\left(\int_{t}^{1}\frac{d_{n}h\left(t_{n}^{-1}v\right)}{v}dv\right) (4)
=\displaystyle= t(ξ0+dnh(0))exp(t1dn(h(tn1v)h(0))v𝑑v),t(0,1],\displaystyle t^{-\left(\xi_{0}+d_{n}h\left(0\right)\right)}\exp\left(\int_{t}^{1}\frac{d_{n}\left(h\left({t_{n}^{-1}}v\right)-h\left(0\right)\right)}{v}dv\right),t\in(0,1],

where the second equality is due to (2). The corresponding tail index is given by ξn,h=ξ0+dnh(0)\xi_{n,h}=\xi_{0}+d_{n}h\left(0\right), and h()h(0)h\left(\cdot\right)-h\left(0\right) characterizes the local deviation from the standard Pareto distribution.

Given the above reparametrization, we translate the local non-parametric family (ξ0,c,ε,u)\mathcal{F}\left(\xi_{0},c,\varepsilon,u\right) of d.f.’s into that of hh. Setting the upper bound to u(s)=Asρu\left(s\right)=As^{\rho}, we consider the family

(A,ρ){hL2[0,1]|sups>0|h(s)|<,|h(s)h(0)|Asρ,s>0},\mathcal{H}\left(A,\rho\right)\equiv\left\{h\in L_{2}[0,1]\left|\begin{array}[]{c}\sup_{s>0}\left|h\left(s\right)\right|<\infty,\\ \left|h\left(s\right)-h\left(0\right)\right|\leq As^{\rho},s>0\end{array}\right.\right\},

which contains all square integrable functions h()h\left(\cdot\right) that are uniformly bounded and satisfy the bound |h(s)h(0)|Asρ\left|h\left(s\right)-h\left(0\right)\right|\leq As^{\rho}. The family (A,ρ)\mathcal{H}\left(A,\rho\right) induces a local counterpart of the non-parametric family (ξ0,1,ε,u)\mathcal{F}\left(\xi_{0},1,\varepsilon,u\right), namely

n(ξ0,(A,ρ),u)=\displaystyle\mathcal{F}_{n}(\xi_{0},\mathcal{H}(A,\rho),u)=
{Fn,h is a d.f. |Fn,h1(1t)=t(ξ0+dnh(0))exp(t1η(s)η(0)v𝑑v),η(s)=dnh(tn1s),dn=u(tn),h(A,ρ)}\displaystyle\left\{F_{n,h}\text{ is a d.f. }\left|\begin{array}[]{c}\left.F_{n,h}^{-1}\left(1-t\right)=t^{-(\xi_{0}+d_{n}h(0))}\exp\left(\int_{t}^{1}\frac{\eta(s)-\eta(0)}{v}dv\right),\right.\\ \left.\eta(s)=d_{n}h(t^{-1}_{n}s),d_{n}=u(t_{n}),h\in\mathcal{H}(A,\rho)\right.\end{array}\right.\right\}

indexed by tn1t^{-1}_{n}\rightarrow\infty as a function of the sample size nn. Given the local reparametrization introduced above, h()h(\cdot) now represents a deviation function and the associated tail index is ξn,h=ξ0+dnh(0)\xi_{n,h}=\xi_{0}+d_{n}h(0).

2.2 Hill’s Estimator

Let {Y1,,Yn}\{Y_{1},...,Y_{n}\} be a random sample from Fn,hF_{n,h}, and let Yn:njY_{n:n-j} denote the (j+1)(j+1)-th largest order statistic in this sample. With these notations, Hill’s estimator is defined by

ξ^(n,k)=1kj=0k1[log(Yn:nj)log(Yn:nk)].\hat{\xi}\left(n,k\right)=\frac{1}{k}\sum_{j=0}^{k-1}\left[\log\left(Y_{n:n-j}\right)-\log\left(Y_{n:n-k}\right)\right].

In practice, a researcher often implements this estimator for an interval of values of the tuning parameter kk to demonstrate ad hoc robustness. This common practice can be formally accommodated by allowing for a sequence of intervals, i.e., k=kn=rk¯nk=k_{n}=r\overline{k}_{n} for r[r¯,1](0,1]r\in[\underline{r},1]\subset(0,1], similarly to Drees, Resnick, and de Haan (2000).

Define the functional Tr(z)T_{r}\left(z\right) by

Tr(z)r10rlog(z(t)/z(r))𝑑tT_{r}\left(z\right)\equiv r^{-1}\int_{0}^{r}\log\left(z\left(t\right)/z\left(r\right)\right)dt

for any measurable function z:[0,1]z:[0,1]\rightarrow\mathbb{R}. If we substitute the quantile function of the standard Pareto distribution, that is, z(t)=F01(1t)=tξ0z\left(t\right)=F_{0}^{-1}\left(1-t\right)=t^{-\xi_{0}}, then we have

Tr(tξ0)=r10rlog(tξ0/rξ0)𝑑t=ξ0,T_{r}\left(t^{-\xi_{0}}\right)=r^{-1}\int_{0}^{r}\log\left(t^{-\xi_{0}}/r^{-\xi_{0}}\right)dt=\xi_{0},

identifying the true ξ0\xi_{0} for any r(0,1]r\in(0,1]. Define the tail empirical quantile function

Qn,rk¯n,Fn,h=Yn:nrk¯nQ_{n,r\overline{k}_{n},F_{n,h}}=Y_{n:n-\lfloor r\overline{k}_{n}\rfloor}

for r(0,1]r\in(0,1], where \lfloor\cdot\rfloor denotes the smallest larger integer. With these auxiliary notations, as implied by Example 3.1 in Drees (1998a), Hill’s estimator ξ^(n,rk¯n)\hat{\xi}(n,r\overline{k}_{n}) can be equivalently rewritten as Tr(Qn,rk¯n,Fn,h)T_{r}\left(Q_{n,r\overline{k}_{n},F_{n,h}}\right).

3 Asymptotic Impossibility

This section presents the first one of the two main theoretical results of this paper. In light of the min-max result for estimation (cf. Drees, 2001), it is natural that a researcher is interested in a length-optimal confidence interval satisfying the uniform coverage over a non-parametric family, such as the one introduced in Section 2.1. This section shows an impossibility of this objective.

Specifically, we aim to establish that the length of any confidence interval H(Y(n))H\left(Y^{(n)}\right) == H(Y1,,Yn)H\left(Y_{1},...,Y_{n}\right) that has a coverage of ξ0\xi_{0} uniformly for all h(A,ρ)h\in\mathcal{H}\left(A,\rho\right) is no shorter than the supremum seminorm of hh over (A,ρ)\mathcal{H}\left(A,\rho\right) up to a constant multiple. This implies that we cannot find a length-optimal confidence interval which satisfies the uniform coverage without accounting for the worst-case bias |Tr(Fn,h1)Tr(F01)||T_{r}(F_{n,h}^{-1})-T_{r}(F_{0}^{-1})| over the non-parametric family (A,ρ)\mathcal{H}\left(A,\rho\right). Such an impossibility result in spirit parallels the one about non-parametric density estimation established by Low (1997) and the one about non-parametric regression estimation established by Genovese and Wasserman (2008), among others.

Define the modulus of continuity of Tr()T_{r}\left(\cdot\right) by

ω(ε,F01,n)sup{|Tr(Fn,h1)Tr(F01)||h(A,ρ) hh(0)ε}.\omega\left(\varepsilon,F_{0}^{-1},n\right)\equiv\sup\left\{\left|T_{r}\left(F_{n,h}^{-1}\right)-T_{r}\left(F_{0}^{-1}\right)\right|\left|\begin{array}[]{c}h\in\mathcal{H}\left(A,\rho\right)\text{ }\\ \left|\left|h-h\left(0\right)\right|\right|\leq\varepsilon\end{array}\right.\right\}. (5)

This is the worst-case bias in absolute value. Let F0nF_{0}^{n} and Fn,hnF_{n,h}^{n} denote the joint distributions of nn i.i.d. draws from F0F_{0} and Fn,hF_{n,h}, respectively. Let 𝔼F0n[]\mathbb{E}_{F_{0}^{n}}\left[\cdot\right] denote the expectation with respect to the product measure F0nF_{0}^{n}. The following theorem establishes the impossibility result. That is, the expected length of a uniformly valid confidence interval is no shorter than a constant multiplie of the modulus of continuity of Tr()T_{r}\left(\cdot\right).

Theorem 1 (Impossibility)

For ε>0\varepsilon>0, suppose that a confidence interval H(Y(n))H\left(Y^{\left(n\right)}\right) for ξ0\xi_{0} has coverage probabilities of at least 1β1-\beta uniformly for all h(A,ρ){h:hh(0)ε}h\in\mathcal{H}\left(A,\rho\right)\cap\{h:\left|\left|h-h\left(0\right)\right|\right|\leq\varepsilon\}. Then, there exist NN and CC depending on ε\varepsilon and ξ0\xi_{0} only such that 1βC>01-\beta-C>0 and

𝔼F0n[μ(H(Y(n)))](1βC)ω(ε,F01,n)\mathbb{E}_{F_{0}^{n}}\left[\mu\left(H\left(Y^{\left(n\right)}\right)\right)\right]\geq\left(1-\beta-C\right)\omega\left(\varepsilon,F_{0}^{-1},n\right) (6)

for all n>Nn>N.

As already mentioned above, this result is analogous to those established by Low (1997), Cai and Low (2004), Genovese and Wasserman (2008), and Armstrong and Kolesár (2018), to list but a few, in other important contexts of statistics, but it is novel in the context of the tail index. To understand the lower bound in (6), we now derive a concrete expression for the element in the definition (5) of the modulus of continuity ω(ε,F01,n)\omega\left(\varepsilon,F_{0}^{-1},n\right). Note that

Tr(Fn,h1)Tr(F01)\displaystyle T_{r}\left(F_{n,h}^{-1}\right)-T_{r}(F_{0}^{-1})
=\displaystyle= r10rlog(tξ0exp(t1dnh(tn1v)v𝑑v)rξ0exp(r1dnh(tn1v)v𝑑v))𝑑tξ0\displaystyle r^{-1}\int_{0}^{r}\log\left(\frac{t^{-\xi_{0}}\exp\left(\int_{t}^{1}\frac{d_{n}h\left({t_{n}^{-1}}v\right)}{v}dv\right)}{r^{-\xi_{0}}\exp\left(\int_{r}^{1}\frac{d_{n}h\left({t_{n}^{-1}}v\right)}{v}dv\right)}\right)dt-\xi_{0}
=\displaystyle= r10r[ξ0log(t/r)+dn(t1h(tn1v)v𝑑vr1h(tn1v)v𝑑v)]𝑑tξ0\displaystyle r^{-1}\int_{0}^{r}\left[-\xi_{0}\log\left(t/r\right)+d_{n}\left(\int_{t}^{1}\frac{h\left({t_{n}^{-1}}v\right)}{v}dv-\int_{r}^{1}\frac{h\left({t_{n}^{-1}}v\right)}{v}dv\right)\right]dt-\xi_{0}
=\displaystyle= dnr10r(t1h(tn1v)v𝑑vr1h(tn1v)v𝑑v)𝑑t\displaystyle d_{n}r^{-1}\int_{0}^{r}\left(\int_{t}^{1}\frac{h\left({t_{n}^{-1}}v\right)}{v}dv-\int_{r}^{1}\frac{h\left({t_{n}^{-1}}v\right)}{v}dv\right)dt
=\displaystyle= dnr10r(trh(tn1v)v𝑑v)𝑑t\displaystyle d_{n}r^{-1}\int_{0}^{r}\left(\int_{t}^{r}\frac{h\left({t_{n}^{-1}}v\right)}{v}dv\right)dt
=\displaystyle= dnr10r(trh(tn1v)h(0)v𝑑v)𝑑t+dnh(0).\displaystyle d_{n}r^{-1}\int_{0}^{r}\left(\int_{t}^{r}\frac{h\left({t_{n}^{-1}}v\right)-h\left(0\right)}{v}dv\right)dt+d_{n}h\left(0\right).

The first term in the last line characterizes the bias due to the deviation from the standard Pareto distribution, and the second term characterizes the asymptotic randomness. As a consequence, to obtain a feasible and uniformly valid confidence interval, we will set an upper bound for the first term and adjust the critical value based on the second term. We obtain such a uniform confidence interval in the following section.

4 Uniform Confidence Interval

Given that ρ>0\rho>0, it has been established that the optimal rate of convergence of Hill’s estimator is nρ/(2ρ+1)n^{-\rho/(2\rho+1)}. See, for example, Remark 3.2.7 in de Haan and Ferreira (2006). Such an optimal rate entails non-negligible asymptotic bias as charaterized in Theorem 2 below. To achieve this rate, we let k¯nntn\overline{k}_{n}\approx nt_{n}, where ABA\approx B means limnA/B=1\lim_{n\rightarrow\infty}A/B=1. Then, the restriction (3) implies that u(tn)A1/(2ρ+1)nρ/(2ρ+1)u(t_{n})\approx A^{1/(2\rho+1)}n^{-\rho/(2\rho+1)} and k¯n1/2u(tn)=k¯n1/2dn1\overline{k}_{n}^{1/2}u(t_{n})=\overline{k}_{n}^{1/2}d_{n}\rightarrow 1 as nn\rightarrow\infty. We formally summarize these conditions below.

Condition 1

As nn\rightarrow\infty, A1/(2ρ+1)k¯n1/2nρ/(2ρ+1)1A^{1/(2\rho+1)}\overline{k}_{n}^{1/2}n^{-\rho/(2\rho+1)}\rightarrow 1.

As the second one of the two main theoretical results of this paper, the following theorem derives the asymptotic distribution of ξ^(n,rk¯n)\hat{\xi}\left(n,r\overline{k}_{n}\right) uniformly for all r[r¯,1]r\in[\underline{r},1] and h(A,ρ)h\in\mathcal{H}\left(A,\rho\right) by exploiting the features of the functional Tr()T_{r}\left(\cdot\right) defined in Section 2.2. Let n,hn\mathbb{P}^{n}_{n,h} denote the distribution of nn i.i.d. draws from the distribution Fn,hF_{n,h}.

Theorem 2 (Uniform Asymptotic Distribution)

If Condition 1 is satisfied, then

k¯n(ξ^(n,rk¯n)ξn,h)=ξ0𝔾(r)+𝔹(r;h)+on,hn(1)\overline{k}_{n}\left(\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h}\right)=\xi_{0}\mathbb{G}\left(r\right)+\mathbb{B}\left(r;h\right)+o_{\mathbb{P}^{n}_{n,h}}(1)

holds under random sampling from (4) uniformly for all r[r¯,1]r\in[\underline{r},1] and h(A,ρ)h\in\mathcal{H}\left(A,\rho\right), where 𝔾\mathbb{G} and 𝔹\mathbb{B} are defined by

𝔾(r)r10r(s1W(s)r1W(r))𝑑s\mathbb{G}\left(r\right)\equiv r^{-1}\int_{0}^{r}\left(s^{-1}W\left(s\right)-r^{-1}W\left(r\right)\right)ds (7)

and

𝔹(r;h)r10r(srh(v)h(0)v𝑑v)𝑑s,\mathbb{B}\left(r;h\right)\equiv r^{-1}\int_{0}^{r}\left(\int_{s}^{r}\frac{h\left(v\right)-h\left(0\right)}{v}dv\right)ds, (8)

respectively.

A proof can be found in the Appendix. The convergence of Hill’s estimator as a function of rr has been established in the literature (e.g., Resnick and Stărică, 1997; Drees et al., 2000). In comparison, Theorem 2 here contributes to the literature by showing the asymptotic distribution uniformly over both r[r¯,1]r\in[\underline{r},1] and the local non-parametric function class. The terms 𝔾()\mathbb{G}\left(\cdot\right) and 𝔹(;h)\mathbb{B}\left(\cdot;h\right) characterize the asymptotic randomness and bias, respectively. It follows that

rk¯n(ξ^(n,rk¯n)ξn,h)ξ0r𝔾(r)+r𝔹(r;h).\sqrt{r\overline{k}_{n}}\left(\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h}\right)\Rightarrow\xi_{0}\sqrt{r}\mathbb{G}\left(r\right)+\sqrt{r}\mathbb{B}\left(r;h\right). (9)

To conduct statistical inference based on (9), we need to compute the bound

supr[r¯,1]suph(A,ρ)r𝔹(r;h)\sup_{r\in[\underline{r},1]}\sup_{h\in\mathcal{H}\left(A,\rho\right)}\sqrt{r}\mathbb{B}\left(r;h\right)

of the bias. To this end, note that |h(v)h(0)|Avρ\left|h\left(v\right)-h\left(0\right)\right|\leq Av^{\rho} for all h(A,ρ)h\in\mathcal{H}\left(A,\rho\right). Therefore, for any h(A,ρ)h\in\mathcal{H}\left(A,\rho\right),

|𝔹(r;h)|=\displaystyle\left|\mathbb{B}\left(r;h\right)\right|= |r10r(srh(v)h(0)v𝑑v)𝑑s|\displaystyle\left|r^{-1}\int_{0}^{r}\left(\int_{s}^{r}\frac{h\left(v\right)-h\left(0\right)}{v}dv\right)ds\right|
\displaystyle\leq r10r(sr|h(v)h(0)|v𝑑v)𝑑s\displaystyle r^{-1}\int_{0}^{r}\left(\int_{s}^{r}\frac{\left|h\left(v\right)-h\left(0\right)\right|}{v}dv\right)ds
\displaystyle\leq r10r(srAvρ1𝑑v)𝑑s\displaystyle r^{-1}\int_{0}^{r}\left(\int_{s}^{r}Av^{\rho-1}dv\right)ds
=\displaystyle= Arρ1+ρ.\displaystyle A\frac{r^{\rho}}{1+\rho}. (10)

This bound is tight and achieved when, for example, h(v)h(0)0h\left(v\right)\geq h\left(0\right)\geq 0.

With this tight bias bound taken into account in a similar spirit to Armstrong and Kolesár (2020), a locally uniformly valid confidence interval for the tail index is given by

HO(n,rk¯n)=[ξ^ξ^q1β/2+B¯(A,ρ)rk¯n,ξ^+ξ^q1β/2+B¯(A,ρ)rk¯n]\displaystyle H^{O}\left(n,r\overline{k}_{n}\right)=\left[\hat{\xi}-\frac{\hat{\xi}\cdot q_{1-\beta/2}+\overline{B}\left(A,\rho\right)}{\sqrt{r\overline{k}_{n}}},\hat{\xi}+\frac{\hat{\xi}\cdot q_{1-\beta/2}+\overline{B}\left(A,\rho\right)}{\sqrt{r\overline{k}_{n}}}\right] (11)

for r[r¯,1]r\in[\underline{r},1], where ξ^\hat{\xi} is short for ξ^(n,rk¯n)\hat{\xi}\left(n,r\overline{k}_{n}\right),

B¯(A,ρ)supr(0,1]rArρ1+ρ=A1+ρ\overline{B}\left(A,\rho\right)\equiv\sup_{r\in(0,1]}\sqrt{r}\frac{Ar^{\rho}}{1+\rho}=\frac{A}{1+\rho}

is the upper bound of the bias, and q1β/2q_{1-\beta/2} denotes the suitable quantile of sup[r¯,1]r𝔾(r)\sup_{[\underline{r},1]}\sqrt{r}\mathbb{G}(r) whose values can be found in Table 1.

rr\β\hskip 5.69046pt\backslash\hskip 5.69046pt\beta 0.10 0.05 0.01 rr\β\hskip 5.69046pt\backslash\hskip 5.69046pt\beta 0.10 0.05 0.01
1 1.64 1.96 2.56 1/4 2.41 2.71 3.27
10/11 1.87 2.19 2.76 1/5 2.46 2.74 3.34
5/6 1.95 2.27 2.86 1/10 2.58 2.85 3.44
2/3 2.09 2.42 3.01 1/20 2.67 2.92 3.51
1/2 2.22 2.54 3.12 1/50 2.75 3.01 3.57
1/3 2.33 2.66 3.23 1/100 2.80 3.08 3.61
Table 1: The 1β/21-\beta/2 quantile of supr[r¯,1]r𝔾(r)\sup_{r\in\left[\underline{r},1\right]}\sqrt{r}\mathbb{G}\left(r\right) computed based on 20,000 simulation draws. The Gaussian process is approximated with 50,000 steps.

As a final remark, we discuss the choice of the higher-order parameters ρ\rho and AA. They both depend on the underlying distribution and hence unknown. While this feature appears to be a disadvantage of our method, the impossibility result in Theorem 1 implies that this feature cannot be avoided, regardless of how we construct the interval. The existing literature has proposed several estimators of ρ\rho and AA, whose consistency requires additional assumptions on the underlying function, and equivalently further restrictions on the class \mathcal{H}. See Carpentier and Kim (2014a); Cheng and Peng (2001); Haeusler and Segers (2007) for some data-driven methods of choosing the higher-order parameters. The corresponding confidence intervals are no longer uniformly valid, but still point-wisely valid.

As an alternative, we propose a rule-of-thumb choice of ρ\rho and AA and proceed with the proposed interval. In particular, if the underlying distribution is Student-t with 1/ξ01/\xi_{0} degrees of freedom, we know that ρ=2ξ0\rho=2\xi_{0}. Furthermore, for the bias upper bound B¯=A/(1+ρ)\overline{B}=A/(1+\rho), we set A=0.1ξ0(1+2ξ0)k¯nA=0.1\xi_{0}(1+2\xi_{0})\sqrt{\overline{k}_{n}}, so that B¯/k¯n\overline{B}/\sqrt{\overline{k}_{n}} is at most 10%10\% of the true tail index to be estimated. In practice, we replace ξ0\xi_{0} with the estimator ξ^(n,k¯n)\hat{\xi}(n,\overline{k}_{n}). This rule-of-thumb choice is reministic to the Silverman’s choice of bandwidth in kernel density estimation, where the reference is the Gaussian distribution. We examine the performance of this rule by simulations in Section 6.

5 Extreme Quantiles

In this section, we apply Theorem 2 to uniform inference about extreme quantiles. To characterize the extremeness, we focus on the sequence of 1pn1-p_{n} quantiles where pn0p_{n}\rightarrow 0 as nn\rightarrow\infty. Consider the extreme quantile estimator by Weissman (1978):

F^n,h1(1pn)=Yn,nrk¯n(rk¯nnpn)ξ^(n,rk¯n),\hat{F}_{n,h}^{-1}\left(1-p_{n}\right)=Y_{n,n-\lfloor r\overline{k}_{n}\rfloor}\left(\frac{r\overline{k}_{n}}{np_{n}}\right)^{\hat{\xi}\left(n,r\overline{k}_{n}\right)},

where ξ^(n,k)\hat{\xi}\left(n,k\right) is Hill’s estimator. Recall that the true quantile under the local drifting sequence is

Fn,h1(1tn)=tξ0exp(t1dnh(tn1v)v𝑑v)F_{n,h}^{-1}\left(1-t_{n}\right)=t^{-\xi_{0}}\exp\left(\int_{t}^{1}\frac{d_{n}h\left({t_{n}^{-1}}v\right)}{v}dv\right)

as in (4), where tnt_{n} satisfies limnk¯n/(ntn)=1\lim_{n\rightarrow\infty}\overline{k}_{n}/(nt_{n})=1 as in Condition 1. We now aim to asymptotically approximate the distribution of

F^n,h1(1pn)Fn,h1(1pn)1.\frac{\hat{F}_{n,h}^{-1}\left(1-p_{n}\right)}{F_{n,h}^{-1}\left(1-p_{n}\right)}-1.

To this end, we state the following condition on the relation among nn, k¯n\overline{k}_{n} and pnp_{n} (e.g., de Haan and Ferreira, 2006, Theorem 4.3.1).

Condition 2

npn=o(k¯n)np_{n}=o\left(\overline{k}_{n}\right) and log(npn)=o(k¯n)\log\left(np_{n}\right)=o\left(\sqrt{\overline{k}_{n}}\right) as nn\rightarrow\infty.

Theorem 3 (Extreme Quantiles)

Under Conditions 1 and 2 with random sampling, uniformly for all r[r¯,1]r\in[\underline{r},1] and h(A,ρ)h\in\mathcal{H}\left(A,\rho\right),

rk¯nlog(rk¯n/(npn))(F^n,h1(1pn)Fn,h1(1pn)1)rk¯n(ξ^(n,rk¯n)ξn,h)=on,hn(1).\displaystyle\frac{\sqrt{r\overline{k}_{n}}}{\log\left(r\overline{k}_{n}/\left(np_{n}\right)\right)}\left(\frac{\hat{F}_{n,h}^{-1}\left(1-p_{n}\right)}{F_{n,h}^{-1}\left(1-p_{n}\right)}-1\right)-\sqrt{r\overline{k}_{n}}\left(\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h}\right)=o_{\mathbb{P}^{n}_{n,h}}(1).

Theorem 3 implies that the asymptotic distribution of the extreme quantile estimator is the same as that of the tail index. Therefore, a uniformly valid confidence interval can be constructed similarly. In particular, a robust confidence interval for Fn,h1(1pn)F^{-1}_{n,h}(1-p_{n}) with nominal 1β1-\beta uniform coverage probability accounting for the bias is constructed as

IO(n,k)=\displaystyle I^{O}\left(n,k\right)= [F^n,h1(1pn){1logd(n,rk¯n)rk¯n(ξ^q1β/2+B¯(A,ρ))},\displaystyle\left[\hat{F}^{-1}_{n,h}\left(1-p_{n}\right)\left\{1-\frac{\log d(n,r\overline{k}_{n})}{\sqrt{r\overline{k}_{n}}}\left(\hat{\xi}\cdot q_{1-\beta/2}+\overline{B}(A,\rho)\right)\right\},\right.
F^n,h1(1pn){1+logd(n,rk¯n)rk¯n(ξ^q1β/2+B¯(A,ρ))}],\displaystyle\>\>\left.\hat{F}^{-1}_{n,h}\left(1-p_{n}\right)\left\{1+\frac{\log d(n,r\overline{k}_{n})}{\sqrt{r\overline{k}_{n}}}\left(\hat{\xi}\cdot q_{1-\beta/2}+\overline{B}(A,\rho)\right)\right\}\right], (12)

where again ξ^\hat{\xi} is short for ξ^(n,rk¯n)\hat{\xi}(n,r\overline{k}_{n}), q1β/2q_{1-\beta/2} denotes a suitable quantile that can be found in Table 1, and d(n,rk¯n)=rk¯n/(npn)d(n,r\overline{k}_{n})=r\overline{k}_{n}/(np_{n}).

6 Simulation Studies

In this section, we use simulated data to evaluate finite-sample performance of our proposed confidence intervals in comparison with the naïve lenth-optimal confidence interval. Sections 6.1 and 6.2 focus on inference about the tail index and extreme quantiles, respectively.

6.1 Tail Index

The following simulation design is employed for our analysis. We generate nn independent standard uniform random variables UiU_{i} and construct the observations Yi=F1(1Ui)Y_{i}=F^{-1}(1-U_{i}), where F1(1t)=tξ0exp(t1s1η(s)𝑑s)F^{-1}(1-t)=t^{-\xi_{0}}\exp(\int^{1}_{t}s^{-1}\eta(s)ds). We set η(t)=ctρ\eta(t)=ct^{\rho}, so that

F1(1t)=tξ0exp(c(1tρ)ρ),F^{-1}(1-t)=t^{-\xi_{0}}\exp\left(\frac{c(1-t^{\rho})}{\rho}\right), (13)

where the constants, cc and ρ\rho, characterize the scale and the shape, respectively, of the deviation from the Pareto distribution. For ease of comparisons, we set ρ=2ξ0\rho=2\xi_{0}, which corresponds to the Student-t distribution with 1/ξ01/\xi_{0} degrees of freedom. Then, we set c=c0ξ0/(1+2ξ0)c=c_{0}\xi_{0}/(1+2\xi_{0}) for normalization, where we vary c0{0,0.5,1}c_{0}\in\{0,0.5,1\} across sets of simulations.

To construct the optimal confidence interval, we need a choice of k¯n\overline{k}_{n}. We use the data-driven algorithm proposed by Guillou and Hall (2001), which we briefly summarize in Appendix B for the convenience of readers. This choice of k¯n\overline{k}_{n} is of the optimal rate as established in their Theorem 2. Specifically, we select k¯n\overline{k}_{n} according to (33) in Appendix B with the restriction that k¯n[0.01n,0.99n]\overline{k}_{n}\in[0.01n,0.99n].

We implement three confidence intervals for the purpose of comparisons. The first method is the naïve length-optimal confidence interval without accounting for a possible bias over the local non-parametric family, that is,

HN(n,k¯n)=[ξ^(n,k¯n)±1.96k¯n1/2ξ^(n,k¯n)],H^{N}(n,\overline{k}_{n})=\left[\hat{\xi}(n,\overline{k}_{n})\pm 1.96\overline{k}^{-1/2}_{n}\hat{\xi}(n,\overline{k}_{n})\right], (14)

where k¯n\overline{k}_{n} is selected according to the procedure described above. Our impossibility result predicts that it fails to achieve a correct coverage uniformly over the non-parametric class encompassing our simulation design presented above.

The second one is HO(n,k¯n)H^{O}(n,\overline{k}_{n}), i.e., our proposed confidence interval given in (11) with r=1r=1, where k¯n\overline{k}_{n} is selected according to the procedure described above. The bias upper bound B¯=A/(1+ρ)\overline{B}=A/(1+\rho) is chosen following the rule-of-thumb choice described in Section 4.

The third is based on kk snooping. Specifically, given k¯n\overline{k}_{n} selected according to the procedure described above, now consider the range [r¯k¯n,k¯n][\underline{r}\overline{k}_{n},\overline{k}_{n}] containing mm integers denoted by {k1,,km}\{k_{1},\cdots,k_{m}\}. The kk snooping interval is constructed by

HS(n,r¯)=j=1mHO(n,kj),H^{S}(n,\underline{r})=\bigcap_{j=1}^{m}H^{O}(n,k_{j}), (15)

where HO(,)H^{O}(\cdot,\cdot) is defined in (11) and the lower bound r¯\underline{r} of kk snooping is set to 1/21/2. For the bias upper bound B¯=A/(1+ρ)\overline{B}=A/(1+\rho), we set A=0.1ξ0(1+2ξ0)rk¯nA=0.1\xi_{0}(1+2\xi_{0})\sqrt{r\overline{k}_{n}} and replace ξ0\xi_{0} with the estimator ξ^(n,kj)\hat{\xi}(n,k_{j}) for each j{1,,m}j\in\{1,...,m\}.

Tail Index: Coverage Probabilities
nn 250 500 1000
(ξ0,c0)(\xi_{0},c_{0}) HNH^{N} HOH^{O} HSH^{S} HNH^{N} HOH^{O} HSH^{S} HNH^{N} HOH^{O} HSH^{S}
(1, 0) 0.92 0.99 0.98 0.92 0.99 0.99 0.91 0.99 0.99
(1, 0.5) 0.88 0.98 0.97 0.81 0.99 0.98 0.73 0.99 0.99
(1, 1) 0.77 0.98 0.97 0.64 0.98 0.98 0.65 0.99 0.99
(0.5, 0) 0.91 0.99 0.98 0.91 0.99 0.99 0.91 0.99 0.99
(0.5, 0.5) 0.70 0.98 0.98 0.55 0.98 0.99 0.55 0.98 0.98
(0.5, 1) 0.51 0.87 0.94 0.59 0.86 0.92 0.61 0.93 0.95
Tail Index: Average Lengths
nn 250 500 1000
(ξ0,c0)(\xi_{0},c_{0}) HNH^{N} HOH^{O} HSH^{S} HNH^{N} HOH^{O} HSH^{S} HNH^{N} HOH^{O} HSH^{S}
(1, 0) 0.29 0.49 0.53 0.20 0.40 0.43 0.14 0.34 0.36
(1, 0.5) 0.30 0.50 0.54 0.21 0.42 0.43 0.16 0.36 0.37
(1, 1) 0.31 0.52 0.54 0.23 0.44 0.45 0.18 0.38 0.39
(0.5, 0) 0.14 0.24 0.26 0.10 0.20 0.22 0.07 0.17 0.18
(0.5, 0.5) 0.16 0.27 0.28 0.12 0.22 0.23 0.09 0.20 0.20
(0.5, 1) 0.18 0.29 0.31 0.15 0.25 0.26 0.12 0.22 0.23
Table 2: The coverage probabilities (top panel) and the average lengths (bottom panel) of the 95% confidence intervals for ξ0\xi_{0}. HNH^{N} stands for the naïve confidence interval defined in (14). HOH^{O} stands for our proposed confidence interval defined in (11) with r=1r=1. HSH^{S} stands for our proposed confidence interval with snooping defined in (15). The results are based on 5000 simulation draws.

From Theorem 2, we expect that both HO(n,k¯n)H^{O}(n,\overline{k}_{n}) and HS(n,r¯)H^{S}(n,\underline{r}) will deliver asymptotically correct (uniform) coverages, whereas HN(n,k¯n)H^{N}(n,\overline{k}_{n}) will not. Table 2 presents the coverage probabilities (top panel) and the average lengths (bottom panel) of the 95% confidence intervals based on 5000 Monte Carlo iterations. Key findings can be summarized as follows. First, both the intervals HOH^{O} and HSH^{S} have the correct coverage probability for most of the distributions consistently with our theory. When ξ0=0.5\xi_{0}=0.5 and c0=1c_{0}=1, the deviation of the tail distribution away from the Pareto distribution is the most severe. In this case, HOH^{O} suffers from some undercoverage when nn is small (e.g., n=250n=250 and 500500), but achieves more satisfactory coverage as nn becomes large (e.g., n=1000n=1000). Second, the coverage by HNH^{N} is inadequate throughout, and even when the deviation from the Pareto distribution is relatively small. Furthermore, the extent of the undercoverage by HNH^{N} tends to exacerbate as the sample size nn increases. Finally, the lengths of HSH^{S} are slightly larger than those of HOH^{O} when nn is 250, but they become almost identical when nn gets larger. From these findings, we prefer HSH^{S} and HOH^{O} to HNH^{N}. As the sample size becomes larger, HSH^{S} and HOH^{O} become equally preferable, while HNH^{N} consistently underperforms.x

6.2 Extreme Quantiles

We now turn to extreme quantiles. The data generating process continues to be the same as in the previous subsection. The object of interest is the 99% quantile so that pn=0.01p_{n}=0.01. Similarly to the previous subsection, we again compare three confidence intervals.

The first one is the naïve confidence interval without accounting for a possible bias in the non-parametric family, that is,

IN(n,k¯n)=\displaystyle I^{N}\left(n,\overline{k}_{n}\right)= [Q^(1pn)(11.96k¯nξ^(n,k¯n)),Q^(1pn)(1+1.96k¯nξ^(n,k¯n))].\displaystyle\left[\hat{Q}\left(1-p_{n}\right)\left(1-\frac{1.96}{\sqrt{\overline{k}_{n}}}\hat{\xi}(n,\overline{k}_{n})\right),\right.\>\>\left.\hat{Q}\left(1-p_{n}\right)\left(1+\frac{1.96}{\sqrt{\overline{k}_{n}}}\hat{\xi}(n,\overline{k}_{n})\right)\right]. (16)

The second one is our proposed confidence interval, that is, IO(n,k¯n)I^{O}(n,\overline{k}_{n}) in (12) with r¯=1\underline{r}=1. The third one is our proposed confidence interval with kk snooping interval, that is,

IS(n,r¯)=j=1mIO(n,kj),I^{S}(n,\underline{r})=\bigcap_{j=1}^{m}I^{O}(n,k_{j}), (17)

where {k1,,km}\{k_{1},\cdots,k_{m}\} are constructed in the same way as in (15), and IO(n,kj)I^{O}(n,k_{j}) is defined as in (12), and we set r¯=1/2\underline{r}=1/2.

Extreme Quantile: Coverage Probabilities
nn 250 500 1000
(ξ0,c0)(\xi_{0},c_{0}) INI^{N} IOI^{O} ISI^{S} INI^{N} IOI^{O} ISI^{S} INI^{N} IOI^{O} ISI^{S}
(1, 0) 0.90 0.96 0.94 0.92 0.98 0.97 0.93 0.99 0.99
(1, 0.5) 0.92 0.96 0.93 0.93 0.98 0.97 0.89 0.99 0.98
(1, 1) 0.92 0.95 0.93 0.89 0.97 0.95 0.83 0.99 0.97
(0.5, 0) 0.92 0.98 0.97 0.93 0.99 0.98 0.93 0.99 0.99
(0.5, 0.5) 0.91 0.97 0.95 0.86 0.98 0.96 0.78 0.99 0.98
(0.5, 1) 0.86 0.96 0.94 0.81 0.97 0.96 0.85 0.98 0.97
Extreme Quantile: Average Lengths
nn 250 500 1000
(ξ0,c0)(\xi_{0},c_{0}) INI^{N} IOI^{O} ISI^{S} INI^{N} IOI^{O} ISI^{S} INI^{N} IOI^{O} ISI^{S}
(1, 0) 136 232 240 91 183 189 62 152 156
(1, 0.5) 170 291 277 110 225 213 76 185 174
(1, 1) 199 342 302 132 263 235 90 209 189
(0.5, 0) 6.3 10.9 11.6 4.4 8.9 9.3 3.0 7.5 7.7
(0.5, 0.5) 8.3 14.4 14.3 5.8 11.5 11.3 4.2 9.4 9.1
(0.5, 1) 10.9 18.3 17.6 7.5 13.7 13.3 5.3 10.6 10.4
Table 3: The coverage probabilities (top panel) and the average lengths (bottom panel) of the 95% confidence intervals for the 99% quantile. INI^{N} stands for the naïve confidence interval defined in (16). IOI^{O} stands for our proposed confidence interval defined in (12) with r=1r=1. ISI^{S} stands for our proposed confidence interval with snooping defined in (17). The results are based on 5000 simulation draws.

Table 3 presents the coverage probabilities (top panel) and the average lengths (bottom panel) of the 95% confidence intervals based on 1000 Monte Carlo iterations. The findings are similar to those in Table 2. In particular, both the intervals IOI^{O} and ISI^{S} lead to correct coverage probabilities for all distributions, while INI^{N} suffers from undercoverage in general. Regarding the lengths, IOI^{O} and ISI^{S} are both longer than INI^{N} to allow for the correct coverage. Furthermore, when ξ0\xi_{0} is 0.5, IOI^{O} has approximately the same lengths as ISI^{S}. When ξ0=1\xi_{0}=1, the lengths of ISI^{S} are shorter than those of IOI^{O}, especially when the model largely deviates away from the Pareto distribution. This is because the true quantile is substantially larger so that the effects of adapting the critical value become more significant. We prefer IOI^{O} and ISI^{S} to INI^{N} from these observations.

7 Real Data Analysis

This section illustrates an application of the proposed method to an analysis of extremely low infant birth weights. Their relations with mothers’ demographic characteristics and maternal behaviors address important research questions. We use detailed natality data (Vital Statistics) published by the National Center for Health Statistics, which has been used by prior studies including Abrevaya (2001) and many others. Our sample consists of repeated cross sections from 1989 to 2002. Using the data from each of these years, we construct 95% confidence intervals of the tail index in the left tail and the first percentile following the same computational procedure as the one taken in Section 6. Details of our implementation with the current empirical data set are as follows.

We follow previous studies (e.g., Abrevaya, 2001) to choose the variables for mothers’ demographic characteristics and maternal behaviors. The variable of our interest is the infant birth weight measured in kilograms. For the purpose of comparison, we set a benchmark subsample in which the infant is a boy and the mother is younger than the median age in the full sample, is white and married, has levels of education lower than a high school degree, had her first prenatal visit in the first trimester (natal1), and did not smoke during the pregnancy. In addition to this benchmark subsample (benchmark), we also consider seven alternative subsamples corresponding to one and only one of the following scenarios: the mother has at least a high school diploma (high school); the infant is a girl (girl); the mother is unmarried (unmarried); the mother is black (black); the mother did not have prenatal visit during pregnancy (no pre-visit); the mother smokes ten cigarettes per day on average (smoke) and the mother’s age is above the median age in the full sample (older).

For each of these subsamples, we construct the 95% confidence intervals HNH^{N}, HOH^{O}, and HSH^{S} for the tail index in the left tail in the same way as in Section 6.1, and the 95% confidence intervals INI^{N}, IOI^{O}, and ISI^{S} for the first percentile as in Section 6.2. Since we are interested in the left tail (extremely low birth weights), we consider only the birth weight, denoted BiB_{i}, that is less than some cutoff value TT, and take Yi=TBiY_{i}=T-B_{i} as the input in our computational procedure for inference. We choose T=4T=4 based on the prior findings in Abrevaya (2001). Namely, Abrevaya (2001) finds that the relationship between the infant birth weight and mother’s demographics change substantially at the 90th percentile of birth weight in the full sample, which is approximately 4 kilograms. Once INI^{N}, IOI^{O}, and ISI^{S} are constructed for the 99th percentile of this transformed variable, we in turn multiply by 1-1 and add TT back to restore the interval for the original first percentile to conduct inference about the extremely low birth weights.

Tail Index
Subsample HNH^{N} HOH^{O} HSH^{S}
benchmark [0.27 0.29] [0.24 0.31] [0.25 0.31]
high school [0.29 0.30] [0.26 0.33] [0.27 0.30]
girl [0.25 0.27] [0.23 0.29] [0.23 0.28]
unmarried [0.29 0.31] [0.26 0.34] [0.27 0.32]
black [0.24 0.29] [0.21 0.32] [0.21 0.33]
no pre-visit [0.34 0.41] [0.31 0.45] [0.31 0.46]
smoke [0.22 0.27] [0.20 0.30] [0.19 0.29]
older [0.29 0.31] [0.26 0.34] [0.26 0.32]
First Percentile in Kilogram
Subsample INI^{N} IOI^{O} ISI^{S}
benchmark [1.46 1.56] [1.30 1.72] [1.35 1.68]
high school [1.48 1.52] [1.31 1.69] [1.43 1.68]
girl [1.50 1.59] [1.35 1.74] [1.43 1.72]
unmarried [1.20 1.30] [0.99 1.51] [1.10 1.48]
black [0.99 1.39] [0.80 1.58] [0.79 1.58]
no pre-visit [0.00 0.63] [0.00 1.08] [0.00 1.09]
smoke [1.34 1.59] [1.21 1.72] [1.27 1.71]
older [1.30 1.45] [1.13 1.62] [1.24 1.62]
Table 4: The 95% confidence intervals for the tail index (HH) and the first percentile (II) of the birth weight. The superscript NN stands for the naïve interval in (14) and (16). The supersrcipt OO stands for the interval in (11) and (12) with r=1r=1. The superscript SS stands for the interval with snooping in (15) and (17). See the main text for details about the data and the definitions of subsamples.

Table 4 presents the results for the 2002 sample. The results for other years are similar and hence omitted here to save space. Key empirical findings from these results can be summarized as follows. First, HOH^{O} and HSH^{S} are similar in length for the tail index, while IOI^{O} tends to be slightly longer than ISI^{S} for the first percentile. Second, both of them are substantially longer than the naïve intervals, HNH^{N} an INI^{N}, suggesting that ignoring the bias could lead to misleadingly short intervals. Third, compared with the benchmark subsample, the mothers who do not have any prenatal visit during pregnancy bear a substantially higher risk of having extremely low infant birth weights. This observation remains true even after accounting for a possible bias.

8 Summary

In this paper, we present two theoretical results concerning uniform confidence intervals for the tail index and extreme quantiles. First, we find it impossible to construct a length-optimal confidence interval satisfying the correct uniform coverage over the local non-parametric family of tail distributions. Second, in light of the impossibility result, we construct an honest confidence interval that is uniformly valid by accounting for the worst-case bias over the local non-parametric class. Simulation studies support our theoretical results. While the naïve length-optimal confidence interval suffers from severe under-coverage, our proposed confidence intervals achieve correct coverage. Applying the proposed method to National Vital Statistics from National Center for Health Statistics, we find that, even after accounting for the worst-case bias bound, having no prenatal visit during pregnancy remain a strong risk factor for low infant birth weight. This result demonstrates that, despite the impossibility result, it is possible to conduct a robust yet informative statistical inference about the tail index and extreme quantiles.

References

  • Abrevaya (2001) Abrevaya, J. (2001): “The effects of demographics and maternal behavior on the distribution of birth outcomes,” Empirical Economics, 26, 247–257.
  • Armstrong and Kolesár (2018) Armstrong, T. B. and M. Kolesár (2018): “Optimal inference in a class of regression models,” Econometrica, 86, 655–683.
  • Armstrong and Kolesár (2020) ——— (2020): “Simple and honest confidence intervals in nonparametric regression,” Quantitative Economics, 11, 1–39.
  • Bull and Nickl (2013) Bull, A. D. and R. Nickl (2013): “Adaptive confidence sets in L2L^{2},” Probability Theory and Related Fields, 156, 889–919.
  • Cai and Low (2004) Cai, T. T. and M. G. Low (2004): “An adaptation theory for nonparametric confidence intervals,” Annals of statistics, 32, 1805–1840.
  • Carpentier (2013) Carpentier, A. (2013): “Honest and adaptive confidence sets in LpL_{p},” Electronic Journal of Statistics, 7, 2875–2923.
  • Carpentier and Kim (2014a) Carpentier, A. and A. K. H. Kim (2014a): “Adaptive and minimax optimal estimation of the tail coefficient,” Statistica Sinica, 25, 1133–1144.
  • Carpentier and Kim (2014b) ——— (2014b): “Adaptive confidence intervals for the tail coefficient in a wide second order class of Pareto models,” Electronic Journal of Statistics, 8, 2066–2110.
  • Cheng and Peng (2001) Cheng, S. and L. Peng (2001): “Confidence intervals for the tail index,” Bernoulli, 7, 751–760.
  • Danielsson et al. (2001) Danielsson, J., L. de Haan, L. Peng, and C. G. de Vries (2001): “Using a bootstrap method to choose the sample fraction in tail index estimation,” Journal of Multivariate Analysis, 76, 226–248.
  • Danielsson et al. (2016) Danielsson, J., L. M. Ergun, L. de Haan, and C. G. de Vries (2016): “Tail index estimation: Quantile driven threshold selection,” Working Paper.
  • de Haan and Ferreira (2006) de Haan, L. and A. Ferreira (2006): Extreme Value Theory: An Introduction, Springer.
  • Drees (1998a) Drees, H. (1998a): “A general class of estimators of the extreme value index,” Journal of Statistical Planning and Inference, 66, 95–112.
  • Drees (1998b) ——— (1998b): “On smooth statistical tail functionals,” Scandinavian Journal of Statistics, 25, 187–210.
  • Drees (2001) ——— (2001): “Minimax risk bounds in extreme value theory,” Annals of Statistics, 29, 266–294.
  • Drees and Kaufmann (1998) Drees, H. and E. Kaufmann (1998): “Selecting the optimal sample fraction in univariate extreme value estimation,” Stochastic Processes and their Applications, 75, 149–172.
  • Drees et al. (2000) Drees, H., S. I. Resnick, and L. de Haan (2000): “How to make a Hill plot,” Annals of Statistics, 28, 254–274.
  • Fedotenkov (2020) Fedotenkov, I. (2020): “A review of more than one hundred Pareto-tail index estimators,” Statistica, 80, 245–299.
  • Geluk and Peng (2000) Geluk, J. L. and L. Peng (2000): “An adaptive optimal estimate of the tail index for MA (l) time series,” Statistics & Probability Letters, 46, 217–227.
  • Genovese and Wasserman (2008) Genovese, C. and L. Wasserman (2008): “Adaptive confidence bands,” Annals of Statistics, 36, 875–905.
  • Gomes and Guillou (2015) Gomes, M. I. and A. Guillou (2015): “Extreme value theory and statistics of univariate extremes: A review,” International Statistical Review, 83, 263–292.
  • Guillou and Hall (2001) Guillou, A. and P. Hall (2001): “A diagnostic for selecting the threshold in extreme value analysis,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63, 293–305.
  • Haeusler and Segers (2007) Haeusler, E. and J. Segers (2007): “Assessing confidence intervals for the tail index by Edgeworth expansions for the Hill estimator,” Bernoulli, 13, 175–194.
  • Hall and Welsh (1985) Hall, P. and A. H. Welsh (1985): “Adaptive estimates of parameters of regular variation,” Annals of Statistics, 13, 331–341.
  • Hill (1975) Hill, B. M. (1975): “A simple general approach to inference about the tail of a distribution.” Annals of Statistics, 3, 1163–1174.
  • Hoffmann and Nickl (2011) Hoffmann, M. and R. Nickl (2011): “On adaptive inference and confidence bands,” Annals of Statistics, 39, 2383–2409.
  • Li (1989) Li, K.-C. (1989): “Honest confidence regions for nonparametric regression,” Annals of Statistics, 17, 1001–1008.
  • Low (1997) Low, M. G. (1997): “On nonparametric confidence intervals,” Annals of Statistics, 25, 2547–2554.
  • Lu and Peng (2002) Lu, J.-C. and L. Peng (2002): “Likelihood based confidence intervals for the tail index,” Extremes, 5, 337–352.
  • Peng and Qi (2006) Peng, L. and Y. Qi (2006): “A new calibration method of constructing empirical likelihood-based confidence intervals for the tail index,” Australian & New Zealand Journal of Statistics, 48, 59–66.
  • Reiss (1989) Reiss, R.-D. (1989): Approximate distribution of order statistics, Springer, New York.
  • Resnick (2007) Resnick, S. I. (2007): Heavy-tail phenomena: probabilistic and statistical modeling, Springer Science & Business Media.
  • Resnick and Stărică (1997) Resnick, S. I. and C. Stărică (1997): “Smoothing the Hill estimator,” Advances in Applied Probability, 29, 271–293.
  • van de Geer et al. (2014) van de Geer, S., P. Bühlmann, Y. Ritov, and R. Dezeure (2014): “On asymptotically optimal confidence regions and tests for high-dimensional models,” Annals of Statistics, 42, 1166–1202.
  • Weissman (1978) Weissman, I. (1978): “Estimation of parameters and large quantiles based on the k largest observations,” Journal of the American Statistical Association, 73, 812–815.
  • Wu et al. (2021) Wu, Y., L. Wang, and H. Fu (2021): “Model-assisted uniformly honest inference for optimal treatment regimes in high dimension,” Journal of the American Statistical Association, forthcoming.

Supplmentary Appendix

On Uniform Confidence Intervals for the Tail Index and the Extreme Quantile

Appendix A Proofs

A.1 Proof of Theorem 1

We need the following two auxiliary lemmas to prove Theorem 1. Throughout, suppose that the distributions Fn,hF_{n,h} and F0F_{0} are absolutely continuous with their density functions denoted by fn,hf_{n,h} and f0f_{0}, respectively.

Lemma 1

We have

(fn1/2(y)f01/2(y)1)2f0(y)𝑑y=o(1).\int\left(\frac{f_{n}^{1/2}\left(y\right)}{f_{0}^{1/2}\left(y\right)}-1\right)^{2}f_{0}\left(y\right)dy=o(1).
Proof of Lemma 1.

By the definitions of Fn,h1F_{n,h}^{-1} and F01F_{0}^{-1}, we can write

fn,h(Fn,h1(1t))=\displaystyle f_{n,h}\left(F_{n,h}^{-1}\left(1-t\right)\right)= t(ξ0+dnh(tn1t))Fn,h1(1t)and\displaystyle\frac{t}{\left(\xi_{0}+d_{n}h\left({t_{n}^{-1}}t\right)\right)F_{n,h}^{-1}\left(1-t\right)}\qquad\text{and}
f0(Fn,h1(1t))=\displaystyle f_{0}\left(F_{n,h}^{-1}\left(1-t\right)\right)= [Fn,h1(1t)]11/ξ0/ξ0.\displaystyle\left[F_{n,h}^{-1}\left(1-t\right)\right]^{-1-1/\xi_{0}}/\xi_{0}.

Hence, it follows that

f0(Fn,h1(1t))fn,h(Fn,h1(1t))=\displaystyle\frac{f_{0}\left(F_{n,h}^{-1}\left(1-t\right)\right)}{f_{n,h}\left(F_{n,h}^{-1}\left(1-t\right)\right)}= (1+ξ01dnh(tn1t))Fn,h1(1t)1/ξ0\displaystyle\frac{\left(1+\xi_{0}^{-1}d_{n}h\left({t_{n}^{-1}}t\right)\right)}{F_{n,h}^{-1}\left(1-t\right)^{1/\xi_{0}}}
=\displaystyle= (1+ξ01dnh(tn1t))exp(t1dnh(tn1v)v𝑑v).\displaystyle\frac{\left(1+\xi_{0}^{-1}d_{n}h\left({t_{n}^{-1}}t\right)\right)}{\exp\left(\int_{t}^{1}\frac{d_{n}h\left({t_{n}^{-1}}v\right)}{v}dv\right)}.

The change of variables y=Fn,h1(1t/tn1)y=F_{n,h}^{-1}\left(1-t/{t_{n}^{-1}}\right) yields

 (fn,h1/2(y)f01/2(y)1)2f0(y)𝑑y\displaystyle\text{ \ \ \ }\int\left(\frac{f_{n,h}^{1/2}\left(y\right)}{f_{0}^{1/2}\left(y\right)}-1\right)^{2}f_{0}\left(y\right)dy
=1tn10tn1(fn,h1/2(Fn,h1(1t/tn1))f01/2(Fn,h1(1t/tn1))1)2f0(Fn,h1(1t/tn1))fn,h(Fn,h1(1t/tn1))𝑑t\displaystyle=\frac{1}{{t_{n}^{-1}}}\int_{0}^{{t_{n}^{-1}}}\left(\frac{f_{n,h}^{1/2}\left(F_{n,h}^{-1}\left(1-t/{t_{n}^{-1}}\right)\right)}{f_{0}^{1/2}\left(F_{n,h}^{-1}\left(1-t/{t_{n}^{-1}}\right)\right)}-1\right)^{2}\frac{f_{0}\left(F_{n,h}^{-1}\left(1-t/{t_{n}^{-1}}\right)\right)}{f_{n,h}\left(F_{n,h}^{-1}\left(1-t/{t_{n}^{-1}}\right)\right)}dt
=1tn10tn1[(exp(dn2ξ0ttn1h(s)s𝑑s)(1+dnh(t)ξ0)1/2)]2\displaystyle=\frac{1}{{t_{n}^{-1}}}\int_{0}^{{t_{n}^{-1}}}\left[\left(\exp\left(\frac{d_{n}}{2\xi_{0}}\int_{t}^{{t_{n}^{-1}}}\frac{h\left(s\right)}{s}ds\right)-\left(1+d_{n}\frac{h\left(t\right)}{\xi_{0}}\right)^{1/2}\right)\right]^{2}
 ×exp(dnξ0ttn1h(s)s𝑑s)dt\displaystyle\text{ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ }\times\exp\left(-\frac{d_{n}}{\xi_{0}}\int_{t}^{{t_{n}^{-1}}}\frac{h\left(s\right)}{s}ds\right)dt
=(i)1tn10tn1[(1(1+dnh(t)ξ0)1/2+dn2ξ0ttn1h(s)s𝑑s+o(dn))]2\displaystyle\overset{(i)}{=}\frac{1}{{t_{n}^{-1}}}\int_{0}^{{t_{n}^{-1}}}\left[\left(1-\left(1+d_{n}\frac{h\left(t\right)}{\xi_{0}}\right)^{1/2}+\frac{d_{n}}{2\xi_{0}}\int_{t}^{{t_{n}^{-1}}}\frac{h\left(s\right)}{s}ds+o(d_{n})\right)\right]^{2}
 ×(1dnξ0ttn1h(s)s𝑑s+o(dn))dt\displaystyle\text{ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ }\times\left(1-\frac{d_{n}}{\xi_{0}}\int_{t}^{{t_{n}^{-1}}}\frac{h\left(s\right)}{s}ds+o\left(d_{n}\right)\right)dt
=(ii)1tn10tn1[(dnh(t)ξ0+dn2ξ0ttn1h(s)s𝑑s+o(dn))]2(1+o(1))𝑑t\displaystyle\overset{(ii)}{=}\frac{1}{{t_{n}^{-1}}}\int_{0}^{{t_{n}^{-1}}}\left[\left(d_{n}\frac{h\left(t\right)}{\xi_{0}}+\frac{d_{n}}{2\xi_{0}}\int_{t}^{{t_{n}^{-1}}}\frac{h\left(s\right)}{s}ds+o(d_{n})\right)\right]^{2}\left(1+o\left(1\right)\right)dt
=(iii)O(dntn)\displaystyle\overset{(iii)}{=}O\left(d_{n}t_{n}\right)
=(iv)o(1),\displaystyle\overset{(iv)}{=}o(1),

where equality (i) follows from exp(x)=1+x+o(x)\exp(x)=1+x+o(x) as x0x\rightarrow 0, equality (ii) follows from (1+x)1/2=1+x+o(x)\left(1+x\right)^{1/2}=1+x+o(x) as xx\rightarrow\infty, equality (iii) follows from the assumptions that hh is uniformly bounded and square integrable for all h(A,ρ)h\in\mathcal{H}(A,\rho), and equality (iv) follows from the fact that dntntnρ+1d_{n}t_{n}\approx t_{n}^{\rho+1} with tn0t_{n}\rightarrow 0 (under Condition 1) and ρ>0\rho>0. ∎


Lemma 2

The Hellinger distance H2(f0,fn,h)H^{2}\left(f_{0},f_{n,h}\right) between f0f_{0} and fn,hf_{n,h} satisfies

H2(f0,fn,h)(f0(y)fn,h(y))2𝑑y=h24nξ02(1+o(1)).H^{2}\left(f_{0},f_{n,h}\right)\equiv\int\left(\sqrt{f_{0}\left(y\right)}-\sqrt{f_{n,h}\left(y\right)}\right)^{2}dy=\frac{\left|\left|h\right|\right|^{2}}{4n\xi_{0}^{2}}\left(1+o(1)\right). (18)
Proof of Lemma 2.

First, Proposition 2.1 in Drees (2001) yields

[n1/2(fn,h1/2(y)f01/2(y))12gn,h(y)f01/2(y)]2dy=o(1),\left[n^{1/2}\left(f_{n,h}^{1/2}\left(y\right)-f_{0}^{1/2}\left(y\right)\right)-\frac{1}{2}g_{n,h}\left(y\right)f_{0}^{1/2}\left(y\right)\right]^{2}dy=o(1), (19)

where

gn,h(y)n1/2dnξ0y1/ξ0h(tn1s)s𝑑sh(tn1y1/ξ0), y1.g_{n,h}\left(y\right)\equiv\frac{n^{1/2}d_{n}}{\xi_{0}}\int_{y^{-1/\xi_{0}}}^{\infty}\frac{h\left({t_{n}^{-1}}s\right)}{s}ds-h\left({t_{n}^{-1}}y^{-1/\xi_{0}}\right),\text{ \ \ }y\geq 1.

Note by Drees (2001, p.286) that gn,hg_{n,h} satisfies

gn(x)f0(x)𝑑x=\displaystyle\int g_{n}\left(x\right)f_{0}\left(x\right)dx= 0and\displaystyle 0\qquad\text{and} (20)
gn2(x)f0(x)𝑑x\displaystyle\int g_{n}^{2}\left(x\right)f_{0}\left(x\right)dx\rightarrow h2ξ02.\displaystyle\frac{\left|\left|h\right|\right|^{2}}{\xi_{0}^{2}}. (21)

Therefore, by expanding the square in (19), the equality (18) follows once we establish

fn,h1/2(y)f01/2(y)gn,h(y)=o(1).\int f_{n,h}^{1/2}\left(y\right)f_{0}^{1/2}\left(y\right)g_{n,h}\left(y\right)=o(1).

This equality follows as

 fn1/2(y)f01/2(y)gn,h(y)𝑑y\displaystyle\text{ \ \ }\int f_{n}^{1/2}\left(y\right)f_{0}^{1/2}\left(y\right)g_{n,h}\left(y\right)dy
=(fn1/2(y)f01/2(y)1)f0(x)gn,h(y)𝑑y\displaystyle=\int\left(\frac{f_{n}^{1/2}\left(y\right)}{f_{0}^{1/2}\left(y\right)}-1\right)f_{0}\left(x\right)g_{n,h}\left(y\right)dy
(i)((fn1/2(y)f01/2(y)1)2f0(y)𝑑y)1/2(gn,h2(y)f0(y)𝑑y)1/2\displaystyle\overset{(i)}{\leq}\left(\int\left(\frac{f_{n}^{1/2}\left(y\right)}{f_{0}^{1/2}\left(y\right)}-1\right)^{2}f_{0}\left(y\right)dy\right)^{1/2}\left(\int g_{n,h}^{2}\left(y\right)f_{0}\left(y\right)dy\right)^{1/2}
=(ii)o(1)hξ0(1+o(1))1/2=o(1),\displaystyle\overset{(ii)}{=}o(1)\frac{\left|\left|h\right|\right|}{\xi_{0}}(1+o(1))^{1/2}=o(1)\text{,}

where inequality (i) follows by Cauchy Schwarz and equality (ii) follows from Lemma 1 and (21). ∎


Proof of Theorem 1.

We first use Lemma 2 to translate the Hellinger distance between f0f_{0} and fn,hf_{n,h} into the L1L_{1}-distance between f0nf_{0}^{n} and fn,hnf_{n,h}^{n}. This is done by Equation (17) in Low (1997) so that

L1(f0n,fn,1n)\displaystyle L_{1}\left(f_{0}^{n},f_{n,1}^{n}\right)\equiv |f0n(y(n))fn,1n(y(n))|𝑑y(n)\displaystyle\int\left|f_{0}^{n}\left(y^{\left(n\right)}\right)-f_{n,1}^{n}\left(y^{\left(n\right)}\right)\right|dy^{\left(n\right)}
\displaystyle\leq 2(22exp(h28ξ02))1/2(1+o(1))\displaystyle 2\left(2-2\exp\left(-\frac{\left|\left|h\right|\right|^{2}}{8\xi_{0}^{2}}\right)\right)^{1/2}\left(1+o(1)\right)
\displaystyle\leq 2(22exp(ε28ξ02))1/2(1+o(1))\displaystyle 2\left(2-2\exp\left(-\frac{\varepsilon^{2}}{8\xi_{0}^{2}}\right)\right)^{1/2}\left(1+o(1)\right)
\displaystyle\equiv C(ε,ξ0)(1+o(1)).\displaystyle C\left(\varepsilon,\xi_{0}\right)\left(1+o(1)\right).

Next, let δ\delta be arbitrary. By the definition of ω(ε,F01,n)\omega\left(\varepsilon,F_{0}^{-1},n\right), there exists an hρ(h¯)h\in\mathcal{H}_{\rho}\left(\overline{h}\right) and hε\left|\left|h\right|\right|\leq\varepsilon such that

|Tr(Fn,h)Tr(F01)|ω(ε,F01,n)δ.\left|T_{r}\left(F_{n,h}\right)-T_{r}\left(F_{0}^{-1}\right)\right|\geq\omega\left(\varepsilon,F_{0}^{-1},n\right)-\delta.

Let Fn,λh1=Qn,k¯n,Fn,λhF_{n,\lambda h}^{-1}=Q_{n,\overline{k}_{n},F_{n,\lambda h}} for a short-handed notation. Also, let fn,λhf_{n,\lambda h} and F0n\mathbb{P}_{F_{0}^{n}} denote the corresponding density and the probability measure, respectively. Since H(Y(n))H\left(Y^{\left(n\right)}\right) has probability of coverage of at least 1β1-\beta uniformly over fn,λhf_{n,\lambda h}, it follows that, for any λ[0,1]\lambda\in\left[0,1\right],

Fn,λhn(Tr(Fn,λh1)H(Y(n)))1β\mathbb{P}_{F_{n,\lambda h}^{n}}\left(T_{r}\left(F_{n,\lambda h}^{-1}\right)\in H\left(Y^{\left(n\right)}\right)\right)\geq 1-\beta

and

λhλhλε.\left|\left|\lambda h\right|\right|\leq\lambda\left|\left|h\right|\right|\leq\lambda\varepsilon.

Then, we have

F0n(Tr(Fn,λh1)H(Y(n)))\displaystyle\mathbb{P}_{F_{0}^{n}}\left(T_{r}\left(F_{n,\lambda h}^{-1}\right)\in H\left(Y^{\left(n\right)}\right)\right)
=\displaystyle= 𝔼F0n[1[Tr(Fn,λh1)H(Y(n))]]\displaystyle\mathbb{E}_{F_{0}^{n}}\left[1\left[T_{r}\left(F_{n,\lambda h}^{-1}\right)\in H\left(Y^{\left(n\right)}\right)\right]\right]
=\displaystyle= 𝔼Fn,λhn[1[Tr(Fn,λh1)H(Y(n))](1+f0nfn,hnfn,hn)]\displaystyle\mathbb{E}_{F_{n,\lambda h}^{n}}\left[1\left[T_{r}\left(F_{n,\lambda h}^{-1}\right)\in H\left(Y^{\left(n\right)}\right)\right]\left(1+\frac{f_{0}^{n}-f_{n,h}^{n}}{f_{n,h}^{n}}\right)\right]
\displaystyle\geq 1β+𝔼Fn,λhn[f0nfn,λhnfn,λhn]\displaystyle 1-\beta+\mathbb{E}_{F_{n,\lambda h}^{n}}\left[\frac{f_{0}^{n}-f_{n,\lambda h}^{n}}{f_{n,\lambda h}^{n}}\right]
\displaystyle\geq 1β|fn,λhn(y)f0n(y)|𝑑y\displaystyle 1-\beta-\int\left|f_{n,\lambda h}^{n}\left(y\right)-f_{0}^{n}\left(y\right)\right|dy
\displaystyle\geq 1βλC.\displaystyle 1-\beta-\lambda C.

By the same lines of argument as in equations (12)–(15) in Low (1997), we obtain

𝔼F0n[μ(H(Y(n)))]\displaystyle\mathbb{E}_{F_{0}^{n}}\left[\mu\left(H\left(Y^{\left(n\right)}\right)\right)\right]
=\displaystyle= 𝔼F0n[1[tμ(H(y(n)))]𝑑t]\displaystyle\mathbb{E}_{F_{0}^{n}}\left[\int_{\mathbb{R}}1\left[t\in\mu\left(H\left(y^{\left(n\right)}\right)\right)\right]dt\right]
\displaystyle\geq 𝔼F0n01(Tr(Fn,h1)Tr(F01))1[Tr(Fn,λh1)μ(H(y(n)))]𝑑λ\displaystyle\mathbb{E}_{F_{0}^{n}}\int_{0}^{1}\left(T_{r}\left(F_{n,h}^{-1}\right)-T_{r}\left(F_{0}^{-1}\right)\right)1\left[T_{r}\left(F_{n,\lambda h}^{-1}\right)\in\mu\left(H\left(y^{\left(n\right)}\right)\right)\right]d\lambda
=\displaystyle= (Tr(Fn,h1)Tr(F01))01F0n(Tr(Fn,λh1)H(Y(n)))𝑑λ\displaystyle\left(T_{r}\left(F_{n,h}^{-1}\right)-T_{r}\left(F_{0}^{-1}\right)\right)\int_{0}^{1}\mathbb{P}_{F_{0}^{n}}\left(T_{r}\left(F_{n,\lambda h}^{-1}\right)\in H\left(Y^{\left(n\right)}\right)\right)d\lambda
\displaystyle\geq (Tr(Fn,h1)Tr(F01))01(1βλC(ε,ξ0))𝑑λ\displaystyle\left(T_{r}\left(F_{n,h}^{-1}\right)-T_{r}\left(F_{0}^{-1}\right)\right)\int_{0}^{1}\left(1-\beta-\lambda C\left(\varepsilon,\xi_{0}\right)\right)d\lambda
=\displaystyle= (ω(ε,F01,n)δ)(1βC(ε,ξ0)2).\displaystyle\left(\omega\left(\varepsilon,F_{0}^{-1},n\right)-\delta\right)\left(1-\beta-\frac{C\left(\varepsilon,\xi_{0}\right)}{2}\right).

The inequality (6) now follows since δ\delta is set to be arbitrary. ∎

A.2 Proof of Theorem 2

Proof.

First, we approximate the empirical tail quantile function Qn,rk¯n,Fn,hQ_{n,r\bar{k}_{n},F_{n,h}} with the partial sum process of the standard exponential random variables. Let ηi\eta_{i} for i=1,2,i=1,2,... be i.i.d. standard exponential random variables and Sij=1iηjS_{i}\equiv\sum_{j=1}^{i}\eta_{j}, and Q~n,rk¯n,FF1(1S[k¯nr]+1/n)\tilde{Q}_{n,r\bar{k}_{n},F}\equiv F^{-1}\left(1-S_{\left[\bar{k}_{n}r\right]+1}/n\right), r(0,1]r\in(0,1]. Let n,hn\mathbb{P}_{n,h}^{n} denote the joint distribution of nn i.i.d. draws from the distribution Fn,hF_{n,h}. By Reiss (1989, Theorem 5.4.3) – see also Drees (2001, eq.(5.12)) – the variational distance between the distribution of Qn,rk¯n,Fn,hQ_{n,r\bar{k}_{n},F_{n,h}} under n,hn\mathbb{P}_{n,h}^{n} and the distribution of Q~n,rk¯n,F\tilde{Q}_{n,r\bar{k}_{n},F} vanishes uniformly as nn\rightarrow\infty, that is,

||(Qn,rk¯n,Fn,h|n,hn)(Q~n,rk¯n,F)||=O(k¯n/n)=o(1),\left|\left|\mathcal{L}\left(Q_{n,r\bar{k}_{n},F_{n,h}}|\mathbb{P}_{n,h}^{n}\right)-\mathcal{L}\left(\tilde{Q}_{n,r\bar{k}_{n},F}\right)\right|\right|=O\left(\bar{k}_{n}/n\right)=o(1), (22)

which implies that

||(Tr(Qn,rk¯n,Fn,h)|n,hn)(Tr(Q~n,rk¯n,F))||=o(1)\left|\left|\mathcal{L}\left(T_{r}\left(Q_{n,r\bar{k}_{n},F_{n,h}}\right)|\mathbb{P}_{n,h}^{n}\right)-\mathcal{L}\left(T_{r}\left(\tilde{Q}_{n,r\bar{k}_{n},F}\right)\right)\right|\right|=o(1) (23)

uniformly for all r[r¯,1]r\in[\underline{r},1] and h(A,ρ)h\in\mathcal{H}\left(A,\rho\right). Recall from Section 2.2 that Tr(z)=r10rz(t)/z(r)𝑑tT_{r}\left(z\right)=r^{-1}\int_{0}^{r}z\left(t\right)/z\left(r\right)dt characterizes Hill’s estimator. Hence it suffices to approximate Q~n,rk¯n,F\tilde{Q}_{n,r\bar{k}_{n},F}. To this end, we employ a strong approximation of Q~n,rk¯n,F\tilde{Q}_{n,r\bar{k}_{n},F} with Fn,h1(1k¯n/n)F_{n,h}^{-1}\left(1-\overline{k}_{n}/n\right). Specifically, using Drees (2001, eq.(5.13)), we obtain

suph(A,ρ)supr(0,1]rξ0+1/2+ε|Q~n,rk¯n,FFn,h1(1k¯n/n)(rξ0ξ0r(ξ0+1)W(k¯nr)k¯n\displaystyle\sup_{h\in\mathcal{H}\left(A,\rho\right)}\sup_{r\in(0,1]}r^{\xi_{0}+1/2+\varepsilon}\Bigg{|}\frac{\tilde{Q}_{n,r\bar{k}_{n},F}}{F_{n,h}^{-1}\left(1-\overline{k}_{n}/n\right)}-\Bigg{(}r^{-\xi_{0}}-\xi_{0}r^{-\left(\xi_{0}+1\right)}\frac{W\left(\overline{k}_{n}r\right)}{\overline{k}_{n}}
+dnrξ0r1h(v)vdv)|=o(k¯n1/2) a.s.\displaystyle+d_{n}r^{-\xi_{0}}\int_{r}^{1}\frac{h\left(v\right)}{v}dv\Bigg{)}\Bigg{|}=o\left(\overline{k}_{n}^{-1/2}\right)\text{ a.s.} (24)

for all ε(0,1/2)\varepsilon\in\left(0,1/2\right), where W()W\left(\cdot\right) denotes the standard Wiener process.

In the second step, we exploit the feature of the functional Tr()T_{r}\left(\cdot\right). Since Tr()T_{r}\left(\cdot\right) is scale invariant such that Tr(az)=Tr(z)T_{r}\left(az\right)=T_{r}\left(z\right) for any constant aa, we have

Tr(Q~n,rk¯n,FFn,h1(1k¯n/n))=Tr(Q~n,rk¯n,F).T_{r}\left(\frac{\tilde{Q}_{n,r\bar{k}_{n},F}}{F_{n,h}^{-1}\left(1-\overline{k}_{n}/n\right)}\right)=T_{r}\left(\tilde{Q}_{n,r\bar{k}_{n},F}\right).

Moreover, Tr()T_{r}\left(\cdot\right) is Hadamard differentiable at zξ0:ttξ0z_{\xi_{0}}:t\longmapsto t^{-\xi_{0}} (cf. Drees, 1998b, Condition 3) in the sense that

Tr(zξ0+dnyn)Tr(zξ0)dnTr(y)\frac{T_{r}\left(z_{\xi_{0}}+d_{n}y_{n}\right)-T_{r}\left(z_{\xi_{0}}\right)}{d_{n}}\rightarrow T_{r}^{\prime}\left(y\right) (25)

uniformly for all functions yny_{n} with supt(0,1]tξ0+1/2+ε|yn(t)|1\sup_{t\in(0,1]}t^{\xi_{0}+1/2+\varepsilon}\left|y_{n}\left(t\right)\right|\leq 1 and ynyy_{n}\rightarrow y. To derive the expression of Tr()T_{r}^{\prime}\left(\cdot\right), we write

Tr(zξ0+dnyn)Tr(zξ0)=\displaystyle T_{r}\left(z_{\xi_{0}}+d_{n}y_{n}\right)-T_{r}\left(z_{\xi_{0}}\right)= r10rlog(tξ0+dnyn(t)rξ0+dnyn(r)rξ0tξ0)𝑑t\displaystyle r^{-1}\int_{0}^{r}\log\left(\frac{t^{-\xi_{0}}+d_{n}y_{n}\left(t\right)}{r^{-\xi_{0}}+d_{n}y_{n}\left(r\right)}\frac{r^{-\xi_{0}}}{t^{-\xi_{0}}}\right)dt
=\displaystyle= r10rlog(1+dnxn(t))𝑑t,\displaystyle r^{-1}\int_{0}^{r}\log\left(1+d_{n}x_{n}\left(t\right)\right)dt,

where

xn(t)=tξ0yn(t)rξ0yn(r)1+dnrξ0yn(r)tξ0y(t)rξ0y(r).x_{n}\left(t\right)=\frac{t^{\xi_{0}}y_{n}\left(t\right)-r^{\xi_{0}}y_{n}\left(r\right)}{1+d_{n}r^{\xi_{0}}y_{n}\left(r\right)}\rightarrow t^{\xi_{0}}y\left(t\right)-r^{\xi_{0}}y\left(r\right).

Following the derivation in Drees (1998a, p.104), we obtain

Tr(y)=r10r(tξ0y(t)rξ0y(r))𝑑tT_{r}^{\prime}\left(y\right)=r^{-1}\int_{0}^{r}\left(t^{\xi_{0}}y\left(t\right)-r^{\xi_{0}}y\left(r\right)\right)dt\text{. }

In the final step, we substitute zξ0(t)=tξ0z_{\xi_{0}}\left(t\right)=t^{-\xi_{0}}, dnk¯n1/2d_{n}\approx\bar{k}_{n}^{-1/2}, and

yn(t)=ξ0t(ξ0+1)W(k¯nt)k¯n1/2+tξ0t1h(v)v𝑑v.y_{n}\left(t\right)=\xi_{0}t^{-\left(\xi_{0}+1\right)}\frac{W\left(\overline{k}_{n}t\right)}{\overline{k}_{n}^{1/2}}+t^{-\xi_{0}}\int_{t}^{1}\frac{h\left(v\right)}{v}dv.

By (24), (25), and the functional delta method, we have

suphρsupr(0,1]|Tr(Q~n,rk¯n,F)(ξ0+k¯n1/2ξ0Tr(zξ0+1Wn))+dnTr(tξ0t1h(v)v𝑑v)|=o(k¯n1/2) a.s.\sup_{h\in\mathcal{H}_{\rho}}\sup_{r\in(0,1]}\left|\begin{array}[]{c}T_{r}\left(\tilde{Q}_{n,r\bar{k}_{n},F}\right)-\left(\xi_{0}+\bar{k}_{n}^{-1/2}\xi_{0}T_{r}^{\prime}\left(z_{\xi_{0}+1}W_{n}\right)\right)\\ +d_{n}T_{r}^{\prime}\left(t^{-\xi_{0}}\int_{t}^{1}\frac{h\left(v\right)}{v}dv\right)\end{array}\right|=o\left(\overline{k}_{n}^{-1/2}\right)\text{ a.s.} (26)

where Wn:tk¯n1/2W(k¯nt)W_{n}:t\longmapsto-\bar{k}_{n}^{-1/2}W\left(\bar{k}_{n}t\right). Use the definition of Tr()T_{r}^{\prime}\left(\cdot\right) to obtain that

Tr(zξ0+1Wn)=1r0r(s1W(s)r1W(r))𝑑s𝔾(r)T_{r}^{\prime}\left(z_{\xi_{0}+1}W_{n}\right)=\frac{1}{r}\int_{0}^{r}\left(s^{-1}W\left(s\right)-r^{-1}W\left(r\right)\right)ds\equiv\mathbb{G}\left(r\right) (27)

and

Tr(tξ0trh(v)v𝑑v)h(0)=\displaystyle T_{r}^{\prime}\left(t^{-\xi_{0}}\int_{t}^{r}\frac{h\left(v\right)}{v}dv\right)-h\left(0\right)= r10r(s1h(v)v𝑑vr1h(v)v𝑑v)𝑑sh(0)\displaystyle r^{-1}\int_{0}^{r}\left(\int_{s}^{1}\frac{h\left(v\right)}{v}dv-\int_{r}^{1}\frac{h\left(v\right)}{v}dv\right)ds-h\left(0\right)
=\displaystyle= r10r(srh(v)v𝑑v)𝑑sh(0)\displaystyle r^{-1}\int_{0}^{r}\left(\int_{s}^{r}\frac{h\left(v\right)}{v}dv\right)ds-h\left(0\right)
=\displaystyle= r10r(srh(v)h(0)v𝑑v)𝑑s\displaystyle r^{-1}\int_{0}^{r}\left(\int_{s}^{r}\frac{h\left(v\right)-h\left(0\right)}{v}dv\right)ds
\displaystyle\equiv 𝔹(r;h),\displaystyle\mathbb{B}\left(r;h\right), (28)

where we used the equality r10r(logtlogr)𝑑t=1r^{-1}\int_{0}^{r}\left(\log t-\log r\right)dt=1. Thus, in view of ξn,h=ξ0+dnh(0)\xi_{n,h}=\xi_{0}+d_{n}h\left(0\right) and combining (26) to (28), we find

Tr(Q~n,tk¯n,F)ξn,h=k¯n1/2ξ0𝔾(r)+dn𝔹(r;h)+oa.s.(k¯n1/2),T_{r}\left(\tilde{Q}_{n,t\bar{k}_{n},F}\right)-\xi_{n,h}=\bar{k}_{n}^{-1/2}\xi_{0}\mathbb{G}\left(r\right)+d_{n}\mathbb{B}\left(r;h\right)+o_{\text{a.s.}}(\overline{k}_{n}^{-1/2}),

where oa.s.()o_{\text{a.s.}}(\cdot) is uniform for all r[r¯,1]r\in[\underline{r},1] and h(A,ρ)h\in\mathcal{H}\left(A,\rho\right). We conclude from (23) that

suph(A,ρ)supr[r¯,1]supx|n,hn(Tr(Qn,rk¯n,Fn,h)ξn,hdnx)(ξ0𝔾(r)+𝔹(r;h)x)|\displaystyle\sup_{h\in\mathcal{H}\left(A,\rho\right)}\sup_{r\in[\underline{r},1]}\sup_{x\in\mathbb{R}}\left|\mathbb{P}_{n,h}^{n}\left(T_{r}\left({Q}_{n,r\bar{k}_{n},F_{n,h}}\right)-\xi_{n,h}\leq d_{n}x\right)-\mathbb{P}\left(\xi_{0}\mathbb{G}\left(r\right)+\mathbb{B}\left(r;h\right)\leq x\right)\right|
0.\displaystyle\rightarrow 0.

This implies that

k¯n1/2(Tr(Qn,rk¯n,Fn,h)ξn,h)=ξ0𝔾(r)+𝔹(r;h)+on,hn(1),\overline{k}_{n}^{1/2}\left(T_{r}\left(Q_{n,r\bar{k}_{n},F_{n,h}}\right)-\xi_{n,h}\right)=\xi_{0}\mathbb{G}\left(r\right)+\mathbb{B}\left(r;h\right)+o_{\mathbb{P}_{n,h}^{n}}\left(1\right),

where on,hn(1)o_{\mathbb{P}_{n,h}^{n}}\left(1\right) is uniform for all r[r¯,1]r\in[\underline{r},1] and h(A,ρ)h\in\mathcal{H}\left(A,\rho\right). This completes the proof. ∎

A.3 Proof of Theorem 3

Proof.

We use the notation gn=k¯n/(npn)g_{n}=\overline{k}_{n}/\left(np_{n}\right). Condition 2 guarantees that gng_{n}\rightarrow\infty. Using the tail quantile process (24) with t=1t=1, we obtain

rk¯nlog(gn)(F^n,h1(1pn)Fn,h1(1pn)1)\displaystyle\frac{\sqrt{r\overline{k}_{n}}}{\log\left(g_{n}\right)}\left(\frac{\hat{F}_{n,h}^{-1}\left(1-p_{n}\right)}{F_{n,h}^{-1}\left(1-p_{n}\right)}-1\right)
=\displaystyle= rk¯nlog(gn)(Yn,nrk¯nFn,h1(1rk¯n/n)gnξ^(n,rk¯n)Fn,h1(1rk¯n/n)Fn,h1(1pn)1)\displaystyle\frac{\sqrt{r\overline{k}_{n}}}{\log\left(g_{n}\right)}\left(\frac{Y_{n,n-\left\lfloor r\overline{k}_{n}\right\rfloor}}{F_{n,h}^{-1}\left(1-r\overline{k}_{n}/n\right)}g_{n}^{\hat{\xi}\left(n,r\overline{k}_{n}\right)}\frac{F_{n,h}^{-1}\left(1-r\overline{k}_{n}/n\right)}{F_{n,h}^{-1}\left(1-p_{n}\right)}-1\right)
=\displaystyle= gnξn,hFn,h1(1rk¯n/n)Fn,h1(1pn){rk¯nlog(gn)(gnξ^(n,rk¯n)ξn,h1)\displaystyle g_{n}^{\xi_{n,h}}\frac{F_{n,h}^{-1}\left(1-r\overline{k}_{n}/n\right)}{F_{n,h}^{-1}\left(1-p_{n}\right)}\left\{\frac{\sqrt{r\overline{k}_{n}}}{\log\left(g_{n}\right)}\left(g_{n}^{\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h}}-1\right)\right.
+rk¯nlog(gn)(Yn,nrk¯nFn,h1(1rk¯n/n)1)gnξ^(n,rk¯n)ξn,h\displaystyle+\frac{\sqrt{r\overline{k}_{n}}}{\log\left(g_{n}\right)}\left(\frac{Y_{n,n-\lfloor r\overline{k}_{n}\rfloor}}{F_{n,h}^{-1}\left(1-r\overline{k}_{n}/n\right)}-1\right)g_{n}^{\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h}}
rk¯nlog(gn)(Fn,h1(1pn)Fn,h1(1rk¯n/n)gnξn,h1)}\displaystyle\left.-\frac{\sqrt{r\overline{k}_{n}}}{\log\left(g_{n}\right)}\left(\frac{F_{n,h}^{-1}\left(1-p_{n}\right)}{F_{n,h}^{-1}\left(1-r\overline{k}_{n}/n\right)}g_{n}^{-\xi_{n,h}}-1\right)\right\}
\displaystyle\equiv C1n(C2n+C3nC4n).\displaystyle C_{1n}\left(C_{2n}+C_{3n}-C_{4n}\right).

We aim to establish

C1n=\displaystyle C_{1n}= 1+o(1),\displaystyle 1+o(1), (29)
C2n=\displaystyle C_{2n}= rk¯n(ξ^(n,rk¯n)ξn,h)+on,hn(1),\displaystyle\sqrt{r\overline{k}_{n}}\left(\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h}\right)+o_{\mathbb{P}^{n}_{n,h}}(1), (30)
C3n=\displaystyle C_{3n}= on,hn(1),and\displaystyle o_{\mathbb{P}^{n}_{n,h}}(1),\qquad\text{and} (31)
C4n=\displaystyle C_{4n}= o(1),\displaystyle o(1), (32)

where the op(1)o_{p}\left(1\right) and on,hn(1)o_{\mathbb{P}^{n}_{n,h}}(1) terms are all uniform over both hρ(A,ρ)h\in\mathcal{H}_{\rho}\left(A,\rho\right) and r[r¯,1]r\in[\underline{r},1].

First, Condition 1 yields that rk¯n/(ntn)rr\overline{k}_{n}/\left(nt_{n}\right)\rightarrow r. Thus, (29) follows from

C1n\displaystyle C_{1n} =(i)exp(rk¯n/n1dn(h(tn1v)h(0))v𝑑vpn1dn(h(tn1v)h(0))v𝑑v)\displaystyle\overset{(i)}{=}\exp\left(\int_{r\overline{k}_{n}/n}^{1}\frac{d_{n}\left(h\left({t_{n}^{-1}}v\right)-h\left(0\right)\right)}{v}dv-\int_{p_{n}}^{1}\frac{d_{n}\left(h\left({t_{n}^{-1}}v\right)-h\left(0\right)\right)}{v}dv\right)
=exp(dnmin{rk¯n/n,pn}max{rk¯n/n,pn}h(tn1v)h(0)v𝑑v)\displaystyle=\exp\left(d_{n}\int_{\min\{r\overline{k}_{n}/n,p_{n}\}}^{\max\{r\overline{k}_{n}/n,p_{n}\}}\frac{h\left({t_{n}^{-1}}v\right)-h\left(0\right)}{v}dv\right)
=(ii)exp(dnmin{gn,pntn1}max{gn,pntn1}h(s)h(0)s𝑑s)\displaystyle\overset{(ii)}{=}\exp\left(d_{n}\int_{\min\{g_{n},p_{n}{t_{n}^{-1}}\}}^{\max\{g_{n},p_{n}{t_{n}^{-1}}\}}\frac{h\left(s\right)-h\left(0\right)}{s}ds\right)
exp(dn01|h(s)h(0)|s𝑑s)\displaystyle\leq\exp\left(d_{n}\int_{0}^{1}\frac{\left|h\left(s\right)-h\left(0\right)\right|}{s}ds\right)
(iii)exp(dn01Asρs𝑑s)\displaystyle\overset{(iii)}{\leq}\exp\left(d_{n}\int_{0}^{1}\frac{As^{\rho}}{s}ds\right)
=(iv)1+O(dn),\displaystyle\overset{(iv)}{=}1+O(d_{n}),

where equality (i) follows from (4), equality (ii) follows from a change of variables, inequality (iii) follows from |h(s)h(0)|Asρ\left|h\left(s\right)-h\left(0\right)\right|\leq As^{\rho} for h(A,ρ)h\in\mathcal{H}\left(A,\rho\right), and equality (iv) follows from dn0d_{n}\rightarrow 0 and exp(an)=1+an+O(an2)\exp\left(a_{n}\right)=1+a_{n}+O(a_{n}^{2}) for any sequence an0a_{n}\rightarrow 0.

Next, consider C2nC_{2n}. Note that (ξ^(n,rk¯n)ξn,h)loggn=rk¯n(ξ^(n,rk¯n)ξn,h)×loggn/rk¯n=on,hn(1)(\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h})\log g_{n}=\sqrt{r\overline{k}_{n}}(\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h})\times\log g_{n}/\sqrt{r\overline{k}_{n}}=o_{\mathbb{P}^{n}_{n,h}}(1) uniformly for all r[r¯,1]r\in[\underline{r},1] and h(A,ρ)h\in\mathcal{H}\left(A,\rho\right) by Theorem 2 and Condition 2. Then, using exp(an)=1+an+O(an2)\exp(a_{n})=1+a_{n}+O(a_{n}^{2}) for any generic sequence an0a_{n}\rightarrow 0, we have that

rk¯nlog(gn)(gnξ^(n,rk¯n)ξn,h1)\displaystyle\frac{\sqrt{r\overline{k}_{n}}}{\log\left(g_{n}\right)}\left(g_{n}^{\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h}}-1\right)
=\displaystyle= rk¯n(ξ^(n,rk¯n)ξn,h)exp((ξ^(n,rk¯n)ξn,h)loggn)1(ξ^(n,rk¯n)ξn,h)loggn\displaystyle\sqrt{r\overline{k}_{n}}\left(\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h}\right)\frac{\exp\left(\left(\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h}\right)\log g_{n}\right)-1}{\left(\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h}\right)\log g_{n}}
=\displaystyle= rk¯n(ξ^(n,rk¯n)ξn,h)(1+O((ξ^(n,rk¯n)ξn,h)loggn))\displaystyle\sqrt{r\overline{k}_{n}}\left(\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h}\right)\left(1+O\left(\left(\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h}\right)\log g_{n}\right)\right)
=\displaystyle= rk¯n(ξ^(n,rk¯n)ξn,h)(1+on,hn(1))\displaystyle\sqrt{r\overline{k}_{n}}\left(\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h}\right)\left(1+o_{\mathbb{P}^{n}_{n,h}}(1)\right)

uniformly for all r[r¯,1]r\in[\underline{r},1] and h(A,ρ)h\in\mathcal{H}\left(A,\rho\right). This yields (30).

Next, consider C3nC_{3n}. The equivalence (22) and the tail quantile approximation (24) imply that

rk¯n(Yn,nrk¯nFn,h1(1rk¯n/n)1)=On,hn(1)\sqrt{r\overline{k}_{n}}\left(\frac{Y_{n,n-\lfloor r\overline{k}_{n}\rfloor}}{F_{n,h}^{-1}\left(1-r\overline{k}_{n}/n\right)}-1\right)=O_{\mathbb{P}^{n}_{n,h}}(1)

uniformly for all r[r¯,1]r\in[\underline{r},1] and h(A,ρ)h\in\mathcal{H}\left(A,\rho\right). The previous step deriving (30) also implies that gnξ^(n,rk¯n)ξn,h=1+on,hn(1)g_{n}^{\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h}}=1+o_{\mathbb{P}^{n}_{n,h}}(1) uniformly for all r[r¯,1]r\in[\underline{r},1] and h(A,ρ)h\in\mathcal{H}\left(A,\rho\right). Therefore, C3n=(loggn)1On,hn(1)(1+on,hn(1))=on,hn(1)C_{3n}=\left(\log g_{n}\right)^{-1}O_{\mathbb{P}^{n}_{n,h}}(1)\left(1+o_{\mathbb{P}^{n}_{n,h}}(1)\right)=o_{\mathbb{P}^{n}_{n,h}}(1) uniformly for all r[r¯,1]r\in[\underline{r},1] and h(A,ρ)h\in\mathcal{H}\left(A,\rho\right), implying (31).

Finally, consider C4nC_{4n}. We have

C4n\displaystyle C_{4n} =(i)rk¯nlog(gn)(exp(dnmin{pn,rk¯n/n}max{pn,rk¯n/n}(h(tn1v)h(0))v𝑑v)1)\displaystyle\overset{(i)}{=}\frac{\sqrt{r\overline{k}_{n}}}{\log\left(g_{n}\right)}\left(\exp\left(d_{n}\int_{\min\{p_{n},r\overline{k}_{n}/n\}}^{\max\{p_{n},r\overline{k}_{n}/n\}}\frac{\left(h\left({t_{n}^{-1}}v\right)-h\left(0\right)\right)}{v}dv\right)-1\right)
=(ii)rk¯nlog(gn)×O(dn)\displaystyle\overset{(ii)}{=}\frac{\sqrt{r\overline{k}_{n}}}{\log\left(g_{n}\right)}\times O\left(d_{n}\right)
=(iii)o(1)\displaystyle\overset{(iii)}{=}o(1)

uniformly for all r[r¯,1]r\in[\underline{r},1] and h(A,ρ)h\in\mathcal{H}\left(A,\rho\right), where equality (i) follows from substituing (4), equality (ii) follows from the derivation of (29), and equality (iii) follows from that rk¯ndn=O(1)\sqrt{r\overline{k}_{n}}d_{n}=O(1) as in Condition 1 and log(gn)\log(g_{n})\rightarrow\infty as in Condition 2. The proof is complete by combining (29)–(32). ∎

Appendix B Appendix: Choice of k¯n\overline{k}_{n}

In this appendix section, we present a data-driven choice rule of k¯n\overline{k}_{n} following Guillou and Hall (2001) for completeness and for convenience of readers. For other methods of choosing the tail threshold, see, for example, Drees and Kaufmann (1998), Geluk and Peng (2000), Danielsson et al. (2001), and Danielsson et al. (2016). See also Resnick (2007, Chapter 4.4) for a review.

We use the shorthand notation k=k¯nk=\overline{k}_{n} in this section for notational simplicity. Define Zi=ilog(Yn:ni+1/Yn:ni)Z_{i}=i\log(Y_{n:n-i+1}/Y_{n:n-i}) for i=1,,ni=1,\ldots,n. If YiY_{i} is exactly Pareto distributed with exponent 1/ξ01/\xi_{0}, then ZiZ_{i} should be i.i.d. with exponential distribution and 𝔼[Zi]=ξ0\mathbb{E}\left[Z_{i}\right]=\xi_{0}. Given this observation, we further construct the antisymmetric weights {wj}j=1k\{w_{j}\}_{j=1}^{k} such that wj=wkj+1w_{j}=-w_{k-j+1} and j=1kwj=0\sum_{j=1}^{k}w_{j}=0. Then, the statistic

𝒯k=(j=1kwj2)1/2ξ^(n,k)1Uk where Uk=j=1kwjZj\mathcal{T}_{k}=\left(\sum_{j=1}^{k}w_{j}^{2}\right)^{-1/2}\hat{\xi}(n,k)^{-1}U_{k}\text{ where }U_{k}=\sum_{j=1}^{k}w_{j}Z_{j}

has zero mean and unit variance provided that YiY_{i} is exactly Pareto and ξ^(n,k)=ξ0\hat{\xi}(n,k)=\xi_{0}.

To evaluate a deviation of the above approximation, we further define the following criterion based on a moving average of 𝒯k2\mathcal{T}_{k}^{2}:

𝒞k=((2l+1)1j=ll𝒯k+j2)1/2,\mathcal{C}_{k}=\left(\left(2l+1\right)^{-1}\sum_{j=-l}^{l}\mathcal{T}_{k+j}^{2}\right)^{1/2},

where ll equals the integer part of k/2k/2. Intuitively, the larger kk is, the larger bias we have in the Pareto tail approximation, and hence by more 𝒞k\mathcal{C}_{k} exceeds one. To obtain an implementable rule, we follow Guillou and Hall (2001) to use wj=sgn(k2j+1)|k2j+1|w_{j}=sgn\left(k-2j+1\right)\left|k-2j+1\right| and propose to choose the smallest kk that satisfies 𝒞t>ccrit\mathcal{C}_{t}>c_{\text{crit}} for all tkt\geq k and some pre-specified constant ccritc_{\text{crit}}. Again, following Guillou and Hall (2001), we set ccrit=1.25c_{\text{crit}}=1.25. For convenience of reference, we write the concrete equation:

k^=min1kn{k:𝒞t>ccrit for all tk}.\hat{k}=\min_{1\leq k\leq n}\{k:\mathcal{C}_{t}>c_{\text{crit}}\text{ for all }t\geq k\}. (33)