On Uniform Confidence Intervals for the Tail Index and the Extreme Quantile

Yuya Sasaki¹¹1Y. Sasaki: VU Station B #351819, 2301 Vanderbilt Place, Nashville, TN 37235-1819, USA.
Vanderbilt University Yulong Wang²²2Y. Wang: 110 Eggers Hall, Syracuse, NY 13244-1020, USA.
Syracuse University

(October 10, 2022)

Abstract

This paper presents two results concerning uniform confidence intervals for the tail index and the extreme quantile. First, we show that it is impossible to construct a length-optimal confidence interval satisfying the correct uniform coverage over a local non-parametric family of tail distributions. Second, in light of the impossibility result, we construct honest confidence intervals that are uniformly valid by incorporating the worst-case bias in the local non-parametric family. The proposed method is applied to simulated data and a real data set of National Vital Statistics from National Center for Health Statistics.

Keywords: honest confidence interval, extreme quantile, impossibility, tail index, uniform inference

1 Introduction

Suppose that one is interested in constructing a confidence interval (CI) for the true tail index $\xi_{0}\in\mathbb{R}^{+}$ of a distribution. To define this parameter, assume that the distribution function (d.f.), denoted by $F$ , has its well-defined density function $f$ and satisfies the well-known von Mises condition:

g\left(x\right)\equiv\frac{1-F\left(x\right)}{xf\left(x\right)}\rightarrow\xi

(1)

as $x\rightarrow\infty$ , where $\xi$ is called the tail index. Point-wise CIs for $\xi_{0}$ have been proposed by a number of papers, including Cheng and Peng (2001), Lu and Peng (2002), Peng and Qi (2006), and Haeusler and Segers (2007). More recently, Carpentier and Kim (2014b) develop an adaptive CI for $\xi_{0}$ that is uniformly valid over a parametric family of tail distributions indexed by the second-order parameter.

Since the seminal paper by Hill (1975), the literature has investigated numerous estimators for $\xi_{0}$ as well as their asymptotic properties. See the recent reviews by Gomes and Guillou (2015) and Fedotenkov (2020). Drees (2001) obtains the worst-case optimal convergence rate, i.e., the min-max bound, in estimation over a local non-parametric family of tail distributions. Carpentier and Kim (2014a) construct adaptive and minimax optimal estimator over the parametric family of second-order Pareto-type distributions. Motivated by these results, a natural question is whether a length-optimal CI for $\xi_{0}$ can achieve uniformly correct coverage probabilities for $\xi_{0}$ over the non-parametric family of tail distributions. To our best knowledge, the existing literature has not answered this important question, while such a problem has been investigated in other important contexts in statistics, e.g., for CIs of non-parametric densities (e.g., Low, 1997; Hoffmann and Nickl, 2011; Bull and Nickl, 2013; Carpentier, 2013), non-parametric regressions (e.g., Li, 1989; Genovese and Wasserman, 2008), and high-dimensional regression models (e.g., van de Geer et al., 2014; Wu et al., 2021) to list but a very few.

In this paper, we first show that it is in fact impossible to construct a length-optimal CI for the true tail index $\xi_{0}$ satisfying the uniformly correct coverage over the local non-parametric family considered by Drees (2001). Specifically, any CI enjoying the uniform coverage property can be no shorter than the worst-case bias over the non-parametric family up to a constant. This negative result is analogous to those of Low (1997) and Genovese and Wasserman (2008) presented in the contexts of non-parametric densities/regressions, but is novel in the context of the tail index.

Second, we derive the asymptotic distribution of Hill’s estimator uniformly over the local non-parametric family of tail distributions. Given the above impossibility result, it is imperative to account for a possible range of bias over the non-parametric family. Hence, we construct an honest CI for the tail index that is locally uniformly valid by incorporating the worse-case bias over the local non-parametric family, as well as influences from a sample on asymptotic randomness. We also demonstrate that this honest CI for the tail index extends to that for extreme quantiles.

Simulation studies support our theoretical predictions. While the naïve length-optimal CI not accounting for a possible range of bias over the local non-parametric family suffers from severe under-coverage overall, our proposed CIs on the other hand achieve correct coverage. We apply the proposed method to National Vital Statistics from National Center for Health Statistics, and construct CIs for quantiles of extremely low infant birth weights across a variety of mothers’ demographic characteristics and maternal behaviors. We find that, even after accounting for a possible range of bias over the local non-parametric family, having no prenatal visit during pregnancy remains a strong risk factor for low infant birth weight.

Organization: Section 2 introduces the setup, notations, and definitions. Section 3 establishes the impossibility result. Section 4 proposes a uniformly valid CI. Section 5 presents an application to extreme quantiles. Sections 6 and 7 illustrate simulation studies and real data analyses, respectively. Section 8 summarizes the paper. Mathematical proofs are collected in the Appendix.

2 Setup, Notations, and Definitions

2.1 Non-parametric Families in the Tail

Any distribution function $F$ satisfying the von Mises condition (1) can be equivalently characterized in the right tail in terms of its inverse $F^{-1}$ by

F^{-1}\left(1-t\right)=ct^{-\xi}\exp\left(\int_{t}^{1}\frac{\eta\left(s\right)}{s}ds\right)

for some $c>0$ and $\eta\left(s\right)=g\left(F^{-1}\left(1-s\right)\right)-\xi$ , which satisfies that $\eta(s)$ tends to zero as $s\rightarrow 0$ . The standard Pareto distribution falls in this family as a trivial special case in which $\eta$ is the zero function and $g\left(x\right)=\xi$ for all $x$ .

To maintain a non-parametric setup in statistical inference about the true tail idex $\xi_{0}$ , we follow Drees (2001) and consider the following family of d.f.’s:

\mathcal{F}\left(\xi_{0},c,\varepsilon,u\right)\equiv\left\{F\text{ is a d.f. }\left|\begin{array}[]{c}\left.F^{-1}\left(1-t\right)=ct^{-\xi}\exp\left(\int_{t}^{1}\frac{\eta\left(s\right)}{s}ds\right),\right.\\ \left.\left|\xi-\xi_{0}\right|\leq\varepsilon,\left|\eta\left(t\right)\right|\leq u\left(t\right),t\in(0,1]\right.\end{array}\right.\right\}

where $\xi_{0}>\varepsilon>0,c>0$ and $u\left(t\right)=At^{\rho}$ for some constants $A,\rho>0$ .

Two remarks are in order about this family $\mathcal{F}\left(\xi_{0},c,\varepsilon,u\right)$ . First, let $F_{0}$ denote the Pareto distribution function with true tail index $\xi_{0}>0$ , i.e., $F_{0}^{-1}\left(1-t\right)=t^{-\xi_{0}}$ . This $F_{0}$ is the center of localization by the family $\mathcal{F}\left(\xi_{0},c,\varepsilon,u\right)$ . The factor $\exp\left(\int_{t}^{1}\frac{\eta\left(s\right)}{s}ds\right)$ represents a deviation from this center. If we set $\eta=u$ , then the model essentially becomes parametric, since the deviation is fully determined by $u\left(t\right)=At^{\rho}$ and hence by the constants, $A$ and $\rho$ . In this parametric setup, Hall and Welsh (1985) establish the optimal uniform rates of convergence over a family of d.f.’s with densities of the form $f\left(x\right)=cx^{-\left(1/\xi+1\right)}\left(1+r\left(x\right)\right)$ , where $\left|r\left(x\right)\right|\leq Ax^{-\rho/\xi}$ . In contrast, we consider a non-parametric family like $\mathcal{F}\left(\xi_{0},c,\varepsilon,u\right)$ , in which the function $u\left(\cdot\right)$ serves merely as an upper bound for deviations from the center. We will focus on the classic estimator by Hill (1975). Since Hill’s estimator is scale invariant, we hereafter assume $c=1$ without loss of generality and for succinct writing.

Second, by the fact that $\int_{t}^{1}\frac{\eta\left(0\right)}{s}ds=-\eta\left(0\right)\log t$ , we can rewrite the quantile characterization $F^{-1}\left(1-t\right)=t^{-\xi_{0}}\exp\left(\int_{t}^{1}\frac{\eta\left(s\right)}{s}ds\right)$ of an element $F\in\mathcal{F}\left(\xi_{0},c,\varepsilon,u\right)$ as

t^{-\xi_{0}}\exp\left(\int_{t}^{1}\frac{\eta\left(s\right)}{s}ds\right)=t^{-\xi_{0}-\eta\left(0\right)}\exp\left(\int_{t}^{1}\frac{\eta\left(s\right)-\eta\left(0\right)}{s}ds\right).

(2)

To construct a uniformly valid inference, we now again follow Drees (2001) and consider a sequence of families of data-generating processes with tail index converging to $\xi_{0}$ at a rate $d_{n}:=u\left(t_{n}\right)\rightarrow 0$ for some $t_{n}\rightarrow 0$ as $n\rightarrow\infty$ . Specifically, we consider a sequence $t_{n}$ satisfying

\lim_{n\rightarrow\infty}u(t_{n})(nt_{n})^{1/2}=1.

(3)

This sequence essentially entails the optimal choice of the tuning parameter (de Haan and Ferreira, 2006, p.77), which we introduce later. Now, we obtain a drifting sequence of local families consisting of $(n,h)$ -indexed elements $F_{n,h}$ whose quantile functions satisfy

	$\displaystyle F_{n,h}^{-1}\left(1-t\right)\equiv$	$\displaystyle t^{-\xi_{0}}\exp\left(\int_{t}^{1}\frac{d_{n}h\left(t_{n}^{-1}v\right)}{v}dv\right)$		(4)
	$\displaystyle=$	$\displaystyle t^{-\left(\xi_{0}+d_{n}h\left(0\right)\right)}\exp\left(\int_{t}^{1}\frac{d_{n}\left(h\left({t_{n}^{-1}}v\right)-h\left(0\right)\right)}{v}dv\right),t\in(0,1],$

where the second equality is due to (2). The corresponding tail index is given by $\xi_{n,h}=\xi_{0}+d_{n}h\left(0\right)$ , and $h\left(\cdot\right)-h\left(0\right)$ characterizes the local deviation from the standard Pareto distribution.

Given the above reparametrization, we translate the local non-parametric family $\mathcal{F}\left(\xi_{0},c,\varepsilon,u\right)$ of d.f.’s into that of $h$ . Setting the upper bound to $u\left(s\right)=As^{\rho}$ , we consider the family

\mathcal{H}\left(A,\rho\right)\equiv\left\{h\in L_{2}[0,1]\left|\begin{array}[]{c}\sup_{s>0}\left|h\left(s\right)\right|<\infty,\\ \left|h\left(s\right)-h\left(0\right)\right|\leq As^{\rho},s>0\end{array}\right.\right\},

which contains all square integrable functions $h\left(\cdot\right)$ that are uniformly bounded and satisfy the bound $\left|h\left(s\right)-h\left(0\right)\right|\leq As^{\rho}$ . The family $\mathcal{H}\left(A,\rho\right)$ induces a local counterpart of the non-parametric family $\mathcal{F}\left(\xi_{0},1,\varepsilon,u\right)$ , namely

	$\displaystyle\mathcal{F}_{n}(\xi_{0},\mathcal{H}(A,\rho),u)=$
	$\displaystyle\left\{F_{n,h}\text{ is a d.f. }\left\|\begin{array}[]{c}\left.F_{n,h}^{-1}\left(1-t\right)=t^{-(\xi_{0}+d_{n}h(0))}\exp\left(\int_{t}^{1}\frac{\eta(s)-\eta(0)}{v}dv\right),\right.\\ \left.\eta(s)=d_{n}h(t^{-1}_{n}s),d_{n}=u(t_{n}),h\in\mathcal{H}(A,\rho)\right.\end{array}\right.\right\}$

indexed by $t^{-1}_{n}\rightarrow\infty$ as a function of the sample size $n$ . Given the local reparametrization introduced above, $h(\cdot)$ now represents a deviation function and the associated tail index is $\xi_{n,h}=\xi_{0}+d_{n}h(0)$ .

2.2 Hill’s Estimator

Let $\{Y_{1},...,Y_{n}\}$ be a random sample from $F_{n,h}$ , and let $Y_{n:n-j}$ denote the $(j+1)$ -th largest order statistic in this sample. With these notations, Hill’s estimator is defined by

\hat{\xi}\left(n,k\right)=\frac{1}{k}\sum_{j=0}^{k-1}\left[\log\left(Y_{n:n-j}\right)-\log\left(Y_{n:n-k}\right)\right].

In practice, a researcher often implements this estimator for an interval of values of the tuning parameter $k$ to demonstrate ad hoc robustness. This common practice can be formally accommodated by allowing for a sequence of intervals, i.e., $k=k_{n}=r\overline{k}_{n}$ for $r\in[\underline{r},1]\subset(0,1]$ , similarly to Drees, Resnick, and de Haan (2000).

Define the functional $T_{r}\left(z\right)$ by

T_{r}\left(z\right)\equiv r^{-1}\int_{0}^{r}\log\left(z\left(t\right)/z\left(r\right)\right)dt

for any measurable function $z:[0,1]\rightarrow\mathbb{R}$ . If we substitute the quantile function of the standard Pareto distribution, that is, $z\left(t\right)=F_{0}^{-1}\left(1-t\right)=t^{-\xi_{0}}$ , then we have

T_{r}\left(t^{-\xi_{0}}\right)=r^{-1}\int_{0}^{r}\log\left(t^{-\xi_{0}}/r^{-\xi_{0}}\right)dt=\xi_{0},

identifying the true $\xi_{0}$ for any $r\in(0,1]$ . Define the tail empirical quantile function

Q_{n,r\overline{k}_{n},F_{n,h}}=Y_{n:n-\lfloor r\overline{k}_{n}\rfloor}

for $r\in(0,1]$ , where $\lfloor\cdot\rfloor$ denotes the smallest larger integer. With these auxiliary notations, as implied by Example 3.1 in Drees (1998a), Hill’s estimator $\hat{\xi}(n,r\overline{k}_{n})$ can be equivalently rewritten as $T_{r}\left(Q_{n,r\overline{k}_{n},F_{n,h}}\right)$ .

3 Asymptotic Impossibility

This section presents the first one of the two main theoretical results of this paper. In light of the min-max result for estimation (cf. Drees, 2001), it is natural that a researcher is interested in a length-optimal confidence interval satisfying the uniform coverage over a non-parametric family, such as the one introduced in Section 2.1. This section shows an impossibility of this objective.

Specifically, we aim to establish that the length of any confidence interval $H\left(Y^{(n)}\right)$ $=$ $H\left(Y_{1},...,Y_{n}\right)$ that has a coverage of $\xi_{0}$ uniformly for all $h\in\mathcal{H}\left(A,\rho\right)$ is no shorter than the supremum seminorm of $h$ over $\mathcal{H}\left(A,\rho\right)$ up to a constant multiple. This implies that we cannot find a length-optimal confidence interval which satisfies the uniform coverage without accounting for the worst-case bias $|T_{r}(F_{n,h}^{-1})-T_{r}(F_{0}^{-1})|$ over the non-parametric family $\mathcal{H}\left(A,\rho\right)$ . Such an impossibility result in spirit parallels the one about non-parametric density estimation established by Low (1997) and the one about non-parametric regression estimation established by Genovese and Wasserman (2008), among others.

Define the modulus of continuity of $T_{r}\left(\cdot\right)$ by

\omega\left(\varepsilon,F_{0}^{-1},n\right)\equiv\sup\left\{\left|T_{r}\left(F_{n,h}^{-1}\right)-T_{r}\left(F_{0}^{-1}\right)\right|\left|\begin{array}[]{c}h\in\mathcal{H}\left(A,\rho\right)\text{ }\\ \left|\left|h-h\left(0\right)\right|\right|\leq\varepsilon\end{array}\right.\right\}.

(5)

This is the worst-case bias in absolute value. Let $F_{0}^{n}$ and $F_{n,h}^{n}$ denote the joint distributions of $n$ i.i.d. draws from $F_{0}$ and $F_{n,h}$ , respectively. Let $\mathbb{E}_{F_{0}^{n}}\left[\cdot\right]$ denote the expectation with respect to the product measure $F_{0}^{n}$ . The following theorem establishes the impossibility result. That is, the expected length of a uniformly valid confidence interval is no shorter than a constant multiplie of the modulus of continuity of $T_{r}\left(\cdot\right)$ .

Theorem 1 (Impossibility)

For $\varepsilon>0$ , suppose that a confidence interval $H\left(Y^{\left(n\right)}\right)$ for $\xi_{0}$ has coverage probabilities of at least $1-\beta$ uniformly for all $h\in\mathcal{H}\left(A,\rho\right)\cap\{h:\left|\left|h-h\left(0\right)\right|\right|\leq\varepsilon\}$ . Then, there exist $N$ and $C$ depending on $\varepsilon$ and $\xi_{0}$ only such that $1-\beta-C>0$ and

\mathbb{E}_{F_{0}^{n}}\left[\mu\left(H\left(Y^{\left(n\right)}\right)\right)\right]\geq\left(1-\beta-C\right)\omega\left(\varepsilon,F_{0}^{-1},n\right)

(6)

for all $n>N$ .

As already mentioned above, this result is analogous to those established by Low (1997), Cai and Low (2004), Genovese and Wasserman (2008), and Armstrong and Kolesár (2018), to list but a few, in other important contexts of statistics, but it is novel in the context of the tail index. To understand the lower bound in (6), we now derive a concrete expression for the element in the definition (5) of the modulus of continuity $\omega\left(\varepsilon,F_{0}^{-1},n\right)$ . Note that

		$\displaystyle T_{r}\left(F_{n,h}^{-1}\right)-T_{r}(F_{0}^{-1})$
	$\displaystyle=$	$\displaystyle r^{-1}\int_{0}^{r}\log\left(\frac{t^{-\xi_{0}}\exp\left(\int_{t}^{1}\frac{d_{n}h\left({t_{n}^{-1}}v\right)}{v}dv\right)}{r^{-\xi_{0}}\exp\left(\int_{r}^{1}\frac{d_{n}h\left({t_{n}^{-1}}v\right)}{v}dv\right)}\right)dt-\xi_{0}$
	$\displaystyle=$	$\displaystyle r^{-1}\int_{0}^{r}\left[-\xi_{0}\log\left(t/r\right)+d_{n}\left(\int_{t}^{1}\frac{h\left({t_{n}^{-1}}v\right)}{v}dv-\int_{r}^{1}\frac{h\left({t_{n}^{-1}}v\right)}{v}dv\right)\right]dt-\xi_{0}$
	$\displaystyle=$	$\displaystyle d_{n}r^{-1}\int_{0}^{r}\left(\int_{t}^{1}\frac{h\left({t_{n}^{-1}}v\right)}{v}dv-\int_{r}^{1}\frac{h\left({t_{n}^{-1}}v\right)}{v}dv\right)dt$
	$\displaystyle=$	$\displaystyle d_{n}r^{-1}\int_{0}^{r}\left(\int_{t}^{r}\frac{h\left({t_{n}^{-1}}v\right)}{v}dv\right)dt$
	$\displaystyle=$	$\displaystyle d_{n}r^{-1}\int_{0}^{r}\left(\int_{t}^{r}\frac{h\left({t_{n}^{-1}}v\right)-h\left(0\right)}{v}dv\right)dt+d_{n}h\left(0\right).$

The first term in the last line characterizes the bias due to the deviation from the standard Pareto distribution, and the second term characterizes the asymptotic randomness. As a consequence, to obtain a feasible and uniformly valid confidence interval, we will set an upper bound for the first term and adjust the critical value based on the second term. We obtain such a uniform confidence interval in the following section.

4 Uniform Confidence Interval

Given that $\rho>0$ , it has been established that the optimal rate of convergence of Hill’s estimator is $n^{-\rho/(2\rho+1)}$ . See, for example, Remark 3.2.7 in de Haan and Ferreira (2006). Such an optimal rate entails non-negligible asymptotic bias as charaterized in Theorem 2 below. To achieve this rate, we let $\overline{k}_{n}\approx nt_{n}$ , where $A\approx B$ means $\lim_{n\rightarrow\infty}A/B=1$ . Then, the restriction (3) implies that $u(t_{n})\approx A^{1/(2\rho+1)}n^{-\rho/(2\rho+1)}$ and $\overline{k}_{n}^{1/2}u(t_{n})=\overline{k}_{n}^{1/2}d_{n}\rightarrow 1$ as $n\rightarrow\infty$ . We formally summarize these conditions below.

Condition 1

As $n\rightarrow\infty$ , $A^{1/(2\rho+1)}\overline{k}_{n}^{1/2}n^{-\rho/(2\rho+1)}\rightarrow 1$ .

As the second one of the two main theoretical results of this paper, the following theorem derives the asymptotic distribution of $\hat{\xi}\left(n,r\overline{k}_{n}\right)$ uniformly for all $r\in[\underline{r},1]$ and $h\in\mathcal{H}\left(A,\rho\right)$ by exploiting the features of the functional $T_{r}\left(\cdot\right)$ defined in Section 2.2. Let $\mathbb{P}^{n}_{n,h}$ denote the distribution of $n$ i.i.d. draws from the distribution $F_{n,h}$ .

Theorem 2 (Uniform Asymptotic Distribution)

If Condition 1 is satisfied, then

\overline{k}_{n}\left(\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h}\right)=\xi_{0}\mathbb{G}\left(r\right)+\mathbb{B}\left(r;h\right)+o_{\mathbb{P}^{n}_{n,h}}(1)

holds under random sampling from (4) uniformly for all $r\in[\underline{r},1]$ and $h\in\mathcal{H}\left(A,\rho\right)$ , where $\mathbb{G}$ and $\mathbb{B}$ are defined by

\mathbb{G}\left(r\right)\equiv r^{-1}\int_{0}^{r}\left(s^{-1}W\left(s\right)-r^{-1}W\left(r\right)\right)ds

(7)

and

\mathbb{B}\left(r;h\right)\equiv r^{-1}\int_{0}^{r}\left(\int_{s}^{r}\frac{h\left(v\right)-h\left(0\right)}{v}dv\right)ds,

(8)

respectively.

A proof can be found in the Appendix. The convergence of Hill’s estimator as a function of $r$ has been established in the literature (e.g., Resnick and Stărică, 1997; Drees et al., 2000). In comparison, Theorem 2 here contributes to the literature by showing the asymptotic distribution uniformly over both $r\in[\underline{r},1]$ and the local non-parametric function class. The terms $\mathbb{G}\left(\cdot\right)$ and $\mathbb{B}\left(\cdot;h\right)$ characterize the asymptotic randomness and bias, respectively. It follows that

\sqrt{r\overline{k}_{n}}\left(\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h}\right)\Rightarrow\xi_{0}\sqrt{r}\mathbb{G}\left(r\right)+\sqrt{r}\mathbb{B}\left(r;h\right).

(9)

To conduct statistical inference based on (9), we need to compute the bound

\sup_{r\in[\underline{r},1]}\sup_{h\in\mathcal{H}\left(A,\rho\right)}\sqrt{r}\mathbb{B}\left(r;h\right)

of the bias. To this end, note that $\left|h\left(v\right)-h\left(0\right)\right|\leq Av^{\rho}$ for all $h\in\mathcal{H}\left(A,\rho\right)$ . Therefore, for any $h\in\mathcal{H}\left(A,\rho\right)$ ,

$\displaystyle\left\|\mathbb{B}\left(r;h\right)\right\|=$	$\displaystyle\left\|r^{-1}\int_{0}^{r}\left(\int_{s}^{r}\frac{h\left(v\right)-h\left(0\right)}{v}dv\right)ds\right\|$
$\displaystyle\leq$	$\displaystyle r^{-1}\int_{0}^{r}\left(\int_{s}^{r}\frac{\left\|h\left(v\right)-h\left(0\right)\right\|}{v}dv\right)ds$
$\displaystyle\leq$	$\displaystyle r^{-1}\int_{0}^{r}\left(\int_{s}^{r}Av^{\rho-1}dv\right)ds$
$\displaystyle=$	$\displaystyle A\frac{r^{\rho}}{1+\rho}.$	(10)

This bound is tight and achieved when, for example, $h\left(v\right)\geq h\left(0\right)\geq 0$ .

With this tight bias bound taken into account in a similar spirit to Armstrong and Kolesár (2020), a locally uniformly valid confidence interval for the tail index is given by

\displaystyle H^{O}\left(n,r\overline{k}_{n}\right)=\left[\hat{\xi}-\frac{\hat{\xi}\cdot q_{1-\beta/2}+\overline{B}\left(A,\rho\right)}{\sqrt{r\overline{k}_{n}}},\hat{\xi}+\frac{\hat{\xi}\cdot q_{1-\beta/2}+\overline{B}\left(A,\rho\right)}{\sqrt{r\overline{k}_{n}}}\right]

(11)

for $r\in[\underline{r},1]$ , where $\hat{\xi}$ is short for $\hat{\xi}\left(n,r\overline{k}_{n}\right)$ ,

\overline{B}\left(A,\rho\right)\equiv\sup_{r\in(0,1]}\sqrt{r}\frac{Ar^{\rho}}{1+\rho}=\frac{A}{1+\rho}

is the upper bound of the bias, and $q_{1-\beta/2}$ denotes the suitable quantile of $\sup_{[\underline{r},1]}\sqrt{r}\mathbb{G}(r)$ whose values can be found in Table 1.

$r$ $\hskip 5.69046pt\backslash\hskip 5.69046pt\beta$	0.10	0.05	0.01	$r$ $\hskip 5.69046pt\backslash\hskip 5.69046pt\beta$	0.10	0.05	0.01
1	1.64	1.96	2.56	1/4	2.41	2.71	3.27
10/11	1.87	2.19	2.76	1/5	2.46	2.74	3.34
5/6	1.95	2.27	2.86	1/10	2.58	2.85	3.44
2/3	2.09	2.42	3.01	1/20	2.67	2.92	3.51
1/2	2.22	2.54	3.12	1/50	2.75	3.01	3.57
1/3	2.33	2.66	3.23	1/100	2.80	3.08	3.61

Table 1: The

1-\beta/2

quantile of

\sup_{r\in\left[\underline{r},1\right]}\sqrt{r}\mathbb{G}\left(r\right)

computed based on 20,000 simulation draws. The Gaussian process is approximated with 50,000 steps.

As a final remark, we discuss the choice of the higher-order parameters $\rho$ and $A$ . They both depend on the underlying distribution and hence unknown. While this feature appears to be a disadvantage of our method, the impossibility result in Theorem 1 implies that this feature cannot be avoided, regardless of how we construct the interval. The existing literature has proposed several estimators of $\rho$ and $A$ , whose consistency requires additional assumptions on the underlying function, and equivalently further restrictions on the class $\mathcal{H}$ . See Carpentier and Kim (2014a); Cheng and Peng (2001); Haeusler and Segers (2007) for some data-driven methods of choosing the higher-order parameters. The corresponding confidence intervals are no longer uniformly valid, but still point-wisely valid.

As an alternative, we propose a rule-of-thumb choice of $\rho$ and $A$ and proceed with the proposed interval. In particular, if the underlying distribution is Student-t with $1/\xi_{0}$ degrees of freedom, we know that $\rho=2\xi_{0}$ . Furthermore, for the bias upper bound $\overline{B}=A/(1+\rho)$ , we set $A=0.1\xi_{0}(1+2\xi_{0})\sqrt{\overline{k}_{n}}$ , so that $\overline{B}/\sqrt{\overline{k}_{n}}$ is at most $10\%$ of the true tail index to be estimated. In practice, we replace $\xi_{0}$ with the estimator $\hat{\xi}(n,\overline{k}_{n})$ . This rule-of-thumb choice is reministic to the Silverman’s choice of bandwidth in kernel density estimation, where the reference is the Gaussian distribution. We examine the performance of this rule by simulations in Section 6.

5 Extreme Quantiles

In this section, we apply Theorem 2 to uniform inference about extreme quantiles. To characterize the extremeness, we focus on the sequence of $1-p_{n}$ quantiles where $p_{n}\rightarrow 0$ as $n\rightarrow\infty$ . Consider the extreme quantile estimator by Weissman (1978):

\hat{F}_{n,h}^{-1}\left(1-p_{n}\right)=Y_{n,n-\lfloor r\overline{k}_{n}\rfloor}\left(\frac{r\overline{k}_{n}}{np_{n}}\right)^{\hat{\xi}\left(n,r\overline{k}_{n}\right)},

where $\hat{\xi}\left(n,k\right)$ is Hill’s estimator. Recall that the true quantile under the local drifting sequence is

F_{n,h}^{-1}\left(1-t_{n}\right)=t^{-\xi_{0}}\exp\left(\int_{t}^{1}\frac{d_{n}h\left({t_{n}^{-1}}v\right)}{v}dv\right)

as in (4), where $t_{n}$ satisfies $\lim_{n\rightarrow\infty}\overline{k}_{n}/(nt_{n})=1$ as in Condition 1. We now aim to asymptotically approximate the distribution of

\frac{\hat{F}_{n,h}^{-1}\left(1-p_{n}\right)}{F_{n,h}^{-1}\left(1-p_{n}\right)}-1.

To this end, we state the following condition on the relation among $n$ , $\overline{k}_{n}$ and $p_{n}$ (e.g., de Haan and Ferreira, 2006, Theorem 4.3.1).

Condition 2

$np_{n}=o\left(\overline{k}_{n}\right)$ and $\log\left(np_{n}\right)=o\left(\sqrt{\overline{k}_{n}}\right)$ as $n\rightarrow\infty$ .

Theorem 3 (Extreme Quantiles)

Under Conditions 1 and 2 with random sampling, uniformly for all $r\in[\underline{r},1]$ and $h\in\mathcal{H}\left(A,\rho\right)$ ,

\displaystyle\frac{\sqrt{r\overline{k}_{n}}}{\log\left(r\overline{k}_{n}/\left(np_{n}\right)\right)}\left(\frac{\hat{F}_{n,h}^{-1}\left(1-p_{n}\right)}{F_{n,h}^{-1}\left(1-p_{n}\right)}-1\right)-\sqrt{r\overline{k}_{n}}\left(\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h}\right)=o_{\mathbb{P}^{n}_{n,h}}(1).

Theorem 3 implies that the asymptotic distribution of the extreme quantile estimator is the same as that of the tail index. Therefore, a uniformly valid confidence interval can be constructed similarly. In particular, a robust confidence interval for $F^{-1}_{n,h}(1-p_{n})$ with nominal $1-\beta$ uniform coverage probability accounting for the bias is constructed as

	$\displaystyle I^{O}\left(n,k\right)=$	$\displaystyle\left[\hat{F}^{-1}_{n,h}\left(1-p_{n}\right)\left\{1-\frac{\log d(n,r\overline{k}_{n})}{\sqrt{r\overline{k}_{n}}}\left(\hat{\xi}\cdot q_{1-\beta/2}+\overline{B}(A,\rho)\right)\right\},\right.$
		$\displaystyle\>\>\left.\hat{F}^{-1}_{n,h}\left(1-p_{n}\right)\left\{1+\frac{\log d(n,r\overline{k}_{n})}{\sqrt{r\overline{k}_{n}}}\left(\hat{\xi}\cdot q_{1-\beta/2}+\overline{B}(A,\rho)\right)\right\}\right],$		(12)

where again $\hat{\xi}$ is short for $\hat{\xi}(n,r\overline{k}_{n})$ , $q_{1-\beta/2}$ denotes a suitable quantile that can be found in Table 1, and $d(n,r\overline{k}_{n})=r\overline{k}_{n}/(np_{n})$ .

6 Simulation Studies

In this section, we use simulated data to evaluate finite-sample performance of our proposed confidence intervals in comparison with the naïve lenth-optimal confidence interval. Sections 6.1 and 6.2 focus on inference about the tail index and extreme quantiles, respectively.

6.1 Tail Index

The following simulation design is employed for our analysis. We generate $n$ independent standard uniform random variables $U_{i}$ and construct the observations $Y_{i}=F^{-1}(1-U_{i})$ , where $F^{-1}(1-t)=t^{-\xi_{0}}\exp(\int^{1}_{t}s^{-1}\eta(s)ds)$ . We set $\eta(t)=ct^{\rho}$ , so that

F^{-1}(1-t)=t^{-\xi_{0}}\exp\left(\frac{c(1-t^{\rho})}{\rho}\right),

(13)

where the constants, $c$ and $\rho$ , characterize the scale and the shape, respectively, of the deviation from the Pareto distribution. For ease of comparisons, we set $\rho=2\xi_{0}$ , which corresponds to the Student-t distribution with $1/\xi_{0}$ degrees of freedom. Then, we set $c=c_{0}\xi_{0}/(1+2\xi_{0})$ for normalization, where we vary $c_{0}\in\{0,0.5,1\}$ across sets of simulations.

To construct the optimal confidence interval, we need a choice of $\overline{k}_{n}$ . We use the data-driven algorithm proposed by Guillou and Hall (2001), which we briefly summarize in Appendix B for the convenience of readers. This choice of $\overline{k}_{n}$ is of the optimal rate as established in their Theorem 2. Specifically, we select $\overline{k}_{n}$ according to (33) in Appendix B with the restriction that $\overline{k}_{n}\in[0.01n,0.99n]$ .

We implement three confidence intervals for the purpose of comparisons. The first method is the naïve length-optimal confidence interval without accounting for a possible bias over the local non-parametric family, that is,

H^{N}(n,\overline{k}_{n})=\left[\hat{\xi}(n,\overline{k}_{n})\pm 1.96\overline{k}^{-1/2}_{n}\hat{\xi}(n,\overline{k}_{n})\right],

(14)

where $\overline{k}_{n}$ is selected according to the procedure described above. Our impossibility result predicts that it fails to achieve a correct coverage uniformly over the non-parametric class encompassing our simulation design presented above.

The second one is $H^{O}(n,\overline{k}_{n})$ , i.e., our proposed confidence interval given in (11) with $r=1$ , where $\overline{k}_{n}$ is selected according to the procedure described above. The bias upper bound $\overline{B}=A/(1+\rho)$ is chosen following the rule-of-thumb choice described in Section 4.

The third is based on $k$ snooping. Specifically, given $\overline{k}_{n}$ selected according to the procedure described above, now consider the range $[\underline{r}\overline{k}_{n},\overline{k}_{n}]$ containing $m$ integers denoted by $\{k_{1},\cdots,k_{m}\}$ . The $k$ snooping interval is constructed by

H^{S}(n,\underline{r})=\bigcap_{j=1}^{m}H^{O}(n,k_{j}),

(15)

where $H^{O}(\cdot,\cdot)$ is defined in (11) and the lower bound $\underline{r}$ of $k$ snooping is set to $1/2$ . For the bias upper bound $\overline{B}=A/(1+\rho)$ , we set $A=0.1\xi_{0}(1+2\xi_{0})\sqrt{r\overline{k}_{n}}$ and replace $\xi_{0}$ with the estimator $\hat{\xi}(n,k_{j})$ for each $j\in\{1,...,m\}$ .

Tail Index: Coverage Probabilities
$n$	250			500			1000
$(\xi_{0},c_{0})$	$H^{N}$	$H^{O}$	$H^{S}$	$H^{N}$	$H^{O}$	$H^{S}$	$H^{N}$	$H^{O}$	$H^{S}$
(1, 0)	0.92	0.99	0.98	0.92	0.99	0.99	0.91	0.99	0.99
(1, 0.5)	0.88	0.98	0.97	0.81	0.99	0.98	0.73	0.99	0.99
(1, 1)	0.77	0.98	0.97	0.64	0.98	0.98	0.65	0.99	0.99
(0.5, 0)	0.91	0.99	0.98	0.91	0.99	0.99	0.91	0.99	0.99
(0.5, 0.5)	0.70	0.98	0.98	0.55	0.98	0.99	0.55	0.98	0.98
(0.5, 1)	0.51	0.87	0.94	0.59	0.86	0.92	0.61	0.93	0.95
Tail Index: Average Lengths
$n$	250			500			1000
$(\xi_{0},c_{0})$	$H^{N}$	$H^{O}$	$H^{S}$	$H^{N}$	$H^{O}$	$H^{S}$	$H^{N}$	$H^{O}$	$H^{S}$
(1, 0)	0.29	0.49	0.53	0.20	0.40	0.43	0.14	0.34	0.36
(1, 0.5)	0.30	0.50	0.54	0.21	0.42	0.43	0.16	0.36	0.37
(1, 1)	0.31	0.52	0.54	0.23	0.44	0.45	0.18	0.38	0.39
(0.5, 0)	0.14	0.24	0.26	0.10	0.20	0.22	0.07	0.17	0.18
(0.5, 0.5)	0.16	0.27	0.28	0.12	0.22	0.23	0.09	0.20	0.20
(0.5, 1)	0.18	0.29	0.31	0.15	0.25	0.26	0.12	0.22	0.23

Table 2: The coverage probabilities (top panel) and the average lengths (bottom panel) of the 95% confidence intervals for

\xi_{0}

H^{N}

stands for the naïve confidence interval defined in (14).

H^{O}

stands for our proposed confidence interval defined in (11) with

r=1

H^{S}

stands for our proposed confidence interval with snooping defined in (15). The results are based on 5000 simulation draws.

From Theorem 2, we expect that both $H^{O}(n,\overline{k}_{n})$ and $H^{S}(n,\underline{r})$ will deliver asymptotically correct (uniform) coverages, whereas $H^{N}(n,\overline{k}_{n})$ will not. Table 2 presents the coverage probabilities (top panel) and the average lengths (bottom panel) of the 95% confidence intervals based on 5000 Monte Carlo iterations. Key findings can be summarized as follows. First, both the intervals $H^{O}$ and $H^{S}$ have the correct coverage probability for most of the distributions consistently with our theory. When $\xi_{0}=0.5$ and $c_{0}=1$ , the deviation of the tail distribution away from the Pareto distribution is the most severe. In this case, $H^{O}$ suffers from some undercoverage when $n$ is small (e.g., $n=250$ and $500$ ), but achieves more satisfactory coverage as $n$ becomes large (e.g., $n=1000$ ). Second, the coverage by $H^{N}$ is inadequate throughout, and even when the deviation from the Pareto distribution is relatively small. Furthermore, the extent of the undercoverage by $H^{N}$ tends to exacerbate as the sample size $n$ increases. Finally, the lengths of $H^{S}$ are slightly larger than those of $H^{O}$ when $n$ is 250, but they become almost identical when $n$ gets larger. From these findings, we prefer $H^{S}$ and $H^{O}$ to $H^{N}$ . As the sample size becomes larger, $H^{S}$ and $H^{O}$ become equally preferable, while $H^{N}$ consistently underperforms.x

6.2 Extreme Quantiles

We now turn to extreme quantiles. The data generating process continues to be the same as in the previous subsection. The object of interest is the 99% quantile so that $p_{n}=0.01$ . Similarly to the previous subsection, we again compare three confidence intervals.

The first one is the naïve confidence interval without accounting for a possible bias in the non-parametric family, that is,

\displaystyle I^{N}\left(n,\overline{k}_{n}\right)=

\displaystyle\left[\hat{Q}\left(1-p_{n}\right)\left(1-\frac{1.96}{\sqrt{\overline{k}_{n}}}\hat{\xi}(n,\overline{k}_{n})\right),\right.\>\>\left.\hat{Q}\left(1-p_{n}\right)\left(1+\frac{1.96}{\sqrt{\overline{k}_{n}}}\hat{\xi}(n,\overline{k}_{n})\right)\right].

(16)

The second one is our proposed confidence interval, that is, $I^{O}(n,\overline{k}_{n})$ in (12) with $\underline{r}=1$ . The third one is our proposed confidence interval with $k$ snooping interval, that is,

I^{S}(n,\underline{r})=\bigcap_{j=1}^{m}I^{O}(n,k_{j}),

(17)

where $\{k_{1},\cdots,k_{m}\}$ are constructed in the same way as in (15), and $I^{O}(n,k_{j})$ is defined as in (12), and we set $\underline{r}=1/2$ .

Extreme Quantile: Coverage Probabilities
$n$	250			500			1000
$(\xi_{0},c_{0})$	$I^{N}$	$I^{O}$	$I^{S}$	$I^{N}$	$I^{O}$	$I^{S}$	$I^{N}$	$I^{O}$	$I^{S}$
(1, 0)	0.90	0.96	0.94	0.92	0.98	0.97	0.93	0.99	0.99
(1, 0.5)	0.92	0.96	0.93	0.93	0.98	0.97	0.89	0.99	0.98
(1, 1)	0.92	0.95	0.93	0.89	0.97	0.95	0.83	0.99	0.97
(0.5, 0)	0.92	0.98	0.97	0.93	0.99	0.98	0.93	0.99	0.99
(0.5, 0.5)	0.91	0.97	0.95	0.86	0.98	0.96	0.78	0.99	0.98
(0.5, 1)	0.86	0.96	0.94	0.81	0.97	0.96	0.85	0.98	0.97
Extreme Quantile: Average Lengths
$n$	250			500			1000
$(\xi_{0},c_{0})$	$I^{N}$	$I^{O}$	$I^{S}$	$I^{N}$	$I^{O}$	$I^{S}$	$I^{N}$	$I^{O}$	$I^{S}$
(1, 0)	136	232	240	91	183	189	62	152	156
(1, 0.5)	170	291	277	110	225	213	76	185	174
(1, 1)	199	342	302	132	263	235	90	209	189
(0.5, 0)	6.3	10.9	11.6	4.4	8.9	9.3	3.0	7.5	7.7
(0.5, 0.5)	8.3	14.4	14.3	5.8	11.5	11.3	4.2	9.4	9.1
(0.5, 1)	10.9	18.3	17.6	7.5	13.7	13.3	5.3	10.6	10.4

Table 3: The coverage probabilities (top panel) and the average lengths (bottom panel) of the 95% confidence intervals for the 99% quantile.

I^{N}

stands for the naïve confidence interval defined in (16).

I^{O}

stands for our proposed confidence interval defined in (12) with

r=1

I^{S}

stands for our proposed confidence interval with snooping defined in (17). The results are based on 5000 simulation draws.

Table 3 presents the coverage probabilities (top panel) and the average lengths (bottom panel) of the 95% confidence intervals based on 1000 Monte Carlo iterations. The findings are similar to those in Table 2. In particular, both the intervals $I^{O}$ and $I^{S}$ lead to correct coverage probabilities for all distributions, while $I^{N}$ suffers from undercoverage in general. Regarding the lengths, $I^{O}$ and $I^{S}$ are both longer than $I^{N}$ to allow for the correct coverage. Furthermore, when $\xi_{0}$ is 0.5, $I^{O}$ has approximately the same lengths as $I^{S}$ . When $\xi_{0}=1$ , the lengths of $I^{S}$ are shorter than those of $I^{O}$ , especially when the model largely deviates away from the Pareto distribution. This is because the true quantile is substantially larger so that the effects of adapting the critical value become more significant. We prefer $I^{O}$ and $I^{S}$ to $I^{N}$ from these observations.

7 Real Data Analysis

This section illustrates an application of the proposed method to an analysis of extremely low infant birth weights. Their relations with mothers’ demographic characteristics and maternal behaviors address important research questions. We use detailed natality data (Vital Statistics) published by the National Center for Health Statistics, which has been used by prior studies including Abrevaya (2001) and many others. Our sample consists of repeated cross sections from 1989 to 2002. Using the data from each of these years, we construct 95% confidence intervals of the tail index in the left tail and the first percentile following the same computational procedure as the one taken in Section 6. Details of our implementation with the current empirical data set are as follows.

We follow previous studies (e.g., Abrevaya, 2001) to choose the variables for mothers’ demographic characteristics and maternal behaviors. The variable of our interest is the infant birth weight measured in kilograms. For the purpose of comparison, we set a benchmark subsample in which the infant is a boy and the mother is younger than the median age in the full sample, is white and married, has levels of education lower than a high school degree, had her first prenatal visit in the first trimester (natal1), and did not smoke during the pregnancy. In addition to this benchmark subsample (benchmark), we also consider seven alternative subsamples corresponding to one and only one of the following scenarios: the mother has at least a high school diploma (high school); the infant is a girl (girl); the mother is unmarried (unmarried); the mother is black (black); the mother did not have prenatal visit during pregnancy (no pre-visit); the mother smokes ten cigarettes per day on average (smoke) and the mother’s age is above the median age in the full sample (older).

For each of these subsamples, we construct the 95% confidence intervals $H^{N}$ , $H^{O}$ , and $H^{S}$ for the tail index in the left tail in the same way as in Section 6.1, and the 95% confidence intervals $I^{N}$ , $I^{O}$ , and $I^{S}$ for the first percentile as in Section 6.2. Since we are interested in the left tail (extremely low birth weights), we consider only the birth weight, denoted $B_{i}$ , that is less than some cutoff value $T$ , and take $Y_{i}=T-B_{i}$ as the input in our computational procedure for inference. We choose $T=4$ based on the prior findings in Abrevaya (2001). Namely, Abrevaya (2001) finds that the relationship between the infant birth weight and mother’s demographics change substantially at the 90th percentile of birth weight in the full sample, which is approximately 4 kilograms. Once $I^{N}$ , $I^{O}$ , and $I^{S}$ are constructed for the 99th percentile of this transformed variable, we in turn multiply by $-1$ and add $T$ back to restore the interval for the original first percentile to conduct inference about the extremely low birth weights.

Tail Index
Subsample	$H^{N}$		$H^{O}$		$H^{S}$
benchmark	[0.27	0.29]	[0.24	0.31]	[0.25	0.31]
high school	[0.29	0.30]	[0.26	0.33]	[0.27	0.30]
girl	[0.25	0.27]	[0.23	0.29]	[0.23	0.28]
unmarried	[0.29	0.31]	[0.26	0.34]	[0.27	0.32]
black	[0.24	0.29]	[0.21	0.32]	[0.21	0.33]
no pre-visit	[0.34	0.41]	[0.31	0.45]	[0.31	0.46]
smoke	[0.22	0.27]	[0.20	0.30]	[0.19	0.29]
older	[0.29	0.31]	[0.26	0.34]	[0.26	0.32]
First Percentile in Kilogram
Subsample	$I^{N}$		$I^{O}$		$I^{S}$
benchmark	[1.46	1.56]	[1.30	1.72]	[1.35	1.68]
high school	[1.48	1.52]	[1.31	1.69]	[1.43	1.68]
girl	[1.50	1.59]	[1.35	1.74]	[1.43	1.72]
unmarried	[1.20	1.30]	[0.99	1.51]	[1.10	1.48]
black	[0.99	1.39]	[0.80	1.58]	[0.79	1.58]
no pre-visit	[0.00	0.63]	[0.00	1.08]	[0.00	1.09]
smoke	[1.34	1.59]	[1.21	1.72]	[1.27	1.71]
older	[1.30	1.45]	[1.13	1.62]	[1.24	1.62]

Table 4: The 95% confidence intervals for the tail index (

H

) and the first percentile (

I

) of the birth weight. The superscript

N

stands for the naïve interval in (14) and (16). The supersrcipt

O

stands for the interval in (11) and (12) with

r=1

. The superscript

S

stands for the interval with snooping in (15) and (17). See the main text for details about the data and the definitions of subsamples.

Table 4 presents the results for the 2002 sample. The results for other years are similar and hence omitted here to save space. Key empirical findings from these results can be summarized as follows. First, $H^{O}$ and $H^{S}$ are similar in length for the tail index, while $I^{O}$ tends to be slightly longer than $I^{S}$ for the first percentile. Second, both of them are substantially longer than the naïve intervals, $H^{N}$ an $I^{N}$ , suggesting that ignoring the bias could lead to misleadingly short intervals. Third, compared with the benchmark subsample, the mothers who do not have any prenatal visit during pregnancy bear a substantially higher risk of having extremely low infant birth weights. This observation remains true even after accounting for a possible bias.

8 Summary

In this paper, we present two theoretical results concerning uniform confidence intervals for the tail index and extreme quantiles. First, we find it impossible to construct a length-optimal confidence interval satisfying the correct uniform coverage over the local non-parametric family of tail distributions. Second, in light of the impossibility result, we construct an honest confidence interval that is uniformly valid by accounting for the worst-case bias over the local non-parametric class. Simulation studies support our theoretical results. While the naïve length-optimal confidence interval suffers from severe under-coverage, our proposed confidence intervals achieve correct coverage. Applying the proposed method to National Vital Statistics from National Center for Health Statistics, we find that, even after accounting for the worst-case bias bound, having no prenatal visit during pregnancy remain a strong risk factor for low infant birth weight. This result demonstrates that, despite the impossibility result, it is possible to conduct a robust yet informative statistical inference about the tail index and extreme quantiles.

References

Abrevaya (2001) Abrevaya, J. (2001): “The effects of demographics and maternal behavior on the distribution of birth outcomes,” Empirical Economics, 26, 247–257.
Armstrong and Kolesár (2018) Armstrong, T. B. and M. Kolesár (2018): “Optimal inference in a class of regression models,” Econometrica, 86, 655–683.
Armstrong and Kolesár (2020) ——— (2020): “Simple and honest confidence intervals in nonparametric regression,” Quantitative Economics, 11, 1–39.
Bull and Nickl (2013) Bull, A. D. and R. Nickl (2013): “Adaptive confidence sets in $L^{2}$ ,” Probability Theory and Related Fields, 156, 889–919.
Cai and Low (2004) Cai, T. T. and M. G. Low (2004): “An adaptation theory for nonparametric confidence intervals,” Annals of statistics, 32, 1805–1840.
Carpentier (2013) Carpentier, A. (2013): “Honest and adaptive confidence sets in $L_{p}$ ,” Electronic Journal of Statistics, 7, 2875–2923.
Carpentier and Kim (2014a) Carpentier, A. and A. K. H. Kim (2014a): “Adaptive and minimax optimal estimation of the tail coefficient,” Statistica Sinica, 25, 1133–1144.
Carpentier and Kim (2014b) ——— (2014b): “Adaptive confidence intervals for the tail coefficient in a wide second order class of Pareto models,” Electronic Journal of Statistics, 8, 2066–2110.
Cheng and Peng (2001) Cheng, S. and L. Peng (2001): “Confidence intervals for the tail index,” Bernoulli, 7, 751–760.
Danielsson et al. (2001) Danielsson, J., L. de Haan, L. Peng, and C. G. de Vries (2001): “Using a bootstrap method to choose the sample fraction in tail index estimation,” Journal of Multivariate Analysis, 76, 226–248.
Danielsson et al. (2016) Danielsson, J., L. M. Ergun, L. de Haan, and C. G. de Vries (2016): “Tail index estimation: Quantile driven threshold selection,” Working Paper.
de Haan and Ferreira (2006) de Haan, L. and A. Ferreira (2006): Extreme Value Theory: An Introduction, Springer.
Drees (1998a) Drees, H. (1998a): “A general class of estimators of the extreme value index,” Journal of Statistical Planning and Inference, 66, 95–112.
Drees (1998b) ——— (1998b): “On smooth statistical tail functionals,” Scandinavian Journal of Statistics, 25, 187–210.
Drees (2001) ——— (2001): “Minimax risk bounds in extreme value theory,” Annals of Statistics, 29, 266–294.
Drees and Kaufmann (1998) Drees, H. and E. Kaufmann (1998): “Selecting the optimal sample fraction in univariate extreme value estimation,” Stochastic Processes and their Applications, 75, 149–172.
Drees et al. (2000) Drees, H., S. I. Resnick, and L. de Haan (2000): “How to make a Hill plot,” Annals of Statistics, 28, 254–274.
Fedotenkov (2020) Fedotenkov, I. (2020): “A review of more than one hundred Pareto-tail index estimators,” Statistica, 80, 245–299.
Geluk and Peng (2000) Geluk, J. L. and L. Peng (2000): “An adaptive optimal estimate of the tail index for MA (l) time series,” Statistics & Probability Letters, 46, 217–227.
Genovese and Wasserman (2008) Genovese, C. and L. Wasserman (2008): “Adaptive confidence bands,” Annals of Statistics, 36, 875–905.
Gomes and Guillou (2015) Gomes, M. I. and A. Guillou (2015): “Extreme value theory and statistics of univariate extremes: A review,” International Statistical Review, 83, 263–292.
Guillou and Hall (2001) Guillou, A. and P. Hall (2001): “A diagnostic for selecting the threshold in extreme value analysis,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), 63, 293–305.
Haeusler and Segers (2007) Haeusler, E. and J. Segers (2007): “Assessing confidence intervals for the tail index by Edgeworth expansions for the Hill estimator,” Bernoulli, 13, 175–194.
Hall and Welsh (1985) Hall, P. and A. H. Welsh (1985): “Adaptive estimates of parameters of regular variation,” Annals of Statistics, 13, 331–341.
Hill (1975) Hill, B. M. (1975): “A simple general approach to inference about the tail of a distribution.” Annals of Statistics, 3, 1163–1174.
Hoffmann and Nickl (2011) Hoffmann, M. and R. Nickl (2011): “On adaptive inference and confidence bands,” Annals of Statistics, 39, 2383–2409.
Li (1989) Li, K.-C. (1989): “Honest confidence regions for nonparametric regression,” Annals of Statistics, 17, 1001–1008.
Low (1997) Low, M. G. (1997): “On nonparametric confidence intervals,” Annals of Statistics, 25, 2547–2554.
Lu and Peng (2002) Lu, J.-C. and L. Peng (2002): “Likelihood based confidence intervals for the tail index,” Extremes, 5, 337–352.
Peng and Qi (2006) Peng, L. and Y. Qi (2006): “A new calibration method of constructing empirical likelihood-based confidence intervals for the tail index,” Australian & New Zealand Journal of Statistics, 48, 59–66.
Reiss (1989) Reiss, R.-D. (1989): Approximate distribution of order statistics, Springer, New York.
Resnick (2007) Resnick, S. I. (2007): Heavy-tail phenomena: probabilistic and statistical modeling, Springer Science & Business Media.
Resnick and Stărică (1997) Resnick, S. I. and C. Stărică (1997): “Smoothing the Hill estimator,” Advances in Applied Probability, 29, 271–293.
van de Geer et al. (2014) van de Geer, S., P. Bühlmann, Y. Ritov, and R. Dezeure (2014): “On asymptotically optimal confidence regions and tests for high-dimensional models,” Annals of Statistics, 42, 1166–1202.
Weissman (1978) Weissman, I. (1978): “Estimation of parameters and large quantiles based on the k largest observations,” Journal of the American Statistical Association, 73, 812–815.
Wu et al. (2021) Wu, Y., L. Wang, and H. Fu (2021): “Model-assisted uniformly honest inference for optimal treatment regimes in high dimension,” Journal of the American Statistical Association, forthcoming.

Supplmentary Appendix

On Uniform Confidence Intervals for the Tail Index and the Extreme Quantile

Appendix A Proofs

A.1 Proof of Theorem 1

We need the following two auxiliary lemmas to prove Theorem 1. Throughout, suppose that the distributions $F_{n,h}$ and $F_{0}$ are absolutely continuous with their density functions denoted by $f_{n,h}$ and $f_{0}$ , respectively.

Lemma 1

We have

\int\left(\frac{f_{n}^{1/2}\left(y\right)}{f_{0}^{1/2}\left(y\right)}-1\right)^{2}f_{0}\left(y\right)dy=o(1).

Proof of Lemma 1.

By the definitions of $F_{n,h}^{-1}$ and $F_{0}^{-1}$ , we can write

	$\displaystyle f_{n,h}\left(F_{n,h}^{-1}\left(1-t\right)\right)=$	$\displaystyle\frac{t}{\left(\xi_{0}+d_{n}h\left({t_{n}^{-1}}t\right)\right)F_{n,h}^{-1}\left(1-t\right)}\qquad\text{and}$
	$\displaystyle f_{0}\left(F_{n,h}^{-1}\left(1-t\right)\right)=$	$\displaystyle\left[F_{n,h}^{-1}\left(1-t\right)\right]^{-1-1/\xi_{0}}/\xi_{0}.$

Hence, it follows that

	$\displaystyle\frac{f_{0}\left(F_{n,h}^{-1}\left(1-t\right)\right)}{f_{n,h}\left(F_{n,h}^{-1}\left(1-t\right)\right)}=$	$\displaystyle\frac{\left(1+\xi_{0}^{-1}d_{n}h\left({t_{n}^{-1}}t\right)\right)}{F_{n,h}^{-1}\left(1-t\right)^{1/\xi_{0}}}$
	$\displaystyle=$	$\displaystyle\frac{\left(1+\xi_{0}^{-1}d_{n}h\left({t_{n}^{-1}}t\right)\right)}{\exp\left(\int_{t}^{1}\frac{d_{n}h\left({t_{n}^{-1}}v\right)}{v}dv\right)}.$

The change of variables $y=F_{n,h}^{-1}\left(1-t/{t_{n}^{-1}}\right)$ yields

	$\displaystyle\text{ \ \ \ }\int\left(\frac{f_{n,h}^{1/2}\left(y\right)}{f_{0}^{1/2}\left(y\right)}-1\right)^{2}f_{0}\left(y\right)dy$
	$\displaystyle=\frac{1}{{t_{n}^{-1}}}\int_{0}^{{t_{n}^{-1}}}\left(\frac{f_{n,h}^{1/2}\left(F_{n,h}^{-1}\left(1-t/{t_{n}^{-1}}\right)\right)}{f_{0}^{1/2}\left(F_{n,h}^{-1}\left(1-t/{t_{n}^{-1}}\right)\right)}-1\right)^{2}\frac{f_{0}\left(F_{n,h}^{-1}\left(1-t/{t_{n}^{-1}}\right)\right)}{f_{n,h}\left(F_{n,h}^{-1}\left(1-t/{t_{n}^{-1}}\right)\right)}dt$
	$\displaystyle=\frac{1}{{t_{n}^{-1}}}\int_{0}^{{t_{n}^{-1}}}\left[\left(\exp\left(\frac{d_{n}}{2\xi_{0}}\int_{t}^{{t_{n}^{-1}}}\frac{h\left(s\right)}{s}ds\right)-\left(1+d_{n}\frac{h\left(t\right)}{\xi_{0}}\right)^{1/2}\right)\right]^{2}$
	$\displaystyle\text{ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ }\times\exp\left(-\frac{d_{n}}{\xi_{0}}\int_{t}^{{t_{n}^{-1}}}\frac{h\left(s\right)}{s}ds\right)dt$
	$\displaystyle\overset{(i)}{=}\frac{1}{{t_{n}^{-1}}}\int_{0}^{{t_{n}^{-1}}}\left[\left(1-\left(1+d_{n}\frac{h\left(t\right)}{\xi_{0}}\right)^{1/2}+\frac{d_{n}}{2\xi_{0}}\int_{t}^{{t_{n}^{-1}}}\frac{h\left(s\right)}{s}ds+o(d_{n})\right)\right]^{2}$
	$\displaystyle\text{ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ }\times\left(1-\frac{d_{n}}{\xi_{0}}\int_{t}^{{t_{n}^{-1}}}\frac{h\left(s\right)}{s}ds+o\left(d_{n}\right)\right)dt$
	$\displaystyle\overset{(ii)}{=}\frac{1}{{t_{n}^{-1}}}\int_{0}^{{t_{n}^{-1}}}\left[\left(d_{n}\frac{h\left(t\right)}{\xi_{0}}+\frac{d_{n}}{2\xi_{0}}\int_{t}^{{t_{n}^{-1}}}\frac{h\left(s\right)}{s}ds+o(d_{n})\right)\right]^{2}\left(1+o\left(1\right)\right)dt$
	$\displaystyle\overset{(iii)}{=}O\left(d_{n}t_{n}\right)$
	$\displaystyle\overset{(iv)}{=}o(1),$

where equality (i) follows from $\exp(x)=1+x+o(x)$ as $x\rightarrow 0$ , equality (ii) follows from $\left(1+x\right)^{1/2}=1+x+o(x)$ as $x\rightarrow\infty$ , equality (iii) follows from the assumptions that $h$ is uniformly bounded and square integrable for all $h\in\mathcal{H}(A,\rho)$ , and equality (iv) follows from the fact that $d_{n}t_{n}\approx t_{n}^{\rho+1}$ with $t_{n}\rightarrow 0$ (under Condition 1) and $\rho>0$ . ∎

Lemma 2

The Hellinger distance $H^{2}\left(f_{0},f_{n,h}\right)$ between $f_{0}$ and $f_{n,h}$ satisfies

H^{2}\left(f_{0},f_{n,h}\right)\equiv\int\left(\sqrt{f_{0}\left(y\right)}-\sqrt{f_{n,h}\left(y\right)}\right)^{2}dy=\frac{\left|\left|h\right|\right|^{2}}{4n\xi_{0}^{2}}\left(1+o(1)\right).

(18)

Proof of Lemma 2.

First, Proposition 2.1 in Drees (2001) yields

\left[n^{1/2}\left(f_{n,h}^{1/2}\left(y\right)-f_{0}^{1/2}\left(y\right)\right)-\frac{1}{2}g_{n,h}\left(y\right)f_{0}^{1/2}\left(y\right)\right]^{2}dy=o(1),

(19)

where

g_{n,h}\left(y\right)\equiv\frac{n^{1/2}d_{n}}{\xi_{0}}\int_{y^{-1/\xi_{0}}}^{\infty}\frac{h\left({t_{n}^{-1}}s\right)}{s}ds-h\left({t_{n}^{-1}}y^{-1/\xi_{0}}\right),\text{ \ \ }y\geq 1.

Note by Drees (2001, p.286) that $g_{n,h}$ satisfies

	$\displaystyle\int g_{n}\left(x\right)f_{0}\left(x\right)dx=$	$\displaystyle 0\qquad\text{and}$		(20)
	$\displaystyle\int g_{n}^{2}\left(x\right)f_{0}\left(x\right)dx\rightarrow$	$\displaystyle\frac{\left\|\left\|h\right\|\right\|^{2}}{\xi_{0}^{2}}.$		(21)

Therefore, by expanding the square in (19), the equality (18) follows once we establish

\int f_{n,h}^{1/2}\left(y\right)f_{0}^{1/2}\left(y\right)g_{n,h}\left(y\right)=o(1).

This equality follows as

	$\displaystyle\text{ \ \ }\int f_{n}^{1/2}\left(y\right)f_{0}^{1/2}\left(y\right)g_{n,h}\left(y\right)dy$
	$\displaystyle=\int\left(\frac{f_{n}^{1/2}\left(y\right)}{f_{0}^{1/2}\left(y\right)}-1\right)f_{0}\left(x\right)g_{n,h}\left(y\right)dy$
	$\displaystyle\overset{(i)}{\leq}\left(\int\left(\frac{f_{n}^{1/2}\left(y\right)}{f_{0}^{1/2}\left(y\right)}-1\right)^{2}f_{0}\left(y\right)dy\right)^{1/2}\left(\int g_{n,h}^{2}\left(y\right)f_{0}\left(y\right)dy\right)^{1/2}$
	$\displaystyle\overset{(ii)}{=}o(1)\frac{\left\|\left\|h\right\|\right\|}{\xi_{0}}(1+o(1))^{1/2}=o(1)\text{,}$

where inequality (i) follows by Cauchy Schwarz and equality (ii) follows from Lemma 1 and (21). ∎

Proof of Theorem 1.

We first use Lemma 2 to translate the Hellinger distance between $f_{0}$ and $f_{n,h}$ into the $L_{1}$ -distance between $f_{0}^{n}$ and $f_{n,h}^{n}$ . This is done by Equation (17) in Low (1997) so that

	$\displaystyle L_{1}\left(f_{0}^{n},f_{n,1}^{n}\right)\equiv$	$\displaystyle\int\left\|f_{0}^{n}\left(y^{\left(n\right)}\right)-f_{n,1}^{n}\left(y^{\left(n\right)}\right)\right\|dy^{\left(n\right)}$
	$\displaystyle\leq$	$\displaystyle 2\left(2-2\exp\left(-\frac{\left\|\left\|h\right\|\right\|^{2}}{8\xi_{0}^{2}}\right)\right)^{1/2}\left(1+o(1)\right)$
	$\displaystyle\leq$	$\displaystyle 2\left(2-2\exp\left(-\frac{\varepsilon^{2}}{8\xi_{0}^{2}}\right)\right)^{1/2}\left(1+o(1)\right)$
	$\displaystyle\equiv$	$\displaystyle C\left(\varepsilon,\xi_{0}\right)\left(1+o(1)\right).$

Next, let $\delta$ be arbitrary. By the definition of $\omega\left(\varepsilon,F_{0}^{-1},n\right)$ , there exists an $h\in\mathcal{H}_{\rho}\left(\overline{h}\right)$ and $\left|\left|h\right|\right|\leq\varepsilon$ such that

\left|T_{r}\left(F_{n,h}\right)-T_{r}\left(F_{0}^{-1}\right)\right|\geq\omega\left(\varepsilon,F_{0}^{-1},n\right)-\delta.

Let $F_{n,\lambda h}^{-1}=Q_{n,\overline{k}_{n},F_{n,\lambda h}}$ for a short-handed notation. Also, let $f_{n,\lambda h}$ and $\mathbb{P}_{F_{0}^{n}}$ denote the corresponding density and the probability measure, respectively. Since $H\left(Y^{\left(n\right)}\right)$ has probability of coverage of at least $1-\beta$ uniformly over $f_{n,\lambda h}$ , it follows that, for any $\lambda\in\left[0,1\right]$ ,

\mathbb{P}_{F_{n,\lambda h}^{n}}\left(T_{r}\left(F_{n,\lambda h}^{-1}\right)\in H\left(Y^{\left(n\right)}\right)\right)\geq 1-\beta

and

\left|\left|\lambda h\right|\right|\leq\lambda\left|\left|h\right|\right|\leq\lambda\varepsilon.

Then, we have

		$\displaystyle\mathbb{P}_{F_{0}^{n}}\left(T_{r}\left(F_{n,\lambda h}^{-1}\right)\in H\left(Y^{\left(n\right)}\right)\right)$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{F_{0}^{n}}\left[1\left[T_{r}\left(F_{n,\lambda h}^{-1}\right)\in H\left(Y^{\left(n\right)}\right)\right]\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{F_{n,\lambda h}^{n}}\left[1\left[T_{r}\left(F_{n,\lambda h}^{-1}\right)\in H\left(Y^{\left(n\right)}\right)\right]\left(1+\frac{f_{0}^{n}-f_{n,h}^{n}}{f_{n,h}^{n}}\right)\right]$
	$\displaystyle\geq$	$\displaystyle 1-\beta+\mathbb{E}_{F_{n,\lambda h}^{n}}\left[\frac{f_{0}^{n}-f_{n,\lambda h}^{n}}{f_{n,\lambda h}^{n}}\right]$
	$\displaystyle\geq$	$\displaystyle 1-\beta-\int\left\|f_{n,\lambda h}^{n}\left(y\right)-f_{0}^{n}\left(y\right)\right\|dy$
	$\displaystyle\geq$	$\displaystyle 1-\beta-\lambda C.$

By the same lines of argument as in equations (12)–(15) in Low (1997), we obtain

		$\displaystyle\mathbb{E}_{F_{0}^{n}}\left[\mu\left(H\left(Y^{\left(n\right)}\right)\right)\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}_{F_{0}^{n}}\left[\int_{\mathbb{R}}1\left[t\in\mu\left(H\left(y^{\left(n\right)}\right)\right)\right]dt\right]$
	$\displaystyle\geq$	$\displaystyle\mathbb{E}_{F_{0}^{n}}\int_{0}^{1}\left(T_{r}\left(F_{n,h}^{-1}\right)-T_{r}\left(F_{0}^{-1}\right)\right)1\left[T_{r}\left(F_{n,\lambda h}^{-1}\right)\in\mu\left(H\left(y^{\left(n\right)}\right)\right)\right]d\lambda$
	$\displaystyle=$	$\displaystyle\left(T_{r}\left(F_{n,h}^{-1}\right)-T_{r}\left(F_{0}^{-1}\right)\right)\int_{0}^{1}\mathbb{P}_{F_{0}^{n}}\left(T_{r}\left(F_{n,\lambda h}^{-1}\right)\in H\left(Y^{\left(n\right)}\right)\right)d\lambda$
	$\displaystyle\geq$	$\displaystyle\left(T_{r}\left(F_{n,h}^{-1}\right)-T_{r}\left(F_{0}^{-1}\right)\right)\int_{0}^{1}\left(1-\beta-\lambda C\left(\varepsilon,\xi_{0}\right)\right)d\lambda$
	$\displaystyle=$	$\displaystyle\left(\omega\left(\varepsilon,F_{0}^{-1},n\right)-\delta\right)\left(1-\beta-\frac{C\left(\varepsilon,\xi_{0}\right)}{2}\right).$

The inequality (6) now follows since $\delta$ is set to be arbitrary. ∎

A.2 Proof of Theorem 2

Proof.

First, we approximate the empirical tail quantile function $Q_{n,r\bar{k}_{n},F_{n,h}}$ with the partial sum process of the standard exponential random variables. Let $\eta_{i}$ for $i=1,2,...$ be i.i.d. standard exponential random variables and $S_{i}\equiv\sum_{j=1}^{i}\eta_{j}$ , and $\tilde{Q}_{n,r\bar{k}_{n},F}\equiv F^{-1}\left(1-S_{\left[\bar{k}_{n}r\right]+1}/n\right)$ , $r\in(0,1]$ . Let $\mathbb{P}_{n,h}^{n}$ denote the joint distribution of $n$ i.i.d. draws from the distribution $F_{n,h}$ . By Reiss (1989, Theorem 5.4.3) – see also Drees (2001, eq.(5.12)) – the variational distance between the distribution of $Q_{n,r\bar{k}_{n},F_{n,h}}$ under $\mathbb{P}_{n,h}^{n}$ and the distribution of $\tilde{Q}_{n,r\bar{k}_{n},F}$ vanishes uniformly as $n\rightarrow\infty$ , that is,

\left|\left|\mathcal{L}\left(Q_{n,r\bar{k}_{n},F_{n,h}}|\mathbb{P}_{n,h}^{n}\right)-\mathcal{L}\left(\tilde{Q}_{n,r\bar{k}_{n},F}\right)\right|\right|=O\left(\bar{k}_{n}/n\right)=o(1),

(22)

which implies that

\left|\left|\mathcal{L}\left(T_{r}\left(Q_{n,r\bar{k}_{n},F_{n,h}}\right)|\mathbb{P}_{n,h}^{n}\right)-\mathcal{L}\left(T_{r}\left(\tilde{Q}_{n,r\bar{k}_{n},F}\right)\right)\right|\right|=o(1)

(23)

uniformly for all $r\in[\underline{r},1]$ and $h\in\mathcal{H}\left(A,\rho\right)$ . Recall from Section 2.2 that $T_{r}\left(z\right)=r^{-1}\int_{0}^{r}z\left(t\right)/z\left(r\right)dt$ characterizes Hill’s estimator. Hence it suffices to approximate $\tilde{Q}_{n,r\bar{k}_{n},F}$ . To this end, we employ a strong approximation of $\tilde{Q}_{n,r\bar{k}_{n},F}$ with $F_{n,h}^{-1}\left(1-\overline{k}_{n}/n\right)$ . Specifically, using Drees (2001, eq.(5.13)), we obtain

	$\displaystyle\sup_{h\in\mathcal{H}\left(A,\rho\right)}\sup_{r\in(0,1]}r^{\xi_{0}+1/2+\varepsilon}\Bigg{\|}\frac{\tilde{Q}_{n,r\bar{k}_{n},F}}{F_{n,h}^{-1}\left(1-\overline{k}_{n}/n\right)}-\Bigg{(}r^{-\xi_{0}}-\xi_{0}r^{-\left(\xi_{0}+1\right)}\frac{W\left(\overline{k}_{n}r\right)}{\overline{k}_{n}}$
	$\displaystyle+d_{n}r^{-\xi_{0}}\int_{r}^{1}\frac{h\left(v\right)}{v}dv\Bigg{)}\Bigg{\|}=o\left(\overline{k}_{n}^{-1/2}\right)\text{ a.s.}$		(24)

for all $\varepsilon\in\left(0,1/2\right)$ , where $W\left(\cdot\right)$ denotes the standard Wiener process.

In the second step, we exploit the feature of the functional $T_{r}\left(\cdot\right)$ . Since $T_{r}\left(\cdot\right)$ is scale invariant such that $T_{r}\left(az\right)=T_{r}\left(z\right)$ for any constant $a$ , we have

T_{r}\left(\frac{\tilde{Q}_{n,r\bar{k}_{n},F}}{F_{n,h}^{-1}\left(1-\overline{k}_{n}/n\right)}\right)=T_{r}\left(\tilde{Q}_{n,r\bar{k}_{n},F}\right).

Moreover, $T_{r}\left(\cdot\right)$ is Hadamard differentiable at $z_{\xi_{0}}:t\longmapsto t^{-\xi_{0}}$ (cf. Drees, 1998b, Condition 3) in the sense that

\frac{T_{r}\left(z_{\xi_{0}}+d_{n}y_{n}\right)-T_{r}\left(z_{\xi_{0}}\right)}{d_{n}}\rightarrow T_{r}^{\prime}\left(y\right)

(25)

uniformly for all functions $y_{n}$ with $\sup_{t\in(0,1]}t^{\xi_{0}+1/2+\varepsilon}\left|y_{n}\left(t\right)\right|\leq 1$ and $y_{n}\rightarrow y$ . To derive the expression of $T_{r}^{\prime}\left(\cdot\right)$ , we write

	$\displaystyle T_{r}\left(z_{\xi_{0}}+d_{n}y_{n}\right)-T_{r}\left(z_{\xi_{0}}\right)=$	$\displaystyle r^{-1}\int_{0}^{r}\log\left(\frac{t^{-\xi_{0}}+d_{n}y_{n}\left(t\right)}{r^{-\xi_{0}}+d_{n}y_{n}\left(r\right)}\frac{r^{-\xi_{0}}}{t^{-\xi_{0}}}\right)dt$
	$\displaystyle=$	$\displaystyle r^{-1}\int_{0}^{r}\log\left(1+d_{n}x_{n}\left(t\right)\right)dt,$

where

x_{n}\left(t\right)=\frac{t^{\xi_{0}}y_{n}\left(t\right)-r^{\xi_{0}}y_{n}\left(r\right)}{1+d_{n}r^{\xi_{0}}y_{n}\left(r\right)}\rightarrow t^{\xi_{0}}y\left(t\right)-r^{\xi_{0}}y\left(r\right).

Following the derivation in Drees (1998a, p.104), we obtain

T_{r}^{\prime}\left(y\right)=r^{-1}\int_{0}^{r}\left(t^{\xi_{0}}y\left(t\right)-r^{\xi_{0}}y\left(r\right)\right)dt\text{. }

In the final step, we substitute $z_{\xi_{0}}\left(t\right)=t^{-\xi_{0}}$ , $d_{n}\approx\bar{k}_{n}^{-1/2}$ , and

y_{n}\left(t\right)=\xi_{0}t^{-\left(\xi_{0}+1\right)}\frac{W\left(\overline{k}_{n}t\right)}{\overline{k}_{n}^{1/2}}+t^{-\xi_{0}}\int_{t}^{1}\frac{h\left(v\right)}{v}dv.

By (24), (25), and the functional delta method, we have

\sup_{h\in\mathcal{H}_{\rho}}\sup_{r\in(0,1]}\left|\begin{array}[]{c}T_{r}\left(\tilde{Q}_{n,r\bar{k}_{n},F}\right)-\left(\xi_{0}+\bar{k}_{n}^{-1/2}\xi_{0}T_{r}^{\prime}\left(z_{\xi_{0}+1}W_{n}\right)\right)\\ +d_{n}T_{r}^{\prime}\left(t^{-\xi_{0}}\int_{t}^{1}\frac{h\left(v\right)}{v}dv\right)\end{array}\right|=o\left(\overline{k}_{n}^{-1/2}\right)\text{ a.s.}

(26)

where $W_{n}:t\longmapsto-\bar{k}_{n}^{-1/2}W\left(\bar{k}_{n}t\right)$ . Use the definition of $T_{r}^{\prime}\left(\cdot\right)$ to obtain that

T_{r}^{\prime}\left(z_{\xi_{0}+1}W_{n}\right)=\frac{1}{r}\int_{0}^{r}\left(s^{-1}W\left(s\right)-r^{-1}W\left(r\right)\right)ds\equiv\mathbb{G}\left(r\right)

(27)

and

$\displaystyle T_{r}^{\prime}\left(t^{-\xi_{0}}\int_{t}^{r}\frac{h\left(v\right)}{v}dv\right)-h\left(0\right)=$	$\displaystyle r^{-1}\int_{0}^{r}\left(\int_{s}^{1}\frac{h\left(v\right)}{v}dv-\int_{r}^{1}\frac{h\left(v\right)}{v}dv\right)ds-h\left(0\right)$
$\displaystyle=$	$\displaystyle r^{-1}\int_{0}^{r}\left(\int_{s}^{r}\frac{h\left(v\right)}{v}dv\right)ds-h\left(0\right)$
$\displaystyle=$	$\displaystyle r^{-1}\int_{0}^{r}\left(\int_{s}^{r}\frac{h\left(v\right)-h\left(0\right)}{v}dv\right)ds$
$\displaystyle\equiv$	$\displaystyle\mathbb{B}\left(r;h\right),$	(28)

where we used the equality $r^{-1}\int_{0}^{r}\left(\log t-\log r\right)dt=1$ . Thus, in view of $\xi_{n,h}=\xi_{0}+d_{n}h\left(0\right)$ and combining (26) to (28), we find

T_{r}\left(\tilde{Q}_{n,t\bar{k}_{n},F}\right)-\xi_{n,h}=\bar{k}_{n}^{-1/2}\xi_{0}\mathbb{G}\left(r\right)+d_{n}\mathbb{B}\left(r;h\right)+o_{\text{a.s.}}(\overline{k}_{n}^{-1/2}),

where $o_{\text{a.s.}}(\cdot)$ is uniform for all $r\in[\underline{r},1]$ and $h\in\mathcal{H}\left(A,\rho\right)$ . We conclude from (23) that

	$\displaystyle\sup_{h\in\mathcal{H}\left(A,\rho\right)}\sup_{r\in[\underline{r},1]}\sup_{x\in\mathbb{R}}\left\|\mathbb{P}_{n,h}^{n}\left(T_{r}\left({Q}_{n,r\bar{k}_{n},F_{n,h}}\right)-\xi_{n,h}\leq d_{n}x\right)-\mathbb{P}\left(\xi_{0}\mathbb{G}\left(r\right)+\mathbb{B}\left(r;h\right)\leq x\right)\right\|$
	$\displaystyle\rightarrow 0.$

This implies that

\overline{k}_{n}^{1/2}\left(T_{r}\left(Q_{n,r\bar{k}_{n},F_{n,h}}\right)-\xi_{n,h}\right)=\xi_{0}\mathbb{G}\left(r\right)+\mathbb{B}\left(r;h\right)+o_{\mathbb{P}_{n,h}^{n}}\left(1\right),

where $o_{\mathbb{P}_{n,h}^{n}}\left(1\right)$ is uniform for all $r\in[\underline{r},1]$ and $h\in\mathcal{H}\left(A,\rho\right)$ . This completes the proof. ∎

A.3 Proof of Theorem 3

Proof.

We use the notation $g_{n}=\overline{k}_{n}/\left(np_{n}\right)$ . Condition 2 guarantees that $g_{n}\rightarrow\infty$ . Using the tail quantile process (24) with $t=1$ , we obtain

		$\displaystyle\frac{\sqrt{r\overline{k}_{n}}}{\log\left(g_{n}\right)}\left(\frac{\hat{F}_{n,h}^{-1}\left(1-p_{n}\right)}{F_{n,h}^{-1}\left(1-p_{n}\right)}-1\right)$
	$\displaystyle=$	$\displaystyle\frac{\sqrt{r\overline{k}_{n}}}{\log\left(g_{n}\right)}\left(\frac{Y_{n,n-\left\lfloor r\overline{k}_{n}\right\rfloor}}{F_{n,h}^{-1}\left(1-r\overline{k}_{n}/n\right)}g_{n}^{\hat{\xi}\left(n,r\overline{k}_{n}\right)}\frac{F_{n,h}^{-1}\left(1-r\overline{k}_{n}/n\right)}{F_{n,h}^{-1}\left(1-p_{n}\right)}-1\right)$
	$\displaystyle=$	$\displaystyle g_{n}^{\xi_{n,h}}\frac{F_{n,h}^{-1}\left(1-r\overline{k}_{n}/n\right)}{F_{n,h}^{-1}\left(1-p_{n}\right)}\left\{\frac{\sqrt{r\overline{k}_{n}}}{\log\left(g_{n}\right)}\left(g_{n}^{\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h}}-1\right)\right.$
		$\displaystyle+\frac{\sqrt{r\overline{k}_{n}}}{\log\left(g_{n}\right)}\left(\frac{Y_{n,n-\lfloor r\overline{k}_{n}\rfloor}}{F_{n,h}^{-1}\left(1-r\overline{k}_{n}/n\right)}-1\right)g_{n}^{\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h}}$
		$\displaystyle\left.-\frac{\sqrt{r\overline{k}_{n}}}{\log\left(g_{n}\right)}\left(\frac{F_{n,h}^{-1}\left(1-p_{n}\right)}{F_{n,h}^{-1}\left(1-r\overline{k}_{n}/n\right)}g_{n}^{-\xi_{n,h}}-1\right)\right\}$
	$\displaystyle\equiv$	$\displaystyle C_{1n}\left(C_{2n}+C_{3n}-C_{4n}\right).$

We aim to establish

$\displaystyle C_{1n}=$	$\displaystyle 1+o(1),$	(29)
$\displaystyle C_{2n}=$	$\displaystyle\sqrt{r\overline{k}_{n}}\left(\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h}\right)+o_{\mathbb{P}^{n}_{n,h}}(1),$	(30)
$\displaystyle C_{3n}=$	$\displaystyle o_{\mathbb{P}^{n}_{n,h}}(1),\qquad\text{and}$	(31)
$\displaystyle C_{4n}=$	$\displaystyle o(1),$	(32)

where the $o_{p}\left(1\right)$ and $o_{\mathbb{P}^{n}_{n,h}}(1)$ terms are all uniform over both $h\in\mathcal{H}_{\rho}\left(A,\rho\right)$ and $r\in[\underline{r},1]$ .

First, Condition 1 yields that $r\overline{k}_{n}/\left(nt_{n}\right)\rightarrow r$ . Thus, (29) follows from

	$\displaystyle C_{1n}$	$\displaystyle\overset{(i)}{=}\exp\left(\int_{r\overline{k}_{n}/n}^{1}\frac{d_{n}\left(h\left({t_{n}^{-1}}v\right)-h\left(0\right)\right)}{v}dv-\int_{p_{n}}^{1}\frac{d_{n}\left(h\left({t_{n}^{-1}}v\right)-h\left(0\right)\right)}{v}dv\right)$
		$\displaystyle=\exp\left(d_{n}\int_{\min\{r\overline{k}_{n}/n,p_{n}\}}^{\max\{r\overline{k}_{n}/n,p_{n}\}}\frac{h\left({t_{n}^{-1}}v\right)-h\left(0\right)}{v}dv\right)$
		$\displaystyle\overset{(ii)}{=}\exp\left(d_{n}\int_{\min\{g_{n},p_{n}{t_{n}^{-1}}\}}^{\max\{g_{n},p_{n}{t_{n}^{-1}}\}}\frac{h\left(s\right)-h\left(0\right)}{s}ds\right)$
		$\displaystyle\leq\exp\left(d_{n}\int_{0}^{1}\frac{\left\|h\left(s\right)-h\left(0\right)\right\|}{s}ds\right)$
		$\displaystyle\overset{(iii)}{\leq}\exp\left(d_{n}\int_{0}^{1}\frac{As^{\rho}}{s}ds\right)$
		$\displaystyle\overset{(iv)}{=}1+O(d_{n}),$

where equality (i) follows from (4), equality (ii) follows from a change of variables, inequality (iii) follows from $\left|h\left(s\right)-h\left(0\right)\right|\leq As^{\rho}$ for $h\in\mathcal{H}\left(A,\rho\right)$ , and equality (iv) follows from $d_{n}\rightarrow 0$ and $\exp\left(a_{n}\right)=1+a_{n}+O(a_{n}^{2})$ for any sequence $a_{n}\rightarrow 0$ .

Next, consider $C_{2n}$ . Note that $(\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h})\log g_{n}=\sqrt{r\overline{k}_{n}}(\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h})\times\log g_{n}/\sqrt{r\overline{k}_{n}}=o_{\mathbb{P}^{n}_{n,h}}(1)$ uniformly for all $r\in[\underline{r},1]$ and $h\in\mathcal{H}\left(A,\rho\right)$ by Theorem 2 and Condition 2. Then, using $\exp(a_{n})=1+a_{n}+O(a_{n}^{2})$ for any generic sequence $a_{n}\rightarrow 0$ , we have that

		$\displaystyle\frac{\sqrt{r\overline{k}_{n}}}{\log\left(g_{n}\right)}\left(g_{n}^{\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h}}-1\right)$
	$\displaystyle=$	$\displaystyle\sqrt{r\overline{k}_{n}}\left(\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h}\right)\frac{\exp\left(\left(\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h}\right)\log g_{n}\right)-1}{\left(\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h}\right)\log g_{n}}$
	$\displaystyle=$	$\displaystyle\sqrt{r\overline{k}_{n}}\left(\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h}\right)\left(1+O\left(\left(\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h}\right)\log g_{n}\right)\right)$
	$\displaystyle=$	$\displaystyle\sqrt{r\overline{k}_{n}}\left(\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h}\right)\left(1+o_{\mathbb{P}^{n}_{n,h}}(1)\right)$

uniformly for all $r\in[\underline{r},1]$ and $h\in\mathcal{H}\left(A,\rho\right)$ . This yields (30).

Next, consider $C_{3n}$ . The equivalence (22) and the tail quantile approximation (24) imply that

\sqrt{r\overline{k}_{n}}\left(\frac{Y_{n,n-\lfloor r\overline{k}_{n}\rfloor}}{F_{n,h}^{-1}\left(1-r\overline{k}_{n}/n\right)}-1\right)=O_{\mathbb{P}^{n}_{n,h}}(1)

uniformly for all $r\in[\underline{r},1]$ and $h\in\mathcal{H}\left(A,\rho\right)$ . The previous step deriving (30) also implies that $g_{n}^{\hat{\xi}\left(n,r\overline{k}_{n}\right)-\xi_{n,h}}=1+o_{\mathbb{P}^{n}_{n,h}}(1)$ uniformly for all $r\in[\underline{r},1]$ and $h\in\mathcal{H}\left(A,\rho\right)$ . Therefore, $C_{3n}=\left(\log g_{n}\right)^{-1}O_{\mathbb{P}^{n}_{n,h}}(1)\left(1+o_{\mathbb{P}^{n}_{n,h}}(1)\right)=o_{\mathbb{P}^{n}_{n,h}}(1)$ uniformly for all $r\in[\underline{r},1]$ and $h\in\mathcal{H}\left(A,\rho\right)$ , implying (31).

Finally, consider $C_{4n}$ . We have

	$\displaystyle C_{4n}$	$\displaystyle\overset{(i)}{=}\frac{\sqrt{r\overline{k}_{n}}}{\log\left(g_{n}\right)}\left(\exp\left(d_{n}\int_{\min\{p_{n},r\overline{k}_{n}/n\}}^{\max\{p_{n},r\overline{k}_{n}/n\}}\frac{\left(h\left({t_{n}^{-1}}v\right)-h\left(0\right)\right)}{v}dv\right)-1\right)$
		$\displaystyle\overset{(ii)}{=}\frac{\sqrt{r\overline{k}_{n}}}{\log\left(g_{n}\right)}\times O\left(d_{n}\right)$
		$\displaystyle\overset{(iii)}{=}o(1)$

uniformly for all $r\in[\underline{r},1]$ and $h\in\mathcal{H}\left(A,\rho\right)$ , where equality (i) follows from substituing (4), equality (ii) follows from the derivation of (29), and equality (iii) follows from that $\sqrt{r\overline{k}_{n}}d_{n}=O(1)$ as in Condition 1 and $\log(g_{n})\rightarrow\infty$ as in Condition 2. The proof is complete by combining (29)–(32). ∎

Appendix B Appendix: Choice of $\overline{k}_{n}$

In this appendix section, we present a data-driven choice rule of $\overline{k}_{n}$ following Guillou and Hall (2001) for completeness and for convenience of readers. For other methods of choosing the tail threshold, see, for example, Drees and Kaufmann (1998), Geluk and Peng (2000), Danielsson et al. (2001), and Danielsson et al. (2016). See also Resnick (2007, Chapter 4.4) for a review.

We use the shorthand notation $k=\overline{k}_{n}$ in this section for notational simplicity. Define $Z_{i}=i\log(Y_{n:n-i+1}/Y_{n:n-i})$ for $i=1,\ldots,n$ . If $Y_{i}$ is exactly Pareto distributed with exponent $1/\xi_{0}$ , then $Z_{i}$ should be i.i.d. with exponential distribution and $\mathbb{E}\left[Z_{i}\right]=\xi_{0}$ . Given this observation, we further construct the antisymmetric weights $\{w_{j}\}_{j=1}^{k}$ such that $w_{j}=-w_{k-j+1}$ and $\sum_{j=1}^{k}w_{j}=0$ . Then, the statistic

\mathcal{T}_{k}=\left(\sum_{j=1}^{k}w_{j}^{2}\right)^{-1/2}\hat{\xi}(n,k)^{-1}U_{k}\text{ where }U_{k}=\sum_{j=1}^{k}w_{j}Z_{j}

has zero mean and unit variance provided that $Y_{i}$ is exactly Pareto and $\hat{\xi}(n,k)=\xi_{0}$ .

To evaluate a deviation of the above approximation, we further define the following criterion based on a moving average of $\mathcal{T}_{k}^{2}$ :

\mathcal{C}_{k}=\left(\left(2l+1\right)^{-1}\sum_{j=-l}^{l}\mathcal{T}_{k+j}^{2}\right)^{1/2},

where $l$ equals the integer part of $k/2$ . Intuitively, the larger $k$ is, the larger bias we have in the Pareto tail approximation, and hence by more $\mathcal{C}_{k}$ exceeds one. To obtain an implementable rule, we follow Guillou and Hall (2001) to use $w_{j}=sgn\left(k-2j+1\right)\left|k-2j+1\right|$ and propose to choose the smallest $k$ that satisfies $\mathcal{C}_{t}>c_{\text{crit}}$ for all $t\geq k$ and some pre-specified constant $c_{\text{crit}}$ . Again, following Guillou and Hall (2001), we set $c_{\text{crit}}=1.25$ . For convenience of reference, we write the concrete equation:

\hat{k}=\min_{1\leq k\leq n}\{k:\mathcal{C}_{t}>c_{\text{crit}}\text{ for all }t\geq k\}.

(33)

On Uniform Confidence Intervals for the Tail Index and the Extreme Quantile

Abstract

1 Introduction

2 Setup, Notations, and Definitions

2.1 Non-parametric Families in the Tail

2.2 Hill’s Estimator

3 Asymptotic Impossibility

Theorem 1 (Impossibility)

4 Uniform Confidence Interval

Condition 1

Theorem 2 (Uniform Asymptotic Distribution)

5 Extreme Quantiles

Condition 2

Theorem 3 (Extreme Quantiles)

6 Simulation Studies

6.1 Tail Index

6.2 Extreme Quantiles

7 Real Data Analysis

8 Summary

References

Appendix A Proofs

A.1 Proof of Theorem 1

Lemma 1

Proof of Lemma 1.

Lemma 2

Proof of Lemma 2.

Proof of Theorem 1.

A.2 Proof of Theorem 2

Proof.

A.3 Proof of Theorem 3

Proof.

Appendix B Appendix: Choice of k¯n\overline{k}_{n}

Appendix B Appendix: Choice of $\overline{k}_{n}$