This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\AtAppendix

lemmasection

Quantile index regression

Yingying Zhanga Yuefeng Sib Guodong Lib and Chil-Ling Tsaic
aEast China Normal University, bUniversity of Hong Kong
and cUniversity of California at Davis
Abstract

Estimating the structures at high or low quantiles has become an important subject and attracted increasing attention across numerous fields. However, due to data sparsity at tails, it usually is a challenging task to obtain reliable estimation, especially for high-dimensional data. This paper suggests a flexible parametric structure to tails, and this enables us to conduct the estimation at quantile levels with rich observations and then to extrapolate the fitted structures to far tails. The proposed model depends on some quantile indices and hence is called the quantile index regression. Moreover, the composite quantile regression method is employed to obtain non-crossing quantile estimators, and this paper further establishes their theoretical properties, including asymptotic normality for the case with low-dimensional covariates and non-asymptotic error bounds for that with high-dimensional covariates. Simulation studies and an empirical example are presented to illustrate the usefulness of the new model.

Keywords: Asymptotic normality; High-dimensional analysis; Non-asymptotic property; Partially parametric model; Quantile regression.

1 Introduction

Quantile regression proposed by Koenker and Bassett, (1978) has been widely used across various fields such as biological science, ecology, economics, finance, and machine learning, etc.; see, e.g., Cade and Noon, (2003), Yu et al., (2003), Meinshausen and Ridgeway, (2006), Linton and Xiao, (2017) and Koenker, (2017). More references on quantile regression can be found in the books of Koenker, (2005) and Davino et al., (2014). Quantile regression has also been studied for high-dimensional data; see, e.g., Belloni and Chernozhukov, (2011), Wang et al., 2012b and Zheng et al., (2015). On the other hand, due to practical needs, it is increasingly becoming a popular subject to estimate the structures at high or low quantiles, such as the risk of high loss for investments in finance (Kuester et al.,, 2006; Zheng et al.,, 2018), high tropical cyclone intensity and extreme waves in climatology (Elsner et al.,, 2008; Jagger and Elsner,, 2008; Lobeto et al.,, 2021), and low infant birth weights in medicine (Abrevaya,, 2001; Chernozhukov et al.,, 2020). It hence is natural to make inference at these extreme quantiles for high-dimensional data, while this is still an open problem.

There are two types of approaches in the literature to model the structures at tails. The first one is based on the conditional distribution function (CDF) of the response YY for a given set of covariates 𝑿\boldsymbol{X}, and it is usually assumed to have a semiparametric structure at tails; see, e.g., Pareto-type structures in Beirlant and Goegebeur, (2004) and Wang and Tsai, (2009). While this method cannot provide conditional quantiles in explicit forms. Later, Noufaily and Jones, (2013) considered a full parametric form, the generalized gamma distribution, to the CDF and then inverted the fitted distribution into a conditional quantile distribution. However, as indicated in Racine and Li, (2017), indirect inverse-CDF-based estimators may not be efficient in tail regions when the data has unbounded support.

The second approach is extremal quantile regression, which combines quantile regression with extreme value theory to estimate the conditional quantile at a very high or low level of τn\tau_{n}, which satisfies (1τn)=O(n)(1-\tau_{n})=O(n) with nn being the sample size; see Chernozhukov, (2005). Specifically, this is a two-stage approach: (i.) performing the estimation at intermediate quantiles τn\tau_{n}^{*} with (1τn)1=o(n)(1-\tau_{n}^{*})^{-1}=o(n); and (ii.) extrapolating the fitted quantile structures to those at extreme quantiles by assuming the extreme value index that is associated to the tails of conditional distributions; see Wang et al., 2012a and Wang and Li, (2013) for details. The key of this method is to make use of the feasible estimation at intermediate levels since there are relatively more observations around these levels. However, intermediate quantiles are also at the far tails, and the corresponding data points may still not be rich enough for the case with many covariates.

In order to handle the case with high-dimensional covariates, along the lines of extremal quantile regression, this paper suggests to conduct estimation at quantile levels with much richer observations, say some fixed levels around τ0\tau_{0}, and then extrapolate the estimated results to extreme quantiles by fully or partially assuming a form of conditional distribution or quantile functions on [τ0,1][\tau_{0},1]. Note that there exist many quantile functions, which have explicit forms, such as the generalized lambda and Burr XII distributions (Gilchrist,, 2000). Especially the generalized lambda distribution can provide a very accurate approximation to some Pareto-type and extreme value distributions, as well as some commonly used distributions such as Gaussian distribution (Vasicek,, 1976; Gilchrist,, 2000). These flexible parametric forms can be assumed to the quantile function on [τ0,1][\tau_{0},1], and the drawback of inverting a distribution function hence can be avoided.

Specifically, for a predetermined interval [0,1]{\mathcal{I}}\subset[0,1], the quantile function of response YY is assumed to have an explicit form of Q(τ,𝜽)Q(\tau,\boldsymbol{\theta}) for each level τ\tau\in\mathcal{I}, up to unknown parameters or indices 𝜽\boldsymbol{\theta}. By further letting 𝜽\boldsymbol{\theta} be a function of covariates 𝐗\mathbf{X}, we then can define the conditional quantile function as follows:

QY(τ|𝑿)=inf{y:FY(y|𝑿)τ}=Q(τ,𝜽(𝑿)),τ,Q_{Y}(\tau|\boldsymbol{X})=\inf\{y:F_{Y}(y|\boldsymbol{X})\geq\tau\}=Q(\tau,\boldsymbol{\theta}(\boldsymbol{X})),\hskip 14.22636pt\tau\in\mathcal{I}, (1.1)

where FY(|𝑿)F_{Y}(\cdot|\boldsymbol{X}) is the distribution of YY conditional on 𝐗\mathbf{X}, and 𝜽(𝐗)\boldsymbol{\theta}(\mathbf{X}) is a dd-dimensional parametric function. Note that 𝜽(𝐗)\boldsymbol{\theta}(\mathbf{X}) can be referred to dd indices, and model (1.1) can then be called the quantile index regression (QIR) for simplicity. In practice, to handle high quantiles, we may take =[τ0,1]\mathcal{I}=[\tau_{0},1] with a fixed value of τ0\tau_{0} and then conduct a composite quantile regression (CQR) estimation for model (1.1) at levels within \mathcal{I} but with richer observations. Subsequently, the fitted QIR model can be used to predict extreme quantiles. More importantly, since the estimation is conducted at fixed quantile levels, there is no difficulty to handle the case with high-dimensional covariates. In addition, comparing with the aforementioned two types of approaches in the literature, the proposed method can not only estimate quantile regression functions effectively, but also forecast extreme quantiles directly. Finally, the QIR model naturally yields non-crossing quantile regression estimators since its quantile function is nondecreasing with respect to τ\tau.

The proposed model is introduced in details at Section 2, and the three main contributions can be summarized below:

  • (a)

    When conducting the CQR estimation, we encounter the first challenge on model identification, and this problem has been carefully studied for the flexible Tukey lambda distribution in Section 2.2.

  • (b)

    Section 2.2 also derives the asymptotic normality of CQR estimators for the case with low-dimensional covariates. This is a challenging task since the corresponding objective function is non-convex and non-differentiable, and we overcome the difficulty by adopting the bracketing method in Pollard, (1985).

  • (c)

    Section 2.3 establishes non-asymptotic properties of a regularized high-dimensional estimation. This is also not trivial due to the problem at (b).

The rest of this paper is organized as follows. Section 3 discusses some implementation issues in searching for these estimators. Numerical studies, including simulation experiments and a real analysis, are given in Sections 4 and 5, and Section 6 provides a short conclusion and discussion. All technical details are relegated to the Appendix.

For the sake of convenience, this paper denotes vectors and matrices by boldface letters, e.g., 𝐗\mathbf{X} and 𝐘\mathbf{Y}, and denotes scalars by regular letters, e.g., XX and YY. In addition, for any two real-valued sequences {an}\{a_{n}\} and {bn}\{b_{n}\}, denote anbna_{n}\gtrsim b_{n} (or anbna_{n}\lesssim b_{n}) if there exists a constant cc such that ancbna_{n}\geq cb_{n} (or ancbna_{n}\leq cb_{n}) for all nn, and denote anbna_{n}\asymp b_{n} if anbna_{n}\gtrsim b_{n} and anbna_{n}\lesssim b_{n}. For a generic vector 𝐗\mathbf{X} and matrix 𝐘\mathbf{Y}, let 𝐗\|\mathbf{X}\|, 𝐗1\|\mathbf{X}\|_{1} and 𝐘F\|\mathbf{Y}\|_{\textrm{F}} represent the Euclidean norm, 1\ell_{1}-norm and Frobenius norm, respectively.

2 Quantile index regression

2.1 Quantile index regression model

Consider a response YY and a pp-dimensional vector of covariates 𝑿=(X1,,Xp)\boldsymbol{X}=(X_{1},...,X_{p})^{\prime}. We then rewrite the quantile function of YY conditional on 𝑿\boldsymbol{X} at (1.1) with an explicit form of 𝜽(𝑿,𝜷)\boldsymbol{\theta}(\boldsymbol{X},\boldsymbol{\beta}) below,

QY(τ|𝑿)=Q(τ,𝜽(𝑿,𝜷)),τ,Q_{Y}(\tau|\boldsymbol{X})=Q(\tau,\boldsymbol{\theta}(\boldsymbol{X},\boldsymbol{\beta})),\hskip 14.22636pt\tau\in\mathcal{I}, (2.1)

where [0,1]\mathcal{I}\subset[0,1] is an interval or the union of multiple disjoint intervals, 𝜽(𝑿,𝜷)=(θ1(𝑿,𝜷),,θd(𝑿,𝜷))\boldsymbol{\theta}(\boldsymbol{X},\boldsymbol{\beta})=(\theta_{1}(\boldsymbol{X},\boldsymbol{\beta}),\cdots,\theta_{d}(\boldsymbol{X},\boldsymbol{\beta}))^{\prime}, 𝜷=(𝜷1,,𝜷d)\boldsymbol{\beta}=(\boldsymbol{\beta}_{1}^{\prime},...,\boldsymbol{\beta}_{d}^{\prime})^{\prime}, 𝜷j=(βj1,,βjp)\boldsymbol{\beta}_{j}=(\beta_{j1},\cdots,\beta_{jp})^{\prime}, θj(𝑿,𝜷)=gj(𝑿𝜷j)\theta_{j}(\boldsymbol{X},\boldsymbol{\beta})=g_{j}(\boldsymbol{X}^{\prime}\boldsymbol{\beta}_{j}), the link functions gj1()g_{j}^{-1}(\cdot)s are monotonic for 1jd1\leq j\leq d, and the intercept term can be included by letting X1=1X_{1}=1. We call model (2.1) the quantile index regression (QIR) for simplicity, and the two examples of Q(,)Q(\cdot,\cdot) below are first introduced to illustrate the new model.

Example 1.

Consider the location shift model, Q(τ,θ)=θ+QΦ(τ)Q(\tau,\theta)=\theta+Q_{\Phi}(\tau), for all τ[τ0,1]\tau\in[\tau_{0},1], where τ0(0,1)\tau_{0}\in(0,1) is a fixed level, θ\theta\in\mathbb{R} is the location index and QΦ(τ)Q_{\Phi}(\tau) is the quantile function of standard normality. Under the identity link function, θ(𝐗,𝛃)=𝐗𝛃\theta(\mathbf{X},\boldsymbol{\beta})=\mathbf{X}^{\prime}\boldsymbol{\beta}, we can construct a linear quantile regression model at τ0\tau_{0}. Then, after estimation, we can make a prediction at any level of τ(τ0,1)\tau\in(\tau_{0},1).

Example 2.

Consider the Tukey lambda distribution (Vasicek,, 1976) that is defined by its quantile function,

Q(τ,𝜽)=θ1+θ2τθ3(1τ)θ3θ3,Q(\tau,\boldsymbol{\theta})=\theta_{1}+{\theta_{2}}\frac{\tau^{\theta_{3}}-(1-\tau)^{\theta_{3}}}{\theta_{3}},

where 𝛉=(θ1,θ2,θ3)\boldsymbol{\theta}=(\theta_{1},\theta_{2},\theta_{3})^{\prime}, and θ1\theta_{1}\in\mathbb{R}, θ2>0\theta_{2}>0 and θ31\theta_{3}\leq 1 are the location, scale and tail indices, respectively. Due to its flexibility, the Tukey lambda distribution can provide an accurate approximation to many commonly used distributions, such as normal, logistic, Weibull, uniform, and Cauchy distributions, etc.; see Gilchrist, (2000). It is then expected to have a better performance when model (2.1) is combined with this distribution.

In the literature, there exist many quantile functions, which have explicit forms, such as the generalized lambda and Burr XII distributions (Gilchrist,, 2000; Fournier et al.,, 2007). For example, the generalized lambda distribution has the form of

Q(τ,𝜽)=θ1+θ2(τθ31θ3(1τ)θ41θ4),Q(\tau,\boldsymbol{\theta})=\theta_{1}+{\theta_{2}}\left(\frac{\tau^{\theta_{3}}-1}{\theta_{3}}-\frac{(1-\tau)^{\theta_{4}}-1}{\theta_{4}}\right),

where the indices θ3\theta_{3} and θ4\theta_{4} control the right and left tails, respectively. Note that it reduces to the Tukey lambda distribution when θ3=θ4\theta_{3}=\theta_{4}. This indicates that the generalized lambda distribution can be considered if we focus on the quantiles with the full range, i.e. =[0,1]\mathcal{I}=[0,1]. On the other hand, the Tukey lambda may be a better choice if our interest is on the quantiles at one side only, i.e. [0,0.5)\mathcal{I}\subset[0,0.5) or (0.5,1]\mathcal{I}\subset(0.5,1].

2.2 Low-dimensional composite quantile regression estimation

Denote the observed data by {(Yi,𝑿i),i=1,,n}\{(Y_{i},\boldsymbol{X}_{i}^{\prime})^{\prime},i=1,...,n\}, and they are independent and identically distributed (i.i.d.i.i.d.) samples of random vector (Y,𝑿)(Y,\boldsymbol{X}), where YiY_{i} is the response, 𝑿i=(Xi1,,Xip)\boldsymbol{X}_{i}=(X_{i1},...,X_{ip})^{\prime} contains pp covariates, and nn is the number of observations.

Let τ1τ2τK\tau_{1}\leq\tau_{2}\leq\ldots\leq\tau_{K} be KK fixed quantile levels, where τk\tau_{k}\in\mathcal{I} for all 1kK1\leq k\leq K. To achieve higher efficiency, we consider the composite quantile regression (CQR) estimator below.

𝜷^n=argmin𝜷Ln(𝜷)andLn(𝜷)=k=1Ki=1nρτk{YiQ(τk,𝜽(𝐗i,𝜷))},\widehat{\boldsymbol{\beta}}_{n}=\arg\min_{\boldsymbol{\beta}}L_{n}(\boldsymbol{\beta})\hskip 5.69054pt\text{and}\hskip 5.69054ptL_{n}(\boldsymbol{\beta})=\sum_{k=1}^{K}\sum_{i=1}^{n}\rho_{\tau_{k}}\{Y_{i}-Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X}_{i},\boldsymbol{\beta}))\}, (2.2)

where ρτ(x)=x{τI(x<0)}\rho_{\tau}(x)=x\{\tau-I(x<0)\} is the quantile check function; see Zou and Yuan, (2008) and Kai et al., (2010). Note that Q(τ,𝜽(𝐗,𝜷^n))Q(\tau,\boldsymbol{\theta}(\mathbf{X},\widehat{\boldsymbol{\beta}}_{n})) has the non-crossing property with respect to τ\tau since, for each 𝜽\boldsymbol{\theta}, Q(τ,𝜽)Q(\tau,\boldsymbol{\theta}) is a non-decreasing quantile function.

To study the asymptotic properties of 𝜷^n\widehat{\boldsymbol{\beta}}_{n}, we first consider 𝜷0=(𝜷01,,𝜷0d)=argmin𝜷L¯(𝜷)\boldsymbol{\beta}_{0}=(\boldsymbol{\beta}_{01}^{\prime},...,\boldsymbol{\beta}_{0d}^{\prime})^{\prime}=\arg\min_{\boldsymbol{\beta}}\bar{L}(\boldsymbol{\beta}), where L¯(𝜷)=E[k=1Kρτk{YQ(τk,𝜽(𝐗,𝜷))}]\bar{L}(\boldsymbol{\beta})=E[\sum_{k=1}^{K}\rho_{\tau_{k}}\{Y-Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))\}] is the population loss function, and it may not be unique, i.e. the CQR estimation at (2.2) may suffer from the identification problem. For the sake of illustration, let us consider the case without including covariates 𝐗\mathbf{X}, i.e. we estimate 𝜽\boldsymbol{\theta} with a sequence of quantile levels τk\tau_{k}\in\mathcal{I} and 1kK1\leq k\leq K. To this end, it requires that two different values of 𝜽\boldsymbol{\theta} can not yield the same quantile function Q(τk,𝜽)Q(\tau_{k},\boldsymbol{\theta}) across all KK levels. In other words, if there exists 𝜽𝜽\boldsymbol{\theta}\neq\boldsymbol{\theta}^{*} that yield Q(τk,𝜽)=Q(τk,𝜽)Q(\tau_{k},\boldsymbol{\theta})=Q(\tau_{k},\boldsymbol{\theta}^{*}) for all KK quantiles, then 𝜽\boldsymbol{\theta} and 𝜽\boldsymbol{\theta}^{*} are not identifiable. In sum, to guarantee that 𝜷0\boldsymbol{\beta}_{0} is the unique minimizer of the population loss, we make the following assumption on the quantile function Q(τ,𝜽)Q(\tau,\boldsymbol{\theta}).

Assumption 1.

For any two index vectors 𝛉1𝛉2\boldsymbol{\theta}_{1}\neq\boldsymbol{\theta}_{2}, there exists at least one 1kK1\leq k\leq K such that Q(τk,𝛉1)Q(τk,𝛉2)Q(\tau_{k},\boldsymbol{\theta}_{1})\neq Q(\tau_{k},\boldsymbol{\theta}_{2}).

Intuitively, for any quantile function Q(τ,𝜽)Q(\tau,\boldsymbol{\theta}), one can always increase the number of quantile levels KK to make Assumption 1 hold. However, it may also depend on the structures of quantile functions and the number of indices. For the sake of illustration, we state the sufficient and necessary condition for the Tukey lambda distribution that satisfies Assumption 1.

Lemma 1.

For the Tukey lambda distribution defined in Example 2, we have that (i) for τk(0,1)\tau_{k}\in(0,1) with 1kK1\leq k\leq K, Assumption 1 holds if K4K\geq 4; (ii) for τk(0.5,1)\tau_{k}\in(0.5,1) (or τk(0,0.5)\tau_{k}\in(0,0.5)) with 1kK1\leq k\leq K, Assumption 1 holds if and only if K3K\geq 3.

Assumption 1, together with an additional assumption on covariates 𝐗\mathbf{X}, allows us to show that 𝜷0\boldsymbol{\beta}_{0} is the unique minimizer of L¯(𝜷)\bar{L}(\boldsymbol{\beta}); see the following theorem, which is critical to establish the asymptotic properties of 𝜷^n\widehat{\boldsymbol{\beta}}_{n}.

Theorem 1.

Suppose that E(𝐗𝐗)E(\boldsymbol{X}\boldsymbol{X}^{\prime}) is finite and positive definite. If Assumption 1 holds, then 𝛃0\boldsymbol{\beta}_{0} is the unique minimizer of L¯(𝛃)\bar{L}(\boldsymbol{\beta}).

To demonstrate the consistency of 𝜷^n\widehat{\boldsymbol{\beta}}_{n} given below, we assume that the parameter space Θdp\Theta\subset\mathbb{R}^{dp} is compact, and the true parameter vector 𝜷0\boldsymbol{\beta}_{0} is an interior point of Θ\Theta.

Theorem 2.

Suppose that E{max1kKsup𝛃ΘQ(τk,𝛉(𝐗,𝛃))/𝛃}<.E\{\max_{1\leq k\leq K}\sup_{\boldsymbol{\beta}\in\Theta}\|\partial Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))/\partial\boldsymbol{\beta}\|\}<\infty. If the conditions of Theorem 1 hold, then 𝛃^n𝛃0\widehat{\boldsymbol{\beta}}_{n}\rightarrow\boldsymbol{\beta}_{0} in probability as nn\rightarrow\infty.

The moment condition assumed in the above theorem allows us to adopt the uniform consistency theorem of Andrews, (1987) in our technical proofs. To show the asymptotic distribution of 𝜷^n\widehat{\boldsymbol{\beta}}_{n}, we introduce two additional assumptions given below.

Assumption 2.

For all 1kK1\leq k\leq K,

EQ(τk,𝜽(𝐗,𝜷0))𝜷3<andEsup𝜷Θ2Q(τk,𝜽(𝐗,𝜷))𝜷𝜷F2<.E\left\|\frac{\partial Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0}))}{\partial\boldsymbol{\beta}}\right\|^{3}<\infty\hskip 14.22636pt\text{and}\hskip 14.22636ptE\sup_{\boldsymbol{\beta}\in\Theta}\left\|\frac{\partial^{2}Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))}{\partial\boldsymbol{\beta}\partial\boldsymbol{\beta}^{\prime}}\right\|_{\mathrm{F}}^{2}<\infty.
Assumption 3.

The conditional density fY(y|𝐗)f_{Y}(y|\boldsymbol{X}) is bounded and continuous uniformly for all 𝐗\boldsymbol{X}.

Assumption 2 is used to prove Lemma A.2 in the Appendix; see also Zhu and Ling, (2011). Assumption 3 is commonly used in the literature of quantile regression (Koenker,, 2005; Belloni et al., 2019a, ), and it can be relaxed by providing more complicated and lengthy technical details (Kato et al.,, 2012; Chernozhukov et al.,, 2015; Galvao and Kato,, 2016).

Denote

Ω0=k=1Kk=1Kmin{τk,τk}(1max{τk,τk})E[Q(τk,𝜽(𝐗,𝜷0))𝜷Q(τk,𝜽(𝐗,𝜷0))𝜷]\Omega_{0}=\sum_{k^{\prime}=1}^{K}\sum_{k=1}^{K}\min\{\tau_{k},\tau_{k^{\prime}}\}\left(1-\max\{\tau_{k},\tau_{k^{\prime}}\}\right)E\left[\frac{\partial Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0}))}{\partial\boldsymbol{\beta}}\frac{\partial Q(\tau_{k^{\prime}},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0}))}{\partial\boldsymbol{\beta}^{\prime}}\right]

and

Ω1=k=1KE[fY{Q(τk,𝜽(𝐗,𝜷0))|𝐗}Q(τk,𝜽(𝐗,𝜷0))𝜷Q(τk,𝜽(𝐗,𝜷0))𝜷].\Omega_{1}=\sum_{k=1}^{K}E\left[f_{Y}\left\{Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0}))|\mathbf{X}\right\}\frac{\partial Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0}))}{\partial\boldsymbol{\beta}}\frac{\partial Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0}))}{\partial\boldsymbol{\beta}^{\prime}}\right].
Theorem 3.

Suppose that Assumptions 2 and 3 hold, and Ω1\Omega_{1} is positive definite. If the conditions of Theorem 2 are satisfied, then n(𝛃^n𝛃0)N(𝟎,Ω11Ω0Ω11)\sqrt{n}(\widehat{\boldsymbol{\beta}}_{n}-\boldsymbol{\beta}_{0})\rightarrow N(\boldsymbol{0},\Omega_{1}^{-1}\Omega_{0}\Omega_{1}^{-1}) in distribution as nn\rightarrow\infty.

Note that the objective function Ln(𝜷)L_{n}(\boldsymbol{\beta}) is non-convex and non-differentiable, and this makes it challenging to establish the asymptotic normality of 𝜷^n\widehat{\boldsymbol{\beta}}_{n}. We overcome the difficulty by making use of the bracketing method in Pollard, (1985). Moreover, to estimate the asymptotic variance in Theorem 3, we first apply the nonparametric method in Hendricks and Koenker, (1991) to estimate the quantities of fY{Q(τk,𝜽(𝐗,𝜷0))|𝐗}f_{Y}\left\{Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0}))|\mathbf{X}\right\} with 1kK1\leq k\leq K, and then matrices Ω0\Omega_{0} and Ω1\Omega_{1} can be approximated by the sample averaging.

In addition, based on the estimator 𝜷^n\widehat{\boldsymbol{\beta}}_{n}, one can use Q(τ,𝜽(𝐗,𝜷^n))Q(\tau^{*},\boldsymbol{\theta}(\mathbf{X},\widehat{\boldsymbol{\beta}}_{n})) to further predict the conditional quantile at any level τ\tau^{*}\in\mathcal{I}, and the corresponding theoretical justification can be established by directly applying the delta-method (van der Vaart,, 1998, Chapter 3).

Corollary 1.

Suppose that the conditions of Theorem 3 are satisfied. Then, for any τ\tau^{*}\in\mathcal{I},

n{Q(τ,𝜽(𝐗,𝜷^n))Q(τ,𝜽(𝐗,𝜷0))}N(𝟎,𝜹Ω11Ω0Ω11𝜹)\sqrt{n}\{Q(\tau^{*},\boldsymbol{\theta}(\mathbf{X},\widehat{\boldsymbol{\beta}}_{n}))-Q(\tau^{*},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0}))\}\rightarrow N(\boldsymbol{0},\boldsymbol{\delta}^{\prime}\Omega_{1}^{-1}\Omega_{0}\Omega_{1}^{-1}\boldsymbol{\delta})

in distribution as nn\rightarrow\infty, where 𝛅=E[Q(τ,𝛉(𝐗,𝛃0))/𝛃]dp\boldsymbol{\delta}=E[\partial Q(\tau^{*},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0}))/\partial\boldsymbol{\beta}]\in\mathbb{R}^{dp}.

2.3 High-dimensional regularized estimation

This subsection considers the case with high-dimensional covariates, i.e., pnp\gg n, and the true parameter vector 𝜷0\boldsymbol{\beta}_{0} is assumed to be ss-sparse, i.e. the number of nonzero elements in 𝜷0\boldsymbol{\beta}_{0} is no more than s>0s>0. A regularized CQR estimation can then be introduced,

𝜷~n=argmin𝜷Θn1Ln(𝜷)+j=1dpλ(𝜷j),\widetilde{\boldsymbol{\beta}}_{n}=\arg\min_{\boldsymbol{\beta}\in\Theta}n^{-1}L_{n}(\boldsymbol{\beta})+\sum_{j=1}^{d}p_{\lambda}(\boldsymbol{\beta}_{j}), (2.3)

where Θ\Theta is given in Theorem 4, pλp_{\lambda} is a penalty function, and it depends on a tuning (regularization) parameter λ+\lambda\in\mathbb{R}^{+} with +=(0,)\mathbb{R}^{+}=(0,\infty).

Consider the loss function Ln(𝜷)=k=1Ki=1nρτk{YiQ(τk,𝜽(𝐗i,𝜷))}L_{n}(\boldsymbol{\beta})=\sum_{k=1}^{K}\sum_{i=1}^{n}\rho_{\tau_{k}}\{Y_{i}-Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X}_{i},\boldsymbol{\beta}))\} defined in (2.2), and Q(τk,𝜽(𝐗i,𝜷))Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X}_{i},\boldsymbol{\beta})) usually is nonconvex with respect to 𝜷\boldsymbol{\beta}. As a result, Ln(𝜷)L_{n}(\boldsymbol{\beta}) will be nonconvex although the check loss ρτ()\rho_{\tau}(\cdot) is convex, and there is no more harm to use nonconvex penalty functions. Specifically, we consider the component-wise penalization,

j=1dpλ(𝜷j)=j=1dl=1ppλ(βjl),\sum_{j=1}^{d}p_{\lambda}(\boldsymbol{\beta}_{j})=\sum_{j=1}^{d}\sum_{l=1}^{p}p_{\lambda}(\beta_{jl}),

where pλ()p_{\lambda}(\cdot) is possibly nonconvex and satisfies the following assumption.

Assumption 4.

The univariate function pλ()p_{\lambda}(\cdot) satisfies the following conditions: (i) it is symmetric around zero with pλ(0)=0p_{\lambda}(0)=0; (ii) it is is nondecreasing on the nonnegative real line; (iii) the function pλ(t)/tp_{\lambda}(t)/t is nonincreasing with respect to t+t\in\mathbb{R}^{+}; (iv) it is differentiable for all t0t\neq 0 and subdifferentiable at t=0t=0, with limt0+pλ(t)=λL\lim_{t\rightarrow 0^{+}}p_{\lambda}^{\prime}(t)=\lambda L and LL being a constant; (v) there exists μ>0\mu>0 such that pλ,μ=pλ(t)+μ22t2p_{\lambda,\mu}=p_{\lambda}(t)+\frac{\mu^{2}}{2}t^{2} is convex.

The above is the μ\mu-amenable assumption given in Loh and Wainwright, (2015) and Loh, (2017), and the penalty function is required not too far from the convexity. Note that the popular penalty functions, including SCAD (Fan and Li.,, 2001) and MCP (Zhang,, 2010), satisfy the above properties.

In the literature of nonconvex penalized quantile regression, Jiang et al., (2012) studied nonlinear quantile regressions with SCAD regularizer from the asymptotic viewpoint, while it can only handle the case with p=o(n1/3)p=o(n^{1/3}). Wang et al., 2012b and Sherwood et al., (2016) considered the case that pp grows exponentially with nn, and their proving techniques heavily depend on the condition that the loss function should be represented as a difference of the two convex functions. However, Ln(𝜷)L_{n}(\boldsymbol{\beta}) does not meet this requirement since quantile function Q(τ,𝜽)Q(\tau,\boldsymbol{\theta}) can be nonconvex.

On the other hand, non-asymptotic properties recently have attracted considerable attention in the theories of high-dimensional analysis; see, e.g., Belloni and Chernozhukov, (2011); Sivakumar and Banerjee, (2017); Pan and Zhou, (2021). This subsection attempts to study them for our proposed quantile estimators, while it is a nontrival task since existing results only focused on linear quantile regression. Loh and Wainwright, (2015) and Loh, (2017) studied the non-asymptotic properties for M-estimators with both nonconvex loss and regularizers, while they required the loss function to be twice differentiable. The technical proofs in the Appendix follow the framework in Loh and Wainwright, (2015) and Loh, (2017), and some new techniques are developed to tackle the nondifferentiability of the quantile check function.

Let 𝜽(𝜸)=(g1(γ1),,gd(γd))\boldsymbol{\theta}(\boldsymbol{\gamma})=(g_{1}(\gamma_{1}),\ldots,g_{d}(\gamma_{d})) with gj1()g_{j}^{-1}(\cdot)s being link functions and 𝜸=(γ1,,γd)\boldsymbol{\gamma}=(\gamma_{1},\ldots,\gamma_{d}), and we can then denote Q(τ,𝜸):=Q(τ,𝜽(𝜸))Q(\tau,\boldsymbol{\gamma}):=Q(\tau,\boldsymbol{\theta}(\boldsymbol{\gamma})). Moreover, by letting γj(𝐗,𝜷)=𝐗𝜷j\gamma_{j}(\mathbf{X},\boldsymbol{\beta})=\mathbf{X}\boldsymbol{\beta}_{j} for 1jd1\leq j\leq d and 𝜸(𝐗,𝜷)=(γ1(𝐗,𝜷),,γd(𝐗,𝜷))\boldsymbol{\gamma}(\mathbf{X},\boldsymbol{\beta})=(\gamma_{1}(\mathbf{X},\boldsymbol{\beta}),\ldots,\gamma_{d}(\mathbf{X},\boldsymbol{\beta})), we can further denote Q(τ,𝜸(𝐗,𝜷)):=Q(τ,𝜽(𝐗,𝜷))Q(\tau,\boldsymbol{\gamma}(\mathbf{X},\boldsymbol{\beta})):=Q(\tau,\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta})).

Assumption 5.

Quantile function Q(τ,𝛄)Q(\tau,\boldsymbol{\gamma}) is differentiable with respect to 𝛄\boldsymbol{\gamma}, and there exist two positive constants LQL_{Q} and CXC_{X} such that max1kKQ(τk,𝛄)/𝛄LQ\max_{1\leq k\leq K}\|{\partial Q(\tau_{k},\boldsymbol{\gamma})}/{\partial\boldsymbol{\gamma}}\|\leq L_{Q} and 𝐗CX\|\mathbf{X}\|_{\infty}\leq C_{X}.

The differentiable assumption of quantile functions allows us to use the Lipschitz property and multivariate contraction theorem. The boundedness of covariates is to assure that the bounded difference inequality can be used, and it can be relaxed with more complicated and lengthy technical details (Wang and He,, 2021).

Denote by R(𝜷0)={𝜷dp:𝜷𝜷0R}\mathcal{B}_{R}(\boldsymbol{\beta}_{0})=\{\boldsymbol{\beta}\in\mathbb{R}^{dp}:\|\boldsymbol{\beta}-\boldsymbol{\beta}_{0}\|\leq R\} the Euclidean ball centered at 𝜷0\boldsymbol{\beta}_{0} with radius R>0R>0, and let λmin(𝜷)\lambda_{\min}(\boldsymbol{\beta}) be the smallest eigenvalue of matrix

Ω2(𝜷)=k=1KE[Q(τk,𝜽(𝐗,𝜷))𝜷Q(τk,𝜽(𝐗,𝜷))𝜷].\Omega_{2}(\boldsymbol{\beta})=\sum_{k=1}^{K}E\left[\frac{\partial Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))}{\partial\boldsymbol{\beta}}\frac{\partial Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))}{\partial\boldsymbol{\beta}^{\prime}}\right].
Assumption 6.

There exists a fixed R>0R>0 such that λmin0=inf𝛃R(𝛃0)λmin(𝛃)>0\lambda_{\min}^{0}=\inf_{\boldsymbol{\beta}\in\mathcal{B}_{R}(\boldsymbol{\beta}_{0})}\lambda_{\min}(\boldsymbol{\beta})>0, and assume that fmin=min1kKinf𝛃R(𝛃0)fY{Q(τk,𝛉(𝐗,𝛃))|𝐗}>0f_{\min}=\min_{1\leq k\leq K}\inf_{\boldsymbol{\beta}\in\mathcal{B}_{R}(\boldsymbol{\beta}_{0})}f_{Y}\left\{Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))|\mathbf{X}\right\}>0.

The above assumption guarantees that the population loss L¯(𝜷)=E[n1Ln(𝜷)]\bar{L}(\boldsymbol{\beta})=E[n^{-1}L_{n}(\boldsymbol{\beta})] is strongly convex around the true parameter vector 𝜷0\boldsymbol{\beta}_{0}. Specifically, let ¯(Δ)=L¯(𝜷0+Δ)L¯(𝜷0)ΔL¯(𝜷0)/𝜷\bar{\mathcal{E}}(\Delta)=\bar{L}(\boldsymbol{\beta}_{0}+\Delta)-\bar{L}(\boldsymbol{\beta}_{0})-\Delta^{\prime}{\partial\bar{L}(\boldsymbol{\beta}_{0})}/{\partial\boldsymbol{\beta}} be the first-order Taylor expansion. Then, by Assumption 6, we have that ¯(Δ)0.5fminλmin0Δ2\bar{\mathcal{E}}(\Delta)\geq 0.5f_{\min}\lambda_{\min}^{0}\|\Delta\|^{2} for all Δ\Delta such that ΔR\|\Delta\|\leq R; see Lemma A.5 in the Appendix for details. We next obtain the non-asymptotic estimation bound of 𝜷~n\widetilde{\boldsymbol{\beta}}_{n}.

Theorem 4.

Suppose that Assumptions 1 and 4-6 hold, 2fminλmin0>μ2f_{\min}\lambda_{\min}^{0}>\mu, nlogpn\gtrsim\log p and λlogp/n\lambda\gtrsim\sqrt{\log p/n}. Then the minimizer 𝛃~n\widetilde{\boldsymbol{\beta}}_{n} of (2.3) with Θ=R(𝛃0)\Theta=\mathcal{B}_{R}(\boldsymbol{\beta}_{0}) satisfies the error bounds of

𝜷~n𝜷06Lsλ4αμand𝜷~n𝜷0124Lsλ4αμ,\displaystyle\|\widetilde{\boldsymbol{\beta}}_{n}-\boldsymbol{\beta}_{0}\|\leq\frac{6L\sqrt{s}\lambda}{4\alpha-\mu}~{}~{}~{}~{}~{}\mbox{and}~{}~{}~{}~{}~{}\|\widetilde{\boldsymbol{\beta}}_{n}-\boldsymbol{\beta}_{0}\|_{1}\leq\frac{24Ls\lambda}{4\alpha-\mu},

with probability at least 1c1pc2Kmax{logp,logn}pc21-c_{1}p^{-c_{2}}-K\max\{\log p,\log n\}p^{-c^{2}} for any c>1c>1, where α=0.5fminλmin0\alpha=0.5f_{\min}\lambda_{\min}^{0}, μ\mu and LL are defined in Assumption 4, ss is the number of nonzero elements in 𝛃0\boldsymbol{\beta}_{0}, and the constants c1c_{1} and c2>0c_{2}>0 are given in Lemma A.4 of the Appendix.

In practice, we can choose λlogp/n\lambda\asymp\sqrt{\log p/n}, and it then holds that 𝜷~n𝜷0slogp/n\|\widetilde{\boldsymbol{\beta}}_{n}-\boldsymbol{\beta}_{0}\|\lesssim\sqrt{s\log p/n}, which has the standard rate of error bounds; see, e.g., (Loh,, 2017). Moreover, for the predicted conditional quantile of Q(τ,𝜽(𝐗,𝜷~n))Q(\tau^{*},\boldsymbol{\theta}(\mathbf{X},\widetilde{\boldsymbol{\beta}}_{n})) at any level τ\tau^{*}\in\mathcal{I}, it can be readily verified that |Q(τ,𝜽(𝐗,𝜷~n))Q(τ,𝜽(𝐗,𝜷0))||Q(\tau^{*},\boldsymbol{\theta}(\mathbf{X},\widetilde{\boldsymbol{\beta}}_{n}))-Q(\tau^{*},\boldsymbol{\theta}(\mathbf{X},{\boldsymbol{\beta}}_{0}))| has the same convergence rate as 𝜷~n𝜷0\|\widetilde{\boldsymbol{\beta}}_{n}-\boldsymbol{\beta}_{0}\|. Finally, the above theorem requires the minimization (2.3) to be conducted in Θ=R(𝜷0)\Theta=\mathcal{B}_{R}(\boldsymbol{\beta}_{0}), which is unknown but fixed. This enables us to solve the problem by conducting a random initialization in optimizing algorithms.

3 Implementation issues

3.1 Optimizing algorithms

This subsection provides algorithms to search for the CQR estimator at (2.2) and regularized estimator at (2.3).

For the CQR estimation without penalty at (2.2), we employ the commonly used gradient descent algorithm to search for estimators, and the (r+1)(r+1)th update is given by

𝜷(r+1)=𝜷(r)η(r)Ln(𝜷(r)),{\boldsymbol{\beta}}^{(r+1)}={\boldsymbol{\beta}}^{(r)}-\eta^{(r)}\,\nabla L_{n}({\boldsymbol{\beta}}^{(r)}),

where 𝜷^n(r)\widehat{\boldsymbol{\beta}}_{n}^{(r)} is from the rrth iteration, and η(r)\eta^{(r)} is the step size. Note that the quantile check loss is nondifferentiable at zero, and Ln(𝜷(r))\nabla L_{n}({\boldsymbol{\beta}}^{(r)}) in the above refers to the subgradient (Moon et al.,, 2021) instead. In practice, too small step size will cause the algorithm to converge slowly, while too large step size may cause the algorithm to diverge. We choose the step size by the backtracking line search (BLS) method, which is shown to be simple and effective; see Bertsekas, (2016). Specifically, the algorithm starts with a large step size and, at (r+1)(r+1)th update, it is reduced by keeping multiplying a fraction of bb until Ln(𝜷(r+1))Ln(𝜷(r))<aη(r)Ln(𝜷(r))22L_{n}(\boldsymbol{\beta}^{(r+1)})-L_{n}(\boldsymbol{\beta}^{(r)})<-a\eta^{(r)}\|\nabla L_{n}(\boldsymbol{\beta}^{(r)})\|_{2}^{2}, where aa is another hyper-parameter. The simulation experiments in Section 4 work well with the setting of (a,b)=(0.3,0.5)(a,b)=(0.3,0.5).

For the regularized estimation at (2.3), we adopt the composite gradient descent algorithm (Loh and Wainwright,, 2015), which is designed for a nonconvex problem and fits our objective functions well. Consider the SCAD penalty, which satisfies Assumption 4 with L=1L=1 and μ=1/(α1)\mu=1/(\alpha-1). We then can rewrite the optimization problem at (2.3) into

𝜷~n=argmin𝜷Θ{n1Ln(𝜷)μ𝜷22/2}L~n(𝜷)+λg(𝜷),\widetilde{\boldsymbol{\beta}}_{n}=\arg\min_{\boldsymbol{\beta}\in\Theta}\underbrace{\{n^{-1}L_{n}(\boldsymbol{\beta})-\mu\|\boldsymbol{\beta}\|_{2}^{2}/2\}}_{\widetilde{L}_{n}(\boldsymbol{\beta})}+\lambda g(\boldsymbol{\beta}),

where, from Assumption 4, g(𝜷)={j=1dpλ(𝜷j)+μ𝜷22/2}/λg(\boldsymbol{\beta})=\{\sum_{j=1}^{d}p_{\lambda}(\boldsymbol{\beta}_{j})+\mu\|\boldsymbol{\beta}\|_{2}^{2}/2\}/\lambda is convex. As a result, similar to the composite gradient descent algorithm in Loh and Wainwright, (2015), the (r+1)(r+1)th update can be calculated by

𝜷(r+1)=argmin{𝜷(𝜷(r)ηL~n(𝜷))22/2+ληg(𝜷)},{\boldsymbol{\beta}}^{(r+1)}=\arg\min\left\{\|\boldsymbol{\beta}-(\boldsymbol{\beta}^{(r)}-\eta\nabla\widetilde{L}_{n}(\boldsymbol{\beta}))\|_{2}^{2}/2+\lambda\eta g(\boldsymbol{\beta})\right\},

which has a closed-form solution of

𝜷(r+1)={0,0|z|νλzsign(z)νλ,νλ|z|(ν+1)λ{zsign(z)ανλα1}/{1να1},(ν+1)λ|z|αλz,|z|αλ\displaystyle{\boldsymbol{\beta}}^{(r+1)}=\left\{\begin{aligned} &0,&\quad 0\leq|z|\leq\nu\lambda\\ &z-\textrm{sign}(z)\cdot\nu\lambda,&\quad\nu\lambda\leq|z|\leq(\nu+1)\lambda\\ &\{z-\textrm{sign}(z)\cdot\frac{\alpha\nu\lambda}{\alpha-1}\}/\{1-\frac{\nu}{\alpha-1}\},&\quad(\nu+1)\lambda\leq|z|\leq\alpha\lambda\\ &z,&\quad|z|\geq\alpha\lambda\end{aligned}\right.

with z=(𝜷(r)ηL~n(𝜷(r)))/(1+μη)z=(\boldsymbol{\beta}^{(r)}-\eta\nabla\widetilde{L}_{n}(\boldsymbol{\beta}^{(r)}))/(1+\mu\eta) and ν=η/(1+μη)\nu=\eta/(1+\mu\eta), where the step size η\eta is chosen by the BLS method.

3.2 Hyper-parameter selection

There are two types of hyper-parameters in the penalized estimation at (2.3): the tuning parameter λ\lambda and quantile levels of τk\tau_{k} with 1kK1\leq k\leq K. We can employ validation methods to select the tuning parameter λ\lambda such that the composite quantile check loss is minimized.

The selection of τk\tau_{k}’s is another important task since it will affect the efficiency of resulting estimators. Suppose that we are interested in some high quantiles of τm\tau_{m}^{*} with 1mM1\leq m\leq M, and then the QIR model can be assumed to the interval of =[τ0,1]\mathcal{I}=[\tau_{0},1], which contains all τm\tau_{m}^{*}’s. We may further choose a suitable interval of [τL,τU][\tau_{L},\tau_{U}]\subset\mathcal{I} such that τk\tau_{k}’s can be equally spaced on it, i.e. τk=τL+k(τUτL)/(K+1)\tau_{k}=\tau_{L}+k(\tau_{U}-\tau_{L})/(K+1) for 1kK1\leq k\leq K, where it can be set to τ0=τL\tau_{0}=\tau_{L}. As a result, the selection of τk\tau_{k}’s is equivalent to that of [τL,τU][\tau_{L},\tau_{U}].

We may choose τU\tau_{U} such that it is close to τm\tau_{m}^{*}’s, while a reliable estimation can be afforded at this level. The selection of τL\tau_{L} is a trade-off between estimation efficiency and model misspecification; see Wang et al., 2012a ; Wang and Tsai, (2009). On one hand, to improve estimation efficiency, we may choose τL\tau_{L} close to 0.5 since the richest observations will appear at the middle for most real data. On the other hand, we have to assume the parametric structure over the whole interval of [τL,1][\tau_{L},1], i.e. more limitations will be added to the real example. The criterion of prediction errors (PEs) is hence introduced,

PE=1Mm=1M1τm(1τm)n|1ni=1nI{yi<Q^Y(τm|𝐗i)}τm|,PE=\frac{1}{M}\sum_{m=1}^{M}\frac{1}{\sqrt{\tau_{m}^{*}(1-\tau_{m}^{*})}}\cdot\sqrt{n}\left|\frac{1}{n}\sum_{i=1}^{n}I\{y_{i}<\widehat{Q}_{Y}(\tau_{m}^{*}|\mathbf{X}_{i})\}-\tau_{m}^{*}\right|,

where we will choose τL\tau_{L} with the minimum value of PEs; see also Wang et al., 2012a .

In practice, the cross validation method can be used to select λ\lambda and τL\tau_{L} simultaneously. Specifically, the composite quantile check loss and PEs are both evaluated at validation sets. For each candidate interval of [τL,τU][\tau_{L},\tau_{U}], the tuning parameter λ\lambda is selected according to the composite quantile check loss, and the corresponding value of PE is also recorded. We then will choose the value of τL\tau_{L}, which corresponds to the minimum value of PEs among all candidate intervals.

4 Simulation Studies

4.1 Composite quantile regression estimation

This subsection conducts simulation experiments to evaluate the finite-sample performance of the low-dimensional composite quantile regression (CQR) estimation at (2.2).

The Tukey Lambda distribution in Example 2 is used to generate the i.i.d.i.i.d. sample,

Yi=QY(Ui,𝜽(𝐗i,𝜷))=θ1(𝐗i,𝜷)+θ2(𝐗i,𝜷)Uiθ3(𝐗i,𝜷)(1Ui)θ3(𝐗i,𝜷)θ3(𝐗i,𝜷)θ1(𝐗i,𝜷)=g1(𝐗i𝜷1),θ2(𝐗i,𝜷)=g2(𝐗i𝜷2),θ3(𝐗i,𝜷)=g3(𝐗i𝜷3),\displaystyle\begin{split}Y_{i}&=Q_{Y}(U_{i},\boldsymbol{\theta}(\mathbf{X}_{i},\boldsymbol{\beta}))=\theta_{1}(\mathbf{X}_{i},\boldsymbol{\beta})+\theta_{2}(\mathbf{X}_{i},\boldsymbol{\beta})\frac{U_{i}^{\theta_{3}(\mathbf{X}_{i},\boldsymbol{\beta})}-(1-U_{i})^{\theta_{3}(\mathbf{X}_{i},\boldsymbol{\beta})}}{\theta_{3}(\mathbf{X}_{i},\boldsymbol{\beta})}\\ &\theta_{1}(\mathbf{X}_{i},\boldsymbol{\beta})=g_{1}(\mathbf{X}_{i}^{\prime}\boldsymbol{\beta}_{1}),~{}\theta_{2}(\mathbf{X}_{i},\boldsymbol{\beta})=g_{2}(\mathbf{X}_{i}^{\prime}\boldsymbol{\beta}_{2}),~{}\theta_{3}(\mathbf{X}_{i},\boldsymbol{\beta})=g_{3}(\mathbf{X}_{i}^{\prime}\boldsymbol{\beta}_{3}),\end{split} (4.1)

where {Ui}\{U_{i}\} are independent and follow Uniform(0,1)\text{Uniform}(0,1), 𝐗i=(1,Xi1,Xi2)\mathbf{X}_{i}=(1,X_{i1},X_{i2})^{\prime}, {(Xi1,Xi2)}\{(X_{i1},X_{i2})^{\prime}\} is an i.i.d.i.i.d. sequence with 2-dimensional standard normality. The true parameter vector is 𝜷0=(𝜷01,𝜷02,𝜷03)\boldsymbol{\beta}_{0}=(\boldsymbol{\beta}_{01}^{\prime},\boldsymbol{\beta}_{02}^{\prime},\boldsymbol{\beta}_{03}^{\prime})^{\prime}, and we set the location parameters 𝜷01=(1,0.5,1)\boldsymbol{\beta}_{01}=(1,0.5,-1)^{\prime}, the scale parameters 𝜷02=(1,0.5,1)\boldsymbol{\beta}_{02}=(1,0.5,-1)^{\prime} and the tail parameters 𝜷03=(1,1,1)\boldsymbol{\beta}_{03}=(1,-1,1)^{\prime}. For the tail index θ3(𝐗i,𝜷)\theta_{3}(\mathbf{X}_{i},\boldsymbol{\beta}), before generating the data, we first scale each covariate into the range of [0.5,0.5][-0.5,0.5] such that a relatively stable sample can be generated. In addition, g1g_{1}, g2g_{2} and g3g_{3} are the inverse of link functions. We choose identity link for the location index and softplus-related link for the scale and tail indices, i.e., g1(x)=x,g2(x)=softplus(x)g_{1}(x)=x,g_{2}(x)=\textrm{softplus}(x) and g3(x)=1softplus(x)g_{3}(x)=1-\textrm{softplus}(x), where softplus(x)=log(1+exp(x))\textrm{softplus}(x)=\log(1+\exp(x)) is a smoothed version of x+=max{0,x}x_{+}=\max\{0,x\} and hence the name. Note that g2(x)>0g_{2}(x)>0 and g3(x)<1g_{3}(x)<1. We consider three sample sizes of n=500n=500, 1000 and 2000, and there are 500 replications for each sample size.

The algorithm for CQR estimation in Section 3 is applied with K=10K=10 and τk\tau_{k}’s being equally spaced over [τL,τU][\tau_{L},\tau_{U}]. We consider three quantile ranges of (τL,τU)=(0.5,0.99)(\tau_{L},\tau_{U})=(0.5,0.99), (0.7,0.99)(0.7,0.99) and (0.9,0.99)(0.9,0.99), and the estimation efficiency is first evaluated. Figure 1 gives the boxplots of three fitted location parameters 𝜷^1n=(β^1,1,β^1,2,β^1,3)\widehat{\boldsymbol{\beta}}_{1n}=(\widehat{\beta}_{1,1},\widehat{\beta}_{1,2},\widehat{\beta}_{1,3})^{\prime}. It can be seen that both bias and standard deviation decrease as the sample size increase. Moreover, when τL\tau_{L} decreases, the quantile levels with richer observations will be used for the estimation and, as expected, both bias and standard deviation will decrease. Boxplots for fitted scale and tail parameters show a similar pattern and hence are omitted to save the space.

We next evaluate the prediction performance of Q(τ,𝜽(𝐗,𝜷^n))Q(\tau^{*},\boldsymbol{\theta}(\mathbf{X},\widehat{\boldsymbol{\beta}}_{n})) at two interesting quantile levels of τ=0.991\tau^{*}=0.991 and 0.995. Consider two values of covariates, 𝑿=(1,0.1,0.2)\boldsymbol{X}=(1,0.1,-0.2)^{\prime} and (1,0,0)(1,0,0)^{\prime}, and the corresponding tail indices are θ3(𝐗,𝜷0)=0.1139\theta_{3}(\mathbf{X},\boldsymbol{\beta}_{0})=-0.1139 and 0.3132-0.3132, respectively. Note that the Tukey lambda distribution can provide a good approximation to Cauchy and normal distributions when the tail indices are 1-1 and 0.140.14, respectively, and it becomes more heavy-tailed when the tail index decreases (Freimer et al.,, 1988). The prediction error in terms of squared loss (PES), [Q(τ,𝜽(𝐗,𝜷^n))Q(τ,𝜽(𝐗,𝜷0))]2[Q(\tau^{*},\boldsymbol{\theta}(\mathbf{X},\widehat{\boldsymbol{\beta}}_{n}))-Q(\tau^{*},\boldsymbol{\theta}(\mathbf{X},{\boldsymbol{\beta}}_{0}))]^{2}, is calculated for each replication, and the corresponding sample mean refers to the commonly used mean square error. Table 1 presents both the sample mean and standard deviation of PESs across 500 replications as in Wang et al., 2012a . A clear trend of improvement can be observed as the sample size becomes larger, and the prediction is more accurate at the 99.1-th quantile level for almost all cases.

4.2 High-dimensional regularized estimation

This subsection conducts simulation experiments to evaluate the finite-sample performance of the high-dimensional regularized estimation at (2.3).

For the data generating process at (4.1), we consider three dimensions of p=50p=50, 100100 and 150150, and the true parameter vectors are extended from those in Section 4.1 by adding zeros, i.e. 𝜷01=(1,0.5,1,0,,0)\boldsymbol{\beta}_{01}=(1,0.5,-1,0,...,0)^{\prime}, 𝜷02=(1,0.5,1,0,,0)\boldsymbol{\beta}_{02}=(1,0.5,-1,0,...,0)^{\prime} and 𝜷03=(1,1,1,0,,0)\boldsymbol{\beta}_{03}=(1,-1,1,0,...,0)^{\prime}, which are vectors of length pp with 33 non-zero entries. As a result, all true parameters 𝜷0=(𝜷01,𝜷02,𝜷03)\boldsymbol{\beta}_{0}=(\boldsymbol{\beta}_{01}^{\prime},\boldsymbol{\beta}_{02}^{\prime},\boldsymbol{\beta}_{03}^{\prime})^{\prime} make a vector of length 3p3p with s=9s=9 non-zero entries. The sample size is chosen such that n=cslogpn=\left\lfloor cs\log p\right\rfloor with c=5c=5, 10, 20, 30, 40 and 50, where x\lfloor x\rfloor refers to the largest integer smaller than or equal to xx. All other settings are the same as in the previous subsection.

The algorithm for regularized estimation in Section 3 is used to search for the estimators, and we generate an independent validation set of size 5n5n to select tuning parameter λ\lambda by minimizing the composite quantile check loss; see also Wang et al., 2012b . Figure 2 gives the estimation errors of 𝜷~n𝜷0\|\widetilde{\boldsymbol{\beta}}_{n}-\boldsymbol{\beta}_{0}\|. It can be seen that 𝜷~n𝜷0\|\widetilde{\boldsymbol{\beta}}_{n}-\boldsymbol{\beta}_{0}\| is roughly proportional to the quantity of slogp/n\sqrt{s\log p/n}, and this confirms the convergence rate in Theorem 4. Moreover, the estimation errors approach zero as the sample size nn increase, and we can then conclude the consistency of 𝜷~n\widetilde{\boldsymbol{\beta}}_{n}. Finally, when τL\tau_{L} increases, the quantile levels with less observations will be used in the estimating procedure, and hence larger estimation errors can be observed.

We next evaluate the prediction performance at quantile levels τ=0.991\tau^{*}=0.991 and 0.9950.995, and covariates 𝐗\mathbf{X} take values of (1,0.1,0.2,0,,0)(1,0.1,-0.2,0,\cdots,0)^{\prime} and (1,0,0,0,,0)(1,0,0,0,\cdots,0)^{\prime}, similar to those in the previous subsection. Table 2 gives mean square errors of the predicted conditional quantiles Q(τ,𝜽(𝐗,𝜷~n))Q(\tau^{*},\boldsymbol{\theta}(\mathbf{X},\widetilde{\boldsymbol{\beta}}_{n})), as well as the sample standard deviations of prediction errors in squared loss, with p=50p=50. It can be seen that larger sample size leads to much smaller mean square errors. Moreover, when τL\tau_{L} is larger, the prediction also becomes worse, and it may be due to the lower estimation efficiency. Finally, similar to the experiments in the previous subsection, the prediction at τ=0.991\tau^{*}=0.991 is more accurate for almost all cases. The results for the cases with p=100p=100 and 150 are similar and hence omitted.

Finally, we consider the following criteria to evaluate the performance of variable selection: average number of selected active coefficients (size), percentage of active and inactive coefficients both correctly selected simultaneously (PAI\textrm{P}_{\textrm{AI}}), percentage of active coefficients correctly selected (PA\textrm{P}_{\textrm{A}}), percentage of inactive coefficients correctly selected (PI\textrm{P}_{\textrm{I}}), false positive rate (FP), and false negative rate (FN). Table 3 reports the selecting results with p=50p=50 and c=10c=10, 30 and 50. When τL\tau_{L} is larger, both PAI\textrm{P}_{\textrm{AI}} and PA\textrm{P}_{\textrm{A}} decrease, and it indicates the increasing of selection accuracy. In addition, performance improves when sample size gets larger. The results for p=100p=100 and 150 are similar and hence omitted.

5 Application to Childhood Malnutrition

Childhood malnutrition is well known to be one of the most urgent problems in developing countries. The Demographic and Health Surveys (DHS) has conducted nationally representative surveys on child and maternal health, family planning and child survival, etc., and this results in many datasets for research purposes. The dataset for India was first analyzed by Koenker, (2011), and can be downloaded from http://www.econ.uiuc.edu/~roger/research/bandaids/bandaids.html. It has also been studied by many researchers (Fenske et al.,, 2011; Belloni et al., 2019b, ) for childhood malnutrition problem in India, and quantile regression with low- or high-dimensional covariates was conducted at the levels of τ=0.1\tau=0.1 and 0.050.05. The proposed model enables us to consider much lower quntiles, corresponding to more severe childhood malnutrition problem.

The child’s height is chosen as the indicator for malnutrition as in Belloni et al., 2019b . Specifically, the response is set to Y=100log(child’s height in centimeters)Y=-100\log(\text{child's height in centimeters}), and we then consider high quantiles to study the childhood malnutrition problem such that it is consistent with previous sections. Other variables include seven continuous and 13 categorical ones, and they contain both biological factors and socioeconomic factors that are possibly related to childhood malnutrition. Examples of biological factors include the child’s age, gender, duration of breastfeeding in months, the mother’s age and body-mass index (BMI), and socioeconomic factors contain the mother’s employment status, religion, residence, and the availability of electricity. All seven continuous variables are standardized to have mean zero and variance one, and two-way interactions between all variables are also included. Moreover, we concentrate on the samples from pool families. As a result, there are p=328p=328 covariates in total after removing variables with all elements being zero, and the sample size is n=6858n=6858. Denote the full model size by (328,328,328)(328,328,328), which correspond to the sizes of location, scale and tail, respectively. Furthermore, as in the simulation experiments, covariates are further rescaled to the range [0.5,0.5][-0.5,0.5] for the tail index.

We aim at two high quanitles of τ=0.991\tau^{*}=0.991 and 0.9950.995, and the algorithm for high-dimensional regularized estimation in Section 3 is first applied to select the interval of [τL,τU][\tau_{L},\tau_{U}]. Specifically, the value of τU\tau_{U} is fixed to 0.990.99, and that of τL\tau_{L} is selected among τL=0.9+0.01j\tau_{L}=0.9+0.01j with 1j81\leq j\leq 8. The value of KK is set to 10, and the τk\tau_{k}’s with 1kK1\leq k\leq K are equally spaced over [τL,τU][\tau_{L},\tau_{U}]. For each τL\tau_{L}, the whole samples are randomly split into five parts with equal size, except that one part is short of two observations, and the 5-fold cross validation is used to select the tuning parameter λ\lambda. To stabilize the process, we conduct the random splitting five times and choose the value of λ\lambda minimizing the composite check loss over all five splittings. The averaged value of PEs is also calculated over all five splittings, and the corresponding plot is presented in Figure 3. As a result, we choose τL=0.96\tau_{L}=0.96 since it corresponds to the minimum value of PEs.

We next apply the QIR model to the whole dataset with [τL,τU]=[0.96,0.99][\tau_{L},\tau_{U}]=[0.96,0.99], and the tuning parameter λ\lambda is scaled by4/5\sqrt{4/5} since the sample size changes from 4n/54n/5 to nn. The fitted model is of size (14,16,19)(14,16,19), and we can predict the conditional quantile structure at any level τ(0.96,1)\tau^{*}\in(0.96,1). For example, consider the variable of child’s age, and we are interested in children with ages of 20, 30 and 40 months. The duration of breastfeeding is set to be the same as child’s age, since the age is always larger than the duration of breastfeeding, and we set the values of all other variables in 𝑿\boldsymbol{X} to be the same as the 460460th observation, which has the response value being the sample median. Figure 4 plots the predicted quantile curves for three different ages. It can be seen that younger children may have extremely lower heights, and we may conclude that it may be easier for younger children to be affected by malnutrition.

Figure 4 also draws quantile curves for mother’s education, child’s gender and mother’s unemployment condition, and the values of variables at the 460460th observation are also used for non-focal covariates in the prediction. For child’s gender, the baby boy is usually higher than baby girls as observed in Koenker, (2011), while the difference varnishes for much larger quantiles. In addition, the quanitle curves for mother’s education are almost parallel, while those for mother’s unemployment condition are crossed. More importantly, all these new insights are at very high quantiles, and this confirms the necessity of the proposed model.

Finally, we compare the proposed QIR model with two commonly used ones in the literature: (i.) linear quantile regression at the level of τ\tau^{*} with 1\ell_{1} penalty in Belloni et al., 2019b , and (ii.) extremal quantile regression in (Wang et al., 2012a, ) adapted to high-dimensional data. The prediction performance at τ=0.991\tau^{*}=0.991 and 0.9950.995 is considered for the comparison, and we fix [τL,τU]=[0.96,0.99][\tau_{L},\tau_{U}]=[0.96,0.99]. For Method (ii.), we consider K=4.5n1/3K=4.5n^{1/3} quantile levels, equally spaced over [0.96,0.99][0.96,0.99], and the linear quantile regression with 1\ell_{1} penalty is conducted at each level. As in Wang et al., 2012a , we can estimate the extreme value index, and hence the fitted structures can be extrapolated to the level of τ\tau^{*}. Note that there is no theoretical justification for Method (ii.) in the literature. As in simulation experiments, the tuning parameter λ\lambda in the above three methods is selected by minimizing the composite check loss in the testing set. We randomly split the data 100 times, and one value of PE can be obtained from each splitting. Figure 3 gives the boxplots of PEs from our model and two competing methods, and the advantages of our model can be observed at both target levels of τ=0.991\tau^{*}=0.991 and 0.995.

6 Conclusions and discussions

This paper proposes a reliable method for the inference at extreme quantiles with both low- and high-dimensional covariates. The main idea is first to conduct a composite quantile regression at fixed quantile levels, and we then can extrapolate the estimated results to extreme quantiles by assuming a parametric structure at tails. The Tukey lambda structure can be used due to its flexibility and the explicit form of its quantile functions, and the success of the proposed methodology has been demonstrated by extensive numeral studies.

This paper can be extended in the following two directions. On one hand, in the proposed model, a parametric structure is assumed over the interval of [τ0,1][\tau_{0},1]. Although the criterion of PE is suggested in Section 3 to balance the estimation efficiency and model misspecification, it should be interesting to provide a statistical tool for the goodness-of-fit. Dong et al., (2019) introduced a goodness-of-fit test for parametric quantile regression at a fixed quantile level, and it can be used for our problem by extending the test statistic from a fixed level to the interval of [τ0,1][\tau_{0},1]. We leave it for the future research. On the other hand, the idea in this paper is general and can be applied to many other scenarios. For example, for conditional heteroscedastic time series models, it is usually difficult to conduct the quantile estimation at both median and extreme quantiles. The difficulty at extreme quantiles is due to the sparse data at tails, while that at median is due to the tiny values of fitted parameters (Zhu et al.,, 2018; Zhu and Li,, 2021). Our idea certainly can be used to solve this problem to some extent.

Appendix: Technical details

A.1 Proofs for low-dimensional CQR Estimation

This subsection gives technical proofs of Lemma 1 and Theorems 1-3 in section 2.2. Two auxiliary lemmas are also presented at the end of this subsection: Lemma A.1 is used for the proof of Lemma 1, and Lemma A.2 is for that of Theorem 3. The proofs of these two auxiliary lemmas are given in a supplementary file.

Proof of Lemma 1.

The Tukey lambda distribution has the form of

Q(τ,𝜽)=θ1+θ2(τθ3(1τ)θ3θ3),Q(\tau,\boldsymbol{\theta})=\theta_{1}+{\theta_{2}}\left(\frac{\tau^{\theta_{3}}-(1-\tau)^{\theta_{3}}}{\theta_{3}}\right),

where θ1\theta_{1}\in\mathbb{R}, θ2>0\theta_{2}>0, θ30\theta_{3}\neq 0. We prove Lemma 1 for θ3<1\theta_{3}<1. Consider four arbitrary quantile levels 0<τ1<τ2<τ3<τ4<10<\tau_{1}<\tau_{2}<\tau_{3}<\tau_{4}<1, and two arbitrary index vectors 𝜽~=(θ~1,,θ~4)\widetilde{\boldsymbol{\theta}}=(\widetilde{\theta}_{1},...,\widetilde{\theta}_{4})^{\prime} and 𝜽=(θ1,..,θ4)\boldsymbol{\theta}=(\theta_{1},..,\theta_{4})^{\prime} such that

Q(τj,𝜽~)=Q(τj,𝜽) for all 1j4.Q(\tau_{j},\widetilde{\boldsymbol{\theta}})=Q(\tau_{j},\boldsymbol{\theta})\text{ for all }1\leq j\leq 4. (A.1)

We show that 𝜽~=𝜽\widetilde{\boldsymbol{\theta}}=\boldsymbol{\theta} in the following.

The first step is to prove θ~3=θ3\widetilde{\theta}_{3}=\theta_{3} using the proof by contradiction. Suppose that θ~3θ3\widetilde{\theta}_{3}\neq\theta_{3} and, without loss of generality, we assume that θ3<θ~3<0\theta_{3}<\widetilde{\theta}_{3}<0. Denote fj(θ3)=τjθ3(1τj)θ3f_{j}(\theta_{3})=\tau_{j}^{\theta_{3}}-(1-\tau_{j})^{\theta_{3}} for 1j41\leq j\leq 4. From (A.1), we have

f1(θ~3)f2(θ~3)f3(θ~3)f2(θ~3)f1(θ3)f2(θ3)f3(θ3)f2(θ3)=0,\frac{f_{1}(\widetilde{\theta}_{3})-f_{2}(\widetilde{\theta}_{3})}{f_{3}(\widetilde{\theta}_{3})-f_{2}(\widetilde{\theta}_{3})}-\frac{f_{1}(\theta_{3})-f_{2}(\theta_{3})}{f_{3}(\theta_{3})-f_{2}(\theta_{3})}=0, (A.2)

and

f4(θ~3)f2(θ~3)f3(θ~3)f2(θ~3)f4(θ3)f2(θ3)f3(θ3)f2(θ3)=0.\frac{f_{4}(\widetilde{\theta}_{3})-f_{2}(\widetilde{\theta}_{3})}{f_{3}(\widetilde{\theta}_{3})-f_{2}(\widetilde{\theta}_{3})}-\frac{f_{4}(\theta_{3})-f_{2}(\theta_{3})}{f_{3}(\theta_{3})-f_{2}(\theta_{3})}=0. (A.3)

Let us fix θ3\theta_{3}, θ~3\widetilde{\theta}_{3}, τ2\tau_{2} and τ3\tau_{3}. As a result, κ1=f2(θ~3)\kappa_{1}=f_{2}(\widetilde{\theta}_{3}), κ2=f2(θ3)\kappa_{2}=f_{2}(\theta_{3}), κ3=f3(θ3)f2(θ3)\kappa_{3}=f_{3}(\theta_{3})-f_{2}(\theta_{3}) and κ4=f3(θ~3)f2(θ~3)\kappa_{4}=f_{3}(\widetilde{\theta}_{3})-f_{2}(\widetilde{\theta}_{3}) are all fixed values. Denote

F(τ)=κ3{τθ~3(1τ)θ~3κ1}κ4{τθ3(1τ)θ3κ2},\displaystyle F(\tau)=\kappa_{3}\left\{\tau^{\widetilde{\theta}_{3}}-(1-\tau)^{\widetilde{\theta}_{3}}-\kappa_{1}\right\}-\kappa_{4}\left\{\tau^{\theta_{3}}-(1-\tau)^{\theta_{3}}-\kappa_{2}\right\}, (A.4)
F˙(τ)=κ3θ~3{τθ~31+(1τ)θ~31}κ4θ3{τθ31+(1τ)θ31},\dot{F}(\tau)=\kappa_{3}\widetilde{\theta}_{3}\{\tau^{\widetilde{\theta}_{3}-1}+(1-\tau)^{\widetilde{\theta}_{3}-1}\}-\kappa_{4}\theta_{3}\{\tau^{{\theta}_{3}-1}+(1-\tau)^{{\theta}_{3}-1}\},

and

G(τ)=τθ~31+(1τ)θ~31τθ31+(1τ)θ31,G(\tau)=\frac{\tau^{\widetilde{\theta}_{3}-1}+(1-\tau)^{\widetilde{\theta}_{3}-1}}{\tau^{{\theta}_{3}-1}+(1-\tau)^{{\theta}_{3}-1}},

where F˙()\dot{F}(\cdot) is the derivative function of F()F(\cdot), and F˙(τ)=0\dot{F}(\tau)=0 if and only if G(τ)=κ4θ3/(κ3θ~3)G(\tau)=\kappa_{4}\theta_{3}/(\kappa_{3}\widetilde{\theta}_{3}). Note that equations (A.2) and (A.3) correspond to F(τ1)=0F(\tau_{1})=0 and F(τ4)=0F(\tau_{4})=0, respectively. Moreover, it can be verified that F(τ2)=0F(\tau_{2})=0 and F(τ3)=0F(\tau_{3})=0, i.e. the equation F(τ)=0F(\tau)=0 has at least four different solutions. As a result, the equation F˙(τ)=0\dot{F}(\tau)=0 or G(τ)=κ4θ3/(κ3θ~3)G(\tau)=\kappa_{4}\theta_{3}/(\kappa_{3}\widetilde{\theta}_{3}) has at least three different solutions. While it is implied by Lemma A.1 that the equation G(τ)=κ4θ3/(κ3θ~3)G(\tau)=\kappa_{4}\theta_{3}/(\kappa_{3}\widetilde{\theta}_{3}) has at most two different solutions. Due to the contradiction, we prove that θ~3=θ3\widetilde{\theta}_{3}=\theta_{3}, and it is readily to further verify that (θ~1,θ~2)=(θ1,θ2)(\widetilde{\theta}_{1},\widetilde{\theta}_{2})=(\theta_{1},\theta_{2}). We hence accomplish the proof of Lemma 1(i). The result of Lemma 1(ii) can be proved similarly. ∎

Proof of Theorem 1.

We first prove the uniqueness of β0\beta_{0}. Denote L¯k(𝜷)=E[ρτk{YQ(τk,𝜽(𝐗,𝜷))}]\bar{L}_{k}(\boldsymbol{\beta})=E[\rho_{\tau_{k}}\{Y-Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))\}]. From model (2.1), 𝜷0\boldsymbol{\beta}_{0} is the minimizer not only of L¯(𝜷)=k=1KL¯k(𝜷)\bar{L}(\boldsymbol{\beta})=\sum_{k=1}^{K}\bar{L}_{k}(\boldsymbol{\beta}), but also of L¯k(𝜷)\bar{L}_{k}(\boldsymbol{\beta}) for all 1kK1\leq k\leq K. Suppose that 𝜷0\boldsymbol{\beta}_{0}^{*} is another minimizer of L¯(𝜷)\bar{L}(\boldsymbol{\beta}), and then it is also the minimizer of L¯k(𝜷)\bar{L}_{k}(\boldsymbol{\beta}) for all 1kK1\leq k\leq K.

Note that, for u0u\neq 0,

ρτ(uv)ρτ(u)=vψτ(u)+(uv)[I(0>u>v)I(0<u<v)]=vψτ(u)+0v[I(us)I(u0)]𝑑s,\displaystyle\begin{split}\rho_{\tau}(u-v)-\rho_{\tau}(u)&=-v\psi_{\tau}(u)+(u-v)[I(0>u>v)-I(0<u<v)]\\ &=-v\psi_{\tau}(u)+\int_{0}^{v}[I(u\leq s)-I(u\leq 0)]ds,\end{split} (A.5)

where ψτ(u)=τI(u<0)\psi_{\tau}(u)=\tau-I(u<0); see Knight, (1998). Let U(k)=YQ(τk,𝜽(𝐗,𝜷0))U^{(k)}=Y-Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0})) and V(k)=Q(τk,𝜽(𝐗,𝜷0))Q(τk,𝜽(𝐗,𝜷0))V^{(k)}=Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0}^{*}))-Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0})). It holds that, for each 1kK1\leq k\leq K,

0=L¯k(𝜷0)L¯k(𝜷0)=E{(U(k)V(k))[I(0>U(k)>V(k))I(0<U(k)<V(k))]},0=\bar{L}_{k}(\boldsymbol{\beta}_{0}^{*})-\bar{L}_{k}(\boldsymbol{\beta}_{0})=E\{(U^{(k)}-V^{(k)})[I(0>U^{(k)}>V^{(k)})-I(0<U^{(k)}<V^{(k)})]\},

which implies that V(k)=Q(τk,𝜽(𝐗,𝜷0))Q(τk,𝜽(𝐗,𝜷0))=0V^{(k)}=Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0}^{*}))-Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0}))=0 with probability one. This, together with the conditions in Lemma 1 and the monotonic link functions, leads to the fact that 𝐗𝜷j0=𝐗𝜷j0\mathbf{X}^{\prime}{\boldsymbol{\beta}}_{j0}^{*}=\mathbf{X}^{\prime}{\boldsymbol{\beta}}_{j0} for all 1jd1\leq j\leq d. We then have 𝜷0=𝜷0\boldsymbol{\beta}_{0}^{*}=\boldsymbol{\beta}_{0} since E(𝑿𝑿)E(\boldsymbol{X}\boldsymbol{X}^{\prime}) is positive definite. This accomplishes the proof of the uniqueness of 𝜷0\boldsymbol{\beta}_{0}. ∎

Proof of Theroem 2.

Note that Ln(𝜷)=n1k=1Ki=1nρτk{YiQ(τk,𝜽(𝐗i,𝜷))}L_{n}(\boldsymbol{\beta})=n^{-1}\sum_{k=1}^{K}\sum_{i=1}^{n}\rho_{\tau_{k}}\left\{Y_{i}-Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X}_{i},\boldsymbol{\beta}))\right\} and L¯(𝜷)=E[k=1Kρτk{YQ(τk,𝜽(𝐗,𝜷))}]\bar{L}(\boldsymbol{\beta})=E[\sum_{k=1}^{K}\rho_{\tau_{k}}\{Y-Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))\}]. From Knight’s identity at (A.5),

L¯(𝜷)L¯(𝜷0)\displaystyle\bar{L}(\boldsymbol{\beta})-\bar{L}(\boldsymbol{\beta}_{0}) k=1KE|Q(τk,𝜽(𝐗,𝜷))Q(τk,𝜽(𝐗,𝜷0))|\displaystyle\leq\sum_{k=1}^{K}E|Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))-Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0}))|
k=1KEsup𝜷ΘQ(τk,𝜽(𝐗,𝜷))𝜷𝜷𝜷0,\displaystyle\leq\sum_{k=1}^{K}E\sup_{\boldsymbol{\beta}\in\Theta}\|\frac{\partial Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))}{\partial\boldsymbol{\beta}}\|\|\boldsymbol{\beta}-\boldsymbol{\beta}_{0}\|,

which, together with the condition of Emax1kKsup𝜷ΘQ(τk,𝜽(𝐗,𝜷))/𝜷<E\max_{1\leq k\leq K}\sup_{\boldsymbol{\beta}\in\Theta}\|{\partial Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))}/{\partial\boldsymbol{\beta}}\|<\infty and Corollary 2 of Andrews, (1987), implies that

sup𝜷Θ|Ln(𝜷)Ln(𝜷0)[L¯(𝜷)L¯(𝜷0)]|=op(1).\sup_{\boldsymbol{\beta}\in\Theta}|L_{n}(\boldsymbol{\beta})-L_{n}(\boldsymbol{\beta}_{0})-[\bar{L}(\boldsymbol{\beta})-\bar{L}(\boldsymbol{\beta}_{0})]|=o_{p}(1). (A.6)

Note that L¯(𝜷)\bar{L}(\boldsymbol{\beta}) is a continuous function with respect to 𝜷\boldsymbol{\beta} and, from Theorem 1, 𝜷0\boldsymbol{\beta}_{0} is the unique minimizer of L¯(𝜷)\bar{L}(\boldsymbol{\beta}). As a result, for each δ>0\delta>0,

ϵ=inf𝜷Bδc(𝜷0)L¯(𝜷)L¯(𝜷0)>0,\epsilon=\inf_{\boldsymbol{\beta}\in B_{\delta}^{c}(\boldsymbol{\beta}_{0})}\bar{L}(\boldsymbol{\beta})-\bar{L}(\boldsymbol{\beta}_{0})>0, (A.7)

where Bδ(𝜷0)={𝜷:𝜷𝜷0δ}B_{\delta}(\boldsymbol{\beta}_{0})=\{\boldsymbol{\beta}:\|\boldsymbol{\beta}-\boldsymbol{\beta}_{0}\|\leq\delta\} and Bδc(𝜷0)B_{\delta}^{c}(\boldsymbol{\beta}_{0}) is its complement set, and hence

{inf𝜷Bδc(𝜷0)Ln(𝜷)Ln(𝜷0)}{sup𝜷Bδc(𝜷0)|Ln(𝜷)Ln(𝜷0)[L¯(𝜷)L¯(𝜷0)]}|ϵ}.\left\{\inf_{\boldsymbol{\beta}\in B_{\delta}^{c}(\boldsymbol{\beta}_{0})}L_{n}(\boldsymbol{\beta})\leq L_{n}(\boldsymbol{\beta}_{0})\right\}\subseteq\left\{\sup_{\boldsymbol{\beta}\in B_{\delta}^{c}(\boldsymbol{\beta}_{0})}|L_{n}(\boldsymbol{\beta})-L_{n}(\boldsymbol{\beta}_{0})-[\bar{L}(\boldsymbol{\beta})-\bar{L}(\boldsymbol{\beta}_{0})]\}|\geq\epsilon\right\}. (A.8)

Note that

1=P{Ln(𝜷^n)Ln(𝜷0)}P{𝜷^nBδ(𝜷0)}+P{inf𝜷Bδc(𝜷0)Ln(𝜷)Ln(𝜷0)}1=P\left\{L_{n}(\widehat{\boldsymbol{\beta}}_{n})\leq L_{n}(\boldsymbol{\beta}_{0})\right\}\leq P\left\{\widehat{\boldsymbol{\beta}}_{n}\in B_{\delta}(\boldsymbol{\beta}_{0})\right\}+P\left\{\inf_{\boldsymbol{\beta}\in B_{\delta}^{c}(\boldsymbol{\beta}_{0})}L_{n}(\boldsymbol{\beta})\leq L_{n}(\boldsymbol{\beta}_{0})\right\}

which together with (A.6) and (A.8), implies that

P{𝜷𝜷0δ}1P{inf𝜷Bδc(𝜷0)Ln(𝜷)Ln(𝜷0)}1,P\left\{\|\boldsymbol{\beta}-\boldsymbol{\beta}_{0}\|\leq\delta\right\}\geq 1-P\left\{\inf_{\boldsymbol{\beta}\in B_{\delta}^{c}(\boldsymbol{\beta}_{0})}L_{n}(\boldsymbol{\beta})\leq L_{n}(\boldsymbol{\beta}_{0})\right\}\rightarrow 1,

as nn\rightarrow\infty. This accomplishes the proof of consistency.

Proof of Theorem 3.

For simplicity, we denote Q(τk,𝜽(𝐗i,𝜷))Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X}_{i},\boldsymbol{\beta})) by Qik(𝜷)Q_{ik}(\boldsymbol{\beta}) in the whole proof of this theorem. Let Sn(𝜷)=Ln(𝜷)Ln(𝜷0)S_{n}(\boldsymbol{\beta})=L_{n}(\boldsymbol{\beta})-L_{n}(\boldsymbol{\beta}_{0}) and, from Knight’s identity at (A.5),

Sn(𝜷)=k=1Ki=1n[ρτk{YiQik(𝜷)}ρτk{YiQik(𝜷0)}]=k=1Ki=1n[{Qik(𝜷)Qik(𝜷0)}{I(eik<0)τk}+0Qik(𝜷)Qik(𝜷0){I(eiks)I(eik0)}ds].\begin{split}S_{n}(\boldsymbol{\beta})=&\sum_{k=1}^{K}\sum_{i=1}^{n}\left[\rho_{\tau_{k}}\left\{Y_{i}-Q_{ik}(\boldsymbol{\beta})\right\}-\rho_{\tau_{k}}\left\{Y_{i}-Q_{ik}(\boldsymbol{\beta}_{0})\right\}\right]\\ =&\sum_{k=1}^{K}\sum_{i=1}^{n}\Big{[}\left\{Q_{ik}(\boldsymbol{\beta})-Q_{ik}(\boldsymbol{\beta}_{0})\right\}\left\{I(e_{ik}<0)-\tau_{k}\right\}\\ &\hskip 48.36967pt+\int_{0}^{Q_{ik}(\boldsymbol{\beta})-Q_{ik}(\boldsymbol{\beta}_{0})}\{I(e_{ik}\leq s)-I(e_{ik}\leq 0)\}ds\Big{]}.\end{split} (A.9)

Note that, by Taylor expansion, Qik(𝜷)Qik(𝜷0)=(𝜷𝜷0)Qik(𝜷0)/𝜷+(𝜷𝜷0)(2Qik(𝜷)/𝜷𝜷)(𝜷𝜷0)Q_{ik}(\boldsymbol{\beta})-Q_{ik}(\boldsymbol{\beta}_{0})=(\boldsymbol{\beta}-\boldsymbol{\beta}_{0})^{\prime}{\partial Q_{ik}(\boldsymbol{\beta}_{0})}/{\partial\boldsymbol{\beta}}+(\boldsymbol{\beta}-\boldsymbol{\beta}_{0})^{\prime}({\partial^{2}Q_{ik}(\boldsymbol{\beta}^{*})}/{\partial\boldsymbol{\beta}\partial\boldsymbol{\beta}^{\prime}})(\boldsymbol{\beta}-\boldsymbol{\beta}_{0}), where 𝜷\boldsymbol{\beta}^{*} is a vector between 𝜷0\boldsymbol{\beta}_{0} and 𝜷\boldsymbol{\beta}. Let u=𝜷𝜷0u=\boldsymbol{\beta}-\boldsymbol{\beta}_{0},

q1ik(u)=uQik(𝜷0)𝜷andq2ik(u)=u2Qik(𝜷)𝜷𝜷u.q_{1ik}(u)=u^{\prime}\frac{\partial Q_{ik}(\boldsymbol{\beta}_{0})}{\partial\boldsymbol{\beta}}\hskip 8.53581pt\text{and}\hskip 8.53581ptq_{2ik}(u)=u^{\prime}\frac{\partial^{2}Q_{ik}(\boldsymbol{\beta}^{*})}{\partial\boldsymbol{\beta}\partial\boldsymbol{\beta}^{\prime}}u.

We then decompose Sn(𝜷)S_{n}(\boldsymbol{\beta}) into

Sn(𝜷)\displaystyle S_{n}(\boldsymbol{\beta}) =k=1Ki=1n[{q1ik(u)+q2ik(u)}{I(eik<0)τk}\displaystyle=\sum_{k=1}^{K}\sum_{i=1}^{n}\Big{[}\left\{q_{1ik}(u)+q_{2ik}(u)\right\}\{I(e_{ik}<0)-\tau_{k}\} (A.10)
+0q1ik(u)+q2ik(u){I(eiks)I(eik0)}ds]\displaystyle\hskip 48.36967pt+\int_{0}^{q_{1ik}(u)+q_{2ik}(u)}\{I(e_{ik}\leq s)-I(e_{ik}\leq 0)\}ds\Big{]} (A.11)
=uTn+Π1n(u)+Π2n(u)+Π3n(u),\displaystyle=-u^{\prime}T_{n}+\Pi_{1n}(u)+\Pi_{2n}(u)+\Pi_{3n}(u), (A.12)

where

Tn=k=1Ki=1nQik(𝜷0)𝜷{τkI(eik0)},\displaystyle T_{n}=\sum_{k=1}^{K}\sum_{i=1}^{n}\frac{\partial Q_{ik}(\boldsymbol{\beta}_{0})}{\partial\boldsymbol{\beta}}\{\tau_{k}-I(e_{ik}\leq 0)\},
ξik(u)=0q1ik(u){I(eiks)I(eik0)}𝑑s,\displaystyle\xi_{ik}(u)=\int_{0}^{q_{1ik}(u)}\{I(e_{ik}\leq s)-I(e_{ik}\leq 0)\}ds,
Π1n(u)=k=1Ki=1n[ξik(u)E{ξik(u)|𝐗i}],\displaystyle\Pi_{1n}(u)=\sum_{k=1}^{K}\sum_{i=1}^{n}\left[\xi_{ik}(u)-E\left\{\xi_{ik}(u)|\mathbf{X}_{i}\right\}\right],
Π2n(u)=k=1Ki=1nE{ξik(u)|𝐗i},\displaystyle\Pi_{2n}(u)=\sum_{k=1}^{K}\sum_{i=1}^{n}E\left\{\xi_{ik}(u)|\mathbf{X}_{i}\right\},

and

Π3n(u)=k=1Ki=1n[q2ik(u){I(eik<0)τk}+q1ik(u)q1ik(u)+q2ik(u){I(eiks)I(eik0)}𝑑s].\Pi_{3n}(u)=\sum_{k=1}^{K}\sum_{i=1}^{n}\Big{[}q_{2ik}(u)\left\{I(e_{ik}<0)-\tau_{k}\right\}+\int_{q_{1ik}(u)}^{q_{1ik}(u)+q_{2ik}(u)}\left\{I(e_{ik}\leq s)-I(e_{ik}\leq 0)\right\}ds\Big{]}.

First, by the central limit theorem, we can show that

1nTn=1nk=1Ki=1nQik(𝜷0)𝜷{τkI(eik0)}N(0,Ω0)\frac{1}{\sqrt{n}}T_{n}=\frac{1}{\sqrt{n}}\sum_{k=1}^{K}\sum_{i=1}^{n}\frac{\partial Q_{ik}(\boldsymbol{\beta}_{0})}{\partial\boldsymbol{\beta}}\{\tau_{k}-I(e_{ik}\leq 0)\}\rightarrow N(0,\Omega_{0}) (A.13)

in distribution as nn\rightarrow\infty. Note that, from Theorem 2, u^n=𝜷^n𝜷0=op(1)\widehat{u}_{n}=\widehat{\boldsymbol{\beta}}_{n}-\boldsymbol{\beta}_{0}=o_{p}(1) and then, by applying Lemma A.2,

Sn(𝜷^n)=u^nTn+Π1n(u^n)+Π2n(u^n)+Π3n(u^n)=u^nTn+12(nu^n)Ω1(nu^n)+op(nu^n+nu^n2)nu^n{1nTn+op(1)}+nu^n2{λmin(Ω1)2+op(1)},\begin{split}S_{n}(\widehat{\boldsymbol{\beta}}_{n})&=-\widehat{u}_{n}^{\prime}T_{n}+\Pi_{1n}(\widehat{u}_{n})+\Pi_{2n}(\widehat{u}_{n})+\Pi_{3n}(\widehat{u}_{n})\\ &=-\widehat{u}_{n}^{\prime}T_{n}+\frac{1}{2}(\sqrt{n}\widehat{u}_{n})^{\prime}\Omega_{1}(\sqrt{n}\widehat{u}_{n})+o_{p}(\sqrt{n}\|\widehat{u}_{n}\|+n\|\widehat{u}_{n}\|^{2})\\ &\geq-\|\sqrt{n}\widehat{u}_{n}\|\left\{\|\frac{1}{\sqrt{n}}T_{n}\|+o_{p}(1)\right\}+n\|\widehat{u}_{n}\|^{2}\left\{\frac{\lambda_{\min}(\Omega_{1})}{2}+o_{p}(1)\right\},\end{split} (A.14)

where λmin(Ω1)\lambda_{\min}(\Omega_{1}) is the minimum eigenvalue of Ω1\Omega_{1}. This, together with the fact that Sn(𝜷^n)=Ln(𝜷^n)Ln(𝜷0)0S_{n}(\widehat{\boldsymbol{\beta}}_{n})=L_{n}(\widehat{\boldsymbol{\beta}}_{n})-L_{n}(\boldsymbol{\beta}_{0})\leq 0, implies that

nu^n{λmin(Ω1)2+op(1)}1{1nTn+op(1)}=Op(1).\sqrt{n}\|\widehat{u}_{n}\|\leq\left\{\frac{\lambda_{\min}(\Omega_{1})}{2}+o_{p}(1)\right\}^{-1}\left\{\|\frac{1}{\sqrt{n}}T_{n}\|+o_{p}(1)\right\}=O_{p}(1). (A.15)

Denote un=n1Ω11Tnu_{n}^{\star}=n^{-1}\Omega_{1}^{-1}T_{n} and, from (A.14) and (A.15),

Sn(𝜷^n)=12(nu^n)Ω1(nu^n)(nu^n)Ω1(nun)+op(1).S_{n}(\widehat{\boldsymbol{\beta}}_{n})=\frac{1}{2}(\sqrt{n}\widehat{u}_{n})^{\prime}\Omega_{1}(\sqrt{n}\widehat{u}_{n})-(\sqrt{n}\widehat{u}_{n})^{\prime}\Omega_{1}(\sqrt{n}u_{n}^{\star})+o_{p}(1). (A.16)

Moreover, since nun=Op(1)\sqrt{n}u_{n}^{\star}=O_{p}(1), equation (A.14) still holds when u^n\widehat{u}_{n} is replaced by unu_{n}^{\star}, and then

S(𝜷0+un)=12(nun)Ω1(nun)+op(1),S(\boldsymbol{\beta}_{0}+u_{n}^{\star})=-\frac{1}{2}(\sqrt{n}u_{n}^{\star})^{\prime}\Omega_{1}(\sqrt{n}u_{n}^{\star})+o_{p}(1), (A.17)

which leads to

0S(𝜷^n)S(𝜷0+un)\displaystyle 0\geq S(\widehat{\boldsymbol{\beta}}_{n})-S(\boldsymbol{\beta}_{0}+u_{n}^{\star}) =\displaystyle= 12(nu^nnun)Ω1(nu^nnun)+op(1)\displaystyle\frac{1}{2}(\sqrt{n}\widehat{u}_{n}-\sqrt{n}u_{n}^{\star})^{\prime}\Omega_{1}(\sqrt{n}\widehat{u}_{n}-\sqrt{n}u_{n}^{\star})+o_{p}(1)
\displaystyle\geq λmin(Ω1)2nu^nnun2+op(1).\displaystyle\frac{\lambda_{\min}(\Omega_{1})}{2}\|\sqrt{n}\widehat{u}_{n}-\sqrt{n}u_{n}^{\star}\|^{2}+o_{p}(1).

As a result, from (A.13),

nu^n=nun+op(1)=Ω11n1/2Tn+op(1)N(𝟎,Ω11Ω0Ω11)\sqrt{n}\widehat{u}_{n}=\sqrt{n}u_{n}^{\star}+o_{p}(1)=\Omega_{1}^{-1}n^{-1/2}T_{n}+o_{p}(1)\rightarrow N(\boldsymbol{0},\Omega_{1}^{-1}\Omega_{0}\Omega_{1}^{-1})

in distribution as nn\rightarrow\infty. ∎

Lemma A.1.

Consider the function of G(τ)G(\tau) defined in the proof of Lemma 1 with θ3<θ~3<0\theta_{3}<\widetilde{\theta}_{3}<0. It holds that, (1) for τ>0.5\tau>0.5, G(τ)G(\tau) is strictly decreasing, (2) for τ<0.5\tau<0.5, G(τ)G(\tau) is strictly increasing.

Lemma A.2.

Suppose that the conditions of Theorem 3 hold. For any sequence of random variables {un}\{u_{n}\} with un=op(1)u_{n}=o_{p}(1), it holds that

  • (a)

    Π1n(un)=op(nun+nun2)\Pi_{1n}(u_{n})=o_{p}(\sqrt{n}\|u_{n}\|+n\|u_{n}\|^{2}),

  • (b)

    Π2n(un)=(1/2)(nun)Ω1(nun)+op(nun2)\Pi_{2n}(u_{n})=(1/2)(\sqrt{n}u_{n})^{\prime}\Omega_{1}(\sqrt{n}u_{n})+o_{p}(n\|u_{n}\|^{2}), and

  • (c)

    Π3n(un)=op(nun2)\Pi_{3n}(u_{n})=o_{p}(n\|u_{n}\|^{2}),

where Π1n(u)\Pi_{1n}(u), Π2n(u)\Pi_{2n}(u) and Π3n(u)\Pi_{3n}(u) are defined in the proof of Theorem 3.

A.2 Proofs for high-dimensional regularized estimation

This subsection first conducts deterministic analysis at Lemma A.3, and then stochastic analysis at Lemmas A.4 and A.5. The proof of Theorem 4 follows from the deterministic analysis in Lemma A.3 and stochastic analysis in Lemmas A.4, and A.5.

We first treat the observed data, {(Yi,𝑿i),i=1,,n}\{(Y_{i},\boldsymbol{X}_{i}^{\prime})^{\prime},i=1,...,n\}, to be deterministic. Consider the loss function Ln(𝜷)=i=1nk=1Kρτk{YiQ(τk,𝜽(𝑿i,𝜷))}L_{n}(\boldsymbol{\beta})=\sum_{i=1}^{n}\sum_{k=1}^{K}\rho_{\tau_{k}}\{Y_{i}-Q(\tau_{k},\boldsymbol{\theta}(\boldsymbol{X}_{i},\boldsymbol{\beta}))\}, and its first-order Taylor-series error

n(Δ)=n1Ln(𝜷0+Δ)n1Ln(𝜷0)n1Ln(𝜷0),Δ,\mathcal{E}_{n}(\Delta)=n^{-1}L_{n}(\boldsymbol{\beta}_{0}+\Delta)-n^{-1}L_{n}(\boldsymbol{\beta}_{0})-\langle n^{-1}\nabla L_{n}(\boldsymbol{\beta}_{0}),\Delta\rangle,

where Δdp\Delta\in\mathbb{R}^{dp}, ,\langle\cdot,\cdot\rangle is the inner product, eik=YiQ(τk,𝜽(𝑿i,𝜷0))e_{ik}=Y_{i}-Q(\tau_{k},\boldsymbol{\theta}(\boldsymbol{X}_{i},\boldsymbol{\beta}_{0})), ψτ(u)=τI(u<0)\psi_{\tau}(u)=\tau-I(u<0), and Ln(𝜷)=i=1nk=1Kψτ(eik)Q(τk,𝜽(𝐗,𝜷))/𝜷\nabla L_{n}(\boldsymbol{\beta})=\sum_{i=1}^{n}\sum_{k=1}^{K}\psi_{\tau}(e_{ik}){\partial Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))}/{\partial\boldsymbol{\beta}} is a subgradient of Ln(𝜷)L_{n}(\boldsymbol{\beta}).

Definition 1.

Loss function Ln()L_{n}(\cdot) satisfies the local restricted strong convexity (LRSC) condition if

n(Δ)αΔ22ηlogpnΔ1for all Δ such that0<rΔ2R,\mathcal{E}_{n}(\Delta)\geq\alpha\|\Delta\|_{2}^{2}-\eta\sqrt{\frac{\log p}{n}}\|\Delta\|_{1}\hskip 5.69054pt\text{for all $\Delta$ such that}\hskip 5.69054pt0<r\leq\|\Delta\|_{2}\leq R,

where α,η>0\alpha,\eta>0, and 1\|\cdot\|_{1} and 2\|\cdot\|_{2} are 1\ell_{1} and 2\ell_{2} norms, respectively.

The above LRSC condition has a larger tolerance term compared with that in Loh and Wainwright, (2015), which has a form of (logp/n)Δ12(\log p/n)\|\Delta\|_{1}^{2}. Similar tolerance term can also be found in Fan et al., (2019) for high-dimensional generalized trace regression. It is ready to establish an upper bound for estimation errors when the penalty λ\lambda is appropriately selected.

Lemma A.3.

Suppose that the regularizer pλ()p_{\lambda}(\cdot) satisfies Assumption 4, and loss function Ln()L_{n}(\cdot) satisfies the LRSC condition with α>μ/4\alpha>\mu/4 and r=12ηs(4αμ)Llogpnr=\frac{12\eta\sqrt{s}}{(4\alpha-\mu)L}\sqrt{\frac{\log p}{n}}. If the tuning parameter λ\lambda satisfies that

λ4Lmax{n1L(𝜷0),ηlogpn},\lambda\geq\frac{4}{L}\max\left\{\|n^{-1}\nabla L(\boldsymbol{\beta}_{0})\|_{\infty},\eta\sqrt{\frac{\log p}{n}}\right\}, (A.18)

then the minimizer 𝛃~n\widetilde{\boldsymbol{\beta}}_{n} over the set of Θ=R(𝛃0)\Theta=\mathcal{B}_{R}(\boldsymbol{\beta}_{0}) satisfies the error bounds

𝜷~n𝜷026sλL4αμand𝜷~n𝜷0124sλL4αμ.\displaystyle\|\widetilde{\boldsymbol{\beta}}_{n}-\boldsymbol{\beta}_{0}\|_{2}\leq\frac{6\sqrt{s}\lambda L}{4\alpha-\mu}~{}~{}~{}~{}~{}\mbox{and}~{}~{}~{}~{}~{}\|\widetilde{\boldsymbol{\beta}}_{n}-\boldsymbol{\beta}_{0}\|_{1}\leq\frac{24s\lambda L}{4\alpha-\mu}. (A.19)
Proof..

Denote Δ~n=𝜷~n𝜷0\widetilde{\Delta}_{n}=\widetilde{\boldsymbol{\beta}}_{n}-\boldsymbol{\beta}_{0}, and it holds that Δ~n2R\|\widetilde{\Delta}_{n}\|_{2}\leq R. Note that this lemma naturally holds if Δ~n2(3sλ)/(4αμ)\|\widetilde{\Delta}_{n}\|_{2}\leq(3\sqrt{s}\lambda)/(4\alpha-\mu). As a result, we only need to consider the case with (3sλ)/(4αμ)Δ~n2R(3\sqrt{s}\lambda)/(4\alpha-\mu)\leq\|\widetilde{\Delta}_{n}\|_{2}\leq R, and it can be verified that

r=12ηs(4αμ)LlogpnΔ~n2R.r=\frac{12\eta\sqrt{s}}{(4\alpha-\mu)L}\sqrt{\frac{\log p}{n}}\leq\|\widetilde{\Delta}_{n}\|_{2}\leq R. (A.20)

Note that n1Ln(𝜷~n)+pλ(𝜷~n)n1Ln(𝜷0)+pλ(𝜷0)n^{-1}L_{n}(\widetilde{\boldsymbol{\beta}}_{n})+p_{\lambda}(\widetilde{\boldsymbol{\beta}}_{n})\leq n^{-1}L_{n}(\boldsymbol{\beta}_{0})+p_{\lambda}(\boldsymbol{\beta}_{0}), and then n(Δ~n)pλ(𝜷0)pλ(𝜷~n)n1Ln(𝜷0),Δ\mathcal{E}_{n}(\widetilde{\Delta}_{n})\leq p_{\lambda}(\boldsymbol{\beta}_{0})-p_{\lambda}(\widetilde{\boldsymbol{\beta}}_{n})-\langle n^{-1}\nabla L_{n}(\boldsymbol{\beta}_{0}),\Delta\rangle. This, together with the LRSC condition and Holder’s inequality, implies that

αΔ~n22\displaystyle\alpha\|\widetilde{\Delta}_{n}\|_{2}^{2}\leq pλ(𝜷0)pλ(𝜷~n)+(ηlogpn+n1L(𝜷0))Δ~n1\displaystyle p_{\lambda}(\boldsymbol{\beta}_{0})-p_{\lambda}(\widetilde{\boldsymbol{\beta}}_{n})+\left(\eta\sqrt{\frac{\log p}{n}}+\|n^{-1}\nabla L(\boldsymbol{\beta}_{0})\|_{\infty}\right)\|\widetilde{\Delta}_{n}\|_{1} (A.21)
\displaystyle\leq pλ(𝜷0)pλ(𝜷~n)+λL2Δ~n1\displaystyle p_{\lambda}(\boldsymbol{\beta}_{0})-p_{\lambda}(\widetilde{\boldsymbol{\beta}}_{n})+\frac{\lambda L}{2}\|\widetilde{\Delta}_{n}\|_{1} (A.22)
\displaystyle\leq pλ(𝜷0)pλ(𝜷~n)+12{pλ(𝜷~n𝜷0)+μ2Δ~n22}\displaystyle p_{\lambda}(\boldsymbol{\beta}_{0})-p_{\lambda}(\widetilde{\boldsymbol{\beta}}_{n})+\frac{1}{2}\left\{p_{\lambda}(\widetilde{\boldsymbol{\beta}}_{n}-\boldsymbol{\beta}_{0})+\frac{\mu}{2}\|\widetilde{\Delta}_{n}\|_{2}^{2}\right\} (A.23)
\displaystyle\leq pλ(𝜷0)pλ(𝜷~n)+12{pλ(𝜷~n)+pλ(𝜷0)+μ2Δ~n22},\displaystyle p_{\lambda}(\boldsymbol{\beta}_{0})-p_{\lambda}(\widetilde{\boldsymbol{\beta}}_{n})+\frac{1}{2}\left\{p_{\lambda}(\widetilde{\boldsymbol{\beta}}_{n})+p_{\lambda}(\boldsymbol{\beta}_{0})+\frac{\mu}{2}\|\widetilde{\Delta}_{n}\|_{2}^{2}\right\}, (A.24)

where the last inequality follows from the subadditivity of pλ()p_{\lambda}(\cdot), while the penultimate inequality is by Assumption 4; see also Lemma 4 in Loh and Wainwright, (2015). As a result,

0<(αμ4)Δ~n2232pλ(𝜷0)12pλ(𝜷~n).\displaystyle 0<\left(\alpha-\frac{\mu}{4}\right)\|\widetilde{\Delta}_{n}\|_{2}^{2}\leq\frac{3}{2}p_{\lambda}(\boldsymbol{\beta}_{0})-\frac{1}{2}p_{\lambda}(\widetilde{\boldsymbol{\beta}}_{n}). (A.25)

Moreover, from Lemma 5 in Loh and Wainwright, (2015), it holds that

03pλ(𝜷0)p(𝜷~n)λL(3(Δ~n)A1(Δ~n)Ac1),\displaystyle 0\leq 3p_{\lambda}(\boldsymbol{\beta}_{0})-p(\widetilde{\boldsymbol{\beta}}_{n})\leq\lambda L(3\|(\widetilde{\Delta}_{n})_{A}\|_{1}-\|(\widetilde{\Delta}_{n})_{A^{c}}\|_{1}), (A.26)

where AA is the index set of the ss largest elements of 𝜷~n\widetilde{\boldsymbol{\beta}}_{n} in magnitude. Combining (A.25) and (A.26), we have

(αμ4)Δ~n223λL2(Δ~n)A13λLs2Δ~n2.\displaystyle\left(\alpha-\frac{\mu}{4}\right)\|\widetilde{\Delta}_{n}\|_{2}^{2}\leq\frac{3\lambda L}{2}\|(\widetilde{\Delta}_{n})_{A}\|_{1}\leq\frac{3\lambda L\sqrt{s}}{2}\|\widetilde{\Delta}_{n}\|_{2}. (A.27)

As a result,

Δ~n26sλL4αμ.\displaystyle\|\widetilde{\Delta}_{n}\|_{2}\leq\frac{6\sqrt{s}\lambda L}{4\alpha-\mu}. (A.28)

It is also implied by (A.26) that (Δ~n)Ac13(Δ~n)A1\|(\widetilde{\Delta}_{n})_{A^{c}}\|_{1}\leq 3\|(\widetilde{\Delta}_{n})_{A}\|_{1}, which leads to

Δ~n1(Δ~n)A1+(Δ~n)Ac14(Δ~n)A14sΔ~n2.\|\widetilde{\Delta}_{n}\|_{1}\leq\|(\widetilde{\Delta}_{n})_{A}\|_{1}+\|(\widetilde{\Delta}_{n})_{A^{c}}\|_{1}\leq 4\|(\widetilde{\Delta}_{n})_{A}\|_{1}\leq 4\sqrt{s}\|\widetilde{\Delta}_{n}\|_{2}. (A.29)

This accomplishes the proof of this lemma. ∎

We next conduct the stochastic analysis to verify that the “good” event and LRSC condition hold with high probability in Lemmas A.4 and A.5, respectively.

Lemma A.4.

If Assumption 5 holds, then

n1L(𝜷0)CSlogpn\|n^{-1}\nabla L(\boldsymbol{\beta}_{0})\|_{\infty}\leq C_{S}\sqrt{\frac{\log p}{n}} (A.30)

with probability at least 1c1pc21-c_{1}p^{-c_{2}} for some positive constants c1c_{1}, c2c_{2} and CSC_{S}.

Proof..

Note that

n1L(𝜷0)=1ni=1nk=1Kψτk(eik)Q(τk,𝜽(𝑿i,𝜷0))𝜸𝑿i.n^{-1}\nabla L(\boldsymbol{\beta}_{0})=\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{K}\psi_{\tau_{k}}(e_{ik})\frac{\partial Q(\tau_{k},\boldsymbol{\theta}(\boldsymbol{X}_{i},\boldsymbol{\beta}_{0}))}{\partial\boldsymbol{\gamma}}\otimes\boldsymbol{X}_{i}. (A.31)

where eik=YiQ(τk,𝜽(𝑿i,𝜷0))e_{ik}=Y_{i}-Q(\tau_{k},\boldsymbol{\theta}(\boldsymbol{X}_{i},\boldsymbol{\beta}_{0})), ψτ(u)=τI(u<0)\psi_{\tau}(u)=\tau-I(u<0), and 𝑿i=(X1i,,Xpi)\boldsymbol{X}_{i}=(X_{1i},...,X_{pi})^{\prime}. For 1jd1\leq j\leq d, denote ξj(𝑿i)=Q(τk,𝜽(𝑿i,𝜷0))/γj\xi_{j}(\boldsymbol{X}_{i})={\partial Q(\tau_{k},\boldsymbol{\theta}(\boldsymbol{X}_{i},\boldsymbol{\beta}_{0}))}/{\partial{\gamma_{j}}} and, from Assumption 5, it holds that |ξj(𝑿i)|LQ|\xi_{j}(\boldsymbol{X}_{i})|\leq L_{Q}.

It can be verified that, conditional on 𝑿i\boldsymbol{X}_{i}, ψτk(eik)\psi_{\tau_{k}}(e_{ik}) is sub-Gaussian with the parameter of 0.5, and hence, for any δ>0\delta>0,

Eexp[δn1ψτk(eik)ξj(𝑿i)Xli]Eexp{[δn1ξj(𝑿i)Xli]2/8}exp{[δn1LQCX]2/8}.E\exp[\delta n^{-1}\psi_{\tau_{k}}(e_{ik})\xi_{j}(\boldsymbol{X}_{i})X_{li}]\leq E\exp\{[\delta n^{-1}\xi_{j}(\boldsymbol{X}_{i})X_{li}]^{2}/8\}\leq\exp\{[\delta n^{-1}L_{Q}C_{X}]^{2}/8\}.

As a result, for each t>0t>0, 1jd1\leq j\leq d and 1lp1\leq l\leq p,

P\displaystyle P (1ni=1nk=1Kψτk(eik)ξj(𝑿i)Xli>t)\displaystyle\left(\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{K}\psi_{\tau_{k}}(e_{ik})\xi_{j}(\boldsymbol{X}_{i})X_{li}>t\right) (A.32)
infδ>0exp(δt)i=1nk=1KEexp[δn1ψτk(eik)ξj(𝑿i)Xli]\displaystyle\leq\inf_{\delta>0}\exp(-\delta t)\prod_{i=1}^{n}\prod_{k=1}^{K}E\exp[\delta n^{-1}\psi_{\tau_{k}}(e_{ik})\xi_{j}(\boldsymbol{X}_{i})X_{li}] (A.33)
infδ>0exp(n1K(LQCX)28δ2δt)exp(2nt2K(LQCX)2),\displaystyle\leq\inf_{\delta>0}\exp\left(\frac{n^{-1}K(L_{Q}C_{X})^{2}}{8}\delta^{2}-\delta t\right)\leq\exp\left(\frac{-2nt^{2}}{K(L_{Q}C_{X})^{2}}\right), (A.34)

which implies that

P(n1L(𝜷0)t)exp(2nt2K(LQCX)2+log(2dp)).P\left(\|n^{-1}\nabla L(\boldsymbol{\beta}_{0})\|_{\infty}\geq t\right)\leq\exp\left(\frac{-2nt^{2}}{K(L_{Q}C_{X})^{2}}+\log(2dp)\right).

By letting t=CSlogp/nt=C_{S}\sqrt{\log p/n} with CS>0.5KLQCXC_{S}>\sqrt{0.5K}L_{Q}C_{X}, we accomplish the proof with c1=2dc_{1}=2d and c2=2CS2/[K(LQCX)2]1>0c_{2}=2C_{S}^{2}/[K(L_{Q}C_{X})^{2}]-1>0.

Lemma A.5.

Suppose that Assumptions 4-6 hold. Given the sample size nclogpn\geq c^{\prime}\log p for some c>0c^{\prime}>0, it holds that

n(Δ)αΔ22ηlogpnΔ1for all rΔ2R\mathcal{E}_{n}(\Delta)\geq\alpha\|\Delta\|_{2}^{2}-\eta\sqrt{\frac{\log p}{n}}\|\Delta\|_{1}\hskip 5.69054pt\text{for all $r\leq\|\Delta\|_{2}\leq R$}

with probability at least 1c1pc2Klog(dpr/rl)pc21-c_{1}p^{-c_{2}}-K\log(\sqrt{dp}r/r_{l})p^{-c^{2}} for any c>1c>1, where α=0.5fminλmin0\alpha=0.5f_{\min}\lambda_{\min}^{0}, η=KCEd2d+1+2KLQCXc+CS\eta=KC_{E}d2^{d+1}+2KL_{Q}C_{X}c+C_{S}.

Proof..

We first show the strong convexity of L¯(𝜷)=E[n1Ln(𝜷)]=E[k=1Kρτk{YQ(τk,𝜽(𝐗,𝜷))}]\bar{L}(\boldsymbol{\beta})=E[n^{-1}L_{n}(\boldsymbol{\beta})]=E[\sum_{k=1}^{K}\rho_{\tau_{k}}\{Y-Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))\}]. Let Q(𝜷)=(Q(τ1,𝜽(𝑿,𝜷)),,Q(τK,𝜽(𝑿,𝜷)))Q^{*}(\boldsymbol{\beta})=(Q(\tau_{1},\boldsymbol{\theta}(\boldsymbol{X},\boldsymbol{\beta})),\ldots,Q(\tau_{K},\boldsymbol{\theta}(\boldsymbol{X},\boldsymbol{\beta})))^{\prime} and, by Taylor expansion,

EQ(𝜷)Q(𝜷0)22=EQ(𝜷)𝜷Δ22=ΔΩ2(𝜷)Δ,E\|Q^{*}(\boldsymbol{\beta})-Q^{*}(\boldsymbol{\beta}_{0})\|_{2}^{2}=E\left\|\frac{\partial Q^{*}(\boldsymbol{\beta}^{*})}{\partial\boldsymbol{\beta}}\Delta\right\|_{2}^{2}=\Delta^{\prime}\Omega_{2}(\boldsymbol{\beta}^{*})\Delta,

where Δ=𝜷𝜷0\Delta=\boldsymbol{\beta}-\boldsymbol{\beta}_{0}, and 𝜷\boldsymbol{\beta}^{*} is between 𝜷0\boldsymbol{\beta}_{0} and 𝜷\boldsymbol{\beta}. Note that L¯(𝜷0)/𝜷=0{\partial\bar{L}(\boldsymbol{\beta}_{0})}/{\partial\boldsymbol{\beta}}=0 and hence, by Knight’s identity at (A.5) and Assumption 6, it can be verified that

¯(Δ):=L¯(𝜷)L¯(𝜷0)Δ,L¯(𝜷0)/𝜷=L¯(𝜷)L¯(𝜷0)=E(k=1K0Q(τk,𝜽(𝑿,𝜷))Q(τk,𝜽(𝑿,𝜷0)){FY|𝑿(Q(τk,𝜽(𝑿,𝜷0))+s)FY|𝑿(Q(τk,𝜽(𝑿,𝜷0)))}𝑑s)0.5fminEQ(𝜷)Q(𝜷0)220.5fminλmin0Δ22\displaystyle\begin{split}&\bar{\mathcal{E}}(\Delta):=\bar{L}(\boldsymbol{\beta})-\bar{L}(\boldsymbol{\beta}_{0})-\langle\Delta,{\partial\bar{L}(\boldsymbol{\beta}_{0})}/{\partial\boldsymbol{\beta}}\rangle=\bar{L}(\boldsymbol{\beta})-\bar{L}(\boldsymbol{\beta}_{0})\\ =&E\left(\sum_{k=1}^{K}\int_{0}^{Q(\tau_{k},\boldsymbol{\theta}(\boldsymbol{X},\boldsymbol{\beta}))-Q(\tau_{k},\boldsymbol{\theta}(\boldsymbol{X},\boldsymbol{\beta}_{0}))}\{F_{Y|\boldsymbol{X}}(Q(\tau_{k},\boldsymbol{\theta}(\boldsymbol{X},\boldsymbol{\beta}_{0}))+s)-F_{Y|\boldsymbol{X}}(Q(\tau_{k},\boldsymbol{\theta}(\boldsymbol{X},\boldsymbol{\beta}_{0})))\}ds\right)\\ \geq&0.5f_{\min}E\|Q^{*}(\boldsymbol{\beta})-Q^{*}(\boldsymbol{\beta}_{0})\|_{2}^{2}\geq 0.5f_{\min}\lambda_{\min}^{0}\|\Delta\|_{2}^{2}\end{split} (A.35)

uniformly for {Δdp:Δ2R}\{\Delta\in\mathbb{R}^{dp}:\|\Delta\|_{2}\leq R\}.

For 1kK1\leq k\leq K, denote Ln(k)(𝜷)=i=1nρτk{YiQ(τk,𝜽(𝑿i,𝜷))}L_{n}^{(k)}(\boldsymbol{\beta})=\sum_{i=1}^{n}\rho_{\tau_{k}}\{Y_{i}-Q(\tau_{k},\boldsymbol{\theta}(\boldsymbol{X}_{i},\boldsymbol{\beta}))\} and L¯k(𝜷)=E[n1Ln(k)(𝜷)]=E[ρτk{YQ(τk,𝜽(𝐗,𝜷))}]\bar{L}_{k}(\boldsymbol{\beta})=E[n^{-1}L_{n}^{(k)}(\boldsymbol{\beta})]=E[\rho_{\tau_{k}}\{Y-Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))\}]. Note that Ln(𝜷)=k=1KLn(k)(𝜷)L_{n}(\boldsymbol{\beta})=\sum_{k=1}^{K}L_{n}^{(k)}(\boldsymbol{\beta}) and L¯(𝜷)=k=1KL¯k(𝜷)\bar{L}(\boldsymbol{\beta})=\sum_{k=1}^{K}\bar{L}_{k}(\boldsymbol{\beta}). Let k(Δ)=|n1Ln(k)(𝜷)n1Ln(k)(𝜷0){L¯k(𝜷)L¯k(𝜷0)}|\mathcal{E}_{k}^{*}(\Delta)=|n^{-1}L_{n}^{(k)}(\boldsymbol{\beta})-n^{-1}L_{n}^{(k)}(\boldsymbol{\beta}_{0})-\{\bar{L}_{k}(\boldsymbol{\beta})-\bar{L}_{k}(\boldsymbol{\beta}_{0})\}|, and we next prove that, uniformly for rΔ2Rr\leq\|\Delta\|_{2}\leq R,

k(Δ)ClogpnΔ1,\mathcal{E}_{k}^{*}(\Delta)\leq C_{\mathcal{E}}\sqrt{\frac{\log p}{n}}\|\Delta\|_{1}, (A.36)

with probability at least 1log(dpr/rl)pc21-\log(\sqrt{dp}r/r_{l})p^{-c^{2}} for any c>1c>1, where C=CEd2d+1+2LQCXcC_{\mathcal{E}}=C_{E}d2^{d+1}+2L_{Q}C_{X}c. As in Theorem 9.34 in Wainwright, (2019), we use the peeling argument, which is a common strategy in empirical process theory.

Tail bound for fixed radii: Define a set 𝒞(r1):={Δdp:Δ1r1}\mathcal{C}(r_{1}):=\left\{\Delta\in\mathbb{R}^{dp}:\|\Delta\|_{1}\leq r_{1}\right\} for a fixed radii r1>0r_{1}>0, and a random variable An(r1)=r11sup𝜷𝒞(r1)k(Δ)A_{n}(r_{1})=r_{1}^{-1}\sup_{\boldsymbol{\beta}\in\mathcal{C}(r_{1})}\mathcal{E}_{k}(\Delta). We next show that, for any t>0t>0,

An(r1)CEd2dlogpn+LQCXtn.A_{n}(r_{1})\leq C_{E}d2^{d}\sqrt{\frac{\log p}{n}}+L_{Q}C_{X}\sqrt{\frac{t}{n}}. (A.37)

with probability at least 1et1-e^{-t}.

For 1in1\leq i\leq n, denote 𝑾i=(Yi,𝑿i)\boldsymbol{W}_{i}=(Y_{i},\boldsymbol{X}_{i}^{\prime})^{\prime}. Note that random variable A(r1)A(r_{1}) has a form of f(𝑾1,,𝑾n)f(\boldsymbol{W}_{1},\dots,\boldsymbol{W}_{n}), and it is guaranteed by Assumption 5 that

|f(𝑾1,,𝑾i,,𝑾n)f(𝑾1,,𝑾i,,𝑾n)|n1LQCX,\left|f(\boldsymbol{W}_{1},\ldots,\boldsymbol{W}_{i},\ldots,\boldsymbol{W}_{n})-f(\boldsymbol{W}_{1},\ldots,\boldsymbol{W}_{i\prime},\ldots,\boldsymbol{W}_{n})\right|\leq n^{-1}L_{Q}C_{X},

i.e., if we replace 𝑾i\boldsymbol{W}_{i} by 𝑾i\boldsymbol{W}_{i^{\prime}}, while keep other 𝑾j\boldsymbol{W}_{j} fixed, then A(r1)A(r_{1}) changes by at most n1LQCXn^{-1}L_{Q}C_{X}. As a result, by the bounded differences inequality and for any t>0t>0,

An(r1)E[An(r1)]+LQCXtnA_{n}(r_{1})\leq E[A_{n}(r_{1})]+L_{Q}C_{X}\sqrt{\frac{t}{n}} (A.38)

with probability at least 1et1-e^{-t}.

In addition, it is implied by Assumption 5 that, for all 𝜷,𝜷~dp\boldsymbol{\beta},\widetilde{\boldsymbol{\beta}}\in\mathbb{R}^{dp},

|ρτk{YiQ(τk,𝜽(𝑿i,𝜷))}ρτk{YiQ(τk,𝜽(𝑿i,𝜷~))}|LQl=1d|𝑿i(𝜷l𝜷l~)|,|\rho_{\tau_{k}}\left\{Y_{i}-Q(\tau_{k},\boldsymbol{\theta}(\boldsymbol{X}_{i},\boldsymbol{\beta}))\right\}-\rho_{\tau_{k}}\{Y_{i}-Q(\tau_{k},\boldsymbol{\theta}(\boldsymbol{X}_{i},\widetilde{\boldsymbol{\beta}}))\}|\leq L_{Q}\sum_{l=1}^{d}|\boldsymbol{X}_{i}^{\prime}(\boldsymbol{\beta}_{l}-\widetilde{\boldsymbol{\beta}_{l}})|,

which leads to

E[An(r1)]2nr1E(sup𝜷𝒞(r1)|i=1nϵi[ρτk{YiQ(τk,𝜽(𝑿i,𝜷))}ρτk{YiQ(τk,𝜽(𝑿i,𝜷0))}]|)C2dnr1E(sup𝜷𝒞(r1)|l=1di=1nVil(𝑿i(𝜷l𝜷0l))|)C2dnE(l=1di=1nVil𝑿i)Cd2dnE(i=1nVi1𝑿i)CEd2dlogpn,\begin{split}E[A_{n}(r_{1})]\leq&\frac{2}{nr_{1}}E\left(\sup_{\boldsymbol{\beta}\in\mathcal{C}(r_{1})}\left|\sum_{i=1}^{n}\epsilon_{i}\left[\rho_{\tau_{k}}\{Y_{i}-Q(\tau_{k},\boldsymbol{\theta}(\boldsymbol{X}_{i},\boldsymbol{\beta}))\}-\rho_{\tau_{k}}\{Y_{i}-Q(\tau_{k},\boldsymbol{\theta}(\boldsymbol{X}_{i},\boldsymbol{\beta}_{0}))\}\right]\right|\right)\\ \leq&\frac{C2^{d}}{nr_{1}}E\left(\sup_{\boldsymbol{\beta}\in\mathcal{C}(r_{1})}\left|\sum_{l=1}^{d}\sum_{i=1}^{n}V_{il}(\boldsymbol{X}_{i}^{\prime}(\boldsymbol{\beta}_{l}-\boldsymbol{\beta}_{0l}))\right|\right)\leq\frac{C2^{d}}{n}E\left(\sum_{l=1}^{d}\left\|\sum_{i=1}^{n}V_{il}\boldsymbol{X}_{i}\right\|_{\infty}\right)\\ \leq&\frac{Cd2^{d}}{n}E\left(\left\|\sum_{i=1}^{n}V_{i1}\boldsymbol{X}_{i}\right\|_{\infty}\right)\leq C_{E}d2^{d}\sqrt{\frac{\log p}{n}},\end{split} (A.39)

where {ϵi}\{\epsilon_{i}\} and {Vil}\{V_{il}\} are i.i.d.i.i.d. Rademacher and standard Gaussian random variables, respectively, the first inequality is due to the symmetrization theorem (Loh and Wainwright,, 2015, Lemma 12), the second one is by the multivariate contraction theorem (van de Geer,, 2016, Theorem 16.3), the third one is due to the fact |𝑿𝜷|𝑿𝜷1|\boldsymbol{X}^{\prime}\boldsymbol{\beta}|\leq\|\boldsymbol{X}\|_{\infty}\|\boldsymbol{\beta}\|_{1}, and the last one is by Lemma 16 of Loh and Wainwright, (2015) given the sample size of nclogpn\geq c^{\prime}\log p for some c>0c^{\prime}>0. The upper bound at (A.37) holds by combining (A.38) and (A.39).

Extension to uniform radii via peeling: Define a sequence of sets Θl:={Δdp:2l1rΔ12lr}\Theta_{l}:=\{\Delta\in\mathbb{R}^{dp}:2^{l-1}r\leq\|\Delta\|_{1}\leq 2^{l}r\} with 1lN=log(dpR/r)1\leq l\leq N=\log(\sqrt{dp}R/r). It can be verified that {Δdp:rΔ2R}{Δdp:rΔ1dpR}l=1NΘl\{\Delta\in\mathbb{R}^{dp}:r\leq\|\Delta\|_{2}\leq R\}\subseteq\{\Delta\in\mathbb{R}^{dp}:r\leq\|\Delta\|_{1}\leq\sqrt{dp}R\}\subseteq\cup_{l=1}^{N}\Theta_{l}. As a result,

P\displaystyle P (k(Δ)ClogpnΔ1,Δl=1NΘl)\displaystyle\left(\mathcal{E}_{k}^{*}(\Delta)\geq C_{\mathcal{E}}\sqrt{\frac{\log p}{n}}\|\Delta\|_{1},\Delta\in\cup_{l=1}^{N}\Theta_{l}\right)
l=1NP(k(Δ)2l1rClogpn,ΔΘl)\displaystyle\hskip 14.22636pt\leq\sum_{l=1}^{N}P\left(\mathcal{E}_{k}^{*}(\Delta)\geq 2^{l-1}rC_{\mathcal{E}}\sqrt{\frac{\log p}{n}},\Delta\in\Theta_{l}\right)
=l=1NP(k(Δ)(CEd2d)(2kr)logpn+LQCX(2kr)c2logpn,ΔΘl)\displaystyle\hskip 14.22636pt=\sum_{l=1}^{N}P\left(\mathcal{E}_{k}^{*}(\Delta)\geq(C_{E}d2^{d})(2^{k}r)\sqrt{\frac{\log p}{n}}+L_{Q}C_{X}(2^{k}r)\sqrt{\frac{c^{2}\log p}{n}},\Delta\in\Theta_{l}\right)
l=1NP(A(2lr)CEd2dlogpn+LQCXc2logpn),\displaystyle\hskip 14.22636pt\leq\sum_{l=1}^{N}P\left(A(2^{l}r)\geq C_{E}d2^{d}\sqrt{\frac{\log p}{n}}+L_{Q}C_{X}\sqrt{\frac{c^{2}\log p}{n}}\right),

where C=CEd2d+1+2LQCXcC_{\mathcal{E}}=C_{E}d2^{d+1}+2L_{Q}C_{X}c. By applying (A.37), it holds that

P(k(Δ)ClogpnΔ1,rΔ2R)k=1Nec2logp=log(dpR/r)pc2,P\left(\mathcal{E}_{k}^{*}(\Delta)\geq C_{\mathcal{E}}\sqrt{\frac{\log p}{n}}\|\Delta\|_{1},r\leq\|\Delta\|_{2}\leq R\right)\leq\sum_{k=1}^{N}e^{-c^{2}\log p}=\log(\sqrt{dp}R/r)p^{-c^{2}},

i.e. (A.36) holds.

Finally, from (LABEL:lemma3-eq1), (A.36) and Lemma A.4,

n(Δ)\displaystyle\mathcal{E}_{n}(\Delta) ¯(Δ)k=1Kk(Δ)n1L(𝜷0)Δ1\displaystyle\geq\bar{\mathcal{E}}(\Delta)-\sum_{k=1}^{K}\mathcal{E}_{k}^{*}(\Delta)-\|n^{-1}\nabla L(\boldsymbol{\beta}_{0})\|_{\infty}\|\Delta\|_{1}
0.5fminλmin0Δ22(KC+CS)logpnΔ1\displaystyle\geq 0.5f_{\min}\lambda_{\min}^{0}\|\Delta\|_{2}^{2}-(KC_{\mathcal{E}}+C_{S})\sqrt{\frac{\log p}{n}}\|\Delta\|_{1}

uniformly for rΔ2Rr\leq\|\Delta\|_{2}\leq R with probability at least 1c1pc2Klog(dpr/rl)pc21-c_{1}p^{-c_{2}}-K\log(\sqrt{dp}r/r_{l})p^{-c^{2}} for any c>1c>1. This accomplishes the proof.

Proof of Theorem 4.

The proof of this theorem follows from the deterministic analysis in Lemma A.3 and stochastic analysis in Lemmas A.4, A.5. ∎

References

  • Abrevaya, (2001) Abrevaya, J. (2001). The effect of demographics and maternal behavior on the distribution of birth outcomes. Empirical Economics, 26:247–259.
  • Andrews, (1987) Andrews, D. W. K. (1987). Consistency in nonlinear econometric models: A generic uniform law of large numbers. Econometrica, 55:1465–1471.
  • Beirlant and Goegebeur, (2004) Beirlant, J. and Goegebeur, Y. (2004). Local polynomial maximum likelihood estimation for Pareto-type distributions. Journal of Multivariate Analysis, 89:97–118.
  • Belloni and Chernozhukov, (2011) Belloni, A. and Chernozhukov, V. (2011). l1l_{1}-penalized quantile regression in high-dimensional sparse models. The Annals of Statistics, 39:82–130.
  • (5) Belloni, A., Chernozhukov, V., Chetverikov, D., and Fernandez-Val, I. (2019a). Conditional quantile processes based on series or many regressors. Journal of Econometrics, 213:4–29.
  • (6) Belloni, A., Chernozhukov, V., and Kato, K. (2019b). Valid post-selection inference in high-dimensional approximately sparse quantile regression models. Journal of the American Statistical Association, 114:749–758.
  • Bertsekas, (2016) Bertsekas, D. P. (2016). Nonlinear Programming. Athena Scientific.
  • Cade and Noon, (2003) Cade, B. S. and Noon, B. R. (2003). A gentle introduction to quantile regression for ecologists. Frontiers in Ecology and the Environment, 1:412–420.
  • Chernozhukov, (2005) Chernozhukov, V. (2005). Extremal quantile regression. The Annals of Statistics, 33:806–839.
  • Chernozhukov et al., (2015) Chernozhukov, V., Fernández-Val, I., and Kowalski, A. E. (2015). Quantile regression with censoring and endogeneity. Journal of Econometrics, 186:201–221.
  • Chernozhukov et al., (2020) Chernozhukov, V., Fernández-Val, I., and Melly, B. (2020). Fast algorithms for the quantile regression process. Empirical Econometrics.
  • Davino et al., (2014) Davino, C., Furno, M., and Vistocco, D. (2014). Quantile Regression: Theory and Applications. Wiley, New York.
  • Dong et al., (2019) Dong, C., Li, G., and Feng, X. (2019). Lack-of-fit tests for quantile regression models. Journal of the Royal Statistical Society, Series B, 81:629–648.
  • Elsner et al., (2008) Elsner, J. B., Kossin, J. P., and Jagger, T. H. (2008). The increasing intensity of the strongest tropical cyclones. Nature, 455:92–95.
  • Fan et al., (2019) Fan, J., Gong, W., and Zhu, Z. (2019). Generalized high-dimensional trace regression via nuclear norm regularization. Journal of Econometrics, 212:177–202.
  • Fan and Li., (2001) Fan, J. and Li., R. (2001). Variable selection via nonconcave penalized likelihood and its oracleproperties. Journal of the American Statistical Association, 96:1348–1360.
  • Fenske et al., (2011) Fenske, N., Kneib, T., and Hothorn, T. (2011). Identifying risk factors for severe childhood malnutrition by boosting additive quantile regression. Journal of the American Statistical Association, 106:494–510.
  • Fournier et al., (2007) Fournier, B., Rupin, N., Bigerelle, M., Najjar, D., Iost, A., and Wilcox, R. (2007). Estimating the parameters of a generalized lambda distribution. Computational Statistics & Data Analysis, 51:2813–2835.
  • Freimer et al., (1988) Freimer, M., Kollia, G., Mudholkar, G. S., and Lin, C. T. (1988). A study of the generalized tukey lambda family. Communications in Statistics-Theory and Methods, 17:3547–3567.
  • Galvao and Kato, (2016) Galvao, A. F. and Kato, K. (2016). Smoothed quantile regression for panel data. Journal of Econometrics, 193:92–112.
  • Gilchrist, (2000) Gilchrist, W. G. (2000). Statistical modelling with quantile functions. Chapman & Hall/CRC, London.
  • Hendricks and Koenker, (1991) Hendricks, W. and Koenker, R. (1991). Hierarchical spline models for conditional quantiles and the demand for electricity. Journal of the American Statistical Association, 87:58–68.
  • Jagger and Elsner, (2008) Jagger, T. H. and Elsner, J. B. (2008). Modeling tropical cyclone intensity with quantile regression. International Journal of Climatology, 29:1351–1361.
  • Jiang et al., (2012) Jiang, X., Jiang, J., and Song, X. (2012). Oracle model selection for nonlinear models based on weighted composite quantile regression. Statistica Sinica, 22:1479–1506.
  • Kai et al., (2010) Kai, B., Li, R., and Zou, H. (2010). Local composite quantile regression smoothing: An efficient and safe alternative to local polynomial regression. Journal of the Royal Statistical Society, Series B, 72:49–69.
  • Kato et al., (2012) Kato, K., Galvao, A. F., and Montes-Rojas, G. V. (2012). Asymptotics for panel quantile regression models with individual effects. Journal of Econometrics, 170:76–91.
  • Knight, (1998) Knight, K. (1998). Limiting distributions for L1 regression estimators under general conditions. The Annals of Statistics, 26:755–770.
  • Koenker, (2005) Koenker, R. (2005). Quantile regression. Cambridge University Press, Cambridge.
  • Koenker, (2011) Koenker, R. (2011). Additive models for quantile regression: Model selection and confidence bandaids. Brazilian Journal of Probability and Statistics, 25:239–262.
  • Koenker, (2017) Koenker, R. (2017). Quantile regression: 40 years on. Annual Review of Economics, 9:155–176.
  • Koenker and Bassett, (1978) Koenker, R. and Bassett, G. (1978). Regression quantiles. Econometrica, 46:33–50.
  • Kuester et al., (2006) Kuester, K., Mittnik, S., and Paolella, M. S. (2006). Value-at-Risk prediction: A comparison of alternative strategies. Journal of Financial Econometrics, 4:53–89.
  • Linton and Xiao, (2017) Linton, O. and Xiao, Z. (2017). Quantile regression applications in finance. In Handbook of Quantile Regression, pages 381–407. Chapman and Hall/CRC.
  • Lobeto et al., (2021) Lobeto, H., Menendez, M., and Losada, I. J. (2021). Future behavior of wind wave extremes due to climate change. Scientific reports, 11:1–12.
  • Loh, (2017) Loh, P. (2017). Statistical consistency and asymptotic normality for high-dimensional robust M-estimators. The Annals of Statistics, 45:866–896.
  • Loh and Wainwright, (2015) Loh, P. and Wainwright, M. J. (2015). Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research, 16:559–616.
  • Meinshausen and Ridgeway, (2006) Meinshausen, N. and Ridgeway, G. (2006). Quantile regression forests. Journal of Machine Learning Research, 7:983–999.
  • Moon et al., (2021) Moon, S. J., Jeon, J.-J., Lee, J. S. H., and Kim, Y. (2021). Learning multiple quantiles with neural networks. Journal of Computational and Graphical Statistics, In press.
  • Noufaily and Jones, (2013) Noufaily, A. and Jones, M. C. (2013). Parametric quantile regression based on the generalized gamma distribution. Journal of the Royal Statistical Society, Series C, 62:723–740.
  • Pan and Zhou, (2021) Pan, X. and Zhou, W.-X. (2021). Multiplier bootstrap for quantile regression: Non-asymptotic theory under random design. Information and Inference, 3:813–861.
  • Pollard, (1985) Pollard, D. (1985). New ways to prove central limit theorems. Econometric Theory, 1:295–313.
  • Racine and Li, (2017) Racine, J. S. and Li, K. (2017). Nonparametric conditional quantile estimation: a locally weighted quantile kernel approach. Journal of Econometrics, 201:72–94.
  • Sherwood et al., (2016) Sherwood, B., Wang, L., et al. (2016). Partially linear additive quantile regression in ultra-high dimension. The Annals of Statistics, 44:288–317.
  • Sivakumar and Banerjee, (2017) Sivakumar, V. and Banerjee, A. (2017). High-dimensional structured quantile regression. In International Conference on Machine Learning, pages 3220–3229.
  • van de Geer, (2016) van de Geer, S. A. (2016). Estimation and testing under sparsity. Springer.
  • van der Vaart, (1998) van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press, Cambridge.
  • Vasicek, (1976) Vasicek, O. (1976). A test for normality based on sample entropy. Journal of the Royal Statistical Society, Series B, 38:54–59.
  • Wainwright, (2019) Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press.
  • Wang and Tsai, (2009) Wang, H. and Tsai, C.-L. (2009). Tail index regression. Journal of the American Statistical Association, 104:1233–1240.
  • Wang and Li, (2013) Wang, H. J. and Li, D. (2013). Estimation of extreme conditional quantiles through power transformation. Journal of the American Statistical Association, 108:1062–1074.
  • (51) Wang, H. J., Li, D., and He, X. (2012a). Estimation of high conditional quantiles for heavy-tailed distributions. Journal of the American Statistical Association, 107:1453–1464.
  • Wang and He, (2021) Wang, L. and He, X. (2021). Analysis of global and local optima of regularized quantile regression in high dimension: a subgradient approach. Technical report, Miami Herbert Business School, University of Miami.
  • (53) Wang, L., Wu, Y., and Li, R. (2012b). Quantile regression for analyzing heterogeneity in ultra-high dimension. Journal of the American Statistical Association, 107:214–222.
  • Yu et al., (2003) Yu, K., Lu, Z., and Stander, J. (2003). Quantile regression: applications and current research areas. Journal of the Royal Statistical Society: Series D, 52:331–350.
  • Zhang, (2010) Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38:894–942.
  • Zheng et al., (2015) Zheng, Q., Peng, L., and He, X. (2015). Globally adaptive quantile regression with ultra-high dimensional data. The Annals of Statistics, 43:2225–2258.
  • Zheng et al., (2018) Zheng, Y., Zhu, Q., Li, G., and Xiao, Z. (2018). Hybrid quantile regression estimation for time series models with conditional heteroscedasticity. Journal of the Royal Statistical Society, Series B, 80:975–993.
  • Zhu and Ling, (2011) Zhu, K. and Ling, S. (2011). Global self-weighted and local quasi-maximum exponential likelihood estimators for arma–garch/igarch models. The Annals of Statistics, 39:2131–2163.
  • Zhu and Li, (2021) Zhu, Q. and Li, G. (2021). Quantile double autoregression. Econometric Theory, In press.
  • Zhu et al., (2018) Zhu, Q., Zheng, Y., and Li, G. (2018). Linear double autoregression. Journal of Econometrics, 207:162–174.
  • Zou and Yuan, (2008) Zou, H. and Yuan, M. (2008). Composite quantile regression and the oracle model selection theory. The Annals of Statistics, 36:1108–1126.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 1: Boxplots for fitted location parameters of 𝜷^1,1\widehat{\boldsymbol{\beta}}_{1,1} (left panel), 𝜷^1,2\widehat{\boldsymbol{\beta}}_{1,2} (middle panel), and 𝜷^1,3\widehat{\boldsymbol{\beta}}_{1,3} (right panel). Sample size is n=500n=500, 1000 or 2000, and the lower bound of quantile range [τL,τU][\tau_{L},\tau_{U}] is τL=0.5\tau_{L}=0.5, 0.7 or 0.9.
Refer to caption
Figure 2: Estimation errors of 𝜷~n𝜷0\|\widetilde{\boldsymbol{\beta}}_{n}-\boldsymbol{\beta}_{0}\| against the quantities of (slogp)/n\sqrt{(s\log p)/n} (left panel) and n/(slogp)n/{(s\log p)} (right panel), respectively.
Table 1: Mean square errors of the predicted conditonal quantile Q(τ,𝜽(𝐗,𝜷^n))Q(\tau^{*},\boldsymbol{\theta}(\mathbf{X},\widehat{\boldsymbol{\beta}}_{n})) at the level of τ=0.991\tau^{*}=0.991 or 0.9950.995. The values in bracket refer to the corresponding sample standard deviations of prediction errors in squared loss.
𝑿=(1,0.1,0.2)T\boldsymbol{X}=(1,0.1,-0.2)^{T} 𝑿=(1,0,0)T\boldsymbol{X}=(1,0,0)^{T}
nn [τL,τU][\tau_{L},\tau_{U}] 0.991 0.995 0.991 0.995
True 10.34 11.83 15.13 18.84
500 [0.5,0.99][0.5,0.99] 1.32(2.17) 2.35(4.11) 5.41(11.41) 12.13(28.19)
[0.7,0.99][0.7,0.99] 1.42(2.20) 2.55(4.07) 5.42(10.95) 12.12(26.73)
[0.9,0.99][0.9,0.99] 2.00(3.92) 3.64(7.56) 6.18(12.86) 14.10(32.78)
1000 [0.5,0.99][0.5,0.99] 0.77(1.68) 1.39(3.29) 2.67(5.28) 5.93(12.61)
[0.7,0.99][0.7,0.99] 0.80(1.39) 1.44(2.64) 2.62(4.27) 5.75(9.75)
[0.9,0.99][0.9,0.99] 1.31(2.53) 2.44(5.07) 3.22(5.08) 7.23(11.78)
2000 [0.5,0.99][0.5,0.99] 0.32(0.47) 0.57(0.85) 1.03(1.56) 2.25(3.49)
[0.7,0.99][0.7,0.99] 0.36(0.49) 0.64(0.90) 1.05(1.47) 2.31(3.24)
[0.9,0.99][0.9,0.99] 0.70(1.34) 1.30(2.44) 1.34(1.75) 3.05(4.06)
Table 2: Mean square errors of the predicted conditonal quantile Q(τ,𝜽(𝐗,𝜷~n))Q(\tau^{*},\boldsymbol{\theta}(\mathbf{X},\widetilde{\boldsymbol{\beta}}_{n})) at the level of τ=0.991\tau^{*}=0.991 or 0.9950.995 with p=50p=50 and n=cklogpn=\lfloor ck\log p\rfloor. The values in bracket refer to the corresponding sample standard deviations of prediction errors in squared loss.
𝑿=(1,0.1,0.2,0,,0)T\boldsymbol{X}=(1,0.1,-0.2,0,\cdots,0)^{T} 𝑿=(1,0,0,0,,0)T\boldsymbol{X}=(1,0,0,0,\cdots,0)^{T}
c [τL,τU][\tau_{L},\tau_{U}] 0.991 0.995 0.991 0.995
True 10.34 11.83 15.13 18.84
10 [0.5,0.99][0.5,0.99] 1.82(2.61) 3.23(4.82) 6.75(9.47) 14.83(21.98)
[0.7,0.99][0.7,0.99] 2.05(6.00) 3.78(13.10) 6.11(8.22) 13.32(18.56)
[0.9,0.99][0.9,0.99] 2.92(7.19) 5.44(15.60) 7.24(10.36) 15.91(25.05)
30 [0.5,0.99][0.5,0.99] 0.65(1.59) 1.15(3.12) 1.97(3.08) 4.26(6.90)
[0.7,0.99][0.7,0.99] 0.65(1.49) 1.18(2.94) 1.96(2.85) 4.26(6.31)
[0.9,0.99][0.9,0.99] 0.92(2.15) 1.66(4.08) 2.22(2.95) 4.86(6.40)
50 [0.5,0.99][0.5,0.99] 0.33(0.49) 0.58(0.84) 1.17(1.77) 2.52(3.88)
[0.7,0.99][0.7,0.99] 0.39(0.57) 0.69(1.01) 1.28(1.93) 2.78(4.26)
[0.9,0.99][0.9,0.99] 0.54(0.86) 0.99(1.58) 1.55(2.39) 3.51(5.57)
Table 3: Selection results for regularized estimation with p=50p=50 and n=cklogpn=\lfloor ck\log p\rfloor. The values in brackets are the corresponding standard errors.
[τL,τU][\tau_{L},\tau_{U}] c size PAI\textrm{P}_{\textrm{AI}} PA\textrm{P}_{\textrm{A}} PI\textrm{P}_{\textrm{I}} FP FN
[0.5,0.99][0.5,0.99] 10 9.04(0.99) 91.6 96 95.6 0.06(0.68) 0.47(2.34)
30 9.00(0.00) 100 100 100 0.00(0.00) 0.00(0.00)
50 9.00(0.00) 100 100 100 0.00(0.00) 0.00(0.00)
[0.7,0.99][0.7,0.99] 10 8.91(1.25) 79 82.6 95.4 0.07(0.84) 2.02(4.52)
30 9.00(0.08) 99.4 99.6 99.8 0.00(0.03) 0.04(0.70)
50 9.00(0.00) 100 100 100 0.00(0.00) 0.00(0.00)
[0.9,0.99][0.9,0.99] 10 8.56(0.99) 48.4 54.6 90.4 0.10(0.41) 6.51(8.43)
30 8.88(0.38) 87.8 88.4 99.2 0.01(0.06) 1.42(4.10)
50 8.96(0.23) 96.4 96.6 99.8 0.00(0.03) 0.44(2.50)
Refer to caption
(a)
Refer to caption
(b)
Figure 3: Plot of PEs against τL\tau_{L} (left panel) and boxplots of PEs from the extreme quantile regression (EQR), linear quantile regression (Linear) and QIR models at two target levels of τ=0.991\tau^{*}=0.991 and 0.995 (right panel).
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 4: Quantile curves for child’s age in months (top left), mother’s education in years (top right) on the three target quantiles. Effects of child’s sex (bottom left) and mother’s unemployment condition (bottom right).