\AtAppendix

lemmasection

Quantile index regression

Yingying Zhang^a Yuefeng Si^b Guodong Li^b and Chil-Ling Tsai^c
^aEast China Normal University, ^bUniversity of Hong Kong
and ^cUniversity of California at Davis

Abstract

Estimating the structures at high or low quantiles has become an important subject and attracted increasing attention across numerous fields. However, due to data sparsity at tails, it usually is a challenging task to obtain reliable estimation, especially for high-dimensional data. This paper suggests a flexible parametric structure to tails, and this enables us to conduct the estimation at quantile levels with rich observations and then to extrapolate the fitted structures to far tails. The proposed model depends on some quantile indices and hence is called the quantile index regression. Moreover, the composite quantile regression method is employed to obtain non-crossing quantile estimators, and this paper further establishes their theoretical properties, including asymptotic normality for the case with low-dimensional covariates and non-asymptotic error bounds for that with high-dimensional covariates. Simulation studies and an empirical example are presented to illustrate the usefulness of the new model.

Keywords: Asymptotic normality; High-dimensional analysis; Non-asymptotic property; Partially parametric model; Quantile regression.

1 Introduction

Quantile regression proposed by Koenker and Bassett, (1978) has been widely used across various fields such as biological science, ecology, economics, finance, and machine learning, etc.; see, e.g., Cade and Noon, (2003), Yu et al., (2003), Meinshausen and Ridgeway, (2006), Linton and Xiao, (2017) and Koenker, (2017). More references on quantile regression can be found in the books of Koenker, (2005) and Davino et al., (2014). Quantile regression has also been studied for high-dimensional data; see, e.g., Belloni and Chernozhukov, (2011), Wang et al., 2012b and Zheng et al., (2015). On the other hand, due to practical needs, it is increasingly becoming a popular subject to estimate the structures at high or low quantiles, such as the risk of high loss for investments in finance (Kuester et al.,, 2006; Zheng et al.,, 2018), high tropical cyclone intensity and extreme waves in climatology (Elsner et al.,, 2008; Jagger and Elsner,, 2008; Lobeto et al.,, 2021), and low infant birth weights in medicine (Abrevaya,, 2001; Chernozhukov et al.,, 2020). It hence is natural to make inference at these extreme quantiles for high-dimensional data, while this is still an open problem.

There are two types of approaches in the literature to model the structures at tails. The first one is based on the conditional distribution function (CDF) of the response $Y$ for a given set of covariates $\boldsymbol{X}$ , and it is usually assumed to have a semiparametric structure at tails; see, e.g., Pareto-type structures in Beirlant and Goegebeur, (2004) and Wang and Tsai, (2009). While this method cannot provide conditional quantiles in explicit forms. Later, Noufaily and Jones, (2013) considered a full parametric form, the generalized gamma distribution, to the CDF and then inverted the fitted distribution into a conditional quantile distribution. However, as indicated in Racine and Li, (2017), indirect inverse-CDF-based estimators may not be efficient in tail regions when the data has unbounded support.

The second approach is extremal quantile regression, which combines quantile regression with extreme value theory to estimate the conditional quantile at a very high or low level of $\tau_{n}$ , which satisfies $(1-\tau_{n})=O(n)$ with $n$ being the sample size; see Chernozhukov, (2005). Specifically, this is a two-stage approach: (i.) performing the estimation at intermediate quantiles $\tau_{n}^{*}$ with $(1-\tau_{n}^{*})^{-1}=o(n)$ ; and (ii.) extrapolating the fitted quantile structures to those at extreme quantiles by assuming the extreme value index that is associated to the tails of conditional distributions; see Wang et al., 2012a and Wang and Li, (2013) for details. The key of this method is to make use of the feasible estimation at intermediate levels since there are relatively more observations around these levels. However, intermediate quantiles are also at the far tails, and the corresponding data points may still not be rich enough for the case with many covariates.

In order to handle the case with high-dimensional covariates, along the lines of extremal quantile regression, this paper suggests to conduct estimation at quantile levels with much richer observations, say some fixed levels around $\tau_{0}$ , and then extrapolate the estimated results to extreme quantiles by fully or partially assuming a form of conditional distribution or quantile functions on $[\tau_{0},1]$ . Note that there exist many quantile functions, which have explicit forms, such as the generalized lambda and Burr XII distributions (Gilchrist,, 2000). Especially the generalized lambda distribution can provide a very accurate approximation to some Pareto-type and extreme value distributions, as well as some commonly used distributions such as Gaussian distribution (Vasicek,, 1976; Gilchrist,, 2000). These flexible parametric forms can be assumed to the quantile function on $[\tau_{0},1]$ , and the drawback of inverting a distribution function hence can be avoided.

Specifically, for a predetermined interval ${\mathcal{I}}\subset[0,1]$ , the quantile function of response $Y$ is assumed to have an explicit form of $Q(\tau,\boldsymbol{\theta})$ for each level $\tau\in\mathcal{I}$ , up to unknown parameters or indices $\boldsymbol{\theta}$ . By further letting $\boldsymbol{\theta}$ be a function of covariates $\mathbf{X}$ , we then can define the conditional quantile function as follows:

Q_{Y}(\tau|\boldsymbol{X})=\inf\{y:F_{Y}(y|\boldsymbol{X})\geq\tau\}=Q(\tau,\boldsymbol{\theta}(\boldsymbol{X})),\hskip 14.22636pt\tau\in\mathcal{I},

(1.1)

where $F_{Y}(\cdot|\boldsymbol{X})$ is the distribution of $Y$ conditional on $\mathbf{X}$ , and $\boldsymbol{\theta}(\mathbf{X})$ is a $d$ -dimensional parametric function. Note that $\boldsymbol{\theta}(\mathbf{X})$ can be referred to $d$ indices, and model (1.1) can then be called the quantile index regression (QIR) for simplicity. In practice, to handle high quantiles, we may take $\mathcal{I}=[\tau_{0},1]$ with a fixed value of $\tau_{0}$ and then conduct a composite quantile regression (CQR) estimation for model (1.1) at levels within $\mathcal{I}$ but with richer observations. Subsequently, the fitted QIR model can be used to predict extreme quantiles. More importantly, since the estimation is conducted at fixed quantile levels, there is no difficulty to handle the case with high-dimensional covariates. In addition, comparing with the aforementioned two types of approaches in the literature, the proposed method can not only estimate quantile regression functions effectively, but also forecast extreme quantiles directly. Finally, the QIR model naturally yields non-crossing quantile regression estimators since its quantile function is nondecreasing with respect to $\tau$ .

The proposed model is introduced in details at Section 2, and the three main contributions can be summarized below:

(a)

When conducting the CQR estimation, we encounter the first challenge on model identification, and this problem has been carefully studied for the flexible Tukey lambda distribution in Section 2.2.
(b)

Section 2.2 also derives the asymptotic normality of CQR estimators for the case with low-dimensional covariates. This is a challenging task since the corresponding objective function is non-convex and non-differentiable, and we overcome the difficulty by adopting the bracketing method in Pollard, (1985).
(c)

Section 2.3 establishes non-asymptotic properties of a regularized high-dimensional estimation. This is also not trivial due to the problem at (b).

The rest of this paper is organized as follows. Section 3 discusses some implementation issues in searching for these estimators. Numerical studies, including simulation experiments and a real analysis, are given in Sections 4 and 5, and Section 6 provides a short conclusion and discussion. All technical details are relegated to the Appendix.

For the sake of convenience, this paper denotes vectors and matrices by boldface letters, e.g., $\mathbf{X}$ and $\mathbf{Y}$ , and denotes scalars by regular letters, e.g., $X$ and $Y$ . In addition, for any two real-valued sequences $\{a_{n}\}$ and $\{b_{n}\}$ , denote $a_{n}\gtrsim b_{n}$ (or $a_{n}\lesssim b_{n}$ ) if there exists a constant $c$ such that $a_{n}\geq cb_{n}$ (or $a_{n}\leq cb_{n}$ ) for all $n$ , and denote $a_{n}\asymp b_{n}$ if $a_{n}\gtrsim b_{n}$ and $a_{n}\lesssim b_{n}$ . For a generic vector $\mathbf{X}$ and matrix $\mathbf{Y}$ , let $\|\mathbf{X}\|$ , $\|\mathbf{X}\|_{1}$ and $\|\mathbf{Y}\|_{\textrm{F}}$ represent the Euclidean norm, $\ell_{1}$ -norm and Frobenius norm, respectively.

2 Quantile index regression

2.1 Quantile index regression model

Consider a response $Y$ and a $p$ -dimensional vector of covariates $\boldsymbol{X}=(X_{1},...,X_{p})^{\prime}$ . We then rewrite the quantile function of $Y$ conditional on $\boldsymbol{X}$ at (1.1) with an explicit form of $\boldsymbol{\theta}(\boldsymbol{X},\boldsymbol{\beta})$ below,

Q_{Y}(\tau|\boldsymbol{X})=Q(\tau,\boldsymbol{\theta}(\boldsymbol{X},\boldsymbol{\beta})),\hskip 14.22636pt\tau\in\mathcal{I},

(2.1)

where $\mathcal{I}\subset[0,1]$ is an interval or the union of multiple disjoint intervals, $\boldsymbol{\theta}(\boldsymbol{X},\boldsymbol{\beta})=(\theta_{1}(\boldsymbol{X},\boldsymbol{\beta}),\cdots,\theta_{d}(\boldsymbol{X},\boldsymbol{\beta}))^{\prime}$ , $\boldsymbol{\beta}=(\boldsymbol{\beta}_{1}^{\prime},...,\boldsymbol{\beta}_{d}^{\prime})^{\prime}$ , $\boldsymbol{\beta}_{j}=(\beta_{j1},\cdots,\beta_{jp})^{\prime}$ , $\theta_{j}(\boldsymbol{X},\boldsymbol{\beta})=g_{j}(\boldsymbol{X}^{\prime}\boldsymbol{\beta}_{j})$ , the link functions $g_{j}^{-1}(\cdot)$ s are monotonic for $1\leq j\leq d$ , and the intercept term can be included by letting $X_{1}=1$ . We call model (2.1) the quantile index regression (QIR) for simplicity, and the two examples of $Q(\cdot,\cdot)$ below are first introduced to illustrate the new model.

Example 1.

Consider the location shift model, $Q(\tau,\theta)=\theta+Q_{\Phi}(\tau)$ , for all $\tau\in[\tau_{0},1]$ , where $\tau_{0}\in(0,1)$ is a fixed level, $\theta\in\mathbb{R}$ is the location index and $Q_{\Phi}(\tau)$ is the quantile function of standard normality. Under the identity link function, $\theta(\mathbf{X},\boldsymbol{\beta})=\mathbf{X}^{\prime}\boldsymbol{\beta}$ , we can construct a linear quantile regression model at $\tau_{0}$ . Then, after estimation, we can make a prediction at any level of $\tau\in(\tau_{0},1)$ .

Example 2.

Consider the Tukey lambda distribution (Vasicek,, 1976) that is defined by its quantile function,

Q(\tau,\boldsymbol{\theta})=\theta_{1}+{\theta_{2}}\frac{\tau^{\theta_{3}}-(1-\tau)^{\theta_{3}}}{\theta_{3}},

where $\boldsymbol{\theta}=(\theta_{1},\theta_{2},\theta_{3})^{\prime}$ , and $\theta_{1}\in\mathbb{R}$ , $\theta_{2}>0$ and $\theta_{3}\leq 1$ are the location, scale and tail indices, respectively. Due to its flexibility, the Tukey lambda distribution can provide an accurate approximation to many commonly used distributions, such as normal, logistic, Weibull, uniform, and Cauchy distributions, etc.; see Gilchrist, (2000). It is then expected to have a better performance when model (2.1) is combined with this distribution.

In the literature, there exist many quantile functions, which have explicit forms, such as the generalized lambda and Burr XII distributions (Gilchrist,, 2000; Fournier et al.,, 2007). For example, the generalized lambda distribution has the form of

Q(\tau,\boldsymbol{\theta})=\theta_{1}+{\theta_{2}}\left(\frac{\tau^{\theta_{3}}-1}{\theta_{3}}-\frac{(1-\tau)^{\theta_{4}}-1}{\theta_{4}}\right),

where the indices $\theta_{3}$ and $\theta_{4}$ control the right and left tails, respectively. Note that it reduces to the Tukey lambda distribution when $\theta_{3}=\theta_{4}$ . This indicates that the generalized lambda distribution can be considered if we focus on the quantiles with the full range, i.e. $\mathcal{I}=[0,1]$ . On the other hand, the Tukey lambda may be a better choice if our interest is on the quantiles at one side only, i.e. $\mathcal{I}\subset[0,0.5)$ or $\mathcal{I}\subset(0.5,1]$ .

2.2 Low-dimensional composite quantile regression estimation

Denote the observed data by $\{(Y_{i},\boldsymbol{X}_{i}^{\prime})^{\prime},i=1,...,n\}$ , and they are independent and identically distributed ( $i.i.d.$ ) samples of random vector $(Y,\boldsymbol{X})$ , where $Y_{i}$ is the response, $\boldsymbol{X}_{i}=(X_{i1},...,X_{ip})^{\prime}$ contains $p$ covariates, and $n$ is the number of observations.

Let $\tau_{1}\leq\tau_{2}\leq\ldots\leq\tau_{K}$ be $K$ fixed quantile levels, where $\tau_{k}\in\mathcal{I}$ for all $1\leq k\leq K$ . To achieve higher efficiency, we consider the composite quantile regression (CQR) estimator below.

\widehat{\boldsymbol{\beta}}_{n}=\arg\min_{\boldsymbol{\beta}}L_{n}(\boldsymbol{\beta})\hskip 5.69054pt\text{and}\hskip 5.69054ptL_{n}(\boldsymbol{\beta})=\sum_{k=1}^{K}\sum_{i=1}^{n}\rho_{\tau_{k}}\{Y_{i}-Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X}_{i},\boldsymbol{\beta}))\},

(2.2)

where $\rho_{\tau}(x)=x\{\tau-I(x<0)\}$ is the quantile check function; see Zou and Yuan, (2008) and Kai et al., (2010). Note that $Q(\tau,\boldsymbol{\theta}(\mathbf{X},\widehat{\boldsymbol{\beta}}_{n}))$ has the non-crossing property with respect to $\tau$ since, for each $\boldsymbol{\theta}$ , $Q(\tau,\boldsymbol{\theta})$ is a non-decreasing quantile function.

To study the asymptotic properties of $\widehat{\boldsymbol{\beta}}_{n}$ , we first consider $\boldsymbol{\beta}_{0}=(\boldsymbol{\beta}_{01}^{\prime},...,\boldsymbol{\beta}_{0d}^{\prime})^{\prime}=\arg\min_{\boldsymbol{\beta}}\bar{L}(\boldsymbol{\beta})$ , where $\bar{L}(\boldsymbol{\beta})=E[\sum_{k=1}^{K}\rho_{\tau_{k}}\{Y-Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))\}]$ is the population loss function, and it may not be unique, i.e. the CQR estimation at (2.2) may suffer from the identification problem. For the sake of illustration, let us consider the case without including covariates $\mathbf{X}$ , i.e. we estimate $\boldsymbol{\theta}$ with a sequence of quantile levels $\tau_{k}\in\mathcal{I}$ and $1\leq k\leq K$ . To this end, it requires that two different values of $\boldsymbol{\theta}$ can not yield the same quantile function $Q(\tau_{k},\boldsymbol{\theta})$ across all $K$ levels. In other words, if there exists $\boldsymbol{\theta}\neq\boldsymbol{\theta}^{*}$ that yield $Q(\tau_{k},\boldsymbol{\theta})=Q(\tau_{k},\boldsymbol{\theta}^{*})$ for all $K$ quantiles, then $\boldsymbol{\theta}$ and $\boldsymbol{\theta}^{*}$ are not identifiable. In sum, to guarantee that $\boldsymbol{\beta}_{0}$ is the unique minimizer of the population loss, we make the following assumption on the quantile function $Q(\tau,\boldsymbol{\theta})$ .

Assumption 1.

For any two index vectors $\boldsymbol{\theta}_{1}\neq\boldsymbol{\theta}_{2}$ , there exists at least one $1\leq k\leq K$ such that $Q(\tau_{k},\boldsymbol{\theta}_{1})\neq Q(\tau_{k},\boldsymbol{\theta}_{2})$ .

Intuitively, for any quantile function $Q(\tau,\boldsymbol{\theta})$ , one can always increase the number of quantile levels $K$ to make Assumption 1 hold. However, it may also depend on the structures of quantile functions and the number of indices. For the sake of illustration, we state the sufficient and necessary condition for the Tukey lambda distribution that satisfies Assumption 1.

Lemma 1.

For the Tukey lambda distribution defined in Example 2, we have that (i) for $\tau_{k}\in(0,1)$ with $1\leq k\leq K$ , Assumption 1 holds if $K\geq 4$ ; (ii) for $\tau_{k}\in(0.5,1)$ (or $\tau_{k}\in(0,0.5)$ ) with $1\leq k\leq K$ , Assumption 1 holds if and only if $K\geq 3$ .

Assumption 1, together with an additional assumption on covariates $\mathbf{X}$ , allows us to show that $\boldsymbol{\beta}_{0}$ is the unique minimizer of $\bar{L}(\boldsymbol{\beta})$ ; see the following theorem, which is critical to establish the asymptotic properties of $\widehat{\boldsymbol{\beta}}_{n}$ .

Theorem 1.

Suppose that $E(\boldsymbol{X}\boldsymbol{X}^{\prime})$ is finite and positive definite. If Assumption 1 holds, then $\boldsymbol{\beta}_{0}$ is the unique minimizer of $\bar{L}(\boldsymbol{\beta})$ .

To demonstrate the consistency of $\widehat{\boldsymbol{\beta}}_{n}$ given below, we assume that the parameter space $\Theta\subset\mathbb{R}^{dp}$ is compact, and the true parameter vector $\boldsymbol{\beta}_{0}$ is an interior point of $\Theta$ .

Theorem 2.

Suppose that $E\{\max_{1\leq k\leq K}\sup_{\boldsymbol{\beta}\in\Theta}\|\partial Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))/\partial\boldsymbol{\beta}\|\}<\infty.$ If the conditions of Theorem 1 hold, then $\widehat{\boldsymbol{\beta}}_{n}\rightarrow\boldsymbol{\beta}_{0}$ in probability as $n\rightarrow\infty$ .

The moment condition assumed in the above theorem allows us to adopt the uniform consistency theorem of Andrews, (1987) in our technical proofs. To show the asymptotic distribution of $\widehat{\boldsymbol{\beta}}_{n}$ , we introduce two additional assumptions given below.

Assumption 2.

For all $1\leq k\leq K$ ,

E\left\|\frac{\partial Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0}))}{\partial\boldsymbol{\beta}}\right\|^{3}<\infty\hskip 14.22636pt\text{and}\hskip 14.22636ptE\sup_{\boldsymbol{\beta}\in\Theta}\left\|\frac{\partial^{2}Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))}{\partial\boldsymbol{\beta}\partial\boldsymbol{\beta}^{\prime}}\right\|_{\mathrm{F}}^{2}<\infty.

Assumption 3.

The conditional density $f_{Y}(y|\boldsymbol{X})$ is bounded and continuous uniformly for all $\boldsymbol{X}$ .

Assumption 2 is used to prove Lemma A.2 in the Appendix; see also Zhu and Ling, (2011). Assumption 3 is commonly used in the literature of quantile regression (Koenker,, 2005; Belloni et al., 2019a, ), and it can be relaxed by providing more complicated and lengthy technical details (Kato et al.,, 2012; Chernozhukov et al.,, 2015; Galvao and Kato,, 2016).

Denote

\Omega_{0}=\sum_{k^{\prime}=1}^{K}\sum_{k=1}^{K}\min\{\tau_{k},\tau_{k^{\prime}}\}\left(1-\max\{\tau_{k},\tau_{k^{\prime}}\}\right)E\left[\frac{\partial Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0}))}{\partial\boldsymbol{\beta}}\frac{\partial Q(\tau_{k^{\prime}},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0}))}{\partial\boldsymbol{\beta}^{\prime}}\right]

and

\Omega_{1}=\sum_{k=1}^{K}E\left[f_{Y}\left\{Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0}))|\mathbf{X}\right\}\frac{\partial Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0}))}{\partial\boldsymbol{\beta}}\frac{\partial Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0}))}{\partial\boldsymbol{\beta}^{\prime}}\right].

Theorem 3.

Suppose that Assumptions 2 and 3 hold, and $\Omega_{1}$ is positive definite. If the conditions of Theorem 2 are satisfied, then $\sqrt{n}(\widehat{\boldsymbol{\beta}}_{n}-\boldsymbol{\beta}_{0})\rightarrow N(\boldsymbol{0},\Omega_{1}^{-1}\Omega_{0}\Omega_{1}^{-1})$ in distribution as $n\rightarrow\infty$ .

Note that the objective function $L_{n}(\boldsymbol{\beta})$ is non-convex and non-differentiable, and this makes it challenging to establish the asymptotic normality of $\widehat{\boldsymbol{\beta}}_{n}$ . We overcome the difficulty by making use of the bracketing method in Pollard, (1985). Moreover, to estimate the asymptotic variance in Theorem 3, we first apply the nonparametric method in Hendricks and Koenker, (1991) to estimate the quantities of $f_{Y}\left\{Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0}))|\mathbf{X}\right\}$ with $1\leq k\leq K$ , and then matrices $\Omega_{0}$ and $\Omega_{1}$ can be approximated by the sample averaging.

In addition, based on the estimator $\widehat{\boldsymbol{\beta}}_{n}$ , one can use $Q(\tau^{*},\boldsymbol{\theta}(\mathbf{X},\widehat{\boldsymbol{\beta}}_{n}))$ to further predict the conditional quantile at any level $\tau^{*}\in\mathcal{I}$ , and the corresponding theoretical justification can be established by directly applying the delta-method (van der Vaart,, 1998, Chapter 3).

Corollary 1.

Suppose that the conditions of Theorem 3 are satisfied. Then, for any $\tau^{*}\in\mathcal{I}$ ,

\sqrt{n}\{Q(\tau^{*},\boldsymbol{\theta}(\mathbf{X},\widehat{\boldsymbol{\beta}}_{n}))-Q(\tau^{*},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0}))\}\rightarrow N(\boldsymbol{0},\boldsymbol{\delta}^{\prime}\Omega_{1}^{-1}\Omega_{0}\Omega_{1}^{-1}\boldsymbol{\delta})

in distribution as $n\rightarrow\infty$ , where $\boldsymbol{\delta}=E[\partial Q(\tau^{*},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0}))/\partial\boldsymbol{\beta}]\in\mathbb{R}^{dp}$ .

2.3 High-dimensional regularized estimation

This subsection considers the case with high-dimensional covariates, i.e., $p\gg n$ , and the true parameter vector $\boldsymbol{\beta}_{0}$ is assumed to be $s$ -sparse, i.e. the number of nonzero elements in $\boldsymbol{\beta}_{0}$ is no more than $s>0$ . A regularized CQR estimation can then be introduced,

\widetilde{\boldsymbol{\beta}}_{n}=\arg\min_{\boldsymbol{\beta}\in\Theta}n^{-1}L_{n}(\boldsymbol{\beta})+\sum_{j=1}^{d}p_{\lambda}(\boldsymbol{\beta}_{j}),

(2.3)

where $\Theta$ is given in Theorem 4, $p_{\lambda}$ is a penalty function, and it depends on a tuning (regularization) parameter $\lambda\in\mathbb{R}^{+}$ with $\mathbb{R}^{+}=(0,\infty)$ .

Consider the loss function $L_{n}(\boldsymbol{\beta})=\sum_{k=1}^{K}\sum_{i=1}^{n}\rho_{\tau_{k}}\{Y_{i}-Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X}_{i},\boldsymbol{\beta}))\}$ defined in (2.2), and $Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X}_{i},\boldsymbol{\beta}))$ usually is nonconvex with respect to $\boldsymbol{\beta}$ . As a result, $L_{n}(\boldsymbol{\beta})$ will be nonconvex although the check loss $\rho_{\tau}(\cdot)$ is convex, and there is no more harm to use nonconvex penalty functions. Specifically, we consider the component-wise penalization,

\sum_{j=1}^{d}p_{\lambda}(\boldsymbol{\beta}_{j})=\sum_{j=1}^{d}\sum_{l=1}^{p}p_{\lambda}(\beta_{jl}),

where $p_{\lambda}(\cdot)$ is possibly nonconvex and satisfies the following assumption.

Assumption 4.

The univariate function $p_{\lambda}(\cdot)$ satisfies the following conditions: (i) it is symmetric around zero with $p_{\lambda}(0)=0$ ; (ii) it is is nondecreasing on the nonnegative real line; (iii) the function $p_{\lambda}(t)/t$ is nonincreasing with respect to $t\in\mathbb{R}^{+}$ ; (iv) it is differentiable for all $t\neq 0$ and subdifferentiable at $t=0$ , with $\lim_{t\rightarrow 0^{+}}p_{\lambda}^{\prime}(t)=\lambda L$ and $L$ being a constant; (v) there exists $\mu>0$ such that $p_{\lambda,\mu}=p_{\lambda}(t)+\frac{\mu^{2}}{2}t^{2}$ is convex.

The above is the $\mu$ -amenable assumption given in Loh and Wainwright, (2015) and Loh, (2017), and the penalty function is required not too far from the convexity. Note that the popular penalty functions, including SCAD (Fan and Li.,, 2001) and MCP (Zhang,, 2010), satisfy the above properties.

In the literature of nonconvex penalized quantile regression, Jiang et al., (2012) studied nonlinear quantile regressions with SCAD regularizer from the asymptotic viewpoint, while it can only handle the case with $p=o(n^{1/3})$ . Wang et al., 2012b and Sherwood et al., (2016) considered the case that $p$ grows exponentially with $n$ , and their proving techniques heavily depend on the condition that the loss function should be represented as a difference of the two convex functions. However, $L_{n}(\boldsymbol{\beta})$ does not meet this requirement since quantile function $Q(\tau,\boldsymbol{\theta})$ can be nonconvex.

On the other hand, non-asymptotic properties recently have attracted considerable attention in the theories of high-dimensional analysis; see, e.g., Belloni and Chernozhukov, (2011); Sivakumar and Banerjee, (2017); Pan and Zhou, (2021). This subsection attempts to study them for our proposed quantile estimators, while it is a nontrival task since existing results only focused on linear quantile regression. Loh and Wainwright, (2015) and Loh, (2017) studied the non-asymptotic properties for M-estimators with both nonconvex loss and regularizers, while they required the loss function to be twice differentiable. The technical proofs in the Appendix follow the framework in Loh and Wainwright, (2015) and Loh, (2017), and some new techniques are developed to tackle the nondifferentiability of the quantile check function.

Let $\boldsymbol{\theta}(\boldsymbol{\gamma})=(g_{1}(\gamma_{1}),\ldots,g_{d}(\gamma_{d}))$ with $g_{j}^{-1}(\cdot)$ s being link functions and $\boldsymbol{\gamma}=(\gamma_{1},\ldots,\gamma_{d})$ , and we can then denote $Q(\tau,\boldsymbol{\gamma}):=Q(\tau,\boldsymbol{\theta}(\boldsymbol{\gamma}))$ . Moreover, by letting $\gamma_{j}(\mathbf{X},\boldsymbol{\beta})=\mathbf{X}\boldsymbol{\beta}_{j}$ for $1\leq j\leq d$ and $\boldsymbol{\gamma}(\mathbf{X},\boldsymbol{\beta})=(\gamma_{1}(\mathbf{X},\boldsymbol{\beta}),\ldots,\gamma_{d}(\mathbf{X},\boldsymbol{\beta}))$ , we can further denote $Q(\tau,\boldsymbol{\gamma}(\mathbf{X},\boldsymbol{\beta})):=Q(\tau,\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))$ .

Assumption 5.

Quantile function $Q(\tau,\boldsymbol{\gamma})$ is differentiable with respect to $\boldsymbol{\gamma}$ , and there exist two positive constants $L_{Q}$ and $C_{X}$ such that $\max_{1\leq k\leq K}\|{\partial Q(\tau_{k},\boldsymbol{\gamma})}/{\partial\boldsymbol{\gamma}}\|\leq L_{Q}$ and $\|\mathbf{X}\|_{\infty}\leq C_{X}$ .

The differentiable assumption of quantile functions allows us to use the Lipschitz property and multivariate contraction theorem. The boundedness of covariates is to assure that the bounded difference inequality can be used, and it can be relaxed with more complicated and lengthy technical details (Wang and He,, 2021).

Denote by $\mathcal{B}_{R}(\boldsymbol{\beta}_{0})=\{\boldsymbol{\beta}\in\mathbb{R}^{dp}:\|\boldsymbol{\beta}-\boldsymbol{\beta}_{0}\|\leq R\}$ the Euclidean ball centered at $\boldsymbol{\beta}_{0}$ with radius $R>0$ , and let $\lambda_{\min}(\boldsymbol{\beta})$ be the smallest eigenvalue of matrix

\Omega_{2}(\boldsymbol{\beta})=\sum_{k=1}^{K}E\left[\frac{\partial Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))}{\partial\boldsymbol{\beta}}\frac{\partial Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))}{\partial\boldsymbol{\beta}^{\prime}}\right].

Assumption 6.

There exists a fixed $R>0$ such that $\lambda_{\min}^{0}=\inf_{\boldsymbol{\beta}\in\mathcal{B}_{R}(\boldsymbol{\beta}_{0})}\lambda_{\min}(\boldsymbol{\beta})>0$ , and assume that $f_{\min}=\min_{1\leq k\leq K}\inf_{\boldsymbol{\beta}\in\mathcal{B}_{R}(\boldsymbol{\beta}_{0})}f_{Y}\left\{Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))|\mathbf{X}\right\}>0$ .

The above assumption guarantees that the population loss $\bar{L}(\boldsymbol{\beta})=E[n^{-1}L_{n}(\boldsymbol{\beta})]$ is strongly convex around the true parameter vector $\boldsymbol{\beta}_{0}$ . Specifically, let $\bar{\mathcal{E}}(\Delta)=\bar{L}(\boldsymbol{\beta}_{0}+\Delta)-\bar{L}(\boldsymbol{\beta}_{0})-\Delta^{\prime}{\partial\bar{L}(\boldsymbol{\beta}_{0})}/{\partial\boldsymbol{\beta}}$ be the first-order Taylor expansion. Then, by Assumption 6, we have that $\bar{\mathcal{E}}(\Delta)\geq 0.5f_{\min}\lambda_{\min}^{0}\|\Delta\|^{2}$ for all $\Delta$ such that $\|\Delta\|\leq R$ ; see Lemma A.5 in the Appendix for details. We next obtain the non-asymptotic estimation bound of $\widetilde{\boldsymbol{\beta}}_{n}$ .

Theorem 4.

Suppose that Assumptions 1 and 4-6 hold, $2f_{\min}\lambda_{\min}^{0}>\mu$ , $n\gtrsim\log p$ and $\lambda\gtrsim\sqrt{\log p/n}$ . Then the minimizer $\widetilde{\boldsymbol{\beta}}_{n}$ of (2.3) with $\Theta=\mathcal{B}_{R}(\boldsymbol{\beta}_{0})$ satisfies the error bounds of

\displaystyle\|\widetilde{\boldsymbol{\beta}}_{n}-\boldsymbol{\beta}_{0}\|\leq\frac{6L\sqrt{s}\lambda}{4\alpha-\mu}~{}~{}~{}~{}~{}\mbox{and}~{}~{}~{}~{}~{}\|\widetilde{\boldsymbol{\beta}}_{n}-\boldsymbol{\beta}_{0}\|_{1}\leq\frac{24Ls\lambda}{4\alpha-\mu},

with probability at least $1-c_{1}p^{-c_{2}}-K\max\{\log p,\log n\}p^{-c^{2}}$ for any $c>1$ , where $\alpha=0.5f_{\min}\lambda_{\min}^{0}$ , $\mu$ and $L$ are defined in Assumption 4, $s$ is the number of nonzero elements in $\boldsymbol{\beta}_{0}$ , and the constants $c_{1}$ and $c_{2}>0$ are given in Lemma A.4 of the Appendix.

In practice, we can choose $\lambda\asymp\sqrt{\log p/n}$ , and it then holds that $\|\widetilde{\boldsymbol{\beta}}_{n}-\boldsymbol{\beta}_{0}\|\lesssim\sqrt{s\log p/n}$ , which has the standard rate of error bounds; see, e.g., (Loh,, 2017). Moreover, for the predicted conditional quantile of $Q(\tau^{*},\boldsymbol{\theta}(\mathbf{X},\widetilde{\boldsymbol{\beta}}_{n}))$ at any level $\tau^{*}\in\mathcal{I}$ , it can be readily verified that $|Q(\tau^{*},\boldsymbol{\theta}(\mathbf{X},\widetilde{\boldsymbol{\beta}}_{n}))-Q(\tau^{*},\boldsymbol{\theta}(\mathbf{X},{\boldsymbol{\beta}}_{0}))|$ has the same convergence rate as $\|\widetilde{\boldsymbol{\beta}}_{n}-\boldsymbol{\beta}_{0}\|$ . Finally, the above theorem requires the minimization (2.3) to be conducted in $\Theta=\mathcal{B}_{R}(\boldsymbol{\beta}_{0})$ , which is unknown but fixed. This enables us to solve the problem by conducting a random initialization in optimizing algorithms.

3 Implementation issues

3.1 Optimizing algorithms

This subsection provides algorithms to search for the CQR estimator at (2.2) and regularized estimator at (2.3).

For the CQR estimation without penalty at (2.2), we employ the commonly used gradient descent algorithm to search for estimators, and the $(r+1)$ th update is given by

{\boldsymbol{\beta}}^{(r+1)}={\boldsymbol{\beta}}^{(r)}-\eta^{(r)}\,\nabla L_{n}({\boldsymbol{\beta}}^{(r)}),

where $\widehat{\boldsymbol{\beta}}_{n}^{(r)}$ is from the $r$ th iteration, and $\eta^{(r)}$ is the step size. Note that the quantile check loss is nondifferentiable at zero, and $\nabla L_{n}({\boldsymbol{\beta}}^{(r)})$ in the above refers to the subgradient (Moon et al.,, 2021) instead. In practice, too small step size will cause the algorithm to converge slowly, while too large step size may cause the algorithm to diverge. We choose the step size by the backtracking line search (BLS) method, which is shown to be simple and effective; see Bertsekas, (2016). Specifically, the algorithm starts with a large step size and, at $(r+1)$ th update, it is reduced by keeping multiplying a fraction of $b$ until $L_{n}(\boldsymbol{\beta}^{(r+1)})-L_{n}(\boldsymbol{\beta}^{(r)})<-a\eta^{(r)}\|\nabla L_{n}(\boldsymbol{\beta}^{(r)})\|_{2}^{2}$ , where $a$ is another hyper-parameter. The simulation experiments in Section 4 work well with the setting of $(a,b)=(0.3,0.5)$ .

For the regularized estimation at (2.3), we adopt the composite gradient descent algorithm (Loh and Wainwright,, 2015), which is designed for a nonconvex problem and fits our objective functions well. Consider the SCAD penalty, which satisfies Assumption 4 with $L=1$ and $\mu=1/(\alpha-1)$ . We then can rewrite the optimization problem at (2.3) into

\widetilde{\boldsymbol{\beta}}_{n}=\arg\min_{\boldsymbol{\beta}\in\Theta}\underbrace{\{n^{-1}L_{n}(\boldsymbol{\beta})-\mu\|\boldsymbol{\beta}\|_{2}^{2}/2\}}_{\widetilde{L}_{n}(\boldsymbol{\beta})}+\lambda g(\boldsymbol{\beta}),

where, from Assumption 4, $g(\boldsymbol{\beta})=\{\sum_{j=1}^{d}p_{\lambda}(\boldsymbol{\beta}_{j})+\mu\|\boldsymbol{\beta}\|_{2}^{2}/2\}/\lambda$ is convex. As a result, similar to the composite gradient descent algorithm in Loh and Wainwright, (2015), the $(r+1)$ th update can be calculated by

{\boldsymbol{\beta}}^{(r+1)}=\arg\min\left\{\|\boldsymbol{\beta}-(\boldsymbol{\beta}^{(r)}-\eta\nabla\widetilde{L}_{n}(\boldsymbol{\beta}))\|_{2}^{2}/2+\lambda\eta g(\boldsymbol{\beta})\right\},

which has a closed-form solution of

\displaystyle{\boldsymbol{\beta}}^{(r+1)}=\left\{\begin{aligned} &0,&\quad 0\leq|z|\leq\nu\lambda\\ &z-\textrm{sign}(z)\cdot\nu\lambda,&\quad\nu\lambda\leq|z|\leq(\nu+1)\lambda\\ &\{z-\textrm{sign}(z)\cdot\frac{\alpha\nu\lambda}{\alpha-1}\}/\{1-\frac{\nu}{\alpha-1}\},&\quad(\nu+1)\lambda\leq|z|\leq\alpha\lambda\\ &z,&\quad|z|\geq\alpha\lambda\end{aligned}\right.

with $z=(\boldsymbol{\beta}^{(r)}-\eta\nabla\widetilde{L}_{n}(\boldsymbol{\beta}^{(r)}))/(1+\mu\eta)$ and $\nu=\eta/(1+\mu\eta)$ , where the step size $\eta$ is chosen by the BLS method.

3.2 Hyper-parameter selection

There are two types of hyper-parameters in the penalized estimation at (2.3): the tuning parameter $\lambda$ and quantile levels of $\tau_{k}$ with $1\leq k\leq K$ . We can employ validation methods to select the tuning parameter $\lambda$ such that the composite quantile check loss is minimized.

The selection of $\tau_{k}$ ’s is another important task since it will affect the efficiency of resulting estimators. Suppose that we are interested in some high quantiles of $\tau_{m}^{*}$ with $1\leq m\leq M$ , and then the QIR model can be assumed to the interval of $\mathcal{I}=[\tau_{0},1]$ , which contains all $\tau_{m}^{*}$ ’s. We may further choose a suitable interval of $[\tau_{L},\tau_{U}]\subset\mathcal{I}$ such that $\tau_{k}$ ’s can be equally spaced on it, i.e. $\tau_{k}=\tau_{L}+k(\tau_{U}-\tau_{L})/(K+1)$ for $1\leq k\leq K$ , where it can be set to $\tau_{0}=\tau_{L}$ . As a result, the selection of $\tau_{k}$ ’s is equivalent to that of $[\tau_{L},\tau_{U}]$ .

We may choose $\tau_{U}$ such that it is close to $\tau_{m}^{*}$ ’s, while a reliable estimation can be afforded at this level. The selection of $\tau_{L}$ is a trade-off between estimation efficiency and model misspecification; see Wang et al., 2012a ; Wang and Tsai, (2009). On one hand, to improve estimation efficiency, we may choose $\tau_{L}$ close to 0.5 since the richest observations will appear at the middle for most real data. On the other hand, we have to assume the parametric structure over the whole interval of $[\tau_{L},1]$ , i.e. more limitations will be added to the real example. The criterion of prediction errors (PEs) is hence introduced,

PE=\frac{1}{M}\sum_{m=1}^{M}\frac{1}{\sqrt{\tau_{m}^{*}(1-\tau_{m}^{*})}}\cdot\sqrt{n}\left|\frac{1}{n}\sum_{i=1}^{n}I\{y_{i}<\widehat{Q}_{Y}(\tau_{m}^{*}|\mathbf{X}_{i})\}-\tau_{m}^{*}\right|,

where we will choose $\tau_{L}$ with the minimum value of PEs; see also Wang et al., 2012a .

In practice, the cross validation method can be used to select $\lambda$ and $\tau_{L}$ simultaneously. Specifically, the composite quantile check loss and PEs are both evaluated at validation sets. For each candidate interval of $[\tau_{L},\tau_{U}]$ , the tuning parameter $\lambda$ is selected according to the composite quantile check loss, and the corresponding value of PE is also recorded. We then will choose the value of $\tau_{L}$ , which corresponds to the minimum value of PEs among all candidate intervals.

4 Simulation Studies

4.1 Composite quantile regression estimation

This subsection conducts simulation experiments to evaluate the finite-sample performance of the low-dimensional composite quantile regression (CQR) estimation at (2.2).

The Tukey Lambda distribution in Example 2 is used to generate the $i.i.d.$ sample,

\displaystyle\begin{split}Y_{i}&=Q_{Y}(U_{i},\boldsymbol{\theta}(\mathbf{X}_{i},\boldsymbol{\beta}))=\theta_{1}(\mathbf{X}_{i},\boldsymbol{\beta})+\theta_{2}(\mathbf{X}_{i},\boldsymbol{\beta})\frac{U_{i}^{\theta_{3}(\mathbf{X}_{i},\boldsymbol{\beta})}-(1-U_{i})^{\theta_{3}(\mathbf{X}_{i},\boldsymbol{\beta})}}{\theta_{3}(\mathbf{X}_{i},\boldsymbol{\beta})}\\ &\theta_{1}(\mathbf{X}_{i},\boldsymbol{\beta})=g_{1}(\mathbf{X}_{i}^{\prime}\boldsymbol{\beta}_{1}),~{}\theta_{2}(\mathbf{X}_{i},\boldsymbol{\beta})=g_{2}(\mathbf{X}_{i}^{\prime}\boldsymbol{\beta}_{2}),~{}\theta_{3}(\mathbf{X}_{i},\boldsymbol{\beta})=g_{3}(\mathbf{X}_{i}^{\prime}\boldsymbol{\beta}_{3}),\end{split}

(4.1)

where $\{U_{i}\}$ are independent and follow $\text{Uniform}(0,1)$ , $\mathbf{X}_{i}=(1,X_{i1},X_{i2})^{\prime}$ , $\{(X_{i1},X_{i2})^{\prime}\}$ is an $i.i.d.$ sequence with 2-dimensional standard normality. The true parameter vector is $\boldsymbol{\beta}_{0}=(\boldsymbol{\beta}_{01}^{\prime},\boldsymbol{\beta}_{02}^{\prime},\boldsymbol{\beta}_{03}^{\prime})^{\prime}$ , and we set the location parameters $\boldsymbol{\beta}_{01}=(1,0.5,-1)^{\prime}$ , the scale parameters $\boldsymbol{\beta}_{02}=(1,0.5,-1)^{\prime}$ and the tail parameters $\boldsymbol{\beta}_{03}=(1,-1,1)^{\prime}$ . For the tail index $\theta_{3}(\mathbf{X}_{i},\boldsymbol{\beta})$ , before generating the data, we first scale each covariate into the range of $[-0.5,0.5]$ such that a relatively stable sample can be generated. In addition, $g_{1}$ , $g_{2}$ and $g_{3}$ are the inverse of link functions. We choose identity link for the location index and softplus-related link for the scale and tail indices, i.e., $g_{1}(x)=x,g_{2}(x)=\textrm{softplus}(x)$ and $g_{3}(x)=1-\textrm{softplus}(x)$ , where $\textrm{softplus}(x)=\log(1+\exp(x))$ is a smoothed version of $x_{+}=\max\{0,x\}$ and hence the name. Note that $g_{2}(x)>0$ and $g_{3}(x)<1$ . We consider three sample sizes of $n=500$ , 1000 and 2000, and there are 500 replications for each sample size.

The algorithm for CQR estimation in Section 3 is applied with $K=10$ and $\tau_{k}$ ’s being equally spaced over $[\tau_{L},\tau_{U}]$ . We consider three quantile ranges of $(\tau_{L},\tau_{U})=(0.5,0.99)$ , $(0.7,0.99)$ and $(0.9,0.99)$ , and the estimation efficiency is first evaluated. Figure 1 gives the boxplots of three fitted location parameters $\widehat{\boldsymbol{\beta}}_{1n}=(\widehat{\beta}_{1,1},\widehat{\beta}_{1,2},\widehat{\beta}_{1,3})^{\prime}$ . It can be seen that both bias and standard deviation decrease as the sample size increase. Moreover, when $\tau_{L}$ decreases, the quantile levels with richer observations will be used for the estimation and, as expected, both bias and standard deviation will decrease. Boxplots for fitted scale and tail parameters show a similar pattern and hence are omitted to save the space.

We next evaluate the prediction performance of $Q(\tau^{*},\boldsymbol{\theta}(\mathbf{X},\widehat{\boldsymbol{\beta}}_{n}))$ at two interesting quantile levels of $\tau^{*}=0.991$ and 0.995. Consider two values of covariates, $\boldsymbol{X}=(1,0.1,-0.2)^{\prime}$ and $(1,0,0)^{\prime}$ , and the corresponding tail indices are $\theta_{3}(\mathbf{X},\boldsymbol{\beta}_{0})=-0.1139$ and $-0.3132$ , respectively. Note that the Tukey lambda distribution can provide a good approximation to Cauchy and normal distributions when the tail indices are $-1$ and $0.14$ , respectively, and it becomes more heavy-tailed when the tail index decreases (Freimer et al.,, 1988). The prediction error in terms of squared loss (PES), $[Q(\tau^{*},\boldsymbol{\theta}(\mathbf{X},\widehat{\boldsymbol{\beta}}_{n}))-Q(\tau^{*},\boldsymbol{\theta}(\mathbf{X},{\boldsymbol{\beta}}_{0}))]^{2}$ , is calculated for each replication, and the corresponding sample mean refers to the commonly used mean square error. Table 1 presents both the sample mean and standard deviation of PESs across 500 replications as in Wang et al., 2012a . A clear trend of improvement can be observed as the sample size becomes larger, and the prediction is more accurate at the 99.1-th quantile level for almost all cases.

4.2 High-dimensional regularized estimation

This subsection conducts simulation experiments to evaluate the finite-sample performance of the high-dimensional regularized estimation at (2.3).

For the data generating process at (4.1), we consider three dimensions of $p=50$ , $100$ and $150$ , and the true parameter vectors are extended from those in Section 4.1 by adding zeros, i.e. $\boldsymbol{\beta}_{01}=(1,0.5,-1,0,...,0)^{\prime}$ , $\boldsymbol{\beta}_{02}=(1,0.5,-1,0,...,0)^{\prime}$ and $\boldsymbol{\beta}_{03}=(1,-1,1,0,...,0)^{\prime}$ , which are vectors of length $p$ with $3$ non-zero entries. As a result, all true parameters $\boldsymbol{\beta}_{0}=(\boldsymbol{\beta}_{01}^{\prime},\boldsymbol{\beta}_{02}^{\prime},\boldsymbol{\beta}_{03}^{\prime})^{\prime}$ make a vector of length $3p$ with $s=9$ non-zero entries. The sample size is chosen such that $n=\left\lfloor cs\log p\right\rfloor$ with $c=5$ , 10, 20, 30, 40 and 50, where $\lfloor x\rfloor$ refers to the largest integer smaller than or equal to $x$ . All other settings are the same as in the previous subsection.

The algorithm for regularized estimation in Section 3 is used to search for the estimators, and we generate an independent validation set of size $5n$ to select tuning parameter $\lambda$ by minimizing the composite quantile check loss; see also Wang et al., 2012b . Figure 2 gives the estimation errors of $\|\widetilde{\boldsymbol{\beta}}_{n}-\boldsymbol{\beta}_{0}\|$ . It can be seen that $\|\widetilde{\boldsymbol{\beta}}_{n}-\boldsymbol{\beta}_{0}\|$ is roughly proportional to the quantity of $\sqrt{s\log p/n}$ , and this confirms the convergence rate in Theorem 4. Moreover, the estimation errors approach zero as the sample size $n$ increase, and we can then conclude the consistency of $\widetilde{\boldsymbol{\beta}}_{n}$ . Finally, when $\tau_{L}$ increases, the quantile levels with less observations will be used in the estimating procedure, and hence larger estimation errors can be observed.

We next evaluate the prediction performance at quantile levels $\tau^{*}=0.991$ and $0.995$ , and covariates $\mathbf{X}$ take values of $(1,0.1,-0.2,0,\cdots,0)^{\prime}$ and $(1,0,0,0,\cdots,0)^{\prime}$ , similar to those in the previous subsection. Table 2 gives mean square errors of the predicted conditional quantiles $Q(\tau^{*},\boldsymbol{\theta}(\mathbf{X},\widetilde{\boldsymbol{\beta}}_{n}))$ , as well as the sample standard deviations of prediction errors in squared loss, with $p=50$ . It can be seen that larger sample size leads to much smaller mean square errors. Moreover, when $\tau_{L}$ is larger, the prediction also becomes worse, and it may be due to the lower estimation efficiency. Finally, similar to the experiments in the previous subsection, the prediction at $\tau^{*}=0.991$ is more accurate for almost all cases. The results for the cases with $p=100$ and 150 are similar and hence omitted.

Finally, we consider the following criteria to evaluate the performance of variable selection: average number of selected active coefficients (size), percentage of active and inactive coefficients both correctly selected simultaneously ( $\textrm{P}_{\textrm{AI}}$ ), percentage of active coefficients correctly selected ( $\textrm{P}_{\textrm{A}}$ ), percentage of inactive coefficients correctly selected ( $\textrm{P}_{\textrm{I}}$ ), false positive rate (FP), and false negative rate (FN). Table 3 reports the selecting results with $p=50$ and $c=10$ , 30 and 50. When $\tau_{L}$ is larger, both $\textrm{P}_{\textrm{AI}}$ and $\textrm{P}_{\textrm{A}}$ decrease, and it indicates the increasing of selection accuracy. In addition, performance improves when sample size gets larger. The results for $p=100$ and 150 are similar and hence omitted.

5 Application to Childhood Malnutrition

Childhood malnutrition is well known to be one of the most urgent problems in developing countries. The Demographic and Health Surveys (DHS) has conducted nationally representative surveys on child and maternal health, family planning and child survival, etc., and this results in many datasets for research purposes. The dataset for India was first analyzed by Koenker, (2011), and can be downloaded from http://www.econ.uiuc.edu/~roger/research/bandaids/bandaids.html. It has also been studied by many researchers (Fenske et al.,, 2011; Belloni et al., 2019b, ) for childhood malnutrition problem in India, and quantile regression with low- or high-dimensional covariates was conducted at the levels of $\tau=0.1$ and $0.05$ . The proposed model enables us to consider much lower quntiles, corresponding to more severe childhood malnutrition problem.

The child’s height is chosen as the indicator for malnutrition as in Belloni et al., 2019b . Specifically, the response is set to $Y=-100\log(\text{child's height in centimeters})$ , and we then consider high quantiles to study the childhood malnutrition problem such that it is consistent with previous sections. Other variables include seven continuous and 13 categorical ones, and they contain both biological factors and socioeconomic factors that are possibly related to childhood malnutrition. Examples of biological factors include the child’s age, gender, duration of breastfeeding in months, the mother’s age and body-mass index (BMI), and socioeconomic factors contain the mother’s employment status, religion, residence, and the availability of electricity. All seven continuous variables are standardized to have mean zero and variance one, and two-way interactions between all variables are also included. Moreover, we concentrate on the samples from pool families. As a result, there are $p=328$ covariates in total after removing variables with all elements being zero, and the sample size is $n=6858$ . Denote the full model size by $(328,328,328)$ , which correspond to the sizes of location, scale and tail, respectively. Furthermore, as in the simulation experiments, covariates are further rescaled to the range $[-0.5,0.5]$ for the tail index.

We aim at two high quanitles of $\tau^{*}=0.991$ and $0.995$ , and the algorithm for high-dimensional regularized estimation in Section 3 is first applied to select the interval of $[\tau_{L},\tau_{U}]$ . Specifically, the value of $\tau_{U}$ is fixed to $0.99$ , and that of $\tau_{L}$ is selected among $\tau_{L}=0.9+0.01j$ with $1\leq j\leq 8$ . The value of $K$ is set to 10, and the $\tau_{k}$ ’s with $1\leq k\leq K$ are equally spaced over $[\tau_{L},\tau_{U}]$ . For each $\tau_{L}$ , the whole samples are randomly split into five parts with equal size, except that one part is short of two observations, and the 5-fold cross validation is used to select the tuning parameter $\lambda$ . To stabilize the process, we conduct the random splitting five times and choose the value of $\lambda$ minimizing the composite check loss over all five splittings. The averaged value of PEs is also calculated over all five splittings, and the corresponding plot is presented in Figure 3. As a result, we choose $\tau_{L}=0.96$ since it corresponds to the minimum value of PEs.

We next apply the QIR model to the whole dataset with $[\tau_{L},\tau_{U}]=[0.96,0.99]$ , and the tuning parameter $\lambda$ is scaled by $\sqrt{4/5}$ since the sample size changes from $4n/5$ to $n$ . The fitted model is of size $(14,16,19)$ , and we can predict the conditional quantile structure at any level $\tau^{*}\in(0.96,1)$ . For example, consider the variable of child’s age, and we are interested in children with ages of 20, 30 and 40 months. The duration of breastfeeding is set to be the same as child’s age, since the age is always larger than the duration of breastfeeding, and we set the values of all other variables in $\boldsymbol{X}$ to be the same as the $460$ th observation, which has the response value being the sample median. Figure 4 plots the predicted quantile curves for three different ages. It can be seen that younger children may have extremely lower heights, and we may conclude that it may be easier for younger children to be affected by malnutrition.

Figure 4 also draws quantile curves for mother’s education, child’s gender and mother’s unemployment condition, and the values of variables at the $460$ th observation are also used for non-focal covariates in the prediction. For child’s gender, the baby boy is usually higher than baby girls as observed in Koenker, (2011), while the difference varnishes for much larger quantiles. In addition, the quanitle curves for mother’s education are almost parallel, while those for mother’s unemployment condition are crossed. More importantly, all these new insights are at very high quantiles, and this confirms the necessity of the proposed model.

Finally, we compare the proposed QIR model with two commonly used ones in the literature: (i.) linear quantile regression at the level of $\tau^{*}$ with $\ell_{1}$ penalty in Belloni et al., 2019b , and (ii.) extremal quantile regression in (Wang et al., 2012a, ) adapted to high-dimensional data. The prediction performance at $\tau^{*}=0.991$ and $0.995$ is considered for the comparison, and we fix $[\tau_{L},\tau_{U}]=[0.96,0.99]$ . For Method (ii.), we consider $K=4.5n^{1/3}$ quantile levels, equally spaced over $[0.96,0.99]$ , and the linear quantile regression with $\ell_{1}$ penalty is conducted at each level. As in Wang et al., 2012a , we can estimate the extreme value index, and hence the fitted structures can be extrapolated to the level of $\tau^{*}$ . Note that there is no theoretical justification for Method (ii.) in the literature. As in simulation experiments, the tuning parameter $\lambda$ in the above three methods is selected by minimizing the composite check loss in the testing set. We randomly split the data 100 times, and one value of PE can be obtained from each splitting. Figure 3 gives the boxplots of PEs from our model and two competing methods, and the advantages of our model can be observed at both target levels of $\tau^{*}=0.991$ and 0.995.

6 Conclusions and discussions

This paper proposes a reliable method for the inference at extreme quantiles with both low- and high-dimensional covariates. The main idea is first to conduct a composite quantile regression at fixed quantile levels, and we then can extrapolate the estimated results to extreme quantiles by assuming a parametric structure at tails. The Tukey lambda structure can be used due to its flexibility and the explicit form of its quantile functions, and the success of the proposed methodology has been demonstrated by extensive numeral studies.

This paper can be extended in the following two directions. On one hand, in the proposed model, a parametric structure is assumed over the interval of $[\tau_{0},1]$ . Although the criterion of PE is suggested in Section 3 to balance the estimation efficiency and model misspecification, it should be interesting to provide a statistical tool for the goodness-of-fit. Dong et al., (2019) introduced a goodness-of-fit test for parametric quantile regression at a fixed quantile level, and it can be used for our problem by extending the test statistic from a fixed level to the interval of $[\tau_{0},1]$ . We leave it for the future research. On the other hand, the idea in this paper is general and can be applied to many other scenarios. For example, for conditional heteroscedastic time series models, it is usually difficult to conduct the quantile estimation at both median and extreme quantiles. The difficulty at extreme quantiles is due to the sparse data at tails, while that at median is due to the tiny values of fitted parameters (Zhu et al.,, 2018; Zhu and Li,, 2021). Our idea certainly can be used to solve this problem to some extent.

Appendix: Technical details

A.1 Proofs for low-dimensional CQR Estimation

This subsection gives technical proofs of Lemma 1 and Theorems 1-3 in section 2.2. Two auxiliary lemmas are also presented at the end of this subsection: Lemma A.1 is used for the proof of Lemma 1, and Lemma A.2 is for that of Theorem 3. The proofs of these two auxiliary lemmas are given in a supplementary file.

Proof of Lemma 1.

The Tukey lambda distribution has the form of

Q(\tau,\boldsymbol{\theta})=\theta_{1}+{\theta_{2}}\left(\frac{\tau^{\theta_{3}}-(1-\tau)^{\theta_{3}}}{\theta_{3}}\right),

where $\theta_{1}\in\mathbb{R}$ , $\theta_{2}>0$ , $\theta_{3}\neq 0$ . We prove Lemma 1 for $\theta_{3}<1$ . Consider four arbitrary quantile levels $0<\tau_{1}<\tau_{2}<\tau_{3}<\tau_{4}<1$ , and two arbitrary index vectors $\widetilde{\boldsymbol{\theta}}=(\widetilde{\theta}_{1},...,\widetilde{\theta}_{4})^{\prime}$ and $\boldsymbol{\theta}=(\theta_{1},..,\theta_{4})^{\prime}$ such that

Q(\tau_{j},\widetilde{\boldsymbol{\theta}})=Q(\tau_{j},\boldsymbol{\theta})\text{ for all }1\leq j\leq 4.

(A.1)

We show that $\widetilde{\boldsymbol{\theta}}=\boldsymbol{\theta}$ in the following.

The first step is to prove $\widetilde{\theta}_{3}=\theta_{3}$ using the proof by contradiction. Suppose that $\widetilde{\theta}_{3}\neq\theta_{3}$ and, without loss of generality, we assume that $\theta_{3}<\widetilde{\theta}_{3}<0$ . Denote $f_{j}(\theta_{3})=\tau_{j}^{\theta_{3}}-(1-\tau_{j})^{\theta_{3}}$ for $1\leq j\leq 4$ . From (A.1), we have

\frac{f_{1}(\widetilde{\theta}_{3})-f_{2}(\widetilde{\theta}_{3})}{f_{3}(\widetilde{\theta}_{3})-f_{2}(\widetilde{\theta}_{3})}-\frac{f_{1}(\theta_{3})-f_{2}(\theta_{3})}{f_{3}(\theta_{3})-f_{2}(\theta_{3})}=0,

(A.2)

and

\frac{f_{4}(\widetilde{\theta}_{3})-f_{2}(\widetilde{\theta}_{3})}{f_{3}(\widetilde{\theta}_{3})-f_{2}(\widetilde{\theta}_{3})}-\frac{f_{4}(\theta_{3})-f_{2}(\theta_{3})}{f_{3}(\theta_{3})-f_{2}(\theta_{3})}=0.

(A.3)

Let us fix $\theta_{3}$ , $\widetilde{\theta}_{3}$ , $\tau_{2}$ and $\tau_{3}$ . As a result, $\kappa_{1}=f_{2}(\widetilde{\theta}_{3})$ , $\kappa_{2}=f_{2}(\theta_{3})$ , $\kappa_{3}=f_{3}(\theta_{3})-f_{2}(\theta_{3})$ and $\kappa_{4}=f_{3}(\widetilde{\theta}_{3})-f_{2}(\widetilde{\theta}_{3})$ are all fixed values. Denote

\displaystyle F(\tau)=\kappa_{3}\left\{\tau^{\widetilde{\theta}_{3}}-(1-\tau)^{\widetilde{\theta}_{3}}-\kappa_{1}\right\}-\kappa_{4}\left\{\tau^{\theta_{3}}-(1-\tau)^{\theta_{3}}-\kappa_{2}\right\},

(A.4)

\dot{F}(\tau)=\kappa_{3}\widetilde{\theta}_{3}\{\tau^{\widetilde{\theta}_{3}-1}+(1-\tau)^{\widetilde{\theta}_{3}-1}\}-\kappa_{4}\theta_{3}\{\tau^{{\theta}_{3}-1}+(1-\tau)^{{\theta}_{3}-1}\},

and

G(\tau)=\frac{\tau^{\widetilde{\theta}_{3}-1}+(1-\tau)^{\widetilde{\theta}_{3}-1}}{\tau^{{\theta}_{3}-1}+(1-\tau)^{{\theta}_{3}-1}},

where $\dot{F}(\cdot)$ is the derivative function of $F(\cdot)$ , and $\dot{F}(\tau)=0$ if and only if $G(\tau)=\kappa_{4}\theta_{3}/(\kappa_{3}\widetilde{\theta}_{3})$ . Note that equations (A.2) and (A.3) correspond to $F(\tau_{1})=0$ and $F(\tau_{4})=0$ , respectively. Moreover, it can be verified that $F(\tau_{2})=0$ and $F(\tau_{3})=0$ , i.e. the equation $F(\tau)=0$ has at least four different solutions. As a result, the equation $\dot{F}(\tau)=0$ or $G(\tau)=\kappa_{4}\theta_{3}/(\kappa_{3}\widetilde{\theta}_{3})$ has at least three different solutions. While it is implied by Lemma A.1 that the equation $G(\tau)=\kappa_{4}\theta_{3}/(\kappa_{3}\widetilde{\theta}_{3})$ has at most two different solutions. Due to the contradiction, we prove that $\widetilde{\theta}_{3}=\theta_{3}$ , and it is readily to further verify that $(\widetilde{\theta}_{1},\widetilde{\theta}_{2})=(\theta_{1},\theta_{2})$ . We hence accomplish the proof of Lemma 1(i). The result of Lemma 1(ii) can be proved similarly. ∎

Proof of Theorem 1.

We first prove the uniqueness of $\beta_{0}$ . Denote $\bar{L}_{k}(\boldsymbol{\beta})=E[\rho_{\tau_{k}}\{Y-Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))\}]$ . From model (2.1), $\boldsymbol{\beta}_{0}$ is the minimizer not only of $\bar{L}(\boldsymbol{\beta})=\sum_{k=1}^{K}\bar{L}_{k}(\boldsymbol{\beta})$ , but also of $\bar{L}_{k}(\boldsymbol{\beta})$ for all $1\leq k\leq K$ . Suppose that $\boldsymbol{\beta}_{0}^{*}$ is another minimizer of $\bar{L}(\boldsymbol{\beta})$ , and then it is also the minimizer of $\bar{L}_{k}(\boldsymbol{\beta})$ for all $1\leq k\leq K$ .

Note that, for $u\neq 0$ ,

\displaystyle\begin{split}\rho_{\tau}(u-v)-\rho_{\tau}(u)&=-v\psi_{\tau}(u)+(u-v)[I(0>u>v)-I(0<u<v)]\\ &=-v\psi_{\tau}(u)+\int_{0}^{v}[I(u\leq s)-I(u\leq 0)]ds,\end{split}

(A.5)

where $\psi_{\tau}(u)=\tau-I(u<0)$ ; see Knight, (1998). Let $U^{(k)}=Y-Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0}))$ and $V^{(k)}=Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0}^{*}))-Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0}))$ . It holds that, for each $1\leq k\leq K$ ,

0=\bar{L}_{k}(\boldsymbol{\beta}_{0}^{*})-\bar{L}_{k}(\boldsymbol{\beta}_{0})=E\{(U^{(k)}-V^{(k)})[I(0>U^{(k)}>V^{(k)})-I(0<U^{(k)}<V^{(k)})]\},

which implies that $V^{(k)}=Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0}^{*}))-Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0}))=0$ with probability one. This, together with the conditions in Lemma 1 and the monotonic link functions, leads to the fact that $\mathbf{X}^{\prime}{\boldsymbol{\beta}}_{j0}^{*}=\mathbf{X}^{\prime}{\boldsymbol{\beta}}_{j0}$ for all $1\leq j\leq d$ . We then have $\boldsymbol{\beta}_{0}^{*}=\boldsymbol{\beta}_{0}$ since $E(\boldsymbol{X}\boldsymbol{X}^{\prime})$ is positive definite. This accomplishes the proof of the uniqueness of $\boldsymbol{\beta}_{0}$ . ∎

Proof of Theroem 2.

Note that $L_{n}(\boldsymbol{\beta})=n^{-1}\sum_{k=1}^{K}\sum_{i=1}^{n}\rho_{\tau_{k}}\left\{Y_{i}-Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X}_{i},\boldsymbol{\beta}))\right\}$ and $\bar{L}(\boldsymbol{\beta})=E[\sum_{k=1}^{K}\rho_{\tau_{k}}\{Y-Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))\}]$ . From Knight’s identity at (A.5),

	$\displaystyle\bar{L}(\boldsymbol{\beta})-\bar{L}(\boldsymbol{\beta}_{0})$	$\displaystyle\leq\sum_{k=1}^{K}E\|Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))-Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}_{0}))\|$
		$\displaystyle\leq\sum_{k=1}^{K}E\sup_{\boldsymbol{\beta}\in\Theta}\\|\frac{\partial Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))}{\partial\boldsymbol{\beta}}\\|\\|\boldsymbol{\beta}-\boldsymbol{\beta}_{0}\\|,$

which, together with the condition of $E\max_{1\leq k\leq K}\sup_{\boldsymbol{\beta}\in\Theta}\|{\partial Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))}/{\partial\boldsymbol{\beta}}\|<\infty$ and Corollary 2 of Andrews, (1987), implies that

\sup_{\boldsymbol{\beta}\in\Theta}|L_{n}(\boldsymbol{\beta})-L_{n}(\boldsymbol{\beta}_{0})-[\bar{L}(\boldsymbol{\beta})-\bar{L}(\boldsymbol{\beta}_{0})]|=o_{p}(1).

(A.6)

Note that $\bar{L}(\boldsymbol{\beta})$ is a continuous function with respect to $\boldsymbol{\beta}$ and, from Theorem 1, $\boldsymbol{\beta}_{0}$ is the unique minimizer of $\bar{L}(\boldsymbol{\beta})$ . As a result, for each $\delta>0$ ,

\epsilon=\inf_{\boldsymbol{\beta}\in B_{\delta}^{c}(\boldsymbol{\beta}_{0})}\bar{L}(\boldsymbol{\beta})-\bar{L}(\boldsymbol{\beta}_{0})>0,

(A.7)

where $B_{\delta}(\boldsymbol{\beta}_{0})=\{\boldsymbol{\beta}:\|\boldsymbol{\beta}-\boldsymbol{\beta}_{0}\|\leq\delta\}$ and $B_{\delta}^{c}(\boldsymbol{\beta}_{0})$ is its complement set, and hence

\left\{\inf_{\boldsymbol{\beta}\in B_{\delta}^{c}(\boldsymbol{\beta}_{0})}L_{n}(\boldsymbol{\beta})\leq L_{n}(\boldsymbol{\beta}_{0})\right\}\subseteq\left\{\sup_{\boldsymbol{\beta}\in B_{\delta}^{c}(\boldsymbol{\beta}_{0})}|L_{n}(\boldsymbol{\beta})-L_{n}(\boldsymbol{\beta}_{0})-[\bar{L}(\boldsymbol{\beta})-\bar{L}(\boldsymbol{\beta}_{0})]\}|\geq\epsilon\right\}.

(A.8)

Note that

1=P\left\{L_{n}(\widehat{\boldsymbol{\beta}}_{n})\leq L_{n}(\boldsymbol{\beta}_{0})\right\}\leq P\left\{\widehat{\boldsymbol{\beta}}_{n}\in B_{\delta}(\boldsymbol{\beta}_{0})\right\}+P\left\{\inf_{\boldsymbol{\beta}\in B_{\delta}^{c}(\boldsymbol{\beta}_{0})}L_{n}(\boldsymbol{\beta})\leq L_{n}(\boldsymbol{\beta}_{0})\right\}

which together with (A.6) and (A.8), implies that

P\left\{\|\boldsymbol{\beta}-\boldsymbol{\beta}_{0}\|\leq\delta\right\}\geq 1-P\left\{\inf_{\boldsymbol{\beta}\in B_{\delta}^{c}(\boldsymbol{\beta}_{0})}L_{n}(\boldsymbol{\beta})\leq L_{n}(\boldsymbol{\beta}_{0})\right\}\rightarrow 1,

as $n\rightarrow\infty$ . This accomplishes the proof of consistency.

∎

Proof of Theorem 3.

For simplicity, we denote $Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X}_{i},\boldsymbol{\beta}))$ by $Q_{ik}(\boldsymbol{\beta})$ in the whole proof of this theorem. Let $S_{n}(\boldsymbol{\beta})=L_{n}(\boldsymbol{\beta})-L_{n}(\boldsymbol{\beta}_{0})$ and, from Knight’s identity at (A.5),

\begin{split}S_{n}(\boldsymbol{\beta})=&\sum_{k=1}^{K}\sum_{i=1}^{n}\left[\rho_{\tau_{k}}\left\{Y_{i}-Q_{ik}(\boldsymbol{\beta})\right\}-\rho_{\tau_{k}}\left\{Y_{i}-Q_{ik}(\boldsymbol{\beta}_{0})\right\}\right]\\ =&\sum_{k=1}^{K}\sum_{i=1}^{n}\Big{[}\left\{Q_{ik}(\boldsymbol{\beta})-Q_{ik}(\boldsymbol{\beta}_{0})\right\}\left\{I(e_{ik}<0)-\tau_{k}\right\}\\ &\hskip 48.36967pt+\int_{0}^{Q_{ik}(\boldsymbol{\beta})-Q_{ik}(\boldsymbol{\beta}_{0})}\{I(e_{ik}\leq s)-I(e_{ik}\leq 0)\}ds\Big{]}.\end{split}

(A.9)

Note that, by Taylor expansion, $Q_{ik}(\boldsymbol{\beta})-Q_{ik}(\boldsymbol{\beta}_{0})=(\boldsymbol{\beta}-\boldsymbol{\beta}_{0})^{\prime}{\partial Q_{ik}(\boldsymbol{\beta}_{0})}/{\partial\boldsymbol{\beta}}+(\boldsymbol{\beta}-\boldsymbol{\beta}_{0})^{\prime}({\partial^{2}Q_{ik}(\boldsymbol{\beta}^{*})}/{\partial\boldsymbol{\beta}\partial\boldsymbol{\beta}^{\prime}})(\boldsymbol{\beta}-\boldsymbol{\beta}_{0})$ , where $\boldsymbol{\beta}^{*}$ is a vector between $\boldsymbol{\beta}_{0}$ and $\boldsymbol{\beta}$ . Let $u=\boldsymbol{\beta}-\boldsymbol{\beta}_{0}$ ,

q_{1ik}(u)=u^{\prime}\frac{\partial Q_{ik}(\boldsymbol{\beta}_{0})}{\partial\boldsymbol{\beta}}\hskip 8.53581pt\text{and}\hskip 8.53581ptq_{2ik}(u)=u^{\prime}\frac{\partial^{2}Q_{ik}(\boldsymbol{\beta}^{*})}{\partial\boldsymbol{\beta}\partial\boldsymbol{\beta}^{\prime}}u.

We then decompose $S_{n}(\boldsymbol{\beta})$ into

$\displaystyle S_{n}(\boldsymbol{\beta})$	$\displaystyle=\sum_{k=1}^{K}\sum_{i=1}^{n}\Big{[}\left\{q_{1ik}(u)+q_{2ik}(u)\right\}\{I(e_{ik}<0)-\tau_{k}\}$	(A.10)
	$\displaystyle\hskip 48.36967pt+\int_{0}^{q_{1ik}(u)+q_{2ik}(u)}\{I(e_{ik}\leq s)-I(e_{ik}\leq 0)\}ds\Big{]}$	(A.11)
	$\displaystyle=-u^{\prime}T_{n}+\Pi_{1n}(u)+\Pi_{2n}(u)+\Pi_{3n}(u),$	(A.12)

where

	$\displaystyle T_{n}=\sum_{k=1}^{K}\sum_{i=1}^{n}\frac{\partial Q_{ik}(\boldsymbol{\beta}_{0})}{\partial\boldsymbol{\beta}}\{\tau_{k}-I(e_{ik}\leq 0)\},$
	$\displaystyle\xi_{ik}(u)=\int_{0}^{q_{1ik}(u)}\{I(e_{ik}\leq s)-I(e_{ik}\leq 0)\}ds,$
	$\displaystyle\Pi_{1n}(u)=\sum_{k=1}^{K}\sum_{i=1}^{n}\left[\xi_{ik}(u)-E\left\{\xi_{ik}(u)\|\mathbf{X}_{i}\right\}\right],$
	$\displaystyle\Pi_{2n}(u)=\sum_{k=1}^{K}\sum_{i=1}^{n}E\left\{\xi_{ik}(u)\|\mathbf{X}_{i}\right\},$

and

\Pi_{3n}(u)=\sum_{k=1}^{K}\sum_{i=1}^{n}\Big{[}q_{2ik}(u)\left\{I(e_{ik}<0)-\tau_{k}\right\}+\int_{q_{1ik}(u)}^{q_{1ik}(u)+q_{2ik}(u)}\left\{I(e_{ik}\leq s)-I(e_{ik}\leq 0)\right\}ds\Big{]}.

First, by the central limit theorem, we can show that

\frac{1}{\sqrt{n}}T_{n}=\frac{1}{\sqrt{n}}\sum_{k=1}^{K}\sum_{i=1}^{n}\frac{\partial Q_{ik}(\boldsymbol{\beta}_{0})}{\partial\boldsymbol{\beta}}\{\tau_{k}-I(e_{ik}\leq 0)\}\rightarrow N(0,\Omega_{0})

(A.13)

in distribution as $n\rightarrow\infty$ . Note that, from Theorem 2, $\widehat{u}_{n}=\widehat{\boldsymbol{\beta}}_{n}-\boldsymbol{\beta}_{0}=o_{p}(1)$ and then, by applying Lemma A.2,

\begin{split}S_{n}(\widehat{\boldsymbol{\beta}}_{n})&=-\widehat{u}_{n}^{\prime}T_{n}+\Pi_{1n}(\widehat{u}_{n})+\Pi_{2n}(\widehat{u}_{n})+\Pi_{3n}(\widehat{u}_{n})\\ &=-\widehat{u}_{n}^{\prime}T_{n}+\frac{1}{2}(\sqrt{n}\widehat{u}_{n})^{\prime}\Omega_{1}(\sqrt{n}\widehat{u}_{n})+o_{p}(\sqrt{n}\|\widehat{u}_{n}\|+n\|\widehat{u}_{n}\|^{2})\\ &\geq-\|\sqrt{n}\widehat{u}_{n}\|\left\{\|\frac{1}{\sqrt{n}}T_{n}\|+o_{p}(1)\right\}+n\|\widehat{u}_{n}\|^{2}\left\{\frac{\lambda_{\min}(\Omega_{1})}{2}+o_{p}(1)\right\},\end{split}

(A.14)

where $\lambda_{\min}(\Omega_{1})$ is the minimum eigenvalue of $\Omega_{1}$ . This, together with the fact that $S_{n}(\widehat{\boldsymbol{\beta}}_{n})=L_{n}(\widehat{\boldsymbol{\beta}}_{n})-L_{n}(\boldsymbol{\beta}_{0})\leq 0$ , implies that

\sqrt{n}\|\widehat{u}_{n}\|\leq\left\{\frac{\lambda_{\min}(\Omega_{1})}{2}+o_{p}(1)\right\}^{-1}\left\{\|\frac{1}{\sqrt{n}}T_{n}\|+o_{p}(1)\right\}=O_{p}(1).

(A.15)

Denote $u_{n}^{\star}=n^{-1}\Omega_{1}^{-1}T_{n}$ and, from (A.14) and (A.15),

S_{n}(\widehat{\boldsymbol{\beta}}_{n})=\frac{1}{2}(\sqrt{n}\widehat{u}_{n})^{\prime}\Omega_{1}(\sqrt{n}\widehat{u}_{n})-(\sqrt{n}\widehat{u}_{n})^{\prime}\Omega_{1}(\sqrt{n}u_{n}^{\star})+o_{p}(1).

(A.16)

Moreover, since $\sqrt{n}u_{n}^{\star}=O_{p}(1)$ , equation (A.14) still holds when $\widehat{u}_{n}$ is replaced by $u_{n}^{\star}$ , and then

S(\boldsymbol{\beta}_{0}+u_{n}^{\star})=-\frac{1}{2}(\sqrt{n}u_{n}^{\star})^{\prime}\Omega_{1}(\sqrt{n}u_{n}^{\star})+o_{p}(1),

(A.17)

which leads to

	$\displaystyle 0\geq S(\widehat{\boldsymbol{\beta}}_{n})-S(\boldsymbol{\beta}_{0}+u_{n}^{\star})$	$\displaystyle=$	$\displaystyle\frac{1}{2}(\sqrt{n}\widehat{u}_{n}-\sqrt{n}u_{n}^{\star})^{\prime}\Omega_{1}(\sqrt{n}\widehat{u}_{n}-\sqrt{n}u_{n}^{\star})+o_{p}(1)$
		$\displaystyle\geq$	$\displaystyle\frac{\lambda_{\min}(\Omega_{1})}{2}\\|\sqrt{n}\widehat{u}_{n}-\sqrt{n}u_{n}^{\star}\\|^{2}+o_{p}(1).$

As a result, from (A.13),

\sqrt{n}\widehat{u}_{n}=\sqrt{n}u_{n}^{\star}+o_{p}(1)=\Omega_{1}^{-1}n^{-1/2}T_{n}+o_{p}(1)\rightarrow N(\boldsymbol{0},\Omega_{1}^{-1}\Omega_{0}\Omega_{1}^{-1})

in distribution as $n\rightarrow\infty$ . ∎

Lemma A.1.

Consider the function of $G(\tau)$ defined in the proof of Lemma 1 with $\theta_{3}<\widetilde{\theta}_{3}<0$ . It holds that, (1) for $\tau>0.5$ , $G(\tau)$ is strictly decreasing, (2) for $\tau<0.5$ , $G(\tau)$ is strictly increasing.

Lemma A.2.

Suppose that the conditions of Theorem 3 hold. For any sequence of random variables $\{u_{n}\}$ with $u_{n}=o_{p}(1)$ , it holds that

(a)

$\Pi_{1n}(u_{n})=o_{p}(\sqrt{n}\|u_{n}\|+n\|u_{n}\|^{2})$ ,
(b)

$\Pi_{2n}(u_{n})=(1/2)(\sqrt{n}u_{n})^{\prime}\Omega_{1}(\sqrt{n}u_{n})+o_{p}(n\|u_{n}\|^{2})$ , and
(c)

$\Pi_{3n}(u_{n})=o_{p}(n\|u_{n}\|^{2})$ ,

where $\Pi_{1n}(u)$ , $\Pi_{2n}(u)$ and $\Pi_{3n}(u)$ are defined in the proof of Theorem 3.

A.2 Proofs for high-dimensional regularized estimation

This subsection first conducts deterministic analysis at Lemma A.3, and then stochastic analysis at Lemmas A.4 and A.5. The proof of Theorem 4 follows from the deterministic analysis in Lemma A.3 and stochastic analysis in Lemmas A.4, and A.5.

We first treat the observed data, $\{(Y_{i},\boldsymbol{X}_{i}^{\prime})^{\prime},i=1,...,n\}$ , to be deterministic. Consider the loss function $L_{n}(\boldsymbol{\beta})=\sum_{i=1}^{n}\sum_{k=1}^{K}\rho_{\tau_{k}}\{Y_{i}-Q(\tau_{k},\boldsymbol{\theta}(\boldsymbol{X}_{i},\boldsymbol{\beta}))\}$ , and its first-order Taylor-series error

\mathcal{E}_{n}(\Delta)=n^{-1}L_{n}(\boldsymbol{\beta}_{0}+\Delta)-n^{-1}L_{n}(\boldsymbol{\beta}_{0})-\langle n^{-1}\nabla L_{n}(\boldsymbol{\beta}_{0}),\Delta\rangle,

where $\Delta\in\mathbb{R}^{dp}$ , $\langle\cdot,\cdot\rangle$ is the inner product, $e_{ik}=Y_{i}-Q(\tau_{k},\boldsymbol{\theta}(\boldsymbol{X}_{i},\boldsymbol{\beta}_{0}))$ , $\psi_{\tau}(u)=\tau-I(u<0)$ , and $\nabla L_{n}(\boldsymbol{\beta})=\sum_{i=1}^{n}\sum_{k=1}^{K}\psi_{\tau}(e_{ik}){\partial Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))}/{\partial\boldsymbol{\beta}}$ is a subgradient of $L_{n}(\boldsymbol{\beta})$ .

Definition 1.

Loss function $L_{n}(\cdot)$ satisfies the local restricted strong convexity (LRSC) condition if

\mathcal{E}_{n}(\Delta)\geq\alpha\|\Delta\|_{2}^{2}-\eta\sqrt{\frac{\log p}{n}}\|\Delta\|_{1}\hskip 5.69054pt\text{for all $\Delta$ such that}\hskip 5.69054pt0<r\leq\|\Delta\|_{2}\leq R,

where $\alpha,\eta>0$ , and $\|\cdot\|_{1}$ and $\|\cdot\|_{2}$ are $\ell_{1}$ and $\ell_{2}$ norms, respectively.

The above LRSC condition has a larger tolerance term compared with that in Loh and Wainwright, (2015), which has a form of $(\log p/n)\|\Delta\|_{1}^{2}$ . Similar tolerance term can also be found in Fan et al., (2019) for high-dimensional generalized trace regression. It is ready to establish an upper bound for estimation errors when the penalty $\lambda$ is appropriately selected.

Lemma A.3.

Suppose that the regularizer $p_{\lambda}(\cdot)$ satisfies Assumption 4, and loss function $L_{n}(\cdot)$ satisfies the LRSC condition with $\alpha>\mu/4$ and $r=\frac{12\eta\sqrt{s}}{(4\alpha-\mu)L}\sqrt{\frac{\log p}{n}}$ . If the tuning parameter $\lambda$ satisfies that

\lambda\geq\frac{4}{L}\max\left\{\|n^{-1}\nabla L(\boldsymbol{\beta}_{0})\|_{\infty},\eta\sqrt{\frac{\log p}{n}}\right\},

(A.18)

then the minimizer $\widetilde{\boldsymbol{\beta}}_{n}$ over the set of $\Theta=\mathcal{B}_{R}(\boldsymbol{\beta}_{0})$ satisfies the error bounds

\displaystyle\|\widetilde{\boldsymbol{\beta}}_{n}-\boldsymbol{\beta}_{0}\|_{2}\leq\frac{6\sqrt{s}\lambda L}{4\alpha-\mu}~{}~{}~{}~{}~{}\mbox{and}~{}~{}~{}~{}~{}\|\widetilde{\boldsymbol{\beta}}_{n}-\boldsymbol{\beta}_{0}\|_{1}\leq\frac{24s\lambda L}{4\alpha-\mu}.

(A.19)

Proof..

Denote $\widetilde{\Delta}_{n}=\widetilde{\boldsymbol{\beta}}_{n}-\boldsymbol{\beta}_{0}$ , and it holds that $\|\widetilde{\Delta}_{n}\|_{2}\leq R$ . Note that this lemma naturally holds if $\|\widetilde{\Delta}_{n}\|_{2}\leq(3\sqrt{s}\lambda)/(4\alpha-\mu)$ . As a result, we only need to consider the case with $(3\sqrt{s}\lambda)/(4\alpha-\mu)\leq\|\widetilde{\Delta}_{n}\|_{2}\leq R$ , and it can be verified that

r=\frac{12\eta\sqrt{s}}{(4\alpha-\mu)L}\sqrt{\frac{\log p}{n}}\leq\|\widetilde{\Delta}_{n}\|_{2}\leq R.

(A.20)

Note that $n^{-1}L_{n}(\widetilde{\boldsymbol{\beta}}_{n})+p_{\lambda}(\widetilde{\boldsymbol{\beta}}_{n})\leq n^{-1}L_{n}(\boldsymbol{\beta}_{0})+p_{\lambda}(\boldsymbol{\beta}_{0})$ , and then $\mathcal{E}_{n}(\widetilde{\Delta}_{n})\leq p_{\lambda}(\boldsymbol{\beta}_{0})-p_{\lambda}(\widetilde{\boldsymbol{\beta}}_{n})-\langle n^{-1}\nabla L_{n}(\boldsymbol{\beta}_{0}),\Delta\rangle$ . This, together with the LRSC condition and Holder’s inequality, implies that

$\displaystyle\alpha\\|\widetilde{\Delta}_{n}\\|_{2}^{2}\leq$	$\displaystyle p_{\lambda}(\boldsymbol{\beta}_{0})-p_{\lambda}(\widetilde{\boldsymbol{\beta}}_{n})+\left(\eta\sqrt{\frac{\log p}{n}}+\\|n^{-1}\nabla L(\boldsymbol{\beta}_{0})\\|_{\infty}\right)\\|\widetilde{\Delta}_{n}\\|_{1}$	(A.21)
$\displaystyle\leq$	$\displaystyle p_{\lambda}(\boldsymbol{\beta}_{0})-p_{\lambda}(\widetilde{\boldsymbol{\beta}}_{n})+\frac{\lambda L}{2}\\|\widetilde{\Delta}_{n}\\|_{1}$	(A.22)
$\displaystyle\leq$	$\displaystyle p_{\lambda}(\boldsymbol{\beta}_{0})-p_{\lambda}(\widetilde{\boldsymbol{\beta}}_{n})+\frac{1}{2}\left\{p_{\lambda}(\widetilde{\boldsymbol{\beta}}_{n}-\boldsymbol{\beta}_{0})+\frac{\mu}{2}\\|\widetilde{\Delta}_{n}\\|_{2}^{2}\right\}$	(A.23)
$\displaystyle\leq$	$\displaystyle p_{\lambda}(\boldsymbol{\beta}_{0})-p_{\lambda}(\widetilde{\boldsymbol{\beta}}_{n})+\frac{1}{2}\left\{p_{\lambda}(\widetilde{\boldsymbol{\beta}}_{n})+p_{\lambda}(\boldsymbol{\beta}_{0})+\frac{\mu}{2}\\|\widetilde{\Delta}_{n}\\|_{2}^{2}\right\},$	(A.24)

where the last inequality follows from the subadditivity of $p_{\lambda}(\cdot)$ , while the penultimate inequality is by Assumption 4; see also Lemma 4 in Loh and Wainwright, (2015). As a result,

\displaystyle 0<\left(\alpha-\frac{\mu}{4}\right)\|\widetilde{\Delta}_{n}\|_{2}^{2}\leq\frac{3}{2}p_{\lambda}(\boldsymbol{\beta}_{0})-\frac{1}{2}p_{\lambda}(\widetilde{\boldsymbol{\beta}}_{n}).

(A.25)

Moreover, from Lemma 5 in Loh and Wainwright, (2015), it holds that

\displaystyle 0\leq 3p_{\lambda}(\boldsymbol{\beta}_{0})-p(\widetilde{\boldsymbol{\beta}}_{n})\leq\lambda L(3\|(\widetilde{\Delta}_{n})_{A}\|_{1}-\|(\widetilde{\Delta}_{n})_{A^{c}}\|_{1}),

(A.26)

where $A$ is the index set of the $s$ largest elements of $\widetilde{\boldsymbol{\beta}}_{n}$ in magnitude. Combining (A.25) and (A.26), we have

\displaystyle\left(\alpha-\frac{\mu}{4}\right)\|\widetilde{\Delta}_{n}\|_{2}^{2}\leq\frac{3\lambda L}{2}\|(\widetilde{\Delta}_{n})_{A}\|_{1}\leq\frac{3\lambda L\sqrt{s}}{2}\|\widetilde{\Delta}_{n}\|_{2}.

(A.27)

As a result,

\displaystyle\|\widetilde{\Delta}_{n}\|_{2}\leq\frac{6\sqrt{s}\lambda L}{4\alpha-\mu}.

(A.28)

It is also implied by (A.26) that $\|(\widetilde{\Delta}_{n})_{A^{c}}\|_{1}\leq 3\|(\widetilde{\Delta}_{n})_{A}\|_{1}$ , which leads to

\|\widetilde{\Delta}_{n}\|_{1}\leq\|(\widetilde{\Delta}_{n})_{A}\|_{1}+\|(\widetilde{\Delta}_{n})_{A^{c}}\|_{1}\leq 4\|(\widetilde{\Delta}_{n})_{A}\|_{1}\leq 4\sqrt{s}\|\widetilde{\Delta}_{n}\|_{2}.

(A.29)

This accomplishes the proof of this lemma. ∎

We next conduct the stochastic analysis to verify that the “good” event and LRSC condition hold with high probability in Lemmas A.4 and A.5, respectively.

Lemma A.4.

If Assumption 5 holds, then

\|n^{-1}\nabla L(\boldsymbol{\beta}_{0})\|_{\infty}\leq C_{S}\sqrt{\frac{\log p}{n}}

(A.30)

with probability at least $1-c_{1}p^{-c_{2}}$ for some positive constants $c_{1}$ , $c_{2}$ and $C_{S}$ .

Proof..

Note that

n^{-1}\nabla L(\boldsymbol{\beta}_{0})=\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{K}\psi_{\tau_{k}}(e_{ik})\frac{\partial Q(\tau_{k},\boldsymbol{\theta}(\boldsymbol{X}_{i},\boldsymbol{\beta}_{0}))}{\partial\boldsymbol{\gamma}}\otimes\boldsymbol{X}_{i}.

(A.31)

where $e_{ik}=Y_{i}-Q(\tau_{k},\boldsymbol{\theta}(\boldsymbol{X}_{i},\boldsymbol{\beta}_{0}))$ , $\psi_{\tau}(u)=\tau-I(u<0)$ , and $\boldsymbol{X}_{i}=(X_{1i},...,X_{pi})^{\prime}$ . For $1\leq j\leq d$ , denote $\xi_{j}(\boldsymbol{X}_{i})={\partial Q(\tau_{k},\boldsymbol{\theta}(\boldsymbol{X}_{i},\boldsymbol{\beta}_{0}))}/{\partial{\gamma_{j}}}$ and, from Assumption 5, it holds that $|\xi_{j}(\boldsymbol{X}_{i})|\leq L_{Q}$ .

It can be verified that, conditional on $\boldsymbol{X}_{i}$ , $\psi_{\tau_{k}}(e_{ik})$ is sub-Gaussian with the parameter of 0.5, and hence, for any $\delta>0$ ,

E\exp[\delta n^{-1}\psi_{\tau_{k}}(e_{ik})\xi_{j}(\boldsymbol{X}_{i})X_{li}]\leq E\exp\{[\delta n^{-1}\xi_{j}(\boldsymbol{X}_{i})X_{li}]^{2}/8\}\leq\exp\{[\delta n^{-1}L_{Q}C_{X}]^{2}/8\}.

As a result, for each $t>0$ , $1\leq j\leq d$ and $1\leq l\leq p$ ,

$\displaystyle P$	$\displaystyle\left(\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{K}\psi_{\tau_{k}}(e_{ik})\xi_{j}(\boldsymbol{X}_{i})X_{li}>t\right)$	(A.32)
	$\displaystyle\leq\inf_{\delta>0}\exp(-\delta t)\prod_{i=1}^{n}\prod_{k=1}^{K}E\exp[\delta n^{-1}\psi_{\tau_{k}}(e_{ik})\xi_{j}(\boldsymbol{X}_{i})X_{li}]$	(A.33)
	$\displaystyle\leq\inf_{\delta>0}\exp\left(\frac{n^{-1}K(L_{Q}C_{X})^{2}}{8}\delta^{2}-\delta t\right)\leq\exp\left(\frac{-2nt^{2}}{K(L_{Q}C_{X})^{2}}\right),$	(A.34)

which implies that

P\left(\|n^{-1}\nabla L(\boldsymbol{\beta}_{0})\|_{\infty}\geq t\right)\leq\exp\left(\frac{-2nt^{2}}{K(L_{Q}C_{X})^{2}}+\log(2dp)\right).

By letting $t=C_{S}\sqrt{\log p/n}$ with $C_{S}>\sqrt{0.5K}L_{Q}C_{X}$ , we accomplish the proof with $c_{1}=2d$ and $c_{2}=2C_{S}^{2}/[K(L_{Q}C_{X})^{2}]-1>0$ .

∎

Lemma A.5.

Suppose that Assumptions 4-6 hold. Given the sample size $n\geq c^{\prime}\log p$ for some $c^{\prime}>0$ , it holds that

\mathcal{E}_{n}(\Delta)\geq\alpha\|\Delta\|_{2}^{2}-\eta\sqrt{\frac{\log p}{n}}\|\Delta\|_{1}\hskip 5.69054pt\text{for all $r\leq\|\Delta\|_{2}\leq R$}

with probability at least $1-c_{1}p^{-c_{2}}-K\log(\sqrt{dp}r/r_{l})p^{-c^{2}}$ for any $c>1$ , where $\alpha=0.5f_{\min}\lambda_{\min}^{0}$ , $\eta=KC_{E}d2^{d+1}+2KL_{Q}C_{X}c+C_{S}$ .

Proof..

We first show the strong convexity of $\bar{L}(\boldsymbol{\beta})=E[n^{-1}L_{n}(\boldsymbol{\beta})]=E[\sum_{k=1}^{K}\rho_{\tau_{k}}\{Y-Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))\}]$ . Let $Q^{*}(\boldsymbol{\beta})=(Q(\tau_{1},\boldsymbol{\theta}(\boldsymbol{X},\boldsymbol{\beta})),\ldots,Q(\tau_{K},\boldsymbol{\theta}(\boldsymbol{X},\boldsymbol{\beta})))^{\prime}$ and, by Taylor expansion,

E\|Q^{*}(\boldsymbol{\beta})-Q^{*}(\boldsymbol{\beta}_{0})\|_{2}^{2}=E\left\|\frac{\partial Q^{*}(\boldsymbol{\beta}^{*})}{\partial\boldsymbol{\beta}}\Delta\right\|_{2}^{2}=\Delta^{\prime}\Omega_{2}(\boldsymbol{\beta}^{*})\Delta,

where $\Delta=\boldsymbol{\beta}-\boldsymbol{\beta}_{0}$ , and $\boldsymbol{\beta}^{*}$ is between $\boldsymbol{\beta}_{0}$ and $\boldsymbol{\beta}$ . Note that ${\partial\bar{L}(\boldsymbol{\beta}_{0})}/{\partial\boldsymbol{\beta}}=0$ and hence, by Knight’s identity at (A.5) and Assumption 6, it can be verified that

\displaystyle\begin{split}&\bar{\mathcal{E}}(\Delta):=\bar{L}(\boldsymbol{\beta})-\bar{L}(\boldsymbol{\beta}_{0})-\langle\Delta,{\partial\bar{L}(\boldsymbol{\beta}_{0})}/{\partial\boldsymbol{\beta}}\rangle=\bar{L}(\boldsymbol{\beta})-\bar{L}(\boldsymbol{\beta}_{0})\\ =&E\left(\sum_{k=1}^{K}\int_{0}^{Q(\tau_{k},\boldsymbol{\theta}(\boldsymbol{X},\boldsymbol{\beta}))-Q(\tau_{k},\boldsymbol{\theta}(\boldsymbol{X},\boldsymbol{\beta}_{0}))}\{F_{Y|\boldsymbol{X}}(Q(\tau_{k},\boldsymbol{\theta}(\boldsymbol{X},\boldsymbol{\beta}_{0}))+s)-F_{Y|\boldsymbol{X}}(Q(\tau_{k},\boldsymbol{\theta}(\boldsymbol{X},\boldsymbol{\beta}_{0})))\}ds\right)\\ \geq&0.5f_{\min}E\|Q^{*}(\boldsymbol{\beta})-Q^{*}(\boldsymbol{\beta}_{0})\|_{2}^{2}\geq 0.5f_{\min}\lambda_{\min}^{0}\|\Delta\|_{2}^{2}\end{split}

(A.35)

uniformly for $\{\Delta\in\mathbb{R}^{dp}:\|\Delta\|_{2}\leq R\}$ .

For $1\leq k\leq K$ , denote $L_{n}^{(k)}(\boldsymbol{\beta})=\sum_{i=1}^{n}\rho_{\tau_{k}}\{Y_{i}-Q(\tau_{k},\boldsymbol{\theta}(\boldsymbol{X}_{i},\boldsymbol{\beta}))\}$ and $\bar{L}_{k}(\boldsymbol{\beta})=E[n^{-1}L_{n}^{(k)}(\boldsymbol{\beta})]=E[\rho_{\tau_{k}}\{Y-Q(\tau_{k},\boldsymbol{\theta}(\mathbf{X},\boldsymbol{\beta}))\}]$ . Note that $L_{n}(\boldsymbol{\beta})=\sum_{k=1}^{K}L_{n}^{(k)}(\boldsymbol{\beta})$ and $\bar{L}(\boldsymbol{\beta})=\sum_{k=1}^{K}\bar{L}_{k}(\boldsymbol{\beta})$ . Let $\mathcal{E}_{k}^{*}(\Delta)=|n^{-1}L_{n}^{(k)}(\boldsymbol{\beta})-n^{-1}L_{n}^{(k)}(\boldsymbol{\beta}_{0})-\{\bar{L}_{k}(\boldsymbol{\beta})-\bar{L}_{k}(\boldsymbol{\beta}_{0})\}|$ , and we next prove that, uniformly for $r\leq\|\Delta\|_{2}\leq R$ ,

\mathcal{E}_{k}^{*}(\Delta)\leq C_{\mathcal{E}}\sqrt{\frac{\log p}{n}}\|\Delta\|_{1},

(A.36)

with probability at least $1-\log(\sqrt{dp}r/r_{l})p^{-c^{2}}$ for any $c>1$ , where $C_{\mathcal{E}}=C_{E}d2^{d+1}+2L_{Q}C_{X}c$ . As in Theorem 9.34 in Wainwright, (2019), we use the peeling argument, which is a common strategy in empirical process theory.

Tail bound for fixed radii: Define a set $\mathcal{C}(r_{1}):=\left\{\Delta\in\mathbb{R}^{dp}:\|\Delta\|_{1}\leq r_{1}\right\}$ for a fixed radii $r_{1}>0$ , and a random variable $A_{n}(r_{1})=r_{1}^{-1}\sup_{\boldsymbol{\beta}\in\mathcal{C}(r_{1})}\mathcal{E}_{k}(\Delta)$ . We next show that, for any $t>0$ ,

A_{n}(r_{1})\leq C_{E}d2^{d}\sqrt{\frac{\log p}{n}}+L_{Q}C_{X}\sqrt{\frac{t}{n}}.

(A.37)

with probability at least $1-e^{-t}$ .

For $1\leq i\leq n$ , denote $\boldsymbol{W}_{i}=(Y_{i},\boldsymbol{X}_{i}^{\prime})^{\prime}$ . Note that random variable $A(r_{1})$ has a form of $f(\boldsymbol{W}_{1},\dots,\boldsymbol{W}_{n})$ , and it is guaranteed by Assumption 5 that

\left|f(\boldsymbol{W}_{1},\ldots,\boldsymbol{W}_{i},\ldots,\boldsymbol{W}_{n})-f(\boldsymbol{W}_{1},\ldots,\boldsymbol{W}_{i\prime},\ldots,\boldsymbol{W}_{n})\right|\leq n^{-1}L_{Q}C_{X},

i.e., if we replace $\boldsymbol{W}_{i}$ by $\boldsymbol{W}_{i^{\prime}}$ , while keep other $\boldsymbol{W}_{j}$ fixed, then $A(r_{1})$ changes by at most $n^{-1}L_{Q}C_{X}$ . As a result, by the bounded differences inequality and for any $t>0$ ,

A_{n}(r_{1})\leq E[A_{n}(r_{1})]+L_{Q}C_{X}\sqrt{\frac{t}{n}}

(A.38)

with probability at least $1-e^{-t}$ .

In addition, it is implied by Assumption 5 that, for all $\boldsymbol{\beta},\widetilde{\boldsymbol{\beta}}\in\mathbb{R}^{dp}$ ,

|\rho_{\tau_{k}}\left\{Y_{i}-Q(\tau_{k},\boldsymbol{\theta}(\boldsymbol{X}_{i},\boldsymbol{\beta}))\right\}-\rho_{\tau_{k}}\{Y_{i}-Q(\tau_{k},\boldsymbol{\theta}(\boldsymbol{X}_{i},\widetilde{\boldsymbol{\beta}}))\}|\leq L_{Q}\sum_{l=1}^{d}|\boldsymbol{X}_{i}^{\prime}(\boldsymbol{\beta}_{l}-\widetilde{\boldsymbol{\beta}_{l}})|,

which leads to

\begin{split}E[A_{n}(r_{1})]\leq&\frac{2}{nr_{1}}E\left(\sup_{\boldsymbol{\beta}\in\mathcal{C}(r_{1})}\left|\sum_{i=1}^{n}\epsilon_{i}\left[\rho_{\tau_{k}}\{Y_{i}-Q(\tau_{k},\boldsymbol{\theta}(\boldsymbol{X}_{i},\boldsymbol{\beta}))\}-\rho_{\tau_{k}}\{Y_{i}-Q(\tau_{k},\boldsymbol{\theta}(\boldsymbol{X}_{i},\boldsymbol{\beta}_{0}))\}\right]\right|\right)\\ \leq&\frac{C2^{d}}{nr_{1}}E\left(\sup_{\boldsymbol{\beta}\in\mathcal{C}(r_{1})}\left|\sum_{l=1}^{d}\sum_{i=1}^{n}V_{il}(\boldsymbol{X}_{i}^{\prime}(\boldsymbol{\beta}_{l}-\boldsymbol{\beta}_{0l}))\right|\right)\leq\frac{C2^{d}}{n}E\left(\sum_{l=1}^{d}\left\|\sum_{i=1}^{n}V_{il}\boldsymbol{X}_{i}\right\|_{\infty}\right)\\ \leq&\frac{Cd2^{d}}{n}E\left(\left\|\sum_{i=1}^{n}V_{i1}\boldsymbol{X}_{i}\right\|_{\infty}\right)\leq C_{E}d2^{d}\sqrt{\frac{\log p}{n}},\end{split}

(A.39)

where $\{\epsilon_{i}\}$ and $\{V_{il}\}$ are $i.i.d.$ Rademacher and standard Gaussian random variables, respectively, the first inequality is due to the symmetrization theorem (Loh and Wainwright,, 2015, Lemma 12), the second one is by the multivariate contraction theorem (van de Geer,, 2016, Theorem 16.3), the third one is due to the fact $|\boldsymbol{X}^{\prime}\boldsymbol{\beta}|\leq\|\boldsymbol{X}\|_{\infty}\|\boldsymbol{\beta}\|_{1}$ , and the last one is by Lemma 16 of Loh and Wainwright, (2015) given the sample size of $n\geq c^{\prime}\log p$ for some $c^{\prime}>0$ . The upper bound at (A.37) holds by combining (A.38) and (A.39).

Extension to uniform radii via peeling: Define a sequence of sets $\Theta_{l}:=\{\Delta\in\mathbb{R}^{dp}:2^{l-1}r\leq\|\Delta\|_{1}\leq 2^{l}r\}$ with $1\leq l\leq N=\log(\sqrt{dp}R/r)$ . It can be verified that $\{\Delta\in\mathbb{R}^{dp}:r\leq\|\Delta\|_{2}\leq R\}\subseteq\{\Delta\in\mathbb{R}^{dp}:r\leq\|\Delta\|_{1}\leq\sqrt{dp}R\}\subseteq\cup_{l=1}^{N}\Theta_{l}$ . As a result,

	$\displaystyle P$	$\displaystyle\left(\mathcal{E}_{k}^{*}(\Delta)\geq C_{\mathcal{E}}\sqrt{\frac{\log p}{n}}\\|\Delta\\|_{1},\Delta\in\cup_{l=1}^{N}\Theta_{l}\right)$
		$\displaystyle\hskip 14.22636pt\leq\sum_{l=1}^{N}P\left(\mathcal{E}_{k}^{*}(\Delta)\geq 2^{l-1}rC_{\mathcal{E}}\sqrt{\frac{\log p}{n}},\Delta\in\Theta_{l}\right)$
		$\displaystyle\hskip 14.22636pt=\sum_{l=1}^{N}P\left(\mathcal{E}_{k}^{*}(\Delta)\geq(C_{E}d2^{d})(2^{k}r)\sqrt{\frac{\log p}{n}}+L_{Q}C_{X}(2^{k}r)\sqrt{\frac{c^{2}\log p}{n}},\Delta\in\Theta_{l}\right)$
		$\displaystyle\hskip 14.22636pt\leq\sum_{l=1}^{N}P\left(A(2^{l}r)\geq C_{E}d2^{d}\sqrt{\frac{\log p}{n}}+L_{Q}C_{X}\sqrt{\frac{c^{2}\log p}{n}}\right),$

where $C_{\mathcal{E}}=C_{E}d2^{d+1}+2L_{Q}C_{X}c$ . By applying (A.37), it holds that

P\left(\mathcal{E}_{k}^{*}(\Delta)\geq C_{\mathcal{E}}\sqrt{\frac{\log p}{n}}\|\Delta\|_{1},r\leq\|\Delta\|_{2}\leq R\right)\leq\sum_{k=1}^{N}e^{-c^{2}\log p}=\log(\sqrt{dp}R/r)p^{-c^{2}},

i.e. (A.36) holds.

Finally, from (LABEL:lemma3-eq1), (A.36) and Lemma A.4,

	$\displaystyle\mathcal{E}_{n}(\Delta)$	$\displaystyle\geq\bar{\mathcal{E}}(\Delta)-\sum_{k=1}^{K}\mathcal{E}_{k}^{*}(\Delta)-\\|n^{-1}\nabla L(\boldsymbol{\beta}_{0})\\|_{\infty}\\|\Delta\\|_{1}$
		$\displaystyle\geq 0.5f_{\min}\lambda_{\min}^{0}\\|\Delta\\|_{2}^{2}-(KC_{\mathcal{E}}+C_{S})\sqrt{\frac{\log p}{n}}\\|\Delta\\|_{1}$

uniformly for $r\leq\|\Delta\|_{2}\leq R$ with probability at least $1-c_{1}p^{-c_{2}}-K\log(\sqrt{dp}r/r_{l})p^{-c^{2}}$ for any $c>1$ . This accomplishes the proof.

∎

Proof of Theorem 4.

The proof of this theorem follows from the deterministic analysis in Lemma A.3 and stochastic analysis in Lemmas A.4, A.5. ∎

References

Abrevaya, (2001) Abrevaya, J. (2001). The effect of demographics and maternal behavior on the distribution of birth outcomes. Empirical Economics, 26:247–259.
Andrews, (1987) Andrews, D. W. K. (1987). Consistency in nonlinear econometric models: A generic uniform law of large numbers. Econometrica, 55:1465–1471.
Beirlant and Goegebeur, (2004) Beirlant, J. and Goegebeur, Y. (2004). Local polynomial maximum likelihood estimation for Pareto-type distributions. Journal of Multivariate Analysis, 89:97–118.
Belloni and Chernozhukov, (2011) Belloni, A. and Chernozhukov, V. (2011). $l_{1}$ -penalized quantile regression in high-dimensional sparse models. The Annals of Statistics, 39:82–130.
(5) Belloni, A., Chernozhukov, V., Chetverikov, D., and Fernandez-Val, I. (2019a). Conditional quantile processes based on series or many regressors. Journal of Econometrics, 213:4–29.
(6) Belloni, A., Chernozhukov, V., and Kato, K. (2019b). Valid post-selection inference in high-dimensional approximately sparse quantile regression models. Journal of the American Statistical Association, 114:749–758.
Bertsekas, (2016) Bertsekas, D. P. (2016). Nonlinear Programming. Athena Scientific.
Cade and Noon, (2003) Cade, B. S. and Noon, B. R. (2003). A gentle introduction to quantile regression for ecologists. Frontiers in Ecology and the Environment, 1:412–420.
Chernozhukov, (2005) Chernozhukov, V. (2005). Extremal quantile regression. The Annals of Statistics, 33:806–839.
Chernozhukov et al., (2015) Chernozhukov, V., Fernández-Val, I., and Kowalski, A. E. (2015). Quantile regression with censoring and endogeneity. Journal of Econometrics, 186:201–221.
Chernozhukov et al., (2020) Chernozhukov, V., Fernández-Val, I., and Melly, B. (2020). Fast algorithms for the quantile regression process. Empirical Econometrics.
Davino et al., (2014) Davino, C., Furno, M., and Vistocco, D. (2014). Quantile Regression: Theory and Applications. Wiley, New York.
Dong et al., (2019) Dong, C., Li, G., and Feng, X. (2019). Lack-of-fit tests for quantile regression models. Journal of the Royal Statistical Society, Series B, 81:629–648.
Elsner et al., (2008) Elsner, J. B., Kossin, J. P., and Jagger, T. H. (2008). The increasing intensity of the strongest tropical cyclones. Nature, 455:92–95.
Fan et al., (2019) Fan, J., Gong, W., and Zhu, Z. (2019). Generalized high-dimensional trace regression via nuclear norm regularization. Journal of Econometrics, 212:177–202.
Fan and Li., (2001) Fan, J. and Li., R. (2001). Variable selection via nonconcave penalized likelihood and its oracleproperties. Journal of the American Statistical Association, 96:1348–1360.
Fenske et al., (2011) Fenske, N., Kneib, T., and Hothorn, T. (2011). Identifying risk factors for severe childhood malnutrition by boosting additive quantile regression. Journal of the American Statistical Association, 106:494–510.
Fournier et al., (2007) Fournier, B., Rupin, N., Bigerelle, M., Najjar, D., Iost, A., and Wilcox, R. (2007). Estimating the parameters of a generalized lambda distribution. Computational Statistics & Data Analysis, 51:2813–2835.
Freimer et al., (1988) Freimer, M., Kollia, G., Mudholkar, G. S., and Lin, C. T. (1988). A study of the generalized tukey lambda family. Communications in Statistics-Theory and Methods, 17:3547–3567.
Galvao and Kato, (2016) Galvao, A. F. and Kato, K. (2016). Smoothed quantile regression for panel data. Journal of Econometrics, 193:92–112.
Gilchrist, (2000) Gilchrist, W. G. (2000). Statistical modelling with quantile functions. Chapman & Hall/CRC, London.
Hendricks and Koenker, (1991) Hendricks, W. and Koenker, R. (1991). Hierarchical spline models for conditional quantiles and the demand for electricity. Journal of the American Statistical Association, 87:58–68.
Jagger and Elsner, (2008) Jagger, T. H. and Elsner, J. B. (2008). Modeling tropical cyclone intensity with quantile regression. International Journal of Climatology, 29:1351–1361.
Jiang et al., (2012) Jiang, X., Jiang, J., and Song, X. (2012). Oracle model selection for nonlinear models based on weighted composite quantile regression. Statistica Sinica, 22:1479–1506.
Kai et al., (2010) Kai, B., Li, R., and Zou, H. (2010). Local composite quantile regression smoothing: An efficient and safe alternative to local polynomial regression. Journal of the Royal Statistical Society, Series B, 72:49–69.
Kato et al., (2012) Kato, K., Galvao, A. F., and Montes-Rojas, G. V. (2012). Asymptotics for panel quantile regression models with individual effects. Journal of Econometrics, 170:76–91.
Knight, (1998) Knight, K. (1998). Limiting distributions for L1 regression estimators under general conditions. The Annals of Statistics, 26:755–770.
Koenker, (2005) Koenker, R. (2005). Quantile regression. Cambridge University Press, Cambridge.
Koenker, (2011) Koenker, R. (2011). Additive models for quantile regression: Model selection and confidence bandaids. Brazilian Journal of Probability and Statistics, 25:239–262.
Koenker, (2017) Koenker, R. (2017). Quantile regression: 40 years on. Annual Review of Economics, 9:155–176.
Koenker and Bassett, (1978) Koenker, R. and Bassett, G. (1978). Regression quantiles. Econometrica, 46:33–50.
Kuester et al., (2006) Kuester, K., Mittnik, S., and Paolella, M. S. (2006). Value-at-Risk prediction: A comparison of alternative strategies. Journal of Financial Econometrics, 4:53–89.
Linton and Xiao, (2017) Linton, O. and Xiao, Z. (2017). Quantile regression applications in finance. In Handbook of Quantile Regression, pages 381–407. Chapman and Hall/CRC.
Lobeto et al., (2021) Lobeto, H., Menendez, M., and Losada, I. J. (2021). Future behavior of wind wave extremes due to climate change. Scientific reports, 11:1–12.
Loh, (2017) Loh, P. (2017). Statistical consistency and asymptotic normality for high-dimensional robust M-estimators. The Annals of Statistics, 45:866–896.
Loh and Wainwright, (2015) Loh, P. and Wainwright, M. J. (2015). Regularized M-estimators with nonconvexity: Statistical and algorithmic theory for local optima. Journal of Machine Learning Research, 16:559–616.
Meinshausen and Ridgeway, (2006) Meinshausen, N. and Ridgeway, G. (2006). Quantile regression forests. Journal of Machine Learning Research, 7:983–999.
Moon et al., (2021) Moon, S. J., Jeon, J.-J., Lee, J. S. H., and Kim, Y. (2021). Learning multiple quantiles with neural networks. Journal of Computational and Graphical Statistics, In press.
Noufaily and Jones, (2013) Noufaily, A. and Jones, M. C. (2013). Parametric quantile regression based on the generalized gamma distribution. Journal of the Royal Statistical Society, Series C, 62:723–740.
Pan and Zhou, (2021) Pan, X. and Zhou, W.-X. (2021). Multiplier bootstrap for quantile regression: Non-asymptotic theory under random design. Information and Inference, 3:813–861.
Pollard, (1985) Pollard, D. (1985). New ways to prove central limit theorems. Econometric Theory, 1:295–313.
Racine and Li, (2017) Racine, J. S. and Li, K. (2017). Nonparametric conditional quantile estimation: a locally weighted quantile kernel approach. Journal of Econometrics, 201:72–94.
Sherwood et al., (2016) Sherwood, B., Wang, L., et al. (2016). Partially linear additive quantile regression in ultra-high dimension. The Annals of Statistics, 44:288–317.
Sivakumar and Banerjee, (2017) Sivakumar, V. and Banerjee, A. (2017). High-dimensional structured quantile regression. In International Conference on Machine Learning, pages 3220–3229.
van de Geer, (2016) van de Geer, S. A. (2016). Estimation and testing under sparsity. Springer.
van der Vaart, (1998) van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press, Cambridge.
Vasicek, (1976) Vasicek, O. (1976). A test for normality based on sample entropy. Journal of the Royal Statistical Society, Series B, 38:54–59.
Wainwright, (2019) Wainwright, M. J. (2019). High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge University Press.
Wang and Tsai, (2009) Wang, H. and Tsai, C.-L. (2009). Tail index regression. Journal of the American Statistical Association, 104:1233–1240.
Wang and Li, (2013) Wang, H. J. and Li, D. (2013). Estimation of extreme conditional quantiles through power transformation. Journal of the American Statistical Association, 108:1062–1074.
(51) Wang, H. J., Li, D., and He, X. (2012a). Estimation of high conditional quantiles for heavy-tailed distributions. Journal of the American Statistical Association, 107:1453–1464.
Wang and He, (2021) Wang, L. and He, X. (2021). Analysis of global and local optima of regularized quantile regression in high dimension: a subgradient approach. Technical report, Miami Herbert Business School, University of Miami.
(53) Wang, L., Wu, Y., and Li, R. (2012b). Quantile regression for analyzing heterogeneity in ultra-high dimension. Journal of the American Statistical Association, 107:214–222.
Yu et al., (2003) Yu, K., Lu, Z., and Stander, J. (2003). Quantile regression: applications and current research areas. Journal of the Royal Statistical Society: Series D, 52:331–350.
Zhang, (2010) Zhang, C.-H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38:894–942.
Zheng et al., (2015) Zheng, Q., Peng, L., and He, X. (2015). Globally adaptive quantile regression with ultra-high dimensional data. The Annals of Statistics, 43:2225–2258.
Zheng et al., (2018) Zheng, Y., Zhu, Q., Li, G., and Xiao, Z. (2018). Hybrid quantile regression estimation for time series models with conditional heteroscedasticity. Journal of the Royal Statistical Society, Series B, 80:975–993.
Zhu and Ling, (2011) Zhu, K. and Ling, S. (2011). Global self-weighted and local quasi-maximum exponential likelihood estimators for arma–garch/igarch models. The Annals of Statistics, 39:2131–2163.
Zhu and Li, (2021) Zhu, Q. and Li, G. (2021). Quantile double autoregression. Econometric Theory, In press.
Zhu et al., (2018) Zhu, Q., Zheng, Y., and Li, G. (2018). Linear double autoregression. Journal of Econometrics, 207:162–174.
Zou and Yuan, (2008) Zou, H. and Yuan, M. (2008). Composite quantile regression and the oracle model selection theory. The Annals of Statistics, 36:1108–1126.

Table 1: Mean square errors of the predicted conditonal quantile

Q(\tau^{*},\boldsymbol{\theta}(\mathbf{X},\widehat{\boldsymbol{\beta}}_{n}))

at the level of

\tau^{*}=0.991

0.995

. The values in bracket refer to the corresponding sample standard deviations of prediction errors in squared loss.

			$\boldsymbol{X}=(1,0.1,-0.2)^{T}$		$\boldsymbol{X}=(1,0,0)^{T}$
$n$	$[\tau_{L},\tau_{U}]$		0.991	0.995	0.991	0.995
		True	10.34	11.83	15.13	18.84
500	$[0.5,0.99]$		1.32(2.17)	2.35(4.11)	5.41(11.41)	12.13(28.19)
	$[0.7,0.99]$		1.42(2.20)	2.55(4.07)	5.42(10.95)	12.12(26.73)
	$[0.9,0.99]$		2.00(3.92)	3.64(7.56)	6.18(12.86)	14.10(32.78)
1000	$[0.5,0.99]$		0.77(1.68)	1.39(3.29)	2.67(5.28)	5.93(12.61)
	$[0.7,0.99]$		0.80(1.39)	1.44(2.64)	2.62(4.27)	5.75(9.75)
	$[0.9,0.99]$		1.31(2.53)	2.44(5.07)	3.22(5.08)	7.23(11.78)
2000	$[0.5,0.99]$		0.32(0.47)	0.57(0.85)	1.03(1.56)	2.25(3.49)
	$[0.7,0.99]$		0.36(0.49)	0.64(0.90)	1.05(1.47)	2.31(3.24)
	$[0.9,0.99]$		0.70(1.34)	1.30(2.44)	1.34(1.75)	3.05(4.06)

Table 2: Mean square errors of the predicted conditonal quantile

Q(\tau^{*},\boldsymbol{\theta}(\mathbf{X},\widetilde{\boldsymbol{\beta}}_{n}))

at the level of

\tau^{*}=0.991

0.995

with

p=50

and

n=\lfloor ck\log p\rfloor

. The values in bracket refer to the corresponding sample standard deviations of prediction errors in squared loss.

			$\boldsymbol{X}=(1,0.1,-0.2,0,\cdots,0)^{T}$		$\boldsymbol{X}=(1,0,0,0,\cdots,0)^{T}$
c	$[\tau_{L},\tau_{U}]$		0.991	0.995	0.991	0.995
		True	10.34	11.83	15.13	18.84
10	$[0.5,0.99]$		1.82(2.61)	3.23(4.82)	6.75(9.47)	14.83(21.98)
	$[0.7,0.99]$		2.05(6.00)	3.78(13.10)	6.11(8.22)	13.32(18.56)
	$[0.9,0.99]$		2.92(7.19)	5.44(15.60)	7.24(10.36)	15.91(25.05)
30	$[0.5,0.99]$		0.65(1.59)	1.15(3.12)	1.97(3.08)	4.26(6.90)
	$[0.7,0.99]$		0.65(1.49)	1.18(2.94)	1.96(2.85)	4.26(6.31)
	$[0.9,0.99]$		0.92(2.15)	1.66(4.08)	2.22(2.95)	4.86(6.40)
50	$[0.5,0.99]$		0.33(0.49)	0.58(0.84)	1.17(1.77)	2.52(3.88)
	$[0.7,0.99]$		0.39(0.57)	0.69(1.01)	1.28(1.93)	2.78(4.26)
	$[0.9,0.99]$		0.54(0.86)	0.99(1.58)	1.55(2.39)	3.51(5.57)

Table 3: Selection results for regularized estimation with

p=50

and

n=\lfloor ck\log p\rfloor

. The values in brackets are the corresponding standard errors.

$[\tau_{L},\tau_{U}]$	c	size	$\textrm{P}_{\textrm{AI}}$	$\textrm{P}_{\textrm{A}}$	$\textrm{P}_{\textrm{I}}$	FP	FN
$[0.5,0.99]$	10	9.04(0.99)	91.6	96	95.6	0.06(0.68)	0.47(2.34)
	30	9.00(0.00)	100	100	100	0.00(0.00)	0.00(0.00)
	50	9.00(0.00)	100	100	100	0.00(0.00)	0.00(0.00)
$[0.7,0.99]$	10	8.91(1.25)	79	82.6	95.4	0.07(0.84)	2.02(4.52)
	30	9.00(0.08)	99.4	99.6	99.8	0.00(0.03)	0.04(0.70)
	50	9.00(0.00)	100	100	100	0.00(0.00)	0.00(0.00)
$[0.9,0.99]$	10	8.56(0.99)	48.4	54.6	90.4	0.10(0.41)	6.51(8.43)
	30	8.88(0.38)	87.8	88.4	99.2	0.01(0.06)	1.42(4.10)
	50	8.96(0.23)	96.4	96.6	99.8	0.00(0.03)	0.44(2.50)