This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Variable Selection in Doubly Truncated Regression

Ming Zheng, Chanjuan Lin, and Wen Yu
Department of Statistics, School of Management, Fudan University
Abstract

Doubly truncated data arise in many areas such as astronomy, econometrics, and medical studies. For the regression analysis with doubly truncated response variables, the existence of double truncation may bring bias for estimation as well as affect variable selection. We propose a simultaneous estimation and variable selection procedure for the doubly truncated regression, allowing a diverging number of regression parameters. To remove the bias introduced by the double truncation, a Mann-Whitney-type loss function is used. The adaptive LASSO penalty is then added into the loss function to achieve simultaneous estimation and variable selection. An iterative algorithm is designed to optimize the resulting objective function. We establish the consistency and the asymptotic normality of the proposed estimator. The oracle property of the proposed selection procedure is also obtained. Some simulation studies are conducted to show the finite sample performance of the proposed approach. We also apply the method to analyze a real astronomical data.

Keywords: Adaptive LASSO; Double truncation; Diverging number of parameters; Least absolute deviation; Oracle property; Variable selection.

Accepted by SCIENTIA SINICA Mathematica (in Chinese)

1 Introduction

Truncated data arise in astronomy, econometrics, and survival analysis, etc. In truncation, only those subjects fall with an interval can be observed along with the interval. The subjects fall out of their respective intervals are not known to exist and consequently, have no chance to be observed. When the truncation interval is unbounded from above, the truncation is called left-truncation; when the interval is unbounded from below, it is called right-truncation. Double truncation occurs when the truncation interval is bounded in both sides.

Truncation may bring systematic bias to statistical analysis if it is not dealt with appropriately. For nonparametric approach of distribution estimation, Turnbull (1976) developed a general algorithm for finding the nonparametric maximum likelihood estimator for arbitrarily grouped, censored or truncated data. Lyden-Bell (1971) obtained a similar estimator for singly truncated data. The counting process techniques are then adopted by Wang et al. (1986), Keiding and Gill (1990), Lai and Ying (1991a), etc. For regression analysis with single truncation, one can refer to Bhattacharya et al. (1983) for extended Mann-Whitey approach, Tsui et al. (1988) for iterative bias adjustment, Tsai (1990) for Kendall’s tau correlation, Lai and Ying (1991b) for rank-based estimation, and so on. Some more recent developments include Greene (2012) for econometrics and Kim et al. (2013) and Liu et al. (2016) for general biased sampling.

Double truncation is technically more challenging to deal with compared with single truncation, so fewer results have been found in existing literature. For distribution estimation, Efron and Petrosian (1999) and Shen (2010) developed nonparametric maximum likelihood estimation. Moreira and Alvarev (2012) proposed kernel-type density estimation approach. For two-sample problem, Bilker and Wang (1996) and Shen (2013a) extended the Mann-Whitney test. For regression ananlsis with doubly truncated responses, Moreia and Alvarev (2016) proposed kernel type approach for low dimensional covariates. Shen (2013b) considered estimating a class of semiparametric transformation models. More recently, Ying et al. (2020) proposed an extended Mann-Whitney-type loss function for parameter estimation in linear regression model with fixed covariates dimension. A Mann-Whitney-type rank estimator is then defined as its minimizer.

Besides the parameter estimation, double truncation may also bring difficulty for variable selection in regression analysis. To the best of our knowledge, no literature has studied variable selection procedure for linear regression models with doubly truncated responses. Based on the extended Mann-Whitney-type loss function proposed by Ying et al. (2020), we adopt the adaptive LASSO penalty (Zou, 2006) to develop a simultaneous estimation and variable selection procedure for the doubly truncated linear models. The proposed procedure can not only deal with fixed-dimensional covariates, but also be applicable for the diverging model size where the dimension of covariates goes to infinity with the sample size. We show that the adaptive LASSO penalty leads to the oracle properties of the selection procedure. Meanwhile, an iterative algorithm is designed for minimizing the proposed objective function. In each iteration, a least absolute deviation (LAD) optimization is involved and can be solved efficiently through standard LAD algorithm. A modified BIC approach is used for selecting the tuning parameter. We also develop a random weighting approach for estimating the standard errors of the non-zero regression parameters. Numerical studies are conducted to illustrate the finite sample behaviors of the proposed approach.

The rest of the paper is organized as follows. The next section introduces necessary notation and the estimation procedure of doubly truncated regression models. In Section 3, we give out the main results of the variable selection procedure, including the basic idea of the selection method, the algorithm, the oracle properties of the selection procedure, and the approach to determine the tuning parameter. In Section 4, the results of simulation studies are presented to show the finite sample behavior of the proposed approach and a real dataset is analyzed to illustrate the approach. Some concluding remarks are given in Section 5. All the technical details are summarized in the Appendix.

2 Notation, model specification, and estimation

First some notations are introduced. Let Y~\tilde{Y} be the response variable and X~\tilde{X} be a pp-dimensional covariate vector. We consider the linear location-shift model

Y~=βX~+ε~,\displaystyle\tilde{Y}=\beta^{\top}\tilde{X}+\tilde{\varepsilon}, (1)

where β\beta is the pp-dimensional regression coefficient vector and ε~\tilde{\varepsilon} is the error term independent of X~\tilde{X}, is assumed to hold. The response Y~\tilde{Y} is subject to double truncation. Let L~\tilde{L} and R~\tilde{R} be the left and right truncation variables, respectively. (Y~,L~,R~,X~)(\tilde{Y},\tilde{L},\tilde{R},\tilde{X}) can be observed if and only if Y~\tilde{Y} falls into the interval (L~,R~)(\tilde{L},\tilde{R}). We also assume that Y~\tilde{Y} is conditionally independent of (L~,R~)(\tilde{L},\tilde{R}) given X~\tilde{X}. This is a common assumption for truncation data analysis. Under model (1), this assumption is equivalent to assume that ε~\tilde{\varepsilon} is independent of (L~,R~)(\tilde{L},\tilde{R}). The distribution and density function of ε~\tilde{\varepsilon} is denoted by FF and ff, respectively.

Let (Y~i,L~i,R~i,X~i)(\tilde{Y}_{i},\tilde{L}_{i},\tilde{R}_{i},\tilde{X}_{i}), i=1,,n~i=1,\ldots,\tilde{n}, be independent and identically distributed (i.i.d.) copies of (Y~,L~,R~,X~)(\tilde{Y},\tilde{L},\tilde{R},\tilde{X}). As we mentioned, Y~i\tilde{Y}_{i} is observed if and only if L~i<Y~i<R~i\tilde{L}_{i}<\tilde{Y}_{i}<\tilde{R}_{i}, together with (L~i,R~i,X~i)(\tilde{L}_{i},\tilde{R}_{i},\tilde{X}_{i}). Denote the number of the observed data points by nn, that is, n=#{i:L~i<Y~i<R~i}n=\#\{i:\tilde{L}_{i}<\tilde{Y}_{i}<\tilde{R}_{i}\}. We use (Yi,Li,Ri,Xi)(Y_{i},L_{i},R_{i},X_{i}), i=1,,ni=1,\ldots,n to denote the observed data points. Moreover, for i=1,,ni=1,\ldots,n, let ei(β)=YiβXie_{i}(\beta)=Y_{i}-\beta^{\top}X_{i}, Li(β)=LiβXiL_{i}(\beta)=L_{i}-\beta^{\top}X_{i}, and Ri(β)=RiβXiR_{i}(\beta)=R_{i}-\beta^{\top}X_{i}.

To correct the bias brought by the double truncation, Ying et al. (2020) proposed a modified Mann-Whitney-type rank estimation equation

i=1nj=1nI{Lj(β)<ei(β)<Rj(β),Li(β)<ej(β)<Ri(β)}(XiXj)sgn{ei(β)ej(β)}=0,\displaystyle\sum_{i=1}^{n}\sum_{j=1}^{n}I\{L_{j}(\beta)<e_{i}(\beta)<R_{j}(\beta),L_{i}(\beta)<e_{j}(\beta)<R_{i}(\beta)\}(X_{i}-X_{j})\mbox{sgn}\{e_{i}(\beta)-e_{j}(\beta)\}=0, (2)

where I{}I\{\cdot\} is the indicator function and sgn{}\mbox{sgn}\{\cdot\} is the sign function. By introducing the indicators, the estimation function takes the summation of all the so-called comparable pairs and becomes the unbiased when β\beta takes the true value. Meanwhile, solving (2) is equivalent to minimizing the following loss function

ln(β)=i=1nj=1n|[(ei(β)ej(β))(RjYj)(YiLi)](LjYj)(YiRi)|,\displaystyle l_{n}(\beta)=\sum_{i=1}^{n}\sum_{j=1}^{n}\left|\left[(e_{i}(\beta)-e_{j}(\beta))\wedge(R_{j}-Y_{j})\wedge(Y_{i}-L_{i})\right]\vee(L_{j}-Y_{j})\vee(Y_{i}-R_{i})\right|, (3)

where \wedge and \vee are the minimum and maximum operator, respectively. The minimizer of ln(β)l_{n}(\beta), denoted by β^\hat{\beta}, is the proposed estimator of β\beta by Ying et al. (2020). They showed that the estimator is consistent and asymptotically normal under some regularity conditions.

3 Main results of variable selection

3.1 The method of variable selection

We consider variable selection for model (1) based on the observed data points. The loss function (3) motivates a penalized approach for variable selection. Instead of restricting to fixed dimension of XiX_{i}, we allow the dimension to go to infinity with nn, that is, pp depends on nn, denoted by pnp_{n}. Then all the quantities that are functions of the covariates depend on nn. For the simplicity of notation, we shall suppress the subscript nn when there is no confusion. Meanwhile, n/n~n/\tilde{n} is assumed to converge to a positive constant less than 1 as n~\tilde{n} goes to infinity.

Because of the L1L_{1} form of the loss function (3), the LASSO-type penalties are naturally considered. Specifically, the objective function with usual LASSO penalty is given by

ln(β)+λnj=1pn|βj|,\displaystyle l_{n}(\beta)+\lambda_{n}\sum_{j=1}^{p_{n}}|\beta_{j}|,

where λn\lambda_{n} is a (data-dependent) tuning parameter. There exist two main concerns if we want to use the above LASSO penalty. The first one is about the computation. One can see that ln(β)l_{n}(\beta) is not convex in β\beta, so the optimization is no longer convex. However, we note that ln(β)l_{n}(\beta) can be represented as

ln(β)=i=1nj=1nI{Lj(β)<ei(β)<Rj(β),Li(β)<ej(β)<Ri(β)}|ei(β)ej(β)|.\displaystyle l_{n}(\beta)=\sum_{i=1}^{n}\sum_{j=1}^{n}I\{L_{j}(\beta)<e_{i}(\beta)<R_{j}(\beta),L_{i}(\beta)<e_{j}(\beta)<R_{i}(\beta)\}|e_{i}(\beta)-e_{j}(\beta)|.

The non-convexity comes from the indicator function, while |ei(β)ej(β)||e_{i}(\beta)-e_{j}(\beta)| is convex in β\beta. Thus, one can develop an iterative algorithm where the parameter in the indicator function is kept fixed in each iteration step and the optimization is implemented for the rest part of the objective function. Some more details of the computation are discussed in Section 3.2 and Appendix A.1.

The second one is that it is well known that the LASSO estimator does not have the unbiasedness property and there are scenarios where the LASSO selection is not consistent. To overcome these drawbacks, Zou (2006) proposed adaptive LASSO penalty, in which proper weights are introduced for penalizing different coefficients in the lasso penalty. The weights used by Zou (2006) are defined based on root-nn-consistent estimator of β\beta. He showed that with a proper choice of the tuning parameter λn\lambda_{n}, the adaptive LASSO possesses the oracle properties when the dimension of β\beta is fixed.

Trying to keep the oracle property, we also consider the adaptive LASSO approach for variable selection. Specifically, the objective function with adaptive LASSO is given by

lnAL(β)=ln(β)+λnj=1pnw^j|βj|,\displaystyle l_{n}^{\mbox{\tiny AL}}(\beta)=l_{n}(\beta)+\lambda_{n}\sum_{j=1}^{p_{n}}\hat{w}_{j}|\beta_{j}|, (4)

where w^j\hat{w}_{j} is the adaptive weight of βj\beta_{j}. Originally, Zou (2006) proposed to use w^j=1/|β~j|γ\hat{w}_{j}=1/|\tilde{\beta}_{j}|^{\gamma}, where β~j\tilde{\beta}_{j} is a root-nn-consistent estimator of βj\beta_{j} and γ\gamma is a predetermined positive constant. In our case, apparently, even when pnp_{n} diverges with nn, we can still get β^\hat{\beta} by minimizing ln(β)l_{n}(\beta) as long as pnp_{n} does not grow too fast. Based on β^\hat{\beta}, we define w^j=1/|β^j|γ\hat{w}_{j}=1/|\hat{\beta}_{j}|^{\gamma} in (4), where β^j\hat{\beta}_{j} stands for the jj-th component of β^\hat{\beta}. The proposed adaptive LASSO estimator, conceptually, is defined to be the minimizer of lnAL(β)l_{n}^{\mbox{\tiny AL}}(\beta).

3.2 The algorithm

As we mention in Section 3.1, ln(β)l_{n}(\beta) is not convex, neither is lnAL(β)l_{n}^{\mbox{\tiny AL}}(\beta), so we turn to an iterative procedure to get the proposed estimator. Specifically, we modify lnAL(β)l_{n}^{\mbox{\tiny AL}}(\beta) to

lnAL(β,b)=i=1nj=1nI{Lj(b)<ei(b)<Rj(b),Li(b)<ej(b)<Ri(b)}|ei(β)ej(β)|+λnj=1pnw^j|βj|.\displaystyle l_{n}^{\mbox{\tiny AL}}(\beta,b)=\sum_{i=1}^{n}\sum_{j=1}^{n}I\{L_{j}(b)<e_{i}(b)<R_{j}(b),L_{i}(b)<e_{j}(b)<R_{i}(b)\}|e_{i}(\beta)-e_{j}(\beta)|+\lambda_{n}\sum_{j=1}^{p_{n}}\hat{w}_{j}|\beta_{j}|.

Let β^(0)\hat{\beta}^{(0)} be an initial estimate; for example, one may use β^\hat{\beta} as the start. Then, in the kk-th iteration step, let β^(k)=argminβlnAL(β,β^(k1))\hat{\beta}^{(k)}=\mbox{argmin}_{\beta}l_{n}^{\mbox{\tiny AL}}(\beta,\hat{\beta}^{(k-1)}) for k1k\geqslant 1. If β^(k)\hat{\beta}^{(k)} converges to a limit as the number of kk\to\infty, then we use the limit as the proposed estimator.

The main reason that we use the above iterative algorithm is because in the kkth iteration, the objective function can be written as

lnAL(β,β^(k1))=i,j𝒮k|ei(β)ej(β)|+λnj=1pnw^j|βj|,\displaystyle l_{n}^{\mbox{\tiny AL}}(\beta,\hat{\beta}^{(k-1)})=\sum_{i,j\in{\cal S}_{k}}|e_{i}(\beta)-e_{j}(\beta)|+\lambda_{n}\sum_{j=1}^{p_{n}}\hat{w}_{j}|\beta_{j}|, (5)

where 𝒮k={(i,j)|Lj(β^(k1))<ei(β^(k1))<Rj(β^(k1)),Li(β^(k1))<ej(β^(k1))<Ri(β^(k1))}{\cal S}_{k}=\{(i,j)|L_{j}(\hat{\beta}^{(k-1)})<e_{i}(\hat{\beta}^{(k-1)})<R_{j}(\hat{\beta}^{(k-1)}),L_{i}(\hat{\beta}^{(k-1)})<e_{j}(\hat{\beta}^{(k-1)})<R_{i}(\hat{\beta}^{(k-1)})\}. Thus, the optimization problem becomes a linear programming problem which can be solved efficiently. Moreover, we find that after some algebraic operations, the optimization can be transformed into a least absolute deviation (LAD) problem which can be solved efficiently through standard LAD procedure, such as the quantreg package in R. More details on the transformation are provided in the Appendix A.1.

3.3 Large sample properties

Next we establish the oracle property of the proposed adaptive lasso approach under some regularity conditions. The oracle property requires the consistency of β^\hat{\beta}. Although Ying et al. (2020) proved root-nn-consistency of β^\hat{\beta} under suitable conditions, their result is only confined to fixed dimension. When pnp_{n} diverges with nn, the consistency of β^\hat{\beta} needs further investigation. Let β0\beta_{0} denote the true value of β\beta. The following proposition, proved in the Appendix A.2, gives out the consistency of β^\hat{\beta} with the diverging dimension.

Proposition 1.

Assume that conditions C1 to C6 listed in the Appendix A.2 hold. If pnlogn/n=o(1)p_{n}\log n/\sqrt{n}=o(1), then β^β0=Op((pn/n)1/2)\|\hat{\beta}-\beta_{0}\|=O_{p}((p_{n}/n)^{1/2}).

We then show the existence of an adaptive lasso estimator and it is consistent with the convergence rate of Op((pn/n)1/2)O_{p}((p_{n}/n)^{1/2}). The following theorem, proved in the Appendix A.2, states the results.

Theorem 1.

Assume that the conditions C1 to C6 listed in the Appendix A.2 hold. If nλn=o(1)\sqrt{n}\lambda_{n}=o(1) and pnlogn/n=o(1)p_{n}\log n/\sqrt{n}=o(1), then with probability tending to 1 there exists a local minimizer of lnAL(β)l_{n}^{\mbox{\tiny AL}}(\beta), denoted by β^AL\hat{\beta}^{\mbox{\tiny AL}}, satisfying that β^ALβ0=Op((pn/n)1/2)\|\hat{\beta}^{\mbox{\tiny AL}}-\beta_{0}\|=O_{p}((p_{n}/n)^{1/2}).

To present the following oracle properties, we need some more notation. Let ξij(β)=I{Lj(β)<ei(β)<Rj(β),Li(β)<ej(β)<Ri(β)}(XiXj)sgn{ei(β)ej(β)}\xi_{ij}(\beta)=I\{L_{j}(\beta)<e_{i}(\beta)<R_{j}(\beta),L_{i}(\beta)<e_{j}(\beta)<R_{i}(\beta)\}(X_{i}-X_{j})\mbox{sgn}\{e_{i}(\beta)-e_{j}(\beta)\}, hij(β)=|[(ei(β)ej(β))(RjYj)(YiLi)](LjYj)(YiRi)|h_{ij}(\beta)=\left|\left[(e_{i}(\beta)-e_{j}(\beta))\wedge(R_{j}-Y_{j})\wedge(Y_{i}-L_{i})\right]\vee(L_{j}-Y_{j})\vee(Y_{i}-R_{i})\right|, and l¯(β)=𝖤(hij(β))\bar{l}(\beta)=\mathsf{E}(h_{ij}(\beta)), where the expectation is taken with respect to the distribution of the observed data. We define two pn×pnp_{n}\times p_{n} matrices V=𝖤(ξij(β0)ξik(β0))V=\mathsf{E}(\xi_{ij}(\beta_{0})\xi_{ik}^{\top}(\beta_{0})) and A=2l¯(β)/ββ|β=β0A=\partial^{2}\bar{l}(\beta)/\partial\beta\partial\beta^{\top}|_{\beta=\beta_{0}}. We further partition β0\beta_{0} into β0,1\beta_{0,1} and β0,2\beta_{0,2}, where β0,1\beta_{0,1} stands for the nonzero component of β0\beta_{0} and β0,2\beta_{0,2} for the zero component. The dimension of β0,1\beta_{0,1} is denoted by pn,1p_{n,1}. Let β^AL1\hat{\beta}^{\mbox{\tiny AL}}_{1} and β^AL2\hat{\beta}^{\mbox{\tiny AL}}_{2} be the adaptive lasso estimator for β0,1\beta_{0,1} and β0,2\beta_{0,2}, respectively. Moreover, let V11V_{11} denote the pn,1×pn,1p_{n,1}\times p_{n,1} components of VV corresponding to β0,1\beta_{0,1} and A11A_{11} denote the pn,1×pn,1p_{n,1}\times p_{n,1} components of AA.

Theorem 2.

Assume that the conditions C1 to C6 listed in the Appendix A.2 hold. If nλn=o(1)\sqrt{n}\lambda_{n}=o(1) and pnlogn/n=o(1)p_{n}\log n/\sqrt{n}=o(1), then with probability tending to 1 the adaptive lasso estimator β^AL\hat{\beta}^{\mbox{\tiny AL}} has the following properties: i) β^AL2=0\hat{\beta}^{\mbox{\tiny AL}}_{2}=0; ii) and for any nonzero pn,1×1p_{n,1}\times 1 constant vector unu_{n} with un=1\|u_{n}\|=1, nun(β^AL1β0,1)/[2(unA111V11Z111un)1/2]\sqrt{n}u_{n}^{\top}(\hat{\beta}^{\mbox{\tiny AL}}_{1}-\beta_{0,1})/[2(u_{n}^{\top}A_{11}^{-1}V_{11}Z_{11}^{-1}u_{n})^{-1/2}] converges in distribution to N(0,1)N(0,1).

We give out the proof of Theorem 2 in the Appendix A.2. By choosing a suitable unu_{n}, it can be seen from Theorem 2 that the estimator of each individual component of β0,1\beta_{0,1} is asymptotically normal. The results can be used to make inference on individual coefficients. The estimation of the variance is discussed in Section 3.4.

3.4 The selection of tuning parameter and variance estimation

Typical methods of selecting tuning parameter include information criteria methods such as AIC, BIC, and data-driven procedures such as KK-fold cross-validation. Because of the computational issue, we consider information criteria approaches here. It is well known that BIC usually leads to the consistency of selection for fixed dimension scenarios. However, as we allow the dimension of the parameter to diverge, the standard BIC does not guarantee the selection consistency. Thus, we consider the following modified BIC:

BIC(λ)=ln(β^ALλ)+lognnCndfλ,\displaystyle\mbox{BIC}(\lambda)=l_{n}(\hat{\beta}^{\mbox{\tiny AL}}_{\lambda})+\frac{\log n}{n}C_{n}\mbox{df}_{\lambda},

where β^ALλ\hat{\beta}^{\mbox{\tiny AL}}_{\lambda} represents the adaptive lasso estimate obtained at the tuning parameter with level λ\lambda, CnC_{n} is a sequence going to infinity with (logn/n)Cn0(\log n/n)C_{n}\to 0, and dfλ\mbox{df}_{\lambda} stands for the size of the non-zero component of the estimate at level λ\lambda. We find that Cn=loglogpnC_{n}=\log\log p_{n} works well in this paper. The tuning parameter we choose is given by

λ^=argminλBIC(λ).\displaystyle\hat{\lambda}=\mbox{argmin}_{\lambda}\mbox{BIC}(\lambda).

The proposed modified BIC has the consistency of selection. Let {\cal M}^{\ast} and λ^{\cal M}_{\hat{\lambda}} be the true model and the selected model, respectively. We have the following result on selection consistency.

Theorem 3.

Assume that the conditions C1 to C6 listed in the Appendix A.2 hold. If nλn=o(1)\sqrt{n}\lambda_{n}=o(1), λnn/pn\lambda_{n}n/p_{n}\to\infty, pnlogn/n=o(1)p_{n}\log n/\sqrt{n}=o(1), and n/{Cnpnlogn}×lim infn{minj|β0,j|}\sqrt{n/\{C_{n}p_{n}\log n\}}\times\liminf_{n\to\infty}\{\min_{j\in{\cal M}^{\ast}}|\beta_{0,j}|\}\to\infty, then 𝖯(λ^=)1\mathsf{P}({\cal M}_{\hat{\lambda}}={\cal M}^{\ast})\to 1 as nn\to\infty.

Finally, we also need to give out the variance estimation of the proposed estimator for the nonzero component of β0\beta_{0}. Note that the asymptotic variance-covariance matrix of β^AL1\hat{\beta}^{\mbox{\tiny AL}}_{1} involves the density function of ε~\tilde{\varepsilon} and is complicated to estimate directly. We turn to use a random weighting method to obtain the variance estimate. To be specific, we defined the perturbed penalized estimator, denoted by β^\hat{\beta}^{\ast}, to be the minimizer of the following perturbed version of the objective function lALn(β)l^{\mbox{\tiny AL}}_{n}(\beta):

ln(β)=i=1nj=1n(Wi+Wj)hij(β)+λ^j=1pnw^j|βj|,\displaystyle l_{n}^{\ast}(\beta)=\sum_{i=1}^{n}\sum_{j=1}^{n}(W_{i}+W_{j})h_{ij}(\beta)+\hat{\lambda}\sum_{j=1}^{p_{n}}\hat{w}_{j}|\beta_{j}|,

where WiW_{i}, i=1,,ni=1,\ldots,n are i.i.d. non-negative and bounded random variables with mean 0.5 and variance 1, independent of the observed data. Let β^1\hat{\beta}^{\ast}_{1} be component of β^\hat{\beta}^{\ast} corresponding to β^AL1\hat{\beta}^{\mbox{\tiny AL}}_{1}, i.e., the non-zero part of β^AL\hat{\beta}^{\mbox{\tiny AL}}. It can be shown that under suitable conditions, given the observed data, the conditional asymptotic distribution of β^1\hat{\beta}^{\ast}_{1} is the same as the asymptotic distribution of β^AL1\hat{\beta}^{\mbox{\tiny AL}}_{1}. Thus, by generating repeatedly the sequence of WiW_{i}, i=1,,ni=1,\ldots,n, we can obtain a large number of replications of β^1\hat{\beta}^{\ast}_{1} and then approximate the variance of β^AL1\hat{\beta}^{\mbox{\tiny AL}}_{1}.

4 Numerical results

4.1 Simulation studies

We conduct some simulation studies to illustrate the finite sample performance of the proposed approach. We set pn=[7n~1/5]p_{n}=[7\tilde{n}^{1/5}], where n~\tilde{n} is the size of the full sample, as defined in Section 2. The size of the non-zero components is set to be pn,1=[pn/3]p_{n,1}=[p_{n}/3]. We consider two scenarios with n~=300\tilde{n}=300 and n~=500\tilde{n}=500, respectively. In the first scenario, pn=21p_{n}=21 and pn,1=7p_{n,1}=7. We set β1,0=(3.12,2.20,0.86,0.92,2.49,1.95,1.32)\beta_{1,0}=(3.12,2.20,-0.86,0.92,-2.49,1.95,-1.32)^{\top}. For the covariates X~=(X~1,,X~21)\tilde{X}=(\tilde{X}_{1},\ldots,\tilde{X}_{21}), we independently generate X~1\tilde{X}_{1} from Binomial(0.25) (the Binomial distribution with success probability 0.25), X~2\tilde{X}_{2} from Binomial(0.8), X~8\tilde{X}_{8} from Uniform(0,2) (the uniform distribution from 0 to 2), X~9\tilde{X}_{9} from Binomial(0.5), and X~10\tilde{X}_{10} from Uniform(-2,0). All the rest covariats are generated from the standard normal distribution, with the correlation between X~i\tilde{X}_{i} and X~j\tilde{X}_{j} being 0.3|ij|0.3^{|i-j|}. In the second scenario, pn=24p_{n}=24 and pn,1=8p_{n,1}=8. We set β1,0=(3.12,2.20,0.86,0.92,2.49,1.95,1.32,2.13)\beta_{1,0}=(3.12,2.20,-0.86,0.92,-2.49,1.95,-1.32,-2.13)^{\top}. We generate X~1\tilde{X}_{1} from Binomial(0.25), X~2\tilde{X}_{2} from Binomial(0.8), X~9\tilde{X}_{9} from Uniform(0,2), X~10\tilde{X}_{10} from Binomial(0.5), X~11\tilde{X}_{11} from Uniform(-2,0), and the rest from the standard normal distribution, with the correlation between X~i\tilde{X}_{i} and X~j\tilde{X}_{j} being 0.3|ij|0.3^{|i-j|}. For the error distribution in the linear model, we consider the standard normal (N(0,1)N(0,1)) distribution and the extreme value distribution with location parameter 0 and scale parameter 1 (EV(0,1)EV(0,1)). The left truncation variable L~\tilde{L} is generated from Uniform(aa,bXb^{\top}X) and the right one R~=L~+c\tilde{R}=\tilde{L}+c, where aa, bb, and cc are constants to yield truncation percentage of 30% and 40%. The percentages of left and right truncation are almost the same.

Under each scenario, 1000 replications are carried out. For variable selection, we define several criteria to assess the performance. The first one is the model error (ME), being defined as ME=(β~β0)𝖤(XiXi)(β~β0)=(\tilde{\beta}-\beta_{0})^{\top}\mathsf{E}(X_{i}X_{i}^{\top})(\tilde{\beta}-\beta_{0}), where β~\tilde{\beta} is an estimator of β0\beta_{0} derived from a specific approach. The median and median absolute deviation (MAD) of the 1000 MEs are recorded to evaluate the prediction performance of different procedures. The second one is the average number of total estimated zero coefficients (TN), that is, the average number of zero estimates obtained in the 1000 replications, including correctly estimated zero number (CN) and incorrectly estimated zero number (IN). For CN, the closer it is to pnpn,1p_{n}-p_{n,1} the better a procedure performs, while for IN, the closer it is to 0 the better the procedure performs. The third one is the ratio of the correctly identified models (RCM), that is, the proportion of the replications that can identify the true model correctly among the 1000 replications. The closer it is to 100%, the better a procedure performs. We consider three procedures, including the proposed selection procedure (Proposed), the variable selection procedure with adaptive lasso penalty without taking truncation into consideration (Naive), and the oracle procedure where the correct subset of covariates is used to fit the model (Oracle). The results are summarized in the following tables.

[Insert Table 1 here]

[Insert Table 2 here]

From the tables, we see that the proposed method outperforms the naive approach significantly in terms of the ME. This is expectable since the naive approach yields biased estimators for the regression coefficients. For the TN and RCM, the proposed method also has better results than the naive approach, while the extend of the improvement is not very large. The TN and RCM are criteria for variable selection performance. Although selecting important variables seems to be less affected by the double truncation, correcting the bias induced by double truncation still helps in increasing the precision of the variable selection procedure.

4.2 A real data example

We analyze an astronomical data from the 46420 quasar catalog produced by the Sloan Digital Sky Survey (SDSS) team. In this data, the main purpose is to predict the red shift of a quasar by using some related features. Thus, the linear model is used and the dependent variable is the red shift of the quasars. Compared with the complete SDSS catalog, the data set we use here omits some technical columns and therefore contains 23 covariates. Some more details about the data set can be found in https://astrostatistics.psu.edu/datasets.

It might not be that straightforward to see why there exists double truncation in the red shift data. This mainly comes from the nature of the observation feature of the astronomical telescope. Because of the aperture of the astronomical telescope, the red shift of a quasar one can observe is limited from the above. For the current data, we consider the maximum value of the red shift (5.4135) as the right truncation bound of the dependent variable. That is, the red shift larger this value is right truncated. For the low red shift, the measurement might be unreliable. The values should be omitted if the value of the red shift is lower its estimated error. The estimated error is then treated as the left truncation bound of the red shift observation. Therefore, the red shift value of the quasar is considered to be doubly truncated. Among the 23 covariates, some are the measurement errors of the brightness which are determined by the SDSS team from knowledge of the observing conditions, detector background, and other technical considerations. These covariates are not included in the model. We use 13 main covariates (shown in Table 3) to fit the linear model. Due to the computation capacity limitation, we do not use the full data to fit the model. A random sample with sample size 929 is drawn to illustrate the proposed method. The sampled data is split into the training part and testing part according to the proportion of 7:3. We use the proposed variable selection procedure on the training set to get a list of ”important” covariates and then used the selected model to do prediction on the testing set. The results of the variable selection are summarized in Table 4.

[Insert Table 3 here]

[Insert Table 4 here]

The proposed variable selection procedure gives out 5 ”important” covariates with non-zero coefficients. We use the selected covariates to do the prediction on the testing set. In Figure 1, we draw the line plot to compare the predicted values and the true response values. We see that the predicted values (shown by the dotted line) are quite close to the true values (shown by the solid line). The selected covariates give out reasonable predictions.

[Insert Figure 1 here]

5 Concluding remarks

We combine the Mann-Whitney-type loss function and the adaptive LASSO penalty to develop a simultaneous estimation and variable selection procedure for double truncated linear models, which allows the number of regression parameters to grow to infinity with the sample size. To overcome the non-convexity of the objective function, an iterative algorithm is designed. In each iteration, the objective function to be optimized can be transformed into a LAD problem. The oracle property of the proposed variable selection procedure is proved. We use the modified BIC to select the tuning parameter and adopt random weighting approach for variance estimation. Simulation studies show the reasonable finite sample performances of the proposed approach.

One of the main contributions of the current approach is to allow the dimension of the covariates to go to infinity with the sample size. However, by using the penalty based method, the growing rate of the covariates dimension can not be very large. For the ultra-high dimension situations, some screening approach is needed first. This is one of the interested directions that can be explored in future study.

Appendix A Appendix

A.1 More details on the algorithm

Here we give out some more details on the algorithm discussed in Section 3.2. Define Yˇ=YiYj\check{Y}=Y_{i}-Y_{j} and Xˇ=XiXj\check{X}=X_{i}-X_{j} for (i,j)𝒮k(i,j)\in{\cal S}_{k}. Let mm be an appropriate index of 𝒮k{\cal S}_{k}. Thus i,j𝒮k|ei(β)ej(β)|=m=1N|YˇmβXˇm|\sum_{i,j\in{\cal S}_{k}}|e_{i}(\beta)-e_{j}(\beta)|=\sum_{m=1}^{N}|\check{Y}_{m}-\beta^{\top}\check{X}_{m}|. Further define the matrices

𝐘ˇ=(Yi1Yj1YiNYjN),𝐗ˇ(β)=(β(Xi1Xj1)β(XiNXjN))\begin{matrix}\check{\mathbf{Y}}=\begin{pmatrix}Y_{i_{1}}-Y_{j_{1}}\\ \vdots\\ Y_{i_{N}}-Y_{j_{N}}\end{pmatrix},&\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \check{\mathbf{X}}(\beta)=\begin{pmatrix}\beta^{\top}(X_{i_{1}}-X_{j_{1}})\\ \vdots\\ \beta^{\top}(X_{i_{N}}-X_{j_{N}})\end{pmatrix}\end{matrix}

and the augmented matrices

𝐘ˇaug=(𝐘ˇ/n(n1)𝟎pn),𝐗ˇaug(β,λn)=(𝐗ˇ(β)/n(n1)𝐝𝐢𝐚𝐠(λnw^)),\begin{matrix}\check{\mathbf{Y}}_{\mbox{\tiny aug}}=\begin{pmatrix}\check{\mathbf{Y}}/n(n-1)\\ \mathbf{0}_{p_{n}}\end{pmatrix},&\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \check{\mathbf{X}}_{\tiny{\mbox{aug}}}(\beta,\lambda_{n})=\begin{pmatrix}\check{\mathbf{X}}(\beta)/n(n-1)\\ \mathbf{diag}(\lambda_{n}\hat{w})\end{pmatrix}\end{matrix},

where 𝟎p\mathbf{0}_{p} is a pnp_{n}-vector of zeros and 𝐝𝐢𝐚𝐠(λnw^)\mathbf{diag}(\lambda_{n}\hat{w}) is a pn×pnp_{n}\times p_{n} diagonal matrix with the elements (λnw^1,,λnw^pn)(\lambda_{n}\hat{w}_{1},\cdots,\lambda_{n}\hat{w}_{p_{n}}). Based on these notations, to minimize the objective function in (5) can be formulated to minimize 𝐘ˇaug𝐗ˇaug(β,λn)1\|\check{\mathbf{Y}}_{\mbox{\tiny aug}}-\check{\mathbf{X}}_{\tiny{\mbox{aug}}}(\beta,\lambda_{n})\|_{1} with respect to β\beta. This is a linear progamming problem and can be solved quite efficiently through standard unpenalized LAD procedure, such as quantreg package in R. The specific estimation procedure can be summarized as follows.

(i) Get an initial estimate value, for instance, the estimate value obtained by ignoring the double truncation and the regularization.

(ii) Iteratively solve the unpenalized optimization problem until convergence to get β^\hat{\beta}.

(iii) Use β^\hat{\beta} to construct the objective function lnAL(β)l_{n}^{\mbox{\tiny AL}}(\beta) and the corresponding modified one lnAL(β,b)l_{n}^{\mbox{\tiny AL}}(\beta,b).

(iv) Use β^(0)\hat{\beta}^{(0)} as the initial value. Iteratively solve the optimization problem minβ𝐘ˇaug𝐗ˇaug(β,λn)1\min_{\beta}\|\check{\mathbf{Y}}_{\mbox{\tiny aug}}-\check{\mathbf{X}}_{\tiny{\mbox{aug}}}(\beta,\lambda_{n})\|_{1} by using linear programming until convergence to get β^AL\hat{\beta}^{\mbox{\tiny AL}}.

A.2 Proof of the large sample properties

To obtain the large sample properties presented in Section 3.3, some regularity conditions are needed. Let Z=(Y,R,L,X)Z=(Y,R,L,X) be the population of the observed data. Let 𝒩\cal N be an o(1)o(1) neighborhood of β0\beta_{0} and SS be the sample space of ZZ. We use the notation h(Zi,Zj;β)h(Z_{i},Z_{j};\beta) to represent hij(β)h_{ij}(\beta), where Zi=(Yi,Ri,Li,Xi)Z_{i}=(Y_{i},R_{i},L_{i},X_{i}), i=1,,ni=1,\ldots,n. Define τ(z,β)=𝖤[h(z,Z;β)]\tau(z,\beta)=\mathsf{E}[h(z,Z;\beta)]. We use k\nabla_{k} to represent the kk-th order derivative of a function of β\beta with respect to β\beta. We assume the following conditions hold.

C1. The density function ff is bounded and has a bounded and continuous derivative.

C2. The parameter space Θ\Theta is compact and the true parameter value β0\beta_{0} is an interior point of Θ\Theta.

C3. Each component of the covariate vector has a bounded second moment.

C4. For all zSz\in S, 2τ(z;β)\nabla_{2}\tau(z;\beta) exist on 𝒩\mathcal{N} and continuous at β0\beta_{0}.

C5. Let I(z)=supαSpα2τ(z;β0)αI(z)=\sup_{\alpha\in S_{p}}\alpha^{\top}\nabla_{2}\tau(z;\beta_{0})\alpha with Sp={αRpn:α=1}S_{p}=\{\alpha\in R^{p_{n}}:\|\alpha\|=1\}. 𝖤[I2(Z)]<\mathsf{E}[I^{2}(Z)]<\infty.

C6. There exist positive constants C1,C2,C3C_{1},C_{2},C_{3}, and C4C_{4} such that for all nn,

0<C1<λmin(A)λmax(A)<C2<,0<C3<λmin(V)λmax(V)<C4<,0<C_{1}<\lambda_{min}(A)\leqslant\lambda_{max}(A)<C_{2}<\infty,0<C_{3}<\lambda_{min}(V)\leqslant\lambda_{max}(V)<C_{4}<\infty, (6)

where λmin()\lambda_{min}(\cdot) and λmax()\lambda_{max}(\cdot) are the largest and smallest eigenvalue of a matrix, respectively.

C1 and C2 are mild conditions. C2 is used to prove the consistency of β^\hat{\beta}, which is crucial to guarantee the consistency of the proposed estimator β^AL\hat{\beta}^{\tiny\mbox{AL}}. C3 is sufficient to meet the condition 5 in Wang et al. (2019) so that some results on high dimensional Hoeffding’s decomposition can used to obtain the quadratic approximation of ln(β)ln(β0)l_{n}(\beta)-l_{n}(\beta_{0}). C4 is for the Taylor’s expansion of τ(z;β)\tau(z;\beta) at β0\beta_{0}. C5 is used to bound the empirical term in the Hoeffding’s decomposition of ln(β)ln(β0)l_{n}(\beta)-l_{n}(\beta_{0}). C6 is used to confine the Hessian matrix of the quadratic approximation of the objective function and ensure that the covariance matrix of the score function is positively definite with uniformly bounded eigenvalues for all nn. This provides justification for the component-wise asymptotic normality of β^AL\hat{\beta}^{\tiny\mbox{AL}}. Similar conditions can be found in some related literature, e.g., Peng and Fan (2004), Cai et al. (2005), Cho and Qu (2013), and Ni et al. (2016).

We then give out the proof of the proposition and the theorems.

Proof of Proposition 1: For a U-process of order 2 with the kernel function hh\in{\cal H}, where ={h(;β):βΘ}\mathcal{H}=\{h(\cdot;\beta):\beta\in\Theta\} is a measurable class of symmetric functions on S2SSS^{2}\equiv S\otimes S with a non-negative envelope. Denote a set as 𝐡ω(;β)={h(Z1(ω),Z2(ω);β),,h(Zn1(ω),Zn(ω);β)}\mathbf{h}_{\omega}(\cdot;\beta)=\{h(Z_{1}(\omega),Z_{2}(\omega);\beta),\cdots,h(Z_{n-1}(\omega),Z_{n}(\omega);\beta)\} with length n(n1)n(n-1) and a class ω={𝐡ω(;β):βΘ}\mathcal{H}_{\omega}=\{\mathbf{h}_{\omega}(\cdot;\beta):\beta\in\Theta\}. According to the Example 1 in Wang et al. (2019) and Lemma 4.4 in Pollard (1990), we have that the pseudo-dimension of ω\mathcal{H}_{\omega} is at most 100pn100p_{n} for every ω\omega. If pnlogn/n=o(1)p_{n}\log n/\sqrt{n}=o(1), it is sufficient to guarantee the condition on the pseudo-dimension in Proposition 1 of Wang et al. (2019). Hence, we have the following quadratic approximation

ln(β)ln(β0)=12(ββ0)A(ββ0)+(ββ0)U^n+op(ββ02)+op(pnββ0n),l_{n}(\beta)-l_{n}(\beta_{0})=\frac{1}{2}(\beta-\beta_{0})^{\top}A(\beta-\beta_{0})+(\beta-\beta_{0})^{\top}\hat{U}_{n}+o_{p}(\|\beta-\beta_{0}\|^{2})+o_{p}\left(\frac{\sqrt{p_{n}}\|\beta-\beta_{0}\|}{\sqrt{n}}\right),

where U^n=2i=1nξij|i/n\hat{U}_{n}=2\sum_{i=1}^{n}\xi_{ij|i}/n with ξij|i=𝖤[ξij(β0)|Zi]\xi_{ij|i}=\mathsf{E}[\xi_{ij}(\beta_{0})|Z_{i}]. By constraining pnp_{n}, the pseudo-dimension is at the rate of o(n/log(n))o(\sqrt{n}/\log(n)). Therefore, all the conditions required listed in Theorem 3 in Wang et al. (2019) are satisfied. Then the consistency of β^\hat{\beta} can be obtained.∎

To prove Theorem 1, a lemma is needed.

Lemma A.1.

Assume that the conditions C1 to C6 hold. Then bRpn\forall b\in R^{p_{n}} with b2=1\|b\|_{2}=1, n(bVb)1/2bU^n/2\sqrt{n}(b^{\top}Vb)^{-1/2}b^{\top}\hat{U}_{n}/2 converges in distribution to N(0,1)N(0,1).

Proof: It is not difficult to see that

n2(bVb)1/2bU^n=1ni=1n(bVb)1/2bξij|i=:I1.\frac{\sqrt{n}}{2}(b^{\top}Vb)^{-1/2}b^{\top}\hat{U}_{n}=\frac{1}{\sqrt{n}}\sum_{i=1}^{n}(b^{\top}Vb)^{-1/2}b^{\top}\xi_{ij|i}=:I_{1}.

Note that 𝖤(ξij|i)=E(ξ(Zi,Zj,β0))=0\mathsf{E}(\xi_{ij|i})=E(\xi(Z_{i},Z_{j},\beta_{0}))=0 and Var(I1)=1Var(I_{1})=1. By C3 and C4, bVbb^{\top}Vb is bounded. By the Cauchy-Schwarz inequality, we have that

1n2i=1n𝖤|(bVb)1/2bξij|i|4\displaystyle\frac{1}{n^{2}}\sum_{i=1}^{n}\mathsf{E}\left|(b^{\top}Vb)^{-1/2}\cdot b^{\top}\xi_{ij|i}\right|^{4} =\displaystyle= O(1)n2i=1n𝖤|bξij|i|4O(1)n2i=1nξij|i24\displaystyle O(1)n^{-2}\sum_{i=1}^{n}\mathsf{E}\left|b^{\top}\xi_{ij|i}\right|^{4}\leqslant O(1)n^{-2}\sum_{i=1}^{n}\|\xi_{ij|i}\|_{2}^{4}
=\displaystyle= Op(n2npn2)=Op(pn2n)=op(1).\displaystyle O_{p}(n^{-2}\cdot n\cdot p_{n}^{2})=O_{p}(\frac{p_{n}^{2}}{n})=o_{p}(1).

Thus, the Lyapunov condition holds for I1I_{1}. By Lyapunov central limit theorem, I1I_{1} converges in distribution to N(0,1)N(0,1).∎

Proof of Theorem 1: Let αn=(pn/n)1/2\alpha_{n}=(p_{n}/n)^{1/2}. It can be shown that for any ϵ>0\epsilon>0 and any constant vector unu_{n} with un=C\|u_{n}\|=C, there exists a large enough CC such that

𝖯{infun=ClALn(β0+αnun)>lALn(β0)}1ϵ.\mathsf{P}\left\{\inf_{\|u_{n}\|=C}l^{\mbox{\tiny AL}}_{n}(\beta_{0}+\alpha_{n}u_{n})>l^{\mbox{\tiny AL}}_{n}(\beta_{0})\right\}\geqslant 1-\epsilon.

This implies the existence of a local minimizer β^AL\hat{\beta}^{\mbox{\tiny AL}} such that β^ALβ0=Op(αn)\|\hat{\beta}^{\mbox{\tiny AL}}-\beta_{0}\|=O_{p}(\alpha_{n}). Then we have that

lALn(β0+αnun)lALn(β0)\displaystyle l^{\mbox{\tiny AL}}_{n}(\beta_{0}+\alpha_{n}u_{n})-l^{\mbox{\tiny AL}}_{n}(\beta_{0}) =\displaystyle= ln(β0+αnun)ln(β0)+j=1pn{Pλnj(|β0j+αnunj|)Pλnj(|β0j|)}\displaystyle l_{n}(\beta_{0}+\alpha_{n}u_{n})-l_{n}(\beta_{0})+\sum_{j=1}^{p_{n}}\{P_{\lambda_{n}j}(|\beta_{0j}+\alpha_{n}u_{nj}|)-P_{\lambda_{n}j}(|\beta_{0j}|)\}
\displaystyle\geqslant {ln(β0+αnun)ln(β0)}+j=1pn,1{Pλnj(|β0j+αnunj|)Pλnj(|β0j|)}\displaystyle\{l_{n}(\beta_{0}+\alpha_{n}u_{n})-l_{n}(\beta_{0})\}+\sum_{j=1}^{p_{n,1}}\{P_{\lambda_{n}j}(|\beta_{0j}+\alpha_{n}u_{nj}|)-P_{\lambda_{n}j}(|\beta_{0j}|)\}
=:\displaystyle=: U1U2.\displaystyle U_{1}-U_{2}.

We write that

U1=αnunU^n+12α2nunAun+(un2+un)op(α2)=:U11+U12+U13,U_{1}=\alpha_{n}u_{n}^{\top}\hat{U}_{n}+\frac{1}{2}\alpha^{2}_{n}u_{n}^{\top}Au_{n}+(\|u_{n}\|^{2}+\|u_{n}\|)o_{p}(\alpha^{2})=:U_{11}+U_{12}+U_{13},

where

|U11|=|αnunU^n|αnunU^n=αnunOp(pnn)=unOp(αn2)|U_{11}|=|\alpha_{n}u_{n}^{\top}\hat{U}_{n}|\leqslant\alpha_{n}\|u_{n}\|\|\hat{U}_{n}\|=\alpha_{n}\|u_{n}\|O_{p}\left(\sqrt{\frac{p_{n}}{n}}\right)=\|u_{n}\|O_{p}(\alpha_{n}^{2})

and

|U12|=|12α2nunAun|12α2nun2λmin(A)>α2nun2C1/2.|U_{12}|=|\frac{1}{2}\alpha^{2}_{n}u_{n}^{\top}Au_{n}|\geqslant\frac{1}{2}\alpha^{2}_{n}\|u_{n}\|^{2}\lambda_{min}(A)>\alpha^{2}_{n}\|u_{n}\|^{2}C_{1}/2.

Therefore, for large enough un\|u_{n}\|, |U12||U_{12}| dominates |U11||U_{11}| and |U13||U_{13}| in U1U_{1}. Meanwhile,

|U2|\displaystyle|U_{2}| =\displaystyle= |j=1pn,1λn|β^j||β0j+αnunj|λn|β^j||β0j||αnλn|j=1pn,1|unj||β^j||\displaystyle\left|\sum_{j=1}^{p_{n,1}}\frac{\lambda_{n}}{|\hat{\beta}_{j}|}|\beta_{0j}+\alpha_{n}u_{nj}|-\frac{\lambda_{n}}{|\hat{\beta}_{j}|}|\beta_{0j}|\right|\leqslant\alpha_{n}\lambda_{n}\left|\sum_{j=1}^{p_{n,1}}\frac{|u_{nj}|}{|\hat{\beta}_{j}|}\right|
=\displaystyle= αnλnpn,1Op(1)ununop(α2n).\displaystyle\alpha_{n}\lambda_{n}\sqrt{p_{n,1}}O_{p}(1)\|u_{n}\|\|u_{n}\|o_{p}(\alpha^{2}_{n}).

For large enough C=unC=\|u_{n}\|, |U12||U_{12}| dominates |U2||U_{2}|. Since AA is positive definite, then U12>0U_{12}>0. Consequently, lALn(β0+αnun)lALn(β0)>0l^{\mbox{\tiny AL}}_{n}(\beta_{0}+\alpha_{n}u_{n})-l^{\mbox{\tiny AL}}_{n}(\beta_{0})>0 with probability tending to 1 as nn\to\infty for large enough CC.∎

To prove Theorem 2, another lemma is needed.

Lemma A.2.

Assume that the conditions C1 to C6 hold. If nλn=o(1)\sqrt{n}\lambda_{n}=o(1), λnn/pn\lambda_{n}n/p_{n}\to\infty, and pnlogn/n=o(1)p_{n}\log n/\sqrt{n}=o(1), then with probability tending to 1, for any given β1\beta_{1} satisfying β1β0,1=O(pn1/2n1/2)\|\beta_{1}-\beta_{0,1}\|=O(p_{n}^{1/2}n^{-1/2}) and any constant CC, lALn{(β1,0)}=minβ2Cpn1/2n1/2lALn{(β1,β2)}l^{\mbox{\tiny AL}}_{n}\{(\beta_{1}^{\top},0^{\top})^{\top}\}=\min_{\|\beta_{2}\|\leq Cp_{n}^{1/2}n^{-1/2}}l^{\mbox{\tiny AL}}_{n}\{(\beta_{1}^{\top},\beta_{2}^{\top})^{\top}\}.

Proof: It is sufficient to show that with probability tending to 1, for any β1\beta_{1} satisfying β1β0,1=O(pn1/2n1/2)\|\beta_{1}-\beta_{0,1}\|=O(p_{n}^{1/2}n^{-1/2}) and β2Cpn1/2n1/2\|\beta_{2}\|\leqslant Cp_{n}^{1/2}n^{-1/2}, lALn(β)/βj\partial l^{\mbox{\tiny AL}}_{n}(\beta)/\partial\beta_{j}, βj\beta_{j} have the same signs for j=(pn,1+1),,pnj=(p_{n,1}+1),\cdots,p_{n}. We have that

lALn(βn)βj\displaystyle\frac{\partial l^{\mbox{\tiny AL}}_{n}(\beta_{n})}{\partial\beta_{j}} =\displaystyle= U^n(β0)j+k=1pnAjk(βkβ0k)+op(βnβ0+pnn)+Pλnj(|βj|)sign(βj)\displaystyle\hat{U}_{n}(\beta_{0})_{j}+\sum_{k=1}^{p_{n}}A_{jk}(\beta_{k}-\beta_{0k})+o_{p}\left(\|\beta_{n}-\beta_{0}\|+\sqrt{\frac{p_{n}}{n}}\right)+P^{\prime}_{\lambda_{n}j}(|\beta_{j}|)sign(\beta_{j})
=:\displaystyle=: V1+V2+V3+V4.\displaystyle V_{1}+V_{2}+V_{3}+V_{4}.

It can be shown that

V1=Op(1n)=op(pnn),V_{1}=O_{p}(\frac{1}{\sqrt{n}})=o_{p}\left(\sqrt{\frac{p_{n}}{n}}\right),
|V2|=|k=1pnAjk(βkβ0k)|ββ0(k=1pnAjk2)12=Op(pnn)O(1)=Op(pnn),|V_{2}|=\left|\sum_{k=1}^{p_{n}}A_{jk}(\beta_{k}-\beta_{0k})\right|\leqslant\|\beta-\beta_{0}\|\left(\sum_{k=1}^{p_{n}}A_{jk}^{2}\right)^{\frac{1}{2}}=O_{p}\left(\sqrt{\frac{p_{n}}{n}}\right)O(1)=O_{p}\left(\sqrt{\frac{p_{n}}{n}}\right),

and |V3|=op(pn/n)|V_{3}|=o_{p}(\sqrt{p_{n}/n}). Thus, V1+V2+V3=Op(pn/n)V_{1}+V2+V_{3}=O_{p}(\sqrt{p_{n}/n}). Following this, we have that

lALn(βn)βj\displaystyle\frac{\partial l^{AL}_{n}(\beta_{n})}{\partial\beta_{j}} =\displaystyle= Pλnj(|βj|)sign(βj)+Op(pnn)=λn|β^nj|sign(βj)+Op(pnn)\displaystyle P^{\prime}_{\lambda_{n}j}(|\beta_{j}|)sign(\beta_{j})+O_{p}\left(\sqrt{\frac{p_{n}}{n}}\right)=\frac{\lambda_{n}}{|\hat{\beta}_{nj}|}sign(\beta_{j})+O_{p}\left(\sqrt{\frac{p_{n}}{n}}\right)
=\displaystyle= λn|β^nj|{sign(βj)+Op(pnnλn)}.\displaystyle\frac{\lambda_{n}}{|\hat{\beta}_{nj}|}\left\{sign(\beta_{j})+O_{p}\left(\frac{p_{n}}{n\lambda_{n}}\right)\right\}.

Since λnn/pn\lambda_{n}n/p_{n}\to\infty, sign(βj)sign(\beta_{j}) dominates Op(pn/(nλn))O_{p}(p_{n}/(n\lambda_{n})). Therefore, lALn(βn)/βj\partial l^{\mbox{\tiny AL}}_{n}(\beta_{n})/\partial\beta_{j} and βj\beta_{j} have the same signs for j=(pn,1+1),,pnj=(p_{n,1}+1),\cdots,p_{n} with probability tending to 1 as nn\to\infty.∎

Proof of Theorem 2: The part i) follows directly from Lemma A.2. To prove ii), we first denote that Bn={Pλn1(|β01|)sign(β01),,Pλnpn,1(|β0k|)sign(β0pn,1)}B_{n}=\{P^{\prime}_{\lambda_{n1}}(|\beta_{01}|)sign(\beta_{01}),\cdots,P^{\prime}_{\lambda_{np_{n,1}}}(|\beta_{0k}|)sign(\beta_{0p_{n,1}})\}^{\top}. It is easy to see that BnB_{n} converges to 0 as nn\to\infty. Since β^AL\hat{\beta}^{\mbox{\tiny AL}} is the minimizer of lALn(β)l^{\mbox{\tiny AL}}_{n}(\beta), lALn(β^AL)/β1=0\partial l^{\mbox{\tiny AL}}_{n}(\hat{\beta}^{\mbox{\tiny AL}})/\partial\beta_{1}=0. Combining β^AL2=0\hat{\beta}^{\mbox{\tiny AL}}_{2}=0, we get that

U^1+A11(β^AL1β0,1)+op(βnβ0+pnn)+Bn=0,\hat{U}_{1}+A_{11}(\hat{\beta}^{\mbox{\tiny AL}}_{1}-\beta_{0,1})+o_{p}(\|\beta_{n}-\beta_{0}\|+\sqrt{\frac{p_{n}}{n}})+B_{n}=0, (7)

where U^1\hat{U}_{1} consists of the first pn,1p_{n,1} components of U^n\hat{U}_{n}. Rearrange (7) to get

n(β^AL1β0,1)=nA111U^1nA111Bn+op(pn).\sqrt{n}(\hat{\beta}^{\mbox{\tiny AL}}_{1}-\beta_{0,1})=-\sqrt{n}A_{11}^{-1}\hat{U}_{1}-\sqrt{n}A_{11}^{-1}B_{n}+o_{p}(\sqrt{p_{n}}).

By the condition that nλn=o(1)\sqrt{n}\lambda_{n}=o(1), we have that nA111Bn0\sqrt{n}A_{11}^{-1}B_{n}\to 0. By C1 and C3, it can be seen that 𝖤|1jτ(Z1β0)|2<\mathsf{E}|\nabla_{1j}\tau(Z_{1}^{\prime}\beta_{0})|^{2}<\infty for j=1,,pnj=1,\cdots,p_{n}. Thus, A111/2U^1=Op(pn/n)\|A_{11}^{-1/2}\hat{U}_{1}\|=O_{p}(\sqrt{p_{n}/n}) and

n(β^AL1β0,1)=nA111U^1(1+op(1)).\sqrt{n}(\hat{\beta}^{\mbox{\tiny AL}}_{1}-\beta_{0,1})=\sqrt{n}A_{11}^{-1}\hat{U}_{1}(1+o_{p}(1)).

Since for any nonzero pn,1×1p_{n,1}\times 1 constant vector unu_{n} with un=1\|u_{n}\|=1, nunA111U^1/[2[unA111V11A111un]1/2]\sqrt{n}u_{n}^{\top}A_{11}^{-1}\hat{U}_{1}/[2\left[u_{n}^{\top}A_{11}^{-1}V_{11}A_{11}^{-1}u_{n}\right]^{-1/2}] converges in distribution to N(0,1)N(0,1). Thus, by Slutsky’s theorem, we conclude that nun(β^AL1β0,1)/[2[unA111V11A111un]1/2]\sqrt{n}u_{n}^{\top}(\hat{\beta}^{\mbox{\tiny AL}}_{1}-\beta_{0,1})/[2\left[u_{n}^{\top}A_{11}^{-1}V_{11}A_{11}^{-1}u_{n}\right]^{-1/2}] converges in distribution to N(0,1)N(0,1).∎

Proof of Theorem 3: To state the proof more clearly, some more notation are needed. The underlying true model is denoted by ={j:β0j0,j=1,,pn}\mathcal{M}^{\ast}=\{j:\beta_{0j}\neq 0,j=1,\ldots,p_{n}\}. If we use β^λ\hat{\beta}_{\lambda} to represent the estimator at the tuning parameter value λ\lambda, then the corresponding model is denoted by λ={j:β^λ,j0,j=1,,pn}\mathcal{M}_{\lambda}=\{j:\hat{\beta}_{\lambda,j}\neq 0,j=1,\ldots,p_{n}\}. For any model ={j1,,jm}\mathcal{M}=\{j_{1},\ldots,j_{m}\} with size mm, the corresponding estimated coefficients are β^=(βj1,,βjm)\hat{\beta}_{\mathcal{M}}=(\beta_{j_{1}},\ldots,\beta_{j_{m}}). Note that β^\hat{\beta}_{\mathcal{M}} and β^λ\hat{\beta}_{\lambda} are two different terms. We need to distinguish them carefully. Next, we divide the RpR^{p} space into the following three disjoint regions: i) Ω\minus={λ>0:λ}\Omega_{\minus}=\{\lambda>0:\mathcal{M}_{\lambda}\not\supset\mathcal{M}^{\ast}\}, ii) Ω0={λ>0:λ=}\Omega_{0}=\{\lambda>0:\mathcal{M}_{\lambda}=\mathcal{M}^{\ast}\}, and iii) Ω+={λ>0:λ,λ}\Omega_{+}=\{\lambda>0:\mathcal{M}_{\lambda}\supset\mathcal{M}^{\ast},\mathcal{M}_{\lambda}\neq\mathcal{M}^{\ast}\}. This division divides the range of λ\lambda into 3 categories, corresponding to the under-fitted model, the correctly fitted model and the over-fitted model. For the under-fitted model, there exists jj\in\mathcal{M}^{\ast} and jλj\notin\mathcal{M}_{\lambda}. The over-fitted model refers to the model such that for all jj\in\mathcal{M}^{\ast}, jλj\in\mathcal{M}_{\lambda}, but there exists kλk\in\mathcal{M}_{\lambda} and kk\notin\mathcal{M}^{\ast}. We prove the theorem by considering the two parts: the under-fitted model and the over-fitted model.

For the under-fitted model part, when λΩ\minus\lambda\in\Omega_{\minus}, the length of the vector β^ALS\hat{\beta}^{\mbox{\tiny AL}}_{S} is inconsistent with β0\beta_{0}. In order to make β^ALS\hat{\beta}^{\mbox{\tiny AL}}_{S} and β0\beta_{0} comparable, let us slightly modify the definition of β^S\hat{\beta}_{S} here. In this part, we define for any model SS,

β^S=argminβRkn:βj=0,iSln(β).\hat{\beta}_{S}=\mbox{argmin}_{\beta\in R_{k_{n}}:\beta_{j}=0,\forall i\notin S}l_{n}(\beta).

By definition,

infλΩ\minusBICλ\displaystyle\inf_{\lambda\in\Omega_{\minus}}\mbox{BIC}_{\lambda} =\displaystyle= infλΩ\minusln(β^λ)+lognnCndfλinfλΩ\minusln(β^λ)\displaystyle\inf_{\lambda\in\Omega_{\minus}}l_{n}(\hat{\beta}_{\lambda})+\frac{\log n}{n}C_{n}\cdot\mbox{df}_{\lambda}\geqslant\inf_{\lambda\in\Omega_{\minus}}l_{n}(\hat{\beta}_{\lambda})
=\displaystyle= infλΩ\minusln(β0)+12(β^λβ0)A(β^λβ0)ln(β0)+12λmin(A)infλΩ\minusβ^λβ02\displaystyle\inf_{\lambda\in\Omega_{\minus}}l_{n}(\beta_{0})+\frac{1}{2}(\hat{\beta}_{\lambda}^{\ast}-\beta_{0})^{\top}A(\hat{\beta}_{\lambda}^{\ast}-\beta_{0})\geqslant l_{n}(\beta_{0})+\frac{1}{2}\lambda_{min}(A)\inf_{\lambda\in\Omega_{\minus}}\|\hat{\beta}_{\lambda}^{\ast}-\beta_{0}\|^{2}
\displaystyle\geqslant ln(β0)+12λmin(A)infSβ^Sβ02ln(β0)+12λmin(A)minj(β0,j2).\displaystyle l_{n}(\beta_{0})+\frac{1}{2}\lambda_{min}(A)\inf_{S\not\supset\mathcal{M}^{\ast}}\|\hat{\beta}_{S}-\beta_{0}\|^{2}\geqslant l_{n}(\beta_{0})+\frac{1}{2}\lambda_{min}(A)\min_{j\in\mathcal{M}_{*}}(\beta_{0,j}^{2}).

By C6, we can obtain that

infλΩ\minusBICλln(β0)12C1minj(β0,j2).\inf_{\lambda\in\Omega_{\minus}}\mbox{BIC}_{\lambda}-l_{n}(\beta_{0})\geqslant\frac{1}{2}C_{1}\min_{j\in\mathcal{M}^{\ast}}(\beta_{0,j}^{2}). (8)

By the condition n/{Cnpnlogn}×lim infn{minjST|β0,j|}\sqrt{n/\{C_{n}p_{n}\log n\}}\times\liminf_{n\to\infty}\{\min_{j\in S_{T}}|\beta_{0,j}|\}\to\infty, the term on the right side of (8) is O(Cnpnlogn/n)O(C_{n}p_{n}\log n/n), which is O(loglogpn×pnlogn/n)O(\log\log p_{n}\times p_{n}\log n/n) as we choose Cn=loglogpnC_{n}=\log\log p_{n} here. Similarly, we derive that

BICλ^\displaystyle\mbox{BIC}_{\hat{\lambda}} =\displaystyle= ln(β^)+lognnCndfln(β^)+pnlognloglogpnn\displaystyle l_{n}(\hat{\beta})+\frac{\log n}{n}C_{n}\cdot\mbox{df}\leqslant l_{n}(\hat{\beta})+\frac{p_{n}\log n\log\log p_{n}}{n}
=\displaystyle= ln(β0)+12(β^β0)A(β^β0)+pnlognloglogpnn\displaystyle l_{n}(\beta_{0})+\frac{1}{2}(\hat{\beta}-\beta_{0})^{\top}A(\hat{\beta}-\beta_{0})+\frac{p_{n}\log n\log\log p_{n}}{n}
\displaystyle\leqslant ln(β0)+12λmax(A)β^β02+pnlognloglogpnn\displaystyle l_{n}(\beta_{0})+\frac{1}{2}\lambda_{max}(A)\|\hat{\beta}-\beta_{0}\|^{2}+\frac{p_{n}\log n\log\log p_{n}}{n}
=\displaystyle= ln(β0)+Op(pnn)+pnlognloglogpnn\displaystyle l_{n}(\beta_{0})+O_{p}\left(\dfrac{p_{n}}{n}\right)+\frac{p_{n}\log n\log\log p_{n}}{n}
=\displaystyle= ln(β0)+op(pnlognloglogpnn)+pnlognloglogpnn\displaystyle l_{n}(\beta_{0})+o_{p}\left(\frac{p_{n}\log n\log\log p_{n}}{n}\right)+\frac{p_{n}\log n\log\log p_{n}}{n}

and

BICλ^ln(β0)=Op(pnlognloglogpnn).\mbox{BIC}_{\hat{\lambda}}-l_{n}(\beta_{0})=O_{p}\left(\frac{p_{n}\log n\log\log p_{n}}{n}\right).

Combining the two results, we can obtain that

𝖯(infλΩ\minusBICλ>BICλ^)1.\mathsf{P}\left(\inf_{\lambda\in\Omega_{\minus}}\mbox{BIC}_{\lambda}>\mbox{BIC}_{\hat{\lambda}}\right)\to 1. (9)

For the over-fitted model part, for any λΩ+\lambda\in\Omega_{+},

infλΩ+(BICλBICλ^)\displaystyle\inf_{\lambda\in\Omega_{+}}\left(\mbox{BIC}_{\lambda}-\mbox{BIC}_{\hat{\lambda}}\right) =\displaystyle= infλΩ+ln(β^λ)ln(β^)+(dfλdfλ^)lognnCn\displaystyle\inf_{\lambda\in\Omega_{+}}l_{n}(\hat{\beta}_{\lambda})-l_{n}(\hat{\beta})+(\mbox{df}_{\lambda}-\mbox{df}_{\hat{\lambda}})\frac{\log n}{n}C_{n}
=\displaystyle= infλΩ+12(β^λβ0)A(β^λβ0)12(β^β0)A(β^β0)\displaystyle\inf_{\lambda\in\Omega_{+}}\frac{1}{2}(\hat{\beta}_{\lambda}-\beta_{0})^{\top}A(\hat{\beta}_{\lambda}-\beta_{0})-\frac{1}{2}(\hat{\beta}-\beta_{0})^{\top}A(\hat{\beta}-\beta_{0})
+(dfλdfλ^)lognnCn\displaystyle+(\mbox{df}_{\lambda}-\mbox{df}_{\hat{\lambda}})\frac{\log n}{n}C_{n}
\displaystyle\geqslant 12λmax(A)β^β02+(dfλdfλ^)lognnCn\displaystyle-\frac{1}{2}\lambda_{max}(A)\|\hat{\beta}-\beta_{0}\|^{2}+(\mbox{df}_{\lambda}-\mbox{df}_{\hat{\lambda}})\frac{\log n}{n}C_{n}
\displaystyle\geqslant 12C2β^β02+lognnCn.\displaystyle-\frac{1}{2}C_{2}\|\hat{\beta}-\beta_{0}\|^{2}+\frac{\log n}{n}C_{n}.

The last line holds since for any λΩ+\lambda\in\Omega_{+}, it must satisfy dfλdfλ^1\mbox{df}_{\lambda}-\mbox{df}_{\hat{\lambda}}\geqslant 1. Furthermore, as β^ALβ02=Op(pn/n)\|\hat{\beta}^{\mbox{\tiny AL}}-\beta_{0}\|^{2}=O_{p}(p_{n}/n), so

infλΩ+(BICλBICλ^)(1+op(1))pnlognloglogpnn>0.\inf_{\lambda\in\Omega_{+}}\left(\mbox{BIC}_{\lambda}-\mbox{BIC}_{\hat{\lambda}}\right)\geqslant(1+o_{p}(1))\frac{p_{n}\log n\log\log p_{n}}{n}>0.

This means that

𝖯(infλΩ+BICλ>BICλ^)1.\mathsf{P}\left(\inf_{\lambda\in\Omega_{+}}\mbox{BIC}_{\lambda}>\mbox{BIC}_{\hat{\lambda}}\right)\to 1. (10)

Combining (9) and ( 10) we can conclude that:

𝖯(λ^=)1\mathsf{P}(\mathcal{M}_{\hat{\lambda}}=\mathcal{M}^{\ast})\to 1

as nn\to\infty.∎


Acknowledgment

The research of Ming Zheng was supported by the National Natural Science Foundation of China Grants (11771095). The research of Wen Yu was supported by the National Natural Science Foundation of China Grants (12071088).

References

  1. [1]

    Bhattacharya, P. K., Chernoff, H., and Yang, S. S. (1983), Nonparametric estimation of the slope of a truncated regression, The Annals of Statistics, 11, 505-514.

  2. [2]

    Bilker, W., and Wang, M.-C. (1996), Generalized Wilcoxon statistics in semiparametric truncation models, Biometrics, 52, 10-20.

  3. [3]

    Cai, J., Fan, J., Li, r., and Zhou, H. (2005). Variable selection for multivariate failure time data. Biometrika, 92, 303?16.

  4. [4]

    Cho, H. and Qu, A. (2013), Model selection for correlated data with diverging number of parameters. Statistica Sinica, 23, 901?27.

  5. [5]

    Efron, B., and Petrosian, V. (1999), Nonparametric methods for doubly truncated data, Journal of the American Statistical Association, 94, 824-834.

  6. [6]

    Greene, W. H. (2012), Econometric Analysis (7th Ed.), Prentice Hall, Upper Saddle River, NJ.

  7. [7]

    Keiding, N., and Gill, R. D. (1990), Random truncation models and Markov processes, The Annals of Statistics, 18, 582-602.

  8. [8]

    Kim, J.P., Lu, W., Sit, T. and Ying, Z. (2013), A unified approach to semiparametric transformation models under generalized biased sampling schemes, Journal of the American Statistical Association, 108, 217-227.

  9. [9]

    Lai, T. L., and Ying, Z. (1991a), Estimating a distribution function with truncated and censored data, The Annals of Statistics, 19, 417-442.

  10. [10]

    Lai, T. L., and Ying, Z. (1991b), Rank regression methods for left-truncated and right-censored data, The Annals of Statistics, 19, 531-556.

  11. [11]

    Liu, H., Ning, J., Qin, J. and Shen, Y. (2016), Semiparametric maximum likelihood inference for truncated or biased-sampling data”, Statistica Sinica, 26, 1087-1115.

  12. [12]

    Lynden-Bell, D. (1971), A method of allowing for known observational selection in small samples applied to 3CR quasars, Monthly Notices of the Royal Astronomical Society, 155, 95-118.

  13. [13]

    Moreira, C. and Una-Alvarez, J. (2012), Kernel density estimation with doubly truncated data, Electronic Journal of Statistics, 6, 501-521.

  14. [14]

    Moreira, C. and Una-Alvarez, J. (2016), Nonparametric regression with doubly truncated data, Computational Statistics and Data Analysis, 93, 294-307.

  15. [15]

    Ni, A., Cai, J., and Zeng, D. (2016), Variable selection for case-cohort studies with failure time outcome, Biometrika, 103, 547-562.

  16. [16]

    Peng, H. and Fan, J. (2004). Nonconcave penalized likelihood with a diverging number of parameters, The Annals of Statistics, 32, 928-961.

  17. [17]

    Pollard, D. (1990), it Empirical Processes: Theory and Applications, IMS,Hayward, CA.

  18. [18]

    Shen, P.-S., (2010), Nonparametric analysis of doubly truncated data, Annals of the Institute of Statistical Mathematics, 62, 835-853.

  19. [19]

    Shen, P.-S. (2013a), A class of rank-based tests for doubly-truncated data, TEST, 22, 83-102.

  20. [20]

    Shen, P.-S. (2013b), Regression analysis of interval censored data and doubly truncated data. Computational Statistics, 28, 581-596.

  21. [21]

    Tsai, W.-Y. (1990), Testing the independence of truncation time and failure time, Biometrika, 77, 167-177.

  22. [22]

    Tsui, K.-L., Jewell, N. P., and Wu, C. F. J. (1988), A nonparametric approach to the truncated regression problem, Journal of the American Statistical Association, 83, 785-792.

  23. [23]

    Turnbull, B. W. (1976), The empirical distribution function with arbitrarily grouped, censored and truncated data, Journal of the Royal Statistical Society, Ser. B, 38, 290–295.

  24. [24]

    Wang, M.-C., Jewell, N. P., and Tsai, W.-Y. (1986), Asymptotic properties of the product limit estimate under random truncation, The Annals of Statistics, 14, 1597-1605.

  25. [25]

    Wang, Z., Lin, Y., Liu, W., and Shao, Q. (2019), U-processes with increasing dimensional parameters, working paper.

  26. [26]

    Ying, Z., Yu, W., Zhao, Z., and Zheng, M. (2020), Regression analysis of doubly truncated data, Journal of the American Statistical Association, 115, 810-821.

  27. [27]

    Zou, H. (2006), The adaptive lasso and its oracle properties. Journal of the American Statistical Association, 101, 1418-1429.

Table 1: Simulation results for variable selection when the random error follows the standard normal distribution.
n~\tilde{n} Truncation Method ME TN RCM
percentage Median MAD CN IN (%)
300 0.3 Proposed 0.220 0.137 13.921 0.000 92.836
Naive 0.487 0.242 13.908 0.000 91.364
Oracle 0.109 0.076 14.000 0.000 100.000
0.4 Proposed 0.385 0.232 13.862 0.000 88.679
Naive 1.048 0.423 13.821 0.000 84.608
Oracle 0.158 0.112 14.000 0.000 100.000
500 0.3 Proposed 0.133 0.073 15.995 0.000 99.503
Naive 0.289 0.135 15.966 0.000 96.723
Oracle 0.062 0.043 16.000 0.000 100.000
0.4 Proposed 0.217 0.120 15.977 0.000 98.014
Naive 0.611 0.229 15.938 0.000 94.048
Oracle 0.084 0.057 16.000 0.000 100.000
Table 2: Simulation results for variable selection when the random error follows the extreme minimum value distribution.
n~\tilde{n} Truncation Method ME TN RCM
percentage Median MAD CN IN (%)
300 0.3 Proposed 0.226 0.162 13.85 0.00 86.905
Naive 0.780 0.353 13.82 0.00 83.730
Oracle 0.129 0.098 14.00 0.00 100.000
0.4 Proposed 0.380 0.293 13.81 0.00 83.765
Naive 1.232 0.565 13.72 0.00 75.896
Oracle 0.184 0.132 14.00 0.00 100.000
500 0.3 Proposed 0.118 0.087 15.98 0.00 98.214
Naive 0.489 0.208 15.96 0.00 95.933
Oracle 0.075 0.048 16.00 0.00 100.000
0.4 Proposed 0.197 0.146 15.90 0.00 91.144
Naive 0.989 0.369 15.88 0.00 89.154
Oracle 0.107 0.072 16.00 0.00 100.000
Table 3: Description of the 13 covariates to fit the linear model for the SDSS data
Name Description
R.A. Right Ascension
Dec. Declination
u_mag Brightness in the u (ultraviolet) band in magnitudes.
g_mag Brightness in the g (green) band
r_mag Brightness in the r (red) band
i_mag Brightness in the i (more red) band
z_mag Brightness in the z (even more red) band
Radio Brightness in the radio band
X-ray Brightness in the X-ray band
J_mag Brightness in the near-infrared J band
H_mag Brightness in the near-infrared H band
K_mag Brightness in the near-infrared K band
M_i The absolute magnitude in the i band.
Table 4: Estamtion and variabel selection results for the SDSS data
       Covariate      EST     SEE
1 R.A. 0 -
2 Dec. 0 -
3 u_mag 0.340 0.024
4 g_mag -0.320 0.046
5 r_mag 0 -
6 i_mag 0.500 0.043
7 z_mag 0 -
8 Radio 0 -
9 X.ray 0 -
10 J_mag 0.017 0.007
11 H_mag 0 -
12 K_mag 0 -
13 M_i -0.803 0.017

EST: estimation of coefficients,
SEE: estimation of standard error

Refer to caption
Figure 1: The predicted and original red shift values for the testing set