Debiased Lasso After Sample Splitting for Estimation and Inference in High Dimensional Generalized Linear Models
Abstract
We consider random sample splitting for estimation and inference in high dimensional generalized linear models, where we first apply the lasso to select a submodel using one subsample and then apply the debiased lasso to fit the selected model using the remaining subsample. We show that, no matter including a prespecified subset of regression coefficients or not, the debiased lasso estimation of the selected submodel after a single splitting follows a normal distribution asymptotically. Furthermore, for a set of prespecified regression coefficients, we show that a multiple splitting procedure based on the debiased lasso can address the loss of efficiency associated with sample splitting and produce asymptotically normal estimates under mild conditions. Our simulation results indicate that using the debiased lasso instead of the standard maximum likelihood estimator in the estimation stage can vastly reduce the bias and variance of the resulting estimates. We illustrate the proposed multiple splitting debiased lasso method with an analysis of the smoking data of the Mid-South Tobacco Case-Control Study.
Keywords: Asymptotic normality, genetic marker, high-dimensional inference, single nucleotide polymorphism (SNP), sparse regression.
1 Introduction
Recent technological advances have allowed biomedical researchers to collect an extraordinary amount of genetic data from patients. Studies that seek to characterize associations between these genetic factors and medical outcomes face numerous challenges.
Firstly, the standard maximum likelihood estimator (MLE) is not well-defined for classical statistical models such as linear and generalized linear models (GLMs) when the covariates are high dimensional, in particular, . Several variable selection and estimation procedures are available for sparse models with high dimensional covariates. The commonly-used lasso estimator (Tibshirani,, 1996) introduces a penalty that shrinks many coefficient estimates to exactly zero, thus performing simultaneous variable selection and coefficient estimation. Similar approaches include the adaptive lasso (Zou,, 2006), the smoothly clipped absolute deviation (SCAD) estimator (Fan and Li,, 2001) and elastic net regularization (Zou and Hastie,, 2005). Another common approach, which only performs variable selection, is sure independence screening (Fan and Lv,, 2008; Fan and Song,, 2010).
Secondly, although penalized regression methods can produce coefficient estimates for GLMs, they typically have substantial biases and do not quantify the uncertainty around these estimates. Thus they cannot be used to directly construct confidence intervals or to do hypothesis testing in the high dimensional setting. More recent progress on statistical inference for high dimensional data includes debiased lasso estimators (Zhang and Zhang,, 2014; van de Geer et al.,, 2014), MLE-based sample splitting approaches (Fei and Li,, 2021; Fei et al.,, 2019), and the decorrelated score test (Ning and Liu,, 2017). Other work on hypothesis testing without coefficient estimation includes the sample splitting approach for p-values by Meinshausen et al., (2009) and the test developed by Zhu and Bradic, (2018) for linear models without imposing sparsity assumptions on the regression parameters.
Lastly, we note that some post-model selection inference procedures, such as the simultaneous inference method of Kuchibhotla et al., (2020) and the sample splitting and bootstrap method of Rinaldo et al., (2019), are applicable to high dimensional covariates, although these in particular only address linear models. The goal of such methods is distinct from ours. They are designed to be valid for arbitrary model selection procedures and under potential model misspecification. This robustness comes at a cost of conservative inference results when a model selection procedure, such as the lasso estimator, can select a superset of the correct model with high probability, which is the setting that we consider below.
Our proposed method is based on sample splitting, a two-stage approach where, first, one part of the data is used to select the covariates to be included in the model, and then the other part is used to estimate the now lower-dimensional model ( but may grow with ). Similar methods have been recently developed for linear models (Fei et al.,, 2019; Wang et al.,, 2020) and generalized linear models (Fei and Li,, 2021) based on the MLE, as well as for Cox proportional hazards models (Zhang et al.,, 2022) based on the decorrelated score function. Compared to Fei and Li, (2021), we use the lasso instead of sure independence screening for model selection, which allows for more mild theoretical assumptions, as well as the debiased lasso in place of the MLE, which can substantially improve the bias, variance, and confidence interval coverage of the resulting estimators, as shown in our simulations below. Note that debiased lasso estimation would be equivalent to using the MLE in sample splitting approaches for linear models. Our contributions are as follows.
For the model selection component, following Huang and Zhang, (2012) we show that, under mild sufficient conditions, the lasso screens out a large portion of the noise variables and selects a model containing the true signal variables with high probability in the case of random design. Existing work on the selected model size and estimation error of the lasso method for linear regression includes Bickel et al., (2009), Zhang and Huang, (2008), Meinshausen and Yu, (2009), and Zhao and Yu, (2006), under differing assumptions on the covariates. Similar results were shown for generalized linear models by van de Geer, (2008) and Huang and Zhang, (2012), only the latter of which addressed model selection consistency and the number of selected noise variables. Their results imply that the lasso selected model size is of the same order as the number of nonzero regression coefficients, with high probability. This is essential because if the selected model size is too large then its design matrix may not have full rank, even asymptotically, and thus the regression coefficients will still not be estimable via MLE or even the refined debiased lasso of Xia et al., (2021).
For the lower-dimensional model estimation stage, while a naive approach would be to use the standard maximum likelihood estimator with the selected covariates, which is commonly used in sample splitting literature, we instead apply the recently developed refined debiased lasso approach of Xia et al., (2021). This method is better suited for models that contain a substantial number of noise variables, as is typically expected after variable selection. This is a more desirable approach because the conditions for the GLM lasso to not select any noise variables, for example, are known to be quite stringent (Huang and Zhang,, 2012). We illustrate the potentially large difference in performance through simulations, where the MLE exhibits a strong bias that inflates the size of estimated coefficients, and this bias increases with the true signal strength, in contrast to the approximately unbiased estimates from the debiased lasso. Such tendencies are discussed further by Xia et al., (2021) in the lower-dimensional setting.
For a set of prespecified coefficients, since their estimators based on a single sample split suffer from a loss of efficiency due to only using part of the data for estimation, we further investigate the idea of averaging estimates across multiple sample splits. Fei et al., (2019) proposed a multiple splitting procedure for linear models, where they required model selection consistency for their asymptotic theory. Wang et al., (2020) discussed multiple splitting for a single treatment variable in linear models, under much milder assumptions. Recently, Fei and Li, (2021) proposed a multiple splitting procedure for generalized linear models based on the MLE under a partial orthogonality condition, which requires that the signal variables be independent of the noise variables. For our proposed multiple splitting procedure, we apply the debiased lasso estimator instead of the MLE in the estimation stage, and show through simulations that our procedure results in approximately unbiased estimates and substantially reduced variability compared to the MLE-based multiple splitting that is often biased. For the theoretical development, we adapt the mild conditions of Wang et al., (2020) to GLMs and show the asymptotic normality of our proposed approach. As evidenced by simulations, our multiple splitting estimator can produce confidence intervals with the nominal coverage.
2 Methods
We first provide brief overviews of lasso and debiased lasso methods, then introduce our proposed sample splitting debiased lasso methods for GLMs.
2.1 Notation
For a positive integer , we denote the size of any index set by . For a -dimensional vector and a dimensional matrix , let denote the -dimensional subvector with entries indexed by , and similarly denote the submatrix with rows and columns indexed by . The norm is denoted as for . For positive sequences and , we write if there exists a constant and such that for all , we write if as , and we write if and .
Let , , be independent and identically distributed copies of , where is a scalar-valued response variable and is a -dimensional vector of covariates. We consider the high dimensional setting where can be much larger than . Let and denote the design matrix by , with th row and th column . We assume without loss of generality that has mean zero. For any function , we define .
We consider generalized linear models (McCullagh and Nelder,, 2019) with canonical link function and known dispersion parameter. Denote
for a known twice-differentiable function , and its first and second derivatives with respect to as and , respectively. The negative log-likelihood for is then with score function and Hessian matrix . Denote . The -dimensional unknown true coefficient vector is assumed to be sparse, and we denote the index set of signal variables by , with . The quantities , , , and are allowed to change with . For practical applications, one would generally consider and to be fixed so that the regression coefficients have the usual interpretation in terms of the conditional mean as grows with . Letting grow with does not maintain the log odds ratio interpretation of , for example, in a logistic regression model.
2.2 The lasso estimator for generalized linear models
The lasso estimator for the GLM parameters is, given a tuning parameter , a minimizer of the penalized negative log likelihood
In general there is no closed form solution for , but the objective function is convex and can be optimized efficiently with widely available software (Friedman et al.,, 2010). The penalization shrinks the estimated coefficients towards zero, with many of their values set exactly to zero, resulting in a sparse estimator . Model selection may then be performed by keeping only the covariates with nonzero estimated coefficients. In practice, a small subset with a finite number of coefficients may be left unpenalized so that they are never excluded from the selected model. In our simulations and data analysis, we do not penalize the intercept.
2.3 The debiased lasso estimator
Zhang and Zhang, (2014) and van de Geer et al., (2014) proposed debiased, also called desparsified, lasso estimators for linear models and GLMs, respectively. This procedure may be used to obtain coefficient estimates and confidence intervals across the entire parameter vector in high dimensional settings. It does not rely on sample splitting or variable selection. Instead, using the lasso estimator as an initial value, it performs a single Newton iteration for minimizing the negative log likelihood , using an approximate inverse for the Hessian matrix such as a nodewise lasso estimator. The resulting desparsified lasso estimator is no longer sparse and is, under further conditions, asymptotically unbiased with a normal distribution.
A key assumption for accurately estimating the inverse of the Hessian matrix in high dimensional settings, however, is the sparsity assumption on . For generalized linear models, each entry of this matrix generally depends on all the signal variables due to the non-identity link function, so the sparsity assumption is difficult to interpret and often may not hold in practice.
Such issues were discussed further by Xia et al., (2021), who proposed a refined debiased lasso method for lower dimensional GLMs with a diverging number of covariates. In that setting and under some standard conditions, is directly invertible with probability going to one, and the resulting estimator
(1) |
with estimated covariance matrix was shown to be asymptotically normal. Additionally, it outperforms both the original desparsified lasso estimator and the MLE in several simulation settings of lower-dimensional GLMs with sparse regression coefficients. It can also be computed using standard GLM software packages for the lasso and MLE.
2.4 The debiased lasso after single sample splitting
We propose a debiased lasso after sample splitting procedure that uses the lasso for model selection and the refined debiased lasso for estimation and inference. For a fixed splitting proportion , we randomly split the sample into two subsets: containing individuals to be used for variable selection, and containing individuals to be used for estimation. For a pre-specified fixed set of coefficient indices that is allowed to be an empty set, do the following:
-
1.
using the subsample , fit a GLM lasso estimator to obtain the selected model ;
-
2.
using the subsample , compute the debiased lasso estimator for based on a GLM with covariates , and use the estimated covariance matrix of , , to construct % confidence intervals for any contrast based on a normal distribution.
Because the two subsamples are independent, we can make asymptotically valid statistical inference on any covariate in as long as this set contains the true signal variables when is large enough. In practice, covariates with small coefficients may not be selected in the first stage, so making conclusions about the statistical significance of omitted variables from step 1 is not advised. The large bias that can potentially occur when is a null set is illustrated through simulations in Section 4. Furthermore, although the size of the fitted model must be low-dimensional, this does not prevent one from obtaining estimates for each individual coefficient. For example, step 2 may be repeated for each individual coefficient, i.e. for , along with an appropriate correction for multiple testing when computing p-values and confidence intervals.
2.5 The debiased lasso after multiple sample splitting
As previously noted, the single sample splitting estimator suffers from a loss of efficiency due to only using a fraction of the total sample to estimate . The following multiple splitting procedure addresses this issue. For a pre-specified fixed set of coefficient indices , we now generate random splits of the sample , denoted for , and do the following:
-
1.
for ,
-
(a)
using the subsample , fit a GLM lasso estimator to obtain the selected model ;
-
(b)
using the subsample , compute the debiased lasso estimator for based on a GLM with covariates ;
-
(a)
-
2.
finally the multiple splitting estimators are .
Note that all target coefficients to estimate must be pre-specified, in contrast to the single split procedure. For exploratory analysis, where step 1b may be repeated for each individual coefficient, the multiple splitting estimator typically has a substantially more strict significance threshold compared to single split inference on due to correcting for , , multiple comparisons.
We apply the same variance estimator as Fei and Li, (2021), based on the nonparametric delta method with a bias correction for controlling the Monte Carlo error (Efron,, 2014; Wager et al.,, 2014). Letting denote the -vector of sampling indicators such that is equal to one if , and is zero otherwise,
where , and % confidence intervals may be constructed based on a normal approximation. Similarly, the estimated covariance matrix is
Both single and multiple sample splitting procedures require that the matrices computed from each estimation split are invertible. One way to encourage this is to set a hard limit for the selected model size, which would set a lower bound for the parameter in the lasso. In our simulations and real data analysis we do not restrict , which is chosen by cross validation, and instead allow the glm function in R to follow its default behavior of automatically dropping covariates from the model when the design matrix is ill-conditioned.
As the number of random splits grows we typically have diminishing gains in efficiency relative to the increased computational cost. Although our theoretical results consider the maximum possible value of , our simulation results indicate that is sufficient for many practical settings with up to several hundreds of individuals in the sample.
3 Theoretical Results
For the single split estimator, we make the following assumptions:
-
1.
The covariates are sub-Gaussian random vectors and almost surely for some constant .
-
2.
and are positive definite with eigenvalues bounded from above and away from zero.
-
3.
The derivatives and exist for all and there exists a and constant such that is locally Lipschitz
Also, there exist constants such that the derivatives are bounded:
-
4.
is bounded above almost surely.
-
5.
The sparsity satisfies and the regression parameters are large enough that as .
Assumptions 1, 2, and 4 are typical in high dimensional literature. Bounded covariates are common in practice, including dummy variables for categorical covariates, minor allele counts, and physical measurements. Assumption 3 is required in order to apply the results for the refined debiased lasso estimator of Xia et al., (2021). All but the boundedness condition on are satisfied when the function is twice continuously differentiable, as is the case with commonly-used GLMs, due to assumption 4. The derivative is also bounded if the response variable has bounded support, such as in logistic regression, and this condition can be relaxed to include other common GLMs by instead assuming sub-exponential tails, as in Lemma 3.4 of Ning and Liu, (2017). Assumption 5 guarantees that, with high probability, model selection based on the lasso does not miss any signal variables, so that the selected model is not misspecified. This is a standard condition for sample splitting procedures that is not required by the desparsified lasso (van de Geer et al.,, 2014).
For multiple splitting, we make the following additional assumptions for the set of target covariates , where is a -vector such that and . Let denote the entire data set.
-
6.
For independent sampling indicator vectors and with corresponding fitted models and , define
where is the matrix such that for any -vector , and the expectations are with respect to the sampling weights, conditional on the data. We assume that and that is invertible when computed from any estimation subsample.
-
7.
There exists a random -vector independent of the data such that is bounded and
Similar conditions were used and discussed further by Wang et al., (2020) in the context of linear models. Since, for low-dimensional GLMs, the refined debiased lasso estimator is asymptotically equivalent to the MLE in the sense that for all , our theoretical arguments also apply to multiple splitting with the MLE, using the lasso for model selection. Fei and Li, (2021) used a stronger partial orthogonality condition, where the signal variables are independent of the noise variables, to derive the asymptotic normality of their MLE-based multiple splitting estimator. We further discuss the motivation behind Assumptions 6 and 7 and their relationship to the assumptions used by other work on sample splitting procedures in Appendix D, but provide a brief overview here.
Under Assumptions 1-5 and the sparsity rates required in Theorem 3 below, it can be shown that
Therefore, in order to prove asymptotic normality we need to control the randomness of the vector averaged across all sample splits that exclude the th sample from model selection (i.e. ). The ideal case, when this average is deterministic, can be achieved under model selection consistency, so that is fixed at for each sample split, or if the set of covariates that are always included in the fitted model are independent of all remaining covariates, essentially giving a block diagonal structure. Either of these two conditions imply that our less restrictive assumptions hold, where we only require that the effect of always excluding a single sample from model selection is asymptotically negligible (Assumption 6) and that the average converges in probability to a bounded random vector with moderate rate (Assumption 7).
The variable screening properties and asymptotic normality of our estimators are presented in the following three theorems, with their proofs given in Appendix A through Appendix C, respectively. Note that any variable selection procedure with the same screening properties as the lasso may be used in our sample splitting procedures.
Theorem 1.
Theorem 2.
4 Simulations
We apply the proposed estimating methods with sample splitting in several simulations settings with high dimensional covariates in order to assess the potential benefits of using the lasso for model selection and the debiased lasso in place of the MLE for making statistical inference, and of averaging estimates across multiple sample splits (MS) as opposed to using a single split (SS) estimator for a fixed set of coefficients. Simulation results are all from logistic regression models, where the benefits of using the debiased lasso are particularly apparent. This may be due to the numerical instability of the Hessian matrix of the negative log-likelihood , where is the cdf of the standard logistic distribution. Note will be close to zero or one for large coefficient values, which can result in a near-singular Hessian matrix. Therefore the debiased lasso approach of performing a single-step maximization of the log-likelihood starting from an initial lasso estimator that is biased towards zero can help alleviate this issue.
In all simulations, samples are generated from a logistic regression model with covariates. The covariates are generated from a distribution before being truncated at . The covariance matrix has ones on the diagonal and either an AR(1) correlation structure or a compound symmetry structure . Further simulations that demonstrate the robustness of our procedures under higher covariate correlations are presented in Appendix E. There are signal covariates, and their index set was randomly chosen. The corresponding coefficient values are .
We consider all signal covariates together with two randomly chosen noise covariates. For each coefficient , we assess the single and multiple splitting debiased lasso estimators fit on , as well as the corresponding MLE-based estimators. The splitting proportion is . We also provide results from the oracle MLE, fit on , for reference, using either the entire sample or half of the sample. Model selection is performed using lasso estimators with the values for chosen by 10-fold cross validation. The multiple splitting estimators use splits. We use the R package glmnet for lasso estimation and the built-in glm function for MLE and debiased lasso estimation, using the control and start arguments to specify a single iteration starting at an initial lasso estimate for the latter.
The simulation results for the AR(1) correlation structure are summarized in Table 1. The single split MLE exhibits a bias that greatly inflates the size of the estimated coefficients, and the magnitude of this bias increases with the signal strength. In contrast, the single split debiased lasso estimator is approximately unbiased. Averaging across multiple splits does not improve the bias of the MLE, which is also apparent to a lesser extent in the logistic regression simulations of Fei and Li, (2021) summarized in their Table 3. The MLE has substantially higher variability than the debiased lasso, even for the multiple splitting estimators. The 95% confidence interval coverage for noise variables is roughly the same across all considered methods. For signal variables, however, the MLE has poor coverage that actually worsens after multiple splitting. This issue appears to be particularly severe in logistic regression models and is mild in some other GLMs such as Poisson regression. In contrast, the debiased lasso after multiple splitting performs well in achieving the nominal 95% coverage for all considered coefficients. All single split standard errors tend to be underestimated for signal variables, leading to slight undercoverage, while the multiple splitting standard errors are approximately unbiased. Multiple splitting drastically lowers the variability of estimates from either the MLE or debiased lasso compared to a single split. This produces a dramatic improvement in rejection rate for small coefficients. Note that the rejection rates for the MLE estimators are inflated due to their bias, which partially offsets the wider confidence interval length. In summary, for each pre-specified coefficient, the multiple splitting debiased lasso estimator provides the best performance in terms of bias, variability, and confidence interval coverage in this simulation setting, where the correlation between covariates decays rather quickly.
Estimator | |||||||||
---|---|---|---|---|---|---|---|---|---|
-1.50 | -1.00 | -0.50 | 0.00 | 0.00 | 0.50 | 1.00 | 1.50 | ||
Selection Rate | 1.00 | 1.00 | 0.64 | 0.13 | 0.07 | 0.69 | 1.00 | 1.00 | |
Bias | Debiased SS | 0.01 | 0.06 | 0.00 | -0.04 | -0.01 | -0.03 | 0.00 | 0.00 |
MLE SS | -0.52 | -0.30 | -0.24 | -0.06 | -0.01 | 0.18 | 0.37 | 0.54 | |
Debiased MS | 0.01 | 0.04 | 0.02 | -0.02 | 0.00 | -0.02 | -0.01 | 0.00 | |
MLE MS | -0.59 | -0.37 | -0.21 | -0.02 | -0.01 | 0.21 | 0.40 | 0.60 | |
Oracle | -0.09 | -0.05 | 0.00 | -0.01 | 0.00 | 0.04 | 0.07 | 0.09 | |
Oracle | -0.05 | -0.01 | -0.01 | -0.02 | 0.00 | 0.01 | 0.04 | 0.05 | |
Coverage | Debiased SS | 0.84 | 0.89 | 0.92 | 0.94 | 0.95 | 0.88 | 0.87 | 0.89 |
MLE SS | 0.73 | 0.86 | 0.84 | 0.90 | 0.92 | 0.86 | 0.77 | 0.69 | |
Debiased MS | 0.94 | 0.94 | 0.93 | 0.95 | 0.97 | 0.96 | 0.96 | 0.95 | |
MLE MS | 0.39 | 0.64 | 0.81 | 0.94 | 0.94 | 0.79 | 0.58 | 0.40 | |
Oracle | 0.95 | 0.96 | 0.94 | 0.95 | 0.96 | 0.94 | 0.96 | 0.97 | |
Oracle | 0.96 | 0.95 | 0.94 | 0.92 | 0.94 | 0.95 | 0.96 | 0.96 | |
Rejection Rate | Debiased SS | 1.00 | 0.99 | 0.74 | 0.06 | 0.05 | 0.62 | 1.00 | 1.00 |
() | MLE SS | 1.00 | 0.99 | 0.77 | 0.10 | 0.08 | 0.71 | 1.00 | 1.00 |
Debiased MS | 1.00 | 1.00 | 0.92 | 0.05 | 0.03 | 0.94 | 1.00 | 1.00 | |
MLE MS | 1.00 | 1.00 | 0.94 | 0.06 | 0.06 | 0.95 | 1.00 | 1.00 | |
Oracle | 1.00 | 1.00 | 0.74 | 0.05 | 0.04 | 0.80 | 1.00 | 1.00 | |
Oracle | 1.00 | 1.00 | 0.97 | 0.09 | 0.06 | 0.98 | 1.00 | 1.00 | |
Standard Error | Debiased SS | 0.22 | 0.20 | 0.19 | 0.21 | 0.18 | 0.19 | 0.20 | 0.22 |
MLE SS | 0.37 | 0.30 | 0.28 | 0.29 | 0.25 | 0.27 | 0.31 | 0.37 | |
Debiased MS | 0.18 | 0.15 | 0.14 | 0.14 | 0.12 | 0.14 | 0.16 | 0.18 | |
MLE MS | 0.27 | 0.22 | 0.20 | 0.22 | 0.19 | 0.20 | 0.22 | 0.27 | |
Oracle | 0.26 | 0.22 | 0.20 | 0.22 | 0.19 | 0.20 | 0.22 | 0.26 | |
Oracle | 0.18 | 0.15 | 0.14 | 0.15 | 0.13 | 0.14 | 0.15 | 0.18 | |
Empirical SD | Debiased SS | 0.30 | 0.23 | 0.21 | 0.22 | 0.18 | 0.24 | 0.26 | 0.29 |
MLE SS | 0.58 | 0.41 | 0.36 | 0.38 | 0.31 | 0.40 | 0.48 | 0.56 | |
Debiased MS | 0.17 | 0.14 | 0.14 | 0.14 | 0.12 | 0.14 | 0.14 | 0.18 | |
MLE MS | 0.29 | 0.22 | 0.22 | 0.23 | 0.20 | 0.21 | 0.24 | 0.30 | |
Oracle | 0.28 | 0.22 | 0.20 | 0.22 | 0.19 | 0.20 | 0.22 | 0.26 | |
Oracle | 0.17 | 0.14 | 0.14 | 0.16 | 0.14 | 0.14 | 0.14 | 0.17 |
The simulation results for compound symmetry correlation structure are summarized in Table 2. The same trends concerning the bias and large variability of the MLE seen in Table 1 are also present in this setting. Again the debiased lasso estimators are approximately unbiased, and the multiple splitting debiased lasso estimator has approximate 95% coverage for each coefficient. Multiple splitting again greatly reduces the variability of single split estimators, resulting in thinner confidence intervals with more power for detecting small coefficient values.
Estimator | |||||||||
---|---|---|---|---|---|---|---|---|---|
-1.50 | -1.00 | -0.50 | 0.00 | 0.00 | 0.50 | 1.00 | 1.50 | ||
Selection Rate | 1.00 | 0.96 | 0.51 | 0.07 | 0.05 | 0.44 | 0.98 | 1.00 | |
Bias | Debiased SS | 0.01 | 0.04 | 0.04 | -0.02 | 0.00 | 0.00 | -0.02 | -0.05 |
MLE SS | -0.42 | -0.27 | -0.13 | -0.02 | -0.01 | 0.19 | 0.30 | 0.37 | |
Debiased MS | 0.02 | 0.03 | 0.03 | -0.01 | 0.00 | -0.02 | -0.02 | -0.02 | |
MLE MS | -0.39 | -0.26 | -0.14 | -0.01 | 0.00 | 0.15 | 0.26 | 0.39 | |
Oracle | -0.05 | -0.04 | -0.02 | 0.00 | 0.00 | 0.01 | 0.04 | 0.07 | |
Oracle | -0.03 | -0.02 | -0.02 | -0.01 | 0.00 | 0.02 | 0.02 | 0.03 | |
Coverage | Debiased SS | 0.88 | 0.92 | 0.92 | 0.94 | 0.96 | 0.92 | 0.90 | 0.89 |
MLE SS | 0.80 | 0.89 | 0.91 | 0.90 | 0.93 | 0.86 | 0.85 | 0.82 | |
Debiased MS | 0.94 | 0.96 | 0.94 | 0.96 | 0.96 | 0.94 | 0.94 | 0.96 | |
MLE MS | 0.68 | 0.84 | 0.90 | 0.96 | 0.94 | 0.92 | 0.81 | 0.76 | |
Oracle | 0.94 | 0.94 | 0.95 | 0.91 | 0.94 | 0.92 | 0.96 | 0.94 | |
Oracle | 0.96 | 0.98 | 0.96 | 0.92 | 0.92 | 0.96 | 0.96 | 0.98 | |
Rejection Rate | Debiased SS | 1.00 | 0.99 | 0.49 | 0.06 | 0.04 | 0.57 | 0.98 | 0.99 |
() | MLE SS | 1.00 | 1.00 | 0.56 | 0.10 | 0.07 | 0.63 | 0.98 | 0.99 |
Debiased MS | 1.00 | 1.00 | 0.82 | 0.04 | 0.04 | 0.87 | 1.00 | 1.00 | |
MLE MS | 1.00 | 1.00 | 0.84 | 0.04 | 0.06 | 0.88 | 1.00 | 1.00 | |
Oracle | 1.00 | 1.00 | 0.66 | 0.09 | 0.06 | 0.66 | 1.00 | 1.00 | |
Oracle | 1.00 | 1.00 | 0.92 | 0.07 | 0.07 | 0.92 | 1.00 | 1.00 | |
Standard Error | Debiased SS | 0.25 | 0.23 | 0.23 | 0.23 | 0.23 | 0.23 | 0.23 | 0.25 |
MLE SS | 0.36 | 0.32 | 0.30 | 0.28 | 0.29 | 0.30 | 0.32 | 0.36 | |
Debiased MS | 0.19 | 0.17 | 0.16 | 0.15 | 0.15 | 0.16 | 0.17 | 0.19 | |
MLE MS | 0.27 | 0.23 | 0.21 | 0.21 | 0.21 | 0.21 | 0.23 | 0.27 | |
Oracle | 0.27 | 0.24 | 0.22 | 0.22 | 0.22 | 0.22 | 0.24 | 0.27 | |
Oracle | 0.18 | 0.17 | 0.15 | 0.15 | 0.15 | 0.15 | 0.17 | 0.18 | |
Empirical SD | Debiased SS | 0.29 | 0.25 | 0.26 | 0.22 | 0.24 | 0.26 | 0.27 | 0.29 |
MLE SS | 0.56 | 0.43 | 0.39 | 0.34 | 0.38 | 0.40 | 0.50 | 0.53 | |
Debiased MS | 0.18 | 0.16 | 0.16 | 0.15 | 0.15 | 0.16 | 0.17 | 0.18 | |
MLE MS | 0.28 | 0.24 | 0.22 | 0.22 | 0.22 | 0.21 | 0.24 | 0.27 | |
Oracle | 0.28 | 0.25 | 0.23 | 0.25 | 0.23 | 0.23 | 0.25 | 0.29 | |
Oracle | 0.18 | 0.15 | 0.16 | 0.16 | 0.16 | 0.15 | 0.16 | 0.17 |
Lastly, we assess the performance of the post-model selection procedure that does not pre-specify any covariate of interest. This simulation setting is identical to that of Table 1 with AR(1) correlation structure, but the estimates are now all based on a single model fit on only the selected covariates. For covariates that are not selected, their coefficient estimate and standard error are both set to zero. The oracle MLE results we present are estimated on instead of .
Estimator | |||||||||
---|---|---|---|---|---|---|---|---|---|
-1.50 | -1.00 | -0.50 | 0.00 | 0.00 | 0.50 | 1.00 | 1.50 | ||
Selection Rate | 1.00 | 1.00 | 0.64 | 0.13 | 0.07 | 0.69 | 1.00 | 1.00 | |
Bias | Debiased SS | 0.00 | 0.06 | 0.18 | -0.01 | 0.00 | -0.18 | 0.00 | 0.00 |
MLE SS | -0.66 | -0.35 | 0.02 | -0.02 | 0.01 | -0.03 | 0.44 | 0.66 | |
Oracle | -0.09 | -0.05 | 0.00 | 0.00 | 0.00 | 0.04 | 0.07 | 0.09 | |
Coverage | Debiased SS | 0.83 | 0.90 | 0.57 | 0.99 | 1.00 | 0.59 | 0.87 | 0.89 |
MLE SS | 0.70 | 0.83 | 0.51 | 0.98 | 1.00 | 0.57 | 0.75 | 0.66 | |
Oracle | 0.95 | 0.96 | 0.94 | 1.00 | 1.00 | 0.94 | 0.96 | 0.97 | |
Rejection Rate | Debiased SS | 1.00 | 0.99 | 0.47 | 0.01 | 0.00 | 0.42 | 1.00 | 1.00 |
() | MLE SS | 1.00 | 0.99 | 0.49 | 0.02 | 0.00 | 0.45 | 1.00 | 1.00 |
Oracle | 1.00 | 1.00 | 0.74 | 0.00 | 0.00 | 0.80 | 1.00 | 1.00 | |
Standard Error | Debiased SS | 0.22 | 0.20 | 0.12 | 0.03 | 0.01 | 0.13 | 0.20 | 0.22 |
MLE SS | 0.40 | 0.32 | 0.18 | 0.04 | 0.02 | 0.20 | 0.33 | 0.40 | |
Oracle | 0.26 | 0.22 | 0.20 | 0.00 | 0.00 | 0.20 | 0.22 | 0.26 | |
Empirical SD | Debiased SS | 0.32 | 0.23 | 0.30 | 0.10 | 0.04 | 0.30 | 0.26 | 0.29 |
MLE SS | 0.83 | 0.49 | 0.47 | 0.18 | 0.07 | 0.50 | 0.55 | 0.75 | |
Oracle | 0.28 | 0.22 | 0.20 | 0.00 | 0.00 | 0.20 | 0.22 | 0.26 |
The simulation results are summarized in Table 3. For small coefficients, there is poor confidence interval coverage across all non-oracle estimators due to the randomness associated with their inclusion in each model. For larger coefficients that are nearly always selected by the lasso, the performance resembles that of Table 1. These results demonstrate the importance of only analyzing coefficients that are included in the fitted model. For larger sample sizes, the lasso selection performance can improve substantially, leading to improved coverage and bias correction of the debiased lasso after model selection. See Appendix E for additional simulations results with a larger sample size. We also provide additional simulation results with higher correlations among covariates in Appendix E.
5 Real Data Example: The Mid-South Tobacco Case-Control Study
We apply the proposed method to a dataset of single nucleotide polymorphisms (SNPs) from a sample of African-American participants in the Mid-South Tobacco Case-Control study population (Jiang et al.,, 2019; Xu et al.,, 2020; Han et al.,, 2022) to assess genetic risk factors for nicotine dependence. The dataset is publicly available with GEO accession GSE148375. It originally contained 242,901 SNPs measured across 3399 individuals. We excluded SNPs that were insertions or deletions, had a call rate less than 95%, were not in Hardy-Weinberg equilibrium (), or had a minor allele frequency of less than 0.01. Subjects missing more than 1% of the remaining SNPs were excluded, and missing values were then imputed using the observed allele frequencies for each SNP. After data cleaning, the covariates consist of 32,557 SNPs as well as gender and age. The response variable is a binary indicator of smoking status (1=smoker), where 1607 of the 3317 participants are smokers.
Prior research has identified several genetic regions with SNPs that are associated with nicotine dependence, including 15q25.1 (CHRNA5/A3/B4), 8p11.21 (CHRNB3/A6), and 19q13.2 (CYP2A6/A7/B6, EGLN2, RAB4B, and NUMBL). See, for example, Yang and Li, (2016) and the references therein. We choose the target covariates to be the 16 measured SNPs that lie in these three regions, as well as the demographic variables.
The results from our multiple splitting debiased lasso estimator are presented in Table 4. After adjusting for multiple testing using the Holm procedure (Holm,, 1979), none of the SNPs have a significant association with smoking status. The SNP rs3733829 has the largest coefficient estimate, with an estimated 37% increase in the odds of being a smoker and unadjusted 95% confidence interval (9%, 73%), controlling for all other covariates. The association of rs3733829 with increased cigarette consumption has been discussed by Kim et al., (2010) and Bloom et al., (2014).
Covariate | Gene | SE | Holm P-value | Odds Ratio (95% CI) | |
---|---|---|---|---|---|
Intercept | -0.01 | 0.18 | 1.00 | 0.99 (0.70, 1.41) | |
Age | 0.03 | 0.03 | 1.00 | 1.03 (0.96, 1.09) | |
Male | 0.45 | 0.07 | 1.58 (1.38, 1.79) | ||
rs35327613 | CHRNB3 | -0.06 | 0.11 | 1.00 | 0.94 (0.76, 1.17) |
rs61740655 | CHRNA5 | -0.10 | 0.13 | 1.00 | 0.90 (0.70, 1.17) |
rs79109919 | CHRNA5 | -0.06 | 0.16 | 1.00 | 0.94 (0.68, 1.29) |
rs16969968 | CHRNA5 | -0.09 | 0.11 | 1.00 | 0.91 (0.74, 1.13) |
rs938682 | CHRNA3 | 0.10 | 0.10 | 1.00 | 1.11 (0.91, 1.35) |
rs8042374 | CHRNA3 | -0.18 | 0.12 | 1.00 | 0.83 (0.66, 1.05) |
rs61737502 | CHRNB4 | 0.04 | 0.10 | 1.00 | 1.04 (0.85, 1.28) |
rs56218866 | CHRNB4 | -0.10 | 0.14 | 1.00 | 0.91 (0.68, 1.21) |
rs950776 | CHRNB4 | -0.02 | 0.08 | 1.00 | 0.98 (0.83, 1.15) |
rs12440298 | CHRNB4 | -0.03 | 0.06 | 1.00 | 0.97 (0.87, 1.09) |
rs3865452 | COQ8B | -0.06 | 0.06 | 1.00 | 0.94 (0.84, 1.04) |
rs3733829 | EGLN2 | 0.32 | 0.12 | 0.13 | 1.37 (1.09, 1.73) |
rs75152309 | CYP2A7 | 0.07 | 0.09 | 1.00 | 1.07 (0.90, 1.28) |
rs73032311 | CYP2A7 | -0.09 | 0.08 | 1.00 | 0.91 (0.78, 1.07) |
rs28399499 | CYP2B6 | 0.00 | 0.10 | 1.00 | 1.00 (0.82, 1.22) |
rs7260329 | CYP2B6 | -0.16 | 0.08 | 0.78 | 0.86 (0.73, 1.00) |
Acknowledgements
This work was supported in part by NIH Grants R01AG056764 and RF1AG075107, and by NSF Grant DMS-1915711.
References
- Bickel et al., (2009) Bickel, P. J., Ritov, Y., and Tsybakov, A. B. (2009). Simultaneous analysis of lasso and dantzig selector. The Annals of Statistics, 37(4):1705–1732.
- Bloom et al., (2014) Bloom, A. J., Baker, T. B., Chen, L.-S., Breslau, N., Hatsukami, D., Bierut, L. J., and Goate, A. (2014). Variants in two adjacent genes, egln2 and cyp2a6, influence smoking behavior related to disease risk via different mechanisms. Human molecular genetics, 23(2):555–561.
- Efron, (2014) Efron, B. (2014). Estimation and accuracy after model selection. Journal of the American Statistical Association, 109(507):991–1007.
- Fan and Li, (2001) Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. Journal of the American statistical Association, 96(456):1348–1360.
- Fan and Lv, (2008) Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70(5):849–911.
- Fan and Song, (2010) Fan, J. and Song, R. (2010). Sure independence screening in generalized linear models with np-dimensionality. The Annals of Statistics, 38(6):3567–3604.
- Fei and Li, (2021) Fei, Z. and Li, Y. (2021). Estimation and inference for high dimensional generalized linear models: A splitting and smoothing approach. Journal of Machine Learning Research, 22(58):1–32.
- Fei et al., (2019) Fei, Z., Zhu, J., Banerjee, M., and Li, Y. (2019). Drawing inferences for high-dimensional linear models: A selection-assisted partial regression and smoothing approach. Biometrics, 75(2):551–561.
- Friedman et al., (2010) Friedman, J., Hastie, T., and Tibshirani, R. (2010). Regularization paths for generalized linear models via coordinate descent. Journal of statistical software, 33(1):1.
- Han et al., (2022) Han, H., Xu, M., Wen, L., Chen, J., Liu, Q., Wang, J., Yang, Z., and Li, M. D. (2022). Identification of a novel functional nonsynonymous single nucleotide polymorphism in frizzled 6 (fzd6) gene for involvement in depressive symptoms. Frontiers in molecular neuroscience, page 348.
- Holm, (1979) Holm, S. (1979). A simple sequentially rejective multiple test procedure. Scandinavian journal of statistics, pages 65–70.
- Huang and Zhang, (2012) Huang, J. and Zhang, C.-H. (2012). Estimation and selection via absolute penalized convex minimization and its multistage adaptive applications. The Journal of Machine Learning Research, 13(1):1839–1864.
- Jiang et al., (2019) Jiang, K., Yang, Z., Cui, W., Su, K., Ma, J. Z., Payne, T. J., and Li, M. D. (2019). An exome-wide association study identifies new susceptibility loci for age of smoking initiation in african-and european-american populations. Nicotine and Tobacco Research, 21(6):707–713. (data available at NCBI Gene Expression Omnibus database, accession GSE148375).
- Kim et al., (2010) Kim, Y., Dackor, J., Boerwinkle, E., Franceschini, N., Ardissino, D., Bernardinelli, L., Mannucci, P., Mauri, F., Merlini, P., Absher, D., et al. (2010). Genome-wide meta-analyses identify multiple loci associated with smoking behavior. Nature genetics, 42(5):441–447.
- Kuchibhotla et al., (2020) Kuchibhotla, A. K., Brown, L. D., Buja, A., Cai, J., George, E. I., and Zhao, L. H. (2020). Valid post-selection inference in model-free linear regression. The Annals of Statistics, 48(5):2953–2981.
- Loh and Wainwright, (2012) Loh, P.-L. and Wainwright, M. J. (2012). High-dimensional regression with noisy and missing data: Provable guarantees with nonconvexity. The Annals of Statistics, 40(3):1637.
- McCullagh and Nelder, (2019) McCullagh, P. and Nelder, J. A. (2019). Generalized linear models. Routledge.
- Meinshausen et al., (2009) Meinshausen, N., Meier, L., and Bühlmann, P. (2009). P-values for high-dimensional regression. Journal of the American Statistical Association, 104(488):1671–1681.
- Meinshausen and Yu, (2009) Meinshausen, N. and Yu, B. (2009). Lasso-type recovery of sparse representations for high-dimensional data. The annals of statistics, 37(1):246–270.
- Ning and Liu, (2017) Ning, Y. and Liu, H. (2017). A general theory of hypothesis tests and confidence regions for sparse high dimensional models. The Annals of Statistics, 45(1):158–195.
- Rinaldo et al., (2019) Rinaldo, A., Wasserman, L., and G’Sell, M. (2019). Bootstrapping and sample splitting for high-dimensional, assumption-lean inference. The Annals of Statistics, 47(6):3438–3469.
- Tibshirani, (1996) Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288.
- van de Geer et al., (2014) van de Geer, S., Bühlmann, P., Ritov, Y., and Dezeure, R. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3):1166–1202.
- van de Geer, (2008) van de Geer, S. A. (2008). High-dimensional generalized linear models and the lasso. The Annals of Statistics, 36(2):614–645.
- van de Geer and Bühlmann, (2009) van de Geer, S. A. and Bühlmann, P. (2009). On the conditions used to prove oracle results for the lasso. Electronic Journal of Statistics, 3:1360–1392.
- Wager et al., (2014) Wager, S., Hastie, T., and Efron, B. (2014). Confidence intervals for random forests: The jackknife and the infinitesimal jackknife. The Journal of Machine Learning Research, 15(1):1625–1651.
- Wang et al., (2020) Wang, J., He, X., and Xu, G. (2020). Debiased inference on treatment effect in a high-dimensional model. Journal of the American Statistical Association, 115(529):442–454.
- Xia et al., (2021) Xia, L., Nan, B., and Li, Y. (2021). Debiased lasso for generalized linear models with a diverging number of covariates. Biometrics.
- Xu et al., (2020) Xu, Y., Cao, L., Zhao, X., Yao, Y., Liu, Q., Zhang, B., Wang, Y., Mao, Y., Ma, Y., Ma, J. Z., et al. (2020). Prediction of smoking behavior from single nucleotide polymorphisms with machine learning approaches. Frontiers in psychiatry, 11:416.
- Yang and Li, (2016) Yang, J. and Li, M. D. (2016). Converging findings from linkage and association analyses on susceptibility genes for smoking and other addictions. Molecular psychiatry, 21(8):992–1008.
- Zhang and Huang, (2008) Zhang, C.-H. and Huang, J. (2008). The sparsity and bias of the lasso selection in high-dimensional linear regression. The Annals of Statistics, 36(4):1567–1594.
- Zhang and Zhang, (2014) Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):217–242.
- Zhang et al., (2022) Zhang, H., Huang, J., and Sun, L. (2022). Projection-based and cross-validated estimation in high-dimensional cox model. Scandinavian Journal of Statistics, 49(1):353–372.
- Zhao and Yu, (2006) Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. The Journal of Machine Learning Research, 7:2541–2563.
- Zhu and Bradic, (2018) Zhu, Y. and Bradic, J. (2018). Linear hypothesis testing in dense high-dimensional linear models. Journal of the American Statistical Association, 113(524):1583–1600.
- Zou, (2006) Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the American statistical association, 101(476):1418–1429.
- Zou and Hastie, (2005) Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the royal statistical society: series B (statistical methodology), 67(2):301–320.
Appendix Appendix A Proof of Theorem 1: Results for GLM lasso selected model
In order to show the model selection properties of the GLM lasso under Assumptions 1-5, we consider the results on false negatives and false positives (i.e. selected model size) separately in Lemmas 1 and 2 below. These are each based on oracle inequalities from Huang and Zhang, (2012), specifically using the unweighted lasso, target vector , and target set that includes the intercept. We use and as generic constants in our probability bounds and denote the sample size as .
Lemma 1.
Under Assumptions 1-5, if the diagonal entries of are bounded above and away from zero and the compatibility constant (Huang and Zhang,, 2012) is bounded below, then the GLM lasso has no false negatives with probability at least for constants .
This lemma is a direct result of Theorem 9 (iii) of Huang and Zhang, (2012), which gives an bound for the GLM lasso estimation error. We proceed by verifying that our assumptions imply the required conditions in Theorem 9 of Huang and Zhang, (2012) are satisfied.
We first use the bound , to verify the required Lipschitz condition
for all , where the denominator is bounded below since is bounded above almost surely and is a positive function.
Next, we can choose penalty levels each of order so that (31) of Huang and Zhang, (2012) is satisfied for large when the compatibility constant is bounded below and , as we have assumed.
Lastly, to verify condition (29) of Huang and Zhang, (2012), note that the positive quantities are bounded almost surely, so the diagonal elements of are bounded above and away from zero due to the assumed corresponding bounds on for all .
Now that we have verified the conditions, by Theorem 9 of Huang and Zhang, (2012) using the seminorm , with probability at least for constants , we have for some constant . Therefore, on this event, there are no false negatives if , as we have assumed.
Lemma 2.
Under the conditions of Lemma 1, the GLM lasso selected model size has an upper bound proportional to the true model size , with probability at least for some constants .
This lemma is a direct result of Theorem 19 of Huang and Zhang, (2012), which gives an upper bound for the number of false positives selected by the GLM lasso in terms of a compatibility constant and restricted upper eigenvalue.
We begin by showing that the restricted upper eigenvalue
over the cone is bounded above with probability tending to one when is proportional to . For large and small , the term involving is bounded almost surely by our Lipschitz assumptions since so in the cone . Since the intercept is included in , and the eigenvalues of are bounded above, we can apply Lemma 15 of Loh and Wainwright, (2012) for zero-mean subgaussian random vectors to show that is bounded above by a constant with probability at least for some constants when and . Using this upper bound for the upper restricted eigenvalue and the assumed lower bound for the compatibility constant, Theorem 19 of Huang and Zhang, (2012) implies that the number of false positives, and thus the selected model size, is at most of order .
We now are ready to prove Theorem 1. We begin by providing probability bounds for the events assumed in Lemmas 1 and 2. First, van de Geer and Bühlmann, (2009) showed that the compatibility constant is bounded away from zero with probability at least for some constants .
Next, we seek to bound from above and away from zero uniformly over . This is trivial for the intercept, and the rest of the covariates are each bounded almost surely with mean zero and a variance of at least . We apply Hoeffding’s inequality and a union bound to get
Since , for large we can fix a and the probability bound will be of order for some positive constants and .
Lastly, let denote the union of events that the compatibility constant is bounded, the diagonal entries of are bounded, and the upper restricted eigenvalue (see Lemma 2) is bounded. Since the probability bound on each of these events has the same order, by a union bound and rescaling of the constants involved we have for some . Then for the event with not depending on and some constants we have
for sufficiently large, since .
Appendix Appendix B Proof of Theorem 2: Asymptotic results for debiased lasso after single sample split
Suppose the sample of individuals is split into a subsample of size and a subsample of size , where we assume is an integer and . We apply a model selection procedure to such that the selected model satisfies the event with probability tending to one as , such as the lasso. For the selected model , which includes the pre-specified set , let denote the debiased lasso estimator of based on the subsample , and similarly denote the asymptotic covariance estimator by . We decompose
where, conditional on any sequence of selected models independent of and satisfying the event , by the theoretical results of Xia et al., (2021). Hence for any ,
where as by Xia et al., (2021) as mentioned above, so unconditionally as . Asymptotic normality, conditional on , also follows from the low-dimensional results of Xia et al., (2021) using the Lindeberg-Feller Central Limit Theorem.
Appendix Appendix C Proof of Theorem 3: Asymptotic results for debiased lasso after multiple sample splits
First we apply the results from the proof of Theorem 2 to the -vector with , i.e. a contrast of only the pre-specified covariates . The single split debiased lasso estimator satisfies
where is the fitted model, is the matrix such that for any -vector , and is the indicator variable equal to one for .
Now consider the multiple splitting estimator obtained by averaging across all possible splits , . Again letting denote the fitted model from data split by sampling indicators , the multiple splitting estimator satisfies
where the expectations are with respect to the random splits , conditional on the data.
To establish asymptotic normality, we first show that . Let be another sampling indicator vector independent of , with fitted model . The expectations in the following decomposition are taken with respect to and , i.e. over , conditional on the data:
and, by independence, with . Since this quantity does not depend on , we can write
with so by Assumption (6). Also, since has bounded entries with mean zero, we apply Hoeffding’s inequality and a union bound to obtain
for some constant and , so and, applying Assumption (7), we have
In summary,
so we have asymptotic normality by the Lindeberg-Feller Central Limit Theorem. Note that
since is a unit vector, has bounded eigenvalues, and with probability tending to one. Thus the Lindeberg condition can be verified by standard arguments for bounded covariates.
Appendix Appendix D Further discussion of the multiple splitting assumptions
The asymptotic representation used in our multiple splitting proof is distinct from the one used by Fei and Li, (2021) who, in their theoretical arguments, applied a different decomposition of based on comparing the single split estimators with their respective oracle estimators. Under their partial orthogonality condition, which requires that the signal variables are independent of the noise variables and thus that is block diagonal (reordering the covariates without loss of generality), they argue that has the same asymptotic normal distribution as the oracle estimator fit on . Under different assumptions than Fei and Li, (2021) we also showed that is asymptotically normal, but with variance that is possibly larger than the oracle variance . Our assumptions instead concern the limiting behavior of the average (across all random splits that exclude from model selection, i.e. such that ) , for . Note that, for each , the corresponding average is independent of . Since the samples are i.i.d., the effect of always excluding the th observation specifically from model selection should be asymptotically negligible (Assumption 6), and furthermore each average should converge in probability to a bounded random vector , which is independent of the data, with moderate rate (Assumption 7).
We can achieve the oracle-level result of Fei and Li, (2021) under either of two stronger assumptions that also imply our convergence conditions in Assumptions (6) and (7). First, under model selection consistency we have for all random splits, so , a deterministic vector, for all and we can directly set . By (Appendix C) we get . Then, using the fact that , the asymptotic variance becomes . Alternatively, if the covariates are independent of then has a block diagonal structure that also implies for all , using the fact that is a zero-vector. Therefore this independence assumption also implies oracle-level variance. Note that, in general, this independence assumption is difficult to justify in practice when the true signal variables are unknown. An exception is in linear models, where it is sufficient to assume the pre-specified covariates are independent of , e.g. by randomized treatment assignment, since is a constant. In this sense our Assumptions (6) and (7), adapted from Wang et al., (2020) for linear models, are less restrictive than those used by others to justify multiple splitting procedures such as Fei and Li, (2021) for GLMs and Fei et al., (2019) for linear models.
Appendix Appendix E Additional simulation results
Table 5 summarizes results from the same simulation setting as the post-model selection procedure of Table 3, but with a larger sample size of , and covariates. We see that the selection rate for small coefficients improves, resulting in lower bias and better confidence interval coverage, and the debiased lasso estimator still has a substantial advantage over the MLE.
Estimator | |||||||||
---|---|---|---|---|---|---|---|---|---|
-1.50 | -1.00 | -0.50 | 0.00 | 0.00 | 0.50 | 1.00 | 1.50 | ||
Selection Rate | 1.00 | 1.00 | 1.00 | 0.11 | 0.03 | 1.00 | 1.00 | 1.00 | |
Bias | Debiased SS | -0.01 | -0.01 | 0.01 | -0.00 | 0.00 | -0.01 | 0.02 | 0.01 |
MLE SS | -0.24 | -0.16 | -0.08 | -0.00 | 0.00 | 0.07 | 0.17 | 0.24 | |
Oracle | -0.03 | -0.02 | -0.00 | 0.00 | 0.00 | 0.01 | 0.03 | 0.03 | |
Coverage | Debiased SS | 0.89 | 0.90 | 0.98 | 0.99 | 1.00 | 0.91 | 0.93 | 0.92 |
MLE SS | 0.64 | 0.71 | 0.90 | 0.99 | 1.00 | 0.86 | 0.74 | 0.65 | |
Oracle | 0.96 | 0.93 | 0.97 | 1.00 | 1.00 | 0.94 | 0.95 | 0.95 | |
Rejection Rate | Debiased SS | 1.00 | 1.00 | 1.00 | 0.01 | 0.00 | 1.00 | 1.00 | 1.00 |
() | MLE SS | 1.00 | 1.00 | 1.00 | 0.01 | 0.00 | 1.00 | 1.00 | 1.00 |
Oracle | 1.00 | 1.00 | 1.00 | 0.00 | 0.00 | 1.00 | 1.00 | 1.00 | |
Standard Error | Debiased SS | 0.11 | 0.10 | 0.09 | 0.01 | 0.00 | 0.09 | 0.10 | 0.11 |
MLE SS | 0.15 | 0.13 | 0.11 | 0.01 | 0.00 | 0.11 | 0.13 | 0.15 | |
Oracle | 0.12 | 0.11 | 0.09 | 0.00 | 0.00 | 0.09 | 0.11 | 0.12 | |
Empirical SD | Debiased SS | 0.14 | 0.11 | 0.10 | 0.03 | 0.01 | 0.11 | 0.11 | 0.13 |
MLE SS | 0.21 | 0.17 | 0.12 | 0.05 | 0.02 | 0.14 | 0.16 | 0.22 | |
Oracle | 0.12 | 0.11 | 0.09 | 0.00 | 0.00 | 0.10 | 0.10 | 0.12 |
We also present additional simulation results that have stronger correlation between covariates. Table 6 summarizes results from an AR(1) correlation structure with parameter 0.8, and Table 7 summarizes results from a compound symmetry correlation structure with parameter 0.8. We see that the multiple splitting debiased lasso estimator is fairly robust to high correlations between covariates, even as the model selection performance deteriorates. The debiased lasso continues to outperform the MLE, having lower bias and variance as well as better confidence interval coverage.
Estimator | |||||||||
---|---|---|---|---|---|---|---|---|---|
-1.50 | -1.00 | -0.50 | 0.00 | 0.00 | 0.50 | 1.00 | 1.50 | ||
Selection Rate | 1.00 | 0.97 | 0.56 | 0.24 | 0.05 | 0.64 | 0.98 | 1.00 | |
Bias | Debiased SS | 0.04 | 0.03 | 0.00 | -0.01 | -0.03 | -0.02 | -0.02 | -0.02 |
MLE SS | -0.54 | -0.37 | -0.24 | 0.02 | -0.05 | 0.21 | 0.39 | 0.55 | |
Debiased MS | 0.01 | 0.03 | 0.01 | -0.02 | -0.01 | -0.01 | -0.01 | -0.01 | |
MLE MS | -0.54 | -0.35 | -0.23 | -0.01 | -0.01 | 0.22 | 0.39 | 0.55 | |
Oracle | -0.08 | -0.03 | -0.02 | -0.03 | -0.01 | 0.02 | 0.05 | 0.08 | |
Oracle | -0.04 | -0.02 | -0.02 | -0.01 | -0.01 | 0.01 | 0.03 | 0.04 | |
Coverage | Debiased SS | 0.89 | 0.93 | 0.92 | 0.96 | 0.96 | 0.90 | 0.91 | 0.90 |
MLE SS | 0.77 | 0.82 | 0.85 | 0.93 | 0.93 | 0.88 | 0.81 | 0.74 | |
Debiased MS | 0.93 | 0.92 | 0.93 | 0.96 | 0.95 | 0.94 | 0.90 | 0.92 | |
MLE MS | 0.58 | 0.73 | 0.86 | 0.97 | 0.94 | 0.85 | 0.70 | 0.53 | |
Oracle | 0.93 | 0.96 | 0.95 | 0.93 | 0.94 | 0.96 | 0.94 | 0.95 | |
Oracle | 0.94 | 0.96 | 0.92 | 0.96 | 0.96 | 0.94 | 0.93 | 0.96 | |
Rejection Rate | Debiased SS | 0.99 | 0.92 | 0.59 | 0.04 | 0.04 | 0.57 | 0.95 | 1.00 |
() | MLE SS | 0.99 | 0.90 | 0.64 | 0.07 | 0.07 | 0.60 | 0.94 | 0.99 |
Debiased MS | 1.00 | 0.99 | 0.90 | 0.04 | 0.05 | 0.88 | 1.00 | 1.00 | |
MLE MS | 1.00 | 0.99 | 0.90 | 0.03 | 0.06 | 0.89 | 1.00 | 1.00 | |
Oracle | 1.00 | 1.00 | 0.74 | 0.07 | 0.06 | 0.78 | 1.00 | 1.00 | |
Oracle | 1.00 | 1.00 | 0.98 | 0.04 | 0.04 | 0.98 | 1.00 | 1.00 | |
Standard Error | Debiased SS | 0.27 | 0.26 | 0.24 | 0.32 | 0.20 | 0.23 | 0.25 | 0.27 |
MLE SS | 0.43 | 0.39 | 0.33 | 0.44 | 0.27 | 0.33 | 0.38 | 0.44 | |
Debiased MS | 0.19 | 0.17 | 0.15 | 0.20 | 0.12 | 0.15 | 0.17 | 0.19 | |
MLE MS | 0.30 | 0.26 | 0.22 | 0.32 | 0.19 | 0.22 | 0.26 | 0.30 | |
Oracle | 0.26 | 0.22 | 0.20 | 0.32 | 0.19 | 0.20 | 0.22 | 0.25 | |
Oracle | 0.17 | 0.15 | 0.14 | 0.22 | 0.13 | 0.14 | 0.15 | 0.17 | |
Empirical SD | Debiased SS | 0.33 | 0.30 | 0.26 | 0.30 | 0.18 | 0.26 | 0.29 | 0.33 |
MLE SS | 0.72 | 0.59 | 0.43 | 0.55 | 0.31 | 0.44 | 0.52 | 0.69 | |
Debiased MS | 0.21 | 0.20 | 0.17 | 0.21 | 0.12 | 0.17 | 0.19 | 0.20 | |
MLE MS | 0.35 | 0.32 | 0.26 | 0.33 | 0.19 | 0.26 | 0.30 | 0.34 | |
Oracle | 0.29 | 0.22 | 0.20 | 0.34 | 0.19 | 0.20 | 0.23 | 0.26 | |
Oracle | 0.19 | 0.15 | 0.14 | 0.23 | 0.13 | 0.15 | 0.17 | 0.18 |
Estimator | |||||||||
---|---|---|---|---|---|---|---|---|---|
-1.50 | -1.00 | -0.50 | 0.00 | 0.00 | 0.50 | 1.00 | 1.50 | ||
Selection Rate | 0.98 | 0.62 | 0.24 | 0.01 | 0.03 | 0.22 | 0.68 | 0.96 | |
Bias | Debiased SS | 0.04 | 0.04 | 0.03 | -0.02 | 0.00 | -0.05 | -0.08 | -0.06 |
MLE SS | -0.14 | -0.09 | -0.04 | -0.02 | 0.01 | 0.01 | 0.04 | 0.11 | |
Debiased MS | 0.03 | 0.08 | 0.02 | -0.02 | 0.00 | -0.02 | -0.07 | -0.07 | |
MLE MS | -0.17 | -0.07 | -0.07 | -0.03 | -0.01 | 0.06 | 0.08 | 0.13 | |
Oracle | -0.07 | 0.03 | -0.07 | -0.03 | 0.00 | 0.05 | 0.03 | 0.02 | |
Oracle | -0.03 | 0.02 | -0.03 | -0.01 | 0.00 | 0.03 | 0.01 | 0.01 | |
Coverage | Debiased SS | 0.90 | 0.91 | 0.91 | 0.91 | 0.90 | 0.94 | 0.91 | 0.91 |
MLE SS | 0.89 | 0.89 | 0.91 | 0.91 | 0.89 | 0.93 | 0.88 | 0.90 | |
Debiased MS | 0.90 | 0.94 | 0.91 | 0.96 | 0.96 | 0.93 | 0.92 | 0.90 | |
MLE MS | 0.86 | 0.90 | 0.93 | 0.96 | 0.95 | 0.93 | 0.88 | 0.88 | |
Oracle | 0.94 | 0.94 | 0.94 | 0.96 | 0.96 | 0.94 | 0.96 | 0.96 | |
Oracle | 0.94 | 0.94 | 0.92 | 0.96 | 0.93 | 0.94 | 0.92 | 0.94 | |
Rejection Rate | Debiased SS | 1.00 | 0.82 | 0.33 | 0.09 | 0.10 | 0.33 | 0.77 | 0.99 |
() | MLE SS | 1.00 | 0.85 | 0.36 | 0.09 | 0.11 | 0.34 | 0.78 | 0.98 |
Debiased MS | 1.00 | 0.98 | 0.59 | 0.04 | 0.04 | 0.56 | 0.98 | 1.00 | |
MLE MS | 1.00 | 0.98 | 0.58 | 0.04 | 0.05 | 0.58 | 0.98 | 1.00 | |
Oracle | 1.00 | 0.88 | 0.47 | 0.04 | 0.04 | 0.44 | 0.90 | 1.00 | |
Oracle | 1.00 | 1.00 | 0.67 | 0.04 | 0.07 | 0.71 | 1.00 | 1.00 | |
Standard Error | Debiased SS | 0.32 | 0.32 | 0.31 | 0.31 | 0.31 | 0.31 | 0.32 | 0.32 |
MLE SS | 0.38 | 0.37 | 0.35 | 0.34 | 0.35 | 0.35 | 0.36 | 0.38 | |
Debiased MS | 0.23 | 0.23 | 0.22 | 0.21 | 0.21 | 0.22 | 0.23 | 0.23 | |
MLE MS | 0.27 | 0.26 | 0.25 | 0.25 | 0.25 | 0.26 | 0.26 | 0.26 | |
Oracle | 0.34 | 0.31 | 0.30 | 0.31 | 0.31 | 0.31 | 0.32 | 0.33 | |
Oracle | 0.23 | 0.22 | 0.21 | 0.21 | 0.21 | 0.21 | 0.22 | 0.23 | |
Empirical SD | Debiased SS | 0.36 | 0.37 | 0.34 | 0.33 | 0.35 | 0.35 | 0.39 | 0.38 |
MLE SS | 0.51 | 0.46 | 0.40 | 0.40 | 0.41 | 0.42 | 0.50 | 0.49 | |
Debiased MS | 0.26 | 0.25 | 0.24 | 0.20 | 0.22 | 0.23 | 0.25 | 0.26 | |
MLE MS | 0.35 | 0.32 | 0.29 | 0.24 | 0.26 | 0.27 | 0.31 | 0.34 | |
Oracle | 0.32 | 0.33 | 0.33 | 0.29 | 0.30 | 0.34 | 0.32 | 0.34 | |
Oracle | 0.24 | 0.23 | 0.24 | 0.20 | 0.23 | 0.22 | 0.24 | 0.24 |