Distributed Iterative Hard Thresholding for Variable Selection in Tobit Models
Abstract
While extensive research has been conducted on high-dimensional data and on regression with left-censored responses, simultaneously addressing these complexities remains challenging, with only a few proposed methods available. In this paper, we utilize the Iterative Hard Thresholding (IHT) algorithm on the Tobit model in such a setting. Theoretical analysis demonstrates that our estimator converges with a near-optimal minimax rate. Additionally, we extend the method to a distributed setting, requiring only a few rounds of communication while retaining the estimation rate of the centralized version. Simulation results show that the IHT algorithm for the Tobit model achieves superior accuracy in predictions and subset selection, with the distributed estimator closely matching that of the centralized estimator. When applied to high-dimensional left-censored HIV viral load data, our method also exhibits similar superiority.
Keywords— Censored regression; Distributed optimization; Hard thresholding; High-dimension statistics; Linear convergence.
1 Introduction
The analysis of left-censored data is a significant statistical focus and has attracted considerable research attention in recent years. It often arises due to the lower detection limit of an assay, posing a shared challenge across various fields such as biology, chemistry, and environmental sciences. For example, biological assays used to measure Human Immunodeficiency Virus (HIV) viral load in plasma may be limited in detecting concentrations below specific thresholds. The presence of such missingness renders commonly used linear regression methods ineffective. Furthermore, with advancements in modern data collection, these challenges often manifest in high-dimensional scenarios. In the context of HIV infection, there is a critical need to explore the association between the number of viral loads and extremely high-dimensional gene expression values. In addressing such challenges, various methodologies have been proposed in prior research. Among these, the Tobit model has proved to be valuable in modeling left-censored responses.
In recent years, significant progress has been made in the research on high dimensional censored data. A useful strategy to tackle high-dimensionality challenges involves constructing penalized estimators such as lasso-type estimators. Müller and van de Geer (2016) and Zhou and Liu (2016) provide theoretical insights into the least absolute deviation estimator with the lasso penalty. Soret et al. (2018) propose the Lasso-regularized Buckley-James least square algorithm, extending the estimator in Buckley and James (1979). Jacobson and Zou (2024) were the first to consider the high-dimensional Tobit model, which optimizes the likelihood function with a nonconvex penalty (specifically with the SCAD penalty in Fan and Li (2001)). In this paper, we concentrate on the IHT approach for variable selection. One advantage of using IHT is that the user can directly specify the number of variables to be retained, which may be useful in some scientific investigations.
IHT-style methods, which combine gradient descent with projection operations, have gained popularity in the literature for sparse recovery. Various algorithms have been proposed, such as standard IHT introduced in Blumensath and Davies (2009), GraDeS in Garg and Khandekar (2009), and Hard Thresholding Pursuit (HTP) in Foucart (2011). Jain et al. (2014) demonstrated that the IHT procedure can achieve linear convergence to the optimal solution under the conditions of strong convexity and strong smoothness in high-dimensional settings. Extending this result, Wang et al. (2023) considered nonsmooth loss functions under the less restrictive assumption of a locally positive-definite population Hessian. We will demonstrate that the linear convergence result also holds for the high-dimensional Tobit model.
Additionally, we consider a scenario where data are distributed across multiple locations and develop a distributed variant of our method. With advancements in data collection technology, gathering similar types of data from various regions has become increasingly common. Typically, directly aggregating all data at a central site faces practical hurdles related to storage, communication, and privacy concerns. To tackle these challenges, divide-and-conquer approaches are commonly employed. Many early approaches utilize one-shot methods, where estimators computed on local machines are transmitted to a central node and aggregated to form a global estimate, as in Zhang et al. (2013), Lee et al. (2017). However, such methods suffer from some drawbacks, as discussed by Jordan et al. (2019), who proposed a communication-efficient distributed algorithm. Fortunately, the gradient descent-type update in the IHT method is conducive to integration with this algorithm. Building upon this concept, we introduce a communication-efficient estimator with an IHT-type update process. To our knowledge, this represents the first exploration of the high-dimensional Tobit model in a distributed setting. Our theoretical analysis and numerical results demonstrate that this estimator’s convergence rate aligns with that achieved when pooling all data together while incurring low communication costs.
2 The Tobit model and IHT
2.1 Local IHT for Tobit model
We consider the Tobit model, introduced in Tobin (1958), a crucial tool for modeling a left-censored response. We assume the existence of a latent response variable such that , where follows a linear model
Here, , , and . We focus on a high-dimensional scenario where the dimension is much larger than the sample size , with . Without loss of generality, we assume throughout the following discussions.
Let represent an independent and identically distributed (i.i.d.) sample from the Tobit model, and define the indicator . The likelihood function for the censored response in the Tobit model can be expressed as
where is the standard normal cumulative distribution function. While is not concave in , Olsen (1978) found that the reparameterization and results in a concave log-likelihood. Denoting , after dropping ignorable constants, the negative log-likelihood is given by
In the high-dimensional setting that is large, and maybe even larger than , variable selection is desired. In this paper, we consider the hard constraint where is a known upper bound of . To optimize this constrained objective function, we use the IHT method, which is designed for high-dimensional regression. This method combines gradient descent with projection operations, making it computationally efficient. Furthermore, we slightly modify it for the Tobit model to prevent from getting too small or even negative. Specifically, denoting the initial value as , for , our IHT algorithm can be formulated as
(1) |
where is the step size, is the projection operator that retains only entries with the largest absolute values while setting other entries to zero, and
Note that we have excluded from the projection operation, as is not part of the parameter for variable selection. Additionally, we introduce a positive lower bound denoted as to prevent during the update process.
In practical applications, the values of and are typically unknown and are treated as tuning parameters. There are two main approaches for determining . First, the user can directly specify it based on a predefined desired sparsity, such as selecting the top 10 variables of interest. An alternative is to treat sparsity as a tuning parameter that can be optimized through cross-validation. can be set to a sufficiently small value to ensure it is smaller than . We present the entire IHT algorithm in Algorithm 1 (called local IHT to distinguish it from the distributed cases). Theorem 1 in Section 3 provides a bound on the estimator error for the local IHT algorithm (after a sufficient number of iterations) as , which is well-known as the minimax rate for linear regression under the sparsity condition (if , and and are of the same order).
2.2 Distributed IHT for Tobit model
In this section, we delve into the distributed implementation of the high-dimensional Tobit model across machines. This scenario holds significance in settings characterized by large volumes of training data. In such contexts, aggregating all raw data directly becomes impractical due to constraints like limited storage capacity, high communication costs, and privacy considerations. Hence, we are motivated to explore the development of a communication-efficient distributed IHT method.
For simplicity, we assume that each machine stores data of the same sample size . Let denote the subsample with a sample size of stored in the -th machine , for , and denote the global sample size . The local and global negative log-likelihood functions take the following form:
Our target is to find
(2) |
When direct aggregation of all data to optimize the global loss function is infeasible, a common approach is to approximate the global loss function via inter-machine communication. Subsequently, this approximation is used for optimization based on local data. Jordan et al. (2019) demonstrated that this approach could achieve optimal statistical precision with just a few rounds of communication. Following suit, we replace the original global negative log-likelihood function with an approximation. Specifically, given the current estimate , we first collect the gradients from different machines and then approximate the global negative log-likelihood function as
(3) |
where is the local empirical negative log-likelihood function on the first machine (treated as the central machine that can communicate with all others) and . The global gradient for the approximate loss is Thus, based on the approximate global negative log-likelihood function, the IHT algorithm is, starting from ,
(4) |
After (4) converges, sends the new estimate to all other machines to obtain the updated approximate global negative log-likelihood function. Algorithm 2 presents the details of the distributed IHT algorithm for the Tobit model. Our theoretical results will show that the distributed IHT method can achieve the same statistical rate as the centralized version. Our simulation results further validate that it offers comparable statistical precision compared to its non-distributed counterpart.
3 Convergence and statistical rates
We present convergence guarantees for both the local IHT and the distributed IHT in this section. In the rest of the paper, and denote positive constants independent of , , , whose values may vary across instances. In the following, for positive sequences and , or means for some constant , means , means and , or is the same as . We make the following assumptions.
-
Assumption 1 (Truncation and signal strength): is a constant that satisfies . and .
-
Assumption 2 (Sparse eigenvalue condition): There exist positive constants and such that:
-
(i)
for all -sparse unit vectors .
-
(ii)
holds, where , and .
-
(i)
-
Assumption 3 (Sub-Gaussian designs): For some constant , are iid and -sub-Gaussian, denoted as . That is, for all ,
-
Assumption 4 (Proper initialization): and .
Remark 1: To satisfy Assumption 1, we need to use a sufficiently small . The sparse eigenvalue assumption is frequently used in high-dimensional data analysis. Note that here the assumption is imposed on the population quantity so even the standard eigenvalue assumption without constraining to be sparse is reasonable. Examining the form of the population Hessian matrix (seen in the Appendix), it is easy to see that Assumptions 1 and 2 (i) actually imply the Hessian matrix has eigenvalues between and . Assumption 3 is also a standard assumption in high-dimensional regression analysis, making it possible to derive tight bounds on different quantities (Jordan et al., 2019; Belloni and Chernozhukov, 2011). Combined with the assumption and the fact that the noise is Gaussian, it also implies is sub-Gaussian. Finally, a trivial initial value satisfying assumption 4 is .
The following Theorem provides a bound for the estimation error of the local IHT algorithm, confirming the linear convergence of the IHT algorithm for the Tobit model.
Theorem 1.
(Local IHT for Tobit) Under Assumptions 1–4, choosing the step size ,
with probability at least . In particular, if and , after iterations, we have
with probability at least .
The specific value chosen for is not the only option, and we opt for it for convenience in the proof. Our proof reveals that the requirement for and is actually that . For initialization, we recommend the Tobit-lasso estimator as it offers computational simplicity and has been shown to be consistent in Jacobson and Zou (2024).
Theorem 1 establishes the convergece of the local IHT algorithm. The following Theorem concerns the convergence of the inner loop in the distributed setting, which is very similar to Theorem 1. This is not surprising since the approximate likelihood is expected to be close to the true likelihood.
Theorem 2.
(Inner loop for Distributed IHT) Given in Algorithm 2, under Assumptions 1–4, choosing the step size , if and , then after iterations, we have
(5) |
with probability at least .
The above result shows that after the inner loop ends, we have possibly a better estimate compared to when the loop starts. Combining such bounds over multiple stages , we get the following result.
4 Conclusion
In this paper, we delved into applying the IHT algorithm to the high-dimensional Tobit model and expanded its utility to distributed environments spanning multiple machines. This marks the first effort to study the Tobit model under the distributed setting in high-dimensional scenarios. We have shown that achieving a minimax convergence rate for local and distributed estimators is feasible. Simulation results validate our theoretical findings and demonstrate the good performance of our proposed method across diverse settings. When applied to high-dimensional left-censored HIV viral load data, the IHT method also effectively produced accurate predictions.
Regarding future directions, it is possible to explore extending the IHT algorithm to decentralized distributed algorithms, such as distributed gradient descent (DGD) as introduced in Nedic and Ozdaglar (2009). DGD algorithms are extensively studied in the literature and offer advantages such as avoiding a single point of failure, addressing the limitations of centralized frameworks. Relaxing the requirement for sufficient local sample size by leveraging information from neighboring nodes in decentralized networks is also possible, as discussed in Ji et al. (2023). Exploring the combination with other distributed methods also remains an intriguing area for future research. Moreover, we can also consider a more complex class of missing data models known as Zero-Inflated models, where missing values are characterized as arising from a two-component mixture model. The presence of latent components in missing values requires a more careful analysis.
Appendix: Proofs
The proof strategies for Theorem 1 and Theorem 2 share similarities. Therefore, we will first provide a detailed proof for Theorem 2, which is expected to be more challenging. We introduce some additional notations used in the proof.
-
Let denote the negative log-likelihood of the first machine, and similarly for . The gradient of the log-likelihood can be expressed as
(6) -
Let denote the population Hessian matrix, given by
(7) -
Define , where and represent the support (indices of nonzero entries) of and the true , respectively.
-
Given a vector , a matrix , and an index set , denotes the vector obtained from by setting components not in to zero, while represents the matrix obtained from by setting the components of rows not in to zero. represents the cardinality of the set .
-
We say that a random variable is sub-exponential with parameters if the norm of satisfy
-
Define , and .
Proof of Theorem 2
In the inner loop, we have
The first inequality above holds because, under Assumption 1, if , then it is evident that , meaning removing the thresholding with can only increase the error. The 2nd inequality is due to Lemma 1 in Jain et al. (2014).
Then, by adding and subtracting terms, we have
For , by Assumption 2 and Taylor’s expansion, we get
(8) | ||||
where lies on the line between and .
For , by Lemma 2, with probability at least , we have
(9) |
and thus .
For , by Lemma 3, we have, with probability at least ,
Therefore, by choosing and combining bounds for , , and , we have
with probability at least .
Thus, if and , after iterations, with probability at least ,
(10) |
The proof is completed.
Proof of Theorem 1
In fact, the proof of Theorem 1 is essentially the same as the proof of Theorem 2. Our goal is to provide an upper bound for
We have
For , by Assumption 2 and Taylor’s expansion, we can obtain that
For , similar to (9), with probability at least , we have
For , as we have shown in the proof of Lemma 3, with probability at least .
Combining the bounds and choosing , we get
with probability of at least .
Thus if and , after iterations , with probability at least , we have
Proof of Theorem 3.
Theorem 2 implies
This implies that after iterations for the outer loop (note if for some constant , is a constant) we obtain
Technical Lemmas
Lemma 1.
Given satisfying and , under Assumptions 1, 2 and 3, then, with probability at least ,
Proof. Denote .
-
For ,
where the last inequality holds because of .
Thus, with probability of at least ,
Lemma 2.
For satisfying and , under Assumptions 1, 2, and 3, with probability at least , we have with probability ,
Proof. We have, by Taylor’s expansion,
By Lemma 1, the second term above as well as its expectation (can be shown following exactly the same lines) is bounded by . For the 1st term above, we first derive a bound for
(11) |
where we denote for a matrix and we recall that by our notation . We analyze each entry of the empirical Hessian matrix .
-
Since for all , we see that
Thus, is sub-exponential.
-
. It is easy to see that is sub-exponential.
-
, and is again sub-exponential.
With the aid of the concentration inequality for sub-exponential variables, (11) is bounded by with probability at least . Consequently,
This completes the proof.
Lemma 3.
Given satisfying and , under Assumptions 1, 2, and 3, with probability at least , we have,
References
- (1)
- Belloni and Chernozhukov (2011) Belloni, A., and Chernozhukov, V. (2011), “l1-penalized quantile regression in high-dimensional sparse models,” The Annals of Statistics, 39(1), 82–130.
- Blumensath and Davies (2009) Blumensath, T., and Davies, M. E. (2009), “Iterative hard thresholding for compressed sensing,” Applied and Computational Harmonic Analysis, 27(3), 265–274.
- Buckley and James (1979) Buckley, J., and James, I. (1979), “Linear regression with censored data,” Biometrika, 66(3), 429–436.
- Fan and Li (2001) Fan, J., and Li, R. (2001), “Variable selection via nonconcave penalized likelihood and its oracle properties,” Journal of the American Statistical Association, 96(456), 1348–1360.
- Foucart (2011) Foucart, S. (2011), “Hard thresholding pursuit: an algorithm for compressive sensing,” SIAM Journal on Numerical Analysis, 49(6), 2543–2563.
- Garg and Khandekar (2009) Garg, R., and Khandekar, R. (2009), Gradient descent with sparsification: an iterative algorithm for sparse recovery with restricted isometry property,, in Proceedings of the 26th annual international conference on machine learning, pp. 337–344.
- Jacobson and Zou (2024) Jacobson, T., and Zou, H. (2024), “High-dimensional censored regression via the penalized Tobit likelihood,” Journal of Business & Economic Statistics, 42(1), 286–297.
- Jain et al. (2014) Jain, P., Tewari, A., and Kar, P. (2014), On iterative hard thresholding methods for high-dimensional M-estimation,, in Advances in Neural Information Processing Systems.
- Ji et al. (2023) Ji, Y., Scutari, G., Sun, Y., and Honnappa, H. (2023), “Distributed sparse regression via penalization,” Journal of Machine Learning Research, 24(272), 1–62.
- Jordan et al. (2019) Jordan, M. I., Lee, J. D., and Yang, Y. (2019), “Communication-efficient distributed statistical inference,” Journal of the American Statistical Association, 114(526), 668–681.
- Lee et al. (2017) Lee, J. D., Liu, Q., Sun, Y., and Taylor, J. E. (2017), “Communication-efficient sparse regression,” Journal of Machine Learning Research, 18(5), 1–30.
- Müller and van de Geer (2016) Müller, P., and van de Geer, S. (2016), “Censored linear model in high dimensions,” TEST, 25, 75–92.
- Nedic and Ozdaglar (2009) Nedic, A., and Ozdaglar, A. (2009), “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, 54(1), 48–61.
- Olsen (1978) Olsen, R. J. (1978), “Note on the uniqueness of the maximum likelihood estimator for the Tobit model,” Econometrica, 46(5), 1211–1215.
- Sampford (1953) Sampford, M. R. (1953), “Some inequalities on Mill’s ratio and related functions,” The Annals of Mathematical Statistics, 24(1), 130 – 132.
- Soret et al. (2018) Soret, P., Avalos, M., Wittkop, L., Commenges, D., and Thiébaut, R. (2018), “Lasso regularization for left-censored Gaussian outcome and high-dimensional predictors,” BMC Medical Research Methodology, 18(1), 159.
- Tobin (1958) Tobin, J. (1958), “Estimation of relationships for limited dependent variables,” Econometrica, 26(1), 24–36.
- Wang et al. (2023) Wang, Y., Lu, W., and Lian, H. (2023), “Best subset selection for high-dimensional non-smooth models using iterative hard thresholding,” Information Sciences, 625, 36–48.
- Zhang et al. (2013) Zhang, Y., Duchi, J. C., and Wainwright, M. J. (2013), “Communication-efficient algorithms for statistical optimization,” Journal of Machine Learning Research, 14(68), 3321–3363.
- Zhou and Liu (2016) Zhou, X., and Liu, G. (2016), “LAD-Lasso variable selection for doubly censored median regression models,” Communications in Statistics - Theory and Methods, 45(12), 3658–3667.