This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Distributed Iterative Hard Thresholding for Variable Selection in Tobit Models

Changxin Yanga, Zhongyi Zhua and  Heng Lianb
aDepartment of Statistics, Fudan University, Shanghai, China bDepartment of Mathematics, City University of Hong Kong, Hong Kong, China
Abstract

While extensive research has been conducted on high-dimensional data and on regression with left-censored responses, simultaneously addressing these complexities remains challenging, with only a few proposed methods available. In this paper, we utilize the Iterative Hard Thresholding (IHT) algorithm on the Tobit model in such a setting. Theoretical analysis demonstrates that our estimator converges with a near-optimal minimax rate. Additionally, we extend the method to a distributed setting, requiring only a few rounds of communication while retaining the estimation rate of the centralized version. Simulation results show that the IHT algorithm for the Tobit model achieves superior accuracy in predictions and subset selection, with the distributed estimator closely matching that of the centralized estimator. When applied to high-dimensional left-censored HIV viral load data, our method also exhibits similar superiority.

Keywords— Censored regression; Distributed optimization; Hard thresholding; High-dimension statistics; Linear convergence.

1 Introduction

The analysis of left-censored data is a significant statistical focus and has attracted considerable research attention in recent years. It often arises due to the lower detection limit of an assay, posing a shared challenge across various fields such as biology, chemistry, and environmental sciences. For example, biological assays used to measure Human Immunodeficiency Virus (HIV) viral load in plasma may be limited in detecting concentrations below specific thresholds. The presence of such missingness renders commonly used linear regression methods ineffective. Furthermore, with advancements in modern data collection, these challenges often manifest in high-dimensional scenarios. In the context of HIV infection, there is a critical need to explore the association between the number of viral loads and extremely high-dimensional gene expression values. In addressing such challenges, various methodologies have been proposed in prior research. Among these, the Tobit model has proved to be valuable in modeling left-censored responses.

In recent years, significant progress has been made in the research on high dimensional censored data. A useful strategy to tackle high-dimensionality challenges involves constructing penalized estimators such as lasso-type estimators. Müller and van de Geer (2016) and Zhou and Liu (2016) provide theoretical insights into the least absolute deviation estimator with the lasso penalty. Soret et al. (2018) propose the Lasso-regularized Buckley-James least square algorithm, extending the estimator in Buckley and James (1979). Jacobson and Zou (2024) were the first to consider the high-dimensional Tobit model, which optimizes the likelihood function with a nonconvex penalty (specifically with the SCAD penalty in Fan and Li (2001)). In this paper, we concentrate on the IHT approach for variable selection. One advantage of using IHT is that the user can directly specify the number of variables to be retained, which may be useful in some scientific investigations.

IHT-style methods, which combine gradient descent with projection operations, have gained popularity in the literature for sparse recovery. Various algorithms have been proposed, such as standard IHT introduced in Blumensath and Davies (2009), GraDeS in Garg and Khandekar (2009), and Hard Thresholding Pursuit (HTP) in Foucart (2011). Jain et al. (2014) demonstrated that the IHT procedure can achieve linear convergence to the optimal solution under the conditions of strong convexity and strong smoothness in high-dimensional settings. Extending this result, Wang et al. (2023) considered nonsmooth loss functions under the less restrictive assumption of a locally positive-definite population Hessian. We will demonstrate that the linear convergence result also holds for the high-dimensional Tobit model.

Additionally, we consider a scenario where data are distributed across multiple locations and develop a distributed variant of our method. With advancements in data collection technology, gathering similar types of data from various regions has become increasingly common. Typically, directly aggregating all data at a central site faces practical hurdles related to storage, communication, and privacy concerns. To tackle these challenges, divide-and-conquer approaches are commonly employed. Many early approaches utilize one-shot methods, where estimators computed on local machines are transmitted to a central node and aggregated to form a global estimate, as in Zhang et al. (2013), Lee et al. (2017). However, such methods suffer from some drawbacks, as discussed by Jordan et al. (2019), who proposed a communication-efficient distributed algorithm. Fortunately, the gradient descent-type update in the IHT method is conducive to integration with this algorithm. Building upon this concept, we introduce a communication-efficient estimator with an IHT-type update process. To our knowledge, this represents the first exploration of the high-dimensional Tobit model in a distributed setting. Our theoretical analysis and numerical results demonstrate that this estimator’s convergence rate aligns with that achieved when pooling all data together while incurring low communication costs.

This article is structured as follows. In Section 2, we review the Tobit model and then develop both local (centralized) IHT and the distributed IHT. Section 3 is dedicated to our theoretical analysis. Section 4 draws some conclusions. All proofs are relegated to the Appendix.

2 The Tobit model and IHT

2.1 Local IHT for Tobit model

We consider the Tobit model, introduced in Tobin (1958), a crucial tool for modeling a left-censored response. We assume the existence of a latent response variable yy^{*} such that y=max{y,c0}y=\max\left\{y^{*},c_{0}\right\}, where yy^{*} follows a linear model

y=𝐱𝜷+ϵ.y^{*}=\mathbf{x}^{\prime}\bm{\beta}^{*}+\epsilon.

Here, 𝐱=(1,x1,,xd)d+1\mathbf{x}=\left(1,x_{1},\ldots,x_{d}\right)^{\prime}\in\mathbb{R}^{d+1}, 𝜷=(β0,β1,,βd)d+1\bm{\beta}^{*}=\left(\beta_{0}^{*},\beta_{1}^{*},\ldots,\beta_{d}^{*}\right)^{\prime}\in\mathbb{R}^{d+1}, and ϵ𝒩(0,(σ)2)\epsilon\sim\mathcal{N}\left(0,(\sigma^{*})^{2}\right). We focus on a high-dimensional scenario where the dimension dd is much larger than the sample size nn, with 𝜷0=s0\left\|\bm{\beta}^{*}\right\|_{0}=s_{0}. Without loss of generality, we assume c0=0c_{0}=0 throughout the following discussions.

Let {𝐳i=(𝐱i,yi)}i=1n\{\mathbf{z}_{i}=\left(\mathbf{x}_{i},y_{i}\right)\}_{i=1}^{n} represent an independent and identically distributed (i.i.d.) sample from the Tobit model, and define the indicator di=Iyi>0d_{i}={I}_{y_{i}>0}. The likelihood function for the censored response in the Tobit model can be expressed as

Ln(𝜷,σ2,{𝐳i}i=1n)=i=1n[12πσexp{12σ2(yi𝐱i𝜷)2}]di[Φ(𝐱i𝜷σ)]1di,\displaystyle L_{n}\left(\bm{\beta},\sigma^{2},\{\mathbf{z}_{i}\}_{i=1}^{n}\right)=\prod_{i=1}^{n}\left[{1\over\sqrt{2\pi}\sigma}\exp\left\{-{1\over 2\sigma^{2}}\left(y_{i}-\mathbf{x}_{i}^{\prime}\bm{\beta}\right)^{2}\right\}\right]^{d_{i}}\left[\Phi\left({-\mathbf{x}_{i}^{\prime}\bm{\beta}\over\sigma}\right)\right]^{1-d_{i}},

where Φ()\Phi(\cdot) is the standard normal cumulative distribution function. While Ln(𝜷,σ2){L}_{n}\left(\bm{\beta},\sigma^{2}\right) is not concave in (𝜷,σ2)\left(\bm{\beta},\sigma^{2}\right), Olsen (1978) found that the reparameterization 𝜹=𝜷/σ\bm{\delta}=\mbox{\boldmath$\beta$}/\sigma and γ2=σ2\gamma^{2}=\sigma^{-2} results in a concave log-likelihood. Denoting 𝜽=(𝜹,γ)\bm{\theta}=(\bm{\delta},\gamma), after dropping ignorable constants, the negative log-likelihood is given by

^(𝜽)\displaystyle\widehat{\mathcal{L}}(\bm{\theta}) =1ni=1n(𝜽,𝐳i)\displaystyle={1\over n}\sum_{i=1}^{n}{\mathcal{L}}(\bm{\theta},\mathbf{z}_{i})
=1ni=1ndi[log(γ)+12(γyi𝐱i𝜹)2](1di)log(Φ(𝐱i𝜹)).\displaystyle={1\over n}\sum_{i=1}^{n}d_{i}\left[-{\rm log}(\gamma)+{1\over 2}\left(\gamma y_{i}-\mathbf{x}_{i}^{\prime}\bm{\delta}\right)^{2}\right]-\left(1-d_{i}\right){\rm log}\left(\Phi\left(-\mathbf{x}_{i}^{\prime}\bm{\delta}\right)\right).

In the high-dimensional setting that dd is large, and maybe even larger than nn, variable selection is desired. In this paper, we consider the hard constraint 𝜹0s\|\mbox{\boldmath$\delta$}\|_{0}\leq s where ss is a known upper bound of s0s_{0}. To optimize this constrained objective function, we use the IHT method, which is designed for high-dimensional regression. This method combines gradient descent with projection operations, making it computationally efficient. Furthermore, we slightly modify it for the Tobit model to prevent γ\gamma from getting too small or even negative. Specifically, denoting the initial value as 𝜽0\bm{\theta}^{0}, for t=0,1,t=0,1,\ldots, our IHT algorithm can be formulated as

𝜽t+1=Ps,C(𝜽tη^(𝜽t)){𝜹t+1=Ps(𝜹tη𝜹^(𝜽t))γt+1=𝒯C(γtηγ^(𝜽t)),\bm{\theta}^{t+1}=P_{s,C^{*}}\left(\bm{\theta}^{t}-\eta\nabla\widehat{\mathcal{L}}\left(\bm{\theta}^{t}\right)\right)\doteq\left\{\begin{array}[]{l}\bm{\delta}^{t+1}=P_{s}\left(\bm{\delta}^{t}-\eta\nabla_{\bm{\delta}}\widehat{\mathcal{L}}\left(\bm{\theta}^{t}\right)\right)\\ \gamma^{t+1}=\mathcal{T}_{C^{*}}(\gamma^{t}-\eta\nabla_{\gamma}\widehat{\mathcal{L}}\left(\bm{\theta}^{t}\right))\end{array}\right., (1)

where η>0\eta>0 is the step size, Ps()P_{s}(\mathbf{\cdot}) is the projection operator that retains only ss entries with the largest absolute values while setting other entries to zero, and

𝒯C(γt)={γtηγ^(𝜽t)if γtηγ^(𝜽t)CCotherwise.\mathcal{T}_{C^{*}}(\gamma^{t})=\begin{cases}\gamma^{t}-\eta\nabla_{\gamma}\widehat{\mathcal{L}}\left(\bm{\theta}^{t}\right)&\text{if }\gamma^{t}-\eta\nabla_{\gamma}\widehat{\mathcal{L}}\left(\bm{\theta}^{t}\right)\geq C^{*}\\ C^{*}&\text{otherwise}\end{cases}.
Algorithm 1 Local IHT for Tobit Regression
1:sparsity level ss, number of iterations TT, step size η\eta, lower bound CC^{*}.
2:Initialize 𝜽local0\bm{\theta}^{0}_{local}
3:for t=0,,T1t=0,\ldots,T-1 do
4:     Calculate the gradient ^(𝜽localt)\nabla\widehat{\mathcal{L}}(\bm{\theta}^{t}_{local})
5:     Update 𝜽localt+1\bm{\theta}^{t+1}_{local} as
𝜽localt+1=Ps,C(𝜽localtη^(𝜽localt))\bm{\theta}^{t+1}_{local}=P_{s,C^{*}}\left(\bm{\theta}^{t}_{local}-\eta\nabla\widehat{\mathcal{L}}\left(\bm{\theta}_{local}^{t}\right)\right)
6:end for
7:return 𝜽localT=(𝜹localT,γlocalT)\bm{\theta}^{T}_{local}=(\bm{\delta}^{T}_{local},\gamma^{T}_{local})

Note that we have excluded γ\gamma from the projection operation, as γ\gamma is not part of the parameter for variable selection. Additionally, we introduce a positive lower bound denoted as CC^{*} to prevent γ0\gamma\leq 0 during the update process.

In practical applications, the values of ss and CC^{*} are typically unknown and are treated as tuning parameters. There are two main approaches for determining ss. First, the user can directly specify it based on a predefined desired sparsity, such as selecting the top 10 variables of interest. An alternative is to treat sparsity as a tuning parameter that can be optimized through cross-validation. CC^{*} can be set to a sufficiently small value to ensure it is smaller than γ\gamma^{*}. We present the entire IHT algorithm in Algorithm 1 (called local IHT to distinguish it from the distributed cases). Theorem 1 in Section 3 provides a bound on the estimator error for the local IHT algorithm (after a sufficient number of iterations) as O(slog(nd)n)O\left(\sqrt{{s{\rm log}(n\vee d)\over n}}\right), which is well-known as the minimax rate for linear regression under the sparsity condition (if s0ds_{0}\ll d, and ss and s0s_{0} are of the same order).

2.2 Distributed IHT for Tobit model

In this section, we delve into the distributed implementation of the high-dimensional Tobit model across MM machines. This scenario holds significance in settings characterized by large volumes of training data. In such contexts, aggregating all raw data directly becomes impractical due to constraints like limited storage capacity, high communication costs, and privacy considerations. Hence, we are motivated to explore the development of a communication-efficient distributed IHT method.

For simplicity, we assume that each machine stores data of the same sample size nn. Let m\mathcal{H}_{m} denote the subsample with a sample size of nn stored in the mm-th machine m\mathcal{M}_{m}, for m=1,,Mm=1,\ldots,M, and denote the global sample size N=nMN=nM. The local and global negative log-likelihood functions take the following form:

m(𝜽)=1nim(𝜽,𝐳i),N(𝜽)=1Mm=1Mm(𝜽).\begin{gathered}\mathcal{L}_{m}(\bm{\theta})={1\over n}\sum_{i\in\mathcal{H}_{m}}{\mathcal{L}}(\bm{\theta},\mathbf{z}_{i}),\\ \mathcal{L}_{N}(\bm{\theta})={1\over M}\sum_{m=1}^{M}\mathcal{L}_{m}(\bm{\theta}).\end{gathered}

Our target is to find

𝜽=argmin𝜽:𝜽0s+1(𝜽)E(N(𝜽)).{\bm{\theta}}^{*}=\underset{\bm{\theta}:\|\bm{\theta}\|_{0}\leq s+1}{\operatorname{argmin}}\mathcal{L}^{*}(\bm{\theta})\doteq E(\mathcal{L}_{N}(\bm{\theta})). (2)

When direct aggregation of all data to optimize the global loss function is infeasible, a common approach is to approximate the global loss function via inter-machine communication. Subsequently, this approximation is used for optimization based on local data. Jordan et al. (2019) demonstrated that this approach could achieve optimal statistical precision with just a few rounds of communication. Following suit, we replace the original global negative log-likelihood function with an approximation. Specifically, given the current estimate 𝜽¯\overline{\bm{\theta}}, we first collect the gradients m(𝜽¯)\nabla\mathcal{L}_{m}(\overline{\bm{\theta}}) from different machines and then approximate the global negative log-likelihood function as

~(𝜽):=1(𝜽)𝜽,1(𝜽¯)N(𝜽¯),\widetilde{\mathcal{L}}(\bm{\theta}):=\mathcal{L}_{1}(\bm{\theta})-\left\langle\bm{\theta},\nabla\mathcal{L}_{1}(\overline{\bm{\theta}})-\nabla\mathcal{L}_{N}(\overline{\bm{\theta}})\right\rangle, (3)

where 1(𝜽)\mathcal{L}_{1}(\bm{\theta}) is the local empirical negative log-likelihood function on the first machine 1\mathcal{M}_{1} (treated as the central machine that can communicate with all others) and N(𝜽¯)=1Mi=1Mm(𝜽¯)\nabla\mathcal{L}_{N}(\overline{\bm{\theta}})={1\over M}\sum_{i=1}^{M}\nabla\mathcal{L}_{m}(\overline{\bm{\theta}}). The global gradient for the approximate loss is ~(𝜽):=1(𝜽)(1(𝜽¯)N(𝜽¯)).\nabla\widetilde{\mathcal{L}}(\bm{\theta}):=\nabla\mathcal{L}_{1}(\bm{\theta})-(\nabla\mathcal{L}_{1}(\overline{\bm{\theta}})-\nabla\mathcal{L}_{N}(\overline{\bm{\theta}})). Thus, based on the approximate global negative log-likelihood function, the IHT algorithm is, starting from 𝜽0=¯𝜽\mbox{\boldmath$\theta$}^{0}=\overline{}\mbox{\boldmath$\theta$},

𝜽t+1=Ps,C(𝜽tη~(𝜽t)).\bm{\theta}^{t+1}=P_{s,C^{*}}\left(\bm{\theta}^{t}-\eta\nabla\widetilde{\mathcal{L}}(\bm{\theta}^{t})\right). (4)

After (4) converges, 1{\cal M}_{1} sends the new estimate to all other machines to obtain the updated approximate global negative log-likelihood function. Algorithm 2 presents the details of the distributed IHT algorithm for the Tobit model. Our theoretical results will show that the distributed IHT method can achieve the same statistical rate as the centralized version. Our simulation results further validate that it offers comparable statistical precision compared to its non-distributed counterpart.

Algorithm 2 Distributed IHT for Tobit Regression
1:sparsity level ss, number of iterations in the inner loop TT, number of iterations for the outer loop QQ, step size η\eta, lower bound CC^{*}.
2:Initialize 𝜽dis0\bm{\theta}^{0}_{dis}.
3:for q=0,,Q1q=0,\ldots,Q-1 do
4:     Let 𝜽¯=𝜽disq\bar{\bm{\theta}}=\bm{\theta}^{q}_{dis} and 1{\cal M}_{1} sends 𝜽¯\bar{\bm{\theta}} to all other machines.
5:     Each machine m\mathcal{M}_{m} computes m(𝜽¯)\nabla\mathcal{L}_{m}(\bar{\bm{\theta}}) and send it to 1{\cal M}_{1}.
6:     1\mathcal{M}_{1} calculates
N(𝜽¯)=1Mm=1Mm(𝜽¯).\nabla\mathcal{L}_{N}(\bar{\bm{\theta}})={1\over M}\sum_{m=1}^{M}\nabla\mathcal{L}_{m}(\bar{\bm{\theta}}).
7:     On 1{\cal M}_{1}, let 𝜽inner0=𝜽¯{\bm{\theta}}^{0}_{inner}=\bar{\bm{\theta}} and execute the following inner loop.
8:     for t=0,,T1t=0,\ldots,T-1 do
𝜽innert+1=Ps,C(𝜽innertη~(𝜽innert)).\bm{\theta}^{t+1}_{inner}=P_{s,C^{*}}\left(\bm{\theta}_{inner}^{t}-\eta\nabla\widetilde{\mathcal{L}}(\bm{\theta}_{inner}^{t})\right).
9:     end for
10:     Let 𝜽disq+1=𝜽innerT\bm{\theta}^{q+1}_{dis}=\bm{\theta}^{T}_{inner}
11:end for
12:return 𝜽disQ=(𝜹disQ,γdisQ)\bm{\theta}^{Q}_{dis}=(\bm{\delta}^{Q}_{dis},\gamma^{Q}_{dis})

3 Convergence and statistical rates

We present convergence guarantees for both the local IHT and the distributed IHT in this section. In the rest of the paper, cc and CC denote positive constants independent of nn, dd, MM, whose values may vary across instances. In the following, for positive sequences ana_{n} and bnb_{n}, an=O(bn)a_{n}=O\left(b_{n}\right) or anbna_{n}\lesssim b_{n} means an/bnca_{n}/b_{n}\leq c for some constant cc, an=o(bn)a_{n}=o\left(b_{n}\right) means an/bn0a_{n}/b_{n}\rightarrow 0, anbna_{n}\asymp b_{n} means an=O(bn)a_{n}=O(b_{n}) and bn=O(an)b_{n}=O(a_{n}), anbna_{n}\ll b_{n} or bnanb_{n}\gg a_{n} is the same as an=o(bn)a_{n}=o\left(b_{n}\right). We make the following assumptions.

  • \bullet

    Assumption 1 (Truncation and signal strength): CC^{*} is a constant that satisfies 0<C<γ\left.0<C^{*}<\gamma^{*}\right.. 𝜷0=s0\|\mbox{\boldmath$\beta$}^{*}\|_{0}=s_{0} and 𝜷2C\|\mbox{\boldmath$\beta$}^{*}\|_{2}\leq C.

  • \bullet

    Assumption 2 (Sparse eigenvalue condition): There exist positive constants C1C_{1} and C~2\tilde{C}_{2} such that:

    1. (i)
      C1E[𝐚(Iy>0𝐱𝐱Iy>0𝐱yIy>0y𝐱Iy>0y2)𝐚]<E[𝐚(𝐱𝐱Iy>0𝐱yIy>0y𝐱Iy>0y2)𝐚]C~2,\displaystyle C_{1}\leq E\left[\mathbf{a}^{\prime}\begin{pmatrix}I_{y>0}\mathbf{x}\mathbf{x}^{\prime}&-I_{y>0}\mathbf{x}y\\ -I_{y>0}y\mathbf{x}^{\prime}&I_{y>0}y^{2}\end{pmatrix}\mathbf{a}\right]<E\left[\mathbf{a}^{\prime}\begin{pmatrix}\mathbf{x}\mathbf{x}^{\prime}&-I_{y>0}\mathbf{x}y\\ -I_{y>0}y\mathbf{x}^{\prime}&I_{y>0}y^{2}\end{pmatrix}\mathbf{a}\right]\leq\tilde{C}_{2},

      for all (2s+s0+1)(2s+s_{0}+1)-sparse unit vectors 𝐚\mathbf{a}.

    2. (ii)

      (1+s0s)C2C1C2+C1<1\left(1+\sqrt{{s_{0}\over s}}\right){C_{2}-C_{1}\over C_{2}+C_{1}}<1 holds, where C2=˙C~2+1(C)2C_{2}\dot{=}\tilde{C}_{2}+{1\over\left(C^{*}\right)^{2}}, and s=O(s0)s=O(s_{0}).

  • \bullet

    Assumption 3 (Sub-Gaussian designs): For some constant CC, 𝐱i\mathbf{x}_{i} are iid and CC-sub-Gaussian, denoted as subG(C)\operatorname{subG}(C). That is, for all αd\alpha\in\mathbb{R}^{d},

    𝔼[exp(α𝐱)]exp(C2α222).\mathbb{E}[\exp(\alpha^{\prime}\mathbf{x})]\leq\exp\left({C^{2}\|\alpha\|_{2}^{2}\over 2}\right).
  • \bullet

    Assumption 4 (Proper initialization): 𝜽int0(s+1)\|\bm{\theta}_{int}\|_{0}\leq(s+1) and γintC\gamma_{int}\geq C^{*}.

Remark 1: To satisfy Assumption 1, we need to use a sufficiently small CC^{*}. The sparse eigenvalue assumption is frequently used in high-dimensional data analysis. Note that here the assumption is imposed on the population quantity so even the standard eigenvalue assumption without constraining 𝐚{\bf a} to be sparse is reasonable. Examining the form of the population Hessian matrix H(𝜽)H(\bm{\theta}) (seen in the Appendix), it is easy to see that Assumptions 1 and 2 (i) actually imply the Hessian matrix has eigenvalues between C1C_{1} and C2C_{2}. Assumption 3 is also a standard assumption in high-dimensional regression analysis, making it possible to derive tight bounds on different quantities (Jordan et al., 2019; Belloni and Chernozhukov, 2011). Combined with the assumption 𝜷2C\|\mbox{\boldmath$\beta$}^{*}\|_{2}\leq C and the fact that the noise is Gaussian, it also implies yiy_{i} is sub-Gaussian. Finally, a trivial initial value satisfying assumption 4 is 𝜽int=(𝟎,C)\mbox{\boldmath$\theta$}_{int}=(\mathbf{0},C^{*}).

The following Theorem provides a bound for the 2\ell_{2} estimation error of the local IHT algorithm, confirming the linear convergence of the IHT algorithm for the Tobit model.

Theorem 1.

(Local IHT for Tobit) Under Assumptions 1–4, choosing the step size η=2C1+C2\eta={2\over C_{1}+C_{2}},

𝜽localt+1𝜽2\displaystyle\|\mbox{\boldmath$\theta$}_{local}^{t+1}-\mbox{\boldmath$\theta$}^{*}\|_{2} \displaystyle\leq ((1+s0s)C2C1C2+C1+Cslog(dn)n)𝜽localt𝜽2\displaystyle\left(\left(1+\sqrt{{s_{0}\over s}}\right){C_{2}-C_{1}\over C_{2}+C_{1}}+Cs\sqrt{{{\rm log}(d\vee n)\over n}}\right)\|\mbox{\boldmath$\theta$}_{local}^{t}-\mbox{\boldmath$\theta$}^{*}\|_{2}
+Cs𝜽localt𝜽22+Cslog(dn)n,\displaystyle+C\sqrt{s}\|\mbox{\boldmath$\theta$}_{local}^{t}-\mbox{\boldmath$\theta$}^{*}\|_{2}^{2}+C\sqrt{{s{\rm log}(d\vee n)\over n}},

with probability at least 1(dn)C1-(d\vee n)^{-C}. In particular, if s2log(dn)n=o(1){s^{2}{\rm log}(d\vee n)\over n}=o(1) and s𝛉int𝛉=o(1)\sqrt{s}\|\mbox{\boldmath$\theta$}_{\text{int}}-\mbox{\boldmath$\theta$}^{*}\|=o(1), after T1log(n𝛉int𝛉22slog(dn))T_{1}\asymp{\rm log}\left({n\left\|\bm{\theta}_{\text{int}}-\bm{\theta}^{*}\right\|^{2}_{2}\over s{\rm log}(d\vee n)}\right) iterations, we have

𝜽localT1𝜽2slog(dn)n,\left\|\bm{\theta}_{\text{local}}^{T_{1}}-\bm{\theta}^{*}\right\|_{2}\lesssim\sqrt{{s{\rm log}(d\vee n)\over n}},

with probability at least 1(dn)C1-(d\vee n)^{-C}.

The specific value chosen for η\eta is not the only option, and we opt for it for convenience in the proof. Our proof reveals that the requirement for η\eta and ss is actually that (1+s0s)max{|1ηC1|,|1ηC2|}<1\left(1+\sqrt{{s_{0}\over s}}\right)\max\left\{\left|1-\eta C_{1}\right|,\left|1-\eta C_{2}\right|\right\}<1. For initialization, we recommend the Tobit-lasso estimator as it offers computational simplicity and has been shown to be consistent in Jacobson and Zou (2024).

Theorem 1 establishes the convergece of the local IHT algorithm. The following Theorem concerns the convergence of the inner loop in the distributed setting, which is very similar to Theorem 1. This is not surprising since the approximate likelihood is expected to be close to the true likelihood.

Theorem 2.

(Inner loop for Distributed IHT) Given 𝛉¯=𝛉int\bar{\bm{\theta}}=\bm{\theta}_{int} in Algorithm 2, under Assumptions 1–4, choosing the step size η=2C1+C2\eta={2\over C_{1}+C_{2}}, if s2log(dn)n=o(1){s^{2}{\rm log}(d\vee n)\over n}=o(1) and s¯𝛉𝛉=o(1)\sqrt{s}\|\bar{}\mbox{\boldmath$\theta$}-\mbox{\boldmath$\theta$}^{*}\|=o(1), then after T2log(n¯𝛉𝛉2slog(dn))T_{2}\asymp{\rm log}\left({n\|\bar{}\mbox{\boldmath$\theta$}-\mbox{\boldmath$\theta$}^{*}\|^{2}\over s{\rm log}(d\vee n)}\right) iterations, we have

𝜽innerT2𝜽2s(slog(dn)n𝜽¯𝜽2+𝜽¯𝜽22+log(dn)N),\left\|\bm{\theta}^{T_{2}}_{\text{inner}}-\bm{\theta}^{*}\right\|_{2}\lesssim\sqrt{s}\left(\sqrt{{s{\rm log}(d\vee n)\over n}}\left\|\bar{\bm{\theta}}-\bm{\theta}^{*}\right\|_{2}+\|\bar{\bm{\theta}}-\bm{\theta}^{*}\|_{2}^{2}+\sqrt{{{\rm log}(d\vee n)\over N}}\right), (5)

with probability at least 1(dn)C1-(d\vee n)^{-C}.

The above result shows that after the inner loop ends, we have possibly a better estimate compared to when the loop starts. Combining such bounds over multiple stages QQ, we get the following result.

Theorem 3.

(Distributed IHT for Tobit) Under the assumptions of Theorem 2, after QlogNlognQ\asymp{{\rm log}N\over{\rm log}n} iterations in the outer loop (with the number of iterations in the inner loop as in the last statement of Theorem 2), with probability at least 1(dn)C1-(d\vee n)^{-C}, we have

𝜽disQ𝜽2slog(dn)N.\left\|\bm{\theta}^{Q}_{dis}-\bm{\theta}^{*}\right\|_{2}\lesssim\sqrt{{s{\rm log}(d\vee n)\over N}}.

4 Conclusion

In this paper, we delved into applying the IHT algorithm to the high-dimensional Tobit model and expanded its utility to distributed environments spanning multiple machines. This marks the first effort to study the Tobit model under the distributed setting in high-dimensional scenarios. We have shown that achieving a minimax convergence rate for local and distributed estimators is feasible. Simulation results validate our theoretical findings and demonstrate the good performance of our proposed method across diverse settings. When applied to high-dimensional left-censored HIV viral load data, the IHT method also effectively produced accurate predictions.

Regarding future directions, it is possible to explore extending the IHT algorithm to decentralized distributed algorithms, such as distributed gradient descent (DGD) as introduced in Nedic and Ozdaglar (2009). DGD algorithms are extensively studied in the literature and offer advantages such as avoiding a single point of failure, addressing the limitations of centralized frameworks. Relaxing the requirement for sufficient local sample size by leveraging information from neighboring nodes in decentralized networks is also possible, as discussed in Ji et al. (2023). Exploring the combination with other distributed methods also remains an intriguing area for future research. Moreover, we can also consider a more complex class of missing data models known as Zero-Inflated models, where missing values are characterized as arising from a two-component mixture model. The presence of latent components in missing values requires a more careful analysis.

Appendix: Proofs

The proof strategies for Theorem 1 and Theorem 2 share similarities. Therefore, we will first provide a detailed proof for Theorem 2, which is expected to be more challenging. We introduce some additional notations used in the proof.

  • \bullet

    Let ϕ()\phi(\cdot) and Φ()\Phi(\cdot) denote the standard normal probability density function (PDF) and cumulative distribution function (CDF), respectively. Let g(a)=ϕ(a)Φ(a)g(a)={\phi(a)\over\Phi(a)}. Define h(a)=g(a)(a+g(a))=g(a)h(a)=g(a)(a+g(a))=-g^{\prime}(a). According to Sampford (1953) and Jacobson and Zou (2024), we have 0<h(a)<10<h(a)<1 and 4.3<h(a)<0-4.3<h^{\prime}(a)<0.

  • \bullet

    Let 1t=1(𝜽t)\nabla\mathcal{L}_{1}^{t}=\nabla\mathcal{L}_{1}\left(\bm{\theta}^{t}\right) denote the negative log-likelihood of the first machine, and similarly for ~t=~(𝜽t)\nabla\tilde{\cal L}^{t}=\nabla\tilde{\mathcal{L}}(\bm{\theta}^{t}). The gradient of the log-likelihood 1(𝜽)\nabla\mathcal{L}_{1}(\bm{\theta}) can be expressed as

    1ni1[Iyi>0𝐱i(γyi𝐱i𝜹)+Iyi0(𝐱ig(𝐱i𝜹))Iyi>0(γ1yi(γyi𝐱i𝜹))].{1\over n}\sum_{i\in{\cal H}_{1}}\left[\begin{array}[]{c}-I_{y_{i}>0}\mathbf{x}_{i}\left(\gamma y_{i}-\mathbf{x}_{i}^{\prime}\mbox{\boldmath$\delta$}\right)+I_{y_{i}\leq 0}\left(\mathbf{x}_{i}^{\prime}g\left(-\mathbf{x}^{\prime}_{i}\bm{\delta}\right)\right)\\ -I_{y_{i}>0}\left(\gamma^{-1}-y_{i}^{\prime}\left(\gamma y_{i}-\mathbf{x}_{i}^{\prime}\mbox{\boldmath$\delta$}\right)\right)\end{array}\right]. (6)
  • \bullet

    Let H(𝜽)H(\bm{\theta}) denote the population Hessian matrix, given by

    E([Iy>0𝐱𝐱+Iy0𝐱𝐱h(𝐱𝜹)Iy>0𝐱yIy>0y𝐱Iy>0(y2+γ2)]).E\left(\begin{bmatrix}I_{y>0}\mathbf{xx}^{\prime}+I_{y\leq 0}\mathbf{xx}^{\prime}h\left(-\mathbf{x}^{\prime}\bm{\delta}\right)&-I_{y>0}\mathbf{x}y\\ -I_{y>0}y\mathbf{x}^{\prime}&I_{y>0}(y^{2}+\gamma^{-2})\ \end{bmatrix}\right). (7)
  • \bullet

    Define It=StS0I^{t}=S^{t}\cup S_{0}, where StS^{t} and S0S_{0} represent the support (indices of nonzero entries) of 𝜽t\bm{\theta}^{t} and the true 𝜽\bm{\theta}^{*}, respectively.

  • \bullet

    Given a vector 𝜽\bm{\theta}, a matrix BB, and an index set I{1,,d+2}I\subseteq\{1,\ldots,d+2\}, 𝜽I\bm{\theta}_{I} denotes the vector obtained from 𝜽\bm{\theta} by setting components not in II to zero, while BIB_{I} represents the matrix obtained from BB by setting the components of rows not in II to zero. |I||I| represents the cardinality of the set II.

  • \bullet

    We say that a random variable XX is sub-exponential with parameters CC if the LpL^{p} norm of XX satisfy

    XLp=(E|X|p)1/pCp, for all p1.\|X\|_{L^{p}}=\left({E}|X|^{p}\right)^{1/p}\leq Cp,\quad\text{ for all }p\geq 1.
  • \bullet

    Define 𝐱max=maxi{1,,n}𝐱i\mathbf{x}_{\max}=\max_{i\in\{1,\dots,n\}}||\mathbf{x}_{i}||_{\infty}, and ymax=maxi{1,,n}|yi|y_{\max}=\max_{i\in\{1,\dots,n\}}|y_{i}|.

Proof of Theorem 2

In the inner loop, we have

et+1\displaystyle e_{t+1} 𝜽t+1𝜽2\displaystyle\doteq\left\|\bm{\theta}^{t+1}-\bm{\theta}^{*}\right\|_{2}
=(Ps,C(𝜽tη~t))It+1𝜽It+12\displaystyle=\left\|\left(P_{s,C^{*}}\left(\bm{\theta}^{t}-\eta\nabla\tilde{\mathcal{L}}^{t}\right)\right)_{I^{t+1}}-\bm{\theta}^{*}_{I^{t+1}}\right\|_{2}
(Ps(𝜽tη~t))It+1(𝜽tη~t)It+12\displaystyle\leq\left\|\left(P_{s}\left(\bm{\theta}^{t}-\eta\nabla\tilde{\mathcal{L}}^{t}\right)\right)_{I^{t+1}}-\left(\bm{\theta}^{t}-\eta\nabla\tilde{\mathcal{L}}^{t}\right)_{I^{t+1}}\right\|_{2}
+(𝜽tη~t)It+1𝜽It+12\displaystyle\quad+\left\|\left(\bm{\theta}^{t}-\eta\nabla\tilde{\mathcal{L}}^{t}\right)_{I^{t+1}}-\bm{\theta}^{*}_{I^{t+1}}\right\|_{2}
(1+|It+1|s|It+1|s0)(𝜽tη~t)It+1𝜽It+12.\displaystyle\leq\left(1+\sqrt{{\left|I^{t+1}\right|-s\over\left|I^{t+1}\right|-s_{0}}}\right)\left\|\left(\bm{\theta}^{t}-\eta\nabla\tilde{\mathcal{L}}^{t}\right)_{I^{t+1}}-\bm{\theta}^{*}_{I^{t+1}}\right\|_{2}.

The first inequality above holds because, under Assumption 1, if γtηγ~(𝜽t)<C\gamma^{t}-\eta\nabla_{\gamma}\tilde{\mathcal{L}}\left(\bm{\theta}^{t}\right)<C^{*}, then it is evident that |γC||γγtηγ~(𝜽t)|\left|\gamma^{*}-C^{*}\right|\leq\left|\gamma^{*}-\gamma^{t}-\eta\nabla_{\gamma}\tilde{\mathcal{L}}\left(\bm{\theta}^{t}\right)\right|, meaning removing the thresholding with CC^{*} can only increase the error. The 2nd inequality is due to Lemma 1 in Jain et al. (2014).

Then, by adding and subtracting terms, we have

(𝜽tη~t)It+1𝜽It+12\displaystyle\left\|\left({\bm{\theta}}^{t}-\eta\nabla\tilde{\mathcal{L}}^{t}\right)_{I^{t+1}}-\bm{\theta}^{*}_{I^{t+1}}\right\|_{2}
𝜽It+1t𝜽It+1ηE(~t~(𝜽))It+12\displaystyle\leq\left\|{\bm{\theta}}_{I^{t+1}}^{t}-\bm{\theta}^{*}_{I^{t+1}}-\eta E\left(\nabla\tilde{\mathcal{L}}^{t}-\nabla\tilde{\mathcal{L}}\left(\bm{\theta}^{*}\right)\right)_{I^{t+1}}\right\|_{2}
+ηE(~t~(𝜽))It+1(~t~(𝜽))It+12\displaystyle\quad+\eta\left\|E\left(\nabla\tilde{\mathcal{L}}^{t}-\nabla\tilde{\mathcal{L}}\left(\bm{\theta}^{*}\right)\right)_{I^{t+1}}-\left(\nabla\tilde{\mathcal{L}}^{t}-\nabla\tilde{\mathcal{L}}\left(\bm{\theta}^{*}\right)\right)_{I^{t+1}}\right\|_{2}
+η(~(𝜽))It+12\displaystyle\quad+\eta\left\|\left(\nabla\tilde{\mathcal{L}}\left(\bm{\theta}^{*}\right)\right)_{I^{t+1}}\right\|_{2}
=𝜽It+1t𝜽It+1ηE(1t1(𝜽))It+12\displaystyle=\left\|{\bm{\theta}}_{I^{t+1}}^{t}-\bm{\theta}^{*}_{I^{t+1}}-\eta E\left(\nabla\mathcal{L}_{1}^{t}-\nabla{\mathcal{L}_{1}}\left(\bm{\theta}^{*}\right)\right)_{I^{t+1}}\right\|_{2}
+ηE(1t1(𝜽))It+1(1t1(𝜽))It+12\displaystyle\quad+\eta\left\|E\left(\nabla\mathcal{L}_{1}^{t}-\nabla{\mathcal{L}_{1}}\left(\bm{\theta}^{*}\right)\right)_{I^{t+1}}-\left(\nabla\mathcal{L}_{1}^{t}-\nabla{\mathcal{L}_{1}}\left(\bm{\theta}^{*}\right)\right)_{I^{t+1}}\right\|_{2}
+η(~(𝜽))It+12\displaystyle\quad+\eta\left\|\left(\nabla\tilde{\mathcal{L}}\left(\bm{\theta}^{*}\right)\right)_{I^{t+1}}\right\|_{2}
P12+ηP22+ηP32.\displaystyle\doteq||P_{1}||_{2}+\eta||P_{2}||_{2}+\eta||P_{3}||_{2}.

For P1P_{1}, by Assumption 2 and Taylor’s expansion, we get

𝜽It+1t𝜽It+1η(E1tE1(𝜽))It+12\displaystyle\left\|\bm{\theta}_{I^{t+1}}^{t}-\bm{\theta}^{*}_{I^{t+1}}-\eta\left(E\nabla\mathcal{L}_{1}^{t}-E\nabla\mathcal{L}_{1}\left(\bm{\theta}^{*}\right)\right)_{I^{t+1}}\right\|_{2} (8)
(IηH(𝜶))It+1St(𝜽It+1Stt𝜽It+1St)2\displaystyle\leq\|(I-\eta H(\bm{\alpha}))_{I^{t+1}\cup S^{t}}\left(\bm{\theta}_{I^{t+1}\cup S^{t}}^{t}-\bm{\theta}^{*}_{I^{t+1}\cup S^{t}}\right)\|_{2}
max{|1ηC1|,|1ηC2|}𝜽t𝜽2,\displaystyle\leq\max\left\{\left|1-\eta C_{1}\right|,\left|1-\eta C_{2}\right|\right\}\left\|\bm{\theta}^{t}-\bm{\theta}^{*}\right\|_{2},

where 𝜶\bm{\alpha} lies on the line between 𝜽t\bm{\theta}^{t} and 𝜽\bm{\theta}^{*}.

For P2P_{2}, by Lemma 2, with probability at least 1(dn)C1-(d\vee n)^{-C}, we have

P2\displaystyle\|P_{2}\|_{\infty} slog(dn)net+et2,\displaystyle\lesssim\sqrt{{s{\rm log}(d\vee n)\over n}}e_{t}+e_{t}^{2}, (9)

and thus P22slog(dn)net+set2\|P_{2}\|_{2}\lesssim s\sqrt{{{\rm log}(d\vee n)\over n}}e_{t}+\sqrt{s}e_{t}^{2}.

For P3P_{3}, by Lemma 3, we have, with probability at least 1(dn)C1-(d\vee n)^{-C},

P32\displaystyle\|P_{3}\|_{2} slog(dn)ne0+se02+slog(dn)N.\displaystyle\lesssim s\sqrt{{{\rm log}(d\vee n)\over n}}e_{0}+\sqrt{s}e_{0}^{2}+\sqrt{{s{\rm log}(d\vee n)\over N}}.

Therefore, by choosing η=2C2+C1\eta={2\over C_{2}+C_{1}} and combining bounds for P1P_{1}, P2P_{2}, and P3P_{3}, we have

et+1\displaystyle e_{t+1} \displaystyle\leq ((1+s0s)C2C1C2+C1+Cslog(dn)n)et+set2\displaystyle\left(\left(1+\sqrt{{s_{0}\over s}}\right){C_{2}-C_{1}\over C_{2}+C_{1}}+Cs\sqrt{{{\rm log}(d\vee n)\over n}}\right)e_{t}+\sqrt{s}e_{t}^{2}
+Cslog(dn)ne0+Cse02+slog(dn)N,\displaystyle+Cs\sqrt{{{\rm log}(d\vee n)\over n}}e_{0}+C\sqrt{s}e_{0}^{2}+\sqrt{{s{\rm log}(d\vee n)\over N}},

with probability at least 1(dn)C1-(d\vee n)^{-C}.

Thus, if slog(dn)n=o(1)s\sqrt{{{\rm log}(d\vee n)\over n}}=o(1) and se0=o(1)\sqrt{s}e_{0}=o(1), after T(log(ne0slog(dn)))T\asymp\left({\rm log}\left({ne_{0}\over s{\rm log}(d\vee n)}\right)\right) iterations, with probability at least 1(dn)C1-(d\vee n)^{-C},

eTs(slog(dn)n𝜽¯𝜽2+𝜽¯𝜽22+log(dn)N).e_{T}\lesssim\sqrt{s}\left(\sqrt{{s{\rm log}(d\vee n)\over n}}\left\|\bar{\bm{\theta}}-\bm{\theta}^{*}\right\|_{2}+\|\bar{\bm{\theta}}-\bm{\theta}\|_{2}^{2}+\sqrt{{{\rm log}(d\vee n)\over N}}\right). (10)

The proof is completed.

Proof of Theorem 1

In fact, the proof of Theorem 1 is essentially the same as the proof of Theorem 2. Our goal is to provide an upper bound for

et+1\displaystyle e_{t+1} 𝜽t+1𝜽2(1+|It+1|s|It+1|s0)(𝜽tη^t)It+1𝜽It+12.\displaystyle\doteq\left\|\bm{\theta}^{t+1}-\bm{\theta}^{*}\right\|_{2}\leq\left(1+\sqrt{{\left|I^{t+1}\right|-s\over\left|I^{t+1}\right|-s_{0}}}\right)\left\|\left(\bm{\theta}^{t}-\eta\nabla\widehat{\mathcal{L}}^{t}\right)_{I^{t+1}}-\bm{\theta}^{*}_{I^{t+1}}\right\|_{2}.

We have

(𝜽tη^t)It+1𝜽It+12\displaystyle\left\|\left({\bm{\theta}}^{t}-\eta\nabla\widehat{\mathcal{L}}^{t}\right)_{I^{t+1}}-\bm{\theta}^{*}_{I^{t+1}}\right\|_{2}
𝜽It+1t𝜽It+1ηE(^t^(𝜽))It+12\displaystyle\leq\left\|{\bm{\theta}}_{I^{t+1}}^{t}-\bm{\theta}^{*}_{I^{t+1}}-\eta E\left(\nabla\widehat{\mathcal{L}}^{t}-\nabla\widehat{\mathcal{L}}\left(\bm{\theta}^{*}\right)\right)_{I^{t+1}}\right\|_{2}
+ηE(^t^(𝜽))It+1(^t^(𝜽))It+12\displaystyle\quad+\eta\left\|E\left(\nabla\widehat{\mathcal{L}}^{t}-\nabla\widehat{\mathcal{L}}\left(\bm{\theta}^{*}\right)\right)_{I^{t+1}}-\left(\nabla\widehat{\mathcal{L}}^{t}-\nabla\widehat{\mathcal{L}}\left(\bm{\theta}^{*}\right)\right)_{I^{t+1}}\right\|_{2}
+η(^(𝜽))It+12\displaystyle\quad+\eta\left\|\left(\nabla\widehat{\mathcal{L}}\left(\bm{\theta}^{*}\right)\right)_{I^{t+1}}\right\|_{2}
P12+ηP22+ηP32.\displaystyle\doteq||P_{1}||_{2}+\eta||P_{2}||_{2}+\eta||P_{3}||_{2}.

For P1P_{1}, by Assumption 2 and Taylor’s expansion, we can obtain that

P12max{|1ηC1|,|1ηC2|}et.\left\|P_{1}\right\|_{2}\leq\max\left\{\left|1-\eta C_{1}\right|,\left|1-\eta C_{2}\right|\right\}e_{t}.

For P2P_{2}, similar to (9), with probability at least 1(dn)C1-(d\vee n)^{-C}, we have

P22\displaystyle\left\|P_{2}\right\|_{2} slog(dn)net+set2.\displaystyle\lesssim s\sqrt{{{\rm log}(d\vee n)\over n}}e_{t}+\sqrt{s}e_{t}^{2}.

For P3P_{3}, as we have shown in the proof of Lemma 3, ^(𝜽)2slog(dn)n\left\|\nabla\widehat{\mathcal{L}}\left(\bm{\theta}^{*}\right)\right\|_{2}\lesssim\sqrt{{s{\rm log}(d\vee n)\over n}} with probability at least 1(dn)C1-(d\vee n)^{-C}.

Combining the bounds and choosing η=2C2+C1\eta={2\over C_{2}+C_{1}}, we get

et+1\displaystyle e_{t+1} ((1+s0s)C2C1C2+C1+Cslog(dn)n)et+Cset2+Cslog(dn)n,\displaystyle\leq\left(\left(1+\sqrt{{s_{0}\over s}}\right){C_{2}-C_{1}\over C_{2}+C_{1}}+Cs\sqrt{{{\rm log}(d\vee n)\over n}}\right)e_{t}+C\sqrt{s}e_{t}^{2}+C\sqrt{{s{\rm log}(d\vee n)\over n}},

with probability of at least 1(dn)C1-(d\vee n)^{-C}.

Thus if slog(dn)n=o(1)s\sqrt{{{\rm log}(d\vee n)\over n}}=o(1) and se0=o(1)\sqrt{s}e_{0}=o(1), after T1log(n𝜽0𝜽22slog(dn))T_{1}\asymp{\rm log}\left({n\left\|\bm{\theta}^{0}-\bm{\theta}^{*}\right\|^{2}_{2}\over s{\rm log}(d\vee n)}\right) iterations , with probability at least 1(dn)C1-(d\vee n)^{-C}, we have

eTslog(dn)n.e_{T}\lesssim\sqrt{{s{\rm log}(d\vee n)\over n}}.

Proof of Theorem 3.

Theorem 2 implies

𝜽dis1𝜽2\displaystyle\left\|\bm{\theta}_{dis}^{1}-\bm{\theta}^{*}\right\|_{2} (slog(dn)n𝜽dis0𝜽2+s𝜽dis0𝜽22+slog(dn)N).\displaystyle\lesssim\left(s\sqrt{{{\rm log}(d\vee n)\over n}}\left\|{\bm{\theta}_{dis}^{0}}-\bm{\theta}^{*}\right\|_{2}+\sqrt{s}\left\|{\bm{\theta}_{dis}^{0}}-\bm{\theta}^{*}\right\|_{2}^{2}+\sqrt{{s{\rm log}(d\vee n)\over N}}\right).

This implies that after QlogNlognQ\asymp{{\rm log}N\over{\rm log}n} iterations for the outer loop (note if nNcn\geq N^{c} for some constant c(0,1)c\in(0,1), QQ is a constant) we obtain

𝜽disQ𝜽2slog(dn)N.\left\|\bm{\theta}_{dis}^{Q}-\bm{\theta}^{*}\right\|_{2}\lesssim\sqrt{{s{\rm log}(d\vee n)\over N}}.

Technical Lemmas

Lemma 1.

Given 𝛉=(𝛅,γ)\bm{\theta}=(\bm{\delta},\gamma) satisfying 𝛉𝛉0(2s+s0+1)||\bm{\theta}-\bm{\theta}^{*}||_{0}\leq(2s+s_{0}+1) and γC\gamma\geq C^{*}, under Assumptions 1, 2 and 3, then, with probability at least 1(dn)C1-(d\vee n)^{-C},

(21(𝜽)21(𝜽))(𝜽𝜽)𝜽𝜽22.\left\|\left(\nabla^{2}\mathcal{L}_{1}\left({\bm{\theta}}\right)-\nabla^{2}\mathcal{L}_{1}\left(\bm{\theta}^{*}\right)\right)(\bm{\theta}-\bm{\theta}^{*})\right\|_{\infty}\lesssim\|{\bm{\theta}}-\bm{\theta}^{*}\|_{2}^{2}.

Proof. Denote 𝜽𝜽=𝚫=(𝚫𝜹,𝚫γ)\bm{\theta}-\bm{\theta}^{*}=\bm{\Delta}=\left(\bm{\Delta}_{\bm{\delta}},\bm{\Delta}_{\gamma}\right).

  • \bullet

    For j=1,,d+1j=1,\dots,d+1,

    [(21(𝜽+𝚫)21(𝜽))𝚫]j\displaystyle{\left[\left(\nabla^{2}\mathcal{L}_{1}\left(\bm{\theta}^{*}+\bm{\Delta}\right)-\nabla^{2}\mathcal{L}_{1}\left(\bm{\theta}^{*}\right)\right)\bm{\Delta}\right]}_{j}
    =1ni=1n(1di)(xij𝐱i𝚫𝜹(h(𝐱i𝜹)h(𝐱i𝜹))\displaystyle={1\over n}\sum_{i=1}^{n}\left(1-d_{i}\right)\left(x_{ij}\mathbf{x}_{i}^{\prime}\bm{\Delta}_{\bm{\delta}}\left(h\left(\mathbf{x}_{i}^{\prime}\bm{\delta}^{*}\right)-h\left(\mathbf{x}_{i}^{\prime}\bm{\delta}\right)\right)\right.
    1ni|xij|(𝐱i𝚫𝜹)2𝜹22,\displaystyle\leq{1\over n}\sum_{i}|x_{ij}|({\bf x}_{i}^{\prime}\mbox{\boldmath$\Delta$}_{\bm{\delta}})^{2}\lesssim\|\bm{\delta}\|_{2}^{2},

    where we use |h(a)|<4.3|h^{\prime}(a)|<4.3 for all aa\in\mathbb{R} (Jacobson and Zou (2024)).

  • \bullet

    For j=d+2j=d+2,

    [(21(𝜽+𝚫)21(𝜽))𝚫]j\displaystyle{\left[\left(\nabla^{2}\mathcal{L}_{1}\left(\bm{\theta}^{*}+\bm{\Delta}\right)-\nabla^{2}\mathcal{L}_{1}\left(\bm{\theta}^{*}\right)\right)\bm{\Delta}\right]_{j}} =1ni=1ndi((γ)2(γ)2)𝚫γ\displaystyle={1\over n}\sum_{i=1}^{n}d_{i}\left((\gamma)^{-2}-\left(\gamma^{*}\right)^{-2}\right)\bm{\Delta}_{\gamma}
    2(C)3𝚫γ2,\displaystyle\leq{2\over\left(C^{*}\right)^{3}}\bm{\Delta}_{\gamma}^{2},

    where the last inequality holds because of γC\gamma\geq C^{*}.

Thus, with probability of at least 1(dn)C1-(d\vee n)^{-C},

(21(𝜽)21(𝜽))𝚫\displaystyle\left\|\left(\nabla^{2}\mathcal{L}_{1}(\bm{\theta})-\nabla^{2}\mathcal{L}_{1}\left(\bm{\theta}^{*}\right)\right)\bm{\Delta}\right\|_{\infty} 𝚫22.\displaystyle\lesssim\|\bm{\Delta}\|_{2}^{2}.
Lemma 2.

For 𝛉=(𝛅,γ){\bm{\theta}}=({\bm{\delta}},{\gamma}) satisfying 𝛉𝛉0(2s+s0+1)||{\bm{\theta}}-\bm{\theta}^{*}||_{0}\leq(2s+s_{0}+1) and γC{\gamma}\geq C^{*}, under Assumptions 1, 2, and 3, with probability at least 1(dn)C1-(d\vee n)^{-C}, we have with probability 1(dn)C1-(d\vee n)^{-C},

1(𝜽)1(𝜽)E[1(𝜽)1(𝜽)]Cslog(dn)n𝜽𝜽2+C𝜽𝜽22.\displaystyle\|\nabla{\cal L}_{1}(\mbox{\boldmath$\theta$})-\nabla{\cal L}_{1}(\mbox{\boldmath$\theta$}^{*})-E[{\cal L}_{1}(\mbox{\boldmath$\theta$})-\nabla{\cal L}_{1}(\mbox{\boldmath$\theta$}^{*})]\|_{\infty}\leq C\sqrt{{s{\rm log}(d\vee n)\over n}}\|\mbox{\boldmath$\theta$}-\mbox{\boldmath$\theta$}^{*}\|_{2}+C\|\mbox{\boldmath$\theta$}-\mbox{\boldmath$\theta$}^{*}\|^{2}_{2}.

Proof. We have, by Taylor’s expansion,

(1(𝜽)1(𝜽))\displaystyle\left(\nabla\mathcal{L}_{1}({\bm{\theta}})-\nabla\mathcal{L}_{1}\left(\bm{\theta}^{*}\right)\right)
=\displaystyle= 21(𝜽)(𝜽𝜽)+01(21(𝜽+u(𝜽𝜽))21(𝜽))(𝜽𝜽)𝑑u.\displaystyle\nabla^{2}\mathcal{L}_{1}\left(\bm{\theta}^{*}\right)\left({\bm{\theta}}-\bm{\theta}^{*}\right)+\int_{0}^{1}\left(\nabla^{2}\mathcal{L}_{1}\left(\bm{\theta}^{*}+u\left({\bm{\theta}}-\bm{\theta}^{*}\right)\right)-\nabla^{2}\mathcal{L}_{1}\left(\bm{\theta}^{*}\right)\right)\left({\bm{\theta}}-\bm{\theta}^{*}\right)du.

By Lemma 1, the second term above as well as its expectation (can be shown following exactly the same lines) is bounded by C𝜽𝜽22C\|\mbox{\boldmath$\theta$}-\mbox{\boldmath$\theta$}^{*}\|_{2}^{2}. For the 1st term above, we first derive a bound for

21(𝜽)2(𝜽)max,\displaystyle\left\|\nabla^{2}\mathcal{L}_{1}\left(\bm{\theta}^{*}\right)-\nabla^{2}\mathcal{L}^{*}\left(\bm{\theta}^{*}\right)\right\|_{\max}, (11)

where we denote Hmaxmax|Hij|\|H\|_{\max}\doteq\max{|H_{ij}|} for a matrix HH and we recall that by our notation (𝜽)=E[1(𝜽)]\mathcal{L}^{*}(\bm{\theta}^{*})=E[\mathcal{L}_{1}(\bm{\theta}^{*})]. We analyze each entry of the empirical Hessian matrix 21(𝜽)\nabla^{2}{\cal L}_{1}(\mbox{\boldmath$\theta$}^{*}).

  • \bullet

    2𝜹j𝜹k1(𝜽)=1ni=1nxijxik[di+(1di)h(𝐱i𝜹)].{\partial^{2}\over\partial\bm{\delta}_{j}\partial\bm{\delta}_{k}}\mathcal{L}_{1}(\bm{\theta})={1\over n}\sum_{i=1}^{n}x_{ij}x_{ik}\left[d_{i}+\left(1-d_{i}\right)h\left(-\mathbf{x}_{i}^{\prime}\bm{\delta}\right)\right]. Since 0<h(s)<10<h(s)<1 for all ss\in\mathbb{R}, we see that

    |xijxik[di+(1di)h(𝐱i𝜹)]||xijxik|.|x_{ij}x_{ik}\left[d_{i}+\left(1-d_{i}\right)h\left(-\mathbf{x}_{i}^{\prime}\bm{\delta}\right)\right]|\leq|x_{ij}x_{ik}|.

    Thus, xijxik[di+(1di)h(𝐱i𝜹)]x_{ij}x_{ik}\left[d_{i}+\left(1-d_{i}\right)h\left(-\mathbf{x}_{i}^{\prime}\bm{\delta}\right)\right] is sub-exponential.

  • \bullet

    2𝜹jγ1(𝜽)=1ni=1ndiyixij{\partial^{2}\over\partial\bm{\delta}_{j}\partial\gamma}\mathcal{L}_{1}(\bm{\theta})={1\over n}\sum_{i=1}^{n}-d_{i}y_{i}x_{ij}. It is easy to see that diyixijd_{i}y_{i}x_{ij} is sub-exponential.

  • \bullet

    2γ21(𝜽)=1ni=1ndi(γ2+yi2){\partial^{2}\over\partial\gamma^{2}}\mathcal{L}_{1}(\bm{\theta})={1\over n}\sum_{i=1}^{n}d_{i}\left(\gamma^{-2}+y_{i}^{2}\right), and di(γ2+yi2)d_{i}\left(\gamma^{-2}+y_{i}^{2}\right) is again sub-exponential.

With the aid of the concentration inequality for sub-exponential variables, (11) is bounded by Clog(dn)nC\sqrt{{{\rm log}(d\vee n)\over n}} with probability at least 1(dn)C1-(d\vee n)^{-C}. Consequently,

(21(𝜽)2(𝜽))(𝜽𝜽)Clog(dn)n𝜽𝜽1Cslog(dn)n𝜽𝜽2.\left\|(\nabla^{2}\mathcal{L}_{1}\left(\bm{\theta}^{*}\right)-\nabla^{2}\mathcal{L}^{*}\left(\bm{\theta}^{*}\right))({\bm{\theta}}-\bm{\theta}^{*})\right\|_{\infty}\leq C\sqrt{{{\rm log}(d\vee n)\over n}}||{\bm{\theta}}-\bm{\theta}^{*}||_{1}\leq C\sqrt{{s{\rm log}(d\vee n)\over n}}\|{\bm{\theta}}-\bm{\theta}^{*}\|_{2}.

This completes the proof.

Lemma 3.

Given 𝛉¯=(𝛅¯,γ¯)\bar{\bm{\theta}}=(\bar{\bm{\delta}},\bar{\gamma}) satisfying 𝛉¯𝛉0(2s+s0+1)||\bar{\bm{\theta}}-\bm{\theta}^{*}||_{0}\leq(2s+s_{0}+1) and γ¯C\bar{\gamma}\geq C^{*}, under Assumptions 1, 2, and 3, with probability at least 1(dn)C1-(d\vee n)^{-C}, we have,

~(𝜽)\displaystyle\left\|\nabla\widetilde{\mathcal{L}}\left(\bm{\theta}^{*}\right)\right\|_{\infty} slog(dn)n𝜽¯𝜽2+𝜽¯𝜽22+log(dn)N.\displaystyle\lesssim\sqrt{{s{\rm log}(d\vee n)\over n}}\left\|\overline{\bm{\theta}}-\bm{\theta}^{*}\right\|_{2}+\|\overline{\bm{\theta}}-\bm{\theta}\|_{2}^{2}+\sqrt{{{\rm log}(d\vee n)\over N}}.

Proof.

~(𝜽)=\displaystyle\nabla\tilde{\mathcal{L}}\left(\bm{\theta}^{*}\right)= 1(𝜽)1(𝜽¯)+N(𝜽¯)\displaystyle\nabla\mathcal{L}_{1}\left(\bm{\theta}^{*}\right)-\nabla\mathcal{L}_{1}(\bar{\bm{\theta}})+\mathcal{L}_{N}(\bar{\bm{\theta}})
=\displaystyle= (N(𝜽¯)N(𝜽))(1(𝜽¯)1(𝜽))+N(𝜽).\displaystyle\left(\nabla\mathcal{L}_{N}(\bar{\bm{\theta}})-\nabla\mathcal{L}_{N}\left(\bm{\theta}^{*}\right)\right)-\left(\nabla\mathcal{L}_{1}(\bar{\bm{\theta}})-\nabla\mathcal{L}_{1}\left(\bm{\theta}^{*}\right)\right)+\nabla\mathcal{L}_{N}\left(\bm{\theta}^{*}\right).

By Lemma 2, with probability at least 1(dn)C1-(d\vee n)^{-C}, we have

~(𝜽)\displaystyle\left\|\nabla\widetilde{\mathcal{L}}\left(\bm{\theta}^{*}\right)\right\|_{\infty}\lesssim slog(dn)n𝜽¯𝜽2+𝜽¯𝜽22+N(𝜽).\displaystyle\sqrt{{s{\rm log}(d\vee n)\over n}}\left\|\overline{\bm{\theta}}-\bm{\theta}^{*}\right\|_{2}+\|\overline{\bm{\theta}}-\bm{\theta}\|_{2}^{2}+\left\|\nabla\mathcal{L}_{N}\left(\bm{\theta}^{*}\right)\right\|_{\infty}.

Recall that 0<|g(s)|<10<|g^{\prime}(s)|<1, and thus |g(𝐱𝜹)g(0)||𝐱𝜹||g(-\mathbf{x}^{\prime}\bm{\delta})-g(0)|\leq|{\bf x}^{\prime}\mbox{\boldmath$\delta$}|, implying g(𝐱𝜹)g(-\mathbf{x}^{\prime}\bm{\delta}) is sub-Gaussian. Thus using the same arguments as in the proof of Lemma 2, it is easy to see, based on the expression (6) for the gradient, that the components of N(𝜽)\mathcal{L}_{N}(\bm{\theta}^{*}) are all sub-exponential. Thus N(𝜽)log(dn)N\left\|\nabla\mathcal{L}_{N}\left(\bm{\theta}^{*}\right)\right\|_{\infty}\lesssim\sqrt{{{\rm log}(d\vee n)\over N}} holds with probability at least 1(dn)C1-(d\vee n)^{-C}.

References

  • (1)
  • Belloni and Chernozhukov (2011) Belloni, A., and Chernozhukov, V. (2011), “l1-penalized quantile regression in high-dimensional sparse models,” The Annals of Statistics, 39(1), 82–130.
  • Blumensath and Davies (2009) Blumensath, T., and Davies, M. E. (2009), “Iterative hard thresholding for compressed sensing,” Applied and Computational Harmonic Analysis, 27(3), 265–274.
  • Buckley and James (1979) Buckley, J., and James, I. (1979), “Linear regression with censored data,” Biometrika, 66(3), 429–436.
  • Fan and Li (2001) Fan, J., and Li, R. (2001), “Variable selection via nonconcave penalized likelihood and its oracle properties,” Journal of the American Statistical Association, 96(456), 1348–1360.
  • Foucart (2011) Foucart, S. (2011), “Hard thresholding pursuit: an algorithm for compressive sensing,” SIAM Journal on Numerical Analysis, 49(6), 2543–2563.
  • Garg and Khandekar (2009) Garg, R., and Khandekar, R. (2009), Gradient descent with sparsification: an iterative algorithm for sparse recovery with restricted isometry property,, in Proceedings of the 26th annual international conference on machine learning, pp. 337–344.
  • Jacobson and Zou (2024) Jacobson, T., and Zou, H. (2024), “High-dimensional censored regression via the penalized Tobit likelihood,” Journal of Business & Economic Statistics, 42(1), 286–297.
  • Jain et al. (2014) Jain, P., Tewari, A., and Kar, P. (2014), On iterative hard thresholding methods for high-dimensional M-estimation,, in Advances in Neural Information Processing Systems.
  • Ji et al. (2023) Ji, Y., Scutari, G., Sun, Y., and Honnappa, H. (2023), “Distributed sparse regression via penalization,” Journal of Machine Learning Research, 24(272), 1–62.
  • Jordan et al. (2019) Jordan, M. I., Lee, J. D., and Yang, Y. (2019), “Communication-efficient distributed statistical inference,” Journal of the American Statistical Association, 114(526), 668–681.
  • Lee et al. (2017) Lee, J. D., Liu, Q., Sun, Y., and Taylor, J. E. (2017), “Communication-efficient sparse regression,” Journal of Machine Learning Research, 18(5), 1–30.
  • Müller and van de Geer (2016) Müller, P., and van de Geer, S. (2016), “Censored linear model in high dimensions,” TEST, 25, 75–92.
  • Nedic and Ozdaglar (2009) Nedic, A., and Ozdaglar, A. (2009), “Distributed subgradient methods for multi-agent optimization,” IEEE Transactions on Automatic Control, 54(1), 48–61.
  • Olsen (1978) Olsen, R. J. (1978), “Note on the uniqueness of the maximum likelihood estimator for the Tobit model,” Econometrica, 46(5), 1211–1215.
  • Sampford (1953) Sampford, M. R. (1953), “Some inequalities on Mill’s ratio and related functions,” The Annals of Mathematical Statistics, 24(1), 130 – 132.
  • Soret et al. (2018) Soret, P., Avalos, M., Wittkop, L., Commenges, D., and Thiébaut, R. (2018), “Lasso regularization for left-censored Gaussian outcome and high-dimensional predictors,” BMC Medical Research Methodology, 18(1), 159.
  • Tobin (1958) Tobin, J. (1958), “Estimation of relationships for limited dependent variables,” Econometrica, 26(1), 24–36.
  • Wang et al. (2023) Wang, Y., Lu, W., and Lian, H. (2023), “Best subset selection for high-dimensional non-smooth models using iterative hard thresholding,” Information Sciences, 625, 36–48.
  • Zhang et al. (2013) Zhang, Y., Duchi, J. C., and Wainwright, M. J. (2013), “Communication-efficient algorithms for statistical optimization,” Journal of Machine Learning Research, 14(68), 3321–3363.
  • Zhou and Liu (2016) Zhou, X., and Liu, G. (2016), “LAD-Lasso variable selection for doubly censored median regression models,” Communications in Statistics - Theory and Methods, 45(12), 3658–3667.