Truncated Linear Regression in High Dimensions
Abstract
As in standard linear regression, in truncated linear regression, we are given access to observations whose dependent variable equals , where is some fixed unknown vector of interest and is independent noise; except we are only given an observation if its dependent variable lies in some “truncation set” . The goal is to recover under some favorable conditions on the ’s and the noise distribution. We prove that there exists a computationally and statistically efficient method for recovering -sparse -dimensional vectors from truncated samples, which attains an optimal reconstruction error of . As a corollary, our guarantees imply a computationally efficient and information-theoretically optimal algorithm for compressed sensing with truncation, which may arise from measurement saturation effects. Our result follows from a statistical and computational analysis of the Stochastic Gradient Descent (SGD) algorithm for solving a natural adaptation of the LASSO optimization problem that accommodates truncation. This generalizes the works of both: (1) Daskalakis et al. [9], where no regularization is needed due to the low-dimensionality of the data, and (2) Wainright [26], where the objective function is simple due to the absence of truncation. In order to deal with both truncation and high-dimensionality at the same time, we develop new techniques that not only generalize the existing ones but we believe are of independent interest.
1 Introduction
In the vanilla linear regression setting, we are given observations of the form , where , , is some unknown coefficient vector that we wish to recover, and is independent and identically distributed across different observations random noise. Under favorable conditions about the ’s and the distribution of the noise, it is well-known that can be recovered to within -reconstruction error .
The classical model and its associated guarantees might, however, be inadequate to address many situations which frequently arise in both theory and practice. We focus on two common and widely studied deviations from the standard model. First, it is often the case that , i.e. the number of observations is much smaller than the dimension of the unknown vector . In this “under-determined” regime, it is fairly clear that it is impossible to expect a non-trivial reconstruction of the underlying , since there are infinitely many such that for all . To sidestep this impossibility, we must exploit additional structural properties that we might know satisfies. One such property might be sparsity, i.e. that has non-zero coordinates. Linear regression under sparsity assumptions has been widely studied, motivated by applications such as model selection and compressed sensing; see e.g. the celebrated works of [23, 6, 11, 26] on this topic. It is known, in particular, that a -sparse can be recovered to within error , when the ’s are drawn from the standard multivariate Normal, or satisfy other favorable conditions [26]. The recovery algorithm solves a least squares optimization problem with regularization, i.e. what is called LASSO optimization in Statistics, in order to reward sparsity.
Another common deviation from the standard model is the presence of truncation. Truncation occurs when the sample is not observed whenever falls outside of a subset . Truncation arises quite often in practice as a result of saturation of measurement devices, bad data collection practices, incorrect experimental design, and legal or privacy constraints which might preclude the use of some of the data. Truncation is known to affect linear regression in counter-intuitive ways, as illustrated in Fig. 1,

where the linear fits obtained via least squares regression before and after truncation of the data based on the value of the response variable are also shown. More broadly, it is well-understood that naive statistical inference using truncated data commonly leads to bias. Accordingly, a long line of research in Statistics and Econometrics has strived to develop regression methods that are robust to truncation [24, 1, 15, 18, 16, 5, 14]. This line of work falls into the broader field of Truncated Statistics [22, 8, 2], which finds its roots in the early works of [3], [13], [19, 20], and [12]. Despite this voluminous work, computationally and statistically efficient methods for truncated linear regression have only recently been obtained in [9], where it was shown that, under favorable assumptions about the ’s, the truncation set , and assuming the ’s are drawn from a Gaussian, the negative log likelihood of the truncated sample can be optimized efficiently, and approximately recovers the true parameter vector with an reconstruction error .
Our contribution.
In this work, we solve the general problem addressing both of the aforedescribed challenges together. Namely, we provide efficient algorithms for the high-dimensional () truncated linear regression problem. This problem is very common, including in compressed sensing applications with measurement saturation, as studied e.g. in [10, 17].
Under standard conditions on the design matrix and the noise distribution (namely that the ’s and ’s are sampled from independent Gaussian distributions before truncation), and under mild assumptions on the truncation set (roughly, that it permits a constant fraction of the samples to survive truncation), we show that the SGD algorithm on the truncated LASSO optimization program, our proposed adaptation of the standard LASSO optimization to accommodate truncation, is a computationally and statistically efficient method for recovering , attaining an optimal reconstruction error of , where is the sparsity of .
1.1 Overview of proofs and techniques
The problem that we solve in this paper encompasses the two difficulties of the problems considered in: (1) Wainwright [26], which tackles the problem of high-dimensional sparse linear regression with Gaussian noise, and (2) Daskalakis et al. [9], which tackles the problem of truncated linear regression. The tools developed in those papers do not suffice to solve our problem, since each difficulty interferes with the other. Hence, we introduce new ideas and develop new interesting tools that allow us to bridge the gap between [26] and [9]. We begin our overview in this section with a brief description of the approaches of [26, 9] and subsequently outline the additional challenges that arise in our setting, and how we address them.
Wainwright [26] uses as an estimator the solution of the regularized least squares program, also called LASSO, to handle the high-dimensionality of the data. Wainwright then uses a primal-dual witness method to bound the number of samples that are needed in order for the solution of the LASSO program to be close to the true coefficient vector . The computational task is not discussed in detail in Wainwright [26], since the objective function of the LASSO program is very simple and standard convex optimization tools can be used.
Daskalakis et al. [9] use as their estimator the solution to the log-likelihood maximization problem. In contrast to [26], their convex optimization problem takes a very complicated form due to the presence of truncation, which introduces an intractable log-partition function term in the log-likelihood. The main idea of Daskalakis et al. [9] to overcome this difficulty is identifying a convex set such that: (1) it contains the true coefficient vector , (2) their objective function is strongly convex inside , (3) inside there exists an efficient rejection sampling algorithm to compute unbiased estimates of the gradients of their objective function, (4) the norm of the stochastic gradients inside is bounded, and (5) projecting onto is efficient. These five properties are essentially what they need to prove that the SGD with projection set converges quickly to a good estimate of .
Our reconstruction algorithm is inspired by both [26] and [9]. We formulate our optimization program as the -regularized version of the negative log-likelihood function in the truncated setting, which we call truncated LASSO. In particular, our objective contains an intractable log-partition function term. Our proof then consists of two parts. First, we show statistical recovery, i.e. we upper bound the number of samples that are needed for the solution of the truncated LASSO program to be close to the true coefficient vector . Second, we show that this optimization problem can be solved efficiently. The cornerstones of our proof are the two seminal approaches that we mentioned: the Primal-Dual Witness method for statistical recovery in high dimensions, and the Projected SGD method for efficient maximum likelihood estimation in the presence of truncation. Unfortunately, these two techniques are not a priori compatible to stitch together.
Roughly speaking, the technique of [9] relies heavily on the very carefully chosen projection set in which the SGD is restricted, as we explained above. This projection set cannot be used in high dimensions because it effectively requires knowing the low-dimension subspace in which the true solution lies. The projection set was the key to the aforementioned nice properties: strong convexity, efficient gradient estimation, and bounded gradient. In its absence, we need to deal with each of these issues individually. The primal-dual witness method of [26] cannot also be applied directly in our setting. In our case the gradient of the truncated LASSO does not have a nice closed form and hence finding the correct way to construct the primal-dual witness requires a more delicate argument. Our proof manages to overcome all these issues. In a nutshell the architecture of our full proof is the following.
-
1.
Optimality on the low dimensional subspace. The first thing that we need to prove is that the optimum of the truncated LASSO program when restricted to the low dimensional subspace defined by the non-zero coordinates of is close to the true solution. This step of the proof was unnecessary in [9] due to the lack of regularization in their objective, and was trivial in [26] due to the simple loss function, i.e. the regularized least square.
-
2.
Optimality on the high dimensional space. We prove that the optimum of the truncated LASSO program in the low dimensional subspace is also optimal for the whole space. This step is done using the primal-dual witness method as in [26]. However, in our case the expression of the gradient is much more complicated due to the very convoluted objective function. Hence, we find a more general way to prove this step that does not rely on the exact expression of the gradient.
These two steps of the proof suffice to upper bound the number of samples that we need to recover the coefficient vector via the truncated LASSO program. Next, we provide a computationally efficient method to solve the truncated LASSO program.
-
3.
Initialization of SGD. The first step of our algorithm to solve truncated LASSO is finding a good initial point for the SGD. This was unnecessary in [26] due to the simple objective and in [9] due to the existence of the projection set (where efficient projection onto immediately gave an initial point). We propose the simple answer of bootstrapping: start with the solution of the -regularized ordinary least squares program. This is a biased estimate, but we show it’s good enough for initialization.
-
4.
Projection of SGD. Next, we need to choose a projection set to make sure that Projected-SGD (PSGD) converges. The projection set chosen in [9] is not helpful in our case unless we a priori know the set of non-zero coordinates of . Hence, we define a different, simpler set which admits efficient projection algorithms. As a necessary side-effect, in contrast to [9], our set cannot guarantee many of the important properties that we need to prove fast convergence of SGD.
-
5.
Lack of strong convexity and gradient estimation. Our different projection set cannot guarantee the strong convexity and efficient gradient estimation enjoyed in [9]. There are two problems here:
First, we know that PSGD converges to a point with small loss, but why must the point be near the optimum? Since strong convexity fails in high dimensions, it is not clear. We provide a workaround to resolve this issue that can be applied to other regularized programs with stochastic access to the gradient function.
Second, computing unbiased estimates of the gradient is now difficult. The prior work employed rejection sampling, but in our setting this may take exponential time. For this reason we provide a more explicit method for estimating the gradient much faster, whenever the truncation set is reasonably well-behaved.
An important tool that we leverage repeatedly in our analysis and we have not mentioned above is a strong isometry property for our measurement matrix, which has truncated Gaussian rows. Similar properties have been explored in the compressed sensing literature for matrices with i.i.d. Gaussian and sub-Gaussian entries [25].
We refer to Section 5 for a more detailed overview of the proofs of our main results.
2 High-dimensional truncated linear regression model
Notation.
Let refer to a standard normal random variable. For and measurable , let refer to the truncated normal . Let . Let . Additionally, for let refer to (or if is a row vector), and let refer to . For a matrix , let be the random vector with . For sets and , let refer to the submatrix . For we treat the row as a row vector. In a slight abuse of notation, we will often write (or sometimes, ); this will always mean . By we mean ). For , define to be the set of indices such that .
2.1 Model
Let be the unknown parameter vector which we are trying to recover. We assume that it is -sparse; that is, has cardinality at most . Let be a measurable subset of the real line. The main focus of this paper is the setting of Gaussian noise: we assume that we are given truncated samples generated by the following process:
-
1.
Pick according to the standard normal distribution .
-
2.
Sample and compute as
(1) -
3.
If , then return sample . Otherwise restart the process from step .
We also briefly discuss the setting of arbitrary noise, in which may be arbitrary and we are interested in approximations to which have guarantees bounded in terms of .
Together, samples define a pair where and . We make the following assumptions about set .
Assumption I (Constant Survival Probability).
Taking expectation over vectors , we have for a constant .
Assumption II (Efficient Sampling).
There is an -time algorithm which takes input and produces an unbiased sample .
We do not require that is a constant, but it will affect the efficiency of our algorithm. To be precise, our algorithm will make queries to the sampling algorithm. As we explain in Lemma K.4 in Section K, if the set is a union of intervals , then the Assumption II is satisfied with . We express the theorems below with the assumption that is a union of intervals in which case the algorithms have polynomial running time, but all the statements below can be replaced with the more general Assumption II and the running time changes from to .
3 Statistically and computationally efficient recovery
In this section we formally state our main results for recovery of a sparse high-dimensional coefficient vector from truncated linear regression samples. In Section 3.1, we present our result under the standard assumption that the error distribution is Gaussian, whereas in Section 3.2 we present our results for the case of adversarial error.
3.1 Gaussian noise
In the setting of Gaussian noise (before truncation), we prove the following theorem.
Theorem 3.1.
From now on, we will use the term “with high probability” when the rate of decay is not of importance. This phrase means “with probability as ”.
Observe that even without the added difficulty of truncation (e.g. if ), sparse linear regression requires samples by known information-theoretic arguments [26]. Thus, our sample complexity is information-theoretically optimal.
In one sentence, the algorithm optimizes the -regularized sample negative log-likelihood via projected SGD. The negative log-likelihood of for a single sample is
Given multiple samples , we can then write . We also define the regularized negative log-likelihood by . We claim that optimizing the following program approximately recovers the true parameter vector with high probability, for sufficiently many samples and appropriate regularization :
(2) |
The first step is to show that any solution to Program (2) will be near the true solution . To this end, we prove the following theorem, which already shows that samples are sufficient to solve the problem of statistical recovery of :
Proposition 3.2.
Suppose that Assumption I holds. There are constants111In the entirety of this paper, constants may depend on . , , and with the following property. Suppose that , and let be samples drawn from Process 1. Let be any optimal solution to Program (2) with regularization constant . Then with high probability.
Then it remains to show that Program (2) can be solved efficiently.
Proposition 3.3.
Suppose that Assumption I holds and let be samples drawn from Process 1 and be any optimal solution to Program (2). There exists a constant such that if then there is an algorithm which outputs satisfying with high probability. Furthermore, if the survival set is a union of intervals the running time of our algorithm is .
We present a more detailed description of the algorithm that we use in Section 4.
3.2 Adversarial noise
In the setting of arbitrary noise, optimizing negative log-likelihood no longer makes sense, and indeed our results from the setting of Gaussian noise no longer hold. However, we may apply results from compressed sensing which describe sufficient conditions on the measurement matrix for recovery to be possible in the face of adversarial error. We obtain the following theorem:
Theorem 3.4.
Suppose that Assumption I holds and let . There are constants and such that if , , and minimizes in the region , then .
The proof is a corollary of our result that satisfies the Restricted Isometry Property from [6] with high probability even when we only observe truncated samples; see Corollary G.6 and the subsequent discussion in Section G.
The remainder of the paper is dedicated to the case where the noise is Gaussian before truncation.
4 The efficient estimation algorithm
Define . To solve Program 2, our algorithm is Projected Stochastic Gradient Descent (PSGD) with projection set , for an appropriate constant (specified in Lemma 5.4). We pick an initial feasible point by computing
Subsequently, the algorithm performs updates, where . Define a random update to as follows. Pick uniformly at random. Sample . Then set
Finally, the algorithm outputs .
5 Overview of proofs and techniques
This section outlines our techniques. The first step is proving Proposition 3.2. The second step is proving Proposition 3.3, by showing that the algorithm described in Section 4 efficiently recovers an approximate solution to Program 2.
5.1 Statistical recovery
Our approach to proving Proposition 3.2 is the Primal-Dual Witness (PDW) method introduced in [26]. Namely, we are interested in showing that the solution of Program (2) is near with high probability. Let be the (unknown) support of the true parameter vector . Define the following (hypothetical) program in which the solution space is restricted to vectors with support in :
(3) |
In the untruncated setting, the PDW method is to apply the subgradient optimality condition to the solution of this restricted program, which is by definition sparse. Proving that satisfies subgradient optimality for the original program implies that the original program has a sparse solution , and under mild extra conditions is the unique solution. Thus, the original program recovers the true basis. We use the PDW method for a slightly different purpose; we apply subgradient optimality to to show that is small, and then use this to prove that solves the original program.
Truncation introduces its own challenges. While Program 2 is still convex [9], it is much more convoluted than ordinary least squares. In particular, the gradient and Hessian of the negative log-likelihood have the following form (see Section C for the proof).
Lemma 5.1.
For all , the gradient of the negative log-likelihood is The Hessian is
We now state the two key facts which make the PDW method work. First, the solution to the restricted program must have a zero subgradient in all directions in . Second, if this subgradient can be extended to all of , then is optimal for the original program. Formally:
Lemma 5.2.
5.2 Computational recovery
For Proposition 3.3, we want to solve Program 2, i.e. minimize The gradient of doesn’t have a closed form, but it can be written cleanly as an expectation:
Let us assume that can be sampled efficiently. Then we may hope to optimize by stochastic gradient descent. But problematically, in our high-dimensional setting is nowhere strongly convex. So while we can apply the following general result from convex optimization, it has several strings attached:
Theorem 5.3.
Let be a convex function achieving its optimum at . Let be a sequence of random vectors in . Suppose that where . Set . Then
In particular, to apply this result, we need to solve three technical problems:
-
1.
We need to efficiently find an initial point with bounded distance from .
-
2.
The gradient does not have bounded norm for arbitrary . Thus, we need to pick a projection set in which the bound holds.
-
3.
Since is not strongly convex, we need to convert the bound on into a bound on .
As defined in Section 4, our solution is the projection set , for an appropriate constant . To pick an initial point in , we solve the program This estimate is biased due to the truncation, but the key point is that by classical results from compressed sensing, it has bounded distance from (and therefore from ).
The algorithm then consists of projected stochastic gradient descent with projection set . To bound the number of update steps required for the algorithm to converge to a good estimate of , we need to solve several statistical problems (which are direct consequences of assumptions in Theorem 5.3).
Properties of .
First, must be feasible and a bounded distance from the initial point (due to high-dimensionality, is unbounded, so this is not immediate). The following lemmas formalize this; see Sections J.11 and J.12 for the respective proofs. Lemma 5.4 specifies the choice of .
Lemma 5.4.
With high probability, for an appropriate constant .
Lemma 5.5.
With high probability, .
Second, the SGD updates at points within must be unbiased estimates of the gradient, with bounded square-norm in expectation. The following lemma shows that the updates defined in Section 4 satisfy this property. See Section J.13 for the proof.
Lemma 5.6.
With high probability over , the following statement holds. Let . Then , and .
Addressing the lack of strong convexity.
Third, we need to show that the algorithm converges in parameter space and not just in loss. That is, if is small, then we want to show that is small as well. Note that is not strongly convex even in , due to the high dimension. So we need a more careful approach. In the subspace , is indeed strongly convex near , as shown in the following lemma (proof in Section J.14):
Lemma 5.7.
There is a constant such that with high probability over , for all with and .
But we need a bound for all . The idea is to prove a lower bound on for near , and then use convexity to scale the bound linearly in . The above lemma provides a lower bound for near if , and we need to show that adding an -component to increases proportionally. This is precisely the content of Theorem B.4. Putting these pieces together we obtain the following lemma. See Section J.15 for the full proof.
Lemma 5.8.
There are constants and such that with high probability over the following holds. Let . If , then
Convergence of PSGD.
It now follows from the above lemmas and Theorem 5.3 that the PSGD algorithm, as outlined above and described in Section 4, converges to a good approximation of in a polynomial number of updates. The following theorem formalizes the guarantee. See Section J.16 for the proof.
Theorem 5.9.
With high probability over and over the execution of the algorithm, we get
Efficient implementation.
Finally, in Section F we prove that initialization and each update step is efficient. Efficient gradient estimation in the projection set (i.e. sampling ) does not follow from the prior work, since our projection set is by necessity laxer than that of the prior work [9]. So we replace their rejection sampling procedure with a novel approximate sampling procedure under mild assumptions about the truncation set. Together with the convergence bound claimed in Theorem 5.9, these prove Proposition 3.3.
Proof of Proposition 3.3.
Acknowledgments
This research was supported by NSF Awards IIS-1741137, CCF-1617730 and CCF-1901292, by a Simons Investigator Award, by the DOE PhILMs project (No. DE-AC05-76RL01830), by the DARPA award HR00111990021, by the MIT Undergraduate Research Opportunities Program, and by a Google PhD Fellowship.
References
- [1] Takeshi Amemiya. Regression analysis when the dependent variable is truncated normal. Econometrica: Journal of the Econometric Society, pages 997–1016, 1973.
- [2] N Balakrishnan and Erhard Cramer. The art of progressive censoring. Springer, 2014.
- [3] Daniel Bernoulli. Essai d’une nouvelle analyse de la mortalité causée par la petite vérole, et des avantages de l’inoculation pour la prévenir. Histoire de l’Acad., Roy. Sci.(Paris) avec Mem, pages 1–45, 1760.
- [4] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
- [5] Richard Breen et al. Regression models: Censored, sample selected, or truncated data, volume 111. Sage, 1996.
- [6] Emmanuel J. Candés, Justin K. Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 59(8):1207–1223, 2006.
- [7] Sylvain Chevillard. The functions erf and erfc computed with arbitrary precision and explicit error bounds. Information and Computation, 216:72–95, 2012.
- [8] A Clifford Cohen. Truncated and censored samples: theory and applications. CRC press, 2016.
- [9] Constantinos Daskalakis, Themis Gouleakis, Christos Tzamos, and Manolis Zampetakis. Computationally and Statistically Efficient Truncated Regression. In the 32nd Conference on Learning Theory (COLT), 2019.
- [10] Mark A Davenport, Jason N Laska, Petros T Boufounos, and Richard G Baraniuk. A simple proof that random matrices are democratic. arXiv preprint arXiv:0911.0736, 2009.
- [11] David L Donoho. For most large underdetermined systems of linear equations the minimal -norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 59(6):797–829, 2006.
- [12] RA Fisher. Properties and applications of Hh functions. Mathematical tables, 1:815–852, 1931.
- [13] Francis Galton. An examination into the registered speeds of american trotting horses, with remarks on their value as hereditary data. Proceedings of the Royal Society of London, 62(379-387):310–315, 1897.
- [14] Vassilis A Hajivassiliou and Daniel L McFadden. The method of simulated scores for the estimation of ldv models. Econometrica, pages 863–896, 1998.
- [15] Jerry A Hausman and David A Wise. Social experimentation, truncated distributions, and efficient estimation. Econometrica: Journal of the Econometric Society, pages 919–938, 1977.
- [16] Michael P Keane. 20 simulation estimation for panel data models with limited dependent variables. 1993.
- [17] Jason N Laska. Democracy in action: Quantization, saturation, and compressive sensing. PhD thesis, 2010.
- [18] Gangadharrao S Maddala. Limited-dependent and qualitative variables in econometrics. Number 3. Cambridge university press, 1986.
- [19] Karl Pearson. On the systematic fitting of frequency curves. Biometrika, 2:2–7, 1902.
- [20] Karl Pearson and Alice Lee. On the generalised probable error in multiple normal correlation. Biometrika, 6(1):59–68, 1908.
- [21] Mark Rudelson and Roman Vershynin. Non-asymptotic theory of random matrices: extreme singular values. In Proceedings of the International Congress of Mathematicians 2010 (ICM 2010) (In 4 Volumes) Vol. I: Plenary Lectures and Ceremonies Vols. II–IV: Invited Lectures, pages 1576–1602. World Scientific, 2010.
- [22] Helmut Schneider. Truncated and censored samples from normal populations. Marcel Dekker, Inc., 1986.
- [23] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
- [24] James Tobin. Estimation of relationships for limited dependent variables. Econometrica: journal of the Econometric Society, pages 24–36, 1958.
- [25] Vladislav Voroninski and Zhiqiang Xu. A strong restricted isometry property, with an application to phaseless compressed sensing. Applied and Computational Harmonic Analysis, 40(2):386 – 395, 2016.
- [26] Martin J Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1-constrained quadratic programming (lasso). IEEE transactions on information theory, 55(5):2183–2202, 2009.
Appendix A Bounding solution of restricted program
In this section we prove that is small with high probability, where is a solution to Program 3. Specifically, we use regularization parameter , and prove that .
The proof is motivated by the following rephrasal of part (a) of Lemma 5.2:
(4) |
where . For intuition, consider the untruncated setting: then , so the equation is simply
where . Since is independent of and has norm , each entry of is Gaussian with variance , so has norm . Additionally, . Finally, is a -isometry, so we get the desired bound on .
Returning to the truncated setting, one bound still holds, namely . The remainder of the above sketch breaks down for two reasons. First, is no longer independent of . Second, bounding no longer implies a bound on .
The first problem is not so hard to work around; we can still bound as follows; see Section J.1 for the proof.
Lemma A.1.
With high probability over and ,
So in equation 4, the last term is with high probability. The first term is always , since . So we know that has small norm. Unfortunately this does not imply that has small norm, but as motivation, assume that we have such a bound.
Since is a -isometry, bounding is equivalent to bounding . To relate this quantity to , our approach is to lower bound the derivative of with respect to . The derivative turns out to have the following elegant form (proof in Section J.2):
Lemma A.2.
For any , .
Crucially, is nonnegative, and relates to survival probability. By integrating a lower bound on the derivative, we get the following lower bound on in terms of . The bound is linear for small , but flattens out as grows. See Section J.3 for the proof.
Lemma A.3.
Let . Then . Additionally, for any constant there is a constant such that if , then
If we want to use this lemma to prove that is at least a constant multiple of , we face two obstacles: (1) may not be large for all , and (2) the lemma only gives linear scaling if : but this is essentially what we’re trying to prove!
To deal with obstacle (1), we restrict to the rows for which is large. To deal with obstacle (2), we have a two-step proof. In the first step, we use the -lower bound provided by Lemma A.3 to show that (so that on average). In the second step, we use this to get linear scaling in Lemma A.3, and complete the proof, showing that .
Formally, define to be the set of indices such that and . In the following lemmas we show that contains a constant fraction of the indices, so by the isometry properties we retain a constant fraction of when restricting to . See Appendices J.4 and J.5 for the proofs of Lemmas A.4 and A.5 respectively.
Lemma A.4.
With high probability, .
Lemma A.5.
For some constant , we have that with high probability, .
We now prove the weaker, first-step bound on . But there is one glaring issue we must address: we made a simplifying assumption that is small. All we actually know is that is small. And has a nontrivial null space.
Here is a sketch of how we resolve this issue. Let and ; we want to show that if is large then is large. Geometrically, is approximately proportional to the distance from to the subspace . Oversimplifying for clarity, we know that for all . This is by itself insufficient. The key observation is that we also know for all . Thus, lies in a hyperoctant shifted to have corner . Since lies in the row space of , it’s perpendicular to , so the closest point to in the shifted hyperoctant should be .
Formalizing this geometric intuition yields the last piece of the proofs of the following theorems. See Section J.6 for the full proofs.
Theorem A.6.
There are positive constants , , and with the following property. Suppose that and . Then with high probability, .
Theorem A.7.
There are positive constants , , and with the following property. Suppose that and . Then with high probability.
Appendix B Proof of statistical recovery
Extend to by defining
We would like to show that . Since is independent of , each entry of is Gaussian with standard deviation . It turns out that a bound of suffices. To get this bound, we decompose
and bound each term separately. Here we are defining , and and similarly.
Lemma B.1.
There is a constant such that under the conditions of Theorem A.7, with high probability over ,
Lemma B.2.
There is a constant such that with high probability.
Combining the above lemmas with the bound on from the previous section, we get the desired theorem. See Section J.9 for the full proof.
Theorem B.3.
There are constants , , and with the following property. Suppose , and . Then with high probability we have .
As an aside that we’ll use later, this proof can be extended to any random vector near with support contained in (proof in Section J.10).
Theorem B.4.
There are constants , , and with the following property. Suppose and . If is a random variable with always, and with high probability, then with high probability .
Returning to the goal of this section, it remains to show that is invertible with high probability. But this follows from the isometry guarantee of Theorem G.1. Our main statistical result, Proposition 3.2, now follows.
Proof of Proposition 3.2.
Take , , and as in the statement of Theorem B.3. Let and . Let be any optimal solution to the regularized program, and let be any solution to the restricted program. By Theorem B.3, with high probability we have and ; and by Theorem G.1, is invertible. So by Lemma 5.2, it follows that . Therefore . ∎
Appendix C Primal-dual witness method
Proof of Lemma 5.1.
For a single sample , the partial derivative in direction is
where expectation is taken over the random variable (for fixed ). Simplifying yields the expression
The second partial derivative of in directions and is therefore
We conclude that
Averaging over all samples yields the claimed result. ∎
The following lemma collects several useful facts that are needed for the PDW method. Parts (a) and (b) are generically true for any -regularized convex program; part (c) is a holdover from the untruncated setting that is still true. The proof is essentially due to [26], although part (c) now requires slightly more work.
Lemma C.1.
Fix any .
-
(a)
A vector is optimal for Program 2 if and only if there exists some such that
-
(b)
Suppose that are as in (a), and furthermore for all . Then necessarily for any optimal solution to Program 2.
-
(c)
Suppose that are as in (b), with . If is invertible, then is the unique optimal solution to Program 2.
Proof.
Part (a) is simply the subgradient optimality condition in a convex program.
Part (b) is a standard fact about duality; we provide a proof here. Let be any optimal solution to Program 2. We claim that . To see this, first note that , since always holds by definition of a subgradient for the norm. Now, by optimality of and , we have for all . Therefore by convexity, for all . Since is the sum of two convex functions, both must be linear on the line segment between and . Therefore
for all . We conclude that
Since by subgradient optimality, it follows that . Hence, . Since for all , if for some then necessarily for equality to hold.
For part (c), if is invertible, then it is (strictly) positive definite. The Hessian of Program 3 is
Since is always positive, there is some (not necessarily a constant) such that
Thus, the Hessian of the restricted program is positive definite, so the restricted program is strictly convex. Therefore the restricted program has a unique solution. By part (b), any solution to the original program has support in , so the original program also has a unique solution, which must be . ∎
As with the previous lemma, the following proof is essentially due to [26] (with a different subgradient optimality condition).
Proof of Lemma 5.2.
By part (a) of Lemma C.1, a vector is optimal for Program 2 if and only if there is some such that
This vector equality can be written in block form as follows:
Since is optimal in , there is some such that satisfy the first of the two block equations. This is precisely part (a). If furthermore is zero-extended to , and is extended as in part (b), and satisfies , then since for all , we have that is a subgradient for . Therefore is optimal for Program 2. If and is invertible, then is the unique solution to Program 2 by parts (b) and (c) of Lemma C.1. ∎
Appendix D Sparse recovery from the Restricted Isometry Property
In this section we restate a theorem due to [6] about sparse recovery in the presence of noise. Our statement is slightly generalized to allow a trade-off between the isometry constants and the sparsity. That is, as the sparsity decreases relative to the isometry order , the isometry constants are allowed to worsen.
Theorem D.1 ([6]).
Let be a matrix satisfying the -Restricted Isometry Property
for all -sparse . Let be -sparse for some , and let satisfy . Then
where .
Proof.
Let and let . Then
so . Without loss of generality assume that , and for all . Divide into sets of size respecting this order:
Then the Restricted Isometry Property gives
(5) |
For any and , we have , so that
Summing over all , we get
The triangle inequality implies that . Returning to Equation 5, it follows that
as claimed. ∎
Appendix E Summary of the algorithm
Appendix F Algorithm details
In this section we fill in the missing details about the algorithm’s efficiency. Since we have already seen that the algorithm converges in update steps, all that remains is to show that the following algorithmic problems can be solved efficiently:
-
1.
(Initial point) Compute
-
2.
(Stochastic gradient) Given and , compute a sample , where .
-
3.
(Projection) Given , compute .
Initial point.
To obtain the initial point , we need to solve the program
This program has come up previously in the compressed sensing literature (see, e.g., [6]). It can be recast as a Second-Order Cone Program (SOCP) by introducing variables :
Thus, it can be solved in polynomial time by interior-point methods (see [4]).
Stochastic gradient.
In computing an unbiased estimate of the gradient, the only challenge is sampling from . By Assumption II, this takes time. We know from Lemma I.4 that . Since , we have from Lemma I.2 that
Thus, the time complexity of computing the stochastic gradient is .
In the special case when the truncation set is a union of intervals, there is a sampling algorithm with time complexity (Lemma K.4). Hence, in this case the time complexity of computing the stochastic gradient is .
To be more precise, we instantiate Lemma K.4 with accuracy , where is the number of update steps performed. This gives some sampling algorithm . In each step, ’s output distribution is within of the true distribution . Consider a hypothetical sampling algorithm in which is run, and then the output is altered by rejection to match the true distribution. Alteration occurs with probability . Thus, running the PSGD algorithm with , the probability that any alteration occurs is at most . As shown by Theorem H.1, PSGD with succeeds with high probability. Hence, PSGD with succeeds with high probability as well.
Projection.
The other problem we need to solve is projection onto set :
This is a convex QCQP, and therefore solvable in polynomial time by interior point methods (see [4]).
Appendix G Isometry properties
Let consist of samples from Process 1. In this section we prove the following theorem:
Theorem G.1.
For every there are constants , , and with the following property. Let . Suppose that . With probability at least over , for every subset with , the submatrix satisfies
We start with the upper bound, for which it suffices to take .
Lemma G.2.
Let . Suppose that . There is a constant such that
Proof.
In the process for generating , consider the matrix obtained by not discarding any of the samples . Then is a matrix for a random variable ; each row of is a spherical Gaussian independent of all previous rows, but depends on the rows. Nonetheless, by a Chernoff bound, In this event, is a submatrix of matrix with i.i.d. Gaussian entries. By [21],
for some absolute constants . Since is a submatrix of with high probability, and is a submatrix of , it follows that
as desired. ∎
For the lower bound, we use an -net argument.
Lemma G.3.
Let and let with . Let . Then
Proof.
From the constant survival probability assumption,
But , so Taking yields the desired bound. ∎
Lemma G.4.
Let . Fix and fix with . There are positive constants and such that
Proof.
For each let be the indicator random variable for the event that . Let . By Lemma G.3, . Each is independent, so by a Chernoff bound,
In the event , for any with it holds that
So the event in the lemma statement occurs with probability at most ∎
Now we can prove the isometry property claimed in Theorem G.1.
Proof of Theorem G.1.
Let . Let . Take , where is the constant in the statement of Lemma G.4. Let be the -dimensional unit ball. Let be a maximal packing of by radius- balls with centers on the unit sphere. By a volume argument,
Applying Lemma G.4 to each and taking a union bound,
So with probability , the complement of this event holds. And by Lemma G.2, the event holds with probability . In these events we claim that the conclusion of the theorem holds. Take any with , and take any with . Since is maximal, there is some with . Then
But . For sufficiently large , we get . Taking yields the claimed lower bound. ∎
As a corollary, we get that is a -isometry on its row space up to constants (of course, this holds for any with , but we only need it for ).
Corollary G.5.
With high probability, for every ,
Proof.
By Theorem G.1, with high probability all eigenvalues of lie in the interval . Hence, all eigenvalues of lie in the interval . But then
The upper bound is similar. ∎
We also get a Restricted Isometry Property, by applying Theorem G.1 to all subsets of a fixed size.
Corollary G.6 (Restricted Isometry Property).
There is a constant such that for any , if , then with high probability, for every with ,
Proof.
We apply Theorem G.1 to all with , and take a union bound over the respective failure events. The probability that there exists some set of size such that the isometry fails is at most
If for a sufficiently large constant , then this probability is . ∎
From this corollary, our main result for adversarial noise (Theorem 3.4) follows almost immediately:
Proof of Theorem 3.4.
Let be the constant in Corollary G.6. Let , and let . Finally, let .
Let . Suppose that and . Then , so by Corollary G.6, satisfies the -Restricted Isometry Property.
By definition, satisfies and (by feasibility of ). Finally, is -sparse. We conclude from Theorem D.1 and our choice of that
But by the triangle inequality. Thus, ∎
Appendix H Projected Stochastic Gradient Descent
In this section we present the exact PSGD convergence theorem which we use, together with a proof for completeness.
Theorem H.1.
Let be a convex function achieving its optimum at . Let be a convex set containing . Let be arbitrary. For define a random variable by
where and is fixed. Then
where .
Proof.
Fix . We can write
since projecting onto cannot increase the distance to .
Taking expectation over for fixed , we have
where the last inequality uses the fact that is a subgradient for at . Rearranging and taking expectation over , we get that
But now summing over , the right-hand side of the inequality telescopes, giving
This is the desired bound. ∎
Appendix I Survival probability
In this section we collect useful lemmas about truncated Gaussian random variables and survival probabilities.
Lemma I.1 ([9]).
Let and let be a measurable set. Then for a constant .
Lemma I.2 ([9]).
For ,
Lemma I.3 ([9]).
For ,
Lemma I.4.
With high probability,
Proof.
Let for , and let . Since are independent and identically distributed,
Therefore
by Markov’s inequality. ∎
Appendix J Omitted proofs
J.1 Proof of Lemma A.1
We need the following computation:
Lemma J.1.
For any and ,
Proof.
Now we can prove Lemma A.1.
Proof of Lemma A.1.
We prove that the bound holds in expectation, and then apply Markov’s inequality. We need to show that each entry of the vector has expected square . A single entry of the vector is
so its expected square is
For any , the cross-term is
But for fixed , the two terms in the product are independent, and they both have mean , so the cross-term is . Thus,
Then
Hence with probability at least ,
by Markov’s inequality. ∎
J.2 Proof of Lemma A.2
We can write
By the quotient rule,
as desired.
J.3 Proof of Lemma A.3
The fact that follows immediately from the fact that (Lemma A.2).
J.4 Proof of Lemma A.4
Since and is always at most , we have . Since the samples are rejection sampled on , it follows that as well. So by a Chernoff bound, with high probability, the number of such that is at most .
The condition that is clearly satisfied by at most indices.
J.5 Proof of Lemma A.5
J.6 Proof of Theorems A.6 and A.7
Proof of Theorem A.6.
Let and let . Our aim is to show that if is large, then is large, which would contradict Equation 4. Since is not an isometry, we can’t simply show that is large. Instead, we write an orthogonal decomposition for some and with . We’ll show that is large. Since , and is an isometry on the row space of , this suffices.
For every with , we have by Lemma A.3 that
where is the constant which makes Lemma A.3 work for indices with . Take , and suppose that the theorem’s conclusion is false, i.e. . Also suppose that the events of Lemmas A.1 and A.5 hold.
Then by the bound for we get
(6) |
where . We assumed earlier that but Equation 6 certainly also holds when .
By Lemma A.3, and have the same sign for all . So for all . Moreover, together with Equation 6, the sign constraint implies that for ,
Summing over we get
By Lemma A.5 on the LHS and Cauchy-Schwarz on the RHS, we get
Hence . But then . On the other hand, Equation 4 implies that
since event (2) holds. This is a contradiction for sufficiently large and sufficiently small. So either the assumption is false, or the events of Lemma A.1 or A.5 fail. But the latter two events fail with probability . So with high probability.
∎
Now that we know that , we can bootstrap to show that . While the previous proof relied on the constant regime of the lower bound provided by Lemma A.3, the following proof relies on the linear regime.
Proof of Theorem A.7.
As before, let and . Suppose that the conclusion of Theorem A.6 holds, i.e. . Also suppose that the events stated in Lemmas A.1 and A.5 holds. We can make these assumptions with high probability. For , we now know that . Thus,
where . By the same argument as in the proof of Theorem A.6, except replacing by , we get that
J.7 Proof of Lemma B.1
J.8 Proof of Lemma B.2
Draw samples from the distribution as follows: pick and . Keep sample if ; otherwise reject. We want to bound . Now consider the following revised process: keep all the samples, but stop only once samples satisfy . Let be the (random) stopping point; then the random variable defined by the new process stochastically dominates the random variable defined by the original process.
But in the new process, each is Gaussian and independent of . With high probability, by a Chernoff bound. And if are independent then
with high probability, by concentration of norms of Gaussian vectors. Therefore with high probability as well.
J.9 Proof of Theorem B.3
Set , where and are the constants in Lemmas B.1 and B.2. Set . Note that is chosen sufficiently large that .
By Theorem A.7, we have with high probability that the following event holds, which we call :
Now notice that
If holds, then by Theorem G.1, we get . By Lemma B.1, with high probability . And by Lemma B.2, with high probability . Therefore
where the last inequality is by choice of and . Thus, the event
occurs with high probability.
Suppose that event occurs. Now note that has independent Gaussian entries. Fix any ; since is independent of , , and , the dot product
is Gaussian with variance . Hence, is Gaussian with variance at most . So
By a union bound,
So the event holds with high probability.
J.10 Proof of Theorem B.4
We know from Theorem B.3 that . So it suffices to show that
Thus, we need to show that
for all . Fix one such . Then by Lemma A.2,
By Lemma I.4, we have
with high probability over . Assume that this inequality holds, and assume that and , so that . Then by Theorem G.1, . By Lemma I.2, for every and every ,
for a constant . Hence, by Lemma I.3,
We conclude that
Additionally, with high probability. Thus, Cauchy-Schwarz entails that
for large .
J.11 Proof of Lemma 5.4
J.12 Proof of Lemma 5.5
J.13 Proof of Lemma 5.6
Note that is a subgradient for at . Furthermore, for fixed ,
It follows that
is a subgradient for at .
We proceed to bounding . By definition of ,
where is uniformly random, and . Since it remains to bound the other term. We have that
With high probability, for all . Thus,
Now
The first term is bounded by since . Additionally,
since . Therefore the second term is bounded as
where the first and second inequalities are by Lemmas I.3 and Lemma I.2, and the third inequality is by Lemma I.4. Putting together the two bounds, we get
from which we conclude that . The law of total expectation implies that as well.
J.14 Proof of Lemma 5.7
We need to show that is -strongly convex near . Since is convex, it suffices to show that is -strongly convex near . The Hessian of is
Hence, it suffices to show that
for all with and . Call this region . With high probability over we can deduce the following.
(i) By Theorem A.7, we have . As for all , we get for all .
(ii) By the proof of Lemma A.4, the number of such that is at most .
Fix , and define to be the set of indices
For any ,
Thus,
Let denote this lower bound—a positive constant. By (i) and (ii), , so by Theorem G.1,
as desired.
J.15 Proof of Lemma 5.8
Let . Define . Also define . Then , so
Therefore for all , so
Additionally, since and , we know that
Adding the second inequality to the square of the first inequality, and lower bounding the norm by norm,
Since , by convexity as well. Hence,
(7) |
We distinguish two cases:
- 1.
-
2.
If , then , and thus . By convexity and this bound,
which contradicts the lemma’s assumption for a sufficiently small constant .
J.16 Proof of Theorem 5.9
Appendix K Efficient sampling for union of intervals
In this section, in Lemma K.4, we see that when , with , then Assumption II holds with . The only difference is that instead of exact sampling we have approximate sampling, but the approximation error is exponentially small in total variation distance and hence it cannot affect any algorithm that runs in polynomial time.
Definition K.1 (Evaluation Oracle).
Let be an arbitrary real function. We define the evaluation oracle of as an oracle that given a number and a target accuracy computes an -approximate value of , that is .
Lemma K.2.
Let be a real increasing and differentiable function and an evaluation oracle of . Let for some . Then we can construct an algorithm that implements the evaluation oracle of , i.e. . This implementation on input and input accuracy runs in time and uses at most calls to the evaluation oracle with inputs with representation length and input accuracy , with .
Proof of Lemma K.2.
Given a value our goal is to find an such that . Using doubling we can find two numbers such that and for some to be determined later. Because of the lower bound on the derivative of we have that this step will take steps. Then we can use binary search in the interval where in each step we make a call to the oracle with accuracy and we can find a point such that . Because of the upper bound on the derivative of we have that is -Lipschitz and hence this binary search will need time and oracle calls. Now using the mean value theorem we get that for some it holds that which implies that , so if we set , the lemma follows. ∎
Lemma K.3.
Let be a closed interval and such that . Then there exists an algorithm that runs in time and returns a sample of a distribution , such that .
Proof Sketch.
The sampling algorithm follows the steps: (1) from the cumulative distribution function of the distribution define a map from to , (2) sample uniformly a number in (3) using an evaluation oracle for the error function, as per Proposition 3 in [7], and using Lemma K.2 compute with exponential accuracy the value and return this as the desired sample. ∎
Lemma K.4.
Let , , , be closed intervals and such that . Then there exists an algorithm that runs in time and returns a sample of a distribution , such that .
Proof Sketch.
Using Proposition 3 in [7] we can compute which estimated with exponential accuracy the mass of every interval . If then do not consider interval in the next step. From the remaining intervals we can choose one proportionally to . Because of the high accuracy in the computation of this is close in total variation distance to choosing an interval proportionally to . Finally after choosing an interval we can run the algorithm of Lemma K.3 with accuracy and hence the overall total variation distance from is at most . ∎