Truncated Linear Regression in High Dimensions

Constantinos Daskalakis
MIT
[email protected] Dhruv Rohatgi
MIT
[email protected] Manolis Zampetakis
MIT
[email protected]

Abstract

As in standard linear regression, in truncated linear regression, we are given access to observations $(A_{i},y_{i})_{i}$ whose dependent variable equals $y_{i}=A_{i}^{\rm T}\cdot x^{*}+\eta_{i}$ , where $x^{*}$ is some fixed unknown vector of interest and $\eta_{i}$ is independent noise; except we are only given an observation if its dependent variable $y_{i}$ lies in some “truncation set” $S\subset\mathbb{R}$ . The goal is to recover $x^{*}$ under some favorable conditions on the $A_{i}$ ’s and the noise distribution. We prove that there exists a computationally and statistically efficient method for recovering $k$ -sparse $n$ -dimensional vectors $x^{*}$ from $m$ truncated samples, which attains an optimal $\ell_{2}$ reconstruction error of $O(\sqrt{(k\log n)/m})$ . As a corollary, our guarantees imply a computationally efficient and information-theoretically optimal algorithm for compressed sensing with truncation, which may arise from measurement saturation effects. Our result follows from a statistical and computational analysis of the Stochastic Gradient Descent (SGD) algorithm for solving a natural adaptation of the LASSO optimization problem that accommodates truncation. This generalizes the works of both: (1) Daskalakis et al. [9], where no regularization is needed due to the low-dimensionality of the data, and (2) Wainright [26], where the objective function is simple due to the absence of truncation. In order to deal with both truncation and high-dimensionality at the same time, we develop new techniques that not only generalize the existing ones but we believe are of independent interest.

1 Introduction

In the vanilla linear regression setting, we are given $m\geq n$ observations of the form $(A_{i},y_{i})$ , where $A_{i}\in\mathbb{R}^{n}$ , $y_{i}=A_{i}^{\rm T}x^{*}+\eta_{i}$ , $x^{*}$ is some unknown coefficient vector that we wish to recover, and $\eta_{i}$ is independent and identically distributed across different observations $i$ random noise. Under favorable conditions about the $A_{i}$ ’s and the distribution of the noise, it is well-known that $x^{*}$ can be recovered to within $\ell_{2}$ -reconstruction error $O(\sqrt{n/m})$ .

The classical model and its associated guarantees might, however, be inadequate to address many situations which frequently arise in both theory and practice. We focus on two common and widely studied deviations from the standard model. First, it is often the case that $m\ll n$ , i.e. the number of observations is much smaller than the dimension of the unknown vector $x^{*}$ . In this “under-determined” regime, it is fairly clear that it is impossible to expect a non-trivial reconstruction of the underlying $x^{*}$ , since there are infinitely many $x\in\mathbb{R}^{n}$ such that $A_{i}^{\rm T}x=A_{i}^{\rm T}x^{*}$ for all $i=1,\ldots,m$ . To sidestep this impossibility, we must exploit additional structural properties that we might know $x^{*}$ satisfies. One such property might be sparsity, i.e. that $x^{*}$ has $k\ll n$ non-zero coordinates. Linear regression under sparsity assumptions has been widely studied, motivated by applications such as model selection and compressed sensing; see e.g. the celebrated works of [23, 6, 11, 26] on this topic. It is known, in particular, that a $k$ -sparse $x^{*}$ can be recovered to within $\ell_{2}$ error $O(\sqrt{k\log n/m})$ , when the $A_{i}$ ’s are drawn from the standard multivariate Normal, or satisfy other favorable conditions [26]. The recovery algorithm solves a least squares optimization problem with $\ell_{1}$ regularization, i.e. what is called LASSO optimization in Statistics, in order to reward sparsity.

Another common deviation from the standard model is the presence of truncation. Truncation occurs when the sample $(A_{i},y_{i})$ is not observed whenever $y_{i}$ falls outside of a subset $S\subseteq\mathbb{R}$ . Truncation arises quite often in practice as a result of saturation of measurement devices, bad data collection practices, incorrect experimental design, and legal or privacy constraints which might preclude the use of some of the data. Truncation is known to affect linear regression in counter-intuitive ways, as illustrated in Fig. 1,

Refer to caption — Figure 1: Truncation in one-dimensional linear regression, along with the linear fit obtained via least squares regression before and after truncation.

where the linear fits obtained via least squares regression before and after truncation of the data based on the value of the response variable are also shown. More broadly, it is well-understood that naive statistical inference using truncated data commonly leads to bias. Accordingly, a long line of research in Statistics and Econometrics has strived to develop regression methods that are robust to truncation [24, 1, 15, 18, 16, 5, 14]. This line of work falls into the broader field of Truncated Statistics [22, 8, 2], which finds its roots in the early works of [3], [13], [19, 20], and [12]. Despite this voluminous work, computationally and statistically efficient methods for truncated linear regression have only recently been obtained in [9], where it was shown that, under favorable assumptions about the $A_{i}$ ’s, the truncation set $S$ , and assuming the $\eta_{i}$ ’s are drawn from a Gaussian, the negative log likelihood of the truncated sample can be optimized efficiently, and approximately recovers the true parameter vector with an $\ell_{2}$ reconstruction error ${O}\left(\sqrt{n\log m\over m}\right)$ .

Our contribution.

In this work, we solve the general problem addressing both of the aforedescribed challenges together. Namely, we provide efficient algorithms for the high-dimensional ( $m\ll n$ ) truncated linear regression problem. This problem is very common, including in compressed sensing applications with measurement saturation, as studied e.g. in [10, 17].

Under standard conditions on the design matrix and the noise distribution (namely that the $A_{i}$ ’s and $\eta_{i}$ ’s are sampled from independent Gaussian distributions before truncation), and under mild assumptions on the truncation set $S$ (roughly, that it permits a constant fraction of the samples $y_{i}=A_{i}^{\rm T}x^{*}+\eta_{i}$ to survive truncation), we show that the SGD algorithm on the truncated LASSO optimization program, our proposed adaptation of the standard LASSO optimization to accommodate truncation, is a computationally and statistically efficient method for recovering $x^{*}$ , attaining an optimal $\ell_{2}$ reconstruction error of $O(\sqrt{(k\log n)/m})$ , where $k$ is the sparsity of $x^{*}$ .

We formally state the model and assumptions in Section 2, and our result in Section 3.

1.1 Overview of proofs and techniques

The problem that we solve in this paper encompasses the two difficulties of the problems considered in: (1) Wainwright [26], which tackles the problem of high-dimensional sparse linear regression with Gaussian noise, and (2) Daskalakis et al. [9], which tackles the problem of truncated linear regression. The tools developed in those papers do not suffice to solve our problem, since each difficulty interferes with the other. Hence, we introduce new ideas and develop new interesting tools that allow us to bridge the gap between [26] and [9]. We begin our overview in this section with a brief description of the approaches of [26, 9] and subsequently outline the additional challenges that arise in our setting, and how we address them.

Wainwright [26] uses as an estimator the solution of the regularized least squares program, also called LASSO, to handle the high-dimensionality of the data. Wainwright then uses a primal-dual witness method to bound the number of samples that are needed in order for the solution of the LASSO program to be close to the true coefficient vector $x^{*}$ . The computational task is not discussed in detail in Wainwright [26], since the objective function of the LASSO program is very simple and standard convex optimization tools can be used.

Daskalakis et al. [9] use as their estimator the solution to the log-likelihood maximization problem. In contrast to [26], their convex optimization problem takes a very complicated form due to the presence of truncation, which introduces an intractable log-partition function term in the log-likelihood. The main idea of Daskalakis et al. [9] to overcome this difficulty is identifying a convex set $D$ such that: (1) it contains the true coefficient vector $x^{*}$ , (2) their objective function is strongly convex inside $D$ , (3) inside $D$ there exists an efficient rejection sampling algorithm to compute unbiased estimates of the gradients of their objective function, (4) the norm of the stochastic gradients inside $D$ is bounded, and (5) projecting onto $D$ is efficient. These five properties are essentially what they need to prove that the SGD with projection set $D$ converges quickly to a good estimate of $x^{*}$ .

Our reconstruction algorithm is inspired by both [26] and [9]. We formulate our optimization program as the $\ell_{1}$ -regularized version of the negative log-likelihood function in the truncated setting, which we call truncated LASSO. In particular, our objective contains an intractable log-partition function term. Our proof then consists of two parts. First, we show statistical recovery, i.e. we upper bound the number of samples that are needed for the solution of the truncated LASSO program to be close to the true coefficient vector $x^{*}$ . Second, we show that this optimization problem can be solved efficiently. The cornerstones of our proof are the two seminal approaches that we mentioned: the Primal-Dual Witness method for statistical recovery in high dimensions, and the Projected SGD method for efficient maximum likelihood estimation in the presence of truncation. Unfortunately, these two techniques are not a priori compatible to stitch together.

Roughly speaking, the technique of [9] relies heavily on the very carefully chosen projection set $D$ in which the SGD is restricted, as we explained above. This projection set cannot be used in high dimensions because it effectively requires knowing the low-dimension subspace in which the true solution lies. The projection set was the key to the aforementioned nice properties: strong convexity, efficient gradient estimation, and bounded gradient. In its absence, we need to deal with each of these issues individually. The primal-dual witness method of [26] cannot also be applied directly in our setting. In our case the gradient of the truncated LASSO does not have a nice closed form and hence finding the correct way to construct the primal-dual witness requires a more delicate argument. Our proof manages to overcome all these issues. In a nutshell the architecture of our full proof is the following.

1.

Optimality on the low dimensional subspace. The first thing that we need to prove is that the optimum of the truncated LASSO program when restricted to the low dimensional subspace defined by the non-zero coordinates of $x^{*}$ is close to the true solution. This step of the proof was unnecessary in [9] due to the lack of regularization in their objective, and was trivial in [26] due to the simple loss function, i.e. the regularized least square.
2.

Optimality on the high dimensional space. We prove that the optimum of the truncated LASSO program in the low dimensional subspace is also optimal for the whole space. This step is done using the primal-dual witness method as in [26]. However, in our case the expression of the gradient is much more complicated due to the very convoluted objective function. Hence, we find a more general way to prove this step that does not rely on the exact expression of the gradient.

These two steps of the proof suffice to upper bound the number of samples that we need to recover the coefficient vector $x^{*}$ via the truncated LASSO program. Next, we provide a computationally efficient method to solve the truncated LASSO program.

3.

Initialization of SGD. The first step of our algorithm to solve truncated LASSO is finding a good initial point for the SGD. This was unnecessary in [26] due to the simple objective and in [9] due to the existence of the projection set $D$ (where efficient projection onto $D$ immediately gave an initial point). We propose the simple answer of bootstrapping: start with the solution of the $\ell_{1}$ -regularized ordinary least squares program. This is a biased estimate, but we show it’s good enough for initialization.
4.

Projection of SGD. Next, we need to choose a projection set to make sure that Projected-SGD (PSGD) converges. The projection set chosen in [9] is not helpful in our case unless we a priori know the set of non-zero coordinates of $x^{*}$ . Hence, we define a different, simpler set which admits efficient projection algorithms. As a necessary side-effect, in contrast to [9], our set cannot guarantee many of the important properties that we need to prove fast convergence of SGD.
5.

Lack of strong convexity and gradient estimation. Our different projection set cannot guarantee the strong convexity and efficient gradient estimation enjoyed in [9]. There are two problems here:

First, we know that PSGD converges to a point with small loss, but why must the point be near the optimum? Since strong convexity fails in high dimensions, it is not clear. We provide a workaround to resolve this issue that can be applied to other regularized programs with stochastic access to the gradient function.

Second, computing unbiased estimates of the gradient is now difficult. The prior work employed rejection sampling, but in our setting this may take exponential time. For this reason we provide a more explicit method for estimating the gradient much faster, whenever the truncation set is reasonably well-behaved.

An important tool that we leverage repeatedly in our analysis and we have not mentioned above is a strong isometry property for our measurement matrix, which has truncated Gaussian rows. Similar properties have been explored in the compressed sensing literature for matrices with i.i.d. Gaussian and sub-Gaussian entries [25].

We refer to Section 5 for a more detailed overview of the proofs of our main results.

2 High-dimensional truncated linear regression model

Notation.

Let $Z\sim N(0,1)$ refer to a standard normal random variable. For $t\in\mathbb{R}$ and measurable $S\subseteq\mathbb{R}$ , let $Z_{t}\sim N(t,1;S)$ refer to the truncated normal $(Z+t)|Z+t\in S$ . Let $\mu_{t}=\mathbb{E}[Z_{t}]$ . Let $\gamma_{S}(t)=\Pr[Z+t\in S]$ . Additionally, for $a,x\in\mathbb{R}^{n}$ let $Z_{a,x}$ refer to $Z_{a^{T}x}$ (or $Z_{ax}$ if $a$ is a row vector), and let $\gamma_{S}(a,x)$ refer to $\gamma_{S}(a^{T}x)$ . For a matrix $A\in\mathbb{R}^{m\times n}$ , let $Z_{A,x}\in\mathbb{R}^{m}$ be the random vector with $(Z_{A,x})_{j}=Z_{A_{j},x}$ . For sets $I\subseteq[n]$ and $J\subseteq[m]$ , let $A_{I,J}$ refer to the submatrix $[A_{i,j}]_{i\in I,j\in J}$ . For $i\in[n]$ we treat the row $A_{i}$ as a row vector. In a slight abuse of notation, we will often write $A_{U}$ (or sometimes, $A_{V}$ ); this will always mean $A_{[m],U}$ . By $A_{U}^{T}$ we mean $(A_{[m],U})^{T}$ ). For $x\in\mathbb{R}^{n}$ , define $\operatorname{supp}(x)$ to be the set of indices $i\in[n]$ such that $x_{i}\neq 0$ .

2.1 Model

Let $x^{*}\in\mathbb{R}^{n}$ be the unknown parameter vector which we are trying to recover. We assume that it is $k$ -sparse; that is, $\operatorname{supp}(x^{*})$ has cardinality at most $k$ . Let $S\subseteq\mathbb{R}$ be a measurable subset of the real line. The main focus of this paper is the setting of Gaussian noise: we assume that we are given $m$ truncated samples $(A_{i},y_{i})$ generated by the following process:

1.

Pick $A_{i}\in\mathbb{R}^{n}$ according to the standard normal distribution $N(0,1)^{n}$ .
2.

Sample $\eta_{i}\sim N(0,1)$ and compute $y_{i}$ as

$y_{i}=A_{i}x^{*}+\eta_{i}.$ (1)
3.

If $y_{i}\in S$ , then return sample $(A_{i},y_{i})$ . Otherwise restart the process from step $1$ .

We also briefly discuss the setting of arbitrary noise, in which $\eta_{i}$ may be arbitrary and we are interested in approximations to $x^{*}$ which have guarantees bounded in terms of $\left\lVert\eta\right\rVert_{2}$ .

Together, $m$ samples define a pair $(A,y)$ where $A\in\mathbb{R}^{m\times n}$ and $y\in\mathbb{R}^{m}$ . We make the following assumptions about set $S$ .

Assumption I (Constant Survival Probability).

Taking expectation over vectors $a\sim N(0,1)^{n}$ , we have $\mathbb{E}\gamma_{S}(a,x^{*})\geq\alpha$ for a constant $\alpha>0$ .

Assumption II (Efficient Sampling).

There is an $T(\gamma_{S}(t))$ -time algorithm which takes input $t\in\mathbb{R}$ and produces an unbiased sample $z\sim N(t,1;S)$ .

We do not require that $T(\cdot)$ is a constant, but it will affect the efficiency of our algorithm. To be precise, our algorithm will make $\text{poly}(n)$ queries to the sampling algorithm. As we explain in Lemma K.4 in Section K, if the set $S$ is a union of $r$ intervals $\cup_{i=1}^{r}[a_{i},b_{i}]$ , then the Assumption II is satisfied with $T(\gamma_{S}(t))=\operatorname{poly}(\log(1/\gamma_{S}(t)),r)$ . We express the theorems below with the assumption that $S$ is a union of $r$ intervals in which case the algorithms have polynomial running time, but all the statements below can be replaced with the more general Assumption II and the running time changes from $\operatorname{poly}(n,r)$ to $\operatorname{poly}(n,T(e^{m/\alpha}))$ .

3 Statistically and computationally efficient recovery

In this section we formally state our main results for recovery of a sparse high-dimensional coefficient vector from truncated linear regression samples. In Section 3.1, we present our result under the standard assumption that the error distribution is Gaussian, whereas in Section 3.2 we present our results for the case of adversarial error.

3.1 Gaussian noise

In the setting of Gaussian noise (before truncation), we prove the following theorem.

Theorem 3.1.

Suppose that Assumption I holds, and that we have $m$ samples $(A_{i},y_{i})$ generated from Process (1), with $n\geq m\geq Mk\log n$ for a sufficiently large constant $M$ . Then, there is an algorithm which outputs $\bar{x}$ satisfying $\left\lVert\bar{x}-x^{*}\right\rVert_{2}\leq O(\sqrt{(k\log n)/m})$ with probability $1-O(1/\log n)$ . Furthermore, if the survival set $S$ is a union of $r$ intervals the running time of our algorithm is $\operatorname{poly}(n,r)$ .

From now on, we will use the term “with high probability” when the rate of decay is not of importance. This phrase means “with probability $1-o(1)$ as $n\to\infty$ ”.

Observe that even without the added difficulty of truncation (e.g. if $S=\mathbb{R}$ ), sparse linear regression requires $\Omega(k\log n)$ samples by known information-theoretic arguments [26]. Thus, our sample complexity is information-theoretically optimal.

In one sentence, the algorithm optimizes the $\ell_{1}$ -regularized sample negative log-likelihood via projected SGD. The negative log-likelihood of $x\in\mathbb{R}^{n}$ for a single sample $(a,y)$ is

\operatorname{nll}(x;a,y)=\frac{1}{2}(a^{T}x-y)^{2}+\log\int_{S}e^{-(a^{T}x-z)^{2}/2}\,dz.

Given multiple samples $(A,y)$ , we can then write $\operatorname{nll}(x;A,y)=\frac{1}{m}\sum_{j=1}^{m}\operatorname{nll}(x;A_{j},y_{j})$ . We also define the regularized negative log-likelihood $f:\mathbb{R}^{n}\to\mathbb{R}$ by $f(x)=\operatorname{nll}(x;A,y)+\lambda\left\lVert x\right\rVert_{1}$ . We claim that optimizing the following program approximately recovers the true parameter vector $x^{*}$ with high probability, for sufficiently many samples and appropriate regularization $\lambda$ :

\min_{x\in\mathbb{R}^{n}}\operatorname{nll}(x;A,y)+\lambda\left\lVert x\right\rVert_{1}.

(2)

The first step is to show that any solution to Program (2) will be near the true solution $x^{*}$ . To this end, we prove the following theorem, which already shows that $O(k\log n)$ samples are sufficient to solve the problem of statistical recovery of $x^{*}$ :

Proposition 3.2.

Suppose that Assumption I holds. There are constants¹¹1In the entirety of this paper, constants may depend on $\alpha$ . $\kappa$ , $d$ , and $\sigma$ with the following property. Suppose that $m>\kappa\cdot k\log n$ , and let $(A,y)$ be $m$ samples drawn from Process 1. Let $\hat{x}$ be any optimal solution to Program (2) with regularization constant $\lambda=\sigma\sqrt{(\log n)/m}$ . Then $\left\lVert\hat{x}-x^{*}\right\rVert_{2}\leq d\sqrt{(k\log n)/m}$ with high probability.

Then it remains to show that Program (2) can be solved efficiently.

Proposition 3.3.

Suppose that Assumption I holds and let $(A,y)$ be $m$ samples drawn from Process 1 and $\hat{x}$ be any optimal solution to Program (2). There exists a constant $M$ such that if $m\geq Mk\log n$ then there is an algorithm which outputs $\bar{x}\in\mathbb{R}^{n}$ satisfying $\left\lVert\bar{x}-\hat{x}\right\rVert_{2}\leq O(\sqrt{(k\log n)/m})$ with high probability. Furthermore, if the survival set $S$ is a union of $r$ intervals the running time of our algorithm is $\operatorname{poly}(n,r)$ .

We present a more detailed description of the algorithm that we use in Section 4.

3.2 Adversarial noise

In the setting of arbitrary noise, optimizing negative log-likelihood no longer makes sense, and indeed our results from the setting of Gaussian noise no longer hold. However, we may apply results from compressed sensing which describe sufficient conditions on the measurement matrix for recovery to be possible in the face of adversarial error. We obtain the following theorem:

Theorem 3.4.

Suppose that Assumption I holds and let $\epsilon>0$ . There are constants $c$ and $M$ such that if $m\geq Mk\log n$ , $\left\lVert Ax^{*}-y\right\rVert_{2}\leq\epsilon$ , and $\hat{x}$ minimizes $\left\lVert x\right\rVert_{1}$ in the region $\{x\in\mathbb{R}^{n}:\left\lVert Ax-y\right\rVert_{2}\leq\epsilon\}$ , then $\left\lVert\hat{x}-x^{*}\right\rVert_{2}\leq c\epsilon/\sqrt{m}$ .

The proof is a corollary of our result that $A$ satisfies the Restricted Isometry Property from [6] with high probability even when we only observe truncated samples; see Corollary G.6 and the subsequent discussion in Section G.

The remainder of the paper is dedicated to the case where the noise is Gaussian before truncation.

4 The efficient estimation algorithm

Define $\mathscr{E}_{r}=\{x\in\mathbb{R}^{n}:\left\lVert Ax-y\right\rVert_{2}\leq r\sqrt{m}\}$ . To solve Program 2, our algorithm is Projected Stochastic Gradient Descent (PSGD) with projection set $\mathscr{E}_{r}$ , for an appropriate constant $r$ (specified in Lemma 5.4). We pick an initial feasible point by computing

x^{(0)}=\operatorname*{argmin}_{x\in\mathscr{E}_{r}}\left\lVert x\right\rVert_{1}.

Subsequently, the algorithm performs $N$ updates, where $N=\operatorname{poly}(n)$ . Define a random update to $x^{(t)}\in\mathbb{R}^{n}$ as follows. Pick $j\in[m]$ uniformly at random. Sample $z^{(t)}\sim Z_{A_{j},x^{(i)}}$ . Then set

v^{(t)}:=A_{j}(z^{(t)}-y_{j})+\lambda\cdot\operatorname{sign}(x^{(t)})

w^{(t)}:=x^{(t)}-\sqrt{\frac{1}{nN}}v^{(t)};\qquad x^{(t+1)}:=\operatorname*{argmin}_{x\in\mathscr{E}_{r}}\left\lVert x-w^{(t)}\right\rVert_{2}.

Finally, the algorithm outputs $\bar{x}=\frac{1}{N}\sum_{t=0}^{N-1}x^{(t)}$ .

See Section 5.2 for the motivation of this algorithm, and a proof sketch of correctness and efficiency. Section E contains a summary of the complete algorithm in pseudocode.

5 Overview of proofs and techniques

This section outlines our techniques. The first step is proving Proposition 3.2. The second step is proving Proposition 3.3, by showing that the algorithm described in Section 4 efficiently recovers an approximate solution to Program 2.

5.1 Statistical recovery

Our approach to proving Proposition 3.2 is the Primal-Dual Witness (PDW) method introduced in [26]. Namely, we are interested in showing that the solution of Program (2) is near $x^{*}$ with high probability. Let $U$ be the (unknown) support of the true parameter vector $x^{*}$ . Define the following (hypothetical) program in which the solution space is restricted to vectors with support in $U$ :

\operatorname*{argmin}_{x\in\mathbb{R}^{n}:\operatorname{supp}(x)\subseteq U}\operatorname{nll}(x;A,y)+\lambda\left\lVert x\right\rVert_{1}

(3)

In the untruncated setting, the PDW method is to apply the subgradient optimality condition to the solution $\breve{x}$ of this restricted program, which is by definition sparse. Proving that $\breve{x}$ satisfies subgradient optimality for the original program implies that the original program has a sparse solution $\breve{x}$ , and under mild extra conditions $\breve{x}$ is the unique solution. Thus, the original program recovers the true basis. We use the PDW method for a slightly different purpose; we apply subgradient optimality to $\breve{x}$ to show that $\left\lVert\breve{x}-x^{*}\right\rVert_{2}$ is small, and then use this to prove that $\breve{x}$ solves the original program.

Truncation introduces its own challenges. While Program 2 is still convex [9], it is much more convoluted than ordinary least squares. In particular, the gradient and Hessian of the negative log-likelihood have the following form (see Section C for the proof).

Lemma 5.1.

For all $(A,y)$ , the gradient of the negative log-likelihood is $\nabla\operatorname{nll}(x;A,y)=\frac{1}{m}\sum_{j=1}^{m}A_{j}^{T}(\mathbb{E}Z_{A_{j},x}-y_{j}).$ The Hessian is $H(x;A,y)=\frac{1}{m}\sum_{j=1}^{m}A_{j}^{T}A_{j}\operatorname{Var}(Z_{A_{j},x}).$

We now state the two key facts which make the PDW method work. First, the solution $\breve{x}$ to the restricted program must have a zero subgradient in all directions in $\mathbb{R}^{U}$ . Second, if this subgradient can be extended to all of $\mathbb{R}^{n}$ , then $\breve{x}$ is optimal for the original program. Formally:

Lemma 5.2.

Fix any $(A,y)$ . Let $\breve{x}$ be an optimal solution to Program 3.

(a)

There is some $\breve{z}_{U}\in\mathbb{R}^{U}$ such that $\left\lVert\breve{z}_{U}\right\rVert_{\infty}\leq 1$ and $-\lambda\breve{z}_{U}=\frac{1}{m}A_{U}^{T}(\mathbb{E}Z_{A,\breve{x}}-y).$
(b)

Extend $\breve{z}_{U}$ to $\mathbb{R}^{n}$ by defining $-\lambda\breve{z}_{U^{c}}=\frac{1}{m}A_{U^{c}}^{T}(\mathbb{E}Z_{A,\breve{x}}-y).$ If $\left\lVert\breve{z}_{U^{c}}\right\rVert_{\infty}<1$ , and $A_{U}^{T}A_{U}$ is invertible, then $\breve{x}$ is the unique optimal solution to Program 2.

See Section C for the proof. The utility of this lemma is in reducing Proposition 3.2 to showing that with high probability over $(A,y)$ , the following conditions both hold:

1.

$\left\lVert\breve{x}-x^{*}\right\rVert$ is small (Theorem A.7).
2.

$\left\lVert\breve{z}_{U^{c}}\right\rVert_{\infty}<1$ (Theorem B.3) and $A_{U}^{T}A_{U}$ is invertible (corollary of Theorem G.1).

In Section A, we prove Condition (1) by dissecting the subgradient optimality condition. In Section B we then prove Condition (2) and complete the proof of Proposition 3.2.

5.2 Computational recovery

For Proposition 3.3, we want to solve Program 2, i.e. minimize $f(x)=\operatorname{nll}(x;A,y)+\lambda\left\lVert x\right\rVert_{1}.$ The gradient of $\operatorname{nll}(x;A,y)$ doesn’t have a closed form, but it can be written cleanly as an expectation:

\nabla\operatorname{nll}(x;A,y)=\frac{1}{m}\sum_{j=1}^{m}A_{j}^{T}(\mathbb{E}Z_{A_{j},x}-y_{j}).

Let us assume that $Z_{A_{j},x}$ can be sampled efficiently. Then we may hope to optimize $f(x)$ by stochastic gradient descent. But problematically, in our high-dimensional setting $f$ is nowhere strongly convex. So while we can apply the following general result from convex optimization, it has several strings attached:

Theorem 5.3.

Let $f:\mathbb{R}^{n}\to\mathbb{R}$ be a convex function achieving its optimum at $\breve{x}\in\mathbb{R}^{n}$ . Let $x^{(0)},x^{(1)},\dots,x^{(N)}$ be a sequence of random vectors in $\mathbb{R}^{n}$ . Suppose that $x^{(i+1)}=x^{(i)}-\eta v^{(i)}$ where $\mathbb{E}[v^{(i)}|x^{(i)}]\in\partial f(x^{(i)})$ . Set $\bar{x}=\frac{1}{N}\sum_{i=1}^{N}x^{(i)}$ . Then

\mathbb{E}[f(\bar{x})]-f(\breve{x})\leq(\eta N)^{-1}\mathbb{E}\left[\left\lVert x^{(0)}-\breve{x}\right\rVert_{2}^{2}\right]+\eta N^{-1}\sum_{i=1}^{N}\mathbb{E}\left[\left\lVert v^{(i)}\right\rVert_{2}^{2}\right].

In particular, to apply this result, we need to solve three technical problems:

1.

We need to efficiently find an initial point $x^{(0)}$ with bounded distance from $\breve{x}$ .
2.

The gradient does not have bounded norm for arbitrary $x\in\mathbb{R}^{n}$ . Thus, we need to pick a projection set in which the bound holds.
3.

Since $f$ is not strongly convex, we need to convert the bound on $f(\bar{x})-f(\breve{x})$ into a bound on $\left\lVert\bar{x}-\breve{x}\right\rVert_{2}$ .

As defined in Section 4, our solution is the projection set $\mathscr{E}_{r}=\{x\in\mathbb{R}^{n}:\left\lVert Ax-y\right\rVert_{2}\leq r\sqrt{m}\}$ , for an appropriate constant $r>0$ . To pick an initial point in $\mathscr{E}_{r}$ , we solve the program $x^{(0)}=\operatorname*{argmin}_{x\in\mathscr{E}_{r}}\left\lVert x\right\rVert_{1}.$ This estimate is biased due to the truncation, but the key point is that by classical results from compressed sensing, it has bounded distance from $x^{*}$ (and therefore from $\breve{x}$ ).

The algorithm then consists of projected stochastic gradient descent with projection set $\mathscr{E}_{r}$ . To bound the number of update steps required for the algorithm to converge to a good estimate of $\breve{x}$ , we need to solve several statistical problems (which are direct consequences of assumptions in Theorem 5.3).

Properties of $\mathscr{E}_{r}$ .

First, $\breve{x}$ must be feasible and a bounded distance from the initial point (due to high-dimensionality, $\mathscr{E}_{r}$ is unbounded, so this is not immediate). The following lemmas formalize this; see Sections J.11 and J.12 for the respective proofs. Lemma 5.4 specifies the choice of $r$ .

Lemma 5.4.

With high probability, $\breve{x}\in\mathscr{E}_{r}$ for an appropriate constant $r>0$ .

Lemma 5.5.

With high probability, $\left\lVert x^{(0)}-\breve{x}\right\rVert_{2}\leq O(1)$ .

Second, the SGD updates at points within $\mathscr{E}_{r}$ must be unbiased estimates of the gradient, with bounded square-norm in expectation. The following lemma shows that the updates $v^{(t)}$ defined in Section 4 satisfy this property. See Section J.13 for the proof.

Lemma 5.6.

With high probability over $A$ , the following statement holds. Let $0\leq t<T$ . Then $\mathbb{E}[v^{(t)}|x^{(t)}]\in\partial f(x^{(t)})$ , and $\mathbb{E}[\left\lVert v^{(t)}\right\rVert_{2}^{2}]\leq O(n)$ .

Addressing the lack of strong convexity.

Third, we need to show that the algorithm converges in parameter space and not just in loss. That is, if $f(x)-f(\breve{x})$ is small, then we want to show that $\left\lVert x-\breve{x}\right\rVert_{2}$ is small as well. Note that $f$ is not strongly convex even in $\mathscr{E}_{r}$ , due to the high dimension. So we need a more careful approach. In the subspace $\mathbb{R}^{U}$ , $f$ is indeed strongly convex near $\breve{x}$ , as shown in the following lemma (proof in Section J.14):

Lemma 5.7.

There is a constant $\zeta$ such that with high probability over $A$ , $f(x)-f(\breve{x})\geq\frac{\zeta}{2}\left\lVert x-\breve{x}\right\rVert_{2}^{2}$ for all $x\in\mathbb{R}^{n}$ with $\operatorname{supp}(x)\subseteq U$ and $\left\lVert x-\breve{x}\right\rVert_{2}\leq 1$ .

But we need a bound for all $\mathbb{R}^{n}$ . The idea is to prove a lower bound on $f(x)-f(\breve{x})$ for $x$ near $\breve{x}$ , and then use convexity to scale the bound linearly in $\left\lVert x-\breve{x}\right\rVert_{2}$ . The above lemma provides a lower bound for $x$ near $\breve{x}$ if $\operatorname{supp}(x)\subseteq U$ , and we need to show that adding an $\mathbb{R}^{U^{c}}$ -component to $x$ increases $f$ proportionally. This is precisely the content of Theorem B.4. Putting these pieces together we obtain the following lemma. See Section J.15 for the full proof.

Lemma 5.8.

There are constants $c_{f}>0$ and $c^{\prime}_{f}$ such that with high probability over $A$ the following holds. Let $x\in\mathbb{R}^{n}$ . If $f(x)-f(\breve{x})\leq c_{f}(\log n)/m^{3}$ , then

f(x)-f(\breve{x})\geq c^{\prime}_{f}\frac{\log n}{m}\left\lVert x-\breve{x}\right\rVert_{2}^{2}.

Convergence of PSGD.

It now follows from the above lemmas and Theorem 5.3 that the PSGD algorithm, as outlined above and described in Section 4, converges to a good approximation of $\breve{x}$ in a polynomial number of updates. The following theorem formalizes the guarantee. See Section J.16 for the proof.

Theorem 5.9.

With high probability over $A$ and over the execution of the algorithm, we get

\left\lVert\bar{x}-\breve{x}\right\rVert_{2}\leq O(\sqrt{(k\log n)/m}).

Efficient implementation.

Finally, in Section F we prove that initialization and each update step is efficient. Efficient gradient estimation in the projection set (i.e. sampling $Z_{A_{j},x}$ ) does not follow from the prior work, since our projection set is by necessity laxer than that of the prior work [9]. So we replace their rejection sampling procedure with a novel approximate sampling procedure under mild assumptions about the truncation set. Together with the convergence bound claimed in Theorem 5.9, these prove Proposition 3.3.

Proof of Proposition 3.3.

The correctness guarantee follows from Theorem 5.9. For the efficiency guarantee, note that the algorithm performs initialization and then $N=\operatorname{poly}(n)$ update steps. By Section F, the initialization takes $\operatorname{poly}(n)$ time, and each update step takes $\operatorname{poly}(n)+T(e^{-\Theta(m)})$ . This implies the desired bounds on overall time complexity. ∎

Acknowledgments

This research was supported by NSF Awards IIS-1741137, CCF-1617730 and CCF-1901292, by a Simons Investigator Award, by the DOE PhILMs project (No. DE-AC05-76RL01830), by the DARPA award HR00111990021, by the MIT Undergraduate Research Opportunities Program, and by a Google PhD Fellowship.

References

[1] Takeshi Amemiya. Regression analysis when the dependent variable is truncated normal. Econometrica: Journal of the Econometric Society, pages 997–1016, 1973.
[2] N Balakrishnan and Erhard Cramer. The art of progressive censoring. Springer, 2014.
[3] Daniel Bernoulli. Essai d’une nouvelle analyse de la mortalité causée par la petite vérole, et des avantages de l’inoculation pour la prévenir. Histoire de l’Acad., Roy. Sci.(Paris) avec Mem, pages 1–45, 1760.
[4] Stephen Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
[5] Richard Breen et al. Regression models: Censored, sample selected, or truncated data, volume 111. Sage, 1996.
[6] Emmanuel J. Candés, Justin K. Romberg, and Terence Tao. Stable signal recovery from incomplete and inaccurate measurements. Communications on Pure and Applied Mathematics, 59(8):1207–1223, 2006.
[7] Sylvain Chevillard. The functions erf and erfc computed with arbitrary precision and explicit error bounds. Information and Computation, 216:72–95, 2012.
[8] A Clifford Cohen. Truncated and censored samples: theory and applications. CRC press, 2016.
[9] Constantinos Daskalakis, Themis Gouleakis, Christos Tzamos, and Manolis Zampetakis. Computationally and Statistically Efficient Truncated Regression. In the 32nd Conference on Learning Theory (COLT), 2019.
[10] Mark A Davenport, Jason N Laska, Petros T Boufounos, and Richard G Baraniuk. A simple proof that random matrices are democratic. arXiv preprint arXiv:0911.0736, 2009.
[11] David L Donoho. For most large underdetermined systems of linear equations the minimal $\ell_{1}$ -norm solution is also the sparsest solution. Communications on Pure and Applied Mathematics: A Journal Issued by the Courant Institute of Mathematical Sciences, 59(6):797–829, 2006.
[12] RA Fisher. Properties and applications of Hh functions. Mathematical tables, 1:815–852, 1931.
[13] Francis Galton. An examination into the registered speeds of american trotting horses, with remarks on their value as hereditary data. Proceedings of the Royal Society of London, 62(379-387):310–315, 1897.
[14] Vassilis A Hajivassiliou and Daniel L McFadden. The method of simulated scores for the estimation of ldv models. Econometrica, pages 863–896, 1998.
[15] Jerry A Hausman and David A Wise. Social experimentation, truncated distributions, and efficient estimation. Econometrica: Journal of the Econometric Society, pages 919–938, 1977.
[16] Michael P Keane. 20 simulation estimation for panel data models with limited dependent variables. 1993.
[17] Jason N Laska. Democracy in action: Quantization, saturation, and compressive sensing. PhD thesis, 2010.
[18] Gangadharrao S Maddala. Limited-dependent and qualitative variables in econometrics. Number 3. Cambridge university press, 1986.
[19] Karl Pearson. On the systematic fitting of frequency curves. Biometrika, 2:2–7, 1902.
[20] Karl Pearson and Alice Lee. On the generalised probable error in multiple normal correlation. Biometrika, 6(1):59–68, 1908.
[21] Mark Rudelson and Roman Vershynin. Non-asymptotic theory of random matrices: extreme singular values. In Proceedings of the International Congress of Mathematicians 2010 (ICM 2010) (In 4 Volumes) Vol. I: Plenary Lectures and Ceremonies Vols. II–IV: Invited Lectures, pages 1576–1602. World Scientific, 2010.
[22] Helmut Schneider. Truncated and censored samples from normal populations. Marcel Dekker, Inc., 1986.
[23] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
[24] James Tobin. Estimation of relationships for limited dependent variables. Econometrica: journal of the Econometric Society, pages 24–36, 1958.
[25] Vladislav Voroninski and Zhiqiang Xu. A strong restricted isometry property, with an application to phaseless compressed sensing. Applied and Computational Harmonic Analysis, 40(2):386 – 395, 2016.
[26] Martin J Wainwright. Sharp thresholds for high-dimensional and noisy sparsity recovery using l1-constrained quadratic programming (lasso). IEEE transactions on information theory, 55(5):2183–2202, 2009.

Appendix A Bounding solution of restricted program

In this section we prove that $\left\lVert\breve{x}-x^{*}\right\rVert_{2}$ is small with high probability, where $\breve{x}$ is a solution to Program 3. Specifically, we use regularization parameter $\lambda=\Theta(\sqrt{(\log n)/m})$ , and prove that $\left\lVert\breve{x}-x^{*}\right\rVert_{2}\leq O(\sqrt{(k\log n)/m})$ .

The proof is motivated by the following rephrasal of part (a) of Lemma 5.2:

-\lambda\breve{z}_{U}=\frac{1}{m}A_{U}^{T}(\mathbb{E}[Z_{A,\breve{x}}]-\mathbb{E}[Z_{A,x^{*}}])+\frac{1}{m}A_{U}^{T}(\mathbb{E}[Z_{A,x^{*}}]-y)

(4)

where $\left\lVert\breve{z}_{U}\right\rVert_{\infty}\leq 1$ . For intuition, consider the untruncated setting: then $\mathbb{E}[Z_{t}]=t$ , so the equation is simply

-\lambda\breve{z}_{U}=\frac{1}{m}A_{U}^{T}A_{U}(\breve{x}_{U}-x^{*}_{U})-\frac{1}{m}A_{U}^{T}w

where $w\sim N(0,1)^{m}$ . Since $w$ is independent of $A_{U}^{T}$ and has norm $\Theta(m)$ , each entry of $A_{U}^{T}w$ is Gaussian with variance $\Theta(m)$ , so $\frac{1}{m}A_{U}^{T}w$ has norm $\Theta(\sqrt{k/m})$ . Additionally, $\left\lVert\lambda\breve{z}_{U}\right\rVert_{2}\leq\lambda\sqrt{k}=O(\sqrt{(k\log n)/m})$ . Finally, $\frac{1}{m}A_{U}^{T}A_{U}$ is a $\Theta(1)$ -isometry, so we get the desired bound on $\breve{x}_{U}-x^{*}_{U}$ .

Returning to the truncated setting, one bound still holds, namely $\left\lVert\lambda\breve{z}_{U}\right\rVert_{2}\leq\lambda\sqrt{k}$ . The remainder of the above sketch breaks down for two reasons. First, $\mathbb{E}[Z_{A,x^{*}}]-y$ is no longer independent of $A$ . Second, bounding $\frac{1}{m}A_{U}^{T}(\mathbb{E}[Z_{A,\breve{x}}]-\mathbb{E}[Z_{A,x^{*}}])$ no longer implies a bound on $\breve{x}_{U}-x^{*}_{U}$ .

The first problem is not so hard to work around; we can still bound $A_{U}^{T}(\mathbb{E}[Z_{A,x^{*}}]-y)$ as follows; see Section J.1 for the proof.

Lemma A.1.

With high probability over $A$ and $y$ , $\left\lVert A_{U}^{T}(\mathbb{E}[Z_{A,x^{*}}]-y)\right\rVert^{2}_{2}\leq\alpha^{-1}km\log n.$

So in equation 4, the last term is $O(\sqrt{(k\log n)/m})$ with high probability. The first term is always $O(\sqrt{(k\log n)/m})$ , since $\left\lVert\breve{z}_{U}\right\rVert_{2}\leq\sqrt{k}$ . So we know that $\frac{1}{m}A_{U}^{T}(\mathbb{E}[Z_{A,\breve{x}}]-\mathbb{E}[Z_{A,x^{*}}])$ has small norm. Unfortunately this does not imply that $\mathbb{E}[Z_{A,\breve{x}}]-\mathbb{E}[Z_{A,x^{*}}]$ has small norm, but as motivation, assume that we have such a bound.

Since $A_{U}$ is a $\Theta(\sqrt{m})$ -isometry, bounding $\breve{x}-x^{*}$ is equivalent to bounding $A\breve{x}-Ax^{*}$ . To relate this quantity to $\mathbb{E}[Z_{A,\breve{x}}]-\mathbb{E}[Z_{A,x^{*}}]$ , our approach is to lower bound the derivative of $\mu_{t}=\mathbb{E}[Z_{t}]$ with respect to $t$ . The derivative turns out to have the following elegant form (proof in Section J.2):

Lemma A.2.

For any $t\in\mathbb{R}$ , $\frac{d}{dt}\mu_{t}=\operatorname{Var}(Z_{t})$ .

Crucially, $\operatorname{Var}(Z_{t})$ is nonnegative, and relates to survival probability. By integrating a lower bound on the derivative, we get the following lower bound on $\mu_{t}-\mu_{t^{*}}$ in terms of $t-t^{*}$ . The bound is linear for small $|t-t^{*}|$ , but flattens out as $|t-t^{*}|$ grows. See Section J.3 for the proof.

Lemma A.3.

Let $t,t^{*}\in\mathbb{R}$ . Then $\operatorname{sign}(\mu_{t}-\mu_{t^{*}})=\operatorname{sign}(t-t^{*})$ . Additionally, for any constant $\beta>0$ there is a constant $c=c(\beta)>0$ such that if $\gamma_{S}(t^{*})\geq\beta$ , then $|\mu_{t}-\mu_{t^{*}}|\geq c\min(1,|t-t^{*}|).$

If we want to use this lemma to prove that $\left\lVert\mathbb{E}[Z_{A,\breve{x}}]-\mathbb{E}[Z_{A,x^{*}}]\right\rVert_{2}$ is at least a constant multiple of $\left\lVert A(\breve{x}-x^{*})\right\rVert_{2}$ , we face two obstacles: (1) $\gamma_{S}(A_{j}x^{*})$ may not be large for all $j$ , and (2) the lemma only gives linear scaling if $|A_{j}(\breve{x}-x^{*})|=O(1)$ : but this is essentially what we’re trying to prove!

To deal with obstacle (1), we restrict to the rows $j\in[m]$ for which $\gamma_{S}(A_{j}x^{*})$ is large. To deal with obstacle (2), we have a two-step proof. In the first step, we use the $\Omega(1)$ -lower bound provided by Lemma A.3 to show that $\left\lVert A(\breve{x}-x^{*})\right\rVert_{2}=O(\sqrt{m})$ (so that $|A_{j}(\breve{x}-x^{*})|=O(1)$ on average). In the second step, we use this to get linear scaling in Lemma A.3, and complete the proof, showing that $\left\lVert A(\breve{x}-x^{*})\right\rVert_{2}=O(\sqrt{k\log n})$ .

Formally, define $I_{\text{good}}$ to be the set of indices $j\in[m]$ such that $\gamma_{S}(A_{j}x^{*})\geq\alpha/2$ and $|A_{j}x^{*}-A_{j}\breve{x}|^{2}\leq(6/(\alpha m))\left\lVert Ax^{*}-A\breve{x}\right\rVert^{2}$ . In the following lemmas we show that $I_{\text{good}}$ contains a constant fraction of the indices, so by the isometry properties we retain a constant fraction of $\left\lVert A(\breve{x}-x^{*})\right\rVert_{2}$ when restricting to $I_{\text{good}}$ . See Appendices J.4 and J.5 for the proofs of Lemmas A.4 and A.5 respectively.

Lemma A.4.

With high probability, $|I_{\text{good}}|\geq(\alpha/6)m$ .

Lemma A.5.

For some constant $\epsilon>0$ , we have that with high probability, $\left\lVert Ax^{*}-A\breve{x}\right\rVert_{I_{\text{good}}}^{2}\geq\epsilon\left\lVert Ax^{*}-A\breve{x}\right\rVert^{2}$ .

We now prove the weaker, first-step bound on $\left\lVert A(\breve{x}-x^{*})\right\rVert_{2}$ . But there is one glaring issue we must address: we made a simplifying assumption that $\left\lVert\mathbb{E}[Z_{A,\breve{x}}]-\mathbb{E}[Z_{A,x^{*}}]\right\rVert$ is small. All we actually know is that $\left\lVert A_{U}^{T}(\mathbb{E}[Z_{A,\breve{x}}]-\mathbb{E}[Z_{A,x^{*}}]\right\rVert_{2}$ is small. And $A_{U}^{T}$ has a nontrivial null space.

Here is a sketch of how we resolve this issue. Let $a=A(\breve{x}-x^{*})$ and $b=\mu_{A\breve{x}}-\mu_{Ax^{*}}$ ; we want to show that if $\left\lVert a\right\rVert$ is large then $\left\lVert A_{U}^{T}b\right\rVert$ is large. Geometrically, $\left\lVert A_{U}^{T}b\right\rVert$ is approximately proportional to the distance from $b$ to the subspace $\text{Null}(A_{U}^{T})$ . Oversimplifying for clarity, we know that $|b_{j}|\geq c|a_{j}|$ for all $j$ . This is by itself insufficient. The key observation is that we also know $\operatorname{sign}(a_{j})=\operatorname{sign}(b_{j})$ for all $j$ . Thus, $b$ lies in a hyperoctant shifted to have corner $ca$ . Since $ca$ lies in the row space of $A_{U}^{T}$ , it’s perpendicular to $\text{Null}(A_{U}^{T})$ , so the closest point to $\text{Null}(A_{U}^{T})$ in the shifted hyperoctant should be $ca$ .

Formalizing this geometric intuition yields the last piece of the proofs of the following theorems. See Section J.6 for the full proofs.

Theorem A.6.

There are positive constants $c^{\prime}_{\text{reg}}=c^{\prime}_{\text{reg}}(\alpha)$ , $M^{\prime}=M^{\prime}(\alpha)$ , and $C^{\prime}=C^{\prime}(\alpha)$ with the following property. Suppose that $\lambda\leq c^{\prime}_{\text{reg}}/\sqrt{k}$ and $m\geq M^{\prime}k\log n$ . Then with high probability, $\left\lVert A_{U}x^{*}-A_{U}\breve{x}\right\rVert_{2}\leq C^{\prime}\sqrt{m}$ .

Theorem A.7.

There are positive constants $c^{\prime\prime}_{\text{reg}}=c^{\prime\prime}_{\text{reg}}(\alpha)$ , $M^{\prime\prime}=M^{\prime\prime}(\alpha)$ , and $C^{\prime\prime}=C^{\prime\prime}(\alpha)$ with the following property. Suppose that $\lambda\leq c^{\prime\prime}_{\text{reg}}/\sqrt{k}$ and $m\geq M^{\prime\prime}k\log n$ . Then $\left\lVert x^{*}-\breve{x}\right\rVert_{2}\leq C^{\prime\prime}(\lambda\sqrt{k}+\sqrt{(k\log n)/m})$ with high probability.

Appendix B Proof of statistical recovery

Extend $\breve{z}$ to $\mathbb{R}^{n}$ by defining

\breve{z}_{U^{c}}=-\frac{1}{\lambda m}A_{U^{c}}^{T}(\mathbb{E}Z_{A,\breve{x}}-y).

We would like to show that $\left\lVert z_{U^{c}}\right\rVert_{\infty}<1$ . Since $A_{U^{c}}^{T}$ is independent of $\mathbb{E}[Z_{A,\breve{x}}]-y$ , each entry of $A_{U^{c}}^{T}(\mathbb{E}[Z_{A,\breve{x}}]-y)$ is Gaussian with standard deviation $\left\lVert\mathbb{E}[Z_{A,\breve{x}}]-y\right\rVert_{2}$ . It turns out that a bound of $O(\lambda\sqrt{km}+\sqrt{m})$ suffices. To get this bound, we decompose

\mathbb{E}[Z_{A,\breve{x}}]-y=A(\breve{x}-x^{*})+\mathbb{E}R_{A,\breve{x}}-(y-Ax^{*})

and bound each term separately. Here we are defining $R_{t}=Z_{t}-t$ , and $R_{a,x}=Z_{a,x}-a^{T}x$ and $R_{A,x}=Z_{A,x}-Ax$ similarly.

We present the proof of the following lemmas in Section J.7 and Section J.8 respectively.

Lemma B.1.

There is a constant $c=c(\alpha)$ such that under the conditions of Theorem A.7, with high probability over $(A,y)$ , $\left\lVert\mathbb{E}[R_{A,\breve{x}}]\right\rVert_{2}^{2}\leq cm.$

Lemma B.2.

There is a constant $c_{y}=c_{y}(\alpha)$ such that $\left\lVert R_{A,x^{*}}\right\rVert_{2}^{2}\leq c_{y}m$ with high probability.

Combining the above lemmas with the bound on $\left\lVert\breve{x}-x^{*}\right\rVert_{2}$ from the previous section, we get the desired theorem. See Section J.9 for the full proof.

Theorem B.3.

There are constants $M=M(\alpha)$ , $\sigma=\sigma(\alpha)$ , and $d=d(\alpha)$ with the following property. Suppose $m\geq Mk\log n$ , and $\lambda=\sigma\sqrt{(\log n)/m}$ . Then with high probability we have $\left\lVert\breve{z}_{U^{c}}\right\rVert_{\infty}<1$ .

As an aside that we’ll use later, this proof can be extended to any random vector near $\breve{x}$ with support contained in $U$ (proof in Section J.10).

Theorem B.4.

There are constants $M=M(\alpha)$ , $\sigma=\sigma(\alpha)$ , and $d=d(\alpha)$ with the following property. Suppose $m\geq Mk\log n$ and $\lambda=\sigma\sqrt{(\log n)/m}$ . If $X\in\mathbb{R}^{n}$ is a random variable with $\operatorname{supp}(X)\subseteq U$ always, and $\left\lVert\breve{x}-X\right\rVert_{2}\leq 1/m$ with high probability, then with high probability $\left\lVert\frac{1}{m}A_{U^{c}}(\mathbb{E}Z_{A,X}-y)\right\rVert_{\infty}\leq\lambda/2$ .

Returning to the goal of this section, it remains to show that $A_{U}^{T}A_{U}$ is invertible with high probability. But this follows from the isometry guarantee of Theorem G.1. Our main statistical result, Proposition 3.2, now follows.

Proof of Proposition 3.2.

Take $M$ , $\sigma$ , and $d$ as in the statement of Theorem B.3. Let $m\geq Mk\log n$ and $\lambda=\sigma\sqrt{(\log n)/m}$ . Let $\hat{x}\in\mathbb{R}^{n}$ be any optimal solution to the regularized program, and let $\breve{x}\in\mathbb{R}^{U}$ be any solution to the restricted program. By Theorem B.3, with high probability we have $\left\lVert x^{*}-\breve{x}\right\rVert\leq d\sqrt{(k\log n)/m}$ and $\left\lVert\breve{z}_{U^{c}}\right\rVert<1$ ; and by Theorem G.1, $A_{U}^{T}A_{U}$ is invertible. So by Lemma 5.2, it follows that $\breve{x}=\hat{x}$ . Therefore $\left\lVert x^{*}-\hat{x}\right\rVert\leq d\sqrt{(k\log n)/m}$ . ∎

Appendix C Primal-dual witness method

Proof of Lemma 5.1.

For a single sample $(A_{j},y_{j})$ , the partial derivative in direction $x_{i}$ is

	$\displaystyle\frac{\partial}{\partial x_{i}}\operatorname{nll}(x;A_{j},y_{j})$	$\displaystyle=A_{ji}(A_{j}x-y)+\frac{\frac{\partial}{\partial x_{i}}\int_{S}e^{-(A_{j}x-z)^{2}/2}\,dz}{\int_{S}e^{-(A_{j}x-z)^{2}/2}\,dz}$
		$\displaystyle=A_{ji}(A_{j}x-y)-\frac{\int_{S}A_{ji}(A_{j}x-z)e^{-(A_{j}x-z)^{2}/2}\,dz}{\int_{S}e^{-(A_{j}x-z)^{2}/2}\,dz}$
		$\displaystyle=A_{ji}(A_{j}x-y)-\mathbb{E}[A_{ji}(A_{j}x-Z_{A_{j}x})]$

where expectation is taken over the random variable $Z_{A_{j}x}$ (for fixed $A_{j}$ ). Simplifying yields the expression

\nabla\operatorname{nll}(x;A_{j},y_{j})=A_{j}(\mathbb{E}[Z_{A_{j}x}]-y).

The second partial derivative of $\operatorname{nll}(x;A_{j},y_{j})$ in directions $x_{i_{1}}$ and $x_{i_{2}}$ is therefore

	$\displaystyle\frac{\partial^{2}}{\partial x_{i_{1}}\partial x_{i_{2}}}\operatorname{nll}(x;A_{j},y_{j})$	$\displaystyle=\frac{\partial}{\partial x_{i_{1}}}A_{ji_{2}}(\mathbb{E}[Z_{A_{j}x}]-y)$
		$\displaystyle=A_{ji_{2}}\frac{\partial}{\partial x_{i_{1}}}\left(\frac{\int_{S}ze^{-(A_{j}x-z)^{2}/2}\,dz}{\int_{S}e^{-(A_{j}x-z)^{2}/2}\,dz}-y\right)$
		$\displaystyle=A_{ji_{2}}\Bigg{(}\frac{\frac{\partial}{\partial x_{i_{1}}}\int_{S}ze^{-(A_{j}x-z)^{2}/2}\,dz}{\int_{S}e^{-(A_{j}x-z)^{2}/2}\,dz}-$
		$\displaystyle\qquad\frac{\int_{S}ze^{-(A_{j}x-z)^{2}/2}\,dz\frac{\partial}{\partial x_{i_{1}}}\int_{S}e^{-(A_{j}x-z)^{2}/2}\,dz}{\left(\int_{S}e^{-(A_{j}x-z)^{2}/2}\,dz\right)^{2}}\Bigg{)}$
		$\displaystyle=A_{ji_{2}}(\mathbb{E}[-A_{ji_{1}}Z_{A_{j}x}(A_{j}x-Z_{A_{j}x})]-\mathbb{E}[Z_{A_{j}x}]\mathbb{E}[-A_{ji_{1}}(A_{j}x-Z_{A_{j}x})]$
		$\displaystyle=A_{ji_{1}}A_{ji_{2}}\operatorname{Var}(Z_{A_{j}x}).$

We conclude that

H(x;A_{j},y_{j})=A_{j}^{T}A_{j}\operatorname{Var}(Z_{A_{j}x}).

Averaging over all samples yields the claimed result. ∎

The following lemma collects several useful facts that are needed for the PDW method. Parts (a) and (b) are generically true for any $\ell_{1}$ -regularized convex program; part (c) is a holdover from the untruncated setting that is still true. The proof is essentially due to [26], although part (c) now requires slightly more work.

Lemma C.1.

Fix any $(A,y)$ .

(a)

A vector $x\in\mathbb{R}^{n}$ is optimal for Program 2 if and only if there exists some $z\in\partial\left\lVert x\right\rVert_{1}$ such that

$\nabla\operatorname{nll}(x;A,y)+\lambda z=0.$
(b)

Suppose that $(x,z)$ are as in (a), and furthermore $|z_{i}|<1$ for all $i\not\in\operatorname{supp}(x)$ . Then necessarily $\operatorname{supp}(\hat{x})\subseteq\operatorname{supp}(x)$ for any optimal solution $\hat{x}$ to Program 2.
(c)

Suppose that $(x,z)$ are as in (b), with $I=\operatorname{supp}(x)$ . If $A_{I}^{T}A_{I}$ is invertible, then $x$ is the unique optimal solution to Program 2.

Proof.

Part (a) is simply the subgradient optimality condition in a convex program.

Part (b) is a standard fact about duality; we provide a proof here. Let $\hat{x}$ be any optimal solution to Program 2. We claim that $\hat{x}^{T}z=\left\lVert\hat{x}\right\rVert_{1}$ . To see this, first note that $x^{T}z=\left\lVert x\right\rVert_{1}$ , since $x_{i}z_{i}=|x_{i}|$ always holds by definition of a subgradient for the $\ell_{1}$ norm. Now, by optimality of $x$ and $\hat{x}$ , we have $f(x)=f(\hat{x})\leq f(tx+(1-t)\hat{x})$ for all $0\leq t\leq 1$ . Therefore by convexity, $f(tx+(1-t)\hat{x})=f(x)$ for all $0\leq t\leq 1$ . Since $f$ is the sum of two convex functions, both must be linear on the line segment between $x$ and $\hat{x}$ . Therefore

\operatorname{nll}(tx+(1-t)\hat{x})=t\operatorname{nll}(x)+(1-t)\operatorname{nll}(\hat{x})

for all $0\leq t\leq 1$ . We conclude that

(\nabla\operatorname{nll}(x))\cdot(\hat{x}-x)=\operatorname{nll}(\hat{x})-\operatorname{nll}(x)=\left\lVert x\right\rVert_{1}-\left\lVert\hat{x}\right\rVert_{1}.

Since $\nabla\operatorname{nll}(x)+z=0$ by subgradient optimality, it follows that $z^{T}(\hat{x}-x)=\left\lVert\hat{x}\right\rVert_{1}-\left\lVert x\right\rVert_{1}$ . Hence, $z^{T}\hat{x}=\left\lVert\hat{x}\right\rVert_{1}$ . Since $|z_{i}|\leq 1$ for all $i$ , if $|z_{i}|<1$ for some $i$ then necessarily $\hat{x}_{i}=0$ for equality to hold.

For part (c), if $A_{I}^{T}A_{I}$ is invertible, then it is (strictly) positive definite. The Hessian of Program 3 is

\frac{1}{m}\sum_{j=1}^{m}A_{I,j}^{T}A_{I,j}\operatorname{Var}(Z_{A_{j},x}).

Since $\operatorname{Var}(Z_{A_{j},x})$ is always positive, there is some $\epsilon>0$ (not necessarily a constant) such that

\frac{1}{m}\sum_{j=1}^{m}A_{I,j}^{T}A_{I,j}\operatorname{Var}(Z_{A_{j},x})\succcurlyeq\frac{1}{m}\epsilon\sum_{j=1}^{m}A_{I,j}^{T}A_{I,j}=\frac{1}{m}\epsilon A_{I}^{T}A_{I}.

Thus, the Hessian of the restricted program is positive definite, so the restricted program is strictly convex. Therefore the restricted program has a unique solution. By part (b), any solution to the original program has support in $I$ , so the original program also has a unique solution, which must be $x$ . ∎

As with the previous lemma, the following proof is essentially due to [26] (with a different subgradient optimality condition).

Proof of Lemma 5.2.

By part (a) of Lemma C.1, a vector $x\in\mathbb{R}^{n}$ is optimal for Program 2 if and only if there is some $z\in\partial\left\lVert x\right\rVert_{1}$ such that

\frac{1}{m}A^{T}(\mathbb{E}Z_{A,x}-y)+\lambda z=0.

This vector equality can be written in block form as follows:

\frac{1}{m}\begin{bmatrix}A_{U}^{T}\\ A_{U^{c}}^{T}\end{bmatrix}\left(\mathbb{E}Z_{A,x}-y\right)+\lambda\begin{bmatrix}z_{U}\\ z_{U^{c}}\end{bmatrix}=0.

Since $\breve{x}$ is optimal in $\mathbb{R}^{U}$ , there is some $\breve{z}_{U}\in\partial\left\lVert\breve{x}\right\rVert_{1}$ such that $(\breve{x},\breve{z}_{U})$ satisfy the first of the two block equations. This is precisely part (a). If furthermore $\breve{x}$ is zero-extended to $\mathbb{R}^{n}$ , and $\breve{z}$ is extended as in part (b), and $\breve{z}$ satisfies $\left\lVert\breve{z}_{U^{c}}\right\rVert_{\infty}\leq 1$ , then since $x_{i}=0$ for all $i\not\in U$ , we have that $\breve{z}$ is a subgradient for $\left\lVert\breve{x}\right\rVert_{1}$ . Therefore $\breve{x}$ is optimal for Program 2. If $\left\lVert\breve{z}_{U^{c}}\right\rVert_{\infty}<1$ and $A_{U}^{T}A_{U}$ is invertible, then $\breve{x}$ is the unique solution to Program 2 by parts (b) and (c) of Lemma C.1. ∎

Appendix D Sparse recovery from the Restricted Isometry Property

In this section we restate a theorem due to [6] about sparse recovery in the presence of noise. Our statement is slightly generalized to allow a trade-off between the isometry constants and the sparsity. That is, as the sparsity $k$ decreases relative to the isometry order $s$ , the isometry constants $\tau,T$ are allowed to worsen.

Theorem D.1 ([6]).

Let $B\in\mathbb{R}^{m\times n}$ be a matrix satisfying the $s$ -Restricted Isometry Property

\tau\left\lVert v\right\rVert_{2}\leq\left\lVert Bv\right\rVert_{2}\leq T\left\lVert v\right\rVert_{2}

for all $s$ -sparse $v\in\mathbb{R}^{n}$ . Let $w^{*}\in\mathbb{R}^{n}$ be $k$ -sparse for some $k<s$ , and let $w\in\mathbb{R}^{n}$ satisfy $\left\lVert w\right\rVert_{1}\leq\left\lVert w^{*}\right\rVert_{1}$ . Then

\left\lVert B(w-w^{*})\right\rVert_{2}\geq\left(\tau(1-\rho)-T\rho\right)\left\lVert w-w^{*}\right\rVert_{2}

where $\rho=\sqrt{k/(s-k)}$ .

Proof.

Let $h=w-w^{*}$ and let $T_{0}=\operatorname{supp}(w^{*})$ . Then

\left\lVert w^{*}\right\rVert_{1}\geq\left\lVert w\right\rVert_{1}=\left\lVert h_{T_{0}^{C}}\right\rVert_{1}+\left\lVert(h+w^{*})_{T_{0}}\right\rVert_{1}\geq\left\lVert h_{T_{0}^{C}}\right\rVert_{1}+\left\lVert w^{*}\right\rVert_{1}-\left\lVert h_{T_{0}}\right\rVert_{1},

so $\left\lVert h_{T_{0}}\right\rVert_{1}\geq\left\lVert h_{T_{0}^{C}}\right\rVert_{1}$ . Without loss of generality assume that $T_{0}^{C}=\{1,\dots,|T_{0}^{C}|\}$ , and $|h_{i}|\geq|h_{i+1}|$ for all $1\leq i<|T_{0}^{C}|$ . Divide $T_{0}^{C}$ into sets of size $s^{\prime}=s-k$ respecting this order:

T_{0}^{C}=T_{1}\cup T_{2}\cup\dots\cup T_{r}.

Then the Restricted Isometry Property gives

\left\lVert Bh\right\rVert_{2}\geq\left\lVert Bh_{T_{0}\cup T_{1}}\right\rVert_{2}-\sum_{t=2}^{r}\left\lVert Bh_{T_{t}}\right\rVert_{2}\geq\tau\left\lVert h_{T_{0}\cup T_{1}}\right\rVert_{2}-T\sum_{t=2}^{r}\left\lVert h_{T_{t}}\right\rVert_{2}

(5)

For any $t\geq 1$ and $i\in T_{t+1}$ , we have $h_{i}\leq\left\lVert h_{T_{t}}\right\rVert_{1}/s^{\prime}$ , so that

\left\lVert h_{T_{t+1}}\right\rVert_{2}^{2}\leq\frac{\left\lVert h_{T_{t}}\right\rVert_{1}^{2}}{s^{\prime}}.

Summing over all $t\geq 2$ , we get

\sum_{t=2}^{r}\left\lVert h_{T_{t}}\right\rVert_{2}\leq\frac{1}{\sqrt{s^{\prime}}}\sum_{t=1}^{r}\left\lVert h_{T_{t}}\right\rVert_{1}=\frac{\left\lVert h_{T_{0}^{C}}\right\rVert_{1}}{\sqrt{s^{\prime}}}\leq\frac{\left\lVert h_{T_{0}}\right\rVert_{1}}{\sqrt{s^{\prime}}}\leq\sqrt{\frac{k}{s^{\prime}}}\left\lVert h\right\rVert_{2}.

The triangle inequality implies that $\left\lVert h_{T_{0}\cup T_{1}}\right\rVert_{2}\geq(1-\sqrt{k/s^{\prime}})\left\lVert h\right\rVert_{2}$ . Returning to Equation 5, it follows that

\left\lVert Bh\right\rVert_{2}\geq\left(\tau(1-\sqrt{k/s^{\prime}})-T\sqrt{k/s^{\prime}}\right)\left\lVert h\right\rVert_{2}

as claimed. ∎

Appendix E Summary of the algorithm

Algorithm 1 Projected Stochastic Gradient Descent.

1:procedure Sgd(

N,\lambda

)

\triangleright

N

: number of steps,

\lambda

: parameter

x^{(0)}\leftarrow\operatorname*{argmin}\left\lVert x\right\rVert_{1}

s.t.

x\in\mathscr{E}_{r}

\triangleright

see the Appendix F for details

3: for

t=1,\dots,N

\eta_{t}\leftarrow\frac{1}{\sqrt{nN}}

v^{(t)}\leftarrow\textsc{GradientEstimation}(x^{(t-1)}))

w^{(t)}\leftarrow x^{(t-1)}-\eta_{t}w^{(t)}

x^{(t)}\leftarrow\operatorname*{argmin}_{x\in\mathscr{E}_{r}}\left\lVert x-w^{(t)}\right\rVert_{2}

\triangleright

see the Appendix F for details

8: return

\bar{x}\leftarrow\frac{1}{N}\sum_{t=1}^{N}x^{(t)}

\triangleright

output the average

Algorithm 2 The function to estimate the gradient of the

\ell_{1}

regularized negative log-likelihood.

1:function GradientEstimation(

x

)

2: Pick

j

at random from

[n]

3: Use Assumption II or Lemma K.4 to sample

z\sim Z_{A_{j}x^{(t)}}

4: return

A_{j}(z-y_{j})

Appendix F Algorithm details

In this section we fill in the missing details about the algorithm’s efficiency. Since we have already seen that the algorithm converges in $O(\text{poly}(n))$ update steps, all that remains is to show that the following algorithmic problems can be solved efficiently:

1.

(Initial point) Compute $x^{(0)}=\operatorname*{argmin}_{x\in\mathscr{E}_{r}}\left\lVert x\right\rVert_{1}.$
2.

(Stochastic gradient) Given $x^{(t)}\in\mathscr{E}_{r}$ and $j\in[m]$ , compute a sample $A_{j}(z-y_{j})$ , where $z\sim Z_{A_{j}x^{(t)}}$ .
3.

(Projection) Given $w^{(t)}\in\mathbb{R}^{n}$ , compute $x^{(t+1)}=\operatorname*{argmin}_{x\in\mathscr{E}_{r}}\left\lVert x-w^{(t)}\right\rVert_{2}$ .

Initial point.

To obtain the initial point $x^{(0)}$ , we need to solve the program

\begin{array}[]{ll}\text{minimize}&\left\lVert x\right\rVert_{1}\\ \text{subject to}&\left\lVert Ax-y\right\rVert_{2}\leq r\sqrt{m}.\end{array}

This program has come up previously in the compressed sensing literature (see, e.g., [6]). It can be recast as a Second-Order Cone Program (SOCP) by introducing variables $x^{+},x^{-}\in\mathbb{R}^{n}$ :

\begin{array}[]{ll}\text{minimize}&\sum_{i=1}^{n}(x^{+}_{i}-x^{-}_{i})\\ \text{subject to}&\left\lVert Ax^{+}-Ax^{-}-y\right\rVert_{2}\leq r\sqrt{m},\\ &x^{+}\geq 0,\\ &-x^{-}\geq 0.\end{array}

Thus, it can be solved in polynomial time by interior-point methods (see [4]).

Stochastic gradient.

In computing an unbiased estimate of the gradient, the only challenge is sampling from $Z_{A_{j}x^{(t)}}$ . By Assumption II, this takes $T(\gamma_{S}(A_{j}x^{(t)}))$ time. We know from Lemma I.4 that $\gamma_{S}(A_{j}x^{*})\geq\alpha^{2m}$ . Since $x^{(t)},x^{*}\in\mathscr{E}_{r}$ , we have from Lemma I.2 that

\gamma_{S}(A_{j}x^{(t)})\geq\gamma_{S}(t^{*})^{2}e^{-|A_{j}(x^{(t)}-x^{*})|^{2}-2}\geq\alpha^{4m}e^{-4r^{2}m-2}\geq e^{-\Theta(m/\alpha)}.

Thus, the time complexity of computing the stochastic gradient is $T(e^{-\Theta(m/\alpha)})$ .

In the special case when the truncation set $S$ is a union of $r$ intervals, there is a sampling algorithm with time complexity $T(\beta)=\operatorname{poly}(r,\log(1/\beta,n))$ (Lemma K.4). Hence, in this case the time complexity of computing the stochastic gradient is $\operatorname{poly}(r,n)$ .

To be more precise, we instantiate Lemma K.4 with accuracy $\zeta=1/(nL)$ , where $L=\operatorname{poly}(n)$ is the number of update steps performed. This gives some sampling algorithm $\mathscr{A}$ . In each step, $\mathscr{A}$ ’s output distribution is within $\zeta$ of the true distribution $N(t,1;S)$ . Consider a hypothetical sampling algorithm $\mathscr{A}^{\prime}$ in which $\mathscr{A}$ is run, and then the output is altered by rejection to match the true distribution. Alteration occurs with probability $\zeta$ . Thus, running the PSGD algorithm with $\mathscr{A}^{\prime}$ , the probability that any alteration occurs is at most $L\zeta=o(1)$ . As shown by Theorem H.1, PSGD with $\mathscr{A}^{\prime}$ succeeds with high probability. Hence, PSGD with $\mathscr{A}$ succeeds with high probability as well.

Projection.

The other problem we need to solve is projection onto set $\mathscr{E}_{r}$ :

\begin{array}[]{ll}\text{minimize}&\left\lVert x-v\right\rVert_{2}\\ \text{subject to}&\left\lVert Ax-y\right\rVert_{2}\leq r\sqrt{m}.\end{array}

This is a convex QCQP, and therefore solvable in polynomial time by interior point methods (see [4]).

Appendix G Isometry properties

Let $A\in\mathbb{R}^{m\times n}$ consist of $m$ samples $A_{i}$ from Process 1. In this section we prove the following theorem:

Theorem G.1.

For every $\epsilon>0$ there are constants $\delta>0$ , $M$ , $\tau>0$ and $T$ with the following property. Let $V\subseteq[n]$ . Suppose that $m\geq M|V|$ . With probability at least $1-e^{-\delta m}$ over $A$ , for every subset $J\subseteq[m]$ with $|J|\geq\epsilon m$ , the $|J|\times k$ submatrix $A_{J,V}$ satisfies

\tau\sqrt{m}\left\lVert v\right\rVert_{2}\leq\left\lVert A_{J,V}v\right\rVert_{2}\leq T\sqrt{m}\left\lVert v\right\rVert_{2}\qquad\forall\,v\in\mathbb{R}^{V}.

We start with the upper bound, for which it suffices to take $J=[m]$ .

Lemma G.2.

Let $V\subseteq[n]$ . Suppose that $m\geq|V|$ . There is a constant $T=T(\alpha)$ such that

\Pr[s_{\max{}}(A_{V})>T]\leq e^{-\Omega(m)}.

Proof.

In the process for generating $A$ , consider the matrix $A^{\prime}$ obtained by not discarding any of the samples $a\in\mathbb{R}^{n}$ . Then $A^{\prime}$ is a $\xi\times n$ matrix for a random variable $\xi$ ; each row of $A^{\prime}$ is a spherical Gaussian independent of all previous rows, but $\xi$ depends on the rows. Nonetheless, by a Chernoff bound, $\Pr[\xi>2m/\alpha]\leq e^{-m/(3\alpha)}.$ In this event, $A^{\prime}$ is a submatrix of $2m/\alpha\times n$ matrix $B$ with i.i.d. Gaussian entries. By [21],

\Pr[s_{\max{}}(B_{V})>C\sqrt{2m/\alpha}]\leq e^{-cm}

for some absolute constants $c,C>0$ . Since $A^{\prime}$ is a submatrix of $B$ with high probability, and $A$ is a submatrix of $A^{\prime}$ , it follows that

\Pr[s_{\max{}}(A_{V})>C\sqrt{2m/\alpha}]\leq e^{-\Omega(m)}

as desired. ∎

For the lower bound, we use an $\epsilon$ -net argument.

Lemma G.3.

Let $\epsilon>0$ and let $v\in\mathbb{R}^{n}$ with $\left\lVert v\right\rVert_{2}=1$ . Let $a\sim N(0,1)^{n}$ . Then

\Pr[|a^{T}v|<\alpha\epsilon\sqrt{\pi/2}|a^{T}x^{*}+Z\in S]<\epsilon.

Proof.

From the constant survival probability assumption,

\Pr[|a^{T}v|<\delta|a^{T}x^{*}+Z\in S]\leq\alpha^{-1}\Pr[|a^{T}v|<\delta].

But $a^{T}v\sim N(0,1)$ , so $\Pr[|a^{T}v|<\delta]\leq 2\delta/\sqrt{2\pi}.$ Taking $\delta=\alpha\epsilon\sqrt{\pi/2}$ yields the desired bound. ∎

Lemma G.4.

Let $V\subseteq[n]$ . Fix $\epsilon>0$ and fix $v\in\mathbb{R}^{V}$ with $\left\lVert v\right\rVert_{2}=1$ . There are positive constants $\tau_{0}=\tau_{0}(\alpha,\epsilon)$ and $c_{0}=c_{0}(\alpha,\epsilon)$ such that

\Pr\left[\exists J\subseteq[m]:(|J|\geq\epsilon m)\land(\left\lVert A_{J,V}v\right\rVert_{2}<\tau_{0})\right]\leq e^{-c_{0}m}.

Proof.

For each $j\in[m]$ let $B_{j}$ be the indicator random variable for the event that $|A_{j,V}v|<\alpha\epsilon/3$ . Let $B=\sum_{j=1}^{m}B_{j}$ . By Lemma G.3, $\mathbb{E}B<\epsilon m/3$ . Each $B_{j}$ is independent, so by a Chernoff bound,

\Pr[B>\epsilon m/2]\leq e^{-\epsilon m/18}.

In the event $[B\leq\epsilon m/2]$ , for any $J\subseteq[m]$ with $|J|\geq\epsilon m$ it holds that

\left\lVert A_{J,V}v\right\rVert_{2}^{2}=\sum_{j\in J}(A_{j,V}v)^{2}\geq\sum_{j\in J:B_{j}=0}(A_{j,V}v)^{2}\geq(\alpha\epsilon/3)B\geq\alpha\epsilon^{2}m/6.

So the event in the lemma statement occurs with probability at most $e^{-\epsilon m/18}.$ ∎

Now we can prove the isometry property claimed in Theorem G.1.

Proof of Theorem G.1.

Let $V\subseteq[n]$ . Let $\epsilon>0$ . Take $\gamma=4|V|/(c_{0}m)$ , where $c_{0}=c_{0}(\alpha,\epsilon)$ is the constant in the statement of Lemma G.4. Let $\mathscr{B}\subseteq\mathbb{R}^{V}$ be the $k$ -dimensional unit ball. Let $\mathscr{D}\subset\mathscr{B}$ be a maximal packing of $(1+\gamma/2)\mathscr{B}$ by radius- $(\gamma/2)$ balls with centers on the unit sphere. By a volume argument,

|\mathscr{D}|\leq\frac{(1+\gamma/2)^{k}}{(\gamma/2)^{k}}\leq e^{2k/\gamma}\leq e^{c_{0}m/2}.

Applying Lemma G.4 to each $v\in\mathscr{D}$ and taking a union bound,

\Pr[\exists J\subseteq[m],v\in\mathscr{D}:(|J|\geq\epsilon m)\land(\left\lVert A_{J,V}v\right\rVert_{2}<\tau_{0})]\leq e^{-c_{0}m/2}.

So with probability $1-e^{-\Omega(m)}$ , the complement of this event holds. And by Lemma G.2, the event $s_{\max{}}(A_{V})\leq T\sqrt{m}$ holds with probability $1-e^{-\Omega(m)}$ . In these events we claim that the conclusion of the theorem holds. Take any $v\in\mathbb{R}^{V}$ with $\left\lVert v\right\rVert_{2}=1$ , and take any $J\subseteq[m]$ with $|J|\geq\epsilon m$ . Since $\mathscr{D}$ is maximal, there is some $w\in\mathscr{D}$ with $\left\lVert v-w\right\rVert_{2}\leq\gamma$ . Then

\left\lVert A_{J,V}v\right\rVert_{2}\geq\left\lVert A_{J,V}w\right\rVert_{2}-\left\lVert A_{J,V}(v-w)\right\rVert_{2}\geq\tau_{0}-\gamma T.

But $\gamma\leq 4/(c_{0}M)$ . For sufficiently large $M$ , we get $\gamma<\tau_{0}/(2T)$ . Taking $\tau=\tau_{0}/2$ yields the claimed lower bound. ∎

As a corollary, we get that $A_{U}^{T}$ is a $\sqrt{m}$ -isometry on its row space up to constants (of course, this holds for any $V\subseteq[n]$ with $|V|=k$ , but we only need it for $V=U$ ).

Corollary G.5.

With high probability, for every $u\in\mathbb{R}^{k}$ ,

\frac{\tau^{2}}{T}\sqrt{m}\left\lVert A_{U}u\right\rVert_{2}\leq\left\lVert A_{U}^{T}A_{U}u\right\rVert_{2}\leq\frac{T^{2}}{\tau}\sqrt{m}\left\lVert A_{U}u\right\rVert_{2}.

Proof.

By Theorem G.1, with high probability all eigenvalues of $A_{U}^{T}A_{U}$ lie in the interval $[\tau\sqrt{m},T\sqrt{m}]$ . Hence, all eigenvalues of $(A_{U}^{T}A_{U})^{2}$ lie in the interval $[\tau^{2}m,T^{2}m]$ . But then

\left\lVert A_{U}^{T}A_{U}u\right\rVert_{2}=u^{T}(A_{U}^{T}A_{U})^{2}u\geq\tau^{2}mu^{T}u\geq\frac{\tau^{2}}{T}\sqrt{m}\left\lVert A_{U}u\right\rVert_{2}.

The upper bound is similar. ∎

We also get a Restricted Isometry Property, by applying Theorem G.1 to all subsets $V\subseteq[n]$ of a fixed size.

Corollary G.6 (Restricted Isometry Property).

There is a constant $M$ such that for any $s>0$ , if $m\geq Ms\log n$ , then with high probability, for every $v\in\mathbb{R}^{n}$ with $|\operatorname{supp}(v)|\leq s$ ,

\tau\sqrt{m}\left\lVert v\right\rVert_{2}\leq\left\lVert Av\right\rVert_{2}\leq T\sqrt{m}\left\lVert v\right\rVert_{2}.

Proof.

We apply Theorem G.1 to all $V\subseteq[n]$ with $|V|=s$ , and take a union bound over the respective failure events. The probability that there exists some set $V\subseteq[n]$ of size $s$ such that the isometry fails is at most

\binom{n}{s}e^{-\delta m}\leq e^{s\log n-\delta m}.

If $m\geq Ms\log n$ for a sufficiently large constant $M$ , then this probability is $o(1)$ . ∎

From this corollary, our main result for adversarial noise (Theorem 3.4) follows almost immediately:

Proof of Theorem 3.4.

Let $M^{\prime}$ be the constant in Corollary G.6. Let $\rho=\min(\tau/(4T),1/3)$ , and let $M=(1+1/\rho^{2})M^{\prime}$ . Finally, let $s=(1+1/\rho^{2})k$ .

Let $\epsilon>0$ . Suppose that $m\geq Mk\log n$ and $\left\lVert Ax^{*}-y\right\rVert\leq\epsilon$ . Then $m\geq M^{\prime}s\log n$ , so by Corollary G.6, $A/\sqrt{m}$ satisfies the $s$ -Restricted Isometry Property.

By definition, $\hat{x}$ satisfies $\left\lVert A\hat{x}-y\right\rVert_{2}\leq\epsilon$ and $\left\lVert\hat{x}\right\rVert_{1}\leq\left\lVert x^{*}\right\rVert_{1}$ (by feasibility of $x^{*}$ ). Finally, $x^{*}$ is $k$ -sparse. We conclude from Theorem D.1 and our choice of $\rho$ that

\left\lVert(A/\sqrt{m})(\hat{x}-x^{*})\right\rVert_{2}\geq(\tau(1-\rho)-T\rho)\left\lVert\hat{x}-x^{*}\right\rVert_{2}\geq\frac{\tau}{2}\left\lVert\hat{x}-x^{*}\right\rVert_{2}.

But $\left\lVert A(\hat{x}-x^{*})\right\rVert_{2}\leq 2\epsilon$ by the triangle inequality. Thus, $\left\lVert\hat{x}-x^{*}\right\rVert_{2}\leq\tau\epsilon/\sqrt{m}.$ ∎

Appendix H Projected Stochastic Gradient Descent

In this section we present the exact PSGD convergence theorem which we use, together with a proof for completeness.

Theorem H.1.

Let $f:\mathbb{R}^{n}\to\mathbb{R}$ be a convex function achieving its optimum at $\breve{x}\in\mathbb{R}^{n}$ . Let $\mathscr{P}\subseteq\mathbb{R}^{n}$ be a convex set containing $\breve{x}$ . Let $x^{(0)}\in\mathscr{P}$ be arbitrary. For $1\leq t\leq T$ define a random variable $x^{(t)}$ by

x^{(t)}=\text{Proj}_{\mathscr{P}}(x^{(t-1)}-\eta v^{(t-1)}),

where $\mathbb{E}[v^{(t)}|x^{(t)}]\in\partial f(x^{(t)})$ and $\eta$ is fixed. Then

\mathbb{E}[f(\bar{x})]-f(\breve{x})\leq(\eta T)^{-1}\mathbb{E}\left[\left\lVert x^{(0)}-\breve{x}\right\rVert_{2}^{2}\right]+\eta T^{-1}\sum_{i=1}^{T}\mathbb{E}\left[\left\lVert v^{(i)}\right\rVert_{2}^{2}\right]

where $\bar{x}=\frac{1}{T}\sum_{i=1}^{T}x^{(i)}$ .

Proof.

Fix $0\leq t<T$ . We can write

\left\lVert x^{(t+1)}-\breve{x}\right\rVert_{2}^{2}\leq\left\lVert(x^{(t)}-\eta v^{(t)})-\breve{x}\right\rVert_{2}^{2}=\left\lVert x^{(t)}-\breve{x}\right\rVert_{2}^{2}-2\eta\langle v^{(t)},x^{(t)}-\breve{x}\rangle+\eta^{2}\left\lVert v^{(t)}\right\rVert_{2}^{2}

since projecting onto $\mathscr{E}_{r}$ cannot increase the distance to $\breve{x}\in\mathscr{E}_{r}$ .

Taking expectation over $v^{(t)}$ for fixed $x^{(0)},\dots,x^{(k)}$ , we have

	$\displaystyle\mathbb{E}\left[\left\lVert x^{(t+1)}-\breve{x}\right\rVert_{2}^{2}\middle\|x^{(0)},\dots,x^{(k)}\right]$	$\displaystyle\leq\left\lVert x^{(t)}-\breve{x}\right\rVert_{2}^{2}-2\eta\langle\mathbb{E}v^{(t)},x^{(t)}-\breve{x}\rangle+\eta^{2}\mathbb{E}\left[\left\lVert v^{(t)}\right\rVert_{2}^{2}\right]$
		$\displaystyle\leq\left\lVert x^{(t)}-\breve{x}\right\rVert_{2}^{2}-2\eta(f(x^{(t)})-f(\breve{x}))+\eta^{2}\mathbb{E}\left[\left\lVert v^{(t)}\right\rVert_{2}^{2}\right]$

where the last inequality uses the fact that $\mathbb{E}v^{(t)}$ is a subgradient for $f$ at $x^{(t)}$ . Rearranging and taking expectation over $x^{(0)},\dots,x^{(t)}$ , we get that

2\left(\mathbb{E}\left[f(x^{(t)})\right]-f(\breve{x})\right)\leq\eta^{-1}\left(\mathbb{E}\left[\left\lVert x^{(t)}-\breve{x}\right\rVert_{2}^{2}\right]-\mathbb{E}\left[\left\lVert x^{(t+1)}-\breve{x}\right\rVert_{2}^{2}\right]\right)+\eta\mathbb{E}\left[\left\lVert v^{(t)}\right\rVert_{2}^{2}\right].

But now summing over $0\leq t<T$ , the right-hand side of the inequality telescopes, giving

	$\displaystyle\mathbb{E}[f(\bar{x})]-f(\breve{x})$	$\displaystyle\leq\frac{1}{T}\sum_{t=0}^{T-1}\mathbb{E}[f(x^{(t)})]-f(\breve{x})$
		$\displaystyle\leq\frac{1}{\eta T}\mathbb{E}\left[\left\lVert x^{(0)}-\breve{x}\right\rVert_{2}^{2}\right]+\frac{\eta}{T}\sum_{t=0}^{T-1}\mathbb{E}\left[\left\lVert v^{(t)}\right\rVert_{2}^{2}\right].$

This is the desired bound. ∎

Appendix I Survival probability

In this section we collect useful lemmas about truncated Gaussian random variables and survival probabilities.

Lemma I.1 ([9]).

Let $t\in\mathbb{R}$ and let $S\subset\mathbb{R}$ be a measurable set. Then $\operatorname{Var}(Z_{t})\geq C\gamma_{S}(t)^{2}$ for a constant $C>0$ .

Lemma I.2 ([9]).

For $t,t^{*}\in\mathbb{R}$ ,

\log\frac{1}{\gamma_{S}(t)}\leq 2\log\frac{1}{\gamma_{S}(t^{*})}+|t-t^{*}|^{2}+2.

Lemma I.3 ([9]).

For $t\in\mathbb{R}$ ,

\mathbb{E}[R_{t}^{2}]\leq 2\log\frac{1}{\gamma_{S}(t)}+4.

Lemma I.4.

With high probability,

\sum_{j=1}^{m}\log\frac{1}{\gamma_{S}(A_{j}x^{*})}\leq 2m\log\left(\frac{1}{\alpha}\right).

Proof.

Let $X_{j}=\log 1/\gamma_{S}(A_{j}x^{*})$ for $j\in[m]$ , and let $X=X_{1}+\dots+X_{m}$ . Since $X_{1},\dots,X_{m}$ are independent and identically distributed,

\mathbb{E}[e^{X}]=\mathbb{E}[e^{X_{j}}]^{m}=\mathbb{E}\left[\frac{1}{\gamma_{S}(A_{j}x^{*})}\right]^{m}=\left(\frac{\mathbb{E}_{a\sim N(0,1)^{n}}[1]}{\mathbb{E}_{a\sim N(0,1)^{n}}[\gamma_{S}(a^{T}x^{*})]}\right)^{m}\leq\alpha^{-m}.

Therefore

\Pr[X>2m\log 1/\alpha]=\Pr[e^{X}>e^{2m\log 1/\alpha}]\leq e^{-m\log 1/\alpha}

by Markov’s inequality. ∎

Appendix J Omitted proofs

J.1 Proof of Lemma A.1

We need the following computation:

Lemma J.1.

For any $i\in[n]$ and $j\in[m]$ ,

\mathbb{E}_{A}A_{ji}^{2}\operatorname{Var}(Z_{A_{j},x^{*}})\leq\alpha^{-1}.

Proof.

By Assumption I,

	$\displaystyle\mathbb{E}_{A_{j}}[A_{ji}^{2}\operatorname{Var}(Z_{A_{j},x^{*}})]$	$\displaystyle=\frac{\mathbb{E}_{a\sim N(0,1)^{n}}[\gamma_{S}(a^{T}x^{})a_{i}^{2}\operatorname{Var}(Z_{a,x^{}})]}{\mathbb{E}_{a\sim N(0,1)^{n}}[\gamma_{S}(a^{T}x^{*})]}$
		$\displaystyle\leq\alpha^{-1}\mathbb{E}_{a\sim N(0,1)^{n}}[\gamma_{S}(a^{T}x^{})a_{i}^{2}\operatorname{Var}(Z_{a,x^{}})].$

But for any fixed $t\in\mathbb{R}$ ,

\operatorname{Var}(Z_{t})\leq\mathbb{E}[(Z_{t}-t)^{2}]=\mathbb{E}[Z^{2}|Z+t\in S]\leq\gamma_{S}(t)^{-1}\mathbb{E}[Z^{2}]=\gamma_{S}(t)^{-1}.

Therefore

\mathbb{E}_{A_{j}}[A_{ji}^{2}\operatorname{Var}(Z_{A_{j},x^{*}})]\leq\alpha^{-1}\mathbb{E}_{a\sim N(0,1)^{n}}[a_{i}^{2}]\leq\alpha^{-1}

as desired. ∎

Now we can prove Lemma A.1.

Proof of Lemma A.1.

We prove that the bound holds in expectation, and then apply Markov’s inequality. We need to show that each entry of the vector $A_{U}^{T}(\mathbb{E}[Z_{A,x^{*}}]-Z_{A,x^{*}})\in\mathbb{R}^{k}$ has expected square $O(m)$ . A single entry of the vector $A_{U}^{T}(\mathbb{E}[Z_{A,x^{*}}]-Z_{A,x^{*}})$ is

\sum_{j=1}^{m}A_{ji}(Z_{A_{j},x^{*}}-\mathbb{E}Z_{A_{j},x^{*}}),

so its expected square is

\mathbb{E}\left(\sum_{j=1}^{m}A_{ji}(Z_{A_{j},x^{*}}-\mathbb{E}Z_{A_{j},x^{*}})\right)^{2}.

For any $j_{1}\neq j_{2}$ , the cross-term is

\mathbb{E}A_{j_{1}i}A_{j_{2}i}(Z_{A_{j_{1}},x^{*}}-\mathbb{E}Z_{A_{j_{1}},x^{*}})(Z_{A_{j_{2}},x^{*}}-\mathbb{E}Z_{A_{j_{2}},x^{*}}).

But for fixed $A$ , the two terms in the product are independent, and they both have mean $0$ , so the cross-term is $0$ . Thus,

	$\displaystyle\mathbb{E}\left(\sum_{j=1}^{m}A_{ji}(Z_{A_{j},x^{}}-\mathbb{E}Z_{A_{j},x^{}})\right)^{2}$	$\displaystyle=\sum_{j=1}^{m}\mathbb{E}_{A,y}A_{ji}^{2}(Z_{A_{j},x^{}}-\mathbb{E}Z_{A_{j},x^{}})^{2}$
		$\displaystyle=\sum_{j=1}^{m}\mathbb{E}_{A}A_{ji}^{2}\operatorname{Var}(Z_{A_{j},x^{*}})$
		$\displaystyle\leq\alpha^{-1}m.$

Then

\mathbb{E}\left[\left\lVert A_{U}^{T}(\mathbb{E}[Z_{A,x^{*}}]-Z_{A,x^{*}})\right\rVert_{2}^{2}\right]\leq\alpha^{-1}km.

Hence with probability at least $1-1/\log n$ ,

\left\lVert A_{U}^{T}(\mathbb{E}[Z_{A,x^{*}}]-Z_{A,x^{*}})\right\rVert_{2}^{2}\leq\alpha^{-1}km\log n

by Markov’s inequality. ∎

J.2 Proof of Lemma A.2

We can write

\mu_{t}=\frac{\int_{S}xe^{-(x-t)^{2}/2}\,dx}{\int_{S}e^{-(x-t)^{2}/2}\,dx}.

By the quotient rule,

	$\displaystyle\frac{d}{dt}\mu_{t}$	$\displaystyle=-\frac{\int_{S}x(t-x)e^{-(x-t)^{2}/2}\,dx}{\int_{S}e^{-(x-t)^{2}/2}\,dx}+\frac{\left(\int_{S}xe^{-(x-t)^{2}/2}\,dx\right)\left(\int_{S}(t-x)e^{-(x-t)^{2}/2}\,dx\right)}{\left(\int_{S}e^{-(x-t)^{2}/2}\,dx\right)^{2}}$
		$\displaystyle=-\mathbb{E}[Z_{t}(t-Z_{t})]+\mathbb{E}[Z_{t}]\mathbb{E}[t-Z_{t}]$
		$\displaystyle=\operatorname{Var}(Z_{t})$

as desired.

J.3 Proof of Lemma A.3

The fact that $\operatorname{sign}(\mu_{t}-\mu_{t^{*}})=\operatorname{sign}(t-t^{*})$ follows immediately from the fact that $\frac{d}{dt}\mu_{t}=\operatorname{Var}(Z_{t})\geq 0$ (Lemma A.2).

We now prove the second claim of the lemma. Suppose $t^{*}<t$ ; the other case is symmetric. Then we have

\mu_{t}-\mu_{t^{*}}=\int_{t^{*}}^{t}\operatorname{Var}(Z_{r})\,dr\geq C\int_{t^{*}}^{t}\gamma_{S}(r)^{2}\,dr\geq C\beta^{2}\int_{0}^{t-t^{*}}e^{-r^{2}-2}\,dr

by Lemmas A.2, I.1 and I.2 respectively. But we can lower bound

	$\displaystyle\int_{0}^{t-t^{*}}e^{-r^{2}-2}\,dr$	$\displaystyle\geq\int_{0}^{\min(1,t-t^{*})}e^{-r^{2}-2}\,dr$
		$\displaystyle\geq e^{-3}\min(1,t-t^{*}).$

This bound has the desired form.

J.4 Proof of Lemma A.4

Since $\mathbb{E}_{a\sim N(0,1)^{k}}\gamma_{S}(a^{T}x^{*})\geq\alpha$ and $\gamma_{S}(a^{T}x^{*})$ is always at most $1$ , we have $\Pr[\gamma_{S}(a^{T}x^{*})\leq\alpha/2]\leq 1-\alpha/2$ . Since the samples are rejection sampled on $\gamma_{S}(a^{T}x^{*})$ , it follows that $\Pr[\gamma_{S}(A_{j}x^{*})\leq\alpha/2]\leq 1-\alpha/2$ as well. So by a Chernoff bound, with high probability, the number of $j\in[m]$ such that $\gamma_{S}(A_{j}x^{*})\leq\alpha/2$ is at most $(1-\alpha/3)m$ .

The condition that $|A_{j}x^{*}-A_{j}\breve{x}|^{2}\geq(6/(\alpha m))\left\lVert Ax^{*}-A\breve{x}\right\rVert^{2}$ is clearly satisfied by at most $(\alpha/6)m$ indices.

J.5 Proof of Lemma A.5

By Lemma A.4 and Theorem G.1, with high probability $A_{I_{\text{good}},U}$ and $A_{U}$ both have singular values bounded between $\sqrt{\tau m}$ and $\sqrt{Tm}$ for some positive constants $\tau=\tau(\alpha)$ and $T=T(\alpha)$ . In this event, we have

\left\lVert A(x^{*}-\breve{x})\right\rVert_{I_{\text{good}}}^{2}\geq\tau m\left\lVert x^{*}-\breve{x}\right\rVert^{2}\geq\frac{\tau}{T}\left\lVert A(x^{*}-\breve{x})\right\rVert^{2}

which proves the claim.

J.6 Proof of Theorems A.6 and A.7

Proof of Theorem A.6.

Let $a=A(\breve{x}-x^{*})$ and let $b=\mu_{A\breve{x}}-\mu_{Ax^{*}}$ . Our aim is to show that if $\left\lVert a\right\rVert_{2}$ is large, then $\left\lVert A_{U}^{T}b\right\rVert_{2}$ is large, which would contradict Equation 4. Since $A_{U}^{T}$ is not an isometry, we can’t simply show that $\left\lVert b\right\rVert_{2}$ is large. Instead, we write an orthogonal decomposition $b=v+A_{U}u$ for some $u\in\mathbb{R}^{k}$ and $v\in\mathbb{R}^{m}$ with $A_{U}^{T}v=0$ . We’ll show that $\left\lVert A_{U}u\right\rVert_{2}$ is large. Since $A_{U}^{T}b=A_{U}^{T}A_{U}u$ , and $A_{U}^{T}$ is an isometry on the row space of $A_{U}$ , this suffices.

For every $j\in I_{\text{good}}$ with $|a_{j}|>0$ , we have by Lemma A.3 that

|b_{j}|\geq C\min(1,|a_{j}|)=C|a_{j}|\min(1/|a_{j}|,1)

where $C$ is the constant which makes Lemma A.3 work for indices $j$ with $\gamma_{S}(A_{j}x^{*})\geq\alpha/2$ . Take $C^{\prime}=\sqrt{6/\alpha}$ , and suppose that the theorem’s conclusion is false, i.e. $\left\lVert a\right\rVert_{2}>C^{\prime}\sqrt{m}$ . Also suppose that the events of Lemmas A.1 and A.5 hold.

Then by the bound $|a_{j}|^{2}\leq(6/(\alpha m))\left\lVert a\right\rVert_{2}^{2}$ for $j\in I_{\text{good}}$ we get

|b_{j}|\geq C|a_{j}|\min\left(\frac{\sqrt{\alpha/6}\sqrt{m}}{\left\lVert a\right\rVert_{2}},1\right)=\frac{c\sqrt{m}}{\left\lVert a\right\rVert_{2}}|a_{j}|

(6)

where $c=C\sqrt{\alpha/6}$ . We assumed earlier that $|a_{j}|>0$ but Equation 6 certainly also holds when $|a_{j}|=0$ .

By Lemma A.3, $a_{j}$ and $b_{j}$ have the same sign for all $j\in[m]$ . So $a_{j}b_{j}\geq 0$ for all $j\in[m]$ . Moreover, together with Equation 6, the sign constraint implies that for $j\in I_{\text{good}}$ ,

a_{j}b_{j}\geq\frac{c\sqrt{m}}{\left\lVert a\right\rVert}a_{j}^{2}.

Summing over $j\in I_{\text{good}}$ we get

\sum_{j\in I_{\text{good}}}a_{j}^{2}\leq\frac{\left\lVert a\right\rVert_{2}}{c\sqrt{m}}\sum_{j\in I_{\text{good}}}a_{j}b_{j}\leq\frac{\left\lVert a\right\rVert_{2}}{c\sqrt{m}}\langle a,b\rangle=\frac{\left\lVert a\right\rVert_{2}}{c\sqrt{m}}\langle a,A_{U}u\rangle.

By Lemma A.5 on the LHS and Cauchy-Schwarz on the RHS, we get

\epsilon\left\lVert a\right\rVert_{2}^{2}\leq\frac{\left\lVert a\right\rVert_{2}^{2}}{c\sqrt{m}}\left\lVert A_{U}u\right\rVert_{2}.

Hence $\left\lVert A_{U}u\right\rVert_{2}\geq\epsilon c\sqrt{m}$ . But then $\left\lVert A_{U}^{T}b\right\rVert_{2}=\left\lVert A_{U}^{T}A_{U}u\right\rVert_{2}\geq(\tau^{2}/T)\epsilon cm$ . On the other hand, Equation 4 implies that

\frac{1}{m}\left\lVert A_{U}^{T}b\right\rVert_{2}\leq\lambda\sqrt{k}+\frac{1}{m}\left\lVert A_{U}^{T}(\mathbb{E}[Z_{A,x^{*}}]-y)\right\rVert_{2}\leq c^{\prime}_{\text{reg}}+\sqrt{\alpha^{-1}(k\log n)/m}

since event (2) holds. This is a contradiction for $M^{\prime}$ sufficiently large and $c^{\prime}_{\text{reg}}$ sufficiently small. So either the assumption $\left\lVert a\right\rVert_{2}>C^{\prime}\sqrt{m}$ is false, or the events of Lemma A.1 or A.5 fail. But the latter two events fail with probability $o(1)$ . So $\left\lVert a\right\rVert_{2}\leq C^{\prime}\sqrt{m}$ with high probability.

∎

Now that we know that $\left\lVert x^{*}-\breve{x}\right\rVert_{2}\leq O(1)$ , we can bootstrap to show that $\left\lVert x^{*}-\breve{x}\right\rVert_{2}\leq\sqrt{(k\log n)/m}$ . While the previous proof relied on the constant regime of the lower bound provided by Lemma A.3, the following proof relies on the linear regime.

Proof of Theorem A.7.

As before, let $a=A(\breve{x}-x^{*})$ and $b=\mu_{A\breve{x}}-\mu_{Ax^{*}}$ . Suppose that the conclusion of Theorem A.6 holds, i.e. $\left\lVert a\right\rVert_{2}\leq C^{\prime}\sqrt{m}$ . Also suppose that the events stated in Lemmas A.1 and A.5 holds. We can make these assumptions with high probability. For $j\in I_{\text{good}}$ , we now know that $|a_{j}|^{2}\leq(6/(\alpha m))\left\lVert a\right\rVert_{2}^{2}=O(1)$ . Thus,

|b_{j}|\geq C|a_{j}|\cdot\min(1/|a_{j}|,1)\geq\delta|a_{j}|

where $\delta=C\min(1,\sqrt{\alpha/6}/C^{\prime})$ . By the same argument as in the proof of Theorem A.6, except replacing $(c\sqrt{m})/\left\lVert a\right\rVert_{2}$ by $\delta$ , we get that

\epsilon\left\lVert a\right\rVert_{2}^{2}\leq\delta^{-1}\left\lVert a\right\rVert_{2}\cdot\left\lVert A_{U}u\right\rVert_{2}.

Thus, $\left\lVert a\right\rVert_{2}\leq\epsilon^{-1}\delta^{-1}\left\lVert A_{U}u\right\rVert_{2}.$ By the isometry property of $A_{U}^{T}$ on its row space (Corollary G.5), we get

\left\lVert a\right\rVert_{2}\leq\frac{\tau^{2}}{T\epsilon\delta\sqrt{m}}\left\lVert A_{U}^{T}A_{U}u\right\rVert_{2}=\frac{c^{\prime}}{\sqrt{m}}\left\lVert A_{U}^{T}b\right\rVert_{2}

for an appropriate constant $c^{\prime}$ . Since $a=A(\breve{x}-x^{*})$ and $A_{U}$ is a $\sqrt{m}$ -isometry up to constants (Theorem G.1), we get

\left\lVert\breve{x}-x^{*}\right\rVert_{2}\leq\frac{\left\lVert a\right\rVert_{2}}{\tau}\leq\frac{c^{\prime}}{\tau m}\left\lVert A_{U}^{T}b\right\rVert_{2}.

By Equation 4 and bounds on the other terms of Equation 4, the RHS of this inequality is $O(\lambda\sqrt{k}+\sqrt{(k\log n)/m})$ . ∎

J.7 Proof of Lemma B.1

For $1\leq j\leq m$ we have by Lemma I.3 that

(\mathbb{E}R_{A_{j},\breve{x}})^{2}\leq\mathbb{E}[R_{A_{j},\breve{x}}^{2}]\leq 2\log\frac{1}{\gamma_{S}(A_{j}\breve{x})}+4.

By Lemma I.2, we have

\log\frac{1}{\gamma_{S}(A_{j}\breve{x})}\leq 2\log\frac{1}{\gamma_{S}(A_{j}x^{*})}+|A_{j}\breve{x}-A_{j}x^{*}|^{2}+2.

Therefore summing over all $j\in[m]$ ,

\left\lVert\mathbb{E}R_{A,\breve{x}}\right\rVert_{2}^{2}\leq 4\sum_{j=1}^{m}\log\frac{1}{\gamma_{S}(A_{j}x^{*})}+2\left\lVert A(\breve{x}-x^{*})\right\rVert_{2}^{2}+8m.

Lemma I.4 bounds the first term. Theorems A.7 and G.1 bound the second: with high probability,

\left\lVert A(\breve{x}-x^{*})\right\rVert_{2}\leq 2T(\lambda\sqrt{km}+C^{\prime\prime}\sqrt{k\log n}).

Thus,

\left\lVert\mathbb{E}[R_{A,\breve{x}}]\right\rVert_{2}^{2}\leq 8m\log(1/\alpha)+8m+8T^{2}\lambda^{2}km+8(TC^{\prime\prime})^{2}k\log n

with high probability. Under the assumptions $\lambda\leq c^{\prime\prime}_{\text{reg}}/\sqrt{k}$ and $m\geq M^{\prime\prime}k\log n$ , this quantity is $O(m)$ .

J.8 Proof of Lemma B.2

Draw $m$ samples from the distribution $R_{A_{j},x^{*}}$ as follows: pick $a\sim N(0,1)^{n}$ and $\eta\sim N(0,1)$ . Keep sample $\eta$ if $a^{T}x^{*}+\eta\in S$ ; otherwise reject. We want to bound $\eta_{1}^{2}+\dots+\eta_{m}^{2}$ . Now consider the following revised process: keep all the samples, but stop only once $m$ samples satisfy $a^{T}x^{*}+\eta\in S$ . Let $t$ be the (random) stopping point; then the random variable $\eta_{1}^{2}+\dots+\eta_{t}^{2}$ defined by the new process stochastically dominates the random variable $\eta_{1}+\dots+\eta_{m}^{2}$ defined by the original process.

But in the new process, each $\eta_{i}$ is Gaussian and independent of $\eta_{1},\dots,\eta_{i-1}$ . With high probability, $t\leq 2m/\alpha$ by a Chernoff bound. And if $\eta^{\prime}_{1},\dots,\eta^{\prime}_{2m/\alpha}\sim N(0,1)$ are independent then

{\eta^{\prime}}_{1}^{2}+\dots+{\eta^{\prime}}_{2m/\alpha}^{2}\leq 4m/\alpha

with high probability, by concentration of norms of Gaussian vectors. Therefore $\eta_{1}^{2}+\dots+\eta_{t}^{2}\leq 4m/\alpha$ with high probability as well.

J.9 Proof of Theorem B.3

Set $\sigma=4(\sqrt{c}+\sqrt{c_{y}})$ , where $c$ and $c_{y}$ are the constants in Lemmas B.1 and B.2. Set $M=\max(16T^{2}C^{\prime\prime 2}(\sigma+1)^{2}/\sigma^{2},\sigma^{2}/{c^{\prime\prime}}_{\text{reg}}^{2})$ . Note that $M$ is chosen sufficiently large that $\lambda=\sigma\sqrt{(\log n)/m}\leq c^{\prime\prime}_{\text{reg}}/\sqrt{k}$ .

By Theorem A.7, we have with high probability that the following event holds, which we call $E_{\text{close}}$ :

\left\lVert x^{*}-\breve{x}\right\rVert_{2}\leq C^{\prime\prime}(\lambda\sqrt{k}+\sqrt{(k\log n)/m})=C^{\prime\prime}(\sigma+1)\sqrt{\frac{k\log n}{m}}.

Now notice that

\mathbb{E}Z_{A,\breve{x}}-y=A(\breve{x}-x^{*})+\mathbb{E}R_{A,\breve{x}}-(y-Ax^{*}).

If $E_{\text{close}}$ holds, then by Theorem G.1, we get $\left\lVert A(\breve{x}-x^{*})\right\rVert_{2}\leq TC^{\prime\prime}(\sigma+1)\sqrt{k\log n}$ . By Lemma B.1, with high probability $\left\lVert\mathbb{E}R_{A,\breve{x}}\right\rVert_{2}\leq\sqrt{cm}$ . And by Lemma B.2, with high probability $\left\lVert y-Ax^{*}\right\rVert_{2}\leq\sqrt{c_{y}m}$ . Therefore

\left\lVert\mathbb{E}Z_{A,\breve{x}}-y\right\rVert_{2}\leq TC^{\prime\prime}(\sigma+1)\sqrt{k\log n}+\sqrt{c_{y}m}+\sqrt{cm}\leq\frac{\sigma}{2}\sqrt{m}

where the last inequality is by choice of $M$ and $\sigma$ . Thus, the event

E:\left\lVert\mathbb{E}Z_{A,\breve{x}}-y\right\rVert_{2}\leq\frac{\sigma}{2}\sqrt{m}

occurs with high probability.

Suppose that event $E$ occurs. Now note that $A_{U^{c}}^{T}$ has independent Gaussian entries. Fix any $i\in U_{c}$ ; since $(A^{T})_{i}$ is independent of $A_{U}$ , $\breve{x}$ , and $y$ , the dot product

(A^{T})_{i}(\mathbb{E}Z_{A,\breve{x}}-y)

is Gaussian with variance $\left\lVert\mathbb{E}Z_{A,\breve{x}}-y\right\rVert_{2}^{2}\leq\sigma^{2}m/4$ . Hence, $\breve{z}_{i}=\frac{1}{\lambda m}(A^{T})_{i}(\mathbb{E}Z_{A,\breve{x}}-y)$ is Gaussian with variance at most $(\sigma^{2}m/4)/(\lambda m)^{2}=1/(4\log n)$ . So

\Pr[|\breve{z}_{i}|\geq 1]\leq 2e^{-2\log n}\leq\frac{2}{n^{2}}.

By a union bound,

\Pr[\left\lVert\breve{z}_{U^{c}}\right\rVert_{\infty}\geq 1]\leq\frac{2}{n}.

So the event $\left\lVert\breve{z}_{U^{c}}\right\rVert<1$ holds with high probability.

J.10 Proof of Theorem B.4

We know from Theorem B.3 that $\left\lVert\frac{1}{m}A_{U^{c}}^{T}(\mathbb{E}[Z_{A,\breve{x}}]-y)\right\rVert_{\infty}\leq\lambda/3$ . So it suffices to show that

\frac{1}{m}\left\lVert A_{U^{c}}^{T}(\mathbb{E}[Z_{A,X}]-y)-A_{U^{c}}^{T}(\mathbb{E}[Z_{A,\breve{x}}]-y)\right\rVert_{\infty}\leq\frac{\lambda}{6}.

Thus, we need to show that

\frac{1}{m}|(A^{T})_{i}(\mathbb{E}[Z_{A,X}]-\mathbb{E}[Z_{A,\breve{x}}])|\leq\frac{\lambda}{6}

for all $i\in U^{c}$ . Fix one such $i$ . Then by Lemma A.2,

	$\displaystyle\left\lVert\mathbb{E}[Z_{A,X}]-\mathbb{E}[Z_{A,\breve{x}}]\right\rVert_{2}^{2}$	$\displaystyle=\sum_{i=1}^{m}(\mu_{A_{i}X}-\mu_{A_{i}\breve{x}})^{2}$
		$\displaystyle=\sum_{i=1}^{m}\left(\int_{A_{i}X}^{A_{i}\breve{x}}\operatorname{Var}(Z_{t})\,dt\right)^{2}$
		$\displaystyle\leq\sum_{i=1}^{m}(A_{i}X-A_{i}\breve{x})^{2}\cdot\sup_{t\in[A_{i}X,A_{i}\breve{x}]}\operatorname{Var}(Z_{t}).$

By Lemma I.4, we have

\sum_{j=1}^{m}\log\frac{1}{\gamma_{S}(A_{j}x^{*})}\leq 2m\log(1/\alpha)

with high probability over $A$ . Assume that this inequality holds, and assume that $\left\lVert X-\breve{x}\right\rVert_{2}\leq 1$ and $\left\lVert\breve{x}-x^{*}\right\rVert_{2}\leq 1$ , so that $\left\lVert X-x^{*}\right\rVert_{2}\leq 2$ . Then by Theorem G.1, $\left\lVert A(X-x^{*})\right\rVert_{2}\leq 2T\sqrt{m}$ . By Lemma I.2, for every $j\in[m]$ and every $t\in[A_{j}X,A_{j}\breve{x}]$ ,

\log\frac{1}{\gamma_{S}(t)}\leq 2\log\frac{1}{\gamma_{S}(A_{j}x^{*})}+|t-A_{j}x^{*}|^{2}+2\leq cm

for a constant $c$ . Hence, by Lemma I.3,

\operatorname{Var}(Z_{t})\leq\mathbb{E}[(Z_{t}-t)^{2}]\leq 2\log\frac{1}{\gamma_{S}(t)}+4\leq 2cm+4.

We conclude that

\left\lVert\mathbb{E}[Z_{A,X}]-\mathbb{E}[Z_{A,\breve{x}}]\right\rVert_{2}^{2}\leq(2cm+4)\left\lVert AX-A\breve{x}\right\rVert_{2}^{2}\leq(2cm+4)\frac{T^{2}m}{m^{2}}\leq O(1).

Additionally, $\left\lVert(A^{T})_{i}\right\rVert_{2}\leq T\sqrt{m}$ with high probability. Thus, Cauchy-Schwarz entails that

\frac{1}{m}|(A^{T})_{i}(\mathbb{E}[Z_{A,X}]-\mathbb{E}[Z_{A,\breve{x}}])|\leq\frac{1}{m}T\sqrt{m}\cdot O(1)\leq\frac{\lambda}{6}

for large $n$ .

J.11 Proof of Lemma 5.4

Note that

\left\lVert A\breve{x}-y\right\rVert_{2}\leq\left\lVert A(\breve{x}-x^{*})\right\rVert_{2}+\left\lVert Ax^{*}-y\right\rVert_{2}.

With high probability, $\left\lVert\breve{x}-x^{*}\right\rVert_{2}\leq 1$ . Theorem G.1 gives that $\left\lVert A(\breve{x}-x^{*})\right\rVert_{2}\leq T\sqrt{m}$ . Furthermore, $\left\lVert Ax^{*}-y\right\rVert_{2}\leq 2\sqrt{m/\alpha}$ by Lemma B.2.

J.12 Proof of Lemma 5.5

With high probability $\breve{x}\in\mathscr{E}_{r}$ by the above lemma. Note that $\left\lVert x^{(0)}\right\rVert_{1}\leq\left\lVert\breve{x}\right\rVert_{1}$ , and $\left\lVert A(x^{(0)}-\breve{x})\right\rVert_{2}\leq 2r\sqrt{m}$ . Set $\rho=\min(\tau/(4T),1/3)$ and $s=k(1+1/\rho^{2})$ . If $m\geq Mk\log n$ for a sufficiently large constant $M$ , then by Corollary G.6, $A/\sqrt{m}$ with high probability satisfies the $s$ -Restricted Isometry Property. Then by Theorem D.1 (due to [6], but reproduced here for completeness), it follows that $\left\lVert x^{(0)}-\breve{x}\right\rVert_{2}\leq O(1)$ .

J.13 Proof of Lemma 5.6

Note that $\operatorname{sign}(x^{(t)})$ is a subgradient for $\left\lVert x\right\rVert_{1}$ at $x=x^{(t)}$ . Furthermore, for fixed $A$ ,

\mathbb{E}\left[A_{j}(z^{(t)}-y_{j})\middle|x^{(t)}\right]=\frac{1}{m}\sum_{j^{\prime}=1}^{m}A_{j^{\prime}}(\mathbb{E}Z_{A_{j^{\prime}},x^{(t)}}-y_{j^{\prime}})=\nabla\operatorname{nll}(x^{(t)};A,y).

It follows that

\mathbb{E}[v^{(t)}|x^{(t)}]=\mathbb{E}\left[A_{j}(z^{(t)}-y_{j})\middle|x^{(t)}\right]+\operatorname{sign}(x^{(t)})

is a subgradient for $f(x)$ at $x=x^{(t)}$ .

We proceed to bounding $\mathbb{E}[\left\lVert v^{(t)}\right\rVert_{2}^{2}|x^{(t)}]$ . By definition of $v^{(t)}$ ,

\left\lVert v^{(t)}\right\rVert_{2}^{2}\leq 2\left\lVert A_{j}(z^{(t)}-y_{j})\right\rVert_{2}^{2}+2\left\lVert\lambda\cdot\operatorname{sign}(x^{(t)})\right\rVert_{2}^{2}

where $j\in[m]$ is uniformly random, and $z^{(t)}|x^{(t)}\sim Z_{A_{j},x^{(t)}}$ . Since $\left\lVert\lambda\cdot\operatorname{sign}(x^{(t)})\right\rVert_{2}^{2}=o(n)$ it remains to bound the other term. We have that

\mathbb{E}[\left\lVert A_{j}(z^{(t)}-y_{j})\right\rVert_{2}^{2}|x^{(t)}]=\frac{1}{m}\sum_{j^{\prime}=1}^{m}\mathbb{E}[\left\lVert A_{j^{\prime}}(Z_{A_{j^{\prime}},x^{(t)}}-y_{j^{\prime}})\right\rVert_{2}^{2}].

With high probability, $\left\lVert A_{i}\right\rVert_{2}^{2}\leq 2n$ for all $i\in[m]$ . Thus,

\mathbb{E}[\left\lVert A_{j}(z^{(t)}-y_{j})\right\rVert_{2}^{2}|x^{(t)}]\leq\frac{n}{m}\sum_{j^{\prime}=1}^{m}\mathbb{E}[(Z_{A_{j^{\prime}},x^{(t)}}-y_{j^{\prime}})^{2}].

Now

\sum_{i=1}^{m}\mathbb{E}[(Z_{A_{i},x^{(t)}}-y_{i})^{2}]\leq 2\sum_{i=1}^{m}(A_{i}x^{(t)}-y_{i})^{2}+2\sum_{i=1}^{m}\mathbb{E}[(A_{i}x^{(t)}-Z_{A_{i},x^{(t)}})^{2}].

The first term is bounded by $2r^{2}m$ since $x^{(t)}\in\mathscr{E}_{r}$ . Additionally,

\left\lVert A(x^{(t)}-x^{*})\right\rVert_{2}\leq\left\lVert Ax^{(t)}-y\right\rVert_{2}+\left\lVert Ax^{*}-y\right\rVert_{2}\leq 2r\sqrt{m}

since $x^{(t)},x^{*}\in\mathscr{E}_{r}$ . Therefore the second term is bounded as

	$\displaystyle 2\sum_{i=1}^{m}\mathbb{E}[R_{A_{i},x^{(t)}}^{2}]$	$\displaystyle\leq 4\sum_{i=1}^{m}\log\left(\frac{1}{\gamma_{S}(A_{i}x^{(t)})}\right)+8m$
		$\displaystyle\leq 8\sum_{i=1}^{m}\log\left(\frac{1}{\gamma_{S}(A_{i}x^{})}\right)+4\left\lVert A(x^{(t)}-x^{})\right\rVert_{2}^{2}+16m$
		$\displaystyle\leq 64\log(1/\alpha)m+16r^{2}m+80m.$

where the first and second inequalities are by Lemmas I.3 and Lemma I.2, and the third inequality is by Lemma I.4. Putting together the two bounds, we get

\sum_{i=1}^{m}\mathbb{E}[(Z_{A_{i},x^{(t)}}-y_{i})^{2}]\leq O(m),

from which we conclude that $\mathbb{E}[\left\lVert v^{(t)}\right\rVert_{2}^{2}|x^{(t)}]\leq O(n)$ . The law of total expectation implies that $\mathbb{E}[\left\lVert v^{(t)}\right\rVert_{2}^{2}]\leq O(n)$ as well.

J.14 Proof of Lemma 5.7

We need to show that $f_{\mathbb{R}^{U}}$ is $\zeta$ -strongly convex near $\breve{x}$ . Since $\left\lVert x\right\rVert_{1}$ is convex, it suffices to show that $\operatorname{nll}(x;A,y)_{\mathbb{R}^{U}}$ is $\zeta$ -strongly convex near $\breve{x}$ . The Hessian of $\operatorname{nll}(x;A,y)|_{\mathbb{R}^{U}}$ is

H_{U}(x;A,y)=\frac{1}{m}\sum_{j=1}^{m}A_{j,U}^{T}A_{j,U}\operatorname{Var}(Z_{A_{j},x}).

Hence, it suffices to show that

\frac{1}{m}\sum_{j=1}^{m}A_{j,U}^{T}A_{j,U}\operatorname{Var}(Z_{A_{j},x})\succeq\zeta I

for all $x\in\mathbb{R}^{n}$ with $\operatorname{supp}(x)\subseteq U$ and $\left\lVert x-\breve{x}\right\rVert_{2}\leq 1$ . Call this region $\mathscr{B}$ . With high probability over $A$ we can deduce the following.

(i) By Theorem A.7, we have $\left\lVert\breve{x}-x^{*}\right\rVert_{2}\leq d\sqrt{(k\log n)/m}$ . As $\left\lVert x-\breve{x}\right\rVert_{2}\leq 1$ for all $x\in\mathscr{B}$ , we get $\left\lVert A(\breve{x}-x)\right\rVert_{2}^{2}\leq T^{2}(d+1)^{2}m$ for all $x\in\mathscr{B}$ .

(ii) By the proof of Lemma A.4, the number of $j\in[m]$ such that $\gamma_{S}(A_{j}x^{*})\leq\alpha/2$ is at most $(1-\alpha/3)m$ .

Fix $x\in\mathscr{B}$ , and define $J_{x}\subseteq[m]$ to be the set of indices

J_{x}=\{j\in[m]:\gamma_{S}(A_{j}x^{*})\geq\alpha/2\land|A_{j}(x-x^{*})|^{2}\leq(6/\alpha)T^{2}(d+1)^{2}.\}

For any $j\in J_{x}$ ,

\log\frac{1}{\gamma_{S}(A_{j}x)}\leq 2\log\frac{1}{\gamma_{S}(A_{j}x^{*})}+|A_{j}(x-x^{*})|^{2}+2\leq\log(2/\alpha)+(6/\alpha)T^{2}(d+1)^{2}+2.

Thus,

\operatorname{Var}(Z_{A_{j},x})\geq C\gamma_{S}(A_{j}x)^{2}\geq e^{-\log(2/\alpha)-(6/\alpha)T^{2}(d+1)^{2}-2}=\Omega(1).

Let $\delta$ denote this lower bound—a positive constant. By (i) and (ii), $|J_{x}|\geq(\alpha/6)m$ , so by Theorem G.1,

H_{U}(x;A,y)=\frac{1}{m}\sum_{j=1}^{m}A_{j,U}^{T}A_{j,U}\operatorname{Var}(Z_{A_{j},x})\succeq\frac{\delta}{m}A_{J_{x},U}^{T}A_{J_{x},U}\succeq\delta\tau I

as desired.

J.15 Proof of Lemma 5.8

Let $t=\left\lVert(x-\breve{x})_{U}\right\rVert_{2}$ . Define $w=\breve{x}+(x-\breve{x})\min(t^{-1}/m,1)$ . Also define $w^{\prime}=[w_{U};0_{U^{c}}]\in\mathbb{R}^{n}$ . Then $\left\lVert(w-\breve{x})_{U}\right\rVert_{2}\leq 1/m$ , so

\left\lVert(\nabla\operatorname{nll}(w^{\prime};A,y))_{U^{c}}\right\rVert_{\infty}\leq\frac{\lambda}{2}.

Therefore $w_{i}\cdot(\nabla\operatorname{nll}(w^{\prime};A,y))_{i}\leq(\lambda/2)|w_{i}|$ for all $i\in U^{c}$ , so

	$\displaystyle f(w)-f(w^{\prime})$	$\displaystyle=(\operatorname{nll}(w;A,y)-\operatorname{nll}(w^{\prime};A,y))+\lambda(\left\lVert w\right\rVert_{1}-\left\lVert w^{\prime}\right\rVert_{1})$
		$\displaystyle\geq(w-w^{\prime})\cdot\nabla\operatorname{nll}(w^{\prime};A,y)+\lambda\left\lVert w_{U^{c}}\right\rVert_{1}$
		$\displaystyle\geq\frac{\lambda}{2}\left\lVert w_{U^{c}}\right\rVert_{1}.$

Additionally, since $\left\lVert w^{\prime}-\breve{x}\right\rVert_{2}\leq 1$ and $\operatorname{supp}(w^{\prime})\subseteq U$ , we know that

f(w^{\prime})-f(\breve{x})\geq\frac{\zeta}{2}\left\lVert w^{\prime}-\breve{x}\right\rVert_{2}^{2}.

Adding the second inequality to the square of the first inequality, and lower bounding the $\ell_{1}$ norm by $\ell_{2}$ norm,

	$\displaystyle\frac{1}{2}(f(w)-f(\breve{x}))^{2}+\frac{1}{2}(f(w)-f(\breve{x}))$	$\displaystyle\geq\frac{1}{2}(f(w)-f(w^{\prime}))^{2}+\frac{1}{2}(f(w^{\prime})-f(\breve{x}))$
		$\displaystyle\geq\frac{\lambda^{2}}{8}\left\lVert w_{U^{c}}\right\rVert_{2}^{2}+\frac{\zeta}{4}\left\lVert w^{\prime}-\breve{x}\right\rVert_{2}^{2}$
		$\displaystyle\geq\frac{\lambda^{2}}{8}\left\lVert(w-\breve{x})_{U^{c}}\right\rVert_{2}^{2}+\frac{\zeta}{4}\left\lVert(w-\breve{x})_{U}\right\rVert_{2}^{2}$
		$\displaystyle\geq\min\left(\frac{\lambda^{2}}{8},\frac{\zeta}{4}\right)\left\lVert w-\breve{x}\right\rVert_{2}^{2}$

Since $f(x)-f(\breve{x})\leq 1$ , by convexity $f(w)-f(\breve{x})\leq 1$ as well. Hence,

f(w)-f(\breve{x})\geq\frac{1}{2}(f(w)-f(\breve{x}))^{2}+\frac{1}{2}(f(w)-f(\breve{x}))\geq\min\left(\frac{\lambda^{2}}{8},\frac{\zeta}{4}\right)\left\lVert w-\breve{x}\right\rVert_{2}^{2}.

(7)

We distinguish two cases:

1.

If $t\leq 1/m$ , then $w=x$ , and it follows from Equation 7 that

$f(x)-f(\breve{x})\geq\min\left(\frac{\lambda^{2}}{8},\frac{\zeta}{4}\right)\left\lVert x-\breve{x}\right\rVert_{2}^{2}$

as desired.
2.

If $t\geq 1/m$ , then $\left\lVert(w-\breve{x})_{U}\right\rVert_{2}=1/m$ , and thus $\left\lVert w-\breve{x}\right\rVert_{2}\geq 1/m$ . By convexity and this bound,

$f(x)-f(\breve{x})\geq f(w)-f(\breve{x})\geq\min\left(\frac{\lambda^{2}}{8},\frac{\zeta}{4}\right)\frac{1}{m^{2}},$

which contradicts the lemma’s assumption for a sufficiently small constant $c_{f}>0$ .

J.16 Proof of Theorem 5.9

By Lemmas 5.4, 5.5, and 5.6, we are guaranteed that $\breve{x}\in\mathscr{E}_{r}$ , $\left\lVert x^{(0)}-\breve{x}\right\rVert_{2}^{2}\leq O(1)$ , and $\mathbb{E}[\left\lVert v^{(t)}\right\rVert_{2}^{2}]\leq O(n)$ for all $t$ . Thus, applying Theorem H.1 with projection set $\mathscr{E}_{r}$ , step count $T=m^{6}n$ , and step size $\eta=1/\sqrt{Tn}$ gives $\mathbb{E}[f(\bar{x})]-f(\breve{x})\leq O(1/m^{3})$ . Since $f(\bar{x})-f(\breve{x})$ is nonnegative, Markov’s inequality gives

\Pr[f(\bar{x})-f(\breve{x})\leq c_{f}(\log n)/m^{3}]\geq 1-o(1).

From Theorem 5.8 we conclude that $\left\lVert\bar{x}-\breve{x}\right\rVert_{2}\leq O(1/m)$ with high probability.

Appendix K Efficient sampling for union of intervals

In this section, in Lemma K.4, we see that when $S=\cup_{i=1}^{r}[a_{i},b_{i}]$ , with $a_{i},b_{i}\in\mathbb{R}$ , then Assumption II holds with $T(\gamma_{S}(t))=\operatorname{poly}(\log(1/\gamma_{S}(t)),r)$ . The only difference is that instead of exact sampling we have approximate sampling, but the approximation error is exponentially small in total variation distance and hence it cannot affect any algorithm that runs in polynomial time.

Definition K.1 (Evaluation Oracle).

Let $f:\mathbb{R}\to\mathbb{R}$ be an arbitrary real function. We define the evaluation oracle $\mathscr{E}_{f}$ of $f$ as an oracle that given a number $x\in\mathbb{R}$ and a target accuracy $\eta$ computes an $\eta$ -approximate value of $f(x)$ , that is $\left|\mathscr{E}_{f}(x)-f(x)\right|\leq\eta$ .

Lemma K.2.

Let $f:\mathbb{R}\to\mathbb{R}_{+}$ be a real increasing and differentiable function and $\mathscr{E}_{f}(x)$ an evaluation oracle of $f$ . Let $\ell\leq f^{\prime}(x)\leq L$ for some $\ell,L\in\mathbb{R}_{+}$ . Then we can construct an algorithm that implements the evaluation oracle of $f^{-1}$ , i.e. $\mathscr{E}_{f^{-1}}$ . This implementation on input $y\in\mathbb{R}_{+}$ and input accuracy $\eta$ runs in time $T$ and uses at most $T$ calls to the evaluation oracle $\mathscr{E}_{f}$ with inputs $x$ with representation length $T$ and input accuracy $\eta^{\prime}=\eta/\ell$ , with $T=\operatorname{poly}\log(\max\{|f(0)/y|,|y/f(0)|\},L,1/\ell,1/\eta)$ .

Proof of Lemma K.2.

Given a value $y\in\mathbb{R}_{+}$ our goal is to find an $x\in\mathbb{R}$ such that $f(x)=y$ . Using doubling we can find two numbers $a,b$ such that $f(a)\leq y-\eta^{\prime}$ and $f(b)\geq y+\eta^{\prime}$ for some $\eta^{\prime}$ to be determined later. Because of the lower bound $\ell$ on the derivative of $f$ we have that this step will take $\log((1/\ell)\cdot\max\{|f(0)/y|,|y/f(0)|\})$ steps. Then we can use binary search in the interval $[a,b]$ where in each step we make a call to the oracle $\mathscr{E}_{f}$ with accuracy $\eta^{\prime}$ and we can find a point $\hat{x}$ such that $\left|f(x)-f(\hat{x})\right|\leq\eta^{\prime}$ . Because of the upper bound on the derivative of $f$ we have that $f$ is $L$ -Lipschitz and hence this binary search will need $\log(L/\eta^{\prime})$ time and oracle calls. Now using the mean value theorem we get that for some $\xi\in[a,b]$ it holds that $\left|f(x)-f(\hat{x})\right|=|f^{\prime}(\xi)|\left|x-\hat{x}\right|$ which implies that $\left|x-\hat{x}\right|\leq\eta^{\prime}/\ell$ , so if we set $\eta^{\prime}=\ell\cdot\eta$ , the lemma follows. ∎

Using the Lemma K.2 and the Proposition 3 of [7] it is easy to prove the following lemma.

Lemma K.3.

Let $[a,b]$ be a closed interval and $\mu\in\mathbb{R}$ such that $\gamma_{[a,b]}(\mu)=\alpha$ . Then there exists an algorithm that runs in time $\operatorname{poly}\log(1/\alpha,\zeta)$ and returns a sample of a distribution $\mathscr{D}$ , such that $d_{\mathrm{TV}}(\mathscr{D},N(\mu,1;[a,b]))\leq\zeta$ .

Proof Sketch.

The sampling algorithm follows the steps: (1) from the cumulative distribution function $F$ of the distribution $N(\mu,1;[a,b])$ define a map from $[a,b]$ to $[0,1]$ , (2) sample uniformly a number $y$ in $[0,1]$ (3) using an evaluation oracle for the error function, as per Proposition 3 in [7], and using Lemma K.2 compute with exponential accuracy the value $F^{-1}(y)$ and return this as the desired sample. ∎

Finally using again Proposition 3 in [7] and Lemma K.3 we can get the following lemma.

Lemma K.4.

Let $[a_{1},b_{1}]$ , $[a_{2},b_{2}]$ , $\dots$ , $[a_{r},b_{r}]$ be closed intervals and $\mu\in\mathbb{R}$ such that $\gamma_{\cup_{i=1}^{r}[a_{i},b_{i}]}(\mu)=\alpha$ . Then there exists an algorithm that runs in time $\operatorname{poly}(\log(1/\alpha,\zeta),r)$ and returns a sample of a distribution $\mathscr{D}$ , such that $d_{\mathrm{TV}}(\mathscr{D},N(\mu,1;\cup_{i=1}^{r}[a_{i},b_{i}]))\leq\zeta$ .

Proof Sketch.

Using Proposition 3 in [7] we can compute $\hat{\alpha}_{i}$ which estimated with exponential accuracy the mass $\alpha_{i}=\gamma_{[a_{i},b_{i}]}(\mu)$ of every interval $[a_{i},b_{i}]$ . If $\hat{\alpha}_{i}/\alpha\leq\zeta/3r$ then do not consider interval $i$ in the next step. From the remaining intervals we can choose one proportionally to $\hat{\alpha}_{i}$ . Because of the high accuracy in the computation of $\hat{\alpha}_{i}$ this is $\zeta/3$ close in total variation distance to choosing an interval proportionally to $\alpha_{i}$ . Finally after choosing an interval $i$ we can run the algorithm of Lemma K.3 with accuracy $\zeta/3$ and hence the overall total variation distance from $N(\mu,1;\cup_{i=1}^{r}[a_{i},b_{i}])$ is at most $\zeta$ . ∎

Truncated Linear Regression in High Dimensions

Abstract

1 Introduction

Our contribution.

1.1 Overview of proofs and techniques

2 High-dimensional truncated linear regression model

Notation.

2.1 Model

Assumption I (Constant Survival Probability).

Assumption II (Efficient Sampling).

3 Statistically and computationally efficient recovery

3.1 Gaussian noise

Theorem 3.1.

Proposition 3.2.

Proposition 3.3.

3.2 Adversarial noise

Theorem 3.4.

4 The efficient estimation algorithm

5 Overview of proofs and techniques

5.1 Statistical recovery

Lemma 5.1.

Lemma 5.2.

5.2 Computational recovery

Theorem 5.3.

Properties of ℰr\mathscr{E}_{r}.

Lemma 5.4.

Lemma 5.5.

Lemma 5.6.

Addressing the lack of strong convexity.

Lemma 5.7.

Lemma 5.8.

Convergence of PSGD.

Theorem 5.9.

Efficient implementation.

Proof of Proposition 3.3.

Acknowledgments

References

Appendix A Bounding solution of restricted program

Lemma A.1.

Lemma A.2.

Lemma A.3.

Lemma A.4.

Lemma A.5.

Theorem A.6.

Theorem A.7.

Appendix B Proof of statistical recovery

Lemma B.1.

Lemma B.2.

Theorem B.3.

Theorem B.4.

Proof of Proposition 3.2.

Appendix C Primal-dual witness method

Proof of Lemma 5.1.

Lemma C.1.

Proof.

Proof of Lemma 5.2.

Appendix D Sparse recovery from the Restricted Isometry Property

Theorem D.1 ([6]).

Proof.

Appendix E Summary of the algorithm

Appendix F Algorithm details

Initial point.

Stochastic gradient.

Projection.

Appendix G Isometry properties

Theorem G.1.

Lemma G.2.

Proof.

Lemma G.3.

Proof.

Lemma G.4.

Proof.

Proof of Theorem G.1.

Corollary G.5.

Proof.

Corollary G.6 (Restricted Isometry Property).

Proof.

Proof of Theorem 3.4.

Appendix H Projected Stochastic Gradient Descent

Theorem H.1.

Properties of $\mathscr{E}_{r}$ .